Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Retry row insertion when BigQuery API returns NOT_FOUND dataset #30

Conversation

jclarysse
Copy link

Every couple of days, BigQuery API might return 404 / NOT_FOUND for Dataset. This happens although the Dataset exists in the same GCP region as where the connector is running. This brings the task to fail and restarting it resumes without any error. This change suggests to retry insertion when this error occurs, instead of failing fast.

Every couple of days, BigQuery API might return 404 / NOT_FOUND for Dataset.
This happens although the Dataset exists in the same GCP region as where the
connector is running. This brings the task to fail and restarting it resumes
without any error. This change suggest to retry insertion when this error
occurs instead of failing fast.
@C0urante
Copy link
Contributor

Thanks @jclarysse, and sorry for the delay (whole company is doing an off-site this week).

Do you have more information about the circumstances that might lead to these kinds of spurious dataset-not-found errors? It seems like this should be reported upstream as a bug in BigQuery.

As far as the fix goes, it looks like this will add latency to the time it takes the connector to fail if it tries to write to a dataset that really does not exist. Would a single retry be sufficient instead?

I also think we may want to add this logic to more places than just the AdaptiveBigQueryWriter, since IIRC that class is only used when table creation/updates are enabled.

@jclarysse
Copy link
Author

Thanks @C0urante for following-up on this.

The dataset-not-found error is the very infrequent result of BigQuery API tabledata insertAll requests. The log is as follow:

[2024-04-12 23:02:32,513] WARN [tilbud-offers-sink|task-1] Could not write batch of size 1 to BigQuery. Error code: 404, underlying error (if present): BigQueryError{reason=notFound, location=null, message=Not found: Dataset some-project-id:some_dataset} (com.wepay.kafka.connect.bigquery.write.batch.TableWriter:97)
com.google.cloud.bigquery.BigQueryException: Not found: Dataset some-project-id:some_dataset
	at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:115)
	at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.insertAll(HttpBigQueryRpc.java:494)
	at com.google.cloud.bigquery.BigQueryImpl$28.call(BigQueryImpl.java:1068)
	at com.google.cloud.bigquery.BigQueryImpl$28.call(BigQueryImpl.java:1065)
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
	at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at com.google.cloud.bigquery.BigQueryImpl.insertAll(BigQueryImpl.java:1064)
	at com.wepay.kafka.connect.bigquery.write.row.AdaptiveBigQueryWriter.performWriteRequest(AdaptiveBigQueryWriter.java:96)
	at com.wepay.kafka.connect.bigquery.write.row.BigQueryWriter.writeRows(BigQueryWriter.java:116)
	at com.wepay.kafka.connect.bigquery.write.batch.TableWriter.run(TableWriter.java:93)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
POST https://www.googleapis.com/bigquery/v2/projects/some-project-id/datasets/some_dataset/tables/some_table$20240412/insertAll?prettyPrint=false
{
  "code" : 404,
  "errors" : [ {
    "domain" : "global",
    "message" : "Not found: Dataset some-project-id:some_dataset",
    "reason" : "notFound"
  } ],
  "message" : "Not found: Dataset some-project-id:some_dataset",
  "status" : "NOT_FOUND"
}
	at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:428)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
	at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.insertAll(HttpBigQueryRpc.java:492)
	... 12 more

The error does not depend on the batch size. As far as I am aware of, it only occurred with partitioned tables from datasets located in multi-regions EU (physically stored in GCP region europe-west1) and so there might be an API bug related to this specific scenario.

In general, the problem felt similar to the old known BigQuery: 404 table not found even when the table exists and my understanding was that issues related to BigQuery eventual consistency should be handled on client side. In this case there is no backend error or quota limit, and so the connector option bigQueryRetry doesn't apply. We were looking for another way to retry and noticed that the table-not-found-scenario was already handled here. Apart from that, I agree that a single retry should be sufficient.

Does it sound like a valid PR or do you feel that we are going into the wrong direction?

@C0urante
Copy link
Contributor

Thanks for the clarification!

Regarding this question:

Does it sound like a valid PR or do you feel that we are going into the wrong direction?

I think retrying to handle any poor backend behavior is fine, regardless of whether it's expected (e.g., a documented limitation due to eventual consistency) or unexpected (e.g., a bug that might be patched in the future). The only difference is that if it seems like a bug, it should be reported upstream.

One thing I'm still unclear about is whether this is related to recently-created tables or datasets (which would definitely fall under the umbrella of eventual consistency issues), or if it occurs for tables/datasets that have existed for a while.

If it's for newly-created entities, then I think this patch is in pretty good shape.

If it's for entities that have existed for a while (and/or for which at least one write has already succeeded), then I think the logic should be moved out of the AdaptiveBigQueryWriter class and into the parent class (so that retries occur regardless of whether automatic table creation/updates are enabled in the connector) and we should limit the number of retries that take place (so that we can fail faster when the dataset truly does not exist).

How does that sound?

@jclarysse
Copy link
Author

Thanks @C0urante, great advice!
Meanwhile, the user seems to have moved away from this issue by enabling Storage Write API.
As a result, I suggest to not further invest in this PR unless someone else would face the same issue.

@jclarysse jclarysse closed this May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants