-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Retry row insertion when BigQuery API returns NOT_FOUND dataset #30
feat: Retry row insertion when BigQuery API returns NOT_FOUND dataset #30
Conversation
Every couple of days, BigQuery API might return 404 / NOT_FOUND for Dataset. This happens although the Dataset exists in the same GCP region as where the connector is running. This brings the task to fail and restarting it resumes without any error. This change suggest to retry insertion when this error occurs instead of failing fast.
Thanks @jclarysse, and sorry for the delay (whole company is doing an off-site this week). Do you have more information about the circumstances that might lead to these kinds of spurious dataset-not-found errors? It seems like this should be reported upstream as a bug in BigQuery. As far as the fix goes, it looks like this will add latency to the time it takes the connector to fail if it tries to write to a dataset that really does not exist. Would a single retry be sufficient instead? I also think we may want to add this logic to more places than just the |
Thanks @C0urante for following-up on this. The dataset-not-found error is the very infrequent result of BigQuery API tabledata insertAll requests. The log is as follow:
The error does not depend on the batch size. As far as I am aware of, it only occurred with partitioned tables from datasets located in multi-regions In general, the problem felt similar to the old known BigQuery: 404 table not found even when the table exists and my understanding was that issues related to BigQuery eventual consistency should be handled on client side. In this case there is no backend error or quota limit, and so the connector option Does it sound like a valid PR or do you feel that we are going into the wrong direction? |
Thanks for the clarification! Regarding this question:
I think retrying to handle any poor backend behavior is fine, regardless of whether it's expected (e.g., a documented limitation due to eventual consistency) or unexpected (e.g., a bug that might be patched in the future). The only difference is that if it seems like a bug, it should be reported upstream. One thing I'm still unclear about is whether this is related to recently-created tables or datasets (which would definitely fall under the umbrella of eventual consistency issues), or if it occurs for tables/datasets that have existed for a while. If it's for newly-created entities, then I think this patch is in pretty good shape. If it's for entities that have existed for a while (and/or for which at least one write has already succeeded), then I think the logic should be moved out of the How does that sound? |
Thanks @C0urante, great advice! |
Every couple of days, BigQuery API might return 404 / NOT_FOUND for Dataset. This happens although the Dataset exists in the same GCP region as where the connector is running. This brings the task to fail and restarting it resumes without any error. This change suggests to retry insertion when this error occurs, instead of failing fast.