-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DynamoDB: Add table loader for full-load operations #226
Conversation
c85e125
to
a86ec49
Compare
records_target = self.cratedb_adapter.count_records(self.cratedb_table) | ||
logger.info(f"Target: CrateDB table={self.cratedb_table} count={records_target}") | ||
progress_bar = tqdm(total=records_in) | ||
result = self.dynamodb_adapter.scan(table_name=self.dynamodb_table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another variant to scan the table, maybe for resuming on errors?
key = None
while True:
if key is None:
response = table.scan()
else:
response = table.scan(ExclusiveStartKey=key)
key = response.get("LastEvaluatedKey", None)
/cc @wierdvanderhaar
Convert data for record items to INSERT statements. | ||
""" | ||
for item in items: | ||
yield self.translator.to_sql(item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's another item transformation idea picked up from an example program. Please advise if this is sensible in all situations, or if it's just a special case.
if 'id' in item and not isinstance(item['id'], str):
item['id'] = str(item['id'])
/cc @wierdvanderhaar
# DynamoDB Backlog | ||
|
||
- Pagination / Batch Getting. | ||
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/programming-with-python.html#programming-with-python-pagination | ||
|
||
- Use `batch_get_item`. | ||
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb/client/batch_get_item.html | ||
|
||
- Scan by query instead of full. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wierdvanderhaar: With respect to alternative implementations, using batched reading from DynamoDB is probably way to go when processing large amounts of data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should use the batched method. Let's use a default batch size of 100 but give the option to use different batch sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I added the remnant items to the backlog.
Do you agree to merge and release it first, following the "first make it work, then make it fast|beautiful|robust" paradigm, in order to get it out as quickly as possible?
4a746ff
to
882676e
Compare
882676e
to
736a603
Compare
About
Bring DynamoDB full-load to Toolkit's
ctk load table
interface.Documentation
https://cratedb-toolkit--226.org.readthedocs.build/io/dynamodb/loader.html
Status
Alpha. For now, the implementation uses the same easy strategy to converge the source record into a single
data (OBJECT)
column in CrateDB. The 1:1 strategy may follow.Backlog
/cc @hammerhead, @zolbatar