-
Notifications
You must be signed in to change notification settings - Fork 841
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text missing or displaced in parsed table #540
Comments
(accidentally deleted previous comment 😮💨) In Docling CLI: In Python:
This is the result (better than with fast model): However, such tables are out of training distribution (hence mistakes) this we want to address with additional training data in the future. |
@maxmnemonic I think it is |
@pbonito, sure let me share some light on how it works, Tableformer is our model that we use in Docling to do table structure recognition. It's a single encoder / dual decoder model that trained to predict structural tags together with bounding boxes of the content. Docling then extracts text from a given bounding boxes and places it in the appropriate place in the structure. And as we see here those bounding boxes most likely fell short sometimes for such tables. Thing is we trained Tableformer on public datasets such as FinTabNet and PubTab1M (and some others), these come from scientific papers, public financial reports, etc. Tables presented in such datasets while vary and enable model to do fairly good generalization, they miss some types of the tables, like the one we are looking at here, where there is a lot of text in each cell. Model was not used to see so much "volume" in each cell, and accuracy of predicted "content bounding box" drops. Our current strategy is to fine tune model on the dataset that has such large-text tables in abundance, and in fact our team works on such dataset as we speak, so once we have dataset and do the fine-tuning we will just push new model weights that hopefully should improve the situation that we see here. Hope this helps! |
@maxmnemonic Is there an estimated timeline for when the new model weights will be pushed? I'm just curious about the expected update. |
@itsainii don't want to give timelines at the moment, before we test and prove the improvements. But I hope very early next year. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Bug
Text in table not present or displaced in parsed document
Steps to reproduce
Parse this pdf
parser_test.pdf
Docling version
Docling version: 2.8.1
Docling Core version: 2.5.1
Docling IBM Models version: 2.0.6
Docling Parse version: 2.1.2
Python version
Python 3.11.5
The text was updated successfully, but these errors were encountered: