Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeing "numbers" as text in converted "tables" json #588

Closed
mllife opened this issue Dec 13, 2024 · 2 comments
Closed

Seeing "numbers" as text in converted "tables" json #588

mllife opened this issue Dec 13, 2024 · 2 comments
Labels
question Further information is requested

Comments

@mllife
Copy link

mllife commented Dec 13, 2024

Question

I want to convert the output from json to separate tables csv. I wrote a code for it. But, I am seeing numbers converted to text.
json { "bbox": { "l": 364.781005859375, "t": 337.5539855957031, "r": 377.7409973144531, "b": 328.56201171875, "coord_origin": "BOTTOMLEFT" }, "row_span": 1, "col_span": 1, "start_row_offset_idx": 4, "end_row_offset_idx": 5, "start_col_offset_idx": 2, "end_col_offset_idx": 3, "text": "/five.lt/period.tab/eight.lt", "column_header": false, "row_header": false, "row_section": false }, { "bbox": { "l": 420.2619934082031, "t": 337.5539855957031, "r": 433.22198486328125, "b": 328.56201171875, "coord_origin": "BOTTOMLEFT" }, "row_span": 1, "col_span": 1, "start_row_offset_idx": 4, "end_row_offset_idx": 5, "start_col_offset_idx": 3, "end_col_offset_idx": 4, "text": "/six.lt/period.tab/seven.lt", "column_header": false, "row_header": false, "row_section": false },

5.8 is showing as "/five.lt/period.tab/eight.lt"

In one of the other tables; "/three.osf_tab./zero.osf_tab/zero.osf_tab% ", "-/zero.osf_tab./two.osf_tab/three.osf_tab%"

I think this is due to otsl; the work from this https://arxiv.org/abs/2305.03393 used in the tsr model; but how can convert it to normal numbers with post-processing. Any utility code available for this already? or any other help will be appreciated

@mllife mllife added the question Further information is requested label Dec 13, 2024
@cau-git
Copy link
Contributor

cau-git commented Dec 13, 2024

@mllife this is a matter of how the PDF encoded the text, you'll be getting out whatever the PDF has encoded in it. So, this is not a matter of TableFormer but one of the PDF backend and its string sanitation.

@mllife
Copy link
Author

mllife commented Dec 14, 2024

@cau-git , UPDATE: tried the other backend "pypdfium2", the output is correct now; docling_v2 parser had some text encoding issue

@mllife mllife closed this as completed Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants