Table picker for PDF #2

sambitdash · 2017-07-12T11:44:27Z

Natural tabular objects in a PDF document should ideally be picked up for extraction.

The intent of the project is API development, hence it will be headless for most part. There may not be a WYSIWYG picker available unlike a reader. A heuristic table picker should scan the document for existence of table like structures and dump them in tabular HTML/CSS format or extracted image objects. In cased document tagging is enabled, the table picker can use the tagged text.

hhaensel · 2022-05-09T07:04:44Z

I have written some lines of code to extract tabular data. Currently it is keyword based to determine the textlayouts to include. I also managed to make short IJulia notebook where you can interactively select text in a Plotly chart.
@sambitdash Would you be interested in including that code in your package?
Otherwise I might release my own package but I feel that this functionality would nicely fit into PDFIO.

sambitdash · 2022-05-09T16:17:37Z

@hhaensel thank you for your interest. I want to understand what level of complex cases can this software handle. If you submit a PR, I can review it and let you know if they are useful for this SDK.

hhaensel · 2022-05-09T20:54:31Z

Sounds perfect, I'll submit a PR tomorrow.
The code extracts a vector of TextLayouts as a function of page(s) and keywords, then scans for common elements in rows and columns as a function of their layout box. The layout boxes can be scaled in order to reduce the probability of overlapping areas. Optionally a Plotly graph displays the elements and their recognised arrangement with a color code.

Looking forward to your feedback.

hhaensel · 2022-05-18T19:39:48Z

Sorry, currently in overload, will take some more time ...

sambitdash added the enhancement label Oct 2, 2017

sambitdash mentioned this issue Nov 14, 2017

pdPageExtractText should support multi-column documents #17

Open

sambitdash mentioned this issue Dec 22, 2019

Extract all Text Objects #83

Closed

sambitdash mentioned this issue Dec 26, 2023

problem extracting text on a two columns layout #112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table picker for PDF #2

Table picker for PDF #2

sambitdash commented Jul 12, 2017

hhaensel commented May 9, 2022

sambitdash commented May 9, 2022

hhaensel commented May 9, 2022

hhaensel commented May 18, 2022

Table picker for PDF #2

Table picker for PDF #2

Comments

sambitdash commented Jul 12, 2017

hhaensel commented May 9, 2022

sambitdash commented May 9, 2022

hhaensel commented May 9, 2022

hhaensel commented May 18, 2022