Skip to content

This extracts tables from PDFs. It supports cells spanning multiple rows or columns

License

Notifications You must be signed in to change notification settings

DenisStad/PdfTableExtract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PdfTableExtract

Input PDF: alt tag

Output HTML: alt tag

This extracts tables from PDFs. It supports cells spanning multiple rows or columns. For results, take a look at the PDF and the HTML in this repository. The HTML table was extracted from the PDF.

I wrote this because I needed to extract the tables of a lot of PDFs, but good tools where expensive or not working well.

This is not a very user friendly tool, but if you want me to make if easier, tell me!

You need the following things installed: ghostscript, pdftotext, opencv

Compile main.cpp, link against opencv. The programm will overwrite tmp.txt and tmp.jpg in your working directory, so make sure you don't have anything important there.

About

This extracts tables from PDFs. It supports cells spanning multiple rows or columns

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages