PdfTableExtract

Input PDF:

Output HTML:

This extracts tables from PDFs. It supports cells spanning multiple rows or columns. For results, take a look at the PDF and the HTML in this repository. The HTML table was extracted from the PDF.

I wrote this because I needed to extract the tables of a lot of PDFs, but good tools where expensive or not working well.

This is not a very user friendly tool, but if you want me to make if easier, tell me!

You need the following things installed: ghostscript, pdftotext, opencv

Compile main.cpp, link against opencv. The programm will overwrite tmp.txt and tmp.jpg in your working directory, so make sure you don't have anything important there.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md
main.cpp		main.cpp
out.html		out.html
pdfIn.png		pdfIn.png
pdfOut.png		pdfOut.png
test.pdf		test.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PdfTableExtract

About

Releases

Packages

Languages

License

DenisStad/PdfTableExtract

Folders and files

Latest commit

History

Repository files navigation

PdfTableExtract

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages