Read text and parse tables from PDF files.
Supports tabular data with automatic column detection, and rule-based parsing.
Dependencies: it is based on pdf2json, which itself relies on Mozilla's pdf.js.
๐ Now includes TypeScript type definitions!
โน๏ธ Important notes:
- This module is meant to be run using Node.js only. It does not work from a web browser.
- This module extracts text entries from PDF files. It does not support photographed text. If you cannot select text from the PDF file, you may need to use OCR software first.
Summary:
- Installation, tests and CLI usage
- Raw PDF reading (incl. examples)
- Rule-based data extraction
- Troubleshooting & FAQ
After installing Node.js:
git clone https://github.com/adrienjoly/npm-pdfreader.git
cd npm-pdfreader
npm install
npm test
node parse.js test/sample.pdf
To install pdfreader
as a dependency of your Node.js project:
npm install pdfreader
Then, see below for examples of use.
This module exposes the PdfReader
class, to be instantiated. You can pass { debug: true }
to the constructor, in order to log debugging information. (useful for troubleshooting)
Your instance has two methods for parsing a PDF. They return the same output and differ only in input: PdfReader.parseFileItems
(as below) for a filename, and PdfReader.parseBuffer
(see: "Raw PDF reading from a PDF already in memory (buffer)") from data that you don't want to reference from the filesystem.
Whichever method you choose, it asks for a callback, which gets called each time the instance finds what it denotes as a PDF item.
An item object can match one of the following objects:
null
, when the parsing is over, or an error occured.- File metadata,
{file:{path:string}}
, when a PDF file is being opened, and is always the first item. - Page metadata,
{page:integer, width:float, height:float}
, when a new page is being parsed, provides the page number, starting at 1. This basically acts as a carriage return for the coordinates of text items to be processed. - Text items,
{text:string, x:float, y:float, w:float, ...}
, which you can think of as simple objects with a text property, and floating 2D AABB coordinates on the page.
It's up to your callback to process these items into a data structure of your choice, and also to handle any errors thrown to it.
For example:
import { PdfReader } from "pdfreader";
new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {
if (err) console.error("error:", err);
else if (!item) console.warn("end of file");
else if (item.text) console.log(item.text);
});
new PdfReader({ password: "YOUR_PASSWORD" }).parseFileItems(
"test/sample-with-password.pdf",
function (err, item) {
if (err) console.error(err);
else if (!item) console.warn("end of file");
else if (item.text) console.log(item.text);
}
);
As above, but reading from a buffer in memory rather than from a file referenced by path. For example:
import fs from "fs";
import { PdfReader } from "pdfreader";
fs.readFile("test/sample.pdf", (err, pdfBuffer) => {
// pdfBuffer contains the file content
new PdfReader().parseBuffer(pdfBuffer, (err, item) => {
if (err) console.error("error:", err);
else if (!item) console.warn("end of buffer");
else if (item.text) console.log(item.text);
});
});
Source code of the examples above: parsing a CV/rรฉsumรฉ.
For more, see Examples of use.
The Rule
class can be used to define and process data extraction rules, while parsing a PDF document.
Rule
instances expose "accumulators": methods that defines the data extraction strategy to be used for each rule.
Example:
const processItem = Rule.makeItemProcessor([
Rule.on(/^Hello \"(.*)\"$/)
.extractRegexpValues()
.then(displayValue),
Rule.on(/^Value\:/)
.parseNextItemValue()
.then(displayValue),
Rule.on(/^c1$/).parseTable(3).then(displayTable),
Rule.on(/^Values\:/)
.accumulateAfterHeading()
.then(displayValue),
]);
new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {
if (err) console.error(err);
else processItem(item);
});
Solutions exist, but this module cannot be run directly by a web browser. If you really want to use this module, you will have to integrate it into your back-end so that PDF files can be read from your server.
Dmitry found out that you may need to run these instructions before including the pdfreader
module:
global.navigator = {
userAgent: "node",
};
window.navigator = {
userAgent: "node",
};