Skip to content

peterk/pimmer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pimmer

Exploratory code for PDF image mining. A multi page PDF will be split and converted to jpeg files that are mined for illustrations and images. Baed on https://github.com/megloff1/image-mining with added PDF splitting, a simple GUI and queue management.

Install

  1. Make sure you have Git and Docker with docker-compose installed.
  2. Get the latest version of this repository: git clone --depth 1 https://github.com/peterk/pimmer.git.
  3. Copy the example_env file to .env and edit settings.
  4. Make sure you have a folder called data in the project root folder (jobs and resulting image files will end up here). You can map output to a different local folder for the worker in docker-compose.yml.
  5. Run docker-compose up -d. Wait a minute until the queue and worker is up.

The service is now running on http://localhost:7777.

If you are planning on processing a large number of documents you can start more workers with docker-compose up -d --scale worker=5 and then post files with curl to the /process/ endpoint:

curl -v --silent -F "file=@testdata/hat_catalog.pdf" http://0.0.0.0:7777/process/

Please report bugs and feedback in the Github issue tracker.

Results

The detected images will end up as individual image files in job folders in the ./data/results.

The job folder will also contain a json file per page with the coordinates of the detected images.

A digitized hat catalog like this: Hat catalog page

... results in all the individual hat images: Individual hat images