Skip to content

v1.0 - Basic scrapping

Compare
Choose a tag to compare
@metalbobinou metalbobinou released this 26 Jul 15:23
· 15 commits to main since this release

A scrapper for Gallica (BnF) that generates URLs with dates, tries to resolve them in order to get the Ark ID (identification of a document), eventually collect multiple Ark ID when a date contains multiple documents, and finally, download all of the JPEG and PDF of each Ark ID.

The scrapper is able to make error recovery by recording where it failed in each list.
However : only 1 instance can be run at a time .
Precisely : 1 instance can be run simultaneously in the same folder (you can copy/paste the source code and your lists in multiple different folders and launch them in parallel... because the file for error recovery always has the same name => TODO : create one tmp file per input filename)
Keep in mind that the BnF/Gallica does not well support multiple connection... the error recovery is here to relaunch the script when the BnF server has a problem and close the connection.

There is a small criteria for detecting the end of a processing : the input filename is appended with a "_final.txt".
For the JPEG/PDF download, a temporary folder is created ( *_WIP_JPEG / *_WIP_PDF ) and renamed with the date (*JPEG[date]).

Multiple exceptions are currently managed : network failures (even a timeout is set when the server doesn't respond) and disk full.
When launching the scrapper, you must keep a trace of the log !

python src/script.py > logX_Y.log 2>&1

With this, you can't miss what's happening (just use a tail or tail -n 25 on the log to see what failed).