Skip to content

globalise-huygens/globalise-corpus-diplomaticum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Corpus diplomaticum

Download pdf files

https://www.cortsfoundation.org/digitheek-en#CD

Extract ppm images from pdf files

mkdir ppm

for file in /path/to/folder/*.pdf; do
  prefix=$(basename "$file" .pdf)
  pdftoppm -r 300 "$file" /path/to/folder/ppm/"$prefix"
done

OCR

Using Tesseract version 5.3.1

$ tesseract --version
tesseract 5.3.1
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5.1) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.11 : libwebp 1.3.0 : libopenjp2 2.5.0
 Found SSE4.1
 Found libarchive 3.6.2 zlib/1.2.11 liblzma/5.4.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.4
 Found libcurl/7.87.0 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11 nghttp2/1.51.0

TXT

mkdir txt

for f in images/*.ppm; do
  output_file="txt/$(basename "${f%.ppm}")"
  tesseract "$f" "$output_file" \
  -l nld \
  -c tessedit_char_whitelist="\"-()'’.,1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz ;:" \
  -c preserve_interword_spaces=1;
done

Sort files into folders

sourcedir="txt/"
prefixes=$(find "$sourcedir" -type f -name "*.txt" -exec basename {} \; | awk -F '-' '{print $1}' | sort -u)

for prefix in $prefixes; do
    folder_path="$sourcedir/$prefix"
    mkdir -p "$folder_path"

    mv "$sourcedir/$prefix"* "$folder_path/"
done

Notes on use

  • contains sections in modern and old Dutch that may be difficult to separate (the modern sections are predominantly the indented introductions to the treaties)
  • contains footnotes

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published