🕷️ Ungoliant is a high-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl. 🕷️
This pipeline was originally used to process the OSCAR dataset. It uses the fasttext lid.176.bin model to generate labels for 176 languages. We forked the code here so that it can function with GlotLID, which is also a fasttext model but can label text for more than 2000 languages.
The outcome of this new dataset is the GlotCC dataset, available at: https://github.com/cisnlp/GlotCC
- Via
git
:cargo install --git https://github.com/cisnlp/ungoliant
Ungoliant needs numerous dependencies that should be compiled when installing. However cmake / gcc
can be needed as the project uses fasttext-rs.
By default, ungoliant
expects the lid.176.bin
model name.
Use wget https://huggingface.co/cis-lmu/glotlid/resolve/main/model.bin -O glotlid.bin
to get GlotLID as glotlid.bin
.
However, you can use the model you want: just point to its path using ungoliant download --lid-path <path to lid>
.
The usual way of generating corpora is:
- First create this structure of folders with
mkdir
:
res
├── annotation
│ └── ...
├── blocklist
│ └── ...
├── corpus
│ └── ...
├── filter
│ └── ...
└── shards
└── ...
-
Fetch the
wet.paths.gz
file from the last CommonCrawl dump.
1.1 Decompress it usinggzip -d wet.paths.gz
.
1.2 Download the files using thedownload
command:ungoliant download wet.paths res/shards
. -
Download website categorizations using
wget https://github.com/olbat/ut1-blacklists/archive/refs/heads/master.zip
.
2.1 Decompress it usingunzip master.zip
.
2.2 Move the blacklists to theres/blocklist
usingmv ut1-blacklists-master/blacklists/* res/blocklist
.
2.3 Decompress the adult block usinggzip -d res/blocklist/adult/domains.gz
.
2.4 Remove the blacklists-master usingrm -r ut1-blacklists-master
. -
Generate the corpus using the
pipeline
command (it may take some time):ungoliant pipeline ./res/shards/ ./res/corpus --lid-path glotlid.bin --blocklist-path ./res/blocklist/
. -
Head on to glotcc-filters for the additional filter steps.
You can find more information on each command's --help
.
ungoliant 2
corpus generation tool.
USAGE:
ungoliant <SUBCOMMAND>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
SUBCOMMANDS:
download Download a CommonCrawl release
help Prints this message or the help of the given subcommand(s)
pipeline Run pipeline
rebuild Rebuild the corpus for a given language.