HTTP Archive Topics API Classification

Classification of HTTP Archive origins by the Topics API.

Getting Started

Clone this repository along with its submodule with: git clone --recurse-submodules <HTTPS or SSH URL>.
Place the .csv files with the HA origins under ha_urls.
Launch classification (we recommend using a screen session):

(if you have dependencies installed): ./classify_origins.sh
- System Dependencies: python3, GNU parallel, unzip
- Python Dependencies: pandas, tflite-support

(if using Docker):

docker build -t topics-image:latest .
docker run --rm -it -v ${PWD}:/workspaces/topics \
    -w /workspaces/topics --entrypoint ./classify_origins.sh topics-image:latest

Refer to the created .tsv file for the classification results. Find the corresponding taxonomy under the corresponding folder in topics_classifier (-2 stands for the Unknown topic).

Parallelization

To classify millions of domains, make sure to deploy a VM with a large number of vCPUs to leverage GNU parallel to its full extent. No special need for RAM or storage behind the minimum required for the instance chosen.

As a reference, classifying the latest CruX top 1M list on an c6g.8xlarge (32 vCPUs) ec2 instance takes about 40 minutes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

HTTP Archive Topics API Classification

Getting Started

Parallelization

Files

README.md

Latest commit

History

README.md

File metadata and controls

HTTP Archive Topics API Classification

Getting Started

Parallelization