Skip to content

Latest commit

 

History

History
25 lines (17 loc) · 1.01 KB

README.md

File metadata and controls

25 lines (17 loc) · 1.01 KB

What is this?

PySpark Jobs for investigating prevalence of ML Opt–Out Protocols, written by Alex Xue as part of the blog post A Further Look Into the Prevalence of Various ML Opt–Out Protocols.

How Do I Run It?

Requires sparkcc.py from commoncrawl/cc-pyspark.

Setup is the same as cc-pyspark. Make sure you have an ./input directory.

To run the jobs:

$SPARK_HOME/bin/spark-submit job_name.py \
    --num_output_partitions 1 --log_level WARN \
    ./input/test_warc.txt output_file_name

and specifically to run html_metatag_count.py (which has a different output schema)

$SPARK_HOME/bin/spark-submit ./html_metatag_count.py \
    --num_output_partitions 1 --log_level WARN --tuple_key_schema True \
    ./input/test_warc.txt output_file_name