Common Crawl PySpark Examples

This project provides examples how to process the Common Crawl dataset with Apache Spark and Python:

count HTML tags in Common Crawl's raw response data (WARC files)
count web server names in Common Crawl's metadata (WAT files or WARC files)
list host names and corresponding IP addresses (WAT files or WARC files)
word count (term and document frequency) in Common Crawl's extracted text (WET files)
extract links from WAT files and construct the (host-level) web graph – for further details about the web graphs see the project cc-webgraph
work with the columnar URL index (see also cc-index-table and notes about schema merging):
- run a SQL query and export the result as a table
- select WARC records by a SQL query, parse the HTML, extract the text and count words. Alternatively, the first step (query the columnar index) can be executed using Amazon Athena. The list of WARC record coordinates (CSV or a table created by a CTAS statement) is then passed via --csv or --input_table_format) to the Spark job.

Further information about the examples and available options is shown via the command-line option --help.

Implementing a Custom Extractor

Extending the CCSparkJob isn't difficult and for many use cases it's sufficient to override a single method (process_record). Have a look at one of the examples, eg. to count HTML tags.

Setup

To develop and test locally, you will need to install

Spark, see the detailed instructions, and
all required Python modules by running

pip install -r requirements.txt

(optionally, and only if you want to query the columnar index) install S3 support libraries so that Spark can load the columnar index from S3

Compatibility and Requirements

Tested with Spark 2.1.0 – 2.4.6 in combination with Python 2.7 or 3.5, 3.6, 3.7, and with Spark 3.0.0 - 3.2.1 in combination with Python 3.7, 3.8 and 3.9.

Get Sample Data

To develop locally, you'll need at least three data files – one for each format used in at least one of the examples. They can be fetched from the following links:

Alternatively, running get-data.sh downloads the sample data. It also writes input files containing

sample input as file:// URLs
all input of one monthly crawl as relative paths
- to use with --input_base_url s3://commoncrawl/ resp. --input_base_url https://data.commoncrawl.org/

Note that the sample data is from an older crawl (CC-MAIN-2017-13 run in March 2017). If you want to use more recent data, please visit the Common Crawl site.

Process Common Crawl Data on Spark

Running locally

First, point the environment variable SPARK_HOME to your Spark installation. Then submit a job via

$SPARK_HOME/bin/spark-submit ./server_count.py \
	--num_output_partitions 1 --log_level WARN \
	./input/test_warc.txt servernames

This will count web server names sent in HTTP response headers for the sample WARC input and store the resulting counts in the SparkSQL table "servernames" in your warehouse location defined by spark.sql.warehouse.dir (usually in your working directory as ./spark-warehouse/servernames).

The output table can be accessed via SparkSQL, e.g.,

$SPARK_HOME/bin/pyspark
>>> df = sqlContext.read.parquet("spark-warehouse/servernames")
>>> for row in df.sort(df.val.desc()).take(10): print(row)
... 
Row(key=u'Apache', val=9396)
Row(key=u'nginx', val=4339)
Row(key=u'Microsoft-IIS/7.5', val=3635)
Row(key=u'(no server in HTTP header)', val=3188)
Row(key=u'cloudflare-nginx', val=2743)
Row(key=u'Microsoft-IIS/8.5', val=1459)
Row(key=u'Microsoft-IIS/6.0', val=1324)
Row(key=u'GSE', val=886)
Row(key=u'Apache/2.2.15 (CentOS)', val=827)
Row(key=u'Apache-Coyote/1.1', val=790)

Running in Spark cluster over large amounts of data

As the Common Crawl dataset lives in the Amazon Public Datasets program, you can access and process it on Amazon AWS (in the us-east-1 AWS region) without incurring any transfer costs. The only cost that you incur is the cost of the machines running your Spark cluster.

spinning up the Spark cluster: AWS EMR contains a ready-to-use Spark installation but you'll find multiple descriptions on the web how to deploy Spark on a cheap cluster of AWS spot instances. See also launching Spark on a cluster.
choose appropriate cluster-specific settings when submitting jobs and also check for relevant command-line options (e.g., --num_input_partitions or --num_output_partitions, see below)
don't forget to deploy all dependencies in the cluster, see advanced dependency management
also the the file sparkcc.py needs to be deployed or added as argument --py-files sparkcc.py to spark-submit. Note: some of the examples require further Python files as dependencies.

Command-line options

All examples show the available command-line options if called with the parameter --help or -h, e.g.

$SPARK_HOME/bin/spark-submit ./server_count.py --help

Overwriting Spark configuration properties

There are many Spark configuration properties which allow to tune the job execution or output, see for example see tuning Spark or EMR Spark memory tuning.

It's possible to overwrite Spark properties when submitting the job:

$SPARK_HOME/bin/spark-submit \
    --conf spark.sql.warehouse.dir=myWareHouseDir \
    ... (other Spark options, flags, config properties) \
    ./server_count.py \
    ... (program-specific options)

Authenticated S3 Access or Access Via HTTP

Since April 2022 there are two ways to access of Common Crawl data:

using HTTP/HTTPS and the base URL https://data.commoncrawl.org/ or https://ds5q9oxwqwsfj.cloudfront.net/
using the S3 API to read the bucket s3://commoncrawl/ requires authentication and makes an Amazon Web Services account mandatory.

This project cc-pyspark uses boto3 to access WARC, WAT or WET files on s3://commoncrawl/. The best way is to ensure that a S3 read-only IAM policy is attached to the the IAM role of the EC2 instances where Common Crawl data is processed, see the IAM user guide. If this is no option (or if the processing is not running on AWS), there are various options to configure credentials in boto3.

Installation of S3 Support Libraries

While WARC/WAT/WET files are read using boto3, accessing the columnar URL index (see option --query of CCIndexSparkJob) is done directly by the SparkSQL engine and requires that S3 support libraries are available. These libs are usually provided when the Spark job is run on a Hadoop cluster running on AWS (eg. EMR). However, they may not be provided for any Spark distribution and are usually absent when running Spark locally (not in a Hadoop cluster). In these situations, the easiest way is to add the libs as required packages by adding --packages org.apache.hadoop:hadoop-aws:3.2.1 to the arguments of spark-submit. This will make Spark manage the dependencies - the hadoop-aws package and transitive dependencies are downloaded as Maven dependencies. Note that the required version of hadoop-aws package depends on the Hadoop version bundled with your Spark installation, e.g., Spark 3.2.1 bundled with Hadoop 3.2 (spark-3.2.1-bin-hadoop3.2.tgz).

Please also note that:

the schema of the URL referencing the columnar index depends on the actual S3 file system implementation: it's s3:// on EMR but s3a:// when using s3a.
(since April 2022) only authenticated S3 access is possible. This requires that access to S3 is properly set up. For configuration details, see Authorizing access to EMRFS data in Amazon S3 or Hadoop-AWS: Authenticating with S3.

Example call to count words in 10 WARC records host under the .is top-level domain:

$SPARK_HOME/bin/spark-submit \
    --packages org.apache.hadoop:hadoop-aws:3.3.2 \
    ./cc_index_word_count.py \
    --input_base_url s3://commoncrawl/ \
    --query "SELECT url, warc_filename, warc_record_offset, warc_record_length, content_charset FROM ccindex WHERE crawl = 'CC-MAIN-2020-24' AND subset = 'warc' AND url_host_tld = 'is' LIMIT 10" \
    s3a://commoncrawl/cc-index/table/cc-main/warc/ \
    myccindexwordcountoutput \
    --num_output_partitions 1 \
    --output_format json

Columnar index and schema merging

The schema of the columnar URL index has been extended over time by adding new columns. If you want to query one of the new columns (e.g., content_languages), the following Spark configuration option needs to be set:

--conf spark.sql.parquet.mergeSchema=true

However, this option impacts the query performance, so use with care! Please also read cc-index-table about configuration options to improve the performance of Spark SQL queries.

Alternatively, it's possible configure the table schema explicitly:

download the latest table schema as JSON
and use it by adding the command-line argument --table_schema cc-index-schema-flat.json.

Credits

Examples are originally ported from Stephen Merity's cc-mrjob with the following changes and upgrades:

based on Apache Spark (instead of mrjob)
boto3 supporting multi-part download of data from S3
warcio a Python 2 and Python 3 compatible module to access WARC files

Further inspirations are taken from

cosr-back written by Sylvain Zimmer for Common Search. You definitely should have a look at it if you need a more sophisticated WARC processor (including a HTML parser for example).
Mark Litwintschik's blog post Analysing Petabytes of Websites

License

MIT License, as per LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 181 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cc_index_export.py		cc_index_export.py
doc_link.py		doc_link.py
get-data.sh		get-data.sh
libs.zip		libs.zip
python_venv.zip		python_venv.zip
requirements.txt		requirements.txt
run.sh		run.sh
run_slurm.sh		run_slurm.sh
setup_venv.sh		setup_venv.sh
sparkcc.py		sparkcc.py
submit-app.sh		submit-app.sh
test.sh		test.sh
word_count.py		word_count.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Common Crawl PySpark Examples

Implementing a Custom Extractor

Setup

Compatibility and Requirements

Get Sample Data

Process Common Crawl Data on Spark

Running locally

Running in Spark cluster over large amounts of data

Command-line options

Overwriting Spark configuration properties

Authenticated S3 Access or Access Via HTTP

Installation of S3 Support Libraries

Columnar index and schema merging

Credits

License

About

Releases

Packages

Languages

License

sis-ethz/cc-pyspark

Folders and files

Latest commit

History

Repository files navigation

Common Crawl PySpark Examples

Implementing a Custom Extractor

Setup

Compatibility and Requirements

Get Sample Data

Process Common Crawl Data on Spark

Running locally

Running in Spark cluster over large amounts of data

Command-line options

Overwriting Spark configuration properties

Authenticated S3 Access or Access Via HTTP

Installation of S3 Support Libraries

Columnar index and schema merging

Credits

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages