Skip to content

Latest commit

 

History

History
27 lines (20 loc) · 2.32 KB

data-2020.md

File metadata and controls

27 lines (20 loc) · 2.32 KB
layout title permalink
page
End of Term 2020 Dataset
/data/data-2020/

End of Term 2020 Dataset

The End of Term 2020 Dataset represents data collected by two collecting institutions. These institutions were the Internet Archive (IA) and the University of North Texas Libraries (UNT). The data is part of the initiative called the End of Term Presidential Web Archive.

Archive Location and Download

The 2020 End of Term archive is located on the eotarchive bucket at EOT-2020.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT, WET, and CDX files.

By adding either s3://eotarchive/ or https://eotarchive.s3.amazonaws.com/ to each line, you end up with the s3 and HTTP paths respectively.

File List #Files Total Size
Compressed
Segments EOT-2020/segment.paths.gz 26
WARC files EOT-2020/warc.paths.gz 239811 266.04 TB
WAT files EOT-2020/wat.paths.gz 239811 9.15 TB
WET files EOT-2020/wet.paths.gz 239811 2.6 TB
META files EOT-2020/meta.paths.gz 239811 712.66 GB
CDX files EOT-2020/cdx.paths.gz 239811 83.66 GB
URL Index files EOT-2020/eot-index.paths.gz 49 74.4 GB