Skip to content

Downloading CIL data

Chris Churas edited this page Oct 30, 2018 · 6 revisions

This page provides instructions on how to use this tool to download and convert CIL data from legacy servers.

WARNING THIS PAGE MAY CONTAIN ERRORS AND OMISSIONS AND WE ARE NOT RESPONSIBLE FOR DAMAGES. YOU HAVE BEEN WARNED.

Step 1) Install CIL_file_download_tool

First step is to install the tool.

git clone https://github.com/CRBS/CIL_file_download_tool.git
cd CIL_file_download_tool
make dist
sudo pip install dist/cildata*whl

Step 2) Create a directory to work in and setup database configuration file

Create a directory to download data to and change to that directory. The full CIL download is a couple terabytes when all said and done so make sure you have enough space.

mkdir cil
cd cil

Save text below into a file named db.conf replacing values in <###> with correct values.

[postgres]

user = <USER>
password = <PASSWORD>
port = 5432
host = <HOST>
database = <DATABASE NAME>

Step 3) Download data from servers

This step will take a while for a full download (days...) and it is suggested to use screen command if doing this step on a remote machine so a disconnect won't stop the download.

cildatadownloader.py --log DEBUG db.conf .

**If you need to restart, remove the last partially completed dataset directory (it'll be under images/## or videos/## and run this:

cildatadownloader.py --log DEBUG --skipifexists db.conf .

Step 4) Check download

cildatareport.py .

Output will look like this and provides a summary of the download (run with --printfailed to get a list of failed downloads):

Number entries: 38575 (failed: 95)
Number unique IDs: 10020 (failed: 93)
Number entries that are NOT supposed to have raw file: 959
-----------------
	application/hyperstudio ==> 179
	image/vnd.adobe.photoshop ==> 12
	image/gif ==> 40
	image/jpeg ==> 1680
	image/jpeg; charset=utf-8 ==> 10020
	video/quicktime ==> 389
	image/png ==> 114
	video/x-flv ==> 632
	None ==> 793
	video/mpeg ==> 16
	image/tif ==> 9295
	video/x-msvideo ==> 231
	text/html; charset=iso-8859-1 ==> 90
	application/zip ==> 9060
	image/tiff ==> 5606
	text/plain ==> 416
	application/vnd.ms-ims ==> 2

Step 5) Convert data

After data has been downloaded another tool can be run to perform necessary data conversions:

cildataconverter.py .

Step 6) Convert flash videos to mp4

Create the script mentioned on this ticket: https://github.com/CRBS/cildata_util/issues/4

and run it on the videos directory. The above script assumes ffmpeg is installed.

Step 7) Create thumbnail images

Create the script mentioned on this ticket: https://github.com/CRBS/cildata_util/issues/3

and run it on the data. The script should create a thumbnails under images/ and videos/ directories.

Step 8) Update database

cildataupdatedb.py db.conf .