-
Notifications
You must be signed in to change notification settings - Fork 0
Downloading CIL data
This page provides instructions on how to use this tool to download and convert CIL data from legacy servers.
WARNING THIS PAGE MAY CONTAIN ERRORS AND OMISSIONS AND WE ARE NOT RESPONSIBLE FOR DAMAGES. YOU HAVE BEEN WARNED.
First step is to install the tool.
git clone https://github.com/CRBS/CIL_file_download_tool.git
cd CIL_file_download_tool
make dist
sudo pip install dist/cildata*whl
Create a directory to download data to and change to that directory. The full CIL download is a couple terabytes when all said and done so make sure you have enough space.
mkdir cil
cd cil
Save text below into a file named db.conf replacing values in <###> with correct values.
[postgres]
user = <USER>
password = <PASSWORD>
port = 5432
host = <HOST>
database = <DATABASE NAME>
This step will take a while for a full download (days...) and it is suggested to use screen command if doing this step on a remote machine so a disconnect won't stop the download.
cildatadownloader.py --log DEBUG db.conf .
**If you need to restart, remove the last partially completed dataset directory (it'll be under images/## or videos/## and run this:
cildatadownloader.py --log DEBUG --skipifexists db.conf .
cildatareport.py .
Output will look like this and provides a summary of the download (run with --printfailed to get a list of failed downloads):
Number entries: 38575 (failed: 95)
Number unique IDs: 10020 (failed: 93)
Number entries that are NOT supposed to have raw file: 959
-----------------
application/hyperstudio ==> 179
image/vnd.adobe.photoshop ==> 12
image/gif ==> 40
image/jpeg ==> 1680
image/jpeg; charset=utf-8 ==> 10020
video/quicktime ==> 389
image/png ==> 114
video/x-flv ==> 632
None ==> 793
video/mpeg ==> 16
image/tif ==> 9295
video/x-msvideo ==> 231
text/html; charset=iso-8859-1 ==> 90
application/zip ==> 9060
image/tiff ==> 5606
text/plain ==> 416
application/vnd.ms-ims ==> 2
After data has been downloaded another tool can be run to perform necessary data conversions:
cildataconverter.py .
Create the script mentioned on this ticket: https://github.com/CRBS/cildata_util/issues/4
and run it on the videos directory. The above script assumes ffmpeg is installed.
Create the script mentioned on this ticket: https://github.com/CRBS/cildata_util/issues/3
and run it on the data. The script should create a thumbnails under images/ and videos/ directories.
cildataupdatedb.py db.conf .