A collection of GitHub siblings of DataLad datasets. Some of the included datasets require access credentials.
- TRR289 datasets
- Pain sensitivity datasets (RPN)
- SFB1280 datasets
- Derivative datasets
- BIDS-MEGA (BEP-035) example datasets (see also)
- PUMI example and test datasets
- dataset must be a DataLad dataset
- dataset must be in BIDS format (for derivative and non-imaging data: at least a dataset_description.json file must be there)
- datasets must have a GitHub sibling in this GH organization
- datasets must have a reliable special remote (preferred: Coscine RDS-S3 or Amazon S3 for public datasets (e.g. clones of openneuro datasets)
- datasets must have an unannexed readme.md and dataset_description.json
- GitHub repo description must be set to a short description of the dataset, ending with the sample size when possible (n=xy)
This does not download the actual data, only the "skeleton". After the install command, you have to explictly tell datalad that you would like your github sibling (origin) to depend on the S3-sibling.
datalad install -s git@github.com:pni-data/<dataset_name>.git <dataset_name>
datalad siblings configure -s origin --publish-depends coscine-rds-s3
If the dataset you are about to donwload is in a private github repo, you'll need to authenticate, as usual (e.g. with a Personal Access Token or a key).
If your connection goes trough a proxy server, you'll need to allow its IP, to be able to communicatee with the S3 sibling.
git config --add annex.security.allowed-ip-addresses <proxy-server-ip>
You can selectively download what you need (e.g. derivatives only).
cd <dataset_name>
datalad get <path/to/file*>
Depending on the dataset, you will be prompted for the S3 credentials to access the files. In this case, contact the dataset owner to obtain the (read or write) credentials and set them uplikee this:
export AWS_ACCESS_KEY_ID="XXXXX-XXXX-XXXX-XXXX-XXXX"
export AWS_SECRET_ACCESS_KEY="XXXXXXXX"
Now you should be able to get the data.
git-annex whereis <path/to/file>
As all datasets here are guaranteed to be also stored on an s3 remote, you can always safely drop any file from your local dataset. Datalad only drops the actual data, but not the annexed links. That is the "dataset skeleton" never has to be removed. You will still able to browse and search the dataset skeleton (and the metadata) and download a file again, if you need it.
datalad drop <path/to/file*>
Just save your changes and push/publish it to the ggithub sibling. As the github sibling depends on the coscine-rds-s3 special remote, the following command will upload the actual data to thee s3 storage.
datalad save .
datalad push --to origin
cd <my_dataset>
datalad create -f .
datalad save .
E.g. readme.md and dataset_description.json (this way these will be directly visible in github)
datalad no-annex --pattern readme.md
datalad save .
Here we create a Coscine RDS-S3 sibling.
You will need the following info about the S3 resource:
- Host name (e.g. coscine-s3-01.data.fds.uni-due.de)
- Port (e.g. 443)
- Access Key for Writing
- Secret Key for Writing
- Bucket Name
See the DataLad docs for more detail.
export AWS_ACCESS_KEY_ID="your-access-key-for-writing"
export AWS_SECRET_ACCESS_KEY="your-secret-key-for-writing"
git-annex initremote coscine-rds-s3 type=S3 host=<your_host> port=<your_port> encryption=none bucket=<your_bucket_name> signature=v4 chunk=50mb autoenable=true
This is for RDM-purposes (listing, sharing, using issues, PRs, etc). The github repo will only contain data that is unannexed. It will disclose the directory tree and the filenames, though.
If that's not what you want, make the repo private with --private
. See the DataLad docs for details.
Requirements:
- you must be a member of the GitHub organization "pni-data" (or swap it your own profile or organization)
- you need a valid GitHub Personal Access Token
- the gitHub repo must not yet exist (the command creates it)
datalad create-sibling-github -d . --github-organization -s origin pni-data <dataset_name> --publish-depends coscine-rds-s3 --access-protocol ssh
# here, your github token is needed
datalad siblings
.: here(+) [git]
.: coscine-rds-s3(+) [git]
.: github(-) [git@github.com:pni-data/datalad_test2.git (git)]
We push/publish the unannexed data and the annexed "dataset skeleton" to github. As the github sibling (origin) depends on the coscine-rds-s3 special remote, the following command will upload the actual data to thee s3 storage (in machine-readable chunks and, if requested, in an encrypted format).
datalad push --to origin