Skip to content

tobac/hcp-suite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HCP Suite

Description

This repository provides

  • a set of scripts to perform common operations and analyses on the Human Connectome Project's neuroimaging data with a focus on task-based fMRI
  • a Python-based framework for connectome-based analysis (CPM) using Ray to parallelize CPM which has become the main focus of this project

Installation

Other than cloning this repository, you need to have bash installed (which is most likely the case if you use Linux, *BSD or even MacOS). For the Python code, the arguably easiest and cleanest way is to set up a Python virtual environment and install the dependencies there:

$ python3 -m venv hcp-suite # Setup the virtual environment
$ source ./hcp-suite/bin/activate # Activate the virtual environment
$ pip install pandas pingouin nilearn nibabel ray # Install Python dependencies within the virtual environment
$ pip install ipython jupyterlab # Install interactive Python shells to run hcp-suite code in

Usage

CPM tutorial

The following tutorial uses the gambling task as an example, the variable to be predicted is BMI.

Overview

Python code in this repository is written to be used in an interactive Python shell. Either Jupyter Lab or Jupyer Notebook is recommended as plots and images are conveniently displayed in-line but any Python shell (e.g. iPython) should work.

Generally speaking, the procedure is as follows:

  1. Creation of a list of subject IDs to be included in the analysis
  2. Parcellation of CIFTI files with task.sh prepare (at the code's current state, the HCP's folder structure is assumed)
  3. Extraction of fMRI time series based on EV files provided by the HCP with get_ev_timeseries() from hcpsuite.py
  4. Computing of correlation matrices via compute_correlations() from hcpsuite.py
  5. Performing of CPM, the functions are provided by cpm_ray.py
  6. Establishing of statistical significance by way of permutation testing (also cpm_ray.py)
Parcellation

The following line is an example for preparation of the gambling task (to be run in the system, e.g. Linux, shell):

$ ./task.sh prepare -s /path/to/gambling_subjects.ids -t GAMBLING -p RenTianGlasser -e

Calling task.sh without arguments will provide an explanation of parameters (output is also provided at the end of this README).

To accelerate the procedure, IDs files can be split and task.sh can be run in parallel:

$ split -l 270 /path/to/gambling_subjects.ids /tmp/gambling_ # Creates /tmp/gambling_aa, /tmp/gambling_bb etc. with 270 lines each
$ for id in /tmp/gambling_a?; do test -e $id.lock && continue; touch $id.lock; ~/Projekte/diss/hcp-suite/bin/task.sh prepare -s $id -t EMOTION -p RenTianGlasser -e; done # Runs one task.sh process for each /tmp/gambling_a? file

Run like this, task.sh will output a list of subject IDs the preparation failed for.

Extraction of time series

The following is run in an interactive Python shell (e.g. the "Console" of Jupyter Lab):

# We need to append Python's path variable as hcp-suite is not part of the standard path
import sys
sys.path.append("/path/to/hcp-suite/lib")
# ... so we can import the functions from hcpsuite.py
from hcpsuite import *

# First we will load the subject IDs into a Python list structure
ids = load_list_from_file("./path/to/gambling_subjects.ids")

# get_ev_timeseries() reads our parcellated CIFTI files and extracts information from the
# HCP-provided EV files (in this example, win.txt for the win condition of the gambling task)
# regarding the specific sections of the task time series we need for that condition
ev_data_dict = get_ev_timeseries(ids, ["win.txt"], task=GAMBLING, runs=('LR', 'RL'), parcellation=RenTianGlasser, data_dir='/home/HCP/S1200/', prewhiten=True) # Be sure to modify data_dir to match your HCP imaging data directory

# Now we save the extraced time series as text files in a directory of our choice
# (in this case: ./GAMBLING_win)
save_data_dict(ev_data_dict, path="./GAMBLING_win")
Computation of correlation matrices

We continue in our Python shell to compute correlation matrices:

# We load the saved time series from the previous step
cd GAMBLING_win # save_data_dict() writes file names into file "ts_files" but without paths,
                # thus the easiest way is to change into the directory containing the files
time_series, time_series_files = load_time_series("./ts_files") # ... and read them from there

correlation_measure, correlation_matrices = compute_correlations(time_series, kind='tangent')
# Tangent in our experience provides the best results, but there are alternatives:
# https://nilearn.github.io/dev/modules/generated/nilearn.connectome.ConnectivityMeasure.html

# We then save the matrices into a single file in the NIfTI format for downstream processing
save_matrix(cifti_dim_to_nifti_dim(correlation_matrices), "./GAMBLING_win-tangent.nii.gz")
CPM

For actual CPM, we need to install and run Ray (run this e.g. in your Python virtual environment as described in Installation):

$ pip install ray
$ ray start --head --num-cpus=16 # Run this on your main node.
                                 # Processes take up a lot of RAM, be careful not to use too many CPUs

Optionally add more Ray nodes to form a cluster, see the Ray documentation for details.

Merging behavioral/biometric data in Python

For our analysis, we need values from both the unrestricted and restricted HCP data, which are available as separate CSV files. For easier handling, we merge them into a single CSV file:

import pandas as pd
unrestricted = pd.read_csv("/path/to/unrestricted.csv")
restricted = pd.read_csv("/path/to/restricted.csv")
merged = pd.merge(restricted, unrestricted, on="Subject")
merged.to_csv("./merged.csv") # Save the merged DataFrame as a CSV file in the current directory
Loading and preparing data in Python (to be run in an interactive Python shell)
import sys
sys.path.append("/path/to/hcp-suite/lib")
from cpm_ray import *

ids = load_list_from_file("./path/to/gambling_subjects.ids") # Load list of subject IDs
behav = 'BMI' # Which variable do we want to predict?
covars = ['Age_in_Yrs', 'Gender', 'Race'] # What variables do we want to correct for?

behav_data = get_behav_data("./path/to/merged.csv", ids) # Loading of behavioral and biometrical data as a Pandas DataFrame from a CSV file
# We need to binarize gender and race for our analysis
behav_data["Gender"] = behav_data["Gender"].replace("F", 0)
behav_data["Gender"] = behav_data["Gender"].replace("M", 1)
behav_data["Race"] = behav_data["Race"].replace("White", 0)
behav_data["Race"] = behav_data["Race"].replace("Black or African Am.", 1)
behav_data["Race"] = behav_data["Race"].replace("Asian/Nat. Hawaiian/Othr Pacific Is.", 1)
behav_data["Race"] = behav_data["Race"].replace("Am. Indian/Alaskan Nat.", 1)
behav_data["Race"] = behav_data["Race"].replace("Unknown or Not Reported", 1)
behav_data["Race"] = behav_data["Race"].replace("More than one", 1)

functional_data = convert_matrices_to_dataframe(get_nimg_data("./path/to/GAMBLING_win-tangent.nii.gz"), ids) # Loading of correlation matrices as Pandas DataFrame
Starting the Ray handler

ray_handler() is a Python class through which data management and the starting and coordination of Ray Actors (i.e. the processes working in parallel) is being handled.

n_folds = 128 # In this example, we use 128 folds, which is a good starting point
n_perm = 1000 # How many permutations are we planning to do later on?

ray_handler = RayHandler(
    functional_data.copy(),
    behav_data.copy(),
    behav,
    covars,
    address="auto", # We assume that the main Ray process runs on the same host
    n_perm=n_perm,
)

ray_handler.add_kfold_indices(n_folds, clean=True) # By setting "clean" to True, we remove twins from the fold so they don't predict each other
ray_handler.upload_data() # Functional and behavioral data are uploaded into the shared storage to which Ray Actors have access
Starting the analysis

First we define the jobs for the actors; a job is a Python list object consisting of the following items: "job type", "fold number", "permutation number". The permutation number for the actual, i.e. unpermutated, data is "-1".

job_list = [["fselection_and_prediction", fold, perm] for perm in [-1] for fold in range(n_folds)] # This is the job list without permutations

# If we wanted to also compute the permutations (which takes a very long time), the job list can be created as follows:
#job_list = [["fselection_and_prediction", fold, perm] for perm in [-1, *range(n_perm)] for fold in range(n_folds)]

ray_handler.start_actors(job_list) # Start computing
Monitoring progress and retrieving results
n_done, n_remaining = ray_handler.status() # Prints a status report

results = ray_handler.get_results(n=100) # Retrieving a huge number of results (e.g. when performing permutation analysis)
                                         # and especially from distributed Ray actors can take a long time. Specifying the
                                         # number of results (e.g. n=100) to be retrieved from the results store at once
                                         # allows for a display of progress

# Optional: Save results into a file (NumPy format)
np.save("./GAMBLING_win-results.npy", results)
Presenting results
g = plot_predictions(results['prediction_results'][perm], tail="glm", color="green")

For more presentation and plotting examples, see save.py.

Task.sh usage

Usage: task.sh <action> [action-specific arguments] [common options]
  Actions:
    prepare
      -s <arg>: specify file with subject IDs or a space-separated list of subjects
      -t <arg>: specify a task
      -p <arg>: specify a parcellation
      You might want to consider using -e to prevent exiting when one subject fails.
    design
      -s <arg>: specify file with subject IDs or a space-separated list of subjects
      -o <arg>: specify an output file (e.g. design.mat)
      -V <arg>: specify variables (e.g. "-V Age_in_Yrs -V Gender -V BMI"; repeatable)
      -f <arg>: specify CSV file containing the specified variables (repeatable; e.g.
                "-f restricted.csv -f unrestricted.csv")
    eb
      -s <arg>: specify file with subject IDs or a space-separated list of subjects
      -o <arg>: specify an output file (e.g. eb.csv)
      -f <arg>: specify CSV file containing the specified variables (repeatable; e.g.
                "-f restricted.csv -f unrestricted.csv")
    analyse
      -s <arg>: specify file with subject IDs or a space-separated list of subjects
      -t <arg>: specify a task
      -c <arg>: specify a COPE number
      -p <arg>: specify a parcellation
      -d <arg>: specify target directory
      [-n <arg>]: number of permutations (default: 5000)
      [-o <arg>]: additional PALM arguments
      Specify design, contrast and EB via additional PALM arguments or place them in
      <target directory>/../ as design_<task>.mat, design.con and eb_<task>.csv

  Common options:
    [-D <arg>]: specify the directory containing HCP subject data (overrides HCP variable in ../lib/common.sh
    [-l <log file>]: specify log file instead of default /tmp/${0}-${RANDOM}.log
    [-e]: turn off error handling (do not exit on error)