Benchmarks mmtf-pyspark methods as function of the number of logical cores.
git clone https://github.com/sbl-sdsc/mmtf-pyspark-benchmarks.git
cd mmtf-pyspark-benchmarks
conda env create -f environment.yml
conda activate benchmark
jupyter notebook
conda deactivate
Anytime you want to use the environment, activate it again and start Jupyter Notebook
conda remove --name benchmark --all
Download the entire PDB as an MMTF Hadoop Sequence File
curl -O https://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
tar -xvf full.tar
Then set the MMTF_FULL environment variable
export MMTF_FULL=<path>/full
When running PySpark on many cores, it may run out of memory on the Spark Driver (default 1GB). If necessary, set the environmental variable in the .bashrc (Linux) or .bash_profile (Mac) file.
export SPARK_DRIVER_MEMORY=16G
The benchmarks compare:
- Reading the whole PDB from an MMTF Hadoop Sequence File (baseline benchmark)
- Tabulating zinc interactions in the PDB
- Tabulating salt-bridge interactions in the PDB
To run the benchmark, set the number of cores in cell 3 of RunBenchmarks.ipynb, e.g. on 24 cores:
cores = [24, 20, 16, 12, 8, 4, 2, 1]
The benchmarks are typically run on 1, 2, 4xn (n = 1, maxcores/4). maxcores is the maximum number of logical cores on a machine. The number of physical cores may be less if hyperthreading is enabled.
- PrintSettings.ipynb (prints Spark settings, e.g., SPARK_DRIVER_MEMORY)
- RunBenchmarks.ipynb (runs all benchmarks)
- PlotReadResults.ipynb (plots benchmark results for reading MMTF Hadoop Sequence Files)
- PlotCalculation.ipynb (plots benchmark results for performing calculations using MMTF Hadoop Sequence Files)