mmtf-pyspark-benchmarks

Benchmarks mmtf-pyspark methods as function of the number of logical cores.

Local Installation

Install Anaconda

Open a terminal window

Clone this repository

git clone https://github.com/sbl-sdsc/mmtf-pyspark-benchmarks.git

Create a conda environment

cd mmtf-pyspark-benchmarks

conda env create -f environment.yml

Activate the conda environment

conda activate benchmark

Launch Jupyter Notebook

jupyter notebook

After you are finished, deactivate the conda environment

conda deactivate

Anytime you want to use the environment, activate it again and start Jupyter Notebook

To permanently remove the benchmark environment

conda remove --name benchmark --all

Download MMTF Hadoop Sequence File

Download the entire PDB as an MMTF Hadoop Sequence File

curl -O https://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar

tar -xvf full.tar

Then set the MMTF_FULL environment variable

export MMTF_FULL=<path>/full

Set Spark Options

When running PySpark on many cores, it may run out of memory on the Spark Driver (default 1GB). If necessary, set the environmental variable in the .bashrc (Linux) or .bash_profile (Mac) file.

export SPARK_DRIVER_MEMORY=16G

Run Benchmarks

The benchmarks compare:

Reading the whole PDB from an MMTF Hadoop Sequence File (baseline benchmark)
Tabulating zinc interactions in the PDB
Tabulating salt-bridge interactions in the PDB

To run the benchmark, set the number of cores in cell 3 of RunBenchmarks.ipynb, e.g. on 24 cores:

cores = [24, 20, 16, 12, 8, 4, 2, 1]

The benchmarks are typically run on 1, 2, 4xn (n = 1, maxcores/4). maxcores is the maximum number of logical cores on a machine. The number of physical cores may be less if hyperthreading is enabled.

PrintSettings.ipynb (prints Spark settings, e.g., SPARK_DRIVER_MEMORY)
RunBenchmarks.ipynb (runs all benchmarks)
PlotReadResults.ipynb (plots benchmark results for reading MMTF Hadoop Sequence Files)
PlotCalculation.ipynb (plots benchmark results for performing calculations using MMTF Hadoop Sequence Files)

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
conf		conf
notebooks		notebooks
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mmtf-pyspark-benchmarks

Local Installation

Open a terminal window

Clone this repository

Create a conda environment

Activate the conda environment

Launch Jupyter Notebook

After you are finished, deactivate the conda environment

To permanently remove the benchmark environment

Download MMTF Hadoop Sequence File

Set Spark Options

Run Benchmarks

About

Releases

Packages

Languages

License

sbl-sdsc/mmtf-pyspark-benchmarks

Folders and files

Latest commit

History

Repository files navigation

mmtf-pyspark-benchmarks

Local Installation

Open a terminal window

Clone this repository

Create a conda environment

Activate the conda environment

Launch Jupyter Notebook

After you are finished, deactivate the conda environment

To permanently remove the benchmark environment

Download MMTF Hadoop Sequence File

Set Spark Options

Run Benchmarks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages