Skip to content

aregm/enzyme

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Enzyme - Rapid Orchestration in the Cloud

  1. Introduction
    1. Overview
    2. Motivations for enabling HPC in the cloud
    3. Intro to the Intel HPC Platform Specification
    4. Currently Supported providers
  2. Installing Enzyme
    1. Required software
    2. Clone the Enzyme respository
    3. Build Enzyme
  3. Getting Started with Enzyme
    1. User Credentials File
    2. Cloud Provider Templates
    3. Test Run
  4. Enzyme User Guide
    1. Launching workloads
    2. Persistent clusters
    3. Launching Workloads with storage
    4. Destroying Clusters
    5. Create image
    6. Create storage
    7. Check Status
    8. Check version
    9. Check user defined parameters
    10. Help
    11. Set Verbosity
    12. Simulate
    13. Options and parameters
  5. Additional Examples
    1. LAMMPS
    2. OpenFOAM
  6. Cloud Provider Quick Reference
    1. Amazon Web Services
    2. Google Cloud Platform

Introduction

  • This is a relatively new project and should be considered Alpha level software

Overview

Enzyme is a software tool that helps provide an accelerated path to spinning up and utilizing high-performance compute clusters in the cloud. Enzyme offers a simple command-line mechanism for users to launch workloads pointing to templates that abstract the orchestration and Operating System image generation of the HPC cluster. It creates operating system images that follow the Intel® HPC Platform Specification to provide a standard base solution that enables a wide range of popular HPC applications. Enzyme aims to accelerate the path for users wanting to migrate to a public cloud by abstracting the learning curve of a supported cloud provider. This allows users a "rapid" path to start using cloud resources, and it will enable the Enzyme community to collaborate to provide optimal environments for the underlying HPC solutions.

Motivations for enabling HPC in the cloud

There are many reasons for running HPC and compute-intensive workloads in a cloud environment. The following are some of the top motivators behind Enzyme, but the list is not exhaustive.

  • Local HPC cluster resource capacity is typically fixed while demand is variable. Cloud resources provide augmentation to local resources that help meet spikes in resource needs on-demand.
  • Cloud-based HPC clusters can simplify and accelerate access for new HPC users and new businesses, resulting in faster time to results.
  • Cloud provides a means to access massive resources or specialized resources for short periods, to address temporary or intermittent business needs.
  • Cloud provides access to the newest technologies, allowing evaluation and use ahead of long-term ownership
  • Datasets may already exist in the cloud, and utilizing cloud resources may be the best option for performance and/or cost.

Intro to the Intel® HPC Platform Specification

The Intel HPC Platform Specification captures industry best practices, optimized Intel runtime requirements, and broad application compatibility needs. These requirements form a foundation for high-performance computing solutions to provide enhanced compatibility and performance across a range of popular HPC workloads. Intel developed this specification by collaborating with many industry partners, incorporating feedback, and curating the specification since 2007.

Currently supported providers

Planned supported providers in the nearest future

Installing Enzyme

Required software

You need to install:

Clone the Enzyme repository

Clone Enzyme repository from Github

Enzyme uses open source tools from Hashicorp, and those tools are included as sub-modules in the Enzyme git repository. To ensure all the required source is cloned, it is suggested to use the following:

git clone --recurse-submodules https://github.com/intel-go/Enzyme

If needed, the sub-modules can be downloaded after cloning by using

git submodule init
git submodule update 

Note: some firewall configurations can impact access to git repositories.

Build Enzyme

Enzyme uses make to build the binary from Go source code. Build Enzyme by specifying the make command and optionally including the target OS platform using the GOOS command-line option. Options are currently windows or linux. If no OS is specified, the default build assumes linux

Note: the make file does not currently support building for Windows under a Windows cmd shell. To build to run Enzyme from a Windows platform, use a Windows Bash implementation and run the following make command:

make GOOS=windows

If the build completes successfully, the Enzyme build will create a sub-directory called package-{GOOS}-amd64 that includes the binaries and supporting directory structures for executing Enzyme. In addition, the sub-directory package is archived into package-{GOOS}-amd64-{version}-{hash}.tar.gz for easy distribution.

The binary package name for Linux is Enzyme, and the binary package name for Windows is Enzyme.exe. The command-line examples in this guide all use the Linux binary name. For use from a Windows system, substitute the Enzyme command with Enzyme.exe.

Getting Started with Enzyme

Enzyme takes several input parameters that provide user credentials for a target cloud account, templates for the cloud provider, and templates for the desired image to run on top of in the cloud. These JSON inputs are combined into a single structure to drive the Hashicorp tools, terraform and packer, to create machine images and spin up the cluster. The combined structure is saved in the .Enzyme/ of the Enzyme package directory.

User Credentials File

Enzyme requires an active account for the desired cloud provider. Access to that user account is utilized by providing access keys and account information in a credentials JSON file. Cloud providers typically offer mechanisms to create this credentials file. See the appropriate Cloud Provider Quick Reference section for referencing how to create a user credentials file for a specific provider. Please note that these provider-specific mechanisms may change.

The user credentials file needs to be copied to the user's host system, where Enzyme will execute. To use Enzyme without specifying a full path to the desired user credentials file, copy the cloud provider credentials file to ./user_credentials/credentials.json in the Enzyme binary directory. Enzyme uses this file as the default to access the desired cloud provider account. Enzyme does provide a command-line option to use a different path and filename for credentials if desired. For example, a user may have more than one accounts and thus have multiple user credentials files that are specified by the command line option with each run.

Cloud Provider Templates

Enzyme uses template files to direct how to build a cluster and how to build the compute node operating system to run workloads on the desired type of instance within the desired cloud provider. These templates are JSON files that provide variables that control how Enzyme uses the Hashicorp tools. These templates may be curated and expanded to provide additional user options and customizations.

Cloud provider templates are provided under the .\templates directory and are typically named after the cloud service provider. The templates cluster_template.json and image_template.json under a given cloud provider directory control instance and image creation, respectively.

For example, a hypothetical cloud provider called MyCloud would have: ./templates/mycloud/cluster_template.json ./templates/mycloud/image_template.json

A user specifies which cloud provider templates to use with the-p or --provider command line parameter. Enzyme currently defaults to using the Google Cloud Provider templates. To use the hypothetical MyCloud providers then, a user includes -p mycloud or --provider mycloud on the command line.

Test Run

Now let's execute a real job as a "Hello, World" test that Enzyme is working. To do this, we'll use the well-known High-Performance LINPACK benchmark that is included in the ./examples folder. This example will use the default cloud provider. To test a different or multiple cloud providers, insert the -p option with the name of the directory of the desired provider in the example below.

To execute a workload through Enzyme, the user specifies the job to launch (or typically a launch script) and points to the Enzyme parameter file and any potential input data files. The Enzyme parameter file fills out and replaces default values used in the execution. This allows a user to modify some aspects of execution without needing to modify the cloud provider or image templates themselves. An important parameter is the project name associated with the user account. This must be set correctly in the project parameter file.

With this in mind, three steps are all that is required to test execution using HP-LINPACK.

  1. Copy the user credentials file to ./user_credentials/credentials.json. This is the default credentials file Enzyme uses.

  2. Modify the ./examples/linpack/linpack-cluster.json file to set the project_name value to the actual name of the cloud project. For example, if the cloud project name is My-Hpc-Cloud-Cluster, modify the key-value pair in the JSON file to be

    project_name: "My-Hpc-Cloud-Cluster",
    
  3. Execute the command to run HP-LINPACK through Enzyme on the default cloud provider. The following command uses both the default cloud provider as well as the default user credentials file (from Step 1).

    Enzyme run examples/linpack/linpack-cluster.sh --parameters examples/linpack/linpack-cluster.json --upload-files examples/linpack/HPL.dat
    

Enzyme will begin building a compute node operating system and installing on the desired instance types in the cloud provider. If that is successful, Enzyme will launch HP-LINPACK on the cluster. Enzyme reports progress along the way, so there should be periodic output displayed on console.

If the end of output should looks like this:

*Finished        1 tests with the following results:*

                                        *1 tests completed and passed residual checks,*

                                         *0 tests completed and failed residual checks,*

                                        *0 tests skipped because of illegal input values.*

--------------------------------------------------------------------------------

*End of Tests.*

then HP-LINPACK successfully executed in the cloud. Congratulations!

Unfortunately, if there is an issue, Enzyme does not have a well-documented debug section yet. That is a work in progress! Stay tuned. Troubleshooting areas to check:

  • TerraForm and Packer executables exist under the ./tools directory. If not, there was a problem building those tools during the Enzyme build.

  • A cluster does not appear in the cloud provider dashboard while running Enzyme. Potential problems could be a problem with the user account permissions, incorrect user credentials file, or incorrect project name identified in the ./examples/linpack/linpack-cluster.json file.

Enzyme User Guide

Launching workloads

Enzyme run task.sh --parameters path/to/parameters.json

This command will instantiate a cloud-based cluster and run the specified task. On first use, the machine image will be automatically created. After the task is completed, the cluster will be destroyed, but the machine image will be left intact for future use.

Persistent clusters

Enzyme run task.sh --parameters path/to/parameters.json --keep-cluster

This command will instantiate the requested cluster and storage for the specified task. The required images will be created on first use. Using --use-storage option allows you to access data living on the storage node. NOTICE: make sure you don't change parameters in configuration except storage_disk_size, otherwise, a new storage will be created after parameters are changed. Currently, changing storage_disk_size has no effect, and the disk keeps its previous size to force it to resize, destroy the storage node and delete the disk in the cloud provider interface.

You can create a persistent cluster without running a task. For this, just use the create cluster command.

Launching workloads with storage

Enzyme run task.sh --parameters path/to/parameters.json --use-storage

This command will instantiate the requested cluster and storage and then run the specified task. As before, the required images will be created on first use. Using --use-storage option allows you to access to storage data. NOTICE: make sure you didn't change parameters in configuration except storage_disk_size. Otherwise, a new storage will be created after parameters are changed. storage_disk_size changing is ignored, disk keeps the previous size.

You can create storage without running a task. For this, just use the create storage command.

Destroying clusters

Enzyme destroy destroyObjectID

You can destroy a cluster or storage by destroyObjectID, which can be found by checking state.

NOTICE: The disk is kept when the storage is destroyed. Only the VM instances will be removed, and the "storage" Enzyme entity will change its status from XXXX to configured. You can delete a disk manually through a selected provider if you want to.

Create image

Enzyme creates image --parameters path/to/parameters.json

This command tells Enzyme to create a VM image from a single configuration file. You can check for created images in the cloud provider interface if you want to.

Create cluster

Enzyme creates cluster --parameters path/to/parameters.json

This command tells Enzyme to spawn VM instances and form a cluster. It also creates the needed image if it doesn't yet exist.

Create storage

Enzyme create storage --parameters path/to/parameters.json

This command tells Enzyme to create VM instance based on a disk that holds your data. You can use storage to organize your data and control access to it. Storage locates in /storage folder on VM instance. It also creates the needed image if it doesn't exist yet.

Uploading data into the storage is outside the scope of Enzyme. Enzyme only provides information allowing you to connect to the storage using rhoc state state command.

Check status

Enzyme state

This command enumerates all manageable entities (images, clusters, storage, etc.) and their respective status. For cluster and storage entities, additional information about SSH/SCP connection (user name, address, and security keys) is provided in order to facilitate access to these resources.

Check version

Enzyme version 

Check user-defined parameters

Use this command with one of the additional arguments: image, cluster, task.

Enzyme print-vars image

You can use --provider flag to check parameters specific for the certain provider (default: GCP)

Help

Enzyme help

This command prints a short help summary. Also, each Enzyme command has a --help switch for providing command-related help.

Set Verbosity

Use -v or --verbose flag with any command to get extended info.

Simulate

Use -s or --simulate flag with any command to simulate running the execution without actually running any commands that can modify anything in the cloud or locally. Useful for checking what Enzyme would perform without actually performing it.

Options and parameters

Common parameters

  • -p, --provider select provider (default: gcp) gcp - Google Cloud Platform aws - Amazon Web Services

  • -c, --credentials path to credentials file (default: user_credentials/credentials.json)

  • -r, --region location of your cluster for selected provider (default: us-central1)

  • -z, --zone location of your cluster for selected provider (default: a)

  • --parameters path to file with user parameters

You can define the above parameters only via command line.

Parameters presented below can be used in the configuration file and command line. When specified in the command line, they override parameters from the configuration file.

For applying them by command line use

  • --vars list of user's variables (example: "image_name=Enzyme,disk_size=30")

Task

parameters

A task combines parameters from all entities it might need to create. For individual entities see:

options
  • --keep-cluster keep the cluster running after script is done
  • --use-storage allow accessing to storage data
  • --newline-conversion enable conversion of DOS/Windows newlines to UNIX newlines for the uploaded script (useful if you're running Enzyme on Windows)
  • --overwrite overwrite the content of the remote file with the content of the local file
  • --remote-path name for the uploaded script on the remote machine (default: "./Enzyme-script")
  • --upload-files files for copying into the cluster (into ~/Enzyme-upload folder with the same names)
  • --download-files files for copying from the cluster (into ./Enzyme-download folder with the same names)

Image

parameters
  • project_name (default: "zyme-cluster")
  • user_name user name for ssh access (default: "ec2-user")
  • image_name name of the image of the machine being created (default: "zyme-worker-node")
  • disk_size size of image boot disk, in GB (default: "20")

Cluster

parameters
  • project_name (default: "zyme-cluster")

  • user_name user name for ssh access (default: "ec2-user")

  • cluster_name name of the cluster being created (default: "sample-cloud-cluster")

  • image_name name of the image which will be used (default: "zyme-worker-node")

  • worker_count count of worker nodes (default: "2")

       **NOTICE**: *Must be greater than 1*
    
  • login_node_root_size boot disk size for login node, in GB (default: "20")

      **NOTICE**: *Must be no less than `disk_size`*
    
  • instance_type_login_node machine type of root node (default: "f1-micro" for GCP)

  • instance_type_worker_node machine type of worker nodes (default: "f1-micro" for GCP)

  • ssh_key_pair_path (default: "private_keys")

  • key_name (default: "hello")

Storage

parameters
  • project_name (default: "zyme-cluster")
  • user_name user name for ssh access (default: "ec2-user")
  • storage_name name of the storage being created (default: "zyme-storage")
  • image_name name of the image which will be used (default: "zyme-worker-node")
  • storage_disk_size size of permanent disk, in GB (default: "50")
  • storage_instance_type machine type of storage node (default: "f1-micro" for GCP)
  • ssh_key_pair_path (default: "private_keys")
  • storage_key_name (default: "hello-storage")

Additional Examples

The included examples in this section all assume the correct build of Enzyme and the correct set up of user credentials. The examples will use the default cloud provider and the default user credentials file.

LAMMPS

LAMMPS is a molecular dynamics simulation application. The included workload will launch a container to execute LAMMPS on a single compute node. This requires the use of the storage capabilities of Enzyme.

  1. Create storage for the LAMPPS workload

    ./Enzyme create storage --parameters=examples/lammps/lammps-single-node.json
    
  2. Use information from ./Enzyme state to get connection details to the storage node created in step 1. SSH into the storage nodeusing provided private key and IP address and execute the following commands:

    sudo mkdir /storage/lammps/
    chown lammps-user /storage/lammps/
    

    Then log out of the storage node.

  3. Upload lammps.avx512.simg container into /storage/lammps/, e.g. by scp -i path/to/private_key.pem path/to/lammps.avx512.simg lammps-user@storage-address:/storage/lammps/

  4. Execute the LAMMPS benchmark through Enzyme

    Enzyme run examples/lammps/lammps-single-node.sh --parameters=examples/lammps/lammps-single-node.json --use-storage --download-files=lammps.log
    

If successful, the content of Enzyme-download/lammps.log file should look like this (Note: this was received by running on 4 cores):

args: 2
OMP_NUM_THREADS=1
NUMCORES=4
mpiexec.hydra -np 4 ./lmp_intel_cpu_intelmpi -in WORKLOAD -log none -pk intel 0 omp 1 -sf intel -v m 0.2 -screen
Running: airebo Performance: 1.208 timesteps/sec
Running: dpd Performance: 9.963 timesteps/sec
Running: eam Performance: 9.378 timesteps/sec
Running: lc Performance: 1.678 timesteps/sec
Running: lj Performance: 19.073 timesteps/sec
Running: rhodo Performance: 1.559 timesteps/sec
Running: sw Performance: 14.928 timesteps/sec
Running: tersoff Performance: 7.026 timesteps/sec
Running: water Performance: 7.432 timesteps/sec
Output file lammps-cluster-login_lammps_2019_11_17.results and all the logs for each workload lammps-cluster-login_lammps_2019_11_17 ... are located at /home/lammps-user/lammps
  1. Important Destroy storage using the ./Enzyme destroy command with the storage ID to avoid unintended storage fees with the cloud provider.

OpenFOAM

OpenFOAM is a computation fluid dynamics application.

  1. Run OpenFOAM benchmark, where 7 is the endTime of computing benchmark:
    Enzyme run -r us-east1 -z b --parameters examples/openfoam/openfoam-single-node.json --download-files DrivAer/log.simpleFoam --overwrite examples/openfoam/openfoam-single-node.sh 7
    

Full log of running OpenFOAM should be available as Enzyme-download/log.simpleFoam

Cloud Provider Quick Reference

This section is intended to provide easy references to cloud providers relative to Enzyme setup.

Amazon Web Services

Help generating the user credentials for Amazon Web Services

Google Cloud Platform

Google Cloud Platform Account Information

Help generating the user credentials for Google Cloud Platform