Skip to content

HPC Workflow

Paul Nilsson edited this page Apr 25, 2022 · 2 revisions

HPC workflow in Pilot 3

The Pilot 3 HPC workflow is a special mode where the application works without a remote connection to PanDA server or other remote facilities. All intercommunications in this case are managed by the Harvester application. Also, in this mode Pilot 3 acts like a simple MPI application, which performs execution of multiple jobs on the computing nodes of the HPC.

How to launch HPC workflow

To launch the HPC workflow in Pilot 3 the command line parameter '-w generic_hpc' (Pilot option) should be specified.

HPC PlugIns

Different HPCs may require special treatments for the preparation of launching the payload. To cover this, implementations were placed in HPC specific plugins. The particular plugin should be specified using the command line parameter: '--hpc-resource' followed by the resource name.

Implementations of the plugins must be placed in the pilot/resource directory. Modules have been created for ALCF, BNL, NERSC and Titan. As of January 2019, only the module for Titan is fully implemented.

Mandatory functions in the HPC Plugins

Get job

get_job(communication_point)
:param communication_point: path to config.Harvester.jobs_list_file (string).
:return: job object, rank (int).

Retrieve job description from json file, fill Job object and return it along with current rank.

Set working directory for job

set_job_workdir(job, communication_point)
:param job: job object.
:param communication_point: local path to Harvester access point (string).
:return: job working directory (string).

Set job/pilot working directory. The function also cd's into this directory.

Get setup

get_setup(job=None)
:param job: optional job object.
:return: setup commands (list).

Return a list of setup commands, which may be required by the infrastructure. The job object is also sent to this function in case it is needed on the relevant resource.

Set working directory for scratch space

set_scratch_workdir(job, work_dir, args)
:param job: job object.
:param work_dir: job working directory (permanent FS) (string).
:param args: args dictionary to collect timing metrics.
:return: job working directory in scratch (string).

Setup the working directory on the transient high-speed storage (RAM disk, local SSD etc). Input files and some DB files are copied to the scratch disk.

Payload command fix

command_fix(command, job_scratch_dir)
:param command: payload command (string).
:param job_scratch_dir: local path to input files (string).
:return: updated/fixed payload command (string).

Adapt some payload parameters for execution on the particular infrastructure. E.g. (Titan) full paths to the input files are inserted in the payload execution command. Any '--DBRelease="all:current"' option is removed to avoid Frontier reading.

Job report processing

process_jobreport(payload_report_file, job_scratch_path, job_communication_point)
:param payload_report_file: name of job report (string).
:param job_scratch_path: path to scratch directory (string).
:param job_communication_point: path to updated job report accessible by Harvester (string).
:raises FileHandlingFailure: in case of IOError.

Copy the job report file form scratch to the working directory and shrink it if necessary. E.g. remove any 'logfileReport' from the dictionary.

Post processing of working directory

postprocess_workdir(workdir)
:param workdir: path to directory to be processed (string).
:raises FileHandlingFailure: in case of IOError.

Some post processing of the working directory (if needed).