The datasets in this repository are used to evaluate task clustering algorithms for executing workflows on batch-scheduled HPC platforms using the simulator in the task_clustering_batch_simulator repository.
The workflows were synthetically generated using this open-source workflow generator. These workflows are generated based on four scientific application domains. Given those four workflow characterizations, we generated 4 * 3 * 3 = 36 workflow configurations by varying the number of tasks in each workflow and the total sequential duration of each workflow.
Application | Number of tasks | Duration (hours) |
---|---|---|
Cybershake, Epigenomics, Montage, Sipht | 50, 250, 500 | 100, 500, 1000 |
The workflow files are named based on its workflow configuration using this scheme: APPLICATION_TASKS_DURATION.dax, where duration is in seconds. These files are in the
DAX format. A path to one of theses workflow files is to be passed to the simulator as the <workflow_specification>
command-line argument.
The workloads are logs of real parallel workloads from various production systems. An expansive collection of these logs can be found in SWF format at the Parallel Workloads Archive. The logs contained in this repository were downloaded off the archive. All logs in this repository are the "cleaned log -- RECOMMENDED" versions.
Log File | System | Nodes |
---|---|---|
kth_sp2.swf | KTH SP2 | 100 |
sdsc_sp2.swf | SDSC SP2 | 128 |
ctc_sp2.swf | CTC SP2 | 338 |
hpc2n.swf | HPC2n | 120 |
A path to one of these workload file is to be passed to the simulator as the <trace_file>
command-line argument (noting that the above number of nodes should be passed as the <compute_nodes>
argument).