-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explain what sort of workflows CWL is for. #36
base: main
Are you sure you want to change the base?
Conversation
cloud, and high performance computing (HPC) environments. | ||
|
||
CWL is for dataflow style batch analysis, where the units of processing are command line programs. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think it would be beneficial to comment here on some known workflows / use cases that CWL does NOT handle well?
Example from chat:
"explicitly not for business process modeling"
And any other use cases that users can think of that aren't intended use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add something clarifying the use for batch processing vs interactive processing, as sometimes we've had confusion about workflows having being able to interact with external services such as as databases or other APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think it would be beneficial to comment here on some known workflows / use cases that CWL does NOT handle well?
It could (and I see the value in that!), but it probably leaves the reader with a better feeling to not have a list of negatives when they first learn about something. I also don't want this introduction to be too long or wordy. A bit tricky to balance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can see where you are coming from there. It might also dissuade someone from trying it out if they do not fully understand what is meant by the item listed as "not supported" (IE cwl could be a use case for their problem, but since they don't understand the terminology of the item they might just not try it out). After thinking about it, it might do more harm than good.
Apologies for banging my little drum (but @mr-c @ -ed the Gitter channel): while CWL is excellent for handling high-throughput computing, it is not (yet) equipped to handle high-performance computing tasks. |
What is your scheduler @rupertnash? I have been using cwl workflows for HPC for over a year now using toil as the runner for IBM LSF. |
My scheduler? We have both PBS pro and SLURM machines at EPCC |
But my point is that having thousands of independent tasks running across a cluster is high throughput. Unless the processes are communicating with each other, it's not HPC. |
@rupertnash I suspect this is a case where different folks have different definitions of a term. It is common in the life sciences for "HPC" to colloquially imply on-prem job schedulers (e.g. LSF, SGE, SLURM, PBS, etc). |
@rupertnash I was going to suggest
And have the So I guess we'll take the HPC part out, and leave in High-Throughput Computing until |
@rupertnash I guess I am confused a bit, if the individual tools are are using processes that are communicating with each other, then is that not HPC? We have a workflow that calls a tool that uses 32 cores across 2 nodes, which the scheduler handles. This is then being requested by both the tool and the scheduler, if written correctly. So the tool is requires high performance computing, while the workflow just details the steps. Are you referring to CWL itself using multiple processes talking to each other to execute a step? |
@drkennetz the distinction @rupertnash is making is that HPC (in certain communities) implies a single logical job that runs as set of parallel processes across different nodes that need to coordinate to complete the job. For example a simulation where each node represents a particular "cell" of the simulated space, and nodes have to be able to interact at the boundaries. That's different from high throughput computing where you can split a job into pieces, scatter them over nodes, each piece runs independently of the others, and you gather the results at the end. |
For the general user, the distinction for HPC and HTC is not well known. Additionally, many users will not know what HTC is. Perhaps instead of using HTC, we could have a phrase that encompasses what CWL does? We want to cognizant that every CWL is not an expert in these areas. |
To address #35