Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain what sort of workflows CWL is for. #36

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mr-c
Copy link
Member

@mr-c mr-c commented Oct 26, 2020

To address #35

cloud, and high performance computing (HPC) environments.

CWL is for dataflow style batch analysis, where the units of processing are command line programs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think it would be beneficial to comment here on some known workflows / use cases that CWL does NOT handle well?

Example from chat:

"explicitly not for business process modeling"

And any other use cases that users can think of that aren't intended use cases?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add something clarifying the use for batch processing vs interactive processing, as sometimes we've had confusion about workflows having being able to interact with external services such as as databases or other APIs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think it would be beneficial to comment here on some known workflows / use cases that CWL does NOT handle well?

It could (and I see the value in that!), but it probably leaves the reader with a better feeling to not have a list of negatives when they first learn about something. I also don't want this introduction to be too long or wordy. A bit tricky to balance!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can see where you are coming from there. It might also dissuade someone from trying it out if they do not fully understand what is meant by the item listed as "not supported" (IE cwl could be a use case for their problem, but since they don't understand the terminology of the item they might just not try it out). After thinking about it, it might do more harm than good.

@rupertnash
Copy link

Apologies for banging my little drum (but @mr-c @ -ed the Gitter channel): while CWL is excellent for handling high-throughput computing, it is not (yet) equipped to handle high-performance computing tasks.

@drkennetz
Copy link

Apologies for banging my little drum (but @mr-c @ -ed the Gitter channel): while CWL is excellent for handling high-throughput computing, it is not (yet) equipped to handle high-performance computing tasks.

What is your scheduler @rupertnash? I have been using cwl workflows for HPC for over a year now using toil as the runner for IBM LSF.

@rupertnash
Copy link

My scheduler? We have both PBS pro and SLURM machines at EPCC

@rupertnash
Copy link

But my point is that having thousands of independent tasks running across a cluster is high throughput. Unless the processes are communicating with each other, it's not HPC.

@geoffjentry
Copy link
Contributor

@rupertnash I suspect this is a case where different folks have different definitions of a term. It is common in the life sciences for "HPC" to colloquially imply on-prem job schedulers (e.g. LSF, SGE, SLURM, PBS, etc).

@mr-c
Copy link
Member Author

mr-c commented Oct 29, 2020

@rupertnash I was going to suggest

The Common Workflow Language (CWL) is an open standard for describing analysis
workflows and tools in a way that makes them portable and scalable across a
variety of software and hardware environments, from workstations to cluster,
cloud, and HTC/HPC* environments.

And have the * link to an explanation of the plans to incorporate the MPIRequirement in a future version of the CWL standards. But then I noticed that https://en.wikipedia.org/wiki/High-performance_computing redirects to https://en.wikipedia.org/wiki/Supercomputer which is not at all helpful.

So I guess we'll take the HPC part out, and leave in High-Throughput Computing until MPIRequirement is matured and ratified.

@drkennetz
Copy link

@rupertnash I guess I am confused a bit, if the individual tools are are using processes that are communicating with each other, then is that not HPC? We have a workflow that calls a tool that uses 32 cores across 2 nodes, which the scheduler handles. This is then being requested by both the tool and the scheduler, if written correctly. So the tool is requires high performance computing, while the workflow just details the steps.

Are you referring to CWL itself using multiple processes talking to each other to execute a step?

@tetron
Copy link
Member

tetron commented Oct 30, 2020

@drkennetz the distinction @rupertnash is making is that HPC (in certain communities) implies a single logical job that runs as set of parallel processes across different nodes that need to coordinate to complete the job. For example a simulation where each node represents a particular "cell" of the simulated space, and nodes have to be able to interact at the boundaries. That's different from high throughput computing where you can split a job into pieces, scatter them over nodes, each piece runs independently of the others, and you gather the results at the end.

@swzCuroverse
Copy link
Contributor

For the general user, the distinction for HPC and HTC is not well known. Additionally, many users will not know what HTC is. Perhaps instead of using HTC, we could have a phrase that encompasses what CWL does? We want to cognizant that every CWL is not an expert in these areas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants