-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should we standardize GitHub Action workflows per repo? #25
Comments
This is a great summary! Thanks @joverlee521 To avoid confusion, I'll use the following definitions as per a recent slack conversation
Thinking about the ideal automated usage of a repo, I'd advocate for the following: Ingest (may require multiple invocations, e.g. ncov)
phylo I don't know if I have strong opinions on whether we bundle multiple invocations up into AWS jobs or whether to keep each invocation as it's own separate AWS job, or how to handle the dependency graph, except to say that the simpler the solution the better! My answers to your open questions would be:
¹ This could be made different from an invocation of |
I think I generally concur with @jameshadfield's comments above. One minor exception is I'd think to use the term build as we define in our glossary, where it encompasses everything that goes into and results in an Auspice dataset JSON, not just the JSON itself.
There's definitely a dependency graph. But ISTM that expressing it in Snakemake only makes sense though if we're running everything in a single giant job (i.e. the answer to questions 1 and 3 is "yes"). If we're not running everything in a single giant job (and I think we shouldn't), then it makes more sense to me to manage this dependency graph in the overarching "orchestration" layer, which for us is GitHub Actions workflows. This would be difficult without the GitHub job staying attached to the AWS Batch job, but we plan to do that anyway! |
Another thought, prompted from following the builddir issue: Many of our pathogens¹ could be run in their entirety (ingest + phylo) within a single GitHub Actions job (no AWS involved). This would be even simpler if we used a single snakemake workflow² for the entire repo (e.g.), and could use a single invocation of For collaborators, this means any job which fits in a single action job (under 6h, under some memory ceiling) could easily be automated and monitored with only a few minor modifications to ¹ probably all or most the following: mumps, measles, zika, ebola, rsv, hepB, wnv, dengue. Memory requirements may force some of them to be run on AWS. ² This would require phylo to not use separate configs to define separate auspice datasets. I.e. how rsv works, not how mpox does it. But I think this is cleaner anyway. ³ Using targets to separate out which parts of the workflow to run in each job ⁴ And if it doesn't run in a single action job we have docs on how they can set up their own AWS environment.
Thanks -- I've updated my post to use "auspice dataset" to avoid any confusion. |
Yeah, agreed that not all pathogens need to run on AWS Batch. ncov-ingest ran on GitHub Actions for the first year before I lifted-and-shifted it to AWS Batch due to disk space. The memory requirements aren't the only thing to consider: GitHub Actions CPUs are limited in number (4) and in my experience can be slower than elsewhere. Likely not a deal breaker, but something to consider. Disk space can also be an issue for larger pathogens. Even without Batch involved, though, GitHub Actions is still our base "orchestration" layer, and it'd be a speed up to hoist top-level fully-independent jobs out of the Snakemake workflow and into parallel GitHub Actions jobs regardless of if they then launch on AWS Batch. |
Some related thoughts I jotted down before lunch. We'll be using our noon (Seattle) slot tomorrow to discuss this topic in general. |
Zika's GH Action workflows are my latest iteration on the standard GH Action workflows. This was built on @tsibley's ideas and the discussion we had back in February. There are a couple things I'd still like to work through:
|
Instead of having a separate This means the manual and automatic workflows are the same, and we'd avoid issues of drift and additional overhead and multiple sources of workflow run history. |
Context
In a pathogen repo, there will technically be 3 separate workflows that can be run: ingest, phylogenetic, and nextclade.
It's unclear how we want to set up the standardized GitHub Action workflows for automating them.
This was originally brought up in our discussion of the the pathogen-repo-template.
Existing workflows
The existing GitHub Action workflows for automated pathogen repos vary mostly for historical reasons:
Open questions
multiple phylogenetic buildsphylo builds that require multiplenextstrain build
invocations be maintained? Do we run these as a matrix within the action (i.e. independent AWS jobs), or do we extend the reusable workflow to accommodate this?The text was updated successfully, but these errors were encountered: