Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should we standardize config schema for files? #26

Open
joverlee521 opened this issue Jan 17, 2024 · 3 comments
Open

How should we standardize config schema for files? #26

joverlee521 opened this issue Jan 17, 2024 · 3 comments

Comments

@joverlee521
Copy link
Contributor

Context

This issue was originally brought up in nextstrain/measles#9 (review) and I wanted to document options and consensus on how we should standardize Nextstrain config schemas for files. This is specifically for discussing config files required for workflows such as reference.gb, exclude.txt, include.txt, etc.

Historically, these files were provided at the top of the Snakefile. The files are grouped together instead of within rule specific params because a single file may be used by multiple rules.

To make the files easily configurable, the ncov workflow moved the files to the config YAML under a top level files key. Other pathogen repos have taken similar paths of providing config files through the config YAML, but have varying schemas. mpox uses top level file name keys (i.e. drops the files key) and it also includes rule specific file name keys:

auspice_config: "config/mpxv/auspice_config.json"
include: "config/mpxv/include.txt"
reference: "config/reference.fasta"
...
filter:
    exclude: "config/exclude_accessions.txt"

rsv uses the top level files key and top level file name keys

exclude: "config/outliers.txt"
description: "config/description.md"
...
files:
    color_schemes: "config/colors.tsv"
    auspice_config: "config/auspice_config.json"

Open questions

  1. Is providing config files through the config YAML still the way forward?
  2. Do we want to standardize config file schemas across pathogens? Should this be decided per pathogen repo?
  3. If we do standardize config schemas, should it use the top level files key, top level file name keys, or rule specific file name keys? Are there other suggested schemas?
@joverlee521
Copy link
Contributor Author

(1) I'm slightly wary of Snakemake's different interpretations of relative paths, but this should be okay as long as config files are always used as inputs and we use hardcoded workdir directives.
(2) I don't think we can have a completely standardized schema across repos, but we can at least standardize the schema for the purpose of providing config files.
(3) Copying my previous comment: Grouping files together makes sense if expected to be used multiple times. I also see the other side of it being confusing to need to edit configs for the filter rule under both files and filter. It's also not be immediately obvious looking at the config what rules use which files (unless you have detailed docs like ncov).

@joverlee521
Copy link
Contributor Author

Copying over @jameshadfield 's relevant comment:

I have spent plenty of time following the strings to work out which rules use which files (mainly in ncov), and often finding a bunch of files no longer used by any rule. Docs are great, but from experience they inevitably fall out of sync. How about something like the following (or a variant thereof)?

files:
  exclude: &files_exclude 'config/exclude_accessions.txt'
filter:
  exclude: *files_exclude

@tsibley
Copy link
Member

tsibley commented Jan 20, 2024

This is specifically for discussing config files required for workflows such as reference.gb, exclude.txt, include.txt, etc.

One quick point which I think can help with clarity: I'd call these input files or similar since config files already means the YAML/JSON files that are loaded into the workflow config variable.

  1. Is providing config files through the config YAML still the way forward?

… how else would they be provided? If we revert back to a rule files with a params block (e.g.), how would alternative files be specified in a workflow invocation?

I'm slightly wary of Snakemake's different interpretations of relative paths […]

That's very fair! If really needed, I think we could have utilities to standardize paths inside the workflow. But I think/hope it wouldn't be necessary with due care around inputs and workdir or similar.

  1. Do we want to standardize config file schemas across pathogens? Should this be decided per pathogen repo?

I don't think we can have a completely standardized schema across repos, but we can at least standardize the schema for the purpose of providing config files.

Agreed.

  1. If we do standardize config schemas, should it use the top level files key, top level file name keys, or rule specific file name keys? Are there other suggested schemas?

To me, files implies that it will contain all the input files used by the workflow, and that sounds both not currently true and impractical. I'd think to use top-level keys or rule-specific keys, with a preference for top-level keys. The preference for top-level keys is that what rule they're used in is ultimately an implementation detail. For example, the "exclude" file is to me conceptually connected most strongly to the input metadata and sequence files; it's a set of things to skip from them. If you provide vastly different metadata/sequences, then you'll likely need to provide different exclusions too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants