Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module proposal: hello-channels #367

Open
adamrtalbot opened this issue Sep 9, 2024 · 7 comments
Open

Module proposal: hello-channels #367

adamrtalbot opened this issue Sep 9, 2024 · 7 comments
Assignees

Comments

@adamrtalbot
Copy link
Collaborator

hello-channels

An additional module that would fit between hello-gatk and hello-modules.

Aims:

  • Teach users about the concepts of channels and functional programming with Nextflow
  • Teach users about data structure within channels
  • Teach users practical examples of operators to manipulate channels

Proposal:

Subject to change, this part might need further discussion.

From the hello-gatk pipeline, add the following features stepwise

  1. Use a samplesheet to read in the BAM files (splitCsv)
  2. Add a sample ID to each BAM file (tuples)
  3. Pass the tuple between all processes with a manipulation (map)
  4. Group per family ID (groupTuple)
  5. Create samplesheet output

Key targets:

  • view for debugging
  • map for manipulating channel contents
  • 1 to 3 more advanced operators such as collectFile, groupTuple, join for demonstrating how channels can be manipulated with built in methods.

To do:

  • Write final endpoint pipeline to be aiming for
  • Write intermediate steps as tutorial
  • Add any changes to hello-modules and hello-nf-test that need to be included

Related issues

#361
#359

@kenibrewer
Copy link
Member

I think this is a great training module plan. hello-gatk has a lot of content and this feels like a logical grouping to split out.

@adamrtalbot
Copy link
Collaborator Author

hello-channels

1 Debugging

Objective: Know how to view the contents of a channel

1.1. Use .view() to debug a channel

// Create input channel from list of input files in plain text
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitText()
                    .view()

2 Add sample ID to samples

Objective: Understand how sample information can be associated with a sample

2.1. Read sample ID from CSV file

Would break down into multiple steps with use of .view() to inspect channel contents.

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)

2.2. Use .map() to modify items in a channel

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)
                    .map{ row -> [row.id, file(row.bam)] }

2.3. Carry sample ID through the pipeline

input:
    tuple val(id), path(bam){, path(bai)}

etc.

3 Maps (key-val pairs) and family ID

Objective: Understand how sample information can be used to make Nextflow extremely scalable

Support > 1 family per run by adding a family (cohort) ID to the sample sheet

3.1 Use a meta map as the first value

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)
                    .map{ row -> 
                        [
                            [
                                id: row.id,
                                family: row.family,
                            ],
                            file(row.bam)
                        ] 
                    }

Note: nf-schema can do this for you.

3.2 Replace sample ID with meta map:

input:
    tuple val(meta), path(bam){, path(bai)}

3.3. Aggregate per-family prior to performing jointgenotyping

Output of haplotyper:

output:
    tuple val(meta), path("${input_bam}.g.vcf"), path("${input_bam}.g.vcf.idx")

Collect families using groupTuple:

GATK_HAPLOTYPECALLER.out
    .map { meta, bam, bai ->
        meta.family, meta, bam, bai
    }
    .groupTuple()

Add a fake family 2 to the input CSV:

family1,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family1,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family1,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam
family2,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family2,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family2,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam

Thoughts? Does this cover sufficient objectives? Should we be extending it further and including more operators? If so, which? Is it too much and requires re-wiring of the whole pipeline too much?

@kenibrewer
Copy link
Member

This is perfect. I really like the progression that you've designed here. I was trying to explain this concept to someone fresh out of hello-gatk using our existing training materials (in Advanced) and I quickly ran into the issue of needing to explain things that hadn't been covered yet.

@adamrtalbot
Copy link
Collaborator Author

After discussion with @maxulysse, we think we can make it better.

  • Move the jointgenotyping part out of hello-gatk and make it the first introduction to hello-channels
  • Use view to inspect the contents of the channel before and after collect()
  • add sample ID to samplesheet as above

Then we could have another module afterwards which includes more advanced concepts like map and groupTuple.

@adamrtalbot
Copy link
Collaborator Author

hello-channels

1 Collect

Objective: Understand how to collect a channel into 1 item.

1.1. Add jointgenotyping process

As hello-gatk. Run and see every VCF is being ran separately.

1.1. Use .view() to inspect the contents of a channel

// Create input channel from list of input files in plain text
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitText()
                    .view()

1.2. Collect results of haplotyper process

all_vcfs = GATK_HAPLOTYPECALLER.out[0].collect()
all_tbis = GATK_HAPLOTYPECALLER.out[1].collect()

1.3. View to see the contents of the collection

all_vcfs.view()
all_tbis.view()

1.4. Run with jointgenotyping again

See only 1 process ran.

2 Add sample ID to samples

Objective: Understand how sample information can be associated with a sample

2.1. Read sample ID from CSV file

Would break down into multiple steps with use of .view() to inspect channel contents.

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)

2.2. Use .map() to modify items in a channel

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)
                    .map{ row -> [row.id, file(row.bam)] }

2.3. Carry sample ID through the pipeline

input:
    tuple val(id), path(bam){, path(bai)}

etc.

hello-operator

or hello-meta?

1 maps (key-val pairs) and family ID

Objective: Understand how sample information can be used to make Nextflow extremely scalable

Support > 1 family per run by adding a family (cohort) ID to the sample sheet

1.1 Use a meta map as the first value

// Create input channel from samplesheet in CSV format (via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)
                    .splitCsv(header: true)
                    .map{ row -> 
                        [
                            [
                                id: row.id,
                                family: row.family,
                            ],
                            file(row.bam)
                        ] 
                    }

Note: nf-schema can do this for you.

1.2 Replace sample ID with meta map:

input:
    tuple val(meta), path(bam){, path(bai)}

1.3. Aggregate per-family prior to performing jointgenotyping

Output of haplotyper:

output:
    tuple val(meta), path("${input_bam}.g.vcf"), path("${input_bam}.g.vcf.idx")

Collect families using groupTuple:

GATK_HAPLOTYPECALLER.out
    .map { meta, bam, bai ->
        meta.family, meta, bam, bai
    }
    .groupTuple()

Add a fake family 2 to the input CSV:

family1,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family1,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family1,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam
family2,mother,/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
family2,father,/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
family2,son,/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam

2 join?

Objective: Understand how to join channel contents together by common element.

We could join just prior to groupTuple above? Unclear where it would fit in best here, but that's outside of the scope of this issue.

@vdauwera
Copy link
Collaborator

vdauwera commented Oct 4, 2024

  • Move the jointgenotyping part out of hello-gatk and make it the first introduction to hello-channels

I love the overall plan for Hello-Channels but I think I'd like to keep the joint-genotyping as part of the Hello-GATK module, because it makes for a very satisfying example as it stands now.

However I could be convinced to change my mind because as I type I realize this could be an opportunity to simplify GATK further (the GVCF stuff is a bit of a curve ball). We could change Hello-GATK to emit regular VCFs and have that module show a purely linear example (and also keep the groovy magic mostly out of the 'first bioinfx example' for simplicity). And that way people are already a bit further down their Nextflow journey when they hit the more interesting plumbing options.

Ok I've gone and convinced myself this is the way to go.

Question: should this new Hello-Channels module come before or after the Config/Modules/nf-test ones? (note that I want to move hello-config to before hello-modules)

vdauwera added a commit that referenced this issue Oct 22, 2024
…les (#391)

Reorder training modules and add stubs to expand the series, improve instructions and add explanations throughout, update GATK flowcharts, improve flow and improve formatting.

Notable changes:

* In Hello-World, start by looking at the code before running it

* Convert the splitText to splitCsv in Hello-World

* Add stub for new Hello-Containers module (Ken)

* Rename Hello-GATK to Hello-Science

* Simplify Hello-Science by moving GVCF and joint genotyping out to new Hello-Channels module based on Adam's proposal in #367

* Improve flow of joint genotyping in Hello-Channels

---------

Co-authored-by: Maxime U Garcia <maxime.garcia@seqera.io>
Co-authored-by: Ken Brewer <kenibrewer@users.noreply.github.com>
@vdauwera
Copy link
Collaborator

I implemented part of this in #408 with the following caveats:

  • Added use of .view() to inspect the contents of a channel earlier, in hello-genomics (formerly hello-gatk)

  • The use of collect() for bringing GVCFs together + the closure and join() to generate the concatenated string for the GenomicsDBImport command end up taking a lot of explaining (could probably use more explicit/ worked out .view()ing but that will have to be for later)

  • The introduction of samplesheet and metamap is very compelling but I think it needs to be its own "Hello Meta" module, which I don't have the time to put together now. But it's the logical next expansion todo.

@vdauwera vdauwera self-assigned this Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants