Decision: Parallel backend and pipeline tool #233

seabbs · 2024-05-14T10:52:03Z

seabbs
May 14, 2024
Maintainer

We need to decide which pipelining tool and computational backends we are going to use.

At the moment we have a partial Dagger.jl implementation but its unclear how we can scale this across different types of compute (i.e local, connections via ssh, slurm and Azure batch). It is also unclear what kind of pipeline tools that Dagger.jl offers (like the task graph, progress monitoring etc). See here for issue looking into some of this: JuliaParallel/Dagger.jl#512

Another option is the JobSchedulers.jl and Pipelines.jl ecosystem which would be closer to a traditional command line based workflow. However it is currently unclear if these packages support non-local compute. See here for an issue on this: cihga39871/JobSchedulers.jl#15

A final alternative is to take a mixed approach of the two or to take a simpler approach that uses more of the standard base Julia tooling.

Making any decision is not time critical as we can use local compute for our current pipeline but as an example of Julia best practices we want to have a clear steer on scalable approaches for the future.

SamuelBrand1 · 2024-05-16T18:41:27Z

SamuelBrand1
May 16, 2024
Maintainer

After some time looking at this and f2f discussion, I now favour using either nextflow or airflow as the eventually/scalable backend.

Upsides

Well documented.
Example usages are easily available and seem numerous.
Connection to Azure Batch resources seem maintained.

Downsides

Lose the feature of self-constructing computational DAG by moving away from Dagger.jl (I think, open to push back or ideas on how to combine).

Thoughts @zsusswein ?

0 replies

seabbs · 2024-05-17T08:56:47Z

seabbs
May 17, 2024
Maintainer Author

Yeah I can see the arguments for this.

I would note though that one of the main issues is not lack of cloud support in Dagger.jl (via Distributed.j) as it actually has quite a lot. The problem is that Azure cloud support in Julia is poorly integrated/implemented and so doesn't work with the rest of the ecosystem.

In terms of the DAG I see the manual DAG construction you do in airflow etc as very similar. I think the real hit we take is we move from an all Julia workflow to one with a glue language and Julia which seems like shame (but for the points you raise seems like it makes sense).

I think for now we should press on but plan to refactor as a demonstration project that we can pitch if/when there is interest or if there is interest from others (i.e @zsusswein)

0 replies

zsusswein · 2024-05-17T14:30:19Z

zsusswein
May 17, 2024
Maintainer

A few quick thoughts:

Pipeline tooling is usually opinionated -- I think that's both a pro and a con. I'd try to (a) avoid choosing a tool that doesn't match your opinions and (b) figure out how it plays with both cloud and local runs before you go with it. I'm not super clear on when/where this becomes a cloud problem, so not sure how big a deal that is.
Moving from all-Julia to multi-language is a real hit. I buy the argument that it's worth it. But also, I'd push for moving to containerization earlier as result.
What's the right level of abstraction for parallelization? Are you running the whole pipeline in one container? Or are you spawning a container from a shared base image per-task in your DAG? (I vote for the latter). This is moot pre-container, but the kind of thing I'd design for early on.

0 replies

seabbs · 2024-05-17T14:38:47Z

seabbs
May 17, 2024
Maintainer Author

Moving from all-Julia to multi-language is a real hit.

I strongly agree and if we could anything other than azure batch we would be able to continue all in Julia but alas. That being said I can see an argument for a standard approach to pipelining and that might as well be very fully featured (i.e some of the above options).

containerization earlier as result

Noting that containers in julia look quite trivial: https://discourse.julialang.org/t/recommended-recipe-for-deploying-a-julia-app-in-docker-with-efficient-precompilation/95591/2. I would also say that in most good pipeline tools exactly where a specific task is actually being run is abstracted and so I would lightly pushback against this or perhaps rephrase to its key to think about how the tools you are using would require you to do this

Or are you spawning a container from a shared base image per-task in your DAG? (I vote for the latter).

Currently the latter. The nice thing about Distributed.jl is this is all abstracted so we don't need to make an internal choice about where it rruns and on what (like in the future ecosystem)

0 replies

SamuelBrand1 · 2024-05-21T11:23:48Z

SamuelBrand1
May 21, 2024
Maintainer

Shall we move this to discussion?

0 replies

seabbs · 2024-05-21T22:34:24Z

seabbs
May 21, 2024
Maintainer Author

yes

0 replies

SamuelBrand1 · 2024-08-29T10:02:29Z

SamuelBrand1
Aug 29, 2024
Maintainer

Here is a nice example of nextflow being used to organise an analysis pipeline with a mix of julia and stan code:

https://github.com/Julia-Tempering/autoMALA-mev

0 replies

SamuelBrand1 · 2024-08-30T12:57:18Z

SamuelBrand1
Aug 30, 2024
Maintainer

Given the nice nextflow example linked above, these docs look useful https://www.nextflow.io/docs/latest/azure.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decision: Parallel backend and pipeline tool #233

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Decision: Parallel backend and pipeline tool #233

seabbs May 14, 2024 Maintainer

Replies: 8 comments

SamuelBrand1 May 16, 2024 Maintainer

Upsides

Downsides

seabbs May 17, 2024 Maintainer Author

zsusswein May 17, 2024 Maintainer

seabbs May 17, 2024 Maintainer Author

SamuelBrand1 May 21, 2024 Maintainer

seabbs May 21, 2024 Maintainer Author

SamuelBrand1 Aug 29, 2024 Maintainer

SamuelBrand1 Aug 30, 2024 Maintainer

seabbs
May 14, 2024
Maintainer

SamuelBrand1
May 16, 2024
Maintainer

seabbs
May 17, 2024
Maintainer Author

zsusswein
May 17, 2024
Maintainer

seabbs
May 17, 2024
Maintainer Author

SamuelBrand1
May 21, 2024
Maintainer

seabbs
May 21, 2024
Maintainer Author

SamuelBrand1
Aug 29, 2024
Maintainer

SamuelBrand1
Aug 30, 2024
Maintainer