Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase flexibility in use of Dockerfile within Wave #4

Open
marcodelapierre opened this issue Nov 15, 2022 · 21 comments
Open

Increase flexibility in use of Dockerfile within Wave #4

marcodelapierre opened this issue Nov 15, 2022 · 21 comments

Comments

@marcodelapierre
Copy link
Member

marcodelapierre commented Nov 15, 2022

Hi @pditommaso , as per our thread a few days ago.

Right now, the main ways to instruct Nextflow/Wave to build containers are:

  1. via conda package manager + wave combo, with lots of flexibility (with/without NF modules, per process conda env, one conda env for whole workflow...)
  2. using a provided Dockerfile + wave, which only work with NF modules

Regarding point 2., I believe there is room for simple solutions that expand flexibility and hence user cases. I am thinking of users/workflows with the following features:

  • software is not available via package manager (because it is not there, or because it is in-house software/scripts); PLUS
  • EITHER: users that do not want to use NF modules in their pipelines (maybe because they only develop a limited number of workflow with common building blocks, and do not need modularity);
  • OR: workflow is prone to having a single container for the whole of it (e.g. python/R-heavy pipelines)

So, I was wondering about the following features:

  1. ability for NF/Wave to look for Dockerfile in main pipeline directory; to be used for all processes, eventually except those that have their own specific Dockerfile
  2. implementation that enables providing paths for process-/tag- specific Dockerfile, in the same way as process.container or process.conda .. something like process.dockerfile or wave.dockerfile
@pditommaso
Copy link
Member

Hi Marco, understand your requirement. Between these choices, I'd go with an explicit the declaration in the config file adding:

wave {
  dockerfile = '/some/Dockerfile'
}

Then this file is used unless another is found at the module level.

Other points to refine:

  1. how to handle multi-arch builds (see Multiple architectures #3)? The Dockerfile path should use as specified it should try to look for the .<arch> variant?
  2. should the wave.strategy be aware of it?

@marcodelapierre
Copy link
Member Author

As I was mentioning to you in our chat, Paolo, I think of multi-arch builds in many respects:

  • amd64 vs ARM vs ..
  • CPU microarchs (e.g. Intel Haswell, Broadwell, Cascadelake, AMD Zen2, Zen3..)
  • GPU vendors and generations

With this in mind, the most flexible way is probably to have support for multiple Dockerfiles with .<arch> ; you could have a documented set of allowed values, that must map to the Dockerfile extension.
And eventually you might have a wave.strategy to tune the multi-Dockerfile feature:

  • multiple Dockerfiles with arch extension
  • one Dockefile with arch build variable
  • one Dockerfile

@pditommaso
Copy link
Member

pditommaso commented Nov 22, 2022

What it could be done is that wave.containerPlatform is specified (see 10d56ca1), it tries to look for first the Dockerfile.<platform>, then fallback to Dockerfile.

When wave.dockerfile (or maybe wave.containerFile) is specified the same logic applies the that file path.

However, to keep the fucos on this issue, my main struggle how it should be handled the behavior of Dockerfile at module when wave.dockerfile is specified. I see 3 choices

  1. always use module Dockerfile when exist, and fallback to wave.dockerfile when the module one is missing
  2. only use wave.dockerfile and ignore module level Dockerfile
  3. default to 1 and add a new option to be provided into wave.strategy to force 2

@marcodelapierre
Copy link
Member Author

Before commenting on your last comment, let me share something that may be useful for arch determination:

https://github.com/archspec/archspec
https://archspec.readthedocs.io/en/latest/

I believe this is the tool that Spack uses (or, if not, they are closely related).
I was wondering...shall Wave try to infer the platform with such a tool?

@marcodelapierre
Copy link
Member Author

It sounds to me like option 3. can be useful when a pipeline developer wants to overwrite the modules default Dockerfiles. However, is this a reasonable use case?

I would probably start with option 1. to keep things simple, and implement 3. if the need arises.

@marcodelapierre
Copy link
Member Author

On this thread, so do you think that there should be no option to have process-specific wave.dockerfiles? it is a use case I particularly like, but maybe it is only me LOL !

@pditommaso
Copy link
Member

so do you think that there should be no option to have process-specific

Now that you are mentioning, what could be possible to do is to use the existing container directive and allow the use of a Dockerfile/Containerfile file path instead of a container name. Tho not 100% about this

@pditommaso
Copy link
Member

I would probably start with option 1. to keep things simple, and implement 3. if the need arises.

I tend to agree, where I get confused is what name to use in the wave.strategy setting to enable the project level Dockefile, because dockerfile is already used to enable the use of module level dockerfiles ..

About archspec, interesting but let's keep this on #3 otherwise there's too much stuff in this thread 😄

@marcodelapierre
Copy link
Member Author

marcodelapierre commented Dec 7, 2022

it could be something like wave.strategy = pipeline_dockerfile, or single_dockerfile, or similar?
however, we are saying that right now we may want to start with option 1., right?

@marcodelapierre
Copy link
Member Author

marcodelapierre commented Dec 7, 2022

Now that you are mentioning, what could be possible to do is to use the existing container directive and allow the use of a Dockerfile/Containerfile file path instead of a container name. Tho not 100% about this.

What I am doubtful about in this scenario is some possible perceived inconsistency in user experience, which can make it confusing.
Take conda, you can choose to provide a list of packages, a yaml path, or an env path, but these 3 options are entirely self-controlled by the conda option itself.

Now, you suggest to use container with either a container spec, or a dockerfile path. But the tricky thing is, how the 2 options are handled also depend on whether wave is on or off. In addition, the wave strategy may allow to choose between dockerfile and container, which makes me think these two options cannot share the same keyword.

So back to wave.dockerfile. however, how about ... (morning walk thinking)

having a process directive, similar to conda and container to handle what effectively define a Dockerfile which would only be used by Wave, a Dockerfile for Wave, so... a Wave Dockerfile, or ... a wavefile !

So you would have process.wavefile, which can be defined by process name/tag, or for all processes.
I mean, in the end it could be just called dockerfile, but isn't wavefile super cool?!?!

Ultimately, I think that what would solve this is to define the dockerfile path using a process rather than a wave directive.

@pditommaso
Copy link
Member

I'd resist the idea of adding a specific directive for Wave, both to avoid the proliferation of directives and also because a directive should capture a general concept. Instead this looks related to a specific plugin.

I still think it could be used the existing process.container with a particular semantic for a file path to be interpreted as the container file to be built when used with Wave.

In the alternative, it could be added a wave.containerFile that will take precedence over the other possible strategies.

@marcodelapierre
Copy link
Member Author

marcodelapierre commented May 29, 2023

Thanks for following up on this, Paolo.

The shortcoming I see in a Wave specific option, i.e. wave.containerFile, is that then it is not possible to have process-specific values. I really believe that, similar to, e.g., conda and spack directives, it is critical that a Dockerfile directive has such process-specific trait.

I also appreciate your concern about non-generality of the feature. This is why I called it process.waveDockerfile, indeed an explicit reminder of its Wave specificity.

If my minor follow-up still did not convince you, I am generally on board with your suggestion of a specific syntax for process.container. I only see a minor source of inconsistency here, as the Dockerfile-path special syntax would only work when Wave is enabled, and should still be implemented to shoot an error message when invoked without the plugin. As a side note, to me this actually represents a point of advantage to my process.waveDockerfile syntax, that explicitly declares its plugin-related nature.

On non-proliferation, I feel I have no word, you have the last word :-)

What do you think?

@pditommaso
Copy link
Member

I believe the use of process.container is the one that better fits the current model.

Let's try to estimate the impact of this change before implementing it.

@marcodelapierre
Copy link
Member Author

Sure, let me start with a couple of aspects I can think of.

1- Possible syntax for the updated process.container syntax.

Container syntax:

container = 'ubuntu'
container = ubuntu:22.04'
container = 'quay.io/marco/ubuntu'
container = 'quay.io/marco/ubuntu:22.04'

This is to say that colon and slashes cannot be used to discriminate between container URI and Dockerfile path, because they may or may not be there.
Hence we could explicitly check for the file name (file name, not full path, to be more prescriptive) to start with Dockerfile or Containerfile (case-insensitive, to avoid silly issues). So for instance these would be acceptable:

container = 'dockerFile'
container = 'containerfile'
container = 'Containerfile.paolo'
container = 'Dockerfile_marco'
container = '/path/to/Dockerfile.marco'
container = '/path/to/Dockerfile'

2- syntax for Wave strategy
Currently the accepted values for wave.strategy are container,dockerfile,conda,spack. This is where I get slightly confused, because we have container and dockerfile, but then the underlying directive would be process.container in both cases.
However, if you deem the above issue to be minor, we could rely on the filename check outlined in 1-, so that the default strategy would still try and understand automatically whether it is a dockerfile or container.

Thoughts?

@pditommaso
Copy link
Member

pditommaso commented May 30, 2023

Regarding the syntax we can assume the file path should start with / or file:/ (same for conda and spack).

regarding the second point, I'd argue that's still consistent because the dockerfile is specified via the container directive.

@marcodelapierre
Copy link
Member Author

Sounds good.
Only one thing, note that for conda/spack the relative path syntax is also accepted, and now that I think, if I remember correctly there is a "fileExists" or "pathExists" check (whatever syntax) in the code, to ascertain whether the value should be taken as file path.
I agree the dockerfile vs container can be implemented to mimick conda/spack yml-file vs recipe specification.

@pditommaso
Copy link
Member

Relative paths can create ambiguity with image names. We have these choices

  1. only allow absolute paths
  2. used a ./ prefix for relative paths (we are doing the same for nextflow module imports)
  3. use a pseudo prefix e.g. containerfile:some/relative/path/to/dockerfile

@marcodelapierre
Copy link
Member Author

personally, I prefer option 2., flexible and simple

@marcodelapierre
Copy link
Member Author

so .. is this a green light to implement? :-)

@pditommaso
Copy link
Member

yeah, I agree 2 is the best option. Ok, let's see what happens .. :D

@marcodelapierre
Copy link
Member Author

[update] yesterday @pditommaso closed #3916

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants