Reconfigurable pipelines #5921
Replies: 15 comments
-
Sounds great, I guess we should think about all the use cases and what the new functionality will bring compared to the existing solutions.
|
Beta Was this translation helpful? Give feedback.
-
@prihoda I agree with both your statements but I need to clarify a couple things. First, I didn't get this part "Parameters could also be incorporated into input and output paths." Does it mean that you would like to change inputs and outputs by parameters? To my mind, inputs, outputs and parameters are different concepts. And DVC should provide and ability to change them separately like
What do you think about this? Second, "Possible improvement 2" looks better to me since our custome wildcards will have more limitations (compared to users scripts\loops) and it might be tricky to implement as you said. Does the code from above PS: I think, I need to clarify "#1119 repetitive commands" part. It is not about loops (loops are fine and probably even better than custom wildcards), it is about running a reconfigurable stage (let's imaging |
Beta Was this translation helpful? Give feedback.
-
I'd add one more use case to the requirement list that might be extremely useful.
In this way, users will be able to create a "library" of reusable stages\pipelines and reuse them from different projects (through copy, Got-submodules or UPDATE: This comment was extracted as a separate feature request #1472 |
Beta Was this translation helpful? Give feedback.
-
@dmpetrov For the inputs and outputs, I meant that they could also contain variables from the config file. But now I see that my suggestions were trying to solve a different, less challenging problem - reproducing an existing stage with changed parameters, inputs or outputs (where all of those could have multiple values - unzipping multiple files, evaluating multiple parameters). So the point of reconfigurable stages is different than I thought - basically you want a named library of "stage templates" or "reusable stages" that define a specific command, right? And wrapping those in "reusable pipelines" that would define a whole pipeline. This would definitely be useful, but you would have to make sure you are not reinventing the wheel. I see that there are two levels to pipeline (workflow) management:
There are loads of workflow managers that operate on the reusable level, see https://github.com/pditommaso/awesome-pipeline, You could imagine writing a workflow in these tools which would actually run each command using I definitely see the benefit in reusable stages, just keep in mind you're entering a whole new world of existing solutions 😄 |
Beta Was this translation helpful? Give feedback.
-
@prihoda It looks like the problems of reconfigurable stages\pipelines and library of stages\pipelines are related and the solutions can complement each other. If we have a way to define a reconfigurable stage why don't we provide an ability to extract this stage from a project and reuse it from a different project? |
Beta Was this translation helpful? Give feedback.
-
Sure a library could be useful. My point is only that reconfigurable/reusable pipelines are a world on their own, with many existing solutions. If I understand it correctly, reconfigurable stages would basically just define a command and its inputs, outputs and other parameters. But isn't that already what any script can do? What would a reconfigurable stage bring as opposed to writing a bash script? So the main contribution would be the reconfigurable pipelines, but again, it would have to provide some benefits over just writing a bash script that calls each step. The main problem is that you don't know the exact intermediate files that will be created when executing the pipeline, since they are based on parameters, e.g. an unknown number of input files or model hyperparameter values. |
Beta Was this translation helpful? Give feedback.
-
For example, let's say you want to create a pipeline that chops a list of files into chunks of 100 lines, sorts the lines in each produced file and then merges the files into one final file. The stages are:
I see two options to define a reconfigurable pipeline:
Are you thinking about solution 1 or 2? Or do you have something else in mind? |
Beta Was this translation helpful? Give feedback.
-
Sorry for interrupting guys, but it seems like my emails are getting lost somewhere. @prihoda i've sent you a few messages and didn't hear back at all, could you please contact me back? Thanks. |
Beta Was this translation helpful? Give feedback.
-
@efiop Sorry I only check my email from my laptop, I was away from it for a few days. Sent a reply. |
Beta Was this translation helpful? Give feedback.
-
@prihoda you are right - you can just rerun a stage with a different inputs\outputs\params. However, to reconfigure a pipeline you have to redefine a whole pipeline each time you reuse it. See the Discord discussion with vern from 11/27/18: "it's annoying to write them all (stages) out by hand and then do it again for each color (parameter)." Intermediate results should be cached and reused (step1 can be the same for a two different "pipeline calls\instances") if we implement build-cache #1234. Your example with variable output size is a separate question. The reconfiguration might support variable output size or might not (I see no reason not to support it). So, 1.a looks like a more reasonable solution. I don't whant to make each stage to "know about each other". PS: I don't see any problem with "reusable pipelines are a world on their own". We have a pretty clear demand for reusable pipelines and it was one of DVC features that I initialy planed to implement but the data\cache part took much more time that I expected. If you have any concerns with this direction - I'd love to hear more. |
Beta Was this translation helpful? Give feedback.
-
@dmpetrov yeah the "world on its own" would mostly be a problem if you were going for option 2. So if you are going with option 1.a, what are the new "reconfigurable" features that you have in mind? Providing a storage of pipelines, plus and ability to execute them with custom parameters (command parameters and input and output paths)? Or is it more about the build cache #1234? |
Beta Was this translation helpful? Give feedback.
-
@pared I've just separated two issues: this one and #1472. And thank you for your comments - it made me clarifying the issue and even renaming it. This issue is just about defining configs/input/params and how to instantiate a pipeline with a new set of params. The instantiation part might and probably should include #1234. A store of pipeline is related to the new issue. I don't think any special store is needed. A module can be simply reused from Git repos or just copied. |
Beta Was this translation helpful? Give feedback.
-
I believe that this issue may also address our use case (but please correct me if I'm wrong or if you have some nice idea for something else that already addresses it better). Anyways, in our case, we have some large codebase that has various functions in it which perform different steps in our pipeline. We also have multiple customers that we create models for. We also create multiple models for each customer.
I think no matter what, we'd have to do quite a bit of custom work to make everything run smoothly, but I think maybe with reconfigurable DVC pipelines, it'd be a little easier. |
Beta Was this translation helpful? Give feedback.
-
Hi! Resurrecting 🧟
From recurrent feedback from users on support channels, I also came up with this idea recently (see #4254). I think it's still needed (or desirable) even now that we also have parameters. |
Beta Was this translation helpful? Give feedback.
-
This is my use case. At the moment we have multiple pipelines in the same workspace. I copied the dvc.yaml into multiple directories and used a variable to change directories and parameters etc. This isn't so bad but there is still the inconvenience of having to make the same change in multiple places if I want to update the base pipeline definition |
Beta Was this translation helpful? Give feedback.
-
Many issues require reconfiguration of stages and even pipelines:
dvc run
handle files with same name but different path #973 @Hong-Xiang was asking about reusing (reconfigurable) pipelines.dvc run
commands (like unpacking of many zip files)? #1119 repetitive commands. I see a similarity with parametrizable commands where only a single output is in use and without creating a separate directory for each experiment (./output.p
instead ofgs1/output.p
).A concept of reconfigurable-stage should be introduced in DVC.
Open questions:
gs1/
)?./output.p
instead ofgs1/output.p
from the above)?UPDATE: #1214 might be also related to this issue.
UPDATE2: Add a quote from vern and open question 7.
Beta Was this translation helpful? Give feedback.
All reactions