retrain-pipelines simplifies the creation and management of machine learning retraining pipelines. The package is designed to remove the complexity of building end-to-end ML retraining pipelines, allowing users to focus on their data and model-architecture. With pre-built, highly adaptable pipeline examples that work out of the box, users can easily integrate their own data and begin retraining models with minimal-to-no setup.
- Model version blessing: Automatically compare the performance of retrained models against previous best versions to ensure only superior models are deployed.
- Infrastructure validation: Each retraining pipeline includes inference pipeline packaging, local Docker container deployment, and request/response validation to ensure that models are production-ready.
- Comprehensive documentation: Every retraining pipeline is fully documented with sections covering Exploratory Data Analysis (EDA), hyperparameter tuning, retraining steps, model performance metrics, and key commands for retrieving training artifacts. Additionally, DAG information for the retraining process is readily available for pipeline transparency and debugging.
In essence, retrain-pipelines offers a seamless solution: "Come with your data and it works" with the added benefit of flexibility for more advanced users to adjust and extend pipelines as needed.
retrain-pipelines offers a high degree of flexibility, allowing users to tailor the pre-shipped pipelines to their specific needs:
- Custom Preprocessing Functions: Users can provide their own Python functions for custom data preprocessing. For example, some built-in pipelines for tabular data allow optional bucketization of numerical features by name, but you can easily modify or extend these preprocessing steps to suit your dataset and feature requirements.
- Custom Pipeline Card Generation: You can specify custom Python functions to generate pipeline cards, such as including specific performance charts or metrics relevant to your use case.
- Custom HTML Templates: For further personalization, retrain-pipelines supports customizable HTML templates, enabling you to adjust formatting, insert specific charts, change page colors, or even add your company's logo to documentation pages.
retrain-pipelines doesn't just streamline the retraining process, it empowers teams to innovate faster, iterate smarter, and deploy more robust models with confidence. Whether you're looking for an out-of-the-box solution or a highly customizable pipeline, retrain-pipelines is your ultimate companion for continuous model improvement.
You can trigger a retrain-pipelines launch from many different places.
local_launcher.webm
the retrain-pipelines package comes with off-the-shelf Machine Learning retraining pipelines. Find them at /sample_pipelines
. For instance :
framework | modality | task | model lib | Serving | |
---|---|---|---|---|---|
Metaflow | Tabular | regression | Dask / LightGBM | ML Server | starter-kit |
Metaflow | Tabular | classification | Pytorch / TabNet | TorchServe | starter-kit |
You can simply give one of those your data and it just runs. The only manual change you need to do is regarding the endpoint request & serving signatures, since it is purposely hard-coded.
Indeed, the infra_validator
step is here to ensure that your inference pipeline (the one you're working on building a continuous-retraining automation for) keeps adhering to the schema expected by consumers of the inference endpoint. So, if you break the format of the required input raw data, you need to create a somehow new retraining pipeline and assign it a new unique name. This is to ensure that any interface disruption between the inference endpoint and its consumer(s) is intentional.
One of the things that make retrain-pipelines stand is its focus on strong MLOps fundamentals.
model blessing 🔽
retrain-pipelines cares for the newly-retrained model version to be evaluated against the previous model versions from that retraining pipeline. We indeed ensure that no lesser-performing model ever gets into production.Default sample pipelines each come with certain built-in evaluation criteria but, you can customize those per your own requirement. You can for instance choose to include evaluation of model performance on a particular sub-population, so as to serve as a gateway against potential incoming biases.
infrastructure validation 🔽
retrain-pipelines cares for the inference endpoint to be tested prior to deployment. We pack the preprocessing engine together with the newly retrained (and blessed) model version with the ML-server of choice and deploy it locally. We then send an inference request to that temp endpoint and check for a200 http-ok
response with a valid payload format.
pipeline cards 🔽
retrain-pipelines is strongly opinionated around ease of quick-access to information ML-engineers care for when it comes to retraining and serving.That's why it offers a central place and minimal amounts of clicks to navigate efficiently.
overview |
EDA |
overall retraining |
hyperparameter tuning |
key artifacts |
pipeline DAG |
click thumbnails to enlarge |
Third-parties integration 🔽
TensorBoard, PyTorch Profiler, Weights and Biases. retrain-pipelines aims at making centrally available to ML engineers the information they care for.illustration with WandB
in the LightGBM_hp_cv_WandB
sample pipeline 🔽
In the example of the LightGBM_hp_cv_WandB
sample pipeline for instance, you can find information on how to view details on logging performed during the different training_job
steps of a given run. Follow the guidance from the below video :wandb_integration.webm
customizability 🔽
As alluded to above, a lot of room is given to ML engineers for them to customize retrain-pipelines workflows.For staters, the sample pipelines are freely modifiable themselves. But, it goes far beyond that. One can go deep into customization with the defaults for
preprocessing
and for pipeline_card
being fully amendable as well.
illustration with the LightGBM_hp_cv_WandB
sample pipeline 🔽
Start by getting the default which you'd like to customize (any combination of the below 3 you'd like) :
reprocessing.py
modulepipeline_card.py
moduletemplate.html
html template
cd sample_pipelines/LightGBM_hp_cv_WandB/
from retraining_pipeline import LightGbmHpCvWandbFlow
LightGbmHpCvWandbFlow.copy_default_preprocess_module(".", exists_ok=True)
LightGbmHpCvWandbFlow.copy_default_pipeline_card_module(".", exists_ok=True)
LightGbmHpCvWandbFlow.copy_default_pipeline_card_html_template(".", exists_ok=True)
Once you updated any of them, you can launch a retrain-pipelines run so it uses those :
%retrain_pipelines_local retraining_pipeline.py run \
--pipeline_card_artifacts_path "." \
--preprocess_artifacts_path "."
Inspectors are convenience methods that abstract away some of the logic to get access to artifacts logged during retrain-pipelines runs.
For instance :
-
With this convenience method, programmatically open abrowse_local_pipeline_card
🔽pipeline_card
without the need to browse and click a ML-framework UI :
from retrain_pipelines.inspectors import browse_local_pipeline_card browse_local_pipeline_card(mf_flow_name)
This opens the
pipeline_card
in a web-browser tab, so you don't have to look for it. It's ideal for quick ideation during the drafting phase : developers can nowrun/resume
&browse
in a chain of instructions.
-
With this convenience method, programmatically access the versioned source code that was used for a particular retrain-pipelines run. This comes together with the WandB integration :get_execution_source_code
🔽
from retrain_pipelines.inspectors import get_execution_source_code for source_code_artifact in get_execution_source_code(mf_run_id=<your_flow_id>): print(f" - {source_code_artifact.name} {source_code_artifact.url}")
You can even have those artifacts downloaded on the go :
from retrain_pipelines.inspectors import explore_source_code # download and open file explorer explore_source_code(mf_run_id=<your_flow_id>)
-
Specific to retrain-pipelines runs that involve data-parallelism, this inspector method plots each individual hyperparameter-tuning cross-validation training job, showing details for every data-parallel worker.plot_run_all_cv_tasks
🔽
For example, for executions of theLightGbmHpCvWandbFlow
sample pipeline (which employs Dask for data-parallel training), this gives :
from retrain_pipelines.inspectors.hp_cv_inspector import plot_run_all_cv_tasks plot_run_all_cv_tasks(mf_run_id=<your_flow_id>)
with results looking like below for a run with 6 different sets of hp values, 2 cross-validation folds and with 4 Dask data-parallel workers :
- and more.
pytest -s tests
python -m build --verbose pkg_src
pip install -e pkg_src
pip install git+https://github.com/aurelienmorgan/retrain-pipelines.git@master#subdirectory=pkg_src
find us @ https://pypi.org/project/retrain-pipelines/
Please consider dropping us a star ! ⭐