JupyterLab is a popular open source development environment among data scientists and machine learning engineers that use Jupyter notebooks extensively. JupypterLab can be customized and enhanced using extensions.
Elyra is a set of AI centric extensions to JupyterLab that provide productivity features for common day to day activities. One of its key features is the support for pipelines, which allows for batch execution of notebooks or Python/R scripts:
In the example above, each notebook/script in the pipeline implements a typical data science task, such as data loading, data cleasing, data analysis, or model training. Pipelines provide a way to automate the execution of these tasks.
The Visual Pipeline Editor in Elyra allows for the assembly of pipelines without the need for any coding and supports execution of those pipelines locally in JupyterLab, or remotely on Kubeflow Pipelines or Apache Airflow.
In this tutorial you will learn how to create a basic pipeline and run it in a local JupyterLab environment. When you run a pipeline in your local environment, each notebook or script is executed in a Kernel on the machine where JupyterLab is running, such as your laptop. Since resources on that machine might be limited local pipeline execution might not always be a viable option.
If you are interested in learning more about how to run pipelines in remote environments such as Kubeflow Pipelines or Apache Airflow, refer to the Hello World Kubeflow Pipelines tutorial and the Hello World Apache Airflow tutorial. (Note that those tutorials require a Kubeflow Pipelines/Apache Airflow deployment, which is not provided.)
You can complete this tutorial using a ready-to-use Elyra container image in your local environment (e.g. running Docker Desktop) or using a public sandbox environment that's hosted in the cloud.
To complete this tutorial in your local environment an installation of Docker Desktop is required.
To run the tutorial:
-
Open a terminal window.
-
Create a directory named
lab-material
in${HOME}
. When you run the container image this directory is used to store files that you create in the JupyterLab development environment. (You can choose a different name and location but need to update thedocker run
command below accordingly.) -
Create a directory named
jupyter-data-dir
in${HOME}
. When you run the container image this directory is used to store metadata files that Elyra creates in the JupyterLab development environment. (You can choose a different name and location but need to update thedocker run
command below accordingly.) -
Run the Elyra container image.
$ docker run -it -p 8888:8888\ -v ${HOME}/lab-material/:/home/jovyan/work\ -w /home/jovyan/work\ -v ${HOME}/jupyter-data-dir:/home/jovyan/.local/share/jupyter\ elyra/elyra:2.2.4 jupyter lab
-
Open the displayed URL (e.g.
http://127.0.0.1:8888/lab?token=...
) in your web browser.
You are ready to start the tutorial. Note that you can shut down and restart the container because the information required to complete the lab is persisted in the two directories you've created earlier.
To complete this tutorial in a public sandbox environment on the cloud open this link in your web browser. In the cloud sandbox environment the container image is built for you on the fly and should be ready after a couple of minutes. Do note that the sandbox environment does not persist any changes you are making. If you disconnect (or are disconnected due to a bad connection), all changes are discarded and you need to start over.
You are ready to start the tutorial.
This tutorial is based on a set of Jupyter notebooks and Python scripts that are published in the https://github.com/CODAIT/ddc-data-and-ai-2021-automate-using-open-source GitHub repository. The Elyra installation includes a git extension, which provides access to common source control tasks within your JupyterLab environment.
To clone the tutorial artifacts:
-
In the JupyterLab GUI open the Git clone wizard (
Git
>Clone a Repository
). -
Enter
https://github.com/CODAIT/ddc-data-and-ai-2021-automate-using-open-source.git
as Clone URI. -
In the File Browser navigate to
ddc-data-and-ai-2021-automate-using-open-source/jupyterlab-and-elyra/pipelines/hello_world
.The cloned repository includes a set of notebooks that download an open weather data set from the Data Asset Exchange, cleanse the data, analyze the data, and perform time-series predictions.
-
Open the Launcher (
File
>New Launcher
) if it is not already open. -
Open the Pipeline Editor (
Elyra
>Pipeline Editor
) to create a new untitled pipeline. -
In the File Browser pane, right click on the untitled pipeline, and select ✎ Rename.
-
Change the pipeline name to
hello_world
.
Next, you'll add a notebook or script to the pipeline that downloads an open data set archive from public cloud storage.
-
From the File Browser pane drag the
load_data.ipynb
notebook onto the canvas. If you like, you can add theload_data.py
Python script instead. The script provides the same functionality as the notebook. The instructions below assume that you've added the notebook to the pipeline, but the steps you need to complete are identical. -
Right click on the
load_data
node to customize its properties.Some properties are only required when you plan to run the pipeline in a remote environment, such as Kubeflow Pipelines. However, it is considered good practice to always specify those properties to allow for easy migration from development (where you might run a pipeline locally) to test and production (where you would want to take advantage of resources that are not available to you in a local environment). Details are in the instructions below.
-
By default the file name is used as node label. You should customize the label text if it is too long (and therefore displayed truncated on the canvas) or not descriptive enough.
-
As Runtime Image choose
Pandas
. The runtime image identifies the container image that is used to execute the notebook or Python script when the pipeline is run on Kubeflows Pipelines or Apache Airflow. This setting must always be specified but is ignored when you run the pipeline locally. Click here to learn more about runtime images and how to use your own images.If the container requires a specific minimum amount of resources during execution, you can specify them.
If no custom requirements are defined, the defaults in the target runtime environment (Kubeflow Pipelines or Apache Airflow) are used.
If a notebook or script requires access to local files, such as Python scripts, you can specify them as File Dependencies. When you run a pipeline locally this setting is ignored because the notebook or script can access all (readable) files in your workspace. However, it is considered good practice to explicitly declare file dependencies to make the pipeline also runnable in environments where notebooks or scripts are executed isolated from each other.
-
The
load_data
file does not have any input file dependencies. Leave the input field empty.If desired, you can customize additional inputs by defining environment variables. The
load_data
file requires environment variableDATASET_URL
. This variable identifies the name and location of a data set, which the notebook or script will download and decompress. -
Assign environment variable
DATASET_URL
the valuehttps://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-data-jfk-airport.tar.gz
.DATASET_URL=https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-data-jfk-airport.tar.gz
If a notebook or script generates files that other notebooks or scripts in the pipeline require access to, specify them as Output Files. This setting is ignored if you are running a pipeline locally because all notebooks or scripts in a pipeline have access to the same shared file system. However, it is considered good practice to declare these files to make the pipeline also runnable in environments where notebooks or scripts are executed in isoluation from each other.
-
Declare an output file named
data/noaa-weather-data-jfk-airport/jfk_weather.csv
, which other notebooks in this pipeline process.It is considered good pratice to specify paths that are relative to the notebook or script location to keep the pipeline portable.
-
Select the
load_data
node and attach a comment to it.The comment is automatically associated with the node.
-
In the comment node enter a descriptive text, such as
Download the JFK Weather dataset archive and extract it
. -
Save the pipeline.
Next, you'll add a data pre-processing notebook to the pipeline and connect it with the first notebook in such a way that it is executed after the first notebook. This notebook cleans the data in data/noaa-weather-data-jfk-airport/jfk_weather.csv
, which the load_data
notebook or script produced.
-
Drag the
Part 1 - Data Cleaning.ipynb
notebook onto the canvas. -
Customize its execution properties as follows:
- Runtime image:
Pandas
- Output files:
data/noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv
- Runtime image:
-
Attach a comment node to the
Part 1 - Data Cleaning
node and provide a description, such asClean the dataset
. -
Connect the output port of the
load_data
node to the input port of thePart 1 - Data Cleaning
node to establish a dependency between the two notebooks.When a pipeline is executed dependent nodes are processed in sequential order (node-1 -> node-2). Nodes that have no dependencies defined between them are processed in random order when you run a pipeline locally in JupyterLab and processed in parallel when you run a pipeline on Kubeflow Pipelines or Apache Airflow.
-
Save the pipeline.
You are ready to run the pipeline in JupyterLab.
When you run a pipeline locally the notebooks and Python/R scripts are executed on the machine where JupyterLab is running.
-
Run the pipeline.
-
Enter
hello_world_pipeline
as Pipeline name.Note that you can only choose a different runtime platform (Kubeflow Pipelines or Apache Airflow) after you've created a runtime configuration for that platform. The last section in this tutorial includes links to resources that outline how to run pipelines on those platforms.
-
Start the pipeline run. A message similar to the following is displayed after the run completed.
You can monitor the run progress in the terminal window where you've launched JupyterLab. (Note that you don't have access to the output when you are completing this tutorial in the cloud-hosted sandbox environment.)
A local pipeline run produces the following output artifacts:
- Each executed notebook is updated and includes the run results in the output cells.
- Script output (e.g. logging messages sent to STDOUT/STDERR) is displayed in the terminal window where JupyterLab is running.
- If a notebook/script produces files they are stored in the local file system.
You can access output artifacts from the File Browser. In the screen capture below the hello_world
pipeline output artifacts are highlighted in green.
-
Open the
Part 1 - Data Cleaning
notebook by double clicking on the file name in the File Browser. The output cells should contain the results that code cells produced.For illustrative purposes let's also use the git integration to review how one of the notebooks was changed while it was executed.
-
From the sidebar select the
Git
tab. Note the files listed in theChanged
(modified files) section and theUntracked
(new files) section.Most of the new files were produced while the pipeline was processed. The only exception is the
hello_world.pipeline
file, which you've created when you assembled the pipeline. -
Select one of the modified notebook files and right click on its name to review the available actions.
-
Select
Diff
to compare the updated notebook with the original version. -
Expand the twistie next to one of the
Metadata changed
orOutputs changed
sections to review the differences. -
Close the file comparison tab.
For illustrative purposes, let's revert the changes for one of the modified files.
-
Select one of the modified notebook files, right click, and choose
Discard
. Note that the file name disappeared from theChanged
section in the Git panel. -
Switch to the File Browser tab and open that notebook file. The output cells should now be empty.
While you have the notebook editor open, let's also briefly review some of the editor productivity enhancements Elyra includes.
-
Click in one of the code cells and right click. The first three listed actions are something you are probably very familiar with from other development environments. Try them out!
The fourth action,
Show diagnostics panel
, provides access to linting output, which helps you keep the code clean and tidy (:The same actions are available by default in Elyra's Python editor and optionally for the R script editor.
This concludes the tutorial.
This concludes the introductory Elyra tutorial. You've learned how to
- create a pipeline
- add and configure notebooks or Python scripts
- run a pipeline in a local environment
- monitor the pipeline run progress
- inspect the pipeline run results
- use the git feature to compare file versions and revert changes
- simplify common code development tasks
If you'd like you can extend the pipeline by adding two more notebooks, which can be executed in parallel after notebook Part 1 - Data Cleaning.ipynb
was processed:
Part 2 - Data Analysis.ipynb
Part 3 - Time Series Forecasting.ipynb
Each of the notebooks can run in the Pandas
container image and doesn't have any input dependencies, doesn't require any environment variables and doesn't produce an additional output files.
If you have questions about Elyra or suggestion, please connect with us using one of the channels listed in the community documentation.
The following resources provide more information about pipelines: