Reconsider how editing environments works #886

krassowski · 2024-09-19T17:27:57Z

Context

Currently editing environment:

creates a new build directory
swaps the symlink

This means that autoreloading does not work. For example, when using with Jupyter/IPython:

kernelspec needs to be reloaded hence need for back-and-forth switching between environments to see changes applied; in my experience this wastes about 3 minutes per each change in an environment (compared to workflow without conda-store)
newly installed packages do not become available in a running kernel which means that possibly hours of computation may be lost as the kernel needs to be restarted to pickup smallest change in the env
changed packages cannot be picked up by the IPython autoreloader either

If instead it worked like:

copy existing build to a new folder
the copied folder becomes the archivial build version
update the environment:
- ideally by updating in place rather than rebuilding from scratch
- if needs be by rebuilding (which will still work but likely change some versions inadvertently if new versions were released, which is another issue)
there is no need to swap the symlink so autoreloading and everything else works!

Value and/or benefit

Many minutes to hours in productivity gained (or rather not lost) for the use case of interactive environment creation by a senior data analyst.

Anything else?

No response

krassowski · 2024-09-19T17:33:35Z

@kcpevey mentioned to me that this may be a foot gun for shared environments:

The problem with autoreloading the environment is that the environment can change underneath you - other people could have updated the environment without your knowledge.

I somewhat agree, but ultimately if shared env is changed by someone else, activating it after the change will cause the same issue.

And questions the UX for user awareness:

What if you are running a notebook, stop to kick off a rebuild the env which takes 20 minutes, while that's going you keep working in the notebook. At some point, the env build is complete - What happens to your running notebook? The kernel remains as the old env until you restart the kernel? The user gets a warning that the kernel has been replaced?

Here I would mention that auto-reloading is not enabled by default, and users who enable it know what they are doing. Also, rebuilding and env should not take 20 minutes (but it does). I do however, agree that a notification that an environment building has completed should be shown when conda-store is used with JupyterLab, which is tracked in:

Send notification when build completes jupyterlab-conda-store#49

dharhas · 2024-09-19T18:22:46Z

As an fyi, historically, updating in place rather than rebuilding from scratch has been a really bad idea and has ended up with folks having non-reproducible bespoke / broken environments because to recreate the environment you have to recreate every update step and that is not tracked anywhere.

dharhas · 2024-09-19T18:24:52Z

But this does go with another discussion I had had about the packaging at pycon, we actually have multiple target audiences (devs, end users etc) for environment management and we are using the same tools for all of them.

dharhas · 2024-09-19T18:32:13Z

newly installed packages do not become available in a running kernel which means that possibly hours of computation may be lost as the kernel needs to be restarted to pickup smallest change in the env

Is this actually a valid use case? How reliably does it work? For pure python packages maybe. To me it seems if you change the underlying environment all bets are off on whether your python objects are even valid if an install changed something under the hood. Seems like a better option would be to make sure you serialized your results.

krassowski · 2024-09-20T07:37:33Z

Is this actually a valid use case?

Yes, in IPython installation/updating of packages via %pip and %conda magics is supported and valid use case. These magics warn that for some packages restarting kernel may be required, but when autoreload is on it is rarely the case (it is the case for updating non-pure Python).

How reliably does it work? For pure python packages maybe.

Very well in my experience. And Databricks considers it a valid use case too, they are contributing enhancements in ipython/ipython#14500.

To me it seems if you change the underlying environment all bets are off on whether your python objects are even valid if an install changed something under the hood.

This is my call as an experienced user to make. I can tell if I will need to restart the kernel or not, and I often know what specific changes will be made. It is not for updating numpy from 1.x to 2.x, it is for grabbing patches for very specific bugfixes.

Seems like a better option would be to make sure you serialized your results.

No, I disagree here. It sounds like blaming user here but in fact even with best serialization and caching, there are operations that always take time like loading up large files or training small/medium models. I don't think that using conda-store should be incompatible with data scientists analysing big data or training baseline/statistical models in notebooks. But maybe I misunderstand the target audience of conda-store.

krassowski · 2024-09-20T07:47:18Z

historically, updating in place rather than rebuilding from scratch has been a really bad idea and has ended up with folks having non-reproducible bespoke / broken environments because to recreate the environment you have to recreate every update step and that is not tracked anywhere.

On the other hand, the current conda-store approach leads to broken notebooks for data scientists who are not used to working with conda-store:

one time my Python version was updated when I added a small unrelated package because new Python version has been released; it broke half of the packages I had installed
my numpy was upgraded to 2.0 when I was doing an unrelated change in env as it was just released and broke my code
another time when I upgraded one dependency it downgraded another dependency without telling me

Why not give advanced users a choice on whether to update in place or not? If the old environment is copied as an archival build that has 0 risk, right?

Thinking about it, what I really miss is:

a) hot-reloading support (as discussed in this issue)
b) a confirmation screen saying what changes will be made so that I can adjust pins
c) a way to pin all packages to current versions (like if I have auto-installed numpy 1.2 I would want to apply numpy>=1.2.0,<2.0" pin when modifying the environment easily
- right now the version information is actually a bit hidden

krassowski · 2024-09-20T13:09:54Z

Part of the delay is that even after environment is built we need to wait for up to a minute for it to be refreshed on the nb_conda_kernels side: https://github.com/anaconda/nb_conda_kernels/blob/04c5fc605c08a4ced0cc45d2a6507dea40897600/nb_conda_kernels/manager.py#L18

So I as a user keep restarting the environment until it clicks. If my new/edited dependency is used lower in the notebook I can waste many minutes there.

For the interactive use case we need to somehow rewrite nb_conda_kernels to watch kernelspecs changes on disk, or emit an event from conda-store and make nb_conda_kernels refresh. Interaction with nb_conda_kernels appears in scope as this is included in the Dockerfile:

conda-store/conda-store/Dockerfile

Line 34 in bde7fe4

python=${python_version} nb_conda_kernels nodejs=18 yarn constructor \

krassowski added the needs: triaging 🚦 Someone needs to have a look at this issue and triage label Sep 19, 2024

krassowski mentioned this issue Sep 25, 2024

[BUG] - Newly built environments don't immediately show up as available to JupyterLab conda-incubator/conda-store-ui#428

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider how editing environments works #886

Reconsider how editing environments works #886

krassowski commented Sep 19, 2024

krassowski commented Sep 19, 2024

dharhas commented Sep 19, 2024 •

edited

Loading

dharhas commented Sep 19, 2024 •

edited

Loading

dharhas commented Sep 19, 2024

krassowski commented Sep 20, 2024

krassowski commented Sep 20, 2024

krassowski commented Sep 20, 2024

Reconsider how editing environments works #886

Reconsider how editing environments works #886

Comments

krassowski commented Sep 19, 2024

Context

Value and/or benefit

Anything else?

krassowski commented Sep 19, 2024

dharhas commented Sep 19, 2024 • edited Loading

dharhas commented Sep 19, 2024 • edited Loading

dharhas commented Sep 19, 2024

krassowski commented Sep 20, 2024

krassowski commented Sep 20, 2024

krassowski commented Sep 20, 2024

dharhas commented Sep 19, 2024 •

edited

Loading

dharhas commented Sep 19, 2024 •

edited

Loading