Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider how editing environments works #886

Open
krassowski opened this issue Sep 19, 2024 · 7 comments
Open

Reconsider how editing environments works #886

krassowski opened this issue Sep 19, 2024 · 7 comments
Labels
needs: triaging 🚦 Someone needs to have a look at this issue and triage

Comments

@krassowski
Copy link

Context

Currently editing environment:

  • creates a new build directory
  • swaps the symlink

This means that autoreloading does not work. For example, when using with Jupyter/IPython:

  • kernelspec needs to be reloaded hence need for back-and-forth switching between environments to see changes applied; in my experience this wastes about 3 minutes per each change in an environment (compared to workflow without conda-store)
  • newly installed packages do not become available in a running kernel which means that possibly hours of computation may be lost as the kernel needs to be restarted to pickup smallest change in the env
  • changed packages cannot be picked up by the IPython autoreloader either

If instead it worked like:

  • copy existing build to a new folder
  • the copied folder becomes the archivial build version
  • update the environment:
    • ideally by updating in place rather than rebuilding from scratch
    • if needs be by rebuilding (which will still work but likely change some versions inadvertently if new versions were released, which is another issue)
  • there is no need to swap the symlink so autoreloading and everything else works!

Value and/or benefit

Many minutes to hours in productivity gained (or rather not lost) for the use case of interactive environment creation by a senior data analyst.

Anything else?

No response

@krassowski krassowski added the needs: triaging 🚦 Someone needs to have a look at this issue and triage label Sep 19, 2024
@krassowski
Copy link
Author

@kcpevey mentioned to me that this may be a foot gun for shared environments:

The problem with autoreloading the environment is that the environment can change underneath you - other people could have updated the environment without your knowledge.

I somewhat agree, but ultimately if shared env is changed by someone else, activating it after the change will cause the same issue.

And questions the UX for user awareness:

What if you are running a notebook, stop to kick off a rebuild the env which takes 20 minutes, while that's going you keep working in the notebook. At some point, the env build is complete - What happens to your running notebook? The kernel remains as the old env until you restart the kernel? The user gets a warning that the kernel has been replaced?

Here I would mention that auto-reloading is not enabled by default, and users who enable it know what they are doing. Also, rebuilding and env should not take 20 minutes (but it does). I do however, agree that a notification that an environment building has completed should be shown when conda-store is used with JupyterLab, which is tracked in:

@dharhas
Copy link
Member

dharhas commented Sep 19, 2024

As an fyi, historically, updating in place rather than rebuilding from scratch has been a really bad idea and has ended up with folks having non-reproducible bespoke / broken environments because to recreate the environment you have to recreate every update step and that is not tracked anywhere.

@dharhas
Copy link
Member

dharhas commented Sep 19, 2024

But this does go with another discussion I had had about the packaging at pycon, we actually have multiple target audiences (devs, end users etc) for environment management and we are using the same tools for all of them.

@dharhas
Copy link
Member

dharhas commented Sep 19, 2024

newly installed packages do not become available in a running kernel which means that possibly hours of computation may be lost as the kernel needs to be restarted to pickup smallest change in the env

Is this actually a valid use case? How reliably does it work? For pure python packages maybe. To me it seems if you change the underlying environment all bets are off on whether your python objects are even valid if an install changed something under the hood. Seems like a better option would be to make sure you serialized your results.

@krassowski
Copy link
Author

Is this actually a valid use case?

Yes, in IPython installation/updating of packages via %pip and %conda magics is supported and valid use case. These magics warn that for some packages restarting kernel may be required, but when autoreload is on it is rarely the case (it is the case for updating non-pure Python).

How reliably does it work? For pure python packages maybe.

Very well in my experience. And Databricks considers it a valid use case too, they are contributing enhancements in ipython/ipython#14500.

To me it seems if you change the underlying environment all bets are off on whether your python objects are even valid if an install changed something under the hood.

This is my call as an experienced user to make. I can tell if I will need to restart the kernel or not, and I often know what specific changes will be made. It is not for updating numpy from 1.x to 2.x, it is for grabbing patches for very specific bugfixes.

Seems like a better option would be to make sure you serialized your results.

No, I disagree here. It sounds like blaming user here but in fact even with best serialization and caching, there are operations that always take time like loading up large files or training small/medium models. I don't think that using conda-store should be incompatible with data scientists analysing big data or training baseline/statistical models in notebooks. But maybe I misunderstand the target audience of conda-store.

@krassowski
Copy link
Author

historically, updating in place rather than rebuilding from scratch has been a really bad idea and has ended up with folks having non-reproducible bespoke / broken environments because to recreate the environment you have to recreate every update step and that is not tracked anywhere.

On the other hand, the current conda-store approach leads to broken notebooks for data scientists who are not used to working with conda-store:

  • one time my Python version was updated when I added a small unrelated package because new Python version has been released; it broke half of the packages I had installed
  • my numpy was upgraded to 2.0 when I was doing an unrelated change in env as it was just released and broke my code
  • another time when I upgraded one dependency it downgraded another dependency without telling me

Why not give advanced users a choice on whether to update in place or not? If the old environment is copied as an archival build that has 0 risk, right?

Thinking about it, what I really miss is:

  • a) hot-reloading support (as discussed in this issue)
  • b) a confirmation screen saying what changes will be made so that I can adjust pins
  • c) a way to pin all packages to current versions (like if I have auto-installed numpy 1.2 I would want to apply numpy>=1.2.0,<2.0" pin when modifying the environment easily
    • right now the version information is actually a bit hidden

@krassowski
Copy link
Author

Part of the delay is that even after environment is built we need to wait for up to a minute for it to be refreshed on the nb_conda_kernels side: https://github.com/anaconda/nb_conda_kernels/blob/04c5fc605c08a4ced0cc45d2a6507dea40897600/nb_conda_kernels/manager.py#L18

So I as a user keep restarting the environment until it clicks. If my new/edited dependency is used lower in the notebook I can waste many minutes there.

For the interactive use case we need to somehow rewrite nb_conda_kernels to watch kernelspecs changes on disk, or emit an event from conda-store and make nb_conda_kernels refresh. Interaction with nb_conda_kernels appears in scope as this is included in the Dockerfile:

python=${python_version} nb_conda_kernels nodejs=18 yarn constructor \

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: triaging 🚦 Someone needs to have a look at this issue and triage
Projects
Status: New 🚦
Development

No branches or pull requests

2 participants