Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Existence of system-wide version of a shared library causes undefined symbol error #1640

Open
Garbaz opened this issue Aug 2, 2024 · 14 comments

Comments

@Garbaz
Copy link

Garbaz commented Aug 2, 2024

To reproduce (assuming you have libnvjitlink12 installed system-wide, and in a different version):

library(reticulate)

venv_name <- "deleteme_5267"
virtualenv_create(venv_name)
use_virtualenv(venv_name)

py_install("torch", pip = true)

pytorch  <- import("torch")

The final line gives me this error:

Error in py_module_import(module, convert = convert) : 
  ImportError: /home/tobi/.virtualenvs/deleteme_5267/lib/python3.12/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12

Checking nm -gDC ~/.virtualenvs/deleteme_5267/lib/python3.12/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12 | grep nvJitLinkAddData is get:

0000000000262eb0 T nvJitLinkAddData@@libnvJitLink.so.12
0000000000263070 T __nvJitLinkAddData_12_0@@libnvJitLink.so.12
0000000000263080 T __nvJitLinkAddData_12_1@@libnvJitLink.so.12
0000000000263090 T __nvJitLinkAddData_12_2@@libnvJitLink.so.12
00000000002630a0 T __nvJitLinkAddData_12_3@@libnvJitLink.so.12
00000000002630b0 T __nvJitLinkAddData_12_4@@libnvJitLink.so.12
00000000002630c0 T __nvJitLinkAddData_12_5@@libnvJitLink.so.12
00000000002630d0 T __nvJitLinkAddData_12_6@@libnvJitLink.so.12

So the version of libnvJitLink.so.12 in the virtualenv has the symbol. And if I activate the virtualenv normally in a shell and import torch from inside a normal Python REPL I don't get any errors. So it's not the fault of libcusparse.so.12.

The thing is though, the library libnvJitLink.so.12 is also installed system-wide, but in a different version. Checking there with nm -gDC /usr/lib/x86_64-linux-gnu/libnvJitLink.so.12 | grep nvJitLinkAddData, I get only:

0000000000226bd0 T __nvJitLinkAddData_12_0@@libnvJitLink.so.12

And when I remove the system-wide version of the library with

sudo apt remove libnvjitlink12:amd64

the error no longer occurs.

It appears to be that if there is a system-wide version of a shared library, it is preferred over the local version in the virtualenv. This is not how it things should be!

R version is 4.4.1 (2024-06-14) and reticulate version is reticulate_1.38.0.

@Garbaz
Copy link
Author

Garbaz commented Aug 2, 2024

To be clear, sudo apt remove libnvjitlink12:amd64 is not really a solution to this problem.

@t-kalinowski
Copy link
Member

t-kalinowski commented Aug 2, 2024

Thanks for reporting!

Are you using the RStudio IDE? Does this happen only in the RStudio IDE, or outside the IDE too?

@Garbaz
Copy link
Author

Garbaz commented Aug 5, 2024

Ah, I should have added that I'm using R Studio Server. And I should have tested running the repro code directly in R.

I don't have access to a machine at the moment where I can test running the code in normal R Studio Desktop, so I can't check whether it's a R Studio Server specific issue. But running source("repro.R"), where repro.R contains the repro code:

library(reticulate)

venv_name <- "deleteme_5267"
virtualenv_create(venv_name)
use_virtualenv(venv_name)

py_install("torch", pip = true)

pytorch  <- import("torch")

I do not get the error. And running e.g. pytorch$cuda$is_available() works as expected.

So it appears to be an interactive between R Studio (Server) and Reticulate that is the issue.

@Garbaz
Copy link
Author

Garbaz commented Aug 5, 2024

Wait, scratch that, I forgot I uninstalled libnvjitlink12 to temporarily fix the issue. Reinstalling it, I get the same error in plain R!

So it has nothing to do with R Studio (Server) in particular.

@t-kalinowski
Copy link
Member

I don't think reticulate is modifying the order of loaded libs.

If this occurs with reticulate::import("torch") in R, but not in a terminal with ~/.virtualenvs/r-torch/bin/python -c 'import torch', then it's likely that something in the R session is either

  1. Modifying LD_LIBRARY_PATH
  2. Pre-loading the "wrong" libnvjitlink12 for some reason.

Can you please double-check the value of Sys.getenv("LD_LIBRARY_PATH") in R, and also, inspect other R startup files for code that might be causing this (.Rprofile, .Renviron, etc.)?

@Garbaz
Copy link
Author

Garbaz commented Aug 5, 2024

Both Sys.getenv("LD_LIBRARY_PATH") and os <- import("os"); os$environ["LD_LIBRARY_PATH"] give:

"/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/lib/server"

What I do find weird is that there is no mention of the virtualenv, even in os$environ["LD_LIBRARY_PATH"], even though, evidently, the libraries from the virtualenv are found.

@Garbaz
Copy link
Author

Garbaz commented Aug 5, 2024

Okay, it appears Python does not simply use the LD_LIBRARY_PATH environment variable. At least when I run os.environ["LD_LIBRARY_PATH"] in the normal python REPL (from the virtualenv), I get a key error.

However, Python does use an environment variable PYTHONPATH. Running os <- import("os"); os$environ["PYTHONPATH"] in R I get:

"/usr/local/lib/R/site-library/reticulate/config:/usr/lib/python312.zip:/usr/lib/python3.12:/usr/lib/python3.12/lib-dynload:/home/tobi/.virtualenvs/deleteme_5267/lib/python3.12/site-packages:/usr/local/lib/R/site-library/reticulate/python"

I will investigate whether I can fix the issue by messing with PYTHONPATH.

Update: I have experimented with both LD_LIBRARY_PATH and PYTHONPATH and could not get the issue to go away. I will continue trying to figure this out later this week.

@Garbaz
Copy link
Author

Garbaz commented Aug 5, 2024

By the way, py_last_error() gives:

--- Python Exception Message
Traceback (most recent call last):
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 122, in _find_and_load_hook
    return _run_hook(name, _hook)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 96, in _run_hook
    module = hook()
             ^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 120, in _hook
    return _find_and_load(name, import_)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tobi/.virtualenvs/deleteme_5267/lib/python3.12/site-packages/torch/__init__.py", line 290, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 122, in _find_and_load_hook
    return _run_hook(name, _hook)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 96, in _run_hook
    module = hook()
             ^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 120, in _hook
    return _find_and_load(name, import_)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: /home/tobi/.virtualenvs/deleteme_5267/lib/python3.12/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12
--- R Traceback
    ▆
 1. └─reticulate::import("torch")
 2.   └─reticulate:::py_module_import(module, convert = convert)
See `reticulate::py_last_error()$r_trace$full_call` for more details.

In case that's of any help.

@t-kalinowski
Copy link
Member

t-kalinowski commented Aug 5, 2024

I am unable to reproduce locally.

Note that PyTorch can be installed a few different ways, depending on your environment. You may want to consult https://pytorch.org/get-started/locally/ and see if there is something that will work better for you than a bare pip install torch (e.g., pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

@Garbaz

This comment was marked as outdated.

@paciorek
Copy link

paciorek commented Oct 1, 2024

This issue sounds very similar to the Quarto issue I just reported, though given discussion there, I probably reported it in the wrong place.

@cderv suggested I report the issue here to reticulate. For now I am simply tagging onto this issue given I think it may be related.

That said, in the Quarto issue @cscheid commented that it seemed like it's a knitr issue rather than reticulate and his logic makes sense.

@Garbaz
Copy link
Author

Garbaz commented Oct 14, 2024

Sorry for the late reply.

As I get the same bug when running my repro script in the plain R REPL, I don't see how this could be an issue with knitr.

And indeed your issue seems to be due to the same underlying bug.

@Garbaz
Copy link
Author

Garbaz commented Oct 14, 2024

Also, as I use pip for my setup and not conda, the suggestion in your issue that the problem is your use of conda seems to me to also be incorrect.

@paciorek
Copy link

As I discussed in an update to the issue I filed with quarto, in my case, the system copy of the library was being opened first (based on looking at the output of strace), so the copy of the library coming from the Python environment was never opened.

A work-around is to use LD_PRELOAD to load the copy of libnvJitLink.so.12 that works:

LD_PRELOAD=/path/to/good/libnvJitLink.so.12  <your executable here>

And yes, the issue is not specific to knitr. It seems to have to do with what libraries have been opened at the time the Python code is run via reticulate. In your case, I don't know what causes the 'bad' libnvjitlink12 to be opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants