Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llava-onevision convert bug #2585

Open
2 of 4 tasks
liyi-xia opened this issue Dec 17, 2024 · 9 comments
Open
2 of 4 tasks

llava-onevision convert bug #2585

liyi-xia opened this issue Dec 17, 2024 · 9 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@liyi-xia
Copy link

System Info

Please see line 345 in tensorrt_llm/models/qwen/model.py. The llava-onevision should have tie_word_embeddings=True, which will fail when converting lm_head weights. We can change

"lm_head": "language_model.lm_head" to "lm_head": "language_model.model.embed_tokens"
Image

Who can help?

@byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

lm_head.weight
197it [00:00, 2288.93it/s]
Traceback (most recent call last):
  File "/root/llava-ov/convert_checkpoint.py", line 308, in <module>
    main()
  File "/root/llava-ov/convert_checkpoint.py", line 300, in main
    convert_and_save_hf(args)
  File "/root/llava-ov/convert_checkpoint.py", line 256, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/root/llava-ov/convert_checkpoint.py", line 263, in execute
    f(args, rank)
  File "/root/llava-ov/convert_checkpoint.py", line 246, in convert_and_save_rank
    qwen = QWenForCausalLM.from_hugging_face(
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 453, in from_hugging_face
    loader.generate_tllm_weights(model)
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/models/model_weights_loader.py", line 409, in generate_tllm_weights
    self.load(tllm_key,
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/models/model_weights_loader.py", line 296, in load
    v = sub_module.postprocess(tllm_key, v, **postprocess_kwargs)
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/layers/linear.py", line 407, in postprocess
    weights = weights.to(str_dtype_to_torch(self.dtype))
AttributeError: 'NoneType' object has no attribute 'to'

Expected behavior

After I modify the code, it becomes

198it [00:00, 2781.83it/s]
Total time of converting checkpoints: 00:00:02

actual behavior

It cannot convert successfully before my modification.

additional notes

N/A

@liyi-xia liyi-xia added the bug Something isn't working label Dec 17, 2024
@nv-guomingz
Copy link
Collaborator

Hi @liyi-xia could u please try the main branch and the latest code already have changes u mentioned above?

@nv-guomingz nv-guomingz added LLM API/Workflow and removed bug Something isn't working labels Dec 18, 2024
@nv-guomingz nv-guomingz self-assigned this Dec 18, 2024
@github-actions github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Dec 18, 2024
@liyi-xia
Copy link
Author

@nv-guomingz hi, i did not see any difference between main branch and my screenshot. It is still convert lm_head to lm_head. May I know what you have done in recent commits to solve this problem?
Image

@liyi-xia
Copy link
Author

liyi-xia commented Dec 18, 2024

By the way, when I tried latest release https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.17.0.dev2024121700-cp310-cp310-linux_x86_64.whl#sha256=c153ba5e78609d8060c2281a220f1cd4769a9c30e48d77b1fd95ac299aae4607, I encountered the following error. My CUDA is 12.6, driver is 470.129.06 and nvcc is 12.6.3

Traceback (most recent call last):
  File "/root/convert_checkpoint.py", line 9, in <module>
    import tensorrt_llm
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/__init__.py", line 32, in <module>
    import tensorrt_llm.functional as functional
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 28, in <module>
    from . import graph_rewriting as gw
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/graph_rewriting.py", line 12, in <module>
    from .network import Network
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/network.py", line 31, in <module>
    from tensorrt_llm.module import Module
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/module.py", line 17, in <module>
    from ._common import default_net
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/_common.py", line 37, in <module>
    from ._utils import str_dtype_to_trt
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/_utils.py", line 32, in <module>
    from tensorrt_llm.bindings import DataType, GptJsonConfig
ImportError: /usr/local/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so: undefined symbol: cuMulticastCreate

@DylanChen-NV
Copy link

@nv-guomingz hi, i did not see any difference between main branch and my screenshot. It is still convert lm_head to lm_head. May I know what you have done in recent commits to solve this problem? Image

Hi @liyi-xia , the mapping will be updated here

@liyi-xia
Copy link
Author

ok i see. but i am not able to use latest version now.

@DylanChen-NV
Copy link

By the way, when I tried latest release https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.17.0.dev2024121700-cp310-cp310-linux_x86_64.whl#sha256=c153ba5e78609d8060c2281a220f1cd4769a9c30e48d77b1fd95ac299aae4607, I encountered the following error. My CUDA is 12.6, driver is 470.129.06 and nvcc is 12.6.3

Traceback (most recent call last):
  File "/root/convert_checkpoint.py", line 9, in <module>
    import tensorrt_llm
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/__init__.py", line 32, in <module>
    import tensorrt_llm.functional as functional
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 28, in <module>
    from . import graph_rewriting as gw
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/graph_rewriting.py", line 12, in <module>
    from .network import Network
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/network.py", line 31, in <module>
    from tensorrt_llm.module import Module
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/module.py", line 17, in <module>
    from ._common import default_net
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/_common.py", line 37, in <module>
    from ._utils import str_dtype_to_trt
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/_utils.py", line 32, in <module>
    from tensorrt_llm.bindings import DataType, GptJsonConfig
ImportError: /usr/local/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so: undefined symbol: cuMulticastCreate

Could you share the steps for upgrading?

@liyi-xia
Copy link
Author

@DylanChen-NV Hi, i downloaded wheel and pip3 install wheel. Before that, I use CUDA Compat 12.6 and upgrade nvcc to 12.6 by CUDA Toolkit

@BasicCoder
Copy link
Contributor

By the way, when I tried latest release https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.17.0.dev2024121700-cp310-cp310-linux_x86_64.whl#sha256=c153ba5e78609d8060c2281a220f1cd4769a9c30e48d77b1fd95ac299aae4607, I encountered the following error. My CUDA is 12.6, driver is 470.129.06 and nvcc is 12.6.3

Traceback (most recent call last):
  File "/root/convert_checkpoint.py", line 9, in <module>
    import tensorrt_llm
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/__init__.py", line 32, in <module>
    import tensorrt_llm.functional as functional
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 28, in <module>
    from . import graph_rewriting as gw
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/graph_rewriting.py", line 12, in <module>
    from .network import Network
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/network.py", line 31, in <module>
    from tensorrt_llm.module import Module
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/module.py", line 17, in <module>
    from ._common import default_net
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/_common.py", line 37, in <module>
    from ._utils import str_dtype_to_trt
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/_utils.py", line 32, in <module>
    from tensorrt_llm.bindings import DataType, GptJsonConfig
ImportError: /usr/local/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so: undefined symbol: cuMulticastCreate

@DylanChen-NV provide some additional information:there is no error in the NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 host environment; in the NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 host environment, an undefined symbol error will occur.

On the machine where the problem occurs, after starting the container, the following information will appear
This may be caused by driver compatibility issues:

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 525.105.17 which has support for CUDA 12.0. This container was built with CUDA 12.6 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use with this container but was unavailable:
[[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

@BasicCoder
Copy link
Contributor

BasicCoder commented Dec 19, 2024

In /etc/shinit_v2, maybe

  # If compat check passed, bring the compat lib into the path (several ways for safety)
  if [ "${_CUDA_COMPAT_STATUS}" = "CUDA Driver OK" ]; then
    # symlink the compat lib into a location that was preset to be on LD_LIBRARY_PATH via ENV
    ln -sf "${_CUDA_COMPAT_REALLIB}" "${_CUDA_COMPAT_SYMLINK}" 2>/dev/null
    # Additionally prepend _CUDA_COMPAT_REALLIB onto LD_LIBRARY_PATH in case _CUDA_COMPAT_PATH was not writable
    export LD_LIBRARY_PATH="${_CUDA_COMPAT_REALLIB}${LD_LIBRARY_PATH:+":${LD_LIBRARY_PATH}"}"
  fi

is not executed correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants