llava-onevision convert bug #2585

liyi-xia · 2024-12-17T10:20:51Z

System Info

Please see line 345 in tensorrt_llm/models/qwen/model.py. The llava-onevision should have tie_word_embeddings=True, which will fail when converting lm_head weights. We can change

"lm_head": "language_model.lm_head" to "lm_head": "language_model.model.embed_tokens"

Who can help?

@byshiue

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

lm_head.weight
197it [00:00, 2288.93it/s]
Traceback (most recent call last):
  File "/root/llava-ov/convert_checkpoint.py", line 308, in <module>
    main()
  File "/root/llava-ov/convert_checkpoint.py", line 300, in main
    convert_and_save_hf(args)
  File "/root/llava-ov/convert_checkpoint.py", line 256, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/root/llava-ov/convert_checkpoint.py", line 263, in execute
    f(args, rank)
  File "/root/llava-ov/convert_checkpoint.py", line 246, in convert_and_save_rank
    qwen = QWenForCausalLM.from_hugging_face(
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 453, in from_hugging_face
    loader.generate_tllm_weights(model)
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/models/model_weights_loader.py", line 409, in generate_tllm_weights
    self.load(tllm_key,
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/models/model_weights_loader.py", line 296, in load
    v = sub_module.postprocess(tllm_key, v, **postprocess_kwargs)
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/layers/linear.py", line 407, in postprocess
    weights = weights.to(str_dtype_to_torch(self.dtype))
AttributeError: 'NoneType' object has no attribute 'to'

Expected behavior

After I modify the code, it becomes

198it [00:00, 2781.83it/s]
Total time of converting checkpoints: 00:00:02

actual behavior

It cannot convert successfully before my modification.

additional notes

N/A

The text was updated successfully, but these errors were encountered:

nv-guomingz · 2024-12-18T01:32:23Z

Hi @liyi-xia could u please try the main branch and the latest code already have changes u mentioned above?

liyi-xia · 2024-12-18T02:35:01Z

@nv-guomingz hi, i did not see any difference between main branch and my screenshot. It is still convert lm_head to lm_head. May I know what you have done in recent commits to solve this problem?

liyi-xia · 2024-12-18T02:53:28Z

By the way, when I tried latest release https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.17.0.dev2024121700-cp310-cp310-linux_x86_64.whl#sha256=c153ba5e78609d8060c2281a220f1cd4769a9c30e48d77b1fd95ac299aae4607, I encountered the following error. My CUDA is 12.6, driver is 470.129.06 and nvcc is 12.6.3

Traceback (most recent call last):
  File "/root/convert_checkpoint.py", line 9, in <module>
    import tensorrt_llm
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/__init__.py", line 32, in <module>
    import tensorrt_llm.functional as functional
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 28, in <module>
    from . import graph_rewriting as gw
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/graph_rewriting.py", line 12, in <module>
    from .network import Network
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/network.py", line 31, in <module>
    from tensorrt_llm.module import Module
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/module.py", line 17, in <module>
    from ._common import default_net
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/_common.py", line 37, in <module>
    from ._utils import str_dtype_to_trt
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/_utils.py", line 32, in <module>
    from tensorrt_llm.bindings import DataType, GptJsonConfig
ImportError: /usr/local/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so: undefined symbol: cuMulticastCreate

DylanChen-NV · 2024-12-18T08:52:52Z

@nv-guomingz hi, i did not see any difference between main branch and my screenshot. It is still convert lm_head to lm_head. May I know what you have done in recent commits to solve this problem?

Hi @liyi-xia , the mapping will be updated here

liyi-xia · 2024-12-18T08:55:03Z

ok i see. but i am not able to use latest version now.

DylanChen-NV · 2024-12-18T09:31:35Z

By the way, when I tried latest release https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.17.0.dev2024121700-cp310-cp310-linux_x86_64.whl#sha256=c153ba5e78609d8060c2281a220f1cd4769a9c30e48d77b1fd95ac299aae4607, I encountered the following error. My CUDA is 12.6, driver is 470.129.06 and nvcc is 12.6.3

Traceback (most recent call last):
  File "/root/convert_checkpoint.py", line 9, in <module>
    import tensorrt_llm
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/__init__.py", line 32, in <module>
    import tensorrt_llm.functional as functional
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 28, in <module>
    from . import graph_rewriting as gw
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/graph_rewriting.py", line 12, in <module>
    from .network import Network
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/network.py", line 31, in <module>
    from tensorrt_llm.module import Module
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/module.py", line 17, in <module>
    from ._common import default_net
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/_common.py", line 37, in <module>
    from ._utils import str_dtype_to_trt
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/_utils.py", line 32, in <module>
    from tensorrt_llm.bindings import DataType, GptJsonConfig
ImportError: /usr/local/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so: undefined symbol: cuMulticastCreate

Could you share the steps for upgrading？

liyi-xia · 2024-12-18T09:35:26Z

@DylanChen-NV Hi, i downloaded wheel and pip3 install wheel. Before that, I use CUDA Compat 12.6 and upgrade nvcc to 12.6 by CUDA Toolkit

BasicCoder · 2024-12-19T06:43:23Z

By the way, when I tried latest release https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.17.0.dev2024121700-cp310-cp310-linux_x86_64.whl#sha256=c153ba5e78609d8060c2281a220f1cd4769a9c30e48d77b1fd95ac299aae4607, I encountered the following error. My CUDA is 12.6, driver is 470.129.06 and nvcc is 12.6.3

Traceback (most recent call last):
  File "/root/convert_checkpoint.py", line 9, in <module>
    import tensorrt_llm
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/__init__.py", line 32, in <module>
    import tensorrt_llm.functional as functional
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 28, in <module>
    from . import graph_rewriting as gw
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/graph_rewriting.py", line 12, in <module>
    from .network import Network
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/network.py", line 31, in <module>
    from tensorrt_llm.module import Module
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/module.py", line 17, in <module>
    from ._common import default_net
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/_common.py", line 37, in <module>
    from ._utils import str_dtype_to_trt
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/_utils.py", line 32, in <module>
    from tensorrt_llm.bindings import DataType, GptJsonConfig
ImportError: /usr/local/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so: undefined symbol: cuMulticastCreate

@DylanChen-NV provide some additional information：there is no error in the NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 host environment; in the NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 host environment, an undefined symbol error will occur.

On the machine where the problem occurs, after starting the container, the following information will appear
This may be caused by driver compatibility issues:

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 525.105.17 which has support for CUDA 12.0. This container was built with CUDA 12.6 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use with this container but was unavailable:
[[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

BasicCoder · 2024-12-19T08:47:49Z

In /etc/shinit_v2, maybe

  # If compat check passed, bring the compat lib into the path (several ways for safety)
  if [ "${_CUDA_COMPAT_STATUS}" = "CUDA Driver OK" ]; then
    # symlink the compat lib into a location that was preset to be on LD_LIBRARY_PATH via ENV
    ln -sf "${_CUDA_COMPAT_REALLIB}" "${_CUDA_COMPAT_SYMLINK}" 2>/dev/null
    # Additionally prepend _CUDA_COMPAT_REALLIB onto LD_LIBRARY_PATH in case _CUDA_COMPAT_PATH was not writable
    export LD_LIBRARY_PATH="${_CUDA_COMPAT_REALLIB}${LD_LIBRARY_PATH:+":${LD_LIBRARY_PATH}"}"
  fi

is not executed correctly.

liyi-xia added the bug Something isn't working label Dec 17, 2024

nv-guomingz added LLM API/Workflow and removed bug Something isn't working labels Dec 18, 2024

nv-guomingz self-assigned this Dec 18, 2024

github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Dec 18, 2024

nv-guomingz removed Investigating LLM API/Workflow labels Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llava-onevision convert bug #2585

llava-onevision convert bug #2585

liyi-xia commented Dec 17, 2024

nv-guomingz commented Dec 18, 2024

liyi-xia commented Dec 18, 2024

liyi-xia commented Dec 18, 2024 •

edited

Loading

DylanChen-NV commented Dec 18, 2024

liyi-xia commented Dec 18, 2024

DylanChen-NV commented Dec 18, 2024

liyi-xia commented Dec 18, 2024

BasicCoder commented Dec 19, 2024

BasicCoder commented Dec 19, 2024 •

edited

Loading

llava-onevision convert bug #2585

llava-onevision convert bug #2585

Comments

liyi-xia commented Dec 17, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

nv-guomingz commented Dec 18, 2024

liyi-xia commented Dec 18, 2024

liyi-xia commented Dec 18, 2024 • edited Loading

DylanChen-NV commented Dec 18, 2024

liyi-xia commented Dec 18, 2024

DylanChen-NV commented Dec 18, 2024

liyi-xia commented Dec 18, 2024

BasicCoder commented Dec 19, 2024

BasicCoder commented Dec 19, 2024 • edited Loading

liyi-xia commented Dec 18, 2024 •

edited

Loading

BasicCoder commented Dec 19, 2024 •

edited

Loading