Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xinference] Error: An error occurred during streaming #8253

Open
5 tasks done
tigflanker opened this issue Sep 11, 2024 · 11 comments
Open
5 tasks done

[xinference] Error: An error occurred during streaming #8253

tigflanker opened this issue Sep 11, 2024 · 11 comments
Assignees

Comments

@tigflanker
Copy link

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.8.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Same problem like this discussion: #5813

I use xinference model serving with docker, and deploy model is : Qwen-VL-Chat
微信截图_20240911141245

Reply told us to disable stream function like:

response = model.invoke(
    model='ChatGLM3',
    credentials={
        'server_url': os.environ.get('XINFERENCE_SERVER_URL'),
        'model_uid': os.environ.get('XINFERENCE_CHAT_MODEL_UID')
    },
    prompt_messages=[
        SystemPromptMessage(
            content='You are a helpful AI assistant.',
        ),
        UserPromptMessage(
            content='Hello World!'
        )
    ],
    model_parameters={
        'temperature': 0.7,
        'top_p': 1.0,
    },
    stop=['you'],
    stream=False,  # Disable streaming
    user="abc-123"
)

# By setting stream=False, you should be able to avoid the streaming error.

But we don't know where to modify this file?
Could anyone please help to explaine this step? Thanks

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

Copy link

dosubot bot commented Sep 11, 2024

Hey @tigflanker! I'm here to assist you with any bugs, questions, or contributions you have. Let's tackle this issue together!

I found a similar unsolved discussion that might be relevant to your issue: dify 0.6.8 fails to deploy local xinference vl model(qwen-vl-chat) [1].

To address the streaming error by disabling the streaming feature, you need to locate the model invocation code in your Dify codebase. Typically, this code is found in the "Model Runtime" module, which provides the interface for invoking and authenticating various models [2].

Here is a general guide on where to look and how to modify the code:

  1. Locate the Model Invocation Code:

    • The model invocation code is usually in a file related to the model you are using. For example, if you are using the Qwen-VL-Chat model, you might find the relevant code in a file like llm.py under the model_runtime module.
  2. Modify the Invocation Code:

    • Once you locate the file, look for the _invoke method or similar function where the model is invoked. You need to set the stream parameter to False in this method.

Here is an example of how you might modify the code:

def _invoke(self, model: str, credentials: dict,
            prompt_messages: list[PromptMessage], model_parameters: dict,
            tools: Optional[list[PromptMessageTool]] = None, stop: Optional[List[str]] = None,
            stream: bool = True, user: Optional[str] = None) \
        -> Union[LLMResult, Generator]:
    """
    Invoke large language model

    :param model: model name
    :param credentials: model credentials
    :param prompt_messages: prompt messages
    :param model_parameters: model parameters
    :param tools: tools for tool calling
    :param stop: stop words
    :param stream: is stream response
    :param user: unique user id
    :return: full response or stream response chunk generator result
    """
    if stream:
        return self._handle_stream_response(**kwargs)
    return self._handle_sync_response(**kwargs)

In your specific case, you need to set stream=False when invoking the model:

response = model.invoke(
    model='Qwen-VL-Chat',
    credentials={
        'server_url': os.environ.get('XINFERENCE_SERVER_URL'),
        'model_uid': os.environ.get('XINFERENCE_CHAT_MODEL_UID')
    },
    prompt_messages=[
        SystemPromptMessage(
            content='You are a helpful AI assistant.',
        ),
        UserPromptMessage(
            content='Hello World!'
        )
    ],
    model_parameters={
        'temperature': 0.7,
        'top_p': 1.0,
    },
    stop=['you'],
    stream=False,  # Disable streaming
    user="abc-123"
)

Make sure to replace the placeholder values with your actual model details and parameters. This should help you avoid the streaming error by disabling the streaming feature.

To continue talking to Dosu, mention @dosu.

@Howe829
Copy link
Contributor

Howe829 commented Sep 11, 2024

The bot replied automatically.It told you to modify the source code, but you run with docker.lol.

@crazywoola
Copy link
Member

Need the logs of xinference side and our api container.

@tigflanker
Copy link
Author

@Howe829 @crazywoola
Thanks for reply.

I trace for the llm.py file from docker docker-api-1, modity two places include _invoke function and _generate function.
But another issue has surfaced...
微信截图_20240911154343

dosubot, can you tell how to find the source code where a workjob invoke this qwen-vl-chat API? (such as 'response = model.invoke...')

@tigflanker
Copy link
Author

@crazywoola Hi crazywoola, here is the log info from xinference side:

2024-09-11T08:22:24.558692415Z 2024-09-11 01:22:24,556 xinference.api.restful_api 1 ERROR    Chat completion stream got an error: [address=0.0.0.0:41349, pid=4710] isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of:
2024-09-11T08:22:24.558744398Z  * (Tensor elements, Tensor test_elements, *, bool assume_unique = False, bool invert = False, Tensor out = None)
2024-09-11T08:22:24.558756994Z  * (Number element, Tensor test_elements, *, bool assume_unique = False, bool invert = False, Tensor out = None)
2024-09-11T08:22:24.558767300Z  * (Tensor elements, Number test_element, *, bool assume_unique = False, bool invert = False, Tensor out = None)
2024-09-11T08:22:24.558776930Z Traceback (most recent call last):
2024-09-11T08:22:24.558785930Z   File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1728, in stream_results
2024-09-11T08:22:24.558795803Z     async for item in iterator:
2024-09-11T08:22:24.558805087Z   File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 340, in __anext__
2024-09-11T08:22:24.558814322Z     return await self._actor_ref.__xoscar_next__(self._uid)
2024-09-11T08:22:24.558823316Z   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send
2024-09-11T08:22:24.558832381Z     return self._process_result_message(result)
2024-09-11T08:22:24.558841062Z   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message
2024-09-11T08:22:24.558850515Z     raise message.as_instanceof_cause()
2024-09-11T08:22:24.558859249Z   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 656, in send
2024-09-11T08:22:24.558868246Z     result = await self._run_coro(message.message_id, coro)
2024-09-11T08:22:24.558880874Z   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 367, in _run_coro
2024-09-11T08:22:24.558918177Z     return await coro
2024-09-11T08:22:24.558927952Z   File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in __on_receive__
2024-09-11T08:22:24.558937127Z     return await super().__on_receive__(message)  # type: ignore
2024-09-11T08:22:24.558945898Z   File "xoscar/core.pyx", line 558, in __on_receive__
2024-09-11T08:22:24.559058617Z     raise ex
2024-09-11T08:22:24.559108495Z   File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
2024-09-11T08:22:24.559122536Z     async with self._lock:
2024-09-11T08:22:24.559132010Z   File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
2024-09-11T08:22:24.559141802Z     with debug_async_timeout('actor_lock_timeout',
2024-09-11T08:22:24.559151140Z   File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
2024-09-11T08:22:24.559160616Z     result = await result
2024-09-11T08:22:24.559178825Z   File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 431, in __xoscar_next__
2024-09-11T08:22:24.559188608Z     raise e
2024-09-11T08:22:24.559197722Z   File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 417, in __xoscar_next__
2024-09-11T08:22:24.559207998Z     r = await asyncio.to_thread(_wrapper, gen)
2024-09-11T08:22:24.559217120Z   File "/usr/lib/python3.10/asyncio/threads.py", line 25, in to_thread
2024-09-11T08:22:24.559226253Z     return await loop.run_in_executor(None, func_call)
2024-09-11T08:22:24.559235066Z   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
2024-09-11T08:22:24.559243965Z     result = self.fn(*self.args, **self.kwargs)
2024-09-11T08:22:24.559253402Z   File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 402, in _wrapper
2024-09-11T08:22:24.559263060Z     return next(_gen)
2024-09-11T08:22:24.559272182Z   File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 326, in _to_generator
2024-09-11T08:22:24.559281913Z     for v in gen:
2024-09-11T08:22:24.559290273Z   File "/usr/local/lib/python3.10/dist-packages/xinference/model/utils.py", line 368, in inner
2024-09-11T08:22:24.559300028Z     for obj in func(*args, **kwargs):
2024-09-11T08:22:24.559309013Z   File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 585, in _to_chat_completion_chunks
2024-09-11T08:22:24.559318311Z     for i, chunk in enumerate(chunks):
2024-09-11T08:22:24.559327806Z   File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/qwen_vl.py", line 212, in _generate_stream
2024-09-11T08:22:24.559337777Z     for response in response_generator:
2024-09-11T08:22:24.559346771Z   File "/root/.cache/huggingface/modules/transformers_modules/Qwen-VL-Chat-pytorch-7b/modeling_qwen.py", line 1021, in stream_generator
2024-09-11T08:22:24.559356738Z     for token in self.generate_stream(
2024-09-11T08:22:24.559393780Z   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2024-09-11T08:22:24.559404074Z     return func(*args, **kwargs)
2024-09-11T08:22:24.559412895Z   File "/usr/local/lib/python3.10/dist-packages/transformers_stream_generator/main.py", line 208, in generate
2024-09-11T08:22:24.559421896Z     ] = self._prepare_attention_mask_for_generation(
2024-09-11T08:22:24.559431094Z   File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 498, in _prepare_attention_mask_for_generation
2024-09-11T08:22:24.559440961Z     torch.isin(elements=inputs, test_elements=pad_token_id).any()
2024-09-11T08:22:24.559450334Z TypeError: [address=0.0.0.0:41349, pid=4710] isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of:
2024-09-11T08:22:24.559462690Z  * (Tensor elements, Tensor test_elements, *, bool assume_unique = False, bool invert = False, Tensor out = None)
2024-09-11T08:22:24.559471854Z  * (Number element, Tensor test_elements, *, bool assume_unique = False, bool invert = False, Tensor out = None)
2024-09-11T08:22:24.559481066Z  * (Tensor elements, Number test_element, *, bool assume_unique = False, bool invert = False, Tensor out = None)
2024-09-11T08:22:24.559489769Z 

and log info from docker-api-1 :

2024-09-11T08:25:11.014640187Z 2024-09-11 08:25:11,014.014 INFO [Thread-34 (_generate_worker)] [_client.py:1038] - HTTP Request: POST http://xx.xx.xx.xx:9997/v1/chat/completions "HTTP/1.1 200 OK"

@YihanWang0810
Copy link

I have the same issue. How to change the stream=False in DIFY or in Xinference? I can call the Xinference LLM on the web, but it return an error when I use the API.
image

@qinxuye
Copy link

qinxuye commented Oct 22, 2024

@Minamiyama Is it possible for you to look at this issue?

@sheepying
Copy link

I have the same issue. Has anyone solved this problem?

Copy link

dosubot bot commented Dec 9, 2024

Hi, @tigflanker. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • You experienced a streaming error with the Qwen-VL-Chat model using xinference model serving with Docker.
  • Guidance was provided to disable streaming by setting stream=False, but there was confusion on where to apply this change.
  • I provided a detailed explanation on modifying the model invocation code, but modifying source code in Docker was challenging.
  • You attempted modifications and shared logs for further insights, with other users reporting similar issues.
  • Assistance was requested from @Minamiyama to help address the problem, but the issue remains unresolved.

Next Steps:

  • Please confirm if this issue is still relevant with the latest version of the Dify repository. If so, you can keep the discussion open by commenting here.
  • If there is no further activity, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 9, 2024
@yihong0618
Copy link
Contributor

let me take a look these days, @laipz8200 can this assign to me ?

@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 10, 2024
@yihong0618
Copy link
Contributor

I can not reproduce in my local can you provide your dsl?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants