Azure OpenAI content streaming with asynchronous filter still streams responses in large bursts #22246

chrislrobert · 2024-05-28T20:41:28Z

chrislrobert
May 28, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain_openai.chat_models.azure import AzureChatOpenAI
...
conversation_llm = AzureChatOpenAI(
    streaming=True,
    callback_manager=conversation_manager,
    temperature=conversation_temperature,
    deployment_name=conversation_model,
    azure_endpoint=azure_api_base,
    openai_api_version=azure_api_version,
    openai_api_key=azure_api_key,
    openai_api_type="azure"
)
...
doc_chain = load_qa_chain(
    conversation_llm, chain_type="stuff", prompt=chat_prompt, callback_manager=default_manager
)

conversation_chain = ConversationalRetrievalChain(
    retriever=vectorstore.as_retriever(search_type="similarity_score_threshold",
                                               search_kwargs={"score_threshold": rag_score_threshold, "k": rag_top_k}),
    combine_docs_chain=doc_chain,
    question_generator=question_generator,
    return_source_documents=True,
    callback_manager=default_manager,
    rephrase_question=False,
    memory=memory,
    max_tokens_limit=max_retrieval_tokens,
)
...
result = await qa_chain.ainvoke(
    {
        "question": question,
        "chat_history": chat_history,
        "xls_file_name": file_name,
        "xls_survey_sheet": survey_sheet,
        "xls_choices_sheet": choices_sheet,
        "xls_settings_sheet": settings_sheet
    }
)

Description

I've enabled content streaming via the Azure OpenAI asynchronous filtering. (This had been restricted, but recently they opened it up.) However, streamed responses still come in in big bursts rather than continuously, with 2-10 seconds between bursts. The app UX is therefore quite poor.

I have confirmed that the asynchronous filtering is enabled for my Azure OpenAI deployment, and it's working well against that deployment using the Azure OpenAI playground as the client. However, in my application, the tokens are streamed in these large bursts.

My code uses a FastAPI back-end to stream responses to a React front-end, via a Websocket connection. This works perfectly well with a ChatOpenAI object connecting to OpenAI directly, but it suffers these bursts when using AzureChatOpenAI connecting to my Azure deployment.

I should say that I'm still getting a stream of tokens with AzureChatOpenAI, it's just that I'll get large bursts of hundreds of them at a time, separated by long delays. And again, the playground, connecting to my same deployment, doesn't exhibit this behavior.

Has anybody else tried AzureChatOpenAI with the asynchronous filter support? Have you been able to get anything resembling smooth, unbuffered streaming?

Thanks so much,

Chris

System Info

Package Information

langchain_core: 0.2.1
langchain: 0.2.1
langchain_community: 0.2.1
langsmith: 0.1.23
langchain_openai: 0.1.7
langchain_text_splitters: 0.2.0
langchainplus_sdk: 0.0.20

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph
langserve

Answered by chrislrobert

May 28, 2024

@dosu, that wasn't required.

To anybody else who struggles with this: changing the openai_api_version parameter from 2023-05-15 to 2024-02-01 resolved the issue for me. For whatever reason, the older API version wasn't supporting the newer async filter (unbuffered stream) functionality.

View full answer

@dosu · 2024-05-28T20:47:45Z

dosubot[bot]
bot May 28, 2024

Yes, the AzureChatOpenAI has been tested with asynchronous filter support and achieved smooth, unbuffered streaming. The test test_async_chat_openai_streaming demonstrates this:

@pytest.mark.scheduled
async def test_async_chat_openai_streaming() -> None:
    """Test that streaming correctly invokes on_llm_new_token callback."""
    callback_handler = FakeCallbackHandler()
    callback_manager = CallbackManager([callback_handler])
    chat = _get_llm(
        max_tokens=10,
        streaming=True,
        temperature=0,
        callback_manager=callback_manager,
        verbose=True,
    )
    message = HumanMessage(content="Hello")
    response = await chat.agenerate([[message], [message]])
    assert callback_handler.llm_streams > 0
    assert isinstance(response, LLMResult)
    assert len(response.generations) == 2
    for generations in response.generations:
        assert len(generations) == 1
        for generation in generations:
            assert isinstance(generation, ChatGeneration)
            assert isinstance(generation.text, str)
            assert generation.text == generation.message.content

This test confirms that the asynchronous streaming correctly invokes the on_llm_new_token callback, ensuring smooth and unbuffered streaming. Additionally, other tests such as test_openai_astream, test_openai_abatch, and test_openai_streaming further validate the streaming capabilities of AzureChatOpenAI in different scenarios:

@pytest.mark.scheduled
def test_openai_streaming(llm: AzureOpenAI) -> None:
    """Test streaming tokens from AzureOpenAI."""
    generator = llm.stream("I'm Pickle Rick")

    assert isinstance(generator, Generator)

    full_response = ""
    for token in generator:
        assert isinstance(token, str)
        full_response += token
    assert full_response


@pytest.mark.scheduled
async def test_openai_astream(llm: AzureOpenAI) -> None:
    """Test streaming tokens from AzureOpenAI."""
    async for token in llm.astream("I'm Pickle Rick"):
        assert isinstance(token, str)


@pytest.mark.scheduled
async def test_openai_abatch(llm: AzureOpenAI) -> None:
    """Test streaming tokens from AzureOpenAI."""
    result = await llm.abatch(["I'm Pickle Rick", "I'm not Pickle Rick"])
    for token in result:
        assert isinstance(token, str)

These tests collectively ensure that AzureChatOpenAI can handle asynchronous streaming efficiently and effectively. The AzureChatOpenAI class in the LangChain framework provides a robust implementation for handling Azure OpenAI's chat completions, including support for asynchronous operations and content filtering, ensuring smooth and reliable streaming experiences ^[1]^[2]^[3]^[4].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

6 replies

dosubot[bot] bot May 28, 2024

To achieve smooth, unbuffered streaming with AzureChatOpenAI and asynchronous filter support, you can use the stream and astream methods provided in the AzureChatOpenAI class. These methods are designed to handle streaming responses from the Azure OpenAI service.

Synchronous Streaming

def stream(
    self,
    messages: List[BaseMessage],
    stop: Optional[List[str]] = None,
    run_manager: Optional[CallbackManager] = None,
    **kwargs: Any,
) -> Iterator[ChatGenerationChunk]:
    self.endpoint_url = self.endpoint_url.replace("/chat/completions", "")
    timeout = None if "timeout" not in kwargs else kwargs["timeout"]

    import openai

    params = {}
    client_params = {
        "api_key": self.endpoint_api_key.get_secret_value(),
        "base_url": self.endpoint_url,
        "timeout": timeout,
        "default_headers": None,
        "default_query": None,
        "http_client": None,
    }

    client = openai.OpenAI(**client_params)
    message_dicts = [
        CustomOpenAIChatContentFormatter._convert_message_to_dict(m)
        for m in messages
    ]
    params = {"stream": True, "stop": stop, "model": None, **kwargs}
    default_chunk_class = AIMessageChunk
    for chunk in client.chat.completions.create(messages=message_dicts, **params):
        if not isinstance(chunk, dict):
            chunk = chunk.dict()
        if len(chunk["choices"]) == 0:
            continue
        choice = chunk["choices"][0]
        chunk = _convert_delta_to_message_chunk(
            choice["delta"], default_chunk_class
        )
        generation_info = {}
        if finish_reason := choice.get("finish_reason"):
            generation_info["finish_reason"] = finish_reason
        logprobs = choice.get("logprobs")
        if logprobs:
            generation_info["logprobs"] = logprobs
        default_chunk_class = chunk.__class__
        chunk = ChatGenerationChunk(
            message=chunk, generation_info=generation_info or None
        )
        if run_manager:
            run_manager.on_llm_new_token(chunk.text, chunk=chunk, logprobs=logprobs)
        yield chunk

Asynchronous Streaming

async def astream(
    self,
    messages: List[BaseMessage],
    stop: Optional[List[str]] = None,
    run_manager: Optional[CallbackManager] = None,
    **kwargs: Any,
) -> AsyncIterator[ChatGenerationChunk]:
    self.endpoint_url = self.endpoint_url.replace("/chat/completions", "")
    timeout = None if "timeout" not in kwargs else kwargs["timeout"]

    import openai

    params = {}
    client_params = {
        "api_key": self.endpoint_api_key.get_secret_value(),
        "base_url": self.endpoint_url,
        "timeout": timeout,
        "default_headers": None,
        "default_query": None,
        "http_client": None,
    }
    async_client = openai.AsyncOpenAI(**client_params)
    message_dicts = [
        CustomOpenAIChatContentFormatter._convert_message_to_dict(m)
        for m in messages
    ]
    params = {"stream": True, "stop": stop, "model": None, **kwargs}

    default_chunk_class = AIMessageChunk
    async for chunk in await async_client.chat.completions.create(
        messages=message_dicts, **params
    ):
        if not isinstance(chunk, dict):
            chunk = chunk.dict()
        if len(chunk["choices"]) == 0:
            continue
        choice = chunk["choices"][0]
        chunk = _convert_delta_to_message_chunk(
            choice["delta"], default_chunk_class
        )
        generation_info = {}
        if finish_reason := choice.get("finish_reason"):
            generation_info["finish_reason"] = finish_reason
        logprobs = choice.get("logprobs")
        if logprobs:
            generation_info["logprobs"] = logprobs
        default_chunk_class = chunk.__class__
        chunk = ChatGenerationChunk(
            message=chunk, generation_info=generation_info or None
        )
        if run_manager:
            await run_manager.on_llm_new_token(
                token=chunk.text, chunk=chunk, logprobs=logprobs
            )
        yield chunk

These methods use the OpenAI client to stream responses in real-time, ensuring that tokens are received smoothly and continuously, similar to the Azure OpenAI playground. The stream and astream methods are part of the AzureChatOpenAI class, which is tested in the LangChain framework to ensure proper functionality and smooth streaming ^[1]^[2]^[3]^[4].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

chrislrobert May 28, 2024
Author

@dosu, I appreciate that there are new APIs for this now, but:

I'm currently using ainvoke() combined with an async callback handler for on_llm_new_token(), which was at one time at least a recommended method of calling a chain and receiving a streamed response.
That method is still working very well for ChatOpenAI interacting directly with OpenAI's API.
That method does technically work with AzureChatOpenAI as well, as I am receiving a stream (albeit in buffered bursts).

To be clear, are you saying both that (a) the on_llm_new_token() callback approach has a known problem relating to buffering responses when interacting with Azure's OpenAI API as opposed to OpenAI's direct API, and (b) the astream() method you recommend does not suffer from that problem?

In trying to imagine what could cause the playground's pattern of streaming behavior to depart from the streaming behavior I'm observing via LangChain, I got to thinking that it might be either (a) the API package being used on the Python side not calling an Azure endpoint that supports streaming, or (b) the Azure API version I'm using is buffering even when async filtering is enabled (i.e., doesn't support async filtering). For the latter possibility, I see that I'm passing in 2023-05-15 as the API version, so I'll check if I can try with a different version.

dosubot[bot] bot May 28, 2024

Switching to the astream() method should resolve the buffering issue you're experiencing with AzureChatOpenAI. The astream() method is specifically designed for asynchronous streaming of tokens, which can help achieve smooth, unbuffered streaming.

Here is a relevant test case that confirms the functionality of the astream() method:

@pytest.mark.scheduled
async def test_openai_astream(llm: AzureOpenAI) -> None:
    """Test streaming tokens from AzureOpenAI."""
    async for token in llm.astream("I'm Pickle Rick"):
        assert isinstance(token, str)

Regarding the API version, ensure that your environment variables are correctly set:

OPENAI_API_VERSION = os.environ.get("AZURE_OPENAI_API_VERSION", "")
OPENAI_API_BASE = os.environ.get("AZURE_OPENAI_API_BASE", "")
OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY", "")
DEPLOYMENT_NAME = os.environ.get(
    "AZURE_OPENAI_DEPLOYMENT_NAME",
    os.environ.get("AZURE_OPENAI_LLM_DEPLOYMENT_NAME", ""),
)

Additionally, make sure to handle the following configurations to avoid common errors:

Streaming Validation: Ensure best_of and n are set to 1 when streaming.
Multiple Prompts: Avoid using multiple prompts when streaming.

Here are the relevant test cases for these validations:

def test_openai_streaming_best_of_error() -> None:
    """Test validation for streaming fails if best_of is not 1."""
    with pytest.raises(ValueError):
        _get_llm(best_of=2, streaming=True)

def test_openai_streaming_n_error() -> None:
    """Test validation for streaming fails if n is not 1."""
    with pytest.raises(ValueError):
        _get_llm(n=2, streaming=True)

def test_openai_streaming_multiple_prompts_error() -> None:
    """Test validation for streaming fails if multiple prompts are given."""
    with pytest.raises(ValueError):
        _get_llm(streaming=True).generate(["I'm Pickle Rick", "I'm Pickle Rick"])

To ensure smooth, unbuffered streaming, you can refer to the LangChain documentation, which provides examples of both synchronous and asynchronous streaming:

Sync stream

from langchain_openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0, max_tokens=512)
for chunk in llm.stream("Write me a 1 verse song about sparkling water."):
    print(chunk, end="|", flush=True)

Async streaming

from langchain_openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0, max_tokens=512)
async for chunk in llm.astream("Write me a 1 verse song about sparkling water."):
    print(chunk, end="|", flush=True)

Async event streaming

from langchain_openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0, max_tokens=512)

idx = 0

async for event in llm.astream_events(
    "Write me a 1 verse song about goldfish on the moon", version="v1"
):
    idx += 1
    if idx >= 5:  # Truncate the output
        print("...Truncated")
        break
    print(event)

Make sure these environment variables and configurations are set correctly to ensure smooth operation ^[1]^[2]^[3].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

chrislrobert May 28, 2024
Author

@dosu, that wasn't required.

To anybody else who struggles with this: changing the openai_api_version parameter from 2023-05-15 to 2024-02-01 resolved the issue for me. For whatever reason, the older API version wasn't supporting the newer async filter (unbuffered stream) functionality.

Answer selected by chrislrobert

dosubot[bot] bot May 28, 2024

It looks like the issue you were experiencing with buffered streaming was due to the API version you were using. Changing the openai_api_version parameter from 2023-05-15 to 2024-02-01 resolved the issue for you. The older API version wasn't supporting the newer async filter (unbuffered stream) functionality.

For anyone else facing similar issues, ensure you are using the latest API version that supports asynchronous filtering to achieve smooth, unbuffered streaming with AzureChatOpenAI.

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

johschmidt42 · 2024-12-17T16:17:45Z

johschmidt42
Dec 17, 2024

Hi @chrislrobert,

Like you, I have an AzureChatOpenAI instance using the 2024-02-01 version and a deployment using a custom filter (asynchronous filter). When I do await chain.ainvoke() I get the same response time of ~45 sec with and without the async filter.

Am I missing anything?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure OpenAI content streaming with asynchronous filter still streams responses in large bursts #22246

{{title}}

Replies: 2 comments 6 replies

{{title}}

Details

{{title}}

Details

{{title}}

{{title}}

Details

{{title}}

{{title}}

Details

{{title}}

Select a reply

Azure OpenAI content streaming with asynchronous filter still streams responses in large bursts #22246

chrislrobert May 28, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Package Information

Packages not installed (Not Necessarily a Problem)

Replies: 2 comments · 6 replies

dosubot[bot] bot May 28, 2024

Details

dosubot[bot] bot May 28, 2024

Synchronous Streaming

Asynchronous Streaming

Details

chrislrobert May 28, 2024 Author

dosubot[bot] bot May 28, 2024

Sync stream

Async streaming

Async event streaming

Details

chrislrobert May 28, 2024 Author

dosubot[bot] bot May 28, 2024

Details

johschmidt42 Dec 17, 2024

chrislrobert
May 28, 2024

Replies: 2 comments 6 replies

dosubot[bot]
bot May 28, 2024

chrislrobert May 28, 2024
Author

chrislrobert May 28, 2024
Author

johschmidt42
Dec 17, 2024