Replies: 22 comments
-
We explored the direction but ultimately decided against pursuing it. This decision was influenced by the fact that most OpenAI-like solutions lack control over decoding steps, a crucial component that could use code completion specific optimization. For instance, handling long lists of stop words and applying grammar constraints for each decoding step becomes challenging. Revisiting this approach might be viable if a decoding step-level API becomes widely adopted in the future. |
Beta Was this translation helpful? Give feedback.
-
Do you think it's so hard and complex so it's not even a thing to formulate it as protocol? Like Tabby Inference Protocol, which should be either supported, or inference engine is not compatible...? If you will pick the protocol which is ok for your code, it shouldn't be hard to support it, and even for Tabby's codebase in a long run it may become beneficial. |
Beta Was this translation helpful? Give feedback.
-
:) It's not so much about complexity as it is about capability. With an interface like OpenAI, we've relinquished control over accessing intermediate decoding steps. Many optimizations, upon which the current tabby relies, cannot be easily implemented—for instance, a lengthy stop words list. |
Beta Was this translation helpful? Give feedback.
-
I see no problem implementing even very long stop list dictionary in my setup, even long list of long stop words with optimal stopping/lookup, I've been through "a lot" with open models😂 |
Beta Was this translation helpful? Give feedback.
-
I can even provide extended methods support, like forcing response to be json etc. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the explanations
Thank you for the explanations |
Beta Was this translation helpful? Give feedback.
-
Continuing from #854. Currently Tabby supports llama.cpp bindings and http bindings to vertex-ai and fastchat. Are there plans to support other bindings like OpenAI endpoints via HTTP or similar protocol.? Thanks for the responses. |
Beta Was this translation helpful? Give feedback.
-
Hey @sundaraa-deshaw , #795 (comment) explains the reason that why we don’t want a openai like http interface |
Beta Was this translation helpful? Give feedback.
-
Thanks, I was wondering if we can have a binding to exllama[v2] inference engine like how it is done for llama.cpp today? |
Beta Was this translation helpful? Give feedback.
-
That’s possible - the trait is defined at https://github.com/TabbyML/tabby/blob/main/crates/tabby-inference/src/lib.rs Could you share some of your findings that exllama has advantage against llama.cpp? |
Beta Was this translation helpful? Give feedback.
-
Thanks, are there plans to add such a binding? exllama turned out to be good for inference on GPU, compared to llama.cpp on CPU. The memory usage for a GPTQ quantized model was 2-3x less than running the non-quantized model (Llama 13B) on llama.cpp on GPU. |
Beta Was this translation helpful? Give feedback.
-
Since Tabby seems to supports Fastchat, would it be possible to support Ollama HTTP bindings? They have a decent list of integrations already. Ollama is also using llama.cpp under the hood. |
Beta Was this translation helpful? Give feedback.
-
Fastchat isn't supported; it's a part of the exploration mentioned in an earlier reply and was eventually abandoned due to the reasons discussed above (lack of control during decoding). |
Beta Was this translation helpful? Give feedback.
-
@wsxiaoys Thanks and sorry. I was mislead by the fastchat.rs file in the repo. Thought this would support for it somehow. |
Beta Was this translation helpful? Give feedback.
-
No problem - it's not compiled to tabby by default (behind a feature flag), and left as a reference. |
Beta Was this translation helpful? Give feedback.
-
Hi, i followed the discussion but could not exactly figure out what it means to me. I have Codellama running in the Cloud and want to connect Tabby to it. Is there a way to do so or do i have to use Tabby Server with a local GPU/CPU? |
Beta Was this translation helpful? Give feedback.
-
Hey @MehrCurry , the short answer is no. Tabby comes with its own inference stack. You could deploy tabby into a cloud GPU (we have several tutorial on this, e.g https://tabby.tabbyml.com/docs/installation/hugging-face/). |
Beta Was this translation helpful? Give feedback.
-
@wsxiaoys it is doable to modify the stop words for each model file in ollama. I'm only just learning about stop-words now, and I only have a surface level understanding of the tabbyml inference stack. So I'm not suggesting that the ollama configuration is feature complete enough to plugin into the tabbml stack. But it might be? |
Beta Was this translation helpful? Give feedback.
-
It looks like it's also possible to modify stop words on the fly using the ollama API rather than just the model files. |
Beta Was this translation helpful? Give feedback.
-
It also looks like GBNF grammar support is in the works in ollama. Are there any other dealbreakers beyond grammar/stop-word support? |
Beta Was this translation helpful? Give feedback.
-
If I understand correctly, Ollama is essentially just a wrapper around Llama.cpp's server API, which, in turn, utilizes this stop words implementation: As far as I know, it operates in O(N) time complexity, where N equals the number of stop words. Feel free to give it a try to see how the decoding performs with a stop sequence of approximately 20, for example, like the one below. (Hint: It will be slow, as is any implementation that supports a dynamic stop word list.) |
Beta Was this translation helpful? Give feedback.
-
TabbyML supports vLLM already: https://tabby.tabbyml.com/docs/references/models-http-api/vllm/ |
Beta Was this translation helpful? Give feedback.
-
Hi,
I currently use vLLM for other services, and I am deeply interested in connecting your extension with a vLLM server. Do you think it's possible? I'm using the OpenAI API format, so if there is a possibility to connect the extension to any server using the OpenAI API, that would be great.
Thanks for the awesome work!
Please reply with a 👍 if you want this feature.
Beta Was this translation helpful? Give feedback.
All reactions