-
Notifications
You must be signed in to change notification settings - Fork 15.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Community][fix] Fix Azure cosmos db no SQL similarity search with score and mmr #28479
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
@@ -260,6 +262,28 @@ def delete_document_by_id(self, document_id: Optional[str] = None) -> None: | |||
raise ValueError("No document ids provided to delete.") | |||
self._container.delete_item(document_id, partition_key=document_id) | |||
|
|||
def _select_relevance_score_fn(self) -> Callable[[float], float]: | |||
""" | |||
The 'correct' relevance function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values are fetched from here : https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/vector-search#container-vector-policies
@@ -121,6 +121,9 @@ def __init__( | |||
self._embedding_key = self._vector_embedding_policy["vectorEmbeddings"][0][ | |||
"path" | |||
][1:] | |||
self._distance_strategy = self._vector_embedding_policy["vectorEmbeddings"][0][ | |||
"distanceFunction" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this guaranteed to always be specified? seems like we only check that self._vector_embedding_policy isn't empty if self._create_container is True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure on what's the intended logic of this code, when I wrote it I followed the embedding_key
up above.
But you are right. vector_embedding_policy
in this case is a dict. Nothing enforces what this dict needs to be if create_container
is not specified. I think there's a minor issue in the code here that can be addressed.
In essence, if we want to use an azure cosmos db no sql vector store, it must have the vector search feature enabled.
To enable and specify it, when we create the container, we must specify (among other things) a Vector Policy (the dict in question). microsoft link
Once created, it's possible to query for these info in the properties of the container. (through (vector search may be activated but not exposed, see comment down below)container.read()
)
Additionally, once created, _database.create_container_if_not_exists
(the method we use here to create our connection to the container) will not be overriden by the new specified policy (and it's totally valid also in this case to give it None policy).
So here are my suggested changes:
- make
vector_embedding_policy
optional since we only force the checks if create container is true self._vector_embedding_policy shouldn't be acquired fromself._vector_embedding_policy should be acquired from container if it is exposed otherwise from user specified optionsvector_embedding_policy
but from the container itself- optionally add a warning if there's a mismatch between container exposed
vector_embedding_policy
and used specifiedvector_embedding_policy
LMK if this is okay with you ! (I added these changes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I was testing and this assumption is not always valid
Once created, it's possible to query for these info in the properties of the container. (through container.read())
It seems there is a bug in azure cosmos db no sql but if the container was the result of a partition_key migration then it seems the information on the vector policy is lost (so not possible to read through container.read()
). However, the container still fully supports vector search !
I'll readapt the implementation to account for this case. It seems no longer possible to use container.read() as the one source of truth.
I'm sorry for the long post I wanted to be thorooughly descriptive and argumentative in my implementation choices!
@@ -273,8 +336,12 @@ def _similarity_search_with_score( | |||
if pre_filter is None or pre_filter.get("limit_offset_clause") is None: | |||
query += "TOP @limit " | |||
|
|||
embedding_field = "" | |||
if with_embedding: | |||
embedding_field = "c[@embeddingKey] as embeddingKey, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a minor optimization but its impact can be substantial: embeddings needed only when the with_embedding
argument is specified as True
Removing it from default because it adds a heavier weight on the query
When we ran our tests, even with chunk of size 4K tokens, the retrieval is still so much faster when we don't query for the embeddings (we go from 0.3s/chunk on a test to retrieve 100 chunks to ~0.02s/chunk)
Issues:
VectorStoreRetriever
instanciated fromAzureCosmosDBNoSqlVectorSearch
because it requires an implementation of__select_relevance_score_fn_
.Changes:
VectorStoreRetriever
a rather typical implementation was added (example for Qdrant)self._embedding_key
and has "$1" (that represents it's order in the parameters). Since Cosmos DB’s SQL API does not support parameterized aliases or dynamic aliasing within the query, we chose to name it 'embeddingKey'These changes were tested locally.