[Community][fix] Fix Azure cosmos db no SQL similarity search with score and mmr #28479

wassim-mechergui-shift · 2024-12-03T16:08:18Z

Issues:

similarity search with score not working for VectorStoreRetriever instanciated from AzureCosmosDBNoSqlVectorSearch because it requires an implementation of __select_relevance_score_fn_.
mmr retrieval is not working because the fetching of metadata field from item to document metadata is currently bugged : (issues: Azure Cosmos DB _similarity_search_with_score fails with KeyError: 'metadata' #26097, [Bug] Azure cosmos db no sql vector store similarity search method "mmr" #28476)

Changes:

VectorStoreRetriever fix:
- For the similarity search from VectorStoreRetriever a rather typical implementation was added (example for Qdrant)
with_embeddings to similarity search:
- Fix for the with_embedding keyword) due to a change made here (Azure Cosmos DB _similarity_search_with_score fails with KeyError: 'metadata' #26097), item no longer has self._embedding_key and has "$1" (that represents it's order in the parameters). Since Cosmos DB’s SQL API does not support parameterized aliases or dynamic aliasing within the query, we chose to name it 'embeddingKey'
- optimization : if with_embedding is not True, we don't need to query the embeddings. Querying the embeddings is heavy and makes requests much slower. This was changed
Changed initialisation of vector policy (from which we get the embedding key and the distance function)
- if user specifies a dict, we check it has the correct form
- if user specifies the container creation if it doesn't exist option, we check this is not null
- for the final user vector policy property
  - if the container exposes one, we use that one (and raise a warning if the user specified policy has differences)
  - if the container doesn't expose one, we use the user exposed one (with an error in case nothing was specified)

These changes were tested locally.

vercel · 2024-12-03T16:08:23Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Dec 9, 2024 0:27am

wassim-mechergui-shift · 2024-12-03T16:09:18Z

libs/community/langchain_community/vectorstores/azure_cosmos_db_no_sql.py

@@ -260,6 +262,28 @@ def delete_document_by_id(self, document_id: Optional[str] = None) -> None:
            raise ValueError("No document ids provided to delete.")
        self._container.delete_item(document_id, partition_key=document_id)

+    def _select_relevance_score_fn(self) -> Callable[[float], float]:
+        """
+        The 'correct' relevance function


values are fetched from here : https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/vector-search#container-vector-policies

baskaryan · 2024-12-09T04:18:36Z

libs/community/langchain_community/vectorstores/azure_cosmos_db_no_sql.py

@@ -121,6 +121,9 @@ def __init__(
        self._embedding_key = self._vector_embedding_policy["vectorEmbeddings"][0][
            "path"
        ][1:]
+        self._distance_strategy = self._vector_embedding_policy["vectorEmbeddings"][0][
+            "distanceFunction"


is this guaranteed to always be specified? seems like we only check that self._vector_embedding_policy isn't empty if self._create_container is True

I'm not really sure on what's the intended logic of this code, when I wrote it I followed the embedding_key up above.
But you are right. vector_embedding_policy in this case is a dict. Nothing enforces what this dict needs to be if create_container is not specified. I think there's a minor issue in the code here that can be addressed.

In essence, if we want to use an azure cosmos db no sql vector store, it must have the vector search feature enabled.
To enable and specify it, when we create the container, we must specify (among other things) a Vector Policy (the dict in question). microsoft link
~~Once created, it's possible to query for these info in the properties of the container. (through container.read())~~ (vector search may be activated but not exposed, see comment down below)

Additionally, once created, _database.create_container_if_not_exists (the method we use here to create our connection to the container) will not be overriden by the new specified policy (and it's totally valid also in this case to give it None policy).

So here are my suggested changes:

make vector_embedding_policy optional since we only force the checks if create container is true

~~self._vector_embedding_policy shouldn't be acquired from vector_embedding_policy but from the container itself~~ self._vector_embedding_policy should be acquired from container if it is exposed otherwise from user specified options

optionally add a warning if there's a mismatch between container exposed vector_embedding_policy and used specified vector_embedding_policy

LMK if this is okay with you ! (I added these changes)

Actually I was testing and this assumption is not always valid

Once created, it's possible to query for these info in the properties of the container. (through container.read())

It seems there is a bug in azure cosmos db no sql but if the container was the result of a partition_key migration then it seems the information on the vector policy is lost (so not possible to read through container.read()). However, the container still fully supports vector search !

I'll readapt the implementation to account for this case. It seems no longer possible to use container.read() as the one source of truth.

I'm sorry for the long post I wanted to be thorooughly descriptive and argumentative in my implementation choices!

…search

wassim-mechergui-shift · 2024-12-09T12:26:34Z

libs/community/langchain_community/vectorstores/azure_cosmos_db_no_sql.py

@@ -273,8 +336,12 @@ def _similarity_search_with_score(
        if pre_filter is None or pre_filter.get("limit_offset_clause") is None:
            query += "TOP @limit "

+        embedding_field = ""
+        if with_embedding:
+            embedding_field = "c[@embeddingKey] as embeddingKey, "


this is a minor optimization but its impact can be substantial: embeddings needed only when the with_embedding argument is specified as True
Removing it from default because it adds a heavier weight on the query

When we ran our tests, even with chunk of size 4K tokens, the retrieval is still so much faster when we don't query for the embeddings (we go from 0.3s/chunk on a test to retrieve 100 chunks to ~0.02s/chunk)

add fix to mmr search

205300a

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. community Related to langchain-community Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Dec 3, 2024

wassim-mechergui-shift commented Dec 3, 2024

View reviewed changes

wassim-mechergui-shift added 4 commits December 3, 2024 17:09

cleanup trailing character

296f1c3

fix linting

f001717

fix linting

24f39c5

change used naming

477a79e

wassim-mechergui-shift changed the title ~~Fix Azure cosmos db no SQL similarity search with score and mmr~~ [Community][Bugfix] Fix Azure cosmos db no SQL similarity search with score and mmr Dec 6, 2024

wassim-mechergui-shift changed the title ~~[Community][Bugfix] Fix Azure cosmos db no SQL similarity search with score and mmr~~ [Community][fix] Fix Azure cosmos db no SQL similarity search with score and mmr Dec 6, 2024

baskaryan reviewed Dec 9, 2024

View reviewed changes

wassim-mechergui-shift added 6 commits December 9, 2024 12:21

enforce correct property management

d1e32cb

reflect vector policy specifications

d6179dd

update warnings to reflect situation

6ac62bc

Merge branch 'master' into azure-cosmos-db-no-sql-fix-mmr-similarity-…

292e64b

…search

specify container name in warning

9bbbd79

fix linting

a417523

wassim-mechergui-shift commented Dec 9, 2024

View reviewed changes

lint

2aa26ee

wassim-mechergui-shift requested a review from baskaryan December 9, 2024 12:31

efriis assigned baskaryan Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Community][fix] Fix Azure cosmos db no SQL similarity search with score and mmr #28479

[Community][fix] Fix Azure cosmos db no SQL similarity search with score and mmr #28479

wassim-mechergui-shift commented Dec 3, 2024 •

edited

Loading

vercel bot commented Dec 3, 2024 •

edited

Loading

wassim-mechergui-shift Dec 3, 2024

baskaryan Dec 9, 2024

wassim-mechergui-shift Dec 9, 2024 •

edited

Loading

wassim-mechergui-shift Dec 9, 2024 •

edited

Loading

wassim-mechergui-shift Dec 9, 2024 •

edited

Loading

[Community][fix] Fix Azure cosmos db no SQL similarity search with score and mmr #28479

Are you sure you want to change the base?

[Community][fix] Fix Azure cosmos db no SQL similarity search with score and mmr #28479

Conversation

wassim-mechergui-shift commented Dec 3, 2024 • edited Loading

vercel bot commented Dec 3, 2024 • edited Loading

wassim-mechergui-shift Dec 3, 2024

Choose a reason for hiding this comment

baskaryan Dec 9, 2024

Choose a reason for hiding this comment

wassim-mechergui-shift Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

wassim-mechergui-shift Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

wassim-mechergui-shift Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

wassim-mechergui-shift commented Dec 3, 2024 •

edited

Loading

vercel bot commented Dec 3, 2024 •

edited

Loading

wassim-mechergui-shift Dec 9, 2024 •

edited

Loading

wassim-mechergui-shift Dec 9, 2024 •

edited

Loading

wassim-mechergui-shift Dec 9, 2024 •

edited

Loading