Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Duplicate Loop Over ref_doc_kv_pairs.items() #16280

Open
xinsu626 opened this issue Sep 29, 2024 · 1 comment
Open

[Bug]: Duplicate Loop Over ref_doc_kv_pairs.items() #16280

xinsu626 opened this issue Sep 29, 2024 · 1 comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@xinsu626
Copy link

Bug Description

Hello, it looks like the for loop here is duplicated and is causing a significant slowdown: https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/storage/docstore/keyval_docstore.py#L127

Version

main branch

Steps to Reproduce

Please see the link in bug description section.

Relevant Logs/Tracbacks

No response

@xinsu626 xinsu626 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Sep 29, 2024
Copy link

dosubot bot commented Sep 29, 2024

Hey @xinsu626! I'm here to help you with any bugs, questions, or contributions you have for the Llama Index repository. Let's squash those bugs together!

Yes, there is a duplicated for loop at line 127 in the keyval_docstore.py file that is causing a significant slowdown. The duplicated for loop is as follows:

for node in nodes:
    # NOTE: doc could already exist in the store, but we overwrite it
    if not allow_update and self.document_exists(node.node_id):
        raise ValueError(
            f"node_id {node.node_id} already exists. "
            "Set allow_update to True to overwrite."
        )
    ref_doc_info = None
    if isinstance(node, TextNode) and node.ref_doc_id is not None:
        ref_doc_info = self.get_ref_doc_info(node.ref_doc_id) or RefDocInfo()

    (
        node_kv_pair,
        metadata_kv_pair,
        ref_doc_kv_pair,
    ) = self._get_kv_pairs_for_insert(node, ref_doc_info, store_text)

    if node_kv_pair is not None:
        node_kv_pairs.append(node_kv_pair)
    if metadata_kv_pair is not None:
        metadata_kv_pairs.append(metadata_kv_pair)
    if ref_doc_kv_pair is not None:
        key = ref_doc_kv_pair[0]
        if key not in ref_doc_kv_pairs:
            ref_doc_kv_pairs[key] = []
        ref_doc_kv_pairs[key].append(ref_doc_kv_pair)

This loop appears twice in the add_documents method, which is unnecessary and could lead to performance issues [1].

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

1 participant