Transform unavailable when model was fit with only a single data sample. #2053

mzhadigerov · 2024-06-17T01:27:15Z

mzhadigerov
Jun 17, 2024

Hi! I have 2 lists with strings:

main = [...]
secondary = [...]

I want to use the main list to extract topics, then use those topics in a zeroshot_topic_list while training a model on secondary list.

I could train a model on the main list successfully, then I extracted generated topics from the trained model.

But when I pass those topics to train a new model with zero-shot, I'm getting the error:

ValueError: Transform unavailable when model was fit with only a single data sample.

There are around 53 strings in main and around 87 strings in secondary.

Here is my model declaration:

    representation_model = OpenAI(client=openai_client, model="gpt-3.5-turbo", delay_in_seconds=10, chat=True)

    topic_model = BERTopic(
        embedding_model=embedding_model,
        zeroshot_topic_list=zeroshot_topic_list,
        zeroshot_min_similarity=.3,
        vectorizer_model=vectorizer_model,
        min_topic_size=2,
        nr_topics="auto",
        representation_model=representation_model
    )
    
    topics, _ = topic_model.fit_transform(documents=texts, embeddings=np.array(embeddings))

What am I doing wrong and how can I possibly overcome this error? Thanks!

Answered by MaartenGr

Jun 17, 2024

Can I conclude that If I'm getting the "Transform unavailable when model was fit with only a single data sample." error, then, most probably the documents in the second list differ greatly from the documents from the first list and output something like: No matching topics found?

It is quite the opposite. Almost all documents in secondary are matched with the topics you created from main. What happens is that there was just a single document not matched which was then put through the default BERTopic pipeline. See the entire process here. In practice, you could also increase the zeroshot_min_similarity value to make sure that there isn't one document left but potentially multiple.

even…

View full answer

MaartenGr · 2024-06-17T09:44:56Z

MaartenGr
Jun 17, 2024
Maintainer

I'm missing a bit of information to get the complete picture. Could you share your full code and your full error message? That helps me understand how certain variables are created, the order of things, etc. What version of BERTopic are you using?

Lastly, how many documents are in texts? I assume this variable is the same as secondary even though they have different names, right?

14 replies

MaartenGr Jun 17, 2024
Maintainer

Hmmm, that's odd. It almost seems like that by setting zeroshot_min_similarity=.3 all documents expect 1 are assigned to the zero-shot topics and that the one that is left is to be clustered. Could you try lowering zero_shot_similarity? I think that might solve your issue, especially if you set it at 0.

This might also be caused by empty/short or perhaps even "strange" documents in secondary that differ greatly from all others.

mzhadigerov Jun 17, 2024
Author

Yes, it kinda "solved" the problem, but I guess setting zeroshot_min_similarity=0 is not what we should do? It will "force" all the documents to be appended to the first string in the zeroshot_topic_list, right?

Can I conclude that If I'm getting the "Transform unavailable when model was fit with only a single data sample." error, then, most probably the documents in the second list differ greatly from the documents from the first list and output something like: No matching topics found?

mzhadigerov Jun 17, 2024
Author

I'm wondering about this:

even though there are no documents that can be assigned to one of the zeroshot topics in the secondary list, why it doesn't detect other topics?

I thought it will detect other topics If no document can be mapped to zeroshot list or am I wrong?

MaartenGr Jun 17, 2024
Maintainer

Can I conclude that If I'm getting the "Transform unavailable when model was fit with only a single data sample." error, then, most probably the documents in the second list differ greatly from the documents from the first list and output something like: No matching topics found?

It is quite the opposite. Almost all documents in secondary are matched with the topics you created from main. What happens is that there was just a single document not matched which was then put through the default BERTopic pipeline. See the entire process here. In practice, you could also increase the zeroshot_min_similarity value to make sure that there isn't one document left but potentially multiple.

even though there are no documents that can be assigned to one of the zeroshot topics in the secondary list, why it doesn't detect other topics?

That's the thing, it is the opposite. Almost all documents are matched to the zeroshot topics expect one. That one will then be put through the default pipeline of SBERT -> UMAP -> HDBSCAN -> cTFIDF but because there is only one document, reducing dimensionality and clustering doesn't work.

I thought it will detect other topics If no document can be mapped to zeroshot list or am I wrong?

It will do that as long as there are at least 3 documents (the same value as min_topic_size) which in your case is unfortunately only 1.

Answer selected by mzhadigerov

mzhadigerov Jun 17, 2024
Author

@MaartenGr Aaaah! I understand now. Thank you!

fix from my side:
So in this case the only thing I can do is increase zeroshot_min_similarity, right?

fix from BERTopic library side:
Maybe it makes sense to move that single document to "Outliers (-1)" to overcome the issue or even discard it from BERTopic pipeline?

MaartenGr Jun 20, 2024
Maintainer

So in this case the only thing I can do is increase zeroshot_min_similarity, right?

Yes, I believe that would help.

Maybe it makes sense to move that single document to "Outliers (-1)" to overcome the issue or even discard it from BERTopic pipeline?

Not sure in all honesty since this happens very uncommonly that only a single document is passed to the HDBSCAN model. I believe it might even be related to your data (perhaps empty documents or documents that end up empty due to their size, etc.) that might impact this. There is a new implementation of zeroshot topic modeling coming that might alleviate this issue.

mzhadigerov Jul 16, 2024
Author

@MaartenGr Hi! Is there any updates regarding the new implementation of zeroshot topic modeling you mentioned? I'm getting the error again on another dataset and always changing the value of zeroshot_min_similarity is not an option, unfortunately.

MaartenGr Jul 16, 2024
Maintainer

@mzhadigerov The new implementation has already merged into the main branch. Have you tried installing BERTopic from the main branch?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform unavailable when model was fit with only a single data sample. #2053

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Transform unavailable when model was fit with only a single data sample. #2053

mzhadigerov Jun 17, 2024

Replies: 1 comment · 14 replies

MaartenGr Jun 17, 2024 Maintainer

MaartenGr Jun 17, 2024 Maintainer

mzhadigerov Jun 17, 2024 Author

mzhadigerov Jun 17, 2024 Author

MaartenGr Jun 17, 2024 Maintainer

mzhadigerov Jun 17, 2024 Author

MaartenGr Jun 20, 2024 Maintainer

mzhadigerov Jul 16, 2024 Author

MaartenGr Jul 16, 2024 Maintainer

mzhadigerov
Jun 17, 2024

Replies: 1 comment 14 replies

MaartenGr
Jun 17, 2024
Maintainer

MaartenGr Jun 17, 2024
Maintainer

mzhadigerov Jun 17, 2024
Author

mzhadigerov Jun 17, 2024
Author

MaartenGr Jun 17, 2024
Maintainer

mzhadigerov Jun 17, 2024
Author

MaartenGr Jun 20, 2024
Maintainer

mzhadigerov Jul 16, 2024
Author

MaartenGr Jul 16, 2024
Maintainer