Assign existing topics to new sentences based on embeddings #2052
Unanswered
9j7axvsLuF
asked this question in
Q&A
Replies: 1 comment 5 replies
-
Hmmm, I'm not sure if this is possible in BERTopic at the moment since dynamic topic modeling was meant to be used on the data it was fitted on (so corpus A). Having said that, it might be possible if you do something like this: # Fit model on corpus A
topic_model.fit(corpus_A)
# Assign topics for corpus B using the precomputed embeddings
topics_B = topic_model.transform(corpus_B, pre_computed_embeddings_B)
# Apply dynamic topic modeling
topics_over_time = topic_model.topics_over_time(
docs=corpus_B,
timestamps=timestamps_B,
topics=topics_B
) This might work but I haven't tried it myself. |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I can't figure out how to do the following. I have a corpus of interview transcripts. My pipeline involves segmenting this corpus into sentences (about 10,000 sentences, let's call this corpus A), pre-computing embeddings for these sentences, then fitting a topic model with the pre-computed embeddings. So far, so good – I get a nice topic model of corpus A.
The problem is that I have another corpus of passages that have been manually extracted from interview transcripts and assigned timestamps. I want to do dynamic topic modeling on that corpus, using the existing topic I fit for corpus A. After segmenting these timestamps passages into sentences, I get about 5,000 sentences (let's call this corpus B). The problem is that while conceptually, corpus B should be a strict subset of corpus A, in practice it isn't, both due to slight differences in how the passages were manually extracted from corpus A (such as correcting small transcription mistakes) and the differences in the output of the sentence segmentation algorithm (some sentences get segmented at different joints).
I pre-computed embeddings for corpus B using the same embedding model as for corpus A. What I would like to do is assign existing topics (from the topic model of corpus A) to sentences in corpus B, to do dynamic topic modeling with timestamps. But I'd like to use the pre-computed embeddings for corpus B to fit these existing topics.
Basically I want to avoid a situation in which the topic I see in my dynamic topic model of corpus B are different from the topic I had for my static topic model of corpus A, since conceptually B is a subset of A, and sentences should be distributed in the same topics. What would be the best way to achieve that?
Beta Was this translation helpful? Give feedback.
All reactions