Replies: 2 comments 1 reply
-
Found out there is
results:
But we still have -1 cluster. So i decided to reproduce the problem using pure HDBSCAN and got perfetct results (see below). So my questions is still relevent:
Result:
Also i noticed tha you can set
|
Beta Was this translation helpful? Give feedback.
-
Thank you for the link to dummy dim_model. I've tried it and got this error. Finally i switched to PCA and found out it works well with n_components less then 60 and returns the same error for n_components >=61 (this threshold depends on number of examples in dataset). I'll try to go forward with PCA, thank you. |
Beta Was this translation helpful? Give feedback.
-
I have two problems, not sure is they are bugs or my lame skills.
I have done a search for similar discussions, find them, learned a lot but not enough to solve the problem.
there was as much as dozen topics for the same string.
The code:
Check params as topic_model.get_params() output:
Several consecutive results below. Please notice how many rows marked as -1. Also notice that last run has very good results so it is not hyperparameter issue:
I have also try to train bertopic on dataset with diffferent phrases and then apply it to dataset of the same single phrase. Same result.
My questions:
a) Why clustring algorithm work so weak in case of the same phrases? Is it clustring algorithm issue or may be other parts of bertopic?
b) How can i choose hyperparameters if i can't reproduce results?
P.s.
a)
bertopic.__version__ # '0.16.4'
b) distance_function = lambda x: 1 - np.clip(cosine_similarity(x), -1, 1) was added to fix known issue for some particular values of n_components. It does not change experiment results.
Beta Was this translation helpful? Give feedback.
All reactions