How to fine tune UMAP #2088
-
Hi, @MaartenGr, thank you so much for creating the tool which made topic modeling so fun. Recently, our group is preparing a paper, and we used BERTopic as our main tool. We noticed that the hyperparameters n_neighbors and min_dist of UMAP were very sensitive. May I know how to evaluate the outcomes bought out by different hyperparameter values? Thank you in advanced. Li |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Thank you for the kind words! There's no single way to evaluate the outcomes by those hyperparameters but most importantly, what are you trying to evaluate? By answering that question, it becomes easier to judge the effect of those parameters. Is it the stability of the reduced embeddings? The topic coherence of the resulting topics? The "quality" of the clusters? |
Beta Was this translation helpful? Give feedback.
Then that will depend on your definition of "quality" in the context of the creation of clusters and/or assignment of documents. It will also depend on whether you have a ground truth available or not. There are a number of cluster-based metrics here. If it is the representation of the clusters (the topics) you are interested in, then OCTIS has many metrics implemented.
Do note though that it is important to first define exactly what it is that you want to evaluate. With unsupervised metrics, such as BERTopic, there isn't a ground truth typically available. As such, and due to the nature of topic modeling, there is a degree of subjectivity involved with the evaluation. There are proxies (…