UMAP Parameters for Visualization #2078

eschaffn · 2024-07-10T14:16:46Z

eschaffn
Jul 10, 2024

Hi there!

Two of the visualizations, visualize_topics() and visualize_documents() both use a 2d-reduced version of embeddings, visualize_topics() has a UMAP n_neighbors = 2, while visualize_documents() uses n_neighbors = 10.

My question is should these parameters be varied based on certain attributes of the fitted model/data? For example, should the n_neighbors of visualize_topics() be a function of the number of topics in the model, does this make mathematical sense? Should the same be applied to visualize_documents(), or should the parameter be scaled baed on some other property; not at all?

I've been doing my best to read the math behind UMAP and the n_neighbors parameter seems like it should vary a bit more.

Thanks

MaartenGr · 2024-07-11T13:13:23Z

MaartenGr
Jul 11, 2024
Maintainer

My question is should these parameters be varied based on certain attributes of the fitted model/data? For example, should the n_neighbors of visualize_topics() be a function of the number of topics in the model, does this make mathematical sense? Should the same be applied to visualize_documents(), or should the parameter be scaled baed on some other property; not at all?

That's not easily done to make this a factor of the data itself since we know certain characteristics of the embeddings in low dimensional space only after reducing them in their dimensionality. Otherwise, there wouldn't have been a need for this parameter in the first place.

Ideally, it would be great if the parameters would be a function of the data but since the input data can vary wildly (embedding size, distribution of values, number of datapoints, etc.) there isn't a straightforward way to make sure all parameters are perfectly tuned towards the data.

2 replies

eschaffn Jul 11, 2024
Author

I'd like to use these plots to get an idea of the metric I'm working to develop here: #2061

I'm setting the radius of the bubbles in visualize_topics() to the variance metric, and want to see if that correlates to an increase in spread in visualize_documents(). Would it be a good idea to keep the n_neighbors the same? And should it be lower (2?), or higher (10?)?

MaartenGr Jul 15, 2024
Maintainer

That depends highly on the number of datapoints that you have. Since the number of topics are generally low (<200), I choose n_neighbors=2 since it would not run into any issues with very small number of topics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMAP Parameters for Visualization #2078

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

UMAP Parameters for Visualization #2078

eschaffn Jul 10, 2024

Replies: 1 comment · 2 replies

MaartenGr Jul 11, 2024 Maintainer

eschaffn Jul 11, 2024 Author

MaartenGr Jul 15, 2024 Maintainer

eschaffn
Jul 10, 2024

Replies: 1 comment 2 replies

MaartenGr
Jul 11, 2024
Maintainer

eschaffn Jul 11, 2024
Author

MaartenGr Jul 15, 2024
Maintainer