We encounter such huge texts of documents time and again. But, before reading the documents, if we can get an overview of what the documents are about, it would make the job so much easy. For example, before watching a movie, if we can watch its trailer, it can help us decide if the movie is worth it. Similarly, we can decide if the documents are worth the effort?
Or, alternatively, if we have a bunch of documents and we can identify each document with their "topic", we can shortlist the documents based on topic of interest - without having to read through all the documents.
In Machine Learning and Natural Language Processing, Topic Models, a type of statistical model, gives us the ability to discover topics from a collection of documents.
Below you can find the link to the app, where you can provide your own reviews and find out what the review is about? Please keep in mind the algorithm works best for food/beverages based products that can be commonly found for Amazon's fine foods.
App - https://topic-modelling-amzon-reviews.herokuapp.com/
For the project, I used publicly available Amazon's Fine Food reveiws data. It can be accessed here. The data contains approx. 569,000 reviews from 256,000 users.
This is what a sample from all the words look like.
There is quite a range of words in this. From coffee to chocolate to dog. It is hard to read what kind of topics or themes are actually in the reviews.
Cleaning the textual data was very important to get good topics from the reviews. The process involved following steps:
- Removing HTML tags
- Correcting grammar contractions
- Lowercasing the reviews
- Removing numbers and additional white spaces
- Removing Punctuations
- Tokenization
- Remving stopwords (using a long list of words from rank.nl and domain specific words)
- Removing Whitespaces
- Lemmatizing all reviews
- K-Means - Identified 15 topics using k-means. Evaluated topics using SSE
- LDA - Identified 16 topics using LDA. Evaluated the topics using coherence scores
- NMF - Identified 11 topics using NMF. Evaluated the topics using coherence scores
LDA does a better job here. Both the models have been good picking the topics for majority of documents but LDA takes a slight edge, so I'm gonna use it as my final model here. The final LDA model was deployed using Flask and Heroku.