Topic-Modelling-Amazon-Fine-Food-Reviews

We encounter such huge texts of documents time and again. But, before reading the documents, if we can get an overview of what the documents are about, it would make the job so much easy. For example, before watching a movie, if we can watch its trailer, it can help us decide if the movie is worth it. Similarly, we can decide if the documents are worth the effort?

Or, alternatively, if we have a bunch of documents and we can identify each document with their "topic", we can shortlist the documents based on topic of interest - without having to read through all the documents.

In Machine Learning and Natural Language Processing, Topic Models, a type of statistical model, gives us the ability to discover topics from a collection of documents.

Heroku based App

Below you can find the link to the app, where you can provide your own reviews and find out what the review is about? Please keep in mind the algorithm works best for food/beverages based products that can be commonly found for Amazon's fine foods.

App - https://topic-modelling-amzon-reviews.herokuapp.com/

DATA

For the project, I used publicly available Amazon's Fine Food reveiws data. It can be accessed here. The data contains approx. 569,000 reviews from 256,000 users.

This is what a sample from all the words look like.

There is quite a range of words in this. From coffee to chocolate to dog. It is hard to read what kind of topics or themes are actually in the reviews.

Text Normalization

Cleaning the textual data was very important to get good topics from the reviews. The process involved following steps:

Removing HTML tags
Correcting grammar contractions
Lowercasing the reviews
Removing numbers and additional white spaces
Removing Punctuations
Tokenization
Remving stopwords (using a long list of words from rank.nl and domain specific words)
Removing Whitespaces
Lemmatizing all reviews

Modelling

K-Means - Identified 15 topics using k-means. Evaluated topics using SSE

LDA - Identified 16 topics using LDA. Evaluated the topics using coherence scores

NMF - Identified 11 topics using NMF. Evaluated the topics using coherence scores

Conclusion

LDA does a better job here. Both the models have been good picking the topics for majority of documents but LDA takes a slight edge, so I'm gonna use it as my final model here. The final LDA model was deployed using Flask and Heroku.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
static		static
templates		templates
#1. Data Cleaning and Exploration.ipynb		#1. Data Cleaning and Exploration.ipynb
#2. Topic-Modelling-kmeans.ipynb		#2. Topic-Modelling-kmeans.ipynb
#3. Topic-Modelling-LDA-gensim.ipynb		#3. Topic-Modelling-LDA-gensim.ipynb
#4. Topic Modelling_NMF_gensim.ipynb		#4. Topic Modelling_NMF_gensim.ipynb
#5. Topic Modelling- Comparing and final evaluation.ipynb		#5. Topic Modelling- Comparing and final evaluation.ipynb
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
app.py		app.py
final-lda-model.pkl		final-lda-model.pkl
indexes.pkl		indexes.pkl
lda_dictionary		lda_dictionary
nltk.txt		nltk.txt
requirements.txt		requirements.txt
reviews_original.pkl		reviews_original.pkl
spacy-en		spacy-en
stop_words_amazon		stop_words_amazon
topic_distribution.pkl		topic_distribution.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic-Modelling-Amazon-Fine-Food-Reviews

Heroku based App

DATA

Text Normalization

Modelling

Conclusion

About

Releases

Packages

Languages

pareshg18/Topic-Modelling-Amazon-Fine-Food-Reviews

Folders and files

Latest commit

History

Repository files navigation

Topic-Modelling-Amazon-Fine-Food-Reviews

Heroku based App

DATA

Text Normalization

Modelling

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages