Skip to content

Use PySpark to perform the ETL process on a dataset retrieved from an AWS RDS instance.

Notifications You must be signed in to change notification settings

inregards2pluto/amazon-vine-analysis

Repository files navigation

Amazon Vine Analysis

Overview

This repository contains an analysis of Amazon video game reviews written by members of the paid Amazon Vine program. PySpark is used to perform the ETL process on a dataset retrieved from an AWS RDS instance. PySpark analyses were originally conducted within Google Colab and the code script is located in the Amazon_Reviews_ETL.ipynb file. Transformed data frames are loaded into pgAdmin. Analysis of Vine reviews were executed in a second Colab notebook. The following metrics were assessed for both paid and unpaid Vine reviews:

  • Total reviews
  • Number of 5-star reviews
  • Percentage of total reviews that were 5-star

Results

  • Paid Vine Reviews
    • Total reviews: 94
    • 5-star reviews: 48
    • Percentage of reviews that were 5-stars: 51%
  • Unpaid Vine Reviews
    • Total reviews: 40,471
    • 5-star reviews: 15,663
    • Percentage of reviews that were 5-stars: 39%

Summary

Results of the Vine review analysis suggests that there is a positivity bias in paid Vine reviews. The percentage of paid Vine reviews that are 5-stars is 51% compared to the 39% of unpaid Vine reviews. Granted, the total number of paid reviews (i.e. 94) is dwarfed by the number of unpaid reviews (i.e. 40,471). It's possible that the positivity bias is in part due to the limited sample size of paid reviewers. If the number of paid reviews was more comparable to the number of unpaid reviews, the difference in percentages may decrease.

To understand if paid/unpaid review statistics would be more comparable if the sample sizes were more similar, an additional analysis could be done where the same statistics were run on a random sample of 150 unpaid Vine reviews instead of the full dataset. Another, more industrious additional analysis could involve running statistics on paid Vine reviews of multiple product categories, not just video games, to see if the positivity bias is exclusive to video game products or if it holds up across multiple product categories.

Resources

Data:

About

Use PySpark to perform the ETL process on a dataset retrieved from an AWS RDS instance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published