Skip to content

souravskr/Google-Play-Store-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Introduction to data

While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web.On digging deeper, it can be found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

Each app (row) has values for catergory, rating, reviews, size, installs, price, rated, last updated, and version.

Source: https://www.kaggle.com/lava18/google-play-store-apps

Objectives of Google-Play Store Analysis

This project focuses on the analysis of the Play Store data set in Kaggle.

The aim of this project is:

1. Using the data to analyze consumer trends and determine which type of apps are the most popular and profitable. 

2. Classifying applications based on their categories.

3. Presenting the growth of applications from 2016 to 2018.

4. Comparing different categories of applications based on the Android version.

5. Comparing the rates in different kinds of applications.

6. Assessing supported Android version with numbers of reviews based on different categories.

Important Pre-requisites

  1. Python 3.7
  2. Pip (usually installed by default with Python 3)
  3. A Kaggle account (More Information can be found here, and details about the configuration of the Kaggle certificates can be found in the Kaggle Notes section contained within this README.md file )
    • If you have trouble obtaining your own key, you can also run bash kagglecheck.sh. This script will generate a key for you to use and be able to download the necessary data

Installation

There are 2 recommended ways to set up your environment to run the scripts, both of which will be described here. We are also assuming that you have already downloaded, or cloned, the repository, and have moved into the Project directory.

Installation Method 1

  1. Download the project(and uncompress the file if you downloaded a zip version of it), and move into the Project folder. Here is a quick example of what this might look like:
git clone https://gitlab.com/cmsc6950/team1/google-play-store-analysis.git
cd Project
  1. Install virtualenv to be able to create separate working python environments. This is to prevent damaging any other projects that you may have on your computer
pip install virtualenv 
  1. Create and initiate a new virtual environment
virtualenv -p python3 .
source ./bin/activate
  1. Install the required libraries
pip install -r requirements.txt
  1. At this point, you are ready to start enjoying the scripts. To deactivate the virtual environment, you just need to run the command deactivate from the terminal

Installation Method 2 (With the makefile)

  1. Download the project(and uncompress the file if you downloaded a zip version of it), and move into the Project folder. Here is a quick example of what this might look like:
git clone https://gitlab.com/cmsc6950/team1/google-play-store-analysis.git
cd Project
  1. Create a virtual environment with anaconda and activate the new environment
conda create -n NewEnv python=3.7
conda activate NewEnv
  1. Run the make command. This should be the last step, but if this command generates an error due to download problems, skip to the next step
make
  1. OPTIONAL if step 3 fails, the make file includes an option to use our back-up version of the kaggle dataset (OrigData)
make bashrun

Short Description

All the scripts are inside the Project folder. Within the Folder, you will find an empty Data folder, which is where the processed data will be stored. The OrigData contains the data found in the Kaggle repository as a backup, should you not be able to directly download the repository from Kaggle, or should the data itself be removed from the Kaggle servers. The following is a list of the scripts included in this project with a brief description about their functionality. It is also recommended that you ran the scripts in the order in which they are listed the first you work with this repository.

  1. download.py - This script downloads the necessary data. Please read the ”Kaggle Notes” section to find out more about the kaggle pre-requisites
  2. cleandata.py - After the data has been downloaded, and unzipped, this script will get rid of unnecessary fields, filter out erronous data points, and fill empty cells with the ‘NA’ keyword
  3. histograms.py - This script provides a preview of the data. It graphs each of the attributes against the number of apps, and will also create a csv file corresponding to each of the files.
  • These 3 scripts have a -h or —-help option, which provides more details about their syntax, and additional options.
  1. jitter.py and jitter2.py generate interactive graphs of the same type, but with different data.
  2. post_processing.py - This script generates more advanced, and multicolored, figures from the data.
  3. test.py - This is a script to test some of the functionality from the scripts. To run it, just type pipenv run py.test test.py
  • othergraph.py - requires Geoview which is an anaconda specific package, and therefore cannot be ran from the makefile or with pipenv.

Member's Contributions

  • Aida: Latex, Presentation, README.md, jitter2.py & avg_size_category.py

  • Carlos: makefile, shell scripts, download.py, cleandata.py, histogram.py, post_processing.py, test.py presentation, README.md, requirements.txt, .gitignore, OrigData.

  • Sourav: Coded with Jupyter Notebook for Project_ver_1.ipynb, post_processing.py & jitter.py. (Bokeh Pie and jitter plot, Box plot, Mean plot, Count Plot, Cat Plot, Heatmap)

  • Fatemeh: Latex, Presentation, README.md, Pie plot, jitter2.py & othergraph.py(avg_size_category graph)

Backup

As an additional backup, the original data has been included in the OrigData folder. The checkfs.sh will copy this data to the Data folder and run the scripts cleandata.py, histograms.py, jitter.py, jitter2.py and post_processing.py. This can also be used as a back up should you have offline troubles.

Kaggle Notes

To be able to download the datasets from Kaggle, it is recommended to use the Kaggle API, which can be found link Installing the Kaggle API is as simple as pip install kaggle Or you can download the source code directly from the aboved mentioned link.

After the kaggle module has been downloaded, you need to set up a kaggle account, and download your credentials to be able to download datasets. The steps required to download the credentials can be found in the Kaggle's API github. However, we have copied the instructions in further sections of this readme file.

The command to download our data set after you configure your credentials is kaggle datasets download lava18/google-play-store-apps, which will download the zip file.

The following sections were copied directly from the Kaggle's API README.md file.

API credentials

To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com. Then go to the 'Account' tab of your user profile (https://www.kaggle.com/<username>/account) and select 'Create API Token'. This will trigger the download of kaggle.json, a file containing your API credentials. Place this file in the location ~/.kaggle/kaggle.json (on Windows in the location C:\Users\<Windows-username>\.kaggle\kaggle.json - you can check the exact location, sans drive, with echo %HOMEPATH%). You can define a shell environment variable KAGGLE_CONFIG_DIR to change this location to $KAGGLE_CONFIG_DIR/kaggle.json (on Windows it will be %KAGGLE_CONFIG_DIR%\kaggle.json).

For your security, ensure that other users of your computer do not have read access to your credentials. On Unix-based systems you can do this with the following command:

chmod 600 ~/.kaggle/kaggle.json

You can also choose to export your Kaggle username and token to the environment:

export KAGGLE_USERNAME=datadinosaur
export KAGGLE_KEY=xxxxxxxxxxxxxx

In addition, you can export any other configuration value that normally would be in the $HOME/.kaggle/kaggle.json in the format 'KAGGLE_' (note uppercase).
For example, if the file had the variable "proxy" you would export KAGGLE_PROXY and it would be discovered by the client.

Commands

The command line tool supports the following commands:

kaggle competitions {list, files, download, submit, submissions, leaderboard}
kaggle datasets {list, files, download, create, version, init}
kaggle kernels {list, init, push, pull, output, status}
kaggle config {view, set, unset}

See more details below for using each of these commands.

Download dataset files
usage: kaggle datasets download [-h] [-f FILE_NAME] [-p PATH] [-w] [--unzip]
                                [-o] [-q]
                                [dataset]

optional arguments:
  -h, --help            show this help message and exit
  dataset               Dataset URL suffix in format <owner>/<dataset-name> (use "kaggle datasets list" to show options)
  -f FILE_NAME, --file FILE_NAME
                        File name, all files downloaded if not provided
                        (use "kaggle datasets files -d <dataset>" to show options)
  -p PATH, --path PATH  Folder where file(s) will be downloaded, defaults to current working directory
  -w, --wp              Download files to current working path
  --unzip               Unzip the downloaded file. Will delete the zip file when completed.
  -o, --force           Skip check whether local version of file is up to date, force file download
  -q, --quiet           Suppress printing information about the upload/download progress

Examples:

kaggle datasets download zillow/zecon

kaggle datasets download zillow/zecon -f State\_time\_series.csv

Please note that BigQuery datasets cannot be downloaded.

Releases

No releases published

Packages

No packages published