Taraana30 is a web scraper that calculates weekly Top 30 Bollywood Songs from various platforms:
The Web Scraper scrapes the above List/Play List for the data from various providers and save the data in the data folder : ./data/<date of previous week's saturday>/
more on <date of previous week's saturday>
later
The Web Scraper provides 2 main commands:
- To get just top_30.csv and candidates.csv:
$ python main.py
Run the above command from the root of the folder to produce top_30.csv
and candidates.csv
files in the data folder: ./data/<date of previous week's saturday>/top_30.csv
and /candidates.csv
- To get all .csv files from the scraper:
$ python main.py --all
Run the above command from the root of the folder to procude:
candidates.csv
top_30.csv
gaana.csv
hungama.csv
mirchi.csv
saavn.csv
wynk.csv
in the ./data/<date of previous week's saturday>/
folder if it exist or create it then create the files
<date of previous week's saturday>
: The folder naming system of the scraper uses the date of the previous Saturday to distinguish between two weeks. The date is in the formate: DD-MM-YYYY
. (The week changes on Sunday 00:00:00
i.e., even if the scrip is run on Saturday the folder name will be the date of previous Saturday)
Example:
./
├── data
├── 22-06-2019
│ ├── top_30.csv
│ └── candidates.csv
└── 29-06-2019
├── top_30.csv
├── candidates.csv
├── gaana.csv
├── hungama.csv
├── mirchi.csv
├── saavn.csv
└── wynk.csv
Currently the tool is only present as a GitHub repository and could be used from there only
-
Fork and Clone to your machine
-
Run the pipenv:
pipenv shell
-
Run
pipenv install --ignore-pipfile
to install all dependencies to your machine. The main dependencies are:- beautifulsoup4
- requests
- lxml
-
Language: Python v3.7.3
-
Scraping Module: BeautifulSoup v4.7.1 with lxml parser
-
I/O request Module: requests v2.22.0
-
Misc: This project is a collection of 5 scrapers one for each plateform:
radiomirchi.com
gaana.com
hungama.com
jiosaavn.com
wynk.in
To speed up the scraping process particularly the delay in various I/O requests for gathering the source code from various platforms Multi-Threading is used, 1 thread per scraper (i.e., thread pool of 5 threads)
Note: The data/22-06-2019
folder in the repository is just an example/sample folder with data just to see the output of the script. No .csv
file should be saved with the scraper. If you want to disable this feature, then remove *.csv
from .gitignore
file
When using as a package import the main
module and call the taraana30()
function on it.
from Taraana import main
# This function will just write the .csv files and will not return anything
main.taraana30()
the taraana30()
takes 1 optional argument all_files
which tells how many files to create
all_files=True
: this is same as passing--all
argument to the script from terminalall_files=False
: (Default) this is same as running the script without any argument
Note: The taraana30()
function makes the ./data/<date of previous week's saturday>/
folder relative to the script it is called from (it uses the os.getcwd()
function to find the current working directory and makes the ./data/<date of previous week's saturday>/
directory in it)