- General Information
- Installation
- Usage
- Extra Configuration
- Features
- Project Status
- Acknowledgements
- Contact
- License
This is an automated data collection package (web-scraper) that is specifically tailored to scrape data on the Book depository website based on specific category keyword of choice. Check features of this scraper for details.
Use the package installer pip to install book scraper.
- Install directly from github repository
!pip install git+https://github.com/fortune-uwha/book_scraper
The BooksScraper takes two arguments: number of examples to scrape and keyword to search. This returns a Pandas DataFrame with the records, with an option to export to a csv file.
- To export raw data without cleaning:
from scraper.bookscraper import BooksScraper
scraper = BooksScraper(3000, "economics", True)
scraper.collect_information()
- To export clean data:
from scraper.bookscraper import CleanBookScraper
scraper = CleanBookScraper(3000, "economics", True)
scraper.clean_dataframe()
For more information just type help(BooksScraper) or help(CleanBookScraper).
In order to use the Database class, you will need to create a postgreSQL database on Heroku or any other platform and enter the authentication credentials into config.py file.
- Initialization
from database.database import Database
db = Database()
Example functions
- These functions will be executed by running main.py. Feel free to edit the variables to suit your requirements.
- delete_tables() - Drops categories and books tables. Handle with care - this will destroy your data!
- create_tables() - Creates categories and books tables and sets up foreign keys.
- insert_data_into_db(dataframe, category) - Inserts the data from dataframe into a database.
- export_to_csv() - Fetches the data from the database and exports as .csv file.
Based on specified category, BooksScraper collects information on:
- Book title
- Book author
- Book price
- Book edition
- Book publish date
- Book category
- Book item url
- Book image url
Project is: in progress
This project was based on Turing College learning on SQL and Data Scraping.
Created by @fortune_uwha - feel free to contact me!
This project is open source and available under the terms of the MIT license.