GitHub - himynamesdave/idox_webscraper: Scrapes planning application details from Idox web sites

#Python Data Scraper ##Version 1.0

Set up: runs with python 2.7 modules/ dependencies

phantomJs
selenium
requests
bs4
multiprocessing

PhantomJs is a headless browser used to power the browser automation involved in this script. If you do not have it downloaded, first download the node package manager and node js from here, https://www.npmjs.com/. once you have npm installed you can simply type “npm install phantomjs-prebuilt”, into your terminal (no quotes). This will download phantom and provide you with a directory path (*** take note of this path, the script needs it for the phantomJs webdriver***). I have attached an image called insertPath.png, this is the location where you need to copy your phantomJs path (delete my path, within the quotes and copy yours). After you have pasted you are up and running with a phantomJs web driver.

For the Python modules: use pip to add bs4, requests, selenium, and multiprocessing

Description: This script scrapes all urls in column C from the following google document, https://docs.google.com/spreadsheets/d/1YJWH5up2sinNY7YubqJ_VbmqAR_Lj5xYmbd_ChqPjdY/edit#gid=0, and proceeds to capture all application listing data. It only works for the current month. It scrapes all pages containing listings for each url.

Run Time: the first two urls and all subpages took (17-18 min) to fully scrape. Some urls load way faster than others. Some urls have way more data than others. For instance the first url can be fully scraped in about 30s. The remaining 16.5 min is for page 2! I’m approximating the full running time to be anywhere from 3-5 hours.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
ghostdriver.log		ghostdriver.log
sa.csv		sa.csv
scrapeAll.py		scrapeAll.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

himynamesdave/idox_webscraper

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages