Table of Contents
Purpose: This web scraper is designed to extract data from websites automatically, enabling users to gather valuable information quickly and efficiently. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Features:
- Login and Authentication: The scraper can handle login and authentication processes, allowing access to protected areas of websites.
- Infinite Scroll Support: It can navigate through pages with infinite scroll, capturing dynamic content as it loads.
- Data Extraction: The scraper can extract specific data elements such as names, emails, phone numbers, and addresses from targeted web pages.
- Page Pooling: To optimize performance, it utilizes a page pooling mechanism, reusing and managing browser pages effectively.
This web scraping project allows you to extract data from websites (10ksb) efficiently. Follow the steps below to get started
Node.js and npm: Ensure you have Node.js (v20 or above) and npm (Node Package Manager) installed on your machine. You can check this by running the following commands in your terminal:
- node
node -v
- npm
npm install npm@latest -g
Chrome Browser: The project utilizes Puppeteer(v14.20), which requires Google Chrome (v103) or Chromium to be installed on your system.
-
Clone the repo
git clone https://github.com/s33chin/web-scraper.git
-
Install NPM packages
npm install puppeteer@14.2.0 @puppeteer/browsers cli-progress puppeteer-core
-
Adjust the Chrome executable path and other settings in the scrapeData function to match your system.
-
Run the script
node indexWithPooling.js
Note: Ensure that you have proper permissions and authorization to scrape data from the target website. Respect the website's terms of service and policies while scraping.
- Auto re-login and authentication after session timeout
- Proxy and User-Agent Rotation
- Data Storage Options: CSV, JSON, Database
- User-Friendly CLI Interface
- Support for Multiple Browsers