GitHub - s33chin/web-scraper: javascript web scraper

Web Scraper

Node.js and puppeteer web scraper with auto scrolling!

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Usage
Roadmap
Acknowledgments

About The Project

Purpose: This web scraper is designed to extract data from websites automatically, enabling users to gather valuable information quickly and efficiently. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Features:

Login and Authentication: The scraper can handle login and authentication processes, allowing access to protected areas of websites.
Infinite Scroll Support: It can navigate through pages with infinite scroll, capturing dynamic content as it loads.
Data Extraction: The scraper can extract specific data elements such as names, emails, phone numbers, and addresses from targeted web pages.
Page Pooling: To optimize performance, it utilizes a page pooling mechanism, reusing and managing browser pages effectively.

(back to top)

Built With

(back to top)

Getting Started

This web scraping project allows you to extract data from websites (10ksb) efficiently. Follow the steps below to get started

Prerequisites

Node.js and npm: Ensure you have Node.js (v20 or above) and npm (Node Package Manager) installed on your machine. You can check this by running the following commands in your terminal:

node
```
node -v
```
npm
```
npm install npm@latest -g
```

Chrome Browser: The project utilizes Puppeteer(v14.20), which requires Google Chrome (v103) or Chromium to be installed on your system.

Installation

Clone the repo

git clone https://github.com/s33chin/web-scraper.git

Install NPM packages

npm install puppeteer@14.2.0 @puppeteer/browsers cli-progress puppeteer-core

Adjust the Chrome executable path and other settings in the scrapeData function to match your system.
Run the script
```
node indexWithPooling.js
```

Note: Ensure that you have proper permissions and authorization to scrape data from the target website. Respect the website's terms of service and policies while scraping.

(back to top)

Usage

(back to top)

Roadmap

Auto re-login and authentication after session timeout
Proxy and User-Agent Rotation
Data Storage Options: CSV, JSON, Database
User-Friendly CLI Interface
Support for Multiple Browsers

(back to top)

Acknowledgments

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
images		images
.gitignore		.gitignore
README.md		README.md
indexWithAutoLogin.js		indexWithAutoLogin.js
indexWithPooling.js		indexWithPooling.js
package-lock.json		package-lock.json
package.json		package.json
python_parser.py		python_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Roadmap

Acknowledgments

About

Releases

Packages

Languages

s33chin/web-scraper

Folders and files

Latest commit

History

Repository files navigation

Web Scraper

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Roadmap

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages