wordpress-scraper

Description

Simple, easy-to-use scraper to scrape data from WordPress JSON API

Features

Support storing crawled documents as MongoDB documents / JSON files
Auto retry upon errors

Requirements

Python 3.7+

Installation

pip install -r requirements.txt

How to use

Basic

Just run crawl.py with the sites URL supplied:

python3 crawl.py https://your.website.here

This will crawl the site using DefaultCrawlSession, which attempts to crawl all posts, categories & tags from the site.

The crawled JSON files will be stored in the directory ./data/<domain-name>.

Most of the time, This will suffice when scraping sites that are:

not required to sign in
JSON API paths not blocked

Advanced

For advanced usage and customizations you may want to look at wpscraper/session.py for actual crawling procedures, and make your own CrawlSession.

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.github		.github
legacy		legacy
wpscraper		wpscraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
crawl.py		crawl.py
file2mongo.py		file2mongo.py
legacy_crawl_all.py		legacy_crawl_all.py
legacy_main.py		legacy_main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wordpress-scraper

Description

Features

Requirements

Installation

How to use

Basic

Advanced

Upcoming Features

About

Releases

Packages

Contributors 3

Languages

License

SoloSynth1/wordpress-scraper

Folders and files

Latest commit

History

Repository files navigation

wordpress-scraper

Description

Features

Requirements

Installation

How to use

Basic

Advanced

Upcoming Features

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages