Skip to content

Simple, easy-to-use scraper to scrape data from WordPress JSON API

License

Notifications You must be signed in to change notification settings

SoloSynth1/wordpress-scraper

Repository files navigation

wordpress-scraper

Description

Simple, easy-to-use scraper to scrape data from WordPress JSON API

Features

  • Support storing crawled documents as MongoDB documents / JSON files
  • Auto retry upon errors

Requirements

  • Python 3.7+

Installation

pip install -r requirements.txt

How to use

Basic

Just run crawl.py with the sites URL supplied:

python3 crawl.py https://your.website.here

This will crawl the site using DefaultCrawlSession, which attempts to crawl all posts, categories & tags from the site.

The crawled JSON files will be stored in the directory ./data/<domain-name>.

Most of the time, This will suffice when scraping sites that are:

  1. not required to sign in
  2. JSON API paths not blocked

Advanced

For advanced usage and customizations you may want to look at wpscraper/session.py for actual crawling procedures, and make your own CrawlSession.

Upcoming Features

  • Rewrite/Refactor
  • MongoDB Connector
  • Async session
  • Authentication Module
  • Cloudflare circumvention
  • Configurable retry policies
  • Full WPv2 API resources support

About

Simple, easy-to-use scraper to scrape data from WordPress JSON API

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages