Simple, easy-to-use scraper to scrape data from WordPress JSON API
- Support storing crawled documents as MongoDB documents / JSON files
- Auto retry upon errors
- Python 3.7+
pip install -r requirements.txt
Just run crawl.py
with the sites URL supplied:
python3 crawl.py https://your.website.here
This will crawl the site using DefaultCrawlSession
, which attempts to crawl all posts
, categories
& tags
from the site.
The crawled JSON files will be stored in the directory ./data/<domain-name>
.
Most of the time, This will suffice when scraping sites that are:
- not required to sign in
- JSON API paths not blocked
For advanced usage and customizations you may want to look at wpscraper/session.py
for actual crawling procedures, and make your own CrawlSession
.
- Rewrite/Refactor
- MongoDB Connector
- Async session
- Authentication Module
- Cloudflare circumvention
- Configurable retry policies
- Full WPv2 API resources support