A powerful web scraping tool designed for extracting structured data from websites with configurable rules and multiple execution modes.
- Configurable JSON-based scraping rules
- Multiple extraction modes:
- Static: Fast HTML parsing without JavaScript execution
- Browser: Full browser emulation with JavaScript support
- Concurrent scraping with adjustable worker count
go install github.com/crawlerclub/extractor/cmd/rabbitextract@latest
go install github.com/crawlerclub/extractor/cmd/rabbitcrawler@latest
rabbitextract is a command-line tool for extracting data from a single webpage using JSON configuration rules.
-config
: Path to the config JSON file (required)-url
: URL to extract data from (optional if provided in config)-mode
: Extraction mode (optional, defaults to "auto")auto
: Automatically choose between static and browser modestatic
: Fast HTML parsing without JavaScriptbrowser
: Full browser emulation with JavaScript support
-output
: Output file path (optional, defaults to stdout)
- Create a configuration file
config.json
:
{
"name": "example-scraper",
"example_url": "https://example.com/page",
"schemas": [
{
"name": "articles",
"entity_type": "article",
"selector": "//div[@class='article']",
"fields": [
{
"name": "title",
"type": "text",
"selector": ".//h1"
},
{
"name": "content",
"type": "text",
"selector": ".//div[@class='content']"
}
]
}
]
}
- Run the extractor:
rabbitextract -config config.json -url "https://example.com/page" -output result.json
text
: Extract text content from an elementattribute
: Extract specific attribute value from an elementnested
: Extract nested object with multiple fieldslist
: Extract array of items
_id
: Used to generate unique external_id for items_time
: Used to set external_time for items