A simple web spider frame written by Python, which needs Python3.5+
- Support multi-threading crawling mode (using threading)
- Support using proxies for crawling (using threading and queue)
- Define some utility functions and classes, for example: UrlFilter, get_string_num, etc
- Fewer lines of code, easyer to read, understand and expand
- utilities module: define some utilities functions and classes for multi-threading spider
- instances module: define classes of Fetcher, Parser, Saver for multi-threading spider
- concurrent module: define WebSpiderFrame of multi-threading spider
①: Fetchers get url from UrlQueue, and make requests based on this url
②: Put the result(content) of ① to HtmlQueue, and so Parser can get it
③: Parser gets content from HtmlQueue, and parses it to get new urls and item
④: Put the new urls to UrlQueue, and so Fetchers can get it
⑤: Put the item to ItemQueue, and so Saver can get it
⑥: Saver gets item from ItemQueue, and saves it to filesystem or database
⑦: Proxieser gets proxies from web or database, and puts proxies to ProxiesQueue
⑧: Fetcher gets proxies from ProxiesQueue if needed, and makes requests based on this proxies
Installation: you'd better use the first method
(1)Copy the "spider" directory to your project directory, and import spider
(2)Install spider to your python system using python3 setup.py install
See test.py
- More Demos
- Distribute Spider
- Execute JavaScript