(Front-end part - A compatible Android app is available at Career_Crawler)
This projects includes a single-threaded web crawler. It is designed to crawl sites with content in English and Hindi.
- Single-threaded
- Follows breadth-first strategy
- Handles various MIME types
- Can overcome anti-crawling traps deployed by web administrators - looks humaly
- Can crawl sites that are in UTF-8 format, particularly in English and Hindi languages
- Can normalise relative paths written in different styles
- Detects broken/missing links and handles a variety of HTTP errors
- Logo extractor to download logos of institutions
- Can be easily integrated with Selenium to see the live crawling!!
We crawled educational sites situated near-by. Some of them have been plotted on the map: