Skip to content

saumya-prakash/Single-threaded-web-crawler

Repository files navigation

Single-threaded Web Crawler

(Front-end part - A compatible Android app is available at Career_Crawler)

This projects includes a single-threaded web crawler. It is designed to crawl sites with content in English and Hindi.

Features

  • Single-threaded
  • Follows breadth-first strategy
  • Handles various MIME types
  • Can overcome anti-crawling traps deployed by web administrators - looks humaly
  • Can crawl sites that are in UTF-8 format, particularly in English and Hindi languages
  • Can normalise relative paths written in different styles
  • Detects broken/missing links and handles a variety of HTTP errors
  • Logo extractor to download logos of institutions
  • Can be easily integrated with Selenium to see the live crawling!!

Sample results

We crawled educational sites situated near-by. Some of them have been plotted on the map:

Releases

No releases published

Packages

No packages published

Languages