Abstract

Internet is composed of web-pages connected through hyperlinks. This gives it a humongous and sparsely connected graphical structure - with about 1.3 billion unique domains 1 represented as nodes. One fundamental problem is to decipher this graphical structure by exploring all hyperlinks of all web-pages and storing it in a suitable data structure. However, because of the scale of the internet, this is no longer a trivial task, but a “Big Data” problem because data becomes the bottleneck. In this project, we implemented a distributed web crawler capable of tackling the volume of data by scaling across multiple nodes. Besides scalability, the web crawler is also efficient and fault tolerant, and in our experiments have shown high performance even with limited resources. Furthermore, we show that the data collected by the web crawler is an accurate representation of a subset of internet and can be used for indexing the web-pages using algorithms like pagerank.

Features

Highly scalable
Control crawl rate by varying producer or consumer nodes
Fault tolerant
Easy deployment through docker

Components and Libraries

AWS EC2
Kafka
Zookeeper
MongoDB
Docker
urllib
BeautifulSoup

Setup

Start Zookeeper, Kafka and MongoDB instances in different nodes
Put the the IP addresses in config.cfg
Start the config server in a separate node by running config.py
Dockerize producer and consumer using the script provided
Start multiple producers and consumers using the docker image and AWS ECR
Check the WikiGraph collection in MongoDB for results

Analytics - Count the number of webpages referencing a particular node

Map function

var mapFunction = function() {
    for(var i = 0; i < this.children.length;i++) {
        emit(this.children[i], 1);
    }
}

Reduce functuon

var reduceFunction = function(key, obj) {
   return Array.sum(obj);
}

Run mapreduce

db.WikiGraph.mapReduce(mapFunction, reduceFunction, { out : "total"});

Check output

db.total.find().sort({value:-1}).limit(5).pretty()

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
ContentConsumer		ContentConsumer
ContentProducer		ContentProducer
.gitignore		.gitignore
README.md		README.md
config.cfg		config.cfg
configAPI.py		configAPI.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract

Features

Components and Libraries

Setup

Analytics - Count the number of webpages referencing a particular node

Read more about our project

About

Releases

Packages

Contributors 3

Languages

anubhabMajumdar/Distributed-Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

Abstract

Features

Components and Libraries

Setup

Analytics - Count the number of webpages referencing a particular node

Read more about our project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages