Open Crawler 1.0.0 - Documentation

Getting Started

Installation

Linux

git clone https://github.com/merwin-asm/OpenCrawler.git

cd OpenCrawler

chmod +x install.sh && ./install.sh

Windows

You need git, python3 and pip installed

git clone https://github.com/merwin-asm/OpenCrawler.git

cd OpenCrawler

pip install -r requirements.txt

Features

Cross Platform
Installer for linux
Related-CLI Tools (includes ,CLI access to tool, not that good search-tool xD, etc)
Memory efficient [ig]
Pool Crawling - Use multiple crawlers at same time
Supports Robot.txt
MongoDB [DB]
Language Detection
18 + Checks / Offensive Content Check
Proxies
Multi Threading
Url Scanning
Keyword, Desc And recurring words Logging

Uses

Making A (Not that good) Search engine :

This can be easily done with verry less modifications if required

We also provide an inbuild search function , which may not be good enough but does do the thing ( the search topic be discussed below )

Osint Tool :

You can make use of the tool to crawl through sites related to someone and do osint by using the search utility or make custom code for it

Pentesting Tool :

Find all websites related to one site , this can be achieved using the connection tree command ( this topic be discussed below )

Crawler As It says..

Commands

Find Commands

To find the commands you can use any of these 2 methods,

warning : this only works in linux

man opencrawler

For Linux:

opencrawler help

For Windows:

python opencrawler help

About Commands

help

Shows the commands available

v

Shows the current version of opencrawler

crawl

This would start the normal crawler

forced_crawl <website>

Forcefully crawl a site , the site crawled is <website>

crawled_status

warning : the data shown aint exact

Gives the info on the mongoDB.. This will show the number of sites crawled and the avg ammount of storage used.

Show the info for both collections : (more info on the collections are given in the working section)

crawledsites
waitlist

search <search>

Uses basic filturing methods to search , this command aint meant for anything like search engine (the working of search be discussed in working section)

configure

Configures the opencrawler... The same is also used to re configure... It will ask all the info required to start the crawler and save it in json file (config.json) (more info in the config section)

Its ok if you are running crawl command without configs because it will ask you to .. xd

connection-tree <website> <no of layers>

A tree of websites connected to <website> be shown

<no of layers> is how deep you want to crawl a site. The default depth is 2

check_html <website>

Checks if a website is returning html

crawlable <website>

Checks if a website is allowed to be crawled It checks the robot.txt , to find if disallowed

dissallowed <website>

Shows the disallowed urls of a website The results are based on robots.txt

fix_db

Starts the fix db program This can be used to resolve bugs present in the code , which could contaminate the DB

re-install

Re installs the opencrawler

update

Installs new version of the opencrawler | reinstalls

install-requirements

Installs the requirements.. These requirements are mentioned in requirements.txt

Config File

The file is generated by the configure command , which will run the "config.py" file.

The file in json , "config.json"

The config file stores info regarding the crawling activity These Include :

MONGODB_PWD - pwd of mongoDB user
MONGODB_URI - uri for connecting to mongoDB
TIMEOUT - time out for get requests
MAX_THREADS - number of threads , set it as one if you don't wanna do multithreading
bad_words - the file containing list of bad words , which by default is bad_words.txt (bad_words.txt is provided)
USE_PROXIES - bool - if the crawler should use proxy (proxy wont be used even if set True for robot.txt scanning)
Scan_Bad_Words - bool - if you want to save the bad / offensive text score
Scan_Top_Keywords - bool - if you want to save the top keywords found in the html txt
urlscan_key - the url scan API key , if you are not use the feature leave it empty
URL_SCAN - bool - if you want to scan url using UrlScan API

Working

Files :

Filename	Type	Use
opencrawler	python	The main file which get called on using command opencrawler
crawler.py	python	The file which do the crawling
requirments.txt	text	The file containing names of python modules , to be installed
search.py	python	Does the search
opencrawler.1	roff	The user manual
mongo_db.py	python	Handles mongoDB
installer.py	python	Installer for linux, which will be ran by install.sh
install.sh	shell	Install basic requirements like python3, for linux use only
fix_db.py	python	Fixes the DB
connection_tree.py	python	Makes the connection tree
config.py	python	Configures the OpenCrawler
bad_words.txt	text	Contains bad words used for predicting the bad/offensive text score

MongoDB Collections

There are two collections used :

waitlist - Used for storing sites which is to be crawled
crawledsites - Used to store crawled sites and collected info about them

How is data stored in mongoDB

Structure in which data is stored in the collections...

crawledsites :

######### Crawled Info are stored in Mongo DB as #####
Crawled sites = [ 
                {
                    "website" : "<website>"
                    
                    "time" : "<last_crawled_in_epoch_time>",
                    "mal" : Val/None, # malicious or not
                    "offn" : Val/None, # 18 +/ Offensive language
                    "ln" : "<language>",
                    
                    "keys" : [<meta-keywords>],
                    "desc" : "<meta-desc>",
                    
                    "recc" : [<recurring words>]/None,
                }
]

waitlist :

waitlist = [
           {
               "website" : "<website>"
           }
]

Connection Tree

By default depth is 2

The command tree works by getting all urls found in a site, then doing the same with the urls found, the number of times this happens deppends on the depth

Search

The search command uses the data stored in the crawledsites.

For each word of query it will check for sites containing them in,

website URL
desc
keywords
top recurring words

The results are sorted with the ones with most number of words from the query

    url = list(_DB().Crawledsites.find({"$or" : [
    {"recc": {"$regex": re.compile(word, re.IGNORECASE)}},
    {"keys": {"$regex":  re.compile(word, re.IGNORECASE)}},
    {"desc": {"$regex": re.compile(word, re.IGNORECASE)}},
    {"website" : {"$regex": re.compile(word, re.IGNORECASE)}}
]}))

Note

Proxy doesn't work for robot.txt scans while you are crawling , this is because the urlib.robotparser doesnt allow the use of proxy
If you have any issues with pymongo not working try installing versions preffered for the specific python version
If you get errors regarding pymongo also make sure you give read and write perms to the user
You can use local mongoDB
Search function aint making use of all possible filtures to find a site
installer.py and install.sh aint same , install.sh also installs python and pip then runs installer.py
installer.py and install.sh is only for linux use
we use proxyscrape API for geting free proxies
we use Virus Total's API for scanning websites , if required

Files

docs.md

Latest commit

History