git clone https://github.com/merwin-asm/OpenCrawler.git
cd OpenCrawler
chmod +x install.sh && ./install.sh
You need git, python3 and pip installed
git clone https://github.com/merwin-asm/OpenCrawler.git
cd OpenCrawler
pip install -r requirements.txt
- Cross Platform
- Installer for linux
- Related-CLI Tools (includes ,CLI access to tool, not that good search-tool xD, etc)
- Memory efficient [ig]
- Pool Crawling - Use multiple crawlers at same time
- Supports Robot.txt
- MongoDB [DB]
- Language Detection
- 18 + Checks / Offensive Content Check
- Proxies
- Multi Threading
- Url Scanning
- Keyword, Desc And recurring words Logging
This can be easily done with verry less modifications if required
- We also provide an inbuild search function , which may not be good enough but does do the thing ( the search topic be discussed below )
You can make use of the tool to crawl through sites related to someone and do osint by using the search utility or make custom code for it
Find all websites related to one site , this can be achieved using the connection tree command ( this topic be discussed below )
To find the commands you can use any of these 2 methods,
warning : this only works in linux
man opencrawler
For Linux:
opencrawler help
For Windows:
python opencrawler help
Shows the commands available
Shows the current version of opencrawler
This would start the normal crawler
Forcefully crawl a site , the site crawled is <website>
warning : the data shown aint exact
Gives the info on the mongoDB.. This will show the number of sites crawled and the avg ammount of storage used.
Show the info for both collections : (more info on the collections are given in the working section)
- crawledsites
- waitlist
Uses basic filturing methods to search , this command aint meant for anything like search engine (the working of search be discussed in working section)
Configures the opencrawler... The same is also used to re configure... It will ask all the info required to start the crawler and save it in json file (config.json) (more info in the config section)
Its ok if you are running crawl command without configs because it will ask you to .. xd
A tree of websites connected to <website> be shown
<no of layers> is how deep you want to crawl a site. The default depth is 2
Checks if a website is returning html
Checks if a website is allowed to be crawled It checks the robot.txt , to find if disallowed
Shows the disallowed urls of a website The results are based on robots.txt
Starts the fix db program This can be used to resolve bugs present in the code , which could contaminate the DB
Re installs the opencrawler
Installs new version of the opencrawler | reinstalls
Installs the requirements.. These requirements are mentioned in requirements.txt
The file is generated by the configure command , which will run the "config.py" file.
The file in json , "config.json"
The config file stores info regarding the crawling activity These Include :
- MONGODB_PWD - pwd of mongoDB user
- MONGODB_URI - uri for connecting to mongoDB
- TIMEOUT - time out for get requests
- MAX_THREADS - number of threads , set it as one if you don't wanna do multithreading
- bad_words - the file containing list of bad words , which by default is bad_words.txt (bad_words.txt is provided)
- USE_PROXIES - bool - if the crawler should use proxy (proxy wont be used even if set True for robot.txt scanning)
- Scan_Bad_Words - bool - if you want to save the bad / offensive text score
- Scan_Top_Keywords - bool - if you want to save the top keywords found in the html txt
- urlscan_key - the url scan API key , if you are not use the feature leave it empty
- URL_SCAN - bool - if you want to scan url using UrlScan API
Filename | Type | Use |
---|---|---|
opencrawler | python | The main file which get called on using command opencrawler |
crawler.py | python | The file which do the crawling |
requirments.txt | text | The file containing names of python modules , to be installed |
search.py | python | Does the search |
opencrawler.1 | roff | The user manual |
mongo_db.py | python | Handles mongoDB |
installer.py | python | Installer for linux, which will be ran by install.sh |
install.sh | shell | Install basic requirements like python3, for linux use only |
fix_db.py | python | Fixes the DB |
connection_tree.py | python | Makes the connection tree |
config.py | python | Configures the OpenCrawler |
bad_words.txt | text | Contains bad words used for predicting the bad/offensive text score |
There are two collections used :
- waitlist - Used for storing sites which is to be crawled
- crawledsites - Used to store crawled sites and collected info about them
Structure in which data is stored in the collections...
######### Crawled Info are stored in Mongo DB as #####
Crawled sites = [
{
"website" : "<website>"
"time" : "<last_crawled_in_epoch_time>",
"mal" : Val/None, # malicious or not
"offn" : Val/None, # 18 +/ Offensive language
"ln" : "<language>",
"keys" : [<meta-keywords>],
"desc" : "<meta-desc>",
"recc" : [<recurring words>]/None,
}
]
waitlist = [
{
"website" : "<website>"
}
]
By default depth is 2
The command tree works by getting all urls found in a site, then doing the same with the urls found, the number of times this happens deppends on the depth
The search command uses the data stored in the crawledsites.
For each word of query it will check for sites containing them in,
- website URL
- desc
- keywords
- top recurring words
The results are sorted with the ones with most number of words from the query
url = list(_DB().Crawledsites.find({"$or" : [
{"recc": {"$regex": re.compile(word, re.IGNORECASE)}},
{"keys": {"$regex": re.compile(word, re.IGNORECASE)}},
{"desc": {"$regex": re.compile(word, re.IGNORECASE)}},
{"website" : {"$regex": re.compile(word, re.IGNORECASE)}}
]}))
- Proxy doesn't work for robot.txt scans while you are crawling , this is because the urlib.robotparser doesnt allow the use of proxy
- If you have any issues with pymongo not working try installing versions preffered for the specific python version
- If you get errors regarding pymongo also make sure you give read and write perms to the user
- You can use local mongoDB
- Search function aint making use of all possible filtures to find a site
- installer.py and install.sh aint same , install.sh also installs python and pip then runs installer.py
- installer.py and install.sh is only for linux use
- we use proxyscrape API for geting free proxies
- we use Virus Total's API for scanning websites , if required