This Project is about creating a web search engine using python : This project meets the following criteria:
- Collect HTML pages up to a maximum size.(according to given crawl-depth)
- Make a pre-processing on these pages (eliminate 'stop words')
- Index crawled data.
- Submitting queries and returning the result
- Display the result in an appropriate order of relevance.
- A web interface!
#Requirements See requirements.txt
-
indexes: A directory of indexed web-pages by Indexer.py
- forward_index : Forward indexing
- inverted_index : Inverted indexing
- url_to_id : For mapping indexed urls.
-
links : A directory of web-pages crawled by Crawler.py
A web-pages(url) crawled and saved offline in links directory and named with base64 encoding. (To store longest urls in distinct names.)
-
static : A directory of static(logo) files like pictures.
-
templates : A directory of views (front-end files).
-
Crawler.py : Main Crawler module
To Run Crawler.py
$python Crawler.py --start_url "url" --max_depth depth_value
url : website address that you want to crawl. depth_value: Maximum depth of crawl (integer)
-
if this will get completed successfully it will run indexer.py itself , if not then follow next to run indexer.py module.
-
By default all crawled data will be stored in 'links/' directory.
-
-
indexer.py : Main Indexer Module
To Run indexer.py
$python indexer.py --stored_docs_dir links/ --index_dir indexes
- this module require to arguments
- stored link's directory (to generate index)
- index directory (to store generated index)
- if this will get completed successfully then it will run web_ui.py itself.
-
web_ui.py : Frontend (view)
To run web_ui.py
$python web_ui.py
- it will start at 0.0.0.0:8080
-
lang_proc.py : Language Processing Module
-
util.py : HTML parser Module
1.wikipedia.org (Crawl_Depth-1)
2.Python.org (Crawl_Depth-3)