Wikipedia-Search-Engine

This project implements a search engine over the ~60GB Wikipedia corpus. The code consists of indexer.py and search.py. Both simple and multi field queries have been implemented.

About:

Project can be broken down into following steps:

Building the index over given data.
Implementing search query and getting all the pages relevant to query.
Implementing page ranking algorithm to get K topmost relevant pages.

How to run:

python3 indexer.py pathtoXMLDumpDirectory stat.txt

This function takes as input the corpus file and creates the entire index in a field separated manner.
It also creates a vocabulary list and a file containg the title-id map.
Along with these files, it also creates the offsets for all the files.

python3 search.py queries.txt

This function takes in queries.txt as argument which contains list of queries. It returns the top K(K being mentioned along with query in queries.txt) results from the Wikipedia corpus.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
indexer.py		indexer.py
queries.txt		queries.txt
queries_output.txt		queries_output.txt
search.py		search.py
stopwords.txt		stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia-Search-Engine

About:

How to run:

python3 indexer.py pathtoXMLDumpDirectory stat.txt

python3 search.py queries.txt

About

Releases

Packages

Languages

anupam1608/Wikipedia-search-engine

Folders and files

Latest commit

History

Repository files navigation

Wikipedia-Search-Engine

About:

How to run:

python3 indexer.py pathtoXMLDumpDirectory stat.txt

python3 search.py queries.txt

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages