Skip to content

πŸ”Ž Building and evaluating performance of retrieval models like tfidf, BM25, Smoothed Query Likelihood and Lucene.

Notifications You must be signed in to change notification settings

karantyagi/information-retrieval-systems

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Implementing and Evaluating Information Retrieval Models

This repo is for project work of course work CS 6200 Information Retreival Systems at Northeastern University. The project implements information retrieval methods like cleaning, indexing, stemming, query enhancement. It also implements various document search models like BM25, TF-IDF, Query Likelihood Model along with Lucene. It uses CACM as corpus.

General Layout

The code is divided into multiple functional packages.

  1. cleaner : handles cleaning logic.
  2. indexer: handles indexing logic based on cleaned corpus.
  3. retriever: implements various document retreival algorithms.
  4. stemmer: handles stemming task
  5. utils: general purpose functions.
  6. evaluation: performs evaluation uisng metrics like Precision, Recall, MAP, MRR etc. on retreived documents for model.

Compiling and Running Program

Creating cleaned corpus and index files.

  • Import the project in IntelliJ or Eclipse
  • To generate the cleaned corpus, run Cleaner.java in cleaner package. This will generate a folder under src/main/resources/testcollection/cleanedcorpus folder.
  • To generate the index user Indexer.java. StemmedIndexer.java can be used to generate index of stemmed version of CACM corpus.

Running project tasks

  • Every task in project can be run using a command line flag in Runner.java.
  • Run Runner.java#main() method in retreivalmodels package.
  • Run Options usage: Retreival Model: -taskName <arg>
  • task to run - [can be one of the TASK1, TASK2 or TASK3, PHASE1, PHASE2, noiseGeneration, softMatching]

NOTE: Read more about tasks in the Problem Statement `

Key Terms

BM25, Lucene, Query Language Model, Noise Generation, Soft Matching

Contributions

About

πŸ”Ž Building and evaluating performance of retrieval models like tfidf, BM25, Smoothed Query Likelihood and Lucene.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •