This work introduces an innovative method in information retrieval (IR) that differs from traditional index-then-retrieve systems. The Differentiable Search Index (DSI) idea entails the integration of indexing and retrieval functions inside a single Transformer language model. This model has undergone training using the MS MARCO dataset and is dependent on the Pyserini library. The goal is to enhance the effectiveness of information retrieval by automatically generating appropriate document identifiers (docids).
- Develop an integrated sequence-to-sequence model ('f') for optimizing information retrieval.
- Investigate different training methodologies, such as auto-regressive and teacher-forcing techniques.
- Improve the effectiveness of retrieving information by prioritizing indicators such as Mean Average Precision (MAP), Precision@10, and Recall@1000.
- Dataset: We utilize the MS MARCO dataset for this task. Detailed instructions on how to access and use the dataset with the Pyserini library are included.
- Project Repository
To set up your environment to use the DSI model, follow these steps:
- Clone the repository with our personal source files: ```bash git clone https://github.com/Saad-data/neural_inverted_index_dsi ```
Syed Saad Hasan
Email: hasan.2106512@studenti.uniroma1.it