Baseline models for searching for movie plots from Wikipedia articles. Techniques include BM25 (lexical search), bi/cross-encoding (semantic search), and retrieval-augmented generation (RAG) using Mistal 7B through Fireworks.ai.
- 3 notebooks
- executive analysis writeup [ pdf | docx ]
- requirements.txt
Develop a prototype for a search tool that helps users find relevant movies based on their queries.
Use the following code to import the dataset, if needed.
from datasets import load_dataset
ds = load_dataset("Coder-Dragon/wikipedia-movies", split='train[:1000]')
The dataset includes movie titles, plots, genres, actors, and other relevant imformation, mined from Wikipedia articles. For this experiment we will only focus on the first 1,000 movies, which are movies from the 1920s or earlier. We will also only focus on embedding and querying the movie titles and their plots.
You should be able to run the notebooks in Colab seamlessly. If there are dependency-related errors or if you'd like to run the notebooks locally, you can use the included requirements.txt.
The recommendation is to run the notebooks in the following order: semantic search, then reranker, then RAG. This is because this follows the order they were developed and the methods grow in complexity. Evaluation metrics are calculated in the notebooks where applicable.
Finally, feel free to review the executive analysis writeup [ pdf | docx ] for experiment findings and recommendations. An appendix is included with all experiment metrics and results neatly organized into tables.