This repository is a simple wrapper around common retrieval tools especially
the sklearn tf-idf
and some huggingface models, and the fiass
library'.
The examples supports fast and typo-tolerant tf-idf, multilingual sentence embedding models, and hybrid methods for retrieval.
this repo is mainly for educational purposes so the code is super readable and there is no abstraction (just duck-typing)
- clone repository from github
- cd to the repository folder
- install with pip in editable mode
pip install -e .
you can find the most relevant document in just 3 lines of code!
from retriever import retriever_factory
docs = [
'hello world',
'How are you woooorld',
'I am fine ',
'This is a junk sentence!',
'This is a siiiiimilar word with a typo',
"it's time to find most similar documents",
]
retriever = retriever_factory(method='tf_idf_cfg_1')
retriever.add_doc_batch(docs)
results = retriever.find_similars('similar term!', top_k=4)
dense models also have same interface
samples = [
"میوه تازه",
"شیر",
"ماست",
"شیرینی",
"دوغ و نوشابه",
"دروغ گفتن",
"بادام",
"کره",
"هلو های تازه",
"نارنج",
"شکلات صبحانه",
"فندق",
"آزادی بیان",
"مبارزه با تروریسم",
"مبارزه با فساد",
"مبارزه ی مدنی",
"یادگیری ماشین",
"الگوریتم های دسته بندی",
"سربار مالیاتی",
"نت ضعیفه",
"ورزش صبحگاه",
]
retriever = retriever_factory(method='dense_LaBSE')
retriever.add_doc_batch(samples)
results1 = retriever.find_similars("فعالیت بدنی") # -> "ورزش صبحگاه"
results2 = retriever.find_similars("نوشیدنی") # -> "دوغ و نوشابه"
results3 = retriever.find_similars("هزینه های پنهان") # -> "سربار مالیاتی"
results4 = retriever.find_similars("کلاه برداری کردن") # -> "دروغ گفتن"
tf-idf config-1 -> fast and typo tolerant tf-idf (insensitive to word orders)
tf-idf config-2 -> less typo telorant tf-idf + little bit of order sensitiveness
good and big model.
you can try this model here
on hugging-face
It's a little smaller than LaBSE, but a good one.
you can try this model here
on hugging-face
small model.
you can try this model here
on hugging-face
an ensemble of 4 different models (labse,minilm,e5,tf-ifd) (it may need 4GB of free RAM for initialization)
for understanding custom configs please refer to factory.py file