- Overview
- 🐍 Build the app with Python
- 🔮 Overview of the files in this example
- 🌀 Flow diagram
- 🔨 Next steps, building your own app
- 🙍 Community
- 🦄 License
Summary | This showcases a semantic text search app |
Data for indexing | Wikipedia corpus |
Data for querying | A text sentence |
Dataset used | Kaggle Wikipedia corpus |
ML model used | distilbert-base-nli-stsb-mean-tokens |
This example shows you how to build a simple semantic search app powered by Jina's neural search framework. You can index and search text sentences from Wikipedia using a state-of-the-art machine learning distilbert-base-nli-stsb-mean-tokens
language model from the Transformers library.
item | content |
---|---|
Input | 1 text file with 1 sentence per line |
Output | top_k number of sentences that match input query |
These instructions explain how to build the example yourself and deploy it with Python. If you want to skip the building steps and just run the app, check out the Docker section below.
- You have a working Python 3.7 or 3.8 environment.
- We recommend creating a new Python virtual environment to have a clean installation of Jina and prevent dependency conflicts.
- You have at least 2 GB of free space on your hard drive.
Begin by cloning the repo, so you can get the required files and datasets. In case you already have the examples repository on your machine make sure to fetch the most recent version.
git clone https://github.com/jina-ai/examples
cd examples/wikipedia-sentences
In your terminal, you should now be located in you the wikipedia-sentences folder. Let's install Jina and the other required Python libraries. For further information on installing Jina check out our documentation.
pip install -r requirements.txt
If this command runs without any error messages, you can then move onto step two.
By default, a small test dataset is used for indexing. This can lead to bad search results.
To index the full dataset (around 900 MB):
- Set up Kaggle
- Run the script:
sh get_data.sh
- Index your new dataset:
python app.py -t index -d full -n $num_docs
The whole dataset contains about 8 Million wikipedia sentences, indexing all of this will take a very long time.
Therefore, we recommend selecting only a subset of the data, the number of elements can be selected by the -n
flag.
We recommend values smaller than 100000. For larger indexes, the SimpleIndexer used in this example will be very slow also in query time.
It is then recommended to use more advanced indexers like the FaissIndexer.
Index your data by running:
python app.py -t index
Here, we can also specify the number of documents to index with --num_docs
/ -n
(defult is 10000).
A search prompt will appear in your terminal after running:
python app.py -t query
See the text below for an example search query and response.
You can also specify the top k search results with --top_k
/ -k
(default is 5)
please type a sentence: What is ROMEO
Ta-Dah🔮, here are what we found for: What is ROMEO
> 0(0.36). The ROMEO website, iOS app and Android app are commonly used by the male gay community to find friends, dates, love or get informed about LGBT+ topics.
Here is a small overview if you're interested in understanding what each file in this example is doing.
File | Explanation |
---|---|
📂 test/* |
Various maintenance tests to keep the example running. |
📃 app.py |
The gateway code to that runs the index & query Flow. |
📃 get_data.sh |
Downloads the Kaggle dataset. |
📃 requirements.txt |
Contains all required python libraries. |
This diagram provides a visual representation of the flow in this example, showing which Executors are used in which order:
It can be seen that the flow for this example is quite simple. We receive input Documents from the gateway, which are then fed into a transformer. This transformer computes an embedding based on the text of the document. Then, the documents are sent to the indexer which does the following:
- Index time: Store all the documents on disk (in the workspace folder).
- Query time: Compare the query document embedding with all stored embeddings and return closest matches
Did you like this example and are you interested in building your own? For a detailed tuturial on how to build your Jina app check out How to Build Your First Jina App guide in our documentation.
- Slack channel - a communication platform for developers to discuss Jina
- LinkedIn - get to know Jina AI as a company and find job opportunities
- - follow us and interact with us using hashtag
#JinaSearch
- Company - know more about our company, we are fully committed to open-source!
Copyright (c) 2021 Jina AI Limited. All rights reserved.
Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.