Hi. I am jann
. I am a retrieval-based chatbot. I would make a great baseline.
I uses approximate nearest neighbor lookup using Spotify's Annoy (Apache License 2.0) library, over a distributed semantic embedding space (Google's Universal Sentence Encoder (code: Apache License 2.0) from TensorFlow Hub.
The goal of jann
is to explicitly describes each step of the process of building a semantic similarity retrieval-based text chatbot. It is designed to be able to use diverse text source as input (e.g. Facebook messages, tweets, emails, movie lines, speeches, restaurant reviews, ...) so long as the data is collected in a single text file to be ready for processing.
Note: jann
development is tested with Python 3.8.6 on macOS 11.5.2 and Ubuntu 20.04.
To run jann
on your local system or a server, you will need to perform the following installation steps.
# OSX: Install homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
# OSX: Install wget
brew install wget
# Configure and activate virtual environment
python3.8 -m venv venv
source venv/bin/activate
python --version
# Ensure Python 3.8.10
# Upgrade Pip
pip install --upgrade pip setuptools
# Install requirements
pip install -r requirements.txt
# Install Jann
python setup.py install
# Set environmental variable for TensorFlow Hub
export TFHUB_CACHE_DIR=Jann/data/module
# Make the TFHUB_CACHE_DIR
mkdir -p ${TFHUB_CACHE_DIR}
# Download and unpack the Universal Sentence Encoder Lite model (~25 MB)
wget "https://tfhub.dev/google/universal-sentence-encoder-lite/2?tf-hub-format=compressed" -O ${TFHUB_CACHE_DIR}/module_lite.tar.gz
cd ${TFHUB_CACHE_DIR};
mkdir -p universal-sentence-encoder-lite-2 && tar -zxvf module_lite.tar.gz -C universal-sentence-encoder-lite-2;
cd -
Download the Cornell Movie Dialog Corpus, and extract to data/CMDC
.
# Change directory to CMDC data subdirectory
mkdir -p Jann/data/CMDC
cd Jann/data/CMDC/
# Download the corpus
wget http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip
# Unzip the corpus and move lines and convos to the main directory
unzip cornell_movie_dialogs_corpus.zip
mv cornell\ movie-dialogs\ corpus/movie_lines.txt movie_lines.txt
mv cornell\ movie-dialogs\ corpus/movie_conversations.txt movie_conversations.txt
# Change direcory to jann's main directory
cd -
As an example, we might use the first 50 lines of movie dialogue from the Cornell Movie Dialog Corpus.
You can set the number of lines from the corpus you want to use by changing the parameter export NUMLINES='50'
in run_examples/run_CMDC.sh
.
pytest --cov-report=xml --cov-report=html --cov=Jann
You should see all the tests passing.
cd Jann
# make sure that the run code is runnable
chmod +x run_examples/run_CMDC.sh
# run it
./run_examples/run_CMDC.sh
jann
is composed of several submodules, each of which can be run in sequence as follows:
# Ensure that the virtual environment is activated
source venv/bin/activate
# Change directory to Jann
cd Jann
# Number of lines from input source to use
export NUMTREES='100'
# Number of neighbors to return
export NUMNEIGHBORS='10'
# Define the environmental variables
export INFILE="data/CMDC/all_lines_50.txt"
# Embed the lines using the encoder (Universal Sentence Encoder)
python embed_lines.py --infile=${INFILE} --verbose
# Process the embeddings and save as unique strings and numpy array
python process_embeddings.py --infile=${INFILE} --verbose
# Index the embeddings using an approximate nearest neighbor (annoy)
python index_embeddings.py --infile=${INFILE} --verbose --num_trees=${NUMTREES}
# Build a simple command line interaction for model testing
python interact_with_model.py --infile=${INFILE} --verbose --num_neighbors=${NUMNEIGHBORS}
For interaction with the model, the only files needed are the unique strings (_unique_strings.csv
) and the Annoy index (.ann
) file.
With the unique strings and the index file you can build a basic interaction.
This is demonstrated in the interact_with_model.py
file.
Conversational dialogue is composed of sequences of utterances. The sequence can be seen as pairs of utterances: inputs and responses.
Nearest neighbours to a given input will find neighbours which are semantically related to the input. By storing input<>response pairs, rather than only inputs, jann
can respond with a response to similar inputs. This example is shown in run_examples/run_CMDC_pairs.sh
.
jann
is designed to run as a web service to be queried by a dialogue interface builder. For instance, jann
is natively configured to be compatible with Dialogflow Webhook Service. The web service runs using the Flask micro-framework and uses the performance-oriented gunicorn application server to launch the application with 4 workers.
cd Jann
# run the pairs set up and test the interaction
./run_examples/run_CMDC_pairs.sh
# pairs set up will write files needed for web server deployment
# default data_key is all_lines_0
# start development server
python app.py
# or serve the pairs model with gunicorn and 4 workers
gunicorn --bind 0.0.0.0:8000 app:JANN -w 4
It is helpful to see a Flask Monitoring dashboard to monitor statistics on the bot. There is a Flask-MonitoringDashboard which is already installed as part of Jann, see Jann/app.py.
To view the dashboard, navigate to http://0.0.0.0:8000/dashboard. The default user/pass is: admin
/ admin
.
Once jann
is running, in a new terminal window you can test the load on the server with Locust, as defined in Jann/tests/locustfile.py
:
source venv/bin/activate
cd Jann/tests
locust --host=http://0.0.0.0:8000
You can then navigate a web browser to http://0.0.0.0:8089/, and simulate N
users spawning at M
users per second and making requests to jann
.
curl --header "Content-Type: application/json" \
--request POST \
--data '{"queryResult": {"queryText": "that sounds really depressing"}}' \
http://0.0.0.0:8000/model_inference
Response:
{"fulfillmentText":"Oh, come on, man. Tell me you wouldn't love it!"}
You can use any dataset you want! Format your source text with a single entry on each line, as follows:
# data/custom_data/example.txt
This is the first line.
This is the second line, a response to the first line.
This is the third line.
This is the fourth line, a response to the third line.
There are a collection of Universal Sentence Encoders trained on a variety of data.
Note from TensorFlow Hub: The module performs best effort text input preprocessing, therefore it is not required to preprocess the data before applying the module.
# Standard Model (914 MB)
wget 'https://tfhub.dev/google/universal-sentence-encoder/4?tf-hub-format=compressed' -O module_standard.tar.gz
mkdir -p universal-sentence-encoder && tar -zxvf module_standard.tar.gz -C universal-sentence-encoder
There are two parameters for the Approximate Nearest Neighbour:
- set
n_trees
as large as possible given the amount of memory you can afford, - set
search_k
as large as possible given the time constraints you have for the queries. This parameter is a interaction tradeoff between accuracy and speed.
You will need to configure your server with the necessary software:
sudo apt update
sudo apt -y upgrade
sudo apt install unzip python3-pip python3-dev python3-venv build-essential libssl-dev libffi-dev python3-setuptools
sudo apt-get install nginx
git clone https://github.com/korymath/jann
# and follow the installation and configuration steps above
sudo /etc/init.d/nginx start # start nginx
Then, you can reference a more in-depth guide here. And here is a walkthrough on how to configure nginx on GCP.
You will need the uwsgi_params file, which is available in the nginx directory of the uWSGI distribution, or from the nginx GitHub repository.
uwsgi_param QUERY_STRING $query_string;
uwsgi_param REQUEST_METHOD $request_method;
uwsgi_param CONTENT_TYPE $content_type;
uwsgi_param CONTENT_LENGTH $content_length;
uwsgi_param REQUEST_URI $request_uri;
uwsgi_param PATH_INFO $document_uri;
uwsgi_param DOCUMENT_ROOT $document_root;
uwsgi_param SERVER_PROTOCOL $server_protocol;
uwsgi_param REQUEST_SCHEME $scheme;
uwsgi_param HTTPS $https if_not_empty;
uwsgi_param REMOTE_ADDR $remote_addr;
uwsgi_param REMOTE_PORT $remote_port;
uwsgi_param SERVER_PORT $server_port;
uwsgi_param SERVER_NAME $server_name;
Copy it into your project directory (e.g. /home/${USER}/jann/uwsgi_params
).
In a moment we will tell nginx to refer to it.
We will serve our application over HTTP on port 80, so we need to enable it:
sudo ufw allow 'Nginx HTTP'
This will allow HTTP traffic on port 80, the default HTTP port.
We can check the rule has been applied with:
sudo ufw status
# Status: active
# To Action From
# -- ------ ----
# Nginx HTTP ALLOW Anywhere
# Nginx HTTP (v6) ALLOW Anywhere (v6)
Make a Systemd unit file:
[Unit]
Description=JANN as a well served Flask application.
After=network.target
[Service]
User=korymath
Group=www-data
WorkingDirectory=/home/korymath/jann/Jann
Environment="PATH=/home/korymath/jann/venv/bin"
ExecStart=/home/korymath/jann/venv/bin/uwsgi --ini wsgi.ini
[Install]
WantedBy=multi-user.target
Then, copy the following into a file on your server,
named: /etc/nginx/sites-available/JANN.conf
# JANN.conf
server {
listen 80;
server_name 35.209.230.155;
location / {
include /home/korymath/jann/uwsgi_params;
uwsgi_pass unix:/home/korymath/jann/Jann/jann.sock;
}
}
Then, we tell nginx how to refer to the server
# link the site configuration to nginx enabled sites
sudo ln -s /etc/nginx/sites-available/JANN.conf /etc/nginx/sites-enabled/
# restart nginx
sudo systemctl restart nginx
# restart jann
sudo systemctl restart jann
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
Solution (for OSX 10.13):
pip install --ignore-installed --upgrade https://github.com/lakshayg/tensorflow-build/releases/download/tf1.9.0-macos-py27-py36/tensorflow-1.9.0-cp36-cp36m-macosx_10_13_x86_64.whl
FileNotFoundError: [Errno 2] No such file or directory: 'data/CMDC/movie_lines.txt'
Solution:
Ensure that the input movie lines file is extracted to the correct path
ValueError: Signature 'spm_path' is missing from meta graph.
Solution
Currently jann
is configured to use the universal-sentence-encoder-lite
module from TFHub as it is small, lightweight, and ready for rapid deployment. This module depends on the SentencePiece library and the SentencePiece model published with the module.
You will need to make some minor code adjustments to use the heaviery modules (such as universal-sentence-encoder and universal-sentence-encoder-large.
The guide for contributors can be found here. It covers everything you need to know to start contributing to jann
.
- Universal Sentence Encoder on TensorFlow Hub
- Cer, Daniel, et al. 'Universal sentence encoder.' arXiv preprint arXiv:1803.11175 (2018).
- Danescu-Niculescu-Mizil, Cristian, and Lillian Lee. 'Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs.' Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics. Association for Computational Linguistics, 2011.
jann
is made with love by Kory Mathewson.
Icon made by Freepik from www.flaticon.com is licensed by CC 3.0 BY.