Note:
- The terminologies used here are from "Rasa Space": component, slot, tracker, etc. Familiarity with Rasa is assumed.
- Please review the Note at the end of this file!
Why I am doing this? Let's quote Rasa here:
"The driver for our research is this: what would help developers build great conversational software? How can we enable them to build things that are currently out of reach? The primary outcome of our research is new features in the open source Rasa Stack."
Let's see if we can build something that rasa currently doesn't support.
- multiple languages
- multiple scripts
In the project, we will demonstrate how to build a bot that
- Understands multiple languages (in this case English and Hindi, "en" and "hi" henceforth resp.)
- Detects scripts, here Devnagari and Latin, from the text message and
- Replies in the script and language found in the latest user message. This is just for fun, not a practical thing to do.
What novel thing is this project going to do?
- How to write custom NLP components
- How to add additional properties to
Message
object and drive conversations using them. How to makerasa core
ignore these additionalMessage
properties.Test theunfeaturized
slot type.Add as needed..
Let's zoom out a bit and try to see which part of Rasa Stack drives conversations: core
. For a conversation to move ahead, which is governed by core
, it doesn't matter which language/script was used by a user to send messages. Core is agnostic to the user's language/script. This implies we need to handle multiple languages at the nlu
part of the Rasa Stack.
There are other ways in which this can be accomplished but we will train two different models :
- to detect language: so that we can reply in the same language
- to detect language-agnostic intent: so that
core
can understand what the user is saying
The data for these two models looks like this:
## intent:hi_greet
- namaste
- kaise ho
## intent:greet
- hi
- hello
- wassup!
The intents hi_greet
and greet
will be mapped to greet
while training the component after stripping off the language info: hi_
in hi_greet
. If the intent starts with hi
then it's in Hindi language else it's in English. This way we can annotate a text for two different kinds of labels: langauge and intent. This is (definitely?) hacky!
By doing so, we expect a single model to learn intents from different languages. This might not be ideal in practice and one might need to detect language and its intent using two diffferent models, but for simplicity, let's just stick with one model doing both the task. [#Explore: how far away can you take this approach? Where does this assumption break or is no more practical]
The script of text that was used also does not matter to the core
. we will do script detection and its translation/transliteration at nlu
of the Rasa Stack. This will not be done using any model though. We will use some package/library to transliterate the text and add it as property to the message. We will transliterate everything to Latin before classifying its intent.
Both language and script will be added to the message with add_to_output=True
so that the tracker
from rasa_sdk
will get that info, which we will use to converse in the latest observed language with latest observed script.
Let's take a look at our pipeline
of component
s.
Pipeline overview:
custom_pipeline:
ScriptExtractor: text --> script
LatinTextExtractor: text, script --> latin_text
CustomFeaturizer: latin_text --> latin_text_features
LanguageExtractor: latin_text_features --> language
CustomIntentClassifier: latin_text_features --> intent
The (unwisely named) custom_pipeline
has five components : each doing a specific task (A pipeline
is like a class
and a component
is like a function
! Rasa's design is so cool!)
- script detector : it detects script from the raw text received from the user.
- latin text extractor : it takes text and script, and transliterates text into latin_text if script is devanagari
- custom featurizer : makes bigrams and skip-grams out of latin_text to give latin_text_features
- language extractor : this is actually a classifier, I have misplaced it into
../extractors/
. Anyway, gets language from latin_text_features - custom intent classifier : this gives us language-agnostic intent from latin_text_features
You can find all about this custom pipeline here : rasa_custom
- Install this first: rasa x
- IMPORTANT: Then replace
rasa
fromsite-packages
with this: https://github.com/psds01/rasa_custom - Then install this: https://pypi.org/project/indic-transliteration/
Here, we will train the rasa init bot with the same stories but with different data which you can find in /data/nlu.md
The template for utter_cheer_up
will also be changed to support language and script since it contains an image.
Train the bot with
rasa train
The default rasa init bot does not come with custom actions. Let's create custom actions that will reply in language and script that was used in the latest user message.
For me, Language
is what you'd hear if you read that text out loud and Script
is what you'd see!
So, namaste
is in Hindi
language with Latin
script and हेल्लो
is in English language in Devanagari Script.
Check:
actions.py
for actions and how it gets language and script from latest user messageaction_utterance_templates.py
for utterance templates of each action in Hindi and English language and in Devanagari and Latin script.
Now since we have custom actions as well, run actions server:
python -m rasa_sdk --actions actions
Now run rasa bot with
rasa x
Check out the ./screenshots/
folder