-
Notifications
You must be signed in to change notification settings - Fork 0
Technical Guide and Documentation
This page contains the technical documentation about how to set up/customize a model in the framework.
*: Not fully documented - under construction!
- The concept
- Prepare a target
- Finding a use case
- Lexical analysis
- Morphology
- Syntactic rules
- Manual Modelling
- Machine Learning aided modelling
-
Advanced topics
- Classification and tagging
- Punctuation and mood
- Handling statements and questions
- Syntactically incorrect sentences
- Coordinating conjunctions (logical operators) *
- Incomplete sentences *
- Relative clauses *
- Empty terminals
- Type inference *
- Functors for grammatical categories *
- Lexical ambiguity *
- Pseudo lexemes *
- Multilanguage support *
- Targeting Android *
- Technical details
To put it fairly simple, using the techniques of formal semantics I'm trying to treat verbs as functions (e.g. 'list' as a shell command like 'ls'), adverbs as modifiers of verbs (options to the command like 'ls -l'), nouns as arguments on which the function operates (e.g. file A) and adjectives as modifiers of the noun (e.g. executable). This is somewhat similar to what dependency grammars/semantics do. The difference here is that I describe the grammar by means of a mixture of categorial grammar and distributed morphology and map each syntactic rule (if applicable) to a semantic rule. Based on the semantic rule, the so called main (or head) functors and their dependent functors (arguments) are looked up in the parse tree. Each word (the lexeme of the word) has a functor which is almost treated like a function. Therefore, technically speaking they have their own signatures which I call in the framework dependencies. Each functor has zero or more dependencies, which are stored in a table called depolex -even zero dependencies must be explicitly entered for a functor. The dependencies of a functor consist of other functors so that they're typed. This means that a functor expects as input parameters zero or more functors and its output is typed by the functor name itself. For example, here are two functors where one depends on the other:
list(contact);
contact(null);
Or as a kind of function composition: list(contact())
A more complex example is:
list(contact);
contact(with[optional]);
with(constant);
constant(null);
Or: list(contact(with(constant)))
Please, note that this guide is about desktop targets and does not provide any details (yet) what you need to install and how. However, if you issue:
make help
you'll get something like this, describing the minimal requirements for the targets:
No dependencies are checked currently. Please, make sure that you have the followings for your targets at least with the versions specified:
desktop_client: rapidjson-1.1.0
desktop_fst, android_fst, js_fst: foma-0.9.17
desktop_parser_db, android_parser_db, js_parser_db: sqlite3-3.8.11
desktop_parser, android_parser, js_parser: bison-3.0.4
shared_native_lib: foma-0.9.17, sqlite3-3.8.11, bison-3.0.4
arm32_lib, arm64_lib: Android NDK r16b
embedded_js_lib, node_js_lib: emscripten-1.36.0
test_tools: sqlite3-3.8.11, foma-0.9.17, python-3.7, nltk-3.3
ml_tools: ABL-1.2 (Alignment-Based Learning framework)
You'll find a link to most of these on the Home wiki page -though, not as a collection but rather scattered.
If you're good to go then in the project directory (where the Makefile resides) issue:
make parser_generator test_tools ml_tools
mkdir build/hi_desktop
The hi_desktop directory will be the place where we put all files we need during modelling the example use case. Please, keep an eye on where a certain shell script command shall be executed. I'll try to indicate it everywhere but in case you don't see any hint on that then you can use this as a rule of thumb:
shell script commands - usually in build/hi_desktop unless otherwise indicated
except:
make - always needs to be executed in the project directory (where the Makefile resides)
hi - this is the interpreter executable, which is always built in the build/ subdirectory of the project directory (as all other executables actually) and needs to be executed within that directory
The first thing to do is to think over what do we want to model. So let's identify a few sentences for an example use case where we want to create a model for an assistant that can look up contacts and call a number from the hit list or call someone directly:
list contacts
list contacts with peter
call peter
call the first/second/third/fourth/.../last
You may have observed that there are words (like peter or the numbers) that represent a category which cannot be enumerated but it's all the same for each word like 'call'. It is also just a representative of a word in the lexicon we have to prepare. Definitely, it's not only Peter who we want to call or whose contacts we want to list and that's the same with the numbers but it's not necessary to list everyone and everything we want to name in the model.
The first thing to create is a morphological analyser for the words used in our use case:
list, contacts, call, with, first, second, third, ...
This project uses foma as morphological analyser which requires setting up both phonological and morphological rules. For bigger projects, the best is to take an existing morpholgical analyser like this one for English or this one for Hungarian. However, if size does matter or for any other reason (like writing a tutorial) you want to create one on your own, here is an example. So let's create the phonological rules for our example. To be able to interpret the following example used in this project, please, refer to the original documentation how lexical rules can be set up with foma. Let's take the examples from foma and make it minimal.
The phonological rules are as follows (english.foma):
### english.foma ###
# Vowels
define V [ a | e | i | o | u ];
# Consonants
define C [ b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | w | x | y | z ];
define PotentialWord [C|V]+;
# E insertion e added after -s, -z, -x, -ch, -sh before s (watch/watches)
define EInsertion [..] -> e || s | z | x | c h | s h _ "^" s ;
# Y replacement: -y changes to -ie before -s, -i before -ed (try/tries)
define YReplacement y -> i e || _ "^" s ;
define YthReplacement y -> i e || _ "^" t h ;
# Cleanup: remove morpheme boundaries
define Cleanup "^" -> 0;
read lexc engnum.lexc
define LexiconNum
define GrammarNum LexiconNum .o.
YthReplacement .o.
Cleanup;
read lexc engnoun.lexc
substitute defined PotentialWord for "[Guess]"
define LexiconN
define GrammarN LexiconN .o.
EInsertion .o.
YReplacement .o.
Cleanup;
read lexc english.lexc
define Lexicon
define Grammar Lexicon .o.
Cleanup;
read lexc engpunct.lexc
define LexiconPunct
define GrammarPunct LexiconPunct .o.
Cleanup;
regex GrammarN | GrammarNum | Grammar | GrammarPunct;
Save this as english.foma in build/hi_desktop.
The best way to set up morphological rules is to split the words into different lexicons according to their grammatical category. Though, in case of such a small example some may be kept together in one lexicon if it's not confusing. Please, refer to the foma documentation to be able to interpret the example below used in this project. There's only one thing to keep in mind when setting up morphological rules for this framework: you need to tag the stems if you want to have morphosyntactic rules in the bison source. Currently, the stem tag is hardcoded which looks like: [stem]. All other tags are gone now which comes with the price that the set of grammatical categories and that of other linguistic features must be disjunct.
Let's see the lexicons included:
The lexicon for numbers was created to have a common denominator when it comes to interpreting numbers. At least, I chose to do it in a way because you may either need to interpret numbers in different forms like roman, arabic and of course written. So I usually convert roman and arabic numbers in texts to their written english forms.
Please, note that the implementation for numbers is a development version and covers only numbers from 0-99.
Lexicon for numbers (engnum.lexc):
!!!engnum.lexc!!!
Multichar_Symbols [stem] +Num +Npref+
+Npref1+ +Npref2+ +Npref3+ +Npref4+
+Npref5+ +Npref6+ +Npref7+ +Npref8+
+Npref9+ +Npref1nn+ +Npref1nnn+
+Npref1n+ +Npref2n+ +Npref3n+ +Npref4n+
+Npref5n+ +Npref6n+ +Npref7n+ +Npref8n+
+Npref9n+ +Nom +Ord
LEXICON Root
NumPref;
LEXICON Num
one NumSuffix;
two NumSuffix;
three NumSuffix;
four NumSuffix;
five NumSuffix;
six NumSuffix;
seven NumSuffix;
eight NumSuffix;
nine NumSuffix;
one[stem]+Num+Ord:first #;
two[stem]+Num+Ord:second #;
three[stem]+Num+Ord:third #;
four[stem]+Num+Ord:fourth #;
five[stem]+Num+Ord:fifth #;
six[stem]+Num+Ord:sixth #;
seven[stem]+Num+Ord:seventh #;
eight[stem]+Num+Ord:eighth #;
nine[stem]+Num+Ord:ninth #;
LEXICON Num1
ten Num1Suffix;
one:eleven Num1Suffix;
two:twelve Num1Suffix;
three:thirteen Num1Suffix;
four:fourteen Num1Suffix;
five:fifteen Num1Suffix;
six:sixteen Num1Suffix;
seven:seventeen Num1Suffix;
eight:eighteen Num1Suffix;
nine:nineteen Num1Suffix;
Num1Suffix;
LEXICON NumPref
zero:zero Num1Suffix;
Num;
+Npref1n+:0 Num1;
+Npref2n+:twenty- Num;
+Npref3n+:thirty- Num;
+Npref4n+:fourty Num;
+Npref5n+:fifty- Num;
+Npref6n+:sixty- Num;
+Npref7n+:seventy- Num;
+Npref8n+:eighty- Num;
+Npref9n+:ninety- Num;
+Npref2n+twenty:twenty Num1Suffix;
+Npref3n+thirty:thirty Num1Suffix;
+Npref4n+fourty:fourty Num1Suffix;
+Npref5n+fifty:fifty Num1Suffix;
+Npref6n+sixty:sixty Num1Suffix;
+Npref7n+seventy:seventy Num1Suffix;
+Npref8n+eighty:eighty Num1Suffix;
+Npref9n+ninety:ninety Num1Suffix;
Lexicon NumSuffix
[stem]+Num:0 Nom;
Lexicon Num1Suffix
two[stem]+Num+Ord:twelfth #;
[stem]+Num:^th Ord;
[stem]+Num:0 Nom;
Lexicon Nom
+Nom:0 #;
Lexicon Ord
+Ord:0 #;
Lexicon for nouns (engnoun.lexc):
!!!engnoun.lexc!!!
Multichar_Symbols [stem] [Guess] +N +Sg +Pl +CON
LEXICON Root
Noun ;
LEXICON Noun
[Guess] Constant;
contact Ninf;
name Ninf;
first Ninf;
last Ninf;
LEXICON Constant
[stem]+CON:0 #;
LEXICON Ninf
[stem]+N+Sg:0 #;
[stem]+N+Pl:^s #;
Lexicon for the rest of words (english.lexc):
!!!english.lexc!!!
Multichar_Symbols [stem] +V +Sg +Pl +PREP +DET
LEXICON Root
Verb ;
Preposition ;
Determiner ;
LEXICON Verb
call Vinf;
list Vinf;
LEXICON Vinf
[stem]+V:0 #;
LEXICON Preposition
with Pinf;
LEXICON Pinf
[stem]+PREP:0 #;
LEXICON Determiner
the[stem]+DET:the #;
Lexicon for punctuation (engpunct.lexc):
!!!engpunct.lexc!!!
Multichar_Symbols [stem] +Punct +Apostrophe +DoubleApostrophe +Quotes +OpeningBracket +ClosingBracket +OpeningSqBracket +ClosingSqBracket +Comma +Hyphen +Dash +FullStop +QuestionMark +BackTick +DoubleBackTick +Colon +ExclamationMark +SemiColon
LEXICON Root
Punctuation ;
LEXICON Punctuation
'[stem]+Punct+Apostrophe:' #;
''[stem]+Punct+DoubleApostrophe:'' #;
%"[stem]+Punct+Quotes:%" #;
([stem]+Punct+OpeningBracket:( #;
)[stem]+Punct+ClosingBracket:) #;
[[stem]+Punct+OpeningSqBracket:[ #;
][stem]+Punct+ClosingSqBracket:] #;
,[stem]+Punct+Comma:, #;
-[stem]+Punct+Hyphen:- #;
--[stem]+Punct+Dash:-- #;
.[stem]+Punct+FullStop:. #;
?[stem]+Punct+QuestionMark:? #;
`[stem]+Punct+BackTick:` #;
``[stem]+Punct+DoubleBackTick:`` #;
%:[stem]+Punct+Colon:%: #;
%![stem]+Punct+ExclamationMark:%! #;
%;[stem]+Punct+SemiColon:%; #;
Save these in separate files in build/hi_desktop. If there's any linguistic feature that you want to add at a morphological level then the lexc files are the right place to do so.
Now, let's create the morphological analyser:
make desktop_fst DESKTOPFOMAPATH=build/hi_desktop/english.foma DESKTOPLEXCFILES=build/hi_desktop DESKTOPFSTNAME=english.fst
Test your fst by checking if the correct analysis is returned for each word e.g.:
echo contacts|flookup build/hi_desktop/english.fst
contacts contact[stem]+N+Pl
contacts contact[stem]+CON
echo last|flookup build/hi_desktop/english.fst
last last[stem]+CON
last last[stem]+N+Sg
echo call|flookup build/hi_desktop/english.fst
call call[stem]+CON
call call[stem]+V
echo first|flookup build/hi_desktop/english.fst
first first[stem]+CON
first first[stem]+N+Sg
first one[stem]+Num+Ord
Now, prepare a minimal content for modelling in db:
-collect all multichar symbols from the lexc files and create an entry for each in the SYMBOLS db table.
-set up the pairs of grammatical categories and their corresponding linguistic features (like N-Stem, N-Pl, N-Sg) and create an entry for each pair in the GCAT db table.
There are some more entries in other db tables that are technically required and will be covered later. So just take them as they are for now.
PRAGMA foreign_keys = ON;
BEGIN;
insert into SETTINGS values('main_symbol','main_verb');
insert into SETTINGS values('main_verb','<V>');
insert into ROOT_TYPE values('H');
insert into ROOT_TYPE values('N');
insert into LANGUAGES values('ENG', 'English', '1', 'english.fst');
insert into SYMBOLS values('CON', 'ENG', 'Constant');
insert into SYMBOLS values('DET', 'ENG', 'Determiner');
insert into SYMBOLS values('N', 'ENG', 'Noun');
insert into SYMBOLS values('Stem', 'ENG', 'Stem');
insert into SYMBOLS values('Pl', 'ENG', 'Plural');
insert into SYMBOLS values('Sg', 'ENG', 'Singular');
insert into SYMBOLS values('PREP', 'ENG', 'Preposition');
insert into SYMBOLS values('V', 'ENG', 'Verb');
insert into SYMBOLS values('S','ENG', NULL);
insert into SYMBOLS values('Num','ENG',NULL);
insert into SYMBOLS values('Nom','ENG',NULL);
insert into SYMBOLS values('Ord','ENG',NULL);
insert into SYMBOLS values('Npref1n','ENG',NULL);
insert into SYMBOLS values('Npref2n','ENG',NULL);
insert into SYMBOLS values('Npref3n','ENG',NULL);
insert into SYMBOLS values('Npref4n','ENG',NULL);
insert into SYMBOLS values('Npref5n','ENG',NULL);
insert into SYMBOLS values('Npref6n','ENG',NULL);
insert into SYMBOLS values('Npref7n','ENG',NULL);
insert into SYMBOLS values('Npref8n','ENG',NULL);
insert into SYMBOLS values('Npref9n','ENG',NULL);
insert into GCAT values('CON', 'Stem', 'ENG', '1',NULL,NULL);
insert into GCAT values('DET', 'Stem', 'ENG', '1',NULL,NULL);
insert into GCAT values('N', 'Stem', 'ENG', '1',NULL,NULL);
insert into GCAT values('N', 'Pl', 'ENG', '1',NULL,NULL);
insert into GCAT values('N', 'Sg', 'ENG', '1',NULL,NULL);
insert into GCAT values('PREP', 'Stem', 'ENG', '1',NULL,NULL);
insert into GCAT values('V', 'Stem', 'ENG', '1',NULL,NULL);
insert into GCAT values('Num', 'Stem', 'ENG', '1',NULL,NULL);
insert into GCAT values('Num', 'Nom', 'ENG', '1',NULL,NULL);
insert into GCAT values('Num', 'Ord', 'ENG', '1',NULL,NULL);
insert into GCAT values('Num', 'Npref1n', 'ENG', '1',NULL,NULL);
insert into GCAT values('Num', 'Npref2n', 'ENG', '1',NULL,NULL);
insert into GCAT values('Num', 'Npref3n', 'ENG', '1',NULL,NULL);
insert into GCAT values('Num', 'Npref4n', 'ENG', '1',NULL,NULL);
insert into GCAT values('Num', 'Npref5n', 'ENG', '1',NULL,NULL);
insert into GCAT values('Num', 'Npref6n', 'ENG', '1',NULL,NULL);
insert into GCAT values('Num', 'Npref7n', 'ENG', '1',NULL,NULL);
insert into GCAT values('Num', 'Npref8n', 'ENG', '1',NULL,NULL);
insert into GCAT values('Num', 'Npref9n', 'ENG', '1',NULL,NULL);
COMMIT;
Save this as m1content.sql in build/hi_desktop.
Create the database:
make desktop_parser_db NATIVEPARSERDBNAME=m1.db NATIVEPARSERDBCONTENT=build/hi_desktop/m1content.sql
At this point you can build a library that's capable of carrying out a morphological analysis of a text input.
If you wish to do so:
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPFUNCTORPATH=""
make desktop_parser
make shared_native_lib
Modify hi.cpp: delete HI_SYNTAX and HI_SEMANTICS in the main function at the line where the 'toa' and 'language' variables are set so that they look like:
language="js";
toa=HI_MORPHOLOGY;
and change hi.db to m1.db when the hi() function is called so that it looks like:
analyses=hi(text.c_str(),"ENG",toa,language.c_str(),"hi_desktop/m1.db","test",crh);
Then build the desktop client:
make desktop_client
Start the interpreter and type the example sentence to get the morphological analysis for the words in it:
./hi
list contacts
human_input:list contacts
picking new token path
nr of paths:4
current_path_nr:0
lexer started
interpreter started
There is 1 analysis.
{"morphology":[{"morpheme id":"3","word":"contacts","lexeme":"contacts","stem":"contacts","gcat":"CON","tags":["contacts[stem]","CON"]},{"morpheme id":"4","word":"contacts","lexeme":"contact","stem":"contact","gcat":"N","tags":["contact[stem]","N","Pl"]},{"morpheme id":"1","word":"list","lexeme":"list","stem":"list","gcat":"CON","tags":["list[stem]","CON"]},{"morpheme id":"2","word":"list","lexeme":"list","stem":"list","gcat":"V","tags":["list[stem]","V"]}]}
Hint: check the json form of the analyses at jsonlint.com
Syntactic rules can be set up in the bison source of which some has been presented on the main wiki page. There are different ways to set up a grammar for the model.
- Code your grammar manually. See the section about Developing and coding a parse tree manually
- Grammar rules can be entered manually in the grammar db table out of which the bison source can be generated. See the section about Developing and generating code for a parse tree
- Set up a text corpus and use machine learning to induce a grammar. See the section about Machine Learning aided modelling.
Here the tutorial is split into Manual Modelling section (options 1-2.) and Machine Learning aided modelling section (option 3.). In the end they are not that different as the resulting model is used the same way in the framework but the approach of creating a model cannot be described within the same section.
So it's still possible but don't forget to copy the sections from the C_declarations.cpp, bison_declarations.cpp and C_code.cpp files residing in the gensrc subdirectory. You can use a generated bison source as an example to see how the final bison source should look like. If you need help on bison to understand/write your own rules, please refer to a good Bison documentation. An important thing is that not only pure syntactic but morphosyntactic rules can be added as well. The tokens for the terminal symbols need to be entered in the gcat table which is used as reference for the words entered in the lexicon table where the lexemes are assigned to them. To make the framework react on a syntactic rule, two different methods of the interpreter class can be called: set_node_info() or combine_nodes(). These can actually be almost copy pasted in each case. For example:
ENG_N : ENG_N_Sg
{
const node_info& ENG_N_Sg=sparser->get_node_info($1);
$$=sparser->set_node_info("ENG_N",ENG_N_Sg);
std::cout<<"ENG_N->ENG_N_Sg"<<std::endl;
};
Set_node_info() needs to be called whenever a symbol is transformed. This creates a new node in the semantic parser tree which is necessary to be able to carry such information like the dependencies of the word or the dependency validation matrix. TODO: Explain implicit call to check_prerequisite_symbols().
ENG_Vbar1 : ENG_V ENG_NP
{
const node_info& ENG_V=sparser->get_node_info($1);
const node_info& ENG_NP=sparser->get_node_info($2);
$$=sparser->combine_nodes("ENG_VBAR1",ENG_V,ENG_NP);
std::cout<<"ENG_Vbar1->ENG_V ENG_NP"<<std::endl;
};
Combine_nodes() needs to be called always, when two nodes are combined i.e. when a phrase needs to be validated semantically. The framework only handles binary branching rules so the syntactic rules must be set up in such a way. Whenever combine_nodes() is called, the semantic checks are carried out based on the semantic rules in the rule_to_rule_map table which are assigned to the triggering syntactic rule. Based on the semantic rules, the framework looks up if the dependent node can be found among the dependencies of the main node. All such possibilities are then registered in a so called dependency validation matrix of the main node. If none is found, the phrase is considered invalid and the interpretation stops.
add_feature_to_leaf() -v1
add_feature_to_leaf() -v2
add_feature_to_leaf() -v3
get_node_info()
The next thing is to set up the LEXICON table containing the stems of the words that the morphological analyser can already process and we want the interpreter to be able to process. Let's add these to m1content.sql:
insert into LEXICON values('call', 'ENG', 'V', 'CALLENGV');
insert into LEXICON values('list', 'ENG', 'V', 'LISTENGV');
insert into LEXICON values('contact', 'ENG', 'N', 'CONTACTENGN');
insert into LEXICON values('first', 'ENG', 'N', 'FIRSTLASTENGN');
insert into LEXICON values('last', 'ENG', 'N', 'FIRSTLASTENGN');
insert into LEXICON values('with', 'ENG', 'PREP', 'WITHENGPREP');
The only field that needs some explanation is the lexeme I think. It's not only used as the lexeme is used generally but it's linked to a functor which is later used for grasping the semantics of it -in the end, by a real function. So the value of the lexeme field in each row must be entered in the FUNCTORS table as well (for technical reasons, before the lexicon entries):
insert into FUNCTORS values('CON', '1', NULL);
insert into FUNCTORS values('CALLENGV', '1', NULL);
insert into FUNCTORS values('LISTENGV', '1', NULL);
insert into FUNCTORS values('CONTACTENGN', '1', NULL);
insert into FUNCTORS values('Num', '1', NULL);
insert into FUNCTORS values('FIRSTLASTENGN', '1', NULL);
insert into FUNCTORS values('WITHENGPREP', '1', NULL);
Each lexeme in the LEXICON table must have a functor declaration in the FUNCTORS table, though a functor may not have a definition in the FUNCTOR_DEFS table. In such a case, the functor_id field can be set to NULL. This is exactly the case here as during grammar development only the functor declarations are necessary but their definitions are not. Besides, there's a mandatory entry for the CON functor which has no counterpart in the LEXICON and is used basically for unknown words i.e. those that cannot be processed by the morphological analyser and therefore called CONcealed/CONstant. There are also words that can be processed by the morphological analyser but cannot be enumerated like e.g. numbers. Such cases can be handled in a way, that the symbol of the corresponding grammatical category is entered as functor as in case of 'Num' above.
Whenever it comes to manually developing a grammar, keep in mind that all the symbols you use in the grammar needs to be entered in the SYMBOLS table. The first thing that will be used when the syntactic analysis starts are the terminal symbols. Terminal symbols for historical reasons begin with 't_', then comes the language id, the grammatical category and the linguistic feature which are taken from a row of the GCAT table. E.g. for a row inserted in GCAT as:
insert into GCAT values('N', 'Stem', 'ENG', '1',NULL,NULL);
the corresponding terminal symbol is:
t_ENG_N_Stem
which needs to be inserted in the SYMBOLS table. If you built the interpreter earlier to check the morphological analysis functionality, you can see that a tokensymbols.h file was generated (in directory build/hi_desktop) which contains exactly the terminal symbols the system generated. Let's add the terminal symbols to the m1content.sql file:
insert into SYMBOLS values('t_ENG_CON_Stem','ENG',NULL);
insert into SYMBOLS values('t_ENG_DET_Stem','ENG',NULL);
insert into SYMBOLS values('t_ENG_N_Stem','ENG',NULL);
insert into SYMBOLS values('t_ENG_N_Pl','ENG',NULL);
insert into SYMBOLS values('t_ENG_N_Sg','ENG',NULL);
insert into SYMBOLS values('t_ENG_PREP_Stem','ENG',NULL);
insert into SYMBOLS values('t_ENG_V_Stem','ENG',NULL);
insert into SYMBOLS values('t_ENG_Num_Stem','ENG',NULL);
insert into SYMBOLS values('t_ENG_Num_Ord','ENG',NULL);
insert into SYMBOLS values('t_ENG_Num_Nom','ENG',NULL);
insert into SYMBOLS values('t_ENG_Num_Npref1n','ENG',NULL);
insert into SYMBOLS values('t_ENG_Num_Npref2n','ENG',NULL);
insert into SYMBOLS values('t_ENG_Num_Npref3n','ENG',NULL);
insert into SYMBOLS values('t_ENG_Num_Npref4n','ENG',NULL);
insert into SYMBOLS values('t_ENG_Num_Npref5n','ENG',NULL);
insert into SYMBOLS values('t_ENG_Num_Npref6n','ENG',NULL);
insert into SYMBOLS values('t_ENG_Num_Npref7n','ENG',NULL);
insert into SYMBOLS values('t_ENG_Num_Npref8n','ENG',NULL);
insert into SYMBOLS values('t_ENG_Num_Npref9n','ENG',NULL);
A good practice to start developing a grammar manually is:
- take the sentences from the use case
- group them according to their morphological structure AND length
- take one of those groups containing the shortest sentences
- 3.1. take a sentence from that group
- 3.2. develop the rules for that sentence
- 3.3. test if the rules work out when implemented in the interpreter
- 3.4. if ok, put aside that sentence and go to 3.5; if not, go to 3.2.
- 3.5. if there are no more sentences in the group, go to 4, otherwise go to 3.
- if there's any group left go to 3
That way you go through the sentences of the use case one by one building the rules bottom up. Taking our little use case as an example, there're not many possibilities:
list contacts
list contacts with peter
call peter
call the first/second/third/fourth/.../last
The bottom is of course the level of words. Getting an analysis for a word can be done by using the morphological analyser. Let's take the first sentence:
'list contacts'
find some more and make a group of them. As you can see, there's only one with a similar structure and length:
'call peter'
Here is what we get from the morphological analyser for the words of those sentences:
list list[stem]+CON
list list[stem]+V
contacts contact[stem]+N+Pl
contacts contacts[stem]+CON
call call[stem]+CON
call call[stem]+V
peter peter[stem]+CON
As the morphological analyser is set up in a way to also give an analysis for any potential word it always returns an analysis with the CON (concealed/constant) symbol. In the end, (syntactic and semantic) analyses are ranked by the interpreter so the more CONs appear in an analysis the weaker position it gets in the ranked list. Therefore, such morphological analyses can be ignored now unless we expect a constant in the sentence like in case of:
'call peter'
since we know that we don't list any names in the morphological model so they'll be analysed as CONs. That also means that we expect 'call' to be followed by a CON and 'list' by a noun ('contacts'). Which means that the two sentences are two separate cases so belong to two different groups:
'call peter': [V]+[CON]
'list contacts': [V]+[N+Pl]
Let's stick to the first example and go left to right just like the parser will do. We already know that the symbol for the stem of a verb is 't_ENG_V_Stem'. That's what the bison parser will return when found (what is actually a symbol for a token). It is important to always create a node even for the tokens as the interpreter's methods expect nodes and not tokens (see Syntactic rules - Option 1. in Manual Modelling). The difference between tokens and nodes is that tokens represent terminal symbols while nodes represent non-terminal symbols. Creating nodes for tokens is actually technically required as mentioned previously. So, let's find a symbol for a node representing 't_ENG_V_Stem'. I usually just omit the 't_' prefix in such cases so I'd call it 'ENG_V_Stem'. The next symbol the parser will find is 't_ENG_CON_Stem' which I'd convert to 'ENG_CON'. There's nothing more in the sentence to analyse so let's see what kind of rules we can set up (using bison notation):
ENG_V_Stem: t_ENG_V_Stem
ENG_CON: t_ENG_CON_Stem
Depending on how much abstraction we want, there are several ways to combine nodes and get to the root of the syntactic parser tree. The topic of developing a parse tree for a grammar is too broad to discuss it here as there are many ways and theories about it. I'll only demonstrate here what is supported by the system using a simple example ('call peter') that can be easily understood. First of all, the system supports only binary branching which should not pose any problem since n-ary branching can be modelled with binary branching as well.
Let's continue with our simple example. We already have the bottom level rules that map tokens to nodes. As it may be necessary to handle different kinds of verbs let's first create a new node for the existing verb symbol we have and at the same time transform the constant node into a noun as there's hardly any other way to handle them:
ENG_V: ENG_V_Stem
ENG_1CON: ENG_CON
ENG_N_Sg: ENG_1CON
ENG_N: ENG_N_Sg
The way a constant is transformed to a noun here leaves some possibilities open to cover later, like collecting more than one constant by one symbol or handling plural nouns. The next thing we need to take care of is combining the nodes in a way that they build a parse tree for the sentence:
ENG_NP: ENG_N
ENG_Vbar1: ENG_V ENG_NP
ENG_VP: ENG_Vbar1
S: ENG_VP
These rules take care of combining all the nodes to make a parse tree. Here you can see also some nodes that are used only for later enhancements like ENG_NP (noun phrase) or ENG_VP (verb phrase). As mentioned earlier, presenting how parse trees can be developed exceeds the limits of this tutorial. However, I hope that simple example helps as a starting point and with the help of further reading the complete list of rules listed here will become clear. Again, this is just one way of modelling a grammar for our use case:
insert into SYMBOLS values('ENG_N_lfea_Pl','ENG',NULL);
insert into SYMBOLS values('ENG_N_lfea_Sg','ENG',NULL);
insert into SYMBOLS values('ENG_N_Sg','ENG',NULL);
insert into SYMBOLS values('ENG_N_Pl','ENG',NULL);
insert into SYMBOLS values('ENG_N','ENG',NULL);
insert into SYMBOLS values('ENG_N_Stem','ENG',NULL);
insert into SYMBOLS values('ENG_NP','ENG',NULL);
insert into SYMBOLS values('ENG_PREP_Stem','ENG',NULL);
insert into SYMBOLS values('ENG_PREP','ENG',NULL);
insert into SYMBOLS values('ENG_V_Stem','ENG',NULL);
insert into SYMBOLS values('ENG_V','ENG',NULL);
insert into SYMBOLS values('ENG_PP','ENG',NULL);
insert into SYMBOLS values('ENG_VP','ENG',NULL);
insert into SYMBOLS values('ENG_Vbar1','ENG',NULL);
insert into SYMBOLS values('ENG_DET_Stem','ENG',NULL);
insert into SYMBOLS values('ENG_DP','ENG',NULL);
insert into SYMBOLS values('ENG_Num_Ord','ENG',NULL);
insert into SYMBOLS values('ENG_Num_Nom','ENG',NULL);
insert into SYMBOLS values('ENG_Num_Stem','ENG',NULL);
insert into SYMBOLS values('ENG_Num','ENG',NULL);
insert into SYMBOLS values('ENG_Num_lfea_Ord','ENG',NULL);
insert into SYMBOLS values('ENG_Num_lfea_Nom','ENG',NULL);
insert into SYMBOLS values('ENG_Num_Pref','ENG',NULL);
insert into SYMBOLS values('ENG_CON','ENG',NULL);
insert into SYMBOLS values('ENG_1CON','ENG',NULL);
insert into GRAMMAR values('ENG','ENG_CON','t_ENG_CON_Stem',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_1CON','ENG_CON',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_N_Stem','t_ENG_N_Stem',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_N_lfea_Pl','t_ENG_N_Pl',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_N_lfea_Sg','t_ENG_N_Sg',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_N','ENG_N_Sg',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_N','ENG_N_Pl',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_N_Pl','ENG_N_Stem','ENG_N_lfea_Pl',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_N_Sg','ENG_N_Stem','ENG_N_lfea_Sg',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_N_Sg','ENG_1CON',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_NP','ENG_N',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_PREP_Stem','t_ENG_PREP_Stem',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_PREP','ENG_PREP_Stem',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_V_Stem','t_ENG_V_Stem',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_V','ENG_V_Stem',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_PP','ENG_PREP','ENG_NP',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Vbar1','ENG_V','ENG_NP',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_VP','ENG_Vbar1',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_VP','ENG_Vbar1','ENG_PP',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_VP','ENG_V','ENG_DP',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_DET_Stem','t_ENG_DET_Stem',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_DP','ENG_DET_Stem','ENG_Num_Ord',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_DP','ENG_DET_Stem','ENG_Num_Nom',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_DP','ENG_DET_Stem','ENG_N',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_Stem','t_ENG_Num_Stem',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num','ENG_Num_Stem',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_lfea_Ord','t_ENG_Num_Ord',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_lfea_Nom','t_ENG_Num_Nom',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_Ord','ENG_Num','ENG_Num_lfea_Ord',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_Nom','ENG_Num','ENG_Num_lfea_Nom',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_Pref','t_ENG_Num_Npref1n',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_Pref','t_ENG_Num_Npref2n',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_Pref','t_ENG_Num_Npref3n',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_Pref','t_ENG_Num_Npref4n',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_Pref','t_ENG_Num_Npref5n',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_Pref','t_ENG_Num_Npref6n',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_Pref','t_ENG_Num_Npref7n',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_Pref','t_ENG_Num_Npref8n',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num_Pref','t_ENG_Num_Npref9n',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Num','ENG_Num_Pref','ENG_Num',NULL,NULL);
insert into GRAMMAR values('ENG','S','ENG_VP',NULL,NULL,NULL);
Copy all those into m1content.sql and create the database:
make desktop_parser_db NATIVEPARSERDBNAME=m1.db NATIVEPARSERDBCONTENT=build/hi_desktop/m1content.sql
Using the test tools you can check what kind of syntactic structures (stex) and sentences (stax) can be generated by the grammar developed. Let's test if the sentences in the use case get generated by the parse tree (change directory to build/hi_desktop):
../stex m1.db ENG 10d list,call,contacts,with,peter,the,first,last > m1stex.txt
../remove_stex_output_duplicates.sh m1stex.txt
../stax m1stex.txt_unique
The output of stax will generate sentences that may surprise you like 'call contacts with peter !' no matter how carefully you craft your grammar. It's normal for grammars to generate sentences that are syntactically correct but make no sense like the famous example of 'colourless green ideas sleep furiously'. Of course, the possibility to generate syntactically incorrect sentences shall be eliminated. Please, excuse me if there are any in this demo grammar. Sometimes it's difficult to write rules that do not accept a certain input as usually rules are created to accept something. In that case, the generated coding can be modified (we'll discuss this later as well) to return a bison error by indicating the system to use a specific code for the action handler that does exactly that:
insert into GRAMMAR values('ENG','A','B','C',NULL,'"YYERROR;"');
In case that 'A: B C' is resolved, the action handler code will call 'YYERROR' -see it listed in the bison documentation.
At this point you can also build a library that's capable of carrying out a syntactic analysis of a text input.
If you wish to do so:
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPFUNCTORPATH=""
make desktop_parser
make shared_native_lib
Modify hi.cpp: delete HI_SEMANTICS in the main function at the line where the 'toa' variable is set so that it looks like:
toa=HI_MORPHOLOGY|HI_SYNTAX;
and change hi.db to m1.db when the hi() function is called so that it looks like:
analyses=hi(text.c_str(),"ENG",toa,language.c_str(),"hi_desktop/m1.db","test",crh);
Then build the desktop client:
make desktop_client
Start the interpreter and type another example sentence 'call peter' to get a simple syntactic analysis for the sentence:
./hi
call peter
human_input:call peter
picking new token path
nr of paths:2
current_path_nr:0
lexer started
interpreter started
bison:syntax error, unexpected t_ENG_CON_Stem, expecting t_ENG_V_Stem
syntax error
processed words:
FALSE: error at call
picking new token path
current_path_nr:1
lexer started
interpreter started
V->CALLENGV
ENG_V->ENG_V_Stem
CON->peter
ENG_1CON->ENG_CON
ENG_N_Sg->ENG_1CON
ENG_N->ENG_N_Sg
ENG_NP->ENG_N
ENG_Vbar1->ENG_V ENG_NP
ENG_VP->ENG_Vbar1
S->ENG_VP
There are 1 analyses.
{"analyses":[{"morphology":[{"morpheme id":"2","word":"call","lexeme":"CALLENGV","stem":"call","gcat":"V","tags":["call[stem]","V"]},{"morpheme id":"3","word":"peter","lexeme":"peter","stem":"peter","gcat":"CON","tags":["peter[stem]","CON"]}],"syntax":[{"symbol":"S","left child":{},"right child":{"symbol":"ENG_VP","left child":{},"right child":{"symbol":"ENG_Vbar1","left child":{"symbol":"ENG_V","left child":{},"right child":{"symbol":"ENG_V_Stem","morpheme id":"2"}},"right child":{"symbol":"ENG_NP","left child":{},"right child":{"symbol":"ENG_N","left child":{},"right child":{"symbol":"ENG_N_Sg","left child":{},"right child":{"symbol":"ENG_1CON","left child":{},"right child":{"symbol":"ENG_CON","morpheme id":"3"}}}}}}}}]}]}
If you check the output at for example jsonlint.com, you'll see how the rules are applied:
{
"analyses": [{
"morphology": [{
"morpheme id": "2",
"word": "call",
"lexeme": "CALLENGV",
"stem": "call",
"gcat": "V",
"tags": ["call[stem]", "V"]
}, {
"morpheme id": "3",
"word": "peter",
"lexeme": "peter",
"stem": "peter",
"gcat": "CON",
"tags": ["peter[stem]", "CON"]
}],
"syntax": [{
"symbol": "S",
"left child": {},
"right child": {
"symbol": "ENG_VP",
"left child": {},
"right child": {
"symbol": "ENG_Vbar1",
"left child": {
"symbol": "ENG_V",
"left child": {},
"right child": {
"symbol": "ENG_V_Stem",
"morpheme id": "2"
}
},
"right child": {
"symbol": "ENG_NP",
"left child": {},
"right child": {
"symbol": "ENG_N",
"left child": {},
"right child": {
"symbol": "ENG_N_Sg",
"left child": {},
"right child": {
"symbol": "ENG_1CON",
"left child": {},
"right child": {
"symbol": "ENG_CON",
"morpheme id": "3"
}
}
}
}
}
}
}
}]
}]
}
The result is a json structure containing all the analyses in an array which is sorted according to the rank of the analyses. Ranking is currently based on the best (the less CONs the better) and longest match. This means that the analysis found in the array at the first position (i.e. index 0) could be analysed along the longest path and contains the least CONs.
If there's any linguistic feature that you want to add at a syntactic level then the grammar rules are the right place to do so. Such a mandatory feature (to prepare for semantics) is to mark a verb as main verb as more than one verb may appear in a sentence even though it's not the case in our simple examples. Currently there isn't any logic at hand that could figure it out and do that automatically. It does not make sense to mark a verb as main verb in a rule which is applied whenever a verb in a sentence pops up since all verbs will be marked as main verb then. But rather when two phrases or parts of a sentence get combined.
In our grammar, the action handler would look like for the relevant rule 'ENG_Vbar1: ENG_V ENG_NP':
"const node_info& main_node=sparser->get_node_info($1);
const node_info& dependent_node=sparser->get_node_info($2);
sparser->add_feature_to_leaf(main_node,"main_verb");
std::string parent_symbol="ENG_Vbar1";
logger::singleton()==NULL?(void)0:logger::singleton()->log(0,parent_symbol+"->"+main_node.symbol+" "+dependent_node.symbol);
$$=sparser->combine_nodes(parent_symbol,main_node,dependent_node);"
We may either insert this content directly in the corresponding action field of the rule in an editor, directly in the m1.db or create an action snippet file for it and put that file name as content in the action field. If you add the action handler in an editor, don't forget to regenerate the db and do NOT add a line break before the opening quote rather bring the opening apostrophe to the same line where the quote is. To make sure that the change lands at the right place, I show the other two cases: it's not easy to echo such a long text through a pipe to sqlite so I'd recommend saving it in a file (together with the quotes in the beginning and in the end) called e.g. main_verb and then echo that:
action=`cat main_verb`;echo update grammar set action=\'"$action"\' where lid=\'ENG\' and parent_symbol=\'ENG_Vbar1\' and head_symbol=\'ENG_V\' and non_head_symbol=\'ENG_NP\'\;|sqlite3 m1.db
The third case: Copy all .cpp files (except the gensrc.cpp itself) from the project subdirectory /gensrc to the build/hi_desktop directory. Save the action implementation (without the quotes in the beginning and in the end) as a file called e.g. main_verb in the build/hi_desktop directory. Then update the action field of the corresponding grammar rule with the action snippet file name:
echo update grammar set action=\'main_verb\' where lid=\'ENG\' and parent_symbol=\'ENG_Vbar1\' and head_symbol=\'ENG_V\' and non_head_symbol=\'ENG_NP\'\;|sqlite3 m1.db
The value of the "main_symbol" setting i.e. "main_verb" may be any arbitrary string but needs to be added to the SETTINGS. Using the chosen symbol, a symbol of one or more grammatical categories must be assigned to it surrounded by angle brackets. More than one symbol may prove useful if more than one grammatical model is stored in the database which use different symbols. (Actually, turning the SETTINGS table to store language dependent options could be done but it's not appropriate for every setting so it's under consideration.)
insert into SETTINGS values('main_symbol','main_verb');
insert into SETTINGS values('main_verb','<V>');
The semantic modelling part is about mapping the syntactic rules to semantic ones and setting up the dependency of lexemes. Semantic rules are stored in the rule_to_rule_map table while dependencies are stored in the depolex table. In the rule_to_rule_map table each syntactic rule that combines two nodes must be entered for which you want a semantic check to be carried out. In case of such combinations, one of the nodes are considered the main node and the other one the dependent node. E.g. when combining a verb and a noun, the noun is the dependency of the verb as the functor of the verb operates on the noun as its argument. At least, usually that makes sense to set up the dependencies in a way in the depolex table that e.g. 'contacts' is an argument of 'list'. The depolex table stores the dependencies of lexemes. This is the content which in the future could be extracted from dependency graphs generated by ML/AI.
Let's take the example of 'list contacts' where 'contacts' is the argument of 'list', such an entry would look like:
insert into DEPOLEX values('LISTENGV', '1', '1', NULL, NULL, NULL, '0', 'CONTACTENGN', '1');
insert into DEPOLEX values('CONTACTENGN', '1', '1', NULL, NULL, NULL, NULL, NULL, NULL);
The meaning of the fields can be looked up at the DEPOLEX table documentation. Copy the entry to a file e.g. depolex.sql and issue:
cat depolex.sql|sqlite3 m1.db
The depolex entries in the example therefore mean that the word 'contact' (with a certain meaning identified as CONTACTENGN_1) is a dependency of the word 'list' (with a certain meaning identified as LISTENGV_1). Similarly, the word 'with' is a dependency of the word 'contact' where not finding the dependency 'with' does not count as failure. This dependency hierarchy (or chain) is sufficient to interpret e.g. 'list contacts' or 'list contacts with peter'.
Based on the dependency hierarchy and the nodes combined in the syntax tree, the rule_to_rule_map entries can be set up. So we need to follow the path bottom up to the root in the syntax tree for the terminals to identify where two nodes get combined. This we can find out if we check the result of the previous execution of the interpreter for our example sentence.
There is one combination in the analysis we'd need to validate semantically to see if the combination of the verb and the noun makes sense which means we need to make an entry for the rule:
ENG_Vbar1: ENG_V ENG_NP
Before inserting anything in the RULE_TO_RULE_MAP table, look up the meaning of its fields. The following entry will trigger semantic validation for the syntactic rule 'ENG_Vbar1: ENG_V ENG_NP' by looking for lexemes having grammatical category 'V' (i.e. verb) in the parser tree of ENG_V, other lexemes having grammatical category 'N' (i.e. noun) in the parser tree of ENG_NP and finally checking if the lexemes found for ENG_V have the lexemes found for ENG_NP as their dependencies in the DEPLOEX table.
insert into RULE_TO_RULE_MAP values( 'ENG_Vbar1', 'ENG_V', 'ENG_NP', '1', NULL, NULL, 'V', NULL, 'H', NULL, NULL, 'N', NULL, 'N', NULL, NULL, 'ENG');
When building the semantic model the most important thing you should keep an eye on is that the semantic rules (RULE_TO_RULE_MAP) and the dependencies (DEPOLEX) go hand in hand. Nodes get combined according to the semantic rules but those combinations are validated in the end based on the dependencies being fulfilled by the lexemes of those combined nodes. Another thing which needs some attention that although syntactic rules and semantic rules are pretty much mapped onto each other but sometimes the semantic head is different than the syntactic one. In such cases, the semantic head needs to be the head when combining two nodes (i.e. writing semantic rules) in order to be able to build a dependency tree according to the dependencies modelled in DEPOLEX. The semantic head can be determined by the main_lookup_root and dependency_lookup_root fields (see RULE_TO_RULE_MAP field descriptions) which in the aforementioned case means that the main_lookup_root shall be set to 'N' (i.e Non-head) and the dependency_lookup_root shall be set to 'H' (i.e. Head).
At this point you can build a library that's capable of carrying out a semantic analysis of 'list contacts !'.Copy all those into m1content.sql and create the database:
make desktop_parser_db NATIVEPARSERDBNAME=m1.db NATIVEPARSERDBCONTENT=build/hi_desktop/m1content.sql
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPFUNCTORPATH=""
make desktop_parser
make shared_native_lib
Modify hi.cpp: delete HI_SEMANTICS in the main function at the line where the 'toa' variable is set so that it looks like:
toa=HI_MORPHOLOGY|HI_SYNTAX|HI_SEMANTICS;
and change hi.db to m1.db when the hi() function is called so that it looks like:
analyses=hi(text.c_str(),"ENG",toa,language.c_str(),"hi_desktop/m1.db","test",crh);
Then build the desktop client:
make desktop_client
Start the interpreter and type the example sentence 'list contacts' to get the semantic analysis for the sentence:
human_input:list contacts
picking new token path
nr of paths:4
current_path_nr:0
lexer started
interpreter started
bison:syntax error, unexpected t_ENG_CON_Stem, expecting t_ENG_V_Stem
syntax error
processed words:
FALSE: error at list
picking new token path
current_path_nr:1
lexer started
interpreter started
bison:syntax error, unexpected t_ENG_CON_Stem, expecting t_ENG_V_Stem
syntax error
processed words:
FALSE: error at list
picking new token path
current_path_nr:2
lexer started
interpreter started
V->LISTENGV
ENG_V->ENG_V_Stem
CON->contacts
ENG_1CON->ENG_CON
ENG_N_Sg->ENG_1CON
ENG_N->ENG_N_Sg
ENG_NP->ENG_N
ENG_Vbar1->ENG_V ENG_NP
step:1 failover:0 successor:0
invalid_combination:Cannot interpret the invalid combination of list and contacts
picking new token path
current_path_nr:3
lexer started
interpreter started
V->LISTENGV
ENG_V->ENG_V_Stem
N->CONTACTENGN
t_ENG_N_Pl->
ENG_N_Pl->ENG_N_Stem ENG_N_lfea_Pl
ENG_N->ENG_N_Pl
ENG_NP->ENG_N
ENG_Vbar1->ENG_V ENG_NP
step:1 failover:0 successor:0
inserting in dvm, dependent node functor CONTACTENGN for main node functor LISTENGV
ENG_VP->ENG_Vbar1
S->ENG_VP
dependencies with longest match:
functor LISTENGV d_key 1: 1 deps found out of the expected 1 deps to be found
functor CONTACTENGN d_key 1: 0 deps found out of the expected 0 deps to be found
Minimum number of dependencies to match:1
Matching nr of dependencies found for functor LISTENGV with d_key 1:1
Total number of dependencies:1
TRUE
There are 1 analyses.
transcripting:LISTENGV_1
transcripting:CONTACTENGN_1
{"analyses":[{"morphology":[{"morpheme id":"2","word":"list","lexeme":"LISTENGV","stem":"list","gcat":"V","tags":["list[stem]","V"]},{"morpheme id":"4","word":"contacts","lexeme":"CONTACTENGN","stem":"contact","gcat":"N","tags":["contact[stem]","N","Pl"]}],"syntax":[{"symbol":"S","left child":{},"right child":{"symbol":"ENG_VP","left child":{},"right child":{"symbol":"ENG_Vbar1","left child":{"symbol":"ENG_V","left child":{},"right child":{"symbol":"ENG_V_Stem","morpheme id":"2"}},"right child":{"symbol":"ENG_NP","left child":{},"right child":{"symbol":"ENG_N","left child":{},"right child":{"symbol":"ENG_N_Pl","left child":{"symbol":"ENG_N_Stem","morpheme id":"4"},"right child":{"symbol":"ENG_N_lfea_Pl"}}}}}}}],"semantics":[{"id":"1","functor":"LISTENGV","d_key":"1","morpheme id":"2","tags":{},"functor id":"","dependencies":[{"id":"3","functor":"CONTACTENGN","d_key":"1","morpheme id":"4","tags":{},"functor id":""}]}],"functors":[]}]}
CONTACTENGN_3_out="{\"morpheme id\":\"4\",\"word\":\"contacts\",\"lexeme\":\"CONTACTENGN\",\"stem\":\"contact\",\"gcat\":\"N\",\"tags\":[\"contact[stem]\",\"N\",\"Pl\"]}";LISTENGV_1_out="{\"morpheme id\":\"2\",\"word\":\"list\",\"lexeme\":\"LISTENGV\",\"stem\":\"list\",\"gcat\":\"V\",\"tags\":[\"list[stem]\",\"V\"]}";
If you check the json analysis, you'll see that now there's a 'semantics' property:
{
"analyses": [{
"morphology": [{
"morpheme id": "2",
"word": "list",
"lexeme": "LISTENGV",
"stem": "list",
"gcat": "V",
"tags": ["list[stem]", "V"]
}, {
"morpheme id": "4",
"word": "contacts",
"lexeme": "CONTACTENGN",
"stem": "contact",
"gcat": "N",
"tags": ["contact[stem]", "N", "Pl"]
}],
"syntax": [{
"symbol": "S",
"left child": {},
"right child": {
"symbol": "ENG_VP",
"left child": {},
"right child": {
"symbol": "ENG_Vbar1",
"left child": {
"symbol": "ENG_V",
"left child": {},
"right child": {
"symbol": "ENG_V_Stem",
"morpheme id": "2"
}
},
"right child": {
"symbol": "ENG_NP",
"left child": {},
"right child": {
"symbol": "ENG_N",
"left child": {},
"right child": {
"symbol": "ENG_N_Pl",
"left child": {
"symbol": "ENG_N_Stem",
"morpheme id": "4"
},
"right child": {
"symbol": "ENG_N_lfea_Pl"
}
}
}
}
}
}
}],
"semantics": [{
"id": "1",
"functor": "LISTENGV",
"d_key": "1",
"morpheme id": "2",
"tags": {},
"functor id": "",
"dependencies": [{
"id": "3",
"functor": "CONTACTENGN",
"d_key": "1",
"morpheme id": "4",
"tags": {},
"functor id": ""
}]
}],
"functors": []
}]
}
The semantics shows how the semantic parse tree is built just like the syntax section does for the syntactic parse tree. There is a new property called 'functors' which is necessary for text to code generation. It's currently empty but we're coming to that right now. However, if you wish to check first how to build an interpreter that can analyse all the sentences in the use case, the necessary sql content to which you need to adjust the one we currently have is:
insert into DEPOLEX values('LISTENGV', '1', '1', NULL, NULL, NULL, '0', 'CONTACTENGN', '1');
insert into DEPOLEX values('CONTACTENGN', '1', '1', NULL, '1', NULL, '0', 'WITHENGPREP', '1');
insert into DEPOLEX values('WITHENGPREP', '1', '1', NULL, NULL, NULL, '0', 'CON', '1');
insert into DEPOLEX values('CON', '1', '1', NULL, NULL, NULL, NULL, NULL, NULL);
insert into DEPOLEX values('CALLENGV', '1', '1', NULL, '2', NULL, '0', 'FIRSTLASTENGN', '1');
insert into DEPOLEX values('CALLENGV', '1', '2', NULL, '3', NULL, '1', 'Num', '1');
insert into DEPOLEX values('CALLENGV', '1', '3', NULL, NULL, NULL, '0', 'CON', '1');
insert into DEPOLEX values('Num', '1', '1', NULL, NULL, NULL, NULL, NULL, NULL);
insert into DEPOLEX values('FIRSTLASTENGN', '1', '1', NULL, NULL, NULL, NULL, NULL, NULL);
insert into RULE_TO_RULE_MAP values( 'ENG_Vbar1', 'ENG_V', 'ENG_NP', '1', '2', NULL, 'V', NULL, 'H', NULL, NULL, 'N', NULL, 'N', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( 'ENG_Vbar1', 'ENG_V', 'ENG_NP', '2', NULL, NULL, 'V', NULL, 'H', NULL, NULL, 'CON', NULL, 'N', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( 'ENG_VP', 'ENG_Vbar1', 'ENG_PP', '1', NULL, NULL, 'N', NULL, 'H', NULL, NULL, 'PREP', NULL, 'N', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( 'ENG_PP', 'ENG_PREP', 'ENG_NP', '1', NULL, NULL, 'PREP', NULL, 'H', NULL, NULL, 'CON', NULL, 'N', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( 'ENG_VP', 'ENG_V', 'ENG_DP', '1', '2', NULL, 'V', NULL, 'H', NULL, NULL, 'CON', NULL, 'N', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( 'ENG_VP', 'ENG_V', 'ENG_DP', '2', '3', NULL, 'V', NULL, 'H', NULL, NULL, 'Num', NULL, 'N', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( 'ENG_VP', 'ENG_V', 'ENG_DP', '3', NULL, NULL, 'V', NULL, 'H', NULL, NULL, 'N', NULL, 'N', NULL, NULL, 'ENG');
To be able to generate code for natural language text, the functors need to be implemented. Let's add the entries to the FUNCTOR_DEFS table in which the functor ids are assigned a particular function.
insert into FUNCTOR_DEFS values('LISTENGV_1', 'js', '1', 'listengv_1.js');
The assignment must be set for the functor as well so that it looks like:
insert into FUNCTORS values('LISTENGV', '1', 'LISTENGV_1');
Let's create a directory called 'functors'. In the build directory issue:
mkdir hi_desktop/functors
Save the following code in the functors directory called 'listengv_1.js':
contact="";
for(i=0;i<parameterList.length;++i){
if(parameterList[i].indexOf('CONTACTENGN_')>-1){
if(typeof arguments[i+2]!=='undefined'&&arguments[i+2].length>0){
contact=arguments[i+2];
break;
}
}
}
let result={action:'fetchContacts',params:[contact]};
return result;
Functor implementations have their peculiarities which can be understood if we know how the system generates the code from the functors. The code in the end as earlier mentioned is a result of function compositions. That composition is based on the semantic parse tree which is built according to the semantic model described by the dependencies (in DEPOLEX) and the semantic rules (in RULE_TO_RULE_MAP). The system generates a function for each functor in the analysis using the implementation specified in FUNCTOR_DEFS. The declaration of the functions is generated and must not be done in the functor implementation. Both the currently available js and sh transcriptors work this way. Let's build an interpreter to have a simple example for the explanation:
make desktop_parser_db NATIVEPARSERDBNAME=m1.db NATIVEPARSERDBCONTENT=build/hi_desktop/m1content.sql
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPFUNCTORPATH=build/hi_desktop/functors
make desktop_parser
make shared_native_lib
Note, that the make target desktop_bison_parser has now a DESKTOPFUNCTORPATH set to the functors directory. If you start the interpreter and type the sentence 'list contacts', you should get an analysis like:
{
"analyses": [{
"morphology": [{
"morpheme id": "2",
"word": "list",
"lexeme": "LISTENGV",
"stem": "list",
"gcat": "V",
"tags": ["list[stem]", "V"]
}, {
"morpheme id": "4",
"word": "contacts",
"lexeme": "CONTACTENGN",
"stem": "contact",
"gcat": "N",
"tags": ["contact[stem]", "N", "Pl"]
}],
"syntax": [{
"symbol": "S",
"left child": {},
"right child": {
"symbol": "ENG_VP",
"left child": {},
"right child": {
"symbol": "ENG_Vbar1",
"left child": {
"symbol": "ENG_V",
"left child": {},
"right child": {
"symbol": "ENG_V_Stem",
"morpheme id": "2"
}
},
"right child": {
"symbol": "ENG_NP",
"left child": {},
"right child": {
"symbol": "ENG_N",
"left child": {},
"right child": {
"symbol": "ENG_N_Pl",
"left child": {
"symbol": "ENG_N_Stem",
"morpheme id": "4"
},
"right child": {
"symbol": "ENG_N_lfea_Pl"
}
}
}
}
}
}
}],
"semantics": [{
"id": "1",
"functor": "LISTENGV",
"d_key": "1",
"morpheme id": "2",
"tags": {},
"functor id": "LISTENGV_1",
"dependencies": [{
"id": "3",
"functor": "CONTACTENGN",
"d_key": "1",
"morpheme id": "4",
"tags": {},
"functor id": ""
}]
}],
"functors": [{
"functor id": "LISTENGV_1",
"definition": "contact=\"\";for(i=0;i<parameterList.length;++i){ if(parameterList[i].indexOf('CONTACTENGN_')>-1){ if(typeof arguments[i+2]!=='undefined'&&arguments[i+2].length>0){ contact=arguments[i+2]; break; } }}let result={action:'fetchContacts',params:[contact]};return result;"
}]
}]
}
The functors section now contains the functor for 'list' and even its implementation. The desktop client implementation also prints the transcribed script which in this case is:
CONTACTENGN_3_out="{\"morpheme id\":\"4\",\"word\":\"contacts\",\"lexeme\":\"CONTACTENGN\",\"stem\":\"contact\",\"gcat\":\"N\",\"tags\":[\"contact[stem]\",\"N\",\"Pl\"]}";
LISTENGV_1_1_morphology="{\"morpheme id\":\"2\",\"word\":\"list\",\"lexeme\":\"LISTENGV\",\"stem\":\"list\",\"gcat\":\"V\",\"tags\":[\"list[stem]\",\"V\"]}";
function LISTENGV_1_1(functionName,parameterList,CONTACTENGN_3_out,LISTENGV_1_1_morphology){
contact="";
for(i=0;i<parameterList.length;++i){
if(parameterList[i].indexOf('CONTACTENGN_')>-1){
if(typeof arguments[i+2]!=='undefined'&&arguments[i+2].length>0){
contact=arguments[i+2];
break;
}
}
}
let result={action:'fetchContacts',params:[contact]};
return result;
};
return LISTENGV_1_1_out=LISTENGV_1_1('LISTENGV_1_1',['CONTACTENGN_3_out','LISTENGV_1_1_morphology'],CONTACTENGN_3_out,LISTENGV_1_1_morphology);
This is not json but the code generated for the natural language text which I formatted here manually to be able to read it easier. It begins with a copy of what was inserted in the ANALYSES_DEPS table but it's irrelevant (and will be removed as it proved to be unnecessary). As you can see, there's a generated declaration for the functor implemented for 'list'. However, it's not called LISTENGV_1 as entered in FUNCTOR_DEFS but LISTENGV_1_1. That's because the transciptor concatenates the node id of the corresponding node in the semantic parse tree. If you look up the node in the semantics section with id 1, you'll see that it's functor id matches that of the functor in the functors section. But that's a system implementation detail that's not necessary to know to be able to use it. The parameter list of the function is generated as follows:
- function name: js - functionName; sh - as shell script functions don't have named parameters, access as $1
- parameter list: js - parameterList; sh - as shell script functions don't have named parameters, access as $2
- values of the parameters in the parameter list
It's done that way so that one can check in the implementation of the function dynamically which function is being executed and the names of the available parameters as well. The function name is usually not necessary (except in case of shell scripts, see later) and is probably also bad practice to build some logic based on that. The list of parameters is pretty much necessary though and unfortuantely there's no other way to make them available as their names are also generated just like the function names depending on which other function yields them. By this, I mean that the result of each function is stored in a variable generated from the name of the function and an '_out' suffix. For example, the last row of the output above is LISTENGV_1_1_out which is the variable in which the result of the LISTENGV_1_1 is stored. Though, as LISTENGV_1_1 is the function of the main verb, it is only used to return the end result. If we check the declaration of LISTENGV_1_1, we can see that it gets a 'CONTACTENGN_3_out' and a 'LISTENGV_1_1_morphology' parameter as well. 'CONTACTENGN_3_out' is the result of the functor of its dependency 'CONTACTENGN' as shown in the semantics parse tree. The strange thing here is that we haven't implemented anything for 'CONTACTENGN'. If there's no implementation for a functor, the system still generates an output variable for it and assigns the corresponding morphological analysis based on the morpheme id of the node in the semantics parse tree. This is just for convenience as there are dependencies that don't really need a functor implementation. If a functor has an implementation, it also gets its morphological analysis as a parameter anyway as it usually comes handy. This is what appears for LISTENGV_1_1 in its declaration as 'LISTENGV_1_1_morphology'.
As mentioned, all that applies to both the js and the sh transcriptors. The only difference is that shell script functions cannot return anything but integers for historical reasons so instead of returning the result, you must put it in the variable name what the system generates to store the result of the function. This is the only case where the function name comes handy. For example, if you have the result in a variable called 'out', you can assign its value to the generated one (what other functions will expect to receive):
eval "$1"_out='"$out"';
That's what I meant by the peculiarities of the functor implementations. Now, you may understand why it's necessary to iterate over the list of parameters in the implementation of the functor of LISTENGV:
- parameter names are generated or suffixed at least (that's why I look for them by their names without the suffix like XYZ_)
- depending on the dependencies found for a lexeme in the sentence, certain parameters may not even be present
- get the position of the parameter in the arguments list by which it can be accessed (see 'arguments[i+2]' where +2 offsets the first two params)
This shouldn't impact performance much though as it's unusual for a functor to have that many parameters.
What shows up at the bottom of the interpretation finally, is not part of the analyses but that's the result of the executed code:
{
action: 'fetchContacts',
params: [
'{"morpheme id":"4","word":"contacts","lexeme":"CONTACTENGN","stem":"contact","gcat":"N","tags":["contact[stem]","N","Pl"]}'
]
}
As you can see, the current implementation returned a json object. The generated scripts can do whatever you want but depending on the platform it runs on they may or may not have access to all capabilities of the device. In cases when not, it's practical to define a certain structure which the caller can easily process. The example above may be used to call a native function called 'fetchContacts' with the parameters in 'params' -even though in the example it's not that useful yet. Let's make it better and add some more entries to FUNCTOR_DEFS:
insert into FUNCTOR_DEFS values('CONTACTENGN_1', 'js', '1', 'contactengn_1.js');
insert into FUNCTOR_DEFS values('WITHENGPREP_1', 'js', '1', 'withengprep_1.js');
Change the functor assignments in the FUNCTOR table accordingly:
insert into FUNCTORS values('CONTACTENGN', '1', 'CONTACTENGN_1');
insert into FUNCTORS values('WITHENGPREP', '1', 'WITHENGPREP_1');
Let's create the implementations as well in the functors directory for 'contactengn_1.js':
contact="";
for(i=0;i<parameterList.length;++i){
if(parameterList[i].indexOf('WITHENGPREP_')>-1){
if(typeof arguments[i+2]!=='undefined'&&arguments[i+2].length>0){
contact=arguments[i+2];
break;
}
}
}
return contact;
Also for 'withengprep_1.js':
contact="";
for(i=0;i<parameterList.length;++i){
if(parameterList[i].indexOf('CON_')>-1){
if(typeof arguments[i+2]!=='undefined'&&arguments[i+2].length>0){
con=JSON.parse(arguments[i+2]);
if(contact.length>0)contact=contact+" "+con.stem;
else contact=con.stem;
}
}
}
return contact;
According to the dependency chain set up in depolex, the function composition looks like: LISTENGV_1(CONTACTENGN_1(WITHENGPREP_1(CON()))). The functor for CON as mentioned earlier does not have any implementation so it returns it morphological analysis to WITHENGPREP_1(). The implementation of WITHENGPREP_1() takes the value of the 'stem' property which is the word itself as it could not be analysed (that's why it is a CON after all). Actually, the code is prepared to handle more than one incoming CONs but we don't need that part now. Then WITHENGPREP_1() returns whatever it found. That return value is passed to CONTACTENGN_1() which does not have any specific logic so it just returns what it gets from WITHENGPREP_1(). Finally, LISTENGV_1() receives the return value of CONTACTENGN_1(), looks it up, creates a command object for the caller and returns that. If there's anything that can be done via a javascript runtime, direct action can be taken (like sending an http request) but looking up contacts (like in a phone) as in this demo, is usually done by the calling program natively. Nevertheless, this example is pretty good to demonstrate that the interpreter can be used even for translations. For example, to generate SQL statements from natural language sentences.
If we rebuild the interpreter like:
make desktop_parser_db NATIVEPARSERDBNAME=m1.db NATIVEPARSERDBCONTENT=build/hi_desktop/m1content.sql
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPFUNCTORPATH=build/hi_desktop/functors
make desktop_parser
make shared_native_lib
After starting it and typing 'list contacts with peter' the generated code will return at the bottom:
{ action: 'fetchContacts', params: [ 'peter' ] }
Let's create a corpus file with the sentences you want to use for training:
list contacts
list contacts with peter
call peter
call the first
call the last
call the second
Save it as corpus.txt in build/hi_desktop and stay in that directory.
To prepare a stemmed corpus for the integrated machine learning tool called ABL and add the stems of the words in the corpus to the lexicon issue:
../prep_abl m1.db corpus.txt ENG stemmed_corpus.txt nopun -d lex
The option 'nopun' indicates that punctuation shall be ignored, '-d' lets the default delimiter be used which is the new line character while 'lex' means that the words in the corpus need to be lexicalized i.e. their stem will be written in the lexicon db table and a default lexeme will be generated for them that will as well be used as functor name.
The LEXICON table contains now the followings:
call|ENG|V|call_ENG_V
contact|ENG|N|contact_ENG_N
first|ENG|N|first_ENG_N
last|ENG|N|last_ENG_N
list|ENG|V|list_ENG_V
one|ENG|Num|one_ENG_Num
the|ENG|DET|the_ENG_DET
two|ENG|Num|two_ENG_Num
with|ENG|PREP|with_ENG_PREP
The FUNCTORS table content is like this:
call_ENG_V|1|
contact_ENG_N|1|
first_ENG_N|1|
last_ENG_N|1|
list_ENG_V|1|
one_ENG_Num|1|
the_ENG_DET|1|
two_ENG_Num|1|
with_ENG_PREP|1|
As numbers shall be treated differently not by enumerating those we need, delete them from the lexicon and from the functor table by:
echo "delete from lexicon where gcat='Num';"|sqlite3 m1.db
echo "delete from functors where functor like '%_Num';"|sqlite3 m1.db
Technically, it's required for constants/concealed words as well to have a functor (with no implementation though) so let's add it:
echo "insert into FUNCTORS values('CON', '1', NULL);"|sqlite3 m1.db
Now let's induce a grammar using ABL:
abl_align -a a -p b -e -i stemmed_corpus.txt -o m1_aligned.txt
abl_cluster -i m1_aligned.txt -o m1_clustered.txt
abl_select -s b -i m1_clustered.txt -o m1_selected.txt
or play around with different selection methods like:
abl_select -s f -i m1_clustered.txt -o m1_selected.txt
Extract the grammar from the output of ABL:
../proc_abl m1_selected.txt ENG m1.db
Test if the sentences used for training get generated by the induced grammar:
../stex m1.db ENG 10d list,call,contacts,with,peter,the,first,last,second > m1stex.txt
../remove_stex_output_duplicates.sh m1stex.txt
../stax m1stex.txt_unique stemmed_corpus.txt
If you're not satisfied with the results you can retrain the ML or in case of a white box ML like this, you have the possibility to check the output of all the steps and figure out why the ML came up with such a grammar instead of guessing and retraining. You can even manually modify the generated grammar or whatever you want in the model.
In case you'd start over the ML, you'll need to delete the files created after the db file creation (including the db file itself), then recreate the db file and carry out all subsequent steps.
At this point you can build a library that's capable of carrying out morphological and syntactic analysis of a text input like:
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPFUNCTORPATH=""
make desktop_parser
Hint: Check shift/reduce, reduce/reduce conflicts!
make shared_native_lib
Modify hi.cpp: in the main function, at the line where the 'toa' variable is set, change it so that it looks like:
language="sh";
toa=HI_MORPHOLOGY|HI_SYNTAX;
and change hi.db to m1.db when the hi() function is called so that it looks like:
analyses=hi(text.c_str(),"ENG",toa,language.c_str(),"hi_desktop/m1.db","test",crh);
Then build the desktop client:
make desktop_client
Start the interpreter and type the example sentence:
human_input:list contacts
picking new token path
nr of paths:4
current_path_nr:0
lexer started
interpreter started
bison:syntax error, unexpected t_ENG_CON_Stem, expecting t_ENG_V_Stem
syntax error
processed words:
FALSE: error at list
picking new token path
current_path_nr:1
lexer started
interpreter started
bison:syntax error, unexpected t_ENG_CON_Stem, expecting t_ENG_V_Stem
syntax error
processed words:
FALSE: error at list
picking new token path
current_path_nr:2
lexer started
interpreter started
V->list_ENG_V
CON->contacts
_3_0->_1_1 _2_2
S->_3_0
picking new token path
current_path_nr:3
lexer started
interpreter started
V->list_ENG_V
N->contact_ENG_N
t_ENG_N_Pl->
_1_2->_1_3 _1_4
_1_0->_1_1 _1_2
S->_1_0
There are 2 analyses.
{"analyses":[{"morphology":[{"morpheme id":"2","word":"list","lexeme":"list_ENG_V","stem":"list","gcat":"V","tags":["list[stem]","V"]},{"morpheme id":"4","word":"contacts","lexeme":"contact_ENG_N","stem":"contact","gcat":"N","tags":["contact[stem]","N","Pl"]}],"syntax":[{"symbol":"S","left child":{},"right child":{"symbol":"_1_0","left child":{"symbol":"_1_1","morpheme id":"2"},"right child":{"symbol":"_1_2","left child":{"symbol":"_1_3","morpheme id":"4"},"right child":{"symbol":"_1_4"}}}}]},{"morphology":[{"morpheme id":"2","word":"list","lexeme":"list_ENG_V","stem":"list","gcat":"V","tags":["list[stem]","V"]},{"morpheme id":"3","word":"contacts","lexeme":"contacts","stem":"contacts","gcat":"CON","tags":["contacts[stem]","CON"]}],"syntax":[{"symbol":"S","left child":{},"right child":{"symbol":"_3_0","left child":{"symbol":"_1_1","morpheme id":"2"},"right child":{"symbol":"_2_2","morpheme id":"3"}}}]}]}
Hint: check the json form of the analyses at jsonlint.com
If there's any linguistic feature that you want to add at a syntactic level then the grammar rules are the right place to do so. Such a mandatory feature is to mark a verb as main verb as more than one verb may appear in a sentence even though it's not the case in our simple examples. Currently there isn't any logic at hand that could figure it out and do that automatically. It does not make sense to mark a verb as main verb in a rule which is applied whenever a verb in a sentence pops up since all verbs will be marked as main verb then. But rather when two phrases or parts of a sentence get combined. To be able to track back such combinations we need the content of the GRAMMAR table generated by ML which we can get by issuing:
echo "select * from grammar;"|sqlite3 build/hi_desktop/m1.db
ENG|S|_1_0|||
ENG|S|_2_0|||
ENG|S|_3_0|||
ENG|S|_6_0|||
ENG|S|_7_0|||
ENG|_1_0|_1_1|_1_2||
ENG|_1_1|t_ENG_V_Stem|||
ENG|_1_2|_1_3|_1_4||
ENG|_1_3|t_ENG_N_Stem|||
ENG|_1_4|t_ENG_N_Pl|||
ENG|_2_0|_1_1|_2_6___2_2||
ENG|_2_2|t_ENG_CON_Stem|||
ENG|_2_6|_1_3|_2__2_6_2___2__2_6_3||
ENG|_2_6___2_2|_2_6|_2_2||
ENG|_2__2_6_2___2__2_6_3|_1_4|_2__2_6_3||
ENG|_2__2_6_3|t_ENG_PREP_Stem|||
ENG|_3_0|_1_1|_2_2||
ENG|_4_4|t_ENG_N_Sg|||
ENG|_4_7|t_ENG_DET_Stem|||
ENG|_4_8|_1_1|_4_7||
ENG|_4_9|_1_3|_4_4||
ENG|_5_9|_5__5_9_2|_5__5_9_3||
ENG|_5__5_9_2|t_ENG_Num_Stem|||
ENG|_5__5_9_3|t_ENG_Num_Ord|||
ENG|_6_0|_4_8|_4_9||
ENG|_7_0|_4_8|_5_9||
Latest by this point one needs to decide either to keep living in the matrix or replace the generated symbols with something meaningful like:
_2_2=ENG_CON_Stem
_1_4=ENG_N_Pl
_1_3=ENG_N_Stem
_1_2=ENG_N
_1_1=ENG_V
_1_0,_3_0=ENG_VP
Here, I'll stick to the generated symbols.
There are two relevant combinations for us in the grammar:
_1_2->_1_3 _1_4
_1_0->_1_1 _1_2
_3_0->_1_1 _2_2
Let's track back the first one:
_1_2->_1_3 _1_4
ENG|_1_3|t_ENG_N_Stem|||
ENG|_1_4|t_ENG_N_Pl|||
the second one:
_1_0->_1_1 _1_2
ENG|_1_1|t_ENG_V_Stem|||
ENG|_1_2|_1_3|_1_4||
ENG|_1_0|_1_1|_1_2||
and the third:
_3_0->_1_1 _2_2
ENG|_1_1|t_ENG_V_Stem|||
ENG|_2_2|t_ENG_CON_Stem|||
ENG|_3_0|_1_1|_2_2||
So in our example the rule that combines a verb with a noun phrase would be this entry:
ENG|_1_0|_1_1|_1_2||
while the rule that combines a verb with a constant is this:
ENG|_3_0|_1_1|_2_2||
For our example, the action code would look like:
"const node_info& main_node=sparser->get_node_info($1);
const node_info& dependent_node=sparser->get_node_info($2);
sparser->add_feature_to_leaf(main_node,"main_verb");
std::string parent_symbol=yytname_[yylhs.type_get()];
logger::singleton()==NULL?(void)0:logger::singleton()->log(0,parent_symbol+"->"+main_node.symbol+" "+dependent_node.symbol);
$$=sparser->combine_nodes(parent_symbol,main_node,dependent_node);"
We may either insert this content directly in the corresponding action field of the rule in the m1.db or create an action snippet file for it and put that file name as content in the action field. The first case: it's not easy to echo such a long text through a pipe to sqlite so I'd still recommend to save it in a file (together with the quotes in the beginning and in the end) called e.g. main_verb and then echo that:
action=`cat main_verb`;echo update grammar set action=\'"$action"\' where lid=\'ENG\' and parent_symbol=\'_1_0\' and head_symbol=\'_1_1\' and non_head_symbol=\'_1_2\'\;|sqlite3 m1.db
action=`cat main_verb`;echo update grammar set action=\'"$action"\' where lid=\'ENG\' and parent_symbol=\'_3_0\' and head_symbol=\'_1_1\' and non_head_symbol=\'_2_2\'\;|sqlite3 m1.db
The second case: Copy all .cpp files (except the gensrc.cpp itself) from the project subdirectory /gensrc to the build/hi_desktop directory. Save the action implementation (without the quotes in the beginning and in the end) as a file called e.g. main_verb in the build/hi_desktop directory. Then update the action field of the corresponding grammar rule with the action snippet file name:
echo update grammar set action=\'main_verb\' where lid=\'ENG\' and parent_symbol=\'_1_0\' and head_symbol=\'_1_1\' and non_head_symbol=\'_1_2\'\;|sqlite3 m1.db
echo update grammar set action=\'main_verb\' where lid=\'ENG\' and parent_symbol=\'_3_0\' and head_symbol=\'_1_1\' and non_head_symbol=\'_2_2\'\;|sqlite3 m1.db
The semantic modelling part is about mapping the syntactic rules to semantic ones and setting up the dependency of lexemes. Semantic rules are stored in the rule_to_rule_map table while dependencies are stored in the depolex table. In the rule_to_rule_map table each syntactic rule that combines two nodes must be entered for which you want a semantic check to be carried out. In case of such combinations, one of the nodes are considered the main node and the other one the dependent node. E.g. when combining a verb and a noun, the noun is the dependency of the verb as the functor of the verb operates on the noun as its argument. At least, usually that makes sense to set up the dependencies in a way in the depolex table that e.g. 'contacts' is an argument of 'list'. The DEPOLEX table stores the dependencies of lexemes. This is the content which in the future could be extracted from dependency graphs generated by ML/AI.
To stick to the example of 'list contacts' where 'contacts' is the argument of 'list', such an entry would look like:
insert into DEPOLEX values('list_ENG_V', '1', '1', NULL, NULL, NULL, '0', 'contact_ENG_N', '1');
insert into DEPOLEX values('contact_ENG_N', '1', '1', NULL, '1', NULL, '0', 'with_ENG_PREP', '1');
insert into DEPOLEX values('with_ENG_PREP', '1', '1', NULL, NULL, NULL, '0', 'CON', '1');
insert into DEPOLEX values('CON', '1', '1', NULL, NULL, NULL, NULL, NULL, NULL);
Copy the entry to a file e.g. depolex.sql and issue:
cat depolex.sql|sqlite3 m1.db
The depolex entries in the example therefore mean that the word 'contact' (with a certain meaning identified as CONTACTENGN_1) is a dependency of the word 'list' (with a certain meaning identified as LISTENGV_1). Similarly, the word 'with' is a dependency of the word 'contact' where not finding the dependency 'with' does not count as failure. This dependency hierarchy (or chain) is sufficient to interpret e.g. 'list contacts' or 'list contacts with Peter'.
Based on the dependency hierarchy and the nodes combined in the syntax tree, the rule_to_rule_map entries can be set up. So we need to follow the path bottom up to the root in the syntax tree for the terminals to identify where two nodes get combined. This we can find out if we check the result of the previous execution of the interpreter for our example sentence.
There are two combinations in the analysis as earlier mentioned:
_1_2->_1_3 _1_4
_1_0->_1_1 _1_2
Out of these two we'd need to validate semantically if the combination of the verb and the noun makes sense which means we need to take this entry:
ENG|_1_0|_1_1|_1_2||
Before inserting anything in the RULE_TO_RULE_MAP table, check the fields with their meaning. The following entry will trigger semantic validation for the syntactic rule _1_0->_1_1 _1_2 by looking for words having grammatical category 'V' (i.e. verb) in the parser tree of _1_1, other words having grammatical category 'N' (i.e. noun) in the parser tree of _1_2 and finally checking if the words found for _1_1 have the words found for _1_2 as their dependencies.
insert into RULE_TO_RULE_MAP values( '_1_0', '_1_1', '_1_2', '1', NULL, NULL, 'V', NULL, 'H', NULL, NULL, 'N', NULL, 'N', NULL, NULL, 'ENG');
Copy the entry to a file e.g. r2rm.sql and issue:
cat r2rm.sql|sqlite3 m1.db
Depending on using action snippets or not there are two ways to regenerate the sources for the parser:
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPACTIONSNIPPETS=build/hi_desktop/ DESKTOPFUNCTORPATH=""
OR
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPFUNCTORPATH=""
Then the next step:
make desktop_parser
Hint: check shift/reduce, reduce/reduce conflicts!
Build the shared library:
make shared_native_lib
Modify hi.cpp: in the main function, at the line where the 'toa' variable is set, change it so that it looks like:
toa=HI_MORPHOLOGY|HI_SYNTAX|HI_SEMANTICS;
Then build the desktop client:
make desktop_client
Execute the interpreter:
./hi
list contacts
human_input:list contacts
picking new token path
nr of paths:4
current_path_nr:0
lexer started
interpreter started
bison:syntax error, unexpected t_ENG_CON_Stem, expecting t_ENG_V_Stem
syntax error
processed words:
FALSE: error at list
picking new token path
current_path_nr:1
lexer started
interpreter started
bison:syntax error, unexpected t_ENG_CON_Stem, expecting t_ENG_V_Stem
syntax error
processed words:
FALSE: error at list
picking new token path
current_path_nr:2
lexer started
interpreter started
V->list_ENG_V
CON->contacts
_3_0->_1_1 _2_2
S->_3_0
dependencies with longest match:
functor list_ENG_V d_key 1: 0 deps found out of the expected 1 deps to be found
Minimum number of dependencies to match:1
Matching nr of dependencies found for functor list_ENG_V with d_key 1:0
Total number of dependencies:1
No matching nr of dependencies found for functor list_ENG_V with any d_key
semantic error
processed words:list
FALSE: error at contacts
picking new token path
current_path_nr:3
lexer started
interpreter started
V->list_ENG_V
N->contact_ENG_N
t_ENG_N_Pl->
_1_2->_1_3 _1_4
_1_0->_1_1 _1_2
step:1 failover:0 successor:0
inserting in dvm, dependent node functor contact_ENG_N for main node functor list_ENG_V
S->_1_0
dependencies with longest match:
functor list_ENG_V d_key 1: 1 deps found out of the expected 1 deps to be found
functor contact_ENG_N d_key 1: 0 deps found out of the expected 0 deps to be found
Minimum number of dependencies to match:1
Matching nr of dependencies found for functor list_ENG_V with d_key 1:1
Total number of dependencies:1
TRUE
There are 1 analyses.
transcripting:list_ENG_V_1
transcripting:contact_ENG_N_1
{"analyses":[{"morphology":[{"morpheme id":"2","word":"list","lexeme":"list_ENG_V","stem":"list","gcat":"V","tags":["list[stem]","V"]},{"morpheme id":"4","word":"contacts","lexeme":"contact_ENG_N","stem":"contact","gcat":"N","tags":["contact[stem]","N","Pl"]}],"syntax":[{"symbol":"S","left child":{},"right child":{"symbol":"_1_0","left child":{"symbol":"_1_1","morpheme id":"2"},"right child":{"symbol":"_1_2","left child":{"symbol":"_1_3","morpheme id":"4"},"right child":{"symbol":"_1_4"}}}}],"semantics":[{"id":"1","functor":"list_ENG_V","d_key":"1","morpheme id":"2","tags":{},"functor id":"","dependencies":[{"id":"2","functor":"contact_ENG_N","d_key":"1","morpheme id":"4","tags":{},"functor id":""}]}],"functors":[]}]}
contact_ENG_N_2_out="{\"morpheme id\":\"4\",\"word\":\"contacts\",\"lexeme\":\"contact_ENG_N\",\"stem\":\"contact\",\"gcat\":\"N\",\"tags\":[\"contact[stem]\",\"N\",\"Pl\"]}";list_ENG_V_1_out="{\"morpheme id\":\"2\",\"word\":\"list\",\"lexeme\":\"list_ENG_V\",\"stem\":\"list\",\"gcat\":\"V\",\"tags\":[\"list[stem]\",\"V\"]}";
Hint: check the json form of the analyses at jsonlint.com
At this point we get back the analyses in json form, but if we need more as it may be the case in many nlp use cases like classification (see the Advanced topics) or triggering a semantic action, we need to add tagging and/or functor implementations as well.
Let's extend our example to have a real life use case for functor implementation. If we execute the interpreter and type this sentence:
./hi
list contacts with peter
it'll quit saying 'There are 0 analyses' (error reporting analyses are cut at the end):
human_input:list contacts with peter
picking new token path
nr of paths:8
current_path_nr:0
lexer started
interpreter started
bison:syntax error, unexpected t_ENG_CON_Stem, expecting t_ENG_V_Stem
syntax error
processed words:
FALSE: error at list
picking new token path
current_path_nr:1
lexer started
interpreter started
bison:syntax error, unexpected t_ENG_CON_Stem, expecting t_ENG_V_Stem
syntax error
processed words:
FALSE: error at list
picking new token path
current_path_nr:2
lexer started
interpreter started
bison:syntax error, unexpected t_ENG_CON_Stem, expecting t_ENG_V_Stem
syntax error
processed words:
FALSE: error at list
picking new token path
current_path_nr:3
lexer started
interpreter started
bison:syntax error, unexpected t_ENG_CON_Stem, expecting t_ENG_V_Stem
syntax error
processed words:
FALSE: error at list
picking new token path
current_path_nr:4
lexer started
interpreter started
V->list_ENG_V
CON->contacts
_3_0->_1_1 _2_2
S->_3_0
bison:syntax error, unexpected t_ENG_CON_Stem, expecting END
syntax error
processed words:list
FALSE: error at with
picking new token path
current_path_nr:5
lexer started
interpreter started
V->list_ENG_V
CON->contacts
_3_0->_1_1 _2_2
S->_3_0
bison:syntax error, unexpected t_ENG_PREP_Stem, expecting END
syntax error
processed words:list
FALSE: error at with
picking new token path
current_path_nr:6
lexer started
interpreter started
V->list_ENG_V
N->contact_ENG_N
t_ENG_N_Pl->
_1_2->_1_3 _1_4
_1_0->_1_1 _1_2
step:1 failover:0 successor:0
inserting in dvm, dependent node functor contact_ENG_N for main node functor list_ENG_V
S->_1_0
bison:syntax error, unexpected t_ENG_CON_Stem, expecting END
syntax error
processed words:list contacts
FALSE: error at with
picking new token path
current_path_nr:7
lexer started
interpreter started
V->list_ENG_V
N->contact_ENG_N
t_ENG_N_Pl->
PREP->with_ENG_PREP
_2__2_6_2___2__2_6_3->_1_4 _2__2_6_3
_2_6->_1_3 _2__2_6_2___2__2_6_3
CON->peter
_2_6___2_2->_2_6 _2_2
_2_0->_1_1 _2_6___2_2
S->_2_0
dependencies with longest match:
functor list_ENG_V d_key 1: 0 deps found out of the expected 1 deps to be found
Minimum number of dependencies to match:1
Matching nr of dependencies found for functor list_ENG_V with d_key 1:0
Total number of dependencies:3
No matching nr of dependencies found for functor list_ENG_V with any d_key
semantic error
processed words:list contacts with
FALSE: error at peter
There are 0 analyses.
That's because the grammar induced by the ML uses a different parser tree to validate this sentence and different grammar rule is used to combine the verb and noun phrases than earlier. Let's track back what the interpreter does:
_2_0->_1_1 _2_6___2_2
ENG|_1_1|t_ENG_V_Stem|||
ENG|_2_6___2_2|_2_6|_2_2||
ENG|_2_6|_1_3|_2__2_6_2___2__2_6_3||
ENG|_1_3|t_ENG_N_Stem|||
ENG|_2__2_6_2___2__2_6_3|_1_4|_2__2_6_3||
ENG|_1_4|t_ENG_N_Pl|||
ENG|_2__2_6_3|t_ENG_PREP_Stem|||
ENG|_2_2|t_ENG_CON_Stem|||
This shows that even though the ML found a way to generate a grammar for this sentence but the training corpus was too small to generate a "correct" one. For me, the problematic rule is:
ENG|_2__2_6_2___2__2_6_3|_1_4|_2__2_6_3||
Where actually, a morpheme that indicates that a noun is plural gets combined with a preposition instead of first combining the noun with the plural morpheme, then the preposition with the constant and then combining these two combinations.
So the entry where the main verb needs to be marked is:
_2_0->_1_1 _2_6___2_2
For which the action implementation that marks a verb as main verb needs to be added as well:
"const node_info& main_node=sparser->get_node_info($1);
const node_info& dependent_node=sparser->get_node_info($2);
sparser->add_feature_to_leaf(main_node,"main_verb");
std::string parent_symbol=yytname_[yylhs.type_get()];
logger::singleton()==NULL?(void)0:logger::singleton()->log(0,parent_symbol+"->"+main_node.symbol+" "+dependent_node.symbol);
$$=sparser->combine_nodes(parent_symbol,main_node,dependent_node);"
Depending on using action snippets or not there are two ways to regenerate the sources for the parser:
action=`cat main_verb`;echo update grammar set action=\'"$action"\' where lid=\'ENG\' and parent_symbol=\'_2_0\' and head_symbol=\'_1_1\' and non_head_symbol=\'_2_6___2_2\'\;|sqlite3 m1.db
OR
echo update grammar set action=\'main_verb\' where lid=\'ENG\' and parent_symbol=\'_2_0\' and head_symbol=\'_1_1\' and non_head_symbol=\'_2_6___2_2\'\;|sqlite3 m1.db
Besides that we also need to map semantic rules to the syntactic ones whereever two nodes get combined:
insert into RULE_TO_RULE_MAP values( '_2_0', '_1_1', '_2_6___2_2', '1', NULL, NULL, 'V', NULL, 'H', NULL, NULL, 'N', NULL, 'N', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( '_2_6', '_1_3', '_2__2_6_2___2__2_6_3', '1', NULL, NULL, 'N', NULL, 'H', NULL, NULL, 'PREP', NULL, 'N', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( '_2_6___2_2', '_2_6', '_2_2', '1', NULL, NULL, 'PREP', NULL, 'H', NULL, NULL, 'CON', NULL, 'N', NULL, NULL, 'ENG');
Copy the entry to a file e.g. r2rm2.sql and issue:
cat r2rm2.sql|sqlite3 m1.db
Depending on using action snippets or not there are two ways to regenerate the sources for the parser:
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPACTIONSNIPPETS=build/hi_desktop/ DESKTOPFUNCTORPATH=""
OR
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPFUNCTORPATH=""
make desktop_parser
Hint: check shift/reduce reduce/reduce conflicts!
make shared_native_lib
If we execute now the interpreter and type 'list contacts with peter' it will provide an analysis for our extended example. Though, the functor implementations in the end are empty:
"functors": []
The language of the functor implementation depends on the chosen target language in the FUNCTOR_DEFS db table which is linked to the FUNCTORS table. Although, the example sentence is not really relevant for a desktop use case, in order to have a complete guide for desktop usage let's consider contacts being files with .vcf extension. This also means that the target language will be shell script as the client delivered for desktop usage can only generate shell script code. Nevertheless, any other language can be used but then the client, generating the code must be written for that as well.
Let's prepare the functor definitions we'll implement in this example in the db. Create a file called functors.sql in build/hi_desktop with the following content:
insert into FUNCTOR_DEFS values('list_ENG_V_1', 'sh', '1', NULL);
insert into FUNCTOR_DEFS values('contact_ENG_N_1', 'sh', '1', NULL);
insert into FUNCTOR_DEFS values('with_ENG_PREP_1', 'sh', '1', NULL);
update FUNCTORS set functor_id='list_ENG_V_1' where functor='list_ENG_V' and d_key='1';
update FUNCTORS set functor_id='contact_ENG_N_1' where functor='contact_ENG_N' and d_key='1';
update FUNCTORS set functor_id='with_ENG_PREP_1' where functor='with_ENG_PREP' and d_key='1';
Then put it in the db:
cat functors.sql|sqlite3 m1.db
Create a directory in build/hi_desktop called e.g. functors where the functor implementations will reside:
mkdir functors
To have an initial code to see what a functor gets as incoming parameters let's just put the same code in each functor implementation and save them in the functors directory using the file names e.g. listengv_1.sh, contactengn_1.sh and withengprep_1.sh (constants/concealed words do not have any functor so there's nothing to be done for words having the grammatical category CON):
echo "printing parameters and their contents for" $1;
unset out;
c=1;
for i in $2;
do
p=$(($c+2));
eval v="\$$p";
echo name;
echo $i;
echo content;
echo "$v";
case "$i" in
*_out) out="$v";
;;
esac;
c=$(($c+1));
done;
eval "$1"_out='"$out"';
The code used for functor implementations is always client specific. I wrote the desktop client according to the concept (see above) in a way that whenever a functor is called it gets its own name in the first parameter, then the list of incoming parameters in the second parameter and all subsequent parameters are the ones enumerated in the parameter list. The desktop client takes care of adding an extra parameter to the list which is called:
<functor>_morphology
which contains the current functor's morphology.
This piece of code prints the first parameter then loops over the parameter list in the second parameter printing the name and content of each parameter in the list. The only extra logic in the loop is that when a parameter is found ending in "_out" then its value is stored in a variable called "out". According to the concept, a functor receives the output of the functor(s) it called as its input which could ideally be done via a return value of a shell function here. However, shell functions have only the exit status as their return value. So instead of returning a value the desktop client expects that a functor puts its output in a variable named like:
<functor>_out
So when a parameter ends in "_out" its value will be copied to another variable and a new outgoing parameter named according to the previously mentioned convention will get that value. This is sufficient here as there's no functor that has more than one dependecy that could lead to having more than one incoming parameters as the output of the corresponding dependencies. Copying the output of a previously called functor to the current functor's output makes sure that the caller of the current functor will receive it as well. This is just to demonstrate how values can be passed over from one functor to the other. Usually, there is also some data processing logic implemented in the functors.
Instead of implementing the functors directly in the definition field of a functor record in db, I'd suggest implementing them in separate files. So let's update the db with our functor implementation file names:
echo update functor_defs set definition=\'listengv_1.sh\' where functor_id=\'list_ENG_V_1\' and tlid=\'sh\' and imp_counter=\'1\'\;|sqlite3 m1.db
echo update functor_defs set definition=\'contactengn_1.sh\' where functor_id=\'contact_ENG_N_1\' and tlid=\'sh\' and imp_counter=\'1\'\;|sqlite3 m1.db
echo update functor_defs set definition=\'withengprep_1.sh\' where functor_id=\'with_ENG_PREP_1\' and tlid=\'sh\' and imp_counter=\'1\'\;|sqlite3 m1.db
In order that the functor implementations get copied to the db, we need to regenerate the bison parser source as well since the same tool (gensrc) is used for copying the grammar action snippets and the functor implementations. Depending on our choice of using action snippets or not we need to trigger one of the following commands:
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPACTIONSNIPPETS=build/hi_desktop/ DESKTOPFUNCTORPATH=build/hi_desktop/functors
OR
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPFUNCTORPATH=build/hi_desktop/functors
make desktop_parser
Hint: check shift/reduce reduce/reduce conflicts!
make shared_native_lib
If we execute the interpreter, type 'list contacts with peter' and check the output after the json analyses, we'll see that it starts with:
printing parameters and their contents for with_ENG_PREP_1_4
name
CON_7_out
content
{"morpheme id":"4","word":"peter","stem":"peter","gcat":"CON"}
name
with_ENG_PREP_1_4_morphology
content
{"morpheme id":"3","word":"with","stem":"with","gcat":"PREP","tags":["with[stem]","PREP"]}
If we check the semantics part in the json analyses:
"semantics": [{
"id": "1",
"functor": "list_ENG_V",
"d_key": "1",
"morpheme id": "2",
"tags": {},
"functor id": "list_ENG_V_1",
"dependencies": [{
"id": "2",
"functor": "contact_ENG_N",
"d_key": "1",
"morpheme id": "4",
"tags": {},
"functor id": "contact_ENG_N_1",
"dependencies": [{
"id": "4",
"functor": "with_ENG_PREP",
"d_key": "1",
"morpheme id": "6",
"tags": {},
"functor id": "with_ENG_PREP_1",
"dependencies": [{
"id": "7",
"functor": "CON",
"d_key": "1",
"morpheme id": "7"
}]
}]
}]
}]
we'll find the with_ENG_PREP functor as the last but one dependency before the end of the dependency chain, right before the constant dependency. That's because we've set up our dependency hierarchy as:
list(contact(with(constant)))
in the depolex table. But as constants/concealed words don't have a functor, the value of a constant is automatically passed over to its functor to which the constant belongs to as a dependency.
To add some specific logic to each functor let's change the functor implementations:
contactengn_1.sh:
echo "printing parameters and their contents for" $1;
unset out;
c=1;
for i in $2;
do
p=$(($c+2));
eval v="\$$p";
echo name;
echo $i;
echo content;
echo "$v";
case "$i" in
*_out) out=*"$(echo "$v"|grep -o '\"stem\"\:.*'|cut -f1 -d,|cut -f4 -d\")"*.vcf
;;
esac;
c=$(($c+1));
done;
eval "$1"_out='"$out"';
listengv_1.sh:
echo "printing parameters and their contents for" $1;
unset out;
c=1;
for i in $2;
do
p=$(($c+2));
eval v="\$$p";
echo name;
echo $i;
echo content;
echo "$v";
case "$i" in
*_out) if [ -z "$v" ];
then out="find . -name '*.vcf'";
else out="find . -name '"$v"'";
fi;
;;
esac;
c=$(($c+1));
done;
echo "$out";
eval "$out";
The functor implementation of 'with' does not need to be changed as there's no specific logic we could add.
During source generation the file names referring to functor implementations are overwritten by the actual code the files contain, so we have to put the file names back and regenerate the source:
echo update functor_defs set definition=\'listengv_1.sh\' where functor_id=\'list_ENG_V_1\' and tlid=\'sh\' and imp_counter=\'1\'\;|sqlite3 m1.db
echo update functor_defs set definition=\'contactengn_1.sh\' where functor_id=\'contact_ENG_N_1\' and tlid=\'sh\' and imp_counter=\'1\'\;|sqlite3 m1.db
echo update functor_defs set definition=\'withengprep_1.sh\' where functor_id=\'with_ENG_PREP_1\' and tlid=\'sh\' and imp_counter=\'1\'\;|sqlite3 m1.db
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPACTIONSNIPPETS=build/hi_desktop/ DESKTOPFUNCTORPATH=build/hi_desktop/functors
OR
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPFUNCTORPATH=build/hi_desktop/functors
make desktop_parser
Hint: check shift/reduce reduce/reduce conflicts!
make shared_native_lib
Before executing the interpreter, let's create some files in build/hi_desktop with which our functor implementations can be tested. As it's only the file name that counts according to our logic implemented, the content of the files can be anything.
echo test > build/hi_desktop/"peter smith.vcf"
echo test > build/hi_desktop/"a peter.vcf"
echo test > build/hi_desktop/peter.vcf
echo test > build/hi_desktop/test.vcf
If we execute the interpreter and type 'list contacts', we'll get all .vcf files listed we have in any of the subdirectories since the desktop client generates a script from the functor implementations which is triggered as semantic action:
find . -name '*.vcf'
./hi_desktop/peter smith.vcf
./hi_desktop/a peter.vcf
./hi_desktop/peter.vcf
./hi_desktop/test.vcf
If we type 'list contacts with peter', only the files get listed that match the *peter*.vcf pattern:
find . -name '*peter*.vcf'
./hi_desktop/peter smith.vcf
./hi_desktop/a peter.vcf
./hi_desktop/peter.vcf
The following topics are based on the final sql content of the manual modelling section.
To add tags based on which classifictaion can be done, we need some content in the FUNCTOR_TAGS db table. Such an entry like this:
insert into FUNCTOR_TAGS values('LISTENGV', '1', 'main_verb', '1', 'type', 'action');
will result in tagging 'LISTENGV'. Copy the entry to a file e.g. tag.sql and issue:
cat tag.sql|sqlite3 m1.db
If we execute the interpreter and check the result for 'list contacts', there's a new object called 'tags' for LISTENGV in the semantics:
"semantics": [{
"id": "1",
"functor": "LISTENGV",
"d_key": "1",
"morpheme id": "2",
"tags": {
"type": "action"
},
"functor id": "",
"dependencies": [{
"id": "3",
"functor": "CONTACTENGN",
"d_key": "1",
"morpheme id": "4",
"tags": {},
"functor id": ""
}]
}]
If there's any linguistic feature that you want to add at a semantic level then tagging is the right way to do so.
The framework can handle punctuation via foma. In this section I'll show how to handle punctuation using the imperative mood as an example while the others are explained in the Handling statements and questions section. The followings need to be added to the m1content.sql:
insert into SETTINGS values('imperative_mood_tag','imperative');
insert into SYMBOLS values('Punct', 'ENG', 'Punctuation');
insert into SYMBOLS values('ExclamationMark', 'ENG', 'Exclamation mark');
insert into SYMBOLS values('t_ENG_Punct_Stem','ENG',NULL);
insert into SYMBOLS values('t_ENG_Punct_ExclamationMark','ENG',NULL);
insert into SYMBOLS values('ENG_VP_Imp','ENG',NULL);
insert into SYMBOLS values('ENG_Punct_Imp','ENG',NULL);
insert into SYMBOLS values('ENG_Punct_Stem','ENG',NULL);
insert into SYMBOLS values('ENG_Punct_ExclamationMark','ENG',NULL);
insert into GCAT values('Punct', 'Stem', 'ENG', '1',NULL,NULL);
insert into GCAT values('Punct', 'ExclamationMark', 'ENG', '1',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Punct_Stem','t_ENG_Punct_Stem',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Punct_ExclamationMark','t_ENG_Punct_ExclamationMark',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Punct_Imp','ENG_Punct_Stem','ENG_Punct_ExclamationMark',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_VP_Imp','ENG_Vbar1',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_VP_Imp','ENG_Vbar1','ENG_PP',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_VP_Imp','ENG_V','ENG_DP',NULL,NULL);
insert into GRAMMAR values('ENG','S','ENG_VP_Imp','ENG_Punct_Imp',NULL,NULL);
insert into RULE_TO_RULE_MAP values( 'ENG_VP_Imp', 'ENG_Vbar1', 'ENG_PP', '1', NULL, NULL, 'N', NULL, 'H', NULL, NULL, 'PREP', NULL, 'N', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( 'ENG_VP_Imp', 'ENG_V', 'ENG_DP', '1', '2', NULL, 'V', NULL, 'H', NULL, NULL, 'CON', NULL, 'N', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( 'ENG_VP_Imp', 'ENG_V', 'ENG_DP', '2', '3', NULL, 'V', NULL, 'H', NULL, NULL, 'Num', NULL, 'N', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( 'ENG_VP_Imp', 'ENG_V', 'ENG_DP', '3', NULL, NULL, 'V', NULL, 'H', NULL, NULL, 'N', NULL, 'N', NULL, NULL, 'ENG');
Please, note that the symbol for ENG_VP got replaced for the sake of clarity with ENG_VP_Imp so you need to clean up the corresponding entries in SYMBOLS, GRAMMAR and RULE_TO_RUE_MAP.
The values of the settings for 'imperative_mood_tag' can also be any arbitrary string but the chosen value shall be used when adding the corresponding feature to indicate mood at the syntactic level since the model handles different moods as well which are indicated by punctuation. In our case, we need to add the following action snippets to the corresponding rules: S: ENG_VP_Imp ENG_Punct_Imp
"const node_info& main_node=sparser->get_node_info($1);
const node_info& dependent_node=sparser->get_node_info($2);
sparser->add_feature_to_leaf(main_node,"main_verb","imperative",true);
std::string parent_symbol="S";
logger::singleton()==NULL?(void)0:logger::singleton()->log(0,parent_symbol+"->"+main_node.symbol+" "+dependent_node.symbol);
$$=sparser->combine_nodes(parent_symbol,main_node,dependent_node);"
Generate a new parser by executing the following steps:
make desktop_parser_db NATIVEPARSERDBNAME=m1.db NATIVEPARSERDBCONTENT=build/hi_desktop/m1content.sql
If you saved the action snippets in files in build/hi_desktop:
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPACTIONSNIPPETS=build/hi_desktop/ DESKTOPFUNCTORPATH=build/hi_desktop/functors
else
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPFUNCTORPATH=build/hi_desktop/functors
Finally:
make desktop_parser
make shared_native_lib
make desktop_client
Invoke the interpreter and type 'list contacts with peter !' (notice the exclamation mark at the end), you'll get a different analysis but the same result as previously without handling the mood. At the same time, not ending the sentence with an exclamation mark, now results in an error.
Let's extend the example use case with a statement and a question:
list contacts !
list contacts with peter !
call peter !
call the first/second/third/fourth/.../last !
today is peter's birthday .
when is peter's birthday ?
As usual, before being able to ask about a fact, we need to have the piece of information what's necessary to answer the question. So let's begin with handling statements. The first thing that needs to be enhanced is the morphological analyser. There are some new words we need to add: two new nouns ('today', 'birthday'), a 3rd person genitive case for CONs, a question word ('when') and an auxiliary ('is').
Lexicon for nouns (engnoun.lexc):
!!!engnoun.lexc!!!
Multichar_Symbols [stem] [Guess] +N +Sg +Pl +CON +3sg +GEN
LEXICON Root
Noun ;
LEXICON Noun
[Guess] Constant;
contact Ninf;
name Ninf;
first Ninf;
last Ninf;
one Ninf;
birthday Ninf;
today Ninf;
LEXICON Constant
[stem]+CON:0 #;
[stem]+CON+3sg+GEN:^'s #;
LEXICON Ninf
[stem]+N+Sg:0 #;
[stem]+N+Pl:^s #;
Lexicon for the rest (english.lexc):
!!!english.lexc!!!
Multichar_Symbols [stem] +V +Sg +Pl +PREP +DET +PRON +3sg +Aux +wh
LEXICON Root
Verb ;
Preposition ;
Determiner ;
Pronoun ;
LEXICON Verb
call Vinf;
list Vinf;
be[stem]+V+3sg+Aux:is #;
LEXICON Vinf
[stem]+V:0 #;
LEXICON Preposition
with Pinf;
LEXICON Pinf
[stem]+PREP:0 #;
LEXICON Determiner
the[stem]+DET:the #;
LEXICON Pronoun
when Pron;
LEXICON Pron
[stem]+PRON+wh:0 #;
Rebuild the morphological analyser by issuing:
make desktop_fst DESKTOPFOMAPATH=build/hi_desktop/english.foma DESKTOPLEXCFILES=build/hi_desktop DESKTOPFSTNAME=english.fst
Concerning the sql content, the changes are as follows:
insert into SETTINGS values('indicative_mood_tag','indicative');
insert into SYMBOLS values('3sg', 'ENG', '3rd person singular');
insert into SYMBOLS values('Aux', 'ENG', 'Auxiliary');
insert into SYMBOLS values('GEN', 'ENG', 'Genitive');
insert into SYMBOLS values('FullStop', 'ENG', 'Full stop');
insert into SYMBOLS values('t_ENG_CON_3sg','ENG',NULL);
insert into SYMBOLS values('t_ENG_CON_GEN','ENG',NULL);
insert into SYMBOLS values('t_ENG_V_3sg','ENG',NULL);
insert into SYMBOLS values('t_ENG_V_Aux','ENG',NULL);
insert into SYMBOLS values('t_ENG_Punct_FullStop','ENG',NULL);
insert into SYMBOLS values('ENG_V_lfea_Aux','ENG',NULL);
insert into SYMBOLS values('ENG_V_lfea_3sg','ENG',NULL);
insert into SYMBOLS values('ENG_V_Aux','ENG',NULL);
insert into SYMBOLS values('ENG_VP_Ind','ENG',NULL);
insert into SYMBOLS values('ENG_Vbar2','ENG',NULL);
insert into SYMBOLS values('ENG_Punct_FullStop','ENG',NULL);
insert into SYMBOLS values('ENG_Punct_Ind','ENG',NULL);
insert into SYMBOLS values('ENG_CON_lfea_3sg','ENG',NULL);
insert into SYMBOLS values('ENG_CON_lfea_GEN','ENG',NULL);
insert into SYMBOLS values('ENG_CON_GEN','ENG',NULL);
insert into SYMBOLS values('ENG_CON_3sgGEN','ENG',NULL);
insert into GCAT values('CON', '3sg', 'ENG', '1',NULL,NULL);
insert into GCAT values('CON', 'GEN', 'ENG', '1',NULL,NULL);
insert into GCAT values('V', '3sg', 'ENG', '1',NULL,NULL);
insert into GCAT values('V', 'Aux', 'ENG', '1',NULL,NULL);
insert into GCAT values('Punct', 'Stem', 'ENG', '1',NULL,NULL);/*This may be duplicate if you added the changes from the section Punctuation and mood*/
insert into GCAT values('Punct', 'FullStop', 'ENG', '1',NULL,NULL);
insert into FUNCTOR_DEFS values('TODAYENGN_1', 'js', '1', 'todayengn_1.js');
insert into FUNCTOR_DEFS values('BEENGV_1', 'js', '1', 'beengv_1.js');
insert into FUNCTOR_DEFS values('DATEENGN_1', 'js', '1', 'dateengn_1.js');
insert into FUNCTOR_DEFS values('BIRTHDAYENGN_1', 'js', '1', 'birthdayengn_1.js');
insert into FUNCTORS values('TODAYENGN', '1', 'TODAYENGN_1');
insert into FUNCTORS values('BIRTHDAYENGN', '1', 'BIRTHDAYENGN_1');
insert into FUNCTORS values('BEENGV', '1', 'BEENGV_1');
insert into FUNCTORS values('DATEENGN', '1', 'DATEENGN_1');
insert into FUNCTOR_TAGS values('BEENGV', '1', 'indicative', '1', 'mood', 'indicative');
insert into FUNCTOR_TAGS values('BIRTHDAYENGN', '1', 'qword', '1', 'qword', 'when');
insert into LEXICON values('today', 'ENG', 'N', 'TODAYENGN');
insert into LEXICON values('birthday', 'ENG', 'N', 'BIRTHDAYENGN');
insert into LEXICON values('be', 'ENG', 'V', 'BEENGV');
/*Statements related rules{*/
insert into GRAMMAR values('ENG','ENG_CON_lfea_3sg','t_ENG_CON_3sg',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_CON_lfea_GEN','t_ENG_CON_GEN',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_CON_3sgGEN','ENG_CON_lfea_3sg','ENG_CON_lfea_GEN',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_CON_GEN','ENG_CON','ENG_CON_3sgGEN',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_V_lfea_Aux','t_ENG_V_Aux',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_V_lfea_3sg','t_ENG_V_3sg',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_V','ENG_V_Stem','ENG_V_lfea_3sg',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_V_Aux','ENG_V','ENG_V_lfea_Aux',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Punct_Stem','t_ENG_Punct_Stem',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','S','ENG_VP_Ind','ENG_Punct_Ind',NULL,
'"const node_info& main_node=sparser->get_node_info($1);
const node_info& dependent_node=sparser->get_node_info($2);
sparser->add_feature_to_leaf(main_node,"main_verb","indicative",true);
std::string parent_symbol="S";
logger::singleton()==NULL?(void)0:logger::singleton()->log(0,parent_symbol+"->"+main_node.symbol+" "+dependent_node.symbol);
$$=sparser->combine_nodes(parent_symbol,main_node,dependent_node);"');
insert into GRAMMAR values('ENG','ENG_Vbar2','ENG_N_Sg','ENG_V_Aux',NULL,
'"const node_info& main_node=sparser->get_node_info($1);
const node_info& dependent_node=sparser->get_node_info($2);
sparser->add_feature_to_leaf(dependent_node,"V",std::string("main_verb"));
std::string parent_symbol="ENG_Vbar2";
logger::singleton()==NULL?(void)0:logger::singleton()->log(0,parent_symbol+"->"+main_node.symbol+" "+dependent_node.symbol);
$$=sparser->combine_nodes(parent_symbol,main_node,dependent_node);"');
insert into GRAMMAR values('ENG','ENG_VP_Ind','ENG_Vbar2','ENG_NP',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_NP','ENG_CON_GEN','ENG_N',NULL,
'"const node_info& main_node=sparser->get_node_info($1);
const node_info& dependent_node=sparser->get_node_info($2);
sparser->add_feature_to_leaf(dependent_node,"N",std::string("qword"));
std::string parent_symbol="ENG_NP";
logger::singleton()==NULL?(void)0:logger::singleton()->log(0,parent_symbol+"->"+main_node.symbol+" "+dependent_node.symbol);
$$=sparser->combine_nodes(parent_symbol,main_node,dependent_node);"');
insert into GRAMMAR values('ENG','ENG_Punct_FullStop','t_ENG_Punct_FullStop',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Punct_Ind','ENG_Punct_Stem','ENG_Punct_FullStop',NULL,NULL);
/*}Statements related rules*/
insert into DEPOLEX values('TODAYENGN', '1', '1', NULL, NULL, NULL, NULL, NULL, NULL);
insert into DEPOLEX values('DATEENGN', '1', '1', '1', NULL, NULL, '0', 'TODAYENGN', '1');
insert into DEPOLEX values('BEENGV', '1', '1', NULL, NULL, NULL, '0', 'BIRTHDAYENGN', '1');
insert into DEPOLEX values('BIRTHDAYENGN', '1', '1', NULL, '2', '2', '0', 'DATEENGN', '1');
insert into DEPOLEX values('BIRTHDAYENGN', '1', '2', NULL, NULL, NULL, '0', 'CON', '1');
/*begin rules for statements*/
insert into RULE_TO_RULE_MAP values( 'ENG_NP', 'ENG_CON_GEN', 'ENG_N', '1', NULL, NULL, 'N', NULL, 'N', NULL, NULL, 'CON', NULL, 'H', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( 'ENG_VP_Ind', 'ENG_Vbar2', 'ENG_NP', '1', NULL, '2', 'V', NULL, 'H', NULL, NULL, 'N', NULL, 'N', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( 'ENG_VP_Ind', 'ENG_Vbar2', 'ENG_NP', '2', NULL, NULL, 'N', NULL, 'N', NULL, NULL, 'N', NULL, 'H', NULL, NULL, 'ENG');
/*end rules for statements*/
With these changes the interpreter is able to handle the statement: 'today is peter's birthday .' What remains is to implement the new functors. But now as we have to handle different moods e.g. in case of the auxiliary 'is' which appears in a statement (indicative mood), its functor has to react accordingly. Tagging comes handy in this case (see the Classification and tagging section). As you can see in the sql content, we added a new entry to the SETTINGS table for 'indicative_mood_tag' with the value 'indicative' and another entry to the FUNCTOR_TAGS table that assigns the 'indicative' tag to the lexeme of the auxiliary 'BEENGV' with a key:value pair 'mood':'indicative'. The action handler of the rule 'S: ENG_VP_Ind ENG_Punct_Ind' contains a line that takes care of adding that tag to the main verb:
sparser->add_feature_to_leaf(main_node,"main_verb","indicative",true);
As mentioned earlier, the last parameter is a flag that indicates if the tag is meant to be global for all the functors or not. As we want to let all the functors to be able to handle the 'mood' tag, it's set to true. There's also another entry added to the FUNCTOR_TAGS table that assigns the tag 'qword' to the lexeme 'BIRTHDAYENGN' with a key:value pair 'qword':'when'. The action handler of the rule 'ENG_NP: ENG_CON_GEN ENG_N' contains a line that takes care of adding that tag to the main verb:
sparser->add_feature_to_leaf(dependent_node,"N",std::string("qword"));
Let's see how the functor implementations look like. The functor implementation of TODAYENGN_1 (todayengn_1.js):
const {execFile}=require('node:child_process');
let tags="";
let date="";
for(i=0;i<parameterList.length;++i){
if(parameterList[i].indexOf('_tags')>-1){
if(typeof arguments[i+2]!=='undefined'&&arguments[i+2].length>0){
tags=JSON.parse(arguments[i+2]);
break;
}
}
}
if(tags.mood==='interrogative'){
/*Not part of use case*/
}
else if(tags.mood==='indicative'){
/*Get today's date*/
let dateObj=new Date();
let year=dateObj.getFullYear();
let month=dateObj.getMonth()+1;
if(month<10) month="0"+month;
let day=dateObj.getDate();
if(day<10) day="0"+day;
date=year+"-"+month+"-"+day;
}
else if(tags.mood==='imperative'){
/*Not part of use case*/
}
else{
/*Error*/
}
return date;
The functor implementation of BEENGV_1 (beengv_1.js):
let tags="";
for(let i=0;i<parameterList.length;++i){
if(parameterList[i].indexOf('_tags')>-1){
if(typeof arguments[i+2]!=='undefined'&&arguments[i+2].length>0){
tags=JSON.parse(arguments[i+2]);
break;
}
}
}
if(tags.mood=='interrogative'){
/*Do nothing*/
}
else if(tags.mood=='indicative'){
/*Do nothing*/
}
else if(tags.mood=='imperative'){
/*Not part of use case*/
}
else{
/*Error*/
}
return;
The functor implementation of DATEENGN_1 (dateengn_1.js):
let tags="";
let date="";
for(i=0;i<parameterList.length;++i){
if(parameterList[i].indexOf('_tags')>-1){
if(typeof arguments[i+2]!=='undefined'&&arguments[i+2].length>0){
tags=JSON.parse(arguments[i+2]);
break;
}
}
else if(parameterList[i].indexOf('_out')>-1){
/*Get the result of the called functor (from depolex we know though that there's only one)*/
date=arguments[i+2];
}
}
if(tags.mood==='interrogative'){
/*Not part of use case*/
}
else if(tags.mood==='indicative'){
/*Nothing to do, date has already been stored, just pass it on*/
}
else if(tags.mood==='imperative'){
/*Not part of use case*/
}
else{
/*Error*/
}
return date;
The functor implementation of BIRTHDAYENGN_1 (birthdayengn_1.js):
const {execFile}=require('node:child_process');
let tags="";
let date="";
for(i=0;i<parameterList.length;++i){
if(parameterList[i].indexOf('_tags')>-1){
if(typeof arguments[i+2]!=='undefined'&&arguments[i+2].length>0){
tags=JSON.parse(arguments[i+2]);
break;
}
}
else if(parameterList[i].indexOf('DATEENGN_')>-1){
date=arguments[i+2];
}
}
if(tags.mood==='interrogative'){
/*Not part of use case*/
}
else if(tags.mood==='indicative'){
/*get rid of year*/
let dateObj=new Date(date);
let month=dateObj.getMonth()+1;
if(month<10) month="0"+month;
let day=dateObj.getDate();
if(day<10) day="0"+day;
date=month+"-"+day;
execFile('./hi',['-c','hi_desktop/m1.db',analysis_deps,functionName,date],(error, stdout, stderr) => {
console.log(`stdout: ${stdout}`);
if (error) {
console.error(`exec error: ${error}`);
return;
}
console.error(`stderr: ${stderr}`);
});
}
else if(tags.mood==='imperative'){
/*Not part of use case*/
}
else{
/*Error*/
}
return date;
As the indicative mood tag is added globally, it as available for all functors as a parameter with the generated name of the function suffixed by '_tags'. That's the same parameter that otherwise holds the functor specific tags. After looking up that parameter, we can implement checks for the different values of the mood tag. Handling the mood in certain functors like TODAYENGN_1 serves a demonstrational purpose only rather than having any signifficance since the value of 'today' shall most probably be always converted to a date independent of the mood while in case of the auxiliary the interrogative part will be added later. Besides, in case of the indicative case there's not much to do except calculating a certain value and storing it in the db. Without calculated values, answering a question like 'when is peter's birthday ?' would end up in the answer 'today is peter's birthday .' which is probably undesirable. This is what we handle for 'today'. We get the date for today, return it so that the DATEENGN_1 functor can handle it, which returns the date to BIRTHDAYENGN_1 that calls the client program (could be an api call) using the appropriate option to store the calculated value for the functor. The arguments passed to the client program are mapped to the following parameters of the underlying hi_state_cvalue() function:
- database file path
- semantic dependencies (provided as global variable called analysis_deps)
- function name (provided as incoming parameter called functionName)
- calculated value
In the functor implementation we call the desktop client which is prepared to handle the '-c' option and calls the hi_state_cvalue() function as it's easier than calling it as a function in a C/C++ library via node js. Build the parser as usual:
make desktop_parser_db NATIVEPARSERDBNAME=m1.db NATIVEPARSERDBCONTENT=build/hi_desktop/m1content.sql
If you saved the action snippets in files in build/hi_desktop:
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPACTIONSNIPPETS=build/hi_desktop/ DESKTOPFUNCTORPATH=build/hi_desktop/functors
else
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPFUNCTORPATH=build/hi_desktop/functors
Finally:
make desktop_parser
make shared_native_lib
make desktop_client
If you start the interpreter then type 'today is peter's birthday .' you'll get the usual analysis. To check if the calculated value and the tag got stored indeed, we need to check the db since we cannot ask yet 'when is peter's birthday ?' Issuing the following command in the project directory:
echo "select * from analyses_deps;"|sqlite hi_desktop/m1.db
will yield some entries from the ANALYSES_DEPS table:
test|1673267033|today is peter's birthday .|1|1|indicative|BEENGV_1_4|0|0|is||0|0|BEENGV|1|{}|
test|1673267033|today is peter's birthday .|1|1|indicative|BIRTHDAYENGN_1_15|1|1|birthday|BEENGV|1|1|BIRTHDAYENGN|1|{"qword":"when"}|01-09
test|1673267033|today is peter's birthday .|1|1|indicative|DATEENGN_1_15_1|2|2||BIRTHDAYENGN|1|1|DATEENGN|1|{}|
test|1673267033|today is peter's birthday .|1|1|indicative|TODAYENGN_1_1|3|3|today|DATEENGN|1|1|TODAYENGN|1|{}|
test|1673267033|today is peter's birthday .|1|1|indicative|CON_10|4|2|peter's|BIRTHDAYENGN|1|2|CON|1|{}|
The relevant line is the one with the function name 'BIRTHDAYENGN_1_15' that contains today's date in the last field (called c_value) and the tag "qword":"when" in the last but one (called tags). These ensure that the word 'birthday' or a calculated value (if available) is considered a valid answer in the model for the question word 'when'.
If you check the analyses now, there is a new property called 'analyses_deps'. The 'analyses_deps' is the semantics itself broken down to searchable entries (keeping the hierarchy of the parse tree) as they are inserted in the ANALYSES_DEPS db table. The point is that each analysis (successful and failed ones too) is stored in the ANALYSES or FAILED_ANALYSES tables but it'd be difficult to use them for searching so the semantics of the successful analyses are stored as well in a structured form in ANALYSES_DEPS as well using the key fields of the ANALYSES table as prefix in addition to its own.
Let's move on to handling questions. As we already added the word 'when' to the morhological analyser at the beginning of this section, there's nothing to do with that part. What needs to be extended is the database content:
insert into SETTINGS values('interrogative_mood_tag','interrogative');
insert into SYMBOLS values('wh', 'ENG', 'question word');
insert into SYMBOLS values('PRON', 'ENG', 'Pronoun');
insert into SYMBOLS values('QuestionMark', 'ENG', 'Question mark');
insert into SYMBOLS values('t_ENG_PRON_Stem','ENG',NULL);
insert into SYMBOLS values('t_ENG_PRON_wh','ENG',NULL);
insert into SYMBOLS values('t_ENG_Punct_QuestionMark','ENG',NULL);
insert into SYMBOLS values('ENG_PRON_Stem','ENG',NULL);
insert into SYMBOLS values('ENG_PRON_lfea_wh','ENG',NULL);
insert into SYMBOLS values('ENG_PRON_qw','ENG',NULL);
insert into SYMBOLS values('ENG_VP_Int','ENG',NULL);
insert into SYMBOLS values('ENG_Vbar3','ENG',NULL);
insert into SYMBOLS values('ENG_Punct_QuestionMark','ENG',NULL);
insert into SYMBOLS values('ENG_Punct_Int','ENG',NULL);
insert into GCAT values('PRON', 'Stem', 'ENG', '1',NULL,NULL);
insert into GCAT values('PRON', 'wh', 'ENG', '1',NULL,NULL);
insert into GCAT values('Punct', 'QuestionMark', 'ENG', '1',NULL,NULL);
insert into FUNCTORS values('WHENENGPRON', '1', NULL);
insert into FUNCTOR_TAGS values('BEENGV', '1', 'main_verb', '1', 'is_root', 'true');
insert into FUNCTOR_TAGS values('BEENGV', '1', 'interrogative', '1', 'mood', 'interrogative');
insert into FUNCTOR_TAGS values('WHENENGPRON', '1', 'qword', '1', 'is_qword', 'true');
insert into LEXICON values('when', 'ENG', 'PRON', 'WHENENGPRON');
/*Questions related rules{*/
insert into GRAMMAR values('ENG','S','ENG_VP_Int','ENG_Punct_Int',NULL,
'"const node_info& main_node=sparser->get_node_info($1);
const node_info& dependent_node=sparser->get_node_info($2);
sparser->add_feature_to_leaf(main_node,"main_verb","interrogative",true);
std::string parent_symbol="S";
logger::singleton()==NULL?(void)0:logger::singleton()->log(0,parent_symbol+"->"+main_node.symbol+" "+dependent_node.symbol);
$$=sparser->combine_nodes(parent_symbol,main_node,dependent_node);"');
insert into GRAMMAR values('ENG','ENG_Punct_QuestionMark','t_ENG_Punct_QuestionMark',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Punct_Int','ENG_Punct_Stem','ENG_Punct_QuestionMark',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_PRON_Stem','t_ENG_PRON_Stem',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_PRON_lfea_wh','t_ENG_PRON_wh',NULL,NULL,NULL);
insert into GRAMMAR values('ENG','ENG_PRON_qw','ENG_PRON_Stem','ENG_PRON_lfea_wh',NULL,NULL);
insert into GRAMMAR values('ENG','ENG_Vbar3','ENG_PRON_qw','ENG_V_Aux',NULL,
'"const node_info& main_node=sparser->get_node_info($1);
const node_info& dependent_node=sparser->get_node_info($2);
sparser->add_feature_to_leaf(dependent_node,"main_verb");
sparser->add_feature_to_leaf(main_node,"PRON",std::string("qword"));
std::string parent_symbol="ENG_Vbar3";
logger::singleton()==NULL?(void)0:logger::singleton()->log(0,parent_symbol+"->"+main_node.symbol+" "+dependent_node.symbol);
$$=sparser->combine_nodes(parent_symbol,main_node,dependent_node);"');
insert into GRAMMAR values('ENG','ENG_VP_Int','ENG_Vbar3','ENG_NP',NULL,NULL);
/*}Questions related rules*/
insert into DEPOLEX values('BEENGV', '1', '1', NULL, '2', '2', '0', 'WHENENGPRON', '1');
insert into DEPOLEX values('BEENGV', '1', '2', NULL, NULL, NULL, '0', 'BIRTHDAYENGN', '1');
insert into DEPOLEX values('WHENENGPRON', '1', '1', NULL, NULL, NULL, NULL, NULL, NULL);
/*begin rules for questions*/
insert into RULE_TO_RULE_MAP values( 'ENG_Vbar3', 'ENG_PRON_qw', 'ENG_V_Aux', '1', NULL, NULL, 'V', NULL, 'N', NULL, NULL, 'PRON', NULL, 'H', NULL, NULL, 'ENG');
insert into RULE_TO_RULE_MAP values( 'ENG_VP_Int', 'ENG_Vbar3', 'ENG_NP', '1', NULL, NULL, 'V', NULL, 'H', NULL, NULL, 'N', NULL, 'N', NULL, NULL, 'ENG');
/*end rules for questions*/
Now the interpreter is able to handle the question: 'when is peter's birthday ?' To be able to handle questions, a new entry was added to the SETTINGS table for 'interrogative_mood_tag' with the value 'interrogative' and some new entries to the FUNCTOR_TAGS:
- in case when BEENGV has the 'main_verb' feature, it is tagged with is_root:true
- in case when BEENGV has the 'interrogative' feature, it is tagged with mood:interrogative
- in case when WHENENGPRON has the 'qword' feature, it is tagged with is_qword:true
The keys 'mood', 'is_root' and 'is_qword' are currently hardcoded but will be added to the SETTINGS later. These keys are used by the api function hi_query() which will be also explained later. The action handler of the rule 'S: ENG_VP_Int ENG_Punct_Int' also must contain a line that takes care of adding the 'interrogative' tag to the main verb:
sparser->add_feature_to_leaf(main_node,"main_verb","interrogative",true);
The other entry added to the FUNCTOR_TAGS table that assigns the tag 'qword' to the lexeme 'WHENENGPRON' with a key:value pair 'is_qword':'true' is triggered by the action handler of the rule 'ENG_Vbar3: ENG_PRON_qw ENG_V_Aux':
sparser->add_feature_to_leaf(main_node,"PRON",std::string("qword"));
We also need to adjust the functor BEENGV_1 to handle the interrogative mood properly:
const {execFile}=require('node:child_process');
let tags="";
for(let i=0;i<parameterList.length;++i){
if(parameterList[i].indexOf('_tags')>-1){
if(typeof arguments[i+2]!=='undefined'&&arguments[i+2].length>0){
tags=JSON.parse(arguments[i+2]);
break;
}
}
}
if(tags.mood=='interrogative'){
jsonDoc=JSON.parse(analysis_deps);
execFile('./hi',['-q','hi_desktop/m1.db','indicative',analysis_deps],(error, stdout, stderr) => {
console.log(`stdout: ${stdout}`);
if (error) {
console.error(`exec error: ${error}`);
return;
}
console.error(`stderr: ${stderr}`);
});
}
else if(tags.mood=='indicative'){
/*Do nothing*/
}
else if(tags.mood=='imperative'){
/*Not part of use case*/
}
else{
/*Error*/
}
return;
In case of interrogative mood, that functor calls the client program (could be an api call) using the appropriate option to return a ranked list of previous analyses from the ANALYSES_DEPS table that best match the query. The arguments passed to the client program are mapped to the following parameters of the underlying hi_query() function:
- database file path
- mood
- semantic dependencies (provided as global variable called analysis_deps)
In the functor implementation we call the desktop client which is prepared to handle the '-q' option and calls the hi_query() function as it's easier than calling it as a function in a C/C++ library via node js. The query is generated from the semantic dependencies (here it is the analysis_deps variable). Usually, it is enough to use the generated semantic dependencies but you can modify its contents or even craft one manually. It's a json object with the following structure currently:
{
"dependencies":[
{
"dependency": "",
"ref_d_key": 0,
"c_value": "",
"word": "",
"tags": {}
},
...
]
}
The query is assembled the following way by hi_query():
- checks the root lexeme using the 'is_root':'true' functor tag (remember that the main verb was tagged as root)
- collects question words using the 'is_qword':'true' functor tag
- collects calculated values form the c_value fields of dependency entries (if a dependency is a CON then it is also treated as calculated value)
- using the incoming mood argument
If there are no question words, the root lexeme (which is the main verb) is used for finding an answer which e.g. could be the case with 'is file abc executable ?' The query tries to find all analyses in ANALYSES_DEPS that match the most criteria like mood, lexemes (dependencies) tagged with the questions words and having the calculated values collected.
Build the parser as usual:
make desktop_parser_db NATIVEPARSERDBNAME=m1.db NATIVEPARSERDBCONTENT=build/hi_desktop/m1content.sql
If you saved the action snippets in files in build/hi_desktop:
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPACTIONSNIPPETS=build/hi_desktop/ DESKTOPFUNCTORPATH=build/hi_desktop/functors
else
make desktop_bison_parser NATIVEPARSERDBNAME=m1.db DESKTOPFUNCTORPATH=build/hi_desktop/functors
Finally:
make desktop_parser
make shared_native_lib
make desktop_client
If you start the interpreter then type 'when is peter's birthday ?' you'll get the usual analysis and the answer in a form of month and date, calculated on the day the statement was entered. In case of our example, it is: 01-09.
In real life scenarios, not all sentences are syntactically correct. The framework is capable of handling syntactically incorrect sentences if you ask it not to carry out syntactic checks (e.g. after a failed syntactic analysis, giving it a second chance) but relying solely on semantic parsing is not a good idea for two reasons:
- The dependency chains (or graph) set up in the DEPOLEX table may contain dependencies that are required by more than one other dependencies. If there's no syntax to sort out which dependency belongs where (see RULE_TO_RULE_MAP table which maps each syntactic rule to a semantic one), it is impossible to get the right interpretation even in such a simple case like "Peter sent Mary a letter." which incorrectly could even be "A letter sent Peter Mary." There are several possibilities in such a case and the interpreter won't be able to figure out without syntax which of those is the right interpretation but will give an analysis that makes sense according to the dependency chains.
- The semantic parser will try to put each dependency to its place in the dependency chain but does not throw an error if not all of them have a slot there. This is implemented this way because if the syntactic analysis is requested to be carried out, it ensures that the sentence is syntactically correct and according to the mapping in the RULE_TO_RULE_MAP, the appropriate dependencies are put in their place. However, without checking the syntax, it lets interpreting sentences more loosely. For example, if the model is set up to handle 'Call Peter!' but not "Let's call Peter!" then the analysis of the latter will fail. Interpreting the same sentence without syntax checks will nevertheless succeed as 'Peter' will be accepted as a dependency of 'call' just as in case of the syntactically correct sentence but as the "let's" part could not be analysed and nothing is set up for it in DEPOLEX, it is just ignored. This works fine in such cases but may have unwanted consequences in others.
In order to disable the syntactic analysis the only thing you need to do is to change the type of analyses (toa) variable in the client (hi.cpp) to request only morphological (HI_MORPHOLOGY) and semantic (HI_SEMANTICS) analyses but no syntactic (HI_SYNTAX) one and call the api function:
toa=HI_MORPHOLOGY|HI_SEMANTICS;
analyses=hi(text.c_str(),"ENG",toa,language.c_str(),"hi_desktop/m1.db","test",crh);
In this section we won't change the model created in the tutorial but we'll take as a case study the desktop example delivered with the repository. So if you have already something in the default build directory you either need to clean it up:
make clean
or use a different build directory by always specifying it for make using the BUILDDIR parameter. Building the delivered desktop example is easy:
make desktop_fst
make desktop_parser_db
make desktop_bison_parser
make desktop_parser
make shared_native_lib
make desktop_client
As a result you have everything in the build directory what's necessary. Let's see the sentences used for demonstration:
list symlinked directories !
list executable directories !
list empty directories !
list symlinked and executable directories !
list symlinked and not executable directories !
list symlinked and executable or not executable and empty directories !
list symlinked and executable or not executable or empty directories !
However, to be able to start examining the sentences, we need to prepare some directories first. Enter the build directory and issue the followings:
mkdir abc
ln -s abc def
mkdir ghi
chmod ugo-x ghi
Now, you have an empty directory (abc) that is symlinked (def) and another empty directory (ghi) that is not symlinked but its executable flag is cleared. Start the interpreter and check the results of the first three, simple sentences.
The sentence 'list symlinked directories !' results in (analyses not shown for now):
find -L ./ -type d |grep "$(find ./ -type l)"
./def
The sentence 'list executable directories !' results in (analyses not shown for now):
find ./ -type d -perm -111
./
./hi_desktop
./abc
The sentence 'list empty directories !' results in (analyses not shown for now):
find ./ -type d -empty
./abc
./ghi
This is the state of affairs in the build directory and all the results are correct. Let's check the results of the next two sentences.
The sentence 'list symlinked and executable directories !' results in (analyses not shown for now):
find -L ./ -type d -perm -111|grep "$(find ./ -type l)"
./def
The sentence 'list symlinked and not executable directories !' results in (analyses not shown for now):
find -L ./ -type d ! -perm -111|grep "$(find ./ -type l)"
These results are also fine as the only symlinked and executable directory we have is 'def' and we have no directory that is symlinked but not executable. Let's continue with the next sentence which is a bit more complicated.
The sentence 'list symlinked and executable or not executable and empty directories !' results in (analyses not shown for now):
find -L ./ -type d -perm -111|grep "$(find ./ -type l)"
./def
&! -perm -111&-empty
! -perm -111
-empty
find ./ -type d ! -perm -111 -empty
./ghi
There are some debug info printed out as well which reveals some internals as it can be seen that this is executed in two parts. Though it's not necessary, just the functors are implemented that way. If you check the result, the directories 'def' and 'ghi' are returned which is correct since 'def' is symlinked and executable while 'ghi' is not executable and empty.
The sentence 'list symlinked and executable or not executable or empty directories !' results in (analyses not shown for now):
find -L ./ -type d -perm -111|grep "$(find ./ -type l)"
./def
&! -perm -111
! -perm -111
find ./ -type d ! -perm -111
./ghi
&-empty
-empty
find ./ -type d -empty
./abc
./ghi
This one gives also a correct result as 'def' is symlinked and executable, 'ghi' is not executable while 'abc' and 'ghi' are both empty. The directory 'ghi' appears in the result twice since it satisfies two criteria (not executable and empty) which is due to the way the functors are implemented.
Now, let's see how the framework interprets these sentences and handles logical operators.
...
In order to be able to handle relative clauses, the followings need to be set up:
- morphemes to be handled as e.g. relative pronoun must be tagged with the 'Relative' feature in the lexc files of the morphological analyser (foma)
- in the appropriate rule of the RULE_TO_RULE_MAP table, to set a reference to the main node, the dependent node must have the 'Relative' feature which results in handling the dependent node as a representetive of the referenced main node
- the lexeme of the dependent node having the 'Relative' feature must have an entry in the DEPOLEX table without any dependencies
- the dependencies that the lexeme with the 'Relative' feature may have must be added to the referenced (see bullet 2.) main node's lexeme in the DEPOLEX table
- the realtive clause verb (RCV) is usually added an RCV feature during the syntactic analysis in order to check for it in RULE_TO_RULE_MAP as a follow up step of 2) to combine it with reference node of 2)
To handle empty terminals, the '%empty' directive from bison comes handy. This is not supported conveniently yet, but at least it's possible if you add an entry to the db content like:
insert into SYMBOLS values('%empty','ENG',NULL);
Which can be handled according to the example below which assumes that a verb is missing from the position where it should normally appear:
insert into GRAMMAR values('ENG','ENG_V','%empty',NULL,NULL,
'"lexicon empty;
std::string symbol="t_ENG_V_Stem";
auto&& symbol_token_map_entry=symbol_token_map.find(symbol);
empty.word="";
empty.gcat="V";
empty.lexeme="?";
empty.dependencies=lex->dependencies_read_for_functor("V");
empty.morphalytics=new morphan_result(empty.word,"ENG","V");
empty.token=symbol_token_map_entry->second;
empty.tokens.clear();
empty.tokens.push_back(empty.token);
empty.lexicon_entry=false;
logger::singleton()==NULL?(void)0:logger::singleton()->log(0,"ENG_V->%empty");
$$=sparser->set_node_info("ENG_V",empty);"');
As mentioned, this is not supported well but it works and will be improved.
Adjusting a model to be used on android is not that much: the functor implementations need to be done in javascript and the target language must be changed in functor_defs from 'sh' to 'js'. The android client applies the same logic for passing parameters when calling functors so looking up the incoming parameters inside a functor is done the same way (though the way a parameter is identified in the examples below is a bit different, it could have been done the same way as in the shell scripts). The only difference is that in javascript we can use the 'return' statement to pass back the functor output. Here are the examples for the same functors (which can as well be found in the project subdirectory hi_android/functors):
withengprep_1.js:
contact="";
for(i=0;i<parameterList.length;++i){
if(parameterList[i].indexOf('CON_')>-1){
if(typeof arguments[i+2]!=='undefined'&&arguments[i+2].length>0){
con=JSON.parse(arguments[i+2]);
if(contact.length>0)contact=contact+" "+con.stem;
else contact=con.stem;
}
}
}
return contact;
contactengn_1.js:
contact="";
for(i=0;i<parameterList.length;++i){
if(parameterList[i].indexOf('WITHENGPREP_1')>-1){
if(typeof arguments[i+2]!=='undefined'&&arguments[i+2].length>0){
contact=arguments[i+2];
break;
}
}
}
return contact;
listengv_1.js:
contact="";
for(i=0;i<parameterList.length;++i){
if(parameterList[i].indexOf('CONTACTENGN_1')>-1){
if(typeof arguments[i+2]!=='undefined'&&arguments[i+2].length>0){
contact=arguments[i+2];
break;
}
}
}
if(contact) Android.fetchContacts(contact);
else Android.fetchContacts("");
One major technical issue on android is that if we want to access any system functionality then it needs to be done via javascript interface or webmessage. Such a javascript interface call happens in listengv_1.js as Android.fetchContacts() above but the source in the repository now uses webmessage. Everything is prepared in the android example delivered with the project for this call so once the functor implementations are adapted, the android specific make targets need to be called, the generated artifacts (db, fst and shared library) need to be copied to replace the ones delivered with the example android project and the android project needs to be rebuilt.
root_type: Currently, only 'H' (head) and 'N' (non-head) values are allowed.
The SYMBOLS table is for all kinds of symbols: terminals (including gcat features) and non-terminals. The SYMBOLS table fields are:
symbol: symbol id
lid: a language id referenced in the LANGUAGES table
description: the description of the symbol
The LANGUAGES table fields:
lid: language id
language: descriptive name of the language
head_position: an integer (0:undefined,1:head first,2:head last) indicating if the language is head first or head last. Partially implemented, set it to 1.
fst: name of the finite state transducer file generated by foma
The GCAT table is a table for terminal symbols, i.e. to which bison tokens can be assigned. The GCAT table fields are:
gcat: a symbol for a grammatical category as defined in the SYMBOLS table
feature: a symbol for a linguistic feature belonging to a grammatical category as defined in the SYMBOLS table
lid: a language id referenced in the LANGUAGES table
token: NULL or '0': don't generate token in bison source, non-NULL: generate token in bison source
precedence: NULL or a precedence matching a precedence value in the PRECEDENCES table
precedence_level: NULL or greater than 0 - defines the order in which the operators of the assigned tokens are declared which determines whose precedence is lowest. See the doc linked at the PRECEDENCES table and some exmaples here
The PRECEDENCES table fields are:
precedence: a precedence id of the following letters in apostrophes: 'L' (%left), 'R' (%right), 'P' (%precedence), 'N' (%nonassoc)
declaration: '%left', '%right', '%nonassoc' or '%precedence' as described in the Bison manual
The fields of the LEXICON table are:
word: the stem of a word
lid: a language id referenced in the GCAT table together with the grammatical category
gcat: a grammatical category referenced in the GCAT table together with the language id
lexeme: the lexeme assigned to the stem as defined in the FUNCTORS table
The fields of the FUNCTORS table are:
functor: a value matching a lexeme in the LEXICON table. There's no constraint on how the functor name shall look like but historically, the concatenation of the stem, the language and the grammatical category is used -all in capitals.
d_key: a lexeme can have more than one functor definitions assigned for which the dependency key (d_key) is used to differentiate
functor_id: the id of the functor definition assigned to the functor. There's also no restriction on how such an id shall look like.
lexeme: the lexeme you specified in the lexicon db table for a word
d_key: the dependency key determining one specific meaning
d_counter: just a counter that enables having more than one dependency
optional_parent_allowed: enables taking the dependency entry into account even if its parent was not found
d_failover: a failover dependency points to a d_counter greater than the current d_counter and is only taken into account if the current d_counter dependency check failed. NULL or 0 means end of dependency chain while pointing to itself also means end of dependency chain but without taking the failure into account.
d_successor: a successor dependency points to a d_counter greater than the current d_counter and is only executed if the current d_counter dependency check was successful
manner: specifies how the dependency is taken into account: 0 - exactly once, 1 - at least once, 2 - more than once
semantic_dependency: specifies the dependency following this dependency together with the ref_d_key field
ref_d_key: d_key of the semantic_dependency
parent_symbol: The parent (left hand side) symbol in the syntactic rule mapped.
head_root_symbol: The head symbol of the right hand side symbols in the syntactic rule mapped.
non_head_root_symbol: The non-head symbol of the right hand side symbols in the syntactic rule mapped.
step: An integer (starting at 1) representing the number of the step for the same syntactic rule since a syntactic rule may be mapped to a sequence of semantic rules.
failover: smallest value: 1; NULL or 0 means end of rule chain, results in a valid combination. A failover step is only executed if the current step failed (e.g. either symbols or functors are not found). A failover step must be greater than the current step to continue evaluating the rule chain. A smaller value results in an error. Setting it to the value of the current step means "no problem" and evaluates to true, resulting in a valid combination.
successor: smallest value: 1; NULL or 0 means end of rule chain, results in a valid combination. A successor step is only executed if the current step succeeded. A successor step must be greater than the current step to continue evaluating the rule chain. A value smaller than or equal to the value of the current step results in an error.
main_node_symbol: The symbol to be looked up in the subtree.
main_node_lexeme: The lexeme to be looked up in the subtree.
main_lookup_root: The root symbol (see ROOT_TYPE table - currently H for Head or N for Non-head) that indicates which subtree to use for symbol lookup. This determines which nodes of the symbols found will be used as main node(s) regardless of the order the nodes were passed to combine_nodes().
main_lookup_subtree_symbol: A symbol denoting a subtree in the subtree denoted by main_lookup_root.
main_set_op: An integer value between 0 and 6. Operation on set of main symbols found in previous and current steps. 1: union, 2: intersection, 3: current - previous (including intersection), 4: previous - current (including intersection), 5: disjunct. The result set will be used in the current step for symbol lookup.
dependent_node_symbol: The symbol to be looked up in the subtree.
dependent_node_lexeme: The lexeme to be looked up in the subtree.
dependency_lookup_root: The root symbol (see ROOT_TYPE table - currently H for Head or N for Non-head) that indicates which subtree to use for symbol lookup. This determines which nodes of the symbols found will be used as dependent node(s) regardless of the order the nodes were passed to combine_nodes().
dependency_lookup_subtree_symbol: A symbol denoting a subtree in the subtree denoted by main_lookup_root.
dependent_set_op: An integer value between 0 and 6. Operation on set of dependent symbols found in previous and current steps. 1: union, 2: intersection, 3: current - previous (including intersection), 4: previous - current (including intersection), 5: disjunct. The result set will be used in the current step for symbol lookup.
lid: A language id matching a value in the LANGUAGES table.
The FUNCTOR_DEFS table fields are:
functor_id: a functor id referenced in the FUNCTORS table
tlid: translation language id - the language id to which a translation can be carried out using this functor definition. There's no constraint on it but there are only two transcriptors implemented for 'js' (javascript) and 'sh' (shell script).
imp_counter: technical field used during source generation (gensrc), set it to 1
definition: function implementation in the language specified in tlid either by specifying a file name or concrete code surrounded by double quotes
The fields of the grammar table are:
lid: A language id matching a value in the LANGUAGES table.
parent_symbol: The parent (left hand side) symbol in the syntactic rule mapped.
head_symbol: The head symbol of the right hand side symbols in the syntactic rule mapped.
non_head_symbol: The non-head symbol of the right hand side symbols in the syntactic rule mapped.
precedence: A precedence id matching a value in the PRECEDENCES table.
action: The last field in the grammar db table is called 'action'. In case it does not contain anything, the bison source generator tool called gensrc will generate a predefined action code for the grammar rule. If it's maintained, it may contain either specific implementation for the action or refer to an action snippet by a file name containing the implementation. If the content is in quotes then it is regarded as code, if not then it is regarded as filename.
functor: a functor referenced in the FUNCTORS table
d_key: a d_key of the functor referenced in the FUNCTORS table
trigger_tag: Serves as condition. If such a tag was created during the interpretation it triggers taking into account (during transcription) the tag-value pairs of the entry e.g. grammatical mood of the verb (imperative, interrogative, indicative), since different tag-value pairs may belong to an indicative mood and an imperative mood as in case of "a directory lists files" and "list files".
counter: just a counter that enables having more than one tag entry
tag: if the trigger_tag is empty, tag-value pairs are added unconditionally
value: a value to be assigned to the tag