Skip to content

Commit

Permalink
Enabled morpheme segmentation in the stem.
Browse files Browse the repository at this point in the history
  • Loading branch information
timarkh committed Jun 8, 2021
1 parent 430bed7 commit ab87860
Show file tree
Hide file tree
Showing 34 changed files with 100 additions and 34 deletions.
Binary file modified docs/_build/doctrees/bad_analyses.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/clitics.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/derivations.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/_build/doctrees/examples.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/format.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/index.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/lex_rules.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/lexemes.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/paradigms.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/stem_conversions.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/usage.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/_build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 2936deac5b1d202353ba3b872ca5b597
config: f4308c907af59774330e9e5fd8b8e221
tags: 645f666f9bcd5a90fca523b33c5a78b7
19 changes: 18 additions & 1 deletion docs/_build/html/_sources/lexemes.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ The ``-lexeme`` line starts a new entry. An entry contains a number of key-value
Pre-defined fields that have to be present in each lexeme are the following:

* ``lex`` stands for lemma, or dictionary form, of the lexeme.
* ``stem`` contains a string describing the stem of the lexeme as a morpheme object (see :doc:`format overview </format>`). If there are several free variants, i.e. variants that are equally possible in any context, they may be written inside one value separated by ``//``. In the example above, the stem of the numeral *eight* in Meadow Mari can be written either as a word *кандаш* or as a numeral *8*. In both cases, affixes can only attach to the right side of it, which is why there is a dot at the right. A stem can only include the root morpheme or be a combination of a root with some (probably non-productive) derivational affixes that you want to treat as a single lexeme.
* ``stem`` contains a string describing the stem of the lexeme as a morpheme object (see :doc:`format overview </format>`). If there are several free variants, i.e. variants that are equally possible in any context, they may be written inside one value separated by ``//``. In the example above, the stem of the numeral *eight* in Meadow Mari can be written either as a word *кандаш* or as a numeral *8*. In both cases, affixes can only attach to the right side of it, which is why there is a dot at the right. A stem can only include the root morpheme or be a combination of a root with some (probably non-productive) derivational affixes that you want to treat as a single lexeme (see ``Morpheme segmentation in the stem`` below).
* ``gramm`` contains tags separated by a comma. Normally tags for a lexeme would include its part of speech (``NUM`` in this case) and, possibly, some dictionary categories such as gender / noun class for nouns or transitivity for verbs.
* ``paradigm`` is a link to the inflectional paradigm for this lexeme, which describes how forms of this lexeme can be produced from its stem(s). Even if the lexeme can not be inflected (e.g. it's a conjunction), there has to be a link, which should in this case lead to a paradigm with a single empty affix. The value must be a name of a paradigm listed in :doc:`paradigms.txt </paradigms>`. There may be multiple paradigm links specified by multiple ``paradigm`` keys.

Expand All @@ -43,3 +43,20 @@ If a lexeme has multiple stem allomorphs that are chosen based on grammatical or
trans_en: wings

The stems are automatically numbered by ``uniparser-morph``: the first stem, *борд.*, is considered to have the number 0, while *бордй.* has the number 1. These numbers can be used in :doc:`paradigms.txt </paradigms>` to specify which morpheme requires which stem allomorph.

Morpheme segmentation in the stem
---------------------------------

Although in many cases what you describe as the stem only consists of one morpheme, it can also be a combination of a root and a number of derivations. If you enable glossing and want the stem to be split into several morphemes, each with a separate gloss, you can indicate the morpheme and gloss breaks with the ``&`` character. (Note that this is done with a different character than in the :doc:`paradigms </paradigms>` for historical reasons.) This could make sense in the case of not-very-productive derivations that you wouldn't describe in the paradigms, but would still like to see in the annotation. Here is an Udmurt example::

-lexeme
lex: котькуд
stem: коть&куд.
gramm: ADJPRO
gloss: INDEF&which
paradigm: Noun-mar
trans_en: whichever

The ``&`` character splits the stem, ``котькуд``, in two parts: ``коть``, an indefiniteness marker glossed as ``INDEF``, and the root ``куд``, glossed ``which``.

Stem morpheme segmentation is designed for concatenative morphology and is not intended for stems that allow infixes.
7 changes: 1 addition & 6 deletions docs/_build/html/_static/pygments.css
Original file line number Diff line number Diff line change
@@ -1,10 +1,5 @@
pre { line-height: 125%; }
td.linenos .normal { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
span.linenos { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
td.linenos .special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
span.linenos.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
.highlight .hll { background-color: #ffffcc }
.highlight { background: #f8f8f8; }
.highlight { background: #f8f8f8; }
.highlight .c { color: #8f5902; font-style: italic } /* Comment */
.highlight .err { color: #a40000; border: 1px solid #ef2929 } /* Error */
.highlight .g { color: #000000 } /* Generic */
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/bad_analyses.html
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ <h3 id="searchlabel">Quick search</h3>
&copy;2021, Timofey Arkhangelskiy.

|
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.3</a>
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.1</a>
&amp; <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.12</a>

|
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/clitics.html
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ <h3 id="searchlabel">Quick search</h3>
&copy;2021, Timofey Arkhangelskiy.

|
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.3</a>
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.1</a>
&amp; <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.12</a>

|
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/derivations.html
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ <h3 id="searchlabel">Quick search</h3>
&copy;2021, Timofey Arkhangelskiy.

|
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.3</a>
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.1</a>
&amp; <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.12</a>

|
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/examples.html
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ <h3 id="searchlabel">Quick search</h3>
&copy;2021, Timofey Arkhangelskiy.

|
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.3</a>
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.1</a>
&amp; <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.12</a>

|
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/format.html
Original file line number Diff line number Diff line change
Expand Up @@ -316,7 +316,7 @@ <h3 id="searchlabel">Quick search</h3>
&copy;2021, Timofey Arkhangelskiy.

|
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.3</a>
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.1</a>
&amp; <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.12</a>

|
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ <h3 id="searchlabel">Quick search</h3>
&copy;2021, Timofey Arkhangelskiy.

|
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.3</a>
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.1</a>
&amp; <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.12</a>

</div>
Expand Down
3 changes: 2 additions & 1 deletion docs/_build/html/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ <h2>Contents<a class="headerlink" href="#contents" title="Permalink to this head
<li class="toctree-l1"><a class="reference internal" href="lexemes.html">lexemes.txt</a><ul>
<li class="toctree-l2"><a class="reference internal" href="lexemes.html#introduction">Introduction</a></li>
<li class="toctree-l2"><a class="reference internal" href="lexemes.html#multiple-stems">Multiple stems</a></li>
<li class="toctree-l2"><a class="reference internal" href="lexemes.html#morpheme-segmentation-in-the-stem">Morpheme segmentation in the stem</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="paradigms.html">paradigms.txt</a><ul>
Expand Down Expand Up @@ -155,7 +156,7 @@ <h3 id="searchlabel">Quick search</h3>
&copy;2021, Timofey Arkhangelskiy.

|
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.3</a>
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.1</a>
&amp; <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.12</a>

|
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/lex_rules.html
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ <h3 id="searchlabel">Quick search</h3>
&copy;2021, Timofey Arkhangelskiy.

|
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.3</a>
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.1</a>
&amp; <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.12</a>

|
Expand Down
20 changes: 18 additions & 2 deletions docs/_build/html/lexemes.html
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ <h2>Introduction<a class="headerlink" href="#introduction" title="Permalink to t
<p>Pre-defined fields that have to be present in each lexeme are the following:</p>
<ul class="simple">
<li><p><code class="docutils literal notranslate"><span class="pre">lex</span></code> stands for lemma, or dictionary form, of the lexeme.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">stem</span></code> contains a string describing the stem of the lexeme as a morpheme object (see <a class="reference internal" href="format.html"><span class="doc">format overview</span></a>). If there are several free variants, i.e. variants that are equally possible in any context, they may be written inside one value separated by <code class="docutils literal notranslate"><span class="pre">//</span></code>. In the example above, the stem of the numeral <em>eight</em> in Meadow Mari can be written either as a word <em>кандаш</em> or as a numeral <em>8</em>. In both cases, affixes can only attach to the right side of it, which is why there is a dot at the right. A stem can only include the root morpheme or be a combination of a root with some (probably non-productive) derivational affixes that you want to treat as a single lexeme.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">stem</span></code> contains a string describing the stem of the lexeme as a morpheme object (see <a class="reference internal" href="format.html"><span class="doc">format overview</span></a>). If there are several free variants, i.e. variants that are equally possible in any context, they may be written inside one value separated by <code class="docutils literal notranslate"><span class="pre">//</span></code>. In the example above, the stem of the numeral <em>eight</em> in Meadow Mari can be written either as a word <em>кандаш</em> or as a numeral <em>8</em>. In both cases, affixes can only attach to the right side of it, which is why there is a dot at the right. A stem can only include the root morpheme or be a combination of a root with some (probably non-productive) derivational affixes that you want to treat as a single lexeme (see <code class="docutils literal notranslate"><span class="pre">Morpheme</span> <span class="pre">segmentation</span> <span class="pre">in</span> <span class="pre">the</span> <span class="pre">stem</span></code> below).</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">gramm</span></code> contains tags separated by a comma. Normally tags for a lexeme would include its part of speech (<code class="docutils literal notranslate"><span class="pre">NUM</span></code> in this case) and, possibly, some dictionary categories such as gender / noun class for nouns or transitivity for verbs.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">paradigm</span></code> is a link to the inflectional paradigm for this lexeme, which describes how forms of this lexeme can be produced from its stem(s). Even if the lexeme can not be inflected (e.g. it’s a conjunction), there has to be a link, which should in this case lead to a paradigm with a single empty affix. The value must be a name of a paradigm listed in <a class="reference internal" href="paradigms.html"><span class="doc">paradigms.txt</span></a>. There may be multiple paradigm links specified by multiple <code class="docutils literal notranslate"><span class="pre">paradigm</span></code> keys.</p></li>
</ul>
Expand All @@ -74,6 +74,21 @@ <h2>Multiple stems<a class="headerlink" href="#multiple-stems" title="Permalink
</div>
<p>The stems are automatically numbered by <code class="docutils literal notranslate"><span class="pre">uniparser-morph</span></code>: the first stem, <em>борд.</em>, is considered to have the number 0, while <em>бордй.</em> has the number 1. These numbers can be used in <a class="reference internal" href="paradigms.html"><span class="doc">paradigms.txt</span></a> to specify which morpheme requires which stem allomorph.</p>
</div>
<div class="section" id="morpheme-segmentation-in-the-stem">
<h2>Morpheme segmentation in the stem<a class="headerlink" href="#morpheme-segmentation-in-the-stem" title="Permalink to this headline"></a></h2>
<p>Although in many cases what you describe as the stem only consists of one morpheme, it can also be a combination of a root and a number of derivations. If you enable glossing and want the stem to be split into several morphemes, each with a separate gloss, you can indicate the morpheme and gloss breaks with the <code class="docutils literal notranslate"><span class="pre">&amp;</span></code> character. (Note that this is done with a different character than in the <a class="reference internal" href="paradigms.html"><span class="doc">paradigms</span></a> for historical reasons.) This could make sense in the case of not-very-productive derivations that you wouldn’t describe in the paradigms, but would still like to see in the annotation. Here is an Udmurt example:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="o">-</span><span class="n">lexeme</span>
<span class="n">lex</span><span class="p">:</span> <span class="n">котькуд</span>
<span class="n">stem</span><span class="p">:</span> <span class="n">коть</span><span class="o">&amp;</span><span class="n">куд</span><span class="o">.</span>
<span class="n">gramm</span><span class="p">:</span> <span class="n">ADJPRO</span>
<span class="n">gloss</span><span class="p">:</span> <span class="n">INDEF</span><span class="o">&amp;</span><span class="n">which</span>
<span class="n">paradigm</span><span class="p">:</span> <span class="n">Noun</span><span class="o">-</span><span class="n">mar</span>
<span class="n">trans_en</span><span class="p">:</span> <span class="n">whichever</span>
</pre></div>
</div>
<p>The <code class="docutils literal notranslate"><span class="pre">&amp;</span></code> character splits the stem, <code class="docutils literal notranslate"><span class="pre">котькуд</span></code>, in two parts: <code class="docutils literal notranslate"><span class="pre">коть</span></code>, an indefiniteness marker glossed as <code class="docutils literal notranslate"><span class="pre">INDEF</span></code>, and the root <code class="docutils literal notranslate"><span class="pre">куд</span></code>, glossed <code class="docutils literal notranslate"><span class="pre">which</span></code>.</p>
<p>Stem morpheme segmentation is designed for concatenative morphology and is not intended for stems that allow infixes.</p>
</div>
</div>


Expand All @@ -100,6 +115,7 @@ <h3>Navigation</h3>
<li class="toctree-l1 current"><a class="current reference internal" href="#">lexemes.txt</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#introduction">Introduction</a></li>
<li class="toctree-l2"><a class="reference internal" href="#multiple-stems">Multiple stems</a></li>
<li class="toctree-l2"><a class="reference internal" href="#morpheme-segmentation-in-the-stem">Morpheme segmentation in the stem</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="paradigms.html">paradigms.txt</a></li>
Expand Down Expand Up @@ -143,7 +159,7 @@ <h3 id="searchlabel">Quick search</h3>
&copy;2021, Timofey Arkhangelskiy.

|
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.3</a>
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.1</a>
&amp; <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.12</a>

|
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/paradigms.html
Original file line number Diff line number Diff line change
Expand Up @@ -284,7 +284,7 @@ <h3 id="searchlabel">Quick search</h3>
&copy;2021, Timofey Arkhangelskiy.

|
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.3</a>
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.1</a>
&amp; <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.12</a>

|
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/search.html
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ <h3>Related Topics</h3>
&copy;2021, Timofey Arkhangelskiy.

|
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.3</a>
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.1</a>
&amp; <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.12</a>

</div>
Expand Down
Loading

0 comments on commit ab87860

Please sign in to comment.