Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment outputs are not as expected #32

Open
xanguera opened this issue Oct 6, 2017 · 5 comments
Open

Alignment outputs are not as expected #32

xanguera opened this issue Oct 6, 2017 · 5 comments

Comments

@xanguera
Copy link

xanguera commented Oct 6, 2017

Hi,
I am using phonetisaurus to align a a grapheme input to its phonetic transcription.
For this I use the phonetisaurus-align tool with alignment models trained on CMUDict.
I a few cases I see that the output does not match with the input, see for example:
input to the aligner:
OVERAWE OW1 V ER0 AA2
Output from the aligner:
O}OW1 V}V E}_ R}_ A}_ E}_

I had to go around it by computing how many phonemes and graphemes I had in the input and output and do something else if it does not match, but I was wondering if it would not be possible/advisable that phonetisaurus could raise an error/warning in these cases. Currently it exists normally, without any sign that an issue occurred.

Thanks!

@AdolfVonKleist
Copy link
Owner

AdolfVonKleist commented Oct 6, 2017 via email

@AdolfVonKleist
Copy link
Owner

AdolfVonKleist commented Oct 7, 2017

Can you also share the version of the cmudict that you are using, or a link to the revision in their corresponding repo?

I cannot find the example word you shared in any recent revision I have handy.
In theory this should not be possible; the aligner builds a lattice for each entry, and the provided example does not look like the result of a valid path terminating in a valid final state. It looks like part of the pronunciation may have been truncating during read - maybe space/tab related?

I tried to reproduce similar behavior with the latest version of the aligner in master, and the latest version of the cmudict:

$ wget https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict
$ cat cmudict.dict   | perl -pe 's/\([0-9]+\)//;
              s/\s+/ /g; s/^\s+//;
              s/\s+$//; @_ = split (/\s+/);
              $w = shift (@_);
              $_ = $w."\t".join (" ", @_)."\n";'   > cmudict.formatted.dict
$ phonetisaurus-train --lexicon cmudict.formatted.dict --seq2_del

I wrote the following script which I think performs the comparison you described:

#!/usr/bin/env python
import re, sys, os
from collections import defaultdict

def ProcessAligned (corpusfile, lexicon) :
    with open (corpusfile, "r") as ifp :
        for line in ifp :
            graphs = []; phones = []
            tokens = re.split (ur"\s+", line.decode ("utf8").strip ())
            for token in tokens :
                g,p = re.split (ur"\}", token)
                graphs.extend (re.split (ur"\|", g))
                phones.extend (re.split (ur"\|", p))
            word = u"".join ([g for g in graphs if not g == u"_"])
            pron = u" ".join ([p for p in phones if not p == u"_"])
            prons = lexicon [word]
            if not pron in prons :
                entry = u"{0}\t{1}".format (word, pron)
                print entry.encode ("utf8")
    return

def LoadLexicon (lexiconfile) :
    lexicon = defaultdict (list)
    with open (lexiconfile, "r") as ifp :
        for entry in ifp :
            word, pron = re.split (ur"\t", entry.decode ("utf8").strip ())
            lexicon [word].append (pron)

    return lexicon

if __name__ == "__main__" :
    import argparse

    lexicon = LoadLexicon (sys.argv [1])
    ProcessAligned (sys.argv [2], lexicon)

when I run it against the reference lexicon and resulting aligned corpus:

$ python proc.py ../cmudict.formatted.dict model.corpus
$

all pronunciations from the original are found. This again makes me think that it may be an issue related to spaces in the read in lexicon. Lemme know!

@xanguera
Copy link
Author

xanguera commented Oct 7, 2017 via email

@AdolfVonKleist
Copy link
Owner

Ah OK, I did not quite understand at first. Can you use the python bindings or the script interface directly? This will actually provide back the original alignment from the decoding step, and will also retain the arc weights from the joint sequence LM, including backoff epsilon arcs:

The python bindings/script interface provide back the following result in my case:

$ ./script/phoneticize.py --model /tmp/experiment/train/model.fst --word overawe
0.00	OW1 V ER0 AA1
-------
o:OW1:5.37
v:V:0.84
e|r:ER0:0.06
<eps>:<eps>:1.85
<eps>:<eps>:0.49
a:AA1:5.51
<eps>:<eps>:0.29
w:_:4.21
<eps>:<eps>:0.29
e:_:2.63
<eps>:<eps>:1.06

@xanguera
Copy link
Author

xanguera commented Oct 10, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants