Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support infinite weights in lt-comp/lt-proc #62

Open
AMR-KELEG opened this issue Jun 12, 2019 · 7 comments
Open

Support infinite weights in lt-comp/lt-proc #62

AMR-KELEG opened this issue Jun 12, 2019 · 7 comments

Comments

@AMR-KELEG
Copy link
Contributor

We need to implement a way to represent infinite weights.
The current outcome is strange!

$ cat sample.att
0       1       a       b       2
1       2       b       c       1
1       2       c       d       inf
2       0

$ lt-comp lr sample.att sa.bin
main@standard 3 3

$ lt-print sa.bin
0       1       a       b       1.000000
1       2       b       c       2.000000
1       2       c       d       -2.000000
2       0.000000
@flammie
Copy link
Member

flammie commented Jun 12, 2019

I think functions like atof, strtod should just work with inf as string. Inf is not the most useful weight though, given that inf+x is inf for all x I think at least openfst just decides to bounce when it sees inf arc (considering it a non-arc; hfst also prints in xerox mode +? as analysis with weight inf and etc.).

For OOVs it's good enough to have reasonably high non-inf number, for more advanced implementations one can calculate some probability estimates like https://en.wikipedia.org/wiki/Additive_smoothing, https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing and so forth.

@AMR-KELEG
Copy link
Contributor Author

Well, using laplacian smoothing will solve the problem while ensuring that OOV tokens get the highest -log(P) value.

OTOH, lt-print seems to not be showing inf weights as shown above.
I am convinced now that an edge with an infinite weight isn't that useful in most fsts.

@flammie
Copy link
Member

flammie commented Jun 18, 2019

Well, using laplacian smoothing will solve the problem while ensuring that OOV tokens get the highest -log(P) value.

Yes that should be good.

OTOH, lt-print seems to not be showing inf weights as shown above.
I am convinced now that an edge with an infinite weight isn't that useful in most fsts.

Yeah, so infinite weights in tropical semiring are mainly good for theoretical constructions like graph completion (where every state must have transition with every symbol). You could check the code where the inf parsing/printing/handling goes awry, since theoretically it should be possible to support it, but it's not a high priority at all.

@mr-martian
Copy link
Contributor

I believe the issue here is not with lt-comp but with the way floating point numbers are written in the current file format since the functions used in compression.cc to disassemble doubles are unspecified when applied to inf (https://en.cppreference.com/w/cpp/numeric/math/frexp).

@TinoDidriksen
Copy link
Member

We can reserve 0xFFFFFFFF 0xFFFFFFFF as inf. But is -inf meaningful?

@flammie
Copy link
Member

flammie commented Jul 22, 2022

I think the tropical semiring weight structures we use are only well defined in R+ including positive infinity, they may kind of work with negative values and I guess one could interpret a path with negative infinity as unconditionally top suggestion...

@TinoDidriksen
Copy link
Member

Implemented by reserving 0xFFFFFFFF 0xFFFFFFFF as inf and 0xFFFFFFFF 0xFFFFFFFE as -inf.

ICU u_sscanf() only supports all-upper INF and -INF, and will print all-upper. So first quirk was adding a special case parse for lower-case inf and -inf.

See if that breaks anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants