-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] correct behavior for “Ё” (U+0401) #29
Comments
btw the icu4x implementation lives here |
As Russian is my mother tongue believe me Russian alphabet doesn't have accentuated characters :) |
Concerning icu4x, encoding-rs - these are a very huge changes. I would prefer to finish idiomatic changes and any other speed improvements (at least attempts and ideas) before start to work on it. |
Ah very interesting. Right now the situation is as follows:
|
I agree. Most of my ideas for idiomatic changes are being exhausted so i have been changing other aspects of the code to explore other options. Feel free to close this issue if you don't think there would be more use for it. |
Describe the bug
In
test_is_accentuated
charset-normalizer-rs/src/tests/utils.rs
Line 28 in cbe086f
This case is tested to see if it is false.
“Ё” (U+0401) Cyrillic Capital Letter Io
The code being tested is here
charset-normalizer-rs/src/utils.rs
Line 118 in cbe086f
The problem here is that it is considered to have an diaeresis under current correct unicode decomposition rules (both NFKD and NFD)
https://www.compart.com/en/unicode/U+0401
https://graphemica.com/%D0%81
(BTW this is different from almost exactly looking Unicode Character “Ë” (U+00CB) Latin Capital Letter E with Diaeresis
To Reproduce
the icu4x crates can be used to decompose in rust.
cargo add icu_normalizer
I am actually trying to reimplement some parts of the code and that is how i discovered it.
This new implementation directly tries to directly decompose the input character and try to see if unicode characters that indicate accents exist.
Since “Ё” (U+0401) Cyrillic Capital Letter Io decomposes into Е Cyrillic Capital Letter Ie + Diaeresis '\u{0308}'
the new code returns true, while the old code returns false (since diaeresis is not in the name)
Expected behavior
“Ё” (U+0401) should return true.
Additional context
Unicode standard is fast moving. A new standard every year and especially for CJK there are new codepoints added constantly.
I think it is valuable to have an implementation that is up to date.
Btw I have almost finished my implementation using various components from https://github.com/unicode-org/icu4x
It is a pure rust codebase worked on by both standard bodies and industry supporters such as google, so I feel like it would be a good library to rely upon.
The text was updated successfully, but these errors were encountered: