Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] correct behavior for “Ё” (U+0401) #29

Open
chris-ha458 opened this issue Oct 4, 2023 · 5 comments
Open

[BUG] correct behavior for “Ё” (U+0401) #29

chris-ha458 opened this issue Oct 4, 2023 · 5 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@chris-ha458
Copy link
Contributor

chris-ha458 commented Oct 4, 2023

Describe the bug
In test_is_accentuated


This case is tested to see if it is false.
“Ё” (U+0401) Cyrillic Capital Letter Io

The code being tested is here

"WITH DIAERESIS",

The problem here is that it is considered to have an diaeresis under current correct unicode decomposition rules (both NFKD and NFD)
https://www.compart.com/en/unicode/U+0401
https://graphemica.com/%D0%81

(BTW this is different from almost exactly looking Unicode Character “Ë” (U+00CB) Latin Capital Letter E with Diaeresis

To Reproduce
the icu4x crates can be used to decompose in rust.
cargo add icu_normalizer
I am actually trying to reimplement some parts of the code and that is how i discovered it.

pub(crate) fn is_accentuated(ch: char) -> bool {
    let nfd = icu_normalizer::DecomposingNormalizer::new_nfkd();
    let denormalized_string: String = nfd.normalize(ch.to_string().as_str());
    denormalized_string
        .chars()
        .any(|decomposed| match decomposed {
        '\u{0300}' // "WITH GRAVE"
        |'\u{0301}' // "WITH ACUTE"
        |'\u{0302}' // "WITH CIRCUMFLEX"
        |'\u{0303}' // "WITH TILDE"
        |'\u{0308}' //  "WITH DIAERESIS"
        |'\u{0327}' // "WITH CEDILLA"
         => true,
        _=> false,
    })
}

This new implementation directly tries to directly decompose the input character and try to see if unicode characters that indicate accents exist.
Since “Ё” (U+0401) Cyrillic Capital Letter Io decomposes into Е Cyrillic Capital Letter Ie + Diaeresis '\u{0308}'
the new code returns true, while the old code returns false (since diaeresis is not in the name)

Expected behavior
“Ё” (U+0401) should return true.

Additional context
Unicode standard is fast moving. A new standard every year and especially for CJK there are new codepoints added constantly.
I think it is valuable to have an implementation that is up to date.

Btw I have almost finished my implementation using various components from https://github.com/unicode-org/icu4x
It is a pure rust codebase worked on by both standard bodies and industry supporters such as google, so I feel like it would be a good library to rely upon.

@chris-ha458 chris-ha458 added bug Something isn't working help wanted Extra attention is needed labels Oct 4, 2023
@chris-ha458
Copy link
Contributor Author

btw the icu4x implementation lives here
https://github.com/chris-ha458/charset-normalizer-rs/tree/icu

@nickspring
Copy link
Owner

As Russian is my mother tongue believe me Russian alphabet doesn't have accentuated characters :)
The behaviour is correct IMO.

@nickspring
Copy link
Owner

Concerning icu4x, encoding-rs - these are a very huge changes. I would prefer to finish idiomatic changes and any other speed improvements (at least attempts and ideas) before start to work on it.

@chris-ha458
Copy link
Contributor Author

As Russian is my mother tongue believe me Russian alphabet doesn't have accentuated characters :)
The behaviour is correct IMO.

Ah very interesting.
In sample-russian-2 there is this character ё that is the lower case of the one mentioned above.

Right now the situation is as follows:

  1. “Ё” (U+0401) Cyrillic Capital Letter Io decomposing into Diaeresis '\u{0308}' seems to be an implementation detail and not meant to signify it is accentuated.
  2. There is no good way of retrieving accentuated characters/languages. Most things must be hard coded, and it does not seem to be changing in the near future (if at all).
  3. In the current implementation, the unicode names are directly retrieved and then compared to with the unic crate (this conforms to unicode standard 10) this crate has been abandoned, and there have been multiple replacements that were also abandoned. Right now efforts have finally settled on icu4x.
  4. icu4x does not currently provide a way to retrieve unicode names like that, but this may change in the near future. It is really fast moving, and unicode names are part of UAX#44 so I am hopeful this changes. Meanwhile, icu4x enables decomposition. This enables approaches such as the code snippet above.
  5. unicode_names2 exists which provides up to date name retrieval that could 3. style retrieval according to unicode standard 15(up to date)
  6. unfortunately, 4 nor 5 show meaningful speed ups, but it does enable further idiomatic code changes.

@chris-ha458
Copy link
Contributor Author

Concerning icu4x, encoding-rs - these are a very huge changes. I would prefer to finish idiomatic changes and any other speed improvements (at least attempts and ideas) before start to work on it.

I agree. Most of my ideas for idiomatic changes are being exhausted so i have been changing other aspects of the code to explore other options.

Feel free to close this issue if you don't think there would be more use for it.
I will close this when I find an idiomatic solution to all the above concerns (this might not exist, so it would take a while to close)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants