[BUG] correct behavior for “Ё” (U+0401) #29

chris-ha458 · 2023-10-04T10:18:21Z

Describe the bug
In test_is_accentuated

charset-normalizer-rs/src/tests/utils.rs

Line 28 in cbe086f

('Ё', false),

This case is tested to see if it is false.
“Ё” (U+0401) Cyrillic Capital Letter Io

The code being tested is here

charset-normalizer-rs/src/utils.rs

Line 118 in cbe086f

"WITH DIAERESIS",

The problem here is that it is considered to have an diaeresis under current correct unicode decomposition rules (both NFKD and NFD)
https://www.compart.com/en/unicode/U+0401
https://graphemica.com/%D0%81

(BTW this is different from almost exactly looking Unicode Character “Ë” (U+00CB) Latin Capital Letter E with Diaeresis

To Reproduce
the icu4x crates can be used to decompose in rust.
cargo add icu_normalizer
I am actually trying to reimplement some parts of the code and that is how i discovered it.

pub(crate) fn is_accentuated(ch: char) -> bool {
    let nfd = icu_normalizer::DecomposingNormalizer::new_nfkd();
    let denormalized_string: String = nfd.normalize(ch.to_string().as_str());
    denormalized_string
        .chars()
        .any(|decomposed| match decomposed {
        '\u{0300}' // "WITH GRAVE"
        |'\u{0301}' // "WITH ACUTE"
        |'\u{0302}' // "WITH CIRCUMFLEX"
        |'\u{0303}' // "WITH TILDE"
        |'\u{0308}' //  "WITH DIAERESIS"
        |'\u{0327}' // "WITH CEDILLA"
         => true,
        _=> false,
    })
}

This new implementation directly tries to directly decompose the input character and try to see if unicode characters that indicate accents exist.
Since “Ё” (U+0401) Cyrillic Capital Letter Io decomposes into Е Cyrillic Capital Letter Ie + Diaeresis '\u{0308}'
the new code returns true, while the old code returns false (since diaeresis is not in the name)

Expected behavior
“Ё” (U+0401) should return true.

Additional context
Unicode standard is fast moving. A new standard every year and especially for CJK there are new codepoints added constantly.
I think it is valuable to have an implementation that is up to date.

Btw I have almost finished my implementation using various components from https://github.com/unicode-org/icu4x
It is a pure rust codebase worked on by both standard bodies and industry supporters such as google, so I feel like it would be a good library to rely upon.

The text was updated successfully, but these errors were encountered:

chris-ha458 · 2023-10-04T10:24:07Z

btw the icu4x implementation lives here
https://github.com/chris-ha458/charset-normalizer-rs/tree/icu

nickspring · 2023-10-04T15:56:26Z

As Russian is my mother tongue believe me Russian alphabet doesn't have accentuated characters :)
The behaviour is correct IMO.

nickspring · 2023-10-04T20:19:15Z

Concerning icu4x, encoding-rs - these are a very huge changes. I would prefer to finish idiomatic changes and any other speed improvements (at least attempts and ideas) before start to work on it.

chris-ha458 · 2023-10-04T23:50:57Z

As Russian is my mother tongue believe me Russian alphabet doesn't have accentuated characters :)
The behaviour is correct IMO.

Ah very interesting.
In sample-russian-2 there is this character ё that is the lower case of the one mentioned above.

Right now the situation is as follows:

“Ё” (U+0401) Cyrillic Capital Letter Io decomposing into Diaeresis '\u{0308}' seems to be an implementation detail and not meant to signify it is accentuated.
There is no good way of retrieving accentuated characters/languages. Most things must be hard coded, and it does not seem to be changing in the near future (if at all).
In the current implementation, the unicode names are directly retrieved and then compared to with the unic crate (this conforms to unicode standard 10) this crate has been abandoned, and there have been multiple replacements that were also abandoned. Right now efforts have finally settled on icu4x.
icu4x does not currently provide a way to retrieve unicode names like that, but this may change in the near future. It is really fast moving, and unicode names are part of UAX#44 so I am hopeful this changes. Meanwhile, icu4x enables decomposition. This enables approaches such as the code snippet above.
unicode_names2 exists which provides up to date name retrieval that could 3. style retrieval according to unicode standard 15(up to date)
unfortunately, 4 nor 5 show meaningful speed ups, but it does enable further idiomatic code changes.

chris-ha458 · 2023-10-04T23:53:50Z

Concerning icu4x, encoding-rs - these are a very huge changes. I would prefer to finish idiomatic changes and any other speed improvements (at least attempts and ideas) before start to work on it.

I agree. Most of my ideas for idiomatic changes are being exhausted so i have been changing other aspects of the code to explore other options.

Feel free to close this issue if you don't think there would be more use for it.
I will close this when I find an idiomatic solution to all the above concerns (this might not exist, so it would take a while to close)

chris-ha458 added bug Something isn't working help wanted Extra attention is needed labels Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] correct behavior for “Ё” (U+0401) #29

[BUG] correct behavior for “Ё” (U+0401) #29

chris-ha458 commented Oct 4, 2023 •

edited

Loading

chris-ha458 commented Oct 4, 2023

nickspring commented Oct 4, 2023

nickspring commented Oct 4, 2023

chris-ha458 commented Oct 4, 2023

chris-ha458 commented Oct 4, 2023

[BUG] correct behavior for “Ё” (U+0401) #29

[BUG] correct behavior for “Ё” (U+0401) #29

Comments

chris-ha458 commented Oct 4, 2023 • edited Loading

chris-ha458 commented Oct 4, 2023

nickspring commented Oct 4, 2023

nickspring commented Oct 4, 2023

chris-ha458 commented Oct 4, 2023

chris-ha458 commented Oct 4, 2023

chris-ha458 commented Oct 4, 2023 •

edited

Loading