Replies: 4 comments 2 replies
-
I don't know of such a plan, but our plans depend on prospective clients; so if people want something like that we could potentially add it (or accept patches for it). It depends, though, I'm not familiar enough with this API to be sure. |
Beta Was this translation helpful? Give feedback.
-
I know XPath is not in any ways the real specification, but it's the particular perspective I'm coming from. The XPath standard has this section on substring matching in light of collation: https://www.w3.org/TR/xpath-functions-31/#substring.functions It talks about "collation units", which are, it says, the same as unicode "collation elements", with a reference to "Unicode Technical Standard #10: Unicode Collation Algorithm". So it looks like those algorithms could be supported if we had access to "collation elements". XPath implies that not all collations support this. I looked through the icu4x source code and indeed the collator does seem to implement this concept, but it looks like neither As a next step I should examine the older ICU implementations to get a better idea of what such an API should look like. |
Beta Was this translation helpful? Give feedback.
-
We've had discussions about search collations in the past, such as #3174 (comment) Basically, we need a client with a clear and compelling use case who ideally can make some contributions, and then the team can provide mentorship to help land this type of feature. |
Beta Was this translation helpful? Give feedback.
-
To add to this by providing another use case other than the XPath one mentioned earlier... I am looking into replacing the search capabilities in the space simulator Celestia with a more Unicode-aware version. As we're a C++ application, we're currently using ICU4C (and we have to use the C API because OS-provided ICU implementations, e.g. on Windows provide this and not the C++ API) but I've been considering ICU4X for a while as it looks like it would be friendlier for the web implementation, and would avoid a bunch of UTF-8/UTF-16 back-and-forth conversions. From experimenting it feels like there is definitely room to provide some more useful APIs than are currently available via ICU4C. What I actually want to do here is prefix matching using the asymmetric search described in TR10, where unmarked characters act as wildcards that match any form of the primary character. This is due to a few situations where case distinctions are important:
The autocomplete functionality also suggests the use of the search collations (which currently do not seem to be supported by ICU4X) so that, e.g. searching for "c" in Slovak will also bring up names starting with "ch", which would otherwise not happen due to the main collation treating the "ch" digraph as a separate letter that sorts next to "h". The implementation in ICU4C ends up working as follows:
For step 1, I don't see anything in the ICU4X documentation that indicates that this U+FFFF technique is also supported. If it is, it would be good to mention it in the collator documentation (apologies if I overlooked it) Step 2 is the least ideal part. The ICU4C API does not seem to provide a "match prefix" option that also supports asymmetric search, so I'm instead doing a substring search that will consider any position in the string, which means it is likely doing unnecessary work. There is also the slightly odd situation that while the case-insensitive search via the en_GB collator will return that, e.g. "Ægir" (which comes up as the name of an exoplanet) starts with "a", the search API says that this does NOT contain "a", but it does contain "ae", which feels similar to the Slovak "ch" issue that I was using the search collation to avoid. It looks like I should be able to construct the prefix-matching algorithm I actually want in step 2 via the collation element iterator, but again this does not quite work. I can find out that in my copy of ICU with the en_GB search collation, the secondary and tertiary weights of "a" (which would count as the unmarked form of the character) are 5 and 5 respectively, it doesn't appear to be documented whether these are the weights of ALL unweighted characters in this collation, or whether other collations use these same values for unweighted characters, or whether this value of 5 remains constant across different ICU4C versions. In any case, I reckon prefix-matching is likely going to be a sufficiently common use-case that it makes sense to provide an API for it. |
Beta Was this translation helpful? Give feedback.
-
I looked through the project to figure out how to do collator-aware string search, such as substring matching. I couldn't find any API in icu4x that helps me implement this. Is this correct or did I miss something?
I did some digging. First this document suggests that the string search algorithm implemented by the other ICU implementations has shifted to a less performant but more accurate linear search:
https://unicode-org.github.io/icu/userguide/collation/string-search.html#performance-and-other-implications
I'm also trying to imagine how difficult it would be to implement linear collation-aware search myself. Since collations can ignore characters, a naive implementation that uses the collator compare ordering
is_eq()
for a range of characters starting at a position won't work, I think.But before I delve into this further, is a (linear or not) search implementation planned for ICU4X?
Beta Was this translation helpful? Give feedback.
All reactions