String search with collators #3981

faassen · 2023-08-31T15:05:21Z

faassen
Aug 31, 2023

I looked through the project to figure out how to do collator-aware string search, such as substring matching. I couldn't find any API in icu4x that helps me implement this. Is this correct or did I miss something?

I did some digging. First this document suggests that the string search algorithm implemented by the other ICU implementations has shifted to a less performant but more accurate linear search:

https://unicode-org.github.io/icu/userguide/collation/string-search.html#performance-and-other-implications

I'm also trying to imagine how difficult it would be to implement linear collation-aware search myself. Since collations can ignore characters, a naive implementation that uses the collator compare ordering is_eq() for a range of characters starting at a position won't work, I think.

But before I delve into this further, is a (linear or not) search implementation planned for ICU4X?

Manishearth · 2023-08-31T16:37:14Z

Manishearth
Aug 31, 2023
Maintainer

I don't know of such a plan, but our plans depend on prospective clients; so if people want something like that we could potentially add it (or accept patches for it). It depends, though, I'm not familiar enough with this API to be sure.

1 reply

faassen Sep 1, 2023
Author

Thanks! I wanted to make sure I am on the right track and wasn't missing something.

I will dig some more into the state of this and report back.

faassen · 2023-09-01T07:58:17Z

faassen
Sep 1, 2023
Author

I know XPath is not in any ways the real specification, but it's the particular perspective I'm coming from. The XPath standard has this section on substring matching in light of collation:

https://www.w3.org/TR/xpath-functions-31/#substring.functions

It talks about "collation units", which are, it says, the same as unicode "collation elements", with a reference to "Unicode Technical Standard #10: Unicode Collation Algorithm". So it looks like those algorithms could be supported if we had access to "collation elements". XPath implies that not all collations support this.

I looked through the icu4x source code and indeed the collator does seem to implement this concept, but it looks like neither CollationElement, CollationElement32 or the iterator that produces them are exposed to the outside. At first glance it doesn't look like those APIs are small enough to expose. But perhaps some higher-level matching functionality could be exposed on the Collator, like contains which checks whether string b is contained in string a.

As a next step I should examine the older ICU implementations to get a better idea of what such an API should look like.

0 replies

sffc · 2023-09-27T05:05:02Z

sffc
Sep 27, 2023
Maintainer

We've had discussions about search collations in the past, such as #3174 (comment)

Basically, we need a client with a clear and compelling use case who ideally can make some contributions, and then the team can provide mentorship to help land this type of feature.

1 reply

faassen Sep 27, 2023
Author

My use case is mostly informed by the requirements of the XPath specification, so unfortunately not very compelling as the specification is likely heavily informed by what ICU for Java made available at the time it was written. A bit too circular.

Could I ask what you mean by "client" in this context?

ajtribick · 2024-12-07T09:42:59Z

ajtribick
Dec 7, 2024

To add to this by providing another use case other than the XPath one mentioned earlier...

I am looking into replacing the search capabilities in the space simulator Celestia with a more Unicode-aware version. As we're a C++ application, we're currently using ICU4C (and we have to use the C API because OS-provided ICU implementations, e.g. on Windows provide this and not the C++ API) but I've been considering ICU4X for a while as it looks like it would be friendlier for the web implementation, and would avoid a bunch of UTF-8/UTF-16 back-and-forth conversions. From experimenting it feels like there is definitely room to provide some more useful APIs than are currently available via ICU4C.

What I actually want to do here is prefix matching using the asymmetric search described in TR10, where unmarked characters act as wildcards that match any form of the primary character. This is due to a few situations where case distinctions are important:

Latin-letter Bayer designations, e.g. B Centauri and b Centauri are different stars.
Spelled-out forms of Greek letter Bayer designations and variable star designations, e.g. Mu Cassiopeiae (Greek letter mu) vs MU Cassiopeiae. This is language dependent: in German the conflicts are My/MY and Ny/NY.
Multiple star system components, e.g. Castor AB is the subsystem of the Castor multiple star system comprising the two binaries Castor A and Castor B, while Castor Ab refers to the secondary star of the Castor A binary.

The autocomplete functionality also suggests the use of the search collations (which currently do not seem to be supported by ICU4X) so that, e.g. searching for "c" in Slovak will also bring up names starting with "ch", which would otherwise not happen due to the main collation treating the "ch" digraph as a separate letter that sorts next to "h".

The implementation in ICU4C ends up working as follows:

The initial list of candidates for the asymmetric string search matching can be obtained with a primary-strength search collator, with U+FFFF appended to the prefix to set an upper bound on the range of candidate strings, as described in the ICU collation service architecture.
The asymmetric search can then be done via the APIs in usearch.h, generated using a default-strength search collator using either USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD or USEARCH_ANY_BASE_WEIGHT_IS_WILDCARD (I'm still experimenting to see which of these feels better).
Sort the filtered results with a default-strength non-search collator with numeric sorting turned on and display the result as a list of possible completions to the user.

For step 1, I don't see anything in the ICU4X documentation that indicates that this U+FFFF technique is also supported. If it is, it would be good to mention it in the collator documentation (apologies if I overlooked it)

Step 2 is the least ideal part. The ICU4C API does not seem to provide a "match prefix" option that also supports asymmetric search, so I'm instead doing a substring search that will consider any position in the string, which means it is likely doing unnecessary work. There is also the slightly odd situation that while the case-insensitive search via the en_GB collator will return that, e.g. "Ægir" (which comes up as the name of an exoplanet) starts with "a", the search API says that this does NOT contain "a", but it does contain "ae", which feels similar to the Slovak "ch" issue that I was using the search collation to avoid.

It looks like I should be able to construct the prefix-matching algorithm I actually want in step 2 via the collation element iterator, but again this does not quite work. I can find out that in my copy of ICU with the en_GB search collation, the secondary and tertiary weights of "a" (which would count as the unmarked form of the character) are 5 and 5 respectively, it doesn't appear to be documented whether these are the weights of ALL unweighted characters in this collation, or whether other collations use these same values for unweighted characters, or whether this value of 5 remains constant across different ICU4C versions. In any case, I reckon prefix-matching is likely going to be a sufficiently common use-case that it makes sense to provide an API for it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String search with collators #3981

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

String search with collators #3981

faassen Aug 31, 2023

Replies: 4 comments · 2 replies

Manishearth Aug 31, 2023 Maintainer

faassen Sep 1, 2023 Author

faassen Sep 1, 2023 Author

sffc Sep 27, 2023 Maintainer

faassen Sep 27, 2023 Author

ajtribick Dec 7, 2024

faassen
Aug 31, 2023

Replies: 4 comments 2 replies

Manishearth
Aug 31, 2023
Maintainer

faassen Sep 1, 2023
Author

faassen
Sep 1, 2023
Author

sffc
Sep 27, 2023
Maintainer

faassen Sep 27, 2023
Author

ajtribick
Dec 7, 2024