String matching algorithms #162

SergeyMalashenko · 2021-01-28T11:26:39Z

SergeyMalashenko
Jan 28, 2021

Good day! I am trying to use your module in license plate recognition system. Because recognized strings are not always stable I solve clustering problem with your method rapidfuzz::extractOne. On my test data I see the following results and can't explain them

choices = ['7', 'X946PT78', 'X10', 'T']
print( process.extractOne('X94678', choices) )

[('7', 90.0, 0), ('X946PT78', 85.71428571428571, 3), ('X10', 29.999999999999996, 2), ('T', 0.0, 1)]

Please help me to understand results

Answered by maxbachmann

Jan 28, 2021

By default extractOne is using fuzz.WRatio, which combines multiple ratios (I will add a explanation of this in the documentation for 1.0.0):

documentation fuzz.WRatio

Here is the documentation from FuzzyWuzzy on fuzz.WRatio:

#. Run full_process from utils on both strings
#. Short circuit if this makes either string empty
#. Take the ratio of the two processed strings (fuzz.ratio)
#. Run checks to compare the length of the strings
* If one of the strings is more than 1.5 times as long as the other
use partial_ratio comparisons - scale partial results by 0.9
(this makes sure only full results can return 100)
* If one of the strings is over 8 times as long as the other
instead scale by 0.6
…

View full answer

maxbachmann · 2021-01-28T12:40:04Z

maxbachmann
Jan 28, 2021
Maintainer

By default extractOne is using fuzz.WRatio, which combines multiple ratios (I will add a explanation of this in the documentation for 1.0.0):

documentation fuzz.WRatio

Here is the documentation from FuzzyWuzzy on fuzz.WRatio:

#. Run full_process from utils on both strings
#. Short circuit if this makes either string empty
#. Take the ratio of the two processed strings (fuzz.ratio)
#. Run checks to compare the length of the strings
* If one of the strings is more than 1.5 times as long as the other
use partial_ratio comparisons - scale partial results by 0.9
(this makes sure only full results can return 100)
* If one of the strings is over 8 times as long as the other
instead scale by 0.6
#. Run the other ratio functions
* if using partial ratio functions call partial_ratio,
partial_token_sort_ratio and partial_token_set_ratio
scale all of these by the ratio based on length
* otherwise call token_sort_ratio and token_set_ratio
* all token based comparisons are scaled by 0.95
(on top of any partial scalars)
#. Take the highest value from these results
round it and return it as an integer.

Result explanation

For your data this has the following results for the choices when they are compared with 'X94678':

'7': The result is 90, since the query is a lot longer than the choice. Therefore it will use partial_ratio weighted by 0.9.
Since '7' is a substring of the query it returns 100 * 0.9
'X946PT78': will use fuzz.ratio as result
'X10': Similar to '7' this will use partial_ratio weighted by 0.9 -> 33.333 * 0.9
'T': This is not part of the query so any similarity ratio will be 0

Suggestion

In your use case WRatio is certainly bad for multiple reasons:

it uses pretty much all ratios like e.g. sorted ratios and set ratios, which will have no effect in your use case, since each of the license plates will only consist out of one word I assume.
I would assume that you would like to know the edit distance for the full license plate, which as described above will not always be the case in WRatio, since it uses partial_ratio for strings with big length differences
it is the slowest ratio since it combines a lot of strings similarities

I would suggest you to use fuzz.ratio. It compares the whole string and it is a lot faster.

process.extractOne('X94678', choices, scorer=fuzz.ratio)

This will still use a preprocessor, that lowercases the string, removes alphanumeric characters and trims whitespaces at begin and end. In case your input + choices are already preprocessed you can save a lot of time by deactivating this:

process.extractOne('X94678', choices, scorer=fuzz.ratio, processor=None)

Especially once v1.0.0 will be released fuzz.ratio will become a lot faster. Here is a performance comparison for the upcoming release:

As a comparison the current performance is shown in the following graph:

So this performance difference is only becoming bigger. Btw fuzz.QRatio is a similar ratio to fuzz.ratio but with the preprocessing activated. So it shows the impact preprocessing has on the performance.

0 replies

SergeyMalashenko · 2021-01-30T07:15:28Z

SergeyMalashenko
Jan 30, 2021
Author

Thank you for you answer. I took your advice.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String matching algorithms #162

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

String matching algorithms #162

SergeyMalashenko Jan 28, 2021

documentation fuzz.WRatio

Replies: 2 comments

maxbachmann Jan 28, 2021 Maintainer

documentation fuzz.WRatio

Result explanation

Suggestion

SergeyMalashenko Jan 30, 2021 Author

SergeyMalashenko
Jan 28, 2021

maxbachmann
Jan 28, 2021
Maintainer

SergeyMalashenko
Jan 30, 2021
Author