Skip to content

Commit

Permalink
Merge #241
Browse files Browse the repository at this point in the history
241: Implement the CharNormalizer trait on the LowercaseNormalizer struct r=ManyTheFish a=Bradshaw

# Pull Request

## Related issue
Fixes #239 

## What does this PR do?
- Implements `CharNormalizer` for `LowercaseNormalizer`
- Removes the previous `Normalizer` trait implementation
- Updates tests to include an initialized `char_map` field on the produced `Token`

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  - Yes, and as explained in #239, it leads to a correctly populated `char_map`
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

## Some quick notes

I based my implementation on [`CompatibilityDecompositionNormalizer`](https://github.com/meilisearch/charabia/blob/de62ab9b889061126b0a8473aa53bf6288a99679/charabia/src/normalizer/compatibility_decomposition.rs#L17-L31).

This leads to duplicated code. I considered adding a function that both implementations could share, if that seems more appropriate, I can propose an alternate solution.

Co-authored-by: Gaeel Bradshaw <gaeel@spaceshipsin.space>
  • Loading branch information
meili-bors[bot] and Bradshaw authored Sep 19, 2023
2 parents 5c3d09a + f171034 commit 944cf62
Showing 1 changed file with 40 additions and 24 deletions.
64 changes: 40 additions & 24 deletions charabia/src/normalizer/lowercase.rs
Original file line number Diff line number Diff line change
@@ -1,37 +1,28 @@
use std::borrow::Cow;
use std::iter::once;

use super::{Normalizer, NormalizerOption};
use super::{CharNormalizer, CharOrStr};
use crate::detection::Script;
use crate::Token;

/// A global [`Normalizer`] lowercasing characters.
///
pub struct LowercaseNormalizer;

impl Normalizer for LowercaseNormalizer {
// lowercasing characters cna change the characters length, so we need
// to make sure that the char mapping is correct and remap it if necessary.
// <https://github.com/meilisearch/charabia/pull/234>
fn normalize<'o>(&self, mut token: Token<'o>, _options: &NormalizerOption) -> Token<'o> {
match token.char_map.take() {
Some(char_map) => {
let mut new_lemma = String::with_capacity(token.lemma.len());
let mut new_char_map = Vec::with_capacity(char_map.len());
let mut s = token.lemma.as_ref();
for (orig_len, new_len) in char_map {
let (chunk, tail) = s.split_at(new_len as usize);
s = tail;
let lowercased_chunk = chunk.to_lowercase();
new_char_map.push((orig_len, lowercased_chunk.len() as u8));
new_lemma.push_str(&lowercased_chunk);
}
token.lemma = Cow::Owned(new_lemma);
token.char_map = Some(new_char_map);
impl CharNormalizer for LowercaseNormalizer {
fn normalize_char(&self, c: char) -> Option<CharOrStr> {
let mut normalized = c.to_lowercase();

// if the original character is converted in exactly one character,
// then we return the character directly instead of creating a string for it.
match (normalized.next(), normalized.next()) {
(Some(c), None) => Some(c.into()),
(Some(first), Some(second)) => {
let normalized: String =
once(first).chain(once(second)).chain(normalized).collect();
Some(normalized.into())
}
None => token.lemma = Cow::Owned(token.lemma().to_lowercase()),
(None, _) => None,
}

token
}

fn should_normalize(&self, token: &Token) -> bool {
Expand All @@ -46,6 +37,7 @@ mod test {
use std::borrow::Cow::Owned;

use crate::normalizer::test::test_normalizer;
use crate::normalizer::{Normalizer, NormalizerOption};
use crate::token::TokenKind;

fn tokens() -> Vec<Token<'static>> {
Expand All @@ -64,6 +56,18 @@ mod test {
char_end: 10,
byte_end: 10,
script: Script::Latin,
char_map: Some(vec![
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
]),
..Default::default()
}]
}
Expand All @@ -75,6 +79,18 @@ mod test {
byte_end: 10,
script: Script::Latin,
kind: TokenKind::Word,
char_map: Some(vec![
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
]),
..Default::default()
}]
}
Expand Down

0 comments on commit 944cf62

Please sign in to comment.