Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to customize separators #215

Merged
merged 12 commits into from
Jun 28, 2023

Conversation

ManyTheFish
Copy link
Member

@ManyTheFish ManyTheFish commented May 29, 2023

This PR adds a pre-segmenter that segments a text based on a separator list possibly given during the Tokenizer's building. Moreover, this separator's list is used by the classifier to categorize a lemma as a separator or not.

These changes highly impact the Hebrew and Latin tokenizers based on Unicode-segmentation:

  • The Hebrew segmenter has been removed because the segmentation is completely done by the pre-segmenter
  • The Latin segmenter only segments camel-cased words if the feature is activated.

About Hard separators (or context separators), because I didn't want to rely on deunicode to define a Hard separator and because I didn't manage to find an exhaustive list of "sentence/context separators" (in opposition to word separators), I personally created my own list of context separators based on several Unicode usages and conventions.

@ManyTheFish ManyTheFish force-pushed the segment-n-classify-custom-separators branch from cb5ccb4 to 8828b68 Compare June 22, 2023 14:01
@ManyTheFish ManyTheFish requested review from dureuill and Kerollmops and removed request for dureuill June 28, 2023 10:17
@ManyTheFish ManyTheFish marked this pull request as ready for review June 28, 2023 11:52
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍
bors merge

@meili-bors
Copy link
Contributor

meili-bors bot commented Jun 28, 2023

Build succeeded:

  • tests

@ManyTheFish ManyTheFish merged commit e571f74 into main Jun 28, 2023
3 checks passed
@ManyTheFish ManyTheFish deleted the segment-n-classify-custom-separators branch June 28, 2023 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants