Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(text/unstable): handle non-Latin-script text in slugify #6012

Merged
merged 1 commit into from
Oct 21, 2024

Conversation

lionel-rowe
Copy link
Contributor

Fixes #5830 (2nd attempt, supersedes abandoned #5880).

transliterate option is provided as a hook for transliteration libraries such as npm:any-ascii, without requiring the transliteration library to re-implement word segmenting (which is already handled by slugify).

The empty case is changed from "" to "-" to mitigate against unsafe slugs (URL path /a//b normalizes to /a/b, and often /a/ is also normalized to /a, i.e. an empty slug allows a limited form of path traversal up one level). Maybe that's not necessary as slugs are often combined with some kind of ID, open to arguments either way on that one.

Four strip regexes, ASCII_DIACRITICS, DIACRITICS, NON_ASCII, NON_WORD, are exported, and supplying a custom regex is also supported. Possibly ASCII_DIACRITICS isn't really necessary? It enables a mode in which certain languages (e.g. Spanish) can still have fully ASCII slugs, yet retains diacritics in other contexts (where stripping them is strictly worse, as the slug wouldn't be ASCII even if they were removed).

@nestarz
Copy link

nestarz commented Sep 18, 2024

const wordSegmenter = new Intl.Segmenter("en-US", { granularity: "word" });

Should the locale an option which could be defined by the user instead and defaults to en-US ? I'm not familiar with a lot of languages so I don't know if it's valuable for the purpose of slugify.

@lionel-rowe
Copy link
Contributor Author

Should the locale an option which could be defined by the user instead and defaults to en-US ? I'm not familiar with a lot of languages so I don't know if it's valuable for the purpose of slugify.

Hmm... possibly. In my experience the locale argument to Intl.Segmenter instances rarely makes a difference, so I typically use "en-US" as basically a placeholder, given that it's de-facto "neutral locale" (as if such a thing could exist). You can set it to undefined, but IMO that's a bad idea in library code as it leads to different behavior on different systems, i.e. may run differently on your local machine vs Deno Deploy vs CI/CD workflows vs individual users' browsers etc.

Open to adding locale as an option if a clear use case can be demonstrated here.

Comment on lines +92 to +94
* import transliterate from "npm:any-ascii";
*
* slugify("Συστημάτων Γραφής", { transliterate, strip: NON_ASCII });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice example for demonstrating transliteration using 3rd party library 👍

@kt3k kt3k changed the title fix(text/unstable): handle non-Latin-script text in slugify feat(text/unstable): handle non-Latin-script text in slugify Oct 21, 2024
Copy link
Member

@kt3k kt3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice enhancement for supporting varieties of patterns without increasing much complexity while keeping backward compatibility of the default behavior.

LGTM

@kt3k kt3k merged commit a541fb4 into denoland:main Oct 21, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

text/slugify gives empty results for non-Latin alphabets
3 participants