-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text/slugify
gives empty results for non-Latin alphabets
#5830
Comments
Here's a rundown of how various platforms handle non-ASCII text in slugs:
The transliteration option has one big advantage, namely that the URL remains legible in any context: plaintext files, IM platforms with limited rich-text features, etc. It also typically leads to shorter URLs compared to the percent-encoded version. Still, it's strictly worse when viewed in a browser address bar, adds a massive amount of complexity, including mappings for thousands of CJK characters, and often still leads to suboptimal results (as seen in the As for diacritics, 3 of the 7 platforms strip them from Latin-script text, while the other 4 keep them. As with transliteration, stripping leads to more plaintext-friendly URLs; however, diacritics can be semantically important, also illustrated by the |
In the initial implementation it was discussed to port The behavior you describe is probably better handled with import { toKebabCase } from "@std/text/to-kebab-case";
console.log(toKebabCase("三人行-必有我师焉")); // "三人行-必有我师焉" |
@timreichen In the state it was merged, that PR isn't a port of And if |
We removed the char mapping because the list was random. The problem is as you pointed out that there is no standard and the slugify functionality varies depending on the implementation.
Every slugify function will be only work on a subset of use cases. That is why for example |
Why? Again, massive platforms like WordPress, Stack Overflow, Wikipedia, GitHub, Medium, and Tumblr don't obey that rule, and browsers and web APIs handle non-ASCII URL components perfectly fine. Allowing |
Maybe we could just change the signature of slugify so users can provide their own strip regex ? function slugify(input: string, strip = /[^a-zA-Z0-9\s-]/g): string This way it doesn't really add much more complexity while offering a bit more liberty to end users (which would know best which charset they'd like to support) ? slugify("déjà-vu", /[^a-zA-Z0-9\s-À-ÖØ-öø-ÿ]/g) // "déjà-vu" |
@lowlighter That seems to me like it's simultaneously too granular and not customizable enough. Too granular because I can't see any good reason why you'd want to allow some non-ASCII but not others; not customizable enough because it still doesn't provide any way of mapping. Something like this could work: // slugify.ts
export type SlugifyOptions = {
/** @default {undefined} */
charMap: Record<string, string> | undefined,
/** @default {Boolean(options.charMap)} */
stripUnknown: boolean,
/** @default {Boolean(options.charMap || options.stripUnknown)} */
stripDiacritics: boolean,
}
export function slugify(input: string, options?: Partial<SlugifyOptions>): string
// slugify_char_map.ts
// A comprehensive char mapping (transliteration) from some decently authoritative source
export const charMap = {
// ...
я: "ya",
// ...
鼎: "ding",
// ...
} If you really want to opt-in to the "nuke everything other than Basic Latin" option for some reason, you could still do that with As for "decently authoritative source" for the char map, I'm not sure what that would be. https://unicode-org.github.io/icu/userguide/transforms/general/ provides some notes on transliteration, which suggest that a simple |
I'm happy to defer to the consensus of others with this issue, but if we go down the route of having a character map, best it be a |
Looks like the requisite ICU data for transliteration is here: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/translit/. With some truly disgusting regex-based """parsing""" of the data files,
Limitations: Japanese will always give bad results for Kanji; Arabic lacks most vowels (I think that's due to the vowels not being indicated in the first place, so no way round that); the Greek is currently based on Ancient Greek transliteration, but |
OK, switched to using a custom peggy grammar to parse the ICU mappings, and now getting what seem to be decent results now for all the languages I've tested (as good/better than results from other general-purpose transliteration libraries I've looked at). Comparison:
Char map is ~213KB (un-minified, un-gzipped). IMO those are "good enough" results at this stage (given that the default will be not to transliterate), but it'd be good to get some input from speakers of a few more of these languages. You can also try it out with other languages here: https://dash.deno.com/playground/slugify |
Upon testing more languages (list taken from |
While I personally found these researches interesting, I think it's difficult to do these transliterations in an unopinionated way. Also it seems difficult to maintain them as the maintainers are not knowledgeable about many of these languages. I'd consider the handlings of non-latin alphabet languages are out of scope of this API. |
@kt3k My main concern isn't that non-Latin script should have special handling, rather that it should be passed through rather than removed (and especially that it shouldn't be removed as a default option). It's worth mentioning that in its current state, I only started looking into transliteration, which IMO is a less-good option compared to pass-through (not to mention significantly less common in-the-wild), as an alternative. With all that said... I'm inclined to think you're probably right. Further, @lowlighter 's suggestion of a strip regex is a useful option after all, but with suggested regexes being exported from the package itself (roll-your-own is probably less useful). With that option, you can easily implement pass-through (default), strip, strip-diacritics, or even strip-only-ascii-diacritics behavior. The regex would be run against the export const NON_WORD = /[^\p{L}\p{M}\p{N}\-]+/gu;
export const DIACRITICS = /[^\p{L}\p{N}\-]+/gu;
export const ASCII_DIACRITICS = /(?<=[a-zA-Z])\p{M}+|[^\p{L}\p{M}\p{N}\-]+/gu;
export const NON_ASCII = /[^0-9a-zA-Z\-]/g;
// NON_WORD
assertEquals(slugify("déjà-vu"), "déjà-vu");
assertEquals(slugify("Συστημάτων Γραφής"), "συστημάτων-γραφής");
assertEquals(slugify("déjà-vu", { strip: DIACRITICS }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: DIACRITICS }), "συστηματων-γραφης");
assertEquals(slugify("déjà-vu", { strip: ASCII_DIACRITICS }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: ASCII_DIACRITICS }), "συστημάτων-γραφής");
assertEquals(slugify("déjà-vu", { strip: NON_ASCII }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: NON_ASCII }), "-"); Further, you could easily use a third-party transliteration library along with import transliterate from 'npm:any-ascii'
assertEquals(slugify(transliterate("Συστημάτων Γραφής"), { strip: NON_ASCII }), "systimaton-grafis"); |
Ah ok. |
Describe the bug
slugify
gives readable results for only a subset of languages, namely those that use the Latin alphabet (English, Spanish, Vietnamese...). For all other languages (Chinese, Arabic, Russian...), it just returns an empty string.Steps to Reproduce
Expected behavior
slugify
to return non-Latin-alphabet words unchanged (but still stripping start/end punctuation and replacing medial punctuation with dashes).Something like the following:
IMO the
slugify
function should not handle percent encoding, instead leaving that up to the calling code (e.g. passing toURL
constructor orURLSearchParams
):url1
: https://example.com/blog/%E4%B8%89%E4%BA%BA%E8%A1%8C-%E5%BF%85%E6%9C%89%E6%88%91%E5%B8%88%E7%84%89url2
:https://example.com/?q=%E4%B8%89%E4%BA%BA%E8%A1%8C-%E5%BF%85%E6%9C%89%E6%88%91%E5%B8%88%E7%84%89.
url3
: https://xn----io6a1vz30ct6b22enxlhzue11c.example.com/While it's true they're unreadable in the percent-encoded/punycode form, modern browsers display the human-readable form automatically on hover, or in the address bar when you open them. At some point it might also be worth adding a
prettifyUrl
function to allow similar conversion in userland, but I'm leaving that out of scope for this issue.Environment
OS: Ubuntu 20.04, WSL
deno version: 1.46.0
std version: text@1.0.4
The text was updated successfully, but these errors were encountered: