CharsetDetect is a simple wrapper around the chardetng crate.
Guess the encoding of a string:
iex> File.read!("test/assets/sjis.txt") |> CharsetDetect.guess
{:ok, "Shift_JIS"}
iex> File.read!("test/assets/big5.txt") |> CharsetDetect.guess!
"Big5"
You might consider minimizing additional memory consumption.
iex> "... (long text) ..." |> String.slice(0, 1024) |> CharsetDetect.guess
Note that an ASCII string, including an empty string, will result in a UTF-8
encoding rather than ASCII
.
iex> "hello world" |> CharsetDetect.guess
{:ok, "UTF-8"}
You can achieve conversion to any desired encoding using iconv.
defmodule Converter do
@spec convert(binary, String.t()) :: {:ok, binary} | {:error, String.t()}
def convert(text, to_encoding \\ "UTF-8") do
case text |> String.slice(0, 1024) |> CharsetDetect.guess do
{:ok, ^to_encoding} ->
{:ok, text}
{:ok, encoding} ->
try do
{:ok, :iconv.convert(encoding, to_encoding, text)}
rescue
e in ArgumentError -> {:error, inspect(e)}
end
{:error, reason} ->
{:error, reason}
end
end
end
iex> File.read!("test/assets/big5.txt") |> Converter.convert
{:ok, "大五碼是繁体中文(正體中文)社群最常用的電腦漢字字符集標準。\n"}
The package can be installed by adding charset_detect
to your list of dependencies in mix.exs
:
def deps do
[
{:charset_detect, "~> 0.1.0"}
]
end
Then, run mix deps.get
.
Note: This library requires the Rust Toolchain for compilation.
Follow the instructions at www.rust-lang.org/tools/install to install Rust.
Verify the installation by checking the cargo
command version:
cargo --version
# Should output something like: cargo 1.82.0 (8f40fc59f 2024-08-21)
Then, set the RUSTLER_PRECOMPILATION_EXAMPLE_BUILD
environment variable to ensure that local sources are compiled instead of downloading a precompiled library file.
RUSTLER_PRECOMPILATION_EXAMPLE_BUILD=1 mix compile
The MIT License