groonga-normalizer-mysql
Groonga-normalizer-mysql is a Groonga plugin. It provides MySQL compatible normalizers and a custom normalizers to Groonga.
Here are MySQL compatible normalizers:
NormalizerMySQLGeneralCI
forutf8mb4_general_ci
NormalizerMySQLUnicodeCI
forutf8mb4_unicode_ci
NormalizerMySQLUnicode520CI
forutf8mb4_unicode_520_ci
NormalizerMySQLUnicode900
forutf8mb4_0900_ai_ci
,utf8mb4_0900_as_ci
,utf8mb4_0900_as_cs
,utf8mb4_ja_0900_as_cs
andutf8mb4_ja_0900_as_cs_ks
.
Here are custom normalizers:
NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark
- It's based on
NormalizerMySQLUnicodeCI
- It's based on
NormalizerMySQLUnicode520CIExceptKanaCIKanaWithVoicedSoundMark
- It's based on
NormalizerMySQLUnicode520CI
- It's based on
They are self-descriptive name but long. They are variant normalizers
of NormalizerMySQLUnicodeCI
and NormalizerMySQLUnicode520CI
. They
have different behaviors. The followings are the different
behaviors. They describes with
NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark
but they
are true for
NormalizerMySQLUnicode520CIExceptKanaCIKanaWithVoicedSoundMark
.
NormalizerMySQLUnicodeCI
normalizes all small Hiragana such asぁ
,っ
to Hiragana such asあ
,つ
.NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark
doesn't normalizeぁ
toあ
norっ
toつ
.ぁ
andあ
are different characters.っ
andつ
are also different characters. This behavior is described byExceptKanaCI
in the long name. This following behaviors ared described byExceptKanaWithVoicedSoundMark
in the long name.NormalizerMySQLUnicode
normalizes all Hiragana with voiced sound mark such asが
to Hiragana without voiced sound mark such asか
.NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark
doesn't normalizeが
toか
.が
andか
are different characters.NormalizerMySQLUnicode
normalizes all Hiragana with semi-voiced sound mark such asぱ
to Hiragana without semi-voiced sound mark such asは
.NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark
doesn't normalizeぱ
toは
.ぱ
andは
are different characters.NormalizerMySQLUnicode
normalizes all Katakana with voiced sound mark such asガ
to Katakana without voiced sound mark such asカ
.NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark
doesn't normalizeガ
toカ
.ガ
andカ
are different characters.NormalizerMySQLUnicode
normalizes all Katakana with semi-voiced sound mark such asパ
to Hiragana without semi-voiced sound mark such asハ
.NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark
doesn't normalizeパ
toハ
.パ
andハ
are different characters.NormalizerMySQLUnicode
normalizes all halfwidth Katakana with voiced sound mark such asガ
to halfwidth Katakana without voiced sound mark such asカ
.NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark
normalizes all halfwidth Katakana with voided sound mark such asガ
to fullwidth Katakana with voiced sound mark such asガ
.
NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark
and
NormalizerMySQLUnicode520CIExceptKanaCIKanaWithVoicedSoundMark
are MySQL incompatible normalizers but they are useful for Japanese
text. For example, ふらつく
and ブラック
has different
means. NormalizerMySQLUnicodeCI
identifies ふらつく
with ブラック
but NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark
doesn't identify them.
Add apt-line for the Groonga deb package repository
and install groonga-normalizer-mysql
package:
% sudo apt-get -y install groonga-normalizer-mysql
Add apt-line for the Groonga deb package repository
and install groonga-normalizer-mysql
package:
% sudo apt-get -y install groonga-normalizer-mysql
Install groonga-repository
package:
% sudo dnf install -y https://packages.groonga.org/almalinux/8/groonga-release-latest.noarch.rpm
Then install groonga-normalizer-mysql
package:
% sudo dnf install -y --enablerepo=epel groonga-normalizer-mysql
Install groonga-repository
package:
% sudo dnf install -y https://packages.groonga.org/almalinux/9/groonga-release-latest.noarch.rpm
Then install groonga-normalizer-mysql
package:
% sudo dnf install -y --enablerepo=epel groonga-normalizer-mysql
Install groonga
package (which includes groonga-normalizer-mysql
):
% brew install groonga
You need to build from source. Here are build instructions.
Install the following build tools:
Download the latest Groonga source from packages.groonga.org. Source file name is formatted as groonga-X.Y.Z.zip
.
Extract the source and move to the source folder:
> cd ...\groonga-X.Y.Z
groonga-X.Y.Z>
Run CMake. Here is a command line to install Groonga to C:\groonga
folder:
groonga-X.Y.Z> cmake . -G "Visual Studio 14 Win64" -DCMAKE_INSTALL_PREFIX=C:\groonga
Build:
groonga-X.Y.Z> cmake --build . --config Release
Install:
groonga-X.Y.Z> cmake --build . --config Release --target Install
Download the latest groonga-normalizer-mysql source from packages.groonga.org. Source file name is formatted as groonga-normalizer-X.Y.Z.zip
.
Extract the source and move to the source folder:
> cd ...\groonga-normalizer-mysql-X.Y.Z
groonga-normalizer-mysql-X.Y.Z>
IMPORTANT!!!: Set PKG_CONFIG_PATH
environment variable:
groonga-normalizer-mysql-X.Y.Z> set PKG_CONFIG_PATH=C:\groonga\local\lib\pkgconfig
Run CMake. Here is a command line to install Groonga to C:\groonga
folder:
groonga-normalizer-mysql-X.Y.Z> cmake . -G "Visual Studio 14 Win64" -DCMAKE_INSTALL_PREFIX=C:\groonga
Build:
groonga-normalizer-mysql-X.Y.Z> cmake --build . --config Release
Install:
groonga-normalizer-mysql-X.Y.Z> cmake --build . --config Release --target Install
First, you need to register normalizers/mysql
plugin:
groonga> register normalizers/mysql
Then, you can use NormalizerMySQLGeneralCI
and
NormalizerMySQLUnicodeCI
as normalizers:
groonga> table_create Lexicon TABLE_PAT_KEY --default_tokenizer TokenBigram --normalizer NormalizerMySQLGeneralCI
- Groonga >= 8.0.4
- English: groonga-talk@lists.sourceforge.net
- Japanese: groonga-dev@lists.sourceforge.jp
- Alexander Barkov <bar@udm.net>: The author of
MYSQL_SOURCE/strings/ctype-utf8.c
. - ...
- Kouhei Sutou <kou@clear-code.com>
LGPLv2 only. See doc/text/lgpl-2.0.txt for details.
This program uses normalization table defined in MySQL source code. So
this program is derived work of MYSQL_SOURCE/strings/ctype-utf8.c
,
MYSQL_SOURCE/strings/uca900_data.h
,
MYSQL_SOURCE/strings/uca900_ja_data.h
. This program is the same
license as them and they are licensed under LGPLv2 only.