sentencepiece

This repository contains an R package which is an Rcpp wrapper around the sentencepiece C++ library

sentencepiece is an unsupervised tokeniser which allows to execute text tokenization using
- Byte Pair Encoding
- Unigrams
- Words
- Characters
It is based on the paper SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing [Taku Kudo, John Richardson.]
The sentencepiece C++ code is available at https://github.com/google/sentencepiece
- This package currently wraps release v0.1.96
This R package has similar functionalities as the R package https://github.com/bnosac/tokenizers.bpe

Features

The R package allows you to

build a Byte Pair Encoding (BPE), Unigram, Char or Word model
apply the model to encode text
apply the model to decode ids back to text
download pretrained sentencepiece models built on Wikipedia

Installation

For regular users, install the package from your local CRAN mirror install.packages("sentencepiece")
For installing the development version of this package: remotes::install_github("bnosac/sentencepiece")

Look to the documentation of the functions

help(package = "sentencepiece")

Example on encoding / decoding with a pretrained model built on Wikipedia

library(sentencepiece)
dl    <- sentencepiece_download_model("English", vocab_size = 50000)
model <- sentencepiece_load_model(dl$file_model)
model

Sentencepiece model
  size of the vocabulary: 50000
  model stored at: C:/Users/Jan/Documents/R/win-library/3.5/sentencepiece/models/en.wiki.bpe.vs50000.model

txt <- c("Give me back my Money or I'll call the police.",
         "Talk to the hand because the face don't want to hear it any more.")
txt <- tolower(txt)
sentencepiece_encode(model, txt, type = "subwords")

[[1]]
 [1] "▁give"   "▁me"     "▁back"   "▁my"     "▁money"  "▁or"     "▁i"      "'"       "ll"      "▁call"   "▁the"    "▁police" "."      

[[2]]
 [1] "▁talk"    "▁to"      "▁the"     "▁hand"    "▁because" "▁the"     "▁face"    "▁don"     "'"        "t"        "▁want"    "▁to"      "▁hear"    "▁it"      "▁any"     "▁more"    "."

sentencepiece_encode(model, txt, type = "ids")

[[1]]
 [1]  3090   352   810  1241  2795   127   386 49937  1188   612     7  2142 49935

[[2]]
 [1]  4252    42     7  1197   936     7  3227  1616 49937 49915  4451    42  6800   107   756   407 49935

Example on training

As an example, let's take some training data containing questions asked in Belgian Parliament in 2017 and focus on French text only.

library(tokenizers.bpe)
library(sentencepiece)
data(belgium_parliament, package = "tokenizers.bpe")
x <- subset(belgium_parliament, language == "french")
writeLines(text = x$text, con = "traindata.txt")

Train a model on text data and inspect the vocabulary

model <- sentencepiece("traindata.txt", type = "bpe", coverage = 0.999, vocab_size = 5000, 
                       model_dir = getwd(), verbose = FALSE)
model

Sentencepiece model
  size of the vocabulary: 5000
  model stored at: sentencepiece.model

str(model$vocabulary)

'data.frame':	5000 obs. of  2 variables:
 $ id     : int  0 1 2 3 4 5 6 7 8 9 ...
 $ subword: chr  "<unk>" "<s>" "</s>" "es" ...

Use the model to encode text

text <- c("L'appartement est grand & vraiment bien situe en plein centre",
          "Proportion de femmes dans les situations de famille monoparentale.")
sentencepiece_encode(model, x = text, type = "subwords")

[[1]]
 [1] "▁L"      "'"       "app"     "ar"      "tement"  "▁est"    "▁grand"  "▁"       "&"       "▁v"      "r"       "ai"      "ment"    "▁bien"   "▁situe"  "▁en"     "▁plein"  "▁centre"

[[2]]
 [1] "▁Pro"        "por"         "tion"        "▁de"         "▁femmes"     "▁dans"       "▁les"        "▁situations" "▁de"         "▁famille"    "▁mon"        "op"          "ar"          "ent"         "ale"         "."

sentencepiece_encode(model, x = text, type = "ids")

[[1]]
 [1]   75 4951  252   31  461  109  960 4934    0   49 4941   34   32  585 4225   44 3356 1915

[[2]]
 [1] 1362 4159   25    9 2060   93   40 3825    9 2923  705  247   31   19  116 4953

Use the model to decode byte pair encodings back to text

x <- sentencepiece_encode(model, x = text, type = "ids")
sentencepiece_decode(model, x)

[[1]]
[1] "L'appartement est grand  ⁇  vraiment bien situe en plein centre"

[[2]]
[1] "Proportion de femmes dans les situations de famille monoparentale."

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github/workflows		.github/workflows
R		R
inst		inst
man		man
src		src
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.note		LICENSE.note
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
sentencepiece.Rproj		sentencepiece.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

sentencepiece

Features

Installation

Example on encoding / decoding with a pretrained model built on Wikipedia

Example on training

Support in text mining

About

Licenses found

Releases 7

Packages

Contributors 2

Languages

License

Licenses found

bnosac/sentencepiece

Folders and files

Latest commit

History

Repository files navigation

sentencepiece

Features

Installation

Example on encoding / decoding with a pretrained model built on Wikipedia

Example on training

Support in text mining

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 2

Languages

Packages