Multilingual Language Processing From Bytes

TLDR; The authors train a deep seq-2-seq LSTM directly on byte-level input of several langauges (shuffling the examples of all languages) and apply it to NER and POS tasks, achieving state-of-the-art or close to that. The model outputs spans of the form [START_POSITION, LENGTH, LABEL], where each span element is a separate token prediction. A single model works well for all languages and learns shared high-level representations. The authors also present a novel way to dropout input tokens (bytes in their case), by randomly replacing them with a DROP symbol.

Data and model performance

Data:

POS Tagging: 13 languages, 2.87M tokens, 25.3M training segments
NER: 4 languags, 0.88M tokens, 6M training segments

Results:

POS CRF Accuracy (average across languages): 95.41
POS BTS Accuracy (average across languages): 95.85
NER BTS en/de/es/nl F1: 86.50/76.22/82.95/82.84
(See paper for NER comparsion models)

Key Takeaways

Surprising to me that the span generations works so well without imposing independence assumptions on it. It's state the LSTM has to keep in memory.
0.2-0.3 Dropout, 320-dimensional embeddings, 320 units LSTM, 4 layers seems to perform well. The resulting model is surprisingly compact (~1M parameters) due to the small vocabulary size of 256 bytes. Changing input sequence order didn't have much of an effect. Dropout and Byte Dropout significantly (74 -> 78 -> 82) improved F1 for NER.
To limit sequence length the authors split the text into k=60 sized segment, with 50% overlap to avoid splitting mid-span.
Byte Dropout can be seen as "blurring text". I believe I've seen the same technique applied to words before and labeled word dropout.
Training examples for all languages are shuffled together. The biggest improvements in scores are seen observed for low-resource languages.
Not clear how to tune recall of the model since non-spans are simply not annotated.

Notes / Questions

I wonder if the fixed-vector embedding of the input sequence is a bottleneck since the decoder LSTM has to carry information not only about the input sequence, but also about the structure that has been produced so far. I wonder if the authors have experimented with varying k, or using attention mechanisms to deal with long sequences (I've seen papers dealing with sequences of 2000 tokens?). 60 seems quite short to me. Of course, output vocabulary size is also a concern with longer sequences.
What about LSTM initialization? When feeding spans coming from the same document, is the state kept around or re-initialized? I strongly suspect it's kept since 60 bytes probably don't contain enough information for proper labeling, but didn't see an explicit reference.
Why not a bidirectional LSTM? Seems to be the standard in most other papers.
How exactly are multiple languages encoded in the LSTM memories? I kind of understand the reasoning behind this, but it's unclear what these "high-level" representations are. Experiments that demonstrate what the LSTM cells represent would be valuable.
Is there a way to easily re-train the model for a new language?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multilingual-language-processing-from-bytes.md

multilingual-language-processing-from-bytes.md

Multilingual Language Processing From Bytes

Data and model performance

Key Takeaways

Notes / Questions

Files

multilingual-language-processing-from-bytes.md

Latest commit

History

multilingual-language-processing-from-bytes.md

File metadata and controls

Multilingual Language Processing From Bytes

Data and model performance

Key Takeaways

Notes / Questions