TLDR; The authors train a deep seq-2-seq LSTM directly on byte-level input of several langauges (shuffling the examples of all languages) and apply it to NER and POS tasks, achieving state-of-the-art or close to that. The model outputs spans of the form [START_POSITION, LENGTH, LABEL]
, where each span element is a separate token prediction. A single model works well for all languages and learns shared high-level representations. The authors also present a novel way to dropout input tokens (bytes in their case), by randomly replacing them with a DROP
symbol.
Data:
- POS Tagging: 13 languages, 2.87M tokens, 25.3M training segments
- NER: 4 languags, 0.88M tokens, 6M training segments
Results:
- POS CRF Accuracy (average across languages): 95.41
- POS BTS Accuracy (average across languages): 95.85
- NER BTS en/de/es/nl F1: 86.50/76.22/82.95/82.84
- (See paper for NER comparsion models)
- Surprising to me that the span generations works so well without imposing independence assumptions on it. It's state the LSTM has to keep in memory.
- 0.2-0.3 Dropout, 320-dimensional embeddings, 320 units LSTM, 4 layers seems to perform well. The resulting model is surprisingly compact (~1M parameters) due to the small vocabulary size of 256 bytes. Changing input sequence order didn't have much of an effect. Dropout and Byte Dropout significantly (74 -> 78 -> 82) improved F1 for NER.
- To limit sequence length the authors split the text into k=60 sized segment, with 50% overlap to avoid splitting mid-span.
- Byte Dropout can be seen as "blurring text". I believe I've seen the same technique applied to words before and labeled word dropout.
- Training examples for all languages are shuffled together. The biggest improvements in scores are seen observed for low-resource languages.
- Not clear how to tune recall of the model since non-spans are simply not annotated.
- I wonder if the fixed-vector embedding of the input sequence is a bottleneck since the decoder LSTM has to carry information not only about the input sequence, but also about the structure that has been produced so far. I wonder if the authors have experimented with varying
k
, or using attention mechanisms to deal with long sequences (I've seen papers dealing with sequences of 2000 tokens?). 60 seems quite short to me. Of course, output vocabulary size is also a concern with longer sequences. - What about LSTM initialization? When feeding spans coming from the same document, is the state kept around or re-initialized? I strongly suspect it's kept since 60 bytes probably don't contain enough information for proper labeling, but didn't see an explicit reference.
- Why not a bidirectional LSTM? Seems to be the standard in most other papers.
- How exactly are multiple languages encoded in the LSTM memories? I kind of understand the reasoning behind this, but it's unclear what these "high-level" representations are. Experiments that demonstrate what the LSTM cells represent would be valuable.
- Is there a way to easily re-train the model for a new language?