Merge pull request #28 from amir-zeldes/dev

Dev
amir-zeldes · Oct 18, 2022 · ed51d54 · ed51d54
2 parents 7bc9fd2 + 10cd30c
commit ed51d54
Show file tree

Hide file tree

Showing 15 changed files with 1,917 additions and 12,668 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,30 @@
+/hebpipe/conll18_ud_eval.py
+/hebpipe/he_htb-ud-dev.conllu
+/hebpipe/he_htb-ud-test.conllu
+/hebpipe/he_htb-ud-train.conllu
+/hebpipe/he_iahltwiki-ud-dev.conllu
+/hebpipe/he_iahltwiki-ud-test.conllu
+/hebpipe/he_iahltwiki-ud-train.conllu
+/hebpipe/htb-test.conllu
+/hebpipe/htb-test.txt
+/hebpipe/htb-test-split.conllu
+/hebpipe/htb-test-split.txt
+/hebpipe/iahlt-test.conllu
+/hebpipe/iahlt-test.txt
+/hebpipe/models/stanza/he_htb.pretrain.pt
+/hebpipe/models/stanza/he_htb_tagger.pt
+/hebpipe/models/stanza/he_lemmatizer.pt
+/hebpipe/models/heb.diaparser
+/hebpipe/models/heb.flair
+/hebpipe/models/heb.morph
+/hebpipe/models/heb.sbdposmorph.pt
+/hebpipe/models/heb.seg
+/hebpipe/models/heb.sent
+/hebpipe/models/heb.sm3
+/hebpipe/models/heb.xrm
+/hebpipe/data/checkpoint/
+/hebpipe/data/tensorboarddir/
+
+__pycache__
+/hebpipe/data/sentsplit_postag_dev_gold.tab
+/hebpipe/data/sentsplit_postag_train_gold.tab
diff --git a/README.md b/README.md
@@ -15,22 +15,18 @@ A simple NLP pipeline for Hebrew text in UTF-8 encoding, using standard componen
 
 Note that entity recognition and coreference are still in beta and offer rudimentary accuracy.
 
-Online demo available at: (choose 'Hebrew' and enter plain text)
+To cite this tool in academic papers please refer to this paper:
 
-https://corpling.uis.georgetown.edu/xrenner/
+Zeldes, Amir, Nick Howell, Noam Ordan and Yifat Ben Moshe (2022) [A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing](https://arxiv.org/abs/2210.07873). In: *Proceedings of EMNLP 2022*. Abu Dhabi, UAE.
 
-To cite this work please refer to the paper about the morphological segmenter here:
-
-Zeldes, Amir (2018) A Characterwise Windowed Approach to Hebrew Morphological Segmentation. In: *Proceedings of the 15th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology*. Brussels, Belgium.
 
 ```
-@InProceedings{Zeldes2018,
-  author    = {Amir Zeldes},
-  title     = {A CharacterwiseWindowed Approach to {H}ebrew Morphological Segmentation},
-  booktitle = {Proceedings of the 15th {SIGMORPHON} Workshop on Computational Research in Phonetics, Phonology, and Morphology},
-  year      = {2018},
-  pages      = {101--110},
-  address   = {Brussels, Belgium}
+@InProceedings{ZeldesHowellOrdanBenMoshe2022,
+  author    = {Amir Zeldes and Nick Howell and Noam Ordan and Yifat Ben Moshe},
+  booktitle = {Proceedings of {EMNLP} 2022},
+  title     = {A SecondWave of UD Hebrew Treebanking and Cross-Domain Parsing},
+  year      = {2022},
+  address   = {Abu Dhabi, UAE},
 }
 ```
 
@@ -57,18 +53,23 @@ Models can be downloaded automatically by the script on its first run.
 
 The NLP pipeline will run on Python 2.7+ or Python 3.5+ (2.6 and lower are not supported). Required libraries:
 
-  * requests
-  * numpy
-  * scipy
-  * pandas
-  * depedit
-  * xmltodict
-  * xgboost==0.81
-  * rftokenizer
-  * joblib
-  * flair==0.6.1
-  * stanza
-  * diaparser
+```
+requests
+transformers==3.5.1
+torch==1.6.0
+xgboost==0.81
+rftokenizer
+numpy
+scipy
+depedit
+pandas
+joblib
+xmltodict
+diaparser==1.1.2
+flair==0.6.1
+stanza
+conllu
+```
 
 You should be able to install these manually via pip if necessary (i.e. `pip install rftokenizer` etc.).
 
@@ -79,7 +80,7 @@ Note that some older versions of Python + Windows do not install numpy correctly
 
 ### Model files
 
-Model files are too large to include in the standard GitHub repository. The software will offer to download them automatically. The latest models can also be downloaded manually at https://corpling.uis.georgetown.edu/amir/download/heb_models_v2/. 
+Model files are too large to include in the standard GitHub repository. The software will offer to download them automatically. The latest models can also be downloaded manually at https://gucorpling.org/amir/download/heb_models_v3/. 
 
 ## Command line usage
 
@@ -97,9 +98,8 @@ standard module options:
                         forms
   -t, --tokenize        Tokenize large word forms into smaller morphological
                         segments
-  -p, --pos             Do POS tagging
+  -p, --posmorph        Do POS tagging and Morphological Tagging
   -l, --lemma           Do lemmatization
-  -m, --morph           Do morphological tagging
   -d, --dependencies    Parse with dependency parser
   -e, --entities        Add entity spans and types
   -c, --coref           Add coreference annotations
@@ -123,7 +123,7 @@ less common options:
 
 Whitespace tokenize, tokenize morphemes, add pos, lemma, morph, dep parse with automatic sentence splitting, 
 entity recognition and coref for one text file, output in default conllu format:
-> python heb_pipe.py -wtplmdec example_in.txt        
+> python heb_pipe.py -wtpldec example_in.txt        
 
 OR specify no processing options (automatically assumes you want all steps)
 > python heb_pipe.py example_in.txt        
@@ -132,10 +132,10 @@ Just tokenize a file using pipes:
 > python heb_pipe.py -wt -o pipes example_in.txt     
 
 Pos tag, lemmatize, add morphology and parse a pre-tokenized file, splitting sentences by existing <sent> tags:
-> python heb_pipe.py -plmd -s sent example_in.txt  
+> python heb_pipe.py -pld -s sent example_in.txt  
 
 Add full analyses to a whole directory of *.txt files, output to a specified directory:    
-> python heb_pipe.py -wtplmdec --dirout /home/heb/out/ *.txt
+> python heb_pipe.py -wtpldec --dirout /home/heb/out/ *.txt
 
 Parse a tagged TT SGML file into CoNLL tabular format for treebanking, use existing tag <sent> to recognize sentence borders:
 > python heb_pipe.py -d -s sent example_in.tt