Skip to content

Commit

Permalink
Merge pull request #28 from amir-zeldes/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
amir-zeldes authored Oct 18, 2022
2 parents 7bc9fd2 + 10cd30c commit ed51d54
Show file tree
Hide file tree
Showing 15 changed files with 1,917 additions and 12,668 deletions.
30 changes: 30 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
/hebpipe/conll18_ud_eval.py
/hebpipe/he_htb-ud-dev.conllu
/hebpipe/he_htb-ud-test.conllu
/hebpipe/he_htb-ud-train.conllu
/hebpipe/he_iahltwiki-ud-dev.conllu
/hebpipe/he_iahltwiki-ud-test.conllu
/hebpipe/he_iahltwiki-ud-train.conllu
/hebpipe/htb-test.conllu
/hebpipe/htb-test.txt
/hebpipe/htb-test-split.conllu
/hebpipe/htb-test-split.txt
/hebpipe/iahlt-test.conllu
/hebpipe/iahlt-test.txt
/hebpipe/models/stanza/he_htb.pretrain.pt
/hebpipe/models/stanza/he_htb_tagger.pt
/hebpipe/models/stanza/he_lemmatizer.pt
/hebpipe/models/heb.diaparser
/hebpipe/models/heb.flair
/hebpipe/models/heb.morph
/hebpipe/models/heb.sbdposmorph.pt
/hebpipe/models/heb.seg
/hebpipe/models/heb.sent
/hebpipe/models/heb.sm3
/hebpipe/models/heb.xrm
/hebpipe/data/checkpoint/
/hebpipe/data/tensorboarddir/

__pycache__
/hebpipe/data/sentsplit_postag_dev_gold.tab
/hebpipe/data/sentsplit_postag_train_gold.tab
60 changes: 30 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,22 +15,18 @@ A simple NLP pipeline for Hebrew text in UTF-8 encoding, using standard componen

Note that entity recognition and coreference are still in beta and offer rudimentary accuracy.

Online demo available at: (choose 'Hebrew' and enter plain text)
To cite this tool in academic papers please refer to this paper:

https://corpling.uis.georgetown.edu/xrenner/
Zeldes, Amir, Nick Howell, Noam Ordan and Yifat Ben Moshe (2022) [A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing](https://arxiv.org/abs/2210.07873). In: *Proceedings of EMNLP 2022*. Abu Dhabi, UAE.

To cite this work please refer to the paper about the morphological segmenter here:

Zeldes, Amir (2018) A Characterwise Windowed Approach to Hebrew Morphological Segmentation. In: *Proceedings of the 15th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology*. Brussels, Belgium.

```
@InProceedings{Zeldes2018,
author = {Amir Zeldes},
title = {A CharacterwiseWindowed Approach to {H}ebrew Morphological Segmentation},
booktitle = {Proceedings of the 15th {SIGMORPHON} Workshop on Computational Research in Phonetics, Phonology, and Morphology},
year = {2018},
pages = {101--110},
address = {Brussels, Belgium}
@InProceedings{ZeldesHowellOrdanBenMoshe2022,
author = {Amir Zeldes and Nick Howell and Noam Ordan and Yifat Ben Moshe},
booktitle = {Proceedings of {EMNLP} 2022},
title = {A SecondWave of UD Hebrew Treebanking and Cross-Domain Parsing},
year = {2022},
address = {Abu Dhabi, UAE},
}
```

Expand All @@ -57,18 +53,23 @@ Models can be downloaded automatically by the script on its first run.

The NLP pipeline will run on Python 2.7+ or Python 3.5+ (2.6 and lower are not supported). Required libraries:

* requests
* numpy
* scipy
* pandas
* depedit
* xmltodict
* xgboost==0.81
* rftokenizer
* joblib
* flair==0.6.1
* stanza
* diaparser
```
requests
transformers==3.5.1
torch==1.6.0
xgboost==0.81
rftokenizer
numpy
scipy
depedit
pandas
joblib
xmltodict
diaparser==1.1.2
flair==0.6.1
stanza
conllu
```

You should be able to install these manually via pip if necessary (i.e. `pip install rftokenizer` etc.).

Expand All @@ -79,7 +80,7 @@ Note that some older versions of Python + Windows do not install numpy correctly

### Model files

Model files are too large to include in the standard GitHub repository. The software will offer to download them automatically. The latest models can also be downloaded manually at https://corpling.uis.georgetown.edu/amir/download/heb_models_v2/.
Model files are too large to include in the standard GitHub repository. The software will offer to download them automatically. The latest models can also be downloaded manually at https://gucorpling.org/amir/download/heb_models_v3/.

## Command line usage

Expand All @@ -97,9 +98,8 @@ standard module options:
forms
-t, --tokenize Tokenize large word forms into smaller morphological
segments
-p, --pos Do POS tagging
-p, --posmorph Do POS tagging and Morphological Tagging
-l, --lemma Do lemmatization
-m, --morph Do morphological tagging
-d, --dependencies Parse with dependency parser
-e, --entities Add entity spans and types
-c, --coref Add coreference annotations
Expand All @@ -123,7 +123,7 @@ less common options:

Whitespace tokenize, tokenize morphemes, add pos, lemma, morph, dep parse with automatic sentence splitting,
entity recognition and coref for one text file, output in default conllu format:
> python heb_pipe.py -wtplmdec example_in.txt
> python heb_pipe.py -wtpldec example_in.txt
OR specify no processing options (automatically assumes you want all steps)
> python heb_pipe.py example_in.txt
Expand All @@ -132,10 +132,10 @@ Just tokenize a file using pipes:
> python heb_pipe.py -wt -o pipes example_in.txt
Pos tag, lemmatize, add morphology and parse a pre-tokenized file, splitting sentences by existing <sent> tags:
> python heb_pipe.py -plmd -s sent example_in.txt
> python heb_pipe.py -pld -s sent example_in.txt
Add full analyses to a whole directory of *.txt files, output to a specified directory:
> python heb_pipe.py -wtplmdec --dirout /home/heb/out/ *.txt
> python heb_pipe.py -wtpldec --dirout /home/heb/out/ *.txt
Parse a tagged TT SGML file into CoNLL tabular format for treebanking, use existing tag <sent> to recognize sentence borders:
> python heb_pipe.py -d -s sent example_in.tt
Expand Down
Loading

0 comments on commit ed51d54

Please sign in to comment.