├── data
    ├── processed
    │   ├── mwt
    │   │   └── README.txt
    │   ├── ner
    │   │   └── README.txt
    │   ├── pos
    │   │   └── README.txt
    │   ├── depparse
    │   │   └── README.txt
    │   ├── lemma
    │   │   └── README.txt
    │   ├── tokenize
    │   │   └── README.txt
    │   └── charlm
    │   │   └── en
    │   │       └── test
    │   │           ├── README.txt
    │   │           ├── dev.txt
    │   │           ├── test.txt
    │   │           └── train
    │   │               ├── train-1.txt
    │   │               └── train-2.txt
    ├── udbase
    │   └── UD_English-TEST
    │   │   ├── en_test-ud-dev.txt
    │   │   ├── en_test-ud-test.txt
    │   │   ├── en_test-ud-train.txt
    │   │   ├── en_test-ud-dev.conllu
    │   │   ├── en_test-ud-test.conllu
    │   │   └── en_test-ud-train.conllu
    ├── wordvec
    │   └── word2vec
    │   │   └── English
    │   │       ├── en.vectors.xz
    │   │       └── en.vectors.txt
    └── nerbase
    │   └── English-SAMPLE
    │       ├── en_sample.dev.bio
    │       ├── en_sample.test.bio
    │       └── en_sample.train.bio
├── requirements.txt
├── config
    ├── config.sh
    └── xpos_vocab_factory.py
├── .gitignore
└── README.md


/data/processed/mwt/README.txt:
--------------------------------------------------------------------------------
1 | Training and test data generated by Stanza.


--------------------------------------------------------------------------------
/data/processed/ner/README.txt:
--------------------------------------------------------------------------------
1 | Training and test data generated by Stanza.


--------------------------------------------------------------------------------
/data/processed/pos/README.txt:
--------------------------------------------------------------------------------
1 | Training and test data generated by Stanza.


--------------------------------------------------------------------------------
/data/processed/depparse/README.txt:
--------------------------------------------------------------------------------
1 | Training and test data generated by Stanza.


--------------------------------------------------------------------------------
/data/processed/lemma/README.txt:
--------------------------------------------------------------------------------
1 | Training and test data generated by Stanza.


--------------------------------------------------------------------------------
/data/processed/tokenize/README.txt:
--------------------------------------------------------------------------------
1 | Training and test data generated by Stanza.


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy
2 | protobuf
3 | requests
4 | tqdm
5 | torch>=1.3.0
6 | 


--------------------------------------------------------------------------------
/data/processed/charlm/en/test/README.txt:
--------------------------------------------------------------------------------
1 | Training and test data for character language model.
2 | 


--------------------------------------------------------------------------------
/data/processed/charlm/en/test/dev.txt:
--------------------------------------------------------------------------------
1 | Barack Obama was born in Hawaii . He was elected president in 2008 .
2 | 


--------------------------------------------------------------------------------
/data/processed/charlm/en/test/test.txt:
--------------------------------------------------------------------------------
1 | Barack Obama was born in Hawaii . He was elected president in 2008 .
2 | 


--------------------------------------------------------------------------------
/data/udbase/UD_English-TEST/en_test-ud-dev.txt:
--------------------------------------------------------------------------------
1 | Barack Obama was born in Hawaii. He was elected president in 2008.
2 | 


--------------------------------------------------------------------------------
/data/udbase/UD_English-TEST/en_test-ud-test.txt:
--------------------------------------------------------------------------------
1 | Barack Obama was born in Hawaii. He was elected president in 2008.
2 | 


--------------------------------------------------------------------------------
/data/processed/charlm/en/test/train/train-1.txt:
--------------------------------------------------------------------------------
1 | Barack Obama was born in Hawaii . He was elected president in 2008 .
2 | 


--------------------------------------------------------------------------------
/data/processed/charlm/en/test/train/train-2.txt:
--------------------------------------------------------------------------------
1 | Barack Obama was born in Hawaii . He was elected president in 2008 .
2 | 


--------------------------------------------------------------------------------
/data/udbase/UD_English-TEST/en_test-ud-train.txt:
--------------------------------------------------------------------------------
1 | Barack Obama was born in Hawaii. He was elected president in 2008.
2 | 


--------------------------------------------------------------------------------
/data/wordvec/word2vec/English/en.vectors.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stanfordnlp/stanza-train/HEAD/data/wordvec/word2vec/English/en.vectors.xz


--------------------------------------------------------------------------------
/data/nerbase/English-SAMPLE/en_sample.dev.bio:
--------------------------------------------------------------------------------
 1 | Barack	I-PER
 2 | Obama	I-PER
 3 | was	O
 4 | born	O
 5 | in	O
 6 | Hawaii	I-LOC
 7 | .	O
 8 | 
 9 | He	O
10 | was	O
11 | elected	O 
12 | president	O
13 | in	O
14 | 2008	O
15 | .	O
16 | 


--------------------------------------------------------------------------------
/data/nerbase/English-SAMPLE/en_sample.test.bio:
--------------------------------------------------------------------------------
 1 | Barack	I-PER
 2 | Obama	I-PER
 3 | was	O
 4 | born	O
 5 | in	O
 6 | Hawaii	I-LOC
 7 | .	O
 8 | 
 9 | He	O
10 | was	O
11 | elected	O 
12 | president	O
13 | in	O
14 | 2008	O
15 | .	O
16 | 


--------------------------------------------------------------------------------
/data/nerbase/English-SAMPLE/en_sample.train.bio:
--------------------------------------------------------------------------------
 1 | Barack	I-PER
 2 | Obama	I-PER
 3 | was	O
 4 | born	O
 5 | in	O
 6 | Hawaii	I-LOC
 7 | .	O
 8 | 
 9 | He	O
10 | was	O
11 | elected	O 
12 | president	O
13 | in	O
14 | 2008	O
15 | .	O
16 | 


--------------------------------------------------------------------------------
/data/wordvec/word2vec/English/en.vectors.txt:
--------------------------------------------------------------------------------
 1 | 10 5
 2 | Barack 0.01613954 0.00141043 -0.00869777 0.000911 0.01950155
 3 | Obama -0.00907914 0.01053656 -0.00389627 -0.00673913 -0.00667982
 4 | was -0.00046209 0.01675782 0.00450974 0.00875711 -0.00223494
 5 | born -0.02178387 0.01755228 -0.00446462 0.00476047 0.02028277
 6 | in 0.00124867 0.01410756 0.01728466 0.01355088 -0.00336146
 7 | Hawaii 0.00582745 -0.01101075 -0.00198883 0.01841053 0.00072485
 8 | . 0.00745728 -0.00108565 0.01947713 0.00447089 -0.01529367
 9 | He -0.00628661 -0.0084458 0.00466739 -0.00817884 -0.02236676
10 | elected 0.00366836 -0.00218679 0.01713075 -0.0119266 -0.0078803
11 | president 0.0030667 0.01066898 -0.01944919 0.00631905 0.00310773
12 | 


--------------------------------------------------------------------------------
/data/udbase/UD_English-TEST/en_test-ud-dev.conllu:
--------------------------------------------------------------------------------
 1 | # text = Barack Obama was born in Hawaii.
 2 | 1	Barack	Barack	PROPN	NNP	Number=Sing	4	nsubj:pass	_	_
 3 | 2	Obama	Obama	PROPN	NNP	Number=Sing	1	flat	_	_
 4 | 3	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	4	aux:pass	_	_
 5 | 4	born	bear	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	_	_
 6 | 5	in	in	ADP	IN	_	6	case	_	_
 7 | 6	Hawaii	Hawaii	PROPN	NNP	Number=Sing	4	obl	_	SpaceAfter=No
 8 | 7	.	.	PUNCT	.	_	4	punct	_	_
 9 | 
10 | # text = He was elected president in 2008.
11 | 1	He	he	PRON	PRP	Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs	3	nsubj:pass	_	_
12 | 2	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	3	aux:pass	_	_
13 | 3	elected	elect	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	_	_
14 | 4	president	president	PROPN	NNP	Number=Sing	3	xcomp	_	_
15 | 5	in	in	ADP	IN	_	6	case	_	_
16 | 6	2008	2008	NUM	CD	NumType=Card	3	obl	_	SpaceAfter=No
17 | 7	.	.	PUNCT	.	_	3	punct	_	_
18 | 
19 | 


--------------------------------------------------------------------------------
/data/udbase/UD_English-TEST/en_test-ud-test.conllu:
--------------------------------------------------------------------------------
 1 | # text = Barack Obama was born in Hawaii.
 2 | 1	Barack	Barack	PROPN	NNP	Number=Sing	4	nsubj:pass	_	_
 3 | 2	Obama	Obama	PROPN	NNP	Number=Sing	1	flat	_	_
 4 | 3	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	4	aux:pass	_	_
 5 | 4	born	bear	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	_	_
 6 | 5	in	in	ADP	IN	_	6	case	_	_
 7 | 6	Hawaii	Hawaii	PROPN	NNP	Number=Sing	4	obl	_	SpaceAfter=No
 8 | 7	.	.	PUNCT	.	_	4	punct	_	_
 9 | 
10 | # text = He was elected president in 2008.
11 | 1	He	he	PRON	PRP	Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs	3	nsubj:pass	_	_
12 | 2	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	3	aux:pass	_	_
13 | 3	elected	elect	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	_	_
14 | 4	president	president	PROPN	NNP	Number=Sing	3	xcomp	_	_
15 | 5	in	in	ADP	IN	_	6	case	_	_
16 | 6	2008	2008	NUM	CD	NumType=Card	3	obl	_	SpaceAfter=No
17 | 7	.	.	PUNCT	.	_	3	punct	_	_
18 | 
19 | 


--------------------------------------------------------------------------------
/data/udbase/UD_English-TEST/en_test-ud-train.conllu:
--------------------------------------------------------------------------------
 1 | # text = Barack Obama was born in Hawaii.
 2 | 1	Barack	Barack	PROPN	NNP	Number=Sing	4	nsubj:pass	_	_
 3 | 2	Obama	Obama	PROPN	NNP	Number=Sing	1	flat	_	_
 4 | 3	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	4	aux:pass	_	_
 5 | 4	born	bear	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	_	_
 6 | 5	in	in	ADP	IN	_	6	case	_	_
 7 | 6	Hawaii	Hawaii	PROPN	NNP	Number=Sing	4	obl	_	SpaceAfter=No
 8 | 7	.	.	PUNCT	.	_	4	punct	_	_
 9 | 
10 | # text = He was elected president in 2008.
11 | 1	He	he	PRON	PRP	Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs	3	nsubj:pass	_	_
12 | 2	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	3	aux:pass	_	_
13 | 3	elected	elect	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	_	_
14 | 4	president	president	PROPN	NNP	Number=Sing	3	xcomp	_	_
15 | 5	in	in	ADP	IN	_	6	case	_	_
16 | 6	2008	2008	NUM	CD	NumType=Card	3	obl	_	SpaceAfter=No
17 | 7	.	.	PUNCT	.	_	3	punct	_	_
18 | 
19 | 


--------------------------------------------------------------------------------
/config/config.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #
 3 | # Set environment variables for the training and testing of stanza modules.
 4 | 
 5 | # Set UDBASE to the location of UD data folder
 6 | # The data should be CoNLL-U format
 7 | # For details, see http://universaldependencies.org/conll18/data.html (CoNLL-18 UD data)
 8 | export UDBASE=../data/udbase
 9 | 
10 | # Set NERBASE to the location of NER data folder
11 | # The data should be BIO format
12 | # For details, see https://www.aclweb.org/anthology/W03-0419.pdf (CoNLL-03 NER paper)
13 | export NERBASE=../data/nerbase
14 | 
15 | # Set directories to store processed training/evaluation files
16 | export DATA_ROOT=../data/processed
17 | export TOKENIZE_DATA_DIR=$DATA_ROOT/tokenize
18 | export MWT_DATA_DIR=$DATA_ROOT/mwt
19 | export LEMMA_DATA_DIR=$DATA_ROOT/lemma
20 | export POS_DATA_DIR=$DATA_ROOT/pos
21 | export DEPPARSE_DATA_DIR=$DATA_ROOT/depparse
22 | export ETE_DATA_DIR=$DATA_ROOT/ete
23 | export NER_DATA_DIR=$DATA_ROOT/ner
24 | export CHARLM_DATA_DIR=$DATA_ROOT/charlm
25 | 
26 | # Set directories to store external word vector data
27 | export WORDVEC_DIR=../data/wordvec
28 | 


--------------------------------------------------------------------------------
/config/xpos_vocab_factory.py:
--------------------------------------------------------------------------------
 1 | # This is the XPOS factory method generated automatically from models.pos.build_xpos_factory.
 2 | # Please don't edit it!
 3 | 
 4 | from stanza.models.pos.vocab import WordVocab, XPOSVocab
 5 | 
 6 | def xpos_vocab_factory(data, shorthand):
 7 |     if shorthand in ["af_afribooms", "grc_perseus", "ar_padt", "bg_btb", "hr_set", "cs_cac", "cs_cltt", "cs_fictree", "cs_pdt", "en_partut", "fr_partut", "gl_ctg", "it_isdt", "it_partut", "it_postwita", "it_twittiro", "it_vit", "ja_gsd", "lv_lvtb", "lt_alksnis", "ro_nonstandard", "ro_rrt", "gd_arcosg", "sr_set", "sk_snk", "sl_ssj", "ta_ttb", "uk_iu", "gl_treegal", "la_perseus", "sl_sst"]:
 8 |         return XPOSVocab(data, shorthand, idx=2, sep="")
 9 |     elif shorthand in ["en_test", "grc_proiel", "hy_armtdp", "eu_bdt", "be_hse", "ca_ancora", "zh-hant_gsd", "zh-hans_gsdsimp", "lzh_kyoto", "cop_scriptorium", "da_ddt", "en_ewt", "en_gum", "et_edt", "fi_tdt", "fr_ftb", "fr_gsd", "fr_sequoia", "fr_spoken", "de_gsd", "de_hdt", "got_proiel", "el_gdt", "he_htb", "hi_hdtb", "hu_szeged", "ga_idt", "ja_bccwj", "la_proiel", "lt_hse", "mt_mudt", "mr_ufal", "nb_bokmaal", "nn_nynorsk", "nn_nynorsklia", "cu_proiel", "fro_srcmf", "orv_torot", "fa_seraji", "pt_bosque", "pt_gsd", "ru_gsd", "ru_syntagrus", "ru_taiga", "es_ancora", "es_gsd", "swl_sslc", "te_mtg", "tr_imst", "ug_udt", "vi_vtb", "wo_wtb", "bxr_bdt", "et_ewt", "kk_ktb", "kmr_mg", "olo_kkpp", "sme_giella", "hsb_ufal"]:
10 |         return WordVocab(data, shorthand, idx=2, ignore=["_"])
11 |     elif shorthand in ["nl_alpino", "nl_lassysmall", "la_ittb", "sv_talbanken"]:
12 |         return XPOSVocab(data, shorthand, idx=2, sep="|")
13 |     elif shorthand in ["en_lines", "sv_lines", "ur_udtb"]:
14 |         return XPOSVocab(data, shorthand, idx=2, sep="-")
15 |     elif shorthand in ["fi_ftb"]:
16 |         return XPOSVocab(data, shorthand, idx=2, sep=",")
17 |     elif shorthand in ["id_gsd", "ko_gsd", "ko_kaist"]:
18 |         return XPOSVocab(data, shorthand, idx=2, sep="+")
19 |     elif shorthand in ["pl_lfg", "pl_pdb"]:
20 |         return XPOSVocab(data, shorthand, idx=2, sep=":")
21 |     else:
22 |         raise NotImplementedError('Language shorthand "{}" not found!'.format(shorthand))
23 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # kept from original
  2 | .DS_Store
  3 | *.tmp
  4 | *.pkl
  5 | *.conllu
  6 | *.lem
  7 | *.toklabels
  8 | 
  9 | # standard github python project gitignore
 10 | # Byte-compiled / optimized / DLL files
 11 | __pycache__/
 12 | *.py[cod]
 13 | *$py.class
 14 | 
 15 | # C extensions
 16 | *.so
 17 | 
 18 | # Distribution / packaging
 19 | .Python
 20 | build/
 21 | develop-eggs/
 22 | dist/
 23 | downloads/
 24 | eggs/
 25 | .eggs/
 26 | lib/
 27 | lib64/
 28 | parts/
 29 | sdist/
 30 | var/
 31 | wheels/
 32 | pip-wheel-metadata/
 33 | share/python-wheels/
 34 | *.egg-info/
 35 | .installed.cfg
 36 | *.egg
 37 | MANIFEST
 38 | 
 39 | # PyInstaller
 40 | #  Usually these files are written by a python script from a template
 41 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 42 | *.manifest
 43 | *.spec
 44 | 
 45 | # Installer logs
 46 | pip-log.txt
 47 | pip-delete-this-directory.txt
 48 | 
 49 | # Unit test / coverage reports
 50 | htmlcov/
 51 | .tox/
 52 | .nox/
 53 | .coverage
 54 | .coverage.*
 55 | .cache
 56 | nosetests.xml
 57 | coverage.xml
 58 | *.cover
 59 | *.py,cover
 60 | .hypothesis/
 61 | .pytest_cache/
 62 | cover/
 63 | 
 64 | # Translations
 65 | *.mo
 66 | *.pot
 67 | 
 68 | # Django stuff:
 69 | *.log
 70 | local_settings.py
 71 | db.sqlite3
 72 | db.sqlite3-journal
 73 | 
 74 | # Flask stuff:
 75 | instance/
 76 | .webassets-cache
 77 | 
 78 | # Scrapy stuff:
 79 | .scrapy
 80 | 
 81 | # Sphinx documentation
 82 | docs/_build/
 83 | 
 84 | # PyBuilder
 85 | .pybuilder/
 86 | target/
 87 | 
 88 | # Jupyter Notebook
 89 | .ipynb_checkpoints
 90 | 
 91 | # IPython
 92 | profile_default/
 93 | ipython_config.py
 94 | 
 95 | # pyenv
 96 | #   For a library or package, you might want to ignore these files since the code is
 97 | #   intended to run in multiple environments; otherwise, check them in:
 98 | # .python-version
 99 | 
100 | # pipenv
101 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
102 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
103 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
104 | #   install all needed dependencies.
105 | #Pipfile.lock
106 | 
107 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
108 | __pypackages__/
109 | 
110 | # Celery stuff
111 | celerybeat-schedule
112 | celerybeat.pid
113 | 
114 | # SageMath parsed files
115 | *.sage.py
116 | 
117 | # Environments
118 | .env
119 | .venv
120 | env/
121 | venv/
122 | ENV/
123 | env.bak/
124 | venv.bak/
125 | 
126 | # Spyder project settings
127 | .spyderproject
128 | .spyproject
129 | 
130 | # Rope project settings
131 | .ropeproject
132 | 
133 | # mkdocs documentation
134 | /site
135 | 
136 | # mypy
137 | .mypy_cache/
138 | .dmypy.json
139 | dmypy.json
140 | 
141 | # Pyre type checker
142 | .pyre/
143 | 
144 | # pytype static type analyzer
145 | .pytype/
146 | 
147 | # Cython debug symbols
148 | cython_debug/
149 | 
150 | 
151 | # ignore the version of stanza we download into the training repo
152 | stanza/
153 | 
154 | # ignore the artifacts produced by preparing training data
155 | data/processed/mwt/en_test*json
156 | data/processed/ner/en_sample*json
157 | data/processed/tokenize/en_test*json
158 | data/processed/tokenize/en_test*txt
159 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <div align="center"><img src="https://github.com/stanfordnlp/stanza/raw/dev/images/stanza-logo.png" height="100px"/></div>
  2 | 
  3 | <h2 align="center">Training Tutorials for the Stanza Python NLP Library</h2>
  4 | 
  5 | This repo provides step-by-step tutorials for training models with [Stanza](https://github.com/stanfordnlp/stanza) - the official Python NLP library by the Stanford NLP Group. All neural processors in Stanza, including the tokenzier, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer, the dependency parser, and the named entity tagger, can be trained with your own data.
  6 | 
  7 | This repo is meant to complement our [training documentation](https://stanfordnlp.github.io/stanza/training.html), by providing runnable scripts coupled with toy data that makes it much easier for users to get started with model training. To train models with your own data, you should be able to simply replace the provided toy data with your own data in the same format, and start using them with Stanza right after training.
  8 | 
  9 | > Warning: This repo is fully tested on Linux. Due to syntax differences between macOS and Linux (e.g., the `declare -A` in the `scripts/treebank_to_shorthand.sh` is not supported by macOS), you need to rewrite some files to run on macOS.  The command lines given here will not work on Windows.
 10 | 
 11 | This repo is designed for and tested on stanza 1.4.0.  Earlier versions will not fully work with these commands.
 12 | 
 13 | ### Windows
 14 | 
 15 | To reiterate, this is only tested on Linux.  In order to run on
 16 | Windows, there is a `source scripts/config.sh` line in the initial
 17 | setup below.  Theoretically, if you manually set those variables in
 18 | the shell, or if you add those variables to the environment using the
 19 | control panel, the rest of the scripts might work.
 20 | 
 21 | ## Environment Setup
 22 | 
 23 | Run the following commands at the command line.
 24 | 
 25 | Stanza only supports `python3`. You can install all dependencies needed by training Stanza models with:
 26 | ```bash
 27 | pip install -r requirements.txt
 28 | ```
 29 | 
 30 | Next, set up the folders and scripts needed for training with:
 31 | 
 32 | ```bash
 33 | git clone git@github.com:stanfordnlp/stanza-train.git
 34 | cd stanza-train
 35 | 
 36 | git clone git@github.com:stanfordnlp/stanza.git
 37 | cp config/config.sh stanza/scripts/config.sh
 38 | cp config/xpos_vocab_factory.py stanza/stanza/models/pos/xpos_vocab_factory.py
 39 | cd stanza
 40 | source scripts/config.sh
 41 | ```
 42 | 
 43 | The [`config.sh`](config/config.sh) script is used to set environment variables (e.g., data path, word vector path, etc.) needed by training and testing Stanza models.
 44 | 
 45 | The [`xpos_vocab_factory.py`](config/xpos_vocab_factory.py) script is used to build XPOS vocabulary file for our provided `UD_English-TEST` toy treebank. Compared with the original file in the downloaded Stanza repo, we only add the shorthand name of the toy treebank (`en_test`) to the script, so that it can be recognized during training. If you want to use another dataset other than `UD_English-TEST` after running this tutorial, you can add the shorthand of your treebank in the same way. In case you're curious, [here's how we built this file]( https://github.com/stanfordnlp/stanza/blob/master/stanza/models/pos/build_xpos_vocab_factory.py).
 46 | 
 47 | 
 48 | ## Training and Evaluating Processors
 49 | 
 50 | Here we provide instructions for training each processor currently supported by Stanza, using the toy data in this repo as example datasets. Model performance will be printed during training. As our provided toy data only contain several sentences for demonstration purpose, you should be able to get 100% accuracy at the end of training.
 51 | 
 52 | ### `tokenize`
 53 | 
 54 | The [`tokenize`](https://stanfordnlp.github.io/stanza/tokenize.html) processor segments the text into tokens and sentences. All downstream processors which generate annotations at the token or sentence level depends on the output from this processor.
 55 | 
 56 | Training the `tokenize` processor currently requires the [Universal Dependencies](https://universaldependencies.org/) treebank data in both plain text and the `conllu` format, as you can find in our provided toy examples [here](data/udbase/UD_English-TEST). To train the `tokenize` processor with this toy data, run the following command:
 57 | 
 58 | ```sh
 59 | python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-TEST
 60 | python3 -m stanza.utils.training.run_tokenizer UD_English-TEST --step 500
 61 | ```
 62 | 
 63 | Note that since this toy data is very small in scale, we are restricting the training with a very small `step` parameter. To train on your own data, you can either set a larger `step` parameter, or use the default parameter value.
 64 | 
 65 | ### `mwt`
 66 | 
 67 | The Universal Dependencies grammar defines syntatic relations between [syntactic words](https://universaldependencies.org/u/overview/tokenization.html), which, for many languages (e.g., French), are different from raw tokens as segmented from the text. For these languages, the [`mwt`](https://stanfordnlp.github.io/stanza/mwt.html) processor expands the multi-word tokens (MWT) recognized by the [`tokenize`](https://stanfordnlp.github.io/stanza/tokenize.html) processor into multiple syntactic words, paving the ways for downstream annotations.
 68 | 
 69 | > Note: The mwt processor is not needed and cannot be trained for languages that do not have [multi-word tokens (MWT)](https://universaldependencies.org/u/overview/tokenization.html), such as English or Chinese.
 70 | 
 71 | Like the `tokenize` processor, training the `mwt` processor requires UD data, in the format like our provided toy examples [here](data/udbase/UD_English-TEST). You can run the following command to train the `mwt` processor:
 72 | 
 73 | ```sh
 74 | python3 -m stanza.utils.datasets.prepare_mwt_treebank UD_English-TEST
 75 | python3 -m stanza.utils.training.run_mwt UD_English-TEST --num_epoch 2
 76 | ```
 77 | 
 78 | > Note: Running the above command with the toy data will yield a message saying that zero training data can be found for MWT training. This is normal since MWT is not needed for English. The training should work when you replace the provided data with data in languages that support MWT (e.g., German, French, etc.).
 79 | 
 80 | ### `lemma`
 81 | 
 82 | The [`lemma`](https://stanfordnlp.github.io/stanza/lemma.html) processor predicts lemmas for all words in an input sentence. Training the `lemma` processor requires data files in the `conllu` format. With the toy examples, you can train the `lemma` processor with the following command:
 83 | 
 84 | ```sh
 85 | python3 -m stanza.utils.datasets.prepare_lemma_treebank UD_English-TEST
 86 | python3 -m stanza.utils.training.run_lemma UD_English-TEST --num_epoch 2
 87 | ```
 88 | 
 89 | ### `pos`
 90 | 
 91 | The [`pos`](https://stanfordnlp.github.io/stanza/lemma.html) processor annotates words with three types of syntactic information simultaneously: the [Universal POS (UPOS) tags](https://universaldependencies.org/u/pos/), and treebank-specific POS (XPOS) tags, and [universal morphological features (UFeats)](https://universaldependencies.org/u/feat/index.html).
 92 | 
 93 | Training the `pos` processor usually requires UD data in the `conllu` format and pretrained word vectors. For demo purpose, we provide an example word vector file [here](data/wordvec/word2vec/English). With the toy data and word vector file, you can train the `pos` processor with:
 94 | 
 95 | ```sh
 96 | python3 -m stanza.utils.datasets.prepare_pos_treebank UD_English-TEST
 97 | python3 -m stanza.utils.training.run_pos UD_English-TEST --max_steps 500
 98 | ```
 99 | 
100 | ### `depparse`
101 | 
102 | The [`depparse`](https://stanfordnlp.github.io/stanza/depparse.html) processor implements a dependency parser that predicts syntactic relations between words in a sentence. Training the `depparse` processor requires data files in the `conllu` format, and a pretrained word vector file. With the toy data and word vector file, you can train the `depparse` processor with:
103 | 
104 | ```sh
105 | python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_English-TEST
106 | python3 -m stanza.utils.training.run_depparse UD_English-TEST --max_steps 500
107 | ```
108 | 
109 | Note that the `gold` parameter here tells the scripts to use the "gold" human-annotated POS tags in the training of the parser.
110 | 
111 | ### `ner`
112 | 
113 | The [`ner`](https://stanfordnlp.github.io/stanza/ner.html) processor recognizes named entities in the input text. Training the `ner` processor requires column training data in either `BIO` or `BIOES` format. See [this wikipedia page](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) for an introduction of the formats. We provide toy examples [here](data/nerbase/English-TEST) in the BIO format. For better performance a pretrained word vector file is also recommended. With the toy data and word vector file, you can train the `ner` processor with:
114 | 
115 | ```sh
116 | python3 -m stanza.utils.training.run_ner en_sample --max_steps 500 --word_emb_dim 5
117 | ```
118 | 
119 | Note that for demo purpose we are restricting the word vector dimension to be 5 with the `word_emb_dim` parameter. You should change it to match the dimension of your own word vectors.
120 | 
121 | 
122 | ### Improving NER Performance with Contextualized Character Language Models
123 | 
124 | The performance of the [`ner`](https://stanfordnlp.github.io/stanza/ner.html) processor can be significantly improved by using contextualized string embeddings (i.e., a character-level language model), as was shown in [this COLING 2018 paper](https://www.aclweb.org/anthology/C18-1139/). To enable this in your NER model, you'll need to first train two character-level language models for your language (named as `charlm` module in Stanza), and then use these trained `charlm` models in your NER training.
125 | 
126 | 
127 | #### `charlm`
128 | 
129 | Training `charlm` requires a large amount of raw text, such as text from news articles or wikipedia pages, in plain text files. We provide toy data for training `charlm` [here](data/processed/charlm/English/test). With the toy data, you can run the following command to train two `charlm` models, one in the forward direction of the text and another in the backward direction, respectively:
130 | 
131 | ```sh
132 | python3 -m stanza.utils.training.run_charlm en_TEST --forward   --epochs 2 --cutoff 0 --batch_size 2
133 | python3 -m stanza.utils.training.run_charlm en_TEST --backward  --epochs 2 --cutoff 0 --batch_size 2
134 | ```
135 | 
136 | Running these commands will result in two model files in the `saved_models/charlm` directory, with the prefix `en_test`.
137 | 
138 | > Note: For details on why two models are needed and how they are used in the NER tagger, please refer to [this COLING 2018 paper](https://www.aclweb.org/anthology/C18-1139/).
139 | 
140 | #### Training contextualized `ner` models with pretrained `charlm`
141 | 
142 | Training contextualized `ner` models requires BIO-format data, pretrained word vectors, and the pretrained `charlm` models obtained in the last step. You can run the following command to train the `ner` processor:
143 | 
144 | ```sh
145 | python3 -m stanza.utils.training.run_ner en_sample --max_steps 500 --word_emb_dim 5 --charlm test
146 | ```
147 | 
148 | Note that the `charlm` here instructs the training script to look for the character language model files with the prefix of `en_test`.
149 | 
150 | 
151 | ## Initializing Processors with Trained Models
152 | 
153 | Initializing a processor with your own trained model only requires the path for the model file. Here we provide an example to initialize the `tokenize` processor with a model file saved at `saved_models/tokenize/en_test_tokenizer.pt`:
154 | 
155 | ```python
156 | >>> import stanza
157 | >>> nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_model_path='saved_models/tokenize/en_test_tokenizer.pt')
158 | ```
159 | 
160 | ## Contributing Your Models to the Model Zoo
161 | 
162 | After training your own models, we welcome you to contribute your models so that it can be used by the community. To do this, you can start by creating a [GitHub issue](https://github.com/stanfordnlp/stanza/issues). Please help us understand your model by clearly describing your dataset, model performance, your contact information, and why you think your model would benefit the whole community. We will integrate your models into our official repository once we are able to verify its quality and usability.
163 | 


--------------------------------------------------------------------------------