ParsNER 🦁

├── .gitignore
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | *.py,cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | #   install all needed dependencies.
 92 | #Pipfile.lock
 93 | 
 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 95 | __pypackages__/
 96 | 
 97 | # Celery stuff
 98 | celerybeat-schedule
 99 | celerybeat.pid
100 | 
101 | # SageMath parsed files
102 | *.sage.py
103 | 
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 | 
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 | 
117 | # Rope project settings
118 | .ropeproject
119 | 
120 | # mkdocs documentation
121 | /site
122 | 
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 | 
128 | # Pyre type checker
129 | .pyre/
130 | .idea
131 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <h1 align="center">ParsNER 🦁</h1>
  2 | 
  3 | <br/><br/>
  4 | 
  5 | 
  6 | ## Introduction
  7 | 
  8 | This repo contains all existing pretrained models that are fine-tuned for the Named Entity Recognition (NER) task. These models trained on a mixed NER dataset collected from [ARMAN](https://github.com/HaniehP/PersianNER), [PEYMA](http://nsurl.org/2019-2/tasks/task-7-named-entity-recognition-ner-for-farsi/), and [WikiANN](https://elisa-ie.github.io/wikiann/) that covered ten types of entities: 
  9 | 
 10 | - Date (DAT)
 11 | - Event (EVE)
 12 | - Facility (FAC)
 13 | - Location (LOC)
 14 | - Money (MON)
 15 | - Organization (ORG)
 16 | - Percent (PCT)
 17 | - Person (PER)
 18 | - Product (PRO)
 19 | - Time (TIM)
 20 | 
 21 | 
 22 | ## Dataset Information
 23 | 
 24 | |       |   Records |   B-DAT |   B-EVE |   B-FAC |   B-LOC |   B-MON |   B-ORG |   B-PCT |   B-PER |   B-PRO |   B-TIM |   I-DAT |   I-EVE |   I-FAC |   I-LOC |   I-MON |   I-ORG |   I-PCT |   I-PER |   I-PRO |   I-TIM |
 25 | |:------|----------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|
 26 | | Train |     29133 |    1423 |    1487 |    1400 |   13919 |     417 |   15926 |     355 |   12347 |    1855 |     150 |    1947 |    5018 |    2421 |    4118 |    1059 |   19579 |     573 |    7699 |    1914 |     332 |
 27 | | Valid |      5142 |     267 |     253 |     250 |    2362 |     100 |    2651 |      64 |    2173 |     317 |      19 |     373 |     799 |     387 |     717 |     270 |    3260 |     101 |    1382 |     303 |      35 |
 28 | | Test  |      6049 |     407 |     256 |     248 |    2886 |      98 |    3216 |      94 |    2646 |     318 |      43 |     568 |     888 |     408 |     858 |     263 |    3967 |     141 |    1707 |     296 |      78 |
 29 | 
 30 | 
 31 | **Download You can download the dataset [from here](https://drive.google.com/uc?id=1fC2WGlpqumUTaT9Dr_U1jO2no3YMKFJ4)**
 32 | 
 33 | 
 34 | ## Evaluation
 35 | 
 36 | The following tables summarize the scores obtained by pretrained models overall and per each class.
 37 | 
 38 | |    Model   | accuracy | precision |  recall  |    f1    |
 39 | |:----------:|:--------:|:---------:|:--------:|:--------:|
 40 | |    Bert    | 0.995086 |  0.953454 | 0.961113 | 0.957268 |
 41 | |   Roberta  | 0.994849 |  0.949816 | 0.960235 | 0.954997 |
 42 | | Distilbert | 0.994534 |  0.946326 |  0.95504 | 0.950663 |
 43 | |   Albert   | 0.993405 |  0.938907 | 0.943966 | 0.941429 |
 44 | 
 45 | 
 46 | ### Bert
 47 | 
 48 | |     	| number 	| precision 	|  recall  	|    f1    	|
 49 | |:---:	|:------:	|:---------:	|:--------:	|:--------:	|
 50 | | DAT 	|   407  	|  0.860636 	| 0.864865 	| 0.862745 	|
 51 | | EVE 	|   256  	|  0.969582 	| 0.996094 	| 0.982659 	|
 52 | | FAC 	|   248  	|  0.976190 	| 0.991935 	| 0.984000 	|
 53 | | LOC 	|  2884  	|  0.970232 	| 0.971914 	| 0.971072 	|
 54 | | MON 	|   98   	|  0.905263 	| 0.877551 	| 0.891192 	|
 55 | | ORG 	|  3216  	|  0.939125 	| 0.954602 	| 0.946800 	|
 56 | | PCT 	|   94   	|  1.000000 	| 0.968085 	| 0.983784 	|
 57 | | PER 	|  2645  	|  0.965244 	| 0.965974 	| 0.965608 	|
 58 | | PRO 	|   318  	|  0.981481 	| 1.000000 	| 0.990654 	|
 59 | | TIM 	|   43   	|  0.692308 	| 0.837209 	| 0.757895 	|
 60 | 
 61 | ### Roberta
 62 | 
 63 | |     	| number 	| precision 	|  recall  	|    f1    	|
 64 | |:---:	|:------:	|:---------:	|:--------:	|:--------:	|
 65 | | DAT 	|   407  	|  0.844869 	| 0.869779 	| 0.857143 	|
 66 | | EVE 	|   256  	|  0.948148 	| 1.000000 	| 0.973384 	|
 67 | | FAC 	|   248  	|  0.957529 	| 1.000000 	| 0.978304 	|
 68 | | LOC 	|  2884  	|  0.965422 	| 0.968100 	| 0.966759 	|
 69 | | MON 	|   98   	|  0.937500 	| 0.918367 	| 0.927835 	|
 70 | | ORG 	|  3216  	|  0.943662 	| 0.958333 	| 0.950941 	|
 71 | | PCT 	|   94   	|  1.000000 	| 0.968085 	| 0.983784 	|
 72 | | PER 	|  2646  	|  0.957030 	| 0.959562 	| 0.958294 	|
 73 | | PRO 	|   318  	|  0.963636 	| 1.000000 	| 0.981481 	|
 74 | | TIM 	|   43   	|  0.739130 	| 0.790698 	| 0.764045 	|
 75 | 
 76 | 
 77 | ### Distilbert
 78 | 
 79 | |     	| number 	| precision 	|  recall  	|    f1    	|
 80 | |:---:	|:------:	|:---------:	|:--------:	|:--------:	|
 81 | | DAT 	|   407  	|  0.812048 	| 0.828010 	| 0.819951 	|
 82 | | EVE 	|   256  	|  0.955056 	| 0.996094 	| 0.975143 	|
 83 | | FAC 	|   248  	|  0.972549 	| 1.000000 	| 0.986083 	|
 84 | | LOC 	|  2884  	|  0.968403 	| 0.967060 	| 0.967731 	|
 85 | | MON 	|   98   	|  0.925532 	| 0.887755 	| 0.906250 	|
 86 | | ORG 	|  3216  	|  0.932095 	| 0.951803 	| 0.941846 	|
 87 | | PCT 	|   94   	|  0.936842 	| 0.946809 	| 0.941799 	|
 88 | | PER 	|  2645  	|  0.959818 	| 0.957278 	| 0.958546 	|
 89 | | PRO 	|   318  	|  0.963526 	| 0.996855 	| 0.979907 	|
 90 | | TIM 	|   43   	|  0.760870 	| 0.813953 	| 0.786517 	|
 91 | 
 92 | ### Albert
 93 | 
 94 | |     	| number 	| precision 	|  recall  	|    f1    	|
 95 | |:---:	|:------:	|:---------:	|:--------:	|:--------:	|
 96 | | DAT 	|   407  	|  0.820639 	| 0.820639 	| 0.820639 	|
 97 | | EVE 	|   256  	|  0.936803 	| 0.984375 	| 0.960000 	|
 98 | | FAC 	|   248  	|  0.925373 	| 1.000000 	| 0.961240 	|
 99 | | LOC 	|  2884  	|  0.960818 	| 0.960818 	| 0.960818 	|
100 | | MON 	|   98   	|  0.913978 	| 0.867347 	| 0.890052 	|
101 | | ORG 	|  3216  	|  0.920892 	| 0.937500 	| 0.929122 	|
102 | | PCT 	|   94   	|  0.946809 	| 0.946809 	| 0.946809 	|
103 | | PER 	|  2644  	|  0.960000 	| 0.944024 	| 0.951945 	|
104 | | PRO 	|   318  	|  0.942943 	| 0.987421 	| 0.964670 	|
105 | | TIM 	|   43   	|  0.780488 	| 0.744186 	| 0.761905 	|
106 | 
107 | ## How To Use
108 | You use this model with Transformers pipeline for NER.
109 | 
110 | ### Installing requirements
111 | 
112 | ```bash
113 | pip install sentencepiece
114 | pip install transformers
115 | ```
116 | 
117 | ### How to predict using pipeline
118 | 
119 | ```python
120 | from transformers import AutoTokenizer
121 | from transformers import AutoModelForTokenClassification  # for pytorch
122 | from transformers import TFAutoModelForTokenClassification  # for tensorflow
123 | from transformers import pipeline
124 | 
125 | # model_name_or_path = "HooshvareLab/bert-fa-zwnj-base-ner"  # Roberta
126 | # model_name_or_path = "HooshvareLab/roberta-fa-zwnj-base-ner"  # Roberta
127 | model_name_or_path = "HooshvareLab/distilbert-fa-zwnj-base-ner"  # Distilbert
128 | # model_name_or_path = "HooshvareLab/albert-fa-zwnj-base-v2-ner"  # Albert
129 | 
130 | tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
131 | 
132 | model = AutoModelForTokenClassification.from_pretrained(model_name_or_path)  # Pytorch
133 | # model = TFAutoModelForTokenClassification.from_pretrained(model_name_or_path)  # Tensorflow
134 | 
135 | nlp = pipeline("ner", model=model, tokenizer=tokenizer)
136 | example = "در سال ۲۰۱۳ درگذشت و آندرتیکر و کین برای او مراسم یادبود گرفتند."
137 | 
138 | ner_results = nlp(example)
139 | print(ner_results)
140 | ```
141 | 
142 | ## Models
143 | 
144 | ### Hugging Face Model Hub
145 | 
146 | - [Bert](https://huggingface.co/HooshvareLab/bert-fa-zwnj-base-ner)
147 | - [Roberta](https://huggingface.co/HooshvareLab/robert-fa-zwnj-base-ner)
148 | - [Distilbert](https://huggingface.co/HooshvareLab/distilbert-fa-zwnj-base-ner)
149 | - [Albert](https://huggingface.co/HooshvareLab/albert-fa-zwnj-base-v2-ner)
150 | 
151 | ### Training
152 | All models were trained on a single NVIDIA P100 GPU with following parameters.
153 | 
154 | **Arguments**
155 | ```bash
156 | "task_name": "ner"
157 | "model_name_or_path": model_name_or_path
158 | "train_file": "/content/ner/train.csv"
159 | "validation_file": "/content/ner/valid.csv"
160 | "test_file": "/content/ner/test.csv"
161 | "output_dir": output_dir
162 | "cache_dir": "/content/cache"
163 | "per_device_train_batch_size": 16
164 | "per_device_eval_batch_size": 16
165 | "use_fast_tokenizer": True
166 | "num_train_epochs": 5.0
167 | "do_train": True
168 | "do_eval": True
169 | "do_predict": True
170 | "learning_rate": 2e-5
171 | "evaluation_strategy": "steps"
172 | "logging_steps": 1000
173 | "save_steps": 1000
174 | "save_total_limit": 2
175 | "overwrite_output_dir": True
176 | "fp16": True
177 | "preprocessing_num_workers": 4
178 | ```
179 | 
180 | 
181 | ## Cite
182 | Please cite this repository in publications as the following:
183 | 
184 | ```bibtext
185 | @misc{ParsNER,
186 |   author = {Hooshvare Team},
187 |   title = {Pre-Trained NER models for Persian},
188 |   year = {2021},
189 |   publisher = {GitHub},
190 |   journal = {GitHub repository},
191 |   howpublished = {\url{https://github.com/hooshvare/parsner}},
192 | }
193 | ```
194 | 
195 | 
196 | ## Questions?
197 | Post a Github issue on the [ParsNER Issues](https://github.com/hooshvare/parsner/issues) repo.


--------------------------------------------------------------------------------