├── ner4el
    ├── src
    │   ├── ui
    │   │   ├── __init__.py
    │   │   ├── run.py
    │   │   └── ui_utils.py
    │   ├── common
    │   │   ├── __init__.py
    │   │   └── utils.py
    │   ├── pl_data
    │   │   ├── __init__.py
    │   │   ├── datamodule.py
    │   │   └── dataset.py
    │   ├── pl_modules
    │   │   ├── __init__.py
    │   │   ├── ner_model.py
    │   │   └── model.py
    │   ├── test.py
    │   └── run.py
    ├── conf
    │   ├── model
    │   │   └── default.yaml
    │   ├── hydra
    │   │   └── default.yaml
    │   ├── default.yaml
    │   ├── optim
    │   │   └── default.yaml
    │   ├── train
    │   │   └── default.yaml
    │   ├── logging
    │   │   └── default.yaml
    │   └── data
    │   │   └── default.yaml
    ├── wandb
    │   └── README.md
    ├── data
    │   └── README.md
    ├── preprocessed_datasets
    │   └── README.md
    ├── requirements.txt
    ├── LICENSE_TEMPLATE
    └── .gitignore
├── img
    ├── logo_ner4el.png
    ├── percentages.png
    └── contributions.png
├── .env
├── requirements.txt
├── README.md
└── LICENSE


/ner4el/src/ui/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/ner4el/src/common/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/ner4el/src/pl_data/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/ner4el/src/pl_modules/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/img/logo_ner4el.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Babelscape/ner4el/HEAD/img/logo_ner4el.png


--------------------------------------------------------------------------------
/img/percentages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Babelscape/ner4el/HEAD/img/percentages.png


--------------------------------------------------------------------------------
/img/contributions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Babelscape/ner4el/HEAD/img/contributions.png


--------------------------------------------------------------------------------
/ner4el/conf/model/default.yaml:
--------------------------------------------------------------------------------
1 | _target_: src.pl_modules.model.MyModel
2 | 
3 | transformer_name: bert-large-uncased
4 | dropout: 0.5
5 | 


--------------------------------------------------------------------------------
/ner4el/wandb/README.md:
--------------------------------------------------------------------------------
1 | This folder will contain all the model checkpoints. Put here also the model of the pretrained NER classifier.
2 | 
3 | 


--------------------------------------------------------------------------------
/ner4el/data/README.md:
--------------------------------------------------------------------------------
1 | This folder should contain the resources (i.e., alias table, descriptions dict, counts dict, title dict) and the datasets (aida-train, aida-dev, aida-test, msnbc, aquaint, ace2004, cweb, wiki).
2 | 


--------------------------------------------------------------------------------
/.env:
--------------------------------------------------------------------------------
1 | export PROJECT_ROOT="/mnt/data/ner4el/ner4el"
2 | export YOUR_TRAIN_DATASET_PATH="/your/project/root/data/blues/train"
3 | export YOUR_VAL_DATASET_PATH="/your/project/root/data/blues/val"
4 | export YOUR_TEST_DATASET_PATH="/your/project/root/data/blues/test"
5 | export PYTHONPATH=$PROJECT_ROOT
6 | 
7 | 


--------------------------------------------------------------------------------
/ner4el/conf/hydra/default.yaml:
--------------------------------------------------------------------------------
 1 | run:
 2 |   dir: .cache/${now:%Y-%m-%d}/${now:%H-%M-%S}
 3 | 
 4 | sweep:
 5 |   dir: .cache/multirun/${now:%Y-%m-%d}/${now:%H-%M-%S}/
 6 |   subdir: ${hydra.job.num}_${hydra.job.id}
 7 | 
 8 | job:
 9 |   env_set:
10 |     WANDB_START_METHOD: thread
11 |     WANDB_DIR: ${oc.env:PROJECT_ROOT}
12 | 


--------------------------------------------------------------------------------
/ner4el/preprocessed_datasets/README.md:
--------------------------------------------------------------------------------
1 | Once the src/run.py script is executed, the indexed training dataset will be stored in this folder. This allows to avoid regenerating the training dataset from scratch every time (this process requires around half an hour).To achieve this, you have to set "processed = True" in the conf/data/default.yaml file.
2 | 
3 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | # Stuff easy to break with updates
 2 | torch==1.8.1
 3 | torchvision==0.9.1
 4 | pytorch-lightning==1.2.5
 5 | transformers==4.6.0
 6 | hydra-core==1.1.0.dev5
 7 | wandb==0.10.23
 8 | streamlit==0.79.0
 9 | # hydra-joblib-launcher
10 | 
11 | # Stable stuff usually backward compatible
12 | dvc
13 | python-dotenv
14 | sklearn
15 | matplotlib
16 | stqdm
17 | 


--------------------------------------------------------------------------------
/ner4el/requirements.txt:
--------------------------------------------------------------------------------
 1 | # Stuff easy to break with updates
 2 | torch==1.8.1
 3 | torchvision==0.9.1
 4 | pytorch-lightning==1.2.5
 5 | hydra-core==1.1.0.dev5
 6 | wandb==0.10.23
 7 | streamlit==0.79.0
 8 | transformers==4.6.0
 9 | # hydra-joblib-launcher
10 | 
11 | # Stable stuff usually backward compatible
12 | dvc
13 | python-dotenv
14 | matplotlib
15 | stqdm
16 | sklearn
17 | 


--------------------------------------------------------------------------------
/ner4el/conf/default.yaml:
--------------------------------------------------------------------------------
 1 | # metadata specialised for each experiment
 2 | core:
 3 |   version: 0.0.1
 4 |   tags:
 5 |     - mytag
 6 | 
 7 | defaults:
 8 |   - data: default
 9 |   - hydra: default
10 |   - logging: default
11 |   - model: default
12 |   - optim: default
13 |   - train: default
14 | #    Decomment this parameter to get parallel job running
15 | #  - hydra/launcher: joblib
16 | 


--------------------------------------------------------------------------------
/ner4el/src/ui/run.py:
--------------------------------------------------------------------------------
 1 | from pathlib import Path
 2 | 
 3 | import streamlit as st
 4 | 
 5 | from src.pl_modules.model import MyModel
 6 | from src.ui.ui_utils import select_checkpoint
 7 | 
 8 | 
 9 | @st.cache(allow_output_mutation=True)
10 | def get_model(checkpoint_path: Path):
11 |     return MyModel.load_from_checkpoint(checkpoint_path=str(checkpoint_path))
12 | 
13 | 
14 | checkpoint_path = select_checkpoint()
15 | model: MyModel = get_model(checkpoint_path=checkpoint_path)
16 | 


--------------------------------------------------------------------------------
/ner4el/conf/optim/default.yaml:
--------------------------------------------------------------------------------
 1 | optimizer:
 2 |   #  Adam-oriented deep learning
 3 |   _target_: torch.optim.Adam
 4 |   #  These are all default parameters for the Adam optimizer
 5 |   lr: 0.00001
 6 |   betas: [ 0.9, 0.999 ]
 7 |   eps: 1e-08
 8 |   weight_decay: 0
 9 | 
10 | use_lr_scheduler: False
11 | lr_scheduler:
12 |   _target_: torch.optim.lr_scheduler.CosineAnnealingWarmRestarts
13 |   T_0: 10
14 |   T_mult: 2
15 |   eta_min: 0 # min value for the lr
16 |   last_epoch: -1
17 |   verbose: True
18 | 


--------------------------------------------------------------------------------
/ner4el/conf/train/default.yaml:
--------------------------------------------------------------------------------
 1 | # reproducibility
 2 | deterministic: True
 3 | random_seed: 2
 4 | 
 5 | # training
 6 | 
 7 | pl_trainer: #tutto ciò che è sotto pl_trainer viene passato al trainer di lightning
 8 |   fast_dev_run: False # Enable this for debug purposes
 9 |   gpus: 1
10 |   precision: 32
11 |   max_epochs: 100
12 |   accumulate_grad_batches: 256
13 |   num_sanity_val_steps: 2
14 |   #gradient_clip_val: 10.0
15 | 
16 | monitor_metric: 'val_acc'
17 | monitor_metric_mode: 'max'
18 | 
19 | early_stopping:
20 |   patience: 5
21 |   verbose: True
22 | 
23 | model_checkpoints:
24 |   save_top_k: 3
25 |   verbose: True
26 | 


--------------------------------------------------------------------------------
/ner4el/conf/logging/default.yaml:
--------------------------------------------------------------------------------
 1 | # log frequency
 2 | val_check_interval: 5000
 3 | progress_bar_refresh_rate: 20
 4 | 
 5 | wandb:
 6 |   project: NER_for_EL
 7 |   entity: null
 8 |   log_model: False
 9 |   mode: 'offline'
10 |   name: ${data.datamodule.version}-negativesamples=${data.datamodule.negative_samples}-nernegativesamples=${data.datamodule.ner_negative_samples}-nerrepresentation=${data.datamodule.ner_representation}-${data.datamodule.datasets.train.num_candidates}-${data.datamodule.datasets.train.window}-${model.transformer_name}-precision${train.pl_trainer.precision}-accumulation${train.pl_trainer.accumulate_grad_batches}
11 | 
12 | wandb_watch:
13 |   log: 'all'
14 |   log_freq: 100
15 | 
16 | lr_monitor:
17 |   logging_interval: "step"
18 |   log_momentum: False
19 | 


--------------------------------------------------------------------------------
/ner4el/LICENSE_TEMPLATE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Valentino Maiorca, Luca Moschella
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/ner4el/conf/data/default.yaml:
--------------------------------------------------------------------------------
 1 | datamodule:
 2 |   _target_: src.pl_data.datamodule.MyDataModule
 3 |   transformer_name: ${model.transformer_name}
 4 |   alias_table_path: data/alias_table.pickle
 5 |   descriptions_dict_path: data/descriptions_dict.csv
 6 |   item_counts_dict_path: data/item_counts_dict.csv
 7 |   title_dict_path: data/title_dict.pickle
 8 |   id2ner_dict_path: data/id2ner_dict.pickle
 9 |   version: ${model.transformer_name}
10 |   negative_samples: False
11 |   ner_negative_samples: True
12 |   ner_representation: True
13 |   ner_filter_candidates: False
14 |   #ner_constrained_decoding -> can be specified once the script is started 
15 |   processed: True
16 | 
17 |   datasets:
18 |     train:
19 |       _target_: src.pl_data.dataset.MyDataset
20 |       name: train
21 |       path: data/aida_train.jsonl
22 |       num_candidates: 40
23 |       window: 128
24 |       dataset_type: train
25 | 
26 |     val:
27 |       _target_: src.pl_data.dataset.MyDataset
28 |       name: dev
29 |       path: data/aida_dev.jsonl
30 |       num_candidates: 40
31 |       window: 128
32 |       dataset_type: dev
33 | 
34 |     #UNCOMMENT THE BLOCK CORRESPONDING TEST SET THAT YOU WANT TO USE (Ctrl + Shift + 7)
35 | 
36 |     test:
37 |       _target_: src.pl_data.dataset.MyDataset
38 |       name: test
39 |       path: data/aida_test.jsonl
40 |       num_candidates: 40
41 |       window: 128
42 |       dataset_type: test
43 | 
44 |     # test:
45 |     #   _target_: src.pl_data.dataset.MyDataset
46 |     #   name: test
47 |     #   path: data/msnbc_test.jsonl
48 |     #   num_candidates: 40
49 |     #   window: 128
50 |     #   dataset_type: test
51 | 
52 |     # test:
53 |     #   _target_: src.pl_data.dataset.MyDataset
54 |     #   name: test
55 |     #   path: data/aquaint_test.jsonl
56 |     #   num_candidates: 5
57 |     #   window: 128
58 |     #   dataset_type: test
59 | 
60 |     # test:
61 |     #   _target_: src.pl_data.dataset.MyDataset
62 |     #   name: test
63 |     #   path: data/ace2004_test.jsonl
64 |     #   num_candidates: 40
65 |     #   window: 128
66 |     #   dataset_type: test
67 | 
68 |     # test:
69 |     #   _target_: src.pl_data.dataset.MyDataset
70 |     #   name: test
71 |     #   path: data/cweb_test.jsonl
72 |     #   num_candidates: 40
73 |     #   window: 128
74 |     #   dataset_type: test
75 | 
76 |     # test:
77 |     #   _target_: src.pl_data.dataset.MyDataset
78 |     #   name: test
79 |     #   path: data/wiki_test.jsonl
80 |     #   num_candidates: 40
81 |     #   window: 128
82 |     #   dataset_type: test
83 | 
84 |   num_workers:
85 |     train: 8
86 |     val: 4
87 |     test: 4
88 | 
89 |   batch_size:
90 |     train: 1
91 |     val: 1
92 |     test: 1
93 | 


--------------------------------------------------------------------------------
/ner4el/src/ui/ui_utils.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | import operator
  3 | from pathlib import Path
  4 | from typing import List
  5 | 
  6 | import hydra
  7 | import omegaconf
  8 | import streamlit as st
  9 | import wandb
 10 | from hydra.core.global_hydra import GlobalHydra
 11 | from hydra.experimental import compose
 12 | from stqdm import stqdm
 13 | 
 14 | from src.common.utils import PROJECT_ROOT, load_envs
 15 | 
 16 | load_envs()
 17 | 
 18 | WANDB_DIR: Path = PROJECT_ROOT / "wandb"
 19 | WANDB_DIR.mkdir(exist_ok=True, parents=True)
 20 | 
 21 | st_run_sel = st.sidebar
 22 | 
 23 | 
 24 | def local_checkpoint_selection(run_dir: Path, st_key: str) -> Path:
 25 |     checkpoint_paths: List[Path] = list(run_dir.rglob("checkpoints/*"))
 26 |     if len(checkpoint_paths) == 0:
 27 |         st.error(
 28 |             f"There's no checkpoint under {run_dir}! Are you sure the restore was successful?"
 29 |         )
 30 |         st.stop()
 31 |     checkpoint_path: Path = st_run_sel.selectbox(
 32 |         label="Select a checkpoint",
 33 |         index=len(checkpoint_paths) - 1,
 34 |         options=checkpoint_paths,
 35 |         format_func=operator.attrgetter("name"),
 36 |         key=f"checkpoint_select_{st_key}",
 37 |     )
 38 | 
 39 |     return checkpoint_path
 40 | 
 41 | 
 42 | def get_run_dir(entity: str, project: str, run_id: str) -> Path:
 43 |     """
 44 |     :param run_path: "flegyas/nn-template/3hztfivf"
 45 |     :return:
 46 |     """
 47 | 
 48 |     api = wandb.Api()
 49 |     run = api.run(path=f"{entity}/{project}/{run_id}")
 50 |     created_at: datetime = datetime.datetime.strptime(
 51 |         run.created_at, "%Y-%m-%dT%H:%M:%S"
 52 |     )
 53 |     st.sidebar.markdown(body=f"[`Open on WandB`]({run.url})")
 54 | 
 55 |     timestamp: str = created_at.strftime("%Y%m%d_%H%M%S")
 56 | 
 57 |     matching_runs: List[Path] = [
 58 |         item
 59 |         for item in WANDB_DIR.iterdir()
 60 |         if item.is_dir() and item.name.endswith(run_id)
 61 |     ]
 62 | 
 63 |     if len(matching_runs) > 1:
 64 |         st.error(
 65 |             f"More than one run matching unique id {run_id}! Are you sure about that?"
 66 |         )
 67 |         st.stop()
 68 | 
 69 |     if len(matching_runs) == 1:
 70 |         return matching_runs[0]
 71 | 
 72 |     only_checkpoint: bool = st_run_sel.checkbox(
 73 |         label="Download only the checkpoint?", value=True
 74 |     )
 75 |     if st_run_sel.button(label="Download"):
 76 |         run_dir: Path = WANDB_DIR / f"restored-{timestamp}-{run.id}" / "files"
 77 |         files = [
 78 |             file
 79 |             for file in run.files()
 80 |             if "checkpoint" in file.name or not only_checkpoint
 81 |         ]
 82 |         if len(files) == 0:
 83 |             st.error(
 84 |                 f"There is no file to download from this run! Check on WandB: {run.url}"
 85 |             )
 86 |         for file in stqdm(files, desc="Downloading files..."):
 87 |             file.download(root=run_dir)
 88 |         return run_dir
 89 |     else:
 90 |         st.stop()
 91 | 
 92 | 
 93 | def select_run_path(st_key: str, default_run_path: str):
 94 |     run_path: str = st_run_sel.text_input(
 95 |         label="Run path (entity/project/id):",
 96 |         value=default_run_path,
 97 |         key=f"run_path_select_{st_key}",
 98 |     )
 99 |     if not run_path:
100 |         st.stop()
101 |     tokens: List[str] = run_path.split("/")
102 |     if len(tokens) != 3:
103 |         st.error(
104 |             f"This run path {run_path} doesn't look like a WandB run path! Are you sure about that?"
105 |         )
106 |         st.stop()
107 | 
108 |     return tokens
109 | 
110 | 
111 | def select_checkpoint(st_key: str = "MyAwesomeModel", default_run_path: str = ""):
112 |     entity, project, run_id = select_run_path(
113 |         st_key=st_key, default_run_path=default_run_path
114 |     )
115 | 
116 |     run_dir: Path = get_run_dir(entity=entity, project=project, run_id=run_id)
117 | 
118 |     return local_checkpoint_selection(run_dir, st_key=st_key)
119 | 
120 | 
121 | def get_hydra_cfg(config_name: str = "default") -> omegaconf.DictConfig:
122 |     """
123 |     Instantiate and return the hydra config -- streamlit and jupyter compatible
124 | 
125 |     Args:
126 |         config_name: .yaml configuration name, without the extension
127 | 
128 |     Returns:
129 |         The desired omegaconf.DictConfig
130 |     """
131 |     GlobalHydra.instance().clear()
132 |     hydra.experimental.initialize_config_dir(config_dir=str(PROJECT_ROOT / "conf"))
133 |     return compose(config_name=config_name)
134 | 


--------------------------------------------------------------------------------
/ner4el/.gitignore:
--------------------------------------------------------------------------------
  1 | 
  2 | # .gitignore defaults for python and pycharm
  3 | .idea
  4 | 
  5 | # Byte-compiled / optimized / DLL files
  6 | __pycache__/
  7 | *.py[cod]
  8 | *$py.class
  9 | 
 10 | # C extensions
 11 | *.so
 12 | 
 13 | # Distribution / packaging
 14 | .Python
 15 | build/
 16 | develop-eggs/
 17 | dist/
 18 | downloads/
 19 | eggs/
 20 | .eggs/
 21 | lib/
 22 | lib64/
 23 | parts/
 24 | sdist/
 25 | var/
 26 | wheels/
 27 | share/python-wheels/
 28 | *.egg-info/
 29 | .installed.cfg
 30 | *.egg
 31 | MANIFEST
 32 | 
 33 | # PyInstaller
 34 | #  Usually these files are written by a python script from a template
 35 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 36 | *.manifest
 37 | *.spec
 38 | 
 39 | # Installer logs
 40 | pip-log.txt
 41 | pip-delete-this-directory.txt
 42 | 
 43 | # Unit test / coverage reports
 44 | htmlcov/
 45 | .tox/
 46 | .nox/
 47 | .coverage
 48 | .coverage.*
 49 | .cache
 50 | nosetests.xml
 51 | coverage.xml
 52 | *.cover
 53 | *.py,cover
 54 | .hypothesis/
 55 | .pytest_cache/
 56 | cover/
 57 | 
 58 | # Translations
 59 | *.mo
 60 | *.pot
 61 | 
 62 | # Django stuff:
 63 | *.log
 64 | local_settings.py
 65 | db.sqlite3
 66 | db.sqlite3-journal
 67 | 
 68 | # Flask stuff:
 69 | instance/
 70 | .webassets-cache
 71 | 
 72 | # Scrapy stuff:
 73 | .scrapy
 74 | 
 75 | # Sphinx documentation
 76 | docs/_build/
 77 | 
 78 | # PyBuilder
 79 | .pybuilder/
 80 | target/
 81 | 
 82 | # Jupyter Notebook
 83 | .ipynb_checkpoints
 84 | 
 85 | # IPython
 86 | profile_default/
 87 | ipython_config.py
 88 | 
 89 | # pyenv
 90 | #   For a library or package, you might want to ignore these files since the code is
 91 | #   intended to run in multiple environments; otherwise, check them in:
 92 | # .python-version
 93 | 
 94 | # pipenv
 95 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 96 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 97 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 98 | #   install all needed dependencies.
 99 | #Pipfile.lock
100 | 
101 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
102 | __pypackages__/
103 | 
104 | # Celery stuff
105 | celerybeat-schedule
106 | celerybeat.pid
107 | 
108 | # SageMath parsed files
109 | *.sage.py
110 | 
111 | # Environments
112 | .env
113 | .venv
114 | env/
115 | venv/
116 | ENV/
117 | env.bak/
118 | venv.bak/
119 | 
120 | # Spyder project settings
121 | .spyderproject
122 | .spyproject
123 | 
124 | # Rope project settings
125 | .ropeproject
126 | 
127 | # mkdocs documentation
128 | /site
129 | 
130 | # mypy
131 | .mypy_cache/
132 | .dmypy.json
133 | dmypy.json
134 | 
135 | # Pyre type checker
136 | .pyre/
137 | 
138 | # pytype static type analyzer
139 | .pytype/
140 | 
141 | # Cython debug symbols
142 | cython_debug/
143 | 
144 | # Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
145 | # Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
146 | 
147 | # User-specific stuff
148 | .idea/**/workspace.xml
149 | .idea/**/tasks.xml
150 | .idea/**/usage.statistics.xml
151 | .idea/**/dictionaries
152 | .idea/**/shelf
153 | 
154 | # Generated files
155 | .idea/**/contentModel.xml
156 | 
157 | # Sensitive or high-churn files
158 | .idea/**/dataSources/
159 | .idea/**/dataSources.ids
160 | .idea/**/dataSources.local.xml
161 | .idea/**/sqlDataSources.xml
162 | .idea/**/dynamic.xml
163 | .idea/**/uiDesigner.xml
164 | .idea/**/dbnavigator.xml
165 | 
166 | # Gradle
167 | .idea/**/gradle.xml
168 | .idea/**/libraries
169 | 
170 | # Gradle and Maven with auto-import
171 | # When using Gradle or Maven with auto-import, you should exclude module files,
172 | # since they will be recreated, and may cause churn.  Uncomment if using
173 | # auto-import.
174 | # .idea/artifacts
175 | # .idea/compiler.xml
176 | # .idea/jarRepositories.xml
177 | # .idea/modules.xml
178 | # .idea/*.iml
179 | # .idea/modules
180 | # *.iml
181 | # *.ipr
182 | 
183 | # CMake
184 | cmake-build-*/
185 | 
186 | # Mongo Explorer plugin
187 | .idea/**/mongoSettings.xml
188 | 
189 | # File-based project format
190 | *.iws
191 | 
192 | # IntelliJ
193 | out/
194 | 
195 | # mpeltonen/sbt-idea plugin
196 | .idea_modules/
197 | 
198 | # JIRA plugin
199 | atlassian-ide-plugin.xml
200 | 
201 | # Cursive Clojure plugin
202 | .idea/replstate.xml
203 | 
204 | # Crashlytics plugin (for Android Studio and IntelliJ)
205 | com_crashlytics_export_strings.xml
206 | crashlytics.properties
207 | crashlytics-build.properties
208 | fabric.properties
209 | 
210 | # Editor-based Rest Client
211 | .idea/httpRequests
212 | 
213 | # Android studio 3.1+ serialized cache file
214 | .idea/caches/build_file_checksums.ser
215 | 


--------------------------------------------------------------------------------
/ner4el/src/common/utils.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from pathlib import Path
  3 | from typing import Dict, List, Optional
  4 | 
  5 | import dotenv
  6 | import numpy as np
  7 | import pytorch_lightning as pl
  8 | import torch
  9 | import pickle
 10 | import json
 11 | from tqdm import tqdm
 12 | import pandas as pd
 13 | import pytorch_lightning as pl
 14 | from omegaconf import DictConfig, OmegaConf
 15 | 
 16 | 
 17 | def read_alias_table(filename):
 18 |     with open(filename, 'rb') as f:
 19 |         alias_table = pickle.load(f)
 20 |     return alias_table
 21 | 
 22 | def read_descriptions_dict(filename):
 23 |     descriptions = pd.read_csv(filename)
 24 |     descriptions_dict = descriptions.set_index("text_id").text.to_dict()
 25 |     return descriptions_dict
 26 | 
 27 | def read_item_counts_dict(filename):
 28 |     id_counts_df = pd.read_csv(filename).set_index('page_id')
 29 |     item_counts_dict = id_counts_df.counts.to_dict()
 30 |     return item_counts_dict
 31 | 
 32 | def read_dataset(filename):
 33 |     data = []
 34 |     with open(filename, 'r', encoding='utf-8') as f:
 35 |         for line in tqdm(f):
 36 |             data.append(json.loads(line))
 37 | 
 38 |     return data
 39 | 
 40 | def get_title_dict(filename):
 41 |     '''id_counts_df = pd.read_csv(filename)
 42 |     title_dict = id_counts_df.set_index("page_id").title.to_dict()'''
 43 |     with open(filename, 'rb') as f:
 44 |         title_dict = pickle.load(f)
 45 |     return title_dict
 46 | 
 47 | def read_id2ner_dict(filename):
 48 |     with open(filename, 'rb') as f:
 49 |         id2ner = pickle.load(f)
 50 |     return id2ner
 51 | 
 52 | 
 53 | def get_env(env_name: str, default: Optional[str] = None) -> str:
 54 |     """
 55 |     Safely read an environment variable.
 56 |     Raises errors if it is not defined or it is empty.
 57 | 
 58 |     :param env_name: the name of the environment variable
 59 |     :param default: the default (optional) value for the environment variable
 60 | 
 61 |     :return: the value of the environment variable
 62 |     """
 63 |     if env_name not in os.environ:
 64 |         if default is None:
 65 |             raise KeyError(f"{env_name} not defined and no default value is present!")
 66 |         return default
 67 | 
 68 |     env_value: str = os.environ[env_name]
 69 |     if not env_value:
 70 |         if default is None:
 71 |             raise ValueError(
 72 |                 f"{env_name} has yet to be configured and no default value is present!"
 73 |             )
 74 |         return default
 75 | 
 76 |     return env_value
 77 | 
 78 | 
 79 | def load_envs(env_file: Optional[str] = None) -> None:
 80 |     """
 81 |     Load all the environment variables defined in the `env_file`.
 82 |     This is equivalent to `. env_file` in bash.
 83 | 
 84 |     It is possible to define all the system specific variables in the `env_file`.
 85 | 
 86 |     :param env_file: the file that defines the environment variables to use. If None
 87 |                      it searches for a `.env` file in the project.
 88 |     """
 89 |     dotenv.load_dotenv(dotenv_path=env_file, override=True)
 90 | 
 91 | 
 92 | STATS_KEY: str = "stats"
 93 | 
 94 | 
 95 | # Adapted from https://github.com/hobogalaxy/lightning-hydra-template/blob/6bf03035107e12568e3e576e82f83da0f91d6a11/src/utils/template_utils.py#L125
 96 | def log_hyperparameters(
 97 |     cfg: DictConfig,
 98 |     model: pl.LightningModule,
 99 |     trainer: pl.Trainer,
100 | ) -> None:
101 |     """This method controls which parameters from Hydra config are saved by Lightning loggers.
102 |     Additionally saves:
103 |         - sizes of train, val, test dataset
104 |         - number of trainable model parameters
105 |     Args:
106 |         cfg (DictConfig): [description]
107 |         model (pl.LightningModule): [description]
108 |         trainer (pl.Trainer): [description]
109 |     """
110 |     hparams = OmegaConf.to_container(cfg, resolve=True)
111 | 
112 |     # save number of model parameters
113 |     hparams[f"{STATS_KEY}/params_total"] = sum(p.numel() for p in model.parameters())
114 |     hparams[f"{STATS_KEY}/params_trainable"] = sum(
115 |         p.numel() for p in model.parameters() if p.requires_grad
116 |     )
117 |     hparams[f"{STATS_KEY}/params_not_trainable"] = sum(
118 |         p.numel() for p in model.parameters() if not p.requires_grad
119 |     )
120 | 
121 |     # send hparams to all loggers
122 |     trainer.logger.log_hyperparams(hparams)
123 | 
124 |     # disable logging any more hyperparameters for all loggers
125 |     # (this is just a trick to prevent trainer from logging hparams of model, since we already did that above)
126 |     trainer.logger.log_hyperparams = lambda params: None
127 | 
128 | 
129 | # Load environment variables
130 | load_envs()
131 | 
132 | # Set the cwd to the project root
133 | PROJECT_ROOT: Path = Path(get_env("PROJECT_ROOT"))
134 | assert (
135 |     PROJECT_ROOT.exists()
136 | ), "You must configure the PROJECT_ROOT environment variable in a .env file!"
137 | 
138 | os.chdir(PROJECT_ROOT)
139 | 


--------------------------------------------------------------------------------
/ner4el/src/test.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from pathlib import Path
  3 | from typing import List
  4 | 
  5 | import hydra
  6 | import omegaconf
  7 | import pytorch_lightning as pl
  8 | from hydra.core.hydra_config import HydraConfig
  9 | from omegaconf import DictConfig, OmegaConf
 10 | from pytorch_lightning import seed_everything, Callback
 11 | from pytorch_lightning.callbacks import (
 12 |     EarlyStopping,
 13 |     LearningRateMonitor,
 14 |     ModelCheckpoint,
 15 | )
 16 | from pytorch_lightning.loggers import WandbLogger
 17 | 
 18 | from src.common.utils import log_hyperparameters, PROJECT_ROOT
 19 | 
 20 | from pl_modules.model import MyModel
 21 | 
 22 | 
 23 | def build_callbacks(cfg: DictConfig) -> List[Callback]:
 24 |     callbacks: List[Callback] = []
 25 | 
 26 |     if "lr_monitor" in cfg.logging:
 27 |         hydra.utils.log.info(f"Adding callback <LearningRateMonitor>")
 28 |         callbacks.append(
 29 |             LearningRateMonitor(
 30 |                 logging_interval=cfg.logging.lr_monitor.logging_interval,
 31 |                 log_momentum=cfg.logging.lr_monitor.log_momentum,
 32 |             )
 33 |         )
 34 | 
 35 |     if "early_stopping" in cfg.train:
 36 |         hydra.utils.log.info(f"Adding callback <EarlyStopping>")
 37 |         callbacks.append(
 38 |             EarlyStopping(
 39 |                 monitor=cfg.train.monitor_metric,
 40 |                 mode=cfg.train.monitor_metric_mode,
 41 |                 patience=cfg.train.early_stopping.patience,
 42 |                 verbose=cfg.train.early_stopping.verbose,
 43 |             )
 44 |         )
 45 | 
 46 |     if "model_checkpoints" in cfg.train:
 47 |         hydra.utils.log.info(f"Adding callback <ModelCheckpoint>")
 48 |         callbacks.append(
 49 |             ModelCheckpoint(
 50 |                 monitor=cfg.train.monitor_metric,
 51 |                 mode=cfg.train.monitor_metric_mode,
 52 |                 save_top_k=cfg.train.model_checkpoints.save_top_k,
 53 |                 verbose=cfg.train.model_checkpoints.verbose,
 54 |             )
 55 |         )
 56 | 
 57 |     return callbacks
 58 | 
 59 | 
 60 | def run(cfg: DictConfig) -> None:
 61 |     """
 62 |     Generic train loop
 63 | 
 64 |     :param cfg: run configuration, defined by Hydra in /conf
 65 |     """
 66 |     if cfg.train.deterministic:
 67 |         seed_everything(cfg.train.random_seed)
 68 | 
 69 |     if cfg.train.pl_trainer.fast_dev_run:
 70 |         hydra.utils.log.info(
 71 |             f"Debug mode <{cfg.train.pl_trainer.fast_dev_run=}>. "
 72 |             f"Forcing debugger friendly configuration!"
 73 |         )
 74 |         # Debuggers don't like GPUs nor multiprocessing
 75 |         cfg.train.pl_trainer.gpus = 0
 76 |         cfg.data.datamodule.num_workers.train = 0
 77 |         cfg.data.datamodule.num_workers.val = 0
 78 |         cfg.data.datamodule.num_workers.test = 0
 79 | 
 80 |         # Switch wandb mode to offline to prevent online logging
 81 |         cfg.logging.wandb.mode = "offline"
 82 | 
 83 |     # Hydra run directory
 84 |     hydra_dir = Path(HydraConfig.get().run.dir)
 85 | 
 86 |     # Instantiate datamodule
 87 |     hydra.utils.log.info(f"Instantiating <{cfg.data.datamodule._target_}>")
 88 |     datamodule: pl.LightningDataModule = hydra.utils.instantiate(
 89 |         cfg.data.datamodule, _recursive_=False
 90 |     )
 91 | 
 92 |     model_ckpt = input("> Insert the absolute path of your model checkpoint: ")
 93 |     model = MyModel.load_from_checkpoint(checkpoint_path=str(model_ckpt))
 94 |     
 95 |     # Instantiate the callbacks
 96 |     callbacks: List[Callback] = build_callbacks(cfg=cfg)
 97 | 
 98 |     # Logger instantiation/configuration
 99 |     wandb_logger = None
100 |     if "wandb" in cfg.logging:
101 |         hydra.utils.log.info(f"Instantiating <WandbLogger>")
102 |         wandb_config = cfg.logging.wandb
103 |         wandb_logger = WandbLogger(
104 |             **wandb_config,
105 |             tags=cfg.core.tags
106 |         )
107 |         hydra.utils.log.info(f"W&B is now watching <{cfg.logging.wandb_watch.log}>!")
108 |         wandb_logger.watch(
109 |             model,
110 |             log=cfg.logging.wandb_watch.log,
111 |             log_freq=cfg.logging.wandb_watch.log_freq,
112 |         )
113 | 
114 |     # Store the YaML config separately into the wandb dir
115 |     yaml_conf: str = OmegaConf.to_yaml(cfg=cfg)
116 |     (Path(wandb_logger.experiment.dir) / "hparams.yaml").write_text(yaml_conf)
117 | 
118 |     # The Lightning core, the Trainer
119 |     trainer = pl.Trainer(
120 |         default_root_dir=hydra_dir,
121 |         logger=wandb_logger,
122 |         callbacks=callbacks,
123 |         deterministic=cfg.train.deterministic,
124 |         val_check_interval=cfg.logging.val_check_interval,
125 |         progress_bar_refresh_rate=cfg.logging.progress_bar_refresh_rate,
126 |         **cfg.train.pl_trainer, #** -> se hai un dizionario passa sia chiave che valore prendendoli dal config
127 |     )
128 |     
129 |     hydra.utils.log.info(f"Starting testing!")
130 |     trainer.test(model=model, datamodule=datamodule)
131 | 
132 |     # Logger closing to release resources/avoid multi-run conflicts
133 |     if wandb_logger is not None:
134 |         wandb_logger.experiment.finish()
135 | 
136 | 
137 | @hydra.main(config_path=str(PROJECT_ROOT / "conf"), config_name="default")
138 | def main(cfg: omegaconf.DictConfig):
139 |     run(cfg)
140 | 
141 | 
142 | if __name__ == "__main__":
143 |     main()
144 | 


--------------------------------------------------------------------------------
/ner4el/src/run.py:
--------------------------------------------------------------------------------
  1 | from logging import log
  2 | import os
  3 | from pathlib import Path
  4 | from typing import List
  5 | 
  6 | import hydra
  7 | import omegaconf
  8 | import pytorch_lightning as pl
  9 | from hydra.core.hydra_config import HydraConfig
 10 | from omegaconf import DictConfig, OmegaConf
 11 | from pytorch_lightning import seed_everything, Callback
 12 | from pytorch_lightning.callbacks import (
 13 |     EarlyStopping,
 14 |     LearningRateMonitor,
 15 |     ModelCheckpoint,
 16 | )
 17 | from pytorch_lightning.loggers import WandbLogger
 18 | 
 19 | from src.common.utils import log_hyperparameters, PROJECT_ROOT
 20 | 
 21 | 
 22 | def build_callbacks(cfg: DictConfig) -> List[Callback]:
 23 |     callbacks: List[Callback] = []
 24 | 
 25 |     if "lr_monitor" in cfg.logging:
 26 |         hydra.utils.log.info(f"Adding callback <LearningRateMonitor>")
 27 |         callbacks.append(
 28 |             LearningRateMonitor(
 29 |                 logging_interval=cfg.logging.lr_monitor.logging_interval,
 30 |                 log_momentum=cfg.logging.lr_monitor.log_momentum,
 31 |             )
 32 |         )
 33 | 
 34 |     if "early_stopping" in cfg.train:
 35 |         hydra.utils.log.info(f"Adding callback <EarlyStopping>")
 36 |         callbacks.append(
 37 |             EarlyStopping(
 38 |                 monitor=cfg.train.monitor_metric,
 39 |                 mode=cfg.train.monitor_metric_mode,
 40 |                 patience=cfg.train.early_stopping.patience,
 41 |                 verbose=cfg.train.early_stopping.verbose,
 42 |             )
 43 |         )
 44 | 
 45 |     if "model_checkpoints" in cfg.train:
 46 |         hydra.utils.log.info(f"Adding callback <ModelCheckpoint>")
 47 |         callbacks.append(
 48 |             ModelCheckpoint(
 49 |                 monitor=cfg.train.monitor_metric,
 50 |                 mode=cfg.train.monitor_metric_mode,
 51 |                 save_top_k=cfg.train.model_checkpoints.save_top_k,
 52 |                 verbose=cfg.train.model_checkpoints.verbose,
 53 |             )
 54 |         )
 55 | 
 56 |     return callbacks
 57 | 
 58 | 
 59 | def run(cfg: DictConfig) -> None:
 60 |     """
 61 |     Generic train loop
 62 | 
 63 |     :param cfg: run configuration, defined by Hydra in /conf
 64 |     """
 65 |     if cfg.train.deterministic:
 66 |         seed_everything(cfg.train.random_seed)
 67 | 
 68 |     if cfg.train.pl_trainer.fast_dev_run:
 69 |         hydra.utils.log.info(
 70 |             f"Debug mode <{cfg.train.pl_trainer.fast_dev_run=}>. "
 71 |             f"Forcing debugger friendly configuration!"
 72 |         )
 73 |         # Debuggers don't like GPUs nor multiprocessing
 74 |         cfg.train.pl_trainer.gpus = 0
 75 |         cfg.data.datamodule.num_workers.train = 0
 76 |         cfg.data.datamodule.num_workers.val = 0
 77 |         cfg.data.datamodule.num_workers.test = 0
 78 | 
 79 |         # Switch wandb mode to offline to prevent online logging
 80 |         cfg.logging.wandb.mode = "offline"
 81 | 
 82 |     # Hydra run directory
 83 |     hydra_dir = Path(HydraConfig.get().run.dir)
 84 | 
 85 |     # Instantiate datamodule
 86 |     hydra.utils.log.info(f"Instantiating <{cfg.data.datamodule._target_}>")
 87 |     datamodule: pl.LightningDataModule = hydra.utils.instantiate(
 88 |         cfg.data.datamodule, _recursive_=False
 89 |     )
 90 | 
 91 |     # Instantiate model
 92 |     hydra.utils.log.info(f"Instantiating <{cfg.model._target_}>")
 93 |     model: pl.LightningModule = hydra.utils.instantiate(
 94 |         cfg.model,
 95 |         optim=cfg.optim,
 96 |         data=cfg.data,
 97 |         logging=cfg.logging,
 98 |         _recursive_=False,
 99 |     )
100 | 
101 |     # Instantiate the callbacks
102 |     callbacks: List[Callback] = build_callbacks(cfg=cfg)
103 | 
104 |     # Logger instantiation/configuration
105 |     wandb_logger = None
106 |     if "wandb" in cfg.logging:
107 |         hydra.utils.log.info(f"Instantiating <WandbLogger>")
108 |         wandb_config = cfg.logging.wandb
109 |         wandb_logger = WandbLogger(
110 |             **wandb_config,
111 |             tags=cfg.core.tags
112 |         )
113 |         hydra.utils.log.info(f"W&B is now watching <{cfg.logging.wandb_watch.log}>!")
114 |         wandb_logger.watch(
115 |             model,
116 |             log=cfg.logging.wandb_watch.log,
117 |             log_freq=cfg.logging.wandb_watch.log_freq,
118 |         )
119 | 
120 |     # Store the YaML config separately into the wandb dir
121 |     yaml_conf: str = OmegaConf.to_yaml(cfg=cfg)
122 |     (Path(wandb_logger.experiment.dir) / "hparams.yaml").write_text(yaml_conf)
123 | 
124 |     hydra.utils.log.info(f"Instantiating the Trainer")
125 | 
126 |     # The Lightning core, the Trainer
127 |     trainer = pl.Trainer(
128 |         default_root_dir=hydra_dir,
129 |         logger=wandb_logger,
130 |         callbacks=callbacks,
131 |         deterministic=cfg.train.deterministic,
132 |         val_check_interval=cfg.logging.val_check_interval,
133 |         progress_bar_refresh_rate=cfg.logging.progress_bar_refresh_rate,
134 |         **cfg.train.pl_trainer, #** -> se hai un dizionario passa sia chiave che valore prendendoli dal config
135 |     )
136 |     log_hyperparameters(trainer=trainer, model=model, cfg=cfg)
137 | 
138 |     hydra.utils.log.info(f"Starting training!")
139 |     trainer.fit(model=model, datamodule=datamodule)
140 | 
141 |     hydra.utils.log.info(f"Starting testing!")
142 |     trainer.test(ckpt_path="best", datamodule=datamodule)
143 | 
144 |     # Logger closing to release resources/avoid multi-run conflicts
145 |     if wandb_logger is not None:
146 |         wandb_logger.experiment.finish()
147 | 
148 | 
149 | @hydra.main(config_path=str(PROJECT_ROOT / "conf"), config_name="default")
150 | def main(cfg: omegaconf.DictConfig):
151 |     run(cfg)
152 | 
153 | 
154 | if __name__ == "__main__":
155 |     main()
156 | 


--------------------------------------------------------------------------------
/ner4el/src/pl_modules/ner_model.py:
--------------------------------------------------------------------------------
  1 | from typing import Any, Dict, Sequence, Tuple, Union, List
  2 | 
  3 | import hydra
  4 | import omegaconf
  5 | import pytorch_lightning as pl
  6 | import torch
  7 | from torch import nn
  8 | from omegaconf import DictConfig
  9 | from torch._C import device
 10 | from torch.optim import Optimizer
 11 | import os
 12 | import math
 13 | from sklearn.metrics import f1_score
 14 | 
 15 | 
 16 | from src.common.utils import PROJECT_ROOT
 17 | from src.common.utils import *
 18 | 
 19 | from transformers import BertTokenizer, BertModel, BertConfig
 20 | 
 21 | model_name_ner = 'bert-base-uncased'
 22 | 
 23 | 
 24 | class MyNERModel(pl.LightningModule):
 25 |     def __init__(self, *args, **kwargs) -> None:
 26 |         super().__init__()
 27 |         #self.save_hyperparameters()  # populate self.hparams with args and kwargs automagically!
 28 |         global bert_tokenizer_ner
 29 | 
 30 |         id2ner_dict_path = "data/id2ner_dict.pickle"
 31 |         id2ner_dict_path = str(PROJECT_ROOT / id2ner_dict_path)
 32 |         id2ner = read_id2ner_dict(id2ner_dict_path)
 33 | 
 34 |         labels_vocab = {}
 35 |         i = 0
 36 |         for v in id2ner.values():
 37 |             if v not in labels_vocab:
 38 |                 labels_vocab[v] = i
 39 |                 i+=1
 40 | 
 41 |  
 42 |         bert_config_ner = BertConfig.from_pretrained(model_name_ner, output_hidden_states=True)
 43 |         bert_tokenizer_ner = BertTokenizer.from_pretrained(model_name_ner)
 44 |         special_tokens_dict = {'additional_special_tokens': ['[E]','[\E]']}
 45 |         num_added_toks = bert_tokenizer_ner.add_special_tokens(special_tokens_dict)
 46 |         bert_model_ner = BertModel.from_pretrained(model_name_ner, config=bert_config_ner)
 47 |         bert_model_ner.resize_token_embeddings(len(bert_tokenizer_ner))
 48 | 
 49 | 
 50 |         self.mention_encoder = bert_model_ner
 51 | 
 52 |         self.cosine_similarity = nn.CosineSimilarity(dim=-1, eps=1e-6)
 53 |         
 54 |         self.dropout = nn.Dropout(0.5)
 55 |             
 56 |         self.linear = nn.Linear(768, len(labels_vocab))
 57 | 
 58 |     def forward(
 59 |         self, mentions, positions, mask1, **kwargs
 60 |     ) -> Dict[str, torch.Tensor]:
 61 |         """
 62 |         Method for the forward pass.
 63 |         'training_step', 'validation_step' and 'test_step' should call
 64 |         this method in order to compute the output predictions and the loss.
 65 |         Returns:
 66 |             output_dict: forward output containing the predictions (output logits ecc...) and the loss if any.
 67 |         """
 68 | 
 69 |         embedding_mention = self.mention_encoder.forward(mentions, mask1)[0] #16x64x768
 70 |         embedding_mention2 = embedding_mention.gather(1, positions.reshape(-1, 1, 1).repeat(1, 1, 768)).squeeze(1)
 71 |         embedding_mention2 = self.dropout(embedding_mention2) #16x768
 72 |         
 73 |         predictions = self.linear(embedding_mention2)
 74 |                         
 75 |         return predictions
 76 | 
 77 |     def step(self, batch: Any, batch_idx: int, dataset_type:str):
 78 |         softmax_function = nn.Softmax(dim=1)
 79 | 
 80 |         mentions, positions, candidates, descriptions, labels = batch
 81 |         positions = torch.tensor(positions, device=self.device)
 82 | 
 83 |         mask1 = self.padding_mask(mentions)
 84 |         predictions = self.forward(mentions, positions, mask1)
 85 |         predictions = softmax_function(predictions)
 86 | 
 87 |         return {
 88 |             "pred": predictions,
 89 |         }
 90 |                     
 91 |                 
 92 | 
 93 |     def training_step(self, batch: Any, batch_idx: int) -> torch.Tensor:
 94 |         step_output = self.step(batch, batch_idx, "train")
 95 |         return step_output
 96 | 
 97 |     def validation_step(self, batch: Any, batch_idx: int) -> torch.Tensor:
 98 |         step_output = self.step(batch, batch_idx, "dev")
 99 |         return step_output
100 | 
101 |     def test_step(self, batch: Any, batch_idx: int) -> torch.Tensor:
102 |         step_output = self.step(batch, batch_idx, "test")
103 |         return step_output
104 | 
105 |         
106 | 
107 |     def configure_optimizers(
108 |         self,
109 |     ) -> Union[Optimizer, Tuple[Sequence[Optimizer], Sequence[Any]]]:
110 |         """
111 |         Choose what optimizers and learning-rate schedulers to use in your optimization.
112 |         Normally you'd need one. But in the case of GANs or similar you might have multiple.
113 |         Return:
114 |             Any of these 6 options.
115 |             - Single optimizer.
116 |             - List or Tuple - List of optimizers.
117 |             - Two lists - The first list has multiple optimizers, the second a list of LR schedulers (or lr_dict).
118 |             - Dictionary, with an 'optimizer' key, and (optionally) a 'lr_scheduler'
119 |               key whose value is a single LR scheduler or lr_dict.
120 |             - Tuple of dictionaries as described, with an optional 'frequency' key.
121 |             - None - Fit will run without any optimizer.
122 |         """
123 |         opt = hydra.utils.instantiate(
124 |             self.hparams.optim.optimizer, params=self.parameters(), _convert_="partial"
125 |         )
126 |         if not self.hparams.optim.use_lr_scheduler:
127 |             return [opt]
128 |         scheduler = hydra.utils.instantiate(
129 |             self.hparams.optim.lr_scheduler, optimizer=opt
130 |         )
131 |         return [opt], [scheduler]
132 | 
133 |     def padding_mask(self, batch):
134 |         padding = torch.ones_like(batch)
135 |         padding[batch == 0] = 0
136 |         padding = padding.type(torch.int64)
137 |         return padding
138 | 
139 |     def normalize(self, m):
140 |         row_min, _ = m.min(dim=1, keepdim=True)
141 |         row_max, _ = m.max(dim=1, keepdim=True)
142 |         return (m - row_min) / (row_max - row_min)
143 | 
144 | 
145 | @hydra.main(config_path=str(PROJECT_ROOT / "conf"), config_name="default")
146 | def main(cfg: omegaconf.DictConfig):
147 |     model: pl.LightningModule = hydra.utils.instantiate(
148 |         cfg.model,
149 |         optim=cfg.optim,
150 |         data=cfg.data,
151 |         logging=cfg.logging,
152 |         _recursive_=False,
153 |     )
154 | 
155 | 
156 | if __name__ == "__main__":
157 |     main()
158 | 


--------------------------------------------------------------------------------
/ner4el/src/pl_data/datamodule.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | from typing import Optional, Sequence
  3 | 
  4 | import hydra
  5 | from hydra import utils
  6 | import numpy as np
  7 | import omegaconf
  8 | import pytorch_lightning as pl
  9 | from pytorch_lightning.core import datamodule
 10 | import torch
 11 | from pprint import pprint
 12 | from omegaconf import DictConfig
 13 | from torch.utils.data import DataLoader, Dataset
 14 | from torch.nn.utils.rnn import pad_sequence
 15 | from src.pl_data.dataset import MyDataset
 16 | 
 17 | from src.common.utils import PROJECT_ROOT
 18 | 
 19 | from src.common.utils import *
 20 | from transformers import BertTokenizer
 21 | from transformers import XLMRobertaTokenizer
 22 | 
 23 | 
 24 | def worker_init_fn(id: int):
 25 |     """
 26 |     DataLoaders workers init function.
 27 | 
 28 |     Initialize the numpy.random seed correctly for each worker, so that
 29 |     random augmentations between workers and/or epochs are not identical.
 30 | 
 31 |     If a global seed is set, the augmentations are deterministic.
 32 | 
 33 |     https://pytorch.org/docs/stable/notes/randomness.html#dataloader
 34 |     """
 35 |     uint64_seed = torch.initial_seed()
 36 |     ss = np.random.SeedSequence([uint64_seed])
 37 |     # More than 128 bits (4 32-bit words) would be overkill.
 38 |     np.random.seed(ss.generate_state(4))
 39 |     random.seed(uint64_seed)
 40 | 
 41 | 
 42 | class MyDataModule(pl.LightningDataModule):
 43 |     def __init__(
 44 |         self,
 45 |         datasets: DictConfig,
 46 |         num_workers: DictConfig,
 47 |         batch_size: DictConfig,
 48 |         transformer_name: str,
 49 |         alias_table_path: str,
 50 |         descriptions_dict_path: str,
 51 |         item_counts_dict_path: str,
 52 |         title_dict_path: str,
 53 |         id2ner_dict_path: str,
 54 |         version: str,
 55 |         negative_samples: bool,
 56 |         ner_negative_samples: bool,
 57 |         ner_representation: bool,
 58 |         ner_filter_candidates: bool,
 59 |         processed: bool,
 60 |     ):
 61 |         super().__init__()
 62 |         self.datasets = datasets
 63 |         self.num_workers = num_workers
 64 |         self.batch_size = batch_size
 65 |         self.transformer_name = transformer_name
 66 |         self.alias_table_path = str(PROJECT_ROOT / alias_table_path)
 67 |         self.descriptions_dict_path = str(PROJECT_ROOT / descriptions_dict_path)
 68 |         self.item_counts_dict_path = str(PROJECT_ROOT / item_counts_dict_path)
 69 |         self.title_dict_path = str(PROJECT_ROOT / title_dict_path)
 70 |         self.id2ner_dict_path = str(PROJECT_ROOT / id2ner_dict_path)
 71 |         self.version = version
 72 |         self.negative_samples = negative_samples
 73 |         self.ner_negative_samples = ner_negative_samples
 74 |         self.ner_representation = ner_representation
 75 |         self.ner_filter_candidates = ner_filter_candidates
 76 |         self.processed = processed
 77 | 
 78 |         self.train_dataset: Optional[Dataset] = None
 79 |         self.val_dataset: Optional[Dataset] = None
 80 |         self.test_dataset: Optional[Dataset] = None
 81 | 
 82 |     def prepare_data(self) -> None:
 83 |         # download only
 84 |         pass
 85 | 
 86 |     def setup(self, stage: Optional[str] = None):
 87 |         # Here you should instantiate your datasets, you may also split the train into train and validation if needed.
 88 | 
 89 |         train_data = read_dataset(str(PROJECT_ROOT / self.datasets.train.path))
 90 |         dev_data = read_dataset(str(PROJECT_ROOT / self.datasets.val.path))
 91 |         test_data = read_dataset(str(PROJECT_ROOT / self.datasets.test.path))
 92 |         print("Datasets loaded.")
 93 | 
 94 |         if "bert-" in self.transformer_name:
 95 |             self.tokenizer = BertTokenizer.from_pretrained(self.transformer_name)
 96 |             special_tokens_dict = {'additional_special_tokens': ['[E]','[\E]']}
 97 |             self.tokenizer.add_special_tokens(special_tokens_dict)
 98 |         elif "xlm" in self.transformer_name:
 99 |             self.tokenizer = XLMRobertaTokenizer.from_pretrained(self.transformer_name)
100 |             special_tokens_dict = {'additional_special_tokens': ['[E]','[\E]']}
101 |             self.tokenizer.add_special_tokens(special_tokens_dict)
102 | 
103 |         self.alias_table = read_alias_table(self.alias_table_path)
104 |         print("Alias table loaded.")
105 | 
106 |         self.descriptions_dict = read_descriptions_dict(self.descriptions_dict_path)
107 |         print("Descriptions dict loaded.")
108 | 
109 |         self.item_counts_dict = read_item_counts_dict(self.item_counts_dict_path)
110 |         print("Item counts dict loaded.")
111 | 
112 |         self.title_dict = get_title_dict(self.title_dict_path)
113 |         self.title_dict_reverse = dict((v, k) for (k, v) in self.title_dict.items())
114 |         print("Title dict loaded.")
115 | 
116 |         self.id2ner = read_id2ner_dict(self.id2ner_dict_path)
117 |         print("NER dict loaded.")
118 |    
119 | 
120 |         if stage is None or stage == "fit":
121 |             self.train_dataset = hydra.utils.instantiate(
122 |                 self.datasets.train,
123 |                 data=train_data,
124 |                 datamodule = self
125 |             )
126 | 
127 |             self.val_dataset = hydra.utils.instantiate(
128 |                 self.datasets.val,
129 |                 data=dev_data,
130 |                 datamodule = self
131 |             )
132 | 
133 |         if stage is None or stage == "test":
134 |             self.test_dataset = hydra.utils.instantiate(
135 |                 self.datasets.test,
136 |                 data=test_data,
137 |                 datamodule = self
138 |             )
139 | 
140 |     def train_dataloader(self) -> DataLoader:
141 |         return DataLoader(
142 |             self.train_dataset,
143 |             shuffle=True,
144 |             batch_size=self.batch_size.train,
145 |             num_workers=self.num_workers.train,
146 |             worker_init_fn=worker_init_fn,
147 |             pin_memory=True,
148 |             collate_fn=self.collate
149 |         )
150 | 
151 |     def val_dataloader(self) -> DataLoader:
152 |         return DataLoader(
153 |             self.val_dataset,
154 |             shuffle=False,
155 |             batch_size=self.batch_size.val,
156 |             num_workers=self.num_workers.val,
157 |             worker_init_fn=worker_init_fn,
158 |             pin_memory=True,
159 |             collate_fn=self.collate
160 |         )
161 | 
162 |     def test_dataloader(self) -> Sequence[DataLoader]:
163 |         return DataLoader(
164 |             self.test_dataset,
165 |             shuffle=False,
166 |             batch_size=self.batch_size.test,
167 |             num_workers=self.num_workers.test,
168 |             worker_init_fn=worker_init_fn,
169 |             pin_memory=True,
170 |             collate_fn=self.collate
171 |         )
172 | 
173 |     def collate(self, elems: List[tuple]) -> List[tuple]:
174 |         mentions, positions, candidates, descriptions, labels = list(zip(*elems))
175 |     
176 |         pad_mentions = pad_sequence(mentions, batch_first=True, padding_value=0)
177 |         pad_candidates = pad_sequence(candidates, batch_first=True, padding_value=0)
178 |         pad_descriptions = pad_sequence(descriptions, batch_first=True, padding_value=0)
179 |  
180 |         return pad_mentions, positions, pad_candidates, pad_descriptions, labels
181 | 
182 |     def __repr__(self) -> str:
183 |         return (
184 |             f"{self.__class__.__name__}("
185 |             f"{self.datasets=}, "
186 |             f"{self.num_workers=}, "
187 |             f"{self.batch_size=})"
188 |         )
189 | 
190 | 
191 | @hydra.main(config_path=str(PROJECT_ROOT / "conf"), config_name="default")
192 | def main(cfg: omegaconf.DictConfig):
193 |     datamodule: pl.LightningDataModule = hydra.utils.instantiate(
194 |         cfg.data.datamodule, _recursive_=False
195 |     )
196 |     datamodule.setup()
197 | 
198 | 
199 | if __name__ == "__main__":
200 |     main()
201 | 


--------------------------------------------------------------------------------
/ner4el/src/pl_data/dataset.py:
--------------------------------------------------------------------------------
  1 | from typing import Dict, Tuple, Union
  2 | 
  3 | import hydra
  4 | import omegaconf
  5 | import pytorch_lightning as pl
  6 | from pytorch_lightning.core import datamodule
  7 | import torch
  8 | from torch import nn
  9 | from omegaconf import ValueNode
 10 | from torch.utils.data import Dataset
 11 | from tqdm import tqdm
 12 | import pickle
 13 | import random
 14 | 
 15 | from src.common.utils import PROJECT_ROOT
 16 | from pl_modules.ner_model import MyNERModel
 17 | 
 18 | 
 19 | 
 20 | class MyDataset(Dataset):
 21 |     def __init__(self, name: ValueNode, 
 22 |                 path: ValueNode, 
 23 |                 data: list,
 24 |                 num_candidates: int, 
 25 |                 window: int,
 26 |                 datamodule,
 27 |                 dataset_type,
 28 |                 **kwargs):
 29 | 
 30 |         from src.pl_data.datamodule import MyDataModule
 31 |         datamodule:MyDataModule
 32 | 
 33 |         super().__init__()
 34 |         self.path = path
 35 |         self.name = name
 36 |         self.data = data
 37 |         self.num_candidates = num_candidates
 38 |         self.window = window
 39 |         self.transformer_name = datamodule.transformer_name
 40 |         self.tokenizer = datamodule.tokenizer
 41 |         self.alias_table = datamodule.alias_table
 42 |         self.descriptions_dict = datamodule.descriptions_dict
 43 |         self.count_dict = datamodule.item_counts_dict
 44 |         self.id2ner = datamodule.id2ner
 45 |         self.dataset_type = dataset_type
 46 |         self.title_dict = datamodule.title_dict
 47 |         self.title_dict_reverse = datamodule.title_dict_reverse
 48 |         self.negative_samples = datamodule.negative_samples
 49 |         self.ner_negative_samples = datamodule.ner_negative_samples
 50 |         self.ner_representation = datamodule.ner_representation
 51 |         self.ner_filter_candidates = datamodule.ner_filter_candidates
 52 |         self.processed = datamodule.processed
 53 | 
 54 |         self.encoded_data = []
 55 |         self.__encode_data()
 56 |         print(f"LEN: {len(self.encoded_data)}")
 57 | 
 58 |     def __encode_data(self):
 59 | 
 60 |         if self.ner_filter_candidates:
 61 |             ner_classifier = MyNERModel()
 62 |             ner_classifier.load_state_dict(torch.load(str(PROJECT_ROOT / "wandb/ner_classifier.pt")))
 63 | 
 64 |             softmax_function = nn.Softmax(dim=1)
 65 | 
 66 |             labels_vocab = {}
 67 |             i = 0
 68 |             for v in self.id2ner.values():
 69 |                 if v not in labels_vocab:
 70 |                     labels_vocab[v] = i
 71 |                     i+=1
 72 | 
 73 | 
 74 |         if self.processed == True and self.dataset_type=="train":
 75 |             with open(str(PROJECT_ROOT /  f"preprocessed_datasets/aida_kilt_train_{self.num_candidates}_{self.window}_{self.transformer_name}_negativesamples={self.negative_samples}_nernegativesamples={self.ner_negative_samples}_nerrepresentation={self.ner_representation}.pickle"), 'rb') as f:
 76 |                 self.encoded_data = pickle.load(f)
 77 | 
 78 |         else:
 79 |             total_entities = 0
 80 |             target_between_candidates = 0
 81 | 
 82 |             #for negative samples
 83 |             if self.dataset_type == "train":
 84 |                 count_dict_with_descriptions = {}
 85 |                 dict_candidates_for_NER_type = {}
 86 | 
 87 |                 for key in tqdm(self.count_dict):
 88 |                     if key in self.descriptions_dict:
 89 |                         count_dict_with_descriptions[key] = 1
 90 | 
 91 |                         if self.ner_negative_samples == True:
 92 |                             if str(key) in self.id2ner:
 93 |                                 ner_tag = self.id2ner[str(key)]
 94 |                                 if ner_tag not in dict_candidates_for_NER_type:
 95 |                                     dict_candidates_for_NER_type[ner_tag] = [key]
 96 |                                 else:
 97 |                                     dict_candidates_for_NER_type[ner_tag].append(key)
 98 |                         
 99 |             
100 |             for entry in tqdm(self.data):
101 |                 m = entry["mention"].lower()
102 |                 left_context = entry["left_context"]
103 |                 right_context = entry["right_context"]
104 |                 ent = self.title_dict_reverse[entry["output"]] if entry["output"] in self.title_dict_reverse else ""
105 |                 title = entry["output"]
106 | 
107 |                 if self.ner_negative_samples == True:
108 |                     if str(ent) in self.id2ner: #to see the upperbound
109 |                             target_ner_tag = self.id2ner[str(ent)]
110 |                     else:
111 |                         target_ner_tag = ""
112 | 
113 |                 tokenized_left_context = self.tokenize_mention("[CLS]" + left_context, self.tokenizer, self.window//2, False)
114 |                 mention_position = len(tokenized_left_context) + 1
115 |                 tokenized_m = self.tokenize_mention("[E]" + m + "[\E]", self.tokenizer, self.window//2, False)
116 |                 tokenized_right_context = self.tokenize_mention(right_context + "[SEP]", self.tokenizer, self.window//2, False)
117 | 
118 |                 tokenized_mention = tokenized_left_context + tokenized_m + tokenized_right_context
119 |                 for _ in range(self.window-len(tokenized_mention)):
120 |                     tokenized_mention.append(0)
121 |                 tokenized_mention = torch.tensor(tokenized_mention)
122 | 
123 | 
124 |                 if self.ner_filter_candidates:
125 |                     pad_mentions = tokenized_mention.unsqueeze(0)
126 |                     mask = self.padding_mask(pad_mentions)
127 | 
128 |                     position = torch.tensor(mention_position)
129 | 
130 |                     predictions = ner_classifier(pad_mentions, position, mask)
131 |                     predictions = softmax_function(predictions)[0]
132 |                     top_k_ner = torch.topk(predictions, 3)[1]
133 |                     label_ner = top_k_ner[0].item()
134 |                     confidence = predictions[label_ner]
135 | 
136 | 
137 |                 if(m in self.alias_table) and ent!="":
138 |                     total_entities+=1
139 |                     all_candidates = self.alias_table[m]
140 | 
141 |                     all_candidates_with_description = []
142 |                     for c in all_candidates:
143 |                         if c in self.descriptions_dict:
144 |                             all_candidates_with_description.append(c)
145 | 
146 |                     all_candidates_counts = torch.tensor([self.count_dict.get(idx, 0) for idx in all_candidates_with_description])
147 | 
148 |                     if len(all_candidates_with_description)>self.num_candidates:
149 |                         top_k = torch.topk(all_candidates_counts, self.num_candidates)[1].tolist()
150 |                         top_k_candidates = []
151 |                         for idx in top_k:
152 |                             top_k_candidates.append(all_candidates_with_description[idx])
153 |                     else:
154 |                         top_k_candidates = all_candidates_with_description
155 | 
156 |                         if self.negative_samples == True and self.dataset_type == "train": #standard negative samples
157 |                             tmp_count_dict_with_descriptions = count_dict_with_descriptions.copy()
158 |                             for candidate in top_k_candidates:
159 |                                 del tmp_count_dict_with_descriptions[candidate]
160 |                             negative_samples = random.sample(tmp_count_dict_with_descriptions.keys(), self.num_candidates-len(top_k_candidates))
161 |                             top_k_candidates.extend(negative_samples)
162 | 
163 |                         elif self.ner_negative_samples == True and self.dataset_type == "train": #NER-enhanced negative samples
164 |                             if target_ner_tag!="":
165 |                                 candidates_of_same_NER_type = dict_candidates_for_NER_type[target_ner_tag]
166 |                                 for candidate in top_k_candidates:
167 |                                     if candidate in candidates_of_same_NER_type:
168 |                                         candidates_of_same_NER_type.remove(candidate)
169 |                                 negative_samples = random.sample(candidates_of_same_NER_type, self.num_candidates-len(top_k_candidates))
170 |                                 top_k_candidates.extend(negative_samples)
171 | 
172 |                         elif self.ner_filter_candidates == True and self.dataset_type == "train":
173 |                             top_k_candidates_filtered = []
174 |                             for c in top_k_candidates:
175 |                                 if str(c) in self.id2ner and confidence>0.5:
176 |                                     if labels_vocab[self.id2ner[str(c)]] == label_ner:
177 |                                         top_k_candidates_filtered.append(c)
178 |                                     elif labels_vocab[self.id2ner[str(c)]] in top_k_ner:
179 |                                         top_k_candidates_filtered.append(c)
180 |                                 else:
181 |                                     top_k_candidates_filtered.append(c)
182 | 
183 | 
184 |                     if int(ent) in top_k_candidates:
185 |                         target_between_candidates+=1
186 | 
187 |                     tokenized_descriptions = []
188 |                     for c in top_k_candidates:
189 |                         if self.ner_representation == True:
190 |                             if str(c) in self.id2ner:
191 |                                 ner_tag = self.id2ner[str(c)]
192 |                             else:
193 |                                 ner_tag = ""
194 |                             d = ner_tag + "[SEP]" + self.descriptions_dict[c]
195 |                         else:
196 |                             d = self.descriptions_dict[c]
197 | 
198 |                         tokenized_descriptions.append(self.tokenize_description(d, self.tokenizer, self.window))
199 | 
200 | 
201 |                     if self.dataset_type == "train":
202 |                         if len(top_k_candidates)>0 and int(ent) in top_k_candidates:
203 |                             self.encoded_data.append((tokenized_mention,
204 |                                                 mention_position,
205 |                                                 torch.tensor(top_k_candidates),
206 |                                                 torch.tensor(tokenized_descriptions),
207 |                                                 int(ent)))
208 | 
209 |                     else:
210 |                         if len(top_k_candidates)>0:
211 |                             self.encoded_data.append((tokenized_mention,
212 |                                                 mention_position,
213 |                                                 torch.tensor(top_k_candidates),
214 |                                                 torch.tensor(tokenized_descriptions),
215 |                                                 int(ent)))
216 | 
217 | 
218 |                     if total_entities>0:
219 |                         print(f"Percentage of target entities within the candidate set: {target_between_candidates/total_entities}")
220 | 
221 |                     #print(self.encoded_data)
222 | 
223 | 
224 |             if self.dataset_type == "train":
225 |                 with open(str(PROJECT_ROOT /  f"preprocessed_datasets/aida_kilt_train_{self.num_candidates}_{self.window}_{self.transformer_name}_negativesamples={self.negative_samples}_nernegativesamples={self.ner_negative_samples}_nerrepresentation={self.ner_representation}.pickle"), 'wb') as f:
226 |                     pickle.dump(self.encoded_data, f)
227 |         
228 |          
229 |         return self.encoded_data
230 | 
231 | 
232 |     def tokenize_mention(self, sent, tokenizer, window, special_tokens):
233 |         encoded_sentence = tokenizer.encode(sent, add_special_tokens = special_tokens)
234 |         if len(encoded_sentence)>=window:
235 |             return encoded_sentence[:window]
236 |         else:
237 |             return encoded_sentence
238 |     
239 |     def tokenize_description(self, sent, tokenizer, window):
240 |         encoded_sentence = tokenizer.encode(sent, add_special_tokens = True)
241 |         if len(encoded_sentence)>=window:
242 |             return encoded_sentence[:window]
243 |         else:
244 |             return encoded_sentence + [0]*(window-len(encoded_sentence))
245 | 
246 |     def padding_mask(self, batch):
247 |         padding = torch.ones_like(batch)
248 |         padding[batch == 0] = 0
249 |         padding = padding.type(torch.int64)
250 |         return padding
251 | 
252 | 
253 |     def __len__(self) -> int:
254 |         return len(self.encoded_data)
255 | 
256 |     def __getitem__(
257 |         self, index
258 |     ) -> Union[Dict[str, torch.Tensor], Tuple[torch.Tensor, torch.Tensor]]:
259 |         return self.encoded_data[index]
260 | 
261 |     def __repr__(self) -> str:
262 |         return f"MyDataset({self.name=}, {self.path=})"
263 | 
264 | 
265 | @hydra.main(config_path=str(PROJECT_ROOT / "conf"), config_name="default")
266 | def main(cfg: omegaconf.DictConfig):
267 |     dataset: MyDataset = hydra.utils.instantiate(
268 |         cfg.data.datamodule.datasets.train, _recursive_=False
269 |     )
270 | 
271 | 
272 | if __name__ == "__main__":
273 |     main()
274 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/named-entity-recognition-for-entity-linking/entity-disambiguation-on-ace2004)](https://paperswithcode.com/sota/entity-disambiguation-on-ace2004?p=named-entity-recognition-for-entity-linking)
  2 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/named-entity-recognition-for-entity-linking/entity-disambiguation-on-aquaint)](https://paperswithcode.com/sota/entity-disambiguation-on-aquaint?p=named-entity-recognition-for-entity-linking)
  3 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/named-entity-recognition-for-entity-linking/entity-disambiguation-on-msnbc)](https://paperswithcode.com/sota/entity-disambiguation-on-msnbc?p=named-entity-recognition-for-entity-linking)
  4 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/named-entity-recognition-for-entity-linking/entity-disambiguation-on-wned-cweb)](https://paperswithcode.com/sota/entity-disambiguation-on-wned-cweb?p=named-entity-recognition-for-entity-linking)	
  5 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/named-entity-recognition-for-entity-linking/entity-disambiguation-on-wned-wiki)](https://paperswithcode.com/sota/entity-disambiguation-on-wned-wiki?p=named-entity-recognition-for-entity-linking)
  6 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/named-entity-recognition-for-entity-linking/entity-disambiguation-on-aida-conll)](https://paperswithcode.com/sota/entity-disambiguation-on-aida-conll?p=named-entity-recognition-for-entity-linking)
  7 | 
  8 | ![logo](./img/logo_ner4el.png)
  9 | --------------------------------------------------------------------------------
 10 | 
 11 | Code and resources for the paper [Named Entity Recognition for Entity Linking: What Works and What's Next](https://aclanthology.org/2021.findings-emnlp.220/).
 12 | 
 13 | This repository is mainly built upon [Pytorch](https://pytorch.org/) and [Pytorch-Lightning](https://pytorch-lightning.readthedocs.io/en/latest/).
 14 | 
 15 | ## Reference
 16 | **Please cite our work if you use resources and/or code from this repository.**
 17 | #### Plaintext
 18 | Simone Tedeschi, Simone Conia, Francesco Cecconi and Roberto Navigli, 2021. **Named Entity Recognition for Entity Linking: What Works and What's Next**. *In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP 2021)*. Punta Cana, Dominican Republic. Association for Computational Linguistics.
 19 | 
 20 | #### Bibtex
 21 | ```bibtex
 22 | @inproceedings{tedeschi-etal-2021-named-entity,
 23 |     title = "{N}amed {E}ntity {R}ecognition for {E}ntity {L}inking: {W}hat Works and What{'}s Next",
 24 |     author = "Tedeschi, Simone  and
 25 |       Conia, Simone  and
 26 |       Cecconi, Francesco  and
 27 |       Navigli, Roberto",
 28 |     booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
 29 |     month = nov,
 30 |     year = "2021",
 31 |     address = "Punta Cana, Dominican Republic",
 32 |     publisher = "Association for Computational Linguistics",
 33 |     url = "https://aclanthology.org/2021.findings-emnlp.220",
 34 |     pages = "2584--2596",
 35 |     abstract = "Entity Linking (EL) systems have achieved impressive results on standard benchmarks mainly thanks to the contextualized representations provided by recent pretrained language models. However, such systems still require massive amounts of data {--} millions of labeled examples {--} to perform at their best, with training times that often exceed several days, especially when limited computational resources are available. In this paper, we look at how Named Entity Recognition (NER) can be exploited to narrow the gap between EL systems trained on high and low amounts of labeled data. More specifically, we show how and to what extent an EL system can benefit from NER to enhance its entity representations, improve candidate selection, select more effective negative samples and enforce hard and soft constraints on its output entities. We release our software {--} code and model checkpoints {--} at https://github.com/Babelscape/ner4el.",
 36 | }
 37 | ```
 38 | 
 39 | # Named Entity Recognition for Entity Linking: An Introduction
 40 | In this work we focus on **Entity Linking (EL)**, a key task in NLP which aims at associating an ambiguous textual mention with a named entity in a knowledge base. It is a very **knowledge-intensive task** and current EL approaches requires massive amounts of training data – often millions of labeled items – in order to perform at their best, making the development of a high-performance EL system viable only to a **limited audience**. Hence, we study whether it is possible to **narrow the performance gap** between systems trained on limited (i.e., less than 20K labeled samples) and large amounts of data (i.e., millions of training samples). In particular, we take a look at **Named Entity Recognition (NER)** – the task of identifying specific words as belonging to predefined semantic types such as Person, Location, Organization – and how this task can be exploited to **improve a strong Entity Linking baseline in low-resource settings** without requiring any additional data. We show how and to what extent an EL  system can benefit from NER to enhance its entity  representations, improve candidate selection, select more  effective negative samples and enforce hard and soft  constraints on its output entities.
 41 | 
 42 | <div align="center">
 43 | 
 44 | ![contributions](./img/contributions.png)
 45 | 
 46 | </div>
 47 | 
 48 | 
 49 | <br>
 50 | 
 51 | # Fine-Grained Classes for NER
 52 | In its standard formulation, NER distinguishes between four classes of entities: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC).
 53 | Although NER systems that use these four classes have been found to be beneficial in downstream tasks, we argue that they might be too coarse-grained and, at the same time, not provide a sufficiently exhaustive coverage to also benefit EL, as many different entities would fall within the same Misc class.
 54 | 
 55 | For these reasons, **we introduce a new set of 18 finer-grained NER classes**, namely, Person (PER), Location (LOC), Organization (ORG), Animal (ANIM), Biology (BIO), Celestial Body (CEL), Disease (DIS), Event (EVE), Food (FOOD), Instrument (INST), Media (MEDIA), Monetary (MON), Number (NUM), Physical Phenomenon (PHYS), Plant (PLANT), Supernatural (SUPER), Time (TIME) and Vehicle (VEHI).
 56 | 
 57 | In order to use the newly introduced NER classes, we **automatically** label each Wikipedia entity with one of them by taking advantage of [WordNet](https://wordnet.princeton.edu/) and [BabelNet](https://babelnet.org/).
 58 | 
 59 | You can **download** the resulting mapping here: [Wikipedia2NER-mapping](https://drive.google.com/file/d/1tnyYe1alAPP2L866bUq4MtUh687z7oE4/view?usp=sharing) (158MB).
 60 | 
 61 | The following plot shows the percentage of Wikipedia articles for each of the 18 NER classes.
 62 | 
 63 | <div align="center">
 64 | 
 65 | ![percentages](./img/percentages.png)
 66 | 
 67 | </div>
 68 | 
 69 | Further details about these classes such as the exact number of articles for each class, or the full strategy used to obtain these annotations, are provided in the paper.
 70 | 
 71 | <br>
 72 | 
 73 | # Other Resources
 74 | Here you can download other resources needed to run the code, but also useful for other purposes (e.g., as a starting point for other EL projects).
 75 | 
 76 | <center>
 77 | 
 78 | | Resource | Description |
 79 | | ------------- | :------------- |
 80 | | [Alias Table](https://drive.google.com/file/d/13iro8M2KVONWANcgna_3zxxPZl9b7TVC/view?usp=sharing) (732MB) | A dictionary that associates each textual mention with a set of possible candidates <br>(i.e., a set of possible Wikipedia IDs)|
 81 | | [Descriptions Dictionary](https://drive.google.com/file/d/1kv1yxbrqvNgONcjuu2XNaoDrs6acOs4t/view?usp=sharing) (2.7GB) | A dictionary that associates each Wikipedia ID with its textual description|
 82 | | [Counts Dictionary](https://drive.google.com/file/d/1uKAO2866GAwVYdq1Rda6v-C2TZvoWOoZ/view?usp=sharing) (222MB) | A dictionary that associates each Wikipedia ID with its frequency in Wikipedia <br>(i.e., the sum of all the wikilinks that refer to that page)|
 83 | | [Titles Dictionary](https://drive.google.com/file/d/1hoUfhfNTP_73mcrYoWVBrwHQ8RXP2OSY/view?usp=sharing) (178MB) | A dictionary that associates the title of a Wikipedia page with its corresponding Wikipedia ID|
 84 | | [NER Classifier](https://drive.google.com/file/d/1SNXL_UvJ1RWzQaFOKZusfimIMgFQ5LAy/view?usp=sharing) (418MB) | The pretrained NER classifier used for the NER-constrained decoding and NER-enhanced candidate generation contributions (place it into the ner4el/wandb folder)|
 85 | 
 86 | </center>
 87 | 
 88 | <br>
 89 | 
 90 | # Data
 91 | The only training data that we use for our experiments are the training instances from the **AIDA-YAGO-CoNLL** training set. We evaluate our systems on the **validation** split of **AIDA-YAGO-CoNLL**.
 92 | For **testing** we use the test split of **AIDA-YAGO-CoNLL**, and the **MSNBC**, **AQUAINT**, **ACE2004**, **WNED-CWEB** and **WNED-WIKI** test sets.
 93 | 
 94 | We preprocessed the datasets and converted them in the following format:
 95 | <center>
 96 | 
 97 | ```python
 98 | {"mention": MENTION, "left_context": LEFT_CTX, "right_context": RIGHT_CTX, "output": OUTPUT}
 99 | ```
100 | </center>
101 | 
102 | The preprocessed datasets are already available in this repository:
103 | - [AIDA-YAGO-CoNLL (Train)](./ner4el/data/aida_train.jsonl)
104 | - [AIDA-YAGO-CoNLL (Dev)](./ner4el/data/aida_dev.jsonl)
105 | - [AIDA-YAGO-CoNLL (Test)](./ner4el/data/aida_test.jsonl)
106 | - [MSNBC](./ner4el/data/msnbc_test.jsonl)
107 | - [AQUAINT](./ner4el/data/aquaint_test.jsonl)
108 | - [ACE2004](./ner4el/data/ace2004_test.jsonl)
109 | - [WNED-CWEB](./ner4el/data/cweb_test.jsonl)
110 | - [WNED-WIKI](./ner4el/data/wiki_test.jsonl)
111 | 
112 | <br>
113 | 
114 | # Pretrained Model
115 | We release the model checkpoint of the best NER4EL system [here](https://drive.google.com/file/d/1CbjbknVYiON11xV1rOZto5mumbPov0z4/view?usp=sharing) (4.0GB).
116 | 
117 | We underline that we trained our system only on the 18K training instances provided by the **AIDA-YAGO-CoNLL** training set. If you want to obtain a stronger EL system using our architecture you can pretrain it on BLINK (9M training instances from Wikipedia). You can download the BLINK train and validation splits as follows:
118 | ```python
119 | wget http://dl.fbaipublicfiles.com/KILT/blink-train-kilt.jsonl
120 | wget http://dl.fbaipublicfiles.com/KILT/blink-dev-kilt.jsonl
121 | ```
122 | 
123 | **Note**: We re-implemented the code using the [nn-template](https://github.com/lucmos/nn-template) and trained the system on a different hardware architecture. Although the obtained results are, on average, almost identical to those reported in the paper, they are slightly different. Please, find below the performance of the released system:
124 | 
125 | <center>
126 | 
127 | | System | AIDA | MSNBC | AQUAINT | ACE2004 | CWEB | WIKI | Avg. |
128 | | ------------- | -------------: | -------------: | -------------: |  -------------: | -------------: | -------------: | -------------: |
129 | | Paper (Baseline + NER-R + NER-NS + NER-CD) | 92.5 | 89.2 | 69.5 | 91.3 | 68.5 | 64.0 | 79.16 |
130 | | Released system (Baseline + NER-R + NER-NS + NER-CD) | 93.6 | 89.1 | 70.6 | 91.0 | 67.2 | 63.7 | 79.20 |
131 | </center>
132 | 
133 | 
134 | <br>
135 | 
136 | # How To Use
137 | To run the code, after you have downloaded the above listed resources and put them into the right folders as specified by the README files inside the folders, you need to perform the following steps:
138 | 
139 | 0. Set the PROJECT_ROOT variable in the [.env](.env) file (it should correspond to the absolute path of the ner4el/ner4el folder)
140 | 
141 | 1. Install the requirements:
142 |     ```
143 |     pip install -r requirements.txt
144 |     ```
145 |     The code requires **python >= 3.8**, hence we suggest you to create a conda environment with python 3.8.
146 | 
147 | 2. Move to the ner4el folder and run the following command to train and evaluate the system:
148 |     ```
149 |     PYTHONPATH=. python src/run.py
150 |     ```
151 | 
152 | 3. If you want to test a trained system (e.g., the **NER4EL pretrained model** available in the previous section), run the command:
153 |     ```
154 |     PYTHONPATH=. python src/test.py
155 |     ```
156 |     Once the script is started, it asks you to specify the path of your model checkpoint.
157 | 
158 | **Note**: If you want to **change the system configuration**, you need to move in the *ner4el/conf* folder and change the parameters of your interest. As an example, if you move to the [data configuration file](./ner4el/conf/data/default.yaml), you can set the *training, evaluation and test sets*, but you can also specify the *number of candidates* you want to use, as well as the *context window*. At lines 10-14, you can also choose which *NER-based contribution* you want to apply on the baseline system, by setting it to *True*.
159 | Similarly, in the [training configuration file](./ner4el/conf/train/default.yaml), you can specify the *number of epochs*, the *value of patience parameter*, and the number of *gradient accumulation steps*.
160 | 
161 | <br>
162 | 
163 | # License 
164 | NER4EL is licensed under the CC BY-SA-NC 4.0 license. The text of the license can be found [here](https://github.com/Babelscape/wikineural/blob/master/LICENSE).
165 | 
166 | <br>
167 | 
168 | # Acknowledgments
169 | We gratefully acknowledge the support of the **ERC Consolidator Grant MOUSSE No. 726487** under the European Union’s Horizon 2020 research and innovation programme.
170 | 
171 | The code in this repository is built on top of [![](https://shields.io/badge/-nn--template-emerald?style=flat&logo=github&labelColor=gray)](https://github.com/lucmos/nn-template).
172 | 


--------------------------------------------------------------------------------
/ner4el/src/pl_modules/model.py:
--------------------------------------------------------------------------------
  1 | from typing import Any, Dict, Sequence, Tuple, Union, List
  2 | 
  3 | import hydra
  4 | import omegaconf
  5 | import pytorch_lightning as pl
  6 | import torch
  7 | from torch import nn
  8 | from omegaconf import DictConfig
  9 | from torch._C import device
 10 | from torch.optim import Optimizer
 11 | import os
 12 | import math
 13 | from sklearn.metrics import f1_score
 14 | from pl_modules.ner_model import MyNERModel
 15 | from src.common.utils import *
 16 | 
 17 | from src.common.utils import PROJECT_ROOT
 18 | 
 19 | from transformers import BertTokenizer, BertModel, BertConfig
 20 | 
 21 | ner_constrained_decoding = input("> Do you want to use the NER-Constrained Decoding (NER-CD) strategy at inference time (it's applied only during testing)? ")
 22 | 
 23 | if ner_constrained_decoding.lower() == "yes" or ner_constrained_decoding.lower() == "y":
 24 |     ner_model = MyNERModel().cuda()
 25 |     ner_model.load_state_dict(torch.load(str(PROJECT_ROOT / "wandb/ner_classifier.pt")))
 26 | 
 27 | id2ner_dict_path = "data/id2ner_dict.pickle"
 28 | id2ner_dict_path = str(PROJECT_ROOT / id2ner_dict_path)
 29 | 
 30 | id2ner = read_id2ner_dict(id2ner_dict_path)
 31 | 
 32 | labels_vocab = {}
 33 | i = 0
 34 | for v in id2ner.values():
 35 |     if v not in labels_vocab:
 36 |         labels_vocab[v] = i
 37 |         i+=1
 38 | 
 39 | class MyModel(pl.LightningModule):
 40 |     def __init__(self, *args, **kwargs) -> None:
 41 |         super().__init__()
 42 |         self.save_hyperparameters()  # populate self.hparams with args and kwargs automagically!
 43 | 
 44 |         self.bert_config = BertConfig.from_pretrained(
 45 |             self.hparams.transformer_name, output_hidden_states=True
 46 |         )
 47 | 
 48 |         self.bert_tokenizer = BertTokenizer.from_pretrained(
 49 |             self.hparams.transformer_name
 50 |         )
 51 | 
 52 |         '''self.mention_encoder = BertModel.from_pretrained(
 53 |             self.hparams.transformer_name, config=self.bert_config
 54 |         )
 55 | 
 56 |         self.entity_encoder = BertModel.from_pretrained(
 57 |             self.hparams.transformer_name, config=self.bert_config
 58 |         )
 59 | 
 60 |         special_tokens_dict = {"additional_special_tokens": ["[E]", "[\E]"]}
 61 |         self.bert_tokenizer.add_special_tokens(special_tokens_dict)
 62 |         self.mention_encoder.resize_token_embeddings(len(self.bert_tokenizer))
 63 |         self.entity_encoder.resize_token_embeddings(len(self.bert_tokenizer))'''
 64 | 
 65 |         #------------------------------------------------------------------------------------
 66 |         # Uncomment the above alternative block of code to use the dual-encoder architecture.
 67 |         # However, we observed that using a single encoder, we obtain very similar 
 68 |         # performances, while we have much shorter training times and less computational
 69 |         # resources are required
 70 | 
 71 |         self.bert_model = BertModel.from_pretrained(
 72 |             self.hparams.transformer_name, config=self.bert_config
 73 |         )
 74 | 
 75 |         special_tokens_dict = {"additional_special_tokens": ["[E]", "[\E]"]}
 76 |         self.bert_tokenizer.add_special_tokens(special_tokens_dict)
 77 |         self.bert_model.resize_token_embeddings(len(self.bert_tokenizer))
 78 | 
 79 |         self.mention_encoder = self.bert_model
 80 |         self.entity_encoder = self.bert_model
 81 |         
 82 |         #------------------------------------------------------------------------------------
 83 | 
 84 |         self.cosine_similarity = nn.CosineSimilarity(dim=-1, eps=1e-6)
 85 | 
 86 |         self.dropout = nn.Dropout(self.hparams.dropout)
 87 |         self.loss_function = nn.CrossEntropyLoss()
 88 | 
 89 |     def forward(
 90 |         self, mentions, positions, descriptions, mask1, mask2, **kwargs
 91 |     ) -> Dict[str, torch.Tensor]:
 92 |         """
 93 |         Method for the forward pass.
 94 |         'training_step', 'validation_step' and 'test_step' should call
 95 |         this method in order to compute the output predictions and the loss.
 96 |         Returns:
 97 |             output_dict: forward output containing the predictions (output logits ecc...) and the loss if any.
 98 |         """
 99 |         num_candidates = descriptions.shape[1]
100 | 
101 |         embedding_mention = self.mention_encoder.forward(mentions, mask1)[0]  # 16x64x768
102 |         embedding_mention2 = (embedding_mention.gather(1, positions.reshape(-1, 1, 1).repeat(1, 1, self.bert_config.hidden_size),).squeeze(1))
103 |         # embedding_mention2 = self.dropout(embedding_mention2) #16x768
104 | 
105 |         descriptions = descriptions.flatten(start_dim=0, end_dim=1)  # 16x20x64 -> 320x64
106 |         mask2 = mask2.flatten(start_dim=0, end_dim=1)
107 | 
108 |         embedding_entities = self.entity_encoder.forward(descriptions, mask2)[0]  # 320x64x768
109 |         # embedding_entities = self.dropout(embedding_entities)
110 |         embedding_entities = embedding_entities[:, 0, :].squeeze(1)  # 320x768
111 |         embedding_entities = embedding_entities.reshape(embedding_mention2.shape[0], num_candidates, -1)  # 16x20x768
112 | 
113 |         # mentions 16x768, entities #16x20x768
114 |         embedding_mention2 = embedding_mention2.unsqueeze(1)  # 16x1x768
115 |         embedding_mention2 = embedding_mention2.repeat_interleave(num_candidates, dim=1)  # 16x20x768
116 | 
117 |         similarities = self.cosine_similarity(embedding_mention2, embedding_entities)
118 | 
119 |         return similarities
120 | 
121 |     def step(self, batch: Any, batch_idx: int, dataset_type:str):
122 |         
123 |         if ner_constrained_decoding.lower() == ("no") or ner_constrained_decoding.lower() == ("n"):
124 |             mentions, positions, candidates, descriptions, labels = batch
125 |             positions = torch.tensor(positions, device=self.device)
126 | 
127 |             mask1 = self.padding_mask(mentions)
128 |             mask2 = self.padding_mask(descriptions)
129 | 
130 |             similarities = self.forward(mentions, positions, descriptions, mask1, mask2)
131 |             normalized_similarities = self.normalize(similarities)
132 | 
133 |             gold = torch.zeros(normalized_similarities.shape[0])
134 |             for i in range(descriptions.shape[0]):  # i is the index of the batch
135 |                 for j in range(descriptions.shape[1]):
136 |                     if candidates[i][j] == labels[i]:
137 |                         gold[i] = j
138 |             gold = gold.type(torch.LongTensor).to(self.device)
139 | 
140 | 
141 |             loss = self.loss_function(normalized_similarities, gold)
142 | 
143 |             all_predictions = list()
144 |             all_labels = list()
145 |             
146 |             if dataset_type != "train":
147 |                 for i in range(len(similarities)):
148 |                     current_candidates = list(filter(lambda x: x!=0, candidates[i]))
149 |                     normalized_similarities_line = normalized_similarities[i][:len(current_candidates)]
150 |                     all_predictions.append(int(candidates[i][torch.argmax(normalized_similarities_line)]))
151 |                     all_labels.append(int(labels[i]))
152 | 
153 |             if dataset_type=="train":
154 |                 if not math.isnan(loss):
155 |                     return {
156 |                         "loss": loss,
157 |                         "pred": all_predictions,
158 |                         "gold": all_labels,
159 |                     }
160 |                 else:
161 |                     return None
162 |             else:
163 |                 return {
164 |                         "pred": all_predictions,
165 |                         "gold": all_labels,
166 |                     }
167 | 
168 |         else:
169 |             softmax_function = nn.Softmax(dim=1)
170 | 
171 |             mentions, positions, candidates, descriptions, labels = batch
172 |             positions = torch.tensor(positions, device=self.device)
173 | 
174 |             mask1 = self.padding_mask(mentions)
175 |             mask2 = self.padding_mask(descriptions)
176 | 
177 |             similarities = self.forward(mentions, positions, descriptions, mask1, mask2)
178 |             normalized_similarities = self.normalize(similarities)
179 | 
180 |             predictions = ner_model.forward(mentions, positions, mask1)
181 |             predictions = softmax_function(predictions)
182 | 
183 |             gold = torch.zeros(normalized_similarities.shape[0])
184 |             for i in range(descriptions.shape[0]):  # i is the index of the batch
185 |                 for j in range(descriptions.shape[1]):
186 |                     if candidates[i][j] == labels[i]:
187 |                         gold[i] = j
188 |             gold = gold.type(torch.LongTensor).to(self.device)
189 | 
190 | 
191 |             loss = self.loss_function(normalized_similarities, gold)
192 | 
193 |             all_predictions = list()
194 |             all_labels = list()
195 |             
196 |             if dataset_type != "train":
197 |                 for i in range(len(similarities)):
198 |                     
199 |                     top_k_ner = torch.topk(predictions[i], 3)[1]
200 |                     target_ner_tag_id = top_k_ner[0].item()
201 |                     confidence = predictions[i][target_ner_tag_id].item()
202 |                     target_ner_tags = []
203 |                     for candidate in top_k_ner:
204 |                         target_ner_tags.append(self.get_key(labels_vocab, candidate))
205 |                                             
206 |                     current_candidates = list(filter(lambda x: x!=0, candidates[i]))
207 |                     normalized_similarities_line = normalized_similarities[i][:len(current_candidates)]
208 |                     current_candidates_ner = [id2ner[str(c.item())] if str(c.item()) in id2ner else "" for c in current_candidates]
209 |                     for k in range(len(current_candidates)):
210 |                         if current_candidates_ner[k] not in target_ner_tags[0] and confidence>0.5: #confidence
211 |                             normalized_similarities_line[k] = 0.0
212 |                         elif current_candidates_ner[k] not in target_ner_tags: #top-k
213 |                             normalized_similarities_line[k] = 0.0
214 | 
215 |                             
216 |                     all_predictions.append(int(candidates[i][torch.argmax(normalized_similarities_line)]))       
217 |                     all_labels.append(labels[i])
218 | 
219 |             if dataset_type=="train":
220 |                 if not math.isnan(loss):
221 |                     return {
222 |                         "loss": loss,
223 |                         "pred": all_predictions,
224 |                         "gold": all_labels,
225 |                     }
226 |                 else:
227 |                     return None
228 |             else:
229 |                 return {
230 |                         "pred": all_predictions,
231 |                         "gold": all_labels,
232 |                     }
233 | 
234 | 
235 | 
236 | 
237 |     def training_step(self, batch: Any, batch_idx: int) -> torch.Tensor:
238 |         step_output = self.step(batch, batch_idx, "train")
239 |         if step_output is not None:
240 |             self.log_dict(
241 |                 {"train_loss": step_output["loss"]},
242 |                 on_step=True,
243 |                 on_epoch=True,
244 |                 prog_bar=True,
245 |             )
246 |         return step_output
247 | 
248 |     def validation_step(self, batch: Any, batch_idx: int) -> torch.Tensor:
249 |         step_output = self.step(batch, batch_idx, "dev")
250 |         return step_output
251 | 
252 |     def test_step(self, batch: Any, batch_idx: int) -> torch.Tensor:
253 |         step_output = self.step(batch, batch_idx, "test")
254 |         return step_output
255 | 
256 | 
257 |     def my_epoch_end(self, outputs: List[Any], split:str) -> None:
258 |         all_predictions = []
259 |         all_labels = []
260 | 
261 |         for elem in outputs:
262 |             all_predictions.extend(elem["pred"])
263 |             all_labels.extend(elem["gold"])
264 |         
265 |         f1_micro = f1_score(all_labels, all_predictions, average='micro')
266 |         self.log_dict(
267 |                 {f"{split}_acc": f1_micro},
268 |                 prog_bar=True
269 |             )
270 | 
271 |         return super().validation_epoch_end(outputs)
272 | 
273 |     def validation_epoch_end(self, outputs: List[Any]) -> None:
274 |         return self.my_epoch_end(outputs, "val")
275 | 
276 |     def test_epoch_end(self, outputs: List[Any]) -> None:
277 |         return self.my_epoch_end(outputs, "test")
278 |         
279 | 
280 |     def configure_optimizers(
281 |         self,
282 |     ) -> Union[Optimizer, Tuple[Sequence[Optimizer], Sequence[Any]]]:
283 |         """
284 |         Choose what optimizers and learning-rate schedulers to use in your optimization.
285 |         Normally you'd need one. But in the case of GANs or similar you might have multiple.
286 |         Return:
287 |             Any of these 6 options.
288 |             - Single optimizer.
289 |             - List or Tuple - List of optimizers.
290 |             - Two lists - The first list has multiple optimizers, the second a list of LR schedulers (or lr_dict).
291 |             - Dictionary, with an 'optimizer' key, and (optionally) a 'lr_scheduler'
292 |               key whose value is a single LR scheduler or lr_dict.
293 |             - Tuple of dictionaries as described, with an optional 'frequency' key.
294 |             - None - Fit will run without any optimizer.
295 |         """
296 |         opt = hydra.utils.instantiate(
297 |             self.hparams.optim.optimizer, params=self.parameters(), _convert_="partial"
298 |         )
299 |         if not self.hparams.optim.use_lr_scheduler:
300 |             return [opt]
301 |         scheduler = hydra.utils.instantiate(
302 |             self.hparams.optim.lr_scheduler, optimizer=opt
303 |         )
304 |         return [opt], [scheduler]
305 | 
306 |     def padding_mask(self, batch):
307 |         padding = torch.ones_like(batch)
308 |         padding[batch == 0] = 0
309 |         padding = padding.type(torch.int64)
310 |         return padding
311 | 
312 |     def normalize(self, m):
313 |         row_min, _ = m.min(dim=1, keepdim=True)
314 |         row_max, _ = m.max(dim=1, keepdim=True)
315 |         return (m - row_min) / (row_max - row_min)
316 | 
317 |     def get_key(self, dictionary, val):
318 |         for key, value in dictionary.items():
319 |             if val == value:
320 |                 return key
321 |     
322 |         return "key doesn't exist"
323 | 
324 | 
325 | @hydra.main(config_path=str(PROJECT_ROOT / "conf"), config_name="default")
326 | def main(cfg: omegaconf.DictConfig):
327 |     model: pl.LightningModule = hydra.utils.instantiate(
328 |         cfg.model,
329 |         optim=cfg.optim,
330 |         data=cfg.data,
331 |         logging=cfg.logging,
332 |         _recursive_=False,
333 |     )
334 | 
335 | 
336 | if __name__ == "__main__":
337 |     main()
338 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | =======================================================================
  2 | 
  3 | Attribution-NonCommercial-ShareAlike 4.0 International
  4 | 
  5 | =======================================================================
  6 | 
  7 | Creative Commons Corporation ("Creative Commons") is not a law firm and
  8 | does not provide legal services or legal advice. Distribution of
  9 | Creative Commons public licenses does not create a lawyer-client or
 10 | other relationship. Creative Commons makes its licenses and related
 11 | information available on an "as-is" basis. Creative Commons gives no
 12 | warranties regarding its licenses, any material licensed under their
 13 | terms and conditions, or any related information. Creative Commons
 14 | disclaims all liability for damages resulting from their use to the
 15 | fullest extent possible.
 16 | 
 17 | Using Creative Commons Public Licenses
 18 | 
 19 | Creative Commons public licenses provide a standard set of terms and
 20 | conditions that creators and other rights holders may use to share
 21 | original works of authorship and other material subject to copyright
 22 | and certain other rights specified in the public license below. The
 23 | following considerations are for informational purposes only, are not
 24 | exhaustive, and do not form part of our licenses.
 25 | 
 26 |      Considerations for licensors: Our public licenses are
 27 |      intended for use by those authorized to give the public
 28 |      permission to use material in ways otherwise restricted by
 29 |      copyright and certain other rights. Our licenses are
 30 |      irrevocable. Licensors should read and understand the terms
 31 |      and conditions of the license they choose before applying it.
 32 |      Licensors should also secure all rights necessary before
 33 |      applying our licenses so that the public can reuse the
 34 |      material as expected. Licensors should clearly mark any
 35 |      material not subject to the license. This includes other CC-
 36 |      licensed material, or material used under an exception or
 37 |      limitation to copyright. More considerations for licensors:
 38 |     wiki.creativecommons.org/Considerations_for_licensors
 39 | 
 40 |      Considerations for the public: By using one of our public
 41 |      licenses, a licensor grants the public permission to use the
 42 |      licensed material under specified terms and conditions. If
 43 |      the licensor's permission is not necessary for any reason--for
 44 |      example, because of any applicable exception or limitation to
 45 |      copyright--then that use is not regulated by the license. Our
 46 |      licenses grant only permissions under copyright and certain
 47 |      other rights that a licensor has authority to grant. Use of
 48 |      the licensed material may still be restricted for other
 49 |      reasons, including because others have copyright or other
 50 |      rights in the material. A licensor may make special requests,
 51 |      such as asking that all changes be marked or described.
 52 |      Although not required by our licenses, you are encouraged to
 53 |      respect those requests where reasonable. More considerations
 54 |      for the public:
 55 |     wiki.creativecommons.org/Considerations_for_licensees
 56 | 
 57 | =======================================================================
 58 | 
 59 | Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
 60 | Public License
 61 | 
 62 | By exercising the Licensed Rights (defined below), You accept and agree
 63 | to be bound by the terms and conditions of this Creative Commons
 64 | Attribution-NonCommercial-ShareAlike 4.0 International Public License
 65 | ("Public License"). To the extent this Public License may be
 66 | interpreted as a contract, You are granted the Licensed Rights in
 67 | consideration of Your acceptance of these terms and conditions, and the
 68 | Licensor grants You such rights in consideration of benefits the
 69 | Licensor receives from making the Licensed Material available under
 70 | these terms and conditions.
 71 | 
 72 | 
 73 | Section 1 -- Definitions.
 74 | 
 75 |   a. Adapted Material means material subject to Copyright and Similar
 76 |      Rights that is derived from or based upon the Licensed Material
 77 |      and in which the Licensed Material is translated, altered,
 78 |      arranged, transformed, or otherwise modified in a manner requiring
 79 |      permission under the Copyright and Similar Rights held by the
 80 |      Licensor. For purposes of this Public License, where the Licensed
 81 |      Material is a musical work, performance, or sound recording,
 82 |      Adapted Material is always produced where the Licensed Material is
 83 |      synched in timed relation with a moving image.
 84 | 
 85 |   b. Adapter's License means the license You apply to Your Copyright
 86 |      and Similar Rights in Your contributions to Adapted Material in
 87 |      accordance with the terms and conditions of this Public License.
 88 | 
 89 |   c. BY-NC-SA Compatible License means a license listed at
 90 |      creativecommons.org/compatiblelicenses, approved by Creative
 91 |      Commons as essentially the equivalent of this Public License.
 92 | 
 93 |   d. Copyright and Similar Rights means copyright and/or similar rights
 94 |      closely related to copyright including, without limitation,
 95 |      performance, broadcast, sound recording, and Sui Generis Database
 96 |      Rights, without regard to how the rights are labeled or
 97 |      categorized. For purposes of this Public License, the rights
 98 |      specified in Section 2(b)(1)-(2) are not Copyright and Similar
 99 |      Rights.
100 | 
101 |   e. Effective Technological Measures means those measures that, in the
102 |      absence of proper authority, may not be circumvented under laws
103 |      fulfilling obligations under Article 11 of the WIPO Copyright
104 |      Treaty adopted on December 20, 1996, and/or similar international
105 |      agreements.
106 | 
107 |   f. Exceptions and Limitations means fair use, fair dealing, and/or
108 |      any other exception or limitation to Copyright and Similar Rights
109 |      that applies to Your use of the Licensed Material.
110 | 
111 |   g. License Elements means the license attributes listed in the name
112 |      of a Creative Commons Public License. The License Elements of this
113 |      Public License are Attribution, NonCommercial, and ShareAlike.
114 | 
115 |   h. Licensed Material means the artistic or literary work, database,
116 |      or other material to which the Licensor applied this Public
117 |      License.
118 | 
119 |   i. Licensed Rights means the rights granted to You subject to the
120 |      terms and conditions of this Public License, which are limited to
121 |      all Copyright and Similar Rights that apply to Your use of the
122 |      Licensed Material and that the Licensor has authority to license.
123 | 
124 |   j. Licensor means the individual(s) or entity(ies) granting rights
125 |      under this Public License.
126 | 
127 |   k. NonCommercial means not primarily intended for or directed towards
128 |      commercial advantage or monetary compensation. For purposes of
129 |      this Public License, the exchange of the Licensed Material for
130 |      other material subject to Copyright and Similar Rights by digital
131 |      file-sharing or similar means is NonCommercial provided there is
132 |      no payment of monetary compensation in connection with the
133 |      exchange.
134 | 
135 |   l. Share means to provide material to the public by any means or
136 |      process that requires permission under the Licensed Rights, such
137 |      as reproduction, public display, public performance, distribution,
138 |      dissemination, communication, or importation, and to make material
139 |      available to the public including in ways that members of the
140 |      public may access the material from a place and at a time
141 |      individually chosen by them.
142 | 
143 |   m. Sui Generis Database Rights means rights other than copyright
144 |      resulting from Directive 96/9/EC of the European Parliament and of
145 |      the Council of 11 March 1996 on the legal protection of databases,
146 |      as amended and/or succeeded, as well as other essentially
147 |      equivalent rights anywhere in the world.
148 | 
149 |   n. You means the individual or entity exercising the Licensed Rights
150 |      under this Public License. Your has a corresponding meaning.
151 | 
152 | 
153 | Section 2 -- Scope.
154 | 
155 |   a. License grant.
156 | 
157 |        1. Subject to the terms and conditions of this Public License,
158 |           the Licensor hereby grants You a worldwide, royalty-free,
159 |           non-sublicensable, non-exclusive, irrevocable license to
160 |           exercise the Licensed Rights in the Licensed Material to:
161 | 
162 |             a. reproduce and Share the Licensed Material, in whole or
163 |                in part, for NonCommercial purposes only; and
164 | 
165 |             b. produce, reproduce, and Share Adapted Material for
166 |                NonCommercial purposes only.
167 | 
168 |        2. Exceptions and Limitations. For the avoidance of doubt, where
169 |           Exceptions and Limitations apply to Your use, this Public
170 |           License does not apply, and You do not need to comply with
171 |           its terms and conditions.
172 | 
173 |        3. Term. The term of this Public License is specified in Section
174 |           6(a).
175 | 
176 |        4. Media and formats; technical modifications allowed. The
177 |           Licensor authorizes You to exercise the Licensed Rights in
178 |           all media and formats whether now known or hereafter created,
179 |           and to make technical modifications necessary to do so. The
180 |           Licensor waives and/or agrees not to assert any right or
181 |           authority to forbid You from making technical modifications
182 |           necessary to exercise the Licensed Rights, including
183 |           technical modifications necessary to circumvent Effective
184 |           Technological Measures. For purposes of this Public License,
185 |           simply making modifications authorized by this Section 2(a)
186 |           (4) never produces Adapted Material.
187 | 
188 |        5. Downstream recipients.
189 | 
190 |             a. Offer from the Licensor -- Licensed Material. Every
191 |                recipient of the Licensed Material automatically
192 |                receives an offer from the Licensor to exercise the
193 |                Licensed Rights under the terms and conditions of this
194 |                Public License.
195 | 
196 |             b. Additional offer from the Licensor -- Adapted Material.
197 |                Every recipient of Adapted Material from You
198 |                automatically receives an offer from the Licensor to
199 |                exercise the Licensed Rights in the Adapted Material
200 |                under the conditions of the Adapter's License You apply.
201 | 
202 |             c. No downstream restrictions. You may not offer or impose
203 |                any additional or different terms or conditions on, or
204 |                apply any Effective Technological Measures to, the
205 |                Licensed Material if doing so restricts exercise of the
206 |                Licensed Rights by any recipient of the Licensed
207 |                Material.
208 | 
209 |        6. No endorsement. Nothing in this Public License constitutes or
210 |           may be construed as permission to assert or imply that You
211 |           are, or that Your use of the Licensed Material is, connected
212 |           with, or sponsored, endorsed, or granted official status by,
213 |           the Licensor or others designated to receive attribution as
214 |           provided in Section 3(a)(1)(A)(i).
215 | 
216 |   b. Other rights.
217 | 
218 |        1. Moral rights, such as the right of integrity, are not
219 |           licensed under this Public License, nor are publicity,
220 |           privacy, and/or other similar personality rights; however, to
221 |           the extent possible, the Licensor waives and/or agrees not to
222 |           assert any such rights held by the Licensor to the limited
223 |           extent necessary to allow You to exercise the Licensed
224 |           Rights, but not otherwise.
225 | 
226 |        2. Patent and trademark rights are not licensed under this
227 |           Public License.
228 | 
229 |        3. To the extent possible, the Licensor waives any right to
230 |           collect royalties from You for the exercise of the Licensed
231 |           Rights, whether directly or through a collecting society
232 |           under any voluntary or waivable statutory or compulsory
233 |           licensing scheme. In all other cases the Licensor expressly
234 |           reserves any right to collect such royalties, including when
235 |           the Licensed Material is used other than for NonCommercial
236 |           purposes.
237 | 
238 | 
239 | Section 3 -- License Conditions.
240 | 
241 | Your exercise of the Licensed Rights is expressly made subject to the
242 | following conditions.
243 | 
244 |   a. Attribution.
245 | 
246 |        1. If You Share the Licensed Material (including in modified
247 |           form), You must:
248 | 
249 |             a. retain the following if it is supplied by the Licensor
250 |                with the Licensed Material:
251 | 
252 |                  i. identification of the creator(s) of the Licensed
253 |                     Material and any others designated to receive
254 |                     attribution, in any reasonable manner requested by
255 |                     the Licensor (including by pseudonym if
256 |                     designated);
257 | 
258 |                 ii. a copyright notice;
259 | 
260 |                iii. a notice that refers to this Public License;
261 | 
262 |                 iv. a notice that refers to the disclaimer of
263 |                     warranties;
264 | 
265 |                  v. a URI or hyperlink to the Licensed Material to the
266 |                     extent reasonably practicable;
267 | 
268 |             b. indicate if You modified the Licensed Material and
269 |                retain an indication of any previous modifications; and
270 | 
271 |             c. indicate the Licensed Material is licensed under this
272 |                Public License, and include the text of, or the URI or
273 |                hyperlink to, this Public License.
274 | 
275 |        2. You may satisfy the conditions in Section 3(a)(1) in any
276 |           reasonable manner based on the medium, means, and context in
277 |           which You Share the Licensed Material. For example, it may be
278 |           reasonable to satisfy the conditions by providing a URI or
279 |           hyperlink to a resource that includes the required
280 |           information.
281 |        3. If requested by the Licensor, You must remove any of the
282 |           information required by Section 3(a)(1)(A) to the extent
283 |           reasonably practicable.
284 | 
285 |   b. ShareAlike.
286 | 
287 |      In addition to the conditions in Section 3(a), if You Share
288 |      Adapted Material You produce, the following conditions also apply.
289 | 
290 |        1. The Adapter's License You apply must be a Creative Commons
291 |           license with the same License Elements, this version or
292 |           later, or a BY-NC-SA Compatible License.
293 | 
294 |        2. You must include the text of, or the URI or hyperlink to, the
295 |           Adapter's License You apply. You may satisfy this condition
296 |           in any reasonable manner based on the medium, means, and
297 |           context in which You Share Adapted Material.
298 | 
299 |        3. You may not offer or impose any additional or different terms
300 |           or conditions on, or apply any Effective Technological
301 |           Measures to, Adapted Material that restrict exercise of the
302 |           rights granted under the Adapter's License You apply.
303 | 
304 | 
305 | Section 4 -- Sui Generis Database Rights.
306 | 
307 | Where the Licensed Rights include Sui Generis Database Rights that
308 | apply to Your use of the Licensed Material:
309 | 
310 |   a. for the avoidance of doubt, Section 2(a)(1) grants You the right
311 |      to extract, reuse, reproduce, and Share all or a substantial
312 |      portion of the contents of the database for NonCommercial purposes
313 |      only;
314 | 
315 |   b. if You include all or a substantial portion of the database
316 |      contents in a database in which You have Sui Generis Database
317 |      Rights, then the database in which You have Sui Generis Database
318 |      Rights (but not its individual contents) is Adapted Material,
319 |      including for purposes of Section 3(b); and
320 | 
321 |   c. You must comply with the conditions in Section 3(a) if You Share
322 |      all or a substantial portion of the contents of the database.
323 | 
324 | For the avoidance of doubt, this Section 4 supplements and does not
325 | replace Your obligations under this Public License where the Licensed
326 | Rights include other Copyright and Similar Rights.
327 | 
328 | 
329 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
330 | 
331 |   a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
332 |      EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
333 |      AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
334 |      ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
335 |      IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
336 |      WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
337 |      PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
338 |      ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
339 |      KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
340 |      ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
341 | 
342 |   b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
343 |      TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
344 |      NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
345 |      INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
346 |      COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
347 |      USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
348 |      ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
349 |      DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
350 |      IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
351 | 
352 |   c. The disclaimer of warranties and limitation of liability provided
353 |      above shall be interpreted in a manner that, to the extent
354 |      possible, most closely approximates an absolute disclaimer and
355 |      waiver of all liability.
356 | 
357 | 
358 | Section 6 -- Term and Termination.
359 | 
360 |   a. This Public License applies for the term of the Copyright and
361 |      Similar Rights licensed here. However, if You fail to comply with
362 |      this Public License, then Your rights under this Public License
363 |      terminate automatically.
364 | 
365 |   b. Where Your right to use the Licensed Material has terminated under
366 |      Section 6(a), it reinstates:
367 | 
368 |        1. automatically as of the date the violation is cured, provided
369 |           it is cured within 30 days of Your discovery of the
370 |           violation; or
371 | 
372 |        2. upon express reinstatement by the Licensor.
373 | 
374 |      For the avoidance of doubt, this Section 6(b) does not affect any
375 |      right the Licensor may have to seek remedies for Your violations
376 |      of this Public License.
377 | 
378 |   c. For the avoidance of doubt, the Licensor may also offer the
379 |      Licensed Material under separate terms or conditions or stop
380 |      distributing the Licensed Material at any time; however, doing so
381 |      will not terminate this Public License.
382 | 
383 |   d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
384 |      License.
385 | 
386 | 
387 | Section 7 -- Other Terms and Conditions.
388 | 
389 |   a. The Licensor shall not be bound by any additional or different
390 |      terms or conditions communicated by You unless expressly agreed.
391 | 
392 |   b. Any arrangements, understandings, or agreements regarding the
393 |      Licensed Material not stated herein are separate from and
394 |      independent of the terms and conditions of this Public License.
395 | 
396 | 
397 | Section 8 -- Interpretation.
398 | 
399 |   a. For the avoidance of doubt, this Public License does not, and
400 |      shall not be interpreted to, reduce, limit, restrict, or impose
401 |      conditions on any use of the Licensed Material that could lawfully
402 |      be made without permission under this Public License.
403 | 
404 |   b. To the extent possible, if any provision of this Public License is
405 |      deemed unenforceable, it shall be automatically reformed to the
406 |      minimum extent necessary to make it enforceable. If the provision
407 |      cannot be reformed, it shall be severed from this Public License
408 |      without affecting the enforceability of the remaining terms and
409 |      conditions.
410 | 
411 |   c. No term or condition of this Public License will be waived and no
412 |      failure to comply consented to unless expressly agreed to by the
413 |      Licensor.
414 | 
415 |   d. Nothing in this Public License constitutes or may be interpreted
416 |      as a limitation upon, or waiver of, any privileges and immunities
417 |      that apply to the Licensor or You, including from the legal
418 |      processes of any jurisdiction or authority.
419 | 
420 | =======================================================================
421 | 
422 | Creative Commons is not a party to its public
423 | licenses. Notwithstanding, Creative Commons may elect to apply one of
424 | its public licenses to material it publishes and in those instances
425 | will be considered the “Licensor.” The text of the Creative Commons
426 | public licenses is dedicated to the public domain under the CC0 Public
427 | Domain Dedication. Except for the limited purpose of indicating that
428 | material is shared under a Creative Commons public license or as
429 | otherwise permitted by the Creative Commons policies published at
430 | creativecommons.org/policies, Creative Commons does not authorize the
431 | use of the trademark "Creative Commons" or any other trademark or logo
432 | of Creative Commons without its prior written consent including,
433 | without limitation, in connection with any unauthorized modifications
434 | to any of its public licenses or any other arrangements,
435 | understandings, or agreements concerning use of licensed material. For
436 | the avoidance of doubt, this paragraph does not form part of the
437 | public licenses.
438 | 
439 | Creative Commons may be contacted at creativecommons.org.
440 | 


--------------------------------------------------------------------------------