├── ner4el ├── src │ ├── ui │ │ ├── __init__.py │ │ ├── run.py │ │ └── ui_utils.py │ ├── common │ │ ├── __init__.py │ │ └── utils.py │ ├── pl_data │ │ ├── __init__.py │ │ ├── datamodule.py │ │ └── dataset.py │ ├── pl_modules │ │ ├── __init__.py │ │ ├── ner_model.py │ │ └── model.py │ ├── test.py │ └── run.py ├── conf │ ├── model │ │ └── default.yaml │ ├── hydra │ │ └── default.yaml │ ├── default.yaml │ ├── optim │ │ └── default.yaml │ ├── train │ │ └── default.yaml │ ├── logging │ │ └── default.yaml │ └── data │ │ └── default.yaml ├── wandb │ └── README.md ├── data │ └── README.md ├── preprocessed_datasets │ └── README.md ├── requirements.txt ├── LICENSE_TEMPLATE └── .gitignore ├── img ├── logo_ner4el.png ├── percentages.png └── contributions.png ├── .env ├── requirements.txt ├── README.md └── LICENSE /ner4el/src/ui/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /ner4el/src/common/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /ner4el/src/pl_data/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /ner4el/src/pl_modules/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /img/logo_ner4el.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Babelscape/ner4el/HEAD/img/logo_ner4el.png -------------------------------------------------------------------------------- /img/percentages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Babelscape/ner4el/HEAD/img/percentages.png -------------------------------------------------------------------------------- /img/contributions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Babelscape/ner4el/HEAD/img/contributions.png -------------------------------------------------------------------------------- /ner4el/conf/model/default.yaml: -------------------------------------------------------------------------------- 1 | _target_: src.pl_modules.model.MyModel 2 | 3 | transformer_name: bert-large-uncased 4 | dropout: 0.5 5 | -------------------------------------------------------------------------------- /ner4el/wandb/README.md: -------------------------------------------------------------------------------- 1 | This folder will contain all the model checkpoints. Put here also the model of the pretrained NER classifier. 2 | 3 | -------------------------------------------------------------------------------- /ner4el/data/README.md: -------------------------------------------------------------------------------- 1 | This folder should contain the resources (i.e., alias table, descriptions dict, counts dict, title dict) and the datasets (aida-train, aida-dev, aida-test, msnbc, aquaint, ace2004, cweb, wiki). 2 | -------------------------------------------------------------------------------- /.env: -------------------------------------------------------------------------------- 1 | export PROJECT_ROOT="/mnt/data/ner4el/ner4el" 2 | export YOUR_TRAIN_DATASET_PATH="/your/project/root/data/blues/train" 3 | export YOUR_VAL_DATASET_PATH="/your/project/root/data/blues/val" 4 | export YOUR_TEST_DATASET_PATH="/your/project/root/data/blues/test" 5 | export PYTHONPATH=$PROJECT_ROOT 6 | 7 | -------------------------------------------------------------------------------- /ner4el/conf/hydra/default.yaml: -------------------------------------------------------------------------------- 1 | run: 2 | dir: .cache/${now:%Y-%m-%d}/${now:%H-%M-%S} 3 | 4 | sweep: 5 | dir: .cache/multirun/${now:%Y-%m-%d}/${now:%H-%M-%S}/ 6 | subdir: ${hydra.job.num}_${hydra.job.id} 7 | 8 | job: 9 | env_set: 10 | WANDB_START_METHOD: thread 11 | WANDB_DIR: ${oc.env:PROJECT_ROOT} 12 | -------------------------------------------------------------------------------- /ner4el/preprocessed_datasets/README.md: -------------------------------------------------------------------------------- 1 | Once the src/run.py script is executed, the indexed training dataset will be stored in this folder. This allows to avoid regenerating the training dataset from scratch every time (this process requires around half an hour).To achieve this, you have to set "processed = True" in the conf/data/default.yaml file. 2 | 3 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # Stuff easy to break with updates 2 | torch==1.8.1 3 | torchvision==0.9.1 4 | pytorch-lightning==1.2.5 5 | transformers==4.6.0 6 | hydra-core==1.1.0.dev5 7 | wandb==0.10.23 8 | streamlit==0.79.0 9 | # hydra-joblib-launcher 10 | 11 | # Stable stuff usually backward compatible 12 | dvc 13 | python-dotenv 14 | sklearn 15 | matplotlib 16 | stqdm 17 | -------------------------------------------------------------------------------- /ner4el/requirements.txt: -------------------------------------------------------------------------------- 1 | # Stuff easy to break with updates 2 | torch==1.8.1 3 | torchvision==0.9.1 4 | pytorch-lightning==1.2.5 5 | hydra-core==1.1.0.dev5 6 | wandb==0.10.23 7 | streamlit==0.79.0 8 | transformers==4.6.0 9 | # hydra-joblib-launcher 10 | 11 | # Stable stuff usually backward compatible 12 | dvc 13 | python-dotenv 14 | matplotlib 15 | stqdm 16 | sklearn 17 | -------------------------------------------------------------------------------- /ner4el/conf/default.yaml: -------------------------------------------------------------------------------- 1 | # metadata specialised for each experiment 2 | core: 3 | version: 0.0.1 4 | tags: 5 | - mytag 6 | 7 | defaults: 8 | - data: default 9 | - hydra: default 10 | - logging: default 11 | - model: default 12 | - optim: default 13 | - train: default 14 | # Decomment this parameter to get parallel job running 15 | # - hydra/launcher: joblib 16 | -------------------------------------------------------------------------------- /ner4el/src/ui/run.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | 3 | import streamlit as st 4 | 5 | from src.pl_modules.model import MyModel 6 | from src.ui.ui_utils import select_checkpoint 7 | 8 | 9 | @st.cache(allow_output_mutation=True) 10 | def get_model(checkpoint_path: Path): 11 | return MyModel.load_from_checkpoint(checkpoint_path=str(checkpoint_path)) 12 | 13 | 14 | checkpoint_path = select_checkpoint() 15 | model: MyModel = get_model(checkpoint_path=checkpoint_path) 16 | -------------------------------------------------------------------------------- /ner4el/conf/optim/default.yaml: -------------------------------------------------------------------------------- 1 | optimizer: 2 | # Adam-oriented deep learning 3 | _target_: torch.optim.Adam 4 | # These are all default parameters for the Adam optimizer 5 | lr: 0.00001 6 | betas: [ 0.9, 0.999 ] 7 | eps: 1e-08 8 | weight_decay: 0 9 | 10 | use_lr_scheduler: False 11 | lr_scheduler: 12 | _target_: torch.optim.lr_scheduler.CosineAnnealingWarmRestarts 13 | T_0: 10 14 | T_mult: 2 15 | eta_min: 0 # min value for the lr 16 | last_epoch: -1 17 | verbose: True 18 | -------------------------------------------------------------------------------- /ner4el/conf/train/default.yaml: -------------------------------------------------------------------------------- 1 | # reproducibility 2 | deterministic: True 3 | random_seed: 2 4 | 5 | # training 6 | 7 | pl_trainer: #tutto ciò che è sotto pl_trainer viene passato al trainer di lightning 8 | fast_dev_run: False # Enable this for debug purposes 9 | gpus: 1 10 | precision: 32 11 | max_epochs: 100 12 | accumulate_grad_batches: 256 13 | num_sanity_val_steps: 2 14 | #gradient_clip_val: 10.0 15 | 16 | monitor_metric: 'val_acc' 17 | monitor_metric_mode: 'max' 18 | 19 | early_stopping: 20 | patience: 5 21 | verbose: True 22 | 23 | model_checkpoints: 24 | save_top_k: 3 25 | verbose: True 26 | -------------------------------------------------------------------------------- /ner4el/conf/logging/default.yaml: -------------------------------------------------------------------------------- 1 | # log frequency 2 | val_check_interval: 5000 3 | progress_bar_refresh_rate: 20 4 | 5 | wandb: 6 | project: NER_for_EL 7 | entity: null 8 | log_model: False 9 | mode: 'offline' 10 | name: ${data.datamodule.version}-negativesamples=${data.datamodule.negative_samples}-nernegativesamples=${data.datamodule.ner_negative_samples}-nerrepresentation=${data.datamodule.ner_representation}-${data.datamodule.datasets.train.num_candidates}-${data.datamodule.datasets.train.window}-${model.transformer_name}-precision${train.pl_trainer.precision}-accumulation${train.pl_trainer.accumulate_grad_batches} 11 | 12 | wandb_watch: 13 | log: 'all' 14 | log_freq: 100 15 | 16 | lr_monitor: 17 | logging_interval: "step" 18 | log_momentum: False 19 | -------------------------------------------------------------------------------- /ner4el/LICENSE_TEMPLATE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Valentino Maiorca, Luca Moschella 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /ner4el/conf/data/default.yaml: -------------------------------------------------------------------------------- 1 | datamodule: 2 | _target_: src.pl_data.datamodule.MyDataModule 3 | transformer_name: ${model.transformer_name} 4 | alias_table_path: data/alias_table.pickle 5 | descriptions_dict_path: data/descriptions_dict.csv 6 | item_counts_dict_path: data/item_counts_dict.csv 7 | title_dict_path: data/title_dict.pickle 8 | id2ner_dict_path: data/id2ner_dict.pickle 9 | version: ${model.transformer_name} 10 | negative_samples: False 11 | ner_negative_samples: True 12 | ner_representation: True 13 | ner_filter_candidates: False 14 | #ner_constrained_decoding -> can be specified once the script is started 15 | processed: True 16 | 17 | datasets: 18 | train: 19 | _target_: src.pl_data.dataset.MyDataset 20 | name: train 21 | path: data/aida_train.jsonl 22 | num_candidates: 40 23 | window: 128 24 | dataset_type: train 25 | 26 | val: 27 | _target_: src.pl_data.dataset.MyDataset 28 | name: dev 29 | path: data/aida_dev.jsonl 30 | num_candidates: 40 31 | window: 128 32 | dataset_type: dev 33 | 34 | #UNCOMMENT THE BLOCK CORRESPONDING TEST SET THAT YOU WANT TO USE (Ctrl + Shift + 7) 35 | 36 | test: 37 | _target_: src.pl_data.dataset.MyDataset 38 | name: test 39 | path: data/aida_test.jsonl 40 | num_candidates: 40 41 | window: 128 42 | dataset_type: test 43 | 44 | # test: 45 | # _target_: src.pl_data.dataset.MyDataset 46 | # name: test 47 | # path: data/msnbc_test.jsonl 48 | # num_candidates: 40 49 | # window: 128 50 | # dataset_type: test 51 | 52 | # test: 53 | # _target_: src.pl_data.dataset.MyDataset 54 | # name: test 55 | # path: data/aquaint_test.jsonl 56 | # num_candidates: 5 57 | # window: 128 58 | # dataset_type: test 59 | 60 | # test: 61 | # _target_: src.pl_data.dataset.MyDataset 62 | # name: test 63 | # path: data/ace2004_test.jsonl 64 | # num_candidates: 40 65 | # window: 128 66 | # dataset_type: test 67 | 68 | # test: 69 | # _target_: src.pl_data.dataset.MyDataset 70 | # name: test 71 | # path: data/cweb_test.jsonl 72 | # num_candidates: 40 73 | # window: 128 74 | # dataset_type: test 75 | 76 | # test: 77 | # _target_: src.pl_data.dataset.MyDataset 78 | # name: test 79 | # path: data/wiki_test.jsonl 80 | # num_candidates: 40 81 | # window: 128 82 | # dataset_type: test 83 | 84 | num_workers: 85 | train: 8 86 | val: 4 87 | test: 4 88 | 89 | batch_size: 90 | train: 1 91 | val: 1 92 | test: 1 93 | -------------------------------------------------------------------------------- /ner4el/src/ui/ui_utils.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import operator 3 | from pathlib import Path 4 | from typing import List 5 | 6 | import hydra 7 | import omegaconf 8 | import streamlit as st 9 | import wandb 10 | from hydra.core.global_hydra import GlobalHydra 11 | from hydra.experimental import compose 12 | from stqdm import stqdm 13 | 14 | from src.common.utils import PROJECT_ROOT, load_envs 15 | 16 | load_envs() 17 | 18 | WANDB_DIR: Path = PROJECT_ROOT / "wandb" 19 | WANDB_DIR.mkdir(exist_ok=True, parents=True) 20 | 21 | st_run_sel = st.sidebar 22 | 23 | 24 | def local_checkpoint_selection(run_dir: Path, st_key: str) -> Path: 25 | checkpoint_paths: List[Path] = list(run_dir.rglob("checkpoints/*")) 26 | if len(checkpoint_paths) == 0: 27 | st.error( 28 | f"There's no checkpoint under {run_dir}! Are you sure the restore was successful?" 29 | ) 30 | st.stop() 31 | checkpoint_path: Path = st_run_sel.selectbox( 32 | label="Select a checkpoint", 33 | index=len(checkpoint_paths) - 1, 34 | options=checkpoint_paths, 35 | format_func=operator.attrgetter("name"), 36 | key=f"checkpoint_select_{st_key}", 37 | ) 38 | 39 | return checkpoint_path 40 | 41 | 42 | def get_run_dir(entity: str, project: str, run_id: str) -> Path: 43 | """ 44 | :param run_path: "flegyas/nn-template/3hztfivf" 45 | :return: 46 | """ 47 | 48 | api = wandb.Api() 49 | run = api.run(path=f"{entity}/{project}/{run_id}") 50 | created_at: datetime = datetime.datetime.strptime( 51 | run.created_at, "%Y-%m-%dT%H:%M:%S" 52 | ) 53 | st.sidebar.markdown(body=f"[`Open on WandB`]({run.url})") 54 | 55 | timestamp: str = created_at.strftime("%Y%m%d_%H%M%S") 56 | 57 | matching_runs: List[Path] = [ 58 | item 59 | for item in WANDB_DIR.iterdir() 60 | if item.is_dir() and item.name.endswith(run_id) 61 | ] 62 | 63 | if len(matching_runs) > 1: 64 | st.error( 65 | f"More than one run matching unique id {run_id}! Are you sure about that?" 66 | ) 67 | st.stop() 68 | 69 | if len(matching_runs) == 1: 70 | return matching_runs[0] 71 | 72 | only_checkpoint: bool = st_run_sel.checkbox( 73 | label="Download only the checkpoint?", value=True 74 | ) 75 | if st_run_sel.button(label="Download"): 76 | run_dir: Path = WANDB_DIR / f"restored-{timestamp}-{run.id}" / "files" 77 | files = [ 78 | file 79 | for file in run.files() 80 | if "checkpoint" in file.name or not only_checkpoint 81 | ] 82 | if len(files) == 0: 83 | st.error( 84 | f"There is no file to download from this run! Check on WandB: {run.url}" 85 | ) 86 | for file in stqdm(files, desc="Downloading files..."): 87 | file.download(root=run_dir) 88 | return run_dir 89 | else: 90 | st.stop() 91 | 92 | 93 | def select_run_path(st_key: str, default_run_path: str): 94 | run_path: str = st_run_sel.text_input( 95 | label="Run path (entity/project/id):", 96 | value=default_run_path, 97 | key=f"run_path_select_{st_key}", 98 | ) 99 | if not run_path: 100 | st.stop() 101 | tokens: List[str] = run_path.split("/") 102 | if len(tokens) != 3: 103 | st.error( 104 | f"This run path {run_path} doesn't look like a WandB run path! Are you sure about that?" 105 | ) 106 | st.stop() 107 | 108 | return tokens 109 | 110 | 111 | def select_checkpoint(st_key: str = "MyAwesomeModel", default_run_path: str = ""): 112 | entity, project, run_id = select_run_path( 113 | st_key=st_key, default_run_path=default_run_path 114 | ) 115 | 116 | run_dir: Path = get_run_dir(entity=entity, project=project, run_id=run_id) 117 | 118 | return local_checkpoint_selection(run_dir, st_key=st_key) 119 | 120 | 121 | def get_hydra_cfg(config_name: str = "default") -> omegaconf.DictConfig: 122 | """ 123 | Instantiate and return the hydra config -- streamlit and jupyter compatible 124 | 125 | Args: 126 | config_name: .yaml configuration name, without the extension 127 | 128 | Returns: 129 | The desired omegaconf.DictConfig 130 | """ 131 | GlobalHydra.instance().clear() 132 | hydra.experimental.initialize_config_dir(config_dir=str(PROJECT_ROOT / "conf")) 133 | return compose(config_name=config_name) 134 | -------------------------------------------------------------------------------- /ner4el/.gitignore: -------------------------------------------------------------------------------- 1 | 2 | # .gitignore defaults for python and pycharm 3 | .idea 4 | 5 | # Byte-compiled / optimized / DLL files 6 | __pycache__/ 7 | *.py[cod] 8 | *$py.class 9 | 10 | # C extensions 11 | *.so 12 | 13 | # Distribution / packaging 14 | .Python 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | .eggs/ 21 | lib/ 22 | lib64/ 23 | parts/ 24 | sdist/ 25 | var/ 26 | wheels/ 27 | share/python-wheels/ 28 | *.egg-info/ 29 | .installed.cfg 30 | *.egg 31 | MANIFEST 32 | 33 | # PyInstaller 34 | # Usually these files are written by a python script from a template 35 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 36 | *.manifest 37 | *.spec 38 | 39 | # Installer logs 40 | pip-log.txt 41 | pip-delete-this-directory.txt 42 | 43 | # Unit test / coverage reports 44 | htmlcov/ 45 | .tox/ 46 | .nox/ 47 | .coverage 48 | .coverage.* 49 | .cache 50 | nosetests.xml 51 | coverage.xml 52 | *.cover 53 | *.py,cover 54 | .hypothesis/ 55 | .pytest_cache/ 56 | cover/ 57 | 58 | # Translations 59 | *.mo 60 | *.pot 61 | 62 | # Django stuff: 63 | *.log 64 | local_settings.py 65 | db.sqlite3 66 | db.sqlite3-journal 67 | 68 | # Flask stuff: 69 | instance/ 70 | .webassets-cache 71 | 72 | # Scrapy stuff: 73 | .scrapy 74 | 75 | # Sphinx documentation 76 | docs/_build/ 77 | 78 | # PyBuilder 79 | .pybuilder/ 80 | target/ 81 | 82 | # Jupyter Notebook 83 | .ipynb_checkpoints 84 | 85 | # IPython 86 | profile_default/ 87 | ipython_config.py 88 | 89 | # pyenv 90 | # For a library or package, you might want to ignore these files since the code is 91 | # intended to run in multiple environments; otherwise, check them in: 92 | # .python-version 93 | 94 | # pipenv 95 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 96 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 97 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 98 | # install all needed dependencies. 99 | #Pipfile.lock 100 | 101 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 102 | __pypackages__/ 103 | 104 | # Celery stuff 105 | celerybeat-schedule 106 | celerybeat.pid 107 | 108 | # SageMath parsed files 109 | *.sage.py 110 | 111 | # Environments 112 | .env 113 | .venv 114 | env/ 115 | venv/ 116 | ENV/ 117 | env.bak/ 118 | venv.bak/ 119 | 120 | # Spyder project settings 121 | .spyderproject 122 | .spyproject 123 | 124 | # Rope project settings 125 | .ropeproject 126 | 127 | # mkdocs documentation 128 | /site 129 | 130 | # mypy 131 | .mypy_cache/ 132 | .dmypy.json 133 | dmypy.json 134 | 135 | # Pyre type checker 136 | .pyre/ 137 | 138 | # pytype static type analyzer 139 | .pytype/ 140 | 141 | # Cython debug symbols 142 | cython_debug/ 143 | 144 | # Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider 145 | # Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839 146 | 147 | # User-specific stuff 148 | .idea/**/workspace.xml 149 | .idea/**/tasks.xml 150 | .idea/**/usage.statistics.xml 151 | .idea/**/dictionaries 152 | .idea/**/shelf 153 | 154 | # Generated files 155 | .idea/**/contentModel.xml 156 | 157 | # Sensitive or high-churn files 158 | .idea/**/dataSources/ 159 | .idea/**/dataSources.ids 160 | .idea/**/dataSources.local.xml 161 | .idea/**/sqlDataSources.xml 162 | .idea/**/dynamic.xml 163 | .idea/**/uiDesigner.xml 164 | .idea/**/dbnavigator.xml 165 | 166 | # Gradle 167 | .idea/**/gradle.xml 168 | .idea/**/libraries 169 | 170 | # Gradle and Maven with auto-import 171 | # When using Gradle or Maven with auto-import, you should exclude module files, 172 | # since they will be recreated, and may cause churn. Uncomment if using 173 | # auto-import. 174 | # .idea/artifacts 175 | # .idea/compiler.xml 176 | # .idea/jarRepositories.xml 177 | # .idea/modules.xml 178 | # .idea/*.iml 179 | # .idea/modules 180 | # *.iml 181 | # *.ipr 182 | 183 | # CMake 184 | cmake-build-*/ 185 | 186 | # Mongo Explorer plugin 187 | .idea/**/mongoSettings.xml 188 | 189 | # File-based project format 190 | *.iws 191 | 192 | # IntelliJ 193 | out/ 194 | 195 | # mpeltonen/sbt-idea plugin 196 | .idea_modules/ 197 | 198 | # JIRA plugin 199 | atlassian-ide-plugin.xml 200 | 201 | # Cursive Clojure plugin 202 | .idea/replstate.xml 203 | 204 | # Crashlytics plugin (for Android Studio and IntelliJ) 205 | com_crashlytics_export_strings.xml 206 | crashlytics.properties 207 | crashlytics-build.properties 208 | fabric.properties 209 | 210 | # Editor-based Rest Client 211 | .idea/httpRequests 212 | 213 | # Android studio 3.1+ serialized cache file 214 | .idea/caches/build_file_checksums.ser 215 | -------------------------------------------------------------------------------- /ner4el/src/common/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | from pathlib import Path 3 | from typing import Dict, List, Optional 4 | 5 | import dotenv 6 | import numpy as np 7 | import pytorch_lightning as pl 8 | import torch 9 | import pickle 10 | import json 11 | from tqdm import tqdm 12 | import pandas as pd 13 | import pytorch_lightning as pl 14 | from omegaconf import DictConfig, OmegaConf 15 | 16 | 17 | def read_alias_table(filename): 18 | with open(filename, 'rb') as f: 19 | alias_table = pickle.load(f) 20 | return alias_table 21 | 22 | def read_descriptions_dict(filename): 23 | descriptions = pd.read_csv(filename) 24 | descriptions_dict = descriptions.set_index("text_id").text.to_dict() 25 | return descriptions_dict 26 | 27 | def read_item_counts_dict(filename): 28 | id_counts_df = pd.read_csv(filename).set_index('page_id') 29 | item_counts_dict = id_counts_df.counts.to_dict() 30 | return item_counts_dict 31 | 32 | def read_dataset(filename): 33 | data = [] 34 | with open(filename, 'r', encoding='utf-8') as f: 35 | for line in tqdm(f): 36 | data.append(json.loads(line)) 37 | 38 | return data 39 | 40 | def get_title_dict(filename): 41 | '''id_counts_df = pd.read_csv(filename) 42 | title_dict = id_counts_df.set_index("page_id").title.to_dict()''' 43 | with open(filename, 'rb') as f: 44 | title_dict = pickle.load(f) 45 | return title_dict 46 | 47 | def read_id2ner_dict(filename): 48 | with open(filename, 'rb') as f: 49 | id2ner = pickle.load(f) 50 | return id2ner 51 | 52 | 53 | def get_env(env_name: str, default: Optional[str] = None) -> str: 54 | """ 55 | Safely read an environment variable. 56 | Raises errors if it is not defined or it is empty. 57 | 58 | :param env_name: the name of the environment variable 59 | :param default: the default (optional) value for the environment variable 60 | 61 | :return: the value of the environment variable 62 | """ 63 | if env_name not in os.environ: 64 | if default is None: 65 | raise KeyError(f"{env_name} not defined and no default value is present!") 66 | return default 67 | 68 | env_value: str = os.environ[env_name] 69 | if not env_value: 70 | if default is None: 71 | raise ValueError( 72 | f"{env_name} has yet to be configured and no default value is present!" 73 | ) 74 | return default 75 | 76 | return env_value 77 | 78 | 79 | def load_envs(env_file: Optional[str] = None) -> None: 80 | """ 81 | Load all the environment variables defined in the `env_file`. 82 | This is equivalent to `. env_file` in bash. 83 | 84 | It is possible to define all the system specific variables in the `env_file`. 85 | 86 | :param env_file: the file that defines the environment variables to use. If None 87 | it searches for a `.env` file in the project. 88 | """ 89 | dotenv.load_dotenv(dotenv_path=env_file, override=True) 90 | 91 | 92 | STATS_KEY: str = "stats" 93 | 94 | 95 | # Adapted from https://github.com/hobogalaxy/lightning-hydra-template/blob/6bf03035107e12568e3e576e82f83da0f91d6a11/src/utils/template_utils.py#L125 96 | def log_hyperparameters( 97 | cfg: DictConfig, 98 | model: pl.LightningModule, 99 | trainer: pl.Trainer, 100 | ) -> None: 101 | """This method controls which parameters from Hydra config are saved by Lightning loggers. 102 | Additionally saves: 103 | - sizes of train, val, test dataset 104 | - number of trainable model parameters 105 | Args: 106 | cfg (DictConfig): [description] 107 | model (pl.LightningModule): [description] 108 | trainer (pl.Trainer): [description] 109 | """ 110 | hparams = OmegaConf.to_container(cfg, resolve=True) 111 | 112 | # save number of model parameters 113 | hparams[f"{STATS_KEY}/params_total"] = sum(p.numel() for p in model.parameters()) 114 | hparams[f"{STATS_KEY}/params_trainable"] = sum( 115 | p.numel() for p in model.parameters() if p.requires_grad 116 | ) 117 | hparams[f"{STATS_KEY}/params_not_trainable"] = sum( 118 | p.numel() for p in model.parameters() if not p.requires_grad 119 | ) 120 | 121 | # send hparams to all loggers 122 | trainer.logger.log_hyperparams(hparams) 123 | 124 | # disable logging any more hyperparameters for all loggers 125 | # (this is just a trick to prevent trainer from logging hparams of model, since we already did that above) 126 | trainer.logger.log_hyperparams = lambda params: None 127 | 128 | 129 | # Load environment variables 130 | load_envs() 131 | 132 | # Set the cwd to the project root 133 | PROJECT_ROOT: Path = Path(get_env("PROJECT_ROOT")) 134 | assert ( 135 | PROJECT_ROOT.exists() 136 | ), "You must configure the PROJECT_ROOT environment variable in a .env file!" 137 | 138 | os.chdir(PROJECT_ROOT) 139 | -------------------------------------------------------------------------------- /ner4el/src/test.py: -------------------------------------------------------------------------------- 1 | import os 2 | from pathlib import Path 3 | from typing import List 4 | 5 | import hydra 6 | import omegaconf 7 | import pytorch_lightning as pl 8 | from hydra.core.hydra_config import HydraConfig 9 | from omegaconf import DictConfig, OmegaConf 10 | from pytorch_lightning import seed_everything, Callback 11 | from pytorch_lightning.callbacks import ( 12 | EarlyStopping, 13 | LearningRateMonitor, 14 | ModelCheckpoint, 15 | ) 16 | from pytorch_lightning.loggers import WandbLogger 17 | 18 | from src.common.utils import log_hyperparameters, PROJECT_ROOT 19 | 20 | from pl_modules.model import MyModel 21 | 22 | 23 | def build_callbacks(cfg: DictConfig) -> List[Callback]: 24 | callbacks: List[Callback] = [] 25 | 26 | if "lr_monitor" in cfg.logging: 27 | hydra.utils.log.info(f"Adding callback ") 28 | callbacks.append( 29 | LearningRateMonitor( 30 | logging_interval=cfg.logging.lr_monitor.logging_interval, 31 | log_momentum=cfg.logging.lr_monitor.log_momentum, 32 | ) 33 | ) 34 | 35 | if "early_stopping" in cfg.train: 36 | hydra.utils.log.info(f"Adding callback ") 37 | callbacks.append( 38 | EarlyStopping( 39 | monitor=cfg.train.monitor_metric, 40 | mode=cfg.train.monitor_metric_mode, 41 | patience=cfg.train.early_stopping.patience, 42 | verbose=cfg.train.early_stopping.verbose, 43 | ) 44 | ) 45 | 46 | if "model_checkpoints" in cfg.train: 47 | hydra.utils.log.info(f"Adding callback ") 48 | callbacks.append( 49 | ModelCheckpoint( 50 | monitor=cfg.train.monitor_metric, 51 | mode=cfg.train.monitor_metric_mode, 52 | save_top_k=cfg.train.model_checkpoints.save_top_k, 53 | verbose=cfg.train.model_checkpoints.verbose, 54 | ) 55 | ) 56 | 57 | return callbacks 58 | 59 | 60 | def run(cfg: DictConfig) -> None: 61 | """ 62 | Generic train loop 63 | 64 | :param cfg: run configuration, defined by Hydra in /conf 65 | """ 66 | if cfg.train.deterministic: 67 | seed_everything(cfg.train.random_seed) 68 | 69 | if cfg.train.pl_trainer.fast_dev_run: 70 | hydra.utils.log.info( 71 | f"Debug mode <{cfg.train.pl_trainer.fast_dev_run=}>. " 72 | f"Forcing debugger friendly configuration!" 73 | ) 74 | # Debuggers don't like GPUs nor multiprocessing 75 | cfg.train.pl_trainer.gpus = 0 76 | cfg.data.datamodule.num_workers.train = 0 77 | cfg.data.datamodule.num_workers.val = 0 78 | cfg.data.datamodule.num_workers.test = 0 79 | 80 | # Switch wandb mode to offline to prevent online logging 81 | cfg.logging.wandb.mode = "offline" 82 | 83 | # Hydra run directory 84 | hydra_dir = Path(HydraConfig.get().run.dir) 85 | 86 | # Instantiate datamodule 87 | hydra.utils.log.info(f"Instantiating <{cfg.data.datamodule._target_}>") 88 | datamodule: pl.LightningDataModule = hydra.utils.instantiate( 89 | cfg.data.datamodule, _recursive_=False 90 | ) 91 | 92 | model_ckpt = input("> Insert the absolute path of your model checkpoint: ") 93 | model = MyModel.load_from_checkpoint(checkpoint_path=str(model_ckpt)) 94 | 95 | # Instantiate the callbacks 96 | callbacks: List[Callback] = build_callbacks(cfg=cfg) 97 | 98 | # Logger instantiation/configuration 99 | wandb_logger = None 100 | if "wandb" in cfg.logging: 101 | hydra.utils.log.info(f"Instantiating ") 102 | wandb_config = cfg.logging.wandb 103 | wandb_logger = WandbLogger( 104 | **wandb_config, 105 | tags=cfg.core.tags 106 | ) 107 | hydra.utils.log.info(f"W&B is now watching <{cfg.logging.wandb_watch.log}>!") 108 | wandb_logger.watch( 109 | model, 110 | log=cfg.logging.wandb_watch.log, 111 | log_freq=cfg.logging.wandb_watch.log_freq, 112 | ) 113 | 114 | # Store the YaML config separately into the wandb dir 115 | yaml_conf: str = OmegaConf.to_yaml(cfg=cfg) 116 | (Path(wandb_logger.experiment.dir) / "hparams.yaml").write_text(yaml_conf) 117 | 118 | # The Lightning core, the Trainer 119 | trainer = pl.Trainer( 120 | default_root_dir=hydra_dir, 121 | logger=wandb_logger, 122 | callbacks=callbacks, 123 | deterministic=cfg.train.deterministic, 124 | val_check_interval=cfg.logging.val_check_interval, 125 | progress_bar_refresh_rate=cfg.logging.progress_bar_refresh_rate, 126 | **cfg.train.pl_trainer, #** -> se hai un dizionario passa sia chiave che valore prendendoli dal config 127 | ) 128 | 129 | hydra.utils.log.info(f"Starting testing!") 130 | trainer.test(model=model, datamodule=datamodule) 131 | 132 | # Logger closing to release resources/avoid multi-run conflicts 133 | if wandb_logger is not None: 134 | wandb_logger.experiment.finish() 135 | 136 | 137 | @hydra.main(config_path=str(PROJECT_ROOT / "conf"), config_name="default") 138 | def main(cfg: omegaconf.DictConfig): 139 | run(cfg) 140 | 141 | 142 | if __name__ == "__main__": 143 | main() 144 | -------------------------------------------------------------------------------- /ner4el/src/run.py: -------------------------------------------------------------------------------- 1 | from logging import log 2 | import os 3 | from pathlib import Path 4 | from typing import List 5 | 6 | import hydra 7 | import omegaconf 8 | import pytorch_lightning as pl 9 | from hydra.core.hydra_config import HydraConfig 10 | from omegaconf import DictConfig, OmegaConf 11 | from pytorch_lightning import seed_everything, Callback 12 | from pytorch_lightning.callbacks import ( 13 | EarlyStopping, 14 | LearningRateMonitor, 15 | ModelCheckpoint, 16 | ) 17 | from pytorch_lightning.loggers import WandbLogger 18 | 19 | from src.common.utils import log_hyperparameters, PROJECT_ROOT 20 | 21 | 22 | def build_callbacks(cfg: DictConfig) -> List[Callback]: 23 | callbacks: List[Callback] = [] 24 | 25 | if "lr_monitor" in cfg.logging: 26 | hydra.utils.log.info(f"Adding callback ") 27 | callbacks.append( 28 | LearningRateMonitor( 29 | logging_interval=cfg.logging.lr_monitor.logging_interval, 30 | log_momentum=cfg.logging.lr_monitor.log_momentum, 31 | ) 32 | ) 33 | 34 | if "early_stopping" in cfg.train: 35 | hydra.utils.log.info(f"Adding callback ") 36 | callbacks.append( 37 | EarlyStopping( 38 | monitor=cfg.train.monitor_metric, 39 | mode=cfg.train.monitor_metric_mode, 40 | patience=cfg.train.early_stopping.patience, 41 | verbose=cfg.train.early_stopping.verbose, 42 | ) 43 | ) 44 | 45 | if "model_checkpoints" in cfg.train: 46 | hydra.utils.log.info(f"Adding callback ") 47 | callbacks.append( 48 | ModelCheckpoint( 49 | monitor=cfg.train.monitor_metric, 50 | mode=cfg.train.monitor_metric_mode, 51 | save_top_k=cfg.train.model_checkpoints.save_top_k, 52 | verbose=cfg.train.model_checkpoints.verbose, 53 | ) 54 | ) 55 | 56 | return callbacks 57 | 58 | 59 | def run(cfg: DictConfig) -> None: 60 | """ 61 | Generic train loop 62 | 63 | :param cfg: run configuration, defined by Hydra in /conf 64 | """ 65 | if cfg.train.deterministic: 66 | seed_everything(cfg.train.random_seed) 67 | 68 | if cfg.train.pl_trainer.fast_dev_run: 69 | hydra.utils.log.info( 70 | f"Debug mode <{cfg.train.pl_trainer.fast_dev_run=}>. " 71 | f"Forcing debugger friendly configuration!" 72 | ) 73 | # Debuggers don't like GPUs nor multiprocessing 74 | cfg.train.pl_trainer.gpus = 0 75 | cfg.data.datamodule.num_workers.train = 0 76 | cfg.data.datamodule.num_workers.val = 0 77 | cfg.data.datamodule.num_workers.test = 0 78 | 79 | # Switch wandb mode to offline to prevent online logging 80 | cfg.logging.wandb.mode = "offline" 81 | 82 | # Hydra run directory 83 | hydra_dir = Path(HydraConfig.get().run.dir) 84 | 85 | # Instantiate datamodule 86 | hydra.utils.log.info(f"Instantiating <{cfg.data.datamodule._target_}>") 87 | datamodule: pl.LightningDataModule = hydra.utils.instantiate( 88 | cfg.data.datamodule, _recursive_=False 89 | ) 90 | 91 | # Instantiate model 92 | hydra.utils.log.info(f"Instantiating <{cfg.model._target_}>") 93 | model: pl.LightningModule = hydra.utils.instantiate( 94 | cfg.model, 95 | optim=cfg.optim, 96 | data=cfg.data, 97 | logging=cfg.logging, 98 | _recursive_=False, 99 | ) 100 | 101 | # Instantiate the callbacks 102 | callbacks: List[Callback] = build_callbacks(cfg=cfg) 103 | 104 | # Logger instantiation/configuration 105 | wandb_logger = None 106 | if "wandb" in cfg.logging: 107 | hydra.utils.log.info(f"Instantiating ") 108 | wandb_config = cfg.logging.wandb 109 | wandb_logger = WandbLogger( 110 | **wandb_config, 111 | tags=cfg.core.tags 112 | ) 113 | hydra.utils.log.info(f"W&B is now watching <{cfg.logging.wandb_watch.log}>!") 114 | wandb_logger.watch( 115 | model, 116 | log=cfg.logging.wandb_watch.log, 117 | log_freq=cfg.logging.wandb_watch.log_freq, 118 | ) 119 | 120 | # Store the YaML config separately into the wandb dir 121 | yaml_conf: str = OmegaConf.to_yaml(cfg=cfg) 122 | (Path(wandb_logger.experiment.dir) / "hparams.yaml").write_text(yaml_conf) 123 | 124 | hydra.utils.log.info(f"Instantiating the Trainer") 125 | 126 | # The Lightning core, the Trainer 127 | trainer = pl.Trainer( 128 | default_root_dir=hydra_dir, 129 | logger=wandb_logger, 130 | callbacks=callbacks, 131 | deterministic=cfg.train.deterministic, 132 | val_check_interval=cfg.logging.val_check_interval, 133 | progress_bar_refresh_rate=cfg.logging.progress_bar_refresh_rate, 134 | **cfg.train.pl_trainer, #** -> se hai un dizionario passa sia chiave che valore prendendoli dal config 135 | ) 136 | log_hyperparameters(trainer=trainer, model=model, cfg=cfg) 137 | 138 | hydra.utils.log.info(f"Starting training!") 139 | trainer.fit(model=model, datamodule=datamodule) 140 | 141 | hydra.utils.log.info(f"Starting testing!") 142 | trainer.test(ckpt_path="best", datamodule=datamodule) 143 | 144 | # Logger closing to release resources/avoid multi-run conflicts 145 | if wandb_logger is not None: 146 | wandb_logger.experiment.finish() 147 | 148 | 149 | @hydra.main(config_path=str(PROJECT_ROOT / "conf"), config_name="default") 150 | def main(cfg: omegaconf.DictConfig): 151 | run(cfg) 152 | 153 | 154 | if __name__ == "__main__": 155 | main() 156 | -------------------------------------------------------------------------------- /ner4el/src/pl_modules/ner_model.py: -------------------------------------------------------------------------------- 1 | from typing import Any, Dict, Sequence, Tuple, Union, List 2 | 3 | import hydra 4 | import omegaconf 5 | import pytorch_lightning as pl 6 | import torch 7 | from torch import nn 8 | from omegaconf import DictConfig 9 | from torch._C import device 10 | from torch.optim import Optimizer 11 | import os 12 | import math 13 | from sklearn.metrics import f1_score 14 | 15 | 16 | from src.common.utils import PROJECT_ROOT 17 | from src.common.utils import * 18 | 19 | from transformers import BertTokenizer, BertModel, BertConfig 20 | 21 | model_name_ner = 'bert-base-uncased' 22 | 23 | 24 | class MyNERModel(pl.LightningModule): 25 | def __init__(self, *args, **kwargs) -> None: 26 | super().__init__() 27 | #self.save_hyperparameters() # populate self.hparams with args and kwargs automagically! 28 | global bert_tokenizer_ner 29 | 30 | id2ner_dict_path = "data/id2ner_dict.pickle" 31 | id2ner_dict_path = str(PROJECT_ROOT / id2ner_dict_path) 32 | id2ner = read_id2ner_dict(id2ner_dict_path) 33 | 34 | labels_vocab = {} 35 | i = 0 36 | for v in id2ner.values(): 37 | if v not in labels_vocab: 38 | labels_vocab[v] = i 39 | i+=1 40 | 41 | 42 | bert_config_ner = BertConfig.from_pretrained(model_name_ner, output_hidden_states=True) 43 | bert_tokenizer_ner = BertTokenizer.from_pretrained(model_name_ner) 44 | special_tokens_dict = {'additional_special_tokens': ['[E]','[\E]']} 45 | num_added_toks = bert_tokenizer_ner.add_special_tokens(special_tokens_dict) 46 | bert_model_ner = BertModel.from_pretrained(model_name_ner, config=bert_config_ner) 47 | bert_model_ner.resize_token_embeddings(len(bert_tokenizer_ner)) 48 | 49 | 50 | self.mention_encoder = bert_model_ner 51 | 52 | self.cosine_similarity = nn.CosineSimilarity(dim=-1, eps=1e-6) 53 | 54 | self.dropout = nn.Dropout(0.5) 55 | 56 | self.linear = nn.Linear(768, len(labels_vocab)) 57 | 58 | def forward( 59 | self, mentions, positions, mask1, **kwargs 60 | ) -> Dict[str, torch.Tensor]: 61 | """ 62 | Method for the forward pass. 63 | 'training_step', 'validation_step' and 'test_step' should call 64 | this method in order to compute the output predictions and the loss. 65 | Returns: 66 | output_dict: forward output containing the predictions (output logits ecc...) and the loss if any. 67 | """ 68 | 69 | embedding_mention = self.mention_encoder.forward(mentions, mask1)[0] #16x64x768 70 | embedding_mention2 = embedding_mention.gather(1, positions.reshape(-1, 1, 1).repeat(1, 1, 768)).squeeze(1) 71 | embedding_mention2 = self.dropout(embedding_mention2) #16x768 72 | 73 | predictions = self.linear(embedding_mention2) 74 | 75 | return predictions 76 | 77 | def step(self, batch: Any, batch_idx: int, dataset_type:str): 78 | softmax_function = nn.Softmax(dim=1) 79 | 80 | mentions, positions, candidates, descriptions, labels = batch 81 | positions = torch.tensor(positions, device=self.device) 82 | 83 | mask1 = self.padding_mask(mentions) 84 | predictions = self.forward(mentions, positions, mask1) 85 | predictions = softmax_function(predictions) 86 | 87 | return { 88 | "pred": predictions, 89 | } 90 | 91 | 92 | 93 | def training_step(self, batch: Any, batch_idx: int) -> torch.Tensor: 94 | step_output = self.step(batch, batch_idx, "train") 95 | return step_output 96 | 97 | def validation_step(self, batch: Any, batch_idx: int) -> torch.Tensor: 98 | step_output = self.step(batch, batch_idx, "dev") 99 | return step_output 100 | 101 | def test_step(self, batch: Any, batch_idx: int) -> torch.Tensor: 102 | step_output = self.step(batch, batch_idx, "test") 103 | return step_output 104 | 105 | 106 | 107 | def configure_optimizers( 108 | self, 109 | ) -> Union[Optimizer, Tuple[Sequence[Optimizer], Sequence[Any]]]: 110 | """ 111 | Choose what optimizers and learning-rate schedulers to use in your optimization. 112 | Normally you'd need one. But in the case of GANs or similar you might have multiple. 113 | Return: 114 | Any of these 6 options. 115 | - Single optimizer. 116 | - List or Tuple - List of optimizers. 117 | - Two lists - The first list has multiple optimizers, the second a list of LR schedulers (or lr_dict). 118 | - Dictionary, with an 'optimizer' key, and (optionally) a 'lr_scheduler' 119 | key whose value is a single LR scheduler or lr_dict. 120 | - Tuple of dictionaries as described, with an optional 'frequency' key. 121 | - None - Fit will run without any optimizer. 122 | """ 123 | opt = hydra.utils.instantiate( 124 | self.hparams.optim.optimizer, params=self.parameters(), _convert_="partial" 125 | ) 126 | if not self.hparams.optim.use_lr_scheduler: 127 | return [opt] 128 | scheduler = hydra.utils.instantiate( 129 | self.hparams.optim.lr_scheduler, optimizer=opt 130 | ) 131 | return [opt], [scheduler] 132 | 133 | def padding_mask(self, batch): 134 | padding = torch.ones_like(batch) 135 | padding[batch == 0] = 0 136 | padding = padding.type(torch.int64) 137 | return padding 138 | 139 | def normalize(self, m): 140 | row_min, _ = m.min(dim=1, keepdim=True) 141 | row_max, _ = m.max(dim=1, keepdim=True) 142 | return (m - row_min) / (row_max - row_min) 143 | 144 | 145 | @hydra.main(config_path=str(PROJECT_ROOT / "conf"), config_name="default") 146 | def main(cfg: omegaconf.DictConfig): 147 | model: pl.LightningModule = hydra.utils.instantiate( 148 | cfg.model, 149 | optim=cfg.optim, 150 | data=cfg.data, 151 | logging=cfg.logging, 152 | _recursive_=False, 153 | ) 154 | 155 | 156 | if __name__ == "__main__": 157 | main() 158 | -------------------------------------------------------------------------------- /ner4el/src/pl_data/datamodule.py: -------------------------------------------------------------------------------- 1 | import random 2 | from typing import Optional, Sequence 3 | 4 | import hydra 5 | from hydra import utils 6 | import numpy as np 7 | import omegaconf 8 | import pytorch_lightning as pl 9 | from pytorch_lightning.core import datamodule 10 | import torch 11 | from pprint import pprint 12 | from omegaconf import DictConfig 13 | from torch.utils.data import DataLoader, Dataset 14 | from torch.nn.utils.rnn import pad_sequence 15 | from src.pl_data.dataset import MyDataset 16 | 17 | from src.common.utils import PROJECT_ROOT 18 | 19 | from src.common.utils import * 20 | from transformers import BertTokenizer 21 | from transformers import XLMRobertaTokenizer 22 | 23 | 24 | def worker_init_fn(id: int): 25 | """ 26 | DataLoaders workers init function. 27 | 28 | Initialize the numpy.random seed correctly for each worker, so that 29 | random augmentations between workers and/or epochs are not identical. 30 | 31 | If a global seed is set, the augmentations are deterministic. 32 | 33 | https://pytorch.org/docs/stable/notes/randomness.html#dataloader 34 | """ 35 | uint64_seed = torch.initial_seed() 36 | ss = np.random.SeedSequence([uint64_seed]) 37 | # More than 128 bits (4 32-bit words) would be overkill. 38 | np.random.seed(ss.generate_state(4)) 39 | random.seed(uint64_seed) 40 | 41 | 42 | class MyDataModule(pl.LightningDataModule): 43 | def __init__( 44 | self, 45 | datasets: DictConfig, 46 | num_workers: DictConfig, 47 | batch_size: DictConfig, 48 | transformer_name: str, 49 | alias_table_path: str, 50 | descriptions_dict_path: str, 51 | item_counts_dict_path: str, 52 | title_dict_path: str, 53 | id2ner_dict_path: str, 54 | version: str, 55 | negative_samples: bool, 56 | ner_negative_samples: bool, 57 | ner_representation: bool, 58 | ner_filter_candidates: bool, 59 | processed: bool, 60 | ): 61 | super().__init__() 62 | self.datasets = datasets 63 | self.num_workers = num_workers 64 | self.batch_size = batch_size 65 | self.transformer_name = transformer_name 66 | self.alias_table_path = str(PROJECT_ROOT / alias_table_path) 67 | self.descriptions_dict_path = str(PROJECT_ROOT / descriptions_dict_path) 68 | self.item_counts_dict_path = str(PROJECT_ROOT / item_counts_dict_path) 69 | self.title_dict_path = str(PROJECT_ROOT / title_dict_path) 70 | self.id2ner_dict_path = str(PROJECT_ROOT / id2ner_dict_path) 71 | self.version = version 72 | self.negative_samples = negative_samples 73 | self.ner_negative_samples = ner_negative_samples 74 | self.ner_representation = ner_representation 75 | self.ner_filter_candidates = ner_filter_candidates 76 | self.processed = processed 77 | 78 | self.train_dataset: Optional[Dataset] = None 79 | self.val_dataset: Optional[Dataset] = None 80 | self.test_dataset: Optional[Dataset] = None 81 | 82 | def prepare_data(self) -> None: 83 | # download only 84 | pass 85 | 86 | def setup(self, stage: Optional[str] = None): 87 | # Here you should instantiate your datasets, you may also split the train into train and validation if needed. 88 | 89 | train_data = read_dataset(str(PROJECT_ROOT / self.datasets.train.path)) 90 | dev_data = read_dataset(str(PROJECT_ROOT / self.datasets.val.path)) 91 | test_data = read_dataset(str(PROJECT_ROOT / self.datasets.test.path)) 92 | print("Datasets loaded.") 93 | 94 | if "bert-" in self.transformer_name: 95 | self.tokenizer = BertTokenizer.from_pretrained(self.transformer_name) 96 | special_tokens_dict = {'additional_special_tokens': ['[E]','[\E]']} 97 | self.tokenizer.add_special_tokens(special_tokens_dict) 98 | elif "xlm" in self.transformer_name: 99 | self.tokenizer = XLMRobertaTokenizer.from_pretrained(self.transformer_name) 100 | special_tokens_dict = {'additional_special_tokens': ['[E]','[\E]']} 101 | self.tokenizer.add_special_tokens(special_tokens_dict) 102 | 103 | self.alias_table = read_alias_table(self.alias_table_path) 104 | print("Alias table loaded.") 105 | 106 | self.descriptions_dict = read_descriptions_dict(self.descriptions_dict_path) 107 | print("Descriptions dict loaded.") 108 | 109 | self.item_counts_dict = read_item_counts_dict(self.item_counts_dict_path) 110 | print("Item counts dict loaded.") 111 | 112 | self.title_dict = get_title_dict(self.title_dict_path) 113 | self.title_dict_reverse = dict((v, k) for (k, v) in self.title_dict.items()) 114 | print("Title dict loaded.") 115 | 116 | self.id2ner = read_id2ner_dict(self.id2ner_dict_path) 117 | print("NER dict loaded.") 118 | 119 | 120 | if stage is None or stage == "fit": 121 | self.train_dataset = hydra.utils.instantiate( 122 | self.datasets.train, 123 | data=train_data, 124 | datamodule = self 125 | ) 126 | 127 | self.val_dataset = hydra.utils.instantiate( 128 | self.datasets.val, 129 | data=dev_data, 130 | datamodule = self 131 | ) 132 | 133 | if stage is None or stage == "test": 134 | self.test_dataset = hydra.utils.instantiate( 135 | self.datasets.test, 136 | data=test_data, 137 | datamodule = self 138 | ) 139 | 140 | def train_dataloader(self) -> DataLoader: 141 | return DataLoader( 142 | self.train_dataset, 143 | shuffle=True, 144 | batch_size=self.batch_size.train, 145 | num_workers=self.num_workers.train, 146 | worker_init_fn=worker_init_fn, 147 | pin_memory=True, 148 | collate_fn=self.collate 149 | ) 150 | 151 | def val_dataloader(self) -> DataLoader: 152 | return DataLoader( 153 | self.val_dataset, 154 | shuffle=False, 155 | batch_size=self.batch_size.val, 156 | num_workers=self.num_workers.val, 157 | worker_init_fn=worker_init_fn, 158 | pin_memory=True, 159 | collate_fn=self.collate 160 | ) 161 | 162 | def test_dataloader(self) -> Sequence[DataLoader]: 163 | return DataLoader( 164 | self.test_dataset, 165 | shuffle=False, 166 | batch_size=self.batch_size.test, 167 | num_workers=self.num_workers.test, 168 | worker_init_fn=worker_init_fn, 169 | pin_memory=True, 170 | collate_fn=self.collate 171 | ) 172 | 173 | def collate(self, elems: List[tuple]) -> List[tuple]: 174 | mentions, positions, candidates, descriptions, labels = list(zip(*elems)) 175 | 176 | pad_mentions = pad_sequence(mentions, batch_first=True, padding_value=0) 177 | pad_candidates = pad_sequence(candidates, batch_first=True, padding_value=0) 178 | pad_descriptions = pad_sequence(descriptions, batch_first=True, padding_value=0) 179 | 180 | return pad_mentions, positions, pad_candidates, pad_descriptions, labels 181 | 182 | def __repr__(self) -> str: 183 | return ( 184 | f"{self.__class__.__name__}(" 185 | f"{self.datasets=}, " 186 | f"{self.num_workers=}, " 187 | f"{self.batch_size=})" 188 | ) 189 | 190 | 191 | @hydra.main(config_path=str(PROJECT_ROOT / "conf"), config_name="default") 192 | def main(cfg: omegaconf.DictConfig): 193 | datamodule: pl.LightningDataModule = hydra.utils.instantiate( 194 | cfg.data.datamodule, _recursive_=False 195 | ) 196 | datamodule.setup() 197 | 198 | 199 | if __name__ == "__main__": 200 | main() 201 | -------------------------------------------------------------------------------- /ner4el/src/pl_data/dataset.py: -------------------------------------------------------------------------------- 1 | from typing import Dict, Tuple, Union 2 | 3 | import hydra 4 | import omegaconf 5 | import pytorch_lightning as pl 6 | from pytorch_lightning.core import datamodule 7 | import torch 8 | from torch import nn 9 | from omegaconf import ValueNode 10 | from torch.utils.data import Dataset 11 | from tqdm import tqdm 12 | import pickle 13 | import random 14 | 15 | from src.common.utils import PROJECT_ROOT 16 | from pl_modules.ner_model import MyNERModel 17 | 18 | 19 | 20 | class MyDataset(Dataset): 21 | def __init__(self, name: ValueNode, 22 | path: ValueNode, 23 | data: list, 24 | num_candidates: int, 25 | window: int, 26 | datamodule, 27 | dataset_type, 28 | **kwargs): 29 | 30 | from src.pl_data.datamodule import MyDataModule 31 | datamodule:MyDataModule 32 | 33 | super().__init__() 34 | self.path = path 35 | self.name = name 36 | self.data = data 37 | self.num_candidates = num_candidates 38 | self.window = window 39 | self.transformer_name = datamodule.transformer_name 40 | self.tokenizer = datamodule.tokenizer 41 | self.alias_table = datamodule.alias_table 42 | self.descriptions_dict = datamodule.descriptions_dict 43 | self.count_dict = datamodule.item_counts_dict 44 | self.id2ner = datamodule.id2ner 45 | self.dataset_type = dataset_type 46 | self.title_dict = datamodule.title_dict 47 | self.title_dict_reverse = datamodule.title_dict_reverse 48 | self.negative_samples = datamodule.negative_samples 49 | self.ner_negative_samples = datamodule.ner_negative_samples 50 | self.ner_representation = datamodule.ner_representation 51 | self.ner_filter_candidates = datamodule.ner_filter_candidates 52 | self.processed = datamodule.processed 53 | 54 | self.encoded_data = [] 55 | self.__encode_data() 56 | print(f"LEN: {len(self.encoded_data)}") 57 | 58 | def __encode_data(self): 59 | 60 | if self.ner_filter_candidates: 61 | ner_classifier = MyNERModel() 62 | ner_classifier.load_state_dict(torch.load(str(PROJECT_ROOT / "wandb/ner_classifier.pt"))) 63 | 64 | softmax_function = nn.Softmax(dim=1) 65 | 66 | labels_vocab = {} 67 | i = 0 68 | for v in self.id2ner.values(): 69 | if v not in labels_vocab: 70 | labels_vocab[v] = i 71 | i+=1 72 | 73 | 74 | if self.processed == True and self.dataset_type=="train": 75 | with open(str(PROJECT_ROOT / f"preprocessed_datasets/aida_kilt_train_{self.num_candidates}_{self.window}_{self.transformer_name}_negativesamples={self.negative_samples}_nernegativesamples={self.ner_negative_samples}_nerrepresentation={self.ner_representation}.pickle"), 'rb') as f: 76 | self.encoded_data = pickle.load(f) 77 | 78 | else: 79 | total_entities = 0 80 | target_between_candidates = 0 81 | 82 | #for negative samples 83 | if self.dataset_type == "train": 84 | count_dict_with_descriptions = {} 85 | dict_candidates_for_NER_type = {} 86 | 87 | for key in tqdm(self.count_dict): 88 | if key in self.descriptions_dict: 89 | count_dict_with_descriptions[key] = 1 90 | 91 | if self.ner_negative_samples == True: 92 | if str(key) in self.id2ner: 93 | ner_tag = self.id2ner[str(key)] 94 | if ner_tag not in dict_candidates_for_NER_type: 95 | dict_candidates_for_NER_type[ner_tag] = [key] 96 | else: 97 | dict_candidates_for_NER_type[ner_tag].append(key) 98 | 99 | 100 | for entry in tqdm(self.data): 101 | m = entry["mention"].lower() 102 | left_context = entry["left_context"] 103 | right_context = entry["right_context"] 104 | ent = self.title_dict_reverse[entry["output"]] if entry["output"] in self.title_dict_reverse else "" 105 | title = entry["output"] 106 | 107 | if self.ner_negative_samples == True: 108 | if str(ent) in self.id2ner: #to see the upperbound 109 | target_ner_tag = self.id2ner[str(ent)] 110 | else: 111 | target_ner_tag = "" 112 | 113 | tokenized_left_context = self.tokenize_mention("[CLS]" + left_context, self.tokenizer, self.window//2, False) 114 | mention_position = len(tokenized_left_context) + 1 115 | tokenized_m = self.tokenize_mention("[E]" + m + "[\E]", self.tokenizer, self.window//2, False) 116 | tokenized_right_context = self.tokenize_mention(right_context + "[SEP]", self.tokenizer, self.window//2, False) 117 | 118 | tokenized_mention = tokenized_left_context + tokenized_m + tokenized_right_context 119 | for _ in range(self.window-len(tokenized_mention)): 120 | tokenized_mention.append(0) 121 | tokenized_mention = torch.tensor(tokenized_mention) 122 | 123 | 124 | if self.ner_filter_candidates: 125 | pad_mentions = tokenized_mention.unsqueeze(0) 126 | mask = self.padding_mask(pad_mentions) 127 | 128 | position = torch.tensor(mention_position) 129 | 130 | predictions = ner_classifier(pad_mentions, position, mask) 131 | predictions = softmax_function(predictions)[0] 132 | top_k_ner = torch.topk(predictions, 3)[1] 133 | label_ner = top_k_ner[0].item() 134 | confidence = predictions[label_ner] 135 | 136 | 137 | if(m in self.alias_table) and ent!="": 138 | total_entities+=1 139 | all_candidates = self.alias_table[m] 140 | 141 | all_candidates_with_description = [] 142 | for c in all_candidates: 143 | if c in self.descriptions_dict: 144 | all_candidates_with_description.append(c) 145 | 146 | all_candidates_counts = torch.tensor([self.count_dict.get(idx, 0) for idx in all_candidates_with_description]) 147 | 148 | if len(all_candidates_with_description)>self.num_candidates: 149 | top_k = torch.topk(all_candidates_counts, self.num_candidates)[1].tolist() 150 | top_k_candidates = [] 151 | for idx in top_k: 152 | top_k_candidates.append(all_candidates_with_description[idx]) 153 | else: 154 | top_k_candidates = all_candidates_with_description 155 | 156 | if self.negative_samples == True and self.dataset_type == "train": #standard negative samples 157 | tmp_count_dict_with_descriptions = count_dict_with_descriptions.copy() 158 | for candidate in top_k_candidates: 159 | del tmp_count_dict_with_descriptions[candidate] 160 | negative_samples = random.sample(tmp_count_dict_with_descriptions.keys(), self.num_candidates-len(top_k_candidates)) 161 | top_k_candidates.extend(negative_samples) 162 | 163 | elif self.ner_negative_samples == True and self.dataset_type == "train": #NER-enhanced negative samples 164 | if target_ner_tag!="": 165 | candidates_of_same_NER_type = dict_candidates_for_NER_type[target_ner_tag] 166 | for candidate in top_k_candidates: 167 | if candidate in candidates_of_same_NER_type: 168 | candidates_of_same_NER_type.remove(candidate) 169 | negative_samples = random.sample(candidates_of_same_NER_type, self.num_candidates-len(top_k_candidates)) 170 | top_k_candidates.extend(negative_samples) 171 | 172 | elif self.ner_filter_candidates == True and self.dataset_type == "train": 173 | top_k_candidates_filtered = [] 174 | for c in top_k_candidates: 175 | if str(c) in self.id2ner and confidence>0.5: 176 | if labels_vocab[self.id2ner[str(c)]] == label_ner: 177 | top_k_candidates_filtered.append(c) 178 | elif labels_vocab[self.id2ner[str(c)]] in top_k_ner: 179 | top_k_candidates_filtered.append(c) 180 | else: 181 | top_k_candidates_filtered.append(c) 182 | 183 | 184 | if int(ent) in top_k_candidates: 185 | target_between_candidates+=1 186 | 187 | tokenized_descriptions = [] 188 | for c in top_k_candidates: 189 | if self.ner_representation == True: 190 | if str(c) in self.id2ner: 191 | ner_tag = self.id2ner[str(c)] 192 | else: 193 | ner_tag = "" 194 | d = ner_tag + "[SEP]" + self.descriptions_dict[c] 195 | else: 196 | d = self.descriptions_dict[c] 197 | 198 | tokenized_descriptions.append(self.tokenize_description(d, self.tokenizer, self.window)) 199 | 200 | 201 | if self.dataset_type == "train": 202 | if len(top_k_candidates)>0 and int(ent) in top_k_candidates: 203 | self.encoded_data.append((tokenized_mention, 204 | mention_position, 205 | torch.tensor(top_k_candidates), 206 | torch.tensor(tokenized_descriptions), 207 | int(ent))) 208 | 209 | else: 210 | if len(top_k_candidates)>0: 211 | self.encoded_data.append((tokenized_mention, 212 | mention_position, 213 | torch.tensor(top_k_candidates), 214 | torch.tensor(tokenized_descriptions), 215 | int(ent))) 216 | 217 | 218 | if total_entities>0: 219 | print(f"Percentage of target entities within the candidate set: {target_between_candidates/total_entities}") 220 | 221 | #print(self.encoded_data) 222 | 223 | 224 | if self.dataset_type == "train": 225 | with open(str(PROJECT_ROOT / f"preprocessed_datasets/aida_kilt_train_{self.num_candidates}_{self.window}_{self.transformer_name}_negativesamples={self.negative_samples}_nernegativesamples={self.ner_negative_samples}_nerrepresentation={self.ner_representation}.pickle"), 'wb') as f: 226 | pickle.dump(self.encoded_data, f) 227 | 228 | 229 | return self.encoded_data 230 | 231 | 232 | def tokenize_mention(self, sent, tokenizer, window, special_tokens): 233 | encoded_sentence = tokenizer.encode(sent, add_special_tokens = special_tokens) 234 | if len(encoded_sentence)>=window: 235 | return encoded_sentence[:window] 236 | else: 237 | return encoded_sentence 238 | 239 | def tokenize_description(self, sent, tokenizer, window): 240 | encoded_sentence = tokenizer.encode(sent, add_special_tokens = True) 241 | if len(encoded_sentence)>=window: 242 | return encoded_sentence[:window] 243 | else: 244 | return encoded_sentence + [0]*(window-len(encoded_sentence)) 245 | 246 | def padding_mask(self, batch): 247 | padding = torch.ones_like(batch) 248 | padding[batch == 0] = 0 249 | padding = padding.type(torch.int64) 250 | return padding 251 | 252 | 253 | def __len__(self) -> int: 254 | return len(self.encoded_data) 255 | 256 | def __getitem__( 257 | self, index 258 | ) -> Union[Dict[str, torch.Tensor], Tuple[torch.Tensor, torch.Tensor]]: 259 | return self.encoded_data[index] 260 | 261 | def __repr__(self) -> str: 262 | return f"MyDataset({self.name=}, {self.path=})" 263 | 264 | 265 | @hydra.main(config_path=str(PROJECT_ROOT / "conf"), config_name="default") 266 | def main(cfg: omegaconf.DictConfig): 267 | dataset: MyDataset = hydra.utils.instantiate( 268 | cfg.data.datamodule.datasets.train, _recursive_=False 269 | ) 270 | 271 | 272 | if __name__ == "__main__": 273 | main() 274 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/named-entity-recognition-for-entity-linking/entity-disambiguation-on-ace2004)](https://paperswithcode.com/sota/entity-disambiguation-on-ace2004?p=named-entity-recognition-for-entity-linking) 2 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/named-entity-recognition-for-entity-linking/entity-disambiguation-on-aquaint)](https://paperswithcode.com/sota/entity-disambiguation-on-aquaint?p=named-entity-recognition-for-entity-linking) 3 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/named-entity-recognition-for-entity-linking/entity-disambiguation-on-msnbc)](https://paperswithcode.com/sota/entity-disambiguation-on-msnbc?p=named-entity-recognition-for-entity-linking) 4 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/named-entity-recognition-for-entity-linking/entity-disambiguation-on-wned-cweb)](https://paperswithcode.com/sota/entity-disambiguation-on-wned-cweb?p=named-entity-recognition-for-entity-linking) 5 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/named-entity-recognition-for-entity-linking/entity-disambiguation-on-wned-wiki)](https://paperswithcode.com/sota/entity-disambiguation-on-wned-wiki?p=named-entity-recognition-for-entity-linking) 6 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/named-entity-recognition-for-entity-linking/entity-disambiguation-on-aida-conll)](https://paperswithcode.com/sota/entity-disambiguation-on-aida-conll?p=named-entity-recognition-for-entity-linking) 7 | 8 | ![logo](./img/logo_ner4el.png) 9 | -------------------------------------------------------------------------------- 10 | 11 | Code and resources for the paper [Named Entity Recognition for Entity Linking: What Works and What's Next](https://aclanthology.org/2021.findings-emnlp.220/). 12 | 13 | This repository is mainly built upon [Pytorch](https://pytorch.org/) and [Pytorch-Lightning](https://pytorch-lightning.readthedocs.io/en/latest/). 14 | 15 | ## Reference 16 | **Please cite our work if you use resources and/or code from this repository.** 17 | #### Plaintext 18 | Simone Tedeschi, Simone Conia, Francesco Cecconi and Roberto Navigli, 2021. **Named Entity Recognition for Entity Linking: What Works and What's Next**. *In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP 2021)*. Punta Cana, Dominican Republic. Association for Computational Linguistics. 19 | 20 | #### Bibtex 21 | ```bibtex 22 | @inproceedings{tedeschi-etal-2021-named-entity, 23 | title = "{N}amed {E}ntity {R}ecognition for {E}ntity {L}inking: {W}hat Works and What{'}s Next", 24 | author = "Tedeschi, Simone and 25 | Conia, Simone and 26 | Cecconi, Francesco and 27 | Navigli, Roberto", 28 | booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021", 29 | month = nov, 30 | year = "2021", 31 | address = "Punta Cana, Dominican Republic", 32 | publisher = "Association for Computational Linguistics", 33 | url = "https://aclanthology.org/2021.findings-emnlp.220", 34 | pages = "2584--2596", 35 | abstract = "Entity Linking (EL) systems have achieved impressive results on standard benchmarks mainly thanks to the contextualized representations provided by recent pretrained language models. However, such systems still require massive amounts of data {--} millions of labeled examples {--} to perform at their best, with training times that often exceed several days, especially when limited computational resources are available. In this paper, we look at how Named Entity Recognition (NER) can be exploited to narrow the gap between EL systems trained on high and low amounts of labeled data. More specifically, we show how and to what extent an EL system can benefit from NER to enhance its entity representations, improve candidate selection, select more effective negative samples and enforce hard and soft constraints on its output entities. We release our software {--} code and model checkpoints {--} at https://github.com/Babelscape/ner4el.", 36 | } 37 | ``` 38 | 39 | # Named Entity Recognition for Entity Linking: An Introduction 40 | In this work we focus on **Entity Linking (EL)**, a key task in NLP which aims at associating an ambiguous textual mention with a named entity in a knowledge base. It is a very **knowledge-intensive task** and current EL approaches requires massive amounts of training data – often millions of labeled items – in order to perform at their best, making the development of a high-performance EL system viable only to a **limited audience**. Hence, we study whether it is possible to **narrow the performance gap** between systems trained on limited (i.e., less than 20K labeled samples) and large amounts of data (i.e., millions of training samples). In particular, we take a look at **Named Entity Recognition (NER)** – the task of identifying specific words as belonging to predefined semantic types such as Person, Location, Organization – and how this task can be exploited to **improve a strong Entity Linking baseline in low-resource settings** without requiring any additional data. We show how and to what extent an EL system can benefit from NER to enhance its entity representations, improve candidate selection, select more effective negative samples and enforce hard and soft constraints on its output entities. 41 | 42 |
43 | 44 | ![contributions](./img/contributions.png) 45 | 46 |
47 | 48 | 49 |
50 | 51 | # Fine-Grained Classes for NER 52 | In its standard formulation, NER distinguishes between four classes of entities: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). 53 | Although NER systems that use these four classes have been found to be beneficial in downstream tasks, we argue that they might be too coarse-grained and, at the same time, not provide a sufficiently exhaustive coverage to also benefit EL, as many different entities would fall within the same Misc class. 54 | 55 | For these reasons, **we introduce a new set of 18 finer-grained NER classes**, namely, Person (PER), Location (LOC), Organization (ORG), Animal (ANIM), Biology (BIO), Celestial Body (CEL), Disease (DIS), Event (EVE), Food (FOOD), Instrument (INST), Media (MEDIA), Monetary (MON), Number (NUM), Physical Phenomenon (PHYS), Plant (PLANT), Supernatural (SUPER), Time (TIME) and Vehicle (VEHI). 56 | 57 | In order to use the newly introduced NER classes, we **automatically** label each Wikipedia entity with one of them by taking advantage of [WordNet](https://wordnet.princeton.edu/) and [BabelNet](https://babelnet.org/). 58 | 59 | You can **download** the resulting mapping here: [Wikipedia2NER-mapping](https://drive.google.com/file/d/1tnyYe1alAPP2L866bUq4MtUh687z7oE4/view?usp=sharing) (158MB). 60 | 61 | The following plot shows the percentage of Wikipedia articles for each of the 18 NER classes. 62 | 63 |
64 | 65 | ![percentages](./img/percentages.png) 66 | 67 |
68 | 69 | Further details about these classes such as the exact number of articles for each class, or the full strategy used to obtain these annotations, are provided in the paper. 70 | 71 |
72 | 73 | # Other Resources 74 | Here you can download other resources needed to run the code, but also useful for other purposes (e.g., as a starting point for other EL projects). 75 | 76 |
77 | 78 | | Resource | Description | 79 | | ------------- | :------------- | 80 | | [Alias Table](https://drive.google.com/file/d/13iro8M2KVONWANcgna_3zxxPZl9b7TVC/view?usp=sharing) (732MB) | A dictionary that associates each textual mention with a set of possible candidates
(i.e., a set of possible Wikipedia IDs)| 81 | | [Descriptions Dictionary](https://drive.google.com/file/d/1kv1yxbrqvNgONcjuu2XNaoDrs6acOs4t/view?usp=sharing) (2.7GB) | A dictionary that associates each Wikipedia ID with its textual description| 82 | | [Counts Dictionary](https://drive.google.com/file/d/1uKAO2866GAwVYdq1Rda6v-C2TZvoWOoZ/view?usp=sharing) (222MB) | A dictionary that associates each Wikipedia ID with its frequency in Wikipedia
(i.e., the sum of all the wikilinks that refer to that page)| 83 | | [Titles Dictionary](https://drive.google.com/file/d/1hoUfhfNTP_73mcrYoWVBrwHQ8RXP2OSY/view?usp=sharing) (178MB) | A dictionary that associates the title of a Wikipedia page with its corresponding Wikipedia ID| 84 | | [NER Classifier](https://drive.google.com/file/d/1SNXL_UvJ1RWzQaFOKZusfimIMgFQ5LAy/view?usp=sharing) (418MB) | The pretrained NER classifier used for the NER-constrained decoding and NER-enhanced candidate generation contributions (place it into the ner4el/wandb folder)| 85 | 86 |
87 | 88 |
89 | 90 | # Data 91 | The only training data that we use for our experiments are the training instances from the **AIDA-YAGO-CoNLL** training set. We evaluate our systems on the **validation** split of **AIDA-YAGO-CoNLL**. 92 | For **testing** we use the test split of **AIDA-YAGO-CoNLL**, and the **MSNBC**, **AQUAINT**, **ACE2004**, **WNED-CWEB** and **WNED-WIKI** test sets. 93 | 94 | We preprocessed the datasets and converted them in the following format: 95 |
96 | 97 | ```python 98 | {"mention": MENTION, "left_context": LEFT_CTX, "right_context": RIGHT_CTX, "output": OUTPUT} 99 | ``` 100 |
101 | 102 | The preprocessed datasets are already available in this repository: 103 | - [AIDA-YAGO-CoNLL (Train)](./ner4el/data/aida_train.jsonl) 104 | - [AIDA-YAGO-CoNLL (Dev)](./ner4el/data/aida_dev.jsonl) 105 | - [AIDA-YAGO-CoNLL (Test)](./ner4el/data/aida_test.jsonl) 106 | - [MSNBC](./ner4el/data/msnbc_test.jsonl) 107 | - [AQUAINT](./ner4el/data/aquaint_test.jsonl) 108 | - [ACE2004](./ner4el/data/ace2004_test.jsonl) 109 | - [WNED-CWEB](./ner4el/data/cweb_test.jsonl) 110 | - [WNED-WIKI](./ner4el/data/wiki_test.jsonl) 111 | 112 |
113 | 114 | # Pretrained Model 115 | We release the model checkpoint of the best NER4EL system [here](https://drive.google.com/file/d/1CbjbknVYiON11xV1rOZto5mumbPov0z4/view?usp=sharing) (4.0GB). 116 | 117 | We underline that we trained our system only on the 18K training instances provided by the **AIDA-YAGO-CoNLL** training set. If you want to obtain a stronger EL system using our architecture you can pretrain it on BLINK (9M training instances from Wikipedia). You can download the BLINK train and validation splits as follows: 118 | ```python 119 | wget http://dl.fbaipublicfiles.com/KILT/blink-train-kilt.jsonl 120 | wget http://dl.fbaipublicfiles.com/KILT/blink-dev-kilt.jsonl 121 | ``` 122 | 123 | **Note**: We re-implemented the code using the [nn-template](https://github.com/lucmos/nn-template) and trained the system on a different hardware architecture. Although the obtained results are, on average, almost identical to those reported in the paper, they are slightly different. Please, find below the performance of the released system: 124 | 125 |
126 | 127 | | System | AIDA | MSNBC | AQUAINT | ACE2004 | CWEB | WIKI | Avg. | 128 | | ------------- | -------------: | -------------: | -------------: | -------------: | -------------: | -------------: | -------------: | 129 | | Paper (Baseline + NER-R + NER-NS + NER-CD) | 92.5 | 89.2 | 69.5 | 91.3 | 68.5 | 64.0 | 79.16 | 130 | | Released system (Baseline + NER-R + NER-NS + NER-CD) | 93.6 | 89.1 | 70.6 | 91.0 | 67.2 | 63.7 | 79.20 | 131 |
132 | 133 | 134 |
135 | 136 | # How To Use 137 | To run the code, after you have downloaded the above listed resources and put them into the right folders as specified by the README files inside the folders, you need to perform the following steps: 138 | 139 | 0. Set the PROJECT_ROOT variable in the [.env](.env) file (it should correspond to the absolute path of the ner4el/ner4el folder) 140 | 141 | 1. Install the requirements: 142 | ``` 143 | pip install -r requirements.txt 144 | ``` 145 | The code requires **python >= 3.8**, hence we suggest you to create a conda environment with python 3.8. 146 | 147 | 2. Move to the ner4el folder and run the following command to train and evaluate the system: 148 | ``` 149 | PYTHONPATH=. python src/run.py 150 | ``` 151 | 152 | 3. If you want to test a trained system (e.g., the **NER4EL pretrained model** available in the previous section), run the command: 153 | ``` 154 | PYTHONPATH=. python src/test.py 155 | ``` 156 | Once the script is started, it asks you to specify the path of your model checkpoint. 157 | 158 | **Note**: If you want to **change the system configuration**, you need to move in the *ner4el/conf* folder and change the parameters of your interest. As an example, if you move to the [data configuration file](./ner4el/conf/data/default.yaml), you can set the *training, evaluation and test sets*, but you can also specify the *number of candidates* you want to use, as well as the *context window*. At lines 10-14, you can also choose which *NER-based contribution* you want to apply on the baseline system, by setting it to *True*. 159 | Similarly, in the [training configuration file](./ner4el/conf/train/default.yaml), you can specify the *number of epochs*, the *value of patience parameter*, and the number of *gradient accumulation steps*. 160 | 161 |
162 | 163 | # License 164 | NER4EL is licensed under the CC BY-SA-NC 4.0 license. The text of the license can be found [here](https://github.com/Babelscape/wikineural/blob/master/LICENSE). 165 | 166 |
167 | 168 | # Acknowledgments 169 | We gratefully acknowledge the support of the **ERC Consolidator Grant MOUSSE No. 726487** under the European Union’s Horizon 2020 research and innovation programme. 170 | 171 | The code in this repository is built on top of [![](https://shields.io/badge/-nn--template-emerald?style=flat&logo=github&labelColor=gray)](https://github.com/lucmos/nn-template). 172 | -------------------------------------------------------------------------------- /ner4el/src/pl_modules/model.py: -------------------------------------------------------------------------------- 1 | from typing import Any, Dict, Sequence, Tuple, Union, List 2 | 3 | import hydra 4 | import omegaconf 5 | import pytorch_lightning as pl 6 | import torch 7 | from torch import nn 8 | from omegaconf import DictConfig 9 | from torch._C import device 10 | from torch.optim import Optimizer 11 | import os 12 | import math 13 | from sklearn.metrics import f1_score 14 | from pl_modules.ner_model import MyNERModel 15 | from src.common.utils import * 16 | 17 | from src.common.utils import PROJECT_ROOT 18 | 19 | from transformers import BertTokenizer, BertModel, BertConfig 20 | 21 | ner_constrained_decoding = input("> Do you want to use the NER-Constrained Decoding (NER-CD) strategy at inference time (it's applied only during testing)? ") 22 | 23 | if ner_constrained_decoding.lower() == "yes" or ner_constrained_decoding.lower() == "y": 24 | ner_model = MyNERModel().cuda() 25 | ner_model.load_state_dict(torch.load(str(PROJECT_ROOT / "wandb/ner_classifier.pt"))) 26 | 27 | id2ner_dict_path = "data/id2ner_dict.pickle" 28 | id2ner_dict_path = str(PROJECT_ROOT / id2ner_dict_path) 29 | 30 | id2ner = read_id2ner_dict(id2ner_dict_path) 31 | 32 | labels_vocab = {} 33 | i = 0 34 | for v in id2ner.values(): 35 | if v not in labels_vocab: 36 | labels_vocab[v] = i 37 | i+=1 38 | 39 | class MyModel(pl.LightningModule): 40 | def __init__(self, *args, **kwargs) -> None: 41 | super().__init__() 42 | self.save_hyperparameters() # populate self.hparams with args and kwargs automagically! 43 | 44 | self.bert_config = BertConfig.from_pretrained( 45 | self.hparams.transformer_name, output_hidden_states=True 46 | ) 47 | 48 | self.bert_tokenizer = BertTokenizer.from_pretrained( 49 | self.hparams.transformer_name 50 | ) 51 | 52 | '''self.mention_encoder = BertModel.from_pretrained( 53 | self.hparams.transformer_name, config=self.bert_config 54 | ) 55 | 56 | self.entity_encoder = BertModel.from_pretrained( 57 | self.hparams.transformer_name, config=self.bert_config 58 | ) 59 | 60 | special_tokens_dict = {"additional_special_tokens": ["[E]", "[\E]"]} 61 | self.bert_tokenizer.add_special_tokens(special_tokens_dict) 62 | self.mention_encoder.resize_token_embeddings(len(self.bert_tokenizer)) 63 | self.entity_encoder.resize_token_embeddings(len(self.bert_tokenizer))''' 64 | 65 | #------------------------------------------------------------------------------------ 66 | # Uncomment the above alternative block of code to use the dual-encoder architecture. 67 | # However, we observed that using a single encoder, we obtain very similar 68 | # performances, while we have much shorter training times and less computational 69 | # resources are required 70 | 71 | self.bert_model = BertModel.from_pretrained( 72 | self.hparams.transformer_name, config=self.bert_config 73 | ) 74 | 75 | special_tokens_dict = {"additional_special_tokens": ["[E]", "[\E]"]} 76 | self.bert_tokenizer.add_special_tokens(special_tokens_dict) 77 | self.bert_model.resize_token_embeddings(len(self.bert_tokenizer)) 78 | 79 | self.mention_encoder = self.bert_model 80 | self.entity_encoder = self.bert_model 81 | 82 | #------------------------------------------------------------------------------------ 83 | 84 | self.cosine_similarity = nn.CosineSimilarity(dim=-1, eps=1e-6) 85 | 86 | self.dropout = nn.Dropout(self.hparams.dropout) 87 | self.loss_function = nn.CrossEntropyLoss() 88 | 89 | def forward( 90 | self, mentions, positions, descriptions, mask1, mask2, **kwargs 91 | ) -> Dict[str, torch.Tensor]: 92 | """ 93 | Method for the forward pass. 94 | 'training_step', 'validation_step' and 'test_step' should call 95 | this method in order to compute the output predictions and the loss. 96 | Returns: 97 | output_dict: forward output containing the predictions (output logits ecc...) and the loss if any. 98 | """ 99 | num_candidates = descriptions.shape[1] 100 | 101 | embedding_mention = self.mention_encoder.forward(mentions, mask1)[0] # 16x64x768 102 | embedding_mention2 = (embedding_mention.gather(1, positions.reshape(-1, 1, 1).repeat(1, 1, self.bert_config.hidden_size),).squeeze(1)) 103 | # embedding_mention2 = self.dropout(embedding_mention2) #16x768 104 | 105 | descriptions = descriptions.flatten(start_dim=0, end_dim=1) # 16x20x64 -> 320x64 106 | mask2 = mask2.flatten(start_dim=0, end_dim=1) 107 | 108 | embedding_entities = self.entity_encoder.forward(descriptions, mask2)[0] # 320x64x768 109 | # embedding_entities = self.dropout(embedding_entities) 110 | embedding_entities = embedding_entities[:, 0, :].squeeze(1) # 320x768 111 | embedding_entities = embedding_entities.reshape(embedding_mention2.shape[0], num_candidates, -1) # 16x20x768 112 | 113 | # mentions 16x768, entities #16x20x768 114 | embedding_mention2 = embedding_mention2.unsqueeze(1) # 16x1x768 115 | embedding_mention2 = embedding_mention2.repeat_interleave(num_candidates, dim=1) # 16x20x768 116 | 117 | similarities = self.cosine_similarity(embedding_mention2, embedding_entities) 118 | 119 | return similarities 120 | 121 | def step(self, batch: Any, batch_idx: int, dataset_type:str): 122 | 123 | if ner_constrained_decoding.lower() == ("no") or ner_constrained_decoding.lower() == ("n"): 124 | mentions, positions, candidates, descriptions, labels = batch 125 | positions = torch.tensor(positions, device=self.device) 126 | 127 | mask1 = self.padding_mask(mentions) 128 | mask2 = self.padding_mask(descriptions) 129 | 130 | similarities = self.forward(mentions, positions, descriptions, mask1, mask2) 131 | normalized_similarities = self.normalize(similarities) 132 | 133 | gold = torch.zeros(normalized_similarities.shape[0]) 134 | for i in range(descriptions.shape[0]): # i is the index of the batch 135 | for j in range(descriptions.shape[1]): 136 | if candidates[i][j] == labels[i]: 137 | gold[i] = j 138 | gold = gold.type(torch.LongTensor).to(self.device) 139 | 140 | 141 | loss = self.loss_function(normalized_similarities, gold) 142 | 143 | all_predictions = list() 144 | all_labels = list() 145 | 146 | if dataset_type != "train": 147 | for i in range(len(similarities)): 148 | current_candidates = list(filter(lambda x: x!=0, candidates[i])) 149 | normalized_similarities_line = normalized_similarities[i][:len(current_candidates)] 150 | all_predictions.append(int(candidates[i][torch.argmax(normalized_similarities_line)])) 151 | all_labels.append(int(labels[i])) 152 | 153 | if dataset_type=="train": 154 | if not math.isnan(loss): 155 | return { 156 | "loss": loss, 157 | "pred": all_predictions, 158 | "gold": all_labels, 159 | } 160 | else: 161 | return None 162 | else: 163 | return { 164 | "pred": all_predictions, 165 | "gold": all_labels, 166 | } 167 | 168 | else: 169 | softmax_function = nn.Softmax(dim=1) 170 | 171 | mentions, positions, candidates, descriptions, labels = batch 172 | positions = torch.tensor(positions, device=self.device) 173 | 174 | mask1 = self.padding_mask(mentions) 175 | mask2 = self.padding_mask(descriptions) 176 | 177 | similarities = self.forward(mentions, positions, descriptions, mask1, mask2) 178 | normalized_similarities = self.normalize(similarities) 179 | 180 | predictions = ner_model.forward(mentions, positions, mask1) 181 | predictions = softmax_function(predictions) 182 | 183 | gold = torch.zeros(normalized_similarities.shape[0]) 184 | for i in range(descriptions.shape[0]): # i is the index of the batch 185 | for j in range(descriptions.shape[1]): 186 | if candidates[i][j] == labels[i]: 187 | gold[i] = j 188 | gold = gold.type(torch.LongTensor).to(self.device) 189 | 190 | 191 | loss = self.loss_function(normalized_similarities, gold) 192 | 193 | all_predictions = list() 194 | all_labels = list() 195 | 196 | if dataset_type != "train": 197 | for i in range(len(similarities)): 198 | 199 | top_k_ner = torch.topk(predictions[i], 3)[1] 200 | target_ner_tag_id = top_k_ner[0].item() 201 | confidence = predictions[i][target_ner_tag_id].item() 202 | target_ner_tags = [] 203 | for candidate in top_k_ner: 204 | target_ner_tags.append(self.get_key(labels_vocab, candidate)) 205 | 206 | current_candidates = list(filter(lambda x: x!=0, candidates[i])) 207 | normalized_similarities_line = normalized_similarities[i][:len(current_candidates)] 208 | current_candidates_ner = [id2ner[str(c.item())] if str(c.item()) in id2ner else "" for c in current_candidates] 209 | for k in range(len(current_candidates)): 210 | if current_candidates_ner[k] not in target_ner_tags[0] and confidence>0.5: #confidence 211 | normalized_similarities_line[k] = 0.0 212 | elif current_candidates_ner[k] not in target_ner_tags: #top-k 213 | normalized_similarities_line[k] = 0.0 214 | 215 | 216 | all_predictions.append(int(candidates[i][torch.argmax(normalized_similarities_line)])) 217 | all_labels.append(labels[i]) 218 | 219 | if dataset_type=="train": 220 | if not math.isnan(loss): 221 | return { 222 | "loss": loss, 223 | "pred": all_predictions, 224 | "gold": all_labels, 225 | } 226 | else: 227 | return None 228 | else: 229 | return { 230 | "pred": all_predictions, 231 | "gold": all_labels, 232 | } 233 | 234 | 235 | 236 | 237 | def training_step(self, batch: Any, batch_idx: int) -> torch.Tensor: 238 | step_output = self.step(batch, batch_idx, "train") 239 | if step_output is not None: 240 | self.log_dict( 241 | {"train_loss": step_output["loss"]}, 242 | on_step=True, 243 | on_epoch=True, 244 | prog_bar=True, 245 | ) 246 | return step_output 247 | 248 | def validation_step(self, batch: Any, batch_idx: int) -> torch.Tensor: 249 | step_output = self.step(batch, batch_idx, "dev") 250 | return step_output 251 | 252 | def test_step(self, batch: Any, batch_idx: int) -> torch.Tensor: 253 | step_output = self.step(batch, batch_idx, "test") 254 | return step_output 255 | 256 | 257 | def my_epoch_end(self, outputs: List[Any], split:str) -> None: 258 | all_predictions = [] 259 | all_labels = [] 260 | 261 | for elem in outputs: 262 | all_predictions.extend(elem["pred"]) 263 | all_labels.extend(elem["gold"]) 264 | 265 | f1_micro = f1_score(all_labels, all_predictions, average='micro') 266 | self.log_dict( 267 | {f"{split}_acc": f1_micro}, 268 | prog_bar=True 269 | ) 270 | 271 | return super().validation_epoch_end(outputs) 272 | 273 | def validation_epoch_end(self, outputs: List[Any]) -> None: 274 | return self.my_epoch_end(outputs, "val") 275 | 276 | def test_epoch_end(self, outputs: List[Any]) -> None: 277 | return self.my_epoch_end(outputs, "test") 278 | 279 | 280 | def configure_optimizers( 281 | self, 282 | ) -> Union[Optimizer, Tuple[Sequence[Optimizer], Sequence[Any]]]: 283 | """ 284 | Choose what optimizers and learning-rate schedulers to use in your optimization. 285 | Normally you'd need one. But in the case of GANs or similar you might have multiple. 286 | Return: 287 | Any of these 6 options. 288 | - Single optimizer. 289 | - List or Tuple - List of optimizers. 290 | - Two lists - The first list has multiple optimizers, the second a list of LR schedulers (or lr_dict). 291 | - Dictionary, with an 'optimizer' key, and (optionally) a 'lr_scheduler' 292 | key whose value is a single LR scheduler or lr_dict. 293 | - Tuple of dictionaries as described, with an optional 'frequency' key. 294 | - None - Fit will run without any optimizer. 295 | """ 296 | opt = hydra.utils.instantiate( 297 | self.hparams.optim.optimizer, params=self.parameters(), _convert_="partial" 298 | ) 299 | if not self.hparams.optim.use_lr_scheduler: 300 | return [opt] 301 | scheduler = hydra.utils.instantiate( 302 | self.hparams.optim.lr_scheduler, optimizer=opt 303 | ) 304 | return [opt], [scheduler] 305 | 306 | def padding_mask(self, batch): 307 | padding = torch.ones_like(batch) 308 | padding[batch == 0] = 0 309 | padding = padding.type(torch.int64) 310 | return padding 311 | 312 | def normalize(self, m): 313 | row_min, _ = m.min(dim=1, keepdim=True) 314 | row_max, _ = m.max(dim=1, keepdim=True) 315 | return (m - row_min) / (row_max - row_min) 316 | 317 | def get_key(self, dictionary, val): 318 | for key, value in dictionary.items(): 319 | if val == value: 320 | return key 321 | 322 | return "key doesn't exist" 323 | 324 | 325 | @hydra.main(config_path=str(PROJECT_ROOT / "conf"), config_name="default") 326 | def main(cfg: omegaconf.DictConfig): 327 | model: pl.LightningModule = hydra.utils.instantiate( 328 | cfg.model, 329 | optim=cfg.optim, 330 | data=cfg.data, 331 | logging=cfg.logging, 332 | _recursive_=False, 333 | ) 334 | 335 | 336 | if __name__ == "__main__": 337 | main() 338 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | ======================================================================= 2 | 3 | Attribution-NonCommercial-ShareAlike 4.0 International 4 | 5 | ======================================================================= 6 | 7 | Creative Commons Corporation ("Creative Commons") is not a law firm and 8 | does not provide legal services or legal advice. Distribution of 9 | Creative Commons public licenses does not create a lawyer-client or 10 | other relationship. Creative Commons makes its licenses and related 11 | information available on an "as-is" basis. Creative Commons gives no 12 | warranties regarding its licenses, any material licensed under their 13 | terms and conditions, or any related information. Creative Commons 14 | disclaims all liability for damages resulting from their use to the 15 | fullest extent possible. 16 | 17 | Using Creative Commons Public Licenses 18 | 19 | Creative Commons public licenses provide a standard set of terms and 20 | conditions that creators and other rights holders may use to share 21 | original works of authorship and other material subject to copyright 22 | and certain other rights specified in the public license below. The 23 | following considerations are for informational purposes only, are not 24 | exhaustive, and do not form part of our licenses. 25 | 26 | Considerations for licensors: Our public licenses are 27 | intended for use by those authorized to give the public 28 | permission to use material in ways otherwise restricted by 29 | copyright and certain other rights. Our licenses are 30 | irrevocable. Licensors should read and understand the terms 31 | and conditions of the license they choose before applying it. 32 | Licensors should also secure all rights necessary before 33 | applying our licenses so that the public can reuse the 34 | material as expected. Licensors should clearly mark any 35 | material not subject to the license. This includes other CC- 36 | licensed material, or material used under an exception or 37 | limitation to copyright. More considerations for licensors: 38 | wiki.creativecommons.org/Considerations_for_licensors 39 | 40 | Considerations for the public: By using one of our public 41 | licenses, a licensor grants the public permission to use the 42 | licensed material under specified terms and conditions. If 43 | the licensor's permission is not necessary for any reason--for 44 | example, because of any applicable exception or limitation to 45 | copyright--then that use is not regulated by the license. Our 46 | licenses grant only permissions under copyright and certain 47 | other rights that a licensor has authority to grant. Use of 48 | the licensed material may still be restricted for other 49 | reasons, including because others have copyright or other 50 | rights in the material. A licensor may make special requests, 51 | such as asking that all changes be marked or described. 52 | Although not required by our licenses, you are encouraged to 53 | respect those requests where reasonable. More considerations 54 | for the public: 55 | wiki.creativecommons.org/Considerations_for_licensees 56 | 57 | ======================================================================= 58 | 59 | Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International 60 | Public License 61 | 62 | By exercising the Licensed Rights (defined below), You accept and agree 63 | to be bound by the terms and conditions of this Creative Commons 64 | Attribution-NonCommercial-ShareAlike 4.0 International Public License 65 | ("Public License"). To the extent this Public License may be 66 | interpreted as a contract, You are granted the Licensed Rights in 67 | consideration of Your acceptance of these terms and conditions, and the 68 | Licensor grants You such rights in consideration of benefits the 69 | Licensor receives from making the Licensed Material available under 70 | these terms and conditions. 71 | 72 | 73 | Section 1 -- Definitions. 74 | 75 | a. Adapted Material means material subject to Copyright and Similar 76 | Rights that is derived from or based upon the Licensed Material 77 | and in which the Licensed Material is translated, altered, 78 | arranged, transformed, or otherwise modified in a manner requiring 79 | permission under the Copyright and Similar Rights held by the 80 | Licensor. For purposes of this Public License, where the Licensed 81 | Material is a musical work, performance, or sound recording, 82 | Adapted Material is always produced where the Licensed Material is 83 | synched in timed relation with a moving image. 84 | 85 | b. Adapter's License means the license You apply to Your Copyright 86 | and Similar Rights in Your contributions to Adapted Material in 87 | accordance with the terms and conditions of this Public License. 88 | 89 | c. BY-NC-SA Compatible License means a license listed at 90 | creativecommons.org/compatiblelicenses, approved by Creative 91 | Commons as essentially the equivalent of this Public License. 92 | 93 | d. Copyright and Similar Rights means copyright and/or similar rights 94 | closely related to copyright including, without limitation, 95 | performance, broadcast, sound recording, and Sui Generis Database 96 | Rights, without regard to how the rights are labeled or 97 | categorized. For purposes of this Public License, the rights 98 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 99 | Rights. 100 | 101 | e. Effective Technological Measures means those measures that, in the 102 | absence of proper authority, may not be circumvented under laws 103 | fulfilling obligations under Article 11 of the WIPO Copyright 104 | Treaty adopted on December 20, 1996, and/or similar international 105 | agreements. 106 | 107 | f. Exceptions and Limitations means fair use, fair dealing, and/or 108 | any other exception or limitation to Copyright and Similar Rights 109 | that applies to Your use of the Licensed Material. 110 | 111 | g. License Elements means the license attributes listed in the name 112 | of a Creative Commons Public License. The License Elements of this 113 | Public License are Attribution, NonCommercial, and ShareAlike. 114 | 115 | h. Licensed Material means the artistic or literary work, database, 116 | or other material to which the Licensor applied this Public 117 | License. 118 | 119 | i. Licensed Rights means the rights granted to You subject to the 120 | terms and conditions of this Public License, which are limited to 121 | all Copyright and Similar Rights that apply to Your use of the 122 | Licensed Material and that the Licensor has authority to license. 123 | 124 | j. Licensor means the individual(s) or entity(ies) granting rights 125 | under this Public License. 126 | 127 | k. NonCommercial means not primarily intended for or directed towards 128 | commercial advantage or monetary compensation. For purposes of 129 | this Public License, the exchange of the Licensed Material for 130 | other material subject to Copyright and Similar Rights by digital 131 | file-sharing or similar means is NonCommercial provided there is 132 | no payment of monetary compensation in connection with the 133 | exchange. 134 | 135 | l. Share means to provide material to the public by any means or 136 | process that requires permission under the Licensed Rights, such 137 | as reproduction, public display, public performance, distribution, 138 | dissemination, communication, or importation, and to make material 139 | available to the public including in ways that members of the 140 | public may access the material from a place and at a time 141 | individually chosen by them. 142 | 143 | m. Sui Generis Database Rights means rights other than copyright 144 | resulting from Directive 96/9/EC of the European Parliament and of 145 | the Council of 11 March 1996 on the legal protection of databases, 146 | as amended and/or succeeded, as well as other essentially 147 | equivalent rights anywhere in the world. 148 | 149 | n. You means the individual or entity exercising the Licensed Rights 150 | under this Public License. Your has a corresponding meaning. 151 | 152 | 153 | Section 2 -- Scope. 154 | 155 | a. License grant. 156 | 157 | 1. Subject to the terms and conditions of this Public License, 158 | the Licensor hereby grants You a worldwide, royalty-free, 159 | non-sublicensable, non-exclusive, irrevocable license to 160 | exercise the Licensed Rights in the Licensed Material to: 161 | 162 | a. reproduce and Share the Licensed Material, in whole or 163 | in part, for NonCommercial purposes only; and 164 | 165 | b. produce, reproduce, and Share Adapted Material for 166 | NonCommercial purposes only. 167 | 168 | 2. Exceptions and Limitations. For the avoidance of doubt, where 169 | Exceptions and Limitations apply to Your use, this Public 170 | License does not apply, and You do not need to comply with 171 | its terms and conditions. 172 | 173 | 3. Term. The term of this Public License is specified in Section 174 | 6(a). 175 | 176 | 4. Media and formats; technical modifications allowed. The 177 | Licensor authorizes You to exercise the Licensed Rights in 178 | all media and formats whether now known or hereafter created, 179 | and to make technical modifications necessary to do so. The 180 | Licensor waives and/or agrees not to assert any right or 181 | authority to forbid You from making technical modifications 182 | necessary to exercise the Licensed Rights, including 183 | technical modifications necessary to circumvent Effective 184 | Technological Measures. For purposes of this Public License, 185 | simply making modifications authorized by this Section 2(a) 186 | (4) never produces Adapted Material. 187 | 188 | 5. Downstream recipients. 189 | 190 | a. Offer from the Licensor -- Licensed Material. Every 191 | recipient of the Licensed Material automatically 192 | receives an offer from the Licensor to exercise the 193 | Licensed Rights under the terms and conditions of this 194 | Public License. 195 | 196 | b. Additional offer from the Licensor -- Adapted Material. 197 | Every recipient of Adapted Material from You 198 | automatically receives an offer from the Licensor to 199 | exercise the Licensed Rights in the Adapted Material 200 | under the conditions of the Adapter's License You apply. 201 | 202 | c. No downstream restrictions. You may not offer or impose 203 | any additional or different terms or conditions on, or 204 | apply any Effective Technological Measures to, the 205 | Licensed Material if doing so restricts exercise of the 206 | Licensed Rights by any recipient of the Licensed 207 | Material. 208 | 209 | 6. No endorsement. Nothing in this Public License constitutes or 210 | may be construed as permission to assert or imply that You 211 | are, or that Your use of the Licensed Material is, connected 212 | with, or sponsored, endorsed, or granted official status by, 213 | the Licensor or others designated to receive attribution as 214 | provided in Section 3(a)(1)(A)(i). 215 | 216 | b. Other rights. 217 | 218 | 1. Moral rights, such as the right of integrity, are not 219 | licensed under this Public License, nor are publicity, 220 | privacy, and/or other similar personality rights; however, to 221 | the extent possible, the Licensor waives and/or agrees not to 222 | assert any such rights held by the Licensor to the limited 223 | extent necessary to allow You to exercise the Licensed 224 | Rights, but not otherwise. 225 | 226 | 2. Patent and trademark rights are not licensed under this 227 | Public License. 228 | 229 | 3. To the extent possible, the Licensor waives any right to 230 | collect royalties from You for the exercise of the Licensed 231 | Rights, whether directly or through a collecting society 232 | under any voluntary or waivable statutory or compulsory 233 | licensing scheme. In all other cases the Licensor expressly 234 | reserves any right to collect such royalties, including when 235 | the Licensed Material is used other than for NonCommercial 236 | purposes. 237 | 238 | 239 | Section 3 -- License Conditions. 240 | 241 | Your exercise of the Licensed Rights is expressly made subject to the 242 | following conditions. 243 | 244 | a. Attribution. 245 | 246 | 1. If You Share the Licensed Material (including in modified 247 | form), You must: 248 | 249 | a. retain the following if it is supplied by the Licensor 250 | with the Licensed Material: 251 | 252 | i. identification of the creator(s) of the Licensed 253 | Material and any others designated to receive 254 | attribution, in any reasonable manner requested by 255 | the Licensor (including by pseudonym if 256 | designated); 257 | 258 | ii. a copyright notice; 259 | 260 | iii. a notice that refers to this Public License; 261 | 262 | iv. a notice that refers to the disclaimer of 263 | warranties; 264 | 265 | v. a URI or hyperlink to the Licensed Material to the 266 | extent reasonably practicable; 267 | 268 | b. indicate if You modified the Licensed Material and 269 | retain an indication of any previous modifications; and 270 | 271 | c. indicate the Licensed Material is licensed under this 272 | Public License, and include the text of, or the URI or 273 | hyperlink to, this Public License. 274 | 275 | 2. You may satisfy the conditions in Section 3(a)(1) in any 276 | reasonable manner based on the medium, means, and context in 277 | which You Share the Licensed Material. For example, it may be 278 | reasonable to satisfy the conditions by providing a URI or 279 | hyperlink to a resource that includes the required 280 | information. 281 | 3. If requested by the Licensor, You must remove any of the 282 | information required by Section 3(a)(1)(A) to the extent 283 | reasonably practicable. 284 | 285 | b. ShareAlike. 286 | 287 | In addition to the conditions in Section 3(a), if You Share 288 | Adapted Material You produce, the following conditions also apply. 289 | 290 | 1. The Adapter's License You apply must be a Creative Commons 291 | license with the same License Elements, this version or 292 | later, or a BY-NC-SA Compatible License. 293 | 294 | 2. You must include the text of, or the URI or hyperlink to, the 295 | Adapter's License You apply. You may satisfy this condition 296 | in any reasonable manner based on the medium, means, and 297 | context in which You Share Adapted Material. 298 | 299 | 3. You may not offer or impose any additional or different terms 300 | or conditions on, or apply any Effective Technological 301 | Measures to, Adapted Material that restrict exercise of the 302 | rights granted under the Adapter's License You apply. 303 | 304 | 305 | Section 4 -- Sui Generis Database Rights. 306 | 307 | Where the Licensed Rights include Sui Generis Database Rights that 308 | apply to Your use of the Licensed Material: 309 | 310 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 311 | to extract, reuse, reproduce, and Share all or a substantial 312 | portion of the contents of the database for NonCommercial purposes 313 | only; 314 | 315 | b. if You include all or a substantial portion of the database 316 | contents in a database in which You have Sui Generis Database 317 | Rights, then the database in which You have Sui Generis Database 318 | Rights (but not its individual contents) is Adapted Material, 319 | including for purposes of Section 3(b); and 320 | 321 | c. You must comply with the conditions in Section 3(a) if You Share 322 | all or a substantial portion of the contents of the database. 323 | 324 | For the avoidance of doubt, this Section 4 supplements and does not 325 | replace Your obligations under this Public License where the Licensed 326 | Rights include other Copyright and Similar Rights. 327 | 328 | 329 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 330 | 331 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 332 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 333 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 334 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 335 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 336 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 337 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 338 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 339 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 340 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 341 | 342 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 343 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 344 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 345 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 346 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 347 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 348 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 349 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 350 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 351 | 352 | c. The disclaimer of warranties and limitation of liability provided 353 | above shall be interpreted in a manner that, to the extent 354 | possible, most closely approximates an absolute disclaimer and 355 | waiver of all liability. 356 | 357 | 358 | Section 6 -- Term and Termination. 359 | 360 | a. This Public License applies for the term of the Copyright and 361 | Similar Rights licensed here. However, if You fail to comply with 362 | this Public License, then Your rights under this Public License 363 | terminate automatically. 364 | 365 | b. Where Your right to use the Licensed Material has terminated under 366 | Section 6(a), it reinstates: 367 | 368 | 1. automatically as of the date the violation is cured, provided 369 | it is cured within 30 days of Your discovery of the 370 | violation; or 371 | 372 | 2. upon express reinstatement by the Licensor. 373 | 374 | For the avoidance of doubt, this Section 6(b) does not affect any 375 | right the Licensor may have to seek remedies for Your violations 376 | of this Public License. 377 | 378 | c. For the avoidance of doubt, the Licensor may also offer the 379 | Licensed Material under separate terms or conditions or stop 380 | distributing the Licensed Material at any time; however, doing so 381 | will not terminate this Public License. 382 | 383 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 384 | License. 385 | 386 | 387 | Section 7 -- Other Terms and Conditions. 388 | 389 | a. The Licensor shall not be bound by any additional or different 390 | terms or conditions communicated by You unless expressly agreed. 391 | 392 | b. Any arrangements, understandings, or agreements regarding the 393 | Licensed Material not stated herein are separate from and 394 | independent of the terms and conditions of this Public License. 395 | 396 | 397 | Section 8 -- Interpretation. 398 | 399 | a. For the avoidance of doubt, this Public License does not, and 400 | shall not be interpreted to, reduce, limit, restrict, or impose 401 | conditions on any use of the Licensed Material that could lawfully 402 | be made without permission under this Public License. 403 | 404 | b. To the extent possible, if any provision of this Public License is 405 | deemed unenforceable, it shall be automatically reformed to the 406 | minimum extent necessary to make it enforceable. If the provision 407 | cannot be reformed, it shall be severed from this Public License 408 | without affecting the enforceability of the remaining terms and 409 | conditions. 410 | 411 | c. No term or condition of this Public License will be waived and no 412 | failure to comply consented to unless expressly agreed to by the 413 | Licensor. 414 | 415 | d. Nothing in this Public License constitutes or may be interpreted 416 | as a limitation upon, or waiver of, any privileges and immunities 417 | that apply to the Licensor or You, including from the legal 418 | processes of any jurisdiction or authority. 419 | 420 | ======================================================================= 421 | 422 | Creative Commons is not a party to its public 423 | licenses. Notwithstanding, Creative Commons may elect to apply one of 424 | its public licenses to material it publishes and in those instances 425 | will be considered the “Licensor.” The text of the Creative Commons 426 | public licenses is dedicated to the public domain under the CC0 Public 427 | Domain Dedication. Except for the limited purpose of indicating that 428 | material is shared under a Creative Commons public license or as 429 | otherwise permitted by the Creative Commons policies published at 430 | creativecommons.org/policies, Creative Commons does not authorize the 431 | use of the trademark "Creative Commons" or any other trademark or logo 432 | of Creative Commons without its prior written consent including, 433 | without limitation, in connection with any unauthorized modifications 434 | to any of its public licenses or any other arrangements, 435 | understandings, or agreements concerning use of licensed material. For 436 | the avoidance of doubt, this paragraph does not form part of the 437 | public licenses. 438 | 439 | Creative Commons may be contacted at creativecommons.org. 440 | --------------------------------------------------------------------------------