├── architecture.png ├── example_output_graph.png ├── Requirements ├── requirements_pipeline.txt ├── requirements_RE.txt └── requirements_TE.txt ├── README.md ├── Term Extraction └── Term_Extraction_Token_Classifier_with_TERMEVAL(LDK).ipynb └── Pipeline_LDK.ipynb /architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Text2TCS/Towards-Learning-Terminological-Concept-Systems/main/architecture.png -------------------------------------------------------------------------------- /example_output_graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Text2TCS/Towards-Learning-Terminological-Concept-Systems/main/example_output_graph.png -------------------------------------------------------------------------------- /Requirements/requirements_pipeline.txt: -------------------------------------------------------------------------------- 1 | spacy==2.2.4 2 | pandas==1.1.5 3 | torch==1.8.1+cu101 4 | graphviz==0.10.1 5 | lxml==4.2.6 6 | transformers==4.6.1 7 | -------------------------------------------------------------------------------- /Requirements/requirements_RE.txt: -------------------------------------------------------------------------------- 1 | sacremoses==0.0.45 2 | seaborn==0.11.1 3 | torch==1.8.1+cu101 4 | nltk==3.2.5 5 | numpy==1.19.5 6 | pandas==1.1.5 7 | sentencepiece==0.1.95 8 | spacy==2.2.4 9 | transformers==4.6.1 10 | matplotlib==3.2.2 11 | scikit_learn==0.24.2 12 | -------------------------------------------------------------------------------- /Requirements/requirements_TE.txt: -------------------------------------------------------------------------------- 1 | pandas==1.1.5 2 | torch==1.8.1+cu101 3 | spacy==2.2.4 4 | sentencepiece==0.1.95 5 | nvidia_ml_py3==7.352.0 6 | seaborn==0.11.1 7 | numpy==1.19.5 8 | sacremoses==0.0.45 9 | seqeval==1.2.2 10 | transformers==4.6.1 11 | matplotlib==3.2.2 12 | nltk==3.2.5 13 | pynvml==8.0.4 14 | scikit_learn==0.24.2 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Towards Learning Terminological Concept Systems from Multilingual Natural Language Text 2 | 3 | ## Reference 4 | Wachowiak, L., Lang, C., Heinisch, B., & Gromann, D. (2021). Towards Learning Terminological Concept Systems from Multilingual Natural Language Text. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik. 5 | Chicago 6 | - [PDF](https://drops.dagstuhl.de/opus/volltexte/2021/14558/pdf/OASIcs-LDK-2021-22.pdf) 7 | - [Video Presentation](https://www.youtube.com/watch?v=Kb8bnb15VLI) 8 | 9 | ## Try it Out 10 | If you just want to try out the service without using the code provided here you can use our [implementation made available on the European Language Grid](https://live.european-language-grid.eu/catalogue/tool-service/8122). However, the implemenation on the European Language Grid utilizes a slightly improved architecture as well as a different dataset for model-training; an updated description will be made available soon. 11 | 12 | 13 | ## Architecture 14 | Architecture for extracting terminological concepts systems from natural language. 15 | ![PicArchitecture](/architecture.png) 16 | 17 | ## Example Output 18 | The resulting terminological concept system is returned in a TBX format as well as connected graph (see below). 19 | PicExampleGraphOutput 20 | 21 | ## Term Extraction Scores 22 | 23 | | Dataset | Precicion | Recall | F1 | 24 | |---|---|---|---| 25 | | TermEval2020 EN | 54.9 | 62.2 | 58.3 | 26 | | TermEval2020 FR | 65.4 | 51.4 | 57.6 | 27 | | TermEval2020 NL | 67.9 | 71.7 | 69.8 | 28 | | ACL RD-TEC Annotator 1 | 74.4 | 77.2 | 75.8 | 29 | | ACL RD-TEC Annotator 2 | 80.1 | 79.3 | 80.0 | 30 | 31 | ## Relation Extraction Scores 32 | 33 | | Relation Type | Precicion | Recall | F1 | 34 | |---|---|---|---| 35 | | synonymy| 0.85 | 0.76 | 0.80 | 36 | | activityRelation (e1,e2) | 0.93 | 0.97 | 0.95 | 37 | | activityRelation (e2,e1) | 0.00 | 0.00 | 0.00 | 38 | | associativeRelation | 0.90 | 0.92 | 0.91 | 39 | | causalRelation (e1,e2) | 0.90 | 0.95 | 0.92 | 40 | | causalRelation (e2,e1) | 0.92 | 0.91 | 0.91 | 41 | | genericRelation (e1,e2) | 0.90 | 0.93 | 0.92 | 42 | | genericRelation (e2,e1) | 0.46 | 0.41 | 0.43 | 43 | | instrumentalRelation (e1,e2) | 0.72 | 0.68 | 0.70 | 44 | | instrumentalRelation (e2,e1) | 0.85 | 0.88 | 0.86 | 45 | | none | 0.69 | 0.44 | 0.54 | 46 | | originationRelation (e1,e2) | 0.83 | 0.89 | 0.86 | 47 | | originationRelation (e2,e1) | 0.84| 0.83 | 0.83 | 48 | | partitiveRelation (e1,e2) | 0.90 | 0.85 | 0.87 | 49 | | partitiveRelation (e2,e1) | 0.77 | 0.77 | 0.77 | 50 | | spatialRelation (e1,e2) | 0.90 | 0.91 | 0.91 | 51 | | spatialRelation (e2,e1) | 0.90 | 0.82| 0.86 | 52 | -------------------------------------------------------------------------------- /Term Extraction/Term_Extraction_Token_Classifier_with_TERMEVAL(LDK).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "accelerator": "GPU", 6 | "colab": { 7 | "name": "Term Extraction Token Classifier", 8 | "provenance": [], 9 | "collapsed_sections": [], 10 | "authorship_tag": "ABX9TyMsd6QWA0JUgbYmp2ibk1Hf", 11 | "include_colab_link": true 12 | }, 13 | "kernelspec": { 14 | "display_name": "Python 3", 15 | "name": "python3" 16 | }, 17 | "widgets": { 18 | "application/vnd.jupyter.widget-state+json": { 19 | "2c019229186e4a99b51ab9bbe4c5b784": { 20 | "model_module": "@jupyter-widgets/controls", 21 | "model_name": "HBoxModel", 22 | "state": { 23 | "_view_name": "HBoxView", 24 | "_dom_classes": [], 25 | "_model_name": "HBoxModel", 26 | "_view_module": "@jupyter-widgets/controls", 27 | "_model_module_version": "1.5.0", 28 | "_view_count": null, 29 | "_view_module_version": "1.5.0", 30 | "box_style": "", 31 | "layout": "IPY_MODEL_345ae3166eff49338827b7a5c7d93d05", 32 | "_model_module": "@jupyter-widgets/controls", 33 | "children": [ 34 | "IPY_MODEL_91d1ab635bc94af5abe773e177d09461", 35 | "IPY_MODEL_184b49ae656d47e6bfecc7cb3f508dd8" 36 | ] 37 | } 38 | }, 39 | "345ae3166eff49338827b7a5c7d93d05": { 40 | "model_module": "@jupyter-widgets/base", 41 | "model_name": "LayoutModel", 42 | "state": { 43 | "_view_name": "LayoutView", 44 | "grid_template_rows": null, 45 | "right": null, 46 | "justify_content": null, 47 | "_view_module": "@jupyter-widgets/base", 48 | "overflow": null, 49 | "_model_module_version": "1.2.0", 50 | "_view_count": null, 51 | "flex_flow": null, 52 | "width": null, 53 | "min_width": null, 54 | "border": null, 55 | "align_items": null, 56 | "bottom": null, 57 | "_model_module": "@jupyter-widgets/base", 58 | "top": null, 59 | "grid_column": null, 60 | "overflow_y": null, 61 | "overflow_x": null, 62 | "grid_auto_flow": null, 63 | "grid_area": null, 64 | "grid_template_columns": null, 65 | "flex": null, 66 | "_model_name": "LayoutModel", 67 | "justify_items": null, 68 | "grid_row": null, 69 | "max_height": null, 70 | "align_content": null, 71 | "visibility": null, 72 | "align_self": null, 73 | "height": null, 74 | "min_height": null, 75 | "padding": null, 76 | "grid_auto_rows": null, 77 | "grid_gap": null, 78 | "max_width": null, 79 | "order": null, 80 | "_view_module_version": "1.2.0", 81 | "grid_template_areas": null, 82 | "object_position": null, 83 | "object_fit": null, 84 | "grid_auto_columns": null, 85 | "margin": null, 86 | "display": null, 87 | "left": null 88 | } 89 | }, 90 | "91d1ab635bc94af5abe773e177d09461": { 91 | "model_module": "@jupyter-widgets/controls", 92 | "model_name": "FloatProgressModel", 93 | "state": { 94 | "_view_name": "ProgressView", 95 | "style": "IPY_MODEL_bd23992f5063466cbe78e07f5318663a", 96 | "_dom_classes": [], 97 | "description": "Downloading: 100%", 98 | "_model_name": "FloatProgressModel", 99 | "bar_style": "success", 100 | "max": 5069051, 101 | "_view_module": "@jupyter-widgets/controls", 102 | "_model_module_version": "1.5.0", 103 | "value": 5069051, 104 | "_view_count": null, 105 | "_view_module_version": "1.5.0", 106 | "orientation": "horizontal", 107 | "min": 0, 108 | "description_tooltip": null, 109 | "_model_module": "@jupyter-widgets/controls", 110 | "layout": "IPY_MODEL_fda1de93c9b64edaa54e83dae313ab9a" 111 | } 112 | }, 113 | "184b49ae656d47e6bfecc7cb3f508dd8": { 114 | "model_module": "@jupyter-widgets/controls", 115 | "model_name": "HTMLModel", 116 | "state": { 117 | "_view_name": "HTMLView", 118 | "style": "IPY_MODEL_10e7746c509342e9881fb2d7644d71c1", 119 | "_dom_classes": [], 120 | "description": "", 121 | "_model_name": "HTMLModel", 122 | "placeholder": "​", 123 | "_view_module": "@jupyter-widgets/controls", 124 | "_model_module_version": "1.5.0", 125 | "value": " 5.07M/5.07M [00:02<00:00, 1.72MB/s]", 126 | "_view_count": null, 127 | "_view_module_version": "1.5.0", 128 | "description_tooltip": null, 129 | "_model_module": "@jupyter-widgets/controls", 130 | "layout": "IPY_MODEL_dfae5588591b4cd0993d3324ad5d1c8d" 131 | } 132 | }, 133 | "bd23992f5063466cbe78e07f5318663a": { 134 | "model_module": "@jupyter-widgets/controls", 135 | "model_name": "ProgressStyleModel", 136 | "state": { 137 | "_view_name": "StyleView", 138 | "_model_name": "ProgressStyleModel", 139 | "description_width": "initial", 140 | "_view_module": "@jupyter-widgets/base", 141 | "_model_module_version": "1.5.0", 142 | "_view_count": null, 143 | "_view_module_version": "1.2.0", 144 | "bar_color": null, 145 | "_model_module": "@jupyter-widgets/controls" 146 | } 147 | }, 148 | "fda1de93c9b64edaa54e83dae313ab9a": { 149 | "model_module": "@jupyter-widgets/base", 150 | "model_name": "LayoutModel", 151 | "state": { 152 | "_view_name": "LayoutView", 153 | "grid_template_rows": null, 154 | "right": null, 155 | "justify_content": null, 156 | "_view_module": "@jupyter-widgets/base", 157 | "overflow": null, 158 | "_model_module_version": "1.2.0", 159 | "_view_count": null, 160 | "flex_flow": null, 161 | "width": null, 162 | "min_width": null, 163 | "border": null, 164 | "align_items": null, 165 | "bottom": null, 166 | "_model_module": "@jupyter-widgets/base", 167 | "top": null, 168 | "grid_column": null, 169 | "overflow_y": null, 170 | "overflow_x": null, 171 | "grid_auto_flow": null, 172 | "grid_area": null, 173 | "grid_template_columns": null, 174 | "flex": null, 175 | "_model_name": "LayoutModel", 176 | "justify_items": null, 177 | "grid_row": null, 178 | "max_height": null, 179 | "align_content": null, 180 | "visibility": null, 181 | "align_self": null, 182 | "height": null, 183 | "min_height": null, 184 | "padding": null, 185 | "grid_auto_rows": null, 186 | "grid_gap": null, 187 | "max_width": null, 188 | "order": null, 189 | "_view_module_version": "1.2.0", 190 | "grid_template_areas": null, 191 | "object_position": null, 192 | "object_fit": null, 193 | "grid_auto_columns": null, 194 | "margin": null, 195 | "display": null, 196 | "left": null 197 | } 198 | }, 199 | "10e7746c509342e9881fb2d7644d71c1": { 200 | "model_module": "@jupyter-widgets/controls", 201 | "model_name": "DescriptionStyleModel", 202 | "state": { 203 | "_view_name": "StyleView", 204 | "_model_name": "DescriptionStyleModel", 205 | "description_width": "", 206 | "_view_module": "@jupyter-widgets/base", 207 | "_model_module_version": "1.5.0", 208 | "_view_count": null, 209 | "_view_module_version": "1.2.0", 210 | "_model_module": "@jupyter-widgets/controls" 211 | } 212 | }, 213 | "dfae5588591b4cd0993d3324ad5d1c8d": { 214 | "model_module": "@jupyter-widgets/base", 215 | "model_name": "LayoutModel", 216 | "state": { 217 | "_view_name": "LayoutView", 218 | "grid_template_rows": null, 219 | "right": null, 220 | "justify_content": null, 221 | "_view_module": "@jupyter-widgets/base", 222 | "overflow": null, 223 | "_model_module_version": "1.2.0", 224 | "_view_count": null, 225 | "flex_flow": null, 226 | "width": null, 227 | "min_width": null, 228 | "border": null, 229 | "align_items": null, 230 | "bottom": null, 231 | "_model_module": "@jupyter-widgets/base", 232 | "top": null, 233 | "grid_column": null, 234 | "overflow_y": null, 235 | "overflow_x": null, 236 | "grid_auto_flow": null, 237 | "grid_area": null, 238 | "grid_template_columns": null, 239 | "flex": null, 240 | "_model_name": "LayoutModel", 241 | "justify_items": null, 242 | "grid_row": null, 243 | "max_height": null, 244 | "align_content": null, 245 | "visibility": null, 246 | "align_self": null, 247 | "height": null, 248 | "min_height": null, 249 | "padding": null, 250 | "grid_auto_rows": null, 251 | "grid_gap": null, 252 | "max_width": null, 253 | "order": null, 254 | "_view_module_version": "1.2.0", 255 | "grid_template_areas": null, 256 | "object_position": null, 257 | "object_fit": null, 258 | "grid_auto_columns": null, 259 | "margin": null, 260 | "display": null, 261 | "left": null 262 | } 263 | }, 264 | "5fb35ed19a634443810db6013a595bf9": { 265 | "model_module": "@jupyter-widgets/controls", 266 | "model_name": "HBoxModel", 267 | "state": { 268 | "_view_name": "HBoxView", 269 | "_dom_classes": [], 270 | "_model_name": "HBoxModel", 271 | "_view_module": "@jupyter-widgets/controls", 272 | "_model_module_version": "1.5.0", 273 | "_view_count": null, 274 | "_view_module_version": "1.5.0", 275 | "box_style": "", 276 | "layout": "IPY_MODEL_80536fcb301343ad8c43a05a68c58120", 277 | "_model_module": "@jupyter-widgets/controls", 278 | "children": [ 279 | "IPY_MODEL_a73b08b27a5740eb9320a4a467446088", 280 | "IPY_MODEL_7554d113a0f54da3b649252fd8969df3" 281 | ] 282 | } 283 | }, 284 | "80536fcb301343ad8c43a05a68c58120": { 285 | "model_module": "@jupyter-widgets/base", 286 | "model_name": "LayoutModel", 287 | "state": { 288 | "_view_name": "LayoutView", 289 | "grid_template_rows": null, 290 | "right": null, 291 | "justify_content": null, 292 | "_view_module": "@jupyter-widgets/base", 293 | "overflow": null, 294 | "_model_module_version": "1.2.0", 295 | "_view_count": null, 296 | "flex_flow": null, 297 | "width": null, 298 | "min_width": null, 299 | "border": null, 300 | "align_items": null, 301 | "bottom": null, 302 | "_model_module": "@jupyter-widgets/base", 303 | "top": null, 304 | "grid_column": null, 305 | "overflow_y": null, 306 | "overflow_x": null, 307 | "grid_auto_flow": null, 308 | "grid_area": null, 309 | "grid_template_columns": null, 310 | "flex": null, 311 | "_model_name": "LayoutModel", 312 | "justify_items": null, 313 | "grid_row": null, 314 | "max_height": null, 315 | "align_content": null, 316 | "visibility": null, 317 | "align_self": null, 318 | "height": null, 319 | "min_height": null, 320 | "padding": null, 321 | "grid_auto_rows": null, 322 | "grid_gap": null, 323 | "max_width": null, 324 | "order": null, 325 | "_view_module_version": "1.2.0", 326 | "grid_template_areas": null, 327 | "object_position": null, 328 | "object_fit": null, 329 | "grid_auto_columns": null, 330 | "margin": null, 331 | "display": null, 332 | "left": null 333 | } 334 | }, 335 | "a73b08b27a5740eb9320a4a467446088": { 336 | "model_module": "@jupyter-widgets/controls", 337 | "model_name": "FloatProgressModel", 338 | "state": { 339 | "_view_name": "ProgressView", 340 | "style": "IPY_MODEL_895b16272a4f4c32a1f572d73eaec700", 341 | "_dom_classes": [], 342 | "description": "Downloading: 100%", 343 | "_model_name": "FloatProgressModel", 344 | "bar_style": "success", 345 | "max": 9096718, 346 | "_view_module": "@jupyter-widgets/controls", 347 | "_model_module_version": "1.5.0", 348 | "value": 9096718, 349 | "_view_count": null, 350 | "_view_module_version": "1.5.0", 351 | "orientation": "horizontal", 352 | "min": 0, 353 | "description_tooltip": null, 354 | "_model_module": "@jupyter-widgets/controls", 355 | "layout": "IPY_MODEL_947011ec13c7404390295e9e2ee39caf" 356 | } 357 | }, 358 | "7554d113a0f54da3b649252fd8969df3": { 359 | "model_module": "@jupyter-widgets/controls", 360 | "model_name": "HTMLModel", 361 | "state": { 362 | "_view_name": "HTMLView", 363 | "style": "IPY_MODEL_83db5fd4d6db4caa885164519ef833cc", 364 | "_dom_classes": [], 365 | "description": "", 366 | "_model_name": "HTMLModel", 367 | "placeholder": "​", 368 | "_view_module": "@jupyter-widgets/controls", 369 | "_model_module_version": "1.5.0", 370 | "value": " 9.10M/9.10M [00:01<00:00, 7.29MB/s]", 371 | "_view_count": null, 372 | "_view_module_version": "1.5.0", 373 | "description_tooltip": null, 374 | "_model_module": "@jupyter-widgets/controls", 375 | "layout": "IPY_MODEL_43b31066b58a40f18fd97596b38aef76" 376 | } 377 | }, 378 | "895b16272a4f4c32a1f572d73eaec700": { 379 | "model_module": "@jupyter-widgets/controls", 380 | "model_name": "ProgressStyleModel", 381 | "state": { 382 | "_view_name": "StyleView", 383 | "_model_name": "ProgressStyleModel", 384 | "description_width": "initial", 385 | "_view_module": "@jupyter-widgets/base", 386 | "_model_module_version": "1.5.0", 387 | "_view_count": null, 388 | "_view_module_version": "1.2.0", 389 | "bar_color": null, 390 | "_model_module": "@jupyter-widgets/controls" 391 | } 392 | }, 393 | "947011ec13c7404390295e9e2ee39caf": { 394 | "model_module": "@jupyter-widgets/base", 395 | "model_name": "LayoutModel", 396 | "state": { 397 | "_view_name": "LayoutView", 398 | "grid_template_rows": null, 399 | "right": null, 400 | "justify_content": null, 401 | "_view_module": "@jupyter-widgets/base", 402 | "overflow": null, 403 | "_model_module_version": "1.2.0", 404 | "_view_count": null, 405 | "flex_flow": null, 406 | "width": null, 407 | "min_width": null, 408 | "border": null, 409 | "align_items": null, 410 | "bottom": null, 411 | "_model_module": "@jupyter-widgets/base", 412 | "top": null, 413 | "grid_column": null, 414 | "overflow_y": null, 415 | "overflow_x": null, 416 | "grid_auto_flow": null, 417 | "grid_area": null, 418 | "grid_template_columns": null, 419 | "flex": null, 420 | "_model_name": "LayoutModel", 421 | "justify_items": null, 422 | "grid_row": null, 423 | "max_height": null, 424 | "align_content": null, 425 | "visibility": null, 426 | "align_self": null, 427 | "height": null, 428 | "min_height": null, 429 | "padding": null, 430 | "grid_auto_rows": null, 431 | "grid_gap": null, 432 | "max_width": null, 433 | "order": null, 434 | "_view_module_version": "1.2.0", 435 | "grid_template_areas": null, 436 | "object_position": null, 437 | "object_fit": null, 438 | "grid_auto_columns": null, 439 | "margin": null, 440 | "display": null, 441 | "left": null 442 | } 443 | }, 444 | "83db5fd4d6db4caa885164519ef833cc": { 445 | "model_module": "@jupyter-widgets/controls", 446 | "model_name": "DescriptionStyleModel", 447 | "state": { 448 | "_view_name": "StyleView", 449 | "_model_name": "DescriptionStyleModel", 450 | "description_width": "", 451 | "_view_module": "@jupyter-widgets/base", 452 | "_model_module_version": "1.5.0", 453 | "_view_count": null, 454 | "_view_module_version": "1.2.0", 455 | "_model_module": "@jupyter-widgets/controls" 456 | } 457 | }, 458 | "43b31066b58a40f18fd97596b38aef76": { 459 | "model_module": "@jupyter-widgets/base", 460 | "model_name": "LayoutModel", 461 | "state": { 462 | "_view_name": "LayoutView", 463 | "grid_template_rows": null, 464 | "right": null, 465 | "justify_content": null, 466 | "_view_module": "@jupyter-widgets/base", 467 | "overflow": null, 468 | "_model_module_version": "1.2.0", 469 | "_view_count": null, 470 | "flex_flow": null, 471 | "width": null, 472 | "min_width": null, 473 | "border": null, 474 | "align_items": null, 475 | "bottom": null, 476 | "_model_module": "@jupyter-widgets/base", 477 | "top": null, 478 | "grid_column": null, 479 | "overflow_y": null, 480 | "overflow_x": null, 481 | "grid_auto_flow": null, 482 | "grid_area": null, 483 | "grid_template_columns": null, 484 | "flex": null, 485 | "_model_name": "LayoutModel", 486 | "justify_items": null, 487 | "grid_row": null, 488 | "max_height": null, 489 | "align_content": null, 490 | "visibility": null, 491 | "align_self": null, 492 | "height": null, 493 | "min_height": null, 494 | "padding": null, 495 | "grid_auto_rows": null, 496 | "grid_gap": null, 497 | "max_width": null, 498 | "order": null, 499 | "_view_module_version": "1.2.0", 500 | "grid_template_areas": null, 501 | "object_position": null, 502 | "object_fit": null, 503 | "grid_auto_columns": null, 504 | "margin": null, 505 | "display": null, 506 | "left": null 507 | } 508 | } 509 | } 510 | } 511 | }, 512 | "cells": [ 513 | { 514 | "cell_type": "markdown", 515 | "metadata": { 516 | "id": "view-in-github", 517 | "colab_type": "text" 518 | }, 519 | "source": [ 520 | "\"Open" 521 | ] 522 | }, 523 | { 524 | "cell_type": "markdown", 525 | "metadata": { 526 | "id": "0yMTmZptEkHC" 527 | }, 528 | "source": [ 529 | "# Imports\n" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "metadata": { 535 | "id": "aaxWLY9GFE2W" 536 | }, 537 | "source": [ 538 | "!pip install transformers\n", 539 | "!pip install sacremoses\n", 540 | "!pip install sentencepiece\n", 541 | "!pip install seqeval" 542 | ], 543 | "execution_count": null, 544 | "outputs": [] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "metadata": { 549 | "id": "m9fYtB3_FHuK" 550 | }, 551 | "source": [ 552 | "#torch and tranformers for model and training\n", 553 | "import torch \n", 554 | "from torch.utils.data import DataLoader, RandomSampler, SequentialSampler\n", 555 | "from torch.utils.data import TensorDataset\n", 556 | "from transformers import XLMRobertaTokenizerFast \n", 557 | "from transformers import XLMRobertaForTokenClassification\n", 558 | "from transformers import AdamW \n", 559 | "from transformers import get_linear_schedule_with_warmup\n", 560 | "from transformers import DataCollatorForTokenClassification\n", 561 | "from transformers import Trainer, TrainingArguments\n", 562 | "import sentencepiece\n", 563 | "\n", 564 | "#sklearn for evaluation\n", 565 | "from sklearn import preprocessing \n", 566 | "from sklearn.metrics import classification_report \n", 567 | "from sklearn.metrics import f1_score\n", 568 | "from sklearn.metrics import confusion_matrix\n", 569 | "from sklearn.model_selection import ParameterGrid \n", 570 | "from sklearn.model_selection import ParameterSampler \n", 571 | "from sklearn.utils.fixes import loguniform\n", 572 | "\n", 573 | "#nlp preprocessing\n", 574 | "from nltk import ngrams \n", 575 | "from spacy.pipeline import SentenceSegmenter\n", 576 | "from spacy.lang.en import English\n", 577 | "from spacy.pipeline import Sentencizer\n", 578 | "from sacremoses import MosesTokenizer, MosesDetokenizer\n", 579 | "\n", 580 | "\n", 581 | "#utilities\n", 582 | "from seqeval.metrics import accuracy_score, f1_score, precision_score, recall_score\n", 583 | "import pandas as pd\n", 584 | "import glob, os\n", 585 | "import time\n", 586 | "import datetime\n", 587 | "import random\n", 588 | "import numpy as np\n", 589 | "import matplotlib.pyplot as plt\n", 590 | "% matplotlib inline\n", 591 | "import seaborn as sns\n", 592 | "import pickle # for saving data structures\n", 593 | "from pynvml import * # for checking gpu memory" 594 | ], 595 | "execution_count": 4, 596 | "outputs": [] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "metadata": { 601 | "colab": { 602 | "base_uri": "https://localhost:8080/" 603 | }, 604 | "id": "YyekDMZ28gBc", 605 | "outputId": "ff6b00cb-4b3c-4a4f-ec85-e00296791586" 606 | }, 607 | "source": [ 608 | "# connect to GPU \n", 609 | "device = torch.device('cuda')\n", 610 | "\n", 611 | "print('Connected to GPU:', torch.cuda.get_device_name(0))" 612 | ], 613 | "execution_count": null, 614 | "outputs": [ 615 | { 616 | "output_type": "stream", 617 | "text": [ 618 | "Connected to GPU: Tesla T4\n" 619 | ], 620 | "name": "stdout" 621 | } 622 | ] 623 | }, 624 | { 625 | "cell_type": "markdown", 626 | "metadata": { 627 | "id": "3RPZ14sYHHUm" 628 | }, 629 | "source": [ 630 | "# Prepare Data" 631 | ] 632 | }, 633 | { 634 | "cell_type": "markdown", 635 | "metadata": { 636 | "id": "TKqV3YfXHSNz" 637 | }, 638 | "source": [ 639 | "Training Data: corp, wind\n", 640 | "\n", 641 | "Validation Data: equi\n", 642 | "\n", 643 | "Test Data: htfl" 644 | ] 645 | }, 646 | { 647 | "cell_type": "code", 648 | "metadata": { 649 | "id": "ERUBsPPOFfe1" 650 | }, 651 | "source": [ 652 | "#load terms\n", 653 | "\n", 654 | "#en\n", 655 | "df_corp_terms_en=pd.read_csv('ACTER-master/ACTER-master/en/corp/annotations/corp_en_terms_nes.ann', delimiter=\"\\t\", names=[\"Term\", \"Label\"]) \n", 656 | "df_equi_terms_en=pd.read_csv('ACTER-master/ACTER-master/en/equi/annotations/equi_en_terms_nes.ann', delimiter=\"\\t\", names=[\"Term\", \"Label\"]) \n", 657 | "df_htfl_terms_en=pd.read_csv('ACTER-master/ACTER-master/en/htfl/annotations/htfl_en_terms_nes.ann', delimiter=\"\\t\", names=[\"Term\", \"Label\"]) \n", 658 | "df_wind_terms_en=pd.read_csv('ACTER-master/ACTER-master/en/wind/annotations/wind_en_terms_nes.ann', delimiter=\"\\t\", names=[\"Term\", \"Label\"]) \n", 659 | "\n", 660 | "#fr\n", 661 | "df_corp_terms_fr=pd.read_csv('ACTER-master/ACTER-master/fr/corp/annotations/corp_fr_terms_nes.ann', delimiter=\"\\t\", names=[\"Term\", \"Label\"]) \n", 662 | "df_equi_terms_fr=pd.read_csv('ACTER-master/ACTER-master/fr/equi/annotations/equi_fr_terms_nes.ann', delimiter=\"\\t\", names=[\"Term\", \"Label\"]) \n", 663 | "df_htfl_terms_fr=pd.read_csv('ACTER-master/ACTER-master/fr/htfl/annotations/htfl_fr_terms_nes.ann', delimiter=\"\\t\", names=[\"Term\", \"Label\"]) \n", 664 | "df_wind_terms_fr=pd.read_csv('ACTER-master/ACTER-master/fr/wind/annotations/wind_fr_terms_nes.ann', delimiter=\"\\t\", names=[\"Term\", \"Label\"]) \n", 665 | "\n", 666 | "#nl\n", 667 | "df_corp_terms_nl=pd.read_csv('ACTER-master/ACTER-master/nl/corp/annotations/corp_nl_terms_nes.ann', delimiter=\"\\t\", names=[\"Term\", \"Label\"]) \n", 668 | "df_equi_terms_nl=pd.read_csv('ACTER-master/ACTER-master/nl/equi/annotations/equi_nl_terms_nes.ann', delimiter=\"\\t\", names=[\"Term\", \"Label\"]) \n", 669 | "df_htfl_terms_nl=pd.read_csv('ACTER-master/ACTER-master/nl/htfl/annotations/htfl_nl_terms_nes.ann', delimiter=\"\\t\", names=[\"Term\", \"Label\"]) \n", 670 | "df_wind_terms_nl=pd.read_csv('ACTER-master/ACTER-master/nl/wind/annotations/wind_nl_terms_nes.ann', delimiter=\"\\t\", names=[\"Term\", \"Label\"]) \n", 671 | "\n", 672 | "labels=[\"Random\", \"Term\"]" 673 | ], 674 | "execution_count": null, 675 | "outputs": [] 676 | }, 677 | { 678 | "cell_type": "code", 679 | "metadata": { 680 | "colab": { 681 | "base_uri": "https://localhost:8080/", 682 | "height": 419 683 | }, 684 | "id": "tw11QcsHF8Gc", 685 | "outputId": "18b520b6-2e7f-43e3-8b14-5ea4de9e6edf" 686 | }, 687 | "source": [ 688 | "# show dataframe\n", 689 | "df_wind_terms_en" 690 | ], 691 | "execution_count": null, 692 | "outputs": [ 693 | { 694 | "output_type": "execute_result", 695 | "data": { 696 | "text/html": [ 697 | "
\n", 698 | "\n", 711 | "\n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | "
TermLabel
048/600Named_Entity
14energiaNamed_Entity
24energyNamed_Entity
3ab \"lietuvos energija\"Named_Entity
4ab lietuvos elektrineNamed_Entity
.........
1529zhiquanNamed_Entity
1530çetinkayaNamed_Entity
1531çeti̇nkayaNamed_Entity
1532çeşmeNamed_Entity
1533özgenNamed_Entity
\n", 777 | "

1534 rows × 2 columns

\n", 778 | "
" 779 | ], 780 | "text/plain": [ 781 | " Term Label\n", 782 | "0 48/600 Named_Entity\n", 783 | "1 4energia Named_Entity\n", 784 | "2 4energy Named_Entity\n", 785 | "3 ab \"lietuvos energija\" Named_Entity\n", 786 | "4 ab lietuvos elektrine Named_Entity\n", 787 | "... ... ...\n", 788 | "1529 zhiquan Named_Entity\n", 789 | "1530 çetinkaya Named_Entity\n", 790 | "1531 çeti̇nkaya Named_Entity\n", 791 | "1532 çeşme Named_Entity\n", 792 | "1533 özgen Named_Entity\n", 793 | "\n", 794 | "[1534 rows x 2 columns]" 795 | ] 796 | }, 797 | "metadata": { 798 | "tags": [] 799 | }, 800 | "execution_count": 7 801 | } 802 | ] 803 | }, 804 | { 805 | "cell_type": "markdown", 806 | "metadata": { 807 | "id": "sU7NMPaDvbWt" 808 | }, 809 | "source": [ 810 | "**Functions for preprocessing and creating of Training Data**" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "metadata": { 816 | "id": "3_stqlIDvZxA" 817 | }, 818 | "source": [ 819 | "#load all text files from folder into a string\n", 820 | "def load_text_corpus(path):\n", 821 | " text_data=\"\"\n", 822 | " print(glob.glob(path))\n", 823 | " for file in glob.glob(path+\"*.txt\"):\n", 824 | " print(file)\n", 825 | " with open(file) as f:\n", 826 | " temp_data = f.read()\n", 827 | " print(len(temp_data))\n", 828 | " text_data=text_data+\" \"+temp_data\n", 829 | " print(len(text_data))\n", 830 | " return text_data" 831 | ], 832 | "execution_count": null, 833 | "outputs": [] 834 | }, 835 | { 836 | "cell_type": "code", 837 | "metadata": { 838 | "id": "4nXtHwAyPoK0" 839 | }, 840 | "source": [ 841 | "#split in sentences and tokenize\n", 842 | "def preprocess(text):\n", 843 | " #sentenize (from spacy)\n", 844 | " sentencizer = Sentencizer()\n", 845 | " nlp = English()\n", 846 | " nlp.add_pipe(sentencizer)\n", 847 | " doc = nlp(text)\n", 848 | "\n", 849 | " #tokenize\n", 850 | " sentence_list=[]\n", 851 | " mt = MosesTokenizer(lang='en')\n", 852 | " for s in doc.sents:\n", 853 | " tokenized_text = mt.tokenize(s, return_str=True) #append tuple of tokens and original senteence\n", 854 | " return sentence_list\n" 855 | ], 856 | "execution_count": null, 857 | "outputs": [] 858 | }, 859 | { 860 | "cell_type": "code", 861 | "metadata": { 862 | "id": "PPUGg3GD9d5W" 863 | }, 864 | "source": [ 865 | "#find indices of a sublist sub in a list l\n", 866 | "def find_sub_list(subl,l):\n", 867 | " results=[]\n", 868 | " subllen=len(subl)\n", 869 | " for ind in (i for i,e in enumerate(l) if e==subl[0]):\n", 870 | " if l[ind:ind+subllen]==subl:\n", 871 | " results.append((ind,ind+subllen-1))\n", 872 | "\n", 873 | " return results" 874 | ], 875 | "execution_count": null, 876 | "outputs": [] 877 | }, 878 | { 879 | "cell_type": "code", 880 | "metadata": { 881 | "id": "1qBA_KhoQkhB" 882 | }, 883 | "source": [ 884 | "#input is list of sentences and dataframe containing terms\n", 885 | "def create_training_data(sentence_list, df_terms, n):\n", 886 | "\n", 887 | " #create empty dataframe\n", 888 | " training_data = []\n", 889 | "\n", 890 | " md = MosesDetokenizer(lang='en')\n", 891 | "\n", 892 | " print(len(sentence_list))\n", 893 | " count=0\n", 894 | "\n", 895 | " for sen in sentence_list:\n", 896 | " count+=1\n", 897 | " if count%100==0:print(count)\n", 898 | "\n", 899 | " s=sen[0] #take first part of tuple, i.e. the tokens\n", 900 | "\n", 901 | " #create label list, with \"n\" for non-terms, \"B-T\" for beginning of a term and \"T\" for the continuation of a term\n", 902 | " tags=[\"n\"]*len(s)\n", 903 | "\n", 904 | " # 1-gram up to n-gram\n", 905 | " for i in range(1,n+1):\n", 906 | " #create n-grams of this sentence\n", 907 | " n_grams = ngrams(s, i)\n", 908 | "\n", 909 | " #look if n-grams are in the annotation dataset\n", 910 | " for n_gram in n_grams: \n", 911 | " n_gram_aslist=list(n_gram)\n", 912 | " n_gram=md.detokenize(n_gram) \n", 913 | " context=str(sen[1]).strip()\n", 914 | " #if yes add an entry to the training data\n", 915 | " if n_gram.lower() in df_terms.values:\n", 916 | " #check where n_gram is in sentence and annotate it \n", 917 | " #print(n_gram_aslist,s)\n", 918 | " sublist_indices=find_sub_list(n_gram_aslist, s)\n", 919 | " for indices in sublist_indices:\n", 920 | " for ind in range(indices[0],indices[1]+1):\n", 921 | " #if term start\n", 922 | " if ind==indices[0]:\n", 923 | " tags[ind]=\"B-T\"\n", 924 | " #if continuation of a Term\n", 925 | " else: \n", 926 | " tags[ind]=\"T\"\n", 927 | "\n", 928 | " training_data.append((s,tags))\n", 929 | " \n", 930 | "\n", 931 | " return training_data\n", 932 | "\n", 933 | " " 934 | ], 935 | "execution_count": null, 936 | "outputs": [] 937 | }, 938 | { 939 | "cell_type": "markdown", 940 | "metadata": { 941 | "id": "4HhBTwYl1-dy" 942 | }, 943 | "source": [ 944 | "**Create Training Data**" 945 | ] 946 | }, 947 | { 948 | "cell_type": "code", 949 | "metadata": { 950 | "id": "UemCf-2xPrn1" 951 | }, 952 | "source": [ 953 | "#create trainings data for all corp texts\n", 954 | "corp_text_en=load_text_corpus(\"ACTER-master/ACTER-master/en/corp/texts/annotated/\") # load text\n", 955 | "corp_s_list=preprocess(corp_text_en) # preprocess\n", 956 | "train_data_corp_en=create_training_data(corp_s_list, df_corp_terms_en, 6) # create training data" 957 | ], 958 | "execution_count": null, 959 | "outputs": [] 960 | }, 961 | { 962 | "cell_type": "code", 963 | "metadata": { 964 | "id": "gzCWke974lJS" 965 | }, 966 | "source": [ 967 | "#create trainings data for all wind texts\n", 968 | "wind_text_en=load_text_corpus(\"ACTER-master/ACTER-master/en/wind/texts/annotated/\") # load text\n", 969 | "wind_s_list=preprocess(wind_text_en) # preprocess\n", 970 | "train_data_wind_en=create_training_data(wind_s_list, df_wind_terms_en, 6) # create training data" 971 | ], 972 | "execution_count": null, 973 | "outputs": [] 974 | }, 975 | { 976 | "cell_type": "code", 977 | "metadata": { 978 | "id": "iGeV2rgS4lbn" 979 | }, 980 | "source": [ 981 | "#create trainings data for all equi texts\n", 982 | "equi_text_en=load_text_corpus(\"ACTER-master/ACTER-master/en/equi/texts/annotated/\") # load text\n", 983 | "equi_s_list=preprocess(equi_text_en) # preprocess\n", 984 | "train_data_equi_en=create_training_data(equi_s_list, df_equi_terms_en, 6) # create training data" 985 | ], 986 | "execution_count": null, 987 | "outputs": [] 988 | }, 989 | { 990 | "cell_type": "code", 991 | "metadata": { 992 | "id": "HCsU8wUE4lmk" 993 | }, 994 | "source": [ 995 | "#create trainings data for all htfl texts\n", 996 | "htfl_text_en=load_text_corpus(\"ACTER-master/ACTER-master/en/htfl/texts/annotated/\") # load text\n", 997 | "htfl_s_list=preprocess(htfl_text_en) # preprocess\n", 998 | "train_data_htfl_en=create_training_data(htfl_s_list, df_htfl_terms_en, 6) # create training data " 999 | ], 1000 | "execution_count": null, 1001 | "outputs": [] 1002 | }, 1003 | { 1004 | "cell_type": "code", 1005 | "metadata": { 1006 | "id": "mSju0fa5m5Vj" 1007 | }, 1008 | "source": [ 1009 | "#fr\n", 1010 | "corp_text_fr=load_text_corpus(\"ACTER-master/ACTER-master/fr/corp/texts/annotated/\") # load text\n", 1011 | "corp_s_list=preprocess(corp_text_fr) # preprocess\n", 1012 | "train_data_corp_fr=create_training_data(corp_s_list, df_corp_terms_fr, 6) # create training data\n", 1013 | "\n", 1014 | "wind_text_fr=load_text_corpus(\"ACTER-master/ACTER-master/fr/wind/texts/annotated/\") # load text\n", 1015 | "wind_s_list=preprocess(wind_text_fr) # preprocess\n", 1016 | "train_data_wind_fr=create_training_data(wind_s_list, df_wind_terms_fr, 6) # create training data\n", 1017 | "\n", 1018 | "equi_text_fr=load_text_corpus(\"ACTER-master/ACTER-master/fr/equi/texts/annotated/\") # load text\n", 1019 | "equi_s_list=preprocess(equi_text_fr) # preprocess\n", 1020 | "train_data_equi_fr=create_training_data(equi_s_list, df_equi_terms_fr, 6) # create training data\n", 1021 | "\n", 1022 | "htfl_text_fr=load_text_corpus(\"ACTER-master/ACTER-master/fr/htfl/texts/annotated/\") # load text\n", 1023 | "htfl_s_list=preprocess(htfl_text_fr) # preprocess\n", 1024 | "train_data_htfl_fr=create_training_data(htfl_s_list, df_htfl_terms_fr, 6) # create training data " 1025 | ], 1026 | "execution_count": null, 1027 | "outputs": [] 1028 | }, 1029 | { 1030 | "cell_type": "code", 1031 | "metadata": { 1032 | "id": "IQt-Z0p2m5Zy" 1033 | }, 1034 | "source": [ 1035 | "#nl\n", 1036 | "corp_text_nl=load_text_corpus(\"ACTER-master/ACTER-master/nl/corp/texts/annotated/\") # load text\n", 1037 | "corp_s_list=preprocess(corp_text_nl) # preprocess\n", 1038 | "train_data_corp_nl=create_training_data(corp_s_list, df_corp_terms_nl, 6) # create training data\n", 1039 | "\n", 1040 | "wind_text_nl=load_text_corpus(\"ACTER-master/ACTER-master/nl/wind/texts/annotated/\") # load text\n", 1041 | "wind_s_list=preprocess(wind_text_nl) # preprocess\n", 1042 | "train_data_wind_nl=create_training_data(wind_s_list, df_wind_terms_nl, 6) # create training data\n", 1043 | "\n", 1044 | "equi_text_nl=load_text_corpus(\"ACTER-master/ACTER-master/nl/equi/texts/annotated/\") # load text\n", 1045 | "equi_s_list=preprocess(equi_text_nl) # preprocess\n", 1046 | "train_data_equi_nl=create_training_data(equi_s_list, df_equi_terms_nl, 6) # create training data\n", 1047 | "\n", 1048 | "htfl_text_nl=load_text_corpus(\"ACTER-master/ACTER-master/nl/htfl/texts/annotated/\") # load text\n", 1049 | "htfl_s_list=preprocess(htfl_text_nl) # preprocess\n", 1050 | "train_data_htfl_nl=create_training_data(htfl_s_list, df_htfl_terms_nl, 6) # create training data " 1051 | ], 1052 | "execution_count": null, 1053 | "outputs": [] 1054 | }, 1055 | { 1056 | "cell_type": "code", 1057 | "metadata": { 1058 | "colab": { 1059 | "base_uri": "https://localhost:8080/" 1060 | }, 1061 | "id": "VSy8hZggPQpf", 1062 | "outputId": "ee451206-219a-4318-d90c-898d39c15bec" 1063 | }, 1064 | "source": [ 1065 | "#concat trainingsdata\n", 1066 | "trainings_data = train_data_corp_en + train_data_wind_en\n", 1067 | "\n", 1068 | "val_data = train_data_equi_en + train_data_equi_fr + train_data_equi_nl\n", 1069 | "val_data_en = train_data_equi_en\n", 1070 | "val_data_fr = train_data_equi_fr\n", 1071 | "val_data_nl = train_data_equi_nl\n", 1072 | "\n", 1073 | "test_data = train_data_htfl_en + train_data_htfl_fr + train_data_htfl_nl\n", 1074 | "test_data_en = train_data_htfl_en\n", 1075 | "test_data_fr = train_data_htfl_fr\n", 1076 | "test_data_nl = train_data_htfl_nl\n", 1077 | "\n", 1078 | "gold_set_for_validation=set(df_equi_terms_en[\"Term\"]).union(set(df_equi_terms_fr[\"Term\"])).union(set(df_equi_terms_nl[\"Term\"])) \n", 1079 | "\n", 1080 | "print(len(trainings_data))\n", 1081 | "print(len(val_data))\n", 1082 | "print(len(test_data))" 1083 | ], 1084 | "execution_count": null, 1085 | "outputs": [ 1086 | { 1087 | "output_type": "stream", 1088 | "text": [ 1089 | "3449\n", 1090 | "7978\n", 1091 | "6416\n" 1092 | ], 1093 | "name": "stdout" 1094 | } 1095 | ] 1096 | }, 1097 | { 1098 | "cell_type": "code", 1099 | "metadata": { 1100 | "id": "OdoXY46fSoxS" 1101 | }, 1102 | "source": [ 1103 | "#seperate tokens and tags\n", 1104 | "\n", 1105 | "#train\n", 1106 | "train_tags=[tup[1] for tup in trainings_data]\n", 1107 | "train_texts=[tup[0] for tup in trainings_data]\n", 1108 | "\n", 1109 | "#val\n", 1110 | "val_tags=[tup[1] for tup in val_data]\n", 1111 | "val_texts=[tup[0] for tup in val_data]\n", 1112 | "\n", 1113 | "val_tags_en=[tup[1] for tup in val_data_en]\n", 1114 | "val_texts_en=[tup[0] for tup in val_data_en]\n", 1115 | "\n", 1116 | "val_tags_fr=[tup[1] for tup in val_data_fr]\n", 1117 | "val_texts_fr=[tup[0] for tup in val_data_fr]\n", 1118 | "\n", 1119 | "val_tags_nl=[tup[1] for tup in val_data_nl]\n", 1120 | "val_texts_nl=[tup[0] for tup in val_data_nl]\n", 1121 | "\n", 1122 | "#test\n", 1123 | "test_tags=[tup[1] for tup in test_data]\n", 1124 | "test_texts=[tup[0] for tup in test_data]\n", 1125 | "\n", 1126 | "test_tags_en=[tup[1] for tup in test_data_en]\n", 1127 | "test_texts_en=[tup[0] for tup in test_data_en]\n", 1128 | "\n", 1129 | "test_tags_fr=[tup[1] for tup in test_data_fr]\n", 1130 | "test_texts_fr=[tup[0] for tup in test_data_fr]\n", 1131 | "\n", 1132 | "test_tags_nl=[tup[1] for tup in test_data_nl]\n", 1133 | "test_texts_nl=[tup[0] for tup in test_data_nl]" 1134 | ], 1135 | "execution_count": null, 1136 | "outputs": [] 1137 | }, 1138 | { 1139 | "cell_type": "markdown", 1140 | "metadata": { 1141 | "id": "wVxAsANXfpDv" 1142 | }, 1143 | "source": [ 1144 | "# Tokenize " 1145 | ] 1146 | }, 1147 | { 1148 | "cell_type": "code", 1149 | "metadata": { 1150 | "id": "_ieMHql0gobX", 1151 | "colab": { 1152 | "base_uri": "https://localhost:8080/", 1153 | "height": 115, 1154 | "referenced_widgets": [ 1155 | "2c019229186e4a99b51ab9bbe4c5b784", 1156 | "345ae3166eff49338827b7a5c7d93d05", 1157 | "91d1ab635bc94af5abe773e177d09461", 1158 | "184b49ae656d47e6bfecc7cb3f508dd8", 1159 | "bd23992f5063466cbe78e07f5318663a", 1160 | "fda1de93c9b64edaa54e83dae313ab9a", 1161 | "10e7746c509342e9881fb2d7644d71c1", 1162 | "dfae5588591b4cd0993d3324ad5d1c8d", 1163 | "5fb35ed19a634443810db6013a595bf9", 1164 | "80536fcb301343ad8c43a05a68c58120", 1165 | "a73b08b27a5740eb9320a4a467446088", 1166 | "7554d113a0f54da3b649252fd8969df3", 1167 | "895b16272a4f4c32a1f572d73eaec700", 1168 | "947011ec13c7404390295e9e2ee39caf", 1169 | "83db5fd4d6db4caa885164519ef833cc", 1170 | "43b31066b58a40f18fd97596b38aef76" 1171 | ] 1172 | }, 1173 | "outputId": "f0964c84-91dc-4937-eaae-33ef4a4f7695" 1174 | }, 1175 | "source": [ 1176 | "tokenizer = XLMRobertaTokenizerFast.from_pretrained(\"xlm-roberta-base\")" 1177 | ], 1178 | "execution_count": null, 1179 | "outputs": [ 1180 | { 1181 | "output_type": "display_data", 1182 | "data": { 1183 | "application/vnd.jupyter.widget-view+json": { 1184 | "model_id": "2c019229186e4a99b51ab9bbe4c5b784", 1185 | "version_minor": 0, 1186 | "version_major": 2 1187 | }, 1188 | "text/plain": [ 1189 | "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5069051.0, style=ProgressStyle(descript…" 1190 | ] 1191 | }, 1192 | "metadata": { 1193 | "tags": [] 1194 | } 1195 | }, 1196 | { 1197 | "output_type": "stream", 1198 | "text": [ 1199 | "\n" 1200 | ], 1201 | "name": "stdout" 1202 | }, 1203 | { 1204 | "output_type": "display_data", 1205 | "data": { 1206 | "application/vnd.jupyter.widget-view+json": { 1207 | "model_id": "5fb35ed19a634443810db6013a595bf9", 1208 | "version_minor": 0, 1209 | "version_major": 2 1210 | }, 1211 | "text/plain": [ 1212 | "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=9096718.0, style=ProgressStyle(descript…" 1213 | ] 1214 | }, 1215 | "metadata": { 1216 | "tags": [] 1217 | } 1218 | }, 1219 | { 1220 | "output_type": "stream", 1221 | "text": [ 1222 | "\n" 1223 | ], 1224 | "name": "stdout" 1225 | } 1226 | ] 1227 | }, 1228 | { 1229 | "cell_type": "code", 1230 | "metadata": { 1231 | "id": "XYftDnmguJMr" 1232 | }, 1233 | "source": [ 1234 | "#align labels with tokenization from XLM-R\n", 1235 | "label_list=[\"n\", \"B-T\", \"T\"]\n", 1236 | "label_to_id = {l: i for i, l in enumerate(label_list)}\n", 1237 | "num_labels=len(label_list)\n", 1238 | "\n", 1239 | "def tokenize_and_align_labels(texts, tags):\n", 1240 | " tokenized_inputs = tokenizer(\n", 1241 | " texts,\n", 1242 | " padding=True,\n", 1243 | " truncation=True,\n", 1244 | " # We use this argument because the texts in our dataset are lists of words (with a label for each word).\n", 1245 | " is_split_into_words=True,\n", 1246 | " )\n", 1247 | " labels = []\n", 1248 | " for i, label in enumerate(tags):\n", 1249 | " word_ids = tokenized_inputs.word_ids(batch_index=i)\n", 1250 | " previous_word_idx = None\n", 1251 | " label_ids = []\n", 1252 | " for word_idx in word_ids:\n", 1253 | " # Special tokens have a word id that is None. We set the label to -100 so they are automatically\n", 1254 | " # ignored in the loss function.\n", 1255 | " if word_idx is None:\n", 1256 | " label_ids.append(-100)\n", 1257 | " # We set the label for the first token of each word.\n", 1258 | " elif word_idx != previous_word_idx:\n", 1259 | " label_ids.append(label_to_id[label[word_idx]])\n", 1260 | " # For the other tokens in a word, we set the label to either the current label or -100, depending on\n", 1261 | " # the label_all_tokens flag.\n", 1262 | " else:\n", 1263 | " label_ids.append(-100)\n", 1264 | " previous_word_idx = word_idx\n", 1265 | "\n", 1266 | " labels.append(label_ids)\n", 1267 | " tokenized_inputs[\"labels\"] = labels\n", 1268 | " return tokenized_inputs \n", 1269 | "\n", 1270 | "\n", 1271 | "train_input_and_labels = tokenize_and_align_labels(train_texts, train_tags)\n", 1272 | "\n", 1273 | "val_input_and_labels = tokenize_and_align_labels(val_texts, val_tags)\n", 1274 | "val_input_and_labels_en = tokenize_and_align_labels(val_texts_en, val_tags_en)\n", 1275 | "val_input_and_labels_fr = tokenize_and_align_labels(val_texts_fr, val_tags_fr)\n", 1276 | "val_input_and_labels_nl = tokenize_and_align_labels(val_texts_nl, val_tags_nl)\n", 1277 | "\n", 1278 | "test_input_and_labels = tokenize_and_align_labels(test_texts, test_tags)\n", 1279 | "test_input_and_labels_en = tokenize_and_align_labels(test_texts_en, test_tags_en)\n", 1280 | "test_input_and_labels_fr = tokenize_and_align_labels(test_texts_fr, test_tags_fr)\n", 1281 | "test_input_and_labels_nl = tokenize_and_align_labels(test_texts_nl, test_tags_nl)\n", 1282 | "\n" 1283 | ], 1284 | "execution_count": null, 1285 | "outputs": [] 1286 | }, 1287 | { 1288 | "cell_type": "code", 1289 | "metadata": { 1290 | "id": "6lcPXbZ22yWG" 1291 | }, 1292 | "source": [ 1293 | "# create dataset that can be used for training with the huggingface trainer\n", 1294 | "class OurDataset(torch.utils.data.Dataset):\n", 1295 | " def __init__(self, encodings, labels):\n", 1296 | " self.encodings = encodings\n", 1297 | " self.labels = labels\n", 1298 | "\n", 1299 | " def __getitem__(self, idx):\n", 1300 | " item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}\n", 1301 | " item['labels'] = torch.tensor(self.labels[idx])\n", 1302 | " return item\n", 1303 | "\n", 1304 | " def __len__(self):\n", 1305 | " return len(self.labels)\n", 1306 | "\n", 1307 | "train_dataset = OurDataset(train_input_and_labels, train_input_and_labels[\"labels\"])\n", 1308 | "\n", 1309 | "val_dataset = OurDataset(val_input_and_labels, val_input_and_labels[\"labels\"])\n", 1310 | "val_dataset_en = OurDataset(val_input_and_labels_en, val_input_and_labels_en[\"labels\"])\n", 1311 | "val_dataset_fr = OurDataset(val_input_and_labels_fr, val_input_and_labels_fr[\"labels\"])\n", 1312 | "val_dataset_nl = OurDataset(val_input_and_labels_nl, val_input_and_labels_nl[\"labels\"])\n", 1313 | "\n", 1314 | "test_dataset = OurDataset(test_input_and_labels, test_input_and_labels[\"labels\"])\n", 1315 | "test_dataset_en = OurDataset(test_input_and_labels_en, test_input_and_labels_en[\"labels\"])\n", 1316 | "test_dataset_fr = OurDataset(test_input_and_labels_fr, test_input_and_labels_fr[\"labels\"])\n", 1317 | "test_dataset_nl = OurDataset(test_input_and_labels_nl, test_input_and_labels_nl[\"labels\"])" 1318 | ], 1319 | "execution_count": null, 1320 | "outputs": [] 1321 | }, 1322 | { 1323 | "cell_type": "markdown", 1324 | "metadata": { 1325 | "id": "9miQ8_HxKqGB" 1326 | }, 1327 | "source": [ 1328 | "# Training" 1329 | ] 1330 | }, 1331 | { 1332 | "cell_type": "code", 1333 | "metadata": { 1334 | "id": "lE8JxaF5T9Yd" 1335 | }, 1336 | "source": [ 1337 | "# return the extracted terms given the token level prediction and the original texts\n", 1338 | "\n", 1339 | "def extract_terms(token_predictions, val_texts):\n", 1340 | " extracted_terms = set()\n", 1341 | " # go over all predictions\n", 1342 | " for i in range(len(token_predictions)):\n", 1343 | " pred = token_predictions[i]\n", 1344 | " txt = val_texts[i]\n", 1345 | " for j in range(len(pred)):\n", 1346 | " # if right tag build term and add it to the set otherwise just continue\n", 1347 | " if pred[j]==\"B-T\":\n", 1348 | " term=txt[j]\n", 1349 | " for k in range(j+1,len(pred)):\n", 1350 | " if pred[k]==\"T\": term+=\" \"+txt[k]\n", 1351 | " else: break\n", 1352 | " extracted_terms.add(term)\n", 1353 | " return extracted_terms" 1354 | ], 1355 | "execution_count": null, 1356 | "outputs": [] 1357 | }, 1358 | { 1359 | "cell_type": "code", 1360 | "metadata": { 1361 | "id": "5FiN4TVUTDXL" 1362 | }, 1363 | "source": [ 1364 | "#compute the metrics TermEval style for Trainer\n", 1365 | "\n", 1366 | "def compute_metrics(p):\n", 1367 | " predictions, labels = p\n", 1368 | " predictions = np.argmax(predictions, axis=2)\n", 1369 | "\n", 1370 | " # Remove ignored index (special tokens)\n", 1371 | " true_predictions = [\n", 1372 | " [label_list[p] for (p, l) in zip(prediction, label) if l != -100]\n", 1373 | " for prediction, label in zip(predictions, labels)\n", 1374 | " ]\n", 1375 | "\n", 1376 | " extracted_terms=extract_terms(true_predictions, val_texts) # ??????\n", 1377 | " extracted_terms = set([item.lower() for item in extracted_terms])\n", 1378 | " gold_set=gold_set_for_validation # ??????\n", 1379 | "\n", 1380 | " true_pos=extracted_terms.intersection(gold_set)\n", 1381 | " recall=len(true_pos)/len(gold_set)\n", 1382 | " precision=len(true_pos)/len(extracted_terms)\n", 1383 | "\n", 1384 | " return {\n", 1385 | " \"precision\": precision,\n", 1386 | " \"recall\": recall,\n", 1387 | " \"f1\": 2*(precision*recall)/(precision+recall),\n", 1388 | " }" 1389 | ], 1390 | "execution_count": null, 1391 | "outputs": [] 1392 | }, 1393 | { 1394 | "cell_type": "code", 1395 | "metadata": { 1396 | "id": "AcDLjK-i4-Y8" 1397 | }, 1398 | "source": [ 1399 | "# training arguments\n", 1400 | "\n", 1401 | "training_args = TrainingArguments(\n", 1402 | " output_dir='./results', # output directory\n", 1403 | " num_train_epochs=1, # total # of training epochs\n", 1404 | " per_device_train_batch_size=8, # batch size per device during training\n", 1405 | " per_device_eval_batch_size=16, # batch size for evaluation\n", 1406 | " warmup_steps=0, # number of warmup steps for learning rate scheduler\n", 1407 | " weight_decay=0, # strength of weight decay\n", 1408 | " learning_rate=2e-5,\n", 1409 | " logging_dir='./logs', # directory for storing logs\n", 1410 | " evaluation_strategy= \"no\",#\"steps\", # or use epoch here\n", 1411 | " eval_steps=100,\n", 1412 | " #save_total_limit=1,\n", 1413 | " load_best_model_at_end=True, #loads the model with the best evaluation score\n", 1414 | " metric_for_best_model=\"f1\",\n", 1415 | " greater_is_better=True\n", 1416 | ")" 1417 | ], 1418 | "execution_count": null, 1419 | "outputs": [] 1420 | }, 1421 | { 1422 | "cell_type": "code", 1423 | "metadata": { 1424 | "colab": { 1425 | "base_uri": "https://localhost:8080/" 1426 | }, 1427 | "id": "yJf_Rnyf26el", 1428 | "outputId": "742cd3c9-5bbf-443d-95c8-75ee69d0590e" 1429 | }, 1430 | "source": [ 1431 | "# initialize model\n", 1432 | "model = XLMRobertaForTokenClassification.from_pretrained(\"xlm-roberta-base\", num_labels=num_labels)\n" 1433 | ], 1434 | "execution_count": null, 1435 | "outputs": [ 1436 | { 1437 | "output_type": "stream", 1438 | "text": [ 1439 | "Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForTokenClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']\n", 1440 | "- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", 1441 | "- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n", 1442 | "Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.weight', 'classifier.bias']\n", 1443 | "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" 1444 | ], 1445 | "name": "stderr" 1446 | } 1447 | ] 1448 | }, 1449 | { 1450 | "cell_type": "code", 1451 | "metadata": { 1452 | "id": "wEg8krxu-MY4" 1453 | }, 1454 | "source": [ 1455 | "# initialize huggingface trainer\n", 1456 | "trainer = Trainer(\n", 1457 | " model=model,\n", 1458 | " args=training_args,\n", 1459 | " train_dataset=train_dataset,\n", 1460 | " eval_dataset=val_dataset,\n", 1461 | " tokenizer=tokenizer,\n", 1462 | " compute_metrics=compute_metrics,\n", 1463 | " )" 1464 | ], 1465 | "execution_count": null, 1466 | "outputs": [] 1467 | }, 1468 | { 1469 | "cell_type": "code", 1470 | "metadata": { 1471 | "id": "nZIimqAK28qS" 1472 | }, 1473 | "source": [ 1474 | "# train\n", 1475 | "trainer.train()" 1476 | ], 1477 | "execution_count": null, 1478 | "outputs": [] 1479 | }, 1480 | { 1481 | "cell_type": "markdown", 1482 | "metadata": { 1483 | "id": "P_dYhX6t2FsM" 1484 | }, 1485 | "source": [ 1486 | "# Test Set Evaluation" 1487 | ] 1488 | }, 1489 | { 1490 | "cell_type": "code", 1491 | "metadata": { 1492 | "colab": { 1493 | "base_uri": "https://localhost:8080/", 1494 | "height": 105 1495 | }, 1496 | "id": "Xcy0k_pfwjxi", 1497 | "outputId": "56b40925-bb94-48d9-a1c8-5f89f63fc0ad" 1498 | }, 1499 | "source": [ 1500 | "#test\n", 1501 | "test_predictions, test_labels, test_metrics = trainer.predict(test_dataset)\n", 1502 | "test_predictions = np.argmax(test_predictions, axis=2)\n", 1503 | "# Remove ignored index (special tokens)\n", 1504 | "true_test_predictions = [\n", 1505 | " [label_list[p] for (p, l) in zip(test_prediction, test_label) if l != -100]\n", 1506 | " for test_prediction, test_label in zip(test_predictions, test_labels)\n", 1507 | "]" 1508 | ], 1509 | "execution_count": null, 1510 | "outputs": [ 1511 | { 1512 | "output_type": "display_data", 1513 | "data": { 1514 | "text/html": [ 1515 | "\n", 1516 | "
\n", 1517 | " \n", 1526 | " \n", 1527 | " \n", 1528 | " [499/499 06:34]\n", 1529 | "
\n", 1530 | " " 1531 | ], 1532 | "text/plain": [ 1533 | "" 1534 | ] 1535 | }, 1536 | "metadata": { 1537 | "tags": [] 1538 | } 1539 | }, 1540 | { 1541 | "output_type": "stream", 1542 | "text": [ 1543 | "/usr/local/lib/python3.6/dist-packages/seqeval/metrics/sequence_labeling.py:171: UserWarning: n seems not to be NE tag.\n", 1544 | " warnings.warn('{} seems not to be NE tag.'.format(chunk))\n", 1545 | "/usr/local/lib/python3.6/dist-packages/seqeval/metrics/sequence_labeling.py:171: UserWarning: T seems not to be NE tag.\n", 1546 | " warnings.warn('{} seems not to be NE tag.'.format(chunk))\n" 1547 | ], 1548 | "name": "stderr" 1549 | } 1550 | ] 1551 | }, 1552 | { 1553 | "cell_type": "code", 1554 | "metadata": { 1555 | "colab": { 1556 | "base_uri": "https://localhost:8080/" 1557 | }, 1558 | "id": "p5wz6JzfHk1B", 1559 | "outputId": "97975df3-980e-458b-f9e5-180d8ae6d0e7" 1560 | }, 1561 | "source": [ 1562 | "# example output\n", 1563 | "i=1\n", 1564 | "print('{:>10} {:>10} {:>10}'.format(\"Text\", \"Label\", \"Prediction\"))\n", 1565 | "for j in range(len(true_test_predictions_en[i])):\n", 1566 | " print('{:>10} {:>10} {:>10}'.format(test_texts[i][j], test_tags[i][j], true_test_predictions_en[i][j]))" 1567 | ], 1568 | "execution_count": null, 1569 | "outputs": [ 1570 | { 1571 | "output_type": "stream", 1572 | "text": [ 1573 | " Text Label Prediction\n", 1574 | " The n n\n", 1575 | " analysis n n\n", 1576 | " included n n\n", 1577 | " a n n\n", 1578 | " large n n\n", 1579 | " study n n\n", 1580 | " sample n n\n", 1581 | " with n n\n", 1582 | " more n n\n", 1583 | " than n n\n", 1584 | " 60,000 n n\n", 1585 | " patients B-T n\n", 1586 | " across n n\n", 1587 | " 4372 n n\n", 1588 | " hospitals B-T n\n", 1589 | " . n n\n" 1590 | ], 1591 | "name": "stdout" 1592 | } 1593 | ] 1594 | }, 1595 | { 1596 | "cell_type": "code", 1597 | "metadata": { 1598 | "id": "QnY89ltMTrm6" 1599 | }, 1600 | "source": [ 1601 | "def computeTermEvalMetrics(extracted_terms, gold_df):\n", 1602 | " #make lower case cause gold standard is lower case\n", 1603 | " extracted_terms = set([item.lower() for item in extracted_terms])\n", 1604 | " gold_set=set(gold_df)\n", 1605 | " true_pos=extracted_terms.intersection(gold_set)\n", 1606 | " recall=len(true_pos)/len(gold_set)\n", 1607 | " precision=len(true_pos)/len(extracted_terms)\n", 1608 | "\n", 1609 | " print(\"Intersection\",len(true_pos))\n", 1610 | " print(\"Gold\",len(gold_set))\n", 1611 | " print(\"Extracted\",len(extracted_terms))\n", 1612 | " print(\"Recall:\", recall)\n", 1613 | " print(\"Precision:\", precision)\n", 1614 | " print(\"F1:\", 2*(precision*recall)/(precision+recall))" 1615 | ], 1616 | "execution_count": null, 1617 | "outputs": [] 1618 | }, 1619 | { 1620 | "cell_type": "code", 1621 | "metadata": { 1622 | "id": "iSAjQkEAMaMq" 1623 | }, 1624 | "source": [ 1625 | "test_extracted_terms = extract_terms(true_test_predictions, test_texts)" 1626 | ], 1627 | "execution_count": null, 1628 | "outputs": [] 1629 | }, 1630 | { 1631 | "cell_type": "code", 1632 | "metadata": { 1633 | "id": "9DrVLBYIaMaU" 1634 | }, 1635 | "source": [ 1636 | "computeTermEvalMetrics(test_extracted_terms, set(df_htfl_terms_en[\"Term\"]).union(set(df_htfl_terms_fr[\"Term\"])).union(set(df_htfl_terms_nl[\"Term\"])))" 1637 | ], 1638 | "execution_count": null, 1639 | "outputs": [] 1640 | } 1641 | ] 1642 | } -------------------------------------------------------------------------------- /Pipeline_LDK.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Pipeline LDK", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "name": "python3", 11 | "display_name": "Python 3" 12 | }, 13 | "widgets": { 14 | "application/vnd.jupyter.widget-state+json": { 15 | "b3d333ffcede41fa8271faae3f265560": { 16 | "model_module": "@jupyter-widgets/controls", 17 | "model_name": "HBoxModel", 18 | "state": { 19 | "_view_name": "HBoxView", 20 | "_dom_classes": [], 21 | "_model_name": "HBoxModel", 22 | "_view_module": "@jupyter-widgets/controls", 23 | "_model_module_version": "1.5.0", 24 | "_view_count": null, 25 | "_view_module_version": "1.5.0", 26 | "box_style": "", 27 | "layout": "IPY_MODEL_a83dba1d35b2425bbbf06b2b9d95a6b4", 28 | "_model_module": "@jupyter-widgets/controls", 29 | "children": [ 30 | "IPY_MODEL_ca5be59af2e94d23aa1e4f09291edd4f", 31 | "IPY_MODEL_c882c56646a24f81ba2e5e206c742ccf" 32 | ] 33 | } 34 | }, 35 | "a83dba1d35b2425bbbf06b2b9d95a6b4": { 36 | "model_module": "@jupyter-widgets/base", 37 | "model_name": "LayoutModel", 38 | "state": { 39 | "_view_name": "LayoutView", 40 | "grid_template_rows": null, 41 | "right": null, 42 | "justify_content": null, 43 | "_view_module": "@jupyter-widgets/base", 44 | "overflow": null, 45 | "_model_module_version": "1.2.0", 46 | "_view_count": null, 47 | "flex_flow": null, 48 | "width": null, 49 | "min_width": null, 50 | "border": null, 51 | "align_items": null, 52 | "bottom": null, 53 | "_model_module": "@jupyter-widgets/base", 54 | "top": null, 55 | "grid_column": null, 56 | "overflow_y": null, 57 | "overflow_x": null, 58 | "grid_auto_flow": null, 59 | "grid_area": null, 60 | "grid_template_columns": null, 61 | "flex": null, 62 | "_model_name": "LayoutModel", 63 | "justify_items": null, 64 | "grid_row": null, 65 | "max_height": null, 66 | "align_content": null, 67 | "visibility": null, 68 | "align_self": null, 69 | "height": null, 70 | "min_height": null, 71 | "padding": null, 72 | "grid_auto_rows": null, 73 | "grid_gap": null, 74 | "max_width": null, 75 | "order": null, 76 | "_view_module_version": "1.2.0", 77 | "grid_template_areas": null, 78 | "object_position": null, 79 | "object_fit": null, 80 | "grid_auto_columns": null, 81 | "margin": null, 82 | "display": null, 83 | "left": null 84 | } 85 | }, 86 | "ca5be59af2e94d23aa1e4f09291edd4f": { 87 | "model_module": "@jupyter-widgets/controls", 88 | "model_name": "FloatProgressModel", 89 | "state": { 90 | "_view_name": "ProgressView", 91 | "style": "IPY_MODEL_fa120f5a7023402cb4628a55bf29b71d", 92 | "_dom_classes": [], 93 | "description": "Downloading: 100%", 94 | "_model_name": "FloatProgressModel", 95 | "bar_style": "success", 96 | "max": 5069051, 97 | "_view_module": "@jupyter-widgets/controls", 98 | "_model_module_version": "1.5.0", 99 | "value": 5069051, 100 | "_view_count": null, 101 | "_view_module_version": "1.5.0", 102 | "orientation": "horizontal", 103 | "min": 0, 104 | "description_tooltip": null, 105 | "_model_module": "@jupyter-widgets/controls", 106 | "layout": "IPY_MODEL_07f4c0a0d8554549b19476371b1bf9b6" 107 | } 108 | }, 109 | "c882c56646a24f81ba2e5e206c742ccf": { 110 | "model_module": "@jupyter-widgets/controls", 111 | "model_name": "HTMLModel", 112 | "state": { 113 | "_view_name": "HTMLView", 114 | "style": "IPY_MODEL_d23bff412bee40a0867f0be8b71d7c7f", 115 | "_dom_classes": [], 116 | "description": "", 117 | "_model_name": "HTMLModel", 118 | "placeholder": "​", 119 | "_view_module": "@jupyter-widgets/controls", 120 | "_model_module_version": "1.5.0", 121 | "value": " 5.07M/5.07M [00:01<00:00, 4.55MB/s]", 122 | "_view_count": null, 123 | "_view_module_version": "1.5.0", 124 | "description_tooltip": null, 125 | "_model_module": "@jupyter-widgets/controls", 126 | "layout": "IPY_MODEL_ea0909fa34c141cea1bc99f4e0af8f4c" 127 | } 128 | }, 129 | "fa120f5a7023402cb4628a55bf29b71d": { 130 | "model_module": "@jupyter-widgets/controls", 131 | "model_name": "ProgressStyleModel", 132 | "state": { 133 | "_view_name": "StyleView", 134 | "_model_name": "ProgressStyleModel", 135 | "description_width": "initial", 136 | "_view_module": "@jupyter-widgets/base", 137 | "_model_module_version": "1.5.0", 138 | "_view_count": null, 139 | "_view_module_version": "1.2.0", 140 | "bar_color": null, 141 | "_model_module": "@jupyter-widgets/controls" 142 | } 143 | }, 144 | "07f4c0a0d8554549b19476371b1bf9b6": { 145 | "model_module": "@jupyter-widgets/base", 146 | "model_name": "LayoutModel", 147 | "state": { 148 | "_view_name": "LayoutView", 149 | "grid_template_rows": null, 150 | "right": null, 151 | "justify_content": null, 152 | "_view_module": "@jupyter-widgets/base", 153 | "overflow": null, 154 | "_model_module_version": "1.2.0", 155 | "_view_count": null, 156 | "flex_flow": null, 157 | "width": null, 158 | "min_width": null, 159 | "border": null, 160 | "align_items": null, 161 | "bottom": null, 162 | "_model_module": "@jupyter-widgets/base", 163 | "top": null, 164 | "grid_column": null, 165 | "overflow_y": null, 166 | "overflow_x": null, 167 | "grid_auto_flow": null, 168 | "grid_area": null, 169 | "grid_template_columns": null, 170 | "flex": null, 171 | "_model_name": "LayoutModel", 172 | "justify_items": null, 173 | "grid_row": null, 174 | "max_height": null, 175 | "align_content": null, 176 | "visibility": null, 177 | "align_self": null, 178 | "height": null, 179 | "min_height": null, 180 | "padding": null, 181 | "grid_auto_rows": null, 182 | "grid_gap": null, 183 | "max_width": null, 184 | "order": null, 185 | "_view_module_version": "1.2.0", 186 | "grid_template_areas": null, 187 | "object_position": null, 188 | "object_fit": null, 189 | "grid_auto_columns": null, 190 | "margin": null, 191 | "display": null, 192 | "left": null 193 | } 194 | }, 195 | "d23bff412bee40a0867f0be8b71d7c7f": { 196 | "model_module": "@jupyter-widgets/controls", 197 | "model_name": "DescriptionStyleModel", 198 | "state": { 199 | "_view_name": "StyleView", 200 | "_model_name": "DescriptionStyleModel", 201 | "description_width": "", 202 | "_view_module": "@jupyter-widgets/base", 203 | "_model_module_version": "1.5.0", 204 | "_view_count": null, 205 | "_view_module_version": "1.2.0", 206 | "_model_module": "@jupyter-widgets/controls" 207 | } 208 | }, 209 | "ea0909fa34c141cea1bc99f4e0af8f4c": { 210 | "model_module": "@jupyter-widgets/base", 211 | "model_name": "LayoutModel", 212 | "state": { 213 | "_view_name": "LayoutView", 214 | "grid_template_rows": null, 215 | "right": null, 216 | "justify_content": null, 217 | "_view_module": "@jupyter-widgets/base", 218 | "overflow": null, 219 | "_model_module_version": "1.2.0", 220 | "_view_count": null, 221 | "flex_flow": null, 222 | "width": null, 223 | "min_width": null, 224 | "border": null, 225 | "align_items": null, 226 | "bottom": null, 227 | "_model_module": "@jupyter-widgets/base", 228 | "top": null, 229 | "grid_column": null, 230 | "overflow_y": null, 231 | "overflow_x": null, 232 | "grid_auto_flow": null, 233 | "grid_area": null, 234 | "grid_template_columns": null, 235 | "flex": null, 236 | "_model_name": "LayoutModel", 237 | "justify_items": null, 238 | "grid_row": null, 239 | "max_height": null, 240 | "align_content": null, 241 | "visibility": null, 242 | "align_self": null, 243 | "height": null, 244 | "min_height": null, 245 | "padding": null, 246 | "grid_auto_rows": null, 247 | "grid_gap": null, 248 | "max_width": null, 249 | "order": null, 250 | "_view_module_version": "1.2.0", 251 | "grid_template_areas": null, 252 | "object_position": null, 253 | "object_fit": null, 254 | "grid_auto_columns": null, 255 | "margin": null, 256 | "display": null, 257 | "left": null 258 | } 259 | }, 260 | "25eaf32fb66846d88496264f6e073eae": { 261 | "model_module": "@jupyter-widgets/controls", 262 | "model_name": "HBoxModel", 263 | "state": { 264 | "_view_name": "HBoxView", 265 | "_dom_classes": [], 266 | "_model_name": "HBoxModel", 267 | "_view_module": "@jupyter-widgets/controls", 268 | "_model_module_version": "1.5.0", 269 | "_view_count": null, 270 | "_view_module_version": "1.5.0", 271 | "box_style": "", 272 | "layout": "IPY_MODEL_2b8ced7e67084cbd865bb7e064b87b0e", 273 | "_model_module": "@jupyter-widgets/controls", 274 | "children": [ 275 | "IPY_MODEL_f2ff88b0edb64cb59f561aa271a464e3", 276 | "IPY_MODEL_f41762de9aee40a2941821e9d2cbe5b2" 277 | ] 278 | } 279 | }, 280 | "2b8ced7e67084cbd865bb7e064b87b0e": { 281 | "model_module": "@jupyter-widgets/base", 282 | "model_name": "LayoutModel", 283 | "state": { 284 | "_view_name": "LayoutView", 285 | "grid_template_rows": null, 286 | "right": null, 287 | "justify_content": null, 288 | "_view_module": "@jupyter-widgets/base", 289 | "overflow": null, 290 | "_model_module_version": "1.2.0", 291 | "_view_count": null, 292 | "flex_flow": null, 293 | "width": null, 294 | "min_width": null, 295 | "border": null, 296 | "align_items": null, 297 | "bottom": null, 298 | "_model_module": "@jupyter-widgets/base", 299 | "top": null, 300 | "grid_column": null, 301 | "overflow_y": null, 302 | "overflow_x": null, 303 | "grid_auto_flow": null, 304 | "grid_area": null, 305 | "grid_template_columns": null, 306 | "flex": null, 307 | "_model_name": "LayoutModel", 308 | "justify_items": null, 309 | "grid_row": null, 310 | "max_height": null, 311 | "align_content": null, 312 | "visibility": null, 313 | "align_self": null, 314 | "height": null, 315 | "min_height": null, 316 | "padding": null, 317 | "grid_auto_rows": null, 318 | "grid_gap": null, 319 | "max_width": null, 320 | "order": null, 321 | "_view_module_version": "1.2.0", 322 | "grid_template_areas": null, 323 | "object_position": null, 324 | "object_fit": null, 325 | "grid_auto_columns": null, 326 | "margin": null, 327 | "display": null, 328 | "left": null 329 | } 330 | }, 331 | "f2ff88b0edb64cb59f561aa271a464e3": { 332 | "model_module": "@jupyter-widgets/controls", 333 | "model_name": "FloatProgressModel", 334 | "state": { 335 | "_view_name": "ProgressView", 336 | "style": "IPY_MODEL_286f45d8d837403dbbc7c1ca5f62986a", 337 | "_dom_classes": [], 338 | "description": "Downloading: 100%", 339 | "_model_name": "FloatProgressModel", 340 | "bar_style": "success", 341 | "max": 9096718, 342 | "_view_module": "@jupyter-widgets/controls", 343 | "_model_module_version": "1.5.0", 344 | "value": 9096718, 345 | "_view_count": null, 346 | "_view_module_version": "1.5.0", 347 | "orientation": "horizontal", 348 | "min": 0, 349 | "description_tooltip": null, 350 | "_model_module": "@jupyter-widgets/controls", 351 | "layout": "IPY_MODEL_a53406ec316c40b1b0dc429b29a7df67" 352 | } 353 | }, 354 | "f41762de9aee40a2941821e9d2cbe5b2": { 355 | "model_module": "@jupyter-widgets/controls", 356 | "model_name": "HTMLModel", 357 | "state": { 358 | "_view_name": "HTMLView", 359 | "style": "IPY_MODEL_801096255e3c4d6d887ec22bafd40ce0", 360 | "_dom_classes": [], 361 | "description": "", 362 | "_model_name": "HTMLModel", 363 | "placeholder": "​", 364 | "_view_module": "@jupyter-widgets/controls", 365 | "_model_module_version": "1.5.0", 366 | "value": " 9.10M/9.10M [00:24<00:00, 377kB/s]", 367 | "_view_count": null, 368 | "_view_module_version": "1.5.0", 369 | "description_tooltip": null, 370 | "_model_module": "@jupyter-widgets/controls", 371 | "layout": "IPY_MODEL_b6d9997a6e5c4d998fd491cec9bd0427" 372 | } 373 | }, 374 | "286f45d8d837403dbbc7c1ca5f62986a": { 375 | "model_module": "@jupyter-widgets/controls", 376 | "model_name": "ProgressStyleModel", 377 | "state": { 378 | "_view_name": "StyleView", 379 | "_model_name": "ProgressStyleModel", 380 | "description_width": "initial", 381 | "_view_module": "@jupyter-widgets/base", 382 | "_model_module_version": "1.5.0", 383 | "_view_count": null, 384 | "_view_module_version": "1.2.0", 385 | "bar_color": null, 386 | "_model_module": "@jupyter-widgets/controls" 387 | } 388 | }, 389 | "a53406ec316c40b1b0dc429b29a7df67": { 390 | "model_module": "@jupyter-widgets/base", 391 | "model_name": "LayoutModel", 392 | "state": { 393 | "_view_name": "LayoutView", 394 | "grid_template_rows": null, 395 | "right": null, 396 | "justify_content": null, 397 | "_view_module": "@jupyter-widgets/base", 398 | "overflow": null, 399 | "_model_module_version": "1.2.0", 400 | "_view_count": null, 401 | "flex_flow": null, 402 | "width": null, 403 | "min_width": null, 404 | "border": null, 405 | "align_items": null, 406 | "bottom": null, 407 | "_model_module": "@jupyter-widgets/base", 408 | "top": null, 409 | "grid_column": null, 410 | "overflow_y": null, 411 | "overflow_x": null, 412 | "grid_auto_flow": null, 413 | "grid_area": null, 414 | "grid_template_columns": null, 415 | "flex": null, 416 | "_model_name": "LayoutModel", 417 | "justify_items": null, 418 | "grid_row": null, 419 | "max_height": null, 420 | "align_content": null, 421 | "visibility": null, 422 | "align_self": null, 423 | "height": null, 424 | "min_height": null, 425 | "padding": null, 426 | "grid_auto_rows": null, 427 | "grid_gap": null, 428 | "max_width": null, 429 | "order": null, 430 | "_view_module_version": "1.2.0", 431 | "grid_template_areas": null, 432 | "object_position": null, 433 | "object_fit": null, 434 | "grid_auto_columns": null, 435 | "margin": null, 436 | "display": null, 437 | "left": null 438 | } 439 | }, 440 | "801096255e3c4d6d887ec22bafd40ce0": { 441 | "model_module": "@jupyter-widgets/controls", 442 | "model_name": "DescriptionStyleModel", 443 | "state": { 444 | "_view_name": "StyleView", 445 | "_model_name": "DescriptionStyleModel", 446 | "description_width": "", 447 | "_view_module": "@jupyter-widgets/base", 448 | "_model_module_version": "1.5.0", 449 | "_view_count": null, 450 | "_view_module_version": "1.2.0", 451 | "_model_module": "@jupyter-widgets/controls" 452 | } 453 | }, 454 | "b6d9997a6e5c4d998fd491cec9bd0427": { 455 | "model_module": "@jupyter-widgets/base", 456 | "model_name": "LayoutModel", 457 | "state": { 458 | "_view_name": "LayoutView", 459 | "grid_template_rows": null, 460 | "right": null, 461 | "justify_content": null, 462 | "_view_module": "@jupyter-widgets/base", 463 | "overflow": null, 464 | "_model_module_version": "1.2.0", 465 | "_view_count": null, 466 | "flex_flow": null, 467 | "width": null, 468 | "min_width": null, 469 | "border": null, 470 | "align_items": null, 471 | "bottom": null, 472 | "_model_module": "@jupyter-widgets/base", 473 | "top": null, 474 | "grid_column": null, 475 | "overflow_y": null, 476 | "overflow_x": null, 477 | "grid_auto_flow": null, 478 | "grid_area": null, 479 | "grid_template_columns": null, 480 | "flex": null, 481 | "_model_name": "LayoutModel", 482 | "justify_items": null, 483 | "grid_row": null, 484 | "max_height": null, 485 | "align_content": null, 486 | "visibility": null, 487 | "align_self": null, 488 | "height": null, 489 | "min_height": null, 490 | "padding": null, 491 | "grid_auto_rows": null, 492 | "grid_gap": null, 493 | "max_width": null, 494 | "order": null, 495 | "_view_module_version": "1.2.0", 496 | "grid_template_areas": null, 497 | "object_position": null, 498 | "object_fit": null, 499 | "grid_auto_columns": null, 500 | "margin": null, 501 | "display": null, 502 | "left": null 503 | } 504 | } 505 | } 506 | } 507 | }, 508 | "cells": [ 509 | { 510 | "cell_type": "code", 511 | "metadata": { 512 | "id": "_LfkDTnzbHu0" 513 | }, 514 | "source": [ 515 | "!pip install transformers" 516 | ], 517 | "execution_count": null, 518 | "outputs": [] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "metadata": { 523 | "id": "FWadm7hSZhBV" 524 | }, 525 | "source": [ 526 | "#imports\n", 527 | "from transformers import XLMRobertaTokenizerFast, pipeline, XLMRobertaForSequenceClassification, XLMRobertaForTokenClassification \n", 528 | "from spacy.pipeline import SentenceSegmenter\n", 529 | "from spacy.lang.en import English\n", 530 | "from spacy.pipeline import Sentencizer\n", 531 | "#from sacremoses import MosesTokenizer, MosesDetokenizer \n", 532 | "import torch \n", 533 | "import itertools\n", 534 | "from string import punctuation\n", 535 | "import pandas as pd\n", 536 | "from lxml import etree\n", 537 | "import datetime\n", 538 | "from graphviz import Digraph\n" 539 | ], 540 | "execution_count": null, 541 | "outputs": [] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": { 546 | "id": "utBSMNz1tScN" 547 | }, 548 | "source": [ 549 | "#Pipeline" 550 | ] 551 | }, 552 | { 553 | "cell_type": "markdown", 554 | "metadata": { 555 | "id": "H3gh4G4yQrO3" 556 | }, 557 | "source": [ 558 | "**Functions**" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "metadata": { 564 | "id": "abnLMuzpRhCX" 565 | }, 566 | "source": [ 567 | "def extract_terms(pred, txt):\n", 568 | " extracted_terms = []\n", 569 | " # go over all predictions\n", 570 | " for j in range(len(pred)):\n", 571 | " # if right tag build term and add it to the set otherwise just continue\n", 572 | " if pred[j]==\"LABEL_1\":\n", 573 | " term=txt[j]\n", 574 | " for k in range(j+1,len(pred)):\n", 575 | " if pred[k]==\"LABEL_2\": term+=\" \"+txt[k] #if continuation of term \n", 576 | " else: break\n", 577 | " #remove wrong punctuation and add it to the termlist if it is no duplicate\n", 578 | " term = remove_end_punctuation(term)\n", 579 | " if term not in extracted_terms: \n", 580 | " extracted_terms.append(term)\n", 581 | " return extracted_terms" 582 | ], 583 | "execution_count": null, 584 | "outputs": [] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "metadata": { 589 | "colab": { 590 | "base_uri": "https://localhost:8080/" 591 | }, 592 | "id": "5_ZUu5kQphnt", 593 | "outputId": "917eb574-53e8-4b33-f3b8-21faf37bbb37" 594 | }, 595 | "source": [ 596 | "# remove last character if it is punctuation and there is not other punctuation in the word\n", 597 | "def remove_end_punctuation(word):\n", 598 | " word_without_last = word[:-1]\n", 599 | " #only remove end punctuation if there is not other punctuation inside the string\n", 600 | " if not any(char in word_without_last for char in punctuation):\n", 601 | " return word.translate(str.maketrans('', '', punctuation)) #all chars in punctuation are mapped to None and the translate function uses this translation table\n", 602 | " else:\n", 603 | " return word\n", 604 | "\n", 605 | "print(punctuation)" 606 | ], 607 | "execution_count": null, 608 | "outputs": [ 609 | { 610 | "output_type": "stream", 611 | "text": [ 612 | "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~\n" 613 | ], 614 | "name": "stdout" 615 | } 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "metadata": { 621 | "id": "589hXmu-Kxat" 622 | }, 623 | "source": [ 624 | "def get_term_list(output):\n", 625 | " word_list=[]\n", 626 | " label_list=[]\n", 627 | " #can i do this with some already existing decode/encode function?\n", 628 | " for i in range(len(output)):\n", 629 | " item=output[i]\n", 630 | " #print(item[\"word\"])\n", 631 | " #if start of word_\n", 632 | " if item[\"word\"][0]==\"▁\":\n", 633 | " word=item[\"word\"]\n", 634 | " label=item[\"entity\"]\n", 635 | " for j in range(i+1,len(output)):\n", 636 | " item=output[j]\n", 637 | " if item[\"word\"][0]!=\"▁\": \n", 638 | " word+=item[\"word\"]\n", 639 | " else:\n", 640 | " break\n", 641 | " #print(word,label)\n", 642 | " word_list.append(word[1:len(word)])\n", 643 | " label_list.append(label)\n", 644 | " return label_list, word_list\n", 645 | "\n" 646 | ], 647 | "execution_count": null, 648 | "outputs": [] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "metadata": { 653 | "id": "OL-TsbGpnoAw" 654 | }, 655 | "source": [ 656 | "#split in sentences and tokenize\n", 657 | "def preprocess(text):\n", 658 | " #sentenize (from spacy)\n", 659 | " sentencizer = Sentencizer()\n", 660 | " nlp = English()\n", 661 | " nlp.add_pipe(sentencizer)\n", 662 | " doc = nlp(text)\n", 663 | "\n", 664 | " #tokenize\n", 665 | " sentence_list=[]\n", 666 | " #mt = MosesTokenizer(lang='en')\n", 667 | " for s in doc.sents:\n", 668 | " # tokenized_text = mt.tokenize(s, return_str=True) \n", 669 | " #sentence_list.append((tokenized_text.split(), s)) #append tuple of tokens and original sentence\n", 670 | " sentence_list.append(str(s))\n", 671 | " return sentence_list" 672 | ], 673 | "execution_count": null, 674 | "outputs": [] 675 | }, 676 | { 677 | "cell_type": "markdown", 678 | "metadata": { 679 | "id": "LE2q0YdPJ4Gg" 680 | }, 681 | "source": [ 682 | "**Load Models and Tokenizers**" 683 | ] 684 | }, 685 | { 686 | "cell_type": "code", 687 | "metadata": { 688 | "id": "-J87mUFjJ4RQ", 689 | "colab": { 690 | "base_uri": "https://localhost:8080/", 691 | "height": 183, 692 | "referenced_widgets": [ 693 | "b3d333ffcede41fa8271faae3f265560", 694 | "a83dba1d35b2425bbbf06b2b9d95a6b4", 695 | "ca5be59af2e94d23aa1e4f09291edd4f", 696 | "c882c56646a24f81ba2e5e206c742ccf", 697 | "fa120f5a7023402cb4628a55bf29b71d", 698 | "07f4c0a0d8554549b19476371b1bf9b6", 699 | "d23bff412bee40a0867f0be8b71d7c7f", 700 | "ea0909fa34c141cea1bc99f4e0af8f4c", 701 | "25eaf32fb66846d88496264f6e073eae", 702 | "2b8ced7e67084cbd865bb7e064b87b0e", 703 | "f2ff88b0edb64cb59f561aa271a464e3", 704 | "f41762de9aee40a2941821e9d2cbe5b2", 705 | "286f45d8d837403dbbc7c1ca5f62986a", 706 | "a53406ec316c40b1b0dc429b29a7df67", 707 | "801096255e3c4d6d887ec22bafd40ce0", 708 | "b6d9997a6e5c4d998fd491cec9bd0427" 709 | ] 710 | }, 711 | "outputId": "ad837cbc-f30e-4f7a-ede8-77a07dc32507" 712 | }, 713 | "source": [ 714 | "#load model TE\n", 715 | "PATH = \"./TermExtraction/saved models/tvt_en_only\" #en_only, fr, nl, all\n", 716 | "model_TermExtraction = XLMRobertaForTokenClassification.from_pretrained(PATH)\n", 717 | "print(\"Term Extraction Model loaded\")\n", 718 | "\n", 719 | "#load tokenizer TE\n", 720 | "tokenizer_TermExtraction = XLMRobertaTokenizerFast.from_pretrained(\"xlm-roberta-base\")\n", 721 | "print(\"Tokenizer loaded\")\n", 722 | "\n", 723 | "#load model RE\n", 724 | "PATH = \"./RelationExtraction/saved models/pipeline1803\" \n", 725 | "model_RelationExtraction = XLMRobertaForSequenceClassification.from_pretrained(PATH)\n", 726 | "print(\"Relation Extraction Model loaded\")\n" 727 | ], 728 | "execution_count": null, 729 | "outputs": [ 730 | { 731 | "output_type": "stream", 732 | "text": [ 733 | "Term Extraction Model loaded\n" 734 | ], 735 | "name": "stdout" 736 | }, 737 | { 738 | "output_type": "display_data", 739 | "data": { 740 | "application/vnd.jupyter.widget-view+json": { 741 | "model_id": "b3d333ffcede41fa8271faae3f265560", 742 | "version_minor": 0, 743 | "version_major": 2 744 | }, 745 | "text/plain": [ 746 | "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5069051.0, style=ProgressStyle(descript…" 747 | ] 748 | }, 749 | "metadata": { 750 | "tags": [] 751 | } 752 | }, 753 | { 754 | "output_type": "stream", 755 | "text": [ 756 | "\n" 757 | ], 758 | "name": "stdout" 759 | }, 760 | { 761 | "output_type": "display_data", 762 | "data": { 763 | "application/vnd.jupyter.widget-view+json": { 764 | "model_id": "25eaf32fb66846d88496264f6e073eae", 765 | "version_minor": 0, 766 | "version_major": 2 767 | }, 768 | "text/plain": [ 769 | "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=9096718.0, style=ProgressStyle(descript…" 770 | ] 771 | }, 772 | "metadata": { 773 | "tags": [] 774 | } 775 | }, 776 | { 777 | "output_type": "stream", 778 | "text": [ 779 | "\n", 780 | "Tokenizer loaded\n", 781 | "Relation Extraction Model loaded\n", 782 | "Hierachy Model loaded\n" 783 | ], 784 | "name": "stdout" 785 | } 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "metadata": { 791 | "colab": { 792 | "base_uri": "https://localhost:8080/" 793 | }, 794 | "id": "upoICrPYXMun", 795 | "outputId": "54ccf8c9-bb24-464b-b764-126239d07980" 796 | }, 797 | "source": [ 798 | "label_list=['SYNONYM', 'activityRelation (e1,e2)', 'activityRelation (e2,e1)',\n", 799 | " 'associativeRelation', 'causalRelation (e1,e2)', 'causalRelation (e2,e1)',\n", 800 | " 'genericRelation (e1,e2)', 'genericRelation (e2,e1)',\n", 801 | " 'instrumentalRelation (e1,e2)', 'instrumentalRelation (e2,e1)', 'none',\n", 802 | " 'originationRelation (e1,e2)', 'originationRelation (e2,e1)',\n", 803 | " 'partitiveRelation (e1,e2)', 'partitiveRelation (e2,e1)',\n", 804 | " 'spatialRelation (e1,e2)', 'spatialRelation (e2,e1)']\n", 805 | "\n", 806 | "wrong_labels=[\"LABEL_0\",\"LABEL_1\",\"LABEL_2\",\"LABEL_3\",\"LABEL_4\",\"LABEL_5\",\"LABEL_6\",\"LABEL_7\",\"LABEL_8\",\"LABEL_9\",\"LABEL_10\",\"LABEL_11\",\"LABEL_12\",\"LABEL_13\", \"LABEL_14\", \"LABEL_15\", \"LABEL_16\"]\n", 807 | "print(len(label_list),len(wrong_labels))" 808 | ], 809 | "execution_count": null, 810 | "outputs": [ 811 | { 812 | "output_type": "stream", 813 | "text": [ 814 | "25 25\n" 815 | ], 816 | "name": "stdout" 817 | } 818 | ] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "metadata": { 823 | "id": "aR6FAsk5gMO4" 824 | }, 825 | "source": [ 826 | "#pipelines\n", 827 | "\n", 828 | "pipeline_terms=pipeline(\"ner\", model=model_TermExtraction, tokenizer=tokenizer_TermExtraction)\n", 829 | "\n", 830 | "pipeline_relation=pipeline(\"sentiment-analysis\", model=model_RelationExtraction, tokenizer=tokenizer_TermExtraction)" 831 | ], 832 | "execution_count": null, 833 | "outputs": [] 834 | }, 835 | { 836 | "cell_type": "markdown", 837 | "metadata": { 838 | "id": "c0gdW4GlJPsN" 839 | }, 840 | "source": [ 841 | "**Read Text**" 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "metadata": { 847 | "id": "2IPkd6ZtpZgO" 848 | }, 849 | "source": [ 850 | "text = \"' Vorhandener Wirkstoff\\n\\n„Prüfprogramm“ ist die Bezeichnung, die üblicherweise für das Arbeitsprogramm zur Prüfung in Biozidprodukten enthaltener alter Wirkstoffe verwendet wird. Das Programm wurde von der Europäischen Kommission unter der Biozidprodukte-Richtlinie (BPD) eingerichtet und wird unter der Verordnung über Biozidprodukte (BPR) fortgeführt.\\n\\nAlte Wirkstoffe sind jene Stoffe, die am 14. Mai 2000 als Wirkstoff eines Biozidprodukts auf dem Markt waren (für andere Zwecke als die wissenschaftliche oder produkt- und verfahrensorientierte Forschung und Entwicklung). Es wurden jene alten Wirkstoffe zur Überprüfung im Prüfprogramm akzeptiert, die als solche identifiziert wurden und für die eine Notifizierung gemäß Anhang II der Verordnung (EG) Nr. 1451/2007 der Kommission akzeptiert wurde.\\n\\nDie genauen Vorschriften für das Prüfprogramm wurden im Rahmen der neuen Verordnung zum Prüfprogramm (EU) Nr. 1062/2014, die die Verordnung (EG) Nr. 1451/2007 der Kommission aufhebt und ersetzt, an die Bestimmungen der BPR angepasst.\\n\\nDie in Artikel 89 der Verordnung (EU) Nr. 528/2012 festgelegten Übergangsmaßnahmen ermöglichen das Inverkehrbringen und die Verwendung von Biozidprodukten, die einen im Prüfprogramm (für eine bestimmte Produktart) enthaltenen Wirkstoff enthalten, vorbehaltlich der nationalen Vorschriften, bis drei Jahre nach ihrem Genehmigungsdatum (im Falle einer Nichtgenehmigung können kürzere Zeiträume gelten).\\n\\nIn Anhang II Teil 1 der Verordnung zum Prüfprogramm sind die Wirkstoffe aufgeführt, die derzeit geprüft werden.\\n\\nDarüber hinaus passt die Verordnung zum Prüfprogramm die Verfahren zur Bewertung von Dossiers an die Verfahren an, die in der BPR für neue Wirkstoffe oder in Verordnung (EU) Nr. 88/2014 zur Änderung von Anhang I beschrieben sind.\\n\\nDes Weiteren sieht die Verordnung zum Prüfprogramm eine feste Rolle für die ECHA vor und legt Verfahren zum Beteiligen oder Ersetzen von Teilnehmern im Prüfprogramm in gegenseitigem Einvernehmen, zum Ausscheiden als Teilnehmer sowie zur Übernahme der Rolle eines Teilnehmers in bestimmten Situationen fest und führt die Möglichkeit ein, unter bestimmten Bedingungen Stoff/Produktart-Kombinationen in das Prüfprogramm aufzunehmen.\"" 851 | ], 852 | "execution_count": null, 853 | "outputs": [] 854 | }, 855 | { 856 | "cell_type": "markdown", 857 | "metadata": { 858 | "id": "HX7RqILgrp4C" 859 | }, 860 | "source": [ 861 | "#Single Example of the Whole Pipeline" 862 | ] 863 | }, 864 | { 865 | "cell_type": "markdown", 866 | "metadata": { 867 | "id": "uiBgEXCfJP-0" 868 | }, 869 | "source": [ 870 | "**Split corpus into sentences**" 871 | ] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "metadata": { 876 | "id": "16IMCujcJQFg", 877 | "colab": { 878 | "base_uri": "https://localhost:8080/", 879 | "height": 54 880 | }, 881 | "outputId": "94e9933c-fa6c-4d40-8c20-334ab6d1af18" 882 | }, 883 | "source": [ 884 | "sentences=preprocess(text)\n", 885 | "sentences[0]" 886 | ], 887 | "execution_count": null, 888 | "outputs": [ 889 | { 890 | "output_type": "execute_result", 891 | "data": { 892 | "application/vnd.google.colaboratory.intrinsic+json": { 893 | "type": "string" 894 | }, 895 | "text/plain": [ 896 | "' Vorhandener Wirkstoff\\n\\n„Prüfprogramm“ ist die Bezeichnung, die üblicherweise für das Arbeitsprogramm zur Prüfung in Biozidprodukten enthaltener alter Wirkstoffe verwendet wird.'" 897 | ] 898 | }, 899 | "metadata": { 900 | "tags": [] 901 | }, 902 | "execution_count": 27 903 | } 904 | ] 905 | }, 906 | { 907 | "cell_type": "markdown", 908 | "metadata": { 909 | "id": "Nm8GNgQUJQMV" 910 | }, 911 | "source": [ 912 | "**Extract Terms from corpus**" 913 | ] 914 | }, 915 | { 916 | "cell_type": "code", 917 | "metadata": { 918 | "id": "HC5cUP7mJQRj" 919 | }, 920 | "source": [ 921 | "#OR USE OWN TOKENIZER + FAKE LABELS TO GET OUTPUT..... (otherwise problems with punctuation)\n", 922 | "\n", 923 | "#pipeline with list of sentences as input ?????\n", 924 | "terms_per_sentence=[]\n", 925 | "for s in sentences:\n", 926 | " #pipeline output\n", 927 | " term_output=pipeline_terms(s)\n", 928 | " #reconstruct full words from pipeline and asign labels based on start word \n", 929 | " labels, words=get_term_list(term_output)\n", 930 | " terms_per_sentence.append(extract_terms(labels,words))\n", 931 | " \n" 932 | ], 933 | "execution_count": null, 934 | "outputs": [] 935 | }, 936 | { 937 | "cell_type": "code", 938 | "metadata": { 939 | "colab": { 940 | "base_uri": "https://localhost:8080/" 941 | }, 942 | "id": "2228Ut560n4C", 943 | "outputId": "c6f6a7c2-ffdd-4f53-b8cd-efc35657ab7e" 944 | }, 945 | "source": [ 946 | "terms_per_sentence" 947 | ], 948 | "execution_count": null, 949 | "outputs": [ 950 | { 951 | "output_type": "execute_result", 952 | "data": { 953 | "text/plain": [ 954 | "[['active substance', 'biocidal active substances', 'biocidal products'],\n", 955 | " ['European Commission',\n", 956 | " 'Biocidal Products Directive',\n", 957 | " 'Biocidal Products Regulation'],\n", 958 | " ['biocidal product'],\n", 959 | " ['Review Programme', 'Commission Regulation'],\n", 960 | " ['rules',\n", 961 | " 'Review Programme',\n", 962 | " 'BPR',\n", 963 | " 'Review Programme Regulation',\n", 964 | " 'Commission Regulation'],\n", 965 | " ['Regulation', 'biocidal products', 'Review', 'rules'],\n", 966 | " ['Review'],\n", 967 | " ['Review Programme',\n", 968 | " 'Regulation',\n", 969 | " 'dossier',\n", 970 | " 'BPR',\n", 971 | " 'active substances',\n", 972 | " 'Review Programme Regulation',\n", 973 | " 'ECHA',\n", 974 | " 'substance/PT'],\n", 975 | " []]" 976 | ] 977 | }, 978 | "metadata": { 979 | "tags": [] 980 | }, 981 | "execution_count": 62 982 | } 983 | ] 984 | }, 985 | { 986 | "cell_type": "code", 987 | "metadata": { 988 | "id": "aqr6bbVYRgXf" 989 | }, 990 | "source": [ 991 | "# flat set of all terms\n", 992 | "term_list=[]\n", 993 | "for term_l in terms_per_sentence:\n", 994 | " for term in term_l:\n", 995 | " if term not in term_list:\n", 996 | " term_list.append(term)" 997 | ], 998 | "execution_count": null, 999 | "outputs": [] 1000 | }, 1001 | { 1002 | "cell_type": "code", 1003 | "metadata": { 1004 | "colab": { 1005 | "base_uri": "https://localhost:8080/" 1006 | }, 1007 | "id": "gMtRVW5BtgF7", 1008 | "outputId": "df0bd28a-a72a-4130-e4a2-8e986f57a9ae" 1009 | }, 1010 | "source": [ 1011 | "term_list" 1012 | ], 1013 | "execution_count": null, 1014 | "outputs": [ 1015 | { 1016 | "output_type": "execute_result", 1017 | "data": { 1018 | "text/plain": [ 1019 | "['active substance',\n", 1020 | " 'biocidal active substances',\n", 1021 | " 'biocidal products',\n", 1022 | " 'European Commission',\n", 1023 | " 'Biocidal Products Directive',\n", 1024 | " 'Biocidal Products Regulation',\n", 1025 | " 'biocidal product',\n", 1026 | " 'Review Programme',\n", 1027 | " 'Commission Regulation',\n", 1028 | " 'rules',\n", 1029 | " 'BPR',\n", 1030 | " 'Review Programme Regulation',\n", 1031 | " 'Regulation',\n", 1032 | " 'Review',\n", 1033 | " 'dossier',\n", 1034 | " 'active substances',\n", 1035 | " 'ECHA',\n", 1036 | " 'substance/PT']" 1037 | ] 1038 | }, 1039 | "metadata": { 1040 | "tags": [] 1041 | }, 1042 | "execution_count": 64 1043 | } 1044 | ] 1045 | }, 1046 | { 1047 | "cell_type": "code", 1048 | "metadata": { 1049 | "id": "FTTDgSMSVodw" 1050 | }, 1051 | "source": [ 1052 | "# create a list of concept dictionaries + term2id mapping\n", 1053 | "\n", 1054 | "term_to_id = dict()\n", 1055 | "\n", 1056 | "concept_list=[]\n", 1057 | "for i, c in enumerate(term_list):\n", 1058 | " concept_list.append(\n", 1059 | " {\"id\":i, \"terms\":[c], \"relations\":[]}\n", 1060 | " )\n", 1061 | " # map terms to id \n", 1062 | " for term in [c]: \n", 1063 | " term_to_id[term]=i\n" 1064 | ], 1065 | "execution_count": null, 1066 | "outputs": [] 1067 | }, 1068 | { 1069 | "cell_type": "code", 1070 | "metadata": { 1071 | "colab": { 1072 | "base_uri": "https://localhost:8080/" 1073 | }, 1074 | "id": "b1Nr1MHxnrGY", 1075 | "outputId": "57d3d5ca-6bcf-4a52-8cc4-afe7ae119a7d" 1076 | }, 1077 | "source": [ 1078 | "concept_list" 1079 | ], 1080 | "execution_count": null, 1081 | "outputs": [ 1082 | { 1083 | "output_type": "execute_result", 1084 | "data": { 1085 | "text/plain": [ 1086 | "[{'id': 0, 'relations': [], 'terms': ['active substance']},\n", 1087 | " {'id': 1, 'relations': [], 'terms': ['biocidal active substances']},\n", 1088 | " {'id': 2, 'relations': [], 'terms': ['biocidal products.']},\n", 1089 | " {'id': 3, 'relations': [], 'terms': ['European Commission']},\n", 1090 | " {'id': 4, 'relations': [], 'terms': ['Biocidal Products Directive']},\n", 1091 | " {'id': 5, 'relations': [], 'terms': ['Biocidal Products Regulation']},\n", 1092 | " {'id': 6, 'relations': [], 'terms': ['biocidal product']},\n", 1093 | " {'id': 7, 'relations': [], 'terms': ['Review Programme']},\n", 1094 | " {'id': 8, 'relations': [], 'terms': ['Commission Regulation']},\n", 1095 | " {'id': 9, 'relations': [], 'terms': ['rules']},\n", 1096 | " {'id': 10, 'relations': [], 'terms': ['BPR']},\n", 1097 | " {'id': 11, 'relations': [], 'terms': ['Review Programme Regulation']},\n", 1098 | " {'id': 12, 'relations': [], 'terms': ['Regulation']},\n", 1099 | " {'id': 13, 'relations': [], 'terms': ['biocidal products']},\n", 1100 | " {'id': 14, 'relations': [], 'terms': ['Review']},\n", 1101 | " {'id': 15, 'relations': [], 'terms': ['rules,']},\n", 1102 | " {'id': 16, 'relations': [], 'terms': ['dossier']},\n", 1103 | " {'id': 17, 'relations': [], 'terms': ['active substances']},\n", 1104 | " {'id': 18, 'relations': [], 'terms': ['ECHA']},\n", 1105 | " {'id': 19, 'relations': [], 'terms': ['substance/PT']},\n", 1106 | " {'id': 20, 'relations': [], 'terms': ['Review Programme,']}]" 1107 | ] 1108 | }, 1109 | "metadata": { 1110 | "tags": [] 1111 | }, 1112 | "execution_count": 21 1113 | } 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "markdown", 1118 | "metadata": { 1119 | "id": "jyebFX2YJoOB" 1120 | }, 1121 | "source": [ 1122 | "**Extract relations per sentence**" 1123 | ] 1124 | }, 1125 | { 1126 | "cell_type": "code", 1127 | "metadata": { 1128 | "id": "mO8mTDcQ2bqe", 1129 | "colab": { 1130 | "base_uri": "https://localhost:8080/" 1131 | }, 1132 | "outputId": "de146a31-1a5a-41e9-d6d2-7d0e1a84e1e4" 1133 | }, 1134 | "source": [ 1135 | "# take terms and save them in a dictionary that we later write to tbx\n", 1136 | "for i in range(len(terms_per_sentence)):\n", 1137 | " #create all termpairs for the sentence \n", 1138 | " terms=terms_per_sentence[i]\n", 1139 | " sentence=sentences[i]\n", 1140 | " term_pairs=list(itertools.combinations(terms, 2))\n", 1141 | " print(\"\\n\\nSentence:\", sentence)\n", 1142 | " print(\"Relations:\")\n", 1143 | " #extract relation for each possible pair \n", 1144 | " for pair in term_pairs:\n", 1145 | " input=pair[0]+\". \"+pair[1]+\". \"+sentence\n", 1146 | " relation=pipeline_relation(input)\n", 1147 | " #print((relation))\n", 1148 | " true_rel=label_list[wrong_labels.index(relation[0][\"label\"])]\n", 1149 | "\n", 1150 | " #add true_rel to concept dict\n", 1151 | " if \"(e1,e2)\" in true_rel:\n", 1152 | " concept_list[term_to_id[pair[0]]][\"relations\"].append([true_rel[:-8],pair[1],round(relation[0][\"score\"],3)])\n", 1153 | " elif \"(e2,e1)\" in true_rel:\n", 1154 | " concept_list[term_to_id[pair[1]]][\"relations\"].append([true_rel[:-8],pair[0]])\n", 1155 | " elif \"associativeRelation\" == true_rel:\n", 1156 | " concept_list[term_to_id[pair[0]]][\"relations\"].append([true_rel,pair[1]])\n", 1157 | " \n", 1158 | "\n", 1159 | " print(\"{:30s}{:30s}{:30s}{:3f}\".format(pair[0],pair[1], true_rel, round(relation[0][\"score\"],3))) #pair[0],\"----\", pair[1], true_rel, relation[0][\"score\"])" 1160 | ], 1161 | "execution_count": null, 1162 | "outputs": [ 1163 | { 1164 | "output_type": "stream", 1165 | "text": [ 1166 | "\n", 1167 | "\n", 1168 | "Sentence: Existing active substance\n", 1169 | "\n", 1170 | "The Review Programme is the name commonly used for the work programme for the examination of existing biocidal active substances contained in biocidal products.\n", 1171 | "Relations:\n", 1172 | "active substance biocidal active substances partitiveRelation (e1,e2) 0.966000\n", 1173 | "active substance biocidal products. partitiveRelation (e1,e2) 0.875000\n", 1174 | "biocidal active substances biocidal products. partitiveRelation (e1,e2) 0.919000\n", 1175 | "\n", 1176 | "\n", 1177 | "Sentence: The programme was set up by the European Commission under the Biocidal Products Directive (BPD) and continues under the Biocidal Products Regulation (BPR).\n", 1178 | "Relations:\n", 1179 | "European Commission Biocidal Products Directive none 0.559000\n", 1180 | "European Commission Biocidal Products Regulation none 0.779000\n", 1181 | "Biocidal Products Directive Biocidal Products Regulation none 0.861000\n", 1182 | "\n", 1183 | "\n", 1184 | "Sentence: \n", 1185 | "\n", 1186 | "Existing active substances are those substances which were on the market on 14 May 2000 as an active substance of a biocidal product (for purposes other than scientific or product and process-orientated research and development).\n", 1187 | "Relations:\n", 1188 | "\n", 1189 | "\n", 1190 | "Sentence: The existing active substances which were accepted to be examined in the Review Programme were those which were identified as such and for which a notification was accepted, as set out in Annex II to Commission Regulation (EC) No 1451/2007.\n", 1191 | "Relations:\n", 1192 | "Review Programme Commission Regulation none 0.880000\n", 1193 | "\n", 1194 | "\n", 1195 | "Sentence: \n", 1196 | "\n", 1197 | "The detailed rules for the Review Programme have been adapted to the provisions of the BPR in the new Review Programme Regulation (EU) No 1062/2014, which repeals and replaces Commission Regulation (EC) No 1451/2007.\n", 1198 | "Relations:\n", 1199 | "rules Review Programme partitiveRelation (e1,e2) 0.702000\n", 1200 | "rules BPR partitiveRelation (e1,e2) 0.699000\n", 1201 | "rules Review Programme Regulation associativeRelation 0.904000\n", 1202 | "rules Commission Regulation none 0.588000\n", 1203 | "Review Programme BPR partitiveRelation (e1,e2) 0.506000\n", 1204 | "Review Programme Review Programme Regulation none 0.653000\n", 1205 | "Review Programme Commission Regulation none 0.552000\n", 1206 | "BPR Review Programme Regulation partitiveRelation (e1,e2) 0.430000\n", 1207 | "BPR Commission Regulation none 0.435000\n", 1208 | "Review Programme Regulation Commission Regulation none 0.659000\n", 1209 | "\n", 1210 | "\n", 1211 | "Sentence: \n", 1212 | "\n", 1213 | "The transitional provisions laid down in Article 89 of Regulation (EU) No 528/2012 allow biocidal products containing an active substance included in the Review Programme (for a given product-type) to be made available on the market and used, subject to national rules, until three years after the date of their approval (shorter timeframes apply in case of non-approval).\n", 1214 | "Relations:\n", 1215 | "Regulation biocidal products none 0.700000\n", 1216 | "Regulation Review none 0.875000\n", 1217 | "Regulation rules, associativeRelation 0.854000\n", 1218 | "biocidal products Review none 0.900000\n", 1219 | "biocidal products rules, associativeRelation 0.605000\n", 1220 | "Review rules, none 0.768000\n", 1221 | "\n", 1222 | "\n", 1223 | "Sentence: \n", 1224 | "\n", 1225 | "In Annex II part 1 of the Review Programme Regulation, the active substances which are under evaluation are listed.\n", 1226 | "Relations:\n", 1227 | "\n", 1228 | "\n", 1229 | "Sentence: \n", 1230 | "\n", 1231 | "In addition, the Review Programme Regulation adapts the processes for the evaluation of a dossier to align them to those described in the BPR for new active substances or in Regulation (EU) No 88/2014 for the amendment of Annex I.\n", 1232 | "\n", 1233 | "Furthermore, the Review Programme Regulation provides a defined role for ECHA and sets out procedures on how to join or replace a participant in the Review Programme by mutual agreement, how to withdraw as a participant, how to take over the role of participant in certain situations and introduces the possibility to add substance/PT combinations to the Review Programme, under certain conditions.\n", 1234 | "Relations:\n", 1235 | "Review Programme Regulation none 0.439000\n", 1236 | "Review Programme dossier none 0.832000\n", 1237 | "Review Programme BPR none 0.734000\n", 1238 | "Review Programme active substances none 0.805000\n", 1239 | "Review Programme Review Programme Regulation none 0.779000\n", 1240 | "Review Programme ECHA none 0.773000\n", 1241 | "Review Programme substance/PT none 0.855000\n", 1242 | "Review Programme Review Programme, genericRelation (e1,e2) 0.751000\n", 1243 | "Regulation dossier none 0.747000\n", 1244 | "Regulation BPR none 0.849000\n", 1245 | "Regulation active substances none 0.457000\n", 1246 | "Regulation Review Programme Regulation none 0.868000\n", 1247 | "Regulation ECHA none 0.781000\n", 1248 | "Regulation substance/PT none 0.834000\n", 1249 | "Regulation Review Programme, none 0.819000\n", 1250 | "dossier BPR none 0.670000\n", 1251 | "dossier active substances none 0.707000\n", 1252 | "dossier Review Programme Regulation none 0.905000\n", 1253 | "dossier ECHA none 0.799000\n", 1254 | "dossier substance/PT partitiveRelation (e2,e1) 0.439000\n", 1255 | "dossier Review Programme, none 0.762000\n", 1256 | "BPR active substances associativeRelation 0.944000\n", 1257 | "BPR Review Programme Regulation none 0.744000\n", 1258 | "BPR ECHA none 0.796000\n", 1259 | "BPR substance/PT none 0.668000\n", 1260 | "BPR Review Programme, genericRelation (e1,e2) 0.461000\n", 1261 | "active substances Review Programme Regulation none 0.908000\n", 1262 | "active substances ECHA none 0.860000\n", 1263 | "active substances substance/PT partitiveRelation (e2,e1) 0.641000\n", 1264 | "active substances Review Programme, none 0.786000\n", 1265 | "Review Programme Regulation ECHA none 0.836000\n", 1266 | "Review Programme Regulation substance/PT none 0.890000\n", 1267 | "Review Programme Regulation Review Programme, none 0.293000\n", 1268 | "ECHA substance/PT none 0.488000\n", 1269 | "ECHA Review Programme, none 0.704000\n", 1270 | "substance/PT Review Programme, none 0.767000\n", 1271 | "\n", 1272 | "\n", 1273 | "Sentence: \n", 1274 | "\n", 1275 | "The Review Programme is foreseen to be completed by 2024.\n", 1276 | "Relations:\n" 1277 | ], 1278 | "name": "stdout" 1279 | } 1280 | ] 1281 | }, 1282 | { 1283 | "cell_type": "markdown", 1284 | "metadata": { 1285 | "id": "M8B85yx4lSyo" 1286 | }, 1287 | "source": [ 1288 | "**Clean Concept List**\n" 1289 | ] 1290 | }, 1291 | { 1292 | "cell_type": "code", 1293 | "metadata": { 1294 | "id": "_-yT_XBbdRBM" 1295 | }, 1296 | "source": [ 1297 | "# merge concept entries with synonym relations \n", 1298 | "\n", 1299 | "# go over all concepts\n", 1300 | "for i, concept in enumerate(concept_list):\n", 1301 | " print(\"checking\", concept[\"id\"], concept[\"terms\"])\n", 1302 | " print(concept[\"relations\"])\n", 1303 | " #go over all relations\n", 1304 | " for k, rel in enumerate(concept[\"relations\"]): \n", 1305 | " print(\" \", rel)\n", 1306 | " # if its a synonym relation\n", 1307 | " if rel[0]==\"SYNONYM\":\n", 1308 | " e2_term=rel[1] #this is a string \n", 1309 | " #find the concept e2 in the list to be able to remove it \n", 1310 | " for j in range(len(concept_list)):\n", 1311 | " if e2_term in concept_list[j][\"terms\"]: break\n", 1312 | " #remove found concept if its not the current concept\n", 1313 | " e2_concept=concept_list[j]\n", 1314 | " if e2_concept!=concept:\n", 1315 | " del concept_list[j]\n", 1316 | " print(\" merge\", concept[\"id\"], e2_concept[\"id\"])\n", 1317 | " #merge found concept\n", 1318 | " for term in e2_concept[\"terms\"]:\n", 1319 | " concept[\"terms\"].append(term)\n", 1320 | " for rel in e2_concept[\"relations\"]:\n", 1321 | " concept[\"relations\"].append(rel)\n", 1322 | " #print updated relations\n", 1323 | " print(concept[\"relations\"])\n", 1324 | " #no need for updating the ids since it works on string basis anyways \n" 1325 | ], 1326 | "execution_count": null, 1327 | "outputs": [] 1328 | }, 1329 | { 1330 | "cell_type": "code", 1331 | "metadata": { 1332 | "id": "FJ8OTMeuIZEu" 1333 | }, 1334 | "source": [ 1335 | "# correct term_to_id (which is broken due to synonymy) and update ids (to have no missing values after merge)\n", 1336 | "\n", 1337 | "#update term_to_id\n", 1338 | "for i, concept in enumerate(concept_list):\n", 1339 | " concept[\"id\"]=\"c\"+str(i+1)\n", 1340 | " for term in concept[\"terms\"]:\n", 1341 | " term_to_id[term]=concept[\"id\"]\n", 1342 | "\n", 1343 | "# update relations to contain ids instead of the string-terms using term_to_id\n", 1344 | "for concept in concept_list:\n", 1345 | " for rel in concept[\"relations\"]:\n", 1346 | " #print(rel[1], term_to_id[rel[1]])\n", 1347 | " rel[1]=term_to_id[rel[1]]\n" 1348 | ], 1349 | "execution_count": null, 1350 | "outputs": [] 1351 | }, 1352 | { 1353 | "cell_type": "code", 1354 | "metadata": { 1355 | "id": "d5zKIdYXSfUv" 1356 | }, 1357 | "source": [ 1358 | "# delete hypernym + all other relations which became self referential after the merge + delete duplicates\n", 1359 | "\n", 1360 | "for concept in concept_list:\n", 1361 | " legal_relations=[]\n", 1362 | " for rel in concept[\"relations\"]:\n", 1363 | " #only keep relations which are not synonym and not self referential\n", 1364 | " if rel[1]!=concept[\"id\"] and rel[0]!=\"SYNONYM\":\n", 1365 | " #do not keep duplicates\n", 1366 | " duplicate=False\n", 1367 | " for legal_rel in legal_relations:\n", 1368 | " #if duplicate, only update probability\n", 1369 | " if rel[0]==legal_rel[0] and rel[1]==legal_rel[1]:\n", 1370 | " legal_rel[2]=max(legal_rel[2], rel[2])\n", 1371 | " duplicate=True\n", 1372 | " if not duplicate:\n", 1373 | " legal_relations.append(rel)\n", 1374 | " else:\n", 1375 | " print(rel)\n", 1376 | " concept[\"relations\"] = legal_relations" 1377 | ], 1378 | "execution_count": null, 1379 | "outputs": [] 1380 | }, 1381 | { 1382 | "cell_type": "markdown", 1383 | "metadata": { 1384 | "id": "nZChj0zoRBkg" 1385 | }, 1386 | "source": [ 1387 | "**Write to TBX**" 1388 | ] 1389 | }, 1390 | { 1391 | "cell_type": "code", 1392 | "metadata": { 1393 | "id": "_deNb4lLRFY_" 1394 | }, 1395 | "source": [ 1396 | "# function to write header\n", 1397 | "def write_header(root):\n", 1398 | " header = etree.SubElement(root, \"tbxHeader\")\n", 1399 | " fileDesc = etree.SubElement(header, \"fileDesc\")\n", 1400 | " sourceDesc = etree.SubElement(fileDesc, \"sourceDesc\")\n", 1401 | " etree.SubElement(sourceDesc, \"p\").text = \"TBX file automatically generated by Text2TCS (https://text2tcs.univie.ac.at/)\"\n", 1402 | " encodingDec = etree.SubElement(header, \"encodingDesc\") \n", 1403 | " etree.SubElement(encodingDec, \"p\", {\"type\": \"XCSURI\"}).text=\"TBXXCSV02.xcs\"" 1404 | ], 1405 | "execution_count": null, 1406 | "outputs": [] 1407 | }, 1408 | { 1409 | "cell_type": "code", 1410 | "metadata": { 1411 | "id": "O7vm2SJ1Ucc7" 1412 | }, 1413 | "source": [ 1414 | "# function for writing a single concept \n", 1415 | "# concept = list of synonymous terms \n", 1416 | "def write_text(root, concept_list):\n", 1417 | " date_string = datetime.datetime.now().strftime(\"%y-%m-%d_%Hh-%Mm\")\n", 1418 | " text = etree.SubElement(root, \"text\")\n", 1419 | " body = etree.SubElement(text,\"body\") \n", 1420 | " for concept in concept_list:\n", 1421 | " conceptEntry = etree.SubElement(body, \"conceptEntry\", {\"id\":concept[\"id\"]})\n", 1422 | " transacGrp = etree.SubElement(conceptEntry, \"transacGrp\")\n", 1423 | " transac = etree.SubElement(transacGrp, \"transac\", {\"type\": \"transactionType\"}).text=\"origination\"\n", 1424 | " transacNote = etree.SubElement(transacGrp, \"transacNote\", {\"type\": \"responsibility\"}).text=\"Text2TCS\"\n", 1425 | " date = etree.SubElement(transacGrp, \"date\").text = date_string\n", 1426 | " #write all terms\n", 1427 | " langSec = etree.SubElement(conceptEntry, \"langSec\", {\"{http://www.w3.org/XML/1998/namespace}lang\":language})\n", 1428 | " for i, term in enumerate(concept[\"terms\"]):\n", 1429 | " term_id=concept[\"id\"]+\"-\"+language+\"-t\"+str(i)\n", 1430 | " termSec = etree.SubElement(langSec, \"termSec\", {\"id\":term_id})\n", 1431 | " etree.SubElement(termSec, \"term\").text = term\n", 1432 | " #write all relations\n", 1433 | " for rel in concept[\"relations\"]:\n", 1434 | " descripGrp = etree.SubElement(conceptEntry,\"descripGrp\")\n", 1435 | " etree.SubElement(descripGrp, \"descrip\", {\"type\":rel[0]}).text = rel[1]\n", 1436 | "\n", 1437 | " \n" 1438 | ], 1439 | "execution_count": null, 1440 | "outputs": [] 1441 | }, 1442 | { 1443 | "cell_type": "code", 1444 | "metadata": { 1445 | "id": "soOiI8sSL7Mt" 1446 | }, 1447 | "source": [ 1448 | "#TODO IMPLEMENT AUTOMATIC LANGUAGE DETECTION // User Input \n", 1449 | "language=\"en\"" 1450 | ], 1451 | "execution_count": null, 1452 | "outputs": [] 1453 | }, 1454 | { 1455 | "cell_type": "code", 1456 | "metadata": { 1457 | "id": "wVssli2KF66a" 1458 | }, 1459 | "source": [ 1460 | "root = etree.Element(\"tbx\", {\"type\":\"TBX-Core\", \"style\":\"dca\", \"{http://www.w3.org/XML/1998/namespace}lang\":language, \"xmlns\":\"urn:iso:std:iso:30042:ed-2\"})\n", 1461 | "#add head xml elements\n", 1462 | "pi2 = etree.ProcessingInstruction('xml-model', 'href=\"https://raw.githubusercontent.com/LTAC-Global/TBX-Core_dialect/master/Schemas/TBX-Core.sch\" type=\"application/xml\" schematypens=\"http://purl.oclc.org/dsdl/schematron\"') \n", 1463 | "pi1 = etree.ProcessingInstruction('xml-model', 'href=\"https://raw.githubusercontent.com/LTAC-Global/TBX-Core_dialect/master/Schemas/TBXcoreStructV03_TBX-Core_integrated.rng\" type=\"application/xml\" schematypens=\"http://relaxng.org/ns/structure/1.0\"') \n", 1464 | "tree = etree.ElementTree(root)\n", 1465 | "tree.getroot().addprevious(pi1)\n", 1466 | "tree.getroot().addprevious(pi2)\n", 1467 | "\n", 1468 | "#write content\n", 1469 | "write_header(root)\n", 1470 | "write_text(root, concept_list)" 1471 | ], 1472 | "execution_count": null, 1473 | "outputs": [] 1474 | }, 1475 | { 1476 | "cell_type": "code", 1477 | "metadata": { 1478 | "colab": { 1479 | "base_uri": "https://localhost:8080/" 1480 | }, 1481 | "id": "X1He6CGNHDr7", 1482 | "outputId": "798722ed-f5a3-418b-907e-56cda52001b2" 1483 | }, 1484 | "source": [ 1485 | "print(etree.tostring(root, encoding='utf-8', xml_declaration=True, pretty_print=True))" 1486 | ], 1487 | "execution_count": null, 1488 | "outputs": [ 1489 | { 1490 | "output_type": "stream", 1491 | "text": [ 1492 | "b'\\n\\n \\n \\n \\n

TBX file automatically generated by Text2TCS (https://text2tcs.univie.ac.at/)

\\n
\\n
\\n \\n

TBXXCSV02.xcs

\\n
\\n
\\n \\n \\n \\n \\n origination\\n Text2TCS\\n 21-03-08_21h-03m\\n \\n \\n \\n active substance\\n \\n \\n biocidal active substances\\n \\n \\n biocidal products.\\n \\n \\n biocidal product\\n \\n \\n biocidal products\\n \\n \\n active substances\\n \\n \\n BPR\\n \\n \\n Biocidal Products Directive\\n \\n \\n Biocidal Products Regulation\\n \\n \\n \\n c7\\n \\n \\n c2\\n \\n \\n c5\\n \\n \\n c4\\n \\n \\n c6\\n \\n \\n c5\\n \\n \\n c7\\n \\n \\n c3\\n \\n \\n \\n \\n origination\\n Text2TCS\\n 21-03-08_21h-03m\\n \\n \\n \\n European Commission\\n \\n \\n \\n c3\\n \\n \\n c5\\n \\n \\n \\n \\n origination\\n Text2TCS\\n 21-03-08_21h-03m\\n \\n \\n \\n Commission Regulation\\n \\n \\n \\n c5\\n \\n \\n c4\\n \\n \\n \\n \\n origination\\n Text2TCS\\n 21-03-08_21h-03m\\n \\n \\n \\n Review\\n \\n \\n \\n c5\\n \\n \\n \\n \\n origination\\n Text2TCS\\n 21-03-08_21h-03m\\n \\n \\n \\n dossier\\n \\n \\n rules\\n \\n \\n Regulation\\n \\n \\n rules,\\n \\n \\n Review Programme\\n \\n \\n Review Programme Regulation\\n \\n \\n Review Programme,\\n \\n \\n \\n c4\\n \\n \\n c1\\n \\n \\n \\n \\n origination\\n Text2TCS\\n 21-03-08_21h-03m\\n \\n \\n \\n ECHA\\n \\n \\n \\n \\n \\n origination\\n Text2TCS\\n 21-03-08_21h-03m\\n \\n \\n \\n substance/PT\\n \\n \\n \\n \\n \\n
\\n'\n" 1493 | ], 1494 | "name": "stdout" 1495 | } 1496 | ] 1497 | }, 1498 | { 1499 | "cell_type": "code", 1500 | "metadata": { 1501 | "id": "maOFOILkHnP7" 1502 | }, 1503 | "source": [ 1504 | "et=etree.ElementTree(root)\n", 1505 | "et.write(\"output.tbx\", encoding='utf-8', xml_declaration=True, pretty_print=True)" 1506 | ], 1507 | "execution_count": null, 1508 | "outputs": [] 1509 | }, 1510 | { 1511 | "cell_type": "markdown", 1512 | "metadata": { 1513 | "id": "Wb52PWHFd9Gg" 1514 | }, 1515 | "source": [ 1516 | "**Graph Vizualization**" 1517 | ] 1518 | }, 1519 | { 1520 | "cell_type": "code", 1521 | "metadata": { 1522 | "id": "sIGLRSWZes7A" 1523 | }, 1524 | "source": [ 1525 | "def make_graph(concept_list, filename):\n", 1526 | " g = Digraph(\"G\", filename=filename)\n", 1527 | " #g.attr(size=\"1000,5\")\n", 1528 | " #create nodes\n", 1529 | " for concept in concept_list:\n", 1530 | " nodename=concept[\"id\"]+\"\\n\"+str(concept[\"terms\"])\n", 1531 | " g.node(nodename, shape=\"box\")\n", 1532 | " #create edges\n", 1533 | " for concept in concept_list:\n", 1534 | " node1=concept[\"id\"]+\"\\n\"+str(concept[\"terms\"])\n", 1535 | " for rel in concept[\"relations\"]:\n", 1536 | " for concepte2 in concept_list:\n", 1537 | " if concepte2[\"id\"]==rel[1]:\n", 1538 | " node2=concepte2[\"id\"]+\"\\n\"+str(concepte2[\"terms\"])\n", 1539 | " if rel[0]==\"associativeRelation\":\n", 1540 | " g.edge(node1, node2, label=str(rel[0]), dir=\"none\")\n", 1541 | " else:\n", 1542 | " g.edge(node1, node2, label=str(rel[0]))\n", 1543 | " return g" 1544 | ], 1545 | "execution_count": null, 1546 | "outputs": [] 1547 | }, 1548 | { 1549 | "cell_type": "code", 1550 | "metadata": { 1551 | "id": "qTQs0hbwxcSR" 1552 | }, 1553 | "source": [ 1554 | "def make_graph_accumulated(concept_list, filename):\n", 1555 | " g = Digraph(\"G\", filename=filename)\n", 1556 | " g.graph_attr[\"rankdir\"] = \"BT\" #change direction to bottom to top \n", 1557 | " #g.attr(size=\"1000,5\")\n", 1558 | " #create nodes\n", 1559 | " for concept in concept_list:\n", 1560 | " nodename=concept[\"id\"]+\"\\n\"+str(concept[\"terms\"])\n", 1561 | " g.node(nodename, shape=\"box\")\n", 1562 | " #create edges (labels accumulated)\n", 1563 | " for concept_1 in concept_list:\n", 1564 | " for concept_2 in concept_list:\n", 1565 | " #find all relations directions c1 to c2\n", 1566 | " relations_c1_c2 = []\n", 1567 | " for rel in concept_1[\"relations\"]:\n", 1568 | " if rel[1]==concept_2[\"id\"]:\n", 1569 | " relations_c1_c2.append(rel)\n", 1570 | " #draw arrow between relations \n", 1571 | " if len(relations_c1_c2)>0:\n", 1572 | " node1=concept_1[\"id\"]+\"\\n\"+str(concept_1[\"terms\"]) \n", 1573 | " node2=concept_2[\"id\"]+\"\\n\"+str(concept_2[\"terms\"])\n", 1574 | " label=\"\"\n", 1575 | " for i in range(len(relations_c1_c2)):\n", 1576 | " rel=relations_c1_c2[i]\n", 1577 | " if i == 0:\n", 1578 | " label+=rel[0]\n", 1579 | " else:\n", 1580 | " label+=\", \"+rel[0]\n", 1581 | " if label!=\"associativeRelation\":\n", 1582 | " g.edge(node1, node2, label=label)\n", 1583 | " else:\n", 1584 | " g.edge(node1, node2, label=label, dir=\"none\")\n", 1585 | " return g" 1586 | ], 1587 | "execution_count": null, 1588 | "outputs": [] 1589 | }, 1590 | { 1591 | "cell_type": "code", 1592 | "metadata": { 1593 | "id": "4gaocnHXGJrl" 1594 | }, 1595 | "source": [ 1596 | "make_graph_accumulated(concept_list, \"test_graph_acc.gv\"+str(text_id))" 1597 | ], 1598 | "execution_count": null, 1599 | "outputs": [] 1600 | } 1601 | ] 1602 | } --------------------------------------------------------------------------------