├── LICENSE.txt ├── README.md ├── __init__.py ├── annotation ├── __init__.py ├── annot_scripts │ ├── __init__.py │ ├── abstract_classes.py │ ├── annotation_models.py │ ├── file_loader.py │ ├── knowledge_bases.py │ └── utils.py └── table_annotation.py ├── data ├── hashmap │ ├── units.json │ └── wd_hashmap_indexing.py └── lookup │ └── entity_indexing.py ├── docker-compose.yml ├── lookup ├── __init__.py ├── entity_lookup.py ├── es_lookup.py └── settings.py ├── preprocessing ├── __init__.py ├── prp_scripts │ ├── __init__.py │ ├── entity_parsers │ │ ├── __init__.py │ │ ├── phoneNumber_parser.py │ │ ├── regex_parser.py │ │ ├── spacy_ner_parser.py │ │ └── unit_parser.py │ ├── file_loader.py │ ├── table_info_extraction_modules.py │ └── utils.py └── table_preprocessing.py └── setup.py /LICENSE.txt: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 3, 29 June 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU General Public License is a free, copyleft license for 11 | software and other kinds of works. 12 | 13 | The licenses for most software and other practical works are designed 14 | to take away your freedom to share and change the works. By contrast, 15 | the GNU General Public License is intended to guarantee your freedom to 16 | share and change all versions of a program--to make sure it remains free 17 | software for all its users. We, the Free Software Foundation, use the 18 | GNU General Public License for most of our software; it applies also to 19 | any other work released this way by its authors. You can apply it to 20 | your programs, too. 21 | 22 | When we speak of free software, we are referring to freedom, not 23 | price. Our General Public Licenses are designed to make sure that you 24 | have the freedom to distribute copies of free software (and charge for 25 | them if you wish), that you receive source code or can get it if you 26 | want it, that you can change the software or use pieces of it in new 27 | free programs, and that you know you can do these things. 28 | 29 | To protect your rights, we need to prevent others from denying you 30 | these rights or asking you to surrender the rights. Therefore, you have 31 | certain responsibilities if you distribute copies of the software, or if 32 | you modify it: responsibilities to respect the freedom of others. 33 | 34 | For example, if you distribute copies of such a program, whether 35 | gratis or for a fee, you must pass on to the recipients the same 36 | freedoms that you received. You must make sure that they, too, receive 37 | or can get the source code. And you must show them these terms so they 38 | know their rights. 39 | 40 | Developers that use the GNU GPL protect your rights with two steps: 41 | (1) assert copyright on the software, and (2) offer you this License 42 | giving you legal permission to copy, distribute and/or modify it. 43 | 44 | For the developers' and authors' protection, the GPL clearly explains 45 | that there is no warranty for this free software. For both users' and 46 | authors' sake, the GPL requires that modified versions be marked as 47 | changed, so that their problems will not be attributed erroneously to 48 | authors of previous versions. 49 | 50 | Some devices are designed to deny users access to install or run 51 | modified versions of the software inside them, although the manufacturer 52 | can do so. This is fundamentally incompatible with the aim of 53 | protecting users' freedom to change the software. The systematic 54 | pattern of such abuse occurs in the area of products for individuals to 55 | use, which is precisely where it is most unacceptable. Therefore, we 56 | have designed this version of the GPL to prohibit the practice for those 57 | products. If such problems arise substantially in other domains, we 58 | stand ready to extend this provision to those domains in future versions 59 | of the GPL, as needed to protect the freedom of users. 60 | 61 | Finally, every program is threatened constantly by software patents. 62 | States should not allow patents to restrict development and use of 63 | software on general-purpose computers, but in those that do, we wish to 64 | avoid the special danger that patents applied to a free program could 65 | make it effectively proprietary. To prevent this, the GPL assures that 66 | patents cannot be used to render the program non-free. 67 | 68 | The precise terms and conditions for copying, distribution and 69 | modification follow. 70 | 71 | TERMS AND CONDITIONS 72 | 73 | 0. Definitions. 74 | 75 | "This License" refers to version 3 of the GNU General Public License. 76 | 77 | "Copyright" also means copyright-like laws that apply to other kinds of 78 | works, such as semiconductor masks. 79 | 80 | "The Program" refers to any copyrightable work licensed under this 81 | License. Each licensee is addressed as "you". "Licensees" and 82 | "recipients" may be individuals or organizations. 83 | 84 | To "modify" a work means to copy from or adapt all or part of the work 85 | in a fashion requiring copyright permission, other than the making of an 86 | exact copy. The resulting work is called a "modified version" of the 87 | earlier work or a work "based on" the earlier work. 88 | 89 | A "covered work" means either the unmodified Program or a work based 90 | on the Program. 91 | 92 | To "propagate" a work means to do anything with it that, without 93 | permission, would make you directly or secondarily liable for 94 | infringement under applicable copyright law, except executing it on a 95 | computer or modifying a private copy. Propagation includes copying, 96 | distribution (with or without modification), making available to the 97 | public, and in some countries other activities as well. 98 | 99 | To "convey" a work means any kind of propagation that enables other 100 | parties to make or receive copies. Mere interaction with a user through 101 | a computer network, with no transfer of a copy, is not conveying. 102 | 103 | An interactive user interface displays "Appropriate Legal Notices" 104 | to the extent that it includes a convenient and prominently visible 105 | feature that (1) displays an appropriate copyright notice, and (2) 106 | tells the user that there is no warranty for the work (except to the 107 | extent that warranties are provided), that licensees may convey the 108 | work under this License, and how to view a copy of this License. If 109 | the interface presents a list of user commands or options, such as a 110 | menu, a prominent item in the list meets this criterion. 111 | 112 | 1. Source Code. 113 | 114 | The "source code" for a work means the preferred form of the work 115 | for making modifications to it. "Object code" means any non-source 116 | form of a work. 117 | 118 | A "Standard Interface" means an interface that either is an official 119 | standard defined by a recognized standards body, or, in the case of 120 | interfaces specified for a particular programming language, one that 121 | is widely used among developers working in that language. 122 | 123 | The "System Libraries" of an executable work include anything, other 124 | than the work as a whole, that (a) is included in the normal form of 125 | packaging a Major Component, but which is not part of that Major 126 | Component, and (b) serves only to enable use of the work with that 127 | Major Component, or to implement a Standard Interface for which an 128 | implementation is available to the public in source code form. A 129 | "Major Component", in this context, means a major essential component 130 | (kernel, window system, and so on) of the specific operating system 131 | (if any) on which the executable work runs, or a compiler used to 132 | produce the work, or an object code interpreter used to run it. 133 | 134 | The "Corresponding Source" for a work in object code form means all 135 | the source code needed to generate, install, and (for an executable 136 | work) run the object code and to modify the work, including scripts to 137 | control those activities. However, it does not include the work's 138 | System Libraries, or general-purpose tools or generally available free 139 | programs which are used unmodified in performing those activities but 140 | which are not part of the work. For example, Corresponding Source 141 | includes interface definition files associated with source files for 142 | the work, and the source code for shared libraries and dynamically 143 | linked subprograms that the work is specifically designed to require, 144 | such as by intimate data communication or control flow between those 145 | subprograms and other parts of the work. 146 | 147 | The Corresponding Source need not include anything that users 148 | can regenerate automatically from other parts of the Corresponding 149 | Source. 150 | 151 | The Corresponding Source for a work in source code form is that 152 | same work. 153 | 154 | 2. Basic Permissions. 155 | 156 | All rights granted under this License are granted for the term of 157 | copyright on the Program, and are irrevocable provided the stated 158 | conditions are met. This License explicitly affirms your unlimited 159 | permission to run the unmodified Program. The output from running a 160 | covered work is covered by this License only if the output, given its 161 | content, constitutes a covered work. This License acknowledges your 162 | rights of fair use or other equivalent, as provided by copyright law. 163 | 164 | You may make, run and propagate covered works that you do not 165 | convey, without conditions so long as your license otherwise remains 166 | in force. You may convey covered works to others for the sole purpose 167 | of having them make modifications exclusively for you, or provide you 168 | with facilities for running those works, provided that you comply with 169 | the terms of this License in conveying all material for which you do 170 | not control copyright. Those thus making or running the covered works 171 | for you must do so exclusively on your behalf, under your direction 172 | and control, on terms that prohibit them from making any copies of 173 | your copyrighted material outside their relationship with you. 174 | 175 | Conveying under any other circumstances is permitted solely under 176 | the conditions stated below. Sublicensing is not allowed; section 10 177 | makes it unnecessary. 178 | 179 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 180 | 181 | No covered work shall be deemed part of an effective technological 182 | measure under any applicable law fulfilling obligations under article 183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 184 | similar laws prohibiting or restricting circumvention of such 185 | measures. 186 | 187 | When you convey a covered work, you waive any legal power to forbid 188 | circumvention of technological measures to the extent such circumvention 189 | is effected by exercising rights under this License with respect to 190 | the covered work, and you disclaim any intention to limit operation or 191 | modification of the work as a means of enforcing, against the work's 192 | users, your or third parties' legal rights to forbid circumvention of 193 | technological measures. 194 | 195 | 4. Conveying Verbatim Copies. 196 | 197 | You may convey verbatim copies of the Program's source code as you 198 | receive it, in any medium, provided that you conspicuously and 199 | appropriately publish on each copy an appropriate copyright notice; 200 | keep intact all notices stating that this License and any 201 | non-permissive terms added in accord with section 7 apply to the code; 202 | keep intact all notices of the absence of any warranty; and give all 203 | recipients a copy of this License along with the Program. 204 | 205 | You may charge any price or no price for each copy that you convey, 206 | and you may offer support or warranty protection for a fee. 207 | 208 | 5. Conveying Modified Source Versions. 209 | 210 | You may convey a work based on the Program, or the modifications to 211 | produce it from the Program, in the form of source code under the 212 | terms of section 4, provided that you also meet all of these conditions: 213 | 214 | a) The work must carry prominent notices stating that you modified 215 | it, and giving a relevant date. 216 | 217 | b) The work must carry prominent notices stating that it is 218 | released under this License and any conditions added under section 219 | 7. This requirement modifies the requirement in section 4 to 220 | "keep intact all notices". 221 | 222 | c) You must license the entire work, as a whole, under this 223 | License to anyone who comes into possession of a copy. This 224 | License will therefore apply, along with any applicable section 7 225 | additional terms, to the whole of the work, and all its parts, 226 | regardless of how they are packaged. This License gives no 227 | permission to license the work in any other way, but it does not 228 | invalidate such permission if you have separately received it. 229 | 230 | d) If the work has interactive user interfaces, each must display 231 | Appropriate Legal Notices; however, if the Program has interactive 232 | interfaces that do not display Appropriate Legal Notices, your 233 | work need not make them do so. 234 | 235 | A compilation of a covered work with other separate and independent 236 | works, which are not by their nature extensions of the covered work, 237 | and which are not combined with it such as to form a larger program, 238 | in or on a volume of a storage or distribution medium, is called an 239 | "aggregate" if the compilation and its resulting copyright are not 240 | used to limit the access or legal rights of the compilation's users 241 | beyond what the individual works permit. Inclusion of a covered work 242 | in an aggregate does not cause this License to apply to the other 243 | parts of the aggregate. 244 | 245 | 6. Conveying Non-Source Forms. 246 | 247 | You may convey a covered work in object code form under the terms 248 | of sections 4 and 5, provided that you also convey the 249 | machine-readable Corresponding Source under the terms of this License, 250 | in one of these ways: 251 | 252 | a) Convey the object code in, or embodied in, a physical product 253 | (including a physical distribution medium), accompanied by the 254 | Corresponding Source fixed on a durable physical medium 255 | customarily used for software interchange. 256 | 257 | b) Convey the object code in, or embodied in, a physical product 258 | (including a physical distribution medium), accompanied by a 259 | written offer, valid for at least three years and valid for as 260 | long as you offer spare parts or customer support for that product 261 | model, to give anyone who possesses the object code either (1) a 262 | copy of the Corresponding Source for all the software in the 263 | product that is covered by this License, on a durable physical 264 | medium customarily used for software interchange, for a price no 265 | more than your reasonable cost of physically performing this 266 | conveying of source, or (2) access to copy the 267 | Corresponding Source from a network server at no charge. 268 | 269 | c) Convey individual copies of the object code with a copy of the 270 | written offer to provide the Corresponding Source. This 271 | alternative is allowed only occasionally and noncommercially, and 272 | only if you received the object code with such an offer, in accord 273 | with subsection 6b. 274 | 275 | d) Convey the object code by offering access from a designated 276 | place (gratis or for a charge), and offer equivalent access to the 277 | Corresponding Source in the same way through the same place at no 278 | further charge. You need not require recipients to copy the 279 | Corresponding Source along with the object code. If the place to 280 | copy the object code is a network server, the Corresponding Source 281 | may be on a different server (operated by you or a third party) 282 | that supports equivalent copying facilities, provided you maintain 283 | clear directions next to the object code saying where to find the 284 | Corresponding Source. Regardless of what server hosts the 285 | Corresponding Source, you remain obligated to ensure that it is 286 | available for as long as needed to satisfy these requirements. 287 | 288 | e) Convey the object code using peer-to-peer transmission, provided 289 | you inform other peers where the object code and Corresponding 290 | Source of the work are being offered to the general public at no 291 | charge under subsection 6d. 292 | 293 | A separable portion of the object code, whose source code is excluded 294 | from the Corresponding Source as a System Library, need not be 295 | included in conveying the object code work. 296 | 297 | A "User Product" is either (1) a "consumer product", which means any 298 | tangible personal property which is normally used for personal, family, 299 | or household purposes, or (2) anything designed or sold for incorporation 300 | into a dwelling. In determining whether a product is a consumer product, 301 | doubtful cases shall be resolved in favor of coverage. For a particular 302 | product received by a particular user, "normally used" refers to a 303 | typical or common use of that class of product, regardless of the status 304 | of the particular user or of the way in which the particular user 305 | actually uses, or expects or is expected to use, the product. A product 306 | is a consumer product regardless of whether the product has substantial 307 | commercial, industrial or non-consumer uses, unless such uses represent 308 | the only significant mode of use of the product. 309 | 310 | "Installation Information" for a User Product means any methods, 311 | procedures, authorization keys, or other information required to install 312 | and execute modified versions of a covered work in that User Product from 313 | a modified version of its Corresponding Source. The information must 314 | suffice to ensure that the continued functioning of the modified object 315 | code is in no case prevented or interfered with solely because 316 | modification has been made. 317 | 318 | If you convey an object code work under this section in, or with, or 319 | specifically for use in, a User Product, and the conveying occurs as 320 | part of a transaction in which the right of possession and use of the 321 | User Product is transferred to the recipient in perpetuity or for a 322 | fixed term (regardless of how the transaction is characterized), the 323 | Corresponding Source conveyed under this section must be accompanied 324 | by the Installation Information. But this requirement does not apply 325 | if neither you nor any third party retains the ability to install 326 | modified object code on the User Product (for example, the work has 327 | been installed in ROM). 328 | 329 | The requirement to provide Installation Information does not include a 330 | requirement to continue to provide support service, warranty, or updates 331 | for a work that has been modified or installed by the recipient, or for 332 | the User Product in which it has been modified or installed. Access to a 333 | network may be denied when the modification itself materially and 334 | adversely affects the operation of the network or violates the rules and 335 | protocols for communication across the network. 336 | 337 | Corresponding Source conveyed, and Installation Information provided, 338 | in accord with this section must be in a format that is publicly 339 | documented (and with an implementation available to the public in 340 | source code form), and must require no special password or key for 341 | unpacking, reading or copying. 342 | 343 | 7. Additional Terms. 344 | 345 | "Additional permissions" are terms that supplement the terms of this 346 | License by making exceptions from one or more of its conditions. 347 | Additional permissions that are applicable to the entire Program shall 348 | be treated as though they were included in this License, to the extent 349 | that they are valid under applicable law. If additional permissions 350 | apply only to part of the Program, that part may be used separately 351 | under those permissions, but the entire Program remains governed by 352 | this License without regard to the additional permissions. 353 | 354 | When you convey a copy of a covered work, you may at your option 355 | remove any additional permissions from that copy, or from any part of 356 | it. (Additional permissions may be written to require their own 357 | removal in certain cases when you modify the work.) You may place 358 | additional permissions on material, added by you to a covered work, 359 | for which you have or can give appropriate copyright permission. 360 | 361 | Notwithstanding any other provision of this License, for material you 362 | add to a covered work, you may (if authorized by the copyright holders of 363 | that material) supplement the terms of this License with terms: 364 | 365 | a) Disclaiming warranty or limiting liability differently from the 366 | terms of sections 15 and 16 of this License; or 367 | 368 | b) Requiring preservation of specified reasonable legal notices or 369 | author attributions in that material or in the Appropriate Legal 370 | Notices displayed by works containing it; or 371 | 372 | c) Prohibiting misrepresentation of the origin of that material, or 373 | requiring that modified versions of such material be marked in 374 | reasonable ways as different from the original version; or 375 | 376 | d) Limiting the use for publicity purposes of names of licensors or 377 | authors of the material; or 378 | 379 | e) Declining to grant rights under trademark law for use of some 380 | trade names, trademarks, or service marks; or 381 | 382 | f) Requiring indemnification of licensors and authors of that 383 | material by anyone who conveys the material (or modified versions of 384 | it) with contractual assumptions of liability to the recipient, for 385 | any liability that these contractual assumptions directly impose on 386 | those licensors and authors. 387 | 388 | All other non-permissive additional terms are considered "further 389 | restrictions" within the meaning of section 10. If the Program as you 390 | received it, or any part of it, contains a notice stating that it is 391 | governed by this License along with a term that is a further 392 | restriction, you may remove that term. If a license document contains 393 | a further restriction but permits relicensing or conveying under this 394 | License, you may add to a covered work material governed by the terms 395 | of that license document, provided that the further restriction does 396 | not survive such relicensing or conveying. 397 | 398 | If you add terms to a covered work in accord with this section, you 399 | must place, in the relevant source files, a statement of the 400 | additional terms that apply to those files, or a notice indicating 401 | where to find the applicable terms. 402 | 403 | Additional terms, permissive or non-permissive, may be stated in the 404 | form of a separately written license, or stated as exceptions; 405 | the above requirements apply either way. 406 | 407 | 8. Termination. 408 | 409 | You may not propagate or modify a covered work except as expressly 410 | provided under this License. Any attempt otherwise to propagate or 411 | modify it is void, and will automatically terminate your rights under 412 | this License (including any patent licenses granted under the third 413 | paragraph of section 11). 414 | 415 | However, if you cease all violation of this License, then your 416 | license from a particular copyright holder is reinstated (a) 417 | provisionally, unless and until the copyright holder explicitly and 418 | finally terminates your license, and (b) permanently, if the copyright 419 | holder fails to notify you of the violation by some reasonable means 420 | prior to 60 days after the cessation. 421 | 422 | Moreover, your license from a particular copyright holder is 423 | reinstated permanently if the copyright holder notifies you of the 424 | violation by some reasonable means, this is the first time you have 425 | received notice of violation of this License (for any work) from that 426 | copyright holder, and you cure the violation prior to 30 days after 427 | your receipt of the notice. 428 | 429 | Termination of your rights under this section does not terminate the 430 | licenses of parties who have received copies or rights from you under 431 | this License. If your rights have been terminated and not permanently 432 | reinstated, you do not qualify to receive new licenses for the same 433 | material under section 10. 434 | 435 | 9. Acceptance Not Required for Having Copies. 436 | 437 | You are not required to accept this License in order to receive or 438 | run a copy of the Program. Ancillary propagation of a covered work 439 | occurring solely as a consequence of using peer-to-peer transmission 440 | to receive a copy likewise does not require acceptance. However, 441 | nothing other than this License grants you permission to propagate or 442 | modify any covered work. These actions infringe copyright if you do 443 | not accept this License. Therefore, by modifying or propagating a 444 | covered work, you indicate your acceptance of this License to do so. 445 | 446 | 10. Automatic Licensing of Downstream Recipients. 447 | 448 | Each time you convey a covered work, the recipient automatically 449 | receives a license from the original licensors, to run, modify and 450 | propagate that work, subject to this License. You are not responsible 451 | for enforcing compliance by third parties with this License. 452 | 453 | An "entity transaction" is a transaction transferring control of an 454 | organization, or substantially all assets of one, or subdividing an 455 | organization, or merging organizations. If propagation of a covered 456 | work results from an entity transaction, each party to that 457 | transaction who receives a copy of the work also receives whatever 458 | licenses to the work the party's predecessor in interest had or could 459 | give under the previous paragraph, plus a right to possession of the 460 | Corresponding Source of the work from the predecessor in interest, if 461 | the predecessor has it or can get it with reasonable efforts. 462 | 463 | You may not impose any further restrictions on the exercise of the 464 | rights granted or affirmed under this License. For example, you may 465 | not impose a license fee, royalty, or other charge for exercise of 466 | rights granted under this License, and you may not initiate litigation 467 | (including a cross-claim or counterclaim in a lawsuit) alleging that 468 | any patent claim is infringed by making, using, selling, offering for 469 | sale, or importing the Program or any portion of it. 470 | 471 | 11. Patents. 472 | 473 | A "contributor" is a copyright holder who authorizes use under this 474 | License of the Program or a work on which the Program is based. The 475 | work thus licensed is called the contributor's "contributor version". 476 | 477 | A contributor's "essential patent claims" are all patent claims 478 | owned or controlled by the contributor, whether already acquired or 479 | hereafter acquired, that would be infringed by some manner, permitted 480 | by this License, of making, using, or selling its contributor version, 481 | but do not include claims that would be infringed only as a 482 | consequence of further modification of the contributor version. For 483 | purposes of this definition, "control" includes the right to grant 484 | patent sublicenses in a manner consistent with the requirements of 485 | this License. 486 | 487 | Each contributor grants you a non-exclusive, worldwide, royalty-free 488 | patent license under the contributor's essential patent claims, to 489 | make, use, sell, offer for sale, import and otherwise run, modify and 490 | propagate the contents of its contributor version. 491 | 492 | In the following three paragraphs, a "patent license" is any express 493 | agreement or commitment, however denominated, not to enforce a patent 494 | (such as an express permission to practice a patent or covenant not to 495 | sue for patent infringement). To "grant" such a patent license to a 496 | party means to make such an agreement or commitment not to enforce a 497 | patent against the party. 498 | 499 | If you convey a covered work, knowingly relying on a patent license, 500 | and the Corresponding Source of the work is not available for anyone 501 | to copy, free of charge and under the terms of this License, through a 502 | publicly available network server or other readily accessible means, 503 | then you must either (1) cause the Corresponding Source to be so 504 | available, or (2) arrange to deprive yourself of the benefit of the 505 | patent license for this particular work, or (3) arrange, in a manner 506 | consistent with the requirements of this License, to extend the patent 507 | license to downstream recipients. "Knowingly relying" means you have 508 | actual knowledge that, but for the patent license, your conveying the 509 | covered work in a country, or your recipient's use of the covered work 510 | in a country, would infringe one or more identifiable patents in that 511 | country that you have reason to believe are valid. 512 | 513 | If, pursuant to or in connection with a single transaction or 514 | arrangement, you convey, or propagate by procuring conveyance of, a 515 | covered work, and grant a patent license to some of the parties 516 | receiving the covered work authorizing them to use, propagate, modify 517 | or convey a specific copy of the covered work, then the patent license 518 | you grant is automatically extended to all recipients of the covered 519 | work and works based on it. 520 | 521 | A patent license is "discriminatory" if it does not include within 522 | the scope of its coverage, prohibits the exercise of, or is 523 | conditioned on the non-exercise of one or more of the rights that are 524 | specifically granted under this License. You may not convey a covered 525 | work if you are a party to an arrangement with a third party that is 526 | in the business of distributing software, under which you make payment 527 | to the third party based on the extent of your activity of conveying 528 | the work, and under which the third party grants, to any of the 529 | parties who would receive the covered work from you, a discriminatory 530 | patent license (a) in connection with copies of the covered work 531 | conveyed by you (or copies made from those copies), or (b) primarily 532 | for and in connection with specific products or compilations that 533 | contain the covered work, unless you entered into that arrangement, 534 | or that patent license was granted, prior to 28 March 2007. 535 | 536 | Nothing in this License shall be construed as excluding or limiting 537 | any implied license or other defenses to infringement that may 538 | otherwise be available to you under applicable patent law. 539 | 540 | 12. No Surrender of Others' Freedom. 541 | 542 | If conditions are imposed on you (whether by court order, agreement or 543 | otherwise) that contradict the conditions of this License, they do not 544 | excuse you from the conditions of this License. If you cannot convey a 545 | covered work so as to satisfy simultaneously your obligations under this 546 | License and any other pertinent obligations, then as a consequence you may 547 | not convey it at all. For example, if you agree to terms that obligate you 548 | to collect a royalty for further conveying from those to whom you convey 549 | the Program, the only way you could satisfy both those terms and this 550 | License would be to refrain entirely from conveying the Program. 551 | 552 | 13. Use with the GNU Affero General Public License. 553 | 554 | Notwithstanding any other provision of this License, you have 555 | permission to link or combine any covered work with a work licensed 556 | under version 3 of the GNU Affero General Public License into a single 557 | combined work, and to convey the resulting work. The terms of this 558 | License will continue to apply to the part which is the covered work, 559 | but the special requirements of the GNU Affero General Public License, 560 | section 13, concerning interaction through a network will apply to the 561 | combination as such. 562 | 563 | 14. Revised Versions of this License. 564 | 565 | The Free Software Foundation may publish revised and/or new versions of 566 | the GNU General Public License from time to time. Such new versions will 567 | be similar in spirit to the present version, but may differ in detail to 568 | address new problems or concerns. 569 | 570 | Each version is given a distinguishing version number. If the 571 | Program specifies that a certain numbered version of the GNU General 572 | Public License "or any later version" applies to it, you have the 573 | option of following the terms and conditions either of that numbered 574 | version or of any later version published by the Free Software 575 | Foundation. If the Program does not specify a version number of the 576 | GNU General Public License, you may choose any version ever published 577 | by the Free Software Foundation. 578 | 579 | If the Program specifies that a proxy can decide which future 580 | versions of the GNU General Public License can be used, that proxy's 581 | public statement of acceptance of a version permanently authorizes you 582 | to choose that version for the Program. 583 | 584 | Later license versions may give you additional or different 585 | permissions. However, no additional obligations are imposed on any 586 | author or copyright holder as a result of your choosing to follow a 587 | later version. 588 | 589 | 15. Disclaimer of Warranty. 590 | 591 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 592 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 596 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 597 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 599 | 600 | 16. Limitation of Liability. 601 | 602 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 610 | SUCH DAMAGES. 611 | 612 | 17. Interpretation of Sections 15 and 16. 613 | 614 | If the disclaimer of warranty and limitation of liability provided 615 | above cannot be given local legal effect according to their terms, 616 | reviewing courts shall apply local law that most closely approximates 617 | an absolute waiver of all civil liability in connection with the 618 | Program, unless a warranty or assumption of liability accompanies a 619 | copy of the Program in return for a fee. 620 | 621 | END OF TERMS AND CONDITIONS 622 | 623 | How to Apply These Terms to Your New Programs 624 | 625 | If you develop a new program, and you want it to be of the greatest 626 | possible use to the public, the best way to achieve this is to make it 627 | free software which everyone can redistribute and change under these terms. 628 | 629 | To do so, attach the following notices to the program. It is safest 630 | to attach them to the start of each source file to most effectively 631 | state the exclusion of warranty; and each file should have at least 632 | the "copyright" line and a pointer to where the full notice is found. 633 | 634 | 635 | Copyright (C) 636 | 637 | This program is free software: you can redistribute it and/or modify 638 | it under the terms of the GNU General Public License as published by 639 | the Free Software Foundation, either version 3 of the License, or 640 | (at your option) any later version. 641 | 642 | This program is distributed in the hope that it will be useful, 643 | but WITHOUT ANY WARRANTY; without even the implied warranty of 644 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 645 | GNU General Public License for more details. 646 | 647 | You should have received a copy of the GNU General Public License 648 | along with this program. If not, see . 649 | 650 | Also add information on how to contact you by electronic and paper mail. 651 | 652 | If the program does terminal interaction, make it output a short 653 | notice like this when it starts in an interactive mode: 654 | 655 | Copyright (C) 656 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 657 | This is free software, and you are welcome to redistribute it 658 | under certain conditions; type `show c' for details. 659 | 660 | The hypothetical commands `show w' and `show c' should show the appropriate 661 | parts of the General Public License. Of course, your program's commands 662 | might be different; for a GUI interface, you would use an "about box". 663 | 664 | You should also get your employer (if you work as a programmer) or school, 665 | if any, to sign a "copyright disclaimer" for the program, if necessary. 666 | For more information on this, and how to apply and follow the GNU GPL, see 667 | . 668 | 669 | The GNU General Public License does not permit incorporating your program 670 | into proprietary programs. If your program is a subroutine library, you 671 | may consider it more useful to permit linking proprietary applications with 672 | the library. If this is what you want to do, use the GNU Lesser General 673 | Public License instead of this License. But first, please read 674 | . 675 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Table Annotation 2 | 3 | The semantic annotation process of a table, performed by DAGOBAH, involves 3 steps: 4 | 1. Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 5 | 2. Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 6 | 3. Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks: 7 | - Cell-Entity Annotation (CEA): assign an entity in a KG to a cell mention. 8 | - Column-Type Annotation (CTA): predicts semantic types for a column 9 | - Column-Pair Annotation (CPA): represents the relationship between two columns with property in KG 10 | 11 | # Installation 12 | 13 | - Environment: 14 | 15 | ```conda create -n dagobah python=3.10``` 16 | 17 | - Indexing the entity label dump using [elasticsearch](https://github.com/elastic/elasticsearch-py) and KG dump using [lmdb](https://lmdb.readthedocs.io/en/release/): 18 | 19 | *Resources requirement:* 500GB SSD. 20 | 21 | ```docker-compose up -d``` 22 | 23 | It will download the dumps from [zenodo](https://zenodo.org/records/8426650) and index them using scripts from ```./table-annotation/data/```. 24 | 25 | This step will take several hours to finish (12-24h). 26 | 27 | Once finished, the elasticsearch endpoint for entity lookup is available at ```localhost:9200``` and the KG hashmap is accessible at ```./table-annotation/data/hashmap/``` 28 | 29 | - Install the system: 30 | 31 | Run 32 | 33 | ```python setup.py install``` 34 | 35 | then, download spaCy NER model, required by table preprocessing. 36 | ```python -m spacy download en_core_web_sm``` 37 | 38 | ### Examples 39 | 40 | - Entity Lookup: 41 | 42 | ``` 43 | from lookup import entity_lookup 44 | entity_lookup(labels=["MUFC"], KG="dagobah_lookup") 45 | 46 | Output: 47 | {'executionTimeSec': 3.5, 'output': [{'label': 'MUFC', 'entities': [{'entity': 'Q18656', 'label': 'MUFC', 'score': 0.9314198500543867, 'origin': 'MAIN_ALIAS'}, {'entity': 'Q1764590', 'label': 'MUFC', 'score': 0.886259396137413, 'origin': 'MAIN_ALIAS'}, {'entity': 'Q1131109', 'label': 'MUFC', 'score': 0.8855865364578572, 'origin': 'MAIN_ALIAS'}, {'entity': 'Q19828435', 'label': 'MNUFC', 'score': 0.7682839133054579, 'origin': 'MAIN_ALIAS'}, ,...]}]} 48 | ``` 49 | 50 | - Table Preprocessing: 51 | 52 | ``` 53 | from preprocessing import table_preprocessing 54 | table_preprocessing([["city", "country"],["Paris", "France"], ["Berlin", "Germany"], ["Madrid", "Spain"], ["Rome", "Italy"]]) 55 | 56 | Output: 57 | {'raw': {'tableDataRaw': [['city', 'country'], ['Paris', 'France'], ['Berlin', 'Germany'], ['Madrid', 'Spain'], ['Rome', 'Italy']]}, 'preprocessed': {'tableDataRevised': [['city', 'country'], ['Paris', 'France'], ['Berlin', 'Germany'], ['Madrid', 'Spain'], ['Rome', 'Italy']], 'tableOrientation': {'orientationLabel': 'HORIZONTAL', 'orientationScore': 0.1}, 'headerInfo': {'hasHeader': True, 'headerPosition': 0, 'headerLabel': ['city', 'country'], 'headerScore': 0.09}, 'primaryKeyInfo': {'hasPrimaryKey': True, 'primaryKeyPosition': 0, 'primaryKeyScore': 0.03}, 'primitiveTyping': [{'columnIndex': 0, 'typing': [{'typingLabel': 'GPE', 'typingScore': 0.75}, {'typingLabel': 'UNKNOWN', 'typingScore': 0.25}]}, {'columnIndex': 1, 'typing': [{'typingLabel': 'GPE', 'typingScore': 1.0}]}]}} 58 | ``` 59 | 60 | - Table Annotation: 61 | ``` 62 | from annotation import table_annotation 63 | table_annotation([["Title","Year","Cast","col3"], ["Pulp Fiction","1994","John Travolta","Gangster"], ["Casino Royale","1967","David Niven","James Bond"], ["Outsiders","1983","Matt Dillon","Drama"], ["Hearts of Darkness: A Filmmakers Apocalypse","1991","Marlon Brando","Docmuentary"], ["Virgin Suicides","1999","Kristen Dunst","Drama"]]) 64 | 65 | Output: 66 | { "annotated": { "CEA": [ { "annotation": { "label": "Pulp Fiction", "score": 0.93, "uri": "https://www.wikidata.org/wiki/Q104123" }, "column": 0, "row": 1 }, { "annotation": { "label": "Casino Royale", "score": 0.98, "uri": "https://www.wikidata.org/wiki/Q591272" }, "column": 0, "row": 2 }, { "annotation": { "label": "The Outsiders", "score": 0.99, "uri": "https://www.wikidata.org/wiki/Q1055332" }, "column": 0, "row": 3 }, { "annotation": { "label": "Hearts of Darkness: A Filmmaker's Apocalypse", "score": 0.9, "uri": "https://www.wikidata.org/wiki/Q1962835" }, "column": 0, "row": 4 }, { "annotation": { "label": "The Virgin Suicides", "score": 0.95, "uri": "https://www.wikidata.org/wiki/Q1423971" }, "column": 0, "row": 5 }, { "annotation": { "label": "John Travolta", "score": 0.94, "uri": "https://www.wikidata.org/wiki/Q80938" }, "column": 2, "row": 1 }, { "annotation": { "label": "David Niven", "score": 0.94, "uri": "https://www.wikidata.org/wiki/Q181917" }, "column": 2, "row": 2 }, { "annotation": { "label": "Matt Dillon", "score": 0.91, "uri": "https://www.wikidata.org/wiki/Q193070" }, "column": 2, "row": 3 }, { "annotation": { "label": "Marlon Brando", "score": 0.9, "uri": "https://www.wikidata.org/wiki/Q34012" }, "column": 2, "row": 4 }, { "annotation": { "label": "Kirsten Dunst", "score": 0.81, "uri": "https://www.wikidata.org/wiki/Q76478" }, "column": 2, "row": 5 }, { "annotation": { "label": "gangster film", "score": 0.67, "uri": "https://www.wikidata.org/wiki/Q7444356" }, "column": 3, "row": 1 }, { "annotation": { "label": "James Bond", "score": 0.78, "uri": "https://www.wikidata.org/wiki/Q2009573" }, "column": 3, "row": 2 }, { "annotation": { "label": "drama", "score": 0.87, "uri": "https://www.wikidata.org/wiki/Q130232" }, "column": 3, "row": 3 }, { "annotation": { "label": "documentary film", "score": 0.64, "uri": "https://www.wikidata.org/wiki/Q93204" }, "column": 3, "row": 4 }, { "annotation": { "label": "drama", "score": 0.87, "uri": "https://www.wikidata.org/wiki/Q130232" }, "column": 3, "row": 5 } ], "CPA": [ { "annotation": { "coverage": 1.0, "label": "cast member", "score": 0.95, "uri": "https://www.wikidata.org/wiki/Property:P161" }, "headColumn": 0, "tailColumn": 2 }, { "annotation": { "coverage": 0.8, "label": "genre", "score": 0.75, "uri": "https://www.wikidata.org/wiki/Property:P136" }, "headColumn": 0, "tailColumn": 3 }, { "annotation": { "coverage": 0.8, "label": "(-)cast member::genre", "score": 0.09, "uri": "(-)https://www.wikidata.org/wiki/Property:P161::https://www.wikidata.org/wiki/Property:P136" }, "headColumn": 2, "tailColumn": 3 }, { "annotation": { "coverage": 0.4, "label": "publication date", "score": 0.06, "uri": "https://www.wikidata.org/wiki/Property:P577" }, "headColumn": 0, "tailColumn": 1 }, { "annotation": { "coverage": 0.2, "label": "publication date", "score": 0.0, "uri": "https://www.wikidata.org/wiki/Property:P577" }, "headColumn": 3, "tailColumn": 1 } ], "CTA": [ { "annotation": [ { "coverage": 1.0, "label": "film", "score": 0.95, "uri": "https://www.wikidata.org/wiki/Q11424" } ], "column": 0 }, { "annotation": [ { "coverage": 1.0, "label": "human", "score": 0.9, "uri": "https://www.wikidata.org/wiki/Q5" } ], "column": 2 }, { "annotation": [ { "coverage": 0.8, "label": "film genre", "score": 0.64, "uri": "https://www.wikidata.org/wiki/Q201658" } ], "column": 3 } ] }, "raw": { "tableContent": null, "tableEndOffset": null, "tableNum": null, "tableOffset": null }, "requestInfo": { "id": 1 } } 67 | ``` 68 | ### Citation 69 | Please cite the following if you found this tool useful in your research. 70 | ``` 71 | @inproceedings{huynh2022heuristics, 72 | title={{From Heuristics to Language Models: A Journey Through the Universe of Semantic Table Interpretation with DAGOBAH}}, 73 | author={Huynh, Viet-Phi and Chabot, Yoan and Labb{\'e}, Thomas and Liu, Jixiong and Troncy, Rapha{\"e}l}, 74 | booktitle={Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab)}, 75 | year={2022} 76 | } 77 | 78 | or 79 | 80 | @inproceedings{huynh2021dagobah, 81 | title={{DAGOBAH: Table and Graph Contexts for Efficient Semantic Annotation of Tabular Data}}, 82 | author={Huynh, Viet-Phi and Liu, Jixiong and Chabot, Yoan and Deuz{\'e}, Fr{\'e}d{\'e}ric and Labb{\'e}, Thomas and Monnin, Pierre and Troncy, Rapha{\"e}l}, 83 | booktitle={Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab)}, 84 | year={2021} 85 | } 86 | ``` 87 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ -------------------------------------------------------------------------------- /annotation/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | from .table_annotation import table_annotation -------------------------------------------------------------------------------- /annotation/annot_scripts/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ -------------------------------------------------------------------------------- /annotation/annot_scripts/abstract_classes.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | from abc import ABC, abstractmethod 21 | import collections 22 | 23 | ### For definition of KB ### 24 | class AbstractKnowledgeBase(ABC): 25 | """ KB Abstract class """ 26 | @abstractmethod 27 | def __init__(self, dump_path): 28 | """ Intialize a KB: reading hashmaps... """ 29 | pass 30 | 31 | @abstractmethod 32 | def is_valid_ID(self, entity_id): 33 | """ Check whether an id is a valid ID w.r.p the KB """ 34 | pass 35 | 36 | @abstractmethod 37 | def get_subgraph_of_entity(self, entity_id): 38 | """ get (subject, predicate, object) edges involving entity_id 39 | where object is target entity """ 40 | pass 41 | 42 | @abstractmethod 43 | def get_types_of_entity(self, entity_id, num_level): 44 | """ Get the types of an entity in hierachical levels. 45 | A list of relevant properties for entity types (for ex. occupation) is also allowed.""" 46 | pass 47 | 48 | @abstractmethod 49 | def get_label_of_entity(self, entity_id): 50 | """ Get the labels + aliases of an entity """ 51 | pass 52 | 53 | @abstractmethod 54 | def get_num_edges(self, entity_id): 55 | """ Get number of incoming edges of an entity in KG """ 56 | pass 57 | 58 | @abstractmethod 59 | def prefixing_entity(self, entity): 60 | """ Adding prefix to an entity """ 61 | pass 62 | 63 | ### Table component definition ### 64 | class Column(collections.namedtuple( 65 | "Column", ("col_index"))): 66 | """ Definition of table column """ 67 | pass 68 | 69 | class Cell(collections.namedtuple( 70 | "Cell", ("row_index", "col_index"))): 71 | """ Definition of table cell. """ 72 | pass 73 | 74 | class Column_Pair(collections.namedtuple( 75 | "Column_Pair", ("head_col_index", "tail_col_index"))): 76 | """ Definition of column pair. """ 77 | pass 78 | 79 | class Candidate_Entity(collections.namedtuple( 80 | "Candidate_Entity", ("row_index", "col_index", "id"))): 81 | """ Definition of a candidate entity for a table cell. """ 82 | pass 83 | 84 | ### Abstract annotation components ### 85 | class Relation(collections.namedtuple( 86 | "Relation", ("id", "semantic_proximity"))): 87 | """ Definition of a relation candidate""" 88 | pass 89 | 90 | class Edge(collections.namedtuple( 91 | "Edge", ("pid", "info"))): 92 | """ Definition of an edge in KG: info field indicates the type of object that the edge points to: entity, literal value """ 93 | pass 94 | 95 | ### Abstract annotation model ### 96 | class AbstractAnnotationModel(ABC): 97 | """ Table annotation abstract class.""" 98 | def __init__(self, table, target_kb, preprocessing_backend, lookup_backend): 99 | """ Initialize the annotation model: KB specified at target_kb, preprocessing using preprocessing_backend, lookup using lookup_backend... """ 100 | pass 101 | 102 | def preprocessing_task(self): 103 | """ table preprocessing task """ 104 | pass 105 | 106 | def lookup_task(self): 107 | """ lookup task for semantic cells in table """ 108 | pass 109 | 110 | def cta_task(self, column_idx, only_one): 111 | """ type annotation for a column index. Return many or more types. """ 112 | pass 113 | 114 | def cea_task(self, column_idx, row_idx, only_one): 115 | """ entity annotation for a cell at (row_index, column_index). Return many or more annotation. """ 116 | pass 117 | 118 | def cpa_task(self, column_head, column_tail, only_one): 119 | """ relation annotation for a column pair (head_column, tail_column). Return many or more annotation. """ 120 | pass 121 | 122 | 123 | -------------------------------------------------------------------------------- /annotation/annot_scripts/file_loader.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | ''' 21 | FILE READER: 22 | 23 | Auto converter 24 | 25 | Auto-converter is for automatically transform from different types of data 26 | to tables 27 | 28 | ''' 29 | import csv 30 | import pandas as pd 31 | import numpy as np 32 | import chardet 33 | import datetime 34 | import openpyxl 35 | from openpyxl.utils import range_boundaries 36 | from scipy import ndimage as ndi 37 | 38 | def txt_to_table(filepath: str): 39 | """ 40 | Read table from text file (txt, tsv,csv..). Currently, only 1 table per file supported 41 | and delimiter detected automatically. 42 | """ 43 | list_tables = [] 44 | try: 45 | f = open(filepath, 'rb') 46 | en_scheme = chardet.detect(f.read()) # detect encoding scheme 47 | f.close() 48 | f = open(filepath, 'r', encoding=en_scheme['encoding']) 49 | possible_sep = [',', '\t', ';', ':'] 50 | dialect = csv.Sniffer().sniff(f.read(), possible_sep) 51 | f.close() 52 | f = open(filepath, 'r', encoding=en_scheme['encoding']) 53 | except FileNotFoundError: 54 | print(" File not found !!") 55 | return [] 56 | except Exception as e: 57 | print(e) 58 | return [] 59 | reader = csv.reader(f, delimiter=dialect.delimiter, skipinitialspace=True) 60 | table = [] 61 | for item in reader: 62 | table.append(item) 63 | if table: 64 | list_tables.append(table) 65 | f.close() 66 | return {"tableFromTextFile": list_tables} 67 | 68 | """ 69 | def json_to_table(filepath: str) -> List[List[List[str]]]: 70 | list_tables = [] 71 | table = [] 72 | with open(filepath, 'r', encoding="utf8") as f: 73 | temp = json.loads(f.read()) 74 | table.append(temp) 75 | list_tables.append(table) 76 | return list_tables 77 | """ 78 | 79 | def excel_to_table(filepath): 80 | """ 81 | Read multiple tables per worksheet in excel file. Only .xlsx supported. Old .xls not supported. 82 | """ 83 | wb_obj = openpyxl.load_workbook(filepath, data_only=True) 84 | sheetnames = wb_obj.sheetnames 85 | tables_per_sheet = {} 86 | for sheet_name in sheetnames: 87 | w_sheet = wb_obj[sheet_name] 88 | num_sheet_col = w_sheet.max_column 89 | num_sheet_row = w_sheet.max_row 90 | ## unmerge cells 91 | groups = w_sheet.merged_cells 92 | for group in list(groups): 93 | w_sheet.unmerge_cells(range_string=str(group)) 94 | min_col, min_row, max_col, max_row = range_boundaries(str(group)) 95 | top_left_cell_value = w_sheet.cell(row=min_row, column=min_col).value 96 | for row in w_sheet.iter_rows(min_col=min_col, min_row=min_row, max_col=max_col, max_row=max_row): 97 | for cell in row: 98 | cell.value = top_left_cell_value 99 | 100 | ## clustering tables in a worksheet by connected components. 101 | raw_sheet = [] ## read value of all cells. 102 | binary_sheet = [] ## 1 (foreground) if cell contains value, otherwise 0 (background) 103 | for row in w_sheet.iter_rows(min_col=1, min_row=1, max_col=num_sheet_col, max_row=num_sheet_row): 104 | binary_row = [] 105 | value_row = [] 106 | for cell in row: 107 | ## read raw value, take care of reading datetime properly. 108 | if cell.value: 109 | if isinstance(cell.value, datetime.datetime): 110 | value_row.append(cell.value.strftime('%m/%d/%Y')) 111 | else: 112 | value_row.append(cell.value) 113 | else: 114 | value_row.append("") 115 | 116 | ## decide if a cell contains a value (hence a foreground cell) 117 | if cell.value: 118 | binary_row.append(True) 119 | elif cell.fill.patternType: 120 | binary_row.append(True) 121 | elif cell.border.left.style or cell.border.right.style: 122 | binary_row.append(True) 123 | else: 124 | binary_row.append(False) 125 | raw_sheet.append(value_row) 126 | binary_sheet.append(binary_row) 127 | 128 | ## find connected components 129 | cnt_components = np.array(binary_sheet) 130 | cnt_component_labels, n_cnt_component = ndi.label(cnt_components) 131 | 132 | ## each connected component can be a potential independent table. 133 | tables = [] 134 | for i_label in range(1, n_cnt_component + 1): 135 | ## define the rectangle that may contain a table 136 | min_row = 0 137 | max_row = 0 138 | min_col = 0 139 | max_col = 0 140 | for i_line, line in enumerate(cnt_component_labels): 141 | if i_label in line: 142 | min_row = i_line 143 | break 144 | for i_line, line in enumerate(list(map(list, zip(*cnt_component_labels)))): 145 | if i_label in line: 146 | min_col = i_line 147 | break 148 | for i_line, line in enumerate(cnt_component_labels[::-1]): 149 | if i_label in line: 150 | max_row = len(cnt_component_labels) - 1 - i_line 151 | break 152 | for i_line, line in enumerate(list(map(list, zip(*cnt_component_labels)))[::-1]): 153 | if i_label in line: 154 | max_col = num_sheet_col - 1 - i_line 155 | break 156 | ## check if there exist potentially a table in the rectangle. 157 | table = np.array(raw_sheet)[min_row:max_row+1,min_col:max_col+1] 158 | if table.shape[0] > 1 and table.shape[1] > 1: 159 | tables.append(table.tolist()) 160 | tables_per_sheet[f"tableFromExcelSheet_{sheet_name}"] = tables 161 | return tables_per_sheet 162 | 163 | def deprecated_excel_to_table(filepath): 164 | """ 165 | Read table from excel file. Currently, multi tables supported. 166 | Only one heuristic supported for seperating tables: blank lines between 2 consecutive tables. 167 | Args: 168 | input table path 169 | Return 170 | 3D array with first dimension is the number of 2D tables. 171 | """ 172 | list_tables = [] 173 | try: 174 | xl = pd.ExcelFile(filepath, engine='openpyxl') 175 | except FileNotFoundError: 176 | print(" File not found !!") 177 | return [] 178 | except: 179 | return [] 180 | sheet_names = xl.sheet_names 181 | for i_sheet in range(len(sheet_names)): 182 | excel_sheet = pd.read_excel(filepath, header=None, sheet_name=i_sheet) 183 | excel_sheet = excel_sheet.values.tolist() 184 | single_table = [] 185 | for line in excel_sheet: 186 | i_element = len(line) - 1 187 | end_of_line = len(line) - 1 188 | is_eol = False 189 | while i_element >= 0: 190 | if pd.isna(line[i_element]): 191 | if not is_eol: 192 | end_of_line -= 1 193 | else: 194 | line[i_element] = "" 195 | else: 196 | is_eol = True 197 | i_element -= 1 198 | line = line[:end_of_line+1] 199 | line = [str(s) for s in line] 200 | if line == []: 201 | if single_table != []: 202 | list_tables.append(single_table) 203 | single_table = [] 204 | else: 205 | tmp_line = [] 206 | for e in line: 207 | if e != "" and e != " "*len(e): 208 | tmp_line.append(e) 209 | if tmp_line == []: 210 | if len(single_table) > 1: 211 | single_table.append(line) 212 | else: 213 | single_table = [] 214 | else: 215 | single_table.append(line) 216 | if single_table != []: 217 | list_tables.append(single_table) 218 | return list_tables 219 | 220 | def file_loader(filepath): 221 | ''' 222 | automatic form detection tool. 223 | :param filepath: target file 224 | :return: 225 | ''' 226 | splitPart = filepath.split('.') 227 | if(splitPart[-1].lower() in ['csv', 'txt', 'tsv']): return txt_to_table(filepath) 228 | # if(splitPart[-1] in ['json']): return json_to_table(filepath) 229 | if(splitPart[-1].lower() in ['xlsx']): return excel_to_table(filepath) 230 | return [] 231 | 232 | 233 | if __name__ == '__main__': 234 | # csv = file_loader(r"C:\Users\pgkx5469\Documents\ECE.csv") 235 | txt = file_loader(r"/datastorage/uploaded_files/ECE.csv") 236 | # txt = file_loader(r"C:\Users\pgkx5469\Documents\Python Scripts\t1.txt") 237 | # excel = file_loader(r"C:\Users\pgkx5469\Documents\Projets\dagobah\data\round_3\1.xlsx") 238 | print(txt) -------------------------------------------------------------------------------- /annotation/annot_scripts/knowledge_bases.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | import lmdb 21 | import json 22 | import pickle 23 | from .abstract_classes import AbstractKnowledgeBase 24 | 25 | class Wikidata_KB(AbstractKnowledgeBase): 26 | """ Wikidata KB Interface """ 27 | def __init__(self, dump_path): 28 | """ Initialize the Wikidata KB """ 29 | ## list of properties to be considered in CTA 30 | self.type_properties = ["P31", "P106", "P39", "P105"] 31 | ## PID of subclass property in Wikidata KB. 32 | self.instanceOfPID = "P31" 33 | self.subClassPID = "P279" 34 | ## PID of unit symbol in Wikidata KB 35 | self.unitSymbolPID = "P5061" 36 | ## pairs of time periods 37 | self.timePeriodPID = [ 38 | ("P571", "P576"), ("P571", "P2699"), ("P571", "P730"), ("P571", "P3999"), 39 | ("P1619", "P3999"), ("P729", "P730"), ("P5204", "P2669"), ("P5204", "P576"), 40 | ("P580", "P582"), ("P2031", "P2032"), ("P3415", "P3416"), ("P3027", "P3028"), 41 | ("P7103", "P7104"), ("P2310", "P2311"), ("P7124", "P7125"), ("P569", "P570"), 42 | ("P1636", "P570") 43 | ] 44 | 45 | ## transitive property (https://www.wikidata.org/wiki/Wikidata:List_of_properties/transitive_relation) 46 | self.transitivePID = ["P131", "P276", "P279", "P361", "P403", "P460", "P527", "P706", "P927", "P1647", "P2094", 47 | "P3373", "P3403", "P5607", "P5973", "P171"] 48 | 49 | ## edge reader, entity reader, property reader, unit dictionary reader 50 | try: 51 | ## load the unit_entity dictionary 52 | self.unit_entity_mapping = json.load(open(dump_path + "/units.json", "r")) 53 | ## wikidata edges reader 54 | self.edge_reader = lmdb.open(dump_path + "/edges", readonly=True, readahead=False, lock=False) 55 | self.edge_txn = self.edge_reader.begin() 56 | except Exception as e: 57 | print(f" Error loading knowledge dumps. Details: {e} !! ") 58 | 59 | def close_dump(self): 60 | """ Close the graph readers """ 61 | self.edge_reader.close() 62 | 63 | def is_valid_ID(self, entity_id): 64 | """ Heuristic way to verify whether an id is a valid entityID wrt. Wikidata """ 65 | if (len(entity_id) > 1) and (entity_id[0] in ["P", "Q"]) and (entity_id[1:].isdigit()): 66 | return True 67 | return False 68 | 69 | def get_subgraph_of_entity(self, entity_id): 70 | """ Get forward nodes (predicate->object) and backward nodes (subject->predicate) of an entity in KG. """ 71 | key = self.edge_txn.get(entity_id.encode("ascii")) 72 | if key: 73 | prop_obj_dict = pickle.loads(key) 74 | del prop_obj_dict["labels"], prop_obj_dict["aliases"], prop_obj_dict["descriptions"] 75 | else: 76 | prop_obj_dict = {} 77 | return prop_obj_dict 78 | 79 | def get_label_of_entity(self, entity_id): 80 | """ Get the labels and aliases of an entity. Language info is not returned 81 | If only_one = True, return the default en label """ 82 | en_label = "" 83 | key = self.edge_txn.get(entity_id.encode("ascii")) 84 | if key: 85 | prop_obj_dict = pickle.loads(key) 86 | if prop_obj_dict["labels"]: 87 | en_label = prop_obj_dict["labels"][0] 88 | else: 89 | en_label = "No English Label" 90 | return en_label 91 | 92 | def get_num_edges(self, entity_id): 93 | """ Get number of incoming edges of an entity in KG """ 94 | key = self.edge_txn.get(entity_id.encode("ascii")) 95 | num_edges = 0 96 | if key: 97 | prop_obj_dict = pickle.loads(key) 98 | for prop, obj_dict in prop_obj_dict.items(): 99 | if prop not in ["descriptions", "labels", "aliases"]: 100 | num_edges += len(obj_dict) 101 | return num_edges 102 | 103 | def get_symbol_of_unit_entity(self, unit_entity_id): 104 | """ get the unit symbol of an unit entity. E.g. unit symbol of Q11573 (metre) is m """ 105 | key = self.edge_txn.get(unit_entity_id.encode("ascii")) 106 | if key: 107 | prop_obj_dict = pickle.loads(key) 108 | ## since Pint does not support officially currency, we should handle it ourself by defining Currency in Pint. First, convert currency symbol to name (e.g. € to euro) 109 | ## since Pint does not accept special symbols like currency symbols. 110 | ## Currently, we only support: dollar, euro, japanese_yen, chinese_yuan, pound_sterling, south_korean_won, russian_ruble, australian_dollar" 111 | if "Q8142" in prop_obj_dict.get(self.instanceOfPID, {}): ## if unit indicates currency (Q8142) 112 | return "_".join(self.get_label_of_entity(unit_entity_id, only_one=True).lower().split(" ")) 113 | else: 114 | if self.unitSymbolPID in prop_obj_dict: 115 | return list(prop_obj_dict[self.unitSymbolPID].keys())[0] 116 | ## not an unit entity 117 | return None 118 | else: 119 | ## not an entity 120 | return None 121 | 122 | def map_unit_dimension_to_entity(self, unit_dim): 123 | """ return the corresponding entity of a given unit dimension. E.g. "microsecond" --> Q842015 """ 124 | return self.unit_entity_mapping.get(unit_dim, {}).get("wikidataID", None).replace("http://www.wikidata.org/entity/", "") 125 | 126 | def get_supertypes_of_type(self, type_id): 127 | """ return supertype of an entity type """ 128 | key = self.edge_txn.get(type_id.encode("ascii")) 129 | if key: 130 | prop_obj_dict = pickle.loads(key) 131 | else: 132 | prop_obj_dict = {} 133 | super_type = prop_obj_dict.get(self.subClassPID, {}) 134 | return super_type 135 | 136 | def get_types_of_entity(self, entity_id, num_level=1): 137 | """ get the hierachical types of an entity upto num_level. (Level 0 is direct types). 138 | If muti_properties is True, other properties in self.type_properties, 139 | apart from P31, are also considered.""" 140 | hierachical_types = {} 141 | if num_level > 0: 142 | key = self.edge_txn.get(entity_id.encode("ascii")) 143 | if key: 144 | prop_obj_dict = pickle.loads(key) 145 | else: 146 | prop_obj_dict = {} 147 | 148 | instanceOf_types = {} 149 | others_types = {} 150 | for prop in self.type_properties: 151 | a_type = prop_obj_dict.get(prop, None) 152 | if a_type: 153 | if prop == self.instanceOfPID: 154 | instanceOf_types.update(a_type) 155 | else: 156 | others_types.update(a_type) 157 | if others_types: 158 | hierachical_types[f"level_1"] = others_types 159 | # hierachical_types[f"level_2"] = instanceOf_types 160 | else: 161 | hierachical_types[f"level_1"] = instanceOf_types 162 | 163 | if num_level > 1: 164 | inter_types = hierachical_types[f"level_1"] 165 | for i in range(2, num_level+1): 166 | if f"level_{i}" not in hierachical_types: 167 | hierachical_types[f"level_{i}"] = {} 168 | types = {} 169 | for t in inter_types: 170 | key = self.edge_txn.get(t.encode("ascii")) 171 | if key: 172 | prop_obj_dict = pickle.loads(key) 173 | else: 174 | prop_obj_dict = {} 175 | super_type = prop_obj_dict.get(self.subClassPID, None) 176 | if super_type: 177 | types.update(super_type) 178 | hierachical_types[f"level_{i}"].update(types) 179 | inter_types = types 180 | return hierachical_types 181 | 182 | def map_rank(self, rank): 183 | """ 184 | Each wikidata attribute has a rank {"PREFERRED", "NORMAL", "DEPRECATED"}. 185 | We numerize it as relevance value (2: high, 0: low) 186 | """ 187 | if rank == "PREFERRED": 188 | return 2 189 | elif rank == "NORMAL": 190 | return 1 191 | else: 192 | return 0 193 | 194 | def prefixing_entity(self, entity): 195 | """ Appending prefix to entity """ 196 | if entity[0] == "Q": ## entity 197 | # return "https://www.wikidata.org/wiki/" + entity 198 | return "http://www.wikidata.org/entity/" + entity ## avoid wiki redirect, use real identifier of property 199 | elif entity[0] == "P": ## property 200 | # return "https://www.wikidata.org/wiki/Property:" + entity 201 | return "http://www.wikidata.org/prop/direct/" + entity ## avoid wiki redirect, use real identifier of entity 202 | else: 203 | return entity 204 | 205 | -------------------------------------------------------------------------------- /annotation/annot_scripts/utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | from dateutil.parser import parse 21 | from rapidfuzz import fuzz 22 | from quantulum3 import parser as qt_unit_parser 23 | import pint 24 | ureg = pint.UnitRegistry() 25 | ## since Pint does not officilly support currency, we define it by ourself. 26 | ## Currently, we only support: dollar, euro, japanese_yen, chinese_yuan, pound_sterling, south_korean_won, russian_ruble, australian_dollar" 27 | ureg.load_definitions(""" 28 | dollar = [currency] = united_states_dollar 29 | euro = 1.1 dollar 30 | japanese_yen = 0.0082 dollar 31 | chinese_yuan = 0.16 dollar 32 | renminbi = 0.16 dollar 33 | pound_sterling = 1.32 dollar 34 | south_korean_won = 0.00082 dollar 35 | russian_ruble = 0.01 dollar 36 | australian_dollar = 0.75 dollar 37 | """.splitlines()) 38 | 39 | def float_parse(value): 40 | """ Check whether an input str is a float, if yes, return the float""" 41 | if isinstance(value, float) or isinstance(value, int): 42 | return value 43 | elif isinstance(value, str): 44 | try: 45 | return float(value.replace(",", "")) 46 | except: 47 | return None 48 | 49 | def date_similarity(s1, s2, operator): 50 | """ Check whether an 2 input str is 2 possible equal datetimes """ 51 | try: 52 | if operator(parse(s1), parse(s2)): 53 | return True 54 | return False 55 | except: 56 | return False 57 | 58 | def get_year_from_date(d): 59 | " return year from date" 60 | try: 61 | return str(parse(d).year) 62 | except: 63 | return False 64 | 65 | def textual_similarity(s1, s2): 66 | """ Calculate the simliarity score between two textual values using three levenstein distances. """ 67 | char_based_ratio = fuzz.ratio(s1.lower(), s2.lower())/100 68 | token_sort_based_ratio = fuzz.token_sort_ratio(s1.lower(), s2.lower())/100 69 | token_set_based_ratio = fuzz.token_set_ratio(s1.lower(), s2.lower())/100 70 | ## the final ratio is the mean of two maximum ratios among three ratios. 71 | ## to avoid that 2 ratios of same values dominate the other. 72 | ## e.g. char_based_ratio("universal", "universal picture") = token_sort_based_ratio("universal", "universal picture") = 0.66 73 | ## so including both ratios in the final ratio will decrease the significance of token_set_based_ratio("universal", "universal picture") which is 1.0 74 | final_ratio = sum(sorted([char_based_ratio, token_sort_based_ratio, token_set_based_ratio], reverse=True)[:2])/2 75 | return final_ratio 76 | # return (fuzz.ratio(s1.lower(), s2.lower())/100 + fuzz.token_sort_ratio(s1.lower(), s2.lower())/100 + fuzz.token_set_ratio(s1.lower(), s2.lower())/100)/3 77 | 78 | def dimensionless_quantity_similarity(s1, s2): 79 | """ Calculate the similarity score between two dimensionless (quantity) values. """ 80 | s1_float = float_parse(s1) 81 | s2_float = float_parse(s2) 82 | if s1_float is not None and s2_float is not None: 83 | sim = 1 - abs(s1_float - s2_float)/(abs(s1_float) + abs(s2_float) + 0.0001) 84 | return sim 85 | else: 86 | return 0.0 87 | 88 | def standardize_to_base_unit(measure): 89 | """ standardize a measurement with unit to base unit. E.g. 5 km -> 5000 m given that metre is base unit of length """ 90 | standardized_measure = {} 91 | if isinstance(measure, str): ## if inuput measure is plain text, e.g. "5 km" 92 | parsed_measure = qt_unit_parser.parse(measure) 93 | for a_unit in parsed_measure: 94 | if a_unit.unit.name != "dimensionless": 95 | try: 96 | transformed_measure = float(a_unit.value)*ureg("_".join(a_unit.unit.name.lower().split(" "))).to_base_units() 97 | if transformed_measure.units not in standardized_measure: 98 | standardized_measure[transformed_measure.units] = [transformed_measure.magnitude] 99 | else: 100 | for magnitude in standardized_measure[transformed_measure.units]: 101 | if 0.98 < magnitude/transformed_measure.magnitude < 0.98**-1: 102 | ## this indicates that single measure has different units, hence, we dont need to append 103 | ## duplicated measure into result 104 | pass 105 | else: 106 | standardized_measure[transformed_measure.units].append(transformed_measure.magnitude) 107 | except: 108 | pass 109 | elif isinstance(measure, dict) and "value" in measure and "unit" in measure: 110 | try: 111 | transformed_measure = float(measure["value"])*ureg(measure["unit"]).to_base_units() 112 | standardized_measure[transformed_measure.units] = [transformed_measure.magnitude] 113 | except: 114 | pass 115 | 116 | return standardized_measure 117 | 118 | # def dimensional_quantity_similarity(s1, s1_unit, s2, s2_unit): 119 | # """ Calculate the similarity score between two dimensional (quantity) values. """ 120 | # ## convert to base units 121 | # try: 122 | # converted_s1 = float(s1)*ureg(s1_unit).to_base_units() 123 | # converted_s2 = float(s2)*ureg(s2_unit).to_base_units() 124 | # except (pint.errors.UndefinedUnitError, AttributeError): 125 | # ## invalid or unsupported unit by Pint 126 | # return 0.0 127 | 128 | # if converted_s1.units == converted_s2.units: 129 | # converted_s1_val = converted_s1.magnitude 130 | # converted_s2_val = converted_s2.magnitude 131 | # ## compare s1 and s2 values 132 | # sim = 1 - abs(converted_s1_val - converted_s2_val)/(abs(converted_s1_val) + abs(converted_s2_val) + 0.0001) 133 | # return sim 134 | # else: 135 | # return 0.0 136 | 137 | def named_entity_related_typing(t): 138 | """ Verify whether a type t talks about an entity. """ 139 | named_entity_list = ["UNKNOWN", "PERSON", "ORG", "FAC", "GPE", "LANGUAGE", "LAW", "LOC", "NORP", "ORG", "PRODUCT", "WORK_OF_ART", "EVENT"] 140 | if t in named_entity_list: 141 | return True 142 | else: 143 | return False 144 | 145 | def date_related_typing(t): 146 | """ Verify whether a type t talks about a date. """ 147 | if t == "DATE": 148 | return True 149 | else: 150 | return False 151 | 152 | def numerical_typing_with_unit(t): 153 | """ Verify whether a type t talks about numerial entity that has an unit. """ 154 | with_unit_list = ['PERCENT', 'DISTANCE', 'MASS', 'MONEY', 'DURATION', 155 | 'TEMPERATURE', 'CHARGE', 'ANGLE', 'DATA STORAGE', 156 | 'AMOUNT OF SUBSTANCE', 'CATALYTIC ACTIVITY', 'AREA', 157 | 'VOLUME','VOLUME (LUMBER)', 'FORCE', 'PRESSURE', 158 | 'ENERGY', 'POWER', 'SPEED', 'ACCELERATION', 159 | 'FUEL ECONOMY', 'FUEL CONSUMPTION', 'ANGULAR SPEED', 'ANGULAR ACCELERATION', 160 | 'DENSITY', 'SPECIFIC VOLUME', 'MOMENT OF INERTIA', 'TORQUE', 161 | 'THERMAL RESISTANCE', 'THERMAL CONDUCTIVITY', 'SPECIFIC HEAT CAPACITY', 'VOLUMETRIC FLOW', 162 | 'MASS FLOW', 'CONCENTRATION', 'DYNAMIC VISCOSITY', 'KINEMATIC VISCOSITY', 163 | 'FLUIDITY', 'SURFACE TENSION', 'PERMEABILITY', 'SOUND LEVEL', 164 | 'LUMINOUS INTENSITY', 'LUMINOUS FLUX', 'ILLUMINANCE', 'LUMINANCE', 165 | 'TYPOGRAPHICAL ELEMENT', 'IMAGE RESOLUTION', 'FREQUENCY', 'INSTANCE FREQUENCY', 166 | 'FLUX DENSITY', 'LINEAR MASS DENSITY', 'LINEAR CHARGE DENSITY', 'SURFACE CHARGE DENSITY', 167 | 'CHARGE DENSITY', 'CURRENT', 'LINEAR CURRENT DENSITY', 'SURFACE CURRENT DENSITY', 168 | 'ELECTRIC POTENTIAL', 'ELECTRIC FIELD', 'ELECTRICAL RESISTANCE', 'ELECTRICAL RESISTIVITY', 169 | 'ELECTRICAL CONDUCTANCE', 'ELECTRICAL CONDUCTIVITY', 'CAPACITANCE', 'INDUCTANCE', 170 | 'MAGNETIC FLUX', 'RELUCTANCE', 'MAGNETOMOTIVE FORCE', 'MAGNETIC FIELD', 171 | 'IRRADIANCE', 'RADIATION ABSORBED DOSE', 'RADIOACTIVITY', 'RADIATION EXPOSURE', 172 | 'RADIATION', 'DATA TRANSFER RATE'] 173 | if t in with_unit_list: 174 | return True 175 | else: 176 | return False 177 | 178 | def numerical_typing_without_unit(t): 179 | """ Verify whether a type t talks about numerical entity that doesn't have an unit. """ 180 | without_unit_list = ["CARDINAL", "QUANTITY", "ORDINAL"] 181 | if t in without_unit_list: 182 | return True 183 | else: 184 | return False 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | -------------------------------------------------------------------------------- /annotation/table_annotation.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | from .annot_scripts.annotation_models import Baseline_Model 21 | 22 | def table_annotation(raw_table, K=20, kb_path="./data/hashmap", lookup_index="dagobah_lookup"): 23 | """ 24 | Main table annotation: 25 | Input: 26 | + 2D table 27 | + K: number of entity candidates per table mention. 28 | + kb_path: path to the KB hashmap. 29 | + lookup_index: the index name of ES lookup. 30 | """ 31 | target_kb = {"kb_path": kb_path, "lookup_index": lookup_index} 32 | annotation_output = {"raw": { 33 | "tableDataRaw": raw_table 34 | }, 35 | "annotated": {}, 36 | "preprocessed": {}, 37 | "preprocessingTime": 0.0, 38 | "lookupTime": 0.0, 39 | "entityScoringTime": 0.0, 40 | "subgraphConstructionTime": 0.0, 41 | "ctaTaskTime": 0.0, 42 | "ceaTaskTime": 0.0, 43 | "cpaTaskTime": 0.0, 44 | "avgLookupCandidate": 0.0} 45 | 46 | params = {"multiHop_context": True, "transitivePropertyOnly_path": False, "soft_scoring": True, "K": K} 47 | baseline_model = Baseline_Model(table=raw_table, target_kb=target_kb, params=params) 48 | ## record the size of subgraphs. Disabled in production due to time consuming. 49 | if baseline_model.is_model_init_success: 50 | revised_table = baseline_model.table 51 | baseline_model.entity_scoring_task() 52 | ## First annotation loop 53 | # CEA 54 | for col_idx in baseline_model.entity_cols: 55 | for row_idx in range(baseline_model.first_data_row, baseline_model.num_rows): 56 | baseline_model.cea_task(col_index=col_idx, row_index=row_idx, only_one=False) 57 | # CPA 58 | for i in range(len(baseline_model.entity_cols)-1): 59 | head_col = baseline_model.entity_cols[i] 60 | for j in range(i+1, len(baseline_model.entity_cols)): 61 | tail_col = baseline_model.entity_cols[j] 62 | baseline_model.cpa_task(head_col_index=head_col,tail_col_index=tail_col, only_one=False) 63 | for head_col in baseline_model.entity_cols: 64 | for tail_col in baseline_model.literal_cols: 65 | baseline_model.cpa_task(head_col_index=head_col,tail_col_index=tail_col, only_one=False) 66 | # Weight update: soft scoring 67 | baseline_model.update_context_weight() 68 | baseline_model.entity_scoring_task(first_step=False) 69 | ## Second annotation loop: with updated score. 70 | baseline_model.cea_annot = {} 71 | for col_idx in baseline_model.entity_cols: 72 | for row_idx in range(baseline_model.first_data_row, baseline_model.num_rows): 73 | baseline_model.cea_task(col_index=col_idx, row_index=row_idx, only_one=False) 74 | for col_idx in baseline_model.entity_cols: 75 | baseline_model.cta_task(col_index=col_idx, only_one=False) 76 | ## Third annotation loop: disambiguation. 77 | baseline_model.cea_annot = {} 78 | for col_idx in baseline_model.entity_cols: 79 | for row_idx in range(baseline_model.first_data_row, baseline_model.num_rows): 80 | baseline_model.cea_task(col_index=col_idx, row_index=row_idx, only_one=True) 81 | baseline_model.cta_annot = {} 82 | for col_idx in baseline_model.entity_cols: 83 | baseline_model.cta_task(col_index=col_idx, only_one=True) 84 | baseline_model.cpa_annot = {} 85 | for i in range(len(baseline_model.entity_cols)-1): 86 | head_col = baseline_model.entity_cols[i] 87 | for j in range(i+1, len(baseline_model.entity_cols)): 88 | tail_col = baseline_model.entity_cols[j] 89 | baseline_model.cpa_task(head_col_index=head_col,tail_col_index=tail_col, only_one=False) 90 | for head_col in baseline_model.entity_cols: 91 | for tail_col in baseline_model.literal_cols: 92 | baseline_model.cpa_task(head_col_index=head_col,tail_col_index=tail_col, only_one=False) 93 | 94 | # Fourth annotation loop: reinforced disambiguation 95 | baseline_model.update_context_weight(onlyLiteralContext=True) 96 | baseline_model.entity_scoring_task(first_step=False, last_step=True) 97 | baseline_model.cea_annot = {} 98 | for col_idx in baseline_model.entity_cols: 99 | for row_idx in range(baseline_model.first_data_row, baseline_model.num_rows): 100 | baseline_model.cea_task(col_index=col_idx, row_index=row_idx, only_one=True) 101 | baseline_model.cta_annot = {} 102 | for col_idx in baseline_model.entity_cols: 103 | baseline_model.cta_task(col_index=col_idx, only_one=True) 104 | baseline_model.cpa_annot = {} 105 | for i in range(len(baseline_model.entity_cols)-1): 106 | head_col = baseline_model.entity_cols[i] 107 | for j in range(i+1, len(baseline_model.entity_cols)): 108 | tail_col = baseline_model.entity_cols[j] 109 | baseline_model.cpa_task(head_col_index=head_col,tail_col_index=tail_col, only_one=True) 110 | for head_col in baseline_model.entity_cols: 111 | for tail_col in baseline_model.literal_cols: 112 | baseline_model.cpa_task(head_col_index=head_col,tail_col_index=tail_col, only_one=True) 113 | 114 | annotation_output["annotated"]["tableDataRevised"] = revised_table 115 | annotation_output["annotated"]["CEA"] = [{"row": cell.row_index, "column": cell.col_index, "annotation": {"label": baseline_model.KB.get_label_of_entity(cea[0]["id"]), 116 | "uri": baseline_model.KB.prefixing_entity(cea[0]["id"]), 117 | "score": round(cea[0]["score"],2)}} for cell, cea in baseline_model.cea_annot.items()] 118 | annotation_output["annotated"]["CTA"] = [{"column": col.col_index, "annotation": [{"label": baseline_model.KB.get_label_of_entity(cta["id"]), 119 | "uri": baseline_model.KB.prefixing_entity(cta["id"]), "score": round(cta["score"],2), 120 | "coverage": round(cta["coverage"],2)} for cta in cta_list]} for col, cta_list in baseline_model.cta_annot.items()] 121 | annotation_output["annotated"]["CPA"] = [] 122 | for col_pair, cpa in baseline_model.cpa_annot.items(): 123 | rel_id = cpa[0]["id"] 124 | id_components = set(rel_id.replace("(-)", "").replace("(", "").replace(")", "").split("::")) 125 | rel_uri = rel_id 126 | rel_label = rel_id 127 | for a_id in id_components: 128 | if baseline_model.KB.is_valid_ID(a_id): 129 | rel_uri = rel_uri.replace(a_id, baseline_model.KB.prefixing_entity(a_id)) 130 | rel_label = rel_label.replace(a_id,baseline_model.KB.get_label_of_entity(a_id)) 131 | annotation_output["annotated"]["CPA"].append({"headColumn": col_pair.head_col_index, "tailColumn": col_pair.tail_col_index, 132 | "annotation": {"label": rel_label, "uri": rel_uri, "score": round(cpa[0]["score"],2), "coverage": round(cpa[0]["coverage"],2)}}) 133 | 134 | annotation_output["preprocessed"] = baseline_model.table_infos 135 | annotation_output["preprocessed"].pop("tableDataRevised", None) 136 | annotation_output["preprocessingTime"] = baseline_model.preprocessing_time 137 | annotation_output["lookupTime"] = baseline_model.lookup_time 138 | annotation_output["entityScoringTime"] = baseline_model.entity_scoring_time 139 | annotation_output["subgraphConstructionTime"] = baseline_model.subgraph_construction_time 140 | annotation_output["ctaTaskTime"] = baseline_model.cta_task_time 141 | annotation_output["ceaTaskTime"] = baseline_model.cea_task_time 142 | annotation_output["cpaTaskTime"] = baseline_model.cpa_task_time 143 | annotation_output["avgLookupCandidate"] = baseline_model.avg_lookup_candidate 144 | 145 | ## close graph readers 146 | baseline_model.KB.close_dump() 147 | 148 | return annotation_output 149 | 150 | if __name__ == "__main__": 151 | table = [["Name", "Soundtrack", "Actors","Character"], 152 | ["Pulp fiction", "Dick Dale", "Travolta", "Vincent Vega"], 153 | ["Casino Royal", "Chris Cornell", "Craig", "James Bond"], 154 | ["Outsiders", "Carmine Coppola", "Dillon"], 155 | ["Hearts of Darkness: A Filmmaker's Apocalypse","Todd Boekelheide","Coppola"], 156 | ["Virgin Suicides","Thomas Mars","Dunst","Lux Lisbon"]] 157 | 158 | print(table_annotation(raw_table=table, K=20, kb_path="./data/hashmap", lookup_index="dagobah_lookup")) 159 | 160 | 161 | 162 | 163 | 164 | 165 | -------------------------------------------------------------------------------- /data/hashmap/wd_hashmap_indexing.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | import os 21 | import lmdb 22 | import json 23 | import gzip 24 | import pickle 25 | import logging 26 | 27 | wd_graph_dump_url = os.getenv("WD_HASHMAP_URL", "") 28 | index_name = "/data/edges" 29 | os.makedirs(index_name, exist_ok=True) 30 | dump_file = "/data/graph_dump.json.gz" 31 | 32 | if not os.path.isfile(dump_file): 33 | os.system(f"wget {wd_graph_dump_url} -O {dump_file}") 34 | 35 | logging.basicConfig( 36 | filename='/data/indexing.log', 37 | format='%(asctime)s [%(module)s] %(levelname)s: %(message)s', 38 | level=logging.INFO 39 | ) 40 | logging.info("BEGIN") 41 | 42 | if os.listdir(index_name): 43 | logging.info("Index exists, Skpipping this step.") 44 | else: 45 | edge_lmdb_writer = lmdb.open(index_name, map_size=248000000000) 46 | count_item = 0 47 | with edge_lmdb_writer.begin(write=True) as e_txn: 48 | with gzip.open(dump_file, "r") as f: 49 | for line in f: 50 | count_item += 1 51 | try: 52 | json_line = json.loads(line[:-2]) 53 | except: 54 | json_line = json.loads(line) 55 | 56 | item_QID = list(json_line.keys())[0] 57 | item_infos = json_line[item_QID] 58 | # print(item_infos) 59 | new_item_infos = {} 60 | for pid, qid_list in item_infos.items(): 61 | if pid in ["labels", "descriptions", "aliases"]: 62 | new_item_infos[pid] = qid_list.get("en-us", []) 63 | else: 64 | if "P1889" not in pid: 65 | if "(-)" not in pid: 66 | new_item_infos[pid] = {} 67 | for qid, qtype in qid_list.items(): 68 | if isinstance(qtype, str) and qtype.split("-")[0] == "DateTime": 69 | new_qid = qid.replace("-00-00", "").replace("-01-01", "") 70 | new_item_infos[pid][new_qid] = qtype 71 | else: 72 | new_item_infos[pid][qid] = qtype 73 | else: 74 | new_item_infos[pid] = qid_list 75 | e_txn.put(item_QID.encode("ascii"), pickle.dumps(new_item_infos)) 76 | if (count_item%100000 == 0): 77 | logging.info("... Processed " + str(count_item) + " wikidata items.") 78 | edge_lmdb_writer.close() 79 | logging.info("Done") -------------------------------------------------------------------------------- /data/lookup/entity_indexing.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | import os 21 | from elasticsearch import Elasticsearch, helpers 22 | import json 23 | import gzip 24 | import logging 25 | import time 26 | import urllib3 27 | urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) 28 | 29 | es_host = os.getenv("ELASTICSEARCH_HOST", "localhost") 30 | es_port = os.getenv("ELASTICSEARCH_PORT", 9200) 31 | es_user = os.getenv("ELASTICSEARCH_USER", "") 32 | es_pwd = os.getenv("ELASTICSEARCH_PWD", "") 33 | 34 | wd_lookup_dump_url = os.getenv("WD_LOOKUP_DUMP_URL", "") 35 | index_name = "dagobah_lookup" 36 | dump_file = "/data/label_dump.json.gz" 37 | 38 | if not os.path.isfile(dump_file): 39 | os.system(f"wget {wd_lookup_dump_url} -O {dump_file} ") 40 | 41 | logging.basicConfig( 42 | filename='/data/indexing.log', 43 | format='%(asctime)s [%(module)s] %(levelname)s: %(message)s', 44 | level=logging.INFO 45 | ) 46 | 47 | logging.info("BEGIN") 48 | 49 | es=Elasticsearch([{'host': es_host,'port': es_port}], timeout=30, max_retries=10, retry_on_timeout=True, http_auth=(es_user, es_pwd)) 50 | if es.indices.exists(index=index_name): 51 | logging.info("Index exists, skipping this step.") 52 | logging.info("END") 53 | 54 | else: 55 | es.indices.create(index=index_name, body={ 56 | 'settings' : { 57 | 'index' : { 58 | 'number_of_shards':3 59 | } 60 | } 61 | }) 62 | 63 | try: 64 | connected = False 65 | while not connected: 66 | try: 67 | logging.info("Connecting to elasticsearch: %s:%s", es_host, es_port) 68 | res = es.cluster.health() 69 | logging.info("Connection successful to %s", res["cluster_name"]) 70 | connected = True 71 | except Exception: 72 | logging.warning("Error during connection to elasticsearch, retry in 10 seconds") 73 | time.sleep(10) 74 | bulk = [] 75 | count = 0 76 | for line in gzip.open(dump_file, "r"): 77 | count += 1 78 | line = line.strip() 79 | if line[-1] == 44:#comma 80 | line = line[:-1] 81 | item = json.loads(line) 82 | qid = item["ID"] 83 | page_rank = item["page_rank"] 84 | labels = item["labels"] 85 | main_aliases = item["main_aliases"] 86 | sub_aliases = item["sub_aliases"] 87 | for label in labels: 88 | data = {"entity": qid, "label": label, "length" : len(label), "origin": "LABEL", "PR": page_rank} 89 | bulk.append({"_index": index_name, "_source": data}) 90 | for alias in main_aliases: 91 | if alias not in labels: 92 | data = {"entity": qid, "label": alias, "length" : len(alias), "origin": "MAIN_ALIAS", "PR": page_rank} 93 | bulk.append({"_index": index_name, "_source": data}) 94 | for alias in sub_aliases: 95 | if alias not in labels and alias not in main_aliases: 96 | data = {"entity": qid, "label": alias, "length" : len(alias), "origin": "SUB_ALIAS","PR": page_rank} 97 | bulk.append({"_index": index_name, "_source": data}) 98 | 99 | if len(bulk) >= 1000000: 100 | res = helpers.bulk(es,bulk) 101 | logging.info(res) 102 | logging.info(count) 103 | bulk.clear() 104 | 105 | if len(bulk) > 0: 106 | res = helpers.bulk(es,bulk) 107 | 108 | logging.info(count) 109 | except Exception as e: 110 | logging.exception(e) 111 | 112 | logging.info("END") 113 | -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: '3' 2 | 3 | x-prod-common: 4 | environment: 5 | &prod-common-env 6 | ELASTICSEARCH_HOST: "elasticsearch" 7 | ELASTICSEARCH_PORT: "9200" 8 | ELASTICSEARCH_USER: "" 9 | ELASTICSEARCH_PWD: "" 10 | WD_LOOKUP_DUMP_URL: "https://zenodo.org/record/8426650/files/wikidata_lookup_dump.json.gz?download=1" 11 | WD_HASHMAP_URL: "https://zenodo.org/records/8426650/files/wikidata_hashmap.json.gz?download=1" 12 | 13 | services: 14 | elasticsearch: 15 | image: elasticsearch:7.17.2 16 | environment: 17 | - "discovery.type=single-node" 18 | - "cluster.name=cw-single-es" 19 | - "ES_JAVA_OPTS=-Xms1G -Xmx1G" 20 | ports: ['9200:9200'] 21 | volumes: 22 | - ${ELASTICSEARCH_VOLUME_DIR:-./data/elasticsearch}/:/usr/share/elasticsearch/data 23 | ulimits: 24 | nofile: 25 | soft: 65535 26 | hard: 65535 27 | 28 | kibana: 29 | image: kibana:7.17.2 30 | ports: ['5601:5601'] 31 | depends_on: ['elasticsearch'] 32 | 33 | entity_indexing: 34 | image: python:3.10 35 | volumes: 36 | - ${DATA_VOLUME_DIR:-./data/lookup}/:/data 37 | command: sh -c "pip install elasticsearch==7.15.1 && python /data/entity_indexing.py" 38 | depends_on: ['elasticsearch'] 39 | environment: 40 | <<: *prod-common-env 41 | 42 | wd_hashmap_indexing: 43 | image: python:3.10 44 | volumes: 45 | - ${DATA_VOLUME_DIR:-./data/hashmap}/:/data 46 | command: sh -c "pip install lmdb==1.3.0 && python /data/wd_hashmap_indexing.py" 47 | environment: 48 | <<: *prod-common-env 49 | 50 | 51 | -------------------------------------------------------------------------------- /lookup/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | from .entity_lookup import entity_lookup -------------------------------------------------------------------------------- /lookup/entity_lookup.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | import time 21 | import threading 22 | from .es_lookup import LookupES 23 | from . import settings 24 | 25 | class LookupManager: 26 | def __init__(self): 27 | self.lookup = LookupES() 28 | self.lookup.connect() 29 | 30 | def search(self, labels, KG): 31 | """Search switch entry point according to mode""" 32 | start_time = time.time() 33 | index_name = KG 34 | result = None 35 | if isinstance(labels, str): 36 | labels = [labels] 37 | nb = len(labels) 38 | if nb > settings.PARALLEL_MIN and settings.PARALLEL_MODE: 39 | #Split list in 2 and execute lookup in half labels in different tasks 40 | half = int(nb/2) 41 | threads = [] 42 | quarter = int(half/2) 43 | threads.append(LookupThread(self.lookup, index_name, labels[:quarter])) 44 | threads.append(LookupThread(self.lookup, index_name, labels[quarter:half])) 45 | threads.append(LookupThread(self.lookup, index_name, labels[half:half+quarter])) 46 | threads.append(LookupThread(self.lookup, index_name, labels[half+quarter:])) 47 | # Start all threads 48 | for t in threads: 49 | t.start() 50 | # Wait for all threads to complete 51 | for t in threads: 52 | t.join() 53 | result = [item for sub_list in threads for item in sub_list.get_result()] 54 | else: 55 | result = self.lookup.flat_search(index_name, labels) 56 | end_time = time.time() 57 | return {"executionTimeSec": round(end_time-start_time,2), "output": result} 58 | 59 | class LookupThread(threading.Thread): 60 | def __init__(self, es_lookup, index_name, labels): 61 | threading.Thread.__init__(self) 62 | self.es_lookup = es_lookup 63 | self.index_name = index_name 64 | self.labels = labels 65 | self.result = [] 66 | 67 | def run(self): 68 | for label in self.labels: 69 | self.result.append(self.es_lookup.flat_search_item(self.index_name, label)) 70 | 71 | def get_result(self): 72 | return self.result 73 | 74 | def entity_lookup(labels, KG): 75 | lkp_manager = LookupManager() 76 | return lkp_manager.search(labels, KG) 77 | 78 | if __name__ == "__main__": 79 | print(entity_lookup(labels=["belgium"], KG="dagobah_lookup")) -------------------------------------------------------------------------------- /lookup/es_lookup.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | from elasticsearch import Elasticsearch 21 | from rapidfuzz import fuzz 22 | import copy 23 | import math 24 | from . import settings 25 | 26 | class LookupES: 27 | FLAT_QUERY_STRING = { 28 | "query": { 29 | "function_score": { 30 | "query": { 31 | "bool": { 32 | "should": [ 33 | { 34 | "bool": { 35 | "must": [ 36 | {"match": 37 | {"label": {"query": "ABC", "fuzziness": "AUTO" }} 38 | } 39 | ], 40 | "filter": { 41 | "range": { "length" : { "gte" : 0, "lte": 0 }} 42 | } 43 | 44 | } 45 | }, 46 | { 47 | "bool": { 48 | "must": [ 49 | {"match": 50 | {"label.keyword": {"query": "ABC", "fuzziness": "AUTO" }} 51 | } 52 | ], 53 | "filter": { 54 | "range": { "length" : { "gte" : 0, "lte": 0 }} 55 | } 56 | 57 | } 58 | } 59 | 60 | ] 61 | 62 | } 63 | }, 64 | "functions": [ 65 | { 66 | "filter": { "match": { "origin": "MAIN_ALIAS" }}, 67 | "weight": settings.MAIN_ALIAS_FACTOR 68 | }, 69 | { 70 | "filter": { "match": { "origin": "SUB_ALIAS" }}, 71 | "weight": settings.SUB_ALIAS_FACTOR 72 | } 73 | ] 74 | } 75 | }, 76 | "size": 10000 77 | } 78 | def __init__(self): 79 | self.es = None 80 | 81 | def connect(self): 82 | """Connect to ElasticSearch cluster""" 83 | host = settings.ELASTICSEARCH_HOST 84 | port = settings.ELASTICSEARCH_PORT 85 | auth = settings.ELASTICSEARCH_AUTH 86 | if auth: 87 | user = settings.ELASTICSEARCH_USER 88 | pwd = settings.ELASTICSEARCH_PWD 89 | self.es = Elasticsearch([{'host':host,'port':port}], timeout=300, http_auth=(user, pwd), verify_certs=False, use_ssl=True) 90 | else: 91 | self.es = Elasticsearch([{'host':host,'port':port}], timeout=300) 92 | # connect to ES cluster 93 | self.es.cluster.health() 94 | 95 | def flat_search(self, index_name, labels): 96 | """Search candidate entities for labels in flat index""" 97 | result = [] 98 | for label in labels: 99 | result.append(self.flat_search_item(index_name, label)) 100 | return result 101 | 102 | def flat_search_item(self, index_name, label): 103 | try: 104 | def es_search(request): 105 | result = self.es.search(index=index_name, body=request) 106 | return result 107 | 108 | def filter_result(label, result): 109 | entities_result = [] 110 | total = result["hits"]["total"]["value"] 111 | bm25_max = result["hits"]["max_score"] ## tdidf score max 112 | if (total > 0): 113 | #Calculate max ratio for all entities returned by ES 114 | entities_set = set() 115 | entity_fuzzy_ratio = {} ## store fuzzy matching score 116 | entity_bm25_ratio = {} ## store keyword matching (tfidf) score (retrieved from ES) 117 | entity_pr_ratio = {} 118 | entity_partial_matching = set() ## since partial exact matching sometimes not fit well with levenshtein, 119 | ## we do not use levenshtein distances to evaluate the partial matching entity. 120 | ## for e.g. "YANKEES" vs. "NEW YORK YANKEES" 121 | entity_best_label = {} 122 | max_ratio = 0.0 123 | for hit in result["hits"]["hits"]: 124 | #Calculate ratio for label of the entity 125 | entity_label = hit["_source"]["label"] 126 | entity_origin = hit["_source"]["origin"] 127 | bm25_score = hit["_score"]/bm25_max 128 | entity_pr_ratio[hit["_source"]["entity"]] = hit["_source"].get("PR", 0.0) 129 | entity_bm25_ratio[hit["_source"]["entity"]] = max(entity_bm25_ratio.get(hit["_source"]["entity"], bm25_score), bm25_score) 130 | 131 | entity_label_lower = entity_label.lower() 132 | ## ratio components 133 | char_based_ratio = 0.9*fuzz.ratio(label_lower, entity_label_lower)/100 + 0.1*fuzz.ratio(new_label, entity_label)/100 134 | token_sort_based_ratio = 0.9*fuzz.token_sort_ratio(label_lower, entity_label_lower)/100 + 0.1*fuzz.token_sort_ratio(new_label, entity_label)/100 135 | if 0.5 < len(label_lower)/len(entity_label_lower) < 2.0: ## token set ratio is noisy, only apply on two labels of similar lengths. 136 | token_set_based_ratio = 0.9*fuzz.token_set_ratio(label_lower, entity_label_lower)/100 + 0.1*fuzz.token_set_ratio(new_label, entity_label)/100 137 | else: 138 | token_set_based_ratio = 0.0 139 | ## find entities that have partial exact matching, we put them directly in output without evaluating levenshtein distances 140 | ## since levenshtein does not fit well with partial exact matching. 141 | ## to avoid extracting too much irrelevant entities, the entity label should not be too long or too short w.r.t. input label 142 | partial_ratio = 0.9*fuzz.partial_ratio(label_lower, entity_label_lower)/100 + 0.1*fuzz.partial_ratio(new_label, entity_label)/100 143 | token_diff = abs(len(label_lower.split(" ")) - len(entity_label_lower.split(" "))) ## token difference between 2 labels 144 | ## partial matching ratio and token set ratio are noisy, only apply on two labels of similar lengths. 145 | if (partial_ratio > 0.9 and token_diff <= 2) or \ 146 | (token_set_based_ratio > 0.9 and 0.5 < len(label_lower)/len(entity_label_lower) < 2.0): 147 | entity_partial_matching.add(hit["_source"]["entity"]) 148 | ## the final ratio is the mean of two maximum ratios among three ratios. 149 | ## to avoid that 2 ratios of same values dominate the other. 150 | ## e.g. char_based_ratio("universal", "universal picture") = token_sort_based_ratio("universal", "universal picture") = 0.66 151 | ## so including both ratios in the final ratio will decrease the significance of token_set_based_ratio("universal", "universal picture") which is 1.0 152 | ratio = sum(sorted([char_based_ratio, token_sort_based_ratio, token_set_based_ratio], reverse=True)[:2])/2 153 | 154 | #Apply factor according to label origin 155 | label_origin = hit["_source"].get("origin") 156 | factor = 1 157 | if label_origin == "MAIN_ALIAS": 158 | factor = settings.MAIN_ALIAS_FACTOR 159 | elif label_origin == "SUB_ALIAS": 160 | factor = settings.SUB_ALIAS_FACTOR 161 | ratio *= factor 162 | 163 | #Store max ratio of the queried label 164 | max_ratio = max(max_ratio, ratio) 165 | #print(hit["_source"]["entity"] + ": " + entity_label + ": " + str(ratio) + " ("+str(hit["_score"])+")") 166 | if entity_fuzzy_ratio.get(hit["_source"]["entity"]): 167 | if ratio > entity_fuzzy_ratio[hit["_source"]["entity"]]: 168 | #Store max ratio of the entity 169 | entity_fuzzy_ratio[hit["_source"]["entity"]] = ratio 170 | entity_best_label[hit["_source"]["entity"]] = entity_label 171 | else: 172 | #Store ratio of the entity 173 | entity_fuzzy_ratio[hit["_source"]["entity"]] = ratio 174 | entity_best_label[hit["_source"]["entity"]] = entity_label 175 | 176 | ratio_threshold = max(settings.ADAPTATIVE_RATIO_MIN_THRESHOLD, max_ratio-settings.ADAPTATIVE_RATIO_MAX_GAP) 177 | 178 | ## in wikidata, we use pagerank to re-rank relevant candidates. 179 | #Filter entities 180 | filtered = 0 181 | ## in wikidata, we use pagerank to re-rank relevant candidates. 182 | ## fist, find the max page rank among candidates. 183 | max_page_rank = 0 184 | # with self.wikidata_stats_reader.begin() as stat_txn: 185 | for entity in entity_fuzzy_ratio: 186 | if entity_fuzzy_ratio[entity] >= ratio_threshold or entity in entity_partial_matching: 187 | entities_set.add(entity) 188 | max_page_rank = max(max_page_rank, entity_pr_ratio[entity]) 189 | filtered += 1 190 | if max_page_rank == 0.0: 191 | max_page_rank = 1.0 192 | 193 | ## re-rank relevant candidates with locally log-normalized page rank score. 194 | for entity in entities_set: 195 | # entity_score = (1-settings.PAGE_RANK_FACTOR-settings.BM25_FACTOR)* entity_fuzzy_ratio[entity] + settings.PAGE_RANK_FACTOR*math.log2(list_page_rank[entity]+1)/math.log2(max_page_rank+1) + settings.BM25_FACTOR * entity_bm25_ratio[entity] 196 | entity_score = (1-settings.PAGE_RANK_FACTOR-settings.BM25_FACTOR)* entity_fuzzy_ratio[entity] + settings.PAGE_RANK_FACTOR*math.log2(entity_pr_ratio[entity]+1)/math.log2(max_page_rank+1) + settings.BM25_FACTOR * entity_bm25_ratio[entity] 197 | entities_result.append({"entity": entity, "label": entity_best_label[entity], "score": entity_score, "origin": entity_origin}) 198 | entities_result.sort(key=lambda x: x["score"], reverse=True) 199 | return {"label": label, "entities": entities_result} 200 | 201 | request = copy.deepcopy(self.FLAT_QUERY_STRING) 202 | new_label = label.replace('"', '').strip() ## only vital preprocessing on input label: replace '"" by '' since ES does not accept double quote and remove last spaces. 203 | new_label = " ".join(new_label.split()) # remove multiple space in string 204 | label_lower = new_label.lower() 205 | request["query"]["function_score"]["query"]["bool"]["should"][0]["bool"]["must"][0]["match"]["label"]["query"] = new_label 206 | request["query"]["function_score"]["query"]["bool"]["should"][1]["bool"]["must"][0]["match"]["label.keyword"]["query"] = new_label 207 | request["query"]["function_score"]["query"]["bool"]["should"][0]["bool"]["filter"]["range"]["length"]["gte"] = int(len(new_label) * settings.LABEL_LENGTH_MIN_FACTOR) 208 | request["query"]["function_score"]["query"]["bool"]["should"][0]["bool"]["filter"]["range"]["length"]["lte"] = int(len(new_label) * settings.LABEL_LENGTH_MAX_FACTOR) 209 | request["query"]["function_score"]["query"]["bool"]["should"][1]["bool"]["filter"]["range"]["length"]["gte"] = int(max(0, len(new_label) - settings.LABEL_TOKEN_DIFF)) 210 | request["query"]["function_score"]["query"]["bool"]["should"][1]["bool"]["filter"]["range"]["length"]["lte"] = int(len(new_label) + settings.LABEL_TOKEN_DIFF) 211 | 212 | result = es_search(request) 213 | entities = filter_result(label, result) 214 | 215 | return entities 216 | except Exception as e: 217 | return {"label": label, "error": str(e)} 218 | -------------------------------------------------------------------------------- /lookup/settings.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | import os 21 | # Elasticsearch server information 22 | ELASTICSEARCH_HOST = os.getenv('ELASTICSEARCH_HOST', 'localhost').strip() 23 | ELASTICSEARCH_PORT = os.getenv('ELASTICSEARCH_PORT', 9200) 24 | ELASTICSEARCH_AUTH = os.getenv('ELASTICSEARCH_AUTH', False) 25 | ELASTICSEARCH_USER = os.getenv('ELASTICSEARCH_USER', '').strip() 26 | ELASTICSEARCH_PWD = os.getenv('ELASTICSEARCH_PWD', '').strip() 27 | 28 | # Adaptative threshold ratio 29 | ADAPTATIVE_RATIO_MIN_THRESHOLD = float(os.getenv('ADAPTATIVE_RATIO_MIN_THRESHOLD', 0.70)) 30 | ADAPTATIVE_RATIO_MAX_GAP = float(os.getenv('ADAPTATIVE_RATIO_MAX_GAP', 0.25)) 31 | 32 | # Parallel mode 33 | PARALLEL_MODE = os.getenv('PARALLEL_MODE', True) 34 | PARALLEL_MIN = os.getenv('PARALLEL_MIN', 5) 35 | 36 | # Lookup score factors 37 | MAIN_ALIAS_FACTOR = float(os.getenv('MAIN_ALIAS_FACTOR', 0.94)) 38 | SUB_ALIAS_FACTOR = float(os.getenv('SUB_ALIAS_FACTOR', 0.88)) 39 | 40 | ## Page rank for wikidata 41 | PAGE_RANK_FACTOR = float(os.getenv('PAGE_RANK_FACTOR', 0.1)) 42 | 43 | ## Filter on label length 44 | LABEL_LENGTH_MIN_FACTOR = float(os.getenv('LABEL_LENGTH_MIN_FACTOR', 0.25)) 45 | LABEL_LENGTH_MAX_FACTOR = float(os.getenv('LABEL_LENGTH_MAX_FACTOR', 4)) 46 | LABEL_TOKEN_DIFF = float(os.getenv('LABEL_TOKEN_DIFF_FACTOR', 4)) 47 | 48 | ## BM25 (TDIDF) score factor. 49 | BM25_FACTOR = float(os.getenv('BM25_FACTOR', 0.20)) -------------------------------------------------------------------------------- /preprocessing/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | from .table_preprocessing import table_preprocessing -------------------------------------------------------------------------------- /preprocessing/prp_scripts/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ -------------------------------------------------------------------------------- /preprocessing/prp_scripts/entity_parsers/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ -------------------------------------------------------------------------------- /preprocessing/prp_scripts/entity_parsers/phoneNumber_parser.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | import phonenumbers 21 | 22 | ## verify whether each cell in list_cell reprensents a PHONE NUMBER 23 | def phonenumber_parser(list_cell): 24 | ner_per_label = {} 25 | for label in list_cell: 26 | ner_per_label[label] = [] 27 | # for code in country_codes: 28 | try: 29 | phone_res = phonenumbers.parse(label) 30 | if phonenumbers.is_valid_number(phone_res): 31 | ner_per_label[label].append("PHONE NUMBER") 32 | break 33 | except: 34 | pass 35 | return ner_per_label 36 | -------------------------------------------------------------------------------- /preprocessing/prp_scripts/entity_parsers/regex_parser.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | import re 21 | 22 | def init_parser(): 23 | regex_matcher = {} 24 | range_pattern = ["^[\s\[\{\(]*[\s]*\d+[.,]?\d*[\s]*[-]+[\s]*\d+[.,]?\d*[\s]*[\s\]\)\}]*$", 25 | "^[\[\{\(]+[\s]*\d+[.,]?\d*[\s]*[,]+[\s]*\d+[.,]?\d*[\s]*[\s\]\)\}]*$", 26 | "^[\s\[\{\(]*[\s]*\d+[.,]?\d*[\s]*[,]+[\s]*\d+[.,]?\d*[\s]*[\]\)\}]+$", 27 | "^[\s\[\{\(]*[\s]*\d+[.,]?\d*[\s]*[–]+[\s]*\d+[.,]?\d*[\s]*[\s\]\)\}]*$"] 28 | 29 | range_matcher = re.compile('|'.join(range_pattern)) 30 | regex_matcher["RANGE"] = range_matcher 31 | 32 | cardinal_matcher = re.compile(r"^\s*[+,-]?\d+[.,]?\d*\s*$|^\s*[+,-]?\d*[\u2150-\u215E\u00BC-\u00BE]\s*$") 33 | regex_matcher["CARDINAL"] = cardinal_matcher 34 | 35 | percent_matcher = re.compile(r"^\s*(\d*(\.\d+)?[\s]*%)\s*$") 36 | regex_matcher["PERCENT"] = percent_matcher 37 | 38 | ip_pattern = "(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" 39 | ip_matcher = re.compile(ip_pattern, re.IGNORECASE) 40 | regex_matcher["IP ADDRESS"] = ip_matcher 41 | 42 | ipv6_pattern = "\s*(?!.*::.*::)(?:(?!:)|:(?=:))(?:[0-9a-f]{0,4}(?:(?<=::)|(? 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | import spacy 21 | 22 | def is_concept(label: str): 23 | ## verify if a typing represents a named entity. 24 | concept_list = ["EVENT", "FAC", "GPE", "LAW", "LOC", "NORP", "ORG", "PERSON", "PRODUCT", "WORK_OF_ART", "LANGUAGE", "MONEY", "PERCENT", "UNKNOWN"] 25 | for concept in concept_list: 26 | if concept in label: 27 | return True 28 | return False 29 | 30 | spacy_model = {"trf": spacy.load("en_core_web_sm", disable=["parser", "textcat"])} 31 | 32 | def spacy_parser(list_cell): 33 | ner_per_label = {} 34 | for doc in spacy_model["trf"].pipe(list_cell): 35 | label = str(doc) 36 | ner_per_label[label] = [] 37 | covered_label = ''.join([t.text for t in doc.ents]) ## record which parts of input label are covered by an named entity. 38 | if 1.4*len(covered_label) >= len(label): ## if a label is covered enough by named entities. we have all possible entities detected on this label. 39 | concept_exist = False 40 | for a_ner in doc.ents: 41 | if is_concept(a_ner.label_): 42 | concept_exist = True 43 | if a_ner.label_ not in ner_per_label[label]: 44 | ner_per_label[label].append(a_ner.label_) 45 | if concept_exist: 46 | for num_enity in ["CARDINAL", "ORDINAL", "DATE"]: 47 | if num_enity in ner_per_label[label]: 48 | ner_per_label[label].remove(num_enity) 49 | return ner_per_label 50 | 51 | 52 | 53 | 54 | 55 | 56 | -------------------------------------------------------------------------------- /preprocessing/prp_scripts/entity_parsers/unit_parser.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | from quantulum3 import parser as qt_unit_parser 21 | 22 | def unit_parser(list_cell): 23 | ner_per_label = {} 24 | for label in list_cell: 25 | ner_per_label[label] = [] 26 | unit_res = qt_unit_parser.parse(label) 27 | surface = "" 28 | entity = [] 29 | for res in unit_res: 30 | surface += res.surface 31 | entity.append(res.unit.entity.name) 32 | """ v1.0.1 """ 33 | ## if a label is covered enough by entities. we keep all possible entities detected on this label. 34 | if 1.4*len(surface) >= len(label.replace(" ", "")): 35 | for a_ner in entity: 36 | if a_ner not in ["unknown", "dimensionless"]: 37 | if a_ner == "time": 38 | a_ner = "DURATION" 39 | elif a_ner == "length": 40 | a_ner = "DISTANCE" 41 | elif a_ner == "currency": 42 | a_ner = "MONEY" 43 | else: 44 | a_ner = a_ner.upper() 45 | if a_ner not in ner_per_label[label]: 46 | ner_per_label[label].append(a_ner) 47 | return ner_per_label 48 | -------------------------------------------------------------------------------- /preprocessing/prp_scripts/file_loader.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | ''' 21 | FILE READER: 22 | 23 | Auto converter 24 | 25 | Auto-converter is for automatically transform from different types of data 26 | to tables 27 | 28 | ''' 29 | 30 | import csv 31 | import json 32 | import pandas as pd 33 | import numpy as np 34 | from typing import List 35 | import chardet 36 | 37 | def txt_to_table(filepath: str): 38 | """ 39 | Read table from text file (txt, tsv,csv..). Currently, only 1 table/file supported 40 | and delimiter detected automatically. 41 | Args: 42 | input table path 43 | Return 44 | 3D array with first dimension = 1, indicating that the oupput contains only one 2D-table. 45 | """ 46 | list_tables = [] 47 | try: 48 | f = open(filepath, 'rb') 49 | en_scheme = chardet.detect(f.read()) # detect encoding scheme 50 | f.close() 51 | f = open(filepath, 'r', encoding=en_scheme['encoding']) 52 | possible_sep = [',', '\t', ';', ':'] 53 | dialect = csv.Sniffer().sniff(f.read(), possible_sep) 54 | f.close() 55 | f = open(filepath, 'r', encoding=en_scheme['encoding']) 56 | except FileNotFoundError: 57 | print(" File not found !!") 58 | return [] 59 | except Exception as e: 60 | print(e) 61 | return [] 62 | reader = csv.reader(f, delimiter=dialect.delimiter, skipinitialspace=True) 63 | table = [] 64 | for item in reader: 65 | table.append(item) 66 | if table: 67 | list_tables.append(table) 68 | f.close() 69 | return list_tables 70 | 71 | """ 72 | def json_to_table(filepath: str) -> List[List[List[str]]]: 73 | list_tables = [] 74 | table = [] 75 | with open(filepath, 'r', encoding="utf8") as f: 76 | temp = json.loads(f.read()) 77 | table.append(temp) 78 | list_tables.append(table) 79 | return list_tables 80 | """ 81 | 82 | def excel_to_table(filepath: str) : 83 | """ 84 | Read table from excel file. Currently, multi tables supported. 85 | Only one heuristic supported for seperating tables: blank lines between 2 consecutive tables. 86 | Args: 87 | input table path 88 | Return 89 | 3D array with first dimension is the number of 2D tables. 90 | """ 91 | list_tables = [] 92 | try: 93 | xl = pd.ExcelFile(filepath, engine='openpyxl') 94 | except FileNotFoundError: 95 | print(" File not found !!") 96 | return [] 97 | except Exception as e: 98 | return [] 99 | sheet_names = xl.sheet_names 100 | for i_sheet in range(len(sheet_names)): 101 | excel_sheet = pd.read_excel(filepath, header=None, sheet_name=i_sheet) 102 | excel_sheet = excel_sheet.values.tolist() 103 | single_table = [] 104 | for line in excel_sheet: 105 | i_element = len(line) - 1 106 | end_of_line = len(line) - 1 107 | is_eol = False 108 | while i_element >= 0: 109 | if pd.isna(line[i_element]): 110 | if not is_eol: 111 | end_of_line -= 1 112 | else: 113 | line[i_element] = "" 114 | else: 115 | is_eol = True 116 | i_element -= 1 117 | line = line[:end_of_line+1] 118 | line = [str(s) for s in line] 119 | if line == []: 120 | if single_table != []: 121 | list_tables.append(single_table) 122 | single_table = [] 123 | else: 124 | tmp_line = [] 125 | for e in line: 126 | if e != "" and e != " "*len(e): 127 | tmp_line.append(e) 128 | if tmp_line == []: 129 | if len(single_table) > 1: 130 | single_table.append(line) 131 | else: 132 | single_table = [] 133 | else: 134 | single_table.append(line) 135 | if single_table != []: 136 | list_tables.append(single_table) 137 | 138 | return list_tables 139 | 140 | def file_loader(filepath: str): 141 | ''' 142 | automatic form detection tool. 143 | :param filepath: target file 144 | :return: 145 | ''' 146 | splitPart = filepath.split('.') 147 | if(splitPart[-1].lower() in ['csv', 'txt', 'tsv']): return txt_to_table(filepath) 148 | # if(splitPart[-1] in ['json']): return json_to_table(filepath) 149 | if(splitPart[-1].lower() in ['xls', 'xlsx']): return excel_to_table(filepath) 150 | return [] 151 | 152 | 153 | if __name__ == '__main__': 154 | # csv = file_loader(r"C:\Users\pgkx5469\Documents\ECE.csv") 155 | txt = file_loader(r"/datastorage/uploaded_files/ECE.csv") 156 | # txt = file_loader(r"C:\Users\pgkx5469\Documents\Python Scripts\t1.txt") 157 | # excel = file_loader(r"C:\Users\pgkx5469\Documents\Projets\dagobah\data\round_3\1.xlsx") 158 | print(txt) -------------------------------------------------------------------------------- /preprocessing/prp_scripts/table_info_extraction_modules.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | ''' 21 | TABLE INFO EXTRACTION MODULES/ 22 | + Orientation 23 | + Header 24 | + Primitive Typing 25 | + Data Type 26 | + Key Column 27 | 28 | ''' 29 | import numpy as np 30 | import collections 31 | from typing import List 32 | from . import utils 33 | 34 | __all__ = ["table_orientation_detection", "table_header_detection", "table_reshaping", 35 | "table_primitive_typing", "table_key_column_detection"] 36 | 37 | class Table_Orientation(collections.namedtuple( 38 | "Table_Orientation", ("orientation", "score"))): 39 | """ To allow for flexibility in returning different outputs. """ 40 | pass 41 | 42 | class Table_Header(collections.namedtuple( 43 | "Table_Header", ("has_header", "header", "score"))): 44 | """ To allow for flexibility in returning different outputs. """ 45 | pass 46 | 47 | class Table_primitive_Typing(collections.namedtuple( 48 | "Table_primitive_Typing", ("type_list"))): 49 | """ To allow for flexibility in returning different outputs. """ 50 | pass 51 | 52 | class Table_keyColumn(collections.namedtuple( 53 | "Table_keyColumn", ("keyColumn", "score"))): 54 | """ To allow for flexibility in returning different outputs. """ 55 | pass 56 | 57 | def table_orientation_detection(targetTable: List[List[str]], table_Datatype, table_Typings) -> Table_Orientation: 58 | """ 59 | Detect the orientation of a table based on column-wise and row-wise dataType homogeneity. 60 | The homogeneities are calculated starting from the second row and second column, 61 | to avoid the unexpected affect of the header. 62 | Args: 63 | targetTable: 2D nested list 64 | ignore_first_row_col: set True by default. If False, first row/rol are also taken into 65 | account in calculation of the homogeneity. 66 | 67 | Return: 68 | Orientation with a confidence score. 69 | Orientation can be "VERTICAL", "HORIZONTAL", "Unknow/In this case, we assume HORIZONTAL since it 70 | is the case usual. 71 | """ 72 | orientation = "" 73 | orientation_score = 0.0 74 | smooth_coef = 0.0 75 | table_rows = len(targetTable) 76 | table_cols = len(targetTable[0]) 77 | 78 | ## step 1: ignoring first row/first column. The horizontal/vertical data type based heterogeneity score are calculated. 79 | ## to assure a confidential homogeneity calculation, table shoud be large enough (num_row, num_col > 2) 80 | is_step1_success = False 81 | if table_rows > 2 and table_cols > 2: 82 | starting_row = 1 83 | starting_col = 1 84 | ## homogeneity mean and std for each row 85 | homogeneity_horizontal, std_horiz = utils.homogeneity_compute([line[starting_col:] for line in targetTable[starting_row:]], table_Datatype, 86 | direction="horizontal") 87 | 88 | ## homogeneity mean and std for each column 89 | homogeneity_vertical, std_verti = utils.homogeneity_compute([line[starting_col:] for line in targetTable[starting_row:]], table_Datatype, 90 | direction="vertical") 91 | if homogeneity_horizontal is not None and homogeneity_vertical is not None: 92 | ## To alleviate the impact of noise, we employ a soft margin in the comparison of horizontal/vertical homogeneity 93 | ## by considering lower- and upper-confidence bounds. 94 | ## a confidence score is computed as a function of the mean and dev of homogeneity. 95 | if homogeneity_horizontal + 0.5*std_horiz/table_rows**0.5 + 0.01 < homogeneity_vertical - 0.5*std_verti/table_cols**0.5 : 96 | is_step1_success = True 97 | if homogeneity_horizontal < 0.1: 98 | smooth_coef = 0.1 ## smooth coef is necessary to avoid resulting a high confidence score 99 | ## when both horizontal and vertical homogeneity are too small (<0.1) 100 | 101 | y_score = homogeneity_vertical - 0.5*std_verti/table_cols**0.5 102 | x_score = homogeneity_horizontal + 0.5*std_horiz/table_rows**0.5 103 | orientation = "VERTICAL" 104 | orientation_score = (y_score - x_score)/(y_score + smooth_coef) 105 | 106 | 107 | elif homogeneity_horizontal - 0.5*std_horiz/table_rows**0.5 >= homogeneity_vertical + 0.5*std_verti/table_cols**0.5 + 0.01 : 108 | is_step1_success = True 109 | if homogeneity_vertical < 0.1: 110 | smooth_coef = 0.1 111 | y_score = homogeneity_vertical + 0.5*std_verti/table_cols**0.5 112 | x_score = homogeneity_horizontal - 0.5*std_horiz/table_rows**0.5 113 | orientation = "HORIZONTAL" 114 | orientation_score = (x_score-y_score)/(x_score + smooth_coef) 115 | else: 116 | orientation = "HORIZONTAL" 117 | orientation_score = 0.1 118 | is_step1_success = True 119 | 120 | if not is_step1_success: 121 | ## step 2: working on first row/column. We impose a strict constraint: a header can not contain cells exposing a primitive typing. 122 | ## store typing of each cell in top row 123 | top_row_typings = [] 124 | for element in targetTable[0][1:]: 125 | if element in table_Typings: 126 | top_row_typings.append(table_Typings[element]) 127 | else: 128 | top_row_typings.append("") 129 | 130 | ## store typing of each cell in left column 131 | left_col_typings = [] 132 | for line in targetTable[1:]: 133 | element = line[0] 134 | if element in table_Typings: 135 | left_col_typings.append(table_Typings[element]) 136 | else: 137 | left_col_typings.append("") 138 | ## count how many cells in top row that contain a typing. 139 | ratio_header_related_typing_top_row = 0 140 | for ts in top_row_typings: 141 | for t in ts: 142 | if t not in ["", "UNKNOWN"]: 143 | ratio_header_related_typing_top_row += 1 144 | break 145 | if top_row_typings: 146 | ratio_header_related_typing_top_row /= len(top_row_typings) 147 | else: 148 | ratio_header_related_typing_top_row = 0.0 149 | 150 | ## count how many cells in left col that contain a typing. 151 | ratio_header_related_typing_left_col = 0 152 | for ts in left_col_typings: 153 | for t in ts: 154 | if t not in ["", "UNKNOWN"]: 155 | ratio_header_related_typing_left_col += 1 156 | break 157 | if left_col_typings: 158 | ratio_header_related_typing_left_col /= len(left_col_typings) 159 | else: 160 | ratio_header_related_typing_left_col = 0.0 161 | 162 | ## If first column has nearly no primitive typings in its cells and first row has significant number of primitive typings in its cells 163 | if ratio_header_related_typing_top_row > 0.5 and ratio_header_related_typing_left_col < 0.05: 164 | orientation = "VERTICAL" 165 | orientation_score = 0.2 166 | ## If first row has nearly no primitive typings in its cells and first column has significant number of primitive typings in its cells 167 | elif ratio_header_related_typing_left_col > 0.5 and ratio_header_related_typing_top_row < 0.05: 168 | orientation = "HORIZONTAL" 169 | orientation_score = 0.2 170 | else: 171 | ## step 3: working on complete primitive typings of the all table cells. 172 | # We rely on the assumption that if a table is incorrectly oriented, no column in it has a consistent (homogenous) primitive typing, 173 | # whereas if the table is correctly oriented, it has at least 1 column from which a consistent primitive typing can be retrieved. 174 | horizontal_typings = table_primitive_typing(targetTable, table_Typings, top_k = 1) 175 | vertical_typings = table_primitive_typing(utils.transpose_heterogeneous_table(targetTable), table_Typings, top_k = 1) 176 | 177 | ## check if horizontal table contains at least 1 column from which a consistent primitive typing can be retrieved. 178 | homo_hori_typing_exist = False 179 | for col_idx, typings in horizontal_typings.type_list.items(): 180 | if typings[0]["type"] not in ["", "UNKNOWN"] and typings[0]["score"] > 0.8: 181 | homo_hori_typing_exist = True 182 | break 183 | ## check if vertical table contains at least 1 column from which a consistent primitive typing can be retrieved. 184 | homo_verti_typing_exist = False 185 | for col_idx, typings in vertical_typings.type_list.items(): 186 | if typings[0]["type"] not in ["", "UNKNOWN"] and typings[0]["score"] > 0.8: 187 | homo_verti_typing_exist = True 188 | break 189 | ## decide orientation 190 | if table_rows > 2 and table_cols > 2 and homo_hori_typing_exist and not homo_verti_typing_exist: 191 | orientation = "HORIZONTAL" 192 | orientation_score = 0.15 193 | elif table_rows > 2 and table_cols > 2 and homo_verti_typing_exist and not homo_hori_typing_exist: 194 | orientation = "VERTICAL" 195 | orientation_score = 0.15 196 | else: 197 | ## step 4: a very long+thin table is horizontal and a very short+fat table is vertical. 198 | if table_rows/table_cols <= 0.25 or table_rows/table_cols >= 4.0: 199 | if table_rows >= table_cols: 200 | orientation = "HORIZONTAL" 201 | orientation_score = 0.1 202 | else: 203 | orientation = "VERTICAL" 204 | orientation_score = 0.1 205 | else: 206 | ## step 4: WTC string length-based calculation. 207 | std_row_wordLength = utils.std_column_wordLength(targetTable, 208 | direction="horizontal") 209 | std_col_wordLength = utils.std_column_wordLength(targetTable, 210 | direction="vertical") 211 | print(std_row_wordLength, std_col_wordLength) 212 | if std_row_wordLength >= std_col_wordLength: 213 | orientation = "HORIZONTAL" 214 | orientation_score = 0.1 215 | else: 216 | orientation = "VERTICAL" 217 | orientation_score = 0.1 218 | 219 | return Table_Orientation(orientation=orientation, 220 | score=orientation_score) 221 | 222 | def table_header_detection(targetTable: List[List[str]], table_orientation_score, table_Typings) -> Table_Header: 223 | """ 224 | Header detection using primitive typings. We impose a strict constraint: a header can not contain cells exposing a primitive typing. 225 | """ 226 | ## we currently consider 2 cases: no_header or single header (first table row) 227 | potential_header = targetTable[0] 228 | ## get primitive typing for each cell in first table row which is potential header. 229 | potential_header_typings = [] 230 | for element in potential_header: 231 | if element in table_Typings: 232 | potential_header_typings.append(table_Typings[element]) 233 | else: 234 | potential_header_typings.append("") 235 | ## get primitive typings represent for each column in table. 236 | if len(targetTable) > 1: 237 | column_typings = utils.typing_per_column(targetTable[1:], table_Typings, 3) 238 | else: 239 | column_typings = utils.typing_per_column(targetTable, table_Typings, 3) 240 | ## Verify if first row is header by assumption: a header can not contain cells exposing a primitive typing. 241 | noheader_score = 0.0 242 | for i_col, typings in column_typings.items(): 243 | if potential_header_typings[i_col]: 244 | ## first case: if primitive typing of header cell is UNIT, MISC (not NAMED ENTITY). 245 | ## if header cell and its column has same primitive typing. (threshold is set to low value 0.2 since UNIT, MISC is reliably parsed.) 246 | if sum([utils.is_concept(t) for t in potential_header_typings[i_col]]) == 0: 247 | if typings[0]["type"] in potential_header_typings[i_col] and typings[0]["score"] > 0.2: 248 | noheader_score = max(noheader_score, typings[0]["score"]) 249 | ## second case: if primitive typing of header cell is NAMED ENTITY, but we do not consider UNKNOWN and PERSON since UNKNOWN contains no information and PERSON is high false positive. 250 | ## if header cell and its column has same primitive typing. (threshold is set to low value 0.2 since NAMED ENTITY is reliably parsed.) 251 | elif "UNKNOWN" not in potential_header_typings[i_col] and "PERSON" not in potential_header_typings[i_col]: 252 | if typings[0]["type"] in potential_header_typings[i_col] and typings[0]["score"] > 0.2: 253 | noheader_score = max(noheader_score, typings[0]["score"]) 254 | 255 | if noheader_score > 0.0: 256 | ## noheader detected. 257 | return Table_Header(has_header=False, header=[], score=noheader_score*table_orientation_score) 258 | else: 259 | ## header exist. 260 | hasheader_score = 0.0 261 | for i_col, typings in column_typings.items(): 262 | if potential_header_typings[i_col]: 263 | for dt in typings: 264 | if dt["type"] not in potential_header_typings[i_col]: 265 | hasheader_score += dt["score"] 266 | hasheader_score = hasheader_score / len(column_typings) 267 | return Table_Header(has_header=True, header=potential_header, score=hasheader_score*table_orientation_score) 268 | 269 | def table_primitive_typing(targetTable: List[List[str]], table_Typing, top_k: int = 1) -> Table_primitive_Typing: 270 | """ 271 | Return the primitive typing (generic type + specific type) for each table column. 272 | Args: 273 | targetTable: 2D input table 274 | top_k: return top k frequent types. 275 | Return: 276 | two type dictionaries (generic and specific) under format: {"col_idx": types} 277 | """ 278 | if len(targetTable) > 1: 279 | pri_typing_per_col = utils.typing_per_column(targetTable[1:], table_Typing, top_k) 280 | else: 281 | pri_typing_per_col = utils.typing_per_column(targetTable, table_Typing, top_k) 282 | return Table_primitive_Typing(type_list=pri_typing_per_col) 283 | 284 | def table_key_column_detection(targetTable: List[List[str]], table_orientation_score, table_Datatype) -> Table_keyColumn: 285 | """ 286 | The main assumption is that the key column covers a large amount of unique cell values. 287 | In addition to the requirement of having at least 50% unique values, it must be a column 288 | consisting primarily of objects (string values) ​​and an average length greater than 3.5% and less than 200. 289 | Args: 290 | targetTable: horizontal input table 291 | tablePrimitiveTyping: column typing used to decide whether the column is a potential key column. 292 | if typing is Object: Yes, otherwise, No 293 | Return: 294 | key column index and confidence score 295 | """ 296 | if len(targetTable) > 1: 297 | column_dataTypes = utils.datatype_per_column(targetTable[1:], table_Datatype, 3) 298 | else: 299 | column_dataTypes = utils.datatype_per_column(targetTable, table_Datatype, 3) 300 | targetTable_T = utils.transpose_heterogeneous_table(targetTable) 301 | ## compute the uniqueness of each column 302 | column_scores = {} 303 | first_keyCol_candidate = None 304 | 305 | num_considered_cols = 0 306 | max_conisdered_cols = 2 307 | if len(targetTable_T) > 8: 308 | max_conisdered_cols = 3 309 | 310 | for col_idx, column in enumerate(targetTable_T): 311 | if column_dataTypes[col_idx][0]["type"]: 312 | if num_considered_cols <= max_conisdered_cols: 313 | num_considered_cols += 1 314 | keyCol_candidate_score = 0 315 | for dtype in column_dataTypes[col_idx]: 316 | if utils.keyColumn_related_datatype(dtype["type"]): 317 | keyCol_candidate_score += dtype["score"] 318 | if keyCol_candidate_score > 0.5: 319 | if not isinstance(first_keyCol_candidate, int): 320 | first_keyCol_candidate = col_idx 321 | unique_contents = [] 322 | empty_cells = [] 323 | number_of_words_per_cell = [] 324 | for cell in column: 325 | number_of_words_per_cell.append(len([word for word in cell.split(" ") if word])) 326 | if cell in table_Datatype: 327 | new_cell = cell 328 | for s in ".@_!#$%^&*()<>?/\|}{][~:\'-+~~_°¨": 329 | new_cell = new_cell.replace(s, '') 330 | for dtype in table_Datatype[cell]: 331 | if utils.keyColumn_related_datatype(dtype) and (3 < len(new_cell) < 200): 332 | unique_contents.append(cell) 333 | break 334 | else: 335 | empty_cells.append(cell) 336 | if unique_contents: 337 | ratio_unique_content = len(set(unique_contents)) / len(column) 338 | ratio_empty_cells = len(empty_cells) / len(column) 339 | # avg_words_per_cell = sum(number_of_words_per_cell)/len(number_of_words_per_cell) 340 | # column_scores[col_idx] = (3*ratio_unique_content+avg_words_per_cell-ratio_empty_cells)/np.sqrt(1+(col_idx-first_keyCol_candidate)) 341 | column_scores[col_idx] = (ratio_unique_content - ratio_empty_cells)/np.sqrt(1+2*(col_idx-first_keyCol_candidate)) 342 | else: 343 | column_scores[col_idx] = 0.0 344 | else: 345 | column_scores[col_idx] = 0.00 346 | ## sorted the list of uniqueness. In case of a lot of max_score, smaller index is more prefered. 347 | if column_scores: 348 | if len(column_scores) > 1: 349 | (key_col, max_score), (_, second_max_score) = sorted(column_scores.items(), key=lambda i: i[1], reverse=True)[:2] 350 | if max_score < 0.25: ## untrustable detection. 351 | return Table_keyColumn(keyColumn=None, score=0.0) 352 | else: 353 | return Table_keyColumn(keyColumn=key_col, score=(max_score - second_max_score)/(max_score+second_max_score)*table_orientation_score) 354 | else: 355 | if list(column_scores.items())[0][1] < 0.25: ## untrustable detection. 356 | return Table_keyColumn(keyColumn=None, score=0.0) 357 | else: 358 | 359 | return Table_keyColumn(keyColumn=list(column_scores.items())[0][0], score=table_orientation_score) 360 | else: 361 | return Table_keyColumn(keyColumn=None, score=0.0) 362 | 363 | def table_reshaping(targetTable: List[List[str]], table_Datatype, table_Typing) -> List[List[str]]: 364 | """ 365 | If table is not well-shaped (heterogeneous row lengths, col lengths), we try 366 | to reshape it padding "" or aligning short lines. 367 | Args: 368 | input table 369 | Return: 370 | reshaped table 371 | """ 372 | list_row_lens = [len(row) for row in targetTable] 373 | if (min(list_row_lens) != max(list_row_lens)): 374 | tab_width = max(set(list_row_lens), key=list_row_lens.count) 375 | reduced_table = [] 376 | for line in targetTable: 377 | if len(line) == tab_width: 378 | reduced_table.append(line) 379 | reshaped_table = [] 380 | if len(reduced_table) > 1: 381 | ### Orientation detection on well-shaped part of table ### 382 | tableOrientation = table_orientation_detection(reduced_table, table_Datatype, table_Typing) 383 | if tableOrientation.orientation == "HORIZONTAL": 384 | ### First row is potential header ### 385 | reshaped_table.append(targetTable[0]) 386 | 387 | ### Column Datatype & Primitive Typing on well-shaped part of table ### 388 | column_dataTypes = utils.datatype_per_column(targetTable[1:], table_Datatype) 389 | 390 | ### Reshape content of Table ### 391 | for line in targetTable[1:]: 392 | if len(line) < tab_width: 393 | ### try to reshape short line ### 394 | new_line = utils.re_align_short_row(line, table_Datatype, column_dataTypes) 395 | reshaped_table.append(new_line) 396 | ### simply append other lines ### 397 | else: 398 | reshaped_table.append(line) 399 | 400 | else: 401 | reshaped_table = targetTable 402 | else: 403 | reshaped_table = targetTable 404 | ## padding "" for unsolved short lines. 405 | utils.table_padding(reshaped_table, max(list_row_lens)) 406 | return reshaped_table 407 | else: 408 | return targetTable 409 | 410 | if __name__ == '__main__': 411 | pass 412 | 413 | -------------------------------------------------------------------------------- /preprocessing/prp_scripts/utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | ''' 21 | TABLE UTILITY FUNCTIONS 22 | Developper: Huynh Viet Phi (vietphi.huynh@orange.com) 23 | ''' 24 | import json 25 | import numpy as np 26 | import ftfy 27 | from collections import Counter, namedtuple 28 | from itertools import combinations 29 | 30 | from .entity_parsers.spacy_ner_parser import spacy_parser 31 | from .entity_parsers.unit_parser import unit_parser 32 | from .entity_parsers.regex_parser import regex_parser 33 | from .entity_parsers.phoneNumber_parser import phonenumber_parser 34 | 35 | """ Named entity and Unit entity parsers """ 36 | def is_concept(label): 37 | concept_list = ["EVENT", "FAC", "GPE", "LAW", "LOC", "NORP", "ORG", "PERSON", "PRODUCT", "WORK_OF_ART", "LANGUAGE", "UNKNOWN"] 38 | for concept in concept_list: 39 | if concept in label: 40 | return True 41 | return False 42 | 43 | def typing_priority(t): 44 | if t != "CARDINAL": 45 | return 1 46 | else: 47 | return 0 48 | 49 | def get_string_type(label): 50 | if len(label) >=100: 51 | return "String_Normal" 52 | 53 | ## string whose large portion is numer is considered as number string 54 | elif 2*len([char for char in label if char.isdigit()])>=len(label): 55 | return "String_Number" 56 | 57 | elif label.upper() == label: 58 | return "String_Uppercase" 59 | 60 | else: 61 | ## Unsolved case assumes normal string 62 | return "String_Normal" 63 | 64 | def text_parser(list_cell): 65 | entity_per_cell = {} 66 | remain_cells = [] 67 | for cell in list_cell: 68 | if (cell == "") or (cell[0] in ".@_!#$%^&*()<>?/\|}{~:\'-+~~_°¨" and cell == cell[0]*len(cell)): 69 | continue ## "" means no data type 70 | ## string contains a single char that is neither alpha nor digit has no type 71 | if (len(cell) == 1) and \ 72 | (((not cell.isalpha()) and (not cell.isdigit())) or len(cell.encode("utf-8")) > 1): 73 | continue 74 | entity_per_cell[cell] = [] 75 | if len(cell) > 70: 76 | entity_per_cell[cell].append("UNKNOWN") 77 | continue 78 | remain_cells.append(cell) 79 | 80 | if remain_cells: 81 | ## perform typing parsers 82 | regex_entity_per_cell = regex_parser(remain_cells) 83 | unit_entity_per_cell = unit_parser(remain_cells) 84 | phone_entity_per_cell = phonenumber_parser(remain_cells) 85 | named_entity_cell = spacy_parser(remain_cells) 86 | 87 | ## collect typings 88 | for ner_list in [phone_entity_per_cell, regex_entity_per_cell, 89 | unit_entity_per_cell, named_entity_cell]: 90 | for cell in ner_list: 91 | for a_ner in ner_list[cell]: 92 | if a_ner not in entity_per_cell[cell]: 93 | entity_per_cell[cell].append(a_ner) 94 | 95 | for cell in entity_per_cell: 96 | if not entity_per_cell[cell]: 97 | entity_per_cell[cell].append("UNKNOWN") 98 | 99 | ## datatype parsers (from typing) 100 | datatype_per_cell = {} 101 | for cell, entities in entity_per_cell.items(): 102 | datatype_per_cell[cell] = [] 103 | for entity in entities: 104 | if is_concept(entity): 105 | dtype = get_string_type(cell) 106 | if dtype not in datatype_per_cell[cell]: 107 | datatype_per_cell[cell].append(dtype) 108 | else: 109 | if entity not in datatype_per_cell[cell]: 110 | datatype_per_cell[cell].append(entity) 111 | 112 | return entity_per_cell, datatype_per_cell 113 | 114 | """ Utility functions specifically for table processing """ 115 | #### type related functions #### 116 | def header_related_datatype(t): 117 | """ 118 | Verify whether a type could appear in header. 119 | """ 120 | if t in ["String_Normal", "String_Uppercase"]: 121 | return True 122 | else: 123 | return False 124 | 125 | def keyColumn_related_datatype(t): 126 | """ 127 | Verify whether a type could appear in header. 128 | """ 129 | if t in ["String_Normal", "String_Uppercase", "String_Number"]: 130 | return True 131 | else: 132 | return False 133 | 134 | #### Table-related functions #### 135 | def recover_poorly_encoded_cell(broke_cell): 136 | """ Recovering a badly-encoded utf-8 cell """ 137 | ## Poorly encoded cells: encode by utf-8 but decode by unicode 138 | try: 139 | cell = broke_cell 140 | # Solved: 141 | # 1. Convert string to byte keeping content, in other word, encode string by latin1 142 | byte_cell = bytes(cell.encode('latin1')) 143 | # 2. Decode Cell Byte by unicode 144 | new_cell = byte_cell.decode('unicode-escape') 145 | return ftfy.fix_text(new_cell) 146 | except: 147 | return ftfy.fix_text(broke_cell) 148 | 149 | def table_filtering(table): 150 | """ 151 | Possibly filtering empty rows and information rows (title rows) 152 | Args: 153 | input table 154 | Return: 155 | filtered table and possible information rows. 156 | """ 157 | title = [] 158 | new_table = [] 159 | max_width = max([len(row) for row in table]) 160 | for row in table: 161 | if row: 162 | new_row = [] 163 | num_nonmissing_cell = 0 164 | for cell in row: 165 | if cell != "" and cell != " "*len(cell): 166 | num_nonmissing_cell += 1 167 | try: 168 | new_row.append(recover_poorly_encoded_cell(cell)) 169 | except: 170 | new_row.append(cell) 171 | if num_nonmissing_cell > 0: 172 | new_table.append(new_row) 173 | ## padding short rows 174 | table_padding(new_table, max_width) 175 | ## remove empty columns 176 | new_table = table_null_column_removing(new_table) 177 | return new_table, title 178 | 179 | def table_padding(table, tab_width): 180 | """ 181 | Padding "" to short rows to balance the table. 182 | """ 183 | for line in table: 184 | for i in range(tab_width-len(line)): 185 | line.append("") 186 | 187 | def table_null_column_removing(table): 188 | """ 189 | Remove null columns in table. 190 | """ 191 | table_T = list(map(list, zip(*table))) 192 | final_table = [] 193 | for line in table_T: 194 | if line != [""]*(len(table)): 195 | final_table.append(line) 196 | final_table = list(map(list, zip(*final_table))) 197 | return final_table 198 | 199 | def transpose_heterogeneous_table(table): 200 | """ 201 | Transpose a heterogeneous table (rows with possible different widths). 202 | We natively padding "" for short lines and perform normal transpose. 203 | """ 204 | table_T = [] 205 | end_of_tab = False 206 | i = 0 207 | while not end_of_tab: 208 | end_of_tab = True 209 | line_T = [] 210 | for line in table: 211 | if i < len(line): 212 | line_T.append(line[i]) 213 | end_of_tab = False 214 | else: 215 | line_T.append("") 216 | i += 1 217 | table_T.append(line_T) 218 | return table_T[:-1] 219 | 220 | #### Type-related functions #### 221 | def parse_table(table): 222 | list_cells = list(set([item for line in table for item in line])) 223 | entity_list, dataType_list = text_parser(list_cells) 224 | return entity_list, dataType_list 225 | 226 | def datatype_per_column(table, table_Datatype, top_k=1): 227 | """ 228 | Return the top_k data types of each column in table. 229 | A type counter is applied to each column and the top_k most frequent type 230 | in the counter is the top_k data type of corresponding column. 231 | 232 | Args: 233 | table: horizontally oriented-table. 234 | nested list whose sublist is a row. 235 | top_k: default 1, the most frequent type 236 | Return: 237 | nested list whose a sublist is a data type of a column. 238 | """ 239 | ## transpose the table in order to retrieve its columns more easily 240 | table_T = transpose_heterogeneous_table(table) 241 | ## apply Datatype counter to each column. 242 | type_per_col = {} 243 | for col_idx, col in enumerate(table_T): 244 | dtypes = {} 245 | sum_type = 0 246 | for cell in col: 247 | if cell in table_Datatype: 248 | for a_dt in table_Datatype[cell]: 249 | dtypes[a_dt] = dtypes.get(a_dt, 0) + 1 250 | sum_type += 1 251 | 252 | for cell in col: 253 | if len(table_Datatype.get(cell, [])) > 1: 254 | sorted_ts = sorted(table_Datatype[cell], key=lambda x: (dtypes[x], typing_priority(x)), reverse=True) 255 | for other_type in sorted_ts[1:]: 256 | dtypes[other_type] -= 1 257 | if dtypes[other_type] == 0: 258 | dtypes.pop(other_type) 259 | 260 | if dtypes: 261 | sorted_types = Counter(dtypes) 262 | top_k_freq = sorted_types.most_common(top_k) 263 | type_per_col[col_idx] = [{"type": item[0], "score": item[1]/sum_type} for item in top_k_freq if item[1] > 0.0] 264 | else: 265 | type_per_col[col_idx] = [{"type": "", "score": 1.0}] 266 | return type_per_col 267 | 268 | def typing_per_column(table, table_Typing, top_k=1): 269 | """ 270 | Return the top_k data types of each column in table. 271 | A type counter is applied to each column and the top_k most frequent type 272 | in the counter is the top_k data type of corresponding column. 273 | 274 | Args: 275 | table: horizontally oriented-table. 276 | nested list whose sublist is a row. 277 | top_k: default 1, the most frequent type 278 | Return: 279 | nested list whose a sublist is a data type of a column. 280 | """ 281 | ## transpose the table in order to retrieve its columns more easily 282 | table_T = transpose_heterogeneous_table(table) 283 | ## apply Datatype counter to each column. 284 | type_per_col = {} 285 | for col_idx, col in enumerate(table_T): 286 | dtypes = {} 287 | sum_type = 0 288 | for cell in col: 289 | if cell in table_Typing: 290 | for a_tp in table_Typing[cell]: 291 | dtypes[a_tp] = dtypes.get(a_tp, 0) + 1 292 | sum_type += 1 293 | 294 | for cell in col: 295 | if len(table_Typing.get(cell, set())) > 1: 296 | sorted_ts = sorted(table_Typing[cell], key=lambda x: (dtypes[x], typing_priority(x)), reverse=True) 297 | for other_type in sorted_ts[1:]: 298 | dtypes[other_type] -= 1 299 | if dtypes[other_type] == 0: 300 | dtypes.pop(other_type) 301 | if dtypes: 302 | sorted_types = Counter(dtypes) 303 | top_k_freq = sorted_types.most_common(top_k) 304 | type_per_col[col_idx] = [{"type": item[0], "score": item[1]/sum_type} for item in top_k_freq if item[1] > 0.0] 305 | else: 306 | type_per_col[col_idx] = [{"type": "", "score": 1.0}] 307 | 308 | if col_idx == 0 and type_per_col[col_idx][0]["type"] == "CARDINAL": 309 | ## verify if the first column is index column 310 | current_index = None 311 | is_index_column = True 312 | tolerate = 0 313 | for cell in col: 314 | try: 315 | idx = int(float(cell)) 316 | if current_index: 317 | if idx == current_index + 1: 318 | current_index += 1 319 | elif idx == current_index: 320 | pass 321 | else: 322 | is_index_column = False 323 | break 324 | else: 325 | current_index = idx 326 | except: 327 | ## no index detected in this cell 328 | current_index = None 329 | tolerate += 1 330 | if tolerate > 4: ## number of noindex-detected tolerance, exceed this number, a column is not ordinal. 331 | is_index_column = False 332 | break 333 | if is_index_column: 334 | type_per_col[col_idx][0]["type"] = "ORDINAL" 335 | 336 | return type_per_col 337 | 338 | #### Orientation-related functions #### 339 | def homogeneity_compute(table, table_Datatype, direction="horizontal") -> float: 340 | """ 341 | Compute the horizontal (column-wise)/vertical (row-wise) homogeneity. 342 | The homogeneity represents the uniqueness of a data type across the table's lines. 343 | Args: 344 | table: input table 345 | direction: horizontal/vertical 346 | Return: 347 | mean and std of the homogeneity across the table's lines. . 348 | 349 | """ 350 | computed_table = [] 351 | if direction == "horizontal": 352 | computed_table = table 353 | elif direction == "vertical": 354 | computed_table = transpose_heterogeneous_table(table) 355 | 356 | ## compute homogeneity for each line 357 | homogeneity_per_line = [] 358 | for line in computed_table: 359 | dtypes = {} 360 | sum_type = 0 361 | for cell in line: 362 | if cell in table_Datatype: 363 | for a_dt in table_Datatype[cell]: 364 | dtypes[a_dt] = dtypes.get(a_dt, 0) + 1 365 | sum_type += 1 366 | for cell in line: 367 | if len(table_Datatype.get(cell, [])) > 1: 368 | sorted_ts = sorted(table_Datatype[cell], key=lambda x: (dtypes[x], typing_priority(x)), reverse=True) 369 | for other_type in sorted_ts[1:]: 370 | dtypes[other_type] -= 1 371 | if dtypes[other_type] == 0: 372 | dtypes.pop(other_type) 373 | 374 | ## if a line contains too much missing value, it is not trustable to take it into 375 | ## count in homogeneity calculation. 376 | if sum_type/len(line) >= 0.25: 377 | type_homoCoef = 0 378 | for t in dtypes: 379 | popularity_score = 1 - np.square(1 - 2 * (dtypes[t] / sum_type)) 380 | type_homoCoef = type_homoCoef + popularity_score 381 | homogeneity = np.square(type_homoCoef / len(dtypes)) 382 | homogeneity_per_line.append(homogeneity) 383 | 384 | ## it is not trustable to 385 | if len(homogeneity_per_line) > 1: 386 | ## compute mean and std accross different lines 387 | mean_homogeneity = np.mean(homogeneity_per_line) 388 | std_homogeneity = np.std(homogeneity_per_line, ddof=1) 389 | return mean_homogeneity, std_homogeneity 390 | else: 391 | return None, None 392 | 393 | def std_column_wordLength(table, direction="horizontal") -> float: 394 | computed_table = [] 395 | if direction == "horizontal": 396 | computed_table = table 397 | elif direction == "vertical": 398 | computed_table = transpose_heterogeneous_table(table) 399 | 400 | standardDeviationofAllRows = [] 401 | for line in computed_table: 402 | cell_lens = [] 403 | for cell in line: 404 | if cell: 405 | cell_lens.append(len(cell)) 406 | if 2*len(cell_lens) >= len(line): 407 | standardDeviationofAllRows.append(np.std(cell_lens)) 408 | if standardDeviationofAllRows: 409 | return np.mean(standardDeviationofAllRows) 410 | else: 411 | return 0 412 | 413 | #### Realignment-related functions #### 414 | def re_align_short_row(line, table_Datatype, column_dataTypes): 415 | ''' TODO ''' 416 | 417 | dtype_of_currline = [] 418 | for element in line: 419 | if element in table_Datatype: 420 | dtype_of_currline.append(table_Datatype[element]) 421 | else: 422 | dtype_of_currline.append("") 423 | 424 | alignable = True 425 | if "" in dtype_of_currline: 426 | alignable = False 427 | else: 428 | for col_idx, col_datatype in column_dataTypes.items(): 429 | if col_datatype[0]["type"] == "" or col_datatype[0]["score"] < 0.75: 430 | alignable = False 431 | break 432 | if alignable: 433 | index_comb = combinations(range(len(column_dataTypes)), len(line)) 434 | index_comb = list(index_comb) 435 | valid_alignments = [] 436 | for idx_set in index_comb: 437 | target_col_types = [column_dataTypes[idx][0]["type"] for idx in idx_set] 438 | if target_col_types == dtype_of_currline: 439 | valid_alignments.append(idx_set) 440 | 441 | if len(valid_alignments) == 1: 442 | new_line = [""]*len(column_dataTypes) 443 | for idx, new_val in zip(valid_alignments[0], line): 444 | new_line[idx] = new_val 445 | return new_line 446 | return line 447 | 448 | """ OUTPUT """ 449 | class Table_Informations(namedtuple( 450 | "Table_Information", ("raw_table", "structured_table", "title", 451 | "orientation", "header", "keyColumn", "primitiveTyping"))): 452 | """ To allow for flexibility in returning different outputs. """ 453 | pass 454 | 455 | def save_table_info(output_path, table_info): 456 | """ Save json file of table infos. """ 457 | with open(output_path, 'w') as f: 458 | json.dump(table_info._asdict(), f) 459 | 460 | def output_jsonlize(table_info): 461 | # convert named tupled table_info --> dictionary table_info 462 | # to better display in Json way. 463 | table_info_dict = { "raw_table": table_info.raw_table, 464 | "restructured_table": table_info.structured_table, 465 | "orientation": {"label": table_info.orientation.orientation, "score": round(table_info.orientation.score,2)}, 466 | "header":{"label": table_info.header.header, "score": round(table_info.header.score,2)}, 467 | "keyColumn": {"label": table_info.keyColumn.keyColumn, "score": round(table_info.keyColumn.score,2)}, 468 | "primitiveTyping": [{"column": i_col, "typing": [{"label": t["type"], "score": round(t["score"],2)} for t in ts]} for i_col, ts in table_info.primitiveTyping.type_list.items()] 469 | } 470 | return table_info_dict 471 | 472 | if __name__ == '__main__': 473 | # table = [["Col 0","Col 1","Col 2","Col 3"], 474 | # ["1.","Nguyen An","an@gmail.com","0654893215"], 475 | # ["2.","Tran Binh","binh@orange.com","+33(0)624759812"], 476 | # ["3.","Huynh Cong","cong@eurecom.fr","+840641896315"], 477 | # ["12","1888","6 kilo","093-456-123"]] 478 | table = [["United States", "2015 National Women's Soccer League", "FC Kansas City" , "2nd", "2014"]] 479 | print(parse_table([["vietphi.huynh@orange.com", "2 m/s", "Orange Labs", "France"], ["(2-3)"]])) 480 | # print(parse_table(table)) 481 | # table_Typing,table_Datatype = parse_table(table) 482 | # print(table_Typing) 483 | # for line in transpose_heterogeneous_table(table): 484 | # print(typing_per_column(table[1:], table_Typing, 3)) 485 | # typing_per_column(table[1:], table_Typing, 3) 486 | 487 | 488 | 489 | -------------------------------------------------------------------------------- /preprocessing/table_preprocessing.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | ''' 21 | MAIN PRE-PROCESSING 22 | ''' 23 | from typing import List 24 | from random import shuffle 25 | 26 | from .prp_scripts import table_info_extraction_modules as tb_modu 27 | from .prp_scripts import utils as tb_utils 28 | 29 | def table_preprocessing(raw_table: List[List[str]]) : 30 | """ 31 | Extracting basic information of a table: 32 | + Potential Reshaping, Filtering table 33 | + Orientation 34 | + Header 35 | + Column Primitive Typing 36 | + Key Column 37 | 38 | """ 39 | preprocessing_output = {"raw": { 40 | "tableDataRaw": raw_table 41 | }, 42 | "preprocessed": {}} 43 | 44 | ## filtering table (encoding correction, blank line removing) 45 | table,_ = tb_utils.table_filtering(raw_table) 46 | if len(table) > 1: 47 | ## if table is two large, says > 400 lines (horizontal) or > 400 columns (vertical), 48 | ## random sub-sampling the table to 400 lines (or 400 columns) maximum to assure a reasonable preprocessing time (~<60s) 49 | sample_table = table 50 | if len(table) > 400: 51 | sample_index = list(range(10,len(table))) 52 | shuffle(sample_index) 53 | sample_index = list(range(10)) + sample_index[:390] ## avoid sampling in first 10 lines due to existence of header. 54 | sample_index = sorted(sample_index) 55 | sample_table = [table[i] for i in sample_index] 56 | 57 | ### Table parsing: extract entities + datatypes ### 58 | tb_entity_list, tb_dataType_list = tb_utils.parse_table(sample_table) 59 | 60 | # print(tb_entity_list) 61 | ### Potentially reshaping exotic table ### TODO 62 | # table = tb_modu.table_reshaping(table, tb_dataType_list, tb_entity_list) 63 | ### Removing null column ### TODO 64 | # table = tb_utils.table_null_column_removing(table) 65 | 66 | ### Extracting table information ### 67 | ## Orientation 68 | tableOrientation = tb_modu.table_orientation_detection(sample_table, tb_dataType_list, tb_entity_list) 69 | if tableOrientation.orientation == "VERTICAL": 70 | sample_table = tb_utils.transpose_heterogeneous_table(sample_table) 71 | table = tb_utils.transpose_heterogeneous_table(table) 72 | ## Column Primitive Typing 73 | tablePrimitiveTyping = tb_modu.table_primitive_typing(sample_table, tb_entity_list, top_k = 3) 74 | 75 | ## Key Column 76 | tableKeyColumn = tb_modu.table_key_column_detection(sample_table, tableOrientation.score, tb_dataType_list) 77 | 78 | ## Header 79 | tableHeader = tb_modu.table_header_detection(sample_table, tableOrientation.score, tb_entity_list) 80 | 81 | preprocessing_output["preprocessed"]["tableDataRevised"] = table 82 | preprocessing_output["preprocessed"]["tableOrientation"] = {"orientationLabel": tableOrientation.orientation, 83 | "orientationScore": round(tableOrientation.score,2) 84 | } 85 | 86 | preprocessing_output["preprocessed"]["headerInfo"] = {"hasHeader": tableHeader.has_header, 87 | "headerPosition": 0 if tableHeader.has_header else None, 88 | "headerLabel": tableHeader.header, 89 | "headerScore": round(tableHeader.score,2) 90 | } 91 | 92 | preprocessing_output["preprocessed"]["primaryKeyInfo"] = {"hasPrimaryKey": bool(tableKeyColumn.keyColumn is not None), 93 | "primaryKeyPosition": tableKeyColumn.keyColumn, 94 | "primaryKeyScore": round(tableKeyColumn.score,2) 95 | } 96 | 97 | preprocessing_output["preprocessed"]["primitiveTyping"] = [{"columnIndex": i_col, 98 | "typing": [{"typingLabel": t["type"], "typingScore": round(t["score"],2)} for t in ts]} for i_col, ts in tablePrimitiveTyping.type_list.items()] 99 | 100 | return preprocessing_output 101 | 102 | if __name__ == "__main__": 103 | print("dkm") 104 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | """ 2 | * Software Name : TableAnnotation 3 | * Author: Viet-Phi Huynh, Jixiong Liu, Yoan Chabot, Frédéric Deuzé and Raphael Troncy 4 | * Software description: TableAnnotation (a.k.a DAGOBAH) is a semantic annotation tool for tables leveraging three steps: 1) Table Preprocessing: a set of comprehensive heuristic to clean the table (e.g. fix encoding error), determine table orientation, data types of columns. 2) Entity Lookup: retrieve a number of entity candidates for mentions in the table, using an elastic search-based entity lookup. 3) Annotation: disambiguate retrieved entity candidates, select the most relevant entity for each mention. This consists of three tasks, namely Cell-Entity Annotation, Column-Type Annotation, Column-Pair Annotation. 5 | * Version: <1.0.0> 6 | * SPDX-FileCopyrightText: Copyright (c) 2023 Orange 7 | * SPDX-License-Identifier: GPL-3.0-or-later 8 | * Licensed under the GNU-GPL v3 (the "License"); 9 | * you may not use this file except in compliance with the License. 10 | * You may obtain a copy of the License at 11 | * 12 | * https://www.gnu.org/licenses/gpl-3.0.html#license-text 13 | * 14 | * Unless required by applicable law or agreed to in writing, software 15 | * distributed under the License is distributed on an "AS IS" BASIS, 16 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | * See the License for the specific language governing permissions and 18 | * limitations under the License. 19 | """ 20 | from setuptools import setup 21 | 22 | setup( 23 | name='Table Annotation', 24 | version='1.0', 25 | description='DAGOBAH: A toolkit for semantic table annotation using heuristics', 26 | author='Orange SA', 27 | license='TODO', 28 | packages=['annotation', 'lookup', 'preprocessing'], 29 | install_requires=[ 30 | 'ftfy==6.0.3', 31 | 'numpy==1.22.0', 32 | 'scipy==1.7.3', 33 | 'pandas==1.4.0', 34 | 'quantulum3==0.7.9', 35 | 'stemming==1.0.1', 36 | 'phonenumbers==8.12.22', 37 | 'spacy==3.7.2', 38 | 'pydantic==2.5.2', 39 | 'typing-extensions>=4.6.1', 40 | 'tqdm==4.60.0', 41 | 'lmdb==1.3.0', 42 | 'rapidfuzz==1.9.1', 43 | 'quantulum3==0.7.9', 44 | 'openpyxl==3.0.9', 45 | 'Pint==0.18', 46 | 'Unidecode==1.3.4', 47 | 'elasticsearch==7.15.1', 48 | ] 49 | ) --------------------------------------------------------------------------------