├── LICENSE ├── named_entities.csv └── node2vec.ipynb /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /named_entities.csv: -------------------------------------------------------------------------------- 1 | named_entities 2 | "basketball,Kobe Bryant" 3 | "basketball,Lebron James" 4 | -------------------------------------------------------------------------------- /node2vec.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "This notebook accompanies the blog post https://engineering.taboola.com/think-your-data-different." 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import pandas as pd\n", 17 | "import numpy as np\n", 18 | "import itertools\n", 19 | "from sklearn.cluster import KMeans\n", 20 | "import pprint" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "## 1. Prepare input for node2vec\n", 28 | "We'll use a CSV file where each row represents a single recommendable item: it contains a comma separated list of the named entities that appear in the item's title." 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 2, 34 | "metadata": {}, 35 | "outputs": [ 36 | { 37 | "data": { 38 | "text/html": [ 39 | "
\n", 40 | "\n", 53 | "\n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | "
named_entities
0CONCEPT-certification mark,CONCEPT-i swear,CON...
1CONCEPT-middle school,CONCEPT-gun,CONCEPT-scho...
2Facility-rush university medical center,CONCEP...
3CONCEPT-web browser
4CONCEPT-types of companies,Person-saquon barkl...
\n", 83 | "
" 84 | ], 85 | "text/plain": [ 86 | " named_entities\n", 87 | "0 CONCEPT-certification mark,CONCEPT-i swear,CON...\n", 88 | "1 CONCEPT-middle school,CONCEPT-gun,CONCEPT-scho...\n", 89 | "2 Facility-rush university medical center,CONCEP...\n", 90 | "3 CONCEPT-web browser\n", 91 | "4 CONCEPT-types of companies,Person-saquon barkl..." 92 | ] 93 | }, 94 | "execution_count": 2, 95 | "metadata": {}, 96 | "output_type": "execute_result" 97 | } 98 | ], 99 | "source": [ 100 | "named_entities_df = pd.read_csv('named_entities.csv')\n", 101 | "named_entities_df.head()" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "First, we'll have to tokenize the named entities, since `node2vec` expects integers." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 3, 114 | "metadata": {}, 115 | "outputs": [ 116 | { 117 | "data": { 118 | "text/html": [ 119 | "
\n", 120 | "\n", 133 | "\n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | "
named_entities
0[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 2...
2[28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 3...
3[41]
4[42, 43, 44, 45, 46, 9]
\n", 163 | "
" 164 | ], 165 | "text/plain": [ 166 | " named_entities\n", 167 | "0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n", 168 | "1 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 2...\n", 169 | "2 [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 3...\n", 170 | "3 [41]\n", 171 | "4 [42, 43, 44, 45, 46, 9]" 172 | ] 173 | }, 174 | "execution_count": 3, 175 | "metadata": {}, 176 | "output_type": "execute_result" 177 | } 178 | ], 179 | "source": [ 180 | "tokenizer = dict()\n", 181 | "named_entities_df['named_entities'] = named_entities_df['named_entities'].apply(\n", 182 | " lambda named_entities: [tokenizer.setdefault(named_entitie, len(tokenizer))\n", 183 | " for named_entitie in named_entities.split(',')])\n", 184 | "named_entities_df.head()" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 4, 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "{'CONCEPT-gal gadot': 20918,\n", 197 | " 'CONCEPT-irish singles chart number one singles': 59693,\n", 198 | " 'CONCEPT-tarantula': 83904,\n", 199 | " 'Organization-ohio republican party': 93001,\n", 200 | " 'Person-billy donovan': 32857}\n" 201 | ] 202 | } 203 | ], 204 | "source": [ 205 | "pprint.pprint(dict(tokenizer.items()[:5]))" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "In order to construct the graph on which we'll run node2vec, we first need to understand which named entities appear together." 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 5, 218 | "metadata": {}, 219 | "outputs": [ 220 | { 221 | "data": { 222 | "text/html": [ 223 | "
\n", 224 | "\n", 237 | "\n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | "
named_entity_1named_entity_2
001
102
203
304
405
\n", 273 | "
" 274 | ], 275 | "text/plain": [ 276 | " named_entity_1 named_entity_2\n", 277 | "0 0 1\n", 278 | "1 0 2\n", 279 | "2 0 3\n", 280 | "3 0 4\n", 281 | "4 0 5" 282 | ] 283 | }, 284 | "execution_count": 5, 285 | "metadata": {}, 286 | "output_type": "execute_result" 287 | } 288 | ], 289 | "source": [ 290 | "pairs_df = named_entities_df['named_entities'].apply(lambda named_entities: list(itertools.combinations(named_entities, 2)))\n", 291 | "pairs_df = pairs_df[pairs_df.apply(len) > 0]\n", 292 | "pairs_df = pd.DataFrame(np.concatenate(pairs_df.values), columns=['named_entity_1', 'named_entity_2'])\n", 293 | "pairs_df.head()" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "Now we can construct the graph. The weight of an edge connecting two named entities will be the number of times these named entities appear together in our dataset." 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 6, 306 | "metadata": {}, 307 | "outputs": [ 308 | { 309 | "data": { 310 | "text/html": [ 311 | "
\n", 312 | "\n", 325 | "\n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | "
named_entity_1named_entity_2weight
493934
9889411142
12751112731
12811113435
12901114961
\n", 367 | "
" 368 | ], 369 | "text/plain": [ 370 | " named_entity_1 named_entity_2 weight\n", 371 | "49 3 9 34\n", 372 | "988 9 41 1142\n", 373 | "1275 11 127 31\n", 374 | "1281 11 134 35\n", 375 | "1290 11 149 61" 376 | ] 377 | }, 378 | "execution_count": 6, 379 | "metadata": {}, 380 | "output_type": "execute_result" 381 | } 382 | ], 383 | "source": [ 384 | "NAMED_ENTITIES_CO_OCCURENCE_THRESHOLD = 25\n", 385 | "\n", 386 | "edges_df = pairs_df.groupby(['named_entity_1', 'named_entity_2']).size().reset_index(name='weight')\n", 387 | "edges_df = edges_df[edges_df['weight'] > NAMED_ENTITIES_CO_OCCURENCE_THRESHOLD]\n", 388 | "edges_df[['named_entity_1', 'named_entity_2', 'weight']].to_csv('edges.csv', header=False, index=False, sep=' ')\n", 389 | "edges_df.head()" 390 | ] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": {}, 395 | "source": [ 396 | "Next, we'll run `node2vec`, which will output the result embeddings in a file called `emb`. \n", 397 | "We'll use the open source implementation developed by [Stanford](https://github.com/snap-stanford/snap/tree/master/examples/node2vec)." 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 7, 403 | "metadata": {}, 404 | "outputs": [ 405 | { 406 | "name": "stdout", 407 | "output_type": "stream", 408 | "text": [ 409 | "Walk iteration:\n", 410 | "1 / 10\n", 411 | "2 / 10\n", 412 | "3 / 10\n", 413 | "4 / 10\n", 414 | "5 / 10\n", 415 | "6 / 10\n", 416 | "7 / 10\n", 417 | "8 / 10\n", 418 | "9 / 10\n", 419 | "10 / 10\n" 420 | ] 421 | } 422 | ], 423 | "source": [ 424 | "!python node2vec/src/main.py --input edges.csv --output emb --weighted" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "## 2. Read embedding and run KMeans clusterring:" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 8, 437 | "metadata": {}, 438 | "outputs": [ 439 | { 440 | "data": { 441 | "text/html": [ 442 | "
\n", 443 | "\n", 456 | "\n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | "
12345678910...119120121122123124125126127128
named_entity
450.1936840.199515-0.5580700.193501-0.151151-0.108368-0.0803950.483877-0.216687-0.027689...-0.020264-0.219160-0.006211-0.116050-0.208311-0.2389170.416022-0.0692080.382213-0.198407
410.116208-0.0137720.2706750.227480-0.123978-0.076915-0.0800150.3388220.007791-0.028516...-0.250689-0.219996-0.3460240.006914-0.1854760.0991200.2313570.3263920.197053-0.103405
4780.326508-0.080868-0.5341340.137786-0.262377-0.071972-0.1874090.533022-0.314909-0.019874...-0.160482-0.192272-0.132486-0.058005-0.182971-0.2016000.3179260.0599880.380023-0.127033
88-0.053936-0.098514-0.1169750.194783-0.1278550.310879-0.050054-0.0025420.094705-0.104536...0.025011-0.357876-0.2384090.2476540.082463-0.1470440.153850-0.535327-0.4356550.259705
830.013028-0.122749-0.0296610.059336-0.2587430.397353-0.0822490.0786530.1023660.091354...0.141847-0.456273-0.1191020.3017410.072765-0.0355280.042997-0.511059-0.2636440.366281
\n", 630 | "

5 rows × 128 columns

\n", 631 | "
" 632 | ], 633 | "text/plain": [ 634 | " 1 2 3 4 5 6 \\\n", 635 | "named_entity \n", 636 | "45 0.193684 0.199515 -0.558070 0.193501 -0.151151 -0.108368 \n", 637 | "41 0.116208 -0.013772 0.270675 0.227480 -0.123978 -0.076915 \n", 638 | "478 0.326508 -0.080868 -0.534134 0.137786 -0.262377 -0.071972 \n", 639 | "88 -0.053936 -0.098514 -0.116975 0.194783 -0.127855 0.310879 \n", 640 | "83 0.013028 -0.122749 -0.029661 0.059336 -0.258743 0.397353 \n", 641 | "\n", 642 | " 7 8 9 10 ... 119 \\\n", 643 | "named_entity ... \n", 644 | "45 -0.080395 0.483877 -0.216687 -0.027689 ... -0.020264 \n", 645 | "41 -0.080015 0.338822 0.007791 -0.028516 ... -0.250689 \n", 646 | "478 -0.187409 0.533022 -0.314909 -0.019874 ... -0.160482 \n", 647 | "88 -0.050054 -0.002542 0.094705 -0.104536 ... 0.025011 \n", 648 | "83 -0.082249 0.078653 0.102366 0.091354 ... 0.141847 \n", 649 | "\n", 650 | " 120 121 122 123 124 125 \\\n", 651 | "named_entity \n", 652 | "45 -0.219160 -0.006211 -0.116050 -0.208311 -0.238917 0.416022 \n", 653 | "41 -0.219996 -0.346024 0.006914 -0.185476 0.099120 0.231357 \n", 654 | "478 -0.192272 -0.132486 -0.058005 -0.182971 -0.201600 0.317926 \n", 655 | "88 -0.357876 -0.238409 0.247654 0.082463 -0.147044 0.153850 \n", 656 | "83 -0.456273 -0.119102 0.301741 0.072765 -0.035528 0.042997 \n", 657 | "\n", 658 | " 126 127 128 \n", 659 | "named_entity \n", 660 | "45 -0.069208 0.382213 -0.198407 \n", 661 | "41 0.326392 0.197053 -0.103405 \n", 662 | "478 0.059988 0.380023 -0.127033 \n", 663 | "88 -0.535327 -0.435655 0.259705 \n", 664 | "83 -0.511059 -0.263644 0.366281 \n", 665 | "\n", 666 | "[5 rows x 128 columns]" 667 | ] 668 | }, 669 | "execution_count": 8, 670 | "metadata": {}, 671 | "output_type": "execute_result" 672 | } 673 | ], 674 | "source": [ 675 | "emb_df = pd.read_csv('emb', sep=' ', skiprows=[0], header=None)\n", 676 | "emb_df.set_index(0, inplace=True)\n", 677 | "emb_df.index.name = 'named_entity'\n", 678 | "emb_df.head()" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": {}, 684 | "source": [ 685 | "Each column is a dimension in the embedding space. Each row contains the dimensions of the embedding of one named entity. \n", 686 | "We'll now cluster the embeddings using a simple clustering algorithm such as k-means." 687 | ] 688 | }, 689 | { 690 | "cell_type": "code", 691 | "execution_count": 9, 692 | "metadata": {}, 693 | "outputs": [ 694 | { 695 | "data": { 696 | "text/html": [ 697 | "
\n", 698 | "\n", 711 | "\n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | "
named_entitycluster
0452
1413
24782
3881
4831
\n", 747 | "
" 748 | ], 749 | "text/plain": [ 750 | " named_entity cluster\n", 751 | "0 45 2\n", 752 | "1 41 3\n", 753 | "2 478 2\n", 754 | "3 88 1\n", 755 | "4 83 1" 756 | ] 757 | }, 758 | "execution_count": 9, 759 | "metadata": {}, 760 | "output_type": "execute_result" 761 | } 762 | ], 763 | "source": [ 764 | "NUM_CLUSTERS = 10\n", 765 | "\n", 766 | "kmeans = KMeans(n_clusters=NUM_CLUSTERS)\n", 767 | "kmeans.fit(emb_df)\n", 768 | "labels = kmeans.predict(emb_df)\n", 769 | "emb_df['cluster'] = labels\n", 770 | "clusters_df = emb_df.reset_index()[['named_entity','cluster']]\n", 771 | "clusters_df.head()" 772 | ] 773 | }, 774 | { 775 | "cell_type": "markdown", 776 | "metadata": {}, 777 | "source": [ 778 | "## 3. Prepare input for Gephi:" 779 | ] 780 | }, 781 | { 782 | "cell_type": "markdown", 783 | "metadata": {}, 784 | "source": [ 785 | "[Gephi](https://gephi.org) is a nice visualization tool for graphical data. \n", 786 | "We'll output our data into a format recognizable by Gephi." 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": 10, 792 | "metadata": { 793 | "collapsed": true 794 | }, 795 | "outputs": [], 796 | "source": [ 797 | "id_to_named_entity = {named_entity_id: named_entity\n", 798 | " for named_entity, named_entity_id in tokenizer.items()}\n", 799 | "\n", 800 | "with open('clusters.gdf', 'w') as f:\n", 801 | " f.write('nodedef>name VARCHAR,cluster_id VARCHAR,label VARCHAR\\n')\n", 802 | " for index, row in clusters_df.iterrows():\n", 803 | " f.write('{},{},{}\\n'.format(row['named_entity'], row['cluster'], id_to_named_entity[row['named_entity']]))\n", 804 | " f.write('edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE\\n')\n", 805 | " for index, row in edges_df.iterrows(): \n", 806 | " f.write('{},{},{}\\n'.format(row['named_entity_1'], row['named_entity_2'], row['weight']))" 807 | ] 808 | }, 809 | { 810 | "cell_type": "markdown", 811 | "metadata": {}, 812 | "source": [ 813 | "Finally, we can open `clusters.gdf` using Gephi in order to inspect the clusters." 814 | ] 815 | } 816 | ], 817 | "metadata": { 818 | "kernelspec": { 819 | "display_name": "Python 2", 820 | "language": "python", 821 | "name": "python2" 822 | }, 823 | "language_info": { 824 | "codemirror_mode": { 825 | "name": "ipython", 826 | "version": 2 827 | }, 828 | "file_extension": ".py", 829 | "mimetype": "text/x-python", 830 | "name": "python", 831 | "nbconvert_exporter": "python", 832 | "pygments_lexer": "ipython2", 833 | "version": "2.7.13" 834 | } 835 | }, 836 | "nbformat": 4, 837 | "nbformat_minor": 2 838 | } 839 | --------------------------------------------------------------------------------