├── LICENSE
├── named_entities.csv
└── node2vec.ipynb
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/named_entities.csv:
--------------------------------------------------------------------------------
1 | named_entities
2 | "basketball,Kobe Bryant"
3 | "basketball,Lebron James"
4 |
--------------------------------------------------------------------------------
/node2vec.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "This notebook accompanies the blog post https://engineering.taboola.com/think-your-data-different."
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "import pandas as pd\n",
17 | "import numpy as np\n",
18 | "import itertools\n",
19 | "from sklearn.cluster import KMeans\n",
20 | "import pprint"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "## 1. Prepare input for node2vec\n",
28 | "We'll use a CSV file where each row represents a single recommendable item: it contains a comma separated list of the named entities that appear in the item's title."
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": 2,
34 | "metadata": {},
35 | "outputs": [
36 | {
37 | "data": {
38 | "text/html": [
39 | "
\n",
40 | "\n",
53 | "
\n",
54 | " \n",
55 | " \n",
56 | " | \n",
57 | " named_entities | \n",
58 | "
\n",
59 | " \n",
60 | " \n",
61 | " \n",
62 | " 0 | \n",
63 | " CONCEPT-certification mark,CONCEPT-i swear,CON... | \n",
64 | "
\n",
65 | " \n",
66 | " 1 | \n",
67 | " CONCEPT-middle school,CONCEPT-gun,CONCEPT-scho... | \n",
68 | "
\n",
69 | " \n",
70 | " 2 | \n",
71 | " Facility-rush university medical center,CONCEP... | \n",
72 | "
\n",
73 | " \n",
74 | " 3 | \n",
75 | " CONCEPT-web browser | \n",
76 | "
\n",
77 | " \n",
78 | " 4 | \n",
79 | " CONCEPT-types of companies,Person-saquon barkl... | \n",
80 | "
\n",
81 | " \n",
82 | "
\n",
83 | "
"
84 | ],
85 | "text/plain": [
86 | " named_entities\n",
87 | "0 CONCEPT-certification mark,CONCEPT-i swear,CON...\n",
88 | "1 CONCEPT-middle school,CONCEPT-gun,CONCEPT-scho...\n",
89 | "2 Facility-rush university medical center,CONCEP...\n",
90 | "3 CONCEPT-web browser\n",
91 | "4 CONCEPT-types of companies,Person-saquon barkl..."
92 | ]
93 | },
94 | "execution_count": 2,
95 | "metadata": {},
96 | "output_type": "execute_result"
97 | }
98 | ],
99 | "source": [
100 | "named_entities_df = pd.read_csv('named_entities.csv')\n",
101 | "named_entities_df.head()"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "First, we'll have to tokenize the named entities, since `node2vec` expects integers."
109 | ]
110 | },
111 | {
112 | "cell_type": "code",
113 | "execution_count": 3,
114 | "metadata": {},
115 | "outputs": [
116 | {
117 | "data": {
118 | "text/html": [
119 | "\n",
120 | "\n",
133 | "
\n",
134 | " \n",
135 | " \n",
136 | " | \n",
137 | " named_entities | \n",
138 | "
\n",
139 | " \n",
140 | " \n",
141 | " \n",
142 | " 0 | \n",
143 | " [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] | \n",
144 | "
\n",
145 | " \n",
146 | " 1 | \n",
147 | " [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 2... | \n",
148 | "
\n",
149 | " \n",
150 | " 2 | \n",
151 | " [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 3... | \n",
152 | "
\n",
153 | " \n",
154 | " 3 | \n",
155 | " [41] | \n",
156 | "
\n",
157 | " \n",
158 | " 4 | \n",
159 | " [42, 43, 44, 45, 46, 9] | \n",
160 | "
\n",
161 | " \n",
162 | "
\n",
163 | "
"
164 | ],
165 | "text/plain": [
166 | " named_entities\n",
167 | "0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n",
168 | "1 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 2...\n",
169 | "2 [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 3...\n",
170 | "3 [41]\n",
171 | "4 [42, 43, 44, 45, 46, 9]"
172 | ]
173 | },
174 | "execution_count": 3,
175 | "metadata": {},
176 | "output_type": "execute_result"
177 | }
178 | ],
179 | "source": [
180 | "tokenizer = dict()\n",
181 | "named_entities_df['named_entities'] = named_entities_df['named_entities'].apply(\n",
182 | " lambda named_entities: [tokenizer.setdefault(named_entitie, len(tokenizer))\n",
183 | " for named_entitie in named_entities.split(',')])\n",
184 | "named_entities_df.head()"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": 4,
190 | "metadata": {},
191 | "outputs": [
192 | {
193 | "name": "stdout",
194 | "output_type": "stream",
195 | "text": [
196 | "{'CONCEPT-gal gadot': 20918,\n",
197 | " 'CONCEPT-irish singles chart number one singles': 59693,\n",
198 | " 'CONCEPT-tarantula': 83904,\n",
199 | " 'Organization-ohio republican party': 93001,\n",
200 | " 'Person-billy donovan': 32857}\n"
201 | ]
202 | }
203 | ],
204 | "source": [
205 | "pprint.pprint(dict(tokenizer.items()[:5]))"
206 | ]
207 | },
208 | {
209 | "cell_type": "markdown",
210 | "metadata": {},
211 | "source": [
212 | "In order to construct the graph on which we'll run node2vec, we first need to understand which named entities appear together."
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": 5,
218 | "metadata": {},
219 | "outputs": [
220 | {
221 | "data": {
222 | "text/html": [
223 | "\n",
224 | "\n",
237 | "
\n",
238 | " \n",
239 | " \n",
240 | " | \n",
241 | " named_entity_1 | \n",
242 | " named_entity_2 | \n",
243 | "
\n",
244 | " \n",
245 | " \n",
246 | " \n",
247 | " 0 | \n",
248 | " 0 | \n",
249 | " 1 | \n",
250 | "
\n",
251 | " \n",
252 | " 1 | \n",
253 | " 0 | \n",
254 | " 2 | \n",
255 | "
\n",
256 | " \n",
257 | " 2 | \n",
258 | " 0 | \n",
259 | " 3 | \n",
260 | "
\n",
261 | " \n",
262 | " 3 | \n",
263 | " 0 | \n",
264 | " 4 | \n",
265 | "
\n",
266 | " \n",
267 | " 4 | \n",
268 | " 0 | \n",
269 | " 5 | \n",
270 | "
\n",
271 | " \n",
272 | "
\n",
273 | "
"
274 | ],
275 | "text/plain": [
276 | " named_entity_1 named_entity_2\n",
277 | "0 0 1\n",
278 | "1 0 2\n",
279 | "2 0 3\n",
280 | "3 0 4\n",
281 | "4 0 5"
282 | ]
283 | },
284 | "execution_count": 5,
285 | "metadata": {},
286 | "output_type": "execute_result"
287 | }
288 | ],
289 | "source": [
290 | "pairs_df = named_entities_df['named_entities'].apply(lambda named_entities: list(itertools.combinations(named_entities, 2)))\n",
291 | "pairs_df = pairs_df[pairs_df.apply(len) > 0]\n",
292 | "pairs_df = pd.DataFrame(np.concatenate(pairs_df.values), columns=['named_entity_1', 'named_entity_2'])\n",
293 | "pairs_df.head()"
294 | ]
295 | },
296 | {
297 | "cell_type": "markdown",
298 | "metadata": {},
299 | "source": [
300 | "Now we can construct the graph. The weight of an edge connecting two named entities will be the number of times these named entities appear together in our dataset."
301 | ]
302 | },
303 | {
304 | "cell_type": "code",
305 | "execution_count": 6,
306 | "metadata": {},
307 | "outputs": [
308 | {
309 | "data": {
310 | "text/html": [
311 | "\n",
312 | "\n",
325 | "
\n",
326 | " \n",
327 | " \n",
328 | " | \n",
329 | " named_entity_1 | \n",
330 | " named_entity_2 | \n",
331 | " weight | \n",
332 | "
\n",
333 | " \n",
334 | " \n",
335 | " \n",
336 | " 49 | \n",
337 | " 3 | \n",
338 | " 9 | \n",
339 | " 34 | \n",
340 | "
\n",
341 | " \n",
342 | " 988 | \n",
343 | " 9 | \n",
344 | " 41 | \n",
345 | " 1142 | \n",
346 | "
\n",
347 | " \n",
348 | " 1275 | \n",
349 | " 11 | \n",
350 | " 127 | \n",
351 | " 31 | \n",
352 | "
\n",
353 | " \n",
354 | " 1281 | \n",
355 | " 11 | \n",
356 | " 134 | \n",
357 | " 35 | \n",
358 | "
\n",
359 | " \n",
360 | " 1290 | \n",
361 | " 11 | \n",
362 | " 149 | \n",
363 | " 61 | \n",
364 | "
\n",
365 | " \n",
366 | "
\n",
367 | "
"
368 | ],
369 | "text/plain": [
370 | " named_entity_1 named_entity_2 weight\n",
371 | "49 3 9 34\n",
372 | "988 9 41 1142\n",
373 | "1275 11 127 31\n",
374 | "1281 11 134 35\n",
375 | "1290 11 149 61"
376 | ]
377 | },
378 | "execution_count": 6,
379 | "metadata": {},
380 | "output_type": "execute_result"
381 | }
382 | ],
383 | "source": [
384 | "NAMED_ENTITIES_CO_OCCURENCE_THRESHOLD = 25\n",
385 | "\n",
386 | "edges_df = pairs_df.groupby(['named_entity_1', 'named_entity_2']).size().reset_index(name='weight')\n",
387 | "edges_df = edges_df[edges_df['weight'] > NAMED_ENTITIES_CO_OCCURENCE_THRESHOLD]\n",
388 | "edges_df[['named_entity_1', 'named_entity_2', 'weight']].to_csv('edges.csv', header=False, index=False, sep=' ')\n",
389 | "edges_df.head()"
390 | ]
391 | },
392 | {
393 | "cell_type": "markdown",
394 | "metadata": {},
395 | "source": [
396 | "Next, we'll run `node2vec`, which will output the result embeddings in a file called `emb`. \n",
397 | "We'll use the open source implementation developed by [Stanford](https://github.com/snap-stanford/snap/tree/master/examples/node2vec)."
398 | ]
399 | },
400 | {
401 | "cell_type": "code",
402 | "execution_count": 7,
403 | "metadata": {},
404 | "outputs": [
405 | {
406 | "name": "stdout",
407 | "output_type": "stream",
408 | "text": [
409 | "Walk iteration:\n",
410 | "1 / 10\n",
411 | "2 / 10\n",
412 | "3 / 10\n",
413 | "4 / 10\n",
414 | "5 / 10\n",
415 | "6 / 10\n",
416 | "7 / 10\n",
417 | "8 / 10\n",
418 | "9 / 10\n",
419 | "10 / 10\n"
420 | ]
421 | }
422 | ],
423 | "source": [
424 | "!python node2vec/src/main.py --input edges.csv --output emb --weighted"
425 | ]
426 | },
427 | {
428 | "cell_type": "markdown",
429 | "metadata": {},
430 | "source": [
431 | "## 2. Read embedding and run KMeans clusterring:"
432 | ]
433 | },
434 | {
435 | "cell_type": "code",
436 | "execution_count": 8,
437 | "metadata": {},
438 | "outputs": [
439 | {
440 | "data": {
441 | "text/html": [
442 | "\n",
443 | "\n",
456 | "
\n",
457 | " \n",
458 | " \n",
459 | " | \n",
460 | " 1 | \n",
461 | " 2 | \n",
462 | " 3 | \n",
463 | " 4 | \n",
464 | " 5 | \n",
465 | " 6 | \n",
466 | " 7 | \n",
467 | " 8 | \n",
468 | " 9 | \n",
469 | " 10 | \n",
470 | " ... | \n",
471 | " 119 | \n",
472 | " 120 | \n",
473 | " 121 | \n",
474 | " 122 | \n",
475 | " 123 | \n",
476 | " 124 | \n",
477 | " 125 | \n",
478 | " 126 | \n",
479 | " 127 | \n",
480 | " 128 | \n",
481 | "
\n",
482 | " \n",
483 | " named_entity | \n",
484 | " | \n",
485 | " | \n",
486 | " | \n",
487 | " | \n",
488 | " | \n",
489 | " | \n",
490 | " | \n",
491 | " | \n",
492 | " | \n",
493 | " | \n",
494 | " | \n",
495 | " | \n",
496 | " | \n",
497 | " | \n",
498 | " | \n",
499 | " | \n",
500 | " | \n",
501 | " | \n",
502 | " | \n",
503 | " | \n",
504 | " | \n",
505 | "
\n",
506 | " \n",
507 | " \n",
508 | " \n",
509 | " 45 | \n",
510 | " 0.193684 | \n",
511 | " 0.199515 | \n",
512 | " -0.558070 | \n",
513 | " 0.193501 | \n",
514 | " -0.151151 | \n",
515 | " -0.108368 | \n",
516 | " -0.080395 | \n",
517 | " 0.483877 | \n",
518 | " -0.216687 | \n",
519 | " -0.027689 | \n",
520 | " ... | \n",
521 | " -0.020264 | \n",
522 | " -0.219160 | \n",
523 | " -0.006211 | \n",
524 | " -0.116050 | \n",
525 | " -0.208311 | \n",
526 | " -0.238917 | \n",
527 | " 0.416022 | \n",
528 | " -0.069208 | \n",
529 | " 0.382213 | \n",
530 | " -0.198407 | \n",
531 | "
\n",
532 | " \n",
533 | " 41 | \n",
534 | " 0.116208 | \n",
535 | " -0.013772 | \n",
536 | " 0.270675 | \n",
537 | " 0.227480 | \n",
538 | " -0.123978 | \n",
539 | " -0.076915 | \n",
540 | " -0.080015 | \n",
541 | " 0.338822 | \n",
542 | " 0.007791 | \n",
543 | " -0.028516 | \n",
544 | " ... | \n",
545 | " -0.250689 | \n",
546 | " -0.219996 | \n",
547 | " -0.346024 | \n",
548 | " 0.006914 | \n",
549 | " -0.185476 | \n",
550 | " 0.099120 | \n",
551 | " 0.231357 | \n",
552 | " 0.326392 | \n",
553 | " 0.197053 | \n",
554 | " -0.103405 | \n",
555 | "
\n",
556 | " \n",
557 | " 478 | \n",
558 | " 0.326508 | \n",
559 | " -0.080868 | \n",
560 | " -0.534134 | \n",
561 | " 0.137786 | \n",
562 | " -0.262377 | \n",
563 | " -0.071972 | \n",
564 | " -0.187409 | \n",
565 | " 0.533022 | \n",
566 | " -0.314909 | \n",
567 | " -0.019874 | \n",
568 | " ... | \n",
569 | " -0.160482 | \n",
570 | " -0.192272 | \n",
571 | " -0.132486 | \n",
572 | " -0.058005 | \n",
573 | " -0.182971 | \n",
574 | " -0.201600 | \n",
575 | " 0.317926 | \n",
576 | " 0.059988 | \n",
577 | " 0.380023 | \n",
578 | " -0.127033 | \n",
579 | "
\n",
580 | " \n",
581 | " 88 | \n",
582 | " -0.053936 | \n",
583 | " -0.098514 | \n",
584 | " -0.116975 | \n",
585 | " 0.194783 | \n",
586 | " -0.127855 | \n",
587 | " 0.310879 | \n",
588 | " -0.050054 | \n",
589 | " -0.002542 | \n",
590 | " 0.094705 | \n",
591 | " -0.104536 | \n",
592 | " ... | \n",
593 | " 0.025011 | \n",
594 | " -0.357876 | \n",
595 | " -0.238409 | \n",
596 | " 0.247654 | \n",
597 | " 0.082463 | \n",
598 | " -0.147044 | \n",
599 | " 0.153850 | \n",
600 | " -0.535327 | \n",
601 | " -0.435655 | \n",
602 | " 0.259705 | \n",
603 | "
\n",
604 | " \n",
605 | " 83 | \n",
606 | " 0.013028 | \n",
607 | " -0.122749 | \n",
608 | " -0.029661 | \n",
609 | " 0.059336 | \n",
610 | " -0.258743 | \n",
611 | " 0.397353 | \n",
612 | " -0.082249 | \n",
613 | " 0.078653 | \n",
614 | " 0.102366 | \n",
615 | " 0.091354 | \n",
616 | " ... | \n",
617 | " 0.141847 | \n",
618 | " -0.456273 | \n",
619 | " -0.119102 | \n",
620 | " 0.301741 | \n",
621 | " 0.072765 | \n",
622 | " -0.035528 | \n",
623 | " 0.042997 | \n",
624 | " -0.511059 | \n",
625 | " -0.263644 | \n",
626 | " 0.366281 | \n",
627 | "
\n",
628 | " \n",
629 | "
\n",
630 | "
5 rows × 128 columns
\n",
631 | "
"
632 | ],
633 | "text/plain": [
634 | " 1 2 3 4 5 6 \\\n",
635 | "named_entity \n",
636 | "45 0.193684 0.199515 -0.558070 0.193501 -0.151151 -0.108368 \n",
637 | "41 0.116208 -0.013772 0.270675 0.227480 -0.123978 -0.076915 \n",
638 | "478 0.326508 -0.080868 -0.534134 0.137786 -0.262377 -0.071972 \n",
639 | "88 -0.053936 -0.098514 -0.116975 0.194783 -0.127855 0.310879 \n",
640 | "83 0.013028 -0.122749 -0.029661 0.059336 -0.258743 0.397353 \n",
641 | "\n",
642 | " 7 8 9 10 ... 119 \\\n",
643 | "named_entity ... \n",
644 | "45 -0.080395 0.483877 -0.216687 -0.027689 ... -0.020264 \n",
645 | "41 -0.080015 0.338822 0.007791 -0.028516 ... -0.250689 \n",
646 | "478 -0.187409 0.533022 -0.314909 -0.019874 ... -0.160482 \n",
647 | "88 -0.050054 -0.002542 0.094705 -0.104536 ... 0.025011 \n",
648 | "83 -0.082249 0.078653 0.102366 0.091354 ... 0.141847 \n",
649 | "\n",
650 | " 120 121 122 123 124 125 \\\n",
651 | "named_entity \n",
652 | "45 -0.219160 -0.006211 -0.116050 -0.208311 -0.238917 0.416022 \n",
653 | "41 -0.219996 -0.346024 0.006914 -0.185476 0.099120 0.231357 \n",
654 | "478 -0.192272 -0.132486 -0.058005 -0.182971 -0.201600 0.317926 \n",
655 | "88 -0.357876 -0.238409 0.247654 0.082463 -0.147044 0.153850 \n",
656 | "83 -0.456273 -0.119102 0.301741 0.072765 -0.035528 0.042997 \n",
657 | "\n",
658 | " 126 127 128 \n",
659 | "named_entity \n",
660 | "45 -0.069208 0.382213 -0.198407 \n",
661 | "41 0.326392 0.197053 -0.103405 \n",
662 | "478 0.059988 0.380023 -0.127033 \n",
663 | "88 -0.535327 -0.435655 0.259705 \n",
664 | "83 -0.511059 -0.263644 0.366281 \n",
665 | "\n",
666 | "[5 rows x 128 columns]"
667 | ]
668 | },
669 | "execution_count": 8,
670 | "metadata": {},
671 | "output_type": "execute_result"
672 | }
673 | ],
674 | "source": [
675 | "emb_df = pd.read_csv('emb', sep=' ', skiprows=[0], header=None)\n",
676 | "emb_df.set_index(0, inplace=True)\n",
677 | "emb_df.index.name = 'named_entity'\n",
678 | "emb_df.head()"
679 | ]
680 | },
681 | {
682 | "cell_type": "markdown",
683 | "metadata": {},
684 | "source": [
685 | "Each column is a dimension in the embedding space. Each row contains the dimensions of the embedding of one named entity. \n",
686 | "We'll now cluster the embeddings using a simple clustering algorithm such as k-means."
687 | ]
688 | },
689 | {
690 | "cell_type": "code",
691 | "execution_count": 9,
692 | "metadata": {},
693 | "outputs": [
694 | {
695 | "data": {
696 | "text/html": [
697 | "\n",
698 | "\n",
711 | "
\n",
712 | " \n",
713 | " \n",
714 | " | \n",
715 | " named_entity | \n",
716 | " cluster | \n",
717 | "
\n",
718 | " \n",
719 | " \n",
720 | " \n",
721 | " 0 | \n",
722 | " 45 | \n",
723 | " 2 | \n",
724 | "
\n",
725 | " \n",
726 | " 1 | \n",
727 | " 41 | \n",
728 | " 3 | \n",
729 | "
\n",
730 | " \n",
731 | " 2 | \n",
732 | " 478 | \n",
733 | " 2 | \n",
734 | "
\n",
735 | " \n",
736 | " 3 | \n",
737 | " 88 | \n",
738 | " 1 | \n",
739 | "
\n",
740 | " \n",
741 | " 4 | \n",
742 | " 83 | \n",
743 | " 1 | \n",
744 | "
\n",
745 | " \n",
746 | "
\n",
747 | "
"
748 | ],
749 | "text/plain": [
750 | " named_entity cluster\n",
751 | "0 45 2\n",
752 | "1 41 3\n",
753 | "2 478 2\n",
754 | "3 88 1\n",
755 | "4 83 1"
756 | ]
757 | },
758 | "execution_count": 9,
759 | "metadata": {},
760 | "output_type": "execute_result"
761 | }
762 | ],
763 | "source": [
764 | "NUM_CLUSTERS = 10\n",
765 | "\n",
766 | "kmeans = KMeans(n_clusters=NUM_CLUSTERS)\n",
767 | "kmeans.fit(emb_df)\n",
768 | "labels = kmeans.predict(emb_df)\n",
769 | "emb_df['cluster'] = labels\n",
770 | "clusters_df = emb_df.reset_index()[['named_entity','cluster']]\n",
771 | "clusters_df.head()"
772 | ]
773 | },
774 | {
775 | "cell_type": "markdown",
776 | "metadata": {},
777 | "source": [
778 | "## 3. Prepare input for Gephi:"
779 | ]
780 | },
781 | {
782 | "cell_type": "markdown",
783 | "metadata": {},
784 | "source": [
785 | "[Gephi](https://gephi.org) is a nice visualization tool for graphical data. \n",
786 | "We'll output our data into a format recognizable by Gephi."
787 | ]
788 | },
789 | {
790 | "cell_type": "code",
791 | "execution_count": 10,
792 | "metadata": {
793 | "collapsed": true
794 | },
795 | "outputs": [],
796 | "source": [
797 | "id_to_named_entity = {named_entity_id: named_entity\n",
798 | " for named_entity, named_entity_id in tokenizer.items()}\n",
799 | "\n",
800 | "with open('clusters.gdf', 'w') as f:\n",
801 | " f.write('nodedef>name VARCHAR,cluster_id VARCHAR,label VARCHAR\\n')\n",
802 | " for index, row in clusters_df.iterrows():\n",
803 | " f.write('{},{},{}\\n'.format(row['named_entity'], row['cluster'], id_to_named_entity[row['named_entity']]))\n",
804 | " f.write('edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE\\n')\n",
805 | " for index, row in edges_df.iterrows(): \n",
806 | " f.write('{},{},{}\\n'.format(row['named_entity_1'], row['named_entity_2'], row['weight']))"
807 | ]
808 | },
809 | {
810 | "cell_type": "markdown",
811 | "metadata": {},
812 | "source": [
813 | "Finally, we can open `clusters.gdf` using Gephi in order to inspect the clusters."
814 | ]
815 | }
816 | ],
817 | "metadata": {
818 | "kernelspec": {
819 | "display_name": "Python 2",
820 | "language": "python",
821 | "name": "python2"
822 | },
823 | "language_info": {
824 | "codemirror_mode": {
825 | "name": "ipython",
826 | "version": 2
827 | },
828 | "file_extension": ".py",
829 | "mimetype": "text/x-python",
830 | "name": "python",
831 | "nbconvert_exporter": "python",
832 | "pygments_lexer": "ipython2",
833 | "version": "2.7.13"
834 | }
835 | },
836 | "nbformat": 4,
837 | "nbformat_minor": 2
838 | }
839 |
--------------------------------------------------------------------------------