├── .gitignore
├── presentation.odp
├── presentation.pptx
├── requirements.txt
├── results
    ├── ng_norm-hist.png
    ├── ng_norm-tf.png
    ├── no_ngram_norm-tf.png
    ├── standard_norm-tf.png
    ├── ng_norm-common-density.png
    ├── words-top-right-cluster.csv
    ├── words-bottom-left-cluster.csv
    ├── words-middle-right-cluster.csv
    ├── words-bottom-right-cluster.csv
    ├── hypo_norm_rel_perc_diff.csv
    └── ng-norm-density-hist.csv
├── README.md
└── main.py


/.gitignore:
--------------------------------------------------------------------------------
1 | /.idea
2 | /data
3 | 


--------------------------------------------------------------------------------
/presentation.odp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/presentation.odp


--------------------------------------------------------------------------------
/presentation.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/presentation.pptx


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | gensim==3.7.3
2 | scipy==1.3.0
3 | seaborn==0.9.0
4 | numpy==1.16.4
5 | pandas==0.24.2
6 | matplotlib
7 | 


--------------------------------------------------------------------------------
/results/ng_norm-hist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/results/ng_norm-hist.png


--------------------------------------------------------------------------------
/results/ng_norm-tf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/results/ng_norm-tf.png


--------------------------------------------------------------------------------
/results/no_ngram_norm-tf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/results/no_ngram_norm-tf.png


--------------------------------------------------------------------------------
/results/standard_norm-tf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/results/standard_norm-tf.png


--------------------------------------------------------------------------------
/results/ng_norm-common-density.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/results/ng_norm-common-density.png


--------------------------------------------------------------------------------
/results/words-top-right-cluster.csv:
--------------------------------------------------------------------------------
 1 | position
 2 | wonderful
 3 | shooting
 4 | switch
 5 | â
 6 | Atlantic
 7 | ladies
 8 | vegetables
 9 | tourist
10 | HERE
11 | prescription
12 | upgraded
13 | Evil
14 | 


--------------------------------------------------------------------------------
/results/words-bottom-left-cluster.csv:
--------------------------------------------------------------------------------
 1 | now
 2 | three
 3 | month
 4 | News
 5 | Big
 6 | picked
 7 | votes
 8 | signature
 9 | Challenge
10 | Short
11 | trick
12 | Lots
13 | 68
14 | priorities
15 | upgrades
16 | 


--------------------------------------------------------------------------------
/results/words-middle-right-cluster.csv:
--------------------------------------------------------------------------------
 1 | via
 2 | companies
 3 | necessary
 4 | straight
 5 | menu
 6 | kinds
 7 | Championship
 8 | relief
 9 | periods
10 | Prize
11 | minimal
12 | Rated
13 | 83
14 | wears
15 | Tiger
16 | 


--------------------------------------------------------------------------------
/results/words-bottom-right-cluster.csv:
--------------------------------------------------------------------------------
 1 | our
 2 | home
 3 | game
 4 | won
 5 | control
 6 | law
 7 | common
 8 | Street
 9 | speed
10 | Tuesday
11 | direct
12 | helped
13 | passed
14 | condition
15 | Date
16 | signed
17 | Government
18 | flight
19 | cheap
20 | foreign
21 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # FastText Vector Norms And OOV Words
 2 | 
 3 | 
 4 | # Summary
 5 | 
 6 | Word embeddings, trained on large unlabeled corpora are useful for many natural language processing tasks. FastText [(Bojanowski et al., 2016)](https://arxiv.org/abs/1607.04606) in contrast to Word2vec model accounts for sub-word information by also embedding sub-word n-grams. FastText word representation is the word embedding vector plus sum of n-grams contained in it.
 7 | Word2vec vector norms have been shown [(Schakel & Wilson, 2015)](http://arxiv.org/abs/1508.02297) to be correlated to word significance. This blog post visualize vector norms of FastText embedding and evaluates use of FastText word vector norm multiplied with number of word n-grams for detecting non-english OOV words.
 8 | 
 9 | - [Read full description of this experiment on Fasttext OOV on my blog and ask or subscribe](https://vaclavkosar.com/ml/FastText-Vector-Norms-And-OOV-Words)
10 | - [Entire code for this post in available in this repository in file "main.py"](https://github.com/vackosar/fasttext-vector-norms-and-oov-words/blob/master/main.py)
11 | - [Continue: StarSpace - a general-purpose embeddings inspired by FastText](https://vaclavkosar.com/ml/starspace-embedding)
12 | 


--------------------------------------------------------------------------------
/results/hypo_norm_rel_perc_diff.csv:
--------------------------------------------------------------------------------
 1 | hyper,hypo,standard_norm,no_ngram_norm,ng_norm,count
 2 | month,January,-22.559890151023865,12.179197371006012,29.06685471534729,76.9945483694108
 3 | month,February,-34.532856941223145,13.354693353176117,30.93428909778595,57.40424788649045
 4 | month,March,21.790121495723724,8.177371323108673,21.79012894630432,91.52572109809037
 5 | month,April,25.993046164512634,10.371281206607819,25.993049144744873,86.94309311928393
 6 | month,May,247.45163917541504,6.607942283153534,15.817219018936157,219.5778495723792
 7 | month,June,86.63637638092041,9.665937721729279,24.424254894256592,80.36360749962446
 8 | month,July,93.21904182434082,12.777550518512726,28.812697529792786,71.60087186170838
 9 | month,August,-4.813988506793976,11.601139605045319,26.91468596458435,56.87035804514697
10 | month,September,-44.985681772232056,12.394984811544418,28.366747498512268,61.35298972257148
11 | month,October,-21.94921225309372,12.07357794046402,30.084648728370667,64.15855587788923
12 | month,November,-35.144105553627014,13.222669064998627,29.71179187297821,55.42349838944575
13 | month,December,-34.639644622802734,12.90571391582489,30.72071075439453,59.169546969422356
14 | color,red,214.2678737640381,-2.4428382515907288,4.7559455037117,-14.315463389067611
15 | color,blue,44.778406620025635,-5.8992899954319,-3.481072559952736,-40.68353126458778
16 | color,green,-16.087377071380615,-4.437129572033882,-16.08739197254181,-30.0291182166378
17 | color,white,-3.950345516204834,-4.10078652203083,-3.95035520195961,24.457166798508602
18 | color,orange,-19.920025765895844,1.365382969379425,6.773289293050766,-80.10268794209723
19 | color,purple,-25.538989901542664,-3.007206879556179,-0.7186640985310078,-87.57766504883688
20 | color,black,-5.428289994597435,-3.7266232073307037,-5.428304523229599,26.119314060631037
21 | color,pink,61.684030294418335,1.4097620733082294,7.789343595504761,-74.93923364878756
22 | color,yellow,-25.193437933921814,1.5528187155723572,-0.25792764499783516,-71.49442220962484
23 | color,cyan,87.80776262283325,18.8669815659523,25.205162167549133,-99.41605573320909
24 | color,violet,-28.96551787853241,10.081180185079575,-5.287368223071098,-98.38465018973534
25 | color,grey,39.793407917022705,-0.5296188872307539,-6.80440366268158,-86.86721598188748
26 | animal,dog,252.45609283447266,-6.00099042057991,-11.885977536439896,76.01154405541592
27 | animal,cat,234.5402717590332,-6.204253435134888,-16.36492908000946,-22.22957017107758
28 | animal,bird,85.8738124370575,2.5565512478351593,-7.063093036413193,-45.238106374812375
29 | animal,reptile,-22.501929104328156,15.213459730148315,-3.1274113804101944,-97.98246182931965
30 | animal,fish,75.87220072746277,2.111702412366867,-12.063899636268616,22.17952626178609
31 | animal,cow,267.8441286087036,7.508818805217743,-8.038965612649918,-82.95581896832819
32 | animal,insect,-7.264796644449234,7.887063175439835,-7.264796644449234,-90.30017980186985
33 | animal,fly,259.20159816741943,-6.115291640162468,-10.199600458145142,-24.141623498716452
34 | animal,mammal,3.3455990254879,16.28085821866989,3.3455990254879,-96.89425529023083
35 | tool,hammer,-60.361552238464355,5.061610043048859,-20.72310447692871,-88.92333982707078
36 | tool,screwdriver,-75.43929815292358,33.55134129524231,10.523150116205215,-97.63942241985984
37 | tool,drill,-43.53181719779968,11.923173069953918,-15.297724306583405,-85.13255472071457
38 | tool,handsaw,-49.96241331100464,76.5534520149231,25.093963742256165,-99.87315610920845
39 | tool,knife,-37.43588626384735,20.666681230068207,-6.153828650712967,-75.10034918456768
40 | tool,wrench,-51.09402537345886,26.54104232788086,-2.1880509331822395,-96.23042980088515
41 | tool,pliers,-45.3825443983078,50.95037817955017,9.23491045832634,-98.40439012603827
42 | fruit,banana,-22.01044261455536,-1.040846575051546,3.986077383160591,-81.24529080040269
43 | fruit,apple,-3.1680751591920853,-1.2982229702174664,-3.1680796295404434,-56.6124435389897
44 | fruit,pear,59.400397539138794,5.798259750008583,6.266932189464569,-93.56599616301408
45 | fruit,peach,-3.1049944460392,-7.888755947351456,-3.105001151561737,-91.25212669185491
46 | fruit,orange,-27.72880494594574,-11.572788655757904,-3.638404980301857,-36.93354726429143
47 | fruit,pineapple,-55.38869500160217,2.2534646093845367,4.093045741319656,-91.03878920728545
48 | fruit,lemon,7.380922883749008,0.14850908191874623,7.38091841340065,-65.35893663637096
49 | fruit,pomegranate,-63.00467848777771,3.623020276427269,10.985970497131348,-97.13937722324889
50 | fruit,grape,5.267916992306709,6.485439836978912,5.267921462655067,-88.9861233404569
51 | fruit,strawberries,-65.2907133102417,4.979605600237846,15.697626769542694,-88.7931709018528
52 | flower,peony,51.811230182647705,15.8254474401474,13.858413696289062,-98.44909184269798
53 | flower,rose,146.12656831741333,-2.1894115954637527,23.063285648822784,16.749135146156082
54 | flower,lily,108.58210325241089,7.221601158380508,4.2910512536764145,-93.18192151636643
55 | flower,tulip,49.614036083221436,14.132213592529297,12.210532277822495,-96.02838754717223
56 | flower,sunflower,-34.75457727909088,9.156723320484161,14.179478585720062,-90.75112293025342
57 | flower,marigold,-25.294849276542664,9.12274420261383,12.057727575302124,-99.08452695857119
58 | flower,orchid,10.703951865434647,7.983660697937012,10.703951865434647,-93.58325297140551
59 | tree,pine,9.169016778469086,4.525832831859589,9.169016778469086,-86.72316207320807
60 | tree,pear,20.91183215379715,11.848663538694382,20.91183215379715,-95.66247039113979
61 | tree,maple,-12.181924283504486,19.677509367465973,31.727102398872375,-90.921095628226
62 | tree,oak,155.62005043029785,15.57936817407608,27.810028195381165,-85.23922111925685
63 | tree,aspen,-16.014888882637024,16.366398334503174,25.97767412662506,-99.31872684331077
64 | tree,spruce,-32.47937858104706,5.101566016674042,35.041239857673645,-97.13105671310633
65 | tree,larch,-4.0693119168281555,22.811006009578705,43.89603137969971,-99.65749381620648
66 | tree,linden,-46.235501766204834,23.560859262943268,7.528995722532272,-99.73155855602536
67 | tree,juniper,-57.9480767250061,14.995041489601135,5.129804089665413,-99.00291744925768
68 | tree,birch,-20.747947692871094,14.30957019329071,18.878066539764404,-97.59187554218126
69 | tree,elm,196.46062850952148,20.48897296190262,48.2303112745285,-98.9773275489597
70 | average,,25.257317321922848,9.337584405404735,9.726516508004245,-44.851693209240054
71 | counts,,42.64705882352941,77.94117647058823,66.17647058823529,
72 | counts selected,,42.64705882352941,77.94117647058823,66.17647058823529,
73 | 


--------------------------------------------------------------------------------
/results/ng-norm-density-hist.csv:
--------------------------------------------------------------------------------
  1 | density,ng_norm
  2 | 0.0,0.021739130434782608
  3 | 0.0,0.06521739130434782
  4 | 0.0,0.10869565217391304
  5 | 0.0,0.15217391304347824
  6 | 0.0,0.19565217391304346
  7 | 0.0,0.2391304347826087
  8 | 0.0,0.28260869565217395
  9 | 0.0,0.32608695652173914
 10 | 0.0,0.3695652173913043
 11 | 0.0,0.4130434782608695
 12 | 0.0,0.4565217391304348
 13 | 0.0,0.5000000000000001
 14 | 0.0,0.5434782608695654
 15 | 0.0,0.5869565217391306
 16 | 0.0,0.6304347826086958
 17 | 0.0,0.673913043478261
 18 | 0.0,0.7173913043478263
 19 | 0.0,0.7608695652173914
 20 | 0.0,0.8043478260869567
 21 | 0.0,0.847826086956522
 22 | 0.0,0.8913043478260871
 23 | 0.0,0.9347826086956526
 24 | 0.0,0.9782608695652177
 25 | 0.029745022275322604,1.0217391304347831
 26 | 0.0,1.0652173913043483
 27 | 0.0,1.1086956521739135
 28 | 0.0,1.1521739130434787
 29 | 0.0,1.195652173913044
 30 | 0.0,1.2391304347826093
 31 | 0.017818040940899414,1.2826086956521745
 32 | 0.01685360955024037,1.3260869565217397
 33 | 0.030744854956846052,1.3695652173913049
 34 | 0.054487934071828927,1.41304347826087
 35 | 0.038852790157855005,1.456521739130435
 36 | 0.048876874899388995,1.5
 37 | 0.02303444537165591,1.5434782608695652
 38 | 0.04253199600714571,1.5869565217391304
 39 | 0.05088508678532239,1.6304347826086958
 40 | 0.06863009605820303,1.6739130434782612
 41 | 0.09945226146254896,1.7173913043478262
 42 | 0.1190031249365472,1.7608695652173914
 43 | 0.13667132373784924,1.8043478260869565
 44 | 0.12125987330468127,1.8478260869565215
 45 | 0.22090784600384045,1.8913043478260865
 46 | 0.2395107597547419,1.9347826086956517
 47 | 0.25932458876772685,1.9782608695652166
 48 | 0.29461419518552434,2.021739130434782
 49 | 0.44009275138709064,2.065217391304347
 50 | 0.5988659381170858,2.108695652173912
 51 | 0.5217400800936571,2.1521739130434776
 52 | 0.7479918294747602,2.195652173913043
 53 | 0.8052760982827616,2.239130434782608
 54 | 1.2067470806150369,2.282608695652173
 55 | 1.241471076965556,2.3260869565217384
 56 | 1.1326357447550428,2.3695652173913038
 57 | 1.226660243147753,2.413043478260869
 58 | 1.2675036866714802,2.456521739130434
 59 | 1.377630912106951,2.499999999999999
 60 | 1.3524228496476398,2.5434782608695645
 61 | 1.1204424508920712,2.58695652173913
 62 | 0.9765901659390295,2.6304347826086953
 63 | 0.8819719033203989,2.6739130434782603
 64 | 0.7596683812373868,2.7173913043478253
 65 | 0.8139558979250324,2.7608695652173907
 66 | 0.6012373281823025,2.8043478260869557
 67 | 0.5350413837621657,2.8478260869565206
 68 | 0.4493961552700747,2.8913043478260856
 69 | 0.4106285023910703,2.9347826086956506
 70 | 0.32469622798928155,2.978260869565216
 71 | 0.2911248648326743,3.021739130434781
 72 | 0.2548415142849414,3.065217391304346
 73 | 0.2171309046935094,3.1086956521739113
 74 | 0.1833797974179618,3.152173913043476
 75 | 0.1375414762525093,3.1956521739130412
 76 | 0.1255213172844463,3.239130434782606
 77 | 0.12119496476901685,3.282608695652171
 78 | 0.07265723441974399,3.3260869565217366
 79 | 0.08986372209492456,3.369565217391301
 80 | 0.06641048119613192,3.4130434782608665
 81 | 0.0731763418728756,3.4565217391304315
 82 | 0.05616108497321689,3.4999999999999964
 83 | 0.037323420706538546,3.543478260869562
 84 | 0.04064077891412763,3.5869565217391264
 85 | 0.033695208195448086,3.630434782608692
 86 | 0.02390649713951644,3.6739130434782568
 87 | 0.030977147407938492,3.7173913043478217
 88 | 0.0295033405621686,3.760869565217387
 89 | 0.012616395028748856,3.804347826086952
 90 | 0.019076422588284705,3.847826086956517
 91 | 0.01705417690734295,3.891304347826082
 92 | 0.004895226292839087,3.934782608695647
 93 | 0.010311266186646455,3.9782608695652124
 94 | 0.016140389522706813,4.021739130434778
 95 | 0.007586599771621939,4.065217391304342
 96 | 0.009838574058438545,4.108695652173907
 97 | 0.01238466935920343,4.152173913043473
 98 | 0.00220167759281762,4.195652173913039
 99 | 0.009129002002053575,4.239130434782604
100 | 0.012038292535885977,4.282608695652169
101 | 0.005062982514948528,4.326086956521735
102 | 0.007954520994995077,4.369565217391299
103 | 0.011174967325071882,4.413043478260865
104 | 0.01193439425847806,4.45652173913043
105 | 0.0031025123933556753,4.499999999999995
106 | 0.0032763920058489675,4.54347826086956
107 | 0.0033795437517031915,4.5869565217391255
108 | 0.0036110916991590855,4.630434782608691
109 | 0.011108286972685205,4.673913043478256
110 | 0.0,4.717391304347822
111 | 0.0,4.760869565217387
112 | 0.0,4.804347826086952
113 | 0.0,4.847826086956517
114 | 0.0,4.8913043478260825
115 | 0.004889592702792756,4.934782608695647
116 | 0.0051379348014483515,4.978260869565212
117 | 0.005292785157127927,5.021739130434778
118 | 0.011044944120346204,5.065217391304342
119 | 0.0,5.108695652173907
120 | 0.0,5.152173913043473
121 | 0.006250520056716423,5.195652173913039
122 | 0.0,5.239130434782604
123 | 0.0,5.282608695652169
124 | 0.01377153477991724,5.326086956521735
125 | 0.0,5.369565217391299
126 | 0.007483152016085931,5.413043478260865
127 | 0.0,5.45652173913043
128 | 0.0,5.499999999999995
129 | 0.0,5.54347826086956
130 | 0.008252566330521835,5.5869565217391255
131 | 0.0,5.63043478260869
132 | 0.0,5.6739130434782545
133 | 0.0,5.71739130434782
134 | 0.0,5.760869565217385
135 | 0.00939919779027535,5.80434782608695
136 | 0.0,5.847826086956515
137 | 0.010181737680513952,5.891304347826081
138 | 0.0,5.934782608695645
139 | 0.0,5.978260869565211
140 | 0.0,6.021739130434776
141 | 0.0,6.065217391304341
142 | 0.0,6.108695652173905
143 | 0.0,6.152173913043471
144 | 0.0,6.195652173913036
145 | 0.0,6.2391304347826
146 | 0.0,6.282608695652166
147 | 0.0,6.326086956521731
148 | 0.0,6.369565217391296
149 | 0.0,6.413043478260861
150 | 0.01564634813912194,6.456521739130427
151 | 0.0,6.499999999999991
152 | 0.0,6.5434782608695565
153 | 0.017760377377983885,6.586956521739122
154 | 0.0,6.6304347826086865
155 | 0.0,6.673913043478251
156 | 0.0,6.717391304347816
157 | 0.0,6.760869565217382
158 | 0.0,6.804347826086946
159 | 0.0,6.847826086956512
160 | 0.0,6.891304347826077
161 | 0.0,6.934782608695642
162 | 0.0,6.978260869565207
163 | 0.0,7.0217391304347725
164 | 0.0,7.065217391304337
165 | 0.0,7.1086956521739015
166 | 0.0,7.152173913043468
167 | 0.02667293613510095,7.195652173913032
168 | 0.0,7.239130434782597
169 | 0.0,7.282608695652162
170 | 0.0,7.326086956521728
171 | 0.0,7.369565217391292
172 | 0.0,7.413043478260858
173 | 0.0,7.456521739130423
174 | 0.0,7.499999999999988
175 | 0.0,7.543478260869553
176 | 0.0,7.586956521739118
177 | 0.0,7.630434782608683
178 | 0.0,7.673913043478247
179 | 0.0,7.717391304347813
180 | 0.0,7.760869565217378
181 | 0.0,7.804347826086943
182 | 0.0,7.847826086956508
183 | 0.0,7.891304347826074
184 | 0.0,7.934782608695638
185 | 0.0,7.9782608695652035
186 | 0.0,8.021739130434769
187 | 0.0,8.065217391304333
188 | 0.0,8.1086956521739
189 | 0.0,8.152173913043463
190 | 0.0,8.195652173913029
191 | 0.0,8.239130434782595
192 | 0.0,8.282608695652161
193 | 0.0,8.326086956521728
194 | 0.0,8.369565217391292
195 | 0.0,8.413043478260857
196 | 0.0,8.456521739130423
197 | 0.0,8.499999999999988
198 | 0.0,8.543478260869552
199 | 0.0,8.586956521739118
200 | 0.0,8.630434782608683
201 | 0.0,8.673913043478247
202 | 0.0,8.717391304347814
203 | 0.0,8.760869565217378
204 | 0.0,8.804347826086943
205 | 0.0,8.847826086956509
206 | 0.07341747972972602,8.891304347826074
207 | 0.0,8.934782608695638
208 | 0.0,8.978260869565204
209 | 0.0,9.021739130434769
210 | 0.0,9.065217391304333
211 | 0.0,9.1086956521739
212 | 0.0,9.152173913043463
213 | 0.0,9.195652173913029
214 | 0.0,9.239130434782595
215 | 0.0,9.282608695652161
216 | 0.0,9.326086956521728
217 | 0.0,9.369565217391292
218 | 0.0,9.413043478260857
219 | 0.0,9.456521739130423
220 | 0.0,9.499999999999988
221 | 0.0,9.543478260869552
222 | 0.0,9.586956521739118
223 | 0.0,9.630434782608683
224 | 0.0,9.673913043478247
225 | 0.0,9.717391304347814
226 | 0.0,9.760869565217378
227 | 0.0,9.804347826086943
228 | 0.0,9.847826086956509
229 | 0.0,9.891304347826074
230 | 0.0,9.934782608695638
231 | 0.0,9.978260869565204
232 | 0.0,10.021739130434769
233 | 0.0,10.065217391304333
234 | 0.0,10.1086956521739
235 | 0.0,10.152173913043463
236 | 0.0,10.195652173913029
237 | 0.0,10.239130434782595
238 | 0.0,10.282608695652161
239 | 0.0,10.326086956521728
240 | 0.0,10.369565217391292
241 | 0.0,10.413043478260857
242 | 0.0,10.456521739130423
243 | 0.0,10.499999999999988
244 | 0.0,10.543478260869552
245 | 0.0,10.586956521739118
246 | 0.0,10.630434782608683
247 | 0.0,10.673913043478247
248 | 0.0,10.717391304347814
249 | 0.0,10.760869565217378
250 | 0.0,10.804347826086943
251 | 0.0,10.847826086956509
252 | 0.0,10.891304347826074
253 | 0.0,10.934782608695638
254 | 0.0,10.978260869565204
255 | 0.0,11.021739130434769
256 | 0.0,11.065217391304333
257 | 0.0,11.1086956521739
258 | 0.0,11.152173913043463
259 | 0.0,11.195652173913029
260 | 0.0,11.239130434782595
261 | 0.0,11.282608695652158
262 | 0.0,11.326086956521724
263 | 0.0,11.369565217391289
264 | 0.0,11.413043478260853
265 | 0.0,11.45652173913042
266 | 0.0,11.499999999999984
267 | 0.0,11.543478260869549
268 | 0.0,11.586956521739115
269 | 0.0,11.63043478260868
270 | 0.0,11.673913043478244
271 | 0.0,11.71739130434781
272 | 0.0,11.760869565217375
273 | 0.0,11.80434782608694
274 | 0.0,11.847826086956506
275 | 0.0,11.89130434782607
276 | 0.0,11.934782608695635
277 | 0.0,11.9782608695652
278 | 0.0,12.021739130434765
279 | 0.0,12.06521739130433
280 | 0.0,12.108695652173896
281 | 0.0,12.152173913043459
282 | 0.0,12.195652173913025
283 | 0.0,12.239130434782592
284 | 0.0,12.282608695652158
285 | 0.0,12.326086956521724
286 | 0.0,12.369565217391289
287 | 0.0,12.413043478260853
288 | 0.0,12.45652173913042
289 | 0.0,12.499999999999984
290 | 0.0,12.543478260869549
291 | 0.0,12.586956521739115
292 | 0.0,12.63043478260868
293 | 0.0,12.673913043478244
294 | 0.0,12.71739130434781
295 | 0.0,12.760869565217375
296 | 0.0,12.80434782608694
297 | 0.0,12.847826086956506
298 | 0.0,12.89130434782607
299 | 0.0,12.934782608695635
300 | 0.0,12.9782608695652
301 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | from typing import Callable, Tuple
  2 | 
  3 | import matplotlib.pyplot as plt
  4 | import numpy as np
  5 | import pandas as pd
  6 | import seaborn
  7 | from gensim.models.fasttext import load_facebook_vectors
  8 | from gensim.models.utils_any2vec import ft_ngram_hashes
  9 | from gensim.test.utils import datapath
 10 | from matplotlib.axes import Axes
 11 | from matplotlib.figure import Figure
 12 | from numpy import linalg as LA
 13 | from numpy.core.multiarray import ndarray
 14 | from scipy.optimize import leastsq
 15 | from scipy.stats import t
 16 | 
 17 | #%% load model from disk
 18 | 
 19 | # FIXME use working dir
 20 | cap_path = datapath("/home/vackosar/src/fasttext-vector-norms-and-oov-words/data/input/cc.en.300.bin")
 21 | # fb_model = load_facebook_model(cap_path)
 22 | wv = load_facebook_vectors(cap_path)
 23 | wv.init_sims()
 24 | print(f'model: maxn: {wv.max_n}, minn {wv.min_n}, vocab size: {len(wv.vectors_vocab)}')
 25 | 
 26 | 
 27 | #%% shared methods
 28 | tf_label = 'tf (Fasttext Model Word Count)'
 29 | ng_norm_label = 'ng_norm = ngram_count * ngram_only_norm i.e. (only sub-ngrams used)'
 30 | mit_10k_common_label = 'MIT 10k Common Words'
 31 | fasttext_model_vocab_label = 'Fasttext Model 2M Vocabulary Words'
 32 | 
 33 | 
 34 | def select_word_index(norms: ndarray, tfs: ndarray, min_count: int, max_count: int, min_norm: float, max_norm: float) -> int:
 35 |     return select_word_indexes(norms, tfs, min_count, max_count, min_norm, max_norm)[0][0]
 36 | 
 37 | 
 38 | def select_word_indexes(norms: ndarray, tfs: ndarray, min_count: int, max_count: int, min_norm: float, max_norm: float) -> ndarray:
 39 |     mask = (max_norm > norms) & (norms > min_norm) & (max_count > tfs) & (tfs > min_count)
 40 |     idxs = np.argwhere(mask)
 41 |     if len(idxs) == 0:
 42 |         raise ValueError(f'Not found {min_count}-{max_count}, {min_norm}-{max_norm}.')
 43 | 
 44 |     else:
 45 |         print(f'idxs found: {idxs[:5]}')
 46 |         return idxs
 47 | 
 48 | 
 49 | def read_mit_10k_words():
 50 |     words: ndarray = pd.read_csv('data/input/mit-10k-words.csv', header=None, names=['word'])['word'].values
 51 |     return words
 52 | 
 53 | 
 54 | def common_words_norms(get_vec: Callable):
 55 |     words = read_mit_10k_words()
 56 |     norms = []
 57 |     tfs = []
 58 |     for i, word in enumerate(words):
 59 |         if word in wv.vocab:
 60 |             vocab_word_ = wv.vocab[word]
 61 |             norms.append(LA.norm(get_vec(word)))
 62 |             # norms.append(LA.norm(wv.vectors_vocab[vocab_word_.index]))
 63 |             tfs.append(vocab_word_.count)
 64 | 
 65 |     norms = np.array(norms)
 66 |     tfs = np.array(tfs)
 67 |     non_zero_norms_mask = (norms != 0) & (tfs != 0)
 68 |     norms = norms[non_zero_norms_mask]
 69 |     tfs = tfs[non_zero_norms_mask]
 70 | 
 71 |     print(f'norms {norms.shape}, tfs {tfs.shape}')
 72 |     return norms, tfs
 73 | 
 74 | 
 75 | def ng_norm_vec(w: str):
 76 |     word_vec = np.zeros(wv.vectors_ngrams.shape[1], dtype=np.float32)
 77 |     ngram_hashes = ft_ngram_hashes(w, wv.min_n, wv.max_n, wv.bucket, wv.compatible_hash)
 78 |     for nh in ngram_hashes:
 79 |         word_vec += wv.vectors_ngrams[nh]
 80 |     # +1 same as in the adjust vecs method
 81 |     #word_vec /= len(ngram_hashes)
 82 |     # word_vec /= math.log(1 + len(ngram_hashes))
 83 |     return word_vec
 84 | 
 85 | 
 86 | def standard_vec(w: str):
 87 |     word_vec = np.zeros(wv.vectors_ngrams.shape[1], dtype=np.float32)
 88 |     ngram_hashes = ft_ngram_hashes(w, wv.min_n, wv.max_n, wv.bucket, wv.compatible_hash)
 89 |     for nh in ngram_hashes:
 90 |         word_vec += wv.vectors_ngrams[nh]
 91 |     # +1 same as in the adjust vecs method
 92 |     if len(ngram_hashes) == 0:
 93 |         word_vec.fill(0)
 94 |         return word_vec
 95 | 
 96 |     else:
 97 |         return word_vec / len(ngram_hashes)
 98 | 
 99 | 
100 | def calc_norms(get_vec: Callable):
101 |     norms = np.zeros(len(wv.vectors_vocab), dtype=np.float64)
102 |     tfs = np.zeros(len(wv.vectors_vocab), dtype=np.float64)
103 |     # for i in range(len(vectors_vocab)):
104 |     for word, val in wv.vocab.items():
105 |         # norms[i] = LA.norm(v)
106 |         # word = wv.index2word[i]
107 |         # norms[i] = LA.norm(wv.word_vec(word))
108 |         i = val.index
109 |         norms[i] = LA.norm(get_vec(word))
110 |         # tfs[i] = log(wv.vocab[word].count)
111 |         tfs[i] = val.count
112 | 
113 |     non_zero_norms_mask = (norms != 0) & (tfs != 0)
114 |     norms = norms[non_zero_norms_mask]
115 |     tfs = tfs[non_zero_norms_mask]
116 | 
117 |     return norms, tfs
118 | 
119 | 
120 | def common_word_norm_density_histogram() -> (ndarray, ndarray):
121 |     common_norms, _ = common_words_norms(ng_norm_vec)
122 |     norms, _ = calc_norms(ng_norm_vec)
123 |     bins = np.linspace(0, 13, 300)
124 |     norm_histogram, _ = np.histogram(norms, bins)
125 |     norm_histogram[0] = 1
126 |     common_histogram, _ = np.histogram(common_norms, bins)
127 |     histogram = common_histogram / norm_histogram
128 |     common_non_zero_on_nan = np.argwhere(common_histogram[np.isnan(histogram)] != 0)
129 |     if len(np.argwhere(common_non_zero_on_nan)) > 0:
130 |         raise ValueError(f'unexpected nan at {common_non_zero_on_nan}, common: {common_histogram[common_non_zero_on_nan]}')
131 | 
132 |     histogram[np.isnan(histogram)] = 0
133 |     density_histogram = histogram / histogram[np.isfinite(histogram)].sum() / (bins[1] - bins[0])
134 |     return density_histogram, bins
135 | 
136 | 
137 | def histogram_position(bins, value) -> int:
138 |     for i, b in enumerate(bins):
139 |         if value < b:
140 |             if i == 0:
141 |                 raise ValueError
142 |             else:
143 |                 return i - 1
144 |     raise ValueError
145 | 
146 | 
147 | def histogram_val(histogram: ndarray, bins: ndarray, value: float) -> float:
148 |     return histogram[np.digitize(value, bins)]
149 | 
150 | 
151 | def no_ngram_vec(word: str) -> ndarray:
152 |     if word in wv.vocab:
153 |         return wv.vectors_vocab[wv.vocab[word].index]
154 | 
155 |     else:
156 |         return np.zeros(wv.vectors_vocab[0].shape[0])
157 | 
158 | 
159 | #%% def plot_standard_vec_norms():
160 | norms, tfs = calc_norms(standard_vec)
161 | 
162 | rnd_word_idx = [
163 |     # sorted_idxs[400000], sorted_idxs[800000], sorted_idxs[1200000], sorted_idxs[1600000], sorted_idxs[1800000]
164 |     select_word_index(norms, tfs, 70_000, 100_000, 2.4, 2.71),
165 |     select_word_index(norms, tfs, 55_000, 70_000, 0.53, 0.6),
166 |     select_word_index(norms, tfs, 4600_000, 7000_000, 0.44, 0.47),
167 |     select_word_index(norms, tfs, 4600_000, 7000_000, 1.26, 1.3),
168 |     select_word_index(norms, tfs, 4600_000, 7000_000, 2.4, 2.71)
169 | ]
170 | 
171 | seaborn.set(style='white', rc={'figure.figsize': (12, 8)})
172 | fig: Figure = plt.figure()
173 | plt.title('FastText Norm - TF')
174 | plt.xlabel(tf_label)
175 | plt.xscale('log')
176 | plt.ylabel('standard norm (Gensim)')
177 | ax: Axes = fig.add_subplot(1, 1, 1) #axisbg="1.0")
178 | ax.scatter(tfs, norms, alpha=0.6, edgecolors='none', s=5, label=fasttext_model_vocab_label)
179 | 
180 | common_words_norm, common_words_tfs = common_words_norms(standard_vec)
181 | ax.scatter(common_words_tfs, common_words_norm, alpha=0.8, edgecolors='none', s=5, label=mit_10k_common_label)
182 | 
183 | for i in rnd_word_idx:
184 |     word = wv.index2word[i]
185 |     tf = wv.vocab[word].count
186 |     norm = norms[i]
187 |     ax.scatter([tf], [norm], alpha=1, edgecolors='black', s=30, label=word)
188 | 
189 | ax.grid(True, which='both')
190 | # plt.ylim(0, 40)
191 | ax.legend()
192 | fig.tight_layout()
193 | fig.savefig('data/standard_norm-tf.png')
194 | fig.show()
195 | 
196 | 
197 | #%% plot standard-norm of hyper-nyms
198 | norms, tfs = calc_norms(standard_vec)
199 | 
200 | seaborn.set(style='white', rc={'figure.figsize': (12, 8)})
201 | fig: Figure = plt.figure()
202 | plt.title('FastText Norm - TF')
203 | plt.xlabel(tf_label)
204 | plt.xscale('log')
205 | plt.ylabel('standard norm (Gensim)')
206 | ax: Axes = fig.add_subplot(1, 1, 1) #axisbg="1.0")
207 | ax.scatter(tfs, norms, alpha=0.6, edgecolors='none', s=5, label=fasttext_model_vocab_label)
208 | 
209 | for word in ['month', 'January', 'February', 'color', 'red', 'blue']:
210 |              #'animal', 'dog', 'Labrador']:
211 |     vocab_word = wv.vocab[word]
212 |     tf = vocab_word.count
213 |     norm = norms[vocab_word.index]
214 |     ax.scatter([tf], [norm], alpha=1, edgecolors='black', s=30, label=word)
215 | 
216 | ax.grid(True, which='both')
217 | # plt.ylim(0, 40)
218 | ax.legend()
219 | fig.tight_layout()
220 | fig.savefig('data/standard_norm-tf.png')
221 | fig.show()
222 | 
223 | 
224 | 
225 | #%% common word cluster samples
226 | def plot_words(word_idxs: ndarray, plot_title: str, file_name: str):
227 |     # fig: Figure = plt.figure()
228 |     # plt.title(f'FastText Norm - TF - {plot_title}')
229 |     # plt.xlabel(tf_label)
230 |     # plt.xscale('log')
231 |     # plt.ylabel('standard norm (Gensim)')
232 |     # ax: Axes = fig.add_subplot(1, 1, 1)
233 |     # ax.scatter(tfs, norms, alpha=0.6, edgecolors='none', s=5, label=fasttext_model_vocab_label)
234 |     # ax.scatter(common_words_tfs, common_words_norm, alpha=0.8, edgecolors='none', s=5, label=mit_10k_common_label)
235 |     words = []
236 |     for i in word_idxs:
237 |         i = i[0]
238 |         word = wv.index2word[i]
239 |         tf = wv.vocab[word].count
240 |         norm = norms[i]
241 |         # ax.scatter([tf], [norm], alpha=1, edgecolors='black', s=30, label=word)
242 |         words.append(word)
243 |     pd.DataFrame(data=dict(word=words)).to_csv(f'data/words-{file_name}.csv', index=False, header=False)
244 |     # ax.grid(True, which='both')
245 |     # ax.legend()
246 |     # fig.tight_layout()
247 |     # fig.savefig(f'data/standard_norm-tf-{file_name}.png')
248 |     # fig.show()
249 |     # plt.clf()
250 | 
251 | 
252 | # norms, tfs = calc_norms(standard_vec)
253 | common_words_norm, common_words_tfs = common_words_norms(standard_vec)
254 | seaborn.set(style='white', rc={'figure.figsize': (12, 8)})
255 | plot_words(select_word_indexes(common_words_norm, common_words_tfs, 4600_000, 7000_000, 0.44, 0.47)[:20], 'Bottom Right Cluster', 'bottom-right-cluster')
256 | plot_words(select_word_indexes(common_words_norm, common_words_tfs, 4600_000, 7000_000, 1.26, 1.3)[:20], 'Middle Right Cluster', 'middle-right-cluster')
257 | plot_words(select_word_indexes(common_words_norm, common_words_tfs, 4600_000, 7000_000, 2.4, 2.71)[:20], 'Top Right Cluster', 'top-right-cluster')
258 | plot_words(select_word_indexes(common_words_norm, common_words_tfs, 55_000, 70_000, 0.53, 0.6)[:20], 'Bottom Left Cluster', 'bottom-left-cluster')
259 | 
260 | 
261 | #%% def plot_no_ngram():
262 | norms, tfs = calc_norms(no_ngram_vec)
263 | # sorted_idxs = matutils.argsort(norms, reverse=True)
264 | rnd_word_idx = [
265 |     # sorted_idxs[400000], sorted_idxs[800000], sorted_idxs[1200000], sorted_idxs[1600000], sorted_idxs[1800000]
266 |     # select_word_index(norms, tfs, 5_000, 10_000, 4.7, 5.1),
267 |     select_word_index(norms, tfs, 70_000, 80_000, 5, 5.1),
268 |     select_word_index(norms, tfs, 70_000, 80_000, 10, 11),
269 |     select_word_index(norms, tfs, 70_000, 80_000, 15, 20),
270 |     select_word_index(norms, tfs, 900_000, 1000_000, 4.7, 5),
271 |     select_word_index(norms, tfs, 15_000_000, 17_000_000, 3, 6.0),
272 | ]
273 | seaborn.set(style='white', rc={'figure.figsize': (12, 8)})
274 | fig: Figure = plt.figure()
275 | plt.title('FastText Word Whole Word Token (no ngram) - TF')
276 | plt.xlabel(tf_label)
277 | plt.xscale('log')
278 | plt.ylabel('norm of the whole words without sub-ngrams')
279 | ax: Axes = fig.add_subplot(1, 1, 1) #axisbg="1.0")
280 | ax.scatter(tfs, norms, alpha=0.6, edgecolors='none', s=5, label=fasttext_model_vocab_label)
281 | 
282 | common_words_norm, common_words_tfs = common_words_norms(no_ngram_vec)
283 | ax.scatter(common_words_tfs, common_words_norm, alpha=0.8, edgecolors='none', s=5, label=mit_10k_common_label)
284 | 
285 | for i in rnd_word_idx:
286 |     word = wv.index2word[i]
287 |     tf = wv.vocab[word].count
288 |     norm = norms[i]
289 |     ax.scatter([tf], [norm], alpha=1, edgecolors='black', s=30, label=word)
290 | 
291 | ax.grid(True, which='both')
292 | plt.ylim(0, 40)
293 | ax.legend()
294 | fig.tight_layout()
295 | fig.savefig('data/no_ngram_norm-tf.png')
296 | fig.show()
297 | 
298 | 
299 | #%% list norms for hypernyms (hypo)
300 | 
301 | pd.set_option('display.max_rows', 500)
302 | pd.set_option('display.max_columns', 500)
303 | pd.set_option('display.width', 1000)
304 | 
305 | 
306 | def get_norm_tuple(word: str):
307 |     standard_norm = LA.norm(standard_vec(word))
308 |     no_ngram_norm = LA.norm(no_ngram_vec(word))
309 |     ng_norm = LA.norm(ng_norm_vec(word))
310 |     count = wv.vocab[word].count
311 |     return (word, standard_norm, no_ngram_norm, ng_norm, count)
312 | 
313 | 
314 | def rel_perc_diff(val1, val2):
315 |     return (val1 - val2) / val2 * 100
316 | 
317 | 
318 | all_norms = pd.DataFrame(columns=['word', 'standard_norm', 'no_ngram_norm', 'ng_norm', 'count'])
319 | hypo_norm_rel_perc_diff = pd.DataFrame(columns=['hyper', 'hypo', 'standard_norm', 'no_ngram_norm', 'ng_norm', 'tf'])
320 | for hypernyme, hyponymes in {
321 |     'month': ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'],
322 |     'color': ['red', 'blue', 'green', 'white', 'orange', 'purple', 'black', 'pink', 'yellow', 'cyan', 'violet', 'grey'],
323 |     'animal': ['dog', 'cat', 'bird', 'reptile', 'fish', 'cow', 'insect', 'fly', 'mammal'],
324 |     'tool': ['hammer', 'screwdriver', 'drill', 'handsaw', 'knife', 'wrench', 'pliers'],
325 |     'fruit': ['banana', 'apple', 'pear', 'peach', 'orange', 'pineapple', 'lemon', 'pomegranate', 'grape', 'strawberries'],
326 |     'flower': ['peony', 'rose', 'lily', 'tulip', 'sunflower', 'marigold', 'orchid'],
327 |     'tree': ['pine', 'pear', 'maple', 'oak', 'aspen', 'spruce', 'larch', 'linden', 'juniper', 'birch', 'elm']
328 | }.items():
329 |     hyper_norms = get_norm_tuple(hypernyme)
330 |     all_norms.loc[all_norms.shape[0]] = hyper_norms
331 |     for hyponyme in hyponymes:
332 |         hypo_norms = get_norm_tuple(hyponyme)
333 |         all_norms.loc[all_norms.shape[0]] = hypo_norms
334 |         hypo_norm_rel_perc_diff.loc[hypo_norm_rel_perc_diff.shape[0]] = (
335 |             hypernyme,
336 |             hyponyme,
337 |             rel_perc_diff(hypo_norms[1], hyper_norms[1]),
338 |             rel_perc_diff(hypo_norms[2], hyper_norms[2]),
339 |             rel_perc_diff(hypo_norms[3], hyper_norms[3]),
340 |             rel_perc_diff(hypo_norms[4], hyper_norms[4])
341 |         )
342 | 
343 | # hypo_norm_rel_perc_diff: pd.DataFrame = hypo_norm_rel_perc_diff.loc[lambda df: df['count'].abs() < 30]
344 | # hypo_norm_rel_perc_diff.reset_index(inplace=True, drop=True)
345 | 
346 | averages = (
347 |     'average',
348 |     '',
349 |     hypo_norm_rel_perc_diff['standard_norm'].mean(),
350 |     hypo_norm_rel_perc_diff['no_ngram_norm'].mean(),
351 |     hypo_norm_rel_perc_diff['ng_norm'].mean(),
352 |     hypo_norm_rel_perc_diff['tf'].mean(),
353 | )
354 | 
355 | counts = (
356 |     'counts',
357 |     '',
358 |     np.argwhere(hypo_norm_rel_perc_diff['standard_norm'] > 0).shape[0] / hypo_norm_rel_perc_diff.shape[0] * 100,
359 |     np.argwhere(hypo_norm_rel_perc_diff['no_ngram_norm'] > 0).shape[0] / hypo_norm_rel_perc_diff.shape[0] * 100,
360 |     np.argwhere(hypo_norm_rel_perc_diff['ng_norm'] > 0).shape[0] / hypo_norm_rel_perc_diff.shape[0] * 100,
361 |     np.nan,
362 | )
363 | 
364 | counts_selected = (
365 |     'counts selected',
366 |     '',
367 |     hypo_norm_rel_perc_diff.loc[lambda df: (df['standard_norm'] > 0)].shape[0] / hypo_norm_rel_perc_diff.shape[0] * 100,
368 |     hypo_norm_rel_perc_diff.loc[lambda df: (df['no_ngram_norm'] > 0)].shape[0] / hypo_norm_rel_perc_diff.shape[0] * 100,
369 |     hypo_norm_rel_perc_diff.loc[lambda df: (df['ng_norm'] > 0)].shape[0] / hypo_norm_rel_perc_diff.shape[0] * 100,
370 |     np.nan,
371 | )
372 | 
373 | 
374 | hypo_norm_rel_perc_diff.loc[hypo_norm_rel_perc_diff.shape[0]] = averages
375 | hypo_norm_rel_perc_diff.loc[hypo_norm_rel_perc_diff.shape[0]] = counts
376 | hypo_norm_rel_perc_diff.loc[hypo_norm_rel_perc_diff.shape[0]] = counts_selected
377 | 
378 | print('all norms')
379 | print(all_norms)
380 | print()
381 | print('rel perc norm diff')
382 | print(hypo_norm_rel_perc_diff)
383 | 
384 | hypo_norm_rel_perc_diff.to_html('data/hypo_norm_rel_perc_diff.html')
385 | hypo_norm_rel_perc_diff.to_csv('data/hypo_norm_rel_perc_diff.csv', index=False)
386 | 
387 | 
388 | #%% def plot_ng_norm_vec_norms():
389 | norms, tfs = calc_norms(ng_norm_vec)
390 | # sorted_idxs = matutils.argsort(norms, reverse=True)
391 | rnd_word_idx = [
392 |     # sorted_idxs[400000], sorted_idxs[800000], sorted_idxs[1200000], sorted_idxs[1600000], sorted_idxs[1800000]
393 |     # select_word_index(norms, tfs, 5_000, 10_000, 2.4, 2.6),
394 |     select_word_index(norms, tfs, 90_000, 100_000, 2.4, 2.6),
395 |     select_word_index(norms, tfs, 90_000, 100_000, 5, 5.1),
396 |     select_word_index(norms, tfs, 90_000, 100_000, 10, 11),
397 |     select_word_index(norms, tfs, 900_000, 1000_000, 2.4, 2.6),
398 |     # select_word_index(norms, tfs, 3000_000, 4000_000, 2.4, 2.6),
399 |     select_word_index(norms, tfs, 15_000_000, 17_000_000, 2.4, 2.6),
400 | ]
401 | 
402 | seaborn.set(style='white', rc={'figure.figsize': (12, 8)})
403 | fig: Figure = plt.figure()
404 | plt.title('FastText NG_Norm - TF')
405 | plt.xlabel(tf_label)
406 | plt.xscale('log')
407 | plt.ylabel(ng_norm_label)
408 | ax: Axes = fig.add_subplot(1, 1, 1) #axisbg="1.0")
409 | ax.scatter(tfs, norms, alpha=0.6, edgecolors='none', s=5, label=fasttext_model_vocab_label)
410 | 
411 | common_words_norm, common_words_tfs = common_words_norms(ng_norm_vec)
412 | ax.scatter(common_words_tfs, common_words_norm, alpha=0.8, edgecolors='none', s=5, label=mit_10k_common_label)
413 | 
414 | for i in rnd_word_idx:
415 |     word = wv.index2word[i]
416 |     tf = wv.vocab[word].count
417 |     norm = norms[i]
418 |     ax.scatter([tf], [norm], alpha=1, edgecolors='black', s=30, label=word)
419 | 
420 | ax.grid(True, which='both')
421 | plt.ylim(0, 30)
422 | ax.legend()
423 | fig.tight_layout()
424 | fig.savefig('data/ng_norm-tf.png')
425 | fig.show()
426 | 
427 | # %% def plot_histogram_of_common_and_vocab_ng_norms():
428 | norms, _ = calc_norms(ng_norm_vec)
429 | print(f'vecs norms avg: {np.average(norms[np.isfinite(norms)], axis=0)}')
430 | print(f'norms: {norms[0:10].tolist()}')
431 | seaborn.set(style='white', rc={'figure.figsize': (12, 8)})
432 | fig: Figure = plt.figure()
433 | plt.title('FastText NG_Norm Distribution')
434 | plt.ylabel('Word Count Density')
435 | plt.xlabel(tf_label)
436 | ax: Axes = fig.add_subplot(1, 1, 1) #axisbg="1.0")
437 | bins = np.linspace(0, 10, 100)
438 | ax.hist(norms, bins=bins, alpha=0.5, label=fasttext_model_vocab_label, density=True)
439 | common_norms, _ = common_words_norms(ng_norm_vec)
440 | print(f'common vecs norms avg: {np.average(common_norms[np.isfinite(common_norms)], axis=0)}')
441 | ax.hist(common_norms, bins=bins, alpha=0.5, label=mit_10k_common_label, density=True)
442 | ax.grid(True, which='both')
443 | ax.legend()
444 | fig.tight_layout()
445 | fig.savefig('data/ng_norm-hist.png')
446 | fig.show()
447 | 
448 | 
449 | #%% calc_and_store_ng_norm_density_histogram():
450 | density_histogram, bins = common_word_norm_density_histogram()
451 | # np.savetxt('data/hist-probability.txt', probability_histogram)
452 | ng_norms = pd.Series(bins).rolling(window=2).mean().iloc[1:].values
453 | pd.DataFrame({'density': density_histogram, 'ng_norm': ng_norms}).to_csv('data/ng-norm-density-hist.csv', index=False)
454 | np.savetxt('data/hist-bins.txt', bins)
455 | 
456 | 
457 | #%%
458 | def run_plot_density_histogram():
459 |     pdf_df = pd.read_csv('data/ng-norm-density-hist.csv')
460 |     density_histogram = pdf_df['density'].values
461 |     ng_norms = pdf_df['ng_norm']
462 | 
463 |     bin_width = ng_norms[1] - ng_norms[0]
464 |     fitfunc = lambda mu, sigma, df, x: t.pdf(x, df, mu, sigma)
465 |     errfunc = lambda p, x, y: fitfunc(p[0], p[1], p[2], x) - y
466 | 
467 |     mean = np.sum(density_histogram * bin_width * ng_norms)
468 |     sigma = np.sqrt(np.sum(density_histogram * bin_width * (ng_norms - mean) ** 2))
469 |     starting_param = np.array([mean, sigma, 3])
470 |     print(f'starting p {starting_param}')
471 |     p, success = leastsq(errfunc, starting_param, args=(ng_norms, density_histogram), full_output=False)
472 |     print(f'fitted_density params: {p}, success value: {success}')
473 |     fitted_density = fitfunc(p[0], p[1], p[2], ng_norms)
474 | 
475 |     seaborn.set(style='white', rc={'figure.figsize': (12, 8)})
476 |     fig: Figure = plt.figure()
477 |     plt.title(mit_10k_common_label + 'FastText NG-Norm Density Histogram')
478 |     plt.ylabel('Probability Density')
479 |     plt.xlabel(ng_norm_label)
480 |     ax: Axes = fig.add_subplot(1, 1, 1) #axisbg="1.0")
481 |     ax.bar(ng_norms, density_histogram, label='original distribution', color='grey', alpha=1, width=bin_width, edgecolor='grey')
482 |     # ax.plot(X, fitted_density, label='fitted_density', color='green', alpha=0.5, width=bin_width) #, linestyle='--')
483 |     fit_label = f'fitted t-distribution (df: {np.round(p[2], 1)}, mean: {np.round(p[0], 1)}, var: {np.round(p[1], 1)})'
484 |     ax.plot(ng_norms, fitted_density, label=fit_label, color='orange', alpha=1, linestyle='--')
485 |     ax.grid(True, which='both')
486 |     ax.legend()
487 |     fig.tight_layout()
488 |     fig.savefig('data/ng_norm-common-density.png')
489 |     fig.show()
490 | 
491 | 
492 | run_plot_density_histogram()
493 | 
494 | 
495 | #%% print_word_separation
496 | def word_split_probability(text: str, density_histogram, bins):
497 |     bin_width = bins[1] - bins[0]
498 |     text = text.lower()
499 |     df = pd.DataFrame(index=range(1, len(text)), columns=['word1', 'word2', 'norm1', 'norm2', 'prob1', 'prob2', 'prob'])
500 |     for i in df.index:
501 |         word1 = text[:i]
502 |         word2 = text[i:]
503 |         df.loc[i, 'word1'] = word1
504 |         df.loc[i, 'word2'] = word2
505 |         df.loc[i, 'norm1'] = LA.norm(ng_norm_vec(word1))
506 |         df.loc[i, 'norm2'] = LA.norm(ng_norm_vec(word2))
507 | 
508 |     try:
509 |         df['prob1'] = density_histogram[np.digitize(df['norm1'].values, bins)] * bin_width
510 |         df['prob2'] = density_histogram[np.digitize(df['norm2'].values, bins)] * bin_width
511 |         df['prob'] = df['prob1'].values * df['prob2'].values
512 | 
513 |     except IndexError as e:
514 |         raise ValueError(f'Failed to find in histogram on of {df[["word1", "word2", "norm1", "norm2"]]}') from e
515 | 
516 |     return df
517 | 
518 | 
519 | density_histogram = pd.read_csv('data/ng-norm-density-hist.csv')['density'].values
520 | bins = np.loadtxt('data/hist-bins.txt')
521 | 
522 | text = 'inflationlithium'
523 | # text = 'goldtwenty'
524 | # text = 'goldvector'
525 | 
526 | splits = word_split_probability(text, density_histogram, bins)
527 | with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
528 |     print(splits)
529 | 
530 | with open('data/ng_norm-split-sample.html', 'w') as f:
531 |     splits.to_html(f, index=False)
532 | 
533 | max_idx = np.argmax(splits['prob'].values)
534 | print(f'max idx {max_idx}, {list(splits.loc[max_idx + 1, ["word1", "word2"]].iteritems())}')
535 | 
536 | 
537 | #%% split common
538 | bins = np.loadtxt('data/hist-bins.txt')
539 | density_histogram = pd.read_csv('data/ng-norm-density-hist.csv')['density'].values
540 | 
541 | words = read_mit_10k_words()
542 | np.random.seed(42)
543 | words1 = words[np.random.randint(0, 9900, size=3000)]
544 | words2 = words[np.random.randint(0, 9900, size=3000)]
545 | 
546 | match_count = 0
547 | total_count = 0
548 | for i in range(words1.shape[0]):
549 |     word1 = words1[i]
550 |     word2 = words2[i]
551 |     # print(word1)
552 |     if word1 in wv.vocab and word2 in wv.vocab:
553 |         total_count += 1
554 |         splits = word_split_probability(word1 + word2, density_histogram, bins)
555 |         # print(splits)
556 |         max_idx = splits['prob'].idxmax()
557 |         # print(max_idx)
558 |         best_word1 = splits.loc[max_idx, 'word1']
559 |         # print(best_word1)
560 |         if len(best_word1) == len(word1):
561 |             match_count += 1
562 | 
563 |         else:
564 |             pass
565 |             # best_word2 = splits.loc[max_idx, 'word2']
566 |             # print(f'invalid split {best_word1} {best_word2}')
567 | 
568 | print(f'match_count {match_count}, total_count {total_count}, accuracy {100 * match_count / total_count}')
569 | 


--------------------------------------------------------------------------------