├── .gitignore ├── presentation.odp ├── presentation.pptx ├── requirements.txt ├── results ├── ng_norm-hist.png ├── ng_norm-tf.png ├── no_ngram_norm-tf.png ├── standard_norm-tf.png ├── ng_norm-common-density.png ├── words-top-right-cluster.csv ├── words-bottom-left-cluster.csv ├── words-middle-right-cluster.csv ├── words-bottom-right-cluster.csv ├── hypo_norm_rel_perc_diff.csv └── ng-norm-density-hist.csv ├── README.md └── main.py /.gitignore: -------------------------------------------------------------------------------- 1 | /.idea 2 | /data 3 | -------------------------------------------------------------------------------- /presentation.odp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/presentation.odp -------------------------------------------------------------------------------- /presentation.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/presentation.pptx -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | gensim==3.7.3 2 | scipy==1.3.0 3 | seaborn==0.9.0 4 | numpy==1.16.4 5 | pandas==0.24.2 6 | matplotlib 7 | -------------------------------------------------------------------------------- /results/ng_norm-hist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/results/ng_norm-hist.png -------------------------------------------------------------------------------- /results/ng_norm-tf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/results/ng_norm-tf.png -------------------------------------------------------------------------------- /results/no_ngram_norm-tf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/results/no_ngram_norm-tf.png -------------------------------------------------------------------------------- /results/standard_norm-tf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/results/standard_norm-tf.png -------------------------------------------------------------------------------- /results/ng_norm-common-density.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vackosar/fasttext-vector-norms-and-oov-words/HEAD/results/ng_norm-common-density.png -------------------------------------------------------------------------------- /results/words-top-right-cluster.csv: -------------------------------------------------------------------------------- 1 | position 2 | wonderful 3 | shooting 4 | switch 5 | â 6 | Atlantic 7 | ladies 8 | vegetables 9 | tourist 10 | HERE 11 | prescription 12 | upgraded 13 | Evil 14 | -------------------------------------------------------------------------------- /results/words-bottom-left-cluster.csv: -------------------------------------------------------------------------------- 1 | now 2 | three 3 | month 4 | News 5 | Big 6 | picked 7 | votes 8 | signature 9 | Challenge 10 | Short 11 | trick 12 | Lots 13 | 68 14 | priorities 15 | upgrades 16 | -------------------------------------------------------------------------------- /results/words-middle-right-cluster.csv: -------------------------------------------------------------------------------- 1 | via 2 | companies 3 | necessary 4 | straight 5 | menu 6 | kinds 7 | Championship 8 | relief 9 | periods 10 | Prize 11 | minimal 12 | Rated 13 | 83 14 | wears 15 | Tiger 16 | -------------------------------------------------------------------------------- /results/words-bottom-right-cluster.csv: -------------------------------------------------------------------------------- 1 | our 2 | home 3 | game 4 | won 5 | control 6 | law 7 | common 8 | Street 9 | speed 10 | Tuesday 11 | direct 12 | helped 13 | passed 14 | condition 15 | Date 16 | signed 17 | Government 18 | flight 19 | cheap 20 | foreign 21 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # FastText Vector Norms And OOV Words 2 | 3 | 4 | # Summary 5 | 6 | Word embeddings, trained on large unlabeled corpora are useful for many natural language processing tasks. FastText [(Bojanowski et al., 2016)](https://arxiv.org/abs/1607.04606) in contrast to Word2vec model accounts for sub-word information by also embedding sub-word n-grams. FastText word representation is the word embedding vector plus sum of n-grams contained in it. 7 | Word2vec vector norms have been shown [(Schakel & Wilson, 2015)](http://arxiv.org/abs/1508.02297) to be correlated to word significance. This blog post visualize vector norms of FastText embedding and evaluates use of FastText word vector norm multiplied with number of word n-grams for detecting non-english OOV words. 8 | 9 | - [Read full description of this experiment on Fasttext OOV on my blog and ask or subscribe](https://vaclavkosar.com/ml/FastText-Vector-Norms-And-OOV-Words) 10 | - [Entire code for this post in available in this repository in file "main.py"](https://github.com/vackosar/fasttext-vector-norms-and-oov-words/blob/master/main.py) 11 | - [Continue: StarSpace - a general-purpose embeddings inspired by FastText](https://vaclavkosar.com/ml/starspace-embedding) 12 | -------------------------------------------------------------------------------- /results/hypo_norm_rel_perc_diff.csv: -------------------------------------------------------------------------------- 1 | hyper,hypo,standard_norm,no_ngram_norm,ng_norm,count 2 | month,January,-22.559890151023865,12.179197371006012,29.06685471534729,76.9945483694108 3 | month,February,-34.532856941223145,13.354693353176117,30.93428909778595,57.40424788649045 4 | month,March,21.790121495723724,8.177371323108673,21.79012894630432,91.52572109809037 5 | month,April,25.993046164512634,10.371281206607819,25.993049144744873,86.94309311928393 6 | month,May,247.45163917541504,6.607942283153534,15.817219018936157,219.5778495723792 7 | month,June,86.63637638092041,9.665937721729279,24.424254894256592,80.36360749962446 8 | month,July,93.21904182434082,12.777550518512726,28.812697529792786,71.60087186170838 9 | month,August,-4.813988506793976,11.601139605045319,26.91468596458435,56.87035804514697 10 | month,September,-44.985681772232056,12.394984811544418,28.366747498512268,61.35298972257148 11 | month,October,-21.94921225309372,12.07357794046402,30.084648728370667,64.15855587788923 12 | month,November,-35.144105553627014,13.222669064998627,29.71179187297821,55.42349838944575 13 | month,December,-34.639644622802734,12.90571391582489,30.72071075439453,59.169546969422356 14 | color,red,214.2678737640381,-2.4428382515907288,4.7559455037117,-14.315463389067611 15 | color,blue,44.778406620025635,-5.8992899954319,-3.481072559952736,-40.68353126458778 16 | color,green,-16.087377071380615,-4.437129572033882,-16.08739197254181,-30.0291182166378 17 | color,white,-3.950345516204834,-4.10078652203083,-3.95035520195961,24.457166798508602 18 | color,orange,-19.920025765895844,1.365382969379425,6.773289293050766,-80.10268794209723 19 | color,purple,-25.538989901542664,-3.007206879556179,-0.7186640985310078,-87.57766504883688 20 | color,black,-5.428289994597435,-3.7266232073307037,-5.428304523229599,26.119314060631037 21 | color,pink,61.684030294418335,1.4097620733082294,7.789343595504761,-74.93923364878756 22 | color,yellow,-25.193437933921814,1.5528187155723572,-0.25792764499783516,-71.49442220962484 23 | color,cyan,87.80776262283325,18.8669815659523,25.205162167549133,-99.41605573320909 24 | color,violet,-28.96551787853241,10.081180185079575,-5.287368223071098,-98.38465018973534 25 | color,grey,39.793407917022705,-0.5296188872307539,-6.80440366268158,-86.86721598188748 26 | animal,dog,252.45609283447266,-6.00099042057991,-11.885977536439896,76.01154405541592 27 | animal,cat,234.5402717590332,-6.204253435134888,-16.36492908000946,-22.22957017107758 28 | animal,bird,85.8738124370575,2.5565512478351593,-7.063093036413193,-45.238106374812375 29 | animal,reptile,-22.501929104328156,15.213459730148315,-3.1274113804101944,-97.98246182931965 30 | animal,fish,75.87220072746277,2.111702412366867,-12.063899636268616,22.17952626178609 31 | animal,cow,267.8441286087036,7.508818805217743,-8.038965612649918,-82.95581896832819 32 | animal,insect,-7.264796644449234,7.887063175439835,-7.264796644449234,-90.30017980186985 33 | animal,fly,259.20159816741943,-6.115291640162468,-10.199600458145142,-24.141623498716452 34 | animal,mammal,3.3455990254879,16.28085821866989,3.3455990254879,-96.89425529023083 35 | tool,hammer,-60.361552238464355,5.061610043048859,-20.72310447692871,-88.92333982707078 36 | tool,screwdriver,-75.43929815292358,33.55134129524231,10.523150116205215,-97.63942241985984 37 | tool,drill,-43.53181719779968,11.923173069953918,-15.297724306583405,-85.13255472071457 38 | tool,handsaw,-49.96241331100464,76.5534520149231,25.093963742256165,-99.87315610920845 39 | tool,knife,-37.43588626384735,20.666681230068207,-6.153828650712967,-75.10034918456768 40 | tool,wrench,-51.09402537345886,26.54104232788086,-2.1880509331822395,-96.23042980088515 41 | tool,pliers,-45.3825443983078,50.95037817955017,9.23491045832634,-98.40439012603827 42 | fruit,banana,-22.01044261455536,-1.040846575051546,3.986077383160591,-81.24529080040269 43 | fruit,apple,-3.1680751591920853,-1.2982229702174664,-3.1680796295404434,-56.6124435389897 44 | fruit,pear,59.400397539138794,5.798259750008583,6.266932189464569,-93.56599616301408 45 | fruit,peach,-3.1049944460392,-7.888755947351456,-3.105001151561737,-91.25212669185491 46 | fruit,orange,-27.72880494594574,-11.572788655757904,-3.638404980301857,-36.93354726429143 47 | fruit,pineapple,-55.38869500160217,2.2534646093845367,4.093045741319656,-91.03878920728545 48 | fruit,lemon,7.380922883749008,0.14850908191874623,7.38091841340065,-65.35893663637096 49 | fruit,pomegranate,-63.00467848777771,3.623020276427269,10.985970497131348,-97.13937722324889 50 | fruit,grape,5.267916992306709,6.485439836978912,5.267921462655067,-88.9861233404569 51 | fruit,strawberries,-65.2907133102417,4.979605600237846,15.697626769542694,-88.7931709018528 52 | flower,peony,51.811230182647705,15.8254474401474,13.858413696289062,-98.44909184269798 53 | flower,rose,146.12656831741333,-2.1894115954637527,23.063285648822784,16.749135146156082 54 | flower,lily,108.58210325241089,7.221601158380508,4.2910512536764145,-93.18192151636643 55 | flower,tulip,49.614036083221436,14.132213592529297,12.210532277822495,-96.02838754717223 56 | flower,sunflower,-34.75457727909088,9.156723320484161,14.179478585720062,-90.75112293025342 57 | flower,marigold,-25.294849276542664,9.12274420261383,12.057727575302124,-99.08452695857119 58 | flower,orchid,10.703951865434647,7.983660697937012,10.703951865434647,-93.58325297140551 59 | tree,pine,9.169016778469086,4.525832831859589,9.169016778469086,-86.72316207320807 60 | tree,pear,20.91183215379715,11.848663538694382,20.91183215379715,-95.66247039113979 61 | tree,maple,-12.181924283504486,19.677509367465973,31.727102398872375,-90.921095628226 62 | tree,oak,155.62005043029785,15.57936817407608,27.810028195381165,-85.23922111925685 63 | tree,aspen,-16.014888882637024,16.366398334503174,25.97767412662506,-99.31872684331077 64 | tree,spruce,-32.47937858104706,5.101566016674042,35.041239857673645,-97.13105671310633 65 | tree,larch,-4.0693119168281555,22.811006009578705,43.89603137969971,-99.65749381620648 66 | tree,linden,-46.235501766204834,23.560859262943268,7.528995722532272,-99.73155855602536 67 | tree,juniper,-57.9480767250061,14.995041489601135,5.129804089665413,-99.00291744925768 68 | tree,birch,-20.747947692871094,14.30957019329071,18.878066539764404,-97.59187554218126 69 | tree,elm,196.46062850952148,20.48897296190262,48.2303112745285,-98.9773275489597 70 | average,,25.257317321922848,9.337584405404735,9.726516508004245,-44.851693209240054 71 | counts,,42.64705882352941,77.94117647058823,66.17647058823529, 72 | counts selected,,42.64705882352941,77.94117647058823,66.17647058823529, 73 | -------------------------------------------------------------------------------- /results/ng-norm-density-hist.csv: -------------------------------------------------------------------------------- 1 | density,ng_norm 2 | 0.0,0.021739130434782608 3 | 0.0,0.06521739130434782 4 | 0.0,0.10869565217391304 5 | 0.0,0.15217391304347824 6 | 0.0,0.19565217391304346 7 | 0.0,0.2391304347826087 8 | 0.0,0.28260869565217395 9 | 0.0,0.32608695652173914 10 | 0.0,0.3695652173913043 11 | 0.0,0.4130434782608695 12 | 0.0,0.4565217391304348 13 | 0.0,0.5000000000000001 14 | 0.0,0.5434782608695654 15 | 0.0,0.5869565217391306 16 | 0.0,0.6304347826086958 17 | 0.0,0.673913043478261 18 | 0.0,0.7173913043478263 19 | 0.0,0.7608695652173914 20 | 0.0,0.8043478260869567 21 | 0.0,0.847826086956522 22 | 0.0,0.8913043478260871 23 | 0.0,0.9347826086956526 24 | 0.0,0.9782608695652177 25 | 0.029745022275322604,1.0217391304347831 26 | 0.0,1.0652173913043483 27 | 0.0,1.1086956521739135 28 | 0.0,1.1521739130434787 29 | 0.0,1.195652173913044 30 | 0.0,1.2391304347826093 31 | 0.017818040940899414,1.2826086956521745 32 | 0.01685360955024037,1.3260869565217397 33 | 0.030744854956846052,1.3695652173913049 34 | 0.054487934071828927,1.41304347826087 35 | 0.038852790157855005,1.456521739130435 36 | 0.048876874899388995,1.5 37 | 0.02303444537165591,1.5434782608695652 38 | 0.04253199600714571,1.5869565217391304 39 | 0.05088508678532239,1.6304347826086958 40 | 0.06863009605820303,1.6739130434782612 41 | 0.09945226146254896,1.7173913043478262 42 | 0.1190031249365472,1.7608695652173914 43 | 0.13667132373784924,1.8043478260869565 44 | 0.12125987330468127,1.8478260869565215 45 | 0.22090784600384045,1.8913043478260865 46 | 0.2395107597547419,1.9347826086956517 47 | 0.25932458876772685,1.9782608695652166 48 | 0.29461419518552434,2.021739130434782 49 | 0.44009275138709064,2.065217391304347 50 | 0.5988659381170858,2.108695652173912 51 | 0.5217400800936571,2.1521739130434776 52 | 0.7479918294747602,2.195652173913043 53 | 0.8052760982827616,2.239130434782608 54 | 1.2067470806150369,2.282608695652173 55 | 1.241471076965556,2.3260869565217384 56 | 1.1326357447550428,2.3695652173913038 57 | 1.226660243147753,2.413043478260869 58 | 1.2675036866714802,2.456521739130434 59 | 1.377630912106951,2.499999999999999 60 | 1.3524228496476398,2.5434782608695645 61 | 1.1204424508920712,2.58695652173913 62 | 0.9765901659390295,2.6304347826086953 63 | 0.8819719033203989,2.6739130434782603 64 | 0.7596683812373868,2.7173913043478253 65 | 0.8139558979250324,2.7608695652173907 66 | 0.6012373281823025,2.8043478260869557 67 | 0.5350413837621657,2.8478260869565206 68 | 0.4493961552700747,2.8913043478260856 69 | 0.4106285023910703,2.9347826086956506 70 | 0.32469622798928155,2.978260869565216 71 | 0.2911248648326743,3.021739130434781 72 | 0.2548415142849414,3.065217391304346 73 | 0.2171309046935094,3.1086956521739113 74 | 0.1833797974179618,3.152173913043476 75 | 0.1375414762525093,3.1956521739130412 76 | 0.1255213172844463,3.239130434782606 77 | 0.12119496476901685,3.282608695652171 78 | 0.07265723441974399,3.3260869565217366 79 | 0.08986372209492456,3.369565217391301 80 | 0.06641048119613192,3.4130434782608665 81 | 0.0731763418728756,3.4565217391304315 82 | 0.05616108497321689,3.4999999999999964 83 | 0.037323420706538546,3.543478260869562 84 | 0.04064077891412763,3.5869565217391264 85 | 0.033695208195448086,3.630434782608692 86 | 0.02390649713951644,3.6739130434782568 87 | 0.030977147407938492,3.7173913043478217 88 | 0.0295033405621686,3.760869565217387 89 | 0.012616395028748856,3.804347826086952 90 | 0.019076422588284705,3.847826086956517 91 | 0.01705417690734295,3.891304347826082 92 | 0.004895226292839087,3.934782608695647 93 | 0.010311266186646455,3.9782608695652124 94 | 0.016140389522706813,4.021739130434778 95 | 0.007586599771621939,4.065217391304342 96 | 0.009838574058438545,4.108695652173907 97 | 0.01238466935920343,4.152173913043473 98 | 0.00220167759281762,4.195652173913039 99 | 0.009129002002053575,4.239130434782604 100 | 0.012038292535885977,4.282608695652169 101 | 0.005062982514948528,4.326086956521735 102 | 0.007954520994995077,4.369565217391299 103 | 0.011174967325071882,4.413043478260865 104 | 0.01193439425847806,4.45652173913043 105 | 0.0031025123933556753,4.499999999999995 106 | 0.0032763920058489675,4.54347826086956 107 | 0.0033795437517031915,4.5869565217391255 108 | 0.0036110916991590855,4.630434782608691 109 | 0.011108286972685205,4.673913043478256 110 | 0.0,4.717391304347822 111 | 0.0,4.760869565217387 112 | 0.0,4.804347826086952 113 | 0.0,4.847826086956517 114 | 0.0,4.8913043478260825 115 | 0.004889592702792756,4.934782608695647 116 | 0.0051379348014483515,4.978260869565212 117 | 0.005292785157127927,5.021739130434778 118 | 0.011044944120346204,5.065217391304342 119 | 0.0,5.108695652173907 120 | 0.0,5.152173913043473 121 | 0.006250520056716423,5.195652173913039 122 | 0.0,5.239130434782604 123 | 0.0,5.282608695652169 124 | 0.01377153477991724,5.326086956521735 125 | 0.0,5.369565217391299 126 | 0.007483152016085931,5.413043478260865 127 | 0.0,5.45652173913043 128 | 0.0,5.499999999999995 129 | 0.0,5.54347826086956 130 | 0.008252566330521835,5.5869565217391255 131 | 0.0,5.63043478260869 132 | 0.0,5.6739130434782545 133 | 0.0,5.71739130434782 134 | 0.0,5.760869565217385 135 | 0.00939919779027535,5.80434782608695 136 | 0.0,5.847826086956515 137 | 0.010181737680513952,5.891304347826081 138 | 0.0,5.934782608695645 139 | 0.0,5.978260869565211 140 | 0.0,6.021739130434776 141 | 0.0,6.065217391304341 142 | 0.0,6.108695652173905 143 | 0.0,6.152173913043471 144 | 0.0,6.195652173913036 145 | 0.0,6.2391304347826 146 | 0.0,6.282608695652166 147 | 0.0,6.326086956521731 148 | 0.0,6.369565217391296 149 | 0.0,6.413043478260861 150 | 0.01564634813912194,6.456521739130427 151 | 0.0,6.499999999999991 152 | 0.0,6.5434782608695565 153 | 0.017760377377983885,6.586956521739122 154 | 0.0,6.6304347826086865 155 | 0.0,6.673913043478251 156 | 0.0,6.717391304347816 157 | 0.0,6.760869565217382 158 | 0.0,6.804347826086946 159 | 0.0,6.847826086956512 160 | 0.0,6.891304347826077 161 | 0.0,6.934782608695642 162 | 0.0,6.978260869565207 163 | 0.0,7.0217391304347725 164 | 0.0,7.065217391304337 165 | 0.0,7.1086956521739015 166 | 0.0,7.152173913043468 167 | 0.02667293613510095,7.195652173913032 168 | 0.0,7.239130434782597 169 | 0.0,7.282608695652162 170 | 0.0,7.326086956521728 171 | 0.0,7.369565217391292 172 | 0.0,7.413043478260858 173 | 0.0,7.456521739130423 174 | 0.0,7.499999999999988 175 | 0.0,7.543478260869553 176 | 0.0,7.586956521739118 177 | 0.0,7.630434782608683 178 | 0.0,7.673913043478247 179 | 0.0,7.717391304347813 180 | 0.0,7.760869565217378 181 | 0.0,7.804347826086943 182 | 0.0,7.847826086956508 183 | 0.0,7.891304347826074 184 | 0.0,7.934782608695638 185 | 0.0,7.9782608695652035 186 | 0.0,8.021739130434769 187 | 0.0,8.065217391304333 188 | 0.0,8.1086956521739 189 | 0.0,8.152173913043463 190 | 0.0,8.195652173913029 191 | 0.0,8.239130434782595 192 | 0.0,8.282608695652161 193 | 0.0,8.326086956521728 194 | 0.0,8.369565217391292 195 | 0.0,8.413043478260857 196 | 0.0,8.456521739130423 197 | 0.0,8.499999999999988 198 | 0.0,8.543478260869552 199 | 0.0,8.586956521739118 200 | 0.0,8.630434782608683 201 | 0.0,8.673913043478247 202 | 0.0,8.717391304347814 203 | 0.0,8.760869565217378 204 | 0.0,8.804347826086943 205 | 0.0,8.847826086956509 206 | 0.07341747972972602,8.891304347826074 207 | 0.0,8.934782608695638 208 | 0.0,8.978260869565204 209 | 0.0,9.021739130434769 210 | 0.0,9.065217391304333 211 | 0.0,9.1086956521739 212 | 0.0,9.152173913043463 213 | 0.0,9.195652173913029 214 | 0.0,9.239130434782595 215 | 0.0,9.282608695652161 216 | 0.0,9.326086956521728 217 | 0.0,9.369565217391292 218 | 0.0,9.413043478260857 219 | 0.0,9.456521739130423 220 | 0.0,9.499999999999988 221 | 0.0,9.543478260869552 222 | 0.0,9.586956521739118 223 | 0.0,9.630434782608683 224 | 0.0,9.673913043478247 225 | 0.0,9.717391304347814 226 | 0.0,9.760869565217378 227 | 0.0,9.804347826086943 228 | 0.0,9.847826086956509 229 | 0.0,9.891304347826074 230 | 0.0,9.934782608695638 231 | 0.0,9.978260869565204 232 | 0.0,10.021739130434769 233 | 0.0,10.065217391304333 234 | 0.0,10.1086956521739 235 | 0.0,10.152173913043463 236 | 0.0,10.195652173913029 237 | 0.0,10.239130434782595 238 | 0.0,10.282608695652161 239 | 0.0,10.326086956521728 240 | 0.0,10.369565217391292 241 | 0.0,10.413043478260857 242 | 0.0,10.456521739130423 243 | 0.0,10.499999999999988 244 | 0.0,10.543478260869552 245 | 0.0,10.586956521739118 246 | 0.0,10.630434782608683 247 | 0.0,10.673913043478247 248 | 0.0,10.717391304347814 249 | 0.0,10.760869565217378 250 | 0.0,10.804347826086943 251 | 0.0,10.847826086956509 252 | 0.0,10.891304347826074 253 | 0.0,10.934782608695638 254 | 0.0,10.978260869565204 255 | 0.0,11.021739130434769 256 | 0.0,11.065217391304333 257 | 0.0,11.1086956521739 258 | 0.0,11.152173913043463 259 | 0.0,11.195652173913029 260 | 0.0,11.239130434782595 261 | 0.0,11.282608695652158 262 | 0.0,11.326086956521724 263 | 0.0,11.369565217391289 264 | 0.0,11.413043478260853 265 | 0.0,11.45652173913042 266 | 0.0,11.499999999999984 267 | 0.0,11.543478260869549 268 | 0.0,11.586956521739115 269 | 0.0,11.63043478260868 270 | 0.0,11.673913043478244 271 | 0.0,11.71739130434781 272 | 0.0,11.760869565217375 273 | 0.0,11.80434782608694 274 | 0.0,11.847826086956506 275 | 0.0,11.89130434782607 276 | 0.0,11.934782608695635 277 | 0.0,11.9782608695652 278 | 0.0,12.021739130434765 279 | 0.0,12.06521739130433 280 | 0.0,12.108695652173896 281 | 0.0,12.152173913043459 282 | 0.0,12.195652173913025 283 | 0.0,12.239130434782592 284 | 0.0,12.282608695652158 285 | 0.0,12.326086956521724 286 | 0.0,12.369565217391289 287 | 0.0,12.413043478260853 288 | 0.0,12.45652173913042 289 | 0.0,12.499999999999984 290 | 0.0,12.543478260869549 291 | 0.0,12.586956521739115 292 | 0.0,12.63043478260868 293 | 0.0,12.673913043478244 294 | 0.0,12.71739130434781 295 | 0.0,12.760869565217375 296 | 0.0,12.80434782608694 297 | 0.0,12.847826086956506 298 | 0.0,12.89130434782607 299 | 0.0,12.934782608695635 300 | 0.0,12.9782608695652 301 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | from typing import Callable, Tuple 2 | 3 | import matplotlib.pyplot as plt 4 | import numpy as np 5 | import pandas as pd 6 | import seaborn 7 | from gensim.models.fasttext import load_facebook_vectors 8 | from gensim.models.utils_any2vec import ft_ngram_hashes 9 | from gensim.test.utils import datapath 10 | from matplotlib.axes import Axes 11 | from matplotlib.figure import Figure 12 | from numpy import linalg as LA 13 | from numpy.core.multiarray import ndarray 14 | from scipy.optimize import leastsq 15 | from scipy.stats import t 16 | 17 | #%% load model from disk 18 | 19 | # FIXME use working dir 20 | cap_path = datapath("/home/vackosar/src/fasttext-vector-norms-and-oov-words/data/input/cc.en.300.bin") 21 | # fb_model = load_facebook_model(cap_path) 22 | wv = load_facebook_vectors(cap_path) 23 | wv.init_sims() 24 | print(f'model: maxn: {wv.max_n}, minn {wv.min_n}, vocab size: {len(wv.vectors_vocab)}') 25 | 26 | 27 | #%% shared methods 28 | tf_label = 'tf (Fasttext Model Word Count)' 29 | ng_norm_label = 'ng_norm = ngram_count * ngram_only_norm i.e. (only sub-ngrams used)' 30 | mit_10k_common_label = 'MIT 10k Common Words' 31 | fasttext_model_vocab_label = 'Fasttext Model 2M Vocabulary Words' 32 | 33 | 34 | def select_word_index(norms: ndarray, tfs: ndarray, min_count: int, max_count: int, min_norm: float, max_norm: float) -> int: 35 | return select_word_indexes(norms, tfs, min_count, max_count, min_norm, max_norm)[0][0] 36 | 37 | 38 | def select_word_indexes(norms: ndarray, tfs: ndarray, min_count: int, max_count: int, min_norm: float, max_norm: float) -> ndarray: 39 | mask = (max_norm > norms) & (norms > min_norm) & (max_count > tfs) & (tfs > min_count) 40 | idxs = np.argwhere(mask) 41 | if len(idxs) == 0: 42 | raise ValueError(f'Not found {min_count}-{max_count}, {min_norm}-{max_norm}.') 43 | 44 | else: 45 | print(f'idxs found: {idxs[:5]}') 46 | return idxs 47 | 48 | 49 | def read_mit_10k_words(): 50 | words: ndarray = pd.read_csv('data/input/mit-10k-words.csv', header=None, names=['word'])['word'].values 51 | return words 52 | 53 | 54 | def common_words_norms(get_vec: Callable): 55 | words = read_mit_10k_words() 56 | norms = [] 57 | tfs = [] 58 | for i, word in enumerate(words): 59 | if word in wv.vocab: 60 | vocab_word_ = wv.vocab[word] 61 | norms.append(LA.norm(get_vec(word))) 62 | # norms.append(LA.norm(wv.vectors_vocab[vocab_word_.index])) 63 | tfs.append(vocab_word_.count) 64 | 65 | norms = np.array(norms) 66 | tfs = np.array(tfs) 67 | non_zero_norms_mask = (norms != 0) & (tfs != 0) 68 | norms = norms[non_zero_norms_mask] 69 | tfs = tfs[non_zero_norms_mask] 70 | 71 | print(f'norms {norms.shape}, tfs {tfs.shape}') 72 | return norms, tfs 73 | 74 | 75 | def ng_norm_vec(w: str): 76 | word_vec = np.zeros(wv.vectors_ngrams.shape[1], dtype=np.float32) 77 | ngram_hashes = ft_ngram_hashes(w, wv.min_n, wv.max_n, wv.bucket, wv.compatible_hash) 78 | for nh in ngram_hashes: 79 | word_vec += wv.vectors_ngrams[nh] 80 | # +1 same as in the adjust vecs method 81 | #word_vec /= len(ngram_hashes) 82 | # word_vec /= math.log(1 + len(ngram_hashes)) 83 | return word_vec 84 | 85 | 86 | def standard_vec(w: str): 87 | word_vec = np.zeros(wv.vectors_ngrams.shape[1], dtype=np.float32) 88 | ngram_hashes = ft_ngram_hashes(w, wv.min_n, wv.max_n, wv.bucket, wv.compatible_hash) 89 | for nh in ngram_hashes: 90 | word_vec += wv.vectors_ngrams[nh] 91 | # +1 same as in the adjust vecs method 92 | if len(ngram_hashes) == 0: 93 | word_vec.fill(0) 94 | return word_vec 95 | 96 | else: 97 | return word_vec / len(ngram_hashes) 98 | 99 | 100 | def calc_norms(get_vec: Callable): 101 | norms = np.zeros(len(wv.vectors_vocab), dtype=np.float64) 102 | tfs = np.zeros(len(wv.vectors_vocab), dtype=np.float64) 103 | # for i in range(len(vectors_vocab)): 104 | for word, val in wv.vocab.items(): 105 | # norms[i] = LA.norm(v) 106 | # word = wv.index2word[i] 107 | # norms[i] = LA.norm(wv.word_vec(word)) 108 | i = val.index 109 | norms[i] = LA.norm(get_vec(word)) 110 | # tfs[i] = log(wv.vocab[word].count) 111 | tfs[i] = val.count 112 | 113 | non_zero_norms_mask = (norms != 0) & (tfs != 0) 114 | norms = norms[non_zero_norms_mask] 115 | tfs = tfs[non_zero_norms_mask] 116 | 117 | return norms, tfs 118 | 119 | 120 | def common_word_norm_density_histogram() -> (ndarray, ndarray): 121 | common_norms, _ = common_words_norms(ng_norm_vec) 122 | norms, _ = calc_norms(ng_norm_vec) 123 | bins = np.linspace(0, 13, 300) 124 | norm_histogram, _ = np.histogram(norms, bins) 125 | norm_histogram[0] = 1 126 | common_histogram, _ = np.histogram(common_norms, bins) 127 | histogram = common_histogram / norm_histogram 128 | common_non_zero_on_nan = np.argwhere(common_histogram[np.isnan(histogram)] != 0) 129 | if len(np.argwhere(common_non_zero_on_nan)) > 0: 130 | raise ValueError(f'unexpected nan at {common_non_zero_on_nan}, common: {common_histogram[common_non_zero_on_nan]}') 131 | 132 | histogram[np.isnan(histogram)] = 0 133 | density_histogram = histogram / histogram[np.isfinite(histogram)].sum() / (bins[1] - bins[0]) 134 | return density_histogram, bins 135 | 136 | 137 | def histogram_position(bins, value) -> int: 138 | for i, b in enumerate(bins): 139 | if value < b: 140 | if i == 0: 141 | raise ValueError 142 | else: 143 | return i - 1 144 | raise ValueError 145 | 146 | 147 | def histogram_val(histogram: ndarray, bins: ndarray, value: float) -> float: 148 | return histogram[np.digitize(value, bins)] 149 | 150 | 151 | def no_ngram_vec(word: str) -> ndarray: 152 | if word in wv.vocab: 153 | return wv.vectors_vocab[wv.vocab[word].index] 154 | 155 | else: 156 | return np.zeros(wv.vectors_vocab[0].shape[0]) 157 | 158 | 159 | #%% def plot_standard_vec_norms(): 160 | norms, tfs = calc_norms(standard_vec) 161 | 162 | rnd_word_idx = [ 163 | # sorted_idxs[400000], sorted_idxs[800000], sorted_idxs[1200000], sorted_idxs[1600000], sorted_idxs[1800000] 164 | select_word_index(norms, tfs, 70_000, 100_000, 2.4, 2.71), 165 | select_word_index(norms, tfs, 55_000, 70_000, 0.53, 0.6), 166 | select_word_index(norms, tfs, 4600_000, 7000_000, 0.44, 0.47), 167 | select_word_index(norms, tfs, 4600_000, 7000_000, 1.26, 1.3), 168 | select_word_index(norms, tfs, 4600_000, 7000_000, 2.4, 2.71) 169 | ] 170 | 171 | seaborn.set(style='white', rc={'figure.figsize': (12, 8)}) 172 | fig: Figure = plt.figure() 173 | plt.title('FastText Norm - TF') 174 | plt.xlabel(tf_label) 175 | plt.xscale('log') 176 | plt.ylabel('standard norm (Gensim)') 177 | ax: Axes = fig.add_subplot(1, 1, 1) #axisbg="1.0") 178 | ax.scatter(tfs, norms, alpha=0.6, edgecolors='none', s=5, label=fasttext_model_vocab_label) 179 | 180 | common_words_norm, common_words_tfs = common_words_norms(standard_vec) 181 | ax.scatter(common_words_tfs, common_words_norm, alpha=0.8, edgecolors='none', s=5, label=mit_10k_common_label) 182 | 183 | for i in rnd_word_idx: 184 | word = wv.index2word[i] 185 | tf = wv.vocab[word].count 186 | norm = norms[i] 187 | ax.scatter([tf], [norm], alpha=1, edgecolors='black', s=30, label=word) 188 | 189 | ax.grid(True, which='both') 190 | # plt.ylim(0, 40) 191 | ax.legend() 192 | fig.tight_layout() 193 | fig.savefig('data/standard_norm-tf.png') 194 | fig.show() 195 | 196 | 197 | #%% plot standard-norm of hyper-nyms 198 | norms, tfs = calc_norms(standard_vec) 199 | 200 | seaborn.set(style='white', rc={'figure.figsize': (12, 8)}) 201 | fig: Figure = plt.figure() 202 | plt.title('FastText Norm - TF') 203 | plt.xlabel(tf_label) 204 | plt.xscale('log') 205 | plt.ylabel('standard norm (Gensim)') 206 | ax: Axes = fig.add_subplot(1, 1, 1) #axisbg="1.0") 207 | ax.scatter(tfs, norms, alpha=0.6, edgecolors='none', s=5, label=fasttext_model_vocab_label) 208 | 209 | for word in ['month', 'January', 'February', 'color', 'red', 'blue']: 210 | #'animal', 'dog', 'Labrador']: 211 | vocab_word = wv.vocab[word] 212 | tf = vocab_word.count 213 | norm = norms[vocab_word.index] 214 | ax.scatter([tf], [norm], alpha=1, edgecolors='black', s=30, label=word) 215 | 216 | ax.grid(True, which='both') 217 | # plt.ylim(0, 40) 218 | ax.legend() 219 | fig.tight_layout() 220 | fig.savefig('data/standard_norm-tf.png') 221 | fig.show() 222 | 223 | 224 | 225 | #%% common word cluster samples 226 | def plot_words(word_idxs: ndarray, plot_title: str, file_name: str): 227 | # fig: Figure = plt.figure() 228 | # plt.title(f'FastText Norm - TF - {plot_title}') 229 | # plt.xlabel(tf_label) 230 | # plt.xscale('log') 231 | # plt.ylabel('standard norm (Gensim)') 232 | # ax: Axes = fig.add_subplot(1, 1, 1) 233 | # ax.scatter(tfs, norms, alpha=0.6, edgecolors='none', s=5, label=fasttext_model_vocab_label) 234 | # ax.scatter(common_words_tfs, common_words_norm, alpha=0.8, edgecolors='none', s=5, label=mit_10k_common_label) 235 | words = [] 236 | for i in word_idxs: 237 | i = i[0] 238 | word = wv.index2word[i] 239 | tf = wv.vocab[word].count 240 | norm = norms[i] 241 | # ax.scatter([tf], [norm], alpha=1, edgecolors='black', s=30, label=word) 242 | words.append(word) 243 | pd.DataFrame(data=dict(word=words)).to_csv(f'data/words-{file_name}.csv', index=False, header=False) 244 | # ax.grid(True, which='both') 245 | # ax.legend() 246 | # fig.tight_layout() 247 | # fig.savefig(f'data/standard_norm-tf-{file_name}.png') 248 | # fig.show() 249 | # plt.clf() 250 | 251 | 252 | # norms, tfs = calc_norms(standard_vec) 253 | common_words_norm, common_words_tfs = common_words_norms(standard_vec) 254 | seaborn.set(style='white', rc={'figure.figsize': (12, 8)}) 255 | plot_words(select_word_indexes(common_words_norm, common_words_tfs, 4600_000, 7000_000, 0.44, 0.47)[:20], 'Bottom Right Cluster', 'bottom-right-cluster') 256 | plot_words(select_word_indexes(common_words_norm, common_words_tfs, 4600_000, 7000_000, 1.26, 1.3)[:20], 'Middle Right Cluster', 'middle-right-cluster') 257 | plot_words(select_word_indexes(common_words_norm, common_words_tfs, 4600_000, 7000_000, 2.4, 2.71)[:20], 'Top Right Cluster', 'top-right-cluster') 258 | plot_words(select_word_indexes(common_words_norm, common_words_tfs, 55_000, 70_000, 0.53, 0.6)[:20], 'Bottom Left Cluster', 'bottom-left-cluster') 259 | 260 | 261 | #%% def plot_no_ngram(): 262 | norms, tfs = calc_norms(no_ngram_vec) 263 | # sorted_idxs = matutils.argsort(norms, reverse=True) 264 | rnd_word_idx = [ 265 | # sorted_idxs[400000], sorted_idxs[800000], sorted_idxs[1200000], sorted_idxs[1600000], sorted_idxs[1800000] 266 | # select_word_index(norms, tfs, 5_000, 10_000, 4.7, 5.1), 267 | select_word_index(norms, tfs, 70_000, 80_000, 5, 5.1), 268 | select_word_index(norms, tfs, 70_000, 80_000, 10, 11), 269 | select_word_index(norms, tfs, 70_000, 80_000, 15, 20), 270 | select_word_index(norms, tfs, 900_000, 1000_000, 4.7, 5), 271 | select_word_index(norms, tfs, 15_000_000, 17_000_000, 3, 6.0), 272 | ] 273 | seaborn.set(style='white', rc={'figure.figsize': (12, 8)}) 274 | fig: Figure = plt.figure() 275 | plt.title('FastText Word Whole Word Token (no ngram) - TF') 276 | plt.xlabel(tf_label) 277 | plt.xscale('log') 278 | plt.ylabel('norm of the whole words without sub-ngrams') 279 | ax: Axes = fig.add_subplot(1, 1, 1) #axisbg="1.0") 280 | ax.scatter(tfs, norms, alpha=0.6, edgecolors='none', s=5, label=fasttext_model_vocab_label) 281 | 282 | common_words_norm, common_words_tfs = common_words_norms(no_ngram_vec) 283 | ax.scatter(common_words_tfs, common_words_norm, alpha=0.8, edgecolors='none', s=5, label=mit_10k_common_label) 284 | 285 | for i in rnd_word_idx: 286 | word = wv.index2word[i] 287 | tf = wv.vocab[word].count 288 | norm = norms[i] 289 | ax.scatter([tf], [norm], alpha=1, edgecolors='black', s=30, label=word) 290 | 291 | ax.grid(True, which='both') 292 | plt.ylim(0, 40) 293 | ax.legend() 294 | fig.tight_layout() 295 | fig.savefig('data/no_ngram_norm-tf.png') 296 | fig.show() 297 | 298 | 299 | #%% list norms for hypernyms (hypo) 300 | 301 | pd.set_option('display.max_rows', 500) 302 | pd.set_option('display.max_columns', 500) 303 | pd.set_option('display.width', 1000) 304 | 305 | 306 | def get_norm_tuple(word: str): 307 | standard_norm = LA.norm(standard_vec(word)) 308 | no_ngram_norm = LA.norm(no_ngram_vec(word)) 309 | ng_norm = LA.norm(ng_norm_vec(word)) 310 | count = wv.vocab[word].count 311 | return (word, standard_norm, no_ngram_norm, ng_norm, count) 312 | 313 | 314 | def rel_perc_diff(val1, val2): 315 | return (val1 - val2) / val2 * 100 316 | 317 | 318 | all_norms = pd.DataFrame(columns=['word', 'standard_norm', 'no_ngram_norm', 'ng_norm', 'count']) 319 | hypo_norm_rel_perc_diff = pd.DataFrame(columns=['hyper', 'hypo', 'standard_norm', 'no_ngram_norm', 'ng_norm', 'tf']) 320 | for hypernyme, hyponymes in { 321 | 'month': ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], 322 | 'color': ['red', 'blue', 'green', 'white', 'orange', 'purple', 'black', 'pink', 'yellow', 'cyan', 'violet', 'grey'], 323 | 'animal': ['dog', 'cat', 'bird', 'reptile', 'fish', 'cow', 'insect', 'fly', 'mammal'], 324 | 'tool': ['hammer', 'screwdriver', 'drill', 'handsaw', 'knife', 'wrench', 'pliers'], 325 | 'fruit': ['banana', 'apple', 'pear', 'peach', 'orange', 'pineapple', 'lemon', 'pomegranate', 'grape', 'strawberries'], 326 | 'flower': ['peony', 'rose', 'lily', 'tulip', 'sunflower', 'marigold', 'orchid'], 327 | 'tree': ['pine', 'pear', 'maple', 'oak', 'aspen', 'spruce', 'larch', 'linden', 'juniper', 'birch', 'elm'] 328 | }.items(): 329 | hyper_norms = get_norm_tuple(hypernyme) 330 | all_norms.loc[all_norms.shape[0]] = hyper_norms 331 | for hyponyme in hyponymes: 332 | hypo_norms = get_norm_tuple(hyponyme) 333 | all_norms.loc[all_norms.shape[0]] = hypo_norms 334 | hypo_norm_rel_perc_diff.loc[hypo_norm_rel_perc_diff.shape[0]] = ( 335 | hypernyme, 336 | hyponyme, 337 | rel_perc_diff(hypo_norms[1], hyper_norms[1]), 338 | rel_perc_diff(hypo_norms[2], hyper_norms[2]), 339 | rel_perc_diff(hypo_norms[3], hyper_norms[3]), 340 | rel_perc_diff(hypo_norms[4], hyper_norms[4]) 341 | ) 342 | 343 | # hypo_norm_rel_perc_diff: pd.DataFrame = hypo_norm_rel_perc_diff.loc[lambda df: df['count'].abs() < 30] 344 | # hypo_norm_rel_perc_diff.reset_index(inplace=True, drop=True) 345 | 346 | averages = ( 347 | 'average', 348 | '', 349 | hypo_norm_rel_perc_diff['standard_norm'].mean(), 350 | hypo_norm_rel_perc_diff['no_ngram_norm'].mean(), 351 | hypo_norm_rel_perc_diff['ng_norm'].mean(), 352 | hypo_norm_rel_perc_diff['tf'].mean(), 353 | ) 354 | 355 | counts = ( 356 | 'counts', 357 | '', 358 | np.argwhere(hypo_norm_rel_perc_diff['standard_norm'] > 0).shape[0] / hypo_norm_rel_perc_diff.shape[0] * 100, 359 | np.argwhere(hypo_norm_rel_perc_diff['no_ngram_norm'] > 0).shape[0] / hypo_norm_rel_perc_diff.shape[0] * 100, 360 | np.argwhere(hypo_norm_rel_perc_diff['ng_norm'] > 0).shape[0] / hypo_norm_rel_perc_diff.shape[0] * 100, 361 | np.nan, 362 | ) 363 | 364 | counts_selected = ( 365 | 'counts selected', 366 | '', 367 | hypo_norm_rel_perc_diff.loc[lambda df: (df['standard_norm'] > 0)].shape[0] / hypo_norm_rel_perc_diff.shape[0] * 100, 368 | hypo_norm_rel_perc_diff.loc[lambda df: (df['no_ngram_norm'] > 0)].shape[0] / hypo_norm_rel_perc_diff.shape[0] * 100, 369 | hypo_norm_rel_perc_diff.loc[lambda df: (df['ng_norm'] > 0)].shape[0] / hypo_norm_rel_perc_diff.shape[0] * 100, 370 | np.nan, 371 | ) 372 | 373 | 374 | hypo_norm_rel_perc_diff.loc[hypo_norm_rel_perc_diff.shape[0]] = averages 375 | hypo_norm_rel_perc_diff.loc[hypo_norm_rel_perc_diff.shape[0]] = counts 376 | hypo_norm_rel_perc_diff.loc[hypo_norm_rel_perc_diff.shape[0]] = counts_selected 377 | 378 | print('all norms') 379 | print(all_norms) 380 | print() 381 | print('rel perc norm diff') 382 | print(hypo_norm_rel_perc_diff) 383 | 384 | hypo_norm_rel_perc_diff.to_html('data/hypo_norm_rel_perc_diff.html') 385 | hypo_norm_rel_perc_diff.to_csv('data/hypo_norm_rel_perc_diff.csv', index=False) 386 | 387 | 388 | #%% def plot_ng_norm_vec_norms(): 389 | norms, tfs = calc_norms(ng_norm_vec) 390 | # sorted_idxs = matutils.argsort(norms, reverse=True) 391 | rnd_word_idx = [ 392 | # sorted_idxs[400000], sorted_idxs[800000], sorted_idxs[1200000], sorted_idxs[1600000], sorted_idxs[1800000] 393 | # select_word_index(norms, tfs, 5_000, 10_000, 2.4, 2.6), 394 | select_word_index(norms, tfs, 90_000, 100_000, 2.4, 2.6), 395 | select_word_index(norms, tfs, 90_000, 100_000, 5, 5.1), 396 | select_word_index(norms, tfs, 90_000, 100_000, 10, 11), 397 | select_word_index(norms, tfs, 900_000, 1000_000, 2.4, 2.6), 398 | # select_word_index(norms, tfs, 3000_000, 4000_000, 2.4, 2.6), 399 | select_word_index(norms, tfs, 15_000_000, 17_000_000, 2.4, 2.6), 400 | ] 401 | 402 | seaborn.set(style='white', rc={'figure.figsize': (12, 8)}) 403 | fig: Figure = plt.figure() 404 | plt.title('FastText NG_Norm - TF') 405 | plt.xlabel(tf_label) 406 | plt.xscale('log') 407 | plt.ylabel(ng_norm_label) 408 | ax: Axes = fig.add_subplot(1, 1, 1) #axisbg="1.0") 409 | ax.scatter(tfs, norms, alpha=0.6, edgecolors='none', s=5, label=fasttext_model_vocab_label) 410 | 411 | common_words_norm, common_words_tfs = common_words_norms(ng_norm_vec) 412 | ax.scatter(common_words_tfs, common_words_norm, alpha=0.8, edgecolors='none', s=5, label=mit_10k_common_label) 413 | 414 | for i in rnd_word_idx: 415 | word = wv.index2word[i] 416 | tf = wv.vocab[word].count 417 | norm = norms[i] 418 | ax.scatter([tf], [norm], alpha=1, edgecolors='black', s=30, label=word) 419 | 420 | ax.grid(True, which='both') 421 | plt.ylim(0, 30) 422 | ax.legend() 423 | fig.tight_layout() 424 | fig.savefig('data/ng_norm-tf.png') 425 | fig.show() 426 | 427 | # %% def plot_histogram_of_common_and_vocab_ng_norms(): 428 | norms, _ = calc_norms(ng_norm_vec) 429 | print(f'vecs norms avg: {np.average(norms[np.isfinite(norms)], axis=0)}') 430 | print(f'norms: {norms[0:10].tolist()}') 431 | seaborn.set(style='white', rc={'figure.figsize': (12, 8)}) 432 | fig: Figure = plt.figure() 433 | plt.title('FastText NG_Norm Distribution') 434 | plt.ylabel('Word Count Density') 435 | plt.xlabel(tf_label) 436 | ax: Axes = fig.add_subplot(1, 1, 1) #axisbg="1.0") 437 | bins = np.linspace(0, 10, 100) 438 | ax.hist(norms, bins=bins, alpha=0.5, label=fasttext_model_vocab_label, density=True) 439 | common_norms, _ = common_words_norms(ng_norm_vec) 440 | print(f'common vecs norms avg: {np.average(common_norms[np.isfinite(common_norms)], axis=0)}') 441 | ax.hist(common_norms, bins=bins, alpha=0.5, label=mit_10k_common_label, density=True) 442 | ax.grid(True, which='both') 443 | ax.legend() 444 | fig.tight_layout() 445 | fig.savefig('data/ng_norm-hist.png') 446 | fig.show() 447 | 448 | 449 | #%% calc_and_store_ng_norm_density_histogram(): 450 | density_histogram, bins = common_word_norm_density_histogram() 451 | # np.savetxt('data/hist-probability.txt', probability_histogram) 452 | ng_norms = pd.Series(bins).rolling(window=2).mean().iloc[1:].values 453 | pd.DataFrame({'density': density_histogram, 'ng_norm': ng_norms}).to_csv('data/ng-norm-density-hist.csv', index=False) 454 | np.savetxt('data/hist-bins.txt', bins) 455 | 456 | 457 | #%% 458 | def run_plot_density_histogram(): 459 | pdf_df = pd.read_csv('data/ng-norm-density-hist.csv') 460 | density_histogram = pdf_df['density'].values 461 | ng_norms = pdf_df['ng_norm'] 462 | 463 | bin_width = ng_norms[1] - ng_norms[0] 464 | fitfunc = lambda mu, sigma, df, x: t.pdf(x, df, mu, sigma) 465 | errfunc = lambda p, x, y: fitfunc(p[0], p[1], p[2], x) - y 466 | 467 | mean = np.sum(density_histogram * bin_width * ng_norms) 468 | sigma = np.sqrt(np.sum(density_histogram * bin_width * (ng_norms - mean) ** 2)) 469 | starting_param = np.array([mean, sigma, 3]) 470 | print(f'starting p {starting_param}') 471 | p, success = leastsq(errfunc, starting_param, args=(ng_norms, density_histogram), full_output=False) 472 | print(f'fitted_density params: {p}, success value: {success}') 473 | fitted_density = fitfunc(p[0], p[1], p[2], ng_norms) 474 | 475 | seaborn.set(style='white', rc={'figure.figsize': (12, 8)}) 476 | fig: Figure = plt.figure() 477 | plt.title(mit_10k_common_label + 'FastText NG-Norm Density Histogram') 478 | plt.ylabel('Probability Density') 479 | plt.xlabel(ng_norm_label) 480 | ax: Axes = fig.add_subplot(1, 1, 1) #axisbg="1.0") 481 | ax.bar(ng_norms, density_histogram, label='original distribution', color='grey', alpha=1, width=bin_width, edgecolor='grey') 482 | # ax.plot(X, fitted_density, label='fitted_density', color='green', alpha=0.5, width=bin_width) #, linestyle='--') 483 | fit_label = f'fitted t-distribution (df: {np.round(p[2], 1)}, mean: {np.round(p[0], 1)}, var: {np.round(p[1], 1)})' 484 | ax.plot(ng_norms, fitted_density, label=fit_label, color='orange', alpha=1, linestyle='--') 485 | ax.grid(True, which='both') 486 | ax.legend() 487 | fig.tight_layout() 488 | fig.savefig('data/ng_norm-common-density.png') 489 | fig.show() 490 | 491 | 492 | run_plot_density_histogram() 493 | 494 | 495 | #%% print_word_separation 496 | def word_split_probability(text: str, density_histogram, bins): 497 | bin_width = bins[1] - bins[0] 498 | text = text.lower() 499 | df = pd.DataFrame(index=range(1, len(text)), columns=['word1', 'word2', 'norm1', 'norm2', 'prob1', 'prob2', 'prob']) 500 | for i in df.index: 501 | word1 = text[:i] 502 | word2 = text[i:] 503 | df.loc[i, 'word1'] = word1 504 | df.loc[i, 'word2'] = word2 505 | df.loc[i, 'norm1'] = LA.norm(ng_norm_vec(word1)) 506 | df.loc[i, 'norm2'] = LA.norm(ng_norm_vec(word2)) 507 | 508 | try: 509 | df['prob1'] = density_histogram[np.digitize(df['norm1'].values, bins)] * bin_width 510 | df['prob2'] = density_histogram[np.digitize(df['norm2'].values, bins)] * bin_width 511 | df['prob'] = df['prob1'].values * df['prob2'].values 512 | 513 | except IndexError as e: 514 | raise ValueError(f'Failed to find in histogram on of {df[["word1", "word2", "norm1", "norm2"]]}') from e 515 | 516 | return df 517 | 518 | 519 | density_histogram = pd.read_csv('data/ng-norm-density-hist.csv')['density'].values 520 | bins = np.loadtxt('data/hist-bins.txt') 521 | 522 | text = 'inflationlithium' 523 | # text = 'goldtwenty' 524 | # text = 'goldvector' 525 | 526 | splits = word_split_probability(text, density_histogram, bins) 527 | with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also 528 | print(splits) 529 | 530 | with open('data/ng_norm-split-sample.html', 'w') as f: 531 | splits.to_html(f, index=False) 532 | 533 | max_idx = np.argmax(splits['prob'].values) 534 | print(f'max idx {max_idx}, {list(splits.loc[max_idx + 1, ["word1", "word2"]].iteritems())}') 535 | 536 | 537 | #%% split common 538 | bins = np.loadtxt('data/hist-bins.txt') 539 | density_histogram = pd.read_csv('data/ng-norm-density-hist.csv')['density'].values 540 | 541 | words = read_mit_10k_words() 542 | np.random.seed(42) 543 | words1 = words[np.random.randint(0, 9900, size=3000)] 544 | words2 = words[np.random.randint(0, 9900, size=3000)] 545 | 546 | match_count = 0 547 | total_count = 0 548 | for i in range(words1.shape[0]): 549 | word1 = words1[i] 550 | word2 = words2[i] 551 | # print(word1) 552 | if word1 in wv.vocab and word2 in wv.vocab: 553 | total_count += 1 554 | splits = word_split_probability(word1 + word2, density_histogram, bins) 555 | # print(splits) 556 | max_idx = splits['prob'].idxmax() 557 | # print(max_idx) 558 | best_word1 = splits.loc[max_idx, 'word1'] 559 | # print(best_word1) 560 | if len(best_word1) == len(word1): 561 | match_count += 1 562 | 563 | else: 564 | pass 565 | # best_word2 = splits.loc[max_idx, 'word2'] 566 | # print(f'invalid split {best_word1} {best_word2}') 567 | 568 | print(f'match_count {match_count}, total_count {total_count}, accuracy {100 * match_count / total_count}') 569 | --------------------------------------------------------------------------------