├── FAQ.md ├── License.md ├── emb-from-suc.md ├── README.md └── examples ├── Ejemplo_WordVectors.md └── Ejemplo_WordVectors.ipynb /FAQ.md: -------------------------------------------------------------------------------- 1 | # FAQ 2 | 3 | ### How to use them? 4 | 5 | Please check out our [Tutorial.](https://github.com/dccuchile/spanish-word-embeddings/blob/master/examples/Ejemplo_WordVectors.md) 6 | 7 | ### Are the embeddings ordered in any way? 8 | 9 | Yes, the embeddings are ordered by frequencies. 10 | 11 | ### How can I get the frequencies of the words? 12 | 13 | In FastText models, you can obtain the frequencies of the words by using the following code: 14 | 15 | import fasttext 16 | model = fasttext.load_model("your_embedding_model.bin") 17 | palabras, frecuencias = model.get_words(include_freq=True) 18 | 19 | ### My question is not here 20 | 21 | Please feel free to create a new [Issue](https://github.com/dccuchile/spanish-word-embeddings/issues) with your doubts or thoughts. 22 | -------------------------------------------------------------------------------- /License.md: -------------------------------------------------------------------------------- 1 | # Spanish Word Embeddings License 2 | 3 | ## [FastText embeddings from SBWC](https://github.com/uchile-nlp/spanish-word-embeddings#fasttext-embeddings-from-sbwc) 4 | 5 | You can use these vectors as you wish under the CC-BY-4.0 license. 6 | 7 | ## [GloVe embeddings from SBWC](https://github.com/uchile-nlp/spanish-word-embeddings#glove-embeddings-from-sbwc) 8 | 9 | You can use these vectors as you wish under the CC-BY-4.0 license. 10 | 11 | ## [FastText embeddings from Spanish Wikipedia](https://github.com/uchile-nlp/spanish-word-embeddings#fasttext-embeddings-from-spanish-wikipedia) 12 | 13 | Please refer to [FastText Pre-trained Vectors page](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md) if you want to use these vectors. 14 | 15 | ## [Word2Vec embeddings from SBWC](https://github.com/uchile-nlp/spanish-word-embeddings#word2vec-embeddings-from-sbwc) 16 | 17 | Please refer to the [SBWCE page](http://crscardellino.me/SBWCE/) if you want to use these vectors. 18 | 19 | -------------------------------------------------------------------------------- /emb-from-suc.md: -------------------------------------------------------------------------------- 1 | ## FastText embeddings from SUC 2 | 3 | Below you find embeddings for different sizes computed from the [Spanish Unannotated Corpora](https://github.com/josecannete/spanish-corpora). 4 | 5 | #### Embeddings 6 | Links to the embeddings: 7 | ##### XS (#dimensions=10, #vectors=1313423): 8 | - [Vector format (.vec)](https://zenodo.org/record/3234051/files/embeddings-xs-model.vec?download=1) (122 MB) 9 | - [Binary format (.bin)](https://zenodo.org/record/3234051/files/embeddings-xs-model.bin?download=1) (209 MB) 10 | ##### S (#dimensions=30, #vectors=1313423): 11 | - [Vector format (.vec)](https://zenodo.org/record/3234051/files/embeddings-s-model.vec?download=1) (348 MB) 12 | - [Binary format (.bin)](https://zenodo.org/record/3234051/files/embeddings-s-model.bin?download=1) (579 MB) 13 | ##### M (#dimensions=100, #vectors=1313423): 14 | - [Vector format (.vec)](https://zenodo.org/record/3234051/files/embeddings-m-model.vec?download=1) (1.1 GB) 15 | - [Binary format (.bin)](https://zenodo.org/record/3234051/files/embeddings-m-model.bin?download=1) (1.9 GB) 16 | ##### L (#dimensions=300, #vectors=1313423): 17 | - [Vector format (.vec)](https://zenodo.org/record/3234051/files/embeddings-l-model.vec?download=1) (3.4 GB) 18 | - [Binary format (.bin)](https://zenodo.org/record/3234051/files/embeddings-l-model.bin?download=1) (5.6 GB) 19 | ##### new L (#dimensions=300, #vectors=1451827): 20 | - [Vector format (.vec)](https://zenodo.org/record/3255001/files/embeddings-new_large-general_3B_fasttext.vec?download=1) (3.8 GB) 21 | - [Binary format (.bin)](https://zenodo.org/record/3255001/files/embeddings-new_large-general_3B_fasttext.bin?download=1) (5.9 GB) 22 | 23 | #### Algorithm 24 | - Implementation: [FastText](https://github.com/facebookresearch/fastText) with Skipgram 25 | - Parameters: 26 | - min subword-ngram = 3 27 | - max subword-ngram = 6 28 | - minCount = 5 29 | - epochs = 20 30 | - dim = 10, 30, 100, 300, 300 31 | - all other parameters set as default 32 | 33 | #### Corpus 34 | - [Spanish Unannotated Corpora](https://github.com/josecannete/spanish-corpora) 35 | - Corpus Size: 2.6 billion words and 3 billion words (for the new 300 dim) 36 | - Post processing: Explained in [Embeddings](https://github.com/BotCenter/spanishWordEmbeddings) and [Corpora](https://github.com/josecannete/spanish-corpora) repos, that include tokenization, lowercase, removed listings and urls. 37 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Spanish Word Embeddings 2 | 3 | Below you find links to Spanish word embeddings computed with different methods and from different corpora. Whenever it is possible, a description of the parameters used to compute the embeddings is included, together with simple statistics of the vectors, vocabulary, and description of the corpus from which the embeddings were computed. Direct links to the embeddings are provided, so please refer to the original sources for proper citation (also see [References](#references)). An example of the use of some of these embeddings can be found [here](examples/Ejemplo_WordVectors.md) or in this [tutorial](https://github.com/mquezada/starsconf2018-word-embeddings) (both in Spanish). 4 | 5 | Summary (and links) for the embeddings in this page: 6 | 7 | | |Corpus |Size |Algorithm |#vectors |vec-dim |Credits | 8 | |---|-----------|----:|-----------|---------:|---------:|-----------| 9 | |[1](#fasttext-embeddings-from-suc)|Spanish Unannotated Corpora|2.6B|FastText|1,313,423|300|[José Cañete](https://github.com/josecannete)| 10 | |[2](#fasttext-embeddings-from-sbwc)|Spanish Billion Word Corpus|1.4B|FastText|855,380|300|[Jorge Pérez](https://github.com/jorgeperezrojas)| 11 | |[3](#glove-embeddings-from-sbwc)|Spanish Billion Word Corpus|1.4B|Glove|855,380|300|[Jorge Pérez](https://github.com/jorgeperezrojas)| 12 | |[4](#word2vec-embeddings-from-sbwc)|Spanish Billion Word Corpus|1.4B|Word2Vec|1,000,653|300|[Cristian Cardellino](https://github.com/crscardellino)| 13 | |[5](#fasttext-embeddings-from-spanish-wikipedia)|Spanish Wikipedia|???|FastText|985,667|300|[FastText team](https://github.com/facebookresearch/fastText)| 14 | 15 | 16 | ## FastText embeddings from SUC 17 | 18 | #### Embeddings 19 | Links to the embeddings (#dimensions=300, #vectors=1,313,423): 20 | - [Vector format (.vec)](https://zenodo.org/record/3234051/files/embeddings-l-model.vec?download=1) (3.4 GB) 21 | - [Binary format (.bin)](https://zenodo.org/record/3234051/files/embeddings-l-model.bin?download=1) (5.6 GB) 22 | 23 | More vectors with different dimensiones (10, 30, 100, and 300) can be found [here](emb-from-suc.md) 24 | 25 | #### Algorithm 26 | - Implementation: [FastText](https://github.com/facebookresearch/fastText) with Skipgram 27 | - Parameters: 28 | - min subword-ngram = 3 29 | - max subword-ngram = 6 30 | - minCount = 5 31 | - epochs = 20 32 | - dim = 300 33 | - all other parameters set as default 34 | 35 | #### Corpus 36 | - [Spanish Unannotated Corpora](https://github.com/josecannete/spanish-corpora) 37 | - Corpus Size: 3 billion words 38 | - Post processing: Explained in [Embeddings](https://github.com/BotCenter/spanishWordEmbeddings) and [Corpora](https://github.com/josecannete/spanish-corpora) repos, that include tokenization, lowercase, removed listings and urls. 39 | 40 | ## FastText embeddings from SBWC 41 | 42 | #### Embeddings 43 | Links to the embeddings (#dimensions=300, #vectors=855,380): 44 | - [Vector format (.vec.gz)](http://dcc.uchile.cl/~jperez/word-embeddings/fasttext-sbwc.vec.gz) (802 MB) 45 | - [Binary format (.bin)](http://dcc.uchile.cl/~jperez/word-embeddings/fasttext-sbwc.bin) (4.2 GB) 46 | 47 | #### Algorithm 48 | - Implementation: [FastText](https://github.com/facebookresearch/fastText) with Skipgram 49 | - Parameters: 50 | - min subword-ngram = 3 51 | - max subword-ngram = 6 52 | - minCount = 5 53 | - epochs = 20 54 | - dim = 300 55 | - all other parameters set as default 56 | 57 | #### Corpus 58 | - [Spanish Billion Word Corpus](http://crscardellino.github.io/SBWCE/) 59 | - Corpus Size: 1.4 billion words 60 | - Post processing: Besides the post processing of the raw corpus explained in the [SBWCE page](http://crscardellino.github.io/SBWCE/) that included deletion of punctuation, numbers, etc., the following processing was applied: 61 | - Words were converted to lower case letters 62 | - Every sequence of the 'DIGITO' keyword was replaced by (a single) '0' 63 | - All words of more than 3 characteres plus a '0' were ommitted (example: 'padre0') 64 | 65 | ## GloVe embeddings from SBWC 66 | 67 | #### Embeddings 68 | Links to the embeddings (#dimensions=300, #vectors=855,380): 69 | - [Vector format (.vec.gz)](http://dcc.uchile.cl/~jperez/word-embeddings/glove-sbwc.i25.vec.gz) (906 MB) 70 | - [Binary format (.bin)](http://dcc.uchile.cl/~jperez/word-embeddings/glove-sbwc.i25.bin) (3.9 GB) 71 | 72 | #### Algorithm 73 | - Implementation: [GloVe](https://github.com/stanfordnlp/GloVe) 74 | - Parameters: 75 | - vector-size = 300 76 | - iter = 25 77 | - min-count = 5 78 | - all other parameters set as default 79 | 80 | #### Corpus 81 | - [Spanish Billion Word Corpus](http://crscardellino.github.io/SBWCE/) (see above) 82 | 83 | ## Word2Vec embeddings from SBWC 84 | 85 | #### Embeddings 86 | Links to the embeddings (#dimensions=300, #vectors=1,000,653) 87 | - [Vector format (.txt.bz2)](http://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.txt.bz2) 88 | - [Binary format (.bin.gz)](http://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.bin.gz) 89 | 90 | #### Algorithm 91 | - Implementation: [Word2Vec with Skipgram by GenSim](https://radimrehurek.com/gensim/models/word2vec.html) 92 | - Parameters: For details on parameters please refer to the [SBWCE page](http://crscardellino.github.io/SBWCE/) 93 | 94 | #### Corpus 95 | - [Spanish Billion Word Corpus](http://crscardellino.github.io/SBWCE/) 96 | - Corpus Size: 1.4 billion words 97 | 98 | 99 | ## FastText embeddings from Spanish Wikipedia 100 | 101 | #### Embeddings 102 | Links to the embeddings (#dimensions=300, #vectors=985,667): 103 | - [Vector format (.vec)](https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.es.vec) (2.4 GB) 104 | - [Binary plus vector format (.zip)](https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.es.zip) (5.4 GB) 105 | 106 | #### Algorithm 107 | - Implementation: [FastText](https://github.com/facebookresearch/fastText) with Skipgram 108 | - Parameters: FastText default parameters 109 | 110 | #### Corpus 111 | - [Wikipedia Spanish Dump](https://archive.org/details/eswiki-20150105) 112 | 113 | 114 | 115 | ## References 116 | 117 | - FastText embeddings from SUC: Word embeddings were computed by [José Cañete](https://github.com/josecannete) at [BotCenter](https://github.com/BotCenter). You can use these vectors as you wish under the MIT license. Please refer to [BotCenter Embeddings repo](https://github.com/BotCenter/spanishWordEmbeddings) for further discussion. You may also want to cite the FastText paper [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606). 118 | - FastText embeddings from SBWC: Word embeddings were computed by [Jorge Pérez](https://github.com/jorgeperezrojas). You can use these vectors as you wish under the CC-BY-4.0 license. You may also want to cite the FastText paper [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606) and the [Spanish Billion Word Corpus project](http://crscardellino.github.io/SBWCE/). 119 | - GloVe embeddings from SBWC: Word embeddings were computed by [Jorge Pérez](https://github.com/jorgeperezrojas). You can use these vectors as you wish under the CC-BY-4.0 license. You may also want to cite the GloVe paper [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf) and the [Spanish Billion Word Corpus project](http://crscardellino.github.io/SBWCE/). 120 | - FastText embeddings from Spanish Wikipedia: Word embeddings were computed by [FastText team](https://github.com/facebookresearch/fastText). 121 | Please refer to [FastText Pre-trained Vectors page](https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md) if you want to use these vectors. 122 | - Word2Vec embeddings from SBWC: Word embeddings were computed by [Cristian Cardellino](https://github.com/crscardellino). Please refer to the [SBWCE page](http://crscardellino.github.io/SBWCE/) if you want to use these vectors. 123 | -------------------------------------------------------------------------------- /examples/Ejemplo_WordVectors.md: -------------------------------------------------------------------------------- 1 | 2 | # Ejemplos de uso de word embeddings computados con FastText 3 | 4 | Primero cargamos los vectores/embeddings usando [gensim](https://radimrehurek.com/gensim/). Hay al menos dos formas posibles. La primera es cargar todos los vectores desde el archivo binario (.bin) en su formato nativo de FastText. Esta opción es más demandante en recursos (tiempo y memoria), pero es mucho más versatil por ejemplo para obtener vectores para palabras que no se ecuentran en el vocabulario. Esta forma se encuentra comentada en la siguiente celda 5 | 6 | 7 | ```python 8 | # opción 1: cargar todos los vectores desde el formato binario (lento, requiere mucha memoria) 9 | # from gensim.models.wrappers import FastText 10 | # wordvectors_file = 'fasttext-sbwc.3.6.e20' 11 | # wordvectors = FastText.load_fasttext_format(wordvectors_file) 12 | ``` 13 | 14 | La segunda forma, mucho más rápida, es cargar sólo una parte de los vectores. Para esto usamos el formato nativo de word2vec y cargamos una cantidad fija de vectores (se pueden cargar vectores generados por diversos métodos como FastText). 15 | 16 | 17 | ```python 18 | # opción 2: cargar una cantidad fija de vectores (más rápido dependiendo de la cantidad cargada) 19 | from gensim.models.keyedvectors import KeyedVectors 20 | wordvectors_file_vec = 'fasttext-sbwc.3.6.e20.vec' 21 | cantidad = 100000 22 | wordvectors = KeyedVectors.load_word2vec_format(wordvectors_file_vec, limit=cantidad) 23 | ``` 24 | 25 | ## Word vectors en analogías 26 | 27 | Ejemplo de uso: `most_similar_cosmul(positive=lista_palabras_positivas, negative=lista_palabras_negativas)` 28 | 29 | Esta llamada encuentra las palabras del vocabulario que están más cercanas a las palabras en `listas_palabras_positivas` y no estén cercanas a `lista_palabras_negativas` (para una formalización del procedimiento, ver la fórmula (4) en la Sección 6 de [este artículo](http://www.aclweb.org/anthology/W14-1618)). 30 | 31 | Cuando `lista_palabras_positivas` contiene dos palabras, digamos `a` y `b_p`, y `lista_palabras_negativas` contiene una palabra, digamos `a_p`, el anterior procedimiento se lee coloquialmente como el encontrar la palabra `b` que responde a la pregunta: `a_p` es a `a` como `b_p` es a ???. El ejemplo clásico se tiene cuando `a` es `rey`, `b_p` es `mujer`, y `a_p` es `hombre`. La palabra buscada `b` es `reina`, pues `hombre` es a `rey` como `mujer` es a `reina`. (Personalmente considero que la intuición de palabras más lejanas y más cercanas es mucho mejor que la de analogías, pero la de analogías es más común en los tutoriales de word embeddings). 32 | 33 | ### Ejemplos considerando género 34 | 35 | 36 | ```python 37 | wordvectors.most_similar_cosmul(positive=['rey','mujer'],negative=['hombre']) 38 | ``` 39 | 40 | 41 | 42 | 43 | [('reina', 0.9141066670417786), 44 | ('isabel', 0.8743277192115784), 45 | ('princesa', 0.843113124370575), 46 | ('infanta', 0.8425983190536499), 47 | ('monarca', 0.8357319831848145), 48 | ('hija', 0.8211697340011597), 49 | ('consorte', 0.8179485201835632), 50 | ('iv', 0.813984215259552), 51 | ('esposa', 0.8115168213844299), 52 | ('ii', 0.8099035620689392)] 53 | 54 | 55 | 56 | 57 | ```python 58 | wordvectors.most_similar_cosmul(positive=['actor','mujer'],negative=['hombre']) 59 | ``` 60 | 61 | 62 | 63 | 64 | [('actriz', 0.9732905030250549), 65 | ('actores', 0.8580312728881836), 66 | ('actrices', 0.8464058041572571), 67 | ('cantante', 0.8347789645195007), 68 | ('reparto', 0.8277631402015686), 69 | ('protagonista', 0.8202100396156311), 70 | ('invitada', 0.8101590871810913), 71 | ('papel', 0.8021049499511719), 72 | ('guionista', 0.7968517541885376), 73 | ('intérprete', 0.7961310744285583)] 74 | 75 | 76 | 77 | 78 | ```python 79 | wordvectors.most_similar_cosmul(positive=['hijo','mujer'],negative=['hombre']) 80 | ``` 81 | 82 | 83 | 84 | 85 | [('hija', 0.9856907725334167), 86 | ('esposa', 0.9255169034004211), 87 | ('hijos', 0.9249492883682251), 88 | ('madre', 0.9138885736465454), 89 | ('hermana', 0.8996301889419556), 90 | ('hijas', 0.8754291534423828), 91 | ('casó', 0.8729564547538757), 92 | ('matrimonio', 0.8709645867347717), 93 | ('viuda', 0.8557067513465881), 94 | ('casada', 0.8546223044395447)] 95 | 96 | 97 | 98 | 99 | ```python 100 | wordvectors.most_similar_cosmul(positive=['yerno','mujer'],negative=['hombre']) 101 | ``` 102 | 103 | 104 | 105 | 106 | [('nuera', 0.9055585861206055), 107 | ('cuñada', 0.8592773079872131), 108 | ('esther', 0.8199110627174377), 109 | ('sobrina', 0.8171849846839905), 110 | ('suegra', 0.8157253265380859), 111 | ('hija', 0.8014461398124695), 112 | ('infanta', 0.8008802533149719), 113 | ('esposa', 0.8008227944374084), 114 | ('nieta', 0.7964767813682556), 115 | ('cuñado', 0.7955604195594788)] 116 | 117 | 118 | 119 | ### Ejemplos considerando conjugaciones 120 | 121 | 122 | ```python 123 | wordvectors.most_similar_cosmul(positive=['jugar','canta'],negative=['cantar']) 124 | ``` 125 | 126 | 127 | 128 | 129 | [('juega', 0.8944003582000732), 130 | ('jugando', 0.8376926183700562), 131 | ('jugará', 0.834348201751709), 132 | ('jugador', 0.8295056819915771), 133 | ('jugó', 0.8156978487968445), 134 | ('jugado', 0.8147079348564148), 135 | ('futbolista', 0.7927162647247314), 136 | ('juegue', 0.7921290397644043), 137 | ('fútbol', 0.7888965606689453), 138 | ('juegan', 0.7832154631614685)] 139 | 140 | 141 | 142 | 143 | ```python 144 | wordvectors.most_similar_cosmul(positive=['jugar','cantaría'],negative=['cantar']) 145 | ``` 146 | 147 | 148 | 149 | 150 | [('jugaría', 0.8204259276390076), 151 | ('jugará', 0.7848052382469177), 152 | ('juegue', 0.7704501152038574), 153 | ('jugara', 0.7684974670410156), 154 | ('ganamos', 0.7370696067810059), 155 | ('disputaría', 0.7334685325622559), 156 | ('perderá', 0.7326226234436035), 157 | ('lesionó', 0.723604679107666), 158 | ('perdería', 0.7234238386154175), 159 | ('jugó', 0.7223093509674072)] 160 | 161 | 162 | 163 | 164 | ```python 165 | wordvectors.most_similar_cosmul(positive=['ir','jugando'],negative=['jugar']) 166 | ``` 167 | 168 | 169 | 170 | 171 | [('yendo', 0.881558895111084), 172 | ('llevando', 0.8737362623214722), 173 | ('ido', 0.8687229156494141), 174 | ('saliendo', 0.8531793355941772), 175 | ('seguir', 0.8456405997276306), 176 | ('haciendo', 0.8450909852981567), 177 | ('va', 0.8442757725715637), 178 | ('vaya', 0.838218629360199), 179 | ('dando', 0.8275400996208191), 180 | ('estamos', 0.8271223306655884)] 181 | 182 | 183 | 184 | ### Ejemplos capitales y países 185 | 186 | 187 | ```python 188 | wordvectors.most_similar_cosmul(positive=['santiago','venezuela'],negative=['chile']) 189 | ``` 190 | 191 | 192 | 193 | 194 | [('caracas', 0.8996074795722961), 195 | ('bolívar', 0.8295609354972839), 196 | ('mérida', 0.8287113308906555), 197 | ('maracaibo', 0.826995849609375), 198 | ('miranda', 0.8242772817611694), 199 | ('santa', 0.8197780847549438), 200 | ('trujillo', 0.8175155520439148), 201 | ('pérez', 0.8143640756607056), 202 | ('rafael', 0.8114412426948547), 203 | ('lara', 0.8102367520332336)] 204 | 205 | 206 | 207 | 208 | ```python 209 | wordvectors.most_similar_cosmul(positive=['habana','chile'],negative=['santiago']) 210 | ``` 211 | 212 | 213 | 214 | 215 | [('cuba', 0.9782935380935669), 216 | ('venezuela', 0.8504070043563843), 217 | ('bolivia', 0.8276636600494385), 218 | ('rica', 0.8253333568572998), 219 | ('colombia', 0.819764256477356), 220 | ('cubana', 0.8174163699150085), 221 | ('argentina', 0.8128121495246887), 222 | ('brasil', 0.8126526474952698), 223 | ('panamá', 0.8123562932014465), 224 | ('nicaragua', 0.8074418306350708)] 225 | 226 | 227 | 228 | ## Word vectors en términos excluídos 229 | 230 | Ejemplo de uso: `doesnt_match(lista_palabras)` 231 | 232 | Esta llamada selecciona la palabra dentro de `listas_palabras` que está más lejana del resto de las palabras de la lista. La distancia es simplemente el ángulo entre las direcciones de los vectores de las palabras. 233 | 234 | 235 | ```python 236 | wordvectors.doesnt_match(['blanco','azul','rojo','chile']) 237 | ``` 238 | 239 | 240 | 241 | 242 | 'chile' 243 | 244 | 245 | 246 | 247 | ```python 248 | wordvectors.doesnt_match(['sol','luna','almuerzo','jupiter']) 249 | ``` 250 | 251 | 252 | 253 | 254 | 'jupiter' 255 | 256 | 257 | 258 | 259 | ```python 260 | wordvectors.doesnt_match(['abril', 'mayo', 'septiembre', 'martes', 'julio']) 261 | ``` 262 | 263 | 264 | 265 | 266 | 'martes' 267 | 268 | 269 | 270 | 271 | ```python 272 | wordvectors.doesnt_match(['lunes', 'martes', 'septiembre', 'jueves', 'viernes']) 273 | ``` 274 | 275 | 276 | 277 | 278 | 'septiembre' 279 | 280 | 281 | 282 | 283 | ```python 284 | wordvectors.doesnt_match(['everton', 'cobreloa', 'huachipato', 'talca']) 285 | ``` 286 | 287 | 288 | 289 | 290 | 'talca' 291 | 292 | 293 | 294 | 295 | ```python 296 | wordvectors.doesnt_match(['santiago', 'paris', 'talca', 'concepcion']) 297 | ``` 298 | 299 | 300 | 301 | 302 | 'concepcion' 303 | 304 | 305 | 306 | 307 | ```python 308 | wordvectors.doesnt_match(['talca', 'paris', 'londres']) 309 | ``` 310 | 311 | 312 | 313 | 314 | 'talca' 315 | 316 | 317 | -------------------------------------------------------------------------------- /examples/Ejemplo_WordVectors.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Ejemplos de uso de word embeddings computados con FastText\n", 8 | "\n", 9 | "Primero cargamos los vectores/embeddings usando [gensim](https://radimrehurek.com/gensim/). Hay al menos dos formas posibles. La primera es cargar todos los vectores desde el archivo binario (.bin) en su formato nativo de FastText. Esta opción es más demandante en recursos (tiempo y memoria), pero es mucho más versatil por ejemplo para obtener vectores para palabras que no se ecuentran en el vocabulario. Esta forma se encuentra comentada en la siguiente celda" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": true 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "# opción 1: cargar todos los vectores desde el formato binario (lento, requiere mucha memoria)\n", 21 | "# from gensim.models.wrappers import FastText\n", 22 | "# wordvectors_file = 'fasttext-sbwc.3.6.e20'\n", 23 | "# wordvectors = FastText.load_fasttext_format(wordvectors_file)" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "La segunda forma, mucho más rápida, es cargar sólo una parte de los vectores. Para esto usamos el formato nativo de word2vec y cargamos una cantidad fija de vectores (se pueden cargar vectores generados por diversos métodos como FastText)." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 3, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "# opción 2: cargar una cantidad fija de vectores (más rápido dependiendo de la cantidad cargada)\n", 40 | "from gensim.models.keyedvectors import KeyedVectors\n", 41 | "wordvectors_file_vec = 'fasttext-sbwc.3.6.e20.vec'\n", 42 | "cantidad = 100000\n", 43 | "wordvectors = KeyedVectors.load_word2vec_format(wordvectors_file_vec, limit=cantidad)" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "## Word vectors en analogías\n", 51 | "\n", 52 | "Ejemplo de uso: `most_similar_cosmul(positive=lista_palabras_positivas, negative=lista_palabras_negativas)`\n", 53 | "\n", 54 | "Esta llamada encuentra las palabras del vocabulario que están más cercanas a las palabras en `listas_palabras_positivas` y no estén cercanas a `lista_palabras_negativas` (para una formalización del procedimiento, ver la fórmula (4) en la Sección 6 de [este artículo](http://www.aclweb.org/anthology/W14-1618)).\n", 55 | "\n", 56 | "Cuando `lista_palabras_positivas` contiene dos palabras, digamos `a` y `b_p`, y `lista_palabras_negativas` contiene una palabra, digamos `a_p`, el anterior procedimiento se lee coloquialmente como el encontrar la palabra `b` que responde a la pregunta: `a_p` es a `a` como `b_p` es a ???. El ejemplo clásico se tiene cuando `a` es `rey`, `b_p` es `mujer`, y `a_p` es `hombre`. La palabra buscada `b` es `reina`, pues `hombre` es a `rey` como `mujer` es a `reina`. (Personalmente considero que la intuición de palabras más lejanas y más cercanas es mucho mejor que la de analogías, pero la de analogías es más común en los tutoriales de word embeddings). " 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "### Ejemplos considerando género" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 4, 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "data": { 73 | "text/plain": [ 74 | "[('reina', 0.9141066670417786),\n", 75 | " ('isabel', 0.8743277192115784),\n", 76 | " ('princesa', 0.843113124370575),\n", 77 | " ('infanta', 0.8425983190536499),\n", 78 | " ('monarca', 0.8357319831848145),\n", 79 | " ('hija', 0.8211697340011597),\n", 80 | " ('consorte', 0.8179485201835632),\n", 81 | " ('iv', 0.813984215259552),\n", 82 | " ('esposa', 0.8115168213844299),\n", 83 | " ('ii', 0.8099035620689392)]" 84 | ] 85 | }, 86 | "execution_count": 4, 87 | "metadata": {}, 88 | "output_type": "execute_result" 89 | } 90 | ], 91 | "source": [ 92 | "wordvectors.most_similar_cosmul(positive=['rey','mujer'],negative=['hombre'])" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 5, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "data": { 102 | "text/plain": [ 103 | "[('actriz', 0.9732905030250549),\n", 104 | " ('actores', 0.8580312728881836),\n", 105 | " ('actrices', 0.8464058041572571),\n", 106 | " ('cantante', 0.8347789645195007),\n", 107 | " ('reparto', 0.8277631402015686),\n", 108 | " ('protagonista', 0.8202100396156311),\n", 109 | " ('invitada', 0.8101590871810913),\n", 110 | " ('papel', 0.8021049499511719),\n", 111 | " ('guionista', 0.7968517541885376),\n", 112 | " ('intérprete', 0.7961310744285583)]" 113 | ] 114 | }, 115 | "execution_count": 5, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "wordvectors.most_similar_cosmul(positive=['actor','mujer'],negative=['hombre'])" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 6, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "data": { 131 | "text/plain": [ 132 | "[('hija', 0.9856907725334167),\n", 133 | " ('esposa', 0.9255169034004211),\n", 134 | " ('hijos', 0.9249492883682251),\n", 135 | " ('madre', 0.9138885736465454),\n", 136 | " ('hermana', 0.8996301889419556),\n", 137 | " ('hijas', 0.8754291534423828),\n", 138 | " ('casó', 0.8729564547538757),\n", 139 | " ('matrimonio', 0.8709645867347717),\n", 140 | " ('viuda', 0.8557067513465881),\n", 141 | " ('casada', 0.8546223044395447)]" 142 | ] 143 | }, 144 | "execution_count": 6, 145 | "metadata": {}, 146 | "output_type": "execute_result" 147 | } 148 | ], 149 | "source": [ 150 | "wordvectors.most_similar_cosmul(positive=['hijo','mujer'],negative=['hombre'])" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 7, 156 | "metadata": {}, 157 | "outputs": [ 158 | { 159 | "data": { 160 | "text/plain": [ 161 | "[('nuera', 0.9055585861206055),\n", 162 | " ('cuñada', 0.8592773079872131),\n", 163 | " ('esther', 0.8199110627174377),\n", 164 | " ('sobrina', 0.8171849846839905),\n", 165 | " ('suegra', 0.8157253265380859),\n", 166 | " ('hija', 0.8014461398124695),\n", 167 | " ('infanta', 0.8008802533149719),\n", 168 | " ('esposa', 0.8008227944374084),\n", 169 | " ('nieta', 0.7964767813682556),\n", 170 | " ('cuñado', 0.7955604195594788)]" 171 | ] 172 | }, 173 | "execution_count": 7, 174 | "metadata": {}, 175 | "output_type": "execute_result" 176 | } 177 | ], 178 | "source": [ 179 | "wordvectors.most_similar_cosmul(positive=['yerno','mujer'],negative=['hombre'])" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "### Ejemplos considerando conjugaciones" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 8, 192 | "metadata": {}, 193 | "outputs": [ 194 | { 195 | "data": { 196 | "text/plain": [ 197 | "[('juega', 0.8944003582000732),\n", 198 | " ('jugando', 0.8376926183700562),\n", 199 | " ('jugará', 0.834348201751709),\n", 200 | " ('jugador', 0.8295056819915771),\n", 201 | " ('jugó', 0.8156978487968445),\n", 202 | " ('jugado', 0.8147079348564148),\n", 203 | " ('futbolista', 0.7927162647247314),\n", 204 | " ('juegue', 0.7921290397644043),\n", 205 | " ('fútbol', 0.7888965606689453),\n", 206 | " ('juegan', 0.7832154631614685)]" 207 | ] 208 | }, 209 | "execution_count": 8, 210 | "metadata": {}, 211 | "output_type": "execute_result" 212 | } 213 | ], 214 | "source": [ 215 | "wordvectors.most_similar_cosmul(positive=['jugar','canta'],negative=['cantar'])" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 9, 221 | "metadata": {}, 222 | "outputs": [ 223 | { 224 | "data": { 225 | "text/plain": [ 226 | "[('jugaría', 0.8204259276390076),\n", 227 | " ('jugará', 0.7848052382469177),\n", 228 | " ('juegue', 0.7704501152038574),\n", 229 | " ('jugara', 0.7684974670410156),\n", 230 | " ('ganamos', 0.7370696067810059),\n", 231 | " ('disputaría', 0.7334685325622559),\n", 232 | " ('perderá', 0.7326226234436035),\n", 233 | " ('lesionó', 0.723604679107666),\n", 234 | " ('perdería', 0.7234238386154175),\n", 235 | " ('jugó', 0.7223093509674072)]" 236 | ] 237 | }, 238 | "execution_count": 9, 239 | "metadata": {}, 240 | "output_type": "execute_result" 241 | } 242 | ], 243 | "source": [ 244 | "wordvectors.most_similar_cosmul(positive=['jugar','cantaría'],negative=['cantar'])" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 10, 250 | "metadata": {}, 251 | "outputs": [ 252 | { 253 | "data": { 254 | "text/plain": [ 255 | "[('yendo', 0.881558895111084),\n", 256 | " ('llevando', 0.8737362623214722),\n", 257 | " ('ido', 0.8687229156494141),\n", 258 | " ('saliendo', 0.8531793355941772),\n", 259 | " ('seguir', 0.8456405997276306),\n", 260 | " ('haciendo', 0.8450909852981567),\n", 261 | " ('va', 0.8442757725715637),\n", 262 | " ('vaya', 0.838218629360199),\n", 263 | " ('dando', 0.8275400996208191),\n", 264 | " ('estamos', 0.8271223306655884)]" 265 | ] 266 | }, 267 | "execution_count": 10, 268 | "metadata": {}, 269 | "output_type": "execute_result" 270 | } 271 | ], 272 | "source": [ 273 | "wordvectors.most_similar_cosmul(positive=['ir','jugando'],negative=['jugar'])" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "### Ejemplos capitales y países" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": 15, 286 | "metadata": {}, 287 | "outputs": [ 288 | { 289 | "data": { 290 | "text/plain": [ 291 | "[('caracas', 0.8996074795722961),\n", 292 | " ('bolívar', 0.8295609354972839),\n", 293 | " ('mérida', 0.8287113308906555),\n", 294 | " ('maracaibo', 0.826995849609375),\n", 295 | " ('miranda', 0.8242772817611694),\n", 296 | " ('santa', 0.8197780847549438),\n", 297 | " ('trujillo', 0.8175155520439148),\n", 298 | " ('pérez', 0.8143640756607056),\n", 299 | " ('rafael', 0.8114412426948547),\n", 300 | " ('lara', 0.8102367520332336)]" 301 | ] 302 | }, 303 | "execution_count": 15, 304 | "metadata": {}, 305 | "output_type": "execute_result" 306 | } 307 | ], 308 | "source": [ 309 | "wordvectors.most_similar_cosmul(positive=['santiago','venezuela'],negative=['chile'])" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": 16, 315 | "metadata": {}, 316 | "outputs": [ 317 | { 318 | "data": { 319 | "text/plain": [ 320 | "[('cuba', 0.9782935380935669),\n", 321 | " ('venezuela', 0.8504070043563843),\n", 322 | " ('bolivia', 0.8276636600494385),\n", 323 | " ('rica', 0.8253333568572998),\n", 324 | " ('colombia', 0.819764256477356),\n", 325 | " ('cubana', 0.8174163699150085),\n", 326 | " ('argentina', 0.8128121495246887),\n", 327 | " ('brasil', 0.8126526474952698),\n", 328 | " ('panamá', 0.8123562932014465),\n", 329 | " ('nicaragua', 0.8074418306350708)]" 330 | ] 331 | }, 332 | "execution_count": 16, 333 | "metadata": {}, 334 | "output_type": "execute_result" 335 | } 336 | ], 337 | "source": [ 338 | "wordvectors.most_similar_cosmul(positive=['habana','chile'],negative=['santiago'])" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "## Word vectors en términos excluídos\n", 346 | "\n", 347 | "Ejemplo de uso: `doesnt_match(lista_palabras)`\n", 348 | "\n", 349 | "Esta llamada selecciona la palabra dentro de `listas_palabras` que está más lejana del resto de las palabras de la lista. La distancia es simplemente el ángulo entre las direcciones de los vectores de las palabras." 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": 17, 355 | "metadata": {}, 356 | "outputs": [ 357 | { 358 | "data": { 359 | "text/plain": [ 360 | "'chile'" 361 | ] 362 | }, 363 | "execution_count": 17, 364 | "metadata": {}, 365 | "output_type": "execute_result" 366 | } 367 | ], 368 | "source": [ 369 | "wordvectors.doesnt_match(['blanco','azul','rojo','chile'])" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 18, 375 | "metadata": {}, 376 | "outputs": [ 377 | { 378 | "data": { 379 | "text/plain": [ 380 | "'jupiter'" 381 | ] 382 | }, 383 | "execution_count": 18, 384 | "metadata": {}, 385 | "output_type": "execute_result" 386 | } 387 | ], 388 | "source": [ 389 | "wordvectors.doesnt_match(['sol','luna','almuerzo','jupiter'])" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 19, 395 | "metadata": {}, 396 | "outputs": [ 397 | { 398 | "data": { 399 | "text/plain": [ 400 | "'martes'" 401 | ] 402 | }, 403 | "execution_count": 19, 404 | "metadata": {}, 405 | "output_type": "execute_result" 406 | } 407 | ], 408 | "source": [ 409 | "wordvectors.doesnt_match(['abril', 'mayo', 'septiembre', 'martes', 'julio'])" 410 | ] 411 | }, 412 | { 413 | "cell_type": "code", 414 | "execution_count": 20, 415 | "metadata": {}, 416 | "outputs": [ 417 | { 418 | "data": { 419 | "text/plain": [ 420 | "'septiembre'" 421 | ] 422 | }, 423 | "execution_count": 20, 424 | "metadata": {}, 425 | "output_type": "execute_result" 426 | } 427 | ], 428 | "source": [ 429 | "wordvectors.doesnt_match(['lunes', 'martes', 'septiembre', 'jueves', 'viernes'])" 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": 21, 435 | "metadata": {}, 436 | "outputs": [ 437 | { 438 | "data": { 439 | "text/plain": [ 440 | "'talca'" 441 | ] 442 | }, 443 | "execution_count": 21, 444 | "metadata": {}, 445 | "output_type": "execute_result" 446 | } 447 | ], 448 | "source": [ 449 | "wordvectors.doesnt_match(['everton', 'cobreloa', 'huachipato', 'talca'])" 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": 22, 455 | "metadata": {}, 456 | "outputs": [ 457 | { 458 | "data": { 459 | "text/plain": [ 460 | "'concepcion'" 461 | ] 462 | }, 463 | "execution_count": 22, 464 | "metadata": {}, 465 | "output_type": "execute_result" 466 | } 467 | ], 468 | "source": [ 469 | "wordvectors.doesnt_match(['santiago', 'paris', 'talca', 'concepcion'])" 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": 23, 475 | "metadata": {}, 476 | "outputs": [ 477 | { 478 | "data": { 479 | "text/plain": [ 480 | "'talca'" 481 | ] 482 | }, 483 | "execution_count": 23, 484 | "metadata": {}, 485 | "output_type": "execute_result" 486 | } 487 | ], 488 | "source": [ 489 | "wordvectors.doesnt_match(['talca', 'paris', 'londres'])" 490 | ] 491 | } 492 | ], 493 | "metadata": { 494 | "kernelspec": { 495 | "display_name": "Python 3", 496 | "language": "python", 497 | "name": "python3" 498 | }, 499 | "language_info": { 500 | "codemirror_mode": { 501 | "name": "ipython", 502 | "version": 3 503 | }, 504 | "file_extension": ".py", 505 | "mimetype": "text/x-python", 506 | "name": "python", 507 | "nbconvert_exporter": "python", 508 | "pygments_lexer": "ipython3", 509 | "version": "3.6.0" 510 | } 511 | }, 512 | "nbformat": 4, 513 | "nbformat_minor": 2 514 | } 515 | --------------------------------------------------------------------------------