├── FAQ.md
├── License.md
├── emb-from-suc.md
├── README.md
└── examples
    ├── Ejemplo_WordVectors.md
    └── Ejemplo_WordVectors.ipynb


/FAQ.md:
--------------------------------------------------------------------------------
 1 | # FAQ
 2 | 
 3 | ### How to use them?
 4 | 
 5 | Please check out our [Tutorial.](https://github.com/dccuchile/spanish-word-embeddings/blob/master/examples/Ejemplo_WordVectors.md)
 6 | 
 7 | ### Are the embeddings ordered in any way?
 8 | 
 9 | Yes, the embeddings are ordered by frequencies.
10 | 
11 | ### How can I get the frequencies of the words?
12 | 
13 | In FastText models, you can obtain the frequencies of the words by using the following code:
14 | 
15 |     import fasttext
16 |     model = fasttext.load_model("your_embedding_model.bin")
17 |     palabras, frecuencias = model.get_words(include_freq=True)
18 | 
19 | ### My question is not here
20 | 
21 | Please feel free to create a new [Issue](https://github.com/dccuchile/spanish-word-embeddings/issues) with your doubts or thoughts.
22 | 


--------------------------------------------------------------------------------
/License.md:
--------------------------------------------------------------------------------
 1 | # Spanish Word Embeddings License
 2 | 
 3 | ## [FastText embeddings from SBWC](https://github.com/uchile-nlp/spanish-word-embeddings#fasttext-embeddings-from-sbwc)
 4 | 
 5 | You can use these vectors as you wish under the CC-BY-4.0 license.
 6 | 
 7 | ## [GloVe embeddings from SBWC](https://github.com/uchile-nlp/spanish-word-embeddings#glove-embeddings-from-sbwc)
 8 | 
 9 | You can use these vectors as you wish under the CC-BY-4.0 license. 
10 | 
11 | ## [FastText embeddings from Spanish Wikipedia](https://github.com/uchile-nlp/spanish-word-embeddings#fasttext-embeddings-from-spanish-wikipedia)
12 | 
13 | Please refer to [FastText Pre-trained Vectors page](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md) if you want to use these vectors.
14 | 
15 | ## [Word2Vec embeddings from SBWC](https://github.com/uchile-nlp/spanish-word-embeddings#word2vec-embeddings-from-sbwc)
16 | 
17 | Please refer to the [SBWCE page](http://crscardellino.me/SBWCE/) if you want to use these vectors.
18 | 
19 | 


--------------------------------------------------------------------------------
/emb-from-suc.md:
--------------------------------------------------------------------------------
 1 | ## FastText embeddings from SUC
 2 | 
 3 | Below you find embeddings for different sizes computed from the [Spanish Unannotated Corpora](https://github.com/josecannete/spanish-corpora).
 4 | 
 5 | #### Embeddings
 6 | Links to the embeddings:
 7 | ##### XS (#dimensions=10, #vectors=1313423): 
 8 | - [Vector format (.vec)](https://zenodo.org/record/3234051/files/embeddings-xs-model.vec?download=1) (122 MB) 
 9 | - [Binary format (.bin)](https://zenodo.org/record/3234051/files/embeddings-xs-model.bin?download=1) (209 MB)
10 | ##### S (#dimensions=30, #vectors=1313423): 
11 | - [Vector format (.vec)](https://zenodo.org/record/3234051/files/embeddings-s-model.vec?download=1) (348 MB) 
12 | - [Binary format (.bin)](https://zenodo.org/record/3234051/files/embeddings-s-model.bin?download=1) (579 MB)
13 | ##### M (#dimensions=100, #vectors=1313423): 
14 | - [Vector format (.vec)](https://zenodo.org/record/3234051/files/embeddings-m-model.vec?download=1) (1.1 GB) 
15 | - [Binary format (.bin)](https://zenodo.org/record/3234051/files/embeddings-m-model.bin?download=1) (1.9 GB)
16 | ##### L (#dimensions=300, #vectors=1313423): 
17 | - [Vector format (.vec)](https://zenodo.org/record/3234051/files/embeddings-l-model.vec?download=1) (3.4 GB) 
18 | - [Binary format (.bin)](https://zenodo.org/record/3234051/files/embeddings-l-model.bin?download=1) (5.6 GB)
19 | ##### new L (#dimensions=300, #vectors=1451827): 
20 | - [Vector format (.vec)](https://zenodo.org/record/3255001/files/embeddings-new_large-general_3B_fasttext.vec?download=1) (3.8 GB) 
21 | - [Binary format (.bin)](https://zenodo.org/record/3255001/files/embeddings-new_large-general_3B_fasttext.bin?download=1) (5.9 GB)
22 | 
23 | #### Algorithm
24 | - Implementation: [FastText](https://github.com/facebookresearch/fastText) with Skipgram
25 | - Parameters: 
26 |     - min subword-ngram = 3 
27 |     - max subword-ngram = 6
28 |     - minCount = 5
29 |     - epochs = 20
30 |     - dim = 10, 30, 100, 300, 300
31 |     - all other parameters set as default
32 |      
33 | #### Corpus
34 | - [Spanish Unannotated Corpora](https://github.com/josecannete/spanish-corpora)
35 | - Corpus Size: 2.6 billion words and 3 billion words (for the new 300 dim)
36 | - Post processing: Explained in [Embeddings](https://github.com/BotCenter/spanishWordEmbeddings) and [Corpora](https://github.com/josecannete/spanish-corpora) repos, that include tokenization, lowercase, removed listings and urls.
37 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Spanish Word Embeddings
  2 | 
  3 | Below you find links to Spanish word embeddings computed with different methods and from different corpora. Whenever it is possible, a description of the parameters used to compute the embeddings is included, together with simple statistics of the vectors, vocabulary, and description of the corpus from which the embeddings were computed. Direct links to the embeddings are provided, so please refer to the original sources for proper citation (also see [References](#references)). An example of the use of some of these embeddings can be found [here](examples/Ejemplo_WordVectors.md) or in this [tutorial](https://github.com/mquezada/starsconf2018-word-embeddings) (both in Spanish).
  4 | 
  5 | Summary (and links) for the embeddings in this page:
  6 | 
  7 | |   |Corpus     |Size |Algorithm  |#vectors  |vec-dim   |Credits    |
  8 | |---|-----------|----:|-----------|---------:|---------:|-----------|
  9 | |[1](#fasttext-embeddings-from-suc)|Spanish Unannotated Corpora|2.6B|FastText|1,313,423|300|[José Cañete](https://github.com/josecannete)|
 10 | |[2](#fasttext-embeddings-from-sbwc)|Spanish Billion Word Corpus|1.4B|FastText|855,380|300|[Jorge Pérez](https://github.com/jorgeperezrojas)|
 11 | |[3](#glove-embeddings-from-sbwc)|Spanish Billion Word Corpus|1.4B|Glove|855,380|300|[Jorge Pérez](https://github.com/jorgeperezrojas)|
 12 | |[4](#word2vec-embeddings-from-sbwc)|Spanish Billion Word Corpus|1.4B|Word2Vec|1,000,653|300|[Cristian Cardellino](https://github.com/crscardellino)|
 13 | |[5](#fasttext-embeddings-from-spanish-wikipedia)|Spanish Wikipedia|???|FastText|985,667|300|[FastText team](https://github.com/facebookresearch/fastText)|
 14 | 
 15 | 
 16 | ## FastText embeddings from SUC
 17 | 
 18 | #### Embeddings
 19 | Links to the embeddings (#dimensions=300, #vectors=1,313,423):
 20 | - [Vector format (.vec)](https://zenodo.org/record/3234051/files/embeddings-l-model.vec?download=1) (3.4 GB) 
 21 | - [Binary format (.bin)](https://zenodo.org/record/3234051/files/embeddings-l-model.bin?download=1) (5.6 GB)
 22 | 
 23 | More vectors with different dimensiones (10, 30, 100, and 300) can be found [here](emb-from-suc.md)
 24 | 
 25 | #### Algorithm
 26 | - Implementation: [FastText](https://github.com/facebookresearch/fastText) with Skipgram
 27 | - Parameters: 
 28 |     - min subword-ngram = 3 
 29 |     - max subword-ngram = 6
 30 |     - minCount = 5
 31 |     - epochs = 20
 32 |     - dim = 300
 33 |     - all other parameters set as default
 34 |      
 35 | #### Corpus
 36 | - [Spanish Unannotated Corpora](https://github.com/josecannete/spanish-corpora)
 37 | - Corpus Size: 3 billion words
 38 | - Post processing: Explained in [Embeddings](https://github.com/BotCenter/spanishWordEmbeddings) and [Corpora](https://github.com/josecannete/spanish-corpora) repos, that include tokenization, lowercase, removed listings and urls.
 39 | 
 40 | ## FastText embeddings from SBWC
 41 | 
 42 | #### Embeddings
 43 | Links to the embeddings (#dimensions=300, #vectors=855,380): 
 44 | - [Vector format (.vec.gz)](http://dcc.uchile.cl/~jperez/word-embeddings/fasttext-sbwc.vec.gz) (802 MB) 
 45 | - [Binary format (.bin)](http://dcc.uchile.cl/~jperez/word-embeddings/fasttext-sbwc.bin) (4.2 GB)
 46 | 
 47 | #### Algorithm
 48 | - Implementation: [FastText](https://github.com/facebookresearch/fastText) with Skipgram
 49 | - Parameters: 
 50 |     - min subword-ngram = 3 
 51 |     - max subword-ngram = 6
 52 |     - minCount = 5
 53 |     - epochs = 20
 54 |     - dim = 300
 55 |     - all other parameters set as default
 56 |      
 57 | #### Corpus
 58 | - [Spanish Billion Word Corpus](http://crscardellino.github.io/SBWCE/)
 59 | - Corpus Size: 1.4 billion words
 60 | - Post processing: Besides the post processing of the raw corpus explained in the [SBWCE page](http://crscardellino.github.io/SBWCE/) that included deletion of punctuation, numbers, etc., the following processing was applied:
 61 |     - Words were converted to lower case letters
 62 |     - Every sequence of the 'DIGITO' keyword was replaced by (a single) '0'
 63 |     - All words of more than 3 characteres plus a '0' were ommitted (example: 'padre0')
 64 | 
 65 | ## GloVe embeddings from SBWC
 66 | 
 67 | #### Embeddings
 68 | Links to the embeddings (#dimensions=300, #vectors=855,380): 
 69 | - [Vector format (.vec.gz)](http://dcc.uchile.cl/~jperez/word-embeddings/glove-sbwc.i25.vec.gz) (906 MB) 
 70 | - [Binary format (.bin)](http://dcc.uchile.cl/~jperez/word-embeddings/glove-sbwc.i25.bin) (3.9 GB)
 71 | 
 72 | #### Algorithm
 73 | - Implementation: [GloVe](https://github.com/stanfordnlp/GloVe)
 74 | - Parameters: 
 75 |     - vector-size = 300
 76 |     - iter = 25
 77 |     - min-count = 5
 78 |     - all other parameters set as default
 79 | 
 80 | #### Corpus
 81 | - [Spanish Billion Word Corpus](http://crscardellino.github.io/SBWCE/) (see above)
 82 | 
 83 | ## Word2Vec embeddings from SBWC
 84 | 
 85 | #### Embeddings
 86 | Links to the embeddings (#dimensions=300, #vectors=1,000,653) 
 87 | - [Vector format (.txt.bz2)](http://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.txt.bz2) 
 88 | - [Binary format (.bin.gz)](http://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.bin.gz) 
 89 | 
 90 | #### Algorithm
 91 | - Implementation: [Word2Vec with Skipgram by GenSim](https://radimrehurek.com/gensim/models/word2vec.html) 
 92 | - Parameters: For details on parameters please refer to the [SBWCE page](http://crscardellino.github.io/SBWCE/)
 93 |      
 94 | #### Corpus
 95 | - [Spanish Billion Word Corpus](http://crscardellino.github.io/SBWCE/) 
 96 | - Corpus Size: 1.4 billion words
 97 | 
 98 | 
 99 | ## FastText embeddings from Spanish Wikipedia 
100 | 
101 | #### Embeddings
102 | Links to the embeddings (#dimensions=300, #vectors=985,667): 
103 | - [Vector format (.vec)](https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.es.vec) (2.4 GB) 
104 | - [Binary plus vector format (.zip)](https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.es.zip) (5.4 GB)
105 | 
106 | #### Algorithm
107 | - Implementation: [FastText](https://github.com/facebookresearch/fastText) with Skipgram
108 | - Parameters: FastText default parameters
109 |      
110 | #### Corpus
111 | - [Wikipedia Spanish Dump](https://archive.org/details/eswiki-20150105)
112 | 
113 | 
114 | 
115 | ## References
116 | 
117 | - FastText embeddings from SUC: Word embeddings were computed by [José Cañete](https://github.com/josecannete) at [BotCenter](https://github.com/BotCenter). You can use these vectors as you wish under the MIT license. Please refer to [BotCenter Embeddings repo](https://github.com/BotCenter/spanishWordEmbeddings) for further discussion. You may also want to cite the FastText paper [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606).
118 | - FastText embeddings from SBWC: Word embeddings were computed by [Jorge Pérez](https://github.com/jorgeperezrojas). You can use these vectors as you wish under the CC-BY-4.0 license. You may also want to cite the FastText paper [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606) and the [Spanish Billion Word Corpus project](http://crscardellino.github.io/SBWCE/). 
119 | - GloVe embeddings from SBWC: Word embeddings were computed by [Jorge Pérez](https://github.com/jorgeperezrojas). You can use these vectors as you wish under the CC-BY-4.0 license. You may also want to cite the GloVe paper [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf) and the [Spanish Billion Word Corpus project](http://crscardellino.github.io/SBWCE/).
120 | - FastText embeddings from Spanish Wikipedia: Word embeddings were computed by [FastText team](https://github.com/facebookresearch/fastText).
121 | Please refer to [FastText Pre-trained Vectors page](https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md) if you want to use these vectors.
122 | - Word2Vec embeddings from SBWC: Word embeddings were computed by [Cristian Cardellino](https://github.com/crscardellino). Please refer to the [SBWCE page](http://crscardellino.github.io/SBWCE/) if you want to use these vectors.
123 | 


--------------------------------------------------------------------------------
/examples/Ejemplo_WordVectors.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Ejemplos de uso de word embeddings computados con FastText
  3 | 
  4 | Primero cargamos los vectores/embeddings usando [gensim](https://radimrehurek.com/gensim/). Hay al menos dos formas posibles. La primera es cargar todos los vectores desde el archivo binario (.bin) en su formato nativo de FastText. Esta opción es más demandante en recursos (tiempo y memoria), pero es mucho más versatil por ejemplo para obtener vectores para palabras que no se ecuentran en el vocabulario. Esta forma se encuentra comentada en la siguiente celda
  5 | 
  6 | 
  7 | ```python
  8 | # opción 1: cargar todos los vectores desde el formato binario (lento, requiere mucha memoria)
  9 | # from gensim.models.wrappers import FastText
 10 | # wordvectors_file = 'fasttext-sbwc.3.6.e20'
 11 | # wordvectors = FastText.load_fasttext_format(wordvectors_file)
 12 | ```
 13 | 
 14 | La segunda forma, mucho más rápida, es cargar sólo una parte de los vectores. Para esto usamos el formato nativo de word2vec y cargamos una cantidad fija de vectores (se pueden cargar vectores generados por diversos métodos como FastText).
 15 | 
 16 | 
 17 | ```python
 18 | # opción 2: cargar una cantidad fija de vectores (más rápido dependiendo de la cantidad cargada)
 19 | from gensim.models.keyedvectors import KeyedVectors
 20 | wordvectors_file_vec = 'fasttext-sbwc.3.6.e20.vec'
 21 | cantidad = 100000
 22 | wordvectors = KeyedVectors.load_word2vec_format(wordvectors_file_vec, limit=cantidad)
 23 | ```
 24 | 
 25 | ## Word vectors en analogías
 26 | 
 27 | Ejemplo de uso: `most_similar_cosmul(positive=lista_palabras_positivas, negative=lista_palabras_negativas)`
 28 | 
 29 | Esta llamada encuentra las palabras del vocabulario que están más cercanas a las palabras en `listas_palabras_positivas` y no estén cercanas a `lista_palabras_negativas` (para una formalización del procedimiento, ver la fórmula (4) en la Sección 6 de [este artículo](http://www.aclweb.org/anthology/W14-1618)).
 30 | 
 31 | Cuando `lista_palabras_positivas` contiene dos palabras, digamos `a` y `b_p`, y `lista_palabras_negativas` contiene una palabra, digamos `a_p`, el anterior procedimiento se lee coloquialmente como el encontrar la palabra `b` que responde a la pregunta: `a_p` es a `a` como `b_p` es a ???. El ejemplo clásico se tiene cuando `a` es `rey`, `b_p` es `mujer`, y `a_p` es `hombre`. La palabra buscada `b` es `reina`, pues `hombre` es a `rey` como `mujer` es a `reina`. (Personalmente considero que la intuición de palabras  más lejanas y más cercanas es mucho mejor que la de analogías, pero la de analogías es más común en los tutoriales de word embeddings). 
 32 | 
 33 | ### Ejemplos considerando género
 34 | 
 35 | 
 36 | ```python
 37 | wordvectors.most_similar_cosmul(positive=['rey','mujer'],negative=['hombre'])
 38 | ```
 39 | 
 40 | 
 41 | 
 42 | 
 43 |     [('reina', 0.9141066670417786),
 44 |      ('isabel', 0.8743277192115784),
 45 |      ('princesa', 0.843113124370575),
 46 |      ('infanta', 0.8425983190536499),
 47 |      ('monarca', 0.8357319831848145),
 48 |      ('hija', 0.8211697340011597),
 49 |      ('consorte', 0.8179485201835632),
 50 |      ('iv', 0.813984215259552),
 51 |      ('esposa', 0.8115168213844299),
 52 |      ('ii', 0.8099035620689392)]
 53 | 
 54 | 
 55 | 
 56 | 
 57 | ```python
 58 | wordvectors.most_similar_cosmul(positive=['actor','mujer'],negative=['hombre'])
 59 | ```
 60 | 
 61 | 
 62 | 
 63 | 
 64 |     [('actriz', 0.9732905030250549),
 65 |      ('actores', 0.8580312728881836),
 66 |      ('actrices', 0.8464058041572571),
 67 |      ('cantante', 0.8347789645195007),
 68 |      ('reparto', 0.8277631402015686),
 69 |      ('protagonista', 0.8202100396156311),
 70 |      ('invitada', 0.8101590871810913),
 71 |      ('papel', 0.8021049499511719),
 72 |      ('guionista', 0.7968517541885376),
 73 |      ('intérprete', 0.7961310744285583)]
 74 | 
 75 | 
 76 | 
 77 | 
 78 | ```python
 79 | wordvectors.most_similar_cosmul(positive=['hijo','mujer'],negative=['hombre'])
 80 | ```
 81 | 
 82 | 
 83 | 
 84 | 
 85 |     [('hija', 0.9856907725334167),
 86 |      ('esposa', 0.9255169034004211),
 87 |      ('hijos', 0.9249492883682251),
 88 |      ('madre', 0.9138885736465454),
 89 |      ('hermana', 0.8996301889419556),
 90 |      ('hijas', 0.8754291534423828),
 91 |      ('casó', 0.8729564547538757),
 92 |      ('matrimonio', 0.8709645867347717),
 93 |      ('viuda', 0.8557067513465881),
 94 |      ('casada', 0.8546223044395447)]
 95 | 
 96 | 
 97 | 
 98 | 
 99 | ```python
100 | wordvectors.most_similar_cosmul(positive=['yerno','mujer'],negative=['hombre'])
101 | ```
102 | 
103 | 
104 | 
105 | 
106 |     [('nuera', 0.9055585861206055),
107 |      ('cuñada', 0.8592773079872131),
108 |      ('esther', 0.8199110627174377),
109 |      ('sobrina', 0.8171849846839905),
110 |      ('suegra', 0.8157253265380859),
111 |      ('hija', 0.8014461398124695),
112 |      ('infanta', 0.8008802533149719),
113 |      ('esposa', 0.8008227944374084),
114 |      ('nieta', 0.7964767813682556),
115 |      ('cuñado', 0.7955604195594788)]
116 | 
117 | 
118 | 
119 | ### Ejemplos considerando conjugaciones
120 | 
121 | 
122 | ```python
123 | wordvectors.most_similar_cosmul(positive=['jugar','canta'],negative=['cantar'])
124 | ```
125 | 
126 | 
127 | 
128 | 
129 |     [('juega', 0.8944003582000732),
130 |      ('jugando', 0.8376926183700562),
131 |      ('jugará', 0.834348201751709),
132 |      ('jugador', 0.8295056819915771),
133 |      ('jugó', 0.8156978487968445),
134 |      ('jugado', 0.8147079348564148),
135 |      ('futbolista', 0.7927162647247314),
136 |      ('juegue', 0.7921290397644043),
137 |      ('fútbol', 0.7888965606689453),
138 |      ('juegan', 0.7832154631614685)]
139 | 
140 | 
141 | 
142 | 
143 | ```python
144 | wordvectors.most_similar_cosmul(positive=['jugar','cantaría'],negative=['cantar'])
145 | ```
146 | 
147 | 
148 | 
149 | 
150 |     [('jugaría', 0.8204259276390076),
151 |      ('jugará', 0.7848052382469177),
152 |      ('juegue', 0.7704501152038574),
153 |      ('jugara', 0.7684974670410156),
154 |      ('ganamos', 0.7370696067810059),
155 |      ('disputaría', 0.7334685325622559),
156 |      ('perderá', 0.7326226234436035),
157 |      ('lesionó', 0.723604679107666),
158 |      ('perdería', 0.7234238386154175),
159 |      ('jugó', 0.7223093509674072)]
160 | 
161 | 
162 | 
163 | 
164 | ```python
165 | wordvectors.most_similar_cosmul(positive=['ir','jugando'],negative=['jugar'])
166 | ```
167 | 
168 | 
169 | 
170 | 
171 |     [('yendo', 0.881558895111084),
172 |      ('llevando', 0.8737362623214722),
173 |      ('ido', 0.8687229156494141),
174 |      ('saliendo', 0.8531793355941772),
175 |      ('seguir', 0.8456405997276306),
176 |      ('haciendo', 0.8450909852981567),
177 |      ('va', 0.8442757725715637),
178 |      ('vaya', 0.838218629360199),
179 |      ('dando', 0.8275400996208191),
180 |      ('estamos', 0.8271223306655884)]
181 | 
182 | 
183 | 
184 | ### Ejemplos capitales y países
185 | 
186 | 
187 | ```python
188 | wordvectors.most_similar_cosmul(positive=['santiago','venezuela'],negative=['chile'])
189 | ```
190 | 
191 | 
192 | 
193 | 
194 |     [('caracas', 0.8996074795722961),
195 |      ('bolívar', 0.8295609354972839),
196 |      ('mérida', 0.8287113308906555),
197 |      ('maracaibo', 0.826995849609375),
198 |      ('miranda', 0.8242772817611694),
199 |      ('santa', 0.8197780847549438),
200 |      ('trujillo', 0.8175155520439148),
201 |      ('pérez', 0.8143640756607056),
202 |      ('rafael', 0.8114412426948547),
203 |      ('lara', 0.8102367520332336)]
204 | 
205 | 
206 | 
207 | 
208 | ```python
209 | wordvectors.most_similar_cosmul(positive=['habana','chile'],negative=['santiago'])
210 | ```
211 | 
212 | 
213 | 
214 | 
215 |     [('cuba', 0.9782935380935669),
216 |      ('venezuela', 0.8504070043563843),
217 |      ('bolivia', 0.8276636600494385),
218 |      ('rica', 0.8253333568572998),
219 |      ('colombia', 0.819764256477356),
220 |      ('cubana', 0.8174163699150085),
221 |      ('argentina', 0.8128121495246887),
222 |      ('brasil', 0.8126526474952698),
223 |      ('panamá', 0.8123562932014465),
224 |      ('nicaragua', 0.8074418306350708)]
225 | 
226 | 
227 | 
228 | ## Word vectors en términos excluídos
229 | 
230 | Ejemplo de uso: `doesnt_match(lista_palabras)`
231 | 
232 | Esta llamada selecciona la palabra dentro de `listas_palabras` que está más lejana del resto de las palabras de la lista. La distancia es simplemente el ángulo entre las direcciones de los vectores de las palabras.
233 | 
234 | 
235 | ```python
236 | wordvectors.doesnt_match(['blanco','azul','rojo','chile'])
237 | ```
238 | 
239 | 
240 | 
241 | 
242 |     'chile'
243 | 
244 | 
245 | 
246 | 
247 | ```python
248 | wordvectors.doesnt_match(['sol','luna','almuerzo','jupiter'])
249 | ```
250 | 
251 | 
252 | 
253 | 
254 |     'jupiter'
255 | 
256 | 
257 | 
258 | 
259 | ```python
260 | wordvectors.doesnt_match(['abril', 'mayo', 'septiembre', 'martes', 'julio'])
261 | ```
262 | 
263 | 
264 | 
265 | 
266 |     'martes'
267 | 
268 | 
269 | 
270 | 
271 | ```python
272 | wordvectors.doesnt_match(['lunes', 'martes', 'septiembre', 'jueves', 'viernes'])
273 | ```
274 | 
275 | 
276 | 
277 | 
278 |     'septiembre'
279 | 
280 | 
281 | 
282 | 
283 | ```python
284 | wordvectors.doesnt_match(['everton', 'cobreloa', 'huachipato', 'talca'])
285 | ```
286 | 
287 | 
288 | 
289 | 
290 |     'talca'
291 | 
292 | 
293 | 
294 | 
295 | ```python
296 | wordvectors.doesnt_match(['santiago', 'paris', 'talca', 'concepcion'])
297 | ```
298 | 
299 | 
300 | 
301 | 
302 |     'concepcion'
303 | 
304 | 
305 | 
306 | 
307 | ```python
308 | wordvectors.doesnt_match(['talca', 'paris', 'londres'])
309 | ```
310 | 
311 | 
312 | 
313 | 
314 |     'talca'
315 | 
316 | 
317 | 


--------------------------------------------------------------------------------
/examples/Ejemplo_WordVectors.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Ejemplos de uso de word embeddings computados con FastText\n",
  8 |     "\n",
  9 |     "Primero cargamos los vectores/embeddings usando [gensim](https://radimrehurek.com/gensim/). Hay al menos dos formas posibles. La primera es cargar todos los vectores desde el archivo binario (.bin) en su formato nativo de FastText. Esta opción es más demandante en recursos (tiempo y memoria), pero es mucho más versatil por ejemplo para obtener vectores para palabras que no se ecuentran en el vocabulario. Esta forma se encuentra comentada en la siguiente celda"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 1,
 15 |    "metadata": {
 16 |     "collapsed": true
 17 |    },
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "# opción 1: cargar todos los vectores desde el formato binario (lento, requiere mucha memoria)\n",
 21 |     "# from gensim.models.wrappers import FastText\n",
 22 |     "# wordvectors_file = 'fasttext-sbwc.3.6.e20'\n",
 23 |     "# wordvectors = FastText.load_fasttext_format(wordvectors_file)"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "metadata": {},
 29 |    "source": [
 30 |     "La segunda forma, mucho más rápida, es cargar sólo una parte de los vectores. Para esto usamos el formato nativo de word2vec y cargamos una cantidad fija de vectores (se pueden cargar vectores generados por diversos métodos como FastText)."
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": 3,
 36 |    "metadata": {},
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "# opción 2: cargar una cantidad fija de vectores (más rápido dependiendo de la cantidad cargada)\n",
 40 |     "from gensim.models.keyedvectors import KeyedVectors\n",
 41 |     "wordvectors_file_vec = 'fasttext-sbwc.3.6.e20.vec'\n",
 42 |     "cantidad = 100000\n",
 43 |     "wordvectors = KeyedVectors.load_word2vec_format(wordvectors_file_vec, limit=cantidad)"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "## Word vectors en analogías\n",
 51 |     "\n",
 52 |     "Ejemplo de uso: `most_similar_cosmul(positive=lista_palabras_positivas, negative=lista_palabras_negativas)`\n",
 53 |     "\n",
 54 |     "Esta llamada encuentra las palabras del vocabulario que están más cercanas a las palabras en `listas_palabras_positivas` y no estén cercanas a `lista_palabras_negativas` (para una formalización del procedimiento, ver la fórmula (4) en la Sección 6 de [este artículo](http://www.aclweb.org/anthology/W14-1618)).\n",
 55 |     "\n",
 56 |     "Cuando `lista_palabras_positivas` contiene dos palabras, digamos `a` y `b_p`, y `lista_palabras_negativas` contiene una palabra, digamos `a_p`, el anterior procedimiento se lee coloquialmente como el encontrar la palabra `b` que responde a la pregunta: `a_p` es a `a` como `b_p` es a ???. El ejemplo clásico se tiene cuando `a` es `rey`, `b_p` es `mujer`, y `a_p` es `hombre`. La palabra buscada `b` es `reina`, pues `hombre` es a `rey` como `mujer` es a `reina`. (Personalmente considero que la intuición de palabras  más lejanas y más cercanas es mucho mejor que la de analogías, pero la de analogías es más común en los tutoriales de word embeddings). "
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "markdown",
 61 |    "metadata": {},
 62 |    "source": [
 63 |     "### Ejemplos considerando género"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "code",
 68 |    "execution_count": 4,
 69 |    "metadata": {},
 70 |    "outputs": [
 71 |     {
 72 |      "data": {
 73 |       "text/plain": [
 74 |        "[('reina', 0.9141066670417786),\n",
 75 |        " ('isabel', 0.8743277192115784),\n",
 76 |        " ('princesa', 0.843113124370575),\n",
 77 |        " ('infanta', 0.8425983190536499),\n",
 78 |        " ('monarca', 0.8357319831848145),\n",
 79 |        " ('hija', 0.8211697340011597),\n",
 80 |        " ('consorte', 0.8179485201835632),\n",
 81 |        " ('iv', 0.813984215259552),\n",
 82 |        " ('esposa', 0.8115168213844299),\n",
 83 |        " ('ii', 0.8099035620689392)]"
 84 |       ]
 85 |      },
 86 |      "execution_count": 4,
 87 |      "metadata": {},
 88 |      "output_type": "execute_result"
 89 |     }
 90 |    ],
 91 |    "source": [
 92 |     "wordvectors.most_similar_cosmul(positive=['rey','mujer'],negative=['hombre'])"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": 5,
 98 |    "metadata": {},
 99 |    "outputs": [
100 |     {
101 |      "data": {
102 |       "text/plain": [
103 |        "[('actriz', 0.9732905030250549),\n",
104 |        " ('actores', 0.8580312728881836),\n",
105 |        " ('actrices', 0.8464058041572571),\n",
106 |        " ('cantante', 0.8347789645195007),\n",
107 |        " ('reparto', 0.8277631402015686),\n",
108 |        " ('protagonista', 0.8202100396156311),\n",
109 |        " ('invitada', 0.8101590871810913),\n",
110 |        " ('papel', 0.8021049499511719),\n",
111 |        " ('guionista', 0.7968517541885376),\n",
112 |        " ('intérprete', 0.7961310744285583)]"
113 |       ]
114 |      },
115 |      "execution_count": 5,
116 |      "metadata": {},
117 |      "output_type": "execute_result"
118 |     }
119 |    ],
120 |    "source": [
121 |     "wordvectors.most_similar_cosmul(positive=['actor','mujer'],negative=['hombre'])"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": 6,
127 |    "metadata": {},
128 |    "outputs": [
129 |     {
130 |      "data": {
131 |       "text/plain": [
132 |        "[('hija', 0.9856907725334167),\n",
133 |        " ('esposa', 0.9255169034004211),\n",
134 |        " ('hijos', 0.9249492883682251),\n",
135 |        " ('madre', 0.9138885736465454),\n",
136 |        " ('hermana', 0.8996301889419556),\n",
137 |        " ('hijas', 0.8754291534423828),\n",
138 |        " ('casó', 0.8729564547538757),\n",
139 |        " ('matrimonio', 0.8709645867347717),\n",
140 |        " ('viuda', 0.8557067513465881),\n",
141 |        " ('casada', 0.8546223044395447)]"
142 |       ]
143 |      },
144 |      "execution_count": 6,
145 |      "metadata": {},
146 |      "output_type": "execute_result"
147 |     }
148 |    ],
149 |    "source": [
150 |     "wordvectors.most_similar_cosmul(positive=['hijo','mujer'],negative=['hombre'])"
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": 7,
156 |    "metadata": {},
157 |    "outputs": [
158 |     {
159 |      "data": {
160 |       "text/plain": [
161 |        "[('nuera', 0.9055585861206055),\n",
162 |        " ('cuñada', 0.8592773079872131),\n",
163 |        " ('esther', 0.8199110627174377),\n",
164 |        " ('sobrina', 0.8171849846839905),\n",
165 |        " ('suegra', 0.8157253265380859),\n",
166 |        " ('hija', 0.8014461398124695),\n",
167 |        " ('infanta', 0.8008802533149719),\n",
168 |        " ('esposa', 0.8008227944374084),\n",
169 |        " ('nieta', 0.7964767813682556),\n",
170 |        " ('cuñado', 0.7955604195594788)]"
171 |       ]
172 |      },
173 |      "execution_count": 7,
174 |      "metadata": {},
175 |      "output_type": "execute_result"
176 |     }
177 |    ],
178 |    "source": [
179 |     "wordvectors.most_similar_cosmul(positive=['yerno','mujer'],negative=['hombre'])"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "markdown",
184 |    "metadata": {},
185 |    "source": [
186 |     "### Ejemplos considerando conjugaciones"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": 8,
192 |    "metadata": {},
193 |    "outputs": [
194 |     {
195 |      "data": {
196 |       "text/plain": [
197 |        "[('juega', 0.8944003582000732),\n",
198 |        " ('jugando', 0.8376926183700562),\n",
199 |        " ('jugará', 0.834348201751709),\n",
200 |        " ('jugador', 0.8295056819915771),\n",
201 |        " ('jugó', 0.8156978487968445),\n",
202 |        " ('jugado', 0.8147079348564148),\n",
203 |        " ('futbolista', 0.7927162647247314),\n",
204 |        " ('juegue', 0.7921290397644043),\n",
205 |        " ('fútbol', 0.7888965606689453),\n",
206 |        " ('juegan', 0.7832154631614685)]"
207 |       ]
208 |      },
209 |      "execution_count": 8,
210 |      "metadata": {},
211 |      "output_type": "execute_result"
212 |     }
213 |    ],
214 |    "source": [
215 |     "wordvectors.most_similar_cosmul(positive=['jugar','canta'],negative=['cantar'])"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "code",
220 |    "execution_count": 9,
221 |    "metadata": {},
222 |    "outputs": [
223 |     {
224 |      "data": {
225 |       "text/plain": [
226 |        "[('jugaría', 0.8204259276390076),\n",
227 |        " ('jugará', 0.7848052382469177),\n",
228 |        " ('juegue', 0.7704501152038574),\n",
229 |        " ('jugara', 0.7684974670410156),\n",
230 |        " ('ganamos', 0.7370696067810059),\n",
231 |        " ('disputaría', 0.7334685325622559),\n",
232 |        " ('perderá', 0.7326226234436035),\n",
233 |        " ('lesionó', 0.723604679107666),\n",
234 |        " ('perdería', 0.7234238386154175),\n",
235 |        " ('jugó', 0.7223093509674072)]"
236 |       ]
237 |      },
238 |      "execution_count": 9,
239 |      "metadata": {},
240 |      "output_type": "execute_result"
241 |     }
242 |    ],
243 |    "source": [
244 |     "wordvectors.most_similar_cosmul(positive=['jugar','cantaría'],negative=['cantar'])"
245 |    ]
246 |   },
247 |   {
248 |    "cell_type": "code",
249 |    "execution_count": 10,
250 |    "metadata": {},
251 |    "outputs": [
252 |     {
253 |      "data": {
254 |       "text/plain": [
255 |        "[('yendo', 0.881558895111084),\n",
256 |        " ('llevando', 0.8737362623214722),\n",
257 |        " ('ido', 0.8687229156494141),\n",
258 |        " ('saliendo', 0.8531793355941772),\n",
259 |        " ('seguir', 0.8456405997276306),\n",
260 |        " ('haciendo', 0.8450909852981567),\n",
261 |        " ('va', 0.8442757725715637),\n",
262 |        " ('vaya', 0.838218629360199),\n",
263 |        " ('dando', 0.8275400996208191),\n",
264 |        " ('estamos', 0.8271223306655884)]"
265 |       ]
266 |      },
267 |      "execution_count": 10,
268 |      "metadata": {},
269 |      "output_type": "execute_result"
270 |     }
271 |    ],
272 |    "source": [
273 |     "wordvectors.most_similar_cosmul(positive=['ir','jugando'],negative=['jugar'])"
274 |    ]
275 |   },
276 |   {
277 |    "cell_type": "markdown",
278 |    "metadata": {},
279 |    "source": [
280 |     "### Ejemplos capitales y países"
281 |    ]
282 |   },
283 |   {
284 |    "cell_type": "code",
285 |    "execution_count": 15,
286 |    "metadata": {},
287 |    "outputs": [
288 |     {
289 |      "data": {
290 |       "text/plain": [
291 |        "[('caracas', 0.8996074795722961),\n",
292 |        " ('bolívar', 0.8295609354972839),\n",
293 |        " ('mérida', 0.8287113308906555),\n",
294 |        " ('maracaibo', 0.826995849609375),\n",
295 |        " ('miranda', 0.8242772817611694),\n",
296 |        " ('santa', 0.8197780847549438),\n",
297 |        " ('trujillo', 0.8175155520439148),\n",
298 |        " ('pérez', 0.8143640756607056),\n",
299 |        " ('rafael', 0.8114412426948547),\n",
300 |        " ('lara', 0.8102367520332336)]"
301 |       ]
302 |      },
303 |      "execution_count": 15,
304 |      "metadata": {},
305 |      "output_type": "execute_result"
306 |     }
307 |    ],
308 |    "source": [
309 |     "wordvectors.most_similar_cosmul(positive=['santiago','venezuela'],negative=['chile'])"
310 |    ]
311 |   },
312 |   {
313 |    "cell_type": "code",
314 |    "execution_count": 16,
315 |    "metadata": {},
316 |    "outputs": [
317 |     {
318 |      "data": {
319 |       "text/plain": [
320 |        "[('cuba', 0.9782935380935669),\n",
321 |        " ('venezuela', 0.8504070043563843),\n",
322 |        " ('bolivia', 0.8276636600494385),\n",
323 |        " ('rica', 0.8253333568572998),\n",
324 |        " ('colombia', 0.819764256477356),\n",
325 |        " ('cubana', 0.8174163699150085),\n",
326 |        " ('argentina', 0.8128121495246887),\n",
327 |        " ('brasil', 0.8126526474952698),\n",
328 |        " ('panamá', 0.8123562932014465),\n",
329 |        " ('nicaragua', 0.8074418306350708)]"
330 |       ]
331 |      },
332 |      "execution_count": 16,
333 |      "metadata": {},
334 |      "output_type": "execute_result"
335 |     }
336 |    ],
337 |    "source": [
338 |     "wordvectors.most_similar_cosmul(positive=['habana','chile'],negative=['santiago'])"
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "markdown",
343 |    "metadata": {},
344 |    "source": [
345 |     "## Word vectors en términos excluídos\n",
346 |     "\n",
347 |     "Ejemplo de uso: `doesnt_match(lista_palabras)`\n",
348 |     "\n",
349 |     "Esta llamada selecciona la palabra dentro de `listas_palabras` que está más lejana del resto de las palabras de la lista. La distancia es simplemente el ángulo entre las direcciones de los vectores de las palabras."
350 |    ]
351 |   },
352 |   {
353 |    "cell_type": "code",
354 |    "execution_count": 17,
355 |    "metadata": {},
356 |    "outputs": [
357 |     {
358 |      "data": {
359 |       "text/plain": [
360 |        "'chile'"
361 |       ]
362 |      },
363 |      "execution_count": 17,
364 |      "metadata": {},
365 |      "output_type": "execute_result"
366 |     }
367 |    ],
368 |    "source": [
369 |     "wordvectors.doesnt_match(['blanco','azul','rojo','chile'])"
370 |    ]
371 |   },
372 |   {
373 |    "cell_type": "code",
374 |    "execution_count": 18,
375 |    "metadata": {},
376 |    "outputs": [
377 |     {
378 |      "data": {
379 |       "text/plain": [
380 |        "'jupiter'"
381 |       ]
382 |      },
383 |      "execution_count": 18,
384 |      "metadata": {},
385 |      "output_type": "execute_result"
386 |     }
387 |    ],
388 |    "source": [
389 |     "wordvectors.doesnt_match(['sol','luna','almuerzo','jupiter'])"
390 |    ]
391 |   },
392 |   {
393 |    "cell_type": "code",
394 |    "execution_count": 19,
395 |    "metadata": {},
396 |    "outputs": [
397 |     {
398 |      "data": {
399 |       "text/plain": [
400 |        "'martes'"
401 |       ]
402 |      },
403 |      "execution_count": 19,
404 |      "metadata": {},
405 |      "output_type": "execute_result"
406 |     }
407 |    ],
408 |    "source": [
409 |     "wordvectors.doesnt_match(['abril', 'mayo', 'septiembre', 'martes', 'julio'])"
410 |    ]
411 |   },
412 |   {
413 |    "cell_type": "code",
414 |    "execution_count": 20,
415 |    "metadata": {},
416 |    "outputs": [
417 |     {
418 |      "data": {
419 |       "text/plain": [
420 |        "'septiembre'"
421 |       ]
422 |      },
423 |      "execution_count": 20,
424 |      "metadata": {},
425 |      "output_type": "execute_result"
426 |     }
427 |    ],
428 |    "source": [
429 |     "wordvectors.doesnt_match(['lunes', 'martes', 'septiembre', 'jueves', 'viernes'])"
430 |    ]
431 |   },
432 |   {
433 |    "cell_type": "code",
434 |    "execution_count": 21,
435 |    "metadata": {},
436 |    "outputs": [
437 |     {
438 |      "data": {
439 |       "text/plain": [
440 |        "'talca'"
441 |       ]
442 |      },
443 |      "execution_count": 21,
444 |      "metadata": {},
445 |      "output_type": "execute_result"
446 |     }
447 |    ],
448 |    "source": [
449 |     "wordvectors.doesnt_match(['everton', 'cobreloa', 'huachipato', 'talca'])"
450 |    ]
451 |   },
452 |   {
453 |    "cell_type": "code",
454 |    "execution_count": 22,
455 |    "metadata": {},
456 |    "outputs": [
457 |     {
458 |      "data": {
459 |       "text/plain": [
460 |        "'concepcion'"
461 |       ]
462 |      },
463 |      "execution_count": 22,
464 |      "metadata": {},
465 |      "output_type": "execute_result"
466 |     }
467 |    ],
468 |    "source": [
469 |     "wordvectors.doesnt_match(['santiago', 'paris', 'talca', 'concepcion'])"
470 |    ]
471 |   },
472 |   {
473 |    "cell_type": "code",
474 |    "execution_count": 23,
475 |    "metadata": {},
476 |    "outputs": [
477 |     {
478 |      "data": {
479 |       "text/plain": [
480 |        "'talca'"
481 |       ]
482 |      },
483 |      "execution_count": 23,
484 |      "metadata": {},
485 |      "output_type": "execute_result"
486 |     }
487 |    ],
488 |    "source": [
489 |     "wordvectors.doesnt_match(['talca', 'paris', 'londres'])"
490 |    ]
491 |   }
492 |  ],
493 |  "metadata": {
494 |   "kernelspec": {
495 |    "display_name": "Python 3",
496 |    "language": "python",
497 |    "name": "python3"
498 |   },
499 |   "language_info": {
500 |    "codemirror_mode": {
501 |     "name": "ipython",
502 |     "version": 3
503 |    },
504 |    "file_extension": ".py",
505 |    "mimetype": "text/x-python",
506 |    "name": "python",
507 |    "nbconvert_exporter": "python",
508 |    "pygments_lexer": "ipython3",
509 |    "version": "3.6.0"
510 |   }
511 |  },
512 |  "nbformat": 4,
513 |  "nbformat_minor": 2
514 | }
515 | 


--------------------------------------------------------------------------------