├── .Rbuildignore ├── .gitignore ├── .travis.yml ├── CONDUCT.md ├── DESCRIPTION ├── LICENSE.txt ├── NAMESPACE ├── NEWS.md ├── R ├── data.R ├── matrixFunctions.R ├── utils.R └── word2vec.R ├── README.md ├── data └── demo_vectors.rda ├── inst ├── doc │ ├── exploration.R │ ├── exploration.Rmd │ ├── exploration.html │ ├── introduction.R │ ├── introduction.Rmd │ └── introduction.html └── paper.md ├── man ├── VectorSpaceModel-VectorSpaceModel-method.Rd ├── VectorSpaceModel-class.Rd ├── as.VectorSpaceModel.Rd ├── closest_to.Rd ├── cosineDist.Rd ├── cosineSimilarity.Rd ├── demo_vectors.Rd ├── distend.Rd ├── filter_to_rownames.Rd ├── improve_vectorspace.Rd ├── magnitudes.Rd ├── nearest_to.Rd ├── normalize_lengths.Rd ├── plot-VectorSpaceModel-method.Rd ├── prep_word2vec.Rd ├── project.Rd ├── read.binary.vectors.Rd ├── read.vectors.Rd ├── reexports.Rd ├── reject.Rd ├── square_magnitudes.Rd ├── sub-VectorSpaceModel-method.Rd ├── sub-sub-VectorSpaceModel-method.Rd ├── train_word2vec.Rd ├── word2phrase.Rd └── write.binary.word2vec.Rd ├── src ├── Makevars.win ├── tmcn_word2vec.c ├── word2phrase.c └── word2vec.h ├── tests ├── run-all.R └── testthat │ ├── test-linear-algebra-functions.R │ ├── test-name-collapsing.r │ ├── test-read-write.R │ ├── test-rejection.R │ ├── test-train.R │ └── test-types.R └── vignettes ├── exploration.Rmd └── introduction.Rmd /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | ^CONDUCT\.md$ 4 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | inst/extdata/ 5 | .DS_Store 6 | hum2vec.Rproj 7 | src/*.o 8 | src/*.so 9 | cookbooks 10 | cookbooks.txt 11 | cookbooks.vectors 12 | cookbooks.zip 13 | cookbooks* 14 | etc 15 | cookbook_vectors.bin 16 | tests/testthat/binary.bin 17 | tests/testthat/input.txt 18 | tests/testthat/tmp.txt 19 | tests/testthat/binary.bin 20 | tests/testthat/tmp.bin 21 | vignettes/*.R 22 | vignettes/*.html 23 | vignettes/*_files 24 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: r 2 | cache: packages 3 | warnings_are_errors: true 4 | r_build_args: --no-build-vignettes --no-manual --no-resave-data 5 | r_check_args: --no-build-vignettes --no-manual 6 | r: 7 | - release 8 | - devel 9 | 10 | -------------------------------------------------------------------------------- /CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Code of Conduct 2 | 3 | As contributors and maintainers of this project, we pledge to respect all people who 4 | contribute through reporting issues, posting feature requests, updating documentation, 5 | submitting pull requests or patches, and other activities. 6 | 7 | We are committed to making participation in this project a harassment-free experience for 8 | everyone, regardless of level of experience, gender, gender identity and expression, 9 | sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion. 10 | 11 | Examples of unacceptable behavior by participants include the use of sexual language or 12 | imagery, derogatory comments or personal attacks, trolling, public or private harassment, 13 | insults, or other unprofessional conduct. 14 | 15 | Project maintainers have the right and responsibility to remove, edit, or reject comments, 16 | commits, code, wiki edits, issues, and other contributions that are not aligned to this 17 | Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed 18 | from the project team. 19 | 20 | Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by 21 | opening an issue or contacting one or more of the project maintainers. 22 | 23 | This Code of Conduct is adapted from the Contributor Covenant 24 | (http:contributor-covenant.org), version 1.0.0, available at 25 | http://contributor-covenant.org/version/1/0/0/ 26 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: wordVectors 2 | Type: Package 3 | Title: Tools for creating and analyzing vector-space models of texts 4 | Version: 2.0 5 | Author: Ben Schmidt, Jian Li 6 | Maintainer: Ben Schmidt 7 | Description: 8 | wordVectors wraps Google's implementation in C for training word2vec models, 9 | and provides several R functions for exploratory data analysis of word2vec 10 | and other related models. These include import-export from the binary format, 11 | some useful linear algebra operations missing from R, and a streamlined 12 | syntax for working with models and performing vector arithmetic that make it 13 | easier to perform useful operations in a word-vector space. 14 | License: Apache License (== 2.0) 15 | URL: http://github.com/bmschmidt/wordVectors 16 | BugReports: https://github.com/bmschmidt/wordVectors/issues 17 | Depends: 18 | R (>= 2.14.0) 19 | LazyData: TRUE 20 | Imports: 21 | magrittr, 22 | graphics, 23 | methods, 24 | utils, 25 | stats, 26 | readr, 27 | stringr, 28 | stringi 29 | Suggests: 30 | tsne, 31 | testthat, 32 | ggplot2, 33 | knitr, 34 | dplyr, 35 | rmarkdown, 36 | devtools 37 | RoxygenNote: 6.0.1 38 | VignetteBuilder: knitr 39 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | This Apache License is included in this package alongside the original 2 | Google word2vec code. Both the word2vec code and Ben Schmidt's R functions 3 | are released under the Apache license. 4 | 5 | 6 | Apache License 7 | Version 2.0, January 2004 8 | http://www.apache.org/licenses/ 9 | 10 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 11 | 12 | 1. Definitions. 13 | 14 | "License" shall mean the terms and conditions for use, reproduction, 15 | and distribution as defined by Sections 1 through 9 of this document. 16 | 17 | "Licensor" shall mean the copyright owner or entity authorized by 18 | the copyright owner that is granting the License. 19 | 20 | "Legal Entity" shall mean the union of the acting entity and all 21 | other entities that control, are controlled by, or are under common 22 | control with that entity. For the purposes of this definition, 23 | "control" means (i) the power, direct or indirect, to cause the 24 | direction or management of such entity, whether by contract or 25 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 26 | outstanding shares, or (iii) beneficial ownership of such entity. 27 | 28 | "You" (or "Your") shall mean an individual or Legal Entity 29 | exercising permissions granted by this License. 30 | 31 | "Source" form shall mean the preferred form for making modifications, 32 | including but not limited to software source code, documentation 33 | source, and configuration files. 34 | 35 | "Object" form shall mean any form resulting from mechanical 36 | transformation or translation of a Source form, including but 37 | not limited to compiled object code, generated documentation, 38 | and conversions to other media types. 39 | 40 | "Work" shall mean the work of authorship, whether in Source or 41 | Object form, made available under the License, as indicated by a 42 | copyright notice that is included in or attached to the work 43 | (an example is provided in the Appendix below). 44 | 45 | "Derivative Works" shall mean any work, whether in Source or Object 46 | form, that is based on (or derived from) the Work and for which the 47 | editorial revisions, annotations, elaborations, or other modifications 48 | represent, as a whole, an original work of authorship. For the purposes 49 | of this License, Derivative Works shall not include works that remain 50 | separable from, or merely link (or bind by name) to the interfaces of, 51 | the Work and Derivative Works thereof. 52 | 53 | "Contribution" shall mean any work of authorship, including 54 | the original version of the Work and any modifications or additions 55 | to that Work or Derivative Works thereof, that is intentionally 56 | submitted to Licensor for inclusion in the Work by the copyright owner 57 | or by an individual or Legal Entity authorized to submit on behalf of 58 | the copyright owner. For the purposes of this definition, "submitted" 59 | means any form of electronic, verbal, or written communication sent 60 | to the Licensor or its representatives, including but not limited to 61 | communication on electronic mailing lists, source code control systems, 62 | and issue tracking systems that are managed by, or on behalf of, the 63 | Licensor for the purpose of discussing and improving the Work, but 64 | excluding communication that is conspicuously marked or otherwise 65 | designated in writing by the copyright owner as "Not a Contribution." 66 | 67 | "Contributor" shall mean Licensor and any individual or Legal Entity 68 | on behalf of whom a Contribution has been received by Licensor and 69 | subsequently incorporated within the Work. 70 | 71 | 2. Grant of Copyright License. Subject to the terms and conditions of 72 | this License, each Contributor hereby grants to You a perpetual, 73 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 74 | copyright license to reproduce, prepare Derivative Works of, 75 | publicly display, publicly perform, sublicense, and distribute the 76 | Work and such Derivative Works in Source or Object form. 77 | 78 | 3. Grant of Patent License. Subject to the terms and conditions of 79 | this License, each Contributor hereby grants to You a perpetual, 80 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 81 | (except as stated in this section) patent license to make, have made, 82 | use, offer to sell, sell, import, and otherwise transfer the Work, 83 | where such license applies only to those patent claims licensable 84 | by such Contributor that are necessarily infringed by their 85 | Contribution(s) alone or by combination of their Contribution(s) 86 | with the Work to which such Contribution(s) was submitted. If You 87 | institute patent litigation against any entity (including a 88 | cross-claim or counterclaim in a lawsuit) alleging that the Work 89 | or a Contribution incorporated within the Work constitutes direct 90 | or contributory patent infringement, then any patent licenses 91 | granted to You under this License for that Work shall terminate 92 | as of the date such litigation is filed. 93 | 94 | 4. Redistribution. You may reproduce and distribute copies of the 95 | Work or Derivative Works thereof in any medium, with or without 96 | modifications, and in Source or Object form, provided that You 97 | meet the following conditions: 98 | 99 | (a) You must give any other recipients of the Work or 100 | Derivative Works a copy of this License; and 101 | 102 | (b) You must cause any modified files to carry prominent notices 103 | stating that You changed the files; and 104 | 105 | (c) You must retain, in the Source form of any Derivative Works 106 | that You distribute, all copyright, patent, trademark, and 107 | attribution notices from the Source form of the Work, 108 | excluding those notices that do not pertain to any part of 109 | the Derivative Works; and 110 | 111 | (d) If the Work includes a "NOTICE" text file as part of its 112 | distribution, then any Derivative Works that You distribute must 113 | include a readable copy of the attribution notices contained 114 | within such NOTICE file, excluding those notices that do not 115 | pertain to any part of the Derivative Works, in at least one 116 | of the following places: within a NOTICE text file distributed 117 | as part of the Derivative Works; within the Source form or 118 | documentation, if provided along with the Derivative Works; or, 119 | within a display generated by the Derivative Works, if and 120 | wherever such third-party notices normally appear. The contents 121 | of the NOTICE file are for informational purposes only and 122 | do not modify the License. You may add Your own attribution 123 | notices within Derivative Works that You distribute, alongside 124 | or as an addendum to the NOTICE text from the Work, provided 125 | that such additional attribution notices cannot be construed 126 | as modifying the License. 127 | 128 | You may add Your own copyright statement to Your modifications and 129 | may provide additional or different license terms and conditions 130 | for use, reproduction, or distribution of Your modifications, or 131 | for any such Derivative Works as a whole, provided Your use, 132 | reproduction, and distribution of the Work otherwise complies with 133 | the conditions stated in this License. 134 | 135 | 5. Submission of Contributions. Unless You explicitly state otherwise, 136 | any Contribution intentionally submitted for inclusion in the Work 137 | by You to the Licensor shall be under the terms and conditions of 138 | this License, without any additional terms or conditions. 139 | Notwithstanding the above, nothing herein shall supersede or modify 140 | the terms of any separate license agreement you may have executed 141 | with Licensor regarding such Contributions. 142 | 143 | 6. Trademarks. This License does not grant permission to use the trade 144 | names, trademarks, service marks, or product names of the Licensor, 145 | except as required for reasonable and customary use in describing the 146 | origin of the Work and reproducing the content of the NOTICE file. 147 | 148 | 7. Disclaimer of Warranty. Unless required by applicable law or 149 | agreed to in writing, Licensor provides the Work (and each 150 | Contributor provides its Contributions) on an "AS IS" BASIS, 151 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 152 | implied, including, without limitation, any warranties or conditions 153 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 154 | PARTICULAR PURPOSE. You are solely responsible for determining the 155 | appropriateness of using or redistributing the Work and assume any 156 | risks associated with Your exercise of permissions under this License. 157 | 158 | 8. Limitation of Liability. In no event and under no legal theory, 159 | whether in tort (including negligence), contract, or otherwise, 160 | unless required by applicable law (such as deliberate and grossly 161 | negligent acts) or agreed to in writing, shall any Contributor be 162 | liable to You for damages, including any direct, indirect, special, 163 | incidental, or consequential damages of any character arising as a 164 | result of this License or out of the use or inability to use the 165 | Work (including but not limited to damages for loss of goodwill, 166 | work stoppage, computer failure or malfunction, or any and all 167 | other commercial damages or losses), even if such Contributor 168 | has been advised of the possibility of such damages. 169 | 170 | 9. Accepting Warranty or Additional Liability. While redistributing 171 | the Work or Derivative Works thereof, You may choose to offer, 172 | and charge a fee for, acceptance of support, warranty, indemnity, 173 | or other liability obligations and/or rights consistent with this 174 | License. However, in accepting such obligations, You may act only 175 | on Your own behalf and on Your sole responsibility, not on behalf 176 | of any other Contributor, and only if You agree to indemnify, 177 | defend, and hold each Contributor harmless for any liability 178 | incurred by, or claims asserted against, such Contributor by reason 179 | of your accepting any such warranty or additional liability. 180 | 181 | END OF TERMS AND CONDITIONS 182 | 183 | APPENDIX: How to apply the Apache License to your work. 184 | 185 | To apply the Apache License to your work, attach the following 186 | boilerplate notice, with the fields enclosed by brackets "[]" 187 | replaced with your own identifying information. (Don't include 188 | the brackets!) The text should be enclosed in the appropriate 189 | comment syntax for the file format. We also recommend that a 190 | file or class name and description of purpose be included on the 191 | same "printed page" as the copyright notice for easier 192 | identification within third-party archives. 193 | 194 | Copyright [yyyy] [name of copyright owner] 195 | 196 | Licensed under the Apache License, Version 2.0 (the "License"); 197 | you may not use this file except in compliance with the License. 198 | You may obtain a copy of the License at 199 | 200 | http://www.apache.org/licenses/LICENSE-2.0 201 | 202 | Unless required by applicable law or agreed to in writing, software 203 | distributed under the License is distributed on an "AS IS" BASIS, 204 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 205 | See the License for the specific language governing permissions and 206 | limitations under the License. 207 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export("%>%") 4 | export(as.VectorSpaceModel) 5 | export(closest_to) 6 | export(cosineDist) 7 | export(cosineSimilarity) 8 | export(distend) 9 | export(filter_to_rownames) 10 | export(improve_vectorspace) 11 | export(magnitudes) 12 | export(nearest_to) 13 | export(normalize_lengths) 14 | export(prep_word2vec) 15 | export(project) 16 | export(read.binary.vectors) 17 | export(read.vectors) 18 | export(reject) 19 | export(train_word2vec) 20 | export(word2phrase) 21 | export(write.binary.word2vec) 22 | exportClasses(VectorSpaceModel) 23 | exportMethods(plot) 24 | importFrom(magrittr,"%>%") 25 | useDynLib(wordVectors) 26 | -------------------------------------------------------------------------------- /NEWS.md: -------------------------------------------------------------------------------- 1 | # VERSION 2.0 2 | 3 | Upgrade focusing on ease of use and CRAN-ability. Bumping major version because of a breaking change in the behavior of `closest_to`, which now returns a data.frame. 4 | 5 | # Changes 6 | 7 | ## New default function: closest_to. 8 | 9 | `nearest_to` was previously the easiest way to interact with cosine similarity functions. That's been deprecated 10 | in favor of a new function, `closest_to`. (I would have changed the name but for back-compatibility reasons.) 11 | The data.frame columns have elaborate names so they can easily be manipulated with dplyr, and/or plotted with ggplot. 12 | `nearest_to` is now just a wrapped version of the new function. 13 | 14 | ## New syntax for vector addition. 15 | 16 | This package now allows formula scoping for the most common operations, and string inputs to access in the context of a particular matrix. This makes this much nicer for handling the bread and butter word2vec operations. 17 | 18 | For instance, instead of writing 19 | ```R 20 | vectors %>% closest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",]) 21 | ``` 22 | 23 | (whew!), you can write 24 | 25 | ```R 26 | vectors %>% closest_to(~"king" - "man" + "woman") 27 | ``` 28 | 29 | 30 | ## Reading tweaks. 31 | 32 | In keeping with the goal of allowing manipulation of models in low-memory environments, it's now possible to read only rows with words matching certain criteria by passing an argument to read.binary.vectors(); either `rowname_list` for a fixed list, or `rowname_regexp` for a regular expression. (You could, say, read only the gerunds from a file by entering `rowname_regexp = "*.ing"`). 33 | 34 | ## Test Suite 35 | 36 | The package now includes a test suite. 37 | 38 | ## Other materials for rOpenScience and JOSS. 39 | 40 | This package has enough users it might be nice to get it on CRAN. I'm trying doing so through rOpenSci. That requires a lot of small files scattered throughout. 41 | 42 | 43 | # VERSION 1.3 44 | 45 | Two significant performance improvements. 46 | 1. Row magnitudes for a `VectorSpaceModel` object are now **cached** in an environment that allows some pass-by-reference editing. This means that the most time-consuming part of any comparison query is only done once for any given vector set; subsequent queries are at least an order of magnitude (10-20x)? faster. 47 | 48 | Although this is a big performance improvement, certain edge cases might not wipe the cache clear. **In particular, assignment inside a VSM object might cause incorrect calculations.** I can't see why anyone would be in the habit of manually tweaking a row or block (rather than a whole matrix). 49 | 1. Access to rows in a `VectorSpaceModel` object is now handled through callNextMethod() rather than accessing the element's .Data slot. For reasons opaque to me, hitting the .Data slot seems to internally require copying the whole huge matrix internally. Now that no longer happens. 50 | 51 | 52 | # VERSION 1.2 53 | 54 | This release implements a number of incremental improvements and clears up some errors. 55 | - The package is now able to read and write in the binary word2vec format; since this is faster and takes much less hard drive space (down by about 66%) than writing out floats as text, it does so internally. 56 | - Several improvements to the C codebase to avoid warnings by @mukul13, described [here](https://github.com/bmschmidt/wordVectors/pull/9). (Does this address the `long long` issue?) 57 | - Subsetting with `[[` now takes an argument `average`; if false, rather than collapse a matrix down to a single row, it just extracts the elements that correspond to the words. 58 | - Added sample data in the object `demo_vectors`: the 999 words from the most common vectors. 59 | - Began adding examples to the codebase. 60 | - Tracking build success using Travis. 61 | - Dodging most warnings from R CMD check. 62 | 63 | Bug fixes 64 | - If the `dir.exists` function is undefined, it creates one for you. This should allow installation on R 3.1 and some lower versions. 65 | - `reject` and `project` are better about returning VSM objects, rather than dropping back into a matrix. 66 | 67 | # VERSION 1.1 68 | 69 | A few changes, primarily to the functions for _training_ vector spaces to produce higher quality models. A number of these changes are merged back in from the fork of this repo by github user @sunecasp . Thanks! 70 | 71 | ## Some bug fixes 72 | 73 | Filenames can now be up to 1024 characters. Some parameters on alpha decay may be fixed; I'm not entirely sure what sunecasp's changes do. 74 | 75 | ## Changes to default number of iterations. 76 | 77 | Models now default to 5 iterations through the text rather than 1. That means training may take 5 times as long; but particularly for small corpora, the vectors should be of higher quality. See below for an example. 78 | 79 | ## More training arguments 80 | 81 | You can now specify more flags to the word2vec code. `?train_word2vec` gives a full list, but particularly useful are: 82 | 1. `window` now accurately sets the window size. 83 | 2. `iter` sets the number of iterations. For very large corpora, `iter=1` will train most quickly; for very small corpora, `iter=15` will give substantially better vectors. (See below). You should set this as high as you can stand within reason (Setting `iter` to a number higher than `window` is probably not that useful). But more text is better than more iterations. 84 | 3. `min_count` gives a cutoff for vocabulary size. Tokens occurring fewer than `min_count` times will be dropped from the model. Setting this high can be useful. (But note that a trained model is sorted in order of frequency, so if you have the RAM to train a big model you can reduce it in size for analysis by just subsetting to the first 10,000 or whatever rows). 85 | 86 | ## Example of vectors 87 | 88 | Here's an example of training on a small set (c. 1000 speeches on the floor of the house of commons from the early 19th century). 89 | 90 | > proc.time({one = train_word2vec("~/tmp2.txt","~/1_iter.vectors",iter = 1)}) 91 | > Error in train_word2vec("~/tmp2.txt", "~/1_iter.vectors", iter = 1) : 92 | > The output file '~/1_iter.vectors' already exists: delete or give a new destination. 93 | > proc.time({one = train_word2vec("~/tmp2.txt","~/1_iter.vectors",iter = 1)}) 94 | > Starting training using file /Users/bschmidt/tmp2.txt 95 | > Vocab size: 4469 96 | > Words in train file: 407583 97 | > Alpha: 0.000711 Progress: 99.86% Words/thread/sec: 67.51k 98 | > Error in proc.time({ : 1 argument passed to 'proc.time' which requires 0 99 | > ?proc.time 100 | > system.time({one = train_word2vec("~/tmp2.txt","~/1_iter.vectors",iter = 1)}) 101 | > Starting training using file /Users/bschmidt/tmp2.txt 102 | > Vocab size: 4469 103 | > Words in train file: 407583 104 | > Alpha: 0.000711 Progress: 99.86% Words/thread/sec: 66.93k user system elapsed 105 | > 6.753 0.055 6.796 106 | > system.time({two = train_word2vec("~/tmp2.txt","~/2_iter.vectors",iter = 3)}) 107 | > Starting training using file /Users/bschmidt/tmp2.txt 108 | > Vocab size: 4469 109 | > Words in train file: 407583 110 | > Alpha: 0.000237 Progress: 99.95% Words/thread/sec: 67.15k user system elapsed 111 | > 18.846 0.085 18.896 112 | > 113 | > two %>% nearest_to(two["debt"]) %>% round(3) 114 | > debt remainder Jan including drawback manufactures prisoners mercantile subsisting 115 | > 0.000 0.234 0.256 0.281 0.291 0.293 0.297 0.314 0.314 116 | > Dec 117 | > 0.318 118 | > one %>% nearest_to(one[["debt"]]) %>% round(3) 119 | > debt Christmas exception preventing Indies import remainder eye eighteen labouring 120 | > 0.000 0.150 0.210 0.214 0.215 0.220 0.221 0.223 0.225 0.227 121 | > 122 | > system.time({ten = train_word2vec("~/tmp2.txt","~/10_iter.vectors",iter = 10)}) 123 | > Starting training using file /Users/bschmidt/tmp2.txt 124 | > Vocab size: 4469 125 | > Words in train file: 407583 126 | > Alpha: 0.000071 Progress: 99.98% Words/thread/sec: 66.13k user system elapsed 127 | > 62.070 0.296 62.333 128 | > 129 | > ten %>% nearest_to(ten[["debt"]]) %>% round(3) 130 | > debt surplus Dec remainder manufacturing grants Jan drawback prisoners 131 | > 0.000 0.497 0.504 0.510 0.519 0.520 0.533 0.536 0.546 132 | > compelling 133 | > 0.553 134 | 135 | ``` 136 | ``` 137 | 138 | -------------------------------------------------------------------------------- /R/data.R: -------------------------------------------------------------------------------- 1 | #' 999 vectors trained on teaching evaluations 2 | #' 3 | #' A sample VectorSpaceModel object trained on about 15 million 4 | #' teaching evaluations, limited to the 999 most common words. 5 | #' Included for demonstration purposes only: there's only so much you can 6 | #' do with a 999 length vocabulary. 7 | #' 8 | #' You're best off downloading a real model to work with, 9 | #' such as the precompiled vectors distributed by Google 10 | #' at https://code.google.com/archive/p/word2vec/ 11 | #' 12 | #' @format A VectorSpaceModel object of 999 words and 500 vectors 13 | #' 14 | #' @source Trained by package author. 15 | "demo_vectors" 16 | -------------------------------------------------------------------------------- /R/matrixFunctions.R: -------------------------------------------------------------------------------- 1 | #' Improve a vectorspace by removing common elements. 2 | #' 3 | #' 4 | #' @param vectorspace A VectorSpacemodel to be improved. 5 | #' @param D The number of principal components to eliminate. 6 | #' 7 | #' @description See reference for a full description. Supposedly, these operations will improve performance on analogy tasks. 8 | #' 9 | #' @references Jiaqi Mu, Suma Bhat, Pramod Viswanath. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. https://arxiv.org/abs/1702.01417. 10 | #' @return A VectorSpaceModel object, transformed from the original. 11 | #' @export 12 | #' 13 | #' @examples 14 | #' 15 | #' closest_to(demo_vectors,"great") 16 | #' # stopwords like "and" and "very" are no longer top ten. 17 | #' # I don't know if this is really better, though. 18 | #' 19 | #' closest_to(improve_vectorspace(demo_vectors),"great") 20 | #' 21 | improve_vectorspace = function(vectorspace,D=round(ncol(vectorspace)/100)) { 22 | mean = methods::new("VectorSpaceModel", 23 | matrix(apply(vectorspace,2,mean), 24 | ncol=ncol(vectorspace)) 25 | ) 26 | vectorspace = vectorspace-mean 27 | pca = stats::prcomp(vectorspace) 28 | 29 | # I don't totally understand the recommended operation in the source paper, but this seems to do much 30 | # the same thing using the internal functions of the package to reject the top i dimensions one at a time. 31 | drop_top_i = function(vspace,i) { 32 | if (i<=0) {vspace} else if (i==1) { 33 | reject(vspace,pca$rotation[,i]) 34 | } else { 35 | drop_top_i(reject(vspace,pca$rotation[,i]), i-1) 36 | } 37 | } 38 | better = drop_top_i(vectorspace,D) 39 | } 40 | 41 | 42 | #' Internal function to subsitute strings for a tree. Allows arithmetic on words. 43 | #' 44 | #' @noRd 45 | #' 46 | #' @param tree an expression from a formula 47 | #' @param context the VSM context in which to parse it. 48 | #' 49 | #' @return a tree 50 | sub_out_tree = function(tree, context) { 51 | # This is a whitelist of operators that I think are basic for vector math. 52 | # It's possible it could be expanded. 53 | 54 | # This might fail if you try to pass a reference to a basic 55 | # arithmetic operator, or something crazy like that. 56 | 57 | if (deparse(tree[[1]]) %in% c("+","*","-","/","^","log","sqrt","(")) { 58 | for (i in 2:length(tree)) { 59 | tree[[i]] <- sub_out_tree(tree[[i]],context) 60 | } 61 | } 62 | if (is.character(tree)) { 63 | return(context[[tree]]) 64 | } 65 | return(tree) 66 | } 67 | 68 | #' Internal function to wrap for sub_out_tree. Allows arithmetic on words. 69 | #' 70 | #' @noRd 71 | #' 72 | #' @param formula A formula; string arithmetic on the LHS, no RHS. 73 | #' @param context the VSM context in which to parse it. 74 | #' 75 | #' @return an evaluated formula. 76 | 77 | sub_out_formula = function(formula,context) { 78 | # Despite the name, this will work on something that 79 | # isn't a formula. That's by design: we want to allow 80 | # basic reference passing, and also to allow simple access 81 | # to words. 82 | 83 | if (class(context) != "VectorSpaceModel") {return(formula)} 84 | if (class(formula)=="formula") { 85 | formula[[2]] <- sub_out_tree(formula[[2]],context) 86 | return(eval(formula[[2]])) 87 | } 88 | if (is.character(formula)) {return(context[[formula]])} 89 | return(formula) 90 | } 91 | 92 | #' Vector Space Model class 93 | #' 94 | #' @description A class for describing and accessing Vector Space Models like Word2Vec. 95 | #' The base object is simply a matrix with columns describing dimensions and unique rownames 96 | #' as the names of vectors. This package gives a number of convenience functions for printing 97 | #' and, most importantly, accessing these objects. 98 | #' @slot magnitudes The cached sum-of-squares for each row in the matrix. Can be cached to 99 | #' speed up similarity calculations 100 | #' @return An object of class "VectorSpaceModel" 101 | #' @exportClass VectorSpaceModel 102 | setClass("VectorSpaceModel",slots = c(".cache"="environment"),contains="matrix") 103 | #setClass("NormalizedVectorSpaceModel",contains="VectorSpaceModel") 104 | 105 | # This is Steve Lianoglu's method for associating a cache with an object 106 | # http://r.789695.n4.nabble.com/Change-value-of-a-slot-of-an-S4-object-within-a-method-td2338484.html 107 | setMethod("initialize", "VectorSpaceModel", 108 | function(.Object, ..., .cache=new.env()) { 109 | methods::callNextMethod(.Object, .cache=.cache, ...) 110 | }) 111 | 112 | #' Square Magnitudes with caching 113 | #' 114 | #' @param VectorSpaceModel A matrix or VectorSpaceModel object 115 | #' @description square_magnitudes Returns the square magnitudes and 116 | #' caches them if necessary 117 | #' @return A vector of the square magnitudes for each row 118 | #' @keywords internal 119 | square_magnitudes = function(object) { 120 | if (class(object)[1]=="VectorSpaceModel") { 121 | if (methods::.hasSlot(object, ".cache")) { 122 | if (is.null(object@.cache$magnitudes)) { 123 | object@.cache$magnitudes = rowSums(object^2) 124 | } 125 | return(object@.cache$magnitudes) 126 | } else { 127 | message("You seem to be using a VectorSpaceModel saved from an earlier version of this package.") 128 | message("To turn on caching on your model, which greatly speeds up queries, type") 129 | message("yourobjectname@.cache = new.env()") 130 | message("(replacing 'yourobjectname' with your actual model name)") 131 | return(rowSums(object^2)) 132 | } 133 | } else { 134 | return(rowSums(object^2)) 135 | } 136 | } 137 | 138 | #' VectorSpaceModel indexing 139 | #' 140 | #' @description Reduce a VectorSpaceModel to a smaller one 141 | #' @param x The vectorspace model to subset 142 | #' @param i The row numbers to extract 143 | #' @param j The column numbers to extract 144 | #' @param ... Other arguments passed to extract (unlikely to be useful). 145 | #' 146 | #' @param drop Whether to drop columns. This parameter is ignored. 147 | #' @return A VectorSpaceModel 148 | #' 149 | setMethod("[","VectorSpaceModel",function(x,i,j,...,drop) { 150 | nextup = methods::callNextMethod() 151 | if (!is.matrix(nextup)) { 152 | # A verbose way of effectively changing drop from TRUE to FALSE; 153 | # I don't want one-dimensional matrices turned to vectors. 154 | # I can't figure out how to do this more simply 155 | if (missing(j)) { 156 | nextup = matrix(nextup,ncol=ncol(x)) 157 | } else { 158 | nextup = matrix(nextup,ncol=j) 159 | } 160 | } 161 | methods::new("VectorSpaceModel",nextup) 162 | }) 163 | 164 | #' VectorSpaceModel subtraction 165 | #' 166 | #' @description Keep the VSM class when doing subtraction operations; 167 | #' make it possible to subtract a single row from an entire model. 168 | #' @param e1 A vector space model 169 | #' @param e2 A vector space model of equal size OR a vector 170 | #' space model of a single row. If the latter (which is more likely) 171 | #' the specified row will be subtracted from each row. 172 | #' 173 | #' 174 | #' @return A VectorSpaceModel of the same dimensions and rownames 175 | #' as e1 176 | #' 177 | #' I believe this is necessary, but honestly am not sure. 178 | #' 179 | setMethod("-",signature(e1="VectorSpaceModel",e2="VectorSpaceModel"),function(e1,e2) { 180 | if (nrow(e1)==nrow(e2) && ncol(e1)==ncol(e2)) { 181 | return (methods::new("VectorSpaceModel",e1@.Data-e2@.Data)) 182 | } 183 | if (nrow(e2)==1) { 184 | return( 185 | methods::new("VectorSpaceModel",e1 - matrix(rep(e2,each=nrow(e1)),nrow=nrow(e1))) 186 | ) 187 | } 188 | stop("Vector space model subtraction must use models of equal dimensions") 189 | }) 190 | 191 | #' VectorSpaceModel subsetting 192 | #' 193 | # @description Reduce a VectorSpaceModel to a single object. 194 | #' @param x The object being subsetted. 195 | #' @param i A character vector: the words to use as rownames. 196 | #' @param average Whether to collapse down to a single vector, 197 | #' or to return a subset of one row for each asked for. 198 | #' 199 | #' @return A VectorSpaceModel of a single row. 200 | setMethod("[[","VectorSpaceModel",function(x,i,average=TRUE) { 201 | # The wordvec class can extract a row from the matrix 202 | # by accessing the rownames. x[["king"]] gives the row 203 | # for which the rowname is "king"; x[[c("king","queen")]] gives 204 | # the midpoint of x[["king"]] and x[["queen"]], which can occasionally 205 | # be useful. 206 | if(typeof(i)=="character") { 207 | matching_rows = x[rownames(x) %in% i,] 208 | if (average) { 209 | val = matrix( 210 | colMeans(matching_rows) 211 | ,nrow=1 212 | ,dimnames = list( 213 | c(),colnames(x)) 214 | ) 215 | } else { 216 | val=matching_rows 217 | rownames(val) = rownames(x)[rownames(x) %in% i] 218 | } 219 | 220 | return(methods::new("VectorSpaceModel",val)) 221 | } 222 | 223 | else if (typeof(i)=="integer") { 224 | return(x[i,]) 225 | } else { 226 | stop("VectorSpaceModel objects are accessed by vectors of numbers or words") 227 | } 228 | }) 229 | 230 | setMethod("show","VectorSpaceModel",function(object) { 231 | dims = dim(object) 232 | cat("A VectorSpaceModel object of ",dims[1]," words and ", dims[2], " vectors\n") 233 | methods::show(unclass(object[1:min(nrow(object),10),1:min(ncol(object),6),drop=F])) 234 | }) 235 | 236 | #' Plot a Vector Space Model. 237 | #' 238 | #' Visualizing a model as a whole is sort of undefined. I think the 239 | #' sanest thing to do is reduce the full model down to two dimensions 240 | #' using T-SNE, which preserves some of the local clusters. 241 | #' 242 | #' For individual subsections, it can make sense to do a principal components 243 | #' plot of the space of just those letters. This is what happens if method 244 | #' is pca. On the full vocab, it's kind of a mess. 245 | #' 246 | #' This plots only the first 300 words in the model. 247 | #' 248 | #' @param x The model to plot 249 | #' @param method The method to use for plotting. "pca" is principal components, "tsne" is t-sne 250 | #' @param ... Further arguments passed to the plotting method. 251 | #' 252 | #' @return The TSNE model (silently.) 253 | #' @export 254 | setMethod("plot","VectorSpaceModel",function(x,method="tsne",...) { 255 | if (method=="tsne") { 256 | message("Attempting to use T-SNE to plot the vector representation") 257 | message("Cancel if this is taking too long") 258 | message("Or run 'install.packages' tsne if you don't have it.") 259 | x = as.matrix(x) 260 | short = x[1:min(300,nrow(x)),] 261 | m = tsne::tsne(short,...) 262 | graphics::plot(m,type='n',main="A two dimensional reduction of the vector space model using t-SNE") 263 | graphics::text(m,rownames(short),cex = ((400:1)/200)^(1/3)) 264 | rownames(m)=rownames(short) 265 | silent = m 266 | } else if (method=="pca") { 267 | vectors = stats::predict(stats::prcomp(x))[,1:2] 268 | graphics::plot(vectors,type='n') 269 | graphics::text(vectors,labels=rownames(vectors)) 270 | } 271 | }) 272 | 273 | #' Convert to a Vector Space Model 274 | #' 275 | #' @param matrix A matrix to coerce. 276 | #' 277 | #' @return An object of class "VectorSpaceModel" 278 | #' @export as.VectorSpaceModel 279 | as.VectorSpaceModel = function(matrix) { 280 | return(methods::new("VectorSpaceModel",matrix)) 281 | } 282 | 283 | #' Read VectorSpaceModel 284 | #' 285 | #' Read a VectorSpaceModel from a file exported from word2vec or a similar output format. 286 | #' 287 | #' @param filename The file to read in. 288 | #' @param vectors The number of dimensions word2vec calculated. Imputed automatically if not specified. 289 | #' @param binary Read in the binary word2vec form. (Wraps `read.binary.vectors`) By default, function 290 | #' guesses based on file suffix. 291 | #' @param ... Further arguments passed to read.table or read.binary.vectors. 292 | #' Note that both accept 'nrows' as an argument. Word2vec produces 293 | #' by default frequency sorted output. Therefore 'read.vectors("file.bin", nrows=500)', for example, 294 | #' will return the vectors for the top 500 words. This can be useful on machines with limited 295 | #' memory. 296 | #' @export 297 | #' @return An matrixlike object of class `VectorSpaceModel` 298 | #' 299 | read.vectors <- function(filename,vectors=guess_n_cols(),binary=NULL,...) { 300 | if(rev(strsplit(filename,"\\.")[[1]])[1] =="bin" && is.null(binary)) { 301 | message("Filename ends with .bin, so reading in binary format") 302 | binary=TRUE 303 | } 304 | 305 | if(binary) { 306 | return(read.binary.vectors(filename,...)) 307 | } 308 | 309 | # Figure out how many dimensions. 310 | guess_n_cols = function() { 311 | # if cols is not defined 312 | test = utils::read.table(filename,header=F,skip=1, 313 | nrows=1,quote="",comment.char="") 314 | return(ncol(test)-1) 315 | } 316 | vectors_matrix = utils::read.table(filename,header=F,skip=1, 317 | colClasses = c("character",rep("numeric",vectors)), 318 | quote="",comment.char="",...) 319 | names(vectors_matrix)[1] = "word" 320 | vectors_matrix$word[is.na(vectors_matrix$word)] = "NA" 321 | matrix = as.matrix(vectors_matrix[,colnames(vectors_matrix)!="word"]) 322 | rownames(matrix) = vectors_matrix$word 323 | colnames(matrix) = paste0("V",1:vectors) 324 | return(methods::new("VectorSpaceModel",matrix)) 325 | } 326 | 327 | #' Read binary word2vec format files 328 | #' 329 | #' @param filename A file in the binary word2vec format to import. 330 | #' @param nrows Optionally, a number of rows to stop reading after. 331 | #' Word2vec sorts by frequency, so limiting to the first 1000 rows will 332 | #' give the thousand most-common words; it can be useful not to load 333 | #' the whole matrix into memory. This limit is applied BEFORE `name_list` and 334 | #' `name_regexp`. 335 | #' @param cols The column numbers to read. Default is "All"; 336 | #' if you are in a memory-limited environment, 337 | #' you can limit the number of columns you read in by giving a vector of column integers 338 | #' @param rowname_list A whitelist of words. If you wish to read in only a few dozen words, 339 | #' all other rows will be skipped and only these read in. 340 | #' @param rowname_regexp A regular expression specifying a pattern for rows to read in. Row 341 | #' names matching that pattern will be included in the read; all others will be skipped. 342 | #' @return A VectorSpaceModel object 343 | #' @export 344 | 345 | read.binary.vectors = function(filename,nrows=Inf,cols="All", rowname_list = NULL, rowname_regexp = NULL) { 346 | if (!is.null(rowname_list) && !is.null(rowname_regexp)) {stop("Specify a whitelist of names or a regular expression to be applied to all input, not both.")} 347 | a = file(filename,'rb') 348 | rows = "" 349 | mostRecent="" 350 | while(mostRecent!=" ") { 351 | mostRecent = readChar(a,1) 352 | rows = paste0(rows,mostRecent) 353 | } 354 | rows = as.integer(rows) 355 | 356 | col_number = "" 357 | while(mostRecent!="\n") { 358 | mostRecent = readChar(a,1) 359 | col_number = paste0(col_number,mostRecent) 360 | } 361 | col_number = as.integer(col_number) 362 | 363 | if(nrows% closest_to("good") 778 | #' 779 | nearest_to = function(...) { 780 | vals = closest_to(...,fancy_names = F) 781 | returnable = 1 - vals$similarity 782 | names(returnable) = vals$word 783 | returnable 784 | } 785 | -------------------------------------------------------------------------------- /R/utils.R: -------------------------------------------------------------------------------- 1 | #' @importFrom magrittr %>% 2 | #' @export 3 | magrittr::`%>%` 4 | -------------------------------------------------------------------------------- /R/word2vec.R: -------------------------------------------------------------------------------- 1 | ##' Train a model by word2vec. 2 | ##' 3 | ##' The word2vec tool takes a text corpus as input and produces the 4 | ##' word vectors as output. It first constructs a vocabulary from the 5 | ##' training text data and then learns vector representation of words. 6 | ##' The resulting word vector file can be used as features in many 7 | ##' natural language processing and machine learning applications. 8 | ##' 9 | ##' 10 | ##' 11 | ##' @title Train a model by word2vec. 12 | ##' @param train_file Path of a single .txt file for training. Tokens are split on spaces. 13 | ##' @param output_file Path of the output file. 14 | ##' @param vectors The number of vectors to output. Defaults to 100. 15 | ##' More vectors usually means more precision, but also more random error, higher memory usage, and slower operations. 16 | ##' Sensible choices are probably in the range 100-500. 17 | ##' @param threads Number of threads to run training process on. 18 | ##' Defaults to 1; up to the number of (virtual) cores on your machine may speed things up. 19 | ##' @param window The size of the window (in words) to use in training. 20 | ##' @param classes Number of classes for k-means clustering. Not documented/tested. 21 | ##' @param cbow If 1, use a continuous-bag-of-words model instead of skip-grams. 22 | ##' Defaults to false (recommended for newcomers). 23 | ##' @param min_count Minimum times a word must appear to be included in the samples. 24 | ##' High values help reduce model size. 25 | ##' @param iter Number of passes to make over the corpus in training. 26 | ##' @param force Whether to overwrite existing model files. 27 | ##' @param negative_samples Number of negative samples to take in skip-gram training. 0 means full sampling, while lower numbers 28 | ##' give faster training. For large corpora 2-5 may work; for smaller corpora, 5-15 is reasonable. 29 | ##' @return A VectorSpaceModel object. 30 | ##' @author Jian Li <\email{rweibo@@sina.com}>, Ben Schmidt <\email{bmchmidt@@gmail.com}> 31 | ##' @references \url{https://code.google.com/p/word2vec/} 32 | ##' @export 33 | ##' 34 | ##' @useDynLib wordVectors 35 | ##' 36 | ##' @examples \dontrun{ 37 | ##' model = train_word2vec(system.file("examples", "rfaq.txt", package = "wordVectors")) 38 | ##' } 39 | train_word2vec <- function(train_file, output_file = "vectors.bin",vectors=100,threads=1,window=12, 40 | classes=0,cbow=0,min_count=5,iter=5,force=F, negative_samples=5) 41 | { 42 | if (!file.exists(train_file)) stop("Can't find the training file!") 43 | if (file.exists(output_file) && !force) stop("The output file '", 44 | output_file , 45 | "' already exists: give a new destination or run with 'force=TRUE'.") 46 | 47 | train_dir <- dirname(train_file) 48 | 49 | # cat HDA15/data/Dickens/* | perl -pe 'print "1\t"' | egrep "[a-z]" | bookworm tokenize token_stream > ~/test.txt 50 | 51 | if(missing(output_file)) { 52 | output_file <- gsub(gsub("^.*\\.", "", basename(train_file)), "bin", basename(train_file)) 53 | output_file <- file.path(train_dir, output_file) 54 | } 55 | 56 | outfile_dir <- dirname(output_file) 57 | if (!file.exists(outfile_dir)) dir.create(outfile_dir, recursive = TRUE) 58 | 59 | train_file <- normalizePath(train_file, winslash = "/", mustWork = FALSE) 60 | output_file <- normalizePath(output_file, winslash = "/", mustWork = FALSE) 61 | # Whether to output binary, default is 1 means binary. 62 | binary = 1 63 | 64 | OUT <- .C("CWrapper_word2vec", 65 | train_file = as.character(train_file), 66 | output_file = as.character(output_file), 67 | binary = as.character(binary), 68 | dims=as.character(vectors), 69 | threads=as.character(threads), 70 | window=as.character(window), 71 | classes=as.character(classes), 72 | cbow=as.character(cbow), 73 | min_count=as.character(min_count), 74 | iter=as.character(iter), 75 | neg_samples=as.character(negative_samples) 76 | ) 77 | 78 | read.vectors(output_file) 79 | } 80 | 81 | #' Prepare documents for word2Vec 82 | #' 83 | #' @description This function exports a directory or document to a single file 84 | #' suitable to Word2Vec run on. That means a single, seekable txt file 85 | #' with tokens separated by spaces. (For example, punctuation is removed 86 | #' rather than attached to the end of words.) 87 | #' This function is extraordinarily inefficient: in most real-world cases, you'll be 88 | #' much better off preparing the documents using python, perl, awk, or any other 89 | #' scripting language that can reasonable read things in line-by-line. 90 | #' 91 | #' @param origin A text file or a directory of text files 92 | #' to be used in training the model 93 | #' @param destination The location for output text. 94 | #' @param lowercase Logical. Should uppercase characters be converted to lower? 95 | #' @param bundle_ngrams Integer. Statistically significant phrases of up to this many words 96 | #' will be joined with underscores: e.g., "United States" will usually be changed to "United_States" 97 | #' if it appears frequently in the corpus. This calls word2phrase once if bundle_ngrams is 2, 98 | #' twice if bundle_ngrams is 3, and so forth; see that function for more details. 99 | #' @param ... Further arguments passed to word2phrase when bundle_ngrams is 100 | #' greater than 1. 101 | #' 102 | #' @export 103 | #' 104 | #' @return The file name (silently). 105 | prep_word2vec <- function(origin,destination,lowercase=F, 106 | bundle_ngrams=1, ...) 107 | { 108 | # strsplit chokes on large lines. I would not have gone down this path if I knew this 109 | # to begin with. 110 | 111 | 112 | 113 | message("Beginning tokenization to text file at ", destination) 114 | if (!exists("dir.exists")) { 115 | # Use the version from devtools if in R < 3.2.0 116 | dir.exists <- function (x) 117 | { 118 | res <- file.exists(x) & file.info(x)$isdir 119 | stats::setNames(res, x) 120 | } 121 | } 122 | 123 | if (dir.exists(origin)) { 124 | origin = list.files(origin,recursive=T,full.names = T) 125 | } 126 | 127 | if (file.exists(destination)) file.remove(destination) 128 | 129 | tokenize_words = function (x, lowercase = TRUE) { 130 | # This is an abbreviated version of the "tokenizers" package version to remove the dependency. 131 | # Sorry, Lincoln, it was failing some tests. 132 | if (lowercase) x <- stringi::stri_trans_tolower(x) 133 | out <- stringi::stri_split_boundaries(x, type = "word", skip_word_none = TRUE) 134 | unlist(out) 135 | } 136 | 137 | prep_single_file <- function(file_in, file_out, lowercase) { 138 | message("Prepping ", file_in) 139 | 140 | text <- file_in %>% 141 | readr::read_file() %>% 142 | tokenize_words(lowercase) %>% 143 | stringr::str_c(collapse = " ") 144 | 145 | stopifnot(length(text) == 1) 146 | readr::write_lines(text, file_out, append = TRUE) 147 | return(TRUE) 148 | } 149 | 150 | 151 | Map(prep_single_file, origin, lowercase=lowercase, file_out=destination) 152 | 153 | # Save the ultimate output 154 | real_destination_name = destination 155 | 156 | # Repeatedly build bigrams, trigrams, etc. 157 | if (bundle_ngrams > 1) { 158 | while(bundle_ngrams > 1) { 159 | old_destination = destination 160 | destination = paste0(destination,"_") 161 | word2phrase(old_destination,destination,...) 162 | file.remove(old_destination) 163 | bundle_ngrams = bundle_ngrams - 1 164 | } 165 | file.rename(destination,real_destination_name) 166 | } 167 | 168 | silent = real_destination_name 169 | } 170 | 171 | 172 | #' Convert words to phrases in a text file. 173 | #' 174 | #' This function attempts to learn phrases given a text document. 175 | #' It does so by progressively joining adjacent pairs of words with an '_' character. 176 | #' You can then run the code multiple times to create multiword phrases. 177 | #' Wrapper around code from the Mikolov's original word2vec release. 178 | #' 179 | #' @title Convert words to phrases 180 | #' @author Tomas Mikolov 181 | #' @param train_file Path of a single .txt file for training. 182 | #' Tokens are split on spaces. 183 | #' @param output_file Path of output file 184 | #' @param debug_mode debug mode. Must be 0, 1 or 2. 0 is silent; 1 print summary statistics; 185 | #' prints progress regularly. 186 | #' @param min_count Minimum times a word must appear to be included in the samples. 187 | #' High values help reduce model size. 188 | #' @param threshold Threshold value for determining if pairs of words are phrases. 189 | #' @param force Whether to overwrite existing files at the output location. Default FALSE 190 | #' 191 | #' @return The name of output_file, the trained file where common phrases are now joined. 192 | #' 193 | #' @export 194 | #' @examples 195 | #' \dontrun{ 196 | #' model=word2phrase("text8","vec.txt") 197 | #' } 198 | 199 | word2phrase=function(train_file,output_file,debug_mode=0,min_count=5,threshold=100,force=FALSE) 200 | { 201 | if (!file.exists(train_file)) stop("Can't find the training file!") 202 | if (file.exists(output_file) && !force) stop("The output file '", 203 | output_file , 204 | "' already exists: give a new destination or run with 'force=TRUE'.") 205 | OUT=.C("word2phrase",rtrain_file=as.character(train_file), 206 | rdebug_mode=as.integer(debug_mode), 207 | routput_file=as.character(output_file), 208 | rmin_count=as.integer(min_count), 209 | rthreshold=as.double(threshold)) 210 | return(output_file) 211 | } 212 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Word Vectors 2 | 3 | [![Build Status](https://travis-ci.org/bmschmidt/wordVectors.svg?branch=master)](https://travis-ci.org/bmschmidt/wordVectors) 4 | 5 | An R package for building and exploring word embedding models. 6 | 7 | # Description 8 | 9 | This package does three major things to make it easier to work with word2vec and other vectorspace models of language. 10 | 11 | 1. [Trains word2vec models](#creating-text-vectors) using an extended Jian Li's word2vec code; reads and writes the binary word2vec format so that you can import pre-trained models such as Google's; and provides tools for reading only *part* of a model (rows or columns) so you can explore a model in memory-limited situations. 12 | 2. [Creates a new `VectorSpaceModel` class in R that gives a better syntax for exploring a word2vec or GloVe model than native matrix methods.](#vectorspacemodel-object) For example, instead of writing 13 | 14 | > `model[rownames(model)=="king",]`, 15 | 16 | you can write 17 | 18 | > `model[["king"]]`, 19 | 20 | and instead of writing 21 | 22 | > `vectors %>% closest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",])` (whew!), 23 | 24 | you can write 25 | 26 | > `vectors %>% closest_to(~"king" - "man" + "woman")`. 27 | 28 | 3. [Implements several basic matrix operations that are useful in exploring word embedding models including cosine similarity, nearest neighbor, and vector projection](#useful-matrix-operations) with some caching that makes them much faster than the simplest implementations. 29 | 30 | ### Quick start 31 | 32 | For a step-by-step interactive demo that includes installation and training a model on 77 historical cookbooks from Michigan State University, [see the introductory vignette.](https://github.com/bmschmidt/wordVectors/blob/master/vignettes/introduction.Rmd). 33 | 34 | ### Credit 35 | 36 | This includes an altered version of Tomas Mikolov's original C code for word2vec; those wrappers were origally written by Jian Li, and I've only tweaked them a little. Several other users have improved that code since I posted it here. 37 | 38 | Right now, it [does not (I don't think) install under Windows 8](https://github.com/bmschmidt/wordVectors/issues/2). Help appreciated on that thread. OS X, Windows 7, Windows 10, and Linux install perfectly well, with one or two exceptions. 39 | 40 | It's not extremely fast, but once the data is loaded in most operations happen in suitable time for exploratory data analysis (under a second on my laptop.) 41 | 42 | For high-performance analysis of models, C or python's numpy/gensim will likely be better than this package, in part because R doesn't have support for single-precision floats. The goal of this package is to facilitate clear code and exploratory data analysis of models. 43 | 44 | Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms. 45 | 46 | ## Creating text vectors. 47 | 48 | One portion of this is an expanded version of the code from Jian Li's `word2vec` package with a few additional parameters enabled as the function `train_word2vec`. 49 | 50 | The input must still be in a single file and pre-tokenized, but it uses the existing word2vec C code. For online data processing, I like the gensim python implementation, but I don't plan to link that to R. 51 | 52 | In RStudio I've noticed that this appears to hang, but if you check processors it actually still runs. Try it on smaller portions first, and then let it take time: the training function can take hours for tens of thousands of books. 53 | 54 | ## VectorSpaceModel object 55 | 56 | The package loads in the word2vec binary format with the format `read.vectors` into a new object called a "VectorSpaceModel" object. It's a light superclass of the standard R matrix object. Anything you can do with matrices, you can do with VectorSpaceModel objects. 57 | 58 | It has a few convenience functions as well. 59 | 60 | ### Faster Access to text vectors 61 | 62 | The rownames of a VectorSpaceModel object are presumed to be tokens in a vector space model and therefore semantically useful. The classic word2vec demonstration is that vector('king') - vector('man') + vector('woman') =~ vector('queen'). With a standard matrix, the vector on the right-hand side of the equation would be described as 63 | 64 | ```{r, include=F,show=T} 65 | vector_set[rownames(vector_set)=="king",] - vector_set[rownames(vector_set)=="man",] + vector_set[rownames(vector_set)=="woman",] 66 | ``` 67 | 68 | In this package, you can simply access it by using the double brace operators: 69 | 70 | ```{r, include=F,show=T} 71 | vector_set[["king"]] - vector_set[["man"]] + vector_set[["woman"]] 72 | ``` 73 | 74 | (And in the context of the custom functions, as a formula like `~"king" - "man" + "woman"`: see below). 75 | 76 | Since frequently an average of two vectors provides a better indication, multiple words can be collapsed into a single vector by specifying multiple labels. For example, this may provide a slightly better gender vector: 77 | 78 | ```{r} 79 | vector_set[["king"]] - vector_set[[c("man","men")]] + vector_set[[c("woman","women")]] 80 | ``` 81 | 82 | Sometimes you want to subset *without* averaging. You can do this with the argument `average==FALSE` to the subset. This is particularly useful for comparing slices of the matrix to itself in similarity operations. 83 | 84 | ```{r} 85 | cosineSimilarity(vector_set[[c("man","men","king"),average=F]], vector_set[[c("woman","women","queen"),average=F]] 86 | ``` 87 | 88 | ## A few native functions defined on the VectorSpaceModel object. 89 | 90 | The native `show` method just prints the dimensions; the native `plot` method does some crazy reductions with the T-SNE package (installation required for functionality) because T-SNE is a nice way to reduce down the size of vectors, **or** lets you pass `method='pca'` to array a full set or subset by the first two principal components. 91 | 92 | 93 | ## Useful matrix operations 94 | 95 | One challenge of vector-space models of texts is that it takes some basic matrix multiplication functions to make them dance around in an entertaining way. 96 | 97 | This package bundles the ones I think are the most useful. 98 | Each takes a `VectorSpaceModel` as its first argument. Sometimes, it's appropriate for the VSM to be your entire data set; other times, it's sensible to limit it to just one or a few vectors. Where appropriate, the functions can also take vectors or matrices as inputs. 99 | 100 | * `cosineSimilarity(VSM_1,VSM_2)` calculates the cosine similarity of every vector in on vector space model to every vector in another. This is `n^2` complexity. With a vocabulary size of 20,000 or so, it can be reasonable to compare an entire set to itself; or you can compare a larger set to a smaller one to search for particular terms of interest. 101 | * `cosineDistance(VSM_1,VSM_2)` is the inverse of cosineSimilarity. It's not really a distance metric, but can be used as one for clustering and the like. 102 | * `closest_to(VSM,vector,n)` wraps a particularly common use case for `cosineSimilarity`, of finding the top `n` terms in a `VectorSpaceModel` closest to term m 103 | * `project(VSM,vector)` takes a `VectorSpaceModel` and returns the portion parallel to the vector `vector`. 104 | * `reject(VSM,vector)` is the inverse of `project`; it takes a `VectorSpaceModel` and returns the portion orthogonal to the vector `vector`. This makes it possible, for example, to collapse a vector space by removing certain distinctions of meaning. 105 | * `magnitudes` calculated the magnitude of each element in a VSM. This is useful in many operations. 106 | 107 | All of these functions place the VSM object as the first argument. This makes it easy to chain together operations using the `magrittr` package. For example, beginning with a single vector set one could find the nearest words in a set to a version of the vector for "bank" that has been decomposed to remove any semantic similarity to the banking sector. 108 | 109 | ``` {r} 110 | library(magrittr) 111 | not_that_kind_of_bank = chronam_vectors[["bank"]] %>% 112 | reject(chronam_vectors[["cashier"]]) %>% 113 | reject(chronam_vectors[["depositors"]]) %>% 114 | reject(chronam_vectors[["check"]]) 115 | chronam_vectors %>% closest_to(not_that_kind_of_bank) 116 | ``` 117 | 118 | These functions also allow an additional layer of syntactic sugar when working with word vectors. 119 | 120 | Or even just as a formula, if you're working entirely with a single model, so you don't have to keep referring to words; instead, you can use a formula interface to reduce typing and increase clarity. 121 | 122 | ```{r} 123 | vectors %>% closest_to(~ "king" - "man" + "woman") 124 | ``` 125 | 126 | 127 | # Quick start 128 | 129 | ## Install the wordVectors package. 130 | 131 | One of the major hurdles to running word2vec for ordinary people is that it requires compiling a C program. For many people, it may be easier to install it in R. 132 | 133 | 1. If you haven't already, [install R](https://cran.rstudio.com/) and then [install RStudio](https://www.rstudio.com/products/rstudio/download/). 134 | 2. Open R, and get a command-line prompt (the thing with a `>` on the left hand side.) This is where you'll be copy-pasting commands. 135 | 3. Install (if you don't already have it) the package `devtools` by pasting the following 136 | ```R 137 | install.packages("devtools") 138 | ``` 139 | 140 | 4. Install the latest version of this package from Github by pasting in the following. 141 | ```R 142 | devtools::install_github("bmschmidt/wordVectors") 143 | ``` 144 | Windows users may need to install "Rtools" as well: if so, a message to this effect should appear in red on the screen. This may cycle through a very large number of warnings: so long as it says "warning" and not "error", you're probably OK. 145 | 146 | ## Train a model. 147 | 148 | For instructions on training, see the [introductory vignette](https://github.com/bmschmidt/wordVectors/blob/master/vignettes/introduction.Rmd) 149 | 150 | ## Explore an existing model. 151 | 152 | For instructions on exploration, see the end of the [introductory vignette](https://github.com/bmschmidt/wordVectors/blob/master/vignettes/introduction.Rmd), or the slower-paced [vignette on exploration](https://github.com/bmschmidt/wordVectors/blob/master/vignettes/exploration.Rmd) 153 | -------------------------------------------------------------------------------- /data/demo_vectors.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bmschmidt/wordVectors/ad127c1badc1f9303a2b0d34e65acfc8759d2e3f/data/demo_vectors.rda -------------------------------------------------------------------------------- /inst/doc/exploration.R: -------------------------------------------------------------------------------- 1 | ## ------------------------------------------------------------------------ 2 | library(wordVectors) 3 | library(magrittr) 4 | 5 | ## ------------------------------------------------------------------------ 6 | demo_vectors[["good"]] 7 | 8 | ## ------------------------------------------------------------------------ 9 | demo_vectors %>% closest_to(demo_vectors[["good"]]) 10 | 11 | ## ------------------------------------------------------------------------ 12 | demo_vectors %>% closest_to("bad") 13 | 14 | ## ------------------------------------------------------------------------ 15 | 16 | demo_vectors %>% closest_to(~"good"+"bad") 17 | 18 | # The same thing could be written as: 19 | # demo_vectors %>% closest_to(demo_vectors[["good"]]+demo_vectors[["bad"]]) 20 | 21 | ## ------------------------------------------------------------------------ 22 | demo_vectors %>% closest_to(~"good" - "bad") 23 | 24 | ## ------------------------------------------------------------------------ 25 | demo_vectors %>% closest_to(~ "bad" - "good") 26 | 27 | ## ------------------------------------------------------------------------ 28 | demo_vectors %>% closest_to(~ "he" - "she") 29 | demo_vectors %>% closest_to(~ "she" - "he") 30 | 31 | ## ------------------------------------------------------------------------ 32 | demo_vectors %>% closest_to(~ "guy" - "he" + "she") 33 | 34 | ## ------------------------------------------------------------------------ 35 | demo_vectors %>% closest_to(~ "guy" + ("she" - "he")) 36 | 37 | ## ------------------------------------------------------------------------ 38 | 39 | demo_vectors[[c("lady","woman","man","he","she","guy","man"), average=F]] %>% 40 | plot(method="pca") 41 | 42 | 43 | ## ------------------------------------------------------------------------ 44 | top_evaluative_words = demo_vectors %>% 45 | closest_to(~ "good"+"bad",n=75) 46 | 47 | goodness = demo_vectors %>% 48 | closest_to(~ "good"-"bad",n=Inf) 49 | 50 | femininity = demo_vectors %>% 51 | closest_to(~ "she" - "he", n=Inf) 52 | 53 | ## ------------------------------------------------------------------------ 54 | library(ggplot2) 55 | library(dplyr) 56 | 57 | top_evaluative_words %>% 58 | inner_join(goodness) %>% 59 | inner_join(femininity) %>% 60 | ggplot() + 61 | geom_text(aes(x=`similarity to "she" - "he"`, 62 | y=`similarity to "good" - "bad"`, 63 | label=word)) 64 | 65 | -------------------------------------------------------------------------------- /inst/doc/exploration.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Word2Vec Workshop" 3 | author: "Ben Schmidt" 4 | date: "`r Sys.Date()`" 5 | output: rmarkdown::html_vignette 6 | vignette: > 7 | %\VignetteIndexEntry{Vignette Title} 8 | %\VignetteEngine{knitr::rmarkdown} 9 | %\VignetteEncoding{UTF-8} 10 | --- 11 | 12 | # Exploring Word2Vec models 13 | 14 | R is a great language for *exploratory data analysis* in particular. If you're going to use a word2vec model in a larger pipeline, it may be important (intellectually or ethically) to spend a little while understanding what kind of model of language you've learned. 15 | 16 | This package makes it easy to do so, both by allowing you to read word2vec models to and from R, and by giving some syntactic sugar that lets you describe vector-space models concisely and clearly. 17 | 18 | Note that these functions may still be useful if you're a data analyst training word2vec models elsewhere (say, in gensim.) I'm also hopeful this can be a good way of interacting with varied vector models in a workshop session. 19 | 20 | If you want to train your own model or need help setting up the package, read the introductory vignette. Aside from the installation, it assumes more knowledge of R than this walkthrough. 21 | 22 | ## Why explore? 23 | 24 | In this vignette we're going to look at (a small portion of) a model trained on teaching evaluations. It's an interesting set, but it's also one that shows the importance of exploring vector space models before you use them. Exploration is important because: 25 | 26 | 1. If you're a humanist or social scientist, it can tell you something about the *sources* by letting you see how they use language. These co-occurrence patterns can then be better investigated through close reading or more traditional collocation scores, which potentially more reliable but also much slower and less flexible. 27 | 28 | 2. If you're an engineer, it can help you understand some of biases built into a model that you're using in a larger pipeline. This can be both technically and ethically important: you don't want, for instance, to build a job-recommendation system which is disinclined to offer programming jobs to women because it has learned that women are unrepresented in CS jobs already. 29 | (On this point in word2vec in particular, see [here](https://freedom-to-tinker.com/blog/randomwalker/language-necessarily-contains-human-biases-and-so-will-machines-trained-on-language-corpora/) and [here](https://arxiv.org/abs/1607.06520).) 30 | 31 | ## Getting started. 32 | 33 | First we'll load this package, and the recommended package `magrittr`, which lets us pass these arguments around. 34 | 35 | ```{r} 36 | library(wordVectors) 37 | library(magrittr) 38 | ``` 39 | 40 | The basic element of any vector space model is a *vectors.* for each word. In the demo data included with this package, an object called 'demo_vectors,' there are 500 numbers: you can start to examine them, if you with, by hand. So let's consider just one of these--the vector for 'good'. 41 | 42 | In R's ordinary matrix syntax, you could write that out laboriously as `demo_vectors[rownames(demo_vectors)=="good",]`. `WordVectors` provides a shorthand using double braces: 43 | 44 | ```{r} 45 | demo_vectors[["good"]] 46 | ``` 47 | 48 | These numbers are meaningless on their own. But in the vector space, we can find similar words. 49 | 50 | ```{r} 51 | demo_vectors %>% closest_to(demo_vectors[["good"]]) 52 | ``` 53 | 54 | The `%>%` is the pipe operator from magrittr; it helps to keep things organized, and is particularly useful with some of the things we'll see later. The 'similarity' scores here are cosine similarity in a vector space; 1.0 represents perfect similarity, 0 is no correlation, and -1.0 is complete opposition. In practice, vector "opposition" is different from the colloquial use of "opposite," and very rare. You'll only occasionally see vector scores below 0--as you can see above, "bad" is actually one of the most similar words to "good." 55 | 56 | When interactively exploring a single model (rather than comparing *two* models), it can be a pain to keep retyping words over and over. Rather than operate on the vectors, this package also lets you access the word directly by using R's formula notation: putting a tilde in front of it. For a single word, you can even access it directly, as so. 57 | 58 | ```{r} 59 | demo_vectors %>% closest_to("bad") 60 | ``` 61 | 62 | ## Vector math 63 | 64 | The tildes are necessary syntax where things get interesting--you can do **math** on these vectors. So if we want to find the words that are closest to the *combination* of "good" and "bad" (which is to say, words that get used in evaluation) we can write (see where the tilde is?): 65 | 66 | ```{r} 67 | 68 | demo_vectors %>% closest_to(~"good"+"bad") 69 | 70 | # The same thing could be written as: 71 | # demo_vectors %>% closest_to(demo_vectors[["good"]]+demo_vectors[["bad"]]) 72 | ``` 73 | 74 | Those are words that are common to both "good" and "bad". We could also find words that are shaded towards just good but *not* bad by using subtraction. 75 | 76 | ```{r} 77 | demo_vectors %>% closest_to(~"good" - "bad") 78 | ``` 79 | 80 | > What does this "subtraction" vector mean? 81 | > In practice, the easiest way to think of it is probably simply as 'similar to 82 | > good and dissimilar to 'bad'. Omer and Levy's papers suggest this interpretation. 83 | > But taking the vectors more seriously means you can think of it geometrically: "good"-"bad" is 84 | > a vector that describes the difference between positive and negative. 85 | > Similarity to this vector means, technically, the portion of a words vectors whose 86 | > whose multidimensional path lies largely along the direction between the two words. 87 | 88 | Again, you can easily switch the order to the opposite: here are a bunch of bad words: 89 | 90 | ```{r} 91 | demo_vectors %>% closest_to(~ "bad" - "good") 92 | ``` 93 | 94 | All sorts of binaries are captured in word2vec models. One of the most famous, since Mikolov's original word2vec paper, is *gender*. If you ask for similarity to "he"-"she", for example, you get words that appear mostly in a *male* context. Since these examples are from teaching evaluations, after just a few straightforwardly gendered words, we start to get things that only men are ("arrogant") or where there are very few women in the university ("physics") 95 | 96 | ```{r} 97 | demo_vectors %>% closest_to(~ "he" - "she") 98 | demo_vectors %>% closest_to(~ "she" - "he") 99 | ``` 100 | 101 | ## Analogies 102 | 103 | We can expand out the match to perform analogies. Men tend to be called 'guys'. 104 | What's the female equivalent? 105 | In an SAT-style analogy, you might write `he:guy::she:???`. 106 | In vector math, we think of this as moving between points. 107 | 108 | If you're using the mental framework of positive of 'similarity' and 109 | negative as 'dissimilarity,' you can think of this as starting at "guy", 110 | removing its similarity to "he", and additing a similarity to "she". 111 | 112 | This yields the answer: the most similar term to "guy" for a woman is "lady." 113 | 114 | ```{r} 115 | demo_vectors %>% closest_to(~ "guy" - "he" + "she") 116 | ``` 117 | 118 | If you're using the other mental framework, of thinking of these as real vectors, 119 | you might phrase this in a slightly different way. 120 | You have a gender vector `("female" - "male")` that represents the *direction* of masculinity 121 | to femininity. You can then add this vector to "guy", and that will take you to a new neighborhood. You might phrase that this way: note that the math is exactly equivalent, and 122 | only the grouping is different. 123 | 124 | ```{r} 125 | demo_vectors %>% closest_to(~ "guy" + ("she" - "he")) 126 | ``` 127 | 128 | Principal components can let you plot a subset of these vectors to see how they relate. You can imagine an arrow from "he" to "she", from "guy" to "lady", and from "man" to "woman"; all run in roughly the same direction. 129 | 130 | ```{r} 131 | 132 | demo_vectors[[c("lady","woman","man","he","she","guy","man"), average=F]] %>% 133 | plot(method="pca") 134 | 135 | ``` 136 | 137 | These lists of ten words at a time are useful for interactive exploration, but sometimes we might want to say 'n=Inf' to return the full list. For instance, we can combine these two methods to look at positive and negative words used to evaluate teachers. 138 | 139 | First we build up three data_frames: first, a list of the 50 top evaluative words, and then complete lists of similarity to `"good" -"bad"` and `"woman" - "man"`. 140 | 141 | ```{r} 142 | top_evaluative_words = demo_vectors %>% 143 | closest_to(~ "good"+"bad",n=75) 144 | 145 | goodness = demo_vectors %>% 146 | closest_to(~ "good"-"bad",n=Inf) 147 | 148 | femininity = demo_vectors %>% 149 | closest_to(~ "she" - "he", n=Inf) 150 | ``` 151 | 152 | Then we can use tidyverse packages to join and plot these. 153 | An `inner_join` restricts us down to just those top 50 words, and ggplot 154 | can array the words on axes. 155 | 156 | ```{r} 157 | library(ggplot2) 158 | library(dplyr) 159 | 160 | top_evaluative_words %>% 161 | inner_join(goodness) %>% 162 | inner_join(femininity) %>% 163 | ggplot() + 164 | geom_text(aes(x=`similarity to "she" - "he"`, 165 | y=`similarity to "good" - "bad"`, 166 | label=word)) 167 | ``` 168 | 169 | -------------------------------------------------------------------------------- /inst/doc/introduction.R: -------------------------------------------------------------------------------- 1 | ## ------------------------------------------------------------------------ 2 | if (!require(wordVectors)) { 3 | if (!(require(devtools))) { 4 | install.packages("devtools") 5 | } 6 | devtools::install_github("bmschmidt/wordVectors") 7 | } 8 | 9 | 10 | 11 | ## ------------------------------------------------------------------------ 12 | library(wordVectors) 13 | library(magrittr) 14 | 15 | ## ------------------------------------------------------------------------ 16 | if (!file.exists("cookbooks.zip")) { 17 | download.file("http://archive.lib.msu.edu/dinfo/feedingamerica/cookbook_text.zip","cookbooks.zip") 18 | } 19 | unzip("cookbooks.zip",exdir="cookbooks") 20 | 21 | ## ------------------------------------------------------------------------ 22 | if (!file.exists("cookbooks.txt")) prep_word2vec(origin="cookbooks",destination="cookbooks.txt",lowercase=T,bundle_ngrams=2) 23 | 24 | ## ------------------------------------------------------------------------ 25 | if (!file.exists("cookbook_vectors.bin")) {model = train_word2vec("cookbooks.txt","cookbook_vectors.bin",vectors=200,threads=4,window=12,iter=5,negative_samples=0)} else model = read.vectors("cookbook_vectors.bin") 26 | 27 | 28 | ## ------------------------------------------------------------------------ 29 | model %>% closest_to("fish") 30 | 31 | ## ------------------------------------------------------------------------ 32 | model %>% 33 | closest_to(model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],50) 34 | 35 | ## ------------------------------------------------------------------------ 36 | some_fish = closest_to(model,model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],150) 37 | fishy = model[[some_fish$word,average=F]] 38 | plot(fishy,method="pca") 39 | 40 | ## ------------------------------------------------------------------------ 41 | set.seed(10) 42 | centers = 150 43 | clustering = kmeans(model,centers=centers,iter.max = 40) 44 | 45 | ## ------------------------------------------------------------------------ 46 | sapply(sample(1:centers,10),function(n) { 47 | names(clustering$cluster[clustering$cluster==n][1:10]) 48 | }) 49 | 50 | ## ------------------------------------------------------------------------ 51 | ingredients = c("madeira","beef","saucepan","carrots") 52 | term_set = lapply(ingredients, 53 | function(ingredient) { 54 | nearest_words = model %>% closest_to(model[[ingredient]],20) 55 | nearest_words$word 56 | }) %>% unlist 57 | 58 | subset = model[[term_set,average=F]] 59 | 60 | subset %>% 61 | cosineDist(subset) %>% 62 | as.dist %>% 63 | hclust %>% 64 | plot 65 | 66 | 67 | ## ------------------------------------------------------------------------ 68 | tastes = model[[c("sweet","salty"),average=F]] 69 | 70 | # model[1:3000,] here restricts to the 3000 most common words in the set. 71 | sweet_and_saltiness = model[1:3000,] %>% cosineSimilarity(tastes) 72 | 73 | # Filter to the top 20 sweet or salty. 74 | sweet_and_saltiness = sweet_and_saltiness[ 75 | rank(-sweet_and_saltiness[,1])<20 | 76 | rank(-sweet_and_saltiness[,2])<20, 77 | ] 78 | 79 | plot(sweet_and_saltiness,type='n') 80 | text(sweet_and_saltiness,labels=rownames(sweet_and_saltiness)) 81 | 82 | 83 | ## ------------------------------------------------------------------------ 84 | 85 | tastes = model[[c("sweet","salty","savory","bitter","sour"),average=F]] 86 | 87 | # model[1:3000,] here restricts to the 3000 most common words in the set. 88 | common_similarities_tastes = model[1:3000,] %>% cosineSimilarity(tastes) 89 | 90 | common_similarities_tastes[20:30,] 91 | 92 | ## ------------------------------------------------------------------------ 93 | high_similarities_to_tastes = common_similarities_tastes[rank(-apply(common_similarities_tastes,1,max)) < 75,] 94 | 95 | high_similarities_to_tastes %>% 96 | prcomp %>% 97 | biplot(main="Fifty words in a\nprojection of flavor space") 98 | 99 | ## ------------------------------------------------------------------------ 100 | plot(model,perplexity=50) 101 | 102 | -------------------------------------------------------------------------------- /inst/doc/introduction.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Word2Vec introduction" 3 | author: "Ben Schmidt" 4 | date: "`r Sys.Date()`" 5 | output: rmarkdown::html_vignette 6 | vignette: > 7 | %\VignetteIndexEntry{Vignette Title} 8 | %\VignetteEngine{knitr::rmarkdown} 9 | %\VignetteEncoding{UTF-8} 10 | --- 11 | 12 | # Intro 13 | 14 | This vignette walks you through training a word2vec model, and using that model to search for similarities, to build clusters, and to visualize vocabulary relationships of that model in two dimensions. If you are working with pre-trained vectors, you might want to jump straight to the "exploration" vignette; it is a little slower-paced, but doesn't show off quite so many features of the package. 15 | 16 | # Package installation 17 | 18 | If you have not installed this package, paste the below. More detailed installation instructions are at the end of the [package README](https://github.com/bmschmidt/wordVectors). 19 | 20 | ```{r} 21 | if (!require(wordVectors)) { 22 | if (!(require(devtools))) { 23 | install.packages("devtools") 24 | } 25 | devtools::install_github("bmschmidt/wordVectors") 26 | } 27 | 28 | 29 | ``` 30 | 31 | # Building test data 32 | 33 | We begin by importing the `wordVectors` package and the `magrittr` package, because its pipe operator makes it easier to work with data. 34 | 35 | ```{r} 36 | library(wordVectors) 37 | library(magrittr) 38 | ``` 39 | 40 | First we build up a test file to train on. 41 | As an example, we'll use a collection of cookbooks from Michigan State University. 42 | This has to download from the Internet if it doesn't already exist. 43 | 44 | ```{r} 45 | if (!file.exists("cookbooks.zip")) { 46 | download.file("http://archive.lib.msu.edu/dinfo/feedingamerica/cookbook_text.zip","cookbooks.zip") 47 | } 48 | unzip("cookbooks.zip",exdir="cookbooks") 49 | ``` 50 | 51 | 52 | Then we *prepare* a single file for word2vec to read in. This does a couple things: 53 | 54 | 1. Creates a single text file with the contents of every file in the original document; 55 | 2. Uses the `tokenizers` package to clean and lowercase the original text, 56 | 3. If `bundle_ngrams` is greater than 1, joins together common bigrams into a single word. For example, "olive oil" may be joined together into "olive_oil" wherever it occurs. 57 | 58 | You can also do this in another language: particularly for large files, that will be **much** faster. (For reference: in a console, `perl -ne 's/[^A-Za-z_0-9 \n]/ /g; print lc $_;' cookbooks/*.txt > cookbooks.txt` will do much the same thing on ASCII text in a couple seconds.) If you do this and want to bundle ngrams, you'll then need to call `word2phrase("cookbooks.txt","cookbook_bigrams.txt",...)` to build up the bigrams; call it twice if you want 3-grams, and so forth. 59 | 60 | 61 | ```{r} 62 | if (!file.exists("cookbooks.txt")) prep_word2vec(origin="cookbooks",destination="cookbooks.txt",lowercase=T,bundle_ngrams=2) 63 | ``` 64 | 65 | To train a word2vec model, use the function `train_word2vec`. This actually builds up the model. It uses an on-disk file as an intermediary and then reads that file into memory. 66 | 67 | ```{r} 68 | if (!file.exists("cookbook_vectors.bin")) {model = train_word2vec("cookbooks.txt","cookbook_vectors.bin",vectors=200,threads=4,window=12,iter=5,negative_samples=0)} else model = read.vectors("cookbook_vectors.bin") 69 | 70 | ``` 71 | 72 | A few notes: 73 | 74 | 1. The `vectors` parameter is the dimensionality of the representation. More vectors usually means more precision, but also more random error and slower operations. Likely choices are probably in the range 100-500. 75 | 2. The `threads` parameter is the number of processors to use on your computer. On a modern laptop, the fastest results will probably be between 2 and 8 threads, depending on the number of cores. 76 | 3. `iter` is how many times to read through the corpus. With fewer than 100 books, it can greatly help to increase the number of passes; if you're working with billions of words, it probably matters less. One danger of too low a number of iterations is that words that aren't closely related will seem to be closer than they are. 77 | 4. Training can take a while. On my laptop, it takes a few minutes to train these cookbooks; larger models take proportionally more time. Because of the importance of more iterations to reducing noise, don't be afraid to set things up to require a lot of training time (as much as a day!) 78 | 5. One of the best things about the word2vec algorithm is that it *does* work on extremely large corpora in linear time. 79 | 6. In RStudio I've noticed that this sometimes appears to hang after a while; the percentage bar stops updating. If you check system activity it actually is still running, and will complete. 80 | 7. If at any point you want to *read in* a previously trained model, you can do so by typing `model = read.vectors("cookbook_vectors.bin")`. 81 | 82 | Now we have a model in memory, trained on about 10 million words from 77 cookbooks. What can it tell us about food? 83 | 84 | ## Similarity searches 85 | 86 | Well, you can run some basic operations to find the nearest elements: 87 | 88 | ```{r} 89 | model %>% closest_to("fish") 90 | ``` 91 | 92 | With that list, you can expand out further to search for multiple words: 93 | 94 | ```{r} 95 | model %>% 96 | closest_to(model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],50) 97 | ``` 98 | 99 | Now we have a pretty expansive list of potential fish-related words from old cookbooks. This can be useful for a few different things: 100 | 101 | 1. As a list of potential query terms for keyword search. 102 | 2. As a batch of words to use as seed to some other text mining operation; for example, you could pull all paragraphs surrounding these to find ways that fish are cooked. 103 | 3. As a source for visualization. 104 | 105 | Or we can just arrange them somehow. In this case, it doesn't look like much of anything. 106 | 107 | ```{r} 108 | some_fish = closest_to(model,model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],150) 109 | fishy = model[[some_fish$word,average=F]] 110 | plot(fishy,method="pca") 111 | ``` 112 | 113 | ## Clustering 114 | 115 | We can use standard clustering algorithms, like kmeans, to find groups of terms that fit together. You can think of this as a sort of topic model, although unlike more sophisticated topic modeling algorithms like Latent Direchlet Allocation, each word must be tied to single particular topic. 116 | 117 | ```{r} 118 | set.seed(10) 119 | centers = 150 120 | clustering = kmeans(model,centers=centers,iter.max = 40) 121 | ``` 122 | 123 | Here are a ten random "topics" produced through this method. Each of the columns are the ten most frequent words in one random cluster. 124 | 125 | ```{r} 126 | sapply(sample(1:centers,10),function(n) { 127 | names(clustering$cluster[clustering$cluster==n][1:10]) 128 | }) 129 | ``` 130 | 131 | These can be useful for figuring out, at a glance, what some of the overall common clusters in your corpus are. 132 | 133 | Clusters need not be derived at the level of the full model. We can take, for instance, 134 | the 20 words closest to each of four different kinds of words. 135 | 136 | ```{r} 137 | ingredients = c("madeira","beef","saucepan","carrots") 138 | term_set = lapply(ingredients, 139 | function(ingredient) { 140 | nearest_words = model %>% closest_to(model[[ingredient]],20) 141 | nearest_words$word 142 | }) %>% unlist 143 | 144 | subset = model[[term_set,average=F]] 145 | 146 | subset %>% 147 | cosineDist(subset) %>% 148 | as.dist %>% 149 | hclust %>% 150 | plot 151 | 152 | ``` 153 | 154 | 155 | # Visualization 156 | 157 | ## Relationship planes. 158 | 159 | One of the basic strategies you can take is to try to project the high-dimensional space here into a plane you can look at. 160 | 161 | For instance, we can take the words "sweet" and "sour," find the twenty words most similar to either of them, and plot those in a sweet-salty plane. 162 | 163 | ```{r} 164 | tastes = model[[c("sweet","salty"),average=F]] 165 | 166 | # model[1:3000,] here restricts to the 3000 most common words in the set. 167 | sweet_and_saltiness = model[1:3000,] %>% cosineSimilarity(tastes) 168 | 169 | # Filter to the top 20 sweet or salty. 170 | sweet_and_saltiness = sweet_and_saltiness[ 171 | rank(-sweet_and_saltiness[,1])<20 | 172 | rank(-sweet_and_saltiness[,2])<20, 173 | ] 174 | 175 | plot(sweet_and_saltiness,type='n') 176 | text(sweet_and_saltiness,labels=rownames(sweet_and_saltiness)) 177 | 178 | ``` 179 | 180 | 181 | There's no limit to how complicated this can get. For instance, there are really *five* tastes: sweet, salty, bitter, sour, and savory. (Savory is usually called 'umami' nowadays, but that word will not appear in historic cookbooks.) 182 | 183 | Rather than use a base matrix of the whole set, we can shrink down to just five dimensions: how similar every word in our set is to each of these five. (I'm using cosine similarity here, so the closer a number is to one, the more similar it is.) 184 | 185 | ```{r} 186 | 187 | tastes = model[[c("sweet","salty","savory","bitter","sour"),average=F]] 188 | 189 | # model[1:3000,] here restricts to the 3000 most common words in the set. 190 | common_similarities_tastes = model[1:3000,] %>% cosineSimilarity(tastes) 191 | 192 | common_similarities_tastes[20:30,] 193 | ``` 194 | 195 | Now we can filter down to the 50 words that are closest to *any* of these (that's what the apply-max function below does), and 196 | use a PCA biplot to look at just 50 words in a flavor plane. 197 | 198 | ```{r} 199 | high_similarities_to_tastes = common_similarities_tastes[rank(-apply(common_similarities_tastes,1,max)) < 75,] 200 | 201 | high_similarities_to_tastes %>% 202 | prcomp %>% 203 | biplot(main="Fifty words in a\nprojection of flavor space") 204 | ``` 205 | 206 | This tells us a few things. One is that (in some runnings of the model, at least--there is some random chance built in here.) "sweet" and "sour" are closely aligned. Is this a unique feature of American cooking? A relationship that changes over time? These would require more investigation. 207 | 208 | Second is that "savory" really is an acting category in these cookbooks, even without the precision of 'umami' as a word to express it. Anchovy, the flavor most closely associated with savoriness, shows up as fairly characteristic of the flavor, along with a variety of herbs. 209 | 210 | Finally, words characteristic of meals seem to show up in the upper realms of the file. 211 | 212 | # Catchall reduction: TSNE 213 | 214 | Last but not least, there is a catchall method built into the library 215 | to visualize a single overall decent plane for viewing the library; TSNE dimensionality reduction. 216 | 217 | Just calling "plot" will display the equivalent of a word cloud with individual tokens grouped relatively close to each other based on their proximity in the higher dimensional space. 218 | 219 | "Perplexity" is the optimal number of neighbors for each word. By default it's 50; smaller numbers may cause clusters to appear more dramatically at the cost of overall coherence. 220 | 221 | ```{r} 222 | plot(model,perplexity=50) 223 | ``` 224 | 225 | A few notes on this method: 226 | 227 | 1. If you don't get local clusters, it is not working. You might need to reduce the perplexity so that clusters are smaller; or you might not have good local similarities. 228 | 2. If you're plotting only a small set of words, you're better off trying to plot a `VectorSpaceModel` with `method="pca"`, which locates the points using principal components analysis. 229 | -------------------------------------------------------------------------------- /inst/paper.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'WordVectors: an R environment for training and exploring word2vec modes' 3 | tags: 4 | - Natural Language Processing 5 | - Vector Space Models 6 | - word2vec 7 | authors: 8 | - name: Benjamin M Schmidt 9 | orcid: 0000-0002-1142-5720 10 | affiliation: 1 11 | affiliations: 12 | - name: Northeastern University 13 | index: 1 14 | date: 24 January 2017 15 | bibliography: paper.bib 16 | --- 17 | 18 | # Summary 19 | 20 | This is an R package for training and exploring word2vec models. It provides wrappers for the reference word2vec implementation released by Google to enable training of vectors from R.[@mikolov_efficient_2013] It also provides a variety of functions enabling exploratory data analysis of word2vec models in an R environment, including 1) functions for reading and writing word2vec's binary form, 2) standard linear algebra functions not bundled in base R (such as cosine similarity) with speed optimizations, and 3) a streamlined syntax for performing vector arithmetic in a vocabulary space. 21 | 22 | # References 23 | 24 | -------------------------------------------------------------------------------- /man/VectorSpaceModel-VectorSpaceModel-method.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \docType{methods} 4 | \name{-,VectorSpaceModel,VectorSpaceModel-method} 5 | \alias{-,VectorSpaceModel,VectorSpaceModel-method} 6 | \title{VectorSpaceModel subtraction} 7 | \usage{ 8 | \S4method{-}{VectorSpaceModel,VectorSpaceModel}(e1, e2) 9 | } 10 | \arguments{ 11 | \item{e1}{A vector space model} 12 | 13 | \item{e2}{A vector space model of equal size OR a vector 14 | space model of a single row. If the latter (which is more likely) 15 | the specified row will be subtracted from each row.} 16 | } 17 | \value{ 18 | A VectorSpaceModel of the same dimensions and rownames 19 | as e1 20 | 21 | I believe this is necessary, but honestly am not sure. 22 | } 23 | \description{ 24 | Keep the VSM class when doing subtraction operations; 25 | make it possible to subtract a single row from an entire model. 26 | } 27 | -------------------------------------------------------------------------------- /man/VectorSpaceModel-class.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \docType{class} 4 | \name{VectorSpaceModel-class} 5 | \alias{VectorSpaceModel-class} 6 | \title{Vector Space Model class} 7 | \value{ 8 | An object of class "VectorSpaceModel" 9 | } 10 | \description{ 11 | A class for describing and accessing Vector Space Models like Word2Vec. 12 | The base object is simply a matrix with columns describing dimensions and unique rownames 13 | as the names of vectors. This package gives a number of convenience functions for printing 14 | and, most importantly, accessing these objects. 15 | } 16 | \section{Slots}{ 17 | 18 | \describe{ 19 | \item{\code{magnitudes}}{The cached sum-of-squares for each row in the matrix. Can be cached to 20 | speed up similarity calculations} 21 | }} 22 | 23 | -------------------------------------------------------------------------------- /man/as.VectorSpaceModel.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{as.VectorSpaceModel} 4 | \alias{as.VectorSpaceModel} 5 | \title{Convert to a Vector Space Model} 6 | \usage{ 7 | as.VectorSpaceModel(matrix) 8 | } 9 | \arguments{ 10 | \item{matrix}{A matrix to coerce.} 11 | } 12 | \value{ 13 | An object of class "VectorSpaceModel" 14 | } 15 | \description{ 16 | Convert to a Vector Space Model 17 | } 18 | -------------------------------------------------------------------------------- /man/closest_to.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{closest_to} 4 | \alias{closest_to} 5 | \title{Return the n closest words in a VectorSpaceModel to a given vector.} 6 | \usage{ 7 | closest_to(matrix, vector, n = 10, fancy_names = TRUE) 8 | } 9 | \arguments{ 10 | \item{matrix}{A matrix or VectorSpaceModel} 11 | 12 | \item{vector}{A vector (or a string or a formula coercable to a vector) 13 | of the same length as the VectorSpaceModel. See below.} 14 | 15 | \item{n}{The number of closest words to include.} 16 | 17 | \item{fancy_names}{If true (the default) the data frame will have descriptive names like 18 | 'similarity to "king+queen-man"'; otherwise, just 'similarity.' The default can speed up 19 | interactive exploration.} 20 | } 21 | \value{ 22 | A sorted data.frame with columns for the words and their similarity 23 | to the target vector. (Or, if as_df==FALSE, a named vector of similarities.) 24 | } 25 | \description{ 26 | This is a convenience wrapper around the most common use of 27 | 'cosineSimilarity'; the listing of several words similar to a given vector. 28 | Unlike cosineSimilarity, it returns a data.frame object instead of a matrix. 29 | cosineSimilarity is more powerful, because it can compare two matrices to 30 | each other; closest_to can only take a vector or vectorlike object as its second argument. 31 | But with (or without) the argument n=Inf, closest_to is often better for 32 | plugging directly into a plot. 33 | 34 | As with cosineSimilarity, the second argument can take several forms. If it's a vector or 35 | matrix slice, it will be taken literally. If it's a character string, it will 36 | be interpreted as a word and the associated vector from `matrix` will be used. If 37 | a formula, any strings in the formula will be converted to rows in the associated `matrix` 38 | before any math happens. 39 | } 40 | \examples{ 41 | 42 | # Synonyms and similar words 43 | closest_to(demo_vectors,demo_vectors[["good"]]) 44 | 45 | # If 'matrix' is a VectorSpaceModel object, 46 | # you can also just enter a string directly, and 47 | # it will be evaluated in the context of the passed matrix. 48 | 49 | closest_to(demo_vectors,"good") 50 | 51 | # You can also express more complicated formulas. 52 | 53 | closest_to(demo_vectors,"good") 54 | 55 | # Something close to the classic king:man::queen:woman; 56 | # What's the equivalent word for a female teacher that "guy" is for 57 | # a male one? 58 | 59 | closest_to(demo_vectors,~ "guy" - "man" + "woman") 60 | 61 | } 62 | -------------------------------------------------------------------------------- /man/cosineDist.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{cosineDist} 4 | \alias{cosineDist} 5 | \title{Cosine Distance} 6 | \usage{ 7 | cosineDist(x, y) 8 | } 9 | \arguments{ 10 | \item{x}{A matrix, VectorSpaceModel, or vector.} 11 | 12 | \item{y}{A matrix, VectorSpaceModel, or vector.} 13 | } 14 | \value{ 15 | A matrix whose dimnames are rownames(x), rownames(y) and whose entires are 16 | the associated distance. 17 | } 18 | \description{ 19 | Calculate the cosine distance between two vectors. 20 | 21 | Not an actual distance metric, but can be used in similar contexts. 22 | It is calculated as simply the inverse of cosine similarity, 23 | and falls in a fixed range of 0 (identical) to 2 (completely opposite in direction.) 24 | } 25 | -------------------------------------------------------------------------------- /man/cosineSimilarity.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{cosineSimilarity} 4 | \alias{cosineSimilarity} 5 | \title{Cosine Similarity} 6 | \usage{ 7 | cosineSimilarity(x, y) 8 | } 9 | \arguments{ 10 | \item{x}{A matrix or VectorSpaceModel object} 11 | 12 | \item{y}{A vector, matrix or VectorSpaceModel object. 13 | 14 | Vector inputs are coerced to single-row matrices; y must have the 15 | same number of dimensions as x.} 16 | } 17 | \value{ 18 | A matrix. Rows correspond to entries in x; columns to entries in y. 19 | } 20 | \description{ 21 | Calculate the cosine similarity of two matrices or a matrix and a vector. 22 | } 23 | \examples{ 24 | 25 | # Inspect the similarity of several academic disciplines by hand. 26 | subjects = demo_vectors[[c("history","literature","biology","math","stats"),average=FALSE]] 27 | similarities = cosineSimilarity(subjects,subjects) 28 | 29 | # Use 'closest_to' to build up a large list of similar words to a seed set. 30 | subjects = demo_vectors[[c("history","literature","biology","math","stats"),average=TRUE]] 31 | new_subject_list = closest_to(demo_vectors,subjects,20) 32 | new_subjects = demo_vectors[[new_subject_list$word,average=FALSE]] 33 | 34 | # Plot the cosineDistance of these as a dendrogram. 35 | plot(hclust(as.dist(cosineDist(new_subjects,new_subjects)))) 36 | 37 | } 38 | -------------------------------------------------------------------------------- /man/demo_vectors.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/data.R 3 | \docType{data} 4 | \name{demo_vectors} 5 | \alias{demo_vectors} 6 | \title{999 vectors trained on teaching evaluations} 7 | \format{A VectorSpaceModel object of 999 words and 500 vectors} 8 | \source{ 9 | Trained by package author. 10 | } 11 | \usage{ 12 | demo_vectors 13 | } 14 | \description{ 15 | A sample VectorSpaceModel object trained on about 15 million 16 | teaching evaluations, limited to the 999 most common words. 17 | Included for demonstration purposes only: there's only so much you can 18 | do with a 999 length vocabulary. 19 | } 20 | \details{ 21 | You're best off downloading a real model to work with, 22 | such as the precompiled vectors distributed by Google 23 | at https://code.google.com/archive/p/word2vec/ 24 | } 25 | \keyword{datasets} 26 | -------------------------------------------------------------------------------- /man/distend.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{distend} 4 | \alias{distend} 5 | \title{Compress or expand a vector space model along a vector.} 6 | \usage{ 7 | distend(matrix, vector, multiplier) 8 | } 9 | \arguments{ 10 | \item{matrix}{A matrix or VectorSpaceModel} 11 | 12 | \item{vector}{A vector (or an object coercable to a vector, see project) 13 | of the same length as the VectorSpaceModel.} 14 | 15 | \item{multiplier}{A scaling factor. See below.} 16 | } 17 | \value{ 18 | A new matrix or VectorSpaceModel of the same dimensions as `matrix`, 19 | distended along the vector 'vector' by factor 'multiplier'. 20 | 21 | See `project` for more details and usage. 22 | } 23 | \description{ 24 | This is an experimental function that might be useful sometimes. 25 | 'Reject' flatly eliminates a particular dimension from a vectorspace, essentially 26 | squashing out a single dimension; 'distend' gives finer grained control, making it 27 | possible to stretch out or compress in the same space. High values of 'multiplier' 28 | make a given vector more prominent; 1 keeps the original matrix untransformed; values 29 | less than one compress distances along the vector; and 0 is the same as "reject," 30 | eliminating a vector entirely. Values less than zero will do some type of mirror-image 31 | universe thing, but probably aren't useful? 32 | } 33 | \examples{ 34 | closest_to(demo_vectors,"sweet") 35 | 36 | # Stretch out the vectorspace 4x longer along the gender direction. 37 | more_sexist = distend(demo_vectors, ~ "man" + "he" - "she" -"woman", 4) 38 | 39 | closest_to(more_sexist,"sweet") 40 | 41 | } 42 | -------------------------------------------------------------------------------- /man/filter_to_rownames.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{filter_to_rownames} 4 | \alias{filter_to_rownames} 5 | \title{Reduce by rownames} 6 | \usage{ 7 | filter_to_rownames(matrix, words) 8 | } 9 | \arguments{ 10 | \item{matrix}{A matrix or VectorSpaceModel object} 11 | 12 | \item{words}{A list of rownames or VectorSpaceModel names} 13 | } 14 | \value{ 15 | An object of the same class as matrix, consisting 16 | of the rows that match its rownames. 17 | 18 | Deprecated: use instead VSM[[c("word1","word2",...),average=FALSE]] 19 | } 20 | \description{ 21 | Reduce by rownames 22 | } 23 | -------------------------------------------------------------------------------- /man/improve_vectorspace.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{improve_vectorspace} 4 | \alias{improve_vectorspace} 5 | \title{Improve a vectorspace by removing common elements.} 6 | \usage{ 7 | improve_vectorspace(vectorspace, D = round(ncol(vectorspace)/100)) 8 | } 9 | \arguments{ 10 | \item{vectorspace}{A VectorSpacemodel to be improved.} 11 | 12 | \item{D}{The number of principal components to eliminate.} 13 | } 14 | \value{ 15 | A VectorSpaceModel object, transformed from the original. 16 | } 17 | \description{ 18 | See reference for a full description. Supposedly, these operations will improve performance on analogy tasks. 19 | } 20 | \examples{ 21 | 22 | closest_to(demo_vectors,"great") 23 | # stopwords like "and" and "very" are no longer top ten. 24 | # I don't know if this is really better, though. 25 | 26 | closest_to(improve_vectorspace(demo_vectors),"great") 27 | 28 | } 29 | \references{ 30 | Jiaqi Mu, Suma Bhat, Pramod Viswanath. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. https://arxiv.org/abs/1702.01417. 31 | } 32 | -------------------------------------------------------------------------------- /man/magnitudes.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{magnitudes} 4 | \alias{magnitudes} 5 | \title{Vector Magnitudes} 6 | \usage{ 7 | magnitudes(matrix) 8 | } 9 | \arguments{ 10 | \item{matrix}{A matrix or VectorSpaceModel object.} 11 | } 12 | \value{ 13 | A vector consisting of the magnitudes of each row. 14 | 15 | This is an extraordinarily simple function. 16 | } 17 | \description{ 18 | Vector Magnitudes 19 | } 20 | -------------------------------------------------------------------------------- /man/nearest_to.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{nearest_to} 4 | \alias{nearest_to} 5 | \title{Nearest vectors to a word} 6 | \usage{ 7 | nearest_to(...) 8 | } 9 | \arguments{ 10 | \item{...}{See `closest_to`} 11 | } 12 | \value{ 13 | a names vector of cosine similarities. See 'nearest_to' for more details. 14 | } 15 | \description{ 16 | This a wrapper around closest_to, included for back-compatibility. Use 17 | closest_to for new applications. 18 | } 19 | \examples{ 20 | 21 | # Recommended usage in 1.0: 22 | nearest_to(demo_vectors, demo_vectors[["good"]]) 23 | 24 | # Recommended usage in 2.0: 25 | demo_vectors \%>\% closest_to("good") 26 | 27 | } 28 | -------------------------------------------------------------------------------- /man/normalize_lengths.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{normalize_lengths} 4 | \alias{normalize_lengths} 5 | \title{Matrix normalization.} 6 | \usage{ 7 | normalize_lengths(matrix) 8 | } 9 | \arguments{ 10 | \item{matrix}{A matrix or VectorSpaceModel object} 11 | } 12 | \value{ 13 | An object of the same class as matrix 14 | } 15 | \description{ 16 | Normalize a matrix so that all rows are of unit length. 17 | } 18 | -------------------------------------------------------------------------------- /man/plot-VectorSpaceModel-method.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \docType{methods} 4 | \name{plot,VectorSpaceModel-method} 5 | \alias{plot,VectorSpaceModel-method} 6 | \title{Plot a Vector Space Model.} 7 | \usage{ 8 | \S4method{plot}{VectorSpaceModel}(x, method = "tsne", ...) 9 | } 10 | \arguments{ 11 | \item{x}{The model to plot} 12 | 13 | \item{method}{The method to use for plotting. "pca" is principal components, "tsne" is t-sne} 14 | 15 | \item{...}{Further arguments passed to the plotting method.} 16 | } 17 | \value{ 18 | The TSNE model (silently.) 19 | } 20 | \description{ 21 | Visualizing a model as a whole is sort of undefined. I think the 22 | sanest thing to do is reduce the full model down to two dimensions 23 | using T-SNE, which preserves some of the local clusters. 24 | } 25 | \details{ 26 | For individual subsections, it can make sense to do a principal components 27 | plot of the space of just those letters. This is what happens if method 28 | is pca. On the full vocab, it's kind of a mess. 29 | 30 | This plots only the first 300 words in the model. 31 | } 32 | -------------------------------------------------------------------------------- /man/prep_word2vec.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/word2vec.R 3 | \name{prep_word2vec} 4 | \alias{prep_word2vec} 5 | \title{Prepare documents for word2Vec} 6 | \usage{ 7 | prep_word2vec(origin, destination, lowercase = F, bundle_ngrams = 1, ...) 8 | } 9 | \arguments{ 10 | \item{origin}{A text file or a directory of text files 11 | to be used in training the model} 12 | 13 | \item{destination}{The location for output text.} 14 | 15 | \item{lowercase}{Logical. Should uppercase characters be converted to lower?} 16 | 17 | \item{bundle_ngrams}{Integer. Statistically significant phrases of up to this many words 18 | will be joined with underscores: e.g., "United States" will usually be changed to "United_States" 19 | if it appears frequently in the corpus. This calls word2phrase once if bundle_ngrams is 2, 20 | twice if bundle_ngrams is 3, and so forth; see that function for more details.} 21 | 22 | \item{...}{Further arguments passed to word2phrase when bundle_ngrams is 23 | greater than 1.} 24 | } 25 | \value{ 26 | The file name (silently). 27 | } 28 | \description{ 29 | This function exports a directory or document to a single file 30 | suitable to Word2Vec run on. That means a single, seekable txt file 31 | with tokens separated by spaces. (For example, punctuation is removed 32 | rather than attached to the end of words.) 33 | This function is extraordinarily inefficient: in most real-world cases, you'll be 34 | much better off preparing the documents using python, perl, awk, or any other 35 | scripting language that can reasonable read things in line-by-line. 36 | } 37 | -------------------------------------------------------------------------------- /man/project.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{project} 4 | \alias{project} 5 | \title{Project each row of an input matrix along a vector.} 6 | \usage{ 7 | project(matrix, vector) 8 | } 9 | \arguments{ 10 | \item{matrix}{A matrix or VectorSpaceModel} 11 | 12 | \item{vector}{A vector (or object coercable to a vector) 13 | of the same length as the VectorSpaceModel.} 14 | } 15 | \value{ 16 | A new matrix or VectorSpaceModel of the same dimensions as `matrix`, 17 | each row of which is parallel to vector. 18 | 19 | If the input is a matrix, the output will be a matrix: if a VectorSpaceModel, 20 | it will be a VectorSpaceModel. 21 | } 22 | \description{ 23 | As with 'cosineSimilarity 24 | } 25 | -------------------------------------------------------------------------------- /man/read.binary.vectors.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{read.binary.vectors} 4 | \alias{read.binary.vectors} 5 | \title{Read binary word2vec format files} 6 | \usage{ 7 | read.binary.vectors(filename, nrows = Inf, cols = "All", 8 | rowname_list = NULL, rowname_regexp = NULL) 9 | } 10 | \arguments{ 11 | \item{filename}{A file in the binary word2vec format to import.} 12 | 13 | \item{nrows}{Optionally, a number of rows to stop reading after. 14 | Word2vec sorts by frequency, so limiting to the first 1000 rows will 15 | give the thousand most-common words; it can be useful not to load 16 | the whole matrix into memory. This limit is applied BEFORE `name_list` and 17 | `name_regexp`.} 18 | 19 | \item{cols}{The column numbers to read. Default is "All"; 20 | if you are in a memory-limited environment, 21 | you can limit the number of columns you read in by giving a vector of column integers} 22 | 23 | \item{rowname_list}{A whitelist of words. If you wish to read in only a few dozen words, 24 | all other rows will be skipped and only these read in.} 25 | 26 | \item{rowname_regexp}{A regular expression specifying a pattern for rows to read in. Row 27 | names matching that pattern will be included in the read; all others will be skipped.} 28 | } 29 | \value{ 30 | A VectorSpaceModel object 31 | } 32 | \description{ 33 | Read binary word2vec format files 34 | } 35 | -------------------------------------------------------------------------------- /man/read.vectors.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{read.vectors} 4 | \alias{read.vectors} 5 | \title{Read VectorSpaceModel} 6 | \usage{ 7 | read.vectors(filename, vectors = guess_n_cols(), binary = NULL, ...) 8 | } 9 | \arguments{ 10 | \item{filename}{The file to read in.} 11 | 12 | \item{vectors}{The number of dimensions word2vec calculated. Imputed automatically if not specified.} 13 | 14 | \item{binary}{Read in the binary word2vec form. (Wraps `read.binary.vectors`) By default, function 15 | guesses based on file suffix.} 16 | 17 | \item{...}{Further arguments passed to read.table or read.binary.vectors. 18 | Note that both accept 'nrows' as an argument. Word2vec produces 19 | by default frequency sorted output. Therefore 'read.vectors("file.bin", nrows=500)', for example, 20 | will return the vectors for the top 500 words. This can be useful on machines with limited 21 | memory.} 22 | } 23 | \value{ 24 | An matrixlike object of class `VectorSpaceModel` 25 | } 26 | \description{ 27 | Read a VectorSpaceModel from a file exported from word2vec or a similar output format. 28 | } 29 | -------------------------------------------------------------------------------- /man/reexports.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/utils.R 3 | \docType{import} 4 | \name{reexports} 5 | \alias{reexports} 6 | \alias{\%>\%} 7 | \title{Objects exported from other packages} 8 | \keyword{internal} 9 | \description{ 10 | These objects are imported from other packages. Follow the links 11 | below to see their documentation. 12 | 13 | \describe{ 14 | \item{magrittr}{\code{\link[magrittr]{\%>\%}}} 15 | }} 16 | 17 | -------------------------------------------------------------------------------- /man/reject.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{reject} 4 | \alias{reject} 5 | \title{Return a vector rejection for each element in a VectorSpaceModel} 6 | \usage{ 7 | reject(matrix, vector) 8 | } 9 | \arguments{ 10 | \item{matrix}{A matrix or VectorSpaceModel} 11 | 12 | \item{vector}{A vector (or an object coercable to a vector, see project) 13 | of the same length as the VectorSpaceModel.} 14 | } 15 | \value{ 16 | A new matrix or VectorSpaceModel of the same dimensions as `matrix`, 17 | each row of which is orthogonal to the `vector` object. 18 | 19 | This is defined simply as `matrix-project(matrix,vector)`, but having a separate 20 | name may make for cleaner code. 21 | 22 | See `project` for more details. 23 | } 24 | \description{ 25 | Return a vector rejection for each element in a VectorSpaceModel 26 | } 27 | \examples{ 28 | closest_to(demo_vectors,demo_vectors[["man"]]) 29 | 30 | genderless = reject(demo_vectors,demo_vectors[["he"]] - demo_vectors[["she"]]) 31 | closest_to(genderless,genderless[["man"]]) 32 | 33 | } 34 | -------------------------------------------------------------------------------- /man/square_magnitudes.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{square_magnitudes} 4 | \alias{square_magnitudes} 5 | \title{Square Magnitudes with caching} 6 | \usage{ 7 | square_magnitudes(object) 8 | } 9 | \arguments{ 10 | \item{VectorSpaceModel}{A matrix or VectorSpaceModel object} 11 | } 12 | \value{ 13 | A vector of the square magnitudes for each row 14 | } 15 | \description{ 16 | square_magnitudes Returns the square magnitudes and 17 | caches them if necessary 18 | } 19 | \keyword{internal} 20 | -------------------------------------------------------------------------------- /man/sub-VectorSpaceModel-method.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \docType{methods} 4 | \name{[,VectorSpaceModel-method} 5 | \alias{[,VectorSpaceModel-method} 6 | \title{VectorSpaceModel indexing} 7 | \usage{ 8 | \S4method{[}{VectorSpaceModel}(x, i, j, ..., drop = TRUE) 9 | } 10 | \arguments{ 11 | \item{x}{The vectorspace model to subset} 12 | 13 | \item{i}{The row numbers to extract} 14 | 15 | \item{j}{The column numbers to extract} 16 | 17 | \item{...}{Other arguments passed to extract (unlikely to be useful).} 18 | 19 | \item{drop}{Whether to drop columns. This parameter is ignored.} 20 | } 21 | \value{ 22 | A VectorSpaceModel 23 | } 24 | \description{ 25 | Reduce a VectorSpaceModel to a smaller one 26 | } 27 | -------------------------------------------------------------------------------- /man/sub-sub-VectorSpaceModel-method.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \docType{methods} 4 | \name{[[,VectorSpaceModel-method} 5 | \alias{[[,VectorSpaceModel-method} 6 | \title{VectorSpaceModel subsetting} 7 | \usage{ 8 | \S4method{[[}{VectorSpaceModel}(x, i, average = TRUE) 9 | } 10 | \arguments{ 11 | \item{x}{The object being subsetted.} 12 | 13 | \item{i}{A character vector: the words to use as rownames.} 14 | 15 | \item{average}{Whether to collapse down to a single vector, 16 | or to return a subset of one row for each asked for.} 17 | } 18 | \value{ 19 | A VectorSpaceModel of a single row. 20 | } 21 | \description{ 22 | VectorSpaceModel subsetting 23 | } 24 | -------------------------------------------------------------------------------- /man/train_word2vec.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/word2vec.R 3 | \name{train_word2vec} 4 | \alias{train_word2vec} 5 | \title{Train a model by word2vec.} 6 | \usage{ 7 | train_word2vec(train_file, output_file = "vectors.bin", vectors = 100, 8 | threads = 1, window = 12, classes = 0, cbow = 0, min_count = 5, 9 | iter = 5, force = F, negative_samples = 5) 10 | } 11 | \arguments{ 12 | \item{train_file}{Path of a single .txt file for training. Tokens are split on spaces.} 13 | 14 | \item{output_file}{Path of the output file.} 15 | 16 | \item{vectors}{The number of vectors to output. Defaults to 100. 17 | More vectors usually means more precision, but also more random error, higher memory usage, and slower operations. 18 | Sensible choices are probably in the range 100-500.} 19 | 20 | \item{threads}{Number of threads to run training process on. 21 | Defaults to 1; up to the number of (virtual) cores on your machine may speed things up.} 22 | 23 | \item{window}{The size of the window (in words) to use in training.} 24 | 25 | \item{classes}{Number of classes for k-means clustering. Not documented/tested.} 26 | 27 | \item{cbow}{If 1, use a continuous-bag-of-words model instead of skip-grams. 28 | Defaults to false (recommended for newcomers).} 29 | 30 | \item{min_count}{Minimum times a word must appear to be included in the samples. 31 | High values help reduce model size.} 32 | 33 | \item{iter}{Number of passes to make over the corpus in training.} 34 | 35 | \item{force}{Whether to overwrite existing model files.} 36 | 37 | \item{negative_samples}{Number of negative samples to take in skip-gram training. 0 means full sampling, while lower numbers 38 | give faster training. For large corpora 2-5 may work; for smaller corpora, 5-15 is reasonable.} 39 | } 40 | \value{ 41 | A VectorSpaceModel object. 42 | } 43 | \description{ 44 | Train a model by word2vec. 45 | } 46 | \details{ 47 | The word2vec tool takes a text corpus as input and produces the 48 | word vectors as output. It first constructs a vocabulary from the 49 | training text data and then learns vector representation of words. 50 | The resulting word vector file can be used as features in many 51 | natural language processing and machine learning applications. 52 | } 53 | \examples{ 54 | \dontrun{ 55 | model = train_word2vec(system.file("examples", "rfaq.txt", package = "wordVectors")) 56 | } 57 | } 58 | \references{ 59 | \url{https://code.google.com/p/word2vec/} 60 | } 61 | \author{ 62 | Jian Li <\email{rweibo@sina.com}>, Ben Schmidt <\email{bmchmidt@gmail.com}> 63 | } 64 | -------------------------------------------------------------------------------- /man/word2phrase.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/word2vec.R 3 | \name{word2phrase} 4 | \alias{word2phrase} 5 | \title{Convert words to phrases} 6 | \usage{ 7 | word2phrase(train_file, output_file, debug_mode = 0, min_count = 5, 8 | threshold = 100, force = FALSE) 9 | } 10 | \arguments{ 11 | \item{train_file}{Path of a single .txt file for training. 12 | Tokens are split on spaces.} 13 | 14 | \item{output_file}{Path of output file} 15 | 16 | \item{debug_mode}{debug mode. Must be 0, 1 or 2. 0 is silent; 1 print summary statistics; 17 | prints progress regularly.} 18 | 19 | \item{min_count}{Minimum times a word must appear to be included in the samples. 20 | High values help reduce model size.} 21 | 22 | \item{threshold}{Threshold value for determining if pairs of words are phrases.} 23 | 24 | \item{force}{Whether to overwrite existing files at the output location. Default FALSE} 25 | } 26 | \value{ 27 | The name of output_file, the trained file where common phrases are now joined. 28 | } 29 | \description{ 30 | Convert words to phrases in a text file. 31 | } 32 | \details{ 33 | This function attempts to learn phrases given a text document. 34 | It does so by progressively joining adjacent pairs of words with an '_' character. 35 | You can then run the code multiple times to create multiword phrases. 36 | Wrapper around code from the Mikolov's original word2vec release. 37 | } 38 | \examples{ 39 | \dontrun{ 40 | model=word2phrase("text8","vec.txt") 41 | } 42 | } 43 | \author{ 44 | Tomas Mikolov 45 | } 46 | -------------------------------------------------------------------------------- /man/write.binary.word2vec.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/matrixFunctions.R 3 | \name{write.binary.word2vec} 4 | \alias{write.binary.word2vec} 5 | \title{Write in word2vec binary format} 6 | \usage{ 7 | write.binary.word2vec(model, filename) 8 | } 9 | \arguments{ 10 | \item{model}{The wordVectors model you wish to save. (This can actually be any matrix with rownames, 11 | if you want a smaller binary serialization in single-precision floats.)} 12 | 13 | \item{filename}{The file to save the vectors to. I recommend ".bin" as a suffix.} 14 | } 15 | \value{ 16 | Nothing 17 | } 18 | \description{ 19 | Write in word2vec binary format 20 | } 21 | -------------------------------------------------------------------------------- /src/Makevars.win: -------------------------------------------------------------------------------- 1 | 2 | PKG_CFLAGS = -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -w 3 | PKG_LIBS = -pthread 4 | 5 | -------------------------------------------------------------------------------- /src/tmcn_word2vec.c: -------------------------------------------------------------------------------- 1 | #include "R.h" 2 | #include "Rmath.h" 3 | #include "word2vec.h" 4 | 5 | void tmcn_word2vec(char *train_file0, char *output_file0, 6 | char *binary0, char *dims0, char *threads, 7 | char *window0, char *classes0, char *cbow0, 8 | char *min_count0, char *iter0, char *neg_samples0) 9 | { 10 | int i; 11 | layer1_size = atoll(dims0); 12 | num_threads = atoi(threads); 13 | window=atoi(window0); 14 | binary = atoi(binary0); 15 | classes = atoi(classes0); 16 | cbow = atoi(cbow0); 17 | min_count = atoi(min_count0); 18 | iter = atoll(iter0); 19 | negative = atoi(neg_samples0); 20 | strcpy(train_file, train_file0); 21 | strcpy(output_file, output_file0); 22 | 23 | 24 | alpha = 0.025; 25 | starting_alpha = alpha; 26 | word_count_actual = 0; 27 | 28 | vocab = (struct vocab_word *)calloc(vocab_max_size, sizeof(struct vocab_word)); 29 | vocab_hash = (int *)calloc(vocab_hash_size, sizeof(int)); 30 | expTable = (real *)malloc((EXP_TABLE_SIZE + 1) * sizeof(real)); 31 | for (i = 0; i < EXP_TABLE_SIZE; i++) { 32 | expTable[i] = exp((i / (real)EXP_TABLE_SIZE * 2 - 1) * MAX_EXP); // Precompute the exp() table 33 | expTable[i] = expTable[i] / (expTable[i] + 1); // Precompute f(x) = x / (x + 1) 34 | } 35 | TrainModel(); 36 | } 37 | 38 | 39 | void CWrapper_word2vec(char **train_file, char **output_file, 40 | char **binary, char **dims, char **threads, 41 | char **window, char **classes, char **cbow, char **min_count, char **iter, char **neg_samples) 42 | { 43 | tmcn_word2vec(*train_file, *output_file, *binary, *dims, *threads,*window,*classes,*cbow,*min_count,*iter, *neg_samples); 44 | } 45 | 46 | -------------------------------------------------------------------------------- /src/word2phrase.c: -------------------------------------------------------------------------------- 1 | // Copyright 2013 Google Inc. All Rights Reserved. 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // http://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | #include "R.h" 15 | #include "Rmath.h" 16 | #include 17 | #include 18 | #include 19 | #include 20 | //#include 21 | 22 | #define MAX_STRING1 60 23 | 24 | const int vocab_hash1_size1 = 500000000; // Maximum 500M entries in the vocabulary 25 | 26 | typedef float real; // Precision of float numbers 27 | 28 | struct vocab_word1 { 29 | long long cn; 30 | char *word; 31 | }; 32 | 33 | char train_file1[MAX_STRING1], output_file1[MAX_STRING1]; 34 | struct vocab_word1 *vocab1; 35 | int debug_mode1 = 2, min_count1 = 5, *vocab_hash1, min_reduce1 = 1; 36 | long long vocab_max_size1 = 10000, vocab_size1 = 0; 37 | long long train_words1 = 0; 38 | real threshold1 = 100; 39 | 40 | unsigned long long next_random = 1; 41 | 42 | // Reads a single word from a file, assuming space + tab + EOL to be word boundaries 43 | void ReadWord1(char *word, FILE *fin) { 44 | int a = 0, ch; 45 | while (!feof(fin)) { 46 | ch = fgetc(fin); 47 | if (ch == 13) continue; 48 | if ((ch == ' ') || (ch == '\t') || (ch == '\n')) { 49 | if (a > 0) { 50 | if (ch == '\n') ungetc(ch, fin); 51 | break; 52 | } 53 | if (ch == '\n') { 54 | strcpy(word, (char *)""); 55 | return; 56 | } else continue; 57 | } 58 | word[a] = ch; 59 | a++; 60 | if (a >= MAX_STRING1 - 1) a--; // Truncate too long words 61 | } 62 | word[a] = 0; 63 | } 64 | 65 | // Returns hash value of a word 66 | int GetWordHash1(char *word) { 67 | unsigned long long a, hash = 1; 68 | for (a = 0; a < strlen(word); a++) hash = hash * 257 + word[a]; 69 | hash = hash % vocab_hash1_size1; 70 | return hash; 71 | } 72 | 73 | // Returns position of a word in the vocabulary; if the word is not found, returns -1 74 | int SearchVocab1(char *word) { 75 | unsigned int hash = GetWordHash1(word); 76 | while (1) { 77 | if (vocab_hash1[hash] == -1) return -1; 78 | if (!strcmp(word, vocab1[vocab_hash1[hash]].word)) return vocab_hash1[hash]; 79 | hash = (hash + 1) % vocab_hash1_size1; 80 | } 81 | return -1; 82 | } 83 | 84 | // Reads a word and returns its index in the vocabulary 85 | int ReadWord1Index1(FILE *fin) { 86 | char word[MAX_STRING1]; 87 | ReadWord1(word, fin); 88 | if (feof(fin)) return -1; 89 | return SearchVocab1(word); 90 | } 91 | 92 | // Adds a word to the vocabulary 93 | int AddWordToVocab1(char *word) { 94 | unsigned int hash, length = strlen(word) + 1; 95 | if (length > MAX_STRING1) length = MAX_STRING1; 96 | vocab1[vocab_size1].word = (char *)calloc(length, sizeof(char)); 97 | strcpy(vocab1[vocab_size1].word, word); 98 | vocab1[vocab_size1].cn = 0; 99 | vocab_size1++; 100 | // Reallocate memory if needed 101 | if (vocab_size1 + 2 >= vocab_max_size1) { 102 | vocab_max_size1 += 10000; 103 | vocab1=(struct vocab_word1 *)realloc(vocab1, vocab_max_size1 * sizeof(struct vocab_word1)); 104 | } 105 | hash = GetWordHash1(word); 106 | while (vocab_hash1[hash] != -1) hash = (hash + 1) % vocab_hash1_size1; 107 | vocab_hash1[hash]=vocab_size1 - 1; 108 | return vocab_size1 - 1; 109 | } 110 | 111 | // Used later for sorting by word counts 112 | int VocabCompare1(const void *a, const void *b) { 113 | return ((struct vocab_word1 *)b)->cn - ((struct vocab_word1 *)a)->cn; 114 | } 115 | 116 | // Sorts the vocabulary by frequency using word counts 117 | void SortVocab1() { 118 | int a; 119 | unsigned int hash; 120 | // Sort the vocabulary and keep at the first position 121 | qsort(&vocab1[1], vocab_size1 - 1, sizeof(struct vocab_word1), VocabCompare1); 122 | for (a = 0; a < vocab_hash1_size1; a++) vocab_hash1[a] = -1; 123 | for (a = 0; a < vocab_size1; a++) { 124 | // Words occuring less than min_count1 times will be discarded from the vocab 125 | if (vocab1[a].cn < min_count1) { 126 | vocab_size1--; 127 | free(vocab1[vocab_size1].word); 128 | } else { 129 | // Hash will be re-computed, as after the sorting it is not actual 130 | hash = GetWordHash1(vocab1[a].word); 131 | while (vocab_hash1[hash] != -1) hash = (hash + 1) % vocab_hash1_size1; 132 | vocab_hash1[hash] = a; 133 | } 134 | } 135 | vocab1 = (struct vocab_word1 *)realloc(vocab1, vocab_size1 * sizeof(struct vocab_word1)); 136 | } 137 | 138 | // Reduces the vocabulary by removing infrequent tokens 139 | void ReduceVocab1() { 140 | int a, b = 0; 141 | unsigned int hash; 142 | for (a = 0; a < vocab_size1; a++) if (vocab1[a].cn > min_reduce1) { 143 | vocab1[b].cn = vocab1[a].cn; 144 | vocab1[b].word = vocab1[a].word; 145 | b++; 146 | } else free(vocab1[a].word); 147 | vocab_size1 = b; 148 | for (a = 0; a < vocab_hash1_size1; a++) vocab_hash1[a] = -1; 149 | for (a = 0; a < vocab_size1; a++) { 150 | // Hash will be re-computed, as it is not actual 151 | hash = GetWordHash1(vocab1[a].word); 152 | while (vocab_hash1[hash] != -1) hash = (hash + 1) % vocab_hash1_size1; 153 | vocab_hash1[hash] = a; 154 | } 155 | //fflush(stdout); 156 | min_reduce1++; 157 | } 158 | 159 | void LearnVocabFromTrainFile1() { 160 | char word[MAX_STRING1], last_word[MAX_STRING1], bigram_word[MAX_STRING1 * 2]; 161 | FILE *fin; 162 | long long a, i, start = 1; 163 | for (a = 0; a < vocab_hash1_size1; a++) vocab_hash1[a] = -1; 164 | fin = fopen(train_file1, "rb"); 165 | if (fin == NULL) { 166 | Rprintf("ERROR: training data file not found!\n"); 167 | return; 168 | } 169 | vocab_size1 = 0; 170 | AddWordToVocab1((char *)""); 171 | while (1) { 172 | ReadWord1(word, fin); 173 | if (feof(fin)) break; 174 | if (!strcmp(word, "")) { 175 | start = 1; 176 | continue; 177 | } else start = 0; 178 | train_words1++; 179 | if ((debug_mode1 > 1) && (train_words1 % 100000 == 0)) { 180 | Rprintf("Words processed: %lldK Vocab size: %lldK %c", train_words1 / 1000, vocab_size1 / 1000, 13); 181 | // fflush(stdout); 182 | } 183 | i = SearchVocab1(word); 184 | if (i == -1) { 185 | a = AddWordToVocab1(word); 186 | vocab1[a].cn = 1; 187 | } else vocab1[i].cn++; 188 | if (start) continue; 189 | sprintf(bigram_word, "%s_%s", last_word, word); 190 | bigram_word[MAX_STRING1 - 1] = 0; 191 | strcpy(last_word, word); 192 | i = SearchVocab1(bigram_word); 193 | if (i == -1) { 194 | a = AddWordToVocab1(bigram_word); 195 | vocab1[a].cn = 1; 196 | } else vocab1[i].cn++; 197 | if (vocab_size1 > vocab_hash1_size1 * 0.7) ReduceVocab1(); 198 | } 199 | SortVocab1(); 200 | if (debug_mode1 > 0) { 201 | Rprintf("\nVocab size (unigrams + bigrams): %lld\n", vocab_size1); 202 | Rprintf("Words in train file: %lld\n", train_words1); 203 | } 204 | fclose(fin); 205 | } 206 | 207 | void TrainModel1() { 208 | long long pa = 0, pb = 0, pab = 0, oov, i, li = -1, cn = 0; 209 | char word[MAX_STRING1], last_word[MAX_STRING1], bigram_word[MAX_STRING1 * 2]; 210 | real score; 211 | FILE *fo, *fin; 212 | Rprintf("Starting training using file %s\n", train_file1); 213 | LearnVocabFromTrainFile1(); 214 | fin = fopen(train_file1, "rb"); 215 | fo = fopen(output_file1, "wb"); 216 | word[0] = 0; 217 | while (1) { 218 | strcpy(last_word, word); 219 | ReadWord1(word, fin); 220 | if (feof(fin)) break; 221 | if (!strcmp(word, "")) { 222 | fprintf(fo, "\n"); 223 | continue; 224 | } 225 | cn++; 226 | if ((debug_mode1 > 1) && (cn % 100000 == 0)) { 227 | Rprintf("Words written: %lldK%c", cn / 1000, 13); 228 | // fflush(stdout); 229 | } 230 | oov = 0; 231 | i = SearchVocab1(word); 232 | if (i == -1) oov = 1; else pb = vocab1[i].cn; 233 | if (li == -1) oov = 1; 234 | li = i; 235 | sprintf(bigram_word, "%s_%s", last_word, word); 236 | bigram_word[MAX_STRING1 - 1] = 0; 237 | i = SearchVocab1(bigram_word); 238 | if (i == -1) oov = 1; else pab = vocab1[i].cn; 239 | if (pa < min_count1) oov = 1; 240 | if (pb < min_count1) oov = 1; 241 | if (oov) score = 0; else score = (pab - min_count1) / (real)pa / (real)pb * (real)train_words1; 242 | if (score > threshold1) { 243 | fprintf(fo, "_%s", word); 244 | pb = 0; 245 | } else fprintf(fo, " %s", word); 246 | pa = pb; 247 | } 248 | fclose(fo); 249 | fclose(fin); 250 | } 251 | 252 | 253 | void word2phrase(char **rtrain_file,int *rdebug_mode,char **routput_file,int *rmin_count,double *rthreshold) { 254 | /* 255 | if (argc == 1) { 256 | printf("WORD2PHRASE tool v0.1a\n\n"); 257 | printf("Options:\n"); 258 | printf("Parameters for training:\n"); 259 | printf("\t-train \n"); 260 | printf("\t\tUse text data from to train the model\n"); 261 | printf("\t-output \n"); 262 | printf("\t\tUse to save the resulting word vectors / word clusters / phrases\n"); 263 | printf("\t-min-count \n"); 264 | printf("\t\tThis will discard words that appear less than times; default is 5\n"); 265 | printf("\t-threshold1 \n"); 266 | printf("\t\t The value represents threshold1 for forming the phrases (higher means less phrases); default 100\n"); 267 | printf("\t-debug \n"); 268 | printf("\t\tSet the debug mode (default = 2 = more info during training)\n"); 269 | printf("\nExamples:\n"); 270 | printf("./word2phrase -train text.txt -output phrases.txt -threshold1 100 -debug 2\n\n"); 271 | return 0; 272 | } 273 | */ 274 | if (*rtrain_file[0]!='0') strcpy(train_file1, *rtrain_file); 275 | if (rdebug_mode[0]!=0) debug_mode1 = rdebug_mode[0]; 276 | if (*routput_file[0]!='0') strcpy(output_file1, *routput_file); 277 | if (rmin_count[0]!=0) min_count1 = rmin_count[0]; 278 | if (rthreshold[0]!= 0) threshold1=rthreshold[0]; 279 | vocab1 = (struct vocab_word1 *)calloc(vocab_max_size1, sizeof(struct vocab_word1)); 280 | vocab_hash1 = (int *)calloc(vocab_hash1_size1, sizeof(int)); 281 | TrainModel1(); 282 | 283 | } 284 | -------------------------------------------------------------------------------- /src/word2vec.h: -------------------------------------------------------------------------------- 1 | // Copyright 2013 Google Inc. All Rights Reserved. 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // http://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include "R.h" 22 | #include "Rmath.h" 23 | 24 | 25 | #define MAX_STRING 100 26 | #define EXP_TABLE_SIZE 1000 27 | #define MAX_EXP 6 28 | #define MAX_SENTENCE_LENGTH 1000 29 | #define MAX_CODE_LENGTH 40 30 | 31 | const int vocab_hash_size = 30000000; // Maximum 30 * 0.7 = 21M words in the vocabulary 32 | 33 | typedef float real; // Precision of float numbers 34 | 35 | struct vocab_word { 36 | long long cn; 37 | int *point; 38 | char *word, *code, codelen; 39 | }; 40 | 41 | char train_file[1024], output_file[1024]; 42 | char save_vocab_file[MAX_STRING], read_vocab_file[MAX_STRING]; 43 | struct vocab_word *vocab; 44 | int binary = 0, cbow = 0, debug_mode = 2, window = 12, min_count = 5, num_threads = 1, min_reduce = 1; 45 | int *vocab_hash; 46 | long long vocab_max_size = 1000, vocab_size = 0, layer1_size = 100; 47 | long long train_words = 0, word_count_actual = 0, iter = 5, file_size = 0, classes = 0; 48 | real alpha = 0.025, starting_alpha, sample = 0; 49 | real *syn0, *syn1, *syn1neg, *expTable; 50 | clock_t start; 51 | 52 | int hs = 1, negative = 0; 53 | const int table_size = 1e8; 54 | int *table; 55 | 56 | 57 | void InitUnigramTable() { 58 | int a, i; 59 | long long train_words_pow = 0; 60 | real d1, power = 0.75; 61 | table = (int *)malloc(table_size * sizeof(int)); 62 | for (a = 0; a < vocab_size; a++) train_words_pow += pow(vocab[a].cn, power); 63 | i = 0; 64 | d1 = pow(vocab[i].cn, power) / (real)train_words_pow; 65 | for (a = 0; a < table_size; a++) { 66 | table[a] = i; 67 | if (a / (real)table_size > d1) { 68 | i++; 69 | d1 += pow(vocab[i].cn, power) / (real)train_words_pow; 70 | } 71 | if (i >= vocab_size) i = vocab_size - 1; 72 | } 73 | } 74 | 75 | // Reads a single word from a file, assuming space + tab + EOL to be word boundaries 76 | void ReadWord(char *word, FILE *fin) { 77 | int a = 0, ch; 78 | while (!feof(fin)) { 79 | ch = fgetc(fin); 80 | if (ch == 13) continue; 81 | if ((ch == ' ') || (ch == '\t') || (ch == '\n')) { 82 | if (a > 0) { 83 | if (ch == '\n') ungetc(ch, fin); 84 | break; 85 | } 86 | if (ch == '\n') { 87 | strcpy(word, (char *)""); 88 | return; 89 | } else continue; 90 | } 91 | word[a] = ch; 92 | a++; 93 | if (a >= MAX_STRING - 1) a--; // Truncate too long words 94 | } 95 | word[a] = 0; 96 | } 97 | 98 | // Returns hash value of a word 99 | int GetWordHash(char *word) { 100 | unsigned long long a, hash = 0; 101 | for (a = 0; a < strlen(word); a++) hash = hash * 257 + word[a]; 102 | hash = hash % vocab_hash_size; 103 | return hash; 104 | } 105 | 106 | // Returns position of a word in the vocabulary; if the word is not found, returns -1 107 | int SearchVocab(char *word) { 108 | unsigned int hash = GetWordHash(word); 109 | while (1) { 110 | if (vocab_hash[hash] == -1) return -1; 111 | if (!strcmp(word, vocab[vocab_hash[hash]].word)) return vocab_hash[hash]; 112 | hash = (hash + 1) % vocab_hash_size; 113 | } 114 | return -1; 115 | } 116 | 117 | // Reads a word and returns its index in the vocabulary 118 | int ReadWordIndex(FILE *fin) { 119 | char word[MAX_STRING]; 120 | ReadWord(word, fin); 121 | if (feof(fin)) return -1; 122 | return SearchVocab(word); 123 | } 124 | 125 | // Adds a word to the vocabulary 126 | int AddWordToVocab(char *word) { 127 | unsigned int hash, length = strlen(word) + 1; 128 | if (length > MAX_STRING) length = MAX_STRING; 129 | vocab[vocab_size].word = (char *)calloc(length, sizeof(char)); 130 | strcpy(vocab[vocab_size].word, word); 131 | vocab[vocab_size].cn = 0; 132 | vocab_size++; 133 | // Reallocate memory if needed 134 | if (vocab_size + 2 >= vocab_max_size) { 135 | vocab_max_size += 1000; 136 | vocab = (struct vocab_word *)realloc(vocab, vocab_max_size * sizeof(struct vocab_word)); 137 | } 138 | hash = GetWordHash(word); 139 | while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size; 140 | vocab_hash[hash] = vocab_size - 1; 141 | return vocab_size - 1; 142 | } 143 | 144 | // Used later for sorting by word counts 145 | int VocabCompare(const void *a, const void *b) { 146 | return ((struct vocab_word *)b)->cn - ((struct vocab_word *)a)->cn; 147 | } 148 | 149 | // Sorts the vocabulary by frequency using word counts 150 | void SortVocab() { 151 | int a, size; 152 | unsigned int hash; 153 | // Sort the vocabulary and keep at the first position 154 | qsort(&vocab[1], vocab_size - 1, sizeof(struct vocab_word), VocabCompare); 155 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 156 | size = vocab_size; 157 | train_words = 0; 158 | for (a = 0; a < size; a++) { 159 | // Words occuring less than min_count times will be discarded from the vocab 160 | if (vocab[a].cn < min_count) { 161 | vocab_size--; 162 | free(vocab[vocab_size].word); 163 | } else { 164 | // Hash will be re-computed, as after the sorting it is not actual 165 | hash=GetWordHash(vocab[a].word); 166 | while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size; 167 | vocab_hash[hash] = a; 168 | train_words += vocab[a].cn; 169 | } 170 | } 171 | vocab = (struct vocab_word *)realloc(vocab, (vocab_size + 1) * sizeof(struct vocab_word)); 172 | // Allocate memory for the binary tree construction 173 | for (a = 0; a < vocab_size; a++) { 174 | vocab[a].code = (char *)calloc(MAX_CODE_LENGTH, sizeof(char)); 175 | vocab[a].point = (int *)calloc(MAX_CODE_LENGTH, sizeof(int)); 176 | } 177 | } 178 | 179 | // Reduces the vocabulary by removing infrequent tokens 180 | void ReduceVocab() { 181 | int a, b = 0; 182 | unsigned int hash; 183 | for (a = 0; a < vocab_size; a++) if (vocab[a].cn > min_reduce) { 184 | vocab[b].cn = vocab[a].cn; 185 | vocab[b].word = vocab[a].word; 186 | b++; 187 | } else free(vocab[a].word); 188 | vocab_size = b; 189 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 190 | for (a = 0; a < vocab_size; a++) { 191 | // Hash will be re-computed, as it is not actual 192 | hash = GetWordHash(vocab[a].word); 193 | while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size; 194 | vocab_hash[hash] = a; 195 | } 196 | fflush(NULL); 197 | min_reduce++; 198 | } 199 | 200 | // Create binary Huffman tree using the word counts 201 | // Frequent words will have short uniqe binary codes 202 | void CreateBinaryTree() { 203 | long long a, b, i, min1i, min2i, pos1, pos2, point[MAX_CODE_LENGTH]; 204 | char code[MAX_CODE_LENGTH]; 205 | long long *count = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long)); 206 | long long *binary = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long)); 207 | long long *parent_node = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long)); 208 | for (a = 0; a < vocab_size; a++) count[a] = vocab[a].cn; 209 | for (a = vocab_size; a < vocab_size * 2; a++) count[a] = 1e15; 210 | pos1 = vocab_size - 1; 211 | pos2 = vocab_size; 212 | // Following algorithm constructs the Huffman tree by adding one node at a time 213 | for (a = 0; a < vocab_size - 1; a++) { 214 | // First, find two smallest nodes 'min1, min2' 215 | if (pos1 >= 0) { 216 | if (count[pos1] < count[pos2]) { 217 | min1i = pos1; 218 | pos1--; 219 | } else { 220 | min1i = pos2; 221 | pos2++; 222 | } 223 | } else { 224 | min1i = pos2; 225 | pos2++; 226 | } 227 | if (pos1 >= 0) { 228 | if (count[pos1] < count[pos2]) { 229 | min2i = pos1; 230 | pos1--; 231 | } else { 232 | min2i = pos2; 233 | pos2++; 234 | } 235 | } else { 236 | min2i = pos2; 237 | pos2++; 238 | } 239 | count[vocab_size + a] = count[min1i] + count[min2i]; 240 | parent_node[min1i] = vocab_size + a; 241 | parent_node[min2i] = vocab_size + a; 242 | binary[min2i] = 1; 243 | } 244 | // Now assign binary code to each vocabulary word 245 | for (a = 0; a < vocab_size; a++) { 246 | b = a; 247 | i = 0; 248 | while (1) { 249 | code[i] = binary[b]; 250 | point[i] = b; 251 | i++; 252 | b = parent_node[b]; 253 | if (b == vocab_size * 2 - 2) break; 254 | } 255 | vocab[a].codelen = i; 256 | vocab[a].point[0] = vocab_size - 2; 257 | for (b = 0; b < i; b++) { 258 | vocab[a].code[i - b - 1] = code[b]; 259 | vocab[a].point[i - b] = point[b] - vocab_size; 260 | } 261 | } 262 | free(count); 263 | free(binary); 264 | free(parent_node); 265 | } 266 | 267 | void LearnVocabFromTrainFile() { 268 | char word[MAX_STRING]; 269 | FILE *fin; 270 | long long a, i; 271 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 272 | fin = fopen(train_file, "rb"); 273 | if (fin == NULL) { 274 | Rprintf("ERROR: training data file not found!\n"); 275 | Rf_error("Error!"); 276 | } 277 | vocab_size = 0; 278 | AddWordToVocab((char *)""); 279 | while (1) { 280 | ReadWord(word, fin); 281 | if (feof(fin)) break; 282 | train_words++; 283 | if ((debug_mode > 1) && (train_words % 100000 == 0)) { 284 | Rprintf("%lldK%c", train_words / 1000, 13); 285 | fflush(NULL); 286 | } 287 | i = SearchVocab(word); 288 | if (i == -1) { 289 | a = AddWordToVocab(word); 290 | vocab[a].cn = 1; 291 | } else vocab[i].cn++; 292 | if (vocab_size > vocab_hash_size * 0.7) ReduceVocab(); 293 | } 294 | SortVocab(); 295 | if (debug_mode > 0) { 296 | Rprintf("Vocab size: %lld\n", vocab_size); 297 | Rprintf("Words in train file: %lld\n", train_words); 298 | } 299 | file_size = ftell(fin); 300 | fclose(fin); 301 | } 302 | 303 | void SaveVocab() { 304 | long long i; 305 | FILE *fo = fopen(save_vocab_file, "wb"); 306 | for (i = 0; i < vocab_size; i++) fprintf(fo, "%s %lld\n", vocab[i].word, vocab[i].cn); 307 | fclose(fo); 308 | } 309 | 310 | void ReadVocab() { 311 | long long a, i = 0; 312 | char c; 313 | char word[MAX_STRING]; 314 | FILE *fin = fopen(read_vocab_file, "rb"); 315 | if (fin == NULL) { 316 | Rprintf("Vocabulary file not found\n"); 317 | Rf_error("Error!"); 318 | } 319 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 320 | vocab_size = 0; 321 | while (1) { 322 | ReadWord(word, fin); 323 | if (feof(fin)) break; 324 | a = AddWordToVocab(word); 325 | if(fscanf(fin, "%lld%c", &vocab[a].cn, &c)==1) 326 | ; 327 | i++; 328 | } 329 | SortVocab(); 330 | if (debug_mode > 0) { 331 | Rprintf("Vocab size: %lld\n", vocab_size); 332 | Rprintf("Words in train file: %lld\n", train_words); 333 | } 334 | fin = fopen(train_file, "rb"); 335 | if (fin == NULL) { 336 | Rprintf("ERROR: training data file not found!\n"); 337 | Rf_error("Error!"); 338 | } 339 | fseek(fin, 0, SEEK_END); 340 | file_size = ftell(fin); 341 | fclose(fin); 342 | } 343 | 344 | void InitNet() { 345 | long long a, b; 346 | unsigned long long next_random = 1; 347 | // a = posix_memalign((void **)&syn0, 128, (long long)vocab_size * layer1_size * sizeof(real)); 348 | #ifdef _WIN32 349 | syn0 = (real *)_aligned_malloc((long long)vocab_size * layer1_size * sizeof(real), 128); 350 | #else 351 | a = posix_memalign((void **)&syn0, 128, (long long)vocab_size * layer1_size * sizeof(real)); 352 | #endif 353 | 354 | if (syn0 == NULL) {Rprintf("Memory allocation failed\n"); Rf_error("Error!");} 355 | if (hs) { 356 | // a = posix_memalign((void **)&syn1, 128, (long long)vocab_size * layer1_size * sizeof(real)); 357 | #ifdef _WIN32 358 | syn1 = (real *)_aligned_malloc((long long)vocab_size * layer1_size * sizeof(real), 128); 359 | #else 360 | a = posix_memalign((void **)&(syn1), 128, (long long)vocab_size * layer1_size * sizeof(real)); 361 | #endif 362 | 363 | if (syn1 == NULL) {Rprintf("Memory allocation failed\n"); Rf_error("Error!");} 364 | for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) 365 | syn1[a * layer1_size + b] = 0; 366 | } 367 | if (negative>0) { 368 | // a = posix_memalign((void **)&syn1neg, 128, (long long)vocab_size * layer1_size * sizeof(real)); 369 | #ifdef _WIN32 370 | syn1neg = (real *)_aligned_malloc((long long)vocab_size * layer1_size * sizeof(real), 128); 371 | #else 372 | a = posix_memalign((void **)&(syn1neg), 128, (long long)vocab_size * layer1_size * sizeof(real)); 373 | #endif 374 | 375 | if (syn1neg == NULL) {Rprintf("Memory allocation failed\n"); Rf_error("Error!");} 376 | for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) 377 | syn1neg[a * layer1_size + b] = 0; 378 | } 379 | for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) { 380 | next_random = next_random * (unsigned long long)25214903917 + 11; 381 | syn0[a * layer1_size + b] = (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size; 382 | } 383 | CreateBinaryTree(); 384 | } 385 | 386 | void *TrainModelThread(void *id) { 387 | long long a, b, d, word, last_word, sentence_length = 0, sentence_position = 0; 388 | long long word_count = 0, last_word_count = 0, sen[MAX_SENTENCE_LENGTH + 1]; 389 | long long l1, l2, c, target, label, local_iter = iter; 390 | unsigned long long next_random = (long long)id; 391 | // real doneness_f, speed_f; 392 | // For writing to R. 393 | real f, g; 394 | clock_t now; 395 | real *neu1 = (real *)calloc(layer1_size, sizeof(real)); 396 | real *neu1e = (real *)calloc(layer1_size, sizeof(real)); 397 | FILE *fi = fopen(train_file, "rb"); 398 | fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET); 399 | while (1) { 400 | if (word_count - last_word_count > 10000) { 401 | word_count_actual += word_count - last_word_count; 402 | last_word_count = word_count; 403 | /* if ((debug_mode > 1)) { */ 404 | /* now=clock(); */ 405 | /* doneness_f = word_count_actual / (real)(iter * train_words + 1) * 100; */ 406 | /* speed_f = word_count_actual / ((real)(now - start + 1) / (real)CLOCKS_PER_SEC * 1000); */ 407 | 408 | /* Rprintf("%cAlpha: %f Progress: %.2f%% Words/thread/sec: %.2fk ", 13, alpha, */ 409 | /* doneness_f, speed_f); */ 410 | 411 | /* fflush(NULL); */ 412 | /* } */ 413 | alpha = starting_alpha * (1 - word_count_actual / (real)(iter * train_words + 1)); 414 | if (alpha < starting_alpha * 0.0001) alpha = starting_alpha * 0.0001; 415 | } 416 | if (sentence_length == 0) { 417 | while (1) { 418 | word = ReadWordIndex(fi); 419 | if (feof(fi)) break; 420 | if (word == -1) continue; 421 | word_count++; 422 | if (word == 0) break; 423 | // The subsampling randomly discards frequent words while keeping the ranking same 424 | if (sample > 0) { 425 | real ran = (sqrt(vocab[word].cn / (sample * train_words)) + 1) * (sample * train_words) / vocab[word].cn; 426 | next_random = next_random * (unsigned long long)25214903917 + 11; 427 | if (ran < (next_random & 0xFFFF) / (real)65536) continue; 428 | } 429 | sen[sentence_length] = word; 430 | sentence_length++; 431 | if (sentence_length >= MAX_SENTENCE_LENGTH) break; 432 | } 433 | sentence_position = 0; 434 | } 435 | if (feof(fi) || (word_count > train_words / num_threads)) { 436 | word_count_actual += word_count - last_word_count; 437 | local_iter--; 438 | if (local_iter == 0) break; 439 | word_count = 0; 440 | last_word_count = 0; 441 | sentence_length = 0; 442 | fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET); 443 | continue; 444 | } 445 | word = sen[sentence_position]; 446 | if (word == -1) continue; 447 | for (c = 0; c < layer1_size; c++) neu1[c] = 0; 448 | for (c = 0; c < layer1_size; c++) neu1e[c] = 0; 449 | next_random = next_random * (unsigned long long)25214903917 + 11; 450 | b = next_random % window; 451 | if (cbow) { //train the cbow architecture 452 | // in -> hidden 453 | for (a = b; a < window * 2 + 1 - b; a++) if (a != window) { 454 | c = sentence_position - window + a; 455 | if (c < 0) continue; 456 | if (c >= sentence_length) continue; 457 | last_word = sen[c]; 458 | if (last_word == -1) continue; 459 | for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size]; 460 | } 461 | if (hs) for (d = 0; d < vocab[word].codelen; d++) { 462 | f = 0; 463 | l2 = vocab[word].point[d] * layer1_size; 464 | // Propagate hidden -> output 465 | for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1[c + l2]; 466 | if (f <= -MAX_EXP) continue; 467 | else if (f >= MAX_EXP) continue; 468 | else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]; 469 | // 'g' is the gradient multiplied by the learning rate 470 | g = (1 - vocab[word].code[d] - f) * alpha; 471 | // Propagate errors output -> hidden 472 | for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2]; 473 | // Learn weights hidden -> output 474 | for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c]; 475 | } 476 | // NEGATIVE SAMPLING 477 | if (negative > 0) for (d = 0; d < negative + 1; d++) { 478 | if (d == 0) { 479 | target = word; 480 | label = 1; 481 | } else { 482 | next_random = next_random * (unsigned long long)25214903917 + 11; 483 | target = table[(next_random >> 16) % table_size]; 484 | if (target == 0) target = next_random % (vocab_size - 1) + 1; 485 | if (target == word) continue; 486 | label = 0; 487 | } 488 | l2 = target * layer1_size; 489 | f = 0; 490 | for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1neg[c + l2]; 491 | if (f > MAX_EXP) g = (label - 1) * alpha; 492 | else if (f < -MAX_EXP) g = (label - 0) * alpha; 493 | else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; 494 | for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2]; 495 | for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * neu1[c]; 496 | } 497 | // hidden -> in 498 | for (a = b; a < window * 2 + 1 - b; a++) if (a != window) { 499 | c = sentence_position - window + a; 500 | if (c < 0) continue; 501 | if (c >= sentence_length) continue; 502 | last_word = sen[c]; 503 | if (last_word == -1) continue; 504 | for (c = 0; c < layer1_size; c++) syn0[c + last_word * layer1_size] += neu1e[c]; 505 | } 506 | } else { //train skip-gram 507 | for (a = b; a < window * 2 + 1 - b; a++) if (a != window) { 508 | c = sentence_position - window + a; 509 | if (c < 0) continue; 510 | if (c >= sentence_length) continue; 511 | last_word = sen[c]; 512 | if (last_word == -1) continue; 513 | l1 = last_word * layer1_size; 514 | for (c = 0; c < layer1_size; c++) neu1e[c] = 0; 515 | // HIERARCHICAL SOFTMAX 516 | if (hs) for (d = 0; d < vocab[word].codelen; d++) { 517 | f = 0; 518 | l2 = vocab[word].point[d] * layer1_size; 519 | // Propagate hidden -> output 520 | for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2]; 521 | if (f <= -MAX_EXP) continue; 522 | else if (f >= MAX_EXP) continue; 523 | else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]; 524 | // 'g' is the gradient multiplied by the learning rate 525 | g = (1 - vocab[word].code[d] - f) * alpha; 526 | // Propagate errors output -> hidden 527 | for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2]; 528 | // Learn weights hidden -> output 529 | for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * syn0[c + l1]; 530 | } 531 | // NEGATIVE SAMPLING 532 | if (negative > 0) for (d = 0; d < negative + 1; d++) { 533 | if (d == 0) { 534 | target = word; 535 | label = 1; 536 | } else { 537 | next_random = next_random * (unsigned long long)25214903917 + 11; 538 | target = table[(next_random >> 16) % table_size]; 539 | if (target == 0) target = next_random % (vocab_size - 1) + 1; 540 | if (target == word) continue; 541 | label = 0; 542 | } 543 | l2 = target * layer1_size; 544 | f = 0; 545 | for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2]; 546 | if (f > MAX_EXP) g = (label - 1) * alpha; 547 | else if (f < -MAX_EXP) g = (label - 0) * alpha; 548 | else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; 549 | for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2]; 550 | for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1]; 551 | } 552 | // Learn weights input -> hidden 553 | for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c]; 554 | } 555 | } 556 | sentence_position++; 557 | if (sentence_position >= sentence_length) { 558 | sentence_length = 0; 559 | continue; 560 | } 561 | } 562 | fclose(fi); 563 | free(neu1); 564 | free(neu1e); 565 | pthread_exit(NULL); 566 | } 567 | 568 | void TrainModel() { 569 | long a, b, c, d; 570 | FILE *fo; 571 | pthread_t *pt = (pthread_t *)malloc(num_threads * sizeof(pthread_t)); 572 | Rprintf("Starting training using file %s\n", train_file); 573 | starting_alpha = alpha; 574 | if (read_vocab_file[0] != 0) ReadVocab(); else LearnVocabFromTrainFile(); 575 | if (save_vocab_file[0] != 0) SaveVocab(); 576 | if (output_file[0] == 0) return; 577 | InitNet(); 578 | if (negative > 0) InitUnigramTable(); 579 | start = clock(); 580 | for (a = 0; a < num_threads; a++) pthread_create(&pt[a], NULL, TrainModelThread, (void *)a); 581 | for (a = 0; a < num_threads; a++) pthread_join(pt[a], NULL); 582 | fo = fopen(output_file, "wb"); 583 | if (classes == 0) { 584 | // Save the word vectors 585 | fprintf(fo, "%lld %lld\n", vocab_size, layer1_size); 586 | for (a = 0; a < vocab_size; a++) { 587 | fprintf(fo, "%s ", vocab[a].word); 588 | if (binary) for (b = 0; b < layer1_size; b++) fwrite(&syn0[a * layer1_size + b], sizeof(real), 1, fo); 589 | else for (b = 0; b < layer1_size; b++) fprintf(fo, "%lf ", syn0[a * layer1_size + b]); 590 | fprintf(fo, "\n"); 591 | } 592 | } else { 593 | // Run K-means on the word vectors 594 | int clcn = classes, iter = 10, closeid; 595 | int *centcn = (int *)malloc(classes * sizeof(int)); 596 | int *cl = (int *)calloc(vocab_size, sizeof(int)); 597 | real closev, x; 598 | real *cent = (real *)calloc(classes * layer1_size, sizeof(real)); 599 | for (a = 0; a < vocab_size; a++) cl[a] = a % clcn; 600 | for (a = 0; a < iter; a++) { 601 | for (b = 0; b < clcn * layer1_size; b++) cent[b] = 0; 602 | for (b = 0; b < clcn; b++) centcn[b] = 1; 603 | for (c = 0; c < vocab_size; c++) { 604 | for (d = 0; d < layer1_size; d++) { 605 | cent[layer1_size * cl[c] + d] += syn0[c * layer1_size + d]; 606 | centcn[cl[c]]++; 607 | } 608 | } 609 | for (b = 0; b < clcn; b++) { 610 | closev = 0; 611 | for (c = 0; c < layer1_size; c++) { 612 | cent[layer1_size * b + c] /= centcn[b]; 613 | closev += cent[layer1_size * b + c] * cent[layer1_size * b + c]; 614 | } 615 | closev = sqrt(closev); 616 | for (c = 0; c < layer1_size; c++) cent[layer1_size * b + c] /= closev; 617 | } 618 | for (c = 0; c < vocab_size; c++) { 619 | closev = -10; 620 | closeid = 0; 621 | for (d = 0; d < clcn; d++) { 622 | x = 0; 623 | for (b = 0; b < layer1_size; b++) x += cent[layer1_size * d + b] * syn0[c * layer1_size + b]; 624 | if (x > closev) { 625 | closev = x; 626 | closeid = d; 627 | } 628 | } 629 | cl[c] = closeid; 630 | } 631 | } 632 | // Save the K-means classes 633 | for (a = 0; a < vocab_size; a++) fprintf(fo, "%s %d\n", vocab[a].word, cl[a]); 634 | free(centcn); 635 | free(cent); 636 | free(cl); 637 | } 638 | fclose(fo); 639 | } 640 | 641 | int ArgPos(char *str, int argc, char **argv) { 642 | int a; 643 | for (a = 1; a < argc; a++) if (!strcmp(str, argv[a])) { 644 | if (a == argc - 1) { 645 | Rprintf("Argument missing for %s\n", str); 646 | Rf_error("Error!"); 647 | } 648 | return a; 649 | } 650 | return -1; 651 | } 652 | 653 | -------------------------------------------------------------------------------- /tests/run-all.R: -------------------------------------------------------------------------------- 1 | library(testthat) 2 | test_check("wordVectors") 3 | -------------------------------------------------------------------------------- /tests/testthat/test-linear-algebra-functions.R: -------------------------------------------------------------------------------- 1 | context("VectorSpaceModel Linear Algebra is sensible") 2 | 3 | test_that("Vectors are near to themselves", 4 | expect_lt( 5 | cosineDist(demo_vectors[1,],demo_vectors[1,]), 6 | 1e-07 7 | ) 8 | ) 9 | 10 | test_that("Distance is between 0 and 2 (pt 1)", 11 | expect_gt( 12 | min(cosineDist(demo_vectors,demo_vectors)), 13 | -1e-07 14 | ) 15 | ) 16 | 17 | test_that("Distance is between 0 and 2 (pt 1)", 18 | expect_lt( 19 | max(cosineDist(demo_vectors,demo_vectors)), 20 | 2 + 1e-07) 21 | ) 22 | 23 | 24 | test_that("Distance is between 0 and 2 (pt 1)", 25 | expect_lt( 26 | max(abs(1-square_magnitudes(normalize_lengths(demo_vectors)))), 27 | 1e-07) 28 | ) 29 | -------------------------------------------------------------------------------- /tests/testthat/test-name-collapsing.r: -------------------------------------------------------------------------------- 1 | context("Name collapsing") 2 | 3 | test_that("name substitution works", 4 | expect_equivalent( 5 | demo_vectors %>% closest_to(~"good") 6 | , 7 | demo_vectors %>% closest_to(demo_vectors[["good"]]) 8 | ) 9 | ) 10 | 11 | test_that("character substitution works", 12 | expect_equivalent( 13 | demo_vectors %>% closest_to("good") 14 | , 15 | demo_vectors %>% closest_to(demo_vectors[["good"]]) 16 | ) 17 | ) 18 | 19 | test_that("addition works in substitutions", 20 | expect_equivalent( 21 | demo_vectors %>% closest_to(~ "good" + "bad") 22 | , 23 | demo_vectors %>% closest_to(demo_vectors[["good"]] + demo_vectors[["bad"]]) 24 | ) 25 | ) 26 | 27 | test_that("addition provides correct results", 28 | expect_gt( 29 | demo_vectors[["good"]] %>% cosineSimilarity(demo_vectors[["good"]] + demo_vectors[["bad"]]) 30 | , 31 | .8)) 32 | 33 | test_that("single-argument negation works", 34 | expect_equivalent( 35 | demo_vectors %>% closest_to(~ -("good"-"bad")) 36 | , 37 | demo_vectors %>% closest_to(~ "bad"-"good") 38 | 39 | )) 40 | 41 | test_that("closest_to can wrap in function", 42 | expect_equal( 43 | {function(x) {closest_to(x,~ "class" + "school")}}(demo_vectors), 44 | closest_to(demo_vectors,~ "class" + "school") 45 | ) 46 | ) 47 | 48 | test_that("Name substitution is occurring", 49 | expect_equivalent( 50 | cosineSimilarity(demo_vectors,"good"), 51 | cosineSimilarity(demo_vectors,demo_vectors[["good"]]) 52 | )) 53 | 54 | test_that("reference in functional scope is passed along", 55 | expect_equivalent( 56 | lapply(c("good"),function(referenced_word) 57 | {demo_vectors %>% closest_to(demo_vectors[[referenced_word]])})[[1]], 58 | demo_vectors %>% closest_to("good") 59 | ) 60 | ) 61 | -------------------------------------------------------------------------------- /tests/testthat/test-read-write.R: -------------------------------------------------------------------------------- 1 | context("Read and Write works") 2 | 3 | ## TODO: Add tests for non-binary format; check actual value of results; test reading of slices. 4 | 5 | test_that("Writing works", 6 | expect_null( 7 | write.binary.word2vec(demo_vectors[1:100,],"binary.bin"), 8 | 1e-07 9 | ) 10 | ) 11 | 12 | test_that("Reading Works", 13 | expect_s4_class( 14 | read.binary.vectors("binary.bin"), 15 | "VectorSpaceModel" 16 | ) 17 | ) 18 | 19 | -------------------------------------------------------------------------------- /tests/testthat/test-rejection.R: -------------------------------------------------------------------------------- 1 | context("Rejection Works") 2 | 3 | test_that("Rejection works along gender binary", 4 | expect_gt( 5 | { 6 | rejected_frame <- demo_vectors %>% reject(~ "man" - "woman") 7 | cosineDist(demo_vectors[["he"]],demo_vectors[["she"]] ) - 8 | cosineDist(rejected_frame[["he"]],rejected_frame[["she"]] ) 9 | }, 10 | .4 11 | ) 12 | ) 13 | -------------------------------------------------------------------------------- /tests/testthat/test-train.R: -------------------------------------------------------------------------------- 1 | context("Training Functions Work") 2 | 3 | # This fails on Travis. I'll worry about this later. 4 | demo = "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. 5 | 6 | Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. 7 | 8 | But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth. 9 | " 10 | 11 | message("In directory", getwd()) 12 | cat(demo,file = "input.txt") 13 | if (file.exists("tmp.txt")) file.remove("tmp.txt") 14 | 15 | test_that("Preparation produces file", 16 | expect_equal( 17 | prep_word2vec("input.txt","tmp.txt"), 18 | "tmp.txt" 19 | ) 20 | ) 21 | 22 | test_that("Preparation produces file", 23 | expect_equal( 24 | prep_word2vec("input.txt","tmp.txt"), 25 | "tmp.txt" 26 | ) 27 | ) 28 | 29 | test_that("Tokenization is the right length", 30 | expect_lt( 31 | 2, 32 | 272 - length(stringr::str_split(readr::read_file("tmp.txt"), " ")) 33 | ) 34 | ) 35 | if (FALSE) { 36 | test_that("Bundling works on multiple levels", 37 | expect_equal( 38 | prep_word2vec("input.txt","tmp.txt",bundle_ngrams = 3), 39 | "tmp.txt" 40 | ) 41 | ) 42 | } 43 | test_that("Training Works", 44 | expect_s4_class( 45 | train_word2vec("tmp.txt"), 46 | "VectorSpaceModel" 47 | ) 48 | ) 49 | 50 | -------------------------------------------------------------------------------- /tests/testthat/test-types.R: -------------------------------------------------------------------------------- 1 | context("VectorSpaceModel Class Works") 2 | 3 | test_that("Class Exists", 4 | expect_s4_class( 5 | demo_vectors, 6 | "VectorSpaceModel" 7 | ) 8 | ) 9 | 10 | test_that("Class inherits addition", 11 | expect_s4_class( 12 | demo_vectors+1, 13 | "VectorSpaceModel" 14 | ) 15 | ) 16 | 17 | test_that("Class inherits slices", 18 | expect_s4_class( 19 | demo_vectors[1,], 20 | "VectorSpaceModel" 21 | ) 22 | ) 23 | 24 | test_that("Slices aren't dropped in dimensionality", 25 | expect_s4_class( 26 | demo_vectors[1,], 27 | "matrix" 28 | ) 29 | ) 30 | -------------------------------------------------------------------------------- /vignettes/exploration.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Word2Vec Workshop" 3 | author: "Ben Schmidt" 4 | date: "`r Sys.Date()`" 5 | output: rmarkdown::html_vignette 6 | vignette: > 7 | %\VignetteIndexEntry{Vignette Title} 8 | %\VignetteEngine{knitr::rmarkdown} 9 | %\VignetteEncoding{UTF-8} 10 | --- 11 | 12 | # Exploring Word2Vec models 13 | 14 | R is a great language for *exploratory data analysis* in particular. If you're going to use a word2vec model in a larger pipeline, it may be important (intellectually or ethically) to spend a little while understanding what kind of model of language you've learned. 15 | 16 | This package makes it easy to do so, both by allowing you to read word2vec models to and from R, and by giving some syntactic sugar that lets you describe vector-space models concisely and clearly. 17 | 18 | Note that these functions may still be useful if you're a data analyst training word2vec models elsewhere (say, in gensim.) I'm also hopeful this can be a good way of interacting with varied vector models in a workshop session. 19 | 20 | If you want to train your own model or need help setting up the package, read the introductory vignette. Aside from the installation, it assumes more knowledge of R than this walkthrough. 21 | 22 | ## Why explore? 23 | 24 | In this vignette we're going to look at (a small portion of) a model trained on teaching evaluations. It's an interesting set, but it's also one that shows the importance of exploring vector space models before you use them. Exploration is important because: 25 | 26 | 1. If you're a humanist or social scientist, it can tell you something about the *sources* by letting you see how they use language. These co-occurrence patterns can then be better investigated through close reading or more traditional collocation scores, which potentially more reliable but also much slower and less flexible. 27 | 28 | 2. If you're an engineer, it can help you understand some of biases built into a model that you're using in a larger pipeline. This can be both technically and ethically important: you don't want, for instance, to build a job-recommendation system which is disinclined to offer programming jobs to women because it has learned that women are unrepresented in CS jobs already. 29 | (On this point in word2vec in particular, see [here](https://freedom-to-tinker.com/blog/randomwalker/language-necessarily-contains-human-biases-and-so-will-machines-trained-on-language-corpora/) and [here](https://arxiv.org/abs/1607.06520).) 30 | 31 | ## Getting started. 32 | 33 | First we'll load this package, and the recommended package `magrittr`, which lets us pass these arguments around. 34 | 35 | ```{r} 36 | library(wordVectors) 37 | library(magrittr) 38 | ``` 39 | 40 | The basic element of any vector space model is a *vectors.* for each word. In the demo data included with this package, an object called 'demo_vectors,' there are 500 numbers: you can start to examine them, if you with, by hand. So let's consider just one of these--the vector for 'good'. 41 | 42 | In R's ordinary matrix syntax, you could write that out laboriously as `demo_vectors[rownames(demo_vectors)=="good",]`. `WordVectors` provides a shorthand using double braces: 43 | 44 | ```{r} 45 | demo_vectors[["good"]] 46 | ``` 47 | 48 | These numbers are meaningless on their own. But in the vector space, we can find similar words. 49 | 50 | ```{r} 51 | demo_vectors %>% closest_to(demo_vectors[["good"]]) 52 | ``` 53 | 54 | The `%>%` is the pipe operator from magrittr; it helps to keep things organized, and is particularly useful with some of the things we'll see later. The 'similarity' scores here are cosine similarity in a vector space; 1.0 represents perfect similarity, 0 is no correlation, and -1.0 is complete opposition. In practice, vector "opposition" is different from the colloquial use of "opposite," and very rare. You'll only occasionally see vector scores below 0--as you can see above, "bad" is actually one of the most similar words to "good." 55 | 56 | When interactively exploring a single model (rather than comparing *two* models), it can be a pain to keep retyping words over and over. Rather than operate on the vectors, this package also lets you access the word directly by using R's formula notation: putting a tilde in front of it. For a single word, you can even access it directly, as so. 57 | 58 | ```{r} 59 | demo_vectors %>% closest_to("bad") 60 | ``` 61 | 62 | ## Vector math 63 | 64 | The tildes are necessary syntax where things get interesting--you can do **math** on these vectors. So if we want to find the words that are closest to the *combination* of "good" and "bad" (which is to say, words that get used in evaluation) we can write (see where the tilde is?): 65 | 66 | ```{r} 67 | 68 | demo_vectors %>% closest_to(~"good"+"bad") 69 | 70 | # The same thing could be written as: 71 | # demo_vectors %>% closest_to(demo_vectors[["good"]]+demo_vectors[["bad"]]) 72 | ``` 73 | 74 | Those are words that are common to both "good" and "bad". We could also find words that are shaded towards just good but *not* bad by using subtraction. 75 | 76 | ```{r} 77 | demo_vectors %>% closest_to(~"good" - "bad") 78 | ``` 79 | 80 | > What does this "subtraction" vector mean? 81 | > In practice, the easiest way to think of it is probably simply as 'similar to 82 | > good and dissimilar to 'bad'. Omer and Levy's papers suggest this interpretation. 83 | > But taking the vectors more seriously means you can think of it geometrically: "good"-"bad" is 84 | > a vector that describes the difference between positive and negative. 85 | > Similarity to this vector means, technically, the portion of a words vectors whose 86 | > whose multidimensional path lies largely along the direction between the two words. 87 | 88 | Again, you can easily switch the order to the opposite: here are a bunch of bad words: 89 | 90 | ```{r} 91 | demo_vectors %>% closest_to(~ "bad" - "good") 92 | ``` 93 | 94 | All sorts of binaries are captured in word2vec models. One of the most famous, since Mikolov's original word2vec paper, is *gender*. If you ask for similarity to "he"-"she", for example, you get words that appear mostly in a *male* context. Since these examples are from teaching evaluations, after just a few straightforwardly gendered words, we start to get things that only men are ("arrogant") or where there are very few women in the university ("physics") 95 | 96 | ```{r} 97 | demo_vectors %>% closest_to(~ "he" - "she") 98 | demo_vectors %>% closest_to(~ "she" - "he") 99 | ``` 100 | 101 | ## Analogies 102 | 103 | We can expand out the match to perform analogies. Men tend to be called 'guys'. 104 | What's the female equivalent? 105 | In an SAT-style analogy, you might write `he:guy::she:???`. 106 | In vector math, we think of this as moving between points. 107 | 108 | If you're using the mental framework of positive of 'similarity' and 109 | negative as 'dissimilarity,' you can think of this as starting at "guy", 110 | removing its similarity to "he", and additing a similarity to "she". 111 | 112 | This yields the answer: the most similar term to "guy" for a woman is "lady." 113 | 114 | ```{r} 115 | demo_vectors %>% closest_to(~ "guy" - "he" + "she") 116 | ``` 117 | 118 | If you're using the other mental framework, of thinking of these as real vectors, 119 | you might phrase this in a slightly different way. 120 | You have a gender vector `("female" - "male")` that represents the *direction* of masculinity 121 | to femininity. You can then add this vector to "guy", and that will take you to a new neighborhood. You might phrase that this way: note that the math is exactly equivalent, and 122 | only the grouping is different. 123 | 124 | ```{r} 125 | demo_vectors %>% closest_to(~ "guy" + ("she" - "he")) 126 | ``` 127 | 128 | Principal components can let you plot a subset of these vectors to see how they relate. You can imagine an arrow from "he" to "she", from "guy" to "lady", and from "man" to "woman"; all run in roughly the same direction. 129 | 130 | ```{r} 131 | 132 | demo_vectors[[c("lady","woman","man","he","she","guy","man"), average=F]] %>% 133 | plot(method="pca") 134 | 135 | ``` 136 | 137 | These lists of ten words at a time are useful for interactive exploration, but sometimes we might want to say 'n=Inf' to return the full list. For instance, we can combine these two methods to look at positive and negative words used to evaluate teachers. 138 | 139 | First we build up three data_frames: first, a list of the 50 top evaluative words, and then complete lists of similarity to `"good" -"bad"` and `"woman" - "man"`. 140 | 141 | ```{r} 142 | top_evaluative_words = demo_vectors %>% 143 | closest_to(~ "good"+"bad",n=75) 144 | 145 | goodness = demo_vectors %>% 146 | closest_to(~ "good"-"bad",n=Inf) 147 | 148 | femininity = demo_vectors %>% 149 | closest_to(~ "she" - "he", n=Inf) 150 | ``` 151 | 152 | Then we can use tidyverse packages to join and plot these. 153 | An `inner_join` restricts us down to just those top 50 words, and ggplot 154 | can array the words on axes. 155 | 156 | ```{r} 157 | library(ggplot2) 158 | library(dplyr) 159 | 160 | top_evaluative_words %>% 161 | inner_join(goodness) %>% 162 | inner_join(femininity) %>% 163 | ggplot() + 164 | geom_text(aes(x=`similarity to "she" - "he"`, 165 | y=`similarity to "good" - "bad"`, 166 | label=word)) 167 | ``` 168 | 169 | -------------------------------------------------------------------------------- /vignettes/introduction.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Word2Vec introduction" 3 | author: "Ben Schmidt" 4 | date: "`r Sys.Date()`" 5 | output: rmarkdown::html_vignette 6 | vignette: > 7 | %\VignetteIndexEntry{Vignette Title} 8 | %\VignetteEngine{knitr::rmarkdown} 9 | %\VignetteEncoding{UTF-8} 10 | --- 11 | 12 | # Intro 13 | 14 | This vignette walks you through training a word2vec model, and using that model to search for similarities, to build clusters, and to visualize vocabulary relationships of that model in two dimensions. If you are working with pre-trained vectors, you might want to jump straight to the "exploration" vignette; it is a little slower-paced, but doesn't show off quite so many features of the package. 15 | 16 | # Package installation 17 | 18 | If you have not installed this package, paste the below. More detailed installation instructions are at the end of the [package README](https://github.com/bmschmidt/wordVectors). 19 | 20 | ```{r} 21 | if (!require(wordVectors)) { 22 | if (!(require(devtools))) { 23 | install.packages("devtools") 24 | } 25 | devtools::install_github("bmschmidt/wordVectors") 26 | } 27 | 28 | 29 | ``` 30 | 31 | # Building test data 32 | 33 | We begin by importing the `wordVectors` package and the `magrittr` package, because its pipe operator makes it easier to work with data. 34 | 35 | ```{r} 36 | library(wordVectors) 37 | library(magrittr) 38 | ``` 39 | 40 | First we build up a test file to train on. 41 | As an example, we'll use a collection of cookbooks from Michigan State University. 42 | This has to download from the Internet if it doesn't already exist. 43 | 44 | ```{r} 45 | if (!file.exists("cookbooks.zip")) { 46 | download.file("http://archive.lib.msu.edu/dinfo/feedingamerica/cookbook_text.zip","cookbooks.zip") 47 | } 48 | unzip("cookbooks.zip",exdir="cookbooks") 49 | ``` 50 | 51 | 52 | Then we *prepare* a single file for word2vec to read in. This does a couple things: 53 | 54 | 1. Creates a single text file with the contents of every file in the original document; 55 | 2. Uses the `tokenizers` package to clean and lowercase the original text, 56 | 3. If `bundle_ngrams` is greater than 1, joins together common bigrams into a single word. For example, "olive oil" may be joined together into "olive_oil" wherever it occurs. 57 | 58 | You can also do this in another language: particularly for large files, that will be **much** faster. (For reference: in a console, `perl -ne 's/[^A-Za-z_0-9 \n]/ /g; print lc $_;' cookbooks/*.txt > cookbooks.txt` will do much the same thing on ASCII text in a couple seconds.) If you do this and want to bundle ngrams, you'll then need to call `word2phrase("cookbooks.txt","cookbook_bigrams.txt",...)` to build up the bigrams; call it twice if you want 3-grams, and so forth. 59 | 60 | 61 | ```{r} 62 | if (!file.exists("cookbooks.txt")) prep_word2vec(origin="cookbooks",destination="cookbooks.txt",lowercase=T,bundle_ngrams=2) 63 | ``` 64 | 65 | To train a word2vec model, use the function `train_word2vec`. This actually builds up the model. It uses an on-disk file as an intermediary and then reads that file into memory. 66 | 67 | ```{r} 68 | if (!file.exists("cookbook_vectors.bin")) {model = train_word2vec("cookbooks.txt","cookbook_vectors.bin",vectors=200,threads=4,window=12,iter=5,negative_samples=0)} else model = read.vectors("cookbook_vectors.bin") 69 | 70 | ``` 71 | 72 | A few notes: 73 | 74 | 1. The `vectors` parameter is the dimensionality of the representation. More vectors usually means more precision, but also more random error and slower operations. Likely choices are probably in the range 100-500. 75 | 2. The `threads` parameter is the number of processors to use on your computer. On a modern laptop, the fastest results will probably be between 2 and 8 threads, depending on the number of cores. 76 | 3. `iter` is how many times to read through the corpus. With fewer than 100 books, it can greatly help to increase the number of passes; if you're working with billions of words, it probably matters less. One danger of too low a number of iterations is that words that aren't closely related will seem to be closer than they are. 77 | 4. Training can take a while. On my laptop, it takes a few minutes to train these cookbooks; larger models take proportionally more time. Because of the importance of more iterations to reducing noise, don't be afraid to set things up to require a lot of training time (as much as a day!) 78 | 5. One of the best things about the word2vec algorithm is that it *does* work on extremely large corpora in linear time. 79 | 6. In RStudio I've noticed that this sometimes appears to hang after a while; the percentage bar stops updating. If you check system activity it actually is still running, and will complete. 80 | 7. If at any point you want to *read in* a previously trained model, you can do so by typing `model = read.vectors("cookbook_vectors.bin")`. 81 | 82 | Now we have a model in memory, trained on about 10 million words from 77 cookbooks. What can it tell us about food? 83 | 84 | ## Similarity searches 85 | 86 | Well, you can run some basic operations to find the nearest elements: 87 | 88 | ```{r} 89 | model %>% closest_to("fish") 90 | ``` 91 | 92 | With that list, you can expand out further to search for multiple words: 93 | 94 | ```{r} 95 | model %>% 96 | closest_to(model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],50) 97 | ``` 98 | 99 | Now we have a pretty expansive list of potential fish-related words from old cookbooks. This can be useful for a few different things: 100 | 101 | 1. As a list of potential query terms for keyword search. 102 | 2. As a batch of words to use as seed to some other text mining operation; for example, you could pull all paragraphs surrounding these to find ways that fish are cooked. 103 | 3. As a source for visualization. 104 | 105 | Or we can just arrange them somehow. In this case, it doesn't look like much of anything. 106 | 107 | ```{r} 108 | some_fish = closest_to(model,model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],150) 109 | fishy = model[[some_fish$word,average=F]] 110 | plot(fishy,method="pca") 111 | ``` 112 | 113 | ## Clustering 114 | 115 | We can use standard clustering algorithms, like kmeans, to find groups of terms that fit together. You can think of this as a sort of topic model, although unlike more sophisticated topic modeling algorithms like Latent Direchlet Allocation, each word must be tied to single particular topic. 116 | 117 | ```{r} 118 | set.seed(10) 119 | centers = 150 120 | clustering = kmeans(model,centers=centers,iter.max = 40) 121 | ``` 122 | 123 | Here are a ten random "topics" produced through this method. Each of the columns are the ten most frequent words in one random cluster. 124 | 125 | ```{r} 126 | sapply(sample(1:centers,10),function(n) { 127 | names(clustering$cluster[clustering$cluster==n][1:10]) 128 | }) 129 | ``` 130 | 131 | These can be useful for figuring out, at a glance, what some of the overall common clusters in your corpus are. 132 | 133 | Clusters need not be derived at the level of the full model. We can take, for instance, 134 | the 20 words closest to each of four different kinds of words. 135 | 136 | ```{r} 137 | ingredients = c("madeira","beef","saucepan","carrots") 138 | term_set = lapply(ingredients, 139 | function(ingredient) { 140 | nearest_words = model %>% closest_to(model[[ingredient]],20) 141 | nearest_words$word 142 | }) %>% unlist 143 | 144 | subset = model[[term_set,average=F]] 145 | 146 | subset %>% 147 | cosineDist(subset) %>% 148 | as.dist %>% 149 | hclust %>% 150 | plot 151 | 152 | ``` 153 | 154 | 155 | # Visualization 156 | 157 | ## Relationship planes. 158 | 159 | One of the basic strategies you can take is to try to project the high-dimensional space here into a plane you can look at. 160 | 161 | For instance, we can take the words "sweet" and "sour," find the twenty words most similar to either of them, and plot those in a sweet-salty plane. 162 | 163 | ```{r} 164 | tastes = model[[c("sweet","salty"),average=F]] 165 | 166 | # model[1:3000,] here restricts to the 3000 most common words in the set. 167 | sweet_and_saltiness = model[1:3000,] %>% cosineSimilarity(tastes) 168 | 169 | # Filter to the top 20 sweet or salty. 170 | sweet_and_saltiness = sweet_and_saltiness[ 171 | rank(-sweet_and_saltiness[,1])<20 | 172 | rank(-sweet_and_saltiness[,2])<20, 173 | ] 174 | 175 | plot(sweet_and_saltiness,type='n') 176 | text(sweet_and_saltiness,labels=rownames(sweet_and_saltiness)) 177 | 178 | ``` 179 | 180 | 181 | There's no limit to how complicated this can get. For instance, there are really *five* tastes: sweet, salty, bitter, sour, and savory. (Savory is usually called 'umami' nowadays, but that word will not appear in historic cookbooks.) 182 | 183 | Rather than use a base matrix of the whole set, we can shrink down to just five dimensions: how similar every word in our set is to each of these five. (I'm using cosine similarity here, so the closer a number is to one, the more similar it is.) 184 | 185 | ```{r} 186 | 187 | tastes = model[[c("sweet","salty","savory","bitter","sour"),average=F]] 188 | 189 | # model[1:3000,] here restricts to the 3000 most common words in the set. 190 | common_similarities_tastes = model[1:3000,] %>% cosineSimilarity(tastes) 191 | 192 | common_similarities_tastes[20:30,] 193 | ``` 194 | 195 | Now we can filter down to the 50 words that are closest to *any* of these (that's what the apply-max function below does), and 196 | use a PCA biplot to look at just 50 words in a flavor plane. 197 | 198 | ```{r} 199 | high_similarities_to_tastes = common_similarities_tastes[rank(-apply(common_similarities_tastes,1,max)) < 75,] 200 | 201 | high_similarities_to_tastes %>% 202 | prcomp %>% 203 | biplot(main="Fifty words in a\nprojection of flavor space") 204 | ``` 205 | 206 | This tells us a few things. One is that (in some runnings of the model, at least--there is some random chance built in here.) "sweet" and "sour" are closely aligned. Is this a unique feature of American cooking? A relationship that changes over time? These would require more investigation. 207 | 208 | Second is that "savory" really is an acting category in these cookbooks, even without the precision of 'umami' as a word to express it. Anchovy, the flavor most closely associated with savoriness, shows up as fairly characteristic of the flavor, along with a variety of herbs. 209 | 210 | Finally, words characteristic of meals seem to show up in the upper realms of the file. 211 | 212 | # Catchall reduction: TSNE 213 | 214 | Last but not least, there is a catchall method built into the library 215 | to visualize a single overall decent plane for viewing the library; TSNE dimensionality reduction. 216 | 217 | Just calling "plot" will display the equivalent of a word cloud with individual tokens grouped relatively close to each other based on their proximity in the higher dimensional space. 218 | 219 | "Perplexity" is the optimal number of neighbors for each word. By default it's 50; smaller numbers may cause clusters to appear more dramatically at the cost of overall coherence. 220 | 221 | ```{r} 222 | plot(model,perplexity=50) 223 | ``` 224 | 225 | A few notes on this method: 226 | 227 | 1. If you don't get local clusters, it is not working. You might need to reduce the perplexity so that clusters are smaller; or you might not have good local similarities. 228 | 2. If you're plotting only a small set of words, you're better off trying to plot a `VectorSpaceModel` with `method="pca"`, which locates the points using principal components analysis. 229 | --------------------------------------------------------------------------------