├── .Rbuildignore
├── .gitignore
├── .travis.yml
├── CONDUCT.md
├── DESCRIPTION
├── LICENSE.txt
├── NAMESPACE
├── NEWS.md
├── R
    ├── data.R
    ├── matrixFunctions.R
    ├── utils.R
    └── word2vec.R
├── README.md
├── data
    └── demo_vectors.rda
├── inst
    ├── doc
    │   ├── exploration.R
    │   ├── exploration.Rmd
    │   ├── exploration.html
    │   ├── introduction.R
    │   ├── introduction.Rmd
    │   └── introduction.html
    └── paper.md
├── man
    ├── VectorSpaceModel-VectorSpaceModel-method.Rd
    ├── VectorSpaceModel-class.Rd
    ├── as.VectorSpaceModel.Rd
    ├── closest_to.Rd
    ├── cosineDist.Rd
    ├── cosineSimilarity.Rd
    ├── demo_vectors.Rd
    ├── distend.Rd
    ├── filter_to_rownames.Rd
    ├── improve_vectorspace.Rd
    ├── magnitudes.Rd
    ├── nearest_to.Rd
    ├── normalize_lengths.Rd
    ├── plot-VectorSpaceModel-method.Rd
    ├── prep_word2vec.Rd
    ├── project.Rd
    ├── read.binary.vectors.Rd
    ├── read.vectors.Rd
    ├── reexports.Rd
    ├── reject.Rd
    ├── square_magnitudes.Rd
    ├── sub-VectorSpaceModel-method.Rd
    ├── sub-sub-VectorSpaceModel-method.Rd
    ├── train_word2vec.Rd
    ├── word2phrase.Rd
    └── write.binary.word2vec.Rd
├── src
    ├── Makevars.win
    ├── tmcn_word2vec.c
    ├── word2phrase.c
    └── word2vec.h
├── tests
    ├── run-all.R
    └── testthat
    │   ├── test-linear-algebra-functions.R
    │   ├── test-name-collapsing.r
    │   ├── test-read-write.R
    │   ├── test-rejection.R
    │   ├── test-train.R
    │   └── test-types.R
└── vignettes
    ├── exploration.Rmd
    └── introduction.Rmd


/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^.*\.Rproj$
2 | ^\.Rproj\.user$
3 | ^CONDUCT\.md$
4 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | .Rproj.user
 2 | .Rhistory
 3 | .RData
 4 | inst/extdata/
 5 | .DS_Store
 6 | hum2vec.Rproj
 7 | src/*.o
 8 | src/*.so
 9 | cookbooks
10 | cookbooks.txt
11 | cookbooks.vectors
12 | cookbooks.zip
13 | cookbooks*
14 | etc
15 | cookbook_vectors.bin
16 | tests/testthat/binary.bin
17 | tests/testthat/input.txt
18 | tests/testthat/tmp.txt
19 | tests/testthat/binary.bin
20 | tests/testthat/tmp.bin
21 | vignettes/*.R
22 | vignettes/*.html
23 | vignettes/*_files
24 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: r
 2 | cache: packages
 3 | warnings_are_errors: true
 4 | r_build_args: --no-build-vignettes --no-manual --no-resave-data
 5 | r_check_args: --no-build-vignettes --no-manual
 6 | r:
 7 |   - release
 8 |   - devel
 9 | 
10 | 


--------------------------------------------------------------------------------
/CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # Contributor Code of Conduct
 2 | 
 3 | As contributors and maintainers of this project, we pledge to respect all people who 
 4 | contribute through reporting issues, posting feature requests, updating documentation,
 5 | submitting pull requests or patches, and other activities.
 6 | 
 7 | We are committed to making participation in this project a harassment-free experience for
 8 | everyone, regardless of level of experience, gender, gender identity and expression,
 9 | sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
10 | 
11 | Examples of unacceptable behavior by participants include the use of sexual language or
12 | imagery, derogatory comments or personal attacks, trolling, public or private harassment,
13 | insults, or other unprofessional conduct.
14 | 
15 | Project maintainers have the right and responsibility to remove, edit, or reject comments,
16 | commits, code, wiki edits, issues, and other contributions that are not aligned to this 
17 | Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed 
18 | from the project team.
19 | 
20 | Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by 
21 | opening an issue or contacting one or more of the project maintainers.
22 | 
23 | This Code of Conduct is adapted from the Contributor Covenant 
24 | (http:contributor-covenant.org), version 1.0.0, available at 
25 | http://contributor-covenant.org/version/1/0/0/
26 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: wordVectors
 2 | Type: Package
 3 | Title: Tools for creating and analyzing vector-space models of texts
 4 | Version: 2.0
 5 | Author: Ben Schmidt, Jian Li
 6 | Maintainer: Ben Schmidt <bmschmidt@gmail.com>
 7 | Description:
 8 | 	wordVectors wraps Google's implementation in C for training word2vec models,
 9 | 	and provides several R functions for exploratory data analysis of word2vec
10 | 	and other related models. These include import-export from the binary format,
11 | 	some useful linear algebra operations missing from R, and a streamlined
12 | 	syntax for working with models and performing vector arithmetic that make it
13 | 	easier to perform useful operations in a word-vector space.
14 | License: Apache License (== 2.0)
15 | URL: http://github.com/bmschmidt/wordVectors
16 | BugReports: https://github.com/bmschmidt/wordVectors/issues
17 | Depends:
18 |     R (>= 2.14.0)
19 | LazyData: TRUE
20 | Imports:
21 |     magrittr,
22 |     graphics,
23 |     methods,
24 |     utils,
25 |     stats,
26 |     readr, 
27 |     stringr,
28 |     stringi
29 | Suggests:
30 |     tsne,
31 |     testthat,
32 |     ggplot2,
33 |     knitr,
34 |     dplyr,
35 |     rmarkdown,
36 |     devtools
37 | RoxygenNote: 6.0.1
38 | VignetteBuilder: knitr
39 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
  1 | This Apache License is included in this package alongside the original 
  2 | Google word2vec code. Both the word2vec code and Ben Schmidt's R functions 
  3 | are released under the Apache license.
  4 | 
  5 | 
  6 |                                  Apache License
  7 |                            Version 2.0, January 2004
  8 |                         http://www.apache.org/licenses/
  9 | 
 10 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
 11 | 
 12 |    1. Definitions.
 13 | 
 14 |       "License" shall mean the terms and conditions for use, reproduction,
 15 |       and distribution as defined by Sections 1 through 9 of this document.
 16 | 
 17 |       "Licensor" shall mean the copyright owner or entity authorized by
 18 |       the copyright owner that is granting the License.
 19 | 
 20 |       "Legal Entity" shall mean the union of the acting entity and all
 21 |       other entities that control, are controlled by, or are under common
 22 |       control with that entity. For the purposes of this definition,
 23 |       "control" means (i) the power, direct or indirect, to cause the
 24 |       direction or management of such entity, whether by contract or
 25 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 26 |       outstanding shares, or (iii) beneficial ownership of such entity.
 27 | 
 28 |       "You" (or "Your") shall mean an individual or Legal Entity
 29 |       exercising permissions granted by this License.
 30 | 
 31 |       "Source" form shall mean the preferred form for making modifications,
 32 |       including but not limited to software source code, documentation
 33 |       source, and configuration files.
 34 | 
 35 |       "Object" form shall mean any form resulting from mechanical
 36 |       transformation or translation of a Source form, including but
 37 |       not limited to compiled object code, generated documentation,
 38 |       and conversions to other media types.
 39 | 
 40 |       "Work" shall mean the work of authorship, whether in Source or
 41 |       Object form, made available under the License, as indicated by a
 42 |       copyright notice that is included in or attached to the work
 43 |       (an example is provided in the Appendix below).
 44 | 
 45 |       "Derivative Works" shall mean any work, whether in Source or Object
 46 |       form, that is based on (or derived from) the Work and for which the
 47 |       editorial revisions, annotations, elaborations, or other modifications
 48 |       represent, as a whole, an original work of authorship. For the purposes
 49 |       of this License, Derivative Works shall not include works that remain
 50 |       separable from, or merely link (or bind by name) to the interfaces of,
 51 |       the Work and Derivative Works thereof.
 52 | 
 53 |       "Contribution" shall mean any work of authorship, including
 54 |       the original version of the Work and any modifications or additions
 55 |       to that Work or Derivative Works thereof, that is intentionally
 56 |       submitted to Licensor for inclusion in the Work by the copyright owner
 57 |       or by an individual or Legal Entity authorized to submit on behalf of
 58 |       the copyright owner. For the purposes of this definition, "submitted"
 59 |       means any form of electronic, verbal, or written communication sent
 60 |       to the Licensor or its representatives, including but not limited to
 61 |       communication on electronic mailing lists, source code control systems,
 62 |       and issue tracking systems that are managed by, or on behalf of, the
 63 |       Licensor for the purpose of discussing and improving the Work, but
 64 |       excluding communication that is conspicuously marked or otherwise
 65 |       designated in writing by the copyright owner as "Not a Contribution."
 66 | 
 67 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 68 |       on behalf of whom a Contribution has been received by Licensor and
 69 |       subsequently incorporated within the Work.
 70 | 
 71 |    2. Grant of Copyright License. Subject to the terms and conditions of
 72 |       this License, each Contributor hereby grants to You a perpetual,
 73 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 74 |       copyright license to reproduce, prepare Derivative Works of,
 75 |       publicly display, publicly perform, sublicense, and distribute the
 76 |       Work and such Derivative Works in Source or Object form.
 77 | 
 78 |    3. Grant of Patent License. Subject to the terms and conditions of
 79 |       this License, each Contributor hereby grants to You a perpetual,
 80 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 81 |       (except as stated in this section) patent license to make, have made,
 82 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 83 |       where such license applies only to those patent claims licensable
 84 |       by such Contributor that are necessarily infringed by their
 85 |       Contribution(s) alone or by combination of their Contribution(s)
 86 |       with the Work to which such Contribution(s) was submitted. If You
 87 |       institute patent litigation against any entity (including a
 88 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 89 |       or a Contribution incorporated within the Work constitutes direct
 90 |       or contributory patent infringement, then any patent licenses
 91 |       granted to You under this License for that Work shall terminate
 92 |       as of the date such litigation is filed.
 93 | 
 94 |    4. Redistribution. You may reproduce and distribute copies of the
 95 |       Work or Derivative Works thereof in any medium, with or without
 96 |       modifications, and in Source or Object form, provided that You
 97 |       meet the following conditions:
 98 | 
 99 |       (a) You must give any other recipients of the Work or
100 |           Derivative Works a copy of this License; and
101 | 
102 |       (b) You must cause any modified files to carry prominent notices
103 |           stating that You changed the files; and
104 | 
105 |       (c) You must retain, in the Source form of any Derivative Works
106 |           that You distribute, all copyright, patent, trademark, and
107 |           attribution notices from the Source form of the Work,
108 |           excluding those notices that do not pertain to any part of
109 |           the Derivative Works; and
110 | 
111 |       (d) If the Work includes a "NOTICE" text file as part of its
112 |           distribution, then any Derivative Works that You distribute must
113 |           include a readable copy of the attribution notices contained
114 |           within such NOTICE file, excluding those notices that do not
115 |           pertain to any part of the Derivative Works, in at least one
116 |           of the following places: within a NOTICE text file distributed
117 |           as part of the Derivative Works; within the Source form or
118 |           documentation, if provided along with the Derivative Works; or,
119 |           within a display generated by the Derivative Works, if and
120 |           wherever such third-party notices normally appear. The contents
121 |           of the NOTICE file are for informational purposes only and
122 |           do not modify the License. You may add Your own attribution
123 |           notices within Derivative Works that You distribute, alongside
124 |           or as an addendum to the NOTICE text from the Work, provided
125 |           that such additional attribution notices cannot be construed
126 |           as modifying the License.
127 | 
128 |       You may add Your own copyright statement to Your modifications and
129 |       may provide additional or different license terms and conditions
130 |       for use, reproduction, or distribution of Your modifications, or
131 |       for any such Derivative Works as a whole, provided Your use,
132 |       reproduction, and distribution of the Work otherwise complies with
133 |       the conditions stated in this License.
134 | 
135 |    5. Submission of Contributions. Unless You explicitly state otherwise,
136 |       any Contribution intentionally submitted for inclusion in the Work
137 |       by You to the Licensor shall be under the terms and conditions of
138 |       this License, without any additional terms or conditions.
139 |       Notwithstanding the above, nothing herein shall supersede or modify
140 |       the terms of any separate license agreement you may have executed
141 |       with Licensor regarding such Contributions.
142 | 
143 |    6. Trademarks. This License does not grant permission to use the trade
144 |       names, trademarks, service marks, or product names of the Licensor,
145 |       except as required for reasonable and customary use in describing the
146 |       origin of the Work and reproducing the content of the NOTICE file.
147 | 
148 |    7. Disclaimer of Warranty. Unless required by applicable law or
149 |       agreed to in writing, Licensor provides the Work (and each
150 |       Contributor provides its Contributions) on an "AS IS" BASIS,
151 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
152 |       implied, including, without limitation, any warranties or conditions
153 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
154 |       PARTICULAR PURPOSE. You are solely responsible for determining the
155 |       appropriateness of using or redistributing the Work and assume any
156 |       risks associated with Your exercise of permissions under this License.
157 | 
158 |    8. Limitation of Liability. In no event and under no legal theory,
159 |       whether in tort (including negligence), contract, or otherwise,
160 |       unless required by applicable law (such as deliberate and grossly
161 |       negligent acts) or agreed to in writing, shall any Contributor be
162 |       liable to You for damages, including any direct, indirect, special,
163 |       incidental, or consequential damages of any character arising as a
164 |       result of this License or out of the use or inability to use the
165 |       Work (including but not limited to damages for loss of goodwill,
166 |       work stoppage, computer failure or malfunction, or any and all
167 |       other commercial damages or losses), even if such Contributor
168 |       has been advised of the possibility of such damages.
169 | 
170 |    9. Accepting Warranty or Additional Liability. While redistributing
171 |       the Work or Derivative Works thereof, You may choose to offer,
172 |       and charge a fee for, acceptance of support, warranty, indemnity,
173 |       or other liability obligations and/or rights consistent with this
174 |       License. However, in accepting such obligations, You may act only
175 |       on Your own behalf and on Your sole responsibility, not on behalf
176 |       of any other Contributor, and only if You agree to indemnify,
177 |       defend, and hold each Contributor harmless for any liability
178 |       incurred by, or claims asserted against, such Contributor by reason
179 |       of your accepting any such warranty or additional liability.
180 | 
181 |    END OF TERMS AND CONDITIONS
182 | 
183 |    APPENDIX: How to apply the Apache License to your work.
184 | 
185 |       To apply the Apache License to your work, attach the following
186 |       boilerplate notice, with the fields enclosed by brackets "[]"
187 |       replaced with your own identifying information. (Don't include
188 |       the brackets!)  The text should be enclosed in the appropriate
189 |       comment syntax for the file format. We also recommend that a
190 |       file or class name and description of purpose be included on the
191 |       same "printed page" as the copyright notice for easier
192 |       identification within third-party archives.
193 | 
194 |    Copyright [yyyy] [name of copyright owner]
195 | 
196 |    Licensed under the Apache License, Version 2.0 (the "License");
197 |    you may not use this file except in compliance with the License.
198 |    You may obtain a copy of the License at
199 | 
200 |        http://www.apache.org/licenses/LICENSE-2.0
201 | 
202 |    Unless required by applicable law or agreed to in writing, software
203 |    distributed under the License is distributed on an "AS IS" BASIS,
204 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
205 |    See the License for the specific language governing permissions and
206 |    limitations under the License.
207 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | export("%>%")
 4 | export(as.VectorSpaceModel)
 5 | export(closest_to)
 6 | export(cosineDist)
 7 | export(cosineSimilarity)
 8 | export(distend)
 9 | export(filter_to_rownames)
10 | export(improve_vectorspace)
11 | export(magnitudes)
12 | export(nearest_to)
13 | export(normalize_lengths)
14 | export(prep_word2vec)
15 | export(project)
16 | export(read.binary.vectors)
17 | export(read.vectors)
18 | export(reject)
19 | export(train_word2vec)
20 | export(word2phrase)
21 | export(write.binary.word2vec)
22 | exportClasses(VectorSpaceModel)
23 | exportMethods(plot)
24 | importFrom(magrittr,"%>%")
25 | useDynLib(wordVectors)
26 | 


--------------------------------------------------------------------------------
/NEWS.md:
--------------------------------------------------------------------------------
  1 | # VERSION 2.0
  2 | 
  3 | Upgrade focusing on ease of use and CRAN-ability. Bumping major version because of a breaking change in the behavior of `closest_to`, which now returns a data.frame.
  4 | 
  5 | # Changes
  6 | 
  7 | ## New default function: closest_to.
  8 | 
  9 | `nearest_to` was previously the easiest way to interact with cosine similarity functions. That's been deprecated
 10 | in favor of a new function, `closest_to`. (I would have changed the name but for back-compatibility reasons.)
 11 | The data.frame columns have elaborate names so they can easily be manipulated with dplyr, and/or plotted with ggplot.
 12 | `nearest_to` is now just a wrapped version of the new function.
 13 | 
 14 | ## New syntax for vector addition.
 15 | 
 16 | This package now allows formula scoping for the most common operations, and string inputs to access in the context of a particular matrix. This makes this much nicer for handling the bread and butter word2vec operations.
 17 | 
 18 | For instance, instead of writing 
 19 | ```R
 20 | vectors %>% closest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",])
 21 | ```
 22 | 
 23 | (whew!), you can write
 24 | 
 25 | ```R
 26 | vectors %>% closest_to(~"king" - "man" + "woman")
 27 | ```
 28 | 
 29 | 
 30 | ## Reading tweaks.
 31 | 
 32 | In keeping with the goal of allowing manipulation of models in low-memory environments, it's now possible to read only rows with words matching certain criteria by passing an argument to read.binary.vectors(); either `rowname_list` for a fixed list, or `rowname_regexp` for a regular expression. (You could, say, read only the gerunds from a file by entering `rowname_regexp = "*.ing"`).
 33 | 
 34 | ## Test Suite
 35 | 
 36 | The package now includes a test suite.
 37 | 
 38 | ## Other materials for rOpenScience and JOSS. 
 39 | 
 40 | This package has enough users it might be nice to get it on CRAN. I'm trying doing so through rOpenSci. That requires a lot of small files scattered throughout.
 41 | 
 42 | 
 43 | # VERSION 1.3
 44 | 
 45 | Two significant performance improvements.
 46 | 1. Row magnitudes for a `VectorSpaceModel` object are now **cached** in an environment that allows some pass-by-reference editing. This means that the most time-consuming part of any comparison query is only done once for any given vector set; subsequent queries are at least an order of magnitude (10-20x)? faster.
 47 | 
 48 | Although this is a big performance improvement, certain edge cases might not wipe the cache clear. **In particular, assignment inside a VSM object might cause incorrect calculations.** I can't see why anyone would be in the habit of manually tweaking a row or block (rather than a whole matrix).
 49 | 1. Access to rows in a `VectorSpaceModel` object is now handled through callNextMethod() rather than accessing the element's .Data slot. For reasons opaque to me, hitting the .Data slot seems to internally require copying the whole huge matrix internally. Now that no longer happens.
 50 | 
 51 | 
 52 | # VERSION 1.2
 53 | 
 54 | This release implements a number of incremental improvements and clears up some errors.
 55 | - The package is now able to read and write in the binary word2vec format; since this is faster and takes much less hard drive space (down by about 66%) than writing out floats as text, it does so internally.
 56 | - Several improvements to the C codebase to avoid warnings by @mukul13, described [here](https://github.com/bmschmidt/wordVectors/pull/9). (Does this address the `long long` issue?)
 57 | - Subsetting with `[[` now takes an argument `average`; if false, rather than collapse a matrix down to a single row, it just extracts the elements that correspond to the words.
 58 | - Added sample data in the object `demo_vectors`: the 999 words from the most common vectors.
 59 | - Began adding examples to the codebase.
 60 | - Tracking build success using Travis.
 61 | - Dodging most warnings from R CMD check.
 62 | 
 63 | Bug fixes
 64 | - If the `dir.exists` function is undefined, it creates one for you. This should allow installation on R 3.1 and some lower versions.
 65 | - `reject` and `project` are better about returning VSM objects, rather than dropping back into a matrix.
 66 | 
 67 | # VERSION 1.1
 68 | 
 69 | A few changes, primarily to the functions for _training_ vector spaces to produce higher quality models. A number of these changes are merged back in from the fork of this repo by github user @sunecasp . Thanks!
 70 | 
 71 | ## Some bug fixes
 72 | 
 73 | Filenames can now be up to 1024 characters. Some parameters on alpha decay may be fixed; I'm not entirely sure what sunecasp's changes do.
 74 | 
 75 | ## Changes to default number of iterations.
 76 | 
 77 | Models now default to 5 iterations through the text rather than 1. That means training may take 5 times as long; but particularly for small corpora, the vectors should be of higher quality. See below for an example. 
 78 | 
 79 | ## More training arguments
 80 | 
 81 | You can now specify more flags to the word2vec code. `?train_word2vec` gives a full list, but particularly useful are:
 82 | 1. `window` now accurately sets the window size.
 83 | 2. `iter` sets the number of iterations. For very large corpora, `iter=1` will train most quickly; for very small corpora, `iter=15` will give substantially better vectors. (See below). You should set this as high as you can stand within reason (Setting `iter` to a number higher than `window` is probably not that useful). But more text is better than more iterations.
 84 | 3. `min_count` gives a cutoff for vocabulary size. Tokens occurring fewer than `min_count` times will be dropped from the model. Setting this high can be useful. (But note that a trained model is sorted in order of frequency, so if you have the RAM to train a big model you can reduce it in size for analysis by just subsetting to the first 10,000 or whatever rows).
 85 | 
 86 | ## Example of vectors
 87 | 
 88 | Here's an example of training on a small set (c. 1000 speeches on the floor of the house of commons from the early 19th century). 
 89 | 
 90 | > proc.time({one = train_word2vec("~/tmp2.txt","~/1_iter.vectors",iter = 1)})
 91 | > Error in train_word2vec("~/tmp2.txt", "~/1_iter.vectors", iter = 1) : 
 92 | >   The output file '~/1_iter.vectors' already exists: delete or give a new destination.
 93 | > proc.time({one = train_word2vec("~/tmp2.txt","~/1_iter.vectors",iter = 1)})
 94 | > Starting training using file /Users/bschmidt/tmp2.txt
 95 | > Vocab size: 4469
 96 | > Words in train file: 407583
 97 | > Alpha: 0.000711  Progress: 99.86%  Words/thread/sec: 67.51k  
 98 | > Error in proc.time({ : 1 argument passed to 'proc.time' which requires 0
 99 | > ?proc.time
100 | > system.time({one = train_word2vec("~/tmp2.txt","~/1_iter.vectors",iter = 1)})
101 | > Starting training using file /Users/bschmidt/tmp2.txt
102 | > Vocab size: 4469
103 | > Words in train file: 407583
104 | > Alpha: 0.000711  Progress: 99.86%  Words/thread/sec: 66.93k     user  system elapsed 
105 | >   6.753   0.055   6.796 
106 | > system.time({two = train_word2vec("~/tmp2.txt","~/2_iter.vectors",iter = 3)})
107 | > Starting training using file /Users/bschmidt/tmp2.txt
108 | > Vocab size: 4469
109 | > Words in train file: 407583
110 | > Alpha: 0.000237  Progress: 99.95%  Words/thread/sec: 67.15k     user  system elapsed 
111 | >  18.846   0.085  18.896 
112 | > 
113 | > two %>% nearest_to(two["debt"]) %>% round(3)
114 | >         debt    remainder          Jan    including     drawback manufactures    prisoners   mercantile   subsisting 
115 | >        0.000        0.234        0.256        0.281        0.291        0.293        0.297        0.314        0.314 
116 | >          Dec 
117 | >        0.318 
118 | > one %>% nearest_to(one[["debt"]]) %>% round(3)
119 | >       debt  Christmas  exception preventing     Indies     import  remainder        eye   eighteen  labouring 
120 | >      0.000      0.150      0.210      0.214      0.215      0.220      0.221      0.223      0.225      0.227 
121 | > 
122 | > system.time({ten = train_word2vec("~/tmp2.txt","~/10_iter.vectors",iter = 10)})
123 | > Starting training using file /Users/bschmidt/tmp2.txt
124 | > Vocab size: 4469
125 | > Words in train file: 407583
126 | > Alpha: 0.000071  Progress: 99.98%  Words/thread/sec: 66.13k     user  system elapsed 
127 | >  62.070   0.296  62.333 
128 | > 
129 | > ten %>% nearest_to(ten[["debt"]]) %>% round(3)
130 | >          debt       surplus           Dec     remainder manufacturing        grants           Jan      drawback     prisoners 
131 | >         0.000         0.497         0.504         0.510         0.519         0.520         0.533         0.536         0.546 
132 | >    compelling 
133 | >         0.553 
134 | 
135 | ```
136 | ```
137 | 
138 | 


--------------------------------------------------------------------------------
/R/data.R:
--------------------------------------------------------------------------------
 1 | #' 999 vectors trained on teaching evaluations
 2 | #'
 3 | #' A sample VectorSpaceModel object trained on about 15 million
 4 | #' teaching evaluations, limited to the 999 most common words.
 5 | #' Included for demonstration purposes only: there's only so much you can
 6 | #' do with a 999 length vocabulary.
 7 | #'
 8 | #' You're best off downloading a real model to work with,
 9 | #' such as the precompiled vectors distributed by Google
10 | #' at https://code.google.com/archive/p/word2vec/
11 | #'
12 | #' @format A VectorSpaceModel object of 999 words and 500 vectors
13 | #'
14 | #' @source Trained by package author.
15 | "demo_vectors"
16 | 


--------------------------------------------------------------------------------
/R/matrixFunctions.R:
--------------------------------------------------------------------------------
  1 | #' Improve a vectorspace by removing common elements.
  2 | #'
  3 | #'
  4 | #' @param vectorspace A VectorSpacemodel to be improved.
  5 | #' @param D The number of principal components to eliminate.
  6 | #'
  7 | #' @description See reference for a full description. Supposedly, these operations will improve performance on analogy tasks.
  8 | #'
  9 | #' @references Jiaqi Mu, Suma Bhat, Pramod Viswanath. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. https://arxiv.org/abs/1702.01417.
 10 | #' @return A VectorSpaceModel object, transformed from the original.
 11 | #' @export
 12 | #'
 13 | #' @examples
 14 | #'
 15 | #' closest_to(demo_vectors,"great")
 16 | #' # stopwords like "and" and "very" are no longer top ten.
 17 | #' # I don't know if this is really better, though.
 18 | #'
 19 | #' closest_to(improve_vectorspace(demo_vectors),"great")
 20 | #'
 21 | improve_vectorspace = function(vectorspace,D=round(ncol(vectorspace)/100)) {
 22 |   mean = methods::new("VectorSpaceModel",
 23 |              matrix(apply(vectorspace,2,mean),
 24 |                     ncol=ncol(vectorspace))
 25 |   )
 26 |   vectorspace = vectorspace-mean
 27 |   pca = stats::prcomp(vectorspace)
 28 | 
 29 |   # I don't totally understand the recommended operation in the source paper, but this seems to do much
 30 |   # the same thing using the internal functions of the package to reject the top i dimensions one at a time.
 31 |   drop_top_i = function(vspace,i) {
 32 |     if (i<=0) {vspace} else if (i==1) {
 33 |       reject(vspace,pca$rotation[,i])
 34 |     } else {
 35 |       drop_top_i(reject(vspace,pca$rotation[,i]), i-1)
 36 |     }
 37 |   }
 38 |   better =  drop_top_i(vectorspace,D)
 39 | }
 40 | 
 41 | 
 42 | #' Internal function to subsitute strings for a tree. Allows arithmetic on words.
 43 | #'
 44 | #' @noRd
 45 | #'
 46 | #' @param tree an expression from a formula
 47 | #' @param context the VSM context in which to parse it.
 48 | #'
 49 | #' @return a tree
 50 | sub_out_tree = function(tree, context) {
 51 |   # This is a whitelist of operators that I think are basic for vector math.
 52 |   # It's possible it could be expanded.
 53 | 
 54 |   # This might fail if you try to pass a reference to a basic
 55 |   # arithmetic operator, or something crazy like that.
 56 | 
 57 |   if (deparse(tree[[1]]) %in% c("+","*","-","/","^","log","sqrt","(")) {
 58 |     for (i in 2:length(tree)) {
 59 |       tree[[i]] <- sub_out_tree(tree[[i]],context)
 60 |     }
 61 |   }
 62 |   if (is.character(tree)) {
 63 |     return(context[[tree]])
 64 |   }
 65 |   return(tree)
 66 | }
 67 | 
 68 | #' Internal function to wrap for sub_out_tree. Allows arithmetic on words.
 69 | #'
 70 | #' @noRd
 71 | #'
 72 | #' @param formula A formula; string arithmetic on the LHS, no RHS.
 73 | #' @param context the VSM context in which to parse it.
 74 | #'
 75 | #' @return an evaluated formula.
 76 | 
 77 | sub_out_formula = function(formula,context) {
 78 |   # Despite the name, this will work on something that
 79 |   # isn't a formula. That's by design: we want to allow
 80 |   # basic reference passing, and also to allow simple access
 81 |   # to words.
 82 | 
 83 |   if (class(context) != "VectorSpaceModel") {return(formula)}
 84 |   if (class(formula)=="formula") {
 85 |     formula[[2]] <- sub_out_tree(formula[[2]],context)
 86 |     return(eval(formula[[2]]))
 87 |   }
 88 |   if (is.character(formula)) {return(context[[formula]])}
 89 |   return(formula)
 90 | }
 91 | 
 92 | #' Vector Space Model class
 93 | #'
 94 | #' @description A class for describing and accessing Vector Space Models like Word2Vec.
 95 | #' The base object is simply a matrix with columns describing dimensions and unique rownames
 96 | #' as the names of vectors. This package gives a number of convenience functions for printing
 97 | #' and, most importantly, accessing these objects.
 98 | #' @slot magnitudes The cached sum-of-squares for each row in the matrix. Can be cached to
 99 | #' speed up similarity calculations
100 | #' @return An object of class "VectorSpaceModel"
101 | #' @exportClass VectorSpaceModel
102 | setClass("VectorSpaceModel",slots = c(".cache"="environment"),contains="matrix")
103 | #setClass("NormalizedVectorSpaceModel",contains="VectorSpaceModel")
104 | 
105 | # This is Steve Lianoglu's method for associating a cache with an object
106 | # http://r.789695.n4.nabble.com/Change-value-of-a-slot-of-an-S4-object-within-a-method-td2338484.html
107 | setMethod("initialize", "VectorSpaceModel",
108 |           function(.Object, ..., .cache=new.env()) {
109 |             methods::callNextMethod(.Object, .cache=.cache, ...)
110 |           })
111 | 
112 | #' Square Magnitudes with caching
113 | #'
114 | #' @param VectorSpaceModel A matrix or VectorSpaceModel object
115 | #' @description square_magnitudes Returns the square magnitudes and
116 | #' caches them if necessary
117 | #' @return A vector of the square magnitudes for each row
118 | #' @keywords internal
119 | square_magnitudes = function(object) {
120 |   if (class(object)[1]=="VectorSpaceModel") {
121 |       if (methods::.hasSlot(object, ".cache")) {
122 |       if (is.null(object@.cache$magnitudes)) {
123 |         object@.cache$magnitudes = rowSums(object^2)
124 |       }
125 |       return(object@.cache$magnitudes)
126 |       } else {
127 |         message("You seem to be using a VectorSpaceModel saved from an earlier version of this package.")
128 |         message("To turn on caching on your model, which greatly speeds up queries, type")
129 |         message("yourobjectname@.cache = new.env()")
130 |         message("(replacing 'yourobjectname' with your actual model name)")
131 |         return(rowSums(object^2))
132 |       }
133 |   } else {
134 |     return(rowSums(object^2))
135 |   }
136 | }
137 | 
138 | #' VectorSpaceModel indexing
139 | #'
140 | #' @description Reduce a VectorSpaceModel to a smaller one
141 | #' @param x The vectorspace model to subset
142 | #' @param i The row numbers to extract
143 | #' @param j The column numbers to extract
144 | #' @param ... Other arguments passed to extract (unlikely to be useful).
145 | #'
146 | #' @param drop Whether to drop columns. This parameter is ignored.
147 | #' @return A VectorSpaceModel
148 | #'
149 | setMethod("[","VectorSpaceModel",function(x,i,j,...,drop) {
150 |   nextup = methods::callNextMethod()
151 |   if (!is.matrix(nextup)) {
152 |     # A verbose way of effectively changing drop from TRUE to FALSE;
153 |     # I don't want one-dimensional matrices turned to vectors.
154 |     # I can't figure out how to do this more simply
155 |     if (missing(j)) {
156 |       nextup = matrix(nextup,ncol=ncol(x))
157 |     } else {
158 |       nextup = matrix(nextup,ncol=j)
159 |     }
160 |   }
161 |   methods::new("VectorSpaceModel",nextup)
162 | })
163 | 
164 | #' VectorSpaceModel subtraction
165 | #'
166 | #' @description Keep the VSM class when doing subtraction operations;
167 | #' make it possible to subtract a single row from an entire model.
168 | #' @param e1 A vector space model
169 | #' @param e2 A vector space model of equal size OR a vector
170 | #' space model of a single row. If the latter (which is more likely)
171 | #' the specified row will be subtracted from each row.
172 | #'
173 | #'
174 | #' @return A VectorSpaceModel of the same dimensions and rownames
175 | #' as e1
176 | #'
177 | #' I believe this is necessary, but honestly am not sure.
178 | #'
179 | setMethod("-",signature(e1="VectorSpaceModel",e2="VectorSpaceModel"),function(e1,e2) {
180 |     if (nrow(e1)==nrow(e2) && ncol(e1)==ncol(e2)) {
181 |       return (methods::new("VectorSpaceModel",e1@.Data-e2@.Data))
182 |     }
183 |     if (nrow(e2)==1) {
184 |       return(
185 |         methods::new("VectorSpaceModel",e1 - matrix(rep(e2,each=nrow(e1)),nrow=nrow(e1)))
186 |         )
187 |     }
188 |     stop("Vector space model subtraction must use models of equal dimensions")
189 | })
190 | 
191 | #' VectorSpaceModel subsetting
192 | #'
193 | #  @description Reduce a VectorSpaceModel to a single object.
194 | #' @param x The object being subsetted.
195 | #' @param i A character vector: the words to use as rownames.
196 | #' @param average Whether to collapse down to a single vector,
197 | #' or to return a subset of one row for each asked for.
198 | #'
199 | #' @return A VectorSpaceModel of a single row.
200 | setMethod("[[","VectorSpaceModel",function(x,i,average=TRUE) {
201 |   # The wordvec class can extract a row from the matrix
202 |   # by accessing the rownames. x[["king"]] gives the row
203 |   # for which the rowname is "king"; x[[c("king","queen")]] gives
204 |   # the midpoint of x[["king"]] and x[["queen"]], which can occasionally
205 |   # be useful.
206 |   if(typeof(i)=="character") {
207 |     matching_rows = x[rownames(x) %in% i,]
208 |     if (average) {
209 |       val = matrix(
210 |               colMeans(matching_rows)
211 |               ,nrow=1
212 |               ,dimnames = list(
213 |                 c(),colnames(x))
214 |             )
215 |     } else {
216 |       val=matching_rows
217 |       rownames(val) = rownames(x)[rownames(x) %in% i]
218 |     }
219 | 
220 |   return(methods::new("VectorSpaceModel",val))
221 |   }
222 | 
223 |   else if (typeof(i)=="integer") {
224 |   return(x[i,])
225 |   } else {
226 |     stop("VectorSpaceModel objects are accessed by vectors of numbers or words")
227 |   }
228 | })
229 | 
230 | setMethod("show","VectorSpaceModel",function(object) {
231 |   dims = dim(object)
232 |   cat("A VectorSpaceModel object of ",dims[1]," words and ", dims[2], " vectors\n")
233 |   methods::show(unclass(object[1:min(nrow(object),10),1:min(ncol(object),6),drop=F]))
234 | })
235 | 
236 | #' Plot a Vector Space Model.
237 | #'
238 | #' Visualizing a model as a whole is sort of undefined. I think the
239 | #' sanest thing to do is reduce the full model down to two dimensions
240 | #' using T-SNE, which preserves some of the local clusters.
241 | #'
242 | #' For individual subsections, it can make sense to do a principal components
243 | #' plot of the space of just those letters. This is what happens if method
244 | #' is pca. On the full vocab, it's kind of a mess.
245 | #'
246 | #' This plots only the first 300 words in the model.
247 | #'
248 | #' @param x The model to plot
249 | #' @param method The method to use for plotting. "pca" is principal components, "tsne" is t-sne
250 | #' @param ... Further arguments passed to the plotting method.
251 | #'
252 | #' @return The TSNE model (silently.)
253 | #' @export
254 | setMethod("plot","VectorSpaceModel",function(x,method="tsne",...) {
255 |   if (method=="tsne") {
256 |     message("Attempting to use T-SNE to plot the vector representation")
257 |     message("Cancel if this is taking too long")
258 |     message("Or run 'install.packages' tsne if you don't have it.")
259 |     x = as.matrix(x)
260 |     short = x[1:min(300,nrow(x)),]
261 |     m = tsne::tsne(short,...)
262 |     graphics::plot(m,type='n',main="A two dimensional reduction of the vector space model using t-SNE")
263 |     graphics::text(m,rownames(short),cex = ((400:1)/200)^(1/3))
264 |     rownames(m)=rownames(short)
265 |     silent = m
266 |   } else if (method=="pca") {
267 |     vectors = stats::predict(stats::prcomp(x))[,1:2]
268 |     graphics::plot(vectors,type='n')
269 |     graphics::text(vectors,labels=rownames(vectors))
270 |   }
271 | })
272 | 
273 | #' Convert to a Vector Space Model
274 | #'
275 | #' @param matrix A matrix to coerce.
276 | #'
277 | #' @return An object of class "VectorSpaceModel"
278 | #' @export as.VectorSpaceModel
279 | as.VectorSpaceModel = function(matrix) {
280 |   return(methods::new("VectorSpaceModel",matrix))
281 | }
282 | 
283 | #' Read VectorSpaceModel
284 | #'
285 | #' Read a VectorSpaceModel from a file exported from word2vec or a similar output format.
286 | #'
287 | #' @param filename The file to read in.
288 | #' @param vectors The number of dimensions word2vec calculated. Imputed automatically if not specified.
289 | #' @param binary Read in the binary word2vec form. (Wraps `read.binary.vectors`) By default, function
290 | #' guesses based on file suffix.
291 | #' @param ... Further arguments passed to read.table or read.binary.vectors.
292 | #' Note that both accept 'nrows' as an argument. Word2vec produces
293 | #' by default frequency sorted output. Therefore 'read.vectors("file.bin", nrows=500)', for example,
294 | #' will return the vectors for the top 500 words. This can be useful on machines with limited
295 | #' memory.
296 | #' @export
297 | #' @return An matrixlike object of class `VectorSpaceModel`
298 | #'
299 | read.vectors <- function(filename,vectors=guess_n_cols(),binary=NULL,...) {
300 |   if(rev(strsplit(filename,"\\.")[[1]])[1] =="bin" && is.null(binary)) {
301 |     message("Filename ends with .bin, so reading in binary format")
302 |     binary=TRUE
303 |   }
304 | 
305 |   if(binary) {
306 |     return(read.binary.vectors(filename,...))
307 |   }
308 | 
309 |   # Figure out how many dimensions.
310 |   guess_n_cols = function() {
311 |     # if cols is not defined
312 |     test = utils::read.table(filename,header=F,skip=1,
313 |                        nrows=1,quote="",comment.char="")
314 |   return(ncol(test)-1)
315 |   }
316 |   vectors_matrix = utils::read.table(filename,header=F,skip=1,
317 |                                colClasses = c("character",rep("numeric",vectors)),
318 |                        quote="",comment.char="",...)
319 |   names(vectors_matrix)[1] = "word"
320 |   vectors_matrix$word[is.na(vectors_matrix$word)] = "NA"
321 |   matrix = as.matrix(vectors_matrix[,colnames(vectors_matrix)!="word"])
322 |   rownames(matrix) = vectors_matrix$word
323 |   colnames(matrix) = paste0("V",1:vectors)
324 |   return(methods::new("VectorSpaceModel",matrix))
325 | }
326 | 
327 | #' Read binary word2vec format files
328 | #'
329 | #' @param filename A file in the binary word2vec format to import.
330 | #' @param nrows Optionally, a number of rows to stop reading after.
331 | #' Word2vec sorts by frequency, so limiting to the first 1000 rows will
332 | #' give the thousand most-common words; it can be useful not to load
333 | #' the whole matrix into memory. This limit is applied BEFORE `name_list` and
334 | #' `name_regexp`.
335 | #' @param cols The column numbers to read. Default is "All";
336 | #' if you are in a memory-limited environment,
337 | #' you can limit the number of columns you read in by giving a vector of column integers
338 | #' @param rowname_list A whitelist of words. If you wish to read in only a few dozen words,
339 | #' all other rows will be skipped and only these read in.
340 | #' @param rowname_regexp A regular expression specifying a pattern for rows to read in. Row
341 | #' names matching that pattern will be included in the read; all others will be skipped.
342 | #' @return A VectorSpaceModel object
343 | #' @export
344 | 
345 | read.binary.vectors = function(filename,nrows=Inf,cols="All", rowname_list = NULL, rowname_regexp = NULL) {
346 |   if (!is.null(rowname_list) && !is.null(rowname_regexp)) {stop("Specify a whitelist of names or a regular expression to be applied to all input, not both.")}
347 |   a = file(filename,'rb')
348 |   rows = ""
349 |   mostRecent=""
350 |   while(mostRecent!=" ") {
351 |     mostRecent = readChar(a,1)
352 |     rows = paste0(rows,mostRecent)
353 |   }
354 |   rows = as.integer(rows)
355 | 
356 |   col_number = ""
357 |   while(mostRecent!="\n") {
358 |     mostRecent = readChar(a,1)
359 |     col_number = paste0(col_number,mostRecent)
360 |   }
361 |   col_number = as.integer(col_number)
362 | 
363 |   if(nrows<rows) {
364 |     message(paste("Reading the first",nrows, "rows of a word2vec binary file of",rows,"rows and",col_number,"columns"))
365 |     rows = nrows
366 |   } else {
367 |     message(paste("Reading a word2vec binary file of",rows,"rows and",col_number,"columns"))
368 |   }
369 | 
370 | 
371 |   ## Read a row
372 |   rownames = rep("",rows)
373 | 
374 |   # create progress bar
375 |   pb <- utils::txtProgressBar(min = 0, max = rows, style = 3)
376 | 
377 | 
378 |   returned_columns = col_number
379 |   if (is.numeric(cols)) {
380 |     returned_columns = length(cols)
381 |   }
382 | 
383 |   read_row = function(i) {
384 |     utils::setTxtProgressBar(pb,i)
385 |     rowname=""
386 |     mostRecent=""
387 |     while(TRUE) {
388 |       mostRecent = readChar(a,1)
389 |       if (mostRecent==" ") {break}
390 |       if (mostRecent!="\n") {
391 |         # Some versions end with newlines, some don't.
392 |         rowname = paste0(rowname,mostRecent)
393 |       }
394 |     }
395 |     rownames[i] <<- rowname
396 |     row = readBin(a,numeric(),size=4,n=col_number,endian="little")
397 |     if (is.numeric(cols)) {
398 |       return(row[cols])
399 |     }
400 |     return(row)
401 |   }
402 | 
403 |   # When the size is fixed, it's faster to do as a vapply than as a for loop.
404 |   if (is.null(rowname_list) && is.null(rowname_regexp)) {
405 |     matrix = t(
406 |       vapply(1:rows,read_row,as.array(rep(0,returned_columns)))
407 |     )
408 |   } else {
409 |     elements = list()
410 |     mynames = c()
411 |     for (i in 1:rows) {
412 |       row = read_row(i)
413 |       if (!is.null(rowname_list)) {
414 |         if (rownames[i] %in% rowname_list) {
415 |           elements[[rownames[i]]] = row
416 |         }
417 |       }
418 |       if (!is.null(rowname_regexp)) {
419 |         if (grepl(pattern = rowname_regexp, x = rownames[i])) {
420 |           elements[[rownames[i]]] = row
421 |         }
422 |       }
423 |     }
424 |     matrix = t(simplify2array(elements))
425 |     rownames = names(elements)
426 | 
427 |   }
428 |   close(pb)
429 |   close(a)
430 |   rownames(matrix) = rownames
431 |   return(as.VectorSpaceModel(matrix))
432 | }
433 | 
434 | #' Write in word2vec binary format
435 | #'
436 | #' @param model The wordVectors model you wish to save. (This can actually be any matrix with rownames,
437 | #' if you want a smaller binary serialization in single-precision floats.)
438 | #' @param filename The file to save the vectors to. I recommend ".bin" as a suffix.
439 | #'
440 | #' @return Nothing
441 | #' @export
442 | write.binary.word2vec = function(model,filename) {
443 |   if (!endsWith(filename,".bin")) {
444 |     warning("Your filename doesn't end with '.bin'; any subsequent calls to read.vectors()
445 |             will fail unless you set 'binary' in the function args to TRUE.")
446 |   }
447 |   filehandle = file(filename,"wb")
448 |   dim = dim(model)
449 |   writeChar(as.character(dim[1]),filehandle,eos=NULL)
450 |   writeChar(" ",filehandle,eos=NULL)
451 |   writeChar(as.character(dim[2]),filehandle,eos=NULL)
452 |   writeChar("\n",filehandle,eos=NULL)
453 |   names = rownames(model)
454 |   # I just store the rownames outside the loop, here.
455 |   i = 1
456 |   names = rownames(model)
457 |   silent = apply(model,1,function(row) {
458 |     # EOS must be null for this to work properly, because, ridiculously,
459 |     # 'eos=NULL' is the command that tells R *not* to insert a null string
460 |     # after a character.
461 |     writeChar(paste0(names[i]," "),filehandle,eos=NULL)
462 |     writeBin(row,filehandle,size=4,endian="little")
463 |     i <<- i+1
464 |   })
465 |   close(filehandle)
466 | }
467 | 
468 | #' Vector Magnitudes
469 | #'
470 | #' @param matrix A matrix or VectorSpaceModel object.
471 | #'
472 | #' @return A vector consisting of the magnitudes of each row.
473 | #'
474 | #' This is an extraordinarily simple function.
475 | #'
476 | #' @export
477 | magnitudes <- function(matrix) {
478 |   sqrt(rowSums(matrix^2))
479 | }
480 | 
481 | #' Matrix normalization.
482 | #'
483 | #' Normalize a matrix so that all rows are of unit length.
484 | #'
485 | #' @param matrix A matrix or VectorSpaceModel object
486 | #'
487 | #' @return An object of the same class as matrix
488 | #' @export
489 | normalize_lengths = function(matrix) {
490 | 
491 |   val = matrix/magnitudes(matrix)
492 |   if (inherits(val,"VectorSpaceModel")) {
493 |     val@.cache = new.env()
494 |   }
495 |   val
496 | }
497 | 
498 | #' Reduce by rownames
499 | #'
500 | #' @param matrix A matrix or VectorSpaceModel object
501 | #' @param words A list of rownames or VectorSpaceModel names
502 | #'
503 | #' @return An object of the same class as matrix, consisting
504 | #' of the rows that match its rownames.
505 | #'
506 | #' Deprecated: use instead VSM[[c("word1","word2",...),average=FALSE]]
507 | #'
508 | #' @export
509 | filter_to_rownames <- function(matrix,words) {
510 |   warning('The function`filter_to_rownames` is deprecated and will be removed in a later version.
511 |           Use instead `VSM[[c("word1","word2",...),average=FALSE]]`')
512 |   matrix[rownames(matrix) %in% words,]
513 | }
514 | 
515 | #' Cosine Similarity
516 | #'
517 | #' @description Calculate the cosine similarity of two matrices or a matrix and a vector.
518 | #'
519 | #' @param x A matrix or VectorSpaceModel object
520 | #' @param y A vector, matrix or VectorSpaceModel object.
521 | #'
522 | #' Vector inputs are coerced to single-row matrices; y must have the
523 | #' same number of dimensions as x.
524 | #'
525 | #'
526 | #' @return A matrix. Rows correspond to entries in x; columns to entries in y.
527 | #'
528 | #' @examples
529 | #'
530 | #' # Inspect the similarity of several academic disciplines by hand.
531 | #' subjects = demo_vectors[[c("history","literature","biology","math","stats"),average=FALSE]]
532 | #' similarities = cosineSimilarity(subjects,subjects)
533 | #'
534 | #' # Use 'closest_to' to build up a large list of similar words to a seed set.
535 | #' subjects = demo_vectors[[c("history","literature","biology","math","stats"),average=TRUE]]
536 | #' new_subject_list = closest_to(demo_vectors,subjects,20)
537 | #' new_subjects = demo_vectors[[new_subject_list$word,average=FALSE]]
538 | #'
539 | #' # Plot the cosineDistance of these as a dendrogram.
540 | #' plot(hclust(as.dist(cosineDist(new_subjects,new_subjects))))
541 | #'
542 | #' @export
543 | 
544 | cosineSimilarity <- function(x,y) {
545 |   # The most straightforward definition would be just:
546 |   #  x %*% t(y)      /     (sqrt(rowSums(x^2) %*% t(rowSums(y^2))))
547 |   # However, we do a little type-checking and a few speedups.
548 | 
549 |   # Allow non-referenced characters to refer to the original matrix.
550 |   y = sub_out_formula(y,x)
551 | 
552 |   if (!(is.matrix(x) || is.matrix(y))) {
553 |     if (length(x)==length(y)) {
554 |       x = as.matrix(x,ncol=length(x))
555 |       y = as.matrix(y,ncol=length(y))
556 |     }
557 |     else {
558 |       stop("At least one input must be a matrix")
559 |     }
560 |   }
561 | 
562 |   if (is.vector(x)) {
563 |     x = as.matrix(x,ncol=ncol(y))
564 |   }
565 |   if (is.vector(y)) {
566 |     y = as.matrix(y,ncol=ncol(x))
567 |   }
568 | 
569 |   # Using tcrossprod should be faster than transposing manually.
570 |   # Of course, this is still double-inefficient b/c we're calculating both
571 |   # triangles of a symmetrical matrix, I think.
572 |   tcrossprod(x,y)/
573 |     (sqrt(tcrossprod(square_magnitudes(x),square_magnitudes(y))))
574 |   #
575 | }
576 | 
577 | 
578 | #' Cosine Distance
579 | #' @description Calculate the cosine distance between two vectors.
580 | #'
581 | #' Not an actual distance metric, but can be used in similar contexts.
582 | #' It is calculated as simply the inverse of cosine similarity,
583 | #' and falls in a fixed range of 0 (identical) to 2 (completely opposite in direction.)
584 | #'
585 | #' @param x A matrix, VectorSpaceModel, or vector.
586 | #' @param y A matrix, VectorSpaceModel, or vector.
587 | #'
588 | #' @return A matrix whose dimnames are rownames(x), rownames(y) and whose entires are
589 | #' the associated distance.
590 | #'
591 | #' @export
592 | cosineDist <- function(x,y) {
593 |   1-(cosineSimilarity(x,y))
594 | }
595 | 
596 | #' Project each row of an input matrix along a vector.
597 | #'
598 | #' @param matrix A matrix or VectorSpaceModel
599 | #' @param vector A vector (or object coercable to a vector)
600 | #' of the same length as the VectorSpaceModel.
601 | #'
602 | #'
603 | #' @description As with 'cosineSimilarity
604 | #'
605 | #' @return A new matrix or VectorSpaceModel of the same dimensions as `matrix`,
606 | #' each row of which is parallel to vector.
607 | #'
608 | #' If the input is a matrix, the output will be a matrix: if a VectorSpaceModel,
609 | #' it will be a VectorSpaceModel.
610 | #'
611 | #'
612 | #' @export
613 | project = function(matrix,vector) {
614 |   # The matrix is a matrix:
615 |   # b is a vector to reproject the matrix to be orthogonal to.
616 |   vector = sub_out_formula(vector,matrix)
617 |   b = as.vector(vector)
618 |   if (length(b)!=ncol(matrix)) {
619 |     stop("The vector must be the same length as the matrix it is being compared to")
620 |   }
621 |   newmat = crossprod(t(matrix %*% b)/as.vector((b %*% b)) , b)
622 |   return(methods::new("VectorSpaceModel",newmat))
623 |   }
624 | 
625 | #' Return a vector rejection for each element in a VectorSpaceModel
626 | #'
627 | #' @param matrix A matrix or VectorSpaceModel
628 | #' @param vector A vector (or an object coercable to a vector, see project)
629 | #' of the same length as the VectorSpaceModel.
630 | #'
631 | #' @return A new matrix or VectorSpaceModel of the same dimensions as `matrix`,
632 | #' each row of which is orthogonal to the `vector` object.
633 | #'
634 | #' This is defined simply as `matrix-project(matrix,vector)`, but having a separate
635 | #' name may make for cleaner code.
636 | #'
637 | #' See `project` for more details.
638 | #'
639 | #' @examples
640 | #' closest_to(demo_vectors,demo_vectors[["man"]])
641 | #'
642 | #' genderless = reject(demo_vectors,demo_vectors[["he"]] - demo_vectors[["she"]])
643 | #' closest_to(genderless,genderless[["man"]])
644 | #'
645 | #' @export
646 | reject = function(matrix,vector) {
647 |   # The projection of the matrix that _does not_ lie parallel to a given vector.
648 |   val = matrix-project(matrix,vector)
649 |   return(val)
650 | }
651 | 
652 | 
653 | #' Compress or expand a vector space model along a vector.
654 | #'
655 | #' @param matrix A matrix or VectorSpaceModel
656 | #' @param vector A vector (or an object coercable to a vector, see project)
657 | #' of the same length as the VectorSpaceModel.
658 | #' @param multiplier A scaling factor. See below.
659 | #'
660 | #' @description This is an experimental function that might be useful sometimes.
661 | #' 'Reject' flatly eliminates a particular dimension from a vectorspace, essentially
662 | #' squashing out a single dimension; 'distend' gives finer grained control, making it
663 | #' possible to stretch out or compress in the same space. High values of 'multiplier'
664 | #' make a given vector more prominent; 1 keeps the original matrix untransformed; values
665 | #' less than one compress distances along the vector; and 0 is the same as "reject,"
666 | #' eliminating a vector entirely. Values less than zero will do some type of mirror-image
667 | #' universe thing, but probably aren't useful?
668 | #'
669 | #'
670 | #' @return A new matrix or VectorSpaceModel of the same dimensions as `matrix`,
671 | #' distended along the vector 'vector' by factor 'multiplier'.
672 | #'
673 | #' See `project` for more details and usage.
674 | #'
675 | #' @examples
676 | #' closest_to(demo_vectors,"sweet")
677 | #'
678 | #' # Stretch out the vectorspace 4x longer along the gender direction.
679 | #' more_sexist = distend(demo_vectors, ~ "man" + "he" - "she" -"woman", 4)
680 | #'
681 | #' closest_to(more_sexist,"sweet")
682 | #'
683 | #' @export
684 | distend = function(matrix,vector, multiplier) {
685 |   parallel_track = project(matrix,vector)
686 |   return(methods::new("VectorSpaceModel",matrix + parallel_track*(multiplier-1)))
687 | }
688 | 
689 | #' Return the n closest words in a VectorSpaceModel to a given vector.
690 | #'
691 | #' @param matrix A matrix or VectorSpaceModel
692 | #' @param vector  A vector (or a string or a formula coercable to a vector)
693 | #' of the same length as the VectorSpaceModel. See below.
694 | #' @param n The number of closest words to include.
695 | #' @param fancy_names If true (the default) the data frame will have descriptive names like
696 | #' 'similarity to "king+queen-man"'; otherwise, just 'similarity.' The default can speed up
697 | #'  interactive exploration.
698 | #'
699 | #' @return A sorted data.frame with columns for the words and their similarity
700 | #' to the target vector. (Or, if as_df==FALSE, a named vector of similarities.)
701 | #'
702 | #' @description This is a convenience wrapper around the most common use of
703 | #' 'cosineSimilarity'; the listing of several words similar to a given vector.
704 | #' Unlike cosineSimilarity, it returns a data.frame object instead of a matrix.
705 | #' cosineSimilarity is more powerful, because it can compare two matrices to
706 | #' each other; closest_to can only take a vector or vectorlike object as its second argument.
707 | #' But with (or without) the argument n=Inf, closest_to is often better for
708 | #' plugging directly into a plot.
709 | #'
710 | #' As with cosineSimilarity, the second argument can take several forms. If it's a vector or
711 | #' matrix slice, it will be taken literally. If it's a character string, it will
712 | #' be interpreted as a word and the associated vector from `matrix` will be used. If
713 | #' a formula, any strings in the formula will be converted to rows in the associated `matrix`
714 | #' before any math happens.
715 | #'
716 | #' @examples
717 | #'
718 | #' # Synonyms and similar words
719 | #' closest_to(demo_vectors,demo_vectors[["good"]])
720 | #'
721 | #' # If 'matrix' is a VectorSpaceModel object,
722 | #' # you can also just enter a string directly, and
723 | #' # it will be evaluated in the context of the passed matrix.
724 | #'
725 | #' closest_to(demo_vectors,"good")
726 | #'
727 | #' # You can also express more complicated formulas.
728 | #'
729 | #' closest_to(demo_vectors,"good")
730 | #'
731 | #' # Something close to the classic king:man::queen:woman;
732 | #' # What's the equivalent word for a female teacher that "guy" is for
733 | #' # a male one?
734 | #'
735 | #' closest_to(demo_vectors,~ "guy" - "man" + "woman")
736 | #'
737 | #' @export
738 | closest_to = function(matrix, vector, n=10, fancy_names = TRUE) {
739 |   label = deparse(substitute(vector),width.cutoff=500)
740 |   if (substr(label,1,1)=="~") {label = substr(label,2,500)}
741 | 
742 |   # The actually wrapping.
743 |   sims = cosineSimilarity(matrix,vector)
744 | 
745 |   # Top n shouldn't be greater than the vocab length.
746 |   n = min(n,length(sims))
747 | 
748 |   # For sorting.
749 |   ords = order(-sims[,1])
750 | 
751 |   return_val = data.frame(rownames(sims)[ords[1:n]], sims[ords[1:n]],stringsAsFactors=FALSE)
752 |   if (fancy_names) {
753 |     names(return_val) = c("word", paste("similarity to", label))
754 |   } else {
755 |     names(return_val) = c("word","similarity")
756 |   }
757 |   rownames(return_val) = NULL
758 |   return_val
759 | }
760 | 
761 | 
762 | #' Nearest vectors to a word
763 | #'
764 | #' @description This a wrapper around closest_to, included for back-compatibility. Use
765 | #' closest_to for new applications.
766 | #' @param ... See `closest_to`
767 | #'
768 | #' @return a names vector of cosine similarities. See 'nearest_to' for more details.
769 | #' @export
770 | #'
771 | #' @examples
772 | #'
773 | #' # Recommended usage in 1.0:
774 | #' nearest_to(demo_vectors, demo_vectors[["good"]])
775 | #'
776 | #' # Recommended usage in 2.0:
777 | #' demo_vectors %>% closest_to("good")
778 | #'
779 | nearest_to = function(...) {
780 |   vals = closest_to(...,fancy_names = F)
781 |   returnable = 1 - vals$similarity
782 |   names(returnable) = vals$word
783 |   returnable
784 | }
785 | 


--------------------------------------------------------------------------------
/R/utils.R:
--------------------------------------------------------------------------------
1 | #' @importFrom magrittr %>%
2 | #' @export
3 | magrittr::`%>%`
4 | 


--------------------------------------------------------------------------------
/R/word2vec.R:
--------------------------------------------------------------------------------
  1 | ##' Train a model by word2vec.
  2 | ##'
  3 | ##' The word2vec tool takes a text corpus as input and produces the
  4 | ##' word vectors as output. It first constructs a vocabulary from the
  5 | ##' training text data and then learns vector representation of words.
  6 | ##' The resulting word vector file can be used as features in many
  7 | ##' natural language processing and machine learning applications.
  8 | ##'
  9 | ##'
 10 | ##'
 11 | ##' @title Train a model by word2vec.
 12 | ##' @param train_file Path of a single .txt file for training. Tokens are split on spaces.
 13 | ##' @param output_file Path of the output file.
 14 | ##' @param vectors The number of vectors to output. Defaults to 100.
 15 | ##' More vectors usually means more precision, but also more random error, higher memory usage, and slower operations.
 16 | ##' Sensible choices are probably in the range 100-500.
 17 | ##' @param threads Number of threads to run training process on.
 18 | ##' Defaults to 1; up to the number of (virtual) cores on your machine may speed things up.
 19 | ##' @param window The size of the window (in words) to use in training.
 20 | ##' @param classes Number of classes for k-means clustering. Not documented/tested.
 21 | ##' @param cbow If 1, use a continuous-bag-of-words model instead of skip-grams.
 22 | ##' Defaults to false (recommended for newcomers).
 23 | ##' @param min_count Minimum times a word must appear to be included in the samples.
 24 | ##' High values help reduce model size.
 25 | ##' @param iter Number of passes to make over the corpus in training.
 26 | ##' @param force Whether to overwrite existing model files.
 27 | ##' @param negative_samples Number of negative samples to take in skip-gram training. 0 means full sampling, while lower numbers
 28 | ##' give faster training. For large corpora 2-5 may work; for smaller corpora, 5-15 is reasonable.
 29 | ##' @return A VectorSpaceModel object.
 30 | ##' @author Jian Li <\email{rweibo@@sina.com}>, Ben Schmidt <\email{bmchmidt@@gmail.com}>
 31 | ##' @references \url{https://code.google.com/p/word2vec/}
 32 | ##' @export
 33 | ##'
 34 | ##' @useDynLib wordVectors
 35 | ##'
 36 | ##' @examples \dontrun{
 37 | ##' model = train_word2vec(system.file("examples", "rfaq.txt", package = "wordVectors"))
 38 | ##' }
 39 | train_word2vec <- function(train_file, output_file = "vectors.bin",vectors=100,threads=1,window=12,
 40 |                            classes=0,cbow=0,min_count=5,iter=5,force=F, negative_samples=5)
 41 | {
 42 |   if (!file.exists(train_file)) stop("Can't find the training file!")
 43 |   if (file.exists(output_file) && !force) stop("The output file '",
 44 |                                      output_file ,
 45 |                                      "' already exists: give a new destination or run with 'force=TRUE'.")
 46 | 
 47 |   train_dir <- dirname(train_file)
 48 | 
 49 |   # cat HDA15/data/Dickens/* | perl -pe 'print "1\t"' | egrep "[a-z]" | bookworm tokenize token_stream > ~/test.txt
 50 | 
 51 |   if(missing(output_file)) {
 52 |     output_file <- gsub(gsub("^.*\\.", "", basename(train_file)), "bin", basename(train_file))
 53 |     output_file <- file.path(train_dir, output_file)
 54 |   }
 55 | 
 56 |   outfile_dir <- dirname(output_file)
 57 |   if (!file.exists(outfile_dir)) dir.create(outfile_dir, recursive = TRUE)
 58 | 
 59 |   train_file <- normalizePath(train_file, winslash = "/", mustWork = FALSE)
 60 |   output_file <- normalizePath(output_file, winslash = "/", mustWork = FALSE)
 61 |   # Whether to output binary, default is 1 means binary.
 62 |   binary = 1
 63 | 
 64 |   OUT <- .C("CWrapper_word2vec",
 65 |             train_file = as.character(train_file),
 66 |             output_file = as.character(output_file),
 67 |             binary = as.character(binary),
 68 |             dims=as.character(vectors),
 69 |             threads=as.character(threads),
 70 |             window=as.character(window),
 71 |             classes=as.character(classes),
 72 |             cbow=as.character(cbow),
 73 |             min_count=as.character(min_count),
 74 |             iter=as.character(iter),
 75 |             neg_samples=as.character(negative_samples)
 76 |   )
 77 | 
 78 |   read.vectors(output_file)
 79 | }
 80 | 
 81 | #' Prepare documents for word2Vec
 82 | #'
 83 | #' @description This function exports a directory or document to a single file
 84 | #' suitable to Word2Vec run on. That means a single, seekable txt file
 85 | #' with tokens separated by spaces. (For example, punctuation is removed
 86 | #' rather than attached to the end of words.)
 87 | #' This function is extraordinarily inefficient: in most real-world cases, you'll be
 88 | #' much better off preparing the documents using python, perl, awk, or any other
 89 | #' scripting language that can reasonable read things in line-by-line.
 90 | #'
 91 | #' @param origin A text file or a directory of text files
 92 | #'  to be used in training the model
 93 | #' @param destination The location for output text.
 94 | #' @param lowercase Logical. Should uppercase characters be converted to lower?
 95 | #' @param bundle_ngrams Integer. Statistically significant phrases of up to this many words
 96 | #' will be joined with underscores: e.g., "United States" will usually be changed to "United_States"
 97 | #' if it appears frequently in the corpus. This calls word2phrase once if bundle_ngrams is 2,
 98 | #' twice if bundle_ngrams is 3, and so forth; see that function for more details.
 99 | #' @param ... Further arguments passed to word2phrase when bundle_ngrams is
100 | #' greater than 1.
101 | #'
102 | #' @export
103 | #'
104 | #' @return The file name (silently).
105 | prep_word2vec <- function(origin,destination,lowercase=F,
106 |                           bundle_ngrams=1, ...)
107 | {
108 |   # strsplit chokes on large lines. I would not have gone down this path if I knew this
109 |   # to begin with.
110 | 
111 | 
112 | 
113 |   message("Beginning tokenization to text file at ", destination)
114 |   if (!exists("dir.exists")) {
115 |     # Use the version from devtools if in R < 3.2.0
116 |     dir.exists <- function (x)
117 |     {
118 |       res <- file.exists(x) & file.info(x)$isdir
119 |       stats::setNames(res, x)
120 |     }
121 |   }
122 | 
123 |   if (dir.exists(origin)) {
124 |     origin = list.files(origin,recursive=T,full.names = T)
125 |   }
126 | 
127 |   if (file.exists(destination)) file.remove(destination)
128 | 
129 |   tokenize_words = function (x, lowercase = TRUE) {
130 |     # This is an abbreviated version of the "tokenizers" package version to remove the dependency.
131 |     # Sorry, Lincoln, it was failing some tests.
132 |     if (lowercase) x <- stringi::stri_trans_tolower(x)
133 |     out <- stringi::stri_split_boundaries(x, type = "word", skip_word_none = TRUE)
134 |     unlist(out)
135 |   }
136 | 
137 |   prep_single_file <- function(file_in, file_out, lowercase) {
138 |     message("Prepping ", file_in)
139 | 
140 |     text <- file_in %>%
141 |       readr::read_file() %>%
142 |       tokenize_words(lowercase) %>%
143 |       stringr::str_c(collapse = " ")
144 | 
145 |     stopifnot(length(text) == 1)
146 |     readr::write_lines(text, file_out, append = TRUE)
147 |     return(TRUE)
148 |   }
149 | 
150 | 
151 |   Map(prep_single_file, origin, lowercase=lowercase, file_out=destination)
152 | 
153 |   # Save the ultimate output
154 |   real_destination_name = destination
155 | 
156 |   # Repeatedly build bigrams, trigrams, etc.
157 |   if (bundle_ngrams > 1) {
158 |     while(bundle_ngrams > 1) {
159 |       old_destination = destination
160 |       destination = paste0(destination,"_")
161 |       word2phrase(old_destination,destination,...)
162 |       file.remove(old_destination)
163 |       bundle_ngrams = bundle_ngrams - 1
164 |     }
165 |     file.rename(destination,real_destination_name)
166 |   }
167 | 
168 |   silent = real_destination_name
169 | }
170 | 
171 | 
172 | #' Convert words to phrases in a text file.
173 | #'
174 | #' This function attempts to learn phrases given a text document.
175 | #' It does so by progressively joining adjacent pairs of words with an '_' character.
176 | #' You can then run the code multiple times to create multiword phrases.
177 | #' Wrapper around code from the Mikolov's original word2vec release.
178 | #'
179 | #' @title Convert words to phrases
180 | #' @author Tomas Mikolov
181 | #' @param train_file Path of a single .txt file for training.
182 | #'   Tokens are split on spaces.
183 | #' @param output_file Path of output file
184 | #' @param debug_mode debug mode. Must be 0, 1 or 2. 0 is silent; 1 print summary statistics;
185 | #'  prints progress regularly.
186 | #' @param min_count Minimum times a word must appear to be included in the samples.
187 | #'   High values help reduce model size.
188 | #' @param threshold Threshold value for determining if pairs of words are phrases.
189 | #' @param force Whether to overwrite existing files at the output location. Default FALSE
190 | #'
191 | #' @return The name of output_file, the trained file where common phrases are now joined.
192 | #'
193 | #' @export
194 | #' @examples
195 | #' \dontrun{
196 | #' model=word2phrase("text8","vec.txt")
197 | #' }
198 | 
199 | word2phrase=function(train_file,output_file,debug_mode=0,min_count=5,threshold=100,force=FALSE)
200 | {
201 |   if (!file.exists(train_file)) stop("Can't find the training file!")
202 |   if (file.exists(output_file) && !force) stop("The output file '",
203 |                                                output_file ,
204 |                                                "' already exists: give a new destination or run with 'force=TRUE'.")
205 |   OUT=.C("word2phrase",rtrain_file=as.character(train_file),
206 |          rdebug_mode=as.integer(debug_mode),
207 |          routput_file=as.character(output_file),
208 |          rmin_count=as.integer(min_count),
209 |          rthreshold=as.double(threshold))
210 |   return(output_file)
211 | }
212 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Word Vectors
  2 | 
  3 | [![Build Status](https://travis-ci.org/bmschmidt/wordVectors.svg?branch=master)](https://travis-ci.org/bmschmidt/wordVectors)
  4 | 
  5 | An R package for building and exploring word embedding models.
  6 | 
  7 | # Description
  8 | 
  9 | This package does three major things to make it easier to work with word2vec and other vectorspace models of language.
 10 | 
 11 | 1. [Trains word2vec models](#creating-text-vectors) using an extended Jian Li's word2vec code; reads and writes the binary word2vec format so that you can import pre-trained models such as Google's; and provides tools for reading only *part* of a model (rows or columns) so you can explore a model in memory-limited situations.
 12 | 2. [Creates a new `VectorSpaceModel` class in R that gives a better syntax for exploring a word2vec or GloVe model than native matrix methods.](#vectorspacemodel-object) For example, instead of writing 
 13 |    
 14 |    > `model[rownames(model)=="king",]`,
 15 |    
 16 |    you can write  
 17 |    
 18 |    > `model[["king"]]`, 
 19 |    
 20 |    and instead of writing 
 21 |    
 22 |    > `vectors %>% closest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",])` (whew!),
 23 |    
 24 |    you can write
 25 |    
 26 |    > `vectors %>% closest_to(~"king" - "man" + "woman")`.
 27 |    
 28 | 3. [Implements several basic matrix operations that are useful in exploring word embedding models including cosine similarity, nearest neighbor, and vector projection](#useful-matrix-operations) with some caching that makes them much faster than the simplest implementations.
 29 | 
 30 | ### Quick start
 31 | 
 32 | For a step-by-step interactive demo that includes installation and training a model on 77 historical cookbooks from Michigan State University, [see the introductory vignette.](https://github.com/bmschmidt/wordVectors/blob/master/vignettes/introduction.Rmd).
 33 | 
 34 | ### Credit
 35 | 
 36 | This includes an altered version of Tomas Mikolov's original C code for word2vec; those wrappers were origally written by Jian Li, and I've only tweaked them a little. Several other users have improved that code since I posted it here.
 37 | 
 38 | Right now, it [does not (I don't think) install under Windows 8](https://github.com/bmschmidt/wordVectors/issues/2).  Help appreciated on that thread. OS X, Windows 7, Windows 10, and Linux install perfectly well, with one or two exceptions. 
 39 | 
 40 | It's not extremely fast, but once the data is loaded in most operations happen in suitable time for exploratory data analysis (under a second on my laptop.)
 41 | 
 42 | For high-performance analysis of models, C or python's numpy/gensim will likely be better than this package, in part because R doesn't have support for single-precision floats. The goal of this package is to facilitate clear code and exploratory data analysis of models.
 43 | 
 44 | Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.
 45 | 
 46 | ## Creating text vectors.
 47 | 
 48 | One portion of this is an expanded version of the code from Jian Li's `word2vec` package with a few additional parameters enabled as the function `train_word2vec`.
 49 | 
 50 | The input must still be in a single file and pre-tokenized, but it uses the existing word2vec C code. For online data processing, I like the gensim python implementation, but I don't plan to link that to R.
 51 | 
 52 | In RStudio I've noticed that this appears to hang, but if you check processors it actually still runs. Try it on smaller portions first, and then let it take time: the training function can take hours for tens of thousands of books.
 53 | 
 54 | ## VectorSpaceModel object
 55 | 
 56 | The package loads in the word2vec binary format with the format `read.vectors` into a new object called a "VectorSpaceModel" object. It's a light superclass of the standard R matrix object. Anything you can do with matrices, you can do with VectorSpaceModel objects.
 57 | 
 58 | It has a few convenience functions as well. 
 59 | 
 60 | ### Faster Access to text vectors
 61 | 
 62 | The rownames of a VectorSpaceModel object are presumed to be tokens in a vector space model and therefore semantically useful. The classic word2vec demonstration is that vector('king') - vector('man') + vector('woman') =~ vector('queen'). With a standard matrix, the vector on the right-hand side of the equation would be described as
 63 | 
 64 | ```{r, include=F,show=T}
 65 | vector_set[rownames(vector_set)=="king",] - vector_set[rownames(vector_set)=="man",] + vector_set[rownames(vector_set)=="woman",]
 66 | ```
 67 | 
 68 | In this package, you can simply access it by using the double brace operators:
 69 | 
 70 | ```{r, include=F,show=T}
 71 | vector_set[["king"]] - vector_set[["man"]] + vector_set[["woman"]]
 72 | ```
 73 | 
 74 | (And in the context of the custom functions, as a formula like `~"king" - "man" + "woman"`: see below).
 75 | 
 76 | Since frequently an average of two vectors provides a better indication, multiple words can be collapsed into a single vector by specifying multiple labels. For example, this may provide a slightly better gender vector:
 77 | 
 78 | ```{r}
 79 | vector_set[["king"]] - vector_set[[c("man","men")]] + vector_set[[c("woman","women")]]
 80 | ```
 81 | 
 82 | Sometimes you want to subset *without* averaging. You can do this with the argument `average==FALSE` to the subset. This is particularly useful for comparing slices of the matrix to itself in similarity operations.
 83 | 
 84 | ```{r}
 85 | cosineSimilarity(vector_set[[c("man","men","king"),average=F]], vector_set[[c("woman","women","queen"),average=F]]
 86 | ```
 87 | 
 88 | ## A few native functions defined on the VectorSpaceModel object.
 89 | 
 90 | The native `show` method just prints the dimensions; the native `plot` method does some crazy reductions with the T-SNE package (installation required for functionality) because T-SNE is a nice way to reduce down the size of vectors, **or** lets you pass `method='pca'` to array a full set or subset by the first two principal components.
 91 | 
 92 | 
 93 | ## Useful matrix operations
 94 | 
 95 | One challenge of vector-space models of texts is that it takes some basic matrix multiplication functions to make them dance around in an entertaining way.
 96 | 
 97 | This package bundles the ones I think are the most useful. 
 98 | Each takes a `VectorSpaceModel` as its first argument. Sometimes, it's appropriate for the VSM to be your entire data set; other times, it's sensible to limit it to just one or a few vectors. Where appropriate, the functions can also take vectors or matrices as inputs.
 99 | 
100 |   * `cosineSimilarity(VSM_1,VSM_2)` calculates the cosine similarity of every vector in on vector space model to every vector in another. This is `n^2` complexity. With a vocabulary size of 20,000 or so, it can be reasonable to compare an entire set to itself; or you can compare a larger set to a smaller one to search for particular terms of interest. 
101 |   * `cosineDistance(VSM_1,VSM_2)` is the inverse of cosineSimilarity. It's not really a distance metric, but can be used as one for clustering and the like.
102 |   * `closest_to(VSM,vector,n)` wraps a particularly common use case for `cosineSimilarity`, of finding the top `n` terms in a `VectorSpaceModel` closest to term m
103 |   * `project(VSM,vector)` takes a `VectorSpaceModel` and returns the portion parallel to the vector `vector`. 
104 |   * `reject(VSM,vector)` is the inverse of `project`; it takes a `VectorSpaceModel` and returns the portion orthogonal to the vector `vector`. This makes it possible, for example, to collapse a vector space by removing certain distinctions of meaning.
105 |   * `magnitudes` calculated the magnitude of each element in a VSM. This is useful in many operations.
106 |   
107 | All of these functions place the VSM object as the first argument. This makes it easy to chain together operations using the `magrittr` package. For example, beginning with a single vector set one could find the nearest words in a set to a version of the vector for "bank" that has been decomposed to remove any semantic similarity to the banking sector.
108 | 
109 | ``` {r}
110 | library(magrittr)
111 | not_that_kind_of_bank = chronam_vectors[["bank"]] %>%
112 |       reject(chronam_vectors[["cashier"]]) %>% 
113 |       reject(chronam_vectors[["depositors"]]) %>%   
114 |       reject(chronam_vectors[["check"]])
115 | chronam_vectors %>% closest_to(not_that_kind_of_bank)
116 | ```
117 | 
118 | These functions also allow an additional layer of syntactic sugar when working with word vectors. 
119 | 
120 | Or even just as a formula, if you're working entirely with a single model, so you don't have to keep referring to words; instead, you can use a formula interface to reduce typing and increase clarity.
121 | 
122 | ```{r}
123 | vectors %>% closest_to(~ "king" - "man" + "woman")
124 | ```
125 | 
126 | 
127 | # Quick start
128 | 
129 | ## Install the wordVectors package.
130 | 
131 | One of the major hurdles to running word2vec for ordinary people is that it requires compiling a C program. For many people, it may be easier to install it in R. 
132 | 
133 | 1. If you haven't already, [install R](https://cran.rstudio.com/) and then [install RStudio](https://www.rstudio.com/products/rstudio/download/).
134 | 2. Open	R, and get a command-line prompt (the thing with a `>` on the left hand side.) This is where you'll be copy-pasting commands.
135 | 3. Install (if you don't already have it) the package `devtools` by pasting the	following
136 |     ```R
137 |     install.packages("devtools")
138 |     ```
139 | 
140 | 4. Install the latest version of this package from Github by pasting in the following.
141 |     ```R
142 |     devtools::install_github("bmschmidt/wordVectors")
143 |     ```
144 |     Windows users may need to install "Rtools" as well: if so, a message to this effect should appear in red on the screen. This may cycle through a very large number of warnings: so long as it says "warning" and not "error", you're probably OK.
145 | 
146 | ## Train a model.
147 | 
148 | For instructions on training, see the [introductory vignette](https://github.com/bmschmidt/wordVectors/blob/master/vignettes/introduction.Rmd)
149 | 
150 | ## Explore an existing model.
151 | 
152 | For instructions on exploration, see the end of the [introductory vignette](https://github.com/bmschmidt/wordVectors/blob/master/vignettes/introduction.Rmd), or the slower-paced [vignette on exploration](https://github.com/bmschmidt/wordVectors/blob/master/vignettes/exploration.Rmd)
153 | 


--------------------------------------------------------------------------------
/data/demo_vectors.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bmschmidt/wordVectors/ad127c1badc1f9303a2b0d34e65acfc8759d2e3f/data/demo_vectors.rda


--------------------------------------------------------------------------------
/inst/doc/exploration.R:
--------------------------------------------------------------------------------
 1 | ## ------------------------------------------------------------------------
 2 | library(wordVectors)
 3 | library(magrittr)
 4 | 
 5 | ## ------------------------------------------------------------------------
 6 | demo_vectors[["good"]]
 7 | 
 8 | ## ------------------------------------------------------------------------
 9 | demo_vectors %>% closest_to(demo_vectors[["good"]])
10 | 
11 | ## ------------------------------------------------------------------------
12 | demo_vectors %>% closest_to("bad")
13 | 
14 | ## ------------------------------------------------------------------------
15 | 
16 | demo_vectors %>% closest_to(~"good"+"bad")
17 | 
18 | # The same thing could be written as:
19 | # demo_vectors %>% closest_to(demo_vectors[["good"]]+demo_vectors[["bad"]])
20 | 
21 | ## ------------------------------------------------------------------------
22 | demo_vectors %>% closest_to(~"good" - "bad")
23 | 
24 | ## ------------------------------------------------------------------------
25 | demo_vectors %>% closest_to(~ "bad" - "good")
26 | 
27 | ## ------------------------------------------------------------------------
28 | demo_vectors %>% closest_to(~ "he" - "she")
29 | demo_vectors %>% closest_to(~ "she" - "he")
30 | 
31 | ## ------------------------------------------------------------------------
32 | demo_vectors %>% closest_to(~ "guy" - "he" + "she")
33 | 
34 | ## ------------------------------------------------------------------------
35 | demo_vectors %>% closest_to(~ "guy" + ("she" - "he"))
36 | 
37 | ## ------------------------------------------------------------------------
38 | 
39 | demo_vectors[[c("lady","woman","man","he","she","guy","man"), average=F]] %>% 
40 |   plot(method="pca")
41 | 
42 | 
43 | ## ------------------------------------------------------------------------
44 | top_evaluative_words = demo_vectors %>% 
45 |    closest_to(~ "good"+"bad",n=75)
46 | 
47 | goodness = demo_vectors %>% 
48 |   closest_to(~ "good"-"bad",n=Inf) 
49 | 
50 | femininity = demo_vectors %>% 
51 |   closest_to(~ "she" - "he", n=Inf)
52 | 
53 | ## ------------------------------------------------------------------------
54 | library(ggplot2)
55 | library(dplyr)
56 | 
57 | top_evaluative_words %>%
58 |   inner_join(goodness) %>%
59 |   inner_join(femininity) %>%
60 |   ggplot() + 
61 |   geom_text(aes(x=`similarity to "she" - "he"`,
62 |                 y=`similarity to "good" - "bad"`,
63 |                 label=word))
64 | 
65 | 


--------------------------------------------------------------------------------
/inst/doc/exploration.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Word2Vec Workshop"
  3 | author: "Ben Schmidt"
  4 | date: "`r Sys.Date()`"
  5 | output: rmarkdown::html_vignette
  6 | vignette: >
  7 |   %\VignetteIndexEntry{Vignette Title}
  8 |   %\VignetteEngine{knitr::rmarkdown}
  9 |   %\VignetteEncoding{UTF-8}
 10 | ---
 11 | 
 12 | # Exploring Word2Vec models
 13 | 
 14 | R is a great language for *exploratory data analysis* in particular. If you're going to use a word2vec model in a larger pipeline, it may be important (intellectually or ethically) to spend a little while understanding what kind of model of language you've learned.
 15 | 
 16 | This package makes it easy to do so, both by allowing you to read word2vec models to and from R, and by giving some syntactic sugar that lets you describe vector-space models concisely and clearly.
 17 | 
 18 | Note that these functions may still be useful if you're a data analyst training word2vec models elsewhere (say, in gensim.) I'm also hopeful this can be a good way of interacting with varied vector models in a workshop session.
 19 | 
 20 | If you want to train your own model or need help setting up the package, read the introductory vignette. Aside from the installation, it assumes more knowledge of R than this walkthrough.
 21 | 
 22 | ## Why explore?
 23 | 
 24 | In this vignette we're going to look at (a small portion of) a model trained on teaching evaluations. It's an interesting set, but it's also one that shows the importance of exploring vector space models before you use them. Exploration is important because:
 25 | 
 26 | 1. If you're a humanist or social scientist, it can tell you something about the *sources* by letting you see how they use language. These co-occurrence patterns can then be better investigated through close reading or more traditional collocation scores, which potentially more reliable but also much slower and less flexible.
 27 | 
 28 | 2. If you're an engineer, it can help you understand some of biases built into a model that you're using in a larger pipeline. This can be both technically and ethically important: you don't want, for instance, to build a job-recommendation system which is disinclined to offer programming jobs to women because it has learned that women are unrepresented in CS jobs already.
 29 | (On this point in word2vec in particular, see [here](https://freedom-to-tinker.com/blog/randomwalker/language-necessarily-contains-human-biases-and-so-will-machines-trained-on-language-corpora/) and [here](https://arxiv.org/abs/1607.06520).)
 30 | 
 31 | ## Getting started.
 32 | 
 33 | First we'll load this package, and the recommended package `magrittr`, which lets us pass these arguments around.
 34 | 
 35 | ```{r}
 36 | library(wordVectors)
 37 | library(magrittr)
 38 | ```
 39 | 
 40 | The basic element of any vector space model is a *vectors.* for each word. In the demo data included with this package, an object called 'demo_vectors,' there are 500 numbers: you can start to examine them, if you with, by hand. So let's consider just one of these--the vector for 'good'.
 41 | 
 42 | In R's ordinary matrix syntax, you could write that out laboriously as `demo_vectors[rownames(demo_vectors)=="good",]`. `WordVectors` provides a shorthand using double braces:
 43 | 
 44 | ```{r}
 45 | demo_vectors[["good"]]
 46 | ```
 47 | 
 48 | These numbers are meaningless on their own. But in the vector space, we can find similar words.
 49 | 
 50 | ```{r}
 51 | demo_vectors %>% closest_to(demo_vectors[["good"]])
 52 | ```
 53 | 
 54 | The `%>%` is the pipe operator from magrittr; it helps to keep things organized, and is particularly useful with some of the things we'll see later. The 'similarity' scores here are cosine similarity in a vector space; 1.0 represents perfect similarity, 0 is no correlation, and -1.0 is complete opposition. In practice, vector "opposition" is different from the colloquial use of "opposite," and very rare. You'll only occasionally see vector scores below 0--as you can see above, "bad" is actually one of the most similar words to "good."
 55 | 
 56 | When interactively exploring a single model (rather than comparing *two* models), it can be a pain to keep retyping words over and over. Rather than operate on the vectors, this package also lets you access the word directly by using R's formula notation: putting a tilde in front of it. For a single word, you can even access it directly, as so.
 57 | 
 58 | ```{r}
 59 | demo_vectors %>% closest_to("bad")
 60 | ```
 61 | 
 62 | ## Vector math
 63 | 
 64 | The tildes are necessary syntax where things get interesting--you can do **math** on these vectors. So if we want to find the words that are closest to the *combination* of "good" and "bad" (which is to say, words that get used in evaluation) we can write (see where the tilde is?):
 65 | 
 66 | ```{r}
 67 | 
 68 | demo_vectors %>% closest_to(~"good"+"bad")
 69 | 
 70 | # The same thing could be written as:
 71 | # demo_vectors %>% closest_to(demo_vectors[["good"]]+demo_vectors[["bad"]])
 72 | ```
 73 | 
 74 | Those are words that are common to both "good" and "bad". We could also find words that are shaded towards just good but *not* bad by using subtraction.
 75 | 
 76 | ```{r}
 77 | demo_vectors %>% closest_to(~"good" - "bad")
 78 | ```
 79 | 
 80 | > What does this "subtraction" vector mean? 
 81 | > In practice, the easiest way to think of it is probably simply as 'similar to 
 82 | > good and dissimilar to 'bad'. Omer and Levy's papers suggest this interpretation.
 83 | > But taking the vectors more seriously means you can think of it geometrically: "good"-"bad" is
 84 | > a vector that describes the difference between positive and negative.
 85 | > Similarity to this vector means, technically, the portion of a words vectors whose
 86 | > whose multidimensional path lies largely along the direction between the two words. 
 87 | 
 88 | Again, you can easily switch the order to the opposite: here are a bunch of bad words:
 89 | 
 90 | ```{r}
 91 | demo_vectors %>% closest_to(~ "bad" - "good")
 92 | ```
 93 | 
 94 | All sorts of binaries are captured in word2vec models. One of the most famous, since Mikolov's original word2vec paper, is *gender*. If you ask for similarity to "he"-"she", for example, you get words that appear mostly in a *male* context. Since these examples are from teaching evaluations, after just a few straightforwardly gendered words, we start to get things that only men are ("arrogant") or where there are very few women in the university ("physics")
 95 | 
 96 | ```{r}
 97 | demo_vectors %>% closest_to(~ "he" - "she")
 98 | demo_vectors %>% closest_to(~ "she" - "he")
 99 | ```
100 | 
101 | ## Analogies
102 | 
103 | We can expand out the match to perform analogies. Men tend to be called 'guys'. 
104 | What's the female equivalent?
105 | In an SAT-style analogy, you might write `he:guy::she:???`.
106 | In vector math, we think of this as moving between points. 
107 | 
108 | If you're using the mental framework of positive of 'similarity' and
109 | negative as 'dissimilarity,' you can think of this as starting at "guy",
110 | removing its similarity to "he", and additing a similarity to "she".
111 | 
112 | This yields the answer: the most similar term to "guy" for a woman is "lady."
113 | 
114 | ```{r}
115 | demo_vectors %>% closest_to(~ "guy" - "he" + "she")
116 | ```
117 | 
118 | If you're using the other mental framework, of thinking of these as real vectors, 
119 | you might phrase this in a slightly different way.
120 | You have a gender vector `("female" - "male")` that represents the *direction* of masculinity 
121 | to femininity. You can then add this vector to "guy", and that will take you to a new neighborhood. You might phrase that this way: note that the math is exactly equivalent, and
122 | only the grouping is different.
123 | 
124 | ```{r}
125 | demo_vectors %>% closest_to(~ "guy" + ("she" - "he"))
126 | ```
127 | 
128 | Principal components can let you plot a subset of these vectors to see how they relate. You can imagine an arrow from "he" to "she", from "guy" to "lady", and from "man" to "woman"; all run in roughly the same direction.
129 | 
130 | ```{r}
131 | 
132 | demo_vectors[[c("lady","woman","man","he","she","guy","man"), average=F]] %>% 
133 |   plot(method="pca")
134 | 
135 | ```
136 | 
137 | These lists of ten words at a time are useful for interactive exploration, but sometimes we might want to say 'n=Inf' to return the full list. For instance, we can combine these two methods to look at positive and negative words used to evaluate teachers. 
138 | 
139 | First we build up three data_frames: first, a list of the 50 top evaluative words, and then complete lists of similarity to `"good" -"bad"` and `"woman" - "man"`. 
140 | 
141 | ```{r}
142 | top_evaluative_words = demo_vectors %>% 
143 |    closest_to(~ "good"+"bad",n=75)
144 | 
145 | goodness = demo_vectors %>% 
146 |   closest_to(~ "good"-"bad",n=Inf) 
147 | 
148 | femininity = demo_vectors %>% 
149 |   closest_to(~ "she" - "he", n=Inf)
150 | ```
151 | 
152 | Then we can use tidyverse packages to join and plot these.
153 | An `inner_join` restricts us down to just those top 50 words, and ggplot
154 | can array the words on axes.
155 | 
156 | ```{r}
157 | library(ggplot2)
158 | library(dplyr)
159 | 
160 | top_evaluative_words %>%
161 |   inner_join(goodness) %>%
162 |   inner_join(femininity) %>%
163 |   ggplot() + 
164 |   geom_text(aes(x=`similarity to "she" - "he"`,
165 |                 y=`similarity to "good" - "bad"`,
166 |                 label=word))
167 | ```
168 | 
169 | 


--------------------------------------------------------------------------------
/inst/doc/introduction.R:
--------------------------------------------------------------------------------
  1 | ## ------------------------------------------------------------------------
  2 | if (!require(wordVectors)) {
  3 |   if (!(require(devtools))) {
  4 |     install.packages("devtools")
  5 |   }
  6 |   devtools::install_github("bmschmidt/wordVectors")
  7 | }
  8 | 
  9 | 
 10 | 
 11 | ## ------------------------------------------------------------------------
 12 | library(wordVectors)
 13 | library(magrittr)
 14 | 
 15 | ## ------------------------------------------------------------------------
 16 | if (!file.exists("cookbooks.zip")) {
 17 |   download.file("http://archive.lib.msu.edu/dinfo/feedingamerica/cookbook_text.zip","cookbooks.zip")
 18 | }
 19 | unzip("cookbooks.zip",exdir="cookbooks")
 20 | 
 21 | ## ------------------------------------------------------------------------
 22 | if (!file.exists("cookbooks.txt")) prep_word2vec(origin="cookbooks",destination="cookbooks.txt",lowercase=T,bundle_ngrams=2)
 23 | 
 24 | ## ------------------------------------------------------------------------
 25 | if (!file.exists("cookbook_vectors.bin")) {model = train_word2vec("cookbooks.txt","cookbook_vectors.bin",vectors=200,threads=4,window=12,iter=5,negative_samples=0)} else model = read.vectors("cookbook_vectors.bin")
 26 | 
 27 | 
 28 | ## ------------------------------------------------------------------------
 29 | model %>% closest_to("fish")
 30 | 
 31 | ## ------------------------------------------------------------------------
 32 | model %>% 
 33 |   closest_to(model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],50)
 34 | 
 35 | ## ------------------------------------------------------------------------
 36 | some_fish = closest_to(model,model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],150)
 37 | fishy = model[[some_fish$word,average=F]]
 38 | plot(fishy,method="pca")
 39 | 
 40 | ## ------------------------------------------------------------------------
 41 | set.seed(10)
 42 | centers = 150
 43 | clustering = kmeans(model,centers=centers,iter.max = 40)
 44 | 
 45 | ## ------------------------------------------------------------------------
 46 | sapply(sample(1:centers,10),function(n) {
 47 |   names(clustering$cluster[clustering$cluster==n][1:10])
 48 | })
 49 | 
 50 | ## ------------------------------------------------------------------------
 51 | ingredients = c("madeira","beef","saucepan","carrots")
 52 | term_set = lapply(ingredients, 
 53 |        function(ingredient) {
 54 |           nearest_words = model %>% closest_to(model[[ingredient]],20)
 55 |           nearest_words$word
 56 |         }) %>% unlist
 57 | 
 58 | subset = model[[term_set,average=F]]
 59 | 
 60 | subset %>%
 61 |   cosineDist(subset) %>% 
 62 |   as.dist %>%
 63 |   hclust %>%
 64 |   plot
 65 | 
 66 | 
 67 | ## ------------------------------------------------------------------------
 68 | tastes = model[[c("sweet","salty"),average=F]]
 69 | 
 70 | # model[1:3000,] here restricts to the 3000 most common words in the set.
 71 | sweet_and_saltiness = model[1:3000,] %>% cosineSimilarity(tastes)
 72 | 
 73 | # Filter to the top 20 sweet or salty.
 74 | sweet_and_saltiness = sweet_and_saltiness[
 75 |   rank(-sweet_and_saltiness[,1])<20 |
 76 |   rank(-sweet_and_saltiness[,2])<20,
 77 |   ]
 78 | 
 79 | plot(sweet_and_saltiness,type='n')
 80 | text(sweet_and_saltiness,labels=rownames(sweet_and_saltiness))
 81 | 
 82 | 
 83 | ## ------------------------------------------------------------------------
 84 | 
 85 | tastes = model[[c("sweet","salty","savory","bitter","sour"),average=F]]
 86 | 
 87 | # model[1:3000,] here restricts to the 3000 most common words in the set.
 88 | common_similarities_tastes = model[1:3000,] %>% cosineSimilarity(tastes)
 89 | 
 90 | common_similarities_tastes[20:30,]
 91 | 
 92 | ## ------------------------------------------------------------------------
 93 | high_similarities_to_tastes = common_similarities_tastes[rank(-apply(common_similarities_tastes,1,max)) < 75,]
 94 | 
 95 | high_similarities_to_tastes %>% 
 96 |   prcomp %>% 
 97 |   biplot(main="Fifty words in a\nprojection of flavor space")
 98 | 
 99 | ## ------------------------------------------------------------------------
100 | plot(model,perplexity=50)
101 | 
102 | 


--------------------------------------------------------------------------------
/inst/doc/introduction.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Word2Vec introduction"
  3 | author: "Ben Schmidt"
  4 | date: "`r Sys.Date()`"
  5 | output: rmarkdown::html_vignette
  6 | vignette: >
  7 |   %\VignetteIndexEntry{Vignette Title}
  8 |   %\VignetteEngine{knitr::rmarkdown}
  9 |   %\VignetteEncoding{UTF-8}
 10 | ---
 11 | 
 12 | # Intro 
 13 | 
 14 | This vignette walks you through training a word2vec model, and using that model to search for similarities, to build clusters, and to visualize vocabulary relationships of that model in two dimensions. If you are working with pre-trained vectors, you might want to jump straight to the "exploration" vignette; it is a little slower-paced, but doesn't show off quite so many features of the package.
 15 | 
 16 | # Package installation
 17 | 
 18 | If you have not installed this package, paste the below. More detailed installation instructions are at the end of the [package README](https://github.com/bmschmidt/wordVectors).
 19 | 
 20 | ```{r}
 21 | if (!require(wordVectors)) {
 22 |   if (!(require(devtools))) {
 23 |     install.packages("devtools")
 24 |   }
 25 |   devtools::install_github("bmschmidt/wordVectors")
 26 | }
 27 | 
 28 | 
 29 | ```
 30 | 
 31 | # Building test data
 32 | 
 33 | We begin by importing the `wordVectors` package and the `magrittr` package, because its pipe operator makes it easier to work with data.
 34 | 
 35 | ```{r}
 36 | library(wordVectors)
 37 | library(magrittr)
 38 | ```
 39 | 
 40 | First we build up a test file to train on.
 41 | As an example, we'll use a collection of cookbooks from Michigan State University.
 42 | This has to download from the Internet if it doesn't already exist.
 43 | 
 44 | ```{r}
 45 | if (!file.exists("cookbooks.zip")) {
 46 |   download.file("http://archive.lib.msu.edu/dinfo/feedingamerica/cookbook_text.zip","cookbooks.zip")
 47 | }
 48 | unzip("cookbooks.zip",exdir="cookbooks")
 49 | ```
 50 | 
 51 | 
 52 | Then we *prepare* a single file for word2vec to read in. This does a couple things:
 53 | 
 54 | 1. Creates a single text file with the contents of every file in the original document;
 55 | 2. Uses the `tokenizers` package to clean and lowercase the original text, 
 56 | 3. If `bundle_ngrams` is greater than 1, joins together common bigrams into a single word. For example, "olive oil" may be joined together into "olive_oil" wherever it occurs.
 57 | 
 58 | You can also do this in another language: particularly for large files, that will be **much** faster. (For reference: in a console, `perl -ne 's/[^A-Za-z_0-9 \n]/ /g; print lc $_;' cookbooks/*.txt > cookbooks.txt` will do much the same thing on ASCII text in a couple seconds.) If you do this and want to bundle ngrams, you'll then need to call `word2phrase("cookbooks.txt","cookbook_bigrams.txt",...)` to build up the bigrams; call it twice if you want 3-grams, and so forth.
 59 | 
 60 | 
 61 | ```{r}
 62 | if (!file.exists("cookbooks.txt")) prep_word2vec(origin="cookbooks",destination="cookbooks.txt",lowercase=T,bundle_ngrams=2)
 63 | ```
 64 | 
 65 | To train a word2vec model, use the function `train_word2vec`. This actually builds up the model. It uses an on-disk file as an intermediary and then reads that file into memory.
 66 | 
 67 | ```{r}
 68 | if (!file.exists("cookbook_vectors.bin")) {model = train_word2vec("cookbooks.txt","cookbook_vectors.bin",vectors=200,threads=4,window=12,iter=5,negative_samples=0)} else model = read.vectors("cookbook_vectors.bin")
 69 | 
 70 | ```
 71 | 
 72 | A few notes:
 73 | 
 74 | 1. The `vectors` parameter is the dimensionality of the representation. More vectors usually means more precision, but also more random error and slower operations. Likely choices are probably in the range 100-500.
 75 | 2. The `threads` parameter is the number of processors to use on your computer. On a modern laptop, the fastest results will probably be between 2 and 8 threads, depending on the number of cores.
 76 | 3. `iter` is how many times to read through the corpus. With fewer than 100 books, it can greatly help to increase the number of passes; if you're working with billions of words, it probably matters less. One danger of too low a number of iterations is that words that aren't closely related will seem to be closer than they are.
 77 | 4. Training can take a while. On my laptop, it takes a few minutes to train these cookbooks; larger models take proportionally more time. Because of the importance of more iterations to reducing noise, don't be afraid to set things up to require a lot of training time (as much as a day!)
 78 | 5. One of the best things about the word2vec algorithm is that it *does* work on extremely large corpora in linear time.
 79 | 6. In RStudio I've noticed that this sometimes appears to hang after a while; the percentage bar stops updating. If you check system activity it actually is still running, and will complete.
 80 | 7. If at any point you want to *read in* a previously trained model, you can do so by typing `model =  read.vectors("cookbook_vectors.bin")`.
 81 | 
 82 | Now we have a model in memory, trained on about 10 million words from 77 cookbooks. What can it tell us about food?
 83 | 
 84 | ## Similarity searches
 85 | 
 86 | Well, you can run some basic operations to find the nearest elements:
 87 | 
 88 | ```{r}
 89 | model %>% closest_to("fish")
 90 | ```
 91 | 
 92 | With that list, you can expand out further to search for multiple words:
 93 | 
 94 | ```{r}
 95 | model %>% 
 96 |   closest_to(model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],50)
 97 | ```
 98 | 
 99 | Now we have a pretty expansive list of potential fish-related words from old cookbooks. This can be useful for a few different things:
100 | 
101 | 1. As a list of potential query terms for keyword search.
102 | 2. As a batch of words to use as seed to some other text mining operation; for example, you could pull all paragraphs surrounding these to find ways that fish are cooked.
103 | 3. As a source for visualization.
104 | 
105 | Or we can just arrange them somehow. In this case, it doesn't look like much of anything.
106 | 
107 | ```{r}
108 | some_fish = closest_to(model,model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],150)
109 | fishy = model[[some_fish$word,average=F]]
110 | plot(fishy,method="pca")
111 | ```
112 | 
113 | ## Clustering
114 | 
115 | We can use standard clustering algorithms, like kmeans, to find groups of terms that fit together. You can think of this as a sort of topic model, although unlike more sophisticated topic modeling algorithms like Latent Direchlet Allocation, each word must be tied to single particular topic.
116 | 
117 | ```{r}
118 | set.seed(10)
119 | centers = 150
120 | clustering = kmeans(model,centers=centers,iter.max = 40)
121 | ```
122 | 
123 | Here are a ten random "topics" produced through this method. Each of the columns are the ten most frequent words in one random cluster.
124 | 
125 | ```{r}
126 | sapply(sample(1:centers,10),function(n) {
127 |   names(clustering$cluster[clustering$cluster==n][1:10])
128 | })
129 | ```
130 | 
131 | These can be useful for figuring out, at a glance, what some of the overall common clusters in your corpus are.
132 | 
133 | Clusters need not be derived at the level of the full model. We can take, for instance, 
134 | the 20 words closest to each of four different kinds of words.
135 | 
136 | ```{r}
137 | ingredients = c("madeira","beef","saucepan","carrots")
138 | term_set = lapply(ingredients, 
139 |        function(ingredient) {
140 |           nearest_words = model %>% closest_to(model[[ingredient]],20)
141 |           nearest_words$word
142 |         }) %>% unlist
143 | 
144 | subset = model[[term_set,average=F]]
145 | 
146 | subset %>%
147 |   cosineDist(subset) %>% 
148 |   as.dist %>%
149 |   hclust %>%
150 |   plot
151 | 
152 | ```
153 | 
154 | 
155 | # Visualization
156 | 
157 | ## Relationship planes.
158 | 
159 | One of the basic strategies you can take is to try to project the high-dimensional space here into a plane you can look at.
160 | 
161 | For instance, we can take the words "sweet" and "sour," find the twenty words most similar to either of them, and plot those in a sweet-salty plane.
162 | 
163 | ```{r}
164 | tastes = model[[c("sweet","salty"),average=F]]
165 | 
166 | # model[1:3000,] here restricts to the 3000 most common words in the set.
167 | sweet_and_saltiness = model[1:3000,] %>% cosineSimilarity(tastes)
168 | 
169 | # Filter to the top 20 sweet or salty.
170 | sweet_and_saltiness = sweet_and_saltiness[
171 |   rank(-sweet_and_saltiness[,1])<20 |
172 |   rank(-sweet_and_saltiness[,2])<20,
173 |   ]
174 | 
175 | plot(sweet_and_saltiness,type='n')
176 | text(sweet_and_saltiness,labels=rownames(sweet_and_saltiness))
177 | 
178 | ```
179 | 
180 | 
181 | There's no limit to how complicated this can get. For instance, there are really *five* tastes: sweet, salty, bitter, sour, and savory. (Savory is usually called 'umami' nowadays, but that word will not appear in historic cookbooks.)
182 | 
183 | Rather than use a base matrix of the whole set, we can shrink down to just five dimensions: how similar every word in our set is to each of these five. (I'm using cosine similarity here, so the closer a number is to one, the more similar it is.)
184 | 
185 | ```{r}
186 | 
187 | tastes = model[[c("sweet","salty","savory","bitter","sour"),average=F]]
188 | 
189 | # model[1:3000,] here restricts to the 3000 most common words in the set.
190 | common_similarities_tastes = model[1:3000,] %>% cosineSimilarity(tastes)
191 | 
192 | common_similarities_tastes[20:30,]
193 | ```
194 | 
195 | Now we can filter down to the 50 words that are closest to *any* of these (that's what the apply-max function below does), and
196 | use a PCA biplot to look at just 50 words in a flavor plane.
197 | 
198 | ```{r}
199 | high_similarities_to_tastes = common_similarities_tastes[rank(-apply(common_similarities_tastes,1,max)) < 75,]
200 | 
201 | high_similarities_to_tastes %>% 
202 |   prcomp %>% 
203 |   biplot(main="Fifty words in a\nprojection of flavor space")
204 | ```
205 | 
206 | This tells us a few things. One is that (in some runnings of the model, at least--there is some random chance built in here.) "sweet" and "sour" are closely aligned. Is this a unique feature of American cooking? A relationship that changes over time? These would require more investigation.
207 | 
208 | Second is that "savory" really is an acting category in these cookbooks, even without the precision of 'umami' as a word to express it. Anchovy, the flavor most closely associated with savoriness, shows up as fairly characteristic of the flavor, along with a variety of herbs.
209 | 
210 | Finally, words characteristic of meals seem to show up in the upper realms of the file.
211 | 
212 | # Catchall reduction: TSNE
213 | 
214 | Last but not least, there is a catchall method built into the library 
215 | to visualize a single overall decent plane for viewing the library; TSNE dimensionality reduction.
216 | 
217 | Just calling "plot" will display the equivalent of a word cloud with individual tokens grouped relatively close to each other based on their proximity in the higher dimensional space.
218 | 
219 | "Perplexity" is the optimal number of neighbors for each word. By default it's 50; smaller numbers may cause clusters to appear more dramatically at the cost of overall coherence.
220 | 
221 | ```{r}
222 | plot(model,perplexity=50)
223 | ```
224 | 
225 | A few notes on this method:
226 | 
227 | 1. If you don't get local clusters, it is not working. You might need to reduce the perplexity so that clusters are smaller; or you might not have good local similarities.
228 | 2. If you're plotting only a small set of words, you're better off trying to plot a `VectorSpaceModel` with `method="pca"`, which locates the points using principal components analysis.
229 | 


--------------------------------------------------------------------------------
/inst/paper.md:
--------------------------------------------------------------------------------
 1 | ---
 2 |   title: 'WordVectors: an R environment for training and exploring word2vec modes'
 3 |   tags:
 4 |     - Natural Language Processing
 5 |     - Vector Space Models
 6 |     - word2vec
 7 |   authors:
 8 |    - name: Benjamin M Schmidt
 9 |      orcid: 0000-0002-1142-5720
10 |      affiliation: 1
11 |   affiliations:
12 |    - name: Northeastern University
13 |      index: 1
14 |   date: 24 January 2017
15 |   bibliography: paper.bib
16 |   ---
17 | 
18 |   # Summary
19 | 
20 |   This is an R package for training and exploring word2vec models. It provides wrappers for the reference word2vec implementation released by Google to enable training of vectors from R.[@mikolov_efficient_2013] It also provides a variety of functions enabling exploratory data analysis of word2vec models in an R environment, including 1) functions for reading and writing word2vec's binary form, 2) standard linear algebra functions not bundled in base R (such as cosine similarity) with speed optimizations, and 3) a streamlined syntax for performing vector arithmetic in a vocabulary space.
21 |   
22 |   # References
23 | 
24 | 


--------------------------------------------------------------------------------
/man/VectorSpaceModel-VectorSpaceModel-method.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \docType{methods}
 4 | \name{-,VectorSpaceModel,VectorSpaceModel-method}
 5 | \alias{-,VectorSpaceModel,VectorSpaceModel-method}
 6 | \title{VectorSpaceModel subtraction}
 7 | \usage{
 8 | \S4method{-}{VectorSpaceModel,VectorSpaceModel}(e1, e2)
 9 | }
10 | \arguments{
11 | \item{e1}{A vector space model}
12 | 
13 | \item{e2}{A vector space model of equal size OR a vector
14 | space model of a single row. If the latter (which is more likely)
15 | the specified row will be subtracted from each row.}
16 | }
17 | \value{
18 | A VectorSpaceModel of the same dimensions and rownames
19 | as e1
20 | 
21 | I believe this is necessary, but honestly am not sure.
22 | }
23 | \description{
24 | Keep the VSM class when doing subtraction operations;
25 | make it possible to subtract a single row from an entire model.
26 | }
27 | 


--------------------------------------------------------------------------------
/man/VectorSpaceModel-class.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \docType{class}
 4 | \name{VectorSpaceModel-class}
 5 | \alias{VectorSpaceModel-class}
 6 | \title{Vector Space Model class}
 7 | \value{
 8 | An object of class "VectorSpaceModel"
 9 | }
10 | \description{
11 | A class for describing and accessing Vector Space Models like Word2Vec.
12 | The base object is simply a matrix with columns describing dimensions and unique rownames
13 | as the names of vectors. This package gives a number of convenience functions for printing
14 | and, most importantly, accessing these objects.
15 | }
16 | \section{Slots}{
17 | 
18 | \describe{
19 | \item{\code{magnitudes}}{The cached sum-of-squares for each row in the matrix. Can be cached to
20 | speed up similarity calculations}
21 | }}
22 | 
23 | 


--------------------------------------------------------------------------------
/man/as.VectorSpaceModel.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{as.VectorSpaceModel}
 4 | \alias{as.VectorSpaceModel}
 5 | \title{Convert to a Vector Space Model}
 6 | \usage{
 7 | as.VectorSpaceModel(matrix)
 8 | }
 9 | \arguments{
10 | \item{matrix}{A matrix to coerce.}
11 | }
12 | \value{
13 | An object of class "VectorSpaceModel"
14 | }
15 | \description{
16 | Convert to a Vector Space Model
17 | }
18 | 


--------------------------------------------------------------------------------
/man/closest_to.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{closest_to}
 4 | \alias{closest_to}
 5 | \title{Return the n closest words in a VectorSpaceModel to a given vector.}
 6 | \usage{
 7 | closest_to(matrix, vector, n = 10, fancy_names = TRUE)
 8 | }
 9 | \arguments{
10 | \item{matrix}{A matrix or VectorSpaceModel}
11 | 
12 | \item{vector}{A vector (or a string or a formula coercable to a vector)
13 | of the same length as the VectorSpaceModel. See below.}
14 | 
15 | \item{n}{The number of closest words to include.}
16 | 
17 | \item{fancy_names}{If true (the default) the data frame will have descriptive names like
18 | 'similarity to "king+queen-man"'; otherwise, just 'similarity.' The default can speed up
19 |  interactive exploration.}
20 | }
21 | \value{
22 | A sorted data.frame with columns for the words and their similarity
23 | to the target vector. (Or, if as_df==FALSE, a named vector of similarities.)
24 | }
25 | \description{
26 | This is a convenience wrapper around the most common use of
27 | 'cosineSimilarity'; the listing of several words similar to a given vector.
28 | Unlike cosineSimilarity, it returns a data.frame object instead of a matrix.
29 | cosineSimilarity is more powerful, because it can compare two matrices to
30 | each other; closest_to can only take a vector or vectorlike object as its second argument.
31 | But with (or without) the argument n=Inf, closest_to is often better for
32 | plugging directly into a plot.
33 | 
34 | As with cosineSimilarity, the second argument can take several forms. If it's a vector or
35 | matrix slice, it will be taken literally. If it's a character string, it will
36 | be interpreted as a word and the associated vector from `matrix` will be used. If
37 | a formula, any strings in the formula will be converted to rows in the associated `matrix`
38 | before any math happens.
39 | }
40 | \examples{
41 | 
42 | # Synonyms and similar words
43 | closest_to(demo_vectors,demo_vectors[["good"]])
44 | 
45 | # If 'matrix' is a VectorSpaceModel object,
46 | # you can also just enter a string directly, and
47 | # it will be evaluated in the context of the passed matrix.
48 | 
49 | closest_to(demo_vectors,"good")
50 | 
51 | # You can also express more complicated formulas.
52 | 
53 | closest_to(demo_vectors,"good")
54 | 
55 | # Something close to the classic king:man::queen:woman;
56 | # What's the equivalent word for a female teacher that "guy" is for
57 | # a male one?
58 | 
59 | closest_to(demo_vectors,~ "guy" - "man" + "woman")
60 | 
61 | }
62 | 


--------------------------------------------------------------------------------
/man/cosineDist.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{cosineDist}
 4 | \alias{cosineDist}
 5 | \title{Cosine Distance}
 6 | \usage{
 7 | cosineDist(x, y)
 8 | }
 9 | \arguments{
10 | \item{x}{A matrix, VectorSpaceModel, or vector.}
11 | 
12 | \item{y}{A matrix, VectorSpaceModel, or vector.}
13 | }
14 | \value{
15 | A matrix whose dimnames are rownames(x), rownames(y) and whose entires are
16 | the associated distance.
17 | }
18 | \description{
19 | Calculate the cosine distance between two vectors.
20 | 
21 | Not an actual distance metric, but can be used in similar contexts.
22 | It is calculated as simply the inverse of cosine similarity,
23 | and falls in a fixed range of 0 (identical) to 2 (completely opposite in direction.)
24 | }
25 | 


--------------------------------------------------------------------------------
/man/cosineSimilarity.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{cosineSimilarity}
 4 | \alias{cosineSimilarity}
 5 | \title{Cosine Similarity}
 6 | \usage{
 7 | cosineSimilarity(x, y)
 8 | }
 9 | \arguments{
10 | \item{x}{A matrix or VectorSpaceModel object}
11 | 
12 | \item{y}{A vector, matrix or VectorSpaceModel object.
13 | 
14 | Vector inputs are coerced to single-row matrices; y must have the
15 | same number of dimensions as x.}
16 | }
17 | \value{
18 | A matrix. Rows correspond to entries in x; columns to entries in y.
19 | }
20 | \description{
21 | Calculate the cosine similarity of two matrices or a matrix and a vector.
22 | }
23 | \examples{
24 | 
25 | # Inspect the similarity of several academic disciplines by hand.
26 | subjects = demo_vectors[[c("history","literature","biology","math","stats"),average=FALSE]]
27 | similarities = cosineSimilarity(subjects,subjects)
28 | 
29 | # Use 'closest_to' to build up a large list of similar words to a seed set.
30 | subjects = demo_vectors[[c("history","literature","biology","math","stats"),average=TRUE]]
31 | new_subject_list = closest_to(demo_vectors,subjects,20)
32 | new_subjects = demo_vectors[[new_subject_list$word,average=FALSE]]
33 | 
34 | # Plot the cosineDistance of these as a dendrogram.
35 | plot(hclust(as.dist(cosineDist(new_subjects,new_subjects))))
36 | 
37 | }
38 | 


--------------------------------------------------------------------------------
/man/demo_vectors.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/data.R
 3 | \docType{data}
 4 | \name{demo_vectors}
 5 | \alias{demo_vectors}
 6 | \title{999 vectors trained on teaching evaluations}
 7 | \format{A VectorSpaceModel object of 999 words and 500 vectors}
 8 | \source{
 9 | Trained by package author.
10 | }
11 | \usage{
12 | demo_vectors
13 | }
14 | \description{
15 | A sample VectorSpaceModel object trained on about 15 million
16 | teaching evaluations, limited to the 999 most common words.
17 | Included for demonstration purposes only: there's only so much you can
18 | do with a 999 length vocabulary.
19 | }
20 | \details{
21 | You're best off downloading a real model to work with,
22 | such as the precompiled vectors distributed by Google
23 | at https://code.google.com/archive/p/word2vec/
24 | }
25 | \keyword{datasets}
26 | 


--------------------------------------------------------------------------------
/man/distend.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{distend}
 4 | \alias{distend}
 5 | \title{Compress or expand a vector space model along a vector.}
 6 | \usage{
 7 | distend(matrix, vector, multiplier)
 8 | }
 9 | \arguments{
10 | \item{matrix}{A matrix or VectorSpaceModel}
11 | 
12 | \item{vector}{A vector (or an object coercable to a vector, see project)
13 | of the same length as the VectorSpaceModel.}
14 | 
15 | \item{multiplier}{A scaling factor. See below.}
16 | }
17 | \value{
18 | A new matrix or VectorSpaceModel of the same dimensions as `matrix`,
19 | distended along the vector 'vector' by factor 'multiplier'.
20 | 
21 | See `project` for more details and usage.
22 | }
23 | \description{
24 | This is an experimental function that might be useful sometimes.
25 | 'Reject' flatly eliminates a particular dimension from a vectorspace, essentially
26 | squashing out a single dimension; 'distend' gives finer grained control, making it
27 | possible to stretch out or compress in the same space. High values of 'multiplier'
28 | make a given vector more prominent; 1 keeps the original matrix untransformed; values
29 | less than one compress distances along the vector; and 0 is the same as "reject,"
30 | eliminating a vector entirely. Values less than zero will do some type of mirror-image
31 | universe thing, but probably aren't useful?
32 | }
33 | \examples{
34 | closest_to(demo_vectors,"sweet")
35 | 
36 | # Stretch out the vectorspace 4x longer along the gender direction.
37 | more_sexist = distend(demo_vectors, ~ "man" + "he" - "she" -"woman", 4)
38 | 
39 | closest_to(more_sexist,"sweet")
40 | 
41 | }
42 | 


--------------------------------------------------------------------------------
/man/filter_to_rownames.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{filter_to_rownames}
 4 | \alias{filter_to_rownames}
 5 | \title{Reduce by rownames}
 6 | \usage{
 7 | filter_to_rownames(matrix, words)
 8 | }
 9 | \arguments{
10 | \item{matrix}{A matrix or VectorSpaceModel object}
11 | 
12 | \item{words}{A list of rownames or VectorSpaceModel names}
13 | }
14 | \value{
15 | An object of the same class as matrix, consisting
16 | of the rows that match its rownames.
17 | 
18 | Deprecated: use instead VSM[[c("word1","word2",...),average=FALSE]]
19 | }
20 | \description{
21 | Reduce by rownames
22 | }
23 | 


--------------------------------------------------------------------------------
/man/improve_vectorspace.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{improve_vectorspace}
 4 | \alias{improve_vectorspace}
 5 | \title{Improve a vectorspace by removing common elements.}
 6 | \usage{
 7 | improve_vectorspace(vectorspace, D = round(ncol(vectorspace)/100))
 8 | }
 9 | \arguments{
10 | \item{vectorspace}{A VectorSpacemodel to be improved.}
11 | 
12 | \item{D}{The number of principal components to eliminate.}
13 | }
14 | \value{
15 | A VectorSpaceModel object, transformed from the original.
16 | }
17 | \description{
18 | See reference for a full description. Supposedly, these operations will improve performance on analogy tasks.
19 | }
20 | \examples{
21 | 
22 | closest_to(demo_vectors,"great")
23 | # stopwords like "and" and "very" are no longer top ten.
24 | # I don't know if this is really better, though.
25 | 
26 | closest_to(improve_vectorspace(demo_vectors),"great")
27 | 
28 | }
29 | \references{
30 | Jiaqi Mu, Suma Bhat, Pramod Viswanath. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. https://arxiv.org/abs/1702.01417.
31 | }
32 | 


--------------------------------------------------------------------------------
/man/magnitudes.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{magnitudes}
 4 | \alias{magnitudes}
 5 | \title{Vector Magnitudes}
 6 | \usage{
 7 | magnitudes(matrix)
 8 | }
 9 | \arguments{
10 | \item{matrix}{A matrix or VectorSpaceModel object.}
11 | }
12 | \value{
13 | A vector consisting of the magnitudes of each row.
14 | 
15 | This is an extraordinarily simple function.
16 | }
17 | \description{
18 | Vector Magnitudes
19 | }
20 | 


--------------------------------------------------------------------------------
/man/nearest_to.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{nearest_to}
 4 | \alias{nearest_to}
 5 | \title{Nearest vectors to a word}
 6 | \usage{
 7 | nearest_to(...)
 8 | }
 9 | \arguments{
10 | \item{...}{See `closest_to`}
11 | }
12 | \value{
13 | a names vector of cosine similarities. See 'nearest_to' for more details.
14 | }
15 | \description{
16 | This a wrapper around closest_to, included for back-compatibility. Use
17 | closest_to for new applications.
18 | }
19 | \examples{
20 | 
21 | # Recommended usage in 1.0:
22 | nearest_to(demo_vectors, demo_vectors[["good"]])
23 | 
24 | # Recommended usage in 2.0:
25 | demo_vectors \%>\% closest_to("good")
26 | 
27 | }
28 | 


--------------------------------------------------------------------------------
/man/normalize_lengths.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{normalize_lengths}
 4 | \alias{normalize_lengths}
 5 | \title{Matrix normalization.}
 6 | \usage{
 7 | normalize_lengths(matrix)
 8 | }
 9 | \arguments{
10 | \item{matrix}{A matrix or VectorSpaceModel object}
11 | }
12 | \value{
13 | An object of the same class as matrix
14 | }
15 | \description{
16 | Normalize a matrix so that all rows are of unit length.
17 | }
18 | 


--------------------------------------------------------------------------------
/man/plot-VectorSpaceModel-method.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \docType{methods}
 4 | \name{plot,VectorSpaceModel-method}
 5 | \alias{plot,VectorSpaceModel-method}
 6 | \title{Plot a Vector Space Model.}
 7 | \usage{
 8 | \S4method{plot}{VectorSpaceModel}(x, method = "tsne", ...)
 9 | }
10 | \arguments{
11 | \item{x}{The model to plot}
12 | 
13 | \item{method}{The method to use for plotting. "pca" is principal components, "tsne" is t-sne}
14 | 
15 | \item{...}{Further arguments passed to the plotting method.}
16 | }
17 | \value{
18 | The TSNE model (silently.)
19 | }
20 | \description{
21 | Visualizing a model as a whole is sort of undefined. I think the
22 | sanest thing to do is reduce the full model down to two dimensions
23 | using T-SNE, which preserves some of the local clusters.
24 | }
25 | \details{
26 | For individual subsections, it can make sense to do a principal components
27 | plot of the space of just those letters. This is what happens if method
28 | is pca. On the full vocab, it's kind of a mess.
29 | 
30 | This plots only the first 300 words in the model.
31 | }
32 | 


--------------------------------------------------------------------------------
/man/prep_word2vec.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/word2vec.R
 3 | \name{prep_word2vec}
 4 | \alias{prep_word2vec}
 5 | \title{Prepare documents for word2Vec}
 6 | \usage{
 7 | prep_word2vec(origin, destination, lowercase = F, bundle_ngrams = 1, ...)
 8 | }
 9 | \arguments{
10 | \item{origin}{A text file or a directory of text files
11 | to be used in training the model}
12 | 
13 | \item{destination}{The location for output text.}
14 | 
15 | \item{lowercase}{Logical. Should uppercase characters be converted to lower?}
16 | 
17 | \item{bundle_ngrams}{Integer. Statistically significant phrases of up to this many words
18 | will be joined with underscores: e.g., "United States" will usually be changed to "United_States"
19 | if it appears frequently in the corpus. This calls word2phrase once if bundle_ngrams is 2,
20 | twice if bundle_ngrams is 3, and so forth; see that function for more details.}
21 | 
22 | \item{...}{Further arguments passed to word2phrase when bundle_ngrams is
23 | greater than 1.}
24 | }
25 | \value{
26 | The file name (silently).
27 | }
28 | \description{
29 | This function exports a directory or document to a single file
30 | suitable to Word2Vec run on. That means a single, seekable txt file
31 | with tokens separated by spaces. (For example, punctuation is removed
32 | rather than attached to the end of words.)
33 | This function is extraordinarily inefficient: in most real-world cases, you'll be
34 | much better off preparing the documents using python, perl, awk, or any other
35 | scripting language that can reasonable read things in line-by-line.
36 | }
37 | 


--------------------------------------------------------------------------------
/man/project.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{project}
 4 | \alias{project}
 5 | \title{Project each row of an input matrix along a vector.}
 6 | \usage{
 7 | project(matrix, vector)
 8 | }
 9 | \arguments{
10 | \item{matrix}{A matrix or VectorSpaceModel}
11 | 
12 | \item{vector}{A vector (or object coercable to a vector)
13 | of the same length as the VectorSpaceModel.}
14 | }
15 | \value{
16 | A new matrix or VectorSpaceModel of the same dimensions as `matrix`,
17 | each row of which is parallel to vector.
18 | 
19 | If the input is a matrix, the output will be a matrix: if a VectorSpaceModel,
20 | it will be a VectorSpaceModel.
21 | }
22 | \description{
23 | As with 'cosineSimilarity
24 | }
25 | 


--------------------------------------------------------------------------------
/man/read.binary.vectors.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{read.binary.vectors}
 4 | \alias{read.binary.vectors}
 5 | \title{Read binary word2vec format files}
 6 | \usage{
 7 | read.binary.vectors(filename, nrows = Inf, cols = "All",
 8 |   rowname_list = NULL, rowname_regexp = NULL)
 9 | }
10 | \arguments{
11 | \item{filename}{A file in the binary word2vec format to import.}
12 | 
13 | \item{nrows}{Optionally, a number of rows to stop reading after.
14 | Word2vec sorts by frequency, so limiting to the first 1000 rows will
15 | give the thousand most-common words; it can be useful not to load
16 | the whole matrix into memory. This limit is applied BEFORE `name_list` and
17 | `name_regexp`.}
18 | 
19 | \item{cols}{The column numbers to read. Default is "All";
20 | if you are in a memory-limited environment,
21 | you can limit the number of columns you read in by giving a vector of column integers}
22 | 
23 | \item{rowname_list}{A whitelist of words. If you wish to read in only a few dozen words,
24 | all other rows will be skipped and only these read in.}
25 | 
26 | \item{rowname_regexp}{A regular expression specifying a pattern for rows to read in. Row
27 | names matching that pattern will be included in the read; all others will be skipped.}
28 | }
29 | \value{
30 | A VectorSpaceModel object
31 | }
32 | \description{
33 | Read binary word2vec format files
34 | }
35 | 


--------------------------------------------------------------------------------
/man/read.vectors.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{read.vectors}
 4 | \alias{read.vectors}
 5 | \title{Read VectorSpaceModel}
 6 | \usage{
 7 | read.vectors(filename, vectors = guess_n_cols(), binary = NULL, ...)
 8 | }
 9 | \arguments{
10 | \item{filename}{The file to read in.}
11 | 
12 | \item{vectors}{The number of dimensions word2vec calculated. Imputed automatically if not specified.}
13 | 
14 | \item{binary}{Read in the binary word2vec form. (Wraps `read.binary.vectors`) By default, function
15 | guesses based on file suffix.}
16 | 
17 | \item{...}{Further arguments passed to read.table or read.binary.vectors.
18 | Note that both accept 'nrows' as an argument. Word2vec produces
19 | by default frequency sorted output. Therefore 'read.vectors("file.bin", nrows=500)', for example,
20 | will return the vectors for the top 500 words. This can be useful on machines with limited
21 | memory.}
22 | }
23 | \value{
24 | An matrixlike object of class `VectorSpaceModel`
25 | }
26 | \description{
27 | Read a VectorSpaceModel from a file exported from word2vec or a similar output format.
28 | }
29 | 


--------------------------------------------------------------------------------
/man/reexports.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/utils.R
 3 | \docType{import}
 4 | \name{reexports}
 5 | \alias{reexports}
 6 | \alias{\%>\%}
 7 | \title{Objects exported from other packages}
 8 | \keyword{internal}
 9 | \description{
10 | These objects are imported from other packages. Follow the links
11 | below to see their documentation.
12 | 
13 | \describe{
14 |   \item{magrittr}{\code{\link[magrittr]{\%>\%}}}
15 | }}
16 | 
17 | 


--------------------------------------------------------------------------------
/man/reject.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{reject}
 4 | \alias{reject}
 5 | \title{Return a vector rejection for each element in a VectorSpaceModel}
 6 | \usage{
 7 | reject(matrix, vector)
 8 | }
 9 | \arguments{
10 | \item{matrix}{A matrix or VectorSpaceModel}
11 | 
12 | \item{vector}{A vector (or an object coercable to a vector, see project)
13 | of the same length as the VectorSpaceModel.}
14 | }
15 | \value{
16 | A new matrix or VectorSpaceModel of the same dimensions as `matrix`,
17 | each row of which is orthogonal to the `vector` object.
18 | 
19 | This is defined simply as `matrix-project(matrix,vector)`, but having a separate
20 | name may make for cleaner code.
21 | 
22 | See `project` for more details.
23 | }
24 | \description{
25 | Return a vector rejection for each element in a VectorSpaceModel
26 | }
27 | \examples{
28 | closest_to(demo_vectors,demo_vectors[["man"]])
29 | 
30 | genderless = reject(demo_vectors,demo_vectors[["he"]] - demo_vectors[["she"]])
31 | closest_to(genderless,genderless[["man"]])
32 | 
33 | }
34 | 


--------------------------------------------------------------------------------
/man/square_magnitudes.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{square_magnitudes}
 4 | \alias{square_magnitudes}
 5 | \title{Square Magnitudes with caching}
 6 | \usage{
 7 | square_magnitudes(object)
 8 | }
 9 | \arguments{
10 | \item{VectorSpaceModel}{A matrix or VectorSpaceModel object}
11 | }
12 | \value{
13 | A vector of the square magnitudes for each row
14 | }
15 | \description{
16 | square_magnitudes Returns the square magnitudes and
17 | caches them if necessary
18 | }
19 | \keyword{internal}
20 | 


--------------------------------------------------------------------------------
/man/sub-VectorSpaceModel-method.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \docType{methods}
 4 | \name{[,VectorSpaceModel-method}
 5 | \alias{[,VectorSpaceModel-method}
 6 | \title{VectorSpaceModel indexing}
 7 | \usage{
 8 | \S4method{[}{VectorSpaceModel}(x, i, j, ..., drop = TRUE)
 9 | }
10 | \arguments{
11 | \item{x}{The vectorspace model to subset}
12 | 
13 | \item{i}{The row numbers to extract}
14 | 
15 | \item{j}{The column numbers to extract}
16 | 
17 | \item{...}{Other arguments passed to extract (unlikely to be useful).}
18 | 
19 | \item{drop}{Whether to drop columns. This parameter is ignored.}
20 | }
21 | \value{
22 | A VectorSpaceModel
23 | }
24 | \description{
25 | Reduce a VectorSpaceModel to a smaller one
26 | }
27 | 


--------------------------------------------------------------------------------
/man/sub-sub-VectorSpaceModel-method.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \docType{methods}
 4 | \name{[[,VectorSpaceModel-method}
 5 | \alias{[[,VectorSpaceModel-method}
 6 | \title{VectorSpaceModel subsetting}
 7 | \usage{
 8 | \S4method{[[}{VectorSpaceModel}(x, i, average = TRUE)
 9 | }
10 | \arguments{
11 | \item{x}{The object being subsetted.}
12 | 
13 | \item{i}{A character vector: the words to use as rownames.}
14 | 
15 | \item{average}{Whether to collapse down to a single vector,
16 | or to return a subset of one row for each asked for.}
17 | }
18 | \value{
19 | A VectorSpaceModel of a single row.
20 | }
21 | \description{
22 | VectorSpaceModel subsetting
23 | }
24 | 


--------------------------------------------------------------------------------
/man/train_word2vec.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/word2vec.R
 3 | \name{train_word2vec}
 4 | \alias{train_word2vec}
 5 | \title{Train a model by word2vec.}
 6 | \usage{
 7 | train_word2vec(train_file, output_file = "vectors.bin", vectors = 100,
 8 |   threads = 1, window = 12, classes = 0, cbow = 0, min_count = 5,
 9 |   iter = 5, force = F, negative_samples = 5)
10 | }
11 | \arguments{
12 | \item{train_file}{Path of a single .txt file for training. Tokens are split on spaces.}
13 | 
14 | \item{output_file}{Path of the output file.}
15 | 
16 | \item{vectors}{The number of vectors to output. Defaults to 100.
17 | More vectors usually means more precision, but also more random error, higher memory usage, and slower operations.
18 | Sensible choices are probably in the range 100-500.}
19 | 
20 | \item{threads}{Number of threads to run training process on.
21 | Defaults to 1; up to the number of (virtual) cores on your machine may speed things up.}
22 | 
23 | \item{window}{The size of the window (in words) to use in training.}
24 | 
25 | \item{classes}{Number of classes for k-means clustering. Not documented/tested.}
26 | 
27 | \item{cbow}{If 1, use a continuous-bag-of-words model instead of skip-grams.
28 | Defaults to false (recommended for newcomers).}
29 | 
30 | \item{min_count}{Minimum times a word must appear to be included in the samples.
31 | High values help reduce model size.}
32 | 
33 | \item{iter}{Number of passes to make over the corpus in training.}
34 | 
35 | \item{force}{Whether to overwrite existing model files.}
36 | 
37 | \item{negative_samples}{Number of negative samples to take in skip-gram training. 0 means full sampling, while lower numbers
38 | give faster training. For large corpora 2-5 may work; for smaller corpora, 5-15 is reasonable.}
39 | }
40 | \value{
41 | A VectorSpaceModel object.
42 | }
43 | \description{
44 | Train a model by word2vec.
45 | }
46 | \details{
47 | The word2vec tool takes a text corpus as input and produces the
48 | word vectors as output. It first constructs a vocabulary from the
49 | training text data and then learns vector representation of words.
50 | The resulting word vector file can be used as features in many
51 | natural language processing and machine learning applications.
52 | }
53 | \examples{
54 | \dontrun{
55 | model = train_word2vec(system.file("examples", "rfaq.txt", package = "wordVectors"))
56 | }
57 | }
58 | \references{
59 | \url{https://code.google.com/p/word2vec/}
60 | }
61 | \author{
62 | Jian Li <\email{rweibo@sina.com}>, Ben Schmidt <\email{bmchmidt@gmail.com}>
63 | }
64 | 


--------------------------------------------------------------------------------
/man/word2phrase.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/word2vec.R
 3 | \name{word2phrase}
 4 | \alias{word2phrase}
 5 | \title{Convert words to phrases}
 6 | \usage{
 7 | word2phrase(train_file, output_file, debug_mode = 0, min_count = 5,
 8 |   threshold = 100, force = FALSE)
 9 | }
10 | \arguments{
11 | \item{train_file}{Path of a single .txt file for training.
12 | Tokens are split on spaces.}
13 | 
14 | \item{output_file}{Path of output file}
15 | 
16 | \item{debug_mode}{debug mode. Must be 0, 1 or 2. 0 is silent; 1 print summary statistics;
17 | prints progress regularly.}
18 | 
19 | \item{min_count}{Minimum times a word must appear to be included in the samples.
20 | High values help reduce model size.}
21 | 
22 | \item{threshold}{Threshold value for determining if pairs of words are phrases.}
23 | 
24 | \item{force}{Whether to overwrite existing files at the output location. Default FALSE}
25 | }
26 | \value{
27 | The name of output_file, the trained file where common phrases are now joined.
28 | }
29 | \description{
30 | Convert words to phrases in a text file.
31 | }
32 | \details{
33 | This function attempts to learn phrases given a text document.
34 | It does so by progressively joining adjacent pairs of words with an '_' character.
35 | You can then run the code multiple times to create multiword phrases.
36 | Wrapper around code from the Mikolov's original word2vec release.
37 | }
38 | \examples{
39 | \dontrun{
40 | model=word2phrase("text8","vec.txt")
41 | }
42 | }
43 | \author{
44 | Tomas Mikolov
45 | }
46 | 


--------------------------------------------------------------------------------
/man/write.binary.word2vec.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/matrixFunctions.R
 3 | \name{write.binary.word2vec}
 4 | \alias{write.binary.word2vec}
 5 | \title{Write in word2vec binary format}
 6 | \usage{
 7 | write.binary.word2vec(model, filename)
 8 | }
 9 | \arguments{
10 | \item{model}{The wordVectors model you wish to save. (This can actually be any matrix with rownames,
11 | if you want a smaller binary serialization in single-precision floats.)}
12 | 
13 | \item{filename}{The file to save the vectors to. I recommend ".bin" as a suffix.}
14 | }
15 | \value{
16 | Nothing
17 | }
18 | \description{
19 | Write in word2vec binary format
20 | }
21 | 


--------------------------------------------------------------------------------
/src/Makevars.win:
--------------------------------------------------------------------------------
1 | 
2 | PKG_CFLAGS = -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -w
3 | PKG_LIBS = -pthread
4 | 
5 | 


--------------------------------------------------------------------------------
/src/tmcn_word2vec.c:
--------------------------------------------------------------------------------
 1 | #include "R.h"
 2 | #include "Rmath.h"
 3 | #include "word2vec.h"
 4 | 
 5 | void tmcn_word2vec(char *train_file0, char *output_file0,
 6 |                    char *binary0, char *dims0, char *threads,
 7 |                    char *window0, char *classes0, char *cbow0,
 8 |                    char *min_count0, char *iter0, char *neg_samples0)
 9 | {
10 | 	int i;
11 |   layer1_size = atoll(dims0);
12 |   num_threads = atoi(threads);
13 |   window=atoi(window0);
14 | 	binary = atoi(binary0);
15 | 	classes = atoi(classes0);
16 | 	cbow = atoi(cbow0);
17 | 	min_count = atoi(min_count0);
18 | 	iter = atoll(iter0);
19 | 	negative = atoi(neg_samples0);
20 | 	strcpy(train_file, train_file0);
21 | 	strcpy(output_file, output_file0);
22 | 
23 | 
24 | 	alpha = 0.025;
25 | 	starting_alpha = alpha;
26 | 	word_count_actual = 0;
27 | 
28 | 	vocab = (struct vocab_word *)calloc(vocab_max_size, sizeof(struct vocab_word));
29 | 	vocab_hash = (int *)calloc(vocab_hash_size, sizeof(int));
30 | 	expTable = (real *)malloc((EXP_TABLE_SIZE + 1) * sizeof(real));
31 | 	for (i = 0; i < EXP_TABLE_SIZE; i++) {
32 |     	expTable[i] = exp((i / (real)EXP_TABLE_SIZE * 2 - 1) * MAX_EXP); // Precompute the exp() table
33 | 		expTable[i] = expTable[i] / (expTable[i] + 1);                   // Precompute f(x) = x / (x + 1)
34 | 	}
35 | 	TrainModel();
36 | }
37 | 
38 | 
39 | void CWrapper_word2vec(char **train_file, char **output_file,
40 |                        char **binary, char **dims, char **threads,
41 |                        char **window, char **classes, char **cbow, char **min_count, char **iter, char **neg_samples)
42 | {
43 |     tmcn_word2vec(*train_file, *output_file, *binary, *dims, *threads,*window,*classes,*cbow,*min_count,*iter, *neg_samples);
44 | }
45 | 
46 | 


--------------------------------------------------------------------------------
/src/word2phrase.c:
--------------------------------------------------------------------------------
  1 | //  Copyright 2013 Google Inc. All Rights Reserved.
  2 | //
  3 | //  Licensed under the Apache License, Version 2.0 (the "License");
  4 | //  you may not use this file except in compliance with the License.
  5 | //  You may obtain a copy of the License at
  6 | //
  7 | //      http://www.apache.org/licenses/LICENSE-2.0
  8 | //
  9 | //  Unless required by applicable law or agreed to in writing, software
 10 | //  distributed under the License is distributed on an "AS IS" BASIS,
 11 | //  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | //  See the License for the specific language governing permissions and
 13 | //  limitations under the License.
 14 | #include "R.h"
 15 | #include "Rmath.h"
 16 | #include <stdio.h>
 17 | #include <stdlib.h>
 18 | #include <string.h>
 19 | #include <math.h>
 20 | //#include <pthread.h>
 21 | 
 22 | #define MAX_STRING1 60
 23 | 
 24 | const int vocab_hash1_size1 = 500000000; // Maximum 500M entries in the vocabulary
 25 | 
 26 | typedef float real;                    // Precision of float numbers
 27 | 
 28 | struct vocab_word1 {
 29 |   long long cn;
 30 |   char *word;
 31 | };
 32 | 
 33 | char train_file1[MAX_STRING1], output_file1[MAX_STRING1];
 34 | struct vocab_word1 *vocab1;
 35 | int debug_mode1 = 2, min_count1 = 5, *vocab_hash1, min_reduce1 = 1;
 36 | long long vocab_max_size1 = 10000, vocab_size1 = 0;
 37 | long long train_words1 = 0;
 38 | real threshold1 = 100;
 39 | 
 40 | unsigned long long next_random = 1;
 41 | 
 42 | // Reads a single word from a file, assuming space + tab + EOL to be word boundaries
 43 | void ReadWord1(char *word, FILE *fin) {
 44 |   int a = 0, ch;
 45 |   while (!feof(fin)) {
 46 |     ch = fgetc(fin);
 47 |     if (ch == 13) continue;
 48 |     if ((ch == ' ') || (ch == '\t') || (ch == '\n')) {
 49 |       if (a > 0) {
 50 |         if (ch == '\n') ungetc(ch, fin);
 51 |         break;
 52 |       }
 53 |       if (ch == '\n') {
 54 |         strcpy(word, (char *)"</s>");
 55 |         return;
 56 |       } else continue;
 57 |     }
 58 |     word[a] = ch;
 59 |     a++;
 60 |     if (a >= MAX_STRING1 - 1) a--;   // Truncate too long words
 61 |   }
 62 |   word[a] = 0;
 63 | }
 64 | 
 65 | // Returns hash value of a word
 66 | int GetWordHash1(char *word) {
 67 |   unsigned long long a, hash = 1;
 68 |   for (a = 0; a < strlen(word); a++) hash = hash * 257 + word[a];
 69 |   hash = hash % vocab_hash1_size1;
 70 |   return hash;
 71 | }
 72 | 
 73 | // Returns position of a word in the vocabulary; if the word is not found, returns -1
 74 | int SearchVocab1(char *word) {
 75 |   unsigned int hash = GetWordHash1(word);
 76 |   while (1) {
 77 |     if (vocab_hash1[hash] == -1) return -1;
 78 |     if (!strcmp(word, vocab1[vocab_hash1[hash]].word)) return vocab_hash1[hash];
 79 |     hash = (hash + 1) % vocab_hash1_size1;
 80 |   }
 81 |   return -1;
 82 | }
 83 | 
 84 | // Reads a word and returns its index in the vocabulary
 85 | int ReadWord1Index1(FILE *fin) {
 86 |   char word[MAX_STRING1];
 87 |   ReadWord1(word, fin);
 88 |   if (feof(fin)) return -1;
 89 |   return SearchVocab1(word);
 90 | }
 91 | 
 92 | // Adds a word to the vocabulary
 93 | int AddWordToVocab1(char *word) {
 94 |   unsigned int hash, length = strlen(word) + 1;
 95 |   if (length > MAX_STRING1) length = MAX_STRING1;
 96 |   vocab1[vocab_size1].word = (char *)calloc(length, sizeof(char));
 97 |   strcpy(vocab1[vocab_size1].word, word);
 98 |   vocab1[vocab_size1].cn = 0;
 99 |   vocab_size1++;
100 |   // Reallocate memory if needed
101 |   if (vocab_size1 + 2 >= vocab_max_size1) {
102 |     vocab_max_size1 += 10000;
103 |     vocab1=(struct vocab_word1 *)realloc(vocab1, vocab_max_size1 * sizeof(struct vocab_word1));
104 |   }
105 |   hash = GetWordHash1(word);
106 |   while (vocab_hash1[hash] != -1) hash = (hash + 1) % vocab_hash1_size1;
107 |   vocab_hash1[hash]=vocab_size1 - 1;
108 |   return vocab_size1 - 1;
109 | }
110 | 
111 | // Used later for sorting by word counts
112 | int VocabCompare1(const void *a, const void *b) {
113 |     return ((struct vocab_word1 *)b)->cn - ((struct vocab_word1 *)a)->cn;
114 | }
115 | 
116 | // Sorts the vocabulary by frequency using word counts
117 | void SortVocab1() {
118 |   int a;
119 |   unsigned int hash;
120 |   // Sort the vocabulary and keep </s> at the first position
121 |   qsort(&vocab1[1], vocab_size1 - 1, sizeof(struct vocab_word1), VocabCompare1);
122 |   for (a = 0; a < vocab_hash1_size1; a++) vocab_hash1[a] = -1;
123 |   for (a = 0; a < vocab_size1; a++) {
124 |     // Words occuring less than min_count1 times will be discarded from the vocab
125 |     if (vocab1[a].cn < min_count1) {
126 |       vocab_size1--;
127 |       free(vocab1[vocab_size1].word);
128 |     } else {
129 |       // Hash will be re-computed, as after the sorting it is not actual
130 |       hash = GetWordHash1(vocab1[a].word);
131 |       while (vocab_hash1[hash] != -1) hash = (hash + 1) % vocab_hash1_size1;
132 |       vocab_hash1[hash] = a;
133 |     }
134 |   }
135 |   vocab1 = (struct vocab_word1 *)realloc(vocab1, vocab_size1 * sizeof(struct vocab_word1));
136 | }
137 | 
138 | // Reduces the vocabulary by removing infrequent tokens
139 | void ReduceVocab1() {
140 |   int a, b = 0;
141 |   unsigned int hash;
142 |   for (a = 0; a < vocab_size1; a++) if (vocab1[a].cn > min_reduce1) {
143 |     vocab1[b].cn = vocab1[a].cn;
144 |     vocab1[b].word = vocab1[a].word;
145 |     b++;
146 |   } else free(vocab1[a].word);
147 |   vocab_size1 = b;
148 |   for (a = 0; a < vocab_hash1_size1; a++) vocab_hash1[a] = -1;
149 |   for (a = 0; a < vocab_size1; a++) {
150 |     // Hash will be re-computed, as it is not actual
151 |     hash = GetWordHash1(vocab1[a].word);
152 |     while (vocab_hash1[hash] != -1) hash = (hash + 1) % vocab_hash1_size1;
153 |     vocab_hash1[hash] = a;
154 |   }
155 |   //fflush(stdout);
156 |   min_reduce1++;
157 | }
158 | 
159 | void LearnVocabFromTrainFile1() {
160 |   char word[MAX_STRING1], last_word[MAX_STRING1], bigram_word[MAX_STRING1 * 2];
161 |   FILE *fin;
162 |   long long a, i, start = 1;
163 |   for (a = 0; a < vocab_hash1_size1; a++) vocab_hash1[a] = -1;
164 |   fin = fopen(train_file1, "rb");
165 |   if (fin == NULL) {
166 |     Rprintf("ERROR: training data file not found!\n");
167 |     return;
168 |   }
169 |   vocab_size1 = 0;
170 |   AddWordToVocab1((char *)"</s>");
171 |   while (1) {
172 |     ReadWord1(word, fin);
173 |     if (feof(fin)) break;
174 |     if (!strcmp(word, "</s>")) {
175 |       start = 1;
176 |       continue;
177 |     } else start = 0;
178 |     train_words1++;
179 |     if ((debug_mode1 > 1) && (train_words1 % 100000 == 0)) {
180 |       Rprintf("Words processed: %lldK     Vocab size: %lldK  %c", train_words1 / 1000, vocab_size1 / 1000, 13);
181 |    //   fflush(stdout);
182 |     }
183 |     i = SearchVocab1(word);
184 |     if (i == -1) {
185 |       a = AddWordToVocab1(word);
186 |       vocab1[a].cn = 1;
187 |     } else vocab1[i].cn++;
188 |     if (start) continue;
189 |     sprintf(bigram_word, "%s_%s", last_word, word);
190 |     bigram_word[MAX_STRING1 - 1] = 0;
191 |     strcpy(last_word, word);
192 |     i = SearchVocab1(bigram_word);
193 |     if (i == -1) {
194 |       a = AddWordToVocab1(bigram_word);
195 |       vocab1[a].cn = 1;
196 |     } else vocab1[i].cn++;
197 |     if (vocab_size1 > vocab_hash1_size1 * 0.7) ReduceVocab1();
198 |   }
199 |   SortVocab1();
200 |   if (debug_mode1 > 0) {
201 |     Rprintf("\nVocab size (unigrams + bigrams): %lld\n", vocab_size1);
202 |     Rprintf("Words in train file: %lld\n", train_words1);
203 |   }
204 |   fclose(fin);
205 | }
206 | 
207 | void TrainModel1() {
208 |   long long pa = 0, pb = 0, pab = 0, oov, i, li = -1, cn = 0;
209 |   char word[MAX_STRING1], last_word[MAX_STRING1], bigram_word[MAX_STRING1 * 2];
210 |   real score;
211 |   FILE *fo, *fin;
212 |   Rprintf("Starting training using file %s\n", train_file1);
213 |   LearnVocabFromTrainFile1();
214 |   fin = fopen(train_file1, "rb");
215 |   fo = fopen(output_file1, "wb");
216 |   word[0] = 0;
217 |   while (1) {
218 |     strcpy(last_word, word);
219 |     ReadWord1(word, fin);
220 |     if (feof(fin)) break;
221 |     if (!strcmp(word, "</s>")) {
222 |       fprintf(fo, "\n");
223 |       continue;
224 |     }
225 |     cn++;
226 |     if ((debug_mode1 > 1) && (cn % 100000 == 0)) {
227 |       Rprintf("Words written: %lldK%c", cn / 1000, 13);
228 |     //  fflush(stdout);
229 |     }
230 |     oov = 0;
231 |     i = SearchVocab1(word);
232 |     if (i == -1) oov = 1; else pb = vocab1[i].cn;
233 |     if (li == -1) oov = 1;
234 |     li = i;
235 |     sprintf(bigram_word, "%s_%s", last_word, word);
236 |     bigram_word[MAX_STRING1 - 1] = 0;
237 |     i = SearchVocab1(bigram_word);
238 |     if (i == -1) oov = 1; else pab = vocab1[i].cn;
239 |     if (pa < min_count1) oov = 1;
240 |     if (pb < min_count1) oov = 1;
241 |     if (oov) score = 0; else score = (pab - min_count1) / (real)pa / (real)pb * (real)train_words1;
242 |     if (score > threshold1) {
243 |       fprintf(fo, "_%s", word);
244 |       pb = 0;
245 |     } else fprintf(fo, " %s", word);
246 |     pa = pb;
247 |   }
248 |   fclose(fo);
249 |   fclose(fin);
250 | }
251 | 
252 | 
253 | void word2phrase(char **rtrain_file,int *rdebug_mode,char **routput_file,int *rmin_count,double *rthreshold) {
254 | /*  
255 | if (argc == 1) {
256 |     printf("WORD2PHRASE tool v0.1a\n\n");
257 |     printf("Options:\n");
258 |     printf("Parameters for training:\n");
259 |     printf("\t-train <file>\n");
260 |     printf("\t\tUse text data from <file> to train the model\n");
261 |     printf("\t-output <file>\n");
262 |     printf("\t\tUse <file> to save the resulting word vectors / word clusters / phrases\n");
263 |     printf("\t-min-count <int>\n");
264 |     printf("\t\tThis will discard words that appear less than <int> times; default is 5\n");
265 |     printf("\t-threshold1 <float>\n");
266 |     printf("\t\t The <float> value represents threshold1 for forming the phrases (higher means less phrases); default 100\n");
267 |     printf("\t-debug <int>\n");
268 |     printf("\t\tSet the debug mode (default = 2 = more info during training)\n");
269 |     printf("\nExamples:\n");
270 |     printf("./word2phrase -train text.txt -output phrases.txt -threshold1 100 -debug 2\n\n");
271 |     return 0;
272 |   }
273 | */
274 |   if (*rtrain_file[0]!='0') strcpy(train_file1, *rtrain_file);
275 |   if (rdebug_mode[0]!=0) debug_mode1 = rdebug_mode[0];
276 |   if (*routput_file[0]!='0') strcpy(output_file1, *routput_file);
277 |   if (rmin_count[0]!=0) min_count1 = rmin_count[0];
278 |   if (rthreshold[0]!= 0) threshold1=rthreshold[0];
279 |   vocab1 = (struct vocab_word1 *)calloc(vocab_max_size1, sizeof(struct vocab_word1));
280 |   vocab_hash1 = (int *)calloc(vocab_hash1_size1, sizeof(int));
281 |   TrainModel1();
282 |  
283 | }
284 | 


--------------------------------------------------------------------------------
/src/word2vec.h:
--------------------------------------------------------------------------------
  1 | //  Copyright 2013 Google Inc. All Rights Reserved.
  2 | //
  3 | //  Licensed under the Apache License, Version 2.0 (the "License");
  4 | //  you may not use this file except in compliance with the License.
  5 | //  You may obtain a copy of the License at
  6 | //
  7 | //      http://www.apache.org/licenses/LICENSE-2.0
  8 | //
  9 | //  Unless required by applicable law or agreed to in writing, software
 10 | //  distributed under the License is distributed on an "AS IS" BASIS,
 11 | //  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | //  See the License for the specific language governing permissions and
 13 | //  limitations under the License.
 14 | 
 15 | #include <stdio.h>
 16 | #include <stdlib.h>
 17 | #include <string.h>
 18 | #include <math.h>
 19 | #include <pthread.h>
 20 | #include <time.h>
 21 | #include "R.h"
 22 | #include "Rmath.h"
 23 | 
 24 | 
 25 | #define MAX_STRING 100
 26 | #define EXP_TABLE_SIZE 1000
 27 | #define MAX_EXP 6
 28 | #define MAX_SENTENCE_LENGTH 1000
 29 | #define MAX_CODE_LENGTH 40
 30 | 
 31 | const int vocab_hash_size = 30000000;  // Maximum 30 * 0.7 = 21M words in the vocabulary
 32 | 
 33 | typedef float real;                    // Precision of float numbers
 34 | 
 35 | struct vocab_word {
 36 |   long long cn;
 37 |   int *point;
 38 |   char *word, *code, codelen;
 39 | };
 40 | 
 41 | char train_file[1024], output_file[1024];
 42 | char save_vocab_file[MAX_STRING], read_vocab_file[MAX_STRING];
 43 | struct vocab_word *vocab;
 44 | int binary = 0, cbow = 0, debug_mode = 2, window = 12, min_count = 5, num_threads = 1, min_reduce = 1;
 45 | int *vocab_hash;
 46 | long long vocab_max_size = 1000, vocab_size = 0, layer1_size = 100;
 47 | long long train_words = 0, word_count_actual = 0, iter = 5, file_size = 0, classes = 0;
 48 | real alpha = 0.025, starting_alpha, sample = 0;
 49 | real *syn0, *syn1, *syn1neg, *expTable;
 50 | clock_t start;
 51 | 
 52 | int hs = 1, negative = 0;
 53 | const int table_size = 1e8;
 54 | int *table;
 55 | 
 56 | 
 57 | void InitUnigramTable() {
 58 |   int a, i;
 59 |   long long train_words_pow = 0;
 60 |   real d1, power = 0.75;
 61 |   table = (int *)malloc(table_size * sizeof(int));
 62 |   for (a = 0; a < vocab_size; a++) train_words_pow += pow(vocab[a].cn, power);
 63 |   i = 0;
 64 |   d1 = pow(vocab[i].cn, power) / (real)train_words_pow;
 65 |   for (a = 0; a < table_size; a++) {
 66 |     table[a] = i;
 67 |     if (a / (real)table_size > d1) {
 68 |       i++;
 69 |       d1 += pow(vocab[i].cn, power) / (real)train_words_pow;
 70 |     }
 71 |     if (i >= vocab_size) i = vocab_size - 1;
 72 |   }
 73 | }
 74 | 
 75 | // Reads a single word from a file, assuming space + tab + EOL to be word boundaries
 76 | void ReadWord(char *word, FILE *fin) {
 77 |   int a = 0, ch;
 78 |   while (!feof(fin)) {
 79 |     ch = fgetc(fin);
 80 |     if (ch == 13) continue;
 81 |     if ((ch == ' ') || (ch == '\t') || (ch == '\n')) {
 82 |       if (a > 0) {
 83 |         if (ch == '\n') ungetc(ch, fin);
 84 |         break;
 85 |       }
 86 |       if (ch == '\n') {
 87 |         strcpy(word, (char *)"</s>");
 88 |         return;
 89 |       } else continue;
 90 |     }
 91 |     word[a] = ch;
 92 |     a++;
 93 |     if (a >= MAX_STRING - 1) a--;   // Truncate too long words
 94 |   }
 95 |   word[a] = 0;
 96 | }
 97 | 
 98 | // Returns hash value of a word
 99 | int GetWordHash(char *word) {
100 |   unsigned long long a, hash = 0;
101 |   for (a = 0; a < strlen(word); a++) hash = hash * 257 + word[a];
102 |   hash = hash % vocab_hash_size;
103 |   return hash;
104 | }
105 | 
106 | // Returns position of a word in the vocabulary; if the word is not found, returns -1
107 | int SearchVocab(char *word) {
108 |   unsigned int hash = GetWordHash(word);
109 |   while (1) {
110 |     if (vocab_hash[hash] == -1) return -1;
111 |     if (!strcmp(word, vocab[vocab_hash[hash]].word)) return vocab_hash[hash];
112 |     hash = (hash + 1) % vocab_hash_size;
113 |   }
114 |   return -1;
115 | }
116 | 
117 | // Reads a word and returns its index in the vocabulary
118 | int ReadWordIndex(FILE *fin) {
119 |   char word[MAX_STRING];
120 |   ReadWord(word, fin);
121 |   if (feof(fin)) return -1;
122 |   return SearchVocab(word);
123 | }
124 | 
125 | // Adds a word to the vocabulary
126 | int AddWordToVocab(char *word) {
127 |   unsigned int hash, length = strlen(word) + 1;
128 |   if (length > MAX_STRING) length = MAX_STRING;
129 |   vocab[vocab_size].word = (char *)calloc(length, sizeof(char));
130 |   strcpy(vocab[vocab_size].word, word);
131 |   vocab[vocab_size].cn = 0;
132 |   vocab_size++;
133 |   // Reallocate memory if needed
134 |   if (vocab_size + 2 >= vocab_max_size) {
135 |     vocab_max_size += 1000;
136 |     vocab = (struct vocab_word *)realloc(vocab, vocab_max_size * sizeof(struct vocab_word));
137 |   }
138 |   hash = GetWordHash(word);
139 |   while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size;
140 |   vocab_hash[hash] = vocab_size - 1;
141 |   return vocab_size - 1;
142 | }
143 | 
144 | // Used later for sorting by word counts
145 | int VocabCompare(const void *a, const void *b) {
146 |     return ((struct vocab_word *)b)->cn - ((struct vocab_word *)a)->cn;
147 | }
148 | 
149 | // Sorts the vocabulary by frequency using word counts
150 | void SortVocab() {
151 |   int a, size;
152 |   unsigned int hash;
153 |   // Sort the vocabulary and keep </s> at the first position
154 |   qsort(&vocab[1], vocab_size - 1, sizeof(struct vocab_word), VocabCompare);
155 |   for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1;
156 |   size = vocab_size;
157 |   train_words = 0;
158 |   for (a = 0; a < size; a++) {
159 |     // Words occuring less than min_count times will be discarded from the vocab
160 |     if (vocab[a].cn < min_count) {
161 |       vocab_size--;
162 |       free(vocab[vocab_size].word);
163 |     } else {
164 |       // Hash will be re-computed, as after the sorting it is not actual
165 |       hash=GetWordHash(vocab[a].word);
166 |       while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size;
167 |       vocab_hash[hash] = a;
168 |       train_words += vocab[a].cn;
169 |     }
170 |   }
171 |   vocab = (struct vocab_word *)realloc(vocab, (vocab_size + 1) * sizeof(struct vocab_word));
172 |   // Allocate memory for the binary tree construction
173 |   for (a = 0; a < vocab_size; a++) {
174 |     vocab[a].code = (char *)calloc(MAX_CODE_LENGTH, sizeof(char));
175 |     vocab[a].point = (int *)calloc(MAX_CODE_LENGTH, sizeof(int));
176 |   }
177 | }
178 | 
179 | // Reduces the vocabulary by removing infrequent tokens
180 | void ReduceVocab() {
181 |   int a, b = 0;
182 |   unsigned int hash;
183 |   for (a = 0; a < vocab_size; a++) if (vocab[a].cn > min_reduce) {
184 |     vocab[b].cn = vocab[a].cn;
185 |     vocab[b].word = vocab[a].word;
186 |     b++;
187 |   } else free(vocab[a].word);
188 |   vocab_size = b;
189 |   for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1;
190 |   for (a = 0; a < vocab_size; a++) {
191 |     // Hash will be re-computed, as it is not actual
192 |     hash = GetWordHash(vocab[a].word);
193 |     while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size;
194 |     vocab_hash[hash] = a;
195 |   }
196 |   fflush(NULL);
197 |   min_reduce++;
198 | }
199 | 
200 | // Create binary Huffman tree using the word counts
201 | // Frequent words will have short uniqe binary codes
202 | void CreateBinaryTree() {
203 |   long long a, b, i, min1i, min2i, pos1, pos2, point[MAX_CODE_LENGTH];
204 |   char code[MAX_CODE_LENGTH];
205 |   long long *count = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long));
206 |   long long *binary = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long));
207 |   long long *parent_node = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long));
208 |   for (a = 0; a < vocab_size; a++) count[a] = vocab[a].cn;
209 |   for (a = vocab_size; a < vocab_size * 2; a++) count[a] = 1e15;
210 |   pos1 = vocab_size - 1;
211 |   pos2 = vocab_size;
212 |   // Following algorithm constructs the Huffman tree by adding one node at a time
213 |   for (a = 0; a < vocab_size - 1; a++) {
214 |     // First, find two smallest nodes 'min1, min2'
215 |     if (pos1 >= 0) {
216 |       if (count[pos1] < count[pos2]) {
217 |         min1i = pos1;
218 |         pos1--;
219 |       } else {
220 |         min1i = pos2;
221 |         pos2++;
222 |       }
223 |     } else {
224 |       min1i = pos2;
225 |       pos2++;
226 |     }
227 |     if (pos1 >= 0) {
228 |       if (count[pos1] < count[pos2]) {
229 |         min2i = pos1;
230 |         pos1--;
231 |       } else {
232 |         min2i = pos2;
233 |         pos2++;
234 |       }
235 |     } else {
236 |       min2i = pos2;
237 |       pos2++;
238 |     }
239 |     count[vocab_size + a] = count[min1i] + count[min2i];
240 |     parent_node[min1i] = vocab_size + a;
241 |     parent_node[min2i] = vocab_size + a;
242 |     binary[min2i] = 1;
243 |   }
244 |   // Now assign binary code to each vocabulary word
245 |   for (a = 0; a < vocab_size; a++) {
246 |     b = a;
247 |     i = 0;
248 |     while (1) {
249 |       code[i] = binary[b];
250 |       point[i] = b;
251 |       i++;
252 |       b = parent_node[b];
253 |       if (b == vocab_size * 2 - 2) break;
254 |     }
255 |     vocab[a].codelen = i;
256 |     vocab[a].point[0] = vocab_size - 2;
257 |     for (b = 0; b < i; b++) {
258 |       vocab[a].code[i - b - 1] = code[b];
259 |       vocab[a].point[i - b] = point[b] - vocab_size;
260 |     }
261 |   }
262 |   free(count);
263 |   free(binary);
264 |   free(parent_node);
265 | }
266 | 
267 | void LearnVocabFromTrainFile() {
268 |   char word[MAX_STRING];
269 |   FILE *fin;
270 |   long long a, i;
271 |   for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1;
272 |   fin = fopen(train_file, "rb");
273 |   if (fin == NULL) {
274 |     Rprintf("ERROR: training data file not found!\n");
275 |     Rf_error("Error!");
276 |   }
277 |   vocab_size = 0;
278 |   AddWordToVocab((char *)"</s>");
279 |   while (1) {
280 |     ReadWord(word, fin);
281 |     if (feof(fin)) break;
282 |     train_words++;
283 |         if ((debug_mode > 1) && (train_words % 100000 == 0)) {
284 |           Rprintf("%lldK%c", train_words / 1000, 13);
285 |           fflush(NULL);
286 |     }
287 |     i = SearchVocab(word);
288 |     if (i == -1) {
289 |       a = AddWordToVocab(word);
290 |       vocab[a].cn = 1;
291 |     } else vocab[i].cn++;
292 |     if (vocab_size > vocab_hash_size * 0.7) ReduceVocab();
293 |   }
294 |   SortVocab();
295 |   if (debug_mode > 0) {
296 |     Rprintf("Vocab size: %lld\n", vocab_size);
297 |     Rprintf("Words in train file: %lld\n", train_words);
298 |   }
299 |   file_size = ftell(fin);
300 |   fclose(fin);
301 | }
302 | 
303 | void SaveVocab() {
304 |   long long i;
305 |   FILE *fo = fopen(save_vocab_file, "wb");
306 |   for (i = 0; i < vocab_size; i++) fprintf(fo, "%s %lld\n", vocab[i].word, vocab[i].cn);
307 |   fclose(fo);
308 | }
309 | 
310 | void ReadVocab() {
311 |   long long a, i = 0;
312 |   char c;
313 |   char word[MAX_STRING];
314 |   FILE *fin = fopen(read_vocab_file, "rb");
315 |   if (fin == NULL) {
316 |     Rprintf("Vocabulary file not found\n");
317 |     Rf_error("Error!");
318 |   }
319 |   for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1;
320 |   vocab_size = 0;
321 |   while (1) {
322 |     ReadWord(word, fin);
323 |     if (feof(fin)) break;
324 |     a = AddWordToVocab(word);
325 |     if(fscanf(fin, "%lld%c", &vocab[a].cn, &c)==1)
326 |       ;
327 |     i++;
328 |   }
329 |   SortVocab();
330 |   if (debug_mode > 0) {
331 |     Rprintf("Vocab size: %lld\n", vocab_size);
332 |     Rprintf("Words in train file: %lld\n", train_words);
333 |   }
334 |   fin = fopen(train_file, "rb");
335 |   if (fin == NULL) {
336 |     Rprintf("ERROR: training data file not found!\n");
337 |     Rf_error("Error!");
338 |   }
339 |   fseek(fin, 0, SEEK_END);
340 |   file_size = ftell(fin);
341 |   fclose(fin);
342 | }
343 | 
344 | void InitNet() {
345 |   long long a, b;
346 |   unsigned long long next_random = 1;
347 | //  a = posix_memalign((void **)&syn0, 128, (long long)vocab_size * layer1_size * sizeof(real));
348 | #ifdef _WIN32
349 |   syn0 = (real *)_aligned_malloc((long long)vocab_size * layer1_size * sizeof(real), 128);
350 | #else
351 |   a = posix_memalign((void **)&syn0, 128, (long long)vocab_size * layer1_size * sizeof(real));
352 | #endif
353 | 
354 |   if (syn0 == NULL) {Rprintf("Memory allocation failed\n"); Rf_error("Error!");}
355 |   if (hs) {
356 | //    a = posix_memalign((void **)&syn1, 128, (long long)vocab_size * layer1_size * sizeof(real));
357 | #ifdef _WIN32
358 |     syn1 = (real *)_aligned_malloc((long long)vocab_size * layer1_size * sizeof(real), 128);
359 | #else
360 |     a = posix_memalign((void **)&(syn1), 128, (long long)vocab_size * layer1_size * sizeof(real));
361 | #endif
362 | 
363 |    if (syn1 == NULL) {Rprintf("Memory allocation failed\n"); Rf_error("Error!");}
364 |     for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++)
365 |      syn1[a * layer1_size + b] = 0;
366 |   }
367 |   if (negative>0) {
368 |    // a = posix_memalign((void **)&syn1neg, 128, (long long)vocab_size * layer1_size * sizeof(real));
369 | #ifdef _WIN32
370 |     syn1neg = (real *)_aligned_malloc((long long)vocab_size * layer1_size * sizeof(real), 128);
371 | #else
372 |     a = posix_memalign((void **)&(syn1neg), 128, (long long)vocab_size * layer1_size * sizeof(real));
373 | #endif
374 | 
375 | if (syn1neg == NULL) {Rprintf("Memory allocation failed\n"); Rf_error("Error!");}
376 |     for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++)
377 |      syn1neg[a * layer1_size + b] = 0;
378 |   }
379 |   for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) {
380 |     next_random = next_random * (unsigned long long)25214903917 + 11;
381 |     syn0[a * layer1_size + b] = (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size;
382 |   }
383 |   CreateBinaryTree();
384 | }
385 | 
386 | void *TrainModelThread(void *id) {
387 |   long long a, b, d, word, last_word, sentence_length = 0, sentence_position = 0;
388 |   long long word_count = 0, last_word_count = 0, sen[MAX_SENTENCE_LENGTH + 1];
389 |   long long l1, l2, c, target, label, local_iter = iter;
390 |   unsigned long long next_random = (long long)id;
391 |   //  real doneness_f, speed_f;
392 |   // For writing to R.
393 |   real f, g;
394 |   clock_t now;
395 |   real *neu1 = (real *)calloc(layer1_size, sizeof(real));
396 |   real *neu1e = (real *)calloc(layer1_size, sizeof(real));
397 |   FILE *fi = fopen(train_file, "rb");
398 |   fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET);
399 |   while (1) {
400 |     if (word_count - last_word_count > 10000) {
401 |       word_count_actual += word_count - last_word_count;
402 |       last_word_count = word_count;      
403 |       /* if ((debug_mode > 1)) { */
404 |       /*   now=clock(); */
405 |       /* 	doneness_f = word_count_actual / (real)(iter * train_words + 1) * 100; */
406 |       /* 	speed_f = word_count_actual / ((real)(now - start + 1) / (real)CLOCKS_PER_SEC * 1000); */
407 | 	  
408 |       /*   Rprintf("%cAlpha: %f  Progress: %.2f%%  Words/thread/sec: %.2fk  ", 13, alpha, */
409 |       /* 		doneness_f, speed_f); */
410 | 
411 |       /*   fflush(NULL); */
412 |       /* } */
413 |       alpha = starting_alpha * (1 - word_count_actual / (real)(iter * train_words + 1));
414 |       if (alpha < starting_alpha * 0.0001) alpha = starting_alpha * 0.0001;
415 |     }
416 |     if (sentence_length == 0) {
417 |       while (1) {
418 |         word = ReadWordIndex(fi);
419 |         if (feof(fi)) break;
420 |         if (word == -1) continue;
421 |         word_count++;
422 |         if (word == 0) break;
423 |         // The subsampling randomly discards frequent words while keeping the ranking same
424 |         if (sample > 0) {
425 |           real ran = (sqrt(vocab[word].cn / (sample * train_words)) + 1) * (sample * train_words) / vocab[word].cn;
426 |           next_random = next_random * (unsigned long long)25214903917 + 11;
427 |           if (ran < (next_random & 0xFFFF) / (real)65536) continue;
428 |         }
429 |         sen[sentence_length] = word;
430 |         sentence_length++;
431 |         if (sentence_length >= MAX_SENTENCE_LENGTH) break;
432 |       }
433 |       sentence_position = 0;
434 |     }
435 |     if (feof(fi) || (word_count > train_words / num_threads)) {
436 |        word_count_actual += word_count - last_word_count;
437 |        local_iter--;
438 |        if (local_iter == 0) break;
439 |        word_count = 0;
440 |        last_word_count = 0;
441 |        sentence_length = 0;
442 |        fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET);
443 |        continue;
444 |     }
445 |     word = sen[sentence_position];
446 |     if (word == -1) continue;
447 |     for (c = 0; c < layer1_size; c++) neu1[c] = 0;
448 |     for (c = 0; c < layer1_size; c++) neu1e[c] = 0;
449 |     next_random = next_random * (unsigned long long)25214903917 + 11;
450 |     b = next_random % window;
451 |     if (cbow) {  //train the cbow architecture
452 |       // in -> hidden
453 |       for (a = b; a < window * 2 + 1 - b; a++) if (a != window) {
454 |         c = sentence_position - window + a;
455 |         if (c < 0) continue;
456 |         if (c >= sentence_length) continue;
457 |         last_word = sen[c];
458 |         if (last_word == -1) continue;
459 |         for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size];
460 |       }
461 |       if (hs) for (d = 0; d < vocab[word].codelen; d++) {
462 |         f = 0;
463 |         l2 = vocab[word].point[d] * layer1_size;
464 |         // Propagate hidden -> output
465 |         for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1[c + l2];
466 |         if (f <= -MAX_EXP) continue;
467 |         else if (f >= MAX_EXP) continue;
468 |         else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))];
469 |         // 'g' is the gradient multiplied by the learning rate
470 |         g = (1 - vocab[word].code[d] - f) * alpha;
471 |         // Propagate errors output -> hidden
472 |         for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
473 |         // Learn weights hidden -> output
474 |         for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c];
475 |       }
476 |       // NEGATIVE SAMPLING
477 |       if (negative > 0) for (d = 0; d < negative + 1; d++) {
478 |         if (d == 0) {
479 |           target = word;
480 |           label = 1;
481 |         } else {
482 |           next_random = next_random * (unsigned long long)25214903917 + 11;
483 |           target = table[(next_random >> 16) % table_size];
484 |           if (target == 0) target = next_random % (vocab_size - 1) + 1;
485 |           if (target == word) continue;
486 |           label = 0;
487 |         }
488 |         l2 = target * layer1_size;
489 |         f = 0;
490 |         for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1neg[c + l2];
491 |         if (f > MAX_EXP) g = (label - 1) * alpha;
492 |         else if (f < -MAX_EXP) g = (label - 0) * alpha;
493 |         else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;
494 |         for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
495 |         for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * neu1[c];
496 |       }
497 |       // hidden -> in
498 |       for (a = b; a < window * 2 + 1 - b; a++) if (a != window) {
499 |         c = sentence_position - window + a;
500 |         if (c < 0) continue;
501 |         if (c >= sentence_length) continue;
502 |         last_word = sen[c];
503 |         if (last_word == -1) continue;
504 |         for (c = 0; c < layer1_size; c++) syn0[c + last_word * layer1_size] += neu1e[c];
505 |       }
506 |     } else {  //train skip-gram
507 |       for (a = b; a < window * 2 + 1 - b; a++) if (a != window) {
508 |         c = sentence_position - window + a;
509 |         if (c < 0) continue;
510 |         if (c >= sentence_length) continue;
511 |         last_word = sen[c];
512 |         if (last_word == -1) continue;
513 |         l1 = last_word * layer1_size;
514 |         for (c = 0; c < layer1_size; c++) neu1e[c] = 0;
515 |         // HIERARCHICAL SOFTMAX
516 |         if (hs) for (d = 0; d < vocab[word].codelen; d++) {
517 |           f = 0;
518 |           l2 = vocab[word].point[d] * layer1_size;
519 |           // Propagate hidden -> output
520 |           for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2];
521 |           if (f <= -MAX_EXP) continue;
522 |           else if (f >= MAX_EXP) continue;
523 |           else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))];
524 |           // 'g' is the gradient multiplied by the learning rate
525 |           g = (1 - vocab[word].code[d] - f) * alpha;
526 |           // Propagate errors output -> hidden
527 |           for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
528 |           // Learn weights hidden -> output
529 |           for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * syn0[c + l1];
530 |         }
531 |         // NEGATIVE SAMPLING
532 |         if (negative > 0) for (d = 0; d < negative + 1; d++) {
533 |           if (d == 0) {
534 |             target = word;
535 |             label = 1;
536 |           } else {
537 |             next_random = next_random * (unsigned long long)25214903917 + 11;
538 |             target = table[(next_random >> 16) % table_size];
539 |             if (target == 0) target = next_random % (vocab_size - 1) + 1;
540 |             if (target == word) continue;
541 |             label = 0;
542 |           }
543 |           l2 = target * layer1_size;
544 |           f = 0;
545 |           for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
546 |           if (f > MAX_EXP) g = (label - 1) * alpha;
547 |           else if (f < -MAX_EXP) g = (label - 0) * alpha;
548 |           else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;
549 |           for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
550 |           for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
551 |         }
552 |         // Learn weights input -> hidden
553 |         for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];
554 |       }
555 |     }
556 |     sentence_position++;
557 |     if (sentence_position >= sentence_length) {
558 |       sentence_length = 0;
559 |       continue;
560 |     }
561 |   }
562 |   fclose(fi);
563 |   free(neu1);
564 |   free(neu1e);
565 |   pthread_exit(NULL);
566 | }
567 | 
568 | void TrainModel() {
569 |   long a, b, c, d;
570 |   FILE *fo;
571 |   pthread_t *pt = (pthread_t *)malloc(num_threads * sizeof(pthread_t));
572 |   Rprintf("Starting training using file %s\n", train_file);
573 |   starting_alpha = alpha;
574 |   if (read_vocab_file[0] != 0) ReadVocab(); else LearnVocabFromTrainFile();
575 |   if (save_vocab_file[0] != 0) SaveVocab();
576 |   if (output_file[0] == 0) return;
577 |   InitNet();
578 |   if (negative > 0) InitUnigramTable();
579 |   start = clock();
580 |   for (a = 0; a < num_threads; a++) pthread_create(&pt[a], NULL, TrainModelThread, (void *)a);
581 |   for (a = 0; a < num_threads; a++) pthread_join(pt[a], NULL);
582 |   fo = fopen(output_file, "wb");
583 |   if (classes == 0) {
584 |     // Save the word vectors
585 |     fprintf(fo, "%lld %lld\n", vocab_size, layer1_size);
586 |     for (a = 0; a < vocab_size; a++) {
587 |       fprintf(fo, "%s ", vocab[a].word);
588 |       if (binary) for (b = 0; b < layer1_size; b++) fwrite(&syn0[a * layer1_size + b], sizeof(real), 1, fo);
589 |       else for (b = 0; b < layer1_size; b++) fprintf(fo, "%lf ", syn0[a * layer1_size + b]);
590 |       fprintf(fo, "\n");
591 |     }
592 |   } else {
593 |     // Run K-means on the word vectors
594 |     int clcn = classes, iter = 10, closeid;
595 |     int *centcn = (int *)malloc(classes * sizeof(int));
596 |     int *cl = (int *)calloc(vocab_size, sizeof(int));
597 |     real closev, x;
598 |     real *cent = (real *)calloc(classes * layer1_size, sizeof(real));
599 |     for (a = 0; a < vocab_size; a++) cl[a] = a % clcn;
600 |     for (a = 0; a < iter; a++) {
601 |       for (b = 0; b < clcn * layer1_size; b++) cent[b] = 0;
602 |       for (b = 0; b < clcn; b++) centcn[b] = 1;
603 |       for (c = 0; c < vocab_size; c++) {
604 |         for (d = 0; d < layer1_size; d++) {
605 |           cent[layer1_size * cl[c] + d] += syn0[c * layer1_size + d];
606 |           centcn[cl[c]]++;
607 |         }
608 |       }
609 |       for (b = 0; b < clcn; b++) {
610 |         closev = 0;
611 |         for (c = 0; c < layer1_size; c++) {
612 |           cent[layer1_size * b + c] /= centcn[b];
613 |           closev += cent[layer1_size * b + c] * cent[layer1_size * b + c];
614 |         }
615 |         closev = sqrt(closev);
616 |         for (c = 0; c < layer1_size; c++) cent[layer1_size * b + c] /= closev;
617 |       }
618 |       for (c = 0; c < vocab_size; c++) {
619 |         closev = -10;
620 |         closeid = 0;
621 |         for (d = 0; d < clcn; d++) {
622 |           x = 0;
623 |           for (b = 0; b < layer1_size; b++) x += cent[layer1_size * d + b] * syn0[c * layer1_size + b];
624 |           if (x > closev) {
625 |             closev = x;
626 |             closeid = d;
627 |           }
628 |         }
629 |         cl[c] = closeid;
630 |       }
631 |     }
632 |     // Save the K-means classes
633 |     for (a = 0; a < vocab_size; a++) fprintf(fo, "%s %d\n", vocab[a].word, cl[a]);
634 |     free(centcn);
635 |     free(cent);
636 |     free(cl);
637 |   }
638 |   fclose(fo);
639 | }
640 | 
641 | int ArgPos(char *str, int argc, char **argv) {
642 |   int a;
643 |   for (a = 1; a < argc; a++) if (!strcmp(str, argv[a])) {
644 |     if (a == argc - 1) {
645 |       Rprintf("Argument missing for %s\n", str);
646 |       Rf_error("Error!");
647 |     }
648 |     return a;
649 |   }
650 |   return -1;
651 | }
652 | 
653 | 


--------------------------------------------------------------------------------
/tests/run-all.R:
--------------------------------------------------------------------------------
1 | library(testthat)
2 | test_check("wordVectors")
3 | 


--------------------------------------------------------------------------------
/tests/testthat/test-linear-algebra-functions.R:
--------------------------------------------------------------------------------
 1 | context("VectorSpaceModel Linear Algebra is sensible")
 2 | 
 3 | test_that("Vectors are near to themselves",
 4 |           expect_lt(
 5 |             cosineDist(demo_vectors[1,],demo_vectors[1,]),
 6 |             1e-07
 7 |           )
 8 | )
 9 | 
10 | test_that("Distance is between 0 and 2 (pt 1)",
11 |           expect_gt(
12 |             min(cosineDist(demo_vectors,demo_vectors)),
13 |             -1e-07
14 |           )
15 | )
16 | 
17 | test_that("Distance is between 0 and 2 (pt 1)",
18 |           expect_lt(
19 |             max(cosineDist(demo_vectors,demo_vectors)),
20 |             2 + 1e-07)
21 |           )
22 | 
23 | 
24 | test_that("Distance is between 0 and 2 (pt 1)",
25 |           expect_lt(
26 |             max(abs(1-square_magnitudes(normalize_lengths(demo_vectors)))),
27 |             1e-07)
28 | )
29 | 


--------------------------------------------------------------------------------
/tests/testthat/test-name-collapsing.r:
--------------------------------------------------------------------------------
 1 | context("Name collapsing")
 2 | 
 3 | test_that("name substitution works",
 4 |   expect_equivalent(
 5 |     demo_vectors %>% closest_to(~"good")
 6 |     ,
 7 |     demo_vectors %>% closest_to(demo_vectors[["good"]])
 8 |   )
 9 | )
10 | 
11 | test_that("character substitution works",
12 |           expect_equivalent(
13 |             demo_vectors %>% closest_to("good")
14 |             ,
15 |             demo_vectors %>% closest_to(demo_vectors[["good"]])
16 |           )
17 | )
18 | 
19 | test_that("addition works in substitutions",
20 |           expect_equivalent(
21 |             demo_vectors %>% closest_to(~ "good" + "bad")
22 |             ,
23 |             demo_vectors %>% closest_to(demo_vectors[["good"]] + demo_vectors[["bad"]])
24 |           )
25 | )
26 | 
27 | test_that("addition provides correct results",
28 |           expect_gt(
29 |             demo_vectors[["good"]] %>% cosineSimilarity(demo_vectors[["good"]] + demo_vectors[["bad"]])
30 |             ,
31 |             .8))
32 | 
33 | test_that("single-argument negation works",
34 |           expect_equivalent(
35 |             demo_vectors %>% closest_to(~ -("good"-"bad"))
36 |             ,
37 |             demo_vectors %>% closest_to(~ "bad"-"good")
38 | 
39 |           ))
40 | 
41 | test_that("closest_to can wrap in function",
42 |           expect_equal(
43 |             {function(x) {closest_to(x,~ "class" + "school")}}(demo_vectors),
44 |             closest_to(demo_vectors,~ "class" + "school")
45 |           )
46 | )
47 | 
48 | test_that("Name substitution is occurring",
49 |   expect_equivalent(
50 |       cosineSimilarity(demo_vectors,"good"),
51 |       cosineSimilarity(demo_vectors,demo_vectors[["good"]])
52 |     ))
53 | 
54 | test_that("reference in functional scope is passed along",
55 |           expect_equivalent(
56 |             lapply(c("good"),function(referenced_word)
57 |               {demo_vectors %>% closest_to(demo_vectors[[referenced_word]])})[[1]],
58 |             demo_vectors %>% closest_to("good")
59 |          )
60 | )
61 | 


--------------------------------------------------------------------------------
/tests/testthat/test-read-write.R:
--------------------------------------------------------------------------------
 1 | context("Read and Write works")
 2 | 
 3 | ## TODO: Add tests for non-binary format; check actual value of results; test reading of slices.
 4 | 
 5 | test_that("Writing works",
 6 |           expect_null(
 7 |             write.binary.word2vec(demo_vectors[1:100,],"binary.bin"),
 8 |             1e-07
 9 |           )
10 | )
11 | 
12 | test_that("Reading Works",
13 |           expect_s4_class(
14 |             read.binary.vectors("binary.bin"),
15 |             "VectorSpaceModel"
16 |           )
17 | )
18 | 
19 | 


--------------------------------------------------------------------------------
/tests/testthat/test-rejection.R:
--------------------------------------------------------------------------------
 1 | context("Rejection Works")
 2 | 
 3 | test_that("Rejection works along gender binary",
 4 |           expect_gt(
 5 |             {
 6 |               rejected_frame <- demo_vectors %>% reject(~ "man" - "woman")
 7 |               cosineDist(demo_vectors[["he"]],demo_vectors[["she"]] ) -
 8 |                 cosineDist(rejected_frame[["he"]],rejected_frame[["she"]] )
 9 |             },
10 |             .4
11 |           )
12 | )
13 | 


--------------------------------------------------------------------------------
/tests/testthat/test-train.R:
--------------------------------------------------------------------------------
 1 | context("Training Functions Work")
 2 | 
 3 | # This fails on Travis. I'll worry about this later.
 4 | demo = "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
 5 | 
 6 | Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.
 7 | 
 8 | But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth.
 9 | "
10 | 
11 | message("In directory", getwd())
12 | cat(demo,file = "input.txt")
13 | if (file.exists("tmp.txt")) file.remove("tmp.txt")
14 | 
15 | test_that("Preparation produces file",
16 |           expect_equal(
17 |           prep_word2vec("input.txt","tmp.txt"),
18 |             "tmp.txt"
19 |           )
20 | )
21 | 
22 | test_that("Preparation produces file",
23 |           expect_equal(
24 |             prep_word2vec("input.txt","tmp.txt"),
25 |             "tmp.txt"
26 |           )
27 | )
28 | 
29 | test_that("Tokenization is the right length",
30 |           expect_lt(
31 |             2,
32 |             272 - length(stringr::str_split(readr::read_file("tmp.txt"), " "))
33 |           )
34 | )
35 | if (FALSE) {
36 | test_that("Bundling works on multiple levels",
37 |           expect_equal(
38 |             prep_word2vec("input.txt","tmp.txt",bundle_ngrams = 3),
39 |             "tmp.txt"
40 |           )
41 | )
42 | }
43 | test_that("Training Works",
44 |           expect_s4_class(
45 |             train_word2vec("tmp.txt"),
46 |             "VectorSpaceModel"
47 |           )
48 | )
49 | 
50 | 


--------------------------------------------------------------------------------
/tests/testthat/test-types.R:
--------------------------------------------------------------------------------
 1 | context("VectorSpaceModel Class Works")
 2 | 
 3 | test_that("Class Exists",
 4 |           expect_s4_class(
 5 |             demo_vectors,
 6 |             "VectorSpaceModel"
 7 |           )
 8 | )
 9 | 
10 | test_that("Class inherits addition",
11 |           expect_s4_class(
12 |             demo_vectors+1,
13 |             "VectorSpaceModel"
14 |           )
15 | )
16 | 
17 | test_that("Class inherits slices",
18 |           expect_s4_class(
19 |             demo_vectors[1,],
20 |             "VectorSpaceModel"
21 |           )
22 | )
23 | 
24 | test_that("Slices aren't dropped in dimensionality",
25 |           expect_s4_class(
26 |             demo_vectors[1,],
27 |             "matrix"
28 |           )
29 | )
30 | 


--------------------------------------------------------------------------------
/vignettes/exploration.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Word2Vec Workshop"
  3 | author: "Ben Schmidt"
  4 | date: "`r Sys.Date()`"
  5 | output: rmarkdown::html_vignette
  6 | vignette: >
  7 |   %\VignetteIndexEntry{Vignette Title}
  8 |   %\VignetteEngine{knitr::rmarkdown}
  9 |   %\VignetteEncoding{UTF-8}
 10 | ---
 11 | 
 12 | # Exploring Word2Vec models
 13 | 
 14 | R is a great language for *exploratory data analysis* in particular. If you're going to use a word2vec model in a larger pipeline, it may be important (intellectually or ethically) to spend a little while understanding what kind of model of language you've learned.
 15 | 
 16 | This package makes it easy to do so, both by allowing you to read word2vec models to and from R, and by giving some syntactic sugar that lets you describe vector-space models concisely and clearly.
 17 | 
 18 | Note that these functions may still be useful if you're a data analyst training word2vec models elsewhere (say, in gensim.) I'm also hopeful this can be a good way of interacting with varied vector models in a workshop session.
 19 | 
 20 | If you want to train your own model or need help setting up the package, read the introductory vignette. Aside from the installation, it assumes more knowledge of R than this walkthrough.
 21 | 
 22 | ## Why explore?
 23 | 
 24 | In this vignette we're going to look at (a small portion of) a model trained on teaching evaluations. It's an interesting set, but it's also one that shows the importance of exploring vector space models before you use them. Exploration is important because:
 25 | 
 26 | 1. If you're a humanist or social scientist, it can tell you something about the *sources* by letting you see how they use language. These co-occurrence patterns can then be better investigated through close reading or more traditional collocation scores, which potentially more reliable but also much slower and less flexible.
 27 | 
 28 | 2. If you're an engineer, it can help you understand some of biases built into a model that you're using in a larger pipeline. This can be both technically and ethically important: you don't want, for instance, to build a job-recommendation system which is disinclined to offer programming jobs to women because it has learned that women are unrepresented in CS jobs already.
 29 | (On this point in word2vec in particular, see [here](https://freedom-to-tinker.com/blog/randomwalker/language-necessarily-contains-human-biases-and-so-will-machines-trained-on-language-corpora/) and [here](https://arxiv.org/abs/1607.06520).)
 30 | 
 31 | ## Getting started.
 32 | 
 33 | First we'll load this package, and the recommended package `magrittr`, which lets us pass these arguments around.
 34 | 
 35 | ```{r}
 36 | library(wordVectors)
 37 | library(magrittr)
 38 | ```
 39 | 
 40 | The basic element of any vector space model is a *vectors.* for each word. In the demo data included with this package, an object called 'demo_vectors,' there are 500 numbers: you can start to examine them, if you with, by hand. So let's consider just one of these--the vector for 'good'.
 41 | 
 42 | In R's ordinary matrix syntax, you could write that out laboriously as `demo_vectors[rownames(demo_vectors)=="good",]`. `WordVectors` provides a shorthand using double braces:
 43 | 
 44 | ```{r}
 45 | demo_vectors[["good"]]
 46 | ```
 47 | 
 48 | These numbers are meaningless on their own. But in the vector space, we can find similar words.
 49 | 
 50 | ```{r}
 51 | demo_vectors %>% closest_to(demo_vectors[["good"]])
 52 | ```
 53 | 
 54 | The `%>%` is the pipe operator from magrittr; it helps to keep things organized, and is particularly useful with some of the things we'll see later. The 'similarity' scores here are cosine similarity in a vector space; 1.0 represents perfect similarity, 0 is no correlation, and -1.0 is complete opposition. In practice, vector "opposition" is different from the colloquial use of "opposite," and very rare. You'll only occasionally see vector scores below 0--as you can see above, "bad" is actually one of the most similar words to "good."
 55 | 
 56 | When interactively exploring a single model (rather than comparing *two* models), it can be a pain to keep retyping words over and over. Rather than operate on the vectors, this package also lets you access the word directly by using R's formula notation: putting a tilde in front of it. For a single word, you can even access it directly, as so.
 57 | 
 58 | ```{r}
 59 | demo_vectors %>% closest_to("bad")
 60 | ```
 61 | 
 62 | ## Vector math
 63 | 
 64 | The tildes are necessary syntax where things get interesting--you can do **math** on these vectors. So if we want to find the words that are closest to the *combination* of "good" and "bad" (which is to say, words that get used in evaluation) we can write (see where the tilde is?):
 65 | 
 66 | ```{r}
 67 | 
 68 | demo_vectors %>% closest_to(~"good"+"bad")
 69 | 
 70 | # The same thing could be written as:
 71 | # demo_vectors %>% closest_to(demo_vectors[["good"]]+demo_vectors[["bad"]])
 72 | ```
 73 | 
 74 | Those are words that are common to both "good" and "bad". We could also find words that are shaded towards just good but *not* bad by using subtraction.
 75 | 
 76 | ```{r}
 77 | demo_vectors %>% closest_to(~"good" - "bad")
 78 | ```
 79 | 
 80 | > What does this "subtraction" vector mean? 
 81 | > In practice, the easiest way to think of it is probably simply as 'similar to 
 82 | > good and dissimilar to 'bad'. Omer and Levy's papers suggest this interpretation.
 83 | > But taking the vectors more seriously means you can think of it geometrically: "good"-"bad" is
 84 | > a vector that describes the difference between positive and negative.
 85 | > Similarity to this vector means, technically, the portion of a words vectors whose
 86 | > whose multidimensional path lies largely along the direction between the two words. 
 87 | 
 88 | Again, you can easily switch the order to the opposite: here are a bunch of bad words:
 89 | 
 90 | ```{r}
 91 | demo_vectors %>% closest_to(~ "bad" - "good")
 92 | ```
 93 | 
 94 | All sorts of binaries are captured in word2vec models. One of the most famous, since Mikolov's original word2vec paper, is *gender*. If you ask for similarity to "he"-"she", for example, you get words that appear mostly in a *male* context. Since these examples are from teaching evaluations, after just a few straightforwardly gendered words, we start to get things that only men are ("arrogant") or where there are very few women in the university ("physics")
 95 | 
 96 | ```{r}
 97 | demo_vectors %>% closest_to(~ "he" - "she")
 98 | demo_vectors %>% closest_to(~ "she" - "he")
 99 | ```
100 | 
101 | ## Analogies
102 | 
103 | We can expand out the match to perform analogies. Men tend to be called 'guys'. 
104 | What's the female equivalent?
105 | In an SAT-style analogy, you might write `he:guy::she:???`.
106 | In vector math, we think of this as moving between points. 
107 | 
108 | If you're using the mental framework of positive of 'similarity' and
109 | negative as 'dissimilarity,' you can think of this as starting at "guy",
110 | removing its similarity to "he", and additing a similarity to "she".
111 | 
112 | This yields the answer: the most similar term to "guy" for a woman is "lady."
113 | 
114 | ```{r}
115 | demo_vectors %>% closest_to(~ "guy" - "he" + "she")
116 | ```
117 | 
118 | If you're using the other mental framework, of thinking of these as real vectors, 
119 | you might phrase this in a slightly different way.
120 | You have a gender vector `("female" - "male")` that represents the *direction* of masculinity 
121 | to femininity. You can then add this vector to "guy", and that will take you to a new neighborhood. You might phrase that this way: note that the math is exactly equivalent, and
122 | only the grouping is different.
123 | 
124 | ```{r}
125 | demo_vectors %>% closest_to(~ "guy" + ("she" - "he"))
126 | ```
127 | 
128 | Principal components can let you plot a subset of these vectors to see how they relate. You can imagine an arrow from "he" to "she", from "guy" to "lady", and from "man" to "woman"; all run in roughly the same direction.
129 | 
130 | ```{r}
131 | 
132 | demo_vectors[[c("lady","woman","man","he","she","guy","man"), average=F]] %>% 
133 |   plot(method="pca")
134 | 
135 | ```
136 | 
137 | These lists of ten words at a time are useful for interactive exploration, but sometimes we might want to say 'n=Inf' to return the full list. For instance, we can combine these two methods to look at positive and negative words used to evaluate teachers. 
138 | 
139 | First we build up three data_frames: first, a list of the 50 top evaluative words, and then complete lists of similarity to `"good" -"bad"` and `"woman" - "man"`. 
140 | 
141 | ```{r}
142 | top_evaluative_words = demo_vectors %>% 
143 |    closest_to(~ "good"+"bad",n=75)
144 | 
145 | goodness = demo_vectors %>% 
146 |   closest_to(~ "good"-"bad",n=Inf) 
147 | 
148 | femininity = demo_vectors %>% 
149 |   closest_to(~ "she" - "he", n=Inf)
150 | ```
151 | 
152 | Then we can use tidyverse packages to join and plot these.
153 | An `inner_join` restricts us down to just those top 50 words, and ggplot
154 | can array the words on axes.
155 | 
156 | ```{r}
157 | library(ggplot2)
158 | library(dplyr)
159 | 
160 | top_evaluative_words %>%
161 |   inner_join(goodness) %>%
162 |   inner_join(femininity) %>%
163 |   ggplot() + 
164 |   geom_text(aes(x=`similarity to "she" - "he"`,
165 |                 y=`similarity to "good" - "bad"`,
166 |                 label=word))
167 | ```
168 | 
169 | 


--------------------------------------------------------------------------------
/vignettes/introduction.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Word2Vec introduction"
  3 | author: "Ben Schmidt"
  4 | date: "`r Sys.Date()`"
  5 | output: rmarkdown::html_vignette
  6 | vignette: >
  7 |   %\VignetteIndexEntry{Vignette Title}
  8 |   %\VignetteEngine{knitr::rmarkdown}
  9 |   %\VignetteEncoding{UTF-8}
 10 | ---
 11 | 
 12 | # Intro 
 13 | 
 14 | This vignette walks you through training a word2vec model, and using that model to search for similarities, to build clusters, and to visualize vocabulary relationships of that model in two dimensions. If you are working with pre-trained vectors, you might want to jump straight to the "exploration" vignette; it is a little slower-paced, but doesn't show off quite so many features of the package.
 15 | 
 16 | # Package installation
 17 | 
 18 | If you have not installed this package, paste the below. More detailed installation instructions are at the end of the [package README](https://github.com/bmschmidt/wordVectors).
 19 | 
 20 | ```{r}
 21 | if (!require(wordVectors)) {
 22 |   if (!(require(devtools))) {
 23 |     install.packages("devtools")
 24 |   }
 25 |   devtools::install_github("bmschmidt/wordVectors")
 26 | }
 27 | 
 28 | 
 29 | ```
 30 | 
 31 | # Building test data
 32 | 
 33 | We begin by importing the `wordVectors` package and the `magrittr` package, because its pipe operator makes it easier to work with data.
 34 | 
 35 | ```{r}
 36 | library(wordVectors)
 37 | library(magrittr)
 38 | ```
 39 | 
 40 | First we build up a test file to train on.
 41 | As an example, we'll use a collection of cookbooks from Michigan State University.
 42 | This has to download from the Internet if it doesn't already exist.
 43 | 
 44 | ```{r}
 45 | if (!file.exists("cookbooks.zip")) {
 46 |   download.file("http://archive.lib.msu.edu/dinfo/feedingamerica/cookbook_text.zip","cookbooks.zip")
 47 | }
 48 | unzip("cookbooks.zip",exdir="cookbooks")
 49 | ```
 50 | 
 51 | 
 52 | Then we *prepare* a single file for word2vec to read in. This does a couple things:
 53 | 
 54 | 1. Creates a single text file with the contents of every file in the original document;
 55 | 2. Uses the `tokenizers` package to clean and lowercase the original text, 
 56 | 3. If `bundle_ngrams` is greater than 1, joins together common bigrams into a single word. For example, "olive oil" may be joined together into "olive_oil" wherever it occurs.
 57 | 
 58 | You can also do this in another language: particularly for large files, that will be **much** faster. (For reference: in a console, `perl -ne 's/[^A-Za-z_0-9 \n]/ /g; print lc $_;' cookbooks/*.txt > cookbooks.txt` will do much the same thing on ASCII text in a couple seconds.) If you do this and want to bundle ngrams, you'll then need to call `word2phrase("cookbooks.txt","cookbook_bigrams.txt",...)` to build up the bigrams; call it twice if you want 3-grams, and so forth.
 59 | 
 60 | 
 61 | ```{r}
 62 | if (!file.exists("cookbooks.txt")) prep_word2vec(origin="cookbooks",destination="cookbooks.txt",lowercase=T,bundle_ngrams=2)
 63 | ```
 64 | 
 65 | To train a word2vec model, use the function `train_word2vec`. This actually builds up the model. It uses an on-disk file as an intermediary and then reads that file into memory.
 66 | 
 67 | ```{r}
 68 | if (!file.exists("cookbook_vectors.bin")) {model = train_word2vec("cookbooks.txt","cookbook_vectors.bin",vectors=200,threads=4,window=12,iter=5,negative_samples=0)} else model = read.vectors("cookbook_vectors.bin")
 69 | 
 70 | ```
 71 | 
 72 | A few notes:
 73 | 
 74 | 1. The `vectors` parameter is the dimensionality of the representation. More vectors usually means more precision, but also more random error and slower operations. Likely choices are probably in the range 100-500.
 75 | 2. The `threads` parameter is the number of processors to use on your computer. On a modern laptop, the fastest results will probably be between 2 and 8 threads, depending on the number of cores.
 76 | 3. `iter` is how many times to read through the corpus. With fewer than 100 books, it can greatly help to increase the number of passes; if you're working with billions of words, it probably matters less. One danger of too low a number of iterations is that words that aren't closely related will seem to be closer than they are.
 77 | 4. Training can take a while. On my laptop, it takes a few minutes to train these cookbooks; larger models take proportionally more time. Because of the importance of more iterations to reducing noise, don't be afraid to set things up to require a lot of training time (as much as a day!)
 78 | 5. One of the best things about the word2vec algorithm is that it *does* work on extremely large corpora in linear time.
 79 | 6. In RStudio I've noticed that this sometimes appears to hang after a while; the percentage bar stops updating. If you check system activity it actually is still running, and will complete.
 80 | 7. If at any point you want to *read in* a previously trained model, you can do so by typing `model =  read.vectors("cookbook_vectors.bin")`.
 81 | 
 82 | Now we have a model in memory, trained on about 10 million words from 77 cookbooks. What can it tell us about food?
 83 | 
 84 | ## Similarity searches
 85 | 
 86 | Well, you can run some basic operations to find the nearest elements:
 87 | 
 88 | ```{r}
 89 | model %>% closest_to("fish")
 90 | ```
 91 | 
 92 | With that list, you can expand out further to search for multiple words:
 93 | 
 94 | ```{r}
 95 | model %>% 
 96 |   closest_to(model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],50)
 97 | ```
 98 | 
 99 | Now we have a pretty expansive list of potential fish-related words from old cookbooks. This can be useful for a few different things:
100 | 
101 | 1. As a list of potential query terms for keyword search.
102 | 2. As a batch of words to use as seed to some other text mining operation; for example, you could pull all paragraphs surrounding these to find ways that fish are cooked.
103 | 3. As a source for visualization.
104 | 
105 | Or we can just arrange them somehow. In this case, it doesn't look like much of anything.
106 | 
107 | ```{r}
108 | some_fish = closest_to(model,model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],150)
109 | fishy = model[[some_fish$word,average=F]]
110 | plot(fishy,method="pca")
111 | ```
112 | 
113 | ## Clustering
114 | 
115 | We can use standard clustering algorithms, like kmeans, to find groups of terms that fit together. You can think of this as a sort of topic model, although unlike more sophisticated topic modeling algorithms like Latent Direchlet Allocation, each word must be tied to single particular topic.
116 | 
117 | ```{r}
118 | set.seed(10)
119 | centers = 150
120 | clustering = kmeans(model,centers=centers,iter.max = 40)
121 | ```
122 | 
123 | Here are a ten random "topics" produced through this method. Each of the columns are the ten most frequent words in one random cluster.
124 | 
125 | ```{r}
126 | sapply(sample(1:centers,10),function(n) {
127 |   names(clustering$cluster[clustering$cluster==n][1:10])
128 | })
129 | ```
130 | 
131 | These can be useful for figuring out, at a glance, what some of the overall common clusters in your corpus are.
132 | 
133 | Clusters need not be derived at the level of the full model. We can take, for instance, 
134 | the 20 words closest to each of four different kinds of words.
135 | 
136 | ```{r}
137 | ingredients = c("madeira","beef","saucepan","carrots")
138 | term_set = lapply(ingredients, 
139 |        function(ingredient) {
140 |           nearest_words = model %>% closest_to(model[[ingredient]],20)
141 |           nearest_words$word
142 |         }) %>% unlist
143 | 
144 | subset = model[[term_set,average=F]]
145 | 
146 | subset %>%
147 |   cosineDist(subset) %>% 
148 |   as.dist %>%
149 |   hclust %>%
150 |   plot
151 | 
152 | ```
153 | 
154 | 
155 | # Visualization
156 | 
157 | ## Relationship planes.
158 | 
159 | One of the basic strategies you can take is to try to project the high-dimensional space here into a plane you can look at.
160 | 
161 | For instance, we can take the words "sweet" and "sour," find the twenty words most similar to either of them, and plot those in a sweet-salty plane.
162 | 
163 | ```{r}
164 | tastes = model[[c("sweet","salty"),average=F]]
165 | 
166 | # model[1:3000,] here restricts to the 3000 most common words in the set.
167 | sweet_and_saltiness = model[1:3000,] %>% cosineSimilarity(tastes)
168 | 
169 | # Filter to the top 20 sweet or salty.
170 | sweet_and_saltiness = sweet_and_saltiness[
171 |   rank(-sweet_and_saltiness[,1])<20 |
172 |   rank(-sweet_and_saltiness[,2])<20,
173 |   ]
174 | 
175 | plot(sweet_and_saltiness,type='n')
176 | text(sweet_and_saltiness,labels=rownames(sweet_and_saltiness))
177 | 
178 | ```
179 | 
180 | 
181 | There's no limit to how complicated this can get. For instance, there are really *five* tastes: sweet, salty, bitter, sour, and savory. (Savory is usually called 'umami' nowadays, but that word will not appear in historic cookbooks.)
182 | 
183 | Rather than use a base matrix of the whole set, we can shrink down to just five dimensions: how similar every word in our set is to each of these five. (I'm using cosine similarity here, so the closer a number is to one, the more similar it is.)
184 | 
185 | ```{r}
186 | 
187 | tastes = model[[c("sweet","salty","savory","bitter","sour"),average=F]]
188 | 
189 | # model[1:3000,] here restricts to the 3000 most common words in the set.
190 | common_similarities_tastes = model[1:3000,] %>% cosineSimilarity(tastes)
191 | 
192 | common_similarities_tastes[20:30,]
193 | ```
194 | 
195 | Now we can filter down to the 50 words that are closest to *any* of these (that's what the apply-max function below does), and
196 | use a PCA biplot to look at just 50 words in a flavor plane.
197 | 
198 | ```{r}
199 | high_similarities_to_tastes = common_similarities_tastes[rank(-apply(common_similarities_tastes,1,max)) < 75,]
200 | 
201 | high_similarities_to_tastes %>% 
202 |   prcomp %>% 
203 |   biplot(main="Fifty words in a\nprojection of flavor space")
204 | ```
205 | 
206 | This tells us a few things. One is that (in some runnings of the model, at least--there is some random chance built in here.) "sweet" and "sour" are closely aligned. Is this a unique feature of American cooking? A relationship that changes over time? These would require more investigation.
207 | 
208 | Second is that "savory" really is an acting category in these cookbooks, even without the precision of 'umami' as a word to express it. Anchovy, the flavor most closely associated with savoriness, shows up as fairly characteristic of the flavor, along with a variety of herbs.
209 | 
210 | Finally, words characteristic of meals seem to show up in the upper realms of the file.
211 | 
212 | # Catchall reduction: TSNE
213 | 
214 | Last but not least, there is a catchall method built into the library 
215 | to visualize a single overall decent plane for viewing the library; TSNE dimensionality reduction.
216 | 
217 | Just calling "plot" will display the equivalent of a word cloud with individual tokens grouped relatively close to each other based on their proximity in the higher dimensional space.
218 | 
219 | "Perplexity" is the optimal number of neighbors for each word. By default it's 50; smaller numbers may cause clusters to appear more dramatically at the cost of overall coherence.
220 | 
221 | ```{r}
222 | plot(model,perplexity=50)
223 | ```
224 | 
225 | A few notes on this method:
226 | 
227 | 1. If you don't get local clusters, it is not working. You might need to reduce the perplexity so that clusters are smaller; or you might not have good local similarities.
228 | 2. If you're plotting only a small set of words, you're better off trying to plot a `VectorSpaceModel` with `method="pca"`, which locates the points using principal components analysis.
229 | 


--------------------------------------------------------------------------------