├── LICENSE
├── PhraseAnalysis.ipynb
├── README.md
├── Word2Vec.ipynb
├── WordCloud.ipynb
├── alltitles.npy
├── alltitles.txt
├── arXivHarvest.py
├── askmodel.py
├── caltechmask.png
├── caltechwordcloud.png
├── condmat-model-window-10-mincount-5-size-100
├── helper.py
├── numpapers.png
├── parsetitles.py
└── trainmodel.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Everard van Nieuwenburg
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # physics2vec
 2 | Things to do with arXiv metadata :-)
 3 | 
 4 | ## Summary
 5 | This repository is (currently) a collection of python scripts and notebooks that
 6 | 1. Do a **Word2Vec encoding** of physics jargon (using gensim's CBOW or skip-gram, if you care for specifics).
 7 |    
 8 |    Examples: "particle + charge = electron" and "majorana + braiding = non-abelian"  
 9 |    Remark: These examples were _learned_ from the cond-mat section titles only.
10 | 
11 | 2. Analyze the **n-grams** (i.e. fixed n-word expressions) in the titles over the years (what should we work on? ;-))
12 | 3. Produce a **WordCloud** of your favorite arXiv section (such as the above, from the cond-mat section)
13 | ![alt text](https://raw.githubusercontent.com/everthemore/physics2vec/master/caltechwordcloud.png "arXiv:cond-mat wordcloud")
14 | 
15 | ## Notes
16 | These scripts were tested and run using **Python 3**. I have not checked backwards compatibility, but I have heard from people who managed to get it to work in **Python 2** too! Feel free to reach out to me in case things don't work out-of-the-box. I have not (yet) tried to make the scripts and notebooks super user-friendly, though I did try to comment the code such that you may figure things out by
17 | trial-and-error. 
18 | 
19 | ## Quickstart ##
20 | If you're already familiar with python, all you need to have are the modules numpy, pyoai, inflect and gensim. These should all be easy to install using pip/pip3. Then the workflow is as follows (I used python3):
21 | 1. python arXivHarvest.py --section physics:cond-mat --output condmattitles.txt
22 | 2. python parsetitles.py --input condmattitles.txt --output condmattitles.npy
23 | 3. python trainmodel.py --input condmattitles.npy --size 100 --window 10 --mincount 5 --output condmatmodel-100-10-5
24 | 4. python askmodel.py --input condmatmodel-100-10-5 --add particle charge
25 | 
26 | In step 1, we get the titles from arXiv. This is a time-consuming step; it took 1.5hrs for the physics:cond-mat section, and so I've provided the files for those in the repository already (i.e. you can skip steps 1 and 2). In step 2 we take out the weird symbols etc, and parse it into a \*.npy file. In the third step, we train a model with vector size 100, window size 10 and minimum count for words to participate of 5. Step 4 can be repeated as often as one desires. 
27 | 
28 | ## More details
29 | Apart from the above scripts, I provide 3 python notebooks that perform more than just the analysis of arXiv titles. I highly 
30 | recommend using notebooks. Very easy to install, and super useful. See here: http://jupyter.org/. You can also just copy-and-paste the code from the notebooks into a \*.py script and run those.
31 | 
32 | You are going to need to following python modules in addition, all installable using pip3 (sudo pip3 install [module-name]).
33 | 
34 | 1. numpy 
35 | 
36 |    Must-have for anything scientific you want to do with python (arrays, linalg)     
37 |    Numpy (http://www.numpy.org/)
38 |    
39 | 2. pyoai 
40 | 
41 |    Open Archive Initiaive module for querying the arXiv servers for metadata     
42 |    https://pypi.python.org/pypi/pyoai
43 |    
44 | 3. inflect
45 |    
46 |    Module for generating/checking plural/singular versions of words     
47 |    https://pypi.python.org/pypi/inflect
48 |    
49 | 4. gensim
50 | 
51 |    Very versatile module for topic modelling (analyzing basically anything you want from text, including word2vec)  
52 |    https://radimrehurek.com/gensim/
53 | 
54 | Not required, but highly recommended is the module "matplotlib" for creating plots. You can comment/remove the
55 | sections in the code that refer to it if you really don't want to. 
56 | 
57 | Optionally, if you wish to make a WordCloud, you will need
58 | 
59 | 5. Matplotlib (https://matplotlib.org/)
60 | 6. PIL (http://www.pythonware.com/products/pil/)
61 | 7. WordCloud (https://github.com/amueller/word_cloud)
62 | 


--------------------------------------------------------------------------------
/Word2Vec.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import numpy as np\n",
 12 |     "\n",
 13 |     "# Be sure to restart the notebook kernel if you make changes to parseTandA\n",
 14 |     "# Re-running this cell does not re-load the module otherwise\n",
 15 |     "from helper import *\n",
 16 |     "\n",
 17 |     "# We use matplotlib for plotting. You can basically get any plot layout/style\n",
 18 |     "# etc you want with this module. I'm setting it up for basics here, meaning\n",
 19 |     "# that I want it to parse LaTeX and use the LaTeX font family for all text.\n",
 20 |     "# !! If you don't have a LaTeX distribution installed, this notebook may\n",
 21 |     "#    throw errors when it tries to create the plots. If that happens, \n",
 22 |     "#    either install a LaTeX distribution or remove/comment the \n",
 23 |     "#    matplotlib.rcParams.update(...) line.\n",
 24 |     "#    In both cases, restart the kernel of this notebook afterwards.\n",
 25 |     "import matplotlib\n",
 26 |     "import matplotlib.pyplot as plt\n",
 27 |     "%matplotlib inline\n",
 28 |     "\n",
 29 |     "rcparams = {                      \n",
 30 |     "    \"pgf.texsystem\": \"pdflatex\",        # change this if using xetex or lautex\n",
 31 |     "    \"text.usetex\": True,                # use LaTeX to write all text\n",
 32 |     "    \"font.family\": \"lmodern\",\n",
 33 |     "    \"font.serif\": [],                   # blank entries should cause plots to inherit fonts from the document\n",
 34 |     "    \"font.sans-serif\": [],\n",
 35 |     "    \"font.monospace\": [],          \n",
 36 |     "    \"font.size\": 12,\n",
 37 |     "    \"legend.fontsize\": 12,         \n",
 38 |     "    \"xtick.labelsize\": 12,\n",
 39 |     "    \"ytick.labelsize\": 12,\n",
 40 |     "    \"pgf.preamble\": [\n",
 41 |     "        r\"\\usepackage[utf8x]{inputenc}\",    # use utf8 fonts becasue your computer can handle it :)\n",
 42 |     "        r\"\\usepackage[T1]{fontenc}\",        # plots will be generated using this preamble\n",
 43 |     "        ]\n",
 44 |     "}\n",
 45 |     "matplotlib.rcParams.update(rcparams)"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "# Load the title dataset"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": 4,
 58 |    "metadata": {
 59 |     "collapsed": true
 60 |    },
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "re_parse = False\n",
 64 |     "if re_parse:\n",
 65 |     "    all_titles = load_and_parse_all_titles('alltitles.txt')\n",
 66 |     "    # Save to a file, so we can load it much faster than having\n",
 67 |     "    # to re-parse the raw data.\n",
 68 |     "    np.save(\"alltitles.npy\", all_titles)\n",
 69 |     "else:\n",
 70 |     "    # Load the titles from the file.\n",
 71 |     "    # The atleast_2d is a hack for correctly loading the dictionary...\n",
 72 |     "    all_titles = np.atleast_2d(np.load(\"alltitles.npy\"))[0][0]"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 5,
 78 |    "metadata": {},
 79 |    "outputs": [
 80 |     {
 81 |      "name": "stdout",
 82 |      "output_type": "stream",
 83 |      "text": [
 84 |       "[1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]\n"
 85 |      ]
 86 |     }
 87 |    ],
 88 |    "source": [
 89 |     "# Check the available years\n",
 90 |     "all_years = sorted(list(all_titles.keys()))\n",
 91 |     "print(all_years)"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "markdown",
 96 |    "metadata": {},
 97 |    "source": [
 98 |     "# Train Word2Vec"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "code",
103 |    "execution_count": 6,
104 |    "metadata": {
105 |     "collapsed": true,
106 |     "scrolled": true
107 |    },
108 |    "outputs": [],
109 |    "source": [
110 |     "titles = get_titles_for_years(all_titles, all_years)\n",
111 |     "ngram_titles, bigrams, ngrams = get_ngrams(titles)"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": 14,
117 |    "metadata": {
118 |     "collapsed": true
119 |    },
120 |    "outputs": [],
121 |    "source": [
122 |     "# train word2vec \n",
123 |     "model = gensim.models.Word2Vec(ngram_titles, window=25, min_count=5, size=100)"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": 19,
129 |    "metadata": {
130 |     "scrolled": true
131 |    },
132 |    "outputs": [
133 |     {
134 |      "name": "stdout",
135 |      "output_type": "stream",
136 |      "text": [
137 |       "Similarity: \n",
138 |       "A superconductor is similar to: \n",
139 |       "   [*] layered_superconductor              \t (0.7722122669219971)\n",
140 |       "   [*] superconducting                     \t (0.7567377090454102)\n",
141 |       "   [*] unconventional_superconductor       \t (0.7531686425209045)\n",
142 |       "   [*] cuprate_superconductor              \t (0.7451467514038086)\n",
143 |       "   [*] superconductivity                   \t (0.733198881149292)\n",
144 |       "   [*] multiband_superconductor            \t (0.7226791977882385)\n",
145 |       "   [*] superconducting_gap                 \t (0.7021770477294922)\n",
146 |       "   [*] cuprate                             \t (0.6682584285736084)\n",
147 |       "   [*] weyl_semimetal                      \t (0.6146906614303589)\n",
148 |       "   [*] noncentrosymmetric_superconductor   \t (0.612571120262146)\n",
149 |       "Majorana is similar to: \n",
150 |       "   [*] majorana_fermion                    \t (0.8646294474601746)\n",
151 |       "   [*] majorana_mode                       \t (0.8107954859733582)\n",
152 |       "   [*] non_abelian                         \t (0.7987779974937439)\n",
153 |       "   [*] braiding                            \t (0.7585236430168152)\n",
154 |       "   [*] topologically_protected             \t (0.7555981278419495)\n",
155 |       "   [*] parity                              \t (0.7497479915618896)\n",
156 |       "   [*] andreev                             \t (0.73931485414505)\n",
157 |       "   [*] majorana_bound                      \t (0.7324416041374207)\n",
158 |       "   [*] kramer_pair                         \t (0.7297208309173584)\n",
159 |       "   [*] protected                           \t (0.7294089198112488)\n",
160 |       "Topological is similar to: \n",
161 |       "   [*] topological_insulator               \t (0.6998741626739502)\n",
162 |       "   [*] weyl                                \t (0.6658810973167419)\n",
163 |       "   [*] majorana                            \t (0.6639574766159058)\n",
164 |       "   [*] topologically_protected             \t (0.6560168266296387)\n",
165 |       "   [*] chiral                              \t (0.6515534520149231)\n",
166 |       "   [*] floquet_topological                 \t (0.6458501219749451)\n",
167 |       "   [*] non_abelian                         \t (0.6349783539772034)\n",
168 |       "   [*] gapless                             \t (0.630128026008606)\n",
169 |       "   [*] topologically                       \t (0.627392590045929)\n",
170 |       "   [*] majorana_fermion                    \t (0.6268280744552612)\n",
171 |       "A phonon is similar to: \n",
172 |       "   [*] optical_absorption                  \t (0.5985944271087646)\n",
173 |       "   [*] plasmon                             \t (0.586416482925415)\n",
174 |       "   [*] ionized_impurity                    \t (0.5812559127807617)\n",
175 |       "   [*] acoustic_phonon                     \t (0.5808508396148682)\n",
176 |       "   [*] carrier                             \t (0.569900631904602)\n",
177 |       "   [*] intraband                           \t (0.5654160380363464)\n",
178 |       "   [*] raman                               \t (0.5651483535766602)\n",
179 |       "   [*] incoherent                          \t (0.5587544441223145)\n",
180 |       "   [*] photoexcited                        \t (0.5581139326095581)\n",
181 |       "   [*] charge_carrier                      \t (0.5440508127212524)\n",
182 |       "\n",
183 |       "\n",
184 |       "Arithmetics: \n",
185 |       "Majorana + Braiding = \n",
186 |       "   [*] majorana_mode                       \t (0.8562889099121094)\n",
187 |       "   [*] non_abelian                         \t (0.8480815887451172)\n",
188 |       "wave + lattice + force = \n",
189 |       "   [*] breather                            \t (0.5430172085762024)\n",
190 |       "   [*] charged_particle                    \t (0.5301461219787598)\n",
191 |       "   [*] vortice                             \t (0.5298101305961609)\n",
192 |       "particle + charge = \n",
193 |       "   [*] electron                            \t (0.5534933805465698)\n",
194 |       "   [*] charged_particle                    \t (0.4921059310436249)\n",
195 |       "electron - charge = \n",
196 |       "   [*] many_body                           \t (0.5787447690963745)\n",
197 |       "   [*] qubit_gate                          \t (0.5545492768287659)\n",
198 |       "2D + electrons + magnetic field = \n",
199 |       "   [*] landau_level                        \t (0.6205390095710754)\n",
200 |       "   [*] carrier_density                     \t (0.5904487371444702)\n",
201 |       "Electron + Hole = \n",
202 |       "   [*] carrier                             \t (0.6992154717445374)\n",
203 |       "   [*] gaas                                \t (0.6459619998931885)\n",
204 |       "Superconductor + Topological = \n",
205 |       "   [*] weyl_semimetal                      \t (0.7405728697776794)\n",
206 |       "   [*] topological_insulator               \t (0.7248612642288208)\n",
207 |       "Spin + Magnetic Field = \n",
208 |       "   [*] magnetization                       \t (0.6871200203895569)\n",
209 |       "   [*] antiferromagnetic                   \t (0.6532838940620422)\n"
210 |      ]
211 |     }
212 |    ],
213 |    "source": [
214 |     "print(\"Similarity: \")\n",
215 |     "print(\"A superconductor is similar to: \")\n",
216 |     "for s in model.most_similar(positive=['superconductor'], topn=10):\n",
217 |     "    print(\"   [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n",
218 |     "    \n",
219 |     "print(\"Majorana is similar to: \")\n",
220 |     "for s in model.most_similar(positive=['majorana'], topn=10):\n",
221 |     "    print(\"   [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n",
222 |     "\n",
223 |     "print(\"Topological is similar to: \")    \n",
224 |     "for s in model.most_similar(positive=['topological'], topn=10):\n",
225 |     "    print(\"   [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n",
226 |     "\n",
227 |     "print(\"A phonon is similar to: \")\n",
228 |     "for s in model.most_similar(positive=['phonon'], topn=10):\n",
229 |     "    print(\"   [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n",
230 |     "    \n",
231 |     "print(\"\\n\")\n",
232 |     "print(\"Arithmetics: \")\n",
233 |     "print(\"Majorana + Braiding = \")\n",
234 |     "for s in model.most_similar(positive=['majorana', 'braiding'], topn=2):\n",
235 |     "    print(\"   [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n",
236 |     "    \n",
237 |     "print(\"particle + charge = \")\n",
238 |     "for s in model.most_similar(positive=['particle', 'charge'], topn=2):\n",
239 |     "    print(\"   [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n",
240 |     "    \n",
241 |     "print(\"electron - charge = \")\n",
242 |     "for s in model.most_similar(positive=['electron', 'positive'], negative=['negative'], topn=2):\n",
243 |     "    print(\"   [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n",
244 |     "    \n",
245 |     "print(\"2D + electrons + magnetic field = \")\n",
246 |     "for s in model.most_similar(positive=['two_dimensional', 'electron', 'magnetic_field'], topn=2):\n",
247 |     "    print(\"   [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n",
248 |     "\n",
249 |     "print(\"Electron + Hole = \")\n",
250 |     "for s in model.most_similar(positive=['electron', 'hole'], topn=2):\n",
251 |     "    print(\"   [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n",
252 |     "    \n",
253 |     "print(\"Superconductor + Topological = \")\n",
254 |     "for s in model.most_similar(positive=['superconductor', 'topological'], topn=2):\n",
255 |     "    print(\"   [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n",
256 |     "    \n",
257 |     "print(\"Spin + Magnetic Field = \")\n",
258 |     "for s in model.most_similar(positive=['spin', 'magnetic_field'], topn=2):\n",
259 |     "    print(\"   [*] {0:35} \\t ({1})\".format(s[0], s[1]))"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": 13,
265 |    "metadata": {
266 |     "collapsed": true
267 |    },
268 |    "outputs": [],
269 |    "source": [
270 |     "# If you want to save and/or load a model:\n",
271 |     "model.save(\"condmat-model-window-25-mincount-5-size-100\")\n",
272 |     "#model = gensim.models.Word2Vec.load(\"condmat-model-window-10-mincount-5-size-100\")"
273 |    ]
274 |   },
275 |   {
276 |    "cell_type": "code",
277 |    "execution_count": 11,
278 |    "metadata": {
279 |     "collapsed": true
280 |    },
281 |    "outputs": [],
282 |    "source": []
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": null,
287 |    "metadata": {
288 |     "collapsed": true
289 |    },
290 |    "outputs": [],
291 |    "source": []
292 |   },
293 |   {
294 |    "cell_type": "markdown",
295 |    "metadata": {},
296 |    "source": [
297 |     "## Clustering"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "code",
302 |    "execution_count": 11,
303 |    "metadata": {
304 |     "collapsed": true
305 |    },
306 |    "outputs": [],
307 |    "source": [
308 |     "from sklearn.cluster import KMeans\n",
309 |     "kmeans = KMeans(n_clusters=500, random_state=0).fit(model.wv.syn0)"
310 |    ]
311 |   },
312 |   {
313 |    "cell_type": "code",
314 |    "execution_count": 12,
315 |    "metadata": {
316 |     "collapsed": true
317 |    },
318 |    "outputs": [],
319 |    "source": [
320 |     "sets = {}\n",
321 |     "for l in np.unique(kmeans.labels_):\n",
322 |     "    sets[l] = []\n",
323 |     "for idx,l in enumerate(sorted(kmeans.labels_)):\n",
324 |     "    sets[l].append(model.wv.index2word[idx])"
325 |    ]
326 |   },
327 |   {
328 |    "cell_type": "code",
329 |    "execution_count": 13,
330 |    "metadata": {
331 |     "scrolled": true
332 |    },
333 |    "outputs": [
334 |     {
335 |      "name": "stdout",
336 |      "output_type": "stream",
337 |      "text": [
338 |       "dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499])\n",
339 |       "0 6974\n",
340 |       "1 101\n",
341 |       "2 3\n",
342 |       "3 76\n",
343 |       "4 173\n",
344 |       "5 563\n",
345 |       "6 2\n",
346 |       "7 11\n",
347 |       "8 20\n",
348 |       "9 11\n",
349 |       "10 4\n",
350 |       "11 27\n",
351 |       "12 31\n",
352 |       "13 28\n",
353 |       "14 4\n",
354 |       "15 9\n",
355 |       "16 30\n",
356 |       "17 2\n",
357 |       "18 6\n",
358 |       "19 9\n",
359 |       "20 4\n",
360 |       "21 7\n",
361 |       "22 57\n",
362 |       "23 13\n",
363 |       "24 1\n",
364 |       "25 19\n",
365 |       "26 7\n",
366 |       "27 1\n",
367 |       "28 22\n",
368 |       "29 6\n",
369 |       "30 1\n",
370 |       "31 2\n",
371 |       "32 31\n",
372 |       "33 1\n",
373 |       "34 68\n",
374 |       "35 1\n",
375 |       "36 565\n",
376 |       "37 145\n",
377 |       "38 1\n",
378 |       "39 222\n",
379 |       "40 9\n",
380 |       "41 6\n",
381 |       "42 3\n",
382 |       "43 1\n",
383 |       "44 1\n",
384 |       "45 14129\n",
385 |       "46 1\n",
386 |       "47 10\n",
387 |       "48 1\n",
388 |       "49 5\n",
389 |       "50 7\n",
390 |       "51 30\n",
391 |       "52 27\n",
392 |       "53 4\n",
393 |       "54 104\n",
394 |       "55 2\n",
395 |       "56 8\n",
396 |       "57 7\n",
397 |       "58 1\n",
398 |       "59 1\n",
399 |       "60 1\n",
400 |       "61 93\n",
401 |       "62 1\n",
402 |       "63 6\n",
403 |       "64 5\n",
404 |       "65 6\n",
405 |       "66 14\n",
406 |       "67 3\n",
407 |       "68 65\n",
408 |       "69 1\n",
409 |       "70 4\n",
410 |       "71 1\n",
411 |       "72 283\n",
412 |       "73 1\n",
413 |       "74 1\n",
414 |       "75 4\n",
415 |       "76 1\n",
416 |       "77 1\n",
417 |       "78 1\n",
418 |       "79 1\n",
419 |       "80 17\n",
420 |       "81 2\n",
421 |       "82 31\n",
422 |       "83 2\n",
423 |       "84 973\n",
424 |       "85 2\n",
425 |       "86 6\n",
426 |       "87 4\n",
427 |       "88 13\n",
428 |       "89 2\n",
429 |       "90 1\n",
430 |       "91 24\n",
431 |       "92 1\n",
432 |       "93 1\n",
433 |       "94 1\n",
434 |       "95 1\n",
435 |       "96 1\n",
436 |       "97 10\n",
437 |       "98 28\n",
438 |       "99 1\n",
439 |       "100 3\n",
440 |       "101 2\n",
441 |       "102 1\n",
442 |       "103 1\n",
443 |       "104 188\n",
444 |       "105 1\n",
445 |       "106 2\n",
446 |       "107 2\n",
447 |       "108 2\n",
448 |       "109 1\n",
449 |       "110 1\n",
450 |       "111 14\n",
451 |       "112 13\n",
452 |       "113 1\n",
453 |       "114 1\n",
454 |       "115 1\n",
455 |       "116 5\n",
456 |       "117 10\n",
457 |       "118 12\n",
458 |       "119 1\n",
459 |       "120 1\n",
460 |       "121 21\n",
461 |       "122 1\n",
462 |       "123 1\n",
463 |       "124 1\n",
464 |       "125 1\n",
465 |       "126 9\n",
466 |       "127 1\n",
467 |       "128 1\n",
468 |       "129 1\n",
469 |       "130 8\n",
470 |       "131 1\n",
471 |       "132 1\n",
472 |       "133 1\n",
473 |       "134 1\n",
474 |       "135 1\n",
475 |       "136 1\n",
476 |       "137 1\n",
477 |       "138 1\n",
478 |       "139 1\n",
479 |       "140 1\n",
480 |       "141 9\n",
481 |       "142 5\n",
482 |       "143 159\n",
483 |       "144 1\n",
484 |       "145 1\n",
485 |       "146 5\n",
486 |       "147 52\n",
487 |       "148 1\n",
488 |       "149 1\n",
489 |       "150 1\n",
490 |       "151 1\n",
491 |       "152 1\n",
492 |       "153 1\n",
493 |       "154 2\n",
494 |       "155 1\n",
495 |       "156 1\n",
496 |       "157 1\n",
497 |       "158 58\n",
498 |       "159 1\n",
499 |       "160 12\n",
500 |       "161 1\n",
501 |       "162 2\n",
502 |       "163 3\n",
503 |       "164 10\n",
504 |       "165 1\n",
505 |       "166 1\n",
506 |       "167 1\n",
507 |       "168 1\n",
508 |       "169 1\n",
509 |       "170 206\n",
510 |       "171 1\n",
511 |       "172 2\n",
512 |       "173 1\n",
513 |       "174 1\n",
514 |       "175 5\n",
515 |       "176 3\n",
516 |       "177 1\n",
517 |       "178 3\n",
518 |       "179 7\n",
519 |       "180 6\n",
520 |       "181 1\n",
521 |       "182 1\n",
522 |       "183 497\n",
523 |       "184 1\n",
524 |       "185 1\n",
525 |       "186 1\n",
526 |       "187 1\n",
527 |       "188 2\n",
528 |       "189 1\n",
529 |       "190 1\n",
530 |       "191 2\n",
531 |       "192 1\n",
532 |       "193 1\n",
533 |       "194 1\n",
534 |       "195 1\n",
535 |       "196 1\n",
536 |       "197 8\n",
537 |       "198 1\n",
538 |       "199 1\n",
539 |       "200 1\n",
540 |       "201 1\n",
541 |       "202 1\n",
542 |       "203 2\n",
543 |       "204 1\n",
544 |       "205 13\n",
545 |       "206 1\n",
546 |       "207 1\n",
547 |       "208 2\n",
548 |       "209 1\n",
549 |       "210 1\n",
550 |       "211 1\n",
551 |       "212 1\n",
552 |       "213 2\n",
553 |       "214 3105\n",
554 |       "215 1\n",
555 |       "216 1\n",
556 |       "217 1\n",
557 |       "218 1\n",
558 |       "219 102\n",
559 |       "220 1\n",
560 |       "221 11\n",
561 |       "222 4\n",
562 |       "223 2\n",
563 |       "224 5\n",
564 |       "225 3\n",
565 |       "226 53\n",
566 |       "227 46\n",
567 |       "228 7\n",
568 |       "229 1\n",
569 |       "230 1\n",
570 |       "231 1\n",
571 |       "232 8\n",
572 |       "233 1\n",
573 |       "234 1\n",
574 |       "235 18\n",
575 |       "236 17\n",
576 |       "237 1\n",
577 |       "238 1\n",
578 |       "239 78\n",
579 |       "240 1\n",
580 |       "241 8\n",
581 |       "242 1\n",
582 |       "243 1\n",
583 |       "244 1\n",
584 |       "245 1\n",
585 |       "246 1\n",
586 |       "247 91\n",
587 |       "248 1\n",
588 |       "249 1\n",
589 |       "250 1\n",
590 |       "251 1\n",
591 |       "252 9\n",
592 |       "253 1\n",
593 |       "254 27\n",
594 |       "255 1\n",
595 |       "256 6\n",
596 |       "257 2\n",
597 |       "258 1\n",
598 |       "259 1\n",
599 |       "260 1\n",
600 |       "261 1\n",
601 |       "262 1\n",
602 |       "263 1\n",
603 |       "264 1\n",
604 |       "265 1\n",
605 |       "266 1\n",
606 |       "267 12\n",
607 |       "268 1\n",
608 |       "269 1\n",
609 |       "270 1\n",
610 |       "271 2\n",
611 |       "272 33\n",
612 |       "273 2\n",
613 |       "274 12\n",
614 |       "275 1\n",
615 |       "276 1\n",
616 |       "277 1\n",
617 |       "278 1\n",
618 |       "279 1\n",
619 |       "280 1\n",
620 |       "281 1\n",
621 |       "282 1\n",
622 |       "283 1\n",
623 |       "284 24\n",
624 |       "285 7\n",
625 |       "286 1\n",
626 |       "287 1\n",
627 |       "288 4\n",
628 |       "289 1\n",
629 |       "290 8\n",
630 |       "291 1\n",
631 |       "292 2510\n",
632 |       "293 175\n",
633 |       "294 1\n",
634 |       "295 99\n",
635 |       "296 1\n",
636 |       "297 1\n",
637 |       "298 1\n",
638 |       "299 289\n",
639 |       "300 15\n",
640 |       "301 1\n",
641 |       "302 1\n",
642 |       "303 1\n",
643 |       "304 2\n",
644 |       "305 1\n",
645 |       "306 1\n",
646 |       "307 9\n",
647 |       "308 2\n",
648 |       "309 1\n",
649 |       "310 1\n",
650 |       "311 1\n",
651 |       "312 1\n",
652 |       "313 1\n",
653 |       "314 2\n",
654 |       "315 1\n",
655 |       "316 1\n",
656 |       "317 1\n",
657 |       "318 1\n",
658 |       "319 39\n",
659 |       "320 6\n",
660 |       "321 1\n",
661 |       "322 1\n",
662 |       "323 1\n",
663 |       "324 2\n",
664 |       "325 34\n",
665 |       "326 1\n",
666 |       "327 1\n",
667 |       "328 1\n",
668 |       "329 1\n",
669 |       "330 1\n",
670 |       "331 1\n",
671 |       "332 91\n",
672 |       "333 2\n",
673 |       "334 1\n",
674 |       "335 1\n",
675 |       "336 2\n",
676 |       "337 1\n",
677 |       "338 2\n",
678 |       "339 1\n",
679 |       "340 1\n",
680 |       "341 1\n",
681 |       "342 4\n",
682 |       "343 1\n",
683 |       "344 1\n",
684 |       "345 1\n",
685 |       "346 1\n",
686 |       "347 5\n",
687 |       "348 3\n",
688 |       "349 88\n",
689 |       "350 6\n",
690 |       "351 7\n",
691 |       "352 1\n",
692 |       "353 217\n",
693 |       "354 1\n",
694 |       "355 1\n",
695 |       "356 1\n",
696 |       "357 1\n",
697 |       "358 1\n",
698 |       "359 1\n",
699 |       "360 1315\n",
700 |       "361 3\n",
701 |       "362 13\n",
702 |       "363 14\n",
703 |       "364 1\n",
704 |       "365 1\n",
705 |       "366 29\n",
706 |       "367 1\n",
707 |       "368 1\n",
708 |       "369 1\n",
709 |       "370 1\n",
710 |       "371 1\n",
711 |       "372 3\n",
712 |       "373 1\n",
713 |       "374 1\n",
714 |       "375 1\n",
715 |       "376 1\n",
716 |       "377 1\n",
717 |       "378 12\n",
718 |       "379 2\n",
719 |       "380 1\n",
720 |       "381 2\n",
721 |       "382 1\n",
722 |       "383 4\n",
723 |       "384 1\n",
724 |       "385 2\n",
725 |       "386 1\n",
726 |       "387 7\n",
727 |       "388 1\n",
728 |       "389 1\n",
729 |       "390 2\n",
730 |       "391 8\n",
731 |       "392 1\n",
732 |       "393 1\n",
733 |       "394 6\n",
734 |       "395 3\n",
735 |       "396 9\n",
736 |       "397 18\n",
737 |       "398 1\n",
738 |       "399 1\n",
739 |       "400 771\n",
740 |       "401 1\n",
741 |       "402 470\n",
742 |       "403 1\n",
743 |       "404 1\n",
744 |       "405 1\n",
745 |       "406 1\n",
746 |       "407 1\n",
747 |       "408 1\n",
748 |       "409 1\n",
749 |       "410 44\n",
750 |       "411 2\n",
751 |       "412 3\n",
752 |       "413 28\n",
753 |       "414 16\n",
754 |       "415 1\n",
755 |       "416 1\n",
756 |       "417 43\n",
757 |       "418 78\n",
758 |       "419 2\n",
759 |       "420 4\n",
760 |       "421 18\n",
761 |       "422 5\n",
762 |       "423 1\n",
763 |       "424 2\n",
764 |       "425 2\n",
765 |       "426 1\n",
766 |       "427 2\n",
767 |       "428 1\n",
768 |       "429 14\n",
769 |       "430 1\n",
770 |       "431 1\n",
771 |       "432 1\n",
772 |       "433 42\n",
773 |       "434 1\n",
774 |       "435 1\n",
775 |       "436 1\n",
776 |       "437 1\n",
777 |       "438 1\n",
778 |       "439 1\n",
779 |       "440 1\n",
780 |       "441 2\n",
781 |       "442 1\n",
782 |       "443 650\n",
783 |       "444 1\n",
784 |       "445 1\n",
785 |       "446 6\n",
786 |       "447 1\n",
787 |       "448 1\n",
788 |       "449 1\n",
789 |       "450 3\n",
790 |       "451 1\n",
791 |       "452 1\n",
792 |       "453 1\n",
793 |       "454 1\n",
794 |       "455 1\n",
795 |       "456 1\n",
796 |       "457 17\n",
797 |       "458 1\n",
798 |       "459 1\n",
799 |       "460 1\n",
800 |       "461 2\n",
801 |       "462 3\n",
802 |       "463 1\n",
803 |       "464 1\n",
804 |       "465 1\n",
805 |       "466 29\n",
806 |       "467 1\n",
807 |       "468 1\n",
808 |       "469 1\n",
809 |       "470 2\n",
810 |       "471 2\n",
811 |       "472 1\n",
812 |       "473 5\n",
813 |       "474 3\n",
814 |       "475 4\n",
815 |       "476 1\n",
816 |       "477 24\n",
817 |       "478 1\n",
818 |       "479 1\n",
819 |       "480 10\n",
820 |       "481 1\n",
821 |       "482 1\n",
822 |       "483 1\n",
823 |       "484 1\n",
824 |       "485 2\n",
825 |       "486 2\n",
826 |       "487 1\n",
827 |       "488 1\n",
828 |       "489 50\n",
829 |       "490 1\n",
830 |       "491 4\n",
831 |       "492 12\n",
832 |       "493 2\n",
833 |       "494 2\n",
834 |       "495 1\n",
835 |       "496 5\n",
836 |       "497 1\n",
837 |       "498 2\n",
838 |       "499 1\n"
839 |      ]
840 |     }
841 |    ],
842 |    "source": [
843 |     "print(sets.keys())\n",
844 |     "for k in sets.keys():\n",
845 |     "    print(k, len(sets[k]))"
846 |    ]
847 |   },
848 |   {
849 |    "cell_type": "code",
850 |    "execution_count": 14,
851 |    "metadata": {},
852 |    "outputs": [
853 |     {
854 |      "name": "stdout",
855 |      "output_type": "stream",
856 |      "text": [
857 |       "['freestanding_graphene', 'coexistent', 'fermi_contour', 'nonanalytic', 'lorenz', 'weak_value', 'leq_x', 'satisfiability_problem', 'are_there', 'simultaneously', 'spinel_oxide', 'oscillator_strength', 'transmon', 'microwave_photoresistance', 'valley_filter', 'nb_film', 'trial', 'screened_exchange', 'to_generate', 'minkowski', 'diffusional', 'pin', 'magnu_force', 'laser_excited', 'competing_species', 'classical_correspondence', 'paramagnon']\n"
858 |      ]
859 |     }
860 |    ],
861 |    "source": [
862 |     "print(sets[11])"
863 |    ]
864 |   },
865 |   {
866 |    "cell_type": "code",
867 |    "execution_count": null,
868 |    "metadata": {
869 |     "collapsed": true
870 |    },
871 |    "outputs": [],
872 |    "source": []
873 |   },
874 |   {
875 |    "cell_type": "code",
876 |    "execution_count": null,
877 |    "metadata": {
878 |     "collapsed": true
879 |    },
880 |    "outputs": [],
881 |    "source": [
882 |     "parsed_abstracts = parse_abstract('allabstracts.txt')"
883 |    ]
884 |   },
885 |   {
886 |    "cell_type": "code",
887 |    "execution_count": null,
888 |    "metadata": {
889 |     "collapsed": true
890 |    },
891 |    "outputs": [],
892 |    "source": [
893 |     "# This takes a very long time!\n",
894 |     "re_parse = False\n",
895 |     "if re_parse:\n",
896 |     "    parsed_abstracts = parse_abstract('allabstracts.txt')\n",
897 |     "    # Save to a file, so we can load it much faster than having\n",
898 |     "    # to re-parse the raw data.\n",
899 |     "    np.save(\"parsed_abstracts.npy\", parsed_abstracts)\n",
900 |     "else:\n",
901 |     "    # Load the titles from the file.\n",
902 |     "    # The atleast_2d is a hack for correctly loading the dictionary...\n",
903 |     "    parsed_abstracts = np.atleast_2d(np.load(\"allabstracts.npy\"))[0][0]"
904 |    ]
905 |   },
906 |   {
907 |    "cell_type": "code",
908 |    "execution_count": null,
909 |    "metadata": {
910 |     "collapsed": true
911 |    },
912 |    "outputs": [],
913 |    "source": [
914 |     "parsed_abstracts = np.atleast_2d(np.load(\"allabstracts.npy\"))[0][0]"
915 |    ]
916 |   },
917 |   {
918 |    "cell_type": "code",
919 |    "execution_count": null,
920 |    "metadata": {
921 |     "collapsed": true
922 |    },
923 |    "outputs": [],
924 |    "source": [
925 |     "ngram_abstr, bigrams_abstr, ngrams_abstr = get_ngrams(abstr)"
926 |    ]
927 |   },
928 |   {
929 |    "cell_type": "code",
930 |    "execution_count": null,
931 |    "metadata": {
932 |     "collapsed": true
933 |    },
934 |    "outputs": [],
935 |    "source": [
936 |     "# train word2vec \n",
937 |     "abstrmodel = gensim.models.Word2Vec(ngram_abstr, window=25, min_count=5, size=100)"
938 |    ]
939 |   }
940 |  ],
941 |  "metadata": {
942 |   "kernelspec": {
943 |    "display_name": "Python 3",
944 |    "language": "python",
945 |    "name": "python3"
946 |   },
947 |   "language_info": {
948 |    "codemirror_mode": {
949 |     "name": "ipython",
950 |     "version": 3
951 |    },
952 |    "file_extension": ".py",
953 |    "mimetype": "text/x-python",
954 |    "name": "python",
955 |    "nbconvert_exporter": "python",
956 |    "pygments_lexer": "ipython3",
957 |    "version": "3.5.2"
958 |   }
959 |  },
960 |  "nbformat": 4,
961 |  "nbformat_minor": 2
962 | }
963 | 


--------------------------------------------------------------------------------
/alltitles.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/everthemore/physics2vec/d437efa21cb56423fba6653cf48af3e528b26114/alltitles.npy


--------------------------------------------------------------------------------
/arXivHarvest.py:
--------------------------------------------------------------------------------
  1 | #--------------------------------------------------------------------
  2 | # arXivHarvest.py
  3 | #
  4 | # Harvests (using OAI metadata available through an arXiv URL)
  5 | # the titles and abstracts of a given arXiv section (cond-mat,
  6 | # quant-ph, etc). 
  7 | #
  8 | # The result of running this script will be two .txt files,
  9 | # one containing the titles, and the other the corresponding
 10 | # abstracts.
 11 | #
 12 | # The title.txt file is structured as (example):
 13 | # 2017 3 This is the title of a paper published in 2017
 14 | #   that was too long to fit on a single line, so it con-
 15 | #   tinues with two whitespaces on the next line
 16 | # 1998 12 This one is older but has a much shorter title
 17 | #
 18 | # The abstract.txt file has no year/month information, and
 19 | # is ordered the same way as the title.txt file (so first
 20 | # abstract belongs to the first title, etc).
 21 | #--------------------------------------------------------------------
 22 | # Import modules
 23 | from oaipmh.client import Client
 24 | from oaipmh.metadata import MetadataRegistry, MetadataReader
 25 | import time
 26 | import argparse
 27 | 
 28 | parser = argparse.ArgumentParser(description="Harvest an arXiv subsection's titles")
 29 | parser.add_argument('--section', type=str, required=False, default=None,
 30 |                    help='text file with titles from harvest')
 31 | parser.add_argument('--output', type=str, required=False, default=None,
 32 |                    help='output filename for *.npy file')
 33 | 
 34 | args = parser.parse_args()
 35 | section = args.section
 36 | output  = args.output
 37 | 
 38 | # Change this to harvest a different arXiv set
 39 | section="physics:cond-mat" if section == None else section
 40 | # And change these to specify the txt file to save the data in
 41 | title_file = "all_cond_mat_titles.txt" if output == None else output
 42 | 
 43 | #abstr_file = "all_cond_mat_abstracts.txt" 
 44 | 
 45 | # Create a new MetadataReader, and list just the fields we are interested in
 46 | oai_dc_reader = MetadataReader(
 47 |     fields={
 48 |     'title':       ('textList', 'oai_dc:dc/dc:title/text()'),
 49 |     'abstract':    ('textList', 'oai_dc:dc/dc:description/text()'),
 50 |     'date':        ('textList', 'oai_dc:dc/dc:date/text()'),
 51 |     },
 52 |     namespaces={
 53 |     'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
 54 |     'dc' : 'http://purl.org/dc/elements/1.1/'}
 55 | )
 56 | 
 57 | # And create a registry for parsing the oai info, linked to the reader
 58 | registry = MetadataRegistry()
 59 | registry.registerReader('oai_dc', oai_dc_reader)
 60 | 
 61 | # arXiv OAI url we will query
 62 | URL = "http://export.arxiv.org/oai2"
 63 | # Create OAI client; now we're all set for listing some records
 64 | client = Client(URL, registry)
 65 | 
 66 | # Open files for writing
 67 | titlef    = open(title_file, 'w')
 68 | #abstractf = open(abstr_file, 'w')
 69 | 
 70 | # Keep track of run-time and number of papers 
 71 | start_time = time.time()
 72 | count = 0
 73 | 
 74 | # Harvest
 75 | for record in client.listRecords(metadataPrefix='oai_dc', set=section):
 76 |     try:
 77 |         # Extract the title
 78 |         title = record[1].getField('title')[0]
 79 |         # Extract the abstract
 80 |         abstract = record[1].getField('abstract')[0]
 81 |         # And get the date (this is stored as yyyy-mm-dd in the arXiv metadata)
 82 |         date  = record[1].getField('date')[0]
 83 |         year  = int(date[0:4])
 84 |         month = int(date[5:7])
 85 | 
 86 |         # Write to file (add year info to the titles)
 87 |         titlef.write("%d %d "%(year,month) + title + "\n")
 88 |     #    abstractf.write(abstract + "\n")
 89 | 
 90 |         count += 1
 91 |         # Flush every 100 papers to the files
 92 |         if count % 100 == 0 and count > 1:
 93 |             print("Harvested {0} papers so far (elapsed time = {1})".format(count, time.time() - start_time))
 94 |             titlef.flush(); #abstractf.flush()
 95 |     except Exception as e:
 96 |         print("Encountered error whilst reading record: ", record)
 97 |         print("Exception: ", e)
 98 |         continue
 99 | 
100 | 
101 | # Close files
102 | #abstractf.close()
103 | titlef.close()
104 | 
105 | # Report runtime and number of papers processed
106 | runtime = time.time() - start_time
107 | print("It took {} seconds to collect {} titles and abstracts".format(runtime, count))
108 | 


--------------------------------------------------------------------------------
/askmodel.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import argparse
 3 | import gensim
 4 | 
 5 | parser = argparse.ArgumentParser(description="Ask a trained Word2Vec model some questions")
 6 | parser.add_argument('--input', type=str, required=True,
 7 |                    help='a trained model file')
 8 | parser.add_argument('--add', type=str, nargs='*', default="",
 9 |                    help='size of encoding vectors')
10 | parser.add_argument('--subtract', type=str, nargs='*', default="",
11 |                    help='size of window scanning over text')
12 | 
13 | args = parser.parse_args()
14 | inputfile = args.input
15 | positive = args.add
16 | negative = args.subtract
17 | 
18 | # Load the model
19 | model = gensim.models.Word2Vec.load(inputfile)
20 | 
21 | # Build a nicer query string
22 | querystring = ""
23 | for i in range(len(positive)):
24 |     querystring = querystring + positive[i]
25 | 
26 |     if i < len(positive) - 1:
27 |         querystring = querystring + " + "
28 | 
29 | if len(negative) != 0:
30 |     querystring = querystring + " - "
31 | 
32 | for i in range(len(negative)):
33 |     querystring = querystring + negative[i]
34 | 
35 |     if i < len(negative) - 1:
36 |         querystring = querystring + " - "
37 | 
38 | print(querystring + " = \n")
39 | 
40 | # Get and display the answers
41 | result = model.most_similar(positive=positive, negative=negative, topn=10)
42 | for r in result:
43 | 	print("{0:40} (with similarity score {1})".format(r[0], r[1]))
44 | print("\n")
45 | 


--------------------------------------------------------------------------------
/caltechmask.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/everthemore/physics2vec/d437efa21cb56423fba6653cf48af3e528b26114/caltechmask.png


--------------------------------------------------------------------------------
/caltechwordcloud.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/everthemore/physics2vec/d437efa21cb56423fba6653cf48af3e528b26114/caltechwordcloud.png


--------------------------------------------------------------------------------
/condmat-model-window-10-mincount-5-size-100:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/everthemore/physics2vec/d437efa21cb56423fba6653cf48af3e528b26114/condmat-model-window-10-mincount-5-size-100


--------------------------------------------------------------------------------
/helper.py:
--------------------------------------------------------------------------------
  1 | # Regular expressions
  2 | import re
  3 | # Use Inflect for singular-izing words
  4 | import inflect
  5 | # Gensim for learning phrases and word2vec
  6 | import gensim
  7 | 
  8 | # For some reason, inflect thinks that there is a singular form of 'mass', namely 'mas' 
  9 | # and similarly for gas. Please add any other exceptions to this list! 
 10 | p = inflect.engine()
 11 | p.defnoun('mass', 'mass|masses')
 12 | p.defnoun('gas', 'gas|gases')
 13 | p.defnoun('gas', 'gas|gasses')     # Other spelling
 14 | p.defnoun('gaas', 'gaas')          #GaAs ;)
 15 | p.defnoun('gapless', 'gapless')
 16 | p.defnoun('haas', 'haas')
 17 | 
 18 | # Check if a string has digits
 19 | def hasNumbers(inputString):
 20 |     return any(char.isdigit() for char in inputString)
 21 |             
 22 | # Return the singular form of a word, if it exists
 23 | def singularize(word):
 24 |     try:
 25 |         # p.singular_word() returns the singular form, but 
 26 |         # returns False if there is no singular form (or already singular)
 27 |         
 28 |         # So, if the word is already singular, just return the word
 29 |         if not p.singular_noun(word):
 30 |             return word
 31 |         else:
 32 |             # And otherwise return the singular version
 33 |             return p.singular_noun(word)
 34 |        
 35 |     except exception as e:
 36 |         print("Euh? What's this? %s"%word)
 37 |         print("This caused an exception: ", e)
 38 |         return word
 39 |     
 40 | def stripchars(w, chars):
 41 |     return "".join( [c for c in w if c not in chars] ).strip('\n')
 42 |    
 43 | # Parse a title into words
 44 | def parse_title(title):
 45 |     # Extract the year
 46 |     year, rest = title.split(' ', 1)
 47 |     year = int(year[0:])
 48 |     # Then the month
 49 |     month, title = rest.split(' ', 1)
 50 |     month = int(month[0:])
 51 |     
 52 |     # Then, for every word in the title:
 53 |     # 1) Split the title into words, by splitting it on spaces ' ' and on '-' (de-hyphenate words).
 54 |     # 2) Turn each of those resulting words into lowercase letters only
 55 |     # 3) Strip out any weird symbols (we don't want parenthesized words, not ; at the end of a word, etc)
 56 |     # 4) Also, we don't want to have digits.. my apologies to all the material studies on interesting compounds!
 57 |     words = re.split( ' |-|\\|/', title.lower() )
 58 |     wordlist = []
 59 |     for i in range(len(words)):
 60 |         w = words[i]
 61 |         
 62 |         # Skip if there is no word, or if we have numbers
 63 |         if len(w) < 1 or hasNumbers(w):
 64 |             continue
 65 | 
 66 |         # If it is (probably) math, let's skip it
 67 |         if w[0] == '$' and w[-1] == '$':
 68 |             continue
 69 |             
 70 |         # Remove other unwanted characters
 71 |         w = stripchars(w, '\\/$(){}.<>,;:_"|\'\n `?!#%')
 72 |         # Get singular form
 73 |         w = singularize(w)
 74 |         
 75 |         # Skip if nothing left, or just an empty space
 76 |         if len(w) < 1 or w == ' ':
 77 |             continue
 78 |             
 79 |         # Append to the list
 80 |         wordlist.append(w)
 81 |         
 82 |     return year, month, wordlist
 83 | 
 84 |     # Previous versions
 85 |     #return year, month [singularize(stripchars(w, ['\\/$(){}.<>,;:"|\'\n '])) for w in re.split(' |-|\\|/',title.lower()) if not hasNumbers(w)]
 86 |     #return year, month, [singularize(w.strip("\\/$|[](){}\n;:\"\',")) for w in re.split(' |-',title.lower()) if not hasNumbers(w)]
 87 | 
 88 | 
 89 | def load_and_parse_all_titles(file):
 90 |     """ Read title info from file, and parse the titles into words """
 91 | 
 92 |     # Buffer for storing the file
 93 |     all_lines = []
 94 |     # Read file into the buffer
 95 |     with open(file, "r") as f:
 96 |         for i,line in enumerate(f):
 97 |             all_lines.append(line)
 98 |         
 99 |     # An empty dictionary for storing all the title info. This 
100 |     # dictionary will have the year of the title as the key, and will
101 |     # hold dictionaries itself that have the month as a key. 
102 |     # For example: 
103 |     #   all_titles[2007] = dictionary with months as keys
104 |     #   all_titles[2007][3] = list of titles from march 2007
105 |     all_titles = {}
106 | 
107 |     # Keep track of the number of titles
108 |     num_titles = 0
109 |     
110 |     # The title.txt file should be organized such, that new
111 |     # titles start with year and month, and that titles that continue
112 |     # on the next line start with two empty spaces.
113 |     # So we're going to loop through the lines, and append the current
114 |     # line to the previous title if it started with two empty spaces.
115 |     # If not, it means we have found the start of a new title.
116 |     title = all_lines[0]
117 |     previous_title = ""
118 | 
119 |     # Scan each line
120 |     i = 1
121 |     while (i < (len(all_lines)-1)):
122 |         
123 |         # If we find a new title (no empty spaces at the start)
124 |         if all_lines[i][0:2] != "  ":
125 |             
126 |             # The title we have so far can be parsed and added
127 |             # to the title-list
128 |             year, month, title = parse_title(title)
129 |             
130 |             # If we have not seen this year before, create a new
131 |             # dictionary entry with this year as the key.
132 |             if year not in all_titles:
133 |                 all_titles[year] = {}
134 |                 
135 |             # And do the same with the month, if we haven't seen it.
136 |             if month not in all_titles[year]:
137 |                 all_titles[year][month] = []
138 | 
139 |             # Now that we're sure the key pair [year][month] exists as a list
140 |             # we can add the title to it.
141 |             all_titles[year][month].append(title)
142 |             num_titles += 1
143 | 
144 |             # Then start the next one
145 |             title = all_lines[i]
146 |             previous_title = title
147 |         else:
148 |             # We are still on the same title
149 |             title = previous_title + all_lines[i][1:]
150 |             previous_title = title
151 | 
152 |         # Go to the next line
153 |         i += 1
154 | 
155 |     print("Read and parsed %d titles"%(num_titles))
156 |     return all_titles
157 | 
158 | def get_titles_for_years(all_titles, years):
159 |     """ Return list of all titles for given years (must be a list, even if only one)"""
160 |     collectedtitles = []
161 |     for k in years:
162 |         allmonthtitles = []
163 |         for m in all_titles[k].keys():
164 |             allmonthtitles = allmonthtitles + all_titles[k][m]
165 |             
166 |         collectedtitles = collectedtitles + allmonthtitles
167 |     return collectedtitles
168 | 
169 | def get_ngrams(sentences):
170 |     """ Detects n-grams with n up to 4, and replaces those in the titles. """
171 |     # Train a 2-word (bigram) phrase-detector
172 |     bigram_phrases = gensim.models.phrases.Phrases(sentences)
173 |     
174 |     # And construct a phraser from that (an object that will take a sentence
175 |     # and replace in it the bigrams that it knows by single objects)
176 |     bigram = gensim.models.phrases.Phraser(bigram_phrases)
177 |     
178 |     # Repeat that for trigrams; the input now are the bigrammed-titles
179 |     ngram_phrases = gensim.models.phrases.Phrases(bigram[sentences])
180 |     ngram         = gensim.models.phrases.Phraser(ngram_phrases)
181 |     
182 |     # !! If you want to have more than 4-grams, just repeat the structure of the
183 |     #    above two lines. That is, train another Phrases on the ngram_phrases[titles],
184 |     #    that will get you up to 8-grams. 
185 |     
186 |     # Now that we have phrasers for bi- and trigrams, let's analyze them
187 |     # The phrases.export_phrases(x) function returns pairs of phrases and their
188 |     # certainty scores from x.
189 |     bigram_info = {}
190 |     for b, score in bigram_phrases.export_phrases(sentences):
191 |         bigram_info[b] = [score, bigram_info.get(b,[0,0])[1] + 1]
192 |         
193 |     ngram_info = {}
194 |     for b, score in ngram_phrases.export_phrases(bigram[sentences]):
195 |         ngram_info[b] = [score, ngram_info.get(b,[0,0])[1] + 1]
196 |             
197 |     # Return a list of 'n-grammed' titles, and the bigram and trigram info
198 |     return [ngram[t] for t in sentences], bigram_info, ngram_info
199 | 
200 | # !!! THIS SECTION HAS NOT YET BEEN UPDATED
201 | # !!! IT WILL WORK, BUT IT TAKES A *VERY* LONG 
202 | # !!! TIME. HAS TO SWITCH TO LIST COMPREHENSION
203 | 
204 | # Parse abstract into sentences 
205 | def parse_abstract(file):
206 |     # Buffer for storing the file
207 |     abstr = open(file, "r").read()
208 |             
209 |     sentences = []
210 | 
211 |     # Clean up abstract
212 |     abstr.lower()   
213 |     abstr.replace('\'', '')
214 |     abstr.replace('\"', '')
215 | 
216 |     # Extract sentences and split into words
217 |     end = abstr.find('.') 
218 |     while end != -1:
219 |         sentence = abstr[:end].replace('\n', ' ')
220 |        
221 |         # Sanitize the words
222 |         words = re.split( ' |-|\\|/', sentence.lower() )
223 |         wordlist = []
224 |         for i in range(len(words)):
225 |             w = words[i]
226 | 
227 |             # Skip if there is no word, or if we have numbers
228 |             if len(w) < 1 or hasNumbers(w):
229 |                 continue
230 | 
231 |             # If it is (probably) math, let's skip it
232 |             if w[0] == '$' and w[-1] == '$':
233 |                 continue
234 | 
235 |             # Remove other unwanted characters
236 |             w = stripchars(w, '\\/$(){}.<>,;:_"|\'\n `?!#%')
237 |             # Get singular form
238 |             w = singularize(w)
239 | 
240 |             # Skip if nothing left, or just an empty space
241 |             if len(w) < 1 or w == ' ':
242 |                 continue
243 | 
244 |             # Append to the list
245 |             wordlist.append(w)
246 |         
247 |         sentences.append( wordlist )
248 |         abstr = abstr[end+1:]
249 |         end = abstr.find('.')
250 | 
251 |     return sentences


--------------------------------------------------------------------------------
/numpapers.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/everthemore/physics2vec/d437efa21cb56423fba6653cf48af3e528b26114/numpapers.png


--------------------------------------------------------------------------------
/parsetitles.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from helper import *
 3 | import argparse
 4 | 
 5 | parser = argparse.ArgumentParser(description="Parse titles into a *.npy file")
 6 | parser.add_argument('--input', type=str, required=True,
 7 |                    help='text file with titles from harvest')
 8 | parser.add_argument('--output', type=str, required=True,
 9 |                    help='output filename for *.npy file')
10 | 
11 | args = parser.parse_args()
12 | inputfile = args.input
13 | outputfile = args.output
14 | 
15 | print("Parsing file.. (make take a short while)")
16 | all_titles = load_and_parse_all_titles(inputfile)
17 | np.save(outputfile, all_titles)
18 | print("Done!")
19 | 


--------------------------------------------------------------------------------
/trainmodel.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import argparse
 3 | from helper import *
 4 | 
 5 | parser = argparse.ArgumentParser(description="Train a Word2Vec encoding on input, and store the resulting model in output")
 6 | parser.add_argument('--input', type=str, required=True,
 7 |                    help='a *.npy file with parsed titles')
 8 | parser.add_argument('--size', type=int, default=100,
 9 |                    help='size of encoding vectors')
10 | parser.add_argument('--window', type=int, default=10,
11 |                    help='size of window scanning over text')
12 | parser.add_argument('--mincount', type=int, default=5,
13 |                    help='minimum number of times a word has to appear to participate')
14 | parser.add_argument('--output', type=str, required=True,
15 |                    help='output filename for saving the model')
16 | 
17 | args = parser.parse_args()
18 | inputfile = args.input
19 | size = args.size
20 | window = args.window
21 | mincount = args.mincount
22 | outputfile = args.output
23 | 
24 | print("Training model with\n")
25 | print("{0:30} = {1}".format("input", inputfile))
26 | print("{0:30} = {1}".format("size", size))
27 | print("{0:30} = {1}".format("window", window))
28 | print("{0:30} = {1}".format("mincount", mincount))
29 | 
30 | all_titles = np.atleast_2d(np.load(inputfile))[0][0]
31 | all_years = sorted(list(all_titles.keys()))
32 | titles = get_titles_for_years(all_titles, all_years)
33 | ngram_titles, bigrams, ngrams = get_ngrams(titles)
34 | model = gensim.models.Word2Vec(ngram_titles, window=window, min_count=mincount, size=size)
35 | print("Saving to {0}".format(outputfile))
36 | model.save(outputfile)
37 | print("Done!")
38 | 


--------------------------------------------------------------------------------