├── LICENSE ├── PhraseAnalysis.ipynb ├── README.md ├── Word2Vec.ipynb ├── WordCloud.ipynb ├── alltitles.npy ├── alltitles.txt ├── arXivHarvest.py ├── askmodel.py ├── caltechmask.png ├── caltechwordcloud.png ├── condmat-model-window-10-mincount-5-size-100 ├── helper.py ├── numpapers.png ├── parsetitles.py └── trainmodel.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Everard van Nieuwenburg 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # physics2vec 2 | Things to do with arXiv metadata :-) 3 | 4 | ## Summary 5 | This repository is (currently) a collection of python scripts and notebooks that 6 | 1. Do a **Word2Vec encoding** of physics jargon (using gensim's CBOW or skip-gram, if you care for specifics). 7 | 8 | Examples: "particle + charge = electron" and "majorana + braiding = non-abelian" 9 | Remark: These examples were _learned_ from the cond-mat section titles only. 10 | 11 | 2. Analyze the **n-grams** (i.e. fixed n-word expressions) in the titles over the years (what should we work on? ;-)) 12 | 3. Produce a **WordCloud** of your favorite arXiv section (such as the above, from the cond-mat section) 13 | ![alt text](https://raw.githubusercontent.com/everthemore/physics2vec/master/caltechwordcloud.png "arXiv:cond-mat wordcloud") 14 | 15 | ## Notes 16 | These scripts were tested and run using **Python 3**. I have not checked backwards compatibility, but I have heard from people who managed to get it to work in **Python 2** too! Feel free to reach out to me in case things don't work out-of-the-box. I have not (yet) tried to make the scripts and notebooks super user-friendly, though I did try to comment the code such that you may figure things out by 17 | trial-and-error. 18 | 19 | ## Quickstart ## 20 | If you're already familiar with python, all you need to have are the modules numpy, pyoai, inflect and gensim. These should all be easy to install using pip/pip3. Then the workflow is as follows (I used python3): 21 | 1. python arXivHarvest.py --section physics:cond-mat --output condmattitles.txt 22 | 2. python parsetitles.py --input condmattitles.txt --output condmattitles.npy 23 | 3. python trainmodel.py --input condmattitles.npy --size 100 --window 10 --mincount 5 --output condmatmodel-100-10-5 24 | 4. python askmodel.py --input condmatmodel-100-10-5 --add particle charge 25 | 26 | In step 1, we get the titles from arXiv. This is a time-consuming step; it took 1.5hrs for the physics:cond-mat section, and so I've provided the files for those in the repository already (i.e. you can skip steps 1 and 2). In step 2 we take out the weird symbols etc, and parse it into a \*.npy file. In the third step, we train a model with vector size 100, window size 10 and minimum count for words to participate of 5. Step 4 can be repeated as often as one desires. 27 | 28 | ## More details 29 | Apart from the above scripts, I provide 3 python notebooks that perform more than just the analysis of arXiv titles. I highly 30 | recommend using notebooks. Very easy to install, and super useful. See here: http://jupyter.org/. You can also just copy-and-paste the code from the notebooks into a \*.py script and run those. 31 | 32 | You are going to need to following python modules in addition, all installable using pip3 (sudo pip3 install [module-name]). 33 | 34 | 1. numpy 35 | 36 | Must-have for anything scientific you want to do with python (arrays, linalg) 37 | Numpy (http://www.numpy.org/) 38 | 39 | 2. pyoai 40 | 41 | Open Archive Initiaive module for querying the arXiv servers for metadata 42 | https://pypi.python.org/pypi/pyoai 43 | 44 | 3. inflect 45 | 46 | Module for generating/checking plural/singular versions of words 47 | https://pypi.python.org/pypi/inflect 48 | 49 | 4. gensim 50 | 51 | Very versatile module for topic modelling (analyzing basically anything you want from text, including word2vec) 52 | https://radimrehurek.com/gensim/ 53 | 54 | Not required, but highly recommended is the module "matplotlib" for creating plots. You can comment/remove the 55 | sections in the code that refer to it if you really don't want to. 56 | 57 | Optionally, if you wish to make a WordCloud, you will need 58 | 59 | 5. Matplotlib (https://matplotlib.org/) 60 | 6. PIL (http://www.pythonware.com/products/pil/) 61 | 7. WordCloud (https://github.com/amueller/word_cloud) 62 | -------------------------------------------------------------------------------- /Word2Vec.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import numpy as np\n", 12 | "\n", 13 | "# Be sure to restart the notebook kernel if you make changes to parseTandA\n", 14 | "# Re-running this cell does not re-load the module otherwise\n", 15 | "from helper import *\n", 16 | "\n", 17 | "# We use matplotlib for plotting. You can basically get any plot layout/style\n", 18 | "# etc you want with this module. I'm setting it up for basics here, meaning\n", 19 | "# that I want it to parse LaTeX and use the LaTeX font family for all text.\n", 20 | "# !! If you don't have a LaTeX distribution installed, this notebook may\n", 21 | "# throw errors when it tries to create the plots. If that happens, \n", 22 | "# either install a LaTeX distribution or remove/comment the \n", 23 | "# matplotlib.rcParams.update(...) line.\n", 24 | "# In both cases, restart the kernel of this notebook afterwards.\n", 25 | "import matplotlib\n", 26 | "import matplotlib.pyplot as plt\n", 27 | "%matplotlib inline\n", 28 | "\n", 29 | "rcparams = { \n", 30 | " \"pgf.texsystem\": \"pdflatex\", # change this if using xetex or lautex\n", 31 | " \"text.usetex\": True, # use LaTeX to write all text\n", 32 | " \"font.family\": \"lmodern\",\n", 33 | " \"font.serif\": [], # blank entries should cause plots to inherit fonts from the document\n", 34 | " \"font.sans-serif\": [],\n", 35 | " \"font.monospace\": [], \n", 36 | " \"font.size\": 12,\n", 37 | " \"legend.fontsize\": 12, \n", 38 | " \"xtick.labelsize\": 12,\n", 39 | " \"ytick.labelsize\": 12,\n", 40 | " \"pgf.preamble\": [\n", 41 | " r\"\\usepackage[utf8x]{inputenc}\", # use utf8 fonts becasue your computer can handle it :)\n", 42 | " r\"\\usepackage[T1]{fontenc}\", # plots will be generated using this preamble\n", 43 | " ]\n", 44 | "}\n", 45 | "matplotlib.rcParams.update(rcparams)" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "# Load the title dataset" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 4, 58 | "metadata": { 59 | "collapsed": true 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "re_parse = False\n", 64 | "if re_parse:\n", 65 | " all_titles = load_and_parse_all_titles('alltitles.txt')\n", 66 | " # Save to a file, so we can load it much faster than having\n", 67 | " # to re-parse the raw data.\n", 68 | " np.save(\"alltitles.npy\", all_titles)\n", 69 | "else:\n", 70 | " # Load the titles from the file.\n", 71 | " # The atleast_2d is a hack for correctly loading the dictionary...\n", 72 | " all_titles = np.atleast_2d(np.load(\"alltitles.npy\"))[0][0]" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 5, 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "name": "stdout", 82 | "output_type": "stream", 83 | "text": [ 84 | "[1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]\n" 85 | ] 86 | } 87 | ], 88 | "source": [ 89 | "# Check the available years\n", 90 | "all_years = sorted(list(all_titles.keys()))\n", 91 | "print(all_years)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "# Train Word2Vec" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 6, 104 | "metadata": { 105 | "collapsed": true, 106 | "scrolled": true 107 | }, 108 | "outputs": [], 109 | "source": [ 110 | "titles = get_titles_for_years(all_titles, all_years)\n", 111 | "ngram_titles, bigrams, ngrams = get_ngrams(titles)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 14, 117 | "metadata": { 118 | "collapsed": true 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "# train word2vec \n", 123 | "model = gensim.models.Word2Vec(ngram_titles, window=25, min_count=5, size=100)" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 19, 129 | "metadata": { 130 | "scrolled": true 131 | }, 132 | "outputs": [ 133 | { 134 | "name": "stdout", 135 | "output_type": "stream", 136 | "text": [ 137 | "Similarity: \n", 138 | "A superconductor is similar to: \n", 139 | " [*] layered_superconductor \t (0.7722122669219971)\n", 140 | " [*] superconducting \t (0.7567377090454102)\n", 141 | " [*] unconventional_superconductor \t (0.7531686425209045)\n", 142 | " [*] cuprate_superconductor \t (0.7451467514038086)\n", 143 | " [*] superconductivity \t (0.733198881149292)\n", 144 | " [*] multiband_superconductor \t (0.7226791977882385)\n", 145 | " [*] superconducting_gap \t (0.7021770477294922)\n", 146 | " [*] cuprate \t (0.6682584285736084)\n", 147 | " [*] weyl_semimetal \t (0.6146906614303589)\n", 148 | " [*] noncentrosymmetric_superconductor \t (0.612571120262146)\n", 149 | "Majorana is similar to: \n", 150 | " [*] majorana_fermion \t (0.8646294474601746)\n", 151 | " [*] majorana_mode \t (0.8107954859733582)\n", 152 | " [*] non_abelian \t (0.7987779974937439)\n", 153 | " [*] braiding \t (0.7585236430168152)\n", 154 | " [*] topologically_protected \t (0.7555981278419495)\n", 155 | " [*] parity \t (0.7497479915618896)\n", 156 | " [*] andreev \t (0.73931485414505)\n", 157 | " [*] majorana_bound \t (0.7324416041374207)\n", 158 | " [*] kramer_pair \t (0.7297208309173584)\n", 159 | " [*] protected \t (0.7294089198112488)\n", 160 | "Topological is similar to: \n", 161 | " [*] topological_insulator \t (0.6998741626739502)\n", 162 | " [*] weyl \t (0.6658810973167419)\n", 163 | " [*] majorana \t (0.6639574766159058)\n", 164 | " [*] topologically_protected \t (0.6560168266296387)\n", 165 | " [*] chiral \t (0.6515534520149231)\n", 166 | " [*] floquet_topological \t (0.6458501219749451)\n", 167 | " [*] non_abelian \t (0.6349783539772034)\n", 168 | " [*] gapless \t (0.630128026008606)\n", 169 | " [*] topologically \t (0.627392590045929)\n", 170 | " [*] majorana_fermion \t (0.6268280744552612)\n", 171 | "A phonon is similar to: \n", 172 | " [*] optical_absorption \t (0.5985944271087646)\n", 173 | " [*] plasmon \t (0.586416482925415)\n", 174 | " [*] ionized_impurity \t (0.5812559127807617)\n", 175 | " [*] acoustic_phonon \t (0.5808508396148682)\n", 176 | " [*] carrier \t (0.569900631904602)\n", 177 | " [*] intraband \t (0.5654160380363464)\n", 178 | " [*] raman \t (0.5651483535766602)\n", 179 | " [*] incoherent \t (0.5587544441223145)\n", 180 | " [*] photoexcited \t (0.5581139326095581)\n", 181 | " [*] charge_carrier \t (0.5440508127212524)\n", 182 | "\n", 183 | "\n", 184 | "Arithmetics: \n", 185 | "Majorana + Braiding = \n", 186 | " [*] majorana_mode \t (0.8562889099121094)\n", 187 | " [*] non_abelian \t (0.8480815887451172)\n", 188 | "wave + lattice + force = \n", 189 | " [*] breather \t (0.5430172085762024)\n", 190 | " [*] charged_particle \t (0.5301461219787598)\n", 191 | " [*] vortice \t (0.5298101305961609)\n", 192 | "particle + charge = \n", 193 | " [*] electron \t (0.5534933805465698)\n", 194 | " [*] charged_particle \t (0.4921059310436249)\n", 195 | "electron - charge = \n", 196 | " [*] many_body \t (0.5787447690963745)\n", 197 | " [*] qubit_gate \t (0.5545492768287659)\n", 198 | "2D + electrons + magnetic field = \n", 199 | " [*] landau_level \t (0.6205390095710754)\n", 200 | " [*] carrier_density \t (0.5904487371444702)\n", 201 | "Electron + Hole = \n", 202 | " [*] carrier \t (0.6992154717445374)\n", 203 | " [*] gaas \t (0.6459619998931885)\n", 204 | "Superconductor + Topological = \n", 205 | " [*] weyl_semimetal \t (0.7405728697776794)\n", 206 | " [*] topological_insulator \t (0.7248612642288208)\n", 207 | "Spin + Magnetic Field = \n", 208 | " [*] magnetization \t (0.6871200203895569)\n", 209 | " [*] antiferromagnetic \t (0.6532838940620422)\n" 210 | ] 211 | } 212 | ], 213 | "source": [ 214 | "print(\"Similarity: \")\n", 215 | "print(\"A superconductor is similar to: \")\n", 216 | "for s in model.most_similar(positive=['superconductor'], topn=10):\n", 217 | " print(\" [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n", 218 | " \n", 219 | "print(\"Majorana is similar to: \")\n", 220 | "for s in model.most_similar(positive=['majorana'], topn=10):\n", 221 | " print(\" [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n", 222 | "\n", 223 | "print(\"Topological is similar to: \") \n", 224 | "for s in model.most_similar(positive=['topological'], topn=10):\n", 225 | " print(\" [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n", 226 | "\n", 227 | "print(\"A phonon is similar to: \")\n", 228 | "for s in model.most_similar(positive=['phonon'], topn=10):\n", 229 | " print(\" [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n", 230 | " \n", 231 | "print(\"\\n\")\n", 232 | "print(\"Arithmetics: \")\n", 233 | "print(\"Majorana + Braiding = \")\n", 234 | "for s in model.most_similar(positive=['majorana', 'braiding'], topn=2):\n", 235 | " print(\" [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n", 236 | " \n", 237 | "print(\"particle + charge = \")\n", 238 | "for s in model.most_similar(positive=['particle', 'charge'], topn=2):\n", 239 | " print(\" [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n", 240 | " \n", 241 | "print(\"electron - charge = \")\n", 242 | "for s in model.most_similar(positive=['electron', 'positive'], negative=['negative'], topn=2):\n", 243 | " print(\" [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n", 244 | " \n", 245 | "print(\"2D + electrons + magnetic field = \")\n", 246 | "for s in model.most_similar(positive=['two_dimensional', 'electron', 'magnetic_field'], topn=2):\n", 247 | " print(\" [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n", 248 | "\n", 249 | "print(\"Electron + Hole = \")\n", 250 | "for s in model.most_similar(positive=['electron', 'hole'], topn=2):\n", 251 | " print(\" [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n", 252 | " \n", 253 | "print(\"Superconductor + Topological = \")\n", 254 | "for s in model.most_similar(positive=['superconductor', 'topological'], topn=2):\n", 255 | " print(\" [*] {0:35} \\t ({1})\".format(s[0], s[1]))\n", 256 | " \n", 257 | "print(\"Spin + Magnetic Field = \")\n", 258 | "for s in model.most_similar(positive=['spin', 'magnetic_field'], topn=2):\n", 259 | " print(\" [*] {0:35} \\t ({1})\".format(s[0], s[1]))" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 13, 265 | "metadata": { 266 | "collapsed": true 267 | }, 268 | "outputs": [], 269 | "source": [ 270 | "# If you want to save and/or load a model:\n", 271 | "model.save(\"condmat-model-window-25-mincount-5-size-100\")\n", 272 | "#model = gensim.models.Word2Vec.load(\"condmat-model-window-10-mincount-5-size-100\")" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 11, 278 | "metadata": { 279 | "collapsed": true 280 | }, 281 | "outputs": [], 282 | "source": [] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": null, 287 | "metadata": { 288 | "collapsed": true 289 | }, 290 | "outputs": [], 291 | "source": [] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "## Clustering" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 11, 303 | "metadata": { 304 | "collapsed": true 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "from sklearn.cluster import KMeans\n", 309 | "kmeans = KMeans(n_clusters=500, random_state=0).fit(model.wv.syn0)" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": 12, 315 | "metadata": { 316 | "collapsed": true 317 | }, 318 | "outputs": [], 319 | "source": [ 320 | "sets = {}\n", 321 | "for l in np.unique(kmeans.labels_):\n", 322 | " sets[l] = []\n", 323 | "for idx,l in enumerate(sorted(kmeans.labels_)):\n", 324 | " sets[l].append(model.wv.index2word[idx])" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 13, 330 | "metadata": { 331 | "scrolled": true 332 | }, 333 | "outputs": [ 334 | { 335 | "name": "stdout", 336 | "output_type": "stream", 337 | "text": [ 338 | "dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499])\n", 339 | "0 6974\n", 340 | "1 101\n", 341 | "2 3\n", 342 | "3 76\n", 343 | "4 173\n", 344 | "5 563\n", 345 | "6 2\n", 346 | "7 11\n", 347 | "8 20\n", 348 | "9 11\n", 349 | "10 4\n", 350 | "11 27\n", 351 | "12 31\n", 352 | "13 28\n", 353 | "14 4\n", 354 | "15 9\n", 355 | "16 30\n", 356 | "17 2\n", 357 | "18 6\n", 358 | "19 9\n", 359 | "20 4\n", 360 | "21 7\n", 361 | "22 57\n", 362 | "23 13\n", 363 | "24 1\n", 364 | "25 19\n", 365 | "26 7\n", 366 | "27 1\n", 367 | "28 22\n", 368 | "29 6\n", 369 | "30 1\n", 370 | "31 2\n", 371 | "32 31\n", 372 | "33 1\n", 373 | "34 68\n", 374 | "35 1\n", 375 | "36 565\n", 376 | "37 145\n", 377 | "38 1\n", 378 | "39 222\n", 379 | "40 9\n", 380 | "41 6\n", 381 | "42 3\n", 382 | "43 1\n", 383 | "44 1\n", 384 | "45 14129\n", 385 | "46 1\n", 386 | "47 10\n", 387 | "48 1\n", 388 | "49 5\n", 389 | "50 7\n", 390 | "51 30\n", 391 | "52 27\n", 392 | "53 4\n", 393 | "54 104\n", 394 | "55 2\n", 395 | "56 8\n", 396 | "57 7\n", 397 | "58 1\n", 398 | "59 1\n", 399 | "60 1\n", 400 | "61 93\n", 401 | "62 1\n", 402 | "63 6\n", 403 | "64 5\n", 404 | "65 6\n", 405 | "66 14\n", 406 | "67 3\n", 407 | "68 65\n", 408 | "69 1\n", 409 | "70 4\n", 410 | "71 1\n", 411 | "72 283\n", 412 | "73 1\n", 413 | "74 1\n", 414 | "75 4\n", 415 | "76 1\n", 416 | "77 1\n", 417 | "78 1\n", 418 | "79 1\n", 419 | "80 17\n", 420 | "81 2\n", 421 | "82 31\n", 422 | "83 2\n", 423 | "84 973\n", 424 | "85 2\n", 425 | "86 6\n", 426 | "87 4\n", 427 | "88 13\n", 428 | "89 2\n", 429 | "90 1\n", 430 | "91 24\n", 431 | "92 1\n", 432 | "93 1\n", 433 | "94 1\n", 434 | "95 1\n", 435 | "96 1\n", 436 | "97 10\n", 437 | "98 28\n", 438 | "99 1\n", 439 | "100 3\n", 440 | "101 2\n", 441 | "102 1\n", 442 | "103 1\n", 443 | "104 188\n", 444 | "105 1\n", 445 | "106 2\n", 446 | "107 2\n", 447 | "108 2\n", 448 | "109 1\n", 449 | "110 1\n", 450 | "111 14\n", 451 | "112 13\n", 452 | "113 1\n", 453 | "114 1\n", 454 | "115 1\n", 455 | "116 5\n", 456 | "117 10\n", 457 | "118 12\n", 458 | "119 1\n", 459 | "120 1\n", 460 | "121 21\n", 461 | "122 1\n", 462 | "123 1\n", 463 | "124 1\n", 464 | "125 1\n", 465 | "126 9\n", 466 | "127 1\n", 467 | "128 1\n", 468 | "129 1\n", 469 | "130 8\n", 470 | "131 1\n", 471 | "132 1\n", 472 | "133 1\n", 473 | "134 1\n", 474 | "135 1\n", 475 | "136 1\n", 476 | "137 1\n", 477 | "138 1\n", 478 | "139 1\n", 479 | "140 1\n", 480 | "141 9\n", 481 | "142 5\n", 482 | "143 159\n", 483 | "144 1\n", 484 | "145 1\n", 485 | "146 5\n", 486 | "147 52\n", 487 | "148 1\n", 488 | "149 1\n", 489 | "150 1\n", 490 | "151 1\n", 491 | "152 1\n", 492 | "153 1\n", 493 | "154 2\n", 494 | "155 1\n", 495 | "156 1\n", 496 | "157 1\n", 497 | "158 58\n", 498 | "159 1\n", 499 | "160 12\n", 500 | "161 1\n", 501 | "162 2\n", 502 | "163 3\n", 503 | "164 10\n", 504 | "165 1\n", 505 | "166 1\n", 506 | "167 1\n", 507 | "168 1\n", 508 | "169 1\n", 509 | "170 206\n", 510 | "171 1\n", 511 | "172 2\n", 512 | "173 1\n", 513 | "174 1\n", 514 | "175 5\n", 515 | "176 3\n", 516 | "177 1\n", 517 | "178 3\n", 518 | "179 7\n", 519 | "180 6\n", 520 | "181 1\n", 521 | "182 1\n", 522 | "183 497\n", 523 | "184 1\n", 524 | "185 1\n", 525 | "186 1\n", 526 | "187 1\n", 527 | "188 2\n", 528 | "189 1\n", 529 | "190 1\n", 530 | "191 2\n", 531 | "192 1\n", 532 | "193 1\n", 533 | "194 1\n", 534 | "195 1\n", 535 | "196 1\n", 536 | "197 8\n", 537 | "198 1\n", 538 | "199 1\n", 539 | "200 1\n", 540 | "201 1\n", 541 | "202 1\n", 542 | "203 2\n", 543 | "204 1\n", 544 | "205 13\n", 545 | "206 1\n", 546 | "207 1\n", 547 | "208 2\n", 548 | "209 1\n", 549 | "210 1\n", 550 | "211 1\n", 551 | "212 1\n", 552 | "213 2\n", 553 | "214 3105\n", 554 | "215 1\n", 555 | "216 1\n", 556 | "217 1\n", 557 | "218 1\n", 558 | "219 102\n", 559 | "220 1\n", 560 | "221 11\n", 561 | "222 4\n", 562 | "223 2\n", 563 | "224 5\n", 564 | "225 3\n", 565 | "226 53\n", 566 | "227 46\n", 567 | "228 7\n", 568 | "229 1\n", 569 | "230 1\n", 570 | "231 1\n", 571 | "232 8\n", 572 | "233 1\n", 573 | "234 1\n", 574 | "235 18\n", 575 | "236 17\n", 576 | "237 1\n", 577 | "238 1\n", 578 | "239 78\n", 579 | "240 1\n", 580 | "241 8\n", 581 | "242 1\n", 582 | "243 1\n", 583 | "244 1\n", 584 | "245 1\n", 585 | "246 1\n", 586 | "247 91\n", 587 | "248 1\n", 588 | "249 1\n", 589 | "250 1\n", 590 | "251 1\n", 591 | "252 9\n", 592 | "253 1\n", 593 | "254 27\n", 594 | "255 1\n", 595 | "256 6\n", 596 | "257 2\n", 597 | "258 1\n", 598 | "259 1\n", 599 | "260 1\n", 600 | "261 1\n", 601 | "262 1\n", 602 | "263 1\n", 603 | "264 1\n", 604 | "265 1\n", 605 | "266 1\n", 606 | "267 12\n", 607 | "268 1\n", 608 | "269 1\n", 609 | "270 1\n", 610 | "271 2\n", 611 | "272 33\n", 612 | "273 2\n", 613 | "274 12\n", 614 | "275 1\n", 615 | "276 1\n", 616 | "277 1\n", 617 | "278 1\n", 618 | "279 1\n", 619 | "280 1\n", 620 | "281 1\n", 621 | "282 1\n", 622 | "283 1\n", 623 | "284 24\n", 624 | "285 7\n", 625 | "286 1\n", 626 | "287 1\n", 627 | "288 4\n", 628 | "289 1\n", 629 | "290 8\n", 630 | "291 1\n", 631 | "292 2510\n", 632 | "293 175\n", 633 | "294 1\n", 634 | "295 99\n", 635 | "296 1\n", 636 | "297 1\n", 637 | "298 1\n", 638 | "299 289\n", 639 | "300 15\n", 640 | "301 1\n", 641 | "302 1\n", 642 | "303 1\n", 643 | "304 2\n", 644 | "305 1\n", 645 | "306 1\n", 646 | "307 9\n", 647 | "308 2\n", 648 | "309 1\n", 649 | "310 1\n", 650 | "311 1\n", 651 | "312 1\n", 652 | "313 1\n", 653 | "314 2\n", 654 | "315 1\n", 655 | "316 1\n", 656 | "317 1\n", 657 | "318 1\n", 658 | "319 39\n", 659 | "320 6\n", 660 | "321 1\n", 661 | "322 1\n", 662 | "323 1\n", 663 | "324 2\n", 664 | "325 34\n", 665 | "326 1\n", 666 | "327 1\n", 667 | "328 1\n", 668 | "329 1\n", 669 | "330 1\n", 670 | "331 1\n", 671 | "332 91\n", 672 | "333 2\n", 673 | "334 1\n", 674 | "335 1\n", 675 | "336 2\n", 676 | "337 1\n", 677 | "338 2\n", 678 | "339 1\n", 679 | "340 1\n", 680 | "341 1\n", 681 | "342 4\n", 682 | "343 1\n", 683 | "344 1\n", 684 | "345 1\n", 685 | "346 1\n", 686 | "347 5\n", 687 | "348 3\n", 688 | "349 88\n", 689 | "350 6\n", 690 | "351 7\n", 691 | "352 1\n", 692 | "353 217\n", 693 | "354 1\n", 694 | "355 1\n", 695 | "356 1\n", 696 | "357 1\n", 697 | "358 1\n", 698 | "359 1\n", 699 | "360 1315\n", 700 | "361 3\n", 701 | "362 13\n", 702 | "363 14\n", 703 | "364 1\n", 704 | "365 1\n", 705 | "366 29\n", 706 | "367 1\n", 707 | "368 1\n", 708 | "369 1\n", 709 | "370 1\n", 710 | "371 1\n", 711 | "372 3\n", 712 | "373 1\n", 713 | "374 1\n", 714 | "375 1\n", 715 | "376 1\n", 716 | "377 1\n", 717 | "378 12\n", 718 | "379 2\n", 719 | "380 1\n", 720 | "381 2\n", 721 | "382 1\n", 722 | "383 4\n", 723 | "384 1\n", 724 | "385 2\n", 725 | "386 1\n", 726 | "387 7\n", 727 | "388 1\n", 728 | "389 1\n", 729 | "390 2\n", 730 | "391 8\n", 731 | "392 1\n", 732 | "393 1\n", 733 | "394 6\n", 734 | "395 3\n", 735 | "396 9\n", 736 | "397 18\n", 737 | "398 1\n", 738 | "399 1\n", 739 | "400 771\n", 740 | "401 1\n", 741 | "402 470\n", 742 | "403 1\n", 743 | "404 1\n", 744 | "405 1\n", 745 | "406 1\n", 746 | "407 1\n", 747 | "408 1\n", 748 | "409 1\n", 749 | "410 44\n", 750 | "411 2\n", 751 | "412 3\n", 752 | "413 28\n", 753 | "414 16\n", 754 | "415 1\n", 755 | "416 1\n", 756 | "417 43\n", 757 | "418 78\n", 758 | "419 2\n", 759 | "420 4\n", 760 | "421 18\n", 761 | "422 5\n", 762 | "423 1\n", 763 | "424 2\n", 764 | "425 2\n", 765 | "426 1\n", 766 | "427 2\n", 767 | "428 1\n", 768 | "429 14\n", 769 | "430 1\n", 770 | "431 1\n", 771 | "432 1\n", 772 | "433 42\n", 773 | "434 1\n", 774 | "435 1\n", 775 | "436 1\n", 776 | "437 1\n", 777 | "438 1\n", 778 | "439 1\n", 779 | "440 1\n", 780 | "441 2\n", 781 | "442 1\n", 782 | "443 650\n", 783 | "444 1\n", 784 | "445 1\n", 785 | "446 6\n", 786 | "447 1\n", 787 | "448 1\n", 788 | "449 1\n", 789 | "450 3\n", 790 | "451 1\n", 791 | "452 1\n", 792 | "453 1\n", 793 | "454 1\n", 794 | "455 1\n", 795 | "456 1\n", 796 | "457 17\n", 797 | "458 1\n", 798 | "459 1\n", 799 | "460 1\n", 800 | "461 2\n", 801 | "462 3\n", 802 | "463 1\n", 803 | "464 1\n", 804 | "465 1\n", 805 | "466 29\n", 806 | "467 1\n", 807 | "468 1\n", 808 | "469 1\n", 809 | "470 2\n", 810 | "471 2\n", 811 | "472 1\n", 812 | "473 5\n", 813 | "474 3\n", 814 | "475 4\n", 815 | "476 1\n", 816 | "477 24\n", 817 | "478 1\n", 818 | "479 1\n", 819 | "480 10\n", 820 | "481 1\n", 821 | "482 1\n", 822 | "483 1\n", 823 | "484 1\n", 824 | "485 2\n", 825 | "486 2\n", 826 | "487 1\n", 827 | "488 1\n", 828 | "489 50\n", 829 | "490 1\n", 830 | "491 4\n", 831 | "492 12\n", 832 | "493 2\n", 833 | "494 2\n", 834 | "495 1\n", 835 | "496 5\n", 836 | "497 1\n", 837 | "498 2\n", 838 | "499 1\n" 839 | ] 840 | } 841 | ], 842 | "source": [ 843 | "print(sets.keys())\n", 844 | "for k in sets.keys():\n", 845 | " print(k, len(sets[k]))" 846 | ] 847 | }, 848 | { 849 | "cell_type": "code", 850 | "execution_count": 14, 851 | "metadata": {}, 852 | "outputs": [ 853 | { 854 | "name": "stdout", 855 | "output_type": "stream", 856 | "text": [ 857 | "['freestanding_graphene', 'coexistent', 'fermi_contour', 'nonanalytic', 'lorenz', 'weak_value', 'leq_x', 'satisfiability_problem', 'are_there', 'simultaneously', 'spinel_oxide', 'oscillator_strength', 'transmon', 'microwave_photoresistance', 'valley_filter', 'nb_film', 'trial', 'screened_exchange', 'to_generate', 'minkowski', 'diffusional', 'pin', 'magnu_force', 'laser_excited', 'competing_species', 'classical_correspondence', 'paramagnon']\n" 858 | ] 859 | } 860 | ], 861 | "source": [ 862 | "print(sets[11])" 863 | ] 864 | }, 865 | { 866 | "cell_type": "code", 867 | "execution_count": null, 868 | "metadata": { 869 | "collapsed": true 870 | }, 871 | "outputs": [], 872 | "source": [] 873 | }, 874 | { 875 | "cell_type": "code", 876 | "execution_count": null, 877 | "metadata": { 878 | "collapsed": true 879 | }, 880 | "outputs": [], 881 | "source": [ 882 | "parsed_abstracts = parse_abstract('allabstracts.txt')" 883 | ] 884 | }, 885 | { 886 | "cell_type": "code", 887 | "execution_count": null, 888 | "metadata": { 889 | "collapsed": true 890 | }, 891 | "outputs": [], 892 | "source": [ 893 | "# This takes a very long time!\n", 894 | "re_parse = False\n", 895 | "if re_parse:\n", 896 | " parsed_abstracts = parse_abstract('allabstracts.txt')\n", 897 | " # Save to a file, so we can load it much faster than having\n", 898 | " # to re-parse the raw data.\n", 899 | " np.save(\"parsed_abstracts.npy\", parsed_abstracts)\n", 900 | "else:\n", 901 | " # Load the titles from the file.\n", 902 | " # The atleast_2d is a hack for correctly loading the dictionary...\n", 903 | " parsed_abstracts = np.atleast_2d(np.load(\"allabstracts.npy\"))[0][0]" 904 | ] 905 | }, 906 | { 907 | "cell_type": "code", 908 | "execution_count": null, 909 | "metadata": { 910 | "collapsed": true 911 | }, 912 | "outputs": [], 913 | "source": [ 914 | "parsed_abstracts = np.atleast_2d(np.load(\"allabstracts.npy\"))[0][0]" 915 | ] 916 | }, 917 | { 918 | "cell_type": "code", 919 | "execution_count": null, 920 | "metadata": { 921 | "collapsed": true 922 | }, 923 | "outputs": [], 924 | "source": [ 925 | "ngram_abstr, bigrams_abstr, ngrams_abstr = get_ngrams(abstr)" 926 | ] 927 | }, 928 | { 929 | "cell_type": "code", 930 | "execution_count": null, 931 | "metadata": { 932 | "collapsed": true 933 | }, 934 | "outputs": [], 935 | "source": [ 936 | "# train word2vec \n", 937 | "abstrmodel = gensim.models.Word2Vec(ngram_abstr, window=25, min_count=5, size=100)" 938 | ] 939 | } 940 | ], 941 | "metadata": { 942 | "kernelspec": { 943 | "display_name": "Python 3", 944 | "language": "python", 945 | "name": "python3" 946 | }, 947 | "language_info": { 948 | "codemirror_mode": { 949 | "name": "ipython", 950 | "version": 3 951 | }, 952 | "file_extension": ".py", 953 | "mimetype": "text/x-python", 954 | "name": "python", 955 | "nbconvert_exporter": "python", 956 | "pygments_lexer": "ipython3", 957 | "version": "3.5.2" 958 | } 959 | }, 960 | "nbformat": 4, 961 | "nbformat_minor": 2 962 | } 963 | -------------------------------------------------------------------------------- /alltitles.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/everthemore/physics2vec/d437efa21cb56423fba6653cf48af3e528b26114/alltitles.npy -------------------------------------------------------------------------------- /arXivHarvest.py: -------------------------------------------------------------------------------- 1 | #-------------------------------------------------------------------- 2 | # arXivHarvest.py 3 | # 4 | # Harvests (using OAI metadata available through an arXiv URL) 5 | # the titles and abstracts of a given arXiv section (cond-mat, 6 | # quant-ph, etc). 7 | # 8 | # The result of running this script will be two .txt files, 9 | # one containing the titles, and the other the corresponding 10 | # abstracts. 11 | # 12 | # The title.txt file is structured as (example): 13 | # 2017 3 This is the title of a paper published in 2017 14 | # that was too long to fit on a single line, so it con- 15 | # tinues with two whitespaces on the next line 16 | # 1998 12 This one is older but has a much shorter title 17 | # 18 | # The abstract.txt file has no year/month information, and 19 | # is ordered the same way as the title.txt file (so first 20 | # abstract belongs to the first title, etc). 21 | #-------------------------------------------------------------------- 22 | # Import modules 23 | from oaipmh.client import Client 24 | from oaipmh.metadata import MetadataRegistry, MetadataReader 25 | import time 26 | import argparse 27 | 28 | parser = argparse.ArgumentParser(description="Harvest an arXiv subsection's titles") 29 | parser.add_argument('--section', type=str, required=False, default=None, 30 | help='text file with titles from harvest') 31 | parser.add_argument('--output', type=str, required=False, default=None, 32 | help='output filename for *.npy file') 33 | 34 | args = parser.parse_args() 35 | section = args.section 36 | output = args.output 37 | 38 | # Change this to harvest a different arXiv set 39 | section="physics:cond-mat" if section == None else section 40 | # And change these to specify the txt file to save the data in 41 | title_file = "all_cond_mat_titles.txt" if output == None else output 42 | 43 | #abstr_file = "all_cond_mat_abstracts.txt" 44 | 45 | # Create a new MetadataReader, and list just the fields we are interested in 46 | oai_dc_reader = MetadataReader( 47 | fields={ 48 | 'title': ('textList', 'oai_dc:dc/dc:title/text()'), 49 | 'abstract': ('textList', 'oai_dc:dc/dc:description/text()'), 50 | 'date': ('textList', 'oai_dc:dc/dc:date/text()'), 51 | }, 52 | namespaces={ 53 | 'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/', 54 | 'dc' : 'http://purl.org/dc/elements/1.1/'} 55 | ) 56 | 57 | # And create a registry for parsing the oai info, linked to the reader 58 | registry = MetadataRegistry() 59 | registry.registerReader('oai_dc', oai_dc_reader) 60 | 61 | # arXiv OAI url we will query 62 | URL = "http://export.arxiv.org/oai2" 63 | # Create OAI client; now we're all set for listing some records 64 | client = Client(URL, registry) 65 | 66 | # Open files for writing 67 | titlef = open(title_file, 'w') 68 | #abstractf = open(abstr_file, 'w') 69 | 70 | # Keep track of run-time and number of papers 71 | start_time = time.time() 72 | count = 0 73 | 74 | # Harvest 75 | for record in client.listRecords(metadataPrefix='oai_dc', set=section): 76 | try: 77 | # Extract the title 78 | title = record[1].getField('title')[0] 79 | # Extract the abstract 80 | abstract = record[1].getField('abstract')[0] 81 | # And get the date (this is stored as yyyy-mm-dd in the arXiv metadata) 82 | date = record[1].getField('date')[0] 83 | year = int(date[0:4]) 84 | month = int(date[5:7]) 85 | 86 | # Write to file (add year info to the titles) 87 | titlef.write("%d %d "%(year,month) + title + "\n") 88 | # abstractf.write(abstract + "\n") 89 | 90 | count += 1 91 | # Flush every 100 papers to the files 92 | if count % 100 == 0 and count > 1: 93 | print("Harvested {0} papers so far (elapsed time = {1})".format(count, time.time() - start_time)) 94 | titlef.flush(); #abstractf.flush() 95 | except Exception as e: 96 | print("Encountered error whilst reading record: ", record) 97 | print("Exception: ", e) 98 | continue 99 | 100 | 101 | # Close files 102 | #abstractf.close() 103 | titlef.close() 104 | 105 | # Report runtime and number of papers processed 106 | runtime = time.time() - start_time 107 | print("It took {} seconds to collect {} titles and abstracts".format(runtime, count)) 108 | -------------------------------------------------------------------------------- /askmodel.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import argparse 3 | import gensim 4 | 5 | parser = argparse.ArgumentParser(description="Ask a trained Word2Vec model some questions") 6 | parser.add_argument('--input', type=str, required=True, 7 | help='a trained model file') 8 | parser.add_argument('--add', type=str, nargs='*', default="", 9 | help='size of encoding vectors') 10 | parser.add_argument('--subtract', type=str, nargs='*', default="", 11 | help='size of window scanning over text') 12 | 13 | args = parser.parse_args() 14 | inputfile = args.input 15 | positive = args.add 16 | negative = args.subtract 17 | 18 | # Load the model 19 | model = gensim.models.Word2Vec.load(inputfile) 20 | 21 | # Build a nicer query string 22 | querystring = "" 23 | for i in range(len(positive)): 24 | querystring = querystring + positive[i] 25 | 26 | if i < len(positive) - 1: 27 | querystring = querystring + " + " 28 | 29 | if len(negative) != 0: 30 | querystring = querystring + " - " 31 | 32 | for i in range(len(negative)): 33 | querystring = querystring + negative[i] 34 | 35 | if i < len(negative) - 1: 36 | querystring = querystring + " - " 37 | 38 | print(querystring + " = \n") 39 | 40 | # Get and display the answers 41 | result = model.most_similar(positive=positive, negative=negative, topn=10) 42 | for r in result: 43 | print("{0:40} (with similarity score {1})".format(r[0], r[1])) 44 | print("\n") 45 | -------------------------------------------------------------------------------- /caltechmask.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/everthemore/physics2vec/d437efa21cb56423fba6653cf48af3e528b26114/caltechmask.png -------------------------------------------------------------------------------- /caltechwordcloud.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/everthemore/physics2vec/d437efa21cb56423fba6653cf48af3e528b26114/caltechwordcloud.png -------------------------------------------------------------------------------- /condmat-model-window-10-mincount-5-size-100: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/everthemore/physics2vec/d437efa21cb56423fba6653cf48af3e528b26114/condmat-model-window-10-mincount-5-size-100 -------------------------------------------------------------------------------- /helper.py: -------------------------------------------------------------------------------- 1 | # Regular expressions 2 | import re 3 | # Use Inflect for singular-izing words 4 | import inflect 5 | # Gensim for learning phrases and word2vec 6 | import gensim 7 | 8 | # For some reason, inflect thinks that there is a singular form of 'mass', namely 'mas' 9 | # and similarly for gas. Please add any other exceptions to this list! 10 | p = inflect.engine() 11 | p.defnoun('mass', 'mass|masses') 12 | p.defnoun('gas', 'gas|gases') 13 | p.defnoun('gas', 'gas|gasses') # Other spelling 14 | p.defnoun('gaas', 'gaas') #GaAs ;) 15 | p.defnoun('gapless', 'gapless') 16 | p.defnoun('haas', 'haas') 17 | 18 | # Check if a string has digits 19 | def hasNumbers(inputString): 20 | return any(char.isdigit() for char in inputString) 21 | 22 | # Return the singular form of a word, if it exists 23 | def singularize(word): 24 | try: 25 | # p.singular_word() returns the singular form, but 26 | # returns False if there is no singular form (or already singular) 27 | 28 | # So, if the word is already singular, just return the word 29 | if not p.singular_noun(word): 30 | return word 31 | else: 32 | # And otherwise return the singular version 33 | return p.singular_noun(word) 34 | 35 | except exception as e: 36 | print("Euh? What's this? %s"%word) 37 | print("This caused an exception: ", e) 38 | return word 39 | 40 | def stripchars(w, chars): 41 | return "".join( [c for c in w if c not in chars] ).strip('\n') 42 | 43 | # Parse a title into words 44 | def parse_title(title): 45 | # Extract the year 46 | year, rest = title.split(' ', 1) 47 | year = int(year[0:]) 48 | # Then the month 49 | month, title = rest.split(' ', 1) 50 | month = int(month[0:]) 51 | 52 | # Then, for every word in the title: 53 | # 1) Split the title into words, by splitting it on spaces ' ' and on '-' (de-hyphenate words). 54 | # 2) Turn each of those resulting words into lowercase letters only 55 | # 3) Strip out any weird symbols (we don't want parenthesized words, not ; at the end of a word, etc) 56 | # 4) Also, we don't want to have digits.. my apologies to all the material studies on interesting compounds! 57 | words = re.split( ' |-|\\|/', title.lower() ) 58 | wordlist = [] 59 | for i in range(len(words)): 60 | w = words[i] 61 | 62 | # Skip if there is no word, or if we have numbers 63 | if len(w) < 1 or hasNumbers(w): 64 | continue 65 | 66 | # If it is (probably) math, let's skip it 67 | if w[0] == '$' and w[-1] == '$': 68 | continue 69 | 70 | # Remove other unwanted characters 71 | w = stripchars(w, '\\/$(){}.<>,;:_"|\'\n `?!#%') 72 | # Get singular form 73 | w = singularize(w) 74 | 75 | # Skip if nothing left, or just an empty space 76 | if len(w) < 1 or w == ' ': 77 | continue 78 | 79 | # Append to the list 80 | wordlist.append(w) 81 | 82 | return year, month, wordlist 83 | 84 | # Previous versions 85 | #return year, month [singularize(stripchars(w, ['\\/$(){}.<>,;:"|\'\n '])) for w in re.split(' |-|\\|/',title.lower()) if not hasNumbers(w)] 86 | #return year, month, [singularize(w.strip("\\/$|[](){}\n;:\"\',")) for w in re.split(' |-',title.lower()) if not hasNumbers(w)] 87 | 88 | 89 | def load_and_parse_all_titles(file): 90 | """ Read title info from file, and parse the titles into words """ 91 | 92 | # Buffer for storing the file 93 | all_lines = [] 94 | # Read file into the buffer 95 | with open(file, "r") as f: 96 | for i,line in enumerate(f): 97 | all_lines.append(line) 98 | 99 | # An empty dictionary for storing all the title info. This 100 | # dictionary will have the year of the title as the key, and will 101 | # hold dictionaries itself that have the month as a key. 102 | # For example: 103 | # all_titles[2007] = dictionary with months as keys 104 | # all_titles[2007][3] = list of titles from march 2007 105 | all_titles = {} 106 | 107 | # Keep track of the number of titles 108 | num_titles = 0 109 | 110 | # The title.txt file should be organized such, that new 111 | # titles start with year and month, and that titles that continue 112 | # on the next line start with two empty spaces. 113 | # So we're going to loop through the lines, and append the current 114 | # line to the previous title if it started with two empty spaces. 115 | # If not, it means we have found the start of a new title. 116 | title = all_lines[0] 117 | previous_title = "" 118 | 119 | # Scan each line 120 | i = 1 121 | while (i < (len(all_lines)-1)): 122 | 123 | # If we find a new title (no empty spaces at the start) 124 | if all_lines[i][0:2] != " ": 125 | 126 | # The title we have so far can be parsed and added 127 | # to the title-list 128 | year, month, title = parse_title(title) 129 | 130 | # If we have not seen this year before, create a new 131 | # dictionary entry with this year as the key. 132 | if year not in all_titles: 133 | all_titles[year] = {} 134 | 135 | # And do the same with the month, if we haven't seen it. 136 | if month not in all_titles[year]: 137 | all_titles[year][month] = [] 138 | 139 | # Now that we're sure the key pair [year][month] exists as a list 140 | # we can add the title to it. 141 | all_titles[year][month].append(title) 142 | num_titles += 1 143 | 144 | # Then start the next one 145 | title = all_lines[i] 146 | previous_title = title 147 | else: 148 | # We are still on the same title 149 | title = previous_title + all_lines[i][1:] 150 | previous_title = title 151 | 152 | # Go to the next line 153 | i += 1 154 | 155 | print("Read and parsed %d titles"%(num_titles)) 156 | return all_titles 157 | 158 | def get_titles_for_years(all_titles, years): 159 | """ Return list of all titles for given years (must be a list, even if only one)""" 160 | collectedtitles = [] 161 | for k in years: 162 | allmonthtitles = [] 163 | for m in all_titles[k].keys(): 164 | allmonthtitles = allmonthtitles + all_titles[k][m] 165 | 166 | collectedtitles = collectedtitles + allmonthtitles 167 | return collectedtitles 168 | 169 | def get_ngrams(sentences): 170 | """ Detects n-grams with n up to 4, and replaces those in the titles. """ 171 | # Train a 2-word (bigram) phrase-detector 172 | bigram_phrases = gensim.models.phrases.Phrases(sentences) 173 | 174 | # And construct a phraser from that (an object that will take a sentence 175 | # and replace in it the bigrams that it knows by single objects) 176 | bigram = gensim.models.phrases.Phraser(bigram_phrases) 177 | 178 | # Repeat that for trigrams; the input now are the bigrammed-titles 179 | ngram_phrases = gensim.models.phrases.Phrases(bigram[sentences]) 180 | ngram = gensim.models.phrases.Phraser(ngram_phrases) 181 | 182 | # !! If you want to have more than 4-grams, just repeat the structure of the 183 | # above two lines. That is, train another Phrases on the ngram_phrases[titles], 184 | # that will get you up to 8-grams. 185 | 186 | # Now that we have phrasers for bi- and trigrams, let's analyze them 187 | # The phrases.export_phrases(x) function returns pairs of phrases and their 188 | # certainty scores from x. 189 | bigram_info = {} 190 | for b, score in bigram_phrases.export_phrases(sentences): 191 | bigram_info[b] = [score, bigram_info.get(b,[0,0])[1] + 1] 192 | 193 | ngram_info = {} 194 | for b, score in ngram_phrases.export_phrases(bigram[sentences]): 195 | ngram_info[b] = [score, ngram_info.get(b,[0,0])[1] + 1] 196 | 197 | # Return a list of 'n-grammed' titles, and the bigram and trigram info 198 | return [ngram[t] for t in sentences], bigram_info, ngram_info 199 | 200 | # !!! THIS SECTION HAS NOT YET BEEN UPDATED 201 | # !!! IT WILL WORK, BUT IT TAKES A *VERY* LONG 202 | # !!! TIME. HAS TO SWITCH TO LIST COMPREHENSION 203 | 204 | # Parse abstract into sentences 205 | def parse_abstract(file): 206 | # Buffer for storing the file 207 | abstr = open(file, "r").read() 208 | 209 | sentences = [] 210 | 211 | # Clean up abstract 212 | abstr.lower() 213 | abstr.replace('\'', '') 214 | abstr.replace('\"', '') 215 | 216 | # Extract sentences and split into words 217 | end = abstr.find('.') 218 | while end != -1: 219 | sentence = abstr[:end].replace('\n', ' ') 220 | 221 | # Sanitize the words 222 | words = re.split( ' |-|\\|/', sentence.lower() ) 223 | wordlist = [] 224 | for i in range(len(words)): 225 | w = words[i] 226 | 227 | # Skip if there is no word, or if we have numbers 228 | if len(w) < 1 or hasNumbers(w): 229 | continue 230 | 231 | # If it is (probably) math, let's skip it 232 | if w[0] == '$' and w[-1] == '$': 233 | continue 234 | 235 | # Remove other unwanted characters 236 | w = stripchars(w, '\\/$(){}.<>,;:_"|\'\n `?!#%') 237 | # Get singular form 238 | w = singularize(w) 239 | 240 | # Skip if nothing left, or just an empty space 241 | if len(w) < 1 or w == ' ': 242 | continue 243 | 244 | # Append to the list 245 | wordlist.append(w) 246 | 247 | sentences.append( wordlist ) 248 | abstr = abstr[end+1:] 249 | end = abstr.find('.') 250 | 251 | return sentences -------------------------------------------------------------------------------- /numpapers.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/everthemore/physics2vec/d437efa21cb56423fba6653cf48af3e528b26114/numpapers.png -------------------------------------------------------------------------------- /parsetitles.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from helper import * 3 | import argparse 4 | 5 | parser = argparse.ArgumentParser(description="Parse titles into a *.npy file") 6 | parser.add_argument('--input', type=str, required=True, 7 | help='text file with titles from harvest') 8 | parser.add_argument('--output', type=str, required=True, 9 | help='output filename for *.npy file') 10 | 11 | args = parser.parse_args() 12 | inputfile = args.input 13 | outputfile = args.output 14 | 15 | print("Parsing file.. (make take a short while)") 16 | all_titles = load_and_parse_all_titles(inputfile) 17 | np.save(outputfile, all_titles) 18 | print("Done!") 19 | -------------------------------------------------------------------------------- /trainmodel.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import argparse 3 | from helper import * 4 | 5 | parser = argparse.ArgumentParser(description="Train a Word2Vec encoding on input, and store the resulting model in output") 6 | parser.add_argument('--input', type=str, required=True, 7 | help='a *.npy file with parsed titles') 8 | parser.add_argument('--size', type=int, default=100, 9 | help='size of encoding vectors') 10 | parser.add_argument('--window', type=int, default=10, 11 | help='size of window scanning over text') 12 | parser.add_argument('--mincount', type=int, default=5, 13 | help='minimum number of times a word has to appear to participate') 14 | parser.add_argument('--output', type=str, required=True, 15 | help='output filename for saving the model') 16 | 17 | args = parser.parse_args() 18 | inputfile = args.input 19 | size = args.size 20 | window = args.window 21 | mincount = args.mincount 22 | outputfile = args.output 23 | 24 | print("Training model with\n") 25 | print("{0:30} = {1}".format("input", inputfile)) 26 | print("{0:30} = {1}".format("size", size)) 27 | print("{0:30} = {1}".format("window", window)) 28 | print("{0:30} = {1}".format("mincount", mincount)) 29 | 30 | all_titles = np.atleast_2d(np.load(inputfile))[0][0] 31 | all_years = sorted(list(all_titles.keys())) 32 | titles = get_titles_for_years(all_titles, all_years) 33 | ngram_titles, bigrams, ngrams = get_ngrams(titles) 34 | model = gensim.models.Word2Vec(ngram_titles, window=window, min_count=mincount, size=size) 35 | print("Saving to {0}".format(outputfile)) 36 | model.save(outputfile) 37 | print("Done!") 38 | --------------------------------------------------------------------------------