├── Beyond Word2vec Setup.ipynb ├── Beyond Word2vec.ipynb ├── README.md └── images ├── SVD.png ├── Word2vec1.png ├── Word2vec2.png ├── Word2vec3.png ├── aboutme.png ├── bag_of_words.png ├── centroid.png ├── centroid10.png ├── centroid11.png ├── centroid2.png ├── centroid3.png ├── centroid4.png ├── centroid5.png ├── centroid6.png ├── centroid7.png ├── centroid8.png ├── centroid9.png ├── cluster.png ├── cluster1.png ├── cluster2.png ├── cluster3.png ├── cluster4.png ├── continuous.png ├── doc2vec.png ├── doc2vecpaper.png ├── docvecuse.png ├── docvecuse1.png ├── docvecuse2.png ├── docvecuse3.png ├── docvecuse4.png ├── firth.jpg ├── implicitfactorization.png ├── island.png ├── island2.png ├── jaggi.png ├── jaggipaper.png ├── linkedin.png ├── metis.png ├── odsc.png ├── parse.png ├── rnn.png ├── sent2vec.png ├── tfidf.png ├── tfidf1.png ├── tfidf10.png ├── tfidf2.png ├── tfidf3.png ├── tfidf4.png ├── tfidf5.png ├── tfidf6.png ├── tfidf7.png ├── tfidf8.png ├── tfidf9.png ├── wilkins.png ├── wordvectors.png └── world.png /Beyond Word2vec Setup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Beyond Word2vec: Setup" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### 1) Clone repo \n", 15 | "\n", 16 | "```bash\n", 17 | "git clone git@github.com:andrewdblevins/beyond_word2vec.git\n", 18 | "```\n", 19 | "\n", 20 | "once you clone this repo, you can follow along these instructions in this notebook" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### 2) Python Installs: \n", 28 | "\n", 29 | "Please note, that we'll need several packages installed on your laptop.\n", 30 | "\n", 31 | "If you have python, gensim and keras installed on your laptop you are probably good to go.\n", 32 | "\n", 33 | "If you haven't performed installs yet, here are some steps to follow:\n", 34 | "\n", 35 | "I generally recommend using the [anaconda package](https://anaconda.org/anaconda/python)\n", 36 | "\n", 37 | "Create a conda environment: \n", 38 | "\n", 39 | "```bash\n", 40 | "conda create -n beyondw2v python=3 \n", 41 | "source activate beyondw2v\n", 42 | "conda install anaconda \n", 43 | "```\n", 44 | "Then install additional packages: \n", 45 | "```bash\n", 46 | "conda install gensim\n", 47 | "conda install -c conda-forge keras \n", 48 | "conda install -c conda-forge theano\n", 49 | "```" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "### 3) Pretrained Word Vectors\n", 57 | "Download the pretrained google news word vectors\n", 58 | "\n", 59 | "**warning: This file is 3.6GB**" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": { 66 | "collapsed": true 67 | }, 68 | "outputs": [], 69 | "source": [ 70 | "!wget https://s3.amazonaws.com/mordecai-geo/GoogleNews-vectors-negative300.bin.gz" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "**Optional** you may want to test other pretrained vectors. If so, download those\n", 78 | "\n", 79 | "* glove http://nlp.stanford.edu/projects/glove/\n", 80 | "* sense2vec https://github.com/explosion/sense2vec\n", 81 | "* fastText https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md\n", 82 | "* metaembeddings http://cistern.cis.lmu.de/meta-emb/" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "### 4) Evaluation Dataset\n", 90 | "\n", 91 | "https://nlp.stanford.edu/projects/snli/\n", 92 | "\n", 93 | "Download the following ~100MB:" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": { 100 | "collapsed": true 101 | }, 102 | "outputs": [], 103 | "source": [ 104 | "!wget https://nlp.stanford.edu/projects/snli/snli_1.0.zip" 105 | ] 106 | } 107 | ], 108 | "metadata": { 109 | "kernelspec": { 110 | "display_name": "Python 3", 111 | "language": "python", 112 | "name": "python3" 113 | }, 114 | "language_info": { 115 | "codemirror_mode": { 116 | "name": "ipython", 117 | "version": 3 118 | }, 119 | "file_extension": ".py", 120 | "mimetype": "text/x-python", 121 | "name": "python", 122 | "nbconvert_exporter": "python", 123 | "pygments_lexer": "ipython3", 124 | "version": "3.6.1" 125 | } 126 | }, 127 | "nbformat": 4, 128 | "nbformat_minor": 2 129 | } 130 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # beyond_word2vec 2 | Slides for my doc2vec workshop/talk 3 | 4 | 5 | 6 | # Setup Instructions 7 | 8 | ### 1) Clone repo 9 | 10 | ```bash 11 | git clone git@github.com:andrewdblevins/beyond_word2vec.git 12 | ``` 13 | 14 | once you clone this repo, you can follow along these instructions in this notebook 15 | 16 | ### 2) Python Installs: 17 | 18 | Please note, that we'll need several packages installed on your laptop. 19 | 20 | If you have python, gensim and keras installed on your laptop you are probably good to go. 21 | 22 | If you haven't performed installs yet, here are some steps to follow: 23 | 24 | I generally recommend using the [anaconda package](https://anaconda.org/anaconda/python) 25 | 26 | Create a conda environment: 27 | 28 | ```bash 29 | conda create -n beyondw2v python=3 30 | source activate beyondw2v 31 | conda install anaconda 32 | ``` 33 | Then install additional packages: 34 | ```bash 35 | 36 | conda install gensim 37 | conda install -c conda-forge keras 38 | conda install -c conda-forge theano 39 | ``` 40 | 41 | ### 3) Pretrained Word Vectors 42 | Download the pretrained google news word vectors 43 | 44 | **warning: This file is 3.6GB** 45 | 46 | 47 | ```python 48 | !wget https://s3.amazonaws.com/mordecai-geo/GoogleNews-vectors-negative300.bin.gz 49 | ``` 50 | 51 | **Optional** you may want to test other pretrained vectors. If so, download those 52 | 53 | * glove http://nlp.stanford.edu/projects/glove/ 54 | * sense2vec https://github.com/explosion/sense2vec 55 | * fastText https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md 56 | * metaembeddings http://cistern.cis.lmu.de/meta-emb/ 57 | 58 | ### 4) Evaluation Dataset 59 | 60 | https://nlp.stanford.edu/projects/snli/ 61 | 62 | Download the following ~100MB: 63 | 64 | 65 | ```python 66 | !wget https://nlp.stanford.edu/projects/snli/snli_1.0.zip 67 | ``` 68 | -------------------------------------------------------------------------------- /images/SVD.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/SVD.png -------------------------------------------------------------------------------- /images/Word2vec1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/Word2vec1.png -------------------------------------------------------------------------------- /images/Word2vec2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/Word2vec2.png -------------------------------------------------------------------------------- /images/Word2vec3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/Word2vec3.png -------------------------------------------------------------------------------- /images/aboutme.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/aboutme.png -------------------------------------------------------------------------------- /images/bag_of_words.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/bag_of_words.png -------------------------------------------------------------------------------- /images/centroid.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/centroid.png -------------------------------------------------------------------------------- /images/centroid10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/centroid10.png -------------------------------------------------------------------------------- /images/centroid11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/centroid11.png -------------------------------------------------------------------------------- /images/centroid2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/centroid2.png -------------------------------------------------------------------------------- /images/centroid3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/centroid3.png -------------------------------------------------------------------------------- /images/centroid4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/centroid4.png -------------------------------------------------------------------------------- /images/centroid5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/centroid5.png -------------------------------------------------------------------------------- /images/centroid6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/centroid6.png -------------------------------------------------------------------------------- /images/centroid7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/centroid7.png -------------------------------------------------------------------------------- /images/centroid8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/centroid8.png -------------------------------------------------------------------------------- /images/centroid9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/centroid9.png -------------------------------------------------------------------------------- /images/cluster.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/cluster.png -------------------------------------------------------------------------------- /images/cluster1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/cluster1.png -------------------------------------------------------------------------------- /images/cluster2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/cluster2.png -------------------------------------------------------------------------------- /images/cluster3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/cluster3.png -------------------------------------------------------------------------------- /images/cluster4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/cluster4.png -------------------------------------------------------------------------------- /images/continuous.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/continuous.png -------------------------------------------------------------------------------- /images/doc2vec.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/doc2vec.png -------------------------------------------------------------------------------- /images/doc2vecpaper.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/doc2vecpaper.png -------------------------------------------------------------------------------- /images/docvecuse.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/docvecuse.png -------------------------------------------------------------------------------- /images/docvecuse1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/docvecuse1.png -------------------------------------------------------------------------------- /images/docvecuse2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/docvecuse2.png -------------------------------------------------------------------------------- /images/docvecuse3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/docvecuse3.png -------------------------------------------------------------------------------- /images/docvecuse4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/docvecuse4.png -------------------------------------------------------------------------------- /images/firth.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/firth.jpg -------------------------------------------------------------------------------- /images/implicitfactorization.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/implicitfactorization.png -------------------------------------------------------------------------------- /images/island.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/island.png -------------------------------------------------------------------------------- /images/island2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/island2.png -------------------------------------------------------------------------------- /images/jaggi.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/jaggi.png -------------------------------------------------------------------------------- /images/jaggipaper.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/jaggipaper.png -------------------------------------------------------------------------------- /images/linkedin.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/linkedin.png -------------------------------------------------------------------------------- /images/metis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/metis.png -------------------------------------------------------------------------------- /images/odsc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/odsc.png -------------------------------------------------------------------------------- /images/parse.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/parse.png -------------------------------------------------------------------------------- /images/rnn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/rnn.png -------------------------------------------------------------------------------- /images/sent2vec.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/sent2vec.png -------------------------------------------------------------------------------- /images/tfidf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/tfidf.png -------------------------------------------------------------------------------- /images/tfidf1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/tfidf1.png -------------------------------------------------------------------------------- /images/tfidf10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/tfidf10.png -------------------------------------------------------------------------------- /images/tfidf2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/tfidf2.png -------------------------------------------------------------------------------- /images/tfidf3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/tfidf3.png -------------------------------------------------------------------------------- /images/tfidf4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/tfidf4.png -------------------------------------------------------------------------------- /images/tfidf5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/tfidf5.png -------------------------------------------------------------------------------- /images/tfidf6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/tfidf6.png -------------------------------------------------------------------------------- /images/tfidf7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/tfidf7.png -------------------------------------------------------------------------------- /images/tfidf8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/tfidf8.png -------------------------------------------------------------------------------- /images/tfidf9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/tfidf9.png -------------------------------------------------------------------------------- /images/wilkins.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/wilkins.png -------------------------------------------------------------------------------- /images/wordvectors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/wordvectors.png -------------------------------------------------------------------------------- /images/world.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewdblevins/beyond_word2vec/b232460eabcbc6cd1aef3ce873b805eddb194ff5/images/world.png --------------------------------------------------------------------------------