├── .gitignore
├── _toc.yml
├── environment.yml
├── _config.yml
├── README.md
└── word-embeddings-workshop.ipynb


/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints/
2 | data/
3 | 


--------------------------------------------------------------------------------
/_toc.yml:
--------------------------------------------------------------------------------
1 | - file: word-embeddings-workshop
2 |   title: Word Embeddings Workshop
3 | 


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
 1 | name: emb-workshop
 2 | channels:
 3 |   - conda-forge
 4 |   - defaults
 5 | dependencies:
 6 |   - python=3.8
 7 |   - gensim
 8 |   - jupyterlab
 9 |   - scikit-learn
10 |   


--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
 1 | title: "CSS Workshop: Word Embeddings for the Social Sciences"
 2 | author: "<a href='https://students.washington.edu/cgilroy/'>Connor Gilroy</a>"
 3 | copyright: "2021"
 4 | only_build_toc_files: true
 5 | page_titles: toc
 6 | execute:
 7 |   execute_notebooks: "off"
 8 | parse:
 9 |   myst_extended_syntax: true
10 |   myst_url_schemes: [mailto, http, https]
11 | repository:
12 |   url: https://github.com/ccgilroy/word-embeddings-workshop
13 |   path_to_book: "/"
14 |   branch: "main"
15 | html:
16 |   use_repository_button: true
17 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # CSS Workshop: Word Embeddings for the Social Sciences
 2 | 
 3 | Author: Connor Gilroy
 4 | 
 5 | This repository contains the materials for a 1-hour work embeddings workshop to be held on 3/19/21, organized by the Computational Social Science reading group at TechnoSoc. Follow the instructions below to run the notebook, word-embeddings-workshop.ipynb. 
 6 | 
 7 | View a static web version of the workshop tutorial at https://ccgilroy.github.io/word-embeddings-workshop/. Introductory slides are [here](https://docs.google.com/presentation/d/1V4SaADerFMph9wB7pES76vYUWhIzvQ_LMpSyAIB6f_o/edit?usp=sharing).
 8 | 
 9 | ## Setup
10 | 
11 | If you use [conda](https://docs.conda.io/en/latest/) to manage your Python environments, you can install the key packages you need for this work shop by running this command in your terminal:
12 | 
13 | ```
14 | conda env create -f environment.yml
15 | ```
16 | 
17 | The whatlies package isn't available through conda. For the optional part of the workshop that uses whatlies, install it with `pip install whatlies` (after activating the conda environment!). 
18 | 
19 | To open the notebook from your terminal, first activate the new environment, then run `jupyter lab`: 
20 | 
21 | ```
22 | conda activate emb-workshop
23 | jupyter lab
24 | ```
25 | 
26 | ---
27 | 
28 | (Here's the full code to create the conda environment. If creating it from the environment file works, you won't need this.)
29 | 
30 | ```
31 | conda create -n emb-workshop
32 | conda activate emb-workshop
33 | conda config --env --add channels conda-forge
34 | conda config --env --set channel_priority strict
35 | conda install python=3.8 jupyterlab gensim scikit-learn 
36 | # optional, to use whatlies
37 | pip install whatlies
38 | ```
39 | 


--------------------------------------------------------------------------------
/word-embeddings-workshop.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Word Embeddings Workshop\n",
   8 |     "\n",
   9 |     "March 19, 2021\n",
  10 |     "\n",
  11 |     "Instructor: Connor Gilroy, Department of Sociology, University of Washington  \n",
  12 |     "TA: Nga Than, Sociology program, CUNY Graduate Center"
  13 |    ]
  14 |   },
  15 |   {
  16 |    "cell_type": "markdown",
  17 |    "metadata": {},
  18 |    "source": [
  19 |     "## Introduction\n",
  20 |     "\n",
  21 |     "The goal of this workshop is to give you an intuitive and practical introduction to what word embeddings are and what they can be used for in the social sciences.\n",
  22 |     "\n",
  23 |     "You can find a short introductory slide deck [here](https://docs.google.com/presentation/d/1V4SaADerFMph9wB7pES76vYUWhIzvQ_LMpSyAIB6f_o/edit?usp=sharing)."
  24 |    ]
  25 |   },
  26 |   {
  27 |    "cell_type": "markdown",
  28 |    "metadata": {},
  29 |    "source": [
  30 |     "## Packages used\n",
  31 |     "\n",
  32 |     "This workshop primarily teaches and uses the **gensim** package. The main corpus (the 20 Newsgroups data set) comes from scikit-learn. \n",
  33 |     "\n",
  34 |     "An optional part of the tutorial also uses the *whatlies* package. For the exercises, you may find it helpful to load some common Python data science packages as well, if you have those installed in your environment."
  35 |    ]
  36 |   },
  37 |   {
  38 |    "cell_type": "code",
  39 |    "execution_count": null,
  40 |    "metadata": {},
  41 |    "outputs": [],
  42 |    "source": [
  43 |     "import gensim\n",
  44 |     "import sklearn\n",
  45 |     "\n",
  46 |     "# import numpy as np\n",
  47 |     "# import pandas as pd\n",
  48 |     "# import matplotlib.pyplot as plt\n",
  49 |     "# import seaborn as sns\n",
  50 |     "# import altair as alt"
  51 |    ]
  52 |   },
  53 |   {
  54 |    "cell_type": "markdown",
  55 |    "metadata": {},
  56 |    "source": [
  57 |     "## 1. Using a pretrained model (GloVe)\n",
  58 |     "\n",
  59 |     "Things to know about a pretrained model: \n",
  60 |     "\n",
  61 |     "- What's the overall modeling approach?\n",
  62 |     "- What vector size?\n",
  63 |     "- What vocabulary size?\n",
  64 |     "- What other parameters might affect substantive results? \n",
  65 |     "- What data was it trained on?\n",
  66 |     "\n",
  67 |     "Some of those things will be well-documented or obvious, some won't be.\n",
  68 |     "\n",
  69 |     "The GloVe models are originally documented [on a project page](https://nlp.stanford.edu/projects/glove/) from the Stanford NLP Group, but the gensim package also stores data for these and other models [in the gensim-data repository](https://github.com/RaRe-Technologies/gensim-data). We'll download and load a small pretrained model from the latter; click the link to read more about it. \n",
  70 |     "\n",
  71 |     "This model is around 66 MB. It's relatively small because each word is represented by a vector of only 50 numbers. Larger vectors (150-300 dimensions) are more common in practice."
  72 |    ]
  73 |   },
  74 |   {
  75 |    "cell_type": "code",
  76 |    "execution_count": null,
  77 |    "metadata": {},
  78 |    "outputs": [],
  79 |    "source": [
  80 |     "import gensim.downloader as api\n",
  81 |     "glove_embeddings = api.load('glove-wiki-gigaword-50')"
  82 |    ]
  83 |   },
  84 |   {
  85 |    "cell_type": "markdown",
  86 |    "metadata": {},
  87 |    "source": [
  88 |     "We can confirm that these vectors do indeed have 50 dimensions:"
  89 |    ]
  90 |   },
  91 |   {
  92 |    "cell_type": "code",
  93 |    "execution_count": null,
  94 |    "metadata": {},
  95 |    "outputs": [],
  96 |    "source": [
  97 |     "glove_embeddings.vector_size"
  98 |    ]
  99 |   },
 100 |   {
 101 |    "cell_type": "markdown",
 102 |    "metadata": {},
 103 |    "source": [
 104 |     "How many words are in the vocabulary?"
 105 |    ]
 106 |   },
 107 |   {
 108 |    "cell_type": "code",
 109 |    "execution_count": null,
 110 |    "metadata": {},
 111 |    "outputs": [],
 112 |    "source": [
 113 |     "len(glove_embeddings.vocab)"
 114 |    ]
 115 |   },
 116 |   {
 117 |    "cell_type": "markdown",
 118 |    "metadata": {},
 119 |    "source": [
 120 |     "This is a 400,000 x 50 matrix, which can be accessed through `glove_embeddings.vectors`: "
 121 |    ]
 122 |   },
 123 |   {
 124 |    "cell_type": "code",
 125 |    "execution_count": null,
 126 |    "metadata": {},
 127 |    "outputs": [],
 128 |    "source": [
 129 |     "glove_embeddings.vectors.shape"
 130 |    ]
 131 |   },
 132 |   {
 133 |    "cell_type": "markdown",
 134 |    "metadata": {},
 135 |    "source": [
 136 |     "If a word is in the vocabulary, you can extract its embedding from the model like this:"
 137 |    ]
 138 |   },
 139 |   {
 140 |    "cell_type": "code",
 141 |    "execution_count": null,
 142 |    "metadata": {},
 143 |    "outputs": [],
 144 |    "source": [
 145 |     "glove_embeddings[\"society\"]"
 146 |    ]
 147 |   },
 148 |   {
 149 |    "cell_type": "markdown",
 150 |    "metadata": {},
 151 |    "source": [
 152 |     "Differerent word, different vector: "
 153 |    ]
 154 |   },
 155 |   {
 156 |    "cell_type": "code",
 157 |    "execution_count": null,
 158 |    "metadata": {},
 159 |    "outputs": [],
 160 |    "source": [
 161 |     "glove_embeddings[\"individual\"]"
 162 |    ]
 163 |   },
 164 |   {
 165 |    "cell_type": "markdown",
 166 |    "metadata": {},
 167 |    "source": [
 168 |     "0.555 corresponds to 0.208, 0.493 corresponds to 0.895, and so on. (What do those individual positions mean? Not much on their own!)\n",
 169 |     "\n",
 170 |     "This is what unlocks the key innovation of word embeddings: we can calculate the similarity or distance between two words using their vector representations. This is usually done with a metric called **cosine similarity**, which ranges from -1 to 1 (=perfectly similar)."
 171 |    ]
 172 |   },
 173 |   {
 174 |    "cell_type": "code",
 175 |    "execution_count": null,
 176 |    "metadata": {},
 177 |    "outputs": [],
 178 |    "source": [
 179 |     "glove_embeddings.similarity(\"society\", \"individual\")"
 180 |    ]
 181 |   },
 182 |   {
 183 |    "cell_type": "markdown",
 184 |    "metadata": {},
 185 |    "source": [
 186 |     "Of course, a word is exactly similar to itself: "
 187 |    ]
 188 |   },
 189 |   {
 190 |    "cell_type": "code",
 191 |    "execution_count": null,
 192 |    "metadata": {},
 193 |    "outputs": [],
 194 |    "source": [
 195 |     "glove_embeddings.similarity(\"society\", \"society\")"
 196 |    ]
 197 |   },
 198 |   {
 199 |    "cell_type": "code",
 200 |    "execution_count": null,
 201 |    "metadata": {},
 202 |    "outputs": [],
 203 |    "source": [
 204 |     "# try out calculating the similarities between a few other pairs of words here!"
 205 |    ]
 206 |   },
 207 |   {
 208 |    "cell_type": "markdown",
 209 |    "metadata": {},
 210 |    "source": [
 211 |     "(Technical aside: cosine similarity is the dot product of two vectors, divided by the L2-norm for each vector.)"
 212 |    ]
 213 |   },
 214 |   {
 215 |    "cell_type": "code",
 216 |    "execution_count": null,
 217 |    "metadata": {},
 218 |    "outputs": [],
 219 |    "source": [
 220 |     "from numpy import dot\n",
 221 |     "from numpy.linalg import norm"
 222 |    ]
 223 |   },
 224 |   {
 225 |    "cell_type": "code",
 226 |    "execution_count": null,
 227 |    "metadata": {},
 228 |    "outputs": [],
 229 |    "source": [
 230 |     "(dot(glove_embeddings[\"society\"], glove_embeddings[\"individual\"]) / \n",
 231 |     "    (norm(glove_embeddings[\"society\"], ord=2) * norm(glove_embeddings[\"individual\"], ord=2)))"
 232 |    ]
 233 |   },
 234 |   {
 235 |    "cell_type": "markdown",
 236 |    "metadata": {},
 237 |    "source": [
 238 |     "To understand how a model respresents a particular word, we can look at which words are the most similar to it according to that model.\n",
 239 |     "\n",
 240 |     "This is the foundation for making substantive claims about meaning: for instance, what a word means in a given set of documents, or how the meaning of a word has changed over time. \n",
 241 |     "\n",
 242 |     "Here are the 10 most similar words to \"society\" in the GloVe vocabulary: "
 243 |    ]
 244 |   },
 245 |   {
 246 |    "cell_type": "code",
 247 |    "execution_count": null,
 248 |    "metadata": {},
 249 |    "outputs": [],
 250 |    "source": [
 251 |     "glove_embeddings.most_similar(\"society\", topn=10)"
 252 |    ]
 253 |   },
 254 |   {
 255 |    "cell_type": "markdown",
 256 |    "metadata": {},
 257 |    "source": [
 258 |     "And the 10 most similar words to \"individual\":"
 259 |    ]
 260 |   },
 261 |   {
 262 |    "cell_type": "code",
 263 |    "execution_count": null,
 264 |    "metadata": {},
 265 |    "outputs": [],
 266 |    "source": [
 267 |     "glove_embeddings.most_similar(\"individual\", topn=10)"
 268 |    ]
 269 |   },
 270 |   {
 271 |    "cell_type": "markdown",
 272 |    "metadata": {},
 273 |    "source": [
 274 |     "Relations of similarity are the basic operation, but we're not limited to the individual word vectors we started with. We can do algebra on those vectors to build up new vectors that represent more complex meanings. For instance, we can take the *difference* between two vectors, in order to represent a binary opposition between them.  \n",
 275 |     "\n",
 276 |     "Researchers build on this basic idea to create more robust vectors representing concepts that can be thought of in a binary way. Kozlowski et al 2019, for example, represent multiple dimensions of social class by averaging different pairs of antonyms. They then show how the associations between these dimensions change over the course of the 20th century.\n",
 277 |     "\n",
 278 |     "The next example constructs an opposition between \"society\" and \"individual\" -- this will give us the words that are the closest to the \"society\" end of that dimension, or the \"individual\" end.\n",
 279 |     "\n",
 280 |     "(You might be able to tell that these vectors were trained on Wikipedia entries, because the vocabulary includes many rare words -- we might get more interesting and meaningful results if we filtered the 400,000-word vocabulary first.)"
 281 |    ]
 282 |   },
 283 |   {
 284 |    "cell_type": "code",
 285 |    "execution_count": null,
 286 |    "metadata": {},
 287 |    "outputs": [],
 288 |    "source": [
 289 |     "glove_embeddings.most_similar(positive=[\"society\"], negative=[\"individual\"])"
 290 |    ]
 291 |   },
 292 |   {
 293 |    "cell_type": "code",
 294 |    "execution_count": null,
 295 |    "metadata": {},
 296 |    "outputs": [],
 297 |    "source": [
 298 |     "glove_embeddings.most_similar(positive=[\"individual\"], negative=[\"society\"])"
 299 |    ]
 300 |   },
 301 |   {
 302 |    "cell_type": "markdown",
 303 |    "metadata": {},
 304 |    "source": [
 305 |     "What's happening here, behind the scenes, is that one vector is being subtracted from the other. \n",
 306 |     "\n",
 307 |     "This creates a new vector, like this:"
 308 |    ]
 309 |   },
 310 |   {
 311 |    "cell_type": "code",
 312 |    "execution_count": null,
 313 |    "metadata": {},
 314 |    "outputs": [],
 315 |    "source": [
 316 |     "glove_embeddings[\"society\"] - glove_embeddings[\"individual\"]"
 317 |    ]
 318 |   },
 319 |   {
 320 |    "cell_type": "markdown",
 321 |    "metadata": {},
 322 |    "source": [
 323 |     "You might want to use that derived vector as part of your model, so here's how to add it into the model's overall set of vectors. \n",
 324 |     "\n",
 325 |     "(This next part is a bit technical and not substantively interesting, so I'll gloss over the details. For an alternative, check out the whatlies package below.) "
 326 |    ]
 327 |   },
 328 |   {
 329 |    "cell_type": "code",
 330 |    "execution_count": null,
 331 |    "metadata": {},
 332 |    "outputs": [],
 333 |    "source": [
 334 |     "diff = glove_embeddings[\"society\"] - glove_embeddings[\"individual\"]\n",
 335 |     "glove_embeddings.add(\"society - individual\", diff)\n",
 336 |     "glove_embeddings[\"society - individual\"]"
 337 |    ]
 338 |   },
 339 |   {
 340 |    "cell_type": "markdown",
 341 |    "metadata": {},
 342 |    "source": [
 343 |     "In order for `most_similar()` to actually work with the \"society - individual\" vector, there's one more necessary step to run, in the next cell. \n",
 344 |     "\n",
 345 |     "(Why? Adding the vector doesn't automatically create an L2-normalized version of it, which `most_similar()` needs. `init_sims()` will recalculate, but only if the rest of the normed vectors are removed first.) "
 346 |    ]
 347 |   },
 348 |   {
 349 |    "cell_type": "code",
 350 |    "execution_count": null,
 351 |    "metadata": {},
 352 |    "outputs": [],
 353 |    "source": [
 354 |     "glove_embeddings.vectors_norm = None\n",
 355 |     "glove_embeddings.init_sims()"
 356 |    ]
 357 |   },
 358 |   {
 359 |    "cell_type": "markdown",
 360 |    "metadata": {},
 361 |    "source": [
 362 |     "Now we can query this single vector and get the same results: "
 363 |    ]
 364 |   },
 365 |   {
 366 |    "cell_type": "code",
 367 |    "execution_count": null,
 368 |    "metadata": {},
 369 |    "outputs": [],
 370 |    "source": [
 371 |     "glove_embeddings.most_similar(positive=[\"society - individual\"])"
 372 |    ]
 373 |   },
 374 |   {
 375 |    "cell_type": "code",
 376 |    "execution_count": null,
 377 |    "metadata": {},
 378 |    "outputs": [],
 379 |    "source": [
 380 |     "glove_embeddings.most_similar(negative=[\"society - individual\"])"
 381 |    ]
 382 |   },
 383 |   {
 384 |    "cell_type": "markdown",
 385 |    "metadata": {},
 386 |    "source": [
 387 |     "### Exercise\n",
 388 |     "\n",
 389 |     "(Adapted from Kozlowski et al 2019) Make a list of sports. Construct a \"rich\" vs \"poor\" dimension to proxy social class. Which sports have the strongest class associations in either direction?"
 390 |    ]
 391 |   },
 392 |   {
 393 |    "cell_type": "code",
 394 |    "execution_count": null,
 395 |    "metadata": {},
 396 |    "outputs": [],
 397 |    "source": [
 398 |     "# add to this list!\n",
 399 |     "sports = [\"hockey\", \"baseball\", ] "
 400 |    ]
 401 |   },
 402 |   {
 403 |    "cell_type": "code",
 404 |    "execution_count": null,
 405 |    "metadata": {},
 406 |    "outputs": [],
 407 |    "source": [
 408 |     "# create a rich - poor vector \n",
 409 |     "# then add it to the GloVe vectors"
 410 |    ]
 411 |   },
 412 |   {
 413 |    "cell_type": "code",
 414 |    "execution_count": null,
 415 |    "metadata": {},
 416 |    "outputs": [],
 417 |    "source": [
 418 |     "# calculate similarity between class dimension and each sport\n",
 419 |     "sport_similarities = []\n",
 420 |     "for sport in sports:\n",
 421 |     "    pass # replace with code to calculate similarity - you "
 422 |    ]
 423 |   },
 424 |   {
 425 |    "cell_type": "markdown",
 426 |    "metadata": {},
 427 |    "source": [
 428 |     "## 2. Creating a locally trained model (Word2Vec)\n",
 429 |     "\n",
 430 |     "If you're interested in word associations in a particular collection of texts, or if you suspect that those texts use language really differently from the sources pretrained models derive from (largely different kinds of contemporary internet data), then you might want to train your own model."
 431 |    ]
 432 |   },
 433 |   {
 434 |    "cell_type": "markdown",
 435 |    "metadata": {},
 436 |    "source": [
 437 |     "### Getting a corpus (20 Newsgroups)\n",
 438 |     "\n",
 439 |     "To train a new model on a corpus, we need a corpus. We'll use the [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/), which was created in 1995 for use in text-related machine learning research. Because the posts are partitioned into different groups, it's mainly used in classification and clustering applications. For an interesting contemporary example, which uses clusters of pretrained embeddings analogously to topic models, see [Sia et al 2020](http://arxiv.org/abs/2004.14914). \n",
 440 |     "\n",
 441 |     "(Sociologically, I find Usenet interesting as a historical precursor to contemporary online communities. Check out Nancy Baym's 1994 paper or her 2000 book if you'd like more ethnographic context, or Avery Dame-Griff 2019 if you'd like a historically-informed computational analysis.)\n",
 442 |     "\n",
 443 |     "Running this code will download the 20 Newsgroups data set:"
 444 |    ]
 445 |   },
 446 |   {
 447 |    "cell_type": "code",
 448 |    "execution_count": null,
 449 |    "metadata": {},
 450 |    "outputs": [],
 451 |    "source": [
 452 |     "from sklearn.datasets import fetch_20newsgroups\n",
 453 |     "\n",
 454 |     "twenty_newsgroups = fetch_20newsgroups(data_home=\"data\",\n",
 455 |     "                                       subset=\"all\",\n",
 456 |     "                                       shuffle=False,\n",
 457 |     "                                       remove=('headers', 'footers', 'quotes'),\n",
 458 |     "                                       download_if_missing=True)"
 459 |    ]
 460 |   },
 461 |   {
 462 |    "cell_type": "markdown",
 463 |    "metadata": {},
 464 |    "source": [
 465 |     "The data set has about 18000 posts. The way the `remove` argument parses the posts is approximate, so some of those posts wind up being empty strings."
 466 |    ]
 467 |   },
 468 |   {
 469 |    "cell_type": "code",
 470 |    "execution_count": null,
 471 |    "metadata": {},
 472 |    "outputs": [],
 473 |    "source": [
 474 |     "len(twenty_newsgroups.data)"
 475 |    ]
 476 |   },
 477 |   {
 478 |    "cell_type": "markdown",
 479 |    "metadata": {},
 480 |    "source": [
 481 |     "Let's look at the data by printing out a few posts at random."
 482 |    ]
 483 |   },
 484 |   {
 485 |    "cell_type": "code",
 486 |    "execution_count": null,
 487 |    "metadata": {},
 488 |    "outputs": [],
 489 |    "source": [
 490 |     "print(twenty_newsgroups.data[0])"
 491 |    ]
 492 |   },
 493 |   {
 494 |    "cell_type": "code",
 495 |    "execution_count": null,
 496 |    "metadata": {},
 497 |    "outputs": [],
 498 |    "source": [
 499 |     "print(twenty_newsgroups.data[5])"
 500 |    ]
 501 |   },
 502 |   {
 503 |    "cell_type": "code",
 504 |    "execution_count": null,
 505 |    "metadata": {},
 506 |    "outputs": [],
 507 |    "source": [
 508 |     "# Look at a few more!"
 509 |    ]
 510 |   },
 511 |   {
 512 |    "cell_type": "markdown",
 513 |    "metadata": {},
 514 |    "source": [
 515 |     "You might notice some obvious themes or topics in the text. These are the groups included in the data set: "
 516 |    ]
 517 |   },
 518 |   {
 519 |    "cell_type": "code",
 520 |    "execution_count": null,
 521 |    "metadata": {},
 522 |    "outputs": [],
 523 |    "source": [
 524 |     "twenty_newsgroups.target_names"
 525 |    ]
 526 |   },
 527 |   {
 528 |    "cell_type": "markdown",
 529 |    "metadata": {},
 530 |    "source": [
 531 |     "Preprocessing is an important (and sometimes underappreciated) step in text analysis, which [can have downstream consequences](https://github.com/matthewjdenny/preText). \n",
 532 |     "\n",
 533 |     "We'll preprocess eacg Usenet post pretty minimally -- remove punctuation and special characters, lowercase, and tokenize it into a list of words: "
 534 |    ]
 535 |   },
 536 |   {
 537 |    "cell_type": "code",
 538 |    "execution_count": null,
 539 |    "metadata": {},
 540 |    "outputs": [],
 541 |    "source": [
 542 |     "from gensim.utils import simple_preprocess\n",
 543 |     "\n",
 544 |     "print(simple_preprocess(twenty_newsgroups.data[5], min_len=1, max_len=20))"
 545 |    ]
 546 |   },
 547 |   {
 548 |    "cell_type": "code",
 549 |    "execution_count": null,
 550 |    "metadata": {},
 551 |    "outputs": [],
 552 |    "source": [
 553 |     "preprocessed_docs = []\n",
 554 |     "for doc in twenty_newsgroups.data:\n",
 555 |     "    preprocessed_docs.append(simple_preprocess(doc, min_len=1, max_len=20))"
 556 |    ]
 557 |   },
 558 |   {
 559 |    "cell_type": "markdown",
 560 |    "metadata": {},
 561 |    "source": [
 562 |     "### Fitting the model\n",
 563 |     "\n",
 564 |     "That was a lot of data work, but now we're ready to train a word2vec model. \n",
 565 |     "\n",
 566 |     "We'll use skip-gram (`sg=1`) with negative sampling (`negative=5`). We'll use a small vector size (`size=50`) because it's faster to train. We'll make a few passes over the data set (`iter=10`), and we'll ignore any words appearing less than 5 times total (`min_count=5`). \n",
 567 |     "\n",
 568 |     "To start with, let's try a small context window around each word (`window=5`). \n",
 569 |     "\n",
 570 |     "**Note:** This small vector size, combined with this small window size, isn't something we'd expect to yield a great model, but it's relatively quick to train. (On my laptop, that means less than two minutes -- it might vary for your machine. `workers=3` means that it runs on three processes.)"
 571 |    ]
 572 |   },
 573 |   {
 574 |    "cell_type": "code",
 575 |    "execution_count": null,
 576 |    "metadata": {},
 577 |    "outputs": [],
 578 |    "source": [
 579 |     "from gensim.models import Word2Vec\n",
 580 |     "\n",
 581 |     "w2v_model1 = Word2Vec(sentences=preprocessed_docs, \n",
 582 |     "                      size=50, \n",
 583 |     "                      window=5, \n",
 584 |     "                      min_count=5,\n",
 585 |     "                      workers=3,\n",
 586 |     "                      sg=1, \n",
 587 |     "                      hs=0, \n",
 588 |     "                      negative=5,\n",
 589 |     "                      iter=10)"
 590 |    ]
 591 |   },
 592 |   {
 593 |    "cell_type": "markdown",
 594 |    "metadata": {},
 595 |    "source": [
 596 |     "Here are some basic properties of the trained model. Note that the embeddings themselves are accessed through the `wv` property. "
 597 |    ]
 598 |   },
 599 |   {
 600 |    "cell_type": "code",
 601 |    "execution_count": null,
 602 |    "metadata": {},
 603 |    "outputs": [],
 604 |    "source": [
 605 |     "w2v_model1.corpus_count"
 606 |    ]
 607 |   },
 608 |   {
 609 |    "cell_type": "code",
 610 |    "execution_count": null,
 611 |    "metadata": {},
 612 |    "outputs": [],
 613 |    "source": [
 614 |     "w2v_model1.total_train_time"
 615 |    ]
 616 |   },
 617 |   {
 618 |    "cell_type": "code",
 619 |    "execution_count": null,
 620 |    "metadata": {},
 621 |    "outputs": [],
 622 |    "source": [
 623 |     "w2v_model1.wv.vectors.shape"
 624 |    ]
 625 |   },
 626 |   {
 627 |    "cell_type": "markdown",
 628 |    "metadata": {},
 629 |    "source": [
 630 |     "The vector for a given word will be a different string of numbers from the previous GloVe model. (And there's no reason for the dimensions to correspond, either!)"
 631 |    ]
 632 |   },
 633 |   {
 634 |    "cell_type": "code",
 635 |    "execution_count": null,
 636 |    "metadata": {},
 637 |    "outputs": [],
 638 |    "source": [
 639 |     "w2v_model1.wv[\"society\"]"
 640 |    ]
 641 |   },
 642 |   {
 643 |    "cell_type": "markdown",
 644 |    "metadata": {},
 645 |    "source": [
 646 |     "Instead, we'll look at the most similar words again, to get a qualitative impression for whether this model is encoding a meaning similar to the GloVe model. "
 647 |    ]
 648 |   },
 649 |   {
 650 |    "cell_type": "code",
 651 |    "execution_count": null,
 652 |    "metadata": {},
 653 |    "outputs": [],
 654 |    "source": [
 655 |     "w2v_model1.wv.most_similar(\"society\")"
 656 |    ]
 657 |   },
 658 |   {
 659 |    "cell_type": "markdown",
 660 |    "metadata": {},
 661 |    "source": [
 662 |     "Again, for the word \"individual\": "
 663 |    ]
 664 |   },
 665 |   {
 666 |    "cell_type": "code",
 667 |    "execution_count": null,
 668 |    "metadata": {},
 669 |    "outputs": [],
 670 |    "source": [
 671 |     "w2v_model1.wv.most_similar(\"individual\")"
 672 |    ]
 673 |   },
 674 |   {
 675 |    "cell_type": "markdown",
 676 |    "metadata": {},
 677 |    "source": [
 678 |     "We won't repeat every step from Part 1 above. \n",
 679 |     "\n",
 680 |     "Instead, let's experiment with changing one parameter for training the model, the context window around each word. We'll fit a new model with `window=10`. "
 681 |    ]
 682 |   },
 683 |   {
 684 |    "cell_type": "code",
 685 |    "execution_count": null,
 686 |    "metadata": {},
 687 |    "outputs": [],
 688 |    "source": [
 689 |     "w2v_model2 = Word2Vec(sentences=preprocessed_docs, \n",
 690 |     "                      size=50, \n",
 691 |     "                      window=10, \n",
 692 |     "                      min_count=5,\n",
 693 |     "                      workers=3,\n",
 694 |     "                      sg=1, \n",
 695 |     "                      hs=0, \n",
 696 |     "                      negative=5,\n",
 697 |     "                      iter=10)"
 698 |    ]
 699 |   },
 700 |   {
 701 |    "cell_type": "code",
 702 |    "execution_count": null,
 703 |    "metadata": {},
 704 |    "outputs": [],
 705 |    "source": [
 706 |     "w2v_model2.total_train_time"
 707 |    ]
 708 |   },
 709 |   {
 710 |    "cell_type": "code",
 711 |    "execution_count": null,
 712 |    "metadata": {},
 713 |    "outputs": [],
 714 |    "source": [
 715 |     "w2v_model2.wv.most_similar(\"society\")"
 716 |    ]
 717 |   },
 718 |   {
 719 |    "cell_type": "markdown",
 720 |    "metadata": {},
 721 |    "source": [
 722 |     "Qualitatively, smaller windows (e.g. 5) tend to encode more syntactic similarities (words that are substitutes), whereas larger windows (e.g. 50) encode more semantic similarities (words that are topically similar). See Rodriguez and Spirling 2020 or [this PyData talk from 2017](https://www.youtube.com/watch?v=tAxrlAVw-Tk&t=648s)."
 723 |    ]
 724 |   },
 725 |   {
 726 |    "cell_type": "markdown",
 727 |    "metadata": {},
 728 |    "source": [
 729 |     "### Exercise\n",
 730 |     "\n",
 731 |     "(Adapted from Rodriguez and Spirling 2020) How correlated are those two word2vec models? Pick a focal word like \"society\" and calculate cosine similarity between that word and every word in the entire model vocabulary, for each model. Then calculate the correlation between those two measures.\n",
 732 |     "\n",
 733 |     "(You might use `most_similar()` with the appropriate value for `topn`, or `similarity()` with a for-loop. The first is probably more efficient, but then you'll need to sort or join the two lists based on the words. Python data science packages like pandas may be helpful.) \n",
 734 |     "\n",
 735 |     "---\n",
 736 |     "\n",
 737 |     "On your own, after the workshop: \n",
 738 |     "\n",
 739 |     "- Try a larger vector size (e.g. 100, 150, 200, 300)\n",
 740 |     "- Try a larger window size (e.g. 20, 50) \n",
 741 |     "- Try reducing min_count (e.g. from 5 to 2)\n",
 742 |     "\n",
 743 |     "**Note: all of these parameter changes make the model take longer to train!**"
 744 |    ]
 745 |   },
 746 |   {
 747 |    "cell_type": "code",
 748 |    "execution_count": null,
 749 |    "metadata": {},
 750 |    "outputs": [],
 751 |    "source": [
 752 |     "from scipy.stats import pearsonr"
 753 |    ]
 754 |   },
 755 |   {
 756 |    "cell_type": "code",
 757 |    "execution_count": null,
 758 |    "metadata": {},
 759 |    "outputs": [],
 760 |    "source": [
 761 |     "# write your code here"
 762 |    ]
 763 |   },
 764 |   {
 765 |    "cell_type": "markdown",
 766 |    "metadata": {},
 767 |    "source": [
 768 |     "## 3. Visualizing embeddings (whatlies) [OPTIONAL]\n",
 769 |     "\n",
 770 |     "The [whatlies package](https://rasahq.github.io/whatlies/) is relatively new (Warmerdam et al 2020); it's meant to facilitate exploring and visualizing embeddings.\n",
 771 |     "\n",
 772 |     "It works as a wrapper around embeddings from different packages, including gensim. We'll turn the word2vec model from Part 3 into a `whatlies.EmbeddingSet`."
 773 |    ]
 774 |   },
 775 |   {
 776 |    "cell_type": "code",
 777 |    "execution_count": null,
 778 |    "metadata": {},
 779 |    "outputs": [],
 780 |    "source": [
 781 |     "from whatlies import Embedding, EmbeddingSet"
 782 |    ]
 783 |   },
 784 |   {
 785 |    "cell_type": "code",
 786 |    "execution_count": null,
 787 |    "metadata": {},
 788 |    "outputs": [],
 789 |    "source": [
 790 |     "emb_w2v = EmbeddingSet.from_names_X(names=w2v_model1.wv.index2word, \n",
 791 |     "                                    X=w2v_model1.wv.vectors)"
 792 |    ]
 793 |   },
 794 |   {
 795 |    "cell_type": "markdown",
 796 |    "metadata": {},
 797 |    "source": [
 798 |     "whatlies has built-in functions for plotting things like cosine similarities (built on the Python visualization packages matplotlib and altair):"
 799 |    ]
 800 |   },
 801 |   {
 802 |    "cell_type": "code",
 803 |    "execution_count": null,
 804 |    "metadata": {},
 805 |    "outputs": [],
 806 |    "source": [
 807 |     "emb_w2v.embset_similar(\"society\", n=10, metric=\"cosine\").plot_similarity()"
 808 |    ]
 809 |   },
 810 |   {
 811 |    "cell_type": "code",
 812 |    "execution_count": null,
 813 |    "metadata": {},
 814 |    "outputs": [],
 815 |    "source": [
 816 |     "emb_w2v.embset_similar(\"individual\", n=10, metric=\"cosine\").plot_similarity()"
 817 |    ]
 818 |   },
 819 |   {
 820 |    "cell_type": "markdown",
 821 |    "metadata": {},
 822 |    "source": [
 823 |     "It also implements vector algebra on embeddings in a nice way: "
 824 |    ]
 825 |   },
 826 |   {
 827 |    "cell_type": "code",
 828 |    "execution_count": null,
 829 |    "metadata": {},
 830 |    "outputs": [],
 831 |    "source": [
 832 |     "emb_w2v.embset_similar(emb_w2v[\"society\"] - emb_w2v[\"individual\"]).plot_similarity()"
 833 |    ]
 834 |   },
 835 |   {
 836 |    "cell_type": "markdown",
 837 |    "metadata": {},
 838 |    "source": [
 839 |     "We can plot the similarities of different vectors against each other on the axes as well:"
 840 |    ]
 841 |   },
 842 |   {
 843 |    "cell_type": "code",
 844 |    "execution_count": null,
 845 |    "metadata": {},
 846 |    "outputs": [],
 847 |    "source": [
 848 |     "(emb_w2v[\"society\", \"nation\", \"individual\", \"exclusive\"]\n",
 849 |     " .plot(x_axis=\"society\", \n",
 850 |     "       y_axis=\"individual\", \n",
 851 |     "       axis_metric=\"cosine_similarity\"))"
 852 |    ]
 853 |   },
 854 |   {
 855 |    "cell_type": "markdown",
 856 |    "metadata": {},
 857 |    "source": [
 858 |     "Or, we can transform the local vector space. Mouse over the points below to see the words.\n",
 859 |     "\n",
 860 |     "(You could, of course, use any dimensionality reduction technique on the entire vector space as well.)"
 861 |    ]
 862 |   },
 863 |   {
 864 |    "cell_type": "code",
 865 |    "execution_count": null,
 866 |    "metadata": {},
 867 |    "outputs": [],
 868 |    "source": [
 869 |     "from whatlies.transformers import Pca"
 870 |    ]
 871 |   },
 872 |   {
 873 |    "cell_type": "code",
 874 |    "execution_count": null,
 875 |    "metadata": {},
 876 |    "outputs": [],
 877 |    "source": [
 878 |     "(emb_w2v\n",
 879 |     " .embset_similar(\"society\", n=100, metric=\"cosine\")\n",
 880 |     " .transform(Pca(2))\n",
 881 |     " .plot_interactive(annot=False))"
 882 |    ]
 883 |   },
 884 |   {
 885 |    "cell_type": "markdown",
 886 |    "metadata": {},
 887 |    "source": [
 888 |     "### Exercise\n",
 889 |     "\n",
 890 |     "Write code to convert the GloVe embeddings from a gensim `KeyedVectors` object to a whatlies `EmbeddingSet`. (Note: you don't need to use the `wv` attribute. Instead, access `glove_embeddings.index2word` directly. That's what the warning message \"use self\" means.)\n",
 891 |     "\n",
 892 |     "Then, select some neighborhood of words (maybe a larger one than n=100) around a focal word (like \"society\"), transform those vectors with PCA, and plot them. \n"
 893 |    ]
 894 |   },
 895 |   {
 896 |    "cell_type": "code",
 897 |    "execution_count": null,
 898 |    "metadata": {},
 899 |    "outputs": [],
 900 |    "source": [
 901 |     "# convert glove embeddings to an EmbeddingSet"
 902 |    ]
 903 |   },
 904 |   {
 905 |    "cell_type": "code",
 906 |    "execution_count": null,
 907 |    "metadata": {},
 908 |    "outputs": [],
 909 |    "source": [
 910 |     "# select, transform, and plot"
 911 |    ]
 912 |   },
 913 |   {
 914 |    "cell_type": "markdown",
 915 |    "metadata": {},
 916 |    "source": [
 917 |     "## Wrapping up"
 918 |    ]
 919 |   },
 920 |   {
 921 |    "cell_type": "markdown",
 922 |    "metadata": {},
 923 |    "source": [
 924 |     "### Main takeaways\n",
 925 |     "\n",
 926 |     "- Word embeddings have become one tool in the computational text analysis toolkit.\n",
 927 |     "- They've been successfully applied in the social sciences, in cases where the relational meanings of words and concepts are of substantive interest."
 928 |    ]
 929 |   },
 930 |   {
 931 |    "cell_type": "markdown",
 932 |    "metadata": {},
 933 |    "source": [
 934 |     "### What to do next\n",
 935 |     "\n",
 936 |     "- Keep playing around in this notebook, with different words and parameters!\n",
 937 |     "- Apply a pretrained model to or train a local model on your own corpus.\n",
 938 |     "- Have a look at the code for some of the papers listed below.\n",
 939 |     "- Read through my [experimental notebooks](https://ccgilroy.github.io/community-discourse/). I wrote those notebooks for myself, to explore how these methods could be applied to a topic I'm interested in, but you might find them useful.\n",
 940 |     "\n",
 941 |     "The main additional topic I would have liked to cover in this workshop, given more time, is **document-level comparisons**. There are two interesting methods in gensim, Word Mover's Distance and Doc2Vec, which do pretty different things. I'm happy to discuss those methods during the Q&A. "
 942 |    ]
 943 |   },
 944 |   {
 945 |    "cell_type": "markdown",
 946 |    "metadata": {},
 947 |    "source": [
 948 |     "### Further reading\n",
 949 |     "\n",
 950 |     "I've loosely categorized some relevant papers into three groups. I'd recommend starting with papers in the first category, though there are some interesting methodological ideas in the second and third categories.\n",
 951 |     "\n",
 952 |     "You'll notice that **time** is the covariate of choice for many of the applied papers. I'm excited to see what other sources of variation researchers can come up with. \n",
 953 |     "\n",
 954 |     "Social science papers: \n",
 955 |     "\n",
 956 |     "- Kozlowski et al 2019 (\"Geometry of Culture\") ([GitHub](https://github.com/KnowledgeLab/GeometryofCulture/))\n",
 957 |     "- Stoltz and Taylor 2019, Taylor and Stoltz 2020a, Taylor and Stoltz 2020b, Stoltz and Taylor forthcoming (Concept Mover's Distance and extensions) ([GitHub](https://github.com/dustinstoltz/CMDist/))\n",
 958 |     "- Jones et al 2019 (gender stereotypes decrease over time) ([GitHub](https://github.com/ruhulsbu/StereotypicalGenderAssociationsInLanguage))\n",
 959 |     "- Rheault and Cochrane 2020 (ideology and parliamentary corporas) ([GitHub](https://github.com/lrheault/partyembed))\n",
 960 |     "- Rodriguez and Spirling 2020 (methodological comparisons for political science research) ([GitHub](https://github.com/ArthurSpirling/EmbeddingsPaper))\n",
 961 |     "- Arseniev-Koehler and Foster 2020 (cultural learning and what it means to be fat) ([GitHub](https://github.com/arsena-k/Word2Vec-bias-extraction))\n",
 962 |     "- Nelson 2021 (machine learning and intersectionality) ([GitHub](https://github.com/lknelson/measuring_intersectionality))\n",
 963 |     "\n",
 964 |     "Social application papers from NLP/CS researchers: \n",
 965 |     "\n",
 966 |     "- Bamman et al 2014 (embedding decomposition and geographic variation) ([GitHub](https://github.com/dbamman/geoSGLM))\n",
 967 |     "- Kulkarni et al 2015 (historical semantic change over time) ([GitHub](https://github.com/viveksck/langchangetrack))\n",
 968 |     "- Hamilton et al 2016a, 2016b (histwords - semantic change) ([website](https://nlp.stanford.edu/projects/histwords/))\n",
 969 |     "- Garg et al 2017 (stereotypes and semantic change) ([GitHub](https://github.com/nikhgarg/EmbeddingDynamicStereotypes))\n",
 970 |     "- Gonen and Goldberg 2019 (\"Lipstick on a pig\" - \"debiasing\" embeddings in terms of gender) ([GitHub](https://github.com/gonenhila/gender_bias_lipstick))\n",
 971 |     "- Giulianelli et al 2020 (contextual embeddings for semantic change) ([GitHub](https://github.com/glnmario/cwr4lsc))\n",
 972 |     "- Mendelsohn et al 2020 (dehumanization and linguistic change)\n",
 973 |     "- Waller and Anderson 2020 (community embeddings)\n",
 974 |     "- Soni et al 2021 (language change in abolitionist newspapers) ([GitHub](https://github.com/sandeepsoni/semantic-leadership-network))\n",
 975 |     "\n",
 976 |     "Fundamental NLP/CS papers:\n",
 977 |     "\n",
 978 |     "- Mikolov et al 2013 (word2vec) ([gensim tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html))\n",
 979 |     "- Pennington et al 2014 (GloVe) ([website](https://nlp.stanford.edu/projects/glove/))\n",
 980 |     "- Dai and Le 2015 (paragraph vectors / doc2vec) ([gensim tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html))\n",
 981 |     "- Kusner et al 2015 (Word Mover's Distance) ([gensim tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_wmd.html))\n",
 982 |     "- Antoniak and Mimno 2018 (stability of embeddings) ([GitHub](https://github.com/maria-antoniak/word-embedding-stability))\n",
 983 |     "- Sia et al 2020 (embeddings as topic models) ([GitHub](https://github.com/adalmia96/Cluster-Analysis))\n",
 984 |     "- Warmerdam et al 2020 (whatlies package) ([website](https://rasahq.github.io/whatlies/))"
 985 |    ]
 986 |   },
 987 |   {
 988 |    "cell_type": "markdown",
 989 |    "metadata": {},
 990 |    "source": [
 991 |     "## Contact information and acknowledgments\n",
 992 |     "\n",
 993 |     "Connor Gilroy  \n",
 994 |     "email: cgilroy at uw dot edu  \n",
 995 |     "twitter: @ccgilroy\n",
 996 |     "\n",
 997 |     "My thinking on word embeddings is indebted to SICSS-2017, a 2018 Text as Data course with John Wilkerson, many conversations (with Kate Stovel, Jeff Lockhart, Ian Kennedy, Nga Than, and others), and the code and papers listed above.\n",
 998 |     "\n",
 999 |     "My research is partially supported by several grants, including an NIH NICHD training grant (T32 HD101442-01) to CSDE at the University of Washington and ARO Grant W911NF-19-1-0407."
1000 |    ]
1001 |   }
1002 |  ],
1003 |  "metadata": {
1004 |   "kernelspec": {
1005 |    "display_name": "Python 3",
1006 |    "language": "python",
1007 |    "name": "python3"
1008 |   },
1009 |   "language_info": {
1010 |    "codemirror_mode": {
1011 |     "name": "ipython",
1012 |     "version": 3
1013 |    },
1014 |    "file_extension": ".py",
1015 |    "mimetype": "text/x-python",
1016 |    "name": "python",
1017 |    "nbconvert_exporter": "python",
1018 |    "pygments_lexer": "ipython3",
1019 |    "version": "3.8.8"
1020 |   },
1021 |   "toc-autonumbering": false,
1022 |   "toc-showcode": false,
1023 |   "toc-showmarkdowntxt": false
1024 |  },
1025 |  "nbformat": 4,
1026 |  "nbformat_minor": 4
1027 | }
1028 | 


--------------------------------------------------------------------------------