├── .gitignore ├── LICENSE ├── README.md ├── code ├── 1-Processing.ipynb ├── 2-Extraction.ipynb ├── 3-Classifiers.ipynb ├── 4-Social-networks.ipynb ├── README.md ├── data │ ├── sample_text.txt │ ├── sentiment-analysis │ │ ├── testing_set.csv │ │ └── training_set.csv │ └── similarity │ │ ├── dif.csv │ │ ├── dif_2.csv │ │ ├── full_ds.csv │ │ ├── sim.csv │ │ ├── sim_2.csv │ │ └── simple.dnd └── images │ └── UKDS_Logos_Col_Grey_300dpi.png ├── environment.yml ├── postBuild └── webinars ├── 2020 ├── Text-Mining_Advanced_widescreen.pdf ├── Text-Mining_Advanced_widescreen.pptx ├── Text-Mining_Basics_widescreen.pdf ├── Text-Mining_Basics_widescreen.pptx ├── Text-Mining_Intro_widescreen.pdf └── Text-Mining_Intro_widescreen.pptx ├── 2023 └── Text-Mining_Nham_DataFest_2023.pptx └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 UK Data Service Open 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![UKDS Logo](./code/images/UKDS_Logos_Col_Grey_300dpi.png)
2 |
3 | # Text-mining for Social Science Research 4 | 5 | Text-mining is one of many data-mining techniques that social scientists are using to turn unstructured (or more accurately, semi-unstructured) material into structured material that can be analysed statistically. In this way, researchers are gaining access to new materials and methods that were previously unavailable. As such, it is increasingly important that social scientists have a clear understanding of what text-mining is (and what is isn't) as well as how to use text-mining to achieve some basic and more advanced research outcomes. 6 | 7 | ## Topics 8 | 9 | The following topics are covered under this training series: 10 | 1. **Introduction to Text-Mining** - covers the concepts behind fully structured and semi-unstructured data, the theory behind capturing and amplifying existing structure, and the four basic steps involved in any text-mining project. 11 | 2. **Text-Mining: Basic Processes** - learn how to do some of the most common text-mining analyses using Python. 12 | 3. **Text-Mining: Advanced Options** - understand the concepts behind more advanced text-mining analyses. 13 | 14 | ## Materials 15 | 16 | The training materials - including webinar recordings, slides, and sample Python code - can be found in the following folders: 17 | * [code](./code) - run and/or download text-mining code using our Jupyter notebook resources. 18 | * [webinars](./webinars) - watch recordings of our webinars and download the underpinning slides. 19 | 20 | ## Acknowledgements 21 | 22 | We are grateful to UKRI through the Economic and Social Research Council for their generous funding of this training series. 23 | 24 | ## Further Information 25 | 26 | * To access learning materials from the wider *Computational Social Science* training series: [Training Materials] 27 | * To keep up to date with upcoming and past training events: [Events] 28 | * To get in contact with feedback, ideas or to seek assistance: [Help] 29 | 30 | Thank you and good luck on your journey exploring new forms of data!
31 | 32 | Dr Julia Kasmire and Dr Diarmuid McDonnell
33 | UK Data Service
34 | University of Manchester
35 | -------------------------------------------------------------------------------- /code/1-Processing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": { 13 | "slideshow": { 14 | "slide_type": "-" 15 | } 16 | }, 17 | "source": [ 18 | "# Text-mining: Basics" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Welcome to this UK Data Service *Computational Social Science* training series! \n", 26 | "\n", 27 | "The various *Computational Social Science* training series, all of which guide you through some of the popular and useful computational techniques, tools, methods and concepts that social science research might want to use. For example, this series covers collecting data from websites and social media platorms, working with text data, conducting simulations (agent based modelling), and more. The series includes recorded video webinars, interactive notebooks containing live programming code, reading lists and more.\n", 28 | "\n", 29 | "* To access training materials on our GitHub site: [Training Materials]\n", 30 | "\n", 31 | "* To keep up to date with upcoming and past training events: [Events]\n", 32 | "\n", 33 | "* To get in contact with feedback, ideas or to seek assistance: [Help]\n", 34 | "\n", 35 | "Dr J. Kasmire
\n", 36 | "UK Data Service
\n", 37 | "University of Manchester
" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": { 43 | "toc": true 44 | }, 45 | "source": [ 46 | "

Table of Contents

\n", 47 | "
" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "\n", 55 | "There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). " 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "-------------------------------------\n", 63 | "\n", 64 | "
This is notebook 1 of 2 in this lesson
\n", 65 | "\n", 66 | "-------------------------------------" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "## Introduction\n", 74 | "\n" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "This is the first in a series of jupyter notebooks on text-mining that cover basic preparation processes, common natural language processing tasks, and some more advanced natural language tasks. These interactive code-along notebooks use python as a programming language, but introduce various packages related to text-mining and text processing. Most of those tasks could be done in other packages, so please be aware that the options demonstrated here are not the only way, or even the best way, to accomplish a text-mining task. " 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "### Guide to using this resource\n", 89 | "\n", 90 | "This learning resource was built using Jupyter Notebook, an open-source software application that allows you to mix code, results and narrative in a single document. As Barba et al. (2019) espouse:\n", 91 | "> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.\n", 92 | "\n", 93 | "If you are familiar with Jupyter notebooks then skip ahead to the main content (*Retrieval*). Otherwise, the following is a quick guide to navigating and interacting with the notebook." 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "### Interaction\n", 101 | "\n", 102 | "**You only need to execute the code that is contained in sections which are marked by `In []`.**\n", 103 | "\n", 104 | "To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).\n", 105 | "\n", 106 | "Try it for yourself:" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "print(\"Enter your name and press enter:\")\n", 116 | "name = input()\n", 117 | "print(\"\\r\")\n", 118 | "print(\"Hello {}, enjoy learning more about Python and computational social science!\".format(name))" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "### Learn more\n", 126 | "\n", 127 | "Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the materials provided by Dani Arribas-Bel at the University of Liverpool." 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "## Retrieval\n" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "The first step in text-mining, or any form of data-mining, is retrieving a data set to work with. Within text-mining, or any language analysis context, one data set is usually referred to as 'a corpus' while multiple data sets are referred to as 'corpora'. 'Corpus' is a latin-root word and therefore has a funny plural. \n", 142 | "\n", 143 | "For text-mining, a corpus can be:\n", 144 | "- a set of tweets, \n", 145 | "- the full text of an 18th centrury novel,\n", 146 | "- the contents of a page in the dictionary, \n", 147 | "- minutes of local council meetings, \n", 148 | "- random gibberish letters and numbers, or\n", 149 | "- just about anything else in text format. \n", 150 | "\n", 151 | "\n", 152 | "Retrieval is a very important step, but it is not the focus of this particular training series. If you are interested in creating a corpus from internet data, then you may want to check out our previous training series that covers Web-scraping (available as recordings of webinars or as a code-along jupyter notebook like this one) and API's (also as recording or jupyter notebook). Both of these demonstrate and discuss ways to get data from the internet that you could use to build a corpus. \n", 153 | "\n", 154 | "Instead, for the purposes of this session, we will assume that you already have a corpus to analyse. This is easy for us to assume, because we have provided a sample text file that we can use as a corpus for these exercises. \n", 155 | "\n", 156 | "First, let's check that it is there. To do that, click in the code cell below and hit the 'Run' button at the top of this page or by holding down the 'Shift' key and hitting the 'Enter' key. \n", 157 | "\n", 158 | "For the rest of this notebook, I will use 'Run/Shift+Enter' as short hand for:\n", 159 | "* click in the code cell below and hit the 'Run' button at the top of this page, or\n", 160 | "* click in the code cell below and hold down the 'Shift' key while hitting the 'Enter' key'. " 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "# It is good practice to always start by importing the modules and packages you will need. \n", 170 | "\n", 171 | "import os # os is a module for navigating your machine (e.g., file directories).\n", 172 | "import nltk # nltk stands for natural language tool kit and is useful for text-mining. \n", 173 | "import re # re is for regular expressions, which we use later \n", 174 | "\n", 175 | "print(\"1. Succesfully imported necessary modules\") # The print statement is just a bit of encouragement!\n", 176 | "\n", 177 | "print(\"\")\n", 178 | "\n", 179 | "# List all of the files in the \"data\" folder that is provided to you\n", 180 | "for file in os.listdir(\"./data\"):\n", 181 | " print(\"2. One of the files in ./data is...\", file)\n", 182 | "print(\"\")\n" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "_______________________________________________________________________________________________________________________________\n", 190 | "Great! We have imported a useful module and used it to check that we have access to the sample_text file. \n", 191 | "\n", 192 | "Now we need to load that sample_text file into a variable that we can work with in python. Time to Run/Shift+Enter again!" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "# Open the \"sample_text\" file and read (import) its contents to a variable called \"corpus\"\n", 202 | "with open(\"./data/sample_text.txt\", \"r\", encoding = \"ISO-8859-1\") as f:\n", 203 | " corpus = f.read()\n", 204 | " \n", 205 | " print(corpus)" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "_______________________________________________________________________________________________________________________________\n", 213 | "Hmm. Not excellent literature, but it will do for our purposes. \n", 214 | "\n", 215 | "A quick look tells us that there are capital letters, contractions, punctuation, numbers as digits, numbers written out, abbreviations, and other things that, as humans, we know are equivalent but that computers do not know about. \n", 216 | "\n", 217 | "Before we go further, it helps to know what kind of variable corpus is. Run/Shift+Enter the next code block to find out!" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": {}, 224 | "outputs": [], 225 | "source": [ 226 | "type(corpus)" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "_______________________________________________________________________________________________________________________________\n", 234 | "This tells us that 'corpus' is one very long string of text characters. \n", 235 | "\n", 236 | "Congratulations! We are done with the retreival portion of this process. The rest won't be quite so straightforward because next up... Processing. \n", 237 | "\n", 238 | "Processing is about cleaning, correcting, standardizing and formatting the raw data returned from the retrieval process. " 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": {}, 244 | "source": [ 245 | "## Processing\n", 246 | "\n" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "_______________________________________________________________________________________________________________________________\n", 254 | "The string we have as our corpus is a good starting point, but it is not perfect. It has a bunch of errors and punctuation which need to be corrected. But even worse, it is 'one long thing' when statistical analysis typically requires 'lots of short things'. \n", 255 | "\n", 256 | "So, clearly, we have a few steps to go through with our raw text. \n", 257 | "- Tokenisation, (or splitting text into various kinds of 'short things' that can be statistically analysed).\n", 258 | "- Standardising the next (including converting uppercase to lower, correcting spelling, find-and-replace operations to remove abbreviations, etc.). \n", 259 | "- Removing irrelevancies (anything from punctuation to stopwords like 'the' or 'to' that are unhelpful for many kinds of analysis).\n", 260 | "- Consolidating (including stemming and lemmatisation that strip words back to their 'root'). \n", 261 | "- Basic NLP (that put some of the small things back together into logically useful medium things, like multi-word noun or verb phrases and proper names).\n", 262 | "\n", 263 | "In practice, most text-mining work will require that any given corpus undergo multiple steps, but the exact steps and the exact order of steps depends on the desired analysis to be done. Thus, some of the examples that follow will use the raw text corpus as an input to the process while others use a processed corpus as an input. \n", 264 | "\n", 265 | "As a side note, it is good practice to create new variables whenever you manipulate an existing variable rather than write over the original. This means that you keep the original and can go back to it anytime you need to if you want to try a different manipulation or correct an error. You will see how this works as we progress through the processing steps. " 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "### Tokenisation" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "Our first step is to cut our 'one big thing' into tokens, or 'lots of little things'. As an example, one project I worked involved downloading a file with hundreds of recorded chess games, which I then divided into individual text files with one game each. The games had a very standard format, with every game ending with either '1-0', '0-1' or '1/2-1/2'. Thus, I was able to use regular expressions (covered in more detail later) to iterate over the file, selecting everyithing until it found an instance of '1-0', '0-1' or '1/2-1/2', at which point it would cut what it had selected, write it to a blank file, save it, and start iterating over the original file again. \n", 280 | "\n", 281 | "Other options that might make more sense with other kinds of files would be to to cut and write from the large file to new files after a specified number of lines or characters. \n", 282 | "\n", 283 | "Whether you have one big file or many smaller ones, most text-mining work will also want to divide the corpus into what are known as 'tokens'. These 'tokens' are the unit of analysis, which might be chapters, sections, paragraphs, sentences, words, or something else. \n", 284 | "\n", 285 | "Since we have one file already loaded as a corpus, we can skip the right to tokenising that text into sentences and words. Both options are functions available through the ntlk package that we imported earlier. These are both useful tokens in their own way, so we will see how to produce both kinds. \n", 286 | " \n", 287 | "We start by dividing our corpus into words, splitting the string into substrings whenever 'word_tokenize' detects a word. \n", 288 | "\n", 289 | "Let's try that. But this time, let's just have a look at the first 100 things it finds instead of the entire text.\n", 290 | "Run/Shift+Enter." 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": {}, 297 | "outputs": [], 298 | "source": [ 299 | "nltk.download('punkt')\n", 300 | "from nltk import word_tokenize # importing the word_tokenize function from nltk\n", 301 | "\n", 302 | "corpus_words = word_tokenize(corpus) # Pass the corpus through word tokenize \n", 303 | "print(corpus_words[:10]) # the [:100] within the print statement says \n", 304 | " # to print only the first 100 items in the list \n", 305 | "print(\"...\") # the print(\"...\") just improves output readability\n", 306 | "type(corpus_words) # Always good to know your variable type!\n" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "Let's have a look. \n", 314 | "\n", 315 | "We can see that corpus_words is a list of strings. We know it is a list because it starts and ends with square brackets and we know the things in that list are strings because they are surrounded by single quotes. \n", 316 | "\n", 317 | "We can also see that puctuation marks are counted as tokens in that list. For example, the full stop at the end of the first sentence appears as its own token because word_tokenize knows that it does not count as part of the previous word. Interestingly, 'U.K.' is all one token, despite having full stops in. Clever stuff, this tokenisation function!\n", 318 | "\n", 319 | "Word_tokenize is a useful function if you want to take a 'bag of words' approach to text-mining. This reduces a lot of the contextual information within the original corpus because it ignores how the words were used or in what order they originally appeared, making it easy to count how often each word occurrs. There is a surprising amount of insight to be gained here, but it does mean that 'building' in the next two sentences will be counted as the \"same\" word. \n", 320 | "- \"He is building a diorama for a school project.\" where 'building' is a verb\n", 321 | "- \"The building is a clear example of brutalist architecture.\" where 'building' is a noun\n", 322 | "\n", 323 | "There are other kinds of analyses that you could do if you want verb-building and noun-building to be counted as different words. That usually starts with tokenising differently, for example into sentences rather than words. \n", 324 | "Let's see what that looks like by running the same basic analysis again, but this time with sentence-token things instead of word-token things. \n", 325 | "\n", 326 | "Do that funky Run/Shift+Enter thing! " 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": null, 332 | "metadata": {}, 333 | "outputs": [], 334 | "source": [ 335 | "# importing sent_tokenize from nltk\n", 336 | "from nltk import sent_tokenize\n", 337 | "\n", 338 | "# Same again, but this time broken into sentences\n", 339 | "corpus_sentences = sent_tokenize(corpus)\n", 340 | "print(corpus_sentences[:10]) # Since these are sentences instead of words, \n", 341 | " # we only want the first 10 items instead of 100.\n", 342 | "print(\"...\") \n", 343 | "type(corpus_sentences)" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "_______________________________________________________________________________________________________________________________\n", 351 | "\n", 352 | "Corpus_sentences is also a list of strings (starts and ends with square brackets, each item is surrounded by single quotes). \n", 353 | "\n", 354 | "This time, the full stops at the end of each sentence are included within the sentence token, which makes sense. \n", 355 | "\n", 356 | "Moving forward, some of the next steps make more sense to do on the word-tokens while others on sentence-tokens." 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "### Standardising\n", 364 | "#### Remove uppercase letters" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": {}, 370 | "source": [ 371 | "If we want to focus on the 'bag of words' approach, we don't really care about uppercase or lowercase distinctions. For example, we want 'Privacy' to count as the same word as 'privacy', rather than as two different words. \n", 372 | "\n", 373 | "We can remove all uppercase letters with a built in python command on corpus_words. Do this in the next code cell, again returning just the first 100 items instead of the whole thing. \n", 374 | "\n", 375 | "Do the Run/Shift+Enter thing. " 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "# You can see that I created a new variable called corpus_lower rather than edit corpus_words directly.\n", 385 | "# This means I can easily compare two different processes or correct something without going back and re-running earlier steps. \n", 386 | "\n", 387 | "corpus_lower = [word.lower() for word in corpus_words]\n", 388 | "print(corpus_lower[:100])" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "_______________________________________________________________________________________________________________________________\n", 396 | "Great! This is another step in the right direction. \n", 397 | "\n", 398 | "If you want a bit more practice, you can copy/paste/edit the command above to create a second version that applies to corpus_sentences instead of corpus_words. You will have to think for yourself whether this makes sense to do or not. Uppercase letters are potentially useful in an analysis that looks at sentences, but since the tokens already capture sentences, maybe that value is no longer useful. \n", 399 | "\n", 400 | "Anyway, have a go. Knock yourself out! " 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "#### Spelling correction" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "_______________________________________________________________________________________________________________________________\n", 415 | "Everybody loves spelling... RIGHT?!?\n", 416 | "\n", 417 | "Fortunately, there are several decent spellchecking packages written for python. They are not automatically installed and ready to import in the same way that the 'os' or 'nltk' packages were, but we just need to install the packages and import the functions we need through an installer called 'pip'. You will see 'pip' in the next code block, but since this is in jupyter notebook rather than directly in a python shell, we need to put a '!' in front of the 'pip' function. Don't worry too much about that now, I just mention it here in case you find it interesting to know. \n", 418 | "\n", 419 | "The next code cell:\n", 420 | "- installs the 'autocorrect' package,\n", 421 | "- imports the Speller function, and\n", 422 | "- creates a one-word command that specifies that the Speller function should use English language. \n", 423 | "\n", 424 | "Run/Shift+Enter, as per usual. " 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": {}, 431 | "outputs": [], 432 | "source": [ 433 | "!pip install autocorrect\n", 434 | "from autocorrect import Speller\n", 435 | "check = Speller(lang='en')" 436 | ] 437 | }, 438 | { 439 | "cell_type": "markdown", 440 | "metadata": {}, 441 | "source": [ 442 | "_______________________________________________________________________________________________________________________________\n", 443 | "Super. Creating that one-word command saves us some time, which is maybe less important here but is a good skill to be aware of if you are working on text-mining every day for weeks on end. Always be on the look out for good ways to save time. \n", 444 | "\n", 445 | "Moving on, we need to iterate over our corpus, checking and correcting each token. This is easy to do if you start with a new, empty list (I called mine 'corpus_correct_spell'). As I work through corpus_words, one token at a time, we append (which is just fancy for 'add to the end') the corrected word to our new blank list. \n", 446 | "\n", 447 | "Then, as usual, we have a quick look at the first 100 entries in the new 'corpus_correct_spell'. \n", 448 | "\n", 449 | "Run/Shift+Enter. You know how to do it. Don't worry if it takes a while... Checking the spelling on each word is not a cakewalk. " 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": null, 455 | "metadata": {}, 456 | "outputs": [], 457 | "source": [ 458 | "corpus_correct_spell = []\n", 459 | "\n", 460 | "for word in corpus_words:\n", 461 | " corpus_correct_spell.append(check(word)) \n", 462 | "\n", 463 | "print(corpus_correct_spell[:100])" 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "_______________________________________________________________________________________________________________________________\n", 471 | "How did it do? Well, this spell-checker replaced 'haz' with 'had' rather than 'has'. That is ok, I guess. No automatic spelling correction programme will get it 100% right 100% of the time. Maybe your project has specific research questions that won't work with this decision. \n", 472 | "\n", 473 | "In that case, you would have to check out some other spell-checkers like textblob or pyspellchecker. You might even want to custom build or adapt your own spell-checker, especially if you were working with very non-standard text, like comment boards that use a bunch of slang, common typos, or specific terms. \n", 474 | "\n", 475 | "But take a moment here and consider the following questions... \n", 476 | "- Can you apply this spell-checker to corpus_sentences rather than corpus_words? If you are not sure what happens, try it out by copying, editing and re-running the above code block. \n", 477 | "- Should you have appled this spell-checker to corpus_lower rather than corpus_words? What difference would it make? Again, try it out if you are not sure. \n", 478 | "\n", 479 | "Next up, specific replacements with RegEx! " 480 | ] 481 | }, 482 | { 483 | "cell_type": "markdown", 484 | "metadata": {}, 485 | "source": [ 486 | "#### RegEx replacements" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "RegEx stands for REGular EXpressions, which is probably familiar to you as the basis for how find-and-replace works in text documents. I mentioned this above when I talked about cutting up a large file into smaller files whenever the computer iterating over the large file found one of three specific combinations of numbers and symbols. \n", 494 | "\n", 495 | "But RegEx is actually stronger than that because you can use it to identify combinations of letters, numbers, symbols, spaces and more, some of which can be repeated more than once or can be optional. I won't go into RegEx too much more here, because that is a whole set of lessons on its own. But here are a couple of examples that you might find useful in a text like ours where we know that there are mixtures of numbers written as numbers, numbers spelled out, geographic abbreviations and more.\n", 496 | "\n", 497 | "As you might expect, do the Run/Shift+Enter thing. " 498 | ] 499 | }, 500 | { 501 | "cell_type": "code", 502 | "execution_count": null, 503 | "metadata": {}, 504 | "outputs": [], 505 | "source": [ 506 | "corpus_numbers = [re.sub(r\"ninety-six\", \"96\", word) for word in corpus_words] # Defines a new variable create by substituting\n", 507 | " # '96' for 'ninety-six' in corpus_words\n", 508 | "\n", 509 | "print(corpus_numbers[:100]) # Prints the first 100 items in the newly created corpus\n" 510 | ] 511 | }, 512 | { 513 | "cell_type": "markdown", 514 | "metadata": {}, 515 | "source": [ 516 | "Super! Now, this only works on 'ninety-six', but there might be other numbers spelled out in the text. We would have to look at it all to be sure, either manually or by using word frequency tables (we'll get to that). If we were to find some, we would have to revise our RegEx to capture more things and substitute them properly. \n", 517 | "\n", 518 | "One way to do that might be to define multiple terms to replace and what to replace them with. To do that, I searched on stack overflow and found a function written to multiple items by RegEx in a string. \n", 519 | "\n", 520 | "Run/Shift+Enter below!" 521 | ] 522 | }, 523 | { 524 | "cell_type": "markdown", 525 | "metadata": {}, 526 | "source": [ 527 | "Now let's try editing this. \n", 528 | "What happens when we use lowercase letters instead of uppercase letters in \"United Kingdom\"?\n", 529 | "What happens if you change the order of the entries in 'dict'. What happens if you reverse the order of \n", 530 | "- \"United Kingdom of Great Britain and Northern Ireland\" : \"U.K.\", and \n", 531 | "- \"United Kingdom of Great Britain\" : \"U.K.\", ?\n", 532 | "\n", 533 | "You should also feel free to add your own lines to 'dict' to exact some substitutions of your own. \n", 534 | "\n", 535 | "Note: this function works on strings, so I applied it to 'corpus' our original raw text. \n", 536 | "We can either put a step like this as the first step in a pipeline, or we can adapt the code to iterate over a list of strings. Both have pros and cons. What do you think those pros and cons might be?" 537 | ] 538 | }, 539 | { 540 | "cell_type": "code", 541 | "execution_count": null, 542 | "metadata": {}, 543 | "outputs": [], 544 | "source": [ 545 | "def multiple_replace(dict, text):\n", 546 | " # Create a regular expression from the dictionary keys\n", 547 | " regex = re.compile(\"(%s)\" % \"|\".join(map(re.escape, dict.keys())))\n", 548 | "\n", 549 | " # For each match, look-up corresponding value in dictionary\n", 550 | " return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) \n", 551 | "\n", 552 | "if __name__ == \"__main__\": \n", 553 | "\n", 554 | " dict = {\n", 555 | " \"CA\" : \"California\",\n", 556 | " \"United Kingdom\" : \"U.K.\",\n", 557 | " \"United Kingdom of Great Britain and Northern Ireland\" : \"U.K.\",\n", 558 | " \"United Kingdom of Great Britain\" : \"U.K.\",\n", 559 | " \"UK\" : \"U.K.\",\n", 560 | " \"Privacy Policy\" : \"noodle soup\",\n", 561 | " } \n", 562 | "\n", 563 | "corpus_replace = multiple_replace(dict, corpus)\n", 564 | "print(corpus_replace)\n" 565 | ] 566 | }, 567 | { 568 | "cell_type": "markdown", 569 | "metadata": {}, 570 | "source": [ 571 | "### Removing irrelevancies" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "#### Remove punctuation" 579 | ] 580 | }, 581 | { 582 | "cell_type": "markdown", 583 | "metadata": {}, 584 | "source": [ 585 | "Punctuation is not always very useful for understanding text, especially if you look at words as tokens because lots of the punctuation ends up being tokenised on its own. \n", 586 | "\n", 587 | "We could use RegEx to replace all punctuation with nothing, and that is a valid approach. But, just for variety sake, I demonstrate another way here." 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "_______________________________________________________________________________________________________________________________\n", 595 | "Forging ahead, let's filter out punctuation. We can define a string that includes all the standard English language punctuation, and then use that to iterate over corpus_words, removing anything that matches.\n", 596 | "\n", 597 | "But wait... Do we really want to remove the:\n", 598 | "- hyphen in 'ninety-six' or words like 'lactose-free'? \n", 599 | "- full stops in 'u.k.'? \n", 600 | "- the apostrophe in contractions or possessives?\n", 601 | "\n", 602 | "There are no right or wrong answers here. Every project will have to decide, based on the research questions, what is the right choice for the specific context. In this case, we want to remove the full stops, even from 'u.k.' so that it becomes identical to 'uk'. \n", 603 | "\n", 604 | "But, at the same time, we don't necessarily want to remove dashes or apostrophes. Those are punctuation marks that occur in the middle of words and do add meaning to the word, so I want to keep them. \n", 605 | "\n", 606 | "Run/Shift+Enter, as is tradition. " 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "metadata": {}, 613 | "outputs": [], 614 | "source": [ 615 | "English_punctuation = \"!\\\"#$%&()*+,./:;<=>?@[\\]^_`{|}~“”\" # Define a variable with all the punctuation to remove.\n", 616 | "print(English_punctuation) # Print that defined variable, just to check it is correct.\n", 617 | "print(\"...\") # Print an ellipsis, just to make the output more readable.\n", 618 | "\n", 619 | "table_punctuation = str.maketrans('','', English_punctuation) # The python function 'maketrans' creates a table that maps\n", 620 | "print(table_punctuation) # the punctation marks to 'None'. Print the table to check. \n", 621 | "print(\"...\") # Just to be clear, '!' is 33 in Unicode, and '\\' is 34, etc.\n", 622 | " # 'None' is python for nothing, not a string of the word \"none\".\n", 623 | " \n", 624 | "corpus_no_punct = [w.translate(table_punctuation) for w in corpus_words] \n", 625 | " # Iterate over corpus_words, turning punctuation to nothing.\n", 626 | "print(corpus_no_punct[:100]) # Print the 1st 100 items in corpus_no_punct to check." 627 | ] 628 | }, 629 | { 630 | "cell_type": "markdown", 631 | "metadata": {}, 632 | "source": [ 633 | "_______________________________________________________________________________________________________________________________\n", 634 | "Super! \n", 635 | "\n", 636 | "Do you want to try something else? How about you create a version that *does* filter out dashes and apostrophes. \n", 637 | "\n", 638 | "C'mon. You know you can do it. \n", 639 | "\n", 640 | "Take each of the steps above and copy/paste/edit them as needed. \n", 641 | "- Create a copy of the line that defines the English_punctuation variable and edit it to define an All_English_Punctuation variable that includes more punctuation.\n", 642 | "- Then create a copy of the line that defines the table_punctuation variable and have it create a table_all_punctuation variable.\n", 643 | "- Then create a copy of the line that creates the corpus_no_punct variable and have it create an absolutely_no_punct variable.\n", 644 | "- Then ask for the first 100 items of absolutely_no_punct. \n", 645 | "\n", 646 | "Feel free to change the variable names as you like. I am going for clarity, but you might prefer brevity. " 647 | ] 648 | }, 649 | { 650 | "cell_type": "markdown", 651 | "metadata": {}, 652 | "source": [ 653 | "Did you notice that removing the punctuation has left list items that are empty strings. Between 'corpus' and 'it', for example, is an item shown as ''. This is an empty string item that was a full stop before we removed the punctuation. \n", 654 | "\n", 655 | "Why do you think these empty string items are included in the output list? \n", 656 | "Can you think of how we might remove this?\n", 657 | "Since those empty strings are python-recognised instances of 'None', python can find and filter them out. \n", 658 | "\n", 659 | "Let's give it a try. Run/Shift+Enter. Do it!" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": null, 665 | "metadata": {}, 666 | "outputs": [], 667 | "source": [ 668 | "corpus_no_space = list(filter(None, corpus_no_punct)) # This filters out the empty string from the no_punct list.\n", 669 | "\n", 670 | "print(corpus_no_space[:100])" 671 | ] 672 | }, 673 | { 674 | "cell_type": "markdown", 675 | "metadata": {}, 676 | "source": [ 677 | "Now we are cooking with gas (unless that saying is no longer environmentally sustainable? Hmmm. ). \n", 678 | "\n", 679 | "But we are not done yet! Next up... Stopwords!" 680 | ] 681 | }, 682 | { 683 | "cell_type": "markdown", 684 | "metadata": {}, 685 | "source": [ 686 | "#### Stopwords" 687 | ] 688 | }, 689 | { 690 | "cell_type": "markdown", 691 | "metadata": {}, 692 | "source": [ 693 | "Stopwords are typically conjunctions ('and', 'or'), prepositions ('to', 'around'), determiners ('the', 'an'), possessives ('s) and the like. The are **REALLY** common in all languages, and tend to occur at about the same ratio in all kinds of writing, regardless of who did the writing or what it is about. These words are definitely important for structure as they make all the difference between \"Freeze *or* I'll shoot!\" and \"Freeze *and* I'll shoot!\". \n", 694 | "\n", 695 | "Buuuut... Many for many text-mining analyses, especially those that take the bag of words approach, these words don't have a whole lot of meaning in and of themselves. Thus, we want to remove them. \n", 696 | "\n", 697 | "Let's start by downloading the basic stopwords function built into nltk and storing the English language ones in a list called, appropriately enough, 'stop_words'. \n", 698 | "\n", 699 | "Then let's have a look at what is in that list with a print command by doing the whole Run/Shift+Enter thing in the next two (two?!?) code cells. " 700 | ] 701 | }, 702 | { 703 | "cell_type": "code", 704 | "execution_count": null, 705 | "metadata": {}, 706 | "outputs": [], 707 | "source": [ 708 | "nltk.download('stopwords')" 709 | ] 710 | }, 711 | { 712 | "cell_type": "code", 713 | "execution_count": null, 714 | "metadata": {}, 715 | "outputs": [], 716 | "source": [ 717 | "from nltk.corpus import stopwords\n", 718 | "stop_words = set(stopwords.words('english'))\n", 719 | "print(sorted(stop_words))\n" 720 | ] 721 | }, 722 | { 723 | "cell_type": "markdown", 724 | "metadata": {}, 725 | "source": [ 726 | "_____________________________________________________________________________________________________________________________\n", 727 | "Great. Now let's remove those stop_words by creating another list called corpus_no_stop_words. Then, we iterate over corpus_correct_spell, looking at them one by one and appending them to corpus_no_stop_words if and only if they do not match any of the items in the stop_words list. \n", 728 | "\n", 729 | "As you might expect, you should do the whole Run/Shift+Enter thing. Again. (I know, I know...)" 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": null, 735 | "metadata": { 736 | "scrolled": true 737 | }, 738 | "outputs": [], 739 | "source": [ 740 | "corpus_no_stop_words = []\n", 741 | "\n", 742 | "for word in corpus_lower:\n", 743 | " if word not in stop_words:\n", 744 | " corpus_no_stop_words.append(word)\n", 745 | " \n", 746 | " \n", 747 | "print(corpus_no_stop_words[:100])" 748 | ] 749 | }, 750 | { 751 | "cell_type": "markdown", 752 | "metadata": {}, 753 | "source": [ 754 | "_______________________________________________________________________________________________________________________________\n", 755 | "Hey now! That looks pretty good. Not perfect, but good.\n", 756 | "\n", 757 | "Want to try more? Run the same code above, but on 'corpus_words' rather than 'corpus_lower'. What happens? Why do you think that is?" 758 | ] 759 | }, 760 | { 761 | "cell_type": "markdown", 762 | "metadata": {}, 763 | "source": [ 764 | "### Consolidation\n", 765 | "#### Stemming words" 766 | ] 767 | }, 768 | { 769 | "cell_type": "markdown", 770 | "metadata": {}, 771 | "source": [ 772 | "You can probably imagine what comes next by now. We import a specific tool from nltk (it is not called the natural language tool kit for nuthin'), define a function, create a fresh new corpus by applying the function to an existing corpus, and print the first hundred items to have a nosey. \n", 773 | "\n", 774 | "Go ahead. Run/Shift+Enter" 775 | ] 776 | }, 777 | { 778 | "cell_type": "code", 779 | "execution_count": null, 780 | "metadata": {}, 781 | "outputs": [], 782 | "source": [ 783 | "from nltk.stem.porter import PorterStemmer\n", 784 | "\n", 785 | "porter = PorterStemmer()\n", 786 | "corpus_stemmed = [porter.stem(word) for word in corpus_no_space]\n", 787 | "print(corpus_stemmed[:100])" 788 | ] 789 | }, 790 | { 791 | "cell_type": "markdown", 792 | "metadata": {}, 793 | "source": [ 794 | "We see that 'sample' has become 'sampl', which collapses 'sampled' together with 'samples' and 'sampling' and 'sample'. This puts plurals and verb tenses all in the same form so they can be counted as instances of the \"same\" word.\n", 795 | "\n", 796 | "If we are happy with this stemming process, we might decide that we are done with the cleaning and can dive into the text-mining. \n", 797 | "\n", 798 | "Alternatively, we might decide to do a bit more cleaning, perhaps by downloading packages that replace contractions, so that 'haven't' would become 'have' and 'not'. There are many potentially useful changes like these that you may want to make. \n", 799 | "\n", 800 | "Buuuuuuuuuuuuuut... maybe we want to keep the count the verbs together and the nouns separetely? For that, we need the slightly more sophisticated approach of 'lemmatisation'. " 801 | ] 802 | }, 803 | { 804 | "cell_type": "markdown", 805 | "metadata": {}, 806 | "source": [ 807 | "#### Lemmatisation" 808 | ] 809 | }, 810 | { 811 | "cell_type": "markdown", 812 | "metadata": {}, 813 | "source": [ 814 | "Lemmatisation is similar to stemming, in that it aims to turn various forms of the same word into a single form. However, lemmatisation is a bit more sophisticated because: \n", 815 | "- It recognises irregular plurals and returns the correct singular form. Example = 'rocks' --> 'rock' but 'corpora' --> 'corpus' \n", 816 | "- If part of speech tags are supplied, it treats verbs, adjectives and nouns differenly, even if they have the same surface form. Example - 'caring' would not be changed if used as an adjective (as in 'his caring manner') but would go to 'care' if it was a verb (as in 'he is caring for baby squirrels'. In contrast, stemming would remove the 'ing' and turn 'caring' into 'car'. \n", 817 | "- If no part of speech tags are supplied, lemmatisation tools tend to assume words as nouns, so the process becomes a sophisticated de-pluraliser. \n", 818 | "\n", 819 | "Again, you import a specific tool from nltk, define a short form for its use, apply it to the relevant input variable, saving the output as a new variable with a suitable name. \n", 820 | "\n", 821 | "Once more, unto the Run/Shift+Enter!" 822 | ] 823 | }, 824 | { 825 | "cell_type": "code", 826 | "execution_count": null, 827 | "metadata": {}, 828 | "outputs": [], 829 | "source": [ 830 | "nltk.download('wordnet')\n", 831 | "from nltk.corpus import wordnet\n", 832 | "from nltk.stem import WordNetLemmatizer\n", 833 | "lemmatizer = WordNetLemmatizer() \n", 834 | " \n", 835 | "print('rocks :', lemmatizer.lemmatize('rocks')) #a few examples of lemmatising as a de-pluraliser\n", 836 | "print('corpora :', lemmatizer.lemmatize('corpora'))\n", 837 | "print('cares :', lemmatizer.lemmatize('cares')) #no part of speech tag supplied, so 'cares' is treated as noun\n", 838 | "print('caring :', lemmatizer.lemmatize('caring', pos = \"v\")) #when part of speech tag added, 'caring' is treated as verb \n", 839 | "print('cared :', lemmatizer.lemmatize('cared', pos = \"v\"))" 840 | ] 841 | }, 842 | { 843 | "cell_type": "markdown", 844 | "metadata": {}, 845 | "source": [ 846 | "The results show that our examples produce good output - 'rocks', 'corpora' and 'cares' are all de-pluralised correctly. The examples with part of speech tags also show that 'caring' and 'cared' are both correctly converted to 'care' as the base verb. \n", 847 | "\n", 848 | "Let's try it on our corpus, this time applying it to the 'corpus_no_space' variable, which has not had the stemming process applied to it. \n", 849 | "\n", 850 | "Run/Shift+Enter. " 851 | ] 852 | }, 853 | { 854 | "cell_type": "code", 855 | "execution_count": null, 856 | "metadata": {}, 857 | "outputs": [], 858 | "source": [ 859 | "corpus_lemmed = [lemmatizer.lemmatize(word) for word in corpus_no_space]\n", 860 | "\n", 861 | "print(corpus_lemmed[:100])" 862 | ] 863 | }, 864 | { 865 | "cell_type": "markdown", 866 | "metadata": {}, 867 | "source": [ 868 | "Well, the results are a bit mixed. There were no part of speech tags in our corpus, so everything was treated as nouns. The corpus has been effectively de-pluralised, but all of the different verb tenses remain. So, I guess we need to mark the corpus for part of speech tags, usually abbreviated to POS. \n", 869 | "\n", 870 | "But that is a topic for the next section!" 871 | ] 872 | }, 873 | { 874 | "cell_type": "markdown", 875 | "metadata": {}, 876 | "source": [ 877 | "## Conclusions" 878 | ] 879 | }, 880 | { 881 | "cell_type": "markdown", 882 | "metadata": {}, 883 | "source": [ 884 | "We have achieved a whole lot already! This is great work! \n", 885 | "\n", 886 | "Now, you will have to think carefully about:\n", 887 | "- what processes you will need for the analysis you want to run, \n", 888 | "- what is the right order of processes for your corpus/corpora and your research questions, and \n", 889 | "- how will you keep track of which processes you run and in which order. Replicability demands clear step-by-steps!\n", 890 | "\n" 891 | ] 892 | }, 893 | { 894 | "cell_type": "markdown", 895 | "metadata": {}, 896 | "source": [ 897 | "## Further reading and resources" 898 | ] 899 | }, 900 | { 901 | "cell_type": "markdown", 902 | "metadata": {}, 903 | "source": [ 904 | "Books, tutorials, package recommendations, etc. for Python\n", 905 | "- Programming with Python for Social Scientists. Brooker, 2020. https://study.sagepub.com/brooker\n", 906 | "- Automate the Boring Stuff with Python: Practical Programming for Total Beginners, Sweigart, 2019. ISBN-13: 9781593279929\n", 907 | "- SentDex, python programming tutorials on YouTube https://www.youtube.com/user/sentdex\n", 908 | "- nltk (Natural Language Toolkit) https://www.nltk.org/book/ch01.html\n", 909 | "- nltk.corpus http://www.nltk.org/howto/corpus.html\n", 910 | "- spaCy https://nlpforhackers.io/complete-guide-to-spacy/\n", 911 | "\n", 912 | "Books and package recommendations for R\n", 913 | "- Quanteda, an R package for text analysis https://quanteda.io/​\n", 914 | "- Text Mining with R, a free online book https://www.tidytextmining.com/​" 915 | ] 916 | }, 917 | { 918 | "cell_type": "markdown", 919 | "metadata": {}, 920 | "source": [ 921 | "
Next section: Social networks
" 922 | ] 923 | } 924 | ], 925 | "metadata": { 926 | "kernelspec": { 927 | "display_name": "Python 3 (ipykernel)", 928 | "language": "python", 929 | "name": "python3" 930 | }, 931 | "language_info": { 932 | "codemirror_mode": { 933 | "name": "ipython", 934 | "version": 3 935 | }, 936 | "file_extension": ".py", 937 | "mimetype": "text/x-python", 938 | "name": "python", 939 | "nbconvert_exporter": "python", 940 | "pygments_lexer": "ipython3", 941 | "version": "3.11.9" 942 | }, 943 | "toc": { 944 | "base_numbering": 1, 945 | "nav_menu": {}, 946 | "number_sections": true, 947 | "sideBar": true, 948 | "skip_h1_title": true, 949 | "title_cell": "Table of Contents", 950 | "title_sidebar": "Contents", 951 | "toc_cell": true, 952 | "toc_position": {}, 953 | "toc_section_display": true, 954 | "toc_window_display": false 955 | }, 956 | "varInspector": { 957 | "cols": { 958 | "lenName": 16, 959 | "lenType": 16, 960 | "lenVar": 40 961 | }, 962 | "kernels_config": { 963 | "python": { 964 | "delete_cmd_postfix": "", 965 | "delete_cmd_prefix": "del ", 966 | "library": "var_list.py", 967 | "varRefreshCmd": "print(var_dic_list())" 968 | }, 969 | "r": { 970 | "delete_cmd_postfix": ") ", 971 | "delete_cmd_prefix": "rm(", 972 | "library": "var_list.r", 973 | "varRefreshCmd": "cat(var_dic_list()) " 974 | } 975 | }, 976 | "types_to_exclude": [ 977 | "module", 978 | "function", 979 | "builtin_function_or_method", 980 | "instance", 981 | "_Feature" 982 | ], 983 | "window_display": false 984 | } 985 | }, 986 | "nbformat": 4, 987 | "nbformat_minor": 2 988 | } 989 | -------------------------------------------------------------------------------- /code/2-Extraction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": { 13 | "slideshow": { 14 | "slide_type": "-" 15 | } 16 | }, 17 | "source": [ 18 | "# Text-mining: Basics" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Welcome to this UK Data Service *Computational Social Science* training series! \n", 26 | "\n", 27 | "The various *Computational Social Science* training series, all of which guide you through some of the popular and useful computational techniques, tools, methods and concepts that social science research might want to use. For example, this series covers collecting data from websites and social media platorms, working with text data, conducting simulations (agent based modelling), and more. The series includes recorded video webinars, interactive notebooks containing live programming code, reading lists and more.\n", 28 | "\n", 29 | "* To access training materials on our GitHub site: [Training Materials]\n", 30 | "\n", 31 | "* To keep up to date with upcoming and past training events: [Events]\n", 32 | "\n", 33 | "* To get in contact with feedback, ideas or to seek assistance: [Help]\n", 34 | "\n", 35 | "Dr J. Kasmire
\n", 36 | "UK Data Service
\n", 37 | "University of Manchester
" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": { 43 | "toc": true 44 | }, 45 | "source": [ 46 | "

Table of Contents

\n", 47 | "
" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "\n", 55 | "There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). " 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "-------------------------------------\n", 63 | "\n", 64 | "
This is notebook 2 of 2 in this lesson
\n", 65 | "\n", 66 | "-------------------------------------" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "## Introduction" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "At the end of the last section, we got held up by the need to do part-of-speech (POS) tagging in order to get a really effective lemmatisation process. POS tagging is actually basic NLP, as opposed to the kind of cleaning and regularising that we were doing in the last section. So why should a NLP process be needed before the preparatory processing is done? \n", 81 | "\n", 82 | "Well, it comes done to choices. Not every analysis will need a sophisticated lemmatiser, so those projects may have a nice finish to the processing step and the start of the extraction or NLP step. Others willneed the lemmatiser or name entity recognisers or other advanced preparatory steps. Those projects will have a less clear distinction between preparation for NLP and NLP. \n", 83 | "\n", 84 | "But even if the project ends up having a clear distinction between the processes, researchers may find that after they start doing some NLP processes, they need to go back and run different preparatory processes instead of or in addition to the ones they chose earlier. \n", 85 | "\n", 86 | "The main takeaway point here is that researchers need to know that developing a text-mining project can be messy, iterative, and complicated. I recommend that you think about each step as elements in a pipeline (or in multiple pipelines). I recommend that you build your own code functions that concatenate the steps, running each one from the output of the previous one. \n", 87 | "\n", 88 | "In this way, you get a fresh clean output at the end of the pipeline each time whenever you need one. It also means that everything you apply the pipeline to gets treated in the same way, with each process done in the same order. This helps replicability!\n" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "### Guide to using this resource\n", 96 | "\n", 97 | "This learning resource was built using Jupyter Notebook, an open-source software application that allows you to mix code, results and narrative in a single document. As Barba et al. (2019) espouse:\n", 98 | "> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.\n", 99 | "\n", 100 | "If you are familiar with Jupyter notebooks then skip ahead to the main content (*Preliminary NLP*). Otherwise, the following is a quick guide to navigating and interacting with the notebook." 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "### Interaction\n", 108 | "\n", 109 | "**You only need to execute the code that is contained in sections which are marked by `In []`.**\n", 110 | "\n", 111 | "To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).\n", 112 | "\n", 113 | "Try it for yourself:" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "print(\"Enter your name and press enter:\")\n", 123 | "name = input()\n", 124 | "print(\"\\r\")\n", 125 | "print(\"Hello {}, enjoy learning more about Python and computational social science!\".format(name)) " 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "### Learn more\n", 133 | "\n", 134 | "Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the materials provided by Dani Arribas-Bel at the University of Liverpool." 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "## Preliminary NLP (or finishing up the processing)" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "Let's start off by importing and downloading all the things we will need. \n", 149 | "\n", 150 | "Run/Shift+Enter." 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "import nltk # get nltk \n", 160 | "from nltk import word_tokenize # and some of its key functions\n", 161 | "from nltk import sent_tokenize \n", 162 | "\n", 163 | "!pip install autocorrect \n", 164 | "from autocorrect import Speller # things we need for spell checking\n", 165 | "check = Speller(lang='en')\n", 166 | "\n", 167 | "import re # things we need for RegEx corrections\n", 168 | "def multiple_replace(dict, text):\n", 169 | " regex = re.compile(\"(%s)\" % \"|\".join(map(re.escape, dict.keys())))\n", 170 | " return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) \n", 171 | "\n", 172 | "if __name__ == \"__main__\": \n", 173 | " dict = {\n", 174 | " \"CA\" : \"California\",\n", 175 | " \"United Kingdom\" : \"U.K.\",\n", 176 | " \"United Kingdom of Great Britain and Northern Ireland\" : \"U.K.\",\n", 177 | " \"United Kingdom of Great Britain\" : \"U.K.\",\n", 178 | " \"UK\" : \"U.K.\",\n", 179 | " \"Privacy Policy\" : \"noodle soup\",}\n", 180 | "\n", 181 | "English_punctuation = \"-!\\\"#$%&()'*+,./:;<=>?@[\\]^_`{|}~''“”\" # Things for removing punctuation, stopwords and empty strings\n", 182 | "table_punctuation = str.maketrans('','', English_punctuation) \n", 183 | "\n", 184 | "nltk.download('stopwords')\n", 185 | "nltk.download('punkt')\n", 186 | "nltk.download('wordnet')\n", 187 | "nltk.download('webtext')\n", 188 | "\n", 189 | "from nltk.corpus import stopwords\n", 190 | "stop_words = set(stopwords.words('english'))\n", 191 | "\n", 192 | "from nltk.corpus import wordnet # Finally, things we need for lemmatising!\n", 193 | "from nltk.stem import WordNetLemmatizer\n", 194 | "lemmatizer = WordNetLemmatizer() \n", 195 | "nltk.download('averaged_perceptron_tagger') # Like a POS-tagger...\n", 196 | "\n", 197 | "print(\"Succesfully imported necessary modules\") # The print statement is just a bit of encouragement!" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "### POS - part of speech tagging" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "Now, let's get back to where we were when we left off last time - with a tokenised corpus on which we need to run a POS tagger. \n", 212 | "\n", 213 | "Run/Shift+Enter, as above!" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "with open(\"./data/sample_text.txt\", \"r\", encoding = \"ISO-8859-1\") as f:\n", 223 | " corpus = f.read()\n", 224 | " \n", 225 | "corpus_words = word_tokenize(corpus)\n", 226 | "\n", 227 | "corpus_lower = [word.lower() for word in corpus_words]\n", 228 | "\n", 229 | "corpus_correct_spell = []\n", 230 | "for word in corpus_lower:\n", 231 | " corpus_correct_spell.append(check(word)) \n", 232 | "\n", 233 | "corpus_no_stopwords = []\n", 234 | "for word in corpus_correct_spell:\n", 235 | " if word not in stop_words:\n", 236 | " corpus_no_stopwords.append(word)\n", 237 | " \n", 238 | "corpus_no_punct = [w.translate(table_punctuation) for w in corpus_no_stopwords] \n", 239 | "corpus_no_space = list(filter(None, corpus_no_punct)) \n", 240 | " \n", 241 | "print(corpus_no_space[:100])" 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "metadata": {}, 247 | "source": [ 248 | "Excellent. Now it is time to tag that corpus with POS-tags. This is pretty easy, as nltk comes with a POS-tagger. \n", 249 | "\n", 250 | "Run/Shift+Enter, as you would expect. " 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [ 259 | "corpus_pos_tagged = nltk.pos_tag(corpus_no_space) \n", 260 | "print(corpus_pos_tagged[:100])" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "Excellent. That has successfully added POS tags to all off the words in our corpus. Now, let's try lemmatising again with the POS tags. \n", 268 | "\n", 269 | "Despite what seems obvious, the nltk POS tagger does not use the same POS tags that the nltk lemmatize function needs. Why? I have no idea. \n", 270 | "\n", 271 | "But to move forward, I need a to define a quick little function called get_wordnet_pos to convert the tag format to the right one. I tell a lie. I did not write this function but copied it off of Stack Overflow. This is not cheating so much as being economical. A HUGE number of the things you want to do or the problems you want to solve will be discussed on Stack Overflow. Just use a popular search engine to find them, read through all the answers, try them out. \n", 272 | "\n", 273 | "Having defined the get_wordnet_pos function, the code belowe then creates a new, blank list called corpus_lemmed. \n", 274 | "After that, the code iterates over corpus_pos_tagged, looking at each word and POS-tag pair, uses the get_wordnet_pos function to convert the POS-tag to the right format, and using that to lemmatize correctly. \n", 275 | "\n", 276 | "At the end, the lemmatised word is appended to the new list we created. \n", 277 | "\n", 278 | "Go ahead. \n", 279 | "Run/Shift+Enter. \n", 280 | "You know you want to!" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": null, 286 | "metadata": {}, 287 | "outputs": [], 288 | "source": [ 289 | "def get_wordnet_pos(word):\n", 290 | " \"\"\"Map POS tag to first character lemmatize() accepts\"\"\"\n", 291 | " tag = nltk.pos_tag([word])[0][1][0].upper()\n", 292 | " tag_dict = {\"J\": wordnet.ADJ,\n", 293 | " \"N\": wordnet.NOUN,\n", 294 | " \"V\": wordnet.VERB,\n", 295 | " \"R\": wordnet.ADV}\n", 296 | " return tag_dict.get(tag, wordnet.NOUN)\n", 297 | "\n", 298 | "corpus_lemmed = []\n", 299 | "for pair in corpus_pos_tagged:\n", 300 | " corpus_lemmed.append(lemmatizer.lemmatize(pair[0], get_wordnet_pos(pair[0]))) \n", 301 | "print(corpus_lemmed[:100])\n", 302 | "\n", 303 | "#corpus_lemmed_tagged = [] \n", 304 | "#for pair in corpus_pos_tagged:\n", 305 | "# corpus_lemmed_tagged.append([(lemmatizer.lemmatize(pair[0], get_wordnet_pos(pair[0])), pair[1])]) \n", 306 | "#print(corpus_lemmed_tagged[:100])" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "The output of the above code returns a list with only words but without any POS-tags. \n", 314 | "\n", 315 | "If you want to keep the corpus in pairs of word and POS-tag, you will need to activate the second, commented out lines. This means you will need to remove the '#' in front of each line of code starting with 'corpus_lemmed_tagged' and re-run the code.\n", 316 | "\n", 317 | "Give it a try!" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "metadata": {}, 323 | "source": [ 324 | "### Named Entity Recogntion and chunking" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "We are really getting somewhere now! Let's try another basic NLP process - CHUNKING!\n", 332 | "\n", 333 | "Named Entity Recognition is a specific kind of 'chunk' operation. Chunking operations iterate over a corpus that has been word tokenised and POS-tagged and put it all back together into sentences. Named Entity Recognition does this too, with special attention to building up the noun phrases that capture well-known entities or onganisations. \n", 334 | "\n", 335 | "The chunks are returned within sets of nested brackets (both square and round to capture different levels of nesting. \n", 336 | "\n", 337 | "So, 'The Cat in the Hat' would come out as (S The/DT (ORGANIZATION Cat/NNP) in/IN the/DT Hat/NNP). \n", 338 | "The 'S' at the beginning stands for 'sentence' which is the highest level grouping that the chunker can find. \n", 339 | "the 'Cat' is recognised as the key entity, so is tagged with ORGANIZATION. \n", 340 | "The 'in the hat' part is captured as belonging to a noun phrase, same as the cat, but it recongsises that this a sentence about a cat, not a sentence about a hat. \n", 341 | "\n", 342 | "Clever, eh?\n", 343 | "Let's try it!" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "#importing chunk library from nltk\n", 353 | "nltk.download('words')\n", 354 | "nltk.download('maxent_ne_chunker')\n", 355 | "from nltk import ne_chunk # ne_chunk is 'named entity chunk'. Other chunkers are available.\n", 356 | "\n", 357 | "# NER and other chunkers only work on word tokenised and POS tagged corpora... \n", 358 | "corpus_pos_tagged2 = nltk.pos_tag(corpus_words)\n", 359 | "corpus_chunked = ne_chunk(corpus_pos_tagged2)\n", 360 | "print(corpus_chunked[57:88])" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "This time, I specifically only asked for a printout of a key range in the resulting corpus. I wanted to highlight here how the word \"Tree\" preceeds those ORGANIZATION entities that hang together as multi-word entities. See, for example, how 'United Kingdom', 'Great Britain' and 'Northern Ireland' are each within square brackets to identify them as the multi-word entity captured by the ORGANIZATION tag. \n", 368 | "\n", 369 | "You might also have noticed that this chunking function is run on corpus_pos_tagged2, which is simply the corpus_words that has been put through the nltk.pos_tag function. This means that corpus_pos_tagged2 still has its stopwords, punctuation, etc. \n", 370 | "\n", 371 | "Why do you think this is? What do you think would happen if you ran the chunking procedure on corpus_pos_tagged which DOES have all the stopwords and punctuation removed?\n", 372 | "\n", 373 | "Well, guess what? You can find out by doing that Run/Shift+Enter thing!" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": null, 379 | "metadata": {}, 380 | "outputs": [], 381 | "source": [ 382 | "corpus_chunked_extra_processes = ne_chunk(corpus_pos_tagged)\n", 383 | "print(corpus_chunked_extra_processes[:100])" 384 | ] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "Hmmm. No 'Tree' markers, no 'ORGANIZATION' markers, etc. \n", 391 | "\n", 392 | "This is because some chunking processes use some of the stopwords (especially determiners like 'an' and 'the') and punctuation, etc. to be useful in determining appropriate chunks. \n", 393 | "\n", 394 | "This may create some challenges for your corpus. For example, if you want to: \n", 395 | "- Count words, then you probably want to remove stopwords, punctuation, etc. \n", 396 | "- Identify chunks, like named entities, then you probably want to leave some or all of the stopwords, punctuation, etc.\n", 397 | "- Count chunks (e.g. count named entities), you probably want to combine the processes in the right order. \n", 398 | "\n", 399 | "Good to know!" 400 | ] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "metadata": {}, 405 | "source": [ 406 | "## Counts and (relative) frequency" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "Excellent! Now, you might be surprised, but a very important function of NLP for analysing text boils down to counting things, often words. This is why so much attention in the last section was focussed on making sure all the words that we want to be counted as 'the same word' appeared in the same form while all the words that we want to count as 'different words' appear in different forms. \n", 414 | "\n", 415 | "Thus, we want to apply the count functions to a corpus that has had some of that standardisation, consolidation, lemmatised (or at least stemming) processes applied already. \n", 416 | "\n", 417 | "First, we import some counting functions, then we apply them to corpus_lemmed. \n", 418 | "Run/Shift+Enter" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": null, 424 | "metadata": {}, 425 | "outputs": [], 426 | "source": [ 427 | "from collections import Counter\n", 428 | "corpus_counts = Counter(corpus_lemmed)\n", 429 | "print(corpus_counts)" 430 | ] 431 | }, 432 | { 433 | "cell_type": "markdown", 434 | "metadata": {}, 435 | "source": [ 436 | "Great! You may have noticed that we applied this count function to a list of words rather than word and POS-tag pairs. This is on purpose, but the code could be written so that it only looks at the first item in each word and POS-tag pairs. \n", 437 | "\n", 438 | "If you want to try that, go ahead. You may want to refer back to the code block where we defined get_wordnet_pos because the code to create corpus_lemmed_tagged uses indices (in [square brackets]) to refer to only one element within a pair. \n", 439 | "\n", 440 | "But, for now, let's have a closer look at the 100 most common words in our corpus by using the 'most_common' function from Counter. \n", 441 | "Run/Shift+Enter!" 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "execution_count": null, 447 | "metadata": { 448 | "scrolled": false 449 | }, 450 | "outputs": [], 451 | "source": [ 452 | "print(corpus_counts.most_common(100))" 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "Just for comparison, let's find the 100 most common words in 'Emma' by Jane Austen. \n", 460 | "We do need to import the text as a corpus and process it in the same way as we did your corpus so that they can be seen as comparable. \n", 461 | "\n", 462 | "Run/Shift+Enter - but be patient. This is a lot of processes to run. " 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": null, 468 | "metadata": {}, 469 | "outputs": [], 470 | "source": [ 471 | "nltk.download('gutenberg')\n", 472 | "import nltk.corpus\n", 473 | "emma = nltk.corpus.gutenberg.raw('austen-emma.txt')\n", 474 | "\n", 475 | "emma_words = word_tokenize(emma)\n", 476 | "\n", 477 | "emma_lower = [word.lower() for word in emma_words]\n", 478 | "\n", 479 | "emma_correct_spell = []\n", 480 | "for word in emma_lower:\n", 481 | " emma_correct_spell.append(check(word)) \n", 482 | "\n", 483 | " emma_no_stopwords = []\n", 484 | "for word in emma_lower:\n", 485 | " if word not in stop_words:\n", 486 | " emma_no_stopwords.append(word)\n", 487 | " \n", 488 | "emma_no_punct = [w.translate(table_punctuation) for w in emma_no_stopwords] \n", 489 | "emma_no_space = list(filter(None, emma_no_punct)) \n", 490 | "emma_pos_tagged = nltk.pos_tag(emma_no_space) \n", 491 | "\n", 492 | "emma_lemmed = []\n", 493 | "for pair in emma_pos_tagged:\n", 494 | " emma_lemmed.append(lemmatizer.lemmatize(pair[0], get_wordnet_pos(pair[0]))) \n", 495 | " \n", 496 | "emma_counts = Counter(emma_lemmed)\n", 497 | "print(emma_counts)" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": {}, 503 | "source": [ 504 | "Excellent! Clearly, the words in Emma are very different than those of our sample corpus, and those words that appear in both occur in very different relative frequencies (not least because one is a page of babble and the other is a full novel).\n", 505 | "\n", 506 | "To get a better idea, how about we compare the 20 most common words from both corpora. \n", 507 | "Run/Shift+Enter" 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": null, 513 | "metadata": {}, 514 | "outputs": [], 515 | "source": [ 516 | "print(corpus_counts.most_common(20))\n", 517 | "print(emma_counts.most_common(20))" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": {}, 523 | "source": [ 524 | "Ok. Clearly very different. But let's try one more thing for now... Let's count how many times each of these texts use the word 'personal'. We could use any word as the target word, but I happen to know that there is a non-zero result for these two texts for this word. \n", 525 | "\n", 526 | "Run/Shift+Enter!!!" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": null, 532 | "metadata": {}, 533 | "outputs": [], 534 | "source": [ 535 | "print(corpus_counts['personal'])\n", 536 | "print(emma_counts['personal'])" 537 | ] 538 | }, 539 | { 540 | "cell_type": "markdown", 541 | "metadata": {}, 542 | "source": [ 543 | "So, despite being much shorter, the sample text corpus uses the word 'personal' over 8 times more often. \n", 544 | "\n", 545 | "That sounds very personal. \n", 546 | "\n", 547 | "Feel free to choose other words and re-run the code. " 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "metadata": {}, 553 | "source": [ 554 | "## Similarity" 555 | ] 556 | }, 557 | { 558 | "cell_type": "markdown", 559 | "metadata": {}, 560 | "source": [ 561 | "Now, comparing the most common words in two documents is one way to compare how similar they are, but there are more sophisticated ways. \n", 562 | "\n", 563 | "spaCy is a relatively new option for text-mining in python, but it is very powerful. First off, we need to download and import a few things. \n", 564 | "\n", 565 | "Run/Shift+Enter (you are already so good at this!)" 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": null, 571 | "metadata": {}, 572 | "outputs": [], 573 | "source": [ 574 | "!pip install spacy -q\n", 575 | "import spacy\n", 576 | "!python -m spacy download en_core_web_lg -q\n", 577 | "from nltk.corpus import webtext" 578 | ] 579 | }, 580 | { 581 | "cell_type": "markdown", 582 | "metadata": {}, 583 | "source": [ 584 | "Super. Now, let's load the model via spacy-load, and then test it on a trivial corpus that has only three words. \n", 585 | "\n", 586 | "Run/Shift+Enter already!" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": null, 592 | "metadata": {}, 593 | "outputs": [], 594 | "source": [ 595 | "nlp = spacy.load('en_core_web_lg')\n", 596 | "\n", 597 | "word_similarity = nlp(\"troll elf rabbit\")\n", 598 | "\n", 599 | "\n", 600 | "for word1 in word_similarity:\n", 601 | " for word2 in word_similarity:\n", 602 | " print(word1.text, word2.text, word1.similarity(word2))" 603 | ] 604 | }, 605 | { 606 | "cell_type": "markdown", 607 | "metadata": {}, 608 | "source": [ 609 | "This code does a few things. First, it loads a model of common words in English (this is the 'en_core_web_lg') that has 300 dimension vectors for each word. If that sounds like nonsense, don't worry too much. \n", 610 | "\n", 611 | "What it means is that the model has a list of lots of common words in English, each of which comes with a 'scorecard' of how they rank on 300 different features which is a sort of abstract way of capturing the meaning of the word. This is not derived from logical scoring by people, but through an AI sort of analysis of how the words are used in LOTS of text, which finds patterns like:\n", 612 | "- Is the target word used more often like a noun or a verb? \n", 613 | "- Is it usually plural (if a noun) or in gerrund ('ing'-form, if a verb)? \n", 614 | "- Is it frequently preceded by adjectives like 'little' or 'unprecedented' or adverbs like 'always' or 'never'? \n", 615 | "\n", 616 | "What comes out of this code is a pair-wise comparison of all the vectors, or scorecards, for the words in our little three-word corpus. This comes out as a number between 0 and 1, with 0 being totally different (or not found in the model) and 1 being a perfect match. \n", 617 | "\n", 618 | "Looking at the results, we see that comparing a word to itself (e.g. the first line which has 'dog dog 1.0') scores a 1, or 100 percent match. Not surprising. \n", 619 | "\n", 620 | "We also see that 'dog' and 'cat' are a pretty good match at 0.8. Both words are likely to be used in similar ways. For example, both would fit equally well into a sentence like \"I really want to get a pet (dog/cat), but I just don't spend enough time at home to take care of it properly.\"\n", 621 | "\n", 622 | "We also see that 'banana' is closer to 'cat' than to 'dog', but not by much. Presumably, bananas are more likely to sit around like cats than to run around like dogs? No idea. \n", 623 | "\n", 624 | "Feel free to edit the little three-word corpus and re-run the similarity test. Try 'puppy' to see if it is closer to 'dog' than 'dog' is to 'cat'? Try adding 'apple'? Or 'unprecedented'? Or anything else?" 625 | ] 626 | }, 627 | { 628 | "cell_type": "markdown", 629 | "metadata": {}, 630 | "source": [ 631 | "Of course, comparing individual words is all well and good, but what you probably want to compare is one text to another. To do that, first we need to prepare a few texts to do some comparing. You have already seen our sample corpus and 'Emma' by Jane Austen, but let's also add 'Persuasion' by Jane Austen and a selection of text from another nltk.corpus of texts from the web. The specific text is called 'firefox'. \n", 632 | "\n", 633 | "All of these texts need to be put through the nlp function we created from spaCy so that it creates a document vector. \n", 634 | "\n", 635 | "Document vectors are much like word vectors in that they score the document on a large number of dimensions. However, instead of coming packaged with spaCy, they are created from the text that you pass to spaCy. Which is what we are doing now.\n", 636 | "\n", 637 | "Run/Shift+Enter below!" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": null, 643 | "metadata": {}, 644 | "outputs": [], 645 | "source": [ 646 | "SimEmma = nlp(nltk.corpus.gutenberg.raw('austen-emma.txt'))\n", 647 | "SimPers = nlp(nltk.corpus.gutenberg.raw('austen-persuasion.txt'))\n", 648 | "SimFire = nlp(nltk.corpus.webtext.raw('firefox.txt'))\n", 649 | "SimCorp = nlp(corpus)" 650 | ] 651 | }, 652 | { 653 | "cell_type": "markdown", 654 | "metadata": {}, 655 | "source": [ 656 | "You can, of course, have a peek at the contents of 'Persuasion' or 'firefox' if you like. You probably know how to do that with a print command, but maybe you want to run some of the other operations on the text too. \n", 657 | "\n", 658 | "Or you can plow on ahead and Run/Shift+Enter to run the document vector similarity comparisons below. " 659 | ] 660 | }, 661 | { 662 | "cell_type": "code", 663 | "execution_count": null, 664 | "metadata": {}, 665 | "outputs": [], 666 | "source": [ 667 | "print(SimEmma.similarity(SimPers))\n", 668 | "print(SimEmma.similarity(SimFire))\n", 669 | "print(SimEmma.similarity(SimCorp))" 670 | ] 671 | }, 672 | { 673 | "cell_type": "markdown", 674 | "metadata": {}, 675 | "source": [ 676 | "Each of these compares 'Emma' to one of the other texts (the ones in parentheses at the end). \n", 677 | "\n", 678 | "Are you surprised by the results? Feel free to try comparing the other texts to each other, rather than just to 'Emma'. \n", 679 | "\n" 680 | ] 681 | }, 682 | { 683 | "cell_type": "markdown", 684 | "metadata": {}, 685 | "source": [ 686 | "## Discovery\n" 687 | ] 688 | }, 689 | { 690 | "cell_type": "markdown", 691 | "metadata": {}, 692 | "source": [ 693 | "Now, for the final bit of NLP that we cover here, let's talk about discovery. This is about identifying patterns that reveal relationships and applying it more widely to discover additional relationships. Let's start by importing a few things that we need. \n", 694 | "\n", 695 | "Run/Shift+Enter. " 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": null, 701 | "metadata": {}, 702 | "outputs": [], 703 | "source": [ 704 | "import re \n", 705 | "import string \n", 706 | "import nltk \n", 707 | "import spacy \n", 708 | "import pandas as pd \n", 709 | "import numpy as np \n", 710 | "import math \n", 711 | "from tqdm import tqdm \n", 712 | "\n", 713 | "from spacy.matcher import Matcher \n", 714 | "from spacy.tokens import Span \n", 715 | "from spacy import displacy \n", 716 | "\n", 717 | "pd.set_option('display.max_colwidth', 200)" 718 | ] 719 | }, 720 | { 721 | "cell_type": "markdown", 722 | "metadata": {}, 723 | "source": [ 724 | "Now, let's take a look at 'Emma'. We start by tokenising the raw text into sentences, then we create a list of all sentences that contain the sub-string \"like a\", then we create run our list through our nlp function from spaCy. \n", 725 | "\n", 726 | "Run/Shift+Enter. " 727 | ] 728 | }, 729 | { 730 | "cell_type": "code", 731 | "execution_count": null, 732 | "metadata": {}, 733 | "outputs": [], 734 | "source": [ 735 | "# sample text \n", 736 | "emma_sentences = nltk.sent_tokenize(nltk.corpus.gutenberg.raw('austen-emma.txt'))\n", 737 | "emma_such_as =\"\"\n", 738 | "for sentence in emma_sentences:\n", 739 | " if \"like a \" in sentence:\n", 740 | " emma_such_as += sentence\n", 741 | " \n", 742 | "# create a spaCy object \n", 743 | "doc = nlp(emma_such_as)" 744 | ] 745 | }, 746 | { 747 | "cell_type": "markdown", 748 | "metadata": {}, 749 | "source": [ 750 | "Now, let's take a closer look at context around those instances of \"like a\". \n", 751 | "\n", 752 | "To do that, we use some spaCy functions that print the word, print its role in the sentence, and print its POS-tag. \n", 753 | "\n", 754 | "This lets us see if we can find any patterns in the word roles or POS-tags that might help us understand the patterns relating to \"like a\". \n", 755 | "\n", 756 | "Run/Shift+Enter." 757 | ] 758 | }, 759 | { 760 | "cell_type": "code", 761 | "execution_count": null, 762 | "metadata": {}, 763 | "outputs": [], 764 | "source": [ 765 | "# print token, dependency, POS tag \n", 766 | "for tok in doc: \n", 767 | " print(tok.text, \"-->\",tok.dep_,\"-->\", tok.pos_)" 768 | ] 769 | }, 770 | { 771 | "cell_type": "markdown", 772 | "metadata": {}, 773 | "source": [ 774 | "It seems like a good start would be to define a pattern with \"like a\" followed by a noun. So first, we define that pattern. \n", 775 | "\n", 776 | "Run/Shift+Enter. " 777 | ] 778 | }, 779 | { 780 | "cell_type": "code", 781 | "execution_count": null, 782 | "metadata": {}, 783 | "outputs": [], 784 | "source": [ 785 | "#define the pattern \n", 786 | "pattern = [{'LOWER': 'like'}, \n", 787 | " {'LOWER': 'a'}, \n", 788 | " {'POS': 'NOUN'}]" 789 | ] 790 | }, 791 | { 792 | "cell_type": "markdown", 793 | "metadata": {}, 794 | "source": [ 795 | "Next, we run a function called Matcher over the text that returns all of the substrings that match the pattern. \n", 796 | "\n", 797 | "Run/Shift+Enter. " 798 | ] 799 | }, 800 | { 801 | "cell_type": "code", 802 | "execution_count": null, 803 | "metadata": {}, 804 | "outputs": [], 805 | "source": [ 806 | "# Matcher class object \n", 807 | "matcher = Matcher(nlp.vocab) \n", 808 | "matcher.add(\"matching_1\", [pattern]) \n", 809 | "\n", 810 | "matches = matcher(doc) \n", 811 | "for match_id, start, end in matches:\n", 812 | " string_id = nlp.vocab.strings[match_id] # Get string representation\n", 813 | " span = doc[start:end] # The matched span\n", 814 | " print(span.text)" 815 | ] 816 | }, 817 | { 818 | "cell_type": "markdown", 819 | "metadata": {}, 820 | "source": [ 821 | "As you would expect, we get a list of substrings that match the pattern we have defined. Let's create a little more ambitious pattern. \n", 822 | "\n", 823 | "This time, we want to capture verbs followed by \"like a\" followed by up to three optional modifiers (adverbs and adjectives) and finally followed by a noun. \n", 824 | "\n", 825 | "Run/Shift+Enter. " 826 | ] 827 | }, 828 | { 829 | "cell_type": "code", 830 | "execution_count": null, 831 | "metadata": {}, 832 | "outputs": [], 833 | "source": [ 834 | "# Matcher class object\n", 835 | "matcher = Matcher(nlp.vocab)\n", 836 | "\n", 837 | "#redefine the pattern\n", 838 | "pattern2 = [{'POS':'VERB'},\n", 839 | " {'LOWER': 'like'},\n", 840 | " {'LOWER': 'a'},\n", 841 | " {'DEP':'amod', 'OP':\"?\"},\n", 842 | " {'DEP':'amod', 'OP':\"?\"},\n", 843 | " {'DEP':'amod', 'OP':\"?\"},\n", 844 | " {'POS': 'NOUN'}]\n", 845 | "\n", 846 | "matcher.add(\"matching_1\", [pattern2])\n", 847 | "matches = matcher(doc)\n", 848 | "\n", 849 | "for match_id, start, end in matches:\n", 850 | " string_id = nlp.vocab.strings[match_id] # Get string representation\n", 851 | " span = doc[start:end] # The matched span\n", 852 | " print(span.text)" 853 | ] 854 | }, 855 | { 856 | "cell_type": "markdown", 857 | "metadata": {}, 858 | "source": [ 859 | "Well, this is more interesting. We see that some one can 'look like a sensible young man', can 'argue like a young man' and can 'write like a sensible man'. This suggests that the author, or possible society at the time of publication, associates these verbs and these adjectives with men. Potentially, the analysis is even more complicated in that young men might argue while old men do not, or that sensible men write much more often than other men. \n", 860 | "\n", 861 | "If we continued to analyse the text in this way, we might also find a similar combinations for verbs and adjectives associated with women. Will there be any evidence that (young) women are sensible? Or that women write or argue in ways that are comparable to how men write and argue? \n" 862 | ] 863 | }, 864 | { 865 | "cell_type": "markdown", 866 | "metadata": {}, 867 | "source": [ 868 | "## Conclusions" 869 | ] 870 | }, 871 | { 872 | "cell_type": "markdown", 873 | "metadata": {}, 874 | "source": [ 875 | "We have only started to dip our toes into what NLP can do, but hopefully this will whet your appetite to know more. \n", 876 | "\n", 877 | "As before, these exercises and this sample code should highlight to you that you need to think about:\n", 878 | "- your research questions and what you want to show, explore or understand, \n", 879 | "- your data, texts, corpus, or other research materials to analyse etc. \n", 880 | "- how your processes are related to your reserch questions, and \n", 881 | "- how your processes and data can be made available and reproducible. " 882 | ] 883 | }, 884 | { 885 | "cell_type": "markdown", 886 | "metadata": {}, 887 | "source": [ 888 | "## Further reading and resources" 889 | ] 890 | }, 891 | { 892 | "cell_type": "markdown", 893 | "metadata": {}, 894 | "source": [ 895 | "Books, tutorials, package recommendations, etc. for Python\n", 896 | "- Natural Language Processing with Python by Steven Bird, Ewan Klein and Edward Loper, http://www.nltk.org/book/\n", 897 | "- Foundations of Statistical Natural Language Processing by Christopher Manning and Hinrich Schütze, https://nlp.stanford.edu/fsnlp/promo/\n", 898 | "- Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition by Dan Jurafsky and James H. Martin, https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf\n", 899 | "- Deep Learning in Natural Language Processing by Li Deng, Yang Liu, https://lidengsite.wordpress.com/book-chapters/\n", 900 | "- nltk.corpus http://www.nltk.org/howto/corpus.html\n", 901 | "- spaCy https://nlpforhackers.io/complete-guide-to-spacy/\n", 902 | "\n", 903 | "Books and package recommendations for R\n", 904 | "- Quanteda, an R package for text analysis https://quanteda.io/​\n", 905 | "- Text Mining with R, a free online book https://www.tidytextmining.com/​" 906 | ] 907 | }, 908 | { 909 | "cell_type": "markdown", 910 | "metadata": {}, 911 | "source": [ 912 | "
Previous section: Processing text
" 913 | ] 914 | } 915 | ], 916 | "metadata": { 917 | "kernelspec": { 918 | "display_name": "Python 3 (ipykernel)", 919 | "language": "python", 920 | "name": "python3" 921 | }, 922 | "language_info": { 923 | "codemirror_mode": { 924 | "name": "ipython", 925 | "version": 3 926 | }, 927 | "file_extension": ".py", 928 | "mimetype": "text/x-python", 929 | "name": "python", 930 | "nbconvert_exporter": "python", 931 | "pygments_lexer": "ipython3", 932 | "version": "3.11.9" 933 | }, 934 | "toc": { 935 | "base_numbering": 1, 936 | "nav_menu": {}, 937 | "number_sections": true, 938 | "sideBar": true, 939 | "skip_h1_title": true, 940 | "title_cell": "Table of Contents", 941 | "title_sidebar": "Contents", 942 | "toc_cell": true, 943 | "toc_position": {}, 944 | "toc_section_display": true, 945 | "toc_window_display": false 946 | }, 947 | "varInspector": { 948 | "cols": { 949 | "lenName": 16, 950 | "lenType": 16, 951 | "lenVar": 40 952 | }, 953 | "kernels_config": { 954 | "python": { 955 | "delete_cmd_postfix": "", 956 | "delete_cmd_prefix": "del ", 957 | "library": "var_list.py", 958 | "varRefreshCmd": "print(var_dic_list())" 959 | }, 960 | "r": { 961 | "delete_cmd_postfix": ") ", 962 | "delete_cmd_prefix": "rm(", 963 | "library": "var_list.r", 964 | "varRefreshCmd": "cat(var_dic_list()) " 965 | } 966 | }, 967 | "types_to_exclude": [ 968 | "module", 969 | "function", 970 | "builtin_function_or_method", 971 | "instance", 972 | "_Feature" 973 | ], 974 | "window_display": false 975 | } 976 | }, 977 | "nbformat": 4, 978 | "nbformat_minor": 2 979 | } 980 | -------------------------------------------------------------------------------- /code/3-Classifiers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": { 13 | "slideshow": { 14 | "slide_type": "-" 15 | } 16 | }, 17 | "source": [ 18 | "# Text-mining: Classifiers and sentiment analysis" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Welcome to this UK Data Service *Computational Social Science* training series! \n", 26 | "\n", 27 | "The various *Computational Social Science* training series, all of which guide you through some of the popular and useful computational techniques, tools, methods and concepts that social science research might want to use. For example, this series covers collecting data from websites and social media platorms, working with text data, conducting simulations (agent based modelling), and more. The series includes recorded video webinars, interactive notebooks containing live programming code, reading lists and more.\n", 28 | "\n", 29 | "* To access training materials on our GitHub site: [Training Materials]\n", 30 | "\n", 31 | "* To keep up to date with upcoming and past training events: [Events]\n", 32 | "\n", 33 | "* To get in contact with feedback, ideas or to seek assistance: [Help]\n", 34 | "\n", 35 | "Dr J. Kasmire
\n", 36 | "UK Data Service
\n", 37 | "University of Manchester
" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": { 43 | "toc": true 44 | }, 45 | "source": [ 46 | "

Table of Contents

\n", 47 | "
" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "\n", 55 | "There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). " 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "## Introduction" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "Sentiment analysis is a commonly used example of automatic classification. To be clear, automatic classification means that a model or learning algorithm has been trained on correctly classified documents and it uses this training to return a probability assessment of what class a new document should belong to. \n", 70 | "\n", 71 | "Sentiment analysis works the same way, but usually only has two classes - positive and negative. A trained model looks at new data and says whether that new data is likely to be positive or negative. Let's take a look!" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "### Guide to using this resource\n", 79 | "\n", 80 | "This learning resource was built using Jupyter Notebook, an open-source software application that allows you to mix code, results and narrative in a single document. As Barba et al. (2019) espouse:\n", 81 | "> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.\n", 82 | "\n", 83 | "If you are familiar with Jupyter notebooks then skip ahead to the main content (*Sentiment Analysis as an example of machine learning/deep learning classification*). Otherwise, the following is a quick guide to navigating and interacting with the notebook." 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "### Interaction\n", 91 | "\n", 92 | "**You only need to execute the code that is contained in sections which are marked by `In []`.**\n", 93 | "\n", 94 | "To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).\n", 95 | "\n", 96 | "Try it for yourself:" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "print(\"Enter your name and press enter:\")\n", 106 | "name = input()\n", 107 | "print(\"\\r\")\n", 108 | "print(\"Hello {}, enjoy learning more about Python and computational social science!\".format(name)) " 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "### Learn more\n", 116 | "\n", 117 | "Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the materials provided by Dani Arribas-Bel at the University of Liverpool." 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "## Sentiment Analysis as an example of machine learning/deep learning classification" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "Let's start off by importing and downloading some useful packages, including `textblob`: it is based on `nltk` and has built in sentiment analysis tools. \n", 132 | "\n", 133 | "To import the packages, click in the code cell below and hit the 'Run' button at the top of this page or by holding down the 'Shift' key and hitting the 'Enter' key. \n", 134 | "\n", 135 | "For the rest of this notebook, I will use 'Run/Shift+Enter' as short hand for 'click in the code cell below and hit the 'Run' button at the top of this page or by hold down the 'Shift' key while hitting the 'Enter' key'. \n", 136 | "\n", 137 | "Run/Shift+Enter." 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "import os # os is a module for navigating your machine (e.g., file directories).\n", 147 | "import nltk # nltk stands for natural language tool kit and is useful for text-mining. \n", 148 | "import csv # csv is for importing and working with csv files\n", 149 | "import statistics\n", 150 | "\n", 151 | "# List all of the files in the \"data\" folder that is provided to you\n", 152 | "\n", 153 | "for file in os.listdir(\"./data/sentiment-analysis\"):\n", 154 | " print(\"A file we can use is... \", file)\n", 155 | "print(\"\")" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "!pip install -U textblob -q\n", 165 | "!python -m textblob.download_corpora -q\n", 166 | "from textblob import TextBlob" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "## Analyse trivial documents with built-in sentiment analysis tool" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "Now, lets get some data.\n", 181 | "\n", 182 | "Run/Shift+Enter, as above!" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "Doc1 = TextBlob(\"Textblob is just super. I love it!\") # Convert a few basic strings into Textblobs \n", 192 | "Doc2 = TextBlob(\"Cabbages are the worst. Say no to cabbages!\") # Textblobs, like other text-mining objects, are often called\n", 193 | "Doc3 = TextBlob(\"Paris is the capital of France. \") # 'documents'\n", 194 | "print(\"...\")\n", 195 | "type(Doc1)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "Docs 1 through 3 are Textblobs, which we can see by the output of type(Doc1). \n", 203 | "\n", 204 | "We get a Textblob by passing a string to the function that we imported above. Specifically, this is done by using this format --> Textblob('string goes here'). Textblobs are ready for analysis through the textblob tools, such as the built-in sentiment analysis tool that we see in the code below. \n", 205 | "\n", 206 | "Run/Shift+Enter on those Textblobs." 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "print(Doc1.sentiment)\n", 216 | "print(Doc2.sentiment)\n", 217 | "print(Doc3.sentiment)" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "The output of the previous code returns two values for each Textblob object. Polarity refers to a positive-negative spectrum while subjectivity refers to an opinion-fact spectrum. \n", 225 | "\n", 226 | "We can see, for example, that Doc1 is fairly positive but also quite subjective while Doc2 is very negative and very subjective. Doc3, in contrast, is both neutral and factual. \n", 227 | "\n", 228 | "Maybe you don't need both polarity and subjectivity. For example, if you are trying to categorise opinions, you don't need the subjectivity score and would only want the polarity. \n", 229 | "\n", 230 | "To get only one of the two values, you can call the appropriate sub-function as shown below. \n", 231 | "\n", 232 | "Run/Shift+Enter for sub-functional fun. " 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "print(Doc1.sentiment.polarity)\n", 242 | "print(Doc1.sentiment.subjectivity)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "## Acquire and analyse trivial documents" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "Super. We have imported some documents (in our case, just sentences in string format) to textblob and analysed it using the built-in sentiment analyser. But we don't want to import documents one string at a time...that would take forever!\n", 257 | "\n", 258 | "Let's import data in .csv format instead! The data here comes from a set of customer reviews of Amazon products. Naturally, not all of the comments in the product reviews are really on topic, but it does not actually matter for our purposes. But, I think it is only fair to warn you...there is some foul language and potentially objectionable personal opinions in the texts if you go through it all. \n", 259 | "\n", 260 | "Run/Shift+Enter (if you dare!)" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "with open('./data/sentiment-analysis/training_set.csv', newline='', encoding = 'ISO-8859-1') as f: # Import a csv of scored \"product reviews\"\n", 270 | " reader = csv.reader(f)\n", 271 | " Doc_set = list(reader)\n", 272 | "\n", 273 | "print(Doc_set[45:55]) # Look at a subset of the imported data" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "A very good start (although you will see what I mean about the off-topic comments and foul language). \n", 281 | "\n", 282 | "Now, the .csv file has multiple strings per row, the first of which we want to pass to `texblob` to create a Textblob object. The second is a number representing the class that the statement belongs to. '4' represents 'positive', '2' represents neutral and '0' represents negative. Don't worry about this for now as we will come to that in a moment. \n", 283 | "\n", 284 | "The code below creates a new list that has the text string and the sentiment score for each item in the imported Doc_set, and also shows you the first 20 results of that new list to look at. \n", 285 | "\n", 286 | "Run/Shift+Enter" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "Doc_set_analysed = []\n", 296 | "\n", 297 | "for item in Doc_set:\n", 298 | " Doc_set_analysed.append([item[0], item[1], TextBlob(item[0]).sentiment])\n", 299 | "\n", 300 | "print(Doc_set_analysed[45:55])" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "Now, edit the code above so that Doc_set_analysed only has the text string, the number string and the Textblob polarity. \n", 308 | "\n", 309 | "We will want to use that to get a sense of whether the polarity judgements are accurate or not. Thus, we want to know whether the judgement assigned to each statement (the '4', '2' or '0') matches with the polarity assigned by the `textblob` sentiment analyser. \n", 310 | "\n", 311 | "To do this, we need to convert the second item (the '4', '2' or '0') to a 1, 0 or -1 to match what we get back from the sentiment analyser, compare them to find the difference and then find the average difference. \n", 312 | "\n", 313 | "Run\\Shift+Enter. " 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "metadata": {}, 320 | "outputs": [], 321 | "source": [ 322 | "Doc_set_polarity_accuracy = []\n", 323 | "\n", 324 | "for item in Doc_set_analysed:\n", 325 | " if (item[1] >= '4'): # this code checks the string with the provided judgement\n", 326 | " x = 1 # and replaces it with a number matching textblob's polarity\n", 327 | " elif (item[1] == '2'):\n", 328 | " x = 0\n", 329 | " else:\n", 330 | " x = -1\n", 331 | " y = item[2].polarity\n", 332 | " Doc_set_polarity_accuracy.append(abs(x-y)) # unless my math is entirely wrong, this returns 'accuracy' or\n", 333 | " # the difference between the provided and calculated polarity\n", 334 | " # Exact matches (-1 and -1 or 1 and 1) return 0, complete opposites\n", 335 | " # (1 and -1 or -1 and 1) returning 2, all else proportionally in between. \n", 336 | " \n", 337 | "\n", 338 | "print(statistics.mean(Doc_set_polarity_accuracy)) # Finding the average of all accuracy shows ... it is not great. " 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "Hmmm. If the sentiment analyser were:\n", 346 | "- entirely accurate, we would have an average difference of 0\n", 347 | "- entirely inaccurate, we would have an average difference of 2\n", 348 | "- entirely random, we would expect an average difference of 1\n", 349 | "\n", 350 | "As it stands, we have an average difference that suggests we are a bit more accurate than chance... but not my much. \n", 351 | "\n", 352 | "However, it is important to remember that we are testing an assigned class against a probable class... The assigned class (the '4', '2' or '0' in the original data set) is an absolute judgement and so is always *exactly* 4, 2, or 0 but never 2.8 or 0.05. In contrast, the polarity judgement returned by the sentiment analyser is a probability: it is 1 if the sentiment analyser is absolutely confident that the statement is positive but only .5 if the sentiment analyser is fairly confident that the statement is positive. \n", 353 | "\n", 354 | "In light of this, the fact that we got a better than chance score on our average accuracy test may mean we are doing quite well. We could test this, of course, and convert the polarity scores from the sentiment analyser into 1, 0 or -1 or even into 4, 2 and 0 and then compare those. \n", 355 | "\n", 356 | "Heck. Why not? Let's have a go. \n", 357 | "Run\\Shift+Enter. \n" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": null, 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "Doc_set_polarity_accuracy_2 = []\n", 367 | "\n", 368 | "for item in Doc_set_analysed:\n", 369 | " x = item[1] # This code sets the original judgement assigned to each statement as x\n", 370 | " if (item[2].polarity > 0): # then converts polarity scores of more than 0 to '4'\n", 371 | " y = '4' \n", 372 | " elif (item[2].polarity == 0 ): # converts polarity scores of exactly 0 to '2'\n", 373 | " y = '2'\n", 374 | " else: # and converts negative polarity scores to '0'\n", 375 | " y = '0'\n", 376 | " if x == y: # then compares the assigned judgement to the converted polarity score\n", 377 | " Doc_set_polarity_accuracy_2.append(1) # and adds a 1 if they match exactly\n", 378 | " else:\n", 379 | " Doc_set_polarity_accuracy_2.append(0) # or adds a 0 if they do not match exactly. \n", 380 | "\n", 381 | "print(statistics.mean(Doc_set_polarity_accuracy_2)) # Finds the average of the match rate. Still not great. " 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "Well, an average close to 1 would be entirely accurate while close to 0 would be entirely wrong (and to be fair, *entirely* wrong would also be accurate too...in a sense). \n", 389 | "\n", 390 | "Our average though suggests that our accuracy is still not great. Ah well. " 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "metadata": {}, 396 | "source": [ 397 | "## Train and test a sentiment analysis tool with trivial data" 398 | ] 399 | }, 400 | { 401 | "cell_type": "markdown", 402 | "metadata": {}, 403 | "source": [ 404 | "Now that we know how to use the built-in analyser, let's have a look back at the sentiment analysis scores for Doc1 and Doc2. \n", 405 | "- Doc1 = 'Textblob is just super. I love it!' which scored scored .48 on polarity... halfway between neutral and positive. \n", 406 | "- Doc2 = 'Cabbages are the worst. Say no to cabbages!' which scored -1 on polarity... the most negative it could score. \n", 407 | "\n", 408 | "Do we really think Doc2 is so much more negative than Doc1 is positive? Hmmmm. The built-in sentiment analyser is clearly not as accurate as we would want. Let's try to train our own, starting with a small set of trivial training and testing data sets. \n", 409 | "\n", 410 | "The following code does a few different things:\n", 411 | "- It defines 'train' as a data set with 10 sentences, each of which is marked as 'pos' or 'neg'.\n", 412 | "- It defines 'test' as a data set with 6 completely different sentences, also marked as 'pos' or 'neg'. \n", 413 | "- It imports NaiveBayesClassifier from the textblob.classifiers.\n", 414 | "- It defines 'cl' as a brand new NaiveBayesClassifier that is trained on the 'train' data set. \n", 415 | "\n", 416 | "Run/Shift+Enter to make it so. " 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "metadata": {}, 423 | "outputs": [], 424 | "source": [ 425 | "train = [\n", 426 | " ('I love this sandwich.', 'pos'),\n", 427 | " ('this is an amazing place!', 'pos'),\n", 428 | " ('I feel very good about these beers.', 'pos'),\n", 429 | " ('this is my best work.', 'pos'),\n", 430 | " (\"what an awesome view\", 'pos'),\n", 431 | " ('I do not like this restaurant', 'neg'),\n", 432 | " ('I am tired of this stuff.', 'neg'),\n", 433 | " (\"I can't deal with this\", 'neg'),\n", 434 | " ('he is my sworn enemy!', 'neg'),\n", 435 | " ('my boss is horrible.', 'neg')]\n", 436 | "test = [\n", 437 | " ('the beer was good.', 'pos'),\n", 438 | " ('I do not enjoy my job', 'neg'),\n", 439 | " (\"I ain't feeling dandy today.\", 'neg'),\n", 440 | " (\"I feel amazing!\", 'pos'),\n", 441 | " ('Gary is a friend of mine.', 'pos'),\n", 442 | " (\"I can't believe I'm doing this.\", 'neg')]\n", 443 | "\n", 444 | "\n", 445 | "from textblob.classifiers import NaiveBayesClassifier\n", 446 | "cl = NaiveBayesClassifier(train)" 447 | ] 448 | }, 449 | { 450 | "cell_type": "markdown", 451 | "metadata": {}, 452 | "source": [ 453 | "Hmm. The code ran but there is nothing to see. This is because we have no output! Let's get some output and see what it did. \n", 454 | "\n", 455 | "The next code block plays around with 'cl', the classifier we trained on our 'train' data set.\n", 456 | "\n", 457 | "The first line asks 'cl' to return a judgment of one sentence about a library. \n", 458 | "\n", 459 | "Then, we ask it to return a judgement of another sentence about something being a doozy. Although both times we get a judgement on whether the sentence is 'pos' or 'neg', the second one has more detailed sub-judgements we can analyse that show us how the positive and negative the sentence is so we can see whether the overall judgement is close or not. \n", 460 | "\n", 461 | "Do the Run/Shift+Enter thing that you are so good at doing!" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": null, 467 | "metadata": {}, 468 | "outputs": [], 469 | "source": [ 470 | "print(\"Our 'cl' classifier says 'This is an amazing library!' is \", cl.classify(\"This is an amazing library!\"))\n", 471 | "print('...')\n", 472 | "\n", 473 | "prob_dist = cl.prob_classify(\"This one is a doozy.\")\n", 474 | "print(\"Our 'cl' classifier says 'This one is a doozy.' is probably\",\n", 475 | " prob_dist.max(), \"because its positive score is \",\n", 476 | " round(prob_dist.prob(\"pos\"), 2),\n", 477 | " \" and its negative score is \",\n", 478 | " round(prob_dist.prob(\"neg\"), 2),\n", 479 | " \".\")" 480 | ] 481 | }, 482 | { 483 | "cell_type": "markdown", 484 | "metadata": {}, 485 | "source": [ 486 | "Super. Now... What if we want to apply our 'cl' classifier to a document with multiple sentences... What kind of judgements can we get with that? \n", 487 | "\n", 488 | "Well, `textblob` is sophisticated enough to give an overall 'pos' or 'neg' judgement, as well as a sentence-by-sentence judgement. \n", 489 | "\n", 490 | "Run/Shift+Enter, buddy. " 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": null, 496 | "metadata": {}, 497 | "outputs": [], 498 | "source": [ 499 | "blob = TextBlob(\"The beer is good. But the hangover is horrible.\", classifier=cl)\n", 500 | "\n", 501 | "print(\"Overall, 'blob' is \", blob.classify(), \" because it's sentences are ...\")\n", 502 | "for s in blob.sentences:\n", 503 | " print(s)\n", 504 | " print(s.classify())" 505 | ] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "metadata": {}, 510 | "source": [ 511 | "What if we try to classify a document that we converted to Textblob format with the built-in sentiment analyser?\n", 512 | "\n", 513 | "Well, we still have Doc1 to try it on.\n", 514 | "\n", 515 | "Run/Shift+Enter" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": null, 521 | "metadata": {}, 522 | "outputs": [], 523 | "source": [ 524 | "print(Doc1)\n", 525 | "Doc1.classify()" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "metadata": {}, 531 | "source": [ 532 | "Uh huh. We get an error. \n", 533 | "\n", 534 | "The error message says the blob known as Doc1 has no classifier. It suggests we train one first, but we can just apply 'cl'. \n", 535 | "\n", 536 | "Run/Shift+Enter" 537 | ] 538 | }, 539 | { 540 | "cell_type": "code", 541 | "execution_count": null, 542 | "metadata": {}, 543 | "outputs": [], 544 | "source": [ 545 | "cl_Doc1 = TextBlob('Textblob is just super. I love it!', classifier=cl)\n", 546 | "cl_Doc1.classify()" 547 | ] 548 | }, 549 | { 550 | "cell_type": "markdown", 551 | "metadata": {}, 552 | "source": [ 553 | "Unsurprisingly, when we classify the string that originally went into Doc1 using our 'cl' classifier, we still get a positive judgement. \n", 554 | "\n", 555 | "Now, what about accuracy? We have been using 'cl' even though it is trained on a REALLY tiny training data set. What does that do to our accuracy? For that, we need to run an accuracy challenge using our test data set. This time, we are using a built-in accuracy protocol which deals with negative values and everything for us. This means we want our result to be as close to 1 as possible. \n", 556 | "\n", 557 | "Run/Shift+Enter" 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": null, 563 | "metadata": {}, 564 | "outputs": [], 565 | "source": [ 566 | "cl.accuracy(test)\n" 567 | ] 568 | }, 569 | { 570 | "cell_type": "markdown", 571 | "metadata": {}, 572 | "source": [ 573 | "Hmmm. Not perfect.\n", 574 | "\n", 575 | "Fortunately, we can add more training data and try again. The code below defines a new training data set and then runs a re-training functiong called 'update' on our 'cl' classifier. \n", 576 | "\n", 577 | "Run/Shift+Enter." 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": null, 583 | "metadata": {}, 584 | "outputs": [], 585 | "source": [ 586 | "new_data = [('She is my best friend.', 'pos'),\n", 587 | " (\"I'm happy to have a new friend.\", 'pos'),\n", 588 | " (\"Stay thirsty, my friend.\", 'pos'),\n", 589 | " (\"He ain't from around here.\", 'neg')]\n", 590 | "\n", 591 | "cl.update(new_data)" 592 | ] 593 | }, 594 | { 595 | "cell_type": "markdown", 596 | "metadata": {}, 597 | "source": [ 598 | "Now, copy the code we ran before to get the accuracy check. Paste it into the next code block and Run\\Shift+Enter it. \n", 599 | "\n", 600 | "Not only will this tell us if updating 'cl' with 'new_data' has improved the accuracy, it is also a chance for you to create a code block of your own. Well, done you (I assume). " 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": null, 606 | "metadata": {}, 607 | "outputs": [], 608 | "source": [ 609 | "# Copy and paste the accuracy challenge from above into this cell and re-run it to get an updated accuracy score. \n" 610 | ] 611 | }, 612 | { 613 | "cell_type": "markdown", 614 | "metadata": {}, 615 | "source": [ 616 | "## You can train and test a sentiment analysis tool with more interesting data too..." 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "metadata": {}, 622 | "source": [ 623 | "This is all well and good, but 'cl' is trained on some seriously trivial data. What if we want to use some more interesting data, like the Doc_set that we imported from .csv earlier?\n", 624 | "\n", 625 | "Well, we are in luck! Sort of...\n", 626 | "\n", 627 | "We can definitely train a classifier on Doc_set, but let's just have a closer look at Doc_set before we jump right in and try that. \n" 628 | ] 629 | }, 630 | { 631 | "cell_type": "code", 632 | "execution_count": null, 633 | "metadata": {}, 634 | "outputs": [], 635 | "source": [ 636 | "print(Doc_set[45:55])\n", 637 | "print('...')\n", 638 | "print(len(Doc_set))" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "metadata": {}, 644 | "source": [ 645 | "Doc_set is a set of comments that come from 'product reviews'. As we saw earlier, each item has two strings, the first of which is the comment and the second of which is a number 4, 2 or 0 which is written as a string. The second item, the number-written-as-a-string, is the class judgement. These scores may have been manually created, or may be the result of a semi-manual or supervised automation process. Excellent for our purposes, but not ideal because:\n", 646 | "- These scores are strings rather than integers. You can tell because they are enclosed in quotes.\n", 647 | "- These scores range from 0 (negative) to 4 (positive) and also contains 2 (neutral), while the textblob sentiment analysis and classifier functions we have been using return scores from -1 (negative) through 0 (neutral) to 1 (positive). \n", 648 | "\n", 649 | "Well, we could change 4 to 1, 2 to 0 and 0 to -1 with the use of regular expressions (RegEx) if we wanted. But as you will see, this is not strictly necessary. \n", 650 | "\n", 651 | "However, there is another issue. Doc_set has 20,000 items. This is big, but this is actually MUCH smaller than it could be. This is a subset of a 1,000,000+ item data set that you can download for free (see extra resources and reading at the end). The original data set was way too big for Jupyter notebook and was even too big for me to analyse on my laptop. I know because I tried. When you find yourself in a situation like this, you can try: \n", 652 | "- Accessing proper research computing facilities (good for real research, too much for a code demo). \n", 653 | "- Dividing a too big data set into chunks, and train/update a chunk at a time. \n", 654 | "- Processing a too big data set to remove punctuation, stop words, urls, twitter handles, etc. (saving computer power for what matters).\n", 655 | "- Or a combination of these options. \n", 656 | "\n", 657 | "But, you can try training a classifier on the much smaller 'testing_set' if you like. That set has under 5000 entries and so does not max out the computer's memory. \n", 658 | "\n", 659 | "I have provided the code below to load 'testing_set' into a new variable called Doc_set_2. Feel free to run the code below, then add more code blocks with processes copied from above. " 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": null, 665 | "metadata": {}, 666 | "outputs": [], 667 | "source": [ 668 | "with open('./data/sentiment-analysis/testing_set.csv', newline='') as f: # Import a csv of scored \"product reviews\"\n", 669 | " reader = csv.reader(f)\n", 670 | " Doc_set_2 = list(reader)\n", 671 | "\n", 672 | "print(Doc_set_2[45:55]) # Look at a subset of the imported data" 673 | ] 674 | }, 675 | { 676 | "cell_type": "markdown", 677 | "metadata": {}, 678 | "source": [ 679 | "## Conclusions" 680 | ] 681 | }, 682 | { 683 | "cell_type": "markdown", 684 | "metadata": {}, 685 | "source": [ 686 | "You can train a classifier on whatever data you want and with whatever categories you want. \n", 687 | "\n", 688 | "Want to train a classifier to recognise sarcasm? Go for it. \n", 689 | "How about recognising lies in political speeches? Good idea. \n", 690 | "How about tweets from bots or from real people? Definitely useful. \n", 691 | "\n", 692 | "The hard part is actually getting the data ready to feed to train your classifier. Depending on what you want to train your classifier to do, you may have to manually tag a whole lotta data. But it is always a good idea to start small. 10 items? 100? What can you do quickly that will give you enough of an idea to see if it is worth investing more time. \n", 693 | "\n", 694 | "Good luck!" 695 | ] 696 | }, 697 | { 698 | "cell_type": "markdown", 699 | "metadata": {}, 700 | "source": [ 701 | "## Further reading and resources" 702 | ] 703 | }, 704 | { 705 | "cell_type": "markdown", 706 | "metadata": {}, 707 | "source": [ 708 | "Books, tutorials, package recommendations, etc. for Python\n", 709 | "\n", 710 | "- Natural Language Processing with Python by Steven Bird, Ewan Klein and Edward Loper, http://www.nltk.org/book/\n", 711 | "- Foundations of Statistical Natural Language Processing by Christopher Manning and Hinrich Schütze, https://nlp.stanford.edu/fsnlp/promo/\n", 712 | "- Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition by Dan Jurafsky and James H. Martin, https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf\n", 713 | "- Deep Learning in Natural Language Processing by Li Deng, Yang Liu, https://lidengsite.wordpress.com/book-chapters/\n", 714 | "- Sentiment Analysis data sets https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124\n", 715 | "\n", 716 | "NLTK options\n", 717 | "- nltk.corpus http://www.nltk.org/howto/corpus.html\n", 718 | "- Data Camp tutorial on sentiment analysis with nltk https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python\n", 719 | "- Vader sentiment analysis script available on github (nltk) https://www.nltk.org/_modules/nltk/sentiment/vader.html\n", 720 | "- TextBlob https://textblob.readthedocs.io/en/dev/\n", 721 | "- Flair, a NLP script available on github https://github.com/flairNLP/flair\n", 722 | "\n", 723 | "spaCy options\n", 724 | "- spaCy https://nlpforhackers.io/complete-guide-to-spacy/\n", 725 | "- Data Quest tutorial on sentiment analysis with spaCy https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/\n", 726 | "\n", 727 | "\n", 728 | "Books and package recommendations for R\n", 729 | "- Quanteda, an R package for text analysis https://quanteda.io/​\n", 730 | "- Text Mining with R, a free online book https://www.tidytextmining.com/​" 731 | ] 732 | }, 733 | { 734 | "cell_type": "markdown", 735 | "metadata": {}, 736 | "source": [ 737 | "
Next section: Extracting text
" 738 | ] 739 | } 740 | ], 741 | "metadata": { 742 | "kernelspec": { 743 | "display_name": "Python 3 (ipykernel)", 744 | "language": "python", 745 | "name": "python3" 746 | }, 747 | "language_info": { 748 | "codemirror_mode": { 749 | "name": "ipython", 750 | "version": 3 751 | }, 752 | "file_extension": ".py", 753 | "mimetype": "text/x-python", 754 | "name": "python", 755 | "nbconvert_exporter": "python", 756 | "pygments_lexer": "ipython3", 757 | "version": "3.11.9" 758 | }, 759 | "toc": { 760 | "base_numbering": 1, 761 | "nav_menu": {}, 762 | "number_sections": true, 763 | "sideBar": true, 764 | "skip_h1_title": true, 765 | "title_cell": "Table of Contents", 766 | "title_sidebar": "Contents", 767 | "toc_cell": true, 768 | "toc_position": {}, 769 | "toc_section_display": true, 770 | "toc_window_display": false 771 | }, 772 | "varInspector": { 773 | "cols": { 774 | "lenName": 16, 775 | "lenType": 16, 776 | "lenVar": 40 777 | }, 778 | "kernels_config": { 779 | "python": { 780 | "delete_cmd_postfix": "", 781 | "delete_cmd_prefix": "del ", 782 | "library": "var_list.py", 783 | "varRefreshCmd": "print(var_dic_list())" 784 | }, 785 | "r": { 786 | "delete_cmd_postfix": ") ", 787 | "delete_cmd_prefix": "rm(", 788 | "library": "var_list.r", 789 | "varRefreshCmd": "cat(var_dic_list()) " 790 | } 791 | }, 792 | "types_to_exclude": [ 793 | "module", 794 | "function", 795 | "builtin_function_or_method", 796 | "instance", 797 | "_Feature" 798 | ], 799 | "window_display": false 800 | } 801 | }, 802 | "nbformat": 4, 803 | "nbformat_minor": 2 804 | } 805 | -------------------------------------------------------------------------------- /code/4-Social-networks.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": { 13 | "slideshow": { 14 | "slide_type": "-" 15 | } 16 | }, 17 | "source": [ 18 | "# Text-mining: Named Entity Extraction and Social Network Creation" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Welcome to this UK Data Service *Computational Social Science* training series! \n", 26 | "\n", 27 | "The various *Computational Social Science* training series, all of which guide you through some of the popular and useful computational techniques, tools, methods and concepts that social science research might want to use. For example, this series covers collecting data from websites and social media platorms, working with text data, conducting simulations (agent based modelling), and more. The series includes recorded video webinars, interactive notebooks containing live programming code, reading lists and more.\n", 28 | "\n", 29 | "* To access training materials on our GitHub site: [Training Materials]\n", 30 | "\n", 31 | "* To keep up to date with upcoming and past training events: [Events]\n", 32 | "\n", 33 | "* To get in contact with feedback, ideas or to seek assistance: [Help]\n", 34 | "\n", 35 | "Dr J. Kasmire
\n", 36 | "UK Data Service
\n", 37 | "University of Manchester
" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": { 43 | "toc": true 44 | }, 45 | "source": [ 46 | "

Table of Contents

\n", 47 | "
" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "\n", 55 | "There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). " 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "## Introduction" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "We already saw how to add part-of-speech tags and named entity labels to a corpus, but now we will explore a practical applications for those POS-tags and entity labels - extracted the names of people and creating a social network based on which people are mentioned in the same document. \n", 70 | "\n", 71 | "There are, of course, many other very useful and practical applications for POS-tags and/or entity labels. However, as always, readers should interpret this notebook as being a demonstration of a popular option rather than an exhaustive or comprehensive guide to all possibilities." 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "## Acquire and prepare a set of documents with named entities." 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "As always, let's start by importing and downloading some useful packages. Many of these will be familiar to you if you have worked through the 'Basic Extraction' notebook, but there are always some new tools to explore. \n", 86 | "\n", 87 | "Run/Shift+Enter." 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "import nltk # get nltk \n", 97 | "from nltk import word_tokenize, pos_tag, ne_chunk # import some of our old favourte functions\n", 98 | "from nltk import Tree # and import some new functions\n", 99 | "nltk.download('punkt')\n", 100 | "nltk.download('averaged_perceptron_tagger')\n", 101 | "nltk.download('maxent_ne_chunker')\n", 102 | "nltk.download('words')\n", 103 | "\n", 104 | "from itertools import chain, tee\n", 105 | "from operator import itemgetter\n", 106 | "\n", 107 | "!pip install networkx\n", 108 | "import networkx as nx # Just notice this line for now... We will refer to it again later.\n", 109 | "from networkx.algorithms import community\n", 110 | "\n", 111 | "import matplotlib.pyplot as plt" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "social_data = ['Archibald walked through Manchester with Beryl.', \n", 121 | " 'Tariq saw Beryl when she was playing tennis.', \n", 122 | " 'Archibald shares a house with Beryl and Cerys.',\n", 123 | " 'Cerys works with both Tariq and Edith.', \n", 124 | " 'Edith drives past Archibald and Denise on her morning commute.',\n", 125 | " 'Fadwa listens to podcasts while running.', \n", 126 | " 'Guo-feng and Hita often drive to the Welsh coast at weekends.',\n", 127 | " 'Icarus shops at the same supermarket as Janyu and Edith.', \n", 128 | " 'Kelsey and Edith both used to live in London.',\n", 129 | " 'Laia and Icarus are on the same bowling team.', \n", 130 | " 'Archibald and Kelsey are both keen gardeners.',\n", 131 | " 'Laia and Montserrat are both Catalonians. '\n", 132 | " ]" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "Named Entity chunkers require data that is:\n", 140 | "* word tokenised and \n", 141 | "* POS-tagged. \n", 142 | "\n", 143 | "Named Entity chunkers return a list of nested trees. " 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "tagged_chunked_data = []\n", 153 | "for item in social_data:\n", 154 | " tokens = word_tokenize(item)\n", 155 | " tags = nltk.pos_tag(tokens)\n", 156 | " chunks = ne_chunk(tags)\n", 157 | " tagged_chunked_data.append(chunks)\n", 158 | " \n", 159 | "print(tagged_chunked_data) # print everything, since this is a small enough list" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "Looking at the results, we can see that each item in the list is a tree because each sentence starts with a \"Tree('S')\" indicating that the highest level of tree is a Sentence Tree. \n", 167 | "\n", 168 | "But Sentence Trees can have sub-trees. Each sub-tree also starts with \"Tree('TREE_TYPE')\" and we can see that there are \"Tree('PERSON')\", \"Tree('ORGANISATION')\" and \"Tree('GPE')\". Unfortunately, the ne_chunker is not perfect. 'Edith' is listed as a 'GPE' in one place but a 'PERSON' in another. Likewise, 'Archibald' is both a 'PERSON' and an 'ORGANISATION', while 'Guo-feng' and 'Montserrat' are not identified as sub-trees at all. You can probably find more mis-classifications. \n", 169 | "\n", 170 | "Let's take a closer look at those sentences. Run/Shift+Enter." 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "print(tagged_chunked_data[0])\n" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "Feel free to change the number in the above code block and re-run it to look at other sentences. \n", 187 | "\n", 188 | "Ultimately, a NER chunker is a classifier and can be trained on custom data as we have already seen how to do (check the classifier jupyter notebook in the same folder as this one!). \n", 189 | "\n", 190 | "Feel free to try training your own NER chunker classifier. You'll need to put social_data through a word_tokenisation process, then use its output to create a training data set. Then train a classifier on your training data set. \n", 191 | "\n", 192 | "But for now, we carry on!" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": {}, 198 | "source": [ 199 | "## Extract the desired chunks" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "Now that we have some reasonable chunks all chunked up nicely, we want to extract the desired chunks so that they can become the nodes in our network. \n", 207 | "\n", 208 | "Run/Shift+Enter!" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "extracted_persons = []\n", 218 | "for tagged_tree in tagged_chunked_data:\n", 219 | " people = []\n", 220 | " for leaf in tagged_tree.leaves():\n", 221 | " if 'NNP' in leaf[1]:\n", 222 | " people.append(leaf[0]) \n", 223 | " extracted_persons.append(sorted(people))\n", 224 | " \n", 225 | "print(extracted_persons)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "Have a look at the results of the code above. It is a list of lists. Each of the sub-lists contains all the nouns that occur in a given sentence. For example, the first sentence was 'Archibald walked through Manchester with Beryl.' and the first sub-list in our extracted_persons list contains 'Archibald', 'Beryl', 'Manchester'. Seems about right. \n", 233 | "\n", 234 | "Well, maybe not perfect. Perhaps we have been a bit too generous... Extracted_persons has accurately extracted all the peoples' names, but it has also extracted place names and proper noun categories too. 'Manchester', 'Welsh', 'London' and 'Catalonians' are extracted too. This is because the above code looks for chunks that have POS-tags indicating they are proper nouns (NNP) and extracts those, rather than strictly looking for the names of people. \n", 235 | "\n", 236 | "We could try to use the Named Entity Recognition labels, extracting only those labelled as 'PERSON'... But we saw that those are not working reliably for our data set. As an alternative to training our own NER chunker, we could just manually review the list and remove the place or category names.\n", 237 | "\n", 238 | "But... maybe we are happy to leave the place names and proper noun categories too. After all, we can infer a kind of relationship between people and places or categories. When Archibald and Beryl go walking through Manchester, they have a relationship with Manchester. If someone else also has a relationship with Manchester, then it is not exactly wrong to suggest that the third person has a somewhat distant relationship with Archibald and Beryl by virtue of their shared link to Manchester. \n", 239 | "S\n", 240 | "So, as an exacutive decision, I am going to leave all the proper nouns in. They will all become nodes in our network. But to do that, we need to find all of the unique entries, which I find is helpful to view in alphabetical order. \n", 241 | "\n", 242 | "Run/Shift+Enter, duuuuuuuuuuuuude!" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "unique_people = sorted(list(set(chain(*extracted_persons))))\n", 252 | "print(unique_people)" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "## Identify which extracted proper nouns are named in the same document" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "Now, the alphabetised list of all unique proper nouns will become the nodes list in when we go to create a network. But we also need an edge list telling the network with nodes are connected. In practical terms, an edge list is a list of tuples, each containing two nodes. \n", 267 | "\n", 268 | "To do this, we can use the itertools.permutations which looks a list and creates a new list of tuples with all possible permutations of a fixed length that can be made from the original list. To be specific, the code below looks at each of the sub-lists of extracted_persons and creates 2 item tuples from all possible combinations of the items in the sub-list. \n", 269 | "\n", 270 | "Run/Shift+Enter" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": null, 276 | "metadata": {}, 277 | "outputs": [], 278 | "source": [ 279 | "import itertools\n", 280 | "\n", 281 | "co_occurring_pairs = []\n", 282 | "for people in extracted_persons:\n", 283 | " for each_permutation in itertools.permutations(people, 2):\n", 284 | " co_occurring_pairs.append(each_permutation)\n", 285 | "\n", 286 | "print(co_occurring_pairs)" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "Let's take a closer look. \n", 294 | "\n", 295 | "The first sub-list in extracted_persons was:\n", 296 | "- 'Archibald', 'Beryl', 'Manchester'\n", 297 | "\n", 298 | "At the start of our co_occurring_pairs, we have:\n", 299 | "- ('Archibald', 'Beryl'),\n", 300 | "- ('Archibald', 'Manchester'),\n", 301 | "- ('Beryl', 'Archibald'),\n", 302 | "- ('Beryl', 'Manchester'),\n", 303 | "- ('Manchester', 'Archibald'),\n", 304 | "- ('Manchester', 'Beryl')\n", 305 | "\n", 306 | "This means that we have an edge between 'Archibald' and 'Beryl', but also another edge between 'Beryl' and 'Archibald'.\n", 307 | "\n", 308 | "There are also no edges involving 'Fadwa' as she only appears in one sentence that has no other proper nouns. Thus, we will have at least one node with no edges. \n", 309 | "\n", 310 | "These two points may or may not be a problem for you, depending on how you want your network to function. For example, you may want some or all of your links to be directed, meaning that the link only goes one way. \n", 311 | "In our network, this might be reasonable for sentences like 'Tariq saw Beryl when she was playing tennis.' since we don't know that Beryl also saw Tariq. Directed links like this would be especially important for networks based on scientific citations or other links that are clearly one-way.\n", 312 | "\n", 313 | "Likewise, if you want your network to be weighted, you may want to add a third value to the tuples with how strong you want the link to be. When you go to create the network, you will need to sum up edges so that multiple instances of an edge between the same two nodes has a higher weight. \n", 314 | "\n", 315 | "To create a weighted edge list, run the code below. \n", 316 | "\n", 317 | "Run/Shift+Enter" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "co_occurring_pairs_weighted = []\n", 327 | "\n", 328 | "for pair in co_occurring_pairs:\n", 329 | " x = pair[0], pair[1], 1\n", 330 | " co_occurring_pairs_weighted.append(x)\n", 331 | " \n", 332 | "print(co_occurring_pairs_weighted)\n", 333 | " " 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "## Create a network graph and add the nodes" 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "First, we initialise an empty network graph object. The 'nx' part of 'nx.Graph()' relies on code at the start of the notebook that imported networkx as nx. If you prefer, you can replace 'nx' with 'networkx'. \n", 348 | "\n", 349 | "Run/Shift+Enter" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "metadata": {}, 356 | "outputs": [], 357 | "source": [ 358 | "social_network = nx.Graph() # Initialize an empty networkx graph object called 'social_network'" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "Now, we need to start filling up our empty graph object with details like the nodes list. \n", 366 | "\n", 367 | "Run/Shift+Enter" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": null, 373 | "metadata": {}, 374 | "outputs": [], 375 | "source": [ 376 | "social_network.add_nodes_from(unique_people) # Add nodes to social_network from our extracted 'unique_people' list" 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "metadata": {}, 382 | "source": [ 383 | "Hmmm. Nothing happened.\n", 384 | "\n", 385 | "Well, that is not true. Something did happen, but we have to call extra functions to see what happened. " 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "metadata": {}, 392 | "outputs": [], 393 | "source": [ 394 | "social_network.nodes # Use a graph object functions to see the nodes" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "## Add edges to the network graph" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "Assuming that we don't want a directed or weighted graph, we can add edges quite simply with code that is almost identical to the code we used to add nodes. \n", 409 | "\n", 410 | "This time, there is another line of code that calls on a different, but obviously similar, function to look at the edges. \n", 411 | "\n", 412 | "Run/Shift+Enter" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": null, 418 | "metadata": {}, 419 | "outputs": [], 420 | "source": [ 421 | "social_network.add_edges_from(co_occurring_pairs) # Add edges to social_network from our co-occurrence tuples\n", 422 | "social_network.edges # Another quick look, this time at the just-imported edges" 423 | ] 424 | }, 425 | { 426 | "cell_type": "markdown", 427 | "metadata": {}, 428 | "source": [ 429 | "Not only does this list all of the edges in a helpful way (alphabetically by the first node) it is clear that there is only one edge between any two pairs of nodes. \n", 430 | "\n", 431 | "There is an Archibald-Beryl edge, but no Beryl-Archibald edge. Good to know, eh?\n", 432 | "\n", 433 | "But what if we want a weighted graph?\n", 434 | "\n", 435 | "Well, we need to run a more complicated code that checks if an edge exists and then either creates it or adds weight to it, as appropriate. To keep our weighted and unweighted graphs separate, we will also create a new empty graph called social_network_weighted, add nodes, and then add edges by checking to see if one already exists first. \n", 436 | "\n", 437 | "Run/Shift+Enter" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": null, 443 | "metadata": {}, 444 | "outputs": [], 445 | "source": [ 446 | "social_network_weighted = nx.Graph()\n", 447 | "social_network_weighted.add_nodes_from(unique_people) \n", 448 | "\n", 449 | "for edge_pair in co_occurring_pairs_weighted:\n", 450 | " if social_network_weighted.has_edge(edge_pair[0], edge_pair[1]):\n", 451 | " # we added this one before, just increase the weight by one\n", 452 | " w = int(edge_pair[2])\n", 453 | " social_network_weighted[edge_pair[0]][edge_pair[1]]['weight'] += edge_pair[2]\n", 454 | " else:\n", 455 | " # new edge. add with weight=1\n", 456 | " social_network_weighted.add_edge(edge_pair[0], edge_pair[1], weight = edge_pair[2])\n", 457 | " \n", 458 | "social_network_weighted.edges" 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "## Have a look at graph info" 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [ 472 | "First, some basic info about our two graphs using the nx.info function. \n", 473 | "\n", 474 | "Run/Shift+Enter" 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": null, 480 | "metadata": {}, 481 | "outputs": [], 482 | "source": [ 483 | "print(social_network) # the nx.info prints some basics about social_network\n", 484 | "print('...') # nx.info doesn't have everything you might want...\n", 485 | "\n", 486 | "print(social_network_weighted) " 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "Of course, you may want some additional info that is not included in the basics. \n", 494 | "\n", 495 | "Run/Shift+Enter" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": null, 501 | "metadata": {}, 502 | "outputs": [], 503 | "source": [ 504 | "print(\"Network density:\", nx.density(social_network)) # but extra info is easy to get.\n", 505 | "print(\"Network density:\", nx.density(social_network_weighted)) " 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": {}, 511 | "source": [ 512 | "## Draw the graph" 513 | ] 514 | }, 515 | { 516 | "cell_type": "markdown", 517 | "metadata": {}, 518 | "source": [ 519 | "Now, we need different visualisations for our different graphs. The two code blocks below show various ways that you can change the graph visualisation (layout, colour, node size, etc. )\n", 520 | "\n", 521 | "Run/Shift+Ente in the next 2 blocks. " 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": null, 527 | "metadata": { 528 | "scrolled": true 529 | }, 530 | "outputs": [], 531 | "source": [ 532 | "social_network_positions = nx.circular_layout(social_network) # Define positions for nodes according to a circular layout\n", 533 | "\n", 534 | "nx.draw_networkx_nodes(social_network, social_network_positions, # draw nodes according to position, size\n", 535 | " node_size=700) \n", 536 | "nx.draw_networkx_edges(social_network, social_network_positions, # draw edges according to position, line width \n", 537 | " width=2) \n", 538 | "nx.draw_networkx_labels(social_network, social_network_positions, # draw labels according to position, font choices\n", 539 | " font_size=10, font_family='sans-serif')\n", 540 | "plt.show() # show the network as drawn" 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "execution_count": null, 546 | "metadata": {}, 547 | "outputs": [], 548 | "source": [ 549 | "from pylab import rcParams\n", 550 | "rcParams['figure.figsize'] = 10, 10\n", 551 | "\n", 552 | "weighted_pos = nx.kamada_kawai_layout(social_network_weighted) # Define positions for force directed node layout\n", 553 | "\n", 554 | "elarge = [(u, v) for (u, v, d) in social_network_weighted.edges(data=True) if d['weight'] > 2] #define a 'heavy edge' style\n", 555 | "esmall = [(u, v) for (u, v, d) in social_network_weighted.edges(data=True) if d['weight'] <= 2] # and a 'light edge' style\n", 556 | "# nodes\n", 557 | "nx.draw_networkx_nodes(social_network_weighted, weighted_pos, node_size=400) # draw the nodes\n", 558 | "\n", 559 | "# edges\n", 560 | "nx.draw_networkx_edges(social_network_weighted, weighted_pos, edgelist=elarge, #draw the heavy edges\n", 561 | " width=3)\n", 562 | "nx.draw_networkx_edges(social_network_weighted, weighted_pos, edgelist=esmall, #draw the light edges \n", 563 | " width=1, alpha=0.5, edge_color='b', style='dashed')\n", 564 | "\n", 565 | "# labels\n", 566 | "nx.draw_networkx_labels(social_network_weighted, weighted_pos, font_size=15, font_family='serif')\n", 567 | "\n", 568 | "plt.axis('off')\n", 569 | "figure = plt.show()" 570 | ] 571 | }, 572 | { 573 | "cell_type": "markdown", 574 | "metadata": {}, 575 | "source": [ 576 | "## Conclusions" 577 | ] 578 | }, 579 | { 580 | "cell_type": "markdown", 581 | "metadata": {}, 582 | "source": [ 583 | "Hopefully, this will give you some ideas about what to do with the NLP processes that you have put your corpus through. There is clear value there, related to which things occur together. Different kinds of processing might help you get directed graphs (although that would take some clever classification relating to subjects, objects, etc. ). \n", 584 | "\n", 585 | "Please do feel free to start back at the beginning, adding more sentences with the same names or even with new names. \n", 586 | "\n", 587 | "As before, these exercises and this sample code should highlight to you that you need to think about:\n", 588 | "- your research questions and what you want to show, explore or understand, \n", 589 | "- your data, texts, corpus, or other research materials to analyse etc. \n", 590 | "- how your processes are related to your reserch questions, and \n", 591 | "- how your processes and data can be made available and reproducible. " 592 | ] 593 | }, 594 | { 595 | "cell_type": "markdown", 596 | "metadata": {}, 597 | "source": [ 598 | "## Further reading" 599 | ] 600 | }, 601 | { 602 | "cell_type": "markdown", 603 | "metadata": {}, 604 | "source": [ 605 | "Books, tutorials, package recommendations, etc. for Python\n", 606 | "\n", 607 | "- Natural Language Processing with Python by Steven Bird, Ewan Klein and Edward Loper, http://www.nltk.org/book/\n", 608 | "- Foundations of Statistical Natural Language Processing by Christopher Manning and Hinrich Schütze, https://nlp.stanford.edu/fsnlp/promo/\n", 609 | "- Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition by Dan Jurafsky and James H. Martin, https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf\n", 610 | "- Deep Learning in Natural Language Processing by Li Deng, Yang Liu, https://lidengsite.wordpress.com/book-chapters/\n", 611 | "- Sentiment Analysis data sets https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124\n", 612 | "\n", 613 | "NLTK options\n", 614 | "- nltk.corpus http://www.nltk.org/howto/corpus.html\n", 615 | "- Data Camp tutorial on sentiment analysis with nltk https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python\n", 616 | "- Vader sentiment analysis script available on github (nltk) https://www.nltk.org/_modules/nltk/sentiment/vader.html\n", 617 | "- TextBlob https://textblob.readthedocs.io/en/dev/\n", 618 | "- Flair, a NLP script available on github https://github.com/flairNLP/flair\n", 619 | "\n", 620 | "networkx\n", 621 | "- package details https://networkx.github.io/documentation/stable/index.html\n", 622 | "- info about drawing graphs, including links to dedicated graph visualisation software https://networkx.github.io/documentation/stable/reference/drawing.html\n", 623 | "- drawing examples and specific tutorials https://networkx.github.io/documentation/latest/auto_examples/index.html\n", 624 | "- All the graph measures you can ask for https://networkx.github.io/documentation/stable/reference/algorithms/index.html\n", 625 | "\n", 626 | "\n", 627 | "Books and package recommendations for R\n", 628 | "- Quanteda, an R package for text analysis https://quanteda.io/​\n", 629 | "- Text Mining with R, a free online book https://www.tidytextmining.com/​" 630 | ] 631 | } 632 | ], 633 | "metadata": { 634 | "kernelspec": { 635 | "display_name": "Python 3 (ipykernel)", 636 | "language": "python", 637 | "name": "python3" 638 | }, 639 | "language_info": { 640 | "codemirror_mode": { 641 | "name": "ipython", 642 | "version": 3 643 | }, 644 | "file_extension": ".py", 645 | "mimetype": "text/x-python", 646 | "name": "python", 647 | "nbconvert_exporter": "python", 648 | "pygments_lexer": "ipython3", 649 | "version": "3.11.9" 650 | }, 651 | "toc": { 652 | "base_numbering": 1, 653 | "nav_menu": {}, 654 | "number_sections": true, 655 | "sideBar": true, 656 | "skip_h1_title": true, 657 | "title_cell": "Table of Contents", 658 | "title_sidebar": "Contents", 659 | "toc_cell": true, 660 | "toc_position": {}, 661 | "toc_section_display": true, 662 | "toc_window_display": false 663 | }, 664 | "varInspector": { 665 | "cols": { 666 | "lenName": 16, 667 | "lenType": 16, 668 | "lenVar": 40 669 | }, 670 | "kernels_config": { 671 | "python": { 672 | "delete_cmd_postfix": "", 673 | "delete_cmd_prefix": "del ", 674 | "library": "var_list.py", 675 | "varRefreshCmd": "print(var_dic_list())" 676 | }, 677 | "r": { 678 | "delete_cmd_postfix": ") ", 679 | "delete_cmd_prefix": "rm(", 680 | "library": "var_list.r", 681 | "varRefreshCmd": "cat(var_dic_list()) " 682 | } 683 | }, 684 | "types_to_exclude": [ 685 | "module", 686 | "function", 687 | "builtin_function_or_method", 688 | "instance", 689 | "_Feature" 690 | ], 691 | "window_display": false 692 | } 693 | }, 694 | "nbformat": 4, 695 | "nbformat_minor": 2 696 | } 697 | -------------------------------------------------------------------------------- /code/README.md: -------------------------------------------------------------------------------- 1 | # Interactive coding materials 2 | 3 | We have developed a number of Jupyter notebooks containing a mix of Python code, narrative and output. 4 | 5 | If you would like to run and/or edit the code without installing any software on your machine, click on the button below. This launches a **Binder** service allowing you to interact with the code through your web browser - as it is temporary, you will lose your work when you log out and you will be booted out if you don’t do anything for a long time. 6 | 7 | Once Binder has been launched, click on the notebook you want to run. (*Don't worry if takes up to a minute to launch*) 8 | 9 | ### Launch Text-mining for Social Science Research as a Binder service: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/UKDataServiceOpen/text-mining/HEAD)
10 | 11 | Alternatively, you can download the notebook files and run them on your own machine. See our guidance on installing Python and Jupyter [here](https://github.com/UKDataServiceOpen/computational-social-science/blob/master/installation.md). 12 | 13 | 14 | **NOTE: If you encounter any errors when running the code in Binder (or when running this in Jupyter notebooks) please either fork the code and let us know or contact us via: louise.capener@manchester.ac.uk and I'll sort it!** 15 | -------------------------------------------------------------------------------- /code/data/sample_text.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/text-mining/407d16015ba270b4e39462c20de9b370c4e78563/code/data/sample_text.txt -------------------------------------------------------------------------------- /code/data/sentiment-analysis/testing_set.csv: -------------------------------------------------------------------------------- 1 | "@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right.",4 2 | Reading my kindle2... Love it... Lee childs is good read.,4 3 | "Ok, first assesment of the #kindle2 ...it fucking rocks!!!",4 4 | @kenburbary You'll love your Kindle2. I've had mine for a few months and never looked back. The new big one is huge! No need for remorse! :),4 5 | @mikefish Fair enough. But i have the Kindle2 and I think it's perfect :),4 6 | @richardebaker no. it is too big. I'm quite happy with the Kindle2.,4 7 | Fuck this economy. I hate aig and their non loan given asses.,0 8 | Jquery is my new best friend.,4 9 | Loves twitter,4 10 | how can you not love Obama? he makes jokes about himself.,4 11 | Check this video out -- President Obama at the White House Correspondents' Dinner http://bit.ly/IMXUM,2 12 | "@Karoli I firmly believe that Obama/Pelosi have ZERO desire to be civil. It's a charade and a slogan, but they want to destroy conservatism",0 13 | "House Correspondents dinner was last night whoopi, barbara & sherri went, Obama got a standing ovation",4 14 | Watchin Espn..Jus seen this new Nike Commerical with a Puppet Lebron..sh*t was hilarious...LMAO!!!,4 15 | "dear nike, stop with the flywire. that shit is a waste of science. and ugly. love, @vincentx24x",0 16 | "#lebron best athlete of our generation, if not all time (basketball related) I don't want to get into inter-sport debates about __1/2",4 17 | I was talking to this guy last night and he was telling me that he is a die hard Spurs fan. He also told me that he hates LeBron James.,0 18 | i love lebron. http://bit.ly/PdHur,4 19 | "@ludajuice Lebron is a Beast, but I'm still cheering 4 the A..til the end.",0 20 | @Pmillzz lebron IS THE BOSS,4 21 | "@sketchbug Lebron is a hometown hero to me, lol I love the Lakers but let's go Cavs, lol",4 22 | lebron and zydrunas are such an awesome duo,4 23 | @wordwhizkid Lebron is a beast... nobody in the NBA comes even close.,4 24 | downloading apps for my iphone! So much fun :-) There literally is an app for just about anything.,4 25 | "good news, just had a call from the Visa office, saying everything is fine.....what a relief! I am sick of scams out there! Stealing!",4 26 | http://twurl.nl/epkr4b - awesome come back from @biz (via @fredwilson),4 27 | In montreal for a long weekend of R&R. Much needed.,4 28 | Booz Allen Hamilton has a bad ass homegrown social collaboration platform. Way cool! #ttiv,4 29 | [#MLUC09] Customer Innovation Award Winner: Booz Allen Hamilton -- http://ping.fm/c2hPP,4 30 | "@SoChi2 I current use the Nikon D90 and love it, but not as much as the Canon 40D/50D. I chose the D90 for the video feature. My mistake.",4 31 | need suggestions for a good IR filter for my canon 40D ... got some? pls DM,2 32 | @surfit: I just checked my google for my business- blip shows up as the second entry! Huh. Is that a good or ba... ? http://blip.fm/~6emhv,2 33 | "@phyreman9 Google is always a good place to look. Should've mentioned I worked on the Mustang w/ my Dad, @KimbleT.",4 34 | Played with an android google phone. The slide out screen scares me I would break that fucker so fast. Still prefer my iPhone.,0 35 | US planning to resume the military tribunals at Guantanamo Bay... only this time those on trial will be AIG execs and Chrysler debt holders,0 36 | omg so bored & my tattoooos are so itchy!! help! aha =),0 37 | I'm itchy and miserable!,0 38 | "@sekseemess no. I'm not itchy for now. Maybe later, lol.",0 39 | RT @jessverr I love the nerdy Stanford human biology videos - makes me miss school. http://bit.ly/13t7NR,4 40 | "@spinuzzi: Has been a bit crazy, with steep learning curve, but LyX is really good for long docs. For anything shorter, it would be insane.",4 41 | "I'm listening to ""P.Y.T"" by Danny Gokey <3 <3 <3 Aww, he's so amazing. I <3 him so much :)",4 42 | is going to sleep then on a bike ride:],4 43 | cant sleep... my tooth is aching.,0 44 | "Blah, blah, blah same old same old. No plans today, going back to sleep I guess.",0 45 | "glad i didnt do Bay to Breakers today, it's 1000 freaking degrees in San Francisco wtf",0 46 | is in San Francisco at Bay to Breakers.,2 47 | just landed at San Francisco,2 48 | San Francisco today. Any suggestions?,2 49 | ?Obama Administration Must Stop Bonuses to AIG Ponzi Schemers ... http://bit.ly/2CUIg,0 50 | started to think that Citi is in really deep s&^t. Are they gonna survive the turmoil or are they gonna be the next AIG?,0 51 | ShaunWoo hate'n on AiG,0 52 | @YarnThing you will not regret going to see Star Trek. It was AWESOME!,4 53 | On my way to see Star Trek @ The Esquire.,2 54 | Going to see star trek soon with my dad.,2 55 | annoying new trend on the internets: people picking apart michael lewis and malcolm gladwell. nobody wants to read that.,0 56 | Bill Simmons in conversation with Malcolm Gladwell http://bit.ly/j9o50,2 57 | Highly recommend: http://tinyurl.com/HowDavidBeatsGoliath by Malcolm Gladwell,4 58 | Blink by malcolm gladwell amazing book and The tipping point!,4 59 | Malcolm Gladwell might be my new man crush,4 60 | omg. The commercials alone on ESPN are going to drive me nuts.,0 61 | @robmalon Playing with Twitter API sounds fun. May need to take a class or find a new friend who like to generate results with API code.,4 62 | playing with cURL and the Twitter API,2 63 | Hello Twitter API ;),4 64 | playing with Java and the Twitter API,2 65 | @morind45 Because the twitter api is slow and most client's aren't good.,0 66 | yahoo answers can be a butt sometimes,0 67 | is scrapbooking with Nic =D,4 68 | RT @mashable: Five Things Wolfram Alpha Does Better (And Vastly Different) Than Google - http://bit.ly/6nSnR,4 69 | just changed my default pic to a Nike basketball cause bball is awesome!!!!!,4 70 | "Nike owns NBA Playoffs ads w/ LeBron, Kobe, Carmelo? http://ow.ly/7Uiy #Adidas #Billups #Howard #Marketing #Branding",2 71 | "'Next time, I'll call myself Nike'",2 72 | New blog post: Nike SB Dunk Low Premium 'White Gum' http://tr.im/lOtT,2 73 | RT @SmartChickPDX: Was just told that Nike layoffs started today :-(,0 74 | Back when I worked for Nike we had one fav word : JUST DO IT! :),4 75 | "By the way, I'm totally inspired by this freaky Nike commercial: http://snurl.com/icgj9",4 76 | "giving weka an app engine interface, using the bird strike data for the tests, the logo is a given.",2 77 | "Brand New Canon EOS 50D 15MP DSLR Camera Canon 17-85mm IS Lens ...: Web Technology Thread, Brand New Canon EOS 5.. http://u.mavrev.com/5a3t",2 78 | Class... The 50d is supposed to come today :),4 79 | needs someone to explain lambda calculus to him! :(,0 80 | Took the Graduate Field Exam for Computer Science today. Nothing makes you feel like more of an idiot than lambda calculus.,0 81 | SHOUT OUTS TO ALL EAST PALO ALTO FOR BEING IN THE BUILDIN KARIZMAKAZE 50CAL GTA! ALSO THANKS TO PROFITS OF DOOM UNIVERSAL HEMPZ CRACKA......,4 82 | "@legalgeekery Yeahhhhhhhhh, I wouldn't really have lived in East Palo Alto if I could have avoided it. I guess it's only for the summer.",0 83 | @accannis @edog1203 Great Stanford course. Thanks for making it available to the public! Really helpful and informative for starting off!,4 84 | "NVIDIA Names Stanford's Bill Dally Chief Scientist, VP Of Research http://bit.ly/Fvvg9",2 85 | New blog post: Harvard Versus Stanford - Who Wins? http://bit.ly/MCoCo,2 86 | @ work til 6pm... lets go lakers!!!,4 87 | Damn you North Korea. http://bit.ly/KtMeQ,0 88 | Can we just go ahead and blow North Korea off the map already?,0 89 | "North Korea, please cease this douchebaggery. China doesn't even like you anymore. http://bit.ly/NeHSl",0 90 | Why the hell is Pelosi in freakin China? and on whose dime?,0 91 | "Are YOU burning more cash $$$ than Chrysler and GM? Stop the financial tsunami. Where ""bailout"" means taking a handout!",0 92 | insects have infected my spinach plant :(,0 93 | wish i could catch every mosquito in the world n burn em slowly.they been bitin the shit outta me 2day.mosquitos are the assholes of insects,0 94 | "just got back from church, and I totally hate insects.",0 95 | Just got mcdonalds goddam those eggs make me sick. O yeah Laker up date go lakers. Not much of an update? Well it's true so suck it,0 96 | omgg i ohhdee want mcdonalds damn i wonder if its open lol =],4 97 | History exam studying ugh,0 98 | "I hate revision, it's so boring! I am totally unprepared for my exam tomorrow :( Things are not looking good...",0 99 | "Higher physics exam tommorow, not lookin forward to it much :(",0 100 | "It's a bank holiday, yet I'm only out of work now. Exam season sucks:(",0 101 | Cheney and Bush are the real culprits - http://fwix.com/article/939496,0 102 | Life?s a bitch? and so is Dick Cheney. #p2 #bipart #tlot #tcot #hhrs #GOP #DNC http://is.gd/DjyQ,0 103 | "Dick Cheney's dishonest speech about torture, terror, and Obama. -Fred Kaplan Slate. http://is.gd/DiHg",0 104 | """The Republican party is a bunch of anti-abortion zealots who couldn't draw flies to a dump."" -- Neal Boortz (just now, on the radio)",0 105 | is Twitter's connections API broken? Some tweets didn't make it to Twitter...,0 106 | "i srsly hate the stupid twitter API timeout thing, soooo annoying!!!!! :(",0 107 | "@psychemedia I really liked @kswedberg's ""Learning jQuery"" book. http://bit.ly/pg0lT is worth a look too",4 108 | jQuery UI 1.6 Book Review - http://cfbloggers.org/?c=30631,2 109 | "Very Interesting Ad from Adobe by Goodby, Silverstein & Partners - YouTube - Adobe CS4: Le Sens Propre http://bit.ly/VprpT",4 110 | Goodby Silverstein agency new site! http://www.goodbysilverstein.com/ Great!,4 111 | "RT @designplay Goodby, Silverstein's new site: http://www.goodbysilverstein.com/ I enjoy it. *nice find!*",4 112 | The ever amazing Psyop and Goodby Silverstein & Partners for HP! http://bit.ly/g2rU8 Have to go play with After Effects now!,4 113 | top ten most watched on Viral-Video Chart. Love the nike #mostvaluablepuppets campaign from Wieden & Kennedy http://bit.ly/nR1n9,4 114 | zomg!!! I have a G2!!!!!!!,4 115 | Ok so lots of buzz from IO2009 but how lucky are they - a Free G2!! http://is.gd/Hyzl,4 116 | just got a free G2 android at google i/o!!!,4 117 | Guess I'll be retiring my G1 and start using my developer G2 woot #googleio,4 118 | At GWT fireside chat @googleio,2 119 | I am happy for Philip being at GoogleIO today,4 120 | Lakers played great! Cannot wait for Thursday night Lakers vs. ???,4 121 | "Hi there, does anyone have a great source for advice on viral marketing?... http://link.gs/YtZ8",2 122 | Judd Apatow creates fake sitcom on NBC.com to market his new movie... viral marketing at its best. http://is.gd/K0yK,4 123 | "Here's A case study on how to use viral marketing to add over 10,000 people to your list http://snipr.com/i50oz",2 124 | VIRAL MARKETING FAIL. This Acia Pills brand oughta get shut down for hacking into people's messenger's. i get 5-6 msgs in a day! Arrrgh!,0 125 | watching Night at The Museum . Lmao,4 126 | i loved night at the museum!!!,4 127 | going to see the new night at the museum movie with my family oh boy a three year old in the movies fuin,2 128 | just got back from the movies. went to see the new night at the museum with rachel. it was good,4 129 | Just saw the new Night at the Museum movie...it was...okay...lol 7\10,2 130 | Going to see night at the museum 2 with tall boy,2 131 | @shannyoday I will take you on a date to see night at the museum 2 whenever you want...it looks soooooo good,4 132 | no watching The Night At The Museum. Getting Really Good,4 133 | "Night at the Museum, Wolverine and junk food - perfect monday!",4 134 | saw night at the museum 2 last night.. pretty crazy movie.. but the cast was awesome so it was well worth it. Robin Williams forever!,4 135 | I saw Night at the Museum: Battle of the Swithsonian today. It was okay. Your typical [kids] Ben Stiller movie.,2 136 | Taking Katie to see Night at the Museum. (she picked it),2 137 | Night at the Museum tonite instead of UP. :( oh well. that 4 yr old better enjoy it. LOL,0 138 | GM says expects announcment on sale of Hummer soon - Reuters: WDSUGM says expects announcment on sale of Hummer .. http://bit.ly/4E1Fv,2 139 | It's unfortunate that after the Stimulus plan was put in place twice to help GM on the back of the American people has led to the inevitable,0 140 | Tell me again why we are giving more $$ to GM?? We should use that $ for all the programs that support the unemployed.,0 141 | @jdreiss oh yes but if GM dies it will only be worth more boo hahaha,0 142 | Time Warner cable is down again 3rd time since Memorial Day bummer!,0 143 | "I would rather pay reasonable yearly taxes for ""free"" fast internet, than get gouged by Time Warner for a slow connection.",0 144 | NOOOOOOO my DVR just died and I was only half way through the EA presser. Hate you Time Warner,0 145 | F*ck Time Warner Cable!!! You f*cking suck balls!!! I have a $700 HD tv & my damn HD channels hardly ever come in. Bullshit!!,0 146 | time warner has the worse customer service ever. I will never use them again,0 147 | Time warner is the devil. Worst possible time for the Internet to go out.,0 148 | Fuck no internet damn time warner!,0 149 | time warner really picks the worst time to not work. all i want to do is get to mtv.com so i can watch the hills. wtfffff.,0 150 | I hate Time Warner! Soooo wish I had Vios. Cant watch the fricken Mets game w/o buffering. I feel like im watching free internet porn.,0 151 | Ahh...got rid of stupid time warner today & now taking a nap while the roomies cook for me. Pretty good end for a monday :),0 152 | Time Warner's HD line up is crap.,0 153 | is being fucked by time warner cable. didnt know modems could explode. and Susan Boyle sucks too!,0 154 | Time Warner Cable Pulls the Plug on 'The Girlfriend Experience' - (www.tinyurl.com/m595fk),2 155 | Time Warner Cable slogan: Where calling it a day at 2pm Happens.,0 156 | "Rocawear Heads to China, Building 300 Stores - http://tinyurl.com/nofet3",2 157 | "Climate focus turns to Beijing: The United Nations, the US and European governments have called on China to co-o.. http://tinyurl.com/lto92n",2 158 | myfoxdc Barrie Students Back from Trip to China: A Silver Spring high school's class trip to China has en.. http://tinyurl.com/nlhqba,2 159 | "Three China aerospace giants develop Tianjin Binhai New Area, 22.9 B yuan invested http://bit.ly/mMiDv",2 160 | http://xi.gs/04FO GM CEO: China will continue to be key partner,2 161 | RT @LATimesautos is now the time to buy a GM car? http://bit.ly/nRzlu,2 162 | Recovering from surgery..wishing @julesrenner was here :(,0 163 | "My wrist still hurts. I have to get it looked at. I HATE the dr/dentist/scary places. :( Time to watch Eagle eye. If you want to join, txt!",4 164 | Dentist tomorrow. Have to brush well in the morning. Like I make my hair all nice before I get it cut. Why?,2 165 | "THE DENTIST LIED! "" U WON'T FEEL ANY DISCOMORT! PROB WON'T EVEN NEED PAIN PILLS"" MAN U TWIPPIN THIS SHIT HURT!! HOW MANY PILLS CAN I TAKE!!",0 166 | @kirstiealley my dentist is great but she's expensive...=(,0 167 | @kirstiealley Pet Dentist http://www.funnyville.com/fv/pictures/dogdentures.shtml,2 168 | is studing math ;) tomorrow exam and dentist :),4 169 | my dentist was wrong... WRONG,0 170 | Going to the dentist later.:|,0 171 | Son has me looking at cars online. I hate car shopping. Would rather go to the dentist! Anyone with a good car at a good price to sell?,0 172 | NCAA Baseball Super Regional - Rams Club http://bit.ly/Ro7nx,2 173 | just started playing Major League Baseball 2K9. http://raptr.com/H3LLGWAR,2 174 | Cardinals baseball advance to Super Regionals. Face CS-Fullerton Friday.,2 175 | Sony coupon code.. Expires soon.. http://www.coupondork.com/r/1796,2 176 | waiting in line at safeway.,2 177 | luke and i got stopped walking out of safeway and asked to empty our pockets and lift our shirts. how jacked up is that?,0 178 | Did not realize there is a gym above Safeway!,2 179 | "@XPhile1908 I have three words for you: ""Safeway dot com""",2 180 | Safeway is very rock n roll tonight,4 181 | Bout to hit safeway I gotta eat,2 182 | Jake's going to safeway!,2 183 | Found a safeway. Picking up a few staples.,2 184 | Safeway Super-marketing via mobile coupons http://bit.ly/ONH7w,2 185 | The safeway bathroom still smells like ass!,0 186 | "At safeway on elkhorn, they move like they're dead!",0 187 | Your Normal Weight (and How to Get There) ? Normal Eating Blog http://bit.ly/ZeT8O,2 188 | Is Eating and Watching Movies....,2 189 | eating sashimi,2 190 | is eating home made yema,2 191 | eating cake,2 192 | i love Dwight Howard's vitamin water commercial... now i wish he was with NIKE and not adidas. lol.,4 193 | Found NOTHING at Nike Factory :/ Off to Banana Republic Outlet! http://myloc.me/2zic,0 194 | iPhone May Get Radio Tagging and Nike : Recently-released iTunes version 8.2 suggests that VoiceOver functional.. http://tinyurl.com/oq5ctc,2 195 | is lovin his Nike already and that's only from running on the spot in his bedroom,4 196 | Launched! http://imgsearch.net #imgsearch #ajax #jquery #webapp,2 197 | @matthewcyan I finally got around to using jquery to make my bio collapse. Yay for slide animations.,4 198 | RT @jquery: The Ultimate jQuery List - http://jquerylist.com/,2 199 | I just extracted and open-sourced a jQuery plugin from Stormweight to highlight text with a regular expression: http://bit.ly/ybJKb,2 200 | @anna_debenham what was the php jquery hack?,2 201 | jQuery Cheat Sheet http://www.javascripttoolbox.com/jquery/cheatsheet/,2 202 | Beginning JavaScript and CSS Development with jQuery #javascript #css #jquery http://bit.ly/TO3e5,2 203 | "@PDubyaD right!!! LOL we'll get there!! I have high expectations, Warren Buffet style.",4 204 | "RT @blknprecious1: RT GREAT @dbroos ""Someone's sitting in the shade today because someone planted a tree a long time ago.""- Warren Buffet",4 205 | Warren Buffet on the economy http://ping.fm/Lau0p,2 206 | "Warren Buffet became (for a time) the richest man in the United States, not by working but investing in 1 Big idea which lead to the fortune",4 207 | "According to the create a school, Notre Dame will have 7 receivers in NCAA 10 at 84 or higher rating :) *sweet*",4 208 | All-Star Basketball Classic Tuesday Features Top Talent: Chattanooga's Notre Dame High School will play host.. http://bit.ly/qltJA,2 209 | @BlondeBroad it's definitely under warranty & my experience is the amazon support for kindle is great! had to contact them about my kindle2,4 210 | "RT Look, Available !Amazon Kindle2 & Kindle DX, Get it Here: http://short.to/87ub The Top Electronic Book Reader Period, free 2 day ship ...",2 211 | Time Warner Road Runner customer support here absolutely blows. I hate not having other high-speed net options. I'm ready to go nuclear.,0 212 | Time Warner cable phone reps r dumber than nails!!!!! UGH! Cable was working 10 mins ago now its not WTF!,0 213 | @siratomofbones we tried but Time Warner wasn't being nice so we recorded today. :),0 214 | OMG - time warner f'ed up my internet install - instead of today its now NEXT saturday - another week w/o internet! &$*ehfa^V9fhg[*# fml.,0 215 | "wth..i have never seen a line this loooong at time warner before, ugh.",0 216 | Impatiently awaiting the arrival of the time warner guy. It's way too pretty to be inside all afternoon,0 217 | Man accosts Roger Federer during French Open http://ff.im/3HCPT,2 218 | Naive Bayes using EM for Text Classification. Really Frustrating...,0 219 | We went to Stanford University today. Got a tour. Made me want to go back to college. It's also decided all of our kids will go there.,4 220 | Investigation pending on death of Stanford CS prof / Google mentor Rajeev Motwani http://bit.ly/LwOUR tip @techmeme,2 221 | "I'm going to bed. It was a successful weekend. Stanford, here I come.",2 222 | "@KarrisFoxy If you're being harassed by calls about your car warranty, changing your number won't fix that. They call every number. #d-bags",0 223 | Just blocked United Blood Services using Google Voice. They call more than those Car Warranty guys.,0 224 | #at&t is complete fail.,0 225 | @broskiii OH SNAP YOU WORK AT AT&T DON'T YOU,0 226 | @Mbjthegreat i really dont want AT&T phone service..they suck when it comes to having a signal,0 227 | "I say we just cut out the small talk: AT&T's new slogan: F__k you, give us your money. (Apologies to Bob Geldof.)",0 228 | pissed about at&t's mid-contract upgrade price for the iPhone (it's $200 more) I'm not going to pay $499 for something I thought was $299,0 229 | Safari 4 is fast :) Even on my shitty AT&T tethering.,0 230 | @ims What is AT&T fucking up?,0 231 | @springsingfiend @dvyers @sethdaggett @jlshack AT&T dropped the ball and isn't supporting crap with the new iPhone 3.0... FAIL #att SUCKS!!!,0 232 | "@MMBarnhill yay, glad you got the phone! Still, damn you, AT&T.",0 233 | Google Wave Developer Sandbox Account Request http://bit.ly/2NYlc,2 234 | "Talk is Cheap: Bing that, I?ll stick with Google. http://bit.ly/XC3C8",0 235 | "@defsounds WTF is the point of deleting tweets if they can still be found in summize and searches? Twitter, please fix that. Thanks and bye",0 236 | @mattcutts have google profiles stopped showing up in searches? cant see them anymore,2 237 | @ArunBasilLal I love Google Translator too ! :D Good day mate !,4 238 | reading on my new Kindle2!,4 239 | My Kindle2 came and I LOVE it! :),4 240 | "LOVING my new Kindle2. Named her Kendra in case u were wondering. The ""cookbook"" is THE tool cuz it tells u all the tricks! Best gift EVR!",4 241 | The real AIG scandal / http://bit.ly/b82Px,0 242 | Any twitter to aprs apps yet?,2 243 | 45 Pros You Should Be Following on Twitter - http://is.gd/sMbZ,2 244 | Obama is quite a good comedian! check out his dinner speech on CNN :) very funny jokes.,4 245 | "' Barack Obama shows his funny side "" >> http://tr.im/l0gY !! Great speech..",4 246 | "I like this guy : ' Barack Obama shows his funny side "" >> http://tr.im/l0gY !!",4 247 | Obama's speech was pretty awesome last night! http://bit.ly/IMXUM,4 248 | "Reading ""Bill Clinton Fail - Obama Win?"" http://tinyurl.com/pcyxj7",4 249 | Obama More Popular Than U.S. Among Arabs: Survey: President Barack Obama's popularity in leading Arab countries .. http://tinyurl.com/prlvqu,4 250 | Obama's got JOKES!! haha just got to watch a bit of his after dinner speech from last night... i'm in love with mr. president ;),4 251 | LEbron james got in a car accident i guess..just heard it on evening news...wow i cant believe it..will he be ok ? http://twtad.com/69750,0 252 | is it me or is this the best the playoffs have been in years oh yea lebron and melo in the finals,4 253 | "@khalid0456 No, Lebron is the best",4 254 | @the_real_usher LeBron is cool. I like his personality...he has good character.,4 255 | Watching Lebron highlights. Damn that niggas good,4 256 | @Lou911 Lebron is MURDERING shit.,4 257 | @uscsports21 LeBron is a monsta and he is only 24. SMH The world ain't ready.,4 258 | @cthagod when Lebron is done in the NBA he will probably be greater than Kobe. Like u said Kobe is good but there alot of 'good' players.,4 259 | KOBE IS GOOD BT LEBRON HAS MY VOTE,4 260 | Kobe is the best in the world not lebron .,0 261 | "@asherroth World Cup 2010 Access?? Damn, that's a good look!",4 262 | Just bought my tickets for the 2010 FIFA World Cup in South Africa. Its going to be a great summer. http://bit.ly/9GEZI,4 263 | Share: Disruption...Fred Wilson's slides for his talk at Google HQ http://bit.ly/Bo8PG,2 264 | I have to go to Booz Allen Hamilton for a 2hr meeting :( But then i get to go home :),0 265 | "The great Indian tamasha truly will unfold from May 16, the result day for Indian General Election.",4 266 | "@crlane I have the Kindle2. I've seen pictures of the DX, but haven't seen it in person. I love my Kindle - I'm on it everyday.",4 267 | @criticalpath Such an awesome idea - the continual learning program with a Kindle2 http://bit.ly/1ZLfF,4 268 | ok.. do nothing.. just thinking about 40D,2 269 | "@faithbabywear Ooooh, what model are you getting??? I have the 40D and LOVE LOVE LOVE LOVE it!",4 270 | The Times of India: The wonder that is India's election. http://bit.ly/p7u1H,4 271 | http://is.gd/ArUJ Good video from Google on using search options.,4 272 | @ambcharlesfield lol. Ah my skin is itchy :( damn lawnmowing.,0 273 | itchy back!! dont ya hate it!,0 274 | Stanford Charity Fashion Show a top draw http://cli.gs/NeNuAH,4 275 | Stanford University?s Facebook Profile is One of the Most Popular Official University Pages - http://tinyurl.com/p5b3fl,4 276 | Lyx is cool.,4 277 | SOOO DISSAPOiNTED THEY SENT DANNY GOKEY HOME... YOU STiLL ROCK ...DANNY ... MY HOMETOWN HERO !! YEAH MiLROCKEE!!,4 278 | "RT @PassionModel 'American Idol' fashion: Adam Lambert tones down, Danny Gokey cute ... http://cli.gs/7JWSHV",4 279 | @dannygokey I love you DANNY GOKEY!! :),4 280 | RT @justindavey: RT @tweetmeme GM OnStar now instantly sends accident location coordinates to 911 | GPS Obsessed http://bit.ly/16szL1,2 281 | so tired. i didn't sleep well at all last night.,0 282 | Boarding plane for San Francisco in 1 hour; 6 hr flight. Blech.,0 283 | bonjour San Francisco. My back hurts from last night..,0 284 | "breakers. in San Francisco, CA http://loopt.us/4v88Bw.t",2 285 | Heading to San Francisco,2 286 | With my best girl for a few more hours in San francisco. Mmmmmfamily is wonderful!,4 287 | "F*** up big, or go home - AIG",0 288 | Went to see the Star Trek movie last night. Very satisfying.,4 289 | "I can't wait, going to see star trek tonight!!",4 290 | Star Trek was as good as everyone said!!,4 291 | am loving new malcolm gladwell book - outliers,4 292 | I highly recommend Malcolm Gladwell's 'The Tipping Point.' My next audiobook will probably be one of his as well.,4 293 | Malcolm Gladwell is a genius at tricking people into not realizing he's a fucking idiot,0 294 | "@sportsguy33 hey no offense but malcolm gladwell is a pretenious, annoying cunt and he brings you down. cant read his shit",0 295 | RT @clashmore: http://bit.ly/SOYv7 Great article by Malcolm Gladwell.,4 296 | I seriously underestimated Malcolm Gladwell. I want to meet this dude.,4 297 | i hate comcast right now. everything is down cable internet & phone....ughh what am i to do,0 298 | Comcast sucks.,0 299 | The day I never have to deal with Comcast again will rank as one of the best days of my life.,0 300 | @Dommm did comcast fail again??,0 301 | How do you use the twitter API?... http://bit.ly/4VBhH,2 302 | curses the Twitter API limit,0 303 | "Now I can see why Dave Winer screams about lack of Twitter API, its limitations and access throttles!",0 304 | testing Twitter API,2 305 | Arg. Twitter API is making me crazy.,0 306 | Testing Twitter API. Remote Update,2 307 | I'm really loving the new search site Wolfram/Alpha. Makes Google seem so ... quaint. http://www72.wolframalpha.com/,4 308 | "#wolfram Alpha SUCKS! Even for researchers the information provided is less than you can get from #google or #wikipedia, totally useless!",0 309 | Off to the NIKE factory!!!,4 310 | New nike muppet commercials are pretty cute. Why do we live together again?,4 311 | New blog post: Nike Zoom LeBron Soldier 3 (III) - White / Black - Teal http://bit.ly/rouUS,2 312 | New blog post: Nike Trainer 1 http://bit.ly/394bp,2 313 | @Fraggle312 oh those are awesome! i so wish they weren't owned by nike :(,0 314 | @tonyhawk http://twitpic.com/5c7uj - AWESOME!!! Seeing the show Friday at the Shoreline Amphitheatre. Never seen NIN before. Can't wait. ...,4 315 | "arhh, It's weka bug. = ="" and I spent almost two hours to find that out. crappy me",0 316 | "@mitzs hey bud :) np I do so love my 50D, although I'd love a 5D mkII more",4 317 | @jonduenas @robynlyn just got us a 50D for the office. :D,4 318 | Just picked up my new Canon 50D...it's beautiful!! Prepare for some seriously awesome photography!,4 319 | Just got my new toy. Canon 50D. Love love love it!,4 320 | Learning about lambda calculus :),4 321 | "#jobs #sittercity Help with taking care of sick child (East Palo Alto, CA) http://tinyurl.com/qwrr2m",2 322 | I'm moving to East Palo Alto!,4 323 | @ atebits I just finished watching your Stanford iPhone Class session. I really appreciate it. You Rock!,4 324 | @jktweet Hi! Just saw your Stanford talk and really liked your advice. Just saying Hi from Singapore (yes the videos do get around),4 325 | #MBA Admissions Tips Stanford GSB Deadlines and Essay Topics 2009-2010 http://tinyurl.com/pet4fd,2 326 | Ethics and nonprofits - http://bit.ly/qsXRp #stanford #socialentrepreneurship,2 327 | LAKERS tonight let's go!!!!,4 328 | Will the Lakers kick the Nuggets ass tonight?,4 329 | Oooooooh... North Korea is in troubleeeee! http://bit.ly/19epAH,0 330 | Wat the heck is North Korea doing!!??!! They just conducted powerful nuclear tests! Follow the link: http://www.msnbc.msn.com/id/30921379,0 331 | Listening to Obama... Friggin North Korea...,0 332 | "I just realized we three monkeys in the white Obama.Biden,Pelosi . Sarah Palin 2012",0 333 | @foxnews Pelosi should stay in China and never come back.,0 334 | Nancy Pelosi gave the worst commencement speech I've ever heard. Yes I'm still bitter about this,0 335 | ugh. the amount of times these stupid insects have bitten me. Grr..,0 336 | Prettiest insects EVER - Pink Katydids: http://bit.ly/2Upw2p,4 337 | Just got barraged by a horde of insects hungry for my kitchen light. So scary.,0 338 | Just had McDonalds for dinner. :D It was goooood. Big Mac Meal. ;),4 339 | AHH YES LOL IMA TELL MY HUBBY TO GO GET ME SUM MCDONALDS =],4 340 | Stopped to have lunch at McDonalds. Chicken Nuggetssss! :) yummmmmy.,4 341 | Could go for a lot of McDonalds. i mean A LOT.,4 342 | my exam went good. @HelloLeonie: your prayers worked (:,4 343 | "Only one exam left, and i am so happy for it :D",4 344 | Math review. Im going to fail the exam.,0 345 | Colin Powell rocked yesterday on CBS. Cheney needs to shut the hell up and go home.Powell is a man of Honor and served our country proudly,0 346 | obviously not siding with Cheney here: http://bit.ly/19j2d,0 347 | Absolutely hilarious!!! from @mashable: http://bit.ly/bccWt,4 348 | @mashable I never did thank you for including me in your Top 100 Twitter Authors! You Rock! (& I New Wave :-D) http://bit.ly/EOrFV,4 349 | Learning jQuery 1.3 Book Review - http://cfbloggers.org/?c=30629,2 350 | RT @shrop: Awesome JQuery reference book for Coda! http://www.macpeeps.com/coda/ #webdesign,4 351 | I've been sending e-mails like crazy today to my contacts...does anyone have a contact at Goodby SIlverstein...I'd love to speak to them,4 352 | Adobe CS4 commercial by Goodby Silverstein: http://bit.ly/1aikhF,2 353 | "Goodby, Silverstein's new site... http://www.goodbysilverstein.com/ I enjoy it.",4 354 | Wow everyone at the Google I/O conference got free G2's with a month of unlimited service,4 355 | @vkerkez dood I got a free google android phone at the I/O conference. The G2!,4 356 | "@Orli the G2 is amazing btw, a HUGE improvement over the G1",4 357 | "HTML 5 Demos! Lots of great stuff to come! Yes, I'm excited. :) http://htmlfive.appspot.com #io2009 #googleio",4 358 | @googleio http://twitpic.com/62shi - Yay! Happy place! Place place! I love Google!,4 359 | #GoogleIO | O3D - Bringing 3d graphics to the browser. Very nice tbh. Funfun.,4 360 | "Awesome viral marketing for ""Funny People"" http://www.nbc.com/yo-teach/",4 361 | "Watching a programme about the life of Hitler, its only enhancing my geekiness of history.",2 362 | saw night at the museum out of sheer desperation. who is funding these movies?,0 363 | Night At The Museum 2? Pretty furkin good.,4 364 | Watching Night at the Museum - giggling.,4 365 | "@pambeeslyjenna Jenna, I went to see Night At The Museum 2 today and I was so surprised to see three cast members from The Office...",2 366 | About to watch Night at the Museum with Ryan and Stacy,2 367 | "Getting ready to go watch Night at the Museum 2. Dum dum, you give me gum gum!",2 368 | "Back from seeing 'Star Trek' and 'Night at the Museum.' 'Star Trek' was amazing, but 'Night at the Museum' was; eh.",0 369 | just watched night at the museum 2! so stinkin cute!,4 370 | "So, Night at the Museum 2 was AWESOME! Much better than part 1. Next weekend we'll see Up.",4 371 | "I think I may have a new favorite restaurant. On our way to see ""Night at the Museum 2"".",2 372 | "UP! was sold out, so i'm seeing Night At The Museum 2. I'm __ years old.",2 373 | saw the new Night at the Museum and i loved it. Next is to go see UP in 3D,4 374 | It is a shame about GM. What if they are forced to make only cars the White House THINKS will sell? What do you think?,0 375 | "As u may have noticed, not too happy about the GM situation, nor AIG, Lehman, et al",0 376 | Obama: Nationalization of GM to be short-term (AP) http://tinyurl.com/md347r,2 377 | @Pittstock $GM good riddance. sad though.,0 378 | "I Will NEVER Buy a Government Motors Vehicle: Until just recently, I drove GM cars. Since 1988, when I bought a .. http://tinyurl.com/lulsw8",0 379 | Having the old Coca-Cola guy on the GM board is stupid has heck! #tcot #ala,0 380 | #RantsAndRaves The worst thing about GM (concord / pleasant hill / martinez): is the fucking UAW. .. http://buzzup.com/4ueb,0 381 | "Give a man a fish, u feed him for the day. Teach him to fish, u feed him for life. Buy him GM, and u F**K him over for good.",0 382 | "The more I hear about this GM thing the more angry I get. Billions wasted, more bullshit. All for something like 40k employees and all the..",0 383 | @QuantTrader i own a GM car and it is junk as far as quality compared to a honda,0 384 | sad day...bankrupt GM,0 385 | is upset about the whole GM thing. life as i know it is so screwed up,0 386 | whoever is running time warner needs to be repeatedly raped by a rhino so they understand the consequences of putting out shitty cable svcs,0 387 | "Time Warner CEO hints at online fees for magazines (AP) - Read from Mountain View,United States. Views 16209 http://bit.ly/UdFCH",2 388 | #WFTB Joining a bit late. My connection was down (boo time warner),0 389 | Cox or Time Warner? Cox is cheaper and gets a B on dslreports. TW is more expensive and gets a C.,0 390 | i am furious with time warner and their phone promotions!,0 391 | Just got home from chick-fil-a with the boys. Damn my internets down =( stupid time warner,0 392 | could time-warner cable suck more? NO.,0 393 | Pissed at Time Warner for causin me to have slow internet problems,0 394 | "@sportsguy33 Ummm, having some Time Warner problems?",0 395 | You guys see this? Why does Time Warner have to suck so much ass? Really wish I could get U-Verse at my apartment. http://bit.ly/s594j,0 396 | "RT @sportsguy33 The upside to Time Warner: unhelpful phone operators superslow on-site service. Crap, that's not an upside.",0 397 | "RT @sportsguy33: New Time Warner slogan: ""Time Warner, where we make you long for the days before cable.""",0 398 | "confirmed: it's Time Warner's fault, not Facebook's, that fb is taking about 3 minutes to load. so tempted to switch to verizon =/",0 399 | @sportsguy33 Time Warner = epic fail,0 400 | Lawson to head Newedge Hong Kong http://bit.ly/xLQSD #business #china,2 401 | Weird Piano Guitar House in China! http://u2s.me/72i8,2 402 | Send us your GM/Chevy photos http://tinyurl.com/luzkpq,2 403 | I know. How sad is that? RT @caseymercier: 1st day of hurricane season. That's less scarey than govt taking over GM.,0 404 | "GM files Bankruptcy, not a good sign...",0 405 | yankees won mets lost. its a good day.,4 406 | My dentist appt today was actually quite enjoyable.,4 407 | I hate the effing dentist.,0 408 | @stevemoakler i had a dentist appt this morning and had the same conversation!,2 409 | @kirstiealley I hate going to the dentist.. !!!,0 410 | i hate the dentist....who invented them anyways?,0 411 | this dentist's office is cold :/,0 412 | Check this video out -- David After Dentist http://bit.ly/47aW2,2 413 | First dentist appointment [in years] on Wednesday possibly.,2 414 | Tom Shanahan's latest column on SDSU and its NCAA Baseball Regional appearance: http://ow.ly/axhu,2 415 | BaseballAmerica.com: Blog: Baseball America Prospects Blog ? Blog ... http://bit.ly/EtT8a,2 416 | Portland city politics may undo baseball park http://tinyurl.com/lpjquj,2 417 | "RT @WaterSISWEB: CA Merced's water bottled by Safeway, resold at a profit: Wells are drying up across the county http://tinyurl.com/mb573s",2 418 | dropped her broccoli walking home from safeway! ;( so depressed,2 419 | @ronjon we don't have Safeway.,2 420 | Just applied at Safeway!(: Yeeeee!,4 421 | @ Safeway. Place is a nightmare right now. Bumming.,0 422 | at safeway with dad,2 423 | "HATE safeway select green tea icecream! bought two cartons, what a waste of money. >_<",0 424 | "Safeway with Marvin, Janelle, and Auntie Lhu",2 425 | Safeway offering mobile coupons http://bit.ly/ONH7w,2 426 | "Phillies Driving in the Cadillac with the Top Down in Cali, Win 5-3 - http://tinyurl.com/nzcjqa",2 427 | Saved money by opting for grocery store trip and stocking food in hotel room fridge vs. eating out every night while out of town.,2 428 | "Lounging around, eating Taco Bell and watching NCIS before work tonight. Need help staying awake.",2 429 | eating breakfast and then school,2 430 | still hungry after eating....,2 431 | 10 tips for healthy eating ? ResultsBy Fitness Blog :: Fitness ... http://bit.ly/62gFn,2 432 | "with the boyfriend, eating a quesadilla",2 433 | "Eating dinner. Meat, chips, and risotto.",2 434 | got a new pair of nike shoes. pics up later,2 435 | "Nike SB Blazer High ""ACG"" Custom - Brad Douglas - http://timesurl.at/45a448",2 436 | Nike rocks. I'm super grateful for what I've done with them :) & the European Division of NIKE is BEYOND! @whitSTYLES @muchasmuertes,4 437 | Nike Air Yeezy Khaki/Pink Colorway Release - http://shar.es/bjfN,2 438 | @evelynbyrne have you tried Nike ? V. addictive.,4 439 | @erickoston That looks an awful lot like one of Nike's private jets....I'm just sayin....,2 440 | The Nike Training Club (beta) iPhone app looks very interesting.,4 441 | argghhhh why won't my jquery appear in safari bad safari !!!,0 442 | DevSnippets : jQuery Tools - Javascript UI Components for the Web... http://inblogs.org/go/hfuqt,2 443 | "all about Ajax,jquery ,css ,JavaScript and more... (many examples) http://ajaxian.com/",2 444 | "I'm ready to drop the pretenses, I am forever in love with jQuery, and I want to marry it. Sorry ladies, this nerd is jquery.spokenFor.js",4 445 | "This is cold.. I was looking at google's chart//visualization API and found this jQuery ""wrapper"" for the API... http://tinyurl.com/mq52bq",2 446 | I spent most of my day reading a jQuery book. Now to start drinking some delirium tremens.,2 447 | jquery Selectors http://codylindley.com/jqueryselectors/,2 448 | How to implement a news ticker with jQuery and ten lines of code http://bit.ly/CZnFJ,2 449 | What's Buffet Doing? Warren Buffett Kicks Butt In Battle of the Boots: Posted By:Alex Crippe.. http://bit.ly/AUIzO,2 450 | "SUPER INVESTORS: A great weekend read here from Warren Buffet. Oldie, but a goodie. http://tinyurl.com/oqxgga",4 451 | I'm truly braindead. I couldn't come up with Warren Buffet's name to save my soul,2 452 | "reading Michael Palin book, The Python Years...great book. I also recommend Warren Buffet & Nelson Mandela's bio",4 453 | "I mean, I'm down with Notre Dame if I have to. It's a good school, I'd be closer to Dan, I'd enjoy it.",4 454 | "I can't watch TV without a Tivo. And after all these years, the Time/Warner DVR STILL sucks. http://www.davehitt.com/march03/twdvr.html",0 455 | I'd say some sports writers are idiots for saying Roger Federer is one of the best ever in Tennis. Roger Federer is THE best ever in Tennis,4 456 | I still love my Kindle2 but reading The New York Times on it does not feel natural. I miss the Bloomingdale ads.,0 457 | I love my Kindle2. No more stacks of books to trip over on the way to the loo.,4 458 | "Although today's keynote rocked, for every great announcement, AT&T shit on us just a little bit more.",0 459 | "@sheridanmarfil - its not so much my obsession with cell phones, but the iphone! i'm a slave to at&t forever because of it. :)",0 460 | @freitasm oh I see. I thought AT&T were 900MHz WCDMA?,2 461 | @Plip Where did you read about tethering support Phil? Just AT&T or will O2 be joining in?,2 462 | Fuzzball is more fun than AT&T ;P http://fuzz-ball.com/twitter,0 463 | "Today is a good day to dislike AT&T. Vote out of office indeed, @danielpunkass",0 464 | GOT MY WAVE SANDBOX INVITE! Extra excited! Too bad I have class now... but I'll play with it soon enough! #io2009 #wave,4 465 | looks like summize has gone down. too many tweets from WWDC perhaps?,0 466 | I hope the girl at work buys my Kindle2,2 467 | Missed this insight-filled May column: One smart guy looking closely at why he's impressed with Kindle2 http://bit.ly/i0peY @wroush,2 468 | "@sklososky Thanks so much!!! ...from one of your *very* happy Kindle2 winners ; ) I was so surprised, fabulous. Thank you! Best, Kathleen",4 469 | Man I kinda dislike Apple right now. Case in point: the iPhone 3GS. Wish there was a video recorder app. Please?? http://bit.ly/DZm1T,0 470 | @cwong08 I have a Kindle2 (& Sony PRS-500). Like it! Physical device feels good. Font is nice. Pg turns are snappy enuf. UI a little klunky.,4 471 | "The #Kindle2 seems the best eReader, but will it work in the UK and where can I get one?",4 472 | "I have a google addiction. Thank you for pointing that out, @annamartin123. Hahaha.",4 473 | @ruby_gem My primary debit card is Visa Electron.,2 474 | Off to the bank to get my new visa platinum card,2 475 | "dearest @google, you rich bastards! the VISA card you sent me doesn't work. why screw a little guy like me?",0 476 | has a date with bobby flay and gut fieri from food network,2 477 | Excited about seeing Bobby Flay and Guy Fieri tomorrow at the Great American Food & Music Fest!,4 478 | Gonna go see Bobby Flay 2moro at Shoreline. Eat and drink. Gonna be good.,4 479 | can't wait for the great american food and music festival at shoreline tomorrow. mmm...katz pastrami and bobby flay. yes please.,4 480 | "My dad was in NY for a day, we ate at MESA grill last night and met Bobby Flay. So much fun, except I completely lost my voice today.",4 481 | Fighting with LaTex. Again...,0 482 | @Iheartseverus we love you too and don't want you to die!!!!!! Latex = the devil,0 483 | "7 hours. 7 hours of inkscape crashing, normally solid as a rock. 7 hours of LaTeX complaining at the slightest thing. I can't take any more.",0 484 | How to Track Iran with Social Media: http://bit.ly/2BoqU,2 485 | Shit's hitting the fan in Iran...craziness indeed #iranelection,0 486 | Monday already. Iran may implode. Kitchen is a disaster. @annagoss seems happy. @sebulous had a nice weekend and @goldpanda is great. whoop.,0 487 | Twitter Stock buzz: $AAPL $ES_F $SPY $SPX $PALM (updated: 12:00 PM),2 488 | getting ready to test out some burger receipes this weekend. Bobby Flay has some great receipes to try. Thanks Bobby.,4 489 | @johncmayer is Bobby Flay joining you?,2 490 | i lam so in love with Bobby Flay... he is my favorite. RT @terrysimpson: @bflay you need a place in Phoenix. We have great peppers here!,4 491 | "I just created my first LaTeX file from scratch. That didn't work out very well. (See @amandabittner , it's a great time waster)",0 492 | using Linux and loving it - so much nicer than windows... Looking forward to using the wysiwyg latex editor!,4 493 | "After using LaTeX a lot, any other typeset mathematics just looks hideous.",4 494 | Ask Programming: LaTeX or InDesign?: submitted by calcio1 [link] [1 comment] http://tinyurl.com/myfmf7,2 495 | "On that note, I hate Word. I hate Pages. I hate LaTeX. There, I said it. I hate LaTeX. All you TEXN3RDS can come kill me now.",0 496 | Ahhh... back in a *real* text editing environment. I <3 LaTeX.,4 497 | "Trouble in Iran, I see. Hmm. Iran. Iran so far away. #flockofseagullsweregeopoliticallycorrect",0 498 | Reading the tweets coming out of Iran... The whole thing is terrifying and incredibly sad...,0 499 | -------------------------------------------------------------------------------- /code/data/sentiment-analysis/training_set.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/text-mining/407d16015ba270b4e39462c20de9b370c4e78563/code/data/sentiment-analysis/training_set.csv -------------------------------------------------------------------------------- /code/data/similarity/simple.dnd: -------------------------------------------------------------------------------- 1 | (((A,B),(C,D)),(E,F,G)); -------------------------------------------------------------------------------- /code/images/UKDS_Logos_Col_Grey_300dpi.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/text-mining/407d16015ba270b4e39462c20de9b370c4e78563/code/images/UKDS_Logos_Col_Grey_300dpi.png -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: text-env 2 | channels: 3 | - conda-forge 4 | 5 | dependencies: 6 | - python=3.11.9 7 | - jupyter=1.0.0 8 | - jupyter_contrib_nbextensions=0.7.0 9 | - pandas=2.2.2 10 | - nltk=3.8.1 11 | - autocorrect=2.6.1 12 | - spacy=3.7.3 13 | - tqdm=4.66.4 14 | - numpy=1.26.4 15 | - scipy=1.13.0 16 | - matplotlib=3.8.4 17 | - networkx=3.3 18 | - rise=5.7.1 -------------------------------------------------------------------------------- /postBuild: -------------------------------------------------------------------------------- 1 | ## Enable Table of Contents 2 | 3 | jupyter contrib nbextension install --user 4 | jupyter nbextension enable --py widgetsnbextension 5 | jupyter nbextension enable rise 6 | jupyter nbextension enable toc2/main -------------------------------------------------------------------------------- /webinars/2020/Text-Mining_Advanced_widescreen.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/text-mining/407d16015ba270b4e39462c20de9b370c4e78563/webinars/2020/Text-Mining_Advanced_widescreen.pdf -------------------------------------------------------------------------------- /webinars/2020/Text-Mining_Advanced_widescreen.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/text-mining/407d16015ba270b4e39462c20de9b370c4e78563/webinars/2020/Text-Mining_Advanced_widescreen.pptx -------------------------------------------------------------------------------- /webinars/2020/Text-Mining_Basics_widescreen.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/text-mining/407d16015ba270b4e39462c20de9b370c4e78563/webinars/2020/Text-Mining_Basics_widescreen.pdf -------------------------------------------------------------------------------- /webinars/2020/Text-Mining_Basics_widescreen.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/text-mining/407d16015ba270b4e39462c20de9b370c4e78563/webinars/2020/Text-Mining_Basics_widescreen.pptx -------------------------------------------------------------------------------- /webinars/2020/Text-Mining_Intro_widescreen.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/text-mining/407d16015ba270b4e39462c20de9b370c4e78563/webinars/2020/Text-Mining_Intro_widescreen.pdf -------------------------------------------------------------------------------- /webinars/2020/Text-Mining_Intro_widescreen.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/text-mining/407d16015ba270b4e39462c20de9b370c4e78563/webinars/2020/Text-Mining_Intro_widescreen.pptx -------------------------------------------------------------------------------- /webinars/2023/Text-Mining_Nham_DataFest_2023.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UKDataServiceOpen/text-mining/407d16015ba270b4e39462c20de9b370c4e78563/webinars/2023/Text-Mining_Nham_DataFest_2023.pptx -------------------------------------------------------------------------------- /webinars/README.md: -------------------------------------------------------------------------------- 1 | # Webinars 2 | 3 | ## 1. Introduction to Text-Mining 4 | The first webinar covers the concepts behind fully structured and semi-unstructured data, the theory behind capturing and amplifying existing structure, and the four basic steps involved in any text-mining project. 5 | * [Watch recording](https://www.youtube.com/watch?v=wFz1n-z_dvY) 6 | * [Download slides](./Text-Mining_Intro_widescreen.pdf) 7 | 8 | ## 2. Text-Mining: Basic Processes 9 | This webinar dives into the steps needed to do some of the most common text-mining analyses and will be accompanied by an online interactive notebook that allows participants to see, edit and execute the demonstrated code. 10 | * [Watch recording](https://www.youtube.com/watch?v=T6K7BibhSTA) 11 | * [Download slides](./Text-Mining_Basics_widescreen.pdf) 12 | 13 | ## 3. Text-Mining: Advanced Options 14 | This webinar rounds off the series by diving into the concepts behind more advanced text-mining analyses, presenting some sample code that participants may find useful, and introducing some work that provides further learning opportunities. This webinar will also be accompanied by an online interactive notebook that allows participants to see, edit and execute the demonstrated code. 15 | * [Watch recording](https://www.youtube.com/watch?v=pEs3jOlwbaI) 16 | * [Download slides](./Text-Mining_Advanced_widescreen.pdf) 17 | --------------------------------------------------------------------------------