└── README.md /README.md: -------------------------------------------------------------------------------- 1 | Twitter Sociolinguistics in the Unix Shell 2 | ================ 3 | 4 | In this note, I'll show how to do a very rough sociolinguistic study of data acquired from the Twitter API, using shell scripting tools only. 5 | The data comes in JSON format, so I'll focus on parsing that. 6 | If you want to learn more about Unix shell, Ken Church's classic [Unix for Poets](http://www.cs.upc.edu/~padro/Unixforpoets.pdf) is a great way to start. 7 | 8 | The sociolinguistic hypothesis that we'll test is that the English future tense system is shifting towards increasing 9 | use of the less formal "going to" in place of the more formal "will" (see, e.g., [Tagliamonte and D'Arcy 2009](http://www.ling.upenn.edu/~wlabov/L660/Peaks.pdf)). 10 | According to the sociolinguistic method of *apparent time*, we expect to find that older people remain more likely to use the outgoing form "will", and younger people are more likely to use the incoming form "going to". Let's see whether this holds up in Twitter. 11 | 12 | We will use classic shell tools ```grep, sed, awk, cut, sort```, etc. 13 | I am not a shell expert, and there are likely better ways to do many of these things. 14 | You can get lots of opinions about these topics on stackexchange and stackoverflow, among other places. 15 | (Thanks to @WladimirSidorenko for several helpful comments on improving the examples and presentation!) 16 | 17 | # Looking at the data 18 | 19 | Twitter data is stored on our group's server in gzipped text files that contain one [JSON](http://www.json.org/) 20 | object per line. 21 | If we combine ```zcat``` (```cat``` for zipfiles) with ```head```, 22 | we can examine the first line of one such file: 23 | 24 | ```shell 25 | [jeisenstein3@conair new-archive]$ zcat tweets-Feb-14-16-03-35.gz | head -1 26 | ","id":58371299,"id_str":"58371299","indices":[3,17]}],"symbols":[],"media":[{"id":696796012514406400,"id_str":"696796012514406400","indices":[36,59],"media_url":"http:\/\/pbs.twimg.com\/media\/CauESBbUcAAyeu-.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/CauESBbUcAAyeu-.jpg","url":"https:\/\/t.co\/GhVa5HQ5ok","display_url":"pic.twitter.com\/GhVa5HQ5ok","expanded_url":"http:\/\/twitter.com\/TheybelikeDar\/status\/696796019800014848\/photo\/1","type":"photo","sizes":{"small":{"w":340,"h":628,"resize":"fit"},"large":{"w":554,"h":1024,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":554,"h":1024,"resize":"fit"}},"source_status_id":696796019800014848,"source_status_id_str":"696796019800014848","source_user_id":58371299,"source_user_id_str":"58371299"}]},"extended_entities":{"media":[{"id":696796012514406400,"id_str":"696796012514406400","indices":[36,59],"media_url":"http:\/\/pbs.twimg.com\/media\/CauESBbUcAAyeu-.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/CauESBbUcAAyeu-.jpg","url":"https:\/\/t.co\/GhVa5HQ5ok","display_url":"pic.twitter.com\/GhVa5HQ5ok","expanded_url":"http:\/\/twitter.com\/TheybelikeDar\/status\/696796019800014848\/photo\/1","type":"photo","sizes":{"small":{"w":340,"h":628,"resize":"fit"},"large":{"w":554,"h":1024,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":554,"h":1024,"resize":"fit"}},"source_status_id":696796019800014848,"source_status_id_str":"696796019800014848","source_user_id":58371299,"source_user_id_str":"58371299"}]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1455353581659"} 27 | ``` 28 | 29 | This is a simple pipeline, which pipes the output of ```zcat``` through the command ```head -1```. (Check out the comment from @WladimirSidorenko on the dangers of using pipes.) Anyway, the structure of the JSON file is: 30 | 31 | ```{"key1":"value1","key2":"value2",etc}``` 32 | 33 | However, the values can themselves contain nested JSON objects. Anyway, we want to pull out two fields: "text" and "user.name". 34 | 35 | ## Using a JSON parser 36 | 37 | The easy way to do this is using a json parser, like [jq](https://stedolan.github.io/jq). The following command will pull out the text field from the first three json lines that contain the substring "going to be". 38 | 39 | ```shell 40 | zgrep 'going to be' tweets-Feb-14-16-03-35.gz | head -3 | jq .text 41 | ``` 42 | 43 | The result is: 44 | 45 | ``` 46 | "@StJuliansFC @CwmbranTown hope our games on going to be a nightmare catching up on all these games!" 47 | "im going to be by myself though im scared" 48 | "RT @carterreynolds: Magcon may never be the same but we're ALWAYS going to be one big happy family 😊" 49 | ``` 50 | 51 | To get the text and the username, you can do this: 52 | 53 | ```shell 54 | zgrep 'going to be' tweets-Feb-14-16-03-35.gz | head -3 | jq --raw-output '"\(.user.name)\t\(.text)"' 55 | ``` 56 | Results: 57 | 58 | ``` 59 | machen afc @StJuliansFC @CwmbranTown hope our games on going to be a nightmare catching up on all these games!nicola im going to be by myself though im scared 60 | Love yoΟ… CΞ±ΠΌ RT @carterreynolds: Magcon may never be the same but we're ALWAYS going to be one big happy family 😊 61 | ``` 62 | 63 | The format of the ```jq``` command is a little funny: I don't know why you have to escape the opening parentheses, but not the closing ones. 64 | 65 | # Getting the text with Sed and Grep 66 | 67 | If you don't have ```jq```, or you want to learn to use more general tools, you can use Sed (**s**treaming **ed**itor) to capture each of these fields. We're going to use Sed's "s" command. The syntax is this command is essentially: 68 | 69 | ```shell 70 | sed s/find/replace/options 71 | ``` 72 | 73 | where ```find``` specifies a pattern to match and ```replace``` specifies what to replace that pattern with. We'll always use a single option, ```/g``` which says to execute this replacement "globally" -- every time possible in each line. 74 | 75 | ## One key-value pair per line 76 | 77 | As a first step, let's try to break up these big JSON blocks into one key-value pair per line. 78 | 79 | ```shell 80 | zcat tweets-Feb-14-16-03-35.gz | sed 's/,\"/\n/g' | head -n 5 81 | ``` 82 | 83 | Results: 84 | 85 | ``` 86 | " 87 | id":58371299 88 | id_str":"58371299" 89 | indices":[3,17]}] 90 | symbols":[] 91 | ``` 92 | 93 | Note that this command does not respect JSON's nested structure. For our purposes, that won't matter, but if there were nested "text" fields, we might be in trouble. 94 | 95 | ## Capture groups 96 | 97 | Next, to get the relevant text, we can use a *capture group*. We want to capture the field after the "text" key. Here's how we'll do it: 98 | 99 | ```shell 100 | sed s/.*\"text\":\"\([^\"]*\)\".*/\1/g 101 | ``` 102 | 103 | This says: 104 | 105 | - match all characters before observing the string ```"text":"``` (note the escaped quotation marks). 106 | - then match a sequence non-quotation characters, ```[^\"]```. The brackets indicate a group (this is a regex), the carrot indicates negation, and then we have the escaped quotation mark. 107 | - by putting (escaped) parens around this pattern, we indicate we want to capture it, ```\([^\"]*\)``` 108 | - then match the closing quote, and all other characters in the line 109 | - in the replace string, ```\1``` means print the capture group 110 | 111 | We call: 112 | 113 | ```shell 114 | zcat tweets-Feb-14-16-03-35.gz | sed 's/.*\"text\":\"\([^\"]*\)\".*/\1/g' | grep -E '^[A-Za-z]' | head -3 115 | ``` 116 | 117 | which uses ```grep``` to filter the output to make sure it starts with an alphabetic character. Here are the first three results: 118 | 119 | ``` 120 | https:\/\/t.co\/QuRvL8exJZ 121 | Que alguien me duerma de una \ud83d\udc4a, \ud83d\ude4f. 122 | relationship status: https:\/\/t.co\/eON4iSSjvz 123 | ``` 124 | 125 | Now let's use a more complicated capture pattern to get the name too. For the name, we will require that it be two, capitalized alphabetic strings, ```[A-Z][a-z]* [A-Z][a-z]*```. This is a trick to trade recall for precision, since there are a lot of garbage names in Twitter. Here's what we run: 126 | 127 | ```shell 128 | zcat tweets-Feb-14-16-03-35.gz | sed 's/.*\"text\":\"\([^\"]*\)\",.*name\":\"\([A-Z][a-z]* [A-Z][a-z]*\)\".*/\1\t\2/g' | grep -E '^[A-Za-z].*' | head -3 129 | ``` 130 | 131 | Notice that the replace string now is ```\1\t\2```: print the two capture groups, with a tab between them. Here's the output: 132 | 133 | ``` 134 | https:\/\/t.co\/QuRvL8exJZ Selena Gomez 135 | Gusto ko ng katext Roymar Buenvenida 136 | RT @NikitaLovebird: \u0915\u0947\u091c\u0930\u0940\u0935\u093e\u0932~ \u0926\u093e\u0926\u0930\u0940 \u091a\u0932\u094b\u0917\u0947 \n\u0911\u091f\u094b\u0935\u093e\u0932\u093e~ \u0939\u093e \u0939\u093e \u0939\u093e \u0939\u093e \n.\n.\n\u0915\u0947\u091c\u0930\u0940\u0935\u093e\u0932~ JNU \u091a\u0932\u094b\u0917\u0947 \n\u0911\u091f\u094b\u0935\u093e\u0932\u093e~ \u0928\u0939\u0940\u0902\n\u0907\u0938\u092e\u0947 \u0915\u0947\u091c\u0930\u0940 \u0905\u0902\u0915\u0932 \u0915\u094d\u092f\u093e \u0915\u0930 \u0938\u0915\u0924\u0947 \u0939\u0948???\n\u2026 Sahil Kataria 137 | ``` 138 | 139 | ## Putting zgrep in front 140 | 141 | We're wasting time processing a lot of text strings that are not of interest. So let's put a grep at the front of the pipeline, so that we start with that: 142 | 143 | ```shell 144 | zgrep 'going to be ' tweets-Feb-14-16-03-35.gz | sed 's/.*\"text\":\"\([^\"]*\)\",.*name\":\"\([A-Z][a-z]* [A-Z][a-z]*\)\".*/\1\t\2/g' | grep -E '^[A-Za-z].*' | head -3 145 | ``` 146 | 147 | Here are the results: 148 | 149 | ``` 150 | MGBACKTOWEMBLEY Matt Goss 151 | God Brat 1, can spew contempt. World champion at ten. He's going to be a fun teenager. Craig Short 152 | I've got a new toaster. I've a feeling finding an optimum setting is going to be quite the journey. What a time to be alive. Boring Tweeter 153 | ``` 154 | 155 | Notice that the first example doesn't include "going to be" in the text! The string must appear somewhere else in the JSON, maybe in the profile. 156 | 157 | ## Postfiltering with grep 158 | 159 | The solution is to use grep twice: once as a prefiltering step, and once as a postfiltering step. I find that this is a typical design pattern in streaming from big data: use a fast low-precision filter first, then a slower high-precision filter at the end. 160 | 161 | ```shell 162 | zgrep 'going to be ' tweets-Feb-14-16-03-35.gz | sed 's/.*\"text\":\"\([^\"]*\)\",.*name\":\"\([A-Z][a-z]* [A-Z][a-z]*\)\".*/\1\t\2/g' | grep -E '^[A-Za-z].*' | grep 'going to be ' | head -5 163 | ``` 164 | 165 | Here are the results: 166 | 167 | ``` 168 | God Brat 1, can spew contempt. World champion at ten. He's going to be a fun teenager. Craig Short 169 | I've got a new toaster. I've a feeling finding an optimum setting is going to be quite the journey. What a time to be alive. Boring Tweeter 170 | I've got a new toaster. I've a feeling finding an optimum setting is going to be quite the journey. What a time to be alive. Boring Tweeter 171 | I've got a new toaster. I've a feeling finding an optimum setting is going to be quite the journey. What a time to be alive. Boring Tweeter 172 | I'm going to be on Broadcasting House on Radio 4 tomorrow explaining why I actually quite enjoy Valentine's Day *ducks* Henry Jeffreys 173 | ``` 174 | 175 | This time I took the first five hits, because the toaster tweet got repeated three times for some reason. We'll deal with that later. 176 | 177 | # Collecting names 178 | 179 | We only want the names of the individuals using these words. We can collect these using ```cut.``` 180 | 181 | ```shell 182 | zgrep 'going to be ' tweets-Feb-14-16-03-35.gz | sed 's/.*\"text\":\"\([^\"]*\)\",.*name\":\"\([A-Z][a-z]* [A-Z][a-z]*\)\".*/\1\t\2/g' | grep -E '^[A-Za-z].*' | grep 'going to be ' | head -5 | cut -f 2 183 | ``` 184 | 185 | Results: 186 | 187 | ``` 188 | Craig Short 189 | Boring Tweeter 190 | Boring Tweeter 191 | Boring Tweeter 192 | Henry Jeffreys 193 | ``` 194 | 195 | Let's write these to a file, one for a each pattern: 196 | 197 | ```shell 198 | zgrep 'going to be ' tweets-Feb-14-16-03-35.gz | sed 's/.*\"text\":\"\([^\"]*\)\",.*name\":\"\([A-Z][a-z]* [A-Z][a-z]*\)\".*/\1\t\2/g' | grep -E '^[A-Za-z].*' | grep 'going to be ' | cut -f 2 | tee ~/going-to-be-all-names.txt 199 | ``` 200 | ```shell 201 | zgrep 'will be ' tweets-Feb-14-16-03-35.gz | sed 's/.*\"text\":\"\([^\"]*\)\",.*name\":\"\([A-Z][a-z]* [A-Z][a-z]*\)\".*/\1\t\2/g' | grep -E '^[A-Za-z].*' | grep 'will be ' | cut -f 2 | tee ~/will-be-all-names.txt 202 | ``` 203 | 204 | Instead of the usual ```>``` redirect, I've used ```tee```, which will print to standard out, as well as write to the file. 205 | 206 | ## Filtering names 207 | 208 | One thing we notice is that one guy, "Cameron Dallas", seems to account for a lot of these messages. Let's just count each name once. 209 | 210 | ```shell 211 | sort -u ~/will-be-all-names.txt | cut -f 1 -d\ | uniq -c | tee will-be-name-counts.txt 212 | ``` 213 | 214 | Here's what's happening in this pipeline: 215 | 216 | - ```sort -u``` alphabetically sorts all names, and returns the unique entries 217 | - ```cut -f 1 -d\ ``` selects only the first name, by cutting on a single whitespace delimiter. This is valid because we are only admitting names that contains exactly two whitespace-delimited tokens. 218 | - ```uniq -c ``` computes the *count* of all unique entries. ```uniq``` requires that its input is already sorted, but we have this from the ```sort``` call at the beginning of the pipeline. 219 | 220 | # Looking up ages per name 221 | 222 | Let's get the top names for each set: 223 | 224 | ```shell 225 | sort -nr ~/will-be-name-counts.txt | head -5 226 | ``` 227 | 228 | Results: 229 | 230 | ``` 231 | 10 The 232 | 10 David 233 | 8 Michael 234 | 8 Chris 235 | 7 Mark 236 | ``` 237 | 238 | ```shell 239 | sort -nr ~/going-to-be-name-counts.txt | head -5 240 | ``` 241 | 242 | Results: 243 | 244 | ``` 245 | 3 Taylor 246 | 3 Nathan 247 | 3 Josh 248 | 3 Jon 249 | 3 Hannah 250 | ``` 251 | 252 | Whose are older? Some answers are online: [http://rhiever.github.io/name-age-calculator/index.html?Gender=M&Name=Nathan](http://rhiever.github.io/name-age-calculator/index.html?Gender=M&Name=Nathan) 253 | 254 | ## Pulling web data 255 | 256 | We can also pull data directly from this site, using the helpful ```curl``` command: 257 | 258 | ```shell 259 | curl http://rhiever.github.io/name-age-calculator/names/M/N/Nathan.txt 260 | ``` 261 | 262 | This data is field-delimited by a comma. We will use ```awk``` to reorganize it so that the column containing the counts of living people is first. Then we will sort on this column, to find the year in which the greatest number of living people was born with name: 263 | 264 | ```shell 265 | curl http://rhiever.github.io/name-age-calculator/names/M/N/Nathan.txt | awk -F, '{ print $3"\t" $1 }' | sort -n | cut -f 2 | tail -n 1 266 | ``` 267 | 268 | Results: ```2004``` 269 | 270 | The ```awk``` command ```awk -F, '{ print $3"\t"$1 }'``` says: 271 | 272 | - Use the comma as a field delimiter 273 | - Print the third field, then a tab, then the first field 274 | 275 | ## Pulling data for each name 276 | 277 | We want to get this data for all names. Let's see how we can iterate over the lines in a file. We'll use the ```for``` command, and then use $(command) to indicate a variable whose values are defined by the output of a sequence of other unix commands: 278 | 279 | ```shell 280 | for name in $(sort -nr ~/will-be-name-counts.txt | head -5 | sed 's/\([^[:alpha:]]*\)//g'); do echo $name; done 281 | ``` 282 | 283 | Next, we'll take these names, and dynamically construct appropriate URLs to pull their age statistics from the from the website. Then we'll run our awk postprocessing script to get the ages. 284 | 285 | ```shell 286 | for name in $(sort -nr ~/will-be-name-counts.txt | head -5 | sed 's/\([^[:alpha:]]*\)//g'); do 287 | echo $name; 288 | curl -s http://rhiever.github.io/name-age-calculator/names/M/${name:0:1}/$name.txt | awk -F, '{ print $3"\t"$1 }' | sort -n | cut -f 2 | tail -1; 289 | done 290 | ``` 291 | 292 | Note that the URLs require us to specify male or female. The above command line only gets male results. Let's add another ```for``` loop to iterate over genders. (@WladimirSidorenko notes that the iteration style ```for gender in {'M','F'}``` was only supported since Bash 4.0, so you may prefer the "old school" alternative ```for gender in M F```.) 293 | 294 | ```shell 295 | for name in $(sort -nr ~/will-be-name-counts.txt | head -5 | sed 's/\([^[:alpha:]]*\)//g'); do 296 | echo $name; 297 | for gender in {'M','F'}; do 298 | curl -s http://rhiever.github.io/name-age-calculator/names/$gender/${name:0:1}/$name.txt | awk -F, '{ print $3"\t"$1 }' | sort -n | cut -f 2 | tail -1; 299 | done; 300 | done 301 | ``` 302 | 303 | Surprisingly enough, most of these names have significant counts for both girls and boys. But here are the final results: 304 | 305 | ``` 306 | The 307 | ul { list-style: none; margin: 25px 0; padding: 0; } 308 | ul { list-style: none; margin: 25px 0; padding: 0; } 309 | David 310 | 1960 311 | 1983 312 | Michael 313 | 1970 314 | 1986 315 | Chris 316 | 1961 317 | 1961 318 | Mark 319 | 1960 320 | 1968 321 | ``` 322 | 323 | We get an error for the non-name "The", which returns an HTML string. For the names that work, most of them are for people born in the 1960s. Now for "going to be": 324 | 325 | ```shell 326 | for name in $(sort -nr ~/going-to-be-name-counts.txt | head -5 | sed 's/\([^[:alpha:]]*\)//g'); do 327 | echo $name; 328 | for gender in {'M','F'}; do 329 | curl -s http://rhiever.github.io/name-age-calculator/names/$gender/${name:0:1}/$name.txt | awk -F, '{ print $3"\t"$1 }' | sort -n | cut -f 2 | tail -1; 330 | done; 331 | done 332 | ``` 333 | 334 | ``` 335 | Taylor 336 | 1992 337 | 1993 338 | Nathan 339 | 2004 340 | 1985 341 | Josh 342 | 1979 343 | ul { list-style: none; margin: 25px 0; padding: 0; } 344 | Jon 345 | 1964 346 | 1958 347 | Hannah 348 | 2004 349 | 2000 350 | ``` 351 | 352 | Taylor, Nathan, and Hannah all peaked after 1990; Josh peaked in the late 1970s, and only Jon dates back to the 1960s. 353 | 354 | This data, limited though it is, supports the hypothesis: "going to be" is favored by younger writers on Twitter. 355 | 356 | # Next steps 357 | 358 | Now that you know *how* to do variationist sociolinguistic analysis in the Unix shell, you may wonder whether one *should* do this. 359 | 360 | It will hopefully be obvious that the analysis presented here is not robust enough to include a research paper. A classical variationist sociolinguistic approach would be to treat the future form as a linguistic variable, and run a logistic regression to estimate the impact of age on the form of this variable. Even better would be a mixed effects model, to control for author-level idiosyncrasies. I won't say this is impossible in the Unix shell, but at the very least, it is the type of thing that one only does when one is trying hard to make some kind of point. A more realistic path forward would be to use the Unix shell to select a relevant set of data, and then write the output to CSV files that could easily be opened as data frames in R or Python. 361 | --------------------------------------------------------------------------------