└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | Twitter Sociolinguistics in the Unix Shell
  2 | ================
  3 | 
  4 | In this note, I'll show how to do a very rough sociolinguistic study of data acquired from the Twitter API, using shell scripting tools only. 
  5 | The data comes in JSON format, so I'll focus on parsing that.
  6 | If you want to learn more about Unix shell, Ken Church's classic [Unix for Poets](http://www.cs.upc.edu/~padro/Unixforpoets.pdf) is a great way to start.
  7 | 
  8 | The sociolinguistic hypothesis that we'll test is that the English future tense system is shifting towards increasing 
  9 | use of the less formal "going to" in place of the more formal "will" (see, e.g., [Tagliamonte and D'Arcy 2009](http://www.ling.upenn.edu/~wlabov/L660/Peaks.pdf)). 
 10 | According to the sociolinguistic method of *apparent time*, we expect to find that older people remain more likely to use the outgoing form "will", and younger people are more likely to use the incoming form "going to". Let's see whether this holds up in Twitter.
 11 | 
 12 | We will use classic shell tools ```grep, sed, awk, cut, sort```, etc. 
 13 | I am not a shell expert, and there are likely better ways to do many of these things. 
 14 | You can get lots of opinions about these topics on stackexchange and stackoverflow, among other places.
 15 | (Thanks to @WladimirSidorenko for several helpful comments on improving the examples and presentation!)
 16 | 
 17 | # Looking at the data
 18 | 
 19 | Twitter data is stored on our group's server in gzipped text files that contain one [JSON](http://www.json.org/) 
 20 | object per line.
 21 | If we combine ```zcat``` (```cat``` for zipfiles) with ```head```, 
 22 | we can examine the first line of one such file:
 23 | 
 24 | ```shell
 25 | [jeisenstein3@conair new-archive]$ zcat tweets-Feb-14-16-03-35.gz | head -1
 26 | ","id":58371299,"id_str":"58371299","indices":[3,17]}],"symbols":[],"media":[{"id":696796012514406400,"id_str":"696796012514406400","indices":[36,59],"media_url":"http:\/\/pbs.twimg.com\/media\/CauESBbUcAAyeu-.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/CauESBbUcAAyeu-.jpg","url":"https:\/\/t.co\/GhVa5HQ5ok","display_url":"pic.twitter.com\/GhVa5HQ5ok","expanded_url":"http:\/\/twitter.com\/TheybelikeDar\/status\/696796019800014848\/photo\/1","type":"photo","sizes":{"small":{"w":340,"h":628,"resize":"fit"},"large":{"w":554,"h":1024,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":554,"h":1024,"resize":"fit"}},"source_status_id":696796019800014848,"source_status_id_str":"696796019800014848","source_user_id":58371299,"source_user_id_str":"58371299"}]},"extended_entities":{"media":[{"id":696796012514406400,"id_str":"696796012514406400","indices":[36,59],"media_url":"http:\/\/pbs.twimg.com\/media\/CauESBbUcAAyeu-.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/CauESBbUcAAyeu-.jpg","url":"https:\/\/t.co\/GhVa5HQ5ok","display_url":"pic.twitter.com\/GhVa5HQ5ok","expanded_url":"http:\/\/twitter.com\/TheybelikeDar\/status\/696796019800014848\/photo\/1","type":"photo","sizes":{"small":{"w":340,"h":628,"resize":"fit"},"large":{"w":554,"h":1024,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":554,"h":1024,"resize":"fit"}},"source_status_id":696796019800014848,"source_status_id_str":"696796019800014848","source_user_id":58371299,"source_user_id_str":"58371299"}]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1455353581659"}
 27 | ```
 28 | 
 29 | This is a simple pipeline, which pipes the output of ```zcat``` through the command ```head -1```. (Check out the comment from @WladimirSidorenko on the dangers of using pipes.) Anyway, the structure of the JSON file is:
 30 | 
 31 | ```{"key1":"value1","key2":"value2",etc}```
 32 | 
 33 | However, the values can themselves contain nested JSON objects. Anyway, we want to pull out two fields: "text" and "user.name".
 34 | 
 35 | ## Using a JSON parser
 36 | 
 37 | The easy way to do this is using a json parser, like [jq](https://stedolan.github.io/jq). The following command will pull out the text field from the first three json lines that contain the substring "going to be".
 38 | 
 39 | ```shell
 40 | zgrep 'going to be' tweets-Feb-14-16-03-35.gz  | head -3 | jq .text
 41 | ```
 42 | 
 43 | The result is:
 44 | 
 45 | ```
 46 | "@StJuliansFC @CwmbranTown hope our games on going to be a nightmare catching up on all these games!"
 47 | "im going to be by myself though im scared"
 48 | "RT @carterreynolds: Magcon may never be the same but we're ALWAYS going to be one big happy family 😊"
 49 | ```
 50 | 
 51 | To get the text and the username, you can do this:
 52 | 
 53 | ```shell
 54 | zgrep 'going to be' tweets-Feb-14-16-03-35.gz | head -3 | jq --raw-output '"\(.user.name)\t\(.text)"'
 55 | ```
 56 | Results:
 57 | 
 58 | ```
 59 | machen afc	@StJuliansFC @CwmbranTown hope our games on going to be a nightmare catching up on all these games!nicola	im going to be by myself though im scared
 60 | Love yoυ Cαм	RT @carterreynolds: Magcon may never be the same but we're ALWAYS going to be one big happy family 😊
 61 | ```
 62 | 
 63 | The format of the ```jq``` command is a little funny: I don't know why you have to escape the opening parentheses, but not the closing ones.
 64 | 
 65 | # Getting the text with Sed and Grep
 66 | 
 67 | If you don't have ```jq```, or you want to learn to use more general tools, you can use Sed (**s**treaming **ed**itor) to capture each of these fields. We're going to use Sed's "s" command. The syntax is this command is essentially:
 68 | 
 69 | ```shell
 70 | sed s/find/replace/options
 71 | ```
 72 | 
 73 | where ```find``` specifies a pattern to match and ```replace``` specifies what to replace that pattern with. We'll always use a single option, ```/g``` which says to execute this replacement "globally" -- every time possible in each line.
 74 | 
 75 | ## One key-value pair per line
 76 | 
 77 | As a first step, let's try to break up these big JSON blocks into one key-value pair per line.
 78 | 
 79 | ```shell
 80 | zcat tweets-Feb-14-16-03-35.gz | sed 's/,\"/\n/g' | head -n 5
 81 | ```
 82 | 
 83 | Results:
 84 | 
 85 | ```
 86 | "
 87 | id":58371299
 88 | id_str":"58371299"
 89 | indices":[3,17]}]
 90 | symbols":[]
 91 | ```
 92 | 
 93 | Note that this command does not respect JSON's nested structure. For our purposes, that won't matter, but if there were nested "text" fields, we might be in trouble.
 94 | 
 95 | ## Capture groups
 96 | 
 97 | Next, to get the relevant text, we can use a *capture group*. We want to capture the field after the "text" key. Here's how we'll do it:
 98 | 
 99 | ```shell
100 | sed s/.*\"text\":\"\([^\"]*\)\".*/\1/g
101 | ```
102 | 
103 | This says:
104 | 
105 | - match all characters before observing the string ```"text":"``` (note the escaped quotation marks).
106 | - then match a sequence non-quotation characters, ```[^\"]```. The brackets indicate a group (this is a regex), the carrot indicates negation, and then we have the escaped quotation mark.
107 | - by putting (escaped) parens around this pattern, we indicate we want to capture it, ```\([^\"]*\)```
108 | - then match the closing quote, and all other characters in the line
109 | - in the replace string, ```\1``` means print the capture group
110 | 
111 | We call:
112 | 
113 | ```shell
114 | zcat tweets-Feb-14-16-03-35.gz | sed 's/.*\"text\":\"\([^\"]*\)\".*/\1/g' | grep -E '^[A-Za-z]' | head -3
115 | ```
116 | 
117 | which uses ```grep``` to filter the output to make sure it starts with an alphabetic character. Here are the first three results:
118 | 
119 | ```
120 | https:\/\/t.co\/QuRvL8exJZ
121 | Que alguien me duerma de una \ud83d\udc4a, \ud83d\ude4f.
122 | relationship status: https:\/\/t.co\/eON4iSSjvz
123 | ```
124 | 
125 | Now let's use a more complicated capture pattern to get the name too. For the name, we will require that it be two, capitalized alphabetic strings, ```[A-Z][a-z]* [A-Z][a-z]*```. This is a trick to trade recall for precision, since there are a lot of garbage names in Twitter. Here's what we run:
126 | 
127 | ```shell
128 | zcat tweets-Feb-14-16-03-35.gz  | sed 's/.*\"text\":\"\([^\"]*\)\",.*name\":\"\([A-Z][a-z]* [A-Z][a-z]*\)\".*/\1\t\2/g' | grep -E '^[A-Za-z].*' | head -3
129 | ```
130 | 
131 | Notice that the replace string now is ```\1\t\2```: print the two capture groups, with a tab between them. Here's the output:
132 | 
133 | ```
134 | https:\/\/t.co\/QuRvL8exJZ	Selena Gomez
135 | Gusto ko ng katext	Roymar Buenvenida
136 | RT @NikitaLovebird: \u0915\u0947\u091c\u0930\u0940\u0935\u093e\u0932~ \u0926\u093e\u0926\u0930\u0940 \u091a\u0932\u094b\u0917\u0947 \n\u0911\u091f\u094b\u0935\u093e\u0932\u093e~ \u0939\u093e \u0939\u093e \u0939\u093e \u0939\u093e \n.\n.\n\u0915\u0947\u091c\u0930\u0940\u0935\u093e\u0932~ JNU \u091a\u0932\u094b\u0917\u0947 \n\u0911\u091f\u094b\u0935\u093e\u0932\u093e~ \u0928\u0939\u0940\u0902\n\u0907\u0938\u092e\u0947 \u0915\u0947\u091c\u0930\u0940 \u0905\u0902\u0915\u0932 \u0915\u094d\u092f\u093e \u0915\u0930 \u0938\u0915\u0924\u0947 \u0939\u0948???\n\u2026	Sahil Kataria
137 | ```
138 | 
139 | ## Putting zgrep in front
140 | 
141 | We're wasting time processing a lot of text strings that are not of interest. So let's put a grep at the front of the pipeline, so that we start with that:
142 | 
143 | ```shell
144 | zgrep 'going to be ' tweets-Feb-14-16-03-35.gz  | sed 's/.*\"text\":\"\([^\"]*\)\",.*name\":\"\([A-Z][a-z]* [A-Z][a-z]*\)\".*/\1\t\2/g' | grep -E '^[A-Za-z].*' | head -3
145 | ```
146 | 
147 | Here are the results:
148 | 
149 | ```
150 | MGBACKTOWEMBLEY	Matt Goss
151 | God Brat 1, can spew contempt. World champion at ten. He's going to be a fun teenager.	Craig Short
152 | I've got a new toaster. I've a feeling finding an optimum setting is going to be quite the journey. What a time to be alive.	Boring Tweeter
153 | ```
154 | 
155 | Notice that the first example doesn't include "going to be" in the text! The string must appear somewhere else in the JSON, maybe in the profile.
156 | 
157 | ## Postfiltering with grep
158 | 
159 | The solution is to use grep twice: once as a prefiltering step, and once as a postfiltering step. I find that this is a typical design pattern in streaming from big data: use a fast low-precision filter first, then a slower high-precision filter at the end.
160 | 
161 | ```shell
162 | zgrep 'going to be ' tweets-Feb-14-16-03-35.gz  | sed 's/.*\"text\":\"\([^\"]*\)\",.*name\":\"\([A-Z][a-z]* [A-Z][a-z]*\)\".*/\1\t\2/g' | grep -E '^[A-Za-z].*' | grep 'going to be ' | head -5
163 | ```
164 | 
165 | Here are the results:
166 | 
167 | ```
168 | God Brat 1, can spew contempt. World champion at ten. He's going to be a fun teenager.	Craig Short
169 | I've got a new toaster. I've a feeling finding an optimum setting is going to be quite the journey. What a time to be alive.	Boring Tweeter
170 | I've got a new toaster. I've a feeling finding an optimum setting is going to be quite the journey. What a time to be alive.	Boring Tweeter
171 | I've got a new toaster. I've a feeling finding an optimum setting is going to be quite the journey. What a time to be alive.	Boring Tweeter
172 | I'm going to be on Broadcasting House on Radio 4 tomorrow explaining why I actually quite enjoy Valentine's Day *ducks*	Henry Jeffreys
173 | ```
174 | 
175 | This time I took the first five hits, because the toaster tweet got repeated three times for some reason. We'll deal with that later.
176 | 
177 | # Collecting names
178 | 
179 | We only want the names of the individuals using these words. We can collect these using ```cut.```
180 | 
181 | ```shell
182 | zgrep 'going to be ' tweets-Feb-14-16-03-35.gz  | sed 's/.*\"text\":\"\([^\"]*\)\",.*name\":\"\([A-Z][a-z]* [A-Z][a-z]*\)\".*/\1\t\2/g' | grep -E '^[A-Za-z].*' | grep 'going to be ' | head -5 | cut -f 2
183 | ```
184 | 
185 | Results:
186 | 
187 | ```
188 | Craig Short
189 | Boring Tweeter
190 | Boring Tweeter
191 | Boring Tweeter
192 | Henry Jeffreys
193 | ```
194 | 
195 | Let's write these to a file, one for a each pattern:
196 | 
197 | ```shell
198 | zgrep 'going to be ' tweets-Feb-14-16-03-35.gz  | sed 's/.*\"text\":\"\([^\"]*\)\",.*name\":\"\([A-Z][a-z]* [A-Z][a-z]*\)\".*/\1\t\2/g' | grep -E '^[A-Za-z].*' | grep 'going to be ' | cut -f 2 | tee ~/going-to-be-all-names.txt
199 | ```
200 | ```shell
201 | zgrep 'will be ' tweets-Feb-14-16-03-35.gz  | sed 's/.*\"text\":\"\([^\"]*\)\",.*name\":\"\([A-Z][a-z]* [A-Z][a-z]*\)\".*/\1\t\2/g' | grep -E '^[A-Za-z].*' | grep 'will be ' | cut -f 2 | tee ~/will-be-all-names.txt
202 | ```
203 | 
204 | Instead of the usual ```>``` redirect, I've used ```tee```, which will print to standard out, as well as write to the file.
205 | 
206 | ## Filtering names
207 | 
208 | One thing we notice is that one guy, "Cameron Dallas", seems to account for a lot of these messages. Let's just count each name once.
209 | 
210 | ```shell
211 | sort -u ~/will-be-all-names.txt | cut -f 1 -d\  | uniq -c | tee will-be-name-counts.txt
212 | ```
213 | 
214 | Here's what's happening in this pipeline:
215 | 
216 | - ```sort -u``` alphabetically sorts all names, and returns the unique entries
217 | - ```cut -f 1 -d\  ``` selects only the first name, by cutting on a single whitespace delimiter. This is valid because we are only admitting names that contains exactly two whitespace-delimited tokens.
218 | - ```uniq -c ``` computes the *count* of all unique entries. ```uniq``` requires that its input is already sorted, but we have this from the ```sort``` call at the beginning of the pipeline.
219 | 
220 | # Looking up ages per name
221 | 
222 | Let's get the top names for each set:
223 | 
224 | ```shell
225 | sort -nr ~/will-be-name-counts.txt  | head -5
226 | ```
227 | 
228 | Results:
229 | 
230 | ```
231 | 10 The
232 | 10 David
233 | 8 Michael
234 | 8 Chris
235 | 7 Mark
236 | ```
237 | 
238 | ```shell 
239 | sort -nr ~/going-to-be-name-counts.txt | head -5
240 | ```
241 | 
242 | Results:
243 | 
244 | ```      
245 | 3 Taylor
246 | 3 Nathan
247 | 3 Josh
248 | 3 Jon
249 | 3 Hannah
250 | ```
251 | 
252 | Whose are older? Some answers are online: [http://rhiever.github.io/name-age-calculator/index.html?Gender=M&Name=Nathan](http://rhiever.github.io/name-age-calculator/index.html?Gender=M&Name=Nathan)
253 | 
254 | ## Pulling web data
255 | 
256 | We can also pull data directly from this site, using the helpful ```curl``` command:
257 | 
258 | ```shell
259 | curl http://rhiever.github.io/name-age-calculator/names/M/N/Nathan.txt
260 | ```
261 | 
262 | This data is field-delimited by a comma. We will use ```awk``` to reorganize it so that the column containing the counts of living people is first. Then we will sort on this column, to find the year in which the greatest number of living people was born with name:
263 | 
264 | ```shell
265 | curl http://rhiever.github.io/name-age-calculator/names/M/N/Nathan.txt | awk -F, '{ print $3"\t" $1 }' | sort -n | cut -f 2 | tail -n 1
266 | ```
267 | 
268 | Results: ```2004```
269 | 
270 | The ```awk``` command ```awk -F, '{ print $3"\t"$1 }'``` says:
271 | 
272 | - Use the comma as a field delimiter
273 | - Print the third field, then a tab, then the first field
274 | 
275 | ## Pulling data for each name
276 | 
277 | We want to get this data for all names. Let's see how we can iterate over the lines in a file. We'll use the ```for``` command, and then use $(command) to indicate a variable whose values are defined by the output of a sequence of other unix commands:
278 | 
279 | ```shell
280 | for name in $(sort -nr ~/will-be-name-counts.txt | head -5 | sed 's/\([^[:alpha:]]*\)//g');  do echo $name; done
281 | ```
282 | 
283 | Next, we'll take these names, and dynamically construct appropriate URLs to pull their age statistics from the from the website. Then we'll run our awk postprocessing script to get the ages.
284 | 
285 | ```shell
286 | for name in $(sort -nr ~/will-be-name-counts.txt | head -5 | sed 's/\([^[:alpha:]]*\)//g'); do 
287 |  echo $name; 
288 |  curl -s http://rhiever.github.io/name-age-calculator/names/M/${name:0:1}/$name.txt | awk -F, '{ print $3"\t"$1 }' | sort -n | cut -f 2 | tail -1; 
289 | done
290 | ```
291 | 
292 | Note that the URLs require us to specify male or female. The above command line only gets male results. Let's add another ```for``` loop to iterate over genders. (@WladimirSidorenko notes that the iteration style ```for gender in {'M','F'}``` was only supported since Bash 4.0, so you may prefer the "old school" alternative ```for gender in M F```.)
293 | 
294 | ```shell
295 | for name in $(sort -nr ~/will-be-name-counts.txt | head -5 | sed 's/\([^[:alpha:]]*\)//g'); do 
296 |  echo $name; 
297 |  for gender in {'M','F'}; do 
298 |   curl -s http://rhiever.github.io/name-age-calculator/names/$gender/${name:0:1}/$name.txt | awk -F, '{ print $3"\t"$1 }' | sort -n | cut -f 2 | tail -1; 
299 |  done; 
300 | done
301 | ```
302 | 
303 | Surprisingly enough, most of these names have significant counts for both girls and boys. But here are the final results:
304 | 
305 | ```
306 | The
307 |      ul { list-style: none; margin: 25px 0; padding: 0; }
308 |      ul { list-style: none; margin: 25px 0; padding: 0; }
309 | David
310 | 1960
311 | 1983
312 | Michael
313 | 1970
314 | 1986
315 | Chris
316 | 1961
317 | 1961
318 | Mark
319 | 1960
320 | 1968
321 | ```
322 | 
323 | We get an error for the non-name "The", which returns an HTML string. For the names that work, most of them are for people born in the 1960s. Now for "going to be":
324 | 
325 | ```shell
326 | for name in $(sort -nr ~/going-to-be-name-counts.txt | head -5 | sed 's/\([^[:alpha:]]*\)//g'); do 
327 |  echo $name; 
328 |  for gender in {'M','F'}; do 
329 |   curl -s http://rhiever.github.io/name-age-calculator/names/$gender/${name:0:1}/$name.txt | awk -F, '{ print $3"\t"$1 }' | sort -n | cut -f 2 | tail -1;
330 |  done;
331 | done
332 | ```
333 | 
334 | ```
335 | Taylor
336 | 1992
337 | 1993
338 | Nathan
339 | 2004
340 | 1985
341 | Josh
342 | 1979
343 | ul { list-style: none; margin: 25px 0; padding: 0; }
344 | Jon
345 | 1964
346 | 1958
347 | Hannah
348 | 2004
349 | 2000
350 | ```
351 | 
352 | Taylor, Nathan, and Hannah all peaked after 1990; Josh peaked in the late 1970s, and only Jon dates back to the 1960s.
353 | 
354 | This data, limited though it is, supports the hypothesis: "going to be" is favored by younger writers on Twitter.
355 | 
356 | # Next steps
357 | 
358 | Now that you know *how* to do variationist sociolinguistic analysis in the Unix shell, you may wonder whether one *should* do this.
359 | 
360 | It will hopefully be obvious that the analysis presented here is not robust enough to include a research paper. A classical variationist sociolinguistic approach would be to treat the future form as a linguistic variable, and run a logistic regression to estimate the impact of age on the form of this variable. Even better would be a mixed effects model, to control for author-level idiosyncrasies. I won't say this is impossible in the Unix shell, but at the very least, it is the type of thing that one only does when one is trying hard to make some kind of point. A more realistic path forward would be to use the Unix shell to select a relevant set of data, and then write the output to CSV files that could easily be opened as data frames in R or Python.
361 | 


--------------------------------------------------------------------------------