├── .gitignore
├── LICENSE
├── README.md
├── notebooks
├── T02 - DataSources.ipynb
├── T03 - Parsing Twitter Data.ipynb
├── T04-08 - Twitter Analytics.ipynb
└── files
│ ├── FacebookInstructions_f1.png
│ ├── FacebookLogo.jpg
│ ├── Me2.jpg
│ ├── RedditLogo.jpg
│ ├── TwitterInstructions_f1.png
│ ├── TwitterInstructions_f2.png
│ ├── TwitterInstructions_f3.png
│ ├── TwitterInstructions_f4.png
│ ├── TwitterInstructions_f5.png
│ ├── TwitterInstructions_f6.png
│ ├── TwitterLogo.png
│ ├── fb_screens
│ └── Screen Shot 2016-05-25 at 3.17.49 AM.png
│ ├── intermission.jpg
│ ├── reddit_screens
│ ├── 0-001.png
│ ├── 1-002.png
│ └── 1-003.png
│ └── twitter_screens
│ ├── Screen Shot 2016-05-25 at 8.56.39 AM.png
│ ├── Screen Shot 2016-05-25 at 8.56.41 AM.png
│ └── Screen Shot 2016-05-25 at 8.56.57 AM.png
└── slides
├── 00 - Introduction.key
├── 00 - Introduction.pdf
├── 01 - Data Acquisition.key
├── 01 - Data Acquisition.pdf
├── 02 - Advanced Analysis.key
└── 02 - Advanced Analysis.pdf
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "{}"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright {yyyy} {name of copyright owner}
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Tutorial on Social Media Analytics During Crisis
2 |
3 | This material is from a tutorial I taught at the 33rd HCIL Symposium at the University of Maryland in May 2016.
4 | I've included both the tutorial's slides and Jupyter notebooks used to demonstrate methods for pulling your own data from Reddit, Facebook, and Twitter.
5 | I didn't see much in the way of other material (beyond what I've posted before about Twitter analysis during Ferguson), so feel free to re-use and redistribute at your leisure!
6 |
7 | ## Tutorial Overview
8 |
9 | This tutorial will build practical experience in using Python + Jupyter Notebooks to analyze and discover insights from social media during times of crisis and social unrest.
10 | We demonstrate how temporal, network, sentiment, and geographic analyses on Twitter can aid in understanding and enhance storytelling of contentious events.
11 | Examples of events we might cover include the Boston Marathon Bombing, and the Charlie Hebdo Attack.
12 | Demonstrations will include hands-on exercises in extracting tweets by location, sentiment analysis, network analysis to visualize groups taking part in the discussion, and detecting high-impact moments in the data.
13 | Most of the work will be performed in the Jupyter notebook framework to aid in repeatable research and support dissemination of results to others.
14 |
15 | ## Material Overview
16 |
17 | ### Tutorial Introduction
18 | - Slides/00 - Introduction.key
19 | - Terror Data sets
20 | - Boston Marathon
21 | - 15 April 2013, 14:49 EDT -> 18:49 UTC
22 | - Charlie Hebdo
23 | - 7 January 2015, 11:30 CET -> 10:30 UTC
24 | - Paris Nov. attacks
25 | - 13 November 2015, 21:20 CET -> 20:20 UTC (until 23:58 UTC)
26 | - Brussels
27 | - 22 March 2016, 7:58 CET -> 6:58 UTC (and 08:11 UTC)
28 |
29 | ### Data Acquisition
30 | - Covered under Slides/01 - Data Acquisition.key
31 | - Topic 1: Introducing the Jupyter Notebook
32 | - Jupyter notebook gallery
33 | - Topic 2: Data sources and collection
34 | - Notebook: __T02 - DataSources.ipynb__
35 | - Data sources:
36 | - Twitter
37 | - Reddit
38 | - Facebook
39 | - Topic 3: Parsing Twitter data
40 | - Notebook: __T03 - Parsing Twitter Data.ipynb__
41 | - JSON format
42 | - Python json.load
43 |
44 | ### Data Analytics
45 | - Notebook: __T04-08 - Twitter Analytics.ipynb__
46 | - Topic 4: Simple frequency analysis
47 | - Top hash tags
48 | - Most common keywords
49 | - Top URLs
50 | - Top images
51 | - Top users
52 | - Top languages
53 | - Most retweeted tweet
54 | - Topic 5: Geographic information systems
55 | - General plotting
56 | - Country plotting
57 | - Images from target location
58 | - Topic 6: Sentiment analysis
59 | - Subjectivity/Objectivity w/ TextBlob
60 | - Topic 7: Other content analysis
61 | - Topics in relevant data
62 | - Topic 8: Network analysis
63 | - Building interaction networks
64 | - Central accounts
65 | - Visualization
66 |
--------------------------------------------------------------------------------
/notebooks/T02 - DataSources.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {
7 | "collapsed": false
8 | },
9 | "outputs": [],
10 | "source": [
11 | "%matplotlib inline\n",
12 | "\n",
13 | "import json\n",
14 | "import codecs"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {
20 | "collapsed": true
21 | },
22 | "source": [
23 | "# Topic 2: Collecting Social Media Data\n",
24 | "\n",
25 | "This notebook contains examples for using web-based APIs (Application Programmer Interfaces) to download data from social media platforms.\n",
26 | "Our examples will include:\n",
27 | "\n",
28 | "- Reddit\n",
29 | "- Facebook\n",
30 | "- Twitter\n",
31 | "\n",
32 | "For most services, we need to register with the platform in order to use their API.\n",
33 | "Instructions for the registration processes are outlined in each specific section below.\n",
34 | "\n",
35 | "We will use APIs because they *can* be much faster than manually copying and pasting data from the web site, APIs provide uniform methods for accessing resources (searching for keywords, places, or dates), and it should conform to the platform's terms of service (important for partnering and publications).\n",
36 | "Note however that each of these platforms has strict limits on access times: e.g., requests per hour, search history depth, maximum number of items returned per request, and similar."
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "
\n",
44 | "
\n",
45 | "\n",
46 | "## Topic 2.1: Reddit API\n",
47 | "\n",
48 | "Reddit's API used to be the easiest to use since it did not require credentials to access data on its subreddit pages.\n",
49 | "Unfortunately, this process has been changed, and developers now need to create a Reddit application on Reddit's app page located here: (https://www.reddit.com/prefs/apps/)."
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": null,
55 | "metadata": {
56 | "collapsed": false
57 | },
58 | "outputs": [],
59 | "source": [
60 | "# For our first piece of code, we need to import the package \n",
61 | "# that connects to Reddit. Praw is a thin wrapper around reddit's \n",
62 | "# web APIs and works well\n",
63 | "\n",
64 | "import praw"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "### Creating a Reddit Application\n",
72 | "Go to https://www.reddit.com/prefs/apps/.\n",
73 | "Scroll down to \"create application\", select \"web app\", and provide a name, description, and URL (which can be anything).\n",
74 | "\n",
75 | "After you press \"create app\", you will be redirected to a new page with information about your application. Copy the unique identifiers below \"web app\" and beside \"secret\". These are your client_id and client_secret values, which you need below."
76 | ]
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "
\n",
83 | "
\n",
84 | "
"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": null,
90 | "metadata": {
91 | "collapsed": false
92 | },
93 | "outputs": [],
94 | "source": [
95 | "# Now we specify a \"unique\" user agent for our code\n",
96 | "# This is primarily for identification, I think, and some\n",
97 | "# user-agents of bad actors might be blocked\n",
98 | "redditApi = praw.Reddit(client_id='OdpBKZ1utVJw8Q',\n",
99 | " client_secret='KH5zzauulUBG45W-XYeAS5a2EdA',\n",
100 | " user_agent='crisis_informatics_v01')"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "### Capturing Reddit Posts\n",
108 | "\n",
109 | "Now for a given subreddit, we can get the newest posts to that sub. \n",
110 | "Post titles are generally short, so you could treat them as something similar to a tweet."
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "metadata": {
117 | "collapsed": false
118 | },
119 | "outputs": [],
120 | "source": [
121 | "subreddit = \"worldnews\"\n",
122 | "\n",
123 | "targetSub = redditApi.subreddit(subreddit)\n",
124 | "\n",
125 | "submissions = targetSub.new(limit=10)\n",
126 | "for post in submissions:\n",
127 | " print(post.title)"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {},
133 | "source": [
134 | "### Leveraging Reddit's Voting\n",
135 | "\n",
136 | "Getting the new posts gives us the most up-to-date information. \n",
137 | "You can also get the \"hot\" posts, \"top\" posts, etc. that should be of higher quality. \n",
138 | "In theory.\n",
139 | "__Caveat emptor__"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": null,
145 | "metadata": {
146 | "collapsed": false
147 | },
148 | "outputs": [],
149 | "source": [
150 | "subreddit = \"worldnews\"\n",
151 | "\n",
152 | "targetSub = redditApi.subreddit(subreddit)\n",
153 | "\n",
154 | "submissions = targetSub.hot(limit=5)\n",
155 | "for post in submissions:\n",
156 | " print(post.title)"
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {},
162 | "source": [
163 | "### Following Multiple Subreddits\n",
164 | "\n",
165 | "Reddit has a mechanism called \"multireddits\" that essentially allow you to view multiple reddits together as though they were one.\n",
166 | "To do this, you need to concatenate your subreddits of interesting using the \"+\" sign."
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": null,
172 | "metadata": {
173 | "collapsed": false
174 | },
175 | "outputs": [],
176 | "source": [
177 | "subreddit = \"worldnews+aww\"\n",
178 | "\n",
179 | "targetSub = redditApi.subreddit(subreddit)\n",
180 | "submissions = targetSub.new(limit=10)\n",
181 | "for post in submissions:\n",
182 | " print(post.title)"
183 | ]
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {},
188 | "source": [
189 | "### Accessing Reddit Comments\n",
190 | "\n",
191 | "While you're never supposed to read the comments, for certain live streams or new and rising posts, the comments may provide useful insight into events on the ground or people's sentiment.\n",
192 | "New posts may not have comments yet though.\n",
193 | "\n",
194 | "Comments are attached to the post title, so for a given submission, you can pull its comments directly.\n",
195 | "\n",
196 | "Note Reddit returns pages of comments to prevent server overload, so you will not get all comments at once and will have to write code for getting more comments than the top ones returned at first.\n",
197 | "This pagination is performed using the MoreXYZ objects (e.g., MoreComments or MorePosts)."
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": null,
203 | "metadata": {
204 | "collapsed": false,
205 | "scrolled": false
206 | },
207 | "outputs": [],
208 | "source": [
209 | "subreddit = \"worldnews\"\n",
210 | "\n",
211 | "breadthCommentCount = 5\n",
212 | "\n",
213 | "targetSub = redditApi.subreddit(subreddit)\n",
214 | "submissions = targetSub.hot(limit=1)\n",
215 | "for post in submissions:\n",
216 | " print (post.title)\n",
217 | " \n",
218 | " post.comment_limit = breadthCommentCount\n",
219 | " \n",
220 | " # Get the top few comments\n",
221 | " for comment in post.comments.list():\n",
222 | " if isinstance(comment, praw.models.MoreComments):\n",
223 | " continue\n",
224 | " \n",
225 | " print (\"---\", comment.name, \"---\")\n",
226 | " print (\"\\t\", comment.body)\n",
227 | " \n",
228 | " for reply in comment.replies.list():\n",
229 | " if isinstance(reply, praw.models.MoreComments):\n",
230 | " continue\n",
231 | " \n",
232 | " print (\"\\t\", \"---\", reply.name, \"---\")\n",
233 | " print (\"\\t\\t\", reply.body)"
234 | ]
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "metadata": {},
239 | "source": [
240 | "### Other Functionality\n",
241 | "\n",
242 | "Reddit has a deep comment structure, and the code above only goes two levels down (top comment and top comment reply). \n",
243 | "You can view Praw's additional functionality, replete with examples on its website here: http://praw.readthedocs.io/"
244 | ]
245 | },
246 | {
247 | "cell_type": "markdown",
248 | "metadata": {},
249 | "source": [
250 | "
\n",
251 | "
\n",
252 | "\n",
253 | "## Topic 2.2: Facebook API\n",
254 | "\n",
255 | "Getting access to Facebook's API is slightly easier than Twitter's in that you can go to the Graph API explorer, grab an access token, and immediately start playing around with the API.\n",
256 | "The access token isn't good forever though, so if you plan on doing long-term analysis or data capture, you'll need to go the full OAuth route and generate tokens using the approved paths."
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": null,
262 | "metadata": {
263 | "collapsed": false
264 | },
265 | "outputs": [],
266 | "source": [
267 | "# As before, the first thing we do is import the Facebook\n",
268 | "# wrapper\n",
269 | "\n",
270 | "import facebook"
271 | ]
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "metadata": {},
276 | "source": [
277 | "### Connecting to the Facebook Graph\n",
278 | "\n",
279 | "Facebook has a \"Graph API\" that lets you explore its social graph. \n",
280 | "For privacy concerns, however, Facebook's Graph API is extremely limited in the kinds of data it can view.\n",
281 | "For instance, Graph API applications can now only view profiles of people who already have installed that particular application.\n",
282 | "These restrictions make it quite difficult to see a lot of Facebook's data.\n",
283 | "\n",
284 | "That being said, Facebook does have many popular public pages (e.g., BBC World News), and articles or messages posted by these public pages are accessible.\n",
285 | "In addition, many posts and comments made in reply to these public posts are also publically available for us to explore.\n",
286 | "\n",
287 | "To connect to Facebook's API though, we need an access token (unlike Reddit's API).\n",
288 | "Fortunately, for research and testing purposes, getting an access token is very easy.\n",
289 | "\n",
290 | "#### Acquiring a Facebook Access Token\n",
291 | "\n",
292 | "1. Log in to your Facebook account\n",
293 | "1. Go to Facebook's Graph Explorer (https://developers.facebook.com/tools/explorer/)\n",
294 | "1. Copy the *long* string out of \"Access Token\" box and paste it in the code cell bedlow\n",
295 | "\n",
296 | "
"
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": null,
302 | "metadata": {
303 | "collapsed": true
304 | },
305 | "outputs": [],
306 | "source": [
307 | "fbAccessToken = \"EAACEdEose0cBAKZAZBoGzF6ZAJBk3uSB0gXSgxPrZBJ5nsZCXkM25xZBT0GzVABvsZBOvARxRukoLxhVEyO42QO1D1IInuE1ZBgQfffxh10BC0iHJmnKfNGHn9bY6ioZA8gHTYAXoOGL0A07hZBKXxMKO1yS3ZAPDB50MVGLBxDjJJDWAYBFhUIoeaAaMAZAzxcT4lMZD\""
308 | ]
309 | },
310 | {
311 | "cell_type": "markdown",
312 | "metadata": {},
313 | "source": [
314 | "Now we can use the Facebook Graph API with this temporary access token (it does expire after maybe 15 minutes)."
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": null,
320 | "metadata": {
321 | "collapsed": false
322 | },
323 | "outputs": [],
324 | "source": [
325 | "# Connect to the graph API, note we use version 2.5\n",
326 | "graph = facebook.GraphAPI(access_token=fbAccessToken, version='2.5')"
327 | ]
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "metadata": {},
332 | "source": [
333 | "### Parsing Posts from a Public Page\n",
334 | "\n",
335 | "To get a public page's posts, all you need is the name of the page. \n",
336 | "Then we can pull the page's feed, and for each post on the page, we can pull its comments and the name of the comment's author.\n",
337 | "While it's unlikely that we can get more user information than that, author name and sentiment or text analytics can give insight into bursting topics and demographics."
338 | ]
339 | },
340 | {
341 | "cell_type": "code",
342 | "execution_count": null,
343 | "metadata": {
344 | "collapsed": false,
345 | "scrolled": true
346 | },
347 | "outputs": [],
348 | "source": [
349 | "# What page to look at?\n",
350 | "targetPage = \"nytimes\"\n",
351 | "\n",
352 | "# Other options for pages:\n",
353 | "# nytimes, bbc, bbcamerica, bbcafrica, redcross, disaster\n",
354 | "\n",
355 | "maxPosts = 10 # How many posts should we pull?\n",
356 | "maxComments = 5 # How many comments for each post?\n",
357 | "\n",
358 | "post = graph.get_object(id=targetPage + '/feed')\n",
359 | "\n",
360 | "# For each post, print its message content and its ID\n",
361 | "for v in post[\"data\"][:maxPosts]:\n",
362 | " print (\"---\")\n",
363 | " print (v[\"message\"], v[\"id\"])\n",
364 | " \n",
365 | " # For each comment on this post, print its number, \n",
366 | " # the name of the author, and the message content\n",
367 | " print (\"Comments:\")\n",
368 | " comments = graph.get_object(id='%s/comments' % v[\"id\"])\n",
369 | " for (i, comment) in enumerate(comments[\"data\"][:maxComments]):\n",
370 | " print (\"\\t\", i, comment[\"from\"][\"name\"], comment[\"message\"])\n"
371 | ]
372 | },
373 | {
374 | "cell_type": "markdown",
375 | "metadata": {},
376 | "source": [
377 | "
\n",
378 | "
\n",
379 | "\n",
380 | "## Topic 2.1: Twitter API\n",
381 | "\n",
382 | "Twitter's API is probably the most useful and flexible but takes several steps to configure. \n",
383 | "To get access to the API, you first need to have a Twitter account and have a mobile phone number (or any number that can receive text messages) attached to that account.\n",
384 | "Then, we'll use Twitter's developer portal to create an \"app\" that will then give us the keys tokens and keys (essentially IDs and passwords) we will need to connect to the API.\n",
385 | "\n",
386 | "So, in summary, the general steps are:\n",
387 | "\n",
388 | "0. Have a Twitter account,\n",
389 | "1. Configure your Twitter account with your mobile number,\n",
390 | "2. Create an app on Twitter's developer site, and\n",
391 | "3. Generate consumer and access keys and secrets.\n",
392 | "\n",
393 | "We will then plug these four strings into the code below."
394 | ]
395 | },
396 | {
397 | "cell_type": "code",
398 | "execution_count": null,
399 | "metadata": {
400 | "collapsed": false
401 | },
402 | "outputs": [],
403 | "source": [
404 | "# For our first piece of code, we need to import the package \n",
405 | "# that connects to Twitter. Tweepy is a popular and fully featured\n",
406 | "# implementation.\n",
407 | "\n",
408 | "import tweepy"
409 | ]
410 | },
411 | {
412 | "cell_type": "markdown",
413 | "metadata": {},
414 | "source": [
415 | "### Creating Twitter Credentials\n",
416 | "\n",
417 | "For more in-depth instructions for creating a Twitter account and/or setting up a Twitter account to use the following code, I will provide a walkthrough on configuring and generating this information.\n",
418 | "\n",
419 | "First, we assume you already have a Twitter account.\n",
420 | "If this is not true, either create one real quick or follow along.\n",
421 | "See the attached figures.\n",
422 | "\n",
423 | "- __Step 1. Create a Twitter account__ If you haven't already done this, do this now at Twitter.com.\n",
424 | "\n",
425 | "- __Step 2. Setting your mobile number__ Log into Twitter and go to \"Settings.\" From there, click \"Mobile\" and fill in an SMS-enabled phone number. You will be asked to confirm this number once it's set, and you'll need to do so before you can create any apps for the next step.\n",
426 | "\n",
427 | "
\n",
428 | "
\n",
429 | "\n",
430 | "- __Step 3. Create an app in Twitter's Dev site__ Go to (apps.twitter.com), and click the \"Create New App\" button. Fill in the \"Name,\" \"Description,\" and \"Website\" fields, leaving the callback one blank (we're not going to use it). Note that the website __must__ be a fully qualified URL, so it should look like: http://test.url.com. Then scroll down and read the developer agreement, checking that agree, and finally click \"Create your Twitter application.\"\n",
431 | "\n",
432 | "
\n",
433 | "
\n",
434 | "\n",
435 | "- __Step 4. Generate keys and tokens with this app__ After your application has been created, you will see a summary page like the one below. Click \"Keys and Access Tokens\" to view and manage keys. Scroll down and click \"Create my access token.\" After a moment, your page should refresh, and it should show you four long strings of characters and numbers, a consume key, consumer secret, an access token, and an access secret (note these are __case-sensitive__!). Copy and past these four strings into the quotes in the code cell below.\n",
436 | "\n",
437 | "
\n",
438 | "
"
439 | ]
440 | },
441 | {
442 | "cell_type": "code",
443 | "execution_count": null,
444 | "metadata": {
445 | "collapsed": false
446 | },
447 | "outputs": [],
448 | "source": [
449 | "# Use the strings from your Twitter app webpage to populate these four \n",
450 | "# variables. Be sure and put the strings BETWEEN the quotation marks\n",
451 | "# to make it a valid Python string.\n",
452 | "\n",
453 | "consumer_key = \"IQ03DPOdXz95N3rTm2iMNE8va\"\n",
454 | "consumer_secret = \"0qGHOXVSX1D1ffP7BfpIxqFalLfgVIqpecXQy9SrUVCGkJ8hmo\"\n",
455 | "access_token = \"867193453159096320-6oUq9riQW8UBa6nD3davJ0SUe9MvZrZ\"\n",
456 | "access_secret = \"5zMwq2DVhxBnvjabM5SU2Imkoei3AE6UtdeOQ0tzR9eNU\""
457 | ]
458 | },
459 | {
460 | "cell_type": "markdown",
461 | "metadata": {},
462 | "source": [
463 | "### Connecting to Twitter\n",
464 | "\n",
465 | "Once we have the authentication details set, we can connect to Twitter using the Tweepy OAuth handler, as below."
466 | ]
467 | },
468 | {
469 | "cell_type": "code",
470 | "execution_count": null,
471 | "metadata": {
472 | "collapsed": false
473 | },
474 | "outputs": [],
475 | "source": [
476 | "# Now we use the configured authentication information to connect\n",
477 | "# to Twitter's API\n",
478 | "auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n",
479 | "auth.set_access_token(access_token, access_secret)\n",
480 | "\n",
481 | "api = tweepy.API(auth)\n",
482 | "\n",
483 | "print(\"Connected to Twitter!\")"
484 | ]
485 | },
486 | {
487 | "cell_type": "markdown",
488 | "metadata": {},
489 | "source": [
490 | "### Testing our Connection\n",
491 | "\n",
492 | "Now that we are connected to Twitter, let's do a brief check that we can read tweets by pulling the first few tweets from our own timeline (or the account associated with your Twitter app) and printing them."
493 | ]
494 | },
495 | {
496 | "cell_type": "code",
497 | "execution_count": null,
498 | "metadata": {
499 | "collapsed": false
500 | },
501 | "outputs": [],
502 | "source": [
503 | "# Get tweets from our timeline\n",
504 | "public_tweets = api.home_timeline()\n",
505 | "\n",
506 | "# print the first five authors and tweet texts\n",
507 | "for tweet in public_tweets[:5]:\n",
508 | " print (tweet.author.screen_name, tweet.author.name, \"said:\", tweet.text)"
509 | ]
510 | },
511 | {
512 | "cell_type": "markdown",
513 | "metadata": {},
514 | "source": [
515 | "### Searching Twitter for Keywords\n",
516 | "\n",
517 | "Now that we're connected, we can search Twitter for specific keywords with relative ease just like you were using Twitter's search box.\n",
518 | "While this search only goes back 7 days and/or 1,500 tweets (whichever is less), it can be powerful if an event you want to track just started.\n",
519 | "\n",
520 | "Note that you might have to deal with paging if you get lots of data. Twitter will only return you one page of up to 100 tweets at a time."
521 | ]
522 | },
523 | {
524 | "cell_type": "code",
525 | "execution_count": null,
526 | "metadata": {
527 | "collapsed": false
528 | },
529 | "outputs": [],
530 | "source": [
531 | "# Our search string\n",
532 | "queryString = \"earthquake\"\n",
533 | "\n",
534 | "# Perform the search\n",
535 | "matchingTweets = api.search(queryString)\n",
536 | "\n",
537 | "print (\"Searched for:\", queryString)\n",
538 | "print (\"Number found:\", len(matchingTweets))\n",
539 | "\n",
540 | "# For each tweet that matches our query, print the author and text\n",
541 | "print (\"\\nTweets:\")\n",
542 | "for tweet in matchingTweets:\n",
543 | " print (tweet.author.screen_name, tweet.text)"
544 | ]
545 | },
546 | {
547 | "cell_type": "markdown",
548 | "metadata": {},
549 | "source": [
550 | "### More Complex Queries\n",
551 | "\n",
552 | "Twitter's Search API exposes many capabilities, like filtering for media, links, mentions, geolocations, dates, etc.\n",
553 | "We can access these capabilities directly with the search function.\n",
554 | "\n",
555 | "For a list of operators Twitter supports, go here: https://dev.twitter.com/rest/public/search"
556 | ]
557 | },
558 | {
559 | "cell_type": "code",
560 | "execution_count": null,
561 | "metadata": {
562 | "collapsed": false
563 | },
564 | "outputs": [],
565 | "source": [
566 | "# Lets find only media or links about earthquakes\n",
567 | "queryString = \"earthquake (filter:media OR filter:links)\"\n",
568 | "\n",
569 | "# Perform the search\n",
570 | "matchingTweets = api.search(queryString)\n",
571 | "\n",
572 | "print (\"Searched for:\", queryString)\n",
573 | "print (\"Number found:\", len(matchingTweets))\n",
574 | "\n",
575 | "# For each tweet that matches our query, print the author and text\n",
576 | "print (\"\\nTweets:\")\n",
577 | "for tweet in matchingTweets:\n",
578 | " print (tweet.author.screen_name, tweet.text)"
579 | ]
580 | },
581 | {
582 | "cell_type": "markdown",
583 | "metadata": {},
584 | "source": [
585 | "### Dealing with Pages\n",
586 | "\n",
587 | "As mentioned, Twitter serves results in pages. \n",
588 | "To get all results, we can use Tweepy's Cursor implementation, which handles this iteration through pages for us in the background."
589 | ]
590 | },
591 | {
592 | "cell_type": "code",
593 | "execution_count": null,
594 | "metadata": {
595 | "collapsed": false
596 | },
597 | "outputs": [],
598 | "source": [
599 | "# Lets find only media or links about earthquakes\n",
600 | "queryString = \"earthquake (filter:media OR filter:links)\"\n",
601 | "\n",
602 | "# How many tweets should we fetch? Upper limit is 1,500\n",
603 | "maxToReturn = 100\n",
604 | "\n",
605 | "# Perform the search, and for each tweet that matches our query, \n",
606 | "# print the author and text\n",
607 | "print (\"\\nTweets:\")\n",
608 | "for status in tweepy.Cursor(api.search, q=queryString).items(maxToReturn):\n",
609 | " print (status.author.screen_name, status.text)"
610 | ]
611 | },
612 | {
613 | "cell_type": "markdown",
614 | "metadata": {},
615 | "source": [
616 | "### Other Search Functionality\n",
617 | "\n",
618 | "The Tweepy wrapper and Twitter API is pretty extensive.\n",
619 | "You can do things like pull the last 3,200 tweets from other users' timelines, find all retweets of your account, get follower lists, search for users matching a query, etc.\n",
620 | "\n",
621 | "More information on Tweepy's capabilities are available at its documentation page: (http://tweepy.readthedocs.io/en/v3.5.0/api.html)\n",
622 | "\n",
623 | "Other information on the Twitter API is available here: (https://dev.twitter.com/rest/public/search)."
624 | ]
625 | },
626 | {
627 | "cell_type": "markdown",
628 | "metadata": {},
629 | "source": [
630 | "### Twitter Streaming\n",
631 | "\n",
632 | "Up to this point, all of our work has been retrospective. \n",
633 | "An event has occurred, and we want to see how Twitter responded over some period of time. \n",
634 | "\n",
635 | "To follow an event in real time, Twitter and Tweepy support Twitter streaming.\n",
636 | "Streaming is a bit complicated, but it essentially lets of track a set of keywords, places, or users.\n",
637 | "\n",
638 | "To keep things simple, I will provide a simple class and show methods for printing the first few tweets.\n",
639 | "Larger solutions exist specifically for handling Twitter streaming.\n",
640 | "\n",
641 | "You could take this code though and easily extend it by writing data to a file rather than the console.\n",
642 | "I've marked where that code could be inserted."
643 | ]
644 | },
645 | {
646 | "cell_type": "code",
647 | "execution_count": null,
648 | "metadata": {
649 | "collapsed": false
650 | },
651 | "outputs": [],
652 | "source": [
653 | "# First, we need to create our own listener for the stream\n",
654 | "# that will stop after a few tweets\n",
655 | "class LocalStreamListener(tweepy.StreamListener):\n",
656 | " \"\"\"A simple stream listener that breaks out after X tweets\"\"\"\n",
657 | " \n",
658 | " # Max number of tweets\n",
659 | " maxTweetCount = 10\n",
660 | " \n",
661 | " # Set current counter\n",
662 | " def __init__(self):\n",
663 | " tweepy.StreamListener.__init__(self)\n",
664 | " self.currentTweetCount = 0\n",
665 | " \n",
666 | " # For writing out to a file\n",
667 | " self.filePtr = None\n",
668 | " \n",
669 | " # Create a log file\n",
670 | " def set_log_file(self, newFile):\n",
671 | " if ( self.filePtr ):\n",
672 | " self.filePtr.close()\n",
673 | " \n",
674 | " self.filePtr = newFile\n",
675 | " \n",
676 | " # Close log file\n",
677 | " def close_log_file(self):\n",
678 | " if ( self.filePtr ):\n",
679 | " self.filePtr.close()\n",
680 | " \n",
681 | " # Pass data up to parent then check if we should stop\n",
682 | " def on_data(self, data):\n",
683 | "\n",
684 | " print (self.currentTweetCount)\n",
685 | " \n",
686 | " tweepy.StreamListener.on_data(self, data)\n",
687 | " \n",
688 | " if ( self.currentTweetCount >= self.maxTweetCount ):\n",
689 | " return False\n",
690 | "\n",
691 | " # Increment the number of statuses we've seen\n",
692 | " def on_status(self, status):\n",
693 | " self.currentTweetCount += 1\n",
694 | " \n",
695 | " # Could write this status to a file instead of to the console\n",
696 | " print (status.text)\n",
697 | " \n",
698 | " # If we have specified a file, write to it\n",
699 | " if ( self.filePtr ):\n",
700 | " self.filePtr.write(\"%s\\n\" % status._json)\n",
701 | " \n",
702 | " # Error handling below here\n",
703 | " def on_exception(self, exc):\n",
704 | " print (exc)\n",
705 | "\n",
706 | " def on_limit(self, track):\n",
707 | " \"\"\"Called when a limitation notice arrives\"\"\"\n",
708 | " print (\"Limit\", track)\n",
709 | " return\n",
710 | "\n",
711 | " def on_error(self, status_code):\n",
712 | " \"\"\"Called when a non-200 status code is returned\"\"\"\n",
713 | " print (\"Error:\", status_code)\n",
714 | " return False\n",
715 | "\n",
716 | " def on_timeout(self):\n",
717 | " \"\"\"Called when stream connection times out\"\"\"\n",
718 | " print (\"Timeout\")\n",
719 | " return\n",
720 | "\n",
721 | " def on_disconnect(self, notice):\n",
722 | " \"\"\"Called when twitter sends a disconnect notice\n",
723 | " \"\"\"\n",
724 | " print (\"Disconnect:\", notice)\n",
725 | " return\n",
726 | "\n",
727 | " def on_warning(self, notice):\n",
728 | " print (\"Warning:\", notice)\n",
729 | " \"\"\"Called when a disconnection warning message arrives\"\"\"\n",
730 | "\n"
731 | ]
732 | },
733 | {
734 | "cell_type": "markdown",
735 | "metadata": {},
736 | "source": [
737 | "Now we set up the stream using the listener above"
738 | ]
739 | },
740 | {
741 | "cell_type": "code",
742 | "execution_count": null,
743 | "metadata": {
744 | "collapsed": true
745 | },
746 | "outputs": [],
747 | "source": [
748 | "listener = LocalStreamListener()\n",
749 | "localStream = tweepy.Stream(api.auth, listener)"
750 | ]
751 | },
752 | {
753 | "cell_type": "code",
754 | "execution_count": null,
755 | "metadata": {
756 | "collapsed": false,
757 | "scrolled": false
758 | },
759 | "outputs": [],
760 | "source": [
761 | "# Stream based on keywords\n",
762 | "localStream.filter(track=['earthquake', 'disaster'])"
763 | ]
764 | },
765 | {
766 | "cell_type": "code",
767 | "execution_count": null,
768 | "metadata": {
769 | "collapsed": false
770 | },
771 | "outputs": [],
772 | "source": [
773 | "listener = LocalStreamListener()\n",
774 | "localStream = tweepy.Stream(api.auth, listener)\n",
775 | "\n",
776 | "# List of screen names to track\n",
777 | "screenNames = ['bbcbreaking', 'CNews', 'bbc', 'nytimes']\n",
778 | "\n",
779 | "# Twitter stream uses user IDs instead of names\n",
780 | "# so we must convert\n",
781 | "userIds = []\n",
782 | "for sn in screenNames:\n",
783 | " user = api.get_user(sn)\n",
784 | " userIds.append(user.id_str)\n",
785 | "\n",
786 | "# Stream based on users\n",
787 | "localStream.filter(follow=userIds)"
788 | ]
789 | },
790 | {
791 | "cell_type": "code",
792 | "execution_count": null,
793 | "metadata": {
794 | "collapsed": false
795 | },
796 | "outputs": [],
797 | "source": [
798 | "listener = LocalStreamListener()\n",
799 | "localStream = tweepy.Stream(api.auth, listener)\n",
800 | "\n",
801 | "# Specify coordinates for a bounding box around area of interest\n",
802 | "# In this case, we use San Francisco\n",
803 | "swCornerLat = 36.8\n",
804 | "swCornerLon = -122.75\n",
805 | "neCornerLat = 37.8\n",
806 | "neCornerLon = -121.75\n",
807 | "\n",
808 | "boxArray = [swCornerLon, swCornerLat, neCornerLon, neCornerLat]\n",
809 | "\n",
810 | "# Say we want to write these tweets to a file\n",
811 | "listener.set_log_file(codecs.open(\"tweet_log.json\", \"w\", \"utf8\"))\n",
812 | "\n",
813 | "# Stream based on location\n",
814 | "localStream.filter(locations=boxArray)\n",
815 | "\n",
816 | "# Close the log file\n",
817 | "listener.close_log_file()"
818 | ]
819 | },
820 | {
821 | "cell_type": "code",
822 | "execution_count": null,
823 | "metadata": {
824 | "collapsed": true
825 | },
826 | "outputs": [],
827 | "source": []
828 | }
829 | ],
830 | "metadata": {
831 | "kernelspec": {
832 | "display_name": "Python 3",
833 | "language": "python",
834 | "name": "python3"
835 | },
836 | "language_info": {
837 | "codemirror_mode": {
838 | "name": "ipython",
839 | "version": 3
840 | },
841 | "file_extension": ".py",
842 | "mimetype": "text/x-python",
843 | "name": "python",
844 | "nbconvert_exporter": "python",
845 | "pygments_lexer": "ipython3",
846 | "version": "3.6.0"
847 | }
848 | },
849 | "nbformat": 4,
850 | "nbformat_minor": 0
851 | }
852 |
--------------------------------------------------------------------------------
/notebooks/T03 - Parsing Twitter Data.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {
7 | "collapsed": false
8 | },
9 | "outputs": [],
10 | "source": [
11 | "%matplotlib inline\n",
12 | "\n",
13 | "import time\n",
14 | "import calendar\n",
15 | "import codecs\n",
16 | "import datetime\n",
17 | "import sys\n",
18 | "import gzip\n",
19 | "import string\n",
20 | "import glob\n",
21 | "import os\n",
22 | "\n",
23 | "# For parsing JSON\n",
24 | "import json"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "# Topic 3. JSON - JavaScript Object Notation\n",
32 | "\n",
33 | "Much of the data with which we will work comes in the JavaScript Object Notation (JSON) format.\n",
34 | "JSON is a lightweight text format that allows one to describe objects by __keys__ and __values__ without needing to specify a schema beforehand (as compared to XML).\n",
35 | "\n",
36 | "Many \"RESTful\" APIs available on the web today return data in JSON format, and the data we have stored from Twitter follows this rule as well.\n",
37 | "\n",
38 | "Python's JSON support is relatively robust and is included in the language under the json package.\n",
39 | "This package allows us to read and write JSON to/from a string or file and convert many of Python's types into a text format."
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "## JSON and Keys/Values\n",
47 | "\n",
48 | "The main idea here is that JSON allows one to specify a key, or name, for some data and then that data's value as a string, number, or object.\n",
49 | "\n",
50 | "An example line of JSON might look like:\n",
51 | "\n",
52 | "> {\"key\": \"value\"}"
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": null,
58 | "metadata": {
59 | "collapsed": false
60 | },
61 | "outputs": [],
62 | "source": [
63 | "jsonString = '{\"key\": \"value\"}'\n",
64 | "\n",
65 | "# Parse the JSON string\n",
66 | "dictFromJson = json.loads(jsonString)\n",
67 | "\n",
68 | "# Python now has a dictionary representing this data\n",
69 | "print (\"Resulting dictionary object:\\n\", dictFromJson)\n",
70 | "\n",
71 | "# Will print the value\n",
72 | "print (\"Data stored in \\\"key\\\":\\n\", dictFromJson[\"key\"])\n",
73 | "\n",
74 | "# This will cause an error!\n",
75 | "print (\"Data stored in \\\"value\\\":\\n\", dictFromJson[\"value\"])"
76 | ]
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "## Multile Keys and Values\n",
83 | "\n",
84 | "A JSON string/file can have many keys and values, but a key should always have a value.\n",
85 | "We can have values without keys if we're doing arrays, but this can be awkward.\n",
86 | "\n",
87 | "An example of JSON string with multiple keys is below:\n",
88 | "\n",
89 | "``\n",
90 | "{\n",
91 | "\"name\": \"Cody\",\n",
92 | "\"occupation\": \"Student\",\n",
93 | "\"goal\": \"PhD\"\n",
94 | "}\n",
95 | "``\n",
96 | "\n",
97 | "Note the __comma__ after the first two values. \n",
98 | "These commas are needed for valid JSON and to separate keys from other values."
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": null,
104 | "metadata": {
105 | "collapsed": false
106 | },
107 | "outputs": [],
108 | "source": [
109 | "jsonString = '{ \"name\": \"Cody\", \"occupation\": \"PostDoc\", \"goal\": \"Tenure\" }'\n",
110 | "\n",
111 | "# Parse the JSON string\n",
112 | "dictFromJson = json.loads(jsonString)\n",
113 | "\n",
114 | "# Python now has a dictionary representing this data\n",
115 | "print (\"Resulting dictionary object:\\n\", dictFromJson)"
116 | ]
117 | },
118 | {
119 | "cell_type": "markdown",
120 | "metadata": {},
121 | "source": [
122 | "## JSON and Arrays\n",
123 | "\n",
124 | "The above JSON string describes an __object__ whose name is \"Cody\".\n",
125 | "How would we describe a list of similar students?\n",
126 | "Arrays are useful here and are denoted with \"[]\" rather than the \"{}\" object notation.\n",
127 | "For example:\n",
128 | "\n",
129 | "``\n",
130 | "{\n",
131 | " \"students\": [\n",
132 | " {\n",
133 | " \"name\": \"Cody\",\n",
134 | " \"occupation\": \"Student\",\n",
135 | " \"goal\": \"PhD\"\n",
136 | " },\n",
137 | " {\n",
138 | " \"name\": \"Scott\",\n",
139 | " \"occupation\": \"Student\",\n",
140 | " \"goal\": \"Masters\"\n",
141 | " }\n",
142 | " ]\n",
143 | "}\n",
144 | "``\n",
145 | "\n",
146 | "Again, note the comma between the \"}\" and \"{\" separating the two student objects and how they are both surrounded by \"[]\"."
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": null,
152 | "metadata": {
153 | "collapsed": false
154 | },
155 | "outputs": [],
156 | "source": [
157 | "jsonString = '{\"students\": [{\"name\": \"Cody\", \"occupation\": \"PostDoc\", \"goal\": \"Tenure\"}, {\"name\": \"Scott\", \"occupation\": \"Student\", \"goal\": \"Masters\"}]}'\n",
158 | "\n",
159 | "# Parse the JSON string\n",
160 | "dictFromJson = json.loads(jsonString)\n",
161 | "\n",
162 | "# Python now has a dictionary representing this data\n",
163 | "print (\"Resulting array:\\n\", dictFromJson)\n",
164 | "\n",
165 | "print (\"Each student:\")\n",
166 | "for student in dictFromJson[\"students\"]:\n",
167 | " print (student)"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "metadata": {},
173 | "source": [
174 | "## More JSON + Arrays\n",
175 | "\n",
176 | "A couple of things to note:\n",
177 | "1. JSON does not *need* a name for the array. It could be declared just as an array.\n",
178 | "1. The student objects need not be identical.\n",
179 | "\n",
180 | "As an example:\n",
181 | "\n",
182 | "``\n",
183 | "[\n",
184 | " {\n",
185 | " \"name\": \"Cody\",\n",
186 | " \"occupation\": \"Student\",\n",
187 | " \"goal\": \"PhD\"\n",
188 | " },\n",
189 | " {\n",
190 | " \"name\": \"Scott\",\n",
191 | " \"occupation\": \"Student\",\n",
192 | " \"goal\": \"Masters\",\n",
193 | " \"completed\": true\n",
194 | " }\n",
195 | "]\n",
196 | "``"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": null,
202 | "metadata": {
203 | "collapsed": false
204 | },
205 | "outputs": [],
206 | "source": [
207 | "jsonString = '[{\"name\": \"Cody\",\"occupation\": \"PostDoc\",\"goal\": \"Tenure\"},{\"name\": \"Scott\",\"occupation\": \"Student\",\"goal\": \"Masters\",\"completed\": true}]'\n",
208 | "\n",
209 | "# Parse the JSON string\n",
210 | "arrFromJson = json.loads(jsonString)\n",
211 | "\n",
212 | "# Python now has an array representing this data\n",
213 | "print (\"Resulting array:\\n\", arrFromJson)\n",
214 | "\n",
215 | "print (\"Each student:\")\n",
216 | "for student in arrFromJson:\n",
217 | " print (student)"
218 | ]
219 | },
220 | {
221 | "cell_type": "markdown",
222 | "metadata": {},
223 | "source": [
224 | "## Nested JSON Objects\n",
225 | "\n",
226 | "We've shown you can have an array as a value, and you can do the same with objects.\n",
227 | "In fact, one of the powers of JSON is its essentially infinite depth/expressability. \n",
228 | "You can very easily nest objects within objects, and JSON in the wild relies on this heavily.\n",
229 | "\n",
230 | "An example:\n",
231 | "\n",
232 | "``\n",
233 | "{\n",
234 | " \"disasters\" : [\n",
235 | " {\n",
236 | " \"event\": \"Nepal Earthquake\",\n",
237 | " \"date\": \"25 April 2015\",\n",
238 | " \"casualties\": 8964,\n",
239 | " \"magnitude\": 7.8,\n",
240 | " \"affectedAreas\": [\n",
241 | " {\n",
242 | " \"country\": \"Nepal\",\n",
243 | " \"capital\": \"Kathmandu\",\n",
244 | " \"population\": 26494504\n",
245 | " },\n",
246 | " {\n",
247 | " \"country\": \"India\",\n",
248 | " \"capital\": \"New Dehli\",\n",
249 | " \"population\": 1276267000\n",
250 | " },\n",
251 | " {\n",
252 | " \"country\": \"China\",\n",
253 | " \"capital\": \"Beijing\",\n",
254 | " \"population\": 1376049000\n",
255 | " },\n",
256 | " {\n",
257 | " \"country\": \"Bangladesh\",\n",
258 | " \"capital\": \"Dhaka\",\n",
259 | " \"population\": 168957745\n",
260 | " }\n",
261 | " ]\n",
262 | " }\n",
263 | " ]\n",
264 | "}\n",
265 | "``"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": null,
271 | "metadata": {
272 | "collapsed": false
273 | },
274 | "outputs": [],
275 | "source": [
276 | "jsonString = '{\"disasters\" : [{\"event\": \"Nepal Earthquake\",\"date\": \"25 April 2015\",\"casualties\": 8964,\"magnitude\": 7.8,\"affectedAreas\": [{\"country\": \"Nepal\",\"capital\": \"Kathmandu\",\"population\": 26494504},{\"country\": \"India\",\"capital\": \"New Dehli\",\"population\": 1276267000},{\"country\": \"China\",\"capital\": \"Beijing\",\"population\": 1376049000},{\"country\": \"Bangladesh\",\"capital\": \"Dhaka\",\"population\": 168957745}]}]}'\n",
277 | "\n",
278 | "disasters = json.loads(jsonString)\n",
279 | "\n",
280 | "for disaster in disasters[\"disasters\"]:\n",
281 | " print (disaster[\"event\"])\n",
282 | " print (disaster[\"date\"])\n",
283 | " \n",
284 | " for country in disaster[\"affectedAreas\"]:\n",
285 | " print (country[\"country\"])"
286 | ]
287 | },
288 | {
289 | "cell_type": "markdown",
290 | "metadata": {},
291 | "source": [
292 | "## From Python Dictionaries to JSON\n",
293 | "\n",
294 | "We can also go from a Python object to JSON with relative ease."
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": null,
300 | "metadata": {
301 | "collapsed": false
302 | },
303 | "outputs": [],
304 | "source": [
305 | "exObj = {\n",
306 | " \"event\": \"Nepal Earthquake\",\n",
307 | " \"date\": \"25 April 2015\",\n",
308 | " \"casualties\": 8964,\n",
309 | " \"magnitude\": 7.8\n",
310 | "}\n",
311 | "\n",
312 | "print (\"Python Object:\", exObj, \"\\n\")\n",
313 | "\n",
314 | "# now we can convert to JSON\n",
315 | "print (\"Object JSON:\")\n",
316 | "print (json.dumps(exObj), \"\\n\")\n",
317 | "\n",
318 | "# We can also pretty-print the JSON\n",
319 | "print (\"Readable JSON:\")\n",
320 | "print (json.dumps(exObj, indent=4)) # Indent adds space"
321 | ]
322 | },
323 | {
324 | "cell_type": "markdown",
325 | "metadata": {},
326 | "source": [
327 | "## Reading Twitter JSON\n",
328 | "\n",
329 | "We should now have all the tools necessary to understand how Python can read Twitter JSON data.\n",
330 | "To show this, we'll read in a single tweet from the Ferguson, MO protests review its format, and parse it with Python's JSON loader."
331 | ]
332 | },
333 | {
334 | "cell_type": "code",
335 | "execution_count": null,
336 | "metadata": {
337 | "collapsed": false
338 | },
339 | "outputs": [],
340 | "source": [
341 | "tweetFilename = \"first_BlackLivesMatter.json\"\n",
342 | "\n",
343 | "# Use Python's os.path.join to account for Windows, OSX/Linux differences\n",
344 | "tweetFilePath = os.path.join(\"..\", \"00_data\", \"ferguson\", tweetFilename)\n",
345 | "\n",
346 | "print (\"Opening\", tweetFilePath)\n",
347 | "\n",
348 | "# We use codecs to ensure we open the file in Unicode format,\n",
349 | "# which supports larger character encodings\n",
350 | "tweetFile = codecs.open(tweetFilePath, \"r\", \"utf8\")\n",
351 | "\n",
352 | "# Read in the whole file, which contains ONE tweet and close\n",
353 | "tweetFileContent = tweetFile.read()\n",
354 | "tweetFile.close()\n",
355 | "\n",
356 | "# Print the raw json\n",
357 | "print (\"Raw Tweet JSON:\\n\")\n",
358 | "print (tweetFileContent)\n",
359 | "\n",
360 | "# Convert the JSON to a Python object\n",
361 | "tweet = json.loads(tweetFileContent)\n",
362 | "print (\"Tweet Object:\\n\")\n",
363 | "print (tweet)\n",
364 | "\n",
365 | "# We could have done this in one step with json.load() \n",
366 | "# called on the open file, but our data files have\n",
367 | "# a single tweet JSON per line, so this is more consistent"
368 | ]
369 | },
370 | {
371 | "cell_type": "markdown",
372 | "metadata": {},
373 | "source": [
374 | "## Twitter JSON Fields\n",
375 | "\n",
376 | "This tweet is pretty big, but we can still see some of the fields it contains. \n",
377 | "Note it also has many nested fields.\n",
378 | "We'll go through some of the more important fields below."
379 | ]
380 | },
381 | {
382 | "cell_type": "code",
383 | "execution_count": null,
384 | "metadata": {
385 | "collapsed": false
386 | },
387 | "outputs": [],
388 | "source": [
389 | "# What fields can we see?\n",
390 | "print (\"Keys:\")\n",
391 | "for k in sorted(tweet.keys()):\n",
392 | " print (\"\\t\", k)\n",
393 | "\n",
394 | "print (\"Tweet Text:\", tweet[\"text\"])\n",
395 | "print (\"User Name:\", tweet[\"user\"][\"screen_name\"])\n",
396 | "print (\"Author:\", tweet[\"user\"][\"name\"])\n",
397 | "print(\"Source:\", tweet[\"source\"])\n",
398 | "print(\"Retweets:\", tweet[\"retweet_count\"])\n",
399 | "print(\"Favorited:\", tweet[\"favorite_count\"])\n",
400 | "print(\"Tweet Location:\", tweet[\"place\"])\n",
401 | "print(\"Tweet GPS Coordinates:\", tweet[\"coordinates\"])\n",
402 | "print(\"Twitter's Guessed Language:\", tweet[\"lang\"])\n",
403 | "\n",
404 | "# Tweets have a list of hashtags, mentions, URLs, and other\n",
405 | "# attachments in \"entities\" field\n",
406 | "print (\"\\n\", \"Entities:\")\n",
407 | "for eType in tweet[\"entities\"]:\n",
408 | " print (\"\\t\", eType)\n",
409 | " \n",
410 | " for e in tweet[\"entities\"][eType]:\n",
411 | " print (\"\\t\\t\", e)"
412 | ]
413 | },
414 | {
415 | "cell_type": "code",
416 | "execution_count": null,
417 | "metadata": {
418 | "collapsed": true
419 | },
420 | "outputs": [],
421 | "source": []
422 | }
423 | ],
424 | "metadata": {
425 | "kernelspec": {
426 | "display_name": "Python 3",
427 | "language": "python",
428 | "name": "python3"
429 | },
430 | "language_info": {
431 | "codemirror_mode": {
432 | "name": "ipython",
433 | "version": 3
434 | },
435 | "file_extension": ".py",
436 | "mimetype": "text/x-python",
437 | "name": "python",
438 | "nbconvert_exporter": "python",
439 | "pygments_lexer": "ipython3",
440 | "version": "3.6.0"
441 | }
442 | },
443 | "nbformat": 4,
444 | "nbformat_minor": 0
445 | }
446 |
--------------------------------------------------------------------------------
/notebooks/T04-08 - Twitter Analytics.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {
7 | "collapsed": false
8 | },
9 | "outputs": [],
10 | "source": [
11 | "%matplotlib inline\n",
12 | "\n",
13 | "import time\n",
14 | "import calendar\n",
15 | "import codecs\n",
16 | "import datetime\n",
17 | "import json\n",
18 | "import sys\n",
19 | "import gzip\n",
20 | "import string\n",
21 | "import glob\n",
22 | "import requests\n",
23 | "import os\n",
24 | "\n",
25 | "import numpy as np"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "# Twitter Crisis Anlytics\n",
33 | "\n",
34 | "The following notebook walks us through a number of capabilities or common pieces of functionality one may want when analyzing Twitter following a crisis.\n",
35 | "We will start by defining information for a set of events for which we have data."
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": null,
41 | "metadata": {
42 | "collapsed": true
43 | },
44 | "outputs": [],
45 | "source": [
46 | "crisisInfo = {\n",
47 | " \"boston\": {\n",
48 | " \"name\": \"Boston Marathon Bombing\",\n",
49 | " \"time\": 1366051740, # Timestamp in seconds since 1/1/1970, UTC\n",
50 | " # 15 April 2013, 14:49 EDT -> 18:49 UTC\n",
51 | " \"directory\": \"boston\",\n",
52 | " \"keywords\": [\"boston\", \"exploision\", \"bomb\", \"marathon\"],\n",
53 | " \"box\": { # Bounding box for geographic limits\n",
54 | " \"lowerLeftLon\": -124.848974,\n",
55 | " \"lowerLeftLat\": 24.396308,\n",
56 | " \"upperRightLon\": -66.885444,\n",
57 | " \"upperRightLat\": 49.384358,\n",
58 | " }\n",
59 | " },\n",
60 | " \n",
61 | " \"paris_hebdo\": {\n",
62 | " \"name\": \"Charlie Hebdo Attack\",\n",
63 | " \"time\": 1420626600, # Timestamp in seconds since 1/1/1970, UTC\n",
64 | " # 7 January 2015, 11:30 CET -> 10:30 UTC\n",
65 | " \"directory\": \"paris_hebdo\",\n",
66 | " \"keywords\": [\"paris\", \"hebdo\"],\n",
67 | " \"box\": {\n",
68 | " \"lowerLeftLon\": -5.1406,\n",
69 | " \"lowerLeftLat\": 41.33374,\n",
70 | " \"upperRightLon\": 9.55932,\n",
71 | " \"upperRightLat\": 51.089062,\n",
72 | " }\n",
73 | " },\n",
74 | " \n",
75 | " \"nepal\": {\n",
76 | " \"name\": \"Nepal Earthquake\",\n",
77 | " \"time\": 1429942286, # Timestamp in seconds since 1/1/1970, UTC\n",
78 | " # 25 April 2015, 6:11:26 UTC\n",
79 | " \"directory\": \"nepal\",\n",
80 | " \"keywords\": [\"nepal\", \"earthquake\", \"quake\", \"nsgs\"],\n",
81 | " \"box\": {\n",
82 | " \"lowerLeftLon\": 80.0562,\n",
83 | " \"lowerLeftLat\": 26.3565,\n",
84 | " \"upperRightLon\": 88.1993,\n",
85 | " \"upperRightLat\": 30.4330,\n",
86 | " }\n",
87 | " },\n",
88 | " \n",
89 | " \"paris_nov\": {\n",
90 | " \"name\": \"Paris November Attacks\",\n",
91 | " \"time\": 1447446000, # Timestamp in seconds since 1/1/1970, UTC\n",
92 | " # 13 November 2015, 20:20 UTC to 23:58 UTC\n",
93 | " \"directory\": \"paris_nov\",\n",
94 | " \"keywords\": [\"paris\", \"shots\", \"explosion\"],\n",
95 | " \"box\": {\n",
96 | " \"lowerLeftLon\": -5.1406,\n",
97 | " \"lowerLeftLat\": 41.33374,\n",
98 | " \"upperRightLon\": 9.55932,\n",
99 | " \"upperRightLat\": 51.089062,\n",
100 | " }\n",
101 | " },\n",
102 | " \n",
103 | " \"brussels\": {\n",
104 | " \"name\": \"Brussels Transit Attacks\",\n",
105 | " \"time\": 1458629880, # Timestamp in seconds since 1/1/1970, UTC\n",
106 | " # 22 March 2016, 6:58 UTC to 08:11 UTC\n",
107 | " \"directory\": \"brussels\",\n",
108 | " \"keywords\": [\"brussels\", \"bomb\", \"belgium\", \"explosion\"],\n",
109 | " \"box\": {\n",
110 | " \"lowerLeftLon\": 2.54563,\n",
111 | " \"lowerLeftLat\": 49.496899,\n",
112 | " \"upperRightLon\": 6.40791,\n",
113 | " \"upperRightLat\": 51.5050810,\n",
114 | " }\n",
115 | " },\n",
116 | "}"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "## Choose Your Crisis\n",
124 | "\n",
125 | "Since we have several disasters we can look at and don't have time to explore them all, you can pick one and follow along with our analysis on the crisis that interests you.\n",
126 | "\n",
127 | "To select the crisis you want, pick from the list printed below."
128 | ]
129 | },
130 | {
131 | "cell_type": "code",
132 | "execution_count": null,
133 | "metadata": {
134 | "collapsed": false
135 | },
136 | "outputs": [],
137 | "source": [
138 | "print (\"Available Crisis Names:\")\n",
139 | "for k in sorted(crisisInfo.keys()):\n",
140 | " print (\"\\t\", k)"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "metadata": {
147 | "collapsed": true
148 | },
149 | "outputs": [],
150 | "source": [
151 | "# Replace the name below with your selected crisis\n",
152 | "selectedCrisis = \"nepal\""
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "
\n",
160 | "\n",
161 | "## Topic 3.1: Reading Tweets\n",
162 | "\n",
163 | "The first thing we do is read in tweets from a directory of compressed files. Our collection of compressed tweets is in the 00_data directory, so we'll use pattern matching (called \"globbing\") to find all the tweet files in the given directory.\n",
164 | "\n",
165 | "Then, for each file, we'll open it, read each line (which is a tweet in JSON form), and build an object out of it. As part of this process, we will extract each tweet's post time and create a map from minute timestamps to the tweets posted during that minute."
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": null,
171 | "metadata": {
172 | "collapsed": false
173 | },
174 | "outputs": [],
175 | "source": [
176 | "# Determine host-specific location of data\n",
177 | "tweetDirectory = crisisInfo[selectedCrisis][\"directory\"]\n",
178 | "tweetGlobPath = os.path.join(\"..\", \"00_data\", tweetDirectory, \"statuses.log.*.gz\")\n",
179 | "\n",
180 | "print (\"Reading files from:\", tweetGlobPath)\n",
181 | "\n",
182 | "# Dictionary for mapping dates to data\n",
183 | "frequencyMap = {}\n",
184 | "\n",
185 | "# For counting tweets\n",
186 | "globalTweetCounter = 0\n",
187 | "\n",
188 | "# Twitter's time format, for parsing the created_at date\n",
189 | "timeFormat = \"%a %b %d %H:%M:%S +0000 %Y\"\n",
190 | "\n",
191 | "reader = codecs.getreader(\"utf-8\")\n",
192 | "\n",
193 | "for tweetFilePath in glob.glob(tweetGlobPath):\n",
194 | " print (\"Reading File:\", tweetFilePath)\n",
195 | "\n",
196 | " for line in gzip.open(tweetFilePath, 'rb'):\n",
197 | "\n",
198 | " # Try to read tweet JSON into object\n",
199 | " tweetObj = None\n",
200 | " try:\n",
201 | " tweetObj = json.loads(reader.decode(line)[0])\n",
202 | " except Exception as e:\n",
203 | " continue\n",
204 | "\n",
205 | " # Deleted status messages and protected status must be skipped\n",
206 | " if ( \"delete\" in tweetObj.keys() or \"status_withheld\" in tweetObj.keys() ):\n",
207 | " continue\n",
208 | "\n",
209 | " # Try to extract the time of the tweet\n",
210 | " try:\n",
211 | " currentTime = datetime.datetime.strptime(tweetObj['created_at'], timeFormat)\n",
212 | " except:\n",
213 | " print (line)\n",
214 | " raise\n",
215 | "\n",
216 | " currentTime = currentTime.replace(second=0)\n",
217 | "\n",
218 | " # Increment tweet count\n",
219 | " globalTweetCounter += 1\n",
220 | "\n",
221 | " # If our frequency map already has this time, use it, otherwise add\n",
222 | " if ( currentTime in frequencyMap.keys() ):\n",
223 | " timeMap = frequencyMap[currentTime]\n",
224 | " timeMap[\"count\"] += 1\n",
225 | " timeMap[\"list\"].append(tweetObj)\n",
226 | " else:\n",
227 | " frequencyMap[currentTime] = {\"count\":1, \"list\":[tweetObj]}\n",
228 | "\n",
229 | "# Fill in any gaps\n",
230 | "times = sorted(frequencyMap.keys())\n",
231 | "firstTime = times[0]\n",
232 | "lastTime = times[-1]\n",
233 | "thisTime = firstTime\n",
234 | "\n",
235 | "# We want to look at per-minute data, so we fill in any missing minutes\n",
236 | "timeIntervalStep = datetime.timedelta(0, 60) # Time step in seconds\n",
237 | "while ( thisTime <= lastTime ):\n",
238 | " if ( thisTime not in frequencyMap.keys() ):\n",
239 | " frequencyMap[thisTime] = {\"count\":0, \"list\":[]}\n",
240 | " \n",
241 | " thisTime = thisTime + timeIntervalStep\n",
242 | "\n",
243 | "print (\"Processed Tweet Count:\", globalTweetCounter)"
244 | ]
245 | },
246 | {
247 | "cell_type": "markdown",
248 | "metadata": {
249 | "collapsed": true
250 | },
251 | "source": [
252 | "
\n",
253 | "# Topic 4: Simple Frequency Analysis\n",
254 | "\n",
255 | "In this section, we will cover a few simple analysis techniques to garner some small insights rapidly.\n",
256 | "\n",
257 | "- Frequency Graph\n",
258 | "- Top users\n",
259 | "- Top hash tags\n",
260 | "- Top URLs\n",
261 | "- Top images\n",
262 | "- Most retweeted tweet\n",
263 | "- Keyword Frequency"
264 | ]
265 | },
266 | {
267 | "cell_type": "markdown",
268 | "metadata": {},
269 | "source": [
270 | "### Twitter Timeline \n",
271 | "\n",
272 | "To build a timeline of Twitter usage, we can simply plot the number of tweets posted per minute."
273 | ]
274 | },
275 | {
276 | "cell_type": "code",
277 | "execution_count": null,
278 | "metadata": {
279 | "collapsed": false
280 | },
281 | "outputs": [],
282 | "source": [
283 | "import matplotlib.pyplot as plt\n",
284 | "\n",
285 | "crisisMoment = crisisInfo[selectedCrisis][\"time\"]\n",
286 | "crisisTime = datetime.datetime.utcfromtimestamp(crisisMoment)\n",
287 | "crisisTime = crisisTime.replace(second=0)\n",
288 | "print (\"Crisis Time:\", crisisTime)\n",
289 | "\n",
290 | "fig, ax = plt.subplots()\n",
291 | "fig.set_size_inches(11, 8.5)\n",
292 | "\n",
293 | "plt.title(\"Tweet Frequency\")\n",
294 | "\n",
295 | "# Sort the times into an array for future use\n",
296 | "sortedTimes = sorted(frequencyMap.keys())\n",
297 | "\n",
298 | "# What time span do these tweets cover?\n",
299 | "print (\"Time Frame:\", sortedTimes[0], sortedTimes[-1])\n",
300 | "\n",
301 | "# Get a count of tweets per minute\n",
302 | "postFreqList = [frequencyMap[x][\"count\"] for x in sortedTimes]\n",
303 | "\n",
304 | "# We'll have ticks every few minutes (more clutters the graph)\n",
305 | "smallerXTicks = range(0, len(sortedTimes), 10)\n",
306 | "plt.xticks(smallerXTicks, [sortedTimes[x] for x in smallerXTicks], rotation=90)\n",
307 | "\n",
308 | "# Plot the post frequency\n",
309 | "yData = [x if x > 0 else 0 for x in postFreqList]\n",
310 | "ax.plot(range(len(frequencyMap)), yData, color=\"blue\", label=\"Posts\")\n",
311 | "\n",
312 | "crisisXCoord = sortedTimes.index(crisisTime)\n",
313 | "ax.scatter([crisisXCoord], [np.mean(yData)], c=\"r\", marker=\"x\", s=100, label=\"Crisis\")\n",
314 | "\n",
315 | "ax.grid(b=True, which=u'major')\n",
316 | "ax.legend()\n",
317 | "\n",
318 | "plt.show()"
319 | ]
320 | },
321 | {
322 | "cell_type": "markdown",
323 | "metadata": {},
324 | "source": [
325 | "### Top Twitter Users\n",
326 | "\n",
327 | "Finding good sources of information is really important during crises. \n",
328 | "On Twitter, the loudest or most prolific users are not necessarily good sources though.\n",
329 | "We first check who these prolific users are by determining who was tweeting the most during this particular time span."
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": null,
335 | "metadata": {
336 | "collapsed": false
337 | },
338 | "outputs": [],
339 | "source": [
340 | "# Create maps for holding counts and tweets for each user\n",
341 | "globalUserCounter = {}\n",
342 | "globalUserMap = {}\n",
343 | "\n",
344 | "# Iterate through the time stamps\n",
345 | "for t in sortedTimes:\n",
346 | " timeObj = frequencyMap[t]\n",
347 | " \n",
348 | " # For each tweet, pull the screen name and add it to the list\n",
349 | " for tweet in timeObj[\"list\"]:\n",
350 | " user = tweet[\"user\"][\"screen_name\"]\n",
351 | " \n",
352 | " if ( user not in globalUserCounter ):\n",
353 | " globalUserCounter[user] = 1\n",
354 | " globalUserMap[user] = [tweet]\n",
355 | " else:\n",
356 | " globalUserCounter[user] += 1\n",
357 | " globalUserMap[user].append(tweet)\n",
358 | "\n",
359 | "print (\"Unique Users:\", len(globalUserCounter.keys()))"
360 | ]
361 | },
362 | {
363 | "cell_type": "code",
364 | "execution_count": null,
365 | "metadata": {
366 | "collapsed": false
367 | },
368 | "outputs": [],
369 | "source": [
370 | "sortedUsers = sorted(globalUserCounter, key=globalUserCounter.get, reverse=True)\n",
371 | "print (\"Top Ten Most Prolific Users:\")\n",
372 | "for u in sortedUsers[:10]:\n",
373 | " print (u, globalUserCounter[u], \n",
374 | " \"\\n\\t\", \"Random Tweet:\", globalUserMap[u][0][\"text\"], \"\\n----------\")"
375 | ]
376 | },
377 | {
378 | "cell_type": "markdown",
379 | "metadata": {},
380 | "source": [
381 | "Many of these tweets are not relevant to the event at hand.\n",
382 | "Twitter is a very noisy place.\n",
383 | "\n",
384 | "Hashtags, however, are high signal keywords. \n",
385 | "Maybe the most common hashtags will be more informative."
386 | ]
387 | },
388 | {
389 | "cell_type": "code",
390 | "execution_count": null,
391 | "metadata": {
392 | "collapsed": false
393 | },
394 | "outputs": [],
395 | "source": [
396 | "# A map for hashtag counts\n",
397 | "hashtagCounter = {}\n",
398 | "\n",
399 | "# For each minute, pull the list of hashtags and add to the counter\n",
400 | "for t in sortedTimes:\n",
401 | " timeObj = frequencyMap[t]\n",
402 | " \n",
403 | " for tweet in timeObj[\"list\"]:\n",
404 | " hashtagList = tweet[\"entities\"][\"hashtags\"]\n",
405 | " \n",
406 | " for hashtagObj in hashtagList:\n",
407 | " \n",
408 | " # We lowercase the hashtag to avoid duplicates (e.g., #MikeBrown vs. #mikebrown)\n",
409 | " hashtagString = hashtagObj[\"text\"].lower()\n",
410 | " \n",
411 | " if ( hashtagString not in hashtagCounter ):\n",
412 | " hashtagCounter[hashtagString] = 1\n",
413 | " else:\n",
414 | " hashtagCounter[hashtagString] += 1\n",
415 | "\n",
416 | "print (\"Unique Hashtags:\", len(hashtagCounter.keys()))\n",
417 | "sortedHashtags = sorted(hashtagCounter, key=hashtagCounter.get, reverse=True)\n",
418 | "print (\"Top Twenty Hashtags:\")\n",
419 | "for ht in sortedHashtags[:20]:\n",
420 | " print (\"\\t\", \"#\" + ht, hashtagCounter[ht])"
421 | ]
422 | },
423 | {
424 | "cell_type": "markdown",
425 | "metadata": {},
426 | "source": [
427 | "We can do the same with URLs to find the most shared URL."
428 | ]
429 | },
430 | {
431 | "cell_type": "code",
432 | "execution_count": null,
433 | "metadata": {
434 | "collapsed": false
435 | },
436 | "outputs": [],
437 | "source": [
438 | "# A map for hashtag counts\n",
439 | "urlCounter = {}\n",
440 | "\n",
441 | "# For each minute, pull the list of hashtags and add to the counter\n",
442 | "for t in sortedTimes:\n",
443 | " timeObj = frequencyMap[t]\n",
444 | " \n",
445 | " for tweet in timeObj[\"list\"]:\n",
446 | " urlList = tweet[\"entities\"][\"urls\"]\n",
447 | " \n",
448 | " for url in urlList:\n",
449 | " urlStr = url[\"url\"]\n",
450 | " \n",
451 | " if ( urlStr not in urlCounter ):\n",
452 | " urlCounter[urlStr] = 1\n",
453 | " else:\n",
454 | " urlCounter[urlStr] += 1\n",
455 | "\n",
456 | "print (\"Unique URLs:\", len(urlCounter.keys()))\n",
457 | "sortedUrls = sorted(urlCounter, key=urlCounter.get, reverse=True)\n",
458 | "print (\"Top Twenty URLs:\")\n",
459 | "for url in sortedUrls[:20]:\n",
460 | " print (\"\\t\", url, urlCounter[url])"
461 | ]
462 | },
463 | {
464 | "cell_type": "markdown",
465 | "metadata": {},
466 | "source": [
467 | "Note how each URL is shortened using Twitter's shortener. \n",
468 | "To get a better idea of the content, we should expand the url."
469 | ]
470 | },
471 | {
472 | "cell_type": "code",
473 | "execution_count": null,
474 | "metadata": {
475 | "collapsed": false
476 | },
477 | "outputs": [],
478 | "source": [
479 | "print (\"Top Expanded URLs:\")\n",
480 | "for url in sortedUrls[:10]:\n",
481 | " try:\n",
482 | " r = requests.get(url)\n",
483 | " realUrl = r.url\n",
484 | " print (\"\\t\", url, urlCounter[url], \"->\", realUrl)\n",
485 | " except:\n",
486 | " print (\"\\t\", url, urlCounter[url], \"->\", \"UNKNOWN Failure\")"
487 | ]
488 | },
489 | {
490 | "cell_type": "markdown",
491 | "metadata": {},
492 | "source": [
493 | "Since URLs and Hashtags are both entities, we can do the same for other entities, like mentions and media."
494 | ]
495 | },
496 | {
497 | "cell_type": "code",
498 | "execution_count": null,
499 | "metadata": {
500 | "collapsed": false
501 | },
502 | "outputs": [],
503 | "source": [
504 | "# A map for mention counts\n",
505 | "mentionCounter = {}\n",
506 | "\n",
507 | "# For each minute, pull the list of mentions and add to the counter\n",
508 | "for t in sortedTimes:\n",
509 | " timeObj = frequencyMap[t]\n",
510 | " \n",
511 | " for tweet in timeObj[\"list\"]:\n",
512 | " mentions = tweet[\"entities\"][\"user_mentions\"]\n",
513 | " \n",
514 | " for mention in mentions:\n",
515 | " mentionStr = mention[\"screen_name\"]\n",
516 | " \n",
517 | " if ( mentionStr not in mentionCounter ):\n",
518 | " mentionCounter[mentionStr] = 1\n",
519 | " else:\n",
520 | " mentionCounter[mentionStr] += 1\n",
521 | "\n",
522 | "print (\"Unique Mentions:\", len(mentionCounter.keys()))\n",
523 | "sortedMentions = sorted(mentionCounter, key=mentionCounter.get, reverse=True)\n",
524 | "print (\"Top Twenty Mentions:\")\n",
525 | "for mention in sortedMentions[:20]:\n",
526 | " print (\"\\t\", mention, mentionCounter[mention])"
527 | ]
528 | },
529 | {
530 | "cell_type": "code",
531 | "execution_count": null,
532 | "metadata": {
533 | "collapsed": false
534 | },
535 | "outputs": [],
536 | "source": [
537 | "# A map for media counts\n",
538 | "mediaCounter = {}\n",
539 | "\n",
540 | "# For each minute, pull the list of media and add to the counter\n",
541 | "for t in sortedTimes:\n",
542 | " timeObj = frequencyMap[t]\n",
543 | " \n",
544 | " for tweet in timeObj[\"list\"]:\n",
545 | " if ( \"media\" not in tweet[\"entities\"] ):\n",
546 | " continue\n",
547 | " \n",
548 | " mediaList = tweet[\"entities\"][\"media\"]\n",
549 | " \n",
550 | " for media in mediaList:\n",
551 | " mediaStr = media[\"media_url\"]\n",
552 | " \n",
553 | " if ( mediaStr not in mediaCounter ):\n",
554 | " mediaCounter[mediaStr] = 1\n",
555 | " else:\n",
556 | " mediaCounter[mediaStr] += 1\n",
557 | "\n",
558 | "print (\"Unique Media:\", len(mediaCounter.keys()))\n",
559 | "sortedMedia = sorted(mediaCounter, key=mediaCounter.get, reverse=True)\n",
560 | "print (\"Top Twenty Media:\")\n",
561 | "for media in sortedMedia[:20]:\n",
562 | " print (\"\\t\", media, mediaCounter[media])"
563 | ]
564 | },
565 | {
566 | "cell_type": "markdown",
567 | "metadata": {},
568 | "source": [
569 | "We can some data is relevant, both in pictures and in hashtags and URLs. \n",
570 | "Are the most retweeted retweets also useful? \n",
571 | "Or are they expressing condolence?\n",
572 | "Or completely unrelated?"
573 | ]
574 | },
575 | {
576 | "cell_type": "code",
577 | "execution_count": null,
578 | "metadata": {
579 | "collapsed": false
580 | },
581 | "outputs": [],
582 | "source": [
583 | "# A map for media counts\n",
584 | "tweetRetweetCountMap = {}\n",
585 | "rtList = []\n",
586 | "\n",
587 | "# For each minute, pull the list of hashtags and add to the counter\n",
588 | "for t in sortedTimes:\n",
589 | " timeObj = frequencyMap[t]\n",
590 | " \n",
591 | " for tweet in timeObj[\"list\"]:\n",
592 | " tweetId = tweet[\"id_str\"]\n",
593 | " rtCount = tweet[\"retweet_count\"]\n",
594 | " \n",
595 | " if ( \"retweeted_status\" in tweet ):\n",
596 | " tweetId = tweet[\"retweeted_status\"][\"id_str\"]\n",
597 | " rtCount = tweet[\"retweeted_status\"][\"retweet_count\"]\n",
598 | " \n",
599 | " tweetRetweetCountMap[tweetId] = rtCount\n",
600 | " rtList.append(rtCount)\n",
601 | " \n",
602 | "sortedRetweets = sorted(tweetRetweetCountMap, key=tweetRetweetCountMap.get, reverse=True)\n",
603 | "print (\"Top Ten Retweets:\")\n",
604 | "for tweetId in sortedRetweets[:10]:\n",
605 | " thisTweet = None\n",
606 | " \n",
607 | " for t in reversed(sortedTimes):\n",
608 | " for tweet in frequencyMap[t][\"list\"]:\n",
609 | " if ( tweet[\"id_str\"] == tweetId ):\n",
610 | " thisTweet = tweet\n",
611 | " break\n",
612 | " \n",
613 | " if ( \"retweeted_status\" in tweet and tweet[\"retweeted_status\"][\"id_str\"] == tweetId ):\n",
614 | " thisTweet = tweet[\"retweeted_status\"]\n",
615 | " break\n",
616 | " \n",
617 | " if ( thisTweet is not None ):\n",
618 | " break\n",
619 | " \n",
620 | " print (\"\\t\", tweetId, tweetRetweetCountMap[tweetId], thisTweet[\"text\"])"
621 | ]
622 | },
623 | {
624 | "cell_type": "markdown",
625 | "metadata": {},
626 | "source": [
627 | "Retweets seem to be dominated by recent elements. \n",
628 | "To correct for this, we should remove retweets that are older than the event."
629 | ]
630 | },
631 | {
632 | "cell_type": "code",
633 | "execution_count": null,
634 | "metadata": {
635 | "collapsed": false
636 | },
637 | "outputs": [],
638 | "source": [
639 | "print (\"Top Ten RECENT Retweets:\")\n",
640 | "\n",
641 | "foundTweets = 0\n",
642 | "for tweetId in sortedRetweets:\n",
643 | " thisTweet = None\n",
644 | " \n",
645 | " # Find the most recent copy of the tweet\n",
646 | " for t in reversed(sortedTimes):\n",
647 | " for tweet in frequencyMap[t][\"list\"]:\n",
648 | " if ( tweet[\"id_str\"] == tweetId ):\n",
649 | " thisTweet = tweet\n",
650 | " break\n",
651 | " \n",
652 | " if ( \"retweeted_status\" in tweet and tweet[\"retweeted_status\"][\"id_str\"] == tweetId ):\n",
653 | " thisTweet = tweet[\"retweeted_status\"]\n",
654 | " break\n",
655 | " \n",
656 | " if ( thisTweet is not None ):\n",
657 | " break\n",
658 | " \n",
659 | " createdTime = datetime.datetime.strptime(thisTweet['created_at'], timeFormat)\n",
660 | " \n",
661 | " # If tweet creation time is before the crisis, assume irrelevant\n",
662 | " if ( createdTime < crisisTime ):\n",
663 | " continue\n",
664 | " \n",
665 | " print (\"\\t\", tweetId, tweetRetweetCountMap[tweetId], thisTweet[\"text\"])\n",
666 | " \n",
667 | " foundTweets += 1\n",
668 | " \n",
669 | " if ( foundTweets > 10 ):\n",
670 | " break"
671 | ]
672 | },
673 | {
674 | "cell_type": "markdown",
675 | "metadata": {},
676 | "source": [
677 | "### Event Detection w/ Keyword Frequency\n",
678 | "\n",
679 | "Twitter is good for breaking news. When an impactful event occurs, we often see a spike on Twitter of the usage of a related keyword. Some examples are below."
680 | ]
681 | },
682 | {
683 | "cell_type": "code",
684 | "execution_count": null,
685 | "metadata": {
686 | "collapsed": false
687 | },
688 | "outputs": [],
689 | "source": [
690 | "# What keywords are we interested in?\n",
691 | "targetKeywords = crisisInfo[selectedCrisis][\"keywords\"]\n",
692 | "\n",
693 | "# Build an empty map for each keyword we are seaching for\n",
694 | "targetCounts = {x:[] for x in targetKeywords}\n",
695 | "totalCount = []\n",
696 | "\n",
697 | "# For each minute, pull the tweet text and search for the keywords we want\n",
698 | "for t in sortedTimes:\n",
699 | " timeObj = frequencyMap[t]\n",
700 | " \n",
701 | " # Temporary counter for this minute\n",
702 | " localTargetCounts = {x:0 for x in targetKeywords}\n",
703 | " localTotalCount = 0\n",
704 | " \n",
705 | " for tweetObj in timeObj[\"list\"]:\n",
706 | " tweetString = tweetObj[\"text\"].lower()\n",
707 | "\n",
708 | " localTotalCount += 1\n",
709 | " \n",
710 | " # Add to the counter if the target keyword is in this tweet\n",
711 | " for keyword in targetKeywords:\n",
712 | " if ( keyword in tweetString ):\n",
713 | " localTargetCounts[keyword] += 1\n",
714 | " \n",
715 | " # Add the counts for this minute to the main counter\n",
716 | " totalCount.append(localTotalCount)\n",
717 | " for keyword in targetKeywords:\n",
718 | " targetCounts[keyword].append(localTargetCounts[keyword])\n",
719 | " \n",
720 | "# Now plot the total frequency and frequency of each keyword\n",
721 | "fig, ax = plt.subplots()\n",
722 | "fig.set_size_inches(11, 8.5)\n",
723 | "\n",
724 | "plt.title(\"Tweet Frequency\")\n",
725 | "plt.xticks(smallerXTicks, [sortedTimes[x] for x in smallerXTicks], rotation=90)\n",
726 | "\n",
727 | "ax.semilogy(range(len(frequencyMap)), totalCount, label=\"Total\")\n",
728 | "\n",
729 | "ax.scatter([crisisXCoord], [100], c=\"r\", marker=\"x\", s=100, label=\"Crisis\")\n",
730 | "\n",
731 | "for keyword in targetKeywords:\n",
732 | " ax.semilogy(range(len(frequencyMap)), targetCounts[keyword], label=keyword)\n",
733 | "ax.legend()\n",
734 | "ax.grid(b=True, which=u'major')\n",
735 | "\n",
736 | "plt.show()"
737 | ]
738 | },
739 | {
740 | "cell_type": "markdown",
741 | "metadata": {},
742 | "source": [
743 | "
\n",
744 | "
\n",
745 | "\n",
746 | "## Time for a break!"
747 | ]
748 | },
749 | {
750 | "cell_type": "markdown",
751 | "metadata": {},
752 | "source": [
753 | "
\n",
754 | "# Topic 5: Geographic Data\n",
755 | "\n",
756 | "Data in social media can be relevant to an event in three ways: __temporally__ relevant, __geographically__ relevant, or __topically__ relevant.\n",
757 | "So far, we've looked at temporally relevant data, or data that was posted at about the same time as the target event.\n",
758 | "Now we'll explore geographically relevant data, or data posted near the event.\n",
759 | "\n",
760 | "Twitter allows users to share their GPS locations when tweeting, but only about 2% of tweets have this information. \n",
761 | "We can extract this geospatial data to look at patterns in different locations. \n",
762 | "\n",
763 | "- General plotting\n",
764 | "- Filtering by a bounding box\n",
765 | "- Images from target location"
766 | ]
767 | },
768 | {
769 | "cell_type": "markdown",
770 | "metadata": {},
771 | "source": [
772 | "### Plotting GPS Data\n",
773 | "\n",
774 | "Each tweet has a field called \"coordinates\" describing from where the tweet was posted. \n",
775 | "The field might be null if the tweet contains no location data, or it could contain bounding box information, place information, or GPS coordinates in the form of (longitude, latitude). \n",
776 | "We want tweets with this GPS data.\n",
777 | "\n",
778 | "For more information on tweet JSON formats, check out https://dev.twitter.com/overview/api/tweets"
779 | ]
780 | },
781 | {
782 | "cell_type": "code",
783 | "execution_count": null,
784 | "metadata": {
785 | "collapsed": false
786 | },
787 | "outputs": [],
788 | "source": [
789 | "# A frequency map for timestamps to geo-coded tweets\n",
790 | "geoFrequencyMap = {}\n",
791 | "geoCount = 0\n",
792 | "\n",
793 | "# Save only those tweets with tweet['coordinate']['coordinate'] entity\n",
794 | "for t in sortedTimes:\n",
795 | " geos = list(filter(lambda tweet: tweet[\"coordinates\"] != None and \n",
796 | " \"coordinates\" in tweet[\"coordinates\"], \n",
797 | " frequencyMap[t][\"list\"]))\n",
798 | " geoCount += len(geos)\n",
799 | " \n",
800 | " # Add to the timestamp map\n",
801 | " geoFrequencyMap[t] = {\"count\": len(geos), \"list\": geos}\n",
802 | "\n",
803 | "print (\"Number of Geo Tweets:\", geoCount)"
804 | ]
805 | },
806 | {
807 | "cell_type": "markdown",
808 | "metadata": {},
809 | "source": [
810 | "#### GPS Frequency\n",
811 | "\n",
812 | "What is the frequency of GPS-coded tweets?"
813 | ]
814 | },
815 | {
816 | "cell_type": "code",
817 | "execution_count": null,
818 | "metadata": {
819 | "collapsed": false
820 | },
821 | "outputs": [],
822 | "source": [
823 | "fig, ax = plt.subplots()\n",
824 | "fig.set_size_inches(11, 8.5)\n",
825 | "\n",
826 | "plt.title(\"Geo Tweet Frequency\")\n",
827 | "\n",
828 | "gpsFreqList = [geoFrequencyMap[x][\"count\"] for x in sortedTimes]\n",
829 | "postFreqList = [frequencyMap[x][\"count\"] for x in sortedTimes]\n",
830 | "\n",
831 | "plt.xticks(smallerXTicks, [sortedTimes[x] for x in smallerXTicks], rotation=90)\n",
832 | "\n",
833 | "xData = range(len(geoFrequencyMap))\n",
834 | "gpsYData = [x if x > 0 else 0 for x in gpsFreqList]\n",
835 | "freqYData = [x if x > 0 else 0 for x in postFreqList]\n",
836 | "\n",
837 | "ax.semilogy(xData, freqYData, color=\"blue\", label=\"Posts\")\n",
838 | "ax.semilogy(xData, gpsYData, color=\"green\", label=\"GPS Posts\")\n",
839 | "ax.scatter([crisisXCoord], [100], c=\"r\", marker=\"x\", s=100, label=\"Crisis\")\n",
840 | "\n",
841 | "ax.grid(b=True, which=u'major')\n",
842 | "ax.legend()\n",
843 | "\n",
844 | "plt.show()"
845 | ]
846 | },
847 | {
848 | "cell_type": "markdown",
849 | "metadata": {},
850 | "source": [
851 | "### Plotting GPS Data\n",
852 | "\n",
853 | "Now that we have a list of all the tweets with GPS coordinates, we can plot from where in the world these tweets were posted. \n",
854 | "To make this plot, we can leverage the Basemap package to make a map of the world and convert GPS coordinates to *(x, y)* coordinates we can then plot."
855 | ]
856 | },
857 | {
858 | "cell_type": "code",
859 | "execution_count": null,
860 | "metadata": {
861 | "collapsed": false
862 | },
863 | "outputs": [],
864 | "source": [
865 | "import matplotlib\n",
866 | "import functools\n",
867 | "\n",
868 | "from mpl_toolkits.basemap import Basemap\n",
869 | "\n",
870 | "# Create a list of all geo-coded tweets\n",
871 | "tmpGeoList = [geoFrequencyMap[t][\"list\"] for t in sortedTimes]\n",
872 | "geoTweets = functools.reduce(lambda x, y: x + y, tmpGeoList)\n",
873 | "\n",
874 | "# For each geo-coded tweet, extract its GPS coordinates\n",
875 | "geoCoord = [x[\"coordinates\"][\"coordinates\"] for x in geoTweets]\n",
876 | "\n",
877 | "# Now we build a map of the world using Basemap\n",
878 | "land_color = 'lightgray'\n",
879 | "water_color = 'lightblue'\n",
880 | "\n",
881 | "fig, ax = plt.subplots(figsize=(24,24))\n",
882 | "worldMap = Basemap(projection='merc', llcrnrlat=-80, urcrnrlat=80,\n",
883 | " llcrnrlon=-180, urcrnrlon=180, resolution='l')\n",
884 | "\n",
885 | "worldMap.fillcontinents(color=land_color, lake_color=water_color, zorder=1)\n",
886 | "worldMap.drawcoastlines()\n",
887 | "worldMap.drawparallels(np.arange(-90.,120.,30.))\n",
888 | "worldMap.drawmeridians(np.arange(0.,420.,60.))\n",
889 | "worldMap.drawmapboundary(fill_color=water_color, zorder=0)\n",
890 | "ax.set_title('World Tweets')\n",
891 | "\n",
892 | "# Convert points from GPS coordinates to (x,y) coordinates\n",
893 | "convPoints = [worldMap(p[0], p[1]) for p in geoCoord]\n",
894 | "x = [p[0] for p in convPoints]\n",
895 | "y = [p[1] for p in convPoints]\n",
896 | "worldMap.scatter(x, y, s=100, marker='x', color=\"red\", zorder=2)\n",
897 | "\n",
898 | "plt.show()"
899 | ]
900 | },
901 | {
902 | "cell_type": "markdown",
903 | "metadata": {},
904 | "source": [
905 | "### Filtering By Location\n",
906 | "\n",
907 | "We can use existing Geographic Information System (GIS) tools to determine from where a tweet was posted.\n",
908 | "For example, we could ask whether a particular tweet was posted from the United States. \n",
909 | "This filtering is often performed using shape files.\n",
910 | "For our purposes though, we established a bounding box along with the crisis data, so we'll use that as our filter for simplicity."
911 | ]
912 | },
913 | {
914 | "cell_type": "code",
915 | "execution_count": null,
916 | "metadata": {
917 | "collapsed": false
918 | },
919 | "outputs": [],
920 | "source": [
921 | "# Get the bounding box for our crisis\n",
922 | "bBox = crisisInfo[selectedCrisis][\"box\"]\n",
923 | "\n",
924 | "fig, ax = plt.subplots(figsize=(11,8.5))\n",
925 | "\n",
926 | "# Create a new map to hold the shape file data\n",
927 | "targetMap = Basemap(llcrnrlon=bBox[\"lowerLeftLon\"], \n",
928 | " llcrnrlat=bBox[\"lowerLeftLat\"], \n",
929 | " urcrnrlon=bBox[\"upperRightLon\"], \n",
930 | " urcrnrlat=bBox[\"upperRightLat\"], \n",
931 | " projection='merc',\n",
932 | " resolution='i', area_thresh=10000)\n",
933 | "\n",
934 | "targetMap.fillcontinents(color=land_color, lake_color=water_color, \n",
935 | " zorder=1)\n",
936 | "targetMap.drawcoastlines()\n",
937 | "targetMap.drawparallels(np.arange(-90.,120.,30.))\n",
938 | "targetMap.drawmeridians(np.arange(0.,420.,60.))\n",
939 | "targetMap.drawmapboundary(fill_color=water_color, zorder=0)\n",
940 | "targetMap.drawcountries()\n",
941 | "\n",
942 | "# Now we build the polygon for filtering\n",
943 | "# Convert from lon, lat of lower-left to x,y coordinates\n",
944 | "llcCoord = targetMap(bBox[\"lowerLeftLon\"], bBox[\"lowerLeftLat\"])\n",
945 | "\n",
946 | "# Same for upper-right corner\n",
947 | "urcCoord = targetMap(bBox[\"upperRightLon\"], bBox[\"upperRightLat\"])\n",
948 | "\n",
949 | "# Now make the polygon we'll us for filtering\n",
950 | "boxPoints = np.array([[llcCoord[0], llcCoord[1]], \n",
951 | " [llcCoord[0], urcCoord[1]], \n",
952 | " [urcCoord[0], urcCoord[1]], \n",
953 | " [urcCoord[0], llcCoord[1]]])\n",
954 | "boundingBox = matplotlib.patches.Polygon(boxPoints)\n",
955 | "\n",
956 | "# Maps of timestamps to tweets for inside/outside Ferguson\n",
957 | "inTargetFreqMap = {}\n",
958 | "plottablePointsX = []\n",
959 | "plottablePointsY = []\n",
960 | "\n",
961 | "# For each geo-coded tweet, extract coordinates and convert \n",
962 | "# them to the Basemap space\n",
963 | "for t in sortedTimes:\n",
964 | " geos = geoFrequencyMap[t][\"list\"]\n",
965 | " convPoints = [(targetMap(tw[\"coordinates\"][\"coordinates\"][0], tw[\"coordinates\"][\"coordinates\"][1]), tw) for tw in geos]\n",
966 | "\n",
967 | " # Local counters for this time\n",
968 | " inTargetFreqMap[t] = {\"count\": 0, \"list\": []}\n",
969 | " \n",
970 | " # For each point, check if it is within the bounding box or not\n",
971 | " for point in convPoints:\n",
972 | " x = point[0][0]\n",
973 | " y = point[0][1]\n",
974 | "\n",
975 | " if ( boundingBox.contains_point((x, y))):\n",
976 | " inTargetFreqMap[t][\"list\"].append(point[1])\n",
977 | " plottablePointsX.append(x)\n",
978 | " plottablePointsY.append(y)\n",
979 | "\n",
980 | "# Plot points in our target\n",
981 | "targetMap.scatter(plottablePointsX, plottablePointsY, s=100, marker='x', color=\"red\", zorder=2)\n",
982 | " \n",
983 | "# Count the number of tweets that fall in the area\n",
984 | "targetTweetCount = np.sum([len(inTargetFreqMap[t][\"list\"]) for t in sortedTimes])\n",
985 | " \n",
986 | "print (\"Tweets in Target Area:\", targetTweetCount)\n",
987 | "print (\"Tweets outside:\", (geoCount - targetTweetCount))\n",
988 | "\n",
989 | "plt.show()"
990 | ]
991 | },
992 | {
993 | "cell_type": "markdown",
994 | "metadata": {},
995 | "source": [
996 | "### Geographically Relevant Tweet Content\n",
997 | "\n",
998 | "Now that we have a list of tweets from the target area, what are they saying?"
999 | ]
1000 | },
1001 | {
1002 | "cell_type": "code",
1003 | "execution_count": null,
1004 | "metadata": {
1005 | "collapsed": false
1006 | },
1007 | "outputs": [],
1008 | "source": [
1009 | "# Merge our list of relevant tweets\n",
1010 | "geoRelevantTweets = [tw for x in sortedTimes for tw in inTargetFreqMap[x][\"list\"]]\n",
1011 | "\n",
1012 | "print(\"Time of Crisis:\", crisisTime)\n",
1013 | "\n",
1014 | "# Print the first few tweets\n",
1015 | "for tweet in geoRelevantTweets[:10]:\n",
1016 | " print(\"Tweet By:\", tweet[\"user\"][\"screen_name\"])\n",
1017 | " print(\"\\t\", \"Tweet Text:\", tweet[\"text\"])\n",
1018 | " print(\"\\t\", \"Tweet Time:\", tweet[\"created_at\"])\n",
1019 | " print(\"\\t\", \"Source:\", tweet[\"source\"])\n",
1020 | " print(\"\\t\", \"Retweets:\", tweet[\"retweet_count\"])\n",
1021 | " print(\"\\t\", \"Favorited:\", tweet[\"favorite_count\"])\n",
1022 | " print(\"\\t\", \"Twitter's Guessed Language:\", tweet[\"lang\"])\n",
1023 | " if ( \"place\" in tweet ):\n",
1024 | " print(\"\\t\", \"Tweet Location:\", tweet[\"place\"][\"full_name\"])\n",
1025 | " print(\"-----\")"
1026 | ]
1027 | },
1028 | {
1029 | "cell_type": "markdown",
1030 | "metadata": {},
1031 | "source": [
1032 | "### Media from Within Target\n",
1033 | "\n",
1034 | "With this filtered list of tweets, we can extract media posted from the evnet."
1035 | ]
1036 | },
1037 | {
1038 | "cell_type": "code",
1039 | "execution_count": null,
1040 | "metadata": {
1041 | "collapsed": false
1042 | },
1043 | "outputs": [],
1044 | "source": [
1045 | "from IPython.display import display\n",
1046 | "from IPython.display import Image\n",
1047 | "\n",
1048 | "geoTweetsWithMedia = list(filter(lambda tweet: \"media\" in tweet[\"entities\"], geoRelevantTweets))\n",
1049 | "print (\"Tweets with Media:\", len(geoTweetsWithMedia))\n",
1050 | "\n",
1051 | "if ( len(geoTweetsWithMedia) == 0 ):\n",
1052 | " print (\"Sorry, not tweets with media...\")\n",
1053 | "\n",
1054 | "for tweet in geoTweetsWithMedia:\n",
1055 | " imgUrl = tweet[\"entities\"][\"media\"][0][\"media_url\"]\n",
1056 | " print (tweet[\"text\"])\n",
1057 | " display(Image(url=imgUrl))"
1058 | ]
1059 | },
1060 | {
1061 | "cell_type": "markdown",
1062 | "metadata": {},
1063 | "source": [
1064 | "---\n",
1065 | "# Topic 6: Content and Sentiment Analysis\n",
1066 | "\n",
1067 | "Another popular type of analysis people do on social networks is \"sentiment analysis,\" which is used to figure out how people **feel** about a specific topic.\n",
1068 | "Some tools also provide measurements like subjectivity/objectivity of text content.\n",
1069 | "\n",
1070 | "We'll cover:\n",
1071 | "\n",
1072 | "- Topically Relevant Filtering\n",
1073 | "- Sentiment, Subjectivity, and Objectivity"
1074 | ]
1075 | },
1076 | {
1077 | "cell_type": "markdown",
1078 | "metadata": {},
1079 | "source": [
1080 | "### Topically Relevant Tweets\n",
1081 | "\n",
1082 | "Before we filter for sentiment and such, we've seen that Twitter has a lot of noise and irrelevant data.\n",
1083 | "We should clean this data a bit before this analysis.\n",
1084 | "To do so, we'll filter our data so that it only contains tweets with relevant keywords."
1085 | ]
1086 | },
1087 | {
1088 | "cell_type": "code",
1089 | "execution_count": null,
1090 | "metadata": {
1091 | "collapsed": false
1092 | },
1093 | "outputs": [],
1094 | "source": [
1095 | "# What keywords are we interested in?\n",
1096 | "targetKeywords = crisisInfo[selectedCrisis][\"keywords\"]\n",
1097 | "\n",
1098 | "# Map for storing topically relevant data\n",
1099 | "topicRelevantMap = {}\n",
1100 | "\n",
1101 | "# For each minute, pull the tweet text and search for the keywords we want\n",
1102 | "for t in sortedTimes:\n",
1103 | " timeObj = frequencyMap[t]\n",
1104 | " topicRelevantMap[t] = {\"count\": 0, \"list\": []}\n",
1105 | " \n",
1106 | " for tweetObj in timeObj[\"list\"]:\n",
1107 | " tweetString = tweetObj[\"text\"].lower()\n",
1108 | "\n",
1109 | " # Add to the counter if the target keyword is in this tweet\n",
1110 | " for keyword in targetKeywords:\n",
1111 | " if ( keyword.lower() in tweetString ):\n",
1112 | " topicRelevantMap[t][\"list\"].append(tweetObj)\n",
1113 | " topicRelevantMap[t][\"count\"] += 1\n",
1114 | " \n",
1115 | " break\n",
1116 | "\n",
1117 | " \n",
1118 | "# Now plot the total frequency and frequency of each keyword\n",
1119 | "fig, ax = plt.subplots()\n",
1120 | "fig.set_size_inches(11, 8.5)\n",
1121 | "\n",
1122 | "plt.title(\"Tweet Frequency\")\n",
1123 | "plt.xticks(smallerXTicks, [sortedTimes[x] for x in smallerXTicks], rotation=90)\n",
1124 | "\n",
1125 | "ax.semilogy(range(len(frequencyMap)), totalCount, label=\"Total\")\n",
1126 | "\n",
1127 | "ax.scatter([crisisXCoord], [100], c=\"r\", marker=\"x\", s=100, label=\"Crisis\")\n",
1128 | "\n",
1129 | "relYData = [topicRelevantMap[t][\"count\"] for t in sortedTimes]\n",
1130 | "ax.semilogy(range(len(relYData)), relYData, label=\"Relevant\")\n",
1131 | "\n",
1132 | "ax.legend()\n",
1133 | "ax.grid(b=True, which=u'major')\n",
1134 | "\n",
1135 | "plt.show()"
1136 | ]
1137 | },
1138 | {
1139 | "cell_type": "markdown",
1140 | "metadata": {},
1141 | "source": [
1142 | "### Highly Important Relevant Tweets"
1143 | ]
1144 | },
1145 | {
1146 | "cell_type": "code",
1147 | "execution_count": null,
1148 | "metadata": {
1149 | "collapsed": false
1150 | },
1151 | "outputs": [],
1152 | "source": [
1153 | "allTweets = [x for t in sortedTimes for x in topicRelevantMap[t][\"list\"]]\n",
1154 | "\n",
1155 | "# get the top retweeted tweets\n",
1156 | "onlyRetweets = filter(lambda x: \"retweeted_status\" in x, allTweets)\n",
1157 | "topTweets = sorted(onlyRetweets, key=lambda x: x[\"retweeted_status\"][\"retweet_count\"], \n",
1158 | " reverse=True)[:10]\n",
1159 | "\n",
1160 | "print(\"Top Retweets:\")\n",
1161 | "for x in topTweets:\n",
1162 | " print(x[\"id\"], x[\"user\"][\"screen_name\"], x[\"retweeted_status\"][\"retweet_count\"], x[\"text\"])\n",
1163 | "\n",
1164 | "# get tweets from users with the msot followers\n",
1165 | "topTweets = sorted(allTweets, key=lambda x: x[\"user\"][\"followers_count\"], reverse=True)[:10]\n",
1166 | "\n",
1167 | "print()\n",
1168 | "print(\"Top Accounts:\")\n",
1169 | "for x in topTweets:\n",
1170 | " print(x[\"id\"], x[\"user\"][\"screen_name\"], x[\"user\"][\"followers_count\"], x[\"text\"])\n",
1171 | " \n",
1172 | " \n",
1173 | "# get the top retweeted tweets but only from verified accounts\n",
1174 | "verifiedTweets = filter(lambda x: x[\"retweeted_status\"][\"user\"][\"verified\"], onlyRetweets)\n",
1175 | "topTweets = sorted(verifiedTweets, key=lambda x: x[\"retweeted_status\"][\"retweet_count\"], \n",
1176 | " reverse=True)[:10]\n",
1177 | "\n",
1178 | "print()\n",
1179 | "print(\"Top Retweets from Verified Accounts:\")\n",
1180 | "for x in verifiedTweets:\n",
1181 | " print(x[\"id\"], x[\"user\"][\"screen_name\"], x[\"retweet_count\"], x[\"text\"])"
1182 | ]
1183 | },
1184 | {
1185 | "cell_type": "markdown",
1186 | "metadata": {},
1187 | "source": [
1188 | "### Quick Geo-data Comparison\n",
1189 | "\n",
1190 | "An interesting comparison might be to look at the areas of concentration of relevant tweets."
1191 | ]
1192 | },
1193 | {
1194 | "cell_type": "code",
1195 | "execution_count": null,
1196 | "metadata": {
1197 | "collapsed": false
1198 | },
1199 | "outputs": [],
1200 | "source": [
1201 | "# A frequency map for timestamps to geo-coded tweets\n",
1202 | "relGeoFreqMap = {}\n",
1203 | "relGeoCount = 0\n",
1204 | "\n",
1205 | "# Save only those tweets with tweet['coordinate']['coordinate'] entity\n",
1206 | "for t in sortedTimes:\n",
1207 | " geos = list(filter(lambda tweet: tweet[\"coordinates\"] != None and \n",
1208 | " \"coordinates\" in tweet[\"coordinates\"], \n",
1209 | " topicRelevantMap[t][\"list\"]))\n",
1210 | " relGeoCount += len(geos)\n",
1211 | " \n",
1212 | " # Add to the timestamp map\n",
1213 | " relGeoFreqMap[t] = {\"count\": len(geos), \"list\": geos}\n",
1214 | "\n",
1215 | "print (\"Number of Relevant Geo Tweets:\", relGeoCount)\n",
1216 | "\n",
1217 | "# Create a list of all geo-coded tweets\n",
1218 | "tmpGeoList = [relGeoFreqMap[t][\"list\"] for t in sortedTimes]\n",
1219 | "relGeoTweets = functools.reduce(lambda x, y: x + y, tmpGeoList)\n",
1220 | "\n",
1221 | "# For each geo-coded tweet, extract its GPS coordinates\n",
1222 | "relGeoCoord = [x[\"coordinates\"][\"coordinates\"] for x in relGeoTweets]\n",
1223 | "\n",
1224 | "fig, ax = plt.subplots(figsize=(24,24))\n",
1225 | "worldMap = Basemap(projection='merc', llcrnrlat=-80, urcrnrlat=80,\n",
1226 | " llcrnrlon=-180, urcrnrlon=180, resolution='l')\n",
1227 | "\n",
1228 | "worldMap.fillcontinents(color=land_color, lake_color=water_color, zorder=1)\n",
1229 | "worldMap.drawcoastlines()\n",
1230 | "worldMap.drawparallels(np.arange(-90.,120.,30.))\n",
1231 | "worldMap.drawmeridians(np.arange(0.,420.,60.))\n",
1232 | "worldMap.drawmapboundary(fill_color=water_color, zorder=0)\n",
1233 | "worldMap.drawcountries()\n",
1234 | "ax.set_title('Global Relevant Tweets')\n",
1235 | "\n",
1236 | "# Convert points from GPS coordinates to (x,y) coordinates\n",
1237 | "allConvPoints = [worldMap(p[0], p[1]) for p in geoCoord]\n",
1238 | "x = [p[0] for p in allConvPoints]\n",
1239 | "y = [p[1] for p in allConvPoints]\n",
1240 | "worldMap.scatter(x, y, s=100, marker='x', color=\"blue\", zorder=2)\n",
1241 | "\n",
1242 | "# Convert points from GPS coordinates to (x,y) coordinates\n",
1243 | "relConvPoints = [worldMap(p[0], p[1]) for p in relGeoCoord]\n",
1244 | "x = [p[0] for p in relConvPoints]\n",
1245 | "y = [p[1] for p in relConvPoints]\n",
1246 | "worldMap.scatter(x, y, s=100, marker='x', color=\"red\", zorder=2)\n",
1247 | "\n",
1248 | "plt.show()"
1249 | ]
1250 | },
1251 | {
1252 | "cell_type": "markdown",
1253 | "metadata": {},
1254 | "source": [
1255 | "__Observation:__ Most topically relevant tweets are not geotagged."
1256 | ]
1257 | },
1258 | {
1259 | "cell_type": "markdown",
1260 | "metadata": {},
1261 | "source": [
1262 | "### Sentiment Analysis w/ TextBlob\n",
1263 | "\n",
1264 | "TextBlob is a nice Python package that provides a number of useful text processing capabilities.\n",
1265 | "We will use it for sentiment analysis to calculate polarity and subjectivity for each relevant tweet."
1266 | ]
1267 | },
1268 | {
1269 | "cell_type": "code",
1270 | "execution_count": null,
1271 | "metadata": {
1272 | "collapsed": false
1273 | },
1274 | "outputs": [],
1275 | "source": [
1276 | "from textblob import TextBlob\n",
1277 | "\n",
1278 | "# Sentiment values\n",
1279 | "polarVals = []\n",
1280 | "objVals = []\n",
1281 | "\n",
1282 | "# For each minute, pull the tweet text and search for the keywords we want\n",
1283 | "for t in sortedTimes:\n",
1284 | " timeObj = topicRelevantMap[t]\n",
1285 | " \n",
1286 | " # For calculating averages\n",
1287 | " localPolarVals = []\n",
1288 | " localObjVals = []\n",
1289 | " \n",
1290 | " for tweetObj in timeObj[\"list\"]:\n",
1291 | " tweetString = tweetObj[\"text\"].lower()\n",
1292 | "\n",
1293 | " blob = TextBlob(tweetString)\n",
1294 | " polarity = blob.sentiment.polarity\n",
1295 | " objectivity = blob.sentiment.subjectivity\n",
1296 | " \n",
1297 | " localPolarVals.append(polarity)\n",
1298 | " localObjVals.append(objectivity)\n",
1299 | " \n",
1300 | " # Add data to the polarity and objectivity measure arrays\n",
1301 | " if ( len(timeObj[\"list\"]) > 10 ):\n",
1302 | " polarVals.append(np.mean(localPolarVals))\n",
1303 | " objVals.append(np.mean(localObjVals))\n",
1304 | " else:\n",
1305 | " polarVals.append(0.0)\n",
1306 | " objVals.append(0.0)\n",
1307 | "\n",
1308 | " \n",
1309 | "# Now plot this sentiment data\n",
1310 | "fig, ax = plt.subplots()\n",
1311 | "fig.set_size_inches(11, 8.5)\n",
1312 | "\n",
1313 | "plt.title(\"Sentiment\")\n",
1314 | "plt.xticks(smallerXTicks, [sortedTimes[x] for x in smallerXTicks], rotation=90)\n",
1315 | "\n",
1316 | "xData = range(len(sortedTimes))\n",
1317 | "\n",
1318 | "ax.scatter([crisisXCoord], [0], c=\"r\", marker=\"x\", s=100, label=\"Crisis\")\n",
1319 | "\n",
1320 | "# Polarity is scaled [-1, 1], for negative and positive polarity\n",
1321 | "ax.plot(xData, polarVals, label=\"Polarity\")\n",
1322 | "\n",
1323 | "# Subjetivity is scaled [0, 1], with 0 = objective, 1 = subjective\n",
1324 | "ax.plot(xData, objVals, label=\"Subjectivity\")\n",
1325 | "\n",
1326 | "ax.legend()\n",
1327 | "ax.grid(b=True, which=u'major')\n",
1328 | "\n",
1329 | "plt.show()"
1330 | ]
1331 | },
1332 | {
1333 | "cell_type": "markdown",
1334 | "metadata": {},
1335 | "source": [
1336 | "## Sentiment Analysis with Vader"
1337 | ]
1338 | },
1339 | {
1340 | "cell_type": "code",
1341 | "execution_count": null,
1342 | "metadata": {
1343 | "collapsed": false
1344 | },
1345 | "outputs": [],
1346 | "source": [
1347 | "import nltk\n",
1348 | "nltk.download(\"vader_lexicon\")\n",
1349 | "import nltk.sentiment.util\n",
1350 | "import nltk.sentiment.vader"
1351 | ]
1352 | },
1353 | {
1354 | "cell_type": "code",
1355 | "execution_count": null,
1356 | "metadata": {
1357 | "collapsed": false
1358 | },
1359 | "outputs": [],
1360 | "source": [
1361 | "vader = nltk.sentiment.vader.SentimentIntensityAnalyzer()"
1362 | ]
1363 | },
1364 | {
1365 | "cell_type": "code",
1366 | "execution_count": null,
1367 | "metadata": {
1368 | "collapsed": false
1369 | },
1370 | "outputs": [],
1371 | "source": [
1372 | "# Sentiment values\n",
1373 | "polarVals = []\n",
1374 | "\n",
1375 | "# For each minute, pull the tweet text and search for the keywords we want\n",
1376 | "for t in sortedTimes:\n",
1377 | " timeObj = topicRelevantMap[t]\n",
1378 | " \n",
1379 | " # For calculating averages\n",
1380 | " localPolarVals = []\n",
1381 | " \n",
1382 | " for tweetObj in timeObj[\"list\"]:\n",
1383 | " tweetString = tweetObj[\"text\"].lower()\n",
1384 | "\n",
1385 | " polarity = vader.polarity_scores(tweetString)[\"compound\"]\n",
1386 | " \n",
1387 | " localPolarVals.append(polarity)\n",
1388 | " \n",
1389 | " # Add data to the polarity and objectivity measure arrays\n",
1390 | " if ( len(timeObj[\"list\"]) > 10 ):\n",
1391 | " polarVals.append(np.mean(localPolarVals))\n",
1392 | " else:\n",
1393 | " polarVals.append(0.0)\n",
1394 | "\n",
1395 | " \n",
1396 | "# Now plot this sentiment data\n",
1397 | "fig, ax = plt.subplots()\n",
1398 | "fig.set_size_inches(11, 8.5)\n",
1399 | "\n",
1400 | "plt.title(\"Sentiment\")\n",
1401 | "plt.xticks(smallerXTicks, [sortedTimes[x] for x in smallerXTicks], rotation=90)\n",
1402 | "\n",
1403 | "xData = range(len(sortedTimes))\n",
1404 | "\n",
1405 | "ax.scatter([crisisXCoord], [0], c=\"r\", marker=\"x\", s=100, label=\"Crisis\")\n",
1406 | "\n",
1407 | "# Polarity is scaled [-1, 1], for negative and positive polarity\n",
1408 | "ax.plot(xData, polarVals, label=\"Polarity\")\n",
1409 | "\n",
1410 | "ax.legend()\n",
1411 | "ax.grid(b=True, which=u'major')\n",
1412 | "\n",
1413 | "plt.ylim((-0.3, 0.55))\n",
1414 | "plt.show()"
1415 | ]
1416 | },
1417 | {
1418 | "cell_type": "markdown",
1419 | "metadata": {},
1420 | "source": [
1421 | "---\n",
1422 | "# Topic 7: Topic Modeling\n",
1423 | "\n",
1424 | "Along with sentiment analysis, a question often asked of social networks is \"What are people talking about?\" \n",
1425 | "We can answer this question using tools from topic modeling and natural language processing.\n",
1426 | "With crises, people can have many responses, from sharing specific data about the event, sharing condolonces, or opening their homes to those in need.\n",
1427 | "\n",
1428 | "To generate these topic models, we will use the Gensim package's implementation of Latent Dirichlet Allocation (LDA), which basically constructs a set of topics where each topic is described as a probability distribution over the words in our tweets. \n",
1429 | "Several other methods for topic modeling exist as well."
1430 | ]
1431 | },
1432 | {
1433 | "cell_type": "code",
1434 | "execution_count": null,
1435 | "metadata": {
1436 | "collapsed": false
1437 | },
1438 | "outputs": [],
1439 | "source": [
1440 | "# Gotta pull in a bunch of packages for this\n",
1441 | "import gensim.models.ldamulticore\n",
1442 | "import gensim.matutils\n",
1443 | "import sklearn.cluster\n",
1444 | "import sklearn.feature_extraction \n",
1445 | "import sklearn.feature_extraction.text\n",
1446 | "import sklearn.metrics\n",
1447 | "import sklearn.preprocessing"
1448 | ]
1449 | },
1450 | {
1451 | "cell_type": "code",
1452 | "execution_count": null,
1453 | "metadata": {
1454 | "collapsed": false
1455 | },
1456 | "outputs": [],
1457 | "source": [
1458 | "nltk.download(\"stopwords\")\n",
1459 | "from nltk.corpus import stopwords"
1460 | ]
1461 | },
1462 | {
1463 | "cell_type": "markdown",
1464 | "metadata": {},
1465 | "source": [
1466 | "We first extract all relevant tweets' text for building our models."
1467 | ]
1468 | },
1469 | {
1470 | "cell_type": "code",
1471 | "execution_count": null,
1472 | "metadata": {
1473 | "collapsed": false
1474 | },
1475 | "outputs": [],
1476 | "source": [
1477 | "# Get all tweets and conver to lowercase\n",
1478 | "allTweetText = [x[\"text\"].lower() for t in sortedTimes for x in topicRelevantMap[t][\"list\"]]\n",
1479 | "\n",
1480 | "print (\"All Tweet Count:\", len(allTweetText))"
1481 | ]
1482 | },
1483 | {
1484 | "cell_type": "markdown",
1485 | "metadata": {},
1486 | "source": [
1487 | "Now we build a list of stop words (words we don't care about) and build a feature generator (the vectorizer) that assigns integer keys to tokens and counts the number of each token."
1488 | ]
1489 | },
1490 | {
1491 | "cell_type": "code",
1492 | "execution_count": null,
1493 | "metadata": {
1494 | "collapsed": false
1495 | },
1496 | "outputs": [],
1497 | "source": [
1498 | "enStop = stopwords.words('english')\n",
1499 | "esStop = stopwords.words('spanish')\n",
1500 | "\n",
1501 | "# Skip stop words, retweet signs, @ symbols, and URL headers\n",
1502 | "stopList = enStop + esStop + [\"http\", \"https\", \"rt\", \"@\", \":\", \"co\"]\n",
1503 | "\n",
1504 | "vectorizer = sklearn.feature_extraction.text.CountVectorizer(strip_accents='unicode', \n",
1505 | " tokenizer=None,\n",
1506 | " token_pattern='(?u)#?\\\\b\\\\w+[\\'-]?\\\\w+\\\\b',\n",
1507 | " stop_words=stopList)\n",
1508 | "\n",
1509 | "# Analyzer\n",
1510 | "analyze = vectorizer.build_analyzer() \n",
1511 | "\n",
1512 | "# Create a vectorizer for all our content\n",
1513 | "vectorizer.fit(allTweetText)\n",
1514 | "\n",
1515 | "# Get all the words in our text\n",
1516 | "names = vectorizer.get_feature_names()\n",
1517 | "\n",
1518 | "# Create a map for vectorizer IDs to words\n",
1519 | "id2WordDict = dict(zip(range(len(vectorizer.get_feature_names())), names))"
1520 | ]
1521 | },
1522 | {
1523 | "cell_type": "markdown",
1524 | "metadata": {},
1525 | "source": [
1526 | "We then use the vectorizer to transform our tweet text into a feature set, which essentially is a table with rows of tweets, columns for each keyword, and each cell is the number of times that keyword appears in that tweet.\n",
1527 | "\n",
1528 | "We then convert that table into a model the Gensim package can handle, apply LDA, and grab the top 10 topics, 10 words that describe that topic, and print them."
1529 | ]
1530 | },
1531 | {
1532 | "cell_type": "code",
1533 | "execution_count": null,
1534 | "metadata": {
1535 | "collapsed": false
1536 | },
1537 | "outputs": [],
1538 | "source": [
1539 | "# Create a corpus for \n",
1540 | "corpus = vectorizer.transform(allTweetText)\n",
1541 | "gsCorpus = gensim.matutils.Sparse2Corpus(corpus, documents_columns=False)\n",
1542 | " \n",
1543 | "lda = gensim.models.LdaMulticore(gsCorpus, \n",
1544 | " id2word=id2WordDict,\n",
1545 | " num_topics=20, \n",
1546 | " passes=2) # ++ passes for better results\n",
1547 | "\n",
1548 | "ldaTopics = lda.show_topics(num_topics=10, \n",
1549 | " num_words=10, \n",
1550 | " formatted=False)\n",
1551 | "\n",
1552 | "for (i, tokenList) in ldaTopics:\n",
1553 | " print (\"Topic %d:\" % i, ' '.join([pair[0] for pair in tokenList]))"
1554 | ]
1555 | },
1556 | {
1557 | "cell_type": "markdown",
1558 | "metadata": {},
1559 | "source": [
1560 | "We can also be a little more strict and get rid of some noise by looking only at words with more than X characters.\n",
1561 | "Stop words are often short, so by putting a floor on the length of a token, we can theoretically get higher-signal data."
1562 | ]
1563 | },
1564 | {
1565 | "cell_type": "code",
1566 | "execution_count": null,
1567 | "metadata": {
1568 | "collapsed": false
1569 | },
1570 | "outputs": [],
1571 | "source": [
1572 | "docArrays = filter(lambda x: len(x) > 4, [y for x in allTweetText for y in analyze(x)])\n",
1573 | "fd = nltk.FreqDist(docArrays)\n",
1574 | "\n",
1575 | "print (\"Most common from analyzer:\")\n",
1576 | "for x in fd.most_common(20):\n",
1577 | " print (x[0], x[1])"
1578 | ]
1579 | },
1580 | {
1581 | "cell_type": "markdown",
1582 | "metadata": {},
1583 | "source": [
1584 | "---\n",
1585 | "# Topic 8: Network Analysis\n",
1586 | "\n",
1587 | "Information flows and social networks are important considerations during crises, when people are trying to get updates on safe spaces, loved ones, places of shelter, etc.\n",
1588 | "Twitter is noisy though, and a lot of the data may be irrelevant, condolences/thoughts expressed by celebrities, or otherwise uninformative.\n",
1589 | "Using network analysis, we can get some idea about who the most important Twitter users were during this time, and how people split into groups online.\n",
1590 | "\n",
1591 | "For this analysis, we'll use the NetworkX package to construct a social graph of how people interact. Each person in our Twitter data will be a node in our graph, and edges in the graph will represent mentions during this timeframe.\n",
1592 | "Then we will explore a few simple analytical methods in network analysis, including:\n",
1593 | "\n",
1594 | "- Central accounts\n",
1595 | "- Visualization"
1596 | ]
1597 | },
1598 | {
1599 | "cell_type": "markdown",
1600 | "metadata": {},
1601 | "source": [
1602 | "### Graph Building\n",
1603 | "\n",
1604 | "To limit the amount of data we're looking at, we'll only build the network for people who tweeted about a relevant keyword and the people they mention. \n",
1605 | "We build this network simply by iterating through all the tweets in our relevant list and extract the \"user_mentions\" list from the \"entities\" section of the tweet object.\n",
1606 | "For each mention a user makes, we will add an edge from that user to the user he/she mentioned."
1607 | ]
1608 | },
1609 | {
1610 | "cell_type": "code",
1611 | "execution_count": null,
1612 | "metadata": {
1613 | "collapsed": false
1614 | },
1615 | "outputs": [],
1616 | "source": [
1617 | "import networkx as nx\n",
1618 | "\n",
1619 | "# We'll use a directed graph since mentions/retweets are directional\n",
1620 | "graph = nx.DiGraph()\n",
1621 | " \n",
1622 | "for tweet in [x for t in sortedTimes for x in topicRelevantMap[t][\"list\"]]:\n",
1623 | " userName = tweet[\"user\"][\"screen_name\"]\n",
1624 | " graph.add_node(userName)\n",
1625 | "\n",
1626 | " mentionList = tweet[\"entities\"][\"user_mentions\"]\n",
1627 | "\n",
1628 | " for otherUser in mentionList:\n",
1629 | " otherUserName = otherUser[\"screen_name\"]\n",
1630 | " if ( graph.has_node(otherUserName) == False ):\n",
1631 | " graph.add_node(otherUserName)\n",
1632 | " graph.add_edge(userName, otherUserName)\n",
1633 | " \n",
1634 | "print (\"Number of Users:\", len(graph.node))"
1635 | ]
1636 | },
1637 | {
1638 | "cell_type": "markdown",
1639 | "metadata": {},
1640 | "source": [
1641 | "### Central Users\n",
1642 | "\n",
1643 | "In network analysis, \"centrality\" is used to measure the importance of a given node. \n",
1644 | "Many different types of centrality are used to describe various types of importance though.\n",
1645 | "Examples include \"closeness centrality,\" which measures how close a node is to all other nodes in the network, versus \"betweeness centrality,\" which measures how many shortest paths run through the given node.\n",
1646 | "Nodes with high closeness centrality are important for rapidly disseminating information or spreading disease, whereas nodes with high betweeness are more important to ensure the network stays connected.\n",
1647 | "\n",
1648 | "The PageRank is another algorithm for measuring importance and was proposed by Sergey Brin and Larry Page for the early version of Google's search algorithm.\n",
1649 | "NetworkX has an implementation of the PageRank algorithm that we can use to look at the most important/authoritative users on Twitter based on their connections to other users."
1650 | ]
1651 | },
1652 | {
1653 | "cell_type": "code",
1654 | "execution_count": null,
1655 | "metadata": {
1656 | "collapsed": false
1657 | },
1658 | "outputs": [],
1659 | "source": [
1660 | "# Now we prune for performance reasons\n",
1661 | "# remove all nodes with few edges\n",
1662 | "\n",
1663 | "nodeList = [n for n,d in graph.degree_iter() if d<2]\n",
1664 | "graph.remove_nodes_from(nodeList)\n",
1665 | "print (\"Number of Remaining Users:\", len(graph.node))"
1666 | ]
1667 | },
1668 | {
1669 | "cell_type": "code",
1670 | "execution_count": null,
1671 | "metadata": {
1672 | "collapsed": false
1673 | },
1674 | "outputs": [],
1675 | "source": [
1676 | "# THis may take a while\n",
1677 | "pageRankList = nx.pagerank_numpy(graph)"
1678 | ]
1679 | },
1680 | {
1681 | "cell_type": "code",
1682 | "execution_count": null,
1683 | "metadata": {
1684 | "collapsed": false
1685 | },
1686 | "outputs": [],
1687 | "source": [
1688 | "highRankNodes = sorted(pageRankList.keys(), key=pageRankList.get, reverse=True)\n",
1689 | "for x in highRankNodes[:20]:\n",
1690 | " print (x, pageRankList[x])\n",
1691 | " "
1692 | ]
1693 | },
1694 | {
1695 | "cell_type": "code",
1696 | "execution_count": null,
1697 | "metadata": {
1698 | "collapsed": false
1699 | },
1700 | "outputs": [],
1701 | "source": [
1702 | "plt.figure(figsize=(8,8))\n",
1703 | "pos = nx.spring_layout(graph, scale=100, iterations=100, k=0.2)\n",
1704 | "nx.draw(graph, \n",
1705 | " pos, \n",
1706 | " node_color='#A0CBE2', \n",
1707 | " width=1, \n",
1708 | " with_labels=False,\n",
1709 | " node_size=50)\n",
1710 | "\n",
1711 | "hrNames = highRankNodes[:20]\n",
1712 | "hrDict = dict(zip(hrNames, hrNames))\n",
1713 | "hrValues = [pageRankList[x] for x in hrNames]\n",
1714 | "\n",
1715 | "nx.draw_networkx_nodes(graph,pos,nodelist=hrNames,\n",
1716 | " node_size=200,\n",
1717 | " node_color=hrValues,\n",
1718 | " cmap=plt.cm.Reds_r)\n",
1719 | "\n",
1720 | "nx.draw_networkx_labels(graph,\n",
1721 | " pos,\n",
1722 | " labels=hrDict,\n",
1723 | " fontsize=36,\n",
1724 | " font_color=\"g\")\n",
1725 | "\n",
1726 | "plt.axis('off')\n",
1727 | "plt.show()"
1728 | ]
1729 | },
1730 | {
1731 | "cell_type": "code",
1732 | "execution_count": null,
1733 | "metadata": {
1734 | "collapsed": true
1735 | },
1736 | "outputs": [],
1737 | "source": []
1738 | }
1739 | ],
1740 | "metadata": {
1741 | "kernelspec": {
1742 | "display_name": "Python 3",
1743 | "language": "python",
1744 | "name": "python3"
1745 | },
1746 | "language_info": {
1747 | "codemirror_mode": {
1748 | "name": "ipython",
1749 | "version": 3
1750 | },
1751 | "file_extension": ".py",
1752 | "mimetype": "text/x-python",
1753 | "name": "python",
1754 | "nbconvert_exporter": "python",
1755 | "pygments_lexer": "ipython3",
1756 | "version": "3.6.0"
1757 | }
1758 | },
1759 | "nbformat": 4,
1760 | "nbformat_minor": 0
1761 | }
1762 |
--------------------------------------------------------------------------------
/notebooks/files/FacebookInstructions_f1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/FacebookInstructions_f1.png
--------------------------------------------------------------------------------
/notebooks/files/FacebookLogo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/FacebookLogo.jpg
--------------------------------------------------------------------------------
/notebooks/files/Me2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/Me2.jpg
--------------------------------------------------------------------------------
/notebooks/files/RedditLogo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/RedditLogo.jpg
--------------------------------------------------------------------------------
/notebooks/files/TwitterInstructions_f1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/TwitterInstructions_f1.png
--------------------------------------------------------------------------------
/notebooks/files/TwitterInstructions_f2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/TwitterInstructions_f2.png
--------------------------------------------------------------------------------
/notebooks/files/TwitterInstructions_f3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/TwitterInstructions_f3.png
--------------------------------------------------------------------------------
/notebooks/files/TwitterInstructions_f4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/TwitterInstructions_f4.png
--------------------------------------------------------------------------------
/notebooks/files/TwitterInstructions_f5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/TwitterInstructions_f5.png
--------------------------------------------------------------------------------
/notebooks/files/TwitterInstructions_f6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/TwitterInstructions_f6.png
--------------------------------------------------------------------------------
/notebooks/files/TwitterLogo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/TwitterLogo.png
--------------------------------------------------------------------------------
/notebooks/files/fb_screens/Screen Shot 2016-05-25 at 3.17.49 AM.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/fb_screens/Screen Shot 2016-05-25 at 3.17.49 AM.png
--------------------------------------------------------------------------------
/notebooks/files/intermission.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/intermission.jpg
--------------------------------------------------------------------------------
/notebooks/files/reddit_screens/0-001.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/reddit_screens/0-001.png
--------------------------------------------------------------------------------
/notebooks/files/reddit_screens/1-002.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/reddit_screens/1-002.png
--------------------------------------------------------------------------------
/notebooks/files/reddit_screens/1-003.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/reddit_screens/1-003.png
--------------------------------------------------------------------------------
/notebooks/files/twitter_screens/Screen Shot 2016-05-25 at 8.56.39 AM.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/twitter_screens/Screen Shot 2016-05-25 at 8.56.39 AM.png
--------------------------------------------------------------------------------
/notebooks/files/twitter_screens/Screen Shot 2016-05-25 at 8.56.41 AM.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/twitter_screens/Screen Shot 2016-05-25 at 8.56.41 AM.png
--------------------------------------------------------------------------------
/notebooks/files/twitter_screens/Screen Shot 2016-05-25 at 8.56.57 AM.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/notebooks/files/twitter_screens/Screen Shot 2016-05-25 at 8.56.57 AM.png
--------------------------------------------------------------------------------
/slides/00 - Introduction.key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/slides/00 - Introduction.key
--------------------------------------------------------------------------------
/slides/00 - Introduction.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/slides/00 - Introduction.pdf
--------------------------------------------------------------------------------
/slides/01 - Data Acquisition.key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/slides/01 - Data Acquisition.key
--------------------------------------------------------------------------------
/slides/01 - Data Acquisition.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/slides/01 - Data Acquisition.pdf
--------------------------------------------------------------------------------
/slides/02 - Advanced Analysis.key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/slides/02 - Advanced Analysis.key
--------------------------------------------------------------------------------
/slides/02 - Advanced Analysis.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cbuntain/TutorialSocialMediaCrisis/4657d8c850d54d7effe8f7f78d514c2897169492/slides/02 - Advanced Analysis.pdf
--------------------------------------------------------------------------------