├── .gitignore ├── CHANGES.txt ├── LICENSE.txt ├── MANIFEST.in ├── README ├── README.md ├── example_config_file ├── gnip_filter_analysis.py ├── gnip_search.py ├── gnip_time_series.py ├── img ├── earthquake_cycle_trend_line.png ├── earthquake_time_line.png └── earthquake_time_peaks_line.png ├── job.json ├── rules.txt ├── search ├── __init__.py ├── api.py ├── results.py ├── test_api.py └── test_results.py ├── setup.cfg ├── setup.py └── test_search.sh /.gitignore: -------------------------------------------------------------------------------- 1 | *.py[cod] 2 | .gnip 3 | *.csv 4 | *.png 5 | *.swp 6 | *.pickle 7 | *.log 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Packages 13 | *.egg 14 | *.egg-info 15 | dist 16 | build 17 | eggs 18 | parts 19 | bin 20 | var 21 | sdist 22 | develop-eggs 23 | .installed.cfg 24 | lib 25 | lib64 26 | 27 | # Installer logs 28 | pip-log.txt 29 | 30 | # Unit test / coverage reports 31 | .coverage 32 | .tox 33 | nosetests.xml 34 | 35 | # Translations 36 | *.mo 37 | 38 | # Mr Developer 39 | .mr.developer.cfg 40 | .project 41 | .pydevproject 42 | MANIFEST 43 | -------------------------------------------------------------------------------- /CHANGES.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DrSkippy/Gnip-Python-Search-API-Utilities/30c3780220bbeba384815ccbc4ce1d567bfa934c/CHANGES.txt -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2012, Scott Hendrickson 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | 1. Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 2. Redistributions in binary form must reproduce the above copyright notice, 10 | this list of conditions and the following disclaimer in the documentation 11 | and/or other materials provided with the distribution. 12 | 13 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 14 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 15 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 16 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 17 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 18 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 19 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 20 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 21 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 22 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 23 | 24 | The views and conclusions contained in the software and documentation are those 25 | of the authors and should not be interpreted as representing official policies, 26 | either expressed or implied, of the FreeBSD Project. 27 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include *.txt 2 | recursive-include docs *.txt 3 | -------------------------------------------------------------------------------- /README: -------------------------------------------------------------------------------- 1 | See README.md 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Gnip Python Search API Utilities 2 | ================================ 3 | 4 | This package includes two utilities: 5 | - Gnip Search API interactions include Search V2 and paging support 6 | - Timeseries analysis and plotting 7 | 8 | #### Installation 9 | Install from PyPI with `pip install gapi` 10 | Or to use the full time line capability, `pip install gapi[timeline]` 11 | 12 | ## Search API 13 | 14 | Usage: 15 | 16 | $ gnip_search.py -h 17 |
 18 | usage: gnip_search.py [-h] [-a] [-c] [-b COUNT_BUCKET] [-e END] [-f FILTER]
 19 |                       [-l STREAM_URL] [-n MAX] [-N HARD_MAX] [-p PASSWORD]
 20 |                       [-q] [-s START] [-u USER] [-w OUTPUT_FILE_PATH] [-t]
 21 |                       USE_CASE
 22 | 
 23 | GnipSearch supports the following use cases: ['json', 'wordcount', 'users',
 24 | 'rate', 'links', 'timeline', 'geo', 'audience']
 25 | 
 26 | positional arguments:
 27 |   USE_CASE              Use case for this search.
 28 | 
 29 | optional arguments:
 30 |   -h, --help            show this help message and exit
 31 |   -a, --paged           Paged access to ALL available results (Warning: this
 32 |                         makes many requests)
 33 |   -c, --csv             Return comma-separated 'date,counts' or geo data.
 34 |   -b COUNT_BUCKET, --bucket COUNT_BUCKET
 35 |                         Bucket size for counts query. Options are day, hour,
 36 |                         minute (default is 'day').
 37 |   -e END, --end-date END
 38 |                         End of datetime window, format 'YYYY-mm-DDTHH:MM'
 39 |                         (default: most recent activities)
 40 |   -f FILTER, --filter FILTER
 41 |                         PowerTrack filter rule (See: http://support.gnip.com/c
 42 |                         ustomer/portal/articles/901152-powertrack-operators)
 43 |   -l STREAM_URL, --stream-url STREAM_URL
 44 |                         Url of search endpoint. (See your Gnip console.)
 45 |   -n MAX, --results-max MAX
 46 |                         Maximum results to return per page (default 100; max
 47 |                         500)
 48 |   -N HARD_MAX, --hard-max HARD_MAX
 49 |                         Maximum results to return for all pages; see -a option
 50 |   -p PASSWORD, --password PASSWORD
 51 |                         Password
 52 |   -q, --query           View API query (no data)
 53 |   -s START, --start-date START
 54 |                         Start of datetime window, format 'YYYY-mm-DDTHH:MM'
 55 |                         (default: 30 days ago)
 56 |   -u USER, --user-name USER
 57 |                         User name
 58 |   -w OUTPUT_FILE_PATH, --output-file-path OUTPUT_FILE_PATH
 59 |                         Create files in ./OUTPUT-FILE-PATH. This path must
 60 |                         exists and will not be created. This options is
 61 |                         available only with -a option. Default is no output
 62 |                         files.
 63 |   -t, --search-v2       Using search API v2 endpoint. [This is depricated and
 64 |                         is automatically set based on endpoint.]
 65 | 
66 | 67 | ##Using a configuration file 68 | 69 | To avoid entering the the -u, -p and -l options for every command, create a configuration file named ".gnip" 70 | in the directory where you will run the code. When this file contains the correct parameters, you can omit 71 | this command line parameters. 72 | 73 | Use this template: 74 | 75 | # export GNIP_CONFIG_FILE= 76 | # 77 | [creds] 78 | un = 79 | pwd = 80 | 81 | [endpoint] 82 | # replace with your endpoint 83 | url = https://search.gnip.com/accounts/shendrickson/search/wayback.json 84 | 85 | ### Use cases 86 | 87 | #### JSON 88 | 89 | Return full, enriched, Activity Streams-format JSON payloads from the Search API endpoint. Run Gnip-Python-Search-API-Utilities/gnip_search.py from Gnip-Python-Search-API-Utilities: 90 | 91 | Note: If you have a GNIP_CONFIG_FILE defined (try echo $GNIP_CONFIG_FILE, it should return the path to the config that you created), -u and -p arguments are not necessary. 92 | 93 | $ ./gnip_search.py -uXXX -pXXX -f"from:Gnip" json 94 | {"body": "RT @bbi: The #BigBoulder bloggers have been busy. Head to http://t.co/Rwve0dVA82 for recaps of the Sina Weibo, Tumblr & Academic Research s\u2026", "retweetCount": 3, "generator": {"link": "http://twitter.com", "displayName": "Twitter Web Client"}, "twitter_filter_level": "medium", "gnip": {"klout_profile": {"link": "http://klout.com/user/id/651348", "topics": [{"link": "http://klout.com/topic/id/5144818194631006088", "displayName": "Software", " 95 | ... 96 | 97 | *Notes* 98 | 99 | ``-a`` option (paging) collects _all_ results before printing to stdout/file and also forces ``-n 500`` per request. The paging 100 | option will collect up to 1/2 M tweets, which may take hours and be very costly. 101 | 102 | 103 | #### Wordcount 104 | 105 | Return top 1- and 2-grams - with counts and document frequency - from matching activities. Can modify the settings within simple ngrams package (``sngrams``) to modify the range of output. 106 | 107 | $ ./gnip_search.py -uXXX -pXXX -f"world cup" -n200 wordcount 108 | ------------------------------------------------------------ 109 | terms -- mentions activities (200) 110 | ------------------------------------------------------------ 111 | world -- 203 11.41% 198 99.00% 112 | cup -- 203 11.41% 198 99.00% 113 | ceremony -- 46 2.59% 45 22.50% 114 | opening -- 45 2.53% 45 22.50% 115 | fifa -- 25 1.41% 25 12.50% 116 | 2014 -- 22 1.24% 22 11.00% 117 | brazil -- 20 1.12% 19 9.50% 118 | watching -- 15 0.84% 12 6.00% 119 | ready -- 14 0.79% 14 7.00% 120 | tonight -- 11 0.62% 11 5.50% 121 | game -- 11 0.62% 11 5.50% 122 | wait -- 10 0.56% 10 5.00% 123 | million -- 10 0.56% 8 4.00% 124 | first -- 10 0.56% 10 5.00% 125 | indonesia -- 10 0.56% 2 1.00% 126 | time -- 10 0.56% 9 4.50% 127 | niallofficial -- 9 0.51% 9 4.50% 128 | here -- 9 0.51% 9 4.50% 129 | majooooorr -- 9 0.51% 9 4.50% 130 | braziiiilllll -- 9 0.51% 9 4.50% 131 | world cup -- 198 12.54% 196 98.00% 132 | opening ceremony -- 33 2.09% 33 16.50% 133 | cup opening -- 23 1.46% 23 11.50% 134 | fifa world -- 23 1.46% 23 11.50% 135 | cup 2014 -- 13 0.82% 13 6.50% 136 | ready world -- 12 0.76% 12 6.00% 137 | cup tonight -- 11 0.70% 11 5.50% 138 | niallofficial first -- 9 0.57% 9 4.50% 139 | cima majooooorr -- 9 0.57% 9 4.50% 140 | cmon braziiiilllll -- 9 0.57% 9 4.50% 141 | tonight wait -- 9 0.57% 9 4.50% 142 | wait pra -- 9 0.57% 9 4.50% 143 | majooooorr cmon -- 9 0.57% 9 4.50% 144 | game world -- 9 0.57% 9 4.50% 145 | pra cima -- 9 0.57% 9 4.50% 146 | watching world -- 9 0.57% 7 3.50% 147 | first game -- 9 0.57% 9 4.50% 148 | indonesia indonesia -- 8 0.51% 2 1.00% 149 | watch world -- 8 0.51% 8 4.00% 150 | ceremony world -- 7 0.44% 7 3.50% 151 | ------------------------------------------------------------ 152 | 153 | 154 | 155 | #### Users 156 | 157 | Return the most common usernames occuring in matching activities 158 | 159 | $ ./gnip_search.py -uXXX -pXXX -f"obama" -n500 users 160 | ------------------------------------------------------------ 161 | terms -- mentions activities (500) 162 | ------------------------------------------------------------ 163 | tsalazar66 -- 5 1.00% 5 1.00% 164 | sunnyherring1 -- 5 1.00% 5 1.00% 165 | debwilliams57 -- 3 0.60% 3 0.60% 166 | tattooq -- 2 0.40% 2 0.40% 167 | carlanae -- 2 0.40% 2 0.40% 168 | miisslys -- 2 0.40% 2 0.40% 169 | celtic_norse -- 2 0.40% 2 0.40% 170 | tvkoolturaldgoh -- 2 0.40% 2 0.40% 171 | tarynmorman -- 2 0.40% 2 0.40% 172 | __coleston_s__ -- 2 0.40% 2 0.40% 173 | alinka2linka -- 2 0.40% 2 0.40% 174 | falakhzafrieyl -- 2 0.40% 2 0.40% 175 | coolstoryluk -- 2 0.40% 2 0.40% 176 | law_colorado -- 2 0.40% 2 0.40% 177 | genelingerfelt -- 2 0.40% 2 0.40% 178 | annerkissed69 -- 2 0.40% 2 0.40% 179 | shotoftheweek -- 2 0.40% 2 0.40% 180 | matemary1 -- 2 0.40% 2 0.40% 181 | orlando_ooh -- 2 0.40% 2 0.40% 182 | c0nt0stavl0s__ -- 2 0.40% 2 0.40% 183 | ------------------------------------------------------------ 184 | 185 | 186 | 187 | #### Rate 188 | 189 | Calculate the approximate activity rate from matched activities. 190 | 191 | $ ./gnip_search.py -uXXX -pXXX -f"from:jrmontag" -n500 rate 192 | ------------------------------------------------------------ 193 | PowerTrack Rule: "from:jrmontag" 194 | Oldest Tweet (UTC): 2014-05-13 02:14:44 195 | Newest Tweet (UTC): 2014-06-12 18:41:44.306984 196 | Now (UTC): 2014-06-12 18:41:55 197 | 254 Tweets: 0.345 Tweets/Hour 198 | ------------------------------------------------------------ 199 | 200 | 201 | 202 | #### Links 203 | 204 | Return the most frequently observed links - count and document frequency - in matching activities 205 | 206 | $ ./gnip_search.py -uXXX -pXXX -f"from:drskippy" -n500 links 207 | --------------------------------------------------------------------------------------------------------------------------------- 208 | links -- mentions activities (31) 209 | --------------------------------------------------------------------------------------------------------------------------------- 210 | nolinks -- 9 27.27% 9 26.47% 211 | http://twitter.com/mutualmind/status/476460889147600896/photo/1 -- 1 3.03% 1 2.94% 212 | http://thenewinquiry.com/essays/the-anxieties-of-big-data/ -- 1 3.03% 1 2.94% 213 | http://www.nytimes.com/2014/05/30/opinion/krugman-cutting-back-on-carbon.html?hp&rref=opinion&_r=0 -- 1 3.03% 1 2.94% 214 | http://twitter.com/mdcin303/status/474991971170131968/photo/1 -- 1 3.03% 1 2.94% 215 | http://twitter.com/notfromshrek/status/475034884189085696/photo/1 -- 1 3.03% 1 2.94% 216 | https://github.com/dlwh/epic -- 1 3.03% 1 2.94% 217 | http://twitter.com/jrmontag/status/471762525449900032/photo/1 -- 1 3.03% 1 2.94% 218 | http://pandas.pydata.org/pandas-docs/stable/whatsnew.html -- 1 3.03% 1 2.94% 219 | http://www.economist.com/blogs/graphicdetail/2014/06/daily-chart-1 -- 1 3.03% 1 2.94% 220 | http://www.zdnet.com/google-turns-to-machine-learning-to-build-a-better-datacentre-7000029930/ -- 1 3.03% 1 2.94% 221 | https://groups.google.com/forum/#!topic/scalanlp-discuss/bd9jhmm2nxc -- 1 3.03% 1 2.94% 222 | http://www.ladamic.com/wordpress/?p=681 -- 1 3.03% 1 2.94% 223 | http://www.linkedin.com/today/post/article/20140407232811-442872-do-your-analysts-really-analyze -- 1 3.03% 1 2.94% 224 | http://twitter.com/giorgiocaviglia/status/474319737761980417/photo/1 -- 1 3.03% 1 2.94% 225 | http://faculty.washington.edu/kstarbi/starbird_iconference2014-final.pdf -- 1 3.03% 1 2.94% 226 | http://twitter.com/drskippy/status/474903707407384576/photo/1 -- 1 3.03% 1 2.94% 227 | http://en.wikipedia.org/wiki/lissajous_curve#logos_and_other_uses -- 1 3.03% 1 2.94% 228 | http://datacolorado.com/knitr_test/ -- 1 3.03% 1 2.94% 229 | http://opendata-hackday.de/?page_id=227 -- 1 3.03% 1 2.94% 230 | --------------------------------------------------------------------------------------------------------------------------------- 231 | 232 | 233 | 234 | #### Timeline 235 | 236 | Return a count timeline of matching activities. Without further options, results are returned in JSON format... 237 | 238 | $ ./gnip_search.py -uXXX -pXXX -f"@cia" timeline 239 | {"results": [{"count": 32, "timePeriod": "201405130000"}, {"count": 31, "timePeriod": "201405140000"}, 240 | 241 | Results can be returned in comma-delimited format with the ``-c`` option: 242 | 243 | $ ./gnip_search.py -uXXX -pXXX -f"@cia" timeline -c 244 | 2014-05-13T00:00:00,32 245 | 2014-05-14T00:00:00,31 246 | 2014-05-15T00:00:00,23 247 | 2014-05-16T00:00:00,81 248 | ... 249 | 250 | 251 | And bucket size can be adjusted with ``-b``: 252 | 253 | $ ./gnip_search.py -uXXX -pXXX -f"@cia" timeline -c -b hour 254 | ... 255 | 2014-06-06T11:00:00,0 256 | 2014-06-06T12:00:00,0 257 | 2014-06-06T13:00:00,0 258 | 2014-06-06T14:00:00,0 259 | 2014-06-06T15:00:00,1 260 | 2014-06-06T16:00:00,0 261 | 2014-06-06T17:00:00,7234 262 | 2014-06-06T18:00:00,77403 263 | 2014-06-06T19:00:00,44704 264 | 2014-06-06T20:00:00,38512 265 | 2014-06-06T21:00:00,23463 266 | 2014-06-06T22:00:00,17458 267 | 2014-06-06T23:00:00,13352 268 | 2014-06-07T00:00:00,12618 269 | 2014-06-07T01:00:00,11373 270 | 2014-06-07T02:00:00,10641 271 | 2014-06-07T03:00:00,9457 272 | ... 273 | 274 | 275 | #### Geo 276 | 277 | Return JSON payloads with the latitude, longitude, timestamp, and activity id for matching activities 278 | 279 | $ ./gnip_search.py -uXXX -pXXX -f"vamos has:geo" geo 280 | {"latitude": 4.6662819, "postedTime": "2014-06-12T18:52:48", "id": "477161613775351808", "longitude": -74.0557122} 281 | {"latitude": null, "postedTime": "2014-06-12T18:52:48", "id": "477161614354165760", "longitude": null} 282 | {"latitude": -24.4162955, "postedTime": "2014-06-12T18:52:47", "id": "477161609786568704", "longitude": -53.5296426} 283 | {"latitude": 14.66637167, "postedTime": "2014-06-12T18:52:47", "id": "477161607299342336", "longitude": -90.52661} 284 | {"latitude": -22.94064485, "postedTime": "2014-06-12T18:52:45", "id": "477161600429088769", "longitude": -43.05257938} 285 | ... 286 | 287 | 288 | This can also be output in delimited format: 289 | 290 | $ ./gnip_search.py -uXXX -pXXX -f"vamos has:geo" geo -c 291 | 477161971364933632,2014-06-12T18:54:13,-6.350394,38.926667 292 | 477161943015636992,2014-06-12T18:54:07,-46.60175585,-23.63230955 293 | 477161939647623168,2014-06-12T18:54:06,-49.0363085,-26.6042339 294 | 477161938833907712,2014-06-12T18:54:06,-1.5364198,53.9949317 295 | 477161936938094592,2014-06-12T18:54:05,-76.06161259,1.84834405 296 | 477161932806692865,2014-06-12T18:54:04,None,None 297 | 477161928377516032,2014-06-12T18:54:03,-51.08593214,0.03778787 298 | 299 | #### Audience 300 | 301 | Return the list of all of the users ids represented by matching activities 302 | 303 | $ ./gnip_search.py -n15 -f "call mom" audience 304 | -------------------------------------------------------------------------------- 305 | 229152598 306 | 458139782 307 | 1371311486 308 | 356605896 309 | 1214494260 310 | 2651237064 311 | 2468197068 312 | 1473613993 313 | 408876524 314 | 245142830 315 | 2158092706 316 | 119980244 317 | 2207663371 318 | 291388723 319 | 3106639108 320 | 321 | ### Simple Timeseries Analysis 322 | 323 | Usage: 324 | 325 | $ gnip_time_series.py -h 326 | 327 |
328 | usage: gnip_time_series.py [-h] [-b COUNT_BUCKET] [-e END] [-f FILTER]
329 |                            [-g SECOND_FILTER] [-l STREAM_URL] [-p PASSWORD]
330 |                            [-s START] [-u USER] [-t] [-w OUTPUT_FILE_PATH]
331 | 
332 | GnipSearch timeline tools
333 | 
334 | optional arguments:
335 |   -h, --help            show this help message and exit
336 |   -b COUNT_BUCKET, --bucket COUNT_BUCKET
337 |                         Bucket size for counts query. Options are day, hour,
338 |                         minute (default is 'day').
339 |   -e END, --end-date END
340 |                         End of datetime window, format 'YYYY-mm-DDTHH:MM'
341 |                         (default: most recent activities)
342 |   -f FILTER, --filter FILTER
343 |                         PowerTrack filter rule (See: http://support.gnip.com/c
344 |                         ustomer/portal/articles/901152-powertrack-operators)
345 |   -g SECOND_FILTER, --second_filter SECOND_FILTER
346 |                         Use a second filter to show correlation plots of -f
347 |                         timeline vs -g timeline.
348 |   -l STREAM_URL, --stream-url STREAM_URL
349 |                         Url of search endpoint. (See your Gnip console.)
350 |   -p PASSWORD, --password PASSWORD
351 |                         Password
352 |   -s START, --start-date START
353 |                         Start of datetime window, format 'YYYY-mm-DDTHH:MM'
354 |                         (default: 30 days ago)
355 |   -u USER, --user-name USER
356 |                         User name
357 |   -t, --get-topics      Set flag to evaluate peak topics (this may take a few
358 |                         minutes)
359 |   -w OUTPUT_FILE_PATH, --output-file-path OUTPUT_FILE_PATH
360 |                         Create files in ./OUTPUT-FILE-PATH. This path must
361 |                         exists and will not be created. 
362 | 
363 | 364 | #### Example Plots 365 | 366 | 367 | Example output from command: 368 | 369 | gnip_time_series.py -f "earthquake" -s2015-10-01T00:00:00 -e2015-11-18T00:00:00 -t -bhour 370 | 371 | ![Image of Earthquake Timeline](/img/earthquake_time_line.png) 372 | 373 | ![Image of Earthquake Trend and Variation](/img/earthquake_cycle_trend_line.png) 374 | 375 | ![Image of Earthquake Peaks ](/img/earthquake_time_peaks_line.png) 376 | 377 | #### Dependencies 378 | Gnip's Search 2.0 API access is required. 379 | 380 | In addition to the the basic Gnip Search utility described immediately above, this pakage 381 | depends on a number of other large packges: 382 | 383 | * matplotlib 384 | * numpy 385 | * pandas 386 | * statsmodels 387 | * scipy 388 | 389 | #### Notes 390 | * You should create the path "plots" in the directory where you run the utility. This will contain the plots of 391 | time series and analysis 392 | * This utility creates an extensive log file named time_series.log. It contains many details of parameter 393 | settings and intermediate outputs. 394 | * On a remote machine or server, change your matplotlib backend by creating a local matplotlibrc file. Create Gnip-Python-Search-API-Utilities/matplotlibrc: 395 | 396 |
397 |   # Change the backend to Agg to avoid errors when matplotlib cannot display the plots
398 |   # More information on creating and editing a matplotlibrc file at: http://matplotlib.org/users/customizing.html
399 |   backend      : Agg
400 | 
401 | 402 | ### Filter Analysis 403 | 404 | $ ./gnip_filter_analysis.py -h 405 |
406 | usage: gnip_filter_analysis.py [-h] [-j JOB_DESCRIPTION] [-b COUNT_BUCKET]
407 |                                [-l STREAM_URL] [-p PASSWORD] [-r RANK_SAMPLE]
408 |                                [-q] [-u USER] [-w OUTPUT_FILE_PATH]
409 | 
410 | Creates an aggregated filter statistics summary from filter rules and date
411 | periods in the job description.
412 | 
413 | optional arguments:
414 |   -h, --help            show this help message and exit
415 |   -j JOB_DESCRIPTION, --job_description JOB_DESCRIPTION
416 |                         JSON formatted job description file
417 |   -b COUNT_BUCKET, --bucket COUNT_BUCKET
418 |                         Bucket size for counts query. Options are day, hour,
419 |                         minute (default is 'day').
420 |   -l STREAM_URL, --stream-url STREAM_URL
421 |                         Url of search endpoint. (See your Gnip console.)
422 |   -p PASSWORD, --password PASSWORD
423 |                         Password
424 |   -r RANK_SAMPLE, --rank_sample RANK_SAMPLE
425 |                         Rank inclusive sampling depth. Default is None. This
426 |                         runs filter rule production for rank1, rank1 OR rank2,
427 |                         rank1 OR rank2 OR rank3, etc.to the depths specifed.
428 |   -q, --query           View API query (no data)
429 |   -u USER, --user-name USER
430 |                         User name
431 |   -w OUTPUT_FILE_PATH, --output-file-path OUTPUT_FILE_PATH
432 |                         Create files in ./OUTPUT-FILE-PATH. This path must
433 |                         exists and will not be created. Default is ./data
434 | 
435 | 
436 | 437 | Example output to compare 7 rules across 2 time periods: 438 | 439 | job.json: 440 | 441 |
442 | {
443 |   "date_ranges": [
444 |     {
445 |       "end": "2015-06-01T00:00:00",
446 |       "start": "2015-05-01T00:00:00"
447 |     },
448 |     {
449 |       "end": "2015-12-01T00:00:00",
450 |       "start": "2015-11-01T00:00:00"
451 |     }
452 |   ],
453 |   "rules": [
454 |     {
455 |       "tag": "common pet",
456 |       "value": "dog"
457 |     },
458 |     {
459 |       "tag": "common pet",
460 |       "value": "cat"
461 |     },
462 |     {
463 |       "tag": "common pet",
464 |       "value": "hamster"
465 |     },
466 |     {
467 |       "tag": "abstract pet",
468 |       "value": "pet"
469 |     },
470 |     {
471 |       "tag": "pet owner destination",
472 |       "value": "vet"
473 |     },
474 |     {
475 |       "tag": "pet owner destination",
476 |       "value": "kennel"
477 |     },
478 |     {
479 |       "tag": "diminutives",
480 |       "value": "puppy OR kitten"
481 |     }
482 |   ]
483 | }
484 | 
485 | 
486 | 487 | Output: 488 | 489 |
490 | $ ./gnip_filter_analysis.py -r 3
491 | ...
492 | start_date                                          2015-05-01T00:00:00  2015-11-01T00:00:00       All
493 | filter                                                                                                
494 | All                                                            42691589             46780243  89471832
495 | dog OR cat OR hamster OR pet OR vet OR kennel O...             20864710             22831053  43695763
496 | dog                                                             8096637              9218028  17314665
497 | cat                                                             8378681              8705244  17083925
498 | puppy OR kitten                                                 2392041              2659051   5051092
499 | pet                                                             2101044              2345140   4446184
500 | vet                                                              620178               749802   1369980
501 | hamster                                                          199634               226864    426498
502 | kennel                                                            38664                45061     83725
503 | 
504 | start_date                                          2015-05-01T00:00:00  2015-11-01T00:00:00        All
505 | filter                                                                                                 
506 | All                                                            63640524             69822220  133462744
507 | dog OR cat OR hamster OR pet OR vet OR kennel O...             20864710             22831053   43695763
508 | dog OR cat OR puppy OR kitten                                  18410402             20096764   38507166
509 | dog OR cat                                                     16268900             17662083   33930983
510 | dog                                                             8096512              9232320   17328832
511 | /pre>
512 | 
513 | So for this rule set, the redundancy is 89471832/43695763. - 1 = 1.0476088722835666 and the
514 | 3 rule approximation for the corpus gives 38507166/43695763. = 0.8812562902265832 or 88% of
515 | of the tweets of the full rule set.
516 | 
517 | Additionally, csv output of the raw counts and a csv version of the pivot table are
518 | written to the specified data directory.
519 | 
520 | #### Dependencies
521 | Gnip's Search 2.0 API access is required.
522 | 
523 | In addition to the the basic Gnip Search utility described immediately above, this pakage
524 | depends on a number of other large packges:
525 | 
526 | * numpy
527 | * pandas
528 | 
529 | #### Notes
530 | * Unlike other utilities provided, the defualt file path is set to "./data" to provide 
531 | full accsess to output results. Therefore, you should create the path "data" in the directory 
532 | where you run the utility. This will contain the data ouputs.
533 | 
534 | ## License
535 | Gnip-Python-Search-API-Utilities by Scott Hendrickson, Josh Montague and Jeff Kolb is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
536 | 


--------------------------------------------------------------------------------
/example_config_file:
--------------------------------------------------------------------------------
 1 | # user your credentials and end point url to configure command line access
 2 | # to work without the command line options
 3 | # Either (1) rename this file .gnip in this directlry or, to run from anywhere
 4 | # export GNIP_CONFIG_FILE=
 5 | #
 6 | [creds]
 7 | un = 
 8 | pwd = 
 9 | 
10 | [endpoint]
11 | # replace with your endpoint
12 | url = https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json
13 | 
14 | [defaults]
15 | # none
16 | 
17 | [tmp]
18 | # none
19 | 


--------------------------------------------------------------------------------
/gnip_filter_analysis.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: UTF-8 -*-
  3 | __author__="Scott Hendrickson, Josh Montague" 
  4 | 
  5 | import sys
  6 | import json
  7 | import codecs
  8 | import argparse
  9 | import datetime
 10 | import time
 11 | import numbers
 12 | import os
 13 | import ConfigParser
 14 | import logging
 15 | try:
 16 |         from cStringIO import StringIO
 17 | except:
 18 |         from StringIO import StringIO
 19 | 
 20 | import pandas as pd
 21 | import numpy as np
 22 | 
 23 | from search.results import *
 24 | 
 25 | reload(sys)
 26 | sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
 27 | sys.stdin = codecs.getreader('utf-8')(sys.stdin)
 28 | 
 29 | DEFAULT_CONFIG_FILENAME = "./.gnip"
 30 | LOG_FILE_PATH = os.path.join(".","filter_analysis.log")
 31 | 
 32 | # set up simple logging
 33 | logging.basicConfig(filename=LOG_FILE_PATH,level=logging.DEBUG)
 34 | logging.info("#"*70)
 35 | logging.info("################# started {} #################".format(datetime.datetime.now()))
 36 | 
 37 | class GnipSearchCMD():
 38 | 
 39 |     def __init__(self, token_list_size=20):
 40 |         # default tokenizer and character limit
 41 |         char_upper_cutoff = 20  # longer than for normal words because of user names
 42 |         self.token_list_size = int(token_list_size)
 43 |         #############################################
 44 |         # CONFIG FILE/COMMAND LINE OPTIONS PATTERN
 45 |         # parse config file
 46 |         config_from_file = self.config_file()
 47 |         # set required fields to None.  Sequence of setting is:
 48 |         #  (1) config file
 49 |         #  (2) command line
 50 |         # if still none, then fail
 51 |         self.user = None
 52 |         self.password = None
 53 |         self.stream_url = None
 54 |         if config_from_file is not None:
 55 |             try:
 56 |                 # command line options take presidence if they exist
 57 |                 self.user = config_from_file.get('creds', 'un')
 58 |                 self.password = config_from_file.get('creds', 'pwd')
 59 |                 self.stream_url = config_from_file.get('endpoint', 'url')
 60 |             except (ConfigParser.NoOptionError,
 61 |                     ConfigParser.NoSectionError) as e:
 62 |                 logging.debug(u"Error reading configuration file ({}), ignoring configuration file.".format(e))
 63 |         # parse the command line options
 64 |         self.options = self.args().parse_args()
 65 |         # set up the job
 66 |         # over ride config file with command line args if present
 67 |         if self.options.user is not None:
 68 |             self.user = self.options.user
 69 |         if self.options.password is not None:
 70 |             self.password = self.options.password
 71 |         if self.options.stream_url is not None:
 72 |             self.stream_url = self.options.stream_url
 73 |         #
 74 |         # Search v2 uses a different url
 75 |         if "data-api.twitter.com" in self.stream_url:
 76 |             self.options.search_v2 = True
 77 |         else:
 78 |             logging.debug(u"Requires search v2, but your URL appears to point to a v1 endpoint. Exiting.")
 79 |             print >> sys.stderr, "Requires search v2, but your URL appears to point to a v1 endpoint. Exiting."
 80 |             sys.exit(-1)
 81 |         # defaults
 82 |         self.options.paged = True
 83 |         self.options.max = 500
 84 |         # 
 85 |         # check paths
 86 |         if self.options.output_file_path is not None:
 87 |             if not os.path.exists(self.options.output_file_path):
 88 |                 logging.debug(u"Path {} doesn't exist. Please create it and try again. Exiting.".format(
 89 |                     self.options.output_file_path))
 90 |                 sys.stderr.write("Path {} doesn't exist. Please create it and try again. Exiting.\n".format(
 91 |                     self.options.output_file_path))
 92 |                 sys.exit(-1)
 93 |         #
 94 |         # log the attributes of this class including all of the options
 95 |         for v in dir(self):
 96 |             # except don't log the password!
 97 |             if not v.startswith('__') and not callable(getattr(self,v)) and not v.lower().startswith('password'):
 98 |                 tmp = str(getattr(self,v))
 99 |                 tmp = re.sub("password=.*,", "password=XXXXXXX,", tmp) 
100 |                 logging.debug(u"  {}={}".format(v, tmp))
101 |         #
102 |         self.job = self.read_job_description(self.options.job_description)
103 | 
104 |     def config_file(self):
105 |         config = ConfigParser.ConfigParser()
106 |         # (1) default file name precidence
107 |         config.read(DEFAULT_CONFIG_FILENAME)
108 |         if not config.has_section("creds"):
109 |             # (2) environment variable file name second
110 |             if 'GNIP_CONFIG_FILE' in os.environ:
111 |                 config_filename = os.environ['GNIP_CONFIG_FILE']
112 |                 config.read(config_filename)
113 |         if config.has_section("creds") and config.has_section("endpoint"):
114 |             return config
115 |         else:
116 |             return None
117 | 
118 |     def args(self):
119 |         twitter_parser = argparse.ArgumentParser(
120 |                 description="Creates an aggregated filter statistics summary from \
121 |                     filter rules and date periods in the job description.")
122 |         twitter_parser.add_argument("-j", "--job_description", dest="job_description",
123 |                 default="./job.json",
124 |                 help="JSON formatted job description file")
125 |         twitter_parser.add_argument("-b", "--bucket", dest="count_bucket", 
126 |                 default="day", 
127 |                 help="Bucket size for counts query. Options are day, hour, \
128 |                     minute (default is 'day').")
129 |         twitter_parser.add_argument("-l", "--stream-url", dest="stream_url", 
130 |                 default=None,
131 |                 help="Url of search endpoint. (See your Gnip console.)")
132 |         twitter_parser.add_argument("-p", "--password", dest="password", default=None, 
133 |                 help="Password")
134 |         twitter_parser.add_argument("-r", "--rank_sample", dest="rank_sample"
135 |                 , default=None
136 |                 , help="Rank inclusive sampling depth. Default is None. This runs filter rule \
137 |                     production for rank1, rank1 OR rank2, rank1 OR rank2 OR rank3, etc.to \
138 |                     the depths specifed.")
139 |         twitter_parser.add_argument("-m", "--rank_negation_sample", dest="rank_negation_sample"
140 |                 , default=False
141 |                 , action="store_true"
142 |                 , help="Like rank inclusive sampling, but rules of higher ranks are negated \
143 |                     on successive retrievals. Uses rank_sample setting.")
144 |         twitter_parser.add_argument("-n", "--negation_rules", dest="negation_rules"
145 |                 , default=False
146 |                 , action="store_true"
147 |                 , help="Apply entire negation rules list to all queries")
148 |         twitter_parser.add_argument("-q", "--query", dest="query", action="store_true", 
149 |                 default=False, help="View API query (no data)")
150 |         twitter_parser.add_argument("-u", "--user-name", dest="user", default=None,
151 |                 help="User name")
152 |         twitter_parser.add_argument("-w", "--output-file-path", dest="output_file_path", 
153 |                 default="./data",
154 |                 help="Create files in ./OUTPUT-FILE-PATH. This path must exists and will \
155 |                     not be created. Default is ./data")
156 | 
157 |         return twitter_parser
158 | 
159 |     def read_job_description(self, job_description):
160 |         with codecs.open(job_description, "rb", "utf-8") as f:
161 |             self.job_description = json.load(f)
162 |         if not all([x in self.job_description for x in ("rules", "date_ranges")]):
163 |             print >>sys.stderr, '"rules" or "date_ranges" missing from you job description file. Exiting.\n'
164 |             logging.error('"rules" or "date_ranges" missing from you job description file. Exiting')
165 |             sys.exit(-1)
166 |     
167 |     def get_date_ranges_for_rule(self, rule, base_rule, tag=None):
168 |         res = []
169 |         for dates_dict in self.job_description["date_ranges"]:
170 |             start_date = dates_dict["start"]
171 |             end_date = dates_dict["end"]
172 |             logging.debug(u"getting date range for {} through {}".format(start_date, end_date))
173 |             results = Results(
174 |                 self.user
175 |                 , self.password
176 |                 , self.stream_url
177 |                 , self.options.paged
178 |                 , self.options.output_file_path
179 |                 , pt_filter=rule
180 |                 , max_results=int(self.options.max)
181 |                 , start=start_date
182 |                 , end=end_date
183 |                 , count_bucket=self.options.count_bucket
184 |                 , show_query=self.options.query
185 |                 , search_v2=self.options.search_v2
186 |                 )
187 |             for x in results.get_time_series():
188 |                 res.append(x + [rule, tag,  start_date, end_date, base_rule])
189 |         return res
190 | 
191 |     def get_pivot_table(self, res):
192 |         df = pd.DataFrame(res
193 |             , columns=("bucket_datetag"
194 |                     ,"counts"
195 |                     ,"bucket_datetime"
196 |                     ,"filter"
197 |                     ,"filter_tag"
198 |                     ,"start_date"
199 |                     ,"end_date"
200 |                     ,"base_rule"))
201 |         pdf = pd.pivot_table(df
202 |             , values="counts"
203 |             , index=["filter", "base_rule"]
204 |             , columns = ["start_date"]
205 |             , margins = True
206 |             , aggfunc=np.sum)
207 |         pdf.sort_values("All"
208 |             , inplace=True
209 |             , ascending=False)
210 |         logging.debug(u"pivot tables calculated with shape(df)={} and shape(pdf)={}".format(df.shape, pdf.shape))
211 |         return df, pdf
212 | 
213 |     def write_output_files(self, df, pdf, pre=""):
214 |         if pre != "":
215 |             pre += "_"
216 |         logging.debug(u"Writing raw and pivot data to {}...".format(self.options.output_file_path))
217 |         with open("{}/{}_{}raw_data.csv".format(
218 |                     self.options.output_file_path
219 |                     , datetime.datetime.now().strftime("%Y%m%d_%H%M")
220 |                     , pre)
221 |                 , "wb") as f:
222 |             f.write(df.to_csv(encoding='utf-8'))
223 |         with open("{}/{}_{}pivot_data.csv".format(
224 |                     self.options.output_file_path
225 |                     , datetime.datetime.now().strftime("%Y%m%d_%H%M")
226 |                     , pre)
227 |                 , "wb") as f:
228 |             f.write(pdf.to_csv(encoding='utf-8'))
229 | 
230 |     def get_result(self):
231 |         if self.options.negation_rules and self.job_description["negation_rules"] is not None:
232 |             negation_rules = [x["value"] for x in self.job_description["negation_rules"]]
233 |             negation_clause = " -(" + " OR ".join(negation_rules) + ")"
234 |         else:
235 |             negation_clause = ""
236 |         all_rules = []
237 |         res = []
238 |         for rule_dict in self.job_description["rules"]:
239 |             # in the case that rule is compound, ensure grouping
240 |             rule = u"(" + rule_dict["value"] + u")" + negation_clause
241 |             logging.debug(u"rule str={}".format(rule))
242 |             all_rules.append(rule_dict["value"])
243 |             tag = None
244 |             if "tag" in rule_dict:
245 |                 tag = rule_dict["tag"]
246 |             res.extend(self.get_date_ranges_for_rule(
247 |                 rule
248 |                 , rule_dict["value"]
249 |                 , tag=tag
250 |                 ))
251 |         # All rules 
252 |         all_rules_res = []
253 |         sub_all_rules = []
254 |         filter_str_last = u"(" + u" OR ".join(sub_all_rules) + u")"
255 |         for rule in all_rules:
256 |             # try adding one more rule
257 |             sub_all_rules.append(rule)
258 |             filter_str = u"(" + u" OR ".join(sub_all_rules) + u")"
259 |             if len(filter_str + negation_clause) > 2048:
260 |                 # back up one rule if the length is too too long
261 |                 filter_str = filter_str_last
262 |                 logging.debug(u"All rules str={}".format(filter_str + negation_clause))
263 |                 all_rules_res = self.get_date_ranges_for_rule(
264 |                     filter_str + negation_clause
265 |                     , filter_str
266 |                     , tag=None
267 |                     )
268 |                 # start a new sublist
269 |                 sub_all_rules = [rule]
270 |                 filter_str = u"(" + u" OR ".join(sub_all_rules) + u")"
271 |             filter_str_last = filter_str
272 |         res.extend(all_rules_res)
273 |         df, pdf = self.get_pivot_table(res)
274 |         if self.options.output_file_path is not None:
275 |             self.write_output_files(df, pdf)
276 |         # rank inclusive results
277 |         rdf, rpdf = None, None
278 |         if self.options.rank_sample is not None:
279 |             # because margin = True, we have an "all" row at the top
280 |             # the second row will be the all_rules results, skip these too
281 |             # therefore, start at the third row 
282 |             rank_list = [x[1] for x in pdf.index.values[2:2+int(self.options.rank_sample)]]
283 |             res = all_rules_res
284 |             for i in range(int(self.options.rank_sample)):
285 |                 if self.options.rank_negation_sample:
286 |                     filter_str = "((" + u") -(".join(rank_list[i+1::-1]) + "))"
287 |                 else:
288 |                     filter_str = "((" + u") OR (".join(rank_list[:i+1]) + "))"
289 |                 logging.debug(u"rank rules str={}".format(filter_str + negation_clause))
290 |                 res.extend(self.get_date_ranges_for_rule(
291 |                     filter_str + negation_clause
292 |                     , filter_str
293 |                     , tag=None
294 |                     ))
295 |             rdf, rpdf = self.get_pivot_table(res)
296 |             if self.options.output_file_path is not None:
297 |                 self.write_output_files(rdf, rpdf, pre="ranked")
298 |         return df, pdf, rdf, rpdf
299 | 
300 | if __name__ == "__main__":
301 |     g = GnipSearchCMD()
302 |     df, pdf, rdf, rpdf = g.get_result()
303 |     sys.stdout.write(pdf.to_string())
304 |     print
305 |     print
306 |     if rpdf is not None:
307 |         sys.stdout.write(rpdf.to_string())
308 |         print
309 | 


--------------------------------------------------------------------------------
/gnip_search.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: UTF-8 -*-
  3 | __author__="Scott Hendrickson, Jeff Kolb, Josh Montague" 
  4 | 
  5 | import sys
  6 | import json
  7 | import codecs
  8 | import argparse
  9 | import datetime
 10 | import time
 11 | import os
 12 | 
 13 | if sys.version_info.major == 2:
 14 |     import ConfigParser as configparser
 15 | else:
 16 |     import configparser
 17 | 
 18 | from search.results import * 
 19 | 
 20 | if (sys.version_info[0]) < 3:
 21 |     try:
 22 |         reload(sys)
 23 |         sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
 24 |         sys.stdin = codecs.getreader('utf-8')(sys.stdin)
 25 |     except NameError:
 26 |         pass
 27 | 
 28 | DEFAULT_CONFIG_FILENAME = "./.gnip"
 29 | 
 30 | class GnipSearchCMD():
 31 | 
 32 |     USE_CASES = ["json", "wordcount","users", "rate", "links", "timeline", "geo", "audience"]
 33 |     
 34 |     def __init__(self, token_list_size=40):
 35 |         # default tokenizer and character limit
 36 |         char_upper_cutoff = 20  # longer than for normal words because of user names
 37 |         self.token_list_size = int(token_list_size)
 38 |         #############################################
 39 |         # CONFIG FILE/COMMAND LINE OPTIONS PATTERN
 40 |         # parse config file
 41 |         config_from_file = self.config_file()
 42 |         # set required fields to None.  Sequence of setting is:
 43 |         #  (1) config file
 44 |         #  (2) command line
 45 |         # if still none, then fail
 46 |         self.user = None
 47 |         self.password = None
 48 |         self.stream_url = None
 49 |         if config_from_file is not None:
 50 |             try:
 51 |                 # command line options take presidence if they exist
 52 |                 self.user = config_from_file.get('creds', 'un')
 53 |                 self.password = config_from_file.get('creds', 'pwd')
 54 |                 self.stream_url = config_from_file.get('endpoint', 'url')
 55 |             except (configparser.NoOptionError,
 56 |                     configparser.NoSectionError) as e:
 57 |                 sys.stderr.write("Error reading configuration file ({}), ignoring configuration file.".format(e))
 58 |         # parse the command line options
 59 |         self.options = self.args().parse_args()
 60 |         if int(sys.version_info[0]) < 3:
 61 |             self.options.filter = self.options.filter.decode("utf-8")
 62 |         # set up the job
 63 |         # over ride config file with command line args if present
 64 |         if self.options.user is not None:
 65 |             self.user = self.options.user
 66 |         if self.options.password is not None:
 67 |             self.password = self.options.password
 68 |         if self.options.stream_url is not None:
 69 |             self.stream_url = self.options.stream_url
 70 | 
 71 |         # exit if the config file isn't set
 72 |         if (self.stream_url is None) or (self.user is None) or (self.password is None):
 73 |             sys.stderr.write("Something is wrong with your configuration. It's possible that the we can't find your config file.")
 74 |             sys.exit(-1)
 75 | 
 76 |         # Gnacs is not yet upgraded to python3, so don't allow CSV output option (which uses Gnacs) if python3
 77 |         if self.options.csv_flag and sys.version_info.major == 3:
 78 |             raise ValueError("CSV option not yet available for Python3")
 79 | 
 80 |     def config_file(self):
 81 |         config = configparser.ConfigParser()
 82 |         # (1) default file name precidence
 83 |         config.read(DEFAULT_CONFIG_FILENAME)
 84 |         if not config.has_section("creds"):
 85 |             # (2) environment variable file name second
 86 |             if 'GNIP_CONFIG_FILE' in os.environ:
 87 |                 config_filename = os.environ['GNIP_CONFIG_FILE']
 88 |                 config.read(config_filename)
 89 |         if config.has_section("creds") and config.has_section("endpoint"):
 90 |             return config
 91 |         else:
 92 |             return None
 93 | 
 94 |     def args(self):
 95 |         twitter_parser = argparse.ArgumentParser(
 96 |                 description="GnipSearch supports the following use cases: %s"%str(self.USE_CASES))
 97 |         twitter_parser.add_argument("use_case", metavar= "USE_CASE", choices=self.USE_CASES, 
 98 |                 help="Use case for this search.")
 99 |         twitter_parser.add_argument("-a", "--paged", dest="paged", action="store_true", 
100 |                 default=False, help="Paged access to ALL available results (Warning: this makes many requests)")
101 |         twitter_parser.add_argument("-c", "--csv", dest="csv_flag", action="store_true", 
102 |                 default=False,
103 |                 help="Return comma-separated 'date,counts' or geo data.")
104 |         twitter_parser.add_argument("-b", "--bucket", dest="count_bucket", 
105 |                 default="day", 
106 |                 help="Bucket size for counts query. Options are day, hour, minute (default is 'day').")
107 |         twitter_parser.add_argument("-e", "--end-date", dest="end", 
108 |                 default=None,
109 |                 help="End of datetime window, format 'YYYY-mm-DDTHH:MM' (default: most recent activities)")
110 |         twitter_parser.add_argument("-f", "--filter", dest="filter", default="from:jrmontag OR from:gnip",
111 |                 help="PowerTrack filter rule (See: http://support.gnip.com/customer/portal/articles/901152-powertrack-operators)")
112 |         twitter_parser.add_argument("-l", "--stream-url", dest="stream_url", 
113 |                 default=None,
114 |                 help="Url of search endpoint. (See your Gnip console.)")
115 |         twitter_parser.add_argument("-n", "--results-max", dest="max", default=100, 
116 |                 help="Maximum results to return per page (default 100; max 500)")
117 |         twitter_parser.add_argument("-N", "--hard-max", dest="hard_max", default=None, type=int,
118 |                 help="Maximum results to return for all pages; see -a option")
119 |         twitter_parser.add_argument("-p", "--password", dest="password", default=None, 
120 |                 help="Password")
121 |         twitter_parser.add_argument("-q", "--query", dest="query", action="store_true", 
122 |                 default=False, help="View API query (no data)")
123 |         twitter_parser.add_argument("-s", "--start-date", dest="start", 
124 |                 default=None,
125 |                 help="Start of datetime window, format 'YYYY-mm-DDTHH:MM' (default: 30 days ago)")
126 |         twitter_parser.add_argument("-u", "--user-name", dest="user", default=None,
127 |                 help="User name")
128 |         twitter_parser.add_argument("-w", "--output-file-path", dest="output_file_path", default=None,
129 |                 help="Create files in ./OUTPUT-FILE-PATH. This path must exists and will not be created. This options is available only with -a option. Default is no output files.")
130 |         # depricated... leave in for compatibility
131 |         twitter_parser.add_argument("-t", "--search-v2", dest="search_v2", action="store_true",
132 |                 default=False, 
133 |                 help="Using search API v2 endpoint. [This is depricated and is automatically set based on endpoint.]")
134 |         return twitter_parser
135 |     
136 |     def get_result(self):
137 |         WIDTH = 80
138 |         BIG_COLUMN = 32
139 |         res = [u"-"*WIDTH]
140 |         if self.options.use_case.startswith("time"):
141 |             self.results = Results(
142 |                 self.user
143 |                 , self.password
144 |                 , self.stream_url
145 |                 , self.options.paged
146 |                 , self.options.output_file_path
147 |                 , pt_filter=self.options.filter
148 |                 , max_results=int(self.options.max)
149 |                 , start=self.options.start
150 |                 , end=self.options.end
151 |                 , count_bucket=self.options.count_bucket
152 |                 , show_query=self.options.query
153 |                 , hard_max=self.options.hard_max
154 |                 )
155 |             res = []
156 |             if self.options.csv_flag:
157 |                 for x in self.results.get_time_series():
158 |                     res.append("{:%Y-%m-%dT%H:%M:%S},{},{}".format(x[2], x[0], x[1]))
159 |             else:
160 |                 res = [x for x in self.results.get_activities()]
161 |                 return '{"results":' + json.dumps(res) + "}"
162 | 
163 |         else:
164 |             self.results = Results(
165 |                 self.user
166 |                 , self.password
167 |                 , self.stream_url
168 |                 , self.options.paged
169 |                 , self.options.output_file_path
170 |                 , pt_filter=self.options.filter
171 |                 , max_results=int(self.options.max)
172 |                 , start=self.options.start
173 |                 , end=self.options.end
174 |                 , count_bucket=None
175 |                 , show_query=self.options.query
176 |                 , hard_max=self.options.hard_max
177 |                 )
178 |             if self.options.use_case.startswith("rate"):
179 |                 rate = self.results.query.get_rate()
180 |                 unit = "Tweets/Minute"
181 |                 if rate < 0.01:
182 |                     rate *= 60.
183 |                     unit = "Tweets/Hour"
184 |                 res.append("     PowerTrack Rule: \"%s\""%self.options.filter)
185 |                 res.append("  Oldest Tweet (UTC): %s"%str(self.results.query.oldest_t))
186 |                 res.append("  Newest Tweet (UTC): %s"%str(self.results.query.newest_t))
187 |                 res.append("           Now (UTC): %s"%str(datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")))
188 |                 res.append("        %5d Tweets: %6.3f %s"%(len(self.results), rate, unit))
189 |                 res.append("-"*WIDTH)
190 |             elif self.options.use_case.startswith("geo"):
191 |                 res = []
192 |                 for x in self.results.get_geo():
193 |                     if self.options.csv_flag:
194 |                         try:
195 |                             res.append("{},{},{},{}".format(x["id"], x["postedTime"], x["longitude"], x["latitude"]))
196 |                         except KeyError as e:
197 |                             print >> sys.stderr, str(e)
198 |                     else:
199 |                         res.append(json.dumps(x))
200 |             elif self.options.use_case.startswith("json"):
201 |                 res = [json.dumps(x) for x in self.results.get_activities()]
202 |                 if self.options.csv_flag:
203 |                     res = ["|".join(x) for x in self.results.query.get_list_set()]
204 |             elif self.options.use_case.startswith("word"):
205 |                 fmt_str = u"%{}s -- %10s     %8s ".format(BIG_COLUMN)
206 |                 res.append(fmt_str%( "terms", "mentions", "activities"))
207 |                 res.append("-"*WIDTH)
208 |                 fmt_str =  u"%{}s -- %4d  %5.2f%% %4d  %5.2f%%".format(BIG_COLUMN)
209 |                 for x in self.results.get_top_grams(n=self.token_list_size):
210 |                     res.append(fmt_str%(x[4], x[0], x[1]*100., x[2], x[3]*100.))
211 |                 res.append("    TOTAL: %d activities"%len(self.results))
212 |                 res.append("-"*WIDTH)
213 |             elif self.options.use_case.startswith("user"):
214 |                 fmt_str = u"%{}s -- %10s     %8s ".format(BIG_COLUMN)
215 |                 res.append(fmt_str%( "terms", "mentions", "activities"))
216 |                 res.append("-"*WIDTH)
217 |                 fmt_str =  u"%{}s -- %4d  %5.2f%% %4d  %5.2f%%".format(BIG_COLUMN)
218 |                 for x in self.results.get_top_users(n=self.token_list_size):
219 |                     res.append(fmt_str%(x[4], x[0], x[1]*100., x[2], x[3]*100.))
220 |                 res.append("    TOTAL: %d activities"%len(self.results))
221 |                 res.append("-"*WIDTH)
222 |             elif self.options.use_case.startswith("link"):
223 |                 res[-1]+=u"-"*WIDTH
224 |                 res.append(u"%100s -- %10s     %8s (%d)"%("links", "mentions", "activities", len(self.results)))
225 |                 res.append("-"*2*WIDTH)
226 |                 for x in self.results.get_top_links(n=self.token_list_size):
227 |                     res.append(u"%100s -- %4d  %5.2f%% %4d  %5.2f%%"%(x[4], x[0], x[1]*100., x[2], x[3]*100.))
228 |                 res.append("-"*WIDTH)
229 |             elif self.options.use_case.startswith("audie"):
230 |                 for x in self.results.get_users():
231 |                     res.append(u"{}".format(x))
232 |                 res.append("-"*WIDTH)
233 |         return u"\n".join(res)
234 | 
235 | if __name__ == "__main__":
236 |     g = GnipSearchCMD()
237 |     print(g.get_result())
238 | 


--------------------------------------------------------------------------------
/gnip_time_series.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: UTF-8 -*-
  3 | #######################################################
  4 | # This script wraps simple timeseries analysis tools
  5 | # and access to the Gnip Search API into a simple tool
  6 | # to help the analysis quickly iterate on filters
  7 | # a and understand time series trend and events.
  8 | #
  9 | # If you find this useful or find a bug you don't want
 10 | # to fix for yourself, please let me know at @drskippy
 11 | #######################################################
 12 | __author__="Scott Hendrickson" 
 13 | 
 14 | # other imports
 15 | import sys
 16 | import argparse
 17 | import calendar
 18 | import codecs
 19 | import csv
 20 | import datetime
 21 | import json
 22 | import logging
 23 | import matplotlib
 24 | import matplotlib.pyplot as plt
 25 | import numpy as np
 26 | import os
 27 | import pandas as pd
 28 | import re
 29 | import statsmodels.api as sm
 30 | import string
 31 | import time
 32 | from functools import partial
 33 | from operator import itemgetter
 34 | from scipy import signal
 35 | from search.results import *
 36 | 
 37 | # fixes an annoying warning that scipy is throwing 
 38 | import warnings
 39 | warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd driver lwork query error")
 40 | 
 41 | # handle Python 3 specific imports
 42 | if sys.version_info[0] == 2:
 43 |     import ConfigParser
 44 | elif sys.version_info[0] == 3:
 45 |     import configparser as ConfigParser
 46 |     #from imp import reload
 47 | 
 48 | # Python 2 specific setup (Py3 the utf-8 stuff is handled)
 49 | if sys.version_info[0] == 2:
 50 |     reload(sys)
 51 |     sys.stdin = codecs.getreader('utf-8')(sys.stdin)
 52 |     sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
 53 | 
 54 | # basic defaults
 55 | FROM_PICKLE = False
 56 | DEFAULT_CONFIG_FILENAME = os.path.join(".",".gnip")
 57 | DATE_FMT = "%Y%m%d%H%M"
 58 | DATE_FMT2 = "%Y-%m-%dT%H:%M:%S"
 59 | LOG_FILE_PATH = os.path.join(".","time_series.log")
 60 | 
 61 | # set up simple logging
 62 | logging.basicConfig(filename=LOG_FILE_PATH,level=logging.DEBUG)
 63 | logging.info("#"*70)
 64 | logging.info("################# started {} #################".format(datetime.datetime.now()))
 65 | 
 66 | # tunable defaults
 67 | CHAR_UPPER_CUTOFF = 20          # don't include tokens longer than CHAR_UPPER_CUTOFF
 68 | TWEET_SAMPLE = 4000             # tweets to collect for peak topics
 69 | MIN_SNR = 2.0                   # signal to noise threshold for peak detection
 70 | MAX_N_PEAKS = 7                 # maximum number of peaks to output
 71 | MAX_PEAK_WIDTH = 20             # max peak width in periods
 72 | MIN_PEAK_WIDTH = 1              # min peak width in periods
 73 | SEARCH_PEAK_WIDTH = 3           # min peak width in periods
 74 | N_MOVING = 4                    # average over buckets
 75 | OUTLIER_FRAC = 0.8              # cut off values over 80% above or below the average
 76 | PLOTS_PREFIX = os.path.join(".","plots")
 77 | PLOT_DELTA_Y = 1.2              # spacing of y values in dotplot
 78 | 
 79 | logging.debug("CHAR_UPPER_CUTOFF={},TWEET_SAMPLE={},MIN_SNR={},MAX_N_PEAKS={},MAX_PEAK_WIDTH={},MIN_PEAK_WIDTH={},SEARCH_PEAK_WIDTH={},N_MOVING={},OUTLIER_FRAC={},PLOTS_PREFIX={},PLOT_DELTA_Y={}".format(
 80 |     CHAR_UPPER_CUTOFF 
 81 |     , TWEET_SAMPLE 
 82 |     , MIN_SNR 
 83 |     , MAX_N_PEAKS 
 84 |     , MAX_PEAK_WIDTH 
 85 |     , MIN_PEAK_WIDTH 
 86 |     , SEARCH_PEAK_WIDTH
 87 |     , N_MOVING 
 88 |     , OUTLIER_FRAC 
 89 |     , PLOTS_PREFIX 
 90 |     , PLOT_DELTA_Y ))
 91 | 
 92 | class TimeSeries():
 93 |     """Containter class for data collected from the API and associated analysis outputs"""
 94 |     pass
 95 | 
 96 | class GnipSearchTimeseries():
 97 | 
 98 |     def __init__(self, token_list_size=40):
 99 |         """Retrieve and analysis timesseries and associated interesting trends, spikes and tweet content."""
100 |         # default tokenizer and character limit
101 |         char_upper_cutoff = CHAR_UPPER_CUTOFF
102 |         self.token_list_size = int(token_list_size)
103 |         #############################################
104 |         # CONFIG FILE/COMMAND LINE OPTIONS PATTERN
105 |         # parse config file
106 |         config_from_file = self.config_file()
107 |         # set required fields to None.  Sequence of setting is:
108 |         #  (1) config file
109 |         #  (2) command line
110 |         # if still none, then fail
111 |         self.user = None
112 |         self.password = None
113 |         self.stream_url = None
114 |         if config_from_file is not None:
115 |             try:
116 |                 # command line options take presidence if they exist
117 |                 self.user = config_from_file.get('creds', 'un')
118 |                 self.password = config_from_file.get('creds', 'pwd')
119 |                 self.stream_url = config_from_file.get('endpoint', 'url')
120 |             except (ConfigParser.NoOptionError,
121 |                     ConfigParser.NoSectionError) as e:
122 |                 logging.warn("Error reading configuration file ({}), ignoring configuration file.".format(e))
123 |         # parse the command line options
124 |         self.options = self.args().parse_args()
125 |         # decode step should not be included for python 3
126 |         if sys.version_info[0] == 2: 
127 |             self.options.filter = self.options.filter.decode("utf-8")
128 |             self.options.second_filter = self.options.second_filter.decode("utf-8")
129 |         # set up the job
130 |         # over ride config file with command line args if present
131 |         if self.options.user is not None:
132 |             self.user = self.options.user
133 |         if self.options.password is not None:
134 |             self.password = self.options.password
135 |         if self.options.stream_url is not None:
136 |             self.stream_url = self.options.stream_url
137 |         
138 |         # search v2 uses a different url
139 |         if "gnip-api.twitter.com" not in self.stream_url:
140 |             logging.error("gnipSearch timeline tools require Search V2. Exiting.")
141 |             logging.error("Your URL should look like: https://gnip-api.twitter.com/search/fullarchive/accounts//dev.json")
142 |             sys.stderr.write("gnipSearch timeline tools require Search V2. Exiting.\n")
143 |             sys.stderr.write("Your URL should look like: https://gnip-api.twitter.com/search/fullarchive/accounts//dev.json")
144 |             sys.exit(-1)
145 | 
146 |         # set some options that should not be changed for this anaysis
147 |         self.options.paged = True
148 |         self.options.search_v2 = True
149 |         self.options.max = 500
150 |         self.options.query = False
151 | 
152 |         # check paths
153 |         if self.options.output_file_path is not None:
154 |             if not os.path.exists(self.options.output_file_path):
155 |                 logging.error("Path {} doesn't exist. Please create it and try again. Exiting.".format(
156 |                     self.options.output_file_path))
157 |                 sys.stderr.write("Path {} doesn't exist. Please create it and try again. Exiting.\n".format(
158 |                     self.options.output_file_path))
159 |                 sys.exit(-1)
160 | 
161 |         if not os.path.exists(PLOTS_PREFIX):
162 |             logging.error("Path {} doesn't exist. Please create it and try again. Exiting.".format(
163 |                 PLOTS_PREFIX))
164 |             sys.stderr.write("Path {} doesn't exist. Please create it and try again. Exiting.\n".format(
165 |                 PLOTS_PREFIX))
166 |             sys.exit(-1)
167 | 
168 |         # log the attributes of this class including all of the options
169 |         for v in dir(self):
170 |             # except don't log the password!
171 |             if not v.startswith('__') and not callable(getattr(self,v)) and not v.lower().startswith('password'):
172 |                 tmp = str(getattr(self,v))
173 |                 tmp = re.sub("password=.*,", "password=XXXXXXX,", tmp) 
174 |                 logging.debug("  {}={}".format(v, tmp))
175 | 
176 |     def config_file(self):
177 |         """Search for a valid config file in the standard locations."""
178 |         config = ConfigParser.ConfigParser()
179 |         # (1) default file name precidence
180 |         config.read(DEFAULT_CONFIG_FILENAME)
181 |         logging.info("attempting to read config file {}".format(DEFAULT_CONFIG_FILENAME))
182 |         if not config.has_section("creds"):
183 |             # (2) environment variable file name second
184 |             if 'GNIP_CONFIG_FILE' in os.environ:
185 |                 config_filename = os.environ['GNIP_CONFIG_FILE']
186 |                 logging.info("attempting to read config file {}".format(config_filename))
187 |                 config.read(config_filename)
188 |         if config.has_section("creds") and config.has_section("endpoint"):
189 |             return config
190 |         else:
191 |             logging.warn("no creds or endpoint section found in config file, attempting to proceed without config info from file")
192 |             return None
193 | 
194 |     def args(self):
195 |         "Set up the command line argments and the associated help strings."""
196 |         twitter_parser = argparse.ArgumentParser(
197 |                 description="GnipSearch timeline tools")
198 |         twitter_parser.add_argument("-b", "--bucket", dest="count_bucket", 
199 |                 default="day", 
200 |                 help="Bucket size for counts query. Options are day, hour, minute (default is 'day').")
201 |         twitter_parser.add_argument("-e", "--end-date", dest="end", 
202 |                 default=None,
203 |                 help="End of datetime window, format 'YYYY-mm-DDTHH:MM' (default: most recent activities)")
204 |         twitter_parser.add_argument("-f", "--filter", dest="filter", 
205 |                 default="from:jrmontag OR from:gnip",
206 |                 help="PowerTrack filter rule (See: http://support.gnip.com/customer/portal/articles/901152-powertrack-operators)")
207 |         twitter_parser.add_argument("-g", "--second_filter", dest="second_filter", 
208 |                 default=None,
209 |                 help="Use a second filter to show correlation plots of -f timeline vs -g timeline.")
210 |         twitter_parser.add_argument("-l", "--stream-url", dest="stream_url", 
211 |                 default=None,
212 |                 help="Url of search endpoint. (See your Gnip console.)")
213 |         twitter_parser.add_argument("-p", "--password", dest="password", default=None, 
214 |                 help="Password")
215 |         twitter_parser.add_argument("-s", "--start-date", dest="start", 
216 |                 default=None,
217 |                 help="Start of datetime window, format 'YYYY-mm-DDTHH:MM' (default: 30 days ago)")
218 |         twitter_parser.add_argument("-u", "--user-name", dest="user", 
219 |                 default=None,
220 |                 help="User name")
221 |         twitter_parser.add_argument("-t", "--get-topics", dest="get_topics", action="store_true", 
222 |                 default=False,
223 |                 help="Set flag to evaluate peak topics (this may take a few minutes)")
224 |         twitter_parser.add_argument("-w", "--output-file-path", dest="output_file_path", 
225 |                 default=None,
226 |                 help="Create files in ./OUTPUT-FILE-PATH. This path must exists and will not be created. This options is available only with -a option. Default is no output files.")
227 |         return twitter_parser
228 |     
229 |     def get_results(self):
230 |         """Execute API calls to the timeseries data and tweet data we need for analysis. Perform analysis
231 |         as we go because we often need results for next steps."""
232 |         ######################
233 |         # (1) Get the timeline
234 |         ######################
235 |         logging.info("retrieving timeline counts")
236 |         results_timeseries = Results( self.user
237 |             , self.password
238 |             , self.stream_url
239 |             , self.options.paged
240 |             , self.options.output_file_path
241 |             , pt_filter=self.options.filter
242 |             , max_results=int(self.options.max)
243 |             , start=self.options.start
244 |             , end=self.options.end
245 |             , count_bucket=self.options.count_bucket
246 |             , show_query=self.options.query
247 |             )
248 |         # sort by date
249 |         res_timeseries = sorted(results_timeseries.get_time_series(), key = itemgetter(0))
250 |         # if we only have one activity, probably don't do all of this
251 |         if len(res_timeseries) <= 1:
252 |             raise ValueError("You've only pulled {} Tweets. time series analysis isn't what you want.".format(len(res_timeseries)))
253 |         # calculate total time interval span
254 |         time_min_date = min(res_timeseries, key = itemgetter(2))[2]
255 |         time_max_date = max(res_timeseries, key = itemgetter(2))[2]
256 |         time_min = float(calendar.timegm(time_min_date.timetuple()))
257 |         time_max = float(calendar.timegm(time_max_date.timetuple()))
258 |         time_span = time_max - time_min
259 |         logging.debug("time_min = {}, time_max = {}, time_span = {}".format(time_min, time_max, time_span))
260 |         # create a simple object to hold our data 
261 |         ts = TimeSeries()
262 |         ts.dates = []
263 |         ts.x = []
264 |         ts.counts = []
265 |         # load and format data
266 |         for i in res_timeseries:
267 |             ts.dates.append(i[2])
268 |             ts.counts.append(float(i[1]))
269 |             # create a independent variable in interval [0.0,1.0]
270 |             ts.x.append((calendar.timegm(datetime.datetime.strptime(i[0], DATE_FMT).timetuple()) - time_min)/time_span)
271 |         logging.info("read {} time items from search API".format(len(ts.dates)))
272 |         if len(ts.dates) < 35:
273 |             logging.warn("peak detection with with fewer than ~35 points is unreliable!")
274 |         logging.debug('dates: ' + ','.join(map(str, ts.dates[:10])) + "...")
275 |         logging.debug('counts: ' + ','.join(map(str, ts.counts[:10])) + "...")
276 |         logging.debug('indep var: ' + ','.join(map(str, ts.x[:10])) + "...")
277 |         ######################
278 |         # (1.1) Get a second timeline?
279 |         ######################
280 |         if self.options.second_filter is not None:
281 |             logging.info("retrieving second timeline counts")
282 |             results_timeseries = Results( self.user
283 |                 , self.password
284 |                 , self.stream_url
285 |                 , self.options.paged
286 |                 , self.options.output_file_path
287 |                 , pt_filter=self.options.second_filter
288 |                 , max_results=int(self.options.max)
289 |                 , start=self.options.start
290 |                 , end=self.options.end
291 |                 , count_bucket=self.options.count_bucket
292 |                 , show_query=self.options.query
293 |                 )
294 |             # sort by date
295 |             second_res_timeseries = sorted(results_timeseries.get_time_series(), key = itemgetter(0))
296 |             if len(second_res_timeseries) != len(res_timeseries):
297 |                 logging.error("time series of different sizes not allowed")
298 |             else:
299 |                 ts.second_counts = []
300 |                 # load and format data
301 |                 for i in second_res_timeseries:
302 |                     ts.second_counts.append(float(i[1]))
303 |                 logging.info("read {} time items from search API".format(len(ts.second_counts)))
304 |                 logging.debug('second counts: ' + ','.join(map(str, ts.second_counts[:10])) + "...")
305 |         ######################
306 |         # (2) Detrend and remove prominent period
307 |         ######################
308 |         logging.info("detrending timeline counts")
309 |         no_trend = signal.detrend(np.array(ts.counts))
310 |         # determine period of data
311 |         df = (ts.dates[1] - ts.dates[0]).total_seconds()
312 |         if df == 86400:
313 |             # day counts, average over week
314 |             n_buckets = 7
315 |             n_avgs = {i:[] for i in range(n_buckets)}
316 |             for t,c in zip(ts.dates, no_trend):
317 |                 n_avgs[t.weekday()].append(c)
318 |         elif df == 3600:
319 |             # hour counts, average over day
320 |             n_buckets = 24
321 |             n_avgs = {i:[] for i in range(n_buckets)}
322 |             for t,c in zip(ts.dates, no_trend):
323 |                 n_avgs[t.hour].append(c)
324 |         elif df == 60:
325 |             # minute counts; average over day
326 |             n_buckets = 24*60
327 |             n_avgs = {i:[] for i in range(n_buckets)}
328 |             for t,c in zip(ts.dates, no_trend):
329 |                 n_avgs[t.minute].append(c)
330 |         else:
331 |             sys.stderr.write("Weird interval problem! Exiting.\n")
332 |             logging.error("Weird interval problem! Exiting.\n")
333 |             sys.exit()
334 |         logging.info("averaging over periods of {} buckets".format(n_buckets))
335 |         # remove upper outliers from averages 
336 |         df_avg_all = {i:np.average(n_avgs[i]) for i in range(n_buckets)}
337 |         logging.debug("bucket averages: {}".format(','.join(map(str, [df_avg_all[i] for i in df_avg_all]))))
338 |         n_avgs_remove_outliers = {i: [j for j in n_avgs[i] 
339 |             if  abs(j - df_avg_all[i])/df_avg_all[i] < (1. + OUTLIER_FRAC) ]
340 |             for i in range(n_buckets)}
341 |         df_avg = {i:np.average(n_avgs_remove_outliers[i]) for i in range(n_buckets)}
342 |         logging.debug("bucket averages w/o outliers: {}".format(','.join(map(str, [df_avg[i] for i in df_avg]))))
343 | 
344 |         # flatten cycle
345 |         ts.counts_no_cycle_trend = np.array([no_trend[i] - df_avg[ts.dates[i].hour] for i in range(len(ts.counts))])
346 |         logging.debug('no trend: ' + ','.join(map(str, ts.counts_no_cycle_trend[:10])) + "...")
347 | 
348 |         ######################
349 |         # (3) Moving average 
350 |         ######################
351 |         ts.moving = np.convolve(ts.counts, np.ones((N_MOVING,))/N_MOVING, mode='valid')
352 |         logging.debug('moving ({}): '.format(N_MOVING) + ','.join(map(str, ts.moving[:10])) + "...")
353 | 
354 |         ######################
355 |         # (4) Peak detection
356 |         ######################
357 |         peakind = signal.find_peaks_cwt(ts.counts_no_cycle_trend, np.arange(MIN_PEAK_WIDTH, MAX_PEAK_WIDTH), min_snr = MIN_SNR)
358 |         n_peaks = min(MAX_N_PEAKS, len(peakind))
359 |         logging.debug('peaks ({}): '.format(n_peaks) + ','.join(map(str, peakind)))
360 |         logging.debug('peaks ({}): '.format(n_peaks) + ','.join(map(str, [ts.dates[i] for i in peakind])))
361 |         
362 |         # top peaks determined by peak volume, better way?
363 |         # peak detector algorithm:
364 |         #      * middle of peak (of unknown width)
365 |         #      * finds peaks up to MAX_PEAK_WIDTH wide
366 |         #
367 |         #   algorithm for geting peak start, peak and end parameters:
368 |         #      find max, find fwhm, 
369 |         #      find start, step past peak, keep track of volume and peak height, 
370 |         #      stop at end of period or when timeseries turns upward
371 |     
372 |         peaks = []
373 |         for i in peakind:
374 |             # find the first max in the possible window
375 |             i_start = max(0, i - SEARCH_PEAK_WIDTH)
376 |             i_finish = min(len(ts.counts) - 1, i + SEARCH_PEAK_WIDTH)
377 |             p_max = max(ts.counts[i_start:i_finish])
378 |             h_max = p_max/2.
379 |             # i_max not center
380 |             i_max = i_start + ts.counts[i_start:i_finish].index(p_max)
381 |             i_start, i_finish = i_max, i_max
382 |             # start at peak, and go back and forward to find start and end
383 |             while i_start >= 1:
384 |                 if (ts.counts[i_start - 1] <= h_max or 
385 |                         ts.counts[i_start - 1] >= ts.counts[i_start] or
386 |                         i_start - 1 <= 0):
387 |                     break
388 |                 i_start -= 1
389 |             while i_finish < len(ts.counts) - 1:
390 |                 if (ts.counts[i_finish + 1] <= h_max or
391 |                         ts.counts[i_finish + 1] >= ts.counts[i_finish] or
392 |                         i_finish + 1 >= len(ts.counts)):
393 |                     break
394 |                 i_finish += 1
395 |             # i is center of peak so balance window
396 |             delta_i = max(1, i - i_start)
397 |             if i_finish - i > delta_i:
398 |                 delta_i = i_finish - i
399 |             # final est of start and finish
400 |             i_finish = min(len(ts.counts) - 1, i + delta_i)
401 |             i_start = max(0, i - delta_i)
402 |             p_volume = sum(ts.counts[i_start:i_finish])
403 |             peaks.append([ i , p_volume , (i, i_start, i_max, i_finish
404 |                                             , h_max  , p_max, p_volume
405 |                                             , ts.dates[i_start], ts.dates[i_max], ts.dates[i_finish])])
406 |         # top n_peaks by volume
407 |         top_peaks = sorted(peaks, key = itemgetter(1))[-n_peaks:]
408 |         # re-sort peaks by date
409 |         ts.top_peaks = sorted(top_peaks, key = itemgetter(0))
410 |         logging.debug('top peaks ({}): '.format(len(ts.top_peaks)) + ','.join(map(str, ts.top_peaks[:4])) + "...")
411 |     
412 |         ######################
413 |         # (5) high/low frequency 
414 |         ######################
415 |         ts.cycle, ts.trend = sm.tsa.filters.hpfilter(np.array(ts.counts))
416 |         logging.debug('cycle: ' + ','.join(map(str, ts.cycle[:10])) + "...")
417 |         logging.debug('trend: ' + ','.join(map(str, ts.trend[:10])) + "...")
418 |     
419 |         ######################
420 |         # (6) n-grams for top peaks
421 |         ######################
422 |         ts.topics = []
423 |         if self.options.get_topics:
424 |             logging.info("retrieving tweets for peak topics")
425 |             for a in ts.top_peaks:
426 |                 # start at peak
427 |                 ds = datetime.datetime.strftime(a[2][8], DATE_FMT2)
428 |                 # estimate how long to get TWEET_SAMPLE tweets
429 |                 # a[1][5] is max tweets per period
430 |                 if a[2][5] > 0:
431 |                     est_periods = float(TWEET_SAMPLE)/a[2][5]
432 |                 else:
433 |                     logging.warn("peak with zero max tweets ({}), setting est_periods to 1".format(a))
434 |                     est_periods = 1
435 |                 # df comes from above, in seconds
436 |                 # time resolution is hours
437 |                 est_time = max(int(est_periods * df), 60)
438 |                 logging.debug("est_periods={}, est_time={}".format(est_periods, est_time))
439 |                 #
440 |                 if a[2][8] + datetime.timedelta(seconds=est_time) < a[2][9]:
441 |                     de = datetime.datetime.strftime(a[2][8] + datetime.timedelta(seconds=est_time), DATE_FMT2)
442 |                 elif a[2][8] < a[2][9]:
443 |                     de = datetime.datetime.strftime(a[2][9], DATE_FMT2)
444 |                 else:
445 |                     de = datetime.datetime.strftime(a[2][8] + datetime.timedelta(seconds=60), DATE_FMT2)
446 |                 logging.info("retreive data for peak index={} in date range [{},{}]".format(a[0], ds, de))
447 |                 res = Results(
448 |                     self.user
449 |                     , self.password
450 |                     , self.stream_url
451 |                     , self.options.paged
452 |                     , self.options.output_file_path
453 |                     , pt_filter=self.options.filter
454 |                     , max_results=int(self.options.max)
455 |                     , start=ds
456 |                     , end=de
457 |                     , count_bucket=None
458 |                     , show_query=self.options.query
459 |                     , hard_max = TWEET_SAMPLE
460 |                     )
461 |                 logging.info("retrieved {} records".format(len(res)))
462 |                 n_grams_counts = list(res.get_top_grams(n=self.token_list_size))
463 |                 ts.topics.append(n_grams_counts)
464 |                 logging.debug('n_grams for peak index={}: '.format(a[0]) + ','.join(
465 |                     map(str, [i[4].encode("utf-8","ignore") for i in n_grams_counts][:10])) + "...")
466 |         return ts
467 | 
468 |     def dotplot(self, x, labels, path = "dotplot.png"):
469 |         """Makeshift dotplots in matplotlib. This is not completely general and encodes labels and
470 |         parameter selections that are particular to n-gram dotplots."""
471 |         logging.info("dotplot called, writing image to path={}".format(path))
472 |         if len(x) <= 1 or len(labels) <= 1:
473 |             raise ValueError("cannot make a dot plot with only 1 point")
474 |         # split n_gram_counts into 2 data sets
475 |         n = int(len(labels)/2)
476 |         x1, x2 = x[:n], x[n:]
477 |         labels1, labels2 = labels[:n], labels[n:]
478 |         # create enough equally spaced y values for the horizontal lines
479 |         ys = [r*PLOT_DELTA_Y for r in range(1,len(labels2)+1)]
480 |         # give ourselves a little extra room on the plot
481 |         maxx = max(x)*1.05
482 |         maxy = max(ys)*1.05
483 |         # set up plots to be a factor taller than the default size
484 |         # make factor proportional to the number of n-grams plotted
485 |         size = plt.gcf().get_size_inches()
486 |         # factor of n/10 is empirical
487 |         scale_denom = 10
488 |         fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1,figsize=(size[0], size[1]*n/scale_denom))
489 |         logging.debug("plotting top {} terms".format(n))
490 |         logging.debug("plot size=({},{})".format(size[0], size[1]*n/scale_denom))
491 |         #  first plot 1-grams
492 |         ax1.set_xlim(0,maxx)
493 |         ax1.set_ylim(0,maxy)
494 |         ticks = ax1.yaxis.set_ticks(ys)
495 |         text = ax1.yaxis.set_ticklabels(labels1)
496 |         for ct, item in enumerate(labels1):
497 |             ax1.hlines(ys[ct], 0, maxx, linestyle='dashed', color='0.9')
498 |         ax1.plot(x1, ys, 'ko')
499 |         ax1.set_title("1-grams")
500 |         # second plot 2-grams
501 |         ax2.set_xlim(0,maxx)
502 |         ax2.set_ylim(0,maxy)
503 |         ticks = ax2.yaxis.set_ticks(ys)
504 |         text = ax2.yaxis.set_ticklabels(labels2)
505 |         for ct, item in enumerate(labels2):
506 |             ax2.hlines(ys[ct], 0, maxx, linestyle='dashed', color='0.9')
507 |         ax2.plot(x2, ys, 'ko')
508 |         ax2.set_title("2-grams")
509 |         ax2.set_xlabel("Fraction of Mentions")
510 |         #
511 |         plt.tight_layout()
512 |         plt.savefig(path)
513 |         plt.close("all")
514 | 
515 |     def plots(self, ts, out_type="png"):
516 |         """Basic choice for plotting analysis. If you wish to extend this class, over-
517 |         write this method."""
518 |         # creat a valid file name, in this case and additional requirement is no spaces
519 |         valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
520 |         filter_prefix_name = ''.join(c for c in self.options.filter if c in valid_chars)
521 |         filter_prefix_name = filter_prefix_name.replace(" ", "_")
522 |         if len(filter_prefix_name) > 16:
523 |             filter_prefix_name = filter_prefix_name[:16]
524 |         if self.options.second_filter is not None:
525 |             second_filter_prefix_name = ''.join(c for c in self.options.second_filter if c in valid_chars)
526 |             second_filter_prefix_name = second_filter_prefix_name.replace(" ", "_")
527 |             if len(second_filter_prefix_name) > 16:
528 |                 second_filter_prefix_name = second_filter_prefix_name[:16]
529 |         ######################
530 |         # timeline
531 |         ######################
532 |         df0 = pd.Series(ts.counts, index=ts.dates)
533 |         df0.plot()
534 |         plt.ylabel("Counts")
535 |         plt.title(filter_prefix_name)
536 |         plt.tight_layout()
537 |         plt.savefig(os.path.join(PLOTS_PREFIX, '{}_{}.{}'.format(filter_prefix_name, "time_line", out_type)))
538 |         plt.close("all")
539 |         ######################
540 |         # cycle and trend
541 |         ######################
542 |         df1 = pd.DataFrame({"cycle":ts.cycle, "trend":ts.trend}, index=ts.dates)
543 |         df1.plot()
544 |         plt.ylabel("Counts")
545 |         plt.title(filter_prefix_name)
546 |         plt.tight_layout()
547 |         plt.savefig(os.path.join(PLOTS_PREFIX, '{}_{}.{}'.format(filter_prefix_name, "cycle_trend_line", out_type)))
548 |         plt.close("all")
549 |         ######################
550 |         # moving avg
551 |         ######################
552 |         if len(ts.moving) <= 3:
553 |             logging.warn("Too little data for a moving average")
554 |         else:
555 |             df2 = pd.DataFrame({"moving":ts.moving}, index=ts.dates[:len(ts.moving)])
556 |             df2.plot()
557 |             plt.ylabel("Counts")
558 |             plt.title(filter_prefix_name)
559 |             plt.tight_layout()
560 |             plt.savefig(os.path.join(PLOTS_PREFIX, '{}_{}.{}'.format(filter_prefix_name, "mov_avg_line", out_type)))
561 |             plt.close("all")
562 |         ######################
563 |         # timeline with peaks marked by vertical bands
564 |         ######################
565 |         df3 = pd.Series(ts.counts, index=ts.dates)
566 |         df3.plot()
567 |         # peaks
568 |         for a in ts.top_peaks:
569 |             xs = a[2][7]
570 |             xp = a[2][8]
571 |             xe = a[2][9]
572 |             y = a[2][5]
573 |             # need to get x and y locs
574 |             plt.axvspan(xs, xe, ymin=0, ymax = y, linewidth=1, color='g', alpha=0.2)
575 |             plt.axvline(xp, ymin=0, ymax = y, linewidth=1, color='y')
576 |         plt.ylabel("Counts")
577 |         plt.title(filter_prefix_name)
578 |         plt.tight_layout()
579 |         plt.savefig(os.path.join(PLOTS_PREFIX, '{}_{}.{}'.format(filter_prefix_name, "time_peaks_line", out_type)))
580 |         plt.close("all")
581 |         ######################
582 |         # n-grams to help determine topics of peaks
583 |         ######################
584 |         for n, p in enumerate(ts.topics):
585 |             x = []
586 |             labels = []
587 |             for i in p:
588 |                 x.append(i[1])
589 |                 labels.append(i[4])
590 |             try:
591 |                 logging.info("creating n-grams dotplot for peak {}".format(n))
592 |                 path = os.path.join(PLOTS_PREFIX, "{}_{}_{}.{}".format(filter_prefix_name, "peak", n, out_type))
593 |                 self.dotplot(x, labels, path)
594 |             except ValueError as e:
595 |                 logging.error("{} - plot path={} skipped".format(e, path))
596 |         ######################
597 |         # x vs y scatter plot for correlations 
598 |         ######################
599 |         if self.options.second_filter is not None:
600 |             logging.info("creating scatter for queries {} and {}".format(self.options.filter, self.options.second_filter))
601 |             df4 = pd.DataFrame({filter_prefix_name: ts.counts, second_filter_prefix_name:ts.second_counts})
602 |             df4.plot(kind='scatter', x=filter_prefix_name, y=second_filter_prefix_name)
603 |             plt.ylabel(second_filter_prefix_name)
604 |             plt.xlabel(filter_prefix_name)
605 |             plt.xlim([0, 1.05 * max(ts.counts)])
606 |             plt.ylim([0, 1.05 * max(ts.second_counts)])
607 |             plt.title("{} vs. {}".format(second_filter_prefix_name, filter_prefix_name))
608 |             plt.tight_layout()
609 |             plt.savefig(os.path.join(PLOTS_PREFIX, '{}_v_{}_{}.{}'.format(filter_prefix_name, 
610 |                                 second_filter_prefix_name, 
611 |                                 "scatter", 
612 |                                 out_type)))
613 |             plt.close("all")
614 | 
615 | if __name__ == "__main__":
616 |     """ Simple command line utility."""
617 |     import pickle
618 |     g = GnipSearchTimeseries()
619 |     if FROM_PICKLE:
620 |         ts = pickle.load(open("./time_series.pickle", "rb"))
621 |     else:
622 |         ts = g.get_results()
623 |         pickle.dump(ts,open("./time_series.pickle", "wb"))
624 |     g.plots(ts)
625 | 


--------------------------------------------------------------------------------
/img/earthquake_cycle_trend_line.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DrSkippy/Gnip-Python-Search-API-Utilities/30c3780220bbeba384815ccbc4ce1d567bfa934c/img/earthquake_cycle_trend_line.png


--------------------------------------------------------------------------------
/img/earthquake_time_line.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DrSkippy/Gnip-Python-Search-API-Utilities/30c3780220bbeba384815ccbc4ce1d567bfa934c/img/earthquake_time_line.png


--------------------------------------------------------------------------------
/img/earthquake_time_peaks_line.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DrSkippy/Gnip-Python-Search-API-Utilities/30c3780220bbeba384815ccbc4ce1d567bfa934c/img/earthquake_time_peaks_line.png


--------------------------------------------------------------------------------
/job.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "rules":[
 3 |               {"value":"dog", "tag":"common pet"}
 4 |             , {"value":"cat", "tag":"common pet"}
 5 |             , {"value":"hamster", "tag":"common pet"}
 6 |             , {"value":"pet", "tag":"abstract pet"}
 7 |             , {"value":"vet", "tag":"pet owner destination"}
 8 |             , {"value":"kennel", "tag":"pet owner destination"}
 9 |             , {"value":"puppy OR kitten", "tag":"diminutives"}
10 |             ],
11 |     "negation_rules":[
12 |               {"value":"tracter", "tag":"type of cat"}
13 |             , {"value":"dozer", "tag":"type of cat"}
14 |             , {"value":"grader", "tag":"type of cat"}
15 |             , {"value":"\"skid loader\"", "tag":"type of cat"}
16 |             ],
17 |     "date_ranges": [
18 |             {"start":"2015-05-01T00:00:00", "end":"2015-06-01T00:00:00"}
19 |             , {"start":"2015-11-01T00:00:00", "end":"2015-12-01T00:00:00"}
20 |             ]
21 | }
22 | 


--------------------------------------------------------------------------------
/rules.txt:
--------------------------------------------------------------------------------
1 | (from:drskippy27 OR from:gnip) data
2 | obama bieber
3 | 


--------------------------------------------------------------------------------
/search/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = ['api'
2 |             , 'results']
3 | 


--------------------------------------------------------------------------------
/search/api.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: UTF-8 -*-
  3 | __author__="Scott Hendrickson, Josh Montague" 
  4 | 
  5 | import sys
  6 | import requests
  7 | import json
  8 | import codecs
  9 | import datetime
 10 | import time
 11 | import os
 12 | import re
 13 | import unicodedata
 14 | 
 15 | from acscsv.twitter_acs import TwacsCSV
 16 | 
 17 | ## update for python3
 18 | if sys.version_info[0] == 2:
 19 |     reload(sys)
 20 |     sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
 21 |     sys.stdin = codecs.getreader('utf-8')(sys.stdin)
 22 | 
 23 | #remove this
 24 | requests.packages.urllib3.disable_warnings()
 25 | 
 26 | # formatter of data from API 
 27 | TIME_FORMAT_SHORT = "%Y%m%d%H%M"
 28 | TIME_FORMAT_LONG = "%Y-%m-%dT%H:%M:%S.000Z"
 29 | PAUSE = 1 # seconds between page requests
 30 | POSTED_TIME_IDX = 1
 31 | #date time parsing utility regex
 32 | DATE_TIME_RE = re.compile("([0-9]{4}).([0-9]{2}).([0-9]{2}).([0-9]{2}):([0-9]{2})")
 33 | 
 34 | class Query(object):
 35 |     """Object represents a single search API query and provides utilities for
 36 |        managing parameters, executing the query and parsing the results."""
 37 |     
 38 |     def __init__(self
 39 |             , user
 40 |             , password
 41 |             , stream_url
 42 |             , paged = False
 43 |             , output_file_path = None
 44 |             , hard_max = None
 45 |             ):
 46 |         """A Query requires at least a valid user name, password and endpoint url.
 47 |            The URL of the endpoint should be the JSON records endpoint, not the counts
 48 |            endpoint.
 49 | 
 50 |            Additional parambers specifying paged search and output file path allow
 51 |            for making queries which return more than the 500 activity limit imposed by
 52 |            a single call to the API. This is called paging or paged search. Setting 
 53 |            paged = True will enable the token interpretation 
 54 |            functionality provided in the API to return a seamless set of activites.
 55 | 
 56 |            Once the object is created, it can be used for repeated access to the
 57 |            configured end point with the same connection configuration set at
 58 |            creation."""
 59 |         self.output_file_path = output_file_path
 60 |         self.paged = paged
 61 |         self.hard_max = hard_max
 62 |         self.paged_file_list = []
 63 |         self.user = user
 64 |         self.password = password
 65 |         self.end_point = stream_url # activities end point NOT the counts end point
 66 |         # get a parser for the twitter columns
 67 |         # TODO: use the updated retriveal methods in gnacs instead of this?
 68 |         self.twitter_parser = TwacsCSV(",", None, False, True, False, True, False, False, False)
 69 |         # Flag for post processing tweet timeline from tweet times
 70 |         self.tweet_times_flag = False
 71 | 
 72 |     def set_dates(self, start, end):
 73 |         """Utility function to set dates from strings. Given string-formated 
 74 |            dates for start date time and end date time, extract the required
 75 |            date string format for use in the API query and make sure they
 76 |            are valid dates. 
 77 | 
 78 |            Sets class fromDate and toDate date strings."""
 79 |         if start:
 80 |             dt = re.search(DATE_TIME_RE, start)
 81 |             if not dt:
 82 |                 raise ValueError("Error. Invalid start-date format: %s \n"%str(start))
 83 |             else:
 84 |                 f =''
 85 |                 for i in range(re.compile(DATE_TIME_RE).groups):
 86 |                     f += dt.group(i+1) 
 87 |                 self.fromDate = f
 88 |                 # make sure this is a valid date
 89 |                 tmp_start = datetime.datetime.strptime(f, TIME_FORMAT_SHORT)
 90 | 
 91 |         if end:
 92 |             dt = re.search(DATE_TIME_RE, end)
 93 |             if not dt:
 94 |                 raise ValueError("Error. Invalid end-date format: %s \n"%str(end))
 95 |             else:
 96 |                 e =''
 97 |                 for i in range(re.compile(DATE_TIME_RE).groups):
 98 |                     e += dt.group(i+1) 
 99 |                 self.toDate = e
100 |                 # make sure this is a valid date
101 |                 tmp_end = datetime.datetime.strptime(e, TIME_FORMAT_SHORT)
102 |                 if start:
103 |                     if tmp_start >= tmp_end:
104 |                         raise ValueError("Error. Start date greater than end date.\n")
105 | 
106 |     def name_munger(self, f):
107 |         """Utility function to create a valid, friendly file name base 
108 |            string from an input rule."""
109 |         f = re.sub(' +','_',f)
110 |         f = f.replace(':','_')
111 |         f = f.replace('"','_Q_')
112 |         f = f.replace('(','_p_') 
113 |         f = f.replace(')','_p_') 
114 |         self.file_name_prefix = unicodedata.normalize(
115 |                 "NFKD",f[:42]).encode(
116 |                         "ascii","ignore")
117 | 
118 |     def request(self):
119 |         """HTTP request based on class variables for rule_payload, 
120 |            stream_url, user and password"""
121 |         try:
122 |             s = requests.Session()
123 |             s.headers = {'Accept-encoding': 'gzip'}
124 |             s.auth = (self.user, self.password)
125 |             res = s.post(self.stream_url, data=json.dumps(self.rule_payload))
126 |             if res.status_code != 200:
127 |                 sys.stderr.write("Exiting with HTTP error code {}\n".format(res.status_code))
128 |                 sys.stderr.write("ERROR Message: {}\n".format(res.json()["error"]["message"]))
129 |                 if 1==1: #self.return_incomplete:
130 |                     sys.stderr.write("Returning incomplete dataset.")
131 |                     return(res.content.decode(res.encoding))
132 |                 sys.exit(-1)
133 |         except requests.exceptions.ConnectionError as e:
134 |             e.msg = "Error (%s). Exiting without results."%str(e)
135 |             raise e
136 |         except requests.exceptions.HTTPError as e:
137 |             e.msg = "Error (%s). Exiting without results."%str(e)
138 |             raise e
139 |         except requests.exceptions.MissingSchema as e:
140 |             e.msg = "Error (%s). Exiting without results."%str(e)
141 |             raise e
142 |         #Don't use res.text as it creates encoding challenges!
143 |         return(res.content.decode(res.encoding))
144 | 
145 |     def parse_responses(self, count_bucket):
146 |         """Parse returned responses.
147 | 
148 |            When paged=True, manage paging using the API token mechanism
149 |            
150 |            When output file is set, write output files for paged output."""
151 |         acs = []
152 |         repeat = True
153 |         page_count = 1
154 |         self.paged_file_list = []
155 |         while repeat:
156 |             doc = self.request()
157 |             tmp_response = json.loads(doc)
158 |             if "results" in tmp_response:
159 |                 acs.extend(tmp_response["results"])
160 |             else:
161 |                 raise ValueError("Invalid request\nQuery: %s\nResponse: %s"%(self.rule_payload, doc))
162 |             if self.hard_max is None or len(acs) < self.hard_max:
163 |                 repeat = False
164 |                 if self.paged or count_bucket:
165 |                     if len(acs) > 0:
166 |                         if self.output_file_path is not None:
167 |                             # writing to file
168 |                             file_name = self.output_file_path + "/{0}_{1}.json".format(
169 |                                     str(datetime.datetime.utcnow().strftime(
170 |                                         "%Y%m%d%H%M%S"))
171 |                                   , str(self.file_name_prefix))
172 |                             with codecs.open(file_name, "wb","utf-8") as out:
173 |                                 for item in tmp_response["results"]:
174 |                                     out.write(json.dumps(item)+"\n")
175 |                             self.paged_file_list.append(file_name)
176 |                             # if writing to file, don't keep track of all the data in memory
177 |                             acs = []
178 |                         else:
179 |                             # storing in memory, so give some feedback as to size
180 |                             sys.stderr.write("[{0:8d} bytes] {1:5d} total activities retrieved...\n".format(
181 |                                                                 sys.getsizeof(acs)
182 |                                                               , len(acs)))
183 |                     else:
184 |                         sys.stderr.write( "No results returned for rule:{0}\n".format(str(self.rule_payload)) ) 
185 |                     if "next" in tmp_response:
186 |                         self.rule_payload["next"]=tmp_response["next"]
187 |                         repeat = True
188 |                         page_count += 1
189 |                         sys.stderr.write( "Fetching page {}...\n".format(page_count) )
190 |                     else:
191 |                         if "next" in self.rule_payload:
192 |                             del self.rule_payload["next"]
193 |                         repeat = False
194 |                     time.sleep(PAUSE)
195 |             else:
196 |                 # stop iterating after reaching hard_max
197 |                 repeat = False
198 |         return acs
199 | 
200 |     def get_time_series(self):
201 |         if self.paged and self.output_file_path is not None:
202 |             for file_name in self.paged_file_list:
203 |                 with codecs.open(file_name,"rb") as f:
204 |                     for res in f:
205 |                         rec = json.loads(res.decode('utf-8').strip())
206 |                         t = datetime.datetime.strptime(rec["timePeriod"], TIME_FORMAT_SHORT)
207 |                         yield [rec["timePeriod"], rec["count"], t]
208 |         else:
209 |             if self.tweet_times_flag:
210 |                 # todo: list of tweets, aggregate by bucket
211 |                 raise NotImplementedError("Aggregated buckets on json tweets not implemented!")
212 |             else:
213 |                 for i in self.time_series:
214 |                     yield i
215 | 
216 | 
217 |     def get_activity_set(self):
218 |         """Generator iterates through the entire activity set from memory or disk."""
219 |         if self.paged and self.output_file_path is not None:
220 |             for file_name in self.paged_file_list:
221 |                 with codecs.open(file_name,"rb") as f:
222 |                     for res in f:
223 |                         yield json.loads(res.decode('utf-8'))
224 |         else:
225 |             for res in self.rec_dict_list:
226 |                 yield res
227 | 
228 |     def get_list_set(self):
229 |         """Like get_activity_set, but returns a list containing values parsed by 
230 |            current Twacs parser configuration."""
231 |         for rec in self.get_activity_set():
232 |             yield self.twitter_parser.get_source_list(rec)
233 | 
234 |     def execute(self
235 |             , pt_filter
236 |             , max_results = 100
237 |             , start = None
238 |             , end = None
239 |             , count_bucket = None # None is json
240 |             , show_query = False):
241 |         """Execute a query with filter, maximum results, start and end dates.
242 | 
243 |            Count_bucket determines the bucket size for the counts endpoint.
244 |            If the count_bucket variable is set to a valid bucket size such 
245 |            as mintute, day or week, then the acitivity counts endpoint will 
246 |            Otherwise, the data endpoint is used."""
247 |         # set class start and stop datetime variables
248 |         self.set_dates(start, end)
249 |         # make a friendlier file name from the rules
250 |         self.name_munger(pt_filter)
251 |         if self.paged or max_results > 500:
252 |             # avoid making many small requests
253 |             max_results = 500
254 |         self.rule_payload = {
255 |                     'query': pt_filter
256 |             }
257 |         self.rule_payload["maxResults"] = int(max_results)
258 |         if start:
259 |             self.rule_payload["fromDate"] = self.fromDate
260 |         if end:
261 |             self.rule_payload["toDate"] = self.toDate
262 |         # use the proper endpoint url
263 |         self.stream_url = self.end_point
264 |         if count_bucket:
265 |             if not self.end_point.endswith("counts.json"): 
266 |                 self.stream_url = self.end_point[:-5] + "/counts.json"
267 |             if count_bucket not in ['day', 'minute', 'hour']:
268 |                 raise ValueError("Error. Invalid count bucket: %s \n"%str(count_bucket))
269 |             self.rule_payload["bucket"] = count_bucket
270 |             self.rule_payload.pop("maxResults",None)
271 |         # for testing, show the query JSON and stop
272 |         if show_query:
273 |             sys.stderr.write("API query:\n")
274 |             sys.stderr.write(json.dumps(self.rule_payload) + '\n')
275 |             sys.exit() 
276 |         # set up variable to catch the data in 3 formats
277 |         self.time_series = []
278 |         self.rec_dict_list = []
279 |         self.rec_list_list = []
280 |         self.res_cnt = 0
281 |         # timing
282 |         self.delta_t = 1    # keeps us from crashing 
283 |         # actual oldest tweet before now
284 |         self.oldest_t = datetime.datetime.utcnow()
285 |         # actual newest tweet more recent that 30 days ago
286 |         # self.newest_t = datetime.datetime.utcnow() - datetime.timedelta(days=30)
287 |         # search v2: newest date is more recent than 2006-03-01T00:00:00
288 |         self.newest_t = datetime.datetime.strptime("2006-03-01T00:00:00.000z", TIME_FORMAT_LONG)
289 |         #
290 |         for rec in self.parse_responses(count_bucket):
291 |             # parse_responses returns only the last set of activities retrieved, not all paged results.
292 |             # to access the entire set, use the helper functions get_activity_set and get_list_set!
293 |             self.res_cnt += 1
294 |             self.rec_dict_list.append(rec)
295 |             if count_bucket:
296 |                 # timeline data
297 |                 t = datetime.datetime.strptime(rec["timePeriod"], TIME_FORMAT_SHORT)
298 |                 tmp_tl_list = [rec["timePeriod"], rec["count"], t]
299 |                 self.tweet_times_flag = False
300 |             else:
301 |                 # json activities
302 |                 # keep track of tweet times for time calculation
303 |                 tmp_list = self.twitter_parser.procRecordToList(rec)
304 |                 self.rec_list_list.append(tmp_list)
305 |                 t = datetime.datetime.strptime(tmp_list[POSTED_TIME_IDX], TIME_FORMAT_LONG)
306 |                 tmp_tl_list = [tmp_list[POSTED_TIME_IDX], 1, t]
307 |                 self.tweet_times_flag = True
308 |             # this list is ***either*** list of buckets or list of tweet times!
309 |             self.time_series.append(tmp_tl_list)
310 |             # timeline requests don't return activities!
311 |             if t < self.oldest_t:
312 |                 self.oldest_t = t
313 |             if t > self.newest_t:
314 |                 self.newest_t = t
315 |             self.delta_t = (self.newest_t - self.oldest_t).total_seconds()/60.
316 |         return 
317 | 
318 |     def get_rate(self):
319 |         """Returns rate from last query executed"""
320 |         if self.delta_t != 0:
321 |             return float(self.res_cnt)/self.delta_t
322 |         else:
323 |             return None
324 | 
325 |     def __len__(self):
326 |         """Returns the size of the results set when len(Query) is called."""
327 |         try:
328 |             return self.res_cnt
329 |         except AttributeError:
330 |             return 0
331 | 
332 |     def __repr__(self):
333 |         """Returns a string represenataion of the result set."""
334 |         try:
335 |             return "\n".join([json.dumps(x) for x in self.rec_dict_list])
336 |         except AttributeError:
337 |             return "No query completed."
338 | 
339 | if __name__ == "__main__":
340 |     g = Query("shendrickson@gnip.com"
341 |             , "XXXXXPASSWORDXXXXX"
342 |             , "https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json")
343 |     g.execute("bieber", 10)
344 |     for x in g.get_activity_set():
345 |         print(x)
346 |     print(g)
347 |     print(g.get_rate())
348 |     g.execute("bieber", count_bucket = "hour")
349 |     print(g)
350 |     print(len(g))
351 |     pg = Query("shendrickson@gnip.com"
352 |             , "XXXXXPASSWORDXXXXX"
353 |             , "https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json"
354 |             , paged = True 
355 |             , output_file_path = "../data/")
356 |     now_date = datetime.datetime.now()
357 |     pg.execute("bieber"
358 |             , end=now_date.strftime(TIME_FORMAT_LONG)
359 |             , start=(now_date - datetime.timedelta(seconds=200)).strftime(TIME_FORMAT_LONG))
360 |     for x in pg.get_activity_set():
361 |         print(x)
362 |     g.execute("bieber", show_query=True)
363 | 


--------------------------------------------------------------------------------
/search/results.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: UTF-8 -*-
  3 | __author__="Scott Hendrickson, Josh Montague" 
  4 | 
  5 | import sys
  6 | import codecs
  7 | import datetime
  8 | import time
  9 | import os
 10 | import re
 11 | 
 12 | from .api import *
 13 | from simple_n_grams.simple_n_grams import SimpleNGrams
 14 | 
 15 | if sys.version_info[0] < 3:
 16 |     try:
 17 |         reload(sys)
 18 |         sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
 19 |         sys.stdin = codecs.getreader('utf-8')(sys.stdin)
 20 |     except NameError:
 21 |         pass
 22 | 
 23 | #############################################
 24 | # Some constants to configure column retrieval from TwacsCSV
 25 | DATE_INDEX = 1
 26 | TEXT_INDEX = 2
 27 | LINKS_INDEX = 3
 28 | USER_NAME_INDEX = 7 
 29 | USER_ID_INDEX = 8
 30 | OUTPUT_PAGE_WIDTH = 120 
 31 | BIG_COLUMN_WIDTH = 32
 32 | 
 33 | class Results():
 34 |     """Class for aggregating and accessing search result sets and
 35 |        subsets.  Returns derived values for the query specified."""
 36 | 
 37 |     def __init__(self
 38 |             , user
 39 |             , password
 40 |             , stream_url
 41 |             , paged = False
 42 |             , output_file_path = None
 43 |             , pt_filter = None
 44 |             , max_results = 100
 45 |             , start = None
 46 |             , end = None
 47 |             , count_bucket = None
 48 |             , show_query = False
 49 |             , hard_max = None
 50 |             ):
 51 |         """Create a result set by passing all of the require parameters 
 52 |            for a query. The Results class runs an API query once when 
 53 |            initialized. This allows one to make multiple calls 
 54 |            to analytics methods on a single query.
 55 |         """
 56 |         # run the query
 57 |         self.query = Query(user, password, stream_url, paged, output_file_path, hard_max)
 58 |         self.query.execute(
 59 |             pt_filter=pt_filter
 60 |             , max_results = max_results
 61 |             , start = start
 62 |             , end = end
 63 |             , count_bucket = count_bucket
 64 |             , show_query = show_query
 65 |             )
 66 |         self.freq = None
 67 | 
 68 |     def get_activities(self):
 69 |         """Generator of query results."""
 70 |         for x in self.query.get_activity_set():
 71 |             yield x
 72 | 
 73 |     def get_time_series(self):
 74 |         """Generator of time series for query results."""
 75 |         for x in self.query.get_time_series():
 76 |             yield x
 77 | 
 78 |     def get_top_links(self, n=20):
 79 |         """Returns the links most shared in the data set retrieved in
 80 |            the order of how many times each was shared."""
 81 |         self.freq = SimpleNGrams(char_upper_cutoff=100, tokenizer="space")
 82 |         for x in self.query.get_list_set():
 83 |             link_str = x[LINKS_INDEX]
 84 |             if link_str != "GNIPEMPTYFIELD" and link_str != "None":
 85 |                 self.freq.add(link_str)
 86 |             else:
 87 |                 self.freq.add("NoLinks")
 88 |         return self.freq.get_tokens(n)
 89 | 
 90 |     def get_top_users(self, n=50):
 91 |         """Returns the users  tweeting the most in the data set retrieved
 92 |            in the data set. Users are returned in descending order of how
 93 |            many times they were tweeted."""
 94 |         self.freq = SimpleNGrams(char_upper_cutoff=20, tokenizer="twitter")
 95 |         for x in self.query.get_list_set():
 96 |             self.freq.add(x[USER_NAME_INDEX])
 97 |         return self.freq.get_tokens(n) 
 98 | 
 99 |     def get_users(self, n=None):
100 |         """Returns the user ids for the tweets collected"""
101 |         uniq_users = set()
102 |         for x in self.query.get_list_set():
103 |             uniq_users.add(x[USER_ID_INDEX])
104 |         return uniq_users
105 | 
106 |     def get_top_grams(self, n=20):
107 |         self.freq = SimpleNGrams(char_upper_cutoff=20, tokenizer="twitter")
108 |         self.freq.sl.add_session_stop_list(["http", "https", "amp", "htt"])
109 |         for x in self.query.get_list_set():
110 |             self.freq.add(x[TEXT_INDEX])
111 |         return self.freq.get_tokens(n) 
112 |             
113 |     def get_geo(self):
114 |         for rec in self.query.get_activity_set():
115 |             lat, lng = None, None
116 |             if "geo" in rec:
117 |                 if "coordinates" in rec["geo"]:
118 |                     [lat,lng] = rec["geo"]["coordinates"]
119 |                     activity = { "id": rec["id"].split(":")[2]
120 |                         , "postedTime": rec["postedTime"].strip(".000Z")
121 |                         , "latitude": lat
122 |                         , "longitude": lng }
123 |                     yield activity
124 |  
125 |     def get_frequency_items(self, size = 20):
126 |         """Retrieve the token list structure from the last query"""
127 |         if self.freq is None:
128 |             raise VallueError("No frequency available for use case")
129 |         return self.freq.get_tokens(size)
130 | 
131 |     def __len__(self):
132 |         return len(self.query)
133 | 
134 |     def __repr__(self):
135 |         if self.last_query_params["count_bucket"] is None:
136 |             res = [u"-"*OUTPUT_PAGE_WIDTH]
137 |             rate = self.query.get_rate()
138 |             unit = "Tweets/Minute"
139 |             if rate < 0.01:
140 |                 rate *= 60.
141 |                 unit = "Tweets/Hour"
142 |             res.append("     PowerTrack Rule: \"%s\""%self.last_query_params["pt_filter"])
143 |             res.append("  Oldest Tweet (UTC): %s"%str(self.query.oldest_t))
144 |             res.append("  Newest Tweet (UTC): %s"%str(self.query.newest_t))
145 |             res.append("           Now (UTC): %s"%str(datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")))
146 |             res.append("        %5d Tweets: %6.3f %s"%(self.query.res_cnt, rate, unit))
147 |             res.append("-"*OUTPUT_PAGE_WIDTH)
148 |             #
149 |             self.query.get_top_users()
150 |             fmt_str = u"%{}s -- %10s     %8s (%d)".format(BIG_COLUMN_WIDTH)
151 |             res.append(fmt_str%( "users", "tweets", "activities", self.res_cnt))
152 |             res.append("-"*OUTPUT_PAGE_WIDTH)
153 |             fmt_str =  u"%{}s -- %4d  %5.2f%% %4d  %5.2f%%".format(BIG_COLUMN_WIDTH)
154 |             for x in self.freq.get_tokens(20):
155 |                 res.append(fmt_str%(x[4], x[0], x[1]*100., x[2], x[3]*100.))
156 |             res.append("-"*OUTPUT_PAGE_WIDTH)
157 |             #
158 |             self.query.get_top_links()
159 |             fmt_str = u"%{}s -- %10s     %8s (%d)".format(int(2.5*BIG_COLUMN_WIDTH))
160 |             res.append(fmt_str%( "links", "mentions", "activities", self.res_cnt))
161 |             res.append("-"*OUTPUT_PAGE_WIDTH)
162 |             fmt_str =  u"%{}s -- %4d  %5.2f%% %4d  %5.2f%%".format(int(2.5*BIG_COLUMN_WIDTH))
163 |             for x in self.freq.get_tokens(20):
164 |                 res.append(fmt_str%(x[4], x[0], x[1]*100., x[2], x[3]*100.))
165 |             res.append("-"*OUTPUT_PAGE_WIDTH)
166 |             #
167 |             self.query.get_top_grams()
168 |             fmt_str = u"%{}s -- %10s     %8s (%d)".format(BIG_COLUMN_WIDTH)
169 |             res.append(fmt_str%( "terms", "mentions", "activities", self.res_cnt))
170 |             res.append("-"*OUTPUT_PAGE_WIDTH)
171 |             fmt_str =u"%{}s -- %4d  %5.2f%% %4d  %6.2f%%".format(BIG_COLUMN_WIDTH)
172 |             for x in self.freq.get_tokens(20):
173 |                 res.append(fmt_str%(x[4], x[0], x[1]*100., x[2], x[3]*100.))
174 |             res.append("-"*OUTPUT_PAGE_WIDTH)
175 |         else:
176 |             res = ["{:%Y-%m-%dT%H:%M:%S},{}".format(x[2], x[1])
177 |                         for x in self.get_time_series()]
178 |         return u"\n".join(res)
179 | 
180 | if __name__ == "__main__":
181 |     g = Results("shendrickson@gnip.com"
182 |             , "XXXXXPASSWORDXXXXX"
183 |             , "https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json")
184 |     #list(g.get_time_series(pt_filter="bieber", count_bucket="hour"))
185 |     print(g)
186 |     print( list(g.get_activities(pt_filter="bieber", max_results = 10)) )
187 |     print( list(g.get_geo(pt_filter = "bieber has:geo", max_results = 10)) )
188 |     print( list(g.get_time_series(pt_filter="beiber", count_bucket="hour")) )
189 |     print( list(g.get_top_links(pt_filter="beiber", max_results=100, n=30)) )
190 |     print( list(g.get_top_users(pt_filter="beiber", max_results=100, n=30)) )
191 |     print( list(g.get_top_grams(pt_filter="bieber", max_results=100, n=50)) )
192 |     print( list(g.get_frequency_items(10)) )
193 |     print(g)
194 |     print(g.get_rate())
195 |     g.execute(pt_filter="bieber", query=True)
196 | 


--------------------------------------------------------------------------------
/search/test_api.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: UTF-8 -*-
  3 | __author__="Scott Hendrickson, Josh Montague" 
  4 | 
  5 | import requests
  6 | import unittest
  7 | import os
  8 | 
  9 | # establish import context and then import explicitly 
 10 | #from .context import gpt
 11 | #from gpt.rules import rules as gpt_r
 12 | from api import *
 13 | 
 14 | class TestQuery(unittest.TestCase):
 15 |     
 16 |     def setUp(self):
 17 |         self.g = Query("shendrickson@gnip.com"
 18 |             , "XXXXXXXXX"
 19 |             , "https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json")
 20 |         self.g_paged = Query("shendrickson@gnip.com"
 21 |             , "XXXXXXXXX"
 22 |             , "https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json"
 23 |             , paged = True
 24 |             , output_file_path = ".")
 25 | 
 26 |     def tearDown(self):
 27 |         # remove stray files
 28 |         for f in os.listdir("."):
 29 |     	    if re.search("bieber.json", f):
 30 |     		os.remove(os.path.join(".", f))
 31 | 
 32 |     def test_set_dates(self):
 33 |         s = "2014-11-01T00:00:30"
 34 |         e = "2014-11-02T00:20:00"
 35 |         self.g.set_dates(s,e)
 36 |         self.assertEquals(self.g.fromDate, "201411010000")
 37 |         self.assertEquals(self.g.toDate, "201411020020")
 38 |         with self.assertRaises(ValueError) as cm:
 39 |             self.g.set_dates(e,s)
 40 |         e = "201/1/0T00:20:00"
 41 |         with self.assertRaises(ValueError) as cm:
 42 |             self.g.set_dates(s,e)
 43 | 
 44 |     def test_name_munger(self):
 45 |         self.g.name_munger("adsfadsfa")
 46 |         self.assertEquals("adsfadsfa", self.g.file_name_prefix)
 47 |         self.g.name_munger('adsf"adsfa')
 48 |         self.assertEquals("adsf_Q_adsfa", self.g.file_name_prefix)
 49 |         self.g.name_munger("adsf(dsfa")
 50 |         self.assertEquals("adsf_p_dsfa", self.g.file_name_prefix)
 51 |         self.g.name_munger("adsf)dsfa")
 52 |         self.assertEquals("adsf_p_dsfa", self.g.file_name_prefix)
 53 |         self.g.name_munger("adsf:dsfa")
 54 |         self.assertEquals("adsf_dsfa", self.g.file_name_prefix)
 55 |         self.g.name_munger("adsf dsfa")
 56 |         self.assertEquals("adsf_dsfa", self.g.file_name_prefix)
 57 |         self.g.name_munger("adsf  dsfa")
 58 |         self.assertEquals("adsf_dsfa", self.g.file_name_prefix)
 59 | 
 60 |     def test_req(self):
 61 |         self.g.rule_payload = {'query': 'bieber', 'maxResults': 10, 'publisher': 'twitter'}
 62 |         self.g.stream_url = self.g.end_point
 63 |         self.assertEquals(10, len(json.loads(self.g.request())["results"]))
 64 |         self.g.stream_url = "adsfadsf"
 65 |         with self.assertRaises(requests.exceptions.MissingSchema) as cm:
 66 |             self.g.request()
 67 |         self.g.stream_url = "http://ww.thisreallydoesn'texist.com"
 68 |         with self.assertRaises(requests.exceptions.ConnectionError) as cm:
 69 |             self.g.request()
 70 |         self.g.stream_url = "https://ww.thisreallydoesntexist.com"
 71 |         with self.assertRaises(requests.exceptions.ConnectionError) as cm:
 72 |             self.g.request()
 73 |         
 74 |     def test_parse_responses(self):
 75 |         self.g.rule_payload = {'query': 'bieber', 'maxResults': 10, 'publisher': 'twitter'}
 76 |         self.g.stream_url = self.g.end_point
 77 |         self.assertEquals(len(self.g.parse_responses()), 10)
 78 |         self.g.rule_payload = {'maxResults': 10, 'publisher': 'twitter'}
 79 |         self.g.stream_url = self.g.end_point
 80 |         with self.assertRaises(ValueError) as cm:
 81 |             self.g.parse_responses()
 82 |         #TODO graceful way to test write to file functionality here
 83 | 
 84 |     def test_get_activity_set(self):
 85 |         self.g.execute("bieber", max_results=10)
 86 |         self.assertEquals(len(list(self.g.get_activity_set())), 10)
 87 |         # seconds of bieber
 88 |         tmp_start =  datetime.datetime.strftime(
 89 |                     datetime.datetime.now() + datetime.timedelta(seconds = -60)
 90 |                     ,"%Y-%m-%dT%H:%M:%S")
 91 |         tmp_end = datetime.datetime.strftime(
 92 |                     datetime.datetime.now() 
 93 |                     ,"%Y-%m-%dT%H:%M:%S")
 94 |         print >> sys.stderr, "bieber from ", tmp_start, " to ", tmp_end
 95 |         self.g_paged.execute("bieber"
 96 |                 , start = tmp_start
 97 |                 , end = tmp_end)
 98 |         self.assertGreater(len(list(self.g_paged.get_activity_set())), 500)
 99 | 
100 |     def test_execute(self):
101 |         #
102 |         tmp = { "pt_filter": "bieber"
103 |                 , "max_results" : 100
104 |                 , "start" : None
105 |                 , "end" : None
106 |                 , "count_bucket" : None # None is json
107 |                 , "show_query" : False }
108 |         self.g.execute(**tmp)
109 |         self.assertEquals(len(self.g), 100)
110 |         self.assertEquals(len(self.g.rec_list_list), 100)
111 |         self.assertEquals(len(self.g.rec_dict_list), 100)
112 |         self.assertEquals(self.g.rule_payload, {'query': 'bieber', 'maxResults': 100, 'publisher': 'twitter'})
113 |         #
114 |         tmp = { "pt_filter": "bieber"
115 |                 , "max_results" : 600
116 |                 , "start" : None
117 |                 , "end" : None
118 |                 , "count_bucket" : None # None is json
119 |                 , "show_query" : False }
120 |         self.g.execute(**tmp)
121 |         self.assertEquals(len(self.g), 500)
122 |         self.assertEquals(len(self.g.time_series), 500)
123 |         self.assertEquals(len(self.g.rec_list_list), 500)
124 |         self.assertEquals(len(self.g.rec_dict_list), 500)
125 |         self.assertEquals(self.g.rule_payload, {'query': 'bieber', 'maxResults': 500, 'publisher': 'twitter'})
126 |         #
127 |         tmp = datetime.datetime.now() + datetime.timedelta(seconds = -60)
128 |         tmp_start = datetime.datetime.strftime(
129 |                     tmp
130 |                     , "%Y-%m-%dT%H:%M:%S")
131 |         tmp_start_cmp =  datetime.datetime.strftime(
132 |                     tmp
133 |                     ,"%Y%m%d%H%M")
134 |         tmp = datetime.datetime.now() 
135 |         tmp_end = datetime.datetime.strftime(
136 |                     tmp
137 |                     ,"%Y-%m-%dT%H:%M:%S")
138 |         tmp_end_cmp = datetime.datetime.strftime(
139 |                     tmp
140 |                     ,"%Y%m%d%H%M")
141 |         tmp = { "pt_filter": "bieber"
142 |                 , "max_results" : 500
143 |                 , "start" : tmp_start 
144 |                 , "end" : tmp_end
145 |                 , "count_bucket" : None # None is json
146 |                 , "show_query" : False }
147 |         self.g.execute(**tmp)
148 |         self.assertEquals(len(self.g), 500)
149 |         self.assertEquals(len(self.g.time_series), 500)
150 |         self.assertEquals(len(self.g.rec_list_list), 500)
151 |         self.assertEquals(len(self.g.rec_dict_list), 500)
152 |         self.assertEquals(self.g.rule_payload, {'query': 'bieber'
153 |                                     , 'maxResults': 500
154 |                                     , 'toDate': tmp_end_cmp
155 |                                     , 'fromDate': tmp_start_cmp
156 |                                     , 'publisher': 'twitter'})
157 |         self.assertIsNotNone(self.g.fromDate)
158 |         self.assertIsNotNone(self.g.toDate)
159 |         self.assertGreater(self.g.delta_t, 0) # delta_t in minutes 
160 |         self.assertGreater(1.1, self.g.delta_t) # delta_t in minutes 
161 |         #
162 |         tmp = { "pt_filter": "bieber"
163 |                 , "max_results" : 100
164 |                 , "start" : None
165 |                 , "end" : None
166 |                 , "count_bucket" : "fortnight"
167 |                 , "show_query" : False }
168 |         with self.assertRaises(ValueError) as cm:
169 |             self.g.execute(**tmp)
170 |         #
171 |         tmp = { "pt_filter": "bieber"
172 |                 , "start" : None
173 |                 , "end" : None
174 |                 , "count_bucket" : "hour"
175 |                 , "show_query" : False }
176 |         self.g.execute(**tmp)
177 |         self.assertEquals(len(self.g), 24*30 + datetime.datetime.utcnow().hour + 1)
178 |         self.assertGreater(self.g.delta_t, 24*30*60) # delta_t in minutes 
179 | 
180 |     def test_get_rate(self):
181 |         self.g.res_cnt = 100
182 |         self.g.delta_t = 10
183 |         self.assertEquals(self.g.get_rate(), 10)
184 |         self.g.delta_t = 11
185 |         self.assertAlmostEquals(self.g.get_rate(), 9.09090909091)
186 | 
187 |     def test_len(self):
188 |         self.assertEquals(0, len(self.g))
189 |         tmp = { "pt_filter": "bieber"
190 |                 , "max_results" : 500
191 |                 , "count_bucket" : None # None is json
192 |                 , "show_query" : False }
193 |         self.g.execute(**tmp)
194 |         self.assertEquals(self.g.res_cnt, len(self.g))
195 | 
196 |     def test_repr(self):
197 |         self.assertIsNotNone(str(self.g))
198 |         tmp = { "pt_filter": "bieber"
199 |                 , "max_results" : 500
200 |                 , "count_bucket" : None # None is json
201 |                 , "show_query" : False }
202 |         self.g.execute(**tmp)
203 |         self.assertIsNotNone(str(self.g))
204 |         self.assertTrue('\n' in str(self.g))
205 |         self.assertEquals(str(self.g).count('\n'), len(self.g)-1)
206 | 
207 | if __name__ == "__main__":
208 |     unittest.main()
209 | 


--------------------------------------------------------------------------------
/search/test_results.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: UTF-8 -*-
  3 | __author__="Scott Hendrickson, Josh Montague" 
  4 | 
  5 | import requests
  6 | import unittest
  7 | import os
  8 | import copy
  9 | import time
 10 | 
 11 | # establish import context and then import explicitly 
 12 | #from .context import gpt
 13 | #from gpt.rules import rules as gpt_r
 14 | from results import *
 15 | 
 16 | class TestResults(unittest.TestCase):
 17 |     
 18 |     def setUp(self):
 19 |         self.params = { 
 20 |               "user":"shendrickson@gnip.com"
 21 |             , "password":"XXXXXXXXX"
 22 |             , "stream_url":"https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json" 
 23 |             }
 24 | 
 25 |     def tearDown(self):
 26 |         # remove stray files
 27 |         for f in os.listdir("."):
 28 |     	    if re.search("bieber.json", f):
 29 |     		os.remove(os.path.join(".", f))
 30 | 
 31 |     def test_get(self):
 32 |         self.g = Results(
 33 |             pt_filter = "bieber" 
 34 |             , max_results = 10
 35 |             , start = None
 36 |             , end = None
 37 |             , count_bucket = None
 38 |             , show_query = False
 39 |             , **self.params)
 40 |         self.assertEquals(len(self.g), 10)
 41 | 
 42 |     def test_get_activities(self):
 43 |         self.g = Results(
 44 |                  pt_filter = "bieber"
 45 |                 , max_results = 10
 46 |                 , start = None
 47 |                 , end = None
 48 |                 , count_bucket = None
 49 |                 , show_query = False
 50 |                 , **self.params)
 51 |         for x in self.g.get_activities():
 52 |             self.assertTrue("id" in x)
 53 |         self.assertEqual(len(list(self.g.get_activities())), 10)
 54 |         # seconds of bieber
 55 |         tmp_start =  datetime.datetime.strftime(
 56 |                     datetime.datetime.now() + datetime.timedelta(seconds = -60)
 57 |                     ,"%Y-%m-%dT%H:%M:%S")
 58 |         tmp_end = datetime.datetime.strftime(
 59 |                     datetime.datetime.now() 
 60 |                     ,"%Y-%m-%dT%H:%M:%S")
 61 |         self.g_paged = Results(
 62 |                 pt_filter = "bieber"
 63 |                 , max_results = 500
 64 |                 , start = tmp_start
 65 |                 , end = tmp_end
 66 |                 , count_bucket = None
 67 |                 , show_query = False
 68 |                 , paged = True
 69 |                 , **self.params)
 70 |         tmp = len(list(self.g_paged.get_activities())) 
 71 |         self.assertGreater(tmp, 1000)
 72 |         self.g_paged = Results(
 73 |                 pt_filter = "bieber"
 74 |                 , max_results = 500
 75 |                 , start = tmp_start
 76 |                 , end = tmp_end
 77 |                 , count_bucket = None
 78 |                 , show_query = False
 79 |                 , paged = True
 80 |                 , output_file_path = "."
 81 |                 , **self.params)
 82 |         self.assertEqual(len(list(self.g_paged.get_activities())), tmp)
 83 |         
 84 |     def test_get_time_series(self):
 85 |         self.g = Results(
 86 |                 pt_filter = "bieber"
 87 |                 , max_results = 10
 88 |                 , start = None
 89 |                 , end = None
 90 |                 , count_bucket = "hour"
 91 |                 , show_query = False
 92 |                 , **self.params)
 93 |         self.assertGreater(len(list(self.g.get_time_series())), 24*30)
 94 | 
 95 |     def test_get_top_links(self):
 96 |         self.g = Results(
 97 |                 pt_filter = "bieber"
 98 |                 , max_results = 200
 99 |                 , start = None
100 |                 , end = None
101 |                 , count_bucket = None
102 |                 , show_query = False
103 |                 , **self.params)
104 |         self.assertEqual(len(list(self.g.get_top_links(n = 5))), 5)
105 |         self.assertEqual(len(list(self.g.get_top_links(n = 10))),10)
106 |         #
107 |         tmp_start = datetime.datetime.strftime(
108 |                     datetime.datetime.now() + datetime.timedelta(seconds = -60)
109 |                     ,"%Y-%m-%dT%H:%M:%S")
110 |         tmp_end = datetime.datetime.strftime(
111 |                     datetime.datetime.now() 
112 |                     ,"%Y-%m-%dT%H:%M:%S")
113 |         self.g_paged = Results(
114 |                 pt_filter = "bieber"
115 |                 , max_results = 500
116 |                 , start = tmp_start 
117 |                 , end = tmp_end
118 |                 , count_bucket = None
119 |                 , show_query = False
120 |                 , paged = True
121 |                 , **self.params)
122 |         self.assertEqual(len(list(self.g_paged.get_top_links(n = 100))), 100)
123 | 
124 |     def test_top_users(self):
125 |         self.g = Results(
126 |                 pt_filter = "bieber"
127 |                 , max_results = 200
128 |                 , start = None
129 |                 , end = None
130 |                 , count_bucket = None
131 |                 , show_query = False
132 |                 , **self.params)
133 |         self.assertEqual(len(list(self.g.get_top_users(n = 5))), 5)
134 |         self.assertEqual(len(list(self.g.get_top_users(n = 10))), 10)
135 |         #
136 |         tmp_start = datetime.datetime.strftime(
137 |                     datetime.datetime.now() + datetime.timedelta(seconds = -60)
138 |                     ,"%Y-%m-%dT%H:%M:%S")
139 |         tmp_end = datetime.datetime.strftime(
140 |                     datetime.datetime.now() 
141 |                     ,"%Y-%m-%dT%H:%M:%S")
142 |         self.g_paged = Results(
143 |                 pt_filter = "bieber"
144 |                 , max_results = 500
145 |                 , start = tmp_start 
146 |                 , end = tmp_end
147 |                 , count_bucket = None
148 |                 , show_query = False
149 |                 , paged = True
150 |                 , **self.params)
151 |         self.assertEqual(len(list(self.g_paged.get_top_users(n = 100))), 100)
152 |         self.assertEqual(len(list(self.g.get_frequency_items(8))), 8)
153 | 
154 |     def test_top_grams(self):
155 |         self.g = Results(
156 |                 pt_filter = "bieber"
157 |                 , max_results = 200
158 |                 , start = None
159 |                 , end = None
160 |                 , count_bucket = None
161 |                 , show_query = False
162 |                 , **self.params)
163 |         self.assertEqual(len(list(self.g.get_top_grams(n = 5)))  , 10)
164 |         self.assertEqual(len(list(self.g.get_top_grams(n = 10))) , 20)
165 |         self.assertEqual(len(list(self.g.get_frequency_items(8))), 16)
166 |         #
167 |         tmp_start = datetime.datetime.strftime(
168 |                     datetime.datetime.now() + datetime.timedelta(seconds = -60)
169 |                     ,"%Y-%m-%dT%H:%M:%S")
170 |         tmp_end = datetime.datetime.strftime(
171 |                     datetime.datetime.now() 
172 |                     ,"%Y-%m-%dT%H:%M:%S")
173 |         self.g_paged = Results(
174 |                 pt_filter = "bieber"
175 |                 , max_results = 500
176 |                 , start = tmp_start 
177 |                 , end = tmp_end
178 |                 , count_bucket = None
179 |                 , show_query = False
180 |                 , paged = True
181 |                 , **self.params)
182 |         self.assertEqual(len(list(self.g_paged.get_top_grams(n = 100))), 200)
183 |         
184 |     def test_get_geo(self):
185 |         self.g = Results(
186 |                 pt_filter = "bieber has:geo"
187 |                 , max_results = 200
188 |                 , start = None
189 |                 , end = None
190 |                 , count_bucket = None
191 |                 , show_query = False
192 |                 , **self.params)
193 |         tmp = len(list(self.g.get_geo()))
194 |         self.assertGreater(201, tmp)
195 |         self.assertGreater(tmp, 10)
196 | 
197 | if __name__ == "__main__":
198 |     unittest.main()
199 | 


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | description-file = README.md
3 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from distutils.core import setup
 2 | 
 3 | setup(
 4 |     name='gapi',
 5 |     version='1.0.2',
 6 |     author='Scott Hendrickson, Josh Montague, Jeff Kolb',
 7 |     author_email='scott@drskippy.net',
 8 |     packages=['search'],
 9 |     scripts=['gnip_search.py', 'gnip_time_series.py'],
10 |     url='https://github.com/DrSkippy/Gnip-Python-Search-API-Utilities',
11 |     download_url='https://github.com/DrSkippy/Gnip-Python-Search-API-Utilities/tags/',
12 |     license='LICENSE.txt',
13 |     description='Simple utilties to explore the Gnip search API',
14 |     install_requires=[
15 |         "gnacs >= 1.1.0"
16 |         , "sngrams >= 0.2.0"
17 |         , "requests > 2.4.0"
18 |         ],
19 |     extras_require = {
20 |         'timeseries':  ["numpy >= 1.10.1"
21 |                 , "scipy >= 0.16.1"
22 |                 , "statsmodels >= 0.6.1"
23 |                 , "matplotlib >= 1.5.0"
24 |                 , "pandas >= 0.17.0"
25 |                 ],
26 |         }
27 |     )
28 | 


--------------------------------------------------------------------------------
/test_search.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | 
 3 | ###
 4 | ### edit creds
 5 | ###
 6 | un=email
 7 | un=shendrickson@gnip.com
 8 | paswd=password
 9 | paswd=$1
10 | 
11 | if [ ! -d data ]; then
12 |     mkdir data
13 | fi
14 | 
15 | rulez="bieber OR bieber"
16 | if [ $(uname) == "Linux" ]; then
17 |     dt1=$(date --date="1 day ago" +%Y-%m-%dT00:00:00)
18 |     dt2=$(date --date="2 days ago" +%Y-%m-%dT00:00:00)
19 |     dt3=$(date --date="2 days ago" +%Y-%m-%dT23:55:00)
20 | else
21 |     dt1=$(date -v-1d +%Y-%m-%dT00:00:00)
22 |     dt2=$(date -v-2d +%Y-%m-%dT00:00:00)
23 |     dt3=$(date -v-2d +%Y-%m-%dT23:55:00)
24 | fi
25 | 
26 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -q json
27 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 json
28 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 geo
29 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 wordcount
30 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 timeline
31 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 users
32 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -c geo
33 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -c timeline 
34 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -s"$dt2" -e"$dt1" json
35 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -s"$dt2" -e"$dt1" geo
36 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -s"$dt2" -e"$dt1" wordcount
37 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -s"$dt2" -e"$dt1" users
38 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -s"$dt3" -e"$dt1" -aw ./data json
39 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -s"$dt3" -e"$dt1" -aw ./data geo
40 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -s"$dt3" -e"$dt1" -a users
41 | 
42 | export GNIP_CONFIG_FILE=./.gnip
43 | ./gnip_search.py -f"has:geo $rulez"  -n10 -q json
44 | ./gnip_search.py -f"has:geo $rulez"  -n10 json
45 | ./gnip_search.py -f"has:geo $rulez"  -n10 geo
46 | ./gnip_search.py -f"has:geo $rulez"  -n10 wordcount
47 | ./gnip_search.py -f"has:geo $rulez"  -n10 timeline
48 | ./gnip_search.py -f"has:geo $rulez"  -n10 users
49 | ./gnip_search.py -f"has:geo $rulez"  -n10 -c geo
50 | ./gnip_search.py -f"has:geo $rulez"  -n10 -c timeline 
51 | ./gnip_search.py -f"has:geo $rulez"  -n10 -s"$dt2" -e"$dt1" json
52 | ./gnip_search.py -f"has:geo $rulez"  -n10 -s"$dt2" -e"$dt1" geo
53 | ./gnip_search.py -f"has:geo $rulez"  -n10 -s"$dt2" -e"$dt1" wordcount
54 | ./gnip_search.py -f"has:geo $rulez"  -n10 -s"$dt2" -e"$dt1" users
55 | ./gnip_search.py -f"has:geo $rulez"  -s"$dt3" -e"$dt1" -aw ./data json
56 | 
57 | 


--------------------------------------------------------------------------------