├── .gitignore ├── CHANGES.txt ├── LICENSE.txt ├── MANIFEST.in ├── README ├── README.md ├── example_config_file ├── gnip_filter_analysis.py ├── gnip_search.py ├── gnip_time_series.py ├── img ├── earthquake_cycle_trend_line.png ├── earthquake_time_line.png └── earthquake_time_peaks_line.png ├── job.json ├── rules.txt ├── search ├── __init__.py ├── api.py ├── results.py ├── test_api.py └── test_results.py ├── setup.cfg ├── setup.py └── test_search.sh /.gitignore: -------------------------------------------------------------------------------- 1 | *.py[cod] 2 | .gnip 3 | *.csv 4 | *.png 5 | *.swp 6 | *.pickle 7 | *.log 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Packages 13 | *.egg 14 | *.egg-info 15 | dist 16 | build 17 | eggs 18 | parts 19 | bin 20 | var 21 | sdist 22 | develop-eggs 23 | .installed.cfg 24 | lib 25 | lib64 26 | 27 | # Installer logs 28 | pip-log.txt 29 | 30 | # Unit test / coverage reports 31 | .coverage 32 | .tox 33 | nosetests.xml 34 | 35 | # Translations 36 | *.mo 37 | 38 | # Mr Developer 39 | .mr.developer.cfg 40 | .project 41 | .pydevproject 42 | MANIFEST 43 | -------------------------------------------------------------------------------- /CHANGES.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DrSkippy/Gnip-Python-Search-API-Utilities/30c3780220bbeba384815ccbc4ce1d567bfa934c/CHANGES.txt -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2012, Scott Hendrickson 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | 1. Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 2. Redistributions in binary form must reproduce the above copyright notice, 10 | this list of conditions and the following disclaimer in the documentation 11 | and/or other materials provided with the distribution. 12 | 13 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 14 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 15 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 16 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 17 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 18 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 19 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 20 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 21 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 22 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 23 | 24 | The views and conclusions contained in the software and documentation are those 25 | of the authors and should not be interpreted as representing official policies, 26 | either expressed or implied, of the FreeBSD Project. 27 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include *.txt 2 | recursive-include docs *.txt 3 | -------------------------------------------------------------------------------- /README: -------------------------------------------------------------------------------- 1 | See README.md 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Gnip Python Search API Utilities 2 | ================================ 3 | 4 | This package includes two utilities: 5 | - Gnip Search API interactions include Search V2 and paging support 6 | - Timeseries analysis and plotting 7 | 8 | #### Installation 9 | Install from PyPI with `pip install gapi` 10 | Or to use the full time line capability, `pip install gapi[timeline]` 11 | 12 | ## Search API 13 | 14 | Usage: 15 | 16 | $ gnip_search.py -h 17 |
18 | usage: gnip_search.py [-h] [-a] [-c] [-b COUNT_BUCKET] [-e END] [-f FILTER] 19 | [-l STREAM_URL] [-n MAX] [-N HARD_MAX] [-p PASSWORD] 20 | [-q] [-s START] [-u USER] [-w OUTPUT_FILE_PATH] [-t] 21 | USE_CASE 22 | 23 | GnipSearch supports the following use cases: ['json', 'wordcount', 'users', 24 | 'rate', 'links', 'timeline', 'geo', 'audience'] 25 | 26 | positional arguments: 27 | USE_CASE Use case for this search. 28 | 29 | optional arguments: 30 | -h, --help show this help message and exit 31 | -a, --paged Paged access to ALL available results (Warning: this 32 | makes many requests) 33 | -c, --csv Return comma-separated 'date,counts' or geo data. 34 | -b COUNT_BUCKET, --bucket COUNT_BUCKET 35 | Bucket size for counts query. Options are day, hour, 36 | minute (default is 'day'). 37 | -e END, --end-date END 38 | End of datetime window, format 'YYYY-mm-DDTHH:MM' 39 | (default: most recent activities) 40 | -f FILTER, --filter FILTER 41 | PowerTrack filter rule (See: http://support.gnip.com/c 42 | ustomer/portal/articles/901152-powertrack-operators) 43 | -l STREAM_URL, --stream-url STREAM_URL 44 | Url of search endpoint. (See your Gnip console.) 45 | -n MAX, --results-max MAX 46 | Maximum results to return per page (default 100; max 47 | 500) 48 | -N HARD_MAX, --hard-max HARD_MAX 49 | Maximum results to return for all pages; see -a option 50 | -p PASSWORD, --password PASSWORD 51 | Password 52 | -q, --query View API query (no data) 53 | -s START, --start-date START 54 | Start of datetime window, format 'YYYY-mm-DDTHH:MM' 55 | (default: 30 days ago) 56 | -u USER, --user-name USER 57 | User name 58 | -w OUTPUT_FILE_PATH, --output-file-path OUTPUT_FILE_PATH 59 | Create files in ./OUTPUT-FILE-PATH. This path must 60 | exists and will not be created. This options is 61 | available only with -a option. Default is no output 62 | files. 63 | -t, --search-v2 Using search API v2 endpoint. [This is depricated and 64 | is automatically set based on endpoint.] 65 |66 | 67 | ##Using a configuration file 68 | 69 | To avoid entering the the -u, -p and -l options for every command, create a configuration file named ".gnip" 70 | in the directory where you will run the code. When this file contains the correct parameters, you can omit 71 | this command line parameters. 72 | 73 | Use this template: 74 | 75 | # export GNIP_CONFIG_FILE=
328 | usage: gnip_time_series.py [-h] [-b COUNT_BUCKET] [-e END] [-f FILTER] 329 | [-g SECOND_FILTER] [-l STREAM_URL] [-p PASSWORD] 330 | [-s START] [-u USER] [-t] [-w OUTPUT_FILE_PATH] 331 | 332 | GnipSearch timeline tools 333 | 334 | optional arguments: 335 | -h, --help show this help message and exit 336 | -b COUNT_BUCKET, --bucket COUNT_BUCKET 337 | Bucket size for counts query. Options are day, hour, 338 | minute (default is 'day'). 339 | -e END, --end-date END 340 | End of datetime window, format 'YYYY-mm-DDTHH:MM' 341 | (default: most recent activities) 342 | -f FILTER, --filter FILTER 343 | PowerTrack filter rule (See: http://support.gnip.com/c 344 | ustomer/portal/articles/901152-powertrack-operators) 345 | -g SECOND_FILTER, --second_filter SECOND_FILTER 346 | Use a second filter to show correlation plots of -f 347 | timeline vs -g timeline. 348 | -l STREAM_URL, --stream-url STREAM_URL 349 | Url of search endpoint. (See your Gnip console.) 350 | -p PASSWORD, --password PASSWORD 351 | Password 352 | -s START, --start-date START 353 | Start of datetime window, format 'YYYY-mm-DDTHH:MM' 354 | (default: 30 days ago) 355 | -u USER, --user-name USER 356 | User name 357 | -t, --get-topics Set flag to evaluate peak topics (this may take a few 358 | minutes) 359 | -w OUTPUT_FILE_PATH, --output-file-path OUTPUT_FILE_PATH 360 | Create files in ./OUTPUT-FILE-PATH. This path must 361 | exists and will not be created. 362 |363 | 364 | #### Example Plots 365 | 366 | 367 | Example output from command: 368 | 369 | gnip_time_series.py -f "earthquake" -s2015-10-01T00:00:00 -e2015-11-18T00:00:00 -t -bhour 370 | 371 |  372 | 373 |  374 | 375 |  376 | 377 | #### Dependencies 378 | Gnip's Search 2.0 API access is required. 379 | 380 | In addition to the the basic Gnip Search utility described immediately above, this pakage 381 | depends on a number of other large packges: 382 | 383 | * matplotlib 384 | * numpy 385 | * pandas 386 | * statsmodels 387 | * scipy 388 | 389 | #### Notes 390 | * You should create the path "plots" in the directory where you run the utility. This will contain the plots of 391 | time series and analysis 392 | * This utility creates an extensive log file named time_series.log. It contains many details of parameter 393 | settings and intermediate outputs. 394 | * On a remote machine or server, change your matplotlib backend by creating a local matplotlibrc file. Create Gnip-Python-Search-API-Utilities/matplotlibrc: 395 | 396 |
397 | # Change the backend to Agg to avoid errors when matplotlib cannot display the plots 398 | # More information on creating and editing a matplotlibrc file at: http://matplotlib.org/users/customizing.html 399 | backend : Agg 400 |401 | 402 | ### Filter Analysis 403 | 404 | $ ./gnip_filter_analysis.py -h 405 |
406 | usage: gnip_filter_analysis.py [-h] [-j JOB_DESCRIPTION] [-b COUNT_BUCKET] 407 | [-l STREAM_URL] [-p PASSWORD] [-r RANK_SAMPLE] 408 | [-q] [-u USER] [-w OUTPUT_FILE_PATH] 409 | 410 | Creates an aggregated filter statistics summary from filter rules and date 411 | periods in the job description. 412 | 413 | optional arguments: 414 | -h, --help show this help message and exit 415 | -j JOB_DESCRIPTION, --job_description JOB_DESCRIPTION 416 | JSON formatted job description file 417 | -b COUNT_BUCKET, --bucket COUNT_BUCKET 418 | Bucket size for counts query. Options are day, hour, 419 | minute (default is 'day'). 420 | -l STREAM_URL, --stream-url STREAM_URL 421 | Url of search endpoint. (See your Gnip console.) 422 | -p PASSWORD, --password PASSWORD 423 | Password 424 | -r RANK_SAMPLE, --rank_sample RANK_SAMPLE 425 | Rank inclusive sampling depth. Default is None. This 426 | runs filter rule production for rank1, rank1 OR rank2, 427 | rank1 OR rank2 OR rank3, etc.to the depths specifed. 428 | -q, --query View API query (no data) 429 | -u USER, --user-name USER 430 | User name 431 | -w OUTPUT_FILE_PATH, --output-file-path OUTPUT_FILE_PATH 432 | Create files in ./OUTPUT-FILE-PATH. This path must 433 | exists and will not be created. Default is ./data 434 | 435 |436 | 437 | Example output to compare 7 rules across 2 time periods: 438 | 439 | job.json: 440 | 441 |
442 | { 443 | "date_ranges": [ 444 | { 445 | "end": "2015-06-01T00:00:00", 446 | "start": "2015-05-01T00:00:00" 447 | }, 448 | { 449 | "end": "2015-12-01T00:00:00", 450 | "start": "2015-11-01T00:00:00" 451 | } 452 | ], 453 | "rules": [ 454 | { 455 | "tag": "common pet", 456 | "value": "dog" 457 | }, 458 | { 459 | "tag": "common pet", 460 | "value": "cat" 461 | }, 462 | { 463 | "tag": "common pet", 464 | "value": "hamster" 465 | }, 466 | { 467 | "tag": "abstract pet", 468 | "value": "pet" 469 | }, 470 | { 471 | "tag": "pet owner destination", 472 | "value": "vet" 473 | }, 474 | { 475 | "tag": "pet owner destination", 476 | "value": "kennel" 477 | }, 478 | { 479 | "tag": "diminutives", 480 | "value": "puppy OR kitten" 481 | } 482 | ] 483 | } 484 | 485 |486 | 487 | Output: 488 | 489 |
490 | $ ./gnip_filter_analysis.py -r 3 491 | ... 492 | start_date 2015-05-01T00:00:00 2015-11-01T00:00:00 All 493 | filter 494 | All 42691589 46780243 89471832 495 | dog OR cat OR hamster OR pet OR vet OR kennel O... 20864710 22831053 43695763 496 | dog 8096637 9218028 17314665 497 | cat 8378681 8705244 17083925 498 | puppy OR kitten 2392041 2659051 5051092 499 | pet 2101044 2345140 4446184 500 | vet 620178 749802 1369980 501 | hamster 199634 226864 426498 502 | kennel 38664 45061 83725 503 | 504 | start_date 2015-05-01T00:00:00 2015-11-01T00:00:00 All 505 | filter 506 | All 63640524 69822220 133462744 507 | dog OR cat OR hamster OR pet OR vet OR kennel O... 20864710 22831053 43695763 508 | dog OR cat OR puppy OR kitten 18410402 20096764 38507166 509 | dog OR cat 16268900 17662083 33930983 510 | dog 8096512 9232320 17328832 511 | /pre> 512 | 513 | So for this rule set, the redundancy is 89471832/43695763. - 1 = 1.0476088722835666 and the 514 | 3 rule approximation for the corpus gives 38507166/43695763. = 0.8812562902265832 or 88% of 515 | of the tweets of the full rule set. 516 | 517 | Additionally, csv output of the raw counts and a csv version of the pivot table are 518 | written to the specified data directory. 519 | 520 | #### Dependencies 521 | Gnip's Search 2.0 API access is required. 522 | 523 | In addition to the the basic Gnip Search utility described immediately above, this pakage 524 | depends on a number of other large packges: 525 | 526 | * numpy 527 | * pandas 528 | 529 | #### Notes 530 | * Unlike other utilities provided, the defualt file path is set to "./data" to provide 531 | full accsess to output results. Therefore, you should create the path "data" in the directory 532 | where you run the utility. This will contain the data ouputs. 533 | 534 | ## License 535 | Gnip-Python-Search-API-Utilities by Scott Hendrickson, Josh Montague and Jeff Kolb is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/. 536 | -------------------------------------------------------------------------------- /example_config_file: -------------------------------------------------------------------------------- 1 | # user your credentials and end point url to configure command line access 2 | # to work without the command line options 3 | # Either (1) rename this file .gnip in this directlry or, to run from anywhere 4 | # export GNIP_CONFIG_FILE=5 | # 6 | [creds] 7 | un = 8 | pwd = 9 | 10 | [endpoint] 11 | # replace with your endpoint 12 | url = https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json 13 | 14 | [defaults] 15 | # none 16 | 17 | [tmp] 18 | # none 19 | -------------------------------------------------------------------------------- /gnip_filter_analysis.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: UTF-8 -*- 3 | __author__="Scott Hendrickson, Josh Montague" 4 | 5 | import sys 6 | import json 7 | import codecs 8 | import argparse 9 | import datetime 10 | import time 11 | import numbers 12 | import os 13 | import ConfigParser 14 | import logging 15 | try: 16 | from cStringIO import StringIO 17 | except: 18 | from StringIO import StringIO 19 | 20 | import pandas as pd 21 | import numpy as np 22 | 23 | from search.results import * 24 | 25 | reload(sys) 26 | sys.stdout = codecs.getwriter('utf-8')(sys.stdout) 27 | sys.stdin = codecs.getreader('utf-8')(sys.stdin) 28 | 29 | DEFAULT_CONFIG_FILENAME = "./.gnip" 30 | LOG_FILE_PATH = os.path.join(".","filter_analysis.log") 31 | 32 | # set up simple logging 33 | logging.basicConfig(filename=LOG_FILE_PATH,level=logging.DEBUG) 34 | logging.info("#"*70) 35 | logging.info("################# started {} #################".format(datetime.datetime.now())) 36 | 37 | class GnipSearchCMD(): 38 | 39 | def __init__(self, token_list_size=20): 40 | # default tokenizer and character limit 41 | char_upper_cutoff = 20 # longer than for normal words because of user names 42 | self.token_list_size = int(token_list_size) 43 | ############################################# 44 | # CONFIG FILE/COMMAND LINE OPTIONS PATTERN 45 | # parse config file 46 | config_from_file = self.config_file() 47 | # set required fields to None. Sequence of setting is: 48 | # (1) config file 49 | # (2) command line 50 | # if still none, then fail 51 | self.user = None 52 | self.password = None 53 | self.stream_url = None 54 | if config_from_file is not None: 55 | try: 56 | # command line options take presidence if they exist 57 | self.user = config_from_file.get('creds', 'un') 58 | self.password = config_from_file.get('creds', 'pwd') 59 | self.stream_url = config_from_file.get('endpoint', 'url') 60 | except (ConfigParser.NoOptionError, 61 | ConfigParser.NoSectionError) as e: 62 | logging.debug(u"Error reading configuration file ({}), ignoring configuration file.".format(e)) 63 | # parse the command line options 64 | self.options = self.args().parse_args() 65 | # set up the job 66 | # over ride config file with command line args if present 67 | if self.options.user is not None: 68 | self.user = self.options.user 69 | if self.options.password is not None: 70 | self.password = self.options.password 71 | if self.options.stream_url is not None: 72 | self.stream_url = self.options.stream_url 73 | # 74 | # Search v2 uses a different url 75 | if "data-api.twitter.com" in self.stream_url: 76 | self.options.search_v2 = True 77 | else: 78 | logging.debug(u"Requires search v2, but your URL appears to point to a v1 endpoint. Exiting.") 79 | print >> sys.stderr, "Requires search v2, but your URL appears to point to a v1 endpoint. Exiting." 80 | sys.exit(-1) 81 | # defaults 82 | self.options.paged = True 83 | self.options.max = 500 84 | # 85 | # check paths 86 | if self.options.output_file_path is not None: 87 | if not os.path.exists(self.options.output_file_path): 88 | logging.debug(u"Path {} doesn't exist. Please create it and try again. Exiting.".format( 89 | self.options.output_file_path)) 90 | sys.stderr.write("Path {} doesn't exist. Please create it and try again. Exiting.\n".format( 91 | self.options.output_file_path)) 92 | sys.exit(-1) 93 | # 94 | # log the attributes of this class including all of the options 95 | for v in dir(self): 96 | # except don't log the password! 97 | if not v.startswith('__') and not callable(getattr(self,v)) and not v.lower().startswith('password'): 98 | tmp = str(getattr(self,v)) 99 | tmp = re.sub("password=.*,", "password=XXXXXXX,", tmp) 100 | logging.debug(u" {}={}".format(v, tmp)) 101 | # 102 | self.job = self.read_job_description(self.options.job_description) 103 | 104 | def config_file(self): 105 | config = ConfigParser.ConfigParser() 106 | # (1) default file name precidence 107 | config.read(DEFAULT_CONFIG_FILENAME) 108 | if not config.has_section("creds"): 109 | # (2) environment variable file name second 110 | if 'GNIP_CONFIG_FILE' in os.environ: 111 | config_filename = os.environ['GNIP_CONFIG_FILE'] 112 | config.read(config_filename) 113 | if config.has_section("creds") and config.has_section("endpoint"): 114 | return config 115 | else: 116 | return None 117 | 118 | def args(self): 119 | twitter_parser = argparse.ArgumentParser( 120 | description="Creates an aggregated filter statistics summary from \ 121 | filter rules and date periods in the job description.") 122 | twitter_parser.add_argument("-j", "--job_description", dest="job_description", 123 | default="./job.json", 124 | help="JSON formatted job description file") 125 | twitter_parser.add_argument("-b", "--bucket", dest="count_bucket", 126 | default="day", 127 | help="Bucket size for counts query. Options are day, hour, \ 128 | minute (default is 'day').") 129 | twitter_parser.add_argument("-l", "--stream-url", dest="stream_url", 130 | default=None, 131 | help="Url of search endpoint. (See your Gnip console.)") 132 | twitter_parser.add_argument("-p", "--password", dest="password", default=None, 133 | help="Password") 134 | twitter_parser.add_argument("-r", "--rank_sample", dest="rank_sample" 135 | , default=None 136 | , help="Rank inclusive sampling depth. Default is None. This runs filter rule \ 137 | production for rank1, rank1 OR rank2, rank1 OR rank2 OR rank3, etc.to \ 138 | the depths specifed.") 139 | twitter_parser.add_argument("-m", "--rank_negation_sample", dest="rank_negation_sample" 140 | , default=False 141 | , action="store_true" 142 | , help="Like rank inclusive sampling, but rules of higher ranks are negated \ 143 | on successive retrievals. Uses rank_sample setting.") 144 | twitter_parser.add_argument("-n", "--negation_rules", dest="negation_rules" 145 | , default=False 146 | , action="store_true" 147 | , help="Apply entire negation rules list to all queries") 148 | twitter_parser.add_argument("-q", "--query", dest="query", action="store_true", 149 | default=False, help="View API query (no data)") 150 | twitter_parser.add_argument("-u", "--user-name", dest="user", default=None, 151 | help="User name") 152 | twitter_parser.add_argument("-w", "--output-file-path", dest="output_file_path", 153 | default="./data", 154 | help="Create files in ./OUTPUT-FILE-PATH. This path must exists and will \ 155 | not be created. Default is ./data") 156 | 157 | return twitter_parser 158 | 159 | def read_job_description(self, job_description): 160 | with codecs.open(job_description, "rb", "utf-8") as f: 161 | self.job_description = json.load(f) 162 | if not all([x in self.job_description for x in ("rules", "date_ranges")]): 163 | print >>sys.stderr, '"rules" or "date_ranges" missing from you job description file. Exiting.\n' 164 | logging.error('"rules" or "date_ranges" missing from you job description file. Exiting') 165 | sys.exit(-1) 166 | 167 | def get_date_ranges_for_rule(self, rule, base_rule, tag=None): 168 | res = [] 169 | for dates_dict in self.job_description["date_ranges"]: 170 | start_date = dates_dict["start"] 171 | end_date = dates_dict["end"] 172 | logging.debug(u"getting date range for {} through {}".format(start_date, end_date)) 173 | results = Results( 174 | self.user 175 | , self.password 176 | , self.stream_url 177 | , self.options.paged 178 | , self.options.output_file_path 179 | , pt_filter=rule 180 | , max_results=int(self.options.max) 181 | , start=start_date 182 | , end=end_date 183 | , count_bucket=self.options.count_bucket 184 | , show_query=self.options.query 185 | , search_v2=self.options.search_v2 186 | ) 187 | for x in results.get_time_series(): 188 | res.append(x + [rule, tag, start_date, end_date, base_rule]) 189 | return res 190 | 191 | def get_pivot_table(self, res): 192 | df = pd.DataFrame(res 193 | , columns=("bucket_datetag" 194 | ,"counts" 195 | ,"bucket_datetime" 196 | ,"filter" 197 | ,"filter_tag" 198 | ,"start_date" 199 | ,"end_date" 200 | ,"base_rule")) 201 | pdf = pd.pivot_table(df 202 | , values="counts" 203 | , index=["filter", "base_rule"] 204 | , columns = ["start_date"] 205 | , margins = True 206 | , aggfunc=np.sum) 207 | pdf.sort_values("All" 208 | , inplace=True 209 | , ascending=False) 210 | logging.debug(u"pivot tables calculated with shape(df)={} and shape(pdf)={}".format(df.shape, pdf.shape)) 211 | return df, pdf 212 | 213 | def write_output_files(self, df, pdf, pre=""): 214 | if pre != "": 215 | pre += "_" 216 | logging.debug(u"Writing raw and pivot data to {}...".format(self.options.output_file_path)) 217 | with open("{}/{}_{}raw_data.csv".format( 218 | self.options.output_file_path 219 | , datetime.datetime.now().strftime("%Y%m%d_%H%M") 220 | , pre) 221 | , "wb") as f: 222 | f.write(df.to_csv(encoding='utf-8')) 223 | with open("{}/{}_{}pivot_data.csv".format( 224 | self.options.output_file_path 225 | , datetime.datetime.now().strftime("%Y%m%d_%H%M") 226 | , pre) 227 | , "wb") as f: 228 | f.write(pdf.to_csv(encoding='utf-8')) 229 | 230 | def get_result(self): 231 | if self.options.negation_rules and self.job_description["negation_rules"] is not None: 232 | negation_rules = [x["value"] for x in self.job_description["negation_rules"]] 233 | negation_clause = " -(" + " OR ".join(negation_rules) + ")" 234 | else: 235 | negation_clause = "" 236 | all_rules = [] 237 | res = [] 238 | for rule_dict in self.job_description["rules"]: 239 | # in the case that rule is compound, ensure grouping 240 | rule = u"(" + rule_dict["value"] + u")" + negation_clause 241 | logging.debug(u"rule str={}".format(rule)) 242 | all_rules.append(rule_dict["value"]) 243 | tag = None 244 | if "tag" in rule_dict: 245 | tag = rule_dict["tag"] 246 | res.extend(self.get_date_ranges_for_rule( 247 | rule 248 | , rule_dict["value"] 249 | , tag=tag 250 | )) 251 | # All rules 252 | all_rules_res = [] 253 | sub_all_rules = [] 254 | filter_str_last = u"(" + u" OR ".join(sub_all_rules) + u")" 255 | for rule in all_rules: 256 | # try adding one more rule 257 | sub_all_rules.append(rule) 258 | filter_str = u"(" + u" OR ".join(sub_all_rules) + u")" 259 | if len(filter_str + negation_clause) > 2048: 260 | # back up one rule if the length is too too long 261 | filter_str = filter_str_last 262 | logging.debug(u"All rules str={}".format(filter_str + negation_clause)) 263 | all_rules_res = self.get_date_ranges_for_rule( 264 | filter_str + negation_clause 265 | , filter_str 266 | , tag=None 267 | ) 268 | # start a new sublist 269 | sub_all_rules = [rule] 270 | filter_str = u"(" + u" OR ".join(sub_all_rules) + u")" 271 | filter_str_last = filter_str 272 | res.extend(all_rules_res) 273 | df, pdf = self.get_pivot_table(res) 274 | if self.options.output_file_path is not None: 275 | self.write_output_files(df, pdf) 276 | # rank inclusive results 277 | rdf, rpdf = None, None 278 | if self.options.rank_sample is not None: 279 | # because margin = True, we have an "all" row at the top 280 | # the second row will be the all_rules results, skip these too 281 | # therefore, start at the third row 282 | rank_list = [x[1] for x in pdf.index.values[2:2+int(self.options.rank_sample)]] 283 | res = all_rules_res 284 | for i in range(int(self.options.rank_sample)): 285 | if self.options.rank_negation_sample: 286 | filter_str = "((" + u") -(".join(rank_list[i+1::-1]) + "))" 287 | else: 288 | filter_str = "((" + u") OR (".join(rank_list[:i+1]) + "))" 289 | logging.debug(u"rank rules str={}".format(filter_str + negation_clause)) 290 | res.extend(self.get_date_ranges_for_rule( 291 | filter_str + negation_clause 292 | , filter_str 293 | , tag=None 294 | )) 295 | rdf, rpdf = self.get_pivot_table(res) 296 | if self.options.output_file_path is not None: 297 | self.write_output_files(rdf, rpdf, pre="ranked") 298 | return df, pdf, rdf, rpdf 299 | 300 | if __name__ == "__main__": 301 | g = GnipSearchCMD() 302 | df, pdf, rdf, rpdf = g.get_result() 303 | sys.stdout.write(pdf.to_string()) 304 | print 305 | print 306 | if rpdf is not None: 307 | sys.stdout.write(rpdf.to_string()) 308 | print 309 | -------------------------------------------------------------------------------- /gnip_search.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: UTF-8 -*- 3 | __author__="Scott Hendrickson, Jeff Kolb, Josh Montague" 4 | 5 | import sys 6 | import json 7 | import codecs 8 | import argparse 9 | import datetime 10 | import time 11 | import os 12 | 13 | if sys.version_info.major == 2: 14 | import ConfigParser as configparser 15 | else: 16 | import configparser 17 | 18 | from search.results import * 19 | 20 | if (sys.version_info[0]) < 3: 21 | try: 22 | reload(sys) 23 | sys.stdout = codecs.getwriter('utf-8')(sys.stdout) 24 | sys.stdin = codecs.getreader('utf-8')(sys.stdin) 25 | except NameError: 26 | pass 27 | 28 | DEFAULT_CONFIG_FILENAME = "./.gnip" 29 | 30 | class GnipSearchCMD(): 31 | 32 | USE_CASES = ["json", "wordcount","users", "rate", "links", "timeline", "geo", "audience"] 33 | 34 | def __init__(self, token_list_size=40): 35 | # default tokenizer and character limit 36 | char_upper_cutoff = 20 # longer than for normal words because of user names 37 | self.token_list_size = int(token_list_size) 38 | ############################################# 39 | # CONFIG FILE/COMMAND LINE OPTIONS PATTERN 40 | # parse config file 41 | config_from_file = self.config_file() 42 | # set required fields to None. Sequence of setting is: 43 | # (1) config file 44 | # (2) command line 45 | # if still none, then fail 46 | self.user = None 47 | self.password = None 48 | self.stream_url = None 49 | if config_from_file is not None: 50 | try: 51 | # command line options take presidence if they exist 52 | self.user = config_from_file.get('creds', 'un') 53 | self.password = config_from_file.get('creds', 'pwd') 54 | self.stream_url = config_from_file.get('endpoint', 'url') 55 | except (configparser.NoOptionError, 56 | configparser.NoSectionError) as e: 57 | sys.stderr.write("Error reading configuration file ({}), ignoring configuration file.".format(e)) 58 | # parse the command line options 59 | self.options = self.args().parse_args() 60 | if int(sys.version_info[0]) < 3: 61 | self.options.filter = self.options.filter.decode("utf-8") 62 | # set up the job 63 | # over ride config file with command line args if present 64 | if self.options.user is not None: 65 | self.user = self.options.user 66 | if self.options.password is not None: 67 | self.password = self.options.password 68 | if self.options.stream_url is not None: 69 | self.stream_url = self.options.stream_url 70 | 71 | # exit if the config file isn't set 72 | if (self.stream_url is None) or (self.user is None) or (self.password is None): 73 | sys.stderr.write("Something is wrong with your configuration. It's possible that the we can't find your config file.") 74 | sys.exit(-1) 75 | 76 | # Gnacs is not yet upgraded to python3, so don't allow CSV output option (which uses Gnacs) if python3 77 | if self.options.csv_flag and sys.version_info.major == 3: 78 | raise ValueError("CSV option not yet available for Python3") 79 | 80 | def config_file(self): 81 | config = configparser.ConfigParser() 82 | # (1) default file name precidence 83 | config.read(DEFAULT_CONFIG_FILENAME) 84 | if not config.has_section("creds"): 85 | # (2) environment variable file name second 86 | if 'GNIP_CONFIG_FILE' in os.environ: 87 | config_filename = os.environ['GNIP_CONFIG_FILE'] 88 | config.read(config_filename) 89 | if config.has_section("creds") and config.has_section("endpoint"): 90 | return config 91 | else: 92 | return None 93 | 94 | def args(self): 95 | twitter_parser = argparse.ArgumentParser( 96 | description="GnipSearch supports the following use cases: %s"%str(self.USE_CASES)) 97 | twitter_parser.add_argument("use_case", metavar= "USE_CASE", choices=self.USE_CASES, 98 | help="Use case for this search.") 99 | twitter_parser.add_argument("-a", "--paged", dest="paged", action="store_true", 100 | default=False, help="Paged access to ALL available results (Warning: this makes many requests)") 101 | twitter_parser.add_argument("-c", "--csv", dest="csv_flag", action="store_true", 102 | default=False, 103 | help="Return comma-separated 'date,counts' or geo data.") 104 | twitter_parser.add_argument("-b", "--bucket", dest="count_bucket", 105 | default="day", 106 | help="Bucket size for counts query. Options are day, hour, minute (default is 'day').") 107 | twitter_parser.add_argument("-e", "--end-date", dest="end", 108 | default=None, 109 | help="End of datetime window, format 'YYYY-mm-DDTHH:MM' (default: most recent activities)") 110 | twitter_parser.add_argument("-f", "--filter", dest="filter", default="from:jrmontag OR from:gnip", 111 | help="PowerTrack filter rule (See: http://support.gnip.com/customer/portal/articles/901152-powertrack-operators)") 112 | twitter_parser.add_argument("-l", "--stream-url", dest="stream_url", 113 | default=None, 114 | help="Url of search endpoint. (See your Gnip console.)") 115 | twitter_parser.add_argument("-n", "--results-max", dest="max", default=100, 116 | help="Maximum results to return per page (default 100; max 500)") 117 | twitter_parser.add_argument("-N", "--hard-max", dest="hard_max", default=None, type=int, 118 | help="Maximum results to return for all pages; see -a option") 119 | twitter_parser.add_argument("-p", "--password", dest="password", default=None, 120 | help="Password") 121 | twitter_parser.add_argument("-q", "--query", dest="query", action="store_true", 122 | default=False, help="View API query (no data)") 123 | twitter_parser.add_argument("-s", "--start-date", dest="start", 124 | default=None, 125 | help="Start of datetime window, format 'YYYY-mm-DDTHH:MM' (default: 30 days ago)") 126 | twitter_parser.add_argument("-u", "--user-name", dest="user", default=None, 127 | help="User name") 128 | twitter_parser.add_argument("-w", "--output-file-path", dest="output_file_path", default=None, 129 | help="Create files in ./OUTPUT-FILE-PATH. This path must exists and will not be created. This options is available only with -a option. Default is no output files.") 130 | # depricated... leave in for compatibility 131 | twitter_parser.add_argument("-t", "--search-v2", dest="search_v2", action="store_true", 132 | default=False, 133 | help="Using search API v2 endpoint. [This is depricated and is automatically set based on endpoint.]") 134 | return twitter_parser 135 | 136 | def get_result(self): 137 | WIDTH = 80 138 | BIG_COLUMN = 32 139 | res = [u"-"*WIDTH] 140 | if self.options.use_case.startswith("time"): 141 | self.results = Results( 142 | self.user 143 | , self.password 144 | , self.stream_url 145 | , self.options.paged 146 | , self.options.output_file_path 147 | , pt_filter=self.options.filter 148 | , max_results=int(self.options.max) 149 | , start=self.options.start 150 | , end=self.options.end 151 | , count_bucket=self.options.count_bucket 152 | , show_query=self.options.query 153 | , hard_max=self.options.hard_max 154 | ) 155 | res = [] 156 | if self.options.csv_flag: 157 | for x in self.results.get_time_series(): 158 | res.append("{:%Y-%m-%dT%H:%M:%S},{},{}".format(x[2], x[0], x[1])) 159 | else: 160 | res = [x for x in self.results.get_activities()] 161 | return '{"results":' + json.dumps(res) + "}" 162 | 163 | else: 164 | self.results = Results( 165 | self.user 166 | , self.password 167 | , self.stream_url 168 | , self.options.paged 169 | , self.options.output_file_path 170 | , pt_filter=self.options.filter 171 | , max_results=int(self.options.max) 172 | , start=self.options.start 173 | , end=self.options.end 174 | , count_bucket=None 175 | , show_query=self.options.query 176 | , hard_max=self.options.hard_max 177 | ) 178 | if self.options.use_case.startswith("rate"): 179 | rate = self.results.query.get_rate() 180 | unit = "Tweets/Minute" 181 | if rate < 0.01: 182 | rate *= 60. 183 | unit = "Tweets/Hour" 184 | res.append(" PowerTrack Rule: \"%s\""%self.options.filter) 185 | res.append(" Oldest Tweet (UTC): %s"%str(self.results.query.oldest_t)) 186 | res.append(" Newest Tweet (UTC): %s"%str(self.results.query.newest_t)) 187 | res.append(" Now (UTC): %s"%str(datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S"))) 188 | res.append(" %5d Tweets: %6.3f %s"%(len(self.results), rate, unit)) 189 | res.append("-"*WIDTH) 190 | elif self.options.use_case.startswith("geo"): 191 | res = [] 192 | for x in self.results.get_geo(): 193 | if self.options.csv_flag: 194 | try: 195 | res.append("{},{},{},{}".format(x["id"], x["postedTime"], x["longitude"], x["latitude"])) 196 | except KeyError as e: 197 | print >> sys.stderr, str(e) 198 | else: 199 | res.append(json.dumps(x)) 200 | elif self.options.use_case.startswith("json"): 201 | res = [json.dumps(x) for x in self.results.get_activities()] 202 | if self.options.csv_flag: 203 | res = ["|".join(x) for x in self.results.query.get_list_set()] 204 | elif self.options.use_case.startswith("word"): 205 | fmt_str = u"%{}s -- %10s %8s ".format(BIG_COLUMN) 206 | res.append(fmt_str%( "terms", "mentions", "activities")) 207 | res.append("-"*WIDTH) 208 | fmt_str = u"%{}s -- %4d %5.2f%% %4d %5.2f%%".format(BIG_COLUMN) 209 | for x in self.results.get_top_grams(n=self.token_list_size): 210 | res.append(fmt_str%(x[4], x[0], x[1]*100., x[2], x[3]*100.)) 211 | res.append(" TOTAL: %d activities"%len(self.results)) 212 | res.append("-"*WIDTH) 213 | elif self.options.use_case.startswith("user"): 214 | fmt_str = u"%{}s -- %10s %8s ".format(BIG_COLUMN) 215 | res.append(fmt_str%( "terms", "mentions", "activities")) 216 | res.append("-"*WIDTH) 217 | fmt_str = u"%{}s -- %4d %5.2f%% %4d %5.2f%%".format(BIG_COLUMN) 218 | for x in self.results.get_top_users(n=self.token_list_size): 219 | res.append(fmt_str%(x[4], x[0], x[1]*100., x[2], x[3]*100.)) 220 | res.append(" TOTAL: %d activities"%len(self.results)) 221 | res.append("-"*WIDTH) 222 | elif self.options.use_case.startswith("link"): 223 | res[-1]+=u"-"*WIDTH 224 | res.append(u"%100s -- %10s %8s (%d)"%("links", "mentions", "activities", len(self.results))) 225 | res.append("-"*2*WIDTH) 226 | for x in self.results.get_top_links(n=self.token_list_size): 227 | res.append(u"%100s -- %4d %5.2f%% %4d %5.2f%%"%(x[4], x[0], x[1]*100., x[2], x[3]*100.)) 228 | res.append("-"*WIDTH) 229 | elif self.options.use_case.startswith("audie"): 230 | for x in self.results.get_users(): 231 | res.append(u"{}".format(x)) 232 | res.append("-"*WIDTH) 233 | return u"\n".join(res) 234 | 235 | if __name__ == "__main__": 236 | g = GnipSearchCMD() 237 | print(g.get_result()) 238 | -------------------------------------------------------------------------------- /gnip_time_series.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: UTF-8 -*- 3 | ####################################################### 4 | # This script wraps simple timeseries analysis tools 5 | # and access to the Gnip Search API into a simple tool 6 | # to help the analysis quickly iterate on filters 7 | # a and understand time series trend and events. 8 | # 9 | # If you find this useful or find a bug you don't want 10 | # to fix for yourself, please let me know at @drskippy 11 | ####################################################### 12 | __author__="Scott Hendrickson" 13 | 14 | # other imports 15 | import sys 16 | import argparse 17 | import calendar 18 | import codecs 19 | import csv 20 | import datetime 21 | import json 22 | import logging 23 | import matplotlib 24 | import matplotlib.pyplot as plt 25 | import numpy as np 26 | import os 27 | import pandas as pd 28 | import re 29 | import statsmodels.api as sm 30 | import string 31 | import time 32 | from functools import partial 33 | from operator import itemgetter 34 | from scipy import signal 35 | from search.results import * 36 | 37 | # fixes an annoying warning that scipy is throwing 38 | import warnings 39 | warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd driver lwork query error") 40 | 41 | # handle Python 3 specific imports 42 | if sys.version_info[0] == 2: 43 | import ConfigParser 44 | elif sys.version_info[0] == 3: 45 | import configparser as ConfigParser 46 | #from imp import reload 47 | 48 | # Python 2 specific setup (Py3 the utf-8 stuff is handled) 49 | if sys.version_info[0] == 2: 50 | reload(sys) 51 | sys.stdin = codecs.getreader('utf-8')(sys.stdin) 52 | sys.stdout = codecs.getwriter('utf-8')(sys.stdout) 53 | 54 | # basic defaults 55 | FROM_PICKLE = False 56 | DEFAULT_CONFIG_FILENAME = os.path.join(".",".gnip") 57 | DATE_FMT = "%Y%m%d%H%M" 58 | DATE_FMT2 = "%Y-%m-%dT%H:%M:%S" 59 | LOG_FILE_PATH = os.path.join(".","time_series.log") 60 | 61 | # set up simple logging 62 | logging.basicConfig(filename=LOG_FILE_PATH,level=logging.DEBUG) 63 | logging.info("#"*70) 64 | logging.info("################# started {} #################".format(datetime.datetime.now())) 65 | 66 | # tunable defaults 67 | CHAR_UPPER_CUTOFF = 20 # don't include tokens longer than CHAR_UPPER_CUTOFF 68 | TWEET_SAMPLE = 4000 # tweets to collect for peak topics 69 | MIN_SNR = 2.0 # signal to noise threshold for peak detection 70 | MAX_N_PEAKS = 7 # maximum number of peaks to output 71 | MAX_PEAK_WIDTH = 20 # max peak width in periods 72 | MIN_PEAK_WIDTH = 1 # min peak width in periods 73 | SEARCH_PEAK_WIDTH = 3 # min peak width in periods 74 | N_MOVING = 4 # average over buckets 75 | OUTLIER_FRAC = 0.8 # cut off values over 80% above or below the average 76 | PLOTS_PREFIX = os.path.join(".","plots") 77 | PLOT_DELTA_Y = 1.2 # spacing of y values in dotplot 78 | 79 | logging.debug("CHAR_UPPER_CUTOFF={},TWEET_SAMPLE={},MIN_SNR={},MAX_N_PEAKS={},MAX_PEAK_WIDTH={},MIN_PEAK_WIDTH={},SEARCH_PEAK_WIDTH={},N_MOVING={},OUTLIER_FRAC={},PLOTS_PREFIX={},PLOT_DELTA_Y={}".format( 80 | CHAR_UPPER_CUTOFF 81 | , TWEET_SAMPLE 82 | , MIN_SNR 83 | , MAX_N_PEAKS 84 | , MAX_PEAK_WIDTH 85 | , MIN_PEAK_WIDTH 86 | , SEARCH_PEAK_WIDTH 87 | , N_MOVING 88 | , OUTLIER_FRAC 89 | , PLOTS_PREFIX 90 | , PLOT_DELTA_Y )) 91 | 92 | class TimeSeries(): 93 | """Containter class for data collected from the API and associated analysis outputs""" 94 | pass 95 | 96 | class GnipSearchTimeseries(): 97 | 98 | def __init__(self, token_list_size=40): 99 | """Retrieve and analysis timesseries and associated interesting trends, spikes and tweet content.""" 100 | # default tokenizer and character limit 101 | char_upper_cutoff = CHAR_UPPER_CUTOFF 102 | self.token_list_size = int(token_list_size) 103 | ############################################# 104 | # CONFIG FILE/COMMAND LINE OPTIONS PATTERN 105 | # parse config file 106 | config_from_file = self.config_file() 107 | # set required fields to None. Sequence of setting is: 108 | # (1) config file 109 | # (2) command line 110 | # if still none, then fail 111 | self.user = None 112 | self.password = None 113 | self.stream_url = None 114 | if config_from_file is not None: 115 | try: 116 | # command line options take presidence if they exist 117 | self.user = config_from_file.get('creds', 'un') 118 | self.password = config_from_file.get('creds', 'pwd') 119 | self.stream_url = config_from_file.get('endpoint', 'url') 120 | except (ConfigParser.NoOptionError, 121 | ConfigParser.NoSectionError) as e: 122 | logging.warn("Error reading configuration file ({}), ignoring configuration file.".format(e)) 123 | # parse the command line options 124 | self.options = self.args().parse_args() 125 | # decode step should not be included for python 3 126 | if sys.version_info[0] == 2: 127 | self.options.filter = self.options.filter.decode("utf-8") 128 | self.options.second_filter = self.options.second_filter.decode("utf-8") 129 | # set up the job 130 | # over ride config file with command line args if present 131 | if self.options.user is not None: 132 | self.user = self.options.user 133 | if self.options.password is not None: 134 | self.password = self.options.password 135 | if self.options.stream_url is not None: 136 | self.stream_url = self.options.stream_url 137 | 138 | # search v2 uses a different url 139 | if "gnip-api.twitter.com" not in self.stream_url: 140 | logging.error("gnipSearch timeline tools require Search V2. Exiting.") 141 | logging.error("Your URL should look like: https://gnip-api.twitter.com/search/fullarchive/accounts/ /dev.json") 142 | sys.stderr.write("gnipSearch timeline tools require Search V2. Exiting.\n") 143 | sys.stderr.write("Your URL should look like: https://gnip-api.twitter.com/search/fullarchive/accounts/ /dev.json") 144 | sys.exit(-1) 145 | 146 | # set some options that should not be changed for this anaysis 147 | self.options.paged = True 148 | self.options.search_v2 = True 149 | self.options.max = 500 150 | self.options.query = False 151 | 152 | # check paths 153 | if self.options.output_file_path is not None: 154 | if not os.path.exists(self.options.output_file_path): 155 | logging.error("Path {} doesn't exist. Please create it and try again. Exiting.".format( 156 | self.options.output_file_path)) 157 | sys.stderr.write("Path {} doesn't exist. Please create it and try again. Exiting.\n".format( 158 | self.options.output_file_path)) 159 | sys.exit(-1) 160 | 161 | if not os.path.exists(PLOTS_PREFIX): 162 | logging.error("Path {} doesn't exist. Please create it and try again. Exiting.".format( 163 | PLOTS_PREFIX)) 164 | sys.stderr.write("Path {} doesn't exist. Please create it and try again. Exiting.\n".format( 165 | PLOTS_PREFIX)) 166 | sys.exit(-1) 167 | 168 | # log the attributes of this class including all of the options 169 | for v in dir(self): 170 | # except don't log the password! 171 | if not v.startswith('__') and not callable(getattr(self,v)) and not v.lower().startswith('password'): 172 | tmp = str(getattr(self,v)) 173 | tmp = re.sub("password=.*,", "password=XXXXXXX,", tmp) 174 | logging.debug(" {}={}".format(v, tmp)) 175 | 176 | def config_file(self): 177 | """Search for a valid config file in the standard locations.""" 178 | config = ConfigParser.ConfigParser() 179 | # (1) default file name precidence 180 | config.read(DEFAULT_CONFIG_FILENAME) 181 | logging.info("attempting to read config file {}".format(DEFAULT_CONFIG_FILENAME)) 182 | if not config.has_section("creds"): 183 | # (2) environment variable file name second 184 | if 'GNIP_CONFIG_FILE' in os.environ: 185 | config_filename = os.environ['GNIP_CONFIG_FILE'] 186 | logging.info("attempting to read config file {}".format(config_filename)) 187 | config.read(config_filename) 188 | if config.has_section("creds") and config.has_section("endpoint"): 189 | return config 190 | else: 191 | logging.warn("no creds or endpoint section found in config file, attempting to proceed without config info from file") 192 | return None 193 | 194 | def args(self): 195 | "Set up the command line argments and the associated help strings.""" 196 | twitter_parser = argparse.ArgumentParser( 197 | description="GnipSearch timeline tools") 198 | twitter_parser.add_argument("-b", "--bucket", dest="count_bucket", 199 | default="day", 200 | help="Bucket size for counts query. Options are day, hour, minute (default is 'day').") 201 | twitter_parser.add_argument("-e", "--end-date", dest="end", 202 | default=None, 203 | help="End of datetime window, format 'YYYY-mm-DDTHH:MM' (default: most recent activities)") 204 | twitter_parser.add_argument("-f", "--filter", dest="filter", 205 | default="from:jrmontag OR from:gnip", 206 | help="PowerTrack filter rule (See: http://support.gnip.com/customer/portal/articles/901152-powertrack-operators)") 207 | twitter_parser.add_argument("-g", "--second_filter", dest="second_filter", 208 | default=None, 209 | help="Use a second filter to show correlation plots of -f timeline vs -g timeline.") 210 | twitter_parser.add_argument("-l", "--stream-url", dest="stream_url", 211 | default=None, 212 | help="Url of search endpoint. (See your Gnip console.)") 213 | twitter_parser.add_argument("-p", "--password", dest="password", default=None, 214 | help="Password") 215 | twitter_parser.add_argument("-s", "--start-date", dest="start", 216 | default=None, 217 | help="Start of datetime window, format 'YYYY-mm-DDTHH:MM' (default: 30 days ago)") 218 | twitter_parser.add_argument("-u", "--user-name", dest="user", 219 | default=None, 220 | help="User name") 221 | twitter_parser.add_argument("-t", "--get-topics", dest="get_topics", action="store_true", 222 | default=False, 223 | help="Set flag to evaluate peak topics (this may take a few minutes)") 224 | twitter_parser.add_argument("-w", "--output-file-path", dest="output_file_path", 225 | default=None, 226 | help="Create files in ./OUTPUT-FILE-PATH. This path must exists and will not be created. This options is available only with -a option. Default is no output files.") 227 | return twitter_parser 228 | 229 | def get_results(self): 230 | """Execute API calls to the timeseries data and tweet data we need for analysis. Perform analysis 231 | as we go because we often need results for next steps.""" 232 | ###################### 233 | # (1) Get the timeline 234 | ###################### 235 | logging.info("retrieving timeline counts") 236 | results_timeseries = Results( self.user 237 | , self.password 238 | , self.stream_url 239 | , self.options.paged 240 | , self.options.output_file_path 241 | , pt_filter=self.options.filter 242 | , max_results=int(self.options.max) 243 | , start=self.options.start 244 | , end=self.options.end 245 | , count_bucket=self.options.count_bucket 246 | , show_query=self.options.query 247 | ) 248 | # sort by date 249 | res_timeseries = sorted(results_timeseries.get_time_series(), key = itemgetter(0)) 250 | # if we only have one activity, probably don't do all of this 251 | if len(res_timeseries) <= 1: 252 | raise ValueError("You've only pulled {} Tweets. time series analysis isn't what you want.".format(len(res_timeseries))) 253 | # calculate total time interval span 254 | time_min_date = min(res_timeseries, key = itemgetter(2))[2] 255 | time_max_date = max(res_timeseries, key = itemgetter(2))[2] 256 | time_min = float(calendar.timegm(time_min_date.timetuple())) 257 | time_max = float(calendar.timegm(time_max_date.timetuple())) 258 | time_span = time_max - time_min 259 | logging.debug("time_min = {}, time_max = {}, time_span = {}".format(time_min, time_max, time_span)) 260 | # create a simple object to hold our data 261 | ts = TimeSeries() 262 | ts.dates = [] 263 | ts.x = [] 264 | ts.counts = [] 265 | # load and format data 266 | for i in res_timeseries: 267 | ts.dates.append(i[2]) 268 | ts.counts.append(float(i[1])) 269 | # create a independent variable in interval [0.0,1.0] 270 | ts.x.append((calendar.timegm(datetime.datetime.strptime(i[0], DATE_FMT).timetuple()) - time_min)/time_span) 271 | logging.info("read {} time items from search API".format(len(ts.dates))) 272 | if len(ts.dates) < 35: 273 | logging.warn("peak detection with with fewer than ~35 points is unreliable!") 274 | logging.debug('dates: ' + ','.join(map(str, ts.dates[:10])) + "...") 275 | logging.debug('counts: ' + ','.join(map(str, ts.counts[:10])) + "...") 276 | logging.debug('indep var: ' + ','.join(map(str, ts.x[:10])) + "...") 277 | ###################### 278 | # (1.1) Get a second timeline? 279 | ###################### 280 | if self.options.second_filter is not None: 281 | logging.info("retrieving second timeline counts") 282 | results_timeseries = Results( self.user 283 | , self.password 284 | , self.stream_url 285 | , self.options.paged 286 | , self.options.output_file_path 287 | , pt_filter=self.options.second_filter 288 | , max_results=int(self.options.max) 289 | , start=self.options.start 290 | , end=self.options.end 291 | , count_bucket=self.options.count_bucket 292 | , show_query=self.options.query 293 | ) 294 | # sort by date 295 | second_res_timeseries = sorted(results_timeseries.get_time_series(), key = itemgetter(0)) 296 | if len(second_res_timeseries) != len(res_timeseries): 297 | logging.error("time series of different sizes not allowed") 298 | else: 299 | ts.second_counts = [] 300 | # load and format data 301 | for i in second_res_timeseries: 302 | ts.second_counts.append(float(i[1])) 303 | logging.info("read {} time items from search API".format(len(ts.second_counts))) 304 | logging.debug('second counts: ' + ','.join(map(str, ts.second_counts[:10])) + "...") 305 | ###################### 306 | # (2) Detrend and remove prominent period 307 | ###################### 308 | logging.info("detrending timeline counts") 309 | no_trend = signal.detrend(np.array(ts.counts)) 310 | # determine period of data 311 | df = (ts.dates[1] - ts.dates[0]).total_seconds() 312 | if df == 86400: 313 | # day counts, average over week 314 | n_buckets = 7 315 | n_avgs = {i:[] for i in range(n_buckets)} 316 | for t,c in zip(ts.dates, no_trend): 317 | n_avgs[t.weekday()].append(c) 318 | elif df == 3600: 319 | # hour counts, average over day 320 | n_buckets = 24 321 | n_avgs = {i:[] for i in range(n_buckets)} 322 | for t,c in zip(ts.dates, no_trend): 323 | n_avgs[t.hour].append(c) 324 | elif df == 60: 325 | # minute counts; average over day 326 | n_buckets = 24*60 327 | n_avgs = {i:[] for i in range(n_buckets)} 328 | for t,c in zip(ts.dates, no_trend): 329 | n_avgs[t.minute].append(c) 330 | else: 331 | sys.stderr.write("Weird interval problem! Exiting.\n") 332 | logging.error("Weird interval problem! Exiting.\n") 333 | sys.exit() 334 | logging.info("averaging over periods of {} buckets".format(n_buckets)) 335 | # remove upper outliers from averages 336 | df_avg_all = {i:np.average(n_avgs[i]) for i in range(n_buckets)} 337 | logging.debug("bucket averages: {}".format(','.join(map(str, [df_avg_all[i] for i in df_avg_all])))) 338 | n_avgs_remove_outliers = {i: [j for j in n_avgs[i] 339 | if abs(j - df_avg_all[i])/df_avg_all[i] < (1. + OUTLIER_FRAC) ] 340 | for i in range(n_buckets)} 341 | df_avg = {i:np.average(n_avgs_remove_outliers[i]) for i in range(n_buckets)} 342 | logging.debug("bucket averages w/o outliers: {}".format(','.join(map(str, [df_avg[i] for i in df_avg])))) 343 | 344 | # flatten cycle 345 | ts.counts_no_cycle_trend = np.array([no_trend[i] - df_avg[ts.dates[i].hour] for i in range(len(ts.counts))]) 346 | logging.debug('no trend: ' + ','.join(map(str, ts.counts_no_cycle_trend[:10])) + "...") 347 | 348 | ###################### 349 | # (3) Moving average 350 | ###################### 351 | ts.moving = np.convolve(ts.counts, np.ones((N_MOVING,))/N_MOVING, mode='valid') 352 | logging.debug('moving ({}): '.format(N_MOVING) + ','.join(map(str, ts.moving[:10])) + "...") 353 | 354 | ###################### 355 | # (4) Peak detection 356 | ###################### 357 | peakind = signal.find_peaks_cwt(ts.counts_no_cycle_trend, np.arange(MIN_PEAK_WIDTH, MAX_PEAK_WIDTH), min_snr = MIN_SNR) 358 | n_peaks = min(MAX_N_PEAKS, len(peakind)) 359 | logging.debug('peaks ({}): '.format(n_peaks) + ','.join(map(str, peakind))) 360 | logging.debug('peaks ({}): '.format(n_peaks) + ','.join(map(str, [ts.dates[i] for i in peakind]))) 361 | 362 | # top peaks determined by peak volume, better way? 363 | # peak detector algorithm: 364 | # * middle of peak (of unknown width) 365 | # * finds peaks up to MAX_PEAK_WIDTH wide 366 | # 367 | # algorithm for geting peak start, peak and end parameters: 368 | # find max, find fwhm, 369 | # find start, step past peak, keep track of volume and peak height, 370 | # stop at end of period or when timeseries turns upward 371 | 372 | peaks = [] 373 | for i in peakind: 374 | # find the first max in the possible window 375 | i_start = max(0, i - SEARCH_PEAK_WIDTH) 376 | i_finish = min(len(ts.counts) - 1, i + SEARCH_PEAK_WIDTH) 377 | p_max = max(ts.counts[i_start:i_finish]) 378 | h_max = p_max/2. 379 | # i_max not center 380 | i_max = i_start + ts.counts[i_start:i_finish].index(p_max) 381 | i_start, i_finish = i_max, i_max 382 | # start at peak, and go back and forward to find start and end 383 | while i_start >= 1: 384 | if (ts.counts[i_start - 1] <= h_max or 385 | ts.counts[i_start - 1] >= ts.counts[i_start] or 386 | i_start - 1 <= 0): 387 | break 388 | i_start -= 1 389 | while i_finish < len(ts.counts) - 1: 390 | if (ts.counts[i_finish + 1] <= h_max or 391 | ts.counts[i_finish + 1] >= ts.counts[i_finish] or 392 | i_finish + 1 >= len(ts.counts)): 393 | break 394 | i_finish += 1 395 | # i is center of peak so balance window 396 | delta_i = max(1, i - i_start) 397 | if i_finish - i > delta_i: 398 | delta_i = i_finish - i 399 | # final est of start and finish 400 | i_finish = min(len(ts.counts) - 1, i + delta_i) 401 | i_start = max(0, i - delta_i) 402 | p_volume = sum(ts.counts[i_start:i_finish]) 403 | peaks.append([ i , p_volume , (i, i_start, i_max, i_finish 404 | , h_max , p_max, p_volume 405 | , ts.dates[i_start], ts.dates[i_max], ts.dates[i_finish])]) 406 | # top n_peaks by volume 407 | top_peaks = sorted(peaks, key = itemgetter(1))[-n_peaks:] 408 | # re-sort peaks by date 409 | ts.top_peaks = sorted(top_peaks, key = itemgetter(0)) 410 | logging.debug('top peaks ({}): '.format(len(ts.top_peaks)) + ','.join(map(str, ts.top_peaks[:4])) + "...") 411 | 412 | ###################### 413 | # (5) high/low frequency 414 | ###################### 415 | ts.cycle, ts.trend = sm.tsa.filters.hpfilter(np.array(ts.counts)) 416 | logging.debug('cycle: ' + ','.join(map(str, ts.cycle[:10])) + "...") 417 | logging.debug('trend: ' + ','.join(map(str, ts.trend[:10])) + "...") 418 | 419 | ###################### 420 | # (6) n-grams for top peaks 421 | ###################### 422 | ts.topics = [] 423 | if self.options.get_topics: 424 | logging.info("retrieving tweets for peak topics") 425 | for a in ts.top_peaks: 426 | # start at peak 427 | ds = datetime.datetime.strftime(a[2][8], DATE_FMT2) 428 | # estimate how long to get TWEET_SAMPLE tweets 429 | # a[1][5] is max tweets per period 430 | if a[2][5] > 0: 431 | est_periods = float(TWEET_SAMPLE)/a[2][5] 432 | else: 433 | logging.warn("peak with zero max tweets ({}), setting est_periods to 1".format(a)) 434 | est_periods = 1 435 | # df comes from above, in seconds 436 | # time resolution is hours 437 | est_time = max(int(est_periods * df), 60) 438 | logging.debug("est_periods={}, est_time={}".format(est_periods, est_time)) 439 | # 440 | if a[2][8] + datetime.timedelta(seconds=est_time) < a[2][9]: 441 | de = datetime.datetime.strftime(a[2][8] + datetime.timedelta(seconds=est_time), DATE_FMT2) 442 | elif a[2][8] < a[2][9]: 443 | de = datetime.datetime.strftime(a[2][9], DATE_FMT2) 444 | else: 445 | de = datetime.datetime.strftime(a[2][8] + datetime.timedelta(seconds=60), DATE_FMT2) 446 | logging.info("retreive data for peak index={} in date range [{},{}]".format(a[0], ds, de)) 447 | res = Results( 448 | self.user 449 | , self.password 450 | , self.stream_url 451 | , self.options.paged 452 | , self.options.output_file_path 453 | , pt_filter=self.options.filter 454 | , max_results=int(self.options.max) 455 | , start=ds 456 | , end=de 457 | , count_bucket=None 458 | , show_query=self.options.query 459 | , hard_max = TWEET_SAMPLE 460 | ) 461 | logging.info("retrieved {} records".format(len(res))) 462 | n_grams_counts = list(res.get_top_grams(n=self.token_list_size)) 463 | ts.topics.append(n_grams_counts) 464 | logging.debug('n_grams for peak index={}: '.format(a[0]) + ','.join( 465 | map(str, [i[4].encode("utf-8","ignore") for i in n_grams_counts][:10])) + "...") 466 | return ts 467 | 468 | def dotplot(self, x, labels, path = "dotplot.png"): 469 | """Makeshift dotplots in matplotlib. This is not completely general and encodes labels and 470 | parameter selections that are particular to n-gram dotplots.""" 471 | logging.info("dotplot called, writing image to path={}".format(path)) 472 | if len(x) <= 1 or len(labels) <= 1: 473 | raise ValueError("cannot make a dot plot with only 1 point") 474 | # split n_gram_counts into 2 data sets 475 | n = int(len(labels)/2) 476 | x1, x2 = x[:n], x[n:] 477 | labels1, labels2 = labels[:n], labels[n:] 478 | # create enough equally spaced y values for the horizontal lines 479 | ys = [r*PLOT_DELTA_Y for r in range(1,len(labels2)+1)] 480 | # give ourselves a little extra room on the plot 481 | maxx = max(x)*1.05 482 | maxy = max(ys)*1.05 483 | # set up plots to be a factor taller than the default size 484 | # make factor proportional to the number of n-grams plotted 485 | size = plt.gcf().get_size_inches() 486 | # factor of n/10 is empirical 487 | scale_denom = 10 488 | fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1,figsize=(size[0], size[1]*n/scale_denom)) 489 | logging.debug("plotting top {} terms".format(n)) 490 | logging.debug("plot size=({},{})".format(size[0], size[1]*n/scale_denom)) 491 | # first plot 1-grams 492 | ax1.set_xlim(0,maxx) 493 | ax1.set_ylim(0,maxy) 494 | ticks = ax1.yaxis.set_ticks(ys) 495 | text = ax1.yaxis.set_ticklabels(labels1) 496 | for ct, item in enumerate(labels1): 497 | ax1.hlines(ys[ct], 0, maxx, linestyle='dashed', color='0.9') 498 | ax1.plot(x1, ys, 'ko') 499 | ax1.set_title("1-grams") 500 | # second plot 2-grams 501 | ax2.set_xlim(0,maxx) 502 | ax2.set_ylim(0,maxy) 503 | ticks = ax2.yaxis.set_ticks(ys) 504 | text = ax2.yaxis.set_ticklabels(labels2) 505 | for ct, item in enumerate(labels2): 506 | ax2.hlines(ys[ct], 0, maxx, linestyle='dashed', color='0.9') 507 | ax2.plot(x2, ys, 'ko') 508 | ax2.set_title("2-grams") 509 | ax2.set_xlabel("Fraction of Mentions") 510 | # 511 | plt.tight_layout() 512 | plt.savefig(path) 513 | plt.close("all") 514 | 515 | def plots(self, ts, out_type="png"): 516 | """Basic choice for plotting analysis. If you wish to extend this class, over- 517 | write this method.""" 518 | # creat a valid file name, in this case and additional requirement is no spaces 519 | valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits) 520 | filter_prefix_name = ''.join(c for c in self.options.filter if c in valid_chars) 521 | filter_prefix_name = filter_prefix_name.replace(" ", "_") 522 | if len(filter_prefix_name) > 16: 523 | filter_prefix_name = filter_prefix_name[:16] 524 | if self.options.second_filter is not None: 525 | second_filter_prefix_name = ''.join(c for c in self.options.second_filter if c in valid_chars) 526 | second_filter_prefix_name = second_filter_prefix_name.replace(" ", "_") 527 | if len(second_filter_prefix_name) > 16: 528 | second_filter_prefix_name = second_filter_prefix_name[:16] 529 | ###################### 530 | # timeline 531 | ###################### 532 | df0 = pd.Series(ts.counts, index=ts.dates) 533 | df0.plot() 534 | plt.ylabel("Counts") 535 | plt.title(filter_prefix_name) 536 | plt.tight_layout() 537 | plt.savefig(os.path.join(PLOTS_PREFIX, '{}_{}.{}'.format(filter_prefix_name, "time_line", out_type))) 538 | plt.close("all") 539 | ###################### 540 | # cycle and trend 541 | ###################### 542 | df1 = pd.DataFrame({"cycle":ts.cycle, "trend":ts.trend}, index=ts.dates) 543 | df1.plot() 544 | plt.ylabel("Counts") 545 | plt.title(filter_prefix_name) 546 | plt.tight_layout() 547 | plt.savefig(os.path.join(PLOTS_PREFIX, '{}_{}.{}'.format(filter_prefix_name, "cycle_trend_line", out_type))) 548 | plt.close("all") 549 | ###################### 550 | # moving avg 551 | ###################### 552 | if len(ts.moving) <= 3: 553 | logging.warn("Too little data for a moving average") 554 | else: 555 | df2 = pd.DataFrame({"moving":ts.moving}, index=ts.dates[:len(ts.moving)]) 556 | df2.plot() 557 | plt.ylabel("Counts") 558 | plt.title(filter_prefix_name) 559 | plt.tight_layout() 560 | plt.savefig(os.path.join(PLOTS_PREFIX, '{}_{}.{}'.format(filter_prefix_name, "mov_avg_line", out_type))) 561 | plt.close("all") 562 | ###################### 563 | # timeline with peaks marked by vertical bands 564 | ###################### 565 | df3 = pd.Series(ts.counts, index=ts.dates) 566 | df3.plot() 567 | # peaks 568 | for a in ts.top_peaks: 569 | xs = a[2][7] 570 | xp = a[2][8] 571 | xe = a[2][9] 572 | y = a[2][5] 573 | # need to get x and y locs 574 | plt.axvspan(xs, xe, ymin=0, ymax = y, linewidth=1, color='g', alpha=0.2) 575 | plt.axvline(xp, ymin=0, ymax = y, linewidth=1, color='y') 576 | plt.ylabel("Counts") 577 | plt.title(filter_prefix_name) 578 | plt.tight_layout() 579 | plt.savefig(os.path.join(PLOTS_PREFIX, '{}_{}.{}'.format(filter_prefix_name, "time_peaks_line", out_type))) 580 | plt.close("all") 581 | ###################### 582 | # n-grams to help determine topics of peaks 583 | ###################### 584 | for n, p in enumerate(ts.topics): 585 | x = [] 586 | labels = [] 587 | for i in p: 588 | x.append(i[1]) 589 | labels.append(i[4]) 590 | try: 591 | logging.info("creating n-grams dotplot for peak {}".format(n)) 592 | path = os.path.join(PLOTS_PREFIX, "{}_{}_{}.{}".format(filter_prefix_name, "peak", n, out_type)) 593 | self.dotplot(x, labels, path) 594 | except ValueError as e: 595 | logging.error("{} - plot path={} skipped".format(e, path)) 596 | ###################### 597 | # x vs y scatter plot for correlations 598 | ###################### 599 | if self.options.second_filter is not None: 600 | logging.info("creating scatter for queries {} and {}".format(self.options.filter, self.options.second_filter)) 601 | df4 = pd.DataFrame({filter_prefix_name: ts.counts, second_filter_prefix_name:ts.second_counts}) 602 | df4.plot(kind='scatter', x=filter_prefix_name, y=second_filter_prefix_name) 603 | plt.ylabel(second_filter_prefix_name) 604 | plt.xlabel(filter_prefix_name) 605 | plt.xlim([0, 1.05 * max(ts.counts)]) 606 | plt.ylim([0, 1.05 * max(ts.second_counts)]) 607 | plt.title("{} vs. {}".format(second_filter_prefix_name, filter_prefix_name)) 608 | plt.tight_layout() 609 | plt.savefig(os.path.join(PLOTS_PREFIX, '{}_v_{}_{}.{}'.format(filter_prefix_name, 610 | second_filter_prefix_name, 611 | "scatter", 612 | out_type))) 613 | plt.close("all") 614 | 615 | if __name__ == "__main__": 616 | """ Simple command line utility.""" 617 | import pickle 618 | g = GnipSearchTimeseries() 619 | if FROM_PICKLE: 620 | ts = pickle.load(open("./time_series.pickle", "rb")) 621 | else: 622 | ts = g.get_results() 623 | pickle.dump(ts,open("./time_series.pickle", "wb")) 624 | g.plots(ts) 625 | -------------------------------------------------------------------------------- /img/earthquake_cycle_trend_line.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DrSkippy/Gnip-Python-Search-API-Utilities/30c3780220bbeba384815ccbc4ce1d567bfa934c/img/earthquake_cycle_trend_line.png -------------------------------------------------------------------------------- /img/earthquake_time_line.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DrSkippy/Gnip-Python-Search-API-Utilities/30c3780220bbeba384815ccbc4ce1d567bfa934c/img/earthquake_time_line.png -------------------------------------------------------------------------------- /img/earthquake_time_peaks_line.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DrSkippy/Gnip-Python-Search-API-Utilities/30c3780220bbeba384815ccbc4ce1d567bfa934c/img/earthquake_time_peaks_line.png -------------------------------------------------------------------------------- /job.json: -------------------------------------------------------------------------------- 1 | { 2 | "rules":[ 3 | {"value":"dog", "tag":"common pet"} 4 | , {"value":"cat", "tag":"common pet"} 5 | , {"value":"hamster", "tag":"common pet"} 6 | , {"value":"pet", "tag":"abstract pet"} 7 | , {"value":"vet", "tag":"pet owner destination"} 8 | , {"value":"kennel", "tag":"pet owner destination"} 9 | , {"value":"puppy OR kitten", "tag":"diminutives"} 10 | ], 11 | "negation_rules":[ 12 | {"value":"tracter", "tag":"type of cat"} 13 | , {"value":"dozer", "tag":"type of cat"} 14 | , {"value":"grader", "tag":"type of cat"} 15 | , {"value":"\"skid loader\"", "tag":"type of cat"} 16 | ], 17 | "date_ranges": [ 18 | {"start":"2015-05-01T00:00:00", "end":"2015-06-01T00:00:00"} 19 | , {"start":"2015-11-01T00:00:00", "end":"2015-12-01T00:00:00"} 20 | ] 21 | } 22 | -------------------------------------------------------------------------------- /rules.txt: -------------------------------------------------------------------------------- 1 | (from:drskippy27 OR from:gnip) data 2 | obama bieber 3 | -------------------------------------------------------------------------------- /search/__init__.py: -------------------------------------------------------------------------------- 1 | __all__ = ['api' 2 | , 'results'] 3 | -------------------------------------------------------------------------------- /search/api.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: UTF-8 -*- 3 | __author__="Scott Hendrickson, Josh Montague" 4 | 5 | import sys 6 | import requests 7 | import json 8 | import codecs 9 | import datetime 10 | import time 11 | import os 12 | import re 13 | import unicodedata 14 | 15 | from acscsv.twitter_acs import TwacsCSV 16 | 17 | ## update for python3 18 | if sys.version_info[0] == 2: 19 | reload(sys) 20 | sys.stdout = codecs.getwriter('utf-8')(sys.stdout) 21 | sys.stdin = codecs.getreader('utf-8')(sys.stdin) 22 | 23 | #remove this 24 | requests.packages.urllib3.disable_warnings() 25 | 26 | # formatter of data from API 27 | TIME_FORMAT_SHORT = "%Y%m%d%H%M" 28 | TIME_FORMAT_LONG = "%Y-%m-%dT%H:%M:%S.000Z" 29 | PAUSE = 1 # seconds between page requests 30 | POSTED_TIME_IDX = 1 31 | #date time parsing utility regex 32 | DATE_TIME_RE = re.compile("([0-9]{4}).([0-9]{2}).([0-9]{2}).([0-9]{2}):([0-9]{2})") 33 | 34 | class Query(object): 35 | """Object represents a single search API query and provides utilities for 36 | managing parameters, executing the query and parsing the results.""" 37 | 38 | def __init__(self 39 | , user 40 | , password 41 | , stream_url 42 | , paged = False 43 | , output_file_path = None 44 | , hard_max = None 45 | ): 46 | """A Query requires at least a valid user name, password and endpoint url. 47 | The URL of the endpoint should be the JSON records endpoint, not the counts 48 | endpoint. 49 | 50 | Additional parambers specifying paged search and output file path allow 51 | for making queries which return more than the 500 activity limit imposed by 52 | a single call to the API. This is called paging or paged search. Setting 53 | paged = True will enable the token interpretation 54 | functionality provided in the API to return a seamless set of activites. 55 | 56 | Once the object is created, it can be used for repeated access to the 57 | configured end point with the same connection configuration set at 58 | creation.""" 59 | self.output_file_path = output_file_path 60 | self.paged = paged 61 | self.hard_max = hard_max 62 | self.paged_file_list = [] 63 | self.user = user 64 | self.password = password 65 | self.end_point = stream_url # activities end point NOT the counts end point 66 | # get a parser for the twitter columns 67 | # TODO: use the updated retriveal methods in gnacs instead of this? 68 | self.twitter_parser = TwacsCSV(",", None, False, True, False, True, False, False, False) 69 | # Flag for post processing tweet timeline from tweet times 70 | self.tweet_times_flag = False 71 | 72 | def set_dates(self, start, end): 73 | """Utility function to set dates from strings. Given string-formated 74 | dates for start date time and end date time, extract the required 75 | date string format for use in the API query and make sure they 76 | are valid dates. 77 | 78 | Sets class fromDate and toDate date strings.""" 79 | if start: 80 | dt = re.search(DATE_TIME_RE, start) 81 | if not dt: 82 | raise ValueError("Error. Invalid start-date format: %s \n"%str(start)) 83 | else: 84 | f ='' 85 | for i in range(re.compile(DATE_TIME_RE).groups): 86 | f += dt.group(i+1) 87 | self.fromDate = f 88 | # make sure this is a valid date 89 | tmp_start = datetime.datetime.strptime(f, TIME_FORMAT_SHORT) 90 | 91 | if end: 92 | dt = re.search(DATE_TIME_RE, end) 93 | if not dt: 94 | raise ValueError("Error. Invalid end-date format: %s \n"%str(end)) 95 | else: 96 | e ='' 97 | for i in range(re.compile(DATE_TIME_RE).groups): 98 | e += dt.group(i+1) 99 | self.toDate = e 100 | # make sure this is a valid date 101 | tmp_end = datetime.datetime.strptime(e, TIME_FORMAT_SHORT) 102 | if start: 103 | if tmp_start >= tmp_end: 104 | raise ValueError("Error. Start date greater than end date.\n") 105 | 106 | def name_munger(self, f): 107 | """Utility function to create a valid, friendly file name base 108 | string from an input rule.""" 109 | f = re.sub(' +','_',f) 110 | f = f.replace(':','_') 111 | f = f.replace('"','_Q_') 112 | f = f.replace('(','_p_') 113 | f = f.replace(')','_p_') 114 | self.file_name_prefix = unicodedata.normalize( 115 | "NFKD",f[:42]).encode( 116 | "ascii","ignore") 117 | 118 | def request(self): 119 | """HTTP request based on class variables for rule_payload, 120 | stream_url, user and password""" 121 | try: 122 | s = requests.Session() 123 | s.headers = {'Accept-encoding': 'gzip'} 124 | s.auth = (self.user, self.password) 125 | res = s.post(self.stream_url, data=json.dumps(self.rule_payload)) 126 | if res.status_code != 200: 127 | sys.stderr.write("Exiting with HTTP error code {}\n".format(res.status_code)) 128 | sys.stderr.write("ERROR Message: {}\n".format(res.json()["error"]["message"])) 129 | if 1==1: #self.return_incomplete: 130 | sys.stderr.write("Returning incomplete dataset.") 131 | return(res.content.decode(res.encoding)) 132 | sys.exit(-1) 133 | except requests.exceptions.ConnectionError as e: 134 | e.msg = "Error (%s). Exiting without results."%str(e) 135 | raise e 136 | except requests.exceptions.HTTPError as e: 137 | e.msg = "Error (%s). Exiting without results."%str(e) 138 | raise e 139 | except requests.exceptions.MissingSchema as e: 140 | e.msg = "Error (%s). Exiting without results."%str(e) 141 | raise e 142 | #Don't use res.text as it creates encoding challenges! 143 | return(res.content.decode(res.encoding)) 144 | 145 | def parse_responses(self, count_bucket): 146 | """Parse returned responses. 147 | 148 | When paged=True, manage paging using the API token mechanism 149 | 150 | When output file is set, write output files for paged output.""" 151 | acs = [] 152 | repeat = True 153 | page_count = 1 154 | self.paged_file_list = [] 155 | while repeat: 156 | doc = self.request() 157 | tmp_response = json.loads(doc) 158 | if "results" in tmp_response: 159 | acs.extend(tmp_response["results"]) 160 | else: 161 | raise ValueError("Invalid request\nQuery: %s\nResponse: %s"%(self.rule_payload, doc)) 162 | if self.hard_max is None or len(acs) < self.hard_max: 163 | repeat = False 164 | if self.paged or count_bucket: 165 | if len(acs) > 0: 166 | if self.output_file_path is not None: 167 | # writing to file 168 | file_name = self.output_file_path + "/{0}_{1}.json".format( 169 | str(datetime.datetime.utcnow().strftime( 170 | "%Y%m%d%H%M%S")) 171 | , str(self.file_name_prefix)) 172 | with codecs.open(file_name, "wb","utf-8") as out: 173 | for item in tmp_response["results"]: 174 | out.write(json.dumps(item)+"\n") 175 | self.paged_file_list.append(file_name) 176 | # if writing to file, don't keep track of all the data in memory 177 | acs = [] 178 | else: 179 | # storing in memory, so give some feedback as to size 180 | sys.stderr.write("[{0:8d} bytes] {1:5d} total activities retrieved...\n".format( 181 | sys.getsizeof(acs) 182 | , len(acs))) 183 | else: 184 | sys.stderr.write( "No results returned for rule:{0}\n".format(str(self.rule_payload)) ) 185 | if "next" in tmp_response: 186 | self.rule_payload["next"]=tmp_response["next"] 187 | repeat = True 188 | page_count += 1 189 | sys.stderr.write( "Fetching page {}...\n".format(page_count) ) 190 | else: 191 | if "next" in self.rule_payload: 192 | del self.rule_payload["next"] 193 | repeat = False 194 | time.sleep(PAUSE) 195 | else: 196 | # stop iterating after reaching hard_max 197 | repeat = False 198 | return acs 199 | 200 | def get_time_series(self): 201 | if self.paged and self.output_file_path is not None: 202 | for file_name in self.paged_file_list: 203 | with codecs.open(file_name,"rb") as f: 204 | for res in f: 205 | rec = json.loads(res.decode('utf-8').strip()) 206 | t = datetime.datetime.strptime(rec["timePeriod"], TIME_FORMAT_SHORT) 207 | yield [rec["timePeriod"], rec["count"], t] 208 | else: 209 | if self.tweet_times_flag: 210 | # todo: list of tweets, aggregate by bucket 211 | raise NotImplementedError("Aggregated buckets on json tweets not implemented!") 212 | else: 213 | for i in self.time_series: 214 | yield i 215 | 216 | 217 | def get_activity_set(self): 218 | """Generator iterates through the entire activity set from memory or disk.""" 219 | if self.paged and self.output_file_path is not None: 220 | for file_name in self.paged_file_list: 221 | with codecs.open(file_name,"rb") as f: 222 | for res in f: 223 | yield json.loads(res.decode('utf-8')) 224 | else: 225 | for res in self.rec_dict_list: 226 | yield res 227 | 228 | def get_list_set(self): 229 | """Like get_activity_set, but returns a list containing values parsed by 230 | current Twacs parser configuration.""" 231 | for rec in self.get_activity_set(): 232 | yield self.twitter_parser.get_source_list(rec) 233 | 234 | def execute(self 235 | , pt_filter 236 | , max_results = 100 237 | , start = None 238 | , end = None 239 | , count_bucket = None # None is json 240 | , show_query = False): 241 | """Execute a query with filter, maximum results, start and end dates. 242 | 243 | Count_bucket determines the bucket size for the counts endpoint. 244 | If the count_bucket variable is set to a valid bucket size such 245 | as mintute, day or week, then the acitivity counts endpoint will 246 | Otherwise, the data endpoint is used.""" 247 | # set class start and stop datetime variables 248 | self.set_dates(start, end) 249 | # make a friendlier file name from the rules 250 | self.name_munger(pt_filter) 251 | if self.paged or max_results > 500: 252 | # avoid making many small requests 253 | max_results = 500 254 | self.rule_payload = { 255 | 'query': pt_filter 256 | } 257 | self.rule_payload["maxResults"] = int(max_results) 258 | if start: 259 | self.rule_payload["fromDate"] = self.fromDate 260 | if end: 261 | self.rule_payload["toDate"] = self.toDate 262 | # use the proper endpoint url 263 | self.stream_url = self.end_point 264 | if count_bucket: 265 | if not self.end_point.endswith("counts.json"): 266 | self.stream_url = self.end_point[:-5] + "/counts.json" 267 | if count_bucket not in ['day', 'minute', 'hour']: 268 | raise ValueError("Error. Invalid count bucket: %s \n"%str(count_bucket)) 269 | self.rule_payload["bucket"] = count_bucket 270 | self.rule_payload.pop("maxResults",None) 271 | # for testing, show the query JSON and stop 272 | if show_query: 273 | sys.stderr.write("API query:\n") 274 | sys.stderr.write(json.dumps(self.rule_payload) + '\n') 275 | sys.exit() 276 | # set up variable to catch the data in 3 formats 277 | self.time_series = [] 278 | self.rec_dict_list = [] 279 | self.rec_list_list = [] 280 | self.res_cnt = 0 281 | # timing 282 | self.delta_t = 1 # keeps us from crashing 283 | # actual oldest tweet before now 284 | self.oldest_t = datetime.datetime.utcnow() 285 | # actual newest tweet more recent that 30 days ago 286 | # self.newest_t = datetime.datetime.utcnow() - datetime.timedelta(days=30) 287 | # search v2: newest date is more recent than 2006-03-01T00:00:00 288 | self.newest_t = datetime.datetime.strptime("2006-03-01T00:00:00.000z", TIME_FORMAT_LONG) 289 | # 290 | for rec in self.parse_responses(count_bucket): 291 | # parse_responses returns only the last set of activities retrieved, not all paged results. 292 | # to access the entire set, use the helper functions get_activity_set and get_list_set! 293 | self.res_cnt += 1 294 | self.rec_dict_list.append(rec) 295 | if count_bucket: 296 | # timeline data 297 | t = datetime.datetime.strptime(rec["timePeriod"], TIME_FORMAT_SHORT) 298 | tmp_tl_list = [rec["timePeriod"], rec["count"], t] 299 | self.tweet_times_flag = False 300 | else: 301 | # json activities 302 | # keep track of tweet times for time calculation 303 | tmp_list = self.twitter_parser.procRecordToList(rec) 304 | self.rec_list_list.append(tmp_list) 305 | t = datetime.datetime.strptime(tmp_list[POSTED_TIME_IDX], TIME_FORMAT_LONG) 306 | tmp_tl_list = [tmp_list[POSTED_TIME_IDX], 1, t] 307 | self.tweet_times_flag = True 308 | # this list is ***either*** list of buckets or list of tweet times! 309 | self.time_series.append(tmp_tl_list) 310 | # timeline requests don't return activities! 311 | if t < self.oldest_t: 312 | self.oldest_t = t 313 | if t > self.newest_t: 314 | self.newest_t = t 315 | self.delta_t = (self.newest_t - self.oldest_t).total_seconds()/60. 316 | return 317 | 318 | def get_rate(self): 319 | """Returns rate from last query executed""" 320 | if self.delta_t != 0: 321 | return float(self.res_cnt)/self.delta_t 322 | else: 323 | return None 324 | 325 | def __len__(self): 326 | """Returns the size of the results set when len(Query) is called.""" 327 | try: 328 | return self.res_cnt 329 | except AttributeError: 330 | return 0 331 | 332 | def __repr__(self): 333 | """Returns a string represenataion of the result set.""" 334 | try: 335 | return "\n".join([json.dumps(x) for x in self.rec_dict_list]) 336 | except AttributeError: 337 | return "No query completed." 338 | 339 | if __name__ == "__main__": 340 | g = Query("shendrickson@gnip.com" 341 | , "XXXXXPASSWORDXXXXX" 342 | , "https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json") 343 | g.execute("bieber", 10) 344 | for x in g.get_activity_set(): 345 | print(x) 346 | print(g) 347 | print(g.get_rate()) 348 | g.execute("bieber", count_bucket = "hour") 349 | print(g) 350 | print(len(g)) 351 | pg = Query("shendrickson@gnip.com" 352 | , "XXXXXPASSWORDXXXXX" 353 | , "https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json" 354 | , paged = True 355 | , output_file_path = "../data/") 356 | now_date = datetime.datetime.now() 357 | pg.execute("bieber" 358 | , end=now_date.strftime(TIME_FORMAT_LONG) 359 | , start=(now_date - datetime.timedelta(seconds=200)).strftime(TIME_FORMAT_LONG)) 360 | for x in pg.get_activity_set(): 361 | print(x) 362 | g.execute("bieber", show_query=True) 363 | -------------------------------------------------------------------------------- /search/results.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: UTF-8 -*- 3 | __author__="Scott Hendrickson, Josh Montague" 4 | 5 | import sys 6 | import codecs 7 | import datetime 8 | import time 9 | import os 10 | import re 11 | 12 | from .api import * 13 | from simple_n_grams.simple_n_grams import SimpleNGrams 14 | 15 | if sys.version_info[0] < 3: 16 | try: 17 | reload(sys) 18 | sys.stdout = codecs.getwriter('utf-8')(sys.stdout) 19 | sys.stdin = codecs.getreader('utf-8')(sys.stdin) 20 | except NameError: 21 | pass 22 | 23 | ############################################# 24 | # Some constants to configure column retrieval from TwacsCSV 25 | DATE_INDEX = 1 26 | TEXT_INDEX = 2 27 | LINKS_INDEX = 3 28 | USER_NAME_INDEX = 7 29 | USER_ID_INDEX = 8 30 | OUTPUT_PAGE_WIDTH = 120 31 | BIG_COLUMN_WIDTH = 32 32 | 33 | class Results(): 34 | """Class for aggregating and accessing search result sets and 35 | subsets. Returns derived values for the query specified.""" 36 | 37 | def __init__(self 38 | , user 39 | , password 40 | , stream_url 41 | , paged = False 42 | , output_file_path = None 43 | , pt_filter = None 44 | , max_results = 100 45 | , start = None 46 | , end = None 47 | , count_bucket = None 48 | , show_query = False 49 | , hard_max = None 50 | ): 51 | """Create a result set by passing all of the require parameters 52 | for a query. The Results class runs an API query once when 53 | initialized. This allows one to make multiple calls 54 | to analytics methods on a single query. 55 | """ 56 | # run the query 57 | self.query = Query(user, password, stream_url, paged, output_file_path, hard_max) 58 | self.query.execute( 59 | pt_filter=pt_filter 60 | , max_results = max_results 61 | , start = start 62 | , end = end 63 | , count_bucket = count_bucket 64 | , show_query = show_query 65 | ) 66 | self.freq = None 67 | 68 | def get_activities(self): 69 | """Generator of query results.""" 70 | for x in self.query.get_activity_set(): 71 | yield x 72 | 73 | def get_time_series(self): 74 | """Generator of time series for query results.""" 75 | for x in self.query.get_time_series(): 76 | yield x 77 | 78 | def get_top_links(self, n=20): 79 | """Returns the links most shared in the data set retrieved in 80 | the order of how many times each was shared.""" 81 | self.freq = SimpleNGrams(char_upper_cutoff=100, tokenizer="space") 82 | for x in self.query.get_list_set(): 83 | link_str = x[LINKS_INDEX] 84 | if link_str != "GNIPEMPTYFIELD" and link_str != "None": 85 | self.freq.add(link_str) 86 | else: 87 | self.freq.add("NoLinks") 88 | return self.freq.get_tokens(n) 89 | 90 | def get_top_users(self, n=50): 91 | """Returns the users tweeting the most in the data set retrieved 92 | in the data set. Users are returned in descending order of how 93 | many times they were tweeted.""" 94 | self.freq = SimpleNGrams(char_upper_cutoff=20, tokenizer="twitter") 95 | for x in self.query.get_list_set(): 96 | self.freq.add(x[USER_NAME_INDEX]) 97 | return self.freq.get_tokens(n) 98 | 99 | def get_users(self, n=None): 100 | """Returns the user ids for the tweets collected""" 101 | uniq_users = set() 102 | for x in self.query.get_list_set(): 103 | uniq_users.add(x[USER_ID_INDEX]) 104 | return uniq_users 105 | 106 | def get_top_grams(self, n=20): 107 | self.freq = SimpleNGrams(char_upper_cutoff=20, tokenizer="twitter") 108 | self.freq.sl.add_session_stop_list(["http", "https", "amp", "htt"]) 109 | for x in self.query.get_list_set(): 110 | self.freq.add(x[TEXT_INDEX]) 111 | return self.freq.get_tokens(n) 112 | 113 | def get_geo(self): 114 | for rec in self.query.get_activity_set(): 115 | lat, lng = None, None 116 | if "geo" in rec: 117 | if "coordinates" in rec["geo"]: 118 | [lat,lng] = rec["geo"]["coordinates"] 119 | activity = { "id": rec["id"].split(":")[2] 120 | , "postedTime": rec["postedTime"].strip(".000Z") 121 | , "latitude": lat 122 | , "longitude": lng } 123 | yield activity 124 | 125 | def get_frequency_items(self, size = 20): 126 | """Retrieve the token list structure from the last query""" 127 | if self.freq is None: 128 | raise VallueError("No frequency available for use case") 129 | return self.freq.get_tokens(size) 130 | 131 | def __len__(self): 132 | return len(self.query) 133 | 134 | def __repr__(self): 135 | if self.last_query_params["count_bucket"] is None: 136 | res = [u"-"*OUTPUT_PAGE_WIDTH] 137 | rate = self.query.get_rate() 138 | unit = "Tweets/Minute" 139 | if rate < 0.01: 140 | rate *= 60. 141 | unit = "Tweets/Hour" 142 | res.append(" PowerTrack Rule: \"%s\""%self.last_query_params["pt_filter"]) 143 | res.append(" Oldest Tweet (UTC): %s"%str(self.query.oldest_t)) 144 | res.append(" Newest Tweet (UTC): %s"%str(self.query.newest_t)) 145 | res.append(" Now (UTC): %s"%str(datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S"))) 146 | res.append(" %5d Tweets: %6.3f %s"%(self.query.res_cnt, rate, unit)) 147 | res.append("-"*OUTPUT_PAGE_WIDTH) 148 | # 149 | self.query.get_top_users() 150 | fmt_str = u"%{}s -- %10s %8s (%d)".format(BIG_COLUMN_WIDTH) 151 | res.append(fmt_str%( "users", "tweets", "activities", self.res_cnt)) 152 | res.append("-"*OUTPUT_PAGE_WIDTH) 153 | fmt_str = u"%{}s -- %4d %5.2f%% %4d %5.2f%%".format(BIG_COLUMN_WIDTH) 154 | for x in self.freq.get_tokens(20): 155 | res.append(fmt_str%(x[4], x[0], x[1]*100., x[2], x[3]*100.)) 156 | res.append("-"*OUTPUT_PAGE_WIDTH) 157 | # 158 | self.query.get_top_links() 159 | fmt_str = u"%{}s -- %10s %8s (%d)".format(int(2.5*BIG_COLUMN_WIDTH)) 160 | res.append(fmt_str%( "links", "mentions", "activities", self.res_cnt)) 161 | res.append("-"*OUTPUT_PAGE_WIDTH) 162 | fmt_str = u"%{}s -- %4d %5.2f%% %4d %5.2f%%".format(int(2.5*BIG_COLUMN_WIDTH)) 163 | for x in self.freq.get_tokens(20): 164 | res.append(fmt_str%(x[4], x[0], x[1]*100., x[2], x[3]*100.)) 165 | res.append("-"*OUTPUT_PAGE_WIDTH) 166 | # 167 | self.query.get_top_grams() 168 | fmt_str = u"%{}s -- %10s %8s (%d)".format(BIG_COLUMN_WIDTH) 169 | res.append(fmt_str%( "terms", "mentions", "activities", self.res_cnt)) 170 | res.append("-"*OUTPUT_PAGE_WIDTH) 171 | fmt_str =u"%{}s -- %4d %5.2f%% %4d %6.2f%%".format(BIG_COLUMN_WIDTH) 172 | for x in self.freq.get_tokens(20): 173 | res.append(fmt_str%(x[4], x[0], x[1]*100., x[2], x[3]*100.)) 174 | res.append("-"*OUTPUT_PAGE_WIDTH) 175 | else: 176 | res = ["{:%Y-%m-%dT%H:%M:%S},{}".format(x[2], x[1]) 177 | for x in self.get_time_series()] 178 | return u"\n".join(res) 179 | 180 | if __name__ == "__main__": 181 | g = Results("shendrickson@gnip.com" 182 | , "XXXXXPASSWORDXXXXX" 183 | , "https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json") 184 | #list(g.get_time_series(pt_filter="bieber", count_bucket="hour")) 185 | print(g) 186 | print( list(g.get_activities(pt_filter="bieber", max_results = 10)) ) 187 | print( list(g.get_geo(pt_filter = "bieber has:geo", max_results = 10)) ) 188 | print( list(g.get_time_series(pt_filter="beiber", count_bucket="hour")) ) 189 | print( list(g.get_top_links(pt_filter="beiber", max_results=100, n=30)) ) 190 | print( list(g.get_top_users(pt_filter="beiber", max_results=100, n=30)) ) 191 | print( list(g.get_top_grams(pt_filter="bieber", max_results=100, n=50)) ) 192 | print( list(g.get_frequency_items(10)) ) 193 | print(g) 194 | print(g.get_rate()) 195 | g.execute(pt_filter="bieber", query=True) 196 | -------------------------------------------------------------------------------- /search/test_api.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: UTF-8 -*- 3 | __author__="Scott Hendrickson, Josh Montague" 4 | 5 | import requests 6 | import unittest 7 | import os 8 | 9 | # establish import context and then import explicitly 10 | #from .context import gpt 11 | #from gpt.rules import rules as gpt_r 12 | from api import * 13 | 14 | class TestQuery(unittest.TestCase): 15 | 16 | def setUp(self): 17 | self.g = Query("shendrickson@gnip.com" 18 | , "XXXXXXXXX" 19 | , "https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json") 20 | self.g_paged = Query("shendrickson@gnip.com" 21 | , "XXXXXXXXX" 22 | , "https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json" 23 | , paged = True 24 | , output_file_path = ".") 25 | 26 | def tearDown(self): 27 | # remove stray files 28 | for f in os.listdir("."): 29 | if re.search("bieber.json", f): 30 | os.remove(os.path.join(".", f)) 31 | 32 | def test_set_dates(self): 33 | s = "2014-11-01T00:00:30" 34 | e = "2014-11-02T00:20:00" 35 | self.g.set_dates(s,e) 36 | self.assertEquals(self.g.fromDate, "201411010000") 37 | self.assertEquals(self.g.toDate, "201411020020") 38 | with self.assertRaises(ValueError) as cm: 39 | self.g.set_dates(e,s) 40 | e = "201/1/0T00:20:00" 41 | with self.assertRaises(ValueError) as cm: 42 | self.g.set_dates(s,e) 43 | 44 | def test_name_munger(self): 45 | self.g.name_munger("adsfadsfa") 46 | self.assertEquals("adsfadsfa", self.g.file_name_prefix) 47 | self.g.name_munger('adsf"adsfa') 48 | self.assertEquals("adsf_Q_adsfa", self.g.file_name_prefix) 49 | self.g.name_munger("adsf(dsfa") 50 | self.assertEquals("adsf_p_dsfa", self.g.file_name_prefix) 51 | self.g.name_munger("adsf)dsfa") 52 | self.assertEquals("adsf_p_dsfa", self.g.file_name_prefix) 53 | self.g.name_munger("adsf:dsfa") 54 | self.assertEquals("adsf_dsfa", self.g.file_name_prefix) 55 | self.g.name_munger("adsf dsfa") 56 | self.assertEquals("adsf_dsfa", self.g.file_name_prefix) 57 | self.g.name_munger("adsf dsfa") 58 | self.assertEquals("adsf_dsfa", self.g.file_name_prefix) 59 | 60 | def test_req(self): 61 | self.g.rule_payload = {'query': 'bieber', 'maxResults': 10, 'publisher': 'twitter'} 62 | self.g.stream_url = self.g.end_point 63 | self.assertEquals(10, len(json.loads(self.g.request())["results"])) 64 | self.g.stream_url = "adsfadsf" 65 | with self.assertRaises(requests.exceptions.MissingSchema) as cm: 66 | self.g.request() 67 | self.g.stream_url = "http://ww.thisreallydoesn'texist.com" 68 | with self.assertRaises(requests.exceptions.ConnectionError) as cm: 69 | self.g.request() 70 | self.g.stream_url = "https://ww.thisreallydoesntexist.com" 71 | with self.assertRaises(requests.exceptions.ConnectionError) as cm: 72 | self.g.request() 73 | 74 | def test_parse_responses(self): 75 | self.g.rule_payload = {'query': 'bieber', 'maxResults': 10, 'publisher': 'twitter'} 76 | self.g.stream_url = self.g.end_point 77 | self.assertEquals(len(self.g.parse_responses()), 10) 78 | self.g.rule_payload = {'maxResults': 10, 'publisher': 'twitter'} 79 | self.g.stream_url = self.g.end_point 80 | with self.assertRaises(ValueError) as cm: 81 | self.g.parse_responses() 82 | #TODO graceful way to test write to file functionality here 83 | 84 | def test_get_activity_set(self): 85 | self.g.execute("bieber", max_results=10) 86 | self.assertEquals(len(list(self.g.get_activity_set())), 10) 87 | # seconds of bieber 88 | tmp_start = datetime.datetime.strftime( 89 | datetime.datetime.now() + datetime.timedelta(seconds = -60) 90 | ,"%Y-%m-%dT%H:%M:%S") 91 | tmp_end = datetime.datetime.strftime( 92 | datetime.datetime.now() 93 | ,"%Y-%m-%dT%H:%M:%S") 94 | print >> sys.stderr, "bieber from ", tmp_start, " to ", tmp_end 95 | self.g_paged.execute("bieber" 96 | , start = tmp_start 97 | , end = tmp_end) 98 | self.assertGreater(len(list(self.g_paged.get_activity_set())), 500) 99 | 100 | def test_execute(self): 101 | # 102 | tmp = { "pt_filter": "bieber" 103 | , "max_results" : 100 104 | , "start" : None 105 | , "end" : None 106 | , "count_bucket" : None # None is json 107 | , "show_query" : False } 108 | self.g.execute(**tmp) 109 | self.assertEquals(len(self.g), 100) 110 | self.assertEquals(len(self.g.rec_list_list), 100) 111 | self.assertEquals(len(self.g.rec_dict_list), 100) 112 | self.assertEquals(self.g.rule_payload, {'query': 'bieber', 'maxResults': 100, 'publisher': 'twitter'}) 113 | # 114 | tmp = { "pt_filter": "bieber" 115 | , "max_results" : 600 116 | , "start" : None 117 | , "end" : None 118 | , "count_bucket" : None # None is json 119 | , "show_query" : False } 120 | self.g.execute(**tmp) 121 | self.assertEquals(len(self.g), 500) 122 | self.assertEquals(len(self.g.time_series), 500) 123 | self.assertEquals(len(self.g.rec_list_list), 500) 124 | self.assertEquals(len(self.g.rec_dict_list), 500) 125 | self.assertEquals(self.g.rule_payload, {'query': 'bieber', 'maxResults': 500, 'publisher': 'twitter'}) 126 | # 127 | tmp = datetime.datetime.now() + datetime.timedelta(seconds = -60) 128 | tmp_start = datetime.datetime.strftime( 129 | tmp 130 | , "%Y-%m-%dT%H:%M:%S") 131 | tmp_start_cmp = datetime.datetime.strftime( 132 | tmp 133 | ,"%Y%m%d%H%M") 134 | tmp = datetime.datetime.now() 135 | tmp_end = datetime.datetime.strftime( 136 | tmp 137 | ,"%Y-%m-%dT%H:%M:%S") 138 | tmp_end_cmp = datetime.datetime.strftime( 139 | tmp 140 | ,"%Y%m%d%H%M") 141 | tmp = { "pt_filter": "bieber" 142 | , "max_results" : 500 143 | , "start" : tmp_start 144 | , "end" : tmp_end 145 | , "count_bucket" : None # None is json 146 | , "show_query" : False } 147 | self.g.execute(**tmp) 148 | self.assertEquals(len(self.g), 500) 149 | self.assertEquals(len(self.g.time_series), 500) 150 | self.assertEquals(len(self.g.rec_list_list), 500) 151 | self.assertEquals(len(self.g.rec_dict_list), 500) 152 | self.assertEquals(self.g.rule_payload, {'query': 'bieber' 153 | , 'maxResults': 500 154 | , 'toDate': tmp_end_cmp 155 | , 'fromDate': tmp_start_cmp 156 | , 'publisher': 'twitter'}) 157 | self.assertIsNotNone(self.g.fromDate) 158 | self.assertIsNotNone(self.g.toDate) 159 | self.assertGreater(self.g.delta_t, 0) # delta_t in minutes 160 | self.assertGreater(1.1, self.g.delta_t) # delta_t in minutes 161 | # 162 | tmp = { "pt_filter": "bieber" 163 | , "max_results" : 100 164 | , "start" : None 165 | , "end" : None 166 | , "count_bucket" : "fortnight" 167 | , "show_query" : False } 168 | with self.assertRaises(ValueError) as cm: 169 | self.g.execute(**tmp) 170 | # 171 | tmp = { "pt_filter": "bieber" 172 | , "start" : None 173 | , "end" : None 174 | , "count_bucket" : "hour" 175 | , "show_query" : False } 176 | self.g.execute(**tmp) 177 | self.assertEquals(len(self.g), 24*30 + datetime.datetime.utcnow().hour + 1) 178 | self.assertGreater(self.g.delta_t, 24*30*60) # delta_t in minutes 179 | 180 | def test_get_rate(self): 181 | self.g.res_cnt = 100 182 | self.g.delta_t = 10 183 | self.assertEquals(self.g.get_rate(), 10) 184 | self.g.delta_t = 11 185 | self.assertAlmostEquals(self.g.get_rate(), 9.09090909091) 186 | 187 | def test_len(self): 188 | self.assertEquals(0, len(self.g)) 189 | tmp = { "pt_filter": "bieber" 190 | , "max_results" : 500 191 | , "count_bucket" : None # None is json 192 | , "show_query" : False } 193 | self.g.execute(**tmp) 194 | self.assertEquals(self.g.res_cnt, len(self.g)) 195 | 196 | def test_repr(self): 197 | self.assertIsNotNone(str(self.g)) 198 | tmp = { "pt_filter": "bieber" 199 | , "max_results" : 500 200 | , "count_bucket" : None # None is json 201 | , "show_query" : False } 202 | self.g.execute(**tmp) 203 | self.assertIsNotNone(str(self.g)) 204 | self.assertTrue('\n' in str(self.g)) 205 | self.assertEquals(str(self.g).count('\n'), len(self.g)-1) 206 | 207 | if __name__ == "__main__": 208 | unittest.main() 209 | -------------------------------------------------------------------------------- /search/test_results.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: UTF-8 -*- 3 | __author__="Scott Hendrickson, Josh Montague" 4 | 5 | import requests 6 | import unittest 7 | import os 8 | import copy 9 | import time 10 | 11 | # establish import context and then import explicitly 12 | #from .context import gpt 13 | #from gpt.rules import rules as gpt_r 14 | from results import * 15 | 16 | class TestResults(unittest.TestCase): 17 | 18 | def setUp(self): 19 | self.params = { 20 | "user":"shendrickson@gnip.com" 21 | , "password":"XXXXXXXXX" 22 | , "stream_url":"https://gnip-api.twitter.com/search/30day/accounts/shendrickson/wayback.json" 23 | } 24 | 25 | def tearDown(self): 26 | # remove stray files 27 | for f in os.listdir("."): 28 | if re.search("bieber.json", f): 29 | os.remove(os.path.join(".", f)) 30 | 31 | def test_get(self): 32 | self.g = Results( 33 | pt_filter = "bieber" 34 | , max_results = 10 35 | , start = None 36 | , end = None 37 | , count_bucket = None 38 | , show_query = False 39 | , **self.params) 40 | self.assertEquals(len(self.g), 10) 41 | 42 | def test_get_activities(self): 43 | self.g = Results( 44 | pt_filter = "bieber" 45 | , max_results = 10 46 | , start = None 47 | , end = None 48 | , count_bucket = None 49 | , show_query = False 50 | , **self.params) 51 | for x in self.g.get_activities(): 52 | self.assertTrue("id" in x) 53 | self.assertEqual(len(list(self.g.get_activities())), 10) 54 | # seconds of bieber 55 | tmp_start = datetime.datetime.strftime( 56 | datetime.datetime.now() + datetime.timedelta(seconds = -60) 57 | ,"%Y-%m-%dT%H:%M:%S") 58 | tmp_end = datetime.datetime.strftime( 59 | datetime.datetime.now() 60 | ,"%Y-%m-%dT%H:%M:%S") 61 | self.g_paged = Results( 62 | pt_filter = "bieber" 63 | , max_results = 500 64 | , start = tmp_start 65 | , end = tmp_end 66 | , count_bucket = None 67 | , show_query = False 68 | , paged = True 69 | , **self.params) 70 | tmp = len(list(self.g_paged.get_activities())) 71 | self.assertGreater(tmp, 1000) 72 | self.g_paged = Results( 73 | pt_filter = "bieber" 74 | , max_results = 500 75 | , start = tmp_start 76 | , end = tmp_end 77 | , count_bucket = None 78 | , show_query = False 79 | , paged = True 80 | , output_file_path = "." 81 | , **self.params) 82 | self.assertEqual(len(list(self.g_paged.get_activities())), tmp) 83 | 84 | def test_get_time_series(self): 85 | self.g = Results( 86 | pt_filter = "bieber" 87 | , max_results = 10 88 | , start = None 89 | , end = None 90 | , count_bucket = "hour" 91 | , show_query = False 92 | , **self.params) 93 | self.assertGreater(len(list(self.g.get_time_series())), 24*30) 94 | 95 | def test_get_top_links(self): 96 | self.g = Results( 97 | pt_filter = "bieber" 98 | , max_results = 200 99 | , start = None 100 | , end = None 101 | , count_bucket = None 102 | , show_query = False 103 | , **self.params) 104 | self.assertEqual(len(list(self.g.get_top_links(n = 5))), 5) 105 | self.assertEqual(len(list(self.g.get_top_links(n = 10))),10) 106 | # 107 | tmp_start = datetime.datetime.strftime( 108 | datetime.datetime.now() + datetime.timedelta(seconds = -60) 109 | ,"%Y-%m-%dT%H:%M:%S") 110 | tmp_end = datetime.datetime.strftime( 111 | datetime.datetime.now() 112 | ,"%Y-%m-%dT%H:%M:%S") 113 | self.g_paged = Results( 114 | pt_filter = "bieber" 115 | , max_results = 500 116 | , start = tmp_start 117 | , end = tmp_end 118 | , count_bucket = None 119 | , show_query = False 120 | , paged = True 121 | , **self.params) 122 | self.assertEqual(len(list(self.g_paged.get_top_links(n = 100))), 100) 123 | 124 | def test_top_users(self): 125 | self.g = Results( 126 | pt_filter = "bieber" 127 | , max_results = 200 128 | , start = None 129 | , end = None 130 | , count_bucket = None 131 | , show_query = False 132 | , **self.params) 133 | self.assertEqual(len(list(self.g.get_top_users(n = 5))), 5) 134 | self.assertEqual(len(list(self.g.get_top_users(n = 10))), 10) 135 | # 136 | tmp_start = datetime.datetime.strftime( 137 | datetime.datetime.now() + datetime.timedelta(seconds = -60) 138 | ,"%Y-%m-%dT%H:%M:%S") 139 | tmp_end = datetime.datetime.strftime( 140 | datetime.datetime.now() 141 | ,"%Y-%m-%dT%H:%M:%S") 142 | self.g_paged = Results( 143 | pt_filter = "bieber" 144 | , max_results = 500 145 | , start = tmp_start 146 | , end = tmp_end 147 | , count_bucket = None 148 | , show_query = False 149 | , paged = True 150 | , **self.params) 151 | self.assertEqual(len(list(self.g_paged.get_top_users(n = 100))), 100) 152 | self.assertEqual(len(list(self.g.get_frequency_items(8))), 8) 153 | 154 | def test_top_grams(self): 155 | self.g = Results( 156 | pt_filter = "bieber" 157 | , max_results = 200 158 | , start = None 159 | , end = None 160 | , count_bucket = None 161 | , show_query = False 162 | , **self.params) 163 | self.assertEqual(len(list(self.g.get_top_grams(n = 5))) , 10) 164 | self.assertEqual(len(list(self.g.get_top_grams(n = 10))) , 20) 165 | self.assertEqual(len(list(self.g.get_frequency_items(8))), 16) 166 | # 167 | tmp_start = datetime.datetime.strftime( 168 | datetime.datetime.now() + datetime.timedelta(seconds = -60) 169 | ,"%Y-%m-%dT%H:%M:%S") 170 | tmp_end = datetime.datetime.strftime( 171 | datetime.datetime.now() 172 | ,"%Y-%m-%dT%H:%M:%S") 173 | self.g_paged = Results( 174 | pt_filter = "bieber" 175 | , max_results = 500 176 | , start = tmp_start 177 | , end = tmp_end 178 | , count_bucket = None 179 | , show_query = False 180 | , paged = True 181 | , **self.params) 182 | self.assertEqual(len(list(self.g_paged.get_top_grams(n = 100))), 200) 183 | 184 | def test_get_geo(self): 185 | self.g = Results( 186 | pt_filter = "bieber has:geo" 187 | , max_results = 200 188 | , start = None 189 | , end = None 190 | , count_bucket = None 191 | , show_query = False 192 | , **self.params) 193 | tmp = len(list(self.g.get_geo())) 194 | self.assertGreater(201, tmp) 195 | self.assertGreater(tmp, 10) 196 | 197 | if __name__ == "__main__": 198 | unittest.main() 199 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | description-file = README.md 3 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from distutils.core import setup 2 | 3 | setup( 4 | name='gapi', 5 | version='1.0.2', 6 | author='Scott Hendrickson, Josh Montague, Jeff Kolb', 7 | author_email='scott@drskippy.net', 8 | packages=['search'], 9 | scripts=['gnip_search.py', 'gnip_time_series.py'], 10 | url='https://github.com/DrSkippy/Gnip-Python-Search-API-Utilities', 11 | download_url='https://github.com/DrSkippy/Gnip-Python-Search-API-Utilities/tags/', 12 | license='LICENSE.txt', 13 | description='Simple utilties to explore the Gnip search API', 14 | install_requires=[ 15 | "gnacs >= 1.1.0" 16 | , "sngrams >= 0.2.0" 17 | , "requests > 2.4.0" 18 | ], 19 | extras_require = { 20 | 'timeseries': ["numpy >= 1.10.1" 21 | , "scipy >= 0.16.1" 22 | , "statsmodels >= 0.6.1" 23 | , "matplotlib >= 1.5.0" 24 | , "pandas >= 0.17.0" 25 | ], 26 | } 27 | ) 28 | -------------------------------------------------------------------------------- /test_search.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | ### 4 | ### edit creds 5 | ### 6 | un=email 7 | un=shendrickson@gnip.com 8 | paswd=password 9 | paswd=$1 10 | 11 | if [ ! -d data ]; then 12 | mkdir data 13 | fi 14 | 15 | rulez="bieber OR bieber" 16 | if [ $(uname) == "Linux" ]; then 17 | dt1=$(date --date="1 day ago" +%Y-%m-%dT00:00:00) 18 | dt2=$(date --date="2 days ago" +%Y-%m-%dT00:00:00) 19 | dt3=$(date --date="2 days ago" +%Y-%m-%dT23:55:00) 20 | else 21 | dt1=$(date -v-1d +%Y-%m-%dT00:00:00) 22 | dt2=$(date -v-2d +%Y-%m-%dT00:00:00) 23 | dt3=$(date -v-2d +%Y-%m-%dT23:55:00) 24 | fi 25 | 26 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -q json 27 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 json 28 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 geo 29 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 wordcount 30 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 timeline 31 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 users 32 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -c geo 33 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -c timeline 34 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -s"$dt2" -e"$dt1" json 35 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -s"$dt2" -e"$dt1" geo 36 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -s"$dt2" -e"$dt1" wordcount 37 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -n10 -s"$dt2" -e"$dt1" users 38 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -s"$dt3" -e"$dt1" -aw ./data json 39 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -s"$dt3" -e"$dt1" -aw ./data geo 40 | ./gnip_search.py -f"has:geo $rulez" -u ${un} -p ${paswd} -s"$dt3" -e"$dt1" -a users 41 | 42 | export GNIP_CONFIG_FILE=./.gnip 43 | ./gnip_search.py -f"has:geo $rulez" -n10 -q json 44 | ./gnip_search.py -f"has:geo $rulez" -n10 json 45 | ./gnip_search.py -f"has:geo $rulez" -n10 geo 46 | ./gnip_search.py -f"has:geo $rulez" -n10 wordcount 47 | ./gnip_search.py -f"has:geo $rulez" -n10 timeline 48 | ./gnip_search.py -f"has:geo $rulez" -n10 users 49 | ./gnip_search.py -f"has:geo $rulez" -n10 -c geo 50 | ./gnip_search.py -f"has:geo $rulez" -n10 -c timeline 51 | ./gnip_search.py -f"has:geo $rulez" -n10 -s"$dt2" -e"$dt1" json 52 | ./gnip_search.py -f"has:geo $rulez" -n10 -s"$dt2" -e"$dt1" geo 53 | ./gnip_search.py -f"has:geo $rulez" -n10 -s"$dt2" -e"$dt1" wordcount 54 | ./gnip_search.py -f"has:geo $rulez" -n10 -s"$dt2" -e"$dt1" users 55 | ./gnip_search.py -f"has:geo $rulez" -s"$dt3" -e"$dt1" -aw ./data json 56 | 57 | --------------------------------------------------------------------------------