├── Dockerfile ├── README.md └── twecoll /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM debian:latest 2 | 3 | USER root 4 | 5 | ENV DEBIAN_FRONTEND noninteractive 6 | ENV PATH /twecoll:$PATH 7 | 8 | RUN apt-get update 9 | RUN apt-get install -y build-essential libxml2-dev zlib1g-dev python-dev python-pip pkg-config libffi-dev libcairo-dev git 10 | RUN pip install python-igraph 11 | RUN pip install --upgrade cffi 12 | RUN pip install cairocffi 13 | 14 | RUN git clone https://github.com/jdevoo/twecoll.git 15 | ADD .twecoll /root 16 | 17 | WORKDIR /app 18 | VOLUME /app 19 | 20 | ENTRYPOINT ["twecoll"] 21 | CMD ["-v"] 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Twecoll is a Twitter command-line tool written in Python. It can be used to retrieve data from Twitter and purge likes (its only data-altering feature). It is based on a sub-command principle meaning calls to twecoll are based on a keyword which instructs twecoll what to do. Below is a list of examples followed by a brief explanation of each command. Running twecoll requires Python 2.7 and the argparse library. It was tested with igraph 0.6 and 0.7.1. The igraph library is optional and is used to generate a clustered graph of the network. 2 | 3 | Note: I am not planning to maintain this further. If you don't care about the likes, have a look at [nucoll](https://github.com/jdevoo/nucoll) which provides similar functionality. 4 | 5 | ## Contributors 6 | 7 | Thank you to [@lucahammer](https://github.com/lucahammer) and [@PeterTheOne](https://github.com/PeterTheOne) for contributing time, feedback & pull requests to this project. 8 | 9 | ## Installation 10 | 11 | Place twecoll in your path and create a working directory to store the data collected. Twecoll creates a number of files and folders to store its data. 12 | 13 | * fdat: directory containing friends of friends files 14 | * img: directory containing avatar images of friends 15 | * .dat: extension of account details data (friends, followers, avatar URL, etc. for account friends) 16 | * .twt: extension of tweets file (timestamp, tweet) 17 | * .fav: extension of likes file (id, timestamp, user id, screen name, tweet) 18 | * .gml: extension of edgelist file (nodes and edges) 19 | * .f: friends data (fdat) 20 | 21 | Twecoll uses oauth and has been updated to support the 1.1 version of the Twitter REST API. Register your own copy of twecoll on http://apps.twitter.com and copy the consumer key and secret. 22 | 23 | The first time you run a twecoll command, it will ask you for the consumer key and consumer secret. It will then retrieve the oauth token. Follow the instructions on the console. An HTTP Error 401 will be thrown if the key and secret cannot be used to retrieve the access token details. 24 | 25 | ## Examples 26 | 27 | #### Download and Purge Likes 28 | Historically, this was twecoll's main use: download all favorited/liked tweets in a file for search purposes. Let's take the handle 'jdevoo' as an example. 29 | 30 | ``` 31 | $ twecoll likes jdevoo 32 | ``` 33 | 34 | This will produce a jdevoo.fav file containing all likes including a tweet ID, timestamp, user ID, handle, text (urf-8). 35 | In order to purge the likes, twecoll needs the .fav file. You can the execute: 36 | 37 | ``` 38 | $ twecoll likes -p jdevoo 39 | ``` 40 | 41 | This is the only command that alters account data. You will need to select the Read+Write permission model for this to work when registering twecoll. 42 | 43 | #### Downloading Tweets 44 | Twecoll can download up to 3000 tweets for a handle or run search queries. 45 | 46 | ``` 47 | $ twecoll tweets jdevoo 48 | ``` 49 | 50 | This would generate a jdevoo.twt file containing all tweets including timestamp and text (utf-8). 51 | In order search for tweets related to a certain hashtag or run a more advanced query, use the -q switch and double-quotes around the query string: 52 | 53 | ``` 54 | $ twecoll tweets -q "#dg2g" 55 | ``` 56 | 57 | This will also generate a .twt file name with the url-encoded search string. 58 | 59 | #### Generating a Graph 60 | It is possible to generate a GML file of your first and second degree relationships on Twitter. This is a two-step process that takes time due to API throttling by Twitter. In order to generate the graph, twecoll retrieves the handle's friends (or followers) and all friends-of-friends (2nd degree relationships). It then calculates the relations between those, ignoring 2nd degree relationships to which the handle is not connected. In other words, it looks only for friend relationships among the friends/followers of the handle or query tweets initially supplied. 61 | 62 | First retrieve the handle details 63 | 64 | ``` 65 | $ twecoll init jdevoo 66 | ``` 67 | 68 | This generates a jdevoo.dat file. It also populates an img directory with avatar images. It is also possible to initialize from a .twt file using the -q option. In this example, retrieve friends of each entry in the .dat file. 69 | 70 | ``` 71 | $ twecoll fetch jdevoo 72 | ``` 73 | 74 | This populates the fdat directory. You can now generate the graph file using the defaults. 75 | 76 | ``` 77 | $ twecoll edgelist jdevoo 78 | ``` 79 | 80 | This generates a jdevoo.gml file in Graph Model Language. If you have installed the python version of igraph, a .png file will also be generated with a visualization of the GML data. You can also use other packages to visualize your GML file, e.g. Gephi. 81 | The GML file will include friends, followers, memberships and statuses counts as properties. If followers count is not equal to zero, the friends-to-followers and listed-to-followers ratios will be calculated. 82 | 83 | See also the [wiki](https://github.com/jdevoo/twecoll/wiki) section for more ideas. 84 | 85 | ## Usage 86 | 87 | Twecoll has built-in help, version and API status switches invoked with -h, -v and -s respectively. Each command can also be invoked with the help switch for additional information about its sub-options. 88 | 89 | ``` 90 | $ twecoll -h 91 | usage: twecoll [-h] [-v] [-s] 92 | {resolve,init,fetch,tweets,likes,edgelist} ... 93 | 94 | Twitter Collection Tool 95 | 96 | optional arguments: 97 | -h, --help show this help message and exit 98 | -v, --version show program's version number and exit 99 | -s, --stats show Twitter throttling stats and exit 100 | 101 | sub-commands: 102 | {resolve,init,fetch,tweets,likes,edgelist} 103 | resolve retrieve user_id for screen_name or vice versa 104 | init retrieve friends data for screen_name 105 | fetch retrieve friends of handles in .dat file 106 | tweets retrieve tweets 107 | likes retrieve likes 108 | edgelist generate graph in GML format 109 | ``` 110 | 111 | ## Changes 112 | 113 | * Version 1.1 114 | - Initial commit 115 | * Version 1.2 116 | - Added option to init to retrieve followers instead of friends 117 | * Version 1.3 118 | - simplified metrics now included in GML file 119 | * Version 1.4 120 | - Simplified membership retrieval and improved graphs 121 | * Version 1.5 122 | - Changes to community finding and visualization 123 | * Version 1.6 124 | - Added support for multiple arguments in edgelist 125 | * Version 1.7 126 | - Added ability to add list members to dat file 127 | * Version 1.8 128 | - Fetch tweets from list for a given user 129 | * Version 1.9 130 | - Renamed favorites to likes 131 | * Version 1.10 132 | - Restored possibility to mix files using edgelist 133 | * Version 1.11 134 | - Suppress nodes with missing data in edgelist by default 135 | * Version 1.12 136 | - Improved init 137 | * Version 1.13 138 | - Added option to skip mentions from queries in init 139 | -------------------------------------------------------------------------------- /twecoll: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ''' 3 | Permission is hereby granted, free of charge, to any person obtaining a 4 | copy of this software and associated documentation files (the "Software"), 5 | to deal in the Software without restriction, including without limitation 6 | the rights to use, copy, modify, merge, publish, distribute, sublicense, 7 | and/or sell copies of the Software, and to permit persons to whom the 8 | Software is furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in 11 | all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 18 | FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 19 | DEALINGS IN THE SOFTWARE. 20 | ''' 21 | 22 | import argparse 23 | import urlparse 24 | import urllib2 25 | import urllib 26 | import hashlib 27 | import base64 28 | import hmac 29 | import json 30 | import sys 31 | import os 32 | import time 33 | import datetime 34 | import random 35 | import math 36 | import csv 37 | import re 38 | import codecs 39 | import errno 40 | import socket 41 | 42 | __version__ = '1.13' 43 | ALPHANUM = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' 44 | 45 | FDAT_DIR = 'fdat' 46 | IMG_DIR = 'img' 47 | TT_EXT = '.dat' 48 | FDAT_EXT = '.f' 49 | FAV_EXT = '.fav' 50 | TWT_EXT = '.twt' 51 | FMAX = 5000 52 | EDG_EXT = '.gml' # Graph Modeling Language 53 | WAIT_CODES = (400, 503, 429) 54 | RETRY_CODES = (500, 502, 110) 55 | SKIP_CODES = (401, 403, 404) 56 | SHAPES = ['box', 'circle', 'rect', 'triangle-up', 'diamond', 'triangle-down'] 57 | 58 | from collections import namedtuple 59 | Consumer = namedtuple('Consumer', 'key secret') 60 | Token = namedtuple('Token', 'key secret') 61 | 62 | def _quote(text): 63 | return urllib.quote(text, '-._~') 64 | 65 | def _encode(params): 66 | return '&'.join(['%s=%s' % (k, v) for k, v in params]) 67 | 68 | def _parse_uri(req): 69 | method = req.get_method() 70 | if method == 'POST': 71 | uri = req.get_full_url() 72 | query = req.get_data() or '' 73 | else: 74 | url = req.get_full_url() 75 | if url.find('?') != -1: 76 | uri, query = req.get_full_url().split('?', 1) 77 | else: 78 | uri = url 79 | query = '' 80 | return method, uri, query 81 | 82 | class Request(urllib2.Request): 83 | def __init__(self, url, \ 84 | data=None, headers={}, origin_req_host=None, unverifiable=False, \ 85 | method=None, oauth_params={}): 86 | urllib2.Request.__init__( \ 87 | self, url, data, headers, origin_req_host, unverifiable) 88 | self.method = method 89 | self.oauth_params = oauth_params 90 | 91 | def get_method(self): 92 | if self.method is not None: 93 | return self.method 94 | if self.has_data(): 95 | return 'POST' 96 | else: 97 | return 'GET' 98 | 99 | class OAuthHandler(urllib2.BaseHandler): 100 | def __init__(self, consumer, token=None, timeout=None): 101 | self.consumer = consumer 102 | self.token = token 103 | self.timeout = timeout 104 | 105 | def get_signature(self, method, uri, query): 106 | key = '%s&' % _quote(self.consumer.secret) 107 | if self.token is not None: 108 | key += _quote(self.token.secret) 109 | signature_base = '&'.join((method.upper(), _quote(uri), _quote(query))) 110 | signature = hmac.new(str(key), signature_base, hashlib.sha1) 111 | return base64.b64encode(signature.digest()) 112 | 113 | def http_request(self, req): 114 | if not req.has_header('Host'): 115 | req.add_header('Host', req.get_host()) 116 | method, uri, query = _parse_uri(req) 117 | if method == 'POST': 118 | req.add_header('Content-type', 'application/x-www-form-urlencoded') 119 | 120 | query = map(lambda (k, v): (k, urllib.quote(v)), urlparse.parse_qsl(query)) 121 | 122 | oauth_params = [ 123 | ('oauth_consumer_key', self.consumer.key), 124 | ('oauth_signature_method', 'HMAC-SHA1'), 125 | ('oauth_timestamp', int(time.time())), 126 | ('oauth_nonce', ''.join([random.choice(ALPHANUM) for i in range(16)])), 127 | ('oauth_version', '1.0')] 128 | if self.token is not None: 129 | oauth_params.append(('oauth_token', self.token.key)) 130 | if hasattr(req, 'oauth_params'): 131 | oauth_params += req.oauth_params.items() 132 | 133 | query += oauth_params 134 | query.sort() 135 | signature = self.get_signature(method, uri, _encode(query)) 136 | 137 | oauth_params.append(('oauth_signature', _quote(signature))) 138 | oauth_params.sort() 139 | 140 | auth = ', '.join(['%s="%s"' % (k, v) for k, v in oauth_params]) 141 | req.headers['Authorization'] = 'OAuth ' + auth 142 | 143 | req = Request(req.get_full_url(), \ 144 | data=req.get_data(), \ 145 | headers=req.headers, \ 146 | origin_req_host=req.get_origin_req_host(), \ 147 | unverifiable=req.is_unverifiable(), \ 148 | method=method) 149 | 150 | req.timeout = self.timeout 151 | return req 152 | 153 | def https_request(self, req): 154 | return self.http_request(req) 155 | 156 | def _replace_opener(): 157 | filename = os.path.expanduser('~')+'/.'+os.path.basename(sys.argv[0]) 158 | if os.path.isfile(filename): 159 | f = open(filename, 'r') 160 | lines = f.readlines() 161 | key = lines[0].strip() 162 | secret = lines[1].strip() 163 | else: 164 | sys.stderr.write('''TWITTER API AUTHENTICATION SETUP 165 | (1) Open the following link in your browser and register this script... 166 | '>>> https://apps.twitter.com/\n''') 167 | sys.stderr.write('What is its consumer key? ') 168 | key = sys.stdin.readline().rstrip('\r\n') 169 | sys.stderr.write('What is its consumer secret? ') 170 | secret = sys.stdin.readline().rstrip('\r\n') 171 | lines = [key, secret] 172 | consumer = Consumer(key, secret) 173 | try: 174 | oauth = lines[2].strip() 175 | oauth_secret = lines[3].strip() 176 | atoken = Token(oauth, oauth_secret) 177 | except IndexError: 178 | opener = urllib2.build_opener(OAuthHandler(consumer)) 179 | resp = opener.open(Request('https://api.twitter.com/oauth/request_token')) 180 | rtoken = urlparse.parse_qs(resp.read()) 181 | rtoken = Token(rtoken['oauth_token'][0], rtoken['oauth_token_secret'][0]) 182 | sys.stderr.write('''(2) Now, open this link and authorize the script... 183 | '>>> https://api.twitter.com/oauth/authorize?oauth_token=%s\n''' % rtoken.key) 184 | sys.stderr.write('What is the PIN? ') 185 | verifier = sys.stdin.readline().rstrip('\r\n') 186 | opener = urllib2.build_opener(OAuthHandler(consumer, rtoken)) 187 | resp = opener.open( \ 188 | Request('https://api.twitter.com/oauth/access_token', \ 189 | oauth_params={'oauth_verifier': verifier})) 190 | atoken = urlparse.parse_qs(resp.read()) 191 | atoken = Token(atoken['oauth_token'][0], atoken['oauth_token_secret'][0]) 192 | f = open(filename, 'w') 193 | f.write(key+'\n') 194 | f.write(secret+'\n') 195 | f.write(atoken.key+'\n') 196 | f.write(atoken.secret+'\n') 197 | f.close() 198 | sys.stderr.write('Setup complete and %s created.\n' % filename) 199 | opener = urllib2.build_opener(OAuthHandler(consumer, atoken)) 200 | urllib2.install_opener(opener) 201 | 202 | def _fetch_img(user_id, avatar): 203 | conn = urllib.urlopen(avatar) 204 | data = conn.read() 205 | info = conn.info().get('Content-Type').lower() 206 | conn.close() 207 | if not os.path.exists(IMG_DIR): 208 | os.makedirs(IMG_DIR) 209 | filename = IMG_DIR+'/'+str(user_id) 210 | if info == 'image/gif': 211 | filename += '.gif' 212 | elif info == 'image/jpeg' or info == 'image/pjpeg': 213 | filename += '.jpg' 214 | elif info == 'image/png': 215 | filename += '.png' 216 | file = open(filename, 'wb') 217 | file.write(data) 218 | file.close() 219 | return filename 220 | 221 | def resolve(args): 222 | for sn in args.screen_name: 223 | try: 224 | if sn.isdigit(): 225 | url = 'https://api.twitter.com/1.1/users/show.json?user_id=%s' 226 | else: 227 | url = 'https://api.twitter.com/1.1/users/show.json?screen_name=%s' 228 | conn = urllib2.urlopen(url % sn) 229 | except urllib2.HTTPError, e: 230 | if e.code in SKIP_CODES: 231 | sys.stdout.write('HTTPError %s with %s. Skipping...\n' % (e.code, sn)) 232 | continue 233 | else: 234 | raise 235 | data = json.loads(conn.read()) 236 | conn.close() 237 | if sn.isdigit(): 238 | sys.stdout.write(data['screen_name']+' ') 239 | else: 240 | sys.stdout.write(data['id_str']+' ') 241 | sys.stdout.write('(%s friends | %s followers | %s memberships | %s tweets)\n' % (\ 242 | data['friends_count'], data['followers_count'], data['listed_count'], data['statuses_count'])) 243 | 244 | class CursorError(urllib2.HTTPError): 245 | def __init__(self, code, cursor=None, res=None, headers=None): 246 | urllib2.HTTPError.__init__(self, None, code, None, None, None) 247 | self.cursor = cursor 248 | self.res = res 249 | self.headers = headers 250 | 251 | # Results are given in groups of 5,000 252 | def _ids(relation, param, c=None, init=None): 253 | res = init or [] 254 | cursor = c or -1 255 | while True: 256 | try: 257 | url = 'https://api.twitter.com/1.1/%s/ids.json?%s&cursor=%s' 258 | conn = urllib2.urlopen(url % (relation, param, cursor)) 259 | except urllib2.HTTPError, e: 260 | raise CursorError(e.code, cursor, res, e.headers) 261 | data = json.loads(conn.read()) 262 | conn.close() 263 | res = res + data['ids'] 264 | if data['next_cursor'] != 0: 265 | cursor = data['next_cursor'] 266 | else: 267 | break 268 | return res 269 | 270 | def _members(param, c=None, init=None): 271 | res = init or [] 272 | cursor = c or -1 273 | while True: 274 | try: 275 | url = 'https://api.twitter.com/1.1/lists/members.json?%s&cursor=%s&include_entities=false&skip_status=true' 276 | conn = urllib2.urlopen(url % (param, cursor)) 277 | except urllib2.HTTPError, e: 278 | raise CursorError(e.code, cursor, res) 279 | data = json.loads(conn.read()) 280 | conn.close() 281 | for user in data['users']: 282 | res.append(user['id']) 283 | if data['next_cursor'] != 0: 284 | cursor = data['next_cursor'] 285 | else: 286 | break 287 | return res 288 | 289 | # first retrieve list of IDs from relation, tweets or list memberships 290 | # then retrieve details for each 100 at a time 291 | def init(args): 292 | if args.l: 293 | type = args.l 294 | else: 295 | type = 'followers' if args.followers else 'friends' 296 | if args.query: 297 | # retrieve handles from file 298 | bag = [] 299 | users = [] 300 | for tweet in open(args.screen_name+TWT_EXT): 301 | handles = re.findall(r'@([\w]+)', tweet.lower()) 302 | if len(handles) > 0: 303 | if args.nomention: 304 | handles = [handles[0]] 305 | bag += handles 306 | users.append(handles[0]) 307 | bag = list(set(bag)) 308 | users = list(set(users)) 309 | else: 310 | # retrieve IDs from relationship 311 | cursor = res = None 312 | while True: 313 | try: 314 | if args.l: 315 | bag = _members('slug='+args.l+'&owner_screen_name='+args.screen_name, cursor, res) 316 | else: 317 | url = 'https://api.twitter.com/1.1/users/show.json?screen_name=%s' 318 | conn = urllib2.urlopen(url % args.screen_name) 319 | data = json.loads(conn.read()) 320 | conn.close() 321 | bag = [data['id']] + _ids(type, 'screen_name='+args.screen_name, cursor, res) 322 | break 323 | except CursorError, e: 324 | if e.code in WAIT_CODES: 325 | waiting_time = int(e.headers['x-rate-limit-reset']) - int(round(time.time())) + random.randint(10,30) 326 | sys.stderr.write('HTTPError %s at %s. Waiting %sm to resume...' % \ 327 | (e.code, time.strftime('%H:%M', time.localtime()), waiting_time/60)) 328 | time.sleep(waiting_time) 329 | cursor = e.cursor 330 | res = e.res 331 | sys.stderr.write('\n') 332 | continue 333 | else: 334 | raise 335 | filename = args.screen_name+TT_EXT 336 | if not args.force and os.path.isfile(filename): 337 | # ignore previously processed data found in file 338 | f = open(filename, 'r+') 339 | for item in csv.reader(f): 340 | try: 341 | bag.remove(item[1].lower() if args.query else int(item[0])) 342 | except: 343 | continue 344 | else: 345 | f = open(filename, 'w') 346 | # retrieve details for all IDs in the bag 347 | while len(bag) > 0: 348 | next_items = [] 349 | if len(bag) > 100: 350 | while len(next_items) < 100: 351 | next_items = [bag.pop() for _ in xrange(100)] 352 | else: 353 | next_items = bag 354 | bag = [] 355 | sys.stdout.write('Processing %i starting from %s...\n' % (len(next_items), next_items[0])) 356 | try: 357 | if isinstance(next_items[0], str): 358 | url = 'https://api.twitter.com/1.1/users/lookup.json?screen_name=%s&include_entities=false' 359 | items_str = ','.join(next_items) 360 | else: 361 | url = 'https://api.twitter.com/1.1/users/lookup.json?user_id=%s&include_entities=false' 362 | items_str = ','.join(map(str, next_items)) 363 | conn = urllib2.urlopen(url % items_str) 364 | except (urllib2.HTTPError, socket.error) as e: 365 | if hasattr(e, 'code') and e.code in SKIP_CODES: 366 | # 404 if no lookup criteria could be satisified 367 | sys.stdout.write('HTTPError %s starting from %s. Skipping...\n' % (e.code, next_items[0])) 368 | continue 369 | elif (hasattr(e, 'code') and e.code in RETRY_CODES) or \ 370 | (hasattr(e, 'errno') and e.errno in (errno.ECONNRESET, errno.EAGAIN)): 371 | bag += next_items 372 | sys.stderr.write('HTTPError %s starting from %s. Deferred...' % (e.code, next_items[0])) 373 | sys.stderr.flush() 374 | time.sleep(random.randint(10,30)) 375 | sys.stderr.write('\n') 376 | continue 377 | elif hasattr(e, 'code') and e.code in WAIT_CODES: 378 | bag += next_items 379 | waiting_time = int(e.headers['x-rate-limit-reset']) - int(round(time.time())) + random.randint(10,30) 380 | sys.stderr.write('HTTPError %s at %s. Waiting %sm to resume (%s items left)...' % (e.code, time.strftime('%H:%M', time.localtime()), waiting_time/60, len(bag))) 381 | sys.stderr.flush() 382 | time.sleep(waiting_time) 383 | sys.stderr.write('\n') 384 | continue 385 | else: 386 | sys.stderr.write('\n') 387 | raise 388 | response = json.loads(conn.read()) 389 | conn.close() 390 | # write details for each block to disk 391 | for data in response: 392 | user_id = data['id_str'] 393 | name = data['screen_name'] 394 | url = data['url'] if data['url'] is not None else '' 395 | avatar = data['profile_image_url'] if data['profile_image_url'] is not None else '' 396 | if avatar != '': 397 | try: 398 | avatar = _fetch_img(user_id, avatar) 399 | except: 400 | avatar = '' 401 | location = data['location'].replace(',', ' ').replace('\n', ' ').replace('\r', ' ') 402 | f.write(user_id+ \ 403 | ','+name+ \ 404 | ','+('mention' if args.query and name.lower() not in users else type)+ \ 405 | ','+str(data['friends_count'])+ \ 406 | ','+str(data['followers_count'])+ \ 407 | ','+str(data['listed_count'])+ \ 408 | ','+str(data['statuses_count'])+ \ 409 | ','+data['created_at']+ \ 410 | ','+url.encode('ascii', 'replace')+ \ 411 | ','+avatar+ \ 412 | ','+location.encode('ascii', 'replace')+'\n') 413 | 414 | # used for igraph visualization 415 | def _palette(rng): 416 | import colorsys 417 | HSV_tuples = [(t*1.0/rng, 0.75, 1.0) for t in range(rng)] 418 | return map(lambda x: colorsys.hsv_to_rgb(*x), HSV_tuples) 419 | 420 | # requires igraph 421 | def _draw(args): 422 | import igraph 423 | g = igraph.load('_'.join(args.screen_name)+EDG_EXT) 424 | sys.stdout.write('%s Handles\n%s Follow Relationships\n' % (g.vcount(), g.ecount())) 425 | sys.stdout.write('Avg Shortest Path = %.6f\n' % g.average_path_length()) 426 | sys.stdout.write('In-Degree Distribution mean = %.6f, sd = %.6f\n' % (g.degree_distribution(mode=igraph.IN).mean, g.degree_distribution(mode=igraph.IN).sd)) 427 | sys.stdout.write('Clustering Coefficient = %.6f\n' % g.transitivity_undirected()) 428 | sys.stdout.write('Degree Assortativity = %.6f\n' % g.assortativity_degree()) 429 | width = height = int(750+g.vcount()*4.73-g.vcount()**2*1.55e-3) 430 | comp = g.clusters(igraph.STRONG if args.strong else igraph.WEAK) 431 | gc = comp.giant() 432 | if len(args.screen_name) > 1 and not args.strong: 433 | RGB_dict = dict(zip(SHAPES, _palette(len(SHAPES)))) 434 | for v in g.vs: 435 | v['color'] = RGB_dict[v['shape']] 436 | else: 437 | for v in g.vs: 438 | v['color'] = (.8, .8, .8) 439 | if gc.ecount() > 0: 440 | gc_size = gc.vcount() 441 | sys.stdout.write('Fraction Handles in Giant Component = %.2f%%\n' % (100.0*gc_size/g.vcount())) 442 | sys.stdout.write('GC Diameter %i (unweighted)\n' % gc.diameter()) 443 | gc_vert = comp[comp.sizes().index(gc_size)] 444 | cim = gc.community_infomap(edge_weights=gc.es['weight'], vertex_weights=gc.vs['lfr']) 445 | RGB_tuples = _palette(max(cim.membership)+1) 446 | for v in g.vs: 447 | gc_v = gc.vs.select(id_eq=v['id']) 448 | if len(gc_v) > 0: 449 | v['color'] = RGB_tuples[cim.membership[gc_v[0].index]] 450 | g.es['color'] = [g.vs['color'][e.target] for e in g.es] 451 | E = max(g.es['weight']) 452 | g.es['width'] = [(not args.transparent)*max(1, 10*e['weight']/E) for e in g.es] 453 | g.es['arrow_size'] = [max(1, 3*e['weight']/E) for e in g.es] 454 | ID = max(g.vs.indegree()) or 1 455 | g.vs['label_size'] = [max(12, 36*v.indegree()/ID) for v in g.vs] 456 | for v in g.vs: 457 | if v['type'] == 'mention': 458 | v['label_color'] = 'blue' 459 | filename = '%s/%s%s' % (FDAT_DIR, v['userid'], FDAT_EXT) 460 | if not os.path.isfile(filename): 461 | v['color'] = (0, 0, 0) 462 | filename = '_'.join(args.screen_name) + \ 463 | ('_s' if args.strong else '_w') + \ 464 | ('_t' if args.transparent else '') + \ 465 | '.' + args.format 466 | igraph.plot(g, filename, layout=g.layout(args.layout), bbox=(width, height), margin=50) 467 | 468 | def _days(created): 469 | t = time.localtime() 470 | c = time.strptime(created, "%a %b %d %H:%M:%S +0000 %Y") 471 | return datetime.timedelta(seconds=(time.mktime(t)-time.mktime(c))).days 472 | 473 | # support commenting out lines in file 474 | def _skip_hash(iterable): 475 | for line in iterable: 476 | if not line.startswith('#'): 477 | yield line 478 | 479 | # generate GML file readable by Gephi or igraph 480 | def edgelist(args): 481 | dat = {} 482 | i = 0 483 | for sn in args.screen_name: 484 | for item in csv.reader(_skip_hash(open(sn+TT_EXT))): 485 | if not dat.has_key(item[0]): 486 | dat[item[0]] = [i]+item[1:11]+[sn] 487 | i = i + 1 488 | # key: user id 489 | # col 0: GML id 490 | # col 1: screen name 491 | # col 2: friends or followers or list name 492 | # col 3: friend count 493 | # col 4: follower count 494 | # col 5: listed count 495 | # col 6: tweets 496 | # col 7: date joined 497 | # col 8: url 498 | # col 9: avatar 499 | # col 10: location 500 | # col 11: handle arg 501 | id0 = {user_id: val[0] for user_id, val in dat.iteritems() if val[1] in args.screen_name} 502 | e = open('_'.join(args.screen_name)+EDG_EXT, 'w') 503 | e.write('graph [\n directed 1\n') 504 | for user_id, val in dat.iteritems(): 505 | if not args.ego and val[1] in args.screen_name: 506 | continue 507 | if not args.missing and not os.path.isfile(FDAT_DIR+'/'+str(user_id)+FDAT_EXT): 508 | continue 509 | e.write(''' node [ 510 | id %s 511 | userid "%s" 512 | file "%s.dat" 513 | label "%s" 514 | image "%s" 515 | type "%s" 516 | statuses %s 517 | friends %s 518 | followers %s 519 | listed %s''' % (val[0], user_id, val[11], val[1], os.path.abspath(val[9]), val[2], val[6], val[3], val[4], val[5])) 520 | ffr = (float(val[4])/float(val[3])) if float(val[3]) > 0 else 0 521 | lfr = (10*float(val[5])/float(val[4])) if float(val[4]) > 0 else 0 522 | e.write('\n ffr %.4f' % ffr) 523 | e.write('\n lfr %.4f' % lfr) 524 | if len(args.screen_name) > 1: 525 | sid = [i for i,s in enumerate(args.screen_name) if s == val[11]][0] 526 | e.write('\n shape "%s"' % SHAPES[sid]) 527 | else: 528 | if ffr > 1: 529 | e.write('\n shape "triangle-up"') 530 | else: 531 | e.write('\n shape "triangle-down"') 532 | e.write('\n ]\n') 533 | for id1, val in dat.iteritems(): 534 | if id1 in id0.keys(): 535 | continue 536 | if not args.missing and not os.path.isfile(FDAT_DIR+'/'+str(id1)+FDAT_EXT): 537 | continue 538 | if args.ego: 539 | for idzero in id0.values(): 540 | d = _days(val[7]) 541 | e.write(''' edge [ 542 | source %s 543 | target %s 544 | weight %s 545 | ] 546 | ''' % (idzero, val[0], float(val[6])/d if float(d) > 0 else 0)) 547 | filename = FDAT_DIR+'/'+str(id1)+FDAT_EXT 548 | if os.path.isfile(filename): 549 | fdat = open(filename).readlines() 550 | for line in fdat: 551 | id2 = line.strip() 552 | if dat.has_key(id2): 553 | if not args.ego and id2 in id0.keys(): 554 | continue 555 | if not args.missing and not os.path.isfile(FDAT_DIR+'/'+str(id2)+FDAT_EXT): 556 | continue 557 | d = _days(dat[id2][7]) 558 | e.write(''' edge [ 559 | source %s 560 | target %s 561 | weight %s 562 | ] 563 | ''' % (val[0], dat[id2][0], float(dat[id2][6])/d if float(d) > 0 else 0)) 564 | else: 565 | sys.stdout.write('Missing data for %s\n' % dat[id1][1]) 566 | e.write(']\n') 567 | e.close() 568 | sys.stdout.write('GML file created.\n') 569 | try: 570 | if len(args.screen_name) > 1: 571 | sys.stdout.write('Shapes mapping: %s -> %s\n' % (args.screen_name, SHAPES[0:len(args.screen_name)])) 572 | _draw(args) 573 | except: 574 | sys.stderr.write('Visualization skipped.\n') 575 | raise 576 | 577 | # retrieve friends 578 | def fetch(args): 579 | dat = [item for item in csv.reader(_skip_hash(open(args.screen_name+TT_EXT)))] 580 | if not os.path.exists(FDAT_DIR): 581 | os.makedirs(FDAT_DIR) 582 | cursor = res = None 583 | while len(dat) > 0: 584 | if cursor is None: 585 | item = dat.pop() 586 | if int(item[3]) > args.count: 587 | sys.stdout.write('Skipping %s (%s %s)\n' % (item[1], item[3], 'friends')) 588 | continue 589 | filename = FDAT_DIR+'/'+str(item[0])+FDAT_EXT 590 | if args.force or not os.path.isfile(filename): 591 | sys.stdout.write('Processing %s...\n' % item[0]) 592 | try: 593 | bag = _ids('friends', 'user_id='+str(item[0]), cursor, res) 594 | cursor = res = None 595 | except (CursorError, socket.error) as e: 596 | if hasattr(e, 'code') and e.code in SKIP_CODES: 597 | sys.stdout.write('HTTPError %s with %s. Skipping...\n' % \ 598 | (e.code, item[0])) 599 | cursor = res = None 600 | continue 601 | elif (hasattr(e, 'code') and e.code in RETRY_CODES) or \ 602 | (hasattr(e, 'errno') and e.errno in (errno.ECONNRESET, errno.EAGAIN)): 603 | sys.stderr.write('HTTPError %s with %s. Deferred...' % \ 604 | (e.code, item[0])) 605 | sys.stderr.flush() 606 | dat.append(item) 607 | time.sleep(random.randint(10,30)) 608 | cursor = res = None 609 | sys.stderr.write('\n') 610 | continue 611 | elif hasattr(e, 'code') and e.code in WAIT_CODES: 612 | waiting_time = int(e.headers['x-rate-limit-reset']) - int(round(time.time())) + random.randint(10,30) 613 | sys.stderr.write('HTTPError %s at %s. Waiting %sm to resume (%s items left)...' % (e.code, time.strftime('%H:%M', time.localtime()), waiting_time/60, len(dat)+1)) 614 | sys.stderr.flush() 615 | time.sleep(waiting_time) 616 | cursor = e.cursor 617 | res = e.res 618 | sys.stderr.write('\n') 619 | continue 620 | else: 621 | sys.stderr.write('\n') 622 | raise 623 | f = open(filename, 'w') 624 | for item in bag: 625 | f.write(str(item)+'\n') 626 | f.close() 627 | 628 | def _destroy(screen_name): 629 | filename = screen_name+FAV_EXT 630 | dat = [line[:20].strip() for line in open(filename, 'rU')] 631 | while len(dat) > 0: 632 | item = dat.pop() 633 | sys.stdout.write('Purging %s...\n' % item) 634 | try: 635 | url = 'https://api.twitter.com/1.1/favorites/destroy.json' 636 | data = {'id': item} 637 | conn = urllib2.urlopen(url, urllib.urlencode(data)) 638 | except urllib2.HTTPError, e: 639 | if e.code in SKIP_CODES: 640 | sys.stdout.write('HTTPError %s with %s. Skipping...\n' % (e.code, item)) 641 | continue 642 | elif e.code in RETRY_CODES: 643 | sys.stderr.write('HTTPError %s. Deferred...' % e.code) 644 | sys.stderr.flush() 645 | time.sleep(random.randint(10,30)) 646 | dat.append(item) 647 | sys.stderr.write('\n') 648 | continue 649 | elif e.code in WAIT_CODES: 650 | waiting_time = int(e.headers['x-rate-limit-reset']) - int(round(time.time())) + random.randint(10, 30) 651 | sys.stderr.write('HTTPError %s at %s. Waiting %sm to resume (%s items left)...' % (e.code, time.strftime('%H:%M', time.localtime()), waiting_time/60, len(dat)+1)) 652 | sys.stderr.flush() 653 | time.sleep(waiting_time + random.randint(10,30)) 654 | dat.append(item) 655 | sys.stderr.write('\n') 656 | continue 657 | else: 658 | sys.stderr.write('\n') 659 | raise 660 | data = json.loads(conn.read()) 661 | conn.close() 662 | 663 | # retrieve and optionally purge favorites 664 | def likes(args): 665 | if args.purge: 666 | sys.stderr.write('Press any key to purge likes or Ctrl-C to abort...') 667 | sys.stderr.flush() 668 | while not sys.stdin.read(1): 669 | pass 670 | _destroy(args.screen_name) 671 | else: 672 | filename = args.screen_name+FAV_EXT 673 | f = codecs.open(filename, 'w', encoding='utf-8') 674 | max_id = None 675 | sys.stderr.write('Fetching likes') 676 | sys.stderr.flush() 677 | while True: 678 | try: 679 | url = 'https://api.twitter.com/1.1/favorites/list.json?count=%s&screen_name=%s' 680 | if max_id: 681 | url += '&max_id=%i' % max_id 682 | conn = urllib2.urlopen(url % (200, args.screen_name)) 683 | except urllib2.HTTPError, e: 684 | if e.code in RETRY_CODES: 685 | sys.stderr.write('\nHTTPError %s. Retrying' % e.code) 686 | sys.stderr.flush() 687 | time.sleep(random.randint(10,30)) 688 | sys.stderr.write('\n') 689 | continue 690 | elif e.code in WAIT_CODES: 691 | waiting_time = int(e.headers['x-rate-limit-reset']) - int(round(time.time())) + random.randint(10,30) 692 | sys.stderr.write('\nHTTPError %s at %s. Waiting %sm to resume...' % \ 693 | (e.code, time.strftime('%H:%M', time.localtime()), waiting_time/60)) 694 | sys.stderr.flush() 695 | time.sleep(waiting_time) 696 | sys.stderr.write('\n') 697 | continue 698 | else: 699 | sys.stderr.write('\n') 700 | raise 701 | data = json.loads(conn.read()) 702 | conn.close() 703 | sys.stderr.write('.') 704 | sys.stderr.flush() 705 | if len(data) == 0: 706 | sys.stderr.write('\n') 707 | return 708 | for tweet in data: 709 | if tweet['id'] == max_id: 710 | if len(data) == 1: 711 | sys.stderr.write('\n') 712 | return 713 | else: 714 | continue 715 | max_id = min(max_id, tweet['id']) or tweet['id'] 716 | f.write('%-20s %30s %-12s @%-20s %s\n' % ( \ 717 | tweet['id_str'], \ 718 | tweet['created_at'], \ 719 | tweet['user']['id_str'], \ 720 | tweet['user']['screen_name'], \ 721 | tweet['text'].replace('\n', ' '))) 722 | 723 | # retrieve tweets 724 | def tweets(args): 725 | f = codecs.open(args.screen_name+TWT_EXT, 'w', encoding='utf-8') 726 | max_id = None 727 | sys.stderr.write('Fetching tweets') 728 | sys.stderr.flush() 729 | while True: 730 | try: 731 | if args.query: 732 | url = 'https://api.twitter.com/1.1/search/tweets.json?q=%s&result_type=recent&count=100&tweet_mode=extended' 733 | elif args.l: 734 | url = 'https://api.twitter.com/1.1/lists/statuses.json?slug='+args.l+'&owner_screen_name=%s&count=100&tweet_mode=extended' 735 | else: 736 | url = 'https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=%s&count=200&tweet_mode=extended' 737 | if max_id: 738 | url += '&max_id=%i' % max_id 739 | conn = urllib2.urlopen(url % args.screen_name) 740 | except urllib2.HTTPError, e: 741 | if e.code in RETRY_CODES: 742 | sys.stderr.write('\nHTTPError %s. Retrying' % e.code) 743 | sys.stderr.flush() 744 | continue 745 | elif e.code in WAIT_CODES: 746 | waiting_time = int(e.headers['x-rate-limit-reset']) - int(round(time.time())) + random.randint(10,30) 747 | sys.stderr.write('\nHTTPError %s at %s. Waiting %sm to resume...' % \ 748 | (e.code, time.strftime('%H:%M', time.localtime()), waiting_time/60)) 749 | sys.stderr.flush() 750 | time.sleep(waiting_time) 751 | sys.stderr.write('\n') 752 | continue 753 | else: 754 | sys.stderr.write('\n') 755 | raise 756 | data = json.loads(conn.read()) 757 | if args.query: 758 | data = data['statuses'] 759 | conn.close() 760 | sys.stderr.write('.') 761 | sys.stderr.flush() 762 | if len(data) == 0: 763 | sys.stderr.write('\n') 764 | return 765 | for tweet in data: 766 | if tweet['id'] == max_id: 767 | if len(data) == 1: 768 | sys.stderr.write('\n') 769 | return 770 | else: 771 | continue 772 | max_id = min(max_id, tweet['id']) or tweet['id'] 773 | if args.query or args.l: 774 | f.write('%30s @%-20s %s\n' % (tweet['created_at'], \ 775 | tweet['user']['screen_name'], tweet['full_text'].replace('\n', ' '))) 776 | else: 777 | f.write('%30s %s\n' % (tweet['created_at'], tweet['full_text'].replace('\n', ' '))) 778 | 779 | # find out remaining calls left by endpoint 780 | class StatsAction(argparse.Action): 781 | def __call__(self, parse, namespace, values, option_string=None): 782 | _replace_opener() 783 | url = 'https://api.twitter.com/1.1/application/rate_limit_status.json?resources=users,friends,statuses,search,favorites' 784 | conn = urllib2.urlopen(url) 785 | data = json.loads(conn.read()) 786 | conn.close() 787 | sys.stdout.write('init (%s), ' % \ 788 | data['resources']['users']['/users/lookup']['remaining']) 789 | sys.stdout.write('fetch (%s), ' % \ 790 | data['resources']['friends']['/friends/ids']['remaining']) 791 | sys.stdout.write('tweets (%s, -q %s), ' % ( \ 792 | data['resources']['statuses']['/statuses/user_timeline']['remaining'], \ 793 | data['resources']['search']['/search/tweets']['remaining'])) 794 | sys.stdout.write('resolve (%s), ' % \ 795 | data['resources']['users']['/users/show/:id']['remaining']) 796 | sys.stdout.write('likes (%s)\n' % \ 797 | data['resources']['favorites']['/favorites/list']['remaining']) 798 | sys.exit(0) 799 | 800 | class QuoteAction(argparse.Action): 801 | def __call__(self, parse, namespace, values, option_string=None): 802 | if isinstance(values, list): 803 | setattr(namespace, self.dest, map(lambda v: urllib.quote(v), values)) 804 | else: 805 | setattr(namespace, self.dest, urllib.quote(values)) 806 | 807 | def main(): 808 | parser = argparse.ArgumentParser(description='Twitter Collection Tool') 809 | sp = parser.add_subparsers(dest='cmd', title='sub-commands') 810 | 811 | sp_resolve = sp.add_parser('resolve', \ 812 | help='retrieve user_id for screen_name or vice versa') 813 | sp_resolve.add_argument(dest='screen_name', nargs='+', action=QuoteAction, \ 814 | help='Twitter screen name') 815 | sp_resolve.set_defaults(func=resolve) 816 | 817 | sp_init = sp.add_parser('init', \ 818 | help='retrieve friends data for screen_name') 819 | sp_init.add_argument('-o', '--followers', action='store_true', \ 820 | help='retrieve followers (default: friends)') 821 | sp_init.add_argument('-q', '--query', action='store_true', \ 822 | help='extract handles from query (default: screen_name)') 823 | sp_init.add_argument('-n', '--nomention', action='store_true', \ 824 | help='ignore mentions from query (default: false)') 825 | sp_init.add_argument('-m', '--members', dest='l', \ 826 | help='extract member handles from list named L') 827 | sp_init.add_argument('-f', '--force', action='store_true', \ 828 | help='ignore existing %s file (default: False)' % TT_EXT) 829 | sp_init.add_argument(dest='screen_name', action=QuoteAction, \ 830 | help='Twitter screen name') 831 | sp_init.set_defaults(func=init) 832 | 833 | sp_fetch = sp.add_parser('fetch', \ 834 | help='retrieve friends of handles in %s file' % TT_EXT) 835 | sp_fetch.add_argument('-f', '--force', action='store_true', \ 836 | help='ignore existing %s files (default: False)' % FDAT_EXT) 837 | sp_fetch.add_argument('-c', dest='count', type=int, default=FMAX, \ 838 | help='skip if friends above count (default: %(FMAX)i)' % globals()) 839 | sp_fetch.add_argument(dest='screen_name', action=QuoteAction, \ 840 | help='Twitter screen name') 841 | sp_fetch.set_defaults(func=fetch) 842 | 843 | sp_tweets = sp.add_parser('tweets', \ 844 | help='retrieve tweets') 845 | sp_tweets.add_argument('-q', '--query', action='store_true', \ 846 | help='argument is a query (default: screen_name)') 847 | sp_tweets.add_argument('-m', '--members', dest='l', \ 848 | help='extract tweets from list named L') 849 | sp_tweets.add_argument(dest='screen_name', action=QuoteAction, \ 850 | help='Twitter screen name') 851 | sp_tweets.set_defaults(func=tweets) 852 | 853 | sp_likes = sp.add_parser('likes', \ 854 | help='retrieve likes') 855 | sp_likes.add_argument('-p', '--purge', action='store_true', \ 856 | help='destroy likes (default: False)') 857 | sp_likes.add_argument(dest='screen_name', action=QuoteAction, \ 858 | help='Twitter screen name') 859 | sp_likes.set_defaults(func=likes) 860 | 861 | sp_edgelist = sp.add_parser('edgelist', \ 862 | help='generate graph in GML format', \ 863 | description='Vertices have the following attributes: twitter user_id, .dat source file, label, link to image file, type, statuses, friends, followers, listed, ffr and lfr. Vertices with a friends-to-followers ratio > 1 have a triangle-up shape, otherwise triangle-down. Label size is proportional to the in-degree. Edges have a width based on their weight attribute. Weight corresponds to the number of tweets per day since the account was created. Vertices included in the giant component are colored according to their membership to a particular community. Otherwise, they are colored in grey. Community finding is based on infomap and applied to members of the giant component. Black is used for vertices with no edges, with more than %s friends or set to private.' % FMAX) 864 | sp_edgelist.add_argument('-f', '--format', default='png', \ 865 | choices=('png', 'pdf', 'ps'), \ 866 | help='graph output format (default: png)') 867 | sp_edgelist.add_argument('-e', '--ego', action='store_true', \ 868 | help='include screen_name (default: False)') 869 | sp_edgelist.add_argument('-l', '--layout', default='kk', \ 870 | choices=('circle', 'fr', 'kk'), \ 871 | help='igraph layout (default: kk)') 872 | sp_edgelist.add_argument('-s', '--strong', action='store_true', \ 873 | help='use strong ties to isolate clusters (default: weak)') 874 | sp_edgelist.add_argument('-t', '--transparent', action='store_true', \ 875 | help='hide edges in graph (default: visible)') 876 | sp_edgelist.add_argument(dest='screen_name', nargs='+', action=QuoteAction, \ 877 | help='Twitter screen name') 878 | sp_edgelist.add_argument('-m', '--missing', action='store_true', \ 879 | help='include missing handles (default: False)') 880 | sp_edgelist.set_defaults(func=edgelist) 881 | 882 | parser.add_argument('-s', '--stats', action=StatsAction, nargs=0, \ 883 | help='show Twitter throttling stats and exit') 884 | parser.add_argument('-v', '--version', action='version', \ 885 | version='%(prog)s v'+'%(__version__)s' % globals()) 886 | 887 | try: 888 | args = parser.parse_args() 889 | _replace_opener() 890 | args.func(args) 891 | except KeyboardInterrupt: 892 | sys.stderr.write('\n') 893 | return 2 894 | except Exception, err: 895 | sys.stderr.write(str(err)+'\n') 896 | return 1 897 | else: 898 | sys.stderr.write('Done.\n') 899 | return 0 900 | 901 | if __name__ == '__main__': 902 | sys.exit(main()) 903 | --------------------------------------------------------------------------------