├── .dockerignore ├── .gitignore ├── Dockerfile ├── README.md ├── __init__.py ├── build_status_attr.py ├── build_user_followers_list.py ├── build_user_friendids_list.py ├── build_user_timeline_list.py ├── config ├── __init__.py ├── aws_config.py ├── esconn.py ├── essetup.py ├── s3conn.py └── twitter_config.py ├── crontab ├── get_stream_output_handles.py ├── get_stream_output_results.py ├── index_twitter_search.py ├── index_twitter_stream.py ├── requirements.txt ├── topics.txt └── tweet_model.py /.dockerignore: -------------------------------------------------------------------------------- 1 | */.idea 2 | config_local.py -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | local_settings.py 55 | 56 | # Flask stuff: 57 | instance/ 58 | .webassets-cache 59 | 60 | # Scrapy stuff: 61 | .scrapy 62 | 63 | # Sphinx documentation 64 | docs/_build/ 65 | 66 | # PyBuilder 67 | target/ 68 | 69 | # IPython Notebook 70 | .ipynb_checkpoints 71 | 72 | # pyenv 73 | .python-version 74 | 75 | # celery beat schedule file 76 | celerybeat-schedule 77 | 78 | # dotenv 79 | .env 80 | 81 | # virtualenv 82 | venv/ 83 | ENV/ 84 | 85 | # Spyder project settings 86 | .spyderproject 87 | 88 | # Rope project settings 89 | .ropeproject 90 | 91 | # PyCharm 92 | .idea/ 93 | /config_local.py 94 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM alpine:latest 2 | 3 | RUN mkdir discursive 4 | 5 | COPY . /discursive 6 | 7 | RUN apk add --update \ 8 | tini \ 9 | python \ 10 | py2-pip && \ 11 | adduser -D aws 12 | 13 | WORKDIR /home/aws 14 | 15 | RUN mkdir aws && \ 16 | pip install --upgrade pip && \ 17 | pip install awscli && \ 18 | pip install -q --upgrade pip && \ 19 | pip install -q --upgrade setuptools && \ 20 | pip install -q -r /discursive/requirements.txt && \ 21 | crontab /discursive/crontab 22 | 23 | CMD ["/sbin/tini", "--", "crond", "-f"] 24 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # discursive 2 | 3 | This tool searches Twitter for a collection of topics and stores the Tweet data in an Elasticsearch index _and_ an S3 bucket. The intended use case is for social network composition and Tweet text analysis. 4 | 5 | ## Setup 6 | 7 | Everything you see here runs on AWS EC2 and the AWS Elasticsearch service. Currently, it runs just fine in the free tier. Things you will need include: 8 | 9 | - An AWS account 10 | - An [AWS Elasticsearch domain](http://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-gsg.html) 11 | - To [configure an access policy for Kibana](https://aws.amazon.com/blogs/security/how-to-control-access-to-your-amazon-elasticsearch-service-domain/) if you want to use that 12 | - An EC2 Linux image (this was tested on Ubuntu) 13 | - A Twitter account and associated [application](https://apps.twitter.com/) auth keys/tokens 14 | - Some grit and [determination](http://www.memecenter.com/fun/333919/determination) 15 | 16 | This sounds like a lot, but was quite quick to cobble together. 17 | 18 | ## Scouring the Twitterverse 19 | 20 | Once you have cloned the repo you're ready to rock: 21 | 22 | 1. Install Docker on your EC2 instance using [instructions appropriate for your OS](https://docs.docker.com/engine/getstarted/step_one/#/docker-for-linux) (the code in this repo is run using Ubuntu). 23 | 24 | 2. Change into the Discursive directory (i.e. `cd discursive/`). 25 | 26 | 3. Run `essetup.py` which is located in the `/config` directory, which'll generate the Elasticsearch index with the appropriate mappings. 27 | 28 | 4. Update the `aws_config.py` `twitter_config.py` `esconn.py` and `s3conn.py` files located in the `/config` directory with your credentials. 29 | 30 | 5. Put your desired keyword(s) in the `topics.txt` file (one term per line). 31 | 32 | 6. Edit the `crontab` file to run at your desired intervals. The default will run every fifteen minutes. 33 | 34 | 7. Run `sudo docker build -t discursive .` 35 | 36 | 8. Run `sudo docker run discursive` 37 | 38 | 9. If all went well you're watching Tweets stream into your Elasticsearch index! Conversely, run `index_twitter_search.py` to search for specific topic(s) and bulk insert the data into your Elasticsearch index (and see the messages from Elasticsearch returned to your console). 39 | 40 | 10. There are several options you may want to configure/tweak. For instance, you may want to turn off printing to console (which you can do in `index_twitter_search.py`) or run the container as a detached process. Please do jump into our Slack channel #assemble if you have any questions or log an issue! 41 | 42 | ## Explore Twitter networks 43 | 44 | A warning, **this is experimental** so check back often for updates. There are four important files for exploring the network of Tweets we've collected: 45 | 46 | * `get_stream_output_results.py` - returns distinct handles (usernames in Twitter) and Tweet IDs (statuses in Twitter) from the Elasticsearch index for a specified number of Tweets 47 | * `build_user_followers_list.py` - taking the result from `get_stream_output_results.py` it returns a list of followers for each input handle 48 | * `build_user_friendids_list.py` - taking the result from `get_stream_output_results.py` it returns a list of friends for each input handle 49 | * `build_user_timelines_list.py` - taking the result from `get_stream_output_results.py` it returns a list of tweets for each input handle 50 | * `build_status_attr.py` - taking the result from `get_stream_output_results.py` it returns the full Tweet object (see Twitter API doc for details) 51 | 52 | So, with some additional munging, you can use the above to build a graph of users, their followers and friends. When combined with the additional data we collect (tweet text, retweets, followers count, etc.) this represents the beginning of our effort to enable analysts by providing curated, network-analysis-ready data! 53 | 54 | ## Where to find help 55 | 56 | There is a chance setting all this up gives you problems. How great a chance? I don't care to speculate publically. I'm @nick on our Slack or you can file an issue here (please for my sanity just join us on Slack and let's talk there). 57 | 58 | ## Want to use our infra? 59 | 60 | I am a-ok with sharing access to the running instance of Elasticsearch until we get new infra up. I am even happy to take your search term requests and type them into my functioning configuration of this thing and have them indexed if you want to send them to me. I will do this for free because we're fam. Just ping me. 61 | 62 | ## Current Work & Roadmap 63 | 64 | - Migrate the data collection components of this project to [assemble](https://github.com/Data4Democracy/assemble). This includes the underlying 65 | infrastructure and associated codebase. This repo will then become home to curated Twitter datasets and analytical products (contact @bstarling, @asragab or @natalia on the #assemble channel on Slack) 66 | - https://github.com/Data4Democracy/discursive/issues/11 67 | - https://github.com/Data4Democracy/discursive/issues/13 68 | - https://github.com/Data4Democracy/discursive/issues/14 69 | - https://github.com/Data4Democracy/discursive/issues/15 70 | - Design, develop and maintain a robust Natural Language Processing (NLP) capability for Twitter data (contact @divya or @wwymak on the #nlp-twitter channel on Slack) 71 | - https://github.com/Data4Democracy/discursive/issues/17 72 | - Design, develop and maintain a community detection and network analysis capability for Twitter data (contact @alarcj or @zac_bohon on the #discursive-commdetect channel on Slack) 73 | - https://github.com/Data4Democracy/discursive/issues/4 74 | 75 | ## Working with Docker 76 | 77 | - Once you have Docker up and running, here are a few useful commands: 78 | ``` 79 | # build the image 80 | sudo docker build -t discursive . 81 | 82 | # list all images 83 | sudo docker images -a 84 | 85 | # if you need to remove the image 86 | sudo docker rmi -f 4f0d819f7a47 87 | 88 | # run the container for reals; prints a bunch of junk 89 | sudo docker run -it -v $HOME/.aws:/home/aws/.aws discursive python /discursive/index_twitter_stream.py 90 | ``` 91 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Data4Democracy/discursive/b1633d547cc5a614264552e66e2b13f9a977ade5/__init__.py -------------------------------------------------------------------------------- /build_status_attr.py: -------------------------------------------------------------------------------- 1 | import tweepy 2 | from config import esconn, twitter_config 3 | from get_stream_output_results import getStreamResultStatusIDs 4 | 5 | # unicode mgmt 6 | import sys 7 | reload(sys) 8 | sys.setdefaultencoding('utf8') 9 | 10 | # go get elasticsearch connection 11 | es = esconn.esconn() 12 | 13 | # auth & api handlers 14 | auth = tweepy.OAuthHandler(twitter_config.CONSUMER_KEY, twitter_config.CONSUMER_SECRET) 15 | auth.set_access_token(twitter_config.ACCESS_TOKEN, twitter_config.ACCESS_TOKEN_SECRET) 16 | api = tweepy.API(auth) 17 | 18 | 19 | def getAttributesbyStatusID(statuses): 20 | search = api.statuses_lookup(statuses, include_entities='yes') 21 | for item in search: 22 | # print str(item.id) + ': ' + item.text 23 | yield item 24 | 25 | # load Twitter screen_name & build a search 26 | status_id_list = getStreamResultStatusIDs(size=10) 27 | output = {item for item in getAttributesbyStatusID(status_id_list)} 28 | 29 | print output 30 | -------------------------------------------------------------------------------- /build_user_followers_list.py: -------------------------------------------------------------------------------- 1 | import tweepy 2 | from config import esconn, aws_config, twitter_config 3 | from elasticsearch import Elasticsearch,helpers 4 | from get_stream_output_handles import getStreamResultHandles 5 | 6 | # unicode mgmt 7 | import sys 8 | reload(sys) 9 | sys.setdefaultencoding('utf8') 10 | 11 | # go get elasticsearch connection 12 | es = esconn.esconn() 13 | 14 | # auth & api handlers 15 | auth = tweepy.OAuthHandler(twitter_config.CONSUMER_KEY, twitter_config.CONSUMER_SECRET) 16 | auth.set_access_token(twitter_config.ACCESS_TOKEN, twitter_config.ACCESS_TOKEN_SECRET) 17 | api = tweepy.API(auth) 18 | 19 | # load Twitter screen_name & build a search 20 | 21 | #search = api.user_timeline('hadoopjax', count=2) 22 | #screen_name_list = ['hadoopjax'] 23 | screen_name_list = getStreamResultHandles() 24 | 25 | def getFollowersbyHandle(handles): 26 | for handle in screen_name_list: 27 | search = tweepy.Cursor(api.followers, screen_name=handle, count=200).items() 28 | for user in search: 29 | print user.screen_name 30 | 31 | output = set() 32 | for handle in screen_name_list: 33 | output.add(getFollowersbyHandle(screen_name_list)) 34 | 35 | print output 36 | -------------------------------------------------------------------------------- /build_user_friendids_list.py: -------------------------------------------------------------------------------- 1 | import time 2 | import tweepy 3 | from config import esconn, aws_config, twitter_config 4 | from elasticsearch import Elasticsearch,helpers 5 | from get_stream_output_handles import getStreamResultHandles 6 | 7 | 8 | # unicode mgmt 9 | import sys 10 | reload(sys) 11 | sys.setdefaultencoding('utf8') 12 | 13 | # go get elasticsearch connection 14 | es = esconn.esconn() 15 | 16 | # auth & api handlers 17 | auth = tweepy.OAuthHandler(twitter_config.CONSUMER_KEY, twitter_config.CONSUMER_SECRET) 18 | auth.set_access_token(twitter_config.ACCESS_TOKEN, twitter_config.ACCESS_TOKEN_SECRET) 19 | api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, retry_count=3, retry_delay=60) 20 | 21 | # load Twitter handle(s) & build a search for friends 22 | screen_name_list = getStreamResultHandles() 23 | 24 | def getFriendsbyHandle(handles): 25 | for handle in screen_name_list: 26 | ids = [] 27 | page_count = 0 28 | for page in tweepy.Cursor(api.friends_ids, screen_name=handle, count=5000).pages(): 29 | page_count += 1 30 | print 'Getting page {} for followers ids'.format(page_count) 31 | ids.extend(page) 32 | return ids 33 | 34 | # print for review 35 | for x in screen_name_list: 36 | print (getFriendsbyHandle(screen_name_list)) 37 | 38 | -------------------------------------------------------------------------------- /build_user_timeline_list.py: -------------------------------------------------------------------------------- 1 | import json 2 | import tweepy 3 | from config import esconn, aws_config, twitter_config 4 | from elasticsearch import Elasticsearch,helpers 5 | from get_stream_output_handles import getStreamResultHandles 6 | 7 | # unicode mgmt 8 | import sys 9 | reload(sys) 10 | sys.setdefaultencoding('utf8') 11 | 12 | # go get elasticsearch connection 13 | es = esconn.esconn() 14 | 15 | # auth & api handlers 16 | auth = tweepy.OAuthHandler(twitter_config.CONSUMER_KEY, twitter_config.CONSUMER_SECRET) 17 | auth.set_access_token(twitter_config.ACCESS_TOKEN, twitter_config.ACCESS_TOKEN_SECRET) 18 | api = tweepy.API(auth) 19 | 20 | # load Twitter handle(s) & build a search for followers 21 | screen_name_list = getStreamResultHandles() 22 | 23 | def getTweetsbyHandle(handles): 24 | for handle in screen_name_list: 25 | search = api.user_timeline(screen_name=handle, count=200, include_rts=True) 26 | for status in search: 27 | print status.text.encode('utf8') + ' ' + handle 28 | 29 | def getUserInfobyHandle(handles): 30 | for handle in handles: 31 | user_info = {handle: []} 32 | try: 33 | user = api.get_user(handle, include_entities=1) 34 | user_info[handle].append( 35 | {'active': 'True', 36 | 'bio': user.description, 37 | 'location': user.location, 38 | 'followers': user.followers_count, 39 | 'following': user.friends_count, 40 | 'image': user.profile_image_url, 41 | }) 42 | except: 43 | user_info[handle].append( 44 | {'active': 'False', 45 | 'bio': 'none', 46 | 'location': 'none', 47 | 'followers': 'none', 48 | 'following': 'none', 49 | 'image': 'none', 50 | 51 | }) 52 | yield user_info 53 | 54 | 55 | # print for review 56 | #for handle in screen_name_list: 57 | # print getTweetsbyHandle(json.dumps(screen_name_list)) 58 | 59 | users_info = getUserInfobyHandle(screen_name_list) 60 | print users_info.next() 61 | with open('user_info.json', 'w') as fp: 62 | for user_info in users_info: 63 | json.dump(user_info, fp, indent=1) -------------------------------------------------------------------------------- /config/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Data4Democracy/discursive/b1633d547cc5a614264552e66e2b13f9a977ade5/config/__init__.py -------------------------------------------------------------------------------- /config/aws_config.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | # AWS Config 4 | # or your settings here 5 | 6 | access_id = os.environ["AWS_ACCESS_KEY_ID"] 7 | access_secret = os.environ["AWS_SECRET_ACCESS_KEY"] 8 | s3_endpoint_url = " " 9 | es_host = " " 10 | -------------------------------------------------------------------------------- /config/esconn.py: -------------------------------------------------------------------------------- 1 | from elasticsearch import Elasticsearch, RequestsHttpConnection 2 | from requests_aws4auth import AWS4Auth 3 | import aws_config 4 | 5 | host = aws_config.es_host 6 | awsauth = AWS4Auth(aws_config.access_id, aws_config.access_secret, 'us-west-2', 'es') 7 | 8 | 9 | def esconn(): 10 | es = Elasticsearch( 11 | hosts=[{'host': host, 'port': 443}], 12 | http_auth=awsauth, 13 | use_ssl=True, 14 | verify_certs=True, 15 | connection_class=RequestsHttpConnection 16 | ) 17 | return es 18 | -------------------------------------------------------------------------------- /config/essetup.py: -------------------------------------------------------------------------------- 1 | # go get elasticsearch connection 2 | from esconn import esconn 3 | 4 | es = esconn() 5 | 6 | # use this to delete an index 7 | if es.indices.exists(index='twitter'): 8 | es.indices.delete(index='twitter') 9 | 10 | # use this to create an index 11 | settings = { 12 | 'settings': { 13 | 'number_of_shards': 1, 14 | 'number_of_replicas': 0 15 | }, 16 | 'mappings': { 17 | 'tweets': { 18 | 'properties': { 19 | 'name': {'type': 'string'}, 20 | 'message': {'type': 'string'}, 21 | 'description': {'type': 'string'}, 22 | 'loc': {'type': 'string'}, 23 | 'text': {'type': 'string', 'store': 'true'}, 24 | 'user_created': {'type': 'date'}, 25 | 'followers': {'type': 'long'}, 26 | 'id_str': {'type': 'string'}, 27 | 'created': {'type': 'date', 'store': 'true'}, 28 | 'retweet_count': {'type': 'long'}, 29 | 'friends_count': {'type': 'long'}, 30 | 31 | # These fields are synthesized from other metadata 32 | 'topics': {'type': 'string', 'store': 'true'}, 33 | 'retweet': {'type': 'string'}, 34 | 'hashtags': {'type': 'string', 'store': 'true'}, 35 | 'original_id': {'type': 'string'}, 36 | 'original_name': {'type': 'string'} 37 | } 38 | } 39 | } 40 | } 41 | es.indices.create(index='twitter', body=settings) 42 | 43 | # check if the index now exists 44 | if es.indices.exists(index='twitter'): 45 | print 'Created the index' 46 | else: 47 | print 'Something went wrong. The index was not created.' 48 | -------------------------------------------------------------------------------- /config/s3conn.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | from botocore.client import ClientError 3 | import aws_config as config 4 | 5 | 6 | def get_s3_bucket(bucket_name): 7 | s3 = boto3.resource('s3', 8 | aws_access_key_id=config.access_id, 9 | aws_secret_access_key=config.access_secret 10 | ) 11 | try: 12 | bucket = s3.create_bucket(Bucket=bucket_name, 13 | CreateBucketConfiguration={'LocationConstraint': 'us-west-2'}) 14 | return bucket 15 | except ClientError as e: 16 | # if the bucket is created will raise this error, return it, else return False 17 | error_code = e.response['Error']['Code'] 18 | if error_code == 'BucketAlreadyOwnedByYou': 19 | return s3.Bucket(bucket_name) 20 | else: 21 | return False 22 | 23 | 24 | # should replace bucket name here with correct one 25 | def write_file_to_s3(data, key, bucket_name="discursive"): 26 | bucket = get_s3_bucket(bucket_name) 27 | if bucket is not False: 28 | bucket.put_object(Key=key, Body=data) 29 | else: 30 | print bucket_name + " bucket could not be found" 31 | -------------------------------------------------------------------------------- /config/twitter_config.py: -------------------------------------------------------------------------------- 1 | # Enter Twitter application credentials 2 | 3 | ACCESS_TOKEN = " " 4 | ACCESS_TOKEN_SECRET = " " 5 | CONSUMER_KEY = " " 6 | CONSUMER_SECRET = " " 7 | -------------------------------------------------------------------------------- /crontab: -------------------------------------------------------------------------------- 1 | 0,15,30,45 * * * * python /discursive/index_twitter_stream.py /discursive/topics.txt 2 | -------------------------------------------------------------------------------- /get_stream_output_handles.py: -------------------------------------------------------------------------------- 1 | import json 2 | from elasticsearch import Elasticsearch, helpers 3 | from config import esconn 4 | 5 | # get Elasticsearch connection 6 | es = esconn.esconn() 7 | 8 | def getStreamResultHandles(): 9 | resp = es.search(index="twitter", doc_type="message", size="100", filter_path=['hits.hits._source.name']) 10 | output = set() 11 | for doc in resp['hits']['hits']: 12 | output.add(doc['_source']['name']) 13 | return list(output) 14 | 15 | # write output to file 16 | #with open('stream_output_handles.txt', 'w') as f: 17 | # f.write(str(getStreamResultHandles()) 18 | print list(getStreamResultHandles()) 19 | -------------------------------------------------------------------------------- /get_stream_output_results.py: -------------------------------------------------------------------------------- 1 | import json 2 | from elasticsearch import Elasticsearch, helpers 3 | from config import esconn 4 | 5 | # get Elasticsearch connection 6 | es = esconn.esconn() 7 | 8 | def getStreamResultHandles(): 9 | resp = es.search(index="twitter", doc_type="tweets", size="100", filter_path=['hits.hits._source.name']) 10 | output = set() 11 | for doc in resp['hits']['hits']: 12 | output.add(doc['_source']['name']) 13 | return list(output) 14 | 15 | def getStreamResultStatusIDs(size): 16 | resp = es.search(index="twitter", doc_type="tweets", size=size, filter_path=['hits.hits._source.id_str']) 17 | output = set() 18 | for doc in resp['hits']['hits']: 19 | output.add(doc['_source']['id_str']) 20 | return list(output) 21 | 22 | # write output to file 23 | #with open('stream_output_handles.txt', 'w') as f: 24 | # f.write(str(getStreamResultHandles()) 25 | #print list(getStreamResultStatusIDs()) 26 | 27 | -------------------------------------------------------------------------------- /index_twitter_search.py: -------------------------------------------------------------------------------- 1 | import tweepy 2 | from config import esconn, aws_config, twitter_config 3 | from elasticsearch import helpers 4 | from tweet_model import map_tweet_for_es 5 | 6 | # unicode mgmt 7 | import sys 8 | reload(sys) 9 | sys.setdefaultencoding('utf8') 10 | 11 | # go get elasticsearch connection 12 | es = esconn.esconn() 13 | 14 | # auth & api handlers 15 | auth = tweepy.OAuthHandler(twitter_config.CONSUMER_KEY, twitter_config.CONSUMER_SECRET) 16 | auth.set_access_token(twitter_config.ACCESS_TOKEN, twitter_config.ACCESS_TOKEN_SECRET) 17 | api = tweepy.API(auth) 18 | 19 | # load topics & build a search 20 | topics = ["oath keeper"] 21 | search = api.search(q=topics, count=100) 22 | 23 | 24 | # function for screen_name, text, search topic 25 | def tweet_text(): 26 | for tweet in search: 27 | if (not tweet.retweeted) and ('RT @' not in tweet.text): 28 | yield map_tweet_for_es(tweet, topics) 29 | 30 | # bulk insert into twitter index 31 | helpers.bulk(es, tweet_text(), index='twitter', doc_type='tweets') 32 | 33 | # view the message field in the twitter index 34 | messages = es.search(index="twitter", size=1000, _source=['message']) 35 | print messages 36 | 37 | -------------------------------------------------------------------------------- /index_twitter_stream.py: -------------------------------------------------------------------------------- 1 | import json 2 | import tweepy 3 | from config import esconn, aws_config, twitter_config 4 | import os 5 | from datetime import datetime as dt 6 | from tweet_model import map_tweet_for_es 7 | from config import s3conn 8 | 9 | # unicode mgmt 10 | import sys 11 | reload(sys) 12 | sys.setdefaultencoding('utf8') 13 | 14 | # Twitter auth and api call setup 15 | auth = tweepy.OAuthHandler(twitter_config.CONSUMER_KEY, twitter_config.CONSUMER_SECRET) 16 | auth.set_access_token(twitter_config.ACCESS_TOKEN, twitter_config.ACCESS_TOKEN_SECRET) 17 | api = tweepy.API(auth) 18 | 19 | # Get elasticsearch connection 20 | es = esconn.esconn() 21 | 22 | if len(sys.argv) > 2: 23 | sys.exit('ERROR: Received 2 or more arguments: {} {} {} Expected 1: Topic file name'.format(sys.argv[0], sys.argv[1], sys.argv[2])) 24 | 25 | elif len(sys.argv) == 2: 26 | try: 27 | with open(sys.argv[1]) as f: 28 | topics = f.readlines() 29 | except Exception: 30 | sys.exit('ERROR: Expected topic file %s not found' % sys.argv[1]) 31 | else: 32 | try: 33 | with open('topics.txt') as f: 34 | topics = f.readlines() 35 | except: 36 | sys.exit('ERROR: Default topics.txt not found. No alternate topic file was provided') 37 | 38 | 39 | TOPICS = [topic.replace('\n', '').strip() for topic in topics] 40 | 41 | 42 | class StreamListener(tweepy.StreamListener): 43 | def __init__(self, api=None): 44 | super(StreamListener, self).__init__() 45 | self.counter = 0 46 | self.limit = 500 47 | self.tweet_list = [] 48 | 49 | def on_status(self, status): 50 | if self.counter < self.limit: 51 | extra = create_extra_fields(status) 52 | tweet = map_tweet_for_es(status, TOPICS, extra) 53 | 54 | # append to instance attribute and then index to elasticsearch (rethink if limit scales up significantly) 55 | self.tweet_list.append(tweet) 56 | dump_to_elastic(tweet) 57 | 58 | print 'Tweet Count# ' + str(self.counter) + ' ' + json.dumps(fix_date_for_tweet(tweet)) 59 | else: 60 | # if limit reached write saved tweets to s3 61 | dump_to_s3(self.tweet_list) 62 | return False 63 | 64 | self.counter += 1 65 | 66 | def on_error(self, status_code): 67 | # Twitter is rate limiting, exit 68 | if status_code == 420: 69 | print('Twitter rate limit error_code {}, exiting...'.format(status_code)) 70 | return False 71 | 72 | 73 | def create_extra_fields(status): 74 | # check if retweet, assign attributes 75 | if hasattr(status, 'retweeted_status'): 76 | retweet = 'Y' 77 | original_id = status.retweeted_status.user.id 78 | original_name = status.retweeted_status.user.name 79 | else: 80 | retweet = 'N' 81 | original_id = None 82 | original_name = None 83 | 84 | # check for hashtags and save as list 85 | if hasattr(status, 'entities'): 86 | hashtags = [] 87 | for tag in status.entities['hashtags']: 88 | hashtags.append(tag['text']) 89 | hashtags = json.dumps(hashtags) 90 | 91 | return { 92 | 'retweet': retweet, 93 | 'hashtags': hashtags, 94 | 'original_id': original_id, 95 | 'original_name': original_name 96 | } 97 | 98 | 99 | def search(): 100 | stream_listener = StreamListener() 101 | stream = tweepy.Stream(auth=api.auth, listener=stream_listener) 102 | stream.filter(track=TOPICS) 103 | return 104 | 105 | 106 | def dump_to_elastic(bodydata): 107 | es.index(index='twitter', doc_type="message", body=bodydata) 108 | 109 | 110 | def dump_to_s3(data): 111 | filename, ext = ("tweets", ".json") 112 | 113 | local_file = dump_to_file(data, filename + ext) 114 | tweets_file = open(local_file, 'rb') 115 | 116 | key = create_key(filename, ext) 117 | s3conn.write_file_to_s3(tweets_file, key) 118 | 119 | 120 | def dump_to_file(data, filename): 121 | # fix dates and dump to json 122 | tweet_list = json.dumps(fix_dates_for_dump(data)) 123 | 124 | # get current working directory and write file to local path 125 | cwd = os.path.dirname(os.path.abspath(__file__)) 126 | path = os.path.join(cwd, filename) 127 | try: 128 | with open(path, 'w') as fw: 129 | fw.write(tweet_list) 130 | return path 131 | except (IOError, OSError) as ex: 132 | print str(ex) 133 | 134 | 135 | def fix_dates_for_dump(data): 136 | # json.dumps can't natively serialize datetime obj converting to str before 137 | for tweet in data: 138 | tweet["user_created"] = str(tweet["user_created"]) 139 | tweet["created"] = str(tweet["created"]) 140 | return data 141 | 142 | def fix_date_for_tweet(tweet): 143 | tweet["user_created"] = str(tweet["user_created"]) 144 | tweet["created"] = str(tweet["created"]) 145 | return tweet 146 | 147 | 148 | def create_key(filename, ext): 149 | now = dt.now() 150 | # Ex: '2017/1/9/21/tweets-26.json' 151 | # This key generates a 'directory' structure in s3 that can be navigated as such 152 | key = str(now.year) + "/" + \ 153 | str(now.month) + "/" + \ 154 | str(now.day) + "/" + \ 155 | str(now.hour) + "/" + \ 156 | filename + "-" + \ 157 | str(now.minute) + ext 158 | return key 159 | 160 | 161 | search() 162 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | boto3==1.4.3 2 | botocore==1.4.93 3 | dataset==0.7.1 4 | elasticsearch==5.1.0 5 | requests_aws4auth==0.9 6 | tweepy==3.5.0 7 | -------------------------------------------------------------------------------- /topics.txt: -------------------------------------------------------------------------------- 1 | cucks 2 | breitbart 3 | -------------------------------------------------------------------------------- /tweet_model.py: -------------------------------------------------------------------------------- 1 | def map_tweet_for_es(tweet, topics, extra=None): 2 | tweet_dict = { 3 | 'name': tweet.user.screen_name, 4 | 'message': tweet.text, 5 | 'description': tweet.user.description, 6 | 'loc': tweet.user.location, 7 | 'text': tweet.text, 8 | 'user_created': tweet.user.created_at, 9 | 'followers': tweet.user.followers_count, 10 | 'id_str': tweet.id_str, 11 | 'created': tweet.created_at, 12 | 'retweet_count': tweet.retweet_count, 13 | 'friends_count': tweet.user.friends_count, 14 | 'topics': topics 15 | } 16 | 17 | if extra is not None: 18 | final = tweet_dict.copy() 19 | final.update(extra) 20 | return final 21 | else: 22 | return tweet_dict 23 | --------------------------------------------------------------------------------