├── .dockerignore
├── .gitignore
├── Dockerfile
├── README.md
├── __init__.py
├── build_status_attr.py
├── build_user_followers_list.py
├── build_user_friendids_list.py
├── build_user_timeline_list.py
├── config
    ├── __init__.py
    ├── aws_config.py
    ├── esconn.py
    ├── essetup.py
    ├── s3conn.py
    └── twitter_config.py
├── crontab
├── get_stream_output_handles.py
├── get_stream_output_results.py
├── index_twitter_search.py
├── index_twitter_stream.py
├── requirements.txt
├── topics.txt
└── tweet_model.py


/.dockerignore:
--------------------------------------------------------------------------------
1 | */.idea
2 | config_local.py


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 
91 | # PyCharm
92 | .idea/
93 | /config_local.py
94 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM alpine:latest
 2 | 
 3 | RUN mkdir discursive
 4 | 
 5 | COPY . /discursive
 6 | 
 7 | RUN apk add --update \
 8 |     tini \
 9 |     python \
10 |     py2-pip && \
11 |     adduser -D aws
12 | 
13 | WORKDIR /home/aws
14 | 
15 | RUN mkdir aws && \
16 |     pip install --upgrade pip && \
17 |     pip install awscli && \
18 |     pip install -q --upgrade pip && \
19 |     pip install -q --upgrade setuptools && \
20 |     pip install -q -r /discursive/requirements.txt && \
21 |     crontab /discursive/crontab
22 | 
23 | CMD ["/sbin/tini", "--", "crond", "-f"]
24 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # discursive
 2 | 
 3 | This tool searches Twitter for a collection of topics and stores the Tweet data in an Elasticsearch index _and_ an S3 bucket. The intended use case is for social network composition and Tweet text analysis. 
 4 | 
 5 | ## Setup
 6 | 
 7 | Everything you see here runs on AWS EC2 and the AWS Elasticsearch service. Currently, it runs just fine in the free tier. Things you will need include:
 8 | 
 9 | - An AWS account
10 | - An [AWS Elasticsearch domain](http://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-gsg.html)
11 | - To [configure an access policy for Kibana](https://aws.amazon.com/blogs/security/how-to-control-access-to-your-amazon-elasticsearch-service-domain/) if you want to use that
12 | - An EC2 Linux image (this was tested on Ubuntu)
13 | - A Twitter account and associated [application](https://apps.twitter.com/) auth keys/tokens
14 | - Some grit and [determination](http://www.memecenter.com/fun/333919/determination)
15 | 
16 | This sounds like a lot, but was quite quick to cobble together.
17 | 
18 | ## Scouring the Twitterverse
19 | 
20 | Once you have cloned the repo you're ready to rock:
21 | 
22 | 1. Install Docker on your EC2 instance using [instructions appropriate for your OS](https://docs.docker.com/engine/getstarted/step_one/#/docker-for-linux) (the code in this repo is run using Ubuntu).
23 | 
24 | 2. Change into the Discursive directory (i.e. `cd discursive/`).
25 | 
26 | 3. Run `essetup.py` which is located in the `/config` directory, which'll generate the Elasticsearch index with the appropriate mappings.
27 | 
28 | 4. Update the `aws_config.py` `twitter_config.py` `esconn.py` and `s3conn.py` files located in the `/config` directory with your credentials.
29 | 
30 | 5. Put your desired keyword(s) in the `topics.txt` file (one term per line).
31 | 
32 | 6. Edit the `crontab` file to run at your desired intervals. The default will run every fifteen minutes. 
33 | 
34 | 7. Run `sudo docker build -t discursive .`
35 | 
36 | 8. Run `sudo docker run discursive`
37 | 
38 | 9. If all went well you're watching Tweets stream into your Elasticsearch index! Conversely, run `index_twitter_search.py` to search for specific topic(s) and bulk insert the data into your Elasticsearch index (and see the messages from Elasticsearch returned to your console).
39 | 
40 | 10. There are several options you may want to configure/tweak. For instance, you may want to turn off printing to console (which you can do in `index_twitter_search.py`) or run the container as a detached process. Please do jump into our Slack channel #assemble if you have any questions or log an issue!
41 | 
42 | ## Explore Twitter networks
43 | 
44 | A warning, **this is experimental** so check back often for updates. There are four important files for exploring the network of Tweets we've collected:
45 | 
46 | * `get_stream_output_results.py` - returns distinct handles (usernames in Twitter) and Tweet IDs (statuses in Twitter) from the Elasticsearch index for a specified number of Tweets 
47 | * `build_user_followers_list.py` - taking the result from `get_stream_output_results.py` it returns a list of followers for each input handle
48 | * `build_user_friendids_list.py` - taking the result from `get_stream_output_results.py` it returns a list of friends for each input handle
49 | * `build_user_timelines_list.py` - taking the result from `get_stream_output_results.py` it returns a list of tweets for each input handle
50 | * `build_status_attr.py` - taking the result from `get_stream_output_results.py` it returns the full Tweet object (see Twitter API doc for details)
51 | 
52 | So, with some additional munging, you can use the above to build a graph of users, their followers and friends. When combined with the additional data we collect (tweet text, retweets, followers count, etc.) this represents the beginning of our effort to enable analysts by providing curated, network-analysis-ready data! 
53 | 
54 | ## Where to find help
55 | 
56 | There is a chance setting all this up gives you problems. How great a chance? I don't care to speculate publically. I'm @nick on our Slack or you can file an issue here (please for my sanity just join us on Slack and let's talk there).
57 | 
58 | ## Want to use our infra?
59 | 
60 | I am a-ok with sharing access to the running instance of Elasticsearch until we get new infra up. I am even happy to take your search term requests and type them into my functioning configuration of this thing and have them indexed if you want to send them to me. I will do this for free because we're fam. Just ping me. 
61 | 
62 | ## Current Work & Roadmap
63 | 
64 | - Migrate the data collection components of this project to [assemble](https://github.com/Data4Democracy/assemble). This includes the underlying
65 | infrastructure and associated codebase. This repo will then become home to curated Twitter datasets and analytical products (contact @bstarling, @asragab or @natalia on the #assemble channel on Slack)
66 |     - https://github.com/Data4Democracy/discursive/issues/11
67 |     - https://github.com/Data4Democracy/discursive/issues/13
68 |     - https://github.com/Data4Democracy/discursive/issues/14
69 |     - https://github.com/Data4Democracy/discursive/issues/15
70 | - Design, develop and maintain a robust Natural Language Processing (NLP) capability for Twitter data (contact @divya or @wwymak on the #nlp-twitter channel on Slack)
71 |     - https://github.com/Data4Democracy/discursive/issues/17
72 | - Design, develop and maintain a community detection and network analysis capability for Twitter data (contact @alarcj or @zac_bohon on the #discursive-commdetect channel on Slack)
73 |     - https://github.com/Data4Democracy/discursive/issues/4
74 | 
75 | ## Working with Docker
76 | 
77 | - Once you have Docker up and running, here are a few useful commands:
78 |     ```
79 |     # build the image
80 |     sudo docker build -t discursive .
81 | 
82 |     # list all images
83 |     sudo docker images -a
84 | 
85 |     # if you need to remove the image
86 |     sudo docker rmi -f 4f0d819f7a47
87 | 
88 |     # run the container for reals; prints a bunch of junk
89 |     sudo docker run -it -v $HOME/.aws:/home/aws/.aws discursive python /discursive/index_twitter_stream.py
90 |     ```
91 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Data4Democracy/discursive/b1633d547cc5a614264552e66e2b13f9a977ade5/__init__.py


--------------------------------------------------------------------------------
/build_status_attr.py:
--------------------------------------------------------------------------------
 1 | import tweepy
 2 | from config import esconn, twitter_config
 3 | from get_stream_output_results import getStreamResultStatusIDs
 4 | 
 5 | # unicode mgmt
 6 | import sys
 7 | reload(sys)
 8 | sys.setdefaultencoding('utf8')
 9 | 
10 | # go get elasticsearch connection
11 | es = esconn.esconn()
12 | 
13 | # auth & api handlers
14 | auth = tweepy.OAuthHandler(twitter_config.CONSUMER_KEY, twitter_config.CONSUMER_SECRET)
15 | auth.set_access_token(twitter_config.ACCESS_TOKEN, twitter_config.ACCESS_TOKEN_SECRET)
16 | api = tweepy.API(auth)
17 | 
18 | 
19 | def getAttributesbyStatusID(statuses):
20 |         search = api.statuses_lookup(statuses, include_entities='yes')
21 |         for item in search:
22 |             # print str(item.id) + ': ' + item.text
23 |             yield item
24 | 
25 | # load Twitter screen_name & build a search
26 | status_id_list = getStreamResultStatusIDs(size=10)
27 | output = {item for item in getAttributesbyStatusID(status_id_list)}
28 | 
29 | print output
30 | 


--------------------------------------------------------------------------------
/build_user_followers_list.py:
--------------------------------------------------------------------------------
 1 | import tweepy
 2 | from config import esconn, aws_config, twitter_config
 3 | from elasticsearch import Elasticsearch,helpers
 4 | from get_stream_output_handles import getStreamResultHandles
 5 | 
 6 | # unicode mgmt
 7 | import sys
 8 | reload(sys)
 9 | sys.setdefaultencoding('utf8')
10 | 
11 | # go get elasticsearch connection
12 | es = esconn.esconn()
13 | 
14 | # auth & api handlers
15 | auth = tweepy.OAuthHandler(twitter_config.CONSUMER_KEY, twitter_config.CONSUMER_SECRET)
16 | auth.set_access_token(twitter_config.ACCESS_TOKEN, twitter_config.ACCESS_TOKEN_SECRET)
17 | api = tweepy.API(auth)
18 | 
19 | # load Twitter screen_name & build a search
20 | 
21 | #search = api.user_timeline('hadoopjax', count=2)
22 | #screen_name_list = ['hadoopjax']
23 | screen_name_list = getStreamResultHandles()
24 | 
25 | def getFollowersbyHandle(handles):
26 |     for handle in screen_name_list:
27 |         search = tweepy.Cursor(api.followers, screen_name=handle, count=200).items()
28 |         for user in search:
29 |             print user.screen_name
30 | 
31 | output = set()
32 | for handle in screen_name_list:
33 |     output.add(getFollowersbyHandle(screen_name_list))
34 | 
35 | print output
36 | 


--------------------------------------------------------------------------------
/build_user_friendids_list.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | import tweepy
 3 | from config import esconn, aws_config, twitter_config
 4 | from elasticsearch import Elasticsearch,helpers
 5 | from get_stream_output_handles import getStreamResultHandles
 6 | 
 7 | 
 8 | # unicode mgmt
 9 | import sys
10 | reload(sys)
11 | sys.setdefaultencoding('utf8')
12 | 
13 | # go get elasticsearch connection
14 | es = esconn.esconn()
15 | 
16 | # auth & api handlers
17 | auth = tweepy.OAuthHandler(twitter_config.CONSUMER_KEY, twitter_config.CONSUMER_SECRET)
18 | auth.set_access_token(twitter_config.ACCESS_TOKEN, twitter_config.ACCESS_TOKEN_SECRET)
19 | api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, retry_count=3, retry_delay=60)
20 | 
21 | # load Twitter handle(s) & build a search for friends
22 | screen_name_list = getStreamResultHandles()
23 | 
24 | def getFriendsbyHandle(handles):
25 |     for handle in screen_name_list:
26 |         ids = []
27 |         page_count = 0
28 |         for page in tweepy.Cursor(api.friends_ids, screen_name=handle, count=5000).pages():
29 |             page_count += 1
30 |             print 'Getting page {} for followers ids'.format(page_count)
31 |             ids.extend(page)
32 |         return ids
33 | 
34 | # print for review
35 | for x in screen_name_list:
36 |     print (getFriendsbyHandle(screen_name_list))
37 | 
38 | 


--------------------------------------------------------------------------------
/build_user_timeline_list.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import tweepy
 3 | from config import esconn, aws_config, twitter_config
 4 | from elasticsearch import Elasticsearch,helpers
 5 | from get_stream_output_handles import getStreamResultHandles
 6 | 
 7 | # unicode mgmt
 8 | import sys
 9 | reload(sys)
10 | sys.setdefaultencoding('utf8')
11 | 
12 | # go get elasticsearch connection
13 | es = esconn.esconn()
14 | 
15 | # auth & api handlers
16 | auth = tweepy.OAuthHandler(twitter_config.CONSUMER_KEY, twitter_config.CONSUMER_SECRET)
17 | auth.set_access_token(twitter_config.ACCESS_TOKEN, twitter_config.ACCESS_TOKEN_SECRET)
18 | api = tweepy.API(auth)
19 | 
20 | # load Twitter handle(s) & build a search for followers
21 | screen_name_list = getStreamResultHandles()
22 | 
23 | def getTweetsbyHandle(handles):
24 |     for handle in screen_name_list:
25 |         search = api.user_timeline(screen_name=handle, count=200, include_rts=True)
26 |         for status in search:
27 |             print status.text.encode('utf8') + ' ' + handle
28 | 
29 | def getUserInfobyHandle(handles):
30 |     for handle in handles:
31 |         user_info = {handle: []}
32 |         try:
33 |             user = api.get_user(handle, include_entities=1)
34 |             user_info[handle].append(
35 |                 {'active': 'True',
36 |                 'bio': user.description,
37 |                 'location': user.location,
38 |                 'followers': user.followers_count,
39 |                 'following': user.friends_count,
40 |                 'image': user.profile_image_url,
41 |                 })
42 |         except:
43 |             user_info[handle].append(
44 |                 {'active': 'False',
45 |                 'bio': 'none',
46 |                 'location': 'none',
47 |                 'followers': 'none',
48 |                 'following': 'none',
49 |                 'image': 'none',
50 | 
51 |                 })
52 |         yield user_info
53 | 
54 | 
55 | # print for review
56 | #for handle in screen_name_list:
57 | #    print getTweetsbyHandle(json.dumps(screen_name_list))
58 | 
59 | users_info = getUserInfobyHandle(screen_name_list)
60 | print users_info.next()
61 | with open('user_info.json', 'w') as fp:
62 |     for user_info in users_info:
63 |         json.dump(user_info, fp, indent=1)


--------------------------------------------------------------------------------
/config/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Data4Democracy/discursive/b1633d547cc5a614264552e66e2b13f9a977ade5/config/__init__.py


--------------------------------------------------------------------------------
/config/aws_config.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | # AWS Config
 4 | # or your settings here
 5 | 
 6 | access_id = os.environ["AWS_ACCESS_KEY_ID"]
 7 | access_secret = os.environ["AWS_SECRET_ACCESS_KEY"]
 8 | s3_endpoint_url = " "
 9 | es_host = " "
10 | 


--------------------------------------------------------------------------------
/config/esconn.py:
--------------------------------------------------------------------------------
 1 | from elasticsearch import Elasticsearch, RequestsHttpConnection
 2 | from requests_aws4auth import AWS4Auth
 3 | import aws_config
 4 | 
 5 | host = aws_config.es_host
 6 | awsauth = AWS4Auth(aws_config.access_id, aws_config.access_secret, 'us-west-2', 'es')
 7 | 
 8 | 
 9 | def esconn():
10 |     es = Elasticsearch(
11 |         hosts=[{'host': host, 'port': 443}],
12 |         http_auth=awsauth,
13 |         use_ssl=True,
14 |         verify_certs=True,
15 |         connection_class=RequestsHttpConnection
16 |     )
17 |     return es
18 | 


--------------------------------------------------------------------------------
/config/essetup.py:
--------------------------------------------------------------------------------
 1 | # go get elasticsearch connection
 2 | from esconn import esconn
 3 | 
 4 | es = esconn()
 5 | 
 6 | # use this to delete an index
 7 | if es.indices.exists(index='twitter'):
 8 |     es.indices.delete(index='twitter')
 9 | 
10 | # use this to create an index
11 | settings = {
12 |     'settings': {
13 |         'number_of_shards': 1,
14 |         'number_of_replicas': 0
15 |     },
16 |     'mappings': {
17 |         'tweets': {
18 |             'properties': {
19 |                 'name': {'type': 'string'},
20 |                 'message': {'type': 'string'},
21 |                 'description': {'type': 'string'},
22 |                 'loc': {'type': 'string'},
23 |                 'text': {'type': 'string', 'store': 'true'},
24 |                 'user_created': {'type': 'date'},
25 |                 'followers': {'type': 'long'},
26 |                 'id_str': {'type': 'string'},
27 |                 'created': {'type': 'date', 'store': 'true'},
28 |                 'retweet_count': {'type': 'long'},
29 |                 'friends_count': {'type': 'long'},
30 | 
31 |                 # These fields are synthesized from other metadata
32 |                 'topics': {'type': 'string', 'store': 'true'},
33 |                 'retweet': {'type': 'string'},
34 |                 'hashtags': {'type': 'string', 'store': 'true'},
35 |                 'original_id': {'type': 'string'},
36 |                 'original_name': {'type': 'string'}
37 |             }
38 |         }
39 |     }
40 | }
41 | es.indices.create(index='twitter', body=settings)
42 | 
43 | # check if the index now exists
44 | if es.indices.exists(index='twitter'):
45 |     print 'Created the index'
46 | else:
47 |     print 'Something went wrong. The index was not created.'
48 | 


--------------------------------------------------------------------------------
/config/s3conn.py:
--------------------------------------------------------------------------------
 1 | import boto3
 2 | from botocore.client import ClientError
 3 | import aws_config as config
 4 | 
 5 | 
 6 | def get_s3_bucket(bucket_name):
 7 |     s3 = boto3.resource('s3',
 8 |                         aws_access_key_id=config.access_id,
 9 |                         aws_secret_access_key=config.access_secret
10 |                         )
11 |     try:
12 |         bucket = s3.create_bucket(Bucket=bucket_name,
13 |                                   CreateBucketConfiguration={'LocationConstraint': 'us-west-2'})
14 |         return bucket
15 |     except ClientError as e:
16 |         # if the bucket is created will raise this error, return it, else return False
17 |         error_code = e.response['Error']['Code']
18 |         if error_code == 'BucketAlreadyOwnedByYou':
19 |             return s3.Bucket(bucket_name)
20 |         else:
21 |             return False
22 | 
23 | 
24 | # should replace bucket name here with correct one
25 | def write_file_to_s3(data, key, bucket_name="discursive"):
26 |     bucket = get_s3_bucket(bucket_name)
27 |     if bucket is not False:
28 |         bucket.put_object(Key=key, Body=data)
29 |     else:
30 |         print bucket_name + " bucket could not be found"
31 | 


--------------------------------------------------------------------------------
/config/twitter_config.py:
--------------------------------------------------------------------------------
1 | # Enter Twitter application credentials
2 | 
3 | ACCESS_TOKEN = " "
4 | ACCESS_TOKEN_SECRET = " "
5 | CONSUMER_KEY = " "
6 | CONSUMER_SECRET = " "
7 | 


--------------------------------------------------------------------------------
/crontab:
--------------------------------------------------------------------------------
1 | 0,15,30,45 * * * * python /discursive/index_twitter_stream.py /discursive/topics.txt
2 | 


--------------------------------------------------------------------------------
/get_stream_output_handles.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | from elasticsearch import Elasticsearch, helpers
 3 | from config import esconn
 4 | 
 5 | # get Elasticsearch connection
 6 | es = esconn.esconn()
 7 | 
 8 | def getStreamResultHandles():
 9 |     resp = es.search(index="twitter", doc_type="message", size="100", filter_path=['hits.hits._source.name'])
10 |     output = set()
11 |     for doc in resp['hits']['hits']:
12 |         output.add(doc['_source']['name'])
13 |     return list(output)
14 | 
15 | # write output to file
16 | #with open('stream_output_handles.txt', 'w') as f:
17 | #    f.write(str(getStreamResultHandles())
18 | print list(getStreamResultHandles())
19 | 


--------------------------------------------------------------------------------
/get_stream_output_results.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | from elasticsearch import Elasticsearch, helpers
 3 | from config import esconn
 4 | 
 5 | # get Elasticsearch connection
 6 | es = esconn.esconn()
 7 | 
 8 | def getStreamResultHandles():
 9 |     resp = es.search(index="twitter", doc_type="tweets", size="100", filter_path=['hits.hits._source.name'])
10 |     output = set()
11 |     for doc in resp['hits']['hits']:
12 |         output.add(doc['_source']['name'])
13 |     return list(output)
14 | 
15 | def getStreamResultStatusIDs(size):
16 |     resp = es.search(index="twitter", doc_type="tweets", size=size, filter_path=['hits.hits._source.id_str'])
17 |     output = set()
18 |     for doc in resp['hits']['hits']:
19 |         output.add(doc['_source']['id_str'])
20 |     return list(output)
21 | 
22 | # write output to file
23 | #with open('stream_output_handles.txt', 'w') as f:
24 | #    f.write(str(getStreamResultHandles())
25 | #print list(getStreamResultStatusIDs())
26 | 
27 | 


--------------------------------------------------------------------------------
/index_twitter_search.py:
--------------------------------------------------------------------------------
 1 | import tweepy
 2 | from config import esconn, aws_config, twitter_config
 3 | from elasticsearch import helpers
 4 | from tweet_model import map_tweet_for_es
 5 | 
 6 | # unicode mgmt
 7 | import sys
 8 | reload(sys)
 9 | sys.setdefaultencoding('utf8')
10 | 
11 | # go get elasticsearch connection
12 | es = esconn.esconn()
13 | 
14 | # auth & api handlers
15 | auth = tweepy.OAuthHandler(twitter_config.CONSUMER_KEY, twitter_config.CONSUMER_SECRET)
16 | auth.set_access_token(twitter_config.ACCESS_TOKEN, twitter_config.ACCESS_TOKEN_SECRET)
17 | api = tweepy.API(auth)
18 | 
19 | # load topics & build a search
20 | topics = ["oath keeper"]
21 | search = api.search(q=topics, count=100)
22 | 
23 | 
24 | # function for screen_name, text, search topic
25 | def tweet_text():
26 |     for tweet in search:
27 |         if (not tweet.retweeted) and ('RT @' not in tweet.text):
28 |             yield map_tweet_for_es(tweet, topics)
29 | 
30 | # bulk insert into twitter index
31 | helpers.bulk(es, tweet_text(), index='twitter', doc_type='tweets')
32 | 
33 | # view the message field in the twitter index
34 | messages = es.search(index="twitter", size=1000, _source=['message'])
35 | print messages
36 | 
37 | 


--------------------------------------------------------------------------------
/index_twitter_stream.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import tweepy
  3 | from config import esconn, aws_config, twitter_config
  4 | import os
  5 | from datetime import datetime as dt
  6 | from tweet_model import map_tweet_for_es
  7 | from config import s3conn
  8 | 
  9 | # unicode mgmt
 10 | import sys
 11 | reload(sys)
 12 | sys.setdefaultencoding('utf8')
 13 | 
 14 | # Twitter auth and api call setup
 15 | auth = tweepy.OAuthHandler(twitter_config.CONSUMER_KEY, twitter_config.CONSUMER_SECRET)
 16 | auth.set_access_token(twitter_config.ACCESS_TOKEN, twitter_config.ACCESS_TOKEN_SECRET)
 17 | api = tweepy.API(auth)
 18 | 
 19 | # Get elasticsearch connection
 20 | es = esconn.esconn()
 21 | 
 22 | if len(sys.argv) > 2:
 23 |     sys.exit('ERROR: Received 2 or more arguments: {} {} {} Expected 1: Topic file name'.format(sys.argv[0], sys.argv[1], sys.argv[2]))
 24 | 
 25 | elif len(sys.argv) == 2:
 26 |     try:
 27 |         with open(sys.argv[1]) as f:
 28 |             topics = f.readlines()
 29 |     except Exception:
 30 |         sys.exit('ERROR: Expected topic file %s not found' % sys.argv[1])
 31 | else:
 32 |     try:
 33 |         with open('topics.txt') as f:
 34 |             topics = f.readlines()
 35 |     except:
 36 |         sys.exit('ERROR: Default topics.txt not found. No alternate topic file  was provided')
 37 | 
 38 | 
 39 | TOPICS = [topic.replace('\n', '').strip() for topic in topics]
 40 | 
 41 | 
 42 | class StreamListener(tweepy.StreamListener):
 43 |     def __init__(self, api=None):
 44 |         super(StreamListener, self).__init__()
 45 |         self.counter = 0
 46 |         self.limit = 500
 47 |         self.tweet_list = []
 48 | 
 49 |     def on_status(self, status):
 50 |         if self.counter < self.limit:
 51 |             extra = create_extra_fields(status)
 52 |             tweet = map_tweet_for_es(status, TOPICS, extra)
 53 | 
 54 |             # append to instance attribute and then index to elasticsearch (rethink if limit scales up significantly)
 55 |             self.tweet_list.append(tweet)
 56 |             dump_to_elastic(tweet)
 57 | 
 58 |             print 'Tweet Count# ' + str(self.counter) + ' ' + json.dumps(fix_date_for_tweet(tweet))
 59 |         else:
 60 |             # if limit reached write saved tweets to s3
 61 |             dump_to_s3(self.tweet_list)
 62 |             return False
 63 | 
 64 |         self.counter += 1
 65 | 
 66 |     def on_error(self, status_code):
 67 |         # Twitter is rate limiting, exit
 68 |         if status_code == 420:
 69 |             print('Twitter rate limit error_code {}, exiting...'.format(status_code))
 70 |             return False
 71 | 
 72 | 
 73 | def create_extra_fields(status):
 74 |     # check if retweet, assign attributes
 75 |     if hasattr(status, 'retweeted_status'):
 76 |         retweet = 'Y'
 77 |         original_id = status.retweeted_status.user.id
 78 |         original_name = status.retweeted_status.user.name
 79 |     else:
 80 |         retweet = 'N'
 81 |         original_id = None
 82 |         original_name = None
 83 | 
 84 |     # check for hashtags and save as list
 85 |     if hasattr(status, 'entities'):
 86 |         hashtags = []
 87 |         for tag in status.entities['hashtags']:
 88 |             hashtags.append(tag['text'])
 89 |         hashtags = json.dumps(hashtags)
 90 | 
 91 |     return {
 92 |         'retweet': retweet,
 93 |         'hashtags': hashtags,
 94 |         'original_id': original_id,
 95 |         'original_name': original_name
 96 |     }
 97 | 
 98 | 
 99 | def search():
100 |     stream_listener = StreamListener()
101 |     stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
102 |     stream.filter(track=TOPICS)
103 |     return
104 | 
105 | 
106 | def dump_to_elastic(bodydata):
107 |     es.index(index='twitter', doc_type="message", body=bodydata)
108 | 
109 | 
110 | def dump_to_s3(data):
111 |     filename, ext = ("tweets", ".json")
112 | 
113 |     local_file = dump_to_file(data, filename + ext)
114 |     tweets_file = open(local_file, 'rb')
115 | 
116 |     key = create_key(filename, ext)
117 |     s3conn.write_file_to_s3(tweets_file, key)
118 | 
119 | 
120 | def dump_to_file(data, filename):
121 |     # fix dates and dump to json
122 |     tweet_list = json.dumps(fix_dates_for_dump(data))
123 | 
124 |     # get current working directory and write file to local path
125 |     cwd = os.path.dirname(os.path.abspath(__file__))
126 |     path = os.path.join(cwd, filename)
127 |     try:
128 |         with open(path, 'w') as fw:
129 |             fw.write(tweet_list)
130 |         return path
131 |     except (IOError, OSError) as ex:
132 |         print str(ex)
133 | 
134 | 
135 | def fix_dates_for_dump(data):
136 |     # json.dumps can't natively serialize datetime obj converting to str before
137 |     for tweet in data:
138 |         tweet["user_created"] = str(tweet["user_created"])
139 |         tweet["created"] = str(tweet["created"])
140 |     return data
141 | 
142 | def fix_date_for_tweet(tweet):
143 |     tweet["user_created"] = str(tweet["user_created"])
144 |     tweet["created"] = str(tweet["created"])
145 |     return tweet
146 | 
147 | 
148 | def create_key(filename, ext):
149 |     now = dt.now()
150 |     # Ex: '2017/1/9/21/tweets-26.json'
151 |     # This key generates a 'directory' structure in s3 that can be navigated as such
152 |     key = str(now.year) + "/" + \
153 |         str(now.month) + "/" + \
154 |         str(now.day) + "/" + \
155 |         str(now.hour) + "/" + \
156 |         filename + "-" + \
157 |         str(now.minute) + ext
158 |     return key
159 | 
160 | 
161 | search()
162 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | boto3==1.4.3
2 | botocore==1.4.93
3 | dataset==0.7.1
4 | elasticsearch==5.1.0
5 | requests_aws4auth==0.9
6 | tweepy==3.5.0
7 | 


--------------------------------------------------------------------------------
/topics.txt:
--------------------------------------------------------------------------------
1 | cucks
2 | breitbart
3 | 


--------------------------------------------------------------------------------
/tweet_model.py:
--------------------------------------------------------------------------------
 1 | def map_tweet_for_es(tweet, topics, extra=None):
 2 |     tweet_dict = {
 3 |         'name': tweet.user.screen_name,
 4 |         'message': tweet.text,
 5 |         'description': tweet.user.description,
 6 |         'loc': tweet.user.location,
 7 |         'text': tweet.text,
 8 |         'user_created': tweet.user.created_at,
 9 |         'followers': tweet.user.followers_count,
10 |         'id_str': tweet.id_str,
11 |         'created': tweet.created_at,
12 |         'retweet_count': tweet.retweet_count,
13 |         'friends_count': tweet.user.friends_count,
14 |         'topics': topics
15 |     }
16 | 
17 |     if extra is not None:
18 |         final = tweet_dict.copy()
19 |         final.update(extra)
20 |         return final
21 |     else:
22 |         return tweet_dict
23 | 


--------------------------------------------------------------------------------