├── .gitignore
├── README.md
├── requirements.txt
└── tweets-nlp-elasticsearch
    ├── config
        └── config.yml.template
    ├── docker
        └── dockerup.py
    └── tweets-nlp-es.py


/.gitignore:
--------------------------------------------------------------------------------
1 | config.yml
2 | *.pyc
3 | tweetstream.json
4 | log/
5 | .idea/


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # A Python 3.6 project to load near-realtime filtered Twitter API data into Elasticsearch, and then visualize the results in Kibana.
 2 | Uses an ELK (Elasticsearch/Logstash/Kibana) 7.x Docker container for easy reproducibility.
 3 | Tested with Python 3.6 (Anaconda Distro) on Win10 and MacOS 10.13.3.
 4 | 
 5 | To get Elasticsearch and Kibana up and running locally quickly, run these 2 docker commands.
 6 | 
 7 | Note: Make sure you have at least 4GB of RAM assigned to Docker. Increase the limit on mmap counts to 262,144 or more on Mac or Linux (I implement this workaround in the below Mac 'docker run' command). See here for more info on this:
 8 | http://elk-docker.readthedocs.io/
 9 | Warning: Do not pull this public image on a low-bandwidth or metered Internet connection. It pulls down well over 5GB of traffic, total.
10 | ```
11 | docker pull sebp/elk:721
12 | ```
13 | Windows:
14 | ```
15 | docker run -p 5601:5601 -p 9200:9200 -p 5044:5044 -p 5000:5000 -it --name elk sebp/elk:721
16 | ```
17 | Mac:
18 | ```
19 | docker run -p 5601:5601 -p 9200:9200 -p 5044:5044 -p 5000:5000 -e MAX_MAP_COUNT="262144" -it --name elk sebp/elk:721
20 | ```
21 | More info on this Docker container can be found here: https://hub.docker.com/r/sebp/elk/
22 | 
23 | After the container is up and running, you should be able to hit these 2 URLs if everything is working properly:
24 | 
25 | Elasticsearch: http://localhost:9200/
26 | 
27 | Which should return a similar response to this:
28 | ~~~
29 | {
30 |   "name" : "elk",
31 |   "cluster_name" : "elasticsearch",
32 |   "cluster_uuid" : "randomstring",
33 |   "version" : {
34 |     "number" : "7.2.1",
35 |     "build_flavor" : "default",
36 |     "build_type" : "tar",
37 |     "build_hash" : "fe6cb20",
38 |     "build_date" : "2019-07-24T17:58:29.979462Z",
39 |     "build_snapshot" : false,
40 |     "lucene_version" : "8.0.0",
41 |     "minimum_wire_compatibility_version" : "6.8.0",
42 |     "minimum_index_compatibility_version" : "6.0.0-beta1"
43 |   },
44 |   "tagline" : "You Know, for Search"
45 | }
46 | ~~~
47 | 
48 | Kibana: http://localhost:5601/
49 | 
50 | Should return the Kibana Web GUI
51 | When we first open the Kibana Web GUI, we need to enter the text 'twitteranalysis' where it asks for the Index name or pattern on the Management tab (on the left).
52 | 
53 | After cloning this repo locally, we can run pip install to get all of the Python libraries loaded into your activate python environment:
54 | `pip install -r requirements.txt`
55 | 
56 | Python libraries used:
57 | ~~~
58 | https://github.com/sloria/textblob
59 | https://github.com/tweepy/tweepy
60 | https://elasticsearch-py.readthedocs.io/en/master/api.html
61 | ~~~
62 | After filling in your Twitter API credentials in the config.yml (rename the config.yml.template to config.yml), you can let the script run for a few minutes and load up the Index in Elasticsearch. After data (analyzed tweets) is loaded up in Elasticsearch, try a few URLs like this:
63 | 
64 | http://localhost:9200/twitteranalysis/_search?q=python
65 | http://localhost:9200/twitteranalysis/_search?q=sentiment:Positive&message=python
66 | 
67 | Feel free to open an issue if you have any problems getting this up and running.
68 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | elasticsearch>=5.0.0,<6.0.0
 2 | nltk==3.6.6
 3 | oauthlib==2.0.2
 4 | PyYAML==5.4
 5 | requests==2.31.0
 6 | requests-oauthlib==0.8.0
 7 | six==1.10.0
 8 | textblob==0.12.0
 9 | tweepy==3.5.0
10 | urllib3==1.26.18
11 | 


--------------------------------------------------------------------------------
/tweets-nlp-elasticsearch/config/config.yml.template:
--------------------------------------------------------------------------------
1 | # You can populate these after creating an App here: https://apps.twitter.com
2 | twitter_consumer_key: ""
3 | twitter_consumer_secret: ""
4 | 
5 | twitter_access_token: ""
6 | twitter_access_token_secret: ""
7 | 
8 | twitter_terms_to_track: "python"
9 | 


--------------------------------------------------------------------------------
/tweets-nlp-elasticsearch/docker/dockerup.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | import json
 3 | 
 4 | 
 5 | def get_docker_auth_token(auth_url, image_name):
 6 |     get_auth_token_payload = {
 7 |         'service': 'registry.docker.io',
 8 |         'scope': 'repository:library/{}:pull'.format(image_name)
 9 |     }
10 | 
11 |     token_resp = requests.get(auth_url + '/token', params=get_auth_token_payload)
12 |     if not r.status_code == 200:
13 |         print("Error status code: {} returned when trying to get token".format(r.status_code))
14 |         raise Exception("Error: Could not get an auth token!")
15 | 
16 |     resp_json = token_resp.json()
17 |     return resp_json['token']
18 | 
19 | 
20 | def fetch_versions(index_url, token, image_name):
21 |     h = {'Authorization': "Bearer {}".format(token)}
22 |     resp_tags = requests.get('{}/v2/library/{}/tags/list'.format(index_url,
23 |                                                                  image_name),
24 |                      headers=h)
25 |     return resp_tags.json()
26 | 
27 | 
28 | def fetch_catalog(index_url, token):
29 |     h = {'Authorization': "Bearer {}".format(token)}
30 |     resp = requests.get('{}/v2/_catalog'.format(index_url),
31 |                      headers=h)
32 |     return resp.json()
33 | 
34 | 
35 | if __name__ == "__main__":
36 |     name = "alpine"
37 |     index_url = 'https://index.docker.io'
38 |     auth_url = 'https://auth.docker.io'
39 | 
40 |     token = get_docker_auth_token(auth_url=auth_url, image_name=name)
41 |     print(token)
42 | 
43 |     print("Get versions")
44 |     versions = fetch_versions(index_url, token, name)
45 |     print(json.dumps(versions, indent=2))
46 |     print("----")
47 | 
48 |     print("Get catalog")
49 |     catalog = fetch_catalog(index_url, token)
50 |     print(json.dumps(catalog, indent=2))
51 | 


--------------------------------------------------------------------------------
/tweets-nlp-elasticsearch/tweets-nlp-es.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Purpose: Insert tweets into Elasticsearch cluster and visualize the results in Kibana.
  3 | Python 3.6
  4 | 1. pip install -r requirements
  5 | 2. cp ./config/config.yml.template ./config/config.yml
  6 | 3. nano ./config/config.yml
  7 | 4. Paste your Twitter API credentials into the config.yml
  8 | 5. Make sure the ELK 7.x stack is up and running.
  9 | 6. Run this script to ingest filtered tweets into Elasticsearch in realtime
 10 | 7. Visualize the NLP results in Kibana
 11 | """
 12 | 
 13 | import json
 14 | import logging
 15 | import os.path
 16 | import sys
 17 | import time
 18 | 
 19 | from logging import handlers
 20 | 
 21 | import yaml
 22 | 
 23 | from elasticsearch import Elasticsearch
 24 | 
 25 | from textblob import TextBlob
 26 | 
 27 | from tweepy import OAuthHandler
 28 | from tweepy import Stream
 29 | from tweepy.streaming import StreamListener
 30 | 
 31 | # config.yml should exist in the same directory as this file
 32 | if not os.path.isfile(os.path.join('config', 'config.yml')):
 33 |     print('config.yml was not found. You probably need to rename the config.yml.template to config.yml ' +
 34 |           'and insert your Twitter credentials in this config file')
 35 |     sys.exit()
 36 | 
 37 | logs_dir_name = 'log'
 38 | if not os.path.exists(logs_dir_name):
 39 |     os.makedirs(logs_dir_name)
 40 | 
 41 | logger = logging.getLogger(__name__)
 42 | logger.setLevel(logging.DEBUG)
 43 | LOG_FORMAT = logging.Formatter('%(asctime)-15s %(levelname)s: %(message)s')
 44 | 
 45 | stdout_logger = logging.StreamHandler(sys.stdout)
 46 | stdout_logger.setFormatter(LOG_FORMAT)
 47 | logger.addHandler(stdout_logger)
 48 | 
 49 | file_logger = handlers.RotatingFileHandler(os.path.join(logs_dir_name, 'tweets-nlp-es.log'),
 50 |                                            maxBytes=(1048576 * 5),
 51 |                                            backupCount=3)
 52 | file_logger.setFormatter(LOG_FORMAT)
 53 | logger.addHandler(file_logger)
 54 | 
 55 | es = Elasticsearch()
 56 | 
 57 | 
 58 | class TweetStreamListener(StreamListener):
 59 |     def on_data(self, data):
 60 | 
 61 |         # Load JSON payload into a dict to make it easy to parse out
 62 |         tweet_json = json.loads(data)
 63 | 
 64 |         # short-circuit exit if no text is found in the tweet item
 65 |         if 'text' not in tweet_json.keys():
 66 |             logger.warning('Text not found in this tweet. Skipping it.')
 67 |             return True
 68 | 
 69 |         tweet_raw_text = tweet_json["text"]
 70 | 
 71 |         # Load the text of the tweet into a TextBlob so it can be analyzed
 72 |         tweet_text_blob = TextBlob(tweet_raw_text)
 73 | 
 74 |         # Value between -1 and 1 - TextBlob Polarity explanation in layman's
 75 |         # terms: http://planspace.org/20150607-textblob_sentiment/
 76 |         text_polarity = tweet_text_blob.sentiment.polarity
 77 |         logger.debug(
 78 |             'Tweet Polarity: {} - on Tweet Text: {}'.format(text_polarity,
 79 |                                                             tweet_raw_text.encode('UTF-8', 'replace')))
 80 | 
 81 |         if text_polarity == 0:
 82 |             sentiment = "Neutral"
 83 |         elif text_polarity < 0:
 84 |             sentiment = "Negative"
 85 |         elif text_polarity > 0:
 86 |             sentiment = "Positive"
 87 |         else:
 88 |             sentiment = "UNKNOWN"
 89 | 
 90 |         logger.debug('TextBlob Analysis Sentiment: {}'.format(sentiment))
 91 | 
 92 |         analyzed_tweet = {
 93 |             "tweet_id": tweet_json["id_str"],
 94 |             "tweet_timestamp_ms": tweet_json["timestamp_ms"],
 95 |             "tweet_date": tweet_json["created_at"],
 96 |             "is_quote_status": tweet_json["is_quote_status"],
 97 |             "in_reply_to_status_id": tweet_json["in_reply_to_status_id"],
 98 |             "in_reply_to_screen_name": tweet_json["in_reply_to_screen_name"],
 99 |             "favorite_count": tweet_json["favorite_count"],
100 |             "author": tweet_json["user"]["screen_name"],
101 |             "tweet_text": tweet_json["text"],
102 |             "retweeted": tweet_json["retweeted"],
103 |             "retweet_count": tweet_json["retweet_count"],
104 |             "geo": tweet_json["geo"],
105 |             "place": tweet_json["place"],
106 |             "coordinates": tweet_json["coordinates"],
107 |             "polarity": text_polarity,
108 |             "subjectivity": tweet_text_blob.sentiment.subjectivity,
109 |             "sentiment": sentiment,
110 |             "epoch_time_ingested": int(time.time())
111 |         }
112 | 
113 |         # can decide if we want to write the analyzed tweet to ES or a static file (or both)
114 |         write_tweet_to_json_file(analyzed_tweet)
115 |         write_analyzed_tweet_to_es(analyzed_tweet)
116 | 
117 |         return True
118 | 
119 |     def on_error(self, status):
120 |         logger.error("Fatal Error: {}".format(status))
121 |         # Disconnect the stream
122 |         return False
123 | 
124 | 
125 | # helper functions for dealing with the processed tweet data
126 | def write_tweet_to_json_file(tweet_data):
127 |     try:
128 |         with open('tweetstream.json', 'a') as out_file:
129 |             out_file.write(str(tweet_data))
130 |     except BaseException as err:
131 |         logger.exception("Exception writing tweet to JSON file: {}".format(err))
132 | 
133 | 
134 | def write_analyzed_tweet_to_es(tweet_data):
135 |     try:
136 |         # Send Analyzed Tweet into ES Index for visualization in Kibana
137 |         es.index(index="twitteranalysis",
138 |                  doc_type="tweet",
139 |                  body=tweet_data
140 |                  )
141 |     except BaseException as err:
142 |         logger.exception("Exception writing tweet to ES: {}".format(err))
143 | 
144 | 
145 | def get_config():
146 |     try:
147 |         with open(os.path.join("config", "config.yml"), "r") as yaml_config_file:
148 |             _config = yaml.load(yaml_config_file, Loader=yaml.SafeLoader)
149 |         return _config
150 |     except:
151 |         logger.exception('config.yml file cannot be found or read. '
152 |                          'You might need to fill in the the config.yml.template and then rename it to config.yml')
153 | 
154 | 
155 | if __name__ == '__main__':
156 |     config = get_config()
157 | 
158 |     # Read twitter API access info from the config file
159 |     twitter_auth = OAuthHandler(config['twitter_consumer_key'], config['twitter_consumer_secret'])
160 |     twitter_auth.set_access_token(config['twitter_access_token'], config['twitter_access_token_secret'])
161 | 
162 |     logger.debug('Creating Listener')
163 |     # Create an instance of the tweepy tweet stream listener
164 |     twitter_listener = TweetStreamListener()
165 | 
166 |     logger.debug('Creating Stream')
167 |     # Create an instance of the tweepy raw stream
168 |     tw_stream = Stream(twitter_auth, twitter_listener)
169 | 
170 |     # Stream that is filtered on keywords
171 |     logger.debug('Starting the Filtered Stream')
172 | 
173 |     twitter_terms_to_track = config['twitter_terms_to_track']
174 |     logger.info('Tracking Terms: {}'.format(twitter_terms_to_track))
175 | 
176 |     tw_stream.filter(track=twitter_terms_to_track, languages=['en'])
177 | 


--------------------------------------------------------------------------------