├── .gitignore ├── README.md ├── requirements.txt └── tweets-nlp-elasticsearch ├── config └── config.yml.template ├── docker └── dockerup.py └── tweets-nlp-es.py /.gitignore: -------------------------------------------------------------------------------- 1 | config.yml 2 | *.pyc 3 | tweetstream.json 4 | log/ 5 | .idea/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # A Python 3.6 project to load near-realtime filtered Twitter API data into Elasticsearch, and then visualize the results in Kibana. 2 | Uses an ELK (Elasticsearch/Logstash/Kibana) 7.x Docker container for easy reproducibility. 3 | Tested with Python 3.6 (Anaconda Distro) on Win10 and MacOS 10.13.3. 4 | 5 | To get Elasticsearch and Kibana up and running locally quickly, run these 2 docker commands. 6 | 7 | Note: Make sure you have at least 4GB of RAM assigned to Docker. Increase the limit on mmap counts to 262,144 or more on Mac or Linux (I implement this workaround in the below Mac 'docker run' command). See here for more info on this: 8 | http://elk-docker.readthedocs.io/ 9 | Warning: Do not pull this public image on a low-bandwidth or metered Internet connection. It pulls down well over 5GB of traffic, total. 10 | ``` 11 | docker pull sebp/elk:721 12 | ``` 13 | Windows: 14 | ``` 15 | docker run -p 5601:5601 -p 9200:9200 -p 5044:5044 -p 5000:5000 -it --name elk sebp/elk:721 16 | ``` 17 | Mac: 18 | ``` 19 | docker run -p 5601:5601 -p 9200:9200 -p 5044:5044 -p 5000:5000 -e MAX_MAP_COUNT="262144" -it --name elk sebp/elk:721 20 | ``` 21 | More info on this Docker container can be found here: https://hub.docker.com/r/sebp/elk/ 22 | 23 | After the container is up and running, you should be able to hit these 2 URLs if everything is working properly: 24 | 25 | Elasticsearch: http://localhost:9200/ 26 | 27 | Which should return a similar response to this: 28 | ~~~ 29 | { 30 | "name" : "elk", 31 | "cluster_name" : "elasticsearch", 32 | "cluster_uuid" : "randomstring", 33 | "version" : { 34 | "number" : "7.2.1", 35 | "build_flavor" : "default", 36 | "build_type" : "tar", 37 | "build_hash" : "fe6cb20", 38 | "build_date" : "2019-07-24T17:58:29.979462Z", 39 | "build_snapshot" : false, 40 | "lucene_version" : "8.0.0", 41 | "minimum_wire_compatibility_version" : "6.8.0", 42 | "minimum_index_compatibility_version" : "6.0.0-beta1" 43 | }, 44 | "tagline" : "You Know, for Search" 45 | } 46 | ~~~ 47 | 48 | Kibana: http://localhost:5601/ 49 | 50 | Should return the Kibana Web GUI 51 | When we first open the Kibana Web GUI, we need to enter the text 'twitteranalysis' where it asks for the Index name or pattern on the Management tab (on the left). 52 | 53 | After cloning this repo locally, we can run pip install to get all of the Python libraries loaded into your activate python environment: 54 | `pip install -r requirements.txt` 55 | 56 | Python libraries used: 57 | ~~~ 58 | https://github.com/sloria/textblob 59 | https://github.com/tweepy/tweepy 60 | https://elasticsearch-py.readthedocs.io/en/master/api.html 61 | ~~~ 62 | After filling in your Twitter API credentials in the config.yml (rename the config.yml.template to config.yml), you can let the script run for a few minutes and load up the Index in Elasticsearch. After data (analyzed tweets) is loaded up in Elasticsearch, try a few URLs like this: 63 | 64 | http://localhost:9200/twitteranalysis/_search?q=python 65 | http://localhost:9200/twitteranalysis/_search?q=sentiment:Positive&message=python 66 | 67 | Feel free to open an issue if you have any problems getting this up and running. 68 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | elasticsearch>=5.0.0,<6.0.0 2 | nltk==3.6.6 3 | oauthlib==2.0.2 4 | PyYAML==5.4 5 | requests==2.31.0 6 | requests-oauthlib==0.8.0 7 | six==1.10.0 8 | textblob==0.12.0 9 | tweepy==3.5.0 10 | urllib3==1.26.18 11 | -------------------------------------------------------------------------------- /tweets-nlp-elasticsearch/config/config.yml.template: -------------------------------------------------------------------------------- 1 | # You can populate these after creating an App here: https://apps.twitter.com 2 | twitter_consumer_key: "" 3 | twitter_consumer_secret: "" 4 | 5 | twitter_access_token: "" 6 | twitter_access_token_secret: "" 7 | 8 | twitter_terms_to_track: "python" 9 | -------------------------------------------------------------------------------- /tweets-nlp-elasticsearch/docker/dockerup.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | 4 | 5 | def get_docker_auth_token(auth_url, image_name): 6 | get_auth_token_payload = { 7 | 'service': 'registry.docker.io', 8 | 'scope': 'repository:library/{}:pull'.format(image_name) 9 | } 10 | 11 | token_resp = requests.get(auth_url + '/token', params=get_auth_token_payload) 12 | if not r.status_code == 200: 13 | print("Error status code: {} returned when trying to get token".format(r.status_code)) 14 | raise Exception("Error: Could not get an auth token!") 15 | 16 | resp_json = token_resp.json() 17 | return resp_json['token'] 18 | 19 | 20 | def fetch_versions(index_url, token, image_name): 21 | h = {'Authorization': "Bearer {}".format(token)} 22 | resp_tags = requests.get('{}/v2/library/{}/tags/list'.format(index_url, 23 | image_name), 24 | headers=h) 25 | return resp_tags.json() 26 | 27 | 28 | def fetch_catalog(index_url, token): 29 | h = {'Authorization': "Bearer {}".format(token)} 30 | resp = requests.get('{}/v2/_catalog'.format(index_url), 31 | headers=h) 32 | return resp.json() 33 | 34 | 35 | if __name__ == "__main__": 36 | name = "alpine" 37 | index_url = 'https://index.docker.io' 38 | auth_url = 'https://auth.docker.io' 39 | 40 | token = get_docker_auth_token(auth_url=auth_url, image_name=name) 41 | print(token) 42 | 43 | print("Get versions") 44 | versions = fetch_versions(index_url, token, name) 45 | print(json.dumps(versions, indent=2)) 46 | print("----") 47 | 48 | print("Get catalog") 49 | catalog = fetch_catalog(index_url, token) 50 | print(json.dumps(catalog, indent=2)) 51 | -------------------------------------------------------------------------------- /tweets-nlp-elasticsearch/tweets-nlp-es.py: -------------------------------------------------------------------------------- 1 | """ 2 | Purpose: Insert tweets into Elasticsearch cluster and visualize the results in Kibana. 3 | Python 3.6 4 | 1. pip install -r requirements 5 | 2. cp ./config/config.yml.template ./config/config.yml 6 | 3. nano ./config/config.yml 7 | 4. Paste your Twitter API credentials into the config.yml 8 | 5. Make sure the ELK 7.x stack is up and running. 9 | 6. Run this script to ingest filtered tweets into Elasticsearch in realtime 10 | 7. Visualize the NLP results in Kibana 11 | """ 12 | 13 | import json 14 | import logging 15 | import os.path 16 | import sys 17 | import time 18 | 19 | from logging import handlers 20 | 21 | import yaml 22 | 23 | from elasticsearch import Elasticsearch 24 | 25 | from textblob import TextBlob 26 | 27 | from tweepy import OAuthHandler 28 | from tweepy import Stream 29 | from tweepy.streaming import StreamListener 30 | 31 | # config.yml should exist in the same directory as this file 32 | if not os.path.isfile(os.path.join('config', 'config.yml')): 33 | print('config.yml was not found. You probably need to rename the config.yml.template to config.yml ' + 34 | 'and insert your Twitter credentials in this config file') 35 | sys.exit() 36 | 37 | logs_dir_name = 'log' 38 | if not os.path.exists(logs_dir_name): 39 | os.makedirs(logs_dir_name) 40 | 41 | logger = logging.getLogger(__name__) 42 | logger.setLevel(logging.DEBUG) 43 | LOG_FORMAT = logging.Formatter('%(asctime)-15s %(levelname)s: %(message)s') 44 | 45 | stdout_logger = logging.StreamHandler(sys.stdout) 46 | stdout_logger.setFormatter(LOG_FORMAT) 47 | logger.addHandler(stdout_logger) 48 | 49 | file_logger = handlers.RotatingFileHandler(os.path.join(logs_dir_name, 'tweets-nlp-es.log'), 50 | maxBytes=(1048576 * 5), 51 | backupCount=3) 52 | file_logger.setFormatter(LOG_FORMAT) 53 | logger.addHandler(file_logger) 54 | 55 | es = Elasticsearch() 56 | 57 | 58 | class TweetStreamListener(StreamListener): 59 | def on_data(self, data): 60 | 61 | # Load JSON payload into a dict to make it easy to parse out 62 | tweet_json = json.loads(data) 63 | 64 | # short-circuit exit if no text is found in the tweet item 65 | if 'text' not in tweet_json.keys(): 66 | logger.warning('Text not found in this tweet. Skipping it.') 67 | return True 68 | 69 | tweet_raw_text = tweet_json["text"] 70 | 71 | # Load the text of the tweet into a TextBlob so it can be analyzed 72 | tweet_text_blob = TextBlob(tweet_raw_text) 73 | 74 | # Value between -1 and 1 - TextBlob Polarity explanation in layman's 75 | # terms: http://planspace.org/20150607-textblob_sentiment/ 76 | text_polarity = tweet_text_blob.sentiment.polarity 77 | logger.debug( 78 | 'Tweet Polarity: {} - on Tweet Text: {}'.format(text_polarity, 79 | tweet_raw_text.encode('UTF-8', 'replace'))) 80 | 81 | if text_polarity == 0: 82 | sentiment = "Neutral" 83 | elif text_polarity < 0: 84 | sentiment = "Negative" 85 | elif text_polarity > 0: 86 | sentiment = "Positive" 87 | else: 88 | sentiment = "UNKNOWN" 89 | 90 | logger.debug('TextBlob Analysis Sentiment: {}'.format(sentiment)) 91 | 92 | analyzed_tweet = { 93 | "tweet_id": tweet_json["id_str"], 94 | "tweet_timestamp_ms": tweet_json["timestamp_ms"], 95 | "tweet_date": tweet_json["created_at"], 96 | "is_quote_status": tweet_json["is_quote_status"], 97 | "in_reply_to_status_id": tweet_json["in_reply_to_status_id"], 98 | "in_reply_to_screen_name": tweet_json["in_reply_to_screen_name"], 99 | "favorite_count": tweet_json["favorite_count"], 100 | "author": tweet_json["user"]["screen_name"], 101 | "tweet_text": tweet_json["text"], 102 | "retweeted": tweet_json["retweeted"], 103 | "retweet_count": tweet_json["retweet_count"], 104 | "geo": tweet_json["geo"], 105 | "place": tweet_json["place"], 106 | "coordinates": tweet_json["coordinates"], 107 | "polarity": text_polarity, 108 | "subjectivity": tweet_text_blob.sentiment.subjectivity, 109 | "sentiment": sentiment, 110 | "epoch_time_ingested": int(time.time()) 111 | } 112 | 113 | # can decide if we want to write the analyzed tweet to ES or a static file (or both) 114 | write_tweet_to_json_file(analyzed_tweet) 115 | write_analyzed_tweet_to_es(analyzed_tweet) 116 | 117 | return True 118 | 119 | def on_error(self, status): 120 | logger.error("Fatal Error: {}".format(status)) 121 | # Disconnect the stream 122 | return False 123 | 124 | 125 | # helper functions for dealing with the processed tweet data 126 | def write_tweet_to_json_file(tweet_data): 127 | try: 128 | with open('tweetstream.json', 'a') as out_file: 129 | out_file.write(str(tweet_data)) 130 | except BaseException as err: 131 | logger.exception("Exception writing tweet to JSON file: {}".format(err)) 132 | 133 | 134 | def write_analyzed_tweet_to_es(tweet_data): 135 | try: 136 | # Send Analyzed Tweet into ES Index for visualization in Kibana 137 | es.index(index="twitteranalysis", 138 | doc_type="tweet", 139 | body=tweet_data 140 | ) 141 | except BaseException as err: 142 | logger.exception("Exception writing tweet to ES: {}".format(err)) 143 | 144 | 145 | def get_config(): 146 | try: 147 | with open(os.path.join("config", "config.yml"), "r") as yaml_config_file: 148 | _config = yaml.load(yaml_config_file, Loader=yaml.SafeLoader) 149 | return _config 150 | except: 151 | logger.exception('config.yml file cannot be found or read. ' 152 | 'You might need to fill in the the config.yml.template and then rename it to config.yml') 153 | 154 | 155 | if __name__ == '__main__': 156 | config = get_config() 157 | 158 | # Read twitter API access info from the config file 159 | twitter_auth = OAuthHandler(config['twitter_consumer_key'], config['twitter_consumer_secret']) 160 | twitter_auth.set_access_token(config['twitter_access_token'], config['twitter_access_token_secret']) 161 | 162 | logger.debug('Creating Listener') 163 | # Create an instance of the tweepy tweet stream listener 164 | twitter_listener = TweetStreamListener() 165 | 166 | logger.debug('Creating Stream') 167 | # Create an instance of the tweepy raw stream 168 | tw_stream = Stream(twitter_auth, twitter_listener) 169 | 170 | # Stream that is filtered on keywords 171 | logger.debug('Starting the Filtered Stream') 172 | 173 | twitter_terms_to_track = config['twitter_terms_to_track'] 174 | logger.info('Tracking Terms: {}'.format(twitter_terms_to_track)) 175 | 176 | tw_stream.filter(track=twitter_terms_to_track, languages=['en']) 177 | --------------------------------------------------------------------------------