├── 1805 - Identifying Forced And Fake Twitter Trends ├── README.md └── Twitter Fake Trend Analysis.R ├── 1806 - Instagram Hashtag Analysis In Python ├── Instagram Hashtag Analysis.py └── README.md ├── 1807 - Reading Habit Analysis Using Pocket API And Python ├── Pocket API Analysis.py └── README.md ├── LICENSE └── README.md /1805 - Identifying Forced And Fake Twitter Trends/README.md: -------------------------------------------------------------------------------- 1 | # Identifying Forced And Fake Twitter Trends Using R 2 | 3 | Read the post here: https://www.everydayplots.com/identifying-forced-fake-twitter-trends-r/ 4 | 5 | ### Introduction 6 | 7 | Twitter has become a toxic place. There, I said it. It is no longer the fun and happy place it used to be a few years back, certainly not in India. It is now full of trolls, rude and nasty people, politicians and companies busy trying to sell their products or spreading propaganda. 8 | 9 | But I still love Twitter. Partly because it is not Facebook (that’s a good enough reason). However it pains me see the negativity every time I visit it. As a user, it appears that the Twitter team isn’t moving fast and hard enough to eliminate the problem of trolls and propaganda. So I decided to approach this problem on my own, doing what I do best – data analysis. In this post, I use the Twitter data to perform a basic data analysis in R to analyze a very specific part of the problem – unnaturally trending hashtags and trends on Twitter. 10 | 11 | ### Summary 12 | 13 | In this post, I've explained a way to use R with the Twitter API data to identify potentially fake / forced trends on Twitter. This is based on the posting frequency pattern and duplicity of the tweets. We are trying to answer the following question using data: 14 | 15 | - How can we identify and differentiate real / natural Twitter trends from the forced / propaganda / fake trends showing up in the ‘trending’ tab? 16 | 17 | ### Output 18 | 19 | The output contains the following values: 20 | - Tweet frequency pattern diagrams 21 | - A table of duplicate posts with frequency 22 | 23 | We can export this list to a CSV file and analyse it further in Excel. 24 | 25 | Read the post to understand the process in detail! 26 | -------------------------------------------------------------------------------- /1805 - Identifying Forced And Fake Twitter Trends/Twitter Fake Trend Analysis.R: -------------------------------------------------------------------------------- 1 | # Identifying Forced And Fake Twitter Trends 2 | # CREATED BY: Ayush Kumar 3 | # WEBSITE: https://everydayplots.com 4 | # GITHUB: https://github.com/kumaagx 5 | 6 | # Description: Input any Twitter trend or search term and see if it is trending naturally or 7 | # being forced to trend. A genuine trend should have a variety of unique tweets distributed over time, 8 | # however, a forced or fake trend has a lot of unevenly distributed duplicate tweets due to people / bots 9 | # directly copy-pasting from pre-created templates in bulk. 10 | 11 | # Features: 12 | # 1. Check the currently trending topics within R 13 | # 2. Extract & clean-up original tweets for trends / search terms 14 | # 3. Visualize the tweet frequency for a trend over time 15 | # 4. Analyze for duplicity / fakeness and assign a score 16 | 17 | # --------------------- # 18 | # INITIALIZE 19 | # --------------------- # 20 | 21 | setwd('C:/Analysis Folder') 22 | 23 | # Load the required R libraries 24 | # install.packages("twitteR") 25 | library(twitteR) 26 | library(stringr) 27 | 28 | download.file(url="http://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem") 29 | 30 | # Set constant requestURL 31 | requestURL <- "https://api.twitter.com/oauth/request_token" 32 | # Set constant accessURL 33 | accessURL <- "https://api.twitter.com/oauth/access_token" 34 | # Set constant authURL 35 | authURL <- "https://api.twitter.com/oauth/authorize" 36 | 37 | # Get your keys from https://apps.twitter.com/ 38 | consumerKey <- "XXXXX" 39 | consumerSecret <- "XXXXX" 40 | accessToken <- "XXXXX" 41 | accessTokenSecret <- "XXXXX" 42 | 43 | setup_twitter_oauth(consumerKey, 44 | consumerSecret, 45 | accessToken, 46 | accessTokenSecret) 47 | 48 | # --------------------- # 49 | # PART 1 - SEARCH AVAILABLE TRENDS 50 | # --------------------- # 51 | 52 | # Search for trending topics 53 | 54 | woeid = availableTrendLocations() 55 | woeid[woeid$country == 'India',] 56 | 57 | tr <- getTrends(23424848) # Trends in India 58 | tr[,1] 59 | 60 | # --------------------- # 61 | # PART 2 - DEFINE FUNCTION 62 | # --------------------- # 63 | 64 | IdentifyFakeTrend <- function(trend) { 65 | 66 | # twitteR docuentation: https://www.rdocumentation.org/packages/twitteR/versions/1.1.9 67 | #twi1 <- searchTwitter("#SampleHashtag" ,n=9000,lang=NULL, resultType="recent") 68 | twi_all <- searchTwitter(trend, n=10000, lang=NULL, since='2018-05-01') # Extract tweets for a trend 69 | twi_ori <- strip_retweets(twi_all) # Keep only "pure" original tweets 70 | 71 | twi_all_df <- twListToDF(twi_all) # Convert tweet list to dataframe 72 | twi_ori_df <- twListToDF(twi_ori) # Convert tweet list to dataframe 73 | print(paste("Total tweets about",trend, "are:",nrow(twi_all_df),"( Tweets -",nrow(twi_ori_df),"| RTs -",nrow(twi_all_df)-nrow(twi_ori_df),")",sep = " ")) 74 | 75 | # Insert freq plot here, include retweets in it 76 | freqplot <- subset(twi_ori_df, select = c("created", "id")) 77 | timest <- as.POSIXct(freqplot$created) 78 | attributes(timest)$tzone <- "Asia/Calcutta" 79 | brks <- trunc(range(timest), "mins") 80 | hist(timest, freq = TRUE, breaks=seq(brks[1], brks[2]+3600, by="5 min"), main = paste("Pattern of tweets on ", trend), xlab = "Time") 81 | 82 | # Extract essential columns: text, created, id, screenName 83 | twi_clean <- subset(twi_ori_df, select = c("text", "created", "id", "screenName")) 84 | 85 | # Clean-up the tweet text 86 | twi_clean$text <- gsub("[^[:alnum:][:space:]]*","", twi_clean$text) 87 | twi_clean$text <- gsub("http\\w*","", twi_clean$text) 88 | twi_clean$text <- gsub("\\n"," ", twi_clean$text) 89 | twi_clean$text <- gsub("\\s+", " ", str_trim(twi_clean$text)) 90 | twi_clean$text <- tolower(twi_clean$text) 91 | 92 | # Create a sorted table of unique tweets w/ count 93 | twfreq <- as.data.frame(table(twi_clean$text)) # create a freq table of duplicates 94 | twfreq <- twfreq[!(twfreq$Var1==""),] # remove blanks 95 | twfreq <- twfreq[order(-twfreq$Freq),] # sort by frequency of duplicates 96 | # print(head(twfreq)) 97 | 98 | # Calculate tweet uniqueness score 99 | # count unique tweets / count total tweets 100 | # uniqueness score of natural trends is closer to 1 101 | # uniqueness score of fake / forced trends is closer to 0 102 | uniqueness = nrow(twfreq[2])/sum(twfreq[2]) 103 | 104 | # For verifying manually 105 | # write.csv(twi3, file = "twi3.csv") 106 | # write.csv(twfreq, file = "twfreq.csv") 107 | 108 | #print(uniqueness) 109 | print(paste("The uniqueness score is ",uniqueness)) 110 | 111 | return(twfreq) 112 | } 113 | 114 | # --------------------- # 115 | # PART 3 - TEST THE FUNCTION 116 | # --------------------- # 117 | 118 | twfreq = IdentifyFakeTrend('#SampleHashtag') 119 | twfreq = IdentifyFakeTrend('#MondayMotivation') -------------------------------------------------------------------------------- /1806 - Instagram Hashtag Analysis In Python/Instagram Hashtag Analysis.py: -------------------------------------------------------------------------------- 1 | from selenium import webdriver 2 | from bs4 import BeautifulSoup 3 | import pandas as pd 4 | import datetime 5 | 6 | 7 | driver = webdriver.Chrome() 8 | 9 | # Extract description of a post from Instagram link 10 | driver.get('https://www.instagram.com/p/BiRnjDsFKzl/') 11 | soup = BeautifulSoup(driver.page_source,"lxml") 12 | desc = " " 13 | 14 | for item in soup.findAll('a'): 15 | desc= desc + " " + str(item.string) 16 | 17 | # Extract tag list from Instagram post description 18 | taglist = desc.split() 19 | taglist = [x for x in taglist if x.startswith('#')] 20 | index = 0 21 | while index < len(taglist): 22 | taglist[index] = taglist[index].strip('#') 23 | index += 1 24 | 25 | # (OR) Copy-paste your tag list manually here 26 | #taglist = ['art', 'instaart', 'iblackwork'] 27 | 28 | print(taglist) 29 | 30 | 31 | # Define dataframe to store hashtag information 32 | tag_df = pd.DataFrame(columns = ['Hashtag', 'Number of Posts', 'Posting Freq (mins)']) 33 | 34 | # Loop over each hashtag to extract information 35 | for tag in taglist: 36 | 37 | driver.get('https://www.instagram.com/explore/tags/'+str(tag)) 38 | soup = BeautifulSoup(driver.page_source,"lxml") 39 | 40 | # Extract current hashtag name 41 | tagname = tag 42 | # Extract total number of posts in this hashtag 43 | # NOTE: Class name may change in the website code 44 | # Get the latest class name by inspecting web code 45 | nposts = soup.find('span', {'class': 'g47SY'}).text 46 | 47 | # Extract all post links from 'explore tags' page 48 | # Needed to extract post frequency of recent posts 49 | myli = [] 50 | for a in soup.find_all('a', href=True): 51 | myli.append(a['href']) 52 | 53 | # Keep link of only 1st and 9th most recent post 54 | newmyli = [x for x in myli if x.startswith('/p/')] 55 | del newmyli[:9] 56 | del newmyli[9:] 57 | del newmyli[1:8] 58 | 59 | timediff = [] 60 | 61 | # Extract the posting time of 1st and 9th most recent post for a tag 62 | for j in range(len(newmyli)): 63 | driver.get('https://www.instagram.com'+str(newmyli[j])) 64 | soup = BeautifulSoup(driver.page_source,"lxml") 65 | 66 | for i in soup.findAll('time'): 67 | if i.has_attr('datetime'): 68 | timediff.append(i['datetime']) 69 | #print(i['datetime']) 70 | 71 | # Calculate time difference between posts 72 | # For obtaining posting frequency 73 | datetimeFormat = '%Y-%m-%dT%H:%M:%S.%fZ' 74 | diff = datetime.datetime.strptime(timediff[0], datetimeFormat)\ 75 | - datetime.datetime.strptime(timediff[1], datetimeFormat) 76 | pfreq= int(diff.total_seconds()/(9*60)) 77 | 78 | # Add hashtag info to dataframe 79 | tag_df.loc[len(tag_df)] = [tagname, nposts, pfreq] 80 | 81 | driver.quit() 82 | 83 | # Check the final dataframe 84 | print(tag_df) 85 | 86 | # CSV output for hashtag analysis 87 | tag_df.to_csv('hashtag_list.csv') -------------------------------------------------------------------------------- /1806 - Instagram Hashtag Analysis In Python/README.md: -------------------------------------------------------------------------------- 1 | # Instagram Hashtag Analysis In Python 2 | 3 | Read the post here: https://www.everydayplots.com/instagram-hashtag-analysis-python/ 4 | 5 | ### Introduction 6 | 7 | Not many people know this but apart from being a data analyst, I am an artist too. This means that I regularly create art and post it on my Instagram account. Making art, just like doing an analysis, takes a lot of time and effort. And it makes me sad when I’m not able to get enough social validation in the form of likes, comments or new followers on my posts. 8 | 9 | So I keep trying out different methods to increase my following and post engagement. One of the methods I use is to include relevant Instagram hashtags in my posts. But the biggest struggle is finding the most relevant hashtags for a particular post. How do I know if the hashtags I’m using are effective enough or not? Therefore I decided to tackle this problem doing what I do best (apart from making art!) – I decided to write a Python code for doing my own Instagram hashtag analysis! 10 | 11 | ### Summary 12 | 13 | In this post, I've explained a way to use Python with Selenium + BeautifulSoup + ChromeDriver to extract and analyse hashtag data from Instagram web. We are trying to answer these questions using data: 14 | 15 | - Where can I find and extract a list of hashtags to analyse? 16 | - How can I identify which of my hashtags are popular / niche / relevant? 17 | - How do I do all this in a fast and automated way? 18 | 19 | ### Output 20 | 21 | The output contains the following values: 22 | - List of hashtags 23 | - Total number of posts under each hashtag 24 | - Hashtag posting frequency. 25 | 26 | We can export this list to a CSV file and analyse it further in Excel. 27 | 28 | Read the post to understand the process in detail! 29 | -------------------------------------------------------------------------------- /1807 - Reading Habit Analysis Using Pocket API And Python/Pocket API Analysis.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import pandas as pd 3 | from pandas.io.json import json_normalize 4 | import json 5 | import datetime 6 | import matplotlib.pyplot as plt 7 | 8 | 9 | # STEP 1: Get a consumer_key by creating a new Pocket application 10 | # Link: https://getpocket.com/developer/apps/new 11 | 12 | # STEP 2: Get a request token 13 | # Connect to the Pocket API 14 | # pocket_api variable stores the http response 15 | pocket_api = requests.post('https://getpocket.com/v3/oauth/request', 16 | data = {'consumer_key':'12345-23ae05df52291ea13b135dff', 17 | 'redirect_uri':'https://google.com'}) 18 | 19 | # Check the response: if 200, then it means all OK 20 | pocket_api.status_code 21 | 22 | # Check error reason, if any 23 | # print(pocket_api.headers['X-Error']) 24 | 25 | # Here is your request_token 26 | # This is a part of the http response stored in pocket_api.text 27 | pocket_api.text 28 | 29 | # STEP 3: Authenticate 30 | # Modify and paste the link below in the browser and authenticate 31 | # Repace text after "?request_token=" with the request_token generated above 32 | # https://getpocket.com/auth/authorize?request_token=PASTE-YOUR-REQUEST-TOKEN-HERE&redirect_uri=https://getpocket.com/connected_applications 33 | 34 | 35 | # STEP 4: Generate an access_token 36 | # After authenticating in the browser, return here 37 | # Use your consumer_key and request_token below 38 | pocket_auth = requests.post('https://getpocket.com/v3/oauth/authorize', 39 | data = {'consumer_key':'12345-23ae05df52291ea13b135dff', 40 | 'code':'a1dc2a39-abcd-af28-e235-25ddd4'}) 41 | 42 | # Check the response: if 200, then it means all OK 43 | # pocket_auth.status_code 44 | 45 | # Check error reason, if any 46 | # print(pocket_auth.headers['X-Error']) 47 | 48 | # Finally, here is your access_token 49 | # We're done authenticating 50 | pocket_auth.text 51 | 52 | 53 | # Get data from the API 54 | # Reference: https://getpocket.com/developer/docs/v3/retrieve 55 | pocket_add = requests.post('https://getpocket.com/v3/get', 56 | data= {'consumer_key':'12345-23ae05df52291ea13b135dff', 57 | 'access_token':'b07ff4be-abcd-4685-2d70-d47816', 58 | 'state':'all', 59 | 'detailType':'simple'}) 60 | 61 | # Check the response: if 200, then it means all OK 62 | # pocket_add.status_code 63 | 64 | # Here is your fetched JSON data 65 | pocket_add.text 66 | 67 | 68 | # Prepare the dataframe: convert JSON to table format 69 | json_data = json.loads(pocket_add.text) 70 | 71 | df_temp = pd.DataFrame() 72 | df = pd.DataFrame() 73 | for key in json_data['list'].keys(): 74 | df_temp = pd.DataFrame(json_data['list'][key], index=[0]) 75 | df = pd.concat([df, df_temp]) 76 | 77 | df = df[['item_id','status','favorite','given_title','given_url','resolved_url','time_added','time_read','time_to_read','word_count']] 78 | df.head(5) 79 | 80 | # Clean up the dataset 81 | df.dtypes 82 | df[['status','favorite','word_count']] = df[['status','favorite','word_count']].astype(int) 83 | df['time_added'] = pd.to_datetime(df['time_added'],unit='s') 84 | df['time_read'] = pd.to_datetime(df['time_read'],unit='s') 85 | df['date_added'] = df['time_added'].dt.date 86 | df['date_read'] = df['time_read'].dt.date 87 | 88 | # Save the dataframe as CSV locally 89 | df.to_csv('pocket_list.csv') 90 | 91 | # Check the data types 92 | df.dtypes 93 | 94 | 95 | # Answer questions using data 96 | 97 | # How many items are there in my Pocket? 98 | print(df['item_id'].count()) 99 | 100 | # What % of articles are read? 101 | print((df['status'].sum()*100)/df['item_id'].count()) 102 | 103 | # How long is the average article in my Pocket? (minutes) 104 | df['time_to_read'].describe() 105 | 106 | # How long is the average article in my Pocket? (word count) 107 | df['word_count'].describe() 108 | 109 | # What is the % of favorites? 110 | print((df['favorite'].sum()*100)/df['item_id'].count()) 111 | 112 | # How many words have I read till date? 113 | print(df.loc[df['status'] == 1, 'word_count'].sum()) 114 | 115 | # How many books is this equivalent to? 116 | print(df.loc[df['status'] == 1, 'word_count'].sum()/64000) 117 | 118 | # How were the articles added over time? 119 | plot_added = df.groupby('date_added')['item_id'].count() 120 | plot_added.describe() 121 | # plot_added.head(10) 122 | 123 | # How were the articles read over time? 124 | plot_read = df.groupby('date_read')['status'].sum() 125 | plot_read.describe() 126 | #plot_read.head(10) 127 | 128 | # Wordcloud of the topics I read about 129 | from wordcloud import WordCloud, STOPWORDS 130 | 131 | stopwords = set(STOPWORDS) 132 | wordcloud = WordCloud(background_color='white', 133 | stopwords=stopwords, 134 | max_words=300, 135 | max_font_size=40, 136 | random_state=42 137 | ).generate(str(df['given_title'])) 138 | 139 | print(wordcloud) 140 | fig = plt.figure(1) 141 | plt.imshow(wordcloud) 142 | plt.axis('off') 143 | plt.show() 144 | fig.savefig("Pocket Wordcloud.png", dpi=900) -------------------------------------------------------------------------------- /1807 - Reading Habit Analysis Using Pocket API And Python/README.md: -------------------------------------------------------------------------------- 1 | # Reading Habit Analysis Using Pocket API And Python 2 | 3 | Still working on this post. Check back in a while! 4 | 5 | Meanwhile, read my previous post here: https://www.everydayplots.com/instagram-hashtag-analysis-python/ -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | This is free and unencumbered software released into the public domain. 2 | 3 | Anyone is free to copy, modify, publish, use, compile, sell, or 4 | distribute this software, either in source code form or as a compiled 5 | binary, for any purpose, commercial or non-commercial, and by any 6 | means. 7 | 8 | In jurisdictions that recognize copyright laws, the author or authors 9 | of this software dedicate any and all copyright interest in the 10 | software to the public domain. We make this dedication for the benefit 11 | of the public at large and to the detriment of our heirs and 12 | successors. We intend this dedication to be an overt act of 13 | relinquishment in perpetuity of all present and future rights to this 14 | software under copyright law. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR 20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 22 | OTHER DEALINGS IN THE SOFTWARE. 23 | 24 | For more information, please refer to 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # EverydayPlots.com - All Analysis Codes 2 | 3 | This Github repo contains all the data analysis codes (Python / R) for the blog posts published by me on my data analytics blog: https://www.everydayplots.com 4 | 5 | Here is a list of the blog posts I've published in 2018. For descriptions, read README files in individual post folders. 6 | 7 | Post 1: [Identifying Forced And Fake Twitter Trends Using R](https://www.everydayplots.com/identifying-forced-fake-twitter-trends-r/) 8 | Post 2: [Instagram Hashtag Analysis In Python](https://www.everydayplots.com/instagram-hashtag-analysis-python/) 9 | Post 3: Reading Habit Analysis Using Pocket API And Python (Coming soon!) 10 | 11 | More posts coming soon! 12 | --------------------------------------------------------------------------------