├── 1805 - Identifying Forced And Fake Twitter Trends
    ├── README.md
    └── Twitter Fake Trend Analysis.R
├── 1806 - Instagram Hashtag Analysis In Python
    ├── Instagram Hashtag Analysis.py
    └── README.md
├── 1807 - Reading Habit Analysis Using Pocket API And Python
    ├── Pocket API Analysis.py
    └── README.md
├── LICENSE
└── README.md


/1805 - Identifying Forced And Fake Twitter Trends/README.md:
--------------------------------------------------------------------------------
 1 | # Identifying Forced And Fake Twitter Trends Using R
 2 | 
 3 | Read the post here: https://www.everydayplots.com/identifying-forced-fake-twitter-trends-r/
 4 | 
 5 | ### Introduction
 6 | 
 7 | Twitter has become a toxic place. There, I said it. It is no longer the fun and happy place it used to be a few years back, certainly not in India. It is now full of trolls, rude and nasty people, politicians and companies busy trying to sell their products or spreading propaganda.
 8 | 
 9 | But I still love Twitter. Partly because it is not Facebook (that’s a good enough reason). However it pains me see the negativity every time I visit it. As a user, it appears that the Twitter team isn’t moving fast and hard enough to eliminate the problem of trolls and propaganda. So I decided to approach this problem on my own, doing what I do best – data analysis. In this post, I use the Twitter data to perform a basic data analysis in R to analyze a very specific part of the problem – unnaturally trending hashtags and trends on Twitter.
10 | 
11 | ### Summary
12 | 
13 | In this post, I've explained a way to use R with the Twitter API data to identify potentially fake / forced trends on Twitter. This is based on the posting frequency pattern and duplicity of the tweets. We are trying to answer the following question using data:
14 | 
15 | - How can we identify and differentiate real / natural Twitter trends from the forced / propaganda / fake trends showing up in the ‘trending’ tab?
16 | 
17 | ### Output
18 | 
19 | The output contains the following values:
20 | - Tweet frequency pattern diagrams
21 | - A table of duplicate posts with frequency
22 | 
23 | We can export this list to a CSV file and analyse it further in Excel.
24 | 
25 | Read the post to understand the process in detail! 
26 | 


--------------------------------------------------------------------------------
/1805 - Identifying Forced And Fake Twitter Trends/Twitter Fake Trend Analysis.R:
--------------------------------------------------------------------------------
  1 | # Identifying Forced And Fake Twitter Trends
  2 | # CREATED BY: Ayush Kumar
  3 | # WEBSITE: https://everydayplots.com
  4 | # GITHUB: https://github.com/kumaagx
  5 | 
  6 | # Description: Input any Twitter trend or search term and see if it is trending naturally or
  7 | # being forced to trend. A genuine trend should have a variety of unique tweets distributed over time, 
  8 | # however, a forced or fake trend has a lot of unevenly distributed duplicate tweets due to people / bots 
  9 | # directly copy-pasting from pre-created templates in bulk.
 10 | 
 11 | # Features:
 12 | # 1. Check the currently trending topics within R
 13 | # 2. Extract & clean-up original tweets for trends / search terms
 14 | # 3. Visualize the tweet frequency for a trend over time
 15 | # 4. Analyze for duplicity / fakeness and assign a score
 16 | 
 17 | # --------------------- #
 18 | # INITIALIZE
 19 | # --------------------- #
 20 | 
 21 | setwd('C:/Analysis Folder')
 22 | 
 23 | # Load the required R libraries
 24 | # install.packages("twitteR")
 25 | library(twitteR)
 26 | library(stringr)
 27 | 
 28 | download.file(url="http://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem")
 29 | 
 30 | # Set constant requestURL
 31 | requestURL <- "https://api.twitter.com/oauth/request_token"
 32 | # Set constant accessURL
 33 | accessURL <- "https://api.twitter.com/oauth/access_token"
 34 | # Set constant authURL
 35 | authURL <- "https://api.twitter.com/oauth/authorize"
 36 | 
 37 | # Get your keys from https://apps.twitter.com/
 38 | consumerKey <- "XXXXX"
 39 | consumerSecret <- "XXXXX"
 40 | accessToken <- "XXXXX"
 41 | accessTokenSecret <- "XXXXX"
 42 | 
 43 | setup_twitter_oauth(consumerKey,
 44 |                     consumerSecret,
 45 |                     accessToken,
 46 |                     accessTokenSecret)
 47 | 
 48 | # --------------------- #
 49 | # PART 1 - SEARCH AVAILABLE TRENDS
 50 | # --------------------- #
 51 | 
 52 | # Search for trending topics
 53 | 
 54 | woeid = availableTrendLocations()
 55 | woeid[woeid$country == 'India',]
 56 | 
 57 | tr <- getTrends(23424848) # Trends in India
 58 | tr[,1]
 59 | 
 60 | # --------------------- #
 61 | # PART 2 - DEFINE FUNCTION
 62 | # --------------------- #
 63 | 
 64 | IdentifyFakeTrend <- function(trend) {
 65 |   
 66 |   # twitteR docuentation: https://www.rdocumentation.org/packages/twitteR/versions/1.1.9
 67 |   #twi1 <- searchTwitter("#SampleHashtag" ,n=9000,lang=NULL, resultType="recent")
 68 |   twi_all <- searchTwitter(trend, n=10000, lang=NULL, since='2018-05-01') # Extract tweets for a trend
 69 |   twi_ori <- strip_retweets(twi_all) # Keep only "pure" original tweets
 70 |   
 71 |   twi_all_df <- twListToDF(twi_all) # Convert tweet list to dataframe
 72 |   twi_ori_df <- twListToDF(twi_ori) # Convert tweet list to dataframe
 73 |   print(paste("Total tweets about",trend, "are:",nrow(twi_all_df),"( Tweets -",nrow(twi_ori_df),"| RTs -",nrow(twi_all_df)-nrow(twi_ori_df),")",sep = " "))
 74 |   
 75 |   # Insert freq plot here, include retweets in it
 76 |   freqplot <- subset(twi_ori_df, select = c("created", "id"))
 77 |   timest <- as.POSIXct(freqplot$created)
 78 |   attributes(timest)$tzone <- "Asia/Calcutta"
 79 |   brks <- trunc(range(timest), "mins")
 80 |   hist(timest, freq = TRUE, breaks=seq(brks[1], brks[2]+3600, by="5 min"), main = paste("Pattern of tweets on ", trend), xlab = "Time")
 81 | 
 82 |   # Extract essential columns: text, created, id, screenName
 83 |   twi_clean <- subset(twi_ori_df, select = c("text", "created", "id", "screenName"))
 84 |   
 85 |   # Clean-up the tweet text
 86 |   twi_clean$text <- gsub("[^[:alnum:][:space:]]*","", twi_clean$text)
 87 |   twi_clean$text <- gsub("http\\w*","", twi_clean$text)
 88 |   twi_clean$text <- gsub("\\n"," ", twi_clean$text)
 89 |   twi_clean$text <- gsub("\\s+", " ", str_trim(twi_clean$text))
 90 |   twi_clean$text <- tolower(twi_clean$text)
 91 | 
 92 |   # Create a sorted table of unique tweets w/ count
 93 |   twfreq <- as.data.frame(table(twi_clean$text)) # create a freq table of duplicates
 94 |   twfreq <- twfreq[!(twfreq$Var1==""),] # remove blanks
 95 |   twfreq <- twfreq[order(-twfreq$Freq),] # sort by frequency of duplicates
 96 |   # print(head(twfreq))
 97 |   
 98 |   # Calculate tweet uniqueness score
 99 |   # count unique tweets / count total tweets
100 |   # uniqueness score of natural trends is closer to 1
101 |   # uniqueness score of fake / forced trends is closer to 0
102 |   uniqueness = nrow(twfreq[2])/sum(twfreq[2])
103 |   
104 |   # For verifying manually
105 |   # write.csv(twi3, file = "twi3.csv")
106 |   # write.csv(twfreq, file = "twfreq.csv")
107 |   
108 |   #print(uniqueness)
109 |   print(paste("The uniqueness score is ",uniqueness))
110 | 
111 |   return(twfreq)
112 | }
113 | 
114 | # --------------------- #
115 | # PART 3 - TEST THE FUNCTION
116 | # --------------------- #
117 | 
118 | twfreq = IdentifyFakeTrend('#SampleHashtag')
119 | twfreq = IdentifyFakeTrend('#MondayMotivation')


--------------------------------------------------------------------------------
/1806 - Instagram Hashtag Analysis In Python/Instagram Hashtag Analysis.py:
--------------------------------------------------------------------------------
 1 | from selenium import webdriver
 2 | from bs4 import BeautifulSoup
 3 | import pandas as pd
 4 | import datetime
 5 | 
 6 | 
 7 | driver = webdriver.Chrome()
 8 | 
 9 | # Extract description of a post from Instagram link
10 | driver.get('https://www.instagram.com/p/BiRnjDsFKzl/')
11 | soup = BeautifulSoup(driver.page_source,"lxml")
12 | desc = " "
13 | 
14 | for item in soup.findAll('a'):
15 |     desc= desc + " " + str(item.string)
16 | 
17 | # Extract tag list from Instagram post description
18 | taglist = desc.split()
19 | taglist = [x for x in taglist if x.startswith('#')]
20 | index = 0
21 | while index < len(taglist):
22 |     taglist[index] = taglist[index].strip('#')
23 |     index += 1
24 | 
25 | # (OR) Copy-paste your tag list manually here
26 | #taglist = ['art', 'instaart', 'iblackwork']
27 | 
28 | print(taglist)
29 | 
30 | 
31 | # Define dataframe to store hashtag information
32 | tag_df  = pd.DataFrame(columns = ['Hashtag', 'Number of Posts', 'Posting Freq (mins)'])
33 | 
34 | # Loop over each hashtag to extract information
35 | for tag in taglist:
36 |     
37 |     driver.get('https://www.instagram.com/explore/tags/'+str(tag))
38 |     soup = BeautifulSoup(driver.page_source,"lxml")
39 |     
40 |     # Extract current hashtag name
41 |     tagname = tag
42 |     # Extract total number of posts in this hashtag
43 |     # NOTE: Class name may change in the website code
44 |     # Get the latest class name by inspecting web code
45 |     nposts = soup.find('span', {'class': 'g47SY'}).text
46 |         
47 |     # Extract all post links from 'explore tags' page
48 |     # Needed to extract post frequency of recent posts
49 |     myli = []
50 |     for a in soup.find_all('a', href=True):
51 |         myli.append(a['href'])
52 | 
53 |     # Keep link of only 1st and 9th most recent post 
54 |     newmyli = [x for x in myli if x.startswith('/p/')]
55 |     del newmyli[:9]
56 |     del newmyli[9:]
57 |     del newmyli[1:8]
58 | 
59 |     timediff = []
60 | 
61 |     # Extract the posting time of 1st and 9th most recent post for a tag
62 |     for j in range(len(newmyli)):
63 |         driver.get('https://www.instagram.com'+str(newmyli[j]))
64 |         soup = BeautifulSoup(driver.page_source,"lxml")
65 | 
66 |         for i in soup.findAll('time'):
67 |             if i.has_attr('datetime'):
68 |                 timediff.append(i['datetime'])
69 |                 #print(i['datetime'])
70 | 
71 |     # Calculate time difference between posts
72 |     # For obtaining posting frequency
73 |     datetimeFormat = '%Y-%m-%dT%H:%M:%S.%fZ'
74 |     diff = datetime.datetime.strptime(timediff[0], datetimeFormat)\
75 |         - datetime.datetime.strptime(timediff[1], datetimeFormat)
76 |     pfreq= int(diff.total_seconds()/(9*60))
77 |     
78 |     # Add hashtag info to dataframe
79 |     tag_df.loc[len(tag_df)] = [tagname, nposts, pfreq]
80 |         
81 | driver.quit()
82 | 
83 | # Check the final dataframe
84 | print(tag_df)
85 | 
86 | # CSV output for hashtag analysis
87 | tag_df.to_csv('hashtag_list.csv')


--------------------------------------------------------------------------------
/1806 - Instagram Hashtag Analysis In Python/README.md:
--------------------------------------------------------------------------------
 1 | # Instagram Hashtag Analysis In Python
 2 | 
 3 | Read the post here: https://www.everydayplots.com/instagram-hashtag-analysis-python/
 4 | 
 5 | ### Introduction
 6 | 
 7 | Not many people know this but apart from being a data analyst, I am an artist too. This means that I regularly create art and post it on my Instagram account. Making art, just like doing an analysis, takes a lot of time and effort. And it makes me sad when I’m not able to get enough social validation in the form of likes, comments or new followers on my posts.
 8 | 
 9 | So I keep trying out different methods to increase my following and post engagement. One of the methods I use is to include relevant Instagram hashtags in my posts. But the biggest struggle is finding the most relevant hashtags for a particular post. How do I know if the hashtags I’m using are effective enough or not? Therefore I decided to tackle this problem doing what I do best (apart from making art!) – I decided to write a Python code for doing my own Instagram hashtag analysis!
10 | 
11 | ### Summary
12 | 
13 | In this post, I've explained a way to use Python with Selenium + BeautifulSoup + ChromeDriver to extract and analyse hashtag data from Instagram web. We are trying to answer these questions using data:
14 | 
15 | - Where can I find and extract a list of hashtags to analyse?
16 | - How can I identify which of my hashtags are popular / niche / relevant?
17 | - How do I do all this in a fast and automated way?
18 | 
19 | ### Output
20 | 
21 | The output contains the following values:
22 | - List of hashtags
23 | - Total number of posts under each hashtag
24 | - Hashtag posting frequency. 
25 | 
26 | We can export this list to a CSV file and analyse it further in Excel.
27 | 
28 | Read the post to understand the process in detail! 
29 | 


--------------------------------------------------------------------------------
/1807 - Reading Habit Analysis Using Pocket API And Python/Pocket API Analysis.py:
--------------------------------------------------------------------------------
  1 | import requests
  2 | import pandas as pd
  3 | from pandas.io.json import json_normalize
  4 | import json
  5 | import datetime
  6 | import matplotlib.pyplot as plt
  7 | 
  8 | 
  9 | # STEP 1: Get a consumer_key by creating a new Pocket application
 10 | # Link: https://getpocket.com/developer/apps/new
 11 | 
 12 | # STEP 2: Get a request token
 13 | # Connect to the Pocket API
 14 | # pocket_api variable stores the http response
 15 | pocket_api = requests.post('https://getpocket.com/v3/oauth/request',
 16 |                            data = {'consumer_key':'12345-23ae05df52291ea13b135dff',
 17 |                                    'redirect_uri':'https://google.com'})
 18 | 
 19 | # Check the response: if 200, then it means all OK
 20 | pocket_api.status_code       
 21 | 
 22 | # Check error reason, if any
 23 | # print(pocket_api.headers['X-Error'])
 24 | 
 25 | # Here is your request_token
 26 | # This is a part of the http response stored in pocket_api.text
 27 | pocket_api.text
 28 | 
 29 | # STEP 3: Authenticate 
 30 | # Modify and paste the link below in the browser and authenticate
 31 | # Repace text after "?request_token=" with the request_token generated above
 32 | # https://getpocket.com/auth/authorize?request_token=PASTE-YOUR-REQUEST-TOKEN-HERE&redirect_uri=https://getpocket.com/connected_applications
 33 | 
 34 | 
 35 | # STEP 4: Generate an access_token
 36 | # After authenticating in the browser, return here
 37 | # Use your consumer_key and request_token below
 38 | pocket_auth = requests.post('https://getpocket.com/v3/oauth/authorize',
 39 |                             data = {'consumer_key':'12345-23ae05df52291ea13b135dff',
 40 |                                     'code':'a1dc2a39-abcd-af28-e235-25ddd4'})
 41 | 
 42 | # Check the response: if 200, then it means all OK
 43 | # pocket_auth.status_code
 44 | 
 45 | # Check error reason, if any
 46 | # print(pocket_auth.headers['X-Error'])
 47 | 
 48 | # Finally, here is your access_token
 49 | # We're done authenticating
 50 | pocket_auth.text
 51 | 
 52 | 
 53 | # Get data from the API
 54 | # Reference: https://getpocket.com/developer/docs/v3/retrieve
 55 | pocket_add = requests.post('https://getpocket.com/v3/get',
 56 |                            data= {'consumer_key':'12345-23ae05df52291ea13b135dff',
 57 |                                   'access_token':'b07ff4be-abcd-4685-2d70-d47816',
 58 |                                   'state':'all',
 59 |                                   'detailType':'simple'})
 60 | 
 61 | # Check the response: if 200, then it means all OK
 62 | # pocket_add.status_code
 63 | 
 64 | # Here is your fetched JSON data
 65 | pocket_add.text
 66 | 
 67 | 
 68 | # Prepare the dataframe: convert JSON to table format
 69 | json_data = json.loads(pocket_add.text)
 70 | 
 71 | df_temp = pd.DataFrame()
 72 | df = pd.DataFrame()
 73 | for key in json_data['list'].keys():
 74 |         df_temp  = pd.DataFrame(json_data['list'][key], index=[0])
 75 |         df = pd.concat([df, df_temp])
 76 | 
 77 | df = df[['item_id','status','favorite','given_title','given_url','resolved_url','time_added','time_read','time_to_read','word_count']]
 78 | df.head(5)
 79 | 
 80 | # Clean up the dataset
 81 | df.dtypes
 82 | df[['status','favorite','word_count']] = df[['status','favorite','word_count']].astype(int)
 83 | df['time_added'] = pd.to_datetime(df['time_added'],unit='s')
 84 | df['time_read'] = pd.to_datetime(df['time_read'],unit='s')
 85 | df['date_added'] = df['time_added'].dt.date
 86 | df['date_read'] = df['time_read'].dt.date
 87 | 
 88 | # Save the dataframe as CSV locally
 89 | df.to_csv('pocket_list.csv')
 90 | 
 91 | # Check the data types
 92 | df.dtypes
 93 | 
 94 | 
 95 | # Answer questions using data
 96 | 
 97 | # How many items are there in my Pocket?
 98 | print(df['item_id'].count())
 99 | 
100 | # What % of articles are read?
101 | print((df['status'].sum()*100)/df['item_id'].count())
102 | 
103 | # How long is the average article in my Pocket? (minutes)
104 | df['time_to_read'].describe()
105 | 
106 | # How long is the average article in my Pocket? (word count)
107 | df['word_count'].describe()
108 | 
109 | # What is the % of favorites?
110 | print((df['favorite'].sum()*100)/df['item_id'].count())
111 | 
112 | # How many words have I read till date?
113 | print(df.loc[df['status'] == 1, 'word_count'].sum())
114 | 
115 | # How many books is this equivalent to?
116 | print(df.loc[df['status'] == 1, 'word_count'].sum()/64000)
117 | 
118 | # How were the articles added over time?
119 | plot_added = df.groupby('date_added')['item_id'].count()
120 | plot_added.describe()
121 | # plot_added.head(10)
122 | 
123 | # How were the articles read over time?
124 | plot_read = df.groupby('date_read')['status'].sum()
125 | plot_read.describe()
126 | #plot_read.head(10)
127 | 
128 | # Wordcloud of the topics I read about
129 | from wordcloud import WordCloud, STOPWORDS
130 | 
131 | stopwords = set(STOPWORDS)
132 | wordcloud = WordCloud(background_color='white',
133 |                       stopwords=stopwords,
134 |                       max_words=300,
135 |                       max_font_size=40, 
136 |                       random_state=42
137 |                       ).generate(str(df['given_title']))
138 | 
139 | print(wordcloud)
140 | fig = plt.figure(1)
141 | plt.imshow(wordcloud)
142 | plt.axis('off')
143 | plt.show()
144 | fig.savefig("Pocket Wordcloud.png", dpi=900)


--------------------------------------------------------------------------------
/1807 - Reading Habit Analysis Using Pocket API And Python/README.md:
--------------------------------------------------------------------------------
1 | # Reading Habit Analysis Using Pocket API And Python
2 | 
3 | Still working on this post. Check back in a while!
4 | 
5 | Meanwhile, read my previous post here: https://www.everydayplots.com/instagram-hashtag-analysis-python/


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | This is free and unencumbered software released into the public domain.
 2 | 
 3 | Anyone is free to copy, modify, publish, use, compile, sell, or
 4 | distribute this software, either in source code form or as a compiled
 5 | binary, for any purpose, commercial or non-commercial, and by any
 6 | means.
 7 | 
 8 | In jurisdictions that recognize copyright laws, the author or authors
 9 | of this software dedicate any and all copyright interest in the
10 | software to the public domain. We make this dedication for the benefit
11 | of the public at large and to the detriment of our heirs and
12 | successors. We intend this dedication to be an overt act of
13 | relinquishment in perpetuity of all present and future rights to this
14 | software under copyright law.
15 | 
16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22 | OTHER DEALINGS IN THE SOFTWARE.
23 | 
24 | For more information, please refer to <http://unlicense.org>
25 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # EverydayPlots.com - All Analysis Codes
 2 | 
 3 | This Github repo contains all the data analysis codes (Python / R) for the blog posts published by me on my data analytics blog: https://www.everydayplots.com
 4 | 
 5 | Here is a list of the blog posts I've published in 2018. For descriptions, read README files in individual post folders.
 6 | 
 7 | Post 1: [Identifying Forced And Fake Twitter Trends Using R](https://www.everydayplots.com/identifying-forced-fake-twitter-trends-r/)  
 8 | Post 2: [Instagram Hashtag Analysis In Python](https://www.everydayplots.com/instagram-hashtag-analysis-python/)  
 9 | Post 3: Reading Habit Analysis Using Pocket API And Python (Coming soon!)
10 | 
11 | More posts coming soon!
12 | 


--------------------------------------------------------------------------------