├── A Good Dog.jpg ├── act_report.pdf ├── wrangle_report.pdf ├── Good Dog Brents.jpg ├── Distribution of Dog Stages.JPG ├── Distribution of Tweet Image Number.JPG ├── Linear Correlation Between Retweet count and Favorite count.JPG ├── README.md ├── wrangle_report.ipynb ├── .ipynb_checkpoints ├── wrangle_report-checkpoint.ipynb └── act_report-checkpoint.ipynb └── act_report.ipynb /A Good Dog.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Chisomnwa/Twitter-Data-Wrangling-Project/HEAD/A Good Dog.jpg -------------------------------------------------------------------------------- /act_report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Chisomnwa/Twitter-Data-Wrangling-Project/HEAD/act_report.pdf -------------------------------------------------------------------------------- /wrangle_report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Chisomnwa/Twitter-Data-Wrangling-Project/HEAD/wrangle_report.pdf -------------------------------------------------------------------------------- /Good Dog Brents.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Chisomnwa/Twitter-Data-Wrangling-Project/HEAD/Good Dog Brents.jpg -------------------------------------------------------------------------------- /Distribution of Dog Stages.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Chisomnwa/Twitter-Data-Wrangling-Project/HEAD/Distribution of Dog Stages.JPG -------------------------------------------------------------------------------- /Distribution of Tweet Image Number.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Chisomnwa/Twitter-Data-Wrangling-Project/HEAD/Distribution of Tweet Image Number.JPG -------------------------------------------------------------------------------- /Linear Correlation Between Retweet count and Favorite count.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Chisomnwa/Twitter-Data-Wrangling-Project/HEAD/Linear Correlation Between Retweet count and Favorite count.JPG -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Twitter-Data-Wrangling-Project 2 | 3 | **This is my second project as an Udacity Scholar in my Udacity Data Analyst Nanodegree Program.** 4 | 5 | ## Introduction 6 | 7 | Real-world data rarely comes clean. Using Python and its libraries, you will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. You will document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries) and/or SQL. 8 | 9 | The dataset that you will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage. 10 | 11 | WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for you to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. 12 | 13 | ## Prpject Details 14 | 15 | On this project, the tasks you will work on are: 16 | 17 | - Gathering Data 18 | - Assessing Data 19 | - Cleaning Data 20 | - Storing Data 21 | - Analyzing and Visualizing Data 22 | - Reporting on 23 | - your data wrangling efforts 24 | - your data analyses and visualizations 25 | 26 | ## Gathering the Data for this Project 27 | 28 | Gather each of the three pieces of data as described below in a Jupyter Notebook titled wrangle_act.ipynb: 29 | 30 | 1. **twitter_archive_enhanced.csv**: It's instructed that this file be downloaded manually from an URL provided by Udacity. 31 | 32 | 2. **The tweet image prediction**:, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: [https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv] 33 | 34 | 3. **JSON file from Twitter API**: This file will contain each tweet's retweet count and favorite (i.e. "like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. 35 | 36 | ## Assessing the Data for this Project 37 | 38 | After gathering all three pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least **eight (8) quality issues** and **two (2) tidiness issues** in the "Accessing Data" section in the `wrangle_act.ipynb` Jupyter Notebook. 39 | 40 | ## Key Points 41 | 42 | Key points to keep in mind when data wrangling for this project: 43 | 44 | - You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets. 45 | 46 | - Fully assessing and cleaning the entire dataset requires exceptional effort so only a subset of its issues (eight (8) quality issues and two (2) tidiness issues at minimum) need to be assessed and cleaned. 47 | 48 | - Cleaning includes merging individual pieces of data according to the rules of tidy data. 49 | 50 | - The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs. 51 | 52 | - You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used. 53 | 54 | ## Cleaning the Data for this Project 55 | 56 | In the Analyzing and Visualizing Data section in your wrangle_act.ipynb Jupyter Notebook, analyze and visualize your wrangled data. 57 | 58 | - You must produce at least three (3) insights and one (1) visualization. 59 | - You must clearly document the piece of assessed and cleaned (if necessary) data used to make each analysis and visualization. 60 | 61 | ## Storing, Analyzing, and Visualizing Data for this Project 62 | 63 | Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database. 64 | 65 | Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced. 66 | 67 | ## Reporting for this Project 68 | 69 | Create a 300-600 word written report called wrangle_report.pdf that briefly describes your wrangling efforts. This is to be framed as an internal document. Create a >250 word written report called act_report.pdf that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example. 70 | 71 | ## Medium Blog on this Project 72 | 73 | You can also read my [blog](https://medium.com/@chisompromise/twitter-data-analysis-weratedogs-1fb8b65da7fa) post on this project and feel free to connect with me on [Linkedin](https://www.linkedin.com/in/chisom-promise/). 74 | -------------------------------------------------------------------------------- /wrangle_report.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "0c829f51", 6 | "metadata": {}, 7 | "source": [ 8 | "# DATA WRANGLING REPORT\n", 9 | "\n", 10 | "#### Created by Chisom Promise Nnamani, Udacity Scholar\n", 11 | "\n", 12 | "The purpose of this project is to put in practice what I have learned from the Data Wrangling section in Udacity Data Analyst Nanaodegree program. The dataset that is wrangled is the tweet archive [@DogRates](https://twitter.com/dog_rates), also known as [@WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs). We rate dogs is a Twitter account that rates people’s dogs with a humorous comment about the dogs. This ratings almost always have a denominator of 10.\n", 13 | "\n", 14 | "### Project Goal:\n", 15 | "\n", 16 | "The goal of this project is to effectively wrangle data related related to dog ratings. The data is sourced from the twitter user [@WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs). Once we have effectively gathered, assessed, and cleaned our data in this project, it can be used for our analysis.\n", 17 | "\n", 18 | "This report briefly escribes my wrangling effort.\n", 19 | "\n", 20 | "### Project Details:\n", 21 | "\n", 22 | "**The tasks of this project are as follows:**\n", 23 | "\n", 24 | "Gathering data\n", 25 | "Assessing Data\n", 26 | "Cleanig Data\n", 27 | "\n", 28 | "### Gathering Data\n", 29 | "\n", 30 | "The data used for this project consisted of three different datasets that were obtained as following:\n", 31 | "\n", 32 | "`Twitter archive file`: This data was provided in the project guideline. I downloaded it to my workspace by clicking on the `jupyer` icon then upload. I imported the python `pandas` library as pd and used the pandas read_csv() function to read the file into a dataframe named `twitter_archive.`\n", 33 | "\n", 34 | "\n", 35 | "`Tweet image prediction file`: I imported the Python `requests` and `os` libraries. With the get() function of the requests library, I got the data through its url and saved it in a response variable. Response displayed `200`, meaning that it was successful.\n", 36 | "\n", 37 | "\n", 38 | "Using the Python `with open` function, I wrote the response’s content to a `tsv` file in the same working directory. I then read the downloaded tsv file into a dataframe named `image_prediction`.\n", 39 | "\n", 40 | "\n", 41 | "`Tweet_Json text`: I created a twitter developer account and created an application for the project. I used the app credentials (consumer_key, consumer_secret, access_toke, and access_secret) for the twitter API authentication. I imported `tweepy` and `json`, authenticated tweepy.OAuthHandler and set `wait_on_limit` to `True` in the API parameter in order to wait after tweet limit (900) and continue automatically at the end of waiting time. I set the needed tweet id to scrape online from the tweet given in the first dataset, created an empty dictionary to save failed tweets and set up a timer for start and end time.\n", 42 | "\n", 43 | "\n", 44 | "With the Python `with open` function, I created the `tweet_json.txt` and wrote the output to it, I appended failed ones to the empty dictionary created above. I printed the time taken and the failed dictionary.\n", 45 | "\n", 46 | "\n", 47 | "With the Python `with open` function again and a `for loop`, I read the `tweet_json.txt` line by line and loaded each line as `json` file. I saved each tweet_id, retweet_count, favorite_count, followers_count and friends_count which I later converted to a dataframe named `tweet_json`.\n", 48 | "\n", 49 | "\n", 50 | "### Assessing Data\n", 51 | "\n", 52 | "Once the three tables were obtained, I assessed the data as following:\n", 53 | "\n", 54 | "**Visually:** I printed the three different dataframes individually in a jupiter notebook and scrolled through left and righ, up and down. Secondly, I visually assessed the csv files in Excel spreadsheet.\n", 55 | "\n", 56 | "\n", 57 | "**Programmatically:** I did various programmatic assessment with various python and pandas methods and functions such as .info(), .describe(), .isnull(), .head(), .tail(), .sample(), .duplicated(), .value_counts() and shape.\n", 58 | "\n", 59 | "\n", 60 | "### Cleaning Data\n", 61 | "\n", 62 | "This part of the data wrangling process was divided into three parts: `Define`, `Code` and `Test`.\n", 63 | "\n", 64 | "These three steps were each on the issues stated in the assess section.\n", 65 | "\n", 66 | "First, I made a copy of the original three datasets. \n", 67 | "\n", 68 | "Twitter_archive = df1_clean\n", 69 | "Image_predictions = df2_clean\n", 70 | "Tweet_json = df3_clean\n", 71 | "\n", 72 | "\n", 73 | "Then, I followed the `Define`, `Code` amd `Test` process and made the following cleaning efforts:\n", 74 | "\n", 75 | " - I removed retweets that won’t be used for analysis. I was able to do this using the tweet ids.\n", 76 | "\n", 77 | " - I dropped retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, in_reply_to_status_id and in_reply_to_user_id columns because they have over 90% of missing values each.\n", 78 | "\n", 79 | " - I combined the four dog stages spread across four columns into one single column.\n", 80 | "\n", 81 | " - I dropped followers_count and friends_count columns as they don't contain necessary values that would be relevant to the analysis.\n", 82 | "\n", 83 | " - I converted the timestamp column from an int to datetime.\n", 84 | "\n", 85 | " - I converted the tweet_id column from integer to string.\n", 86 | "\n", 87 | " - I dropped all values in the name column that started with small letters because it was confirmed that those names weren’t dog names.\n", 88 | "\n", 89 | " - I converted the tweet_id column in image prediction table to a string.\n", 90 | "\n", 91 | " - I changed all p1, p2, and p3 values to lower case.\n", 92 | "\n", 93 | " - I converted tweet_id coulmn in the tweet_json dataframe from integer to string.\n", 94 | "\n", 95 | " - I changed the column label from 'id' to 'tweet_id' in tweet_json(df3) dataset.\n", 96 | "\n", 97 | " - I merged the three dataframes to become one dataframe and merge them on tweet_id column.\n", 98 | "\n", 99 | "\n", 100 | "### Storing the Data\n", 101 | "\n", 102 | "After gathering, assessing and cleaning the data, I saved the merged data in a csv file named `twitter_archive_master.csv`.\n", 103 | "\n", 104 | "\n", 105 | "### Conclusion\n", 106 | "\n", 107 | "This project was so much fun for me! Yes, there were situations I encountered errors and I wpould always have to calm down and trace the source of the errors, which is definitely part of the process.\n", 108 | "\n", 109 | "**Data Wrangling is a core skill that anyone who handles data should be familiar with.**\n", 110 | "\n", 111 | "I was able to polish my skills more in using Python programming language and its packages to successfully wrangle data and gain insights from these data.\n" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "id": "3a8b58fe", 118 | "metadata": {}, 119 | "outputs": [], 120 | "source": [] 121 | } 122 | ], 123 | "metadata": { 124 | "kernelspec": { 125 | "display_name": "Python [conda env:StarNPMS] *", 126 | "language": "python", 127 | "name": "conda-env-StarNPMS-py" 128 | }, 129 | "language_info": { 130 | "codemirror_mode": { 131 | "name": "ipython", 132 | "version": 3 133 | }, 134 | "file_extension": ".py", 135 | "mimetype": "text/x-python", 136 | "name": "python", 137 | "nbconvert_exporter": "python", 138 | "pygments_lexer": "ipython3", 139 | "version": "3.9.12" 140 | } 141 | }, 142 | "nbformat": 4, 143 | "nbformat_minor": 5 144 | } 145 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/wrangle_report-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "0c829f51", 6 | "metadata": {}, 7 | "source": [ 8 | "# DATA WRANGLING REPORT\n", 9 | "\n", 10 | "#### Created by Chisom Promise Nnamani, Udacity Scholar\n", 11 | "\n", 12 | "The purpose of this project is to put in practice what I have learned from the Data Wrangling section in Udacity Data Analyst Nanaodegree program. The dataset that is wrangled is the tweet archive [@DogRates](https://twitter.com/dog_rates), also known as [@WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs). We rate dogs is a Twitter account that rates people’s dogs with a humorous comment about the dogs. This ratings almost always have a denominator of 10.\n", 13 | "\n", 14 | "### Project Goal:\n", 15 | "\n", 16 | "The goal of this project is to effectively wrangle data related related to dog ratings. The data is sourced from the twitter user [@WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs). Once we have effectively gathered, assessed, and cleaned our data in this project, it can be used for our analysis.\n", 17 | "\n", 18 | "This report briefly escribes my wrangling effort.\n", 19 | "\n", 20 | "### Project Details:\n", 21 | "\n", 22 | "**The tasks of this project are as follows:**\n", 23 | "\n", 24 | "Gathering data\n", 25 | "Assessing Data\n", 26 | "Cleanig Data\n", 27 | "\n", 28 | "### Gathering Data\n", 29 | "\n", 30 | "The data used for this project consisted of three different datasets that were obtained as following:\n", 31 | "\n", 32 | "`Twitter archive file`: This data was provided in the project guideline. I downloaded it to my workspace by clicking on the `jupyer` icon then upload. I imported the python `pandas` library as pd and used the pandas read_csv() function to read the file into a dataframe named `twitter_archive.`\n", 33 | "\n", 34 | "\n", 35 | "`Tweet image prediction file`: I imported the Python `requests` and `os` libraries. With the get() function of the requests library, I got the data through its url and saved it in a response variable. Response displayed `200`, meaning that it was successful.\n", 36 | "\n", 37 | "\n", 38 | "Using the Python `with open` function, I wrote the response’s content to a `tsv` file in the same working directory. I then read the downloaded tsv file into a dataframe named `image_prediction`.\n", 39 | "\n", 40 | "\n", 41 | "`Tweet_Json text`: I created a twitter developer account and created an application for the project. I used the app credentials (consumer_key, consumer_secret, access_toke, and access_secret) for the twitter API authentication. I imported `tweepy` and `json`, authenticated tweepy.OAuthHandler and set `wait_on_limit` to `True` in the API parameter in order to wait after tweet limit (900) and continue automatically at the end of waiting time. I set the needed tweet id to scrape online from the tweet given in the first dataset, created an empty dictionary to save failed tweets and set up a timer for start and end time.\n", 42 | "\n", 43 | "\n", 44 | "With the Python `with open` function, I created the `tweet_json.txt` and wrote the output to it, I appended failed ones to the empty dictionary created above. I printed the time taken and the failed dictionary.\n", 45 | "\n", 46 | "\n", 47 | "With the Python `with open` function again and a `for loop`, I read the `tweet_json.txt` line by line and loaded each line as `json` file. I saved each tweet_id, retweet_count, favorite_count, followers_count and friends_count which I later converted to a dataframe named `tweet_json`.\n", 48 | "\n", 49 | "\n", 50 | "### Assessing Data\n", 51 | "\n", 52 | "Once the three tables were obtained, I assessed the data as following:\n", 53 | "\n", 54 | "**Visually:** I printed the three different dataframes individually in a jupiter notebook and scrolled through left and righ, up and down. Secondly, I visually assessed the csv files in Excel spreadsheet.\n", 55 | "\n", 56 | "\n", 57 | "**Programmatically:** I did various programmatic assessment with various python and pandas methods and functions such as .info(), .describe(), .isnull(), .head(), .tail(), .sample(), .duplicated(), .value_counts() and shape.\n", 58 | "\n", 59 | "\n", 60 | "### Cleaning Data\n", 61 | "\n", 62 | "This part of the data wrangling process was divided into three parts: `Define`, `Code` and `Test`.\n", 63 | "\n", 64 | "These three steps were each on the issues stated in the assess section.\n", 65 | "\n", 66 | "First, I made a copy of the original three datasets. \n", 67 | "\n", 68 | "Twitter_archive = df1_clean\n", 69 | "Image_predictions = df2_clean\n", 70 | "Tweet_json = df3_clean\n", 71 | "\n", 72 | "\n", 73 | "Then, I followed the `Define`, `Code` amd `Test` process and made the following cleaning efforts:\n", 74 | "\n", 75 | " - I removed retweets that won’t be used for analysis. I was able to do this using the tweet ids.\n", 76 | "\n", 77 | " - I dropped retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, in_reply_to_status_id and in_reply_to_user_id columns because they have over 90% of missing values each.\n", 78 | "\n", 79 | " - I combined the four dog stages spread across four columns into one single column.\n", 80 | "\n", 81 | " - I dropped followers_count and friends_count columns as they don't contain necessary values that would be relevant to the analysis.\n", 82 | "\n", 83 | " - I converted the timestamp column from an int to datetime.\n", 84 | "\n", 85 | " - I converted the tweet_id column from integer to string.\n", 86 | "\n", 87 | " - I dropped all values in the name column that started with small letters because it was confirmed that those names weren’t dog names.\n", 88 | "\n", 89 | " - I converted the tweet_id column in image prediction table to a string.\n", 90 | "\n", 91 | " - I changed all p1, p2, and p3 values to lower case.\n", 92 | "\n", 93 | " - I converted tweet_id coulmn in the tweet_json dataframe from integer to string.\n", 94 | "\n", 95 | " - I changed the column label from 'id' to 'tweet_id' in tweet_json(df3) dataset.\n", 96 | "\n", 97 | " - I merged the three dataframes to become one dataframe and merge them on tweet_id column.\n", 98 | "\n", 99 | "\n", 100 | "### Storing the Data\n", 101 | "\n", 102 | "After gathering, assessing and cleaning the data, I saved the merged data in a csv file named `twitter_archive_master.csv`.\n", 103 | "\n", 104 | "\n", 105 | "### Conclusion\n", 106 | "\n", 107 | "This project was so much fun for me! Yes, there were situations I encountered errors and I wpould always have to calm down and trace the source of the errors, which is definitely part of the process.\n", 108 | "\n", 109 | "**Data Wrangling is a core skill that anyone who handles data should be familiar with.**\n", 110 | "\n", 111 | "I was able to polish my skills more in using Python programming language and its packages to successfully wrangle data and gain insights from these data.\n" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "id": "3a8b58fe", 118 | "metadata": {}, 119 | "outputs": [], 120 | "source": [] 121 | } 122 | ], 123 | "metadata": { 124 | "kernelspec": { 125 | "display_name": "Python [conda env:StarNPMS] *", 126 | "language": "python", 127 | "name": "conda-env-StarNPMS-py" 128 | }, 129 | "language_info": { 130 | "codemirror_mode": { 131 | "name": "ipython", 132 | "version": 3 133 | }, 134 | "file_extension": ".py", 135 | "mimetype": "text/x-python", 136 | "name": "python", 137 | "nbconvert_exporter": "python", 138 | "pygments_lexer": "ipython3", 139 | "version": "3.9.12" 140 | } 141 | }, 142 | "nbformat": 4, 143 | "nbformat_minor": 5 144 | } 145 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/act_report-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "6e6a01ef", 6 | "metadata": {}, 7 | "source": [ 8 | "# Report: act_report\n", 9 | "\n", 10 | "Create a **250-word-minimum written report** called \"act_report.pdf\" or \"act_report.html\" that communicates the insights and displays the visualization(s) produced from your wrangles data. This is to be framed as an external document, like a blog post or magazine article, for example." 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "id": "7bae629f", 16 | "metadata": {}, 17 | "source": [ 18 | "" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "id": "aefdc753", 24 | "metadata": {}, 25 | "source": [ 26 | "**This act report includes the summary of the Data Analysis process that was taken for the data wrangling project.**\n", 27 | "\n", 28 | "In this project, I worked with three datasets.\n", 29 | "\n", 30 | "Udacity provided the first dataset which is a csv file named `twitter_archive_enhanced.csv`. It contains basic information about 2356 tweets and was downloaded manually. \n", 31 | "\n", 32 | "The second dataset was a tsv file named `image_prediction.tsv` which was hosted on udacity server and I programmatically downloaded the file. It contains 2075 predictions made by a neural network that can classify dog breeds. \n", 33 | "\n", 34 | "For the third dataset, I scrapped the twitter API using python Tweepy’s Library. This third dataset contains information like the rewetet count, favorite count, followers count and friends count each tweet recieved for 2327 tweets in the file \"tweet_json_text\".\n", 35 | "\n", 36 | "During accessing the data, I found out 10 quality issues and 4 tidiness issues. I used a variety of Pandas methods to clean them up.\n", 37 | "\n", 38 | "**Here are some insights and visualizations that I got after I merged the three datasets into a master dataset named `twitter_archive_master.csv`.**\n", 39 | "\n", 40 | "First I loaded the the master dataset in a pandas dataframe." 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 1, 46 | "id": "fb470441", 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "import pandas as pd\n", 51 | "data = pd.read_csv(\"twitter_archive_master.csv\")" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 2, 57 | "id": "2dd363f3", 58 | "metadata": {}, 59 | "outputs": [ 60 | { 61 | "data": { 62 | "text/html": [ 63 | "
\n", 64 | "\n", 77 | "\n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | "
tweet_idrating_numeratorrating_denominatorimg_nump1_confp2_confp3_confretweet_countfavorite_count
count1.986000e+031986.0000001986.0000001986.0000001986.0000001.986000e+031.986000e+031986.0000001986.000000
mean7.356142e+1712.28197410.5342401.2034240.5934521.344853e-016.034994e-022244.6319237713.266868
std6.740686e+1641.5811807.3353690.5614920.2719611.005944e-015.091948e-024020.79407111378.788558
min6.660209e+170.0000002.0000001.0000000.0443331.011300e-081.740170e-1011.00000066.000000
25%6.758214e+1710.00000010.0000001.0000000.3626565.407533e-021.624755e-02495.5000001637.750000
50%7.082494e+1711.00000010.0000001.0000000.5873571.175370e-014.952715e-021080.0000003468.000000
75%7.873791e+1712.00000010.0000001.0000000.8449201.951377e-019.166433e-022559.2500009562.750000
max8.924206e+171776.000000170.0000004.0000001.0000004.880140e-012.734190e-0170778.000000144938.000000
\n", 191 | "
" 192 | ], 193 | "text/plain": [ 194 | " tweet_id rating_numerator rating_denominator img_num \\\n", 195 | "count 1.986000e+03 1986.000000 1986.000000 1986.000000 \n", 196 | "mean 7.356142e+17 12.281974 10.534240 1.203424 \n", 197 | "std 6.740686e+16 41.581180 7.335369 0.561492 \n", 198 | "min 6.660209e+17 0.000000 2.000000 1.000000 \n", 199 | "25% 6.758214e+17 10.000000 10.000000 1.000000 \n", 200 | "50% 7.082494e+17 11.000000 10.000000 1.000000 \n", 201 | "75% 7.873791e+17 12.000000 10.000000 1.000000 \n", 202 | "max 8.924206e+17 1776.000000 170.000000 4.000000 \n", 203 | "\n", 204 | " p1_conf p2_conf p3_conf retweet_count favorite_count \n", 205 | "count 1986.000000 1.986000e+03 1.986000e+03 1986.000000 1986.000000 \n", 206 | "mean 0.593452 1.344853e-01 6.034994e-02 2244.631923 7713.266868 \n", 207 | "std 0.271961 1.005944e-01 5.091948e-02 4020.794071 11378.788558 \n", 208 | "min 0.044333 1.011300e-08 1.740170e-10 11.000000 66.000000 \n", 209 | "25% 0.362656 5.407533e-02 1.624755e-02 495.500000 1637.750000 \n", 210 | "50% 0.587357 1.175370e-01 4.952715e-02 1080.000000 3468.000000 \n", 211 | "75% 0.844920 1.951377e-01 9.166433e-02 2559.250000 9562.750000 \n", 212 | "max 1.000000 4.880140e-01 2.734190e-01 70778.000000 144938.000000 " 213 | ] 214 | }, 215 | "execution_count": 2, 216 | "metadata": {}, 217 | "output_type": "execute_result" 218 | } 219 | ], 220 | "source": [ 221 | "data.describe()" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "id": "e5902cdf", 227 | "metadata": {}, 228 | "source": [ 229 | "#### Insights\n", 230 | "\n", 231 | " - The minimum favorite count is 66, mean is 7714, and the maximum favorite count is 144955\n", 232 | " \n", 233 | " - The minimum retweet count is 11, mean is 2245, and the maximum retweet count is 70786\n", 234 | " \n", 235 | " - About 32% of the dogs have no name\n", 236 | " \n", 237 | " - Image number 1 is the most prominent (frequent)\n", 238 | "\n", 239 | " - The merged dataset has 21 columns and 1986 rows, all the rows except for the dog stage column are completely filed with no missing value.\n", 240 | "\n", 241 | " - The columns are 'tweet_id', 'timestamp', 'source', 'text', 'expanded_urls', 'rating_numerator', 'rating_denominator', 'name', 'stage', 'retweet_count', 'favorite_count', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'.\n", 242 | "\n", 243 | " - Nine of the columns are object data type (string), one is datetime, five are integer data types, three are floats, and the remaining three are boolean data types." 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "id": "51a9400f", 249 | "metadata": {}, 250 | "source": [ 251 | "#### Visualizations" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "id": "1040b2da", 257 | "metadata": {}, 258 | "source": [ 259 | " 1. The most occcuring image number that corresponds to each tweet's most confident prediction is 1." 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "id": "94b36de0", 265 | "metadata": {}, 266 | "source": [ 267 | "" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "id": "d222853e", 273 | "metadata": {}, 274 | "source": [ 275 | " 2. The most popular dog stage that were rated by the WeRateDogs Twitter account was pupper, follwed by doggo and then puppo." 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "id": "85ba3377", 281 | "metadata": {}, 282 | "source": [ 283 | "" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "id": "8bda88b3", 289 | "metadata": {}, 290 | "source": [ 291 | " 3. From the graph below, there is a positive linear relationship between retweet_count and favorite_count.\n", 292 | " \n", 293 | "A reasonable hypothesis is that the most popular tweets get the highest number of retweet count and favorite count. I tested the correlation between retweet_count and favorite_count and the r^2 is 0.928. That is a high value showing a strong correlation between them." 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "id": "9f3fe5a9", 299 | "metadata": {}, 300 | "source": [ 301 | "" 302 | ] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "id": "40c1176a", 307 | "metadata": {}, 308 | "source": [ 309 | "**That is the summary of the Data Wrangling process!**" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "id": "0623dff0", 315 | "metadata": {}, 316 | "source": [ 317 | "\n" 318 | ] 319 | } 320 | ], 321 | "metadata": { 322 | "kernelspec": { 323 | "display_name": "Python [conda env:StarNPMS] *", 324 | "language": "python", 325 | "name": "conda-env-StarNPMS-py" 326 | }, 327 | "language_info": { 328 | "codemirror_mode": { 329 | "name": "ipython", 330 | "version": 3 331 | }, 332 | "file_extension": ".py", 333 | "mimetype": "text/x-python", 334 | "name": "python", 335 | "nbconvert_exporter": "python", 336 | "pygments_lexer": "ipython3", 337 | "version": "3.9.12" 338 | } 339 | }, 340 | "nbformat": 4, 341 | "nbformat_minor": 5 342 | } 343 | -------------------------------------------------------------------------------- /act_report.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "6e6a01ef", 6 | "metadata": {}, 7 | "source": [ 8 | "# Report: act_report\n", 9 | "\n", 10 | "Create a **250-word-minimum written report** called \"act_report.pdf\" or \"act_report.html\" that communicates the insights and displays the visualization(s) produced from your wrangles data. This is to be framed as an external document, like a blog post or magazine article, for example." 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "id": "7bae629f", 16 | "metadata": {}, 17 | "source": [ 18 | "" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "id": "aefdc753", 24 | "metadata": {}, 25 | "source": [ 26 | "**This act report includes the summary of the Data Analysis process that was taken for the data wrangling project.**\n", 27 | "\n", 28 | "In this project, I worked with three datasets.\n", 29 | "\n", 30 | "Udacity provided the first dataset which is a csv file named `twitter_archive_enhanced.csv`. It contains basic information about 2356 tweets and was downloaded manually. \n", 31 | "\n", 32 | "The second dataset was a tsv file named `image_prediction.tsv` which was hosted on udacity server and I programmatically downloaded the file. It contains 2075 predictions made by a neural network that can classify dog breeds. \n", 33 | "\n", 34 | "For the third dataset, I scrapped the twitter API using python Tweepy’s Library. This third dataset contains information like the rewetet count, favorite count, followers count and friends count each tweet recieved for 2327 tweets in the file \"tweet_json_text\".\n", 35 | "\n", 36 | "During accessing the data, I found out 10 quality issues and 4 tidiness issues. I used a variety of Pandas methods to clean them up.\n", 37 | "\n", 38 | "**Here are some insights and visualizations that I got after I merged the three datasets into a master dataset named `twitter_archive_master.csv`.**\n", 39 | "\n", 40 | "First I loaded the the master dataset in a pandas dataframe." 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 1, 46 | "id": "fb470441", 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "import pandas as pd\n", 51 | "data = pd.read_csv(\"twitter_archive_master.csv\")" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 2, 57 | "id": "2dd363f3", 58 | "metadata": {}, 59 | "outputs": [ 60 | { 61 | "data": { 62 | "text/html": [ 63 | "
\n", 64 | "\n", 77 | "\n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | "
tweet_idrating_numeratorrating_denominatorimg_nump1_confp2_confp3_confretweet_countfavorite_count
count1.986000e+031986.0000001986.0000001986.0000001986.0000001.986000e+031.986000e+031986.0000001986.000000
mean7.356142e+1712.28197410.5342401.2034240.5934521.344853e-016.034994e-022244.6319237713.266868
std6.740686e+1641.5811807.3353690.5614920.2719611.005944e-015.091948e-024020.79407111378.788558
min6.660209e+170.0000002.0000001.0000000.0443331.011300e-081.740170e-1011.00000066.000000
25%6.758214e+1710.00000010.0000001.0000000.3626565.407533e-021.624755e-02495.5000001637.750000
50%7.082494e+1711.00000010.0000001.0000000.5873571.175370e-014.952715e-021080.0000003468.000000
75%7.873791e+1712.00000010.0000001.0000000.8449201.951377e-019.166433e-022559.2500009562.750000
max8.924206e+171776.000000170.0000004.0000001.0000004.880140e-012.734190e-0170778.000000144938.000000
\n", 191 | "
" 192 | ], 193 | "text/plain": [ 194 | " tweet_id rating_numerator rating_denominator img_num \\\n", 195 | "count 1.986000e+03 1986.000000 1986.000000 1986.000000 \n", 196 | "mean 7.356142e+17 12.281974 10.534240 1.203424 \n", 197 | "std 6.740686e+16 41.581180 7.335369 0.561492 \n", 198 | "min 6.660209e+17 0.000000 2.000000 1.000000 \n", 199 | "25% 6.758214e+17 10.000000 10.000000 1.000000 \n", 200 | "50% 7.082494e+17 11.000000 10.000000 1.000000 \n", 201 | "75% 7.873791e+17 12.000000 10.000000 1.000000 \n", 202 | "max 8.924206e+17 1776.000000 170.000000 4.000000 \n", 203 | "\n", 204 | " p1_conf p2_conf p3_conf retweet_count favorite_count \n", 205 | "count 1986.000000 1.986000e+03 1.986000e+03 1986.000000 1986.000000 \n", 206 | "mean 0.593452 1.344853e-01 6.034994e-02 2244.631923 7713.266868 \n", 207 | "std 0.271961 1.005944e-01 5.091948e-02 4020.794071 11378.788558 \n", 208 | "min 0.044333 1.011300e-08 1.740170e-10 11.000000 66.000000 \n", 209 | "25% 0.362656 5.407533e-02 1.624755e-02 495.500000 1637.750000 \n", 210 | "50% 0.587357 1.175370e-01 4.952715e-02 1080.000000 3468.000000 \n", 211 | "75% 0.844920 1.951377e-01 9.166433e-02 2559.250000 9562.750000 \n", 212 | "max 1.000000 4.880140e-01 2.734190e-01 70778.000000 144938.000000 " 213 | ] 214 | }, 215 | "execution_count": 2, 216 | "metadata": {}, 217 | "output_type": "execute_result" 218 | } 219 | ], 220 | "source": [ 221 | "data.describe()" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "id": "e5902cdf", 227 | "metadata": {}, 228 | "source": [ 229 | "#### Insights\n", 230 | "\n", 231 | " - The minimum favorite count is 66, mean is 7714, and the maximum favorite count is 144955\n", 232 | " \n", 233 | " - The minimum retweet count is 11, mean is 2245, and the maximum retweet count is 70786\n", 234 | " \n", 235 | " - About 32% of the dogs have no name\n", 236 | " \n", 237 | " - Image number 1 is the most prominent (frequent)\n", 238 | "\n", 239 | " - The merged dataset has 21 columns and 1986 rows, all the rows except for the dog stage column are completely filed with no missing value.\n", 240 | "\n", 241 | " - The columns are 'tweet_id', 'timestamp', 'source', 'text', 'expanded_urls', 'rating_numerator', 'rating_denominator', 'name', 'stage', 'retweet_count', 'favorite_count', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'.\n", 242 | "\n", 243 | " - Nine of the columns are object data type (string), one is datetime, five are integer data types, three are floats, and the remaining three are boolean data types." 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "id": "51a9400f", 249 | "metadata": {}, 250 | "source": [ 251 | "#### Visualizations" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "id": "1040b2da", 257 | "metadata": {}, 258 | "source": [ 259 | " 1. The most occcuring image number that corresponds to each tweet's most confident prediction is 1." 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "id": "94b36de0", 265 | "metadata": {}, 266 | "source": [ 267 | "" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "id": "d222853e", 273 | "metadata": {}, 274 | "source": [ 275 | " 2. The most popular dog stage that were rated by the WeRateDogs Twitter account was pupper, follwed by doggo and then puppo." 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "id": "85ba3377", 281 | "metadata": {}, 282 | "source": [ 283 | "" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "id": "8bda88b3", 289 | "metadata": {}, 290 | "source": [ 291 | " 3. From the graph below, there is a positive linear relationship between retweet_count and favorite_count.\n", 292 | " \n", 293 | "A reasonable hypothesis is that the most popular tweets get the highest number of retweet count and favorite count. I tested the correlation between retweet_count and favorite_count and the r^2 is 0.928. That is a high value showing a strong correlation between them." 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "id": "9f3fe5a9", 299 | "metadata": {}, 300 | "source": [ 301 | "" 302 | ] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "id": "40c1176a", 307 | "metadata": {}, 308 | "source": [ 309 | "**That is the summary of the Data Wrangling process!**" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "id": "0623dff0", 315 | "metadata": {}, 316 | "source": [ 317 | "\n" 318 | ] 319 | } 320 | ], 321 | "metadata": { 322 | "kernelspec": { 323 | "display_name": "Python [conda env:StarNPMS] *", 324 | "language": "python", 325 | "name": "conda-env-StarNPMS-py" 326 | }, 327 | "language_info": { 328 | "codemirror_mode": { 329 | "name": "ipython", 330 | "version": 3 331 | }, 332 | "file_extension": ".py", 333 | "mimetype": "text/x-python", 334 | "name": "python", 335 | "nbconvert_exporter": "python", 336 | "pygments_lexer": "ipython3", 337 | "version": "3.9.12" 338 | } 339 | }, 340 | "nbformat": 4, 341 | "nbformat_minor": 5 342 | } 343 | --------------------------------------------------------------------------------