├── Data_Analysis_with_Python.PNG ├── LICENSE ├── README.md ├── demographic_data_analyzer.md ├── demographic_data_analyzer.py ├── mean_var_std.md ├── mean_var_std.py ├── medical_data_visualizer.md ├── medical_data_visualizer.py ├── sea_level_predictor.md ├── sea_level_predictor.py ├── time_series_visualizer.md └── time_series_visualizer.py /Data_Analysis_with_Python.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mjcs-95/FreeCodeCamp_Data_Analysis_with_Python/2740865094d63e6bfabc3e1ad7bbcb0aef2963c9/Data_Analysis_with_Python.PNG -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Manuel J. Corbacho 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![Certificate](Data_Analysis_with_Python.PNG) 2 | https://freecodecamp.org/certification/mjcs-95/data-analysis-with-python-v7 3 | -------------------------------------------------------------------------------- /demographic_data_analyzer.md: -------------------------------------------------------------------------------- 1 | ### Assignment 2 | 3 | # Demographic Data Analyzer 4 | 5 | In this challenge you must analyze demographic data using Pandas. You are given a dataset of demographic data that was extracted from the 1994 Census database. Here is a sample of what the data looks like: 6 | 7 | | | age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | 8 | |---:|------:|:-----------------|---------:|:------------|----------------:|:-------------------|:------------------|:---------------|:-------|:-------|---------------:|---------------:|-----------------:|:-----------------|:---------| 9 | | 0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K | 10 | | 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K | 11 | | 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K | 12 | | 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K | 13 | | 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K | 14 | 15 | 16 | You must use Pandas to answer the following questions: 17 | * How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (`race` column) 18 | * What is the average age of men? 19 | * What is the percentage of people who have a Bachelor's degree? 20 | * What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K? 21 | * What percentage of people without advanced education make more than 50K? 22 | * What is the minimum number of hours a person works per week? 23 | * What percentage of the people who work the minimum number of hours per week have a salary of more than 50K? 24 | * What country has the highest percentage of people that earn >50K and what is that percentage? 25 | * Identify the most popular occupation for those who earn >50K in India. 26 | 27 | Use the starter code in the file `demographic_data_anaylizer`. Update the code so all variables set to "None" are set to the appropriate calculation or code. Round all decimals to the nearest tenth. 28 | 29 | Unit tests are written for you under `test_module.py`. 30 | 31 | ### Development 32 | 33 | For development, you can use `main.py` to test your functions. Click the "run" button and `main.py` will run. 34 | 35 | ### Testing 36 | 37 | We imported the tests from `test_module.py` to `main.py` for your convenience. The tests will run automatically whenever you hit the "run" button. 38 | 39 | ### Submitting 40 | 41 | Copy your project's URL and submit it to freeCodeCamp. 42 | 43 | ### Dataset Source 44 | 45 | Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. -------------------------------------------------------------------------------- /demographic_data_analyzer.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | 4 | def calculate_demographic_data(print_data=True): 5 | # Read data from file 6 | df = pd.read_csv("adult.data.csv") 7 | 8 | # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. 9 | race_count = df['race'].value_counts() 10 | 11 | # What is the average age of men? 12 | average_age_men = round( df.groupby('sex')['age'].mean()['Male'], 1 ) 13 | 14 | # What is the percentage of people who have a Bachelor's degree? 15 | percentage_bachelors = round( df['education'].value_counts(normalize=True)['Bachelors'] * 100.0, 1 ) 16 | 17 | # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K? 18 | # What percentage of people without advanced education make more than 50K? 19 | 20 | 21 | #round(100* / , 1 ) 22 | # with and without `Bachelors`, `Masters`, or `Doctorate` 23 | higher_education = df.loc[df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])] 24 | lower_education = df.loc[~df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])] 25 | 26 | # percentage with salary >50K 27 | higher_education_rich = round(100.0 * (higher_education['salary'] == '>50K').sum() / len(higher_education), 1 ) 28 | lower_education_rich = round(100.0 * (lower_education['salary'] == '>50K').sum() / len(lower_education), 1 ) 29 | 30 | 31 | # What is the minimum number of hours a person works per week (hours-per-week feature)? 32 | min_work_hours = df['hours-per-week'].min() 33 | 34 | # What percentage of the people who work the minimum number of hours per week have a salary of >50K? 35 | num_min_workers = df.loc[df['hours-per-week'] == min_work_hours] 36 | 37 | rich_percentage = round(100.0 * (num_min_workers['salary'] == '>50K').sum() / len(num_min_workers) , 1 ) 38 | 39 | # What country has the highest percentage of people that earn >50K? 40 | highest_earning_country = None 41 | highest_earning_country_percentage = 0.0 42 | for country, data in df.groupby('native-country'): 43 | percentage = (data['salary'] == '>50K').sum() / data['salary'].count() 44 | if highest_earning_country_percentage < percentage: 45 | highest_earning_country_percentage = percentage 46 | highest_earning_country = country 47 | highest_earning_country_percentage = round(100 * highest_earning_country_percentage,1) 48 | 49 | 50 | # Identify the most popular occupation for those who earn >50K in India. 51 | top_IN_occupation = df[(df['salary'] == '>50K') & (df['native-country'] == 'India')]['occupation'].value_counts().keys()[0] 52 | 53 | # DO NOT MODIFY BELOW THIS LINE 54 | 55 | if print_data: 56 | print("Number of each race:\n", race_count) 57 | print("Average age of men:", average_age_men) 58 | print(f"Percentage with Bachelors degrees: {percentage_bachelors}%") 59 | print(f"Percentage with higher education that earn >50K: {higher_education_rich}%") 60 | print(f"Percentage without higher education that earn >50K: {lower_education_rich}%") 61 | print(f"Min work time: {min_work_hours} hours/week") 62 | print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%") 63 | print("Country with highest percentage of rich:", highest_earning_country) 64 | print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%") 65 | print("Top occupations in India:", top_IN_occupation) 66 | 67 | return { 68 | 'race_count': race_count, 69 | 'average_age_men': average_age_men, 70 | 'percentage_bachelors': percentage_bachelors, 71 | 'higher_education_rich': higher_education_rich, 72 | 'lower_education_rich': lower_education_rich, 73 | 'min_work_hours': min_work_hours, 74 | 'rich_percentage': rich_percentage, 75 | 'highest_earning_country': highest_earning_country, 76 | 'highest_earning_country_percentage': 77 | highest_earning_country_percentage, 78 | 'top_IN_occupation': top_IN_occupation 79 | } 80 | -------------------------------------------------------------------------------- /mean_var_std.md: -------------------------------------------------------------------------------- 1 | ### Assignment 2 | 3 | Create a function named `calculate()` in `mean_var_std.py` that uses Numpy to output the mean, variance, standard deviation, max, min, and sum of the rows, columns, and elements in a 3 x 3 matrix. 4 | 5 | The input of the function should be a list containing 9 digits. The function should convert the list into a 3 x 3 Numpy array, and then return a dictionary containing the mean, variance, standard deviation, max, min, and sum along both axes and for the flattened matrix. 6 | 7 | The returned dictionary should follow this format: 8 | ```py 9 | { 10 | 'mean': [axis1, axis2, flattened], 11 | 'variance': [axis1, axis2, flattened], 12 | 'standard deviation': [axis1, axis2, flattened], 13 | 'max': [axis1, axis2, flattened], 14 | 'min': [axis1, axis2, flattened], 15 | 'sum': [axis1, axis2, flattened] 16 | } 17 | ``` 18 | 19 | If a list containing less than 9 elements is passed into the function, it should raise a `ValueError` exception with the message: "List must contain nine numbers." The values in the returned dictionary should be lists and not Numpy arrays. 20 | 21 | For example, `calculate([0,1,2,3,4,5,6,7,8])` should return: 22 | ```py 23 | { 24 | 'mean': [[3.0, 4.0, 5.0], [1.0, 4.0, 7.0], 4.0], 25 | 'variance': [[6.0, 6.0, 6.0], [0.6666666666666666, 0.6666666666666666, 0.6666666666666666], 6.666666666666667], 26 | 'standard deviation': [[2.449489742783178, 2.449489742783178, 2.449489742783178], [0.816496580927726, 0.816496580927726, 0.816496580927726], 2.581988897471611], 27 | 'max': [[6, 7, 8], [2, 5, 8], 8], 28 | 'min': [[0, 1, 2], [0, 3, 6], 0], 29 | 'sum': [[9, 12, 15], [3, 12, 21], 36]} 30 | } 31 | ``` 32 | 33 | The unit tests for this project are in `test_module.py`. 34 | 35 | ### Development 36 | 37 | For development, you can use `main.py` to test your `calculate()` function. Click the "run" button and `main.py` will run. 38 | 39 | ### Testing 40 | 41 | We imported the tests from `test_module.py` to `main.py` for your convenience. The tests will run automatically whenever you hit the "run" button. 42 | 43 | ### Submitting 44 | 45 | Copy your project's URL and submit it to freeCodeCamp. 46 | -------------------------------------------------------------------------------- /mean_var_std.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | def calculate(numbers): 4 | if len(numbers) != 9: 5 | raise ValueError("List must contain nine numbers.") 6 | data = np.reshape(np.array(numbers),(3,3)) 7 | calculations = {} 8 | calculations['mean'] = [np.mean(data, axis=0).tolist(), np.mean(data, axis=1).tolist(), np.mean(data.flatten()).tolist()] 9 | calculations['variance'] = [np.var(data, axis=0).tolist(), np.var(data, axis=1).tolist(), np.var(data.flatten()).tolist()] 10 | calculations['standard deviation'] = [np.std(data, axis=0).tolist(), np.std(data, axis=1).tolist(), np.std(data.flatten()).tolist()] 11 | calculations['max'] = [np.max(data, axis=0).tolist(), np.max(data, axis=1).tolist(), np.max(data.flatten()).tolist()] 12 | calculations['min'] = [np.min(data, axis=0).tolist(), np.min(data, axis=1).tolist(), np.min(data.flatten()).tolist()] 13 | calculations['sum'] = [np.sum(data, axis=0).tolist(), np.sum(data, axis=1).tolist(), np.sum(data.flatten()).tolist()] 14 | return calculations -------------------------------------------------------------------------------- /medical_data_visualizer.md: -------------------------------------------------------------------------------- 1 | ### Assignment 2 | 3 | In this project, you will visualize and make calculations from medical examination data using matplotlib, seaborn, and pandas. The dataset values were collected during medical examinations. 4 | 5 | #### Data description 6 | 7 | The rows in the dataset represent patiets and the columns represent information like body measurements, results from various blood tests, and lifestyle choices. You will use the dataset to exploring the relationship between cardiac disease, body measurements, blood markers, and lifestyle choices. 8 | 9 | File name: medical_examination.csv 10 | 11 | | Feature | Variable Type | Variable | Value Type | 12 | |:-------:|:------------:|:-------------:|:----------:| 13 | | Age | Objective Feature | age | int (days) | 14 | | Height | Objective Feature | height | int (cm) | 15 | | Weight | Objective Feature | weight | float (kg) | 16 | | Gender | Objective Feature | gender | categorical code | 17 | | Systolic blood pressure | Examination Feature | ap_hi | int | 18 | | Diastolic blood pressure | Examination Feature | ap_lo | int | 19 | | Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal | 20 | | Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal | 21 | | Smoking | Subjective Feature | smoke | binary | 22 | | Alcohol intake | Subjective Feature | alco | binary | 23 | | Physical activity | Subjective Feature | active | binary | 24 | | Presence or absence of cardiovascular disease | Target Variable | cardio | binary | 25 | 26 | #### Tasks 27 | 28 | Create a chart similar to `examples/Figure_1.png`, where we show the counts of good and bad outcomes for cholesterol, gluc, alco variable, active, and smoke for patients with cardio=1 and cardio=0 in different panels. 29 | 30 | Use the data to complete the following tasks in `medical_data_visualizer.py`: 31 | * Add an 'overweight' column to the data. To determine if a person is overweight, first calculate their BMI by dividing their weight in kilograms by the square of their height in meters. If that value is > 25 then the person is overweight. Use the value 0 for NOT overweight and the value 1 for overweight. 32 | * Normalize data by making 0 always good and 1 always bad. If the value of 'cholestorol' or 'gluc' is 1, make the value 0. If the value is more than 1, make the value 1. 33 | * Convert the data into long format and create a chart that shows the value counts of the categorical features using seaborn's `catplot()`. The dataset should be split by 'Cardio' so there is one chart for each 'cardio' value. The chart should look like "examples/Figure_1.png". 34 | * Clean the data. Filter out the following patient segments that represent incorrect data: 35 | - diastolic pressure is higher then systolic (Keep the correct data with `df['ap_lo'] <= df['ap_hi'])`) 36 | - height is less than the 2.5th percentile (Keep the correct data with `(df['height'] >= df['height'].quantile(0.025))`) 37 | - height is more than the 97.5th percentile 38 | - weight is less then the 2.5th percentile 39 | - weight is more than the 97.5th percentile 40 | * Create a correlation matrix using the dataset. Plot the correlation matrix using seaborn's `heatmap()`. Mask the upper triangle. The chart should look like "examples/Figure_2.png". 41 | 42 | Any time a variable is set to 'None', make sure to set it to the correct code. 43 | 44 | Unit tests are written for you under `test_module.py`. 45 | 46 | ### Development 47 | 48 | For development, you can use `main.py` to test your functions. Click the "run" button and `main.py` will run. 49 | 50 | ### Testing 51 | 52 | We imported the tests from `test_module.py` to `main.py` for your convenience. The tests will run automatically whenever you hit the "run" button. 53 | 54 | ### Submitting 55 | 56 | Copy your project's URL and submit it to freeCodeCamp. 57 | -------------------------------------------------------------------------------- /medical_data_visualizer.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import seaborn as sns 3 | import matplotlib.pyplot as plt 4 | import numpy as np 5 | 6 | # Import data 7 | df = pd.read_csv('medical_examination.csv') 8 | 9 | # Add 'overweight' column 10 | df['overweight'] = (df['weight']/((df['height']/100)**2) > 25).astype(int) 11 | 12 | # Normalize data by making 0 always good and 1 always bad. If the value of 'cholestorol' or 'gluc' is 1, make the value 0. If the value is more than 1, make the value 1. 13 | df[['gluc','cholesterol']] = (df[['gluc','cholesterol']] > 1).astype(int) 14 | 15 | # Draw Categorical Plot 16 | def draw_cat_plot(): 17 | # Create DataFrame for cat plot using `pd.melt` using just the values from 'cholesterol', 'gluc', 'smoke', 'alco', 'active', and 'overweight'. 18 | df_cat = pd.melt(df, id_vars=['cardio'], value_vars=['active', 'alco', 'cholesterol', 'gluc', 'overweight', 'smoke']) 19 | 20 | 21 | # Group and reformat the data to split it by 'cardio'. Show the counts of each feature. You will have to rename one of the collumns for the catplot to work correctly. 22 | #df_cat = None 23 | 24 | # Draw the catplot with 'sns.catplot()' 25 | fig = sns.catplot(data = df_cat, kind='count', x='variable', hue='value', col='cardio').set(ylabel = 'total').fig 26 | 27 | 28 | # Do not modify the next two lines 29 | fig.savefig('catplot.png') 30 | return fig 31 | 32 | 33 | # Draw Heat Map 34 | def draw_heat_map(): 35 | # Clean the data 36 | df_heat = df[ 37 | ( df['ap_lo'] <= df['ap_hi'] ) & 38 | ( df['height'] >= df['height'].quantile(0.025) ) & 39 | ( df['height'] <= df['height'].quantile(0.975) ) & 40 | ( df['weight'] >= df['weight'].quantile(0.025) ) & 41 | ( df['weight'] <= df['weight'].quantile(0.975) ) 42 | ] 43 | 44 | # Calculate the correlation matrix 45 | corr = df_heat.corr() 46 | 47 | # Generate a mask for the upper triangle 48 | mask = np.triu(corr) 49 | 50 | 51 | # Set up the matplotlib figure 52 | fig, ax = plt.subplots() 53 | 54 | # Draw the heatmap with 'sns.heatmap()' 55 | ax = sns.heatmap(corr, mask=mask, annot=True, fmt='0.1f', square=True) 56 | 57 | 58 | # Do not modify the next two lines 59 | fig.savefig('heatmap.png') 60 | return fig 61 | -------------------------------------------------------------------------------- /sea_level_predictor.md: -------------------------------------------------------------------------------- 1 | ### Assignment 2 | 3 | You will anaylize a dataset of the global average sea level change since 1880. You will use the data to predict the sea level change through year 2050. 4 | 5 | Use the data to complete the following tasks: 6 | * Use Pandas to import the data from `epa-sea-level.csv`. 7 | * Use matplotlib to create a scatter plot using the "Year" column as the x-axis and the "CSIRO Adjusted Sea Level" column as the y-axix. 8 | * Use the `linregress` function from `scipi.stats` to get the slope and y-intercept of the line of best fit. Plot the line of best fit over the top of the scatter plot. Make the line go through the year 2050 to predict the sea level rise in 2050. 9 | * Plot a new line of best fit just using the data from year 2000 through the most recent year in the dataset. Make the line also go through the year 2050 to predict the sea level rise in 2050 if the rate of rise continues as it has since the year 2000. 10 | * The x label should be "Year", the y label should be "Sea Level (inches)", and the title should be "Rise in Sea Level". 11 | 12 | Unit tests are written for you under `test_module.py`. 13 | 14 | ### Development 15 | 16 | For development, you can use `main.py` to test your functions. Click the "run" button and `main.py` will run. 17 | 18 | ### Testing 19 | 20 | We imported the tests from `test_module.py` to `main.py` for your convenience. The tests will run automatically whenever you hit the "run" button. 21 | 22 | ### Submitting 23 | 24 | Copy your project's URL and submit it to freeCodeCamp. 25 | 26 | ### Data Source 27 | Global Average Absolute Sea Level Change, 1880-2014 from the US Environmental Protection Agency using data from CSIRO, 2015; NOAA, 2015. 28 | https://datahub.io/core/sea-level-rise 29 | -------------------------------------------------------------------------------- /sea_level_predictor.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import matplotlib.pyplot as plt 3 | from scipy.stats import linregress 4 | 5 | def draw_plot(): 6 | # Read data from file 7 | df = pd.read_csv('epa-sea-level.csv') 8 | # Create scatter plot 9 | fig, ax = plt.subplots() 10 | ax.scatter(x= "Year", y = "CSIRO Adjusted Sea Level", data = df) 11 | # Create first line of best fit 12 | slope, intercept, r_value, p_value, std_err = linregress(df["Year"], df["CSIRO Adjusted Sea Level"]) 13 | years = pd.Series(range(1880,2050)) 14 | ax.plot(years, intercept + slope*years, 'r', label='first line of best fit') 15 | # Create second line of best fit 16 | df2 = df.loc[df["Year"] >= 2000] 17 | slope2, intercept2, r_value2, p_value2, std_err2 = linregress(df2["Year"], df2["CSIRO Adjusted Sea Level"]) 18 | years2 = pd.Series(range(2000,2050)) 19 | ax.plot(years2, intercept2 + slope2*years2, 'b', label='second line of best fit') 20 | # Add labels and title 21 | ax.set(xlabel="Year", ylabel="Sea Level (inches)", title="Rise in Sea Level") 22 | # Save plot and return data for testing (DO NOT MODIFY) 23 | plt.savefig('sea_level_plot.png') 24 | return plt.gca() -------------------------------------------------------------------------------- /time_series_visualizer.md: -------------------------------------------------------------------------------- 1 | ### Assignment 2 | 3 | For this project you will visualize time series data using a line chart, bar chart, and box plots. You will use Pandas, Matplotlib, and Seaborn to visualize a dataset containing the number of page views each day on the freeCodeCamp.org forum from 2016-05-09 to 2019-12-03. The data visualizations will help you understand the patterns in visits and identify yearly and monthly growth. 4 | 5 | Use the data to complete the following tasks: 6 | * Use Pandas to import the data from "fcc-forum-pageviews.csv". Set the index to the "date" column. 7 | * Clean the data by filtering out days when the page views were in the top 2.5% of the dataset or bottom 2.5% of the dataset. 8 | * Create a `draw_line_plot` function that uses Matplotlib to draw a line chart similar to "examples/Figure_1.png". The title should be "Daily freeCodeCamp Forum Page Views 5/2016-12/2019". The label on the x axis should be "Date" and the label on the y axis should be "Page Views". 9 | * Create a `draw_bar_plot` function that draws a bar chart similar to "examples/Figure_2.png". It should show average daily page views for each month grouped by year. The legend should show month labels and have a title of "Months". On the chart, the label on the x axis should be "Years" and the label on the y axis should be "Average Page Views". 10 | * Create a `draw_box_plot` function that uses Searborn to draw two adjacent box plots similar to "examples/Figure_3.png". These box plots should show how the values are distributed within a given year or month and how it compares over time. The title of the first chart should be "Year-wise Box Plot (Trend)" and the title of the second chart should be "Month-wise Box Plot (Seasonality)". Make sure the month labels on bottom start at "Jan" and the x and x axis are labeled correctly. 11 | 12 | For each chart, make sure to use a copy of the data frame. Unit tests are written for you under `test_module.py`. 13 | 14 | ### Development 15 | 16 | For development, you can use `main.py` to test your functions. Click the "run" button and `main.py` will run. 17 | 18 | ### Testing 19 | 20 | We imported the tests from `test_module.py` to `main.py` for your convenience. The tests will run automatically whenever you hit the "run" button. 21 | 22 | ### Submitting 23 | 24 | Copy your project's URL and submit it to freeCodeCamp. 25 | -------------------------------------------------------------------------------- /time_series_visualizer.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import pandas as pd 3 | import seaborn as sns 4 | from pandas.plotting import register_matplotlib_converters 5 | register_matplotlib_converters() 6 | 7 | # Import data (Make sure to parse dates. Consider setting index column to 'date'.) 8 | df = pd.read_csv("fcc-forum-pageviews.csv", index_col="date" , parse_dates=True) 9 | 10 | # Clean data 11 | df = df[ df["value"].between( df["value"].quantile(.025), df["value"].quantile(.975) ) ] 12 | months= ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'] 13 | def draw_line_plot(): 14 | # Draw line plot 15 | fig, ax = plt.subplots(figsize=(15,5)) 16 | ax = sns.lineplot(data = df, legend="brief") 17 | ax.set(title='Daily freeCodeCamp Forum Page Views 5/2016-12/2019') 18 | ax.set(xlabel = "Date",ylabel = "Page Views") 19 | # Save image and return fig (don't change this part) 20 | fig.savefig('line_plot.png') 21 | return fig 22 | 23 | def draw_bar_plot(): 24 | # Copy and modify data for monthly bar plot 25 | df_bar = df.copy() 26 | df_bar["year"] = df.index.year.values 27 | df_bar["month"] = df.index.month_name() 28 | # Draw bar plot 29 | fig, ax = plt.subplots(figsize=(15,5)) 30 | 31 | ax = sns.barplot(x="year", hue="month", y="value", data=df_bar, hue_order = months, ci=None ) 32 | ax.set(xlabel = "Years",ylabel = "Average Page Views") 33 | # Save image and return fig (don't change this part) 34 | fig.savefig('bar_plot.png') 35 | return fig 36 | 37 | 38 | def draw_box_plot(): 39 | # Prepare data for box plots (this part is done!) 40 | df_box = df.copy() 41 | df_box.reset_index(inplace=True) 42 | df_box['year'] = [d.year for d in df_box.date] 43 | df_box['month'] = [d.strftime('%b') for d in df_box.date] 44 | 45 | # Draw box plots (using Seaborn) 46 | df_box['monthnumber'] = df.index.month 47 | df_box = df_box.sort_values('monthnumber') 48 | fig, ax = plt.subplots(1,2,figsize=(16,6)) 49 | sns.boxplot(y = "value", x = "year", data = df_box, ax = ax[0] ) 50 | ax[0].set(xlabel="Year", ylabel="Page Views", title="Year-wise Box Plot (Trend)") 51 | sns.boxplot(y = "value", x = "month", data = df_box, ax = ax[1]) 52 | ax[1].set(xlabel="Month", ylabel="Page Views", title="Month-wise Box Plot (Seasonality)") 53 | # Save image and return fig (don't change this part) 54 | fig.savefig('box_plot.png') 55 | return fig 56 | --------------------------------------------------------------------------------