├── 01 - Introduction to Python ├── Chapter 1 - Python basics.txt ├── Chapter 2 - Python Lists.txt ├── Chapter 3 - Functions and Packages.txt ├── Chapter 4 - Numpy.txt └── about course ├── 02 - Intermediate Python ├── Chapter 1 - Matplotlib .txt ├── Chapter 2 -Dictionaries & Pandas .txt ├── Chapter 3 - Logic, Control Flow and Filtering.txt ├── Chapter 4 - Loops.txt ├── Chapter 5 - case study hacker statistics.txt └── key points ├── 03 - Introduction to Data Visualization using Matplotlib ├── Chapter 1 - Introduction to Matplotlib.txt ├── Chapter 2 - Plotting time-series.txt ├── Chapter 3 - Quantitative Comparisions and statistical visualizations.txt ├── Chapter 4 - sharing visualizations with others.txt └── contents ├── 04 - Introduction to Data Visualization with Seaborn ├── Chapter 1 - Introduction to Seaborn.txt ├── Chapter 2 - Visualization two quantitative variables.txt ├── Chapter 3 - Visualization a categorical and a quantitative variables.txt ├── Chapter 4 - customizing seaborn plots.txt └── key points ├── 05 - Python Data Science Toolbox (Part 1) ├── Chapter 1 - Writing your own functions .txt ├── Chapter 2 - Default arguments variable length arguments and scope .txt ├── Chapter 3 - lambda functions and error handling.txt └── key points ├── 06 - Python Data Science Toolbox (Part 2) ├── Chapter 1 - Using Iterators in Pythonland .txt ├── Chapter 2 - List Comprehensions and Generators.txt ├── Chapter 3 - Bringing it all together.txt └── key points ├── 07 - Intermediate Data Visualization with Seaborn ├── Chapter 1 - seaborn introduction .txt ├── Chapter 2 - Customizing Seaborn plots.txt ├── Chapter 3 -additional plot types.txt ├── Chapter 4 -creating plots on data aware grids.txt └── key points ├── 08 - Introduction to Import data in python ├── Chapter 1 - introduction and flat files 1.txt ├── Chapter 2 - importing data from other file types 2.txt ├── Chapter 3 - working with relational databases in python 3.txt └── key points ├── 09 - Intermediate importing data in python ├── Chapter 1 - Importing data from the Internet.txt ├── Chapter 2 - intracting with apis to import data from web.txt ├── Chapter 3 - Diving deep into the twitter api.txt └── key points ├── 10 - Cleaning Data in Python ├── Chapter 1 - Common Data Problems.txt ├── Chapter 2 - Text and categorical data problems.txt ├── Chapter 3 - Advanced data problems.txt ├── Chapter 4 - Record Linkage.txt └── key points ├── 11 - Working with Dates and Times in Python ├── Chapter 1 - Dates and Calenders .txt ├── Chapter 2 - Combining dates and times.txt ├── Chapter 3 - time zones and daylight saving.txt └── key points ├── 12 - Writing functions in Python ├── Chapter 1 - Best Practices.txt ├── Chapter 2 - Using Context Managers.txt ├── Chapter 3 - Decorators.txt ├── Chapter 4 - More on Decorators.txt └── key points ├── 13 - Exploratory Data Analysis in Python ├── Chapter 1 - Read clean and validate.txt ├── Chapter 2 - Distributions.txt ├── Chapter 3 - Relationships.txt ├── Chapter 4 - Multivariate Thinking.txt └── key points ├── 14 - Analyzing Police Activity with pandas ├── Chapter 1 - preparing data for analysis.txt ├── Chapter 2 - Exploring the Relationship between gender and policing.txt ├── Chapter 3 - Visual Exploratory data analysis.txt ├── Chapter 4 - Analyzing the effect of weather on policing.txt └── key points ├── 15 - Statistical Thinking in Python (Part 1) ├── Chapter 1 - Graphical Exploratory Data Analysis .txt ├── Chapter 2 - Quantitative Exploratory Data Analysis.txt ├── Chapter 3 - Thinking probabilistically discrete variables.txt ├── Chapter 4 - Thinking probabilistically continuous variables.txt └── key points ├── 16 - Statistical Thinking in Python (Part 2) ├── Chapter 1 - Parameter estimation by optimization.txt ├── Chapter 2 - Bootstrap confidence intervals.txt ├── Chapter 3 - Introduction to hypothesis testing.txt └── key points ├── 17 - Supervised Learning with Scikit-learn ├── Chapter 1 - Classification.txt ├── Chapter 2 - Regression.txt ├── Chapter 3 - Fine Tuning your model.txt ├── Chapter 4 - Preprocessing and Pipelines.txt └── key points ├── 18 - Unsupervised Learning in Python ├── Chapter 1 - Clustering for dataset exploration.txt ├── Chapter 2 - Visualization with Hierarchical clustering and t-sne.txt ├── Chapter 3 - Decorrelating your data and dimension reduction.txt ├── Chapter 4 - Discovering Interpretable features.txt └── key points ├── 19 - Machine learning with tree-based models in python ├── Chapter 1 - Classification and regression trees.txt ├── Chapter 2 - The bias-variance Tradeoff.txt ├── Chapter 3 - Bagging and Random Forests.txt ├── Chapter 4 - Boosting.txt ├── Chapter 5 - Model Tuning.txt └── key points ├── 20 - Cluster Analysis in Python ├── Chapter 1 - Introduction to clustering.txt ├── Chapter 2 - Hierarchical Clustering.txt ├── Chapter 3 - K-means Clustering.txt ├── Chapter 4 - Clustering in Real World.txt └── key points └── README.md /01 - Introduction to Python/Chapter 1 - Python basics.txt: -------------------------------------------------------------------------------- 1 | 1. 2 | ___________________________________________________ 3 | 4 | # Example, do not modify! 5 | print(5 / 8) 6 | 7 | # Print the sum of 7 and 10 8 | print(7 + 10) 9 | 10 | ___________________________________________________ 11 | 2. 12 | 13 | # Division 14 | print(5 / 8) 15 | 16 | #Addition 17 | print(7 + 10) 18 | 19 | ___________________________________________________ 20 | 3. 21 | # Addition, subtraction 22 | print(5 + 5) 23 | print(5 - 5) 24 | 25 | # Multiplication, division, modulo, and exponentiation 26 | print(3 * 5) 27 | print(10 / 2) 28 | print(18 % 7) 29 | print(4 ** 2) 30 | 31 | # How much is your $100 worth after 7 years? 32 | print(100*(1.1**7)) 33 | ____________________________________________________ 34 | 4. 35 | # Create a variable savings 36 | savings =100 37 | 38 | # Print out savings 39 | print(savings) 40 | ____________________________________________________ 41 | 5. 42 | # Create a variable savings 43 | savings = 100 44 | 45 | # Create a variable growth_multiplier 46 | growth_multiplier = 1.1 47 | 48 | # Calculate result 49 | result = savings * (growth_multiplier**7) 50 | 51 | # Print out result 52 | print(result) 53 | _____________________________________________________ 54 | 6. 55 | # Create a variable desc 56 | desc = "compound interest" 57 | 58 | # Create a variable profitable 59 | profitable = True 60 | _____________________________________________________ 61 | 7. 62 | savings = 100 63 | growth_multiplier = 1.1 64 | desc = "compound interest" 65 | 66 | # Assign product of growth_multiplier and savings to year1 67 | year1= growth_multiplier*savings 68 | 69 | # Print the type of year1 70 | print(type(year1)) 71 | 72 | # Assign sum of desc and desc to doubledesc 73 | doubledesc = desc+desc 74 | 75 | # Print out doubledesc 76 | print(doubledesc) 77 | _______________________________________________________ 78 | 8. 79 | # Definition of savings and result 80 | savings = 100 81 | result = 100 * 1.10 ** 7 82 | 83 | # Fix the printout 84 | print("I started with $" + str(savings) + " and now have $" + str(result) + ". Awesome!") 85 | 86 | # Definition of pi_string 87 | pi_string = "3.1415926" 88 | 89 | # Convert pi_string into float: pi_float 90 | pi_float=float(pi_string) 91 | _____________________________________________________ 92 | -------------------------------------------------------------------------------- /01 - Introduction to Python/Chapter 2 - Python Lists.txt: -------------------------------------------------------------------------------- 1 | 1. 2 | # area variables (in square meters) 3 | hall = 11.25 4 | kit = 18.0 5 | liv = 20.0 6 | bed = 10.75 7 | bath = 9.50 8 | 9 | # Create list areas 10 | areas = [hall,kit,liv,bed,bath] 11 | 12 | # Print areas 13 | print(areas) 14 | ____________________________________________________ 15 | 2. 16 | # area variables (in square meters) 17 | hall = 11.25 18 | kit = 18.0 19 | liv = 20.0 20 | bed = 10.75 21 | bath = 9.50 22 | 23 | # Adapt list areas 24 | areas = ["hallway",hall,"kitchen", kit, "living room", liv,"bedroom", bed, "bathroom", bath] 25 | 26 | # Print areas 27 | print(areas) 28 | ____________________________________________________ 29 | 3. 30 | # area variables (in square meters) 31 | hall = 11.25 32 | kit = 18.0 33 | liv = 20.0 34 | bed = 10.75 35 | bath = 9.50 36 | 37 | # house information as list of lists 38 | house = [["hallway", hall], 39 | ["kitchen", kit], 40 | ["living room", liv], 41 | ["bedroom", bed], 42 | ["bathroom",bath]] 43 | 44 | # Print out house 45 | print(house) 46 | 47 | # Print out the type of house 48 | print(type(house)) 49 | __________________________________________________________ 50 | 4. 51 | # Create the areas list 52 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50] 53 | 54 | # Print out second element from areas 55 | print(areas[1]) 56 | 57 | # Print out last element from areas 58 | print(areas[-1]) 59 | 60 | # Print out the area of the living room 61 | print(areas[5]) 62 | __________________________________________________________ 63 | 5. 64 | # Create the areas list 65 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50] 66 | 67 | # Sum of kitchen and bedroom area: eat_sleep_area 68 | eat_sleep_area = areas[3]+areas[7] 69 | 70 | # Print the variable eat_sleep_area 71 | print(eat_sleep_area) 72 | ___________________________________________________________ 73 | 6. 74 | # Create the areas list 75 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50] 76 | 77 | # Use slicing to create downstairs 78 | downstairs = areas[:6] 79 | 80 | # Use slicing to create upstairs 81 | upstairs = areas[6:] 82 | 83 | # Print out downstairs and upstairs 84 | print(downstairs,upstairs) 85 | ____________________________________________________________ 86 | 7. 87 | # Create the areas list 88 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50] 89 | 90 | # Alternative slicing to create downstairs 91 | downstairs=areas[:6] 92 | 93 | # Alternative slicing to create upstairs 94 | upstairs=areas[6:] 95 | ____________________________________________________________ 96 | 8. 97 | # Create the areas list 98 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50] 99 | 100 | # Correct the bathroom area 101 | areas[-1] = 10.50 102 | 103 | # Change "living room" to "chill zone" 104 | areas[4] = "chill zone" 105 | ____________________________________________________________ 106 | 9. 107 | # Create the areas list and make some changes 108 | areas = ["hallway", 11.25, "kitchen", 18.0, "chill zone", 20.0, 109 | "bedroom", 10.75, "bathroom", 10.50] 110 | 111 | # Add poolhouse data to areas, new list is areas_1 112 | areas_1 = areas+["poolhouse",24.5] 113 | 114 | # Add garage data to areas_1, new list is areas_2 115 | areas_2 = areas_1+["garage",15.45] 116 | ___________________________________________________________ 117 | 10. 118 | # Create list areas 119 | areas = [11.25, 18.0, 20.0, 10.75, 9.50] 120 | 121 | # Create areas_copy 122 | areas_copy = list(areas) 123 | 124 | # Change areas_copy 125 | areas_copy[0] = 5.0 126 | 127 | # Print areas 128 | print(areas) 129 | ___________________________________________________________ 130 | -------------------------------------------------------------------------------- /01 - Introduction to Python/Chapter 3 - Functions and Packages.txt: -------------------------------------------------------------------------------- 1 | 1. 2 | # Create variables var1 and var2 3 | var1 = [1, 2, 3, 4] 4 | var2 = True 5 | 6 | # Print out type of var1 7 | print(type(var1)) 8 | 9 | # Print out length of var1 10 | print(len(var1)) 11 | 12 | # Convert var2 to an integer: out2 13 | out2=int(var2) 14 | ____________________________________________ 15 | 2. 16 | # Create lists first and second 17 | first = [11.25, 18.0, 20.0] 18 | second = [10.75, 9.50] 19 | 20 | # Paste together first and second: full 21 | full = first+second 22 | 23 | # Sort full in descending order: full_sorted 24 | full_sorted = sorted(full,reverse=True) 25 | 26 | # Print out full_sorted 27 | print(full_sorted) 28 | _____________________________________________ 29 | 3. 30 | # string to experiment with: place 31 | place = "poolhouse" 32 | 33 | # Use upper() on place: place_up 34 | place_up = place.upper() 35 | 36 | # Print out place and place_up 37 | print(place, place_up) 38 | 39 | # Print out the number of o's in place 40 | print(place.count('o')) 41 | ______________________________________________ 42 | 4. 43 | # Create list areas 44 | areas = [11.25, 18.0, 20.0, 10.75, 9.50] 45 | 46 | # Print out the index of the element 20.0 47 | print(areas.index(20.0)) 48 | 49 | # Print out how often 9.50 appears in areas 50 | print(areas.count(9.50)) 51 | _______________________________________________ 52 | 5. 53 | # Create list areas 54 | areas = [11.25, 18.0, 20.0, 10.75, 9.50] 55 | 56 | # Use append twice to add poolhouse and garage size 57 | areas.append(24.5) 58 | areas.append(15.45) 59 | 60 | 61 | # Print out areas 62 | print(areas) 63 | 64 | # Reverse the orders of the elements in areas 65 | areas = areas.reverse() 66 | 67 | # Print out areas 68 | print(areas) 69 | _________________________________________________ 70 | 6. 71 | # Definition of radius 72 | r = 0.43 73 | 74 | # Import the math package 75 | import math 76 | 77 | # Calculate C 78 | C = 2*math.pi*r 79 | 80 | # Calculate A 81 | A = math.pi*r*r 82 | 83 | # Build printout 84 | print("Circumference: " + str(C)) 85 | print("Area: " + str(A)) 86 | _________________________________________________ 87 | 7. 88 | # Definition of radius 89 | r = 192500 90 | phi = 12 91 | # Import radians function of math package 92 | from math import radians 93 | 94 | # Travel distance of Moon over 12 degrees. Store in dist. 95 | dist = r * radians(phi) 96 | 97 | # Print out dist 98 | print(dist) 99 | __________________________________________________ 100 | 101 | -------------------------------------------------------------------------------- /01 - Introduction to Python/about course: -------------------------------------------------------------------------------- 1 | 2 | This course consist of 4 major parts : 3 | 1. Python basics 4 | 2. Python Lists 5 | 3.Functions 6 | 4.Numpy 7 | -------------------------------------------------------------------------------- /02 - Intermediate Python/Chapter 3 - Logic, Control Flow and Filtering.txt: -------------------------------------------------------------------------------- 1 | 1.Equality 2 | # Comparison of booleans 3 | print(True==False) 4 | 5 | # Comparison of integers 6 | -5*15!=75 7 | 8 | # Comparison of strings 9 | 'pyscript'=='PyScript' 10 | 11 | # Compare a boolean with an integer 12 | print(True==1) 13 | ________________________________________________________________ 14 | 2.Greater and less than 15 | # Comparison of integers 16 | x = -3 * 6 17 | 18 | 19 | # Comparison of strings 20 | y = "test" 21 | 22 | 23 | # Comparison of booleans 24 | print(x>=-10) 25 | print(y>='test',True>False) 26 | __________________________________________________________________ 27 | 3.Compare arrays 28 | # Create arrays 29 | import numpy as np 30 | my_house = np.array([18.0, 20.0, 10.75, 9.50]) 31 | your_house = np.array([14.0, 24.0, 14.25, 9.0]) 32 | 33 | # my_house greater than or equal to 18 34 | print(my_house>=18) 35 | 36 | # my_house less than your_house 37 | print(my_house 10 and my_kitchen<18) 46 | 47 | # my_kitchen smaller than 14 or bigger than 17? 48 | print(my_kitchen < 14 or my_kitchen>17) 49 | 50 | # Double my_kitchen smaller than triple your_kitchen? 51 | print(my_kitchen*2 < your_kitchen*3) 52 | ___________________________________________________________________ 53 | 5.Boolean operators with Numpy 54 | # Create arrays 55 | import numpy as np 56 | my_house = np.array([18.0, 20.0, 10.75, 9.50]) 57 | your_house = np.array([14.0, 24.0, 14.25, 9.0]) 58 | 59 | # my_house greater than 18.5 or smaller than 10 60 | print(np.logical_or(my_house > 18.5, my_house<10)) 61 | 62 | # Both my_house and your_house smaller than 11 63 | print(np.logical_and(my_house < 11 , your_house<11)) 64 | ____________________________________________________________________ 65 | 6.if 66 | # Define variables 67 | room = "kit" 68 | area = 14.0 69 | 70 | # if statement for room 71 | if room == "kit" : 72 | print("looking around in the kitchen.") 73 | 74 | # if statement for area 75 | if area > 15.0 : 76 | print("big place!") 77 | __________________________________________________________________________ 78 | 7.And else 79 | # Define variables 80 | room = "kit" 81 | area = 14.0 82 | 83 | # if-else construct for room 84 | if room == "kit" : 85 | print("looking around in the kitchen.") 86 | else : 87 | print("looking around elsewhere.") 88 | 89 | # if-else construct for area 90 | if area > 15 : 91 | print("big place!") 92 | else: 93 | print("pretty small.") 94 | __________________________________________________________________________ 95 | 8.Customize further:elif 96 | # Define variables 97 | room = "bed" 98 | area = 14.0 99 | 100 | # if-elif-else construct for room 101 | if room == "kit" : 102 | print("looking around in the kitchen.") 103 | elif room == "bed": 104 | print("looking around in the bedroom.") 105 | else : 106 | print("looking around elsewhere.") 107 | 108 | # if-elif-else construct for area 109 | if area > 15 : 110 | print("big place!") 111 | elif area > 10: 112 | print("medium size, nice!") 113 | else : 114 | print("pretty small.") 115 | __________________________________________________________________________ 116 | 9.Driving right(1) 117 | # Import cars data 118 | import pandas as pd 119 | cars = pd.read_csv('cars.csv', index_col = 0) 120 | 121 | # Extract drives_right column as Series: dr 122 | dr = cars['drives_right'] 123 | 124 | # Use dr to subset cars: sel 125 | sel = cars[dr] 126 | 127 | # Print sel 128 | print(sel) 129 | __________________________________________________________________________ 130 | 10.Driving right(2) 131 | # Import cars data 132 | import pandas as pd 133 | cars = pd.read_csv('cars.csv', index_col = 0) 134 | 135 | # Convert code to a one-liner 136 | dr = cars['drives_right'] 137 | sel = cars[dr] 138 | 139 | # Print sel 140 | print(sel) 141 | __________________________________________________________________________ 142 | 11.Cars per capita(1) 143 | # Import cars data 144 | import pandas as pd 145 | cars = pd.read_csv('cars.csv', index_col = 0) 146 | 147 | # Create car_maniac: observations that have a cars_per_cap over 500 148 | cpc = cars['cars_per_cap'] 149 | many_cars = cpc > 500 150 | car_maniac = cars[many_cars] 151 | 152 | # Print car_maniac 153 | print(car_maniac) 154 | __________________________________________________________________________ 155 | 12.Cars per captia(2) 156 | # Import cars data 157 | import pandas as pd 158 | cars = pd.read_csv('cars.csv', index_col = 0) 159 | 160 | # Import numpy, you'll need this 161 | import numpy as np 162 | 163 | # Create medium: observations with cars_per_cap between 100 and 500 164 | cpc = cars['cars_per_cap'] 165 | between = np.logical_and(cpc > 100, cpc < 500) 166 | medium = cars[between] 167 | 168 | # Print medium 169 | print(medium) 170 | __________________________________________________________________________ -------------------------------------------------------------------------------- /02 - Intermediate Python/Chapter 4 - Loops.txt: -------------------------------------------------------------------------------- 1 | 1.Basic while loop 2 | # Initialize offset 3 | offset = 8 4 | 5 | # Code the while loop 6 | while offset != 0: 7 | print("correcting...") 8 | offset = offset - 1 9 | print(offset) 10 | ____________________________________________________________________ 11 | 2.Add Conditionals 12 | # Initialize offset 13 | offset = -6 14 | 15 | # Code the while loop 16 | while offset != 0 : 17 | print("correcting...") 18 | if offset > 0 : 19 | offset = offset - 1 20 | else : 21 | offset = offset + 1 22 | print(offset) 23 | ____________________________________________________________________ 24 | 3.loop over a list 25 | # areas list 26 | areas = [11.25, 18.0, 20.0, 10.75, 9.50] 27 | 28 | # Code the for loop 29 | for area in areas : 30 | print(area) 31 | ____________________________________________________________________ 32 | 4.Indexes and values (1) 33 | # areas list 34 | areas = [11.25, 18.0, 20.0, 10.75, 9.50] 35 | 36 | # Change for loop to use enumerate() and update print() 37 | for index, area in enumerate(areas) : 38 | print("room "+str(index)+": "+str(area)) 39 | ____________________________________________________________________ 40 | 5.Indexes and values (2) 41 | # areas list 42 | areas = [11.25, 18.0, 20.0, 10.75, 9.50] 43 | 44 | # Code the for loop 45 | for index, area in enumerate(areas) : 46 | print("room " + str(index+1) + ": " + str(area)) 47 | ____________________________________________________________________ 48 | 6.Loop over list of lists 49 | # house list of lists 50 | house = [["hallway", 11.25], 51 | ["kitchen", 18.0], 52 | ["living room", 20.0], 53 | ["bedroom", 10.75], 54 | ["bathroom", 9.50]] 55 | 56 | # Build a for loop from scratch 57 | for x in house : 58 | print("the " + x[0] + " is " + str(x[1]) + " sqm") 59 | ____________________________________________________________________ 60 | 7.Loop over dictionary 61 | # Definition of dictionary 62 | europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 63 | 'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'austria':'vienna' } 64 | 65 | # Iterate over europe 66 | for key, value in europe.items() : 67 | print("the capital of " + str(key) + " is " + str(value)) 68 | ____________________________________________________________________ 69 | 8.Loop over Numpy array 70 | # Import numpy as np 71 | import numpy as np 72 | 73 | # For loop over np_height 74 | for x in np_height: 75 | print(str(x)+" inches") 76 | 77 | # For loop over np_baseball 78 | for i in np.nditer(np_baseball): 79 | print(i) 80 | ____________________________________________________________________ 81 | 9.Loop over DataFrame (1) 82 | # Import cars data 83 | import pandas as pd 84 | cars = pd.read_csv('cars.csv', index_col = 0) 85 | 86 | # Iterate over rows of cars 87 | for lab, row in cars.iterrows() : 88 | print(lab) 89 | print(row) 90 | ____________________________________________________________________ 91 | 10.Loop over DataFrame (2) 92 | # Import cars data 93 | import pandas as pd 94 | cars = pd.read_csv('cars.csv', index_col = 0) 95 | 96 | # Adapt for loop 97 | for lab, row in cars.iterrows() : 98 | print(lab + ": " + str(row['cars_per_cap'])) 99 | ____________________________________________________________________ 100 | 11.Add column (1) 101 | # Import cars data 102 | import pandas as pd 103 | cars = pd.read_csv('cars.csv', index_col = 0) 104 | 105 | # Code for loop that adds COUNTRY column 106 | for lab, row in cars.iterrows() : 107 | cars.loc[lab, "COUNTRY"] = row["country"].upper() 108 | 109 | # Print cars 110 | print(cars) 111 | ____________________________________________________________________ 112 | 12.Add column (2) 113 | # Import cars data 114 | import pandas as pd 115 | cars = pd.read_csv('cars.csv', index_col = 0) 116 | 117 | # Use .apply(str.upper) 118 | cars["COUNTRY"] = cars["country"].apply(str.upper) 119 | ____________________________________________________________________ -------------------------------------------------------------------------------- /02 - Intermediate Python/key points: -------------------------------------------------------------------------------- 1 | Consist of 5 Courses: 2 | 1- Matplotlib 3 | 2- Dictionaries & Pandas 4 | 3- Logic, Control Flow and Filtering 5 | 4- Loops 6 | 5- Case Study: Hacker Statistics 7 | -------------------------------------------------------------------------------- /03 - Introduction to Data Visualization using Matplotlib/Chapter 1 - Introduction to Matplotlib.txt: -------------------------------------------------------------------------------- 1 | 1.Using the matplotlib.pyplot interface 2 | # Import the matplotlib.pyplot submodule and name it plt 3 | import matplotlib.pyplot as plt 4 | 5 | # Create a Figure and an Axes with plt.subplots 6 | fig, ax = plt.subplots() 7 | 8 | # Call the show function to show the result 9 | plt.show() 10 | __________________________________________________________________________ 11 | 2.Adding data to an Axes object 12 | # Import the matplotlib.pyplot submodule and name it plt 13 | import matplotlib.pyplot as plt 14 | 15 | # Create a Figure and an Axes with plt.subplots 16 | fig, ax = plt.subplots() 17 | 18 | # Plot MLY-PRCP-NORMAL from seattle_weather against the MONTH 19 | ax.plot(seattle_weather["MONTH"], seattle_weather['MLY-PRCP-NORMAL']) 20 | 21 | # Plot MLY-PRCP-NORMAL from austin_weather against MONTH 22 | ax.plot(austin_weather['MONTH'], austin_weather['MLY-PRCP-NORMAL']) 23 | 24 | # Call the show function 25 | plt.show() 26 | __________________________________________________________________________ 27 | 3.Customizing data appearance 28 | # Plot Seattle data, setting data appearance 29 | ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"], color="b", marker='o',linestyle='--') 30 | 31 | # Plot Austin data, setting data appearance 32 | ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"], color="r", marker='v',linestyle='--') 33 | 34 | # Call show to display the resulting plot 35 | plt.show() 36 | __________________________________________________________________________ 37 | 4.Customizing axis labels and adding titles 38 | ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"]) 39 | ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"]) 40 | 41 | # Customize the x-axis label 42 | ax.set_xlabel("Time (months)") 43 | # Customize the y-axis label 44 | ax.set_ylabel("Precipitation (inches)") 45 | 46 | # Add the title 47 | ax.set_title("Weather patterns in Austin and Seattle") 48 | 49 | # Display the figure 50 | plt.show() 51 | __________________________________________________________________________ 52 | 5.Creating small multiples with plt.subplots 53 | # Create a Figure and an array of subplots with 2 rows and 2 columns 54 | fig, ax = plt.subplots(2, 2) 55 | 56 | # Addressing the top left Axes as index 0, 0, plot month and Seattle precipitation 57 | ax[0, 0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"]) 58 | 59 | # In the top right (index 0,1), plot month and Seattle temperatures 60 | ax[0, 1].plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"]) 61 | 62 | # In the bottom left (1, 0) plot month and Austin precipitations 63 | ax[1,0].plot(austin_weather['MONTH'], austin_weather["MLY-PRCP-NORMAL"]) 64 | 65 | # In the bottom right (1, 1) plot month and Austin temperatures 66 | ax[1,1].plot(austin_weather['MONTH'], austin_weather["MLY-TAVG-NORMAL"]) 67 | plt.show() 68 | __________________________________________________________________________ 69 | 6.Small multiples with shared y axis 70 | # Create a figure and an array of axes: 2 rows, 1 column with shared y axis 71 | fig, ax = plt.subplots(2, 1, sharey=True) 72 | 73 | # Plot Seattle precipitation in the top axes 74 | ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"], color='b') 75 | ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-25PCTL"], color='b', linestyle='--') 76 | ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-75PCTL"], color='b', linestyle='--') 77 | 78 | # Plot Austin precipitation in the bottom axes 79 | ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"], color='r') 80 | ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-25PCTL"], color='r', linestyle='--') 81 | ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-75PCTL"], color='r', linestyle='--') 82 | 83 | plt.show() 84 | __________________________________________________________________________ -------------------------------------------------------------------------------- /03 - Introduction to Data Visualization using Matplotlib/Chapter 2 - Plotting time-series.txt: -------------------------------------------------------------------------------- 1 | 1.Read data with a time index 2 | # Import pandas as pd 3 | import pandas as pd 4 | 5 | # Read the data from file using read_csv 6 | climate_change = pd.read_csv('climate_change.csv', parse_dates=["date"], index_col="date") 7 | __________________________________________________________________________ 8 | 2.Plot time-series data 9 | import matplotlib.pyplot as plt 10 | fig, ax = plt.subplots() 11 | 12 | # Add the time-series for "relative_temp" to the plot 13 | ax.plot(climate_change.index,climate_change['relative_temp']) 14 | 15 | # Set the x-axis label 16 | ax.set_xlabel('Time') 17 | 18 | # Set the y-axis label 19 | ax.set_ylabel('Relative temperature (Celsius)') 20 | 21 | # Show the figure 22 | plt.show() 23 | __________________________________________________________________________ 24 | 3.Using a time index to zoom in 25 | import matplotlib.pyplot as plt 26 | 27 | # Use plt.subplots to create fig and ax 28 | fig, ax= plt.subplots() 29 | 30 | # Create variable seventies with data from "1970-01-01" to "1979-12-31" 31 | seventies = climate_change["1970-01-01":"1979-12-31"] 32 | 33 | # Add the time-series for "co2" data from seventies to the plot 34 | ax.plot(seventies.index, seventies["co2"]) 35 | 36 | # Show the figure 37 | plt.show() 38 | __________________________________________________________________________ 39 | 4.Plotting two variables 40 | import matplotlib.pyplot as plt 41 | 42 | # Initalize a Figure and Axes 43 | fig, ax = plt.subplots() 44 | 45 | # Plot the CO2 variable in blue 46 | ax.plot(climate_change.index, climate_change["co2"], color='blue') 47 | 48 | # Create a twin Axes that shares the x-axis 49 | ax2 = ax.twinx() 50 | 51 | # Plot the relative temperature in red 52 | ax2.plot(climate_change.index, climate_change["relative_temp"], color='red') 53 | 54 | plt.show() 55 | __________________________________________________________________________ 56 | 5.Defining a function that plots time-series data 57 | # Define a function called plot_timeseries 58 | def plot_timeseries(axes, x, y, color, xlabel, ylabel): 59 | 60 | # Plot the inputs x,y in the provided color 61 | axes.plot(x, y, color=color) 62 | 63 | # Set the x-axis label 64 | axes.set_xlabel(xlabel) 65 | 66 | # Set the y-axis label 67 | axes.set_ylabel(ylabel, color=color) 68 | 69 | # Set the colors tick params for y-axis 70 | axes.tick_params('y', colors=color) 71 | __________________________________________________________________________ 72 | 6.Using a plotting function 73 | fig, ax = plt.subplots() 74 | 75 | # Plot the CO2 levels time-series in blue 76 | plot_timeseries(ax, climate_change.index, climate_change['co2'], "blue", 'Time (years)', 'CO2 levels') 77 | 78 | # Create a twin Axes object that shares the x-axis 79 | ax2 = ax.twinx() 80 | 81 | # Plot the relative temperature data in red 82 | plot_timeseries(ax, climate_change.index, climate_change['relative_temp'], "red", "Time (years)", "Relative temperature (Celsius)") 83 | 84 | plt.show() 85 | __________________________________________________________________________ 86 | 7.Annotating a plot of time-series data 87 | fig, ax = plt.subplots() 88 | 89 | # Plot the relative temperature data 90 | ax.plot(climate_change.index,climate_change['relative_temp']) 91 | 92 | # Annotate the date at which temperatures exceeded 1 degree 93 | ax.annotate('>1 degree', (pd.Timestamp('2015-10-06'), 1)) 94 | 95 | plt.show() 96 | __________________________________________________________________________ 97 | 8.Plotting time-series: putting it all together 98 | fig, ax = plt.subplots() 99 | 100 | # Plot the CO2 levels time-series in blue 101 | plot_timeseries(ax, climate_change.index, climate_change['co2'], 'blue', "Time (years)", "CO2 levels") 102 | 103 | # Create an Axes object that shares the x-axis 104 | ax2 = ax.twinx() 105 | 106 | # Plot the relative temperature data in red 107 | plot_timeseries(ax2, climate_change.index, climate_change['relative_temp'], 'red', 'Time (years)', 'Relative temp (Celsius)') 108 | 109 | # Annotate point with relative temperature >1 degree 110 | ax2.annotate(">1 degree", xy=(pd.Timestamp('2015-10-06'),1), xytext=(pd.Timestamp('2008-10-06'), -0.2),arrowprops={'arrowstyle': '->', 'color': 'gray'}) 111 | 112 | plt.show() 113 | __________________________________________________________________________ -------------------------------------------------------------------------------- /03 - Introduction to Data Visualization using Matplotlib/Chapter 3 - Quantitative Comparisions and statistical visualizations.txt: -------------------------------------------------------------------------------- 1 | 1.Bar chart 2 | fig, ax = plt.subplots() 3 | 4 | # Plot a bar-chart of gold medals as a function of country 5 | ax.bar(medals.index,medals['Gold']) 6 | 7 | # Set the x-axis tick labels to the country names 8 | ax.set_xticklabels(medals.index, rotation=90) 9 | 10 | # Set the y-axis label 11 | ax.set_ylabel("Number of medals") 12 | 13 | plt.show() 14 | _____________________________________________________________________________ 15 | 2.Stacked bar chart 16 | # Add bars for "Gold" with the label "Gold" 17 | ax.bar(medals.index, medals['Gold'], label='Gold') 18 | 19 | # Stack bars for "Silver" on top with label "Silver" 20 | ax.bar(medals.index, medals['Silver'], bottom= medals['Gold'], label="Silver" ) 21 | 22 | # Stack bars for "Bronze" on top of that with label "Bronze" 23 | ax.bar(medals.index,medals['Bronze'], bottom= medals['Gold'] + medals['Silver'], label='Bronze') 24 | 25 | # Display the legend 26 | ax.legend() 27 | 28 | plt.show() 29 | _____________________________________________________________________________ 30 | 3.Creating histograms 31 | fig, ax = plt.subplots() 32 | # Plot a histogram of "Weight" for mens_rowing 33 | ax.hist(mens_rowing["Weight"]) 34 | 35 | # Compare to histogram of "Weight" for mens_gymnastics 36 | ax.hist(mens_gymnastics["Weight"]) 37 | 38 | # Set the x-axis label to "Weight (kg)" 39 | ax.set_xlabel("Weight (kg)") 40 | 41 | # Set the y-axis label to "# of observations" 42 | ax.set_ylabel("# of observations") 43 | 44 | plt.show() 45 | _____________________________________________________________________________ 46 | 4."Step" histogram 47 | fig, ax = plt.subplots() 48 | 49 | # Plot a histogram of "Weight" for mens_rowing 50 | ax.hist(mens_rowing["Weight"], label='Rowing', bins=5, histtype='step') 51 | 52 | # Compare to histogram of "Weight" for mens_gymnastics 53 | ax.hist(mens_gymnastics["Weight"], label='Gymnastics', bins=5, histtype='step') 54 | 55 | ax.set_xlabel("Weight (kg)") 56 | ax.set_ylabel("# of observations") 57 | 58 | # Add the legend and show the Figure 59 | ax.legend() 60 | plt.show() 61 | _____________________________________________________________________________ 62 | 5.Adding error-bars to a bar chart 63 | fig, ax = plt.subplots() 64 | 65 | # Add a bar for the rowing "Height" column mean/std 66 | ax.bar("Rowing",mens_rowing["Height"].mean(), yerr=mens_rowing["Height"].std()) 67 | 68 | # Add a bar for the gymnastics "Height" column mean/std 69 | ax.bar("Gymnastics", mens_gymnastics["Height"].mean(), yerr=mens_gymnastics["Height"].std()); 70 | 71 | # Label the y-axis 72 | ax.set_ylabel("Height (cm)") 73 | 74 | plt.show() 75 | _____________________________________________________________________________ 76 | 6.Adding error-bars to a plot 77 | fig, ax = plt.subplots() 78 | 79 | # Add Seattle temperature data in each month with error bars 80 | ax.errorbar(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"], 81 | yerr=seattle_weather["MLY-TAVG-STDDEV"]); 82 | 83 | # Add Austin temperature data in each month with error bars 84 | ax.errorbar(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"], 85 | yerr=austin_weather["MLY-TAVG-STDDEV"]); 86 | # Set the y-axis label 87 | ax.set_ylabel("Temperature (Fahrenheit)"); 88 | 89 | plt.show() 90 | _____________________________________________________________________________ 91 | 7.Creating boxplots 92 | fig, ax = plt.subplots() 93 | 94 | # Add a boxplot for the "Height" column in the DataFrames 95 | ax.boxplot([mens_rowing["Height"], mens_gymnastics["Height"]]); 96 | 97 | # Add x-axis tick labels: 98 | ax.set_xticklabels(["Rowing", "Gymnastics"]); 99 | 100 | # Add a y-axis label 101 | ax.set_ylabel("Height (cm)"); 102 | 103 | plt.show() 104 | _____________________________________________________________________________ 105 | 8.Simple scatter plot 106 | fig, ax = plt.subplots() 107 | 108 | # Add data: "co2" on x-axis, "relative_temp" on y-axis 109 | ax.scatter(climate_change["co2"], climate_change["relative_temp"]); 110 | 111 | # Set the x-axis label to "CO2 (ppm)" 112 | ax.set_xlabel("CO2 (ppm)"); 113 | 114 | # Set the y-axis label to "Relative temperature (C)" 115 | ax.set_ylabel("Relative temperature (C)"); 116 | 117 | plt.show() 118 | _____________________________________________________________________________ 119 | 9.Encoding time by color 120 | fig, ax = plt.subplots() 121 | 122 | # Add data: "co2", "relative_temp" as x-y, index as color 123 | ax.scatter(climate_change["co2"], climate_change["relative_temp"], c=climate_change.index) 124 | 125 | # Set the x-axis label to "CO2 (ppm)" 126 | ax.set_xlabel("CO2 (ppm)"); 127 | 128 | # Set the y-axis label to "Relative temperature (C)" 129 | ax.set_ylabel("Relative temperature (C)"); 130 | 131 | plt.show() 132 | _____________________________________________________________________________ -------------------------------------------------------------------------------- /03 - Introduction to Data Visualization using Matplotlib/Chapter 4 - sharing visualizations with others.txt: -------------------------------------------------------------------------------- 1 | 1.Switching between styles 2 | # Use the "ggplot" style and create new Figure/Axes 3 | plt.style.use('ggplot') 4 | fig, ax = plt.subplots() 5 | ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"]) 6 | plt.show() 7 | __________________________________________________________________________________ 8 | 2.Saving a file several times 9 | 10 | fig.savefig('my_figure.png') 11 | fig.savefig('my_figure_300dpi.png', dpi=300) 12 | __________________________________________________________________________________ 13 | 3.Save a figure with different sizes 14 | fig.set_size_inches([3,5]) 15 | fig.savefig("figure_3_5.png") 16 | __________________________________________________________________________________ 17 | 4.Unique values of a column 18 | # Extract the "Sport" column 19 | sports_column = summer_2016_medals['Sport'] 20 | 21 | # Find the unique values of the "Sport" column 22 | sports = sports_column.unique() 23 | 24 | # Print out the unique sports values 25 | print(sports) 26 | __________________________________________________________________________________ 27 | 5.Automate your visualization 28 | fig, ax = plt.subplots() 29 | 30 | # Loop over the different sports branches 31 | for sport in sports: 32 | # Extract the rows only for this sport 33 | sport_df = summer_2016_medals[summer_2016_medals["Sport"] == sport] 34 | # Add a bar for the "Weight" mean with std y error bar 35 | ax.bar(sport, sport_df["Weight"].mean(), yerr=sport_df["Weight"].std()); 36 | 37 | ax.set_ylabel("Weight"); 38 | ax.set_xticklabels(sports, rotation=90); 39 | 40 | # Save the figure to file 41 | fig.savefig("sports_weights.png") 42 | __________________________________________________________________________________ -------------------------------------------------------------------------------- /03 - Introduction to Data Visualization using Matplotlib/contents: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /04 - Introduction to Data Visualization with Seaborn/Chapter 1 - Introduction to Seaborn.txt: -------------------------------------------------------------------------------- 1 | 1.Making a scatter plot with lists 2 | # Import Matplotlib and Seaborn 3 | import matplotlib.pyplot as plt 4 | import seaborn as sns 5 | 6 | # Change this scatter plot to have percent literate on the y-axis 7 | sns.scatterplot(x=gdp, y=percent_literate) 8 | 9 | # Show plot 10 | plt.show() 11 | _________________________________________________________________________________ 12 | 2.Making a count plot with a list 13 | # Import Matplotlib and Seaborn 14 | import matplotlib.pyplot as plt 15 | import seaborn as sns 16 | 17 | 18 | # Create count plot with region on the y-axis 19 | sns.countplot(y=region) 20 | 21 | # Show plot 22 | plt.show() 23 | _________________________________________________________________________________ 24 | 3."Tidy" vs. "untidy" data 25 | # Import Pandas 26 | import pandas as pd 27 | 28 | # Create a DataFrame from csv file 29 | df= pd.read_csv(csv_filepath) 30 | 31 | # Print the head of df 32 | print(df.head()) 33 | _________________________________________________________________________________ 34 | 4.Making a count plot with a DataFrame 35 | # Import Matplotlib, Pandas, and Seaborn 36 | import pandas as pd 37 | import seaborn as sns 38 | import matplotlib.pyplot as plt 39 | 40 | # Create a DataFrame from csv file 41 | df=pd.read_csv(csv_filepath) 42 | 43 | # Create a count plot with "Spiders" on the x-axis 44 | sns.countplot(x="Spiders",data=df) 45 | 46 | # Display the plot 47 | plt.show() 48 | _________________________________________________________________________________ 49 | 5.Hue and scatter plots 50 | # Import Matplotlib and Seaborn 51 | import matplotlib.pyplot as plt 52 | import seaborn as sns 53 | 54 | # Change the legend order in the scatter plot 55 | sns.scatterplot(x="absences", y="G3", data=student_data,hue="location" , hue_order=["Rural","Urban"]) 56 | 57 | 58 | # Show plot 59 | plt.show() 60 | _________________________________________________________________________________ -------------------------------------------------------------------------------- /04 - Introduction to Data Visualization with Seaborn/Chapter 2 - Visualization two quantitative variables.txt: -------------------------------------------------------------------------------- 1 | 1.Creating subplots with col and row 2 | # Change this scatter plot to arrange the plots in rows instead of columns 3 | sns.relplot(x="absences", y="G3", 4 | data=student_data, 5 | kind="scatter", 6 | row="study_time"); 7 | 8 | # Show plot 9 | plt.show() 10 | ____________________________________________________________________________ 11 | 2.Creating two-factor subplots 12 | # Adjust further to add subplots based on family support 13 | sns.relplot(x="G1", y="G3", 14 | data=student_data, 15 | kind="scatter", 16 | col="schoolsup", 17 | row='famsup', 18 | col_order=["yes", "no"], 19 | row_order=['yes', 'no']); 20 | 21 | # Show plot 22 | plt.show() 23 | ____________________________________________________________________________ 24 | 3.Changing the size of scatter plot points 25 | # Import Matplotlib and Seaborn 26 | import matplotlib.pyplot as plt 27 | import seaborn as sns 28 | 29 | # Create scatter plot of horsepower vs. mpg 30 | sns.relplot(x="horsepower", y="mpg", 31 | data=mpg, kind="scatter", 32 | size="cylinders", 33 | hue='cylinders'); 34 | 35 | # Show plot 36 | plt.show() 37 | ____________________________________________________________________________ 38 | 4.Changing the style of scatter plot points 39 | # Import Matplotlib and Seaborn 40 | import matplotlib.pyplot as plt 41 | import seaborn as sns 42 | 43 | # Create a scatter plot of acceleration vs. mpg 44 | sns.relplot(x='acceleration', y='mpg', data=mpg, kind='scatter', style='origin', hue='origin'); 45 | 46 | 47 | 48 | # Show plot 49 | plt.show() 50 | ____________________________________________________________________________ 51 | 5.Interpreting line plots 52 | # Import Matplotlib and Seaborn 53 | import matplotlib.pyplot as plt 54 | import seaborn as sns 55 | 56 | # Create line plot 57 | sns.relplot(x='model_year', y='mpg', data=mpg, kind='line'); 58 | 59 | 60 | # Show plot 61 | plt.show() 62 | ____________________________________________________________________________ 63 | 6.Visualizing standard deviation with line plots 64 | # Make the shaded area show the standard deviation 65 | sns.relplot(x="model_year", y="mpg", data=mpg, kind="line", ci='sd'); 66 | 67 | # Show plot 68 | plt.show() 69 | ____________________________________________________________________________ 70 | 7.Plotting subgroups in line plots 71 | # Import Matplotlib and Seaborn 72 | import matplotlib.pyplot as plt 73 | import seaborn as sns 74 | 75 | # Add markers and make each line have the same style 76 | sns.relplot(x="model_year", y="horsepower", 77 | data=mpg, kind="line", ci=None, style="origin", hue="origin", 78 | markers=True, dashes=False); 79 | 80 | # Show plot 81 | plt.show() 82 | ____________________________________________________________________________ -------------------------------------------------------------------------------- /04 - Introduction to Data Visualization with Seaborn/Chapter 3 - Visualization a categorical and a quantitative variables.txt: -------------------------------------------------------------------------------- 1 | 1.Count plots 2 | # Create column subplots based on age category 3 | sns.catplot(y="Internet usage", data=survey_data, kind="count", col='Age Category'); 4 | plt.tight_layout(); 5 | # Show plot 6 | plt.show() 7 | ________________________________________________________________________________ 8 | 2.Bar plots with percentages 9 | # Create a bar plot of interest in math, separated by gender 10 | sns.catplot(x='Gender', y='Interested in Math', data=survey_data, kind='bar'); 11 | 12 | # Show plot 13 | plt.show() 14 | ________________________________________________________________________________ 15 | 3.Customizing bar plots 16 | # Turn off the confidence intervals 17 | sns.catplot(x="study_time", y="G3", 18 | data=student_data, 19 | kind="bar", 20 | order=["<2 hours", 21 | "2 to 5 hours", 22 | "5 to 10 hours", 23 | ">10 hours"], 24 | ci=None); 25 | 26 | # Show plot 27 | plt.show() 28 | ________________________________________________________________________________ 29 | 4.Create and interpret a box plot 30 | # Specify the category ordering 31 | study_time_order = ["<2 hours", "2 to 5 hours", 32 | "5 to 10 hours", ">10 hours"] 33 | 34 | # Create a box plot and set the order of the categories 35 | sns.catplot(x='study_time', y='G3', 36 | data=student_data, 37 | kind='box', 38 | order=study_time_order); 39 | 40 | # Show plot 41 | plt.show() 42 | ________________________________________________________________________________ 43 | 5.Omitting outliers 44 | # Create a box plot with subgroups and omit the outliers 45 | sns.catplot(x='internet', y='G3', 46 | data=student_data, 47 | kind='box', 48 | hue='location', 49 | sym=''); 50 | # Show plot 51 | plt.show() 52 | ________________________________________________________________________________ 53 | 6.Adjusting the whiskers 54 | # Set the whiskers at the min and max values 55 | sns.catplot(x="romantic", y="G3", 56 | data=student_data, 57 | kind="box", 58 | whis=[0, 100]); 59 | 60 | # Show plot 61 | plt.show() 62 | ________________________________________________________________________________ 63 | 7.Customizing point plots 64 | # Remove the lines joining the points 65 | sns.catplot(x="famrel", y="absences", 66 | data=student_data, 67 | kind="point", 68 | capsize=0.2, 69 | join=False); 70 | 71 | # Show plot 72 | plt.show() 73 | ________________________________________________________________________________ 74 | 8.Point plots with subgroups 75 | # Import median function from numpy 76 | import numpy 77 | from numpy import median 78 | # Plot the median number of absences instead of the mean 79 | sns.catplot(x="romantic", y="absences", 80 | data=student_data, 81 | kind="point", 82 | hue="school", 83 | ci=None, 84 | estimator=numpy.median) 85 | 86 | # Show plot 87 | plt.show() 88 | ________________________________________________________________________________ -------------------------------------------------------------------------------- /04 - Introduction to Data Visualization with Seaborn/Chapter 4 - customizing seaborn plots.txt: -------------------------------------------------------------------------------- 1 | 1.Changing style and palette 2 | # Change the color palette to "RdBu" 3 | sns.set_style("whitegrid") 4 | sns.set_palette("RdBu") 5 | 6 | # Create a count plot of survey responses 7 | category_order = ["Never", "Rarely", "Sometimes", 8 | "Often", "Always"] 9 | 10 | sns.catplot(x="Parents Advice", 11 | data=survey_data, 12 | kind="count", 13 | order=category_order) 14 | 15 | # Show plot 16 | plt.show() 17 | ____________________________________________________________________________________ 18 | 2.Changing the scale 19 | # Change the context to "poster" 20 | sns.set_context("poster") 21 | 22 | # Create bar plot 23 | sns.catplot(x="Number of Siblings", y="Feels Lonely", 24 | data=survey_data, kind="bar") 25 | 26 | # Show plot 27 | plt.show() 28 | ____________________________________________________________________________________ 29 | 3.Using a custom palette 30 | # Set the style to "darkgrid" 31 | sns.set_context("notebook") 32 | 33 | # Set the style to "darkgrid" 34 | sns.set_style('darkgrid') 35 | 36 | # Set a custom color palette 37 | sns.set_palette(['#39A7D0', '#36ADA4']) 38 | 39 | # Create the box plot of age distribution by gender 40 | sns.catplot(x="Gender", y="Age", 41 | data=survey_data, kind="box"); 42 | 43 | # Show plot 44 | plt.show() 45 | ____________________________________________________________________________________ 46 | 4.FacetGrids vs. AxesSubplots 47 | # Create scatter plot 48 | g = sns.relplot(x="weight", 49 | y="horsepower", 50 | data=mpg, 51 | kind="scatter") 52 | 53 | # Identify plot type 54 | type_of_g = type(g) 55 | 56 | # Print type 57 | print(type_of_g) 58 | ____________________________________________________________________________________ 59 | 5.Adding a title to a FacetGrid object 60 | # Create scatter plot 61 | g = sns.relplot(x="weight", 62 | y="horsepower", 63 | data=mpg, 64 | kind="scatter"); 65 | 66 | # Add a title "Car Weight vs. Horsepower" 67 | g.fig.suptitle('Car Weight vs. Horsepower'); 68 | # Show plot 69 | plt.show() 70 | ____________________________________________________________________________________ 71 | 6.Adding a title and axis labels 72 | # Create line plot 73 | g = sns.lineplot(x="model_year", y="mpg_mean", 74 | data=mpg_mean, 75 | hue="origin"); 76 | 77 | # Add a title "Average MPG Over Time" 78 | g.set_title("Average MPG Over Time"); 79 | 80 | # Add x-axis and y-axis labels 81 | g.set(xlabel="Car Model Year", 82 | ylabel="Average MPG"); 83 | 84 | # Show plot 85 | plt.show() 86 | ____________________________________________________________________________________ 87 | 7.Rotating x-tick labels 88 | # Create point plot 89 | sns.catplot(x="origin", 90 | y="acceleration", 91 | data=mpg, 92 | kind="point", 93 | join=False, 94 | capsize=0.1) 95 | 96 | # Rotate x-tick labels 97 | plt.xticks(rotation=90); 98 | 99 | # Show plot 100 | plt.show() 101 | ____________________________________________________________________________________ 102 | 8.Box plot with subgroups 103 | # Set palette to "Blues" 104 | sns.set_palette("Blues") 105 | 106 | # Adjust to add subgroups based on "Interested in Pets" 107 | g = sns.catplot(x="Gender", 108 | y="Age", data=survey_data, 109 | kind="box", hue='Interested in Pets', aspect=1.5) 110 | 111 | # Set title to "Age of Those Interested in Pets vs. Not" 112 | g.fig.suptitle("Age of Those Interested in Pets vs. Not") 113 | 114 | # Show plot 115 | plt.show() 116 | ____________________________________________________________________________________ 117 | 9.Bar plot with subgroups and subplots 118 | # Set the figure style to "dark" 119 | plt.style.use('seaborn') 120 | sns.set_style('dark') 121 | # Adjust to add subplots per gender 122 | g = sns.catplot(x="Village - town", y="Likes Techno", 123 | data=survey_data, kind="bar", 124 | col='Gender'); 125 | 126 | # Add title and axis labels 127 | g.fig.suptitle("Percentage of Young People Who Like Techno", y=1.02); 128 | g.set(xlabel="Location of Residence", 129 | ylabel="% Who Like Techno"); 130 | 131 | # Show plot 132 | plt.show() 133 | ____________________________________________________________________________________ -------------------------------------------------------------------------------- /04 - Introduction to Data Visualization with Seaborn/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /05 - Python Data Science Toolbox (Part 1)/Chapter 1 - Writing your own functions .txt: -------------------------------------------------------------------------------- 1 | 1. Write a simple function 2 | # Define the function shout 3 | def shout(): 4 | """Print a string with three exclamation marks""" 5 | # Concatenate the strings: shout_word 6 | shout_word = 'congratulations' + '!!!' 7 | 8 | # Print shout_word 9 | print(shout_word) 10 | 11 | # Call shout 12 | shout() 13 | ____________________________________________________________________ 14 | 2.Single-parameter functions 15 | # Define shout with the parameter, word 16 | def shout(word): 17 | """Print a string with three exclamation marks""" 18 | # Concatenate the strings: shout_word 19 | shout_word = word + '!!!' 20 | 21 | # Print shout_word 22 | print(shout_word) 23 | 24 | # Call shout with the string 'congratulations' 25 | shout('congratulations') 26 | ____________________________________________________________________ 27 | 3.Functions that return single values 28 | # Define shout with the parameter, word 29 | def shout(word): 30 | """Return a string with three exclamation marks""" 31 | # Concatenate the strings: shout_word 32 | shout_word = word + '!!!' 33 | 34 | # Replace print with return 35 | return shout_word 36 | 37 | # Pass 'congratulations' to shout: yell 38 | yell = shout('congratulations') 39 | 40 | # Print yell 41 | print(yell) 42 | ____________________________________________________________________ 43 | 4.Functions with multiple parameters 44 | # Define shout with parameters word1 and word2 45 | def shout(word1, word2): 46 | """Concatenate strings with three exclamation marks""" 47 | # Concatenate word1 with '!!!': shout1 48 | shout1 = word1 + '!!!' 49 | 50 | # Concatenate word2 with '!!!': shout2 51 | shout2 = word2 + '!!!' 52 | 53 | # Concatenate shout1 with shout2: new_shout 54 | new_shout = shout1 + shout2 55 | 56 | # Return new_shout 57 | return new_shout 58 | 59 | # Pass 'congratulations' and 'you' to shout: yell 60 | yell = shout('congratulations', 'you') 61 | 62 | # Print yell 63 | print(yell) 64 | ____________________________________________________________________ 65 | 5.A brief introduction to tuples 66 | # Unpack nums into num1, num2, and num3 67 | num1, num2, num3 = nums 68 | 69 | # Construct even_nums 70 | even_nums = (2, num2, num3) 71 | ____________________________________________________________________ 72 | 6.Functions that return multiple values 73 | # Define shout_all with parameters word1 and word2 74 | def shout_all(word1, word2): 75 | """Return a tuple of strings""" 76 | # Concatenate word1 with '!!!': shout1 77 | shout1 = word1 + '!!!' 78 | 79 | # Concatenate word2 with '!!!': shout2 80 | shout2 = word2 + '!!!' 81 | 82 | # Construct a tuple with shout1 and shout2: shout_words 83 | shout_words = (shout1, shout2) 84 | 85 | # Return shout_words 86 | return shout_words 87 | 88 | # Pass 'congratulations' and 'you' to shout_all(): yell1, yell2 89 | yell1, yell2 = shout_all('congratulations', 'you') 90 | 91 | # Print yell1 and yell2 92 | print(yell1) 93 | print(yell2) 94 | ____________________________________________________________________ 95 | 7.Bringing it all together (1) 96 | # Import pandas 97 | import pandas as pd 98 | 99 | # Import Twitter data as DataFrame: df 100 | df = pd.read_csv('tweets.csv') 101 | 102 | # Initialize an empty dictionary: langs_count 103 | langs_count = {} 104 | 105 | # Extract column from DataFrame: col 106 | col = df['lang'] 107 | 108 | # Iterate over lang column in DataFrame 109 | for entry in col: 110 | 111 | # If the language is in langs_count, add 1 112 | if entry in langs_count.keys(): 113 | langs_count[entry] += 1 114 | # Else add the language to langs_count, set the value to 1 115 | else: 116 | langs_count[entry] = 1 117 | 118 | # Print the populated dictionary 119 | print(langs_count) 120 | ____________________________________________________________________ 121 | 8.Bringing it all together (2) 122 | # Define count_entries() 123 | def count_entries(df, col_name): 124 | """Return a dictionary with counts of 125 | occurrences as value for each key.""" 126 | 127 | # Initialize an empty dictionary: langs_count 128 | langs_count = {} 129 | 130 | # Extract column from DataFrame: col 131 | col = df[col_name] 132 | 133 | # Iterate over lang column in DataFrame 134 | for entry in col: 135 | 136 | # If the language is in langs_count, add 1 137 | if entry in langs_count.keys(): 138 | langs_count[entry] += 1 139 | # Else add the language to langs_count, set the value to 1 140 | else: 141 | langs_count[entry] = 1 142 | 143 | # Return the langs_count dictionary 144 | return langs_count 145 | 146 | # Call count_entries(): result 147 | result = count_entries(tweets_df, 'lang') 148 | 149 | # Print the result 150 | print(result) 151 | ____________________________________________________________________ -------------------------------------------------------------------------------- /05 - Python Data Science Toolbox (Part 1)/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /06 - Python Data Science Toolbox (Part 2)/Chapter 2 - List Comprehensions and Generators.txt: -------------------------------------------------------------------------------- 1 | 1.Writing list comprehensions 2 | # Create list comprehension: squares 3 | squares = [i**2 for i in range(0,10)] 4 | _________________________________________________________________________________ 5 | 2.Nested list comprehensions 6 | # Create a 5 x 5 matrix using a list of lists: matrix 7 | matrix = [[col for col in range(5)] for row in range(5)] 8 | 9 | # Print the matrix 10 | for row in matrix: 11 | print(row) 12 | _________________________________________________________________________________ 13 | 3.Using conditionals in comprehensions (1) 14 | # Create a list of strings: fellowship 15 | fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli'] 16 | 17 | # Create list comprehension: new_fellowship 18 | new_fellowship = [member for member in fellowship if len(member) >= 7] 19 | 20 | # Print the new list 21 | print(new_fellowship) 22 | _________________________________________________________________________________ 23 | 4.Using conditionals in comprehensions (2) 24 | # Create a list of strings: fellowship 25 | fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli'] 26 | 27 | # Create list comprehension: new_fellowship 28 | new_fellowship = [member if len(member) >= 7 else member.replace( 29 | member, '') for member in fellowship] 30 | 31 | # Print the new list 32 | print(new_fellowship) 33 | _________________________________________________________________________________ 34 | 5.Dict comprehensions 35 | # Create a list of strings: fellowship 36 | fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli'] 37 | 38 | # Create dict comprehension: new_fellowship 39 | new_fellowship = {member: len(member) for member in fellowship} 40 | 41 | # Print the new dictionary 42 | print(new_fellowship) 43 | _________________________________________________________________________________ 44 | 6.Write your own generator expressions 45 | # Create generator object: result 46 | result = (num for num in range(31)) 47 | 48 | # Print the first 5 values 49 | print(next(result)) 50 | print(next(result)) 51 | print(next(result)) 52 | print(next(result)) 53 | print(next(result)) 54 | 55 | # Print the rest of the values 56 | for value in result: 57 | print(value) 58 | _________________________________________________________________________________ 59 | 7.Changing the output in generator expressions 60 | # Create a list of strings: lannister 61 | lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey'] 62 | 63 | # Create a generator object: lengths 64 | lengths = (len(person) for person in lannister) 65 | 66 | # Iterate over and print the values in lengths 67 | for value in lengths: 68 | print(value) 69 | _________________________________________________________________________________ 70 | 8.Build a generator 71 | # Create a list of strings 72 | lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey'] 73 | 74 | # Define generator function get_lengths 75 | def get_lengths(input_list): 76 | """Generator function that yields the 77 | length of the strings in input_list.""" 78 | 79 | # Yield the length of a string 80 | for person in input_list: 81 | yield len(person) 82 | 83 | # Print the values generated by get_lengths() 84 | for value in get_lengths(lannister): 85 | print(value) 86 | _________________________________________________________________________________ 87 | 9.List comprehensions for time-stamped data 88 | # Extract the created_at column from df: tweet_time 89 | tweet_time = df['created_at'] 90 | 91 | # Extract the clock time: tweet_clock_time 92 | tweet_clock_time = [entry[11:19] for entry in tweet_time] 93 | 94 | # Print the extracted times 95 | print(tweet_clock_time) 96 | _________________________________________________________________________________ 97 | 10.Conditional list comprehensions for time-stamped data 98 | # Extract the created_at column from df: tweet_time 99 | tweet_time = df['created_at'] 100 | 101 | # Extract the clock time: tweet_clock_time 102 | tweet_clock_time = [entry[11:19] for entry in tweet_time if entry[17:19] == '19'] 103 | 104 | # Print the extracted times 105 | print(tweet_clock_time) 106 | _________________________________________________________________________________ -------------------------------------------------------------------------------- /06 - Python Data Science Toolbox (Part 2)/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /07 - Intermediate Data Visualization with Seaborn/Chapter 1 - seaborn introduction .txt: -------------------------------------------------------------------------------- 1 | 1.Reading a csv file 2 | # import all modules 3 | import pandas as pd 4 | import seaborn as sns 5 | import matplotlib.pyplot as plt 6 | 7 | # Read in the DataFrame 8 | df = pd.read_csv(grant_file) 9 | ______________________________________________________________________________ 10 | 2.Comparing a histogram and distplot 11 | # Display pandas histogram 12 | df['Award_Amount'].plot.hist() 13 | plt.show() 14 | 15 | # Clear out the pandas histogram 16 | plt.clf() 17 | 18 | # Display a Seaborn distplot 19 | sns.distplot(df['Award_Amount']) 20 | plt.show() 21 | 22 | # Clear the distplot 23 | plt.clf() 24 | ______________________________________________________________________________ 25 | 3.Plot a histogram 26 | # Create a distplot 27 | sns.distplot(df['Award_Amount'], 28 | kde=False, 29 | bins=20) 30 | 31 | # Display the plot 32 | plt.show() 33 | ______________________________________________________________________________ 34 | 4.Rug plot and kde shading 35 | # Create a distplot of the Award Amount 36 | sns.distplot(df['Award_Amount'], 37 | hist=False, 38 | rug=True, 39 | kde_kws={'shade':True}) 40 | 41 | # Plot the results 42 | plt.show() 43 | ______________________________________________________________________________ 44 | 5.Create a regression plot 45 | # Create a regression plot of premiums vs. insurance_losses 46 | sns.regplot(x="insurance_losses",y = "premiums", data = df) 47 | 48 | # Display the plot 49 | plt.show() 50 | 51 | # Create an lmplot of premiums vs. insurance_losses 52 | sns.lmplot(x="insurance_losses",y="premiums",data=df) 53 | 54 | # Display the second plot 55 | plt.show() 56 | ______________________________________________________________________________ 57 | 6.Plotting multiple variables 58 | # Create a regression plot using hue 59 | sns.lmplot(data=df, 60 | x="insurance_losses", 61 | y="premiums", 62 | hue="Region") 63 | 64 | # Show the results 65 | plt.show() 66 | ______________________________________________________________________________ 67 | 7.Facetting multiple regressions 68 | # Create a regression plot with multiple rows 69 | sns.lmplot(data=df, 70 | x="insurance_losses", 71 | y="premiums", 72 | row="Region") 73 | 74 | # Show the plot 75 | plt.show() 76 | ______________________________________________________________________________ -------------------------------------------------------------------------------- /07 - Intermediate Data Visualization with Seaborn/Chapter 2 - Customizing Seaborn plots.txt: -------------------------------------------------------------------------------- 1 | 1.Setting the default style 2 | # Plot the pandas histogram 3 | df['fmr_2'].plot.hist() 4 | plt.show() 5 | plt.clf() 6 | 7 | # Set the default seaborn style 8 | sns.set() 9 | 10 | # Plot the pandas histogram again 11 | df['fmr_2'].plot.hist() 12 | plt.show() 13 | plt.clf() 14 | _____________________________________________________________________________ 15 | 2.Comparing styles 16 | # Plot with a dark style 17 | sns.set_style('dark') 18 | sns.distplot(df['fmr_2']) 19 | plt.show() 20 | 21 | # Clear the figure 22 | plt.clf() 23 | /********/ 24 | sns.set_style('whitegrid') 25 | sns.distplot(df['fmr_2']) 26 | plt.show() 27 | 28 | # Clear the figure 29 | plt.clf() 30 | _____________________________________________________________________________ 31 | 3.Removing spines 32 | # Set the style to white 33 | sns.set_style('white') 34 | 35 | # Create a regression plot 36 | sns.lmplot(data=df, 37 | x='pop2010', 38 | y='fmr_2') 39 | 40 | # Remove the spines 41 | sns.despine() 42 | 43 | # Show the plot and clear the figure 44 | plt.show() 45 | plt.clf() 46 | _____________________________________________________________________________ 47 | 4.Matplotlib color codes 48 | # Set style, enable color code, and create a magenta distplot 49 | sns.set(color_codes=True) 50 | sns.distplot(df['fmr_3'], color='m') 51 | 52 | # Show the plot 53 | plt.show() 54 | _____________________________________________________________________________ 55 | 5.Using default palettes 56 | # Loop through differences between bright and colorblind palettes 57 | for p in ['bright', 'colorblind']: 58 | sns.set_palette(p) 59 | sns.distplot(df['fmr_3']) 60 | plt.show() 61 | 62 | # Clear the plots 63 | plt.clf() 64 | _____________________________________________________________________________ 65 | 6.Creating Custom Palettes 66 | # Create the Purples palette with 8 colors 67 | sns.palplot(sns.color_palette("Purples", 8)) 68 | plt.show() 69 | /*************/ 70 | sns.palplot(sns.color_palette("husl", 10)) 71 | plt.show() 72 | /*************/ 73 | sns.palplot(sns.color_palette("coolwarm", 6)) 74 | plt.show() 75 | _____________________________________________________________________________ 76 | 7.Using matplotlib axes 77 | # Create a figure and axes 78 | fig, ax = plt.subplots() 79 | 80 | # Plot the distribution of data 81 | sns.distplot(df['fmr_3'], ax=ax) 82 | 83 | # Create a more descriptive x axis label 84 | ax.set(xlabel="3 Bedroom Fair Market Rent") 85 | 86 | # Show the plot 87 | plt.show() 88 | _____________________________________________________________________________ 89 | 8.Additional plot customizations 90 | # Create a figure and axes 91 | fig, ax = plt.subplots() 92 | 93 | # Plot the distribution of 1 bedroom rents 94 | sns.distplot(df['fmr_1'], ax=ax) 95 | 96 | # Modify the properties of the plot 97 | ax.set(xlabel="1 Bedroom Fair Market Rent", 98 | xlim=(100,1500), 99 | title="US Rent") 100 | 101 | # Display the plot 102 | plt.show() 103 | _____________________________________________________________________________ 104 | 9.Adding annotations 105 | # Create a figure and axes. Then plot the data 106 | fig, ax = plt.subplots() 107 | sns.distplot(df['fmr_1'], ax=ax) 108 | 109 | # Customize the labels and limits 110 | ax.set(xlabel="1 Bedroom Fair Market Rent", xlim=(100,1500), title="US Rent") 111 | 112 | # Add vertical lines for the median and mean 113 | ax.axvline(x=634.0, color='m', label='Median', linestyle='--', linewidth=2) 114 | ax.axvline(x=706.3254351016984, color='b', label='Mean', linestyle='-', linewidth=2) 115 | 116 | # Show the legend and plot the data 117 | ax.legend() 118 | plt.show() 119 | _____________________________________________________________________________ 120 | 10.Multiple plots 121 | # Create a plot with 1 row and 2 columns that share the y axis label 122 | fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, sharey=True) 123 | 124 | # Plot the distribution of 1 bedroom apartments on ax0 125 | sns.distplot(df['fmr_1'], ax=ax0) 126 | ax0.set(xlabel="1 Bedroom Fair Market Rent", xlim=(100,1500)) 127 | 128 | # Plot the distribution of 2 bedroom apartments on ax1 129 | sns.distplot(df['fmr_2'], ax=ax1) 130 | ax1.set(xlabel="2 Bedroom Fair Market Rent", xlim=(100,1500)) 131 | 132 | # Display the plot 133 | plt.show() 134 | _____________________________________________________________________________ -------------------------------------------------------------------------------- /07 - Intermediate Data Visualization with Seaborn/Chapter 3 -additional plot types.txt: -------------------------------------------------------------------------------- 1 | 1.stripplot() and swarmplot() 2 | # Create the stripplot 3 | sns.stripplot(data=df, 4 | x='Award_Amount', 5 | y='Model Selected', 6 | jitter=True) 7 | 8 | plt.show() 9 | /*********/ 10 | # Create and display a swarmplot with hue set to the Region 11 | sns.swarmplot(data=df, 12 | x='Award_Amount', 13 | y='Model Selected', 14 | hue='Region') 15 | 16 | plt.show() 17 | ___________________________________________________________________ 18 | 2.boxplots, violinplots and lvplots 19 | # Create a boxplot 20 | sns.boxplot(data=df, 21 | x='Award_Amount', 22 | y='Model Selected') 23 | 24 | plt.show() 25 | plt.clf() 26 | /***************/ 27 | # Create a violinplot with the husl palette 28 | sns.violinplot(data=df, 29 | x='Award_Amount', 30 | y='Model Selected', 31 | palette='husl') 32 | 33 | plt.show() 34 | plt.clf() 35 | /****************/ 36 | # Create a lvplot with the Paired palette and the Region column as the hue 37 | sns.lvplot(data=df, 38 | x='Award_Amount', 39 | y='Model Selected', 40 | palette='Paired', 41 | hue='Region') 42 | 43 | plt.show() 44 | plt.clf() 45 | ___________________________________________________________________ 46 | 3.Regression and residual plots 47 | # Display a regression plot for Tuition 48 | sns.regplot(data=df, 49 | y='Tuition', 50 | x="SAT_AVG_ALL", 51 | marker='^', 52 | color='g') 53 | 54 | plt.show() 55 | plt.clf() 56 | /**************/ 57 | # Display the residual plot 58 | sns.residplot(data=df, 59 | y='Tuition', 60 | x="SAT_AVG_ALL", 61 | color='g') 62 | 63 | plt.show() 64 | plt.clf() 65 | ___________________________________________________________________ 66 | 4.Regression plot parameters 67 | # Plot a regression plot of Tuition and the Percentage of Pell Grants 68 | sns.regplot(data=df, 69 | y='Tuition', 70 | x="PCTPELL") 71 | 72 | plt.show() 73 | plt.clf() 74 | /**************/ 75 | # Create another plot that estimates the tuition by PCTPELL 76 | sns.regplot(data=df, 77 | y='Tuition', 78 | x='PCTPELL', 79 | x_bins=5) 80 | 81 | plt.show() 82 | plt.clf() 83 | /****************/ 84 | # The final plot should include a line using a 2nd order polynomial 85 | sns.regplot(data=df, 86 | y='Tuition', 87 | x="PCTPELL", 88 | x_bins=5, 89 | order=2) 90 | 91 | plt.show() 92 | plt.clf() 93 | ___________________________________________________________________ 94 | 5.Binning data 95 | # Create a scatter plot by disabling the regression line 96 | sns.regplot(data=df, 97 | y='Tuition', 98 | x="UG", 99 | fit_reg=False) 100 | 101 | plt.show() 102 | plt.clf() 103 | /************/ 104 | # Create a scatter plot and bin the data into 5 bins 105 | sns.regplot(data=df, 106 | y='Tuition', 107 | x="UG", 108 | x_bins=5) 109 | 110 | plt.show() 111 | plt.clf() 112 | /************/ 113 | # Create a regplot and bin the data into 8 bins 114 | sns.regplot(data=df, 115 | y='Tuition', 116 | x="UG", 117 | x_bins=8) 118 | 119 | plt.show() 120 | plt.clf() 121 | ___________________________________________________________________ 122 | 6.Creating heatmaps 123 | # Create a crosstab table of the data 124 | pd_crosstab = pd.crosstab(df["Group"], df["YEAR"]) 125 | print(pd_crosstab) 126 | 127 | # Plot a heatmap of the table 128 | sns.heatmap(pd_crosstab) 129 | 130 | # Rotate tick marks for visibility 131 | plt.yticks(rotation=0) 132 | plt.xticks(rotation=90) 133 | 134 | plt.show() 135 | ___________________________________________________________________ 136 | 7.Customizing heatmaps 137 | # Create the crosstab DataFrame 138 | pd_crosstab = pd.crosstab(df["Group"], df["YEAR"]) 139 | 140 | # Plot a heatmap of the table with no color bar and using the BuGn palette 141 | sns.heatmap(pd_crosstab, cbar=False, cmap="BuGn", linewidths=0.3) 142 | 143 | # Rotate tick marks for visibility 144 | plt.yticks(rotation=0) 145 | plt.xticks(rotation=90) 146 | 147 | #Show the plot 148 | plt.show() 149 | plt.clf() 150 | ___________________________________________________________________ -------------------------------------------------------------------------------- /07 - Intermediate Data Visualization with Seaborn/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /08 - Introduction to Import data in python/Chapter 1 - introduction and flat files 1.txt: -------------------------------------------------------------------------------- 1 | 1.Importing entire text files 2 | # Open a file: file 3 | file = open('moby_dick.txt', mode='r') 4 | 5 | # Print it 6 | print(file.read()) 7 | 8 | # Check whether file is closed 9 | print(file.closed) 10 | 11 | # Close file 12 | file.close() 13 | 14 | # Check whether file is closed 15 | print(file.closed) 16 | 17 | _______________________________________________________________________________ 18 | 2.Importing text files line by line 19 | # Read & print the first 3 lines 20 | with open('moby_dick.txt') as file: 21 | print(file.readline()) 22 | print(file.readline()) 23 | print(file.readline()) 24 | 25 | _______________________________________________________________________________ 26 | 3.Using NumPy to import flat files 27 | # Import package 28 | import numpy as np 29 | 30 | # Assign filename to variable: file 31 | file = 'digits.csv' 32 | 33 | # Load file as array: digits 34 | digits = np.loadtxt(file, delimiter=',') 35 | 36 | # Print datatype of digits 37 | print(type(digits)) 38 | 39 | # Select and reshape a row 40 | im = digits[21, 1:] 41 | im_sq = np.reshape(im, (28, 28)) 42 | 43 | # Plot reshaped data (matplotlib.pyplot already loaded as plt) 44 | plt.imshow(im_sq, cmap='Greys', interpolation='nearest') 45 | plt.show() 46 | 47 | _______________________________________________________________________________ 48 | 4.Customizing your NumPy import 49 | # Import numpy 50 | import numpy as np 51 | 52 | # Assign the filename: file 53 | file = 'digits_header.txt' 54 | 55 | # Load the data: data 56 | data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0, 2]) 57 | 58 | # Print data 59 | print(data) 60 | 61 | _______________________________________________________________________________ 62 | 5.Importing different datatypes 63 | # Assign filename: file 64 | file = 'seaslug.txt' 65 | 66 | # Import file: data 67 | data = np.loadtxt(file, delimiter='\t', dtype=str) 68 | 69 | # Print the first element of data 70 | print(data[0]) 71 | 72 | # Import data as floats and skip the first row: data_float 73 | data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1) 74 | 75 | # Print the 10th element of data_float 76 | print(data_float[9]) 77 | 78 | # Plot a scatterplot of the data 79 | plt.scatter(data_float[:, 0], data_float[:, 1]) 80 | plt.xlabel('time (min.)') 81 | plt.ylabel('percentage of larvae') 82 | plt.show() 83 | 84 | _______________________________________________________________________________ 85 | 6.Working with mixed datatypes (2) 86 | # Assign the filename: file 87 | file = 'titanic.csv' 88 | 89 | # Import file using np.recfromcsv: d 90 | d = np.recfromcsv(file) 91 | 92 | # Print out first three entries of d 93 | print(d[:3]) 94 | 95 | _______________________________________________________________________________ 96 | 7.Using pandas to import flat files as DataFrames (1) 97 | # Import pandas 98 | import pandas as pd 99 | 100 | # Assign the filename: file 101 | file = 'titanic.csv' 102 | 103 | # Read the file into a DataFrame: df 104 | df = pd.read_csv(file) 105 | 106 | # View the head of the DataFrame 107 | print(df.head()) 108 | 109 | _______________________________________________________________________________ 110 | 8.Using pandas to import flat files as DataFrames (2) 111 | # Assign the filename: file 112 | file = 'digits.csv' 113 | 114 | # Read the first 5 rows of the file into a DataFrame: data 115 | data = pd.read_csv(file, nrows=5, header=None) 116 | 117 | # Build a numpy array from the DataFrame: data_array 118 | data_array = data.values 119 | 120 | # Print the datatype of data_array to the shell 121 | print(type(data_array)) 122 | 123 | _______________________________________________________________________________ 124 | 9.Customizing your pandas import 125 | # Import matplotlib.pyplot as plt 126 | import matplotlib.pyplot as plt 127 | 128 | # Assign filename: file 129 | file = 'titanic_corrupt.txt' 130 | 131 | # Import file: data 132 | data = pd.read_csv(file, sep='\t', comment='#', na_values=['Nothing']) 133 | 134 | # Print the head of the DataFrame 135 | print(data.head()) 136 | 137 | # Plot 'Age' variable in a histogram 138 | pd.DataFrame.hist(data[['Age']]) 139 | plt.xlabel('Age (years)') 140 | plt.ylabel('count') 141 | plt.show() 142 | 143 | _______________________________________________________________________________ -------------------------------------------------------------------------------- /08 - Introduction to Import data in python/Chapter 2 - importing data from other file types 2.txt: -------------------------------------------------------------------------------- 1 | 1.Loading a pickled file 2 | # Import pickle package 3 | import pickle 4 | 5 | # Open pickle file and load data 6 | with open('data.pkl', 'rb') as file: 7 | d = pickle.load(file) 8 | 9 | # Print data 10 | print(d) 11 | 12 | # Print datatype 13 | print(type(d)) 14 | ________________________________________________________________________________ 15 | 2.Listing sheets in Excel files 16 | # Import pandas 17 | import pandas as pd 18 | 19 | # Assign spreadsheet filename: file 20 | file = 'battledeath.xlsx' 21 | 22 | # Load spreadsheet: xls 23 | xls = pd.ExcelFile(file) 24 | 25 | # Print sheet names 26 | print(xls.sheet_names) 27 | 28 | ________________________________________________________________________________ 29 | 3.Importing sheets from Excel files 30 | # Load a sheet into a DataFrame by name: df1 31 | df1 = xls.parse('2004') 32 | 33 | # Print the head of the DataFrame df1 34 | print(df1.head()) 35 | 36 | # Load a sheet into a DataFrame by index: df2 37 | df2 = xls.parse(0) 38 | 39 | # Print the head of the DataFrame df2 40 | print(df2.head()) 41 | ________________________________________________________________________________ 42 | 4.Customizing your spreadsheet import 43 | # Parse the first sheet and rename the columns: df1 44 | df1 = xls.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)']) 45 | 46 | # Print the head of the DataFrame df1 47 | print(df1.head()) 48 | 49 | # Parse the first column of the second sheet and rename the column: df2 50 | df2 = xls.parse(1, usecols=[0], skiprows=[0], names=['Country']) 51 | 52 | # Print the head of the DataFrame df2 53 | print(df2.head()) 54 | ________________________________________________________________________________ 55 | 5.Importing SAS files 56 | # Import sas7bdat package 57 | from sas7bdat import SAS7BDAT 58 | 59 | # Save file to a DataFrame: df_sas 60 | with SAS7BDAT('sales.sas7bdat') as file: 61 | df_sas = file.to_data_frame() 62 | 63 | # Print head of DataFrame 64 | print(df_sas.head()) 65 | 66 | # Plot histograms of a DataFrame feature (pandas and pyplot already imported) 67 | pd.DataFrame.hist(df_sas[['P']]) 68 | plt.ylabel('count') 69 | plt.show() 70 | ________________________________________________________________________________ 71 | 6.Importing Stata files 72 | # Import pandas 73 | import pandas as pd 74 | 75 | # Load Stata file into a pandas DataFrame: df 76 | df = pd.read_stata('disarea.dta') 77 | 78 | # Print the head of the DataFrame df 79 | print(df.head()) 80 | 81 | # Plot histogram of one column of the DataFrame 82 | pd.DataFrame.hist(df[['disa10']]) 83 | plt.xlabel('Extent of disease') 84 | plt.ylabel('Number of countries') 85 | plt.show() 86 | ________________________________________________________________________________ 87 | 7.Using h5py to import HDF5 files 88 | # Import packages 89 | import numpy as np 90 | import h5py 91 | 92 | # Assign filename: file 93 | file = 'LIGO_data.hdf5' 94 | 95 | # Load file: data 96 | data = h5py.File(file, 'r') 97 | 98 | # Print the datatype of the loaded file 99 | print(type(data)) 100 | 101 | # Print the keys of the file 102 | for key in data.keys(): 103 | print(key) 104 | ________________________________________________________________________________ 105 | 8.Extracting data from your HDF5 file 106 | # Get the HDF5 group: group 107 | group = data['strain'] 108 | 109 | # Check out keys of group 110 | for key in group.keys(): 111 | print(key) 112 | 113 | # Set variable equal to time series data: strain 114 | strain = data['strain']['Strain'].value 115 | 116 | # Set number of time points to sample: num_samples 117 | num_samples = 10000 118 | 119 | # Set time vector 120 | time = np.arange(0, 1, 1/num_samples) 121 | 122 | # Plot data 123 | plt.plot(time, strain[:num_samples]) 124 | plt.xlabel('GPS Time (s)') 125 | plt.ylabel('strain') 126 | plt.show() 127 | ________________________________________________________________________________ 128 | 9.Loading .mat files 129 | # Import package 130 | import scipy.io 131 | 132 | # Load MATLAB file: mat 133 | mat = scipy.io.loadmat('albeck_gene_expression.mat') 134 | 135 | # Print the datatype type of mat 136 | print(type(mat)) 137 | ________________________________________________________________________________ 138 | 10.The structure of .mat in Python 139 | # Print the keys of the MATLAB dictionary 140 | print(mat.keys()) 141 | 142 | # Print the type of the value corresponding to the key 'CYratioCyt' 143 | print(type(mat['CYratioCyt'])) 144 | 145 | # Print the shape of the value corresponding to the key 'CYratioCyt' 146 | print(np.shape(mat['CYratioCyt'])) 147 | 148 | # Subset the array and plot it 149 | data = mat['CYratioCyt'][25, 5:] 150 | fig = plt.figure() 151 | plt.plot(data) 152 | plt.xlabel('time (min.)') 153 | plt.ylabel('normalized fluorescence (measure of expression)') 154 | plt.show() 155 | ________________________________________________________________________________ -------------------------------------------------------------------------------- /08 - Introduction to Import data in python/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /09 - Intermediate importing data in python/Chapter 2 - intracting with apis to import data from web.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Anirudh-Chauhan/Data-Scientist-with-Python-DataCamp/2b254aab79c5c7420c9fd96aeeab81a020420a27/09 - Intermediate importing data in python/Chapter 2 - intracting with apis to import data from web.txt -------------------------------------------------------------------------------- /09 - Intermediate importing data in python/Chapter 3 - Diving deep into the twitter api.txt: -------------------------------------------------------------------------------- 1 | 1.API Authentication 2 | # Import package 3 | import tweepy,json 4 | 5 | # Store OAuth authentication credentials in relevant variables 6 | access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy" 7 | access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx" 8 | consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM" 9 | consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i" 10 | 11 | # Pass OAuth details to tweepy's OAuth handler 12 | auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 13 | auth.set_access_token(access_token,access_token_secret) 14 | ___________________________________________________________________________ 15 | 2.Streaming tweets 16 | # Initialize Stream listener 17 | l = MyStreamListener() 18 | 19 | # Create you Stream object with authentication 20 | stream = tweepy.Stream(auth, l) 21 | 22 | # Filter Twitter Streams to capture data by the keywords: 23 | s = ['clinton', 'trump', 'sanders','cruz'] 24 | stream.filter(track = s) 25 | ___________________________________________________________________________ 26 | 3.Load and explore your Twitter data 27 | # Import package 28 | import json 29 | import json 30 | 31 | # String of path to file: tweets_data_path 32 | tweets_data_path = 'tweets.txt' 33 | 34 | # Initialize empty list to store tweets: tweets_data 35 | tweets_data = [] 36 | 37 | # Open connection to file 38 | tweets_file = open(tweets_data_path, "r") 39 | 40 | # Read in tweets and store in list: tweets_data 41 | for line in tweets_file: 42 | tweet = json.loads(line) 43 | tweets_data.append(tweet) 44 | 45 | # Close connection to file 46 | tweets_file.close() 47 | 48 | # Print the keys of the first tweet dict 49 | print(tweets_data[0].keys()) 50 | # Close connection to file 51 | tweets_file.close() 52 | 53 | # Print the keys of the first tweet dict 54 | print(tweets_data[0].keys()) 55 | ___________________________________________________________________________ 56 | 4.Twitter data to DataFrame 57 | # Import package 58 | import pandas as pd 59 | 60 | # Build DataFrame of tweet texts and languages 61 | df = pd.DataFrame(tweets_data, columns=['text','lang']) 62 | 63 | # Print head of DataFrame 64 | print(df.head()) 65 | ___________________________________________________________________________ 66 | 5.A little bit of Twitter text analysis 67 | # Initialize list to store tweet counts 68 | [clinton, trump, sanders, cruz] = [0, 0, 0, 0] 69 | 70 | # Iterate through df, counting the number of tweets in which 71 | # each candidate is mentioned 72 | for index, row in df.iterrows(): 73 | clinton += word_in_text('clinton', row['text']) 74 | trump += word_in_text('trump', row['text']) 75 | sanders += word_in_text('sanders', row['text']) 76 | cruz += word_in_text('cruz', row['text']) 77 | ___________________________________________________________________________ 78 | 6.Plotting your Twitter data 79 | # Import packages 80 | import seaborn as sns 81 | import matplotlib.pyplot as plt 82 | 83 | # Set seaborn style 84 | sns.set(color_codes=True) 85 | 86 | # Create a list of labels:cd 87 | cd = ['clinton', 'trump', 'sanders', 'cruz'] 88 | 89 | # Plot the bar chart 90 | ax = sns.barplot(cd, [clinton, trump, sanders, cruz]) 91 | ax.set(ylabel="count") 92 | plt.show() 93 | ___________________________________________________________________________ -------------------------------------------------------------------------------- /09 - Intermediate importing data in python/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /10 - Cleaning Data in Python/Chapter 1 - Common Data Problems.txt: -------------------------------------------------------------------------------- 1 | 1.Numeric data or ... ? 2 | # Print the information of ride_sharing 3 | print(ride_sharing.info()) 4 | 5 | # Print summary statistics of user_type column 6 | print(ride_sharing['user_type'].describe()) 7 | 8 | # Convert user_type from integer to category 9 | ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category') 10 | 11 | # Write an assert statement confirming the change 12 | assert ride_sharing['user_type_cat'].dtype == 'category' 13 | 14 | # Print new summary statistics 15 | print(ride_sharing['user_type_cat'].describe()) 16 | ________________________________________________________________________ 17 | 2.Summing strings and concatenating numbers 18 | # Strip duration of minutes 19 | ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip("minutes") 20 | 21 | # Convert duration to integer 22 | ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype("int") 23 | 24 | # Write an assert statement making sure of conversion 25 | assert ride_sharing['duration_time'].dtype == "int" 26 | 27 | # Print formed columns and calculate average ride duration 28 | print(ride_sharing[['duration','duration_trim','duration_time']]) 29 | print(ride_sharing[['duration','duration_trim','duration_time']].mean()) 30 | ________________________________________________________________________ 31 | 3.Tire size constraints 32 | # Convert tire_sizes to integer 33 | ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int') 34 | 35 | # Set all values above 27 to 27 36 | ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27 37 | 38 | # Reconvert tire_sizes back to categorical 39 | ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category') 40 | 41 | # Print tire size description 42 | print(ride_sharing['tire_sizes'].describe()) 43 | ________________________________________________________________________ 44 | 4.Back to the future 45 | # Convert ride_date to datetime 46 | ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']) 47 | 48 | # Save today's date 49 | today = dt.date.today() 50 | 51 | # Set all in the future to today's date 52 | ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today 53 | 54 | # Print maximum of ride_dt column 55 | print(ride_sharing['ride_dt'].max()) 56 | ________________________________________________________________________ 57 | 5.Finding duplicates 58 | # Find duplicates 59 | duplicates = ride_sharing.duplicated(subset = 'ride_id', keep = False) 60 | 61 | # Sort your duplicated rides 62 | duplicated_rides = ride_sharing[duplicates].sort_values('ride_id') 63 | 64 | # Print relevant columns 65 | print(duplicated_rides[['ride_id','duration','user_birth_year']]) 66 | ____________________________________________________________________________ 67 | 6.Treating duplicates 68 | # Drop complete duplicates from ride_sharing 69 | ride_dup = ride_sharing.drop_duplicates() 70 | 71 | # Create statistics dictionary for aggregation function 72 | statistics = {'user_birth_year': 'min', 'duration': 'mean'} 73 | 74 | # Group by ride_id and compute new statistics 75 | ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index() 76 | 77 | # Find duplicated values again 78 | duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False) 79 | duplicated_rides = ride_unique[duplicates == True] 80 | 81 | # Assert duplicates are processed 82 | assert duplicated_rides.shape[0] == 0 83 | ____________________________________________________________________________ 84 | -------------------------------------------------------------------------------- /10 - Cleaning Data in Python/Chapter 2 - Text and categorical data problems.txt: -------------------------------------------------------------------------------- 1 | 1.Finding consistency 2 | # Print categories DataFrame 3 | print(categories) 4 | 5 | # Print unique values of survey columns in airlines 6 | print('Cleanliness: ', airlines['cleanliness'].unique(), "\n") 7 | print('Safety: ', airlines['safety'].unique(), "\n") 8 | print('Satisfaction: ', airlines['satisfaction'].unique(), "\n") 9 | 10 | /*******************************/ 11 | # Find the cleanliness category in airlines not in categories 12 | cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness']) 13 | 14 | # Find rows with that category 15 | cat_clean_rows = airlines['cleanliness'].isin(cat_clean) 16 | 17 | # Print rows with inconsistent category 18 | print(airlines[cat_clean_rows]) 19 | 20 | /*******************************/ 21 | # Find the cleanliness category in airlines not in categories 22 | cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness']) 23 | 24 | # Find rows with that category 25 | cat_clean_rows = airlines['cleanliness'].isin(cat_clean) 26 | 27 | # Print rows with inconsistent category 28 | print(airlines[cat_clean_rows]) 29 | 30 | # Print rows with consistent categories only 31 | print(airlines[~cat_clean_rows]) 32 | ______________________________________________________________________________ 33 | 2.Inconsistent categories 34 | # Print unique values of both columns 35 | print(airlines['dest_region'].unique()) 36 | print(airlines['dest_size'].unique()) 37 | /***********************************/ 38 | # Print unique values of both columns 39 | print(airlines['dest_region'].unique()) 40 | print(airlines['dest_size'].unique()) 41 | 42 | # Lower dest_region column and then replace "eur" with "europe" 43 | airlines['dest_region'] = airlines['dest_region'].str.lower() 44 | airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'}) 45 | /************************************/ 46 | # Print unique values of both columns 47 | print(airlines['dest_region'].unique()) 48 | print(airlines['dest_size'].unique()) 49 | 50 | # Lower dest_region column and then replace "eur" with "europe" 51 | airlines['dest_region'] = airlines['dest_region'].str.lower() 52 | airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'}) 53 | 54 | # Remove white spaces from `dest_size` 55 | airlines['dest_size'] = airlines['dest_size'].str.strip() 56 | 57 | # Verify changes have been effected 58 | print(airlines['dest_region'].unique()) 59 | print(airlines['dest_size'].unique()) 60 | ___________________________________________________________________________ 61 | 3.Remapping categories 62 | # Create ranges for categories 63 | label_ranges = [0, 60, 180, np.inf] 64 | label_names = ['short', 'medium', 'long'] 65 | 66 | # Create wait_type column 67 | airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges, 68 | labels = label_names) 69 | 70 | # Create mappings and replace 71 | mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday', 72 | 'Thursday': 'weekday', 'Friday': 'weekday', 73 | 'Saturday': 'weekend', 'Sunday': 'weekend'} 74 | 75 | airlines['day_week'] = airlines['day'].replace(mappings) 76 | ____________________________________________________________________________ 77 | 4.Removing titles and taking names 78 | # Replace "Dr." with empty string "" 79 | airlines['full_name'] = airlines['full_name'].str.replace("Dr.","") 80 | 81 | # Replace "Mr." with empty string "" 82 | airlines['full_name'] = airlines['full_name'].str.replace("Mr.","") 83 | 84 | # Replace "Miss" with empty string "" 85 | airlines['full_name'] = airlines['full_name'].str.replace("Miss","") 86 | 87 | # Replace "Ms." with empty string "" 88 | airlines['full_name'] = airlines['full_name'].str.replace("Ms.","") 89 | 90 | # Assert that full_name has no honorifics 91 | assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False 92 | ____________________________________________________________________________ 93 | 5.Keeping it descriptive 94 | # Store length of each row in survey_response column 95 | resp_length = airlines['survey_response'].str.len() 96 | 97 | # Find rows in airlines where resp_length > 40 98 | airlines_survey = airlines[resp_length > 40] 99 | 100 | # Assert minimum survey_response length is > 40 101 | assert airlines_survey['survey_response'].str.len().min() > 40 102 | 103 | # Print new survey_response column 104 | print(airlines_survey['survey_response']) 105 | ____________________________________________________________________________ -------------------------------------------------------------------------------- /10 - Cleaning Data in Python/Chapter 3 - Advanced data problems.txt: -------------------------------------------------------------------------------- 1 | 1.Uniform currencies 2 | # Find values of acct_cur that are equal to 'euro' 3 | acct_eu = banking['acct_cur'] == 'euro' 4 | 5 | # Convert acct_amount where it is in euro to dollars 6 | banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1 7 | 8 | # Unify acct_cur column by changing 'euro' values to 'dollar' 9 | banking.loc[acct_eu, 'acct_cur'] = 'dollar' 10 | 11 | # Assert that only dollar currency remains 12 | assert banking['acct_cur'].unique() == 'dollar' 13 | ______________________________________________________________________ 14 | 2.Uniform dates 15 | # Print the header of account_opend 16 | print(banking['account_opened'].head()) 17 | 18 | # Convert account_opened to datetime 19 | banking['account_opened'] = pd.to_datetime(banking['account_opened'], 20 | # Infer datetime format 21 | infer_datetime_format = True, 22 | # Return missing value for error 23 | errors = 'coerce') 24 | 25 | # Get year of account opened 26 | banking['acct_year'] = banking['account_opened'].dt.strftime('%Y') 27 | 28 | # Print acct_year 29 | print(banking['acct_year']) 30 | ______________________________________________________________________ 31 | 3.How's our data integrity? 32 | # Store fund columns to sum against 33 | fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D'] 34 | 35 | # Find rows where fund_columns row sum == inv_amount 36 | inv_equ = banking[fund_columns].sum(axis = 1) == banking['inv_amount'] 37 | 38 | # Store consistent and inconsistent data 39 | consistent_inv = banking[inv_equ] 40 | inconsistent_inv = banking[~inv_equ] 41 | 42 | # Store consistent and inconsistent data 43 | print("Number of inconsistent investments: ", inconsistent_inv.shape[0]) 44 | ______________________________________________________________________ 45 | 4.Missing investors 46 | # Print number of missing values in banking 47 | print(banking.isna().sum()) 48 | 49 | # Visualize missingness matrix 50 | msno.matrix(banking) 51 | plt.show() 52 | 53 | # Isolate missing and non missing values of inv_amount 54 | missing_investors = banking[banking['inv_amount'].isna()] 55 | investors = banking[~banking['inv_amount'].isna()] 56 | 57 | # Sort banking by age and visualize 58 | banking_sorted = banking.sort_values(by = 'age') 59 | msno.matrix(banking_sorted) 60 | plt.show() 61 | ______________________________________________________________________ 62 | 5.Follow the money 63 | # Drop missing values of cust_id 64 | banking_fullid = banking.dropna(subset = ['cust_id']) 65 | 66 | # Compute estimated acct_amount 67 | acct_imp = banking_fullid['inv_amount'] * 5 68 | 69 | # Impute missing acct_amount with corresponding acct_imp 70 | banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp}) 71 | 72 | # Print number of missing values 73 | print(banking_imputed.isna().sum()) 74 | ______________________________________________________________________ 75 | -------------------------------------------------------------------------------- /10 - Cleaning Data in Python/Chapter 4 - Record Linkage.txt: -------------------------------------------------------------------------------- 1 | 1.The cutoff point 2 | # Import process from fuzzywuzzy 3 | from fuzzywuzzy import process 4 | 5 | # Store the unique values of cuisine_type in unique_types 6 | unique_types = restaurants['cuisine_type'].unique() 7 | 8 | # Calculate similarity of 'asian' to all values of unique_types 9 | print(process.extract('asian', unique_types, limit = len(unique_types))) 10 | 11 | # Calculate similarity of 'american' to all values of unique_types 12 | print(process.extract('american', unique_types, limit = len(unique_types))) 13 | 14 | # Calculate similarity of 'italian' to all values of unique_types 15 | print(process.extract('italian', unique_types, limit = len(unique_types))) 16 | _____________________________________________________________________________ 17 | 2.Remapping categories II 18 | # Iterate through categories 19 | for cuisine in categories: 20 | # Create a list of matches, comparing cuisine with the cuisine_type column 21 | matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type)) 22 | 23 | # Iterate through the list of matches 24 | for match in matches: 25 | # Check whether the similarity score is greater than or equal to 80 26 | if match[1] >= 80: 27 | # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine 28 | restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine 29 | 30 | # Inspect the final result 31 | restaurants['cuisine_type'].unique() 32 | _____________________________________________________________________________ 33 | 3.Pairs of restaurants 34 | # Create an indexer and object and find possible pairs 35 | indexer = recordlinkage.Index() 36 | 37 | # Block pairing on cuisine_type 38 | indexer.block('cuisine_type') 39 | 40 | # Generate pairs 41 | pairs = indexer.index(restaurants, restaurants_new) 42 | _____________________________________________________________________________ 43 | 4.Similar restaurants 44 | # Create a comparison object 45 | comp_cl = recordlinkage.Compare() 46 | 47 | # Find exact matches on city, cuisine_types - 48 | comp_cl.exact('city', 'city', label='city') 49 | comp_cl.exact('cuisine_type', 'cuisine_type', label='cuisine_type') 50 | 51 | # Find similar matches of rest_name 52 | comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8) 53 | 54 | # Get potential matches and print 55 | potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new) 56 | print(potential_matches) 57 | _____________________________________________________________________________ 58 | 5.Linking them together! 59 | # Isolate potential matches with row sum >=3 60 | matches = potential_matches[potential_matches.sum(axis = 1) >= 3] 61 | 62 | # Get values of second column index of matches 63 | matching_indices = matches.index.get_level_values(1) 64 | 65 | # Subset restaurants_new based on non-duplicate values 66 | non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)] 67 | 68 | # Append non_dup to restaurants 69 | full_restaurants = restaurants.append(non_dup) 70 | print(full_restaurants) 71 | _____________________________________________________________________________ -------------------------------------------------------------------------------- /10 - Cleaning Data in Python/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /11 - Working with Dates and Times in Python/Chapter 1 - Dates and Calenders .txt: -------------------------------------------------------------------------------- 1 | 1.Which day of the week? 2 | # Import date from datetime 3 | from datetime import date 4 | 5 | # Create a date object 6 | hurricane_andrew = date(1992, 8, 24) 7 | 8 | # Which day of the week is the date? 9 | print(hurricane_andrew.weekday()) 10 | ____________________________________________________________________________ 11 | 2.How many hurricanes come early? 12 | # Counter for how many before June 1 13 | early_hurricanes = 0 14 | 15 | # We loop over the dates 16 | for hurricane in florida_hurricane_dates: 17 | # Check if the month is before June (month number 6) 18 | if hurricane.month < 6: 19 | early_hurricanes = early_hurricanes + 1 20 | 21 | print(early_hurricanes) 22 | ____________________________________________________________________________ 23 | 3.Subtracting dates 24 | # Import date 25 | from datetime import date 26 | 27 | # Create a date object for May 9th, 2007 28 | start = date(2007, 5, 9) 29 | 30 | # Create a date object for December 13th, 2007 31 | end = date(2007, 12, 13) 32 | 33 | # Subtract the two dates and print the number of days 34 | print((end - start).days) 35 | ____________________________________________________________________________ 36 | 4.Counting events per calendar month 37 | # A dictionary to count hurricanes per calendar month 38 | hurricanes_each_month = {1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6:0, 39 | 7: 0, 8:0, 9:0, 10:0, 11:0, 12:0} 40 | 41 | # Loop over all hurricanes 42 | for hurricane in florida_hurricane_dates: 43 | # Pull out the month 44 | month = hurricane.month 45 | # Increment the count in your dictionary by one 46 | hurricanes_each_month[month] += 1 47 | 48 | print(hurricanes_each_month) 49 | ____________________________________________________________________________ 50 | 5.Putting a list of dates in order 51 | # Print the first and last scrambled dates 52 | print(dates_scrambled[0]) 53 | print(dates_scrambled[-1]) 54 | 55 | # Put the dates in order 56 | dates_ordered = sorted(dates_scrambled) 57 | 58 | # Print the first and last ordered dates 59 | print(dates_ordered[0]) 60 | print(dates_ordered[-1]) 61 | ____________________________________________________________________________ 62 | 6.Printing dates in a friendly format 63 | # Assign the earliest date to first_date 64 | first_date = min(florida_hurricane_dates) 65 | 66 | # Convert to ISO and US formats 67 | iso = "Our earliest hurricane date: " + first_date.isoformat() 68 | us = "Our earliest hurricane date: " + first_date.strftime("%m/%d/%Y") 69 | 70 | print("ISO: " + iso) 71 | print("US: " + us) 72 | ____________________________________________________________________________ 73 | 7.Representing dates in different ways 74 | # Import date 75 | from datetime import date 76 | 77 | # Create a date object 78 | andrew = date(1992, 8, 26) 79 | 80 | # Print the date in the format 'YYYY-MM' 81 | print(andrew.strftime('%Y-%m')) 82 | ____________________________________________________________________________ -------------------------------------------------------------------------------- /11 - Working with Dates and Times in Python/Chapter 2 - Combining dates and times.txt: -------------------------------------------------------------------------------- 1 | 1.Creating datetimes by hand 2 | # Import datetime 3 | from datetime import datetime 4 | 5 | # Create a datetime object 6 | dt = datetime(2017, 10, 1, 15, 26, 26) 7 | 8 | # Print the results in ISO 8601 format 9 | print(dt.isoformat()) 10 | _______________________________________________________________________________ 11 | 2.Counting events before and after noon 12 | # Create dictionary to hold results 13 | trip_counts = {'AM': 0, 'PM': 0} 14 | 15 | # Loop over all trips 16 | for trip in onebike_datetimes: 17 | # Check to see if the trip starts before noon 18 | if trip['start'].hour < 12: 19 | # Increment the counter for before noon 20 | trip_counts['AM'] += 1 21 | else: 22 | # Increment the counter for after noon 23 | trip_counts['PM'] += 1 24 | 25 | print(trip_counts) 26 | _______________________________________________________________________________ 27 | 3.Turning strings into datetimes 28 | # Import the datetime class 29 | from datetime import datetime 30 | 31 | # Starting string, in YYYY-MM-DD HH:MM:SS format 32 | s = '2017-02-03 00:00:01' 33 | 34 | # Write a format string to parse s 35 | fmt = '%Y-%m-%d %H:%M:%S' 36 | 37 | # Create a datetime object d 38 | d = datetime.strptime(s, fmt) 39 | 40 | # Print d 41 | print(d) 42 | _______________________________________________________________________________ 43 | 4.Parsing pairs of strings as datetimes 44 | # Write down the format string 45 | fmt = "%Y-%m-%d %H:%M:%S" 46 | 47 | # Initialize a list for holding the pairs of datetime objects 48 | onebike_datetimes = [] 49 | 50 | # Loop over all trips 51 | for (start, end) in onebike_datetime_strings: 52 | trip = {'start': datetime.strptime(start, fmt), 53 | 'end': datetime.strptime(end, fmt)} 54 | 55 | # Append the trip 56 | onebike_datetimes.append(trip) 57 | _______________________________________________________________________________ 58 | 5.Recreating ISO format with strftime() 59 | # Import datetime 60 | from datetime import datetime 61 | 62 | # Pull out the start of the first trip 63 | first_start = onebike_datetimes[0]['start'] 64 | 65 | # Format to feed to strftime() 66 | fmt = "%Y-%m-%dT%H:%M:%S" 67 | 68 | # Print out date with .isoformat(), then with .strftime() to compare 69 | print(first_start.isoformat()) 70 | print(first_start.strftime(fmt)) 71 | _______________________________________________________________________________ 72 | 6.Unix timestamps 73 | # Import datetime 74 | from datetime import datetime 75 | 76 | # Starting timestamps 77 | timestamps = [1514665153, 1514664543] 78 | 79 | # Datetime objects 80 | dts = [] 81 | 82 | # Loop 83 | for ts in timestamps: 84 | dts.append(datetime.fromtimestamp(ts)) 85 | 86 | # Print results 87 | print(dts) 88 | _______________________________________________________________________________ 89 | 7.Turning pairs of datetimes into durations 90 | # Initialize a list for all the trip durations 91 | onebike_durations = [] 92 | 93 | for trip in onebike_datetimes: 94 | # Create a timedelta object corresponding to the length of the trip 95 | trip_duration = trip['end'] - trip['start'] 96 | 97 | # Get the total elapsed seconds in trip_duration 98 | trip_length_seconds = trip_duration.total_seconds() 99 | 100 | # Append the results to our list 101 | onebike_durations.append(trip_length_seconds) 102 | _______________________________________________________________________________ 103 | 8.Average trip time 104 | # What was the total duration of all trips? 105 | total_elapsed_time = sum(onebike_durations) 106 | 107 | # What was the total number of trips? 108 | number_of_trips = len(onebike_durations) 109 | 110 | # Divide the total duration by the number of trips 111 | print(total_elapsed_time / number_of_trips) 112 | _______________________________________________________________________________ 113 | 9.The long and the short of why time is hard 114 | # Calculate shortest and longest trips 115 | shortest_trip = min(onebike_durations) 116 | longest_trip = max(onebike_durations) 117 | 118 | # Print out the results 119 | print("The shortest trip was " + str(shortest_trip) + " seconds") 120 | print("The longest trip was " + str(longest_trip) + " seconds") 121 | _______________________________________________________________________________ -------------------------------------------------------------------------------- /11 - Working with Dates and Times in Python/Chapter 3 - time zones and daylight saving.txt: -------------------------------------------------------------------------------- 1 | 1.Creating timezone aware datetimes 2 | # Import datetime, timezone 3 | from datetime import datetime, timezone 4 | 5 | # October 1, 2017 at 15:26:26, UTC 6 | dt = datetime(2017, 10, 1, 15, 26, 26, tzinfo=timezone.utc) 7 | 8 | # Print results 9 | print(dt.isoformat()) 10 | ____________________________________________________________________________ 11 | 2.What time did the bike leave in UTC? 12 | # Loop over the trips 13 | for trip in onebike_datetimes[:10]: 14 | # Pull out the start 15 | dt = trip['start'] 16 | # Move dt to be in UTC 17 | dt = dt.astimezone(timezone.utc) 18 | 19 | # Print the start time in UTC 20 | print('Original:', trip['start'], '| UTC:', dt.isoformat()) 21 | ____________________________________________________________________________ 22 | 3.Putting the bike trips into the right time zone 23 | # Import tz 24 | from dateutil import tz 25 | 26 | # Create a timezone object for Eastern Time 27 | et = tz.gettz('America/New_York') 28 | 29 | # Loop over trips, updating the datetimes to be in Eastern Time 30 | for trip in onebike_datetimes[:10]: 31 | # Update trip['start'] and trip['end'] 32 | trip['start'] = trip['start'].replace(tzinfo = et) 33 | trip['end'] = trip['end'].replace(tzinfo = et) 34 | ____________________________________________________________________________ 35 | 4.What time did the bike leave? (Global edition) 36 | # Create the timezone object 37 | uk = tz.gettz('Europe/London') 38 | 39 | # Pull out the start of the first trip 40 | local = onebike_datetimes[0]['start'] 41 | 42 | # What time was it in the UK? 43 | notlocal = local.astimezone(uk) 44 | 45 | # Print them out and see the difference 46 | print(local.isoformat()) 47 | print(notlocal.isoformat()) 48 | ____________________________________________________________________________ 49 | 5.How many hours elapsed around daylight saving? 50 | # Import datetime, timedelta, tz, timezone 51 | from datetime import datetime, timedelta, timezone 52 | from dateutil import tz 53 | 54 | # Start on March 12, 2017, midnight, then add 6 hours 55 | start = datetime(2017, 3, 12, tzinfo = tz.gettz('America/New_York')) 56 | end = start + timedelta(hours=6) 57 | print(start.isoformat() + " to " + end.isoformat()) 58 | ____________________________________________________________________________ 59 | 6.March 29, throughout a decade 60 | # Import datetime and tz 61 | from datetime import datetime 62 | from dateutil import tz 63 | 64 | # Create starting date 65 | dt = datetime(2000, 3, 29, tzinfo = tz.gettz('Europe/London')) 66 | 67 | # Loop over the dates, replacing the year, and print the ISO timestamp 68 | for y in range(2000, 2011): 69 | print(dt.replace(year=y).isoformat()) 70 | ____________________________________________________________________________ 71 | 7.Finding ambiguous datetimes 72 | # Loop over trips 73 | for trip in onebike_datetimes: 74 | # Rides with ambiguous start 75 | if tz.datetime_ambiguous(trip['start']): 76 | print("Ambiguous start at " + str(trip['start'])) 77 | # Rides with ambiguous end 78 | if tz.datetime_ambiguous(trip['end']): 79 | print("Ambiguous end at " + str(trip['end'])) 80 | ____________________________________________________________________________ 81 | 7.Cleaning daylight saving data with fold 82 | trip_durations = [] 83 | for trip in onebike_datetimes: 84 | # When the start is later than the end, set the fold to be 1 85 | if trip['start'] > trip['end']: 86 | trip['end'] = tz.enfold(trip['end']) 87 | # Convert to UTC 88 | start = trip['start'].astimezone(tz.UTC) 89 | end = trip['end'].astimezone(tz.UTC) 90 | 91 | # Subtract the difference 92 | trip_length_seconds = (end-start).total_seconds() 93 | trip_durations.append(trip_length_seconds) 94 | 95 | # Take the shortest trip duration 96 | print("Shortest trip: " + str(min(trip_durations))) 97 | ____________________________________________________________________________ -------------------------------------------------------------------------------- /11 - Working with Dates and Times in Python/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /12 - Writing functions in Python/Chapter 1 - Best Practices.txt: -------------------------------------------------------------------------------- 1 | 1.Crafting a docstring 2 | def count_letter(content, letter): 3 | """Count the number of times `letter` appears in `content`. 4 | 5 | Args: 6 | content (str): The string to search. 7 | letter (str): The letter to search for. 8 | 9 | Returns: 10 | int 11 | 12 | # Add a section detailing what errors might be raised 13 | Raises: 14 | ValueError: If `letter` is not a one-character string. 15 | """ 16 | if (not isinstance(letter, str)) or len(letter) != 1: 17 | raise ValueError('`letter` must be a single character string.') 18 | return len([char for char in content if char == letter]) 19 | __________________________________________________________ 20 | 2.Retrieving docstrings 21 | def build_tooltip(function): 22 | """Create a tooltip for any function that shows the 23 | function's docstring. 24 | 25 | Args: 26 | function (callable): The function we want a tooltip for. 27 | 28 | Returns: 29 | str 30 | """ 31 | # Use 'inspect' to get the docstring 32 | docstring = inspect.getdoc(function) 33 | border = '#' * 28 34 | return '{}\n{}\n{}'.format(border, docstring, border) 35 | 36 | print(build_tooltip(count_letter)) 37 | print(build_tooltip(range)) 38 | print(build_tooltip(print)) 39 | 3.Extract a function 40 | def standardize(column): 41 | """Standardize the values in a column. 42 | 43 | Args: 44 | column (pandas Series): The data to standardize. 45 | 46 | Returns: 47 | pandas Series: the values as z-scores 48 | """ 49 | # Finish the function so that it returns the z-scores 50 | z_score = (column - column.mean()) / column.std() 51 | return z_score 52 | 53 | # Use the standardize() function to calculate the z-scores 54 | df['y1_z'] = standardize(df.y1_gpa) 55 | df['y2_z'] = standardize(df.y2_gpa) 56 | df['y3_z'] = standardize(df.y3_gpa) 57 | df['y4_z'] = standardize(df.y4_gpa) 58 | _____________________________________________________________ 59 | 4.Split up a function 60 | def mean(values): 61 | """Get the mean of a list of values 62 | 63 | Args: 64 | values (iterable of float): A list of numbers 65 | 66 | Returns: 67 | float 68 | """ 69 | # Write the mean() function 70 | mean = sum(values) / len(values) 71 | return mean 72 | def median(values): 73 | """Get the median of a list of values 74 | 75 | Args: 76 | values (iterable of float): A list of numbers 77 | 78 | Returns: 79 | float 80 | """ 81 | # Write the median() function 82 | midpoint = int(len(values) / 2) 83 | if len(values) % 2 == 0: 84 | median = (values[midpoint - 1] + values[midpoint]) / 2 85 | else: 86 | median = values[midpoint] 87 | return median 88 | ___________________________________________________________________ 89 | 5.Best practice for default arguments 90 | # Use an immutable variable for the default argument 91 | def better_add_column(values, df=None): 92 | """Add a column of `values` to a DataFrame `df`. 93 | The column will be named "col_" where "n" is 94 | the numerical index of the column. 95 | 96 | Args: 97 | values (iterable): The values of the new column 98 | df (DataFrame, optional): The DataFrame to update. 99 | If no DataFrame is passed, one is created by default. 100 | 101 | Returns: 102 | DataFrame 103 | """ 104 | # Update the function to create a default DataFrame 105 | if df is None: 106 | df = pandas.DataFrame() 107 | df['col_{}'.format(len(df.columns))] = values 108 | return df 109 | ___________________________________________________________________ 110 | -------------------------------------------------------------------------------- /12 - Writing functions in Python/Chapter 2 - Using Context Managers.txt: -------------------------------------------------------------------------------- 1 | 1.The number of cats 2 | # Open "alice.txt" and assign the file to "file" 3 | with open('alice.txt') as file: 4 | text = file.read() 5 | 6 | n = 0 7 | for word in text.split(): 8 | if word.lower() in ['cat', 'cats']: 9 | n += 1 10 | 11 | print('Lewis Carroll uses the word "cat" {} times'.format(n)) 12 | _______________________________________________________________________ 13 | 2.The speed of cats 14 | image = get_image_from_instagram() 15 | 16 | # Time how long process_with_numpy(image) takes to run 17 | with timer(): 18 | print('Numpy version') 19 | process_with_numpy(image) 20 | 21 | # Time how long process_with_pytorch(image) takes to run 22 | with timer(): 23 | print('Pytorch version') 24 | process_with_pytorch(image) 25 | _______________________________________________________________________ 26 | 3.The timer() context manager 27 | # Add a decorator that will make timer() a context manager 28 | @contextlib.contextmanager 29 | def timer(): 30 | """Time the execution of a context block. 31 | 32 | Yields: 33 | None 34 | """ 35 | start = time.time() 36 | # Send control back to the context block 37 | yield 38 | end = time.time() 39 | print('Elapsed: {:.2f}s'.format(end - start)) 40 | 41 | with timer(): 42 | print('This should take approximately 0.25 seconds') 43 | time.sleep(0.25) 44 | _______________________________________________________________________ 45 | 4.A read-only open() context manager 46 | @contextlib.contextmanager 47 | def open_read_only(filename): 48 | """Open a file in read-only mode. 49 | 50 | Args: 51 | filename (str): The location of the file to read 52 | 53 | Yields: 54 | file object 55 | """ 56 | read_only_file = open(filename, mode='r') 57 | # Yield read_only_file so it can be assigned to my_file 58 | yield read_only_file 59 | # Close read_only_file 60 | read_only_file.close() 61 | 62 | with open_read_only('my_file.txt') as my_file: 63 | print(my_file.read()) 64 | _______________________________________________________________________ 65 | 5.Scraping the NASDAQ 66 | # Use the "stock('NVDA')" context manager 67 | # and assign the result to the variable "nvda" 68 | with stock('NVDA') as nvda: 69 | # Open 'NVDA.txt' for writing as f_out 70 | with open('NVDA.txt', 'w') as f_out: 71 | for _ in range(10): 72 | value = nvda.price() 73 | print('Logging ${:.2f} for NVDA'.format(value)) 74 | f_out.write('{:.2f}\n'.format(value)) 75 | _______________________________________________________________________ 76 | 6.Changing the working directory 77 | def in_dir(directory): 78 | """Change current working directory to `directory`, 79 | allow the user to run some code, and change back. 80 | 81 | Args: 82 | directory (str): The path to a directory to work in. 83 | """ 84 | current_dir = os.getcwd() 85 | os.chdir(directory) 86 | 87 | # Add code that lets you handle errors 88 | try: 89 | yield 90 | # Ensure the directory is reset, 91 | # whether there was an error or not 92 | finally: 93 | os.chdir(current_dir) 94 | _______________________________________________________________________ -------------------------------------------------------------------------------- /12 - Writing functions in Python/Chapter 3 - Decorators.txt: -------------------------------------------------------------------------------- 1 | 1.Building a command line data app 2 | # Add the missing function references to the function map 3 | function_map = { 4 | 'mean': mean, 5 | 'std': std, 6 | 'minimum': minimum, 7 | 'maximum': maximum 8 | } 9 | 10 | data = load_data() 11 | print(data) 12 | 13 | func_name = get_user_input() 14 | 15 | # Call the chosen function and pass "data" as an argument 16 | function_map[func_name](data) 17 | ________________________________________________________________________ 18 | 2.Reviewing your co-worker's code 19 | # Call has_docstring() on the load_and_plot_data() function 20 | ok = has_docstring(load_and_plot_data) 21 | 22 | if not ok: 23 | print("load_and_plot_data() doesn't have a docstring!") 24 | else: 25 | print("load_and_plot_data() looks ok") 26 | /***************************************/ 27 | # Call has_docstring() on the as_2D() function 28 | ok = has_docstring(as_2D) 29 | 30 | if not ok: 31 | print("as_2D() doesn't have a docstring!") 32 | else: 33 | print("as_2D() looks ok") 34 | /**************************************/ 35 | # Call has_docstring() on the log_product() function 36 | ok = has_docstring(log_product) 37 | 38 | if not ok: 39 | print("log_product() doesn't have a docstring!") 40 | else: 41 | print("log_product() looks ok") 42 | ________________________________________________________________________ 43 | 3.Returning functions for a math game 44 | def create_math_function(func_name): 45 | if func_name == 'add': 46 | def add(a, b): 47 | return a + b 48 | return add 49 | elif func_name == 'subtract': 50 | # Define the subtract() function 51 | def subtract(a, b): 52 | return a - b 53 | return subtract 54 | else: 55 | print("I don't know that one") 56 | 57 | add = create_math_function('add') 58 | print('5 + 2 = {}'.format(add(5, 2))) 59 | 60 | subtract = create_math_function('subtract') 61 | print('5 - 2 = {}'.format(subtract(5, 2))) 62 | ________________________________________________________________________ 63 | 4.Modifying variables outside local scope 64 | def wait_until_done(): 65 | def check_is_done(): 66 | # Add a keyword so that wait_until_done() 67 | # doesn't run forever 68 | global done 69 | if random.random() < 0.1: 70 | done = True 71 | 72 | while not done: 73 | check_is_done() 74 | 75 | done = False 76 | wait_until_done() 77 | 78 | print('Work done? {}'.format(done)) 79 | ________________________________________________________________________ 80 | 5.Checking for closure 81 | def return_a_func(arg1, arg2): 82 | def new_func(): 83 | print('arg1 was {}'.format(arg1)) 84 | print('arg2 was {}'.format(arg2)) 85 | return new_func 86 | 87 | my_func = return_a_func(2, 17) 88 | 89 | print(my_func.__closure__ is not None) 90 | print(len(my_func.__closure__) == 2) 91 | 92 | # Get the values of the variables in the closure 93 | closure_values = [ 94 | my_func.__closure__[i].cell_contents for i in range(2) 95 | ] 96 | print(closure_values == [2, 17]) 97 | ________________________________________________________________________ 98 | 6.Closures keep your values safe 99 | def my_special_function(): 100 | print('You are running my_special_function()') 101 | 102 | def get_new_func(func): 103 | def call_func(): 104 | func() 105 | return call_func 106 | 107 | new_func = get_new_func(my_special_function) 108 | 109 | # Redefine my_special_function() to just print "hello" 110 | def my_special_function(): 111 | print("hello") 112 | 113 | new_func() 114 | /********************************/ 115 | def my_special_function(): 116 | print('You are running my_special_function()') 117 | 118 | def get_new_func(func): 119 | def call_func(): 120 | func() 121 | return call_func 122 | 123 | new_func = get_new_func(my_special_function) 124 | 125 | # Delete my_special_function() 126 | del(my_special_function) 127 | 128 | new_func() 129 | /********************************/ 130 | def my_special_function(): 131 | print('You are running my_special_function()') 132 | 133 | def get_new_func(func): 134 | def call_func(): 135 | func() 136 | return call_func 137 | 138 | # Overwrite `my_special_function` with the new function 139 | my_special_function = get_new_func(my_special_function) 140 | 141 | my_special_function() 142 | ________________________________________________________________________ 143 | 7.Using decorator syntax 144 | # Decorate my_function() with the print_args() decorator 145 | @print_args 146 | def my_function(a, b, c): 147 | print(a + b + c) 148 | 149 | my_function(1, 2, 3) 150 | ________________________________________________________________________ 151 | 8.Defining a decorator 152 | def print_before_and_after(func): 153 | def wrapper(*args): 154 | print('Before {}'.format(func.__name__)) 155 | # Call the function being decorated with *args 156 | func(*args) 157 | print('After {}'.format(func.__name__)) 158 | # Return the nested function 159 | return wrapper 160 | 161 | @print_before_and_after 162 | def multiply(a, b): 163 | print(a * b) 164 | 165 | multiply(5, 10) 166 | ________________________________________________________________________ -------------------------------------------------------------------------------- /12 - Writing functions in Python/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /13 - Exploratory Data Analysis in Python/Chapter 1 - Read clean and validate.txt: -------------------------------------------------------------------------------- 1 | 1.Exploring the NSFG data 2 | # Display the number of rows and columns 3 | nsfg.shape 4 | 5 | # Display the names of the columns 6 | nsfg.columns 7 | 8 | # Select column birthwgt_oz1: ounces 9 | ounces = nsfg['birthwgt_oz1'] 10 | 11 | # Print the first 5 elements of ounces 12 | print(ounces.head()) 13 | ___________________________________________________________________ 14 | 2.Clean a variable 15 | # Replace the value 8 with NaN 16 | nsfg['nbrnaliv'].replace([8], np.nan, inplace=True) 17 | 18 | # Print the values and their frequencies 19 | print(nsfg['nbrnaliv'].value_counts()) 20 | ___________________________________________________________________ 21 | 3.Compute a variable 22 | # Select the columns and divide by 100 23 | agecon = nsfg['agecon'] / 100 24 | agepreg = nsfg['agepreg'] / 100 25 | 26 | # Compute the difference 27 | preg_length = agepreg - agecon 28 | 29 | # Compute summary statistics 30 | print(preg_length.describe()) 31 | ___________________________________________________________________ 32 | 4.Make a histogram 33 | # Plot the histogram 34 | plt.hist(agecon, bins=20, histtype='step') 35 | 36 | # Label the axes 37 | plt.xlabel('Age at conception') 38 | plt.ylabel('Number of pregnancies') 39 | 40 | # Show the figure 41 | plt.show() 42 | ___________________________________________________________________ 43 | 5.Compute birth weight 44 | # Create a Boolean Series for full-term babies 45 | full_term = nsfg['prglngth'] >= 37 46 | 47 | # Select the weights of full-term babies 48 | full_term_weight = birth_weight[full_term] 49 | 50 | # Compute the mean weight of full-term babies 51 | print(full_term_weight.mean()) 52 | ___________________________________________________________________ 53 | 6.Filter 54 | # Filter full-term babies 55 | full_term = nsfg['prglngth'] >= 37 56 | 57 | # Filter single births 58 | single = nsfg['nbrnaliv'] == 1 59 | 60 | # Compute birth weight for single full-term babies 61 | single_full_term_weight = birth_weight[single & full_term] 62 | print('Single full-term mean:', single_full_term_weight.mean()) 63 | 64 | # Compute birth weight for multiple full-term babies 65 | mult_full_term_weight = birth_weight[~single & full_term] 66 | print('Multiple full-term mean:', mult_full_term_weight.mean()) 67 | ___________________________________________________________________ -------------------------------------------------------------------------------- /13 - Exploratory Data Analysis in Python/Chapter 2 - Distributions.txt: -------------------------------------------------------------------------------- 1 | 1.Make a PMF 2 | # Compute the PMF for year 3 | pmf_year = Pmf(gss['year'], normalize=False) 4 | 5 | # Print the result 6 | print(pmf_year) 7 | ____________________________________________________ 8 | 2.Plot a PMF 9 | # Select the age column 10 | age = gss['age'] 11 | 12 | # Make a PMF of age 13 | pmf_age = Pmf(age) 14 | 15 | # Plot the PMF 16 | pmf_age.bar() 17 | 18 | # Label the axes 19 | plt.xlabel('Age') 20 | plt.ylabel('PMF') 21 | plt.show() 22 | 3.Make a CDF 23 | # Select the age column 24 | age = gss['age'] 25 | 26 | # Compute the CDF of age 27 | cdf_age = Cdf(age) 28 | 29 | # Calculate the CDF of 30 30 | print(cdf_age(30)) 31 | ____________________________________________________ 32 | 4.Compute IQR 33 | # Calculate the 75th percentile 34 | percentile_75th = cdf_income.inverse(0.75) 35 | 36 | # Calculate the 25th percentile 37 | percentile_25th = cdf_income.inverse(0.25) 38 | 39 | # Calculate the interquartile range 40 | iqr = percentile_75th - percentile_25th 41 | 42 | # Print the interquartile range 43 | print(iqr) 44 | 5.Plot a CDF 45 | # Select realinc 46 | income = gss['realinc'] 47 | 48 | # Make the CDF 49 | cdf_income = Cdf(income) 50 | 51 | # Plot it 52 | cdf_income.plot() 53 | 54 | # Label the axes 55 | plt.xlabel('Income (1986 USD)') 56 | plt.ylabel('CDF') 57 | plt.show() 58 | ____________________________________________________ 59 | 6.Extract education levels 60 | # Select educ 61 | educ = gss['educ'] 62 | 63 | # Bachelor's degree 64 | bach = (educ >= 16) 65 | 66 | # Associate degree 67 | assc = (educ >= 14) & (educ < 16) 68 | 69 | # High school 70 | high = (educ <= 12) 71 | print(high.mean()) 72 | ____________________________________________________ 73 | 7.Plot income CDFs 74 | income = gss['realinc'] 75 | 76 | # Plot the CDFs 77 | Cdf(income[high]).plot(label='High school') 78 | Cdf(income[assc]).plot(label='Associate') 79 | Cdf(income[bach]).plot(label='Bachelor') 80 | 81 | # Label the axes 82 | plt.xlabel('Income (1986 USD)') 83 | plt.ylabel('CDF') 84 | plt.legend() 85 | plt.show() 86 | ____________________________________________________ 87 | 8.Distribution of income 88 | # Extract realinc and compute its log 89 | income = gss['realinc'] 90 | log_income = np.log10(income) 91 | 92 | # Compute mean and standard deviation 93 | mean = log_income.mean() 94 | std = log_income.std() 95 | print(mean, std) 96 | 97 | # Make a norm object 98 | from scipy.stats import norm 99 | dist = norm(mean, std) 100 | _________________________________________________________________________ 101 | 9.Comparing CDFs 102 | # Evaluate the model CDF 103 | xs = np.linspace(2, 5.5) 104 | ys = dist.cdf(xs) 105 | 106 | # Plot the model CDF 107 | plt.clf() 108 | plt.plot(xs, ys, color='gray') 109 | 110 | # Create and plot the Cdf of log_income 111 | Cdf(log_income).plot() 112 | 113 | # Label the axes 114 | plt.xlabel('log10 of realinc') 115 | plt.ylabel('CDF') 116 | plt.show() 117 | ____________________________________________________ 118 | 10.Comparing PDFs 119 | # Evaluate the normal PDF 120 | xs = np.linspace(2, 5.5) 121 | ys = dist.pdf(xs) 122 | 123 | # Plot the model PDF 124 | plt.clf() 125 | plt.plot(xs, ys, color='gray') 126 | 127 | # Plot the data KDE 128 | sns.kdeplot(log_income) 129 | 130 | # Label the axes 131 | plt.xlabel('log10 of realinc') 132 | plt.ylabel('PDF') 133 | plt.show() 134 | ____________________________________________________ -------------------------------------------------------------------------------- /13 - Exploratory Data Analysis in Python/Chapter 3 - Relationships.txt: -------------------------------------------------------------------------------- 1 | 1.PMF of age 2 | # Extract AGE 3 | age = brfss['AGE'] 4 | 5 | # Plot the PMF 6 | Pmf(age).bar() 7 | 8 | # Label the axes 9 | plt.xlabel('Age in years') 10 | plt.ylabel('PMF') 11 | plt.show() 12 | ______________________________________________________ 13 | 2.Scatter plot 14 | # Select the first 1000 respondents 15 | brfss = brfss[:1000] 16 | 17 | # Extract age and weight 18 | age = brfss['AGE'] 19 | weight = brfss['WTKG3'] 20 | 21 | # Make a scatter plot 22 | plt.plot(age, weight, 'o', alpha=0.1) 23 | 24 | plt.xlabel('Age in years') 25 | plt.ylabel('Weight in kg') 26 | 27 | plt.show() 28 | ______________________________________________________ 29 | 3.Jittering 30 | # Select the first 1000 respondents 31 | brfss = brfss[:1000] 32 | 33 | # Add jittering to age 34 | age = brfss['AGE'] + np.random.normal(0, 2.5, size=len(brfss)) 35 | # Extract weight 36 | weight = brfss['WTKG3'] 37 | 38 | # Make a scatter plot 39 | plt.plot(age, weight, 'o', markersize=5, alpha=0.2) 40 | 41 | plt.xlabel('Age in years') 42 | plt.ylabel('Weight in kg') 43 | plt.show() 44 | ______________________________________________________ 45 | 4.Height and weight 46 | # Drop rows with missing data 47 | data = brfss.dropna(subset=['_HTMG10', 'WTKG3']) 48 | 49 | # Make a box plot 50 | sns.boxplot(x='_HTMG10', y='WTKG3', data=data, whis=10) 51 | 52 | # Plot the y-axis on a log scale 53 | plt.yscale('log') 54 | 55 | # Remove unneeded lines and label axes 56 | sns.despine(left=True, bottom=True) 57 | plt.xlabel('Height in cm') 58 | plt.ylabel('Weight in kg') 59 | plt.show() 60 | ______________________________________________________ 61 | 5.Distribution of income 62 | # Extract income 63 | income = brfss['INCOME2'] 64 | 65 | # Plot the PMF 66 | Pmf(income).bar() 67 | 68 | # Label the axes 69 | plt.xlabel('Income level') 70 | plt.ylabel('PMF') 71 | plt.show() 72 | ______________________________________________________ 73 | 6.Income and height 74 | # Drop rows with missing data 75 | data = brfss.dropna(subset=['INCOME2', 'HTM4']) 76 | 77 | # Make a violin plot 78 | sns.violinplot(x='INCOME2', y='HTM4', data=data, inner=None) 79 | 80 | # Remove unneeded lines and label axes 81 | sns.despine(left=True, bottom=True) 82 | plt.xlabel('Income level') 83 | plt.ylabel('Height in cm') 84 | plt.show() 85 | ______________________________________________________ 86 | 7.Computing correlations 87 | # Select columns 88 | columns = ['AGE', 'INCOME2', '_VEGESU1'] 89 | subset = brfss[columns] 90 | 91 | # Compute the correlation matrix 92 | print(subset.corr()) 93 | ______________________________________________________ 94 | 8.Income and vegetables 95 | from scipy.stats import linregress 96 | 97 | # Extract the variables 98 | subset = brfss.dropna(subset=['INCOME2', '_VEGESU1']) 99 | xs = subset['INCOME2'] 100 | ys = subset['_VEGESU1'] 101 | 102 | # Compute the linear regression 103 | res = linregress(xs, ys) 104 | print(res) 105 | ______________________________________________________ 106 | 9.Fit a line 107 | # Plot the scatter plot 108 | plt.clf() 109 | x_jitter = xs + np.random.normal(0, 0.15, len(xs)) 110 | plt.plot(x_jitter, ys, 'o', alpha=0.2) 111 | 112 | # Plot the line of best fit 113 | fx = np.array([xs.min(), xs.max()]) 114 | fy = res.intercept + res.slope * fx 115 | plt.plot(fx, fy, '-', alpha=0.7) 116 | 117 | plt.xlabel('Income code') 118 | plt.ylabel('Vegetable servings per day') 119 | plt.ylim([0, 6]) 120 | plt.show() 121 | ______________________________________________________ -------------------------------------------------------------------------------- /13 - Exploratory Data Analysis in Python/Chapter 4 - Multivariate Thinking.txt: -------------------------------------------------------------------------------- 1 | 1.Using StatsModels 2 | from scipy.stats import linregress 3 | import statsmodels.formula.api as smf 4 | 5 | # Run regression with linregress 6 | subset = brfss.dropna(subset=['INCOME2', '_VEGESU1']) 7 | xs = subset['INCOME2'] 8 | ys = subset['_VEGESU1'] 9 | res = linregress(xs, ys) 10 | print(res) 11 | 12 | # Run regression with StatsModels 13 | results = smf.ols('_VEGESU1 ~ INCOME2', data=brfss).fit() 14 | print(results.params) 15 | ________________________________________________________________ 16 | 2.Plot income and education 17 | # Group by educ 18 | grouped = gss.groupby('educ') 19 | 20 | # Compute mean income in each group 21 | mean_income_by_educ = grouped['realinc'].mean() 22 | 23 | # Plot mean income as a scatter plot 24 | plt.clf() 25 | plt.plot(mean_income_by_educ, 'o', alpha=0.5) 26 | 27 | # Label the axes 28 | plt.xlabel('Education (years)') 29 | plt.ylabel('Income (1986 $)') 30 | plt.show() 31 | ________________________________________________________________ 32 | 3.Non-linear model of education 33 | import statsmodels.formula.api as smf 34 | 35 | # Add a new column with educ squared 36 | gss['educ2'] = gss['educ']**2 37 | 38 | # Run a regression model with educ, educ2, age, and age2 39 | results = smf.ols('realinc ~ educ + educ2 + age + age2', data=gss).fit() 40 | 41 | # Print the estimated parameters 42 | print(results.params) 43 | ________________________________________________________________ 44 | 4.Making predictions 45 | # Run a regression model with educ, educ2, age, and age2 46 | results = smf.ols('realinc ~ educ + educ2 + age + age2', data=gss).fit() 47 | 48 | # Make the DataFrame 49 | df = pd.DataFrame() 50 | df['educ'] = np.linspace(0, 20) 51 | df['age'] = 30 52 | df['educ2'] = df['educ']**2 53 | df['age2'] = df['age']**2 54 | 55 | # Generate and plot the predictions 56 | pred = results.predict(df) 57 | print(pred.head()) 58 | ________________________________________________________________ 59 | 5.Visualizing predictions 60 | # Plot mean income in each age group 61 | plt.clf() 62 | grouped = gss.groupby('educ') 63 | mean_income_by_educ = grouped['realinc'].mean() 64 | plt.plot(mean_income_by_educ, 'o', alpha=0.5) 65 | 66 | # Plot the predictions 67 | pred = results.predict(df) 68 | plt.plot(df['educ'], pred, label='Age 30') 69 | 70 | # Label axes 71 | plt.xlabel('Education (years)') 72 | plt.ylabel('Income (1986 $)') 73 | plt.legend() 74 | plt.show() 75 | ________________________________________________________________ 76 | 6.Predicting a binary variable 77 | # Recode grass 78 | gss['grass'].replace(2, 0, inplace=True) 79 | 80 | # Run logistic regression 81 | results = smf.logit('grass ~ age + age2 + educ + educ2 + C(sex)', data=gss).fit() 82 | results.params 83 | 84 | # Make a DataFrame with a range of ages 85 | df = pd.DataFrame() 86 | df['age'] = np.linspace(18, 89) 87 | df['age2'] = df['age']**2 88 | 89 | # Set the education level to 12 90 | df['educ'] = 12 91 | df['educ2'] = df['educ']**2 92 | 93 | # Generate predictions for men and women 94 | df['sex'] = 1 95 | pred1 = results.predict(df) 96 | 97 | df['sex'] = 2 98 | pred2 = results.predict(df) 99 | 100 | plt.clf() 101 | grouped = gss.groupby('age') 102 | favor_by_age = grouped['grass'].mean() 103 | plt.plot(favor_by_age, 'o', alpha=0.5) 104 | 105 | plt.plot(df['age'], pred1, label='Male') 106 | plt.plot(df['age'], pred2, label='Female') 107 | 108 | plt.xlabel('Age') 109 | plt.ylabel('Probability of favoring legalization') 110 | plt.legend() 111 | plt.show() 112 | ________________________________________________________________ -------------------------------------------------------------------------------- /13 - Exploratory Data Analysis in Python/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /14 - Analyzing Police Activity with pandas/Chapter 1 - preparing data for analysis.txt: -------------------------------------------------------------------------------- 1 | 1.Examining the dataset 2 | # Import the pandas library as pd 3 | import pandas as pd 4 | 5 | # Read 'police.csv' into a DataFrame named ri 6 | ri = pd.read_csv('police.csv') 7 | 8 | # Examine the head of the DataFrame 9 | print(ri.head()) 10 | 11 | # Count the number of missing values in each column 12 | print(ri.isnull().sum()) 13 | ______________________________________________________________________ 14 | 2.Dropping columns 15 | # Examine the shape of the DataFrame 16 | print(ri.shape) 17 | 18 | # Drop the 'county_name' and 'state' columns 19 | ri.drop(['county_name', 'state'], axis='columns', inplace=True) 20 | 21 | # Examine the shape of the DataFrame (again) 22 | print(ri.shape) 23 | ______________________________________________________________________ 24 | 3.Dropping rows 25 | # Count the number of missing values in each column 26 | print(ri.isnull().sum()) 27 | 28 | # Drop all rows that are missing 'driver_gender' 29 | ri.dropna(subset=['driver_gender'], inplace=True) 30 | 31 | # Count the number of missing values in each column (again) 32 | print(ri.isnull().sum()) 33 | 34 | # Examine the shape of the DataFrame 35 | print(ri.shape) 36 | ______________________________________________________________________ 37 | 4.Fixing a data type 38 | # Examine the head of the 'is_arrested' column 39 | print(ri.is_arrested.head()) 40 | 41 | # Change the data type of 'is_arrested' to 'bool' 42 | ri['is_arrested'] = ri.is_arrested.astype('bool') 43 | 44 | # Check the data type of 'is_arrested' 45 | print(ri.is_arrested.dtype) 46 | ______________________________________________________________________ 47 | 5.Combining object columns 48 | # Concatenate 'stop_date' and 'stop_time' (separated by a space) 49 | combined = ri.stop_date.str.cat(ri.stop_time, sep=' ') 50 | 51 | # Convert 'combined' to datetime format 52 | ri['stop_datetime'] = pd.to_datetime(combined) 53 | 54 | # Examine the data types of the DataFrame 55 | print(ri.dtypes) 56 | ______________________________________________________________________ 57 | 6.Setting the index 58 | # Set 'stop_datetime' as the index 59 | ri.set_index('stop_datetime', inplace=True) 60 | 61 | # Examine the index 62 | print(ri.index) 63 | 64 | # Examine the columns 65 | print(ri.columns) 66 | ______________________________________________________________________ -------------------------------------------------------------------------------- /14 - Analyzing Police Activity with pandas/Chapter 2 - Exploring the Relationship between gender and policing.txt: -------------------------------------------------------------------------------- 1 | 1.Examining traffic violations 2 | # Count the unique values in 'violation' 3 | print(ri.violation.value_counts()) 4 | 5 | # Express the counts as proportions 6 | print(ri.violation.value_counts(normalize=True)) 7 | _______________________________________________________________________ 8 | 2.Comparing violations by gender 9 | # Create a DataFrame of female drivers 10 | female = ri[ri.driver_gender == 'F'] 11 | 12 | # Create a DataFrame of male drivers 13 | male = ri[ri.driver_gender == 'M'] 14 | 15 | # Compute the violations by female drivers (as proportions) 16 | print(female.violation.value_counts(normalize=True)) 17 | 18 | # Compute the violations by male drivers (as proportions) 19 | print(male.violation.value_counts(normalize=True)) 20 | _______________________________________________________________________ 21 | 3.Comparing speeding outcomes by gender 22 | # Create a DataFrame of female drivers stopped for speeding 23 | female_and_speeding = ri[(ri.driver_gender == 'F') & (ri.violation == 'Speeding')] 24 | 25 | # Create a DataFrame of male drivers stopped for speeding 26 | male_and_speeding = ri[(ri.driver_gender == 'M') & (ri.violation == 'Speeding')] 27 | 28 | # Compute the stop outcomes for female drivers (as proportions) 29 | print(female_and_speeding.stop_outcome.value_counts(normalize=True)) 30 | 31 | # Compute the stop outcomes for male drivers (as proportions) 32 | print(male_and_speeding.stop_outcome.value_counts(normalize=True)) 33 | _______________________________________________________________________ 34 | 4.Calculating the search rate 35 | # Check the data type of 'search_conducted' 36 | print(ri.search_conducted.dtype) 37 | 38 | # Calculate the search rate by counting the values 39 | print(ri.search_conducted.value_counts(normalize=True)) 40 | 41 | # Calculate the search rate by taking the mean 42 | print(ri.search_conducted.mean()) 43 | _______________________________________________________________________ 44 | 5.Comparing search rates by gender 45 | # Calculate the search rate for both groups simultaneously 46 | print(ri.groupby('driver_gender').search_conducted.mean()) 47 | _______________________________________________________________________ 48 | 6.Adding a second factor to the analysis 49 | # Reverse the ordering to group by violation before gender 50 | print(ri.groupby(['violation', 'driver_gender']).search_conducted.mean()) 51 | _______________________________________________________________________ 52 | 7.Counting protective frisks 53 | # Count the 'search_type' values 54 | print(ri.search_type.value_counts()) 55 | 56 | # Check if 'search_type' contains the string 'Protective Frisk' 57 | ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False) 58 | 59 | # Check the data type of 'frisk' 60 | print(ri.frisk.dtype) 61 | 62 | # Take the sum of 'frisk' 63 | print(ri.frisk.sum()) 64 | _______________________________________________________________________ 65 | 8.Comparing frisk rates by gender 66 | # Create a DataFrame of stops in which a search was conducted 67 | searched = ri[ri.search_conducted == True] 68 | 69 | # Calculate the overall frisk rate by taking the mean of 'frisk' 70 | print(searched.frisk.mean()) 71 | 72 | # Calculate the frisk rate for each gender 73 | print(searched.groupby('driver_gender').frisk.mean()) 74 | _______________________________________________________________________ -------------------------------------------------------------------------------- /14 - Analyzing Police Activity with pandas/Chapter 3 - Visual Exploratory data analysis.txt: -------------------------------------------------------------------------------- 1 | 1.Calculating the hourly arrest rate 2 | # Calculate the overall arrest rate 3 | print(ri.is_arrested.mean()) 4 | 5 | # Calculate the hourly arrest rate 6 | print(ri.groupby(ri.index.hour).is_arrested.mean()) 7 | 8 | # Save the hourly arrest rate 9 | hourly_arrest_rate = ri.groupby(ri.index.hour).is_arrested.mean() 10 | __________________________________________________________________________ 11 | 2.Plotting the hourly arrest rate 12 | # Import matplotlib.pyplot as plt 13 | import matplotlib.pyplot as plt 14 | 15 | # Create a line plot of 'hourly_arrest_rate' 16 | hourly_arrest_rate.plot() 17 | 18 | # Add the xlabel, ylabel, and title 19 | plt.xlabel('Hour') 20 | plt.ylabel('Arrest Rate') 21 | plt.title('Arrest Rate by Time of Day') 22 | 23 | # Display the plot 24 | plt.show() 25 | __________________________________________________________________________ 26 | 3.Plotting drug-related stops 27 | # Calculate the annual rate of drug-related stops 28 | print(ri.drugs_related_stop.resample('A').mean()) 29 | 30 | # Save the annual rate of drug-related stops 31 | annual_drug_rate = ri.drugs_related_stop.resample('A').mean() 32 | 33 | # Create a line plot of 'annual_drug_rate' 34 | annual_drug_rate.plot() 35 | 36 | # Display the plot 37 | plt.show() 38 | __________________________________________________________________________ 39 | 4.Comparing drug and search rates 40 | # Calculate and save the annual search rate 41 | annual_search_rate = ri.search_conducted.resample('A').mean() 42 | 43 | # Concatenate 'annual_drug_rate' and 'annual_search_rate' 44 | annual = pd.concat([annual_drug_rate, annual_search_rate], axis='columns') 45 | 46 | # Create subplots from 'annual' 47 | annual.plot(subplots=True) 48 | 49 | # Display the subplots 50 | plt.show() 51 | __________________________________________________________________________ 52 | 5.Tallying violations by district 53 | # Create a frequency table of districts and violations 54 | print(pd.crosstab(ri.district, ri.violation)) 55 | 56 | # Save the frequency table as 'all_zones' 57 | all_zones = pd.crosstab(ri.district, ri.violation) 58 | 59 | # Select rows 'Zone K1' through 'Zone K3' 60 | print(all_zones.loc['Zone K1':'Zone K3']) 61 | 62 | # Save the smaller table as 'k_zones' 63 | k_zones = all_zones.loc['Zone K1':'Zone K3'] 64 | __________________________________________________________________________ 65 | 6.Plotting violations by district 66 | # Create a stacked bar plot of 'k_zones' 67 | k_zones.plot(kind='bar', stacked=True) 68 | 69 | # Display the plot 70 | plt.show() 71 | __________________________________________________________________________ 72 | 7.Converting stop durations to numbers 73 | # Print the unique values in 'stop_duration' 74 | print(ri.stop_duration.unique()) 75 | 76 | # Create a dictionary that maps strings to integers 77 | mapping = {'0-15 Min':8, '16-30 Min':23, '30+ Min':45} 78 | 79 | # Convert the 'stop_duration' strings to integers using the 'mapping' 80 | ri['stop_minutes'] = ri.stop_duration.map(mapping) 81 | 82 | # Print the unique values in 'stop_minutes' 83 | print(ri.stop_minutes.unique()) 84 | __________________________________________________________________________ 85 | 8.Plotting stop length 86 | # Calculate the mean 'stop_minutes' for each value in 'violation_raw' 87 | print(ri.groupby('violation_raw').stop_minutes.mean()) 88 | 89 | # Save the resulting Series as 'stop_length' 90 | stop_length = ri.groupby('violation_raw').stop_minutes.mean() 91 | 92 | # Sort 'stop_length' by its values and create a horizontal bar plot 93 | stop_length.sort_values().plot(kind='barh') 94 | 95 | # Display the plot 96 | plt.show() 97 | __________________________________________________________________________ -------------------------------------------------------------------------------- /14 - Analyzing Police Activity with pandas/Chapter 4 - Analyzing the effect of weather on policing.txt: -------------------------------------------------------------------------------- 1 | 1.Plotting the temperature 2 | # Read 'weather.csv' into a DataFrame named 'weather' 3 | weather = pd.read_csv('weather.csv') 4 | 5 | # Describe the temperature columns 6 | print(weather[['TMIN', 'TAVG', 'TMAX']].describe()) 7 | 8 | # Create a box plot of the temperature columns 9 | weather[['TMIN', 'TAVG', 'TMAX']].plot(kind='box') 10 | 11 | # Display the plot 12 | plt.show() 13 | ___________________________________________________________________________ 14 | 2.Plotting the temperature difference 15 | # Create a 'TDIFF' column that represents temperature difference 16 | weather['TDIFF'] = weather.TMAX - weather.TMIN 17 | 18 | # Describe the 'TDIFF' column 19 | print(weather.TDIFF.describe()) 20 | 21 | # Create a histogram with 20 bins to visualize 'TDIFF' 22 | weather.TDIFF.plot(kind='hist', bins=20) 23 | 24 | # Display the plot 25 | plt.show() 26 | ___________________________________________________________________________ 27 | 3.Counting bad weather conditions 28 | # Copy 'WT01' through 'WT22' to a new DataFrame 29 | WT = weather.loc[:, 'WT01':'WT22'] 30 | 31 | # Calculate the sum of each row in 'WT' 32 | weather['bad_conditions'] = WT.sum(axis='columns') 33 | 34 | # Replace missing values in 'bad_conditions' with '0' 35 | weather['bad_conditions'] = weather.bad_conditions.fillna(0).astype('int') 36 | 37 | # Create a histogram to visualize 'bad_conditions' 38 | weather.bad_conditions.plot(kind='hist') 39 | 40 | # Display the plot 41 | plt.show() 42 | ___________________________________________________________________________ 43 | 4.Rating the weather conditions 44 | # Count the unique values in 'bad_conditions' and sort the index 45 | print(weather.bad_conditions.value_counts().sort_index()) 46 | 47 | # Create a dictionary that maps integers to strings 48 | mapping = {0:'good', 1:'bad', 2:'bad', 3:'bad', 4:'bad', 5:'worse', 6:'worse', 7:'worse', 8:'worse', 9:'worse'} 49 | 50 | # Convert the 'bad_conditions' integers to strings using the 'mapping' 51 | weather['rating'] = weather.bad_conditions.map(mapping) 52 | 53 | # Count the unique values in 'rating' 54 | print(weather.rating.value_counts()) 55 | ___________________________________________________________________________ 56 | 5.Changing the data type to category 57 | # Create a list of weather ratings in logical order 58 | cats = ['good', 'bad', 'worse'] 59 | 60 | # Change the data type of 'rating' to category 61 | weather['rating'] = weather.rating.astype('category', ordered=True, categories=cats) 62 | 63 | # Examine the head of 'rating' 64 | print(weather.rating.head()) 65 | ___________________________________________________________________________ 66 | 6.Preparing the DataFrames 67 | # Reset the index of 'ri' 68 | ri.reset_index(inplace=True) 69 | 70 | # Examine the head of 'ri' 71 | print(ri.head()) 72 | 73 | # Create a DataFrame from the 'DATE' and 'rating' columns 74 | weather_rating = weather[['DATE', 'rating']] 75 | 76 | # Examine the head of 'weather_rating' 77 | print(weather_rating.head()) 78 | ___________________________________________________________________________ 79 | 7.Merging the DataFrames 80 | # Examine the shape of 'ri' 81 | print(ri.shape) 82 | 83 | # Merge 'ri' and 'weather_rating' using a left join 84 | ri_weather = pd.merge(left=ri, right=weather_rating, left_on='stop_date', right_on='DATE', how='left') 85 | 86 | # Examine the shape of 'ri_weather' 87 | print(ri_weather.shape) 88 | 89 | # Set 'stop_datetime' as the index of 'ri_weather' 90 | ri_weather.set_index('stop_datetime', inplace=True) 91 | ___________________________________________________________________________ 92 | 8.Comparing arrest rates by weather rating 93 | # Calculate the arrest rate for each 'violation' and 'rating' 94 | print(ri_weather.groupby(['violation', 'rating']).is_arrested.mean()) 95 | ___________________________________________________________________________ 96 | 9.Selecting from a multi-indexed Series 97 | # Save the output of the groupby operation from the last exercise 98 | arrest_rate = ri_weather.groupby(['violation', 'rating']).is_arrested.mean() 99 | 100 | # Print the 'arrest_rate' Series 101 | print(arrest_rate) 102 | 103 | # Print the arrest rate for moving violations in bad weather 104 | print(arrest_rate.loc['Moving violation', 'bad']) 105 | 106 | # Print the arrest rates for speeding violations in all three weather conditions 107 | print(arrest_rate.loc['Speeding']) 108 | ___________________________________________________________________________ 109 | 10.Reshaping the arrest rate data 110 | # Unstack the 'arrest_rate' Series into a DataFrame 111 | print(arrest_rate.unstack()) 112 | 113 | # Create the same DataFrame using a pivot table 114 | print(ri_weather.pivot_table(index='violation', columns='rating', values='is_arrested')) 115 | ___________________________________________________________________________ -------------------------------------------------------------------------------- /14 - Analyzing Police Activity with pandas/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /15 - Statistical Thinking in Python (Part 1)/Chapter 1 - Graphical Exploratory Data Analysis .txt: -------------------------------------------------------------------------------- 1 | 1.Plotting a histogram of iris data 2 | # Import plotting modules 3 | import matplotlib.pyplot as plt 4 | import seaborn as sns 5 | 6 | # Set default Seaborn style 7 | sns.set() 8 | 9 | # Plot histogram of versicolor petal lengths 10 | _ = plt.hist(versicolor_petal_length) 11 | 12 | # Show histogram 13 | plt.show() 14 | __________________________________________________________________________ 15 | 2.Axis labels! 16 | # Plot histogram of versicolor petal lengths 17 | _ = plt.hist(versicolor_petal_length) 18 | 19 | # Label axes 20 | _ = plt.xlabel('petal length (cm)') 21 | _ = plt.ylabel('count') 22 | 23 | # Show histogram 24 | plt.show() 25 | __________________________________________________________________________ 26 | 3.Adjusting the number of bins in a histogram 27 | # Import numpy 28 | import numpy as np 29 | 30 | # Compute number of data points: n_data 31 | n_data = len(versicolor_petal_length) 32 | 33 | # Number of bins is the square root of number of data points: n_bins 34 | n_bins = np.sqrt(n_data) 35 | 36 | # Convert number of bins to integer: n_bins 37 | n_bins = int(n_bins) 38 | 39 | # Plot the histogram 40 | _ = plt.hist(versicolor_petal_length, bins=n_bins) 41 | 42 | # Label axes 43 | _ = plt.xlabel('petal length (cm)') 44 | _ = plt.ylabel('count') 45 | 46 | # Show histogram 47 | plt.show() 48 | __________________________________________________________________________ 49 | 4.Bee swarm plot 50 | # Create bee swarm plot with Seaborn's default settings 51 | _ = sns.swarmplot(x='species', y='petal length (cm)', data=df) 52 | 53 | # Label the axes 54 | _ = plt.xlabel('species') 55 | _ = plt.ylabel('petal length (cm)') 56 | 57 | # Show the plot 58 | plt.show() 59 | __________________________________________________________________________ 60 | 5.Computing the ECDF 61 | def ecdf(data): 62 | """Compute ECDF for a one-dimensional array of measurements.""" 63 | # Number of data points: n 64 | n = len(data) 65 | 66 | # x-data for the ECDF: x 67 | x = np.sort(data) 68 | 69 | # y-data for the ECDF: y 70 | y = np.arange(1, n+1) / n 71 | 72 | return x, y 73 | __________________________________________________________________________ 74 | 6.Plotting the ECDF 75 | # Compute ECDF for versicolor data: x_vers, y_vers 76 | x_vers, y_vers = ecdf(versicolor_petal_length) 77 | 78 | # Generate plot 79 | _ = plt.plot(x_vers, y_vers, marker='.', linestyle='none') 80 | 81 | # Label the axes 82 | _ = plt.xlabel('petal length (cm)') 83 | _ = plt.ylabel('ECDF') 84 | 85 | # Display the plot 86 | plt.show() 87 | __________________________________________________________________________ 88 | 7.Comparison of ECDFs 89 | # Compute ECDFs 90 | x_set, y_set = ecdf(setosa_petal_length) 91 | x_vers, y_vers = ecdf(versicolor_petal_length) 92 | x_virg, y_virg = ecdf(virginica_petal_length) 93 | 94 | # Plot all ECDFs on the same plot 95 | _ = plt.plot(x_set, y_set, marker='.', linestyle='none') 96 | _ = plt.plot(x_vers, y_vers, marker='.', linestyle='none') 97 | _ = plt.plot(x_virg, y_virg, marker='.', linestyle='none') 98 | 99 | # Annotate the plot 100 | _ = plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right') 101 | _ = plt.xlabel('petal length (cm)') 102 | _ = plt.ylabel('ECDF') 103 | 104 | # Display the plot 105 | plt.show() 106 | __________________________________________________________________________ 107 | -------------------------------------------------------------------------------- /15 - Statistical Thinking in Python (Part 1)/Chapter 2 - Quantitative Exploratory Data Analysis.txt: -------------------------------------------------------------------------------- 1 | 1.Computing means 2 | # Compute the mean 3 | mean_length_vers = np.mean(versicolor_petal_length) 4 | 5 | # Print the results with some nice formatting 6 | print('I. versicolor:', mean_length_vers, 'cm') 7 | ___________________________________________________________________________ 8 | 2.Computing percentiles 9 | # Specify array of percentiles: percentiles 10 | percentiles = np.array([2.5, 25, 50, 75, 97.5]) 11 | 12 | # Compute percentiles: ptiles_vers 13 | ptiles_vers = np.percentile(versicolor_petal_length, percentiles) 14 | 15 | # Print the result 16 | print(ptiles_vers) 17 | ___________________________________________________________________________ 18 | 3.Comparing percentiles to ECDF 19 | # Plot the ECDF 20 | _ = plt.plot(x_vers, y_vers, '.') 21 | _ = plt.xlabel('petal length (cm)') 22 | _ = plt.ylabel('ECDF') 23 | 24 | # Overlay percentiles as red x's 25 | _ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red', 26 | linestyle='none') 27 | 28 | # Show the plot 29 | plt.show() 30 | ___________________________________________________________________________ 31 | 4.Box-and-whisker plot 32 | # Create box plot with Seaborn's default settings 33 | _ = sns.boxplot(x='species', y='petal length (cm)', data=df) 34 | 35 | # Label the axes 36 | _ = plt.xlabel('species') 37 | _ = plt.ylabel('petal length (cm)') 38 | 39 | # Show the plot 40 | plt.show() 41 | ___________________________________________________________________________ 42 | 5.Computing the variance 43 | # Array of differences to mean: differences 44 | differences = versicolor_petal_length - np.mean(versicolor_petal_length) 45 | 46 | # Square the differences: diff_sq 47 | diff_sq = differences**2 48 | 49 | # Compute the mean square difference: variance_explicit 50 | variance_explicit = np.mean(diff_sq) 51 | 52 | # Compute the variance using NumPy: variance_np 53 | variance_np = np.var(versicolor_petal_length) 54 | 55 | # Print the results 56 | print(variance_explicit, variance_np) 57 | ___________________________________________________________________________ 58 | 6.The standard deviation and the variance 59 | # Compute the variance: variance 60 | variance = np.var(versicolor_petal_length) 61 | 62 | # Print the square root of the variance 63 | print(np.sqrt(variance)) 64 | 65 | # Print the standard deviation 66 | print(np.std(versicolor_petal_length)) 67 | ___________________________________________________________________________ 68 | 7.Scatter plots 69 | # Make a scatter plot 70 | _ = plt.plot(versicolor_petal_length, versicolor_petal_width, 71 | marker='.', linestyle='none') 72 | 73 | # Label the axes 74 | _ = plt.xlabel('petal length (cm)') 75 | _ = plt.ylabel('petal width (cm)') 76 | 77 | # Show the result 78 | plt.show() 79 | ___________________________________________________________________________ 80 | 8.Computing the covariance 81 | # Compute the covariance matrix: covariance_matrix 82 | covariance_matrix = np.cov(versicolor_petal_length, versicolor_petal_width) 83 | 84 | # Print covariance matrix 85 | print(covariance_matrix) 86 | 87 | # Extract covariance of length and width of petals: petal_cov 88 | petal_cov = covariance_matrix[0,1] 89 | 90 | # Print the length/width covariance 91 | print(petal_cov) 92 | ___________________________________________________________________________ 93 | 9.Computing the Pearson correlation coefficient 94 | def pearson_r(x, y): 95 | """Compute Pearson correlation coefficient between two arrays.""" 96 | # Compute correlation matrix: corr_mat 97 | corr_mat = np.corrcoef(x, y) 98 | 99 | # Return entry [0,1] 100 | return corr_mat[0,1] 101 | 102 | # Compute Pearson correlation coefficient for I. versicolor 103 | r = pearson_r(versicolor_petal_width, versicolor_petal_length) 104 | 105 | # Print the result 106 | print(r) 107 | ___________________________________________________________________________ -------------------------------------------------------------------------------- /15 - Statistical Thinking in Python (Part 1)/Chapter 3 - Thinking probabilistically discrete variables.txt: -------------------------------------------------------------------------------- 1 | 1.Generating random numbers using the np.random module 2 | # Seed the random number generator 3 | np.random.seed(42) 4 | 5 | # Initialize random numbers: random_numbers 6 | random_numbers = np.empty(100000) 7 | 8 | # Generate random numbers by looping over range(100000) 9 | for i in range(100000): 10 | random_numbers[i] = np.random.random() 11 | 12 | # Plot a histogram 13 | _ = plt.hist(random_numbers) 14 | 15 | # Show the plot 16 | plt.show() 17 | ______________________________________________________________________ 18 | 2.The np.random module and Bernoulli trials 19 | def perform_bernoulli_trials(n, p): 20 | """Perform n Bernoulli trials with success probability p 21 | and return number of successes.""" 22 | # Initialize number of successes: n_success 23 | n_success = 0 24 | 25 | # Perform trials 26 | for i in range(n): 27 | # Choose random number between zero and one: random_number 28 | random_number = np.random.random() 29 | 30 | # If less than p, it's a success so add one to n_success 31 | if random_number < p: 32 | n_success += 1 33 | 34 | return n_success 35 | ______________________________________________________________________ 36 | 3.How many defaults might we expect? 37 | # Seed random number generator 38 | np.random.seed(42) 39 | 40 | # Initialize the number of defaults: n_defaults 41 | n_defaults = np.empty(1000) 42 | 43 | # Compute the number of defaults 44 | for i in range(1000): 45 | n_defaults[i] = perform_bernoulli_trials(100, 0.05) 46 | 47 | # Plot the histogram with default number of bins; label your axes 48 | _ = plt.hist(n_defaults, normed=True) 49 | _ = plt.xlabel('number of defaults out of 100 loans') 50 | _ = plt.ylabel('probability') 51 | 52 | # Show the plot 53 | plt.show() 54 | ______________________________________________________________________ 55 | 4.Will the bank fail? 56 | # Compute ECDF: x, y 57 | x, y = ecdf(n_defaults) 58 | 59 | # Plot the CDF with labeled axes 60 | _ = plt.plot(x, y, marker='.', linestyle='none') 61 | _ = plt.xlabel('number of defaults out of 100') 62 | _ = plt.ylabel('CDF') 63 | 64 | # Show the plot 65 | plt.show() 66 | 67 | # Compute the number of 100-loan simulations with 10 or more defaults: n_lose_money 68 | n_lose_money = np.sum(n_defaults >= 10) 69 | 70 | # Compute and print probability of losing money 71 | print('Probability of losing money =', n_lose_money / len(n_defaults)) 72 | ______________________________________________________________________ 73 | 5.Sampling out of the Binomial distribution 74 | # Take 10,000 samples out of the binomial distribution: n_defaults 75 | n_defaults = np.random.binomial(n=100, p=0.05, size=10000) 76 | 77 | # Compute CDF: x, y 78 | x, y = ecdf(n_defaults) 79 | 80 | # Plot the CDF with axis labels 81 | _ = plt.plot(x, y, marker='.', linestyle='none') 82 | _ = plt.xlabel('number of defaults out of 100 loans') 83 | _ = plt.ylabel('CDF') 84 | 85 | # Show the plot 86 | plt.show() 87 | ______________________________________________________________________ 88 | 6.Plotting the Binomial PMF 89 | # Compute bin edges: bins 90 | bins = np.arange(0, max(n_defaults) + 1.5) - 0.5 91 | 92 | # Generate histogram 93 | _ = plt.hist(n_defaults, normed=True, bins=bins) 94 | 95 | # Label axes 96 | _ = plt.xlabel('number of defaults out of 100 loans') 97 | _ = plt.ylabel('PMF') 98 | 99 | # Show the plot 100 | plt.show() 101 | ______________________________________________________________________ 102 | 7.Relationship between Binomial and Poisson distributions 103 | # Draw 10,000 samples out of Poisson distribution: samples_poisson 104 | samples_poisson = np.random.poisson(10, size=10000) 105 | 106 | # Print the mean and standard deviation 107 | print('Poisson: ', np.mean(samples_poisson), 108 | np.std(samples_poisson)) 109 | 110 | # Specify values of n and p to consider for Binomial: n, p 111 | n = [20, 100, 1000] 112 | p = [0.5, 0.1, 0.01] 113 | 114 | # Draw 10,000 samples for each n,p pair: samples_binomial 115 | for i in range(3): 116 | samples_binomial = np.random.binomial(n[i], p[i], size=10000) 117 | 118 | # Print results 119 | print('n =', n[i], 'Binom:', np.mean(samples_binomial), 120 | np.std(samples_binomial)) 121 | ______________________________________________________________________ 122 | 8.Was 2015 anomalous? 123 | # Draw 10,000 samples out of Poisson distribution: n_nohitters 124 | n_nohitters = np.random.poisson(251/115, size=10000) 125 | 126 | # Compute number of samples that are seven or greater: n_large 127 | n_large = np.sum(n_nohitters >= 7) 128 | 129 | # Compute probability of getting seven or more: p_large 130 | p_large = n_large / 10000 131 | 132 | # Print the result 133 | print('Probability of seven or more no-hitters:', p_large) 134 | ______________________________________________________________________ -------------------------------------------------------------------------------- /15 - Statistical Thinking in Python (Part 1)/Chapter 4 - Thinking probabilistically continuous variables.txt: -------------------------------------------------------------------------------- 1 | 1.The Normal PDF 2 | # Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10 3 | samples_std1 = np.random.normal(20, 1, size=100000) 4 | samples_std3 = np.random.normal(20, 3, size=100000) 5 | samples_std10 = np.random.normal(20, 10, size=100000) 6 | 7 | # Make histograms 8 | _ = plt.hist(samples_std1, bins=100, normed=True, histtype='step') 9 | _ = plt.hist(samples_std3, bins=100, normed=True, histtype='step') 10 | _ = plt.hist(samples_std10, bins=100, normed=True, histtype='step') 11 | 12 | # Make a legend, set limits and show plot 13 | _ = plt.legend(('std = 1', 'std = 3', 'std = 10')) 14 | plt.ylim(-0.01, 0.42) 15 | plt.show() 16 | ________________________________________________________________________ 17 | 2.The Normal CDF 18 | # Generate CDFs 19 | x_std1, y_std1 = ecdf(samples_std1) 20 | x_std3, y_std3 = ecdf(samples_std3) 21 | x_std10, y_std10 = ecdf(samples_std10) 22 | 23 | # Plot CDFs 24 | _ = plt.plot(x_std1, y_std1, marker='.', linestyle='none') 25 | _ = plt.plot(x_std3, y_std3, marker='.', linestyle='none') 26 | _ = plt.plot(x_std10, y_std10, marker='.', linestyle='none') 27 | 28 | # Make a legend and show the plot 29 | _ = plt.legend(('std = 1', 'std = 3', 'std = 10'), loc='lower right') 30 | plt.show() 31 | ________________________________________________________________________ 32 | 3.Are the Belmont Stakes results Normally distributed? 33 | # Compute mean and standard deviation: mu, sigma 34 | mu = np.mean(belmont_no_outliers) 35 | sigma = np.std(belmont_no_outliers) 36 | 37 | # Sample out of a normal distribution with this mu and sigma: samples 38 | samples = np.random.normal(mu, sigma, size=10000) 39 | 40 | # Get the CDF of the samples and of the data 41 | x_theor, y_theor = ecdf(samples) 42 | x, y = ecdf(belmont_no_outliers) 43 | 44 | # Plot the CDFs and show the plot 45 | _ = plt.plot(x_theor, y_theor) 46 | _ = plt.plot(x, y, marker='.', linestyle='none') 47 | _ = plt.xlabel('Belmont winning time (sec.)') 48 | _ = plt.ylabel('CDF') 49 | plt.show() 50 | ________________________________________________________________________ 51 | 4.What are the chances of a horse matching or beating Secretariat's record? 52 | # Take a million samples out of the Normal distribution: samples 53 | samples = np.random.normal(mu, sigma, size=1000000) 54 | 55 | # Compute the fraction that are faster than 144 seconds: prob 56 | prob = np.sum(samples <= 144) / len(samples) 57 | 58 | # Print the result 59 | print('Probability of besting Secretariat:', prob) 60 | ________________________________________________________________________ 61 | 5.If you have a story, you can simulate it! 62 | def successive_poisson(tau1, tau2, size=1): 63 | """Compute time for arrival of 2 successive Poisson processes.""" 64 | # Draw samples out of first exponential distribution: t1 65 | t1 = np.random.exponential(tau1, size=size) 66 | 67 | # Draw samples out of second exponential distribution: t2 68 | t2 = np.random.exponential(tau2, size=size) 69 | 70 | return t1 + t2 71 | ________________________________________________________________________ 72 | 6.Distribution of no-hitters and cycles 73 | # Draw samples of waiting times 74 | waiting_times = successive_poisson(764, 715, size=100000) 75 | 76 | # Make the histogram 77 | _ = plt.hist(waiting_times, bins=100, histtype='step', 78 | normed=True) 79 | 80 | # Label axes 81 | _ = plt.xlabel('total waiting time (games)') 82 | _ = plt.ylabel('PDF') 83 | 84 | # Show the plot 85 | plt.show() 86 | ________________________________________________________________________ -------------------------------------------------------------------------------- /15 - Statistical Thinking in Python (Part 1)/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /16 - Statistical Thinking in Python (Part 2)/Chapter 1 - Parameter estimation by optimization.txt: -------------------------------------------------------------------------------- 1 | 1.How often do we get no-hitters? 2 | # Seed random number generator 3 | np.random.seed(42) 4 | 5 | # Compute mean no-hitter time: tau 6 | tau = np.mean(nohitter_times) 7 | 8 | # Draw out of an exponential distribution with parameter tau: inter_nohitter_time 9 | inter_nohitter_time = np.random.exponential(tau, 100000) 10 | 11 | # Plot the PDF and label axes 12 | _ = plt.hist(inter_nohitter_time, 13 | bins=50, normed=True, histtype='step') 14 | _ = plt.xlabel('Games between no-hitters') 15 | _ = plt.ylabel('PDF') 16 | 17 | # Show the plot 18 | plt.show() 19 | _________________________________________________________________________ 20 | 2.Do the data follow our story? 21 | # Create an ECDF from real data: x, y 22 | x, y = ecdf(nohitter_times) 23 | 24 | # Create a CDF from theoretical samples: x_theor, y_theor 25 | x_theor, y_theor = ecdf(inter_nohitter_time) 26 | 27 | # Overlay the plots 28 | plt.plot(x_theor, y_theor) 29 | plt.plot(x, y, marker='.', linestyle='none') 30 | 31 | # Margins and axis labels 32 | plt.margins(0.02) 33 | plt.xlabel('Games between no-hitters') 34 | plt.ylabel('CDF') 35 | 36 | # Show the plot 37 | plt.show() 38 | _________________________________________________________________________ 39 | 3.How is this parameter optimal? 40 | # Plot the theoretical CDFs 41 | plt.plot(x_theor, y_theor) 42 | plt.plot(x, y, marker='.', linestyle='none') 43 | plt.margins(0.02) 44 | plt.xlabel('Games between no-hitters') 45 | plt.ylabel('CDF') 46 | 47 | # Take samples with half tau: samples_half 48 | samples_half = np.random.exponential(tau/2, 10000) 49 | 50 | # Take samples with double tau: samples_double 51 | samples_double = np.random.exponential(2*tau, 10000) 52 | 53 | # Generate CDFs from these samples 54 | x_half, y_half = ecdf(samples_half) 55 | x_double, y_double = ecdf(samples_double) 56 | 57 | # Plot these CDFs as lines 58 | _ = plt.plot(x_half, y_half) 59 | _ = plt.plot(x_double, y_double) 60 | 61 | # Show the plot 62 | plt.show() 63 | _________________________________________________________________________ 64 | 4.EDA of literacy/fertility data 65 | # Plot the illiteracy rate versus fertility 66 | _ = plt.plot(illiteracy, fertility, marker='.', linestyle='none') 67 | 68 | # Set the margins and label axes 69 | plt.margins(0.02) 70 | _ = plt.xlabel('percent illiterate') 71 | _ = plt.ylabel('fertility') 72 | 73 | # Show the plot 74 | plt.show() 75 | 76 | # Show the Pearson correlation coefficient 77 | print(pearson_r(illiteracy, fertility)) 78 | _________________________________________________________________________ 79 | 5.Linear regression 80 | # Plot the illiteracy rate versus fertility 81 | _ = plt.plot(illiteracy, fertility, marker='.', linestyle='none') 82 | plt.margins(0.02) 83 | _ = plt.xlabel('percent illiterate') 84 | _ = plt.ylabel('fertility') 85 | 86 | # Perform a linear regression using np.polyfit(): a, b 87 | a, b = np.polyfit(illiteracy, fertility, 1) 88 | 89 | # Print the results to the screen 90 | print('slope =', a, 'children per woman / percent illiterate') 91 | print('intercept =', b, 'children per woman') 92 | 93 | # Make theoretical line to plot 94 | x = np.array([0, 100]) 95 | y = a * x + b 96 | 97 | # Add regression line to your plot 98 | _ = plt.plot(x, y) 99 | 100 | # Draw the plot 101 | plt.show() 102 | _________________________________________________________________________ 103 | 6.How is it optimal? 104 | # Specify slopes to consider: a_vals 105 | a_vals = np.linspace(0, 0.1, 200) 106 | 107 | # Initialize sum of square of residuals: rss 108 | rss = np.empty_like(a_vals) 109 | 110 | # Compute sum of square of residuals for each value of a_vals 111 | for i, a in enumerate(a_vals): 112 | rss[i] = np.sum((fertility - a*illiteracy - b)**2) 113 | 114 | # Plot the RSS 115 | plt.plot(a_vals, rss, '-') 116 | plt.xlabel('slope (children per woman / percent illiterate)') 117 | plt.ylabel('sum of square of residuals') 118 | 119 | plt.show() 120 | _________________________________________________________________________ 121 | 7.Linear regression on appropriate Anscombe data 122 | # Perform linear regression: a, b 123 | a, b = np.polyfit(x, y, 1) 124 | 125 | # Print the slope and intercept 126 | print(a, b) 127 | 128 | # Generate theoretical x and y data: x_theor, y_theor 129 | x_theor = np.array([3, 15]) 130 | y_theor = a * x_theor + b 131 | 132 | # Plot the Anscombe data and theoretical line 133 | _ = plt.plot(x, y, marker='.', linestyle='none') 134 | _ = plt.plot(x_theor, y_theor) 135 | 136 | # Label the axes 137 | plt.xlabel('x') 138 | plt.ylabel('y') 139 | 140 | # Show the plot 141 | plt.show() 142 | _________________________________________________________________________ 143 | 8.Linear regression on all Anscombe data 144 | # Iterate through x,y pairs 145 | for x, y in zip(anscombe_x, anscombe_y): 146 | # Compute the slope and intercept: a, b 147 | a, b = np.polyfit(x, y, 1) 148 | 149 | # Print the result 150 | print('slope:', a, 'intercept:', b) 151 | 152 | _________________________________________________________________________ -------------------------------------------------------------------------------- /16 - Statistical Thinking in Python (Part 2)/Chapter 3 - Introduction to hypothesis testing.txt: -------------------------------------------------------------------------------- 1 | 1.Generating a permutation sample 2 | def permutation_sample(data1, data2): 3 | """Generate a permutation sample from two data sets.""" 4 | 5 | # Concatenate the data sets: data 6 | data = np.concatenate((data1, data2)) 7 | 8 | # Permute the concatenated array: permuted_data 9 | permuted_data = np.random.permutation(data) 10 | 11 | # Split the permuted array into two: perm_sample_1, perm_sample_2 12 | perm_sample_1 = permuted_data[:len(data1)] 13 | perm_sample_2 = permuted_data[len(data1):] 14 | 15 | return perm_sample_1, perm_sample_2 16 | ______________________________________________________________________ 17 | 2.Visualizing permutation sampling 18 | for _ in range(50): 19 | # Generate permutation samples 20 | perm_sample_1, perm_sample_2 = permutation_sample( 21 | rain_june, rain_november) 22 | 23 | # Compute ECDFs 24 | x_1, y_1 = ecdf(perm_sample_1) 25 | x_2, y_2 = ecdf(perm_sample_2) 26 | 27 | # Plot ECDFs of permutation sample 28 | _ = plt.plot(x_1, y_1, marker='.', linestyle='none', 29 | color='red', alpha=0.02) 30 | _ = plt.plot(x_2, y_2, marker='.', linestyle='none', 31 | color='blue', alpha=0.02) 32 | 33 | # Create and plot ECDFs from original data 34 | x_1, y_1 = ecdf(rain_june) 35 | x_2, y_2 = ecdf(rain_november) 36 | _ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='red') 37 | _ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='blue') 38 | 39 | # Label axes, set margin, and show plot 40 | plt.margins(0.02) 41 | _ = plt.xlabel('monthly rainfall (mm)') 42 | _ = plt.ylabel('ECDF') 43 | plt.show() 44 | ______________________________________________________________________ 45 | 3.Generating permutation replicates 46 | def draw_perm_reps(data_1, data_2, func, size=1): 47 | """Generate multiple permutation replicates.""" 48 | 49 | # Initialize array of replicates: perm_replicates 50 | perm_replicates = np.empty(size) 51 | 52 | for i in range(size): 53 | # Generate permutation sample 54 | perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2) 55 | 56 | # Compute the test statistic 57 | perm_replicates[i] = func(perm_sample_1, perm_sample_2) 58 | 59 | return perm_replicates 60 | ______________________________________________________________________ 61 | 4.Look before you leap: EDA before hypothesis testing 62 | # Make bee swarm plot 63 | _ = sns.swarmplot(x='ID', y='impact_force', data=df) 64 | 65 | # Label axes 66 | _ = plt.xlabel('frog') 67 | _ = plt.ylabel('impact force (N)') 68 | 69 | # Show the plot 70 | plt.show() 71 | ______________________________________________________________________ 72 | 5.Permutation test on frog data 73 | def diff_of_means(data_1, data_2): 74 | """Difference in means of two arrays.""" 75 | 76 | # The difference of means of data_1, data_2: diff 77 | diff = np.mean(data_1) - np.mean(data_2) 78 | 79 | return diff 80 | 81 | # Compute difference of mean impact force from experiment: empirical_diff_means 82 | empirical_diff_means = diff_of_means(force_a, force_b) 83 | 84 | # Draw 10,000 permutation replicates: perm_replicates 85 | perm_replicates = draw_perm_reps(force_a, force_b, 86 | diff_of_means, size=10000) 87 | 88 | # Compute p-value: p 89 | p = np.sum(perm_replicates >= empirical_diff_means) / len(perm_replicates) 90 | 91 | # Print the result 92 | print('p-value =', p) 93 | ______________________________________________________________________ 94 | 6.A one-sample bootstrap hypothesis test 95 | # Make an array of translated impact forces: translated_force_b 96 | translated_force_b = force_b - np.mean(force_b) + 0.55 97 | 98 | # Take bootstrap replicates of Frog B's translated impact forces: bs_replicates 99 | bs_replicates = draw_bs_reps(translated_force_b, np.mean, 10000) 100 | 101 | # Compute fraction of replicates that are less than the observed Frog B force: p 102 | p = np.sum(bs_replicates <= np.mean(force_b)) / 10000 103 | 104 | # Print the p-value 105 | print('p = ', p) 106 | ______________________________________________________________________ 107 | 7.A two-sample bootstrap hypothesis test for difference of means 108 | # Compute mean of all forces: mean_force 109 | mean_force = np.mean(forces_concat) 110 | 111 | # Generate shifted arrays 112 | force_a_shifted = force_a - np.mean(force_a) + mean_force 113 | force_b_shifted = force_b - np.mean(force_b) + mean_force 114 | 115 | # Compute 10,000 bootstrap replicates from shifted arrays 116 | bs_replicates_a = draw_bs_reps(force_a_shifted, np.mean, size=10000) 117 | bs_replicates_b = draw_bs_reps(force_b_shifted, np.mean, size=10000) 118 | 119 | # Get replicates of difference of means: bs_replicates 120 | bs_replicates = bs_replicates_a - bs_replicates_b 121 | 122 | # Compute and print p-value: p 123 | p = np.sum(bs_replicates >= empirical_diff_means) / len(bs_replicates) 124 | print('p-value =', p) 125 | ______________________________________________________________________ -------------------------------------------------------------------------------- /16 - Statistical Thinking in Python (Part 2)/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /17 - Supervised Learning with Scikit-learn/Chapter 1 - Classification.txt: -------------------------------------------------------------------------------- 1 | 1.k-Nearest Neighbors: Fit 2 | # Import KNeighborsClassifier from sklearn.neighbors 3 | from sklearn.neighbors import KNeighborsClassifier 4 | 5 | # Create arrays for the features and the response variable 6 | y = df['party'].values 7 | X = df.drop('party', axis=1).values 8 | 9 | # Create a k-NN classifier with 6 neighbors 10 | knn = KNeighborsClassifier(n_neighbors=6) 11 | 12 | # Fit the classifier to the data 13 | knn.fit(X, y) 14 | ______________________________________________________________________ 15 | 2.k-Nearest Neighbors: Predict 16 | # Import KNeighborsClassifier from sklearn.neighbors 17 | from sklearn.neighbors import KNeighborsClassifier 18 | 19 | # Create arrays for the features and the response variable 20 | y = df['party'].values 21 | X = df.drop('party', axis=1).values 22 | 23 | # Create a k-NN classifier with 6 neighbors: knn 24 | knn = KNeighborsClassifier(n_neighbors=6) 25 | 26 | # Fit the classifier to the data 27 | knn.fit(X, y) 28 | 29 | # Predict the labels for the training data X: y_pred 30 | y_pred = knn.predict(X) 31 | 32 | # Predict and print the label for the new data point X_new 33 | new_prediction = knn.predict(X_new) 34 | print("Prediction: {}".format(new_prediction)) 35 | ______________________________________________________________________ 36 | 3.The digits recognition dataset 37 | # Import necessary modules 38 | from sklearn import datasets 39 | import matplotlib.pyplot as plt 40 | 41 | # Load the digits dataset: digits 42 | digits = datasets.load_digits() 43 | 44 | # Print the keys and DESCR of the dataset 45 | print(digits.keys()) 46 | print(digits.DESCR) 47 | 48 | # Print the shape of the images and data keys 49 | print(digits.images.shape) 50 | print(digits.data.shape) 51 | 52 | # Display digit 1010 53 | plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest') 54 | plt.show() 55 | ______________________________________________________________________ 56 | 4.Train/Test Split + Fit/Predict/Accuracy 57 | # Import necessary modules 58 | from sklearn.neighbors import KNeighborsClassifier 59 | from sklearn.model_selection import train_test_split 60 | 61 | # Create feature and target arrays 62 | X = digits.data 63 | y = digits.target 64 | 65 | # Split into training and test set 66 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y) 67 | 68 | # Create a k-NN classifier with 7 neighbors: knn 69 | knn = KNeighborsClassifier(n_neighbors=7) 70 | 71 | # Fit the classifier to the training data 72 | knn.fit(X_train, y_train) 73 | 74 | # Print the accuracy 75 | print(knn.score(X_test, y_test)) 76 | ______________________________________________________________________ 77 | 5.Overfitting and underfitting 78 | # Setup arrays to store train and test accuracies 79 | neighbors = np.arange(1, 9) 80 | train_accuracy = np.empty(len(neighbors)) 81 | test_accuracy = np.empty(len(neighbors)) 82 | 83 | # Loop over different values of k 84 | for i, k in enumerate(neighbors): 85 | # Setup a k-NN Classifier with k neighbors: knn 86 | knn = KNeighborsClassifier(n_neighbors=k) 87 | 88 | # Fit the classifier to the training data 89 | knn.fit(X_train, y_train) 90 | 91 | #Compute accuracy on the training set 92 | train_accuracy[i] = knn.score(X_train, y_train) 93 | 94 | #Compute accuracy on the testing set 95 | test_accuracy[i] = knn.score(X_test, y_test) 96 | 97 | # Generate plot 98 | plt.title('k-NN: Varying Number of Neighbors') 99 | plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy') 100 | plt.plot(neighbors, train_accuracy, label = 'Training Accuracy') 101 | plt.legend() 102 | plt.xlabel('Number of Neighbors') 103 | plt.ylabel('Accuracy') 104 | plt.show() 105 | 106 | ______________________________________________________________________ -------------------------------------------------------------------------------- /17 - Supervised Learning with Scikit-learn/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /18 - Unsupervised Learning in Python/Chapter 1 - Clustering for dataset exploration.txt: -------------------------------------------------------------------------------- 1 | 1.Clustering 2D points 2 | # Import KMeans 3 | from sklearn.cluster import KMeans 4 | 5 | # Create a KMeans instance with 3 clusters: model 6 | model = KMeans(n_clusters=3) 7 | 8 | # Fit model to points 9 | model.fit(points) 10 | 11 | # Determine the cluster labels of new_points: labels 12 | labels = model.predict(new_points) 13 | 14 | # Print cluster labels of new_points 15 | print(labels) 16 | ___________________________________________________________________ 17 | 2.Inspect your clustering 18 | # Import pyplot 19 | from matplotlib import pyplot as plt 20 | 21 | # Assign the columns of new_points: xs and ys 22 | xs = new_points[:,0] 23 | ys = new_points[:,1] 24 | 25 | # Make a scatter plot of xs and ys, using labels to define the colors 26 | plt.scatter(xs, ys, c=labels, alpha=0.5) 27 | 28 | # Assign the cluster centers: centroids 29 | centroids = model.cluster_centers_ 30 | 31 | # Assign the columns of centroids: centroids_x, centroids_y 32 | centroids_x = centroids[:,0] 33 | centroids_y = centroids[:,1] 34 | 35 | # Make a scatter plot of centroids_x and centroids_y 36 | plt.scatter(centroids_x, centroids_y, marker='D', s=50) 37 | plt.show() 38 | ___________________________________________________________________ 39 | 3.How many clusters of grain? 40 | ks = range(1, 6) 41 | inertias = [] 42 | 43 | for k in ks: 44 | # Create a KMeans instance with k clusters: model 45 | model = KMeans(n_clusters=k) 46 | 47 | # Fit model to samples 48 | model.fit(samples) 49 | 50 | # Append the inertia to the list of inertias 51 | inertias.append(model.inertia_) 52 | 53 | # Plot ks vs inertias 54 | plt.plot(ks, inertias, '-o') 55 | plt.xlabel('number of clusters, k') 56 | plt.ylabel('inertia') 57 | plt.xticks(ks) 58 | plt.show() 59 | ___________________________________________________________________ 60 | 4.Evaluating the grain clustering 61 | # Create a KMeans model with 3 clusters: model 62 | model = KMeans(n_clusters=3) 63 | 64 | # Use fit_predict to fit model and obtain cluster labels: labels 65 | labels = model.fit_predict(samples) 66 | 67 | # Create a DataFrame with clusters and varieties as columns: df 68 | df = pd.DataFrame({'labels': labels, 'varieties': varieties}) 69 | 70 | # Create crosstab: ct 71 | ct = pd.crosstab(df['labels'], df['varieties']) 72 | 73 | # Display ct 74 | print(ct) 75 | ___________________________________________________________________ 76 | 5.Scaling fish data for clustering 77 | # Perform the necessary imports 78 | from sklearn.pipeline import make_pipeline 79 | from sklearn.preprocessing import StandardScaler 80 | from sklearn.cluster import KMeans 81 | 82 | # Create scaler: scaler 83 | scaler = StandardScaler() 84 | 85 | # Create KMeans instance: kmeans 86 | kmeans = KMeans(n_clusters=4) 87 | 88 | # Create pipeline: pipeline 89 | pipeline = make_pipeline(scaler, kmeans) 90 | __________________________________________________________________ 91 | 6.Clustering the fish data 92 | # Import pandas 93 | import pandas as pd 94 | 95 | # Fit the pipeline to samples 96 | pipeline.fit(samples) 97 | 98 | # Calculate the cluster labels: labels 99 | labels = pipeline.predict(samples) 100 | 101 | # Create a DataFrame with labels and species as columns: df 102 | df = pd.DataFrame({'labels': labels, 'species': species}) 103 | 104 | # Create crosstab: ct 105 | ct = pd.crosstab(df['labels'], df['species']) 106 | 107 | # Display ct 108 | print(ct) 109 | ___________________________________________________________________ 110 | 7.Clustering stocks using KMeans 111 | # Import Normalizer 112 | from sklearn.preprocessing import Normalizer 113 | 114 | # Create a normalizer: normalizer 115 | normalizer = Normalizer() 116 | 117 | # Create a KMeans model with 10 clusters: kmeans 118 | kmeans = KMeans(n_clusters=10) 119 | 120 | # Make a pipeline chaining normalizer and kmeans: pipeline 121 | pipeline = make_pipeline(normalizer, kmeans) 122 | 123 | # Fit pipeline to the daily price movements 124 | pipeline.fit(movements) 125 | ___________________________________________________________________ 126 | 8.Which stocks move together? 127 | # Import pandas 128 | import pandas as pd 129 | 130 | # Predict the cluster labels: labels 131 | labels = pipeline.predict(movements) 132 | 133 | # Create a DataFrame aligning labels and companies: df 134 | df = pd.DataFrame({'labels': labels, 'companies': companies}) 135 | 136 | # Display df sorted by cluster label 137 | print(df.sort_values('labels')) 138 | ___________________________________________________________________ -------------------------------------------------------------------------------- /18 - Unsupervised Learning in Python/Chapter 2 - Visualization with Hierarchical clustering and t-sne.txt: -------------------------------------------------------------------------------- 1 | 1.Hierarchical clustering of the grain data 2 | # Perform the necessary imports 3 | from scipy.cluster.hierarchy import linkage, dendrogram 4 | import matplotlib.pyplot as plt 5 | 6 | # Calculate the linkage: mergings 7 | mergings = linkage(samples, method='complete') 8 | 9 | # Plot the dendrogram, using varieties as labels 10 | dendrogram(mergings, 11 | labels=varieties, 12 | leaf_rotation=90, 13 | leaf_font_size=6, 14 | ) 15 | plt.show() 16 | _________________________________________________________________ 17 | 2.Hierarchies of stocks 18 | # Import normalize 19 | from sklearn.preprocessing import normalize 20 | 21 | # Normalize the movements: normalized_movements 22 | normalized_movements = normalize(movements) 23 | 24 | # Calculate the linkage: mergings 25 | mergings = linkage(normalized_movements, method='complete') 26 | 27 | # Plot the dendrogram 28 | dendrogram( 29 | mergings, 30 | labels=companies, 31 | leaf_rotation=90, 32 | leaf_font_size=6 33 | ) 34 | plt.show() 35 | _________________________________________________________________ 36 | 3.Different linkage, different hierarchical clustering! 37 | # Perform the necessary imports 38 | import matplotlib.pyplot as plt 39 | from scipy.cluster.hierarchy import linkage, dendrogram 40 | 41 | # Calculate the linkage: mergings 42 | mergings = linkage(samples, method='single') 43 | 44 | # Plot the dendrogram 45 | dendrogram(mergings, 46 | labels=country_names, 47 | leaf_rotation=90, 48 | leaf_font_size=6, 49 | ) 50 | plt.show() 51 | _________________________________________________________________ 52 | 4.Extracting the cluster labels 53 | # Perform the necessary imports 54 | import pandas as pd 55 | from scipy.cluster.hierarchy import fcluster 56 | 57 | # Use fcluster to extract labels: labels 58 | labels = fcluster(mergings, 6, criterion='distance') 59 | 60 | # Create a DataFrame with labels and varieties as columns: df 61 | df = pd.DataFrame({'labels': labels, 'varieties': varieties}) 62 | 63 | # Create crosstab: ct 64 | ct = pd.crosstab(df['labels'], df['varieties']) 65 | 66 | # Display ct 67 | print(ct) 68 | _________________________________________________________________ 69 | 5.t-SNE visualization of grain dataset 70 | # Import TSNE 71 | from sklearn.manifold import TSNE 72 | 73 | # Create a TSNE instance: model 74 | model = TSNE(learning_rate=200) 75 | 76 | # Apply fit_transform to samples: tsne_features 77 | tsne_features = model.fit_transform(samples) 78 | 79 | # Select the 0th feature: xs 80 | xs = tsne_features[:,0] 81 | 82 | # Select the 1st feature: ys 83 | ys = tsne_features[:,1] 84 | 85 | # Scatter plot, coloring by variety_numbers 86 | plt.scatter(xs, ys, c=variety_numbers) 87 | plt.show() 88 | _________________________________________________________________ 89 | 6.A t-SNE map of the stock market 90 | # Import TSNE 91 | from sklearn.manifold import TSNE 92 | 93 | # Create a TSNE instance: model 94 | model = TSNE(learning_rate=50) 95 | 96 | # Apply fit_transform to normalized_movements: tsne_features 97 | tsne_features = model.fit_transform(normalized_movements) 98 | 99 | # Select the 0th feature: xs 100 | xs = tsne_features[:,0] 101 | 102 | # Select the 1th feature: ys 103 | ys = tsne_features[:,1] 104 | 105 | # Scatter plot 106 | plt.scatter(xs, ys, alpha=0.5) 107 | 108 | # Annotate the points 109 | for x, y, company in zip(xs, ys, companies): 110 | plt.annotate(company, (x, y), fontsize=5, alpha=0.75) 111 | plt.show() 112 | _________________________________________________________________ -------------------------------------------------------------------------------- /18 - Unsupervised Learning in Python/Chapter 4 - Discovering Interpretable features.txt: -------------------------------------------------------------------------------- 1 | 1.NMF applied to Wikipedia articles 2 | # Import NMF 3 | from sklearn.decomposition import NMF 4 | 5 | # Create an NMF instance: model 6 | model = NMF(n_components=6) 7 | 8 | # Fit the model to articles 9 | model.fit(articles) 10 | 11 | # Transform the articles: nmf_features 12 | nmf_features = model.transform(articles) 13 | 14 | # Print the NMF features 15 | print(nmf_features.round(2)) 16 | ________________________________________________________________ 17 | 2.NMF features of the Wikipedia articles 18 | # Import pandas 19 | import pandas as pd 20 | 21 | # Create a pandas DataFrame: df 22 | df = pd.DataFrame(nmf_features, index=titles) 23 | 24 | # Print the row for 'Anne Hathaway' 25 | print(df.loc['Anne Hathaway']) 26 | 27 | # Print the row for 'Denzel Washington' 28 | print(df.loc['Denzel Washington']) 29 | ________________________________________________________________ 30 | 3.NMF learns topics of documents 31 | # Import pandas 32 | import pandas as pd 33 | 34 | # Create a DataFrame: components_df 35 | components_df = pd.DataFrame(model.components_, columns=words) 36 | 37 | # Print the shape of the DataFrame 38 | print(components_df.shape) 39 | 40 | # Select row 3: component 41 | component = components_df.iloc[3] 42 | 43 | # Print result of nlargest 44 | print(component.nlargest()) 45 | ________________________________________________________________ 46 | 4.Explore the LED digits dataset 47 | # Import pyplot 48 | from matplotlib import pyplot as plt 49 | 50 | # Select the 0th row: digit 51 | digit = samples[0,:] 52 | 53 | # Print digit 54 | print(digit) 55 | 56 | # Reshape digit to a 13x8 array: bitmap 57 | bitmap = digit.reshape((13, 8)) 58 | 59 | # Print bitmap 60 | print(bitmap) 61 | 62 | # Use plt.imshow to display bitmap 63 | plt.imshow(bitmap, cmap='gray', interpolation='nearest') 64 | plt.colorbar() 65 | plt.show() 66 | ________________________________________________________________ 67 | 5.NMF learns the parts of images 68 | # Import NMF 69 | from sklearn.decomposition import NMF 70 | 71 | # Create an NMF model: model 72 | model = NMF(n_components=7) 73 | 74 | # Apply fit_transform to samples: features 75 | features = model.fit_transform(samples) 76 | 77 | # Call show_as_image on each component 78 | for component in model.components_: 79 | show_as_image(component) 80 | 81 | # Select the 0th row of features: digit_features 82 | digit_features = features[0,:] 83 | 84 | # Print digit_features 85 | print(digit_features) 86 | ________________________________________________________________ 87 | 6.PCA doesn't learn parts 88 | # Import PCA 89 | from sklearn.decomposition import PCA 90 | 91 | # Create a PCA instance: model 92 | model = PCA(n_components=7) 93 | 94 | # Apply fit_transform to samples: features 95 | features = model.fit_transform(samples) 96 | 97 | # Call show_as_image on each component 98 | for component in model.components_: 99 | show_as_image(component) 100 | ________________________________________________________________ 101 | 7.Which articles are similar to 'Cristiano Ronaldo'? 102 | # Perform the necessary imports 103 | import pandas as pd 104 | from sklearn.preprocessing import normalize 105 | 106 | # Normalize the NMF features: norm_features 107 | norm_features = normalize(nmf_features) 108 | 109 | # Create a DataFrame: df 110 | df = pd.DataFrame(norm_features, index=titles) 111 | 112 | # Select the row corresponding to 'Cristiano Ronaldo': article 113 | article = df.loc['Cristiano Ronaldo'] 114 | 115 | # Compute the dot products: similarities 116 | similarities = df.dot(article) 117 | 118 | # Display those with the largest cosine similarity 119 | print(similarities.nlargest()) 120 | ________________________________________________________________ 121 | 8.Recommend musical artists part I 122 | # Perform the necessary imports 123 | from sklearn.decomposition import NMF 124 | from sklearn.preprocessing import Normalizer, MaxAbsScaler 125 | from sklearn.pipeline import make_pipeline 126 | 127 | # Create a MaxAbsScaler: scaler 128 | scaler = MaxAbsScaler() 129 | 130 | # Create an NMF model: nmf 131 | nmf = NMF(n_components=20) 132 | 133 | # Create a Normalizer: normalizer 134 | normalizer = Normalizer() 135 | 136 | # Create a pipeline: pipeline 137 | pipeline = make_pipeline(scaler, nmf, normalizer) 138 | 139 | # Apply fit_transform to artists: norm_features 140 | norm_features = pipeline.fit_transform(artists) 141 | ________________________________________________________________ 142 | 9.Recommend musical artists part II 143 | # Import pandas 144 | import pandas as pd 145 | 146 | # Create a DataFrame: df 147 | df = pd.DataFrame(norm_features, index=artist_names) 148 | 149 | # Select row of 'Bruce Springsteen': artist 150 | artist = df.loc['Bruce Springsteen'] 151 | 152 | # Compute cosine similarities: similarities 153 | similarities = df.dot(artist) 154 | 155 | # Display those with highest cosine similarity 156 | print(similarities.nlargest()) 157 | ________________________________________________________________ -------------------------------------------------------------------------------- /18 - Unsupervised Learning in Python/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /19 - Machine learning with tree-based models in python/Chapter 1 - Classification and regression trees.txt: -------------------------------------------------------------------------------- 1 | 1.Train your first classification tree 2 | #work with the Wisconsin Breast Cancer Dataset from the UCI machine learning repository. 3 | # Import DecisionTreeClassifier from sklearn.tree 4 | from sklearn.tree import DecisionTreeClassifier 5 | 6 | # Instantiate a DecisionTreeClassifier 'dt' with a maximum depth of 6 7 | dt = DecisionTreeClassifier(max_depth=6, random_state=SEED) 8 | 9 | # Fit dt to the training set 10 | dt.fit(X_train, y_train) 11 | 12 | # Predict test set labels 13 | y_pred = dt.predict(X_test) 14 | print(y_pred[0:5]) 15 | ______________________________________________________________________ 16 | 2.Evaluate the classification tree 17 | # Import accuracy_score 18 | from sklearn.metrics import accuracy_score 19 | 20 | # Predict test set labels 21 | y_pred = dt.predict(X_test) 22 | 23 | # Compute test set accuracy 24 | acc = accuracy_score(y_test, y_pred) 25 | print("Test set accuracy: {:.2f}".format(acc)) 26 | ______________________________________________________________________ 27 | 3.Logistic regression vs classification tree 28 | # Import LogisticRegression from sklearn.linear_model 29 | from sklearn.linear_model import LogisticRegression 30 | 31 | # Instatiate logreg 32 | logreg = LogisticRegression(random_state=1) 33 | 34 | # Fit logreg to the training set 35 | logreg.fit(X_train, y_train) 36 | 37 | # Define a list called clfs containing the two classifiers logreg and dt 38 | clfs = [logreg, dt] 39 | 40 | # Review the decision regions of the two classifiers 41 | plot_labeled_decision_regions(X_test, y_test, clfs) 42 | ______________________________________________________________________ 43 | 4.Using entropy as a criterion 44 | # Import DecisionTreeClassifier from sklearn.tree 45 | from sklearn.tree import DecisionTreeClassifier 46 | 47 | # Instantiate dt_entropy, set 'entropy' as the information criterion 48 | dt_entropy = DecisionTreeClassifier(max_depth=8, criterion='entropy', random_state=1) 49 | 50 | # Fit dt_entropy to the training set 51 | dt_entropy.fit(X_train, y_train) 52 | ______________________________________________________________________ 53 | 5.Entropy vs Gini index 54 | # Import accuracy_score from sklearn.metrics 55 | from sklearn.metrics import accuracy_score 56 | 57 | # Use dt_entropy to predict test set labels 58 | y_pred = dt_entropy.predict(X_test) 59 | 60 | # Evaluate accuracy_entropy 61 | accuracy_entropy = accuracy_score(y_test, y_pred) 62 | 63 | # Print accuracy_entropy 64 | print('Accuracy achieved by using entropy: ', accuracy_entropy) 65 | 66 | # Print accuracy_gini 67 | print('Accuracy achieved by using the gini index: ', accuracy_gini) 68 | 6.Train your first regression tree 69 | # Import DecisionTreeRegressor from sklearn.tree 70 | from sklearn.tree import DecisionTreeRegressor 71 | 72 | # Instantiate dt 73 | dt = DecisionTreeRegressor(max_depth=8, 74 | min_samples_leaf=0.13, 75 | random_state=3) 76 | 77 | # Fit dt to the training set 78 | dt.fit(X_train, y_train) 79 | ______________________________________________________________________ 80 | 7.Evaluate the regression tree 81 | # Import mean_squared_error from sklearn.metrics as MSE 82 | from sklearn.metrics import mean_squared_error as MSE 83 | 84 | # Compute y_pred 85 | y_pred = dt.predict(X_test) 86 | 87 | # Compute mse_dt 88 | mse_dt = MSE(y_test, y_pred) 89 | 90 | # Compute rmse_dt 91 | rmse_dt = mse_dt**(1/2) 92 | 93 | # Print rmse_dt 94 | print("Test set RMSE of dt: {:.2f}".format(rmse_dt)) 95 | ______________________________________________________________________ 96 | 8.Linear regression vs regression tree 97 | # Predict test set labels 98 | y_pred_lr = lr.predict(X_test) 99 | 100 | # Compute mse_lr 101 | mse_lr = MSE(y_test, y_pred_lr) 102 | 103 | # Compute rmse_lr 104 | rmse_lr = mse_lr**(1/2) 105 | 106 | # Print rmse_lr 107 | print('Linear Regression test set RMSE: {:.2f}'.format(rmse_lr)) 108 | 109 | # Print rmse_dt 110 | print('Regression Tree test set RMSE: {:.2f}'.format(rmse_dt)) 111 | ______________________________________________________________________ -------------------------------------------------------------------------------- /19 - Machine learning with tree-based models in python/Chapter 2 - The bias-variance Tradeoff.txt: -------------------------------------------------------------------------------- 1 | 1.Instantiate the model 2 | # Import train_test_split from sklearn.model_selection 3 | from sklearn.model_selection import train_test_split 4 | 5 | # Set SEED for reproducibility 6 | SEED = 1 7 | 8 | # Split the data into 70% train and 30% test 9 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED) 10 | 11 | # Instantiate a DecisionTreeRegressor dt 12 | dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED) 13 | _________________________________________________________________________ 14 | 2.Evaluate the 10-fold CV error 15 | # Compute the array containing the 10-folds CV MSEs 16 | MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, 17 | scoring='neg_mean_squared_error', 18 | n_jobs=-1) 19 | 20 | # Compute the 10-folds CV RMSE 21 | RMSE_CV = (MSE_CV_scores.mean())**(1/2) 22 | 23 | # Print RMSE_CV 24 | print('CV RMSE: {:.2f}'.format(RMSE_CV)) 25 | _________________________________________________________________________ 26 | 3.Evaluate the training error 27 | # Import mean_squared_error from sklearn.metrics as MSE 28 | from sklearn.metrics import mean_squared_error as MSE 29 | 30 | # Fit dt to the training set 31 | dt.fit(X_train, y_train) 32 | 33 | # Predict the labels of the training set 34 | y_pred_train = dt.predict(X_train) 35 | 36 | # Evaluate the training set RMSE of dt 37 | RMSE_train = (MSE(y_train, y_pred_train))**(1/2) 38 | 39 | # Print RMSE_train 40 | print('Train RMSE: {:.2f}'.format(RMSE_train)) 41 | _________________________________________________________________________ 42 | 4.Define the ensemble 43 | # Set seed for reproducibility 44 | SEED=1 45 | 46 | # Instantiate lr 47 | lr = LogisticRegression(random_state=SEED) 48 | 49 | # Instantiate knn 50 | knn = KNN(n_neighbors=27) 51 | 52 | # Instantiate dt 53 | dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED) 54 | 55 | # Define the list classifiers 56 | classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)] 57 | _________________________________________________________________________ 58 | 5.Evaluate individual classifiers 59 | # Iterate over the pre-defined list of classifiers 60 | for clf_name, clf in classifiers: 61 | 62 | # Fit clf to the training set 63 | clf.fit(X_train, y_train) 64 | 65 | # Predict y_pred 66 | y_pred = clf.predict(X_test) 67 | 68 | # Calculate accuracy 69 | accuracy = accuracy_score(y_test, y_pred) 70 | 71 | # Evaluate clf's accuracy on the test set 72 | print('{:s} : {:.3f}'.format(clf_name, accuracy)) 73 | _________________________________________________________________________ 74 | 6.Better performance with a Voting Classifier 75 | # Import VotingClassifier from sklearn.ensemble 76 | from sklearn.ensemble import VotingClassifier 77 | 78 | # Instantiate a VotingClassifier vc 79 | vc = VotingClassifier(estimators=classifiers) 80 | 81 | # Fit vc to the training set 82 | vc.fit(X_train, y_train) 83 | 84 | # Evaluate the test set predictions 85 | y_pred = vc.predict(X_test) 86 | 87 | # Calculate accuracy score 88 | accuracy = accuracy_score(y_test, y_pred) 89 | print('Voting Classifier: {:.3f}'.format(accuracy)) 90 | _________________________________________________________________________ -------------------------------------------------------------------------------- /19 - Machine learning with tree-based models in python/Chapter 3 - Bagging and Random Forests.txt: -------------------------------------------------------------------------------- 1 | 1.Define the bagging classifier 2 | # Indian Liver Patient dataset from the UCI machine learning repository. 3 | # Import DecisionTreeClassifier 4 | from sklearn.tree import DecisionTreeClassifier 5 | 6 | # Import BaggingClassifier 7 | from sklearn.ensemble import BaggingClassifier 8 | 9 | # Instantiate dt 10 | dt = DecisionTreeClassifier(random_state=1) 11 | 12 | # Instantiate bc 13 | bc = BaggingClassifier(base_estimator=dt, n_estimators=50, random_state=1) 14 | _____________________________________________________________ 15 | 2.Evaluate Bagging performance 16 | # Fit bc to the training set 17 | bc.fit(X_train, y_train) 18 | 19 | # Predict test set labels 20 | y_pred = bc.predict(X_test) 21 | 22 | # Evaluate acc_test 23 | acc_test = accuracy_score(y_test, y_pred) 24 | print('Test set accuracy of bc: {:.2f}'.format(acc_test)) 25 | _____________________________________________________________ 26 | 3.Prepare the ground 27 | # Import DecisionTreeClassifier 28 | from sklearn.tree import DecisionTreeClassifier 29 | 30 | # Import BaggingClassifier 31 | from sklearn.ensemble import BaggingClassifier 32 | 33 | # Instantiate dt 34 | dt = DecisionTreeClassifier(min_samples_leaf=8, random_state=1) 35 | 36 | # Instantiate bc 37 | bc = BaggingClassifier(base_estimator=dt, 38 | n_estimators=50, 39 | oob_score=True, 40 | random_state=1) 41 | _____________________________________________________________ 42 | 4.OOB Score vs Test Set Score 43 | # Fit bc to the training set 44 | bc.fit(X_train, y_train) 45 | 46 | # Predict test set labels 47 | y_pred = bc.predict(X_test) 48 | 49 | # Evaluate test set accuracy 50 | acc_test = accuracy_score(y_test, y_pred) 51 | 52 | # Evaluate OOB accuracy 53 | acc_oob = bc.oob_score_ 54 | 55 | # Print acc_test and acc_oob 56 | print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test, acc_oob)) 57 | _____________________________________________________________ 58 | 5.Train an RF regressor 59 | #using historical weather data from the Bike Sharing Demand dataset available through Kaggle 60 | # Import RandomForestRegressor 61 | from sklearn.ensemble import RandomForestRegressor 62 | 63 | # Instantiate rf 64 | rf = RandomForestRegressor(n_estimators=25, 65 | random_state=2) 66 | 67 | # Fit rf to the training set 68 | rf.fit(X_train, y_train) 69 | _____________________________________________________________ 70 | 6.Evaluate the RF regressor 71 | # Import mean_squared_error as MSE 72 | from sklearn.metrics import mean_squared_error as MSE 73 | 74 | # Predict the test set labels 75 | y_pred = rf.predict(X_test) 76 | 77 | # Evaluate the test set RMSE 78 | rmse_test = MSE(y_test, y_pred)**(1/2) 79 | 80 | # Print rmse_test 81 | print('Test set RMSE of rf: {:.2f}'.format(rmse_test)) 82 | _____________________________________________________________ 83 | 7.Visualizing features importances 84 | # Create a pd.Series of features importances 85 | importances = pd.Series(data=rf.feature_importances_, 86 | index= X_train.columns) 87 | 88 | # Sort importances 89 | importances_sorted = importances.sort_values() 90 | 91 | # Draw a horizontal barplot of importances_sorted 92 | importances_sorted.plot(kind='barh', color='lightgreen') 93 | plt.title('Features Importances') 94 | plt.show() 95 | _____________________________________________________________ 96 | -------------------------------------------------------------------------------- /19 - Machine learning with tree-based models in python/Chapter 4 - Boosting.txt: -------------------------------------------------------------------------------- 1 | 1.Define the AdaBoost classifier 2 | #the Indian Liver Patient dataset 3 | # Import DecisionTreeClassifier 4 | from sklearn.tree import DecisionTreeClassifier 5 | 6 | # Import AdaBoostClassifier 7 | from sklearn.ensemble import AdaBoostClassifier 8 | 9 | # Instantiate dt 10 | dt = DecisionTreeClassifier(max_depth=2, random_state=1) 11 | 12 | # Instantiate ada 13 | ada = AdaBoostClassifier(base_estimator=dt, n_estimators=180, random_state=1) 14 | _______________________________________________________________ 15 | 2.Train the AdaBoost classifier 16 | # Fit ada to the training set 17 | ada.fit(X_train, y_train) 18 | 19 | # Compute the probabilities of obtaining the positive class 20 | y_pred_proba = ada.predict_proba(X_test)[:,1] 21 | _______________________________________________________________ 22 | 3.Evaluate the AdaBoost classifier 23 | # Import roc_auc_score 24 | from sklearn.metrics import roc_auc_score 25 | 26 | # Evaluate test-set roc_auc_score 27 | ada_roc_auc = roc_auc_score(y_test, y_pred_proba) 28 | 29 | # Print roc_auc_score 30 | print('ROC AUC score: {:.2f}'.format(ada_roc_auc)) 31 | _______________________________________________________________ 32 | 4.Define the GB regressor 33 | #the Bike Sharing Demand dataset 34 | # Import GradientBoostingRegressor 35 | from sklearn.ensemble import GradientBoostingRegressor 36 | 37 | # Instantiate gb 38 | gb = GradientBoostingRegressor(max_depth=4, 39 | n_estimators=200, 40 | random_state=2) 41 | _______________________________________________________________ 42 | 5.Train the GB regressor 43 | # Fit gb to the training set 44 | gb.fit(X_train, y_train) 45 | 46 | # Predict test set labels 47 | y_pred = gb.predict(X_test) 48 | _______________________________________________________________ 49 | 6.Evaluate the GB regressor 50 | # Import mean_squared_error as MSE 51 | from sklearn.metrics import mean_squared_error as MSE 52 | 53 | # Compute MSE 54 | mse_test = MSE(y_test, y_pred) 55 | 56 | # Compute RMSE 57 | rmse_test = mse_test**(1/2) 58 | 59 | # Print RMSE 60 | 61 | print('Test set RMSE of gb: {:.3f}'.format(rmse_test)) 62 | _______________________________________________________________ 63 | 7.Regression with SGB 64 | # Import GradientBoostingRegressor 65 | from sklearn.ensemble import GradientBoostingRegressor 66 | 67 | # Instantiate sgbr 68 | sgbr = GradientBoostingRegressor(max_depth=4, 69 | subsample=0.9, 70 | max_features=0.75, 71 | n_estimators=200, 72 | random_state=2) 73 | _______________________________________________________________ 74 | 8.Train the SGB regressor 75 | # Fit sgbr to the training set 76 | sgbr.fit(X_train, y_train) 77 | 78 | # Predict test set labels 79 | y_pred = sgbr.predict(X_test) 80 | _______________________________________________________________ 81 | 9.Evaluate the SGB regressor 82 | # Import mean_squared_error as MSE 83 | from sklearn.metrics import mean_squared_error as MSE 84 | 85 | # Compute test set MSE 86 | mse_test = MSE(y_test, y_pred) 87 | 88 | # Compute test set RMSE 89 | rmse_test = mse_test**(1/2) 90 | 91 | # Print rmse_test 92 | print('Test set RMSE of sgbr: {:.3f}'.format(rmse_test)) 93 | _______________________________________________________________ -------------------------------------------------------------------------------- /19 - Machine learning with tree-based models in python/Chapter 5 - Model Tuning.txt: -------------------------------------------------------------------------------- 1 | 1.Set the tree's hyperparameter grid 2 | # Define params_dt 3 | params_dt = { 4 | 'max_depth': [2, 3, 4], 5 | 'min_samples_leaf': [0.12, 0.14, 0.16, 0.18] 6 | } 7 | _______________________________________________________________ 8 | 2.Search for the optimal tree 9 | # Import GridSearchCV 10 | from sklearn.model_selection import GridSearchCV 11 | 12 | # Instantiate grid_dt 13 | grid_dt = GridSearchCV(estimator=dt, 14 | param_grid=params_dt, 15 | scoring='roc_auc', 16 | cv=5, 17 | n_jobs=-1) 18 | _______________________________________________________________ 19 | 3.Evaluate the optimal tree 20 | # Import roc_auc_score from sklearn.metrics 21 | from sklearn.metrics import roc_auc_score 22 | 23 | # Extract the best estimator 24 | best_model = grid_dt.best_estimator_ 25 | 26 | # Predict the test set probabilities of the positive class 27 | y_pred_proba = best_model.predict_proba(X_test)[:,1] 28 | 29 | # Compute test_roc_auc 30 | test_roc_auc = roc_auc_score(y_test, y_pred_proba) 31 | 32 | # Print test_roc_auc 33 | print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc)) 34 | _______________________________________________________________ 35 | 4.Set the hyperparameter grid of RF 36 | # Define the dictionary 'params_rf' 37 | params_rf = { 38 | 'n_estimators': [100, 350, 500], 39 | 'max_features': ['log2', 'auto', 'sqrt'], 40 | 'min_samples_leaf': [2, 10, 30], 41 | } 42 | _______________________________________________________________ 43 | 5.Search for the optimal forest 44 | # Import GridSearchCV 45 | from sklearn.model_selection import GridSearchCV 46 | 47 | # Instantiate grid_rf 48 | grid_rf = GridSearchCV(estimator=rf, 49 | param_grid=params_rf, 50 | scoring='neg_mean_squared_error', 51 | cv=3, 52 | verbose=1, 53 | n_jobs=-1) 54 | _______________________________________________________________ 55 | 6.Evaluate the optimal forest 56 | # Import mean_squared_error from sklearn.metrics as MSE 57 | from sklearn.metrics import mean_squared_error as MSE 58 | 59 | # Extract the best estimator 60 | best_model = grid_rf.best_estimator_ 61 | 62 | # Predict test set labels 63 | y_pred = best_model.predict(X_test) 64 | 65 | # Compute rmse_test 66 | rmse_test = MSE(y_test, y_pred)**(1/2) 67 | 68 | # Print rmse_test 69 | print('Test RMSE of best model: {:.3f}'.format(rmse_test)) 70 | _______________________________________________________________ -------------------------------------------------------------------------------- /19 - Machine learning with tree-based models in python/key points: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /20 - Cluster Analysis in Python/Chapter 1 - Introduction to clustering.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Anirudh-Chauhan/Data-Scientist-with-Python-DataCamp/2b254aab79c5c7420c9fd96aeeab81a020420a27/20 - Cluster Analysis in Python/Chapter 1 - Introduction to clustering.txt -------------------------------------------------------------------------------- /20 - Cluster Analysis in Python/Chapter 2 - Hierarchical Clustering.txt: -------------------------------------------------------------------------------- 1 | 1.Hierarchical clustering: ward method 2 | # Import the fcluster and linkage functions 3 | from scipy.cluster.hierarchy import fcluster, linkage 4 | 5 | # Use the linkage() function 6 | distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method = 'ward', metric = 'euclidean') 7 | 8 | # Assign cluster labels 9 | comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust') 10 | 11 | # Plot clusters 12 | sns.scatterplot(x='x_scaled', y='y_scaled', 13 | hue='cluster_labels', data = comic_con) 14 | plt.show() 15 | ____________________________________________________________________________ 16 | 2.Hierarchical clustering: single method 17 | # Import the fcluster and linkage functions 18 | from scipy.cluster.hierarchy import fcluster, linkage 19 | 20 | # Use the linkage() function 21 | distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method = 'single', metric = 'euclidean') 22 | 23 | # Assign cluster labels 24 | comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust') 25 | 26 | # Plot clusters 27 | sns.scatterplot(x='x_scaled', y='y_scaled', 28 | hue='cluster_labels', data = comic_con) 29 | plt.show() 30 | ____________________________________________________________________________ 31 | 3.Hierarchical clustering: complete method 32 | # Import the fcluster and linkage functions 33 | from scipy.cluster.hierarchy import fcluster, linkage 34 | 35 | # Use the linkage() function 36 | distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method = 'complete', metric = 'euclidean') 37 | 38 | # Assign cluster labels 39 | comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust') 40 | 41 | # Plot clusters 42 | sns.scatterplot(x='x_scaled', y='y_scaled', 43 | hue='cluster_labels', data = comic_con) 44 | plt.show() 45 | ____________________________________________________________________________ 46 | 4.Visualize clusters with matplotlib 47 | # Import the pyplot class 48 | from matplotlib import pyplot as plt 49 | 50 | # Define a colors dictionary for clusters 51 | colors = {1:'red', 2:'blue'} 52 | 53 | # Plot a scatter plot 54 | comic_con.plot.scatter(x='x_scaled', 55 | y='y_scaled', 56 | c=comic_con['cluster_labels'].apply(lambda x: colors[x])) 57 | plt.show() 58 | ____________________________________________________________________________ 59 | 5.Visualize clusters with seaborn 60 | # Import the seaborn module 61 | import seaborn as sns 62 | 63 | # Plot a scatter plot using seaborn 64 | sns.scatterplot(x='x_scaled', 65 | y='y_scaled', 66 | hue='cluster_labels', 67 | data=comic_con) 68 | plt.show() 69 | ____________________________________________________________________________ 70 | 6.Create a dendrogram 71 | # Import the dendrogram function 72 | from scipy.cluster.hierarchy import dendrogram 73 | 74 | # Create a dendrogram 75 | dn = dendrogram(distance_matrix) 76 | 77 | # Display the dendogram 78 | plt.show() 79 | ____________________________________________________________________________ 80 | 7.FIFA 18: exploring defenders 81 | # Fit the data into a hierarchical clustering algorithm 82 | distance_matrix = linkage(fifa[['scaled_sliding_tackle', 'scaled_aggression']], 'ward') 83 | 84 | # Assign cluster labels to each row of data 85 | fifa['cluster_labels'] = fcluster(distance_matrix, 3, criterion='maxclust') 86 | 87 | # Display cluster centers of each cluster 88 | print(fifa[['scaled_sliding_tackle', 'scaled_aggression', 'cluster_labels']].groupby('cluster_labels').mean()) 89 | 90 | # Create a scatter plot through seaborn 91 | sns.scatterplot(x='scaled_sliding_tackle', y='scaled_aggression', hue='cluster_labels', data=fifa) 92 | plt.show() 93 | ____________________________________________________________________________ -------------------------------------------------------------------------------- /20 - Cluster Analysis in Python/Chapter 3 - K-means Clustering.txt: -------------------------------------------------------------------------------- 1 | 1.K-means clustering: first exercise 2 | # Import the kmeans and vq functions 3 | from scipy.cluster.vq import kmeans, vq 4 | 5 | # Generate cluster centers 6 | cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], 2) 7 | 8 | # Assign cluster labels 9 | comic_con['cluster_labels'], distortion_list = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers) 10 | 11 | # Plot clusters 12 | sns.scatterplot(x='x_scaled', y='y_scaled', 13 | hue='cluster_labels', data = comic_con) 14 | plt.show() 15 | _________________________________________________________________ 16 | 2.Elbow method on distinct clusters 17 | distortions = [] 18 | num_clusters = range(1, 7) 19 | 20 | # Create a list of distortions from the kmeans function 21 | for i in num_clusters: 22 | cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], i) 23 | distortions.append(distortion) 24 | 25 | # Create a data frame with two lists - num_clusters, distortions 26 | elbow_plot = pd.DataFrame({'num_clusters': num_clusters, 'distortions': distortions}) 27 | 28 | # Creat a line plot of num_clusters and distortions 29 | sns.lineplot(x='num_clusters', y='distortions', data = elbow_plot) 30 | plt.xticks(num_clusters) 31 | plt.show() 32 | _________________________________________________________________ 33 | 3.Impact of seeds on distinct clusters 34 | # Import random class 35 | from numpy import random 36 | 37 | # Initialize seed 38 | random.seed([1, 2, 1000]) 39 | 40 | # Run kmeans clustering 41 | cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], 2) 42 | comic_con['cluster_labels'], distortion_list = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers) 43 | 44 | # Plot the scatterplot 45 | sns.scatterplot(x='x_scaled', y='y_scaled', 46 | hue='cluster_labels', data = comic_con) 47 | plt.show() 48 | _________________________________________________________________ 49 | 4.Uniform clustering patterns 50 | # Import the kmeans and vq functions 51 | from scipy.cluster.vq import kmeans, vq 52 | 53 | # Generate cluster centers 54 | cluster_centers, distortion = kmeans(mouse[['x_scaled', 'y_scaled']], 3) 55 | 56 | # Assign cluster labels 57 | mouse['cluster_labels'], distortion_list = vq(mouse[['x_scaled', 'y_scaled']], cluster_centers) 58 | 59 | # Plot clusters 60 | sns.scatterplot(x='x_scaled', y='y_scaled', 61 | hue='cluster_labels', data = mouse) 62 | plt.show() 63 | _________________________________________________________________ 64 | 5.FIFA 18: defenders revisited 65 | # Set up a random seed in numpy 66 | random.seed([1000,2000]) 67 | 68 | # Fit the data into a k-means algorithm 69 | cluster_centers,_ = kmeans(fifa[['scaled_def', 'scaled_phy']], 3) 70 | 71 | # Assign cluster labels 72 | fifa['cluster_labels'], _ = vq(fifa[['scaled_def', 'scaled_phy']], cluster_centers) 73 | 74 | # Display cluster centers 75 | print(fifa[['scaled_def', 'scaled_phy', 'cluster_labels']].groupby('cluster_labels').mean()) 76 | 77 | # Create a scatter plot through seaborn 78 | sns.scatterplot(x='scaled_def', y='scaled_phy', hue='cluster_labels', data=fifa) 79 | plt.show() 80 | _________________________________________________________________ -------------------------------------------------------------------------------- /20 - Cluster Analysis in Python/Chapter 4 - Clustering in Real World.txt: -------------------------------------------------------------------------------- 1 | 1.Extract RGB values from image 2 | # Import image class of matplotlib 3 | import matplotlib.image as img 4 | 5 | # Read batman image and print dimensions 6 | batman_image = img.imread('batman.jpg') 7 | print(batman_image.shape) 8 | 9 | # Store RGB values of all pixels in lists r, g and b 10 | for row in batman_image: 11 | for temp_r, temp_g, temp_b in row: 12 | r.append(temp_r) 13 | g.append(temp_g) 14 | b.append(temp_b) 15 | _________________________________________________________________________ 16 | 2.How many dominant colors? 17 | distortions = [] 18 | num_clusters = range(1, 7) 19 | 20 | # Create a list of distortions from the kmeans function 21 | for i in num_clusters: 22 | cluster_centers, distortion = kmeans(batman_df[['scaled_red', 'scaled_blue', 'scaled_green']], i) 23 | distortions.append(distortion) 24 | 25 | # Create a data frame with two lists, num_clusters and distortions 26 | elbow_plot = pd.DataFrame({'num_clusters': num_clusters, 'distortions': distortions}) 27 | 28 | # Create a line plot of num_clusters and distortions 29 | sns.lineplot(x='num_clusters', y='distortions', data = elbow_plot) 30 | plt.xticks(num_clusters) 31 | plt.show() 32 | _________________________________________________________________________ 33 | 3.Display dominant colors 34 | # Get standard deviations of each color 35 | r_std, g_std, b_std = batman_df[['red', 'green', 'blue']].std() 36 | 37 | for cluster_center in cluster_centers: 38 | scaled_r, scaled_g, scaled_b = cluster_center 39 | # Convert each standardized value to scaled value 40 | colors.append(( 41 | scaled_r * r_std / 255, 42 | scaled_g * g_std / 255, 43 | scaled_b * b_std / 255 44 | )) 45 | 46 | # Display colors of cluster centers 47 | plt.imshow([colors]) 48 | plt.show() 49 | _________________________________________________________________________ 50 | 4.TF-IDF of movie plots 51 | # Import TfidfVectorizer class from sklearn 52 | from sklearn.feature_extraction.text import TfidfVectorizer 53 | 54 | # Initialize TfidfVectorizer 55 | tfidf_vectorizer = TfidfVectorizer(max_df=0.75, max_features=50, 56 | min_df=0.1, tokenizer=remove_noise) 57 | 58 | # Use the .fit_transform() method on the list plots 59 | tfidf_matrix = tfidf_vectorizer.fit_transform(plots) 60 | _________________________________________________________________________ 61 | 5.Top terms in movie clusters 62 | num_clusters = 2 63 | 64 | # Generate cluster centers through the kmeans function 65 | cluster_centers, distortion = kmeans(tfidf_matrix.todense(), num_clusters) 66 | 67 | # Generate terms from the tfidf_vectorizer object 68 | terms = tfidf_vectorizer.get_feature_names() 69 | 70 | for i in range(num_clusters): 71 | # Sort the terms and print top 3 terms 72 | center_terms = dict(zip(terms, list(cluster_centers[i]))) 73 | sorted_terms = sorted(center_terms, key=center_terms.get, reverse=True) 74 | print(sorted_terms[:3]) 75 | _________________________________________________________________________ 76 | 6.Basic checks on clusters 77 | # Print the size of the clusters 78 | print(fifa.groupby('cluster_labels')['ID'].count()) 79 | 80 | # Print the mean value of wages in each cluster 81 | print(fifa.groupby('cluster_labels')['eur_wage'].mean()) 82 | _________________________________________________________________________ 83 | 7.FIFA 18: what makes a complete player? 84 | # Create centroids with kmeans for 2 clusters 85 | cluster_centers,_ = kmeans(fifa[scaled_features], 2) 86 | 87 | # Assign cluster labels and print cluster centers 88 | fifa['cluster_labels'], _ = vq(fifa[scaled_features], cluster_centers) 89 | print(fifa.groupby('cluster_labels')[scaled_features].mean()) 90 | 91 | # Plot cluster centers to visualize clusters 92 | fifa.groupby('cluster_labels')[scaled_features].mean().plot(legend=True, kind='bar') 93 | plt.show() 94 | 95 | # Get the name column of first 5 players in each cluster 96 | for cluster in fifa['cluster_labels'].unique(): 97 | print(cluster, fifa[fifa['cluster_labels'] == cluster]['name'].values[:5]) 98 | _________________________________________________________________________ -------------------------------------------------------------------------------- /20 - Cluster Analysis in Python/key points: -------------------------------------------------------------------------------- 1 | 2 | --------------------------------------------------------------------------------