├── 01 - Introduction to Python
    ├── Chapter 1 - Python basics.txt
    ├── Chapter 2 - Python Lists.txt
    ├── Chapter 3 - Functions and Packages.txt
    ├── Chapter 4 - Numpy.txt
    └── about course
├── 02 - Intermediate Python
    ├── Chapter 1 - Matplotlib  .txt
    ├── Chapter 2 -Dictionaries & Pandas .txt
    ├── Chapter 3 - Logic, Control Flow and Filtering.txt
    ├── Chapter 4 - Loops.txt
    ├── Chapter 5 - case study hacker statistics.txt
    └── key points
├── 03 - Introduction to Data Visualization using Matplotlib
    ├── Chapter 1 - Introduction to Matplotlib.txt
    ├── Chapter 2 - Plotting time-series.txt
    ├── Chapter 3 - Quantitative Comparisions and statistical visualizations.txt
    ├── Chapter 4 - sharing visualizations with others.txt
    └── contents
├── 04 - Introduction to Data Visualization with Seaborn
    ├── Chapter 1 - Introduction to Seaborn.txt
    ├── Chapter 2 - Visualization two quantitative variables.txt
    ├── Chapter 3 - Visualization a categorical and a quantitative variables.txt
    ├── Chapter 4 - customizing seaborn plots.txt
    └── key points
├── 05 - Python Data Science Toolbox (Part 1)
    ├── Chapter 1 - Writing your own functions .txt
    ├── Chapter 2 - Default arguments variable length arguments and scope .txt
    ├── Chapter 3 - lambda functions and error handling.txt
    └── key points
├── 06 - Python Data Science Toolbox (Part 2)
    ├── Chapter 1 - Using Iterators in Pythonland .txt
    ├── Chapter 2 - List Comprehensions and Generators.txt
    ├── Chapter 3 - Bringing it all together.txt
    └── key points
├── 07 - Intermediate Data Visualization with Seaborn
    ├── Chapter 1 - seaborn introduction .txt
    ├── Chapter 2 - Customizing Seaborn plots.txt
    ├── Chapter 3 -additional plot types.txt
    ├── Chapter 4 -creating plots on data aware grids.txt
    └── key points
├── 08 - Introduction to Import data in python
    ├── Chapter 1 - introduction and flat files 1.txt
    ├── Chapter 2 - importing data from other file types 2.txt
    ├── Chapter 3 - working with relational databases in python 3.txt
    └── key points
├── 09 - Intermediate importing  data in python
    ├── Chapter 1 - Importing data from the Internet.txt
    ├── Chapter 2 - intracting with apis to import data from web.txt
    ├── Chapter 3 - Diving deep into the twitter api.txt
    └── key points
├── 10 - Cleaning Data in Python
    ├── Chapter 1 - Common Data Problems.txt
    ├── Chapter 2 - Text  and categorical data problems.txt
    ├── Chapter 3 - Advanced data problems.txt
    ├── Chapter 4 - Record Linkage.txt
    └── key points
├── 11 - Working with Dates and Times in Python
    ├── Chapter 1 - Dates and Calenders .txt
    ├── Chapter 2 - Combining dates and times.txt
    ├── Chapter 3 - time zones and daylight saving.txt
    └── key points
├── 12 - Writing functions in Python
    ├── Chapter 1 - Best Practices.txt
    ├── Chapter 2 - Using Context Managers.txt
    ├── Chapter 3 - Decorators.txt
    ├── Chapter 4 - More on Decorators.txt
    └── key points
├── 13 - Exploratory Data Analysis in Python
    ├── Chapter 1 - Read clean and validate.txt
    ├── Chapter 2 - Distributions.txt
    ├── Chapter 3 - Relationships.txt
    ├── Chapter 4 - Multivariate Thinking.txt
    └── key points
├── 14 - Analyzing Police Activity with pandas
    ├── Chapter 1 - preparing data for analysis.txt
    ├── Chapter 2 - Exploring the Relationship between gender and policing.txt
    ├── Chapter 3 - Visual Exploratory data analysis.txt
    ├── Chapter 4 - Analyzing the effect of weather on policing.txt
    └── key points
├── 15 - Statistical Thinking in Python (Part 1)
    ├── Chapter 1 - Graphical Exploratory Data Analysis .txt
    ├── Chapter 2 - Quantitative Exploratory Data Analysis.txt
    ├── Chapter 3 - Thinking probabilistically discrete variables.txt
    ├── Chapter 4 - Thinking probabilistically continuous variables.txt
    └── key points
├── 16 - Statistical Thinking in Python (Part 2)
    ├── Chapter 1 - Parameter estimation by optimization.txt
    ├── Chapter 2 - Bootstrap confidence intervals.txt
    ├── Chapter 3 - Introduction to hypothesis testing.txt
    └── key points
├── 17 - Supervised Learning with Scikit-learn
    ├── Chapter 1 - Classification.txt
    ├── Chapter 2 - Regression.txt
    ├── Chapter 3 - Fine Tuning your model.txt
    ├── Chapter 4 - Preprocessing and Pipelines.txt
    └── key points
├── 18 - Unsupervised Learning in Python
    ├── Chapter 1 - Clustering for dataset exploration.txt
    ├── Chapter 2 - Visualization with Hierarchical clustering and t-sne.txt
    ├── Chapter 3 - Decorrelating your data and dimension reduction.txt
    ├── Chapter 4 - Discovering Interpretable features.txt
    └── key points
├── 19 - Machine learning with tree-based models in python
    ├── Chapter 1 - Classification and regression trees.txt
    ├── Chapter 2 - The bias-variance Tradeoff.txt
    ├── Chapter 3 - Bagging and Random Forests.txt
    ├── Chapter 4 - Boosting.txt
    ├── Chapter 5 - Model Tuning.txt
    └── key points
├── 20 - Cluster Analysis in Python
    ├── Chapter 1 - Introduction to clustering.txt
    ├── Chapter 2 - Hierarchical Clustering.txt
    ├── Chapter 3 - K-means Clustering.txt
    ├── Chapter 4 - Clustering in Real World.txt
    └── key points
└── README.md


/01 - Introduction to Python/Chapter 1 - Python basics.txt:
--------------------------------------------------------------------------------
 1 | 1.
 2 | ___________________________________________________
 3 | 
 4 | # Example, do not modify!
 5 | print(5 / 8)
 6 | 
 7 | # Print the sum of 7 and 10
 8 | print(7 + 10)
 9 | 
10 | ___________________________________________________
11 | 2.
12 | 
13 | # Division
14 | print(5 / 8)
15 | 
16 | #Addition
17 | print(7 + 10)
18 | 
19 | ___________________________________________________
20 | 3.
21 | # Addition, subtraction
22 | print(5 + 5)
23 | print(5 - 5)
24 | 
25 | # Multiplication, division, modulo, and exponentiation
26 | print(3 * 5)
27 | print(10 / 2)
28 | print(18 % 7)
29 | print(4 ** 2)
30 | 
31 | # How much is your $100 worth after 7 years?
32 | print(100*(1.1**7))
33 | ____________________________________________________
34 | 4.
35 | # Create a variable savings
36 | savings =100
37 | 
38 | # Print out savings
39 | print(savings)
40 | ____________________________________________________
41 | 5.
42 | # Create a variable savings
43 | savings = 100
44 | 
45 | # Create a variable growth_multiplier
46 | growth_multiplier = 1.1
47 | 
48 | # Calculate result
49 | result = savings * (growth_multiplier**7)
50 | 
51 | # Print out result
52 | print(result)
53 | _____________________________________________________
54 | 6.
55 | # Create a variable desc
56 | desc = "compound interest"
57 | 
58 | # Create a variable profitable
59 | profitable = True
60 | _____________________________________________________
61 | 7.
62 | savings = 100
63 | growth_multiplier = 1.1
64 | desc = "compound interest"
65 | 
66 | # Assign product of growth_multiplier and savings to year1
67 | year1= growth_multiplier*savings
68 | 
69 | # Print the type of year1
70 | print(type(year1))
71 | 
72 | # Assign sum of desc and desc to doubledesc
73 | doubledesc = desc+desc
74 | 
75 | # Print out doubledesc
76 | print(doubledesc)
77 | _______________________________________________________
78 | 8.
79 | # Definition of savings and result
80 | savings = 100
81 | result = 100 * 1.10 ** 7
82 | 
83 | # Fix the printout
84 | print("I started with $" + str(savings) + " and now have $" + str(result) + ". Awesome!")
85 | 
86 | # Definition of pi_string
87 | pi_string = "3.1415926"
88 | 
89 | # Convert pi_string into float: pi_float
90 | pi_float=float(pi_string)
91 | _____________________________________________________
92 | 


--------------------------------------------------------------------------------
/01 - Introduction to Python/Chapter 2 - Python Lists.txt:
--------------------------------------------------------------------------------
  1 | 1.
  2 | # area variables (in square meters)
  3 | hall = 11.25
  4 | kit = 18.0
  5 | liv = 20.0
  6 | bed = 10.75
  7 | bath = 9.50
  8 | 
  9 | # Create list areas
 10 | areas = [hall,kit,liv,bed,bath]
 11 | 
 12 | # Print areas
 13 | print(areas)
 14 | ____________________________________________________
 15 | 2.
 16 | # area variables (in square meters)
 17 | hall = 11.25
 18 | kit = 18.0
 19 | liv = 20.0
 20 | bed = 10.75
 21 | bath = 9.50
 22 | 
 23 | # Adapt list areas
 24 | areas = ["hallway",hall,"kitchen", kit, "living room", liv,"bedroom", bed, "bathroom", bath]
 25 | 
 26 | # Print areas
 27 | print(areas)
 28 | ____________________________________________________
 29 | 3.
 30 | # area variables (in square meters)
 31 | hall = 11.25
 32 | kit = 18.0
 33 | liv = 20.0
 34 | bed = 10.75
 35 | bath = 9.50
 36 | 
 37 | # house information as list of lists
 38 | house = [["hallway", hall],
 39 |          ["kitchen", kit],
 40 |          ["living room", liv],
 41 |          ["bedroom", bed],
 42 |          ["bathroom",bath]]
 43 | 
 44 | # Print out house
 45 | print(house)
 46 | 
 47 | # Print out the type of house
 48 | print(type(house))
 49 | __________________________________________________________
 50 | 4.
 51 | # Create the areas list
 52 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
 53 | 
 54 | # Print out second element from areas
 55 | print(areas[1])
 56 | 
 57 | # Print out last element from areas
 58 | print(areas[-1])
 59 | 
 60 | # Print out the area of the living room
 61 | print(areas[5])
 62 | __________________________________________________________
 63 | 5.
 64 | # Create the areas list
 65 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
 66 | 
 67 | # Sum of kitchen and bedroom area: eat_sleep_area
 68 | eat_sleep_area = areas[3]+areas[7]
 69 | 
 70 | # Print the variable eat_sleep_area
 71 | print(eat_sleep_area)
 72 | ___________________________________________________________
 73 | 6.
 74 | # Create the areas list
 75 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
 76 | 
 77 | # Use slicing to create downstairs
 78 | downstairs = areas[:6]
 79 | 
 80 | # Use slicing to create upstairs
 81 | upstairs = areas[6:]
 82 | 
 83 | # Print out downstairs and upstairs
 84 | print(downstairs,upstairs)
 85 | ____________________________________________________________
 86 | 7.
 87 | # Create the areas list
 88 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
 89 | 
 90 | # Alternative slicing to create downstairs
 91 | downstairs=areas[:6]
 92 | 
 93 | # Alternative slicing to create upstairs
 94 | upstairs=areas[6:]
 95 | ____________________________________________________________
 96 | 8.
 97 | # Create the areas list
 98 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
 99 | 
100 | # Correct the bathroom area
101 | areas[-1] = 10.50
102 | 
103 | # Change "living room" to "chill zone"
104 | areas[4] = "chill zone"
105 | ____________________________________________________________
106 | 9.
107 | # Create the areas list and make some changes
108 | areas = ["hallway", 11.25, "kitchen", 18.0, "chill zone", 20.0,
109 |          "bedroom", 10.75, "bathroom", 10.50]
110 | 
111 | # Add poolhouse data to areas, new list is areas_1
112 | areas_1 = areas+["poolhouse",24.5]
113 | 
114 | # Add garage data to areas_1, new list is areas_2
115 | areas_2 = areas_1+["garage",15.45]
116 | ___________________________________________________________
117 | 10.
118 | # Create list areas
119 | areas = [11.25, 18.0, 20.0, 10.75, 9.50]
120 | 
121 | # Create areas_copy
122 | areas_copy = list(areas)
123 | 
124 | # Change areas_copy
125 | areas_copy[0] = 5.0
126 | 
127 | # Print areas
128 | print(areas)
129 | ___________________________________________________________
130 | 


--------------------------------------------------------------------------------
/01 - Introduction to Python/Chapter 3 - Functions and Packages.txt:
--------------------------------------------------------------------------------
  1 | 1.
  2 | # Create variables var1 and var2
  3 | var1 = [1, 2, 3, 4]
  4 | var2 = True
  5 | 
  6 | # Print out type of var1
  7 | print(type(var1))
  8 | 
  9 | # Print out length of var1
 10 | print(len(var1))
 11 | 
 12 | # Convert var2 to an integer: out2
 13 | out2=int(var2)
 14 | ____________________________________________
 15 | 2.
 16 | # Create lists first and second
 17 | first = [11.25, 18.0, 20.0]
 18 | second = [10.75, 9.50]
 19 | 
 20 | # Paste together first and second: full
 21 | full = first+second
 22 | 
 23 | # Sort full in descending order: full_sorted
 24 | full_sorted = sorted(full,reverse=True)
 25 | 
 26 | # Print out full_sorted
 27 | print(full_sorted)
 28 | _____________________________________________
 29 | 3.
 30 | # string to experiment with: place
 31 | place = "poolhouse"
 32 | 
 33 | # Use upper() on place: place_up
 34 | place_up = place.upper()
 35 | 
 36 | # Print out place and place_up
 37 | print(place, place_up)
 38 | 
 39 | # Print out the number of o's in place
 40 | print(place.count('o'))
 41 | ______________________________________________
 42 | 4.
 43 | # Create list areas
 44 | areas = [11.25, 18.0, 20.0, 10.75, 9.50]
 45 | 
 46 | # Print out the index of the element 20.0
 47 | print(areas.index(20.0))
 48 | 
 49 | # Print out how often 9.50 appears in areas
 50 | print(areas.count(9.50))
 51 | _______________________________________________
 52 | 5.
 53 | # Create list areas
 54 | areas = [11.25, 18.0, 20.0, 10.75, 9.50]
 55 | 
 56 | # Use append twice to add poolhouse and garage size
 57 | areas.append(24.5)
 58 | areas.append(15.45)
 59 | 
 60 | 
 61 | # Print out areas
 62 | print(areas)
 63 | 
 64 | # Reverse the orders of the elements in areas
 65 | areas = areas.reverse()
 66 | 
 67 | # Print out areas
 68 | print(areas)
 69 | _________________________________________________
 70 | 6.
 71 | # Definition of radius
 72 | r = 0.43
 73 | 
 74 | # Import the math package
 75 | import math
 76 | 
 77 | # Calculate C
 78 | C = 2*math.pi*r 
 79 | 
 80 | # Calculate A
 81 | A = math.pi*r*r
 82 | 
 83 | # Build printout
 84 | print("Circumference: " + str(C))
 85 | print("Area: " + str(A))
 86 | _________________________________________________
 87 | 7.
 88 | # Definition of radius
 89 | r = 192500
 90 | phi = 12
 91 | # Import radians function of math package
 92 | from math import radians
 93 | 
 94 | # Travel distance of Moon over 12 degrees. Store in dist.
 95 | dist = r * radians(phi)
 96 | 
 97 | # Print out dist
 98 | print(dist)
 99 | __________________________________________________
100 | 
101 | 


--------------------------------------------------------------------------------
/01 - Introduction to Python/about course:
--------------------------------------------------------------------------------
1 | 
2 | This course consist of 4 major parts :
3 | 1. Python basics 
4 | 2. Python Lists
5 | 3.Functions
6 | 4.Numpy
7 | 


--------------------------------------------------------------------------------
/02 - Intermediate Python/Chapter 3 - Logic, Control Flow and Filtering.txt:
--------------------------------------------------------------------------------
  1 | 1.Equality
  2 | # Comparison of booleans
  3 | print(True==False)
  4 | 
  5 | # Comparison of integers
  6 | -5*15!=75
  7 | 
  8 | # Comparison of strings
  9 | 'pyscript'=='PyScript'
 10 | 
 11 | # Compare a boolean with an integer
 12 | print(True==1)
 13 | ________________________________________________________________
 14 | 2.Greater and less than
 15 | # Comparison of integers
 16 | x = -3 * 6
 17 | 
 18 | 
 19 | # Comparison of strings
 20 | y = "test"
 21 | 
 22 | 
 23 | # Comparison of booleans
 24 | print(x>=-10)
 25 | print(y>='test',True>False)
 26 | __________________________________________________________________
 27 | 3.Compare arrays
 28 | # Create arrays
 29 | import numpy as np
 30 | my_house = np.array([18.0, 20.0, 10.75, 9.50])
 31 | your_house = np.array([14.0, 24.0, 14.25, 9.0])
 32 | 
 33 | # my_house greater than or equal to 18
 34 | print(my_house>=18)
 35 | 
 36 | # my_house less than your_house
 37 | print(my_house<your_house)
 38 | ___________________________________________________________________
 39 | 4.and, or, not (1)
 40 | # Define variables
 41 | my_kitchen = 18.0
 42 | your_kitchen = 14.0
 43 | 
 44 | # my_kitchen bigger than 10 and smaller than 18?
 45 | print(my_kitchen > 10 and my_kitchen<18)
 46 | 
 47 | # my_kitchen smaller than 14 or bigger than 17?
 48 | print(my_kitchen < 14 or my_kitchen>17)
 49 | 
 50 | # Double my_kitchen smaller than triple your_kitchen?
 51 | print(my_kitchen*2 < your_kitchen*3)
 52 | ___________________________________________________________________
 53 | 5.Boolean operators with Numpy
 54 | # Create arrays
 55 | import numpy as np
 56 | my_house = np.array([18.0, 20.0, 10.75, 9.50])
 57 | your_house = np.array([14.0, 24.0, 14.25, 9.0])
 58 | 
 59 | # my_house greater than 18.5 or smaller than 10
 60 | print(np.logical_or(my_house > 18.5, my_house<10))
 61 | 
 62 | # Both my_house and your_house smaller than 11
 63 | print(np.logical_and(my_house < 11 , your_house<11))
 64 | ____________________________________________________________________
 65 | 6.if
 66 | # Define variables
 67 | room = "kit"
 68 | area = 14.0
 69 | 
 70 | # if statement for room
 71 | if room == "kit" :
 72 |     print("looking around in the kitchen.")
 73 | 
 74 | # if statement for area
 75 | if area > 15.0 :
 76 |     print("big place!")
 77 | __________________________________________________________________________
 78 | 7.And else
 79 | # Define variables
 80 | room = "kit"
 81 | area = 14.0
 82 | 
 83 | # if-else construct for room
 84 | if room == "kit" :
 85 |     print("looking around in the kitchen.")
 86 | else :
 87 |     print("looking around elsewhere.")
 88 | 
 89 | # if-else construct for area
 90 | if area > 15 :
 91 |     print("big place!")
 92 | else:
 93 |     print("pretty small.")
 94 | __________________________________________________________________________
 95 | 8.Customize further:elif
 96 | # Define variables
 97 | room = "bed"
 98 | area = 14.0
 99 | 
100 | # if-elif-else construct for room
101 | if room == "kit" :
102 |     print("looking around in the kitchen.")
103 | elif room == "bed":
104 |     print("looking around in the bedroom.")
105 | else :
106 |     print("looking around elsewhere.")
107 | 
108 | # if-elif-else construct for area
109 | if area > 15 :
110 |     print("big place!")
111 | elif area > 10:
112 |     print("medium size, nice!")
113 | else :
114 |     print("pretty small.")
115 | __________________________________________________________________________
116 | 9.Driving right(1)
117 | # Import cars data
118 | import pandas as pd
119 | cars = pd.read_csv('cars.csv', index_col = 0)
120 | 
121 | # Extract drives_right column as Series: dr
122 | dr = cars['drives_right']
123 | 
124 | # Use dr to subset cars: sel
125 | sel = cars[dr]
126 | 
127 | # Print sel
128 | print(sel)
129 | __________________________________________________________________________
130 | 10.Driving right(2)
131 | # Import cars data
132 | import pandas as pd
133 | cars = pd.read_csv('cars.csv', index_col = 0)
134 | 
135 | # Convert code to a one-liner
136 | dr = cars['drives_right']
137 | sel = cars[dr]
138 | 
139 | # Print sel
140 | print(sel)
141 | __________________________________________________________________________
142 | 11.Cars per capita(1)
143 | # Import cars data
144 | import pandas as pd
145 | cars = pd.read_csv('cars.csv', index_col = 0)
146 | 
147 | # Create car_maniac: observations that have a cars_per_cap over 500
148 | cpc = cars['cars_per_cap']
149 | many_cars = cpc > 500
150 | car_maniac = cars[many_cars]
151 | 
152 | # Print car_maniac
153 | print(car_maniac)
154 | __________________________________________________________________________
155 | 12.Cars per captia(2)
156 | # Import cars data
157 | import pandas as pd
158 | cars = pd.read_csv('cars.csv', index_col = 0)
159 | 
160 | # Import numpy, you'll need this
161 | import numpy as np
162 | 
163 | # Create medium: observations with cars_per_cap between 100 and 500
164 | cpc = cars['cars_per_cap']
165 | between = np.logical_and(cpc > 100, cpc < 500)
166 | medium = cars[between]
167 | 
168 | # Print medium
169 | print(medium)
170 | __________________________________________________________________________


--------------------------------------------------------------------------------
/02 - Intermediate Python/Chapter 4 - Loops.txt:
--------------------------------------------------------------------------------
  1 | 1.Basic while loop
  2 | # Initialize offset
  3 | offset = 8
  4 | 
  5 | # Code the while loop
  6 | while offset != 0:
  7 |     print("correcting...")
  8 |     offset = offset - 1
  9 |     print(offset)
 10 | ____________________________________________________________________
 11 | 2.Add Conditionals
 12 | # Initialize offset
 13 | offset = -6
 14 | 
 15 | # Code the while loop
 16 | while offset != 0 :
 17 |     print("correcting...")
 18 |     if offset > 0 :
 19 |         offset = offset - 1
 20 |     else :
 21 |         offset = offset + 1
 22 |     print(offset)
 23 | ____________________________________________________________________
 24 | 3.loop over a list
 25 | # areas list
 26 | areas = [11.25, 18.0, 20.0, 10.75, 9.50]
 27 | 
 28 | # Code the for loop
 29 | for area in areas :
 30 |     print(area)
 31 | ____________________________________________________________________
 32 | 4.Indexes and values (1)
 33 | # areas list
 34 | areas = [11.25, 18.0, 20.0, 10.75, 9.50]
 35 | 
 36 | # Change for loop to use enumerate() and update print()
 37 | for index, area in enumerate(areas) :
 38 |     print("room "+str(index)+": "+str(area))
 39 | ____________________________________________________________________
 40 | 5.Indexes and values (2)
 41 | # areas list
 42 | areas = [11.25, 18.0, 20.0, 10.75, 9.50]
 43 | 
 44 | # Code the for loop
 45 | for index, area in enumerate(areas) :
 46 |     print("room " + str(index+1) + ": " + str(area))
 47 | ____________________________________________________________________
 48 | 6.Loop over list of lists
 49 | # house list of lists
 50 | house = [["hallway", 11.25], 
 51 |          ["kitchen", 18.0], 
 52 |          ["living room", 20.0], 
 53 |          ["bedroom", 10.75], 
 54 |          ["bathroom", 9.50]]
 55 |          
 56 | # Build a for loop from scratch
 57 | for x in house :
 58 |     print("the " + x[0] + " is " + str(x[1]) + " sqm")
 59 | ____________________________________________________________________
 60 | 7.Loop over dictionary
 61 | # Definition of dictionary
 62 | europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin',
 63 |           'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'austria':'vienna' }
 64 |           
 65 | # Iterate over europe
 66 | for key, value in europe.items() :
 67 |      print("the capital of " + str(key) + " is " + str(value))
 68 | ____________________________________________________________________
 69 | 8.Loop over Numpy array
 70 | # Import numpy as np
 71 | import numpy as np
 72 | 
 73 | # For loop over np_height
 74 | for x in np_height:
 75 |     print(str(x)+" inches")
 76 | 
 77 | # For loop over np_baseball
 78 | for i in np.nditer(np_baseball):
 79 |     print(i)
 80 | ____________________________________________________________________
 81 | 9.Loop over DataFrame (1)
 82 | # Import cars data
 83 | import pandas as pd
 84 | cars = pd.read_csv('cars.csv', index_col = 0)
 85 | 
 86 | # Iterate over rows of cars
 87 | for lab, row in cars.iterrows() :
 88 |     print(lab)
 89 |     print(row)
 90 | ____________________________________________________________________
 91 | 10.Loop over DataFrame (2)
 92 | # Import cars data
 93 | import pandas as pd
 94 | cars = pd.read_csv('cars.csv', index_col = 0)
 95 | 
 96 | # Adapt for loop
 97 | for lab, row in cars.iterrows() :
 98 |     print(lab + ": " + str(row['cars_per_cap']))
 99 | ____________________________________________________________________
100 | 11.Add column (1)
101 | # Import cars data
102 | import pandas as pd
103 | cars = pd.read_csv('cars.csv', index_col = 0)
104 | 
105 | # Code for loop that adds COUNTRY column
106 | for lab, row in cars.iterrows() :
107 |     cars.loc[lab, "COUNTRY"] = row["country"].upper()
108 |     
109 | # Print cars
110 | print(cars)
111 | ____________________________________________________________________
112 | 12.Add column (2)
113 | # Import cars data
114 | import pandas as pd
115 | cars = pd.read_csv('cars.csv', index_col = 0)
116 | 
117 | # Use .apply(str.upper)
118 | cars["COUNTRY"] = cars["country"].apply(str.upper)
119 | ____________________________________________________________________


--------------------------------------------------------------------------------
/02 - Intermediate Python/key points:
--------------------------------------------------------------------------------
1 | Consist of 5 Courses:
2 | 1- Matplotlib
3 | 2- Dictionaries & Pandas
4 | 3- Logic, Control Flow and Filtering
5 | 4- Loops
6 | 5- Case Study: Hacker Statistics
7 | 


--------------------------------------------------------------------------------
/03 - Introduction to Data Visualization using Matplotlib/Chapter 1 - Introduction to Matplotlib.txt:
--------------------------------------------------------------------------------
 1 | 1.Using the matplotlib.pyplot interface
 2 | # Import the matplotlib.pyplot submodule and name it plt
 3 | import matplotlib.pyplot as plt
 4 | 
 5 | # Create a Figure and an Axes with plt.subplots
 6 | fig, ax = plt.subplots()
 7 | 
 8 | # Call the show function to show the result
 9 | plt.show()
10 | __________________________________________________________________________
11 | 2.Adding data to an Axes object
12 | # Import the matplotlib.pyplot submodule and name it plt
13 | import matplotlib.pyplot as plt
14 | 
15 | # Create a Figure and an Axes with plt.subplots
16 | fig, ax = plt.subplots()
17 | 
18 | # Plot MLY-PRCP-NORMAL from seattle_weather against the MONTH
19 | ax.plot(seattle_weather["MONTH"], seattle_weather['MLY-PRCP-NORMAL'])
20 | 
21 | # Plot MLY-PRCP-NORMAL from austin_weather against MONTH
22 | ax.plot(austin_weather['MONTH'], austin_weather['MLY-PRCP-NORMAL'])
23 | 
24 | # Call the show function
25 | plt.show()
26 | __________________________________________________________________________
27 | 3.Customizing data appearance
28 | # Plot Seattle data, setting data appearance
29 | ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"], color="b", marker='o',linestyle='--')
30 | 
31 | # Plot Austin data, setting data appearance
32 | ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"], color="r", marker='v',linestyle='--')
33 | 
34 | # Call show to display the resulting plot
35 | plt.show()
36 | __________________________________________________________________________
37 | 4.Customizing axis labels and adding titles
38 | ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"])
39 | ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"])
40 | 
41 | # Customize the x-axis label
42 | ax.set_xlabel("Time (months)")
43 | # Customize the y-axis label
44 | ax.set_ylabel("Precipitation (inches)")
45 | 
46 | # Add the title
47 | ax.set_title("Weather patterns in Austin and Seattle")
48 | 
49 | # Display the figure
50 | plt.show()
51 | __________________________________________________________________________
52 | 5.Creating small multiples with plt.subplots
53 | # Create a Figure and an array of subplots with 2 rows and 2 columns
54 | fig, ax = plt.subplots(2, 2)
55 | 
56 | # Addressing the top left Axes as index 0, 0, plot month and Seattle precipitation
57 | ax[0, 0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"])
58 | 
59 | # In the top right (index 0,1), plot month and Seattle temperatures
60 | ax[0, 1].plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"])
61 | 
62 | # In the bottom left (1, 0) plot month and Austin precipitations
63 | ax[1,0].plot(austin_weather['MONTH'], austin_weather["MLY-PRCP-NORMAL"])
64 | 
65 | # In the bottom right (1, 1) plot month and Austin temperatures
66 | ax[1,1].plot(austin_weather['MONTH'], austin_weather["MLY-TAVG-NORMAL"])
67 | plt.show()
68 | __________________________________________________________________________
69 | 6.Small multiples with shared y axis
70 | # Create a figure and an array of axes: 2 rows, 1 column with shared y axis
71 | fig, ax = plt.subplots(2, 1, sharey=True)
72 | 
73 | # Plot Seattle precipitation in the top axes
74 | ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"], color='b')
75 | ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-25PCTL"], color='b', linestyle='--')
76 | ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-75PCTL"], color='b', linestyle='--')
77 | 
78 | # Plot Austin precipitation in the bottom axes
79 | ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"], color='r')
80 | ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-25PCTL"], color='r', linestyle='--')
81 | ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-75PCTL"], color='r', linestyle='--')
82 | 
83 | plt.show()
84 | __________________________________________________________________________


--------------------------------------------------------------------------------
/03 - Introduction to Data Visualization using Matplotlib/Chapter 2 - Plotting time-series.txt:
--------------------------------------------------------------------------------
  1 | 1.Read data with a time index
  2 | # Import pandas as pd
  3 | import pandas as pd
  4 | 
  5 | # Read the data from file using read_csv
  6 | climate_change = pd.read_csv('climate_change.csv', parse_dates=["date"], index_col="date")
  7 | __________________________________________________________________________
  8 | 2.Plot time-series data
  9 | import matplotlib.pyplot as plt
 10 | fig, ax = plt.subplots()
 11 | 
 12 | # Add the time-series for "relative_temp" to the plot
 13 | ax.plot(climate_change.index,climate_change['relative_temp'])
 14 | 
 15 | # Set the x-axis label
 16 | ax.set_xlabel('Time')
 17 | 
 18 | # Set the y-axis label
 19 | ax.set_ylabel('Relative temperature (Celsius)')
 20 | 
 21 | # Show the figure
 22 | plt.show()
 23 | __________________________________________________________________________
 24 | 3.Using a time index to zoom in
 25 | import matplotlib.pyplot as plt
 26 | 
 27 | # Use plt.subplots to create fig and ax
 28 | fig, ax= plt.subplots()
 29 | 
 30 | # Create variable seventies with data from "1970-01-01" to "1979-12-31"
 31 | seventies = climate_change["1970-01-01":"1979-12-31"]
 32 | 
 33 | # Add the time-series for "co2" data from seventies to the plot
 34 | ax.plot(seventies.index, seventies["co2"])
 35 | 
 36 | # Show the figure
 37 | plt.show()
 38 | __________________________________________________________________________
 39 | 4.Plotting two variables
 40 | import matplotlib.pyplot as plt
 41 | 
 42 | # Initalize a Figure and Axes
 43 | fig, ax = plt.subplots()
 44 | 
 45 | # Plot the CO2 variable in blue
 46 | ax.plot(climate_change.index, climate_change["co2"], color='blue')
 47 | 
 48 | # Create a twin Axes that shares the x-axis
 49 | ax2 = ax.twinx()
 50 | 
 51 | # Plot the relative temperature in red
 52 | ax2.plot(climate_change.index, climate_change["relative_temp"], color='red')
 53 | 
 54 | plt.show()
 55 | __________________________________________________________________________
 56 | 5.Defining a function that plots time-series data
 57 | # Define a function called plot_timeseries
 58 | def plot_timeseries(axes, x, y, color, xlabel, ylabel):
 59 | 
 60 |   # Plot the inputs x,y in the provided color
 61 |   axes.plot(x, y, color=color)
 62 | 
 63 |   # Set the x-axis label
 64 |   axes.set_xlabel(xlabel)
 65 | 
 66 |   # Set the y-axis label
 67 |   axes.set_ylabel(ylabel, color=color)
 68 | 
 69 |   # Set the colors tick params for y-axis
 70 |   axes.tick_params('y', colors=color)
 71 | __________________________________________________________________________
 72 | 6.Using a plotting function
 73 | fig, ax = plt.subplots()
 74 | 
 75 | # Plot the CO2 levels time-series in blue
 76 | plot_timeseries(ax, climate_change.index, climate_change['co2'], "blue", 'Time (years)', 'CO2 levels')
 77 | 
 78 | # Create a twin Axes object that shares the x-axis
 79 | ax2 = ax.twinx()
 80 | 
 81 | # Plot the relative temperature data in red
 82 | plot_timeseries(ax, climate_change.index, climate_change['relative_temp'], "red", "Time (years)", "Relative temperature (Celsius)")
 83 | 
 84 | plt.show()
 85 | __________________________________________________________________________
 86 | 7.Annotating a plot of time-series data
 87 | fig, ax = plt.subplots()
 88 | 
 89 | # Plot the relative temperature data
 90 | ax.plot(climate_change.index,climate_change['relative_temp'])
 91 | 
 92 | # Annotate the date at which temperatures exceeded 1 degree
 93 | ax.annotate('>1 degree', (pd.Timestamp('2015-10-06'), 1))
 94 | 
 95 | plt.show()
 96 | __________________________________________________________________________
 97 | 8.Plotting time-series: putting it all together
 98 | fig, ax = plt.subplots()
 99 | 
100 | # Plot the CO2 levels time-series in blue
101 | plot_timeseries(ax, climate_change.index, climate_change['co2'], 'blue', "Time (years)", "CO2 levels")
102 | 
103 | # Create an Axes object that shares the x-axis
104 | ax2 = ax.twinx()
105 | 
106 | # Plot the relative temperature data in red
107 | plot_timeseries(ax2, climate_change.index, climate_change['relative_temp'], 'red', 'Time (years)', 'Relative temp (Celsius)')
108 | 
109 | # Annotate point with relative temperature >1 degree
110 | ax2.annotate(">1 degree", xy=(pd.Timestamp('2015-10-06'),1), xytext=(pd.Timestamp('2008-10-06'), -0.2),arrowprops={'arrowstyle': '->', 'color': 'gray'})
111 | 
112 | plt.show()
113 | __________________________________________________________________________


--------------------------------------------------------------------------------
/03 - Introduction to Data Visualization using Matplotlib/Chapter 3 - Quantitative Comparisions and statistical visualizations.txt:
--------------------------------------------------------------------------------
  1 | 1.Bar chart
  2 | fig, ax = plt.subplots()
  3 | 
  4 | # Plot a bar-chart of gold medals as a function of country
  5 | ax.bar(medals.index,medals['Gold'])
  6 | 
  7 | # Set the x-axis tick labels to the country names
  8 | ax.set_xticklabels(medals.index, rotation=90)
  9 | 
 10 | # Set the y-axis label
 11 | ax.set_ylabel("Number of medals")
 12 | 
 13 | plt.show()
 14 | _____________________________________________________________________________
 15 | 2.Stacked bar chart
 16 | # Add bars for "Gold" with the label "Gold"
 17 | ax.bar(medals.index, medals['Gold'], label='Gold')
 18 | 
 19 | # Stack bars for "Silver" on top with label "Silver"
 20 | ax.bar(medals.index, medals['Silver'], bottom= medals['Gold'], label="Silver" )
 21 | 
 22 | # Stack bars for "Bronze" on top of that with label "Bronze"
 23 | ax.bar(medals.index,medals['Bronze'], bottom= medals['Gold'] + medals['Silver'], label='Bronze')
 24 | 
 25 | # Display the legend
 26 | ax.legend()
 27 | 
 28 | plt.show()
 29 | _____________________________________________________________________________
 30 | 3.Creating histograms
 31 | fig, ax = plt.subplots()
 32 | # Plot a histogram of "Weight" for mens_rowing
 33 | ax.hist(mens_rowing["Weight"])
 34 | 
 35 | # Compare to histogram of "Weight" for mens_gymnastics
 36 | ax.hist(mens_gymnastics["Weight"])
 37 | 
 38 | # Set the x-axis label to "Weight (kg)"
 39 | ax.set_xlabel("Weight (kg)")
 40 | 
 41 | # Set the y-axis label to "# of observations"
 42 | ax.set_ylabel("# of observations")
 43 | 
 44 | plt.show()
 45 | _____________________________________________________________________________
 46 | 4."Step" histogram
 47 | fig, ax = plt.subplots()
 48 | 
 49 | # Plot a histogram of "Weight" for mens_rowing
 50 | ax.hist(mens_rowing["Weight"], label='Rowing', bins=5, histtype='step')
 51 | 
 52 | # Compare to histogram of "Weight" for mens_gymnastics
 53 | ax.hist(mens_gymnastics["Weight"], label='Gymnastics', bins=5, histtype='step')
 54 | 
 55 | ax.set_xlabel("Weight (kg)")
 56 | ax.set_ylabel("# of observations")
 57 | 
 58 | # Add the legend and show the Figure
 59 | ax.legend()
 60 | plt.show()
 61 | _____________________________________________________________________________
 62 | 5.Adding error-bars to a bar chart
 63 | fig, ax = plt.subplots()
 64 | 
 65 | # Add a bar for the rowing "Height" column mean/std
 66 | ax.bar("Rowing",mens_rowing["Height"].mean(), yerr=mens_rowing["Height"].std())
 67 | 
 68 | # Add a bar for the gymnastics "Height" column mean/std
 69 | ax.bar("Gymnastics", mens_gymnastics["Height"].mean(), yerr=mens_gymnastics["Height"].std());
 70 | 
 71 | # Label the y-axis
 72 | ax.set_ylabel("Height (cm)")
 73 | 
 74 | plt.show()
 75 | _____________________________________________________________________________
 76 | 6.Adding error-bars to a plot
 77 | fig, ax = plt.subplots()
 78 | 
 79 | # Add Seattle temperature data in each month with error bars
 80 | ax.errorbar(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"],
 81 |             yerr=seattle_weather["MLY-TAVG-STDDEV"]);
 82 | 
 83 | # Add Austin temperature data in each month with error bars
 84 | ax.errorbar(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"], 
 85 |             yerr=austin_weather["MLY-TAVG-STDDEV"]);
 86 | # Set the y-axis label
 87 | ax.set_ylabel("Temperature (Fahrenheit)");
 88 | 
 89 | plt.show()
 90 | _____________________________________________________________________________
 91 | 7.Creating boxplots
 92 | fig, ax = plt.subplots()
 93 | 
 94 | # Add a boxplot for the "Height" column in the DataFrames
 95 | ax.boxplot([mens_rowing["Height"], mens_gymnastics["Height"]]);
 96 | 
 97 | # Add x-axis tick labels:
 98 | ax.set_xticklabels(["Rowing", "Gymnastics"]);
 99 | 
100 | # Add a y-axis label
101 | ax.set_ylabel("Height (cm)");
102 | 
103 | plt.show()
104 | _____________________________________________________________________________
105 | 8.Simple scatter plot
106 | fig, ax = plt.subplots()
107 | 
108 | # Add data: "co2" on x-axis, "relative_temp" on y-axis
109 | ax.scatter(climate_change["co2"], climate_change["relative_temp"]);
110 | 
111 | # Set the x-axis label to "CO2 (ppm)"
112 | ax.set_xlabel("CO2 (ppm)");
113 | 
114 | # Set the y-axis label to "Relative temperature (C)"
115 | ax.set_ylabel("Relative temperature (C)");
116 | 
117 | plt.show()
118 | _____________________________________________________________________________
119 | 9.Encoding time by color
120 | fig, ax = plt.subplots()
121 | 
122 | # Add data: "co2", "relative_temp" as x-y, index as color
123 | ax.scatter(climate_change["co2"], climate_change["relative_temp"], c=climate_change.index)
124 | 
125 | # Set the x-axis label to "CO2 (ppm)"
126 | ax.set_xlabel("CO2 (ppm)");
127 | 
128 | # Set the y-axis label to "Relative temperature (C)"
129 | ax.set_ylabel("Relative temperature (C)");
130 | 
131 | plt.show()
132 | _____________________________________________________________________________


--------------------------------------------------------------------------------
/03 - Introduction to Data Visualization using Matplotlib/Chapter 4 - sharing visualizations with others.txt:
--------------------------------------------------------------------------------
 1 | 1.Switching between styles
 2 | # Use the "ggplot" style and create new Figure/Axes
 3 | plt.style.use('ggplot')
 4 | fig, ax = plt.subplots()
 5 | ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"])
 6 | plt.show()
 7 | __________________________________________________________________________________
 8 | 2.Saving a file several times
 9 | 
10 | fig.savefig('my_figure.png')
11 | fig.savefig('my_figure_300dpi.png', dpi=300)
12 | __________________________________________________________________________________
13 | 3.Save a figure with different sizes
14 | fig.set_size_inches([3,5])
15 | fig.savefig("figure_3_5.png")
16 | __________________________________________________________________________________
17 | 4.Unique values of a column
18 | # Extract the "Sport" column
19 | sports_column = summer_2016_medals['Sport']
20 | 
21 | # Find the unique values of the "Sport" column
22 | sports = sports_column.unique()
23 | 
24 | # Print out the unique sports values
25 | print(sports)
26 | __________________________________________________________________________________
27 | 5.Automate your visualization
28 | fig, ax = plt.subplots()
29 | 
30 | # Loop over the different sports branches
31 | for sport in sports:
32 |   # Extract the rows only for this sport
33 |   sport_df = summer_2016_medals[summer_2016_medals["Sport"] == sport]
34 |   # Add a bar for the "Weight" mean with std y error bar
35 |   ax.bar(sport, sport_df["Weight"].mean(), yerr=sport_df["Weight"].std());
36 | 
37 | ax.set_ylabel("Weight");
38 | ax.set_xticklabels(sports, rotation=90);
39 | 
40 | # Save the figure to file
41 | fig.savefig("sports_weights.png")
42 | __________________________________________________________________________________


--------------------------------------------------------------------------------
/03 - Introduction to Data Visualization using Matplotlib/contents:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/04 - Introduction to Data Visualization with Seaborn/Chapter 1 - Introduction to Seaborn.txt:
--------------------------------------------------------------------------------
 1 | 1.Making a scatter plot with lists
 2 | # Import Matplotlib and Seaborn
 3 | import matplotlib.pyplot as plt
 4 | import seaborn as sns
 5 | 
 6 | # Change this scatter plot to have percent literate on the y-axis
 7 | sns.scatterplot(x=gdp, y=percent_literate)
 8 | 
 9 | # Show plot
10 | plt.show()
11 | _________________________________________________________________________________
12 | 2.Making a count plot with a list
13 | # Import Matplotlib and Seaborn
14 | import matplotlib.pyplot as plt
15 | import seaborn as sns
16 | 
17 | 
18 | # Create count plot with region on the y-axis
19 | sns.countplot(y=region)
20 | 
21 | # Show plot
22 | plt.show()
23 | _________________________________________________________________________________
24 | 3."Tidy" vs. "untidy" data
25 | # Import Pandas
26 | import pandas as pd
27 | 
28 | # Create a DataFrame from csv file
29 | df= pd.read_csv(csv_filepath)
30 | 
31 | # Print the head of df
32 | print(df.head())
33 | _________________________________________________________________________________
34 | 4.Making a count plot with a DataFrame
35 | # Import Matplotlib, Pandas, and Seaborn
36 | import pandas as pd
37 | import seaborn as sns
38 | import matplotlib.pyplot as plt
39 | 
40 | # Create a DataFrame from csv file
41 | df=pd.read_csv(csv_filepath)
42 | 
43 | # Create a count plot with "Spiders" on the x-axis
44 | sns.countplot(x="Spiders",data=df)
45 | 
46 | # Display the plot
47 | plt.show()
48 | _________________________________________________________________________________
49 | 5.Hue and scatter plots
50 | # Import Matplotlib and Seaborn
51 | import matplotlib.pyplot as plt
52 | import seaborn as sns
53 | 
54 | # Change the legend order in the scatter plot
55 | sns.scatterplot(x="absences", y="G3", data=student_data,hue="location" , hue_order=["Rural","Urban"])
56 | 
57 | 
58 | # Show plot
59 | plt.show()
60 | _________________________________________________________________________________


--------------------------------------------------------------------------------
/04 - Introduction to Data Visualization with Seaborn/Chapter 2 - Visualization two quantitative variables.txt:
--------------------------------------------------------------------------------
 1 | 1.Creating subplots with col and row
 2 | # Change this scatter plot to arrange the plots in rows instead of columns
 3 | sns.relplot(x="absences", y="G3", 
 4 |             data=student_data,
 5 |             kind="scatter", 
 6 |             row="study_time");
 7 | 
 8 | # Show plot
 9 | plt.show()
10 | ____________________________________________________________________________
11 | 2.Creating two-factor subplots
12 | # Adjust further to add subplots based on family support
13 | sns.relplot(x="G1", y="G3", 
14 |             data=student_data,
15 |             kind="scatter", 
16 |             col="schoolsup",
17 |             row='famsup',
18 |             col_order=["yes", "no"],
19 |             row_order=['yes', 'no']);
20 | 
21 | # Show plot
22 | plt.show()
23 | ____________________________________________________________________________
24 | 3.Changing the size of scatter plot points
25 | # Import Matplotlib and Seaborn
26 | import matplotlib.pyplot as plt
27 | import seaborn as sns
28 | 
29 | # Create scatter plot of horsepower vs. mpg
30 | sns.relplot(x="horsepower", y="mpg", 
31 |             data=mpg, kind="scatter", 
32 |             size="cylinders",
33 |             hue='cylinders');
34 | 
35 | # Show plot
36 | plt.show()
37 | ____________________________________________________________________________
38 | 4.Changing the style of scatter plot points
39 | # Import Matplotlib and Seaborn
40 | import matplotlib.pyplot as plt
41 | import seaborn as sns
42 | 
43 | # Create a scatter plot of acceleration vs. mpg
44 | sns.relplot(x='acceleration', y='mpg', data=mpg, kind='scatter', style='origin', hue='origin');
45 | 
46 | 
47 | 
48 | # Show plot
49 | plt.show()
50 | ____________________________________________________________________________
51 | 5.Interpreting line plots
52 | # Import Matplotlib and Seaborn
53 | import matplotlib.pyplot as plt
54 | import seaborn as sns
55 | 
56 | # Create line plot
57 | sns.relplot(x='model_year', y='mpg', data=mpg, kind='line');
58 | 
59 | 
60 | # Show plot
61 | plt.show()
62 | ____________________________________________________________________________
63 | 6.Visualizing standard deviation with line plots
64 | # Make the shaded area show the standard deviation
65 | sns.relplot(x="model_year", y="mpg", data=mpg, kind="line", ci='sd');
66 | 
67 | # Show plot
68 | plt.show()
69 | ____________________________________________________________________________
70 | 7.Plotting subgroups in line plots
71 | # Import Matplotlib and Seaborn
72 | import matplotlib.pyplot as plt
73 | import seaborn as sns
74 | 
75 | # Add markers and make each line have the same style
76 | sns.relplot(x="model_year", y="horsepower", 
77 |             data=mpg, kind="line", ci=None, style="origin", hue="origin",
78 |             markers=True, dashes=False);
79 | 
80 | # Show plot
81 | plt.show()
82 | ____________________________________________________________________________


--------------------------------------------------------------------------------
/04 - Introduction to Data Visualization with Seaborn/Chapter 3 - Visualization a categorical and a quantitative variables.txt:
--------------------------------------------------------------------------------
 1 | 1.Count plots
 2 | # Create column subplots based on age category
 3 | sns.catplot(y="Internet usage", data=survey_data, kind="count", col='Age Category');
 4 | plt.tight_layout();
 5 | # Show plot
 6 | plt.show()
 7 | ________________________________________________________________________________
 8 | 2.Bar plots with percentages
 9 | # Create a bar plot of interest in math, separated by gender
10 | sns.catplot(x='Gender', y='Interested in Math', data=survey_data, kind='bar');
11 | 
12 | # Show plot
13 | plt.show()
14 | ________________________________________________________________________________
15 | 3.Customizing bar plots
16 | # Turn off the confidence intervals
17 | sns.catplot(x="study_time", y="G3",
18 |             data=student_data,
19 |             kind="bar",
20 |             order=["<2 hours", 
21 |                    "2 to 5 hours", 
22 |                    "5 to 10 hours", 
23 |                    ">10 hours"],
24 |             ci=None);
25 | 
26 | # Show plot
27 | plt.show()
28 | ________________________________________________________________________________
29 | 4.Create and interpret a box plot
30 | # Specify the category ordering
31 | study_time_order = ["<2 hours", "2 to 5 hours", 
32 |                     "5 to 10 hours", ">10 hours"]
33 | 
34 | # Create a box plot and set the order of the categories
35 | sns.catplot(x='study_time', y='G3',
36 |             data=student_data,
37 |             kind='box',
38 |             order=study_time_order);
39 | 
40 | # Show plot
41 | plt.show()
42 | ________________________________________________________________________________
43 | 5.Omitting outliers
44 | # Create a box plot with subgroups and omit the outliers
45 | sns.catplot(x='internet', y='G3',
46 |             data=student_data,
47 |             kind='box',
48 |             hue='location',
49 |             sym='');
50 | # Show plot
51 | plt.show()
52 | ________________________________________________________________________________
53 | 6.Adjusting the whiskers
54 | # Set the whiskers at the min and max values
55 | sns.catplot(x="romantic", y="G3",
56 |             data=student_data,
57 |             kind="box",
58 |             whis=[0, 100]);
59 | 
60 | # Show plot
61 | plt.show()
62 | ________________________________________________________________________________
63 | 7.Customizing point plots
64 | # Remove the lines joining the points
65 | sns.catplot(x="famrel", y="absences",
66 |             data=student_data,
67 |             kind="point",
68 |             capsize=0.2,
69 |             join=False);
70 |             
71 | # Show plot
72 | plt.show()
73 | ________________________________________________________________________________
74 | 8.Point plots with subgroups
75 | # Import median function from numpy
76 | import numpy 
77 | from numpy import median 
78 | # Plot the median number of absences instead of the mean
79 | sns.catplot(x="romantic", y="absences",
80 | 			data=student_data,
81 |             kind="point",
82 |             hue="school",
83 |             ci=None,
84 |             estimator=numpy.median)
85 | 
86 | # Show plot
87 | plt.show()
88 | ________________________________________________________________________________


--------------------------------------------------------------------------------
/04 - Introduction to Data Visualization with Seaborn/Chapter 4 - customizing seaborn plots.txt:
--------------------------------------------------------------------------------
  1 | 1.Changing style and palette
  2 | # Change the color palette to "RdBu"
  3 | sns.set_style("whitegrid")
  4 | sns.set_palette("RdBu")
  5 | 
  6 | # Create a count plot of survey responses
  7 | category_order = ["Never", "Rarely", "Sometimes", 
  8 |                   "Often", "Always"]
  9 | 
 10 | sns.catplot(x="Parents Advice", 
 11 |             data=survey_data, 
 12 |             kind="count", 
 13 |             order=category_order)
 14 | 
 15 | # Show plot
 16 | plt.show()
 17 | ____________________________________________________________________________________
 18 | 2.Changing the scale
 19 | # Change the context to "poster"
 20 | sns.set_context("poster")
 21 | 
 22 | # Create bar plot
 23 | sns.catplot(x="Number of Siblings", y="Feels Lonely",
 24 |             data=survey_data, kind="bar")
 25 | 
 26 | # Show plot
 27 | plt.show()
 28 | ____________________________________________________________________________________
 29 | 3.Using a custom palette
 30 | # Set the style to "darkgrid"
 31 | sns.set_context("notebook")
 32 | 
 33 | # Set the style to "darkgrid"
 34 | sns.set_style('darkgrid')
 35 | 
 36 | # Set a custom color palette
 37 | sns.set_palette(['#39A7D0', '#36ADA4'])
 38 | 
 39 | # Create the box plot of age distribution by gender
 40 | sns.catplot(x="Gender", y="Age", 
 41 |             data=survey_data, kind="box");
 42 | 
 43 | # Show plot
 44 | plt.show()
 45 | ____________________________________________________________________________________
 46 | 4.FacetGrids vs. AxesSubplots
 47 | # Create scatter plot
 48 | g = sns.relplot(x="weight", 
 49 |                 y="horsepower", 
 50 |                 data=mpg,
 51 |                 kind="scatter")
 52 | 
 53 | # Identify plot type
 54 | type_of_g = type(g)
 55 | 
 56 | # Print type
 57 | print(type_of_g)
 58 | ____________________________________________________________________________________
 59 | 5.Adding a title to a FacetGrid object
 60 | # Create scatter plot
 61 | g = sns.relplot(x="weight", 
 62 |                 y="horsepower", 
 63 |                 data=mpg,
 64 |                 kind="scatter");
 65 | 
 66 | # Add a title "Car Weight vs. Horsepower"
 67 | g.fig.suptitle('Car Weight vs. Horsepower');
 68 | # Show plot
 69 | plt.show()
 70 | ____________________________________________________________________________________
 71 | 6.Adding a title and axis labels
 72 | # Create line plot
 73 | g = sns.lineplot(x="model_year", y="mpg_mean", 
 74 |                  data=mpg_mean,
 75 |                  hue="origin");
 76 | 
 77 | # Add a title "Average MPG Over Time"
 78 | g.set_title("Average MPG Over Time");
 79 | 
 80 | # Add x-axis and y-axis labels
 81 | g.set(xlabel="Car Model Year",
 82 |       ylabel="Average MPG");
 83 | 
 84 | # Show plot
 85 | plt.show()
 86 | ____________________________________________________________________________________
 87 | 7.Rotating x-tick labels
 88 | # Create point plot
 89 | sns.catplot(x="origin", 
 90 |             y="acceleration", 
 91 |             data=mpg, 
 92 |             kind="point", 
 93 |             join=False, 
 94 |             capsize=0.1)
 95 | 
 96 | # Rotate x-tick labels
 97 | plt.xticks(rotation=90);
 98 | 
 99 | # Show plot
100 | plt.show()
101 | ____________________________________________________________________________________
102 | 8.Box plot with subgroups
103 | # Set palette to "Blues"
104 | sns.set_palette("Blues")
105 | 
106 | # Adjust to add subgroups based on "Interested in Pets"
107 | g = sns.catplot(x="Gender",
108 |                 y="Age", data=survey_data, 
109 |                 kind="box", hue='Interested in Pets', aspect=1.5)
110 | 
111 | # Set title to "Age of Those Interested in Pets vs. Not"
112 | g.fig.suptitle("Age of Those Interested in Pets vs. Not")
113 | 
114 | # Show plot
115 | plt.show()
116 | ____________________________________________________________________________________
117 | 9.Bar plot with subgroups and subplots
118 | # Set the figure style to "dark"
119 | plt.style.use('seaborn')
120 | sns.set_style('dark')
121 | # Adjust to add subplots per gender
122 | g = sns.catplot(x="Village - town", y="Likes Techno", 
123 |                 data=survey_data, kind="bar",
124 |                 col='Gender');
125 | 
126 | # Add title and axis labels
127 | g.fig.suptitle("Percentage of Young People Who Like Techno", y=1.02);
128 | g.set(xlabel="Location of Residence", 
129 |        ylabel="% Who Like Techno");
130 | 
131 | # Show plot
132 | plt.show()
133 | ____________________________________________________________________________________


--------------------------------------------------------------------------------
/04 - Introduction to Data Visualization with Seaborn/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/05 - Python Data Science Toolbox (Part 1)/Chapter 1 - Writing your own functions .txt:
--------------------------------------------------------------------------------
  1 | 1. Write a simple function
  2 | # Define the function shout
  3 | def shout():
  4 |     """Print a string with three exclamation marks"""
  5 |     # Concatenate the strings: shout_word
  6 |     shout_word = 'congratulations' + '!!!'
  7 | 
  8 |     # Print shout_word
  9 |     print(shout_word)
 10 | 
 11 | # Call shout
 12 | shout()
 13 | ____________________________________________________________________
 14 | 2.Single-parameter functions
 15 | # Define shout with the parameter, word
 16 | def shout(word):
 17 |     """Print a string with three exclamation marks"""
 18 |     # Concatenate the strings: shout_word
 19 |     shout_word = word + '!!!'
 20 | 
 21 |     # Print shout_word
 22 |     print(shout_word)
 23 | 
 24 | # Call shout with the string 'congratulations'
 25 | shout('congratulations')
 26 | ____________________________________________________________________
 27 | 3.Functions that return single values
 28 | # Define shout with the parameter, word
 29 | def shout(word):
 30 |     """Return a string with three exclamation marks"""
 31 |     # Concatenate the strings: shout_word
 32 |     shout_word = word + '!!!'
 33 | 
 34 |     # Replace print with return
 35 |     return shout_word
 36 | 
 37 | # Pass 'congratulations' to shout: yell
 38 | yell = shout('congratulations')
 39 | 
 40 | # Print yell
 41 | print(yell)
 42 | ____________________________________________________________________
 43 | 4.Functions with multiple parameters
 44 | # Define shout with parameters word1 and word2
 45 | def shout(word1, word2):
 46 |     """Concatenate strings with three exclamation marks"""
 47 |     # Concatenate word1 with '!!!': shout1
 48 |     shout1 = word1 + '!!!'
 49 |     
 50 |     # Concatenate word2 with '!!!': shout2
 51 |     shout2 = word2 + '!!!'
 52 |     
 53 |     # Concatenate shout1 with shout2: new_shout
 54 |     new_shout = shout1 + shout2
 55 | 
 56 |     # Return new_shout
 57 |     return new_shout
 58 | 
 59 | # Pass 'congratulations' and 'you' to shout: yell
 60 | yell = shout('congratulations', 'you')
 61 | 
 62 | # Print yell
 63 | print(yell)
 64 | ____________________________________________________________________
 65 | 5.A brief introduction to tuples
 66 | # Unpack nums into num1, num2, and num3
 67 | num1, num2, num3 = nums
 68 | 
 69 | # Construct even_nums
 70 | even_nums = (2, num2, num3)
 71 | ____________________________________________________________________
 72 | 6.Functions that return multiple values
 73 | # Define shout_all with parameters word1 and word2
 74 | def shout_all(word1, word2):
 75 |     """Return a tuple of strings"""
 76 |     # Concatenate word1 with '!!!': shout1
 77 |     shout1 = word1 + '!!!'
 78 |     
 79 |     # Concatenate word2 with '!!!': shout2
 80 |     shout2 = word2 + '!!!'
 81 |     
 82 |     # Construct a tuple with shout1 and shout2: shout_words
 83 |     shout_words = (shout1, shout2)
 84 | 
 85 |     # Return shout_words
 86 |     return shout_words
 87 | 
 88 | # Pass 'congratulations' and 'you' to shout_all(): yell1, yell2
 89 | yell1, yell2 = shout_all('congratulations', 'you')
 90 | 
 91 | # Print yell1 and yell2
 92 | print(yell1)
 93 | print(yell2)
 94 | ____________________________________________________________________
 95 | 7.Bringing it all together (1)
 96 | # Import pandas
 97 | import pandas as pd
 98 | 
 99 | # Import Twitter data as DataFrame: df
100 | df = pd.read_csv('tweets.csv')
101 | 
102 | # Initialize an empty dictionary: langs_count
103 | langs_count = {}
104 | 
105 | # Extract column from DataFrame: col
106 | col = df['lang']
107 | 
108 | # Iterate over lang column in DataFrame
109 | for entry in col:
110 | 
111 |     # If the language is in langs_count, add 1
112 |     if entry in langs_count.keys():
113 |         langs_count[entry] += 1
114 |     # Else add the language to langs_count, set the value to 1
115 |     else:
116 |         langs_count[entry] = 1
117 | 
118 | # Print the populated dictionary
119 | print(langs_count)
120 | ____________________________________________________________________
121 | 8.Bringing it all together (2)
122 | # Define count_entries()
123 | def count_entries(df, col_name):
124 |     """Return a dictionary with counts of 
125 |     occurrences as value for each key."""
126 | 
127 |     # Initialize an empty dictionary: langs_count
128 |     langs_count = {}
129 |     
130 |     # Extract column from DataFrame: col
131 |     col = df[col_name]
132 | 
133 |     # Iterate over lang column in DataFrame
134 |     for entry in col:
135 | 
136 |         # If the language is in langs_count, add 1
137 |         if entry in langs_count.keys():
138 |             langs_count[entry] += 1
139 |         # Else add the language to langs_count, set the value to 1
140 |         else:
141 |             langs_count[entry] = 1
142 | 
143 |     # Return the langs_count dictionary
144 |     return langs_count
145 | 
146 | # Call count_entries(): result
147 | result = count_entries(tweets_df, 'lang')
148 | 
149 | # Print the result
150 | print(result)
151 | ____________________________________________________________________


--------------------------------------------------------------------------------
/05 - Python Data Science Toolbox (Part 1)/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/06 - Python Data Science Toolbox (Part 2)/Chapter 2 - List Comprehensions and Generators.txt:
--------------------------------------------------------------------------------
  1 | 1.Writing list comprehensions
  2 | # Create list comprehension: squares
  3 | squares = [i**2 for i in range(0,10)]
  4 | _________________________________________________________________________________
  5 | 2.Nested list comprehensions
  6 | # Create a 5 x 5 matrix using a list of lists: matrix
  7 | matrix = [[col for col in range(5)] for row in range(5)]
  8 | 
  9 | # Print the matrix
 10 | for row in matrix:
 11 |     print(row)
 12 | _________________________________________________________________________________
 13 | 3.Using conditionals in comprehensions (1)
 14 | # Create a list of strings: fellowship
 15 | fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
 16 | 
 17 | # Create list comprehension: new_fellowship
 18 | new_fellowship = [member for member in fellowship if len(member) >= 7]
 19 | 
 20 | # Print the new list
 21 | print(new_fellowship)
 22 | _________________________________________________________________________________
 23 | 4.Using conditionals in comprehensions (2)
 24 | # Create a list of strings: fellowship
 25 | fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
 26 | 
 27 | # Create list comprehension: new_fellowship
 28 | new_fellowship = [member if len(member) >= 7 else member.replace(
 29 |     member, '') for member in fellowship]
 30 | 
 31 | # Print the new list
 32 | print(new_fellowship)
 33 | _________________________________________________________________________________
 34 | 5.Dict comprehensions
 35 | # Create a list of strings: fellowship
 36 | fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
 37 | 
 38 | # Create dict comprehension: new_fellowship
 39 | new_fellowship = {member: len(member) for member in fellowship}
 40 | 
 41 | # Print the new dictionary
 42 | print(new_fellowship)
 43 | _________________________________________________________________________________
 44 | 6.Write your own generator expressions
 45 | # Create generator object: result
 46 | result = (num for num in range(31))
 47 | 
 48 | # Print the first 5 values
 49 | print(next(result))
 50 | print(next(result))
 51 | print(next(result))
 52 | print(next(result))
 53 | print(next(result))
 54 | 
 55 | # Print the rest of the values
 56 | for value in result:
 57 |     print(value)
 58 | _________________________________________________________________________________
 59 | 7.Changing the output in generator expressions
 60 | # Create a list of strings: lannister
 61 | lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']
 62 | 
 63 | # Create a generator object: lengths
 64 | lengths = (len(person) for person in lannister)
 65 | 
 66 | # Iterate over and print the values in lengths
 67 | for value in lengths:
 68 |     print(value)
 69 | _________________________________________________________________________________
 70 | 8.Build a generator
 71 | # Create a list of strings
 72 | lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']
 73 | 
 74 | # Define generator function get_lengths
 75 | def get_lengths(input_list):
 76 |     """Generator function that yields the
 77 |     length of the strings in input_list."""
 78 | 
 79 |     # Yield the length of a string
 80 |     for person in input_list:
 81 |         yield len(person)
 82 |         
 83 | # Print the values generated by get_lengths()
 84 | for value in get_lengths(lannister):
 85 |     print(value)
 86 | _________________________________________________________________________________
 87 | 9.List comprehensions for time-stamped data
 88 | # Extract the created_at column from df: tweet_time
 89 | tweet_time = df['created_at']
 90 | 
 91 | # Extract the clock time: tweet_clock_time
 92 | tweet_clock_time = [entry[11:19] for entry in tweet_time]
 93 | 
 94 | # Print the extracted times
 95 | print(tweet_clock_time)
 96 | _________________________________________________________________________________
 97 | 10.Conditional list comprehensions for time-stamped data
 98 | # Extract the created_at column from df: tweet_time
 99 | tweet_time = df['created_at']
100 | 
101 | # Extract the clock time: tweet_clock_time
102 | tweet_clock_time = [entry[11:19] for entry in tweet_time if entry[17:19] == '19']
103 | 
104 | # Print the extracted times
105 | print(tweet_clock_time)
106 | _________________________________________________________________________________


--------------------------------------------------------------------------------
/06 - Python Data Science Toolbox (Part 2)/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/07 - Intermediate Data Visualization with Seaborn/Chapter 1 - seaborn introduction .txt:
--------------------------------------------------------------------------------
 1 | 1.Reading a csv file
 2 | # import all modules
 3 | import pandas as pd
 4 | import seaborn as sns
 5 | import matplotlib.pyplot as plt
 6 | 
 7 | # Read in the DataFrame
 8 | df = pd.read_csv(grant_file)
 9 | ______________________________________________________________________________
10 | 2.Comparing a histogram and distplot
11 | # Display pandas histogram
12 | df['Award_Amount'].plot.hist()
13 | plt.show()
14 | 
15 | # Clear out the pandas histogram
16 | plt.clf()
17 | 
18 | # Display a Seaborn distplot
19 | sns.distplot(df['Award_Amount'])
20 | plt.show()
21 | 
22 | # Clear the distplot
23 | plt.clf()
24 | ______________________________________________________________________________
25 | 3.Plot a histogram
26 | # Create a distplot
27 | sns.distplot(df['Award_Amount'],
28 |              kde=False,
29 |              bins=20)
30 | 
31 | # Display the plot
32 | plt.show()
33 | ______________________________________________________________________________
34 | 4.Rug plot and kde shading
35 | # Create a distplot of the Award Amount
36 | sns.distplot(df['Award_Amount'],
37 |              hist=False,
38 |              rug=True,
39 |              kde_kws={'shade':True})
40 | 
41 | # Plot the results
42 | plt.show()
43 | ______________________________________________________________________________
44 | 5.Create a regression plot
45 | # Create a regression plot of premiums vs. insurance_losses
46 | sns.regplot(x="insurance_losses",y = "premiums", data = df)
47 | 
48 | # Display the plot
49 | plt.show()
50 | 
51 | # Create an lmplot of premiums vs. insurance_losses
52 | sns.lmplot(x="insurance_losses",y="premiums",data=df)
53 | 
54 | # Display the second plot
55 | plt.show()
56 | ______________________________________________________________________________
57 | 6.Plotting multiple variables
58 | # Create a regression plot using hue
59 | sns.lmplot(data=df,
60 |            x="insurance_losses",
61 |            y="premiums",
62 |            hue="Region")
63 | 
64 | # Show the results
65 | plt.show()
66 | ______________________________________________________________________________
67 | 7.Facetting multiple regressions
68 | # Create a regression plot with multiple rows
69 | sns.lmplot(data=df,
70 |            x="insurance_losses",
71 |            y="premiums",
72 |            row="Region")
73 | 
74 | # Show the plot
75 | plt.show()
76 | ______________________________________________________________________________


--------------------------------------------------------------------------------
/07 - Intermediate Data Visualization with Seaborn/Chapter 2 - Customizing Seaborn plots.txt:
--------------------------------------------------------------------------------
  1 | 1.Setting the default style
  2 | # Plot the pandas histogram
  3 | df['fmr_2'].plot.hist()
  4 | plt.show()
  5 | plt.clf()
  6 | 
  7 | # Set the default seaborn style
  8 | sns.set()
  9 | 
 10 | # Plot the pandas histogram again
 11 | df['fmr_2'].plot.hist()
 12 | plt.show()
 13 | plt.clf()
 14 | _____________________________________________________________________________
 15 | 2.Comparing styles
 16 | # Plot with a dark style 
 17 | sns.set_style('dark')
 18 | sns.distplot(df['fmr_2'])
 19 | plt.show()
 20 | 
 21 | # Clear the figure
 22 | plt.clf()
 23 | /********/
 24 | sns.set_style('whitegrid')
 25 | sns.distplot(df['fmr_2'])
 26 | plt.show()
 27 | 
 28 | # Clear the figure
 29 | plt.clf()
 30 | _____________________________________________________________________________
 31 | 3.Removing spines
 32 | # Set the style to white
 33 | sns.set_style('white')
 34 | 
 35 | # Create a regression plot
 36 | sns.lmplot(data=df,
 37 |            x='pop2010',
 38 |            y='fmr_2')
 39 | 
 40 | # Remove the spines
 41 | sns.despine()
 42 | 
 43 | # Show the plot and clear the figure
 44 | plt.show()
 45 | plt.clf()
 46 | _____________________________________________________________________________
 47 | 4.Matplotlib color codes
 48 | # Set style, enable color code, and create a magenta distplot
 49 | sns.set(color_codes=True)
 50 | sns.distplot(df['fmr_3'], color='m')
 51 | 
 52 | # Show the plot
 53 | plt.show()
 54 | _____________________________________________________________________________
 55 | 5.Using default palettes
 56 | # Loop through differences between bright and colorblind palettes
 57 | for p in ['bright', 'colorblind']:
 58 |     sns.set_palette(p)
 59 |     sns.distplot(df['fmr_3'])
 60 |     plt.show()
 61 |     
 62 |     # Clear the plots    
 63 |     plt.clf()
 64 | _____________________________________________________________________________
 65 | 6.Creating Custom Palettes
 66 | # Create the Purples palette with 8 colors
 67 | sns.palplot(sns.color_palette("Purples", 8))
 68 | plt.show()
 69 | /*************/
 70 | sns.palplot(sns.color_palette("husl", 10))
 71 | plt.show()
 72 | /*************/
 73 | sns.palplot(sns.color_palette("coolwarm", 6))
 74 | plt.show()
 75 | _____________________________________________________________________________
 76 | 7.Using matplotlib axes
 77 | # Create a figure and axes
 78 | fig, ax = plt.subplots()
 79 | 
 80 | # Plot the distribution of data
 81 | sns.distplot(df['fmr_3'], ax=ax)
 82 | 
 83 | # Create a more descriptive x axis label
 84 | ax.set(xlabel="3 Bedroom Fair Market Rent")
 85 | 
 86 | # Show the plot
 87 | plt.show()
 88 | _____________________________________________________________________________
 89 | 8.Additional plot customizations
 90 | # Create a figure and axes
 91 | fig, ax = plt.subplots()
 92 | 
 93 | # Plot the distribution of 1 bedroom rents
 94 | sns.distplot(df['fmr_1'], ax=ax)
 95 | 
 96 | # Modify the properties of the plot
 97 | ax.set(xlabel="1 Bedroom Fair Market Rent",
 98 |        xlim=(100,1500),
 99 |        title="US Rent")
100 | 
101 | # Display the plot
102 | plt.show()
103 | _____________________________________________________________________________
104 | 9.Adding annotations
105 | # Create a figure and axes. Then plot the data
106 | fig, ax = plt.subplots()
107 | sns.distplot(df['fmr_1'], ax=ax)
108 | 
109 | # Customize the labels and limits
110 | ax.set(xlabel="1 Bedroom Fair Market Rent", xlim=(100,1500), title="US Rent")
111 | 
112 | # Add vertical lines for the median and mean
113 | ax.axvline(x=634.0, color='m', label='Median', linestyle='--', linewidth=2)
114 | ax.axvline(x=706.3254351016984, color='b', label='Mean', linestyle='-', linewidth=2)
115 | 
116 | # Show the legend and plot the data
117 | ax.legend()
118 | plt.show()
119 | _____________________________________________________________________________
120 | 10.Multiple plots
121 | # Create a plot with 1 row and 2 columns that share the y axis label
122 | fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, sharey=True)
123 | 
124 | # Plot the distribution of 1 bedroom apartments on ax0
125 | sns.distplot(df['fmr_1'], ax=ax0)
126 | ax0.set(xlabel="1 Bedroom Fair Market Rent", xlim=(100,1500))
127 | 
128 | # Plot the distribution of 2 bedroom apartments on ax1
129 | sns.distplot(df['fmr_2'], ax=ax1)
130 | ax1.set(xlabel="2 Bedroom Fair Market Rent", xlim=(100,1500))
131 | 
132 | # Display the plot
133 | plt.show()
134 | _____________________________________________________________________________


--------------------------------------------------------------------------------
/07 - Intermediate Data Visualization with Seaborn/Chapter 3 -additional plot types.txt:
--------------------------------------------------------------------------------
  1 | 1.stripplot() and swarmplot()
  2 | # Create the stripplot
  3 | sns.stripplot(data=df,
  4 |          x='Award_Amount',
  5 |          y='Model Selected',
  6 |          jitter=True)
  7 | 
  8 | plt.show()
  9 | /*********/
 10 | # Create and display a swarmplot with hue set to the Region
 11 | sns.swarmplot(data=df,
 12 |          x='Award_Amount',
 13 |          y='Model Selected',
 14 |          hue='Region')
 15 | 
 16 | plt.show()
 17 | ___________________________________________________________________
 18 | 2.boxplots, violinplots and lvplots
 19 | # Create a boxplot
 20 | sns.boxplot(data=df,
 21 |          x='Award_Amount',
 22 |          y='Model Selected')
 23 | 
 24 | plt.show()
 25 | plt.clf()
 26 | /***************/
 27 | # Create a violinplot with the husl palette
 28 | sns.violinplot(data=df,
 29 |          x='Award_Amount',
 30 |          y='Model Selected',
 31 |          palette='husl')
 32 | 
 33 | plt.show()
 34 | plt.clf()
 35 | /****************/
 36 | # Create a lvplot with the Paired palette and the Region column as the hue
 37 | sns.lvplot(data=df,
 38 |          x='Award_Amount',
 39 |          y='Model Selected',
 40 |          palette='Paired',
 41 |          hue='Region')
 42 | 
 43 | plt.show()
 44 | plt.clf()
 45 | ___________________________________________________________________
 46 | 3.Regression and residual plots
 47 | # Display a regression plot for Tuition
 48 | sns.regplot(data=df,
 49 |          y='Tuition',
 50 |          x="SAT_AVG_ALL",
 51 |          marker='^',
 52 |          color='g')
 53 | 
 54 | plt.show()
 55 | plt.clf()
 56 | /**************/
 57 | # Display the residual plot
 58 | sns.residplot(data=df,
 59 |           y='Tuition',
 60 |           x="SAT_AVG_ALL",
 61 |           color='g')
 62 | 
 63 | plt.show()
 64 | plt.clf()
 65 | ___________________________________________________________________
 66 | 4.Regression plot parameters
 67 | # Plot a regression plot of Tuition and the Percentage of Pell Grants
 68 | sns.regplot(data=df,
 69 |             y='Tuition',
 70 |             x="PCTPELL")
 71 | 
 72 | plt.show()
 73 | plt.clf()
 74 | /**************/
 75 | # Create another plot that estimates the tuition by PCTPELL
 76 | sns.regplot(data=df,
 77 |             y='Tuition',
 78 |             x='PCTPELL',
 79 |             x_bins=5)
 80 | 
 81 | plt.show()
 82 | plt.clf()
 83 | /****************/
 84 | # The final plot should include a line using a 2nd order polynomial
 85 | sns.regplot(data=df,
 86 |             y='Tuition',
 87 |             x="PCTPELL",
 88 |             x_bins=5,
 89 |             order=2)
 90 | 
 91 | plt.show()
 92 | plt.clf()
 93 | ___________________________________________________________________
 94 | 5.Binning data
 95 | # Create a scatter plot by disabling the regression line
 96 | sns.regplot(data=df,
 97 |             y='Tuition',
 98 |             x="UG",
 99 |             fit_reg=False)
100 | 
101 | plt.show()
102 | plt.clf()
103 | /************/
104 | # Create a scatter plot and bin the data into 5 bins
105 | sns.regplot(data=df,
106 |             y='Tuition',
107 |             x="UG",
108 |             x_bins=5)
109 | 
110 | plt.show()
111 | plt.clf()
112 | /************/
113 | # Create a regplot and bin the data into 8 bins
114 | sns.regplot(data=df,
115 |          y='Tuition',
116 |          x="UG",
117 |          x_bins=8)
118 | 
119 | plt.show()
120 | plt.clf()
121 | ___________________________________________________________________
122 | 6.Creating heatmaps
123 | # Create a crosstab table of the data
124 | pd_crosstab = pd.crosstab(df["Group"], df["YEAR"])
125 | print(pd_crosstab)
126 | 
127 | # Plot a heatmap of the table
128 | sns.heatmap(pd_crosstab)
129 | 
130 | # Rotate tick marks for visibility
131 | plt.yticks(rotation=0)
132 | plt.xticks(rotation=90)
133 | 
134 | plt.show()
135 | ___________________________________________________________________
136 | 7.Customizing heatmaps
137 | # Create the crosstab DataFrame
138 | pd_crosstab = pd.crosstab(df["Group"], df["YEAR"])
139 | 
140 | # Plot a heatmap of the table with no color bar and using the BuGn palette
141 | sns.heatmap(pd_crosstab, cbar=False, cmap="BuGn", linewidths=0.3)
142 | 
143 | # Rotate tick marks for visibility
144 | plt.yticks(rotation=0)
145 | plt.xticks(rotation=90)
146 | 
147 | #Show the plot
148 | plt.show()
149 | plt.clf()
150 | ___________________________________________________________________


--------------------------------------------------------------------------------
/07 - Intermediate Data Visualization with Seaborn/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/08 - Introduction to Import data in python/Chapter 1 - introduction and flat files 1.txt:
--------------------------------------------------------------------------------
  1 | 1.Importing entire text files
  2 | # Open a file: file
  3 | file = open('moby_dick.txt', mode='r')
  4 | 
  5 | # Print it
  6 | print(file.read())
  7 | 
  8 | # Check whether file is closed
  9 | print(file.closed)
 10 | 
 11 | # Close file
 12 | file.close()
 13 | 
 14 | # Check whether file is closed
 15 | print(file.closed)
 16 | 
 17 | _______________________________________________________________________________
 18 | 2.Importing text files line by line
 19 | # Read & print the first 3 lines
 20 | with open('moby_dick.txt') as file:
 21 |     print(file.readline())
 22 |     print(file.readline())
 23 |     print(file.readline())
 24 | 
 25 | _______________________________________________________________________________
 26 | 3.Using NumPy to import flat files
 27 | # Import package
 28 | import numpy as np
 29 | 
 30 | # Assign filename to variable: file
 31 | file = 'digits.csv'
 32 | 
 33 | # Load file as array: digits
 34 | digits = np.loadtxt(file, delimiter=',')
 35 | 
 36 | # Print datatype of digits
 37 | print(type(digits))
 38 | 
 39 | # Select and reshape a row
 40 | im = digits[21, 1:]
 41 | im_sq = np.reshape(im, (28, 28))
 42 | 
 43 | # Plot reshaped data (matplotlib.pyplot already loaded as plt)
 44 | plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
 45 | plt.show()
 46 | 
 47 | _______________________________________________________________________________
 48 | 4.Customizing your NumPy import
 49 | # Import numpy
 50 | import numpy as np
 51 | 
 52 | # Assign the filename: file
 53 | file = 'digits_header.txt'
 54 | 
 55 | # Load the data: data
 56 | data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0, 2])
 57 | 
 58 | # Print data
 59 | print(data)
 60 | 
 61 | _______________________________________________________________________________
 62 | 5.Importing different datatypes
 63 | # Assign filename: file
 64 | file = 'seaslug.txt'
 65 | 
 66 | # Import file: data
 67 | data = np.loadtxt(file, delimiter='\t', dtype=str)
 68 | 
 69 | # Print the first element of data
 70 | print(data[0])
 71 | 
 72 | # Import data as floats and skip the first row: data_float
 73 | data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)
 74 | 
 75 | # Print the 10th element of data_float
 76 | print(data_float[9])
 77 | 
 78 | # Plot a scatterplot of the data
 79 | plt.scatter(data_float[:, 0], data_float[:, 1])
 80 | plt.xlabel('time (min.)')
 81 | plt.ylabel('percentage of larvae')
 82 | plt.show()
 83 | 
 84 | _______________________________________________________________________________
 85 | 6.Working with mixed datatypes (2)
 86 | # Assign the filename: file
 87 | file = 'titanic.csv'
 88 | 
 89 | # Import file using np.recfromcsv: d
 90 | d = np.recfromcsv(file)
 91 | 
 92 | # Print out first three entries of d
 93 | print(d[:3])
 94 | 
 95 | _______________________________________________________________________________
 96 | 7.Using pandas to import flat files as DataFrames (1)
 97 | # Import pandas
 98 | import pandas as pd
 99 | 
100 | # Assign the filename: file
101 | file = 'titanic.csv'
102 | 
103 | # Read the file into a DataFrame: df
104 | df = pd.read_csv(file)
105 | 
106 | # View the head of the DataFrame
107 | print(df.head())
108 | 
109 | _______________________________________________________________________________
110 | 8.Using pandas to import flat files as DataFrames (2)
111 | # Assign the filename: file
112 | file = 'digits.csv'
113 | 
114 | # Read the first 5 rows of the file into a DataFrame: data
115 | data = pd.read_csv(file, nrows=5, header=None)
116 | 
117 | # Build a numpy array from the DataFrame: data_array
118 | data_array = data.values
119 | 
120 | # Print the datatype of data_array to the shell
121 | print(type(data_array))
122 | 
123 | _______________________________________________________________________________
124 | 9.Customizing your pandas import
125 | # Import matplotlib.pyplot as plt
126 | import matplotlib.pyplot as plt
127 | 
128 | # Assign filename: file
129 | file = 'titanic_corrupt.txt'
130 | 
131 | # Import file: data
132 | data = pd.read_csv(file, sep='\t', comment='#', na_values=['Nothing'])
133 | 
134 | # Print the head of the DataFrame
135 | print(data.head())
136 | 
137 | # Plot 'Age' variable in a histogram
138 | pd.DataFrame.hist(data[['Age']])
139 | plt.xlabel('Age (years)')
140 | plt.ylabel('count')
141 | plt.show()
142 | 
143 | _______________________________________________________________________________


--------------------------------------------------------------------------------
/08 - Introduction to Import data in python/Chapter 2 - importing data from other file types 2.txt:
--------------------------------------------------------------------------------
  1 | 1.Loading a pickled file
  2 | # Import pickle package
  3 | import pickle
  4 | 
  5 | # Open pickle file and load data
  6 | with open('data.pkl', 'rb') as file:
  7 |     d = pickle.load(file)
  8 | 
  9 | # Print data
 10 | print(d)
 11 | 
 12 | # Print datatype
 13 | print(type(d))
 14 | ________________________________________________________________________________
 15 | 2.Listing sheets in Excel files
 16 | # Import pandas
 17 | import pandas as pd
 18 | 
 19 | # Assign spreadsheet filename: file
 20 | file = 'battledeath.xlsx'
 21 | 
 22 | # Load spreadsheet: xls
 23 | xls = pd.ExcelFile(file)
 24 | 
 25 | # Print sheet names
 26 | print(xls.sheet_names)
 27 | 
 28 | ________________________________________________________________________________
 29 | 3.Importing sheets from Excel files
 30 | # Load a sheet into a DataFrame by name: df1
 31 | df1 = xls.parse('2004')
 32 | 
 33 | # Print the head of the DataFrame df1
 34 | print(df1.head())
 35 | 
 36 | # Load a sheet into a DataFrame by index: df2
 37 | df2 = xls.parse(0)
 38 | 
 39 | # Print the head of the DataFrame df2
 40 | print(df2.head())
 41 | ________________________________________________________________________________
 42 | 4.Customizing your spreadsheet import
 43 | # Parse the first sheet and rename the columns: df1
 44 | df1 = xls.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)'])
 45 | 
 46 | # Print the head of the DataFrame df1
 47 | print(df1.head())
 48 | 
 49 | # Parse the first column of the second sheet and rename the column: df2
 50 | df2 = xls.parse(1, usecols=[0], skiprows=[0], names=['Country'])
 51 | 
 52 | # Print the head of the DataFrame df2
 53 | print(df2.head())
 54 | ________________________________________________________________________________
 55 | 5.Importing SAS files
 56 | # Import sas7bdat package
 57 | from sas7bdat import SAS7BDAT
 58 | 
 59 | # Save file to a DataFrame: df_sas
 60 | with SAS7BDAT('sales.sas7bdat') as file:
 61 |     df_sas = file.to_data_frame()
 62 | 
 63 | # Print head of DataFrame
 64 | print(df_sas.head())
 65 | 
 66 | # Plot histograms of a DataFrame feature (pandas and pyplot already imported)
 67 | pd.DataFrame.hist(df_sas[['P']])
 68 | plt.ylabel('count')
 69 | plt.show()
 70 | ________________________________________________________________________________
 71 | 6.Importing Stata files
 72 | # Import pandas
 73 | import pandas as pd
 74 | 
 75 | # Load Stata file into a pandas DataFrame: df
 76 | df = pd.read_stata('disarea.dta')
 77 | 
 78 | # Print the head of the DataFrame df
 79 | print(df.head())
 80 | 
 81 | # Plot histogram of one column of the DataFrame
 82 | pd.DataFrame.hist(df[['disa10']])
 83 | plt.xlabel('Extent of disease')
 84 | plt.ylabel('Number of countries')
 85 | plt.show()
 86 | ________________________________________________________________________________
 87 | 7.Using h5py to import HDF5 files
 88 | # Import packages
 89 | import numpy as np
 90 | import h5py
 91 | 
 92 | # Assign filename: file
 93 | file = 'LIGO_data.hdf5'
 94 | 
 95 | # Load file: data
 96 | data = h5py.File(file, 'r')
 97 | 
 98 | # Print the datatype of the loaded file
 99 | print(type(data))
100 | 
101 | # Print the keys of the file
102 | for key in data.keys():
103 |     print(key)
104 | ________________________________________________________________________________
105 | 8.Extracting data from your HDF5 file
106 | # Get the HDF5 group: group
107 | group = data['strain']
108 | 
109 | # Check out keys of group
110 | for key in group.keys():
111 |     print(key)
112 | 
113 | # Set variable equal to time series data: strain
114 | strain = data['strain']['Strain'].value
115 | 
116 | # Set number of time points to sample: num_samples
117 | num_samples = 10000
118 | 
119 | # Set time vector
120 | time = np.arange(0, 1, 1/num_samples)
121 | 
122 | # Plot data
123 | plt.plot(time, strain[:num_samples])
124 | plt.xlabel('GPS Time (s)')
125 | plt.ylabel('strain')
126 | plt.show()
127 | ________________________________________________________________________________
128 | 9.Loading .mat files
129 | # Import package
130 | import scipy.io
131 | 
132 | # Load MATLAB file: mat
133 | mat = scipy.io.loadmat('albeck_gene_expression.mat')
134 | 
135 | # Print the datatype type of mat
136 | print(type(mat))
137 | ________________________________________________________________________________
138 | 10.The structure of .mat in Python
139 | # Print the keys of the MATLAB dictionary
140 | print(mat.keys())
141 | 
142 | # Print the type of the value corresponding to the key 'CYratioCyt'
143 | print(type(mat['CYratioCyt']))
144 | 
145 | # Print the shape of the value corresponding to the key 'CYratioCyt'
146 | print(np.shape(mat['CYratioCyt']))
147 | 
148 | # Subset the array and plot it
149 | data = mat['CYratioCyt'][25, 5:]
150 | fig = plt.figure()
151 | plt.plot(data)
152 | plt.xlabel('time (min.)')
153 | plt.ylabel('normalized fluorescence (measure of expression)')
154 | plt.show()
155 | ________________________________________________________________________________


--------------------------------------------------------------------------------
/08 - Introduction to Import data in python/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/09 - Intermediate importing  data in python/Chapter 2 - intracting with apis to import data from web.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Anirudh-Chauhan/Data-Scientist-with-Python-DataCamp/2b254aab79c5c7420c9fd96aeeab81a020420a27/09 - Intermediate importing  data in python/Chapter 2 - intracting with apis to import data from web.txt


--------------------------------------------------------------------------------
/09 - Intermediate importing  data in python/Chapter 3 - Diving deep into the twitter api.txt:
--------------------------------------------------------------------------------
 1 | 1.API Authentication
 2 | # Import package
 3 | import tweepy,json
 4 | 
 5 | # Store OAuth authentication credentials in relevant variables
 6 | access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy"
 7 | access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"
 8 | consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM"
 9 | consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i"
10 | 
11 | # Pass OAuth details to tweepy's OAuth handler
12 | auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
13 | auth.set_access_token(access_token,access_token_secret)
14 | ___________________________________________________________________________
15 | 2.Streaming tweets
16 | # Initialize Stream listener
17 | l = MyStreamListener()
18 | 
19 | # Create you Stream object with authentication
20 | stream = tweepy.Stream(auth, l)
21 | 
22 | # Filter Twitter Streams to capture data by the keywords:
23 | s = ['clinton', 'trump', 'sanders','cruz']
24 | stream.filter(track = s)
25 | ___________________________________________________________________________
26 | 3.Load and explore your Twitter data
27 | # Import package
28 | import json
29 | import json
30 | 
31 | # String of path to file: tweets_data_path
32 | tweets_data_path = 'tweets.txt'
33 | 
34 | # Initialize empty list to store tweets: tweets_data
35 | tweets_data = []
36 | 
37 | # Open connection to file
38 | tweets_file = open(tweets_data_path, "r")
39 | 
40 | # Read in tweets and store in list: tweets_data
41 | for line in tweets_file:
42 |     tweet = json.loads(line)
43 |     tweets_data.append(tweet)
44 | 
45 | # Close connection to file
46 | tweets_file.close()
47 | 
48 | # Print the keys of the first tweet dict
49 | print(tweets_data[0].keys())
50 | # Close connection to file
51 | tweets_file.close()
52 | 
53 | # Print the keys of the first tweet dict
54 | print(tweets_data[0].keys())
55 | ___________________________________________________________________________
56 | 4.Twitter data to DataFrame
57 | # Import package
58 | import pandas as pd
59 | 
60 | # Build DataFrame of tweet texts and languages
61 | df = pd.DataFrame(tweets_data, columns=['text','lang'])
62 | 
63 | # Print head of DataFrame
64 | print(df.head())
65 | ___________________________________________________________________________
66 | 5.A little bit of Twitter text analysis
67 | # Initialize list to store tweet counts
68 | [clinton, trump, sanders, cruz] = [0, 0, 0, 0]
69 | 
70 | # Iterate through df, counting the number of tweets in which
71 | # each candidate is mentioned
72 | for index, row in df.iterrows():
73 |     clinton += word_in_text('clinton', row['text'])
74 |     trump += word_in_text('trump',  row['text'])
75 |     sanders += word_in_text('sanders',  row['text'])
76 |     cruz += word_in_text('cruz',  row['text'])
77 | ___________________________________________________________________________
78 | 6.Plotting your Twitter data
79 | # Import packages
80 | import seaborn as sns
81 | import matplotlib.pyplot as plt
82 | 
83 | # Set seaborn style
84 | sns.set(color_codes=True)
85 | 
86 | # Create a list of labels:cd
87 | cd = ['clinton', 'trump', 'sanders', 'cruz']
88 | 
89 | # Plot the bar chart
90 | ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
91 | ax.set(ylabel="count")
92 | plt.show()
93 | ___________________________________________________________________________


--------------------------------------------------------------------------------
/09 - Intermediate importing  data in python/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/10 - Cleaning Data in Python/Chapter 1 - Common Data Problems.txt:
--------------------------------------------------------------------------------
 1 | 1.Numeric data or ... ?
 2 | # Print the information of ride_sharing
 3 | print(ride_sharing.info())
 4 | 
 5 | # Print summary statistics of user_type column
 6 | print(ride_sharing['user_type'].describe())
 7 | 
 8 | # Convert user_type from integer to category
 9 | ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')
10 | 
11 | # Write an assert statement confirming the change
12 | assert ride_sharing['user_type_cat'].dtype == 'category'
13 | 
14 | # Print new summary statistics 
15 | print(ride_sharing['user_type_cat'].describe())
16 | ________________________________________________________________________
17 | 2.Summing strings and concatenating numbers
18 | # Strip duration of minutes
19 | ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip("minutes")
20 | 
21 | # Convert duration to integer
22 | ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype("int")
23 | 
24 | # Write an assert statement making sure of conversion
25 | assert ride_sharing['duration_time'].dtype == "int"
26 | 
27 | # Print formed columns and calculate average ride duration 
28 | print(ride_sharing[['duration','duration_trim','duration_time']])
29 | print(ride_sharing[['duration','duration_trim','duration_time']].mean())
30 | ________________________________________________________________________
31 | 3.Tire size constraints
32 | # Convert tire_sizes to integer
33 | ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')
34 | 
35 | # Set all values above 27 to 27
36 | ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27
37 | 
38 | # Reconvert tire_sizes back to categorical
39 | ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')
40 | 
41 | # Print tire size description
42 | print(ride_sharing['tire_sizes'].describe())
43 | ________________________________________________________________________
44 | 4.Back to the future
45 | # Convert ride_date to datetime
46 | ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date'])
47 | 
48 | # Save today's date
49 | today = dt.date.today()
50 | 
51 | # Set all in the future to today's date
52 | ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today
53 | 
54 | # Print maximum of ride_dt column
55 | print(ride_sharing['ride_dt'].max())
56 | ________________________________________________________________________
57 | 5.Finding duplicates
58 | # Find duplicates
59 | duplicates = ride_sharing.duplicated(subset = 'ride_id', keep = False)
60 | 
61 | # Sort your duplicated rides
62 | duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')
63 | 
64 | # Print relevant columns
65 | print(duplicated_rides[['ride_id','duration','user_birth_year']])
66 | ____________________________________________________________________________
67 | 6.Treating duplicates
68 | # Drop complete duplicates from ride_sharing
69 | ride_dup = ride_sharing.drop_duplicates()
70 | 
71 | # Create statistics dictionary for aggregation function
72 | statistics = {'user_birth_year': 'min', 'duration': 'mean'}
73 | 
74 | # Group by ride_id and compute new statistics
75 | ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()
76 | 
77 | # Find duplicated values again
78 | duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
79 | duplicated_rides = ride_unique[duplicates == True]
80 | 
81 | # Assert duplicates are processed
82 | assert duplicated_rides.shape[0] == 0
83 | ____________________________________________________________________________
84 | 


--------------------------------------------------------------------------------
/10 - Cleaning Data in Python/Chapter 2 - Text  and categorical data problems.txt:
--------------------------------------------------------------------------------
  1 | 1.Finding consistency
  2 | # Print categories DataFrame
  3 | print(categories)
  4 | 
  5 | # Print unique values of survey columns in airlines
  6 | print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
  7 | print('Safety: ', airlines['safety'].unique(), "\n")
  8 | print('Satisfaction: ', airlines['satisfaction'].unique(), "\n")
  9 | 
 10 | /*******************************/
 11 | # Find the cleanliness category in airlines not in categories
 12 | cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
 13 | 
 14 | # Find rows with that category
 15 | cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
 16 | 
 17 | # Print rows with inconsistent category
 18 | print(airlines[cat_clean_rows])
 19 | 
 20 | /*******************************/
 21 | # Find the cleanliness category in airlines not in categories
 22 | cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
 23 | 
 24 | # Find rows with that category
 25 | cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
 26 | 
 27 | # Print rows with inconsistent category
 28 | print(airlines[cat_clean_rows])
 29 | 
 30 | # Print rows with consistent categories only
 31 | print(airlines[~cat_clean_rows])
 32 | ______________________________________________________________________________
 33 | 2.Inconsistent categories
 34 | # Print unique values of both columns
 35 | print(airlines['dest_region'].unique())
 36 | print(airlines['dest_size'].unique())
 37 | /***********************************/
 38 | # Print unique values of both columns
 39 | print(airlines['dest_region'].unique())
 40 | print(airlines['dest_size'].unique())
 41 | 
 42 | # Lower dest_region column and then replace "eur" with "europe"
 43 | airlines['dest_region'] = airlines['dest_region'].str.lower()
 44 | airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})
 45 | /************************************/
 46 | # Print unique values of both columns
 47 | print(airlines['dest_region'].unique())
 48 | print(airlines['dest_size'].unique())
 49 | 
 50 | # Lower dest_region column and then replace "eur" with "europe"
 51 | airlines['dest_region'] = airlines['dest_region'].str.lower() 
 52 | airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})
 53 | 
 54 | # Remove white spaces from `dest_size`
 55 | airlines['dest_size'] = airlines['dest_size'].str.strip()
 56 | 
 57 | # Verify changes have been effected
 58 | print(airlines['dest_region'].unique())
 59 | print(airlines['dest_size'].unique())
 60 | ___________________________________________________________________________
 61 | 3.Remapping categories
 62 | # Create ranges for categories
 63 | label_ranges = [0, 60, 180, np.inf]
 64 | label_names = ['short', 'medium', 'long']
 65 | 
 66 | # Create wait_type column
 67 | airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges, 
 68 |                                labels = label_names)
 69 | 
 70 | # Create mappings and replace
 71 | mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday', 
 72 |             'Thursday': 'weekday', 'Friday': 'weekday', 
 73 |             'Saturday': 'weekend', 'Sunday': 'weekend'}
 74 | 
 75 | airlines['day_week'] = airlines['day'].replace(mappings)
 76 | ____________________________________________________________________________
 77 | 4.Removing titles and taking names
 78 | # Replace "Dr." with empty string ""
 79 | airlines['full_name'] = airlines['full_name'].str.replace("Dr.","")
 80 | 
 81 | # Replace "Mr." with empty string ""
 82 | airlines['full_name'] = airlines['full_name'].str.replace("Mr.","")
 83 | 
 84 | # Replace "Miss" with empty string ""
 85 | airlines['full_name'] = airlines['full_name'].str.replace("Miss","")
 86 | 
 87 | # Replace "Ms." with empty string ""
 88 | airlines['full_name'] = airlines['full_name'].str.replace("Ms.","")
 89 | 
 90 | # Assert that full_name has no honorifics
 91 | assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False
 92 | ____________________________________________________________________________
 93 | 5.Keeping it descriptive
 94 | # Store length of each row in survey_response column
 95 | resp_length = airlines['survey_response'].str.len()
 96 | 
 97 | # Find rows in airlines where resp_length > 40
 98 | airlines_survey = airlines[resp_length > 40]
 99 | 
100 | # Assert minimum survey_response length is > 40
101 | assert airlines_survey['survey_response'].str.len().min() > 40
102 | 
103 | # Print new survey_response column
104 | print(airlines_survey['survey_response'])
105 | ____________________________________________________________________________


--------------------------------------------------------------------------------
/10 - Cleaning Data in Python/Chapter 3 - Advanced data problems.txt:
--------------------------------------------------------------------------------
 1 | 1.Uniform currencies
 2 | # Find values of acct_cur that are equal to 'euro'
 3 | acct_eu = banking['acct_cur'] == 'euro'
 4 | 
 5 | # Convert acct_amount where it is in euro to dollars
 6 | banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1 
 7 | 
 8 | # Unify acct_cur column by changing 'euro' values to 'dollar'
 9 | banking.loc[acct_eu, 'acct_cur'] = 'dollar'
10 | 
11 | # Assert that only dollar currency remains
12 | assert banking['acct_cur'].unique() == 'dollar'
13 | ______________________________________________________________________
14 | 2.Uniform dates
15 | # Print the header of account_opend
16 | print(banking['account_opened'].head())
17 | 
18 | # Convert account_opened to datetime
19 | banking['account_opened'] = pd.to_datetime(banking['account_opened'],
20 |                                            # Infer datetime format
21 |                                            infer_datetime_format = True,
22 |                                            # Return missing value for error
23 |                                            errors = 'coerce') 
24 | 
25 | # Get year of account opened
26 | banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')
27 | 
28 | # Print acct_year
29 | print(banking['acct_year'])
30 | ______________________________________________________________________
31 | 3.How's our data integrity?
32 | # Store fund columns to sum against
33 | fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']
34 | 
35 | # Find rows where fund_columns row sum == inv_amount
36 | inv_equ = banking[fund_columns].sum(axis = 1) == banking['inv_amount']
37 | 
38 | # Store consistent and inconsistent data
39 | consistent_inv = banking[inv_equ]
40 | inconsistent_inv = banking[~inv_equ]
41 | 
42 | # Store consistent and inconsistent data
43 | print("Number of inconsistent investments: ", inconsistent_inv.shape[0])
44 | ______________________________________________________________________
45 | 4.Missing investors
46 | # Print number of missing values in banking
47 | print(banking.isna().sum())
48 | 
49 | # Visualize missingness matrix
50 | msno.matrix(banking)
51 | plt.show()
52 | 
53 | # Isolate missing and non missing values of inv_amount
54 | missing_investors = banking[banking['inv_amount'].isna()]
55 | investors = banking[~banking['inv_amount'].isna()]
56 | 
57 | # Sort banking by age and visualize
58 | banking_sorted = banking.sort_values(by = 'age')
59 | msno.matrix(banking_sorted)
60 | plt.show()
61 | ______________________________________________________________________
62 | 5.Follow the money
63 | # Drop missing values of cust_id
64 | banking_fullid = banking.dropna(subset = ['cust_id'])
65 | 
66 | # Compute estimated acct_amount
67 | acct_imp = banking_fullid['inv_amount'] * 5
68 | 
69 | # Impute missing acct_amount with corresponding acct_imp
70 | banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp})
71 | 
72 | # Print number of missing values
73 | print(banking_imputed.isna().sum())
74 | ______________________________________________________________________
75 | 


--------------------------------------------------------------------------------
/10 - Cleaning Data in Python/Chapter 4 - Record Linkage.txt:
--------------------------------------------------------------------------------
 1 | 1.The cutoff point
 2 | # Import process from fuzzywuzzy
 3 | from fuzzywuzzy import process
 4 | 
 5 | # Store the unique values of cuisine_type in unique_types
 6 | unique_types = restaurants['cuisine_type'].unique()
 7 | 
 8 | # Calculate similarity of 'asian' to all values of unique_types
 9 | print(process.extract('asian', unique_types, limit = len(unique_types)))
10 | 
11 | # Calculate similarity of 'american' to all values of unique_types
12 | print(process.extract('american', unique_types, limit = len(unique_types)))
13 | 
14 | # Calculate similarity of 'italian' to all values of unique_types
15 | print(process.extract('italian', unique_types, limit = len(unique_types)))
16 | _____________________________________________________________________________
17 | 2.Remapping categories II
18 | # Iterate through categories
19 | for cuisine in categories:  
20 |   # Create a list of matches, comparing cuisine with the cuisine_type column
21 |   matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))
22 | 
23 |   # Iterate through the list of matches
24 |   for match in matches:
25 |      # Check whether the similarity score is greater than or equal to 80
26 |     if match[1] >= 80:
27 |       # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
28 |       restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine
29 |       
30 | # Inspect the final result
31 | restaurants['cuisine_type'].unique()
32 | _____________________________________________________________________________
33 | 3.Pairs of restaurants
34 | # Create an indexer and object and find possible pairs
35 | indexer = recordlinkage.Index()
36 | 
37 | # Block pairing on cuisine_type
38 | indexer.block('cuisine_type')
39 | 
40 | # Generate pairs
41 | pairs = indexer.index(restaurants, restaurants_new)
42 | _____________________________________________________________________________
43 | 4.Similar restaurants
44 | # Create a comparison object
45 | comp_cl = recordlinkage.Compare()
46 | 
47 | # Find exact matches on city, cuisine_types - 
48 | comp_cl.exact('city', 'city', label='city')
49 | comp_cl.exact('cuisine_type', 'cuisine_type', label='cuisine_type')
50 | 
51 | # Find similar matches of rest_name
52 | comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8) 
53 | 
54 | # Get potential matches and print
55 | potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new) 
56 | print(potential_matches)
57 | _____________________________________________________________________________
58 | 5.Linking them together!
59 | # Isolate potential matches with row sum >=3
60 | matches = potential_matches[potential_matches.sum(axis = 1) >= 3]
61 | 
62 | # Get values of second column index of matches
63 | matching_indices = matches.index.get_level_values(1)
64 | 
65 | # Subset restaurants_new based on non-duplicate values
66 | non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]
67 | 
68 | # Append non_dup to restaurants
69 | full_restaurants = restaurants.append(non_dup)
70 | print(full_restaurants)
71 | _____________________________________________________________________________


--------------------------------------------------------------------------------
/10 - Cleaning Data in Python/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/11 - Working with Dates and Times in Python/Chapter 1 - Dates and Calenders .txt:
--------------------------------------------------------------------------------
 1 | 1.Which day of the week?
 2 | # Import date from datetime
 3 | from datetime import date
 4 | 
 5 | # Create a date object
 6 | hurricane_andrew = date(1992, 8, 24)
 7 | 
 8 | # Which day of the week is the date?
 9 | print(hurricane_andrew.weekday())
10 | ____________________________________________________________________________
11 | 2.How many hurricanes come early?
12 | # Counter for how many before June 1
13 | early_hurricanes = 0
14 | 
15 | # We loop over the dates
16 | for hurricane in florida_hurricane_dates:
17 |   # Check if the month is before June (month number 6)
18 |   if hurricane.month < 6:
19 |     early_hurricanes = early_hurricanes + 1
20 |     
21 | print(early_hurricanes)
22 | ____________________________________________________________________________
23 | 3.Subtracting dates
24 | # Import date
25 | from datetime import date
26 | 
27 | # Create a date object for May 9th, 2007
28 | start = date(2007, 5, 9)
29 | 
30 | # Create a date object for December 13th, 2007
31 | end = date(2007, 12, 13)
32 | 
33 | # Subtract the two dates and print the number of days
34 | print((end - start).days)
35 | ____________________________________________________________________________
36 | 4.Counting events per calendar month
37 | # A dictionary to count hurricanes per calendar month
38 | hurricanes_each_month = {1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6:0,
39 | 		  				 7: 0, 8:0, 9:0, 10:0, 11:0, 12:0}
40 | 
41 | # Loop over all hurricanes
42 | for hurricane in florida_hurricane_dates:
43 |   # Pull out the month
44 |   month = hurricane.month
45 |   # Increment the count in your dictionary by one
46 |   hurricanes_each_month[month] += 1
47 |   
48 | print(hurricanes_each_month)
49 | ____________________________________________________________________________
50 | 5.Putting a list of dates in order
51 | # Print the first and last scrambled dates
52 | print(dates_scrambled[0])
53 | print(dates_scrambled[-1])
54 | 
55 | # Put the dates in order
56 | dates_ordered = sorted(dates_scrambled)
57 | 
58 | # Print the first and last ordered dates
59 | print(dates_ordered[0])
60 | print(dates_ordered[-1])
61 | ____________________________________________________________________________
62 | 6.Printing dates in a friendly format
63 | # Assign the earliest date to first_date
64 | first_date = min(florida_hurricane_dates)
65 | 
66 | # Convert to ISO and US formats
67 | iso = "Our earliest hurricane date: " + first_date.isoformat()
68 | us = "Our earliest hurricane date: " + first_date.strftime("%m/%d/%Y")
69 | 
70 | print("ISO: " + iso)
71 | print("US: " + us)
72 | ____________________________________________________________________________
73 | 7.Representing dates in different ways
74 | # Import date
75 | from datetime import date
76 | 
77 | # Create a date object
78 | andrew = date(1992, 8, 26)
79 | 
80 | # Print the date in the format 'YYYY-MM'
81 | print(andrew.strftime('%Y-%m'))
82 | ____________________________________________________________________________


--------------------------------------------------------------------------------
/11 - Working with Dates and Times in Python/Chapter 2 - Combining dates and times.txt:
--------------------------------------------------------------------------------
  1 | 1.Creating datetimes by hand
  2 | # Import datetime
  3 | from datetime import datetime
  4 | 
  5 | # Create a datetime object
  6 | dt = datetime(2017, 10, 1, 15, 26, 26)
  7 | 
  8 | # Print the results in ISO 8601 format
  9 | print(dt.isoformat())
 10 | _______________________________________________________________________________
 11 | 2.Counting events before and after noon
 12 | # Create dictionary to hold results
 13 | trip_counts = {'AM': 0, 'PM': 0}
 14 |   
 15 | # Loop over all trips
 16 | for trip in onebike_datetimes:
 17 |   # Check to see if the trip starts before noon
 18 |   if trip['start'].hour < 12:
 19 |     # Increment the counter for before noon
 20 |     trip_counts['AM'] += 1
 21 |   else:
 22 |     # Increment the counter for after noon
 23 |     trip_counts['PM'] += 1
 24 |   
 25 | print(trip_counts)
 26 | _______________________________________________________________________________
 27 | 3.Turning strings into datetimes
 28 | # Import the datetime class
 29 | from datetime import datetime
 30 | 
 31 | # Starting string, in YYYY-MM-DD HH:MM:SS format
 32 | s = '2017-02-03 00:00:01'
 33 | 
 34 | # Write a format string to parse s
 35 | fmt = '%Y-%m-%d %H:%M:%S'
 36 | 
 37 | # Create a datetime object d
 38 | d = datetime.strptime(s, fmt)
 39 | 
 40 | # Print d
 41 | print(d)
 42 | _______________________________________________________________________________
 43 | 4.Parsing pairs of strings as datetimes
 44 | # Write down the format string
 45 | fmt = "%Y-%m-%d %H:%M:%S"
 46 | 
 47 | # Initialize a list for holding the pairs of datetime objects
 48 | onebike_datetimes = []
 49 | 
 50 | # Loop over all trips
 51 | for (start, end) in onebike_datetime_strings:
 52 |   trip = {'start': datetime.strptime(start, fmt),
 53 |           'end': datetime.strptime(end, fmt)}
 54 |   
 55 |   # Append the trip
 56 |   onebike_datetimes.append(trip)
 57 | _______________________________________________________________________________
 58 | 5.Recreating ISO format with strftime()
 59 | # Import datetime
 60 | from datetime import datetime
 61 | 
 62 | # Pull out the start of the first trip
 63 | first_start = onebike_datetimes[0]['start']
 64 | 
 65 | # Format to feed to strftime()
 66 | fmt = "%Y-%m-%dT%H:%M:%S"
 67 | 
 68 | # Print out date with .isoformat(), then with .strftime() to compare
 69 | print(first_start.isoformat())
 70 | print(first_start.strftime(fmt))
 71 | _______________________________________________________________________________
 72 | 6.Unix timestamps
 73 | # Import datetime
 74 | from datetime import datetime
 75 | 
 76 | # Starting timestamps
 77 | timestamps = [1514665153, 1514664543]
 78 | 
 79 | # Datetime objects
 80 | dts = []
 81 | 
 82 | # Loop
 83 | for ts in timestamps:
 84 |   dts.append(datetime.fromtimestamp(ts))
 85 |   
 86 | # Print results
 87 | print(dts)
 88 | _______________________________________________________________________________
 89 | 7.Turning pairs of datetimes into durations
 90 | # Initialize a list for all the trip durations
 91 | onebike_durations = []
 92 | 
 93 | for trip in onebike_datetimes:
 94 |   # Create a timedelta object corresponding to the length of the trip
 95 |   trip_duration = trip['end'] - trip['start']
 96 |   
 97 |   # Get the total elapsed seconds in trip_duration
 98 |   trip_length_seconds = trip_duration.total_seconds()
 99 |   
100 |   # Append the results to our list
101 |   onebike_durations.append(trip_length_seconds)
102 | _______________________________________________________________________________
103 | 8.Average trip time
104 | # What was the total duration of all trips?
105 | total_elapsed_time = sum(onebike_durations)
106 | 
107 | # What was the total number of trips?
108 | number_of_trips = len(onebike_durations)
109 |   
110 | # Divide the total duration by the number of trips
111 | print(total_elapsed_time / number_of_trips)
112 | _______________________________________________________________________________
113 | 9.The long and the short of why time is hard
114 | # Calculate shortest and longest trips
115 | shortest_trip = min(onebike_durations)
116 | longest_trip = max(onebike_durations)
117 | 
118 | # Print out the results
119 | print("The shortest trip was " + str(shortest_trip) + " seconds")
120 | print("The longest trip was " + str(longest_trip) + " seconds")
121 | _______________________________________________________________________________


--------------------------------------------------------------------------------
/11 - Working with Dates and Times in Python/Chapter 3 - time zones and daylight saving.txt:
--------------------------------------------------------------------------------
 1 | 1.Creating timezone aware datetimes
 2 | # Import datetime, timezone
 3 | from datetime import datetime, timezone
 4 | 
 5 | # October 1, 2017 at 15:26:26, UTC
 6 | dt = datetime(2017, 10, 1, 15, 26, 26, tzinfo=timezone.utc)
 7 | 
 8 | # Print results
 9 | print(dt.isoformat())
10 | ____________________________________________________________________________
11 | 2.What time did the bike leave in UTC?
12 | # Loop over the trips
13 | for trip in onebike_datetimes[:10]:
14 |   # Pull out the start
15 |   dt = trip['start']
16 |   # Move dt to be in UTC
17 |   dt = dt.astimezone(timezone.utc)
18 |   
19 |   # Print the start time in UTC
20 |   print('Original:', trip['start'], '| UTC:', dt.isoformat())
21 | ____________________________________________________________________________
22 | 3.Putting the bike trips into the right time zone
23 | # Import tz
24 | from dateutil import tz
25 | 
26 | # Create a timezone object for Eastern Time
27 | et = tz.gettz('America/New_York')
28 | 
29 | # Loop over trips, updating the datetimes to be in Eastern Time
30 | for trip in onebike_datetimes[:10]:
31 |   # Update trip['start'] and trip['end']
32 |   trip['start'] = trip['start'].replace(tzinfo = et)
33 |   trip['end'] = trip['end'].replace(tzinfo = et)
34 | ____________________________________________________________________________
35 | 4.What time did the bike leave? (Global edition)
36 | # Create the timezone object
37 | uk = tz.gettz('Europe/London')
38 | 
39 | # Pull out the start of the first trip
40 | local = onebike_datetimes[0]['start']
41 | 
42 | # What time was it in the UK?
43 | notlocal = local.astimezone(uk)
44 | 
45 | # Print them out and see the difference
46 | print(local.isoformat())
47 | print(notlocal.isoformat())
48 | ____________________________________________________________________________
49 | 5.How many hours elapsed around daylight saving?
50 | # Import datetime, timedelta, tz, timezone
51 | from datetime import datetime, timedelta, timezone
52 | from dateutil import tz
53 | 
54 | # Start on March 12, 2017, midnight, then add 6 hours
55 | start = datetime(2017, 3, 12, tzinfo = tz.gettz('America/New_York'))
56 | end = start + timedelta(hours=6)
57 | print(start.isoformat() + " to " + end.isoformat())
58 | ____________________________________________________________________________
59 | 6.March 29, throughout a decade
60 | # Import datetime and tz
61 | from datetime import datetime
62 | from dateutil import tz
63 | 
64 | # Create starting date
65 | dt = datetime(2000, 3, 29, tzinfo = tz.gettz('Europe/London'))
66 | 
67 | # Loop over the dates, replacing the year, and print the ISO timestamp
68 | for y in range(2000, 2011):
69 |   print(dt.replace(year=y).isoformat())
70 | ____________________________________________________________________________
71 | 7.Finding ambiguous datetimes
72 | # Loop over trips
73 | for trip in onebike_datetimes:
74 |   # Rides with ambiguous start
75 |   if tz.datetime_ambiguous(trip['start']):
76 |     print("Ambiguous start at " + str(trip['start']))
77 |   # Rides with ambiguous end
78 |   if tz.datetime_ambiguous(trip['end']):
79 |     print("Ambiguous end at " + str(trip['end']))
80 | ____________________________________________________________________________
81 | 7.Cleaning daylight saving data with fold
82 | trip_durations = []
83 | for trip in onebike_datetimes:
84 |   # When the start is later than the end, set the fold to be 1
85 |   if trip['start'] > trip['end']:
86 |     trip['end'] = tz.enfold(trip['end'])
87 |   # Convert to UTC
88 |   start = trip['start'].astimezone(tz.UTC)
89 |   end = trip['end'].astimezone(tz.UTC)
90 | 
91 |   # Subtract the difference
92 |   trip_length_seconds = (end-start).total_seconds()
93 |   trip_durations.append(trip_length_seconds)
94 | 
95 | # Take the shortest trip duration
96 | print("Shortest trip: " + str(min(trip_durations)))
97 | ____________________________________________________________________________


--------------------------------------------------------------------------------
/11 - Working with Dates and Times in Python/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/12 - Writing functions in Python/Chapter 1 - Best Practices.txt:
--------------------------------------------------------------------------------
  1 | 1.Crafting a docstring
  2 | def count_letter(content, letter):
  3 |   """Count the number of times `letter` appears in `content`.
  4 | 
  5 |   Args:
  6 |     content (str): The string to search.
  7 |     letter (str): The letter to search for.
  8 | 
  9 |   Returns:
 10 |     int
 11 | 
 12 |   # Add a section detailing what errors might be raised
 13 |   Raises:
 14 |     ValueError: If `letter` is not a one-character string.
 15 |   """
 16 |   if (not isinstance(letter, str)) or len(letter) != 1:
 17 |     raise ValueError('`letter` must be a single character string.')
 18 |   return len([char for char in content if char == letter])
 19 | __________________________________________________________
 20 | 2.Retrieving docstrings
 21 | def build_tooltip(function):
 22 |   """Create a tooltip for any function that shows the 
 23 |   function's docstring.
 24 |   
 25 |   Args:
 26 |     function (callable): The function we want a tooltip for.
 27 |     
 28 |   Returns:
 29 |     str
 30 |   """
 31 |   # Use 'inspect' to get the docstring
 32 |   docstring = inspect.getdoc(function)
 33 |   border = '#' * 28
 34 |   return '{}\n{}\n{}'.format(border, docstring, border)
 35 | 
 36 | print(build_tooltip(count_letter))
 37 | print(build_tooltip(range))
 38 | print(build_tooltip(print))
 39 | 3.Extract a function
 40 | def standardize(column):
 41 |   """Standardize the values in a column.
 42 | 
 43 |   Args:
 44 |     column (pandas Series): The data to standardize.
 45 | 
 46 |   Returns:
 47 |     pandas Series: the values as z-scores
 48 |   """
 49 |   # Finish the function so that it returns the z-scores
 50 |   z_score = (column - column.mean()) / column.std()
 51 |   return z_score
 52 | 
 53 | # Use the standardize() function to calculate the z-scores
 54 | df['y1_z'] = standardize(df.y1_gpa)
 55 | df['y2_z'] = standardize(df.y2_gpa)
 56 | df['y3_z'] = standardize(df.y3_gpa)
 57 | df['y4_z'] = standardize(df.y4_gpa)
 58 | _____________________________________________________________
 59 | 4.Split up a function
 60 | def mean(values):
 61 |   """Get the mean of a list of values
 62 | 
 63 |   Args:
 64 |     values (iterable of float): A list of numbers
 65 | 
 66 |   Returns:
 67 |     float
 68 |   """
 69 |   # Write the mean() function
 70 |   mean = sum(values) / len(values)
 71 |   return mean
 72 | def median(values):
 73 |   """Get the median of a list of values
 74 | 
 75 |   Args:
 76 |     values (iterable of float): A list of numbers
 77 | 
 78 |   Returns:
 79 |     float
 80 |   """
 81 |   # Write the median() function
 82 |   midpoint = int(len(values) / 2)
 83 |   if len(values) % 2 == 0:
 84 |     median = (values[midpoint - 1] + values[midpoint]) / 2
 85 |   else:
 86 |     median = values[midpoint]
 87 |   return median
 88 | ___________________________________________________________________
 89 | 5.Best practice for default arguments
 90 | # Use an immutable variable for the default argument 
 91 | def better_add_column(values, df=None):
 92 |   """Add a column of `values` to a DataFrame `df`.
 93 |   The column will be named "col_<n>" where "n" is
 94 |   the numerical index of the column.
 95 | 
 96 |   Args:
 97 |     values (iterable): The values of the new column
 98 |     df (DataFrame, optional): The DataFrame to update.
 99 |       If no DataFrame is passed, one is created by default.
100 | 
101 |   Returns:
102 |     DataFrame
103 |   """
104 |   # Update the function to create a default DataFrame
105 |   if df is None:
106 |     df = pandas.DataFrame()
107 |   df['col_{}'.format(len(df.columns))] = values
108 |   return df
109 | ___________________________________________________________________
110 | 


--------------------------------------------------------------------------------
/12 - Writing functions in Python/Chapter 2 - Using Context Managers.txt:
--------------------------------------------------------------------------------
 1 | 1.The number of cats
 2 | # Open "alice.txt" and assign the file to "file"
 3 | with open('alice.txt') as file:
 4 |   text = file.read()
 5 | 
 6 | n = 0
 7 | for word in text.split():
 8 |   if word.lower() in ['cat', 'cats']:
 9 |     n += 1
10 | 
11 | print('Lewis Carroll uses the word "cat" {} times'.format(n))
12 | _______________________________________________________________________
13 | 2.The speed of cats
14 | image = get_image_from_instagram()
15 | 
16 | # Time how long process_with_numpy(image) takes to run
17 | with timer():
18 |   print('Numpy version')
19 |   process_with_numpy(image)
20 | 
21 | # Time how long process_with_pytorch(image) takes to run
22 | with timer():
23 |   print('Pytorch version')
24 |   process_with_pytorch(image)
25 | _______________________________________________________________________
26 | 3.The timer() context manager
27 | # Add a decorator that will make timer() a context manager
28 | @contextlib.contextmanager
29 | def timer():
30 |   """Time the execution of a context block.
31 | 
32 |   Yields:
33 |     None
34 |   """
35 |   start = time.time()
36 |   # Send control back to the context block
37 |   yield
38 |   end = time.time()
39 |   print('Elapsed: {:.2f}s'.format(end - start))
40 | 
41 | with timer():
42 |   print('This should take approximately 0.25 seconds')
43 |   time.sleep(0.25)
44 | _______________________________________________________________________
45 | 4.A read-only open() context manager
46 | @contextlib.contextmanager
47 | def open_read_only(filename):
48 |   """Open a file in read-only mode.
49 | 
50 |   Args:
51 |     filename (str): The location of the file to read
52 | 
53 |   Yields:
54 |     file object
55 |   """
56 |   read_only_file = open(filename, mode='r')
57 |   # Yield read_only_file so it can be assigned to my_file
58 |   yield read_only_file
59 |   # Close read_only_file
60 |   read_only_file.close()
61 | 
62 | with open_read_only('my_file.txt') as my_file:
63 |   print(my_file.read())
64 | _______________________________________________________________________
65 | 5.Scraping the NASDAQ
66 | # Use the "stock('NVDA')" context manager
67 | # and assign the result to the variable "nvda"
68 | with stock('NVDA') as nvda:
69 |   # Open 'NVDA.txt' for writing as f_out
70 |   with open('NVDA.txt', 'w') as f_out:
71 |     for _ in range(10):
72 |       value = nvda.price()
73 |       print('Logging ${:.2f} for NVDA'.format(value))
74 |       f_out.write('{:.2f}\n'.format(value))
75 | _______________________________________________________________________
76 | 6.Changing the working directory
77 | def in_dir(directory):
78 |   """Change current working directory to `directory`,
79 |   allow the user to run some code, and change back.
80 | 
81 |   Args:
82 |     directory (str): The path to a directory to work in.
83 |   """
84 |   current_dir = os.getcwd()
85 |   os.chdir(directory)
86 | 
87 |   # Add code that lets you handle errors
88 |   try:
89 |     yield
90 |   # Ensure the directory is reset,
91 |   # whether there was an error or not
92 |   finally:
93 |     os.chdir(current_dir)
94 | _______________________________________________________________________


--------------------------------------------------------------------------------
/12 - Writing functions in Python/Chapter 3 - Decorators.txt:
--------------------------------------------------------------------------------
  1 | 1.Building a command line data app
  2 | # Add the missing function references to the function map
  3 | function_map = {
  4 |   'mean': mean,
  5 |   'std': std,
  6 |   'minimum': minimum,
  7 |   'maximum': maximum
  8 | }
  9 | 
 10 | data = load_data()
 11 | print(data)
 12 | 
 13 | func_name = get_user_input()
 14 | 
 15 | # Call the chosen function and pass "data" as an argument
 16 | function_map[func_name](data)
 17 | ________________________________________________________________________
 18 | 2.Reviewing your co-worker's code
 19 | # Call has_docstring() on the load_and_plot_data() function
 20 | ok = has_docstring(load_and_plot_data)
 21 | 
 22 | if not ok:
 23 |   print("load_and_plot_data() doesn't have a docstring!")
 24 | else:
 25 |   print("load_and_plot_data() looks ok")
 26 | /***************************************/
 27 | # Call has_docstring() on the as_2D() function
 28 | ok = has_docstring(as_2D)
 29 | 
 30 | if not ok:
 31 |   print("as_2D() doesn't have a docstring!")
 32 | else:
 33 |   print("as_2D() looks ok")
 34 | /**************************************/
 35 | # Call has_docstring() on the log_product() function
 36 | ok = has_docstring(log_product)
 37 | 
 38 | if not ok:
 39 |   print("log_product() doesn't have a docstring!")
 40 | else:
 41 |   print("log_product() looks ok")
 42 | ________________________________________________________________________
 43 | 3.Returning functions for a math game
 44 | def create_math_function(func_name):
 45 |   if func_name == 'add':
 46 |     def add(a, b):
 47 |       return a + b
 48 |     return add
 49 |   elif func_name == 'subtract':
 50 |     # Define the subtract() function
 51 |     def subtract(a, b):
 52 |       return a - b
 53 |     return subtract
 54 |   else:
 55 |     print("I don't know that one")
 56 |     
 57 | add = create_math_function('add')
 58 | print('5 + 2 = {}'.format(add(5, 2)))
 59 | 
 60 | subtract = create_math_function('subtract')
 61 | print('5 - 2 = {}'.format(subtract(5, 2)))
 62 | ________________________________________________________________________
 63 | 4.Modifying variables outside local scope
 64 | def wait_until_done():
 65 |   def check_is_done():
 66 |     # Add a keyword so that wait_until_done() 
 67 |     # doesn't run forever
 68 |     global done
 69 |     if random.random() < 0.1:
 70 |       done = True
 71 |       
 72 |   while not done:
 73 |     check_is_done()
 74 | 
 75 | done = False
 76 | wait_until_done()
 77 | 
 78 | print('Work done? {}'.format(done))
 79 | ________________________________________________________________________
 80 | 5.Checking for closure
 81 | def return_a_func(arg1, arg2):
 82 |   def new_func():
 83 |     print('arg1 was {}'.format(arg1))
 84 |     print('arg2 was {}'.format(arg2))
 85 |   return new_func
 86 |     
 87 | my_func = return_a_func(2, 17)
 88 | 
 89 | print(my_func.__closure__ is not None)
 90 | print(len(my_func.__closure__) == 2)
 91 | 
 92 | # Get the values of the variables in the closure
 93 | closure_values = [
 94 |   my_func.__closure__[i].cell_contents for i in range(2)
 95 | ]
 96 | print(closure_values == [2, 17])
 97 | ________________________________________________________________________
 98 | 6.Closures keep your values safe
 99 | def my_special_function():
100 |   print('You are running my_special_function()')
101 |   
102 | def get_new_func(func):
103 |   def call_func():
104 |     func()
105 |   return call_func
106 | 
107 | new_func = get_new_func(my_special_function)
108 | 
109 | # Redefine my_special_function() to just print "hello"
110 | def my_special_function():
111 |   print("hello")
112 | 
113 | new_func()
114 | /********************************/
115 | def my_special_function():
116 |   print('You are running my_special_function()')
117 |   
118 | def get_new_func(func):
119 |   def call_func():
120 |     func()
121 |   return call_func
122 | 
123 | new_func = get_new_func(my_special_function)
124 | 
125 | # Delete my_special_function()
126 | del(my_special_function)
127 | 
128 | new_func()
129 | /********************************/
130 | def my_special_function():
131 |   print('You are running my_special_function()')
132 |   
133 | def get_new_func(func):
134 |   def call_func():
135 |     func()
136 |   return call_func
137 | 
138 | # Overwrite `my_special_function` with the new function
139 | my_special_function = get_new_func(my_special_function)
140 | 
141 | my_special_function()
142 | ________________________________________________________________________
143 | 7.Using decorator syntax
144 | # Decorate my_function() with the print_args() decorator
145 | @print_args
146 | def my_function(a, b, c):
147 |   print(a + b + c)
148 | 
149 | my_function(1, 2, 3)
150 | ________________________________________________________________________
151 | 8.Defining a decorator
152 | def print_before_and_after(func):
153 |   def wrapper(*args):
154 |     print('Before {}'.format(func.__name__))
155 |     # Call the function being decorated with *args
156 |     func(*args)
157 |     print('After {}'.format(func.__name__))
158 |   # Return the nested function
159 |   return wrapper
160 | 
161 | @print_before_and_after
162 | def multiply(a, b):
163 |   print(a * b)
164 | 
165 | multiply(5, 10)
166 | ________________________________________________________________________


--------------------------------------------------------------------------------
/12 - Writing functions in Python/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/13 - Exploratory Data Analysis in Python/Chapter 1 - Read clean and validate.txt:
--------------------------------------------------------------------------------
 1 | 1.Exploring the NSFG data
 2 | # Display the number of rows and columns
 3 | nsfg.shape
 4 | 
 5 | # Display the names of the columns
 6 | nsfg.columns
 7 | 
 8 | # Select column birthwgt_oz1: ounces
 9 | ounces = nsfg['birthwgt_oz1']
10 | 
11 | # Print the first 5 elements of ounces
12 | print(ounces.head())
13 | ___________________________________________________________________
14 | 2.Clean a variable
15 | # Replace the value 8 with NaN
16 | nsfg['nbrnaliv'].replace([8], np.nan, inplace=True)
17 | 
18 | # Print the values and their frequencies
19 | print(nsfg['nbrnaliv'].value_counts())
20 | ___________________________________________________________________
21 | 3.Compute a variable
22 | # Select the columns and divide by 100
23 | agecon = nsfg['agecon'] / 100
24 | agepreg = nsfg['agepreg'] / 100
25 | 
26 | # Compute the difference
27 | preg_length = agepreg - agecon
28 | 
29 | # Compute summary statistics
30 | print(preg_length.describe())
31 | ___________________________________________________________________
32 | 4.Make a histogram
33 | # Plot the histogram
34 | plt.hist(agecon, bins=20, histtype='step')
35 | 
36 | # Label the axes
37 | plt.xlabel('Age at conception')
38 | plt.ylabel('Number of pregnancies')
39 | 
40 | # Show the figure
41 | plt.show()
42 | ___________________________________________________________________
43 | 5.Compute birth weight
44 | # Create a Boolean Series for full-term babies
45 | full_term = nsfg['prglngth'] >= 37
46 | 
47 | # Select the weights of full-term babies
48 | full_term_weight = birth_weight[full_term]
49 | 
50 | # Compute the mean weight of full-term babies
51 | print(full_term_weight.mean())
52 | ___________________________________________________________________
53 | 6.Filter
54 | # Filter full-term babies
55 | full_term = nsfg['prglngth'] >= 37
56 | 
57 | # Filter single births
58 | single = nsfg['nbrnaliv'] == 1
59 | 
60 | # Compute birth weight for single full-term babies
61 | single_full_term_weight = birth_weight[single & full_term]
62 | print('Single full-term mean:', single_full_term_weight.mean())
63 | 
64 | # Compute birth weight for multiple full-term babies
65 | mult_full_term_weight = birth_weight[~single & full_term]
66 | print('Multiple full-term mean:', mult_full_term_weight.mean())
67 | ___________________________________________________________________


--------------------------------------------------------------------------------
/13 - Exploratory Data Analysis in Python/Chapter 2 - Distributions.txt:
--------------------------------------------------------------------------------
  1 | 1.Make a PMF
  2 | # Compute the PMF for year
  3 | pmf_year = Pmf(gss['year'], normalize=False)
  4 | 
  5 | # Print the result
  6 | print(pmf_year)
  7 | ____________________________________________________
  8 | 2.Plot a PMF
  9 | # Select the age column
 10 | age = gss['age']
 11 | 
 12 | # Make a PMF of age
 13 | pmf_age = Pmf(age)
 14 | 
 15 | # Plot the PMF
 16 | pmf_age.bar()
 17 | 
 18 | # Label the axes
 19 | plt.xlabel('Age')
 20 | plt.ylabel('PMF')
 21 | plt.show()
 22 | 3.Make a CDF
 23 | # Select the age column
 24 | age = gss['age']
 25 | 
 26 | # Compute the CDF of age
 27 | cdf_age = Cdf(age)
 28 | 
 29 | # Calculate the CDF of 30
 30 | print(cdf_age(30))
 31 | ____________________________________________________
 32 | 4.Compute IQR
 33 | # Calculate the 75th percentile 
 34 | percentile_75th = cdf_income.inverse(0.75)
 35 | 
 36 | # Calculate the 25th percentile
 37 | percentile_25th = cdf_income.inverse(0.25)
 38 | 
 39 | # Calculate the interquartile range
 40 | iqr = percentile_75th - percentile_25th
 41 | 
 42 | # Print the interquartile range
 43 | print(iqr)
 44 | 5.Plot a CDF
 45 | # Select realinc
 46 | income = gss['realinc']
 47 | 
 48 | # Make the CDF
 49 | cdf_income = Cdf(income)
 50 | 
 51 | # Plot it
 52 | cdf_income.plot()
 53 | 
 54 | # Label the axes
 55 | plt.xlabel('Income (1986 USD)')
 56 | plt.ylabel('CDF')
 57 | plt.show()
 58 | ____________________________________________________
 59 | 6.Extract education levels
 60 | # Select educ
 61 | educ = gss['educ']
 62 | 
 63 | # Bachelor's degree
 64 | bach = (educ >= 16)
 65 | 
 66 | # Associate degree
 67 | assc = (educ >= 14) & (educ < 16)
 68 | 
 69 | # High school
 70 | high = (educ <= 12)
 71 | print(high.mean())
 72 | ____________________________________________________
 73 | 7.Plot income CDFs
 74 | income = gss['realinc']
 75 | 
 76 | # Plot the CDFs
 77 | Cdf(income[high]).plot(label='High school')
 78 | Cdf(income[assc]).plot(label='Associate')
 79 | Cdf(income[bach]).plot(label='Bachelor')
 80 | 
 81 | # Label the axes
 82 | plt.xlabel('Income (1986 USD)')
 83 | plt.ylabel('CDF')
 84 | plt.legend()
 85 | plt.show()
 86 | ____________________________________________________
 87 | 8.Distribution of income
 88 | # Extract realinc and compute its log
 89 | income = gss['realinc']
 90 | log_income = np.log10(income)
 91 | 
 92 | # Compute mean and standard deviation
 93 | mean = log_income.mean()
 94 | std = log_income.std()
 95 | print(mean, std)
 96 | 
 97 | # Make a norm object
 98 | from scipy.stats import norm
 99 | dist = norm(mean, std)
100 | _________________________________________________________________________
101 | 9.Comparing CDFs
102 | # Evaluate the model CDF
103 | xs = np.linspace(2, 5.5)
104 | ys = dist.cdf(xs)
105 | 
106 | # Plot the model CDF
107 | plt.clf()
108 | plt.plot(xs, ys, color='gray')
109 | 
110 | # Create and plot the Cdf of log_income
111 | Cdf(log_income).plot()
112 |     
113 | # Label the axes
114 | plt.xlabel('log10 of realinc')
115 | plt.ylabel('CDF')
116 | plt.show()
117 | ____________________________________________________
118 | 10.Comparing PDFs
119 | # Evaluate the normal PDF
120 | xs = np.linspace(2, 5.5)
121 | ys = dist.pdf(xs)
122 | 
123 | # Plot the model PDF
124 | plt.clf()
125 | plt.plot(xs, ys, color='gray')
126 | 
127 | # Plot the data KDE
128 | sns.kdeplot(log_income)
129 |     
130 | # Label the axes
131 | plt.xlabel('log10 of realinc')
132 | plt.ylabel('PDF')
133 | plt.show()
134 | ____________________________________________________


--------------------------------------------------------------------------------
/13 - Exploratory Data Analysis in Python/Chapter 3 - Relationships.txt:
--------------------------------------------------------------------------------
  1 | 1.PMF of age
  2 | # Extract AGE
  3 | age = brfss['AGE']
  4 | 
  5 | # Plot the PMF
  6 | Pmf(age).bar()
  7 | 
  8 | # Label the axes
  9 | plt.xlabel('Age in years')
 10 | plt.ylabel('PMF')
 11 | plt.show()
 12 | ______________________________________________________
 13 | 2.Scatter plot
 14 | # Select the first 1000 respondents
 15 | brfss = brfss[:1000]
 16 | 
 17 | # Extract age and weight
 18 | age = brfss['AGE']
 19 | weight = brfss['WTKG3']
 20 | 
 21 | # Make a scatter plot
 22 | plt.plot(age, weight, 'o', alpha=0.1)
 23 | 
 24 | plt.xlabel('Age in years')
 25 | plt.ylabel('Weight in kg')
 26 | 
 27 | plt.show()
 28 | ______________________________________________________
 29 | 3.Jittering
 30 | # Select the first 1000 respondents
 31 | brfss = brfss[:1000]
 32 | 
 33 | # Add jittering to age
 34 | age = brfss['AGE'] + np.random.normal(0, 2.5, size=len(brfss))
 35 | # Extract weight
 36 | weight = brfss['WTKG3']
 37 | 
 38 | # Make a scatter plot
 39 | plt.plot(age, weight, 'o', markersize=5, alpha=0.2)
 40 | 
 41 | plt.xlabel('Age in years')
 42 | plt.ylabel('Weight in kg')
 43 | plt.show()
 44 | ______________________________________________________
 45 | 4.Height and weight
 46 | # Drop rows with missing data
 47 | data = brfss.dropna(subset=['_HTMG10', 'WTKG3'])
 48 | 
 49 | # Make a box plot
 50 | sns.boxplot(x='_HTMG10', y='WTKG3', data=data, whis=10)
 51 | 
 52 | # Plot the y-axis on a log scale
 53 | plt.yscale('log')
 54 | 
 55 | # Remove unneeded lines and label axes
 56 | sns.despine(left=True, bottom=True)
 57 | plt.xlabel('Height in cm')
 58 | plt.ylabel('Weight in kg')
 59 | plt.show()
 60 | ______________________________________________________
 61 | 5.Distribution of income
 62 | # Extract income
 63 | income = brfss['INCOME2']
 64 | 
 65 | # Plot the PMF
 66 | Pmf(income).bar()
 67 | 
 68 | # Label the axes
 69 | plt.xlabel('Income level')
 70 | plt.ylabel('PMF')
 71 | plt.show()
 72 | ______________________________________________________
 73 | 6.Income and height
 74 | # Drop rows with missing data
 75 | data = brfss.dropna(subset=['INCOME2', 'HTM4'])
 76 | 
 77 | # Make a violin plot
 78 | sns.violinplot(x='INCOME2', y='HTM4', data=data, inner=None)
 79 | 
 80 | # Remove unneeded lines and label axes
 81 | sns.despine(left=True, bottom=True)
 82 | plt.xlabel('Income level')
 83 | plt.ylabel('Height in cm')
 84 | plt.show()
 85 | ______________________________________________________
 86 | 7.Computing correlations
 87 | # Select columns
 88 | columns = ['AGE', 'INCOME2', '_VEGESU1']
 89 | subset = brfss[columns]
 90 | 
 91 | # Compute the correlation matrix
 92 | print(subset.corr())
 93 | ______________________________________________________
 94 | 8.Income and vegetables
 95 | from scipy.stats import linregress
 96 | 
 97 | # Extract the variables
 98 | subset = brfss.dropna(subset=['INCOME2', '_VEGESU1'])
 99 | xs = subset['INCOME2']
100 | ys = subset['_VEGESU1']
101 | 
102 | # Compute the linear regression
103 | res = linregress(xs, ys)
104 | print(res)
105 | ______________________________________________________
106 | 9.Fit a line
107 | # Plot the scatter plot
108 | plt.clf()
109 | x_jitter = xs + np.random.normal(0, 0.15, len(xs))
110 | plt.plot(x_jitter, ys, 'o', alpha=0.2)
111 | 
112 | # Plot the line of best fit
113 | fx = np.array([xs.min(), xs.max()])
114 | fy = res.intercept + res.slope * fx
115 | plt.plot(fx, fy, '-', alpha=0.7)
116 | 
117 | plt.xlabel('Income code')
118 | plt.ylabel('Vegetable servings per day')
119 | plt.ylim([0, 6])
120 | plt.show()
121 | ______________________________________________________


--------------------------------------------------------------------------------
/13 - Exploratory Data Analysis in Python/Chapter 4 - Multivariate Thinking.txt:
--------------------------------------------------------------------------------
  1 | 1.Using StatsModels
  2 | from scipy.stats import linregress
  3 | import statsmodels.formula.api as smf
  4 | 
  5 | # Run regression with linregress
  6 | subset = brfss.dropna(subset=['INCOME2', '_VEGESU1'])
  7 | xs = subset['INCOME2']
  8 | ys = subset['_VEGESU1']
  9 | res = linregress(xs, ys)
 10 | print(res)
 11 | 
 12 | # Run regression with StatsModels
 13 | results = smf.ols('_VEGESU1 ~ INCOME2', data=brfss).fit()
 14 | print(results.params)
 15 | ________________________________________________________________
 16 | 2.Plot income and education
 17 | # Group by educ
 18 | grouped = gss.groupby('educ')
 19 | 
 20 | # Compute mean income in each group
 21 | mean_income_by_educ = grouped['realinc'].mean()
 22 | 
 23 | # Plot mean income as a scatter plot
 24 | plt.clf()
 25 | plt.plot(mean_income_by_educ, 'o', alpha=0.5)
 26 | 
 27 | # Label the axes
 28 | plt.xlabel('Education (years)')
 29 | plt.ylabel('Income (1986 $)')
 30 | plt.show()
 31 | ________________________________________________________________
 32 | 3.Non-linear model of education
 33 | import statsmodels.formula.api as smf
 34 | 
 35 | # Add a new column with educ squared
 36 | gss['educ2'] = gss['educ']**2
 37 | 
 38 | # Run a regression model with educ, educ2, age, and age2
 39 | results = smf.ols('realinc ~ educ + educ2 + age + age2', data=gss).fit()
 40 | 
 41 | # Print the estimated parameters
 42 | print(results.params)
 43 | ________________________________________________________________
 44 | 4.Making predictions
 45 | # Run a regression model with educ, educ2, age, and age2
 46 | results = smf.ols('realinc ~ educ + educ2 + age + age2', data=gss).fit()
 47 | 
 48 | # Make the DataFrame
 49 | df = pd.DataFrame()
 50 | df['educ'] = np.linspace(0, 20)
 51 | df['age'] = 30
 52 | df['educ2'] = df['educ']**2
 53 | df['age2'] = df['age']**2
 54 | 
 55 | # Generate and plot the predictions
 56 | pred = results.predict(df)
 57 | print(pred.head())
 58 | ________________________________________________________________
 59 | 5.Visualizing predictions
 60 | # Plot mean income in each age group
 61 | plt.clf()
 62 | grouped = gss.groupby('educ')
 63 | mean_income_by_educ = grouped['realinc'].mean()
 64 | plt.plot(mean_income_by_educ, 'o', alpha=0.5)
 65 | 
 66 | # Plot the predictions
 67 | pred = results.predict(df)
 68 | plt.plot(df['educ'], pred, label='Age 30')
 69 | 
 70 | # Label axes
 71 | plt.xlabel('Education (years)')
 72 | plt.ylabel('Income (1986 $)')
 73 | plt.legend()
 74 | plt.show()
 75 | ________________________________________________________________
 76 | 6.Predicting a binary variable
 77 | # Recode grass
 78 | gss['grass'].replace(2, 0, inplace=True)
 79 | 
 80 | # Run logistic regression
 81 | results = smf.logit('grass ~ age + age2 + educ + educ2 + C(sex)', data=gss).fit()
 82 | results.params
 83 | 
 84 | # Make a DataFrame with a range of ages
 85 | df = pd.DataFrame()
 86 | df['age'] = np.linspace(18, 89)
 87 | df['age2'] = df['age']**2
 88 | 
 89 | # Set the education level to 12
 90 | df['educ'] = 12
 91 | df['educ2'] = df['educ']**2
 92 | 
 93 | # Generate predictions for men and women
 94 | df['sex'] = 1
 95 | pred1 = results.predict(df)
 96 | 
 97 | df['sex'] = 2
 98 | pred2 = results.predict(df)
 99 | 
100 | plt.clf()
101 | grouped = gss.groupby('age')
102 | favor_by_age = grouped['grass'].mean()
103 | plt.plot(favor_by_age, 'o', alpha=0.5)
104 | 
105 | plt.plot(df['age'], pred1, label='Male')
106 | plt.plot(df['age'], pred2, label='Female')
107 | 
108 | plt.xlabel('Age')
109 | plt.ylabel('Probability of favoring legalization')
110 | plt.legend()
111 | plt.show()
112 | ________________________________________________________________


--------------------------------------------------------------------------------
/13 - Exploratory Data Analysis in Python/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/14 - Analyzing Police Activity with pandas/Chapter 1 - preparing data for analysis.txt:
--------------------------------------------------------------------------------
 1 | 1.Examining the dataset
 2 | # Import the pandas library as pd
 3 | import pandas as pd
 4 | 
 5 | # Read 'police.csv' into a DataFrame named ri
 6 | ri = pd.read_csv('police.csv')
 7 | 
 8 | # Examine the head of the DataFrame
 9 | print(ri.head())
10 | 
11 | # Count the number of missing values in each column
12 | print(ri.isnull().sum())
13 | ______________________________________________________________________
14 | 2.Dropping columns
15 | # Examine the shape of the DataFrame
16 | print(ri.shape)
17 | 
18 | # Drop the 'county_name' and 'state' columns
19 | ri.drop(['county_name', 'state'], axis='columns', inplace=True)
20 | 
21 | # Examine the shape of the DataFrame (again)
22 | print(ri.shape)
23 | ______________________________________________________________________
24 | 3.Dropping rows
25 | # Count the number of missing values in each column
26 | print(ri.isnull().sum())
27 | 
28 | # Drop all rows that are missing 'driver_gender'
29 | ri.dropna(subset=['driver_gender'], inplace=True)
30 | 
31 | # Count the number of missing values in each column (again)
32 | print(ri.isnull().sum())
33 | 
34 | # Examine the shape of the DataFrame
35 | print(ri.shape)
36 | ______________________________________________________________________
37 | 4.Fixing a data type
38 | # Examine the head of the 'is_arrested' column
39 | print(ri.is_arrested.head())
40 | 
41 | # Change the data type of 'is_arrested' to 'bool'
42 | ri['is_arrested'] = ri.is_arrested.astype('bool')
43 | 
44 | # Check the data type of 'is_arrested' 
45 | print(ri.is_arrested.dtype)
46 | ______________________________________________________________________
47 | 5.Combining object columns
48 | # Concatenate 'stop_date' and 'stop_time' (separated by a space)
49 | combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')
50 | 
51 | # Convert 'combined' to datetime format
52 | ri['stop_datetime'] = pd.to_datetime(combined)
53 | 
54 | # Examine the data types of the DataFrame
55 | print(ri.dtypes)
56 | ______________________________________________________________________
57 | 6.Setting the index
58 | # Set 'stop_datetime' as the index
59 | ri.set_index('stop_datetime', inplace=True)
60 | 
61 | # Examine the index
62 | print(ri.index)
63 | 
64 | # Examine the columns
65 | print(ri.columns)
66 | ______________________________________________________________________


--------------------------------------------------------------------------------
/14 - Analyzing Police Activity with pandas/Chapter 2 - Exploring the Relationship between gender and policing.txt:
--------------------------------------------------------------------------------
 1 | 1.Examining traffic violations
 2 | # Count the unique values in 'violation'
 3 | print(ri.violation.value_counts())
 4 | 
 5 | # Express the counts as proportions
 6 | print(ri.violation.value_counts(normalize=True))
 7 | _______________________________________________________________________
 8 | 2.Comparing violations by gender
 9 | # Create a DataFrame of female drivers
10 | female = ri[ri.driver_gender == 'F']
11 | 
12 | # Create a DataFrame of male drivers
13 | male = ri[ri.driver_gender == 'M']
14 | 
15 | # Compute the violations by female drivers (as proportions)
16 | print(female.violation.value_counts(normalize=True))
17 | 
18 | # Compute the violations by male drivers (as proportions)
19 | print(male.violation.value_counts(normalize=True))
20 | _______________________________________________________________________
21 | 3.Comparing speeding outcomes by gender
22 | # Create a DataFrame of female drivers stopped for speeding
23 | female_and_speeding = ri[(ri.driver_gender == 'F') & (ri.violation == 'Speeding')]
24 | 
25 | # Create a DataFrame of male drivers stopped for speeding
26 | male_and_speeding = ri[(ri.driver_gender == 'M') & (ri.violation == 'Speeding')]
27 | 
28 | # Compute the stop outcomes for female drivers (as proportions)
29 | print(female_and_speeding.stop_outcome.value_counts(normalize=True))
30 | 
31 | # Compute the stop outcomes for male drivers (as proportions)
32 | print(male_and_speeding.stop_outcome.value_counts(normalize=True))
33 | _______________________________________________________________________
34 | 4.Calculating the search rate
35 | # Check the data type of 'search_conducted'
36 | print(ri.search_conducted.dtype)
37 | 
38 | # Calculate the search rate by counting the values
39 | print(ri.search_conducted.value_counts(normalize=True))
40 | 
41 | # Calculate the search rate by taking the mean
42 | print(ri.search_conducted.mean())
43 | _______________________________________________________________________
44 | 5.Comparing search rates by gender
45 | # Calculate the search rate for both groups simultaneously
46 | print(ri.groupby('driver_gender').search_conducted.mean())
47 | _______________________________________________________________________
48 | 6.Adding a second factor to the analysis
49 | # Reverse the ordering to group by violation before gender
50 | print(ri.groupby(['violation', 'driver_gender']).search_conducted.mean())
51 | _______________________________________________________________________
52 | 7.Counting protective frisks
53 | # Count the 'search_type' values
54 | print(ri.search_type.value_counts())
55 | 
56 | # Check if 'search_type' contains the string 'Protective Frisk'
57 | ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False)
58 | 
59 | # Check the data type of 'frisk'
60 | print(ri.frisk.dtype)
61 | 
62 | # Take the sum of 'frisk'
63 | print(ri.frisk.sum())
64 | _______________________________________________________________________
65 | 8.Comparing frisk rates by gender
66 | # Create a DataFrame of stops in which a search was conducted
67 | searched = ri[ri.search_conducted == True]
68 | 
69 | # Calculate the overall frisk rate by taking the mean of 'frisk'
70 | print(searched.frisk.mean())
71 | 
72 | # Calculate the frisk rate for each gender
73 | print(searched.groupby('driver_gender').frisk.mean())
74 | _______________________________________________________________________


--------------------------------------------------------------------------------
/14 - Analyzing Police Activity with pandas/Chapter 3 - Visual Exploratory data analysis.txt:
--------------------------------------------------------------------------------
 1 | 1.Calculating the hourly arrest rate
 2 | # Calculate the overall arrest rate
 3 | print(ri.is_arrested.mean())
 4 | 
 5 | # Calculate the hourly arrest rate
 6 | print(ri.groupby(ri.index.hour).is_arrested.mean())
 7 | 
 8 | # Save the hourly arrest rate
 9 | hourly_arrest_rate = ri.groupby(ri.index.hour).is_arrested.mean()
10 | __________________________________________________________________________
11 | 2.Plotting the hourly arrest rate
12 | # Import matplotlib.pyplot as plt
13 | import matplotlib.pyplot as plt
14 | 
15 | # Create a line plot of 'hourly_arrest_rate'
16 | hourly_arrest_rate.plot()
17 | 
18 | # Add the xlabel, ylabel, and title
19 | plt.xlabel('Hour')
20 | plt.ylabel('Arrest Rate')
21 | plt.title('Arrest Rate by Time of Day')
22 | 
23 | # Display the plot
24 | plt.show()
25 | __________________________________________________________________________
26 | 3.Plotting drug-related stops
27 | # Calculate the annual rate of drug-related stops
28 | print(ri.drugs_related_stop.resample('A').mean())
29 | 
30 | # Save the annual rate of drug-related stops
31 | annual_drug_rate = ri.drugs_related_stop.resample('A').mean()
32 | 
33 | # Create a line plot of 'annual_drug_rate'
34 | annual_drug_rate.plot()
35 | 
36 | # Display the plot
37 | plt.show()
38 | __________________________________________________________________________
39 | 4.Comparing drug and search rates
40 | # Calculate and save the annual search rate
41 | annual_search_rate = ri.search_conducted.resample('A').mean()
42 | 
43 | # Concatenate 'annual_drug_rate' and 'annual_search_rate'
44 | annual = pd.concat([annual_drug_rate, annual_search_rate], axis='columns')
45 | 
46 | # Create subplots from 'annual'
47 | annual.plot(subplots=True)
48 | 
49 | # Display the subplots
50 | plt.show()
51 | __________________________________________________________________________
52 | 5.Tallying violations by district
53 | # Create a frequency table of districts and violations
54 | print(pd.crosstab(ri.district, ri.violation))
55 | 
56 | # Save the frequency table as 'all_zones'
57 | all_zones = pd.crosstab(ri.district, ri.violation)
58 | 
59 | # Select rows 'Zone K1' through 'Zone K3'
60 | print(all_zones.loc['Zone K1':'Zone K3'])
61 | 
62 | # Save the smaller table as 'k_zones'
63 | k_zones = all_zones.loc['Zone K1':'Zone K3']
64 | __________________________________________________________________________
65 | 6.Plotting violations by district
66 | # Create a stacked bar plot of 'k_zones'
67 | k_zones.plot(kind='bar', stacked=True)
68 | 
69 | # Display the plot
70 | plt.show()
71 | __________________________________________________________________________
72 | 7.Converting stop durations to numbers
73 | # Print the unique values in 'stop_duration'
74 | print(ri.stop_duration.unique())
75 | 
76 | # Create a dictionary that maps strings to integers
77 | mapping = {'0-15 Min':8, '16-30 Min':23, '30+ Min':45}
78 | 
79 | # Convert the 'stop_duration' strings to integers using the 'mapping'
80 | ri['stop_minutes'] = ri.stop_duration.map(mapping)
81 | 
82 | # Print the unique values in 'stop_minutes'
83 | print(ri.stop_minutes.unique())
84 | __________________________________________________________________________
85 | 8.Plotting stop length
86 | # Calculate the mean 'stop_minutes' for each value in 'violation_raw'
87 | print(ri.groupby('violation_raw').stop_minutes.mean())
88 | 
89 | # Save the resulting Series as 'stop_length'
90 | stop_length = ri.groupby('violation_raw').stop_minutes.mean()
91 | 
92 | # Sort 'stop_length' by its values and create a horizontal bar plot
93 | stop_length.sort_values().plot(kind='barh')
94 | 
95 | # Display the plot
96 | plt.show()
97 | __________________________________________________________________________


--------------------------------------------------------------------------------
/14 - Analyzing Police Activity with pandas/Chapter 4 - Analyzing the effect of weather on policing.txt:
--------------------------------------------------------------------------------
  1 | 1.Plotting the temperature
  2 | # Read 'weather.csv' into a DataFrame named 'weather'
  3 | weather = pd.read_csv('weather.csv')
  4 | 
  5 | # Describe the temperature columns
  6 | print(weather[['TMIN', 'TAVG', 'TMAX']].describe())
  7 | 
  8 | # Create a box plot of the temperature columns
  9 | weather[['TMIN', 'TAVG', 'TMAX']].plot(kind='box')
 10 | 
 11 | # Display the plot
 12 | plt.show()
 13 | ___________________________________________________________________________
 14 | 2.Plotting the temperature difference
 15 | # Create a 'TDIFF' column that represents temperature difference
 16 | weather['TDIFF'] = weather.TMAX - weather.TMIN
 17 | 
 18 | # Describe the 'TDIFF' column
 19 | print(weather.TDIFF.describe())
 20 | 
 21 | # Create a histogram with 20 bins to visualize 'TDIFF'
 22 | weather.TDIFF.plot(kind='hist', bins=20)
 23 | 
 24 | # Display the plot
 25 | plt.show()
 26 | ___________________________________________________________________________
 27 | 3.Counting bad weather conditions
 28 | # Copy 'WT01' through 'WT22' to a new DataFrame
 29 | WT = weather.loc[:, 'WT01':'WT22']
 30 | 
 31 | # Calculate the sum of each row in 'WT'
 32 | weather['bad_conditions'] = WT.sum(axis='columns')
 33 | 
 34 | # Replace missing values in 'bad_conditions' with '0'
 35 | weather['bad_conditions'] = weather.bad_conditions.fillna(0).astype('int')
 36 | 
 37 | # Create a histogram to visualize 'bad_conditions'
 38 | weather.bad_conditions.plot(kind='hist')
 39 | 
 40 | # Display the plot
 41 | plt.show()
 42 | ___________________________________________________________________________
 43 | 4.Rating the weather conditions
 44 | # Count the unique values in 'bad_conditions' and sort the index
 45 | print(weather.bad_conditions.value_counts().sort_index())
 46 | 
 47 | # Create a dictionary that maps integers to strings
 48 | mapping = {0:'good', 1:'bad', 2:'bad', 3:'bad', 4:'bad', 5:'worse', 6:'worse', 7:'worse', 8:'worse', 9:'worse'}
 49 | 
 50 | # Convert the 'bad_conditions' integers to strings using the 'mapping'
 51 | weather['rating'] = weather.bad_conditions.map(mapping)
 52 | 
 53 | # Count the unique values in 'rating'
 54 | print(weather.rating.value_counts())
 55 | ___________________________________________________________________________
 56 | 5.Changing the data type to category
 57 | # Create a list of weather ratings in logical order
 58 | cats = ['good', 'bad', 'worse']
 59 | 
 60 | # Change the data type of 'rating' to category
 61 | weather['rating'] = weather.rating.astype('category', ordered=True, categories=cats)
 62 | 
 63 | # Examine the head of 'rating'
 64 | print(weather.rating.head())
 65 | ___________________________________________________________________________
 66 | 6.Preparing the DataFrames
 67 | # Reset the index of 'ri'
 68 | ri.reset_index(inplace=True)
 69 | 
 70 | # Examine the head of 'ri'
 71 | print(ri.head())
 72 | 
 73 | # Create a DataFrame from the 'DATE' and 'rating' columns
 74 | weather_rating = weather[['DATE', 'rating']]
 75 | 
 76 | # Examine the head of 'weather_rating'
 77 | print(weather_rating.head())
 78 | ___________________________________________________________________________
 79 | 7.Merging the DataFrames
 80 | # Examine the shape of 'ri'
 81 | print(ri.shape)
 82 | 
 83 | # Merge 'ri' and 'weather_rating' using a left join
 84 | ri_weather = pd.merge(left=ri, right=weather_rating, left_on='stop_date', right_on='DATE', how='left')
 85 | 
 86 | # Examine the shape of 'ri_weather'
 87 | print(ri_weather.shape)
 88 | 
 89 | # Set 'stop_datetime' as the index of 'ri_weather'
 90 | ri_weather.set_index('stop_datetime', inplace=True)
 91 | ___________________________________________________________________________
 92 | 8.Comparing arrest rates by weather rating
 93 | # Calculate the arrest rate for each 'violation' and 'rating'
 94 | print(ri_weather.groupby(['violation', 'rating']).is_arrested.mean())
 95 | ___________________________________________________________________________
 96 | 9.Selecting from a multi-indexed Series
 97 | # Save the output of the groupby operation from the last exercise
 98 | arrest_rate = ri_weather.groupby(['violation', 'rating']).is_arrested.mean()
 99 | 
100 | # Print the 'arrest_rate' Series
101 | print(arrest_rate)
102 | 
103 | # Print the arrest rate for moving violations in bad weather
104 | print(arrest_rate.loc['Moving violation', 'bad'])
105 | 
106 | # Print the arrest rates for speeding violations in all three weather conditions
107 | print(arrest_rate.loc['Speeding'])
108 | ___________________________________________________________________________
109 | 10.Reshaping the arrest rate data
110 | # Unstack the 'arrest_rate' Series into a DataFrame
111 | print(arrest_rate.unstack())
112 | 
113 | # Create the same DataFrame using a pivot table
114 | print(ri_weather.pivot_table(index='violation', columns='rating', values='is_arrested'))
115 | ___________________________________________________________________________


--------------------------------------------------------------------------------
/14 - Analyzing Police Activity with pandas/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/15 - Statistical Thinking in Python (Part 1)/Chapter 1 - Graphical Exploratory Data Analysis .txt:
--------------------------------------------------------------------------------
  1 | 1.Plotting a histogram of iris data
  2 | # Import plotting modules
  3 | import matplotlib.pyplot as plt
  4 | import seaborn as sns
  5 | 
  6 | # Set default Seaborn style
  7 | sns.set()
  8 | 
  9 | # Plot histogram of versicolor petal lengths
 10 | _ = plt.hist(versicolor_petal_length)
 11 | 
 12 | # Show histogram
 13 | plt.show()
 14 | __________________________________________________________________________
 15 | 2.Axis labels!
 16 | # Plot histogram of versicolor petal lengths
 17 | _ = plt.hist(versicolor_petal_length)
 18 | 
 19 | # Label axes
 20 | _ = plt.xlabel('petal length (cm)')
 21 | _ = plt.ylabel('count')
 22 | 
 23 | # Show histogram
 24 | plt.show()
 25 | __________________________________________________________________________
 26 | 3.Adjusting the number of bins in a histogram
 27 | # Import numpy
 28 | import numpy as np
 29 | 
 30 | # Compute number of data points: n_data
 31 | n_data = len(versicolor_petal_length)
 32 | 
 33 | # Number of bins is the square root of number of data points: n_bins
 34 | n_bins = np.sqrt(n_data)
 35 | 
 36 | # Convert number of bins to integer: n_bins
 37 | n_bins = int(n_bins)
 38 | 
 39 | # Plot the histogram
 40 | _ = plt.hist(versicolor_petal_length, bins=n_bins)
 41 | 
 42 | # Label axes
 43 | _ = plt.xlabel('petal length (cm)')
 44 | _ = plt.ylabel('count')
 45 | 
 46 | # Show histogram
 47 | plt.show()
 48 | __________________________________________________________________________
 49 | 4.Bee swarm plot
 50 | # Create bee swarm plot with Seaborn's default settings
 51 | _ = sns.swarmplot(x='species', y='petal length (cm)', data=df)
 52 | 
 53 | # Label the axes
 54 | _ = plt.xlabel('species')
 55 | _ = plt.ylabel('petal length (cm)')
 56 | 
 57 | # Show the plot
 58 | plt.show()
 59 | __________________________________________________________________________
 60 | 5.Computing the ECDF
 61 | def ecdf(data):
 62 |     """Compute ECDF for a one-dimensional array of measurements."""
 63 |     # Number of data points: n
 64 |     n = len(data)
 65 | 
 66 |     # x-data for the ECDF: x
 67 |     x = np.sort(data)
 68 | 
 69 |     # y-data for the ECDF: y
 70 |     y = np.arange(1, n+1) / n
 71 | 
 72 |     return x, y
 73 | __________________________________________________________________________
 74 | 6.Plotting the ECDF
 75 | # Compute ECDF for versicolor data: x_vers, y_vers
 76 | x_vers, y_vers = ecdf(versicolor_petal_length)
 77 | 
 78 | # Generate plot
 79 | _ = plt.plot(x_vers, y_vers, marker='.', linestyle='none')
 80 | 
 81 | # Label the axes
 82 | _ = plt.xlabel('petal length (cm)')
 83 | _ = plt.ylabel('ECDF')
 84 | 
 85 | # Display the plot
 86 | plt.show()
 87 | __________________________________________________________________________
 88 | 7.Comparison of ECDFs
 89 | # Compute ECDFs
 90 | x_set, y_set = ecdf(setosa_petal_length)
 91 | x_vers, y_vers = ecdf(versicolor_petal_length)
 92 | x_virg, y_virg = ecdf(virginica_petal_length)
 93 | 
 94 | # Plot all ECDFs on the same plot
 95 | _ = plt.plot(x_set, y_set, marker='.', linestyle='none')
 96 | _ = plt.plot(x_vers, y_vers, marker='.', linestyle='none')
 97 | _ = plt.plot(x_virg, y_virg, marker='.', linestyle='none')
 98 | 
 99 | # Annotate the plot
100 | _ = plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
101 | _ = plt.xlabel('petal length (cm)')
102 | _ = plt.ylabel('ECDF')
103 | 
104 | # Display the plot
105 | plt.show()
106 | __________________________________________________________________________
107 | 


--------------------------------------------------------------------------------
/15 - Statistical Thinking in Python (Part 1)/Chapter 2 - Quantitative Exploratory Data Analysis.txt:
--------------------------------------------------------------------------------
  1 | 1.Computing means
  2 | # Compute the mean
  3 | mean_length_vers = np.mean(versicolor_petal_length)
  4 | 
  5 | # Print the results with some nice formatting
  6 | print('I. versicolor:', mean_length_vers, 'cm')
  7 | ___________________________________________________________________________
  8 | 2.Computing percentiles
  9 | # Specify array of percentiles: percentiles
 10 | percentiles = np.array([2.5, 25, 50, 75, 97.5])
 11 | 
 12 | # Compute percentiles: ptiles_vers
 13 | ptiles_vers = np.percentile(versicolor_petal_length, percentiles)
 14 | 
 15 | # Print the result
 16 | print(ptiles_vers)
 17 | ___________________________________________________________________________
 18 | 3.Comparing percentiles to ECDF
 19 | # Plot the ECDF
 20 | _ = plt.plot(x_vers, y_vers, '.')
 21 | _ = plt.xlabel('petal length (cm)')
 22 | _ = plt.ylabel('ECDF')
 23 | 
 24 | # Overlay percentiles as red x's
 25 | _ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red',
 26 |          linestyle='none')
 27 | 
 28 | # Show the plot
 29 | plt.show()
 30 | ___________________________________________________________________________
 31 | 4.Box-and-whisker plot
 32 | # Create box plot with Seaborn's default settings
 33 | _ = sns.boxplot(x='species', y='petal length (cm)', data=df)
 34 | 
 35 | # Label the axes
 36 | _ = plt.xlabel('species')
 37 | _ = plt.ylabel('petal length (cm)')
 38 | 
 39 | # Show the plot
 40 | plt.show()
 41 | ___________________________________________________________________________
 42 | 5.Computing the variance
 43 | # Array of differences to mean: differences
 44 | differences = versicolor_petal_length - np.mean(versicolor_petal_length)
 45 | 
 46 | # Square the differences: diff_sq
 47 | diff_sq = differences**2
 48 | 
 49 | # Compute the mean square difference: variance_explicit
 50 | variance_explicit = np.mean(diff_sq)
 51 | 
 52 | # Compute the variance using NumPy: variance_np
 53 | variance_np = np.var(versicolor_petal_length)
 54 | 
 55 | # Print the results
 56 | print(variance_explicit, variance_np)
 57 | ___________________________________________________________________________
 58 | 6.The standard deviation and the variance
 59 | # Compute the variance: variance
 60 | variance = np.var(versicolor_petal_length)
 61 | 
 62 | # Print the square root of the variance
 63 | print(np.sqrt(variance))
 64 | 
 65 | # Print the standard deviation
 66 | print(np.std(versicolor_petal_length))
 67 | ___________________________________________________________________________
 68 | 7.Scatter plots
 69 | # Make a scatter plot
 70 | _ = plt.plot(versicolor_petal_length, versicolor_petal_width,
 71 |              marker='.', linestyle='none')
 72 | 
 73 | # Label the axes
 74 | _ = plt.xlabel('petal length (cm)')
 75 | _ = plt.ylabel('petal width (cm)')
 76 | 
 77 | # Show the result
 78 | plt.show()
 79 | ___________________________________________________________________________
 80 | 8.Computing the covariance
 81 | # Compute the covariance matrix: covariance_matrix
 82 | covariance_matrix = np.cov(versicolor_petal_length, versicolor_petal_width)
 83 | 
 84 | # Print covariance matrix
 85 | print(covariance_matrix)
 86 | 
 87 | # Extract covariance of length and width of petals: petal_cov
 88 | petal_cov = covariance_matrix[0,1]
 89 | 
 90 | # Print the length/width covariance
 91 | print(petal_cov)
 92 | ___________________________________________________________________________
 93 | 9.Computing the Pearson correlation coefficient
 94 | def pearson_r(x, y):
 95 |     """Compute Pearson correlation coefficient between two arrays."""
 96 |     # Compute correlation matrix: corr_mat
 97 |     corr_mat = np.corrcoef(x, y)
 98 | 
 99 |     # Return entry [0,1]
100 |     return corr_mat[0,1]
101 | 
102 | # Compute Pearson correlation coefficient for I. versicolor
103 | r = pearson_r(versicolor_petal_width, versicolor_petal_length)
104 | 
105 | # Print the result
106 | print(r)
107 | ___________________________________________________________________________


--------------------------------------------------------------------------------
/15 - Statistical Thinking in Python (Part 1)/Chapter 3 - Thinking probabilistically discrete variables.txt:
--------------------------------------------------------------------------------
  1 | 1.Generating random numbers using the np.random module
  2 | # Seed the random number generator
  3 | np.random.seed(42)
  4 | 
  5 | # Initialize random numbers: random_numbers
  6 | random_numbers = np.empty(100000)
  7 | 
  8 | # Generate random numbers by looping over range(100000)
  9 | for i in range(100000):
 10 |     random_numbers[i] = np.random.random()
 11 | 
 12 | # Plot a histogram
 13 | _ = plt.hist(random_numbers)
 14 | 
 15 | # Show the plot
 16 | plt.show()
 17 | ______________________________________________________________________
 18 | 2.The np.random module and Bernoulli trials
 19 | def perform_bernoulli_trials(n, p):
 20 |     """Perform n Bernoulli trials with success probability p
 21 |     and return number of successes."""
 22 |     # Initialize number of successes: n_success
 23 |     n_success = 0
 24 | 
 25 |     # Perform trials
 26 |     for i in range(n):
 27 |         # Choose random number between zero and one: random_number
 28 |         random_number = np.random.random()
 29 | 
 30 |         # If less than p, it's a success  so add one to n_success
 31 |         if random_number < p:
 32 |             n_success += 1
 33 | 
 34 |     return n_success
 35 | ______________________________________________________________________
 36 | 3.How many defaults might we expect?
 37 | # Seed random number generator
 38 | np.random.seed(42)
 39 | 
 40 | # Initialize the number of defaults: n_defaults
 41 | n_defaults = np.empty(1000)
 42 | 
 43 | # Compute the number of defaults
 44 | for i in range(1000):
 45 |     n_defaults[i] = perform_bernoulli_trials(100, 0.05)
 46 | 
 47 | # Plot the histogram with default number of bins; label your axes
 48 | _ = plt.hist(n_defaults, normed=True)
 49 | _ = plt.xlabel('number of defaults out of 100 loans')
 50 | _ = plt.ylabel('probability')
 51 | 
 52 | # Show the plot
 53 | plt.show()
 54 | ______________________________________________________________________
 55 | 4.Will the bank fail?
 56 | # Compute ECDF: x, y
 57 | x, y = ecdf(n_defaults)
 58 | 
 59 | # Plot the CDF with labeled axes
 60 | _ = plt.plot(x, y, marker='.', linestyle='none')
 61 | _ = plt.xlabel('number of defaults out of 100')
 62 | _ = plt.ylabel('CDF')
 63 | 
 64 | # Show the plot
 65 | plt.show()
 66 | 
 67 | # Compute the number of 100-loan simulations with 10 or more defaults: n_lose_money
 68 | n_lose_money = np.sum(n_defaults >= 10)
 69 | 
 70 | # Compute and print probability of losing money
 71 | print('Probability of losing money =', n_lose_money / len(n_defaults))
 72 | ______________________________________________________________________
 73 | 5.Sampling out of the Binomial distribution
 74 | # Take 10,000 samples out of the binomial distribution: n_defaults
 75 | n_defaults = np.random.binomial(n=100, p=0.05, size=10000)
 76 | 
 77 | # Compute CDF: x, y
 78 | x, y = ecdf(n_defaults)
 79 | 
 80 | # Plot the CDF with axis labels
 81 | _ = plt.plot(x, y, marker='.', linestyle='none')
 82 | _ = plt.xlabel('number of defaults out of 100 loans')
 83 | _ = plt.ylabel('CDF')
 84 | 
 85 | # Show the plot
 86 | plt.show()
 87 | ______________________________________________________________________
 88 | 6.Plotting the Binomial PMF
 89 | # Compute bin edges: bins
 90 | bins = np.arange(0, max(n_defaults) + 1.5) - 0.5
 91 | 
 92 | # Generate histogram
 93 | _ = plt.hist(n_defaults, normed=True, bins=bins)
 94 | 
 95 | # Label axes
 96 | _ = plt.xlabel('number of defaults out of 100 loans')
 97 | _ = plt.ylabel('PMF')
 98 | 
 99 | # Show the plot
100 | plt.show()
101 | ______________________________________________________________________
102 | 7.Relationship between Binomial and Poisson distributions
103 | # Draw 10,000 samples out of Poisson distribution: samples_poisson
104 | samples_poisson = np.random.poisson(10, size=10000)
105 | 
106 | # Print the mean and standard deviation
107 | print('Poisson:     ', np.mean(samples_poisson),
108 |                        np.std(samples_poisson))
109 | 
110 | # Specify values of n and p to consider for Binomial: n, p
111 | n = [20, 100, 1000]
112 | p = [0.5, 0.1, 0.01]
113 | 
114 | # Draw 10,000 samples for each n,p pair: samples_binomial
115 | for i in range(3):
116 |     samples_binomial = np.random.binomial(n[i], p[i], size=10000)
117 | 
118 |     # Print results
119 |     print('n =', n[i], 'Binom:', np.mean(samples_binomial),
120 |                                  np.std(samples_binomial))
121 | ______________________________________________________________________
122 | 8.Was 2015 anomalous?
123 | # Draw 10,000 samples out of Poisson distribution: n_nohitters
124 | n_nohitters = np.random.poisson(251/115, size=10000)
125 | 
126 | # Compute number of samples that are seven or greater: n_large
127 | n_large = np.sum(n_nohitters >= 7)
128 | 
129 | # Compute probability of getting seven or more: p_large
130 | p_large = n_large / 10000
131 | 
132 | # Print the result
133 | print('Probability of seven or more no-hitters:', p_large)
134 | ______________________________________________________________________


--------------------------------------------------------------------------------
/15 - Statistical Thinking in Python (Part 1)/Chapter 4 - Thinking probabilistically continuous variables.txt:
--------------------------------------------------------------------------------
 1 | 1.The Normal PDF
 2 | # Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10
 3 | samples_std1 = np.random.normal(20, 1, size=100000)
 4 | samples_std3 = np.random.normal(20, 3, size=100000)
 5 | samples_std10 = np.random.normal(20, 10, size=100000)
 6 | 
 7 | # Make histograms
 8 | _ = plt.hist(samples_std1, bins=100, normed=True, histtype='step')
 9 | _ = plt.hist(samples_std3, bins=100, normed=True, histtype='step')
10 | _ = plt.hist(samples_std10, bins=100, normed=True, histtype='step')
11 | 
12 | # Make a legend, set limits and show plot
13 | _ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
14 | plt.ylim(-0.01, 0.42)
15 | plt.show()
16 | ________________________________________________________________________
17 | 2.The Normal CDF
18 | # Generate CDFs
19 | x_std1, y_std1 = ecdf(samples_std1)
20 | x_std3, y_std3 = ecdf(samples_std3)
21 | x_std10, y_std10 = ecdf(samples_std10)
22 | 
23 | # Plot CDFs
24 | _ = plt.plot(x_std1, y_std1, marker='.', linestyle='none')
25 | _ = plt.plot(x_std3, y_std3, marker='.', linestyle='none')
26 | _ = plt.plot(x_std10, y_std10, marker='.', linestyle='none')
27 | 
28 | # Make a legend and show the plot
29 | _ = plt.legend(('std = 1', 'std = 3', 'std = 10'), loc='lower right')
30 | plt.show()
31 | ________________________________________________________________________
32 | 3.Are the Belmont Stakes results Normally distributed?
33 | # Compute mean and standard deviation: mu, sigma
34 | mu = np.mean(belmont_no_outliers)
35 | sigma = np.std(belmont_no_outliers)
36 | 
37 | # Sample out of a normal distribution with this mu and sigma: samples
38 | samples = np.random.normal(mu, sigma, size=10000)
39 | 
40 | # Get the CDF of the samples and of the data
41 | x_theor, y_theor = ecdf(samples)
42 | x, y = ecdf(belmont_no_outliers)
43 | 
44 | # Plot the CDFs and show the plot
45 | _ = plt.plot(x_theor, y_theor)
46 | _ = plt.plot(x, y, marker='.', linestyle='none')
47 | _ = plt.xlabel('Belmont winning time (sec.)')
48 | _ = plt.ylabel('CDF')
49 | plt.show()
50 | ________________________________________________________________________
51 | 4.What are the chances of a horse matching or beating Secretariat's record?
52 | # Take a million samples out of the Normal distribution: samples
53 | samples = np.random.normal(mu, sigma, size=1000000)
54 | 
55 | # Compute the fraction that are faster than 144 seconds: prob
56 | prob = np.sum(samples <= 144) / len(samples)
57 | 
58 | # Print the result
59 | print('Probability of besting Secretariat:', prob)
60 | ________________________________________________________________________
61 | 5.If you have a story, you can simulate it!
62 | def successive_poisson(tau1, tau2, size=1):
63 |     """Compute time for arrival of 2 successive Poisson processes."""
64 |     # Draw samples out of first exponential distribution: t1
65 |     t1 = np.random.exponential(tau1, size=size)
66 | 
67 |     # Draw samples out of second exponential distribution: t2
68 |     t2 = np.random.exponential(tau2, size=size)
69 | 
70 |     return t1 + t2
71 | ________________________________________________________________________
72 | 6.Distribution of no-hitters and cycles
73 | # Draw samples of waiting times
74 | waiting_times = successive_poisson(764, 715, size=100000)
75 | 
76 | # Make the histogram
77 | _ = plt.hist(waiting_times, bins=100, histtype='step',
78 |              normed=True)
79 | 
80 | # Label axes
81 | _ = plt.xlabel('total waiting time (games)')
82 | _ = plt.ylabel('PDF')
83 | 
84 | # Show the plot
85 | plt.show()
86 | ________________________________________________________________________


--------------------------------------------------------------------------------
/15 - Statistical Thinking in Python (Part 1)/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/16 - Statistical Thinking in Python (Part 2)/Chapter 1 - Parameter estimation by optimization.txt:
--------------------------------------------------------------------------------
  1 | 1.How often do we get no-hitters?
  2 | # Seed random number generator
  3 | np.random.seed(42)
  4 | 
  5 | # Compute mean no-hitter time: tau
  6 | tau = np.mean(nohitter_times)
  7 | 
  8 | # Draw out of an exponential distribution with parameter tau: inter_nohitter_time
  9 | inter_nohitter_time = np.random.exponential(tau, 100000)
 10 | 
 11 | # Plot the PDF and label axes
 12 | _ = plt.hist(inter_nohitter_time,
 13 |              bins=50, normed=True, histtype='step')
 14 | _ = plt.xlabel('Games between no-hitters')
 15 | _ = plt.ylabel('PDF')
 16 | 
 17 | # Show the plot
 18 | plt.show()
 19 | _________________________________________________________________________
 20 | 2.Do the data follow our story?
 21 | # Create an ECDF from real data: x, y
 22 | x, y = ecdf(nohitter_times)
 23 | 
 24 | # Create a CDF from theoretical samples: x_theor, y_theor
 25 | x_theor, y_theor = ecdf(inter_nohitter_time)
 26 | 
 27 | # Overlay the plots
 28 | plt.plot(x_theor, y_theor)
 29 | plt.plot(x, y, marker='.', linestyle='none')
 30 | 
 31 | # Margins and axis labels
 32 | plt.margins(0.02)
 33 | plt.xlabel('Games between no-hitters')
 34 | plt.ylabel('CDF')
 35 | 
 36 | # Show the plot
 37 | plt.show()
 38 | _________________________________________________________________________
 39 | 3.How is this parameter optimal?
 40 | # Plot the theoretical CDFs
 41 | plt.plot(x_theor, y_theor)
 42 | plt.plot(x, y, marker='.', linestyle='none')
 43 | plt.margins(0.02)
 44 | plt.xlabel('Games between no-hitters')
 45 | plt.ylabel('CDF')
 46 | 
 47 | # Take samples with half tau: samples_half
 48 | samples_half = np.random.exponential(tau/2, 10000)
 49 | 
 50 | # Take samples with double tau: samples_double
 51 | samples_double = np.random.exponential(2*tau, 10000)
 52 | 
 53 | # Generate CDFs from these samples
 54 | x_half, y_half = ecdf(samples_half)
 55 | x_double, y_double = ecdf(samples_double)
 56 | 
 57 | # Plot these CDFs as lines
 58 | _ = plt.plot(x_half, y_half)
 59 | _ = plt.plot(x_double, y_double)
 60 | 
 61 | # Show the plot
 62 | plt.show()
 63 | _________________________________________________________________________
 64 | 4.EDA of literacy/fertility data
 65 | # Plot the illiteracy rate versus fertility
 66 | _ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
 67 | 
 68 | # Set the margins and label axes
 69 | plt.margins(0.02)
 70 | _ = plt.xlabel('percent illiterate')
 71 | _ = plt.ylabel('fertility')
 72 | 
 73 | # Show the plot
 74 | plt.show()
 75 | 
 76 | # Show the Pearson correlation coefficient
 77 | print(pearson_r(illiteracy, fertility))
 78 | _________________________________________________________________________
 79 | 5.Linear regression
 80 | # Plot the illiteracy rate versus fertility
 81 | _ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
 82 | plt.margins(0.02)
 83 | _ = plt.xlabel('percent illiterate')
 84 | _ = plt.ylabel('fertility')
 85 | 
 86 | # Perform a linear regression using np.polyfit(): a, b
 87 | a, b = np.polyfit(illiteracy, fertility, 1)
 88 | 
 89 | # Print the results to the screen
 90 | print('slope =', a, 'children per woman / percent illiterate')
 91 | print('intercept =', b, 'children per woman')
 92 | 
 93 | # Make theoretical line to plot
 94 | x = np.array([0, 100])
 95 | y = a * x + b
 96 | 
 97 | # Add regression line to your plot
 98 | _ = plt.plot(x, y)
 99 | 
100 | # Draw the plot
101 | plt.show()
102 | _________________________________________________________________________
103 | 6.How is it optimal?
104 | # Specify slopes to consider: a_vals
105 | a_vals = np.linspace(0, 0.1, 200)
106 | 
107 | # Initialize sum of square of residuals: rss
108 | rss = np.empty_like(a_vals)
109 | 
110 | # Compute sum of square of residuals for each value of a_vals
111 | for i, a in enumerate(a_vals):
112 |     rss[i] = np.sum((fertility - a*illiteracy - b)**2)
113 | 
114 | # Plot the RSS
115 | plt.plot(a_vals, rss, '-')
116 | plt.xlabel('slope (children per woman / percent illiterate)')
117 | plt.ylabel('sum of square of residuals')
118 | 
119 | plt.show()
120 | _________________________________________________________________________
121 | 7.Linear regression on appropriate Anscombe data
122 | # Perform linear regression: a, b
123 | a, b = np.polyfit(x, y, 1)
124 | 
125 | # Print the slope and intercept
126 | print(a, b)
127 | 
128 | # Generate theoretical x and y data: x_theor, y_theor
129 | x_theor = np.array([3, 15])
130 | y_theor = a * x_theor + b
131 | 
132 | # Plot the Anscombe data and theoretical line
133 | _ = plt.plot(x, y, marker='.', linestyle='none')
134 | _ = plt.plot(x_theor, y_theor)
135 | 
136 | # Label the axes
137 | plt.xlabel('x')
138 | plt.ylabel('y')
139 | 
140 | # Show the plot
141 | plt.show()
142 | _________________________________________________________________________
143 | 8.Linear regression on all Anscombe data
144 | # Iterate through x,y pairs
145 | for x, y in zip(anscombe_x, anscombe_y):
146 |     # Compute the slope and intercept: a, b
147 |     a, b = np.polyfit(x, y, 1)
148 | 
149 |     # Print the result
150 |     print('slope:', a, 'intercept:', b)
151 | 
152 | _________________________________________________________________________


--------------------------------------------------------------------------------
/16 - Statistical Thinking in Python (Part 2)/Chapter 3 - Introduction to hypothesis testing.txt:
--------------------------------------------------------------------------------
  1 | 1.Generating a permutation sample
  2 | def permutation_sample(data1, data2):
  3 |     """Generate a permutation sample from two data sets."""
  4 | 
  5 |     # Concatenate the data sets: data
  6 |     data = np.concatenate((data1, data2))
  7 | 
  8 |     # Permute the concatenated array: permuted_data
  9 |     permuted_data = np.random.permutation(data)
 10 | 
 11 |     # Split the permuted array into two: perm_sample_1, perm_sample_2
 12 |     perm_sample_1 = permuted_data[:len(data1)]
 13 |     perm_sample_2 = permuted_data[len(data1):]
 14 | 
 15 |     return perm_sample_1, perm_sample_2
 16 | ______________________________________________________________________
 17 | 2.Visualizing permutation sampling
 18 | for _ in range(50):
 19 |     # Generate permutation samples
 20 |     perm_sample_1, perm_sample_2 = permutation_sample(
 21 |                                     rain_june, rain_november)
 22 | 
 23 |     # Compute ECDFs
 24 |     x_1, y_1 = ecdf(perm_sample_1)
 25 |     x_2, y_2 = ecdf(perm_sample_2)
 26 | 
 27 |     # Plot ECDFs of permutation sample
 28 |     _ = plt.plot(x_1, y_1, marker='.', linestyle='none',
 29 |                  color='red', alpha=0.02)
 30 |     _ = plt.plot(x_2, y_2, marker='.', linestyle='none',
 31 |                  color='blue', alpha=0.02)
 32 | 
 33 | # Create and plot ECDFs from original data
 34 | x_1, y_1 = ecdf(rain_june)
 35 | x_2, y_2 = ecdf(rain_november)
 36 | _ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='red')
 37 | _ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='blue')
 38 | 
 39 | # Label axes, set margin, and show plot
 40 | plt.margins(0.02)
 41 | _ = plt.xlabel('monthly rainfall (mm)')
 42 | _ = plt.ylabel('ECDF')
 43 | plt.show()
 44 | ______________________________________________________________________
 45 | 3.Generating permutation replicates
 46 | def draw_perm_reps(data_1, data_2, func, size=1):
 47 |     """Generate multiple permutation replicates."""
 48 | 
 49 |     # Initialize array of replicates: perm_replicates
 50 |     perm_replicates = np.empty(size)
 51 | 
 52 |     for i in range(size):
 53 |         # Generate permutation sample
 54 |         perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2)
 55 | 
 56 |         # Compute the test statistic
 57 |         perm_replicates[i] = func(perm_sample_1, perm_sample_2)
 58 | 
 59 |     return perm_replicates
 60 | ______________________________________________________________________
 61 | 4.Look before you leap: EDA before hypothesis testing
 62 | # Make bee swarm plot
 63 | _ = sns.swarmplot(x='ID', y='impact_force', data=df)
 64 | 
 65 | # Label axes
 66 | _ = plt.xlabel('frog')
 67 | _ = plt.ylabel('impact force (N)')
 68 | 
 69 | # Show the plot
 70 | plt.show()
 71 | ______________________________________________________________________
 72 | 5.Permutation test on frog data
 73 | def diff_of_means(data_1, data_2):
 74 |     """Difference in means of two arrays."""
 75 | 
 76 |     # The difference of means of data_1, data_2: diff
 77 |     diff = np.mean(data_1) - np.mean(data_2)
 78 | 
 79 |     return diff
 80 | 
 81 | # Compute difference of mean impact force from experiment: empirical_diff_means
 82 | empirical_diff_means = diff_of_means(force_a, force_b)
 83 | 
 84 | # Draw 10,000 permutation replicates: perm_replicates
 85 | perm_replicates = draw_perm_reps(force_a, force_b,
 86 |                                  diff_of_means, size=10000)
 87 | 
 88 | # Compute p-value: p
 89 | p = np.sum(perm_replicates >= empirical_diff_means) / len(perm_replicates)
 90 | 
 91 | # Print the result
 92 | print('p-value =', p)
 93 | ______________________________________________________________________
 94 | 6.A one-sample bootstrap hypothesis test
 95 | # Make an array of translated impact forces: translated_force_b
 96 | translated_force_b = force_b - np.mean(force_b) + 0.55
 97 | 
 98 | # Take bootstrap replicates of Frog B's translated impact forces: bs_replicates
 99 | bs_replicates = draw_bs_reps(translated_force_b, np.mean, 10000)
100 | 
101 | # Compute fraction of replicates that are less than the observed Frog B force: p
102 | p = np.sum(bs_replicates <= np.mean(force_b)) / 10000
103 | 
104 | # Print the p-value
105 | print('p = ', p)
106 | ______________________________________________________________________
107 | 7.A two-sample bootstrap hypothesis test for difference of means
108 | # Compute mean of all forces: mean_force
109 | mean_force = np.mean(forces_concat)
110 | 
111 | # Generate shifted arrays
112 | force_a_shifted = force_a - np.mean(force_a) + mean_force
113 | force_b_shifted = force_b - np.mean(force_b) + mean_force
114 | 
115 | # Compute 10,000 bootstrap replicates from shifted arrays
116 | bs_replicates_a = draw_bs_reps(force_a_shifted, np.mean, size=10000)
117 | bs_replicates_b = draw_bs_reps(force_b_shifted, np.mean, size=10000)
118 | 
119 | # Get replicates of difference of means: bs_replicates
120 | bs_replicates = bs_replicates_a - bs_replicates_b
121 | 
122 | # Compute and print p-value: p
123 | p = np.sum(bs_replicates >= empirical_diff_means) / len(bs_replicates)
124 | print('p-value =', p)
125 | ______________________________________________________________________


--------------------------------------------------------------------------------
/16 - Statistical Thinking in Python (Part 2)/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/17 - Supervised Learning with Scikit-learn/Chapter 1 - Classification.txt:
--------------------------------------------------------------------------------
  1 | 1.k-Nearest Neighbors: Fit
  2 | # Import KNeighborsClassifier from sklearn.neighbors
  3 | from sklearn.neighbors import KNeighborsClassifier 
  4 | 
  5 | # Create arrays for the features and the response variable
  6 | y = df['party'].values
  7 | X = df.drop('party', axis=1).values
  8 | 
  9 | # Create a k-NN classifier with 6 neighbors
 10 | knn = KNeighborsClassifier(n_neighbors=6)
 11 | 
 12 | # Fit the classifier to the data
 13 | knn.fit(X, y)
 14 | ______________________________________________________________________
 15 | 2.k-Nearest Neighbors: Predict
 16 | # Import KNeighborsClassifier from sklearn.neighbors
 17 | from sklearn.neighbors import KNeighborsClassifier 
 18 | 
 19 | # Create arrays for the features and the response variable
 20 | y = df['party'].values
 21 | X = df.drop('party', axis=1).values
 22 | 
 23 | # Create a k-NN classifier with 6 neighbors: knn
 24 | knn = KNeighborsClassifier(n_neighbors=6)
 25 | 
 26 | # Fit the classifier to the data
 27 | knn.fit(X, y)
 28 | 
 29 | # Predict the labels for the training data X: y_pred
 30 | y_pred = knn.predict(X)
 31 | 
 32 | # Predict and print the label for the new data point X_new
 33 | new_prediction = knn.predict(X_new)
 34 | print("Prediction: {}".format(new_prediction)) 
 35 | ______________________________________________________________________
 36 | 3.The digits recognition dataset
 37 | # Import necessary modules
 38 | from sklearn import datasets
 39 | import matplotlib.pyplot as plt
 40 | 
 41 | # Load the digits dataset: digits
 42 | digits = datasets.load_digits()
 43 | 
 44 | # Print the keys and DESCR of the dataset
 45 | print(digits.keys())
 46 | print(digits.DESCR)
 47 | 
 48 | # Print the shape of the images and data keys
 49 | print(digits.images.shape)
 50 | print(digits.data.shape)
 51 | 
 52 | # Display digit 1010
 53 | plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
 54 | plt.show()
 55 | ______________________________________________________________________
 56 | 4.Train/Test Split + Fit/Predict/Accuracy
 57 | # Import necessary modules
 58 | from sklearn.neighbors import KNeighborsClassifier 
 59 | from sklearn.model_selection import train_test_split
 60 | 
 61 | # Create feature and target arrays
 62 | X = digits.data
 63 | y = digits.target
 64 | 
 65 | # Split into training and test set
 66 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)
 67 | 
 68 | # Create a k-NN classifier with 7 neighbors: knn
 69 | knn = KNeighborsClassifier(n_neighbors=7)
 70 | 
 71 | # Fit the classifier to the training data
 72 | knn.fit(X_train, y_train)
 73 | 
 74 | # Print the accuracy
 75 | print(knn.score(X_test, y_test))
 76 | ______________________________________________________________________
 77 | 5.Overfitting and underfitting
 78 | # Setup arrays to store train and test accuracies
 79 | neighbors = np.arange(1, 9)
 80 | train_accuracy = np.empty(len(neighbors))
 81 | test_accuracy = np.empty(len(neighbors))
 82 | 
 83 | # Loop over different values of k
 84 | for i, k in enumerate(neighbors):
 85 |     # Setup a k-NN Classifier with k neighbors: knn
 86 |     knn = KNeighborsClassifier(n_neighbors=k)
 87 | 
 88 |     # Fit the classifier to the training data
 89 |     knn.fit(X_train, y_train)
 90 |     
 91 |     #Compute accuracy on the training set
 92 |     train_accuracy[i] = knn.score(X_train, y_train)
 93 | 
 94 |     #Compute accuracy on the testing set
 95 |     test_accuracy[i] = knn.score(X_test, y_test)
 96 | 
 97 | # Generate plot
 98 | plt.title('k-NN: Varying Number of Neighbors')
 99 | plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
100 | plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
101 | plt.legend()
102 | plt.xlabel('Number of Neighbors')
103 | plt.ylabel('Accuracy')
104 | plt.show()
105 | 
106 | ______________________________________________________________________


--------------------------------------------------------------------------------
/17 - Supervised Learning with Scikit-learn/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/18 - Unsupervised Learning in Python/Chapter 1 - Clustering for dataset exploration.txt:
--------------------------------------------------------------------------------
  1 | 1.Clustering 2D points
  2 | # Import KMeans
  3 | from sklearn.cluster import KMeans
  4 | 
  5 | # Create a KMeans instance with 3 clusters: model
  6 | model = KMeans(n_clusters=3)
  7 | 
  8 | # Fit model to points
  9 | model.fit(points)
 10 | 
 11 | # Determine the cluster labels of new_points: labels
 12 | labels = model.predict(new_points)
 13 | 
 14 | # Print cluster labels of new_points
 15 | print(labels)
 16 | ___________________________________________________________________
 17 | 2.Inspect your clustering
 18 | # Import pyplot
 19 | from matplotlib import pyplot as plt
 20 | 
 21 | # Assign the columns of new_points: xs and ys
 22 | xs = new_points[:,0]
 23 | ys = new_points[:,1]
 24 | 
 25 | # Make a scatter plot of xs and ys, using labels to define the colors
 26 | plt.scatter(xs, ys, c=labels, alpha=0.5)
 27 | 
 28 | # Assign the cluster centers: centroids
 29 | centroids = model.cluster_centers_
 30 | 
 31 | # Assign the columns of centroids: centroids_x, centroids_y
 32 | centroids_x = centroids[:,0]
 33 | centroids_y = centroids[:,1]
 34 | 
 35 | # Make a scatter plot of centroids_x and centroids_y
 36 | plt.scatter(centroids_x, centroids_y, marker='D', s=50)
 37 | plt.show()
 38 | ___________________________________________________________________
 39 | 3.How many clusters of grain?
 40 | ks = range(1, 6)
 41 | inertias = []
 42 | 
 43 | for k in ks:
 44 |     # Create a KMeans instance with k clusters: model
 45 |     model = KMeans(n_clusters=k)
 46 |     
 47 |     # Fit model to samples
 48 |     model.fit(samples)
 49 |     
 50 |     # Append the inertia to the list of inertias
 51 |     inertias.append(model.inertia_)
 52 |     
 53 | # Plot ks vs inertias
 54 | plt.plot(ks, inertias, '-o')
 55 | plt.xlabel('number of clusters, k')
 56 | plt.ylabel('inertia')
 57 | plt.xticks(ks)
 58 | plt.show()
 59 | ___________________________________________________________________
 60 | 4.Evaluating the grain clustering
 61 | # Create a KMeans model with 3 clusters: model
 62 | model = KMeans(n_clusters=3)
 63 | 
 64 | # Use fit_predict to fit model and obtain cluster labels: labels
 65 | labels = model.fit_predict(samples)
 66 | 
 67 | # Create a DataFrame with clusters and varieties as columns: df
 68 | df = pd.DataFrame({'labels': labels, 'varieties': varieties})
 69 | 
 70 | # Create crosstab: ct
 71 | ct = pd.crosstab(df['labels'], df['varieties'])
 72 | 
 73 | # Display ct
 74 | print(ct)
 75 | ___________________________________________________________________
 76 | 5.Scaling fish data for clustering
 77 | # Perform the necessary imports
 78 | from sklearn.pipeline import make_pipeline
 79 | from sklearn.preprocessing import StandardScaler
 80 | from sklearn.cluster import KMeans
 81 | 
 82 | # Create scaler: scaler
 83 | scaler = StandardScaler()
 84 | 
 85 | # Create KMeans instance: kmeans
 86 | kmeans = KMeans(n_clusters=4)
 87 | 
 88 | # Create pipeline: pipeline
 89 | pipeline = make_pipeline(scaler, kmeans)
 90 | __________________________________________________________________
 91 | 6.Clustering the fish data
 92 | # Import pandas
 93 | import pandas as pd
 94 | 
 95 | # Fit the pipeline to samples
 96 | pipeline.fit(samples)
 97 | 
 98 | # Calculate the cluster labels: labels
 99 | labels = pipeline.predict(samples)
100 | 
101 | # Create a DataFrame with labels and species as columns: df
102 | df = pd.DataFrame({'labels': labels, 'species': species})
103 | 
104 | # Create crosstab: ct
105 | ct = pd.crosstab(df['labels'], df['species'])
106 | 
107 | # Display ct
108 | print(ct)
109 | ___________________________________________________________________
110 | 7.Clustering stocks using KMeans
111 | # Import Normalizer
112 | from sklearn.preprocessing import Normalizer
113 | 
114 | # Create a normalizer: normalizer
115 | normalizer = Normalizer()
116 | 
117 | # Create a KMeans model with 10 clusters: kmeans
118 | kmeans = KMeans(n_clusters=10)
119 | 
120 | # Make a pipeline chaining normalizer and kmeans: pipeline
121 | pipeline = make_pipeline(normalizer, kmeans)
122 | 
123 | # Fit pipeline to the daily price movements
124 | pipeline.fit(movements)
125 | ___________________________________________________________________
126 | 8.Which stocks move together?
127 | # Import pandas
128 | import pandas as pd
129 | 
130 | # Predict the cluster labels: labels
131 | labels = pipeline.predict(movements)
132 | 
133 | # Create a DataFrame aligning labels and companies: df
134 | df = pd.DataFrame({'labels': labels, 'companies': companies})
135 | 
136 | # Display df sorted by cluster label
137 | print(df.sort_values('labels'))
138 | ___________________________________________________________________


--------------------------------------------------------------------------------
/18 - Unsupervised Learning in Python/Chapter 2 - Visualization with Hierarchical clustering and t-sne.txt:
--------------------------------------------------------------------------------
  1 | 1.Hierarchical clustering of the grain data
  2 | # Perform the necessary imports
  3 | from scipy.cluster.hierarchy import linkage, dendrogram
  4 | import matplotlib.pyplot as plt
  5 | 
  6 | # Calculate the linkage: mergings
  7 | mergings = linkage(samples, method='complete')
  8 | 
  9 | # Plot the dendrogram, using varieties as labels
 10 | dendrogram(mergings,
 11 |            labels=varieties,
 12 |            leaf_rotation=90,
 13 |            leaf_font_size=6,
 14 | )
 15 | plt.show()
 16 | _________________________________________________________________
 17 | 2.Hierarchies of stocks
 18 | # Import normalize
 19 | from sklearn.preprocessing import normalize
 20 | 
 21 | # Normalize the movements: normalized_movements
 22 | normalized_movements = normalize(movements)
 23 | 
 24 | # Calculate the linkage: mergings
 25 | mergings = linkage(normalized_movements, method='complete')
 26 | 
 27 | # Plot the dendrogram
 28 | dendrogram(
 29 |     mergings,
 30 |     labels=companies,
 31 |     leaf_rotation=90,
 32 |     leaf_font_size=6
 33 | )
 34 | plt.show()
 35 | _________________________________________________________________
 36 | 3.Different linkage, different hierarchical clustering!
 37 | # Perform the necessary imports
 38 | import matplotlib.pyplot as plt
 39 | from scipy.cluster.hierarchy import linkage, dendrogram
 40 | 
 41 | # Calculate the linkage: mergings
 42 | mergings = linkage(samples, method='single')
 43 | 
 44 | # Plot the dendrogram
 45 | dendrogram(mergings,
 46 |            labels=country_names,
 47 |            leaf_rotation=90,
 48 |            leaf_font_size=6,
 49 | )
 50 | plt.show()
 51 | _________________________________________________________________
 52 | 4.Extracting the cluster labels
 53 | # Perform the necessary imports
 54 | import pandas as pd
 55 | from scipy.cluster.hierarchy import fcluster
 56 | 
 57 | # Use fcluster to extract labels: labels
 58 | labels = fcluster(mergings, 6, criterion='distance')
 59 | 
 60 | # Create a DataFrame with labels and varieties as columns: df
 61 | df = pd.DataFrame({'labels': labels, 'varieties': varieties})
 62 | 
 63 | # Create crosstab: ct
 64 | ct = pd.crosstab(df['labels'], df['varieties'])
 65 | 
 66 | # Display ct
 67 | print(ct)
 68 | _________________________________________________________________
 69 | 5.t-SNE visualization of grain dataset
 70 | # Import TSNE
 71 | from sklearn.manifold import TSNE
 72 | 
 73 | # Create a TSNE instance: model
 74 | model = TSNE(learning_rate=200)
 75 | 
 76 | # Apply fit_transform to samples: tsne_features
 77 | tsne_features = model.fit_transform(samples)
 78 | 
 79 | # Select the 0th feature: xs
 80 | xs = tsne_features[:,0]
 81 | 
 82 | # Select the 1st feature: ys
 83 | ys = tsne_features[:,1]
 84 | 
 85 | # Scatter plot, coloring by variety_numbers
 86 | plt.scatter(xs, ys, c=variety_numbers)
 87 | plt.show()
 88 | _________________________________________________________________
 89 | 6.A t-SNE map of the stock market
 90 | # Import TSNE
 91 | from sklearn.manifold import TSNE
 92 | 
 93 | # Create a TSNE instance: model
 94 | model = TSNE(learning_rate=50)
 95 | 
 96 | # Apply fit_transform to normalized_movements: tsne_features
 97 | tsne_features = model.fit_transform(normalized_movements)
 98 | 
 99 | # Select the 0th feature: xs
100 | xs = tsne_features[:,0]
101 | 
102 | # Select the 1th feature: ys
103 | ys = tsne_features[:,1]
104 | 
105 | # Scatter plot
106 | plt.scatter(xs, ys, alpha=0.5)
107 | 
108 | # Annotate the points
109 | for x, y, company in zip(xs, ys, companies):
110 |     plt.annotate(company, (x, y), fontsize=5, alpha=0.75)
111 | plt.show()
112 | _________________________________________________________________


--------------------------------------------------------------------------------
/18 - Unsupervised Learning in Python/Chapter 4 - Discovering Interpretable features.txt:
--------------------------------------------------------------------------------
  1 | 1.NMF applied to Wikipedia articles
  2 | # Import NMF
  3 | from sklearn.decomposition import NMF
  4 | 
  5 | # Create an NMF instance: model
  6 | model = NMF(n_components=6)
  7 | 
  8 | # Fit the model to articles
  9 | model.fit(articles)
 10 | 
 11 | # Transform the articles: nmf_features
 12 | nmf_features = model.transform(articles)
 13 | 
 14 | # Print the NMF features
 15 | print(nmf_features.round(2))
 16 | ________________________________________________________________
 17 | 2.NMF features of the Wikipedia articles
 18 | # Import pandas
 19 | import pandas as pd
 20 | 
 21 | # Create a pandas DataFrame: df
 22 | df = pd.DataFrame(nmf_features, index=titles)
 23 | 
 24 | # Print the row for 'Anne Hathaway'
 25 | print(df.loc['Anne Hathaway'])
 26 | 
 27 | # Print the row for 'Denzel Washington'
 28 | print(df.loc['Denzel Washington'])
 29 | ________________________________________________________________
 30 | 3.NMF learns topics of documents
 31 | # Import pandas
 32 | import pandas as pd
 33 | 
 34 | # Create a DataFrame: components_df
 35 | components_df = pd.DataFrame(model.components_, columns=words)
 36 | 
 37 | # Print the shape of the DataFrame
 38 | print(components_df.shape)
 39 | 
 40 | # Select row 3: component
 41 | component = components_df.iloc[3]
 42 | 
 43 | # Print result of nlargest
 44 | print(component.nlargest())
 45 | ________________________________________________________________
 46 | 4.Explore the LED digits dataset
 47 | # Import pyplot
 48 | from matplotlib import pyplot as plt
 49 | 
 50 | # Select the 0th row: digit
 51 | digit = samples[0,:]
 52 | 
 53 | # Print digit
 54 | print(digit)
 55 | 
 56 | # Reshape digit to a 13x8 array: bitmap
 57 | bitmap = digit.reshape((13, 8))
 58 | 
 59 | # Print bitmap
 60 | print(bitmap)
 61 | 
 62 | # Use plt.imshow to display bitmap
 63 | plt.imshow(bitmap, cmap='gray', interpolation='nearest')
 64 | plt.colorbar()
 65 | plt.show()
 66 | ________________________________________________________________
 67 | 5.NMF learns the parts of images
 68 | # Import NMF
 69 | from sklearn.decomposition import NMF
 70 | 
 71 | # Create an NMF model: model
 72 | model = NMF(n_components=7)
 73 | 
 74 | # Apply fit_transform to samples: features
 75 | features = model.fit_transform(samples)
 76 | 
 77 | # Call show_as_image on each component
 78 | for component in model.components_:
 79 |     show_as_image(component)
 80 | 
 81 | # Select the 0th row of features: digit_features
 82 | digit_features = features[0,:]
 83 | 
 84 | # Print digit_features
 85 | print(digit_features)
 86 | ________________________________________________________________
 87 | 6.PCA doesn't learn parts
 88 | # Import PCA
 89 | from sklearn.decomposition import PCA
 90 | 
 91 | # Create a PCA instance: model
 92 | model = PCA(n_components=7)
 93 | 
 94 | # Apply fit_transform to samples: features
 95 | features = model.fit_transform(samples)
 96 | 
 97 | # Call show_as_image on each component
 98 | for component in model.components_:
 99 |     show_as_image(component)
100 | ________________________________________________________________
101 | 7.Which articles are similar to 'Cristiano Ronaldo'?
102 | # Perform the necessary imports
103 | import pandas as pd
104 | from sklearn.preprocessing import normalize
105 | 
106 | # Normalize the NMF features: norm_features
107 | norm_features = normalize(nmf_features)
108 | 
109 | # Create a DataFrame: df
110 | df = pd.DataFrame(norm_features, index=titles)
111 | 
112 | # Select the row corresponding to 'Cristiano Ronaldo': article
113 | article = df.loc['Cristiano Ronaldo']
114 | 
115 | # Compute the dot products: similarities
116 | similarities = df.dot(article)
117 | 
118 | # Display those with the largest cosine similarity
119 | print(similarities.nlargest())
120 | ________________________________________________________________
121 | 8.Recommend musical artists part I
122 | # Perform the necessary imports
123 | from sklearn.decomposition import NMF
124 | from sklearn.preprocessing import Normalizer, MaxAbsScaler
125 | from sklearn.pipeline import make_pipeline
126 | 
127 | # Create a MaxAbsScaler: scaler
128 | scaler = MaxAbsScaler()
129 | 
130 | # Create an NMF model: nmf
131 | nmf = NMF(n_components=20)
132 | 
133 | # Create a Normalizer: normalizer
134 | normalizer = Normalizer()
135 | 
136 | # Create a pipeline: pipeline
137 | pipeline = make_pipeline(scaler, nmf, normalizer)
138 | 
139 | # Apply fit_transform to artists: norm_features
140 | norm_features = pipeline.fit_transform(artists)
141 | ________________________________________________________________
142 | 9.Recommend musical artists part II
143 | # Import pandas
144 | import pandas as pd
145 | 
146 | # Create a DataFrame: df
147 | df = pd.DataFrame(norm_features, index=artist_names)
148 | 
149 | # Select row of 'Bruce Springsteen': artist
150 | artist = df.loc['Bruce Springsteen']
151 | 
152 | # Compute cosine similarities: similarities
153 | similarities = df.dot(artist)
154 | 
155 | # Display those with highest cosine similarity
156 | print(similarities.nlargest())
157 | ________________________________________________________________


--------------------------------------------------------------------------------
/18 - Unsupervised Learning in Python/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/19 - Machine learning with tree-based models in python/Chapter 1 - Classification and regression trees.txt:
--------------------------------------------------------------------------------
  1 | 1.Train your first classification tree
  2 | #work with the Wisconsin Breast Cancer Dataset from the UCI machine learning repository. 
  3 | # Import DecisionTreeClassifier from sklearn.tree
  4 | from sklearn.tree import DecisionTreeClassifier
  5 | 
  6 | # Instantiate a DecisionTreeClassifier 'dt' with a maximum depth of 6
  7 | dt = DecisionTreeClassifier(max_depth=6, random_state=SEED)
  8 | 
  9 | # Fit dt to the training set
 10 | dt.fit(X_train, y_train)
 11 | 
 12 | # Predict test set labels
 13 | y_pred = dt.predict(X_test)
 14 | print(y_pred[0:5])
 15 | ______________________________________________________________________
 16 | 2.Evaluate the classification tree
 17 | # Import accuracy_score
 18 | from sklearn.metrics import accuracy_score
 19 | 
 20 | # Predict test set labels
 21 | y_pred = dt.predict(X_test)
 22 | 
 23 | # Compute test set accuracy  
 24 | acc = accuracy_score(y_test, y_pred)
 25 | print("Test set accuracy: {:.2f}".format(acc))
 26 | ______________________________________________________________________
 27 | 3.Logistic regression vs classification tree
 28 | # Import LogisticRegression from sklearn.linear_model
 29 | from sklearn.linear_model import  LogisticRegression
 30 | 
 31 | # Instatiate logreg
 32 | logreg = LogisticRegression(random_state=1)
 33 | 
 34 | # Fit logreg to the training set
 35 | logreg.fit(X_train, y_train)
 36 | 
 37 | # Define a list called clfs containing the two classifiers logreg and dt
 38 | clfs = [logreg, dt]
 39 | 
 40 | # Review the decision regions of the two classifiers
 41 | plot_labeled_decision_regions(X_test, y_test, clfs)
 42 | ______________________________________________________________________
 43 | 4.Using entropy as a criterion
 44 | # Import DecisionTreeClassifier from sklearn.tree
 45 | from sklearn.tree import DecisionTreeClassifier
 46 | 
 47 | # Instantiate dt_entropy, set 'entropy' as the information criterion
 48 | dt_entropy = DecisionTreeClassifier(max_depth=8, criterion='entropy', random_state=1)
 49 | 
 50 | # Fit dt_entropy to the training set
 51 | dt_entropy.fit(X_train, y_train)
 52 | ______________________________________________________________________
 53 | 5.Entropy vs Gini index
 54 | # Import accuracy_score from sklearn.metrics
 55 | from sklearn.metrics import accuracy_score
 56 | 
 57 | # Use dt_entropy to predict test set labels
 58 | y_pred = dt_entropy.predict(X_test)
 59 | 
 60 | # Evaluate accuracy_entropy
 61 | accuracy_entropy = accuracy_score(y_test, y_pred)
 62 | 
 63 | # Print accuracy_entropy
 64 | print('Accuracy achieved by using entropy: ', accuracy_entropy)
 65 | 
 66 | # Print accuracy_gini
 67 | print('Accuracy achieved by using the gini index: ', accuracy_gini)
 68 | 6.Train your first regression tree
 69 | # Import DecisionTreeRegressor from sklearn.tree
 70 | from sklearn.tree import DecisionTreeRegressor
 71 | 
 72 | # Instantiate dt
 73 | dt = DecisionTreeRegressor(max_depth=8,
 74 |                            min_samples_leaf=0.13,
 75 |                            random_state=3)
 76 | 
 77 | # Fit dt to the training set
 78 | dt.fit(X_train, y_train)
 79 | ______________________________________________________________________
 80 | 7.Evaluate the regression tree
 81 | # Import mean_squared_error from sklearn.metrics as MSE
 82 | from sklearn.metrics import mean_squared_error as MSE
 83 | 
 84 | # Compute y_pred
 85 | y_pred = dt.predict(X_test)
 86 | 
 87 | # Compute mse_dt
 88 | mse_dt = MSE(y_test, y_pred)
 89 | 
 90 | # Compute rmse_dt
 91 | rmse_dt = mse_dt**(1/2)
 92 | 
 93 | # Print rmse_dt
 94 | print("Test set RMSE of dt: {:.2f}".format(rmse_dt))
 95 | ______________________________________________________________________
 96 | 8.Linear regression vs regression tree
 97 | # Predict test set labels 
 98 | y_pred_lr = lr.predict(X_test)
 99 | 
100 | # Compute mse_lr
101 | mse_lr = MSE(y_test, y_pred_lr)
102 | 
103 | # Compute rmse_lr
104 | rmse_lr = mse_lr**(1/2)
105 | 
106 | # Print rmse_lr
107 | print('Linear Regression test set RMSE: {:.2f}'.format(rmse_lr))
108 | 
109 | # Print rmse_dt
110 | print('Regression Tree test set RMSE: {:.2f}'.format(rmse_dt))
111 | ______________________________________________________________________


--------------------------------------------------------------------------------
/19 - Machine learning with tree-based models in python/Chapter 2 - The bias-variance Tradeoff.txt:
--------------------------------------------------------------------------------
 1 | 1.Instantiate the model
 2 | # Import train_test_split from sklearn.model_selection
 3 | from sklearn.model_selection import train_test_split
 4 | 
 5 | # Set SEED for reproducibility
 6 | SEED = 1
 7 | 
 8 | # Split the data into 70% train and 30% test
 9 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)
10 | 
11 | # Instantiate a DecisionTreeRegressor dt
12 | dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED)
13 | _________________________________________________________________________
14 | 2.Evaluate the 10-fold CV error
15 | # Compute the array containing the 10-folds CV MSEs
16 | MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, 
17 |                                   scoring='neg_mean_squared_error', 
18 |                                   n_jobs=-1) 
19 | 
20 | # Compute the 10-folds CV RMSE
21 | RMSE_CV = (MSE_CV_scores.mean())**(1/2)
22 | 
23 | # Print RMSE_CV
24 | print('CV RMSE: {:.2f}'.format(RMSE_CV))
25 | _________________________________________________________________________
26 | 3.Evaluate the training error
27 | # Import mean_squared_error from sklearn.metrics as MSE
28 | from sklearn.metrics import mean_squared_error as MSE
29 | 
30 | # Fit dt to the training set
31 | dt.fit(X_train, y_train)
32 | 
33 | # Predict the labels of the training set
34 | y_pred_train = dt.predict(X_train)
35 | 
36 | # Evaluate the training set RMSE of dt
37 | RMSE_train = (MSE(y_train, y_pred_train))**(1/2)
38 | 
39 | # Print RMSE_train
40 | print('Train RMSE: {:.2f}'.format(RMSE_train))
41 | _________________________________________________________________________
42 | 4.Define the ensemble
43 | # Set seed for reproducibility
44 | SEED=1
45 | 
46 | # Instantiate lr
47 | lr = LogisticRegression(random_state=SEED)
48 | 
49 | # Instantiate knn
50 | knn = KNN(n_neighbors=27)
51 | 
52 | # Instantiate dt
53 | dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)
54 | 
55 | # Define the list classifiers
56 | classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]
57 | _________________________________________________________________________
58 | 5.Evaluate individual classifiers
59 | # Iterate over the pre-defined list of classifiers
60 | for clf_name, clf in classifiers:    
61 |   
62 |     # Fit clf to the training set
63 |     clf.fit(X_train, y_train)    
64 |   
65 |     # Predict y_pred
66 |     y_pred = clf.predict(X_test)
67 |     
68 |     # Calculate accuracy
69 |     accuracy = accuracy_score(y_test, y_pred)
70 |   
71 |     # Evaluate clf's accuracy on the test set
72 |     print('{:s} : {:.3f}'.format(clf_name, accuracy))
73 | _________________________________________________________________________
74 | 6.Better performance with a Voting Classifier
75 | # Import VotingClassifier from sklearn.ensemble
76 | from sklearn.ensemble import VotingClassifier
77 | 
78 | # Instantiate a VotingClassifier vc 
79 | vc = VotingClassifier(estimators=classifiers)     
80 | 
81 | # Fit vc to the training set
82 | vc.fit(X_train, y_train)   
83 | 
84 | # Evaluate the test set predictions
85 | y_pred = vc.predict(X_test)
86 | 
87 | # Calculate accuracy score
88 | accuracy = accuracy_score(y_test, y_pred)
89 | print('Voting Classifier: {:.3f}'.format(accuracy))
90 | _________________________________________________________________________


--------------------------------------------------------------------------------
/19 - Machine learning with tree-based models in python/Chapter 3 - Bagging and Random Forests.txt:
--------------------------------------------------------------------------------
 1 | 1.Define the bagging classifier
 2 | # Indian Liver Patient dataset from the UCI machine learning repository. 
 3 | # Import DecisionTreeClassifier
 4 | from sklearn.tree import DecisionTreeClassifier
 5 | 
 6 | # Import BaggingClassifier
 7 | from sklearn.ensemble import BaggingClassifier
 8 | 
 9 | # Instantiate dt
10 | dt = DecisionTreeClassifier(random_state=1)
11 | 
12 | # Instantiate bc
13 | bc = BaggingClassifier(base_estimator=dt, n_estimators=50, random_state=1)
14 | _____________________________________________________________
15 | 2.Evaluate Bagging performance
16 | # Fit bc to the training set
17 | bc.fit(X_train, y_train)
18 | 
19 | # Predict test set labels
20 | y_pred = bc.predict(X_test)
21 | 
22 | # Evaluate acc_test
23 | acc_test = accuracy_score(y_test, y_pred)
24 | print('Test set accuracy of bc: {:.2f}'.format(acc_test))
25 | _____________________________________________________________
26 | 3.Prepare the ground
27 | # Import DecisionTreeClassifier
28 | from sklearn.tree import DecisionTreeClassifier
29 | 
30 | # Import BaggingClassifier
31 | from sklearn.ensemble import BaggingClassifier
32 | 
33 | # Instantiate dt
34 | dt = DecisionTreeClassifier(min_samples_leaf=8, random_state=1)
35 | 
36 | # Instantiate bc
37 | bc = BaggingClassifier(base_estimator=dt, 
38 |                        n_estimators=50,
39 |                        oob_score=True,
40 |                        random_state=1)
41 | _____________________________________________________________
42 | 4.OOB Score vs Test Set Score
43 | # Fit bc to the training set 
44 | bc.fit(X_train, y_train)
45 | 
46 | # Predict test set labels
47 | y_pred = bc.predict(X_test)
48 | 
49 | # Evaluate test set accuracy
50 | acc_test = accuracy_score(y_test, y_pred)
51 | 
52 | # Evaluate OOB accuracy
53 | acc_oob = bc.oob_score_
54 | 
55 | # Print acc_test and acc_oob
56 | print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test, acc_oob))
57 | _____________________________________________________________
58 | 5.Train an RF regressor
59 | #using historical weather data from the Bike Sharing Demand dataset available through Kaggle
60 | # Import RandomForestRegressor
61 | from sklearn.ensemble import RandomForestRegressor
62 | 
63 | # Instantiate rf
64 | rf = RandomForestRegressor(n_estimators=25,
65 |                            random_state=2)
66 |                            
67 | # Fit rf to the training set            
68 | rf.fit(X_train, y_train)                           
69 | _____________________________________________________________
70 | 6.Evaluate the RF regressor
71 | # Import mean_squared_error as MSE
72 | from sklearn.metrics import mean_squared_error as MSE
73 | 
74 | # Predict the test set labels
75 | y_pred = rf.predict(X_test)
76 | 
77 | # Evaluate the test set RMSE
78 | rmse_test = MSE(y_test, y_pred)**(1/2)
79 | 
80 | # Print rmse_test
81 | print('Test set RMSE of rf: {:.2f}'.format(rmse_test))
82 | _____________________________________________________________
83 | 7.Visualizing features importances
84 | # Create a pd.Series of features importances
85 | importances = pd.Series(data=rf.feature_importances_,
86 |                         index= X_train.columns)
87 | 
88 | # Sort importances
89 | importances_sorted = importances.sort_values()
90 | 
91 | # Draw a horizontal barplot of importances_sorted
92 | importances_sorted.plot(kind='barh', color='lightgreen')
93 | plt.title('Features Importances')
94 | plt.show()
95 | _____________________________________________________________
96 | 


--------------------------------------------------------------------------------
/19 - Machine learning with tree-based models in python/Chapter 4 - Boosting.txt:
--------------------------------------------------------------------------------
 1 | 1.Define the AdaBoost classifier
 2 | #the Indian Liver Patient dataset 
 3 | # Import DecisionTreeClassifier
 4 | from sklearn.tree import DecisionTreeClassifier
 5 | 
 6 | # Import AdaBoostClassifier
 7 | from sklearn.ensemble import AdaBoostClassifier
 8 | 
 9 | # Instantiate dt
10 | dt = DecisionTreeClassifier(max_depth=2, random_state=1)
11 | 
12 | # Instantiate ada
13 | ada = AdaBoostClassifier(base_estimator=dt, n_estimators=180, random_state=1)
14 | _______________________________________________________________
15 | 2.Train the AdaBoost classifier
16 | # Fit ada to the training set
17 | ada.fit(X_train, y_train)
18 | 
19 | # Compute the probabilities of obtaining the positive class
20 | y_pred_proba = ada.predict_proba(X_test)[:,1]
21 | _______________________________________________________________
22 | 3.Evaluate the AdaBoost classifier
23 | # Import roc_auc_score
24 | from sklearn.metrics import roc_auc_score
25 | 
26 | # Evaluate test-set roc_auc_score
27 | ada_roc_auc = roc_auc_score(y_test, y_pred_proba)
28 | 
29 | # Print roc_auc_score
30 | print('ROC AUC score: {:.2f}'.format(ada_roc_auc))
31 | _______________________________________________________________
32 | 4.Define the GB regressor
33 | #the Bike Sharing Demand dataset
34 | # Import GradientBoostingRegressor
35 | from sklearn.ensemble import GradientBoostingRegressor
36 | 
37 | # Instantiate gb
38 | gb = GradientBoostingRegressor(max_depth=4,
39 |                                n_estimators=200,
40 |                                random_state=2)
41 | _______________________________________________________________
42 | 5.Train the GB regressor
43 | # Fit gb to the training set
44 | gb.fit(X_train, y_train)
45 | 
46 | # Predict test set labels
47 | y_pred = gb.predict(X_test)
48 | _______________________________________________________________
49 | 6.Evaluate the GB regressor
50 | # Import mean_squared_error as MSE
51 | from sklearn.metrics import mean_squared_error as MSE
52 | 
53 | # Compute MSE
54 | mse_test = MSE(y_test, y_pred)
55 | 
56 | # Compute RMSE
57 | rmse_test = mse_test**(1/2)
58 | 
59 | # Print RMSE
60 | 
61 | print('Test set RMSE of gb: {:.3f}'.format(rmse_test))
62 | _______________________________________________________________
63 | 7.Regression with SGB
64 | # Import GradientBoostingRegressor
65 | from sklearn.ensemble import GradientBoostingRegressor
66 | 
67 | # Instantiate sgbr
68 | sgbr = GradientBoostingRegressor(max_depth=4, 
69 |                                  subsample=0.9,
70 |                                  max_features=0.75,
71 |                                  n_estimators=200,                                
72 |                                  random_state=2)
73 | _______________________________________________________________
74 | 8.Train the SGB regressor
75 | # Fit sgbr to the training set
76 | sgbr.fit(X_train, y_train)
77 | 
78 | # Predict test set labels
79 | y_pred = sgbr.predict(X_test)
80 | _______________________________________________________________
81 | 9.Evaluate the SGB regressor
82 | # Import mean_squared_error as MSE
83 | from sklearn.metrics import mean_squared_error as MSE
84 | 
85 | # Compute test set MSE
86 | mse_test = MSE(y_test, y_pred)
87 | 
88 | # Compute test set RMSE
89 | rmse_test = mse_test**(1/2)
90 | 
91 | # Print rmse_test
92 | print('Test set RMSE of sgbr: {:.3f}'.format(rmse_test))
93 | _______________________________________________________________


--------------------------------------------------------------------------------
/19 - Machine learning with tree-based models in python/Chapter 5 - Model Tuning.txt:
--------------------------------------------------------------------------------
 1 | 1.Set the tree's hyperparameter grid
 2 | # Define params_dt
 3 | params_dt = {
 4 |              'max_depth': [2, 3, 4],
 5 |              'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]
 6 |             }
 7 | _______________________________________________________________
 8 | 2.Search for the optimal tree
 9 | # Import GridSearchCV
10 | from sklearn.model_selection import GridSearchCV
11 | 
12 | # Instantiate grid_dt
13 | grid_dt = GridSearchCV(estimator=dt,
14 |                        param_grid=params_dt,
15 |                        scoring='roc_auc',
16 |                        cv=5,
17 |                        n_jobs=-1)
18 | _______________________________________________________________
19 | 3.Evaluate the optimal tree
20 | # Import roc_auc_score from sklearn.metrics 
21 | from sklearn.metrics import roc_auc_score
22 | 
23 | # Extract the best estimator
24 | best_model = grid_dt.best_estimator_
25 | 
26 | # Predict the test set probabilities of the positive class
27 | y_pred_proba = best_model.predict_proba(X_test)[:,1]
28 | 
29 | # Compute test_roc_auc
30 | test_roc_auc = roc_auc_score(y_test, y_pred_proba)
31 | 
32 | # Print test_roc_auc
33 | print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))
34 | _______________________________________________________________
35 | 4.Set the hyperparameter grid of RF
36 | # Define the dictionary 'params_rf'
37 | params_rf = {
38 |              'n_estimators': [100, 350, 500],
39 |              'max_features': ['log2', 'auto', 'sqrt'],
40 |              'min_samples_leaf': [2, 10, 30], 
41 |              }
42 | _______________________________________________________________
43 | 5.Search for the optimal forest
44 | # Import GridSearchCV
45 | from sklearn.model_selection import  GridSearchCV
46 | 
47 | # Instantiate grid_rf
48 | grid_rf = GridSearchCV(estimator=rf,
49 |                        param_grid=params_rf,
50 |                        scoring='neg_mean_squared_error',
51 |                        cv=3,
52 |                        verbose=1,
53 |                        n_jobs=-1)
54 | _______________________________________________________________
55 | 6.Evaluate the optimal forest
56 | # Import mean_squared_error from sklearn.metrics as MSE 
57 | from sklearn.metrics import mean_squared_error as MSE
58 | 
59 | # Extract the best estimator
60 | best_model = grid_rf.best_estimator_
61 | 
62 | # Predict test set labels
63 | y_pred = best_model.predict(X_test)
64 | 
65 | # Compute rmse_test
66 | rmse_test = MSE(y_test, y_pred)**(1/2)
67 | 
68 | # Print rmse_test
69 | print('Test RMSE of best model: {:.3f}'.format(rmse_test)) 
70 | _______________________________________________________________


--------------------------------------------------------------------------------
/19 - Machine learning with tree-based models in python/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/20 - Cluster Analysis in Python/Chapter 1 - Introduction to clustering.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Anirudh-Chauhan/Data-Scientist-with-Python-DataCamp/2b254aab79c5c7420c9fd96aeeab81a020420a27/20 - Cluster Analysis in Python/Chapter 1 - Introduction to clustering.txt


--------------------------------------------------------------------------------
/20 - Cluster Analysis in Python/Chapter 2 - Hierarchical Clustering.txt:
--------------------------------------------------------------------------------
 1 | 1.Hierarchical clustering: ward method
 2 | # Import the fcluster and linkage functions
 3 | from scipy.cluster.hierarchy import fcluster, linkage
 4 | 
 5 | # Use the linkage() function
 6 | distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method = 'ward', metric = 'euclidean')
 7 | 
 8 | # Assign cluster labels
 9 | comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')
10 | 
11 | # Plot clusters
12 | sns.scatterplot(x='x_scaled', y='y_scaled', 
13 |                 hue='cluster_labels', data = comic_con)
14 | plt.show()
15 | ____________________________________________________________________________
16 | 2.Hierarchical clustering: single method
17 | # Import the fcluster and linkage functions
18 | from scipy.cluster.hierarchy import fcluster, linkage
19 | 
20 | # Use the linkage() function
21 | distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method = 'single', metric = 'euclidean')
22 | 
23 | # Assign cluster labels
24 | comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')
25 | 
26 | # Plot clusters
27 | sns.scatterplot(x='x_scaled', y='y_scaled', 
28 |                 hue='cluster_labels', data = comic_con)
29 | plt.show()
30 | ____________________________________________________________________________
31 | 3.Hierarchical clustering: complete method
32 | # Import the fcluster and linkage functions
33 | from scipy.cluster.hierarchy import fcluster, linkage
34 | 
35 | # Use the linkage() function
36 | distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method = 'complete', metric = 'euclidean')
37 | 
38 | # Assign cluster labels
39 | comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')
40 | 
41 | # Plot clusters
42 | sns.scatterplot(x='x_scaled', y='y_scaled', 
43 |                 hue='cluster_labels', data = comic_con)
44 | plt.show()
45 | ____________________________________________________________________________
46 | 4.Visualize clusters with matplotlib
47 | # Import the pyplot class
48 | from matplotlib import pyplot as plt
49 | 
50 | # Define a colors dictionary for clusters
51 | colors = {1:'red', 2:'blue'}
52 | 
53 | # Plot a scatter plot
54 | comic_con.plot.scatter(x='x_scaled', 
55 |                 	   y='y_scaled',
56 |                        c=comic_con['cluster_labels'].apply(lambda x: colors[x]))
57 | plt.show()
58 | ____________________________________________________________________________
59 | 5.Visualize clusters with seaborn
60 | # Import the seaborn module
61 | import seaborn as sns
62 | 
63 | # Plot a scatter plot using seaborn
64 | sns.scatterplot(x='x_scaled', 
65 |                 y='y_scaled', 
66 |                 hue='cluster_labels', 
67 |                 data=comic_con)
68 | plt.show()
69 | ____________________________________________________________________________
70 | 6.Create a dendrogram
71 | # Import the dendrogram function
72 | from scipy.cluster.hierarchy import dendrogram
73 | 
74 | # Create a dendrogram
75 | dn = dendrogram(distance_matrix)
76 | 
77 | # Display the dendogram
78 | plt.show()
79 | ____________________________________________________________________________
80 | 7.FIFA 18: exploring defenders
81 | # Fit the data into a hierarchical clustering algorithm
82 | distance_matrix = linkage(fifa[['scaled_sliding_tackle', 'scaled_aggression']], 'ward')
83 | 
84 | # Assign cluster labels to each row of data
85 | fifa['cluster_labels'] = fcluster(distance_matrix, 3, criterion='maxclust')
86 | 
87 | # Display cluster centers of each cluster
88 | print(fifa[['scaled_sliding_tackle', 'scaled_aggression', 'cluster_labels']].groupby('cluster_labels').mean())
89 | 
90 | # Create a scatter plot through seaborn
91 | sns.scatterplot(x='scaled_sliding_tackle', y='scaled_aggression', hue='cluster_labels', data=fifa)
92 | plt.show()
93 | ____________________________________________________________________________


--------------------------------------------------------------------------------
/20 - Cluster Analysis in Python/Chapter 3 - K-means Clustering.txt:
--------------------------------------------------------------------------------
 1 | 1.K-means clustering: first exercise
 2 | # Import the kmeans and vq functions
 3 | from scipy.cluster.vq import kmeans, vq
 4 | 
 5 | # Generate cluster centers
 6 | cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], 2)
 7 | 
 8 | # Assign cluster labels
 9 | comic_con['cluster_labels'], distortion_list = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers)
10 | 
11 | # Plot clusters
12 | sns.scatterplot(x='x_scaled', y='y_scaled', 
13 |                 hue='cluster_labels', data = comic_con)
14 | plt.show()
15 | _________________________________________________________________
16 | 2.Elbow method on distinct clusters
17 | distortions = []
18 | num_clusters = range(1, 7)
19 | 
20 | # Create a list of distortions from the kmeans function
21 | for i in num_clusters:
22 |     cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], i)
23 |     distortions.append(distortion)
24 | 
25 | # Create a data frame with two lists - num_clusters, distortions
26 | elbow_plot = pd.DataFrame({'num_clusters': num_clusters, 'distortions': distortions})
27 | 
28 | # Creat a line plot of num_clusters and distortions
29 | sns.lineplot(x='num_clusters', y='distortions', data = elbow_plot)
30 | plt.xticks(num_clusters)
31 | plt.show()
32 | _________________________________________________________________
33 | 3.Impact of seeds on distinct clusters
34 | # Import random class
35 | from numpy import random
36 | 
37 | # Initialize seed
38 | random.seed([1, 2, 1000])
39 | 
40 | # Run kmeans clustering
41 | cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], 2)
42 | comic_con['cluster_labels'], distortion_list = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers)
43 | 
44 | # Plot the scatterplot
45 | sns.scatterplot(x='x_scaled', y='y_scaled', 
46 |                 hue='cluster_labels', data = comic_con)
47 | plt.show()
48 | _________________________________________________________________
49 | 4.Uniform clustering patterns
50 | # Import the kmeans and vq functions
51 | from scipy.cluster.vq import kmeans, vq
52 | 
53 | # Generate cluster centers
54 | cluster_centers, distortion = kmeans(mouse[['x_scaled', 'y_scaled']], 3)
55 | 
56 | # Assign cluster labels
57 | mouse['cluster_labels'], distortion_list = vq(mouse[['x_scaled', 'y_scaled']], cluster_centers)
58 | 
59 | # Plot clusters
60 | sns.scatterplot(x='x_scaled', y='y_scaled', 
61 |                 hue='cluster_labels', data = mouse)
62 | plt.show()
63 | _________________________________________________________________
64 | 5.FIFA 18: defenders revisited
65 | # Set up a random seed in numpy
66 | random.seed([1000,2000])
67 | 
68 | # Fit the data into a k-means algorithm
69 | cluster_centers,_ = kmeans(fifa[['scaled_def', 'scaled_phy']], 3)
70 | 
71 | # Assign cluster labels
72 | fifa['cluster_labels'], _ = vq(fifa[['scaled_def', 'scaled_phy']], cluster_centers)
73 | 
74 | # Display cluster centers 
75 | print(fifa[['scaled_def', 'scaled_phy', 'cluster_labels']].groupby('cluster_labels').mean())
76 | 
77 | # Create a scatter plot through seaborn
78 | sns.scatterplot(x='scaled_def', y='scaled_phy', hue='cluster_labels', data=fifa)
79 | plt.show()
80 | _________________________________________________________________


--------------------------------------------------------------------------------
/20 - Cluster Analysis in Python/Chapter 4 - Clustering in Real World.txt:
--------------------------------------------------------------------------------
 1 | 1.Extract RGB values from image
 2 | # Import image class of matplotlib
 3 | import matplotlib.image as img
 4 | 
 5 | # Read batman image and print dimensions
 6 | batman_image = img.imread('batman.jpg')
 7 | print(batman_image.shape)
 8 | 
 9 | # Store RGB values of all pixels in lists r, g and b
10 | for row in batman_image:
11 |     for temp_r, temp_g, temp_b in row:
12 |         r.append(temp_r)
13 |         g.append(temp_g)
14 |         b.append(temp_b)
15 | _________________________________________________________________________
16 | 2.How many dominant colors?
17 | distortions = []
18 | num_clusters = range(1, 7)
19 | 
20 | # Create a list of distortions from the kmeans function
21 | for i in num_clusters:
22 |     cluster_centers, distortion = kmeans(batman_df[['scaled_red', 'scaled_blue', 'scaled_green']], i)
23 |     distortions.append(distortion)
24 | 
25 | # Create a data frame with two lists, num_clusters and distortions
26 | elbow_plot = pd.DataFrame({'num_clusters': num_clusters, 'distortions': distortions})
27 | 
28 | # Create a line plot of num_clusters and distortions
29 | sns.lineplot(x='num_clusters', y='distortions', data = elbow_plot)
30 | plt.xticks(num_clusters)
31 | plt.show()
32 | _________________________________________________________________________
33 | 3.Display dominant colors
34 | # Get standard deviations of each color
35 | r_std, g_std, b_std = batman_df[['red', 'green', 'blue']].std()
36 | 
37 | for cluster_center in cluster_centers:
38 |     scaled_r, scaled_g, scaled_b = cluster_center
39 |     # Convert each standardized value to scaled value
40 |     colors.append((
41 |         scaled_r * r_std / 255,
42 |         scaled_g * g_std / 255,
43 |         scaled_b * b_std / 255
44 |     ))
45 | 
46 | # Display colors of cluster centers
47 | plt.imshow([colors])
48 | plt.show()
49 | _________________________________________________________________________
50 | 4.TF-IDF of movie plots
51 | # Import TfidfVectorizer class from sklearn
52 | from sklearn.feature_extraction.text import TfidfVectorizer
53 | 
54 | # Initialize TfidfVectorizer
55 | tfidf_vectorizer = TfidfVectorizer(max_df=0.75, max_features=50,
56 |                                    min_df=0.1, tokenizer=remove_noise)
57 | 
58 | # Use the .fit_transform() method on the list plots
59 | tfidf_matrix = tfidf_vectorizer.fit_transform(plots)
60 | _________________________________________________________________________
61 | 5.Top terms in movie clusters
62 | num_clusters = 2
63 | 
64 | # Generate cluster centers through the kmeans function
65 | cluster_centers, distortion = kmeans(tfidf_matrix.todense(), num_clusters)
66 | 
67 | # Generate terms from the tfidf_vectorizer object
68 | terms = tfidf_vectorizer.get_feature_names()
69 | 
70 | for i in range(num_clusters):
71 |     # Sort the terms and print top 3 terms
72 |     center_terms = dict(zip(terms, list(cluster_centers[i])))
73 |     sorted_terms = sorted(center_terms, key=center_terms.get, reverse=True)
74 |     print(sorted_terms[:3])
75 | _________________________________________________________________________
76 | 6.Basic checks on clusters
77 | # Print the size of the clusters
78 | print(fifa.groupby('cluster_labels')['ID'].count())
79 | 
80 | # Print the mean value of wages in each cluster
81 | print(fifa.groupby('cluster_labels')['eur_wage'].mean())
82 | _________________________________________________________________________
83 | 7.FIFA 18: what makes a complete player?
84 | # Create centroids with kmeans for 2 clusters
85 | cluster_centers,_ = kmeans(fifa[scaled_features], 2)
86 | 
87 | # Assign cluster labels and print cluster centers
88 | fifa['cluster_labels'], _ = vq(fifa[scaled_features], cluster_centers)
89 | print(fifa.groupby('cluster_labels')[scaled_features].mean())
90 | 
91 | # Plot cluster centers to visualize clusters
92 | fifa.groupby('cluster_labels')[scaled_features].mean().plot(legend=True, kind='bar')
93 | plt.show()
94 | 
95 | # Get the name column of first 5 players in each cluster
96 | for cluster in fifa['cluster_labels'].unique():
97 |     print(cluster, fifa[fifa['cluster_labels'] == cluster]['name'].values[:5])
98 | _________________________________________________________________________


--------------------------------------------------------------------------------
/20 - Cluster Analysis in Python/key points:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------