├── P0-Explore-Weather-Trends
├── Explore Weather Trends.ipynb
├── README.md
├── city_data.csv
├── city_list.csv
└── global_data.csv
├── P1-Investigate-A-Dataset
├── Investigate Gapminder Data.html
├── Investigate Gapminder Data.ipynb
└── README.md
├── P2-Analyze-A-B-Test-Results
├── Analyze_ab_test_results_notebook.html
├── Analyze_ab_test_results_notebook.ipynb
├── README.md
├── ab_data.csv
└── countries.csv
├── P3-Analyze-Twitter-Data
├── README.md
├── image_predictions.tsv
├── tweet_json.txt
├── twitter-archive-enhanced.csv
├── twitter_archive_master.csv
└── wrangle_act.ipynb
├── P4-Communicate-Data-Findings
├── Communicate Data Findings.html
├── Communicate Data Findings.ipynb
├── Data
│ ├── README.md
│ └── gather_goBike_data.py
├── Explanatory Visualization.html
├── Explanatory Visualization.ipynb
├── Explanatory Visualization.slides.html
├── Images
│ ├── README.md
│ ├── east_bay_500.PNG
│ ├── san_francisco_1000.PNG
│ ├── san_jose_200.PNG
│ ├── stations_1.PNG
│ ├── stations_2.png
│ └── stations_kepler.png
└── README.md
├── README.md
├── global_weather_trend.png
├── life_expectancy_to_income_2018.png
├── mean_of_retweets_per_month-year_combination.png
├── rel_userfreq_by_gender_and_area.png
└── sampling_dist.png
/P0-Explore-Weather-Trends/README.md:
--------------------------------------------------------------------------------
1 |
2 | ## P0: Explore Weather Trends
3 |
4 | ### Prerequisites
5 |
6 | Additional installations:
7 |
8 | * [Missingno](https://github.com/ResidentMario/missingno)
9 | * [sklearn](https://scikit-learn.org/)
10 |
11 | ## Project Overview
12 |
13 | This first project required the following steps:
14 | * Extract data from a database using a SQL query
15 | * Calculate a moving average
16 | * Create a line chart
17 |
18 | I analyzed local and global temperature data and compared the temperature trends in three german cities to overall global temperature trends. After some data cleaning I created a function to assist the data processing and visualization with some options to play around with. This included also the calculation of a simple linear regression to visualize trends.
19 |
20 | 
21 |
22 | ### Data Sources
23 |
24 | **Name:** city_data.csv
25 | * Definition: Overall city temperature data
26 | * Source: Udacity
27 | * Version: 1
28 | * Method of gathering: SQL
29 |
30 | **Name:** global_data.csv
31 | * Definition: Global temperature data
32 | * Source: Udacity
33 | * Version: 1
34 | * Method of gathering: SQL
35 |
36 | **Name:** city_list.csv
37 | * Definition: List of cities in this dataset
38 | * Source: Udacity
39 | * Version: 1
40 | * Method of gathering: SQL
41 |
42 | ### Wrangling
43 | * Cut data to years 1750 - 2013
44 |
45 | ### Summary
46 |
47 | > To conclude, there is a clear overall uptrend visible, what means, that the average global temperature is increasing, with an also increasing tempo.
48 |
49 | The german cities Hamburg, Berlin and Munich got compared to the global data (1750 - 2013):
50 |
51 | - The slope of the global trend is higher than compared to the german cities, so the global average temperature is increasing faster (looking at this long time period)
52 | - Berlin has the highest average temperature among the german cities, making Berlin the only city that has a higher average temperature than the global
53 | - Hamburg is the closest to the global average temperature, while Munich has the lowest average temperature, but also the highest correlation to the global data compared to the other two german cities
54 |
55 | ### Authors
56 |
57 | * Christoph Lindstädt
58 | * Udacity
59 |
60 | ## License
61 |
62 | * Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
63 |
64 |
65 |
66 |
67 |
--------------------------------------------------------------------------------
/P0-Explore-Weather-Trends/city_list.csv:
--------------------------------------------------------------------------------
1 | city,country
2 | Abidjan,Côte D'Ivoire
3 | Abu Dhabi,United Arab Emirates
4 | Abuja,Nigeria
5 | Accra,Ghana
6 | Adana,Turkey
7 | Adelaide,Australia
8 | Agra,India
9 | Ahmadabad,India
10 | Albuquerque,United States
11 | Alexandria,Egypt
12 | Alexandria,United States
13 | Algiers,Algeria
14 | Allahabad,India
15 | Almaty,Kazakhstan
16 | Amritsar,India
17 | Amsterdam,Netherlands
18 | Ankara,Turkey
19 | Anshan,China
20 | Antananarivo,Madagascar
21 | Arlington,United States
22 | Asmara,Eritrea
23 | Astana,Kazakhstan
24 | Athens,Greece
25 | Atlanta,United States
26 | Austin,United States
27 | Baghdad,Iraq
28 | Baku,Azerbaijan
29 | Baltimore,United States
30 | Bamako,Mali
31 | Bandung,Indonesia
32 | Bangalore,India
33 | Bangkok,Thailand
34 | Bangui,Central African Republic
35 | Barcelona,Spain
36 | Barcelona,Venezuela
37 | Barquisimeto,Venezuela
38 | Barranquilla,Colombia
39 | Beirut,Lebanon
40 | Belfast,United Kingdom
41 | Belgrade,Serbia
42 | Belo Horizonte,Brazil
43 | Benghazi,Libya
44 | Berlin,Germany
45 | Bern,Switzerland
46 | Bhopal,India
47 | Birmingham,United Kingdom
48 | Birmingham,United States
49 | Bissau,Guinea Bissau
50 | Boston,United States
51 | Bratislava,Slovakia
52 | Brazzaville,Congo
53 | Brisbane,Australia
54 | Brussels,Belgium
55 | Bucharest,Romania
56 | Budapest,Hungary
57 | Bujumbura,Burundi
58 | Bursa,Turkey
59 | Cairo,Egypt
60 | Cali,Colombia
61 | Campinas,Brazil
62 | Canberra,Australia
63 | Caracas,Venezuela
64 | Cardiff,United Kingdom
65 | Casablanca,Morocco
66 | Changchun,China
67 | Changzhou,China
68 | Charlotte,United States
69 | Chelyabinsk,Russia
70 | Chengdu,China
71 | Chicago,United States
72 | Chisinau,Moldova
73 | Colombo,Brazil
74 | Colombo,Sri Lanka
75 | Colorado Springs,United States
76 | Columbus,United States
77 | Conakry,Guinea
78 | Copenhagen,Denmark
79 | Cordoba,Argentina
80 | Curitiba,Brazil
81 | Dakar,Senegal
82 | Dalian,China
83 | Dallas,United States
84 | Damascus,Syria
85 | Dar Es Salaam,Tanzania
86 | Datong,China
87 | Delhi,India
88 | Denver,United States
89 | Detroit,United States
90 | Dhaka,Bangladesh
91 | Doha,Qatar
92 | Douala,Cameroon
93 | Dublin,Ireland
94 | Durban,South Africa
95 | Dushanbe,Tajikistan
96 | Ecatepec,Mexico
97 | Edinburgh,United Kingdom
98 | El Paso,United States
99 | Faisalabad,Pakistan
100 | Fort Worth,United States
101 | Fortaleza,Brazil
102 | Foshan,China
103 | Freetown,Sierra Leone
104 | Fresno,United States
105 | Fuzhou,China
106 | Gaborone,Botswana
107 | Georgetown,Guyana
108 | Guadalajara,Mexico
109 | Guangzhou,China
110 | Guarulhos,Brazil
111 | Guatemala City,Guatemala
112 | Guayaquil,Ecuador
113 | Guiyang,China
114 | Gujranwala,Pakistan
115 | Hamburg,Germany
116 | Handan,China
117 | Hangzhou,China
118 | Hanoi,Vietnam
119 | Haora,India
120 | Harare,Zimbabwe
121 | Harbin,China
122 | Hefei,China
123 | Helsinki,Finland
124 | Hiroshima,Japan
125 | Ho Chi Minh City,Vietnam
126 | Houston,United States
127 | Hyderabad,India
128 | Hyderabad,Pakistan
129 | Ibadan,Nigeria
130 | Indianapolis,United States
131 | Indore,India
132 | Irbil,Iraq
133 | Islamabad,Pakistan
134 | Istanbul,Turkey
135 | Izmir,Turkey
136 | Jacksonville,United States
137 | Jaipur,India
138 | Jakarta,Indonesia
139 | Jilin,China
140 | Jinan,China
141 | Johannesburg,South Africa
142 | Juba,Sudan
143 | Kabul,Afghanistan
144 | Kaduna,Nigeria
145 | Kampala,Uganda
146 | Kano,Nigeria
147 | Kanpur,India
148 | Kansas City,United States
149 | Kaohsiung,Taiwan
150 | Karachi,Pakistan
151 | Kathmandu,Nepal
152 | Kawasaki,Japan
153 | Kazan,Russia
154 | Khartoum,Sudan
155 | Khulna,Bangladesh
156 | Kiev,Ukraine
157 | Kigali,Rwanda
158 | Kingston,Canada
159 | Kingston,Jamaica
160 | Kinshasa,Congo (Democratic Republic Of The)
161 | Kitakyushu,Japan
162 | Kobe,Japan
163 | Kuala Lumpur,Malaysia
164 | Kunming,China
165 | La Paz,Bolivia
166 | La Paz,Mexico
167 | Lagos,Nigeria
168 | Lahore,Pakistan
169 | Lanzhou,China
170 | Las Vegas,United States
171 | Libreville,Gabon
172 | Lilongwe,Malawi
173 | Lima,Peru
174 | Lisbon,Portugal
175 | Ljubljana,Slovenia
176 | London,Canada
177 | London,United Kingdom
178 | Long Beach,United States
179 | Los Angeles,Chile
180 | Los Angeles,United States
181 | Louisville,United States
182 | Luanda,Angola
183 | Lubumbashi,Congo (Democratic Republic Of The)
184 | Ludhiana,India
185 | Luoyang,China
186 | Lusaka,Zambia
187 | Madrid,Spain
188 | Maiduguri,Nigeria
189 | Malabo,Equatorial Guinea
190 | Managua,Nicaragua
191 | Manama,Bahrain
192 | Manaus,Brazil
193 | Manila,Philippines
194 | Maputo,Mozambique
195 | Maracaibo,Venezuela
196 | Maseru,Lesotho
197 | Mashhad,Iran
198 | Mecca,Saudi Arabia
199 | Medan,Indonesia
200 | Melbourne,Australia
201 | Memphis,United States
202 | Mesa,United States
203 | Mexicali,Mexico
204 | Miami,United States
205 | Milan,Italy
206 | Milwaukee,United States
207 | Minneapolis,United States
208 | Minsk,Belarus
209 | Mogadishu,Somalia
210 | Monrovia,Liberia
211 | Monterrey,Mexico
212 | Montevideo,Uruguay
213 | Montreal,Canada
214 | Moscow,Russia
215 | Multan,Pakistan
216 | Munich,Germany
217 | Nagoya,Japan
218 | Nagpur,India
219 | Nairobi,Kenya
220 | Nanchang,China
221 | Nanjing,China
222 | Nanning,China
223 | Nashville,United States
224 | Nassau,Bahamas
225 | New Delhi,India
226 | New Orleans,United States
227 | New York,United States
228 | Niamey,Niger
229 | Nouakchott,Mauritania
230 | Novosibirsk,Russia
231 | Oakland,United States
232 | Oklahoma City,United States
233 | Omaha,United States
234 | Omsk,Russia
235 | Oslo,Norway
236 | Ottawa,Canada
237 | Ouagadougou,Burkina Faso
238 | Palembang,Indonesia
239 | Paramaribo,Suriname
240 | Paris,France
241 | Patna,India
242 | Perm,Russia
243 | Perth,Australia
244 | Peshawar,Pakistan
245 | Philadelphia,United States
246 | Phoenix,United States
247 | Podgorica,Montenegro
248 | Port Au Prince,Haiti
249 | Port Harcourt,Nigeria
250 | Port Louis,Mauritius
251 | Port Moresby,Papua New Guinea
252 | Portland,United States
253 | Porto Alegre,Brazil
254 | Prague,Czech Republic
255 | Pretoria,South Africa
256 | Pristina,Serbia
257 | Puebla,Mexico
258 | Pune,India
259 | Qingdao,China
260 | Qiqihar,China
261 | Quito,Ecuador
262 | Rabat,Morocco
263 | Rajkot,India
264 | Raleigh,United States
265 | Ranchi,India
266 | Rawalpindi,Pakistan
267 | Recife,Brazil
268 | Riga,Latvia
269 | Rio De Janeiro,Brazil
270 | Riyadh,Saudi Arabia
271 | Rome,Italy
272 | Rosario,Argentina
273 | Sacramento,United States
274 | Salvador,Brazil
275 | Samara,Russia
276 | San Antonio,United States
277 | San Diego,United States
278 | San Francisco,United States
279 | San Jose,United States
280 | San Salvador,El Salvador
281 | Santa Cruz,Philippines
282 | Santiago,Chile
283 | Santiago,Dominican Republic
284 | Santiago,Philippines
285 | Santo Domingo,Dominican Republic
286 | Santo Domingo,Ecuador
287 | Sarajevo,Bosnia And Herzegovina
288 | Seattle,United States
289 | Semarang,Indonesia
290 | Seoul,South Korea
291 | Shanghai,China
292 | Shenyang,China
293 | Shenzhen,Hong Kong
294 | Shiraz,Iran
295 | Singapore,Singapore
296 | Skopje,Macedonia
297 | Sofia,Bulgaria
298 | Soweto,South Africa
299 | Stockholm,Sweden
300 | Surabaya,Indonesia
301 | Surat,India
302 | Suzhou,China
303 | Sydney,Australia
304 | Tabriz,Iran
305 | Taichung,Taiwan
306 | Taipei,Taiwan
307 | Taiyuan,China
308 | Tallinn,Estonia
309 | Tangshan,China
310 | Tashkent,Uzbekistan
311 | Tbilisi,Georgia
312 | Tegucigalpa,Honduras
313 | Tianjin,China
314 | Tijuana,Mexico
315 | Tirana,Albania
316 | Tokyo,Japan
317 | Toronto,Canada
318 | Tripoli,Libya
319 | Tucson,United States
320 | Tulsa,United States
321 | Tunis,Tunisia
322 | Ufa,Russia
323 | Ulaanbaatar,Mongolia
324 | Vadodara,India
325 | Valencia,Spain
326 | Valencia,Venezuela
327 | Varanasi,India
328 | Victoria,Canada
329 | Vienna,Austria
330 | Vientiane,Laos
331 | Vilnius,Lithuania
332 | Virginia Beach,United States
333 | Volgograd,Russia
334 | Warsaw,Poland
335 | Washington,United States
336 | Wellington,New Zealand
337 | Wichita,United States
338 | Windhoek,Namibia
339 | Wuhan,China
340 | Wuxi,China
341 | Xian,China
342 | Xuzhou,China
343 | Yamoussoukro,Côte D'Ivoire
344 | Yerevan,Armenia
345 | Zagreb,Croatia
346 | Zapopan,Mexico
347 |
--------------------------------------------------------------------------------
/P0-Explore-Weather-Trends/global_data.csv:
--------------------------------------------------------------------------------
1 | year,avg_temp
2 | 1750,8.72
3 | 1751,7.98
4 | 1752,5.78
5 | 1753,8.39
6 | 1754,8.47
7 | 1755,8.36
8 | 1756,8.85
9 | 1757,9.02
10 | 1758,6.74
11 | 1759,7.99
12 | 1760,7.19
13 | 1761,8.77
14 | 1762,8.61
15 | 1763,7.50
16 | 1764,8.40
17 | 1765,8.25
18 | 1766,8.41
19 | 1767,8.22
20 | 1768,6.78
21 | 1769,7.69
22 | 1770,7.69
23 | 1771,7.85
24 | 1772,8.19
25 | 1773,8.22
26 | 1774,8.77
27 | 1775,9.18
28 | 1776,8.30
29 | 1777,8.26
30 | 1778,8.54
31 | 1779,8.98
32 | 1780,9.43
33 | 1781,8.10
34 | 1782,7.90
35 | 1783,7.68
36 | 1784,7.86
37 | 1785,7.36
38 | 1786,8.26
39 | 1787,8.03
40 | 1788,8.45
41 | 1789,8.33
42 | 1790,7.98
43 | 1791,8.23
44 | 1792,8.09
45 | 1793,8.23
46 | 1794,8.53
47 | 1795,8.35
48 | 1796,8.27
49 | 1797,8.51
50 | 1798,8.67
51 | 1799,8.51
52 | 1800,8.48
53 | 1801,8.59
54 | 1802,8.58
55 | 1803,8.50
56 | 1804,8.84
57 | 1805,8.56
58 | 1806,8.43
59 | 1807,8.28
60 | 1808,7.63
61 | 1809,7.08
62 | 1810,6.92
63 | 1811,6.86
64 | 1812,7.05
65 | 1813,7.74
66 | 1814,7.59
67 | 1815,7.24
68 | 1816,6.94
69 | 1817,6.98
70 | 1818,7.83
71 | 1819,7.37
72 | 1820,7.62
73 | 1821,8.09
74 | 1822,8.19
75 | 1823,7.72
76 | 1824,8.55
77 | 1825,8.39
78 | 1826,8.36
79 | 1827,8.81
80 | 1828,8.17
81 | 1829,7.94
82 | 1830,8.52
83 | 1831,7.64
84 | 1832,7.45
85 | 1833,8.01
86 | 1834,8.15
87 | 1835,7.39
88 | 1836,7.70
89 | 1837,7.38
90 | 1838,7.51
91 | 1839,7.63
92 | 1840,7.80
93 | 1841,7.69
94 | 1842,8.02
95 | 1843,8.17
96 | 1844,7.65
97 | 1845,7.85
98 | 1846,8.55
99 | 1847,8.09
100 | 1848,7.98
101 | 1849,7.98
102 | 1850,7.90
103 | 1851,8.18
104 | 1852,8.10
105 | 1853,8.04
106 | 1854,8.21
107 | 1855,8.11
108 | 1856,8.00
109 | 1857,7.76
110 | 1858,8.10
111 | 1859,8.25
112 | 1860,7.96
113 | 1861,7.85
114 | 1862,7.56
115 | 1863,8.11
116 | 1864,7.98
117 | 1865,8.18
118 | 1866,8.29
119 | 1867,8.44
120 | 1868,8.25
121 | 1869,8.43
122 | 1870,8.20
123 | 1871,8.12
124 | 1872,8.19
125 | 1873,8.35
126 | 1874,8.43
127 | 1875,7.86
128 | 1876,8.08
129 | 1877,8.54
130 | 1878,8.83
131 | 1879,8.17
132 | 1880,8.12
133 | 1881,8.27
134 | 1882,8.13
135 | 1883,7.98
136 | 1884,7.77
137 | 1885,7.92
138 | 1886,7.95
139 | 1887,7.91
140 | 1888,8.09
141 | 1889,8.32
142 | 1890,7.97
143 | 1891,8.02
144 | 1892,8.07
145 | 1893,8.06
146 | 1894,8.16
147 | 1895,8.15
148 | 1896,8.21
149 | 1897,8.29
150 | 1898,8.18
151 | 1899,8.40
152 | 1900,8.50
153 | 1901,8.54
154 | 1902,8.30
155 | 1903,8.22
156 | 1904,8.09
157 | 1905,8.23
158 | 1906,8.38
159 | 1907,7.95
160 | 1908,8.19
161 | 1909,8.18
162 | 1910,8.22
163 | 1911,8.18
164 | 1912,8.17
165 | 1913,8.30
166 | 1914,8.59
167 | 1915,8.59
168 | 1916,8.23
169 | 1917,8.02
170 | 1918,8.13
171 | 1919,8.38
172 | 1920,8.36
173 | 1921,8.57
174 | 1922,8.41
175 | 1923,8.42
176 | 1924,8.51
177 | 1925,8.53
178 | 1926,8.73
179 | 1927,8.52
180 | 1928,8.63
181 | 1929,8.24
182 | 1930,8.63
183 | 1931,8.72
184 | 1932,8.71
185 | 1933,8.34
186 | 1934,8.63
187 | 1935,8.52
188 | 1936,8.55
189 | 1937,8.70
190 | 1938,8.86
191 | 1939,8.76
192 | 1940,8.76
193 | 1941,8.77
194 | 1942,8.73
195 | 1943,8.76
196 | 1944,8.85
197 | 1945,8.58
198 | 1946,8.68
199 | 1947,8.80
200 | 1948,8.75
201 | 1949,8.59
202 | 1950,8.37
203 | 1951,8.63
204 | 1952,8.64
205 | 1953,8.87
206 | 1954,8.56
207 | 1955,8.63
208 | 1956,8.28
209 | 1957,8.73
210 | 1958,8.77
211 | 1959,8.73
212 | 1960,8.58
213 | 1961,8.80
214 | 1962,8.75
215 | 1963,8.86
216 | 1964,8.41
217 | 1965,8.53
218 | 1966,8.60
219 | 1967,8.70
220 | 1968,8.52
221 | 1969,8.60
222 | 1970,8.70
223 | 1971,8.60
224 | 1972,8.50
225 | 1973,8.95
226 | 1974,8.47
227 | 1975,8.74
228 | 1976,8.35
229 | 1977,8.85
230 | 1978,8.69
231 | 1979,8.73
232 | 1980,8.98
233 | 1981,9.17
234 | 1982,8.64
235 | 1983,9.03
236 | 1984,8.69
237 | 1985,8.66
238 | 1986,8.83
239 | 1987,8.99
240 | 1988,9.20
241 | 1989,8.92
242 | 1990,9.23
243 | 1991,9.18
244 | 1992,8.84
245 | 1993,8.87
246 | 1994,9.04
247 | 1995,9.35
248 | 1996,9.04
249 | 1997,9.20
250 | 1998,9.52
251 | 1999,9.29
252 | 2000,9.20
253 | 2001,9.41
254 | 2002,9.57
255 | 2003,9.53
256 | 2004,9.32
257 | 2005,9.70
258 | 2006,9.53
259 | 2007,9.73
260 | 2008,9.43
261 | 2009,9.51
262 | 2010,9.70
263 | 2011,9.52
264 | 2012,9.51
265 | 2013,9.61
266 | 2014,9.57
267 | 2015,9.83
268 |
--------------------------------------------------------------------------------
/P1-Investigate-A-Dataset/README.md:
--------------------------------------------------------------------------------
1 |
2 | ## P1: Investigate A Dataset (Gapminder World Dataset)
3 |
4 | ### Prerequisites
5 |
6 | Additional installations:
7 |
8 | * [Missingno](https://github.com/ResidentMario/missingno)
9 | * [pycountry](https://bitbucket.org/flyingcircus/pycountry)
10 | * [pycountry-convert](https://pycountry-convert.readthedocs.io/en/latest/)
11 |
12 | ## Project Overview
13 |
14 | ### Data Sources
15 |
16 | **Name: Population total**
17 |
18 | - Definition: Total population
19 | - Source: http://gapm.io/dpop
20 | - Version: 5
21 |
22 | **Name: Life expectancy (years)**
23 |
24 | - Definition: The average number of years a newborn child would live if current mortality patterns were to stay the same.
25 | - Source: http://gapm.io/ilex
26 | - Version: 9
27 |
28 | **Name: Income per person (GPD/Capita, PPP$ inflation-adjusted)**
29 |
30 | - Definition: Gross domestic product per person adjusted for differences in purchasing power (in international dollars, fixed 2011 prices, PPP based on 2011 ICP).
31 | - Source: http://gapm.io/dgdppc
32 | - Version: 25
33 |
34 | ### Wrangling
35 |
36 | For the analysis, following countries were dropped out of the dataframe due to too much missing data:
37 |
38 | - Andorra, Dominica, Holy See, Liechtenstein, Marshall Islands, Monaco, Nauru, Palau, San Marino, St. Kitts and Nevis, Tuvalu
39 |
40 | ### Summary
41 |
42 | - We can observe an overall and ongoing uptrend for the world population, the income per person and the life expectation
43 |
44 | - especially between 1950 and 1975 was a starting point for a strong increase in all three metrics
45 |
46 | - the world population is increasing strongly and one observable reason for that is the increase in the overall life expectancy
47 |
48 | - we also found out that there is a relationship between the income per person and the life expectancy. An increasing income is no guarantee for an also increasing life expectancy, but it is correlated
49 |
50 | ### Authors
51 |
52 | * Christoph Lindstädt
53 | * Udacity
54 |
55 | ## License
56 |
57 | * Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
58 |
59 |
60 |
61 |
62 |
--------------------------------------------------------------------------------
/P2-Analyze-A-B-Test-Results/Analyze_ab_test_results_notebook.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Analyze A/B Test Results\n",
8 | "\n",
9 | "You may either submit your notebook through the workspace here, or you may work from your local machine and submit through the next page. Either way assure that your code passes the project [RUBRIC](https://review.udacity.com/#!/projects/37e27304-ad47-4eb0-a1ab-8c12f60e43d0/rubric). **Please save regularly.**\n",
10 | "\n",
11 | "This project will assure you have mastered the subjects covered in the statistics lessons. The hope is to have this project be as comprehensive of these topics as possible. Good luck!\n",
12 | "\n",
13 | "## Table of Contents\n",
14 | "- [Introduction](#intro)\n",
15 | "- [Part I - Probability](#probability)\n",
16 | "- [Part II - A/B Test](#ab_test)\n",
17 | "- [Part III - Regression](#regression)\n",
18 | "\n",
19 | "\n",
20 | "\n",
21 | "### Introduction\n",
22 | "\n",
23 | "A/B tests are very commonly performed by data analysts and data scientists. It is important that you get some practice working with the difficulties of these \n",
24 | "\n",
25 | "For this project, you will be working to understand the results of an A/B test run by an e-commerce website. Your goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.\n",
26 | "\n",
27 | "**As you work through this notebook, follow along in the classroom and answer the corresponding quiz questions associated with each question.** The labels for each classroom concept are provided for each question. This will assure you are on the right track as you work through the project, and you can feel more confident in your final submission meeting the criteria. As a final check, assure you meet all the criteria on the [RUBRIC](https://review.udacity.com/#!/projects/37e27304-ad47-4eb0-a1ab-8c12f60e43d0/rubric).\n",
28 | "\n",
29 | "\n",
30 | "#### Part I - Probability\n",
31 | "\n",
32 | "To get started, let's import our libraries."
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 1,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "import pandas as pd\n",
42 | "import numpy as np\n",
43 | "import random\n",
44 | "import matplotlib.pyplot as plt\n",
45 | "%matplotlib inline\n",
46 | "#We are setting the seed to assure you get the same answers on quizzes as we set up\n",
47 | "random.seed(42)"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "`1.` Now, read in the `ab_data.csv` data. Store it in `df`. **Use your dataframe to answer the questions in Quiz 1 of the classroom.**\n",
55 | "\n",
56 | "a. Read in the dataset and take a look at the top few rows here:"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 2,
62 | "metadata": {},
63 | "outputs": [
64 | {
65 | "data": {
66 | "text/html": [
67 | "
\n",
68 | "\n",
81 | "
\n",
82 | " \n",
83 | " \n",
84 | " | \n",
85 | " user_id | \n",
86 | " timestamp | \n",
87 | " group | \n",
88 | " landing_page | \n",
89 | " converted | \n",
90 | "
\n",
91 | " \n",
92 | " \n",
93 | " \n",
94 | " 0 | \n",
95 | " 851104 | \n",
96 | " 2017-01-21 22:11:48.556739 | \n",
97 | " control | \n",
98 | " old_page | \n",
99 | " 0 | \n",
100 | "
\n",
101 | " \n",
102 | " 1 | \n",
103 | " 804228 | \n",
104 | " 2017-01-12 08:01:45.159739 | \n",
105 | " control | \n",
106 | " old_page | \n",
107 | " 0 | \n",
108 | "
\n",
109 | " \n",
110 | " 2 | \n",
111 | " 661590 | \n",
112 | " 2017-01-11 16:55:06.154213 | \n",
113 | " treatment | \n",
114 | " new_page | \n",
115 | " 0 | \n",
116 | "
\n",
117 | " \n",
118 | " 3 | \n",
119 | " 853541 | \n",
120 | " 2017-01-08 18:28:03.143765 | \n",
121 | " treatment | \n",
122 | " new_page | \n",
123 | " 0 | \n",
124 | "
\n",
125 | " \n",
126 | " 4 | \n",
127 | " 864975 | \n",
128 | " 2017-01-21 01:52:26.210827 | \n",
129 | " control | \n",
130 | " old_page | \n",
131 | " 1 | \n",
132 | "
\n",
133 | " \n",
134 | "
\n",
135 | "
"
136 | ],
137 | "text/plain": [
138 | " user_id timestamp group landing_page converted\n",
139 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0\n",
140 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0\n",
141 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0\n",
142 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0\n",
143 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1"
144 | ]
145 | },
146 | "execution_count": 2,
147 | "metadata": {},
148 | "output_type": "execute_result"
149 | }
150 | ],
151 | "source": [
152 | "#read data\n",
153 | "df = pd.read_csv(\"ab_data.csv\")\n",
154 | "df.head()"
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "b. Use the cell below to find the number of rows in the dataset."
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 3,
167 | "metadata": {},
168 | "outputs": [
169 | {
170 | "name": "stdout",
171 | "output_type": "stream",
172 | "text": [
173 | "\n",
174 | "RangeIndex: 294478 entries, 0 to 294477\n",
175 | "Data columns (total 5 columns):\n",
176 | "user_id 294478 non-null int64\n",
177 | "timestamp 294478 non-null object\n",
178 | "group 294478 non-null object\n",
179 | "landing_page 294478 non-null object\n",
180 | "converted 294478 non-null int64\n",
181 | "dtypes: int64(2), object(3)\n",
182 | "memory usage: 11.2+ MB\n"
183 | ]
184 | }
185 | ],
186 | "source": [
187 | "#show the df info\n",
188 | "df.info()"
189 | ]
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "metadata": {},
194 | "source": [
195 | "c. The number of unique users in the dataset."
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 4,
201 | "metadata": {},
202 | "outputs": [
203 | {
204 | "data": {
205 | "text/plain": [
206 | "290584"
207 | ]
208 | },
209 | "execution_count": 4,
210 | "metadata": {},
211 | "output_type": "execute_result"
212 | }
213 | ],
214 | "source": [
215 | "#choose user_id column and show the number of unique entries\n",
216 | "df.user_id.nunique()"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {
222 | "collapsed": true
223 | },
224 | "source": [
225 | "d. The proportion of users converted."
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": 5,
231 | "metadata": {},
232 | "outputs": [
233 | {
234 | "data": {
235 | "text/plain": [
236 | "0.11965919355605512"
237 | ]
238 | },
239 | "execution_count": 5,
240 | "metadata": {},
241 | "output_type": "execute_result"
242 | }
243 | ],
244 | "source": [
245 | "#choose converted column and calculate the mean\n",
246 | "df.converted.mean()"
247 | ]
248 | },
249 | {
250 | "cell_type": "markdown",
251 | "metadata": {
252 | "collapsed": true
253 | },
254 | "source": [
255 | "e. The number of times the `new_page` and `treatment` don't match."
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": 6,
261 | "metadata": {},
262 | "outputs": [
263 | {
264 | "data": {
265 | "text/html": [
266 | "\n",
267 | "\n",
280 | "
\n",
281 | " \n",
282 | " \n",
283 | " | \n",
284 | " | \n",
285 | " user_id | \n",
286 | " timestamp | \n",
287 | " converted | \n",
288 | "
\n",
289 | " \n",
290 | " group | \n",
291 | " landing_page | \n",
292 | " | \n",
293 | " | \n",
294 | " | \n",
295 | "
\n",
296 | " \n",
297 | " \n",
298 | " \n",
299 | " control | \n",
300 | " new_page | \n",
301 | " 1928 | \n",
302 | " 1928 | \n",
303 | " 1928 | \n",
304 | "
\n",
305 | " \n",
306 | " old_page | \n",
307 | " 145274 | \n",
308 | " 145274 | \n",
309 | " 145274 | \n",
310 | "
\n",
311 | " \n",
312 | " treatment | \n",
313 | " new_page | \n",
314 | " 145311 | \n",
315 | " 145311 | \n",
316 | " 145311 | \n",
317 | "
\n",
318 | " \n",
319 | " old_page | \n",
320 | " 1965 | \n",
321 | " 1965 | \n",
322 | " 1965 | \n",
323 | "
\n",
324 | " \n",
325 | "
\n",
326 | "
"
327 | ],
328 | "text/plain": [
329 | " user_id timestamp converted\n",
330 | "group landing_page \n",
331 | "control new_page 1928 1928 1928\n",
332 | " old_page 145274 145274 145274\n",
333 | "treatment new_page 145311 145311 145311\n",
334 | " old_page 1965 1965 1965"
335 | ]
336 | },
337 | "execution_count": 6,
338 | "metadata": {},
339 | "output_type": "execute_result"
340 | }
341 | ],
342 | "source": [
343 | "#group the dataframe by the group and landing page and count the entries for the resulting combinations\n",
344 | "df.groupby([\"group\", \"landing_page\"]).count()"
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": 7,
350 | "metadata": {},
351 | "outputs": [
352 | {
353 | "data": {
354 | "text/plain": [
355 | "3893"
356 | ]
357 | },
358 | "execution_count": 7,
359 | "metadata": {},
360 | "output_type": "execute_result"
361 | }
362 | ],
363 | "source": [
364 | "1928+1965"
365 | ]
366 | },
367 | {
368 | "cell_type": "markdown",
369 | "metadata": {},
370 | "source": [
371 | "f. Do any of the rows have missing values?"
372 | ]
373 | },
374 | {
375 | "cell_type": "code",
376 | "execution_count": 8,
377 | "metadata": {},
378 | "outputs": [
379 | {
380 | "name": "stdout",
381 | "output_type": "stream",
382 | "text": [
383 | "\n",
384 | "RangeIndex: 294478 entries, 0 to 294477\n",
385 | "Data columns (total 5 columns):\n",
386 | "user_id 294478 non-null int64\n",
387 | "timestamp 294478 non-null object\n",
388 | "group 294478 non-null object\n",
389 | "landing_page 294478 non-null object\n",
390 | "converted 294478 non-null int64\n",
391 | "dtypes: int64(2), object(3)\n",
392 | "memory usage: 11.2+ MB\n"
393 | ]
394 | }
395 | ],
396 | "source": [
397 | "df.info() #no"
398 | ]
399 | },
400 | {
401 | "cell_type": "markdown",
402 | "metadata": {},
403 | "source": [
404 | "`2.` For the rows where **treatment** does not match with **new_page** or **control** does not match with **old_page**, we cannot be sure if this row truly received the new or old page. Use **Quiz 2** in the classroom to figure out how we should handle these rows. \n",
405 | "\n",
406 | "a. Now use the answer to the quiz to create a new dataset that meets the specifications from the quiz. Store your new dataframe in **df2**."
407 | ]
408 | },
409 | {
410 | "cell_type": "code",
411 | "execution_count": 9,
412 | "metadata": {},
413 | "outputs": [
414 | {
415 | "data": {
416 | "text/html": [
417 | "\n",
418 | "\n",
431 | "
\n",
432 | " \n",
433 | " \n",
434 | " | \n",
435 | " user_id | \n",
436 | " timestamp | \n",
437 | " group | \n",
438 | " landing_page | \n",
439 | " converted | \n",
440 | "
\n",
441 | " \n",
442 | " \n",
443 | " \n",
444 | " 0 | \n",
445 | " 851104 | \n",
446 | " 2017-01-21 22:11:48.556739 | \n",
447 | " control | \n",
448 | " old_page | \n",
449 | " 0 | \n",
450 | "
\n",
451 | " \n",
452 | " 1 | \n",
453 | " 804228 | \n",
454 | " 2017-01-12 08:01:45.159739 | \n",
455 | " control | \n",
456 | " old_page | \n",
457 | " 0 | \n",
458 | "
\n",
459 | " \n",
460 | " 2 | \n",
461 | " 661590 | \n",
462 | " 2017-01-11 16:55:06.154213 | \n",
463 | " treatment | \n",
464 | " new_page | \n",
465 | " 0 | \n",
466 | "
\n",
467 | " \n",
468 | " 3 | \n",
469 | " 853541 | \n",
470 | " 2017-01-08 18:28:03.143765 | \n",
471 | " treatment | \n",
472 | " new_page | \n",
473 | " 0 | \n",
474 | "
\n",
475 | " \n",
476 | " 4 | \n",
477 | " 864975 | \n",
478 | " 2017-01-21 01:52:26.210827 | \n",
479 | " control | \n",
480 | " old_page | \n",
481 | " 1 | \n",
482 | "
\n",
483 | " \n",
484 | "
\n",
485 | "
"
486 | ],
487 | "text/plain": [
488 | " user_id timestamp group landing_page converted\n",
489 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0\n",
490 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0\n",
491 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0\n",
492 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0\n",
493 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1"
494 | ]
495 | },
496 | "execution_count": 9,
497 | "metadata": {},
498 | "output_type": "execute_result"
499 | }
500 | ],
501 | "source": [
502 | "df.head()"
503 | ]
504 | },
505 | {
506 | "cell_type": "code",
507 | "execution_count": 10,
508 | "metadata": {},
509 | "outputs": [
510 | {
511 | "name": "stdout",
512 | "output_type": "stream",
513 | "text": [
514 | "Length of df: 294478\n",
515 | "Number of false rows in df: 3893\n",
516 | "Removed rows in df2: 3893\n"
517 | ]
518 | }
519 | ],
520 | "source": [
521 | "#get all the indices of wrong entries\n",
522 | "false_index = df[((df['group'] == 'treatment') == (df['landing_page'] == 'new_page')) == False].index\n",
523 | "\n",
524 | "print(\"Length of df: \", len(df))\n",
525 | "print(\"Number of false rows in df: \",len(false_index))\n",
526 | "\n",
527 | "#drop these indices out\n",
528 | "df2 = df.drop(false_index)\n",
529 | "\n",
530 | "print(\"Removed rows in df2: \",len(df) - len(df2))"
531 | ]
532 | },
533 | {
534 | "cell_type": "code",
535 | "execution_count": 11,
536 | "metadata": {},
537 | "outputs": [
538 | {
539 | "data": {
540 | "text/plain": [
541 | "0"
542 | ]
543 | },
544 | "execution_count": 11,
545 | "metadata": {},
546 | "output_type": "execute_result"
547 | }
548 | ],
549 | "source": [
550 | "# Double Check all of the correct rows were removed - this should be 0\n",
551 | "df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]"
552 | ]
553 | },
554 | {
555 | "cell_type": "markdown",
556 | "metadata": {},
557 | "source": [
558 | "`3.` Use **df2** and the cells below to answer questions for **Quiz3** in the classroom."
559 | ]
560 | },
561 | {
562 | "cell_type": "markdown",
563 | "metadata": {},
564 | "source": [
565 | "a. How many unique **user_id**s are in **df2**?"
566 | ]
567 | },
568 | {
569 | "cell_type": "code",
570 | "execution_count": 12,
571 | "metadata": {},
572 | "outputs": [
573 | {
574 | "data": {
575 | "text/plain": [
576 | "290584"
577 | ]
578 | },
579 | "execution_count": 12,
580 | "metadata": {},
581 | "output_type": "execute_result"
582 | }
583 | ],
584 | "source": [
585 | "df2.user_id.nunique()"
586 | ]
587 | },
588 | {
589 | "cell_type": "markdown",
590 | "metadata": {
591 | "collapsed": true
592 | },
593 | "source": [
594 | "b. There is one **user_id** repeated in **df2**. What is it?"
595 | ]
596 | },
597 | {
598 | "cell_type": "code",
599 | "execution_count": 13,
600 | "metadata": {},
601 | "outputs": [
602 | {
603 | "data": {
604 | "text/html": [
605 | "\n",
606 | "\n",
619 | "
\n",
620 | " \n",
621 | " \n",
622 | " | \n",
623 | " user_id | \n",
624 | " timestamp | \n",
625 | " group | \n",
626 | " landing_page | \n",
627 | " converted | \n",
628 | "
\n",
629 | " \n",
630 | " \n",
631 | " \n",
632 | " 2893 | \n",
633 | " 773192 | \n",
634 | " 2017-01-14 02:55:59.590927 | \n",
635 | " treatment | \n",
636 | " new_page | \n",
637 | " 0 | \n",
638 | "
\n",
639 | " \n",
640 | "
\n",
641 | "
"
642 | ],
643 | "text/plain": [
644 | " user_id timestamp group landing_page converted\n",
645 | "2893 773192 2017-01-14 02:55:59.590927 treatment new_page 0"
646 | ]
647 | },
648 | "execution_count": 13,
649 | "metadata": {},
650 | "output_type": "execute_result"
651 | }
652 | ],
653 | "source": [
654 | "#show duplicated user ids\n",
655 | "df2[df2.duplicated(subset = [\"user_id\"])] #user id 773192"
656 | ]
657 | },
658 | {
659 | "cell_type": "markdown",
660 | "metadata": {},
661 | "source": [
662 | "c. What is the row information for the repeat **user_id**? "
663 | ]
664 | },
665 | {
666 | "cell_type": "code",
667 | "execution_count": 14,
668 | "metadata": {},
669 | "outputs": [
670 | {
671 | "data": {
672 | "text/html": [
673 | "\n",
674 | "\n",
687 | "
\n",
688 | " \n",
689 | " \n",
690 | " | \n",
691 | " user_id | \n",
692 | " timestamp | \n",
693 | " group | \n",
694 | " landing_page | \n",
695 | " converted | \n",
696 | "
\n",
697 | " \n",
698 | " \n",
699 | " \n",
700 | " 1899 | \n",
701 | " 773192 | \n",
702 | " 2017-01-09 05:37:58.781806 | \n",
703 | " treatment | \n",
704 | " new_page | \n",
705 | " 0 | \n",
706 | "
\n",
707 | " \n",
708 | " 2893 | \n",
709 | " 773192 | \n",
710 | " 2017-01-14 02:55:59.590927 | \n",
711 | " treatment | \n",
712 | " new_page | \n",
713 | " 0 | \n",
714 | "
\n",
715 | " \n",
716 | "
\n",
717 | "
"
718 | ],
719 | "text/plain": [
720 | " user_id timestamp group landing_page converted\n",
721 | "1899 773192 2017-01-09 05:37:58.781806 treatment new_page 0\n",
722 | "2893 773192 2017-01-14 02:55:59.590927 treatment new_page 0"
723 | ]
724 | },
725 | "execution_count": 14,
726 | "metadata": {},
727 | "output_type": "execute_result"
728 | }
729 | ],
730 | "source": [
731 | "df2[df2.duplicated(subset = [\"user_id\"], keep = False)] #different timestamp"
732 | ]
733 | },
734 | {
735 | "cell_type": "markdown",
736 | "metadata": {},
737 | "source": [
738 | "d. Remove **one** of the rows with a duplicate **user_id**, but keep your dataframe as **df2**."
739 | ]
740 | },
741 | {
742 | "cell_type": "code",
743 | "execution_count": 15,
744 | "metadata": {},
745 | "outputs": [
746 | {
747 | "name": "stdout",
748 | "output_type": "stream",
749 | "text": [
750 | "Length before drop: 290585\n",
751 | "Length after drop: 290584\n"
752 | ]
753 | }
754 | ],
755 | "source": [
756 | "print(\"Length before drop: \", len(df2))\n",
757 | "#drop duplicated user ids\n",
758 | "df2 = df2.drop(df2[df2.duplicated(subset = [\"user_id\"])].index)\n",
759 | "print(\"Length after drop: \", len(df2))"
760 | ]
761 | },
762 | {
763 | "cell_type": "code",
764 | "execution_count": 16,
765 | "metadata": {},
766 | "outputs": [
767 | {
768 | "data": {
769 | "text/html": [
770 | "\n",
771 | "\n",
784 | "
\n",
785 | " \n",
786 | " \n",
787 | " | \n",
788 | " user_id | \n",
789 | " timestamp | \n",
790 | " group | \n",
791 | " landing_page | \n",
792 | " converted | \n",
793 | "
\n",
794 | " \n",
795 | " \n",
796 | " \n",
797 | "
\n",
798 | "
"
799 | ],
800 | "text/plain": [
801 | "Empty DataFrame\n",
802 | "Columns: [user_id, timestamp, group, landing_page, converted]\n",
803 | "Index: []"
804 | ]
805 | },
806 | "execution_count": 16,
807 | "metadata": {},
808 | "output_type": "execute_result"
809 | }
810 | ],
811 | "source": [
812 | "df2[df2.duplicated(subset = [\"user_id\"], keep = False)] "
813 | ]
814 | },
815 | {
816 | "cell_type": "markdown",
817 | "metadata": {},
818 | "source": [
819 | "`4.` Use **df2** in the cells below to answer the quiz questions related to **Quiz 4** in the classroom.\n",
820 | "\n",
821 | "a. What is the probability of an individual converting regardless of the page they receive?"
822 | ]
823 | },
824 | {
825 | "cell_type": "code",
826 | "execution_count": 17,
827 | "metadata": {},
828 | "outputs": [
829 | {
830 | "name": "stdout",
831 | "output_type": "stream",
832 | "text": [
833 | "The probability of an individual converting regardless of the page they receive is: 0.119597087245\n"
834 | ]
835 | }
836 | ],
837 | "source": [
838 | "overall_conv_prob = df2.converted.mean()\n",
839 | "print(\"The probability of an individual converting regardless of the page they receive is: \", overall_conv_prob)"
840 | ]
841 | },
842 | {
843 | "cell_type": "markdown",
844 | "metadata": {},
845 | "source": [
846 | "b. Given that an individual was in the `control` group, what is the probability they converted?"
847 | ]
848 | },
849 | {
850 | "cell_type": "code",
851 | "execution_count": 18,
852 | "metadata": {},
853 | "outputs": [
854 | {
855 | "name": "stdout",
856 | "output_type": "stream",
857 | "text": [
858 | "The probability of converting in the control group is: 0.1203863045\n"
859 | ]
860 | }
861 | ],
862 | "source": [
863 | "controlgrp_conv_prob = df2.query(\"group == 'control'\").converted.mean()\n",
864 | "print(\"The probability of converting in the control group is: \", controlgrp_conv_prob)"
865 | ]
866 | },
867 | {
868 | "cell_type": "markdown",
869 | "metadata": {},
870 | "source": [
871 | "c. Given that an individual was in the `treatment` group, what is the probability they converted?"
872 | ]
873 | },
874 | {
875 | "cell_type": "code",
876 | "execution_count": 19,
877 | "metadata": {},
878 | "outputs": [
879 | {
880 | "name": "stdout",
881 | "output_type": "stream",
882 | "text": [
883 | "The probability of converting in the treatment group is: 0.118808065515\n"
884 | ]
885 | }
886 | ],
887 | "source": [
888 | "controlgrp_conv_prob = df2.query(\"group == 'treatment'\").converted.mean()\n",
889 | "print(\"The probability of converting in the treatment group is: \", controlgrp_conv_prob)"
890 | ]
891 | },
892 | {
893 | "cell_type": "markdown",
894 | "metadata": {},
895 | "source": [
896 | "d. What is the probability that an individual received the new page?"
897 | ]
898 | },
899 | {
900 | "cell_type": "code",
901 | "execution_count": 20,
902 | "metadata": {},
903 | "outputs": [
904 | {
905 | "name": "stdout",
906 | "output_type": "stream",
907 | "text": [
908 | "The probability of receiving the new page is: 0.5000619442226688\n"
909 | ]
910 | }
911 | ],
912 | "source": [
913 | "overall_conv_prob = len(df2.query(\"landing_page == 'new_page'\"))/len(df2)\n",
914 | "print(\"The probability of receiving the new page is: \", overall_conv_prob)"
915 | ]
916 | },
917 | {
918 | "cell_type": "code",
919 | "execution_count": 21,
920 | "metadata": {},
921 | "outputs": [
922 | {
923 | "data": {
924 | "text/plain": [
925 | "17264"
926 | ]
927 | },
928 | "execution_count": 21,
929 | "metadata": {},
930 | "output_type": "execute_result"
931 | }
932 | ],
933 | "source": [
934 | "num_conv_treat = df2.query(\"group == 'treatment' and converted == 1\").count()[0]\n",
935 | "num_conv_treat"
936 | ]
937 | },
938 | {
939 | "cell_type": "code",
940 | "execution_count": 22,
941 | "metadata": {},
942 | "outputs": [
943 | {
944 | "data": {
945 | "text/plain": [
946 | "17489"
947 | ]
948 | },
949 | "execution_count": 22,
950 | "metadata": {},
951 | "output_type": "execute_result"
952 | }
953 | ],
954 | "source": [
955 | "num_conv_control = df2.query(\"group == 'control' and converted == 1\").count()[0]\n",
956 | "num_conv_control"
957 | ]
958 | },
959 | {
960 | "cell_type": "markdown",
961 | "metadata": {},
962 | "source": [
963 | "e. Consider your results from parts (a) through (d) above, and explain below whether you think there is sufficient evidence to conclude that the new treatment page leads to more conversions."
964 | ]
965 | },
966 | {
967 | "cell_type": "markdown",
968 | "metadata": {},
969 | "source": [
970 | "**To sum up the results:**\n",
971 | "\n",
972 | " - The Probability of converting regardless of page is: 11.96%
\n",
973 | " - Given an individual received the control page, the probability of converting is: 12.04% (17723 people in the control group converted)
\n",
974 | " - Given that an individual received the treatment page, the probability of converting is: 11.88% (17514 people in the treatment group converted)
\n",
975 | " - The probability of receiving the new page is: 50.01%
\n",
976 | "
\n",
977 | "\n",
978 | "The first positive thing to mention is, that the users received the new or the old page in a ration very close to 50/50. The probabilities of converting in the control group and the treatment group are very close to each other, with a difference of 0.16%. This small difference could also appear by chance, therefore we don't have sufficient evidence to conclude that the new treatment page leads to more conversions than the old page. "
979 | ]
980 | },
981 | {
982 | "cell_type": "markdown",
983 | "metadata": {},
984 | "source": [
985 | "\n",
986 | "### Part II - A/B Test\n",
987 | "\n",
988 | "Notice that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed. \n",
989 | "\n",
990 | "However, then the hard question is do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time? How long do you run to render a decision that neither page is better than another? \n",
991 | "\n",
992 | "These questions are the difficult parts associated with A/B tests in general. \n",
993 | "\n",
994 | "\n",
995 | "`1.` For now, consider you need to make the decision just based on all the data provided. If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should your null and alternative hypotheses be? You can state your hypothesis in terms of words or in terms of **$p_{old}$** and **$p_{new}$**, which are the converted rates for the old and new pages."
996 | ]
997 | },
998 | {
999 | "cell_type": "markdown",
1000 | "metadata": {},
1001 | "source": [
1002 | "\n",
1003 | " \n",
1004 | "**$$H_{0}: p_{new} - p_{old} <= 0$$**\n",
1005 | "\n",
1006 | "**$$H_{1}: p_{new} - p_{old} > 0$$**\n",
1007 | "\n"
1008 | ]
1009 | },
1010 | {
1011 | "cell_type": "markdown",
1012 | "metadata": {},
1013 | "source": [
1014 | "`2.` Assume under the null hypothesis, $p_{new}$ and $p_{old}$ both have \"true\" success rates equal to the **converted** success rate regardless of page - that is $p_{new}$ and $p_{old}$ are equal. Furthermore, assume they are equal to the **converted** rate in **ab_data.csv** regardless of the page.
\n",
1015 | "\n",
1016 | "Use a sample size for each page equal to the ones in **ab_data.csv**.
\n",
1017 | "\n",
1018 | "Perform the sampling distribution for the difference in **converted** between the two pages over 10,000 iterations of calculating an estimate from the null.
\n",
1019 | "\n",
1020 | "Use the cells below to provide the necessary parts of this simulation. If this doesn't make complete sense right now, don't worry - you are going to work through the problems below to complete this problem. You can use **Quiz 5** in the classroom to make sure you are on the right track.
"
1021 | ]
1022 | },
1023 | {
1024 | "cell_type": "markdown",
1025 | "metadata": {},
1026 | "source": [
1027 | "a. What is the **conversion rate** for $p_{new}$ under the null? "
1028 | ]
1029 | },
1030 | {
1031 | "cell_type": "code",
1032 | "execution_count": 23,
1033 | "metadata": {},
1034 | "outputs": [
1035 | {
1036 | "data": {
1037 | "text/plain": [
1038 | "0.11959708724499628"
1039 | ]
1040 | },
1041 | "execution_count": 23,
1042 | "metadata": {},
1043 | "output_type": "execute_result"
1044 | }
1045 | ],
1046 | "source": [
1047 | "p_new = df2.converted.mean()\n",
1048 | "p_new"
1049 | ]
1050 | },
1051 | {
1052 | "cell_type": "markdown",
1053 | "metadata": {},
1054 | "source": [
1055 | "b. What is the **conversion rate** for $p_{old}$ under the null?
"
1056 | ]
1057 | },
1058 | {
1059 | "cell_type": "code",
1060 | "execution_count": 24,
1061 | "metadata": {},
1062 | "outputs": [
1063 | {
1064 | "data": {
1065 | "text/plain": [
1066 | "0.11959708724499628"
1067 | ]
1068 | },
1069 | "execution_count": 24,
1070 | "metadata": {},
1071 | "output_type": "execute_result"
1072 | }
1073 | ],
1074 | "source": [
1075 | "p_old = df2.converted.mean()\n",
1076 | "p_old"
1077 | ]
1078 | },
1079 | {
1080 | "cell_type": "markdown",
1081 | "metadata": {},
1082 | "source": [
1083 | "c. What is $n_{new}$, the number of individuals in the treatment group?"
1084 | ]
1085 | },
1086 | {
1087 | "cell_type": "code",
1088 | "execution_count": 25,
1089 | "metadata": {},
1090 | "outputs": [
1091 | {
1092 | "data": {
1093 | "text/plain": [
1094 | "145310"
1095 | ]
1096 | },
1097 | "execution_count": 25,
1098 | "metadata": {},
1099 | "output_type": "execute_result"
1100 | }
1101 | ],
1102 | "source": [
1103 | "n_new = df2.query(\"group == 'treatment'\").user_id.nunique()\n",
1104 | "n_new"
1105 | ]
1106 | },
1107 | {
1108 | "cell_type": "markdown",
1109 | "metadata": {},
1110 | "source": [
1111 | "d. What is $n_{old}$, the number of individuals in the control group?"
1112 | ]
1113 | },
1114 | {
1115 | "cell_type": "code",
1116 | "execution_count": 26,
1117 | "metadata": {},
1118 | "outputs": [
1119 | {
1120 | "data": {
1121 | "text/plain": [
1122 | "145274"
1123 | ]
1124 | },
1125 | "execution_count": 26,
1126 | "metadata": {},
1127 | "output_type": "execute_result"
1128 | }
1129 | ],
1130 | "source": [
1131 | "n_old = df2.query(\"group == 'control'\").user_id.nunique()\n",
1132 | "n_old"
1133 | ]
1134 | },
1135 | {
1136 | "cell_type": "markdown",
1137 | "metadata": {},
1138 | "source": [
1139 | "e. Simulate $n_{new}$ transactions with a conversion rate of $p_{new}$ under the null. Store these $n_{new}$ 1's and 0's in **new_page_converted**."
1140 | ]
1141 | },
1142 | {
1143 | "cell_type": "code",
1144 | "execution_count": 27,
1145 | "metadata": {},
1146 | "outputs": [
1147 | {
1148 | "data": {
1149 | "text/plain": [
1150 | "17272"
1151 | ]
1152 | },
1153 | "execution_count": 27,
1154 | "metadata": {},
1155 | "output_type": "execute_result"
1156 | }
1157 | ],
1158 | "source": [
1159 | "#simulate n transactions with a conversion rate of p with np.random.choice\n",
1160 | "new_page_converted = np.random.choice([1,0], size = n_new, replace = True, p = (p_new, 1-p_new))\n",
1161 | "new_page_converted.sum()"
1162 | ]
1163 | },
1164 | {
1165 | "cell_type": "markdown",
1166 | "metadata": {},
1167 | "source": [
1168 | "f. Simulate $n_{old}$ transactions with a conversion rate of $p_{old}$ under the null. Store these $n_{old}$ 1's and 0's in **old_page_converted**."
1169 | ]
1170 | },
1171 | {
1172 | "cell_type": "code",
1173 | "execution_count": 28,
1174 | "metadata": {},
1175 | "outputs": [
1176 | {
1177 | "data": {
1178 | "text/plain": [
1179 | "17436"
1180 | ]
1181 | },
1182 | "execution_count": 28,
1183 | "metadata": {},
1184 | "output_type": "execute_result"
1185 | }
1186 | ],
1187 | "source": [
1188 | "old_page_converted = np.random.choice([1,0], size = n_old, replace = True, p = [p_old, (1-p_old)])\n",
1189 | "old_page_converted.sum()"
1190 | ]
1191 | },
1192 | {
1193 | "cell_type": "markdown",
1194 | "metadata": {},
1195 | "source": [
1196 | "g. Find $p_{new}$ - $p_{old}$ for your simulated values from part (e) and (f)."
1197 | ]
1198 | },
1199 | {
1200 | "cell_type": "code",
1201 | "execution_count": 29,
1202 | "metadata": {},
1203 | "outputs": [
1204 | {
1205 | "data": {
1206 | "text/plain": [
1207 | "-0.0011583564321773071"
1208 | ]
1209 | },
1210 | "execution_count": 29,
1211 | "metadata": {},
1212 | "output_type": "execute_result"
1213 | }
1214 | ],
1215 | "source": [
1216 | "new_page_converted.mean() - old_page_converted.mean()"
1217 | ]
1218 | },
1219 | {
1220 | "cell_type": "markdown",
1221 | "metadata": {},
1222 | "source": [
1223 | "h. Create 10,000 $p_{new}$ - $p_{old}$ values using the same simulation process you used in parts (a) through (g) above. Store all 10,000 values in a NumPy array called **p_diffs**."
1224 | ]
1225 | },
1226 | {
1227 | "cell_type": "code",
1228 | "execution_count": 30,
1229 | "metadata": {},
1230 | "outputs": [],
1231 | "source": [
1232 | "#creating the sampling distribution with 10000 simulations of the steps before\n",
1233 | "p_diffs = []\n",
1234 | "for _ in range(10000):\n",
1235 | " new_page_converted = np.random.choice([1,0], size = n_new, replace = True, p = (p_new, 1-p_new))\n",
1236 | " old_page_converted = np.random.choice([1,0], size = n_old, replace = True, p = (p_old, 1-p_old))\n",
1237 | " diff = new_page_converted.mean() - old_page_converted.mean()\n",
1238 | " p_diffs.append(diff)"
1239 | ]
1240 | },
1241 | {
1242 | "cell_type": "markdown",
1243 | "metadata": {},
1244 | "source": [
1245 | "i. Plot a histogram of the **p_diffs**. Does this plot look like what you expected? Use the matching problem in the classroom to assure you fully understand what was computed here."
1246 | ]
1247 | },
1248 | {
1249 | "cell_type": "code",
1250 | "execution_count": 31,
1251 | "metadata": {},
1252 | "outputs": [
1253 | {
1254 | "data": {
1255 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAD8CAYAAAB+UHOxAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAELVJREFUeJzt3X+s3XV9x/Hna0UwmzqKvbCurSuaLln5Y8gaZHF/sLBBKYbiHyaQTBs0qckg0cxlqfIHRkOCOn+EzGFQG0uGIpsaG+mGlbgYkwEtDIFaWa9Q5dqO1tWgi4kL+N4f51s59N7ee+6Pc8+9fJ6P5OR8z/v7+f769Oa++v1+vud7U1VIktrzW6PeAUnSaBgAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEadMeodmM6qVatq/fr1o94NSVpWHn744Z9W1dhM7ZZ0AKxfv579+/ePejckaVlJ8qNB2nkJSJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGrWkvwkszWT9jntHtu3Dt141sm1LC8EzAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKB8HLc3RqB5F7WOotVA8A5CkRs0YAEnWJfl2koNJDiR5T1f/YJKfJHm0e23pW+b9ScaTPJnkir765q42nmTHcA5JkjSIQS4BPQ+8r6oeSfJq4OEke7t5n6yqv+9vnGQjcC1wAfD7wLeS/GE3+9PAXwITwL4ku6vq+wtxIJKk2ZkxAKrqKHC0m/5FkoPAmmkW2QrcXVW/Ap5OMg5c3M0br6qnAJLc3bU1ACRpBGY1BpBkPfBG4MGudGOSx5LsTLKyq60BnulbbKKrna4uSRqBgQMgyauArwDvraqfA7cDbwAupHeG8PGTTadYvKapn7qd7Un2J9l//PjxQXdPkjRLAwVAklfQ++V/V1V9FaCqnq2qF6rq18BnefEyzwSwrm/xtcCRaeovUVV3VNWmqto0NjY22+ORJA1okLuAAnweOFhVn+irr+5r9lbgiW56N3BtkrOSnA9sAB4C9gEbkpyf5Ex6A8W7F+YwJEmzNchdQG8G3g48nuTRrvYB4LokF9K7jHMYeDdAVR1Icg+9wd3ngRuq6gWAJDcC9wErgJ1VdWABj0WSNAuD3AX0Xaa+fr9nmmVuAW6Zor5nuuUkSYvHbwJLUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUTMGQJJ1Sb6d5GCSA0ne09XPSbI3yaHufWVXT5LbkowneSzJRX3r2ta1P5Rk2/AOS5I0k0HOAJ4H3ldVfwRcAtyQZCOwA7i/qjYA93efAa4ENnSv7cDt0AsM4GbgTcDFwM0nQ0OStPhmDICqOlpVj3TTvwAOAmuArcCurtku4JpueitwZ/U8AJydZDVwBbC3qk5U1c+AvcDmBT0aSdLAZjUGkGQ98EbgQeC8qjoKvZAAzu2arQGe6Vtsoqudri5JGoGBAyDJq4CvAO+tqp9P13SKWk1TP3U725PsT7L/+PHjg+6eJGmWBgqAJK+g98v/rqr6ald+tru0Q/d+rKtPAOv6Fl8LHJmm/hJVdUdVbaqqTWNjY7M5FknSLAxyF1CAzwMHq+oTfbN2Ayfv5NkGfL2v/o7ubqBLgOe6S0T3AZcnWdkN/l7e1SRJI3DGAG3eDLwdeDzJo13tA8CtwD1J3gX8GHhbN28PsAUYB34JXA9QVSeSfBjY17X7UFWdWJCjkCTN2owBUFXfZerr9wCXTdG+gBtOs66dwM7Z7KAkaTj8JrAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDVqkEdBSDNav+PeUe+CpFnyDECSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqNmDIAkO5McS/JEX+2DSX6S5NHutaVv3vuTjCd5MskVffXNXW08yY6FPxRJ0mwMcgbwBWDzFPVPVtWF3WsPQJKNwLXABd0y/5hkRZIVwKeBK4GNwHVdW0nSiMz4R+Gr6jtJ1g+4vq3A3VX1K+DpJOPAxd288ap6CiDJ3V3b7896jyVJC2I+YwA3Jnmsu0S0squtAZ7pazPR1U5XlySNyFwD4HbgDcCFwFHg4109U7StaeqTJNmeZH+S/cePH5/j7kmSZjKnAKiqZ6vqhar6NfBZXrzMMwGs62u6FjgyTX2qdd9RVZuqatPY2Nhcdk+SNIA5BUCS1X0f3wqcvENoN3BtkrOSnA9sAB4C9gEbkpyf5Ex6A8W7577bkqT5mnEQOMmXgEuBVUkmgJuBS5NcSO8yzmHg3QBVdSDJPfQGd58HbqiqF7r13AjcB6wAdlbVgQU/GknSwAa5C+i6Kcqfn6b9LcAtU9T3AHtmtXeSpKHxm8CS1CgDQJIaZQBIUqNmHAOQtLSs33HvyLZ9+NarRrZtLTzPACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUTMGQJKdSY4leaKvdk6SvUkOde8ru3qS3JZkPMljSS7qW2Zb1/5Qkm3DORxJ0qAGOQP4ArD5lNoO4P6q2gDc330GuBLY0L22A7dDLzCAm4E3ARcDN58MDUnSaMwYAFX1HeDEKeWtwK5uehdwTV/9zup5ADg7yWrgCmBvVZ2oqp8Be5kcKpKkRTTXMYDzquooQPd+bldfAzzT126iq52uPkmS7Un2J9l//PjxOe6eJGkmCz0InClqNU19crHqjqraVFWbxsbGFnTnJEkvmmsAPNtd2qF7P9bVJ4B1fe3WAkemqUuSRmSuAbAbOHknzzbg6331d3R3A10CPNddIroPuDzJym7w9/KuJkkakTNmapDkS8ClwKokE/Tu5rkVuCfJu4AfA2/rmu8BtgDjwC+B6wGq6kSSDwP7unYfqqpTB5YlSYtoxgCoqutOM+uyKdoWcMNp1rMT2DmrvZMkDY3fBJakRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGzfgXwbS8rN9x76h3QdIy4RmAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEbNKwCSHE7yeJJHk+zvauck2ZvkUPe+sqsnyW1JxpM8luSihTgASdLcLMQZwJ9X1YVVtan7vAO4v6o2APd3nwGuBDZ0r+3A7QuwbUnSHA3jEtBWYFc3vQu4pq9+Z/U8AJydZPUQti9JGsB8A6CAbyZ5OMn2rnZeVR0F6N7P7eprgGf6lp3oapKkEZjv46DfXFVHkpwL7E3yg2naZopaTWrUC5LtAK973evmuXuSpNOZ1xlAVR3p3o8BXwMuBp49eWmnez/WNZ8A1vUtvhY4MsU676iqTVW1aWxsbD67J0maxpwDIMnvJHn1yWngcuAJYDewrWu2Dfh6N70beEd3N9AlwHMnLxVJkhbffC4BnQd8LcnJ9Xyxqv4tyT7gniTvAn4MvK1rvwfYAowDvwSun8e2JUnzNOcAqKqngD+eov4/wGVT1Au4Ya7bkyQtLP8msKSBjepvTh++9aqRbPflzkdBSFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY3ybwIPwaj+bqokzYZnAJLUKANAkhplAEhSoxwDkLTkjXJc7fCtV41s28PmGYAkNcoAkKRGGQCS1KhFD4Akm5M8mWQ8yY7F3r4kqWdRAyDJCuDTwJXARuC6JBsXcx8kST2LfRfQxcB4VT0FkORuYCvw/WFszG/kStLpLXYArAGe6fs8AbxpkfdBkgY2qv9ILsbtp4sdAJmiVi9pkGwHtncf/zfJk0Pep1XAT4e8jeXGPpnMPpnMPplswfokH5nX4n8wSKPFDoAJYF3f57XAkf4GVXUHcMdi7VCS/VW1abG2txzYJ5PZJ5PZJ5Mttz5Z7LuA9gEbkpyf5EzgWmD3Iu+DJIlFPgOoqueT3AjcB6wAdlbVgcXcB0lSz6I/C6iq9gB7Fnu701i0y03LiH0ymX0ymX0y2bLqk1TVzK0kSS87PgpCkhr1sg2AJOck2ZvkUPe+8jTttnVtDiXZ1lf/kySPd4+suC1JTlnub5NUklXDPpaFMqw+SfKxJD9I8liSryU5e7GOaa5meiRJkrOSfLmb/2CS9X3z3t/Vn0xyxaDrXOoWuk+SrEvy7SQHkxxI8p7FO5qFMYyfk27eiiT/meQbwz+KaVTVy/IFfBTY0U3vAD4yRZtzgKe695Xd9Mpu3kPAn9L77sK/Alf2LbeO3kD2j4BVoz7WUfcJcDlwRjf9kanWu5Re9G5A+CHweuBM4HvAxlPa/DXwmW76WuDL3fTGrv1ZwPndelYMss6l/BpSn6wGLuravBr4r9b7pG+5vwG+CHxjlMf4sj0DoPeIiV3d9C7gminaXAHsraoTVfUzYC+wOclq4DVV9R/V+9e685TlPwn8Had8iW0ZGEqfVNU3q+r5bvkH6H2/Yyn7zSNJqur/gJOPJOnX31f/AlzWnfFsBe6uql9V1dPAeLe+Qda5lC14n1TV0ap6BKCqfgEcpPc0gOViGD8nJFkLXAV8bhGOYVov5wA4r6qOAnTv507RZqpHU6zpXhNT1ElyNfCTqvreMHZ6yIbSJ6d4J72zg6XsdMc4ZZsu3J4DXjvNsoOscykbRp/8Rndp5I3Agwu4z8M2rD75FL3/QP564Xd5dpb1n4RM8i3g96aYddOgq5iiVqerJ/ntbt2XD7j+RbfYfXLKtm8CngfuGnBbozLjsUzT5nT1qf4ztZzOEIfRJ72FklcBXwHeW1U/n/MeLr4F75MkbwGOVdXDSS6d5/7N27IOgKr6i9PNS/JsktVVdbS7fHFsimYTwKV9n9cC/97V155SPwK8gd71vO91459rgUeSXFxV/z2PQ1kwI+iTk+veBrwFuKy7RLSUzfhIkr42E0nOAH4XODHDsjOtcykbSp8keQW9X/53VdVXh7PrQzOMPrkauDrJFuCVwGuS/FNV/dVwDmEGox5oGdYL+BgvHfD86BRtzgGepjfYubKbPqebtw+4hBcHPLdMsfxhltcg8FD6BNhM75HeY6M+xgH74Qx6g9vn8+Lg3gWntLmBlw7u3dNNX8BLB/eeojdYOOM6l/JrSH0SemNFnxr18S2VPjll2UsZ8SDwyDt5iP94rwXuBw517yd/iW0CPtfX7p30BmjGgev76puAJ+iN3v8D3ZfmTtnGcguAofRJ1+4Z4NHu9ZlRH+sAfbGF3l0pPwRu6mofAq7upl8J/HN3bA8Br+9b9qZuuSd56d1hk9a5nF4L3SfAn9G7HPJY38/GpP9ILeXXMH5O+uaPPAD8JrAkNerlfBeQJGkaBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY36fwOV9rCNLfoUAAAAAElFTkSuQmCC\n",
1256 | "text/plain": [
1257 | ""
1258 | ]
1259 | },
1260 | "metadata": {
1261 | "needs_background": "light"
1262 | },
1263 | "output_type": "display_data"
1264 | }
1265 | ],
1266 | "source": [
1267 | "plt.hist(p_diffs);"
1268 | ]
1269 | },
1270 | {
1271 | "cell_type": "markdown",
1272 | "metadata": {},
1273 | "source": [
1274 | "j. What proportion of the **p_diffs** are greater than the actual difference observed in **ab_data.csv**?"
1275 | ]
1276 | },
1277 | {
1278 | "cell_type": "code",
1279 | "execution_count": 32,
1280 | "metadata": {},
1281 | "outputs": [
1282 | {
1283 | "name": "stdout",
1284 | "output_type": "stream",
1285 | "text": [
1286 | "Number of converted persons in control group: 17489 | p_old: 0.1203863045\n",
1287 | "Number of converted persons in treatment group: 17264 | p_new: 0.118808065515\n",
1288 | "Actual difference: -0.00157823898536\n"
1289 | ]
1290 | }
1291 | ],
1292 | "source": [
1293 | "p_actual_old = df2.query(\"group == 'control'\").converted.mean()\n",
1294 | "p_actual_new = df2.query(\"group == 'treatment'\").converted.mean()\n",
1295 | "actual_diff = p_actual_new - p_actual_old\n",
1296 | "print(\"Number of converted persons in control group: \",num_conv_control, \"| p_old: \", p_actual_old)\n",
1297 | "print(\"Number of converted persons in treatment group: \",num_conv_treat, \"| p_new: \", p_actual_new)\n",
1298 | "print(\"Actual difference: \", actual_diff)"
1299 | ]
1300 | },
1301 | {
1302 | "cell_type": "code",
1303 | "execution_count": 33,
1304 | "metadata": {},
1305 | "outputs": [
1306 | {
1307 | "data": {
1308 | "text/plain": [
1309 | ""
1310 | ]
1311 | },
1312 | "execution_count": 33,
1313 | "metadata": {},
1314 | "output_type": "execute_result"
1315 | },
1316 | {
1317 | "data": {
1318 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAD8CAYAAAB+UHOxAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAENdJREFUeJzt3X+s3XV9x/Hna1QwmzqKLaxr64qmM4M/hqxBFvcHCxuUYgD/MIFk2qBJTQaJZi5LlT8wGhLQ+SNkDle1sWQosqmxgW5YicaYDGhhCFRkXKHKtR2tYtDFxAX33h/nWz3cnnvvuT/OPbf9PB/JN+d73t/P9/v9fD+9ua9+f5xzU1VIktrzW+PugCRpPAwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqNWjLsDM1m1alVt2LBh3N3QXD35ZO/19a8fbz+kRj300EM/rqrVs7Vb1gGwYcMG9u/fP+5uaK4uuqj3+s1vjrMXUrOS/GCYdl4CkqRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRi3rTwJLs9mw/Z6x7fvgzZePbd/SYvAMQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1yq+DluZpXF9F7ddQa7F4BiBJjTIAJKlRBoAkNWrWAEiyPsk3kjyR5ECSd3f1DyT5UZJHumlL3zrvSzKR5Mkkl/bVN3e1iSTbR3NIkqRhDHMT+EXgvVX1cJJXAg8l2dst+3hV/X1/4yTnAFcD5wK/D3w9yR92iz8J/CUwCexLsruqvrsYByJJmptZA6CqDgOHu/mfJ3kCWDvDKlcCd1bVL4FnkkwAF3TLJqrqaYAkd3ZtDQBJGoM53QNIsgF4A/BAV7o+yaNJdiZZ2dXWAs/2rTbZ1aarT93HtiT7k+w/evToXLonSZqDoQMgySuALwHvqaqfAbcBrwPOo3eG8NFjTQesXjPUX1qo2lFVm6pq0+rVq4ftniRpjob6IFiSl9H75X9HVX0ZoKqe61v+aeDu7u0ksL5v9XXAoW5+urokaYkN8xRQgM8CT1TVx/rqa/qavQV4vJvfDVyd5LQkZwMbgQeBfcDGJGcnOZXejeLdi3MYkqS5GuYM4E3A24DHkjzS1d4PXJPkPHqXcQ4C7wKoqgNJ7qJ3c/dF4Lqq+hVAkuuBe4FTgJ1VdWARj0WSNAfDPAX0bQZfv98zwzo3ATcNqO+ZaT1J0tLxk8CS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElq1KwBkGR9km8keSLJgSTv7upnJNmb5KnudWVXT5Jbk0wkeTTJ+X3b2tq1fyrJ1tEdliRpNsOcAbwIvLeq/gi4ELguyTnAduC+qtoI3Ne9B7gM2NhN24DboBcYwI3AG4ELgBuPhYYkaenNGgBVdbiqHu7mfw48AawFrgR2dc12AVd181cCt1fP/cDpSdYAlwJ7q+r5qvopsBfYvKhHI0ka2pzuASTZALwBeAA4q6oOQy8kgDO7ZmuBZ/tWm+xq09Wn7mNbkv1J9h89enQu3ZMkzcHQAZDkFcCXgPdU1c9majqgVjPUX1qo2lFVm6pq0+rVq4ftniRpjoYKgCQvo/fL/46q+nJXfq67tEP3eqSrTwLr+1ZfBxyaoS5JGoNhngIK8Fngiar6WN+i3cCxJ3m2Al/tq7+9exroQuCF7hLRvcAlSVZ2N38v6WqSpDFYMUSbNwFvAx5L8khXez9wM3BXkncCPwTe2i3bA2wBJoBfANcCVNXzST4E7OvafbCqnl+Uo5AkzdmsAVBV32bw9XuAiwe0L+C6aba1E9g5lw5KkkbDTwJLUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUqGH+IIw0qw3b7/n1/J1P/wSAq/tqkpYfzwAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGzRoASXYmOZLk8b7aB5L8KMkj3bSlb9n7kkwkeTLJpX31zV1tIsn2xT8USdJcDHMG8Dlg84D6x6vqvG7aA5DkHOBq4NxunX9MckqSU4BPApcB5wDXdG0lSWMy67eBVtW3kmwYcntXAndW1S+BZ5JMABd0yyaq6mmAJHd2bb875x5LkhbFQu4BXJ/k0e4S0cquthZ4tq/NZFebri5JGpP5BsBtwOuA84DDwEe7ega0rRnqx0myLcn+JPuPHj06z+5JkmYzrz8IU1XPHZtP8mng7u7tJLC+r+k64FA3P1196rZ3ADsANm3aNDAkpJZtGOMf2jl48+Vj27cW37zOAJKs6Xv7FuDYE0K7gauTnJbkbGAj8CCwD9iY5Owkp9K7Ubx7/t2WJC3UrGcASb4AXASsSjIJ3AhclOQ8epdxDgLvAqiqA0nuondz90Xguqr6Vbed64F7gVOAnVV1YNGPRpI0tGGeArpmQPmzM7S/CbhpQH0PsGdOvZMkjYyfBJakRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjZg2AJDuTHEnyeF/tjCR7kzzVva7s6klya5KJJI8mOb9vna1d+6eSbB3N4UiShjXMGcDngM1TatuB+6pqI3Bf9x7gMmBjN20DboNeYAA3Am8ELgBuPBYakqTxmDUAqupbwPNTylcCu7r5XcBVffXbq+d+4PQka4BLgb1V9XxV/RTYy/GhIklaQvO9B3BWVR0G6F7P7OprgWf72k12tenqkqQxWeybwBlQqxnqx28g2ZZkf5L9R48eXdTOSZJ+Y74B8Fx3aYfu9UhXnwTW97VbBxyaoX6cqtpRVZuqatPq1avn2T1J0mzmGwC7gWNP8mwFvtpXf3v3NNCFwAvdJaJ7gUuSrOxu/l7S1SRJY7JitgZJvgBcBKxKMknvaZ6bgbuSvBP4IfDWrvkeYAswAfwCuBagqp5P8iFgX9fug1U19cayJGkJzRoAVXXNNIsuHtC2gOum2c5OYOeceidJGhk/CSxJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWrUinF3QItrw/Z7xt0FSScIzwAkqVEGgCQ1ygCQpEYtKACSHEzyWJJHkuzvamck2Zvkqe51ZVdPkluTTCR5NMn5i3EAkqT5WYwzgD+vqvOqalP3fjtwX1VtBO7r3gNcBmzspm3AbYuwb0nSPI3iEtCVwK5ufhdwVV/99uq5Hzg9yZoR7F+SNISFPgZawNeSFPBPVbUDOKuqDgNU1eEkZ3Zt1wLP9q072dUOL7APkpbIuB4zPnjz5WPZ78luoQHwpqo61P2S35vkezO0zYBaHdco2UbvEhGvec1rFtg9SdJ0FnQJqKoOda9HgK8AFwDPHbu0070e6ZpPAuv7Vl8HHBqwzR1VtamqNq1evXoh3ZMkzWDeAZDkd5K88tg8cAnwOLAb2No12wp8tZvfDby9exroQuCFY5eKJElLbyGXgM4CvpLk2HY+X1X/nmQfcFeSdwI/BN7atd8DbAEmgF8A1y5g35KkBZp3AFTV08AfD6j/BLh4QL2A6+a7P0nS4vKTwJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVErxt2Bk9GG7feMuwuSNCvPACSpUZ4BSFr2xnlWffDmy8e271HzDECSGmUASFKjDABJatSSB0CSzUmeTDKRZPtS71+S1LOkAZDkFOCTwGXAOcA1Sc5Zyj5IknqW+gzgAmCiqp6uqv8F7gSuXOI+SJJY+sdA1wLP9r2fBN44qp35gSxJCzWu3yNL8fjpUgdABtTqJQ2SbcC27u3/JHly5L36jVXAj5dwfyeCOY/Jnx6bueXNi96ZZcKfk+M5Jsdb0JjklgXt+w+GabTUATAJrO97vw441N+gqnYAO5ayU8ck2V9Vm8ax7+XKMTmeY3I8x+R4J8KYLPU9gH3AxiRnJzkVuBrYvcR9kCSxxGcAVfVikuuBe4FTgJ1VdWAp+yBJ6lny7wKqqj3AnqXe75DGculpmXNMjueYHM8xOd6yH5NU1eytJEknHb8KQpIaddIHQJIzkuxN8lT3unKadlu7Nk8l2dpX/5Mkj3VfXXFrkkxZ72+TVJJVoz6WxTKqMUnykSTfS/Jokq8kOX2pjmm+ZvtqkiSnJflit/yBJBv6lr2vqz+Z5NJht7ncLfaYJFmf5BtJnkhyIMm7l+5oFscofk66Zack+c8kd4/+KAaoqpN6Aj4MbO/mtwO3DGhzBvB097qym1/ZLXuQ3qPtAf4NuKxvvfX0bmj/AFg17mMd95gAlwAruvlbBm13OU30HkT4PvBa4FTgO8A5U9r8NfCpbv5q4Ivd/Dld+9OAs7vtnDLMNpfzNKIxWQOc37V5JfBfrY9J33p/A3weuHscx3bSnwHQ+6qJXd38LuCqAW0uBfZW1fNV9VNgL7A5yRrgVVX1H9X717p9yvofB/6OKR9mOwGMZEyq6mtV9WK3/v30PuexnA3z1ST9Y/WvwMXdGc+VwJ1V9cuqegaY6LZ3on/dyaKPSVUdrqqHAarq58AT9L4V4EQxip8TkqwDLgc+swTHMFALAXBWVR0G6F7PHNBm0FdUrO2myQF1klwB/KiqvjOKTo/YSMZkinfQOztYzqY7xoFtunB7AXj1DOsOs83lbBRj8mvdpZE3AA8sYp9HbVRj8gl6/4H8v8Xv8nBOij8JmeTrwO8NWHTDsJsYUKvp6kl+u9v2JUNuf8kt9ZhM2fcNwIvAHUPua1xmPZYZ2kxXH/SfqhPpDHEUY9JbKXkF8CXgPVX1s3n3cOkt+pgkeTNwpKoeSnLRAvs3bydFAFTVX0y3LMlzSdZU1eHu8sWRAc0mgYv63q8DvtnV102pHwJeR+963ne6+5/rgIeTXFBV/72AQ1k0YxiTY9veCrwZuLi7RLSczfrVJH1tJpOsAH4XeH6WdWfb5nI2kjFJ8jJ6v/zvqKovj6brIzOKMbkCuCLJFuDlwKuS/HNV/dVoDmEa477BMuoJ+AgvveH54QFtzgCeoXezc2U3f0a3bB9wIb+54bllwPoHObFuAo9kTIDNwHeB1eM+xiHHYQW9m9tn85ube+dOaXMdL725d1c3fy4vvbn3NL2bhbNuczlPIxqT0LtX9IlxH99yGZMp617EmG4Cj31wl+Af79XAfcBT3euxX2KbgM/0tXsHvRs0E8C1ffVNwOP07t7/A92H56bs40QLgJGMSdfuWeCRbvrUuI91iLHYQu+plO8DN3S1DwJXdPMvB/6lO7YHgdf2rXtDt96TvPTpsOO2eSJNiz0mwJ/RuxzyaN/PxnH/kVrO0yh+TvqWjy0A/CSwJDWqhaeAJEkDGACS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXq/wGsL/KbIPpI+AAAAABJRU5ErkJggg==\n",
1319 | "text/plain": [
1320 | ""
1321 | ]
1322 | },
1323 | "metadata": {
1324 | "needs_background": "light"
1325 | },
1326 | "output_type": "display_data"
1327 | }
1328 | ],
1329 | "source": [
1330 | "p_diffs = np.array(p_diffs)\n",
1331 | "#calcualte the null_vals based on the std of the p_diffs array\n",
1332 | "null_vals = np.random.normal(0, p_diffs.std(), p_diffs.size)\n",
1333 | "plt.hist(null_vals);\n",
1334 | "plt.axvline(actual_diff, color = 'r')"
1335 | ]
1336 | },
1337 | {
1338 | "cell_type": "code",
1339 | "execution_count": 34,
1340 | "metadata": {},
1341 | "outputs": [
1342 | {
1343 | "data": {
1344 | "text/plain": [
1345 | "0.90949999999999998"
1346 | ]
1347 | },
1348 | "execution_count": 34,
1349 | "metadata": {},
1350 | "output_type": "execute_result"
1351 | }
1352 | ],
1353 | "source": [
1354 | "(null_vals > actual_diff).mean()"
1355 | ]
1356 | },
1357 | {
1358 | "cell_type": "markdown",
1359 | "metadata": {},
1360 | "source": [
1361 | "k. Please explain using the vocabulary you've learned in this course what you just computed in part **j.** What is this value called in scientific studies? What does this value mean in terms of whether or not there is a difference between the new and old pages?"
1362 | ]
1363 | },
1364 | {
1365 | "cell_type": "markdown",
1366 | "metadata": {},
1367 | "source": [
1368 | "In j. we calculated the p-value with 0.9095.\n",
1369 | "\n",
1370 | "What exactly did we do?\n",
1371 | "\n",
1372 | "We assumed that the null hypothesis is true. With that, we assume that p_old = p_new, so both pages have the same converting rates over the whole sample. Therefore we also assume, that the individual converting probability of each page is equal to the one of the whole sample. Based on that, we bootstrapped a sampling distribution for both pages and calculated the differences in the converting probability per page with n equal to the original number of people who received each page and a converting probability of 0.119597. With the resulting standard deviation of the differences (which is coming from the simulated population), we then calcualted values coming from a normal distribution around 0. As last step we calculated the proportion of values which are bigger than the actually observed difference. The calculated p-value now tells us the probability of receiving this observed statistic if the null hypothesis is true. With a Type-I-Error-Rate of 0.05, we can say that 0.9095 > 0.05, therefore we don't have enough evidence to reject the null hypothesis."
1373 | ]
1374 | },
1375 | {
1376 | "cell_type": "markdown",
1377 | "metadata": {},
1378 | "source": [
1379 | "l. We could also use a built-in to achieve similar results. Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. Fill in the below to calculate the number of conversions for each page, as well as the number of individuals who received each page. Let `n_old` and `n_new` refer the the number of rows associated with the old page and new pages, respectively."
1380 | ]
1381 | },
1382 | {
1383 | "cell_type": "code",
1384 | "execution_count": 35,
1385 | "metadata": {},
1386 | "outputs": [
1387 | {
1388 | "name": "stderr",
1389 | "output_type": "stream",
1390 | "text": [
1391 | "/opt/conda/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n",
1392 | " from pandas.core import datetools\n"
1393 | ]
1394 | }
1395 | ],
1396 | "source": [
1397 | "import statsmodels.api as sm\n",
1398 | "\n",
1399 | "convert_old = df2.query(\"group == 'control'\").converted.sum()\n",
1400 | "convert_new = df2.query(\"group == 'treatment'\").converted.sum()\n",
1401 | "n_old = df2.query(\"landing_page == 'old_page'\").count()[0]\n",
1402 | "n_new = df2.query(\"landing_page == 'new_page'\").count()[0]"
1403 | ]
1404 | },
1405 | {
1406 | "cell_type": "markdown",
1407 | "metadata": {},
1408 | "source": [
1409 | "m. Now use `stats.proportions_ztest` to compute your test statistic and p-value. [Here](http://knowledgetack.com/python/statsmodels/proportions_ztest/) is a helpful link on using the built in."
1410 | ]
1411 | },
1412 | {
1413 | "cell_type": "code",
1414 | "execution_count": 36,
1415 | "metadata": {},
1416 | "outputs": [
1417 | {
1418 | "name": "stdout",
1419 | "output_type": "stream",
1420 | "text": [
1421 | "Z-Score: 1.31092419842 \n",
1422 | "Critical Z-Score: 1.64485362695 \n",
1423 | "P-Value: 0.905058312759\n"
1424 | ]
1425 | }
1426 | ],
1427 | "source": [
1428 | "#https://machinelearningmastery.com/critical-values-for-statistical-hypothesis-testing/\n",
1429 | "from scipy.stats import norm\n",
1430 | "\n",
1431 | "#calculate z-test\n",
1432 | "z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative=\"smaller\")\n",
1433 | "\n",
1434 | "#calculate the critical z_term\n",
1435 | "z_critical=norm.ppf(1-(0.05))\n",
1436 | "\n",
1437 | "print(\"Z-Score: \",z_score, \"\\nCritical Z-Score: \", z_critical, \"\\nP-Value: \", p_value)\n",
1438 | "\n"
1439 | ]
1440 | },
1441 | {
1442 | "cell_type": "markdown",
1443 | "metadata": {},
1444 | "source": [
1445 | "n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages? Do they agree with the findings in parts **j.** and **k.**?"
1446 | ]
1447 | },
1448 | {
1449 | "cell_type": "markdown",
1450 | "metadata": {},
1451 | "source": [
1452 | "The p-value here agrees with our findings in j. Also the calculated Z-Score is smaller than the Critical Z - Score, so we also fail to reject the null hypothesis based on the Z-test. \n",
1453 | "\n",
1454 | "In conclusion we accept the null hypothesis that the coversion rates of the old page are equal or better than the conversion rates of the new page."
1455 | ]
1456 | },
1457 | {
1458 | "cell_type": "markdown",
1459 | "metadata": {},
1460 | "source": [
1461 | "\n",
1462 | "### Part III - A regression approach\n",
1463 | "\n",
1464 | "`1.` In this final part, you will see that the result you achieved in the A/B test in Part II above can also be achieved by performing regression.
\n",
1465 | "\n",
1466 | "a. Since each row is either a conversion or no conversion, what type of regression should you be performing in this case?"
1467 | ]
1468 | },
1469 | {
1470 | "cell_type": "markdown",
1471 | "metadata": {},
1472 | "source": [
1473 | "Since this case is binary, a logistic regression should be performed."
1474 | ]
1475 | },
1476 | {
1477 | "cell_type": "markdown",
1478 | "metadata": {},
1479 | "source": [
1480 | "b. The goal is to use **statsmodels** to fit the regression model you specified in part **a.** to see if there is a significant difference in conversion based on which page a customer receives. However, you first need to create in df2 a column for the intercept, and create a dummy variable column for which page each user received. Add an **intercept** column, as well as an **ab_page** column, which is 1 when an individual receives the **treatment** and 0 if **control**."
1481 | ]
1482 | },
1483 | {
1484 | "cell_type": "code",
1485 | "execution_count": 37,
1486 | "metadata": {},
1487 | "outputs": [
1488 | {
1489 | "data": {
1490 | "text/html": [
1491 | "\n",
1492 | "\n",
1505 | "
\n",
1506 | " \n",
1507 | " \n",
1508 | " | \n",
1509 | " user_id | \n",
1510 | " timestamp | \n",
1511 | " group | \n",
1512 | " landing_page | \n",
1513 | " converted | \n",
1514 | "
\n",
1515 | " \n",
1516 | " \n",
1517 | " \n",
1518 | " 0 | \n",
1519 | " 851104 | \n",
1520 | " 2017-01-21 22:11:48.556739 | \n",
1521 | " control | \n",
1522 | " old_page | \n",
1523 | " 0 | \n",
1524 | "
\n",
1525 | " \n",
1526 | " 1 | \n",
1527 | " 804228 | \n",
1528 | " 2017-01-12 08:01:45.159739 | \n",
1529 | " control | \n",
1530 | " old_page | \n",
1531 | " 0 | \n",
1532 | "
\n",
1533 | " \n",
1534 | " 2 | \n",
1535 | " 661590 | \n",
1536 | " 2017-01-11 16:55:06.154213 | \n",
1537 | " treatment | \n",
1538 | " new_page | \n",
1539 | " 0 | \n",
1540 | "
\n",
1541 | " \n",
1542 | " 3 | \n",
1543 | " 853541 | \n",
1544 | " 2017-01-08 18:28:03.143765 | \n",
1545 | " treatment | \n",
1546 | " new_page | \n",
1547 | " 0 | \n",
1548 | "
\n",
1549 | " \n",
1550 | " 4 | \n",
1551 | " 864975 | \n",
1552 | " 2017-01-21 01:52:26.210827 | \n",
1553 | " control | \n",
1554 | " old_page | \n",
1555 | " 1 | \n",
1556 | "
\n",
1557 | " \n",
1558 | "
\n",
1559 | "
"
1560 | ],
1561 | "text/plain": [
1562 | " user_id timestamp group landing_page converted\n",
1563 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0\n",
1564 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0\n",
1565 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0\n",
1566 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0\n",
1567 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1"
1568 | ]
1569 | },
1570 | "execution_count": 37,
1571 | "metadata": {},
1572 | "output_type": "execute_result"
1573 | }
1574 | ],
1575 | "source": [
1576 | "df_log = df2.copy()\n",
1577 | "df_log.head()"
1578 | ]
1579 | },
1580 | {
1581 | "cell_type": "code",
1582 | "execution_count": 38,
1583 | "metadata": {},
1584 | "outputs": [],
1585 | "source": [
1586 | "#add intercept\n",
1587 | "df_log[\"intercept\"] = 1\n",
1588 | "\n",
1589 | "#get dummies and rename\n",
1590 | "df_log = df_log.join(pd.get_dummies(df_log['group']))\n",
1591 | "df_log.rename(columns = {\"treatment\": \"ab_page\"}, inplace=True)"
1592 | ]
1593 | },
1594 | {
1595 | "cell_type": "code",
1596 | "execution_count": 39,
1597 | "metadata": {},
1598 | "outputs": [
1599 | {
1600 | "data": {
1601 | "text/html": [
1602 | "\n",
1603 | "\n",
1616 | "
\n",
1617 | " \n",
1618 | " \n",
1619 | " | \n",
1620 | " user_id | \n",
1621 | " timestamp | \n",
1622 | " group | \n",
1623 | " landing_page | \n",
1624 | " converted | \n",
1625 | " intercept | \n",
1626 | " control | \n",
1627 | " ab_page | \n",
1628 | "
\n",
1629 | " \n",
1630 | " \n",
1631 | " \n",
1632 | " 0 | \n",
1633 | " 851104 | \n",
1634 | " 2017-01-21 22:11:48.556739 | \n",
1635 | " control | \n",
1636 | " old_page | \n",
1637 | " 0 | \n",
1638 | " 1 | \n",
1639 | " 1 | \n",
1640 | " 0 | \n",
1641 | "
\n",
1642 | " \n",
1643 | " 1 | \n",
1644 | " 804228 | \n",
1645 | " 2017-01-12 08:01:45.159739 | \n",
1646 | " control | \n",
1647 | " old_page | \n",
1648 | " 0 | \n",
1649 | " 1 | \n",
1650 | " 1 | \n",
1651 | " 0 | \n",
1652 | "
\n",
1653 | " \n",
1654 | " 2 | \n",
1655 | " 661590 | \n",
1656 | " 2017-01-11 16:55:06.154213 | \n",
1657 | " treatment | \n",
1658 | " new_page | \n",
1659 | " 0 | \n",
1660 | " 1 | \n",
1661 | " 0 | \n",
1662 | " 1 | \n",
1663 | "
\n",
1664 | " \n",
1665 | " 3 | \n",
1666 | " 853541 | \n",
1667 | " 2017-01-08 18:28:03.143765 | \n",
1668 | " treatment | \n",
1669 | " new_page | \n",
1670 | " 0 | \n",
1671 | " 1 | \n",
1672 | " 0 | \n",
1673 | " 1 | \n",
1674 | "
\n",
1675 | " \n",
1676 | " 4 | \n",
1677 | " 864975 | \n",
1678 | " 2017-01-21 01:52:26.210827 | \n",
1679 | " control | \n",
1680 | " old_page | \n",
1681 | " 1 | \n",
1682 | " 1 | \n",
1683 | " 1 | \n",
1684 | " 0 | \n",
1685 | "
\n",
1686 | " \n",
1687 | "
\n",
1688 | "
"
1689 | ],
1690 | "text/plain": [
1691 | " user_id timestamp group landing_page converted \\\n",
1692 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0 \n",
1693 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0 \n",
1694 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0 \n",
1695 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0 \n",
1696 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1 \n",
1697 | "\n",
1698 | " intercept control ab_page \n",
1699 | "0 1 1 0 \n",
1700 | "1 1 1 0 \n",
1701 | "2 1 0 1 \n",
1702 | "3 1 0 1 \n",
1703 | "4 1 1 0 "
1704 | ]
1705 | },
1706 | "execution_count": 39,
1707 | "metadata": {},
1708 | "output_type": "execute_result"
1709 | }
1710 | ],
1711 | "source": [
1712 | "df_log.head()"
1713 | ]
1714 | },
1715 | {
1716 | "cell_type": "markdown",
1717 | "metadata": {},
1718 | "source": [
1719 | "c. Use **statsmodels** to instantiate your regression model on the two columns you created in part b., then fit the model using the two columns you created in part **b.** to predict whether or not an individual converts. "
1720 | ]
1721 | },
1722 | {
1723 | "cell_type": "code",
1724 | "execution_count": 40,
1725 | "metadata": {},
1726 | "outputs": [
1727 | {
1728 | "name": "stdout",
1729 | "output_type": "stream",
1730 | "text": [
1731 | "Optimization terminated successfully.\n",
1732 | " Current function value: 0.366118\n",
1733 | " Iterations 6\n"
1734 | ]
1735 | }
1736 | ],
1737 | "source": [
1738 | "y = df_log[\"converted\"]\n",
1739 | "x = df_log[[\"intercept\", \"ab_page\"]]\n",
1740 | "\n",
1741 | "#load model\n",
1742 | "log_mod = sm.Logit(y,x)\n",
1743 | "\n",
1744 | "#fit model\n",
1745 | "result = log_mod.fit()\n"
1746 | ]
1747 | },
1748 | {
1749 | "cell_type": "markdown",
1750 | "metadata": {},
1751 | "source": [
1752 | "d. Provide the summary of your model below, and use it as necessary to answer the following questions."
1753 | ]
1754 | },
1755 | {
1756 | "cell_type": "code",
1757 | "execution_count": 41,
1758 | "metadata": {},
1759 | "outputs": [
1760 | {
1761 | "data": {
1762 | "text/html": [
1763 | "\n",
1764 | "Logit Regression Results\n",
1765 | "\n",
1766 | " Dep. Variable: | converted | No. Observations: | 290584 | \n",
1767 | "
\n",
1768 | "\n",
1769 | " Model: | Logit | Df Residuals: | 290582 | \n",
1770 | "
\n",
1771 | "\n",
1772 | " Method: | MLE | Df Model: | 1 | \n",
1773 | "
\n",
1774 | "\n",
1775 | " Date: | Sun, 20 Jan 2019 | Pseudo R-squ.: | 8.077e-06 | \n",
1776 | "
\n",
1777 | "\n",
1778 | " Time: | 16:32:06 | Log-Likelihood: | -1.0639e+05 | \n",
1779 | "
\n",
1780 | "\n",
1781 | " converged: | True | LL-Null: | -1.0639e+05 | \n",
1782 | "
\n",
1783 | "\n",
1784 | " | | LLR p-value: | 0.1899 | \n",
1785 | "
\n",
1786 | "
\n",
1787 | "\n",
1788 | "\n",
1789 | " | coef | std err | z | P>|z| | [0.025 | 0.975] | \n",
1790 | "
\n",
1791 | "\n",
1792 | " intercept | -1.9888 | 0.008 | -246.669 | 0.000 | -2.005 | -1.973 | \n",
1793 | "
\n",
1794 | "\n",
1795 | " ab_page | -0.0150 | 0.011 | -1.311 | 0.190 | -0.037 | 0.007 | \n",
1796 | "
\n",
1797 | "
"
1798 | ],
1799 | "text/plain": [
1800 | "\n",
1801 | "\"\"\"\n",
1802 | " Logit Regression Results \n",
1803 | "==============================================================================\n",
1804 | "Dep. Variable: converted No. Observations: 290584\n",
1805 | "Model: Logit Df Residuals: 290582\n",
1806 | "Method: MLE Df Model: 1\n",
1807 | "Date: Sun, 20 Jan 2019 Pseudo R-squ.: 8.077e-06\n",
1808 | "Time: 16:32:06 Log-Likelihood: -1.0639e+05\n",
1809 | "converged: True LL-Null: -1.0639e+05\n",
1810 | " LLR p-value: 0.1899\n",
1811 | "==============================================================================\n",
1812 | " coef std err z P>|z| [0.025 0.975]\n",
1813 | "------------------------------------------------------------------------------\n",
1814 | "intercept -1.9888 0.008 -246.669 0.000 -2.005 -1.973\n",
1815 | "ab_page -0.0150 0.011 -1.311 0.190 -0.037 0.007\n",
1816 | "==============================================================================\n",
1817 | "\"\"\""
1818 | ]
1819 | },
1820 | "execution_count": 41,
1821 | "metadata": {},
1822 | "output_type": "execute_result"
1823 | }
1824 | ],
1825 | "source": [
1826 | "result.summary()"
1827 | ]
1828 | },
1829 | {
1830 | "cell_type": "markdown",
1831 | "metadata": {},
1832 | "source": [
1833 | "e. What is the p-value associated with **ab_page**? Why does it differ from the value you found in **Part II**?
**Hint**: What are the null and alternative hypotheses associated with your regression model, and how do they compare to the null and alternative hypotheses in **Part II**?"
1834 | ]
1835 | },
1836 | {
1837 | "cell_type": "markdown",
1838 | "metadata": {},
1839 | "source": [
1840 | "The p-value associated with ab_page is 0.19. This is because the approach of calculating the p-value is different for each case. For the first case we calculate the probability receiving a observed statistic if the null hypothesis is true. Therefore this is a one-sided test. However, the ab_page p-value is the result of a two sided test, because the null hypothesis for this case is, that there is no significant relationship between the conversion rate and ab_page. Therefore give us a variable with a low p value \"a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable\" (http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-regression-analysis-results-p-values-and-coefficients).\n",
1841 | "\n",
1842 | "Based on that p_value we can say, that the conversion is not significant dependent on the page. \n"
1843 | ]
1844 | },
1845 | {
1846 | "cell_type": "markdown",
1847 | "metadata": {},
1848 | "source": [
1849 | "f. Now, you are considering other things that might influence whether or not an individual converts. Discuss why it is a good idea to consider other factors to add into your regression model. Are there any disadvantages to adding additional terms into your regression model?"
1850 | ]
1851 | },
1852 | {
1853 | "cell_type": "markdown",
1854 | "metadata": {},
1855 | "source": [
1856 | "Other features to consider could be extracts of the time stamp, for example the day of the week or the gender/income infrastructure (if this data would be available). This could lead to more precise results and a higher accuracy. The disadvantages are the increasing complexity of interpretation and the possible introduction of multicollinearity. However, the last problem can be solved with calculating the VIF's. "
1857 | ]
1858 | },
1859 | {
1860 | "cell_type": "markdown",
1861 | "metadata": {},
1862 | "source": [
1863 | "g. Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives in. You will need to read in the **countries.csv** dataset and merge together your datasets on the appropriate rows. [Here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) are the docs for joining tables. \n",
1864 | "\n",
1865 | "Does it appear that country had an impact on conversion? Don't forget to create dummy variables for these country columns - **Hint: You will need two columns for the three dummy variables.** Provide the statistical output as well as a written response to answer this question."
1866 | ]
1867 | },
1868 | {
1869 | "cell_type": "code",
1870 | "execution_count": 42,
1871 | "metadata": {},
1872 | "outputs": [
1873 | {
1874 | "data": {
1875 | "text/html": [
1876 | "\n",
1877 | "\n",
1890 | "
\n",
1891 | " \n",
1892 | " \n",
1893 | " | \n",
1894 | " user_id | \n",
1895 | " country | \n",
1896 | "
\n",
1897 | " \n",
1898 | " \n",
1899 | " \n",
1900 | " 0 | \n",
1901 | " 834778 | \n",
1902 | " UK | \n",
1903 | "
\n",
1904 | " \n",
1905 | " 1 | \n",
1906 | " 928468 | \n",
1907 | " US | \n",
1908 | "
\n",
1909 | " \n",
1910 | " 2 | \n",
1911 | " 822059 | \n",
1912 | " UK | \n",
1913 | "
\n",
1914 | " \n",
1915 | " 3 | \n",
1916 | " 711597 | \n",
1917 | " UK | \n",
1918 | "
\n",
1919 | " \n",
1920 | " 4 | \n",
1921 | " 710616 | \n",
1922 | " UK | \n",
1923 | "
\n",
1924 | " \n",
1925 | "
\n",
1926 | "
"
1927 | ],
1928 | "text/plain": [
1929 | " user_id country\n",
1930 | "0 834778 UK\n",
1931 | "1 928468 US\n",
1932 | "2 822059 UK\n",
1933 | "3 711597 UK\n",
1934 | "4 710616 UK"
1935 | ]
1936 | },
1937 | "execution_count": 42,
1938 | "metadata": {},
1939 | "output_type": "execute_result"
1940 | }
1941 | ],
1942 | "source": [
1943 | "df_countries = pd.read_csv(\"countries.csv\")\n",
1944 | "df_countries.head()"
1945 | ]
1946 | },
1947 | {
1948 | "cell_type": "code",
1949 | "execution_count": 43,
1950 | "metadata": {},
1951 | "outputs": [],
1952 | "source": [
1953 | "#merge the dataframes together\n",
1954 | "df_log_country = df_log.merge(df_countries, on=\"user_id\", how = \"left\")"
1955 | ]
1956 | },
1957 | {
1958 | "cell_type": "code",
1959 | "execution_count": 44,
1960 | "metadata": {},
1961 | "outputs": [
1962 | {
1963 | "data": {
1964 | "text/html": [
1965 | "\n",
1966 | "\n",
1979 | "
\n",
1980 | " \n",
1981 | " \n",
1982 | " | \n",
1983 | " user_id | \n",
1984 | " timestamp | \n",
1985 | " group | \n",
1986 | " landing_page | \n",
1987 | " converted | \n",
1988 | " intercept | \n",
1989 | " control | \n",
1990 | " ab_page | \n",
1991 | " country | \n",
1992 | " CA | \n",
1993 | " UK | \n",
1994 | " US | \n",
1995 | "
\n",
1996 | " \n",
1997 | " \n",
1998 | " \n",
1999 | " 0 | \n",
2000 | " 851104 | \n",
2001 | " 2017-01-21 22:11:48.556739 | \n",
2002 | " control | \n",
2003 | " old_page | \n",
2004 | " 0 | \n",
2005 | " 1 | \n",
2006 | " 1 | \n",
2007 | " 0 | \n",
2008 | " US | \n",
2009 | " 0 | \n",
2010 | " 0 | \n",
2011 | " 1 | \n",
2012 | "
\n",
2013 | " \n",
2014 | " 1 | \n",
2015 | " 804228 | \n",
2016 | " 2017-01-12 08:01:45.159739 | \n",
2017 | " control | \n",
2018 | " old_page | \n",
2019 | " 0 | \n",
2020 | " 1 | \n",
2021 | " 1 | \n",
2022 | " 0 | \n",
2023 | " US | \n",
2024 | " 0 | \n",
2025 | " 0 | \n",
2026 | " 1 | \n",
2027 | "
\n",
2028 | " \n",
2029 | " 2 | \n",
2030 | " 661590 | \n",
2031 | " 2017-01-11 16:55:06.154213 | \n",
2032 | " treatment | \n",
2033 | " new_page | \n",
2034 | " 0 | \n",
2035 | " 1 | \n",
2036 | " 0 | \n",
2037 | " 1 | \n",
2038 | " US | \n",
2039 | " 0 | \n",
2040 | " 0 | \n",
2041 | " 1 | \n",
2042 | "
\n",
2043 | " \n",
2044 | " 3 | \n",
2045 | " 853541 | \n",
2046 | " 2017-01-08 18:28:03.143765 | \n",
2047 | " treatment | \n",
2048 | " new_page | \n",
2049 | " 0 | \n",
2050 | " 1 | \n",
2051 | " 0 | \n",
2052 | " 1 | \n",
2053 | " US | \n",
2054 | " 0 | \n",
2055 | " 0 | \n",
2056 | " 1 | \n",
2057 | "
\n",
2058 | " \n",
2059 | " 4 | \n",
2060 | " 864975 | \n",
2061 | " 2017-01-21 01:52:26.210827 | \n",
2062 | " control | \n",
2063 | " old_page | \n",
2064 | " 1 | \n",
2065 | " 1 | \n",
2066 | " 1 | \n",
2067 | " 0 | \n",
2068 | " US | \n",
2069 | " 0 | \n",
2070 | " 0 | \n",
2071 | " 1 | \n",
2072 | "
\n",
2073 | " \n",
2074 | "
\n",
2075 | "
"
2076 | ],
2077 | "text/plain": [
2078 | " user_id timestamp group landing_page converted \\\n",
2079 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0 \n",
2080 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0 \n",
2081 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0 \n",
2082 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0 \n",
2083 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1 \n",
2084 | "\n",
2085 | " intercept control ab_page country CA UK US \n",
2086 | "0 1 1 0 US 0 0 1 \n",
2087 | "1 1 1 0 US 0 0 1 \n",
2088 | "2 1 0 1 US 0 0 1 \n",
2089 | "3 1 0 1 US 0 0 1 \n",
2090 | "4 1 1 0 US 0 0 1 "
2091 | ]
2092 | },
2093 | "execution_count": 44,
2094 | "metadata": {},
2095 | "output_type": "execute_result"
2096 | }
2097 | ],
2098 | "source": [
2099 | "df_log_country = df_log_country.join(pd.get_dummies(df_log_country['country']))\n",
2100 | "df_log_country.head()"
2101 | ]
2102 | },
2103 | {
2104 | "cell_type": "code",
2105 | "execution_count": 45,
2106 | "metadata": {},
2107 | "outputs": [
2108 | {
2109 | "name": "stdout",
2110 | "output_type": "stream",
2111 | "text": [
2112 | "Optimization terminated successfully.\n",
2113 | " Current function value: 0.366113\n",
2114 | " Iterations 6\n"
2115 | ]
2116 | },
2117 | {
2118 | "data": {
2119 | "text/html": [
2120 | "\n",
2121 | "Logit Regression Results\n",
2122 | "\n",
2123 | " Dep. Variable: | converted | No. Observations: | 290584 | \n",
2124 | "
\n",
2125 | "\n",
2126 | " Model: | Logit | Df Residuals: | 290580 | \n",
2127 | "
\n",
2128 | "\n",
2129 | " Method: | MLE | Df Model: | 3 | \n",
2130 | "
\n",
2131 | "\n",
2132 | " Date: | Sun, 20 Jan 2019 | Pseudo R-squ.: | 2.323e-05 | \n",
2133 | "
\n",
2134 | "\n",
2135 | " Time: | 16:32:11 | Log-Likelihood: | -1.0639e+05 | \n",
2136 | "
\n",
2137 | "\n",
2138 | " converged: | True | LL-Null: | -1.0639e+05 | \n",
2139 | "
\n",
2140 | "\n",
2141 | " | | LLR p-value: | 0.1760 | \n",
2142 | "
\n",
2143 | "
\n",
2144 | "\n",
2145 | "\n",
2146 | " | coef | std err | z | P>|z| | [0.025 | 0.975] | \n",
2147 | "
\n",
2148 | "\n",
2149 | " intercept | -1.9893 | 0.009 | -223.763 | 0.000 | -2.007 | -1.972 | \n",
2150 | "
\n",
2151 | "\n",
2152 | " ab_page | -0.0149 | 0.011 | -1.307 | 0.191 | -0.037 | 0.007 | \n",
2153 | "
\n",
2154 | "\n",
2155 | " CA | -0.0408 | 0.027 | -1.516 | 0.130 | -0.093 | 0.012 | \n",
2156 | "
\n",
2157 | "\n",
2158 | " UK | 0.0099 | 0.013 | 0.743 | 0.457 | -0.016 | 0.036 | \n",
2159 | "
\n",
2160 | "
"
2161 | ],
2162 | "text/plain": [
2163 | "\n",
2164 | "\"\"\"\n",
2165 | " Logit Regression Results \n",
2166 | "==============================================================================\n",
2167 | "Dep. Variable: converted No. Observations: 290584\n",
2168 | "Model: Logit Df Residuals: 290580\n",
2169 | "Method: MLE Df Model: 3\n",
2170 | "Date: Sun, 20 Jan 2019 Pseudo R-squ.: 2.323e-05\n",
2171 | "Time: 16:32:11 Log-Likelihood: -1.0639e+05\n",
2172 | "converged: True LL-Null: -1.0639e+05\n",
2173 | " LLR p-value: 0.1760\n",
2174 | "==============================================================================\n",
2175 | " coef std err z P>|z| [0.025 0.975]\n",
2176 | "------------------------------------------------------------------------------\n",
2177 | "intercept -1.9893 0.009 -223.763 0.000 -2.007 -1.972\n",
2178 | "ab_page -0.0149 0.011 -1.307 0.191 -0.037 0.007\n",
2179 | "CA -0.0408 0.027 -1.516 0.130 -0.093 0.012\n",
2180 | "UK 0.0099 0.013 0.743 0.457 -0.016 0.036\n",
2181 | "==============================================================================\n",
2182 | "\"\"\""
2183 | ]
2184 | },
2185 | "execution_count": 45,
2186 | "metadata": {},
2187 | "output_type": "execute_result"
2188 | }
2189 | ],
2190 | "source": [
2191 | "y = df_log_country[\"converted\"]\n",
2192 | "x = df_log_country[[\"intercept\", \"ab_page\", \"CA\", \"UK\"]]\n",
2193 | "\n",
2194 | "log_mod = sm.Logit(y,x)\n",
2195 | "results = log_mod.fit()\n",
2196 | "results.summary()"
2197 | ]
2198 | },
2199 | {
2200 | "cell_type": "markdown",
2201 | "metadata": {},
2202 | "source": [
2203 | "Based on the country-features p-values we can say, that these features also doens't have a significant impact on the coversion rate. \n",
2204 | "\n",
2205 | "However, we could interpret these coefficients as follows:"
2206 | ]
2207 | },
2208 | {
2209 | "cell_type": "code",
2210 | "execution_count": 46,
2211 | "metadata": {},
2212 | "outputs": [
2213 | {
2214 | "name": "stdout",
2215 | "output_type": "stream",
2216 | "text": [
2217 | "ab_page reciprocal exponential: 1.01501155838 - A conversion is 1.015 times less likely, if a user receives the treatment page, holding all other variables constant\n",
2218 | "\n",
2219 | "CA reciprocal exponential: 1.04164375596 - A conversion is 1.042 times less likely, if the user lives in CA and not the US.\n",
2220 | "\n",
2221 | "UK exponential: 1.00994916712 - A conversion is 1.00995 times more likely, if the user lives in UK and not the US.\n"
2222 | ]
2223 | }
2224 | ],
2225 | "source": [
2226 | "print(\"ab_page reciprocal exponential: \", 1/np.exp(-0.0149), \"-\", \"A conversion is 1.015 times less likely, if a user receives the treatment page, holding all other variables constant\\n\"\n",
2227 | " \"\\nCA reciprocal exponential: \", 1/np.exp(-0.0408), \"-\", \"A conversion is 1.042 times less likely, if the user lives in CA and not the US.\\n\"\n",
2228 | " \"\\nUK exponential: \",np.exp(0.0099),\"-\", \"A conversion is 1.00995 times more likely, if the user lives in UK and not the US.\")"
2229 | ]
2230 | },
2231 | {
2232 | "cell_type": "markdown",
2233 | "metadata": {},
2234 | "source": [
2235 | "h. Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion. Create the necessary additional columns, and fit the new model. \n",
2236 | "\n",
2237 | "Provide the summary results, and your conclusions based on the results."
2238 | ]
2239 | },
2240 | {
2241 | "cell_type": "code",
2242 | "execution_count": 47,
2243 | "metadata": {},
2244 | "outputs": [
2245 | {
2246 | "data": {
2247 | "text/html": [
2248 | "\n",
2249 | "\n",
2262 | "
\n",
2263 | " \n",
2264 | " \n",
2265 | " | \n",
2266 | " user_id | \n",
2267 | " timestamp | \n",
2268 | " group | \n",
2269 | " landing_page | \n",
2270 | " converted | \n",
2271 | " intercept | \n",
2272 | " control | \n",
2273 | " ab_page | \n",
2274 | " country | \n",
2275 | " CA | \n",
2276 | " UK | \n",
2277 | " US | \n",
2278 | "
\n",
2279 | " \n",
2280 | " \n",
2281 | " \n",
2282 | " 0 | \n",
2283 | " 851104 | \n",
2284 | " 2017-01-21 22:11:48.556739 | \n",
2285 | " control | \n",
2286 | " old_page | \n",
2287 | " 0 | \n",
2288 | " 1 | \n",
2289 | " 1 | \n",
2290 | " 0 | \n",
2291 | " US | \n",
2292 | " 0 | \n",
2293 | " 0 | \n",
2294 | " 1 | \n",
2295 | "
\n",
2296 | " \n",
2297 | " 1 | \n",
2298 | " 804228 | \n",
2299 | " 2017-01-12 08:01:45.159739 | \n",
2300 | " control | \n",
2301 | " old_page | \n",
2302 | " 0 | \n",
2303 | " 1 | \n",
2304 | " 1 | \n",
2305 | " 0 | \n",
2306 | " US | \n",
2307 | " 0 | \n",
2308 | " 0 | \n",
2309 | " 1 | \n",
2310 | "
\n",
2311 | " \n",
2312 | " 2 | \n",
2313 | " 661590 | \n",
2314 | " 2017-01-11 16:55:06.154213 | \n",
2315 | " treatment | \n",
2316 | " new_page | \n",
2317 | " 0 | \n",
2318 | " 1 | \n",
2319 | " 0 | \n",
2320 | " 1 | \n",
2321 | " US | \n",
2322 | " 0 | \n",
2323 | " 0 | \n",
2324 | " 1 | \n",
2325 | "
\n",
2326 | " \n",
2327 | " 3 | \n",
2328 | " 853541 | \n",
2329 | " 2017-01-08 18:28:03.143765 | \n",
2330 | " treatment | \n",
2331 | " new_page | \n",
2332 | " 0 | \n",
2333 | " 1 | \n",
2334 | " 0 | \n",
2335 | " 1 | \n",
2336 | " US | \n",
2337 | " 0 | \n",
2338 | " 0 | \n",
2339 | " 1 | \n",
2340 | "
\n",
2341 | " \n",
2342 | " 4 | \n",
2343 | " 864975 | \n",
2344 | " 2017-01-21 01:52:26.210827 | \n",
2345 | " control | \n",
2346 | " old_page | \n",
2347 | " 1 | \n",
2348 | " 1 | \n",
2349 | " 1 | \n",
2350 | " 0 | \n",
2351 | " US | \n",
2352 | " 0 | \n",
2353 | " 0 | \n",
2354 | " 1 | \n",
2355 | "
\n",
2356 | " \n",
2357 | "
\n",
2358 | "
"
2359 | ],
2360 | "text/plain": [
2361 | " user_id timestamp group landing_page converted \\\n",
2362 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0 \n",
2363 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0 \n",
2364 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0 \n",
2365 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0 \n",
2366 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1 \n",
2367 | "\n",
2368 | " intercept control ab_page country CA UK US \n",
2369 | "0 1 1 0 US 0 0 1 \n",
2370 | "1 1 1 0 US 0 0 1 \n",
2371 | "2 1 0 1 US 0 0 1 \n",
2372 | "3 1 0 1 US 0 0 1 \n",
2373 | "4 1 1 0 US 0 0 1 "
2374 | ]
2375 | },
2376 | "execution_count": 47,
2377 | "metadata": {},
2378 | "output_type": "execute_result"
2379 | }
2380 | ],
2381 | "source": [
2382 | "df_log_country.head()"
2383 | ]
2384 | },
2385 | {
2386 | "cell_type": "code",
2387 | "execution_count": 48,
2388 | "metadata": {},
2389 | "outputs": [],
2390 | "source": [
2391 | "#create the interaction higher order term for the ab_page and country columns\n",
2392 | "df_log_country[\"CA_page\"], df_log_country[\"UK_page\"] = df_log_country[\"CA\"] * df_log_country[\"ab_page\"], df_log_country[\"UK\"] * df_log_country[\"ab_page\"]"
2393 | ]
2394 | },
2395 | {
2396 | "cell_type": "code",
2397 | "execution_count": 49,
2398 | "metadata": {},
2399 | "outputs": [
2400 | {
2401 | "data": {
2402 | "text/html": [
2403 | "\n",
2404 | "\n",
2417 | "
\n",
2418 | " \n",
2419 | " \n",
2420 | " | \n",
2421 | " user_id | \n",
2422 | " timestamp | \n",
2423 | " group | \n",
2424 | " landing_page | \n",
2425 | " converted | \n",
2426 | " intercept | \n",
2427 | " control | \n",
2428 | " ab_page | \n",
2429 | " country | \n",
2430 | " CA | \n",
2431 | " UK | \n",
2432 | " US | \n",
2433 | " CA_page | \n",
2434 | " UK_page | \n",
2435 | "
\n",
2436 | " \n",
2437 | " \n",
2438 | " \n",
2439 | " 0 | \n",
2440 | " 851104 | \n",
2441 | " 2017-01-21 22:11:48.556739 | \n",
2442 | " control | \n",
2443 | " old_page | \n",
2444 | " 0 | \n",
2445 | " 1 | \n",
2446 | " 1 | \n",
2447 | " 0 | \n",
2448 | " US | \n",
2449 | " 0 | \n",
2450 | " 0 | \n",
2451 | " 1 | \n",
2452 | " 0 | \n",
2453 | " 0 | \n",
2454 | "
\n",
2455 | " \n",
2456 | " 1 | \n",
2457 | " 804228 | \n",
2458 | " 2017-01-12 08:01:45.159739 | \n",
2459 | " control | \n",
2460 | " old_page | \n",
2461 | " 0 | \n",
2462 | " 1 | \n",
2463 | " 1 | \n",
2464 | " 0 | \n",
2465 | " US | \n",
2466 | " 0 | \n",
2467 | " 0 | \n",
2468 | " 1 | \n",
2469 | " 0 | \n",
2470 | " 0 | \n",
2471 | "
\n",
2472 | " \n",
2473 | " 2 | \n",
2474 | " 661590 | \n",
2475 | " 2017-01-11 16:55:06.154213 | \n",
2476 | " treatment | \n",
2477 | " new_page | \n",
2478 | " 0 | \n",
2479 | " 1 | \n",
2480 | " 0 | \n",
2481 | " 1 | \n",
2482 | " US | \n",
2483 | " 0 | \n",
2484 | " 0 | \n",
2485 | " 1 | \n",
2486 | " 0 | \n",
2487 | " 0 | \n",
2488 | "
\n",
2489 | " \n",
2490 | " 3 | \n",
2491 | " 853541 | \n",
2492 | " 2017-01-08 18:28:03.143765 | \n",
2493 | " treatment | \n",
2494 | " new_page | \n",
2495 | " 0 | \n",
2496 | " 1 | \n",
2497 | " 0 | \n",
2498 | " 1 | \n",
2499 | " US | \n",
2500 | " 0 | \n",
2501 | " 0 | \n",
2502 | " 1 | \n",
2503 | " 0 | \n",
2504 | " 0 | \n",
2505 | "
\n",
2506 | " \n",
2507 | " 4 | \n",
2508 | " 864975 | \n",
2509 | " 2017-01-21 01:52:26.210827 | \n",
2510 | " control | \n",
2511 | " old_page | \n",
2512 | " 1 | \n",
2513 | " 1 | \n",
2514 | " 1 | \n",
2515 | " 0 | \n",
2516 | " US | \n",
2517 | " 0 | \n",
2518 | " 0 | \n",
2519 | " 1 | \n",
2520 | " 0 | \n",
2521 | " 0 | \n",
2522 | "
\n",
2523 | " \n",
2524 | "
\n",
2525 | "
"
2526 | ],
2527 | "text/plain": [
2528 | " user_id timestamp group landing_page converted \\\n",
2529 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0 \n",
2530 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0 \n",
2531 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0 \n",
2532 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0 \n",
2533 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1 \n",
2534 | "\n",
2535 | " intercept control ab_page country CA UK US CA_page UK_page \n",
2536 | "0 1 1 0 US 0 0 1 0 0 \n",
2537 | "1 1 1 0 US 0 0 1 0 0 \n",
2538 | "2 1 0 1 US 0 0 1 0 0 \n",
2539 | "3 1 0 1 US 0 0 1 0 0 \n",
2540 | "4 1 1 0 US 0 0 1 0 0 "
2541 | ]
2542 | },
2543 | "execution_count": 49,
2544 | "metadata": {},
2545 | "output_type": "execute_result"
2546 | }
2547 | ],
2548 | "source": [
2549 | "df_log_country.head()"
2550 | ]
2551 | },
2552 | {
2553 | "cell_type": "code",
2554 | "execution_count": 50,
2555 | "metadata": {},
2556 | "outputs": [
2557 | {
2558 | "name": "stdout",
2559 | "output_type": "stream",
2560 | "text": [
2561 | "Optimization terminated successfully.\n",
2562 | " Current function value: 0.366109\n",
2563 | " Iterations 6\n"
2564 | ]
2565 | },
2566 | {
2567 | "data": {
2568 | "text/html": [
2569 | "\n",
2570 | "Logit Regression Results\n",
2571 | "\n",
2572 | " Dep. Variable: | converted | No. Observations: | 290584 | \n",
2573 | "
\n",
2574 | "\n",
2575 | " Model: | Logit | Df Residuals: | 290578 | \n",
2576 | "
\n",
2577 | "\n",
2578 | " Method: | MLE | Df Model: | 5 | \n",
2579 | "
\n",
2580 | "\n",
2581 | " Date: | Sun, 20 Jan 2019 | Pseudo R-squ.: | 3.482e-05 | \n",
2582 | "
\n",
2583 | "\n",
2584 | " Time: | 16:32:16 | Log-Likelihood: | -1.0639e+05 | \n",
2585 | "
\n",
2586 | "\n",
2587 | " converged: | True | LL-Null: | -1.0639e+05 | \n",
2588 | "
\n",
2589 | "\n",
2590 | " | | LLR p-value: | 0.1920 | \n",
2591 | "
\n",
2592 | "
\n",
2593 | "\n",
2594 | "\n",
2595 | " | coef | std err | z | P>|z| | [0.025 | 0.975] | \n",
2596 | "
\n",
2597 | "\n",
2598 | " intercept | -1.9865 | 0.010 | -206.344 | 0.000 | -2.005 | -1.968 | \n",
2599 | "
\n",
2600 | "\n",
2601 | " ab_page | -0.0206 | 0.014 | -1.505 | 0.132 | -0.047 | 0.006 | \n",
2602 | "
\n",
2603 | "\n",
2604 | " CA | -0.0175 | 0.038 | -0.465 | 0.642 | -0.091 | 0.056 | \n",
2605 | "
\n",
2606 | "\n",
2607 | " UK | -0.0057 | 0.019 | -0.306 | 0.760 | -0.043 | 0.031 | \n",
2608 | "
\n",
2609 | "\n",
2610 | " CA_page | -0.0469 | 0.054 | -0.872 | 0.383 | -0.152 | 0.059 | \n",
2611 | "
\n",
2612 | "\n",
2613 | " UK_page | 0.0314 | 0.027 | 1.181 | 0.238 | -0.021 | 0.084 | \n",
2614 | "
\n",
2615 | "
"
2616 | ],
2617 | "text/plain": [
2618 | "\n",
2619 | "\"\"\"\n",
2620 | " Logit Regression Results \n",
2621 | "==============================================================================\n",
2622 | "Dep. Variable: converted No. Observations: 290584\n",
2623 | "Model: Logit Df Residuals: 290578\n",
2624 | "Method: MLE Df Model: 5\n",
2625 | "Date: Sun, 20 Jan 2019 Pseudo R-squ.: 3.482e-05\n",
2626 | "Time: 16:32:16 Log-Likelihood: -1.0639e+05\n",
2627 | "converged: True LL-Null: -1.0639e+05\n",
2628 | " LLR p-value: 0.1920\n",
2629 | "==============================================================================\n",
2630 | " coef std err z P>|z| [0.025 0.975]\n",
2631 | "------------------------------------------------------------------------------\n",
2632 | "intercept -1.9865 0.010 -206.344 0.000 -2.005 -1.968\n",
2633 | "ab_page -0.0206 0.014 -1.505 0.132 -0.047 0.006\n",
2634 | "CA -0.0175 0.038 -0.465 0.642 -0.091 0.056\n",
2635 | "UK -0.0057 0.019 -0.306 0.760 -0.043 0.031\n",
2636 | "CA_page -0.0469 0.054 -0.872 0.383 -0.152 0.059\n",
2637 | "UK_page 0.0314 0.027 1.181 0.238 -0.021 0.084\n",
2638 | "==============================================================================\n",
2639 | "\"\"\""
2640 | ]
2641 | },
2642 | "execution_count": 50,
2643 | "metadata": {},
2644 | "output_type": "execute_result"
2645 | }
2646 | ],
2647 | "source": [
2648 | "y = df_log_country[\"converted\"]\n",
2649 | "x = df_log_country[[\"intercept\", \"ab_page\", \"CA\", \"UK\", \"CA_page\", \"UK_page\"]]\n",
2650 | "\n",
2651 | "log_mod = sm.Logit(y,x)\n",
2652 | "results = log_mod.fit()\n",
2653 | "results.summary()"
2654 | ]
2655 | },
2656 | {
2657 | "cell_type": "markdown",
2658 | "metadata": {},
2659 | "source": [
2660 | "Based on these results, we can see that the p_values for the interaction terms are definietly not significant and even decrease the significance of the original \"CA\" and \"UK\" columns. Therefore we should not include these higher order terms in our model."
2661 | ]
2662 | },
2663 | {
2664 | "cell_type": "code",
2665 | "execution_count": 165,
2666 | "metadata": {},
2667 | "outputs": [
2668 | {
2669 | "data": {
2670 | "text/plain": [
2671 | "0"
2672 | ]
2673 | },
2674 | "execution_count": 165,
2675 | "metadata": {},
2676 | "output_type": "execute_result"
2677 | }
2678 | ],
2679 | "source": [
2680 | "from subprocess import call\n",
2681 | "call(['python', '-m', 'nbconvert', 'Analyze_ab_test_results_notebook.ipynb'])"
2682 | ]
2683 | },
2684 | {
2685 | "cell_type": "code",
2686 | "execution_count": null,
2687 | "metadata": {},
2688 | "outputs": [],
2689 | "source": []
2690 | }
2691 | ],
2692 | "metadata": {
2693 | "kernelspec": {
2694 | "display_name": "Python 3",
2695 | "language": "python",
2696 | "name": "python3"
2697 | },
2698 | "language_info": {
2699 | "codemirror_mode": {
2700 | "name": "ipython",
2701 | "version": 3
2702 | },
2703 | "file_extension": ".py",
2704 | "mimetype": "text/x-python",
2705 | "name": "python",
2706 | "nbconvert_exporter": "python",
2707 | "pygments_lexer": "ipython3",
2708 | "version": "3.6.3"
2709 | }
2710 | },
2711 | "nbformat": 4,
2712 | "nbformat_minor": 2
2713 | }
2714 |
--------------------------------------------------------------------------------
/P2-Analyze-A-B-Test-Results/README.md:
--------------------------------------------------------------------------------
1 | ## P2: Analyze A-B Test Results
2 |
3 | ### Prerequisites
4 |
5 | Additional installations:
6 |
7 | * None
8 |
9 | ## Project Overview
10 |
11 | ### Data Sources
12 |
13 | **Name: ab_data.csv**
14 | * Source: Udacity
15 |
16 | ### Authors
17 |
18 | * Christoph Lindstädt
19 | * Udacity
20 |
21 | ## License
22 |
23 | * Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
24 |
25 |
26 |
27 |
28 |
--------------------------------------------------------------------------------
/P3-Analyze-Twitter-Data/README.md:
--------------------------------------------------------------------------------
1 |
2 | ## P3: Analyze Twitter Data
3 |
4 | ### Prerequisites
5 |
6 | Additional installations:
7 |
8 | * [Missingno](https://github.com/ResidentMario/missingno)
9 | * [Tweepy](http://www.tweepy.org/)
10 | * [Requests](http://docs.python-requests.org/en/master/)
11 |
12 | ## Project Overview
13 |
14 | ### Data Sources
15 |
16 | **Name: WeRateDogs™ Twitter Archive (twitter-archive-enhanced.csv)**
17 |
18 | - Source: Udacity
19 | - Version: Latest (Download 09.02.2019)
20 | - Method of gathering: Manual download
21 |
22 | **Name: Tweet image predictions (image_predictions.tsv)**
23 |
24 | - Source: Udacity
25 | - Version: Latest (Download 09.02.2019)
26 | - Method of gathering: Programmatical download via Requests
27 |
28 | **Name: Additional Twitter data (tweet_json.txt)**
29 |
30 | - Source: WeRateDogs™
31 | - Version: Latest (Gathered 09.02.2019)
32 | - Method of gathering: API via Tweepy
33 |
34 | ### Wrangling
35 |
36 | **Cleaning steps:**
37 |
38 | - Merge the tables together
39 | - Drop the replies, retweets and the corresponding columns and also drop the tweets without an image or with images which don't display doggos
40 | - Clean the datatypes of the columns
41 | -Clean the wrong numerators - the floats on the one hand (replacement), the ones with multiple occurence of the pattern on the other (drop)
42 | - Extract the source from html code
43 | - Split the text range into two separate columns
44 | - Remove the "None" out of the doggo, floofer, pupper and puppo column and merge them into one column
45 | - Remove the wrong names of name column
46 | - Reduce the prediction columns into two - breed and conf
47 | - Clean the new breed column by replacing the "_" with a whitespace and make them all lowercase
48 |
49 | ### Summary
50 |
51 | **Questions:**
52 |
53 | > Based on the predicted, most likely dog breed: Which breed gets retweeted and favorited the most overall?
54 | - The winner for our analysis was the labrador retriever.
55 | > How did the account develop (speaking about number of tweets, retweets, favorites, image number and length of the tweets)?
56 | - We found, that the number of tweets per month decreased, while the retweets and favorites show an uptrend. For the image numbers there is no clear trend visible, the length of the tweets got a little bit closer to the maximum of 130 in the second half of the dataset.
57 | > Is there a pattern visible in the timing of the tweets?
58 | - there are nearly no tweets at all between 5 and 15 'o clock . The most tweets are during the time from 0 - 4 and then again from 15 - 23 'o clock
59 |
60 | ### Authors
61 |
62 | * Christoph Lindstädt
63 | * Udacity
64 |
65 | ## License
66 |
67 | * Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
68 |
69 |
70 |
71 |
72 |
73 |
--------------------------------------------------------------------------------
/P4-Communicate-Data-Findings/Data/README.md:
--------------------------------------------------------------------------------
1 | # Data and License
2 |
3 | The Data can be downloaded here.
4 |
5 | **License:** Ford GoBike Data License Agreement
6 |
7 | You can also run "gather_goBike_data.py" to request the data.
8 |
--------------------------------------------------------------------------------
/P4-Communicate-Data-Findings/Data/gather_goBike_data.py:
--------------------------------------------------------------------------------
1 | import requests
2 |
3 | #define years to gather
4 | year_data = [x for x in range(201801, 201813)] + [x for x in range(201901, 201903)]
5 |
6 | #loop over all files
7 | for year in year_data:
8 |
9 | #set the url dependend on the url
10 | url = f"https://s3.amazonaws.com/fordgobike-data/{year}-fordgobike-tripdata.csv.zip"
11 |
12 | #request the url
13 | response = requests.get(url)
14 |
15 | #open a file with the same name as the downloaded one and write to content to the file
16 | with open(f"{year}-fordgobike-tripdata.csv.zip", mode = "wb") as file:
17 | file.write(response.content)
18 |
--------------------------------------------------------------------------------
/P4-Communicate-Data-Findings/Images/README.md:
--------------------------------------------------------------------------------
1 | Source: kepler.gl
2 |
--------------------------------------------------------------------------------
/P4-Communicate-Data-Findings/Images/east_bay_500.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/P4-Communicate-Data-Findings/Images/east_bay_500.PNG
--------------------------------------------------------------------------------
/P4-Communicate-Data-Findings/Images/san_francisco_1000.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/P4-Communicate-Data-Findings/Images/san_francisco_1000.PNG
--------------------------------------------------------------------------------
/P4-Communicate-Data-Findings/Images/san_jose_200.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/P4-Communicate-Data-Findings/Images/san_jose_200.PNG
--------------------------------------------------------------------------------
/P4-Communicate-Data-Findings/Images/stations_1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/P4-Communicate-Data-Findings/Images/stations_1.PNG
--------------------------------------------------------------------------------
/P4-Communicate-Data-Findings/Images/stations_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/P4-Communicate-Data-Findings/Images/stations_2.png
--------------------------------------------------------------------------------
/P4-Communicate-Data-Findings/Images/stations_kepler.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/P4-Communicate-Data-Findings/Images/stations_kepler.png
--------------------------------------------------------------------------------
/P4-Communicate-Data-Findings/README.md:
--------------------------------------------------------------------------------
1 |
2 | ## P4: Communicate Data Findings (Ford GoBike Data)
3 |
4 | ### Prerequisites
5 |
6 | Additional installations:
7 |
8 | * [Missingno](https://github.com/ResidentMario/missingno)
9 |
10 | ## Project Overview
11 |
12 | ### Data Sources
13 |
14 | **Name:** result.csv
15 | * Definition: Ford GoBike System - Data
16 | * Source: https://www.fordgobike.com/system-data
17 | * Version: Files from 01.2018 - 02.2019
18 |
19 | ### Wrangling
20 |
21 | ### Analysis
22 |
23 | ### Summary
24 |
25 | ### Authors
26 |
27 | * Christoph Lindstädt
28 | * Udacity
29 |
30 | ## License
31 |
32 | * Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
33 |
34 |
35 |
36 |
37 |
38 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # [Udacity Data Analyst Nanodegree](https://www.udacity.com/course/data-analyst-nanodegree--nd002)
2 |
3 | > Discover insights from data via Python and SQL.
4 |
5 | ## Skills Acquired (Summary)
6 |
7 |
8 | ### Prerequisites
9 |
10 | You'll need to install:
11 |
12 | * [Python (3.x or higher)](https://www.python.org/downloads/)
13 | * [Jupyter Notebook](https://jupyter.org/)
14 | * [Numpy](http://www.numpy.org/)
15 | * [Pandas](http://pandas.pydata.org/)
16 | * [Matplotlib](https://matplotlib.org/)
17 | * [Seaborn](https://seaborn.pydata.org/)
18 |
19 | And additional libraries defined in each project.
20 |
21 | Recommended:
22 |
23 | * [Anaconda](https://www.anaconda.com/distribution/#download-section)
24 |
25 | ## Project Overview
26 | ### P0: Explore Weather Trends
27 |
28 | The first chapter was an introduction to the following projects of the Data Analyst Nanodegree.
29 |
30 | First chapter project was about weather trends - it required to apply (atleast) the following steps:
31 | * Extract data from a database using a SQL query
32 | * Calculate a moving average
33 | * Create a line chart
34 |
35 | I analyzed local and global temperature data and compared the temperature trends in three german cities to overall global temperature trends. After cleaning the data, I've created a function, which was supposed to handle all the tasks that are needed to plot the data - for example calculating the linear trend and calculating the rolling average. In addition, the function had other various options for the visualization to get various graphs.
36 |
37 | **Key findings**:
38 | - the average global temperature is increasing, with an also increasing tempo
39 | - Berlin is the only city in Germany in this dataset which has a higher average temperature than the global average
40 |
41 | 
42 |
43 | ### P1: Investigate a Dataset (Gapminder World Dataset)
44 |
45 | This chapter was all about the data analysis process as whole. From gathering to cleaning, assessing and wrangling to exploring and visualizing the data over the programming workflow and communication was everything included.
46 |
47 | This project included therefore all steps of the typical data analysis process. This includes:
48 | - posing questions
49 | - gather, wrangle and clean data
50 | - communicate answers to the questions
51 | - assited through visualizations and statistics.
52 |
53 | Out of the project:
54 |
55 | > This project will examine datasets available at Gapminder. To be more specific, it will take a closer look on the life expectancy of the population from different countries and the influences from other variables. It will also take a look on the development of these variables over time.
56 | >
57 | >**What is Gapminder?**
58 | "Gapminder is an independent Swedish foundation with no political, religious or economic affiliations. Gapminder is a fact tank, not a think tank. Gapminder fights devastating misconceptions about global development." (https://www.gapminder.org/about-gapminder/)
59 |
60 | Here we were confronted with the full joy of a real-life dataset: from hard-to-analyze structure, missing, messy, dirty data to real and - after finally being done with data wrangling - the reward of interesting insights.
61 |
62 | 
63 |
64 | ### P2: Analyze A/B Test Results
65 |
66 | Following chapter was filled with *a lot* of information. We talked about: Data Types, Notation, Mean, Standard Deviation, Correlation, Data Shapes, Outliers, Bias, Dangers, Probability and Bayes, Distributions, Central Limit Theorem, Bootstrapping, Confidence Intervals, Hypothesis Testing, A/B Tests, Linear Regression, Logistic Regression and more.. *heavy breathing
67 |
68 | To goal of the project in this chapter was to get experience with A/B testing, it's difficulties and drawbacks of it. First of all, we learned what A/B testing is all about - including different metrics like the Click Through Rate (CTR) and how to analyze these metrics properly. And second of all, we learned about the drawbacks like the novelty effect or change aversion.
69 |
70 | In the end we brought everything we've learned together to analyze this A/B test properly.
71 |
72 | 
73 |
74 | ### P3: Gather, Clean and Analyze Twitter Data (WeRateDogs™ (@dog_rates))
75 |
76 | This chapter was a deep dive into the data wrangling part of the data analysis process. We learned about the difference between messy and dirty data, how tidy data should look like, about the assessing, defining, cleaning and testing process, etc. Moreover, we talked about many different file types and different methods of gathering data.
77 |
78 | In this project we had to deal with the reality of dirty and messy data (again). We gathered data from different sources (for example the Twitter API), identified issues with the dataset in terms of tidiness and quality. Afterwards we had to solve these problems while documenting each step. The end of the project was then focused on the exploration of the data.
79 |
80 | 
81 |
82 | ### P4: Communicate Data Findings
83 |
84 | The final chapter was focused on proper visualization of data. We learned about chart junk, uni-, bi- and multivariate visualization, use of color, data/ink ratio, the lief factor, other encodings, [...].
85 |
86 | The task of the final project was to analyze and visualize real-world data. I chose the Ford GoBike dataset.
87 |
88 | 
89 |
90 | ## License
91 |
92 | * Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
93 |
94 |
95 |
96 |
97 |
--------------------------------------------------------------------------------
/global_weather_trend.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/global_weather_trend.png
--------------------------------------------------------------------------------
/life_expectancy_to_income_2018.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/life_expectancy_to_income_2018.png
--------------------------------------------------------------------------------
/mean_of_retweets_per_month-year_combination.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/mean_of_retweets_per_month-year_combination.png
--------------------------------------------------------------------------------
/rel_userfreq_by_gender_and_area.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/rel_userfreq_by_gender_and_area.png
--------------------------------------------------------------------------------
/sampling_dist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/sampling_dist.png
--------------------------------------------------------------------------------