├── P0-Explore-Weather-Trends ├── Explore Weather Trends.ipynb ├── README.md ├── city_data.csv ├── city_list.csv └── global_data.csv ├── P1-Investigate-A-Dataset ├── Investigate Gapminder Data.html ├── Investigate Gapminder Data.ipynb └── README.md ├── P2-Analyze-A-B-Test-Results ├── Analyze_ab_test_results_notebook.html ├── Analyze_ab_test_results_notebook.ipynb ├── README.md ├── ab_data.csv └── countries.csv ├── P3-Analyze-Twitter-Data ├── README.md ├── image_predictions.tsv ├── tweet_json.txt ├── twitter-archive-enhanced.csv ├── twitter_archive_master.csv └── wrangle_act.ipynb ├── P4-Communicate-Data-Findings ├── Communicate Data Findings.html ├── Communicate Data Findings.ipynb ├── Data │ ├── README.md │ └── gather_goBike_data.py ├── Explanatory Visualization.html ├── Explanatory Visualization.ipynb ├── Explanatory Visualization.slides.html ├── Images │ ├── README.md │ ├── east_bay_500.PNG │ ├── san_francisco_1000.PNG │ ├── san_jose_200.PNG │ ├── stations_1.PNG │ ├── stations_2.png │ └── stations_kepler.png └── README.md ├── README.md ├── global_weather_trend.png ├── life_expectancy_to_income_2018.png ├── mean_of_retweets_per_month-year_combination.png ├── rel_userfreq_by_gender_and_area.png └── sampling_dist.png /P0-Explore-Weather-Trends/README.md: -------------------------------------------------------------------------------- 1 | 2 | ## P0: Explore Weather Trends 3 | 4 | ### Prerequisites 5 | 6 | Additional installations: 7 | 8 | * [Missingno](https://github.com/ResidentMario/missingno) 9 | * [sklearn](https://scikit-learn.org/) 10 | 11 | ## Project Overview 12 | 13 | This first project required the following steps: 14 | * Extract data from a database using a SQL query 15 | * Calculate a moving average 16 | * Create a line chart 17 | 18 | I analyzed local and global temperature data and compared the temperature trends in three german cities to overall global temperature trends. After some data cleaning I created a function to assist the data processing and visualization with some options to play around with. This included also the calculation of a simple linear regression to visualize trends. 19 | 20 | ![Global Weather Trend](https://github.com/DataLind/Udacity-Data-Analyst-Nanodegree/blob/master/global_weather_trend.png) 21 | 22 | ### Data Sources 23 | 24 | **Name:** city_data.csv 25 | * Definition: Overall city temperature data 26 | * Source: Udacity 27 | * Version: 1 28 | * Method of gathering: SQL 29 | 30 | **Name:** global_data.csv 31 | * Definition: Global temperature data 32 | * Source: Udacity 33 | * Version: 1 34 | * Method of gathering: SQL 35 | 36 | **Name:** city_list.csv 37 | * Definition: List of cities in this dataset 38 | * Source: Udacity 39 | * Version: 1 40 | * Method of gathering: SQL 41 | 42 | ### Wrangling 43 | * Cut data to years 1750 - 2013 44 | 45 | ### Summary 46 | 47 | > To conclude, there is a clear overall uptrend visible, what means, that the average global temperature is increasing, with an also increasing tempo. 48 | 49 | The german cities Hamburg, Berlin and Munich got compared to the global data (1750 - 2013): 50 | 51 | - The slope of the global trend is higher than compared to the german cities, so the global average temperature is increasing faster (looking at this long time period) 52 | - Berlin has the highest average temperature among the german cities, making Berlin the only city that has a higher average temperature than the global 53 | - Hamburg is the closest to the global average temperature, while Munich has the lowest average temperature, but also the highest correlation to the global data compared to the other two german cities 54 | 55 | ### Authors 56 | 57 | * Christoph Lindstädt 58 | * Udacity 59 | 60 | ## License 61 | 62 | * Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License 63 | 64 | 65 | Creative Commons License 66 | 67 | -------------------------------------------------------------------------------- /P0-Explore-Weather-Trends/city_list.csv: -------------------------------------------------------------------------------- 1 | city,country 2 | Abidjan,Côte D'Ivoire 3 | Abu Dhabi,United Arab Emirates 4 | Abuja,Nigeria 5 | Accra,Ghana 6 | Adana,Turkey 7 | Adelaide,Australia 8 | Agra,India 9 | Ahmadabad,India 10 | Albuquerque,United States 11 | Alexandria,Egypt 12 | Alexandria,United States 13 | Algiers,Algeria 14 | Allahabad,India 15 | Almaty,Kazakhstan 16 | Amritsar,India 17 | Amsterdam,Netherlands 18 | Ankara,Turkey 19 | Anshan,China 20 | Antananarivo,Madagascar 21 | Arlington,United States 22 | Asmara,Eritrea 23 | Astana,Kazakhstan 24 | Athens,Greece 25 | Atlanta,United States 26 | Austin,United States 27 | Baghdad,Iraq 28 | Baku,Azerbaijan 29 | Baltimore,United States 30 | Bamako,Mali 31 | Bandung,Indonesia 32 | Bangalore,India 33 | Bangkok,Thailand 34 | Bangui,Central African Republic 35 | Barcelona,Spain 36 | Barcelona,Venezuela 37 | Barquisimeto,Venezuela 38 | Barranquilla,Colombia 39 | Beirut,Lebanon 40 | Belfast,United Kingdom 41 | Belgrade,Serbia 42 | Belo Horizonte,Brazil 43 | Benghazi,Libya 44 | Berlin,Germany 45 | Bern,Switzerland 46 | Bhopal,India 47 | Birmingham,United Kingdom 48 | Birmingham,United States 49 | Bissau,Guinea Bissau 50 | Boston,United States 51 | Bratislava,Slovakia 52 | Brazzaville,Congo 53 | Brisbane,Australia 54 | Brussels,Belgium 55 | Bucharest,Romania 56 | Budapest,Hungary 57 | Bujumbura,Burundi 58 | Bursa,Turkey 59 | Cairo,Egypt 60 | Cali,Colombia 61 | Campinas,Brazil 62 | Canberra,Australia 63 | Caracas,Venezuela 64 | Cardiff,United Kingdom 65 | Casablanca,Morocco 66 | Changchun,China 67 | Changzhou,China 68 | Charlotte,United States 69 | Chelyabinsk,Russia 70 | Chengdu,China 71 | Chicago,United States 72 | Chisinau,Moldova 73 | Colombo,Brazil 74 | Colombo,Sri Lanka 75 | Colorado Springs,United States 76 | Columbus,United States 77 | Conakry,Guinea 78 | Copenhagen,Denmark 79 | Cordoba,Argentina 80 | Curitiba,Brazil 81 | Dakar,Senegal 82 | Dalian,China 83 | Dallas,United States 84 | Damascus,Syria 85 | Dar Es Salaam,Tanzania 86 | Datong,China 87 | Delhi,India 88 | Denver,United States 89 | Detroit,United States 90 | Dhaka,Bangladesh 91 | Doha,Qatar 92 | Douala,Cameroon 93 | Dublin,Ireland 94 | Durban,South Africa 95 | Dushanbe,Tajikistan 96 | Ecatepec,Mexico 97 | Edinburgh,United Kingdom 98 | El Paso,United States 99 | Faisalabad,Pakistan 100 | Fort Worth,United States 101 | Fortaleza,Brazil 102 | Foshan,China 103 | Freetown,Sierra Leone 104 | Fresno,United States 105 | Fuzhou,China 106 | Gaborone,Botswana 107 | Georgetown,Guyana 108 | Guadalajara,Mexico 109 | Guangzhou,China 110 | Guarulhos,Brazil 111 | Guatemala City,Guatemala 112 | Guayaquil,Ecuador 113 | Guiyang,China 114 | Gujranwala,Pakistan 115 | Hamburg,Germany 116 | Handan,China 117 | Hangzhou,China 118 | Hanoi,Vietnam 119 | Haora,India 120 | Harare,Zimbabwe 121 | Harbin,China 122 | Hefei,China 123 | Helsinki,Finland 124 | Hiroshima,Japan 125 | Ho Chi Minh City,Vietnam 126 | Houston,United States 127 | Hyderabad,India 128 | Hyderabad,Pakistan 129 | Ibadan,Nigeria 130 | Indianapolis,United States 131 | Indore,India 132 | Irbil,Iraq 133 | Islamabad,Pakistan 134 | Istanbul,Turkey 135 | Izmir,Turkey 136 | Jacksonville,United States 137 | Jaipur,India 138 | Jakarta,Indonesia 139 | Jilin,China 140 | Jinan,China 141 | Johannesburg,South Africa 142 | Juba,Sudan 143 | Kabul,Afghanistan 144 | Kaduna,Nigeria 145 | Kampala,Uganda 146 | Kano,Nigeria 147 | Kanpur,India 148 | Kansas City,United States 149 | Kaohsiung,Taiwan 150 | Karachi,Pakistan 151 | Kathmandu,Nepal 152 | Kawasaki,Japan 153 | Kazan,Russia 154 | Khartoum,Sudan 155 | Khulna,Bangladesh 156 | Kiev,Ukraine 157 | Kigali,Rwanda 158 | Kingston,Canada 159 | Kingston,Jamaica 160 | Kinshasa,Congo (Democratic Republic Of The) 161 | Kitakyushu,Japan 162 | Kobe,Japan 163 | Kuala Lumpur,Malaysia 164 | Kunming,China 165 | La Paz,Bolivia 166 | La Paz,Mexico 167 | Lagos,Nigeria 168 | Lahore,Pakistan 169 | Lanzhou,China 170 | Las Vegas,United States 171 | Libreville,Gabon 172 | Lilongwe,Malawi 173 | Lima,Peru 174 | Lisbon,Portugal 175 | Ljubljana,Slovenia 176 | London,Canada 177 | London,United Kingdom 178 | Long Beach,United States 179 | Los Angeles,Chile 180 | Los Angeles,United States 181 | Louisville,United States 182 | Luanda,Angola 183 | Lubumbashi,Congo (Democratic Republic Of The) 184 | Ludhiana,India 185 | Luoyang,China 186 | Lusaka,Zambia 187 | Madrid,Spain 188 | Maiduguri,Nigeria 189 | Malabo,Equatorial Guinea 190 | Managua,Nicaragua 191 | Manama,Bahrain 192 | Manaus,Brazil 193 | Manila,Philippines 194 | Maputo,Mozambique 195 | Maracaibo,Venezuela 196 | Maseru,Lesotho 197 | Mashhad,Iran 198 | Mecca,Saudi Arabia 199 | Medan,Indonesia 200 | Melbourne,Australia 201 | Memphis,United States 202 | Mesa,United States 203 | Mexicali,Mexico 204 | Miami,United States 205 | Milan,Italy 206 | Milwaukee,United States 207 | Minneapolis,United States 208 | Minsk,Belarus 209 | Mogadishu,Somalia 210 | Monrovia,Liberia 211 | Monterrey,Mexico 212 | Montevideo,Uruguay 213 | Montreal,Canada 214 | Moscow,Russia 215 | Multan,Pakistan 216 | Munich,Germany 217 | Nagoya,Japan 218 | Nagpur,India 219 | Nairobi,Kenya 220 | Nanchang,China 221 | Nanjing,China 222 | Nanning,China 223 | Nashville,United States 224 | Nassau,Bahamas 225 | New Delhi,India 226 | New Orleans,United States 227 | New York,United States 228 | Niamey,Niger 229 | Nouakchott,Mauritania 230 | Novosibirsk,Russia 231 | Oakland,United States 232 | Oklahoma City,United States 233 | Omaha,United States 234 | Omsk,Russia 235 | Oslo,Norway 236 | Ottawa,Canada 237 | Ouagadougou,Burkina Faso 238 | Palembang,Indonesia 239 | Paramaribo,Suriname 240 | Paris,France 241 | Patna,India 242 | Perm,Russia 243 | Perth,Australia 244 | Peshawar,Pakistan 245 | Philadelphia,United States 246 | Phoenix,United States 247 | Podgorica,Montenegro 248 | Port Au Prince,Haiti 249 | Port Harcourt,Nigeria 250 | Port Louis,Mauritius 251 | Port Moresby,Papua New Guinea 252 | Portland,United States 253 | Porto Alegre,Brazil 254 | Prague,Czech Republic 255 | Pretoria,South Africa 256 | Pristina,Serbia 257 | Puebla,Mexico 258 | Pune,India 259 | Qingdao,China 260 | Qiqihar,China 261 | Quito,Ecuador 262 | Rabat,Morocco 263 | Rajkot,India 264 | Raleigh,United States 265 | Ranchi,India 266 | Rawalpindi,Pakistan 267 | Recife,Brazil 268 | Riga,Latvia 269 | Rio De Janeiro,Brazil 270 | Riyadh,Saudi Arabia 271 | Rome,Italy 272 | Rosario,Argentina 273 | Sacramento,United States 274 | Salvador,Brazil 275 | Samara,Russia 276 | San Antonio,United States 277 | San Diego,United States 278 | San Francisco,United States 279 | San Jose,United States 280 | San Salvador,El Salvador 281 | Santa Cruz,Philippines 282 | Santiago,Chile 283 | Santiago,Dominican Republic 284 | Santiago,Philippines 285 | Santo Domingo,Dominican Republic 286 | Santo Domingo,Ecuador 287 | Sarajevo,Bosnia And Herzegovina 288 | Seattle,United States 289 | Semarang,Indonesia 290 | Seoul,South Korea 291 | Shanghai,China 292 | Shenyang,China 293 | Shenzhen,Hong Kong 294 | Shiraz,Iran 295 | Singapore,Singapore 296 | Skopje,Macedonia 297 | Sofia,Bulgaria 298 | Soweto,South Africa 299 | Stockholm,Sweden 300 | Surabaya,Indonesia 301 | Surat,India 302 | Suzhou,China 303 | Sydney,Australia 304 | Tabriz,Iran 305 | Taichung,Taiwan 306 | Taipei,Taiwan 307 | Taiyuan,China 308 | Tallinn,Estonia 309 | Tangshan,China 310 | Tashkent,Uzbekistan 311 | Tbilisi,Georgia 312 | Tegucigalpa,Honduras 313 | Tianjin,China 314 | Tijuana,Mexico 315 | Tirana,Albania 316 | Tokyo,Japan 317 | Toronto,Canada 318 | Tripoli,Libya 319 | Tucson,United States 320 | Tulsa,United States 321 | Tunis,Tunisia 322 | Ufa,Russia 323 | Ulaanbaatar,Mongolia 324 | Vadodara,India 325 | Valencia,Spain 326 | Valencia,Venezuela 327 | Varanasi,India 328 | Victoria,Canada 329 | Vienna,Austria 330 | Vientiane,Laos 331 | Vilnius,Lithuania 332 | Virginia Beach,United States 333 | Volgograd,Russia 334 | Warsaw,Poland 335 | Washington,United States 336 | Wellington,New Zealand 337 | Wichita,United States 338 | Windhoek,Namibia 339 | Wuhan,China 340 | Wuxi,China 341 | Xian,China 342 | Xuzhou,China 343 | Yamoussoukro,Côte D'Ivoire 344 | Yerevan,Armenia 345 | Zagreb,Croatia 346 | Zapopan,Mexico 347 | -------------------------------------------------------------------------------- /P0-Explore-Weather-Trends/global_data.csv: -------------------------------------------------------------------------------- 1 | year,avg_temp 2 | 1750,8.72 3 | 1751,7.98 4 | 1752,5.78 5 | 1753,8.39 6 | 1754,8.47 7 | 1755,8.36 8 | 1756,8.85 9 | 1757,9.02 10 | 1758,6.74 11 | 1759,7.99 12 | 1760,7.19 13 | 1761,8.77 14 | 1762,8.61 15 | 1763,7.50 16 | 1764,8.40 17 | 1765,8.25 18 | 1766,8.41 19 | 1767,8.22 20 | 1768,6.78 21 | 1769,7.69 22 | 1770,7.69 23 | 1771,7.85 24 | 1772,8.19 25 | 1773,8.22 26 | 1774,8.77 27 | 1775,9.18 28 | 1776,8.30 29 | 1777,8.26 30 | 1778,8.54 31 | 1779,8.98 32 | 1780,9.43 33 | 1781,8.10 34 | 1782,7.90 35 | 1783,7.68 36 | 1784,7.86 37 | 1785,7.36 38 | 1786,8.26 39 | 1787,8.03 40 | 1788,8.45 41 | 1789,8.33 42 | 1790,7.98 43 | 1791,8.23 44 | 1792,8.09 45 | 1793,8.23 46 | 1794,8.53 47 | 1795,8.35 48 | 1796,8.27 49 | 1797,8.51 50 | 1798,8.67 51 | 1799,8.51 52 | 1800,8.48 53 | 1801,8.59 54 | 1802,8.58 55 | 1803,8.50 56 | 1804,8.84 57 | 1805,8.56 58 | 1806,8.43 59 | 1807,8.28 60 | 1808,7.63 61 | 1809,7.08 62 | 1810,6.92 63 | 1811,6.86 64 | 1812,7.05 65 | 1813,7.74 66 | 1814,7.59 67 | 1815,7.24 68 | 1816,6.94 69 | 1817,6.98 70 | 1818,7.83 71 | 1819,7.37 72 | 1820,7.62 73 | 1821,8.09 74 | 1822,8.19 75 | 1823,7.72 76 | 1824,8.55 77 | 1825,8.39 78 | 1826,8.36 79 | 1827,8.81 80 | 1828,8.17 81 | 1829,7.94 82 | 1830,8.52 83 | 1831,7.64 84 | 1832,7.45 85 | 1833,8.01 86 | 1834,8.15 87 | 1835,7.39 88 | 1836,7.70 89 | 1837,7.38 90 | 1838,7.51 91 | 1839,7.63 92 | 1840,7.80 93 | 1841,7.69 94 | 1842,8.02 95 | 1843,8.17 96 | 1844,7.65 97 | 1845,7.85 98 | 1846,8.55 99 | 1847,8.09 100 | 1848,7.98 101 | 1849,7.98 102 | 1850,7.90 103 | 1851,8.18 104 | 1852,8.10 105 | 1853,8.04 106 | 1854,8.21 107 | 1855,8.11 108 | 1856,8.00 109 | 1857,7.76 110 | 1858,8.10 111 | 1859,8.25 112 | 1860,7.96 113 | 1861,7.85 114 | 1862,7.56 115 | 1863,8.11 116 | 1864,7.98 117 | 1865,8.18 118 | 1866,8.29 119 | 1867,8.44 120 | 1868,8.25 121 | 1869,8.43 122 | 1870,8.20 123 | 1871,8.12 124 | 1872,8.19 125 | 1873,8.35 126 | 1874,8.43 127 | 1875,7.86 128 | 1876,8.08 129 | 1877,8.54 130 | 1878,8.83 131 | 1879,8.17 132 | 1880,8.12 133 | 1881,8.27 134 | 1882,8.13 135 | 1883,7.98 136 | 1884,7.77 137 | 1885,7.92 138 | 1886,7.95 139 | 1887,7.91 140 | 1888,8.09 141 | 1889,8.32 142 | 1890,7.97 143 | 1891,8.02 144 | 1892,8.07 145 | 1893,8.06 146 | 1894,8.16 147 | 1895,8.15 148 | 1896,8.21 149 | 1897,8.29 150 | 1898,8.18 151 | 1899,8.40 152 | 1900,8.50 153 | 1901,8.54 154 | 1902,8.30 155 | 1903,8.22 156 | 1904,8.09 157 | 1905,8.23 158 | 1906,8.38 159 | 1907,7.95 160 | 1908,8.19 161 | 1909,8.18 162 | 1910,8.22 163 | 1911,8.18 164 | 1912,8.17 165 | 1913,8.30 166 | 1914,8.59 167 | 1915,8.59 168 | 1916,8.23 169 | 1917,8.02 170 | 1918,8.13 171 | 1919,8.38 172 | 1920,8.36 173 | 1921,8.57 174 | 1922,8.41 175 | 1923,8.42 176 | 1924,8.51 177 | 1925,8.53 178 | 1926,8.73 179 | 1927,8.52 180 | 1928,8.63 181 | 1929,8.24 182 | 1930,8.63 183 | 1931,8.72 184 | 1932,8.71 185 | 1933,8.34 186 | 1934,8.63 187 | 1935,8.52 188 | 1936,8.55 189 | 1937,8.70 190 | 1938,8.86 191 | 1939,8.76 192 | 1940,8.76 193 | 1941,8.77 194 | 1942,8.73 195 | 1943,8.76 196 | 1944,8.85 197 | 1945,8.58 198 | 1946,8.68 199 | 1947,8.80 200 | 1948,8.75 201 | 1949,8.59 202 | 1950,8.37 203 | 1951,8.63 204 | 1952,8.64 205 | 1953,8.87 206 | 1954,8.56 207 | 1955,8.63 208 | 1956,8.28 209 | 1957,8.73 210 | 1958,8.77 211 | 1959,8.73 212 | 1960,8.58 213 | 1961,8.80 214 | 1962,8.75 215 | 1963,8.86 216 | 1964,8.41 217 | 1965,8.53 218 | 1966,8.60 219 | 1967,8.70 220 | 1968,8.52 221 | 1969,8.60 222 | 1970,8.70 223 | 1971,8.60 224 | 1972,8.50 225 | 1973,8.95 226 | 1974,8.47 227 | 1975,8.74 228 | 1976,8.35 229 | 1977,8.85 230 | 1978,8.69 231 | 1979,8.73 232 | 1980,8.98 233 | 1981,9.17 234 | 1982,8.64 235 | 1983,9.03 236 | 1984,8.69 237 | 1985,8.66 238 | 1986,8.83 239 | 1987,8.99 240 | 1988,9.20 241 | 1989,8.92 242 | 1990,9.23 243 | 1991,9.18 244 | 1992,8.84 245 | 1993,8.87 246 | 1994,9.04 247 | 1995,9.35 248 | 1996,9.04 249 | 1997,9.20 250 | 1998,9.52 251 | 1999,9.29 252 | 2000,9.20 253 | 2001,9.41 254 | 2002,9.57 255 | 2003,9.53 256 | 2004,9.32 257 | 2005,9.70 258 | 2006,9.53 259 | 2007,9.73 260 | 2008,9.43 261 | 2009,9.51 262 | 2010,9.70 263 | 2011,9.52 264 | 2012,9.51 265 | 2013,9.61 266 | 2014,9.57 267 | 2015,9.83 268 | -------------------------------------------------------------------------------- /P1-Investigate-A-Dataset/README.md: -------------------------------------------------------------------------------- 1 | 2 | ## P1: Investigate A Dataset (Gapminder World Dataset) 3 | 4 | ### Prerequisites 5 | 6 | Additional installations: 7 | 8 | * [Missingno](https://github.com/ResidentMario/missingno) 9 | * [pycountry](https://bitbucket.org/flyingcircus/pycountry) 10 | * [pycountry-convert](https://pycountry-convert.readthedocs.io/en/latest/) 11 | 12 | ## Project Overview 13 | 14 | ### Data Sources 15 | 16 | **Name: Population total** 17 | 18 | - Definition: Total population 19 | - Source: http://gapm.io/dpop 20 | - Version: 5 21 | 22 | **Name: Life expectancy (years)** 23 | 24 | - Definition: The average number of years a newborn child would live if current mortality patterns were to stay the same. 25 | - Source: http://gapm.io/ilex 26 | - Version: 9 27 | 28 | **Name: Income per person (GPD/Capita, PPP$ inflation-adjusted)** 29 | 30 | - Definition: Gross domestic product per person adjusted for differences in purchasing power (in international dollars, fixed 2011 prices, PPP based on 2011 ICP). 31 | - Source: http://gapm.io/dgdppc 32 | - Version: 25 33 | 34 | ### Wrangling 35 | 36 | For the analysis, following countries were dropped out of the dataframe due to too much missing data: 37 | 38 | - Andorra, Dominica, Holy See, Liechtenstein, Marshall Islands, Monaco, Nauru, Palau, San Marino, St. Kitts and Nevis, Tuvalu 39 | 40 | ### Summary 41 | 42 | - We can observe an overall and ongoing uptrend for the world population, the income per person and the life expectation 43 | 44 | - especially between 1950 and 1975 was a starting point for a strong increase in all three metrics 45 | 46 | - the world population is increasing strongly and one observable reason for that is the increase in the overall life expectancy 47 | 48 | - we also found out that there is a relationship between the income per person and the life expectancy. An increasing income is no guarantee for an also increasing life expectancy, but it is correlated 49 | 50 | ### Authors 51 | 52 | * Christoph Lindstädt 53 | * Udacity 54 | 55 | ## License 56 | 57 | * Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License 58 | 59 | 60 | Creative Commons License 61 | 62 | -------------------------------------------------------------------------------- /P2-Analyze-A-B-Test-Results/Analyze_ab_test_results_notebook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Analyze A/B Test Results\n", 8 | "\n", 9 | "You may either submit your notebook through the workspace here, or you may work from your local machine and submit through the next page. Either way assure that your code passes the project [RUBRIC](https://review.udacity.com/#!/projects/37e27304-ad47-4eb0-a1ab-8c12f60e43d0/rubric). **Please save regularly.**\n", 10 | "\n", 11 | "This project will assure you have mastered the subjects covered in the statistics lessons. The hope is to have this project be as comprehensive of these topics as possible. Good luck!\n", 12 | "\n", 13 | "## Table of Contents\n", 14 | "- [Introduction](#intro)\n", 15 | "- [Part I - Probability](#probability)\n", 16 | "- [Part II - A/B Test](#ab_test)\n", 17 | "- [Part III - Regression](#regression)\n", 18 | "\n", 19 | "\n", 20 | "\n", 21 | "### Introduction\n", 22 | "\n", 23 | "A/B tests are very commonly performed by data analysts and data scientists. It is important that you get some practice working with the difficulties of these \n", 24 | "\n", 25 | "For this project, you will be working to understand the results of an A/B test run by an e-commerce website. Your goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.\n", 26 | "\n", 27 | "**As you work through this notebook, follow along in the classroom and answer the corresponding quiz questions associated with each question.** The labels for each classroom concept are provided for each question. This will assure you are on the right track as you work through the project, and you can feel more confident in your final submission meeting the criteria. As a final check, assure you meet all the criteria on the [RUBRIC](https://review.udacity.com/#!/projects/37e27304-ad47-4eb0-a1ab-8c12f60e43d0/rubric).\n", 28 | "\n", 29 | "\n", 30 | "#### Part I - Probability\n", 31 | "\n", 32 | "To get started, let's import our libraries." 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 1, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "import pandas as pd\n", 42 | "import numpy as np\n", 43 | "import random\n", 44 | "import matplotlib.pyplot as plt\n", 45 | "%matplotlib inline\n", 46 | "#We are setting the seed to assure you get the same answers on quizzes as we set up\n", 47 | "random.seed(42)" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "`1.` Now, read in the `ab_data.csv` data. Store it in `df`. **Use your dataframe to answer the questions in Quiz 1 of the classroom.**\n", 55 | "\n", 56 | "a. Read in the dataset and take a look at the top few rows here:" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 2, 62 | "metadata": {}, 63 | "outputs": [ 64 | { 65 | "data": { 66 | "text/html": [ 67 | "
\n", 68 | "\n", 81 | "\n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | "
user_idtimestampgrouplanding_pageconverted
08511042017-01-21 22:11:48.556739controlold_page0
18042282017-01-12 08:01:45.159739controlold_page0
26615902017-01-11 16:55:06.154213treatmentnew_page0
38535412017-01-08 18:28:03.143765treatmentnew_page0
48649752017-01-21 01:52:26.210827controlold_page1
\n", 135 | "
" 136 | ], 137 | "text/plain": [ 138 | " user_id timestamp group landing_page converted\n", 139 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0\n", 140 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0\n", 141 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0\n", 142 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0\n", 143 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1" 144 | ] 145 | }, 146 | "execution_count": 2, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "#read data\n", 153 | "df = pd.read_csv(\"ab_data.csv\")\n", 154 | "df.head()" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "b. Use the cell below to find the number of rows in the dataset." 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 3, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "name": "stdout", 171 | "output_type": "stream", 172 | "text": [ 173 | "\n", 174 | "RangeIndex: 294478 entries, 0 to 294477\n", 175 | "Data columns (total 5 columns):\n", 176 | "user_id 294478 non-null int64\n", 177 | "timestamp 294478 non-null object\n", 178 | "group 294478 non-null object\n", 179 | "landing_page 294478 non-null object\n", 180 | "converted 294478 non-null int64\n", 181 | "dtypes: int64(2), object(3)\n", 182 | "memory usage: 11.2+ MB\n" 183 | ] 184 | } 185 | ], 186 | "source": [ 187 | "#show the df info\n", 188 | "df.info()" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "c. The number of unique users in the dataset." 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 4, 201 | "metadata": {}, 202 | "outputs": [ 203 | { 204 | "data": { 205 | "text/plain": [ 206 | "290584" 207 | ] 208 | }, 209 | "execution_count": 4, 210 | "metadata": {}, 211 | "output_type": "execute_result" 212 | } 213 | ], 214 | "source": [ 215 | "#choose user_id column and show the number of unique entries\n", 216 | "df.user_id.nunique()" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": { 222 | "collapsed": true 223 | }, 224 | "source": [ 225 | "d. The proportion of users converted." 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 5, 231 | "metadata": {}, 232 | "outputs": [ 233 | { 234 | "data": { 235 | "text/plain": [ 236 | "0.11965919355605512" 237 | ] 238 | }, 239 | "execution_count": 5, 240 | "metadata": {}, 241 | "output_type": "execute_result" 242 | } 243 | ], 244 | "source": [ 245 | "#choose converted column and calculate the mean\n", 246 | "df.converted.mean()" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": { 252 | "collapsed": true 253 | }, 254 | "source": [ 255 | "e. The number of times the `new_page` and `treatment` don't match." 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 6, 261 | "metadata": {}, 262 | "outputs": [ 263 | { 264 | "data": { 265 | "text/html": [ 266 | "
\n", 267 | "\n", 280 | "\n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | "
user_idtimestampconverted
grouplanding_page
controlnew_page192819281928
old_page145274145274145274
treatmentnew_page145311145311145311
old_page196519651965
\n", 326 | "
" 327 | ], 328 | "text/plain": [ 329 | " user_id timestamp converted\n", 330 | "group landing_page \n", 331 | "control new_page 1928 1928 1928\n", 332 | " old_page 145274 145274 145274\n", 333 | "treatment new_page 145311 145311 145311\n", 334 | " old_page 1965 1965 1965" 335 | ] 336 | }, 337 | "execution_count": 6, 338 | "metadata": {}, 339 | "output_type": "execute_result" 340 | } 341 | ], 342 | "source": [ 343 | "#group the dataframe by the group and landing page and count the entries for the resulting combinations\n", 344 | "df.groupby([\"group\", \"landing_page\"]).count()" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 7, 350 | "metadata": {}, 351 | "outputs": [ 352 | { 353 | "data": { 354 | "text/plain": [ 355 | "3893" 356 | ] 357 | }, 358 | "execution_count": 7, 359 | "metadata": {}, 360 | "output_type": "execute_result" 361 | } 362 | ], 363 | "source": [ 364 | "1928+1965" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": {}, 370 | "source": [ 371 | "f. Do any of the rows have missing values?" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 8, 377 | "metadata": {}, 378 | "outputs": [ 379 | { 380 | "name": "stdout", 381 | "output_type": "stream", 382 | "text": [ 383 | "\n", 384 | "RangeIndex: 294478 entries, 0 to 294477\n", 385 | "Data columns (total 5 columns):\n", 386 | "user_id 294478 non-null int64\n", 387 | "timestamp 294478 non-null object\n", 388 | "group 294478 non-null object\n", 389 | "landing_page 294478 non-null object\n", 390 | "converted 294478 non-null int64\n", 391 | "dtypes: int64(2), object(3)\n", 392 | "memory usage: 11.2+ MB\n" 393 | ] 394 | } 395 | ], 396 | "source": [ 397 | "df.info() #no" 398 | ] 399 | }, 400 | { 401 | "cell_type": "markdown", 402 | "metadata": {}, 403 | "source": [ 404 | "`2.` For the rows where **treatment** does not match with **new_page** or **control** does not match with **old_page**, we cannot be sure if this row truly received the new or old page. Use **Quiz 2** in the classroom to figure out how we should handle these rows. \n", 405 | "\n", 406 | "a. Now use the answer to the quiz to create a new dataset that meets the specifications from the quiz. Store your new dataframe in **df2**." 407 | ] 408 | }, 409 | { 410 | "cell_type": "code", 411 | "execution_count": 9, 412 | "metadata": {}, 413 | "outputs": [ 414 | { 415 | "data": { 416 | "text/html": [ 417 | "
\n", 418 | "\n", 431 | "\n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | "
user_idtimestampgrouplanding_pageconverted
08511042017-01-21 22:11:48.556739controlold_page0
18042282017-01-12 08:01:45.159739controlold_page0
26615902017-01-11 16:55:06.154213treatmentnew_page0
38535412017-01-08 18:28:03.143765treatmentnew_page0
48649752017-01-21 01:52:26.210827controlold_page1
\n", 485 | "
" 486 | ], 487 | "text/plain": [ 488 | " user_id timestamp group landing_page converted\n", 489 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0\n", 490 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0\n", 491 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0\n", 492 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0\n", 493 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1" 494 | ] 495 | }, 496 | "execution_count": 9, 497 | "metadata": {}, 498 | "output_type": "execute_result" 499 | } 500 | ], 501 | "source": [ 502 | "df.head()" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": 10, 508 | "metadata": {}, 509 | "outputs": [ 510 | { 511 | "name": "stdout", 512 | "output_type": "stream", 513 | "text": [ 514 | "Length of df: 294478\n", 515 | "Number of false rows in df: 3893\n", 516 | "Removed rows in df2: 3893\n" 517 | ] 518 | } 519 | ], 520 | "source": [ 521 | "#get all the indices of wrong entries\n", 522 | "false_index = df[((df['group'] == 'treatment') == (df['landing_page'] == 'new_page')) == False].index\n", 523 | "\n", 524 | "print(\"Length of df: \", len(df))\n", 525 | "print(\"Number of false rows in df: \",len(false_index))\n", 526 | "\n", 527 | "#drop these indices out\n", 528 | "df2 = df.drop(false_index)\n", 529 | "\n", 530 | "print(\"Removed rows in df2: \",len(df) - len(df2))" 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": 11, 536 | "metadata": {}, 537 | "outputs": [ 538 | { 539 | "data": { 540 | "text/plain": [ 541 | "0" 542 | ] 543 | }, 544 | "execution_count": 11, 545 | "metadata": {}, 546 | "output_type": "execute_result" 547 | } 548 | ], 549 | "source": [ 550 | "# Double Check all of the correct rows were removed - this should be 0\n", 551 | "df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]" 552 | ] 553 | }, 554 | { 555 | "cell_type": "markdown", 556 | "metadata": {}, 557 | "source": [ 558 | "`3.` Use **df2** and the cells below to answer questions for **Quiz3** in the classroom." 559 | ] 560 | }, 561 | { 562 | "cell_type": "markdown", 563 | "metadata": {}, 564 | "source": [ 565 | "a. How many unique **user_id**s are in **df2**?" 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": 12, 571 | "metadata": {}, 572 | "outputs": [ 573 | { 574 | "data": { 575 | "text/plain": [ 576 | "290584" 577 | ] 578 | }, 579 | "execution_count": 12, 580 | "metadata": {}, 581 | "output_type": "execute_result" 582 | } 583 | ], 584 | "source": [ 585 | "df2.user_id.nunique()" 586 | ] 587 | }, 588 | { 589 | "cell_type": "markdown", 590 | "metadata": { 591 | "collapsed": true 592 | }, 593 | "source": [ 594 | "b. There is one **user_id** repeated in **df2**. What is it?" 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": 13, 600 | "metadata": {}, 601 | "outputs": [ 602 | { 603 | "data": { 604 | "text/html": [ 605 | "
\n", 606 | "\n", 619 | "\n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | "
user_idtimestampgrouplanding_pageconverted
28937731922017-01-14 02:55:59.590927treatmentnew_page0
\n", 641 | "
" 642 | ], 643 | "text/plain": [ 644 | " user_id timestamp group landing_page converted\n", 645 | "2893 773192 2017-01-14 02:55:59.590927 treatment new_page 0" 646 | ] 647 | }, 648 | "execution_count": 13, 649 | "metadata": {}, 650 | "output_type": "execute_result" 651 | } 652 | ], 653 | "source": [ 654 | "#show duplicated user ids\n", 655 | "df2[df2.duplicated(subset = [\"user_id\"])] #user id 773192" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": {}, 661 | "source": [ 662 | "c. What is the row information for the repeat **user_id**? " 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": 14, 668 | "metadata": {}, 669 | "outputs": [ 670 | { 671 | "data": { 672 | "text/html": [ 673 | "
\n", 674 | "\n", 687 | "\n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | "
user_idtimestampgrouplanding_pageconverted
18997731922017-01-09 05:37:58.781806treatmentnew_page0
28937731922017-01-14 02:55:59.590927treatmentnew_page0
\n", 717 | "
" 718 | ], 719 | "text/plain": [ 720 | " user_id timestamp group landing_page converted\n", 721 | "1899 773192 2017-01-09 05:37:58.781806 treatment new_page 0\n", 722 | "2893 773192 2017-01-14 02:55:59.590927 treatment new_page 0" 723 | ] 724 | }, 725 | "execution_count": 14, 726 | "metadata": {}, 727 | "output_type": "execute_result" 728 | } 729 | ], 730 | "source": [ 731 | "df2[df2.duplicated(subset = [\"user_id\"], keep = False)] #different timestamp" 732 | ] 733 | }, 734 | { 735 | "cell_type": "markdown", 736 | "metadata": {}, 737 | "source": [ 738 | "d. Remove **one** of the rows with a duplicate **user_id**, but keep your dataframe as **df2**." 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": 15, 744 | "metadata": {}, 745 | "outputs": [ 746 | { 747 | "name": "stdout", 748 | "output_type": "stream", 749 | "text": [ 750 | "Length before drop: 290585\n", 751 | "Length after drop: 290584\n" 752 | ] 753 | } 754 | ], 755 | "source": [ 756 | "print(\"Length before drop: \", len(df2))\n", 757 | "#drop duplicated user ids\n", 758 | "df2 = df2.drop(df2[df2.duplicated(subset = [\"user_id\"])].index)\n", 759 | "print(\"Length after drop: \", len(df2))" 760 | ] 761 | }, 762 | { 763 | "cell_type": "code", 764 | "execution_count": 16, 765 | "metadata": {}, 766 | "outputs": [ 767 | { 768 | "data": { 769 | "text/html": [ 770 | "
\n", 771 | "\n", 784 | "\n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | "
user_idtimestampgrouplanding_pageconverted
\n", 798 | "
" 799 | ], 800 | "text/plain": [ 801 | "Empty DataFrame\n", 802 | "Columns: [user_id, timestamp, group, landing_page, converted]\n", 803 | "Index: []" 804 | ] 805 | }, 806 | "execution_count": 16, 807 | "metadata": {}, 808 | "output_type": "execute_result" 809 | } 810 | ], 811 | "source": [ 812 | "df2[df2.duplicated(subset = [\"user_id\"], keep = False)] " 813 | ] 814 | }, 815 | { 816 | "cell_type": "markdown", 817 | "metadata": {}, 818 | "source": [ 819 | "`4.` Use **df2** in the cells below to answer the quiz questions related to **Quiz 4** in the classroom.\n", 820 | "\n", 821 | "a. What is the probability of an individual converting regardless of the page they receive?" 822 | ] 823 | }, 824 | { 825 | "cell_type": "code", 826 | "execution_count": 17, 827 | "metadata": {}, 828 | "outputs": [ 829 | { 830 | "name": "stdout", 831 | "output_type": "stream", 832 | "text": [ 833 | "The probability of an individual converting regardless of the page they receive is: 0.119597087245\n" 834 | ] 835 | } 836 | ], 837 | "source": [ 838 | "overall_conv_prob = df2.converted.mean()\n", 839 | "print(\"The probability of an individual converting regardless of the page they receive is: \", overall_conv_prob)" 840 | ] 841 | }, 842 | { 843 | "cell_type": "markdown", 844 | "metadata": {}, 845 | "source": [ 846 | "b. Given that an individual was in the `control` group, what is the probability they converted?" 847 | ] 848 | }, 849 | { 850 | "cell_type": "code", 851 | "execution_count": 18, 852 | "metadata": {}, 853 | "outputs": [ 854 | { 855 | "name": "stdout", 856 | "output_type": "stream", 857 | "text": [ 858 | "The probability of converting in the control group is: 0.1203863045\n" 859 | ] 860 | } 861 | ], 862 | "source": [ 863 | "controlgrp_conv_prob = df2.query(\"group == 'control'\").converted.mean()\n", 864 | "print(\"The probability of converting in the control group is: \", controlgrp_conv_prob)" 865 | ] 866 | }, 867 | { 868 | "cell_type": "markdown", 869 | "metadata": {}, 870 | "source": [ 871 | "c. Given that an individual was in the `treatment` group, what is the probability they converted?" 872 | ] 873 | }, 874 | { 875 | "cell_type": "code", 876 | "execution_count": 19, 877 | "metadata": {}, 878 | "outputs": [ 879 | { 880 | "name": "stdout", 881 | "output_type": "stream", 882 | "text": [ 883 | "The probability of converting in the treatment group is: 0.118808065515\n" 884 | ] 885 | } 886 | ], 887 | "source": [ 888 | "controlgrp_conv_prob = df2.query(\"group == 'treatment'\").converted.mean()\n", 889 | "print(\"The probability of converting in the treatment group is: \", controlgrp_conv_prob)" 890 | ] 891 | }, 892 | { 893 | "cell_type": "markdown", 894 | "metadata": {}, 895 | "source": [ 896 | "d. What is the probability that an individual received the new page?" 897 | ] 898 | }, 899 | { 900 | "cell_type": "code", 901 | "execution_count": 20, 902 | "metadata": {}, 903 | "outputs": [ 904 | { 905 | "name": "stdout", 906 | "output_type": "stream", 907 | "text": [ 908 | "The probability of receiving the new page is: 0.5000619442226688\n" 909 | ] 910 | } 911 | ], 912 | "source": [ 913 | "overall_conv_prob = len(df2.query(\"landing_page == 'new_page'\"))/len(df2)\n", 914 | "print(\"The probability of receiving the new page is: \", overall_conv_prob)" 915 | ] 916 | }, 917 | { 918 | "cell_type": "code", 919 | "execution_count": 21, 920 | "metadata": {}, 921 | "outputs": [ 922 | { 923 | "data": { 924 | "text/plain": [ 925 | "17264" 926 | ] 927 | }, 928 | "execution_count": 21, 929 | "metadata": {}, 930 | "output_type": "execute_result" 931 | } 932 | ], 933 | "source": [ 934 | "num_conv_treat = df2.query(\"group == 'treatment' and converted == 1\").count()[0]\n", 935 | "num_conv_treat" 936 | ] 937 | }, 938 | { 939 | "cell_type": "code", 940 | "execution_count": 22, 941 | "metadata": {}, 942 | "outputs": [ 943 | { 944 | "data": { 945 | "text/plain": [ 946 | "17489" 947 | ] 948 | }, 949 | "execution_count": 22, 950 | "metadata": {}, 951 | "output_type": "execute_result" 952 | } 953 | ], 954 | "source": [ 955 | "num_conv_control = df2.query(\"group == 'control' and converted == 1\").count()[0]\n", 956 | "num_conv_control" 957 | ] 958 | }, 959 | { 960 | "cell_type": "markdown", 961 | "metadata": {}, 962 | "source": [ 963 | "e. Consider your results from parts (a) through (d) above, and explain below whether you think there is sufficient evidence to conclude that the new treatment page leads to more conversions." 964 | ] 965 | }, 966 | { 967 | "cell_type": "markdown", 968 | "metadata": {}, 969 | "source": [ 970 | "**To sum up the results:**\n", 971 | "\n", 977 | "\n", 978 | "The first positive thing to mention is, that the users received the new or the old page in a ration very close to 50/50. The probabilities of converting in the control group and the treatment group are very close to each other, with a difference of 0.16%. This small difference could also appear by chance, therefore we don't have sufficient evidence to conclude that the new treatment page leads to more conversions than the old page. " 979 | ] 980 | }, 981 | { 982 | "cell_type": "markdown", 983 | "metadata": {}, 984 | "source": [ 985 | "\n", 986 | "### Part II - A/B Test\n", 987 | "\n", 988 | "Notice that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed. \n", 989 | "\n", 990 | "However, then the hard question is do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time? How long do you run to render a decision that neither page is better than another? \n", 991 | "\n", 992 | "These questions are the difficult parts associated with A/B tests in general. \n", 993 | "\n", 994 | "\n", 995 | "`1.` For now, consider you need to make the decision just based on all the data provided. If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should your null and alternative hypotheses be? You can state your hypothesis in terms of words or in terms of **$p_{old}$** and **$p_{new}$**, which are the converted rates for the old and new pages." 996 | ] 997 | }, 998 | { 999 | "cell_type": "markdown", 1000 | "metadata": {}, 1001 | "source": [ 1002 | "\n", 1003 | " \n", 1004 | "**$$H_{0}: p_{new} - p_{old} <= 0$$**\n", 1005 | "\n", 1006 | "**$$H_{1}: p_{new} - p_{old} > 0$$**\n", 1007 | "\n" 1008 | ] 1009 | }, 1010 | { 1011 | "cell_type": "markdown", 1012 | "metadata": {}, 1013 | "source": [ 1014 | "`2.` Assume under the null hypothesis, $p_{new}$ and $p_{old}$ both have \"true\" success rates equal to the **converted** success rate regardless of page - that is $p_{new}$ and $p_{old}$ are equal. Furthermore, assume they are equal to the **converted** rate in **ab_data.csv** regardless of the page.

\n", 1015 | "\n", 1016 | "Use a sample size for each page equal to the ones in **ab_data.csv**.

\n", 1017 | "\n", 1018 | "Perform the sampling distribution for the difference in **converted** between the two pages over 10,000 iterations of calculating an estimate from the null.

\n", 1019 | "\n", 1020 | "Use the cells below to provide the necessary parts of this simulation. If this doesn't make complete sense right now, don't worry - you are going to work through the problems below to complete this problem. You can use **Quiz 5** in the classroom to make sure you are on the right track.

" 1021 | ] 1022 | }, 1023 | { 1024 | "cell_type": "markdown", 1025 | "metadata": {}, 1026 | "source": [ 1027 | "a. What is the **conversion rate** for $p_{new}$ under the null? " 1028 | ] 1029 | }, 1030 | { 1031 | "cell_type": "code", 1032 | "execution_count": 23, 1033 | "metadata": {}, 1034 | "outputs": [ 1035 | { 1036 | "data": { 1037 | "text/plain": [ 1038 | "0.11959708724499628" 1039 | ] 1040 | }, 1041 | "execution_count": 23, 1042 | "metadata": {}, 1043 | "output_type": "execute_result" 1044 | } 1045 | ], 1046 | "source": [ 1047 | "p_new = df2.converted.mean()\n", 1048 | "p_new" 1049 | ] 1050 | }, 1051 | { 1052 | "cell_type": "markdown", 1053 | "metadata": {}, 1054 | "source": [ 1055 | "b. What is the **conversion rate** for $p_{old}$ under the null?

" 1056 | ] 1057 | }, 1058 | { 1059 | "cell_type": "code", 1060 | "execution_count": 24, 1061 | "metadata": {}, 1062 | "outputs": [ 1063 | { 1064 | "data": { 1065 | "text/plain": [ 1066 | "0.11959708724499628" 1067 | ] 1068 | }, 1069 | "execution_count": 24, 1070 | "metadata": {}, 1071 | "output_type": "execute_result" 1072 | } 1073 | ], 1074 | "source": [ 1075 | "p_old = df2.converted.mean()\n", 1076 | "p_old" 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "markdown", 1081 | "metadata": {}, 1082 | "source": [ 1083 | "c. What is $n_{new}$, the number of individuals in the treatment group?" 1084 | ] 1085 | }, 1086 | { 1087 | "cell_type": "code", 1088 | "execution_count": 25, 1089 | "metadata": {}, 1090 | "outputs": [ 1091 | { 1092 | "data": { 1093 | "text/plain": [ 1094 | "145310" 1095 | ] 1096 | }, 1097 | "execution_count": 25, 1098 | "metadata": {}, 1099 | "output_type": "execute_result" 1100 | } 1101 | ], 1102 | "source": [ 1103 | "n_new = df2.query(\"group == 'treatment'\").user_id.nunique()\n", 1104 | "n_new" 1105 | ] 1106 | }, 1107 | { 1108 | "cell_type": "markdown", 1109 | "metadata": {}, 1110 | "source": [ 1111 | "d. What is $n_{old}$, the number of individuals in the control group?" 1112 | ] 1113 | }, 1114 | { 1115 | "cell_type": "code", 1116 | "execution_count": 26, 1117 | "metadata": {}, 1118 | "outputs": [ 1119 | { 1120 | "data": { 1121 | "text/plain": [ 1122 | "145274" 1123 | ] 1124 | }, 1125 | "execution_count": 26, 1126 | "metadata": {}, 1127 | "output_type": "execute_result" 1128 | } 1129 | ], 1130 | "source": [ 1131 | "n_old = df2.query(\"group == 'control'\").user_id.nunique()\n", 1132 | "n_old" 1133 | ] 1134 | }, 1135 | { 1136 | "cell_type": "markdown", 1137 | "metadata": {}, 1138 | "source": [ 1139 | "e. Simulate $n_{new}$ transactions with a conversion rate of $p_{new}$ under the null. Store these $n_{new}$ 1's and 0's in **new_page_converted**." 1140 | ] 1141 | }, 1142 | { 1143 | "cell_type": "code", 1144 | "execution_count": 27, 1145 | "metadata": {}, 1146 | "outputs": [ 1147 | { 1148 | "data": { 1149 | "text/plain": [ 1150 | "17272" 1151 | ] 1152 | }, 1153 | "execution_count": 27, 1154 | "metadata": {}, 1155 | "output_type": "execute_result" 1156 | } 1157 | ], 1158 | "source": [ 1159 | "#simulate n transactions with a conversion rate of p with np.random.choice\n", 1160 | "new_page_converted = np.random.choice([1,0], size = n_new, replace = True, p = (p_new, 1-p_new))\n", 1161 | "new_page_converted.sum()" 1162 | ] 1163 | }, 1164 | { 1165 | "cell_type": "markdown", 1166 | "metadata": {}, 1167 | "source": [ 1168 | "f. Simulate $n_{old}$ transactions with a conversion rate of $p_{old}$ under the null. Store these $n_{old}$ 1's and 0's in **old_page_converted**." 1169 | ] 1170 | }, 1171 | { 1172 | "cell_type": "code", 1173 | "execution_count": 28, 1174 | "metadata": {}, 1175 | "outputs": [ 1176 | { 1177 | "data": { 1178 | "text/plain": [ 1179 | "17436" 1180 | ] 1181 | }, 1182 | "execution_count": 28, 1183 | "metadata": {}, 1184 | "output_type": "execute_result" 1185 | } 1186 | ], 1187 | "source": [ 1188 | "old_page_converted = np.random.choice([1,0], size = n_old, replace = True, p = [p_old, (1-p_old)])\n", 1189 | "old_page_converted.sum()" 1190 | ] 1191 | }, 1192 | { 1193 | "cell_type": "markdown", 1194 | "metadata": {}, 1195 | "source": [ 1196 | "g. Find $p_{new}$ - $p_{old}$ for your simulated values from part (e) and (f)." 1197 | ] 1198 | }, 1199 | { 1200 | "cell_type": "code", 1201 | "execution_count": 29, 1202 | "metadata": {}, 1203 | "outputs": [ 1204 | { 1205 | "data": { 1206 | "text/plain": [ 1207 | "-0.0011583564321773071" 1208 | ] 1209 | }, 1210 | "execution_count": 29, 1211 | "metadata": {}, 1212 | "output_type": "execute_result" 1213 | } 1214 | ], 1215 | "source": [ 1216 | "new_page_converted.mean() - old_page_converted.mean()" 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "markdown", 1221 | "metadata": {}, 1222 | "source": [ 1223 | "h. Create 10,000 $p_{new}$ - $p_{old}$ values using the same simulation process you used in parts (a) through (g) above. Store all 10,000 values in a NumPy array called **p_diffs**." 1224 | ] 1225 | }, 1226 | { 1227 | "cell_type": "code", 1228 | "execution_count": 30, 1229 | "metadata": {}, 1230 | "outputs": [], 1231 | "source": [ 1232 | "#creating the sampling distribution with 10000 simulations of the steps before\n", 1233 | "p_diffs = []\n", 1234 | "for _ in range(10000):\n", 1235 | " new_page_converted = np.random.choice([1,0], size = n_new, replace = True, p = (p_new, 1-p_new))\n", 1236 | " old_page_converted = np.random.choice([1,0], size = n_old, replace = True, p = (p_old, 1-p_old))\n", 1237 | " diff = new_page_converted.mean() - old_page_converted.mean()\n", 1238 | " p_diffs.append(diff)" 1239 | ] 1240 | }, 1241 | { 1242 | "cell_type": "markdown", 1243 | "metadata": {}, 1244 | "source": [ 1245 | "i. Plot a histogram of the **p_diffs**. Does this plot look like what you expected? Use the matching problem in the classroom to assure you fully understand what was computed here." 1246 | ] 1247 | }, 1248 | { 1249 | "cell_type": "code", 1250 | "execution_count": 31, 1251 | "metadata": {}, 1252 | "outputs": [ 1253 | { 1254 | "data": { 1255 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAD8CAYAAAB+UHOxAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAELVJREFUeJzt3X+s3XV9x/Hna0UwmzqKvbCurSuaLln5Y8gaZHF/sLBBKYbiHyaQTBs0qckg0cxlqfIHRkOCOn+EzGFQG0uGIpsaG+mGlbgYkwEtDIFaWa9Q5dqO1tWgi4kL+N4f51s59N7ee+6Pc8+9fJ6P5OR8z/v7+f769Oa++v1+vud7U1VIktrzW6PeAUnSaBgAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEadMeodmM6qVatq/fr1o94NSVpWHn744Z9W1dhM7ZZ0AKxfv579+/ePejckaVlJ8qNB2nkJSJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGrWkvwkszWT9jntHtu3Dt141sm1LC8EzAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKB8HLc3RqB5F7WOotVA8A5CkRs0YAEnWJfl2koNJDiR5T1f/YJKfJHm0e23pW+b9ScaTPJnkir765q42nmTHcA5JkjSIQS4BPQ+8r6oeSfJq4OEke7t5n6yqv+9vnGQjcC1wAfD7wLeS/GE3+9PAXwITwL4ku6vq+wtxIJKk2ZkxAKrqKHC0m/5FkoPAmmkW2QrcXVW/Ap5OMg5c3M0br6qnAJLc3bU1ACRpBGY1BpBkPfBG4MGudGOSx5LsTLKyq60BnulbbKKrna4uSRqBgQMgyauArwDvraqfA7cDbwAupHeG8PGTTadYvKapn7qd7Un2J9l//PjxQXdPkjRLAwVAklfQ++V/V1V9FaCqnq2qF6rq18BnefEyzwSwrm/xtcCRaeovUVV3VNWmqto0NjY22+ORJA1okLuAAnweOFhVn+irr+5r9lbgiW56N3BtkrOSnA9sAB4C9gEbkpyf5Ex6A8W7F+YwJEmzNchdQG8G3g48nuTRrvYB4LokF9K7jHMYeDdAVR1Icg+9wd3ngRuq6gWAJDcC9wErgJ1VdWABj0WSNAuD3AX0Xaa+fr9nmmVuAW6Zor5nuuUkSYvHbwJLUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUTMGQJJ1Sb6d5GCSA0ne09XPSbI3yaHufWVXT5LbkowneSzJRX3r2ta1P5Rk2/AOS5I0k0HOAJ4H3ldVfwRcAtyQZCOwA7i/qjYA93efAa4ENnSv7cDt0AsM4GbgTcDFwM0nQ0OStPhmDICqOlpVj3TTvwAOAmuArcCurtku4JpueitwZ/U8AJydZDVwBbC3qk5U1c+AvcDmBT0aSdLAZjUGkGQ98EbgQeC8qjoKvZAAzu2arQGe6Vtsoqudri5JGoGBAyDJq4CvAO+tqp9P13SKWk1TP3U725PsT7L/+PHjg+6eJGmWBgqAJK+g98v/rqr6ald+tru0Q/d+rKtPAOv6Fl8LHJmm/hJVdUdVbaqqTWNjY7M5FknSLAxyF1CAzwMHq+oTfbN2Ayfv5NkGfL2v/o7ubqBLgOe6S0T3AZcnWdkN/l7e1SRJI3DGAG3eDLwdeDzJo13tA8CtwD1J3gX8GHhbN28PsAUYB34JXA9QVSeSfBjY17X7UFWdWJCjkCTN2owBUFXfZerr9wCXTdG+gBtOs66dwM7Z7KAkaTj8JrAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDVqkEdBSDNav+PeUe+CpFnyDECSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqNmDIAkO5McS/JEX+2DSX6S5NHutaVv3vuTjCd5MskVffXNXW08yY6FPxRJ0mwMcgbwBWDzFPVPVtWF3WsPQJKNwLXABd0y/5hkRZIVwKeBK4GNwHVdW0nSiMz4R+Gr6jtJ1g+4vq3A3VX1K+DpJOPAxd288ap6CiDJ3V3b7896jyVJC2I+YwA3Jnmsu0S0squtAZ7pazPR1U5XlySNyFwD4HbgDcCFwFHg4109U7StaeqTJNmeZH+S/cePH5/j7kmSZjKnAKiqZ6vqhar6NfBZXrzMMwGs62u6FjgyTX2qdd9RVZuqatPY2Nhcdk+SNIA5BUCS1X0f3wqcvENoN3BtkrOSnA9sAB4C9gEbkpyf5Ex6A8W7577bkqT5mnEQOMmXgEuBVUkmgJuBS5NcSO8yzmHg3QBVdSDJPfQGd58HbqiqF7r13AjcB6wAdlbVgQU/GknSwAa5C+i6Kcqfn6b9LcAtU9T3AHtmtXeSpKHxm8CS1CgDQJIaZQBIUqNmHAOQtLSs33HvyLZ9+NarRrZtLTzPACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUTMGQJKdSY4leaKvdk6SvUkOde8ru3qS3JZkPMljSS7qW2Zb1/5Qkm3DORxJ0qAGOQP4ArD5lNoO4P6q2gDc330GuBLY0L22A7dDLzCAm4E3ARcDN58MDUnSaMwYAFX1HeDEKeWtwK5uehdwTV/9zup5ADg7yWrgCmBvVZ2oqp8Be5kcKpKkRTTXMYDzquooQPd+bldfAzzT126iq52uPkmS7Un2J9l//PjxOe6eJGkmCz0InClqNU19crHqjqraVFWbxsbGFnTnJEkvmmsAPNtd2qF7P9bVJ4B1fe3WAkemqUuSRmSuAbAbOHknzzbg6331d3R3A10CPNddIroPuDzJym7w9/KuJkkakTNmapDkS8ClwKokE/Tu5rkVuCfJu4AfA2/rmu8BtgDjwC+B6wGq6kSSDwP7unYfqqpTB5YlSYtoxgCoqutOM+uyKdoWcMNp1rMT2DmrvZMkDY3fBJakRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGzfgXwbS8rN9x76h3QdIy4RmAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEbNKwCSHE7yeJJHk+zvauck2ZvkUPe+sqsnyW1JxpM8luSihTgASdLcLMQZwJ9X1YVVtan7vAO4v6o2APd3nwGuBDZ0r+3A7QuwbUnSHA3jEtBWYFc3vQu4pq9+Z/U8AJydZPUQti9JGsB8A6CAbyZ5OMn2rnZeVR0F6N7P7eprgGf6lp3oapKkEZjv46DfXFVHkpwL7E3yg2naZopaTWrUC5LtAK973evmuXuSpNOZ1xlAVR3p3o8BXwMuBp49eWmnez/WNZ8A1vUtvhY4MsU676iqTVW1aWxsbD67J0maxpwDIMnvJHn1yWngcuAJYDewrWu2Dfh6N70beEd3N9AlwHMnLxVJkhbffC4BnQd8LcnJ9Xyxqv4tyT7gniTvAn4MvK1rvwfYAowDvwSun8e2JUnzNOcAqKqngD+eov4/wGVT1Au4Ya7bkyQtLP8msKSBjepvTh++9aqRbPflzkdBSFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY3ybwIPwaj+bqokzYZnAJLUKANAkhplAEhSoxwDkLTkjXJc7fCtV41s28PmGYAkNcoAkKRGGQCS1KhFD4Akm5M8mWQ8yY7F3r4kqWdRAyDJCuDTwJXARuC6JBsXcx8kST2LfRfQxcB4VT0FkORuYCvw/WFszG/kStLpLXYArAGe6fs8AbxpkfdBkgY2qv9ILsbtp4sdAJmiVi9pkGwHtncf/zfJk0Pep1XAT4e8jeXGPpnMPpnMPplswfokH5nX4n8wSKPFDoAJYF3f57XAkf4GVXUHcMdi7VCS/VW1abG2txzYJ5PZJ5PZJ5Mttz5Z7LuA9gEbkpyf5EzgWmD3Iu+DJIlFPgOoqueT3AjcB6wAdlbVgcXcB0lSz6I/C6iq9gB7Fnu701i0y03LiH0ymX0ymX0y2bLqk1TVzK0kSS87PgpCkhr1sg2AJOck2ZvkUPe+8jTttnVtDiXZ1lf/kySPd4+suC1JTlnub5NUklXDPpaFMqw+SfKxJD9I8liSryU5e7GOaa5meiRJkrOSfLmb/2CS9X3z3t/Vn0xyxaDrXOoWuk+SrEvy7SQHkxxI8p7FO5qFMYyfk27eiiT/meQbwz+KaVTVy/IFfBTY0U3vAD4yRZtzgKe695Xd9Mpu3kPAn9L77sK/Alf2LbeO3kD2j4BVoz7WUfcJcDlwRjf9kanWu5Re9G5A+CHweuBM4HvAxlPa/DXwmW76WuDL3fTGrv1ZwPndelYMss6l/BpSn6wGLuravBr4r9b7pG+5vwG+CHxjlMf4sj0DoPeIiV3d9C7gminaXAHsraoTVfUzYC+wOclq4DVV9R/V+9e685TlPwn8Had8iW0ZGEqfVNU3q+r5bvkH6H2/Yyn7zSNJqur/gJOPJOnX31f/AlzWnfFsBe6uql9V1dPAeLe+Qda5lC14n1TV0ap6BKCqfgEcpPc0gOViGD8nJFkLXAV8bhGOYVov5wA4r6qOAnTv507RZqpHU6zpXhNT1ElyNfCTqvreMHZ6yIbSJ6d4J72zg6XsdMc4ZZsu3J4DXjvNsoOscykbRp/8Rndp5I3Agwu4z8M2rD75FL3/QP564Xd5dpb1n4RM8i3g96aYddOgq5iiVqerJ/ntbt2XD7j+RbfYfXLKtm8CngfuGnBbozLjsUzT5nT1qf4ztZzOEIfRJ72FklcBXwHeW1U/n/MeLr4F75MkbwGOVdXDSS6d5/7N27IOgKr6i9PNS/JsktVVdbS7fHFsimYTwKV9n9cC/97V155SPwK8gd71vO91459rgUeSXFxV/z2PQ1kwI+iTk+veBrwFuKy7RLSUzfhIkr42E0nOAH4XODHDsjOtcykbSp8keQW9X/53VdVXh7PrQzOMPrkauDrJFuCVwGuS/FNV/dVwDmEGox5oGdYL+BgvHfD86BRtzgGepjfYubKbPqebtw+4hBcHPLdMsfxhltcg8FD6BNhM75HeY6M+xgH74Qx6g9vn8+Lg3gWntLmBlw7u3dNNX8BLB/eeojdYOOM6l/JrSH0SemNFnxr18S2VPjll2UsZ8SDwyDt5iP94rwXuBw517yd/iW0CPtfX7p30BmjGgev76puAJ+iN3v8D3ZfmTtnGcguAofRJ1+4Z4NHu9ZlRH+sAfbGF3l0pPwRu6mofAq7upl8J/HN3bA8Br+9b9qZuuSd56d1hk9a5nF4L3SfAn9G7HPJY38/GpP9ILeXXMH5O+uaPPAD8JrAkNerlfBeQJGkaBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY36fwOV9rCNLfoUAAAAAElFTkSuQmCC\n", 1256 | "text/plain": [ 1257 | "" 1258 | ] 1259 | }, 1260 | "metadata": { 1261 | "needs_background": "light" 1262 | }, 1263 | "output_type": "display_data" 1264 | } 1265 | ], 1266 | "source": [ 1267 | "plt.hist(p_diffs);" 1268 | ] 1269 | }, 1270 | { 1271 | "cell_type": "markdown", 1272 | "metadata": {}, 1273 | "source": [ 1274 | "j. What proportion of the **p_diffs** are greater than the actual difference observed in **ab_data.csv**?" 1275 | ] 1276 | }, 1277 | { 1278 | "cell_type": "code", 1279 | "execution_count": 32, 1280 | "metadata": {}, 1281 | "outputs": [ 1282 | { 1283 | "name": "stdout", 1284 | "output_type": "stream", 1285 | "text": [ 1286 | "Number of converted persons in control group: 17489 | p_old: 0.1203863045\n", 1287 | "Number of converted persons in treatment group: 17264 | p_new: 0.118808065515\n", 1288 | "Actual difference: -0.00157823898536\n" 1289 | ] 1290 | } 1291 | ], 1292 | "source": [ 1293 | "p_actual_old = df2.query(\"group == 'control'\").converted.mean()\n", 1294 | "p_actual_new = df2.query(\"group == 'treatment'\").converted.mean()\n", 1295 | "actual_diff = p_actual_new - p_actual_old\n", 1296 | "print(\"Number of converted persons in control group: \",num_conv_control, \"| p_old: \", p_actual_old)\n", 1297 | "print(\"Number of converted persons in treatment group: \",num_conv_treat, \"| p_new: \", p_actual_new)\n", 1298 | "print(\"Actual difference: \", actual_diff)" 1299 | ] 1300 | }, 1301 | { 1302 | "cell_type": "code", 1303 | "execution_count": 33, 1304 | "metadata": {}, 1305 | "outputs": [ 1306 | { 1307 | "data": { 1308 | "text/plain": [ 1309 | "" 1310 | ] 1311 | }, 1312 | "execution_count": 33, 1313 | "metadata": {}, 1314 | "output_type": "execute_result" 1315 | }, 1316 | { 1317 | "data": { 1318 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAD8CAYAAAB+UHOxAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAENdJREFUeJzt3X+s3XV9x/Hna1QwmzqKLaxr64qmM4M/hqxBFvcHCxuUYgD/MIFk2qBJTQaJZi5LlT8wGhLQ+SNkDle1sWQosqmxgW5YicaYDGhhCFRkXKHKtR2tYtDFxAX33h/nWz3cnnvvuT/OPbf9PB/JN+d73t/P9/v9fD+9ua9+f5xzU1VIktrzW+PugCRpPAwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqNWjLsDM1m1alVt2LBh3N3QXD35ZO/19a8fbz+kRj300EM/rqrVs7Vb1gGwYcMG9u/fP+5uaK4uuqj3+s1vjrMXUrOS/GCYdl4CkqRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRi3rTwJLs9mw/Z6x7fvgzZePbd/SYvAMQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1yq+DluZpXF9F7ddQa7F4BiBJjTIAJKlRBoAkNWrWAEiyPsk3kjyR5ECSd3f1DyT5UZJHumlL3zrvSzKR5Mkkl/bVN3e1iSTbR3NIkqRhDHMT+EXgvVX1cJJXAg8l2dst+3hV/X1/4yTnAFcD5wK/D3w9yR92iz8J/CUwCexLsruqvrsYByJJmptZA6CqDgOHu/mfJ3kCWDvDKlcCd1bVL4FnkkwAF3TLJqrqaYAkd3ZtDQBJGoM53QNIsgF4A/BAV7o+yaNJdiZZ2dXWAs/2rTbZ1aarT93HtiT7k+w/evToXLonSZqDoQMgySuALwHvqaqfAbcBrwPOo3eG8NFjTQesXjPUX1qo2lFVm6pq0+rVq4ftniRpjob6IFiSl9H75X9HVX0ZoKqe61v+aeDu7u0ksL5v9XXAoW5+urokaYkN8xRQgM8CT1TVx/rqa/qavQV4vJvfDVyd5LQkZwMbgQeBfcDGJGcnOZXejeLdi3MYkqS5GuYM4E3A24DHkjzS1d4PXJPkPHqXcQ4C7wKoqgNJ7qJ3c/dF4Lqq+hVAkuuBe4FTgJ1VdWARj0WSNAfDPAX0bQZfv98zwzo3ATcNqO+ZaT1J0tLxk8CS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElq1KwBkGR9km8keSLJgSTv7upnJNmb5KnudWVXT5Jbk0wkeTTJ+X3b2tq1fyrJ1tEdliRpNsOcAbwIvLeq/gi4ELguyTnAduC+qtoI3Ne9B7gM2NhN24DboBcYwI3AG4ELgBuPhYYkaenNGgBVdbiqHu7mfw48AawFrgR2dc12AVd181cCt1fP/cDpSdYAlwJ7q+r5qvopsBfYvKhHI0ka2pzuASTZALwBeAA4q6oOQy8kgDO7ZmuBZ/tWm+xq09Wn7mNbkv1J9h89enQu3ZMkzcHQAZDkFcCXgPdU1c9majqgVjPUX1qo2lFVm6pq0+rVq4ftniRpjoYKgCQvo/fL/46q+nJXfq67tEP3eqSrTwLr+1ZfBxyaoS5JGoNhngIK8Fngiar6WN+i3cCxJ3m2Al/tq7+9exroQuCF7hLRvcAlSVZ2N38v6WqSpDFYMUSbNwFvAx5L8khXez9wM3BXkncCPwTe2i3bA2wBJoBfANcCVNXzST4E7OvafbCqnl+Uo5AkzdmsAVBV32bw9XuAiwe0L+C6aba1E9g5lw5KkkbDTwJLUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUqGH+IIw0qw3b7/n1/J1P/wSAq/tqkpYfzwAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGzRoASXYmOZLk8b7aB5L8KMkj3bSlb9n7kkwkeTLJpX31zV1tIsn2xT8USdJcDHMG8Dlg84D6x6vqvG7aA5DkHOBq4NxunX9MckqSU4BPApcB5wDXdG0lSWMy67eBVtW3kmwYcntXAndW1S+BZ5JMABd0yyaq6mmAJHd2bb875x5LkhbFQu4BXJ/k0e4S0cquthZ4tq/NZFebri5JGpP5BsBtwOuA84DDwEe7ega0rRnqx0myLcn+JPuPHj06z+5JkmYzrz8IU1XPHZtP8mng7u7tJLC+r+k64FA3P1196rZ3ADsANm3aNDAkpJZtGOMf2jl48+Vj27cW37zOAJKs6Xv7FuDYE0K7gauTnJbkbGAj8CCwD9iY5Owkp9K7Ubx7/t2WJC3UrGcASb4AXASsSjIJ3AhclOQ8epdxDgLvAqiqA0nuondz90Xguqr6Vbed64F7gVOAnVV1YNGPRpI0tGGeArpmQPmzM7S/CbhpQH0PsGdOvZMkjYyfBJakRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjZg2AJDuTHEnyeF/tjCR7kzzVva7s6klya5KJJI8mOb9vna1d+6eSbB3N4UiShjXMGcDngM1TatuB+6pqI3Bf9x7gMmBjN20DboNeYAA3Am8ELgBuPBYakqTxmDUAqupbwPNTylcCu7r5XcBVffXbq+d+4PQka4BLgb1V9XxV/RTYy/GhIklaQvO9B3BWVR0G6F7P7OprgWf72k12tenqkqQxWeybwBlQqxnqx28g2ZZkf5L9R48eXdTOSZJ+Y74B8Fx3aYfu9UhXnwTW97VbBxyaoX6cqtpRVZuqatPq1avn2T1J0mzmGwC7gWNP8mwFvtpXf3v3NNCFwAvdJaJ7gUuSrOxu/l7S1SRJY7JitgZJvgBcBKxKMknvaZ6bgbuSvBP4IfDWrvkeYAswAfwCuBagqp5P8iFgX9fug1U19cayJGkJzRoAVXXNNIsuHtC2gOum2c5OYOeceidJGhk/CSxJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSowwASWrUinF3QItrw/Z7xt0FSScIzwAkqVEGgCQ1ygCQpEYtKACSHEzyWJJHkuzvamck2Zvkqe51ZVdPkluTTCR5NMn5i3EAkqT5WYwzgD+vqvOqalP3fjtwX1VtBO7r3gNcBmzspm3AbYuwb0nSPI3iEtCVwK5ufhdwVV/99uq5Hzg9yZoR7F+SNISFPgZawNeSFPBPVbUDOKuqDgNU1eEkZ3Zt1wLP9q072dUOL7APkpbIuB4zPnjz5WPZ78luoQHwpqo61P2S35vkezO0zYBaHdco2UbvEhGvec1rFtg9SdJ0FnQJqKoOda9HgK8AFwDPHbu0070e6ZpPAuv7Vl8HHBqwzR1VtamqNq1evXoh3ZMkzWDeAZDkd5K88tg8cAnwOLAb2No12wp8tZvfDby9exroQuCFY5eKJElLbyGXgM4CvpLk2HY+X1X/nmQfcFeSdwI/BN7atd8DbAEmgF8A1y5g35KkBZp3AFTV08AfD6j/BLh4QL2A6+a7P0nS4vKTwJLUKANAkhplAEhSowwASWqUASBJjTIAJKlRBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQoA0CSGmUASFKjDABJapQBIEmNMgAkqVErxt2Bk9GG7feMuwuSNCvPACSpUZ4BSFr2xnlWffDmy8e271HzDECSGmUASFKjDABJatSSB0CSzUmeTDKRZPtS71+S1LOkAZDkFOCTwGXAOcA1Sc5Zyj5IknqW+gzgAmCiqp6uqv8F7gSuXOI+SJJY+sdA1wLP9r2fBN44qp35gSxJCzWu3yNL8fjpUgdABtTqJQ2SbcC27u3/JHly5L36jVXAj5dwfyeCOY/Jnx6bueXNi96ZZcKfk+M5Jsdb0JjklgXt+w+GabTUATAJrO97vw441N+gqnYAO5ayU8ck2V9Vm8ax7+XKMTmeY3I8x+R4J8KYLPU9gH3AxiRnJzkVuBrYvcR9kCSxxGcAVfVikuuBe4FTgJ1VdWAp+yBJ6lny7wKqqj3AnqXe75DGculpmXNMjueYHM8xOd6yH5NU1eytJEknHb8KQpIaddIHQJIzkuxN8lT3unKadlu7Nk8l2dpX/5Mkj3VfXXFrkkxZ72+TVJJVoz6WxTKqMUnykSTfS/Jokq8kOX2pjmm+ZvtqkiSnJflit/yBJBv6lr2vqz+Z5NJht7ncLfaYJFmf5BtJnkhyIMm7l+5oFscofk66Zack+c8kd4/+KAaoqpN6Aj4MbO/mtwO3DGhzBvB097qym1/ZLXuQ3qPtAf4NuKxvvfX0bmj/AFg17mMd95gAlwAruvlbBm13OU30HkT4PvBa4FTgO8A5U9r8NfCpbv5q4Ivd/Dld+9OAs7vtnDLMNpfzNKIxWQOc37V5JfBfrY9J33p/A3weuHscx3bSnwHQ+6qJXd38LuCqAW0uBfZW1fNV9VNgL7A5yRrgVVX1H9X717p9yvofB/6OKR9mOwGMZEyq6mtV9WK3/v30PuexnA3z1ST9Y/WvwMXdGc+VwJ1V9cuqegaY6LZ3on/dyaKPSVUdrqqHAarq58AT9L4V4EQxip8TkqwDLgc+swTHMFALAXBWVR0G6F7PHNBm0FdUrO2myQF1klwB/KiqvjOKTo/YSMZkinfQOztYzqY7xoFtunB7AXj1DOsOs83lbBRj8mvdpZE3AA8sYp9HbVRj8gl6/4H8v8Xv8nBOij8JmeTrwO8NWHTDsJsYUKvp6kl+u9v2JUNuf8kt9ZhM2fcNwIvAHUPua1xmPZYZ2kxXH/SfqhPpDHEUY9JbKXkF8CXgPVX1s3n3cOkt+pgkeTNwpKoeSnLRAvs3bydFAFTVX0y3LMlzSdZU1eHu8sWRAc0mgYv63q8DvtnV102pHwJeR+963ne6+5/rgIeTXFBV/72AQ1k0YxiTY9veCrwZuLi7RLSczfrVJH1tJpOsAH4XeH6WdWfb5nI2kjFJ8jJ6v/zvqKovj6brIzOKMbkCuCLJFuDlwKuS/HNV/dVoDmEa477BMuoJ+AgvveH54QFtzgCeoXezc2U3f0a3bB9wIb+54bllwPoHObFuAo9kTIDNwHeB1eM+xiHHYQW9m9tn85ube+dOaXMdL725d1c3fy4vvbn3NL2bhbNuczlPIxqT0LtX9IlxH99yGZMp617EmG4Cj31wl+Af79XAfcBT3euxX2KbgM/0tXsHvRs0E8C1ffVNwOP07t7/A92H56bs40QLgJGMSdfuWeCRbvrUuI91iLHYQu+plO8DN3S1DwJXdPMvB/6lO7YHgdf2rXtDt96TvPTpsOO2eSJNiz0mwJ/RuxzyaN/PxnH/kVrO0yh+TvqWjy0A/CSwJDWqhaeAJEkDGACS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXq/wGsL/KbIPpI+AAAAABJRU5ErkJggg==\n", 1319 | "text/plain": [ 1320 | "" 1321 | ] 1322 | }, 1323 | "metadata": { 1324 | "needs_background": "light" 1325 | }, 1326 | "output_type": "display_data" 1327 | } 1328 | ], 1329 | "source": [ 1330 | "p_diffs = np.array(p_diffs)\n", 1331 | "#calcualte the null_vals based on the std of the p_diffs array\n", 1332 | "null_vals = np.random.normal(0, p_diffs.std(), p_diffs.size)\n", 1333 | "plt.hist(null_vals);\n", 1334 | "plt.axvline(actual_diff, color = 'r')" 1335 | ] 1336 | }, 1337 | { 1338 | "cell_type": "code", 1339 | "execution_count": 34, 1340 | "metadata": {}, 1341 | "outputs": [ 1342 | { 1343 | "data": { 1344 | "text/plain": [ 1345 | "0.90949999999999998" 1346 | ] 1347 | }, 1348 | "execution_count": 34, 1349 | "metadata": {}, 1350 | "output_type": "execute_result" 1351 | } 1352 | ], 1353 | "source": [ 1354 | "(null_vals > actual_diff).mean()" 1355 | ] 1356 | }, 1357 | { 1358 | "cell_type": "markdown", 1359 | "metadata": {}, 1360 | "source": [ 1361 | "k. Please explain using the vocabulary you've learned in this course what you just computed in part **j.** What is this value called in scientific studies? What does this value mean in terms of whether or not there is a difference between the new and old pages?" 1362 | ] 1363 | }, 1364 | { 1365 | "cell_type": "markdown", 1366 | "metadata": {}, 1367 | "source": [ 1368 | "In j. we calculated the p-value with 0.9095.\n", 1369 | "\n", 1370 | "What exactly did we do?\n", 1371 | "\n", 1372 | "We assumed that the null hypothesis is true. With that, we assume that p_old = p_new, so both pages have the same converting rates over the whole sample. Therefore we also assume, that the individual converting probability of each page is equal to the one of the whole sample. Based on that, we bootstrapped a sampling distribution for both pages and calculated the differences in the converting probability per page with n equal to the original number of people who received each page and a converting probability of 0.119597. With the resulting standard deviation of the differences (which is coming from the simulated population), we then calcualted values coming from a normal distribution around 0. As last step we calculated the proportion of values which are bigger than the actually observed difference. The calculated p-value now tells us the probability of receiving this observed statistic if the null hypothesis is true. With a Type-I-Error-Rate of 0.05, we can say that 0.9095 > 0.05, therefore we don't have enough evidence to reject the null hypothesis." 1373 | ] 1374 | }, 1375 | { 1376 | "cell_type": "markdown", 1377 | "metadata": {}, 1378 | "source": [ 1379 | "l. We could also use a built-in to achieve similar results. Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. Fill in the below to calculate the number of conversions for each page, as well as the number of individuals who received each page. Let `n_old` and `n_new` refer the the number of rows associated with the old page and new pages, respectively." 1380 | ] 1381 | }, 1382 | { 1383 | "cell_type": "code", 1384 | "execution_count": 35, 1385 | "metadata": {}, 1386 | "outputs": [ 1387 | { 1388 | "name": "stderr", 1389 | "output_type": "stream", 1390 | "text": [ 1391 | "/opt/conda/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n", 1392 | " from pandas.core import datetools\n" 1393 | ] 1394 | } 1395 | ], 1396 | "source": [ 1397 | "import statsmodels.api as sm\n", 1398 | "\n", 1399 | "convert_old = df2.query(\"group == 'control'\").converted.sum()\n", 1400 | "convert_new = df2.query(\"group == 'treatment'\").converted.sum()\n", 1401 | "n_old = df2.query(\"landing_page == 'old_page'\").count()[0]\n", 1402 | "n_new = df2.query(\"landing_page == 'new_page'\").count()[0]" 1403 | ] 1404 | }, 1405 | { 1406 | "cell_type": "markdown", 1407 | "metadata": {}, 1408 | "source": [ 1409 | "m. Now use `stats.proportions_ztest` to compute your test statistic and p-value. [Here](http://knowledgetack.com/python/statsmodels/proportions_ztest/) is a helpful link on using the built in." 1410 | ] 1411 | }, 1412 | { 1413 | "cell_type": "code", 1414 | "execution_count": 36, 1415 | "metadata": {}, 1416 | "outputs": [ 1417 | { 1418 | "name": "stdout", 1419 | "output_type": "stream", 1420 | "text": [ 1421 | "Z-Score: 1.31092419842 \n", 1422 | "Critical Z-Score: 1.64485362695 \n", 1423 | "P-Value: 0.905058312759\n" 1424 | ] 1425 | } 1426 | ], 1427 | "source": [ 1428 | "#https://machinelearningmastery.com/critical-values-for-statistical-hypothesis-testing/\n", 1429 | "from scipy.stats import norm\n", 1430 | "\n", 1431 | "#calculate z-test\n", 1432 | "z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative=\"smaller\")\n", 1433 | "\n", 1434 | "#calculate the critical z_term\n", 1435 | "z_critical=norm.ppf(1-(0.05))\n", 1436 | "\n", 1437 | "print(\"Z-Score: \",z_score, \"\\nCritical Z-Score: \", z_critical, \"\\nP-Value: \", p_value)\n", 1438 | "\n" 1439 | ] 1440 | }, 1441 | { 1442 | "cell_type": "markdown", 1443 | "metadata": {}, 1444 | "source": [ 1445 | "n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages? Do they agree with the findings in parts **j.** and **k.**?" 1446 | ] 1447 | }, 1448 | { 1449 | "cell_type": "markdown", 1450 | "metadata": {}, 1451 | "source": [ 1452 | "The p-value here agrees with our findings in j. Also the calculated Z-Score is smaller than the Critical Z - Score, so we also fail to reject the null hypothesis based on the Z-test. \n", 1453 | "\n", 1454 | "In conclusion we accept the null hypothesis that the coversion rates of the old page are equal or better than the conversion rates of the new page." 1455 | ] 1456 | }, 1457 | { 1458 | "cell_type": "markdown", 1459 | "metadata": {}, 1460 | "source": [ 1461 | "\n", 1462 | "### Part III - A regression approach\n", 1463 | "\n", 1464 | "`1.` In this final part, you will see that the result you achieved in the A/B test in Part II above can also be achieved by performing regression.

\n", 1465 | "\n", 1466 | "a. Since each row is either a conversion or no conversion, what type of regression should you be performing in this case?" 1467 | ] 1468 | }, 1469 | { 1470 | "cell_type": "markdown", 1471 | "metadata": {}, 1472 | "source": [ 1473 | "Since this case is binary, a logistic regression should be performed." 1474 | ] 1475 | }, 1476 | { 1477 | "cell_type": "markdown", 1478 | "metadata": {}, 1479 | "source": [ 1480 | "b. The goal is to use **statsmodels** to fit the regression model you specified in part **a.** to see if there is a significant difference in conversion based on which page a customer receives. However, you first need to create in df2 a column for the intercept, and create a dummy variable column for which page each user received. Add an **intercept** column, as well as an **ab_page** column, which is 1 when an individual receives the **treatment** and 0 if **control**." 1481 | ] 1482 | }, 1483 | { 1484 | "cell_type": "code", 1485 | "execution_count": 37, 1486 | "metadata": {}, 1487 | "outputs": [ 1488 | { 1489 | "data": { 1490 | "text/html": [ 1491 | "
\n", 1492 | "\n", 1505 | "\n", 1506 | " \n", 1507 | " \n", 1508 | " \n", 1509 | " \n", 1510 | " \n", 1511 | " \n", 1512 | " \n", 1513 | " \n", 1514 | " \n", 1515 | " \n", 1516 | " \n", 1517 | " \n", 1518 | " \n", 1519 | " \n", 1520 | " \n", 1521 | " \n", 1522 | " \n", 1523 | " \n", 1524 | " \n", 1525 | " \n", 1526 | " \n", 1527 | " \n", 1528 | " \n", 1529 | " \n", 1530 | " \n", 1531 | " \n", 1532 | " \n", 1533 | " \n", 1534 | " \n", 1535 | " \n", 1536 | " \n", 1537 | " \n", 1538 | " \n", 1539 | " \n", 1540 | " \n", 1541 | " \n", 1542 | " \n", 1543 | " \n", 1544 | " \n", 1545 | " \n", 1546 | " \n", 1547 | " \n", 1548 | " \n", 1549 | " \n", 1550 | " \n", 1551 | " \n", 1552 | " \n", 1553 | " \n", 1554 | " \n", 1555 | " \n", 1556 | " \n", 1557 | " \n", 1558 | "
user_idtimestampgrouplanding_pageconverted
08511042017-01-21 22:11:48.556739controlold_page0
18042282017-01-12 08:01:45.159739controlold_page0
26615902017-01-11 16:55:06.154213treatmentnew_page0
38535412017-01-08 18:28:03.143765treatmentnew_page0
48649752017-01-21 01:52:26.210827controlold_page1
\n", 1559 | "
" 1560 | ], 1561 | "text/plain": [ 1562 | " user_id timestamp group landing_page converted\n", 1563 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0\n", 1564 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0\n", 1565 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0\n", 1566 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0\n", 1567 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1" 1568 | ] 1569 | }, 1570 | "execution_count": 37, 1571 | "metadata": {}, 1572 | "output_type": "execute_result" 1573 | } 1574 | ], 1575 | "source": [ 1576 | "df_log = df2.copy()\n", 1577 | "df_log.head()" 1578 | ] 1579 | }, 1580 | { 1581 | "cell_type": "code", 1582 | "execution_count": 38, 1583 | "metadata": {}, 1584 | "outputs": [], 1585 | "source": [ 1586 | "#add intercept\n", 1587 | "df_log[\"intercept\"] = 1\n", 1588 | "\n", 1589 | "#get dummies and rename\n", 1590 | "df_log = df_log.join(pd.get_dummies(df_log['group']))\n", 1591 | "df_log.rename(columns = {\"treatment\": \"ab_page\"}, inplace=True)" 1592 | ] 1593 | }, 1594 | { 1595 | "cell_type": "code", 1596 | "execution_count": 39, 1597 | "metadata": {}, 1598 | "outputs": [ 1599 | { 1600 | "data": { 1601 | "text/html": [ 1602 | "
\n", 1603 | "\n", 1616 | "\n", 1617 | " \n", 1618 | " \n", 1619 | " \n", 1620 | " \n", 1621 | " \n", 1622 | " \n", 1623 | " \n", 1624 | " \n", 1625 | " \n", 1626 | " \n", 1627 | " \n", 1628 | " \n", 1629 | " \n", 1630 | " \n", 1631 | " \n", 1632 | " \n", 1633 | " \n", 1634 | " \n", 1635 | " \n", 1636 | " \n", 1637 | " \n", 1638 | " \n", 1639 | " \n", 1640 | " \n", 1641 | " \n", 1642 | " \n", 1643 | " \n", 1644 | " \n", 1645 | " \n", 1646 | " \n", 1647 | " \n", 1648 | " \n", 1649 | " \n", 1650 | " \n", 1651 | " \n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | " \n", 1656 | " \n", 1657 | " \n", 1658 | " \n", 1659 | " \n", 1660 | " \n", 1661 | " \n", 1662 | " \n", 1663 | " \n", 1664 | " \n", 1665 | " \n", 1666 | " \n", 1667 | " \n", 1668 | " \n", 1669 | " \n", 1670 | " \n", 1671 | " \n", 1672 | " \n", 1673 | " \n", 1674 | " \n", 1675 | " \n", 1676 | " \n", 1677 | " \n", 1678 | " \n", 1679 | " \n", 1680 | " \n", 1681 | " \n", 1682 | " \n", 1683 | " \n", 1684 | " \n", 1685 | " \n", 1686 | " \n", 1687 | "
user_idtimestampgrouplanding_pageconvertedinterceptcontrolab_page
08511042017-01-21 22:11:48.556739controlold_page0110
18042282017-01-12 08:01:45.159739controlold_page0110
26615902017-01-11 16:55:06.154213treatmentnew_page0101
38535412017-01-08 18:28:03.143765treatmentnew_page0101
48649752017-01-21 01:52:26.210827controlold_page1110
\n", 1688 | "
" 1689 | ], 1690 | "text/plain": [ 1691 | " user_id timestamp group landing_page converted \\\n", 1692 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0 \n", 1693 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0 \n", 1694 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0 \n", 1695 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0 \n", 1696 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1 \n", 1697 | "\n", 1698 | " intercept control ab_page \n", 1699 | "0 1 1 0 \n", 1700 | "1 1 1 0 \n", 1701 | "2 1 0 1 \n", 1702 | "3 1 0 1 \n", 1703 | "4 1 1 0 " 1704 | ] 1705 | }, 1706 | "execution_count": 39, 1707 | "metadata": {}, 1708 | "output_type": "execute_result" 1709 | } 1710 | ], 1711 | "source": [ 1712 | "df_log.head()" 1713 | ] 1714 | }, 1715 | { 1716 | "cell_type": "markdown", 1717 | "metadata": {}, 1718 | "source": [ 1719 | "c. Use **statsmodels** to instantiate your regression model on the two columns you created in part b., then fit the model using the two columns you created in part **b.** to predict whether or not an individual converts. " 1720 | ] 1721 | }, 1722 | { 1723 | "cell_type": "code", 1724 | "execution_count": 40, 1725 | "metadata": {}, 1726 | "outputs": [ 1727 | { 1728 | "name": "stdout", 1729 | "output_type": "stream", 1730 | "text": [ 1731 | "Optimization terminated successfully.\n", 1732 | " Current function value: 0.366118\n", 1733 | " Iterations 6\n" 1734 | ] 1735 | } 1736 | ], 1737 | "source": [ 1738 | "y = df_log[\"converted\"]\n", 1739 | "x = df_log[[\"intercept\", \"ab_page\"]]\n", 1740 | "\n", 1741 | "#load model\n", 1742 | "log_mod = sm.Logit(y,x)\n", 1743 | "\n", 1744 | "#fit model\n", 1745 | "result = log_mod.fit()\n" 1746 | ] 1747 | }, 1748 | { 1749 | "cell_type": "markdown", 1750 | "metadata": {}, 1751 | "source": [ 1752 | "d. Provide the summary of your model below, and use it as necessary to answer the following questions." 1753 | ] 1754 | }, 1755 | { 1756 | "cell_type": "code", 1757 | "execution_count": 41, 1758 | "metadata": {}, 1759 | "outputs": [ 1760 | { 1761 | "data": { 1762 | "text/html": [ 1763 | "\n", 1764 | "\n", 1765 | "\n", 1766 | " \n", 1767 | "\n", 1768 | "\n", 1769 | " \n", 1770 | "\n", 1771 | "\n", 1772 | " \n", 1773 | "\n", 1774 | "\n", 1775 | " \n", 1776 | "\n", 1777 | "\n", 1778 | " \n", 1779 | "\n", 1780 | "\n", 1781 | " \n", 1782 | "\n", 1783 | "\n", 1784 | " \n", 1785 | "\n", 1786 | "
Logit Regression Results
Dep. Variable: converted No. Observations: 290584
Model: Logit Df Residuals: 290582
Method: MLE Df Model: 1
Date: Sun, 20 Jan 2019 Pseudo R-squ.: 8.077e-06
Time: 16:32:06 Log-Likelihood: -1.0639e+05
converged: True LL-Null: -1.0639e+05
LLR p-value: 0.1899
\n", 1787 | "\n", 1788 | "\n", 1789 | " \n", 1790 | "\n", 1791 | "\n", 1792 | " \n", 1793 | "\n", 1794 | "\n", 1795 | " \n", 1796 | "\n", 1797 | "
coef std err z P>|z| [0.025 0.975]
intercept -1.9888 0.008 -246.669 0.000 -2.005 -1.973
ab_page -0.0150 0.011 -1.311 0.190 -0.037 0.007
" 1798 | ], 1799 | "text/plain": [ 1800 | "\n", 1801 | "\"\"\"\n", 1802 | " Logit Regression Results \n", 1803 | "==============================================================================\n", 1804 | "Dep. Variable: converted No. Observations: 290584\n", 1805 | "Model: Logit Df Residuals: 290582\n", 1806 | "Method: MLE Df Model: 1\n", 1807 | "Date: Sun, 20 Jan 2019 Pseudo R-squ.: 8.077e-06\n", 1808 | "Time: 16:32:06 Log-Likelihood: -1.0639e+05\n", 1809 | "converged: True LL-Null: -1.0639e+05\n", 1810 | " LLR p-value: 0.1899\n", 1811 | "==============================================================================\n", 1812 | " coef std err z P>|z| [0.025 0.975]\n", 1813 | "------------------------------------------------------------------------------\n", 1814 | "intercept -1.9888 0.008 -246.669 0.000 -2.005 -1.973\n", 1815 | "ab_page -0.0150 0.011 -1.311 0.190 -0.037 0.007\n", 1816 | "==============================================================================\n", 1817 | "\"\"\"" 1818 | ] 1819 | }, 1820 | "execution_count": 41, 1821 | "metadata": {}, 1822 | "output_type": "execute_result" 1823 | } 1824 | ], 1825 | "source": [ 1826 | "result.summary()" 1827 | ] 1828 | }, 1829 | { 1830 | "cell_type": "markdown", 1831 | "metadata": {}, 1832 | "source": [ 1833 | "e. What is the p-value associated with **ab_page**? Why does it differ from the value you found in **Part II**?

**Hint**: What are the null and alternative hypotheses associated with your regression model, and how do they compare to the null and alternative hypotheses in **Part II**?" 1834 | ] 1835 | }, 1836 | { 1837 | "cell_type": "markdown", 1838 | "metadata": {}, 1839 | "source": [ 1840 | "The p-value associated with ab_page is 0.19. This is because the approach of calculating the p-value is different for each case. For the first case we calculate the probability receiving a observed statistic if the null hypothesis is true. Therefore this is a one-sided test. However, the ab_page p-value is the result of a two sided test, because the null hypothesis for this case is, that there is no significant relationship between the conversion rate and ab_page. Therefore give us a variable with a low p value \"a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable\" (http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-regression-analysis-results-p-values-and-coefficients).\n", 1841 | "\n", 1842 | "Based on that p_value we can say, that the conversion is not significant dependent on the page. \n" 1843 | ] 1844 | }, 1845 | { 1846 | "cell_type": "markdown", 1847 | "metadata": {}, 1848 | "source": [ 1849 | "f. Now, you are considering other things that might influence whether or not an individual converts. Discuss why it is a good idea to consider other factors to add into your regression model. Are there any disadvantages to adding additional terms into your regression model?" 1850 | ] 1851 | }, 1852 | { 1853 | "cell_type": "markdown", 1854 | "metadata": {}, 1855 | "source": [ 1856 | "Other features to consider could be extracts of the time stamp, for example the day of the week or the gender/income infrastructure (if this data would be available). This could lead to more precise results and a higher accuracy. The disadvantages are the increasing complexity of interpretation and the possible introduction of multicollinearity. However, the last problem can be solved with calculating the VIF's. " 1857 | ] 1858 | }, 1859 | { 1860 | "cell_type": "markdown", 1861 | "metadata": {}, 1862 | "source": [ 1863 | "g. Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives in. You will need to read in the **countries.csv** dataset and merge together your datasets on the appropriate rows. [Here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) are the docs for joining tables. \n", 1864 | "\n", 1865 | "Does it appear that country had an impact on conversion? Don't forget to create dummy variables for these country columns - **Hint: You will need two columns for the three dummy variables.** Provide the statistical output as well as a written response to answer this question." 1866 | ] 1867 | }, 1868 | { 1869 | "cell_type": "code", 1870 | "execution_count": 42, 1871 | "metadata": {}, 1872 | "outputs": [ 1873 | { 1874 | "data": { 1875 | "text/html": [ 1876 | "
\n", 1877 | "\n", 1890 | "\n", 1891 | " \n", 1892 | " \n", 1893 | " \n", 1894 | " \n", 1895 | " \n", 1896 | " \n", 1897 | " \n", 1898 | " \n", 1899 | " \n", 1900 | " \n", 1901 | " \n", 1902 | " \n", 1903 | " \n", 1904 | " \n", 1905 | " \n", 1906 | " \n", 1907 | " \n", 1908 | " \n", 1909 | " \n", 1910 | " \n", 1911 | " \n", 1912 | " \n", 1913 | " \n", 1914 | " \n", 1915 | " \n", 1916 | " \n", 1917 | " \n", 1918 | " \n", 1919 | " \n", 1920 | " \n", 1921 | " \n", 1922 | " \n", 1923 | " \n", 1924 | " \n", 1925 | "
user_idcountry
0834778UK
1928468US
2822059UK
3711597UK
4710616UK
\n", 1926 | "
" 1927 | ], 1928 | "text/plain": [ 1929 | " user_id country\n", 1930 | "0 834778 UK\n", 1931 | "1 928468 US\n", 1932 | "2 822059 UK\n", 1933 | "3 711597 UK\n", 1934 | "4 710616 UK" 1935 | ] 1936 | }, 1937 | "execution_count": 42, 1938 | "metadata": {}, 1939 | "output_type": "execute_result" 1940 | } 1941 | ], 1942 | "source": [ 1943 | "df_countries = pd.read_csv(\"countries.csv\")\n", 1944 | "df_countries.head()" 1945 | ] 1946 | }, 1947 | { 1948 | "cell_type": "code", 1949 | "execution_count": 43, 1950 | "metadata": {}, 1951 | "outputs": [], 1952 | "source": [ 1953 | "#merge the dataframes together\n", 1954 | "df_log_country = df_log.merge(df_countries, on=\"user_id\", how = \"left\")" 1955 | ] 1956 | }, 1957 | { 1958 | "cell_type": "code", 1959 | "execution_count": 44, 1960 | "metadata": {}, 1961 | "outputs": [ 1962 | { 1963 | "data": { 1964 | "text/html": [ 1965 | "
\n", 1966 | "\n", 1979 | "\n", 1980 | " \n", 1981 | " \n", 1982 | " \n", 1983 | " \n", 1984 | " \n", 1985 | " \n", 1986 | " \n", 1987 | " \n", 1988 | " \n", 1989 | " \n", 1990 | " \n", 1991 | " \n", 1992 | " \n", 1993 | " \n", 1994 | " \n", 1995 | " \n", 1996 | " \n", 1997 | " \n", 1998 | " \n", 1999 | " \n", 2000 | " \n", 2001 | " \n", 2002 | " \n", 2003 | " \n", 2004 | " \n", 2005 | " \n", 2006 | " \n", 2007 | " \n", 2008 | " \n", 2009 | " \n", 2010 | " \n", 2011 | " \n", 2012 | " \n", 2013 | " \n", 2014 | " \n", 2015 | " \n", 2016 | " \n", 2017 | " \n", 2018 | " \n", 2019 | " \n", 2020 | " \n", 2021 | " \n", 2022 | " \n", 2023 | " \n", 2024 | " \n", 2025 | " \n", 2026 | " \n", 2027 | " \n", 2028 | " \n", 2029 | " \n", 2030 | " \n", 2031 | " \n", 2032 | " \n", 2033 | " \n", 2034 | " \n", 2035 | " \n", 2036 | " \n", 2037 | " \n", 2038 | " \n", 2039 | " \n", 2040 | " \n", 2041 | " \n", 2042 | " \n", 2043 | " \n", 2044 | " \n", 2045 | " \n", 2046 | " \n", 2047 | " \n", 2048 | " \n", 2049 | " \n", 2050 | " \n", 2051 | " \n", 2052 | " \n", 2053 | " \n", 2054 | " \n", 2055 | " \n", 2056 | " \n", 2057 | " \n", 2058 | " \n", 2059 | " \n", 2060 | " \n", 2061 | " \n", 2062 | " \n", 2063 | " \n", 2064 | " \n", 2065 | " \n", 2066 | " \n", 2067 | " \n", 2068 | " \n", 2069 | " \n", 2070 | " \n", 2071 | " \n", 2072 | " \n", 2073 | " \n", 2074 | "
user_idtimestampgrouplanding_pageconvertedinterceptcontrolab_pagecountryCAUKUS
08511042017-01-21 22:11:48.556739controlold_page0110US001
18042282017-01-12 08:01:45.159739controlold_page0110US001
26615902017-01-11 16:55:06.154213treatmentnew_page0101US001
38535412017-01-08 18:28:03.143765treatmentnew_page0101US001
48649752017-01-21 01:52:26.210827controlold_page1110US001
\n", 2075 | "
" 2076 | ], 2077 | "text/plain": [ 2078 | " user_id timestamp group landing_page converted \\\n", 2079 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0 \n", 2080 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0 \n", 2081 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0 \n", 2082 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0 \n", 2083 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1 \n", 2084 | "\n", 2085 | " intercept control ab_page country CA UK US \n", 2086 | "0 1 1 0 US 0 0 1 \n", 2087 | "1 1 1 0 US 0 0 1 \n", 2088 | "2 1 0 1 US 0 0 1 \n", 2089 | "3 1 0 1 US 0 0 1 \n", 2090 | "4 1 1 0 US 0 0 1 " 2091 | ] 2092 | }, 2093 | "execution_count": 44, 2094 | "metadata": {}, 2095 | "output_type": "execute_result" 2096 | } 2097 | ], 2098 | "source": [ 2099 | "df_log_country = df_log_country.join(pd.get_dummies(df_log_country['country']))\n", 2100 | "df_log_country.head()" 2101 | ] 2102 | }, 2103 | { 2104 | "cell_type": "code", 2105 | "execution_count": 45, 2106 | "metadata": {}, 2107 | "outputs": [ 2108 | { 2109 | "name": "stdout", 2110 | "output_type": "stream", 2111 | "text": [ 2112 | "Optimization terminated successfully.\n", 2113 | " Current function value: 0.366113\n", 2114 | " Iterations 6\n" 2115 | ] 2116 | }, 2117 | { 2118 | "data": { 2119 | "text/html": [ 2120 | "\n", 2121 | "\n", 2122 | "\n", 2123 | " \n", 2124 | "\n", 2125 | "\n", 2126 | " \n", 2127 | "\n", 2128 | "\n", 2129 | " \n", 2130 | "\n", 2131 | "\n", 2132 | " \n", 2133 | "\n", 2134 | "\n", 2135 | " \n", 2136 | "\n", 2137 | "\n", 2138 | " \n", 2139 | "\n", 2140 | "\n", 2141 | " \n", 2142 | "\n", 2143 | "
Logit Regression Results
Dep. Variable: converted No. Observations: 290584
Model: Logit Df Residuals: 290580
Method: MLE Df Model: 3
Date: Sun, 20 Jan 2019 Pseudo R-squ.: 2.323e-05
Time: 16:32:11 Log-Likelihood: -1.0639e+05
converged: True LL-Null: -1.0639e+05
LLR p-value: 0.1760
\n", 2144 | "\n", 2145 | "\n", 2146 | " \n", 2147 | "\n", 2148 | "\n", 2149 | " \n", 2150 | "\n", 2151 | "\n", 2152 | " \n", 2153 | "\n", 2154 | "\n", 2155 | " \n", 2156 | "\n", 2157 | "\n", 2158 | " \n", 2159 | "\n", 2160 | "
coef std err z P>|z| [0.025 0.975]
intercept -1.9893 0.009 -223.763 0.000 -2.007 -1.972
ab_page -0.0149 0.011 -1.307 0.191 -0.037 0.007
CA -0.0408 0.027 -1.516 0.130 -0.093 0.012
UK 0.0099 0.013 0.743 0.457 -0.016 0.036
" 2161 | ], 2162 | "text/plain": [ 2163 | "\n", 2164 | "\"\"\"\n", 2165 | " Logit Regression Results \n", 2166 | "==============================================================================\n", 2167 | "Dep. Variable: converted No. Observations: 290584\n", 2168 | "Model: Logit Df Residuals: 290580\n", 2169 | "Method: MLE Df Model: 3\n", 2170 | "Date: Sun, 20 Jan 2019 Pseudo R-squ.: 2.323e-05\n", 2171 | "Time: 16:32:11 Log-Likelihood: -1.0639e+05\n", 2172 | "converged: True LL-Null: -1.0639e+05\n", 2173 | " LLR p-value: 0.1760\n", 2174 | "==============================================================================\n", 2175 | " coef std err z P>|z| [0.025 0.975]\n", 2176 | "------------------------------------------------------------------------------\n", 2177 | "intercept -1.9893 0.009 -223.763 0.000 -2.007 -1.972\n", 2178 | "ab_page -0.0149 0.011 -1.307 0.191 -0.037 0.007\n", 2179 | "CA -0.0408 0.027 -1.516 0.130 -0.093 0.012\n", 2180 | "UK 0.0099 0.013 0.743 0.457 -0.016 0.036\n", 2181 | "==============================================================================\n", 2182 | "\"\"\"" 2183 | ] 2184 | }, 2185 | "execution_count": 45, 2186 | "metadata": {}, 2187 | "output_type": "execute_result" 2188 | } 2189 | ], 2190 | "source": [ 2191 | "y = df_log_country[\"converted\"]\n", 2192 | "x = df_log_country[[\"intercept\", \"ab_page\", \"CA\", \"UK\"]]\n", 2193 | "\n", 2194 | "log_mod = sm.Logit(y,x)\n", 2195 | "results = log_mod.fit()\n", 2196 | "results.summary()" 2197 | ] 2198 | }, 2199 | { 2200 | "cell_type": "markdown", 2201 | "metadata": {}, 2202 | "source": [ 2203 | "Based on the country-features p-values we can say, that these features also doens't have a significant impact on the coversion rate. \n", 2204 | "\n", 2205 | "However, we could interpret these coefficients as follows:" 2206 | ] 2207 | }, 2208 | { 2209 | "cell_type": "code", 2210 | "execution_count": 46, 2211 | "metadata": {}, 2212 | "outputs": [ 2213 | { 2214 | "name": "stdout", 2215 | "output_type": "stream", 2216 | "text": [ 2217 | "ab_page reciprocal exponential: 1.01501155838 - A conversion is 1.015 times less likely, if a user receives the treatment page, holding all other variables constant\n", 2218 | "\n", 2219 | "CA reciprocal exponential: 1.04164375596 - A conversion is 1.042 times less likely, if the user lives in CA and not the US.\n", 2220 | "\n", 2221 | "UK exponential: 1.00994916712 - A conversion is 1.00995 times more likely, if the user lives in UK and not the US.\n" 2222 | ] 2223 | } 2224 | ], 2225 | "source": [ 2226 | "print(\"ab_page reciprocal exponential: \", 1/np.exp(-0.0149), \"-\", \"A conversion is 1.015 times less likely, if a user receives the treatment page, holding all other variables constant\\n\"\n", 2227 | " \"\\nCA reciprocal exponential: \", 1/np.exp(-0.0408), \"-\", \"A conversion is 1.042 times less likely, if the user lives in CA and not the US.\\n\"\n", 2228 | " \"\\nUK exponential: \",np.exp(0.0099),\"-\", \"A conversion is 1.00995 times more likely, if the user lives in UK and not the US.\")" 2229 | ] 2230 | }, 2231 | { 2232 | "cell_type": "markdown", 2233 | "metadata": {}, 2234 | "source": [ 2235 | "h. Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion. Create the necessary additional columns, and fit the new model. \n", 2236 | "\n", 2237 | "Provide the summary results, and your conclusions based on the results." 2238 | ] 2239 | }, 2240 | { 2241 | "cell_type": "code", 2242 | "execution_count": 47, 2243 | "metadata": {}, 2244 | "outputs": [ 2245 | { 2246 | "data": { 2247 | "text/html": [ 2248 | "
\n", 2249 | "\n", 2262 | "\n", 2263 | " \n", 2264 | " \n", 2265 | " \n", 2266 | " \n", 2267 | " \n", 2268 | " \n", 2269 | " \n", 2270 | " \n", 2271 | " \n", 2272 | " \n", 2273 | " \n", 2274 | " \n", 2275 | " \n", 2276 | " \n", 2277 | " \n", 2278 | " \n", 2279 | " \n", 2280 | " \n", 2281 | " \n", 2282 | " \n", 2283 | " \n", 2284 | " \n", 2285 | " \n", 2286 | " \n", 2287 | " \n", 2288 | " \n", 2289 | " \n", 2290 | " \n", 2291 | " \n", 2292 | " \n", 2293 | " \n", 2294 | " \n", 2295 | " \n", 2296 | " \n", 2297 | " \n", 2298 | " \n", 2299 | " \n", 2300 | " \n", 2301 | " \n", 2302 | " \n", 2303 | " \n", 2304 | " \n", 2305 | " \n", 2306 | " \n", 2307 | " \n", 2308 | " \n", 2309 | " \n", 2310 | " \n", 2311 | " \n", 2312 | " \n", 2313 | " \n", 2314 | " \n", 2315 | " \n", 2316 | " \n", 2317 | " \n", 2318 | " \n", 2319 | " \n", 2320 | " \n", 2321 | " \n", 2322 | " \n", 2323 | " \n", 2324 | " \n", 2325 | " \n", 2326 | " \n", 2327 | " \n", 2328 | " \n", 2329 | " \n", 2330 | " \n", 2331 | " \n", 2332 | " \n", 2333 | " \n", 2334 | " \n", 2335 | " \n", 2336 | " \n", 2337 | " \n", 2338 | " \n", 2339 | " \n", 2340 | " \n", 2341 | " \n", 2342 | " \n", 2343 | " \n", 2344 | " \n", 2345 | " \n", 2346 | " \n", 2347 | " \n", 2348 | " \n", 2349 | " \n", 2350 | " \n", 2351 | " \n", 2352 | " \n", 2353 | " \n", 2354 | " \n", 2355 | " \n", 2356 | " \n", 2357 | "
user_idtimestampgrouplanding_pageconvertedinterceptcontrolab_pagecountryCAUKUS
08511042017-01-21 22:11:48.556739controlold_page0110US001
18042282017-01-12 08:01:45.159739controlold_page0110US001
26615902017-01-11 16:55:06.154213treatmentnew_page0101US001
38535412017-01-08 18:28:03.143765treatmentnew_page0101US001
48649752017-01-21 01:52:26.210827controlold_page1110US001
\n", 2358 | "
" 2359 | ], 2360 | "text/plain": [ 2361 | " user_id timestamp group landing_page converted \\\n", 2362 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0 \n", 2363 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0 \n", 2364 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0 \n", 2365 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0 \n", 2366 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1 \n", 2367 | "\n", 2368 | " intercept control ab_page country CA UK US \n", 2369 | "0 1 1 0 US 0 0 1 \n", 2370 | "1 1 1 0 US 0 0 1 \n", 2371 | "2 1 0 1 US 0 0 1 \n", 2372 | "3 1 0 1 US 0 0 1 \n", 2373 | "4 1 1 0 US 0 0 1 " 2374 | ] 2375 | }, 2376 | "execution_count": 47, 2377 | "metadata": {}, 2378 | "output_type": "execute_result" 2379 | } 2380 | ], 2381 | "source": [ 2382 | "df_log_country.head()" 2383 | ] 2384 | }, 2385 | { 2386 | "cell_type": "code", 2387 | "execution_count": 48, 2388 | "metadata": {}, 2389 | "outputs": [], 2390 | "source": [ 2391 | "#create the interaction higher order term for the ab_page and country columns\n", 2392 | "df_log_country[\"CA_page\"], df_log_country[\"UK_page\"] = df_log_country[\"CA\"] * df_log_country[\"ab_page\"], df_log_country[\"UK\"] * df_log_country[\"ab_page\"]" 2393 | ] 2394 | }, 2395 | { 2396 | "cell_type": "code", 2397 | "execution_count": 49, 2398 | "metadata": {}, 2399 | "outputs": [ 2400 | { 2401 | "data": { 2402 | "text/html": [ 2403 | "
\n", 2404 | "\n", 2417 | "\n", 2418 | " \n", 2419 | " \n", 2420 | " \n", 2421 | " \n", 2422 | " \n", 2423 | " \n", 2424 | " \n", 2425 | " \n", 2426 | " \n", 2427 | " \n", 2428 | " \n", 2429 | " \n", 2430 | " \n", 2431 | " \n", 2432 | " \n", 2433 | " \n", 2434 | " \n", 2435 | " \n", 2436 | " \n", 2437 | " \n", 2438 | " \n", 2439 | " \n", 2440 | " \n", 2441 | " \n", 2442 | " \n", 2443 | " \n", 2444 | " \n", 2445 | " \n", 2446 | " \n", 2447 | " \n", 2448 | " \n", 2449 | " \n", 2450 | " \n", 2451 | " \n", 2452 | " \n", 2453 | " \n", 2454 | " \n", 2455 | " \n", 2456 | " \n", 2457 | " \n", 2458 | " \n", 2459 | " \n", 2460 | " \n", 2461 | " \n", 2462 | " \n", 2463 | " \n", 2464 | " \n", 2465 | " \n", 2466 | " \n", 2467 | " \n", 2468 | " \n", 2469 | " \n", 2470 | " \n", 2471 | " \n", 2472 | " \n", 2473 | " \n", 2474 | " \n", 2475 | " \n", 2476 | " \n", 2477 | " \n", 2478 | " \n", 2479 | " \n", 2480 | " \n", 2481 | " \n", 2482 | " \n", 2483 | " \n", 2484 | " \n", 2485 | " \n", 2486 | " \n", 2487 | " \n", 2488 | " \n", 2489 | " \n", 2490 | " \n", 2491 | " \n", 2492 | " \n", 2493 | " \n", 2494 | " \n", 2495 | " \n", 2496 | " \n", 2497 | " \n", 2498 | " \n", 2499 | " \n", 2500 | " \n", 2501 | " \n", 2502 | " \n", 2503 | " \n", 2504 | " \n", 2505 | " \n", 2506 | " \n", 2507 | " \n", 2508 | " \n", 2509 | " \n", 2510 | " \n", 2511 | " \n", 2512 | " \n", 2513 | " \n", 2514 | " \n", 2515 | " \n", 2516 | " \n", 2517 | " \n", 2518 | " \n", 2519 | " \n", 2520 | " \n", 2521 | " \n", 2522 | " \n", 2523 | " \n", 2524 | "
user_idtimestampgrouplanding_pageconvertedinterceptcontrolab_pagecountryCAUKUSCA_pageUK_page
08511042017-01-21 22:11:48.556739controlold_page0110US00100
18042282017-01-12 08:01:45.159739controlold_page0110US00100
26615902017-01-11 16:55:06.154213treatmentnew_page0101US00100
38535412017-01-08 18:28:03.143765treatmentnew_page0101US00100
48649752017-01-21 01:52:26.210827controlold_page1110US00100
\n", 2525 | "
" 2526 | ], 2527 | "text/plain": [ 2528 | " user_id timestamp group landing_page converted \\\n", 2529 | "0 851104 2017-01-21 22:11:48.556739 control old_page 0 \n", 2530 | "1 804228 2017-01-12 08:01:45.159739 control old_page 0 \n", 2531 | "2 661590 2017-01-11 16:55:06.154213 treatment new_page 0 \n", 2532 | "3 853541 2017-01-08 18:28:03.143765 treatment new_page 0 \n", 2533 | "4 864975 2017-01-21 01:52:26.210827 control old_page 1 \n", 2534 | "\n", 2535 | " intercept control ab_page country CA UK US CA_page UK_page \n", 2536 | "0 1 1 0 US 0 0 1 0 0 \n", 2537 | "1 1 1 0 US 0 0 1 0 0 \n", 2538 | "2 1 0 1 US 0 0 1 0 0 \n", 2539 | "3 1 0 1 US 0 0 1 0 0 \n", 2540 | "4 1 1 0 US 0 0 1 0 0 " 2541 | ] 2542 | }, 2543 | "execution_count": 49, 2544 | "metadata": {}, 2545 | "output_type": "execute_result" 2546 | } 2547 | ], 2548 | "source": [ 2549 | "df_log_country.head()" 2550 | ] 2551 | }, 2552 | { 2553 | "cell_type": "code", 2554 | "execution_count": 50, 2555 | "metadata": {}, 2556 | "outputs": [ 2557 | { 2558 | "name": "stdout", 2559 | "output_type": "stream", 2560 | "text": [ 2561 | "Optimization terminated successfully.\n", 2562 | " Current function value: 0.366109\n", 2563 | " Iterations 6\n" 2564 | ] 2565 | }, 2566 | { 2567 | "data": { 2568 | "text/html": [ 2569 | "\n", 2570 | "\n", 2571 | "\n", 2572 | " \n", 2573 | "\n", 2574 | "\n", 2575 | " \n", 2576 | "\n", 2577 | "\n", 2578 | " \n", 2579 | "\n", 2580 | "\n", 2581 | " \n", 2582 | "\n", 2583 | "\n", 2584 | " \n", 2585 | "\n", 2586 | "\n", 2587 | " \n", 2588 | "\n", 2589 | "\n", 2590 | " \n", 2591 | "\n", 2592 | "
Logit Regression Results
Dep. Variable: converted No. Observations: 290584
Model: Logit Df Residuals: 290578
Method: MLE Df Model: 5
Date: Sun, 20 Jan 2019 Pseudo R-squ.: 3.482e-05
Time: 16:32:16 Log-Likelihood: -1.0639e+05
converged: True LL-Null: -1.0639e+05
LLR p-value: 0.1920
\n", 2593 | "\n", 2594 | "\n", 2595 | " \n", 2596 | "\n", 2597 | "\n", 2598 | " \n", 2599 | "\n", 2600 | "\n", 2601 | " \n", 2602 | "\n", 2603 | "\n", 2604 | " \n", 2605 | "\n", 2606 | "\n", 2607 | " \n", 2608 | "\n", 2609 | "\n", 2610 | " \n", 2611 | "\n", 2612 | "\n", 2613 | " \n", 2614 | "\n", 2615 | "
coef std err z P>|z| [0.025 0.975]
intercept -1.9865 0.010 -206.344 0.000 -2.005 -1.968
ab_page -0.0206 0.014 -1.505 0.132 -0.047 0.006
CA -0.0175 0.038 -0.465 0.642 -0.091 0.056
UK -0.0057 0.019 -0.306 0.760 -0.043 0.031
CA_page -0.0469 0.054 -0.872 0.383 -0.152 0.059
UK_page 0.0314 0.027 1.181 0.238 -0.021 0.084
" 2616 | ], 2617 | "text/plain": [ 2618 | "\n", 2619 | "\"\"\"\n", 2620 | " Logit Regression Results \n", 2621 | "==============================================================================\n", 2622 | "Dep. Variable: converted No. Observations: 290584\n", 2623 | "Model: Logit Df Residuals: 290578\n", 2624 | "Method: MLE Df Model: 5\n", 2625 | "Date: Sun, 20 Jan 2019 Pseudo R-squ.: 3.482e-05\n", 2626 | "Time: 16:32:16 Log-Likelihood: -1.0639e+05\n", 2627 | "converged: True LL-Null: -1.0639e+05\n", 2628 | " LLR p-value: 0.1920\n", 2629 | "==============================================================================\n", 2630 | " coef std err z P>|z| [0.025 0.975]\n", 2631 | "------------------------------------------------------------------------------\n", 2632 | "intercept -1.9865 0.010 -206.344 0.000 -2.005 -1.968\n", 2633 | "ab_page -0.0206 0.014 -1.505 0.132 -0.047 0.006\n", 2634 | "CA -0.0175 0.038 -0.465 0.642 -0.091 0.056\n", 2635 | "UK -0.0057 0.019 -0.306 0.760 -0.043 0.031\n", 2636 | "CA_page -0.0469 0.054 -0.872 0.383 -0.152 0.059\n", 2637 | "UK_page 0.0314 0.027 1.181 0.238 -0.021 0.084\n", 2638 | "==============================================================================\n", 2639 | "\"\"\"" 2640 | ] 2641 | }, 2642 | "execution_count": 50, 2643 | "metadata": {}, 2644 | "output_type": "execute_result" 2645 | } 2646 | ], 2647 | "source": [ 2648 | "y = df_log_country[\"converted\"]\n", 2649 | "x = df_log_country[[\"intercept\", \"ab_page\", \"CA\", \"UK\", \"CA_page\", \"UK_page\"]]\n", 2650 | "\n", 2651 | "log_mod = sm.Logit(y,x)\n", 2652 | "results = log_mod.fit()\n", 2653 | "results.summary()" 2654 | ] 2655 | }, 2656 | { 2657 | "cell_type": "markdown", 2658 | "metadata": {}, 2659 | "source": [ 2660 | "Based on these results, we can see that the p_values for the interaction terms are definietly not significant and even decrease the significance of the original \"CA\" and \"UK\" columns. Therefore we should not include these higher order terms in our model." 2661 | ] 2662 | }, 2663 | { 2664 | "cell_type": "code", 2665 | "execution_count": 165, 2666 | "metadata": {}, 2667 | "outputs": [ 2668 | { 2669 | "data": { 2670 | "text/plain": [ 2671 | "0" 2672 | ] 2673 | }, 2674 | "execution_count": 165, 2675 | "metadata": {}, 2676 | "output_type": "execute_result" 2677 | } 2678 | ], 2679 | "source": [ 2680 | "from subprocess import call\n", 2681 | "call(['python', '-m', 'nbconvert', 'Analyze_ab_test_results_notebook.ipynb'])" 2682 | ] 2683 | }, 2684 | { 2685 | "cell_type": "code", 2686 | "execution_count": null, 2687 | "metadata": {}, 2688 | "outputs": [], 2689 | "source": [] 2690 | } 2691 | ], 2692 | "metadata": { 2693 | "kernelspec": { 2694 | "display_name": "Python 3", 2695 | "language": "python", 2696 | "name": "python3" 2697 | }, 2698 | "language_info": { 2699 | "codemirror_mode": { 2700 | "name": "ipython", 2701 | "version": 3 2702 | }, 2703 | "file_extension": ".py", 2704 | "mimetype": "text/x-python", 2705 | "name": "python", 2706 | "nbconvert_exporter": "python", 2707 | "pygments_lexer": "ipython3", 2708 | "version": "3.6.3" 2709 | } 2710 | }, 2711 | "nbformat": 4, 2712 | "nbformat_minor": 2 2713 | } 2714 | -------------------------------------------------------------------------------- /P2-Analyze-A-B-Test-Results/README.md: -------------------------------------------------------------------------------- 1 | ## P2: Analyze A-B Test Results 2 | 3 | ### Prerequisites 4 | 5 | Additional installations: 6 | 7 | * None 8 | 9 | ## Project Overview 10 | 11 | ### Data Sources 12 | 13 | **Name: ab_data.csv** 14 | * Source: Udacity 15 | 16 | ### Authors 17 | 18 | * Christoph Lindstädt 19 | * Udacity 20 | 21 | ## License 22 | 23 | * Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License 24 | 25 | 26 | Creative Commons License 27 | 28 | -------------------------------------------------------------------------------- /P3-Analyze-Twitter-Data/README.md: -------------------------------------------------------------------------------- 1 | 2 | ## P3: Analyze Twitter Data 3 | 4 | ### Prerequisites 5 | 6 | Additional installations: 7 | 8 | * [Missingno](https://github.com/ResidentMario/missingno) 9 | * [Tweepy](http://www.tweepy.org/) 10 | * [Requests](http://docs.python-requests.org/en/master/) 11 | 12 | ## Project Overview 13 | 14 | ### Data Sources 15 | 16 | **Name: WeRateDogs™ Twitter Archive (twitter-archive-enhanced.csv)** 17 | 18 | - Source: Udacity 19 | - Version: Latest (Download 09.02.2019) 20 | - Method of gathering: Manual download 21 | 22 | **Name: Tweet image predictions (image_predictions.tsv)** 23 | 24 | - Source: Udacity 25 | - Version: Latest (Download 09.02.2019) 26 | - Method of gathering: Programmatical download via Requests 27 | 28 | **Name: Additional Twitter data (tweet_json.txt)** 29 | 30 | - Source: WeRateDogs™ 31 | - Version: Latest (Gathered 09.02.2019) 32 | - Method of gathering: API via Tweepy 33 | 34 | ### Wrangling 35 | 36 | **Cleaning steps:** 37 | 38 | - Merge the tables together 39 | - Drop the replies, retweets and the corresponding columns and also drop the tweets without an image or with images which don't display doggos 40 | - Clean the datatypes of the columns 41 | -Clean the wrong numerators - the floats on the one hand (replacement), the ones with multiple occurence of the pattern on the other (drop) 42 | - Extract the source from html code 43 | - Split the text range into two separate columns 44 | - Remove the "None" out of the doggo, floofer, pupper and puppo column and merge them into one column 45 | - Remove the wrong names of name column 46 | - Reduce the prediction columns into two - breed and conf 47 | - Clean the new breed column by replacing the "_" with a whitespace and make them all lowercase 48 | 49 | ### Summary 50 | 51 | **Questions:** 52 | 53 | > Based on the predicted, most likely dog breed: Which breed gets retweeted and favorited the most overall? 54 | - The winner for our analysis was the labrador retriever. 55 | > How did the account develop (speaking about number of tweets, retweets, favorites, image number and length of the tweets)? 56 | - We found, that the number of tweets per month decreased, while the retweets and favorites show an uptrend. For the image numbers there is no clear trend visible, the length of the tweets got a little bit closer to the maximum of 130 in the second half of the dataset. 57 | > Is there a pattern visible in the timing of the tweets? 58 | - there are nearly no tweets at all between 5 and 15 'o clock . The most tweets are during the time from 0 - 4 and then again from 15 - 23 'o clock 59 | 60 | ### Authors 61 | 62 | * Christoph Lindstädt 63 | * Udacity 64 | 65 | ## License 66 | 67 | * Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License 68 | 69 | 70 | Creative Commons License 71 | 72 | 73 | -------------------------------------------------------------------------------- /P4-Communicate-Data-Findings/Data/README.md: -------------------------------------------------------------------------------- 1 | # Data and License 2 | 3 | The Data can be downloaded here. 4 | 5 | **License:** Ford GoBike Data License Agreement 6 | 7 | You can also run "gather_goBike_data.py" to request the data. 8 | -------------------------------------------------------------------------------- /P4-Communicate-Data-Findings/Data/gather_goBike_data.py: -------------------------------------------------------------------------------- 1 | import requests 2 | 3 | #define years to gather 4 | year_data = [x for x in range(201801, 201813)] + [x for x in range(201901, 201903)] 5 | 6 | #loop over all files 7 | for year in year_data: 8 | 9 | #set the url dependend on the url 10 | url = f"https://s3.amazonaws.com/fordgobike-data/{year}-fordgobike-tripdata.csv.zip" 11 | 12 | #request the url 13 | response = requests.get(url) 14 | 15 | #open a file with the same name as the downloaded one and write to content to the file 16 | with open(f"{year}-fordgobike-tripdata.csv.zip", mode = "wb") as file: 17 | file.write(response.content) 18 | -------------------------------------------------------------------------------- /P4-Communicate-Data-Findings/Images/README.md: -------------------------------------------------------------------------------- 1 | Source: kepler.gl 2 | -------------------------------------------------------------------------------- /P4-Communicate-Data-Findings/Images/east_bay_500.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/P4-Communicate-Data-Findings/Images/east_bay_500.PNG -------------------------------------------------------------------------------- /P4-Communicate-Data-Findings/Images/san_francisco_1000.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/P4-Communicate-Data-Findings/Images/san_francisco_1000.PNG -------------------------------------------------------------------------------- /P4-Communicate-Data-Findings/Images/san_jose_200.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/P4-Communicate-Data-Findings/Images/san_jose_200.PNG -------------------------------------------------------------------------------- /P4-Communicate-Data-Findings/Images/stations_1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/P4-Communicate-Data-Findings/Images/stations_1.PNG -------------------------------------------------------------------------------- /P4-Communicate-Data-Findings/Images/stations_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/P4-Communicate-Data-Findings/Images/stations_2.png -------------------------------------------------------------------------------- /P4-Communicate-Data-Findings/Images/stations_kepler.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/P4-Communicate-Data-Findings/Images/stations_kepler.png -------------------------------------------------------------------------------- /P4-Communicate-Data-Findings/README.md: -------------------------------------------------------------------------------- 1 | 2 | ## P4: Communicate Data Findings (Ford GoBike Data) 3 | 4 | ### Prerequisites 5 | 6 | Additional installations: 7 | 8 | * [Missingno](https://github.com/ResidentMario/missingno) 9 | 10 | ## Project Overview 11 | 12 | ### Data Sources 13 | 14 | **Name:** result.csv 15 | * Definition: Ford GoBike System - Data 16 | * Source: https://www.fordgobike.com/system-data 17 | * Version: Files from 01.2018 - 02.2019 18 | 19 | ### Wrangling 20 | 21 | ### Analysis 22 | 23 | ### Summary 24 | 25 | ### Authors 26 | 27 | * Christoph Lindstädt 28 | * Udacity 29 | 30 | ## License 31 | 32 | * Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License 33 | 34 | 35 | Creative Commons License 36 | 37 | 38 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # [Udacity Data Analyst Nanodegree](https://www.udacity.com/course/data-analyst-nanodegree--nd002) 2 | 3 | > Discover insights from data via Python and SQL. 4 | 5 | ## Skills Acquired (Summary) 6 | 7 | 8 | ### Prerequisites 9 | 10 | You'll need to install: 11 | 12 | * [Python (3.x or higher)](https://www.python.org/downloads/) 13 | * [Jupyter Notebook](https://jupyter.org/) 14 | * [Numpy](http://www.numpy.org/) 15 | * [Pandas](http://pandas.pydata.org/) 16 | * [Matplotlib](https://matplotlib.org/) 17 | * [Seaborn](https://seaborn.pydata.org/) 18 | 19 | And additional libraries defined in each project. 20 | 21 | Recommended: 22 | 23 | * [Anaconda](https://www.anaconda.com/distribution/#download-section) 24 | 25 | ## Project Overview 26 | ### P0: Explore Weather Trends 27 | 28 | The first chapter was an introduction to the following projects of the Data Analyst Nanodegree. 29 | 30 | First chapter project was about weather trends - it required to apply (atleast) the following steps: 31 | * Extract data from a database using a SQL query 32 | * Calculate a moving average 33 | * Create a line chart 34 | 35 | I analyzed local and global temperature data and compared the temperature trends in three german cities to overall global temperature trends. After cleaning the data, I've created a function, which was supposed to handle all the tasks that are needed to plot the data - for example calculating the linear trend and calculating the rolling average. In addition, the function had other various options for the visualization to get various graphs. 36 | 37 | **Key findings**: 38 | - the average global temperature is increasing, with an also increasing tempo 39 | - Berlin is the only city in Germany in this dataset which has a higher average temperature than the global average 40 | 41 | ![Global Weather Trend](https://github.com/DataLind/Udacity-Data-Analyst-Nanodegree/blob/master/global_weather_trend.png) 42 | 43 | ### P1: Investigate a Dataset (Gapminder World Dataset) 44 | 45 | This chapter was all about the data analysis process as whole. From gathering to cleaning, assessing and wrangling to exploring and visualizing the data over the programming workflow and communication was everything included. 46 | 47 | This project included therefore all steps of the typical data analysis process. This includes: 48 | - posing questions 49 | - gather, wrangle and clean data 50 | - communicate answers to the questions 51 | - assited through visualizations and statistics. 52 | 53 | Out of the project: 54 | 55 | > This project will examine datasets available at Gapminder. To be more specific, it will take a closer look on the life expectancy of the population from different countries and the influences from other variables. It will also take a look on the development of these variables over time. 56 | > 57 | >**What is Gapminder?** 58 | "Gapminder is an independent Swedish foundation with no political, religious or economic affiliations. Gapminder is a fact tank, not a think tank. Gapminder fights devastating misconceptions about global development." (https://www.gapminder.org/about-gapminder/) 59 | 60 | Here we were confronted with the full joy of a real-life dataset: from hard-to-analyze structure, missing, messy, dirty data to real and - after finally being done with data wrangling - the reward of interesting insights. 61 | 62 | ![Life Expectancy To Income 2018](https://github.com/DataLind/Udacity-Data-Analyst-Nanodegree/blob/master/life_expectancy_to_income_2018.png) 63 | 64 | ### P2: Analyze A/B Test Results 65 | 66 | Following chapter was filled with *a lot* of information. We talked about: Data Types, Notation, Mean, Standard Deviation, Correlation, Data Shapes, Outliers, Bias, Dangers, Probability and Bayes, Distributions, Central Limit Theorem, Bootstrapping, Confidence Intervals, Hypothesis Testing, A/B Tests, Linear Regression, Logistic Regression and more.. *heavy breathing 67 | 68 | To goal of the project in this chapter was to get experience with A/B testing, it's difficulties and drawbacks of it. First of all, we learned what A/B testing is all about - including different metrics like the Click Through Rate (CTR) and how to analyze these metrics properly. And second of all, we learned about the drawbacks like the novelty effect or change aversion. 69 | 70 | In the end we brought everything we've learned together to analyze this A/B test properly. 71 | 72 | ![Sampling distribution](https://github.com/DataLind/Udacity-Data-Analyst-Nanodegree/blob/master/sampling_dist.png) 73 | 74 | ### P3: Gather, Clean and Analyze Twitter Data (WeRateDogs™ (@dog_rates)) 75 | 76 | This chapter was a deep dive into the data wrangling part of the data analysis process. We learned about the difference between messy and dirty data, how tidy data should look like, about the assessing, defining, cleaning and testing process, etc. Moreover, we talked about many different file types and different methods of gathering data. 77 | 78 | In this project we had to deal with the reality of dirty and messy data (again). We gathered data from different sources (for example the Twitter API), identified issues with the dataset in terms of tidiness and quality. Afterwards we had to solve these problems while documenting each step. The end of the project was then focused on the exploration of the data. 79 | 80 | ![Mean of retweets](https://github.com/DataLind/Udacity-Data-Analyst-Nanodegree/blob/master/mean_of_retweets_per_month-year_combination.png) 81 | 82 | ### P4: Communicate Data Findings 83 | 84 | The final chapter was focused on proper visualization of data. We learned about chart junk, uni-, bi- and multivariate visualization, use of color, data/ink ratio, the lief factor, other encodings, [...]. 85 | 86 | The task of the final project was to analyze and visualize real-world data. I chose the Ford GoBike dataset. 87 | 88 | ![Relative Userfrequncy by gender and area](https://github.com/DataLind/Udacity-Data-Analyst-Nanodegree/blob/master/rel_userfreq_by_gender_and_area.png) 89 | 90 | ## License 91 | 92 | * Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License 93 | 94 | 95 | Creative Commons License 96 | 97 | -------------------------------------------------------------------------------- /global_weather_trend.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/global_weather_trend.png -------------------------------------------------------------------------------- /life_expectancy_to_income_2018.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/life_expectancy_to_income_2018.png -------------------------------------------------------------------------------- /mean_of_retweets_per_month-year_combination.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/mean_of_retweets_per_month-year_combination.png -------------------------------------------------------------------------------- /rel_userfreq_by_gender_and_area.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/rel_userfreq_by_gender_and_area.png -------------------------------------------------------------------------------- /sampling_dist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chrislicodes/udacity-data-analyst-nanodegree/4f0ba7e3fad707371d9f985d53709ab081a59ac5/sampling_dist.png --------------------------------------------------------------------------------