├── Clase 10_Regresion ├── C10_Metodos Supervisados I_Regresion lineal.pptx └── Tutorial10_UBA_Regresion.ipynb ├── Clase 11_Intro Clasificación_Logit_KNN ├── C11_Clasificacion 1_Bayes_Logit_kNN.pptx └── Tutorial11_UBA_logit.ipynb ├── Clase 12_Clasificación_Curvas ROC ├── C12_Clasificacion 2_Curvas_ROC_Comparaciones.pptx ├── Tutorial12_UBA_ROC.ipynb └── ~$C12_Clasificacion 2_LDA_QDA_ROC_Comparaciones.pptx ├── Clase 13_Enfoque_Validacion_CV ├── C13_Enfoque_Validación_Cross_validation.pptx └── Tutorial13_UBA_CV.ipynb ├── Clase 14_Regularizacion 1_Ridge ├── C14_Regularizacion 1_Ridge.pptx └── Tutorial14_UBA_Ridge.ipynb ├── Clase 15_Regularización 2_LASSO ├── C15_Regularizacion 2 _LASSO.pptx └── Tutorial15_UBA_LASSO.ipynb ├── Clase 16_CART ├── C16_CART.pptx ├── Hitters.csv ├── Tutorial16_UBA_CART.ipynb ├── test.csv └── train.csv ├── Clase 1_Presentación del curso └── C1_Curso y Intro Machine Learning_Visualizacion.pptx ├── Clase 2_Introduccion a Python y GitHub └── Clase 2_UBA.ipynb ├── Clase 3_Intro Python II_Carga de base de datos ├── Clase 3_UBA.ipynb ├── archivo.txt ├── ejemplo.csv ├── ejemplo.dta ├── ejemplo.xlsx ├── exportar_ejemplo.xlsx └── exportar_ejemplo2.xlsx ├── Clase 4_Pandas & Matplotlib ├── Clase4_P1(Pandas).ipynb ├── Clase4_P2(Matplotlib).ipynb ├── potencia_instalada.xlsx ├── tabla_ejemplo.xlsx ├── tabla_ejemplo_2.xlsx ├── tabla_ejemplo_3a.xlsx ├── tabla_ejemplo_3b.xlsx └── tabla_ejemplo_3c.xlsx ├── Clase 5_APIs & Webscrapping ├── Clase5_UBA_APIs.ipynb ├── Clase5_UBA_APIs_merdadolibre.ipynb └── Clase5_UBA_WebScraping.ipynb ├── Clase 6_PCA ├── C2_Metodos No Supervisados I_PCA.pptx ├── Clase6_UBA_PCA.ipynb ├── USArrests.csv └── Wine.csv ├── Clase 7_Cluster ├── C7_Metodos no supervisados II_Cluster.pptx └── Clase7_UBA_Cluster.ipynb ├── Clase 8_Histogramas ├── C8_Metodos No Paramétricos I_Histogramas.pptx └── Tutorial8_UBA_Hist.ipynb ├── Clase 9_Kernels ├── C9_Metodos No Paramétricos II_Kernels.pptx └── Tutorial9_UBA_Kernels.ipynb ├── Clase17_Ensamble 1_Bagging ├── C17_Metodos de Ensamble I_Bagging.pptx └── Tutorial17_UBA_Bootstrapping & Bagging.ipynb ├── Guía para las exposiciones grupales.pdf ├── README.md └── TPs ├── TP0_Practica opcional de Python └── TP0_UBA_Practica de Python.ipynb ├── TP1_Jugando con APIS y Webscrapping └── TP1-UBA.ipynb ├── TP2_Introducción a la EPH ├── Big Data_UBA_TP2.docx ├── Big Data_UBA_TP2.pdf └── ~$g Data_UBA_TP2.docx ├── TP3_EPH_Hist, Kernels & M. no supervisados ├── Big Data_UBA_TP3.docx ├── Big Data_UBA_TP3.pdf └── ~$g Data_UBA_TP3.docx └── TP4_Regresión&Clasificación ├── Big Data_UBA_TP4.docx ├── Big Data_UBA_TP4.pdf └── ~$g Data_UBA_TP4.docx /Clase 10_Regresion/C10_Metodos Supervisados I_Regresion lineal.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 10_Regresion/C10_Metodos Supervisados I_Regresion lineal.pptx -------------------------------------------------------------------------------- /Clase 11_Intro Clasificación_Logit_KNN/C11_Clasificacion 1_Bayes_Logit_kNN.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 11_Intro Clasificación_Logit_KNN/C11_Clasificacion 1_Bayes_Logit_kNN.pptx -------------------------------------------------------------------------------- /Clase 12_Clasificación_Curvas ROC/C12_Clasificacion 2_Curvas_ROC_Comparaciones.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 12_Clasificación_Curvas ROC/C12_Clasificacion 2_Curvas_ROC_Comparaciones.pptx -------------------------------------------------------------------------------- /Clase 12_Clasificación_Curvas ROC/~$C12_Clasificacion 2_LDA_QDA_ROC_Comparaciones.pptx: -------------------------------------------------------------------------------- 1 | Noelia Romero Noelia Romero -------------------------------------------------------------------------------- /Clase 13_Enfoque_Validacion_CV/C13_Enfoque_Validación_Cross_validation.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 13_Enfoque_Validacion_CV/C13_Enfoque_Validación_Cross_validation.pptx -------------------------------------------------------------------------------- /Clase 14_Regularizacion 1_Ridge/C14_Regularizacion 1_Ridge.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 14_Regularizacion 1_Ridge/C14_Regularizacion 1_Ridge.pptx -------------------------------------------------------------------------------- /Clase 15_Regularización 2_LASSO/C15_Regularizacion 2 _LASSO.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 15_Regularización 2_LASSO/C15_Regularizacion 2 _LASSO.pptx -------------------------------------------------------------------------------- /Clase 16_CART/C16_CART.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 16_CART/C16_CART.pptx -------------------------------------------------------------------------------- /Clase 16_CART/Hitters.csv: -------------------------------------------------------------------------------- 1 | ,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague 2 | 0,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A 3 | 1,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N 4 | 2,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A 5 | 3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N 6 | 4,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N 7 | 5,594,169,4,74,51,35,11,4408,1133,19,501,336,194,A,W,282,421,25,750.0,A 8 | 6,185,37,1,23,8,21,2,214,42,1,30,9,24,N,E,76,127,7,70.0,A 9 | 7,298,73,0,24,24,7,3,509,108,0,41,37,12,A,W,121,283,9,100.0,A 10 | 8,323,81,6,26,32,8,2,341,86,6,32,34,8,N,W,143,290,19,75.0,N 11 | 9,401,92,17,49,66,65,13,5206,1332,253,784,890,866,A,E,0,0,0,1100.0,A 12 | 10,574,159,21,107,75,59,10,4631,1300,90,702,504,488,A,E,238,445,22,517.143,A 13 | 11,202,53,4,31,26,27,9,1876,467,15,192,186,161,N,W,304,45,11,512.5,N 14 | 12,418,113,13,48,61,47,4,1512,392,41,205,204,203,N,E,211,11,7,550.0,N 15 | 13,239,60,0,30,11,22,6,1941,510,4,309,103,207,A,E,121,151,6,700.0,A 16 | 14,196,43,7,29,27,30,13,3231,825,36,376,290,238,N,E,80,45,8,240.0,N 17 | 15,183,39,3,20,15,11,3,201,42,3,20,16,11,A,W,118,0,0,,A 18 | 16,568,158,20,89,75,73,15,8068,2273,177,1045,993,732,N,W,105,290,10,775.0,N 19 | 17,190,46,2,24,8,15,5,479,102,5,65,23,39,A,W,102,177,16,175.0,A 20 | 18,407,104,6,57,43,65,12,5233,1478,100,643,658,653,A,W,912,88,9,,A 21 | 19,127,32,8,16,22,14,8,727,180,24,67,82,56,N,W,202,22,2,135.0,N 22 | 20,413,92,16,72,48,65,1,413,92,16,72,48,65,N,E,280,9,5,100.0,N 23 | 21,426,109,3,55,43,62,1,426,109,3,55,43,62,A,W,361,22,2,115.0,N 24 | 22,22,10,1,4,2,1,6,84,26,2,9,9,3,A,W,812,84,11,,A 25 | 23,472,116,16,60,62,74,6,1924,489,67,242,251,240,N,W,518,55,3,600.0,N 26 | 24,629,168,18,73,102,40,18,8424,2464,164,1008,1072,402,A,E,1067,157,14,776.6669999999999,A 27 | 25,587,163,4,92,51,70,6,2695,747,17,442,198,317,A,E,434,9,3,765.0,A 28 | 26,324,73,4,32,18,22,7,1931,491,13,291,108,180,N,E,222,3,3,708.3330000000001,N 29 | 27,474,129,10,50,56,40,10,2331,604,61,246,327,166,N,W,732,83,13,750.0,N 30 | 28,550,152,6,92,37,81,5,2308,633,32,349,182,308,N,W,262,329,16,625.0,N 31 | 29,513,137,20,90,95,90,14,5201,1382,166,763,734,784,A,W,267,5,3,900.0,A 32 | 30,313,84,9,42,30,39,17,6890,1833,224,1033,864,1087,A,W,127,221,7,,A 33 | 31,419,108,6,55,36,22,3,591,149,8,80,46,31,N,W,226,7,4,110.0,N 34 | 32,517,141,27,70,87,52,9,3571,994,215,545,652,337,N,W,1378,102,8,,N 35 | 33,583,168,17,83,80,56,5,1646,452,44,219,208,136,A,E,109,292,25,612.5,A 36 | 34,204,49,6,23,25,12,7,1309,308,27,126,132,66,A,W,419,46,5,300.0,A 37 | 35,379,106,10,38,60,30,14,6207,1906,146,859,803,571,N,W,72,170,24,850.0,N 38 | 36,161,36,0,19,10,17,4,1053,244,3,156,86,107,A,E,70,149,12,,A 39 | 37,268,60,5,24,25,15,2,350,78,5,34,29,18,N,W,442,59,6,90.0,N 40 | 38,346,98,5,31,53,30,16,5913,1615,235,784,901,560,A,E,0,0,0,,A 41 | 39,241,61,1,34,12,14,1,241,61,1,34,12,14,N,W,166,172,10,,N 42 | 40,181,41,1,15,21,33,2,232,50,4,20,29,45,A,E,326,29,5,67.5,A 43 | 41,216,54,0,21,18,15,18,7318,1926,46,796,627,483,N,W,103,84,5,,N 44 | 42,200,57,6,23,14,14,9,2516,684,46,371,230,195,N,W,69,1,1,,N 45 | 43,217,46,7,32,19,9,4,694,160,32,86,76,32,A,E,307,25,1,180.0,A 46 | 44,194,40,7,19,29,30,11,4183,1069,64,486,493,608,A,E,325,22,2,,A 47 | 45,254,68,2,28,26,22,6,999,236,21,108,117,118,A,E,359,30,4,305.0,A 48 | 46,416,132,7,57,49,33,3,932,273,24,113,121,80,N,W,73,177,18,215.0,N 49 | 47,205,57,8,34,32,9,5,756,192,32,117,107,51,A,E,58,4,4,247.5,A 50 | 48,542,140,12,46,75,41,16,7099,2130,235,987,1089,431,A,E,697,61,9,,A 51 | 49,526,146,13,71,70,84,6,2648,715,77,352,342,289,N,W,303,9,9,815.0,N 52 | 50,457,101,14,42,63,22,17,6521,1767,281,1003,977,619,A,W,389,39,4,875.0,A 53 | 51,214,53,2,30,29,23,2,226,59,2,32,32,27,N,E,109,7,3,70.0,N 54 | 52,19,7,0,1,2,1,4,41,13,1,3,4,4,A,E,0,0,0,,A 55 | 53,591,168,19,80,72,39,9,4478,1307,113,634,563,319,A,W,67,147,4,1200.0,A 56 | 54,403,101,12,45,53,39,12,5150,1429,166,747,666,526,A,E,316,6,5,675.0,A 57 | 55,405,102,18,49,85,20,6,950,231,29,99,138,64,N,W,161,10,3,415.0,N 58 | 56,244,58,9,28,25,35,4,1335,333,49,164,179,194,N,W,142,14,2,340.0,N 59 | 57,235,61,3,24,39,21,14,3926,1029,35,441,401,333,A,E,425,43,4,,A 60 | 58,313,78,6,32,41,12,12,3742,968,35,409,321,170,N,W,106,206,7,416.667,N 61 | 59,627,177,25,98,81,70,6,3210,927,133,529,472,313,A,E,240,482,13,1350.0,A 62 | 60,416,113,24,58,69,16,1,416,113,24,58,69,16,A,E,203,70,10,90.0,A 63 | 61,155,44,6,21,23,15,16,6631,1634,98,698,661,777,N,E,53,88,3,275.0,N 64 | 62,236,56,0,27,15,11,4,1115,270,1,116,64,57,A,W,125,199,13,230.0,A 65 | 63,216,53,1,31,15,22,4,926,210,9,118,69,114,N,W,73,152,11,225.0,N 66 | 64,24,3,0,1,0,2,3,159,28,0,20,12,9,A,W,80,4,0,,A 67 | 65,585,139,31,93,94,62,17,7546,1982,315,1141,1179,727,A,E,0,0,0,950.0,A 68 | 66,191,37,4,12,17,14,4,773,163,16,61,74,52,N,E,391,38,8,,N 69 | 67,199,53,5,29,22,21,3,514,120,8,57,40,39,A,W,152,3,5,75.0,A 70 | 68,521,142,20,67,86,45,4,815,205,22,99,103,78,A,E,107,242,23,105.0,A 71 | 69,419,113,1,44,27,44,12,4484,1231,32,612,344,422,A,E,211,2,1,,A 72 | 70,311,81,3,42,30,26,17,8247,2198,100,950,909,690,N,W,153,223,10,320.0,N 73 | 71,138,31,8,18,21,38,3,244,53,12,33,32,55,N,E,244,21,4,,N 74 | 72,512,131,26,69,96,52,14,5347,1397,221,712,815,548,A,W,119,216,12,850.0,A 75 | 73,507,122,29,78,85,91,18,7761,1947,347,1175,1152,1380,A,E,808,108,2,535.0,A 76 | 74,529,137,26,86,97,97,15,6661,1785,291,1082,949,989,A,E,280,10,5,933.333,A 77 | 75,424,119,6,57,46,13,9,3651,1046,32,461,301,112,A,E,224,286,8,850.0,N 78 | 76,351,97,4,55,29,39,4,1258,353,16,196,110,117,N,W,226,7,3,210.0,A 79 | 77,195,55,5,24,33,30,8,1313,338,25,144,149,153,N,E,83,2,1,,N 80 | 78,388,103,15,59,47,39,6,2174,555,80,285,274,186,A,W,182,9,4,325.0,A 81 | 79,339,96,4,37,29,23,4,1064,290,11,123,108,55,A,W,104,213,9,275.0,A 82 | 80,561,118,35,70,94,33,16,6677,1575,442,901,1210,608,A,W,463,32,8,,A 83 | 81,255,70,7,49,35,43,15,6311,1661,154,1019,608,820,N,E,51,54,8,450.0,N 84 | 82,677,238,31,117,113,53,5,2223,737,93,349,401,171,A,E,1377,100,6,1975.0,A 85 | 83,227,46,7,23,20,12,5,1325,324,44,156,158,67,A,W,92,2,2,,A 86 | 84,614,163,29,89,83,75,11,5017,1388,266,813,822,617,N,W,303,6,6,1900.0,N 87 | 85,329,83,9,50,39,56,9,3828,948,145,575,528,635,A,W,276,6,2,600.0,A 88 | 86,637,174,31,89,116,56,14,6727,2024,247,978,1093,495,N,W,278,9,9,1041.667,N 89 | 87,280,82,16,44,45,47,2,428,113,25,61,70,63,A,E,148,4,2,110.0,A 90 | 88,155,41,12,21,29,22,16,5409,1338,181,746,805,875,A,W,165,9,1,260.0,A 91 | 89,458,114,13,67,57,48,4,1350,298,28,160,123,122,A,W,246,389,18,475.0,A 92 | 90,314,83,13,39,46,16,5,1457,405,28,156,159,76,A,W,533,40,4,431.5,A 93 | 91,475,123,27,76,93,72,4,1810,471,108,292,343,267,N,E,226,10,6,1220.0,N 94 | 92,317,78,7,35,35,32,1,317,78,7,35,35,32,A,E,45,122,26,70.0,A 95 | 93,511,138,25,76,96,61,3,592,164,28,87,110,71,A,W,157,7,8,145.0,A 96 | 94,278,69,3,24,21,29,8,2079,565,32,258,192,162,N,W,142,210,10,,N 97 | 95,382,119,13,54,58,36,12,2133,594,41,287,294,227,N,W,59,156,9,595.0,N 98 | 96,565,148,24,90,104,77,14,7287,2083,305,1135,1234,791,A,E,292,9,5,1861.46,A 99 | 97,277,71,2,27,29,14,15,5952,1647,60,753,596,259,N,W,360,32,5,,N 100 | 98,415,115,27,97,71,68,3,711,184,45,156,119,99,N,W,274,2,7,300.0,N 101 | 99,424,110,15,70,47,36,7,2130,544,38,335,174,258,N,W,292,6,3,490.0,N 102 | 100,495,151,17,61,84,78,10,5624,1679,275,884,1015,709,A,E,1045,88,13,2460.0,A 103 | 101,524,132,9,69,47,54,2,972,260,14,123,92,90,A,E,212,327,20,,A 104 | 102,233,49,2,41,23,18,8,1350,336,7,166,122,106,A,E,102,132,10,375.0,A 105 | 103,395,106,16,48,56,35,10,2303,571,86,266,323,248,A,E,709,41,7,,A 106 | 104,397,114,23,67,67,53,13,5589,1632,241,906,926,716,A,E,244,2,4,,A 107 | 105,210,37,8,15,19,15,6,994,244,36,107,114,53,A,E,40,115,15,,A 108 | 106,420,95,23,55,58,37,3,646,139,31,77,77,61,N,W,206,10,7,,N 109 | 107,566,154,22,76,84,43,14,6100,1583,131,743,693,300,A,W,316,439,10,750.0,A 110 | 108,641,198,31,101,108,41,5,2129,610,92,297,319,117,A,E,269,17,10,1175.0,A 111 | 109,215,51,4,19,18,11,1,215,51,4,19,18,11,A,E,116,5,12,70.0,A 112 | 110,441,128,16,70,73,80,14,6675,2095,209,1072,1050,695,A,W,97,218,16,1500.0,A 113 | 111,325,76,16,33,52,37,5,1506,351,71,195,219,214,N,W,726,87,3,385.0,A 114 | 112,490,125,24,81,105,62,13,6063,1646,271,847,999,680,N,E,869,62,8,1925.571,N 115 | 113,574,152,31,91,101,64,3,985,260,53,148,173,95,N,W,1253,111,11,215.0,N 116 | 114,284,64,14,30,42,24,18,7023,1925,348,986,1239,666,N,E,96,4,4,,N 117 | 115,596,171,34,91,108,52,6,2862,728,107,361,401,224,A,W,118,334,21,900.0,A 118 | 116,472,118,12,63,54,30,4,793,187,14,102,80,50,A,W,228,377,26,155.0,A 119 | 117,283,77,14,45,47,26,16,6840,1910,259,915,1067,546,A,W,144,6,5,700.0,A 120 | 118,408,94,4,42,36,66,9,3573,866,59,429,365,410,N,W,282,487,19,535.0,N 121 | 119,327,85,3,30,44,20,8,2140,568,16,216,208,93,A,E,91,185,12,362.5,A 122 | 120,370,96,21,49,46,60,15,6986,1972,231,1070,955,921,N,E,137,5,9,733.3330000000001,N 123 | 121,354,77,16,36,55,41,20,8716,2172,384,1172,1267,1057,N,W,83,174,16,200.0,N 124 | 122,539,139,5,93,58,69,5,1469,369,12,247,126,198,A,W,462,9,7,400.0,A 125 | 123,340,84,11,62,33,47,5,1516,376,42,284,141,219,N,E,185,8,4,400.0,A 126 | 124,510,126,2,42,44,35,11,5562,1578,44,703,519,256,N,W,207,358,20,737.5,N 127 | 125,315,59,16,45,36,58,13,4677,1051,268,681,782,697,A,W,0,0,0,,A 128 | 126,282,78,13,37,51,29,5,1649,453,73,211,280,138,A,W,670,57,5,500.0,A 129 | 127,380,120,5,54,51,31,8,3118,900,92,444,419,240,A,W,237,8,1,600.0,A 130 | 128,584,158,15,70,84,42,5,2358,636,58,265,316,134,N,E,331,20,4,662.5,N 131 | 129,570,169,21,72,88,38,7,3754,1077,140,492,589,263,A,W,295,15,5,950.0,A 132 | 130,306,104,14,50,58,25,7,2954,822,55,313,377,187,N,E,116,222,15,750.0,N 133 | 131,220,54,10,30,39,31,5,1185,299,40,145,154,128,N,E,50,136,20,297.5,N 134 | 132,278,70,7,22,37,18,18,7186,2081,190,935,1088,643,A,W,0,0,0,325.0,A 135 | 133,445,99,1,46,24,29,4,618,129,1,72,31,48,A,W,278,415,16,87.5,A 136 | 134,143,39,5,18,30,15,9,639,151,16,80,97,61,N,W,138,15,1,175.0,N 137 | 135,185,40,4,23,11,18,3,524,125,7,58,37,47,N,E,97,2,2,90.0,N 138 | 136,589,170,40,107,108,69,6,2325,634,128,371,376,238,A,E,368,20,3,1237.5,A 139 | 137,343,103,6,48,36,40,15,4338,1193,70,581,421,325,A,E,211,56,13,430.0,A 140 | 138,284,69,1,33,18,25,5,1407,361,6,139,98,111,A,E,122,140,5,,N 141 | 139,438,103,2,65,32,71,2,440,103,2,67,32,71,A,W,276,7,9,100.0,N 142 | 140,600,144,33,85,117,65,2,696,173,38,101,130,69,A,W,319,4,14,165.0,A 143 | 141,663,200,29,108,121,32,4,1447,404,57,210,222,68,A,E,241,8,6,250.0,A 144 | 142,232,55,9,34,23,45,12,4405,1213,194,702,705,625,N,E,623,35,3,1300.0,N 145 | 143,479,133,10,48,72,55,17,7472,2147,153,980,1032,854,N,W,237,5,4,773.3330000000001,N 146 | 144,209,45,0,38,19,42,10,3859,916,23,557,279,478,A,W,132,205,5,,A 147 | 145,528,132,21,61,74,41,6,2641,671,97,273,383,226,N,E,885,105,8,1008.333,N 148 | 146,160,39,8,18,31,22,14,2128,543,56,304,268,298,A,E,33,3,0,275.0,A 149 | 147,599,183,10,80,74,32,5,2482,715,27,330,326,158,A,E,231,374,18,775.0,A 150 | 148,497,136,7,58,38,26,11,3871,1066,40,450,367,241,A,E,304,347,10,850.0,A 151 | 149,210,70,13,32,51,28,15,4040,1130,97,544,462,551,A,E,0,0,0,365.0,A 152 | 150,225,61,5,32,26,26,11,1568,408,25,202,185,257,A,W,132,9,0,,A 153 | 151,151,41,4,26,21,19,2,288,68,9,45,39,35,A,W,28,56,2,95.0,A 154 | 152,278,86,4,33,38,45,1,278,86,4,33,38,45,N,W,102,4,2,110.0,N 155 | 153,341,95,6,48,42,20,10,2964,808,81,379,428,221,N,W,158,4,5,100.0,N 156 | 154,537,147,23,58,88,47,10,2744,730,97,302,351,174,N,E,92,257,20,277.5,N 157 | 155,399,102,3,56,34,34,5,670,167,4,89,48,54,A,W,211,9,3,80.0,A 158 | 156,309,94,5,37,32,26,13,4618,1330,57,616,522,436,N,E,161,3,3,600.0,N 159 | 157,401,100,2,60,19,28,4,876,238,2,126,44,55,N,E,193,11,4,,N 160 | 158,336,93,9,35,46,23,15,5779,1610,128,730,741,497,A,W,0,0,0,,A 161 | 159,616,163,27,83,107,32,3,1437,377,65,181,227,82,A,W,110,308,15,200.0,A 162 | 160,219,47,8,24,26,17,12,1188,286,23,100,125,63,A,W,260,58,4,,A 163 | 161,579,174,7,67,78,58,6,3053,880,32,366,337,218,N,E,280,479,5,657.0,N 164 | 162,165,39,2,13,9,16,3,196,44,2,18,10,18,A,W,332,19,2,75.0,N 165 | 163,618,200,20,98,110,62,13,7127,2163,351,1104,1289,564,A,E,330,16,8,2412.5,A 166 | 164,257,66,5,31,26,32,14,3910,979,33,518,324,382,N,W,87,166,14,250.0,A 167 | 165,315,76,13,35,60,25,3,630,151,24,68,94,55,N,E,498,39,13,155.0,N 168 | 166,591,157,16,90,78,26,4,2020,541,52,310,226,91,N,E,290,440,25,640.0,N 169 | 167,404,92,11,54,49,18,6,1354,325,30,188,135,63,A,E,222,5,5,300.0,A 170 | 168,315,73,5,23,37,16,4,450,108,6,38,46,28,A,W,227,15,3,110.0,A 171 | 169,249,69,6,32,19,20,4,702,209,10,97,48,44,N,E,103,8,2,,N 172 | 170,429,91,12,41,42,57,13,5590,1397,83,578,579,644,A,W,686,46,4,825.0,N 173 | 171,212,54,13,28,44,18,2,233,59,13,31,46,20,A,E,243,23,5,,A 174 | 172,453,101,3,46,43,61,3,948,218,6,96,72,91,N,W,249,444,16,195.0,N 175 | 173,161,43,4,17,26,22,3,707,179,21,77,99,76,A,W,300,12,2,,A 176 | 174,184,47,5,20,28,18,11,3327,890,74,419,382,304,N,W,49,2,0,450.0,N 177 | 175,591,184,20,83,79,38,5,1689,462,40,219,195,82,N,W,303,12,5,630.0,N 178 | 176,181,58,6,34,23,22,1,181,58,6,34,23,22,N,W,88,0,3,86.5,N 179 | 177,441,118,28,84,86,68,8,2723,750,126,433,420,309,A,E,190,2,2,1300.0,A 180 | 178,490,150,21,69,58,35,14,6126,1839,121,983,707,600,A,E,96,5,3,1000.0,N 181 | 179,551,171,13,94,83,94,13,6090,1840,128,969,900,917,N,E,1199,149,5,1800.0,N 182 | 180,550,147,29,85,91,71,6,2816,815,117,405,474,319,A,W,1218,104,10,1310.0,A 183 | 181,283,74,4,34,29,22,10,3919,1062,85,505,456,283,N,W,145,5,7,737.5,N 184 | 182,560,161,26,89,96,66,4,1789,470,65,233,260,155,N,W,332,9,8,625.0,N 185 | 183,328,91,12,51,43,33,2,342,94,12,51,44,33,N,E,145,59,8,125.0,N 186 | 184,586,159,12,72,79,53,9,3082,880,83,363,477,295,N,E,181,13,4,1043.333,N 187 | 185,503,136,5,62,48,83,10,3423,970,20,408,303,414,N,W,65,258,8,725.0,N 188 | 186,344,85,24,69,64,88,7,911,214,64,150,156,187,A,W,0,0,0,300.0,A 189 | 187,680,223,31,119,96,34,3,1928,587,35,262,201,91,A,W,429,8,6,365.0,A 190 | 188,279,64,0,31,26,30,1,279,64,0,31,26,30,N,W,107,205,16,75.0,N 191 | 189,484,127,20,66,65,67,7,3006,844,116,436,458,377,N,E,1231,80,7,1183.333,N 192 | 190,431,127,8,77,45,58,2,667,187,9,117,64,88,N,E,283,8,3,202.5,N 193 | 191,283,70,8,33,37,27,12,4479,1222,94,557,483,307,A,E,156,2,2,225.0,A 194 | 192,491,141,11,77,47,37,15,4291,1240,84,615,430,340,A,E,239,8,2,525.0,A 195 | 193,199,52,9,26,28,21,6,805,191,30,113,119,87,N,W,235,22,5,265.0,N 196 | 194,589,149,21,89,86,64,7,3558,928,102,513,471,351,A,E,371,6,6,787.5,A 197 | 195,327,84,22,53,62,38,10,4273,1123,212,577,700,334,A,E,483,48,6,800.0,N 198 | 196,464,128,28,67,94,52,13,5829,1552,210,740,840,452,A,W,0,0,0,587.5,A 199 | 197,166,34,0,20,13,17,1,166,34,0,20,13,17,N,E,64,119,9,,N 200 | 198,338,92,18,42,60,21,3,682,185,36,88,112,50,A,E,0,0,0,145.0,A 201 | 199,508,146,8,80,44,46,9,3148,915,41,571,289,326,A,W,245,5,9,,A 202 | 200,584,157,20,95,73,63,10,4704,1320,93,724,522,576,A,E,276,421,11,420.0,A 203 | 201,216,54,2,27,25,33,1,216,54,2,27,25,33,N,W,317,36,1,75.0,N 204 | 202,625,179,4,94,60,65,5,1696,476,12,216,163,166,A,E,303,450,14,575.0,A 205 | 203,243,53,4,18,26,27,4,853,228,23,101,110,76,N,E,107,3,3,,N 206 | 204,489,131,19,77,55,34,7,2051,549,62,300,263,153,A,W,310,9,9,780.0,A 207 | 205,209,56,12,22,36,19,2,216,58,12,24,37,19,N,E,201,6,3,90.0,N 208 | 206,407,93,8,47,30,30,2,969,230,14,121,69,68,N,W,172,317,25,150.0,N 209 | 207,490,148,14,64,78,49,13,3400,1000,113,445,491,301,A,E,0,0,0,700.0,N 210 | 208,209,59,6,20,37,27,4,884,209,14,66,106,92,N,E,415,35,3,,N 211 | 209,442,131,18,68,77,33,6,1416,398,47,210,203,136,A,E,233,7,7,550.0,A 212 | 210,317,88,3,40,32,19,8,2543,715,28,269,270,118,A,W,220,16,4,,A 213 | 211,288,65,8,30,36,27,9,2815,698,55,315,325,189,N,E,259,30,10,650.0,A 214 | 212,209,54,3,25,14,12,1,209,54,3,25,14,12,A,W,102,6,3,68.0,A 215 | 213,303,71,3,18,30,36,3,344,76,3,20,36,45,N,E,468,47,6,100.0,N 216 | 214,330,77,19,47,53,27,6,1928,516,90,247,288,161,N,W,149,8,6,670.0,N 217 | 215,504,120,28,71,71,54,3,1085,259,54,150,167,114,A,E,103,283,19,175.0,A 218 | 216,258,60,8,28,33,18,3,638,170,17,80,75,36,A,W,358,32,8,137.0,A 219 | 217,20,1,0,0,0,0,2,41,9,2,6,7,4,N,E,78,220,6,2127.333,N 220 | 218,374,94,5,36,26,62,7,1968,519,26,181,199,288,N,W,756,64,15,875.0,N 221 | 219,211,43,10,26,35,39,3,498,116,14,59,55,78,A,W,463,32,8,120.0,A 222 | 220,299,75,6,38,23,26,3,580,160,8,71,33,44,N,E,212,1,2,140.0,N 223 | 221,576,167,8,89,49,57,4,822,232,19,132,83,79,N,E,325,12,8,210.0,N 224 | 222,381,110,9,61,45,32,7,3015,834,40,451,249,168,N,E,228,7,5,800.0,N 225 | 223,288,76,7,34,37,15,4,1644,408,16,198,120,113,N,W,203,3,3,240.0,N 226 | 224,369,93,9,43,42,49,5,1258,323,54,181,177,157,A,E,149,1,6,350.0,A 227 | 225,330,76,12,35,41,47,4,1367,326,55,167,198,167,N,W,512,30,5,,N 228 | 226,547,137,2,58,47,12,2,1038,271,3,129,80,24,A,W,261,459,22,175.0,A 229 | 227,572,152,18,105,49,65,2,978,249,36,168,91,101,A,W,325,13,3,200.0,A 230 | 228,359,84,4,46,27,21,12,4992,1257,37,699,386,387,N,W,151,8,5,,N 231 | 229,514,144,0,67,54,79,9,4739,1169,13,583,374,528,N,E,229,453,15,1940.0,N 232 | 230,359,80,15,45,48,63,7,1493,359,61,176,202,175,N,W,682,93,13,700.0,N 233 | 231,526,163,12,88,50,77,4,1556,470,38,245,167,174,A,W,250,11,1,750.0,A 234 | 232,313,83,9,43,41,30,14,5885,1543,104,751,714,535,N,W,58,141,23,450.0,N 235 | 233,540,135,30,82,88,55,1,540,135,30,82,88,55,A,W,157,6,14,172.0,A 236 | 234,437,123,9,62,55,40,9,4139,1203,79,676,390,364,A,E,82,170,15,1260.0,A 237 | 235,551,160,23,86,90,87,5,2235,602,75,278,328,273,A,W,1224,115,11,,A 238 | 236,237,52,0,15,25,30,24,14053,4256,160,2165,1314,1566,N,W,523,43,6,750.0,N 239 | 237,236,56,6,41,19,21,5,1257,329,24,166,125,105,A,E,172,1,4,190.0,A 240 | 238,473,154,6,61,48,29,6,1966,566,29,250,252,178,A,E,846,84,9,580.0,A 241 | 239,309,72,0,33,31,26,5,354,82,0,41,32,26,N,E,117,269,12,130.0,N 242 | 240,271,77,5,35,29,33,12,4933,1358,48,630,435,403,A,W,62,90,3,450.0,A 243 | 241,357,96,7,50,45,39,5,1394,344,43,178,192,136,A,W,167,2,4,300.0,A 244 | 242,216,56,4,22,18,15,12,2796,665,43,266,304,198,A,E,391,44,4,250.0,A 245 | 243,256,70,13,42,36,44,16,7058,1845,312,965,1128,990,N,E,41,118,8,1050.0,A 246 | 244,466,108,33,75,86,72,3,652,142,44,102,109,102,A,E,286,8,8,215.0,A 247 | 245,327,68,13,42,29,45,18,3949,939,78,438,380,466,A,E,659,53,7,400.0,A 248 | 246,462,119,16,49,65,37,7,2131,583,69,244,288,150,A,E,866,65,6,,A 249 | 247,341,110,9,45,49,46,9,2331,658,50,249,322,274,A,E,251,9,4,560.0,A 250 | 248,608,160,28,130,74,89,8,4071,1182,103,862,417,708,A,E,426,4,6,1670.0,A 251 | 249,419,101,18,65,58,92,20,9528,2510,548,1509,1659,1342,A,W,0,0,0,487.5,A 252 | 250,33,6,0,2,4,7,1,33,6,0,2,4,7,A,W,205,5,4,,A 253 | 251,376,82,21,42,60,35,5,1770,408,115,238,299,157,A,W,0,0,0,425.0,A 254 | 252,486,145,11,51,76,40,11,3967,1102,67,410,497,284,N,E,88,204,16,500.0,A 255 | 253,186,44,7,28,16,11,1,186,44,7,28,16,11,N,W,99,3,1,,N 256 | 254,307,80,1,42,36,29,7,2421,656,18,379,198,184,A,W,145,2,2,,A 257 | 255,246,76,5,35,39,13,6,912,234,12,102,96,80,A,E,44,0,1,250.0,A 258 | 256,205,52,8,31,27,17,12,5134,1323,56,643,445,459,A,E,155,3,2,400.0,A 259 | 257,348,90,11,50,45,43,10,2288,614,43,295,273,269,A,E,60,176,6,450.0,A 260 | 258,523,135,8,52,44,52,9,3368,895,39,377,284,296,N,W,367,475,19,750.0,N 261 | 259,312,68,2,32,22,24,1,312,68,2,32,22,24,A,E,86,150,15,70.0,A 262 | 260,496,119,8,57,33,21,7,3358,882,36,365,280,165,N,W,155,371,29,875.0,N 263 | 261,126,27,3,8,10,5,4,239,49,3,16,13,14,N,E,190,2,9,190.0,N 264 | 262,275,68,5,42,42,61,6,961,238,16,128,104,172,N,E,181,3,2,191.0,N 265 | 263,627,178,14,68,76,46,6,3146,902,74,494,345,242,N,E,309,492,5,740.0,N 266 | 264,394,86,1,38,28,36,4,1089,267,3,94,71,76,N,E,203,369,16,250.0,N 267 | 265,208,57,8,32,25,18,3,653,170,17,98,54,62,N,E,42,94,13,140.0,N 268 | 266,382,101,16,50,55,22,1,382,101,16,50,55,22,A,W,200,7,6,97.5,A 269 | 267,459,113,20,59,57,68,12,5348,1369,155,713,660,735,A,W,0,0,0,740.0,A 270 | 268,549,149,7,73,47,42,1,549,149,7,73,47,42,N,W,255,450,17,140.0,N 271 | 269,288,63,3,25,33,16,10,2682,667,38,315,259,204,A,W,135,257,7,341.667,A 272 | 270,303,84,4,35,32,23,2,312,87,4,39,32,23,N,W,179,5,3,,N 273 | 271,522,163,9,82,46,62,13,7037,2019,153,1043,827,535,A,E,352,9,1,1000.0,A 274 | 272,512,117,29,54,88,43,6,1750,412,100,204,276,155,A,W,1236,98,18,100.0,A 275 | 273,220,66,5,20,28,13,3,290,80,5,27,31,15,A,W,281,21,3,90.0,A 276 | 274,522,140,16,73,77,60,4,730,185,22,93,106,86,N,E,1320,166,17,200.0,N 277 | 275,461,112,18,54,54,35,2,680,160,24,76,75,49,A,W,111,226,11,135.0,A 278 | 276,581,145,17,66,68,21,2,831,210,21,106,86,40,N,E,320,465,32,155.0,N 279 | 277,530,159,3,82,50,47,6,1619,426,11,218,149,163,A,W,196,354,15,475.0,A 280 | 278,557,142,21,58,81,23,18,8759,2583,271,1138,1299,478,N,W,1160,53,7,1450.0,N 281 | 279,439,96,0,44,36,65,4,711,148,1,68,56,99,N,E,229,406,22,150.0,N 282 | 280,453,103,8,53,33,52,2,507,123,8,63,39,58,A,W,289,407,6,105.0,A 283 | 281,528,122,1,67,45,51,4,1716,403,12,211,146,155,A,W,209,372,17,350.0,A 284 | 282,633,210,6,91,56,59,6,3070,872,19,420,230,274,N,W,367,432,16,90.0,N 285 | 283,16,2,0,1,0,0,2,28,4,0,1,0,0,A,E,247,4,8,,A 286 | 284,562,169,17,88,73,53,8,3181,841,61,450,342,373,A,E,351,442,17,530.0,A 287 | 285,281,76,3,42,25,20,8,2658,657,48,324,300,179,A,E,106,144,7,341.667,A 288 | 286,593,152,23,69,75,53,6,2765,686,133,369,384,321,A,W,315,10,6,940.0,A 289 | 287,687,213,10,91,65,27,4,1518,448,15,196,137,89,A,E,294,445,13,350.0,A 290 | 288,368,103,3,48,28,54,8,1897,493,9,207,162,198,N,W,209,246,3,326.66700000000003,N 291 | 289,263,70,1,26,23,30,4,888,220,9,83,82,86,N,E,81,147,4,250.0,N 292 | 290,642,211,14,107,59,52,5,2364,770,27,352,230,193,N,W,337,19,4,740.0,N 293 | 291,265,68,8,26,30,29,7,1337,339,32,135,163,128,N,W,92,5,3,425.0,A 294 | 292,289,63,7,36,41,44,17,7402,1954,195,1115,919,1153,A,W,166,211,7,,A 295 | 293,559,141,2,48,61,73,8,3162,874,16,421,349,359,N,E,352,414,9,925.0,N 296 | 294,520,120,17,53,44,21,4,927,227,22,106,80,52,A,W,70,144,11,185.0,A 297 | 295,19,4,1,2,3,1,1,19,4,1,2,3,1,N,W,692,70,8,920.0,A 298 | 296,205,43,2,24,17,20,7,854,219,12,105,99,71,N,E,131,6,1,286.66700000000003,N 299 | 297,193,47,10,21,29,24,6,1136,256,42,129,139,106,A,W,299,13,5,245.0,A 300 | 298,181,46,1,19,18,17,5,937,238,9,88,95,104,A,E,37,98,9,,A 301 | 299,213,61,4,17,22,3,17,4061,1145,83,488,491,244,A,W,178,45,4,235.0,A 302 | 300,510,147,10,56,52,53,7,2872,821,63,307,340,174,N,E,810,99,18,1150.0,N 303 | 301,578,138,1,56,59,34,3,1399,357,7,149,161,87,N,E,133,371,20,160.0,N 304 | 302,200,51,2,14,29,25,23,9778,2732,379,1272,1652,925,N,W,398,29,7,,N 305 | 303,441,113,5,76,52,76,5,1546,397,17,226,149,191,A,W,160,290,11,425.0,A 306 | 304,172,42,3,17,14,15,10,4086,1150,57,579,363,406,N,W,65,0,0,900.0,N 307 | 305,580,194,9,91,62,78,8,3372,1028,48,604,314,469,N,E,270,13,6,,N 308 | 306,127,32,4,14,25,12,19,8396,2402,242,1048,1348,819,N,W,167,18,6,500.0,N 309 | 307,279,69,4,35,31,32,4,1359,355,31,180,148,158,N,E,133,173,9,277.5,N 310 | 308,480,112,18,50,71,44,7,3031,771,110,338,406,239,N,E,94,270,16,750.0,N 311 | 309,600,139,0,94,29,60,2,1236,309,1,201,69,110,N,E,300,12,9,160.0,N 312 | 310,610,186,19,107,98,74,6,2728,753,69,399,366,286,N,E,1182,96,13,1300.0,N 313 | 311,360,81,5,37,44,37,7,2268,566,41,279,257,246,N,E,170,284,3,525.0,N 314 | 312,387,124,1,67,27,36,7,1775,506,6,272,125,194,N,E,186,290,17,550.0,N 315 | 313,580,207,8,107,71,105,5,2778,978,32,474,322,417,A,E,121,267,19,1600.0,A 316 | 314,408,117,11,66,41,34,1,408,117,11,66,41,34,N,W,942,72,11,120.0,N 317 | 315,593,172,22,82,100,57,1,593,172,22,82,100,57,A,W,1222,139,15,165.0,A 318 | 316,221,53,2,21,23,22,8,1063,283,15,107,124,106,N,E,325,58,6,,N 319 | 317,497,127,7,65,48,37,5,2703,806,32,379,311,138,N,E,325,9,3,700.0,N 320 | 318,492,136,5,76,50,94,12,5511,1511,39,897,451,875,A,E,313,381,20,875.0,A 321 | 319,475,126,3,61,43,52,6,1700,433,7,217,93,146,A,W,37,113,7,385.0,A 322 | 320,573,144,9,85,60,78,8,3198,857,97,470,420,332,A,E,1314,131,12,960.0,A 323 | 321,631,170,9,77,44,31,11,4908,1457,30,775,357,249,A,W,408,4,3,1000.0,A 324 | -------------------------------------------------------------------------------- /Clase 16_CART/test.csv: -------------------------------------------------------------------------------- 1 | PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked 2 | 892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q 3 | 893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S 4 | 894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q 5 | 895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S 6 | 896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S 7 | 897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S 8 | 898,3,"Connolly, Miss. Kate",female,30,0,0,330972,7.6292,,Q 9 | 899,2,"Caldwell, Mr. Albert Francis",male,26,1,1,248738,29,,S 10 | 900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18,0,0,2657,7.2292,,C 11 | 901,3,"Davies, Mr. John Samuel",male,21,2,0,A/4 48871,24.15,,S 12 | 902,3,"Ilieff, Mr. Ylio",male,,0,0,349220,7.8958,,S 13 | 903,1,"Jones, Mr. Charles Cresson",male,46,0,0,694,26,,S 14 | 904,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23,1,0,21228,82.2667,B45,S 15 | 905,2,"Howard, Mr. Benjamin",male,63,1,0,24065,26,,S 16 | 906,1,"Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)",female,47,1,0,W.E.P. 5734,61.175,E31,S 17 | 907,2,"del Carlo, Mrs. Sebastiano (Argenia Genovesi)",female,24,1,0,SC/PARIS 2167,27.7208,,C 18 | 908,2,"Keane, Mr. Daniel",male,35,0,0,233734,12.35,,Q 19 | 909,3,"Assaf, Mr. Gerios",male,21,0,0,2692,7.225,,C 20 | 910,3,"Ilmakangas, Miss. Ida Livija",female,27,1,0,STON/O2. 3101270,7.925,,S 21 | 911,3,"Assaf Khalil, Mrs. Mariana (Miriam"")""",female,45,0,0,2696,7.225,,C 22 | 912,1,"Rothschild, Mr. Martin",male,55,1,0,PC 17603,59.4,,C 23 | 913,3,"Olsen, Master. Artur Karl",male,9,0,1,C 17368,3.1708,,S 24 | 914,1,"Flegenheim, Mrs. Alfred (Antoinette)",female,,0,0,PC 17598,31.6833,,S 25 | 915,1,"Williams, Mr. Richard Norris II",male,21,0,1,PC 17597,61.3792,,C 26 | 916,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48,1,3,PC 17608,262.375,B57 B59 B63 B66,C 27 | 917,3,"Robins, Mr. Alexander A",male,50,1,0,A/5. 3337,14.5,,S 28 | 918,1,"Ostby, Miss. Helene Ragnhild",female,22,0,1,113509,61.9792,B36,C 29 | 919,3,"Daher, Mr. Shedid",male,22.5,0,0,2698,7.225,,C 30 | 920,1,"Brady, Mr. John Bertram",male,41,0,0,113054,30.5,A21,S 31 | 921,3,"Samaan, Mr. Elias",male,,2,0,2662,21.6792,,C 32 | 922,2,"Louch, Mr. Charles Alexander",male,50,1,0,SC/AH 3085,26,,S 33 | 923,2,"Jefferys, Mr. Clifford Thomas",male,24,2,0,C.A. 31029,31.5,,S 34 | 924,3,"Dean, Mrs. Bertram (Eva Georgetta Light)",female,33,1,2,C.A. 2315,20.575,,S 35 | 925,3,"Johnston, Mrs. Andrew G (Elizabeth Lily"" Watson)""",female,,1,2,W./C. 6607,23.45,,S 36 | 926,1,"Mock, Mr. Philipp Edmund",male,30,1,0,13236,57.75,C78,C 37 | 927,3,"Katavelas, Mr. Vassilios (Catavelas Vassilios"")""",male,18.5,0,0,2682,7.2292,,C 38 | 928,3,"Roth, Miss. Sarah A",female,,0,0,342712,8.05,,S 39 | 929,3,"Cacic, Miss. Manda",female,21,0,0,315087,8.6625,,S 40 | 930,3,"Sap, Mr. Julius",male,25,0,0,345768,9.5,,S 41 | 931,3,"Hee, Mr. Ling",male,,0,0,1601,56.4958,,S 42 | 932,3,"Karun, Mr. Franz",male,39,0,1,349256,13.4167,,C 43 | 933,1,"Franklin, Mr. Thomas Parham",male,,0,0,113778,26.55,D34,S 44 | 934,3,"Goldsmith, Mr. Nathan",male,41,0,0,SOTON/O.Q. 3101263,7.85,,S 45 | 935,2,"Corbett, Mrs. Walter H (Irene Colvin)",female,30,0,0,237249,13,,S 46 | 936,1,"Kimball, Mrs. Edwin Nelson Jr (Gertrude Parsons)",female,45,1,0,11753,52.5542,D19,S 47 | 937,3,"Peltomaki, Mr. Nikolai Johannes",male,25,0,0,STON/O 2. 3101291,7.925,,S 48 | 938,1,"Chevre, Mr. Paul Romaine",male,45,0,0,PC 17594,29.7,A9,C 49 | 939,3,"Shaughnessy, Mr. Patrick",male,,0,0,370374,7.75,,Q 50 | 940,1,"Bucknell, Mrs. William Robert (Emma Eliza Ward)",female,60,0,0,11813,76.2917,D15,C 51 | 941,3,"Coutts, Mrs. William (Winnie Minnie"" Treanor)""",female,36,0,2,C.A. 37671,15.9,,S 52 | 942,1,"Smith, Mr. Lucien Philip",male,24,1,0,13695,60,C31,S 53 | 943,2,"Pulbaum, Mr. Franz",male,27,0,0,SC/PARIS 2168,15.0333,,C 54 | 944,2,"Hocking, Miss. Ellen Nellie""""",female,20,2,1,29105,23,,S 55 | 945,1,"Fortune, Miss. Ethel Flora",female,28,3,2,19950,263,C23 C25 C27,S 56 | 946,2,"Mangiavacchi, Mr. Serafino Emilio",male,,0,0,SC/A.3 2861,15.5792,,C 57 | 947,3,"Rice, Master. Albert",male,10,4,1,382652,29.125,,Q 58 | 948,3,"Cor, Mr. Bartol",male,35,0,0,349230,7.8958,,S 59 | 949,3,"Abelseth, Mr. Olaus Jorgensen",male,25,0,0,348122,7.65,F G63,S 60 | 950,3,"Davison, Mr. Thomas Henry",male,,1,0,386525,16.1,,S 61 | 951,1,"Chaudanson, Miss. Victorine",female,36,0,0,PC 17608,262.375,B61,C 62 | 952,3,"Dika, Mr. Mirko",male,17,0,0,349232,7.8958,,S 63 | 953,2,"McCrae, Mr. Arthur Gordon",male,32,0,0,237216,13.5,,S 64 | 954,3,"Bjorklund, Mr. Ernst Herbert",male,18,0,0,347090,7.75,,S 65 | 955,3,"Bradley, Miss. Bridget Delia",female,22,0,0,334914,7.725,,Q 66 | 956,1,"Ryerson, Master. John Borie",male,13,2,2,PC 17608,262.375,B57 B59 B63 B66,C 67 | 957,2,"Corey, Mrs. Percy C (Mary Phyllis Elizabeth Miller)",female,,0,0,F.C.C. 13534,21,,S 68 | 958,3,"Burns, Miss. Mary Delia",female,18,0,0,330963,7.8792,,Q 69 | 959,1,"Moore, Mr. Clarence Bloomfield",male,47,0,0,113796,42.4,,S 70 | 960,1,"Tucker, Mr. Gilbert Milligan Jr",male,31,0,0,2543,28.5375,C53,C 71 | 961,1,"Fortune, Mrs. Mark (Mary McDougald)",female,60,1,4,19950,263,C23 C25 C27,S 72 | 962,3,"Mulvihill, Miss. Bertha E",female,24,0,0,382653,7.75,,Q 73 | 963,3,"Minkoff, Mr. Lazar",male,21,0,0,349211,7.8958,,S 74 | 964,3,"Nieminen, Miss. Manta Josefina",female,29,0,0,3101297,7.925,,S 75 | 965,1,"Ovies y Rodriguez, Mr. Servando",male,28.5,0,0,PC 17562,27.7208,D43,C 76 | 966,1,"Geiger, Miss. Amalie",female,35,0,0,113503,211.5,C130,C 77 | 967,1,"Keeping, Mr. Edwin",male,32.5,0,0,113503,211.5,C132,C 78 | 968,3,"Miles, Mr. Frank",male,,0,0,359306,8.05,,S 79 | 969,1,"Cornell, Mrs. Robert Clifford (Malvina Helen Lamson)",female,55,2,0,11770,25.7,C101,S 80 | 970,2,"Aldworth, Mr. Charles Augustus",male,30,0,0,248744,13,,S 81 | 971,3,"Doyle, Miss. Elizabeth",female,24,0,0,368702,7.75,,Q 82 | 972,3,"Boulos, Master. Akar",male,6,1,1,2678,15.2458,,C 83 | 973,1,"Straus, Mr. Isidor",male,67,1,0,PC 17483,221.7792,C55 C57,S 84 | 974,1,"Case, Mr. Howard Brown",male,49,0,0,19924,26,,S 85 | 975,3,"Demetri, Mr. Marinko",male,,0,0,349238,7.8958,,S 86 | 976,2,"Lamb, Mr. John Joseph",male,,0,0,240261,10.7083,,Q 87 | 977,3,"Khalil, Mr. Betros",male,,1,0,2660,14.4542,,C 88 | 978,3,"Barry, Miss. Julia",female,27,0,0,330844,7.8792,,Q 89 | 979,3,"Badman, Miss. Emily Louisa",female,18,0,0,A/4 31416,8.05,,S 90 | 980,3,"O'Donoghue, Ms. Bridget",female,,0,0,364856,7.75,,Q 91 | 981,2,"Wells, Master. Ralph Lester",male,2,1,1,29103,23,,S 92 | 982,3,"Dyker, Mrs. Adolf Fredrik (Anna Elisabeth Judith Andersson)",female,22,1,0,347072,13.9,,S 93 | 983,3,"Pedersen, Mr. Olaf",male,,0,0,345498,7.775,,S 94 | 984,1,"Davidson, Mrs. Thornton (Orian Hays)",female,27,1,2,F.C. 12750,52,B71,S 95 | 985,3,"Guest, Mr. Robert",male,,0,0,376563,8.05,,S 96 | 986,1,"Birnbaum, Mr. Jakob",male,25,0,0,13905,26,,C 97 | 987,3,"Tenglin, Mr. Gunnar Isidor",male,25,0,0,350033,7.7958,,S 98 | 988,1,"Cavendish, Mrs. Tyrell William (Julia Florence Siegel)",female,76,1,0,19877,78.85,C46,S 99 | 989,3,"Makinen, Mr. Kalle Edvard",male,29,0,0,STON/O 2. 3101268,7.925,,S 100 | 990,3,"Braf, Miss. Elin Ester Maria",female,20,0,0,347471,7.8542,,S 101 | 991,3,"Nancarrow, Mr. William Henry",male,33,0,0,A./5. 3338,8.05,,S 102 | 992,1,"Stengel, Mrs. Charles Emil Henry (Annie May Morris)",female,43,1,0,11778,55.4417,C116,C 103 | 993,2,"Weisz, Mr. Leopold",male,27,1,0,228414,26,,S 104 | 994,3,"Foley, Mr. William",male,,0,0,365235,7.75,,Q 105 | 995,3,"Johansson Palmquist, Mr. Oskar Leander",male,26,0,0,347070,7.775,,S 106 | 996,3,"Thomas, Mrs. Alexander (Thamine Thelma"")""",female,16,1,1,2625,8.5167,,C 107 | 997,3,"Holthen, Mr. Johan Martin",male,28,0,0,C 4001,22.525,,S 108 | 998,3,"Buckley, Mr. Daniel",male,21,0,0,330920,7.8208,,Q 109 | 999,3,"Ryan, Mr. Edward",male,,0,0,383162,7.75,,Q 110 | 1000,3,"Willer, Mr. Aaron (Abi Weller"")""",male,,0,0,3410,8.7125,,S 111 | 1001,2,"Swane, Mr. George",male,18.5,0,0,248734,13,F,S 112 | 1002,2,"Stanton, Mr. Samuel Ward",male,41,0,0,237734,15.0458,,C 113 | 1003,3,"Shine, Miss. Ellen Natalia",female,,0,0,330968,7.7792,,Q 114 | 1004,1,"Evans, Miss. Edith Corse",female,36,0,0,PC 17531,31.6792,A29,C 115 | 1005,3,"Buckley, Miss. Katherine",female,18.5,0,0,329944,7.2833,,Q 116 | 1006,1,"Straus, Mrs. Isidor (Rosalie Ida Blun)",female,63,1,0,PC 17483,221.7792,C55 C57,S 117 | 1007,3,"Chronopoulos, Mr. Demetrios",male,18,1,0,2680,14.4542,,C 118 | 1008,3,"Thomas, Mr. John",male,,0,0,2681,6.4375,,C 119 | 1009,3,"Sandstrom, Miss. Beatrice Irene",female,1,1,1,PP 9549,16.7,G6,S 120 | 1010,1,"Beattie, Mr. Thomson",male,36,0,0,13050,75.2417,C6,C 121 | 1011,2,"Chapman, Mrs. John Henry (Sara Elizabeth Lawry)",female,29,1,0,SC/AH 29037,26,,S 122 | 1012,2,"Watt, Miss. Bertha J",female,12,0,0,C.A. 33595,15.75,,S 123 | 1013,3,"Kiernan, Mr. John",male,,1,0,367227,7.75,,Q 124 | 1014,1,"Schabert, Mrs. Paul (Emma Mock)",female,35,1,0,13236,57.75,C28,C 125 | 1015,3,"Carver, Mr. Alfred John",male,28,0,0,392095,7.25,,S 126 | 1016,3,"Kennedy, Mr. John",male,,0,0,368783,7.75,,Q 127 | 1017,3,"Cribb, Miss. Laura Alice",female,17,0,1,371362,16.1,,S 128 | 1018,3,"Brobeck, Mr. Karl Rudolf",male,22,0,0,350045,7.7958,,S 129 | 1019,3,"McCoy, Miss. Alicia",female,,2,0,367226,23.25,,Q 130 | 1020,2,"Bowenur, Mr. Solomon",male,42,0,0,211535,13,,S 131 | 1021,3,"Petersen, Mr. Marius",male,24,0,0,342441,8.05,,S 132 | 1022,3,"Spinner, Mr. Henry John",male,32,0,0,STON/OQ. 369943,8.05,,S 133 | 1023,1,"Gracie, Col. Archibald IV",male,53,0,0,113780,28.5,C51,C 134 | 1024,3,"Lefebre, Mrs. Frank (Frances)",female,,0,4,4133,25.4667,,S 135 | 1025,3,"Thomas, Mr. Charles P",male,,1,0,2621,6.4375,,C 136 | 1026,3,"Dintcheff, Mr. Valtcho",male,43,0,0,349226,7.8958,,S 137 | 1027,3,"Carlsson, Mr. Carl Robert",male,24,0,0,350409,7.8542,,S 138 | 1028,3,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C 139 | 1029,2,"Schmidt, Mr. August",male,26,0,0,248659,13,,S 140 | 1030,3,"Drapkin, Miss. Jennie",female,23,0,0,SOTON/OQ 392083,8.05,,S 141 | 1031,3,"Goodwin, Mr. Charles Frederick",male,40,1,6,CA 2144,46.9,,S 142 | 1032,3,"Goodwin, Miss. Jessie Allis",female,10,5,2,CA 2144,46.9,,S 143 | 1033,1,"Daniels, Miss. Sarah",female,33,0,0,113781,151.55,,S 144 | 1034,1,"Ryerson, Mr. Arthur Larned",male,61,1,3,PC 17608,262.375,B57 B59 B63 B66,C 145 | 1035,2,"Beauchamp, Mr. Henry James",male,28,0,0,244358,26,,S 146 | 1036,1,"Lindeberg-Lind, Mr. Erik Gustaf (Mr Edward Lingrey"")""",male,42,0,0,17475,26.55,,S 147 | 1037,3,"Vander Planke, Mr. Julius",male,31,3,0,345763,18,,S 148 | 1038,1,"Hilliard, Mr. Herbert Henry",male,,0,0,17463,51.8625,E46,S 149 | 1039,3,"Davies, Mr. Evan",male,22,0,0,SC/A4 23568,8.05,,S 150 | 1040,1,"Crafton, Mr. John Bertram",male,,0,0,113791,26.55,,S 151 | 1041,2,"Lahtinen, Rev. William",male,30,1,1,250651,26,,S 152 | 1042,1,"Earnshaw, Mrs. Boulton (Olive Potter)",female,23,0,1,11767,83.1583,C54,C 153 | 1043,3,"Matinoff, Mr. Nicola",male,,0,0,349255,7.8958,,C 154 | 1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S 155 | 1045,3,"Klasen, Mrs. (Hulda Kristina Eugenia Lofqvist)",female,36,0,2,350405,12.1833,,S 156 | 1046,3,"Asplund, Master. Filip Oscar",male,13,4,2,347077,31.3875,,S 157 | 1047,3,"Duquemin, Mr. Joseph",male,24,0,0,S.O./P.P. 752,7.55,,S 158 | 1048,1,"Bird, Miss. Ellen",female,29,0,0,PC 17483,221.7792,C97,S 159 | 1049,3,"Lundin, Miss. Olga Elida",female,23,0,0,347469,7.8542,,S 160 | 1050,1,"Borebank, Mr. John James",male,42,0,0,110489,26.55,D22,S 161 | 1051,3,"Peacock, Mrs. Benjamin (Edith Nile)",female,26,0,2,SOTON/O.Q. 3101315,13.775,,S 162 | 1052,3,"Smyth, Miss. Julia",female,,0,0,335432,7.7333,,Q 163 | 1053,3,"Touma, Master. Georges Youssef",male,7,1,1,2650,15.2458,,C 164 | 1054,2,"Wright, Miss. Marion",female,26,0,0,220844,13.5,,S 165 | 1055,3,"Pearce, Mr. Ernest",male,,0,0,343271,7,,S 166 | 1056,2,"Peruschitz, Rev. Joseph Maria",male,41,0,0,237393,13,,S 167 | 1057,3,"Kink-Heilmann, Mrs. Anton (Luise Heilmann)",female,26,1,1,315153,22.025,,S 168 | 1058,1,"Brandeis, Mr. Emil",male,48,0,0,PC 17591,50.4958,B10,C 169 | 1059,3,"Ford, Mr. Edward Watson",male,18,2,2,W./C. 6608,34.375,,S 170 | 1060,1,"Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genevieve Fosdick)",female,,0,0,17770,27.7208,,C 171 | 1061,3,"Hellstrom, Miss. Hilda Maria",female,22,0,0,7548,8.9625,,S 172 | 1062,3,"Lithman, Mr. Simon",male,,0,0,S.O./P.P. 251,7.55,,S 173 | 1063,3,"Zakarian, Mr. Ortin",male,27,0,0,2670,7.225,,C 174 | 1064,3,"Dyker, Mr. Adolf Fredrik",male,23,1,0,347072,13.9,,S 175 | 1065,3,"Torfa, Mr. Assad",male,,0,0,2673,7.2292,,C 176 | 1066,3,"Asplund, Mr. Carl Oscar Vilhelm Gustafsson",male,40,1,5,347077,31.3875,,S 177 | 1067,2,"Brown, Miss. Edith Eileen",female,15,0,2,29750,39,,S 178 | 1068,2,"Sincock, Miss. Maude",female,20,0,0,C.A. 33112,36.75,,S 179 | 1069,1,"Stengel, Mr. Charles Emil Henry",male,54,1,0,11778,55.4417,C116,C 180 | 1070,2,"Becker, Mrs. Allen Oliver (Nellie E Baumgardner)",female,36,0,3,230136,39,F4,S 181 | 1071,1,"Compton, Mrs. Alexander Taylor (Mary Eliza Ingersoll)",female,64,0,2,PC 17756,83.1583,E45,C 182 | 1072,2,"McCrie, Mr. James Matthew",male,30,0,0,233478,13,,S 183 | 1073,1,"Compton, Mr. Alexander Taylor Jr",male,37,1,1,PC 17756,83.1583,E52,C 184 | 1074,1,"Marvin, Mrs. Daniel Warner (Mary Graham Carmichael Farquarson)",female,18,1,0,113773,53.1,D30,S 185 | 1075,3,"Lane, Mr. Patrick",male,,0,0,7935,7.75,,Q 186 | 1076,1,"Douglas, Mrs. Frederick Charles (Mary Helene Baxter)",female,27,1,1,PC 17558,247.5208,B58 B60,C 187 | 1077,2,"Maybery, Mr. Frank Hubert",male,40,0,0,239059,16,,S 188 | 1078,2,"Phillips, Miss. Alice Frances Louisa",female,21,0,1,S.O./P.P. 2,21,,S 189 | 1079,3,"Davies, Mr. Joseph",male,17,2,0,A/4 48873,8.05,,S 190 | 1080,3,"Sage, Miss. Ada",female,,8,2,CA. 2343,69.55,,S 191 | 1081,2,"Veal, Mr. James",male,40,0,0,28221,13,,S 192 | 1082,2,"Angle, Mr. William A",male,34,1,0,226875,26,,S 193 | 1083,1,"Salomon, Mr. Abraham L",male,,0,0,111163,26,,S 194 | 1084,3,"van Billiard, Master. Walter John",male,11.5,1,1,A/5. 851,14.5,,S 195 | 1085,2,"Lingane, Mr. John",male,61,0,0,235509,12.35,,Q 196 | 1086,2,"Drew, Master. Marshall Brines",male,8,0,2,28220,32.5,,S 197 | 1087,3,"Karlsson, Mr. Julius Konrad Eugen",male,33,0,0,347465,7.8542,,S 198 | 1088,1,"Spedden, Master. Robert Douglas",male,6,0,2,16966,134.5,E34,C 199 | 1089,3,"Nilsson, Miss. Berta Olivia",female,18,0,0,347066,7.775,,S 200 | 1090,2,"Baimbrigge, Mr. Charles Robert",male,23,0,0,C.A. 31030,10.5,,S 201 | 1091,3,"Rasmussen, Mrs. (Lena Jacobsen Solvang)",female,,0,0,65305,8.1125,,S 202 | 1092,3,"Murphy, Miss. Nora",female,,0,0,36568,15.5,,Q 203 | 1093,3,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.33,0,2,347080,14.4,,S 204 | 1094,1,"Astor, Col. John Jacob",male,47,1,0,PC 17757,227.525,C62 C64,C 205 | 1095,2,"Quick, Miss. Winifred Vera",female,8,1,1,26360,26,,S 206 | 1096,2,"Andrew, Mr. Frank Thomas",male,25,0,0,C.A. 34050,10.5,,S 207 | 1097,1,"Omont, Mr. Alfred Fernand",male,,0,0,F.C. 12998,25.7417,,C 208 | 1098,3,"McGowan, Miss. Katherine",female,35,0,0,9232,7.75,,Q 209 | 1099,2,"Collett, Mr. Sidney C Stuart",male,24,0,0,28034,10.5,,S 210 | 1100,1,"Rosenbaum, Miss. Edith Louise",female,33,0,0,PC 17613,27.7208,A11,C 211 | 1101,3,"Delalic, Mr. Redjo",male,25,0,0,349250,7.8958,,S 212 | 1102,3,"Andersen, Mr. Albert Karvin",male,32,0,0,C 4001,22.525,,S 213 | 1103,3,"Finoli, Mr. Luigi",male,,0,0,SOTON/O.Q. 3101308,7.05,,S 214 | 1104,2,"Deacon, Mr. Percy William",male,17,0,0,S.O.C. 14879,73.5,,S 215 | 1105,2,"Howard, Mrs. Benjamin (Ellen Truelove Arman)",female,60,1,0,24065,26,,S 216 | 1106,3,"Andersson, Miss. Ida Augusta Margareta",female,38,4,2,347091,7.775,,S 217 | 1107,1,"Head, Mr. Christopher",male,42,0,0,113038,42.5,B11,S 218 | 1108,3,"Mahon, Miss. Bridget Delia",female,,0,0,330924,7.8792,,Q 219 | 1109,1,"Wick, Mr. George Dennick",male,57,1,1,36928,164.8667,,S 220 | 1110,1,"Widener, Mrs. George Dunton (Eleanor Elkins)",female,50,1,1,113503,211.5,C80,C 221 | 1111,3,"Thomson, Mr. Alexander Morrison",male,,0,0,32302,8.05,,S 222 | 1112,2,"Duran y More, Miss. Florentina",female,30,1,0,SC/PARIS 2148,13.8583,,C 223 | 1113,3,"Reynolds, Mr. Harold J",male,21,0,0,342684,8.05,,S 224 | 1114,2,"Cook, Mrs. (Selena Rogers)",female,22,0,0,W./C. 14266,10.5,F33,S 225 | 1115,3,"Karlsson, Mr. Einar Gervasius",male,21,0,0,350053,7.7958,,S 226 | 1116,1,"Candee, Mrs. Edward (Helen Churchill Hungerford)",female,53,0,0,PC 17606,27.4458,,C 227 | 1117,3,"Moubarek, Mrs. George (Omine Amenia"" Alexander)""",female,,0,2,2661,15.2458,,C 228 | 1118,3,"Asplund, Mr. Johan Charles",male,23,0,0,350054,7.7958,,S 229 | 1119,3,"McNeill, Miss. Bridget",female,,0,0,370368,7.75,,Q 230 | 1120,3,"Everett, Mr. Thomas James",male,40.5,0,0,C.A. 6212,15.1,,S 231 | 1121,2,"Hocking, Mr. Samuel James Metcalfe",male,36,0,0,242963,13,,S 232 | 1122,2,"Sweet, Mr. George Frederick",male,14,0,0,220845,65,,S 233 | 1123,1,"Willard, Miss. Constance",female,21,0,0,113795,26.55,,S 234 | 1124,3,"Wiklund, Mr. Karl Johan",male,21,1,0,3101266,6.4958,,S 235 | 1125,3,"Linehan, Mr. Michael",male,,0,0,330971,7.8792,,Q 236 | 1126,1,"Cumings, Mr. John Bradley",male,39,1,0,PC 17599,71.2833,C85,C 237 | 1127,3,"Vendel, Mr. Olof Edvin",male,20,0,0,350416,7.8542,,S 238 | 1128,1,"Warren, Mr. Frank Manley",male,64,1,0,110813,75.25,D37,C 239 | 1129,3,"Baccos, Mr. Raffull",male,20,0,0,2679,7.225,,C 240 | 1130,2,"Hiltunen, Miss. Marta",female,18,1,1,250650,13,,S 241 | 1131,1,"Douglas, Mrs. Walter Donald (Mahala Dutton)",female,48,1,0,PC 17761,106.425,C86,C 242 | 1132,1,"Lindstrom, Mrs. Carl Johan (Sigrid Posse)",female,55,0,0,112377,27.7208,,C 243 | 1133,2,"Christy, Mrs. (Alice Frances)",female,45,0,2,237789,30,,S 244 | 1134,1,"Spedden, Mr. Frederic Oakley",male,45,1,1,16966,134.5,E34,C 245 | 1135,3,"Hyman, Mr. Abraham",male,,0,0,3470,7.8875,,S 246 | 1136,3,"Johnston, Master. William Arthur Willie""""",male,,1,2,W./C. 6607,23.45,,S 247 | 1137,1,"Kenyon, Mr. Frederick R",male,41,1,0,17464,51.8625,D21,S 248 | 1138,2,"Karnes, Mrs. J Frank (Claire Bennett)",female,22,0,0,F.C.C. 13534,21,,S 249 | 1139,2,"Drew, Mr. James Vivian",male,42,1,1,28220,32.5,,S 250 | 1140,2,"Hold, Mrs. Stephen (Annie Margaret Hill)",female,29,1,0,26707,26,,S 251 | 1141,3,"Khalil, Mrs. Betros (Zahie Maria"" Elias)""",female,,1,0,2660,14.4542,,C 252 | 1142,2,"West, Miss. Barbara J",female,0.92,1,2,C.A. 34651,27.75,,S 253 | 1143,3,"Abrahamsson, Mr. Abraham August Johannes",male,20,0,0,SOTON/O2 3101284,7.925,,S 254 | 1144,1,"Clark, Mr. Walter Miller",male,27,1,0,13508,136.7792,C89,C 255 | 1145,3,"Salander, Mr. Karl Johan",male,24,0,0,7266,9.325,,S 256 | 1146,3,"Wenzel, Mr. Linhart",male,32.5,0,0,345775,9.5,,S 257 | 1147,3,"MacKay, Mr. George William",male,,0,0,C.A. 42795,7.55,,S 258 | 1148,3,"Mahon, Mr. John",male,,0,0,AQ/4 3130,7.75,,Q 259 | 1149,3,"Niklasson, Mr. Samuel",male,28,0,0,363611,8.05,,S 260 | 1150,2,"Bentham, Miss. Lilian W",female,19,0,0,28404,13,,S 261 | 1151,3,"Midtsjo, Mr. Karl Albert",male,21,0,0,345501,7.775,,S 262 | 1152,3,"de Messemaeker, Mr. Guillaume Joseph",male,36.5,1,0,345572,17.4,,S 263 | 1153,3,"Nilsson, Mr. August Ferdinand",male,21,0,0,350410,7.8542,,S 264 | 1154,2,"Wells, Mrs. Arthur Henry (Addie"" Dart Trevaskis)""",female,29,0,2,29103,23,,S 265 | 1155,3,"Klasen, Miss. Gertrud Emilia",female,1,1,1,350405,12.1833,,S 266 | 1156,2,"Portaluppi, Mr. Emilio Ilario Giuseppe",male,30,0,0,C.A. 34644,12.7375,,C 267 | 1157,3,"Lyntakoff, Mr. Stanko",male,,0,0,349235,7.8958,,S 268 | 1158,1,"Chisholm, Mr. Roderick Robert Crispin",male,,0,0,112051,0,,S 269 | 1159,3,"Warren, Mr. Charles William",male,,0,0,C.A. 49867,7.55,,S 270 | 1160,3,"Howard, Miss. May Elizabeth",female,,0,0,A. 2. 39186,8.05,,S 271 | 1161,3,"Pokrnic, Mr. Mate",male,17,0,0,315095,8.6625,,S 272 | 1162,1,"McCaffry, Mr. Thomas Francis",male,46,0,0,13050,75.2417,C6,C 273 | 1163,3,"Fox, Mr. Patrick",male,,0,0,368573,7.75,,Q 274 | 1164,1,"Clark, Mrs. Walter Miller (Virginia McDowell)",female,26,1,0,13508,136.7792,C89,C 275 | 1165,3,"Lennon, Miss. Mary",female,,1,0,370371,15.5,,Q 276 | 1166,3,"Saade, Mr. Jean Nassr",male,,0,0,2676,7.225,,C 277 | 1167,2,"Bryhl, Miss. Dagmar Jenny Ingeborg ",female,20,1,0,236853,26,,S 278 | 1168,2,"Parker, Mr. Clifford Richard",male,28,0,0,SC 14888,10.5,,S 279 | 1169,2,"Faunthorpe, Mr. Harry",male,40,1,0,2926,26,,S 280 | 1170,2,"Ware, Mr. John James",male,30,1,0,CA 31352,21,,S 281 | 1171,2,"Oxenham, Mr. Percy Thomas",male,22,0,0,W./C. 14260,10.5,,S 282 | 1172,3,"Oreskovic, Miss. Jelka",female,23,0,0,315085,8.6625,,S 283 | 1173,3,"Peacock, Master. Alfred Edward",male,0.75,1,1,SOTON/O.Q. 3101315,13.775,,S 284 | 1174,3,"Fleming, Miss. Honora",female,,0,0,364859,7.75,,Q 285 | 1175,3,"Touma, Miss. Maria Youssef",female,9,1,1,2650,15.2458,,C 286 | 1176,3,"Rosblom, Miss. Salli Helena",female,2,1,1,370129,20.2125,,S 287 | 1177,3,"Dennis, Mr. William",male,36,0,0,A/5 21175,7.25,,S 288 | 1178,3,"Franklin, Mr. Charles (Charles Fardon)",male,,0,0,SOTON/O.Q. 3101314,7.25,,S 289 | 1179,1,"Snyder, Mr. John Pillsbury",male,24,1,0,21228,82.2667,B45,S 290 | 1180,3,"Mardirosian, Mr. Sarkis",male,,0,0,2655,7.2292,F E46,C 291 | 1181,3,"Ford, Mr. Arthur",male,,0,0,A/5 1478,8.05,,S 292 | 1182,1,"Rheims, Mr. George Alexander Lucien",male,,0,0,PC 17607,39.6,,S 293 | 1183,3,"Daly, Miss. Margaret Marcella Maggie""""",female,30,0,0,382650,6.95,,Q 294 | 1184,3,"Nasr, Mr. Mustafa",male,,0,0,2652,7.2292,,C 295 | 1185,1,"Dodge, Dr. Washington",male,53,1,1,33638,81.8583,A34,S 296 | 1186,3,"Wittevrongel, Mr. Camille",male,36,0,0,345771,9.5,,S 297 | 1187,3,"Angheloff, Mr. Minko",male,26,0,0,349202,7.8958,,S 298 | 1188,2,"Laroche, Miss. Louise",female,1,1,2,SC/Paris 2123,41.5792,,C 299 | 1189,3,"Samaan, Mr. Hanna",male,,2,0,2662,21.6792,,C 300 | 1190,1,"Loring, Mr. Joseph Holland",male,30,0,0,113801,45.5,,S 301 | 1191,3,"Johansson, Mr. Nils",male,29,0,0,347467,7.8542,,S 302 | 1192,3,"Olsson, Mr. Oscar Wilhelm",male,32,0,0,347079,7.775,,S 303 | 1193,2,"Malachard, Mr. Noel",male,,0,0,237735,15.0458,D,C 304 | 1194,2,"Phillips, Mr. Escott Robert",male,43,0,1,S.O./P.P. 2,21,,S 305 | 1195,3,"Pokrnic, Mr. Tome",male,24,0,0,315092,8.6625,,S 306 | 1196,3,"McCarthy, Miss. Catherine Katie""""",female,,0,0,383123,7.75,,Q 307 | 1197,1,"Crosby, Mrs. Edward Gifford (Catherine Elizabeth Halstead)",female,64,1,1,112901,26.55,B26,S 308 | 1198,1,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.55,C22 C26,S 309 | 1199,3,"Aks, Master. Philip Frank",male,0.83,0,1,392091,9.35,,S 310 | 1200,1,"Hays, Mr. Charles Melville",male,55,1,1,12749,93.5,B69,S 311 | 1201,3,"Hansen, Mrs. Claus Peter (Jennie L Howard)",female,45,1,0,350026,14.1083,,S 312 | 1202,3,"Cacic, Mr. Jego Grga",male,18,0,0,315091,8.6625,,S 313 | 1203,3,"Vartanian, Mr. David",male,22,0,0,2658,7.225,,C 314 | 1204,3,"Sadowitz, Mr. Harry",male,,0,0,LP 1588,7.575,,S 315 | 1205,3,"Carr, Miss. Jeannie",female,37,0,0,368364,7.75,,Q 316 | 1206,1,"White, Mrs. John Stuart (Ella Holmes)",female,55,0,0,PC 17760,135.6333,C32,C 317 | 1207,3,"Hagardon, Miss. Kate",female,17,0,0,AQ/3. 30631,7.7333,,Q 318 | 1208,1,"Spencer, Mr. William Augustus",male,57,1,0,PC 17569,146.5208,B78,C 319 | 1209,2,"Rogers, Mr. Reginald Harry",male,19,0,0,28004,10.5,,S 320 | 1210,3,"Jonsson, Mr. Nils Hilding",male,27,0,0,350408,7.8542,,S 321 | 1211,2,"Jefferys, Mr. Ernest Wilfred",male,22,2,0,C.A. 31029,31.5,,S 322 | 1212,3,"Andersson, Mr. Johan Samuel",male,26,0,0,347075,7.775,,S 323 | 1213,3,"Krekorian, Mr. Neshan",male,25,0,0,2654,7.2292,F E57,C 324 | 1214,2,"Nesson, Mr. Israel",male,26,0,0,244368,13,F2,S 325 | 1215,1,"Rowe, Mr. Alfred G",male,33,0,0,113790,26.55,,S 326 | 1216,1,"Kreuchen, Miss. Emilie",female,39,0,0,24160,211.3375,,S 327 | 1217,3,"Assam, Mr. Ali",male,23,0,0,SOTON/O.Q. 3101309,7.05,,S 328 | 1218,2,"Becker, Miss. Ruth Elizabeth",female,12,2,1,230136,39,F4,S 329 | 1219,1,"Rosenshine, Mr. George (Mr George Thorne"")""",male,46,0,0,PC 17585,79.2,,C 330 | 1220,2,"Clarke, Mr. Charles Valentine",male,29,1,0,2003,26,,S 331 | 1221,2,"Enander, Mr. Ingvar",male,21,0,0,236854,13,,S 332 | 1222,2,"Davies, Mrs. John Morgan (Elizabeth Agnes Mary White) ",female,48,0,2,C.A. 33112,36.75,,S 333 | 1223,1,"Dulles, Mr. William Crothers",male,39,0,0,PC 17580,29.7,A18,C 334 | 1224,3,"Thomas, Mr. Tannous",male,,0,0,2684,7.225,,C 335 | 1225,3,"Nakid, Mrs. Said (Waika Mary"" Mowad)""",female,19,1,1,2653,15.7417,,C 336 | 1226,3,"Cor, Mr. Ivan",male,27,0,0,349229,7.8958,,S 337 | 1227,1,"Maguire, Mr. John Edward",male,30,0,0,110469,26,C106,S 338 | 1228,2,"de Brito, Mr. Jose Joaquim",male,32,0,0,244360,13,,S 339 | 1229,3,"Elias, Mr. Joseph",male,39,0,2,2675,7.2292,,C 340 | 1230,2,"Denbury, Mr. Herbert",male,25,0,0,C.A. 31029,31.5,,S 341 | 1231,3,"Betros, Master. Seman",male,,0,0,2622,7.2292,,C 342 | 1232,2,"Fillbrook, Mr. Joseph Charles",male,18,0,0,C.A. 15185,10.5,,S 343 | 1233,3,"Lundstrom, Mr. Thure Edvin",male,32,0,0,350403,7.5792,,S 344 | 1234,3,"Sage, Mr. John George",male,,1,9,CA. 2343,69.55,,S 345 | 1235,1,"Cardeza, Mrs. James Warburton Martinez (Charlotte Wardle Drake)",female,58,0,1,PC 17755,512.3292,B51 B53 B55,C 346 | 1236,3,"van Billiard, Master. James William",male,,1,1,A/5. 851,14.5,,S 347 | 1237,3,"Abelseth, Miss. Karen Marie",female,16,0,0,348125,7.65,,S 348 | 1238,2,"Botsford, Mr. William Hull",male,26,0,0,237670,13,,S 349 | 1239,3,"Whabee, Mrs. George Joseph (Shawneene Abi-Saab)",female,38,0,0,2688,7.2292,,C 350 | 1240,2,"Giles, Mr. Ralph",male,24,0,0,248726,13.5,,S 351 | 1241,2,"Walcroft, Miss. Nellie",female,31,0,0,F.C.C. 13528,21,,S 352 | 1242,1,"Greenfield, Mrs. Leo David (Blanche Strouse)",female,45,0,1,PC 17759,63.3583,D10 D12,C 353 | 1243,2,"Stokes, Mr. Philip Joseph",male,25,0,0,F.C.C. 13540,10.5,,S 354 | 1244,2,"Dibden, Mr. William",male,18,0,0,S.O.C. 14879,73.5,,S 355 | 1245,2,"Herman, Mr. Samuel",male,49,1,2,220845,65,,S 356 | 1246,3,"Dean, Miss. Elizabeth Gladys Millvina""""",female,0.17,1,2,C.A. 2315,20.575,,S 357 | 1247,1,"Julian, Mr. Henry Forbes",male,50,0,0,113044,26,E60,S 358 | 1248,1,"Brown, Mrs. John Murray (Caroline Lane Lamson)",female,59,2,0,11769,51.4792,C101,S 359 | 1249,3,"Lockyer, Mr. Edward",male,,0,0,1222,7.8792,,S 360 | 1250,3,"O'Keefe, Mr. Patrick",male,,0,0,368402,7.75,,Q 361 | 1251,3,"Lindell, Mrs. Edvard Bengtsson (Elin Gerda Persson)",female,30,1,0,349910,15.55,,S 362 | 1252,3,"Sage, Master. William Henry",male,14.5,8,2,CA. 2343,69.55,,S 363 | 1253,2,"Mallet, Mrs. Albert (Antoinette Magnin)",female,24,1,1,S.C./PARIS 2079,37.0042,,C 364 | 1254,2,"Ware, Mrs. John James (Florence Louise Long)",female,31,0,0,CA 31352,21,,S 365 | 1255,3,"Strilic, Mr. Ivan",male,27,0,0,315083,8.6625,,S 366 | 1256,1,"Harder, Mrs. George Achilles (Dorothy Annan)",female,25,1,0,11765,55.4417,E50,C 367 | 1257,3,"Sage, Mrs. John (Annie Bullen)",female,,1,9,CA. 2343,69.55,,S 368 | 1258,3,"Caram, Mr. Joseph",male,,1,0,2689,14.4583,,C 369 | 1259,3,"Riihivouri, Miss. Susanna Juhantytar Sanni""""",female,22,0,0,3101295,39.6875,,S 370 | 1260,1,"Gibson, Mrs. Leonard (Pauline C Boeson)",female,45,0,1,112378,59.4,,C 371 | 1261,2,"Pallas y Castello, Mr. Emilio",male,29,0,0,SC/PARIS 2147,13.8583,,C 372 | 1262,2,"Giles, Mr. Edgar",male,21,1,0,28133,11.5,,S 373 | 1263,1,"Wilson, Miss. Helen Alice",female,31,0,0,16966,134.5,E39 E41,C 374 | 1264,1,"Ismay, Mr. Joseph Bruce",male,49,0,0,112058,0,B52 B54 B56,S 375 | 1265,2,"Harbeck, Mr. William H",male,44,0,0,248746,13,,S 376 | 1266,1,"Dodge, Mrs. Washington (Ruth Vidaver)",female,54,1,1,33638,81.8583,A34,S 377 | 1267,1,"Bowen, Miss. Grace Scott",female,45,0,0,PC 17608,262.375,,C 378 | 1268,3,"Kink, Miss. Maria",female,22,2,0,315152,8.6625,,S 379 | 1269,2,"Cotterill, Mr. Henry Harry""""",male,21,0,0,29107,11.5,,S 380 | 1270,1,"Hipkins, Mr. William Edward",male,55,0,0,680,50,C39,S 381 | 1271,3,"Asplund, Master. Carl Edgar",male,5,4,2,347077,31.3875,,S 382 | 1272,3,"O'Connor, Mr. Patrick",male,,0,0,366713,7.75,,Q 383 | 1273,3,"Foley, Mr. Joseph",male,26,0,0,330910,7.8792,,Q 384 | 1274,3,"Risien, Mrs. Samuel (Emma)",female,,0,0,364498,14.5,,S 385 | 1275,3,"McNamee, Mrs. Neal (Eileen O'Leary)",female,19,1,0,376566,16.1,,S 386 | 1276,2,"Wheeler, Mr. Edwin Frederick""""",male,,0,0,SC/PARIS 2159,12.875,,S 387 | 1277,2,"Herman, Miss. Kate",female,24,1,2,220845,65,,S 388 | 1278,3,"Aronsson, Mr. Ernst Axel Algot",male,24,0,0,349911,7.775,,S 389 | 1279,2,"Ashby, Mr. John",male,57,0,0,244346,13,,S 390 | 1280,3,"Canavan, Mr. Patrick",male,21,0,0,364858,7.75,,Q 391 | 1281,3,"Palsson, Master. Paul Folke",male,6,3,1,349909,21.075,,S 392 | 1282,1,"Payne, Mr. Vivian Ponsonby",male,23,0,0,12749,93.5,B24,S 393 | 1283,1,"Lines, Mrs. Ernest H (Elizabeth Lindsey James)",female,51,0,1,PC 17592,39.4,D28,S 394 | 1284,3,"Abbott, Master. Eugene Joseph",male,13,0,2,C.A. 2673,20.25,,S 395 | 1285,2,"Gilbert, Mr. William",male,47,0,0,C.A. 30769,10.5,,S 396 | 1286,3,"Kink-Heilmann, Mr. Anton",male,29,3,1,315153,22.025,,S 397 | 1287,1,"Smith, Mrs. Lucien Philip (Mary Eloise Hughes)",female,18,1,0,13695,60,C31,S 398 | 1288,3,"Colbert, Mr. Patrick",male,24,0,0,371109,7.25,,Q 399 | 1289,1,"Frolicher-Stehli, Mrs. Maxmillian (Margaretha Emerentia Stehli)",female,48,1,1,13567,79.2,B41,C 400 | 1290,3,"Larsson-Rondberg, Mr. Edvard A",male,22,0,0,347065,7.775,,S 401 | 1291,3,"Conlon, Mr. Thomas Henry",male,31,0,0,21332,7.7333,,Q 402 | 1292,1,"Bonnell, Miss. Caroline",female,30,0,0,36928,164.8667,C7,S 403 | 1293,2,"Gale, Mr. Harry",male,38,1,0,28664,21,,S 404 | 1294,1,"Gibson, Miss. Dorothy Winifred",female,22,0,1,112378,59.4,,C 405 | 1295,1,"Carrau, Mr. Jose Pedro",male,17,0,0,113059,47.1,,S 406 | 1296,1,"Frauenthal, Mr. Isaac Gerald",male,43,1,0,17765,27.7208,D40,C 407 | 1297,2,"Nourney, Mr. Alfred (Baron von Drachstedt"")""",male,20,0,0,SC/PARIS 2166,13.8625,D38,C 408 | 1298,2,"Ware, Mr. William Jeffery",male,23,1,0,28666,10.5,,S 409 | 1299,1,"Widener, Mr. George Dunton",male,50,1,1,113503,211.5,C80,C 410 | 1300,3,"Riordan, Miss. Johanna Hannah""""",female,,0,0,334915,7.7208,,Q 411 | 1301,3,"Peacock, Miss. Treasteall",female,3,1,1,SOTON/O.Q. 3101315,13.775,,S 412 | 1302,3,"Naughton, Miss. Hannah",female,,0,0,365237,7.75,,Q 413 | 1303,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37,1,0,19928,90,C78,Q 414 | 1304,3,"Henriksson, Miss. Jenny Lovisa",female,28,0,0,347086,7.775,,S 415 | 1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S 416 | 1306,1,"Oliva y Ocana, Dona. Fermina",female,39,0,0,PC 17758,108.9,C105,C 417 | 1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S 418 | 1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S 419 | 1309,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C 420 | -------------------------------------------------------------------------------- /Clase 1_Presentación del curso/C1_Curso y Intro Machine Learning_Visualizacion.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 1_Presentación del curso/C1_Curso y Intro Machine Learning_Visualizacion.pptx -------------------------------------------------------------------------------- /Clase 3_Intro Python II_Carga de base de datos/archivo.txt: -------------------------------------------------------------------------------- 1 | Escribimos las primeras palabras en el archivo. 2 | Luego, estas palabras se agregarán al final del archivo. -------------------------------------------------------------------------------- /Clase 3_Intro Python II_Carga de base de datos/ejemplo.csv: -------------------------------------------------------------------------------- 1 | nombre;nota;curso;grupo 2 | Luciano;10;Big Data;1 3 | Juana;8;Big Data;1 4 | Pedro;6;Big Data;1 5 | Joaquin;7;Big Data;2 6 | Maria;9;Big Data;2 7 | Tomas;5;Big Data;2 8 | Facundo;6;Big Data;3 9 | Lucia;8;Big Data;3 10 | Martina;9;Big Data;3 11 | Natalia;9;Big Data;4 12 | Ezequiel;8;Big Data;4 13 | Andres;7;Big Data;4 14 | Agustina;10;Big Data;5 15 | Ignacio;9;Big Data;5 16 | Martin;8;Big Data;5 17 | -------------------------------------------------------------------------------- /Clase 3_Intro Python II_Carga de base de datos/ejemplo.dta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 3_Intro Python II_Carga de base de datos/ejemplo.dta -------------------------------------------------------------------------------- /Clase 3_Intro Python II_Carga de base de datos/ejemplo.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 3_Intro Python II_Carga de base de datos/ejemplo.xlsx -------------------------------------------------------------------------------- /Clase 3_Intro Python II_Carga de base de datos/exportar_ejemplo.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 3_Intro Python II_Carga de base de datos/exportar_ejemplo.xlsx -------------------------------------------------------------------------------- /Clase 3_Intro Python II_Carga de base de datos/exportar_ejemplo2.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 3_Intro Python II_Carga de base de datos/exportar_ejemplo2.xlsx -------------------------------------------------------------------------------- /Clase 4_Pandas & Matplotlib/Clase4_P1(Pandas).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Big Data y Machine Learning (UBA) 2025\n", 8 | "## Clase 4 - Parte 1\n", 9 | "\n", 10 | "El objetivo de esta clase se centra en tres de las librerias más usadas en Python: una breve mencion de Numpy, Pandas y Matplotlib." 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "**NumPy** es una librería de cálculo numérico para Python, muchas de\n", 18 | "las principales librerías en análisis de datos están construidas sobre\n", 19 | "NumPy.\n", 20 | "\n", 21 | "**Pandas** es quizás la librería que más usen en sus primeros pasos en Python, sirve para el trabajo con tablas de datos estructuradas, importar tablas de información, manipularlas y analizarlas. \n", 22 | "\n", 23 | "**Matplolib** es una librería para graficar (en la Parte 2 de esta clase).\n", 24 | "\n", 25 | "Hay mucho más para conocer sobre estas librerías de lo que se cubre en esta clase. El objetivo es que tengan el conocimiento suficiente para comenzar a usarlas y resolver las tareas más comunes." 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "### Introducción a Numpy\n", 33 | "\n", 34 | "NumPy es una librería de cálculo numérico para Python, lanzada inicialmente en 1995. Está escrita en lenguajes de bajo nivel (por ejemplo, C) para permitir realizar operaciones matemáticas de manera eficiente.\n", 35 | "\n", 36 | "Muchas librerías muy utilizadas en análisis de datos tienen a NumPy como librería base, y agregan funcionalidades adicionales por encima de ella. Algunos ejemplos son:\n", 37 | "- Pandas: para manejo de datos tabulares\n", 38 | "- Matplotlib: para gráficos\n", 39 | "- SciPy: para cálculo científico\n", 40 | "- scikit-learn: para aprendizaje automático\n" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "### Pandas\n", 48 | "\n", 49 | "Pandas: es la librería más usada manejo de datos en tablas, tiene funciones y métodos que facilitan mucho el trabajo con este tipo de datos (dataframes). Se usa para tareas de procesamiento, análisis y visualización de datos.\n", 50 | "\n", 51 | "Los dos tipos principales de objetos de Pandas son:\n", 52 | "- Series (una matriz unidimensional etiquetada) \n", 53 | "- DataFrames (una estructura de datos bidimensional etiquetada con columnas de tipos posiblemente diversos). \n", 54 | " \n", 55 | "En la **Clase 3**, hemos visto:\n", 56 | "- importar/exportar y crear DataFrames\n", 57 | "- explorar la tabla de datos\n", 58 | "\n", 59 | "En esta tutorial vamos a usarla para:\n", 60 | "- filtrar datos\n", 61 | "- agregar columnas / filas\n", 62 | "- unir bases (merge, join)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "#### Repaso\n", 70 | "\n", 71 | "Algunos de los métodos del paquete Pandas que vimos en la Tutorial 1 son:\n", 72 | "- `pd.read_excel()`: abrir un archivo .xls o .xlsx\n", 73 | "- `pd.read_csv()`: abrir un archivo .csv\n", 74 | "- `pd.read_stata()`: abrir un archivo .dta\n", 75 | "- `df.head(N)`: ver las primeras N líneas \n", 76 | "- `df.tail(N)`: ver las últimas N líneas \n", 77 | "- `df.sample(N)`: ver una muestra de N líneas \n", 78 | "- `pd.DataFrame(columns=[\"AA\", \"BB\"])`: crear tabla con columnas AA y BB\n", 79 | "- `df.to_excel()`: guardar una tabla en archivo .xls o .xslx\n", 80 | "- `df.to_csv()`: guardar una tabla en archivo .csv" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 2, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "import pandas as pd\n", 90 | "import os" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "##### Explorar datos\n", 98 | "Vamos a usar un ejemplo con registros de grupos inventados de estudiantes" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "# Recuerden tener el archivo en la carpeta donde están o modificar su directorio con:\n", 108 | "os.getcwd() # Ver donde estamos ubicados\n", 109 | "os.chdir(\"/Users/mnromero/Dropbox/COURSES/2025 - S1- Big Data y Machine Learning (UBA)/Clases/Clase 4_Pandas & Matplotlib\") # Ubicarnos en la carpeta con todos los archivos de esta clase\n", 110 | "\n", 111 | "# Abrimos el archivo y vemos las primeras dos filas\n", 112 | "df = pd.read_excel(\"tabla_ejemplo.xlsx\")\n", 113 | "df.head(2)" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "El 0 y 1 en el margen son los números de fila" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "df.sample(3)" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "Al importar un df, es una buena práctica imprimir algunas líneas para verificar que se haya cargado bien. \n", 137 | "También es útil imprimir el listado de nombres de las columnas y el tipo de dato:\n", 138 | "- `df.columns`: listado de nombres de columnas\n", 139 | "- `df.dtypes`: tipo de dato por columna\n", 140 | "- `df.shape`: cantidad de filas y columnas\n", 141 | "- `df.info`: información más detallada sobre el df" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "print('Columns:', df.columns)\n", 151 | "print('\\nTypes:\\n', df.dtypes) # muestra el tipo de dato de cada columna\n", 152 | "print('\\nShape:', df.shape) # muestra cuántas filas y columnas tiene la tabla\n", 153 | "print('\\nInfo:\\n', df.info)" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "df.info(verbose = True)# verbose imprime resumen completo" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "### Trabajando con dataframes\n", 170 | "A continuación vamos a ver una serie de acciones que podemos realizar con dataframes:\n", 171 | "\n", 172 | "1. crear columnas\n", 173 | "2. tipos de datos\n", 174 | "3. aplicar funciones\n", 175 | "4. seleccionar/filtrar datos \n", 176 | "5. eliminar duplicados\n", 177 | "6. agregar (append)\n", 178 | "7. join/merge\n", 179 | "8. agrupar (aggregate)" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "Algunas de las operaciones que podemos realizar con dataframes de pandas:\n", 187 | "Podemos seleccionar una columna de dos maneras:\n", 188 | "- **`df[\"nombre columna\"]`**: esto devuelve la columna como un objeto de tipo pandas **Series**. Se puede usar con nombres que incluyen espacios, sirve para crear columnas y operaciones con y entre columnas existentes. \n", 189 | "- **`df.nombre_columna`**: permite acceder a la columna, pero como un **atributo** del dataframe. Una limitación es que no se puede usar con nombres que incluyen espacios, ni tampoco para crear columnas nuevas. Si para operaciones con y entre columnas existentes." 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "metadata": {}, 196 | "outputs": [], 197 | "source": [ 198 | "# Vamos a trabajar con dataframe de ejemplo:\n", 199 | "df = pd.read_excel(\"tabla_ejemplo_2.xlsx\") # ejemplo2\n", 200 | "print(df.columns)\n", 201 | "print(df.head(3))" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "#### 1) Crear columnas" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "# Creamos una columna nueva con total de inscriptos llamada \"inscriptos total\"\n", 218 | "df[\"inscriptos_total\"] = df[\"inscriptos_ronda1\"] + df[\"inscriptos_ronda2\"]\n", 219 | "print(df.head(3))" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": {}, 226 | "outputs": [], 227 | "source": [ 228 | "# Sumemos 2 inscriptos a cada grupo de la ronda 2 y volvamos a calcular el total\n", 229 | "df[\"inscriptos_ronda2\"] = df[\"inscriptos_ronda2\"] + 2\n", 230 | "df[\"inscriptos_total\"] = df[\"inscriptos_ronda1\"] + df[\"inscriptos_ronda2\"]\n", 231 | "print(df.head(3))" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [ 240 | "# Generemos un id concatenando area y materia, incluyamos un separador\n", 241 | "df[\"area_asignatura\"] = df[\"area\"]+\"_\"+df[\"asignatura\"]\n", 242 | "\n", 243 | "# Generemos una columna que indique estado de inscripcion\n", 244 | "df[\"inscripcion_estado\"] = \"CERRADA\"\n", 245 | "\n", 246 | "df" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "#### 2) Tipo de dato\n", 254 | "Ya vimos como podemos usar pandas para conocer el tipo de dato de cada columna, usando df.info(verbose=True).\n", 255 | "Para ver el tipo de dato de una columna podemos usar`df[\"nombre_columna\"].dtype`:" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "df[\"inscriptos_total\"].dtype" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "#### 3) Aplicar funciones a columnas \n", 272 | "Las columnas son objetos iterables sobre los que podemos aplicar una función y afectará a los registros en cada uno de los elementos de la columna.\n", 273 | "Podemos aplicar funciones predefinidas como también funciones propias. Veamos un ejemplo con una *función predefinida*:" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "metadata": {}, 280 | "outputs": [], 281 | "source": [ 282 | "# La función round () redondea al entero más cercano\n", 283 | "df[\"edad_promedio_r\"] = round(df[\"edad_promedio\"])" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [ 292 | "df[\"edad_promedio_r\"]" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "También podemos usar `df[\"nombre_columna\"].apply(funcion x)`. Donde `apply()` inserta `df[\"nombre_columna\"]` como parámetro en `funcion x()`.\n", 300 | "Apply itera sobre los inputs y aplica la función sobre cada elemento." 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": null, 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "# Alternativa 2:\n", 310 | "df['edad_promedio_r'] = df['edad_promedio'].apply(round)\n", 311 | "df" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "Ahora veamos un ejemplo con una función propia:" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "metadata": {}, 325 | "outputs": [], 326 | "source": [ 327 | "# Primero definimos una función simple\n", 328 | "def clasificar_tamano(inscriptos):\n", 329 | " if inscriptos < 30:\n", 330 | " tamano = \"Chico\"\n", 331 | " else:\n", 332 | " tamano = \"Grande\"\n", 333 | " return tamano" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "# Ahora la aplicamos\n", 343 | "df['inscriptos_total'].apply(clasificar_tamano)" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "# Agregamos esa información como una columna nueva\n", 353 | "df['tamano_clase'] = df['inscriptos_total'].apply(clasificar_tamano)\n", 354 | "df " 355 | ] 356 | }, 357 | { 358 | "cell_type": "markdown", 359 | "metadata": {}, 360 | "source": [ 361 | "#### 4) Seleccionar columnas y/o filas\n", 362 | "Para seleccionar un sub-conjunto de filas y/o columnas (*slicing*) hay distintas alternativas:" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "# Seleccionamos una columna\n", 372 | "df[\"grupo\"]" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "# Seleccionamos varias columnas (usamos lista de nombres de columnas)\n", 382 | "df[[\"grupo\",\"asignatura\",\"inscriptos_total\"]]" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": null, 388 | "metadata": {}, 389 | "outputs": [], 390 | "source": [ 391 | "# Guardamos una copia como otro dataframe\n", 392 | "df_resumen = df[[\"grupo\",\"inscriptos_total\"]].copy()\n", 393 | "df_resumen" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": null, 399 | "metadata": { 400 | "scrolled": true 401 | }, 402 | "outputs": [], 403 | "source": [ 404 | "# Para seleccionar un sub-conjunto de filas podemos usar los índices:\n", 405 | "\n", 406 | "# Seleccionamos las filas 3 a 5\n", 407 | "df[3:6] # el límite inferior sí se incluye pero el superior no. sería como indicar el intervalo [3, 6)" 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": null, 413 | "metadata": {}, 414 | "outputs": [], 415 | "source": [ 416 | "# Seleccionamos las primeras 3 filas\n", 417 | "df[:3]" 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "execution_count": null, 423 | "metadata": {}, 424 | "outputs": [], 425 | "source": [ 426 | "# Seleccionamos las últimas 3 filas\n", 427 | "df[-3:]" 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": {}, 433 | "source": [ 434 | "Para filtrar usando condiciones:\n", 435 | "- Si hay una condición: `df[condición]`\n", 436 | "- Si hay más de una condición: `df[(condición1)&/|(condición2)]`\n", 437 | "\n", 438 | "Donde la condición es booleana (evalúa a True o False), de ser 2 o más, debe estar unida por un operador lógico (\"&\"/\"|\"). Noten que los operadores lógicos que usamos son \"&\"/\"|\" y no \"and\"/\"or\". Esto es por cómo está implementado Pandas" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": null, 444 | "metadata": {}, 445 | "outputs": [], 446 | "source": [ 447 | "df[df[\"area\"]!=\"Cs. Sociales\"]" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": null, 453 | "metadata": {}, 454 | "outputs": [], 455 | "source": [ 456 | "df[df[\"inscriptos_total\"]<30]" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": null, 462 | "metadata": {}, 463 | "outputs": [], 464 | "source": [ 465 | "df[(df[\"inscriptos_total\"]<30) & (df[\"edad_promedio\"]>40)] # muestra filas si nro. de inscriptos es menor a 30 Y edad promedio mayor a 40" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": null, 471 | "metadata": {}, 472 | "outputs": [], 473 | "source": [ 474 | "# También podemos seleccionar según una condición de esta forma:\n", 475 | "opciones = ['Física', 'Geografía']\n", 476 | "df[df['asignatura'].isin(opciones)] # isin() es una funcion de Pandas para estos condicionales" 477 | ] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "metadata": {}, 482 | "source": [ 483 | "Para combinar filtros de filas y columas:\n", 484 | "- **`df.loc[filas,columnas]`**: mencionamos el **nombre/etiquetas** de las columnas y filas que queremos seleccionar. A diferencia de lo que ocurre con slicing, al usar loc el límite superior SI se incluye.\n", 485 | "- **`df.iloc[filas,columnas]`**: mencionamos las **posiciones** (como números enteros) de las columnas y filas que queremos seleccionar. En iloc el límite superior NO se incluye.\n", 486 | "\n", 487 | "Nota: muchas veces el label e index de una fila son iguales (numéricos)." 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": null, 493 | "metadata": {}, 494 | "outputs": [], 495 | "source": [ 496 | "# Veamos la diferencia si usamos slicing, loc o iloc\n", 497 | "display(df[3:6]) # intervalo [3, 6)\n", 498 | "display(df.loc[3:6]) # intervalo [3, 6]\n", 499 | "display(df.iloc[3:6]) # intervalo [3, 6)" 500 | ] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "metadata": {}, 505 | "source": [ 506 | "En el ejemplo anterior cuando usamos `loc` usamos números para seleccionar columnas. ¿Por qué? La etiqueta de las filas eran números!\n", 507 | "Si nuestro df tuviera etiquetas en las filas, sería distinto, por ejemplo:" 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": null, 513 | "metadata": {}, 514 | "outputs": [], 515 | "source": [ 516 | "# Creamos un df con un índice\n", 517 | "df_label = df[:3]\n", 518 | "index_ = ['Row_1', 'Row_2', 'Row_3']\n", 519 | "df_label.index = index_ # configuramos el índice\n", 520 | "df_label" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": null, 526 | "metadata": {}, 527 | "outputs": [], 528 | "source": [ 529 | "#display(df_label.loc[:3]) # ahora usar loc y números da error. hay que usar etiquetas!\n", 530 | "display(df_label.loc[[\"Row_1\", 'Row_2']]) " 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "metadata": {}, 537 | "outputs": [], 538 | "source": [ 539 | "# Seleccionando filas y columnas\n", 540 | "df.loc[3:4,[\"grupo\",\"asignatura\"]]" 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "execution_count": null, 546 | "metadata": {}, 547 | "outputs": [], 548 | "source": [ 549 | "df.iloc[3:5,[0,4]]" 550 | ] 551 | }, 552 | { 553 | "cell_type": "markdown", 554 | "metadata": {}, 555 | "source": [ 556 | "#### 5) Eliminar duplicados" 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "Para eliminar filas y columnas podemos:\n", 564 | "- \"pisar\" el dataframe con la selección de las filas/columnas que SI queremos guardado una nueva copia sobre la anterior.\n", 565 | "- usar la función `df.drop()`.\n", 566 | "La función `.drop()` toma como argumento una lista de labels, y puede usarse para eliminar filas o columnas. Con el parámetro \"axis\" definimos si eliminar filas (axis=0, valor por defecto) o columnas (axis=1)." 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": null, 572 | "metadata": {}, 573 | "outputs": [], 574 | "source": [ 575 | "df" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": {}, 582 | "outputs": [], 583 | "source": [ 584 | "# Eliminamos filas\n", 585 | "df.drop([0,3,12]) # implícitamente llama axis=0\n", 586 | "\n", 587 | "# Eliminamos columnas\n", 588 | "df.drop([\"inscripcion_estado\"], axis=1) # indica que es una columna y no guarda el cambio\n", 589 | "\n", 590 | "# Para guardar podemos sobreescribir en el objeto (df = ...) o usar inplace=True\n", 591 | "df.drop([\"inscripcion_estado\"], axis=1, inplace = True )\n", 592 | "display(df)\n", 593 | "\n", 594 | "df_cut = df.drop([0,3,12])\n", 595 | "display(df_cut)" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [ 602 | "#### 6) Agregar (append)\n", 603 | "Python nos permite trabajar con múltiples bases al mismo tiempo. Si queremos agregar un conjunto de bases, **una debajo de la otra**, podemos usar `df_a.append(df_b)` para _appendear_ las bases df_a y df_b" 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": null, 609 | "metadata": {}, 610 | "outputs": [], 611 | "source": [ 612 | "# Carguemos una recortada de la base\n", 613 | "df_a = pd.read_excel(\"tabla_ejemplo_3a.xlsx\") # exploren contenido\n", 614 | "print(df_a.shape)\n", 615 | "\n", 616 | "df_b = pd.read_excel(\"tabla_ejemplo_3b.xlsx\") # exploren contenido\n", 617 | "print(df_b.shape)\n", 618 | "\n", 619 | "df = pd.concat([df_a, df_b]) # en este paso adjuntamos al final los datos de la segunda base\n", 620 | "print(df.shape)" 621 | ] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "execution_count": null, 626 | "metadata": {}, 627 | "outputs": [], 628 | "source": [ 629 | "# exploren para ver resultado" 630 | ] 631 | }, 632 | { 633 | "cell_type": "markdown", 634 | "metadata": {}, 635 | "source": [ 636 | "Es una buena práctica inspeccionar los resultados después de hacer un append o merge. Por ejemplo, luego del append, la indexación de las filas quedó repetida ya que se preservaron los índices de las bases originales. Para arreglar esto:" 637 | ] 638 | }, 639 | { 640 | "cell_type": "code", 641 | "execution_count": null, 642 | "metadata": {}, 643 | "outputs": [], 644 | "source": [ 645 | "# luego de de adjuntar los datos, pueden visualizarlos con display\n", 646 | "display(df)\n", 647 | "df.reset_index(drop=True) # index original es descartado" 648 | ] 649 | }, 650 | { 651 | "cell_type": "markdown", 652 | "metadata": {}, 653 | "source": [ 654 | "### 7) Unir or linkear bases de datos (Join/merge)\n", 655 | "Podemos usar las funciones `.join()` y `.merge()` para **unir horizontalmente** bases con un identificador común. \n", 656 | "Veamos el caso de `.merge()` (join es similar, lo pueden ver en la [documentación de Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).\n", 657 | "\n", 658 | "La función merge se aplica con la forma:\n", 659 | "`df_a.merge(df_b, on=lista_columnas_id, how=tipo_join)`. Donde:\n", 660 | "* `lista_columnas_id` son los nombres de la/s columna/s por los que vamos a unir las bases. **Estas columnas deben encontrarse en ambos DataFrames.** Si no se pasa ningún valor (y no se combina un índice, es decir que left_index y right_index son Falso), se deducirá que se debe hacer el merge con la intersección de las columnas en los DataFrames.\n", 661 | "* Los tipos de uniones de base de datos (*joins*) son:\n", 662 | " - **union por intersección** (*inner join*): sólo se mantienen aquellas filas que coinciden _en ambos datasets_. Si alguna fila no tiene coincidencia en uno de los dos, se descarta. El dataset final tiene igual cantidad o menos filas que el dataset más grande.\n", 663 | " - **unir por izquierda** (*left join*): preserva el 100% de las filas que tiene la tabla _de la izquierda_ del merge, y agrega las columnas del dataset _de la derecha_ con los valores (cuando hay una coincidencia) o las llena con valores nulos (cuando no hay coincidencia). Si el dataset de la derecha tiene valores para filas que no están presentes en el dataset de la izquierda, simplemente no se utilizan. El dataset resultado tiene la misma cantidad de filas que el dataset de la izquierda.\n", 664 | " - **unir por derecha** (*right join*): es igual que el anterior, pero se preservan las filas del dataset de la derecha en lugar de las del de la izquierda.\n", 665 | "\n", 666 | " - **union conjunta** (*outer join*): se preservan todas las filas. Si hay coincidencia, se cruzan, y si no hay coincidencia se apilan llenando con valores nulos.\n", 667 | "\n", 668 | "El tipo de join más común que van a utilizar la mayoría de las veces es el **left join**, que se una cuando se tiene una tabla, y se quiere enriquecerla con nuevas columnas.\n", 669 | "\n", 670 | "" 672 | ] 673 | }, 674 | { 675 | "cell_type": "code", 676 | "execution_count": null, 677 | "metadata": {}, 678 | "outputs": [], 679 | "source": [ 680 | "# Ejemplo de merge (linkear)\n", 681 | "print(df.columns)\n", 682 | "\n", 683 | "# Abrimos otra base y vemos sus columnas\n", 684 | "df_c = pd.read_excel(\"tabla_ejemplo_3c.xlsx\") # exploren contenido\n", 685 | "print(df_c.columns)\n", 686 | "\n", 687 | "# Hacemos el merge y vemos sus columnas\n", 688 | "df = df.merge(df_c) # exploren para ver resultado\n", 689 | "print(df.columns)\n", 690 | "df" 691 | ] 692 | }, 693 | { 694 | "cell_type": "markdown", 695 | "metadata": {}, 696 | "source": [ 697 | "Dependiendo de los datos que estemos usando, o si las bases no están bien curadas, realizar un merge puede generar duplicación de registros.\n", 698 | "En este caso lo simulamos. Para eliminar registros duplicados:" 699 | ] 700 | }, 701 | { 702 | "cell_type": "code", 703 | "execution_count": null, 704 | "metadata": {}, 705 | "outputs": [], 706 | "source": [ 707 | "display(df.tail(6))\n", 708 | "df.drop_duplicates(inplace=True)\n", 709 | "df # cómo quedó el índice de las filas ?" 710 | ] 711 | }, 712 | { 713 | "cell_type": "code", 714 | "execution_count": null, 715 | "metadata": {}, 716 | "outputs": [], 717 | "source": [ 718 | "df.reset_index(drop=True, inplace=True) # inplace guarda el resultado\n", 719 | "df" 720 | ] 721 | }, 722 | { 723 | "cell_type": "markdown", 724 | "metadata": {}, 725 | "source": [ 726 | "#### 8) agrupar (aggregate)\n", 727 | "Una operación frecuente con dataframes es agregar los datos, agrupando sobre un conjunto de variables y aplicando una función de agregación a otras. Con pandas esto se logra usando la función `.groupby()`, de la forma:\n", 728 | "`df.groupby(by=lista_columnas_agrupamiento).agg(dict_var_func)`\n", 729 | "\n", 730 | "Donde lista_columnas_agrupamiento es una lista con los nombres de columnas sobre las que se agrupa, y dict_var_func es un \n", 731 | "diccionario de variable a agregar (clave) y la función con la que se agrega (valor).\n", 732 | "\n", 733 | "Algunas de las funciones de agregación son:\n", 734 | "- sum\n", 735 | "- mean\n", 736 | "- count\n", 737 | "- first / last\n", 738 | "- min / max" 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": null, 744 | "metadata": {}, 745 | "outputs": [], 746 | "source": [ 747 | "# Seguimos con el mismo df ya definido\n", 748 | "df_agg = df.groupby(by=[\"area\",\"asignatura\"]).agg({\"inscriptos_ronda1\":\"sum\",\"inscriptos_ronda2\":\"sum\",\"edad_promedio\":\"mean\"})\n", 749 | "\n", 750 | "# Si observan el resultado verán que las variables de agrupamiento pasan a definir el índice\n", 751 | "df_agg\n" 752 | ] 753 | }, 754 | { 755 | "cell_type": "code", 756 | "execution_count": null, 757 | "metadata": {}, 758 | "outputs": [], 759 | "source": [ 760 | "# Si queremos que vuelva a ser un índice numérico hacemos:\n", 761 | "df_agg.reset_index(inplace = True) # drop = False por default\n", 762 | "# Vean que de esta forma, area y asignatura volvieron a ser columnas de la tabla\n", 763 | "df_agg" 764 | ] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "execution_count": null, 769 | "metadata": {}, 770 | "outputs": [], 771 | "source": [] 772 | } 773 | ], 774 | "metadata": { 775 | "kernelspec": { 776 | "display_name": "Python [conda env:base] *", 777 | "language": "python", 778 | "name": "conda-base-py" 779 | }, 780 | "language_info": { 781 | "codemirror_mode": { 782 | "name": "ipython", 783 | "version": 3 784 | }, 785 | "file_extension": ".py", 786 | "mimetype": "text/x-python", 787 | "name": "python", 788 | "nbconvert_exporter": "python", 789 | "pygments_lexer": "ipython3", 790 | "version": "3.12.4" 791 | } 792 | }, 793 | "nbformat": 4, 794 | "nbformat_minor": 4 795 | } 796 | -------------------------------------------------------------------------------- /Clase 4_Pandas & Matplotlib/Clase4_P2(Matplotlib).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "FjZT3QZcKOuf" 7 | }, 8 | "source": [ 9 | "# Big Data y Machine Learning (UBA) 2025\n", 10 | "## Clase 4 - Parte 2\n", 11 | "\n", 12 | "El objetivo es graficar con matplotlib." 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "Matplotlib es la librería base de graficación, sobre la cual se montan otras librerías. Dentro de Matplotlib, usamos la dependencia \"pyplot\" que se instala con la librería. Por convención importamos así:" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": null, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "# Primero, instalar la libreria:\n", 29 | "#!pip install matplotlib\n", 30 | "import matplotlib.pyplot as plt # importamos la librería gráfica. plt es el nombre por convención que se le asigna" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "Matplotlib genera los gráficos sobre dos objetos interrelacionados:\n", 38 | "- Figura (*Figure*): la hoja en blanco, el recuadro que contiene hacia adentro el/los gráfico/s. En términos prácticos esto ocurre detrás de escenas, pero es lo que permite dibujar el gráfico.\n", 39 | "- Ejes (*Axes*): el gráfico en sí, los ejes y la informacíon graficada. La representación de la información sobre ejes." 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "Las partes de un gráfico\n", 47 | "" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "Hay esencialmente dos maneras de graficar con Matplotlib:\n", 56 | "- **Estilo pyplot**: simple y rápida para figuras que no son muy avanzadas. Quizás más fácil para empezar.\n", 57 | "- **Estilo orientado-objetos**: un poco más complejo pero necesario para figuras que requieren mucha personalización.\n", 58 | "\n", 59 | "En cuanto al resultado estético, con ambos se puede lograr la misma calidad. Para dar los primeros pasos es indistinto cual se use. Sin embargo, el estilo orientado a objetos es necesario para figuras más complejas donde hay varios gráficos (subplots) y es necesario definir parámetros distintos para cada par de ejes (2D)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": { 65 | "id": "_wHJy23yKOuh" 66 | }, 67 | "source": [ 68 | "### Graficar con matplotlib" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": { 75 | "executionInfo": { 76 | "elapsed": 1094, 77 | "status": "ok", 78 | "timestamp": 1661282643386, 79 | "user": { 80 | "displayName": "Belén Michel Torino", 81 | "userId": "16232771333703850174" 82 | }, 83 | "user_tz": 240 84 | }, 85 | "id": "nrMwI5vWKOuh" 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "import matplotlib.pyplot as plt\n", 90 | "import pandas as pd" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "import os\n", 100 | "os.chdir(\"/Users/mnromero/Dropbox/COURSES/2025 - S1- Big Data y Machine Learning (UBA)/Clases/Clase 4_Pandas & Matplotlib\")" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": null, 106 | "metadata": { 107 | "colab": { 108 | "base_uri": "https://localhost:8080/", 109 | "height": 353 110 | }, 111 | "executionInfo": { 112 | "elapsed": 289, 113 | "status": "error", 114 | "timestamp": 1661282644488, 115 | "user": { 116 | "displayName": "Belén Michel Torino", 117 | "userId": "16232771333703850174" 118 | }, 119 | "user_tz": 240 120 | }, 121 | "id": "JCYTjWGHKOui", 122 | "outputId": "4b13c0cd-44ad-47f2-d28e-9a79fc39c177" 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "# Abrimos el archivo de potencia energética instalada en el país\n", 127 | "df = pd.read_excel(\"potencia_instalada.xlsx\")" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": null, 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | " # exploren aqui la base. Hint: tail, head, sample, info...\n" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": { 143 | "id": "XdiLXPy2KOui", 144 | "outputId": "b96cd9f3-7cb1-4f18-bc49-8a3ec1d1fb5c" 145 | }, 146 | "outputs": [], 147 | "source": [ 148 | "# Agregamos (collapse) a nivel de tipo de fuente\n", 149 | "df_fuente = df.groupby(by=[\"periodo\",\"fuente_generacion\"]).agg({\"potencia_instalada_mw\":\"sum\"})\n", 150 | "df_fuente.reset_index(inplace=True)\n", 151 | "df_fuente.sample(5)" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "metadata": { 158 | "id": "NJ9utAqzKOuj", 159 | "outputId": "8bd123af-e0ba-4bc9-be38-52f1a3cb02ee" 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "df_fuente.shape" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "Vamos a graficar dos líneas, así que definimos vector X e Y para cada una. Vamos a graficar la potencia instalada de generación por fuente Renovable y fuente Térmica:" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": { 177 | "id": "jhirJJclKOuk" 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "# Definimos vectores de datos para serie 1 (renovable)\n", 182 | "y1 = df_fuente[df_fuente[\"fuente_generacion\"]==\"Renovable\"][\"potencia_instalada_mw\"]\n", 183 | "x1 = df_fuente[df_fuente[\"fuente_generacion\"]==\"Renovable\"][\"periodo\"]\n", 184 | "# Definimos vectores de datos para serie 2 (térmica)\n", 185 | "y2 = df_fuente[df_fuente[\"fuente_generacion\"]==\"Térmica\"][\"potencia_instalada_mw\"]\n", 186 | "x2 = df_fuente[df_fuente[\"fuente_generacion\"]==\"Térmica\"][\"periodo\"]\n", 187 | "\n", 188 | "# Nota: df[condicion][columna] selecciona la \"columna\" de la base que resulta de aplicar el filtro df[condicion]." 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": { 195 | "id": "Jriw9yo0KOul", 196 | "outputId": "439d3efc-8ea0-4cc0-a6d4-a815a549946c" 197 | }, 198 | "outputs": [], 199 | "source": [ 200 | "# Creamos el gráfico al estilo pyplot\n", 201 | "\n", 202 | "plt.plot(x1, y1, label=\"Renovable\") # serie 1\n", 203 | "plt.plot(x2, y2, label=\"Térmica\") # serie 2\n", 204 | "# Estas dos líneas estaran sobre el mismo gráfico\n", 205 | "\n", 206 | "# Modifico labels\n", 207 | "plt.xlabel(\"Período\")\n", 208 | "plt.ylabel(\"Potencia Instalada (MW)\")\n", 209 | "plt.title(\"Producción Energética Argentina según Fuente\")\n", 210 | "\n", 211 | "# Agrego leyenda\n", 212 | "plt.legend()\n", 213 | "plt.show() #esto es necesario para visualizar" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": { 220 | "id": "GBdj3IlTKOum", 221 | "outputId": "e58796c6-687e-4229-ccfd-373f4f1352d5" 222 | }, 223 | "outputs": [], 224 | "source": [ 225 | "# Creamos el gráfico al estilo OO (orientado-objetos)\n", 226 | "\n", 227 | "# Creamos la figura y los axes\n", 228 | "fig, ax = plt.subplots() # crear objetos\n", 229 | "\n", 230 | "# Definimos series\n", 231 | "ax.plot(x1, y1, label=\"Renovable\") # Serie 1\n", 232 | "ax.plot(x2, y2, label=\"Térmica\") # Serie 2\n", 233 | "\n", 234 | "# Modificamos labels y título\n", 235 | "ax.set_xlabel(\"Período\")\n", 236 | "ax.set_ylabel(\"Potencia Instalada (MW)\")\n", 237 | "ax.set_title(\"Producción Energética Argentina según Fuente (v2)\")\n", 238 | "\n", 239 | "# Agregamos leyenda\n", 240 | "ax.legend()\n", 241 | "fig.show()" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": { 248 | "id": "bFHB0axvKOum", 249 | "outputId": "5cd4d1c0-5cbe-408c-8060-0db99a254b5d" 250 | }, 251 | "outputs": [], 252 | "source": [ 253 | "# Graficar múltiples gráficos estilo pyplot\n", 254 | "\n", 255 | "# ejemplo 2 ax en un fig\n", 256 | "plt.figure(figsize=(14, 5))\n", 257 | "\n", 258 | "# Definimos primer gráfico\n", 259 | "plt.subplot(121) # subplot(nrows, ncols, index, **kwargs) donde nrows=1, ncols=2, index=1\n", 260 | "plt.plot(x1, y1)\n", 261 | "plt.title(\"A. Fuente Renovable\")\n", 262 | "\n", 263 | "# Definimos segundo gráfico\n", 264 | "plt.subplot(122)\n", 265 | "plt.plot(x2, y2)\n", 266 | "plt.title(\"B. Fuente Térmica\")\n", 267 | "\n", 268 | "# Definimos título general de la figura\n", 269 | "plt.suptitle(\"Ejemplo dos gráficos en una figura\")\n", 270 | "plt.show()" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": null, 276 | "metadata": { 277 | "id": "bFHB0axvKOum", 278 | "outputId": "5cd4d1c0-5cbe-408c-8060-0db99a254b5d" 279 | }, 280 | "outputs": [], 281 | "source": [ 282 | "# Graficar múltiples gráficos estilo O-O\n", 283 | "\n", 284 | "# ejemplo 2 ax en un fig\n", 285 | "fig, ax = plt.subplots(figsize=(14, 5), ncols=2, nrows=1)\n", 286 | "\n", 287 | "# Definimos primer gráfico\n", 288 | "ax[0].plot(x1, y1)\n", 289 | "ax[0].set_title(\"Fuente Renovable\")\n", 290 | "\n", 291 | "# Definimos segundo gráfico\n", 292 | "ax[1].plot(x2, y2)\n", 293 | "ax[1].set_title(\"Fuente Térmica\")\n", 294 | "\n", 295 | "# Definimos título general de la figura\n", 296 | "fig.suptitle(\"Ejemplo dos gráficos en una figura\")\n", 297 | "fig.show()" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "Comentario: un gráfico orientado a objetos (O-O) en Matplotlib se refiere al uso explícito de los objetos Figure y Axes para crear y controlar un gráfico. Esto contrasta con el enfoque stateful (pyplot), que modifica el estado global de la visualización.\n" 305 | ] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "### Gráficos con ipywidgets\n", 312 | "\n", 313 | "Los widgets en Python son objetos que tienen una representación en el navegador. Por ejemplo, los widgets pueden tener forma de una caja de texto, un desplegable, una casilla de verificación, etc.\n", 314 | "Más info sobre ipywidgets [acá](https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Basics.html)" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "import matplotlib.pyplot as plt\n", 324 | "import ipywidgets as widgets\n", 325 | "from IPython.display import display\n", 326 | "\n", 327 | "import datetime" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "metadata": {}, 334 | "outputs": [], 335 | "source": [ 336 | "pip install ipywidgets" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": null, 342 | "metadata": {}, 343 | "outputs": [], 344 | "source": [ 345 | "lista_fuentes = list(set(df_fuente['fuente_generacion']))\n", 346 | "lista_fuentes" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": null, 352 | "metadata": {}, 353 | "outputs": [], 354 | "source": [ 355 | "# Seleccionar tipo de fuente\n", 356 | "print(\"Seleccionar Fuente:\")\n", 357 | "fuente = widgets.Dropdown(\n", 358 | " options=['Nuclear','Renovable', 'Hidráulica', 'Térmica'],\n", 359 | " value='Nuclear', # \"Nuclear\" es la opción seleccionada de forma predeterminada cuando se crea el widget\n", 360 | " description='Fuente:',\n", 361 | " disabled=False # widget activo. si disabled=True, el widget se vuelve inactivo y el usuario no puede interactuar con él.\n", 362 | ")\n", 363 | "display(fuente) # Muestra el widget" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": null, 369 | "metadata": {}, 370 | "outputs": [], 371 | "source": [ 372 | "fechas = list(set(df_fuente['periodo'].dt.strftime(\"%y-%m\"))) \n", 373 | "# set para eliminar duplicados\n", 374 | "# strftime() para formatear la fechas en un string según un formato deseado. \n", 375 | "# \"%y-%m\" formato tal que se muestren los últimos dos dígitos del año (%y) seguidos por el mes (%m).\n", 376 | "fechas.sort()\n", 377 | "fechas" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": null, 383 | "metadata": {}, 384 | "outputs": [], 385 | "source": [ 386 | "select_fecha = widgets.SelectionRangeSlider(\n", 387 | " options=fechas,\n", 388 | " index=(0, len(fechas)-1),\n", 389 | " description='Fechas',\n", 390 | " disabled=False\n", 391 | ")\n", 392 | "display(select_fecha)" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": {}, 399 | "outputs": [], 400 | "source": [ 401 | "# Probamos si los valores quedaron actualizados\n", 402 | "# Para usar los valores definidos usamos .value\n", 403 | "\n", 404 | "print(\"El rango de fechas a usar es: \", select_fecha.value)\n", 405 | "print(\"La fuente a mostrar es: \", fuente.value)" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": null, 411 | "metadata": {}, 412 | "outputs": [], 413 | "source": [ 414 | "# Creamos un dataframe con la selección de filas de la fuente elegida\n", 415 | "df_temp = df_fuente[df_fuente['fuente_generacion'] == fuente.value]\n", 416 | "\n", 417 | "# Extraemos el objeto fecha del string creado a partir del widget\n", 418 | "fecha_min = datetime.datetime.strptime(select_fecha.value[0], \"%y-%m\") \n", 419 | "fecha_max = datetime.datetime.strptime(select_fecha.value[1], \"%y-%m\")\n", 420 | "# con el módulo datetime y su función strptime creamos un objeto datetime (que contiene info de date y time)\n", 421 | "\n", 422 | "# Filtramos según fechas elegidas\n", 423 | "df_temp = df_temp[(df_temp['periodo']>fecha_min)&(df_temp['periodo']\n", 61 | "Fuente: Curso de Instituto Humai - APIs\n", 62 | "" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": { 68 | "id": "AmN5n_xDfpsj" 69 | }, 70 | "source": [ 71 | "Cada vez que vamos al navegador y escribimos la dirección de una página web, **estamos haciendo un GET request** a un servidor. Esto es una petición para adquirir el código de un recurso que queremos visualizar en el navegador.\n", 72 | "\n", 73 | "La URL es la parte más importante de la definición de un GET request (aunque el navegador agrega otras cosas también, que no vemos) y nos permite cambiar la representación deseada de un mismo recurso de distintas maneras:\n", 74 | "\n", 75 | "* https://deportes.mercadolibre.com.ar/pelotas-futbol pide al servidor pelotas de fútbol.\n", 76 | "* https://deportes.mercadolibre.com.ar/pelotas-futbol_OrderId_PRICE pide al servidor pelotas de fútbol ordenadas por precio.\n", 77 | "\n", 78 | "Cuando escribimos una URL en un navegador, la mayoría de las veces hacemos GET requests que devuelven código HTML (el código que da una estructura a una página web, tal como vimos en el video anterior cuando obteníamos el código HTML al hacer web scraping). Pero los GET requests pueden devolver datos en otros formatos (por ejemplo en JSON y en CSV).\n", 79 | "\n", 80 | "Las APIs REST que definen GET requests capaces de devolver datos en formato JSON y CSV, son particularmente útiles cuando queremos analizar datos.\n", 81 | "\n", 82 | "Ahora vamos a conocer algunas APIs." 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "### API World Bank" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "Pueden ver la documentación [acá](https://wbdata.readthedocs.io/en/stable/)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "import sys\n", 106 | "!{sys.executable} -m pip install wbdata # a mi esta linea me anda bien para installar el nuevo paquete \"wbdata\". Este es una wrapper de para usar la API del Banco Mundial (World Bank)" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "#!pip install wbdata\n", 116 | "import wbdata\n", 117 | "import pandas as pd" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "help(wbdata)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "indicadores = {'HD.HCI.HLOS.FE':'scores_edu_fem','HD.HCI.HLOS.MA':'scores_edu_masc'}\n", 136 | "#HD.HCI.HLOS.FE Harmonized Test Scores, Female\n", 137 | "#HD.HCI.HLOS.MA Harmonized Test Scores, Male\n", 138 | "\n", 139 | "data = wbdata.get_dataframe(indicadores, country=['USA','ARG'])\n", 140 | "\n", 141 | "df = pd.DataFrame(data=data)" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "df.head()" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "df" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "ax = df.plot(kind='bar', title='Puntaje en educación')\n", 169 | "ax.set_xlabel('País-Año',color='grey')\n", 170 | "ax.set_ylabel('Puntaje',color='grey')\n", 171 | "ax.legend([\"Mujeres\",\"Varones\"])\n", 172 | "# Acá estamos usando el index del df como xticklabels" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "Ahora buscamos hacer un gráfico solo con datos del año 2020" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "# Dejamos índice como columnas\n", 189 | "df.reset_index(inplace=True)\n", 190 | "df" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "print(df[\"date\"].dtype) # no es numérica\n", 200 | "df_2020 = df[df[\"date\"]==\"2020\"]\n", 201 | "df_2020" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "df_2020 = df_2020.set_index([\"country\", \"date\"])\n", 211 | "df_2020" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "# Graficamos\n", 221 | "ax = df_2020.plot(kind='bar', title='Puntaje en educación en 2020')\n", 222 | "ax.set_xlabel('País-Año',color='grey')\n", 223 | "ax.set_ylabel('Puntaje',color='grey')\n", 224 | "ax.tick_params(axis=\"x\", rotation=0)\n", 225 | "ax.legend([\"Mujeres\",\"Varones\"])" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": { 231 | "id": "AzeRiw9Yfpsk" 232 | }, 233 | "source": [ 234 | "### API Series de Tiempo\n", 235 | "\n", 236 | "La **[API Series de Tiempo de la Republica Argentina](https://apis.datos.gob.ar/series)** es una API REST desarrollada y mantenida por el Estado Nacional de Argentina para la consulta de estadísticas en formato de series de tiempo. Contiene series publicadas por organismos de la Administración Pública Nacional.\n", 237 | "\n", 238 | "La API permite:\n", 239 | "\n", 240 | "* [Buscar series](https://datosgobar.github.io/series-tiempo-ar-api/reference/search-reference/) por texto. También se pueden buscar en el sitio web de datos.gob.ar: https://datos.gob.ar/series\n", 241 | "* Cambiar la frecuencia (por ejemplo: convertir series diarias en mensuales)\n", 242 | "* Elegir la función de agregacion de valores, usada en el cambio de frecuencia (una serie se puede convertir de diaria a mensual promediando, sumando, sacando el maximo, el minimo, el ultimo valor del periodo, etc)\n", 243 | "* Filtrar por rango de fechas\n", 244 | "* Elegir el formato (CSV o JSON)\n", 245 | "* Cambiar configuracion del CSV (caracter separador, caracter decimal)\n", 246 | "\n", 247 | "En https://datos.gob.ar/series podés buscar series de tiempo publicadas por distintos organismos de la Administración Pública Nacional en Argentina y usar el link al CSV para leerlos directamente desde python con pandas.\n", 248 | "\n", 249 | "También podes **buscar los ids de las series de interés** y juntarlos en la misma consulta para armar una tabla de hasta 40 series." 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "import requests\n", 259 | "import pandas as pd\n", 260 | "import matplotlib.pyplot as plt" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": { 267 | "id": "nZsrFDOLfpsk", 268 | "outputId": "e4358abc-e3ed-4723-a13e-9e72789a6426" 269 | }, 270 | "outputs": [], 271 | "source": [ 272 | "# Un ejemplo\n", 273 | "\n", 274 | "url_arg = \"https://apis.datos.gob.ar/series/api/series?ids=105.1_I2N_2016_M_16,105.1_I2L_2016_M_14,105.1_I2L_2016_M_16&format=json\"\n", 275 | "\n", 276 | "response = requests.get(url_arg)\n", 277 | "print(response)\n", 278 | "\n", 279 | "datos = response.json()\n", 280 | "datos" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": null, 286 | "metadata": { 287 | "id": "vt802wz-fpsk" 288 | }, 289 | "outputs": [], 290 | "source": [ 291 | "d = datos['data']\n", 292 | "data_arg = pd.DataFrame(d)\n", 293 | "data_arg.columns = ['fecha', 'IPC Limon', 'IPC Naranja', 'IPC Lechuga']" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "metadata": { 300 | "id": "c6TOFbtufpsk", 301 | "outputId": "9f9b5b15-3e10-4a9d-c7e3-63a6c7c9228e" 302 | }, 303 | "outputs": [], 304 | "source": [ 305 | "data_arg" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": null, 311 | "metadata": {}, 312 | "outputs": [], 313 | "source": [ 314 | "data_arg['fecha'] = pd.to_datetime(data_arg['fecha'])\n", 315 | "\n", 316 | "# Creamos la figura y los axes\n", 317 | "fig, ax = plt.subplots() # Crear objetos\n", 318 | "\n", 319 | "# Definimos series\n", 320 | "ax.plot(data_arg['fecha'], data_arg['IPC Limon'], label=\"IPC Limón\", color = '#FFDE21') \n", 321 | "ax.plot(data_arg['fecha'], data_arg['IPC Naranja'], label=\"IPC Naranja\", color ='orange') \n", 322 | "ax.plot(data_arg['fecha'], data_arg['IPC Lechuga'], label=\"IPC Lechuga\", color = 'green')\n", 323 | "\n", 324 | "# Modificamos labels y título\n", 325 | "ax.set_xlabel(\"Mes\")\n", 326 | "ax.set_ylabel(\"Índice (Diciembre 2016 = 1)\")\n", 327 | "ax.set_title(\"Evolución IPC\")\n", 328 | "\n", 329 | "# Configuramos las etiquetas del eje X para que solo muestren los meses de enero\n", 330 | "data_arg_january = data_arg[data_arg['fecha'].dt.month == 1] # Filtramos solo los meses de enero\n", 331 | "ax.set_xticks(data_arg_january['fecha']) # Establecemos los ticks solo en enero\n", 332 | "ax.set_xticklabels(data_arg_january['fecha'].dt.strftime('%Y-%m')) # Mostramos solo el año y mes en formato 'YYYY-MM'\n", 333 | "plt.xticks(rotation=20)\n", 334 | "\n", 335 | "# Agregamos leyenda\n", 336 | "ax.legend()\n", 337 | "\n", 338 | "# Mostramos la figura\n", 339 | "plt.show()" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": null, 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [] 348 | } 349 | ], 350 | "metadata": { 351 | "colab": { 352 | "provenance": [] 353 | }, 354 | "kernelspec": { 355 | "display_name": "Python [conda env:base] *", 356 | "language": "python", 357 | "name": "conda-base-py" 358 | }, 359 | "language_info": { 360 | "codemirror_mode": { 361 | "name": "ipython", 362 | "version": 3 363 | }, 364 | "file_extension": ".py", 365 | "mimetype": "text/x-python", 366 | "name": "python", 367 | "nbconvert_exporter": "python", 368 | "pygments_lexer": "ipython3", 369 | "version": "3.12.4" 370 | }, 371 | "varInspector": { 372 | "cols": { 373 | "lenName": 16, 374 | "lenType": 16, 375 | "lenVar": 40 376 | }, 377 | "kernels_config": { 378 | "python": { 379 | "delete_cmd_postfix": "", 380 | "delete_cmd_prefix": "del ", 381 | "library": "var_list.py", 382 | "varRefreshCmd": "print(var_dic_list())" 383 | }, 384 | "r": { 385 | "delete_cmd_postfix": ") ", 386 | "delete_cmd_prefix": "rm(", 387 | "library": "var_list.r", 388 | "varRefreshCmd": "cat(var_dic_list()) " 389 | } 390 | }, 391 | "position": { 392 | "height": "484.85px", 393 | "left": "554.2px", 394 | "right": "20px", 395 | "top": "114px", 396 | "width": "417px" 397 | }, 398 | "types_to_exclude": [ 399 | "module", 400 | "function", 401 | "builtin_function_or_method", 402 | "instance", 403 | "_Feature" 404 | ], 405 | "window_display": false 406 | } 407 | }, 408 | "nbformat": 4, 409 | "nbformat_minor": 4 410 | } 411 | -------------------------------------------------------------------------------- /Clase 6_PCA/C2_Metodos No Supervisados I_PCA.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 6_PCA/C2_Metodos No Supervisados I_PCA.pptx -------------------------------------------------------------------------------- /Clase 6_PCA/Clase6_UBA_PCA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "SnMR3sOW6j5y" 7 | }, 8 | "source": [ 9 | "# Big Data y Machine Learning (UBA) 2025\n", 10 | "## Clase 6 - Análisis de Componentes Principales (PCA)" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "id": "FFE_dv7D75Da" 17 | }, 18 | "source": [ 19 | "Componentes principales (PCA, en inglés) es una técnica de **aprendizaje no supervisado**. Es decir que nos encontramos en una situación donde tenemos información de un conjunto de variables o features ($X_1, X_2, ..., X_p$), pero no sobre una variable de resultado o outcome ($Y$). Vamos a tratar de ajustar algoritmos que interpreten la distribución de nuestros datos y encuentren relaciones interesantes entre éstos, trabajando con la naturaleza propia de los datos y sin un outcome de interés $Y$. \n", 20 | "Esto se diferencia del **aprendizaje supervisado**, caso en el cual los estimadores se usan para **predecir** resultados basados en datos que poseen un outcome o variable de resultado $Y$ (puede ser una etiqueta -clasificación- o un valor -regresión-).\n", 21 | "\n", 22 | "Los algoritmos de aprendizaje no supervisado pueden ser muy útiles para casos en los que se busca **reducir la dimensionalidad**, por ejemplo cuando se busca visualizar datos de gran dimensionalidad o se busca crear un índice. PCA suele emplearse como parte del **análisis descriptivo y exploratorio de datos**." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "Supongamos que tenemos $n$ observaciones y $p$ variables y queremos visualizarlas como parte de una análisis exploratorio de los datos.\n", 30 | "\n", 31 | "Podríamos realizar gráficos de a 2 variables, pero serían muchos si $p$ es grande...\n", 32 | "Entonces vamos a buscar una representación de los datos en menos dimensiones (2 usualmente) que capture la mayor información posible.\n", 33 | "\n", 34 | "Las dimensiones serán combinaciones lineales de las $p$ variables que tienen la mayor varianza posible." 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": { 40 | "id": "mWiCdrSF_fqo" 41 | }, 42 | "source": [ 43 | "Vamos a trabajar con la librería [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "### Ejemplo 1" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "#Instalamos los paquetes necesarios\n", 60 | "!pip install statsmodels\n", 61 | "!pip install scikit-learn\n", 62 | "!pip install ISLP\n", 63 | "\n", 64 | "# Alternativa\n", 65 | "#import sys\n", 66 | "#!{sys.executable} -m pip install statsmodels\n", 67 | "#!{sys.executable} -m pip install scikit-learn\n", 68 | "#!{sys.executable} -m pip install ISLP" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "import ISLP\n", 78 | "from ISLP import load_data\n", 79 | "from statsmodels.datasets import get_rdataset\n", 80 | "\n", 81 | "import numpy as np\n", 82 | "import pandas as pd\n", 83 | "import matplotlib.pyplot as plt\n", 84 | "from sklearn.preprocessing import StandardScaler\n", 85 | "from sklearn.decomposition import PCA" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "import os\n", 95 | "print(os.getcwd())" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "#Descargar los archivos de esta clase en una carpeta y poner el directorio\n", 105 | "os.chdir(\"/Users/mnromero/Dropbox/COURSES/2025 - S1- Big Data y Machine Learning (UBA)/Clases/Clase 6_PCA\") " 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "Vamos a trabajar con una base de datos que provee el [libro ISLP](https://islp.readthedocs.io/en/latest/datasets/USArrests.html).\n", 113 | "\n", 114 | "##### Violent Crime Rates by US State\n", 115 | "Contiene información sobre arrestos por asaltos, asesinatos y violaciones cada 100.000 habitantes en 50 estados de Estados Unidos en 1973. También tiene información sobre el porcentaje de población viviendo en zonas urbanas.\n", 116 | "\n", 117 | "- Murder: Murder arrests (per 100,000)\n", 118 | "- Assault: Assault arrests (per 100,000)\n", 119 | "- Rape: Rape arrests (per 100,000)\n", 120 | "- UrbanPop: Percent urban population" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "#### 1. Chequeamos la base de datos a usar" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": null, 133 | "metadata": { 134 | "scrolled": true 135 | }, 136 | "outputs": [], 137 | "source": [ 138 | "arrests = get_rdataset('USArrests').data\n", 139 | "print(arrests.shape)\n", 140 | "print(\"\\n\", arrests.info)\n", 141 | "print(\"\\n\", arrests.dtypes)\n", 142 | "print(\"\\n\", arrests.head())" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "# Estadistica descriptiva (promedio) de las variables originales\n", 152 | "print(arrests.mean())" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "##### 1.2 Transformamos las variables\n", 160 | "Escalamos las variables antes de usar PCA" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "# Inicializamos el transformador, \n", 170 | "scaler = StandardScaler(with_std=True, with_mean=True) \n", 171 | "# Aplicamos fit_transform al DataFrame\n", 172 | "arrests_transformed = pd.DataFrame(scaler.fit_transform(arrests), columns=arrests.columns)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [ 181 | "# Chequeamos que tengan media 0 y desvio estandar 1\n", 182 | "print(\"Promedio luego de la transformación\\n\",arrests_transformed.mean()) # luego de la estandarización la media es cero\n", 183 | "print(\"Desvío estandár luego de la transformación\\n\",arrests_transformed.std()) # la desviación estandar es uno" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "# Visualizamos\n", 193 | "print(arrests_transformed.head())" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "Importante: Por definición, `PCA()` requiere centrar las variables para que tengan media cero pero las escalamos por eficiencia computacional y no alterar los pesos de las variables." 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "### 2. Aplicamos PCA \n", 208 | "Estamos buscando maximizar la varianza de los predictores con la restricción de normalización" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "# Ajustamos el modelo\n", 218 | "pca = PCA()\n", 219 | "arrests_pca = pca.fit_transform(arrests_transformed)" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "#### 2.1 Veamos los distintos elementos de dicho método no supervisado" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [ 235 | "# Scores/ Indices (z_im) donde i (filas) m (componentes)\n", 236 | "scores = arrests_pca\n", 237 | "print(scores)" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [ 246 | "# Loadings vectors\n", 247 | "loading_vectors = pca.components_ # cada fila corresponde a un CP y cada columna, a una variable\n", 248 | "print(\"Loadings:\\n\", pca.components_)\n", 249 | "print(\"Loadings del CP1:\\n\",pca.components_[0]) \n", 250 | "pca.components_[0,0] #loadings del CP1 variable 1\n" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "#### 2.2 Recordemos la norma igual 1 de la suma de los ponderadores al cuadrado " 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "# Notar que si tomamos los loadings/ponderadores del primer componente principal, por ejemplo:\n", 267 | "(0.53589947)**2+(0.58318363)**2+(0.27819087)**2+(0.5434309)**2 \n", 268 | "# La suma de sus cuadrados vemos que es igual a 1. Es la restricción que habíamos puesto!" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "Tarea probar que: $\\phi_1'*\\phi_2=0$" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "### 3. Grafico de Dispersión (*Biplot*)\n", 283 | "Un ventaja de este método no supervisado es enfocarnos en los primeros dos componentes que nos resumen gran parte de la información original de nuestra matriz X." 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [ 292 | "i, j = 0, 1 # Componentes\n", 293 | "fig, ax = plt.subplots(1, 1, figsize=(5, 5)) # creamos 1 subplot\n", 294 | "ax.scatter(scores[:,0], scores[:,1]) # graficamos los valores de los CP1 y CP2\n", 295 | "ax.set_xlabel('ComponentePrincipal%d' % (i+1))\n", 296 | "ax.set_ylabel('ComponentePrincipal%d' % (j+1))\n", 297 | "plt.title('Analisis de los primeros dos componentes principales')\n", 298 | "for k in range(pca.components_.shape[1]): # loop que itera por la cantidad de features\n", 299 | " ax.arrow(0, 0, pca.components_[i,k], pca.components_[j,k]) # flecha desde el origen (0) a las coordenadas\n", 300 | " ax.text(pca.components_[i,k], pca.components_[j,k], arrests.columns[k]) # al final de cada flecha, nombre de la variable" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": null, 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "# Biplot\n", 310 | "# Ajustes, extendemos longitud de las flechas e invertimos el eje y\n", 311 | "\n", 312 | "i, j = 0, 1 # Componentes\n", 313 | "\n", 314 | "scale_arrow = s_ = 2 # para extender la longitud de las flechas y que se vean mejor\n", 315 | "scores[:,1] *= -1\n", 316 | "pca.components_[1] *= -1 # gira el eje y (CP2)\n", 317 | "\n", 318 | "fig, ax = plt.subplots(1, 1, figsize=(5, 5))\n", 319 | "ax.scatter(scores[:,0], scores[:,1]) \n", 320 | "ax.set_xlabel('ComponentePrincipal%d' % (i+1))\n", 321 | "ax.set_ylabel('ComponentePrincipal%d' % (j+1))\n", 322 | "plt.title('Analisis de los primeros dos componentes principales')\n", 323 | "for k in range(pca.components_.shape[1]):\n", 324 | " ax.arrow(0, 0, s_*pca.components_[i,k], s_*pca.components_[j,k])\n", 325 | " ax.text(s_*pca.components_[i,k], s_*pca.components_[j,k], arrests.columns[k])" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "#### 4. Proporcion de la Varianza explicada\n", 333 | "Para entender cuánta información de la matriz original $X$ resume cada uno de los componentes, podemos calcular y visualizar la varianza de la matriz original $X$ explicada por cada uno de los componentes." 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "# % de la Varianza explicada por los componentes \n", 343 | "print(pca.explained_variance_ratio_) # CP1 explica el 62% de la varianza" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "%%capture \n", 353 | "fig, axes = plt.subplots(1, 2, figsize=(10, 4)) # 2 subplots uno al lado del otro\n", 354 | "ticks = np.arange(pca.n_components_)+1 # para crear ticks en el eje horizontal\n", 355 | "ax = axes[0]\n", 356 | "ax.plot(ticks, pca.explained_variance_ratio_ , marker='o')\n", 357 | "ax.set_xlabel('Componente principal');\n", 358 | "ax.set_ylabel('Prop. de la varianza explicada')\n", 359 | "ax.set_ylim([0,1])\n", 360 | "ax.set_xticks(ticks)\n", 361 | "# capture suprime la visualización de la figura parcialmente terminada" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "metadata": {}, 368 | "outputs": [], 369 | "source": [ 370 | "ax = axes[1]\n", 371 | "ax.plot(ticks, pca.explained_variance_ratio_.cumsum(), marker='o') \n", 372 | "ax.set_xlabel('Componente principal')\n", 373 | "ax.set_ylabel('Suma acumulada de la varianza explicada')\n", 374 | "ax.set_ylim([0, 1])\n", 375 | "ax.set_xticks(ticks)\n", 376 | "fig" 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "metadata": {}, 382 | "source": [ 383 | "Para decidir *qué número de componentes usar*, podemos consultar un scree plot que nos muestre la proporción de variable explicada para cada uno de los componentes y la variación en la varianza total explicada por el total de los componentes.\n", 384 | "\n", 385 | "Típicamente se elige la cantidad de componentes para la cual la proporción de la varianza explicada cae para cada componente principal adicional (cuando hay un codo en el scree plot)" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "metadata": {}, 391 | "source": [ 392 | "### Ejemplo 2" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": {}, 399 | "outputs": [], 400 | "source": [ 401 | "import numpy as np\n", 402 | "import matplotlib.pyplot as plt\n", 403 | "import pandas as pd\n", 404 | "from sklearn.preprocessing import StandardScaler\n", 405 | "from sklearn.decomposition import PCA" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": {}, 411 | "source": [ 412 | "Vamos a trabajar con un dataset de vinos (de scikit-learn). Contiene características de 178 vinos y a qué segmento de consumidores pertenecen" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": null, 418 | "metadata": {}, 419 | "outputs": [], 420 | "source": [ 421 | "# Importar el dataset y breve exploración\n", 422 | "wine_data = pd.read_csv('Wine.csv')\n", 423 | "print(wine_data.shape)\n", 424 | "print(wine_data.dtypes)\n", 425 | "print(wine_data.head())" 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": null, 431 | "metadata": {}, 432 | "outputs": [], 433 | "source": [ 434 | "# Separamos datos entre X e Y (por ahora, haremos de cuenta que no contamos con Y)\n", 435 | "wine_features = wine_data.iloc[:, 0:13].values\n", 436 | "wine_customer_segment = wine_data.iloc[:, 13].values\n", 437 | "\n", 438 | "# Vemos las etiquetas posibles de customer segment\n", 439 | "wine_customer_segment_unique, counts = np.unique(wine_customer_segment, return_counts=True)\n", 440 | "for value, count in zip(wine_customer_segment_unique, counts):\n", 441 | " print(f\"Value: {value}, Count: {count}\")" 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "execution_count": null, 447 | "metadata": {}, 448 | "outputs": [], 449 | "source": [ 450 | "# Preprocesamiento. Estandarizar las variables\n", 451 | "# Iniciar scaler y aplicarlo\n", 452 | "sc = StandardScaler()\n", 453 | "wine_features_transformed = sc.fit_transform(wine_features)" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "Por qué estandarizamos? El análisis es sensible a la varianza de las variables originales y eso puede ocasionar problemas a la hora de elegir los CPs" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": null, 466 | "metadata": {}, 467 | "outputs": [], 468 | "source": [ 469 | "# Aplicar PCA\n", 470 | "\n", 471 | "pca = PCA(n_components = 2)\n", 472 | "\n", 473 | "wine_pca = pca.fit_transform(wine_features_transformed) # Obtenemos los scores" 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": null, 479 | "metadata": {}, 480 | "outputs": [], 481 | "source": [ 482 | "# Scores\n", 483 | "wine_scores = wine_pca\n", 484 | "\n", 485 | "# % de la Varianza explicada por los componentes\n", 486 | "print(\"Varianza explicada:\", pca.explained_variance_ratio_)\n", 487 | "# El primer componente principal explica el 36% de la varianza, mientras que el segundo, explica el 19%\n", 488 | "\n", 489 | "# Loading vectors\n", 490 | "loading_vectors = pca.components_ # cada fila corresponde a un CP y cada columna, a una variable\n", 491 | "print(\"Loadings:\\n\", pca.components_)\n", 492 | "print(\"Loadings del CP1:\\n\",pca.components_[0]) \n", 493 | "\n", 494 | "# Visualizamos features y loadings\n", 495 | "for i, loading_vector in enumerate(loading_vectors):\n", 496 | " print(f\"\\nLoading Vector CP{i+1}:\")\n", 497 | " for j, feature in enumerate(wine_data.columns[:-1]):\n", 498 | " print(f\"{feature}: {round(loading_vector[j],3)}\")\n", 499 | " print()" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": null, 505 | "metadata": {}, 506 | "outputs": [], 507 | "source": [ 508 | "# Crear un DataFrame para los componentes principales\n", 509 | "pca_df = pd.DataFrame(data=wine_pca, columns=['Componente_1', 'Componente_2'])\n", 510 | "\n", 511 | "# Añadir la variable objetivo al DataFrame de los componentes principales\n", 512 | "pca_df['Customer_Segment'] = wine_customer_segment\n", 513 | "pca_df" 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": null, 519 | "metadata": {}, 520 | "outputs": [], 521 | "source": [ 522 | "# Graficamos los componentes\n", 523 | "plt.figure(figsize=(6, 6))\n", 524 | "plt.scatter(pca_df['Componente_1'], pca_df['Componente_2'], c=wine_data['Customer_Segment'], cmap='viridis')\n", 525 | "plt.xlabel('Componente Principal 1', fontsize=11)\n", 526 | "plt.ylabel('Componente Principal 2', fontsize=11)\n", 527 | "plt.title('Analisis de los primeros dos componentes principales')\n", 528 | "plt.colorbar(label='Customer Segment')\n", 529 | "plt.show()" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [ 538 | " # Otra forma de graficar\n", 539 | "plt.figure(figsize=(6, 6))\n", 540 | "plt.xlabel('Componente Principal 1', fontsize=11)\n", 541 | "plt.ylabel('Componente Principal 2', fontsize=11)\n", 542 | "plt.title(\"Wine Data\",fontsize=16)\n", 543 | "plt.xticks(fontsize=11)\n", 544 | "plt.yticks(fontsize=11)\n", 545 | "\n", 546 | "targets = [1, 2, 3]\n", 547 | "colors = ['red', 'green', 'blue']\n", 548 | "\n", 549 | "for target, color in zip(targets,colors):\n", 550 | " indices_graf = pca_df['Customer_Segment'] == target\n", 551 | " plt.scatter(pca_df.loc[indices_graf, 'Componente_1'], pca_df.loc[indices_graf, 'Componente_2'], c = color, s = 50)\n", 552 | "\n", 553 | "#plt.xlim(-4,4)\n", 554 | "#plt.ylim(-4,4)\n", 555 | "plt.legend(targets)" 556 | ] 557 | }, 558 | { 559 | "cell_type": "markdown", 560 | "metadata": {}, 561 | "source": [ 562 | "La representación bidimensional de los datos tridimensionales capta correctamente el patrón principal de los datos: las observaciones rojas, azules y verdes, siguen estando en la representación bidimensional. \n", 563 | "\n", 564 | "Nota: aquí usamos dos componentes pero podríamos haber usado 1 o más de 2. " 565 | ] 566 | }, 567 | { 568 | "cell_type": "code", 569 | "execution_count": null, 570 | "metadata": {}, 571 | "outputs": [], 572 | "source": [] 573 | } 574 | ], 575 | "metadata": { 576 | "colab": { 577 | "collapsed_sections": [ 578 | "NFyrvNVL26HH", 579 | "pEiQnszy3lmi", 580 | "1yaJI88z4TfW", 581 | "ZoQVkU4C-GNd" 582 | ], 583 | "name": "Aprendizaje_no_supervisado.ipynb", 584 | "provenance": [], 585 | "toc_visible": true 586 | }, 587 | "kernelspec": { 588 | "display_name": "Python [conda env:base] *", 589 | "language": "python", 590 | "name": "conda-base-py" 591 | }, 592 | "language_info": { 593 | "codemirror_mode": { 594 | "name": "ipython", 595 | "version": 3 596 | }, 597 | "file_extension": ".py", 598 | "mimetype": "text/x-python", 599 | "name": "python", 600 | "nbconvert_exporter": "python", 601 | "pygments_lexer": "ipython3", 602 | "version": "3.12.4" 603 | }, 604 | "vscode": { 605 | "interpreter": { 606 | "hash": "988c801e8fa6188d3e53012a7256361dd6100dad47899d4700f624e035bcb20b" 607 | } 608 | } 609 | }, 610 | "nbformat": 4, 611 | "nbformat_minor": 4 612 | } 613 | -------------------------------------------------------------------------------- /Clase 6_PCA/USArrests.csv: -------------------------------------------------------------------------------- 1 | "State","Murder","Assault","UrbanPop","Rape" 2 | "Alabama",13.2,236,58,21.2 3 | "Alaska",10,263,48,44.5 4 | "Arizona",8.1,294,80,31 5 | "Arkansas",8.8,190,50,19.5 6 | "California",9,276,91,40.6 7 | "Colorado",7.9,204,78,38.7 8 | "Connecticut",3.3,110,77,11.1 9 | "Delaware",5.9,238,72,15.8 10 | "Florida",15.4,335,80,31.9 11 | "Georgia",17.4,211,60,25.8 12 | "Hawaii",5.3,46,83,20.2 13 | "Idaho",2.6,120,54,14.2 14 | "Illinois",10.4,249,83,24 15 | "Indiana",7.2,113,65,21 16 | "Iowa",2.2,56,57,11.3 17 | "Kansas",6,115,66,18 18 | "Kentucky",9.7,109,52,16.3 19 | "Louisiana",15.4,249,66,22.2 20 | "Maine",2.1,83,51,7.8 21 | "Maryland",11.3,300,67,27.8 22 | "Massachusetts",4.4,149,85,16.3 23 | "Michigan",12.1,255,74,35.1 24 | "Minnesota",2.7,72,66,14.9 25 | "Mississippi",16.1,259,44,17.1 26 | "Missouri",9,178,70,28.2 27 | "Montana",6,109,53,16.4 28 | "Nebraska",4.3,102,62,16.5 29 | "Nevada",12.2,252,81,46 30 | "New Hampshire",2.1,57,56,9.5 31 | "New Jersey",7.4,159,89,18.8 32 | "New Mexico",11.4,285,70,32.1 33 | "New York",11.1,254,86,26.1 34 | "North Carolina",13,337,45,16.1 35 | "North Dakota",0.8,45,44,7.3 36 | "Ohio",7.3,120,75,21.4 37 | "Oklahoma",6.6,151,68,20 38 | "Oregon",4.9,159,67,29.3 39 | "Pennsylvania",6.3,106,72,14.9 40 | "Rhode Island",3.4,174,87,8.3 41 | "South Carolina",14.4,279,48,22.5 42 | "South Dakota",3.8,86,45,12.8 43 | "Tennessee",13.2,188,59,26.9 44 | "Texas",12.7,201,80,25.5 45 | "Utah",3.2,120,80,22.9 46 | "Vermont",2.2,48,32,11.2 47 | "Virginia",8.5,156,63,20.7 48 | "Washington",4,145,73,26.2 49 | "West Virginia",5.7,81,39,9.3 50 | "Wisconsin",2.6,53,66,10.8 51 | "Wyoming",6.8,161,60,15.6 52 | -------------------------------------------------------------------------------- /Clase 6_PCA/Wine.csv: -------------------------------------------------------------------------------- 1 | Alcohol,Malic_Acid,Ash,Ash_Alcanity,Magnesium,Total_Phenols,Flavanoids,Nonflavanoid_Phenols,Proanthocyanins,Color_Intensity,Hue,OD280,Proline,Customer_Segment 2 | 14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,1 3 | 13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050,1 4 | 13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185,1 5 | 14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480,1 6 | 13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735,1 7 | 14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450,1 8 | 14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290,1 9 | 14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295,1 10 | 14.83,1.64,2.17,14,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045,1 11 | 13.86,1.35,2.27,16,98,2.98,3.15,0.22,1.85,7.22,1.01,3.55,1045,1 12 | 14.1,2.16,2.3,18,105,2.95,3.32,0.22,2.38,5.75,1.25,3.17,1510,1 13 | 14.12,1.48,2.32,16.8,95,2.2,2.43,0.26,1.57,5,1.17,2.82,1280,1 14 | 13.75,1.73,2.41,16,89,2.6,2.76,0.29,1.81,5.6,1.15,2.9,1320,1 15 | 14.75,1.73,2.39,11.4,91,3.1,3.69,0.43,2.81,5.4,1.25,2.73,1150,1 16 | 14.38,1.87,2.38,12,102,3.3,3.64,0.29,2.96,7.5,1.2,3,1547,1 17 | 13.63,1.81,2.7,17.2,112,2.85,2.91,0.3,1.46,7.3,1.28,2.88,1310,1 18 | 14.3,1.92,2.72,20,120,2.8,3.14,0.33,1.97,6.2,1.07,2.65,1280,1 19 | 13.83,1.57,2.62,20,115,2.95,3.4,0.4,1.72,6.6,1.13,2.57,1130,1 20 | 14.19,1.59,2.48,16.5,108,3.3,3.93,0.32,1.86,8.7,1.23,2.82,1680,1 21 | 13.64,3.1,2.56,15.2,116,2.7,3.03,0.17,1.66,5.1,0.96,3.36,845,1 22 | 14.06,1.63,2.28,16,126,3,3.17,0.24,2.1,5.65,1.09,3.71,780,1 23 | 12.93,3.8,2.65,18.6,102,2.41,2.41,0.25,1.98,4.5,1.03,3.52,770,1 24 | 13.71,1.86,2.36,16.6,101,2.61,2.88,0.27,1.69,3.8,1.11,4,1035,1 25 | 12.85,1.6,2.52,17.8,95,2.48,2.37,0.26,1.46,3.93,1.09,3.63,1015,1 26 | 13.5,1.81,2.61,20,96,2.53,2.61,0.28,1.66,3.52,1.12,3.82,845,1 27 | 13.05,2.05,3.22,25,124,2.63,2.68,0.47,1.92,3.58,1.13,3.2,830,1 28 | 13.39,1.77,2.62,16.1,93,2.85,2.94,0.34,1.45,4.8,0.92,3.22,1195,1 29 | 13.3,1.72,2.14,17,94,2.4,2.19,0.27,1.35,3.95,1.02,2.77,1285,1 30 | 13.87,1.9,2.8,19.4,107,2.95,2.97,0.37,1.76,4.5,1.25,3.4,915,1 31 | 14.02,1.68,2.21,16,96,2.65,2.33,0.26,1.98,4.7,1.04,3.59,1035,1 32 | 13.73,1.5,2.7,22.5,101,3,3.25,0.29,2.38,5.7,1.19,2.71,1285,1 33 | 13.58,1.66,2.36,19.1,106,2.86,3.19,0.22,1.95,6.9,1.09,2.88,1515,1 34 | 13.68,1.83,2.36,17.2,104,2.42,2.69,0.42,1.97,3.84,1.23,2.87,990,1 35 | 13.76,1.53,2.7,19.5,132,2.95,2.74,0.5,1.35,5.4,1.25,3,1235,1 36 | 13.51,1.8,2.65,19,110,2.35,2.53,0.29,1.54,4.2,1.1,2.87,1095,1 37 | 13.48,1.81,2.41,20.5,100,2.7,2.98,0.26,1.86,5.1,1.04,3.47,920,1 38 | 13.28,1.64,2.84,15.5,110,2.6,2.68,0.34,1.36,4.6,1.09,2.78,880,1 39 | 13.05,1.65,2.55,18,98,2.45,2.43,0.29,1.44,4.25,1.12,2.51,1105,1 40 | 13.07,1.5,2.1,15.5,98,2.4,2.64,0.28,1.37,3.7,1.18,2.69,1020,1 41 | 14.22,3.99,2.51,13.2,128,3,3.04,0.2,2.08,5.1,0.89,3.53,760,1 42 | 13.56,1.71,2.31,16.2,117,3.15,3.29,0.34,2.34,6.13,0.95,3.38,795,1 43 | 13.41,3.84,2.12,18.8,90,2.45,2.68,0.27,1.48,4.28,0.91,3,1035,1 44 | 13.88,1.89,2.59,15,101,3.25,3.56,0.17,1.7,5.43,0.88,3.56,1095,1 45 | 13.24,3.98,2.29,17.5,103,2.64,2.63,0.32,1.66,4.36,0.82,3,680,1 46 | 13.05,1.77,2.1,17,107,3,3,0.28,2.03,5.04,0.88,3.35,885,1 47 | 14.21,4.04,2.44,18.9,111,2.85,2.65,0.3,1.25,5.24,0.87,3.33,1080,1 48 | 14.38,3.59,2.28,16,102,3.25,3.17,0.27,2.19,4.9,1.04,3.44,1065,1 49 | 13.9,1.68,2.12,16,101,3.1,3.39,0.21,2.14,6.1,0.91,3.33,985,1 50 | 14.1,2.02,2.4,18.8,103,2.75,2.92,0.32,2.38,6.2,1.07,2.75,1060,1 51 | 13.94,1.73,2.27,17.4,108,2.88,3.54,0.32,2.08,8.9,1.12,3.1,1260,1 52 | 13.05,1.73,2.04,12.4,92,2.72,3.27,0.17,2.91,7.2,1.12,2.91,1150,1 53 | 13.83,1.65,2.6,17.2,94,2.45,2.99,0.22,2.29,5.6,1.24,3.37,1265,1 54 | 13.82,1.75,2.42,14,111,3.88,3.74,0.32,1.87,7.05,1.01,3.26,1190,1 55 | 13.77,1.9,2.68,17.1,115,3,2.79,0.39,1.68,6.3,1.13,2.93,1375,1 56 | 13.74,1.67,2.25,16.4,118,2.6,2.9,0.21,1.62,5.85,0.92,3.2,1060,1 57 | 13.56,1.73,2.46,20.5,116,2.96,2.78,0.2,2.45,6.25,0.98,3.03,1120,1 58 | 14.22,1.7,2.3,16.3,118,3.2,3,0.26,2.03,6.38,0.94,3.31,970,1 59 | 13.29,1.97,2.68,16.8,102,3,3.23,0.31,1.66,6,1.07,2.84,1270,1 60 | 13.72,1.43,2.5,16.7,108,3.4,3.67,0.19,2.04,6.8,0.89,2.87,1285,1 61 | 12.37,0.94,1.36,10.6,88,1.98,0.57,0.28,0.42,1.95,1.05,1.82,520,2 62 | 12.33,1.1,2.28,16,101,2.05,1.09,0.63,0.41,3.27,1.25,1.67,680,2 63 | 12.64,1.36,2.02,16.8,100,2.02,1.41,0.53,0.62,5.75,0.98,1.59,450,2 64 | 13.67,1.25,1.92,18,94,2.1,1.79,0.32,0.73,3.8,1.23,2.46,630,2 65 | 12.37,1.13,2.16,19,87,3.5,3.1,0.19,1.87,4.45,1.22,2.87,420,2 66 | 12.17,1.45,2.53,19,104,1.89,1.75,0.45,1.03,2.95,1.45,2.23,355,2 67 | 12.37,1.21,2.56,18.1,98,2.42,2.65,0.37,2.08,4.6,1.19,2.3,678,2 68 | 13.11,1.01,1.7,15,78,2.98,3.18,0.26,2.28,5.3,1.12,3.18,502,2 69 | 12.37,1.17,1.92,19.6,78,2.11,2,0.27,1.04,4.68,1.12,3.48,510,2 70 | 13.34,0.94,2.36,17,110,2.53,1.3,0.55,0.42,3.17,1.02,1.93,750,2 71 | 12.21,1.19,1.75,16.8,151,1.85,1.28,0.14,2.5,2.85,1.28,3.07,718,2 72 | 12.29,1.61,2.21,20.4,103,1.1,1.02,0.37,1.46,3.05,0.906,1.82,870,2 73 | 13.86,1.51,2.67,25,86,2.95,2.86,0.21,1.87,3.38,1.36,3.16,410,2 74 | 13.49,1.66,2.24,24,87,1.88,1.84,0.27,1.03,3.74,0.98,2.78,472,2 75 | 12.99,1.67,2.6,30,139,3.3,2.89,0.21,1.96,3.35,1.31,3.5,985,2 76 | 11.96,1.09,2.3,21,101,3.38,2.14,0.13,1.65,3.21,0.99,3.13,886,2 77 | 11.66,1.88,1.92,16,97,1.61,1.57,0.34,1.15,3.8,1.23,2.14,428,2 78 | 13.03,0.9,1.71,16,86,1.95,2.03,0.24,1.46,4.6,1.19,2.48,392,2 79 | 11.84,2.89,2.23,18,112,1.72,1.32,0.43,0.95,2.65,0.96,2.52,500,2 80 | 12.33,0.99,1.95,14.8,136,1.9,1.85,0.35,2.76,3.4,1.06,2.31,750,2 81 | 12.7,3.87,2.4,23,101,2.83,2.55,0.43,1.95,2.57,1.19,3.13,463,2 82 | 12,0.92,2,19,86,2.42,2.26,0.3,1.43,2.5,1.38,3.12,278,2 83 | 12.72,1.81,2.2,18.8,86,2.2,2.53,0.26,1.77,3.9,1.16,3.14,714,2 84 | 12.08,1.13,2.51,24,78,2,1.58,0.4,1.4,2.2,1.31,2.72,630,2 85 | 13.05,3.86,2.32,22.5,85,1.65,1.59,0.61,1.62,4.8,0.84,2.01,515,2 86 | 11.84,0.89,2.58,18,94,2.2,2.21,0.22,2.35,3.05,0.79,3.08,520,2 87 | 12.67,0.98,2.24,18,99,2.2,1.94,0.3,1.46,2.62,1.23,3.16,450,2 88 | 12.16,1.61,2.31,22.8,90,1.78,1.69,0.43,1.56,2.45,1.33,2.26,495,2 89 | 11.65,1.67,2.62,26,88,1.92,1.61,0.4,1.34,2.6,1.36,3.21,562,2 90 | 11.64,2.06,2.46,21.6,84,1.95,1.69,0.48,1.35,2.8,1,2.75,680,2 91 | 12.08,1.33,2.3,23.6,70,2.2,1.59,0.42,1.38,1.74,1.07,3.21,625,2 92 | 12.08,1.83,2.32,18.5,81,1.6,1.5,0.52,1.64,2.4,1.08,2.27,480,2 93 | 12,1.51,2.42,22,86,1.45,1.25,0.5,1.63,3.6,1.05,2.65,450,2 94 | 12.69,1.53,2.26,20.7,80,1.38,1.46,0.58,1.62,3.05,0.96,2.06,495,2 95 | 12.29,2.83,2.22,18,88,2.45,2.25,0.25,1.99,2.15,1.15,3.3,290,2 96 | 11.62,1.99,2.28,18,98,3.02,2.26,0.17,1.35,3.25,1.16,2.96,345,2 97 | 12.47,1.52,2.2,19,162,2.5,2.27,0.32,3.28,2.6,1.16,2.63,937,2 98 | 11.81,2.12,2.74,21.5,134,1.6,0.99,0.14,1.56,2.5,0.95,2.26,625,2 99 | 12.29,1.41,1.98,16,85,2.55,2.5,0.29,1.77,2.9,1.23,2.74,428,2 100 | 12.37,1.07,2.1,18.5,88,3.52,3.75,0.24,1.95,4.5,1.04,2.77,660,2 101 | 12.29,3.17,2.21,18,88,2.85,2.99,0.45,2.81,2.3,1.42,2.83,406,2 102 | 12.08,2.08,1.7,17.5,97,2.23,2.17,0.26,1.4,3.3,1.27,2.96,710,2 103 | 12.6,1.34,1.9,18.5,88,1.45,1.36,0.29,1.35,2.45,1.04,2.77,562,2 104 | 12.34,2.45,2.46,21,98,2.56,2.11,0.34,1.31,2.8,0.8,3.38,438,2 105 | 11.82,1.72,1.88,19.5,86,2.5,1.64,0.37,1.42,2.06,0.94,2.44,415,2 106 | 12.51,1.73,1.98,20.5,85,2.2,1.92,0.32,1.48,2.94,1.04,3.57,672,2 107 | 12.42,2.55,2.27,22,90,1.68,1.84,0.66,1.42,2.7,0.86,3.3,315,2 108 | 12.25,1.73,2.12,19,80,1.65,2.03,0.37,1.63,3.4,1,3.17,510,2 109 | 12.72,1.75,2.28,22.5,84,1.38,1.76,0.48,1.63,3.3,0.88,2.42,488,2 110 | 12.22,1.29,1.94,19,92,2.36,2.04,0.39,2.08,2.7,0.86,3.02,312,2 111 | 11.61,1.35,2.7,20,94,2.74,2.92,0.29,2.49,2.65,0.96,3.26,680,2 112 | 11.46,3.74,1.82,19.5,107,3.18,2.58,0.24,3.58,2.9,0.75,2.81,562,2 113 | 12.52,2.43,2.17,21,88,2.55,2.27,0.26,1.22,2,0.9,2.78,325,2 114 | 11.76,2.68,2.92,20,103,1.75,2.03,0.6,1.05,3.8,1.23,2.5,607,2 115 | 11.41,0.74,2.5,21,88,2.48,2.01,0.42,1.44,3.08,1.1,2.31,434,2 116 | 12.08,1.39,2.5,22.5,84,2.56,2.29,0.43,1.04,2.9,0.93,3.19,385,2 117 | 11.03,1.51,2.2,21.5,85,2.46,2.17,0.52,2.01,1.9,1.71,2.87,407,2 118 | 11.82,1.47,1.99,20.8,86,1.98,1.6,0.3,1.53,1.95,0.95,3.33,495,2 119 | 12.42,1.61,2.19,22.5,108,2,2.09,0.34,1.61,2.06,1.06,2.96,345,2 120 | 12.77,3.43,1.98,16,80,1.63,1.25,0.43,0.83,3.4,0.7,2.12,372,2 121 | 12,3.43,2,19,87,2,1.64,0.37,1.87,1.28,0.93,3.05,564,2 122 | 11.45,2.4,2.42,20,96,2.9,2.79,0.32,1.83,3.25,0.8,3.39,625,2 123 | 11.56,2.05,3.23,28.5,119,3.18,5.08,0.47,1.87,6,0.93,3.69,465,2 124 | 12.42,4.43,2.73,26.5,102,2.2,2.13,0.43,1.71,2.08,0.92,3.12,365,2 125 | 13.05,5.8,2.13,21.5,86,2.62,2.65,0.3,2.01,2.6,0.73,3.1,380,2 126 | 11.87,4.31,2.39,21,82,2.86,3.03,0.21,2.91,2.8,0.75,3.64,380,2 127 | 12.07,2.16,2.17,21,85,2.6,2.65,0.37,1.35,2.76,0.86,3.28,378,2 128 | 12.43,1.53,2.29,21.5,86,2.74,3.15,0.39,1.77,3.94,0.69,2.84,352,2 129 | 11.79,2.13,2.78,28.5,92,2.13,2.24,0.58,1.76,3,0.97,2.44,466,2 130 | 12.37,1.63,2.3,24.5,88,2.22,2.45,0.4,1.9,2.12,0.89,2.78,342,2 131 | 12.04,4.3,2.38,22,80,2.1,1.75,0.42,1.35,2.6,0.79,2.57,580,2 132 | 12.86,1.35,2.32,18,122,1.51,1.25,0.21,0.94,4.1,0.76,1.29,630,3 133 | 12.88,2.99,2.4,20,104,1.3,1.22,0.24,0.83,5.4,0.74,1.42,530,3 134 | 12.81,2.31,2.4,24,98,1.15,1.09,0.27,0.83,5.7,0.66,1.36,560,3 135 | 12.7,3.55,2.36,21.5,106,1.7,1.2,0.17,0.84,5,0.78,1.29,600,3 136 | 12.51,1.24,2.25,17.5,85,2,0.58,0.6,1.25,5.45,0.75,1.51,650,3 137 | 12.6,2.46,2.2,18.5,94,1.62,0.66,0.63,0.94,7.1,0.73,1.58,695,3 138 | 12.25,4.72,2.54,21,89,1.38,0.47,0.53,0.8,3.85,0.75,1.27,720,3 139 | 12.53,5.51,2.64,25,96,1.79,0.6,0.63,1.1,5,0.82,1.69,515,3 140 | 13.49,3.59,2.19,19.5,88,1.62,0.48,0.58,0.88,5.7,0.81,1.82,580,3 141 | 12.84,2.96,2.61,24,101,2.32,0.6,0.53,0.81,4.92,0.89,2.15,590,3 142 | 12.93,2.81,2.7,21,96,1.54,0.5,0.53,0.75,4.6,0.77,2.31,600,3 143 | 13.36,2.56,2.35,20,89,1.4,0.5,0.37,0.64,5.6,0.7,2.47,780,3 144 | 13.52,3.17,2.72,23.5,97,1.55,0.52,0.5,0.55,4.35,0.89,2.06,520,3 145 | 13.62,4.95,2.35,20,92,2,0.8,0.47,1.02,4.4,0.91,2.05,550,3 146 | 12.25,3.88,2.2,18.5,112,1.38,0.78,0.29,1.14,8.21,0.65,2,855,3 147 | 13.16,3.57,2.15,21,102,1.5,0.55,0.43,1.3,4,0.6,1.68,830,3 148 | 13.88,5.04,2.23,20,80,0.98,0.34,0.4,0.68,4.9,0.58,1.33,415,3 149 | 12.87,4.61,2.48,21.5,86,1.7,0.65,0.47,0.86,7.65,0.54,1.86,625,3 150 | 13.32,3.24,2.38,21.5,92,1.93,0.76,0.45,1.25,8.42,0.55,1.62,650,3 151 | 13.08,3.9,2.36,21.5,113,1.41,1.39,0.34,1.14,9.4,0.57,1.33,550,3 152 | 13.5,3.12,2.62,24,123,1.4,1.57,0.22,1.25,8.6,0.59,1.3,500,3 153 | 12.79,2.67,2.48,22,112,1.48,1.36,0.24,1.26,10.8,0.48,1.47,480,3 154 | 13.11,1.9,2.75,25.5,116,2.2,1.28,0.26,1.56,7.1,0.61,1.33,425,3 155 | 13.23,3.3,2.28,18.5,98,1.8,0.83,0.61,1.87,10.52,0.56,1.51,675,3 156 | 12.58,1.29,2.1,20,103,1.48,0.58,0.53,1.4,7.6,0.58,1.55,640,3 157 | 13.17,5.19,2.32,22,93,1.74,0.63,0.61,1.55,7.9,0.6,1.48,725,3 158 | 13.84,4.12,2.38,19.5,89,1.8,0.83,0.48,1.56,9.01,0.57,1.64,480,3 159 | 12.45,3.03,2.64,27,97,1.9,0.58,0.63,1.14,7.5,0.67,1.73,880,3 160 | 14.34,1.68,2.7,25,98,2.8,1.31,0.53,2.7,13,0.57,1.96,660,3 161 | 13.48,1.67,2.64,22.5,89,2.6,1.1,0.52,2.29,11.75,0.57,1.78,620,3 162 | 12.36,3.83,2.38,21,88,2.3,0.92,0.5,1.04,7.65,0.56,1.58,520,3 163 | 13.69,3.26,2.54,20,107,1.83,0.56,0.5,0.8,5.88,0.96,1.82,680,3 164 | 12.85,3.27,2.58,22,106,1.65,0.6,0.6,0.96,5.58,0.87,2.11,570,3 165 | 12.96,3.45,2.35,18.5,106,1.39,0.7,0.4,0.94,5.28,0.68,1.75,675,3 166 | 13.78,2.76,2.3,22,90,1.35,0.68,0.41,1.03,9.58,0.7,1.68,615,3 167 | 13.73,4.36,2.26,22.5,88,1.28,0.47,0.52,1.15,6.62,0.78,1.75,520,3 168 | 13.45,3.7,2.6,23,111,1.7,0.92,0.43,1.46,10.68,0.85,1.56,695,3 169 | 12.82,3.37,2.3,19.5,88,1.48,0.66,0.4,0.97,10.26,0.72,1.75,685,3 170 | 13.58,2.58,2.69,24.5,105,1.55,0.84,0.39,1.54,8.66,0.74,1.8,750,3 171 | 13.4,4.6,2.86,25,112,1.98,0.96,0.27,1.11,8.5,0.67,1.92,630,3 172 | 12.2,3.03,2.32,19,96,1.25,0.49,0.4,0.73,5.5,0.66,1.83,510,3 173 | 12.77,2.39,2.28,19.5,86,1.39,0.51,0.48,0.64,9.899999,0.57,1.63,470,3 174 | 14.16,2.51,2.48,20,91,1.68,0.7,0.44,1.24,9.7,0.62,1.71,660,3 175 | 13.71,5.65,2.45,20.5,95,1.68,0.61,0.52,1.06,7.7,0.64,1.74,740,3 176 | 13.4,3.91,2.48,23,102,1.8,0.75,0.43,1.41,7.3,0.7,1.56,750,3 177 | 13.27,4.28,2.26,20,120,1.59,0.69,0.43,1.35,10.2,0.59,1.56,835,3 178 | 13.17,2.59,2.37,20,120,1.65,0.68,0.53,1.46,9.3,0.6,1.62,840,3 179 | 14.13,4.1,2.74,24.5,96,2.05,0.76,0.56,1.35,9.2,0.61,1.6,560,3 -------------------------------------------------------------------------------- /Clase 7_Cluster/C7_Metodos no supervisados II_Cluster.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 7_Cluster/C7_Metodos no supervisados II_Cluster.pptx -------------------------------------------------------------------------------- /Clase 8_Histogramas/C8_Metodos No Paramétricos I_Histogramas.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 8_Histogramas/C8_Metodos No Paramétricos I_Histogramas.pptx -------------------------------------------------------------------------------- /Clase 8_Histogramas/Tutorial8_UBA_Hist.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "ObeIwb_npgeZ" 7 | }, 8 | "source": [ 9 | "# Big Data y Machine Learning (UBA) 2025\n", 10 | "## Clase 8 - Histogramas y Visualización de datos\n", 11 | "\n", 12 | "**Objetivo:**\n", 13 | "Que se familiaricen con métodos no paramétricos para estimacion de la distribución de densidad de una variable aleatoria." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "### Métodos no paramétricos\n", 21 | "El objetivo es predecir distribución de una variable de interés \n", 22 | "- 𝑌 variable aleatoria de interés\n", 23 | "- 𝑓(𝑌) distribución de densidad 𝑌\n", 24 | "\n", 25 | "Métodos\n", 26 | "- Repaso a Numpy vs. Pandas\n", 27 | "- Histogramas con Matplotlib\n", 28 | "- Histogramas mejorados con Seaborn y sus opciones\n" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "#### Repaso: NumPy y scikit-learn\n", 36 | "**El paquete NumPy** es fundamental en Python. Está escrito en lenguajes de bajo nivel, lo que permite realizar operaciones matemáticas de manera muy eficiente. Para más información, ver la [guía oficial de uso de NumPy](https://docs.scipy.org/doc/numpy/user/index.html).\n", 37 | "\n", 38 | "**El paquete scikit-learn** es una biblioteca de Python usada para machine learning, construida encima de NumPy y otros paquetes. Permite procesar datos, reducir la dimensionalidad de la base, implementar regresiones, clasificaciones, clustering y más. Pueden ver la [web de scikit-learn](https://scikit-learn.org/stable/)\n" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "#Installamos el paquete necesario\n", 48 | "!pip install scikit-learn\n", 49 | "\n", 50 | "# Alternativa\n", 51 | "#import sys\n", 52 | "#!{sys.executable} -m pip install scikit-learn" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "# Importamos paquetes\n", 62 | "import numpy as np\n", 63 | "import pandas as pd\n", 64 | "import matplotlib.pyplot as plt" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "#### Repaso breve de Numpy\n", 72 | "\n", 73 | "A continuación crearemos dos vectores con los que trabajaremos en nuestra primera regresión lineal." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "x = np.array([5, 15, 25, 35, 45, 55])\n", 83 | "y = np.array([5, 20, 14, 32, 22, 38])\n", 84 | "\n", 85 | "print(x)\n", 86 | "print(y)\n", 87 | "# Ambos son vectores fila" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "# Reshape para transformar x en un vector columna\n", 97 | "x = x.reshape((-1, 1)) # El -1 indica el largo del array\n", 98 | "# Es equivalente a: x = x.reshape((6, 1))\n", 99 | "\n", 100 | "print(x)\n", 101 | "print(y)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "# Generamos datos aleatorios con distribucion normal\n", 111 | "np.random.seed(20)\n", 112 | "X = np.concatenate([np.random.normal(0,1,500), np.random.normal(5,1,500)]).reshape(-1,1)\n", 113 | "\n" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "##### Estastistica descriptiva en Numpy vs. Pandas" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "# Estadistica descriptiva en Numpy\n", 130 | "print(\"Media\", np.mean(X).round(2))\n", 131 | "print(\"Desvío Estándar (s.d.)\", np.std(X).round(2))\n", 132 | "print(\"Mínimo\", np.min(X).round(2))\n", 133 | "print(\"Mediana\", np.percentile(X,50).round(2))\n", 134 | "print(\"Máximo\", np.max(X).round(2))" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "# convertimos la matriz en un base de datos (\"DataFrame\")\n", 144 | "df_X = pd.DataFrame(X,columns=['Var_Normal'])\n", 145 | "\n", 146 | "# Visualizamos\n", 147 | "print(df_X.head(3))\n", 148 | "\n", 149 | "# Obtenemos estadistica descriptiva de las variables\n", 150 | "df_X.describe().round(2)" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "## Histogramas con Matplotlib\n", 158 | "Como introducimos en clases anteriores, el módulo de Matplotlib nos ayuda a hacer gráficos y visualización de datos" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "Nuestro objetivo es estimar la distribucion de densidad $f(Y)$ una variable aleatoria Y, con la siguiente aproximación no parametrica:\n", 166 | "\n", 167 | "$$\n", 168 | "\\hat{f}(y) = \\frac{M}{n} ∑^𝑛_i I(𝑌_𝑖 \\in B_l) \n", 169 | "$$\n", 170 | "Con $B_l$ barra (bin) $l$-ésimo\n", 171 | "\n", 172 | "Podemos usar el atributo `hist` de Matplotlib. Ver documentación [acá](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [ 181 | "# Grafico\n", 182 | "plt.figure(figsize=(10,6))\n", 183 | "plt.hist(X, alpha=0.5, color='blue') # por default, 10 bins\n", 184 | "plt.xlabel('Valores')\n", 185 | "plt.ylabel('Frecuencia')\n", 186 | "plt.show()" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "A mayor número de barras (bins en ingles), menos observaciones se acumulan en cada bin (notar diferencia de escala en el eje y), como muestra el siguiente gráfico" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "# Grafico\n", 203 | "plt.figure(figsize=(10,6))\n", 204 | "plt.hist(X, bins=50, alpha=0.5, color='blue', label='Histograma')\n", 205 | "plt.xlabel('Valores')\n", 206 | "plt.ylabel('Frecuencia')\n", 207 | "plt.show()" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "# Grafico\n", 217 | "plt.figure(figsize=(10,6))\n", 218 | "plt.hist(X, bins=30, alpha=0.5, color='blue', label='Histograma')\n", 219 | "plt.xlabel('Valores')\n", 220 | "plt.ylabel('Frecuencia')\n", 221 | "\n", 222 | "# Agregamos línea vertical con la media\n", 223 | "mean_value = np.mean(X)\n", 224 | "plt.axvline(mean_value, color='red', linestyle='dashed', linewidth=1, label='Media')\n", 225 | "plt.legend() # Show legend with label for the mean line\n", 226 | "plt.show()" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [ 235 | "# Definimos un criterio para \"cortar\" outliers (por ejemplo, a 2 DE de la media)\n", 236 | "mean_value = np.mean(X)\n", 237 | "std_dev = np.std(X)\n", 238 | "lower_bound = mean_value - 2 * std_dev\n", 239 | "upper_bound = mean_value + 2 * std_dev\n", 240 | "\n", 241 | "# Filtramos los datos\n", 242 | "X_filtered = X[(X >= lower_bound) & (X <= upper_bound)]\n", 243 | "\n", 244 | "# Plot histogram of filtered data\n", 245 | "plt.hist(X, bins=30, alpha=0.3, color='blue', label='Histograma')\n", 246 | "plt.hist(X_filtered, bins=30, alpha=0.3, color='orange', label='Histograma')\n", 247 | "plt.xlabel('Valores')\n", 248 | "plt.ylabel('Frecuencia')\n", 249 | "plt.show()" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "También podemos usar algunas funciones de seaborn para graficar histogramas. Ver documentación [acá](https://seaborn.pydata.org/generated/seaborn.histplot.html#seaborn.histplot)\n" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "## Histogramas con Seaborn\n", 264 | "En este ejemplop utilizaremos una base de datos del modulo de `seaborn`, también muy utilizada en procesamiento y visualización de datos en Python. Para más información ver [seaborn](https://seaborn.pydata.org/)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": null, 270 | "metadata": { 271 | "scrolled": true 272 | }, 273 | "outputs": [], 274 | "source": [ 275 | "# Primero, installamos el paquete\n", 276 | "#!pip install seaborn" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "import seaborn as sns" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": null, 291 | "metadata": {}, 292 | "outputs": [], 293 | "source": [ 294 | "tips = sns.load_dataset(\"tips\")\n", 295 | "tips" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": {}, 301 | "source": [ 302 | "##### Pregunta: Como verían la estadistica descriptiva de esta base de datos de propinas?" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": {}, 309 | "outputs": [], 310 | "source": [ 311 | "# resolver aquí\n" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "#### Visualización de dos variables con Seaborn" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "Podemos hacer un lindo gráfico de dispersión entre dos variables rápidamente. Por ejemplo entre la cuenta total (`total_bill`) y las propinas (`tips`)." 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [ 334 | "# Gráfico de dispersión\n", 335 | "sns.relplot(data=tips, x=\"total_bill\", y=\"tip\")" 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "metadata": {}, 341 | "source": [ 342 | "Ahora sí, utilicemos seaborn para hacer un histogramas" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": null, 348 | "metadata": {}, 349 | "outputs": [], 350 | "source": [ 351 | "sns.histplot(data=tips['tip'], stat='density') # funcion de histograma de Seaborn" 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | "Podemos alterar los ejes y opciones usando las opciones de **Matplotlib**" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "metadata": {}, 365 | "outputs": [], 366 | "source": [ 367 | "sns.histplot(data=tips['tip'], stat='density') # funcion de histograma de Seaborn\n", 368 | "mean_tips = np.mean(tips['tip'])\n", 369 | "plt.axvline(mean_tips, color='red', linestyle='dashed', linewidth=1, label='Tips promedio')\n", 370 | "plt.title(\"Distribución de propinas (en USD)\")\n", 371 | "plt.xlabel(\"Propinas (en USD)\")\n", 372 | "plt.legend() # Nos muestra la leyenda para la media de tips" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "metadata": {}, 378 | "source": [ 379 | "#### Histograma para dos categorías\n", 380 | "Podemos hacer el histograma por grupos como varon y mujer, utilizando la opcion `hue`." 381 | ] 382 | }, 383 | { 384 | "cell_type": "code", 385 | "execution_count": null, 386 | "metadata": {}, 387 | "outputs": [], 388 | "source": [ 389 | "sns.histplot(data=tips, x=\"tip\", stat='density', hue=\"sex\", multiple=\"stack\") \n", 390 | "plt.title(\"Distribución de propinas (en USD)\")\n", 391 | "plt.xlabel(\"Propinas (en USD)\")" 392 | ] 393 | }, 394 | { 395 | "cell_type": "markdown", 396 | "metadata": {}, 397 | "source": [ 398 | "Tambien podemos hacerlo como dos paneles separados con la funcion `displot()` " 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": null, 404 | "metadata": {}, 405 | "outputs": [], 406 | "source": [ 407 | "sns.displot(data=tips, x=\"tip\", stat='density', hue=\"sex\", col=\"sex\") " 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": null, 413 | "metadata": {}, 414 | "outputs": [], 415 | "source": [ 416 | "# Ahora le agregamos la media a cada distribucion\n", 417 | "g = sns.displot(data=tips, x=\"tip\", stat='density', hue=\"sex\", col=\"sex\") \n", 418 | "\n", 419 | "for ax, sex in zip(g.axes[0], tips['sex'].unique()):\n", 420 | " media = tips[tips['sex'] == sex]['tip'].mean()\n", 421 | " ax.axvline(media, color='r', linestyle='--', label=f'Media {sex}')\n", 422 | " ax.legend\n", 423 | "plt.show()" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": null, 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "# Ahora le agregamos la media y el valor a cada distribucion\n", 433 | "g = sns.displot(data=tips, x=\"tip\", stat='density', hue=\"sex\", col=\"sex\") \n", 434 | "\n", 435 | "for ax, sex in zip(g.axes[0], tips['sex'].unique()):\n", 436 | " media = tips[tips['sex'] == sex]['tip'].mean()\n", 437 | " ax.axvline(media, color='r', linestyle='--', label=f'Media {sex}')\n", 438 | " ax.text(media, ax.get_ylim()[1], f'{media:.2f}', ha='left', va='top', color='r')\n", 439 | "\n", 440 | "g.axes[0][-1].legend() #muestra la leyenda en el último panel\n", 441 | "\n", 442 | "plt.show()" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "### Opciones de distribución de la variable\n", 450 | "En lugar es estimas la densidad $\\hat{f}(y)$, podemos mostrar la cantidad de observaciones en cada barrita (en cada bin) como:\n", 451 | "- `density`: La función de densidad esta *normalizada* para que el área total del histograma de 1 (esta es la que hicimos)\n", 452 | "- `count`: cuenta el número de observaciones en cada barrita (bin)\n", 453 | "- `frequency`: muestra el número de observaciones dividido el ancho de cada barrita (bin)\n", 454 | "- `probability or proportion`: normalizada de modo tal que la suma de las alturas de 1.\n", 455 | "- `percent`: normalizada de modo tal que la suma de las alturas de 100%\n" 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": null, 461 | "metadata": {}, 462 | "outputs": [], 463 | "source": [ 464 | "sns.histplot(data=tips['tip'], stat='count') # funcion de histograma de Seaborn\n", 465 | "mean_tips = np.mean(tips['tip'])\n", 466 | "plt.axvline(mean_tips, color='red', linestyle='dashed', linewidth=1, label='Tips promedio')\n", 467 | "plt.title(\"Distribución de propinas (en USD)\")\n", 468 | "plt.xlabel(\"Propinas (en USD)\")\n", 469 | "plt.legend() # Nos muestra la leyenda para la media de tips" 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": null, 475 | "metadata": {}, 476 | "outputs": [], 477 | "source": [ 478 | "sns.histplot(data=tips['tip'], stat='frequency') # funcion de histograma de Seaborn\n", 479 | "mean_tips = np.mean(tips['tip'])\n", 480 | "plt.axvline(mean_tips, color='red', linestyle='dashed', linewidth=1, label='Tips promedio')\n", 481 | "plt.title(\"Distribución de propinas (en USD)\")\n", 482 | "plt.xlabel(\"Propinas (en USD)\")\n", 483 | "plt.legend() # Nos muestra la leyenda para la media de tips" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": null, 489 | "metadata": {}, 490 | "outputs": [], 491 | "source": [ 492 | "sns.histplot(data=tips['tip'], stat='probability') # funcion de histograma de Seaborn\n", 493 | "mean_tips = np.mean(tips['tip'])\n", 494 | "plt.axvline(mean_tips, color='red', linestyle='dashed', linewidth=1, label='Tips promedio')\n", 495 | "plt.title(\"Distribución de propinas (en USD)\")\n", 496 | "plt.xlabel(\"Propinas (en USD)\")\n", 497 | "plt.legend() # Nos muestra la leyenda para la media de tips" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": {}, 503 | "source": [ 504 | "### Barritas (bins)\n", 505 | "Ahora juguemos con las opciones del número de barritas (`bins`)" 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": null, 511 | "metadata": {}, 512 | "outputs": [], 513 | "source": [ 514 | "sns.histplot(data=tips['tip'], stat='density',bins=10) # funcion de histograma de Seaborn\n", 515 | "mean_tips = np.mean(tips['tip'])\n", 516 | "plt.axvline(mean_tips, color='red', linestyle='dashed', linewidth=1, label='Tips promedio')\n", 517 | "plt.title(\"Distribución de propinas (en USD)\")\n", 518 | "plt.xlabel(\"Propinas (en USD)\")\n", 519 | "plt.legend() # Nos muestra la leyenda para la media de tips" 520 | ] 521 | }, 522 | { 523 | "cell_type": "markdown", 524 | "metadata": {}, 525 | "source": [ 526 | "### Ancho de banda (binwidth)\n", 527 | "Ahora juguemos con las opciones del ancho de banda (`binwidth`)" 528 | ] 529 | }, 530 | { 531 | "cell_type": "code", 532 | "execution_count": null, 533 | "metadata": {}, 534 | "outputs": [], 535 | "source": [ 536 | "sns.histplot(data=tips['tip'], stat='density',binwidth=0.25) # funcion de histograma de Seaborn\n", 537 | "mean_tips = np.mean(tips['tip'])\n", 538 | "plt.axvline(mean_tips, color='red', linestyle='dashed', linewidth=1, label='Tips promedio')\n", 539 | "plt.title(\"Distribución de propinas (en USD)\")\n", 540 | "plt.xlabel(\"Propinas (en USD)\")\n", 541 | "plt.legend() # Nos muestra la leyenda para la media de tips" 542 | ] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "metadata": {}, 547 | "source": [ 548 | "### Clase que viene: Kernels\n", 549 | "Podemos sumar la estimación de la densidad usando un Kernel (Gaussiano) del cual veremos mas la clase que viene. " 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": {}, 556 | "outputs": [], 557 | "source": [ 558 | "sns.histplot(data=tips, x=\"tip\", hue=\"sex\", stat=\"density\", multiple=\"stack\", kde=True)" 559 | ] 560 | } 561 | ], 562 | "metadata": { 563 | "colab": { 564 | "provenance": [] 565 | }, 566 | "kernelspec": { 567 | "display_name": "Python [conda env:base] *", 568 | "language": "python", 569 | "name": "conda-base-py" 570 | }, 571 | "language_info": { 572 | "codemirror_mode": { 573 | "name": "ipython", 574 | "version": 3 575 | }, 576 | "file_extension": ".py", 577 | "mimetype": "text/x-python", 578 | "name": "python", 579 | "nbconvert_exporter": "python", 580 | "pygments_lexer": "ipython3", 581 | "version": "3.12.4" 582 | } 583 | }, 584 | "nbformat": 4, 585 | "nbformat_minor": 4 586 | } 587 | -------------------------------------------------------------------------------- /Clase 9_Kernels/C9_Metodos No Paramétricos II_Kernels.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase 9_Kernels/C9_Metodos No Paramétricos II_Kernels.pptx -------------------------------------------------------------------------------- /Clase 9_Kernels/Tutorial9_UBA_Kernels.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "ObeIwb_npgeZ" 7 | }, 8 | "source": [ 9 | "# Big Data y Machine Learning (UBA) 2025\n", 10 | "## Clase 9 - Kernels\n", 11 | "\n", 12 | "**Objetivo:**\n", 13 | "Que se familiaricen con el segundo método no paramétrico - Kernels - para la estimación de la distribución de densidad de una variable aleatoria." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "### Métodos no paramétricos\n", 21 | "El objetivo es predecir distribución de una variable de interés \n", 22 | "- 𝑌 variable aleatoria de interés\n", 23 | "- 𝑓(𝑌) distribución de densidad 𝑌\n", 24 | "\n", 25 | "##### Métodos\n", 26 | "- Breve repaso de Histogramas\n", 27 | "- Kernels con Sckit-learn\n", 28 | " - Tipo de funciones de kernels\n", 29 | " - Opciones de Kernels: Ancho de banda $h$\n", 30 | " - Simulación de datos: Sesgo de la estimación no parametrica de Kernels\n", 31 | "- Kernels con Seaborn\n" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "# Importamos paquetes\n", 41 | "import numpy as np\n", 42 | "import pandas as pd\n", 43 | "import matplotlib.pyplot as plt\n", 44 | "from scipy.stats import norm #para crear datos de distribucion normal con otro modulo\n", 45 | "\n", 46 | "import seaborn as sns\n", 47 | "from sklearn.neighbors import KernelDensity" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "### Breve repaso de Histogramas\n" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "Estimamos la distribucion de densidad $f(Y)$ una variable aleatoria Y, con la siguiente aproximación no parametrica:\n", 62 | "\n", 63 | "$$\n", 64 | "\\hat{f}(y) = \\frac{M}{n} ∑^𝑛_i I(𝑌_𝑖 \\in B_l) \n", 65 | "$$\n", 66 | "Con $B_l$ barra (bin) $l$-ésimo\n", 67 | "\n", 68 | "Podemos usar el atributo `hist` de Matplotlib. Ver documentación [acá](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "# Generamos datos\n", 78 | "np.random.seed(20)\n", 79 | "X = np.concatenate([np.random.normal(0,1,500), np.random.normal(5,1,500)]).reshape(-1,1)\n", 80 | "X" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "# Grafico\n", 90 | "plt.figure(figsize=(10,6))\n", 91 | "plt.hist(X, bins=30, alpha=0.5, color='blue', label='Histograma')\n", 92 | "plt.xlabel('Valores')\n", 93 | "plt.ylabel('Frecuencia')\n", 94 | "\n", 95 | "# Agregamos línea vertical con la media\n", 96 | "mean_value = np.mean(X)\n", 97 | "plt.axvline(mean_value, color='red', linestyle='dashed', linewidth=1, label='Media')\n", 98 | "plt.legend() # Show legend with label for the mean line\n", 99 | "plt.show()" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "\n" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "## Kernels" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "Kernel:\n", 121 | "A cada observación le estima una pequeña función de densidad y suma todas las pequeñas funciones\n", 122 | "\n", 123 | "$$\n", 124 | "𝑓(𝑦_0)= \\frac{1}{n} ∑^𝑛_i \\frac{1}{h} 𝐾(\\frac{𝑌_𝑖−𝑦_0}{h}) \n", 125 | "$$\n", 126 | "\n", 127 | "- $K(z)$ función Kernel continua (y generalmente) simétrica \n", 128 | "- $h$ ancho de banda (smoothing bandwidth) --> Controla qué tan “suave” es la densidad \n" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "Vamos a usar el [módulo neighbors de Scikit learn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html)" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "Para estimar una densidad usando kernels tenemos la siguiente función: \n", 143 | "\n", 144 | " sklearn.neighbors.KernelDensity(*, bandwidth=1.0, algorithm='auto', kernel='gaussian', metric='euclidean', atol=0, rtol=0, breadth_first=True, leaf_size=40, metric_params=None)\n", 145 | "\n", 146 | "donde algunos parámetros importantes son:\n", 147 | "- bandwidth (valor por default: 1.0)\n", 148 | "- kernel (valor por default: 'gaussian')\n", 149 | "\n", 150 | "Scikit learn nos permite cambiar el kernel y probar varios y cuál ajusta mejor a los datos" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "# Grafico\n", 160 | "plt.figure(figsize=(10,6))\n", 161 | "plt.hist(X, bins=30, density=True, alpha=0.3, color='blue', label='Histograma') # mantenemos el histogramas para comparar\n", 162 | "\n", 163 | "# Rango de valores para eje x (para graficar la funcion de Kernel)\n", 164 | "X_plot = np.linspace(min(X), max(X), 1000).reshape(-1,1)\n", 165 | "\n", 166 | "# Estimamos la funcion de Kernel Gaussiana y todas sus opciones de default\n", 167 | "kde = KernelDensity().fit(X)\n", 168 | " \n", 169 | "# Usar la KDE para estimar la densidad para cada valor de X\n", 170 | "log_densities = kde.score_samples(X_plot)\n", 171 | "densities = np.exp(log_densities)\n", 172 | " \n", 173 | "# Grafico de kernel\n", 174 | "plt.plot(X_plot[:,0], densities, color='red', label=f'Gaussian Kernel')\n", 175 | "\n", 176 | "# Agregamos línea vertical con la media\n", 177 | "mean_value = np.mean(X)\n", 178 | "plt.axvline(mean_value, color='green', linestyle='dashed', linewidth=1, label='Media')\n", 179 | "\n", 180 | "plt.legend()\n", 181 | "plt.title('Estimación con Kernel Gaussiano')" 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "#### Tipos de kernels (disponibles en Scikit learn)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [ 197 | "# Kernels\n", 198 | "kernels = [\"gaussian\", \"tophat\", \"epanechnikov\", \"exponential\", \"linear\", \"cosine\"] \n", 199 | " \n", 200 | "# Figura con 3 filas y 2 columnas\n", 201 | "fig, ax = plt.subplots(3, 2) \n", 202 | "# Tamaño de la figura\n", 203 | "fig.set_figheight(15) \n", 204 | "fig.set_figwidth(10) \n", 205 | "# Título \n", 206 | "fig.suptitle(\"Tipos de kernels\") \n", 207 | "\n", 208 | "# 1D array de valores de x para graficar la distribución \n", 209 | "x_plot = np.linspace(-6, 6, 1000) # 1000 valores de -6 a 6 separados con la misma distancia entre sí\n", 210 | "x_plot = x_plot.reshape(-1,1) # formato 2D array (necesario para scikit learn)\n", 211 | "x_orig = np.zeros((1, 1)) # punto (0,0)\n", 212 | " \n", 213 | "# Graficamos usando los distintos kernels \n", 214 | "for i, kernel in enumerate(kernels): \n", 215 | " # Ajustamos el modelo \n", 216 | " kde = KernelDensity(kernel=kernel).fit(x_orig) # usamos el punto (0,0)\n", 217 | " # log de la densidad de probabilidad (PDF)\n", 218 | " log_dens = kde.score_samples(x_plot) \n", 219 | " \n", 220 | " # Distribuciones \n", 221 | " ax[i // 2, i % 2].fill(x_plot[:, 0], np.exp(log_dens)) \n", 222 | " # i//2 nos permite referirnos a la fila del subplot, e i%2 nos permite referirnos a la columna\n", 223 | " # Título y labels de los subplots \n", 224 | " ax[i // 2, i % 2].set_title(kernel.capitalize()) \n", 225 | " ax[i // 2, i % 2].set_xlim(-3, 3) \n", 226 | " ax[i // 2, i % 2].set_ylim(0, 1) \n", 227 | " ax[i // 2, i % 2].set_ylabel(\"Densidad\") \n", 228 | " ax[i // 2, i % 2].set_xlabel(\"x\") \n", 229 | "plt.show()" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "De la misma forma, en un gráfico" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": null, 242 | "metadata": {}, 243 | "outputs": [], 244 | "source": [ 245 | "# Kernels\n", 246 | "kernels = [\"gaussian\", \"tophat\", \"epanechnikov\", \"exponential\", \"linear\", \"cosine\"] \n", 247 | " \n", 248 | "# Grafico\n", 249 | "plt.figure(figsize=(10,6))\n", 250 | "\n", 251 | "for k in kernels:\n", 252 | " # Ajustamos el modelo \n", 253 | " kde = KernelDensity(kernel=k).fit(x_orig) # usamos el punto (0,0)\n", 254 | " # log de la densidad de probabilidad (PDF)\n", 255 | " log_dens = kde.score_samples(x_plot) \n", 256 | " \n", 257 | " # Graficar la estimacion para cada kernel\n", 258 | " plt.plot(x_plot[:,0], np.exp(log_dens), label=f'{k.capitalize()} Kernel')\n", 259 | "\n", 260 | "plt.legend()\n", 261 | "plt.title('Estimación con diferentes Kernels')\n", 262 | "plt.show()" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "Continuamos con el ejemplo de la variable X creada (en la clase pasada) probando los distintos tipos de kernels" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "# Lista de kernels a probar\n", 279 | "kernels = [\"gaussian\", \"tophat\", \"epanechnikov\", \"exponential\", \"linear\", \"cosine\"] \n", 280 | "\n", 281 | "# Grafico\n", 282 | "plt.figure(figsize=(10,6))\n", 283 | "plt.hist(X, bins=30, density=True, alpha=0.5, color='blue', label='Histograma')\n", 284 | "\n", 285 | "for k in kernels:\n", 286 | " kde = KernelDensity(kernel=k).fit(X)\n", 287 | " \n", 288 | " # Usar la KDE para estimar la densidad para cada valor de X\n", 289 | " log_densities = kde.score_samples(X_plot)\n", 290 | " densities = np.exp(log_densities)\n", 291 | " \n", 292 | " # Graficar para cada kernel\n", 293 | " plt.plot(X_plot[:,0], densities, label=f'{k.capitalize()} Kernel')\n", 294 | "\n", 295 | "plt.legend()\n", 296 | "plt.title('Estimación con diferentes Kernels')" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "#### Opciones de Kernels: Ancho de banda $h$\n", 304 | "Ahora veamos qué ocurre si para un mismo kernel, cambiamos los **anchos de banda**" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "# Anchos de banda\n", 314 | "bandwidths = [0.5, 0.75, 1, 1.25, 1.5, 1.75] \n", 315 | " \n", 316 | "# Figura con 3 filas y 2 columnas\n", 317 | "fig, ax = plt.subplots(3, 2) \n", 318 | "# Tamaño de la figura\n", 319 | "fig.set_figheight(15) \n", 320 | "fig.set_figwidth(10) \n", 321 | "# Título \n", 322 | "fig.suptitle('Kernel Gaussiano, con distintos anchos de banda')\n", 323 | "\n", 324 | "# Graficamos usando los distintos kernels \n", 325 | "for i, bw in enumerate(bandwidths): \n", 326 | " # Ajustamos el modelo \n", 327 | " kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(x_orig) # usamos el punto (0,0)\n", 328 | " # log de la densidad de probabilidad (PDF)\n", 329 | " log_dens = kde.score_samples(x_plot) \n", 330 | " \n", 331 | " # Distribuciones \n", 332 | " ax[i // 2, i % 2].fill(x_plot[:, 0], np.exp(log_dens)) \n", 333 | " # i//2 nos permite referirnos a la fila del subplot, e i%2 nos permite referirnos a la columna\n", 334 | " # Título y labels de los subplots \n", 335 | " ax[i // 2, i % 2].set_title('Kernel Gaussiano con bandwidth='+str(bw)) \n", 336 | " ax[i // 2, i % 2].set_xlim(-3, 3) \n", 337 | " ax[i // 2, i % 2].set_ylim(0, 1) \n", 338 | " ax[i // 2, i % 2].set_ylabel('Densidad') \n", 339 | " ax[i // 2, i % 2].set_xlabel('x') \n", 340 | "plt.show()" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "# Anchos de banda\n", 350 | "bandwidths = [0.5, 0.75, 1, 1.25, 1.5, 1.75] \n", 351 | " \n", 352 | "# Grafico\n", 353 | "plt.figure(figsize=(10,6))\n", 354 | "\n", 355 | "for bw in bandwidths:\n", 356 | " # Ajustamos el modelo \n", 357 | " kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(x_orig) # usamos el punto (0,0)\n", 358 | " # log de la densidad de probabilidad (PDF)\n", 359 | " log_dens = kde.score_samples(x_plot) \n", 360 | " \n", 361 | " # Graficar la estimacion para cada kernel\n", 362 | " plt.plot(x_plot[:,0], np.exp(log_dens), label='Kernel Gaussiano con bandwidth='+str(bw))\n", 363 | "\n", 364 | "plt.legend()\n", 365 | "plt.title('Kernel Gaussiano, con distintos anchos de banda') \n", 366 | "plt.show()" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": {}, 373 | "outputs": [], 374 | "source": [ 375 | "# Anchos de banda\n", 376 | "bandwidths = [0.5, 0.75, 1, 1.25, 1.5, 1.75] \n", 377 | "\n", 378 | "# Grafico\n", 379 | "plt.figure(figsize=(10,6))\n", 380 | "plt.hist(X, bins=30, density=True, alpha=0.5, color='blue', label='Histograma')\n", 381 | "\n", 382 | "for bw in bandwidths:\n", 383 | " kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(X)\n", 384 | " \n", 385 | " # Usar la KDE para estimar la densidad para cada valor de X\n", 386 | " log_densities = kde.score_samples(X_plot)\n", 387 | " densities = np.exp(log_densities)\n", 388 | " \n", 389 | " # Graficar para cada kernel\n", 390 | " plt.plot(X_plot[:,0], densities, label='Bandwidth='+str(bw))\n", 391 | "\n", 392 | "plt.legend()\n", 393 | "plt.title('Estimación de Kernel Gaussiano con diferentes ancho de banda')" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "### Simulación de datos: Sesgo de la estimación no parametrica de Kernels\n", 401 | "Ahora veamos un ejemplo donde creamos datos ficticios, esto implica que conocemos la verdadera forma en la que se generan los datos, para comparar la estimación no paramétrica de Kernels y su aproximación a la verdadera función de densidad. Se puede demostrar formalmente, que la estimación no paramétrica de Kernels (e histograma) es *sesgada*. Por lo que, aquí estamos ilustrando ese concepto." 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "# Creamos una distribución\n", 411 | "n = 100\n", 412 | "np.random.seed(10)\n", 413 | "X = np.concatenate((np.random.normal(0, 1, int(0.6 * n)), np.random.normal(10, 1, int(0.4 * n)))) \n", 414 | "# Creamos X concatenando datos de dos distribuciones normales\n", 415 | "# primero 60 datos de una distribución normal con media 0 y desvío 1\n", 416 | "# luego, 40 datos de una normal con media 10 y desvío 1\n", 417 | "X = X.reshape(-1,1)\n", 418 | "\n", 419 | "X_plot = np.linspace(-5, 15, 1000).reshape(-1,1)\n", 420 | "# Usaremos X para estimar la densidad y calcularemos la densidad para los puntos de X_plot \n", 421 | "\n", 422 | "# Calcular la \"verdera\" densidad para los puntos X_plot\n", 423 | "true_density = 0.6 * norm(0, 1).pdf(X_plot[:, 0]) + 0.4 * norm(10, 1).pdf(X_plot[:, 0]) \n", 424 | " \n", 425 | "# Gráfico\n", 426 | "fig, ax = plt.subplots() \n", 427 | " \n", 428 | "# Gráfico de la verdadera densidad \n", 429 | "ax.fill( \n", 430 | " X_plot[:, 0], true_density, \n", 431 | " fc='black', alpha=0.2, \n", 432 | " label='Verdadera Distribución'\n", 433 | ") \n", 434 | " \n", 435 | "# Estimar la densidad de X usando kernel gaussiano y bandwidth de 0.5 \n", 436 | "kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X) \n", 437 | "# Log de la PDF \n", 438 | "log_dens = kde.score_samples(X_plot) \n", 439 | " \n", 440 | "# Densidad \n", 441 | "ax.plot( \n", 442 | " X_plot[:, 0], np.exp(log_dens), \n", 443 | " color='blue', \n", 444 | " linestyle='-', \n", 445 | " label='Densidad con kernel Gaussiano'\n", 446 | ") \n", 447 | "ax.set_xlim(-4, 15) \n", 448 | "ax.set_ylim(0, 0.3) \n", 449 | "#ax.grid(True) \n", 450 | "ax.legend(loc='upper right')\n", 451 | "plt.title('Sesgo de la estimación por Kernel') \n", 452 | "\n", 453 | "plt.show()" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "Para elegir el bandwidth con cross-validation (CV) que explicaremos en mayor detalle en la clase 8 tutorial 10." 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": null, 466 | "metadata": {}, 467 | "outputs": [], 468 | "source": [ 469 | "# Tarea para la casa: Mostrar el sesgo del histograma y esta funcion verdadera." 470 | ] 471 | }, 472 | { 473 | "cell_type": "markdown", 474 | "metadata": {}, 475 | "source": [ 476 | "## Kernels con Seaborn\n", 477 | "Continuaremos con el ejemplo utilizando la base de datos de propinas del modulo de `seaborn`. Para más información ver [seaborn](https://seaborn.pydata.org/)\n", 478 | "La función de seaborn para graficar Kernels es [kdeplot()](https://seaborn.pydata.org/generated/seaborn.kdeplot.html)." 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [ 487 | "# Importamos la base de datos de propinas\n", 488 | "tips = sns.load_dataset(\"tips\")\n", 489 | "tips" 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "execution_count": null, 495 | "metadata": {}, 496 | "outputs": [], 497 | "source": [ 498 | "# veamos la estadistica descriptiva por grupo\n", 499 | "tips.groupby('sex').describe().round(2).T" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": null, 505 | "metadata": {}, 506 | "outputs": [], 507 | "source": [ 508 | "# Estimamos y graficamos la función por Kernels\n", 509 | "sns.kdeplot(data=tips, x='tip') " 510 | ] 511 | }, 512 | { 513 | "cell_type": "markdown", 514 | "metadata": {}, 515 | "source": [ 516 | "Nuevamente podemos mejorar el gráfico de seaborn usando las opciones de matplotlib" 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": null, 522 | "metadata": {}, 523 | "outputs": [], 524 | "source": [ 525 | "sns.kdeplot(data=tips, x='tip') # funcion de kernel de Seaborn\n", 526 | "mean_tips = np.mean(tips['tip'])\n", 527 | "plt.axvline(mean_tips, color='red', linestyle='dashed', linewidth=1, label='Tips promedio')\n", 528 | "plt.title(\"Distribución de propinas (en USD)\")\n", 529 | "plt.xlabel(\"Propinas (en USD)\")\n", 530 | "plt.legend() # Nos muestra la leyenda para la media de tips" 531 | ] 532 | }, 533 | { 534 | "cell_type": "markdown", 535 | "metadata": {}, 536 | "source": [ 537 | "También podemos hacer el grafico original del histograma (ver clase 8) y sumar con la opcion `kde=True`" 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": null, 543 | "metadata": {}, 544 | "outputs": [], 545 | "source": [ 546 | "sns.histplot(data=tips['tip'], stat='density', kde=True) # funcion de histograma de Seaborn\n", 547 | "mean_tips = np.mean(tips['tip'])\n", 548 | "plt.axvline(mean_tips, color='red', linestyle='dashed', linewidth=1, label='Tips promedio')\n", 549 | "plt.title(\"Distribución de propinas (en USD)\")\n", 550 | "plt.xlabel(\"Propinas (en USD)\")\n", 551 | "plt.legend() # Nos muestra la leyenda para la media de tips" 552 | ] 553 | }, 554 | { 555 | "cell_type": "markdown", 556 | "metadata": {}, 557 | "source": [ 558 | "Nuevamente, podemos comparar las distribuciones de densidad de kernel entre hombres y mujeres\n" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": null, 564 | "metadata": {}, 565 | "outputs": [], 566 | "source": [ 567 | "# Checkeando el promedio de propinas\n", 568 | "mean_tips_male = tips[tips['sex'] == 'Male']['tip'].mean()\n", 569 | "print(mean_tips_male.round(2))\n", 570 | "mean_tips_female = tips[tips['sex'] == 'Female']['tip'].mean()\n", 571 | "print(mean_tips_female.round(2))" 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": null, 577 | "metadata": {}, 578 | "outputs": [], 579 | "source": [ 580 | "sns.kdeplot(data=tips, x='tip',hue=\"sex\", multiple=\"stack\", ) # funcion de kernel de Seaborn\n", 581 | "\n", 582 | "plt.axvline(mean_tips_male, color='blue', linestyle='dashed', linewidth=1, label='Male')\n", 583 | "plt.axvline(mean_tips_female, color='red', linestyle='dashed', linewidth=1, label='Female')\n", 584 | "\n", 585 | "plt.title(\"Distribución de propinas (en USD)\")\n", 586 | "plt.xlabel(\"Propinas (en USD)\")\n", 587 | "plt.legend() # Nos muestra la leyenda para la media de tips" 588 | ] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "execution_count": null, 593 | "metadata": {}, 594 | "outputs": [], 595 | "source": [ 596 | "# tarea para la casa: jugar con la opcion de ver este grafico en dos paneles y sumarle el valor del promedio de cada grupo" 597 | ] 598 | } 599 | ], 600 | "metadata": { 601 | "colab": { 602 | "provenance": [] 603 | }, 604 | "kernelspec": { 605 | "display_name": "Python [conda env:base] *", 606 | "language": "python", 607 | "name": "conda-base-py" 608 | }, 609 | "language_info": { 610 | "codemirror_mode": { 611 | "name": "ipython", 612 | "version": 3 613 | }, 614 | "file_extension": ".py", 615 | "mimetype": "text/x-python", 616 | "name": "python", 617 | "nbconvert_exporter": "python", 618 | "pygments_lexer": "ipython3", 619 | "version": "3.12.4" 620 | } 621 | }, 622 | "nbformat": 4, 623 | "nbformat_minor": 4 624 | } 625 | -------------------------------------------------------------------------------- /Clase17_Ensamble 1_Bagging/C17_Metodos de Ensamble I_Bagging.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Clase17_Ensamble 1_Bagging/C17_Metodos de Ensamble I_Bagging.pptx -------------------------------------------------------------------------------- /Guía para las exposiciones grupales.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/Guía para las exposiciones grupales.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Big Data y Machine Learning (UBA) 2025 2 | Este es el repositorio oficial de curso "Big Data y Machine Learning" (1er Semestre - 2025) para Licenciatura en Economia (UBA). Cualquier duda u error, contactarse a m.n.romero91@gmail.com 3 | -------------------------------------------------------------------------------- /TPs/TP0_Practica opcional de Python/TP0_UBA_Practica de Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true, 7 | "id": "Dh8MkXaG-c9Y", 8 | "jupyter": { 9 | "outputs_hidden": true 10 | } 11 | }, 12 | "source": [ 13 | "# Big Data y Machine Learning (UBA) - 2025\n", 14 | "\n", 15 | "## Trabajo Práctico 0 (Opcional) - Práctica de Programación en Python\n" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": { 21 | "id": "RhBlm6mZ-c9e" 22 | }, 23 | "source": [ 24 | "### Reglas de formato y presentación\n", 25 | "- El trabajo debe estar debidamente documentado comentado (utilizando #) para que tanto los docentes como sus compañeros puedan comprender el código fácilmente.\n", 26 | "\n", 27 | "- El mismo debe ser completado en este Jupyter Notebook y entregado como tal, es decir en un archivo .ipynb\n" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": { 33 | "id": "ZEjGaa4U-c9g" 34 | }, 35 | "source": [ 36 | "### Fecha de relización:\n", 37 | "Viernes 18 de Marzo a las 13:00 hs" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": { 43 | "id": "ZXbrPraa-c9i" 44 | }, 45 | "source": [ 46 | "#### Ejercicio 1\n", 47 | "Este ejercicio simplemente busca repasar lo que aprendimos sobre definición de variables. Definir dos variables con un nombre combinado (al menos dos palabras), una que se pueda crear y otra que tenga un nombre inaceptable (genera error). Explicar por qué ocurre el error." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": { 54 | "id": "mb7PkXfN-c9j" 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "# Caso A\n", 59 | "\n", 60 | "\n", 61 | "# Caso B (acá debería saltar un error)" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "#### Ejercicio 2\n", 69 | "Importar módulos. Usando el módulo math impriman la tangente de 1. ¿Cuál es el resultado?\n", 70 | "Hagan este cálculo de dos formas: primero importando el módulo math y usando la función correspondiente y luego solo importando la función específica que precisan para el cálculo." 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "# Caso A\n", 80 | "\n", 81 | "\n", 82 | "# Caso B" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": { 88 | "id": "GlNh0fyv-c9l" 89 | }, 90 | "source": [ 91 | "#### Ejercicio 3 \n", 92 | "Este ejercicio trata sobre lograr el intercambio de valores entre dos variables utilizando una variable temporal para hacerlo. Las variables temporales y la sustitución de valores termina siendo útil en algunos loops. Los pasos a seguir son: (a) definir variables A y B (cuyos valores buscaremos invertir); (b) definir una variable temporal que resguarde el valor de B; (c) sustitución (asignar B igual a A y también A igual al valor original de B); (d) imprimir valores para verificar." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": { 99 | "id": "uWalSYFC-c9m" 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "# a) Definir variables A y B, \n", 104 | "a = 1\n", 105 | "b = 2\n", 106 | "\n", 107 | "# b) Definir variable temporal \"tmp\" igual a B (la variable, no el valor)\n", 108 | "\n", 109 | "\n", 110 | "# c) Ahora sustituir variables: variable B igual a variable A (la variable, no el\n", 111 | "# valor) y viceversa.\n", 112 | "\n", 113 | "\n", 114 | "# d) Verifiquemos resultados: imprimir variables A y B" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": { 120 | "id": "wXhAaRyN-c9p" 121 | }, 122 | "source": [ 123 | "#### Ejercicio 4\n", 124 | "En este ejercicio se busca poner en práctica el uso de range() en un for loop. \n", 125 | "\n", 126 | "Construir un for loop usando un range(). El range debe ser entre los valores que quieran (con una diferencia mínima de 15 entre start y stop), en incrementos de 3 unidades. Dentro del loop, implementar una sentencia condicional que imprima una leyenda indicando si el input es par o impar." 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 9, 132 | "metadata": { 133 | "id": "_oE5sG0c-c9q" 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "# Resolver acá\n" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": { 143 | "id": "h3g5bXUB-c9u" 144 | }, 145 | "source": [ 146 | "#### Ejercicio 5\n", 147 | "Para practicar el uso de condiciones lógicas y la definición de funciones, construir una función con una sentencia condicional que verifique si un año es bisiesto o no. Para que un año sea bisiesto debe cumplir una de dos condiciones:\n", 148 | "\n", 149 | "(a) que sea divisible por 400; o\n", 150 | "\n", 151 | "(b) que sea divisible por 4 y no sea divisible por 100\n", 152 | "\n", 153 | "Notar que son dos condiciones, donde la segunda condición tiene dos componentes. Prueben la función con 3 valores para verificar que funcione." 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": { 160 | "id": "7_MnILdz-c9v" 161 | }, 162 | "outputs": [], 163 | "source": [ 164 | "# Resolver acá" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": { 170 | "collapsed": true, 171 | "id": "P3a7bJkd-c9w", 172 | "jupyter": { 173 | "outputs_hidden": true 174 | } 175 | }, 176 | "source": [ 177 | "#### Ejercicio 6 \n", 178 | "Pongamos en práctica identificar el type() de cada variable. A continuación tenemos una lista con elementos de diferentes tipos. Construyan un for loop que itere sobre la lista e imprima un cartel indicando el tipo de dato u objeto que hay en cada caso." 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "metadata": { 185 | "id": "jDf4d_Wr-c9w" 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "mi_lista = [10, 34.5, 99999, 'abc', [1,2,3], ('ARG', 1810), {'pob': 45}, True]" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "metadata": { 196 | "id": "SAJgEiNEFQAS" 197 | }, 198 | "outputs": [], 199 | "source": [ 200 | "# Resolver acá" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": { 206 | "id": "y1lncitl-c9x" 207 | }, 208 | "source": [ 209 | "#### Ejercicio 7\n", 210 | "Ahora definan ustedes una nueva lista, en la que los primeros cuatro elementos sean palabras (strings), el quinto elemento no sea string, y el sexto sea string. Construyan un for loop que corra por la lista y que imprima la palabra y la longitud de la misma. Que el loop contenga una sentencia condicional que imprima un cartel \"Elemento no es un string: < el elemento > | < class del elemento >\" para los casos dónde el elemento evaluado no sea string." 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": { 217 | "id": "bFOk9Os0-c9x" 218 | }, 219 | "outputs": [], 220 | "source": [ 221 | "# Resolver acá" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "#### Ejercicio 8" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "Construyan una función llamada 'suma' que tome una cantidad variable de parámetros y devuelva el resultado de la suma." 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "# Resolver acá" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": { 250 | "id": "PUpxDz72-c9x" 251 | }, 252 | "source": [ 253 | "#### Ejercicio 9\n", 254 | "Argentina tiene una representación legislativa proporcional en la cual la cantidad de diputados se debería ajustar según el tamaño de población de cada provincia. Otra característica de la representación legislativa del país es que hay una cantidad mínima de diputados por provincia (5). Esto genera un desbalance en la cantidad de ciudadanos por cada representante en el Congreso entre provincias.\n", 255 | "\n", 256 | "A continuación preparamos algunos ejemplos, dividiendo la población de cada provincia (según estimación para el 2022) por la cantidad de representantes en el Congreso para esa misma jurisdicción. También armamos el equivalente para el total país y una lista con los valores provinciales.\n", 257 | "\n", 258 | "En este ejercicio les pedimos que construyan un for loop que itere sobre el diccionario definido y compare cada valor provincial contra el valor de proporcionalidad directa (la variable argentina). El loop debe imprimir una leyenda que indique si la provincia está sobrerepresentada, subrepresentada o con representación proporcional. Además, queremos que el mismo loop compare los valores de las provincias sobrerepresentadas y que guarde el valor de la provincia con mayor sobrerepresentación. Luego del loop impriman este valor así podemos ver que haya funcionado.\n", 259 | "\n", 260 | "Fuente para población: https://es.wikipedia.org/wiki/Demograf%C3%ADa_de_Argentina\n", 261 | "\n", 262 | "Fuente para representantes: https://es.wikipedia.org/wiki/C%C3%A1mara_de_Diputados_de_la_Naci%C3%B3n_Argentina" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": null, 268 | "metadata": { 269 | "id": "hd8Z5AHs-c9y" 270 | }, 271 | "outputs": [], 272 | "source": [ 273 | "# Valor de referencia: proporcionalidad\n", 274 | "argentina = 46044703 / 257\n", 275 | "\n", 276 | "# Creamos variables para una selección de jurisdicciones\n", 277 | "cordoba = 3978984 / 18\n", 278 | "santa_fe = 3556522 / 19\n", 279 | "mendoza = 2014533 / 10\n", 280 | "buenos_aires = 17569053 / 70\n", 281 | "entre_rios = 1426426 / 9\n", 282 | "santa_cruz = 333473 / 5\n", 283 | "formosa = 606041 / 5\n", 284 | "\n", 285 | "# Definimos el diccionario para iterar\n", 286 | "dict_provincias = {\n", 287 | " \"Córdoba\": cordoba,\n", 288 | " \"Santa Fe\": santa_fe,\n", 289 | " \"Mendoza\": mendoza,\n", 290 | " \"Buenos Aires\": buenos_aires,\n", 291 | " \"Entre Ríos\": entre_rios,\n", 292 | " \"Santa Cruz\": santa_cruz,\n", 293 | " \"Formosa\": formosa\n", 294 | "}" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": null, 300 | "metadata": { 301 | "id": "fsEHnxiY-c9y", 302 | "scrolled": true 303 | }, 304 | "outputs": [], 305 | "source": [ 306 | "# Resolver acá\n" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": { 312 | "id": "GokZe5tV-c9z" 313 | }, 314 | "source": [ 315 | "#### Ejercicio 10\n", 316 | "Si buscáramos minimizar la subrepresentación de la provincia de Buenos Aires, ¿cuántos representantes debería tener según el Censo 2022? Usemos el mismo criterio de sub/sobrerepresentacion que en el ejercicio anterior: población sobre cantidad de representantes.\n", 317 | "\n", 318 | "Para responder esta pregunta construyan un while loop que incremente de a uno la cantidad de representantes de la Provincia hasta minimizar la subrepresentación. En cada iteración impriman un cartel que diga: \"Se agregó un representante, el total ahora es X\" dónde X es el número de representantes simulado." 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "metadata": { 325 | "id": "x9DvuXa_-c9z" 326 | }, 327 | "outputs": [], 328 | "source": [ 329 | "# Estos son los valores de la provincia\n", 330 | "representantes_pba = 70\n", 331 | "poblacion2010_pba = 17594428\n", 332 | "\n", 333 | "\n", 334 | "# Y el valor de referencia nacional\n", 335 | "argentina = 46044703 / 257\n" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": {}, 342 | "outputs": [], 343 | "source": [ 344 | "# Resolver acá\n" 345 | ] 346 | } 347 | ], 348 | "metadata": { 349 | "anaconda-cloud": {}, 350 | "colab": { 351 | "name": "TP1 - Parte 1.ipynb", 352 | "provenance": [] 353 | }, 354 | "kernelspec": { 355 | "display_name": "Python [conda env:base] *", 356 | "language": "python", 357 | "name": "conda-base-py" 358 | }, 359 | "language_info": { 360 | "codemirror_mode": { 361 | "name": "ipython", 362 | "version": 3 363 | }, 364 | "file_extension": ".py", 365 | "mimetype": "text/x-python", 366 | "name": "python", 367 | "nbconvert_exporter": "python", 368 | "pygments_lexer": "ipython3", 369 | "version": "3.12.4" 370 | } 371 | }, 372 | "nbformat": 4, 373 | "nbformat_minor": 4 374 | } 375 | -------------------------------------------------------------------------------- /TPs/TP1_Jugando con APIS y Webscrapping/TP1-UBA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true, 7 | "id": "Dh8MkXaG-c9Y", 8 | "jupyter": { 9 | "outputs_hidden": true 10 | } 11 | }, 12 | "source": [ 13 | "# Big Data y Machine Learning (UBA) - 2025\n", 14 | "\n", 15 | "## Trabajo Práctico 1: Jugando con APIs y WebScraping " 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": { 21 | "id": "RhBlm6mZ-c9e" 22 | }, 23 | "source": [ 24 | "### Reglas de formato y presentación\n", 25 | "- El trabajo debe estar debidamente documentado comentado (utilizando #) para que tanto los docentes como sus compañeros puedan comprender el código fácilmente.\n", 26 | "\n", 27 | "- El mismo debe ser completado en este Jupyter Notebook y entregado como tal, es decir en un archivo .ipynb\n" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": { 33 | "id": "ZEjGaa4U-c9g" 34 | }, 35 | "source": [ 36 | "### Fecha de entrega:\n", 37 | "Viernes 4 de Abril a las 13:00 hs" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": { 43 | "id": "N9TU2y7E-c9h" 44 | }, 45 | "source": [ 46 | "### Modalidad de entrega\n", 47 | "- Al finalizar el trabajo práctico deben hacer un último commit en su repositorio de GitHub llamado “Entrega final del tp”. \n", 48 | "- Asegurense de haber creado una carpeta llamada TP1. Este Jupyter Notebook y el correspondiente al TP1 deben estar dentro de esa carpeta.\n", 49 | "- También deben enviar el link de su repositorio -para que pueda ser clonado y corregido- a mi correo 25RO35480961@campus.economicas.uba.ar. Usar de asunto de email \"Big Data - TP 1 - Grupo #\" y nombrar el archivo \"TP1_Grupo #\" donde # es el número de grupo que le fue asignado.\n", 50 | "- La última versión en el repositorio es la que será evaluada. Por lo que es importante que: \n", 51 | " - No envien el correo hasta no haber terminado y estar seguros de que han hecho el commit y push a la versión final que quieren entregar. \n", 52 | " - No hagan nuevos push despues de haber entregado su versión final. Esto generaría confusión acerca de que versión es la que quieren que se les corrija.\n", 53 | "- En resumen, la carpeta del repositorio debe incluir:\n", 54 | " - El codigo\n", 55 | " - Un documento Word (Parte A) donde esten las figuras y una breve descripción de las mismas.\n", 56 | " - El excel con los links webscrappeados (Parte B)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "### Parte A" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": { 69 | "id": "ZXbrPraa-c9i" 70 | }, 71 | "source": [ 72 | "#### Ejercicio 1 - Jugando con APIs\n", 73 | "Usando la API del Banco Mundial [link](https://wbdata.readthedocs.io/en/stable/) , obtener dos series de indicadores para dos paises a elección en una consulta de búsqueda. Pueden buscar serie de indicadores de su interés." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "# Resolver acá\n" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "#### Ejercicio 2 - Repaso de Pandas\n", 90 | "Realicen una estadistica descriptiva de ambas series de indicadores comparando los dos países." 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "# Resolver acá\n" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "#### Ejercicio 3 - Practicando con Matplotlib\n", 107 | "Armen dos gráficos distintos usando la librería Matplotlib (repasen Clase 4). Uno programandolo con el estilo *pyplot* y otro gráfico de estilo *orientada a objetos*" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "# Resolver acá estilo pyplot\n" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "# Resolver acá estilo orientado-objetos \n", 126 | "# Tip: aprovechar este estilo de programar una figura para hacerlo más lindo \n" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "### Parte B" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "#### Ejercicio 4\n", 141 | "De la página de noticias del [diario La Nación](https://www.lanacion.com.ar/) o cualquier diario que les interese, utilicen herramientas de web scraping para obtener los **links** de las noticias de la portada. Guarden los links obtenidos en un dataframe y expórtenlo a un archivo de excel.\n", 142 | "\n", 143 | "Nota 1: es posible que logren obtener los links a las noticias sin el dominio: \"https://www.lanacion.com.ar/\". De ser así, concatenen el dominio a la ruta del link obtenido, tal que se obtenga un link al que se pueda acceder. Es decir, que las cadenas de caracteres finales tendrán la forma: https://www.lanacion.com.ar/*texto_obtenido*)\n", 144 | "\n", 145 | "Nota 2: junto con su entrega, adjunten una captura de la página de noticias al momento de correr su código. Eso servirá al momento de la corrección para verificar que los links obtenidos hacen referencia a las noticias de ese día y hora." 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "# Resolver acá\n" 155 | ] 156 | } 157 | ], 158 | "metadata": { 159 | "anaconda-cloud": {}, 160 | "colab": { 161 | "name": "TP1 - Parte 1.ipynb", 162 | "provenance": [] 163 | }, 164 | "kernelspec": { 165 | "display_name": "Python [conda env:base] *", 166 | "language": "python", 167 | "name": "conda-base-py" 168 | }, 169 | "language_info": { 170 | "codemirror_mode": { 171 | "name": "ipython", 172 | "version": 3 173 | }, 174 | "file_extension": ".py", 175 | "mimetype": "text/x-python", 176 | "name": "python", 177 | "nbconvert_exporter": "python", 178 | "pygments_lexer": "ipython3", 179 | "version": "3.12.4" 180 | } 181 | }, 182 | "nbformat": 4, 183 | "nbformat_minor": 4 184 | } 185 | -------------------------------------------------------------------------------- /TPs/TP2_Introducción a la EPH/Big Data_UBA_TP2.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/TPs/TP2_Introducción a la EPH/Big Data_UBA_TP2.docx -------------------------------------------------------------------------------- /TPs/TP2_Introducción a la EPH/Big Data_UBA_TP2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/TPs/TP2_Introducción a la EPH/Big Data_UBA_TP2.pdf -------------------------------------------------------------------------------- /TPs/TP2_Introducción a la EPH/~$g Data_UBA_TP2.docx: -------------------------------------------------------------------------------- 1 | Microsoft Office UserMicrosoft Office User -------------------------------------------------------------------------------- /TPs/TP3_EPH_Hist, Kernels & M. no supervisados/Big Data_UBA_TP3.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/TPs/TP3_EPH_Hist, Kernels & M. no supervisados/Big Data_UBA_TP3.docx -------------------------------------------------------------------------------- /TPs/TP3_EPH_Hist, Kernels & M. no supervisados/Big Data_UBA_TP3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/TPs/TP3_EPH_Hist, Kernels & M. no supervisados/Big Data_UBA_TP3.pdf -------------------------------------------------------------------------------- /TPs/TP3_EPH_Hist, Kernels & M. no supervisados/~$g Data_UBA_TP3.docx: -------------------------------------------------------------------------------- 1 | Microsoft Office UserMicrosoft Office User -------------------------------------------------------------------------------- /TPs/TP4_Regresión&Clasificación/Big Data_UBA_TP4.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/TPs/TP4_Regresión&Clasificación/Big Data_UBA_TP4.docx -------------------------------------------------------------------------------- /TPs/TP4_Regresión&Clasificación/Big Data_UBA_TP4.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mnromero/BigData_2025_UBA/d5f21d4d600fb26480e2c69513dfa8813d7d8379/TPs/TP4_Regresión&Clasificación/Big Data_UBA_TP4.pdf -------------------------------------------------------------------------------- /TPs/TP4_Regresión&Clasificación/~$g Data_UBA_TP4.docx: -------------------------------------------------------------------------------- 1 | Microsoft Office UserMicrosoft Office User --------------------------------------------------------------------------------