├── .gitignore ├── names_add.csv ├── names_extra.csv ├── names_original.csv ├── dorms.csv ├── housing.csv ├── README.md └── Pandas.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints/ 2 | __pycache__/ 3 | -------------------------------------------------------------------------------- /names_add.csv: -------------------------------------------------------------------------------- 1 | First Name,Last Name 2 | Martin,Perez 3 | Menna,Elsayed -------------------------------------------------------------------------------- /names_extra.csv: -------------------------------------------------------------------------------- 1 | First Name,Last Name,Major 2 | Martin,Perez,Mechanical Engineering 3 | Menna,Elsayed,Sociology -------------------------------------------------------------------------------- /names_original.csv: -------------------------------------------------------------------------------- 1 | First Name,Last Name 2 | Lesley,Cordero 3 | Ojas,Sathe 4 | Helen,Chen 5 | Eli,Epperson 6 | Jacob,Greenberg -------------------------------------------------------------------------------- /dorms.csv: -------------------------------------------------------------------------------- 1 | Dorm,Street,Cost 2 | Broadway,114th,9000 3 | Shapiro,115th,9500 4 | Watt,113th,10500 5 | East Campus,116th,"11,000" 6 | Wallach,114th,9500 -------------------------------------------------------------------------------- /housing.csv: -------------------------------------------------------------------------------- 1 | Dorm,Name 2 | East Campus,Helen Chen 3 | Broadway,Danielle Jing 4 | Shapiro,Craig Rhodes 5 | Watt,Lesley Cordero 6 | East Campus,Martin Perez 7 | Broadway,Menna Elsayed 8 | Wallach,Will Essilfie 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Environment Setup 2 | 3 | In this tutorial, we'll be using Python to work with CSV files computationally. Specifically, we'll be working with pandas in Python. 4 | 5 | This guide was written in Python 3.6. If you haven't already, download [Python](https://www.python.org/downloads/release/python-361/) and [Pip](https://pypi.python.org/pypi/pip). Next, you’ll need to install the pandas module that we’ll use throughout this tutorial: 6 | 7 | ``` 8 | pip3 install pandas==0.20.3 9 | pip3 install jupyter==1.0.0 10 | ``` 11 | 12 | For this workshop, we'll be using five datasets. If you're familiar with git, the easiest way to retrieve them is to clone the repo with the following command: 13 | 14 | ``` 15 | git clone https://github.com/lesley2958/pandas-workshop 16 | ``` 17 | 18 | Otherwise, download the datasets [here](https://drive.google.com/open?id=0B4I1qITaz894NHBNX19LUjZaU0k). To make your life a bit easier, try to download these datasets onto your desktop, or somewhere you can easily access them with Python. 19 | 20 | Since we’ll be working with Python interactively, using the Jupyter Notebook is the best way to get the most out of this tutorial. You already installed it with pip3 up above, now you just need to get it running. With that said, open up your terminal or command prompt and entire the following command: 21 | 22 | ``` 23 | jupyter notebook 24 | ``` 25 | 26 | And BOOM! It should have opened up in your default browser. Now we’re ready to go. 27 | 28 | ### A Quick Note on Jupyter 29 | 30 | For those of you who are unfamiliar with Jupyter notebooks, I’ve provided a brief review of which functions will be particularly useful to move along with this tutorial. 31 | 32 | In the image below, you’ll see three buttons labeled 1-3 that will be important for you to get a grasp of -- the save button (1), add cell button (2), and run cell button (3). 33 | 34 | ![alt text](https://www.twilio.com/blog/wp-content/uploads/2017/10/XCdsHFjl8mmg1BT55TZorU8dcUx-DsTlxGZJmeqQMbXk3vv0lPJ-O8YIHjSPwxZ8M2Nw1vOcByUHM_lIRuIpKQ6LASxtcnjyHpf1UpSRKbY0qF1bEgA_hzfRu1pX7y8cSXPgZPEG.png) 35 | 36 | The first button is the button you’ll use to save your work as you go along (1). I won’t give you directions as when you should do this -- that’s up to you! 37 | 38 | Next, we have the “add cell” button (2). Cells are blocks of code that you can run together. These are the building blocks of jupyter notebook because it provides the option of running code incrementally without having to to run all your code at once. Throughout this tutorial, you’ll see lines of code blocked off -- each one should correspond to a cell. 39 | 40 | Lastly, there’s the “run cell” button (3). Jupyter Notebook doesn’t automatically run your code for you; you have to tell it when by clicking this button. As with add button, once you’ve written each block of code in this tutorial onto your cell, you should then run it to see the output (if any). If any output is expected, note that it will also be shown in this tutorial so you know what to expect. Make sure to run your code as you go along because many blocks of code in this tutorial rely on previous cells. 41 | -------------------------------------------------------------------------------- /Pandas.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Exploring Data with Pandas" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Learning objectives\n", 15 | "\n", 16 | "* Understand the role of `pandas` in the Python ecosystem.\n", 17 | "* Read a flat file (CSV) dataset into Python with `pandas`.\n", 18 | "* Understand the basic features of a `DataFrame`.\n", 19 | "* Employ slicing to select subsets of data from a `DataFrame`.\n", 20 | "* Use label- and integer-based indexing to select ranges of data in a `DataFrame`." 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "## Introduction \n", 28 | "\n", 29 | "Data Scientists spend an incredible amount of time and skills acquiring, prepping, cleaning, and normalizing data. In this workshops, we'll review one of the most important tools used during this process, `pandas`. \n", 30 | "\n", 31 | "But first, let's review the differences between Data Acquisition, Preparation, and Cleaning.\n", 32 | "\n", 33 | "### Data Acquisition\n", 34 | "\n", 35 | "Many times, your data might be already given to you, in which case you won't have to worry about data acquisition. However, this is not always the case. The process of actually getting a dataset to work with is called **data acquisition**. For the purposes of this tutorial, we won't be providing an in-depth review of data acquisition methodology, but this should provide some context as to how `pandas` fits into the data science ecosystem. \n", 36 | "\n", 37 | "### Data Preparation\n", 38 | "\n", 39 | "Now, once a dataset is acquired, it might not be in a suitable format to work with. In some cases, you might have scraped data from a website or simply downloaded a zipfile containing multiple CSV files. Luckily, `pandas` provides Python users with an important data type you can use to work with data in an easy manner. This process of converting data into a suitable format is called **data preparation**.\n", 40 | "\n", 41 | "\n", 42 | "### Data Cleaning\n", 43 | "\n", 44 | "Now that your data is being handled in a proper manner, your work with it still might not be done. You might have missing values, values that need normalizing, or values in the wrong data type. These inconsistencies that you fix before analysis refers to **data cleaning**. " 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## Pandas\n", 52 | "\n", 53 | "Now that we've reviewed the process in which `pandas` fits in as a Python tool, we will begin to review pandas and its different capabilities. \n", 54 | "\n", 55 | "Pandas allows us to deal with data in a user-friendly way by providing with an important data type, referred to as a `DataFrame`, not otherwise available in Python. With `pandas` we can effortlessly import data from files such as CSVs into a DataFrame object, allowing us to quickly apply transformations, filters, and other data wrangling methodology to our data.\n", 56 | "\n", 57 | "To begin this tutorial, we'll need to first import `pandas`. Typically, the shorthand alias used is `pd`." 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 5, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "import pandas as pd" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "## Series\n", 74 | "\n", 75 | "Before we review the DataFrame object in `pandas`, we'll begin by reviewing the `series` data type. \n", 76 | "\n", 77 | "In its simplest form, a Series is a one-dimensional object containing an array of data. Below, you'll see that we use the `pandas.Series()` function to define an example containing 4 numbers. Notice that the input of this method *is* a Python list." 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 6, 83 | "metadata": {}, 84 | "outputs": [ 85 | { 86 | "name": "stdout", 87 | "output_type": "stream", 88 | "text": [ 89 | "\n" 90 | ] 91 | } 92 | ], 93 | "source": [ 94 | "l1 = [1,2,3]\n", 95 | "print(type(l1))" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 7, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "obj = pd.Series([4, 7, -5, 3])" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "Series are incredibly similar to your typical list in Python, however, it **is** distinctive, which we can display by printing the data type of the series below." 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 8, 117 | "metadata": {}, 118 | "outputs": [ 119 | { 120 | "name": "stdout", 121 | "output_type": "stream", 122 | "text": [ 123 | "\n" 124 | ] 125 | } 126 | ], 127 | "source": [ 128 | "print(type(obj))" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "The reason for this distinction is because `pandas` allows us to attach our own custom indices, as shown below" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 9, 141 | "metadata": {}, 142 | "outputs": [ 143 | { 144 | "name": "stdout", 145 | "output_type": "stream", 146 | "text": [ 147 | "d 4\n", 148 | "b 7\n", 149 | "a -5\n", 150 | "c 3\n", 151 | "dtype: int64\n" 152 | ] 153 | } 154 | ], 155 | "source": [ 156 | "obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])\n", 157 | "print(obj2)" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "You might be thinking, \"is this not the same thing as a dictionary?\", a fair question for anyone beginning to learn `pandas`. Now, the useful thing is that we can, indeed, easily convert a dictionary to a series, as you can see in the code below. " 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 10, 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}\n", 174 | "obj3 = pd.Series(sdata)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "But notice that if the actual object is printed, as well as its type, the format is much more readable with the series object. This is incredibly useful for when a user needs to visually interact with the data." 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 11, 187 | "metadata": {}, 188 | "outputs": [ 189 | { 190 | "name": "stdout", 191 | "output_type": "stream", 192 | "text": [ 193 | "\n" 194 | ] 195 | } 196 | ], 197 | "source": [ 198 | "print(type(sdata))" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": 12, 204 | "metadata": {}, 205 | "outputs": [ 206 | { 207 | "name": "stdout", 208 | "output_type": "stream", 209 | "text": [ 210 | "{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}\n" 211 | ] 212 | } 213 | ], 214 | "source": [ 215 | "print(sdata)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 13, 221 | "metadata": {}, 222 | "outputs": [ 223 | { 224 | "name": "stdout", 225 | "output_type": "stream", 226 | "text": [ 227 | "\n" 228 | ] 229 | } 230 | ], 231 | "source": [ 232 | "print(type(obj3))" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 14, 238 | "metadata": {}, 239 | "outputs": [ 240 | { 241 | "name": "stdout", 242 | "output_type": "stream", 243 | "text": [ 244 | "Ohio 35000\n", 245 | "Oregon 16000\n", 246 | "Texas 71000\n", 247 | "Utah 5000\n", 248 | "dtype: int64\n" 249 | ] 250 | } 251 | ], 252 | "source": [ 253 | "print(obj3)" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "## DataFrames\n", 261 | "\n", 262 | "Now that we've reviewed the simplest object in `pandas`, we're ready to tackle on the DataFrame object. A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).\n", 263 | "\n", 264 | "There are numerous ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists." 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 15, 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "name": "stdout", 274 | "output_type": "stream", 275 | "text": [ 276 | " pop state year\n", 277 | "0 1.5 Ohio 2000\n", 278 | "1 1.7 Ohio 2001\n", 279 | "2 3.6 Ohio 2002\n", 280 | "3 2.4 Nevada 2001\n", 281 | "4 2.9 Nevada 2002\n" 282 | ] 283 | } 284 | ], 285 | "source": [ 286 | "data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], \n", 287 | " 'year': [2000, 2001, 2002, 2001, 2002], \n", 288 | " 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}\n", 289 | "frame = pd.DataFrame(data)\n", 290 | "print(frame)" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "To select a single column, the syntax is treated the same way you would a dictionary, with brackets inside which contains the column name. \n", 298 | "\n", 299 | "It's important to note that DataFrames are actually a collection of `series`, which we can see is true when we index one of the columns and print its type." 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 16, 305 | "metadata": {}, 306 | "outputs": [ 307 | { 308 | "name": "stdout", 309 | "output_type": "stream", 310 | "text": [ 311 | "\n" 312 | ] 313 | } 314 | ], 315 | "source": [ 316 | "sel = frame['state']\n", 317 | "print(type(sel))" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "metadata": {}, 323 | "source": [ 324 | "## Challenge Question\n", 325 | "\n", 326 | "True or False: Series and Dictionaries can converted to one another with one simple function call." 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "## CSV Files\n", 334 | "\n", 335 | "It was briefly mentioned earlier that one of the advantages of Python is to easily convert a CSV file to a pandas DataFrame. \n", 336 | "\n", 337 | "Using [this](https://s3.amazonaws.com/assets.datacamp.com/production/course_1639/datasets/world_ind_pop_data.csv) csv file, which is population data from the World Bank, we'll use pandas to convert this file to a DataFrame -- all in two lines of code. " 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 40, 343 | "metadata": {}, 344 | "outputs": [ 345 | { 346 | "name": "stdout", 347 | "output_type": "stream", 348 | "text": [ 349 | " CountryName CountryCode Year \\\n", 350 | "0 Arab World ARB 1960 \n", 351 | "1 Caribbean small states CSS 1960 \n", 352 | "2 Central Europe and the Baltics CEB 1960 \n", 353 | "3 East Asia & Pacific (all income levels) EAS 1960 \n", 354 | "4 East Asia & Pacific (developing only) EAP 1960 \n", 355 | "5 Euro area EMU 1960 \n", 356 | "6 Europe & Central Asia (all income levels) ECS 1960 \n", 357 | "7 Europe & Central Asia (developing only) ECA 1960 \n", 358 | "8 European Union EUU 1960 \n", 359 | "9 Fragile and conflict affected situations FCS 1960 \n", 360 | "10 Heavily indebted poor countries (HIPC) HPC 1960 \n", 361 | "11 High income HIC 1960 \n", 362 | "12 High income: nonOECD NOC 1960 \n", 363 | "13 High income: OECD OEC 1960 \n", 364 | "14 Latin America & Caribbean (all income levels) LCN 1960 \n", 365 | "15 Latin America & Caribbean (developing only) LAC 1960 \n", 366 | "16 Least developed countries: UN classification LDC 1960 \n", 367 | "17 Low & middle income LMY 1960 \n", 368 | "18 Low income LIC 1960 \n", 369 | "19 Lower middle income LMC 1960 \n", 370 | "20 Middle East & North Africa (all income levels) MEA 1960 \n", 371 | "21 Middle East & North Africa (developing only) MNA 1960 \n", 372 | "22 Middle income MIC 1960 \n", 373 | "23 North America NAC 1960 \n", 374 | "24 OECD members OED 1960 \n", 375 | "25 Other small states OSS 1960 \n", 376 | "26 Pacific island small states PSS 1960 \n", 377 | "27 Small states SST 1960 \n", 378 | "28 South Asia SAS 1960 \n", 379 | "29 Sub-Saharan Africa (all income levels) SSF 1960 \n", 380 | "... ... ... ... \n", 381 | "13344 Sweden SWE 2014 \n", 382 | "13345 Switzerland CHE 2014 \n", 383 | "13346 Syrian Arab Republic SYR 2014 \n", 384 | "13347 Tajikistan TJK 2014 \n", 385 | "13348 Tanzania TZA 2014 \n", 386 | "13349 Thailand THA 2014 \n", 387 | "13350 Timor-Leste TMP 2014 \n", 388 | "13351 Togo TGO 2014 \n", 389 | "13352 Tonga TON 2014 \n", 390 | "13353 Trinidad and Tobago TTO 2014 \n", 391 | "13354 Tunisia TUN 2014 \n", 392 | "13355 Turkey TUR 2014 \n", 393 | "13356 Turkmenistan TKM 2014 \n", 394 | "13357 Turks and Caicos Islands TCA 2014 \n", 395 | "13358 Tuvalu TUV 2014 \n", 396 | "13359 Uganda UGA 2014 \n", 397 | "13360 Ukraine UKR 2014 \n", 398 | "13361 United Arab Emirates ARE 2014 \n", 399 | "13362 United Kingdom GBR 2014 \n", 400 | "13363 United States USA 2014 \n", 401 | "13364 Uruguay URY 2014 \n", 402 | "13365 Uzbekistan UZB 2014 \n", 403 | "13366 Vanuatu VUT 2014 \n", 404 | "13367 Venezuela, RB VEN 2014 \n", 405 | "13368 Vietnam VNM 2014 \n", 406 | "13369 Virgin Islands (U.S.) VIR 2014 \n", 407 | "13370 West Bank and Gaza WBG 2014 \n", 408 | "13371 Yemen, Rep. YEM 2014 \n", 409 | "13372 Zambia ZMB 2014 \n", 410 | "13373 Zimbabwe ZWE 2014 \n", 411 | "\n", 412 | " Total Population Urban population (% of total) \n", 413 | "0 9.249590e+07 31.285384 \n", 414 | "1 4.190810e+06 31.597490 \n", 415 | "2 9.140158e+07 44.507921 \n", 416 | "3 1.042475e+09 22.471132 \n", 417 | "4 8.964930e+08 16.917679 \n", 418 | "5 2.653965e+08 62.096947 \n", 419 | "6 6.674890e+08 55.378977 \n", 420 | "7 1.553174e+08 38.066129 \n", 421 | "8 4.094985e+08 61.212898 \n", 422 | "9 1.203546e+08 17.891972 \n", 423 | "10 1.624912e+08 12.236046 \n", 424 | "11 9.075975e+08 62.680332 \n", 425 | "12 1.866767e+08 56.107863 \n", 426 | "13 7.209208e+08 64.285435 \n", 427 | "14 2.205642e+08 49.284688 \n", 428 | "15 1.776822e+08 44.863308 \n", 429 | "16 2.410728e+08 9.616261 \n", 430 | "17 2.127373e+09 21.272894 \n", 431 | "18 1.571884e+08 11.498396 \n", 432 | "19 9.429116e+08 19.810513 \n", 433 | "20 1.055126e+08 34.951334 \n", 434 | "21 9.786942e+07 33.875012 \n", 435 | "22 1.970185e+09 22.053114 \n", 436 | "23 1.986244e+08 69.918403 \n", 437 | "24 7.866482e+08 62.480915 \n", 438 | "25 6.590560e+06 14.337844 \n", 439 | "26 8.613780e+05 22.043762 \n", 440 | "27 1.164275e+07 21.120573 \n", 441 | "28 5.720361e+08 16.735545 \n", 442 | "29 2.282688e+08 14.631387 \n", 443 | "... ... ... \n", 444 | "13344 9.689555e+06 85.665000 \n", 445 | "13345 8.190229e+06 73.844000 \n", 446 | "13346 2.215780e+07 57.255000 \n", 447 | "13347 8.295840e+06 26.692000 \n", 448 | "13348 5.182262e+07 30.901000 \n", 449 | "13349 6.772598e+07 49.174000 \n", 450 | "13350 1.212107e+06 32.131000 \n", 451 | "13351 7.115163e+06 39.469000 \n", 452 | "13352 1.055860e+05 23.632000 \n", 453 | "13353 1.354483e+06 8.550000 \n", 454 | "13354 1.099660e+07 66.645000 \n", 455 | "13355 7.593235e+07 72.891000 \n", 456 | "13356 5.307188e+06 49.688000 \n", 457 | "13357 3.374000e+04 91.847000 \n", 458 | "13358 9.893000e+03 58.782000 \n", 459 | "13359 3.778297e+07 15.766000 \n", 460 | "13360 4.536290e+07 69.482000 \n", 461 | "13361 9.086139e+06 85.266000 \n", 462 | "13362 6.451038e+07 82.345000 \n", 463 | "13363 3.188571e+08 81.447000 \n", 464 | "13364 3.419516e+06 95.152000 \n", 465 | "13365 3.075770e+07 36.278000 \n", 466 | "13366 2.588830e+05 25.817000 \n", 467 | "13367 3.069383e+07 88.941000 \n", 468 | "13368 9.073000e+07 32.951000 \n", 469 | "13369 1.041700e+05 95.203000 \n", 470 | "13370 4.294682e+06 75.026000 \n", 471 | "13371 2.618368e+07 34.027000 \n", 472 | "13372 1.572134e+07 40.472000 \n", 473 | "13373 1.524586e+07 32.501000 \n", 474 | "\n", 475 | "[13374 rows x 5 columns]\n" 476 | ] 477 | } 478 | ], 479 | "source": [ 480 | "file = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1639/datasets/world_ind_pop_data.csv'\n", 481 | "df = pd.read_csv(file)\n", 482 | "\n", 483 | "print(df)" 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "metadata": {}, 489 | "source": [ 490 | "As you can see above, we used the built-in pandas function, `read_csv()` to input the file into Python. Just to confirm that the file was properly converted into a DataFrame, let's take a look at what the data type is." 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": 18, 496 | "metadata": {}, 497 | "outputs": [ 498 | { 499 | "name": "stdout", 500 | "output_type": "stream", 501 | "text": [ 502 | "\n" 503 | ] 504 | } 505 | ], 506 | "source": [ 507 | "print(type(df))" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": {}, 513 | "source": [ 514 | "Previously, we used the `type()` method to check the data type of our converted CSV file, but `pandas` actually provides users with an `info()` function to display a concise summary of a given DataFrame. Let's take a look at what `pandas` has to say about the World Bank dataset." 515 | ] 516 | }, 517 | { 518 | "cell_type": "code", 519 | "execution_count": 19, 520 | "metadata": {}, 521 | "outputs": [ 522 | { 523 | "name": "stdout", 524 | "output_type": "stream", 525 | "text": [ 526 | "\n", 527 | "RangeIndex: 13374 entries, 0 to 13373\n", 528 | "Data columns (total 5 columns):\n", 529 | "CountryName 13374 non-null object\n", 530 | "CountryCode 13374 non-null object\n", 531 | "Year 13374 non-null int64\n", 532 | "Total Population 13374 non-null float64\n", 533 | "Urban population (% of total) 13374 non-null float64\n", 534 | "dtypes: float64(2), int64(1), object(2)\n", 535 | "memory usage: 522.5+ KB\n" 536 | ] 537 | } 538 | ], 539 | "source": [ 540 | "df.info()" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "The very first line, as you can see, tells us what type it is -- awesome. Additionally, we have useful information involving the size of the data, the data types of each column, and more. This is a wonderful substitute to having to call different methods like `len()`, `type()`, etc., on our DataFrame." 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "metadata": {}, 553 | "source": [ 554 | "Unless you opened up the link provided above, you've likely not seen what the actual data looks like. To fix that, you can utilize another crucial method of `pandas` called `head()`. As default it prints the column names and the first five rows of the DataFrame you call the method on, but if you want to adjust the number, you can feed it whatever number you'd like." 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 20, 560 | "metadata": {}, 561 | "outputs": [ 562 | { 563 | "name": "stdout", 564 | "output_type": "stream", 565 | "text": [ 566 | " CountryName CountryCode Year \\\n", 567 | "0 Arab World ARB 1960 \n", 568 | "1 Caribbean small states CSS 1960 \n", 569 | "2 Central Europe and the Baltics CEB 1960 \n", 570 | "3 East Asia & Pacific (all income levels) EAS 1960 \n", 571 | "4 East Asia & Pacific (developing only) EAP 1960 \n", 572 | "\n", 573 | " Total Population Urban population (% of total) \n", 574 | "0 9.249590e+07 31.285384 \n", 575 | "1 4.190810e+06 31.597490 \n", 576 | "2 9.140158e+07 44.507921 \n", 577 | "3 1.042475e+09 22.471132 \n", 578 | "4 8.964930e+08 16.917679 \n" 579 | ] 580 | } 581 | ], 582 | "source": [ 583 | "print(df.head())" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": 21, 589 | "metadata": {}, 590 | "outputs": [ 591 | { 592 | "data": { 593 | "text/html": [ 594 | "
\n", 595 | "\n", 608 | "\n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | "
CountryNameCountryCodeYearTotal PopulationUrban population (% of total)
0Arab WorldARB196092495902.031.285384
1Caribbean small statesCSS19604190810.031.597490
\n", 638 | "
" 639 | ], 640 | "text/plain": [ 641 | " CountryName CountryCode Year Total Population \\\n", 642 | "0 Arab World ARB 1960 92495902.0 \n", 643 | "1 Caribbean small states CSS 1960 4190810.0 \n", 644 | "\n", 645 | " Urban population (% of total) \n", 646 | "0 31.285384 \n", 647 | "1 31.597490 " 648 | ] 649 | }, 650 | "execution_count": 21, 651 | "metadata": {}, 652 | "output_type": "execute_result" 653 | } 654 | ], 655 | "source": [ 656 | "df.head(2)" 657 | ] 658 | }, 659 | { 660 | "cell_type": "markdown", 661 | "metadata": {}, 662 | "source": [ 663 | "On the opposite end, you can also display the last 10 rows of a given DataFrame as well, using the `tail()` method." 664 | ] 665 | }, 666 | { 667 | "cell_type": "code", 668 | "execution_count": 22, 669 | "metadata": {}, 670 | "outputs": [ 671 | { 672 | "data": { 673 | "text/html": [ 674 | "
\n", 675 | "\n", 688 | "\n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | "
CountryNameCountryCodeYearTotal PopulationUrban population (% of total)
13364UruguayURY20143419516.095.152
13365UzbekistanUZB201430757700.036.278
13366VanuatuVUT2014258883.025.817
13367Venezuela, RBVEN201430693827.088.941
13368VietnamVNM201490730000.032.951
13369Virgin Islands (U.S.)VIR2014104170.095.203
13370West Bank and GazaWBG20144294682.075.026
13371Yemen, Rep.YEM201426183676.034.027
13372ZambiaZMB201415721343.040.472
13373ZimbabweZWE201415245855.032.501
\n", 782 | "
" 783 | ], 784 | "text/plain": [ 785 | " CountryName CountryCode Year Total Population \\\n", 786 | "13364 Uruguay URY 2014 3419516.0 \n", 787 | "13365 Uzbekistan UZB 2014 30757700.0 \n", 788 | "13366 Vanuatu VUT 2014 258883.0 \n", 789 | "13367 Venezuela, RB VEN 2014 30693827.0 \n", 790 | "13368 Vietnam VNM 2014 90730000.0 \n", 791 | "13369 Virgin Islands (U.S.) VIR 2014 104170.0 \n", 792 | "13370 West Bank and Gaza WBG 2014 4294682.0 \n", 793 | "13371 Yemen, Rep. YEM 2014 26183676.0 \n", 794 | "13372 Zambia ZMB 2014 15721343.0 \n", 795 | "13373 Zimbabwe ZWE 2014 15245855.0 \n", 796 | "\n", 797 | " Urban population (% of total) \n", 798 | "13364 95.152 \n", 799 | "13365 36.278 \n", 800 | "13366 25.817 \n", 801 | "13367 88.941 \n", 802 | "13368 32.951 \n", 803 | "13369 95.203 \n", 804 | "13370 75.026 \n", 805 | "13371 34.027 \n", 806 | "13372 40.472 \n", 807 | "13373 32.501 " 808 | ] 809 | }, 810 | "execution_count": 22, 811 | "metadata": {}, 812 | "output_type": "execute_result" 813 | } 814 | ], 815 | "source": [ 816 | "df.tail(10)" 817 | ] 818 | }, 819 | { 820 | "cell_type": "markdown", 821 | "metadata": {}, 822 | "source": [ 823 | "Earlier in this tutorial, we selected a column from the DataFrame we built using a dictionary. The same can be done an **any** DataFrame, including those which began as a CSV file. With that said, we'll review an example by selecting the `Total Population` column from the World Bank dataset." 824 | ] 825 | }, 826 | { 827 | "cell_type": "code", 828 | "execution_count": 23, 829 | "metadata": {}, 830 | "outputs": [ 831 | { 832 | "name": "stdout", 833 | "output_type": "stream", 834 | "text": [ 835 | "0 9.249590e+07\n", 836 | "1 4.190810e+06\n", 837 | "2 9.140158e+07\n", 838 | "3 1.042475e+09\n", 839 | "4 8.964930e+08\n", 840 | "Name: Total Population, dtype: float64\n" 841 | ] 842 | } 843 | ], 844 | "source": [ 845 | "print(df['Total Population'].head()) # called the head function so we don't have to view the entire column" 846 | ] 847 | }, 848 | { 849 | "cell_type": "markdown", 850 | "metadata": {}, 851 | "source": [ 852 | "Now that we've reviewed how to select columns from DataFrames, how do we go about selecting *rows* from a DataFrame? \n", 853 | "\n", 854 | "A simple way is to use the built-in `ix` function, which can take either a single parameter to select **1** row, or it can take in a range using list slicing rules you've seen in regular Python lists. \n", 855 | "\n", 856 | "Below we'll print 5 rows using this method: " 857 | ] 858 | }, 859 | { 860 | "cell_type": "code", 861 | "execution_count": 41, 862 | "metadata": {}, 863 | "outputs": [ 864 | { 865 | "name": "stdout", 866 | "output_type": "stream", 867 | "text": [ 868 | " CountryName CountryCode Year \\\n", 869 | "5 Euro area EMU 1960 \n", 870 | "6 Europe & Central Asia (all income levels) ECS 1960 \n", 871 | "7 Europe & Central Asia (developing only) ECA 1960 \n", 872 | "8 European Union EUU 1960 \n", 873 | "9 Fragile and conflict affected situations FCS 1960 \n", 874 | "10 Heavily indebted poor countries (HIPC) HPC 1960 \n", 875 | "\n", 876 | " Total Population Urban population (% of total) \n", 877 | "5 265396501.0 62.096947 \n", 878 | "6 667489033.0 55.378977 \n", 879 | "7 155317369.0 38.066129 \n", 880 | "8 409498462.0 61.212898 \n", 881 | "9 120354582.0 17.891972 \n", 882 | "10 162491185.0 12.236046 \n" 883 | ] 884 | }, 885 | { 886 | "name": "stderr", 887 | "output_type": "stream", 888 | "text": [ 889 | "/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: \n", 890 | ".ix is deprecated. Please use\n", 891 | ".loc for label based indexing or\n", 892 | ".iloc for positional indexing\n", 893 | "\n", 894 | "See the documentation here:\n", 895 | "http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated\n", 896 | " \"\"\"Entry point for launching an IPython kernel.\n" 897 | ] 898 | } 899 | ], 900 | "source": [ 901 | "print(df.ix[5:10])" 902 | ] 903 | }, 904 | { 905 | "cell_type": "markdown", 906 | "metadata": {}, 907 | "source": [ 908 | "### Apply\n", 909 | "\n", 910 | "Lets's generate a random dictionary:" 911 | ] 912 | }, 913 | { 914 | "cell_type": "code", 915 | "execution_count": 45, 916 | "metadata": {}, 917 | "outputs": [ 918 | { 919 | "name": "stdout", 920 | "output_type": "stream", 921 | "text": [ 922 | " b d e\n", 923 | "Utah 1.824797 1.538739 1.277535\n", 924 | "Ohio -0.280938 0.933722 -0.450051\n", 925 | "Texas -0.015748 0.064026 0.158915\n", 926 | "Oregon -0.591756 0.595281 -0.143778\n" 927 | ] 928 | } 929 | ], 930 | "source": [ 931 | "import numpy as np\n", 932 | "frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])\n", 933 | "\n", 934 | "print(frame)" 935 | ] 936 | }, 937 | { 938 | "cell_type": "markdown", 939 | "metadata": {}, 940 | "source": [ 941 | "With this, we can apply a function on a DataFrame:" 942 | ] 943 | }, 944 | { 945 | "cell_type": "code", 946 | "execution_count": 46, 947 | "metadata": {}, 948 | "outputs": [ 949 | { 950 | "data": { 951 | "text/html": [ 952 | "
\n", 953 | "\n", 966 | "\n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | "
bde
Utah1.8247971.5387391.277535
Ohio0.2809380.9337220.450051
Texas0.0157480.0640260.158915
Oregon0.5917560.5952810.143778
\n", 1002 | "
" 1003 | ], 1004 | "text/plain": [ 1005 | " b d e\n", 1006 | "Utah 1.824797 1.538739 1.277535\n", 1007 | "Ohio 0.280938 0.933722 0.450051\n", 1008 | "Texas 0.015748 0.064026 0.158915\n", 1009 | "Oregon 0.591756 0.595281 0.143778" 1010 | ] 1011 | }, 1012 | "execution_count": 46, 1013 | "metadata": {}, 1014 | "output_type": "execute_result" 1015 | } 1016 | ], 1017 | "source": [ 1018 | "np.abs(frame)" 1019 | ] 1020 | }, 1021 | { 1022 | "cell_type": "markdown", 1023 | "metadata": {}, 1024 | "source": [ 1025 | "We can also apply functions with the `apply()` method:" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "code", 1030 | "execution_count": 47, 1031 | "metadata": {}, 1032 | "outputs": [ 1033 | { 1034 | "name": "stdout", 1035 | "output_type": "stream", 1036 | "text": [ 1037 | "b 2.416554\n", 1038 | "d 1.474713\n", 1039 | "e 1.727587\n", 1040 | "dtype: float64\n" 1041 | ] 1042 | } 1043 | ], 1044 | "source": [ 1045 | "f = lambda x: x.max() - x.min()\n", 1046 | "\n", 1047 | "print(frame.apply(f))" 1048 | ] 1049 | }, 1050 | { 1051 | "cell_type": "code", 1052 | "execution_count": 28, 1053 | "metadata": {}, 1054 | "outputs": [ 1055 | { 1056 | "name": "stdout", 1057 | "output_type": "stream", 1058 | "text": [ 1059 | " b d e\n", 1060 | "Utah 0.866926 -0.104468 -1.201680\n", 1061 | "Ohio -0.031610 0.008895 0.100290\n", 1062 | "Texas -0.342475 0.341394 -2.020873\n", 1063 | "Oregon 0.666301 0.526818 0.749407\n" 1064 | ] 1065 | } 1066 | ], 1067 | "source": [ 1068 | "f = lambda x: np.abs(x)\n", 1069 | "\n", 1070 | "frame.apply(f)\n", 1071 | "\n", 1072 | "print(frame)" 1073 | ] 1074 | }, 1075 | { 1076 | "cell_type": "markdown", 1077 | "metadata": {}, 1078 | "source": [ 1079 | "## Challenge Question \n", 1080 | "\n", 1081 | "Does the `pandas.DataFrame.apply()` function modify the original DataFrame or return a new DataFrame?" 1082 | ] 1083 | }, 1084 | { 1085 | "cell_type": "markdown", 1086 | "metadata": {}, 1087 | "source": [ 1088 | "#### Sorting\n", 1089 | "\n", 1090 | "To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:" 1091 | ] 1092 | }, 1093 | { 1094 | "cell_type": "code", 1095 | "execution_count": 29, 1096 | "metadata": {}, 1097 | "outputs": [ 1098 | { 1099 | "data": { 1100 | "text/html": [ 1101 | "
\n", 1102 | "\n", 1115 | "\n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | "
bde
Ohio-0.0316100.0088950.100290
Oregon0.6663010.5268180.749407
Texas-0.3424750.341394-2.020873
Utah0.866926-0.104468-1.201680
\n", 1151 | "
" 1152 | ], 1153 | "text/plain": [ 1154 | " b d e\n", 1155 | "Ohio -0.031610 0.008895 0.100290\n", 1156 | "Oregon 0.666301 0.526818 0.749407\n", 1157 | "Texas -0.342475 0.341394 -2.020873\n", 1158 | "Utah 0.866926 -0.104468 -1.201680" 1159 | ] 1160 | }, 1161 | "execution_count": 29, 1162 | "metadata": {}, 1163 | "output_type": "execute_result" 1164 | } 1165 | ], 1166 | "source": [ 1167 | "frame.sort_index()" 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "markdown", 1172 | "metadata": {}, 1173 | "source": [ 1174 | "## Data Merging\n", 1175 | "\n", 1176 | "If you encounter two different datasets that contain the same type of information, you might consider merging them for your analyses. This is yet another functionality built into `pandas`. \n", 1177 | "\n", 1178 | "Let's go through an example containing student data. `d1` contains 5 of the samples and `d2` contains 2 of them: " 1179 | ] 1180 | }, 1181 | { 1182 | "cell_type": "code", 1183 | "execution_count": 30, 1184 | "metadata": {}, 1185 | "outputs": [ 1186 | { 1187 | "name": "stdout", 1188 | "output_type": "stream", 1189 | "text": [ 1190 | " First Name Last Name\n", 1191 | "0 Lesley Cordero\n", 1192 | "1 Ojas Sathe\n", 1193 | "2 Helen Chen\n", 1194 | "3 Eli Epperson\n", 1195 | "4 Jacob Greenberg\n" 1196 | ] 1197 | } 1198 | ], 1199 | "source": [ 1200 | "d1 = pd.read_csv(\"./names_original.csv\")\n", 1201 | "print(d1)" 1202 | ] 1203 | }, 1204 | { 1205 | "cell_type": "code", 1206 | "execution_count": 31, 1207 | "metadata": {}, 1208 | "outputs": [ 1209 | { 1210 | "name": "stdout", 1211 | "output_type": "stream", 1212 | "text": [ 1213 | "\n" 1214 | ] 1215 | } 1216 | ], 1217 | "source": [ 1218 | "print(type(d1))" 1219 | ] 1220 | }, 1221 | { 1222 | "cell_type": "code", 1223 | "execution_count": 32, 1224 | "metadata": {}, 1225 | "outputs": [ 1226 | { 1227 | "name": "stdout", 1228 | "output_type": "stream", 1229 | "text": [ 1230 | " First Name Last Name\n", 1231 | "0 Martin Perez\n", 1232 | "1 Menna Elsayed\n" 1233 | ] 1234 | } 1235 | ], 1236 | "source": [ 1237 | "d2 = pd.read_csv(\"./names_add.csv\")\n", 1238 | "print(d2)" 1239 | ] 1240 | }, 1241 | { 1242 | "cell_type": "markdown", 1243 | "metadata": {}, 1244 | "source": [ 1245 | "### Concatenation \n", 1246 | "\n", 1247 | "Instead of working with two separate datasets, it's much easier to simply merge, so we do this with the `concat()` function:\n" 1248 | ] 1249 | }, 1250 | { 1251 | "cell_type": "code", 1252 | "execution_count": 33, 1253 | "metadata": {}, 1254 | "outputs": [ 1255 | { 1256 | "name": "stdout", 1257 | "output_type": "stream", 1258 | "text": [ 1259 | " First Name Last Name\n", 1260 | "0 Lesley Cordero\n", 1261 | "1 Ojas Sathe\n", 1262 | "2 Helen Chen\n", 1263 | "3 Eli Epperson\n", 1264 | "4 Jacob Greenberg\n", 1265 | "0 Martin Perez\n", 1266 | "1 Menna Elsayed\n" 1267 | ] 1268 | } 1269 | ], 1270 | "source": [ 1271 | "result = pd.concat([d1,d2])\n", 1272 | "print(result)" 1273 | ] 1274 | }, 1275 | { 1276 | "cell_type": "markdown", 1277 | "metadata": {}, 1278 | "source": [ 1279 | "Now, you might be asking what will happen if one of the datasets has more columns than other - will they still be allowed to merge? Let's try this example with another dataset:" 1280 | ] 1281 | }, 1282 | { 1283 | "cell_type": "code", 1284 | "execution_count": 34, 1285 | "metadata": {}, 1286 | "outputs": [ 1287 | { 1288 | "name": "stdout", 1289 | "output_type": "stream", 1290 | "text": [ 1291 | " First Name Last Name Major\n", 1292 | "0 Martin Perez Mechanical Engineering\n", 1293 | "1 Menna Elsayed Sociology\n" 1294 | ] 1295 | } 1296 | ], 1297 | "source": [ 1298 | "d3 = pd.read_csv(\"./names_extra.csv\")\n", 1299 | "print(d3)" 1300 | ] 1301 | }, 1302 | { 1303 | "cell_type": "markdown", 1304 | "metadata": {}, 1305 | "source": [ 1306 | "If we use the same `concat()` function, we get:" 1307 | ] 1308 | }, 1309 | { 1310 | "cell_type": "code", 1311 | "execution_count": 35, 1312 | "metadata": {}, 1313 | "outputs": [ 1314 | { 1315 | "name": "stdout", 1316 | "output_type": "stream", 1317 | "text": [ 1318 | " First Name Last Name Major\n", 1319 | "0 Lesley Cordero NaN\n", 1320 | "1 Ojas Sathe NaN\n", 1321 | "2 Helen Chen NaN\n", 1322 | "3 Eli Epperson NaN\n", 1323 | "4 Jacob Greenberg NaN\n", 1324 | "0 Martin Perez Mechanical Engineering\n", 1325 | "1 Menna Elsayed Sociology\n" 1326 | ] 1327 | } 1328 | ], 1329 | "source": [ 1330 | "result1 = pd.concat([d1, d3])\n", 1331 | "print(result1)" 1332 | ] 1333 | }, 1334 | { 1335 | "cell_type": "markdown", 1336 | "metadata": {}, 1337 | "source": [ 1338 | "Notice the `NaN` values - these are undefined values indicating there wasn't any data to be displayed. `pandas` will simply fill in the missing data for each sample where it's unavailable: " 1339 | ] 1340 | }, 1341 | { 1342 | "cell_type": "markdown", 1343 | "metadata": {}, 1344 | "source": [ 1345 | "## Challenge Question\n", 1346 | "\n", 1347 | "Does the `pandas.DataFrame.concat()` function modify a original DataFrame or return a new DataFrame?" 1348 | ] 1349 | }, 1350 | { 1351 | "cell_type": "markdown", 1352 | "metadata": {}, 1353 | "source": [ 1354 | "### Merging\n", 1355 | "\n", 1356 | "Now, how do we merge two datasets with differing columns? Well, let's take a look at our datasets:" 1357 | ] 1358 | }, 1359 | { 1360 | "cell_type": "code", 1361 | "execution_count": 36, 1362 | "metadata": {}, 1363 | "outputs": [ 1364 | { 1365 | "name": "stdout", 1366 | "output_type": "stream", 1367 | "text": [ 1368 | " Dorm Name\n", 1369 | "0 East Campus Helen Chen\n", 1370 | "1 Broadway Danielle Jing\n", 1371 | "2 Shapiro Craig Rhodes\n", 1372 | "3 Watt Lesley Cordero\n", 1373 | "4 East Campus Martin Perez\n", 1374 | "5 Broadway Menna Elsayed\n", 1375 | "6 Wallach Will Essilfie\n" 1376 | ] 1377 | } 1378 | ], 1379 | "source": [ 1380 | "h1 = pd.read_csv(\"./housing.csv\")\n", 1381 | "print(h1)" 1382 | ] 1383 | }, 1384 | { 1385 | "cell_type": "code", 1386 | "execution_count": 37, 1387 | "metadata": {}, 1388 | "outputs": [ 1389 | { 1390 | "name": "stdout", 1391 | "output_type": "stream", 1392 | "text": [ 1393 | " Dorm Street Cost\n", 1394 | "0 Broadway 114th 9000\n", 1395 | "1 Shapiro 115th 9500\n", 1396 | "2 Watt 113th 10500\n", 1397 | "3 East Campus 116th 11,000\n", 1398 | "4 Wallach 114th 9500\n" 1399 | ] 1400 | } 1401 | ], 1402 | "source": [ 1403 | "h2 = pd.read_csv(\"./dorms.csv\")\n", 1404 | "print(h2)" 1405 | ] 1406 | }, 1407 | { 1408 | "cell_type": "markdown", 1409 | "metadata": {}, 1410 | "source": [ 1411 | "With the `merge()` function in pandas, we can specify which column to merge on and what kind of join to specify. By default merge does an 'inner' join, but here we set it to a left join:" 1412 | ] 1413 | }, 1414 | { 1415 | "cell_type": "code", 1416 | "execution_count": 38, 1417 | "metadata": {}, 1418 | "outputs": [ 1419 | { 1420 | "name": "stdout", 1421 | "output_type": "stream", 1422 | "text": [ 1423 | " Dorm Name Street Cost\n", 1424 | "0 East Campus Helen Chen 116th 11,000\n", 1425 | "1 Broadway Danielle Jing 114th 9000\n", 1426 | "2 Shapiro Craig Rhodes 115th 9500\n", 1427 | "3 Watt Lesley Cordero 113th 10500\n", 1428 | "4 East Campus Martin Perez 116th 11,000\n", 1429 | "5 Broadway Menna Elsayed 114th 9000\n", 1430 | "6 Wallach Will Essilfie 114th 9500\n" 1431 | ] 1432 | } 1433 | ], 1434 | "source": [ 1435 | "house = pd.merge(h1, h2, on=\"Dorm\", how=\"left\")\n", 1436 | "print(house)" 1437 | ] 1438 | }, 1439 | { 1440 | "cell_type": "markdown", 1441 | "metadata": {}, 1442 | "source": [ 1443 | "## Challenge Question\n", 1444 | "\n", 1445 | "Does the `pandas.DataFrame.merge()` function modify a original DataFrame or return a new DataFrame?" 1446 | ] 1447 | }, 1448 | { 1449 | "cell_type": "markdown", 1450 | "metadata": { 1451 | "collapsed": true 1452 | }, 1453 | "source": [ 1454 | "## Resources\n", 1455 | "\n", 1456 | "[Pandas Learning Resources](https://chatbotslife.com/pandas-learning-resources-946540ba574e)
\n", 1457 | "[GeoPandas Tutorial](https://www.twilio.com/blog/2017/08/geospatial-analysis-python-geojson-geopandas.html)" 1458 | ] 1459 | }, 1460 | { 1461 | "cell_type": "code", 1462 | "execution_count": null, 1463 | "metadata": {}, 1464 | "outputs": [], 1465 | "source": [] 1466 | } 1467 | ], 1468 | "metadata": { 1469 | "anaconda-cloud": {}, 1470 | "kernelspec": { 1471 | "display_name": "Python 3", 1472 | "language": "python", 1473 | "name": "python3" 1474 | }, 1475 | "language_info": { 1476 | "codemirror_mode": { 1477 | "name": "ipython", 1478 | "version": 3 1479 | }, 1480 | "file_extension": ".py", 1481 | "mimetype": "text/x-python", 1482 | "name": "python", 1483 | "nbconvert_exporter": "python", 1484 | "pygments_lexer": "ipython3", 1485 | "version": "3.6.2" 1486 | } 1487 | }, 1488 | "nbformat": 4, 1489 | "nbformat_minor": 2 1490 | } 1491 | --------------------------------------------------------------------------------