├── .gitignore
├── names_add.csv
├── names_extra.csv
├── names_original.csv
├── dorms.csv
├── housing.csv
├── README.md
└── Pandas.ipynb


/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints/
2 | __pycache__/
3 | 


--------------------------------------------------------------------------------
/names_add.csv:
--------------------------------------------------------------------------------
1 | First Name,Last Name
2 | Martin,Perez
3 | Menna,Elsayed


--------------------------------------------------------------------------------
/names_extra.csv:
--------------------------------------------------------------------------------
1 | First Name,Last Name,Major
2 | Martin,Perez,Mechanical Engineering
3 | Menna,Elsayed,Sociology


--------------------------------------------------------------------------------
/names_original.csv:
--------------------------------------------------------------------------------
1 | First Name,Last Name
2 | Lesley,Cordero
3 | Ojas,Sathe
4 | Helen,Chen
5 | Eli,Epperson
6 | Jacob,Greenberg


--------------------------------------------------------------------------------
/dorms.csv:
--------------------------------------------------------------------------------
1 | Dorm,Street,Cost
2 | Broadway,114th,9000
3 | Shapiro,115th,9500
4 | Watt,113th,10500
5 | East Campus,116th,"11,000"
6 | Wallach,114th,9500


--------------------------------------------------------------------------------
/housing.csv:
--------------------------------------------------------------------------------
1 | Dorm,Name
2 | East Campus,Helen Chen
3 | Broadway,Danielle Jing
4 | Shapiro,Craig Rhodes
5 | Watt,Lesley Cordero
6 | East Campus,Martin Perez
7 | Broadway,Menna Elsayed
8 | Wallach,Will Essilfie
9 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Environment Setup
 2 | 
 3 | In this tutorial, we'll be using Python to work with CSV files computationally. Specifically, we'll be working with pandas in Python.
 4 | 
 5 | This guide was written in Python 3.6. If you haven't already, download [Python](https://www.python.org/downloads/release/python-361/) and [Pip](https://pypi.python.org/pypi/pip). Next, you’ll need to install the pandas module that we’ll use throughout this tutorial:
 6 | 
 7 | ```
 8 | pip3 install pandas==0.20.3
 9 | pip3 install jupyter==1.0.0
10 | ```
11 | 
12 | For this workshop, we'll be using five datasets. If you're familiar with git, the easiest way to retrieve them is to clone the repo with the following command:
13 | 
14 | ```
15 | git clone https://github.com/lesley2958/pandas-workshop
16 | ```
17 | 
18 | Otherwise, download the datasets [here](https://drive.google.com/open?id=0B4I1qITaz894NHBNX19LUjZaU0k). To make your life a bit easier, try to download these datasets onto your desktop, or somewhere you can easily access them with Python. 
19 | 
20 | Since we’ll be working with Python interactively, using the Jupyter Notebook is the best way to get the most out of this tutorial. You already installed it with pip3 up above, now you just need to get it running. With that said, open up your terminal or command prompt and entire the following command: 
21 | 
22 | ```
23 | jupyter notebook
24 | ```
25 | 
26 | And BOOM! It should have opened up in your default browser. Now we’re ready to go. 
27 | 
28 | ### A Quick Note on Jupyter
29 | 
30 | For those of you who are unfamiliar with Jupyter notebooks, I’ve provided a brief review of which functions will be particularly useful to move along with this tutorial. 
31 | 
32 | In the image below, you’ll see three buttons labeled 1-3 that will be important for you to get a grasp of -- the save button (1), add cell button (2), and run cell button (3). 
33 | 
34 | ![alt text](https://www.twilio.com/blog/wp-content/uploads/2017/10/XCdsHFjl8mmg1BT55TZorU8dcUx-DsTlxGZJmeqQMbXk3vv0lPJ-O8YIHjSPwxZ8M2Nw1vOcByUHM_lIRuIpKQ6LASxtcnjyHpf1UpSRKbY0qF1bEgA_hzfRu1pX7y8cSXPgZPEG.png)
35 | 
36 | The first button is the button you’ll use to save your work as you go along (1). I won’t give you directions as when you should do this -- that’s up to you! 
37 | 
38 | Next, we have the “add cell” button (2). Cells are blocks of code that you can run together. These are the building blocks of jupyter notebook because it provides the option of running code incrementally without having to to run all your code at once.  Throughout this tutorial, you’ll see lines of code blocked off -- each one should correspond to a cell. 
39 | 
40 | Lastly, there’s the “run cell” button (3). Jupyter Notebook doesn’t automatically run your code for you; you have to tell it when by clicking this button. As with add button, once you’ve written each block of code in this tutorial onto your cell, you should then run it to see the output (if any). If any output is expected, note that it will also be shown in this tutorial so you know what to expect. Make sure to run your code as you go along because many blocks of code in this tutorial rely on previous cells. 
41 | 


--------------------------------------------------------------------------------
/Pandas.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Exploring Data with Pandas"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "## Learning objectives\n",
  15 |     "\n",
  16 |     "* Understand the role of `pandas` in the Python ecosystem.\n",
  17 |     "* Read a flat file (CSV) dataset into Python with `pandas`.\n",
  18 |     "* Understand the basic features of a `DataFrame`.\n",
  19 |     "* Employ slicing to select subsets of data from a `DataFrame`.\n",
  20 |     "* Use label- and integer-based indexing to select ranges of data in a `DataFrame`."
  21 |    ]
  22 |   },
  23 |   {
  24 |    "cell_type": "markdown",
  25 |    "metadata": {},
  26 |    "source": [
  27 |     "## Introduction \n",
  28 |     "\n",
  29 |     "Data Scientists spend an incredible amount of time and skills acquiring, prepping, cleaning, and normalizing data. In this workshops, we'll review one of the most important tools used during this process, `pandas`. \n",
  30 |     "\n",
  31 |     "But first, let's review the differences between Data Acquisition, Preparation, and Cleaning.\n",
  32 |     "\n",
  33 |     "### Data Acquisition\n",
  34 |     "\n",
  35 |     "Many times, your data might be already given to you, in which case you won't have to worry about data acquisition. However, this is not always the case. The process of actually getting a dataset to work with is called **data acquisition**. For the purposes of this tutorial, we won't be providing an in-depth review of data acquisition methodology, but this should provide some context as to how `pandas` fits into the data science ecosystem. \n",
  36 |     "\n",
  37 |     "### Data Preparation\n",
  38 |     "\n",
  39 |     "Now, once a dataset is acquired, it might not be in a suitable format to work with. In some cases, you might have scraped data from a website or simply downloaded a zipfile containing multiple CSV files. Luckily, `pandas` provides Python users with an important data type you can use to work with data in an easy manner. This process of converting data into a suitable format is called **data preparation**.\n",
  40 |     "\n",
  41 |     "\n",
  42 |     "### Data Cleaning\n",
  43 |     "\n",
  44 |     "Now that your data is being handled in a proper manner, your work with it still might not be done. You might have missing values, values that need normalizing, or values in the wrong data type. These inconsistencies that you fix before analysis refers to **data cleaning**.  "
  45 |    ]
  46 |   },
  47 |   {
  48 |    "cell_type": "markdown",
  49 |    "metadata": {},
  50 |    "source": [
  51 |     "## Pandas\n",
  52 |     "\n",
  53 |     "Now that we've reviewed the process in which `pandas` fits in as a Python tool, we will begin to review pandas and its different capabilities. \n",
  54 |     "\n",
  55 |     "Pandas allows us to deal with data in a user-friendly way by providing with an important data type, referred to as a `DataFrame`, not otherwise available in Python. With `pandas` we can effortlessly import data from files such as CSVs into a DataFrame object, allowing us to quickly apply transformations, filters, and other data wrangling methodology to our data.\n",
  56 |     "\n",
  57 |     "To begin this tutorial, we'll need to first import `pandas`. Typically, the shorthand alias used is `pd`."
  58 |    ]
  59 |   },
  60 |   {
  61 |    "cell_type": "code",
  62 |    "execution_count": 5,
  63 |    "metadata": {},
  64 |    "outputs": [],
  65 |    "source": [
  66 |     "import pandas as pd"
  67 |    ]
  68 |   },
  69 |   {
  70 |    "cell_type": "markdown",
  71 |    "metadata": {},
  72 |    "source": [
  73 |     "## Series\n",
  74 |     "\n",
  75 |     "Before we review the DataFrame object in `pandas`, we'll begin by reviewing the `series` data type. \n",
  76 |     "\n",
  77 |     "In its simplest form, a Series is a one-dimensional object containing an array of data. Below, you'll see that we use the `pandas.Series()` function to define an example containing 4 numbers. Notice that the input of this method *is* a Python list."
  78 |    ]
  79 |   },
  80 |   {
  81 |    "cell_type": "code",
  82 |    "execution_count": 6,
  83 |    "metadata": {},
  84 |    "outputs": [
  85 |     {
  86 |      "name": "stdout",
  87 |      "output_type": "stream",
  88 |      "text": [
  89 |       "<class 'list'>\n"
  90 |      ]
  91 |     }
  92 |    ],
  93 |    "source": [
  94 |     "l1 = [1,2,3]\n",
  95 |     "print(type(l1))"
  96 |    ]
  97 |   },
  98 |   {
  99 |    "cell_type": "code",
 100 |    "execution_count": 7,
 101 |    "metadata": {},
 102 |    "outputs": [],
 103 |    "source": [
 104 |     "obj = pd.Series([4, 7, -5, 3])"
 105 |    ]
 106 |   },
 107 |   {
 108 |    "cell_type": "markdown",
 109 |    "metadata": {},
 110 |    "source": [
 111 |     "Series are incredibly similar to your typical list in Python, however, it **is** distinctive, which we can display by printing the data type of the series below."
 112 |    ]
 113 |   },
 114 |   {
 115 |    "cell_type": "code",
 116 |    "execution_count": 8,
 117 |    "metadata": {},
 118 |    "outputs": [
 119 |     {
 120 |      "name": "stdout",
 121 |      "output_type": "stream",
 122 |      "text": [
 123 |       "<class 'pandas.core.series.Series'>\n"
 124 |      ]
 125 |     }
 126 |    ],
 127 |    "source": [
 128 |     "print(type(obj))"
 129 |    ]
 130 |   },
 131 |   {
 132 |    "cell_type": "markdown",
 133 |    "metadata": {},
 134 |    "source": [
 135 |     "The reason for this distinction is because `pandas` allows us to attach our own custom indices, as shown below"
 136 |    ]
 137 |   },
 138 |   {
 139 |    "cell_type": "code",
 140 |    "execution_count": 9,
 141 |    "metadata": {},
 142 |    "outputs": [
 143 |     {
 144 |      "name": "stdout",
 145 |      "output_type": "stream",
 146 |      "text": [
 147 |       "d    4\n",
 148 |       "b    7\n",
 149 |       "a   -5\n",
 150 |       "c    3\n",
 151 |       "dtype: int64\n"
 152 |      ]
 153 |     }
 154 |    ],
 155 |    "source": [
 156 |     "obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])\n",
 157 |     "print(obj2)"
 158 |    ]
 159 |   },
 160 |   {
 161 |    "cell_type": "markdown",
 162 |    "metadata": {},
 163 |    "source": [
 164 |     "You might be thinking, \"is this not the same thing as a dictionary?\", a fair question for anyone beginning to learn `pandas`. Now, the useful thing is that we can, indeed, easily convert a dictionary to a series, as you can see in the code below. "
 165 |    ]
 166 |   },
 167 |   {
 168 |    "cell_type": "code",
 169 |    "execution_count": 10,
 170 |    "metadata": {},
 171 |    "outputs": [],
 172 |    "source": [
 173 |     "sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}\n",
 174 |     "obj3 = pd.Series(sdata)"
 175 |    ]
 176 |   },
 177 |   {
 178 |    "cell_type": "markdown",
 179 |    "metadata": {},
 180 |    "source": [
 181 |     "But notice that if the actual object is printed, as well as its type, the format is much more readable with the series object. This is incredibly useful for when a user needs to visually interact with the data."
 182 |    ]
 183 |   },
 184 |   {
 185 |    "cell_type": "code",
 186 |    "execution_count": 11,
 187 |    "metadata": {},
 188 |    "outputs": [
 189 |     {
 190 |      "name": "stdout",
 191 |      "output_type": "stream",
 192 |      "text": [
 193 |       "<class 'dict'>\n"
 194 |      ]
 195 |     }
 196 |    ],
 197 |    "source": [
 198 |     "print(type(sdata))"
 199 |    ]
 200 |   },
 201 |   {
 202 |    "cell_type": "code",
 203 |    "execution_count": 12,
 204 |    "metadata": {},
 205 |    "outputs": [
 206 |     {
 207 |      "name": "stdout",
 208 |      "output_type": "stream",
 209 |      "text": [
 210 |       "{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}\n"
 211 |      ]
 212 |     }
 213 |    ],
 214 |    "source": [
 215 |     "print(sdata)"
 216 |    ]
 217 |   },
 218 |   {
 219 |    "cell_type": "code",
 220 |    "execution_count": 13,
 221 |    "metadata": {},
 222 |    "outputs": [
 223 |     {
 224 |      "name": "stdout",
 225 |      "output_type": "stream",
 226 |      "text": [
 227 |       "<class 'pandas.core.series.Series'>\n"
 228 |      ]
 229 |     }
 230 |    ],
 231 |    "source": [
 232 |     "print(type(obj3))"
 233 |    ]
 234 |   },
 235 |   {
 236 |    "cell_type": "code",
 237 |    "execution_count": 14,
 238 |    "metadata": {},
 239 |    "outputs": [
 240 |     {
 241 |      "name": "stdout",
 242 |      "output_type": "stream",
 243 |      "text": [
 244 |       "Ohio      35000\n",
 245 |       "Oregon    16000\n",
 246 |       "Texas     71000\n",
 247 |       "Utah       5000\n",
 248 |       "dtype: int64\n"
 249 |      ]
 250 |     }
 251 |    ],
 252 |    "source": [
 253 |     "print(obj3)"
 254 |    ]
 255 |   },
 256 |   {
 257 |    "cell_type": "markdown",
 258 |    "metadata": {},
 259 |    "source": [
 260 |     "## DataFrames\n",
 261 |     "\n",
 262 |     "Now that we've reviewed the simplest object in `pandas`, we're ready to tackle on the DataFrame object. A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).\n",
 263 |     "\n",
 264 |     "There are numerous ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists."
 265 |    ]
 266 |   },
 267 |   {
 268 |    "cell_type": "code",
 269 |    "execution_count": 15,
 270 |    "metadata": {},
 271 |    "outputs": [
 272 |     {
 273 |      "name": "stdout",
 274 |      "output_type": "stream",
 275 |      "text": [
 276 |       "   pop   state  year\n",
 277 |       "0  1.5    Ohio  2000\n",
 278 |       "1  1.7    Ohio  2001\n",
 279 |       "2  3.6    Ohio  2002\n",
 280 |       "3  2.4  Nevada  2001\n",
 281 |       "4  2.9  Nevada  2002\n"
 282 |      ]
 283 |     }
 284 |    ],
 285 |    "source": [
 286 |     "data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], \n",
 287 |     "        'year': [2000, 2001, 2002, 2001, 2002], \n",
 288 |     "        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}\n",
 289 |     "frame = pd.DataFrame(data)\n",
 290 |     "print(frame)"
 291 |    ]
 292 |   },
 293 |   {
 294 |    "cell_type": "markdown",
 295 |    "metadata": {},
 296 |    "source": [
 297 |     "To select a single column, the syntax is treated the same way you would a dictionary, with brackets inside which contains the column name. \n",
 298 |     "\n",
 299 |     "It's important to note that DataFrames are actually a collection of `series`, which we can see is true when we index one of the columns and print its type."
 300 |    ]
 301 |   },
 302 |   {
 303 |    "cell_type": "code",
 304 |    "execution_count": 16,
 305 |    "metadata": {},
 306 |    "outputs": [
 307 |     {
 308 |      "name": "stdout",
 309 |      "output_type": "stream",
 310 |      "text": [
 311 |       "<class 'pandas.core.series.Series'>\n"
 312 |      ]
 313 |     }
 314 |    ],
 315 |    "source": [
 316 |     "sel = frame['state']\n",
 317 |     "print(type(sel))"
 318 |    ]
 319 |   },
 320 |   {
 321 |    "cell_type": "markdown",
 322 |    "metadata": {},
 323 |    "source": [
 324 |     "## Challenge Question\n",
 325 |     "\n",
 326 |     "True or False: Series and Dictionaries can converted to one another with one simple function call."
 327 |    ]
 328 |   },
 329 |   {
 330 |    "cell_type": "markdown",
 331 |    "metadata": {},
 332 |    "source": [
 333 |     "## CSV Files\n",
 334 |     "\n",
 335 |     "It was briefly mentioned earlier that one of the advantages of Python is to easily convert a CSV file to a pandas DataFrame. \n",
 336 |     "\n",
 337 |     "Using [this](https://s3.amazonaws.com/assets.datacamp.com/production/course_1639/datasets/world_ind_pop_data.csv) csv file, which is population data from the World Bank, we'll use pandas to convert this file to a DataFrame -- all in two lines of code. "
 338 |    ]
 339 |   },
 340 |   {
 341 |    "cell_type": "code",
 342 |    "execution_count": 40,
 343 |    "metadata": {},
 344 |    "outputs": [
 345 |     {
 346 |      "name": "stdout",
 347 |      "output_type": "stream",
 348 |      "text": [
 349 |       "                                          CountryName CountryCode  Year  \\\n",
 350 |       "0                                          Arab World         ARB  1960   \n",
 351 |       "1                              Caribbean small states         CSS  1960   \n",
 352 |       "2                      Central Europe and the Baltics         CEB  1960   \n",
 353 |       "3             East Asia & Pacific (all income levels)         EAS  1960   \n",
 354 |       "4               East Asia & Pacific (developing only)         EAP  1960   \n",
 355 |       "5                                           Euro area         EMU  1960   \n",
 356 |       "6           Europe & Central Asia (all income levels)         ECS  1960   \n",
 357 |       "7             Europe & Central Asia (developing only)         ECA  1960   \n",
 358 |       "8                                      European Union         EUU  1960   \n",
 359 |       "9            Fragile and conflict affected situations         FCS  1960   \n",
 360 |       "10             Heavily indebted poor countries (HIPC)         HPC  1960   \n",
 361 |       "11                                        High income         HIC  1960   \n",
 362 |       "12                               High income: nonOECD         NOC  1960   \n",
 363 |       "13                                  High income: OECD         OEC  1960   \n",
 364 |       "14      Latin America & Caribbean (all income levels)         LCN  1960   \n",
 365 |       "15        Latin America & Caribbean (developing only)         LAC  1960   \n",
 366 |       "16       Least developed countries: UN classification         LDC  1960   \n",
 367 |       "17                                Low & middle income         LMY  1960   \n",
 368 |       "18                                         Low income         LIC  1960   \n",
 369 |       "19                                Lower middle income         LMC  1960   \n",
 370 |       "20     Middle East & North Africa (all income levels)         MEA  1960   \n",
 371 |       "21       Middle East & North Africa (developing only)         MNA  1960   \n",
 372 |       "22                                      Middle income         MIC  1960   \n",
 373 |       "23                                      North America         NAC  1960   \n",
 374 |       "24                                       OECD members         OED  1960   \n",
 375 |       "25                                 Other small states         OSS  1960   \n",
 376 |       "26                        Pacific island small states         PSS  1960   \n",
 377 |       "27                                       Small states         SST  1960   \n",
 378 |       "28                                         South Asia         SAS  1960   \n",
 379 |       "29             Sub-Saharan Africa (all income levels)         SSF  1960   \n",
 380 |       "...                                               ...         ...   ...   \n",
 381 |       "13344                                          Sweden         SWE  2014   \n",
 382 |       "13345                                     Switzerland         CHE  2014   \n",
 383 |       "13346                            Syrian Arab Republic         SYR  2014   \n",
 384 |       "13347                                      Tajikistan         TJK  2014   \n",
 385 |       "13348                                        Tanzania         TZA  2014   \n",
 386 |       "13349                                        Thailand         THA  2014   \n",
 387 |       "13350                                     Timor-Leste         TMP  2014   \n",
 388 |       "13351                                            Togo         TGO  2014   \n",
 389 |       "13352                                           Tonga         TON  2014   \n",
 390 |       "13353                             Trinidad and Tobago         TTO  2014   \n",
 391 |       "13354                                         Tunisia         TUN  2014   \n",
 392 |       "13355                                          Turkey         TUR  2014   \n",
 393 |       "13356                                    Turkmenistan         TKM  2014   \n",
 394 |       "13357                        Turks and Caicos Islands         TCA  2014   \n",
 395 |       "13358                                          Tuvalu         TUV  2014   \n",
 396 |       "13359                                          Uganda         UGA  2014   \n",
 397 |       "13360                                         Ukraine         UKR  2014   \n",
 398 |       "13361                            United Arab Emirates         ARE  2014   \n",
 399 |       "13362                                  United Kingdom         GBR  2014   \n",
 400 |       "13363                                   United States         USA  2014   \n",
 401 |       "13364                                         Uruguay         URY  2014   \n",
 402 |       "13365                                      Uzbekistan         UZB  2014   \n",
 403 |       "13366                                         Vanuatu         VUT  2014   \n",
 404 |       "13367                                   Venezuela, RB         VEN  2014   \n",
 405 |       "13368                                         Vietnam         VNM  2014   \n",
 406 |       "13369                           Virgin Islands (U.S.)         VIR  2014   \n",
 407 |       "13370                              West Bank and Gaza         WBG  2014   \n",
 408 |       "13371                                     Yemen, Rep.         YEM  2014   \n",
 409 |       "13372                                          Zambia         ZMB  2014   \n",
 410 |       "13373                                        Zimbabwe         ZWE  2014   \n",
 411 |       "\n",
 412 |       "       Total Population  Urban population (% of total)  \n",
 413 |       "0          9.249590e+07                      31.285384  \n",
 414 |       "1          4.190810e+06                      31.597490  \n",
 415 |       "2          9.140158e+07                      44.507921  \n",
 416 |       "3          1.042475e+09                      22.471132  \n",
 417 |       "4          8.964930e+08                      16.917679  \n",
 418 |       "5          2.653965e+08                      62.096947  \n",
 419 |       "6          6.674890e+08                      55.378977  \n",
 420 |       "7          1.553174e+08                      38.066129  \n",
 421 |       "8          4.094985e+08                      61.212898  \n",
 422 |       "9          1.203546e+08                      17.891972  \n",
 423 |       "10         1.624912e+08                      12.236046  \n",
 424 |       "11         9.075975e+08                      62.680332  \n",
 425 |       "12         1.866767e+08                      56.107863  \n",
 426 |       "13         7.209208e+08                      64.285435  \n",
 427 |       "14         2.205642e+08                      49.284688  \n",
 428 |       "15         1.776822e+08                      44.863308  \n",
 429 |       "16         2.410728e+08                       9.616261  \n",
 430 |       "17         2.127373e+09                      21.272894  \n",
 431 |       "18         1.571884e+08                      11.498396  \n",
 432 |       "19         9.429116e+08                      19.810513  \n",
 433 |       "20         1.055126e+08                      34.951334  \n",
 434 |       "21         9.786942e+07                      33.875012  \n",
 435 |       "22         1.970185e+09                      22.053114  \n",
 436 |       "23         1.986244e+08                      69.918403  \n",
 437 |       "24         7.866482e+08                      62.480915  \n",
 438 |       "25         6.590560e+06                      14.337844  \n",
 439 |       "26         8.613780e+05                      22.043762  \n",
 440 |       "27         1.164275e+07                      21.120573  \n",
 441 |       "28         5.720361e+08                      16.735545  \n",
 442 |       "29         2.282688e+08                      14.631387  \n",
 443 |       "...                 ...                            ...  \n",
 444 |       "13344      9.689555e+06                      85.665000  \n",
 445 |       "13345      8.190229e+06                      73.844000  \n",
 446 |       "13346      2.215780e+07                      57.255000  \n",
 447 |       "13347      8.295840e+06                      26.692000  \n",
 448 |       "13348      5.182262e+07                      30.901000  \n",
 449 |       "13349      6.772598e+07                      49.174000  \n",
 450 |       "13350      1.212107e+06                      32.131000  \n",
 451 |       "13351      7.115163e+06                      39.469000  \n",
 452 |       "13352      1.055860e+05                      23.632000  \n",
 453 |       "13353      1.354483e+06                       8.550000  \n",
 454 |       "13354      1.099660e+07                      66.645000  \n",
 455 |       "13355      7.593235e+07                      72.891000  \n",
 456 |       "13356      5.307188e+06                      49.688000  \n",
 457 |       "13357      3.374000e+04                      91.847000  \n",
 458 |       "13358      9.893000e+03                      58.782000  \n",
 459 |       "13359      3.778297e+07                      15.766000  \n",
 460 |       "13360      4.536290e+07                      69.482000  \n",
 461 |       "13361      9.086139e+06                      85.266000  \n",
 462 |       "13362      6.451038e+07                      82.345000  \n",
 463 |       "13363      3.188571e+08                      81.447000  \n",
 464 |       "13364      3.419516e+06                      95.152000  \n",
 465 |       "13365      3.075770e+07                      36.278000  \n",
 466 |       "13366      2.588830e+05                      25.817000  \n",
 467 |       "13367      3.069383e+07                      88.941000  \n",
 468 |       "13368      9.073000e+07                      32.951000  \n",
 469 |       "13369      1.041700e+05                      95.203000  \n",
 470 |       "13370      4.294682e+06                      75.026000  \n",
 471 |       "13371      2.618368e+07                      34.027000  \n",
 472 |       "13372      1.572134e+07                      40.472000  \n",
 473 |       "13373      1.524586e+07                      32.501000  \n",
 474 |       "\n",
 475 |       "[13374 rows x 5 columns]\n"
 476 |      ]
 477 |     }
 478 |    ],
 479 |    "source": [
 480 |     "file = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1639/datasets/world_ind_pop_data.csv'\n",
 481 |     "df = pd.read_csv(file)\n",
 482 |     "\n",
 483 |     "print(df)"
 484 |    ]
 485 |   },
 486 |   {
 487 |    "cell_type": "markdown",
 488 |    "metadata": {},
 489 |    "source": [
 490 |     "As you can see above, we used the built-in pandas function, `read_csv()` to input the file into Python. Just to confirm that the file was properly converted into a DataFrame, let's take a look at what the data type is."
 491 |    ]
 492 |   },
 493 |   {
 494 |    "cell_type": "code",
 495 |    "execution_count": 18,
 496 |    "metadata": {},
 497 |    "outputs": [
 498 |     {
 499 |      "name": "stdout",
 500 |      "output_type": "stream",
 501 |      "text": [
 502 |       "<class 'pandas.core.frame.DataFrame'>\n"
 503 |      ]
 504 |     }
 505 |    ],
 506 |    "source": [
 507 |     "print(type(df))"
 508 |    ]
 509 |   },
 510 |   {
 511 |    "cell_type": "markdown",
 512 |    "metadata": {},
 513 |    "source": [
 514 |     "Previously, we used the `type()` method to check the data type of our converted CSV file, but `pandas` actually provides users with an `info()` function to display a concise summary of a given DataFrame. Let's take a look at what `pandas` has to say about the World Bank dataset."
 515 |    ]
 516 |   },
 517 |   {
 518 |    "cell_type": "code",
 519 |    "execution_count": 19,
 520 |    "metadata": {},
 521 |    "outputs": [
 522 |     {
 523 |      "name": "stdout",
 524 |      "output_type": "stream",
 525 |      "text": [
 526 |       "<class 'pandas.core.frame.DataFrame'>\n",
 527 |       "RangeIndex: 13374 entries, 0 to 13373\n",
 528 |       "Data columns (total 5 columns):\n",
 529 |       "CountryName                      13374 non-null object\n",
 530 |       "CountryCode                      13374 non-null object\n",
 531 |       "Year                             13374 non-null int64\n",
 532 |       "Total Population                 13374 non-null float64\n",
 533 |       "Urban population (% of total)    13374 non-null float64\n",
 534 |       "dtypes: float64(2), int64(1), object(2)\n",
 535 |       "memory usage: 522.5+ KB\n"
 536 |      ]
 537 |     }
 538 |    ],
 539 |    "source": [
 540 |     "df.info()"
 541 |    ]
 542 |   },
 543 |   {
 544 |    "cell_type": "markdown",
 545 |    "metadata": {},
 546 |    "source": [
 547 |     "The very first line, as you can see, tells us what type it is -- awesome. Additionally, we have useful information involving the size of the data, the data types of each column, and more. This is a wonderful substitute to having to call different methods like `len()`, `type()`, etc., on our DataFrame."
 548 |    ]
 549 |   },
 550 |   {
 551 |    "cell_type": "markdown",
 552 |    "metadata": {},
 553 |    "source": [
 554 |     "Unless you opened up the link provided above, you've likely not seen what the actual data looks like. To fix that, you can utilize another crucial method of `pandas` called `head()`. As default it prints the column names and the first five rows of the DataFrame you call the method on, but if you want to adjust the number, you can feed it whatever number you'd like."
 555 |    ]
 556 |   },
 557 |   {
 558 |    "cell_type": "code",
 559 |    "execution_count": 20,
 560 |    "metadata": {},
 561 |    "outputs": [
 562 |     {
 563 |      "name": "stdout",
 564 |      "output_type": "stream",
 565 |      "text": [
 566 |       "                               CountryName CountryCode  Year  \\\n",
 567 |       "0                               Arab World         ARB  1960   \n",
 568 |       "1                   Caribbean small states         CSS  1960   \n",
 569 |       "2           Central Europe and the Baltics         CEB  1960   \n",
 570 |       "3  East Asia & Pacific (all income levels)         EAS  1960   \n",
 571 |       "4    East Asia & Pacific (developing only)         EAP  1960   \n",
 572 |       "\n",
 573 |       "   Total Population  Urban population (% of total)  \n",
 574 |       "0      9.249590e+07                      31.285384  \n",
 575 |       "1      4.190810e+06                      31.597490  \n",
 576 |       "2      9.140158e+07                      44.507921  \n",
 577 |       "3      1.042475e+09                      22.471132  \n",
 578 |       "4      8.964930e+08                      16.917679  \n"
 579 |      ]
 580 |     }
 581 |    ],
 582 |    "source": [
 583 |     "print(df.head())"
 584 |    ]
 585 |   },
 586 |   {
 587 |    "cell_type": "code",
 588 |    "execution_count": 21,
 589 |    "metadata": {},
 590 |    "outputs": [
 591 |     {
 592 |      "data": {
 593 |       "text/html": [
 594 |        "<div>\n",
 595 |        "<style>\n",
 596 |        "    .dataframe thead tr:only-child th {\n",
 597 |        "        text-align: right;\n",
 598 |        "    }\n",
 599 |        "\n",
 600 |        "    .dataframe thead th {\n",
 601 |        "        text-align: left;\n",
 602 |        "    }\n",
 603 |        "\n",
 604 |        "    .dataframe tbody tr th {\n",
 605 |        "        vertical-align: top;\n",
 606 |        "    }\n",
 607 |        "</style>\n",
 608 |        "<table border=\"1\" class=\"dataframe\">\n",
 609 |        "  <thead>\n",
 610 |        "    <tr style=\"text-align: right;\">\n",
 611 |        "      <th></th>\n",
 612 |        "      <th>CountryName</th>\n",
 613 |        "      <th>CountryCode</th>\n",
 614 |        "      <th>Year</th>\n",
 615 |        "      <th>Total Population</th>\n",
 616 |        "      <th>Urban population (% of total)</th>\n",
 617 |        "    </tr>\n",
 618 |        "  </thead>\n",
 619 |        "  <tbody>\n",
 620 |        "    <tr>\n",
 621 |        "      <th>0</th>\n",
 622 |        "      <td>Arab World</td>\n",
 623 |        "      <td>ARB</td>\n",
 624 |        "      <td>1960</td>\n",
 625 |        "      <td>92495902.0</td>\n",
 626 |        "      <td>31.285384</td>\n",
 627 |        "    </tr>\n",
 628 |        "    <tr>\n",
 629 |        "      <th>1</th>\n",
 630 |        "      <td>Caribbean small states</td>\n",
 631 |        "      <td>CSS</td>\n",
 632 |        "      <td>1960</td>\n",
 633 |        "      <td>4190810.0</td>\n",
 634 |        "      <td>31.597490</td>\n",
 635 |        "    </tr>\n",
 636 |        "  </tbody>\n",
 637 |        "</table>\n",
 638 |        "</div>"
 639 |       ],
 640 |       "text/plain": [
 641 |        "              CountryName CountryCode  Year  Total Population  \\\n",
 642 |        "0              Arab World         ARB  1960        92495902.0   \n",
 643 |        "1  Caribbean small states         CSS  1960         4190810.0   \n",
 644 |        "\n",
 645 |        "   Urban population (% of total)  \n",
 646 |        "0                      31.285384  \n",
 647 |        "1                      31.597490  "
 648 |       ]
 649 |      },
 650 |      "execution_count": 21,
 651 |      "metadata": {},
 652 |      "output_type": "execute_result"
 653 |     }
 654 |    ],
 655 |    "source": [
 656 |     "df.head(2)"
 657 |    ]
 658 |   },
 659 |   {
 660 |    "cell_type": "markdown",
 661 |    "metadata": {},
 662 |    "source": [
 663 |     "On the opposite end, you can also display the last 10 rows of a given DataFrame as well, using the `tail()` method."
 664 |    ]
 665 |   },
 666 |   {
 667 |    "cell_type": "code",
 668 |    "execution_count": 22,
 669 |    "metadata": {},
 670 |    "outputs": [
 671 |     {
 672 |      "data": {
 673 |       "text/html": [
 674 |        "<div>\n",
 675 |        "<style>\n",
 676 |        "    .dataframe thead tr:only-child th {\n",
 677 |        "        text-align: right;\n",
 678 |        "    }\n",
 679 |        "\n",
 680 |        "    .dataframe thead th {\n",
 681 |        "        text-align: left;\n",
 682 |        "    }\n",
 683 |        "\n",
 684 |        "    .dataframe tbody tr th {\n",
 685 |        "        vertical-align: top;\n",
 686 |        "    }\n",
 687 |        "</style>\n",
 688 |        "<table border=\"1\" class=\"dataframe\">\n",
 689 |        "  <thead>\n",
 690 |        "    <tr style=\"text-align: right;\">\n",
 691 |        "      <th></th>\n",
 692 |        "      <th>CountryName</th>\n",
 693 |        "      <th>CountryCode</th>\n",
 694 |        "      <th>Year</th>\n",
 695 |        "      <th>Total Population</th>\n",
 696 |        "      <th>Urban population (% of total)</th>\n",
 697 |        "    </tr>\n",
 698 |        "  </thead>\n",
 699 |        "  <tbody>\n",
 700 |        "    <tr>\n",
 701 |        "      <th>13364</th>\n",
 702 |        "      <td>Uruguay</td>\n",
 703 |        "      <td>URY</td>\n",
 704 |        "      <td>2014</td>\n",
 705 |        "      <td>3419516.0</td>\n",
 706 |        "      <td>95.152</td>\n",
 707 |        "    </tr>\n",
 708 |        "    <tr>\n",
 709 |        "      <th>13365</th>\n",
 710 |        "      <td>Uzbekistan</td>\n",
 711 |        "      <td>UZB</td>\n",
 712 |        "      <td>2014</td>\n",
 713 |        "      <td>30757700.0</td>\n",
 714 |        "      <td>36.278</td>\n",
 715 |        "    </tr>\n",
 716 |        "    <tr>\n",
 717 |        "      <th>13366</th>\n",
 718 |        "      <td>Vanuatu</td>\n",
 719 |        "      <td>VUT</td>\n",
 720 |        "      <td>2014</td>\n",
 721 |        "      <td>258883.0</td>\n",
 722 |        "      <td>25.817</td>\n",
 723 |        "    </tr>\n",
 724 |        "    <tr>\n",
 725 |        "      <th>13367</th>\n",
 726 |        "      <td>Venezuela, RB</td>\n",
 727 |        "      <td>VEN</td>\n",
 728 |        "      <td>2014</td>\n",
 729 |        "      <td>30693827.0</td>\n",
 730 |        "      <td>88.941</td>\n",
 731 |        "    </tr>\n",
 732 |        "    <tr>\n",
 733 |        "      <th>13368</th>\n",
 734 |        "      <td>Vietnam</td>\n",
 735 |        "      <td>VNM</td>\n",
 736 |        "      <td>2014</td>\n",
 737 |        "      <td>90730000.0</td>\n",
 738 |        "      <td>32.951</td>\n",
 739 |        "    </tr>\n",
 740 |        "    <tr>\n",
 741 |        "      <th>13369</th>\n",
 742 |        "      <td>Virgin Islands (U.S.)</td>\n",
 743 |        "      <td>VIR</td>\n",
 744 |        "      <td>2014</td>\n",
 745 |        "      <td>104170.0</td>\n",
 746 |        "      <td>95.203</td>\n",
 747 |        "    </tr>\n",
 748 |        "    <tr>\n",
 749 |        "      <th>13370</th>\n",
 750 |        "      <td>West Bank and Gaza</td>\n",
 751 |        "      <td>WBG</td>\n",
 752 |        "      <td>2014</td>\n",
 753 |        "      <td>4294682.0</td>\n",
 754 |        "      <td>75.026</td>\n",
 755 |        "    </tr>\n",
 756 |        "    <tr>\n",
 757 |        "      <th>13371</th>\n",
 758 |        "      <td>Yemen, Rep.</td>\n",
 759 |        "      <td>YEM</td>\n",
 760 |        "      <td>2014</td>\n",
 761 |        "      <td>26183676.0</td>\n",
 762 |        "      <td>34.027</td>\n",
 763 |        "    </tr>\n",
 764 |        "    <tr>\n",
 765 |        "      <th>13372</th>\n",
 766 |        "      <td>Zambia</td>\n",
 767 |        "      <td>ZMB</td>\n",
 768 |        "      <td>2014</td>\n",
 769 |        "      <td>15721343.0</td>\n",
 770 |        "      <td>40.472</td>\n",
 771 |        "    </tr>\n",
 772 |        "    <tr>\n",
 773 |        "      <th>13373</th>\n",
 774 |        "      <td>Zimbabwe</td>\n",
 775 |        "      <td>ZWE</td>\n",
 776 |        "      <td>2014</td>\n",
 777 |        "      <td>15245855.0</td>\n",
 778 |        "      <td>32.501</td>\n",
 779 |        "    </tr>\n",
 780 |        "  </tbody>\n",
 781 |        "</table>\n",
 782 |        "</div>"
 783 |       ],
 784 |       "text/plain": [
 785 |        "                 CountryName CountryCode  Year  Total Population  \\\n",
 786 |        "13364                Uruguay         URY  2014         3419516.0   \n",
 787 |        "13365             Uzbekistan         UZB  2014        30757700.0   \n",
 788 |        "13366                Vanuatu         VUT  2014          258883.0   \n",
 789 |        "13367          Venezuela, RB         VEN  2014        30693827.0   \n",
 790 |        "13368                Vietnam         VNM  2014        90730000.0   \n",
 791 |        "13369  Virgin Islands (U.S.)         VIR  2014          104170.0   \n",
 792 |        "13370     West Bank and Gaza         WBG  2014         4294682.0   \n",
 793 |        "13371            Yemen, Rep.         YEM  2014        26183676.0   \n",
 794 |        "13372                 Zambia         ZMB  2014        15721343.0   \n",
 795 |        "13373               Zimbabwe         ZWE  2014        15245855.0   \n",
 796 |        "\n",
 797 |        "       Urban population (% of total)  \n",
 798 |        "13364                         95.152  \n",
 799 |        "13365                         36.278  \n",
 800 |        "13366                         25.817  \n",
 801 |        "13367                         88.941  \n",
 802 |        "13368                         32.951  \n",
 803 |        "13369                         95.203  \n",
 804 |        "13370                         75.026  \n",
 805 |        "13371                         34.027  \n",
 806 |        "13372                         40.472  \n",
 807 |        "13373                         32.501  "
 808 |       ]
 809 |      },
 810 |      "execution_count": 22,
 811 |      "metadata": {},
 812 |      "output_type": "execute_result"
 813 |     }
 814 |    ],
 815 |    "source": [
 816 |     "df.tail(10)"
 817 |    ]
 818 |   },
 819 |   {
 820 |    "cell_type": "markdown",
 821 |    "metadata": {},
 822 |    "source": [
 823 |     "Earlier in this tutorial, we selected a column from the DataFrame we built using a dictionary. The same can be done an **any** DataFrame, including those which began as a CSV file. With that said, we'll review an example by selecting the `Total Population` column from the World Bank dataset."
 824 |    ]
 825 |   },
 826 |   {
 827 |    "cell_type": "code",
 828 |    "execution_count": 23,
 829 |    "metadata": {},
 830 |    "outputs": [
 831 |     {
 832 |      "name": "stdout",
 833 |      "output_type": "stream",
 834 |      "text": [
 835 |       "0    9.249590e+07\n",
 836 |       "1    4.190810e+06\n",
 837 |       "2    9.140158e+07\n",
 838 |       "3    1.042475e+09\n",
 839 |       "4    8.964930e+08\n",
 840 |       "Name: Total Population, dtype: float64\n"
 841 |      ]
 842 |     }
 843 |    ],
 844 |    "source": [
 845 |     "print(df['Total Population'].head()) # called the head function so we don't have to view the entire column"
 846 |    ]
 847 |   },
 848 |   {
 849 |    "cell_type": "markdown",
 850 |    "metadata": {},
 851 |    "source": [
 852 |     "Now that we've reviewed how to select columns from DataFrames, how do we go about selecting *rows* from a DataFrame? \n",
 853 |     "\n",
 854 |     "A simple way is to use the built-in `ix` function, which can take either a single parameter to select **1** row, or it can take in a range using list slicing rules you've seen in regular Python lists. \n",
 855 |     "\n",
 856 |     "Below we'll print 5 rows using this method: "
 857 |    ]
 858 |   },
 859 |   {
 860 |    "cell_type": "code",
 861 |    "execution_count": 41,
 862 |    "metadata": {},
 863 |    "outputs": [
 864 |     {
 865 |      "name": "stdout",
 866 |      "output_type": "stream",
 867 |      "text": [
 868 |       "                                  CountryName CountryCode  Year  \\\n",
 869 |       "5                                   Euro area         EMU  1960   \n",
 870 |       "6   Europe & Central Asia (all income levels)         ECS  1960   \n",
 871 |       "7     Europe & Central Asia (developing only)         ECA  1960   \n",
 872 |       "8                              European Union         EUU  1960   \n",
 873 |       "9    Fragile and conflict affected situations         FCS  1960   \n",
 874 |       "10     Heavily indebted poor countries (HIPC)         HPC  1960   \n",
 875 |       "\n",
 876 |       "    Total Population  Urban population (% of total)  \n",
 877 |       "5        265396501.0                      62.096947  \n",
 878 |       "6        667489033.0                      55.378977  \n",
 879 |       "7        155317369.0                      38.066129  \n",
 880 |       "8        409498462.0                      61.212898  \n",
 881 |       "9        120354582.0                      17.891972  \n",
 882 |       "10       162491185.0                      12.236046  \n"
 883 |      ]
 884 |     },
 885 |     {
 886 |      "name": "stderr",
 887 |      "output_type": "stream",
 888 |      "text": [
 889 |       "/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: \n",
 890 |       ".ix is deprecated. Please use\n",
 891 |       ".loc for label based indexing or\n",
 892 |       ".iloc for positional indexing\n",
 893 |       "\n",
 894 |       "See the documentation here:\n",
 895 |       "http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated\n",
 896 |       "  \"\"\"Entry point for launching an IPython kernel.\n"
 897 |      ]
 898 |     }
 899 |    ],
 900 |    "source": [
 901 |     "print(df.ix[5:10])"
 902 |    ]
 903 |   },
 904 |   {
 905 |    "cell_type": "markdown",
 906 |    "metadata": {},
 907 |    "source": [
 908 |     "### Apply\n",
 909 |     "\n",
 910 |     "Lets's generate a random dictionary:"
 911 |    ]
 912 |   },
 913 |   {
 914 |    "cell_type": "code",
 915 |    "execution_count": 45,
 916 |    "metadata": {},
 917 |    "outputs": [
 918 |     {
 919 |      "name": "stdout",
 920 |      "output_type": "stream",
 921 |      "text": [
 922 |       "               b         d         e\n",
 923 |       "Utah    1.824797  1.538739  1.277535\n",
 924 |       "Ohio   -0.280938  0.933722 -0.450051\n",
 925 |       "Texas  -0.015748  0.064026  0.158915\n",
 926 |       "Oregon -0.591756  0.595281 -0.143778\n"
 927 |      ]
 928 |     }
 929 |    ],
 930 |    "source": [
 931 |     "import numpy as np\n",
 932 |     "frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])\n",
 933 |     "\n",
 934 |     "print(frame)"
 935 |    ]
 936 |   },
 937 |   {
 938 |    "cell_type": "markdown",
 939 |    "metadata": {},
 940 |    "source": [
 941 |     "With this, we can apply a function on a DataFrame:"
 942 |    ]
 943 |   },
 944 |   {
 945 |    "cell_type": "code",
 946 |    "execution_count": 46,
 947 |    "metadata": {},
 948 |    "outputs": [
 949 |     {
 950 |      "data": {
 951 |       "text/html": [
 952 |        "<div>\n",
 953 |        "<style>\n",
 954 |        "    .dataframe thead tr:only-child th {\n",
 955 |        "        text-align: right;\n",
 956 |        "    }\n",
 957 |        "\n",
 958 |        "    .dataframe thead th {\n",
 959 |        "        text-align: left;\n",
 960 |        "    }\n",
 961 |        "\n",
 962 |        "    .dataframe tbody tr th {\n",
 963 |        "        vertical-align: top;\n",
 964 |        "    }\n",
 965 |        "</style>\n",
 966 |        "<table border=\"1\" class=\"dataframe\">\n",
 967 |        "  <thead>\n",
 968 |        "    <tr style=\"text-align: right;\">\n",
 969 |        "      <th></th>\n",
 970 |        "      <th>b</th>\n",
 971 |        "      <th>d</th>\n",
 972 |        "      <th>e</th>\n",
 973 |        "    </tr>\n",
 974 |        "  </thead>\n",
 975 |        "  <tbody>\n",
 976 |        "    <tr>\n",
 977 |        "      <th>Utah</th>\n",
 978 |        "      <td>1.824797</td>\n",
 979 |        "      <td>1.538739</td>\n",
 980 |        "      <td>1.277535</td>\n",
 981 |        "    </tr>\n",
 982 |        "    <tr>\n",
 983 |        "      <th>Ohio</th>\n",
 984 |        "      <td>0.280938</td>\n",
 985 |        "      <td>0.933722</td>\n",
 986 |        "      <td>0.450051</td>\n",
 987 |        "    </tr>\n",
 988 |        "    <tr>\n",
 989 |        "      <th>Texas</th>\n",
 990 |        "      <td>0.015748</td>\n",
 991 |        "      <td>0.064026</td>\n",
 992 |        "      <td>0.158915</td>\n",
 993 |        "    </tr>\n",
 994 |        "    <tr>\n",
 995 |        "      <th>Oregon</th>\n",
 996 |        "      <td>0.591756</td>\n",
 997 |        "      <td>0.595281</td>\n",
 998 |        "      <td>0.143778</td>\n",
 999 |        "    </tr>\n",
1000 |        "  </tbody>\n",
1001 |        "</table>\n",
1002 |        "</div>"
1003 |       ],
1004 |       "text/plain": [
1005 |        "               b         d         e\n",
1006 |        "Utah    1.824797  1.538739  1.277535\n",
1007 |        "Ohio    0.280938  0.933722  0.450051\n",
1008 |        "Texas   0.015748  0.064026  0.158915\n",
1009 |        "Oregon  0.591756  0.595281  0.143778"
1010 |       ]
1011 |      },
1012 |      "execution_count": 46,
1013 |      "metadata": {},
1014 |      "output_type": "execute_result"
1015 |     }
1016 |    ],
1017 |    "source": [
1018 |     "np.abs(frame)"
1019 |    ]
1020 |   },
1021 |   {
1022 |    "cell_type": "markdown",
1023 |    "metadata": {},
1024 |    "source": [
1025 |     "We can also apply functions with the `apply()` method:"
1026 |    ]
1027 |   },
1028 |   {
1029 |    "cell_type": "code",
1030 |    "execution_count": 47,
1031 |    "metadata": {},
1032 |    "outputs": [
1033 |     {
1034 |      "name": "stdout",
1035 |      "output_type": "stream",
1036 |      "text": [
1037 |       "b    2.416554\n",
1038 |       "d    1.474713\n",
1039 |       "e    1.727587\n",
1040 |       "dtype: float64\n"
1041 |      ]
1042 |     }
1043 |    ],
1044 |    "source": [
1045 |     "f = lambda x: x.max() - x.min()\n",
1046 |     "\n",
1047 |     "print(frame.apply(f))"
1048 |    ]
1049 |   },
1050 |   {
1051 |    "cell_type": "code",
1052 |    "execution_count": 28,
1053 |    "metadata": {},
1054 |    "outputs": [
1055 |     {
1056 |      "name": "stdout",
1057 |      "output_type": "stream",
1058 |      "text": [
1059 |       "               b         d         e\n",
1060 |       "Utah    0.866926 -0.104468 -1.201680\n",
1061 |       "Ohio   -0.031610  0.008895  0.100290\n",
1062 |       "Texas  -0.342475  0.341394 -2.020873\n",
1063 |       "Oregon  0.666301  0.526818  0.749407\n"
1064 |      ]
1065 |     }
1066 |    ],
1067 |    "source": [
1068 |     "f = lambda x: np.abs(x)\n",
1069 |     "\n",
1070 |     "frame.apply(f)\n",
1071 |     "\n",
1072 |     "print(frame)"
1073 |    ]
1074 |   },
1075 |   {
1076 |    "cell_type": "markdown",
1077 |    "metadata": {},
1078 |    "source": [
1079 |     "## Challenge Question \n",
1080 |     "\n",
1081 |     "Does the `pandas.DataFrame.apply()` function modify the original DataFrame or return a new DataFrame?"
1082 |    ]
1083 |   },
1084 |   {
1085 |    "cell_type": "markdown",
1086 |    "metadata": {},
1087 |    "source": [
1088 |     "#### Sorting\n",
1089 |     "\n",
1090 |     "To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:"
1091 |    ]
1092 |   },
1093 |   {
1094 |    "cell_type": "code",
1095 |    "execution_count": 29,
1096 |    "metadata": {},
1097 |    "outputs": [
1098 |     {
1099 |      "data": {
1100 |       "text/html": [
1101 |        "<div>\n",
1102 |        "<style>\n",
1103 |        "    .dataframe thead tr:only-child th {\n",
1104 |        "        text-align: right;\n",
1105 |        "    }\n",
1106 |        "\n",
1107 |        "    .dataframe thead th {\n",
1108 |        "        text-align: left;\n",
1109 |        "    }\n",
1110 |        "\n",
1111 |        "    .dataframe tbody tr th {\n",
1112 |        "        vertical-align: top;\n",
1113 |        "    }\n",
1114 |        "</style>\n",
1115 |        "<table border=\"1\" class=\"dataframe\">\n",
1116 |        "  <thead>\n",
1117 |        "    <tr style=\"text-align: right;\">\n",
1118 |        "      <th></th>\n",
1119 |        "      <th>b</th>\n",
1120 |        "      <th>d</th>\n",
1121 |        "      <th>e</th>\n",
1122 |        "    </tr>\n",
1123 |        "  </thead>\n",
1124 |        "  <tbody>\n",
1125 |        "    <tr>\n",
1126 |        "      <th>Ohio</th>\n",
1127 |        "      <td>-0.031610</td>\n",
1128 |        "      <td>0.008895</td>\n",
1129 |        "      <td>0.100290</td>\n",
1130 |        "    </tr>\n",
1131 |        "    <tr>\n",
1132 |        "      <th>Oregon</th>\n",
1133 |        "      <td>0.666301</td>\n",
1134 |        "      <td>0.526818</td>\n",
1135 |        "      <td>0.749407</td>\n",
1136 |        "    </tr>\n",
1137 |        "    <tr>\n",
1138 |        "      <th>Texas</th>\n",
1139 |        "      <td>-0.342475</td>\n",
1140 |        "      <td>0.341394</td>\n",
1141 |        "      <td>-2.020873</td>\n",
1142 |        "    </tr>\n",
1143 |        "    <tr>\n",
1144 |        "      <th>Utah</th>\n",
1145 |        "      <td>0.866926</td>\n",
1146 |        "      <td>-0.104468</td>\n",
1147 |        "      <td>-1.201680</td>\n",
1148 |        "    </tr>\n",
1149 |        "  </tbody>\n",
1150 |        "</table>\n",
1151 |        "</div>"
1152 |       ],
1153 |       "text/plain": [
1154 |        "               b         d         e\n",
1155 |        "Ohio   -0.031610  0.008895  0.100290\n",
1156 |        "Oregon  0.666301  0.526818  0.749407\n",
1157 |        "Texas  -0.342475  0.341394 -2.020873\n",
1158 |        "Utah    0.866926 -0.104468 -1.201680"
1159 |       ]
1160 |      },
1161 |      "execution_count": 29,
1162 |      "metadata": {},
1163 |      "output_type": "execute_result"
1164 |     }
1165 |    ],
1166 |    "source": [
1167 |     "frame.sort_index()"
1168 |    ]
1169 |   },
1170 |   {
1171 |    "cell_type": "markdown",
1172 |    "metadata": {},
1173 |    "source": [
1174 |     "## Data Merging\n",
1175 |     "\n",
1176 |     "If you encounter two different datasets that contain the same type of information, you might consider merging them for your analyses. This is yet another functionality built into `pandas`. \n",
1177 |     "\n",
1178 |     "Let's go through an example containing student data. `d1` contains 5 of the samples and `d2` contains 2 of them: "
1179 |    ]
1180 |   },
1181 |   {
1182 |    "cell_type": "code",
1183 |    "execution_count": 30,
1184 |    "metadata": {},
1185 |    "outputs": [
1186 |     {
1187 |      "name": "stdout",
1188 |      "output_type": "stream",
1189 |      "text": [
1190 |       "  First Name  Last Name\n",
1191 |       "0     Lesley    Cordero\n",
1192 |       "1       Ojas      Sathe\n",
1193 |       "2      Helen       Chen\n",
1194 |       "3        Eli   Epperson\n",
1195 |       "4      Jacob  Greenberg\n"
1196 |      ]
1197 |     }
1198 |    ],
1199 |    "source": [
1200 |     "d1 = pd.read_csv(\"./names_original.csv\")\n",
1201 |     "print(d1)"
1202 |    ]
1203 |   },
1204 |   {
1205 |    "cell_type": "code",
1206 |    "execution_count": 31,
1207 |    "metadata": {},
1208 |    "outputs": [
1209 |     {
1210 |      "name": "stdout",
1211 |      "output_type": "stream",
1212 |      "text": [
1213 |       "<class 'pandas.core.frame.DataFrame'>\n"
1214 |      ]
1215 |     }
1216 |    ],
1217 |    "source": [
1218 |     "print(type(d1))"
1219 |    ]
1220 |   },
1221 |   {
1222 |    "cell_type": "code",
1223 |    "execution_count": 32,
1224 |    "metadata": {},
1225 |    "outputs": [
1226 |     {
1227 |      "name": "stdout",
1228 |      "output_type": "stream",
1229 |      "text": [
1230 |       "  First Name Last Name\n",
1231 |       "0     Martin     Perez\n",
1232 |       "1      Menna   Elsayed\n"
1233 |      ]
1234 |     }
1235 |    ],
1236 |    "source": [
1237 |     "d2 = pd.read_csv(\"./names_add.csv\")\n",
1238 |     "print(d2)"
1239 |    ]
1240 |   },
1241 |   {
1242 |    "cell_type": "markdown",
1243 |    "metadata": {},
1244 |    "source": [
1245 |     "### Concatenation \n",
1246 |     "\n",
1247 |     "Instead of working with two separate datasets, it's much easier to simply merge, so we do this with the `concat()` function:\n"
1248 |    ]
1249 |   },
1250 |   {
1251 |    "cell_type": "code",
1252 |    "execution_count": 33,
1253 |    "metadata": {},
1254 |    "outputs": [
1255 |     {
1256 |      "name": "stdout",
1257 |      "output_type": "stream",
1258 |      "text": [
1259 |       "  First Name  Last Name\n",
1260 |       "0     Lesley    Cordero\n",
1261 |       "1       Ojas      Sathe\n",
1262 |       "2      Helen       Chen\n",
1263 |       "3        Eli   Epperson\n",
1264 |       "4      Jacob  Greenberg\n",
1265 |       "0     Martin      Perez\n",
1266 |       "1      Menna    Elsayed\n"
1267 |      ]
1268 |     }
1269 |    ],
1270 |    "source": [
1271 |     "result = pd.concat([d1,d2])\n",
1272 |     "print(result)"
1273 |    ]
1274 |   },
1275 |   {
1276 |    "cell_type": "markdown",
1277 |    "metadata": {},
1278 |    "source": [
1279 |     "Now, you might be asking what will happen if one of the datasets has more columns than other - will they still be allowed to merge? Let's try this example with another dataset:"
1280 |    ]
1281 |   },
1282 |   {
1283 |    "cell_type": "code",
1284 |    "execution_count": 34,
1285 |    "metadata": {},
1286 |    "outputs": [
1287 |     {
1288 |      "name": "stdout",
1289 |      "output_type": "stream",
1290 |      "text": [
1291 |       "  First Name Last Name                   Major\n",
1292 |       "0     Martin     Perez  Mechanical Engineering\n",
1293 |       "1      Menna   Elsayed               Sociology\n"
1294 |      ]
1295 |     }
1296 |    ],
1297 |    "source": [
1298 |     "d3 = pd.read_csv(\"./names_extra.csv\")\n",
1299 |     "print(d3)"
1300 |    ]
1301 |   },
1302 |   {
1303 |    "cell_type": "markdown",
1304 |    "metadata": {},
1305 |    "source": [
1306 |     "If we use the same `concat()` function, we get:"
1307 |    ]
1308 |   },
1309 |   {
1310 |    "cell_type": "code",
1311 |    "execution_count": 35,
1312 |    "metadata": {},
1313 |    "outputs": [
1314 |     {
1315 |      "name": "stdout",
1316 |      "output_type": "stream",
1317 |      "text": [
1318 |       "  First Name  Last Name                   Major\n",
1319 |       "0     Lesley    Cordero                     NaN\n",
1320 |       "1       Ojas      Sathe                     NaN\n",
1321 |       "2      Helen       Chen                     NaN\n",
1322 |       "3        Eli   Epperson                     NaN\n",
1323 |       "4      Jacob  Greenberg                     NaN\n",
1324 |       "0     Martin      Perez  Mechanical Engineering\n",
1325 |       "1      Menna    Elsayed               Sociology\n"
1326 |      ]
1327 |     }
1328 |    ],
1329 |    "source": [
1330 |     "result1 = pd.concat([d1, d3])\n",
1331 |     "print(result1)"
1332 |    ]
1333 |   },
1334 |   {
1335 |    "cell_type": "markdown",
1336 |    "metadata": {},
1337 |    "source": [
1338 |     "Notice the `NaN` values - these are undefined values indicating there wasn't any data to be displayed. `pandas` will simply fill in the missing data for each sample where it's unavailable:  "
1339 |    ]
1340 |   },
1341 |   {
1342 |    "cell_type": "markdown",
1343 |    "metadata": {},
1344 |    "source": [
1345 |     "## Challenge Question\n",
1346 |     "\n",
1347 |     "Does the `pandas.DataFrame.concat()` function modify a original DataFrame or return a new DataFrame?"
1348 |    ]
1349 |   },
1350 |   {
1351 |    "cell_type": "markdown",
1352 |    "metadata": {},
1353 |    "source": [
1354 |     "### Merging\n",
1355 |     "\n",
1356 |     "Now, how do we merge two datasets with differing columns? Well, let's take a look at our datasets:"
1357 |    ]
1358 |   },
1359 |   {
1360 |    "cell_type": "code",
1361 |    "execution_count": 36,
1362 |    "metadata": {},
1363 |    "outputs": [
1364 |     {
1365 |      "name": "stdout",
1366 |      "output_type": "stream",
1367 |      "text": [
1368 |       "          Dorm            Name\n",
1369 |       "0  East Campus      Helen Chen\n",
1370 |       "1     Broadway   Danielle Jing\n",
1371 |       "2      Shapiro    Craig Rhodes\n",
1372 |       "3         Watt  Lesley Cordero\n",
1373 |       "4  East Campus    Martin Perez\n",
1374 |       "5     Broadway   Menna Elsayed\n",
1375 |       "6      Wallach   Will Essilfie\n"
1376 |      ]
1377 |     }
1378 |    ],
1379 |    "source": [
1380 |     "h1 = pd.read_csv(\"./housing.csv\")\n",
1381 |     "print(h1)"
1382 |    ]
1383 |   },
1384 |   {
1385 |    "cell_type": "code",
1386 |    "execution_count": 37,
1387 |    "metadata": {},
1388 |    "outputs": [
1389 |     {
1390 |      "name": "stdout",
1391 |      "output_type": "stream",
1392 |      "text": [
1393 |       "          Dorm Street    Cost\n",
1394 |       "0     Broadway  114th    9000\n",
1395 |       "1      Shapiro  115th    9500\n",
1396 |       "2         Watt  113th   10500\n",
1397 |       "3  East Campus  116th  11,000\n",
1398 |       "4      Wallach  114th    9500\n"
1399 |      ]
1400 |     }
1401 |    ],
1402 |    "source": [
1403 |     "h2 = pd.read_csv(\"./dorms.csv\")\n",
1404 |     "print(h2)"
1405 |    ]
1406 |   },
1407 |   {
1408 |    "cell_type": "markdown",
1409 |    "metadata": {},
1410 |    "source": [
1411 |     "With the `merge()` function in pandas, we can specify which column to merge on and what kind of join to specify. By default merge does an 'inner' join, but here we set it to a left join:"
1412 |    ]
1413 |   },
1414 |   {
1415 |    "cell_type": "code",
1416 |    "execution_count": 38,
1417 |    "metadata": {},
1418 |    "outputs": [
1419 |     {
1420 |      "name": "stdout",
1421 |      "output_type": "stream",
1422 |      "text": [
1423 |       "          Dorm            Name Street    Cost\n",
1424 |       "0  East Campus      Helen Chen  116th  11,000\n",
1425 |       "1     Broadway   Danielle Jing  114th    9000\n",
1426 |       "2      Shapiro    Craig Rhodes  115th    9500\n",
1427 |       "3         Watt  Lesley Cordero  113th   10500\n",
1428 |       "4  East Campus    Martin Perez  116th  11,000\n",
1429 |       "5     Broadway   Menna Elsayed  114th    9000\n",
1430 |       "6      Wallach   Will Essilfie  114th    9500\n"
1431 |      ]
1432 |     }
1433 |    ],
1434 |    "source": [
1435 |     "house = pd.merge(h1, h2, on=\"Dorm\", how=\"left\")\n",
1436 |     "print(house)"
1437 |    ]
1438 |   },
1439 |   {
1440 |    "cell_type": "markdown",
1441 |    "metadata": {},
1442 |    "source": [
1443 |     "## Challenge Question\n",
1444 |     "\n",
1445 |     "Does the `pandas.DataFrame.merge()` function modify a original DataFrame or return a new DataFrame?"
1446 |    ]
1447 |   },
1448 |   {
1449 |    "cell_type": "markdown",
1450 |    "metadata": {
1451 |     "collapsed": true
1452 |    },
1453 |    "source": [
1454 |     "## Resources\n",
1455 |     "\n",
1456 |     "[Pandas Learning Resources](https://chatbotslife.com/pandas-learning-resources-946540ba574e) <br>\n",
1457 |     "[GeoPandas Tutorial](https://www.twilio.com/blog/2017/08/geospatial-analysis-python-geojson-geopandas.html)"
1458 |    ]
1459 |   },
1460 |   {
1461 |    "cell_type": "code",
1462 |    "execution_count": null,
1463 |    "metadata": {},
1464 |    "outputs": [],
1465 |    "source": []
1466 |   }
1467 |  ],
1468 |  "metadata": {
1469 |   "anaconda-cloud": {},
1470 |   "kernelspec": {
1471 |    "display_name": "Python 3",
1472 |    "language": "python",
1473 |    "name": "python3"
1474 |   },
1475 |   "language_info": {
1476 |    "codemirror_mode": {
1477 |     "name": "ipython",
1478 |     "version": 3
1479 |    },
1480 |    "file_extension": ".py",
1481 |    "mimetype": "text/x-python",
1482 |    "name": "python",
1483 |    "nbconvert_exporter": "python",
1484 |    "pygments_lexer": "ipython3",
1485 |    "version": "3.6.2"
1486 |   }
1487 |  },
1488 |  "nbformat": 4,
1489 |  "nbformat_minor": 2
1490 | }
1491 | 


--------------------------------------------------------------------------------