Data science - Wikipedia

├── .gitignore ├── 00-Introduction.ipynb ├── 01-Python.ipynb ├── 02-JupyterNotebooks.ipynb ├── 03-DataAnalysis.ipynb ├── 04-ScientificPython.ipynb ├── 05-DataGathering.ipynb ├── 06-DataWrangling.ipynb ├── 07-DataCleaning.ipynb ├── 08-DataPrivacy&Anonymization.ipynb ├── 09-DataVisualization.ipynb ├── 10-Distributions.ipynb ├── 11-TestingDistributions.ipynb ├── 12-StatisticalComparisons.ipynb ├── 13-OrdinaryLeastSquares.ipynb ├── 14-LinearModels.ipynb ├── 15-Clustering.ipynb ├── 16-DimensionalityReduction.ipynb ├── 17-Classification.ipynb ├── 18-NaturalLanguageProcessing.ipynb ├── A1-PythonPackages.ipynb ├── A2-Git.ipynb ├── LICENSE.txt ├── README.md ├── files ├── ZillowNeighborhoods-CA.dbf ├── ZillowNeighborhoods-CA.prj ├── ZillowNeighborhoods-CA.shp ├── ZillowNeighborhoods-CA.shx ├── ZillowNeighborhoods-RI.dbf ├── ZillowNeighborhoods-RI.prj ├── ZillowNeighborhoods-RI.shp ├── ZillowNeighborhoods-RI.shx ├── book10k.txt ├── data.csv ├── data.json ├── data.txt ├── messy_data.csv └── messy_data.json └── img ├── anaconda.png ├── git.png ├── github.png ├── jupyter.png ├── matplotlib.png ├── numpy.png ├── pandas.png ├── python.png ├── scipy.png ├── sklearn.png └── sourcetree.png /.gitignore: -------------------------------------------------------------------------------- 1 | # Ignore ipynb checkpoint files 2 | *ipynb_checkpoints/* 3 | # Ignore Mac Folder Attribute files 4 | *DS_Store* 5 | -------------------------------------------------------------------------------- /00-Introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true, 7 | "nbpresent": { 8 | "id": "7fc0cefe-8b1c-4ca9-aa39-094614969842" 9 | } 10 | }, 11 | "source": [ 12 | "# Introduction\n", 13 | "\n", 14 | "Welcome to the hands on materials for Data Science in Practice.\n", 15 | "\n", 16 | "This notebook will guide through getting the tools you will need for working with these tutorials and assignments." 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "## Alerts" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "Throughout these tutorials, you will see colored 'alert' text:\n", 31 | "\n", 32 | "

\n", 33 | "Green alerts provide key information and definitions.\n", 34 | "

\n", 35 | "\n", 36 | "

\n", 37 | "Blue alerts provide links out to further \n", 38 | "resources. \n", 39 | "

" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": { 45 | "nbpresent": { 46 | "id": "b6153143-e694-4e86-96b0-243f56bad8d5" 47 | } 48 | }, 49 | "source": [ 50 | "## What do you need for these tutorials?\n", 51 | "\n", 52 | "### Software\n", 53 | "\n", 54 | "- Working install of Python (>= 3.6), with the anaconda distribution\n", 55 | " - If you are in the official class, [datahub](http://datahub.ucsd.edu) satisfies this requirement\n", 56 | "- Jupyter Notebooks\n", 57 | " - Also satisfied by [datahub](http://datahub.ucsd.edu)\n", 58 | "- git and a GitHub account" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "### Prerequisites\n", 66 | "\n", 67 | "These tutorials presume that you do already have some basic knowledge of programming. \n", 68 | "\n", 69 | "In particular, it assumes knowledge of the Python programming language and standard library. \n", 70 | "\n", 71 | "If you are somewhat unfamiliar with Python, you can follow the links in the Python notebook to catch up." 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "### Computational Resources\n", 79 | "\n", 80 | "The examples throughout these tutorials, and in the assignments are not computationally heavy. \n", 81 | "\n", 82 | "You should be able to run all these materials on any computer you have access to, assuming it will run the aforementioned tools. " 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "### Installing Python\n", 90 | "\n", 91 | "- If you are running code locally, we recommend you install a new version of Python with Anaconda, as described below\n", 92 | " - If you are in the official course, you can use [datahub](http://datahub.ucsd.edu) for everything you need\n", 93 | "- If you are on Mac, you have a native installation of python. This native installation of Python may be older, will not include the extra packages that you will need for this class, and is best left untouched. \n", 94 | " - Downloading Anaconda will install a separate, independent install of Python, leaving your native install untouched. \n", 95 | "- Windows does not require Python natively and so it is not typically pre-installed." 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "## Tools\n", 103 | "\n", 104 | "The following are a series of tools that you will need for this class" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "
\n", 112 | "
\n", 113 | "

\n", 114 | "
\n", 115 | "
" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "

\n", 123 | "Anaconda is an open-source distribution of Python, designed for scientific computing, data science and machine learning. \n", 124 | "

\n", 125 | "\n", 126 | "

" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "Anaconda itself is a distribution, meaning that is a version of Python with a collection of packages that are curated and maintained together. \n", 139 | "\n", 140 | "Using a pre-built distribution is useful, as it comes with the packages that you need for data science.\n", 141 | "\n", 142 | "Anaconda also comes with `conda`, which is a package manager, allowing you to download, install, and manage other packages. \n", 143 | "\n", 144 | "The anaconda distribution includes all packages that are needed for these tutorials." 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "
\n", 152 | "
\n", 153 | "

\n", 154 | "
\n", 155 | "
" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": { 161 | "nbpresent": { 162 | "id": "0f4dd046-4020-465c-85f6-3d92ac9fe145" 163 | } 164 | }, 165 | "source": [ 166 | "

\n", 167 | "Jupyter notebooks are a way to intermix code, outputs and plain text. \n", 168 | "They run in a web browser, and connect to a kernel to be able to execute code. \n", 169 | "

\n", 170 | "\n", 171 | "

\n", 172 | "The official Jupyter website is available \n", 173 | "here.\n", 174 | "

" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "Note that you do not need to download Jupyter separately, as it comes packaged with the Anaconda distribution." 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "#### Checking Your Python Version\n", 189 | "\n", 190 | "You can check which installation of Python you are using, and which version it is.\n", 191 | "\n", 192 | "Once you have installed anaconda, you should see you are using Python in an anaconda folder. \n", 193 | "\n", 194 | "The version number that is printed should also be 3.6 or greater. " 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 1, 200 | "metadata": {}, 201 | "outputs": [ 202 | { 203 | "name": "stdout", 204 | "output_type": "stream", 205 | "text": [ 206 | "/opt/anaconda3/bin/python\n", 207 | "Python 3.7.4\n" 208 | ] 209 | } 210 | ], 211 | "source": [ 212 | "# Check the installed version of Python\n", 213 | "# Note: these are command-line functions that may not work on windows\n", 214 | "!which python\n", 215 | "!python --version" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "
\n", 223 | "
\n", 224 | "

\n", 225 | "
\n", 226 | "
" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": { 232 | "nbpresent": { 233 | "id": "6576af9c-b0f3-4cbe-9a02-06feaa61d0b0" 234 | } 235 | }, 236 | "source": [ 237 | "

\n", 238 | "Git is a tool, a software package, for version control. \n", 239 | "

\n", 240 | "\n", 241 | "

\n", 242 | "Install \n", 243 | "git,\n", 244 | "if you don't already have it.\n", 245 | "

" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "
\n", 253 | "
\n", 254 | "

\n", 255 | "
\n", 256 | "
" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "

\n", 264 | "Github is an online hosting service that can be used with git, and offers online tools to use git. \n", 265 | "

\n", 266 | "\n", 267 | "

\n", 268 | "Create an account on \n", 269 | "Github.\n", 270 | "

" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "Git & GitHub are not the same thing, though, in practice, they are commonly used together, whereby git is used as a tool to version control code and manage multiple copies stored across your computer, as well as on remote repositories that are stored on Github.\n", 278 | "\n", 279 | "Note that while GitHub is a private company, git is an open-source tool, and can be used independent of GitHub." 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": 2, 285 | "metadata": {}, 286 | "outputs": [ 287 | { 288 | "name": "stdout", 289 | "output_type": "stream", 290 | "text": [ 291 | "git version 2.20.1 (Apple Git-117)\r\n" 292 | ] 293 | } 294 | ], 295 | "source": [ 296 | "# Check that you have git installed (which version doesn't really matter)\n", 297 | "!git --version" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "
\n", 305 | "
\n", 306 | "

\n", 307 | "
\n", 308 | "
" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "

\n", 316 | "Source Tree is a free graphical user interface (GUI) for managing repositories with git & Github. \n", 317 | "

\n", 318 | "\n", 319 | "

\n", 326 | "\n", 327 | "You don't need to use SourceTree (or any other GUI) if you know, or want to learn to use git from the command line." 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "## Environments" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "

\n", 342 | "Environments are isolated, independent installations of a programming language and groups of packages, that don't interfere with each other. \n", 343 | "

\n", 344 | "\n", 345 | "

\n", 346 | "Anaconda has detailed instructions on using environments available \n", 347 | "here.\n", 348 | "

" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "You do not need to use environments, however you may find it useful if you want or need to maintain multiple different versions of Python. \n", 356 | "\n", 357 | "If you want to use an environment, and already have conda, you can run this command from command line:
\n", 358 | "\n", 359 | "``$ conda create --name *envname* python=3.7 anaconda``
\n", 360 | "\n", 361 | "^ Replace '*envname*' with a name to call this environment.
\n", 362 | "\n", 363 | "This will install a new environment, with Python 3.7 and the anaconda distribution.\n", 364 | "\n", 365 | "You will then need to activate this environment (everytime) you want to use it. \n", 366 | "\n", 367 | "To activate your environment:
\n", 368 | "``$ conda activate *envname*``\n", 369 | "\n", 370 | "To deactivate your environment:
\n", 371 | "``$ conda deactivate``" 372 | ] 373 | } 374 | ], 375 | "metadata": { 376 | "anaconda-cloud": {}, 377 | "kernelspec": { 378 | "display_name": "Python 3", 379 | "language": "python", 380 | "name": "python3" 381 | }, 382 | "language_info": { 383 | "codemirror_mode": { 384 | "name": "ipython", 385 | "version": 3 386 | }, 387 | "file_extension": ".py", 388 | "mimetype": "text/x-python", 389 | "name": "python", 390 | "nbconvert_exporter": "python", 391 | "pygments_lexer": "ipython3", 392 | "version": "3.7.4" 393 | }, 394 | "nbpresent": { 395 | "slides": { 396 | "3d09dc46-88c8-44cb-bc57-259db78a0e70": { 397 | "id": "3d09dc46-88c8-44cb-bc57-259db78a0e70", 398 | "prev": "8d1b5def-2290-42c9-8b06-1c6e0e495521", 399 | "regions": { 400 | "4601423d-c94d-46da-885b-fe33b0216c22": { 401 | "attrs": { 402 | "height": 1, 403 | "width": 1, 404 | "x": 0, 405 | "y": 0 406 | }, 407 | "content": { 408 | "cell": "0f4dd046-4020-465c-85f6-3d92ac9fe145", 409 | "part": "whole" 410 | }, 411 | "id": "4601423d-c94d-46da-885b-fe33b0216c22" 412 | } 413 | } 414 | }, 415 | "8d1b5def-2290-42c9-8b06-1c6e0e495521": { 416 | "id": "8d1b5def-2290-42c9-8b06-1c6e0e495521", 417 | "prev": "bc666852-d015-42a1-b679-eaf92d5eb643", 418 | "regions": { 419 | "d0118c2f-7757-4efa-a276-96f162d312ae": { 420 | "attrs": { 421 | "height": 1, 422 | "width": 1, 423 | "x": 0, 424 | "y": 0 425 | }, 426 | "content": { 427 | "cell": "d9d878d6-230b-4f1e-b2aa-2f152cb3fe8e", 428 | "part": "whole" 429 | }, 430 | "id": "d0118c2f-7757-4efa-a276-96f162d312ae" 431 | } 432 | } 433 | }, 434 | "b039dd05-8357-462a-9525-7f8103de436c": { 435 | "id": "b039dd05-8357-462a-9525-7f8103de436c", 436 | "prev": "3d09dc46-88c8-44cb-bc57-259db78a0e70", 437 | "regions": { 438 | "9180ab3f-f784-45a2-b2cc-a18aad800fc5": { 439 | "attrs": { 440 | "height": 1, 441 | "width": 1, 442 | "x": 0, 443 | "y": 0 444 | }, 445 | "content": { 446 | "cell": "b57ed03a-8c01-4e48-95e8-9c6753e35088", 447 | "part": "whole" 448 | }, 449 | "id": "9180ab3f-f784-45a2-b2cc-a18aad800fc5" 450 | } 451 | } 452 | }, 453 | "bc666852-d015-42a1-b679-eaf92d5eb643": { 454 | "id": "bc666852-d015-42a1-b679-eaf92d5eb643", 455 | "layout": "grid", 456 | "prev": null, 457 | "regions": { 458 | "31cd776f-cc93-49d6-a40c-c590805cfb8f": { 459 | "attrs": { 460 | "height": 0.8333333333333334, 461 | "pad": 0.01, 462 | "width": 0.8333333333333334, 463 | "x": 0.08333333333333333, 464 | "y": 0.08333333333333333 465 | }, 466 | "content": { 467 | "cell": "7fc0cefe-8b1c-4ca9-aa39-094614969842", 468 | "part": "whole" 469 | }, 470 | "id": "31cd776f-cc93-49d6-a40c-c590805cfb8f" 471 | }, 472 | "e1612c29-0f61-4692-9d6e-112e8d378e46": { 473 | "attrs": { 474 | "height": 0.8333333333333334, 475 | "pad": 0.01, 476 | "width": 0.8333333333333334, 477 | "x": 0.08333333333333333, 478 | "y": 0.08333333333333333 479 | }, 480 | "content": { 481 | "cell": "7fc0cefe-8b1c-4ca9-aa39-094614969842", 482 | "part": "whole" 483 | }, 484 | "id": "e1612c29-0f61-4692-9d6e-112e8d378e46" 485 | } 486 | } 487 | } 488 | }, 489 | "themes": {} 490 | } 491 | }, 492 | "nbformat": 4, 493 | "nbformat_minor": 1 494 | } 495 | -------------------------------------------------------------------------------- /01-Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Python" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": { 13 | "collapsed": true 14 | }, 15 | "source": [ 16 | "
\n", 17 | "
\n", 18 | "

\n", 19 | "
\n", 20 | "
" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": { 26 | "collapsed": true 27 | }, 28 | "source": [ 29 | "

\n", 38 | "\n", 39 | "

\n", 40 | "The official Python\n", 41 | "website.\n", 42 | "

" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "## Why Python\n", 50 | "\n", 51 | "- As a general purpose language, Python supports a large range of tasks.\n", 52 | " - Or put another way: 'Python isn't the best at anything, but it's second best at everything'\n", 53 | " - This is useful. A data science project may include everything from scraping data from the web, analyzing a mixture or text and numerical data, computing features, training a model, creating high-quality graphs, and then hosting a website with your results. \n", 54 | "- Python is explicitly and by design, user-friendly.\n", 55 | "- Python also has a massive user community, who contribute to a large number of high-quality, well maintained open-source tools.\n", 56 | " - The best language for your project is one which has the things you need.\n", 57 | "- In part for the reasons listed above, Python is heavily used in industry\n" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "

\n", 65 | "The Python programming language is developed and maintained by the\n", 66 | "Python Software Foundation.\n", 67 | "

" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "## Python Versions\n", 75 | "\n", 76 | "This class uses Python3, the currently developed version of Python, and more specifically Python version 3.6 or above. \n", 77 | "\n", 78 | "Python2 has reached \"End of Life\" meaning it is no longer supported or maintained by the Python Organization. " 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "## Python Resources\n", 86 | "\n", 87 | "These materials presume prior knowledge of the Python programming language. \n", 88 | "\n", 89 | "If you are note yet familiar, here are some entry level materials for learning Python:\n", 90 | "\n", 91 | "- [Codecademy](https://www.codecademy.com/tracks/python) is good for a beginner's introduction to the language.\n", 92 | "- [The Official Beginners Guide](https://wiki.python.org/moin/BeginnersGuide) is supported by the Python organization.\n", 93 | "- [Whirlwind Tour of Python](https://github.com/jakevdp/WhirlwindTourOfPython) is a free collection of Jupyter notebooks that takes you through Python. \n", 94 | " - This book is especially good (and specifically designed for) if you have some experience with programming in some other language, and want to quickly run through the specifics of Python." 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "

\n", 102 | "A much broader list of resources and guides for learning Python is available \n", 103 | "here.\n", 104 | "

" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "## Getting Un-Stuck\n", 112 | "\n", 113 | "At some point, you will get stuck. It happens. The internet is your friend. \n", 114 | "\n", 115 | "If you get an error, or aren't sure how to proceed, use {your favourite search engine} with specific search terms relating to what you are trying to do. Sometimes this just means searching the error that you got.\n", 116 | "\n", 117 | "Your are likely to find responses on [StackOverflow](https://stackoverflow.com) - which is basically a forum for programming questions, and a good place to find answers. " 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "## Standard Library" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "

\n", 132 | "The Standard Library refers to everything in Python that is part of standard version and install of Python.\n", 133 | "

" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "

\n", 141 | "The Python \n", 142 | "Standard Library\n", 143 | "comes with a lot of basic functionality. \n", 144 | "

" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "Part of what makes Python a powerful language is the standard library itself, which is a rich set of tools for programming. However, the standard library itself does not include data science tools, and a lot of the power of Python stems for a rich ecosystem of packages that can be added and used with Python. " 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "## Packages" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "

\n", 166 | "Packages are collections of code. Packages from outside the standard library can be installed and added to Python.\n", 167 | "

\n", 168 | "\n", 169 | "

\n", 170 | "For managing and installing packages, Anaconda comes with the \n", 171 | "conda\n", 172 | "package manager.\n", 173 | "

" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "## Scientific Python\n", 181 | "\n", 182 | "When we say that Python is good for data science, and scientific computing, what we really mean is that there is a rich ecosystem of available open-source external packages, that greatly expand the capacities of the language beyond the standard library. \n", 183 | "\n", 184 | "This set of packages, which we will introduce as we go through these materials, is sometimes referred to as 'Scientific Python', or the 'Scipy' ecosystem. \n", 185 | "\n", 186 | "For the purposes of these materials, the Anaconda distribution that we are using contains all the packages you need. " 187 | ] 188 | } 189 | ], 190 | "metadata": { 191 | "anaconda-cloud": {}, 192 | "kernelspec": { 193 | "display_name": "Python 3", 194 | "language": "python", 195 | "name": "python3" 196 | }, 197 | "language_info": { 198 | "codemirror_mode": { 199 | "name": "ipython", 200 | "version": 3 201 | }, 202 | "file_extension": ".py", 203 | "mimetype": "text/x-python", 204 | "name": "python", 205 | "nbconvert_exporter": "python", 206 | "pygments_lexer": "ipython3", 207 | "version": "3.7.4" 208 | } 209 | }, 210 | "nbformat": 4, 211 | "nbformat_minor": 1 212 | } 213 | -------------------------------------------------------------------------------- /02-JupyterNotebooks.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Jupyter Notebooks\n", 8 | "\n", 9 | "
\n", 10 | "
\n", 11 | "

\n", 12 | "
\n", 13 | "
\n", 14 | "\n", 15 | "This is a quick introduction to Jupyter notebooks." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "

\n", 23 | "Jupyter notebooks are a way to combine executable code, code outputs, and text into one connected file.\n", 24 | "

\n", 25 | "\n", 26 | "

" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "## Menu Options & Shortcuts\n", 39 | "\n", 40 | "To get a quick tour of the Jupyter user-interface, click on the 'Help' menu, then click 'User Interface Tour'.\n", 41 | "\n", 42 | "There are also a large number of useful keyboard shortcuts. Click on the 'Help' menu, and then 'Keyboard Shortcuts' to see a list. " 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "## Cells" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "

\n", 57 | "The main organizational structure of the notebook are 'cells'.\n", 58 | "

" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "Cells, can be markdown (text), like this one or code cells (we'll get to those)." 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "### Markdown cells" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": { 78 | "slideshow": { 79 | "slide_type": "fragment" 80 | } 81 | }, 82 | "source": [ 83 | "Markdown cell are useful for communicating information about our notebooks.\n", 84 | "\n", 85 | "They perform basic text formatting including italics, bold, headings, links and images.\n", 86 | "\n", 87 | "Double-click on any of the cells in this section to see what the plain-text looks like. Run the cell to then see what the formatted Markdown text looks like." 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": { 93 | "slideshow": { 94 | "slide_type": "slide" 95 | } 96 | }, 97 | "source": [ 98 | "# This is a heading\n", 99 | "\n", 100 | "## This is a smaller heading\n", 101 | "\n", 102 | "### This is a really small heading" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": { 108 | "slideshow": { 109 | "slide_type": "slide" 110 | } 111 | }, 112 | "source": [ 113 | "We can italicize my text either like *this* or like _this_." 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": { 119 | "slideshow": { 120 | "slide_type": "fragment" 121 | } 122 | }, 123 | "source": [ 124 | "We can embolden my text either like **this** or like __this__." 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": { 130 | "slideshow": { 131 | "slide_type": "slide" 132 | } 133 | }, 134 | "source": [ 135 | "Here is an unordered list of items:\n", 136 | "* This is an item\n", 137 | "* This is an item\n", 138 | "* This is an item" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": { 144 | "slideshow": { 145 | "slide_type": "slide" 146 | } 147 | }, 148 | "source": [ 149 | "Here is an ordered list of items:\n", 150 | "1. This is my first item\n", 151 | "2. This is my second item\n", 152 | "3. This is my third item" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": { 158 | "slideshow": { 159 | "slide_type": "slide" 160 | } 161 | }, 162 | "source": [ 163 | "We can have a list of lists by using identation:\n", 164 | "* This is an item\n", 165 | "* This is an item\n", 166 | "\t* This is an item\n", 167 | "\t* This is an item\n", 168 | "* This is an item" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": { 174 | "slideshow": { 175 | "slide_type": "slide" 176 | } 177 | }, 178 | "source": [ 179 | "We can also combine ordered and unordered lists:\n", 180 | "1. This is my first item\n", 181 | "2. This is my second item\n", 182 | "\t* This is an item\n", 183 | "\t* This is an item\n", 184 | "3. This is my third item" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": { 190 | "slideshow": { 191 | "slide_type": "slide" 192 | } 193 | }, 194 | "source": [ 195 | "We can make a link to this [useful markdown cheatsheet](https://www.markdownguide.org/cheat-sheet/) as such." 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": { 201 | "slideshow": { 202 | "slide_type": "fragment" 203 | } 204 | }, 205 | "source": [ 206 | "If we don't use the markdown syntax for links, it will just show the link itself as the link text: https://www.markdownguide.org/cheat-sheet/" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": { 212 | "slideshow": { 213 | "slide_type": "slide" 214 | } 215 | }, 216 | "source": [ 217 | "### LaTeX-formatted text" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": { 223 | "slideshow": { 224 | "slide_type": "fragment" 225 | } 226 | }, 227 | "source": [ 228 | "$$ P(A \\mid B) = \\frac{P(B \\mid A) \\, P(A)}{P(B)} $$" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "### Code Cells\n", 236 | "\n", 237 | "Code cells are cells that contain code, that can be executed. \n", 238 | "\n", 239 | "Comments can also be written in code cells, indicated by '#'. " 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": 1, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "# In a code cell, comments can be typed\n", 249 | "a = 1\n", 250 | "b = 2" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": 2, 256 | "metadata": {}, 257 | "outputs": [ 258 | { 259 | "name": "stdout", 260 | "output_type": "stream", 261 | "text": [ 262 | "3\n" 263 | ] 264 | } 265 | ], 266 | "source": [ 267 | "# Cells can also have output, that gets printed out below the cell.\n", 268 | "print(a + b)" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": 3, 274 | "metadata": { 275 | "slideshow": { 276 | "slide_type": "slide" 277 | } 278 | }, 279 | "outputs": [], 280 | "source": [ 281 | "# Define a variable in code\n", 282 | "my_string = 'hello world'" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": 4, 288 | "metadata": { 289 | "slideshow": { 290 | "slide_type": "fragment" 291 | } 292 | }, 293 | "outputs": [ 294 | { 295 | "name": "stdout", 296 | "output_type": "stream", 297 | "text": [ 298 | "hello world\n" 299 | ] 300 | } 301 | ], 302 | "source": [ 303 | "# Print out a variable\n", 304 | "print(my_string)" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 5, 310 | "metadata": { 311 | "slideshow": { 312 | "slide_type": "slide" 313 | } 314 | }, 315 | "outputs": [ 316 | { 317 | "data": { 318 | "text/plain": [ 319 | "'HELLO WORLD'" 320 | ] 321 | }, 322 | "execution_count": 5, 323 | "metadata": {}, 324 | "output_type": "execute_result" 325 | } 326 | ], 327 | "source": [ 328 | "# Operations that return objects get printed out as output\n", 329 | "my_string.upper()" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 6, 335 | "metadata": { 336 | "slideshow": { 337 | "slide_type": "slide" 338 | } 339 | }, 340 | "outputs": [], 341 | "source": [ 342 | "# Define a list variable\n", 343 | "my_list = ['a','b','c']" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 7, 349 | "metadata": { 350 | "slideshow": { 351 | "slide_type": "fragment" 352 | } 353 | }, 354 | "outputs": [ 355 | { 356 | "name": "stdout", 357 | "output_type": "stream", 358 | "text": [ 359 | "['a', 'b', 'c']\n" 360 | ] 361 | } 362 | ], 363 | "source": [ 364 | "# Print out our list variable\n", 365 | "print(my_list)" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "## Accessing Documentation" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "metadata": {}, 378 | "source": [ 379 | "

\n", 380 | "Jupyter has useful shortcuts. Add a single '?' after a function or class get a window with the documentation, or a double '??' to pull up the source code. \n", 381 | "

" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 8, 387 | "metadata": {}, 388 | "outputs": [], 389 | "source": [ 390 | "# Import numpy for examples\n", 391 | "import numpy as np" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": 9, 397 | "metadata": {}, 398 | "outputs": [], 399 | "source": [ 400 | "# Check the docs for a numpy array\n", 401 | "np.array?" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": 10, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "# Check the full source code for numpy append function\n", 411 | "np.append??" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": 11, 417 | "metadata": { 418 | "slideshow": { 419 | "slide_type": "fragment" 420 | } 421 | }, 422 | "outputs": [], 423 | "source": [ 424 | "# Get information about a variable you've created\n", 425 | "my_string?" 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "metadata": {}, 431 | "source": [ 432 | "## Autocomplete" 433 | ] 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": {}, 438 | "source": [ 439 | "

\n", 440 | "Jupyter also has \n", 441 | "tab complete\n", 442 | "capacities, which can autocomplete what you are typing, and/or be used to explore what code is available. \n", 443 | "

" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": 12, 449 | "metadata": {}, 450 | "outputs": [ 451 | { 452 | "ename": "SyntaxError", 453 | "evalue": "invalid syntax (, line 2)", 454 | "output_type": "error", 455 | "traceback": [ 456 | "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m2\u001b[0m\n\u001b[0;31m np.\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" 457 | ] 458 | } 459 | ], 460 | "source": [ 461 | "# Move your cursor just after the period, press tab, and a drop menu will appear showing all possible completions\n", 462 | "np." 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": null, 468 | "metadata": {}, 469 | "outputs": [], 470 | "source": [ 471 | "# Autocomplete does not have to be at a period. Move to the end of 'ra' and hit tab to see completion options. \n", 472 | "ra" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": null, 478 | "metadata": {}, 479 | "outputs": [], 480 | "source": [ 481 | "# If there is only one option, tab-complete will auto-complete what you are typing\n", 482 | "ran" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "## Kernel & Namespace\n", 490 | "\n", 491 | "You do not need to run cells in order! This is useful for flexibly testing and developing code. \n", 492 | "\n", 493 | "The numbers in the square brackets to the left of a cell show which cells have been run, and in what order.\n", 494 | "\n", 495 | "However, it can also be easy to lose track of what has already been declared / imported, leading to unexpected behaviour from running cells.\n", 496 | "\n", 497 | "The kernel is what connects the notebook to your computer behind-the-scenes to execute the code. \n", 498 | "\n", 499 | "It can be useful to clear and re-launch the kernel. You can do this from the 'kernel' drop down menu, at the top, optionally also clearing all ouputs." 500 | ] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "metadata": {}, 505 | "source": [ 506 | "## Magic Commands" 507 | ] 508 | }, 509 | { 510 | "cell_type": "markdown", 511 | "metadata": {}, 512 | "source": [ 513 | "

\n", 514 | "'Magic Commands' are a special (command-line like) syntax in IPython/Jupyter to run special functionality. They can run on lines and/or entire cells. \n", 515 | "

\n", 516 | "\n", 517 | "

\n", 518 | "The iPython documentation has more information on magic commands.\n", 519 | "

" 520 | ] 521 | }, 522 | { 523 | "cell_type": "markdown", 524 | "metadata": { 525 | "slideshow": { 526 | "slide_type": "slide" 527 | } 528 | }, 529 | "source": [ 530 | "Magic commands are designed to succinctly solve various common problems in standard data analysis. Magic commands come in two flavors: line magics, which are denoted by a single % prefix and operate on a single line of input, and cell magics, which are denoted by a double %% prefix and operate on multiple lines of input." 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "metadata": { 537 | "slideshow": { 538 | "slide_type": "slide" 539 | } 540 | }, 541 | "outputs": [], 542 | "source": [ 543 | "# Access quick reference sheet for interactive Python (this opens a reference guide)\n", 544 | "%quickref" 545 | ] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": null, 550 | "metadata": {}, 551 | "outputs": [], 552 | "source": [ 553 | "# Check a list of available magic commands\n", 554 | "%lsmagic" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": null, 560 | "metadata": { 561 | "slideshow": { 562 | "slide_type": "slide" 563 | } 564 | }, 565 | "outputs": [], 566 | "source": [ 567 | "# Check the current working directory\n", 568 | "%pwd" 569 | ] 570 | }, 571 | { 572 | "cell_type": "code", 573 | "execution_count": null, 574 | "metadata": { 575 | "slideshow": { 576 | "slide_type": "fragment" 577 | } 578 | }, 579 | "outputs": [], 580 | "source": [ 581 | "# Check all currently defined variables\n", 582 | "%who" 583 | ] 584 | }, 585 | { 586 | "cell_type": "code", 587 | "execution_count": null, 588 | "metadata": { 589 | "slideshow": { 590 | "slide_type": "fragment" 591 | } 592 | }, 593 | "outputs": [], 594 | "source": [ 595 | "# Chcek all variables, with more information about them\n", 596 | "%whos" 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": null, 602 | "metadata": { 603 | "slideshow": { 604 | "slide_type": "slide" 605 | } 606 | }, 607 | "outputs": [], 608 | "source": [ 609 | "# Check code history\n", 610 | "%hist" 611 | ] 612 | }, 613 | { 614 | "cell_type": "markdown", 615 | "metadata": {}, 616 | "source": [ 617 | "### Line Magics\n", 618 | "\n", 619 | "\n", 620 | "Line magics use a single '%', and apply to a single line. " 621 | ] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "execution_count": null, 626 | "metadata": {}, 627 | "outputs": [], 628 | "source": [ 629 | "# For example, we can time how long it takes to create a large list\n", 630 | "%timeit list(range(100000))" 631 | ] 632 | }, 633 | { 634 | "cell_type": "markdown", 635 | "metadata": {}, 636 | "source": [ 637 | "### Cell Magics\n", 638 | "\n", 639 | "Cell magics use a double '%%', and apply to the whole cell. " 640 | ] 641 | }, 642 | { 643 | "cell_type": "code", 644 | "execution_count": null, 645 | "metadata": {}, 646 | "outputs": [], 647 | "source": [ 648 | "%%timeit\n", 649 | "# For example, we could time a whole cell\n", 650 | "a = list(range(100000))\n", 651 | "b = [n + 1 for n in a]" 652 | ] 653 | }, 654 | { 655 | "cell_type": "markdown", 656 | "metadata": {}, 657 | "source": [ 658 | "### Running terminal commands\n", 659 | "\n", 660 | "Another nice thing about notebooks is being able to run terminals commands" 661 | ] 662 | }, 663 | { 664 | "cell_type": "code", 665 | "execution_count": null, 666 | "metadata": {}, 667 | "outputs": [], 668 | "source": [ 669 | "# You can run a terminal command by adding '!' to the start of the line\n", 670 | "!pwd\n", 671 | "\n", 672 | "# Note that in this case, '!pwd' is equivalent to line magic '%pwd'. \n", 673 | "# The '!' syntax is more general though, allowing you to run anything you want through command-line " 674 | ] 675 | }, 676 | { 677 | "cell_type": "code", 678 | "execution_count": null, 679 | "metadata": {}, 680 | "outputs": [], 681 | "source": [ 682 | "%%bash\n", 683 | "# Equivalently, (for bash) use the %%bash cell magic to run a cell as bash (command-line)\n", 684 | "pwd" 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": null, 690 | "metadata": { 691 | "slideshow": { 692 | "slide_type": "fragment" 693 | } 694 | }, 695 | "outputs": [], 696 | "source": [ 697 | "# List files in directory\n", 698 | "!ls" 699 | ] 700 | }, 701 | { 702 | "cell_type": "code", 703 | "execution_count": null, 704 | "metadata": { 705 | "slideshow": { 706 | "slide_type": "fragment" 707 | } 708 | }, 709 | "outputs": [], 710 | "source": [ 711 | "# Change current directory\n", 712 | "!cd ." 713 | ] 714 | }, 715 | { 716 | "cell_type": "markdown", 717 | "metadata": {}, 718 | "source": [ 719 | "

\n", 720 | "For more useful information, check out Jupyter Notebooks \n", 721 | "tips & tricks, and more information on how \n", 722 | "notebooks work.\n", 723 | "

" 724 | ] 725 | } 726 | ], 727 | "metadata": { 728 | "kernelspec": { 729 | "display_name": "Python 3", 730 | "language": "python", 731 | "name": "python3" 732 | }, 733 | "language_info": { 734 | "codemirror_mode": { 735 | "name": "ipython", 736 | "version": 3 737 | }, 738 | "file_extension": ".py", 739 | "mimetype": "text/x-python", 740 | "name": "python", 741 | "nbconvert_exporter": "python", 742 | "pygments_lexer": "ipython3", 743 | "version": "3.7.4" 744 | } 745 | }, 746 | "nbformat": 4, 747 | "nbformat_minor": 2 748 | } 749 | -------------------------------------------------------------------------------- /05-DataGathering.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Gathering" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "

\n", 15 | "Data Gathering is the process of accessing data and collecting it together.\n", 16 | "

" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "This notebook covers strategies for finding and gathering data.\n", 24 | "\n", 25 | "If you want to start by working on data analyses (with provided data) you can move onto the next tutorials, and come back to this one later.\n", 26 | "\n", 27 | "Data gathering can encompass many different strategies, including data collection, web scraping, accessing data from databases, and downloading data in bulk. Sometimes it even includes things like calling someone to ask if you can use some of their data, and asking them to send it over. " 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## Where to get Data\n", 35 | "\n", 36 | "There are lots of way to get data, and lots of places to get it from. Typically, most of this data will be accessed through the internet, in one way or another, especially when pursuing indepent research projects. \n", 37 | "\n", 38 | "### Institutional Access\n", 39 | "\n", 40 | "If you are working with data as part of an institution, such as a company of research lab, the institution will typically have data it needs analyzing, that it collects in various ways. Keep in mind that even people working inside institutions, with access to local data, will data still seek to find and incorporate external datasets. \n", 41 | "\n", 42 | "### Data Repositories\n", 43 | "\n", 44 | "**Data repositories** are databases from which you can download data. Some data repositories allow you to explore available datasets and download datasets in bulk. Others may also offer **APIs**, through which you can request specific data from particular databases.\n", 45 | "\n", 46 | "### Web Scraping\n", 47 | "\n", 48 | "The web itself is full of unstructured data. **Web scraping** can be done to directly extract and collect data directly from websites.\n", 49 | "\n", 50 | "### Asking People for Data\n", 51 | "\n", 52 | "Not all data is indexed or accessible on the web, at least not publicly. Sometimes finding data means figuring out if any data is available, figuring out where it might be, and then reaching out and asking people directly about data access. If there is some particular data you need, you can try to figure out who might have it, and get in touch to see if it might be available." 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "### Data Gathering Skills\n", 60 | "\n", 61 | "Depending on your gathering method, you will likely have to do some combination of the following:\n", 62 | "\n", 63 | "- Direct download data files from repositories\n", 64 | "- Query databases & use APIs to extract and collect data of interest\n", 65 | "- Ask people for data, and going to pick up data with a harddrive\n", 66 | "\n", 67 | "Ultimately, the goal is collect and curate data files, hopefully structured, that you can read into Python." 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "## Definitions: Databases & Query Languages\n", 75 | "\n", 76 | "Here, we will introduce some useful definitions you will likely encounter when exploring how to gather data. \n", 77 | "\n", 78 | "Other than these definitions, we will not cover databases & query languages more in these tutorials. " 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "

\n", 86 | "A database is an organized collection of data. More formally, 'database' refers to a set of related data, and the way it is organized. \n", 87 | "

" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "

\n", 95 | "A query language is a language for operating with databases, such as retrieving, and sometimes modifying, information from databases.\n", 96 | "

" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "

\n", 104 | "SQL (pronounced 'sequel') is a common query language used to interact with databases, and request data.\n", 105 | "

\n", 106 | "\n", 107 | "

" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "## Data Repositories" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "

\n", 129 | "A Data Repository is basically just a place that data is stored. For our purposes, it is a place you can download data from. \n", 130 | "

\n", 131 | "\n", 132 | "

\n", 133 | "There is a curated list of good data source included in the \n", 134 | "project materials.\n", 135 | "

" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "For our purposes, data repositories are places you can download data directly from, for example [data.gov](https://www.data.gov/)." 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "## Application Program Interfaces (APIs)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "

\n", 157 | "APIs are basically a way for software to talk to software - it is an interface into an application / website / database designed for software.\n", 158 | "

\n", 159 | "\n", 160 | "

\n", 166 | "\n", 167 | "

\n", 168 | "This\n", 169 | "list\n", 170 | "includes a collection of commonly used and available APIs. \n", 171 | "

" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "APIs offer a lot of functionality - you can send requests to the application to do all kinds of actions. In fact, any application interface that is designed to be used programmatically is an API, including, for example, interfaces for using packages of code. \n", 179 | "\n", 180 | "One of the many things that APIs do, and offer, is a way to query and access data from particular applications / databases. For example, there is a an API for Google maps that allows for programmatically querying the latitude & longitude positions of given addresses. \n", 181 | "\n", 182 | "The benefit of using APIs for data gathering purposes is that they typically return data in nicely structured formats, that are relatively easy to analyze." 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "### Launching URL Requests from Python\n", 190 | "\n", 191 | "In order to use APIs, and for other approaches to collecting data, it may be useful to launch URL requests from Python.\n", 192 | "\n", 193 | "Note that by `URL`, we just mean a file or application that can be reached by a web address. Python can be used to organize and launch URL requests, triggering actions and collecting any returned data. \n", 194 | "\n", 195 | "In practice, APIs are usually special URLs that return raw data, such as `json` or `XML` files. This is compared to URLs we are typically more used to that return web pages as `html`, which can be rendered for human viewers (html). The key difference is that APIs return structured data files, where as `html` files are typically unstructured (more on that later, with web scraping). \n", 196 | "\n", 197 | "If you with to use an API, try and find the documentation for to see how you send requests to access whatever data you want. \n", 198 | "\n", 199 | "#### API Example\n", 200 | "\n", 201 | "For our example here, we will use the Github API. Note that the URL we use is `api.github.com`. This URL accesses the API, and will return structured data files, instead of the html that would be returned by the standard URL (github.com)." 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 10, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "import pandas as pd\n", 211 | "\n", 212 | "# We will use the `requests` library to launch URL requests from Python\n", 213 | "import requests" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 11, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "# Request data from the Github API on a particular user\n", 223 | "page = requests.get('https://api.github.com/users/tomdonoghue')" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 12, 229 | "metadata": {}, 230 | "outputs": [ 231 | { 232 | "data": { 233 | "text/plain": [ 234 | "b'{\"login\":\"TomDonoghue\",\"id\":7727566,\"node_id\":\"MDQ6VXNlcjc3Mjc1NjY=\",\"avatar_url\":\"https://avatars0.githubusercontent.com/u/7727566?v=4\",\"gravatar_id\":\"\",\"url\":\"https://api.github.com/users/TomDonoghue\",\"html_url\":\"https://github.com/TomDonoghue\",\"followers_url\":\"https://api.github.com/users/TomDonoghue/followers\",\"following_url\":\"https://api.github.com/users/TomDonoghue/following{/other_user}\",\"gists_url\":\"https://api.github.com/users/TomDonoghue/gists{/gist_id}\",\"starred_url\":\"https://api.github.com/users/TomDonoghue/starred{/owner}{/repo}\",\"subscriptions_url\":\"https://api.github.com/users/TomDonoghue/subscriptions\",\"organizations_url\":\"https://api.github.com/users/TomDonoghue/orgs\",\"repos_url\":\"https://api.github.com/users/TomDonoghue/repos\",\"events_url\":\"https://api.github.com/users/TomDonoghue/events{/privacy}\",\"received_events_url\":\"https://api.github.com/users/TomDonoghue/received_events\",\"type\":\"User\",\"site_admin\":false,\"name\":\"Tom\",\"company\":\"UC San Diego\",\"blog\":\"https://tomdonoghue.github.io\",\"location\":\"San Diego\",\"email\":null,\"hireable\":null,\"bio\":\"Cognitive Science Grad Student @ UC San Diego working on analyzing electrical brain activity. Also teaching Python & Data Science. \\\\r\\\\n\\\\r\\\\n\",\"twitter_username\":null,\"public_repos\":13,\"public_gists\":0,\"followers\":97,\"following\":83,\"created_at\":\"2014-05-28T20:20:48Z\",\"updated_at\":\"2020-06-19T21:35:12Z\"}'" 235 | ] 236 | }, 237 | "execution_count": 12, 238 | "metadata": {}, 239 | "output_type": "execute_result" 240 | } 241 | ], 242 | "source": [ 243 | "# In this case, the content we get back is a json file\n", 244 | "page.content" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 13, 250 | "metadata": {}, 251 | "outputs": [ 252 | { 253 | "data": { 254 | "text/plain": [ 255 | "login TomDonoghue\n", 256 | "id 7727566\n", 257 | "node_id MDQ6VXNlcjc3Mjc1NjY=\n", 258 | "avatar_url https://avatars0.githubusercontent.com/u/77275...\n", 259 | "gravatar_id \n", 260 | "url https://api.github.com/users/TomDonoghue\n", 261 | "html_url https://github.com/TomDonoghue\n", 262 | "followers_url https://api.github.com/users/TomDonoghue/follo...\n", 263 | "following_url https://api.github.com/users/TomDonoghue/follo...\n", 264 | "gists_url https://api.github.com/users/TomDonoghue/gists...\n", 265 | "starred_url https://api.github.com/users/TomDonoghue/starr...\n", 266 | "subscriptions_url https://api.github.com/users/TomDonoghue/subsc...\n", 267 | "organizations_url https://api.github.com/users/TomDonoghue/orgs\n", 268 | "repos_url https://api.github.com/users/TomDonoghue/repos\n", 269 | "events_url https://api.github.com/users/TomDonoghue/event...\n", 270 | "received_events_url https://api.github.com/users/TomDonoghue/recei...\n", 271 | "type User\n", 272 | "site_admin False\n", 273 | "name Tom\n", 274 | "company UC San Diego\n", 275 | "blog https://tomdonoghue.github.io\n", 276 | "location San Diego\n", 277 | "email None\n", 278 | "hireable None\n", 279 | "bio Cognitive Science Grad Student @ UC San Diego ...\n", 280 | "twitter_username None\n", 281 | "public_repos 13\n", 282 | "public_gists 0\n", 283 | "followers 97\n", 284 | "following 83\n", 285 | "created_at 2014-05-28T20:20:48Z\n", 286 | "updated_at 2020-06-19T21:35:12Z\n", 287 | "dtype: object" 288 | ] 289 | }, 290 | "execution_count": 13, 291 | "metadata": {}, 292 | "output_type": "execute_result" 293 | } 294 | ], 295 | "source": [ 296 | "# We can read in the json data with pandas\n", 297 | "pd.read_json(page.content, typ='series')" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "As we can see above, in a couple lines of code, we can collect a lot of structured data about a particular user.\n", 305 | "\n", 306 | "If we wanted to do analyses of Github profiles and activity, we could use the Github API to collect information about a group of users, and then analyze and compare the collected data. " 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": { 312 | "collapsed": true 313 | }, 314 | "source": [ 315 | "## Web Scraping" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "

\n", 323 | "Web scraping is when you (programmatically) extract data from websites.\n", 324 | "

\n", 325 | "\n", 326 | "

\n", 327 | "Wikipedia\n", 328 | "has a useful page on web scraping.\n", 329 | "

" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "By web scraping, we typically mean something distinct from using the internet to access an API. Rather, web scraping refers to using code to systematically navigate the internet, and extract information of internet, from html or other available files. Note that in this case one is not interacting directly with a database, but simply exploring and collecting whatever is available on web pages.\n", 337 | "\n", 338 | "Note that the following section uses the 'BeautifulSoup' module, which is not part of the standard anaconda distribution. \n", 339 | "\n", 340 | "If you do not have BeautifulSoup, and want to get it to run this section, you can uncomment the cell below, and run it, to install BeautifulSoup in your current Python environment. You only have to do this once." 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 5, 346 | "metadata": { 347 | "collapsed": true 348 | }, 349 | "outputs": [], 350 | "source": [ 351 | "#import sys\n", 352 | "#!conda install --yes --prefix {sys.prefix} beautifulsoup4" 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": 6, 358 | "metadata": { 359 | "collapsed": true 360 | }, 361 | "outputs": [], 362 | "source": [ 363 | "# Import BeautifulSoup\n", 364 | "from bs4 import BeautifulSoup" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": 7, 370 | "metadata": { 371 | "collapsed": true 372 | }, 373 | "outputs": [], 374 | "source": [ 375 | "# Set the URL for the page we wish to scrape\n", 376 | "site_url = 'https://en.wikipedia.org/wiki/Data_science'\n", 377 | "\n", 378 | "# Launch the URL request, to get the page\n", 379 | "page = requests.get(site_url)" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 8, 385 | "metadata": {}, 386 | "outputs": [ 387 | { 388 | "data": { 389 | "text/plain": [ 390 | "b'\\n\\n\\n\\nData science - Wikipedia\\n\\n