├── .dask └── config.yaml ├── .gitignore ├── 01-Intro.ipynb ├── 02-Extract.ipynb ├── 03-Features-Modeling.ipynb ├── 04-Lab-Modeling.ipynb ├── 04a-Solution-Modeling.ipynb ├── 05-Tuning.ipynb ├── 06-Lab-Tuning.ipynb ├── 07-Scoring-Orchestration.ipynb ├── 08-RaySGD-MLflow.ipynb ├── 09-Wrapup.ipynb ├── README.md ├── binder ├── apt.txt ├── environment.yml ├── jupyterlab-workspace.json ├── postBuild └── start ├── data ├── california │ ├── _SUCCESS │ ├── _committed_2595799468439767928 │ ├── _started_2595799468439767928 │ ├── part-00000-tid-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-105-1-c000.snappy.parquet │ ├── part-00001-tid-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-106-1-c000.snappy.parquet │ ├── part-00002-tid-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-107-1-c000.snappy.parquet │ ├── part-00003-tid-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-108-1-c000.snappy.parquet │ ├── part-00004-tid-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-109-1-c000.snappy.parquet │ ├── part-00005-tid-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-110-1-c000.snappy.parquet │ ├── part-00006-tid-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-111-1-c000.snappy.parquet │ └── part-00007-tid-2595799468439767928-8acbc669-35b8-49ef-ae03-df77016f96f8-112-1-c000.snappy.parquet ├── diamonds.csv └── powerplant.csv └── images ├── cpv1.mp4 ├── dask-array.svg ├── dask-dataframe.svg ├── data.jpg ├── flow-analyze.png ├── flow-base.png ├── flow-extract.png ├── flow-model.png ├── flow-transform.png ├── largest.jpg └── psf-logo@2x.png /.dask/config.yaml: -------------------------------------------------------------------------------- 1 | distributed: 2 | dashboard: 3 | link: "{JUPYTERHUB_BASE_URL}user/{JUPYTERHUB_USER}/proxy/{port}/status" 4 | 5 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | *.tune_metadata 131 | checkpoints/ 132 | -------------------------------------------------------------------------------- /01-Intro.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Techniques for Data Science with Big Datasets\n", 8 | "\n", 9 | "\n", 10 | "\n", 11 | "## Well... that sounds awfully vague, doesn't it?\n", 12 | "\n", 13 | "__Welcome to large-scale data engineering and data science in 2020__\n", 14 | "* End-to-end, \"single product\" platforms are no longer the leading options\n", 15 | "* In the open-source world, end-to-end may not even be possible for the near future\n", 16 | "\n", 17 | "__What does this mean in concrete terms?__\n", 18 | "* Focusing on OSS, Hadoop and Spark can no longer support our end-to-end needs\n", 19 | "* We need -- and want -- to learn how to assemble a suite of best-of-breed tools for data science with newer, simpler tools like\n", 20 | " * Dask\n", 21 | " * Ray\n", 22 | " * Horovod and others\n", 23 | "* ... while still using key features of mature tools like\n", 24 | " * SparkSQL\n", 25 | " * Hive\n", 26 | " * Airflow and more\n", 27 | " \n", 28 | "__As architects and practitioners, we need to choose and leverage suite of tools chosen for power and simplicity__\n", 29 | "\n", 30 | "This class is designed to help you become confident\n", 31 | "* making those tool choices\n", 32 | "* communicating about them with your team\n", 33 | "* migrating away from legacy systems to meet modern data science needs\n", 34 | "\n", 35 | "This class is *not* designed to\n", 36 | "* Go in depth on the APIs or internals of any specific tools (there's just not enough time)\n", 37 | "* \"Sell you\" on any specific open-source project or product\n", 38 | " * We want to get comfortable discussing strength/weaknesses, and then you can choose a solution that is right for you\n", 39 | " \n", 40 | "*We'll be welcoming and exploring Questions & Answers more than most of my classes (which are heavier on the code and internals)*" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "## Catching up to large-scale data science in 2020: what's changed?\n", 48 | "\n", 49 | "A brief recap:\n", 50 | "* 2016 - broad adoption across industries of R, PyData, Apache Spark\n", 51 | "* 2017 - broad rise of deep learning\n", 52 | "* 2018-2019 - decline of Hadoop/Spark for data science\n", 53 | "* 2020-2021 - new open tools and hybrid architectures\n", 54 | "\n", 55 | "__Theme: best-of-breed__\n", 56 | "\n", 57 | "https://www.oreilly.com/radar/why-best-of-breed-is-a-better-choice-than-all-in-one-platforms-for-data-science/" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "## Interactive Survey\n", 65 | "\n", 66 | "* What size datasets do you typically work with?\n", 67 | "* Where is (most of) your data stored?\n", 68 | "* How do you get data out of your data lake?\n", 69 | "* What tools do you typically use for\n", 70 | " * feature engineering\n", 71 | " * modeling" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "## The changing definition of large-scale data\n", 79 | "\n", 80 | "__Compute power has grown, but datasets have not__\n", 81 | "\n", 82 | "\n", 83 | "\n", 84 | "\n", 85 | "*Source: https://www.kdnuggets.com/2020/07/poll-largest-dataset-analyzed-results.html*\n", 86 | "\n", 87 | "The largest ML datasets used vs. largest tractable on a single node (no cluster) has changed dramatically\n", 88 | "* Resulting in new definitions for small, medium, and big data\n", 89 | "* Avoid \"big data\" tools and their taxes when you can\n", 90 | "\n", 91 | "Some \"medium data\" approaches\n", 92 | "* Downsample\n", 93 | "* XGBoost external memory (out-of-core)\n", 94 | "* TF/PyTorch data loaders\n", 95 | "* sklearn + `partial_fit` (incrementalizable) algorithms\n", 96 | " * Simplify with Dask, though Dask not strictly necessary\n", 97 | "* Apache Arrow / PyArrow (https://arrow.apache.org/docs/python/memory.html)\n", 98 | "* Honorable mention for feature engineering: Vaex" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "## Roadmap for large-scale tooling journey\n", 106 | "\n", 107 | "" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [] 116 | } 117 | ], 118 | "metadata": { 119 | "kernelspec": { 120 | "display_name": "Python 3", 121 | "language": "python", 122 | "name": "python3" 123 | }, 124 | "language_info": { 125 | "codemirror_mode": { 126 | "name": "ipython", 127 | "version": 3 128 | }, 129 | "file_extension": ".py", 130 | "mimetype": "text/x-python", 131 | "name": "python", 132 | "nbconvert_exporter": "python", 133 | "pygments_lexer": "ipython3", 134 | "version": "3.7.0" 135 | } 136 | }, 137 | "nbformat": 4, 138 | "nbformat_minor": 4 139 | } 140 | -------------------------------------------------------------------------------- /02-Extract.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Acquiring data (extraction)\n", 8 | "\n", 9 | "\n", 10 | "\n", 11 | "> Note: in some organizations, there is a data discovery system, like https://www.amundsen.io/amundsen/ upstream from this step. We're not covering that area due to scope constraints\n" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "## Goal: use SQL to efficiently retrieve data for further work\n", 19 | "\n", 20 | "### Legacy Tools\n", 21 | "\n", 22 | "Mostly: Apache Hive\n", 23 | "\n", 24 | "### Current Tools\n", 25 | "\n", 26 | "* SparkSQL\n", 27 | "* Presto\n", 28 | "* *Hive Metastore*\n", 29 | "\n", 30 | "### Rising/Future Tools\n", 31 | "\n", 32 | "* Kartothek, Intake\n", 33 | "* BlazingSQL\n", 34 | "* Dask-SQL\n", 35 | "\n", 36 | "*There are more non-SQL options, but support for SQL is a requirement in most large organizations, so we're sticking with SQL-capable tools for now*\n" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "import pyspark" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "spark = pyspark.sql.SparkSession.builder.appName(\"demo\").getOrCreate()" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "spark.sql(\"SELECT * FROM parquet.`data/california`\").show()" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "query = \"\"\"\n", 73 | "SELECT origin, mean(delay) as delay, count(1) \n", 74 | "FROM parquet.`data/california` \n", 75 | "GROUP BY origin\n", 76 | "HAVING count(1) > 500\n", 77 | "ORDER BY delay DESC\n", 78 | "\"\"\"\n", 79 | "spark.sql(query).show()" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "query = \"\"\"\n", 89 | "SELECT *\n", 90 | "FROM parquet.`data/california` \n", 91 | "WHERE origin in (\n", 92 | " SELECT origin \n", 93 | " FROM parquet.`data/california` \n", 94 | " GROUP BY origin \n", 95 | " HAVING count(1) > 500\n", 96 | ")\n", 97 | "\"\"\"\n", 98 | "spark.sql(query).write.mode('overwrite').option('header', 'true').csv('data/refined_flights/')" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "! head data/refined_flights/*.csv" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [] 116 | } 117 | ], 118 | "metadata": { 119 | "kernelspec": { 120 | "display_name": "Python 3", 121 | "language": "python", 122 | "name": "python3" 123 | }, 124 | "language_info": { 125 | "codemirror_mode": { 126 | "name": "ipython", 127 | "version": 3 128 | }, 129 | "file_extension": ".py", 130 | "mimetype": "text/x-python", 131 | "name": "python", 132 | "nbconvert_exporter": "python", 133 | "pygments_lexer": "ipython3", 134 | "version": "3.7.0" 135 | } 136 | }, 137 | "nbformat": 4, 138 | "nbformat_minor": 4 139 | } 140 | -------------------------------------------------------------------------------- /03-Features-Modeling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Feature Engineering\n", 8 | "\n", 9 | "### Main flavors of data and feature engineering\n", 10 | "* Tabular: Dataframe model\n", 11 | " * \"Typical\" business data tables\n", 12 | "* Batch/Tensor/Vector: Array model\n", 13 | " * Numeric data, timeseries, scientific data, audio, images, video, geodata, etc.\n", 14 | "* Natural language\n", 15 | " * Batches of strings\n", 16 | " * Transformed into array data through NLP-specific techniques\n", 17 | " \n", 18 | "\n", 19 | "\n", 20 | "### \"Must-haves\" for feature engineering on large data\n", 21 | "\n", 22 | "* Some data representation for the large dataset\n", 23 | " * Likely distributed, out-of-core, lazy, streaming, etc.\n", 24 | "* Mechanism to load data from standard formats and locations into the representation\n", 25 | " * E.g., loading HDF5 in S3 or Parquet in HDFS\n", 26 | "* APIs to apply feature engineering transformations\n", 27 | " * Mathematical operations\n", 28 | " * String, date, etc.\n", 29 | " * Custom (\"user-defined\")\n", 30 | "* Integration to a modeling framework and/or ability to write to standard formats\n", 31 | "\n", 32 | "### \"Nice-to-haves\"\n", 33 | "\n", 34 | "* Intuitive data representation: similar to \"small data\" tooling\n", 35 | "* APIs that resemble those of the most common industry-standard libraries\n", 36 | "* Both modeling integration *and* ability to write out transformed data" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "## Rise of Python\n", 51 | "\n", 52 | "Python has become the *lingua franca* or dominant cross-cutting language for data science.\n", 53 | "\n", 54 | ">\n", 55 | "> __Note__ this is not to imply Python is the best or only language, or that other languages might not be intrinsically better or even, in the future, more successful. \n", 56 | ">\n", 57 | "> There are wonderful things to be said for languages from Rust to R to Julia to many others, but for baseline data science capability and versatility in commercial enterprises today, it's Python\n", 58 | ">\n", 59 | "\n", 60 | "So we can turn to Python and look at the dominant libraries and tools within that ecosystem\n", 61 | "* Tabular data: Pandas\n", 62 | "* Array data: NumPy and derivatives like CuPy, JAX.numpy, etc.\n", 63 | "* Basic modeling: scikit-learn, XGBoost, etc.\n", 64 | "* Deep learning: PyTorch, Tensorflow\n", 65 | "* NLP: SpaCy, NLTK, Huggingface, etc.\n", 66 | "\n", 67 | "As we get into further parts of the workflow, like hyperparameter tuning or reinforcement learning there are more choices. \n", 68 | "\n", 69 | "For time reasons, we're going to stick to this core workflow of extraction through modeling and tuning, and not continue on into MLOps and deploment architectures, or meta-modeling platforms for experimentation, feature and provenance tracking, etc. That would be a bit too much to take on!\n", 70 | "\n", 71 | "__Bottom line__: We want a data representation and APIs that are fairly close to the Pandas / NumPy / scikit-learn (SciPy) workflow. And we want elegant bridges into things like PyTorch, XGBoost, NLP tools, and tuning tools.\n", 72 | "\n", 73 | "## Dask: SciPy at Scale\n", 74 | "\n", 75 | "Luckily, Dask is well placed to solve this problem. \n", 76 | "\n", 77 | "While enterprises were still wrestling with JVM-based tools over the past 5 years, scientists, researchers, and others in the PyData and SciPy communities were building Dask, a pure-Python distributed compute platform that integrates deeply with all of the standard SciPy tools.\n", 78 | "\n", 79 | "__What does this mean?__\n", 80 | "\n", 81 | "We can take many of our local workflows to large-scale data via Dask with fairly minimal effort -- because under the hood, Dask is designed to use those \"small data\" structures in federation to create arbitrarily large ones." 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "
\n", 89 | " \n", 90 | "\n", 91 | "\n", 92 | "
" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "As an added bonus, due to the Dask architecture, it can leverage GPU-enabled versions of the underlying libraries.\n", 100 | "* GPU + NumPy => CuPy\n", 101 | "* GPU + Pandas => cuDF (RAPIDS CUDA dataframe)\n", 102 | "* GPU + scikit-learn => cuML (RAPIDS CUDA algorithms)\n", 103 | "etc.\n", 104 | "\n", 105 | "### Using Dask for Feature Transformation\n", 106 | "\n", 107 | "* We need to be able to load data in a standard format\n", 108 | "* Manipulate it using dataframe or array APIs\n", 109 | "* Write it and/or pass it efficiently to a modeling framework" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "from dask import dataframe as ddf\n", 119 | "from dask.distributed import Client\n", 120 | "\n", 121 | "client = Client(n_workers=2, threads_per_worker=1, memory_limit='1GB')\n", 122 | "\n", 123 | "client" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [ 132 | "df = ddf.read_csv('data/diamonds.csv')\n", 133 | "df" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "df.head()" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "df = df.drop(columns=['Unnamed: 0'])\n", 152 | "df = df.categorize()\n", 153 | "\n", 154 | "df" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "prepared = ddf.reshape.get_dummies(df)\n", 164 | "\n", 165 | "prepared.head()" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "# Modeling\n", 173 | "\n", 174 | "\n", 175 | "\n", 176 | "If Dask makes an easy choice for some feature engineering and preprocessing, we're back into the deep end making choices for modeling.\n", 177 | "\n", 178 | "Why?\n", 179 | "\n", 180 | "Simply put, different kinds of modeling are handled best by different tools, so we have a lot of choices to make.\n", 181 | "\n", 182 | "* \"Classic\" ML\n", 183 | " * Dask\n", 184 | " * Dask ML\n", 185 | " * XGBoost (with or without Dask)\n", 186 | "* Unsupervised learning and dimensionality reduction\n", 187 | " * Dask supports some algorithms\n", 188 | " * For others, we may want to scale a deep-learning tool (PyTorch/Tensorflow)\n", 189 | " * Horovod\n", 190 | " * Ray SGD\n", 191 | "* Deep learning (scaling PyTorch/TF easily)\n", 192 | " * Horovod\n", 193 | " * Ray SGD\n", 194 | " * Ray RLlib for deep reinforcement learning\n", 195 | "* Simulations and agent-based models\n", 196 | " * Ray for stateful-agent simulations\n", 197 | " * Dask Actors may be an option\n", 198 | "\n", 199 | "\n", 200 | "## Example: Linear Model with Dask" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": null, 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "y = prepared.price.to_dask_array(lengths=True)\n", 210 | "arr = prepared.drop('price', axis=1).to_dask_array(lengths=True)\n", 211 | "\n", 212 | "arr[:4]" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": {}, 219 | "outputs": [], 220 | "source": [ 221 | "arr[:4].compute()" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "from dask_ml.model_selection import train_test_split\n", 231 | "\n", 232 | "X_train, X_test, y_train, y_test = train_test_split(arr, y, test_size=0.1)\n", 233 | "\n", 234 | "X_train" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "from dask_ml.linear_model import LinearRegression\n", 244 | "\n", 245 | "lr = LinearRegression(solver='lbfgs', max_iter=10)\n", 246 | "lr_model = lr.fit(X_train, y_train)" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "metadata": {}, 253 | "outputs": [], 254 | "source": [ 255 | "y_predicted = lr_model.predict(X_test)\n", 256 | "\n", 257 | "y_predicted" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "from dask_ml.metrics import mean_squared_error\n", 267 | "from math import sqrt\n", 268 | "\n", 269 | "sqrt(mean_squared_error(y_test, y_predicted))" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "client.close()" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "## What is Ray?\n", 286 | "\n", 287 | "Ray (https://ray.io/) is a scale-out computing system designed for high-throughput, resilient stateful-actor algorithms. Ray was design at UC Berkeley's RISE lab under the supervision of some of the same team that created Apache Spark. \n", 288 | "\n", 289 | "Ray supports a number of languages at the API layer (Python and Java today) while most of the engine is C++. Ray's stateful actor support makes it strong in a number of key areas, like distributed SGD and reinforcement learning.\n", 290 | "\n", 291 | "Let's try a reinforcement learning example!\n", 292 | "\n", 293 | "> __Reinforcement Learning__ is a family of techniques that train *agents* to act in an *environment* to maximize *reward*. Famous examples include agents that can play chess, go, or Atari games ... but the field is hot because those agents can also be robots learning to do work, autonomous vehicles driving, or even virtual salesmen learning to get the best price possible from a customer.\n", 294 | "\n", 295 | "Ray treats deep reinforcement learning (RL + deep learning) as a top-level use case and includes libraries that encapsulate many of the most popular algorithms.\n", 296 | "\n", 297 | "Here, to create a simple example, we'll use __Deep Q-Learning__ (a foundational deep RL algorithm) to learn OpenAI's \"cart-pole\" (https://gym.openai.com/envs/CartPole-v1/) environment, which you can visualize like this:\n", 298 | "\n", 299 | "