├── .gitbook └── assets │ ├── cleanshot-2020-06-08-at-12.01.58-2x.png │ ├── cleanshot-2020-06-08-at-12.04.06-2x.png │ ├── cleanshot-2020-06-16-at-17.42.58-2x.png │ ├── cleanshot-2020-06-30-at-15.11.22-2x.png │ ├── cleanshot-2020-06-30-at-15.20.32-2x.png │ ├── cleanshot-2020-07-01-at-22.39.49-2x.png │ ├── cleanshot-2020-07-01-at-22.41.53-2x.png │ ├── cleanshot-2020-07-02-at-17.33.52-2x.png │ ├── cleanshot-2020-07-02-at-17.34.50-2x.png │ ├── cleanshot-2020-07-06-at-19.09.30-2x.png │ ├── cleanshot-2020-07-06-at-19.10.25-2x.png │ ├── cleanshot-2020-07-06-at-19.11.29-2x.png │ ├── cleanshot-2020-07-06-at-19.14.01-2x.png │ ├── cleanshot-2020-07-09-at-09.56.23-2x.jpg │ ├── cleanshot-2020-07-16-at-12.49.50-2x.jpg │ ├── data.png │ ├── debugging_overview.jpg │ ├── deployment.png │ ├── image (1).png │ ├── image (2).png │ ├── image (3).png │ ├── image.png │ ├── ml-interview-tips.png │ ├── mqan-architecture.png │ ├── softwarev2.png │ ├── team-structures.png │ ├── tools.png │ └── uber-cota.png ├── README.md ├── SUMMARY.md ├── _assets ├── guest_lecturers.key └── instructors.key ├── certification ├── certification.md └── exam-preparation.md ├── course-content ├── data-management │ ├── README.md │ ├── labeling.md │ ├── overview.md │ ├── processing.md │ ├── sources.md │ ├── storage.md │ └── versioning.md ├── infrastructure-and-tooling │ ├── README.md │ ├── all-in-one-solutions.md │ ├── experiment-management.md │ ├── frameworks-and-distributed-training.md │ ├── hardware.md │ ├── hyperparameter-tuning.md │ ├── overview.md │ ├── resource-management.md │ └── software-engineering.md ├── labs.md ├── ml-teams │ ├── README.md │ ├── hiring.md │ ├── managing-projects.md │ ├── overview.md │ ├── roles.md │ └── team-structure.md ├── research-areas.md ├── setting-up-machine-learning-projects │ ├── README.md │ ├── archetypes.md │ ├── baselines.md │ ├── lifecycle.md │ ├── metrics.md │ ├── overview.md │ └── prioritizing.md ├── testing-and-deployment │ ├── README.md │ ├── ci-testing.md │ ├── docker.md │ ├── hardware-mobile.md │ ├── ml-test-score.md │ ├── monitoring.md │ ├── project-structure.md │ └── web-deployment.md ├── training-and-debugging │ ├── README.md │ ├── conclusion.md │ ├── debug.md │ ├── evaluate.md │ ├── improve.md │ ├── overview.md │ ├── start-simple.md │ └── tune.md └── where-to-go-next.md └── guest-lectures ├── andrej-karpathy-tesla.md ├── chip-huyen-nvidia.md ├── franziska-bell-toyota-research.md ├── jai-ranganathan-keeptruckin.md ├── jeremy-howard-fast.ai.md ├── lukas-biewald-weights-and-biases.md ├── raquel-urtasun-uber-atg.md ├── richard-socher-salesforce.md ├── xavier-amatriain.md └── yangqing-jia-alibaba.md /.gitbook/assets/cleanshot-2020-06-08-at-12.01.58-2x.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-06-08-at-12.01.58-2x.png -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-06-08-at-12.04.06-2x.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-06-08-at-12.04.06-2x.png -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-06-16-at-17.42.58-2x.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-06-16-at-17.42.58-2x.png -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-06-30-at-15.11.22-2x.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-06-30-at-15.11.22-2x.png -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-06-30-at-15.20.32-2x.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-06-30-at-15.20.32-2x.png -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-07-01-at-22.39.49-2x.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-07-01-at-22.39.49-2x.png -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-07-01-at-22.41.53-2x.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-07-01-at-22.41.53-2x.png -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-07-02-at-17.33.52-2x.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-07-02-at-17.33.52-2x.png -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-07-02-at-17.34.50-2x.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-07-02-at-17.34.50-2x.png -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-07-06-at-19.09.30-2x.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-07-06-at-19.09.30-2x.png -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-07-06-at-19.10.25-2x.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-07-06-at-19.10.25-2x.png -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-07-06-at-19.11.29-2x.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-07-06-at-19.11.29-2x.png -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-07-06-at-19.14.01-2x.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-07-06-at-19.14.01-2x.png -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-07-09-at-09.56.23-2x.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-07-09-at-09.56.23-2x.jpg -------------------------------------------------------------------------------- /.gitbook/assets/cleanshot-2020-07-16-at-12.49.50-2x.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/cleanshot-2020-07-16-at-12.49.50-2x.jpg -------------------------------------------------------------------------------- /.gitbook/assets/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/data.png -------------------------------------------------------------------------------- /.gitbook/assets/debugging_overview.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/debugging_overview.jpg -------------------------------------------------------------------------------- /.gitbook/assets/deployment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/deployment.png -------------------------------------------------------------------------------- /.gitbook/assets/image (1).png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/image (1).png -------------------------------------------------------------------------------- /.gitbook/assets/image (2).png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/image (2).png -------------------------------------------------------------------------------- /.gitbook/assets/image (3).png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/image (3).png -------------------------------------------------------------------------------- /.gitbook/assets/image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/image.png -------------------------------------------------------------------------------- /.gitbook/assets/ml-interview-tips.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/ml-interview-tips.png -------------------------------------------------------------------------------- /.gitbook/assets/mqan-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/mqan-architecture.png -------------------------------------------------------------------------------- /.gitbook/assets/softwarev2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/softwarev2.png -------------------------------------------------------------------------------- /.gitbook/assets/team-structures.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/team-structures.png -------------------------------------------------------------------------------- /.gitbook/assets/tools.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/tools.png -------------------------------------------------------------------------------- /.gitbook/assets/uber-cota.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/.gitbook/assets/uber-cota.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | Full Stack Deep Learning helps you bridge the gap from training machine 4 | learning models to deploying AI systems in the real world. 5 | --- 6 | 7 | # Full Stack Deep Learning 8 | 9 | {% hint style="info" %} 10 | We are teaching an updated and improved FSDL as an official UC Berkeley course Spring 2021. 11 | 12 | Sign up to receive updates on our lectures as they're released — and to optionally participate in a synchronous learning community. 13 | 14 | [**Sign up for 2021**](https://forms.gle/xSrgSPyBCkD8KnV76)\*\*\*\* 15 | {% endhint %} 16 | 17 | [![Join the chat at https://gitter.im/full-stack-deep-learning/fsdl-course](https://badges.gitter.im/full-stack-deep-learning/fsdl-course.svg)](https://gitter.im/full-stack-deep-learning/fsdl-course?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) 18 | 19 | ## About this course 20 | 21 | Since 2012, deep learning has led to remarkable progress across a variety of challenging computing tasks, from image recognition to speech recognition, robotics, and audio synthesis. Deep learning has the potential to enable a new set of previously infeasible technologies like autonomous vehicles, real-time translation, and voice assistants and help reinvent existing software categories. 22 | 23 | There are many great courses to learn how to train deep neural networks. However, training the model is just one part of shipping a deep learning project. This course teaches **full-stack production deep learning:** 24 | 25 | * Formulating the **problem** and estimating project **cost** 26 | * Finding, cleaning, labeling, and augmenting **data** 27 | * Picking the right **framework** and compute **infrastructure** 28 | * **Troubleshooting** training and ensuring **reproducibility** 29 | * **Deploying** the model at scale 30 | 31 | ![](.gitbook/assets/image%20%282%29.png) 32 | 33 | This course was originally taught as an in-person boot camp in Berkeley from 2018 - 2019. It was also taught as a University of Washington Computer Science [PMP course](https://bit.ly/uwfsdl) in Spring 2020. 34 | 35 | The discussion page for the course on [Gitter](https://gitter.im/full-stack-deep-learning/fsdl-course). 36 | 37 | The course project is on [Github](https://github.com/full-stack-deep-learning/fsdl-text-recognizer-project). 38 | 39 | {% hint style="info" %} 40 | Please [submit a pull request](https://github.com/full-stack-deep-learning/course-gitbook) if any information is out of date or if you have good additional info to add! 41 | {% endhint %} 42 | 43 | ## Who is this for 44 | 45 | The course is aimed at people who already know the basics of deep learning and want to understand the rest of the process of creating production deep learning systems. You will get the most out of this course if you have: 46 | 47 | * At least one-year experience programming in Python. 48 | * At least one deep learning course \(at a university or online\). 49 | * Experience with code versioning, Unix environments, and software engineering. 50 | 51 | We will not review the fundamentals of deep learning \(gradient descent, backpropagation, convolutional neural networks, recurrent neural networks, etc\), so you should review those materials first if you are rusty. 52 | 53 | ## Organizers 54 | 55 | ![](.gitbook/assets/cleanshot-2020-07-01-at-22.41.53-2x.png) 56 | 57 | ## Guest Lectures 58 | 59 | ![](.gitbook/assets/cleanshot-2020-07-16-at-12.49.50-2x.jpg) 60 | 61 | ## Newsletter 62 | 63 | {% embed url="https://forms.gle/mDQZxsLZmep8JFgx9" caption="" %} 64 | 65 | ## Course Content 66 | 67 | {% page-ref page="course-content/setting-up-machine-learning-projects/" %} 68 | 69 | {% page-ref page="course-content/infrastructure-and-tooling/" %} 70 | 71 | {% page-ref page="course-content/data-management/" %} 72 | 73 | {% page-ref page="course-content/ml-teams/" %} 74 | 75 | {% page-ref page="course-content/training-and-debugging/" %} 76 | 77 | {% page-ref page="course-content/testing-and-deployment/" %} 78 | 79 | {% page-ref page="course-content/research-areas.md" %} 80 | 81 | ## Guest Lectures 82 | 83 | {% page-ref page="guest-lectures/xavier-amatriain.md" %} 84 | 85 | {% page-ref page="guest-lectures/chip-huyen-nvidia.md" %} 86 | 87 | {% page-ref page="guest-lectures/lukas-biewald-weights-and-biases.md" %} 88 | 89 | {% page-ref page="guest-lectures/jeremy-howard-fast.ai.md" %} 90 | 91 | {% page-ref page="guest-lectures/richard-socher-salesforce.md" %} 92 | 93 | {% page-ref page="guest-lectures/raquel-urtasun-uber-atg.md" %} 94 | 95 | {% page-ref page="guest-lectures/yangqing-jia-alibaba.md" %} 96 | 97 | {% page-ref page="guest-lectures/andrej-karpathy-tesla.md" %} 98 | 99 | {% page-ref page="guest-lectures/jai-ranganathan-keeptruckin.md" %} 100 | 101 | {% page-ref page="guest-lectures/franziska-bell-toyota-research.md" %} 102 | 103 | -------------------------------------------------------------------------------- /SUMMARY.md: -------------------------------------------------------------------------------- 1 | # Table of contents 2 | 3 | * [Full Stack Deep Learning](README.md) 4 | 5 | ## Course Content 6 | 7 | * [Setting up Machine Learning Projects](course-content/setting-up-machine-learning-projects/README.md) 8 | * [Overview](course-content/setting-up-machine-learning-projects/overview.md) 9 | * [Lifecycle](course-content/setting-up-machine-learning-projects/lifecycle.md) 10 | * [Prioritizing](course-content/setting-up-machine-learning-projects/prioritizing.md) 11 | * [Archetypes](course-content/setting-up-machine-learning-projects/archetypes.md) 12 | * [Metrics](course-content/setting-up-machine-learning-projects/metrics.md) 13 | * [Baselines](course-content/setting-up-machine-learning-projects/baselines.md) 14 | * [Infrastructure and Tooling](course-content/infrastructure-and-tooling/README.md) 15 | * [Overview](course-content/infrastructure-and-tooling/overview.md) 16 | * [Software Engineering](course-content/infrastructure-and-tooling/software-engineering.md) 17 | * [Computing and GPUs](course-content/infrastructure-and-tooling/hardware.md) 18 | * [Resource Management](course-content/infrastructure-and-tooling/resource-management.md) 19 | * [Frameworks and Distributed Training](course-content/infrastructure-and-tooling/frameworks-and-distributed-training.md) 20 | * [Experiment Management](course-content/infrastructure-and-tooling/experiment-management.md) 21 | * [Hyperparameter Tuning](course-content/infrastructure-and-tooling/hyperparameter-tuning.md) 22 | * [All-in-one Solutions](course-content/infrastructure-and-tooling/all-in-one-solutions.md) 23 | * [Data Management](course-content/data-management/README.md) 24 | * [Overview](course-content/data-management/overview.md) 25 | * [Sources](course-content/data-management/sources.md) 26 | * [Labeling](course-content/data-management/labeling.md) 27 | * [Storage](course-content/data-management/storage.md) 28 | * [Versioning](course-content/data-management/versioning.md) 29 | * [Processing](course-content/data-management/processing.md) 30 | * [Machine Learning Teams](course-content/ml-teams/README.md) 31 | * [Overview](course-content/ml-teams/overview.md) 32 | * [Roles](course-content/ml-teams/roles.md) 33 | * [Team Structure](course-content/ml-teams/team-structure.md) 34 | * [Managing Projects](course-content/ml-teams/managing-projects.md) 35 | * [Hiring](course-content/ml-teams/hiring.md) 36 | * [Training and Debugging](course-content/training-and-debugging/README.md) 37 | * [Overview](course-content/training-and-debugging/overview.md) 38 | * [Start Simple](course-content/training-and-debugging/start-simple.md) 39 | * [Debug](course-content/training-and-debugging/debug.md) 40 | * [Evaluate](course-content/training-and-debugging/evaluate.md) 41 | * [Improve](course-content/training-and-debugging/improve.md) 42 | * [Tune](course-content/training-and-debugging/tune.md) 43 | * [Conclusion](course-content/training-and-debugging/conclusion.md) 44 | * [Testing and Deployment](course-content/testing-and-deployment/README.md) 45 | * [Project Structure](course-content/testing-and-deployment/project-structure.md) 46 | * [ML Test Score](course-content/testing-and-deployment/ml-test-score.md) 47 | * [CI / Testing](course-content/testing-and-deployment/ci-testing.md) 48 | * [Docker](course-content/testing-and-deployment/docker.md) 49 | * [Web Deployment](course-content/testing-and-deployment/web-deployment.md) 50 | * [Monitoring](course-content/testing-and-deployment/monitoring.md) 51 | * [Hardware/Mobile](course-content/testing-and-deployment/hardware-mobile.md) 52 | * [Research Areas](course-content/research-areas.md) 53 | * [Labs](course-content/labs.md) 54 | * [Where to go next](course-content/where-to-go-next.md) 55 | 56 | ## Guest Lectures 57 | 58 | * [Xavier Amatriain \(Curai\)](guest-lectures/xavier-amatriain.md) 59 | * [Chip Huyen \(Snorkel\)](guest-lectures/chip-huyen-nvidia.md) 60 | * [Lukas Biewald \(Weights & Biases\)](guest-lectures/lukas-biewald-weights-and-biases.md) 61 | * [Jeremy Howard \(Fast.ai\)](guest-lectures/jeremy-howard-fast.ai.md) 62 | * [Richard Socher \(Salesforce\)](guest-lectures/richard-socher-salesforce.md) 63 | * [Raquel Urtasun \(Uber ATG\)](guest-lectures/raquel-urtasun-uber-atg.md) 64 | * [Yangqing Jia \(Alibaba\)](guest-lectures/yangqing-jia-alibaba.md) 65 | * [Andrej Karpathy \(Tesla\)](guest-lectures/andrej-karpathy-tesla.md) 66 | * [Jai Ranganathan \(KeepTruckin\)](guest-lectures/jai-ranganathan-keeptruckin.md) 67 | * [Franziska Bell \(Toyota Research\)](guest-lectures/franziska-bell-toyota-research.md) 68 | 69 | ## Corporate Training and Certification 70 | 71 | * [Corporate Training](certification/exam-preparation.md) 72 | * [Certification](certification/certification.md) 73 | 74 | -------------------------------------------------------------------------------- /_assets/guest_lecturers.key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/_assets/guest_lecturers.key -------------------------------------------------------------------------------- /_assets/instructors.key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/the-full-stack/course-gitbook/ada9a9171301b733abc16709e4ff3f08fdc455ad/_assets/instructors.key -------------------------------------------------------------------------------- /certification/certification.md: -------------------------------------------------------------------------------- 1 | # Certification 2 | 3 | If you are interested in receiving group training, or becoming certified for taking the course, please email us at [team+certification@fullstackdeeplearning.com](mailto:team+certification@fullstackdeeplearning.com?subject=”Interested%20in%20certification”). 4 | 5 | -------------------------------------------------------------------------------- /certification/exam-preparation.md: -------------------------------------------------------------------------------- 1 | # Corporate Training 2 | 3 | If you are interested in receiving group training, or becoming certified for taking the course, please email us at [team+training@fullstackdeeplearning.com](mailto:team+training@fullstackdeeplearning.com?subject=”Interested%20in%20corporate%20training”). 4 | 5 | -------------------------------------------------------------------------------- /course-content/data-management/README.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: The Data Phase of Your Machine Learning Workflow 3 | --- 4 | 5 | # Data Management 6 | 7 | ![Data Management concerns its storage, access, processing, versioning, and labeling.](../../.gitbook/assets/data.png) 8 | 9 | {% hint style="info" %} 10 | As always, please [submit a pull request](https://github.com/full-stack-deep-learning/course-gitbook) if any information is out of date! 11 | {% endhint %} 12 | 13 | ## Slides 14 | 15 | {% embed url="https://www.slideshare.net/sergeykarayev/data-management-full-stack-deep-learning" %} 16 | 17 | ## Videos 18 | 19 | {% page-ref page="overview.md" %} 20 | 21 | {% page-ref page="sources.md" %} 22 | 23 | {% page-ref page="labeling.md" %} 24 | 25 | {% page-ref page="storage.md" %} 26 | 27 | {% page-ref page="versioning.md" %} 28 | 29 | {% page-ref page="processing.md" %} 30 | 31 | -------------------------------------------------------------------------------- /course-content/data-management/labeling.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: What are effective ways to label your data? 3 | --- 4 | 5 | # Labeling 6 | 7 | {% embed url="https://www.youtube.com/watch?v=S7KzXF9M7Zs" caption="Labeling - Data Management" %} 8 | 9 | ## Summary 10 | 11 | * Data labeling requires a collection of data points such as images, text, or audio and a qualified team of people to label each of the input points with meaningful information that will be used to train a machine learning model. 12 | * You can create a **user interface** with a standard set of features \(bounding boxes, segmentation, key points, cuboids, set of applicable classes…\) and train your own annotators to label the data. 13 | * You can leverage other labor sources by either **hiring** your own annotators or **crowdsourcing** the annotators. 14 | * You can also consult standalone **service companies**. Data labeling requires separate software stack, temporary labor, and quality assurance; so it makes sense to **outsource**. 15 | 16 | -------------------------------------------------------------------------------- /course-content/data-management/overview.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: Why is data management important? 3 | --- 4 | 5 | # Overview 6 | 7 | {% embed url="https://www.youtube.com/watch?v=xz-Uzcpc4AE" caption="Overview - Data Management" %} 8 | 9 | ## Summary 10 | 11 | * Data science has never been as much about machine learning as it has about cleaning, shaping, and moving data from place to place. 12 | * Here are the important concepts in data management: 13 | * **Sources -** how to get training data 14 | * **Labeling -** how to label proprietary data at scale 15 | * **Storage -** how to store data and metadata appropriately 16 | * **Versioning -** how to update data through user activity or additional labeling 17 | * **Processing -** how to aggregate and convert raw data and metadata 18 | 19 | -------------------------------------------------------------------------------- /course-content/data-management/processing.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: What are efficient ways to process your data? 3 | --- 4 | 5 | # Processing 6 | 7 | {% embed url="https://www.youtube.com/watch?v=foD8r33JM\_8" caption="Processing - Data Management" %} 8 | 9 | ## Summary 10 | 11 | * The simplest thing we can do is a **Makefile** to specify what action\(s\) depend on. 12 | * You will probably need a workflow management system. **Airflow** is the current winner of this space. 13 | * Try to keep things simple and don't over-engineer your processing pipeline. 14 | 15 | -------------------------------------------------------------------------------- /course-content/data-management/sources.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: Where do the training data come from? 3 | --- 4 | 5 | # Sources 6 | 7 | {% embed url="https://www.youtube.com/watch?v=5rY5HPe9UzI" caption="Sources - Data Management" %} 8 | 9 | ## Summary 10 | 11 | * Most deep learning applications require lots of labeled data. There are publicly available datasets that can serve as a starting point, but there is no competitive advantage of doing so. 12 | * Most companies usually spend a lot of money and time to label their own data. 13 | * **Data flywheel** means harnessing the power of users rapidly improve the whole machine learning system. 14 | * **Semi-supervised learning** is a relatively recent learning technique where the training data is autonomously \(or automatically\) labeled. 15 | * **Data augmentation** is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data. 16 | * **Synthetic data** is data that’s generated programmatically, an underrated idea that is almost always worth starting with. 17 | 18 | -------------------------------------------------------------------------------- /course-content/data-management/storage.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: What are appropriate ways to store your data? 3 | --- 4 | 5 | # Storage 6 | 7 | {% embed url="https://www.youtube.com/watch?v=HUYDy3NZkHU" caption="Storage - Data Management" %} 8 | 9 | ## Summary 10 | 11 | * Data storage requirements for AI vary widely according to the application and the source material. 12 | * The **filesystem** is the foundational layer of storage. Its fundamental unit is a “file” — which can be text or binary, is not versioned, and is easily overwritten. 13 | * **Object storage** is an API over the filesystem that allows users to use a command on files \(GET, PUT, DELETE\) to a service, without worrying where they are actually stored. Its fundamental unit is an “object” — which is usually binary \(images, sound files…\). 14 | * The **database** is a persistent, fast, and scalable storage/retrieval of structured data. Its fundamental unit is a “row” \(unique IDs, references to other rows, values in columns\). 15 | * A **data lake** is the unstructured aggregation of data from multiple sources \(databases, logs, expensive data transformations\). It operates under the concept of “schema-on-read” by dumping everything in and then transforming the data for specific needs later. 16 | 17 | -------------------------------------------------------------------------------- /course-content/data-management/versioning.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: What are the different levels of versioning your data? 3 | --- 4 | 5 | # Versioning 6 | 7 | {% embed url="https://www.youtube.com/watch?v=T\_K7tiKM9nE" caption="Versioning - Data Management" %} 8 | 9 | ## Summary 10 | 11 | * Data versioning refers to saving new copies of your data when you make changes so that you can go back and retrieve specific versions of your files later. 12 | * In **Level 0**, the data lives on the filesystem and/or object storage and the database without being versioned. 13 | * In **Level 1**, the data is versioned by storing a snapshot of everything at training time. 14 | * In **Level 2**, the data is versioned as a mix of assets and code. 15 | * **Level 3** requires specialized solutions for versioning data. You should avoid these until you can fully explain how they will improve your project. 16 | 17 | -------------------------------------------------------------------------------- /course-content/infrastructure-and-tooling/README.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: The Training and Evaluation Phases of Your Machine Learning Workflow 3 | --- 4 | 5 | # Infrastructure and Tooling 6 | 7 | ![The infrastructure landscape of deep learning.](../../.gitbook/assets/cleanshot-2020-07-06-at-19.11.29-2x.png) 8 | 9 | {% hint style="info" %} 10 | As always, please [submit a pull request](https://github.com/full-stack-deep-learning/course-gitbook) if any information is out of date! 11 | {% endhint %} 12 | 13 | ## Slides 14 | 15 | {% embed url="https://www.slideshare.net/sergeykarayev/infrastructure-and-tooling-full-stack-deep-learning" caption="Slides for the entire module. Go to the next page for the first video!" %} 16 | 17 | ## Videos 18 | 19 | {% page-ref page="overview.md" %} 20 | 21 | {% page-ref page="software-engineering.md" %} 22 | 23 | {% page-ref page="hardware.md" %} 24 | 25 | {% page-ref page="experiment-management.md" %} 26 | 27 | {% page-ref page="hyperparameter-tuning.md" %} 28 | 29 | {% page-ref page="all-in-one-solutions.md" %} 30 | 31 | 32 | 33 | 34 | 35 | -------------------------------------------------------------------------------- /course-content/infrastructure-and-tooling/all-in-one-solutions.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to choose between different machine learning platforms? 3 | --- 4 | 5 | # All-in-one Solutions 6 | 7 | {% embed url="https://www.youtube.com/watch?v=xBfll7pt8RI" caption="All In One - Infrastructure and Tooling" %} 8 | 9 | ## Summary 10 | 11 | * The “All-In-One” machine learning platforms provide a single system for everything: developing models, scaling experiments to many machines, tracking experiments and versioning models, deploying models, and monitoring model performance. 12 | * **FBLearner Flow** is the workflow management platform at the heart of the Facebook ML engineering ecosystem. 13 | * **Michelangelo**, Uber’s ML Platform, supports the training and serving of thousands of models in production across the company. 14 | * **TensorFlow Extended** \(TFX\) is a Google-production-scale ML platform based on TensorFlow. 15 | * Another option from Google is its **Cloud AI Platform**, a managed service that enables you to easily build machine learning models, that work on any type of data, of any size. 16 | * **Amazon SageMaker** is one of the core AI offerings from AWS that helps teams through all stages in the machine learning life cycle. 17 | * **Neptune** is a product that focuses on managing the experimentation process while remaining lightweight and easy to use by any data science team. 18 | * **FloydHub** is another managed cloud platform for data scientists. 19 | * **Paperspace** provides a solution for accessing computing power via the cloud and offers it through an easy-to-use console where everyday consumers can just click a button to log into their upgraded, more powerful remote machine. 20 | * **Determined AI** is a startup that creates software to handle everything from managing cluster compute resources to automating workflows, thereby putting some of that big-company technology within reach of any organization. 21 | * **Domino Data Lab** is an integrated end-to-end platform that is language agnostic, having a rich functionality for version control and collaboration; as well as one-click infrastructure scalability, deployment, and publishing. 22 | 23 | -------------------------------------------------------------------------------- /course-content/infrastructure-and-tooling/experiment-management.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to keep track of your model experiments? 3 | --- 4 | 5 | # Experiment Management 6 | 7 | {% embed url="https://www.youtube.com/watch?v=N83eejieeRg" caption="Experiment Management - Infrastructure and Tooling" %} 8 | 9 | ## Summary 10 | 11 | * Getting your models to perform well is a very iterative process. If you don’t have a system for managing your experiments, it quickly gets out of control. 12 | * **TensorBoard** is a TensorFlow extension that allows you to easily monitor your model in a browser. 13 | * **Losswise** provides ML practitioners with a Python API and accompanying dashboard to visualize progress within and across training sessions. 14 | * [**Comet.ml**](http://comet.ml/) is another platform that enables engineers and data scientists to efficiently maintain their preferred workflow and tools, track previous work, and collaborate throughout the iterative process. 15 | * **Weights & Biases** is an experiment tracking tool for deep learning that allows you to \(1\) store all the hyper-parameters and output metrics in one place; \(2\) explore and compare every experiment with training/inference visualizations; and \(3\) create beautiful reports that showcase your work. 16 | * **MLflow** is an open-source platform for the entire machine learning lifecycle started by Databricks. Its MLflow Tracking component is an API and UI for logging parameters, code versions, metrics, and output files from your model training process. 17 | 18 | -------------------------------------------------------------------------------- /course-content/infrastructure-and-tooling/frameworks-and-distributed-training.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | How to choose a deep learning framework? How to enable distributed training 4 | for your models? 5 | --- 6 | 7 | # Frameworks and Distributed Training 8 | 9 | {% embed url="https://www.youtube.com/watch?v=vlwf7wEVoW4" caption="Frameworks and Distributed Training - Infrastructure and Tooling" %} 10 | 11 | ## Summary 12 | 13 | * Unless you have a good reason not to, you should use either **TensorFlow** or **PyTorch**. 14 | * Both frameworks are converging to a point where they are good for research and production. 15 | * [**fast.ai**](http://fast.ai) is a solid option for beginners who want to iterate quickly. 16 | * Distributed training of neural networks can be approached in 2 ways: \(1\) data parallelism and \(2\) model parallelism. 17 | * Practically, **data parallelism** is more popular and frequently employed in large organizations for executing production-level deep learning algorithms. 18 | * **Model parallelism**, on the other hand, is only necessary when a model does not fit on a single GPU. 19 | * [**Ray**](http://docs.ray.io/) is an open-source project for effortless, stateful, parallel, and distributed computing in Python. 20 | * [**RaySGD**](https://docs.ray.io/en/latest/raysgd/raysgd_pytorch.html) is a library for distributed data parallel training that provides fault tolerance and seamless parallelization, built on top of [**Ray**](http://docs.ray.io/). 21 | * **Horovod** is Uber’s open-source distributed deep learning framework that uses a standard multi-process communication framework, so it can be an easier experience for multi-node training. 22 | 23 | -------------------------------------------------------------------------------- /course-content/infrastructure-and-tooling/hardware.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | How to choose appropriate hardware for your compute needs? Should you compute 4 | in the cloud or using your own GPUs? 5 | --- 6 | 7 | # Computing and GPUs 8 | 9 | {% embed url="https://youtu.be/S8nLrIj7sM0" caption="Computing and GPUS - Infrastructure and Tooling" %} 10 | 11 | ## Summary 12 | 13 | * If you go with the GPU round, there are a lot of **NVIDIA** cards to choose from \(Kepler, Maxwell, Pascal, Volta, Turing\). 14 | * If you go with a cloud provider, **Amazon Web Services** and **Google Cloud Platform** are the heavyweights, while startups such as **Paperspace** and **Lambda Labs** are also viable options. 15 | * If you work solo or in a startup, you should build or buy a 4x recent-architecture PC for model development. For model training, if you run many experiments, you can either buy shared server machines or use cloud instances. 16 | * If you work in a large company, you are more likely to rely on cloud instances for both model development and model training, as they provide proper provisioning and infrastructure to handle failures. 17 | 18 | -------------------------------------------------------------------------------- /course-content/infrastructure-and-tooling/hyperparameter-tuning.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to tune your model hyper-parameters? 3 | --- 4 | 5 | # Hyperparameter Tuning 6 | 7 | {% embed url="https://www.youtube.com/watch?v=n-2HeifoItU" caption="Hyperparameter Tuning - Infrastructure and Tooling" %} 8 | 9 | ## Summary 10 | 11 | * Deep learning models are literally full of hyper-parameters. Finding the best configuration for these variables in a high-dimensional space is not trivial. 12 | * Searching for hyper-parameters is an iterative process constrained by computing power, money, and time. Therefore, it would be really useful to have software that helps you search over hyper-parameter settings. 13 | * **Hyperopt** is a Python library for serial and parallel optimization over awkward search spaces, which may include real-valued, discrete, and conditional dimensions. 14 | * **SigOpt** is an optimization-as-a-service API that allows users to seamlessly tune the configuration parameters in AI and ML models. 15 | * [**Ray Tune**](https://docs.ray.io/en/latest/tune.html) is a Python library for hyperparameter tuning at any scale, integrating seamlessly with optimization libraries such as **Hyperopt** and **SigOpt**. 16 | * **Weights & Biases** has a nice feature called “Hyperparameter Sweeps” — a way to efficiently select the right model for a given dataset using the tool. 17 | 18 | -------------------------------------------------------------------------------- /course-content/infrastructure-and-tooling/overview.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: What are the components of a machine learning system? 3 | --- 4 | 5 | # Overview 6 | 7 | {% embed url="https://www.youtube.com/watch?v=\_pLe7\_b5tGc" caption="Overview - Infrastructure and Tooling" %} 8 | 9 | ## Summary 10 | 11 | * Google's seminal paper "Machine Learning: The High-Interest Credit Card of Technical Debt" states that if we look at the whole machine learning system, the actual modeling code is very small. There are a lot of other code around it that configure the system, extract the data/features, test the model performance, manage processes/resources, and serve/deploy the model. 12 | * The **data component:** 13 | * Data Storage - How to store the data? 14 | * Data Workflows - How to process the data? 15 | * Data Labeling - How to label the data? 16 | * Data Versioning - How to version the data? 17 | * The **development component:** 18 | * Software Engineering - How to choose the proper engineering tools? 19 | * Frameworks - How to choose the right deep learning frameworks? 20 | * Distributed Training - How to train the models in a distributed fashion? 21 | * Resource Management - How to provision and mange distributed GPUs? 22 | * Experiment Management - How to manage and store model experiments? 23 | * Hyper-parameter Tuning - How to tune model hyper-parameters? 24 | * The **deployment component** 25 | * Continuous Integration and Testing - How to not break things as models are updated? 26 | * Web - How to deploy models to web services? 27 | * Hardware and Mobile - How to deploy models to embedded and mobile systems? 28 | * Interchange - How to deploy models across systems? 29 | * Monitoring - How to monitor model predictions? 30 | * All-In-One: There are solutions that handle all of these components! 31 | 32 | -------------------------------------------------------------------------------- /course-content/infrastructure-and-tooling/resource-management.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to effectively manage compute resources? 3 | --- 4 | 5 | # Resource Management 6 | 7 | {% embed url="https://www.youtube.com/watch?v=XmHRXktfwhM" caption="Resource Management - Infrastructure and Tooling" %} 8 | 9 | ## Summary 10 | 11 | * Running complex deep learning models poses a very practical resource management problem: how to give every team the tools they need to train their models without requiring them to operate their own infrastructure? 12 | * The most primitive approach is to use **spreadsheets** that allow people to reserve what resources they need to use. 13 | * The next approach is to utilize a **SLURM Workload Manager**, a free and open-source job scheduler for Linux and Unix-like kernels. 14 | * A very standard approach these days is to use Docker alongside Kubernetes. 15 | * **Docker** is a way to package up an entire dependency stack in a lighter-than-a-Virtual-Machine package. 16 | * **Kubernetes** is a way to run many Docker containers on top of a cluster. 17 | * The last option is to use open-source projects. 18 | * Using **Kubeflow** allows you to run model training jobs at scale on containers with the same scalability of container orchestration that comes with Kubernetes. 19 | * **Polyaxon** is a self-service multi-user system, taking care of scheduling and managing jobs in order to make the best use of available cluster resources. 20 | 21 | -------------------------------------------------------------------------------- /course-content/infrastructure-and-tooling/software-engineering.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | What are the good software engineering practices for Machine Learning 4 | developers? 5 | --- 6 | 7 | # Software Engineering 8 | 9 | {% embed url="https://www.youtube.com/watch?v=DHgC6\_JkaAU" caption="Software Engineering - Infrastructure and Tooling" %} 10 | 11 | ## Summary 12 | 13 | * **Python** is the clear programming language of choice. 14 | * **Visual Studio Code** makes for a very nice Python experience, with features such as built-in git staging and diffing, peek documentation, and linter code hints. 15 | * **PyCharm** is a popular choice for Python developers. 16 | * **Jupyter Notebooks** is the standard tool for quick prototyping and exploratory analysis, but it is not suitable to build machine learning products. 17 | * **Streamlit** is a new tool that fulfills a common need - an interactive applet to communicate the modeling results. 18 | 19 | -------------------------------------------------------------------------------- /course-content/labs.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: 'Course Project: Build and Deploy an End-to-End Deep Learning System' 3 | --- 4 | 5 | # Labs 6 | 7 | The course project is broken up into 8 labs: [https://github.com/full-stack-deep-learning/fsdl-text-recognizer-project](https://github.com/full-stack-deep-learning/fsdl-text-recognizer-project) 8 | 9 | Feel free to follow along, and submit pull requests to the source repo if you'd like to make changes: [https://github.com/full-stack-deep-learning/fsdl-text-recognizer](https://github.com/full-stack-deep-learning/fsdl-text-recognizer) 10 | 11 | ![](../.gitbook/assets/cleanshot-2020-07-09-at-09.56.23-2x.jpg) 12 | 13 | ## Videos 14 | 15 | {% embed url="https://www.youtube.com/playlist?list=PL1T8fO7ArWlfx84Kc1Ke4a3DTGAzwnApV" caption="Start watching our Labs playlist." %} 16 | 17 | 18 | 19 | -------------------------------------------------------------------------------- /course-content/ml-teams/README.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How To Build Your Machine Learning Teams Effectively 3 | --- 4 | 5 | # Machine Learning Teams 6 | 7 | ![In this module, we explore ML roles, types of organizations, best management practices, and hiring.](../../.gitbook/assets/cleanshot-2020-07-06-at-19.10.25-2x.png) 8 | 9 | {% hint style="info" %} 10 | As always, please [submit a pull request](https://github.com/full-stack-deep-learning/course-gitbook) if any information is out of date! 11 | {% endhint %} 12 | 13 | ## Slides 14 | 15 | {% embed url="https://www.slideshare.net/sergeykarayev/machine-learning-teams-full-stack-deep-learning" %} 16 | 17 | ## Videos 18 | 19 | {% page-ref page="overview.md" %} 20 | 21 | {% page-ref page="roles.md" %} 22 | 23 | {% page-ref page="team-structure.md" %} 24 | 25 | {% page-ref page="managing-projects.md" %} 26 | 27 | {% page-ref page="hiring.md" %} 28 | 29 | 30 | 31 | -------------------------------------------------------------------------------- /course-content/ml-teams/hiring.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | How to source Machine Learning talent? How to interview Machine Learning 4 | candidates? How to find a job as a Machine Learning practitioner? 5 | --- 6 | 7 | # Hiring 8 | 9 | {% embed url="https://www.youtube.com/watch?v=-z8UmqCvOFs" caption="Hiring - ML Teams" %} 10 | 11 | ## Summary 12 | 13 | * Machine Learning talent is scarce. 14 | * As a manager, be specific about what skills are must-have in the Machine Learning job descriptions. 15 | * As a job seeker, it can be brutally challenging to break in as an outsider, so use projects as a signal to build awareness. 16 | 17 | -------------------------------------------------------------------------------- /course-content/ml-teams/managing-projects.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to manage machine learning projects properly? 3 | --- 4 | 5 | # Managing Projects 6 | 7 | {% embed url="https://www.youtube.com/watch?v=di--TEEMV6U" caption="Managing - ML Teams" %} 8 | 9 | ## Summary 10 | 11 | * Manage Machine Learning projects can be **very challenging**: 12 | * In Machine Learning, it is hard to tell in advance what’s hard and what’s easy. 13 | * Machine Learning progress is nonlinear. 14 | * There are cultural gaps between research and engineering because of different values, backgrounds, goals, and norms. 15 | * Often, leadership just does not understand it. 16 | * The secret sauce is to plan the Machine Learning project **probabilistically**! 17 | * Attempt a portfolio of approaches. 18 | * Measure progress based on inputs, not results. 19 | * Have researchers and engineers work together. 20 | * Get end-to-end pipelines together quickly to demonstrate quick wins. 21 | * Educate leadership on Machine Learning timeline uncertainty. 22 | 23 | -------------------------------------------------------------------------------- /course-content/ml-teams/overview.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: Why is running a Machine Learning team hard? 3 | --- 4 | 5 | # Overview 6 | 7 | {% embed url="https://www.youtube.com/watch?v=5owC6OBnEcU" caption="Overview - ML Teams" %} 8 | 9 | ## Summary 10 | 11 | * Machine Learning talents are expensive and scarce. 12 | * Machine Learning teams have a diverse set of roles. 13 | * Machine Learning projects have unclear timelines and high uncertainty. 14 | * Machine Learning is also the “[high-interest credit card of technical debt](https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf)." 15 | * Leadership often doesn’t understand Machine Learning. 16 | 17 | -------------------------------------------------------------------------------- /course-content/ml-teams/roles.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | What are the different roles inside a Machine Learning team? What skills are 4 | needed for each of them? 5 | --- 6 | 7 | # Roles 8 | 9 | {% embed url="https://www.youtube.com/watch?v=9QNNJRUqyiI" caption="Roles - ML Teams" %} 10 | 11 | ## Summary 12 | 13 | * The **Machine Learning Product Manager** is someone who works with the Machine Learning team, as well as other business functions and the end-users. 14 | * This person designs docs, creates wireframes, comes up with the plan to prioritize and execute Machine Learning projects. 15 | * The role is just like a traditional Product Manager, but with a deep knowledge of the Machine Learning development process and mindset. 16 | * The **DevOps Engineer** is someone who deploys and monitors production systems. 17 | * This person handles the infrastructure that runs the deployed Machine Learning product. 18 | * This role is primarily a software engineering role, which often comes from a standard software engineering pipeline. 19 | * The **Data Engineer** is someone who builds data pipelines, aggregates and collects from data storage, monitors data behavior. 20 | * This person works with distributed systems such as Hadoop, Kafka, Airflow. 21 | * This person belongs to the software engineering team that works actively with Machine Learning teams. 22 | * The **Machine Learning Engineer** is someone who trains and deploys prediction models. 23 | * This person uses tools like TensorFlow and Docker to work with prediction systems running on real data in production. 24 | * This person is either an engineer with significant self-teaching OR a science/engineering Ph.D. who works as a traditional software engineer after graduate school. 25 | * The **Machine Learning Researcher** is someone who trains prediction models, but often forward-looking or not production-critical. 26 | * This person uses TensorFlow / PyTorch / Jupiter to build models and reports describing their experiments. 27 | * This person is a Machine Learning expert who usually has an MS or Ph.D. degree in Computer Science or Statistics or finishes an industrial fellowship program. 28 | * The **Data Scientist** is actually a blanket term used to describe all of the roles above. 29 | * In some organizations, this role actually entails answering business questions via analytics. 30 | * The role constitutes a wide range of backgrounds from undergraduate to Ph.D. students. 31 | 32 | -------------------------------------------------------------------------------- /course-content/ml-teams/team-structure.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to structure a Machine Learning team inside an organization? 3 | --- 4 | 5 | # Team Structure 6 | 7 | {% embed url="https://www.youtube.com/watch?v=PVaKy5mVFDA" caption="Organizations - ML Teams" %} 8 | 9 | ## Summary 10 | 11 | * The **Nascent and Ad-Hoc Machine Learning** organization: 12 | * No one is doing Machine Learning, or Machine Learning is done on an ad-hoc basis. 13 | * There is often low-hanging fruit for Machine Learning. 14 | * But there is little support for Machine Learning projects and it’s very difficult to hire and retain good talent. 15 | * The **Research and Development Machine Learning** organization: 16 | * Machine Learning efforts are centered in the R&D arm of the organization. Often hire Machine Learning researchers and doctorate students with experience publishing papers. 17 | * They can hire experienced researchers and work on long-term business priorities to get big wins. 18 | * However, it is very difficult to get quality data. Most often, this type of research work rarely translates into actual business value, so usually the amount of investment remains small. 19 | * The **Business and Product Embedded Machine Learning** organization: 20 | * Certain product teams or business units have Machine Learning expertise along-side their software or analytics talent. These Machine Learning individuals report up to the team’s engineering/tech lead. 21 | * Machine Learning improvements are likely to lead to business value. Furthermore, there is a tight feedback cycle between idea iteration and product improvement. 22 | * Unfortunately, it is still very hard to hire and develop top talent, and access to data & compute resources can lag. There are also potential conflicts between Machine Learning project cycles and engineering management, so long-term Machine Learning projects can be hard to justify. 23 | * The **Independent Machine Learning** organization: 24 | * Machine Learning division reports directly to senior leadership. The Machine Learning Product Managers work with Researchers and Engineers to build Machine Learning into client-facing products. They can sometimes publish long-term research. 25 | * Talent density allows them to hire and train top practitioners. Senior leaders can marshal data and compute resources. This gives the organizations to invest in tooling, practices, and culture around Machine Learning development. 26 | * A disadvantage is that model handoffs to different business lines can be challenging, since users need to buy-in to Machine Learning benefits and get educated on the model use. Also, feedback cycles can be slow. 27 | * The **Machine Learning First** organization: 28 | * CEO invests in Machine Learning and there are experts across the business focusing on quick wins. The Machine Learning division works on challenging and long-term projects. 29 | * They have the best data access \(data thinking permeates the organization\), the most attractive recruiting funnel \(challenging Machine Learning problems tends to attract top talent\), and the easiest deployment procedure \(product teams understand Machine Learning well enough\). 30 | * This type of organization archetype is hard to implement in practice since it is culturally difficult to embed Machine Learning thinking everywhere. 31 | * Organizational design follow 3 broad categories: 32 | * **Software Engineer vs Research**: To what extent is the Machine Learning team responsible for building or integrating with software? How important are Software Engineering skills on the team? 33 | * **Data Ownership**: How much control does the Machine Learning team have over data collection, warehousing, labeling, and pipelining? 34 | * **Model Ownership**: Is the Machine Learning team responsible for deploying models into production? Who maintains the deployed models? 35 | 36 | -------------------------------------------------------------------------------- /course-content/research-areas.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | Professor Pieter Abbeel covers state of the art deep learning methods that are 4 | just now becoming usable in production. 5 | --- 6 | 7 | # Research Areas 8 | 9 | ## Slides 10 | 11 | {% embed url="https://www.slideshare.net/sergeykarayev/research-directions-full-stack-deep-learning" %} 12 | 13 | ## Video 14 | 15 | {% embed url="https://youtu.be/OMraS0GRWK0" caption="Research Areas as explained by Professor Pieter Abbeel." %} 16 | 17 | ## Summary 18 | 19 | ### Few-Shot Learning 20 | 21 | * **Model-Agnostic Meta-Learning** is an end-to-end learning paradigm of a parameter vector that is a good initialization for fine-tuning many tasks. 22 | * It works well for classification problems, optimization tasks, and generative models. 23 | 24 | ### Reinforcement Learning 25 | 26 | * Compared to supervised learning, reinforcement learning has additional challenges in terms of **credit assignment, stability,** and **exploration**. 27 | * Success stories of reinforcement learning are predominantly in the domains of **game-playing** and **robotics**. 28 | * **Meta-Reinforcement Learning** helps us \*\*develop "fast" reinforcement learning algorithms that can adapt to real-world scenarios. 29 | * **Multi-Armed Bandits** is a solid evaluation scheme for reinforcement learning methods. 30 | * **Contextual Bandits** is basically a simpler version of reinforcement learning with no states. 31 | * In **the real world**, reinforcement learning works well whenever we have a great simulator/demonstration of the environment and an inexpensive data collection process. 32 | 33 | ### Imitation Learning 34 | 35 | * In the **imitation learning** paradigm, we collect many demonstrations and turn them into a policy that can interact with the environment. 36 | * **One-Shot Imitation Learning** only needs a single demonstration of a new task to figure out the next action. 37 | * In **the real world,** imitation learning works well whenever we have access to previous data and it's easy to predict what happens next. 38 | 39 | ### Domain Randomization 40 | 41 | * The motivation here is **how can we learn useful real-world skills in the simulator?** 42 | * **Domain randomization** operates under the assumption that if the model sees enough simulated variation, the real world may look like just the next simulator. 43 | * A well-known example is the **OpenAI's robot hand** that solves the Rubik Cube. 44 | 45 | ### Architecture Search 46 | 47 | * The idea behind **architecture search** is to use search algorithms to determine the optimal architecture for our neural networks. 48 | * We can use reinforcement learning to perform this search process. 49 | * Small architectures sometimes match the performance of the bigger architectures. 50 | * Furthermore, we can use reinforcement learning to design the right **data augmentation** scheme to maximize performance. 51 | 52 | ### Unsupervised Learning 53 | 54 | * **Unsupervised learning** deals with unlabeled data. We can learn the network that embeds the data or learn the weights of the network architecture. 55 | * The main families of models include Variational Autoencoders, Generative Adversarial Networks, Exact Likelihood Models, and "puzzles" ones. 56 | * **Contrastive Predictive Coding** is an unsupervised learning scheme that breaks up the input into pieces, removes specific pieces, and asks the network to fill those pieces back in the final output. 57 | * A well-known example is the **OpenAI's GPT-2 system** that can generate text. 58 | 59 | ### Overall Research Theme 60 | 61 | * Researchers use more computing power to get better results. 62 | * You should focus on problem territory with a lot of **data** and **compute** than human ingenuity. 63 | 64 | ### Bridge The Research and Real-World Gap 65 | 66 | * **Research** is about going from 0 to 1. 67 | * In **real-world** applications, often 90% performance is not enough. 68 | 69 | ### How To Keep Up 70 | 71 | * Learn to read academic papers. 72 | * Helpful resources to get papers are Import AI newsletter, Arxiv Sanity, Twitter, Facebook Groups, and ML Subreddit. 73 | * Form a reading group. 74 | 75 | -------------------------------------------------------------------------------- /course-content/setting-up-machine-learning-projects/README.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How To Set Your Machine Learning Projects Up For Success 3 | --- 4 | 5 | # Setting up Machine Learning Projects 6 | 7 | ![](../../.gitbook/assets/cleanshot-2020-06-08-at-12.01.58-2x.png) 8 | 9 | {% hint style="info" %} 10 | As always, please [submit a pull request](https://github.com/full-stack-deep-learning/course-gitbook) if any information is out of date! 11 | {% endhint %} 12 | 13 | ## Slides 14 | 15 | {% embed url="https://www.slideshare.net/sergeykarayev/setting-up-machine-learning-projects-full-stack-deep-learning" %} 16 | 17 | ## Videos 18 | 19 | {% page-ref page="overview.md" %} 20 | 21 | {% page-ref page="lifecycle.md" %} 22 | 23 | {% page-ref page="prioritizing.md" %} 24 | 25 | {% page-ref page="archetypes.md" %} 26 | 27 | {% page-ref page="metrics.md" %} 28 | 29 | {% page-ref page="baselines.md" %} 30 | 31 | 32 | 33 | 34 | 35 | -------------------------------------------------------------------------------- /course-content/setting-up-machine-learning-projects/archetypes.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: What are the different archetypes of machine learning projects? 3 | --- 4 | 5 | # Archetypes 6 | 7 | {% embed url="https://www.youtube.com/watch?v=QlcoixU55VU" caption="Archetypes - ML Projects" %} 8 | 9 | * Archetype 1 - Projects that **improve an existing process**: improving route optimization in a ride-sharing service, building a customized churn prevention model, building a better video game AI. 10 | * Archetype 2 - Projects that **augment a manual process**: turning mockup designs into application UI, building a sentence auto-completion feature, helping a doctor to do his/her job more efficient. 11 | * Archetype 3 - Projects that **automate a manual process**: developing autonomous vehicles, automating customer service, automating website design. 12 | * **Data flywheels** is a concept saying that more users lead to more data, more data lead to better models, and better models lead to more users. 13 | 14 | -------------------------------------------------------------------------------- /course-content/setting-up-machine-learning-projects/baselines.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | How to choose a good baseline to know whether your model is performing well or 4 | not? 5 | --- 6 | 7 | # Baselines 8 | 9 | {% embed url="https://www.youtube.com/watch?v=wfTk7Lb9uPg" caption="Baselines - ML Projects" %} 10 | 11 | * A **baseline** is a model that is both simple to set up and has a reasonable chance of providing decent results. It gives you a lower bound on expected model performance. 12 | * Your choice of a simple baseline depends on the kind of data you are working with and the kind of task you are targeting. 13 | * You can look for **external baselines** such as business and engineering requirements, as well as published results from academic papers that tackle your problem domain. 14 | * You can also look for **internal baselines** using simple models and human performance. 15 | * There is a tradeoff between cost and quality when designing human baselines. More specialized domains require more skilled labelers, so you should find cases where the model performs worse and concentrate the data collection effort there. 16 | 17 | -------------------------------------------------------------------------------- /course-content/setting-up-machine-learning-projects/lifecycle.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: What is the lifecycle of a machine learning project? 3 | --- 4 | 5 | # Lifecycle 6 | 7 | {% embed url="https://www.youtube.com/watch?v=JzfR8pOtxZc" caption="Lifecycle - ML Projects" %} 8 | 9 | * Phase 1 is **Project Planning and Project Setup**: At this phase, we want to decide the problem to work on, determine the requirements and goals, as well as figure out how to allocate resources properly. 10 | * Phase 2 is **Data Collection and Data Labeling**: At this phase, we want to collect training data \(images, text, tabular, etc.\) and potentially annotate them with ground truth, depending on the specific sources where they come from. 11 | * Phase 3 is **Model Training and Model Debugging**: At this phase, we want to implement baseline models quickly, find and reproduce state-of-the-art methods for the problem domain, debug our implementation, and improve the model performance for specific tasks. 12 | * Phase 4 is **Model Deployment and Model Testing**: At this phase, we want to pilot the model in a constrained environment, write tests to prevent regressions, and roll the model into production. 13 | 14 | -------------------------------------------------------------------------------- /course-content/setting-up-machine-learning-projects/metrics.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How do you pick metrics to optimize your machine learning project? 3 | --- 4 | 5 | # Metrics 6 | 7 | {% embed url="https://www.youtube.com/watch?v=-US7jrlz3wM" caption="Metrics - ML Projects" %} 8 | 9 | * In most real-world projects, you usually care about a lot of metrics. Because machine learning systems work best when optimizing a single number, you need to pick a formula for combining different metrics of interest. 10 | * The first way is to do a simple **average** \(or **weighted average**\) of these metrics. 11 | * The second way is to choose a metric as a **threshold** and evaluate at that threshold value. The thresholding metrics are up to your domain judgment, but you would probably want to choose ones that are least sensitive to model choice and are closest to desirable values. 12 | * The third way is to use a more **complex / domain-specific** formula. A solid process to go about this direction is to first start enumerating all the project requirements, then evaluate the current performance of your model, then compare the current performance to the requirements, and finally revisit the metric as your numbers improve. 13 | 14 | 15 | 16 | -------------------------------------------------------------------------------- /course-content/setting-up-machine-learning-projects/overview.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | According to a 2019 report, 85% of AI projects fail to deliver on their 4 | intended promises to business. Why do so many projects fail? 5 | --- 6 | 7 | # Overview 8 | 9 | {% embed url="https://youtu.be/c5bZO95kKY8" caption="Overview - ML Projects" %} 10 | 11 | ## Summary 12 | 13 | * ML is still research, therefore it is very challenging to aim for 100% success rate. 14 | * Many ML projects are technically infeasible or poorly scoped. 15 | * Many ML projects never make the leap into production. 16 | * Many ML projects have unclear success criteria. 17 | * Many ML projects are poorly managed. 18 | 19 | -------------------------------------------------------------------------------- /course-content/setting-up-machine-learning-projects/prioritizing.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How do you decide which machine learning projects to work on? 3 | --- 4 | 5 | # Prioritizing 6 | 7 | {% embed url="https://www.youtube.com/watch?v=Dx3hKmNJ-1E" caption="Prioritizing - ML Projects" %} 8 | 9 | * The project should have **high impact,** where cheap prediction is valuable for the complex parts of your business process. 10 | * The project should have **high feasibility,** which is driven by the data availability, accuracy requirements, and problem difficulty. 11 | * Here are 3 types of hard machine learning problems: 12 | * \(1\) The output is complex. 13 | * \(2\) Reliability is required. 14 | * \(3\) Generalization is expected. 15 | 16 | -------------------------------------------------------------------------------- /course-content/testing-and-deployment/README.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: The Testing and Deployment Phase of Your Machine Learning Workflow 3 | --- 4 | 5 | # Testing and Deployment 6 | 7 | ![It's useful to break down the different systems and tests necessary for a successful ML project.](../../.gitbook/assets/cleanshot-2020-07-06-at-19.14.01-2x.png) 8 | 9 | {% hint style="info" %} 10 | As always, please [submit a pull request](https://github.com/full-stack-deep-learning/course-gitbook) if any information is out of date! 11 | {% endhint %} 12 | 13 | ## Slides 14 | 15 | {% embed url="https://www.slideshare.net/sergeykarayev/testing-and-deployment-full-stack-deep-learning" %} 16 | 17 | ## Videos 18 | 19 | {% page-ref page="project-structure.md" %} 20 | 21 | {% page-ref page="ml-test-score.md" %} 22 | 23 | {% page-ref page="ci-testing.md" %} 24 | 25 | {% page-ref page="docker.md" %} 26 | 27 | {% page-ref page="web-deployment.md" %} 28 | 29 | {% page-ref page="monitoring.md" %} 30 | 31 | {% page-ref page="hardware-mobile.md" %} 32 | 33 | 34 | 35 | -------------------------------------------------------------------------------- /course-content/testing-and-deployment/ci-testing.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: What do testing and continuous integration mean? 3 | --- 4 | 5 | # CI / Testing 6 | 7 | {% embed url="https://youtu.be/VRe4xkgHjaM" caption="CI/Testing - Testing and Deployment" %} 8 | 9 | ## Summary 10 | 11 | * **Unit tests** are designed for specific module functionality. 12 | * **Integration tests** are designed for the whole system. 13 | * **Continuous integration** is an environment where tests are run every time a new code is pushed to the repository before the updated model is deployed. 14 | * A quick survey of continuous integration tools yields several options: CircleCI, Travis CI, Jenkins, and Buildkite. 15 | 16 | -------------------------------------------------------------------------------- /course-content/testing-and-deployment/docker.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: What is Docker? 3 | --- 4 | 5 | # Docker 6 | 7 | {% embed url="https://youtu.be/GJWPmff2df8" caption="Docker - Testing and Deployment" %} 8 | 9 | ## Summary 10 | 11 | * Docker is a computer program that performs operating-system-level virtualization, also known as **containerization**. 12 | * A container is a standardized unit of fully packaged software used for local development, shipping code, and deploying system. 13 | * Though Docker presents on how to deal with each of the individual microservices, we also need an orchestrator to handle the whole cluster of services. For that, **Kubernetes** is the open-source winner, and it has excellent support from the leading cloud vendors. 14 | 15 | -------------------------------------------------------------------------------- /course-content/testing-and-deployment/hardware-mobile.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to deploy your models to hardware and mobile devices? 3 | --- 4 | 5 | # Hardware/Mobile 6 | 7 | {% embed url="https://youtu.be/uq6soVz\_IuU" caption="Hardware and Mobile - Testing and Deployment" %} 8 | 9 | ## Summary 10 | 11 | * Embedded and mobile devices have low-processor with little memory, which makes the process slow and expensive to compute. Often, we can try some tricks such as reducing network size, quantizing the weights, and distilling knowledge. 12 | * Both **pruning** and **quantization** are model compression techniques that make the model physically smaller to save disk space and make the model require less memory during computation to run faster. 13 | * **Knowledge distillation** is a compression technique in which a small “student” model is trained to reproduce the behavior of a large “teacher” model. 14 | * Embedded and mobile PyTorch/TensorFlow frameworks are less fully featured than the full PyTorch/TensorFlow frameworks. Therefore, we have to be careful with the model architecture. An alternative option is using the interchange format. 15 | * **Mobile machine learning frameworks** are regularly in flux: Tensorflow Lite, PyTorch Mobile, CoreML, MLKit, FritzAI. 16 | * The best solution in the industry for **embedded** devices is NVIDIA. 17 | * The **Open Neural Network Exchange** \(ONNX for short\) is designed to allow framework interoperability. 18 | 19 | -------------------------------------------------------------------------------- /course-content/testing-and-deployment/ml-test-score.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How can you test your machine learning system? 3 | --- 4 | 5 | # ML Test Score 6 | 7 | {% embed url="https://youtu.be/SIoYEd7VPDQ" caption="ML Test Score - Testing and Deployment" %} 8 | 9 | ## Summary 10 | 11 | * [ML Test Score :  A Rubric for Production Readiness and Technical Debt Reduction](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/aad9f93b86b7addfea4c419b9100c6cdd26cacea.pdf)  is an exhaustive framework/checklist from practitioners at Google. 12 | * The paper presents a rubric as a set of 28 actionable tests and offers a scoring system to measure how ready for production a given machine learning system is. These are categorized into 4 sections: \(1\) data tests, \(2\) model tests, \(3\) ML infrastructure tests, and \(4\) monitoring tests. 13 | * The scoring system provides a vector for incentivizing ML system developers to achieve stable levels of reliability by providing a clear indicator of readiness and clear guidelines for how to improve. 14 | 15 | -------------------------------------------------------------------------------- /course-content/testing-and-deployment/monitoring.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to monitor your machine learning system? 3 | --- 4 | 5 | # Monitoring 6 | 7 | {% embed url="https://youtu.be/dYqhcXOi7JM" caption="Monitoring - Testing and Deployment" %} 8 | 9 | ## Summary 10 | 11 | * It is crucial to monitor serving systems, training pipelines, and input data. A typical monitoring system can **raise alarms** when things go wrong and provide the records for tuning things. 12 | * Cloud providers have decent monitoring solutions. 13 | * Anything that can be logged can be monitored: dependency changes, distribution shift in data, model instabilities, etc. 14 | * **Data distribution monitoring** is an underserved need! 15 | * It is important to monitor the **business uses** of the model, not just its statistics. Furthermore, it is important to be able to **contribute failures** back to the dataset. 16 | 17 | -------------------------------------------------------------------------------- /course-content/testing-and-deployment/project-structure.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: What are the different components of a machine learning system? 3 | --- 4 | 5 | # Project Structure 6 | 7 | {% embed url="https://youtu.be/uctx9L0tuCc" caption="Project Structure - Testing and Deployment" %} 8 | 9 | ## Summary 10 | 11 | * The **prediction system** involves code to process input data, to construct networks with trained weights, and to make predictions. 12 | * The **training system** processes raw data, runs experiments, and manages results. 13 | * The goal of any prediction system is to be deployed into the **serving system**. Its purpose is to serve predictions and to scale to demand. 14 | * **Training and validation data** are used in conjunction with the training system to generate the prediction system. 15 | * At production time, we have **production data** that has not been seen before and can only be served by the serving system. 16 | * The prediction system should be tested by **functionality** to catch code regressions and by **validation** to catch model regressions. 17 | * The training system should have its tests to catch upstream regressions \(change in data sources, upgrade of dependencies\) 18 | * For production data, we need **monitoring** that raises alert to downtime, errors, distribution shifts, etc. and catches service and data regressions. 19 | 20 | -------------------------------------------------------------------------------- /course-content/testing-and-deployment/web-deployment.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to deploy your models to the web? 3 | --- 4 | 5 | # Web Deployment 6 | 7 | {% embed url="https://youtu.be/1BegV7gjqDo" caption="Web Deployment - Testing and Deployment" %} 8 | 9 | ## Summary 10 | 11 | * For web deployment, you need to be familiar with the concept of **REST API.** 12 | * You can deploy the code to Virtual Machines, and then scale by adding instances. 13 | * You can deploy the code as containers, and then scale via orchestration. 14 | * You can deploy the code as a “server-less function.” 15 | * You can deploy the code via a model serving solution. 16 | * If you are making **CPU inference**, you can get away with scaling by launching more servers \(Docker\), or going serverless \(AWS Lambda\). 17 | * If you are using **GPU inference**, things like TF Serving and [Ray Serve](https://docs.ray.io/en/latest/serve/index.html#rayserve) become useful with features such as adaptive batching. 18 | 19 | -------------------------------------------------------------------------------- /course-content/training-and-debugging/README.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How To Troubleshoot Your Deep Learning Models 3 | --- 4 | 5 | # Training and Debugging 6 | 7 | ![We recommend a simple workflow for training and debugging neural networks.](../../.gitbook/assets/debugging_overview.jpg) 8 | 9 | {% hint style="info" %} 10 | As always, please [submit a pull request](https://github.com/full-stack-deep-learning/course-gitbook) if any information is out of date! 11 | {% endhint %} 12 | 13 | ## Slides 14 | 15 | {% embed url="https://www.slideshare.net/sergeykarayev/troubleshooting-deep-neural-networks-full-stack-deep-learning" %} 16 | 17 | ## Videos 18 | 19 | {% page-ref page="overview.md" %} 20 | 21 | {% page-ref page="start-simple.md" %} 22 | 23 | {% page-ref page="debug.md" %} 24 | 25 | {% page-ref page="evaluate.md" %} 26 | 27 | {% page-ref page="improve.md" %} 28 | 29 | {% page-ref page="tune.md" %} 30 | 31 | {% page-ref page="conclusion.md" %} 32 | 33 | -------------------------------------------------------------------------------- /course-content/training-and-debugging/conclusion.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: What are the key takeaways to troubleshoot deep neural networks? 3 | --- 4 | 5 | # Conclusion 6 | 7 | {% embed url="https://www.youtube.com/watch?v=Ja414543TBM" caption="Conclusion - Troubleshooting" %} 8 | 9 | ## Summary 10 | 11 | * Deep learning debugging is hard due to many competing sources of error. 12 | * To train bug-free deep learning models, you need to treat building them as an iterative process. 13 | * Choose the simplest model and data possible. 14 | * Once the model runs, overfit a single batch and reproduce a known result. 15 | * Apply the bias-variance decomposition to decide what to do next. 16 | * Use coarse-to-fine random searches to tune the model’s hyper-parameters. 17 | * Make your model bigger if your model under-fits and add more data and/or regularization if your model over-fits. 18 | 19 | -------------------------------------------------------------------------------- /course-content/training-and-debugging/debug.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to implement and debug deep learning models? 3 | --- 4 | 5 | # Debug 6 | 7 | {% embed url="https://www.youtube.com/watch?v=d07le7otRUM" caption="Debug - Troubleshooting" %} 8 | 9 | ## Summary 10 | 11 | * The 5 most common bugs in deep learning models include: 12 | * Incorrect shapes for tensors. 13 | * Pre-processing inputs incorrectly. 14 | * Incorrect input to the loss function. 15 | * Forgot to set up train mode for the network correctly. 16 | * Numerical instability - inf/NaN. 17 | * 3 pieces of general advice for implementing models: 18 | * Start with **a lightweight implementation**. 19 | * Use **off-the-shelf components** such as Keras if possible, since most of the stuff in Keras works well out-of-the-box. 20 | * Build **complicated data pipelines later**. 21 | * The first step is to **get the model to run**: 22 | * For **shape mismatch and casting issues**, you should step through your model creation and inference step-by-step in a debugger, checking for correct shapes and data types of your tensors. 23 | * For **out-of-memory issues**, you can scale back your memory-intensive operations one-by-one. 24 | * For **other issues**, simply Google it. StackOverflow would be great most of the time. 25 | * The second step is to have the model **overfit a single batch**: 26 | * **Error goes up:** Commonly this is due to a flip sign somewhere in the loss function/gradient. 27 | * **Error explodes:** This is usually a numerical issue, but can also be caused by a high learning rate. 28 | * **Error oscillates:** You can lower the learning rate and inspect the data for shuffled labels or incorrect data augmentation. 29 | * **Error plateaus:** You can increase the learning rate and get rid of regularization. Then you can inspect the loss function and the data pipeline for correctness. 30 | * The third step is to **compare the model to a known result**: 31 | * The most useful results come from **an official model implementation** **evaluated on a similar dataset to yours**. 32 | * If you can’t find an official implementation on a similar dataset, you can compare your approach to results from **an official model implementation evaluated on a benchmark dataset**. 33 | * If there is no official implementation of your approach, you can compare it to results from **an unofficial model implementation**. 34 | * Then, you can compare to results from **a paper with no code**, results from **the model on a benchmark dataset**, and results from **a similar model on a similar dataset**. 35 | * An under-rated source of results come from **simple baselines**, which can help make sure that your model is learning anything at all. 36 | 37 | -------------------------------------------------------------------------------- /course-content/training-and-debugging/evaluate.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to evaluate deep learning model? 3 | --- 4 | 5 | # Evaluate 6 | 7 | {% embed url="https://www.youtube.com/watch?v=wP6BkXcB\_Xg" caption="Evaluate - Troubleshooting" %} 8 | 9 | ## Summary 10 | 11 | * You want to apply **the bias-variance decomposition** concept here: _Test error = irreducible error + bias + variance + validation overfitting_. 12 | * If the training, validation, and test sets come from different data distributions, then you should use **2 validation sets**: one set sampled from the training distribution, and the other set sampled from the test distribution. 13 | 14 | -------------------------------------------------------------------------------- /course-content/training-and-debugging/improve.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to improve deep learning model? 3 | --- 4 | 5 | # Improve 6 | 7 | {% embed url="https://www.youtube.com/watch?v=rlFHwTE5qPE" caption="Improve - Troubleshooting" %} 8 | 9 | ## Summary 10 | 11 | * The first step is to **address under-fitting:** 12 | * Add model complexity → Reduce regularization → Error analysis → Choose a more complex architecture → Tune hyper-parameters → Add features. 13 | * The second step is to **address over-fitting:** 14 | * Add more training data → Add normalization → Add data augmentation → Increase regularization → Error analysis → Choose a more complex architecture → Tune hyper-parameters → Early stopping → Remove features → Reduce model size. 15 | * The third step is to **address the distribution shift** present in the data: 16 | * Analyze test-validation errors and collect more training data to compensate. 17 | * Analyze test-validation errors and synthesize more training data to compensate. 18 | * Apply domain adaptation techniques to training and test distributions. 19 | * The final step, if applicable, is to **rebalance your datasets:** 20 | * If the model performance on the test & validation set is significantly better than the performance on the test set, you over-fit to the validation set. 21 | * When it does happen, you can recollect the validation data by re-shuffling the test/validation split ratio. 22 | 23 | -------------------------------------------------------------------------------- /course-content/training-and-debugging/overview.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: Why is deep learning troubleshooting hard? 3 | --- 4 | 5 | # Overview 6 | 7 | {% embed url="https://www.youtube.com/watch?v=cyxtSCrm6tA" caption="Overview - Troubleshooting" %} 8 | 9 | ## Summary 10 | 11 | * A common sentiment among practitioners is that they spend 80–90% of time debugging and tuning the models, and only 10–20% of time deriving math equations and implementing things. 12 | * Reproducing the results in deep learning can be challenging due to various factors including implementation bugs, choices of model hyper-parameters, data/model fit, and the construction of data. 13 | 14 | -------------------------------------------------------------------------------- /course-content/training-and-debugging/start-simple.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to start simple with deep learning models? 3 | --- 4 | 5 | # Start Simple 6 | 7 | {% embed url="https://www.youtube.com/watch?v=xTjPrYUmPhk" caption="Start Simple - Troubleshooting" %} 8 | 9 | ## Summary 10 | 11 | * **Choose simple architecture**: 12 | * LeNet/ResNet for images. 13 | * LSTM for sequences. 14 | * Fully-connected network with one hidden layer for all other tasks. 15 | * **Use sensible hyper-parameter defaults**: 16 | * Adam optimizer with a “magic” learning rate value of 3e-4. 17 | * ReLU activation for fully-connected and convolutional models and TanH activation for LSTM models. 18 | * He initialization for ReLU and Glorot initialization for TanH. 19 | * No regularization and data normalization. 20 | * **Normalize data inputs**: subtracting the mean and dividing by the variance. 21 | * **Simplify the problem**: 22 | * Working with a small training set around 10,000 examples. 23 | * Using a fixed number of objects, classes, input size, etc. 24 | * Creating a simpler synthetic training set like in research labs. 25 | 26 | -------------------------------------------------------------------------------- /course-content/training-and-debugging/tune.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: How to tune deep learning models? 3 | --- 4 | 5 | # Tune 6 | 7 | {% embed url="https://www.youtube.com/watch?v=8jhKNAHVfno" caption="Tune - Troubleshooting" %} 8 | 9 | ## Summary 10 | 11 | * Choosing which hyper-parameters to optimize is not an easy task since some are more sensitive than others and are dependent upon the choice of model. 12 | * **Low sensitivity**: Optimizer, batch size, non-linearity. 13 | * **Medium sensitivity**: weight initialization, model depth, layer parameters, weight of regularization. 14 | * **High sensitivity**: learning rate, annealing schedule, loss function, layer size. 15 | * Method 1 is **manual optimization:** 16 | * For a skilled practitioner, this may require the least amount of computation to get good results. 17 | * However, the method is time-consuming and requires a detailed understanding of the algorithm. 18 | * Method 2 is **grid search:** 19 | * Grid search is super simple to implement and can produce good results. 20 | * Unfortunately, it’s not very efficient since we need to train the model on all cross-combinations of the hyper-parameters. It also requires prior knowledge about the parameters to get good results. 21 | * Method 3 is **random search:** 22 | * Random search is also easy to implement and often produces better results than grid search. 23 | * But it is not very interpretable and may also require prior knowledge about the parameters to get good results. 24 | * Method 4 is **coarse-to-fine search:** 25 | * This strategy helps you narrow in only on very high performing hyper-parameters and is a common practice in the industry. 26 | * The only drawback is that it is somewhat a manual process. 27 | * Method 5 is **Bayesian optimization search:** 28 | * Bayesian optimization is generally the most efficient hands-off way to choose hyper-parameters. 29 | * But it’s difficult to implement from scratch and can be hard to integrate with off-the-shelf tools. 30 | 31 | -------------------------------------------------------------------------------- /course-content/where-to-go-next.md: -------------------------------------------------------------------------------- 1 | # Where to go next 2 | 3 | Deep Learning has a strong open-source culture. Many great learning resources exist on blogs, lectures, tutorials, newsletters, course websites, and code repositories. 4 | 5 | {% hint style="info" %} 6 | This section is mostly up to you! [Submit a pull request](https://github.com/full-stack-deep-learning/course-gitbook) to add helpful resources. 7 | {% endhint %} 8 | 9 | ## Conferences 10 | 11 | * [ScaledML](https://info.matroid.com/scaledml-media-archive-2020) \(by Matroid\) 12 | * [MLOps Conference](https://www.youtube.com/playlist?list=PLH8M0UOY0uy6d_n3vEQe6J_gRBUrISF9m) \(by Iguazio\) 13 | * [Spark + AI Summit](https://databricks.com/sparkaisummit/north-america-2020/agenda) \(by Databricks\) 14 | 15 | ## Newsletters 16 | 17 | * [The Batch](https://www.deeplearning.ai/thebatch/) \(by deeplearning.ai\) 18 | * [Machine Learning In Production](https://mlinproduction.com/machine-learning-newsletter/) \(by Luigi Patruno\) 19 | * [Import AI](https://jack-clark.net/about/) \(by Jack Clark\) 20 | * [The Machine Learning Engineer Newslette](https://ethical.institute/mle.html)r \(by The Institute for Ethical AI & ML\) 21 | * [Projects To Know](https://projectstoknow.amplifypartners.com/ml-and-data) \(by Amplify Partners\) 22 | 23 | ## Personal Blogs 24 | 25 | * [Locally Optimistic](https://locallyoptimistic.com/) \(Data Leaders in NYC\) 26 | * [MLOps Tooling Landscape](https://huyenchip.com/2020/06/22/mlops.html) \(by Chip Huyen\) 27 | * [Three Risks in Building Machine Learning Systems](https://insights.sei.cmu.edu/sei_blog/2020/05/three-risks-in-building-machine-learning-systems.html) \(by Benjamin Cohen\) 28 | * [How To Serve Models](http://bugra.github.io/posts/2020/5/25/how-to-serve-model/) \(by Bugra Akyildiz\) 29 | * [Nitpicking ML Technical Debt](https://matthewmcateer.me/blog/machine-learning-technical-debt/) \(by Matthew McAteer\) 30 | * [Monitoring ML Models in Production](https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/) \(by Christopher Samiullah\) 31 | * [Models for integrating data science teams within organizations](https://medium.com/@djpardis/models-for-integrating-data-science-teams-within-organizations-7c5afa032ebd) \(Pardis Noorzad\) 32 | * [Data-as-a-Product vs Data-as-a-Service](https://medium.com/@itunpredictable/data-as-a-product-vs-data-as-a-service-d9f7e622dc55) \(Justin Gage\) 33 | 34 | ## Corporate Blogs 35 | 36 | * [The New Business of AI](https://a16z.com/2020/02/16/the-new-business-of-ai-and-how-its-different-from-traditional-software/) \(by a16z\) 37 | * [Long-Tailed AI Problems](https://a16z.com/2020/08/12/taming-the-tail-adventures-in-improving-ai-economics/) \(by a16z\) 38 | * [Rules of ML](https://developers.google.com/machine-learning/guides/rules-of-ml) \(by Google\) 39 | * [Continuous delivery and automation pipelines in machine learning](https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning) \(by Google\) 40 | * [Tecton: The Data Platform for Machine Learning](https://www.tecton.ai/blog/data-platform-ml/) \(by Tecton\) 41 | * [Why We Need DevOps for ML Data](https://www.tecton.ai/blog/devops-ml-data/) \(by Tecton\) 42 | * [Continuous Delivery for Machine Learning](https://martinfowler.com/articles/cd4ml.html) \(by ThoughtWorks\) 43 | * [Dagster: The Data Orchestrator](https://medium.com/dagster-io/dagster-the-data-orchestrator-5fe5cadb0dfb) \(by Elementl\) 44 | * [State of Machine Learning Model Servers In Production](https://anyscale.com/blog/heres-what-you-need-to-look-for-in-a-model-server-to-build-ml-powered-services/) \(by Anyscale\) 45 | 46 | ## Repositories 47 | 48 | * [Awesome Production Machine Learning](https://github.com/EthicalML/awesome-production-machine-learning) \(by The Institute for Ethical AI & Machine Learning\) 49 | * [MLOps References](https://ml-ops.org/content/references.html) \(by InnoQ\) 50 | * [ML Applied in Production](https://github.com/eugeneyan/applied-ml) \(by Eugene Yan\) 51 | * [Feature Stores for ML](http://featurestore.org/) \(by KTH Royal Institute of Technology\) 52 | * [Feature Store: The Missing Data Layer in ML Pipelines?](https://www.logicalclocks.com/blog/feature-store-the-missing-data-layer-in-ml-pipelines) \(by Logical Clocks\) 53 | 54 | ## Tutorials 55 | 56 | * [Deep Learning for OCR, Document Analysis, Text Recognition, and Language Modeling](https://github.com/tmbdev-tutorials/icdar2019-tutorial) \(ICDAR 2019\) 57 | * [Image Retrieval in the Wild](https://matsui528.github.io/cvpr2020_tutorial_retrieval/) \(CVPR 2020\) 58 | * [Handwritten Text Recognition \(OCR\) with MXNet Gluon](https://github.com/awslabs/handwritten-text-recognition-for-apache-mxnet) \(AWS Labs\) 59 | 60 | -------------------------------------------------------------------------------- /guest-lectures/andrej-karpathy-tesla.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: "Andrej is currently Senior Director of AI at Tesla,\L and was formerly a Research Scientist at OpenAI. His educational materials about deep learning remain among the most popular." 3 | --- 4 | 5 | # Andrej Karpathy \(Tesla\) 6 | 7 | {% embed url="https://youtu.be/ZVJTqAuPvTU" caption="Programming The Software 2.0 Stack" %} 8 | 9 | ### Andrej's Landmark Post - "[Software 2.0](https://medium.com/@karpathy/software-2-0-a64152b37c35)" 10 | 11 | * **Software 1.0** consists of explicit instructions to the computer written by a programmer. 12 | * **Software 2.0** can be written in much more abstract, human unfriendly language, such as the weights of a neural network. 13 | * In software 2.0, we restrict the search to a continuous subset of the program space where the search process can be made efficient with back-propagation and stochastic gradient descent. 14 | 15 | ![Source: https://medium.com/@karpathy/software-2-0-a64152b37c35](../.gitbook/assets/softwarev2.png) 16 | 17 | ### Programming In The 2.0 Stack 18 | 19 | * If optimization is doing most of the coding, what are the humans doing? 20 | * 2.0 programmers label data 21 | * 1.0 programmers maintain the surrounding "dataset infrastructure": 22 | * Visualize data 23 | * Create and edit labels 24 | * Bubble up likely mislabeled examples 25 | * Suggest data to label 26 | * Flag labeler disagreements 27 | * **Data labeling** is highly iterative and non-trivial. 28 | * Lane lines are different across the world. 29 | * Cars have different shapes and sizes. 30 | * Even traffic lights and traffic signs can be ambiguous. 31 | * **Label imbalances** are very frequent. 32 | * **Data imbalances** are very common. 33 | 34 | ⇒ **Realistic datasets**: high label and data imbalances, noisy labels, highly multi-task, semi-supervised, active. 35 | 36 | ### 2.0 IDEs 37 | 38 | * Show a full inventory and statistics of the current dataset. 39 | * Create and edit annotation layers for any data point. 40 | * Flag, escalate, and resolve discrepancies in multiple labels. 41 | * Flag and escalate data points that are likely to be mislabeled. 42 | * Display predictions on an arbitrary set of test data points. 43 | * Auto-suggest data points that should be labeled. 44 | 45 | ⇒ **Can we build GitHub for Software 2.0?** 46 | 47 | -------------------------------------------------------------------------------- /guest-lectures/chip-huyen-nvidia.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | Chip created the TensorFlow for Deep Learning Research course at Stanford 4 | University, has worked on production ML teams at Snorkel an Nvidia, and has 5 | published many popular resources for ML Engineers. 6 | --- 7 | 8 | # Chip Huyen \(Snorkel\) 9 | 10 | {% embed url="https://youtu.be/pli1K75PSa8" caption="Machine Learning Interviews" %} 11 | 12 | ### Machine Learning Jobs 13 | 14 | #### Research vs Applied Research 15 | 16 | * Research finds the answers to **fundamental** questions and expands the body of theoretical knowledge. Applied research finds solutions to **practical** problems. 17 | * Research focuses on **long-term** outcomes, while applied research focuses on **immediate** commercial outcomes. 18 | * Most cutting-edge research is spearheaded by **big corporations**. 19 | * Machine learning research is highly **empirical** at this point. 20 | 21 | #### Research Scientist vs Research Engineer 22 | 23 | * A research scientist develops **original** ideas, while a research engineer **actualizes** those ideas. 24 | * **A Ph.D. degree** is often required for research scientist roles. 25 | * Starting as a research engineer is a natural path to become a research scientist. 26 | * In some organizations, these roles often **overlap**. 27 | 28 | #### Data Scientist vs Machine Learning Engineer 29 | 30 | * A data scientist **extracts knowledge and insights** from data, while a machine learning engineer **builds models** to turn data into products. 31 | * Engineering skillset is **a top priority** for the latter. 32 | * ML Engineers at **startups** might spend more time on cleaning data, setting up infrastructure, and deploying models than training models. 33 | 34 | #### Big Companies vs Startups 35 | 36 | * Big companies can afford research, while startups cannot. 37 | * Big companies can afford specialists, while startups need generalists. 38 | * Big companies have a standardized hiring process, while startups make up the process as they go. 39 | 40 | ### Getting a Machine Learning Job 41 | 42 | #### Six Common Paths 43 | 44 | 1. BS/MS in ML → ML Engineer \(Tech Ivies → FAANG/Startups\) 45 | 2. Ph.D. in ML → ML Researcher \(Published at Top-Tier Conferences → FAANG/ML-First Startups\) 46 | 3. Data Scientist → On-The-Job Training → ML Engineer/ML Researcher \(Data Scientists in companies that want to start using ML\) 47 | 4. Software Engineer → Courses → ML Engineer \(Software Engineers who want to transition into ML\) 48 | 5. Adjacent Fields → On-The-Job Training → ML Researcher \(Ph.D. from fields like physics, math, neuroscience\) 49 | 6. Unrelated Fields → Residency/Fellowship → ML Researcher \(People in fields like healthcare, architecture, art, etc. who go through programs in big companies\) 50 | 51 | #### Senior Role vs Junior Role 52 | 53 | * Companies hire senior roles for **skills** and junior roles for **attitude**. 54 | 55 | #### Ph.D. or Nah? 56 | 57 | * The only role that might require a Ph.D. is **\(applied\) research scientist**. 58 | * We need **more engineers** to improve and productize research. 59 | 60 | ### Understanding The Interviewers' Mindset 61 | 62 | * **Companies Hate Hiring**: 63 | * Expensive for companies. 64 | * Stressful for hiring managers. 65 | * Boring for interviewers. 66 | * Companies want the best people who can do **a reasonable job** within **time and monetary constraints**. 67 | * Companies **don't know what they are hiring for**. The job descriptions are for reference purposes only. 68 | * Most recruiters rely on **weak signals** such as previous employers, degrees, awards/papers, GitHub/Kaggle, and referrals. 69 | * Placing too much importance on voluntary activities \(like contributing to open-source or participating in Kaggle competitions\) **punishes** candidates from **less privileged** backgrounds. 70 | * Most interviewers have **little or no training** \(even at big companies\). 71 | * The interview outcome depends on **many random variables**. They do not reflect your ability or your self-worth. 72 | 73 | ### Interview Process 74 | 75 | #### The Steps 76 | 77 | 1. Resume Screen 78 | 2. Phone Screen 79 | 3. Coding Challenges or Take-Home Assignments 80 | 4. Technical Offsite Interviews \(1-2\) 81 | 5. Onsite Interviews \(4-8\) 82 | 83 | #### Bad Interview Questions 84 | 85 | 1. Questions that ask for retention of knowledge that can be easily looked up. 86 | 2. Questions that evaluate irrelevant skills. 87 | 3. Questions whose solutions rely on a single insight. 88 | 4. Questions that try to evaluate multiple skills at once. 89 | 5. Questions that use specific hard-to-remember names. 90 | 6. Open-ended questions with one expected answer. 91 | 7. Easy questions during later interview rounds. 92 | 93 | ⇒ For **good interview questions,** check out [this list](https://github.com/chiphuyen/machine-learning-systems-design/blob/master/content/exercises.md) curated by Chip's herself. 94 | 95 | #### Alternative Interview Formats 96 | 97 | * Multiple choice quiz 98 | * Code debugging 99 | * Pair programming 100 | * Good cop, bad cop 101 | 102 | ### Recruiting Pipeline 103 | 104 | * The higher the onsite-to-offer ratio, the more likely offers are accepted. 105 | * Most junior roles are sourced through campus or referrals. 106 | * "**Be so good they can't ignore you**" 107 | * Candidates with negative experiences are less likely to accept offers. 108 | 109 | ![Tips To Prepare for ML Interviews](../.gitbook/assets/ml-interview-tips.png) 110 | 111 | -------------------------------------------------------------------------------- /guest-lectures/franziska-bell-toyota-research.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | Franziska is currently the Senior Director at Toyota Research Institute, 4 | Formerly Director of Data Science at Uber 5 | --- 6 | 7 | # Franziska Bell \(Toyota Research\) 8 | 9 | Please stand by: we are working with Uber to be able to release Franziska's lecture video! 10 | 11 | -------------------------------------------------------------------------------- /guest-lectures/jai-ranganathan-keeptruckin.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | Jai is currently SVP Product at KeepTruckin, and was formerly VP of various AI 4 | and Data matters at Uber. 5 | --- 6 | 7 | # Jai Ranganathan \(KeepTruckin\) 8 | 9 | {% embed url="https://youtu.be/8PjTWFfjkeY" caption="End-To-End Use Case of Uber\'s COTA system" %} 10 | 11 | ### Uber's [Customer Obsession Ticket Assistant](https://eng.uber.com/cota/) \(COTA\) 12 | 13 | * A tool that uses machine learning and natural language processing techniques to help agents deliver better customer support. 14 | * Enables quick and efficient issue resolution for more than 90 percent of Uber's inbound support tickets. 15 | 16 | ### Challenge 17 | 18 | As Uber grows, so does the volume of support tickets 19 | 20 | * Millions of tickets from riders, drivers, and eaters per week 21 | * Global-scale of serving 600+ cities 22 | * Thousands of different types of issues users may encounter 23 | * Multilingual support 24 | 25 | ![https://eng.uber.com/cota/](../.gitbook/assets/uber-cota.png) 26 | 27 | ### Customer Support Platform 28 | 29 | * Steps in the workflow 30 | * User → Select Flow Node → Write Message → Contact Ticket → Customer Support Representative → Select Contact Type → Lookup Info and Policies → Select Action → Write Response Using a Reply Template → Response → User 31 | * Problems to solve 32 | * Issue prediction 33 | * Issue categorization 34 | * Ticket routing 35 | * Ticket volume 36 | * Policy optimization 37 | * Auto-response 38 | 39 | ### Exploration 40 | 41 | * Identify the right problems to solve 42 | * Use **analytics** to understand the value before all else 43 | * Know what **metrics** to optimize for 44 | * Understand whether Machine Learning is a good fit 45 | * Build with an eye on the **probabilistic** nature of Machine Learning solutions 46 | 47 | ### Development 48 | 49 | * Many possible solutions including basic Machine Learning techniques 50 | * Understand the cost-benefit of compute time vs accuracy 51 | * Deep learning is a fast-evolving space - keep up with the literature to understand the latest advances 52 | * Validate your results with visualization 53 | 54 | ### Deployment 55 | 56 | * Architecture complexity with feature engineering and training have special needs 57 | * Deep learning is still slow! Distributed deep learning can help a lot and is getting better 58 | * Good experiment design required to validate the models 59 | 60 | ### Monitoring 61 | 62 | * Dynamic business problems require retraining strategies with well thought out safe deployment 63 | * Continuous improvement of labeling will make your models better 64 | * Look for edges where your models fail to find room for model improvements 65 | 66 | -------------------------------------------------------------------------------- /guest-lectures/jeremy-howard-fast.ai.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | Jeremy Howard is the co-founder of fast.ai, a research institute dedicated to 4 | making deep learning more accessible. Previously, Jeremy founded a med tech 5 | startup Enlitic, and was President of Kaggle. 6 | --- 7 | 8 | # Jeremy Howard \(Fast.ai\) 9 | 10 | {% embed url="https://www.youtube.com/watch?v=hZd3X\_nGdew" caption="Tricks To Train Deep Learning Models " %} 11 | 12 | * Instead of automating the machine learning process, we should study how to augment it via **human-in-the-loop**. 13 | * [Platform.ai](http://platform.ai) is a unique visual and code-free tool that labels images and trains computer vision models. 14 | * Here are lessons learned from optimizing hyper-parameters for image datasets using [fast.ai](http://fast.ai): 15 | * Stick with a sensible learning rate \(most of the time, the default is good\). 16 | * With Test-Time Augmentation search, you can beat state-of-the-art results even if they use specialized models. 17 | * Progressive resizing is amazing. 18 | * Heatmaps are useful to visualize what's happening. 19 | * 1cycle is a big time-saver. 20 | * For transfer learning, always train later layers more: \(1\) gradual unfreezing and \(2\) discriminative learning rates. 21 | * Use AdamW optimizer. 22 | * If you are doing tons of epochs, consider clipping gradients or annealing Adam's episodes. 23 | 24 | -------------------------------------------------------------------------------- /guest-lectures/lukas-biewald-weights-and-biases.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | Lukas is co-founder and CEO of Weights & Biases, an ML tooling company. He 4 | previously co-founded and led data labeling company Figure Eight (acquired by 5 | Appen). 6 | --- 7 | 8 | # Lukas Biewald \(Weights & Biases\) 9 | 10 | {% embed url="https://www.youtube.com/watch?v=25\_kBogrzrs" caption="Deep Learning In The Wild" %} 11 | 12 | * Machine Learning can be unpredictable and opaque. 13 | * Deep Learning can be vulnerable to hacking. 14 | * Machine Learning requires tons of clean training data. 15 | * Deep Learning and GPUs break a lot of assumptions. 16 | * Machine Learning can look at far more data than humans. 17 | * The combination of humans and computers is powerful. 18 | * **What's coming?** 19 | * Better tools and platforms. 20 | * More medical applications. 21 | * New solutions to training data. 22 | * How to \(**successfully**\) ship deep learning projects: 23 | * Pay a lot of attention to your training data. 24 | * Get something working end-to-end right away, then improve one thing at a time. 25 | * Look for graceful ways to handle the inevitable cases where the algorithm fails. 26 | 27 | Mentioned Resources: 28 | 29 | * [The State of Machine Intelligence](http://www.shivonzilis.com/machineintelligence) \(curated by [Shivon Zillis](https://twitter.com/shivon)\) 30 | * Read Lukas's article: "[Why are Machine Learning Projects so Hard to Manage?](https://medium.com/@l2k/why-are-machine-learning-projects-so-hard-to-manage-8e9b9cf49641)" 31 | * [Datasets over Algorithms](http://www.spacemachine.net/views/2016/3/datasets-over-algorithms) \(credit to [Alex Wissner-Gross](https://www.alexwg.org/)\) 32 | 33 | -------------------------------------------------------------------------------- /guest-lectures/raquel-urtasun-uber-atg.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | Raquel is currently the Chief Scientist and Head of Uber ATG, and also a 4 | Professor at University of Toronto 5 | --- 6 | 7 | # Raquel Urtasun \(Uber ATG\) 8 | 9 | Please stand by: we are working with Uber to be able to release Raquel's lecture video! 10 | 11 | -------------------------------------------------------------------------------- /guest-lectures/richard-socher-salesforce.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | Richard is Chief Scientist at Salesforce, which he joined through acquisition 4 | of his startup Metamind. Previously, Richard was a professor in the Stanford 5 | CS department. 6 | --- 7 | 8 | # Richard Socher \(Salesforce\) 9 | 10 | {% embed url="https://www.youtube.com/watch?v=yvMgcLKuvVg" caption="decaNLP - A Benchmark for Generalized NLP" %} 11 | 12 | ### Why Unified Multi-Task Models for NLP? 13 | 14 | * **Multi-task learning** is a blocker for general NLLP systems. 15 | * **Unified models** can decide how to transfer knowledge \(domain adaptation, weight sharing, transfer learning, and zero-shot learning\). 16 | * Unified **AND** multi-task models can: 17 | * More easily adapt to new tasks. 18 | * Make deploying to production X times simpler. 19 | * Lower the bar for more people to solve new tasks. 20 | * Potentially move towards continual learning. 21 | 22 | ### The 3 Major NLP Task Categories 23 | 24 | 1. **Sequence tagging**: named entity recognition, aspect specific sentiment. 25 | 2. **Text classification**: dialogue state tracking, sentiment classification. 26 | 3. **Sequence-to-sequence**: machine translation, summarization, question answering. 27 | 28 | ⇒ They correspond to the 3 equivalent super-tasks of NLP: **Language Modeling**, **Question Answering**, and **Dialogue**. 29 | 30 | ### A Multi-Task Question Answering Network for [decaNLP](http://decanlp.com/) 31 | 32 | #### Methodology 33 | 34 | * Start with a context. 35 | * Ask a question. 36 | * Generate the answer one word at a time by: 37 | * Pointing to context. 38 | * Pointing to question. 39 | * Or choosing a word from an external vocabulary. 40 | * Pointer Switch is choosing between those three options for each output word. 41 | 42 | #### Architecture Design 43 | 44 | ![Multi-Task Question Answering Network Architecture](../.gitbook/assets/mqan-architecture.png) 45 | 46 | ### [decaNLP: A Benchmark for Generalized NLP](https://github.com/salesforce/decaNLP) 47 | 48 | * Train a single question answering model for multiple NLP tasks \(aka questions\). 49 | * Framework for tackling: 50 | * More general language understanding. 51 | * Multi-task learning. 52 | * Domain adaptation. 53 | * Transfer learning. 54 | * Weight-sharing, pre-training, fine-tuning. 55 | * Zero-shot learning. 56 | 57 | -------------------------------------------------------------------------------- /guest-lectures/xavier-amatriain.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | Co-founder and CTO at Curai. Previously: VP of Engineering at Quora, led 4 | Algorithms Engineering at Netflix. 5 | --- 6 | 7 | # Xavier Amatriain \(Curai\) 8 | 9 | {% embed url="https://youtu.be/5ygO8FxNB8c" caption="Lessons Learned From Building Practical Deep Learning Systems" %} 10 | 11 | ### Lesson 1 - More Data or Better Models? 12 | 13 | * **More data** is preferred when we have access to more features and our models have low-bias. 14 | * **Better models** is preferred when the space of our feature set has low dimensions. 15 | * **Transfer learning** lowers the need for access to data. In order to use this method effectively, we want to **fine-tune** the pre-trained models on **better data.** 16 | 17 | ### Lesson 2 - Simple Models >>> Complex Models 18 | 19 | * **Occam's Razor:** Given two models that perform more or less equally, you should always prefer the less complex. 20 | * Deep learning might not be preferred, even if it squeezes an increase of 1% accuracy. 21 | * Reasons to use simple models include scalability, system complexity, maintenance, explainability, etc. 22 | 23 | ### Lesson 3 - Sometimes, You Need Complex Models 24 | 25 | * More complex features may require a more complex model. 26 | * A more complex model may not show improvements with a feature set that is too simple. 27 | 28 | ### Lesson 4 - You Should Care About Feature Engineering 29 | 30 | * A **well-behaved** Machine Learning feature should be reusable, transformable, interpretable, and reliable. 31 | * In deep learning, **architecture engineering** is the new feature engineering. 32 | 33 | ### Lesson 5 - Supervised vs Unsupervised Learning 34 | 35 | * Most fascinating results in recent years come from **a combination** of the two approaches \(stacked autoencoders, unsupervised pre-training, etc.\). 36 | * **Self-supervised learning** is a learning paradigm where we train a model using labels that are naturally part of the input data, rather than requiring separate external labels. 37 | 38 | ### Lesson 6 - Everything is an Ensemble 39 | 40 | * Most practical applications of machine learning run an **ensemble**. You can use completely different approaches at the ensemble layer. 41 | * Ensemble resembles the way to turn any model into a feature! 42 | 43 | ### Lesson 7 - There are Biases in Your Data 44 | 45 | * **Biases** can happen in the data labels, or even in the presentation to end-users. 46 | * Introducing biases leads to a lack of **fairness** in machine learning. 47 | 48 | ### Lesson 8 - Think About Models "In the Wild" 49 | 50 | * Two desired properties of models in the wild are: 51 | * **Easily extensible**: incrementally/iteratively learn from "human-in-the-loop" or from additional data. 52 | * **Knows what it does not know**: model uncertainty in prediction and enable fall-back to manual. 53 | 54 | ### Lesson 9 - Choose The Right Evaluation Approach 55 | 56 | * Evaluation metrics used during **offline** and **online** experiments must match! 57 | * **A/B tests** help measure differences in metrics across statistically identical populations that each experience a different algorithm. 58 | * Use **long-term metrics** whenever possible. 59 | * **Short-term metrics** can be informative and allow faster decisions. 60 | 61 | ### Lesson 10 - Do Not Underestimate the Value of Systems and Frameworks 62 | 63 | * You should apply the best **software engineering** practices during the design of machine learning systems \(encapsulation, abstraction, cohesion, low coupling, etc.\). 64 | * However, **design patterns** for machine learning software are not well-known or documented. 65 | 66 | ### Lesson 11 - Your Machine Learning Infrastructure Will Have Two Masters 67 | 68 | * Whenever you develop any ML infrastructure, you need to target two different modes: 69 | * **ML experimentation** that emphasizes flexibility, reusability, and ease of use. 70 | * **ML production** that adds on a new layer of performance and scalability. 71 | * In order to combine them: 72 | * Research should be done using tools that are the same in production. 73 | * Abstraction layers should be implemented on top of the optimized research code so they can be accessed from friendly experimentation tools. 74 | 75 | ### Lesson 12 - There is Machine Learning Beyond Deep Learning 76 | 77 | * Examples of other ML approaches include XGBoost, tensor methods, factorization machines, non-parametric Bayesian methods, etc. 78 | * Sometimes, deep learning methods do not outperform these simpler approaches. 79 | 80 | -------------------------------------------------------------------------------- /guest-lectures/yangqing-jia-alibaba.md: -------------------------------------------------------------------------------- 1 | --- 2 | description: >- 3 | Yangqing is currently the VP AI / Big Data at Alibaba, and was formerly 4 | Director of AI Platform at Facebook. He co-created the Caffe2 and Caffe deep 5 | learning frameworks. 6 | --- 7 | 8 | # Yangqing Jia \(Alibaba\) 9 | 10 | {% embed url="https://youtu.be/s2MWUamDYUI" caption="What\'d Ya Mean by Frameworks?" %} 11 | 12 | ### The Progress of Deep Learning Frameworks 13 | 14 | * 2008: Theano 15 | * 2012: Torch7 16 | * 2013: Caffe 17 | * 2015: Keras, TensorFlow 18 | * 2017: Caffe2, PyTorch, ONNX 19 | 20 | ### What Are The Deciding Factors? 21 | 22 | * A framework is intended for model development. 23 | * A framework helps to improve **developer efficiency -** trying out ideas faster \(debugging, interactive development, simplicity, intuitiveness\). 24 | * A framework helps to improve **infrastructure efficiency -** running computation faster \(implementation, scalability, model definition, cross-platform requirements\). 25 | * A good framework makes a **balance** between developer efficiency and infrastructure efficiency. 26 | 27 | #### Declarative Toolkits 28 | 29 | * Examples include Theano, Caffe, MXNet, TensorFlow, and Caffe2. 30 | * In these frameworks, we declare and compile models, then repeatedly execute the models in a Virtual Machine. 31 | * **Advantages**: 32 | * Easy to optimize. 33 | * Easy to serialize for production deployment. 34 | * **Disadvantages**: 35 | * Non-intuitive programming model. 36 | * Difficult to design and maintain. 37 | 38 | #### Imperative Toolkits 39 | 40 | * Examples include PyTorch and Chainer. 41 | * In these frameworks, we define and construct the models by running computation. There is no separate execution engine. 42 | * **Advantages**: 43 | * Intuitive to write programs. 44 | * Easy to design, debug, and iterate. 45 | * **Disadvantages**: 46 | * Difficult to optimize - no domain-specific languages. 47 | * Hard to deploy on multiple platforms. 48 | 49 | #### Facebook Example 50 | 51 | * Research to Production at Facebook: 52 | * PyTorch → Caffe2 \(2017\): Reimplementation took weeks or months. 53 | * PyTorch → ONNX → Caffe2 \(2018\): Enabling model or model fragment transfer. 54 | * PyTorch + Caffe2 \(2019-Present\): Combining both the advantages of developer efficiency and infrastructure efficiency. 55 | * Many frameworks start adopting such a combination: 56 | * Keras/TF-Eager + TensorFlow 57 | * Gluon + MXNet 58 | 59 | ### How To Choose Frameworks 60 | 61 | * Understand your **need**: 62 | * Developer Efficiency: Algorithm Research? Startup? Proof of Concept? 63 | * Infrastructure Efficiency: System Research? Cross-Platform? Scale? 64 | * Learn one framework and **focus on your problem**. 65 | * It's fine to switch. 66 | * The goal of frameworks is to improve **productivity.** 67 | 68 | ### Beyond Frameworks 69 | 70 | * Within the tech stack, there is the **libraries** layer on top of frameworks: TF-Serving, CoreML, Clipper, Ray, etc. 71 | * On top of libraries, we have the **applications** layer: Detectron, FairSeq, Magenta, GluonNLP, etc. 72 | * Below the frameworks, we have the layer of **runtime, compilers, and optimizers**: CuDNN, NNPack, TVM, ONXX, etc. 73 | * The lowest layer is **hardware:** CPU, GPU, DSP, FPGA/ASIC 74 | 75 | ### Thoughts Across The Stack 76 | 77 | * Do NOT use AlexNet! 78 | * Unifications help everyone: ONNX bridges the gap between high-level API & framework frontends with hardware vendor libraries & devices. 79 | * Invest in experiment management. 80 | * Use Computer Science conventional wisdom: programming language, compilers, scientific computation, databases, etc. 81 | * Things change across the stack: 82 | * Applications layer: quantization brings a balance between speed and accuracy. 83 | * Libraries layer: auto-quantization interfaces increase ease of use. 84 | * Frameworks layer: quantized training, auto-scaling, etc. 85 | * Runtime, compilers, optimizers layer: high-performance fixed-point math. 86 | * Hardware layer: quantized computation primitives. 87 | 88 | --------------------------------------------------------------------------------