├── LICENSE ├── README.md ├── deployment └── README.md ├── example ├── Falcon_meets_IMDB.ipynb └── vision_and_scope.md ├── preparation └── README.md ├── training └── README.md └── v2_updates └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | ### [Packt Conference : Put Generative AI to work on Oct 11-13 (Virtual)](https://packt.link/JGIEY) 3 | 4 |

[![Packt Conference](https://hub.packtpub.com/wp-content/uploads/2023/08/put-generative-ai-to-work-packt.png)](https://packt.link/JGIEY)

5 | 3 Days, 20+ AI Experts, 25+ Workshops and Power Talks 6 | 7 | Code: USD75OFF 8 | 9 | # Pretrain Vision and Large Language Models in Python 10 | Data Engineering with AWS 11 | 12 | This is the code repository for [Pretrain Vision and Large Language Models in Python](https://www.packtpub.com/product/pretrain-vision-and-large-language-models-in-python/9781804618257), published by Packt. 13 | 14 | **End-to-end techniques for building and deploying foundation models on AWS** 15 | 16 | ## What is this book about? 17 | 18 | Foundation models have forever changed machine learning. From BERT to ChatGPT, CLIP to Stable Diffusion, when billions of parameters are combined with large datasets and hundreds to thousands of GPUs, the result is nothing short of record-breaking. The recommendations, advice, and code samples in this book will help you pretrain and fine-tune your own foundation models from scratch on AWS and Amazon SageMaker, while applying them to hundreds of use cases across your organization. 19 | 20 | With advice from seasoned AWS and machine learning expert Emily Webber, this book helps you learn everything you need to go from project ideation to dataset preparation, training, evaluation, and deployment for large language, vision, and multimodal models. With step-by-step explanations of essential concepts and practical examples, you’ll go from mastering the concept of pretraining to preparing your dataset and model, configuring your environment, training, fine-tuning, evaluating, deploying, and optimizing your foundation models. 21 | 22 | You will learn how to apply the scaling laws to distributing your model and dataset over multiple GPUs, remove bias, achieve high throughput, and build deployment pipelines. 23 | 24 | By the end of this book, you’ll be well equipped to embark on your own project to pretrain and fine-tune the foundation models of the future. 25 | 26 | This book covers the following exciting features: 27 | * Find the right use cases and datasets for pretraining and fine-tuning 28 | * Prepare for large-scale training with custom accelerators and GPUs 29 | * Configure environments on AWS and SageMaker to maximize performance 30 | * Select hyperparameters based on your model and constraints 31 | * Distribute your model and dataset using many types of parallelism 32 | * Avoid pitfalls with job restarts, intermittent health checks, and more 33 | * Evaluate your model with quantitative and qualitative insights 34 | * Deploy your models with runtime improvements and monitoring pipelines 35 | 36 | If you feel this book is for you, get your [copy](https://www.amazon.com/Pretrain-Vision-Language-Models-Beginners/dp/180461825X) today! 37 | 38 | **Following is what you need for this book:** 39 | If you’re a machine learning researcher or enthusiast who wants to start a foundation modelling project, this book is for you. Applied scientists, data scientists, machine learning engineers, solution architects, product managers, and students will all benefit from this book. Intermediate Python is a must, along with introductory concepts of cloud computing. A strong understanding of deep learning fundamentals is needed, while advanced topics will be explained. The content covers advanced machine learning and cloud techniques, explaining them in an actionable, easy-to-understand way. 40 | 41 | ### Software and Hardware List 42 | 43 | | Chapter | Software required | OS required | 44 | | -------- | -------------------------------------------------------------------------------------| -----------------------------------| 45 | | 1-15 | AWS Web Services(AWS) with a recent version of a modern web browser(Chrome, Edge, etc.) | Any OS | 46 | 47 | 48 | ## Errata 49 | 50 | * Page 212 (Paragraph 5, line 1): **s this a good thing? Honestly, I’m happy to fi nally see the shift ; I’ve been working on generating content with AI/ML models in some fashion since at least 2019, and as a writer and creative person myself, I’ve always thought this was the most interesting part of machine learning. ** _should be_ **Is this a good thing? Honestly, I’m happy to fi nally see the shift ; I’ve been working on generating content with AI/ML models in some fashion since at least 2019, and as a writer and creative person myself, I’ve always thought this was the most interesting part of machine learning.** 51 | 52 | 53 | 54 | ## Get to Know the Author 55 | **Emily Webber** is a principal machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). In her more than five years at AWS, she has assisted hundreds of customers on their journey to ML in the cloud, specializing in distributed training for large language and vision models. She mentors ML solution architects, authors countless feature designs for SageMaker and AWS, and guides the Amazon SageMaker product and engineering teams on best practices in regard to ML and customers. Emily is widely known in the AWS community for a 16-video YouTube series (https://www.youtube.com/playlistlist=PLhr1KZpdzukcOr_6j_zmSrvYnLUt qsZz) featuring SageMaker with 211,000 views, and for giving a keynote speech at O’Reilly AI London 2019 on a novel reinforcement learning approach she developed for public policy. 56 | 57 | 58 | 59 | # Note from Author 60 | 61 | 62 | Hi there! If you'd like some examples to get hands-on with pretraining your own foundation models 🧠, then you've come to the right place. This repository is meant to pair with my 2023 [book on the topic](https://bit.ly/dist-train-book), providing an end-to-end guide for foundation models on AWS. The book has detailed explanations and relevant examples of key topics in the lifecycle of foundation models, including where they come from, why you'd design your own, how to do that, and how to build it into an application. 63 | 64 | This repository contains examples to help you implement that guidance at scale. 🚀 65 | 66 | Eventually I'll have some net new examples here for you to follow step-by-step across the whole lifecycle. To get you moving between now and then, you'll find in this repository links to hands-on examples elsewhere that explain how to implement everything you learned about in the book. I'll start uploading these in July, and will be pushing them out slowly. 67 | 68 | The concepts, theory, and examples discussed in the book are general and extend to any compute environment. However, all of the hands-on guidance will focus explicitly on AWS and especially Amazon SageMaker. So let's get rolling! 🎸 69 | 70 | ## Book structure 📖 71 | The book is broken down into 15 chapters. Let's recap those here: 72 | 1. Introduction to pretraining foundation models 73 | 2. Picking the right use case and dataset 74 | 3. Picking the right model 75 | 4. Prepping your containers and accelerators on AWS 76 | 5. Distribution fundamentals: data and model parallel 77 | 6. Building a distributed data loader 78 | 7. Finding the right hyperparameters 79 | 8. Large-scale training on Amazon SageMaker 80 | 9. Advanced training concepts 81 | 10. Fine-tuning and evaluating models 82 | 11. Detecting and mitigating bias 83 | 12. Deploying your model on SageMaker 84 | 13. Prompt engineering 85 | 14. MLOps for foundation models on AWS 86 | 15. Future trends in pretraining foundation models 87 | 88 | You don't need coding examples for Chapters 1 and 15, because those are straightforward and high-level. That means you've got 12 more chapters to work through. I'll break these up into 3 groups for you: preparing your environment, training, and deploying. Your repository structure will then look like this. 89 | 90 | ## Repository structure 91 | I. [Preparation](https://github.com/PacktPublishing/Pretrain-Vision-and-Large-Language-Models-in-Python/tree/main/preparation) 92 | - Dataset analysis 93 | - Model analysis 94 | - Preparing your containers and accelerators on AWS 95 | - Basics of distribution 96 | 97 | II. [Training](https://github.com/PacktPublishing/Pretrain-Vision-and-Large-Language-Models-in-Python/tree/main/training) 98 | - Distributed data loader 99 | - Hyperparameters 100 | - Training foundation models on SageMaker 101 | - Compilation and throughput 102 | - Fine-tuning and evaluating 103 | 104 | 105 | III. [Deployment](https://github.com/PacktPublishing/Pretrain-Vision-and-Large-Language-Models-in-Python/tree/main/deployment) 106 | - Bias detection and mitigation 107 | - Hosting foundation models on SageMaker 108 | - Prompt engineering 109 | - MLOps 110 | 111 | Each of these directories will start with just a list of links to helpful hands-on tutorials to dive into the content. In July, I'll update these with new examples for pretraining in Python across vision and language on AWS. 112 | 113 | Happy trails! ⛰️ 114 | 115 | ### Asking questions and getting help 116 | If you're stuck, feel free to log an issue in the repo here and I'll try to address it. You can also reach me on Twitch, I'm live every Monday at 9am PST / 12pm EST / 6pm CET right [here](https://www.twitch.tv/aws/schedule). Bring your question and come hang out with us! The show is called ***Build on Generative AI*** with myself and Darko Mesaros. You can also ping me on [LinkedIn right here.](https://www.linkedin.com/in/emily-webber-921b4969/) 117 | -------------------------------------------------------------------------------- /deployment/README.md: -------------------------------------------------------------------------------- 1 | # Deploying foundation models on AWS 2 | 3 | ### 1. Detecting and mitigating bias 4 | - Amazon's ***BOLD: Bias in Open-ended Language Generation Dataset*** Github project is [here](https://github.com/amazon-science/bold). 5 | - Princeton's ***REVISE: REvealing VIsual biaSEs*** GitHub project is [here](https://github.com/princetonvisualai/revise-tool). 6 | - Detecting bias in vision and language with SageMaker Clarify, [noteooks are here](https://github.com/aws/amazon-sagemaker-examples/tree/2e60fb1522d1b228a77d4979a0c4ae269a4afe9c/sagemaker-clarify). 7 | - Monitoring model drift and bias for hosted models with SageMaker [documentation is here](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-model-monitor-bias-drift.html). 8 | 9 | ### 2. Hosting foundation models on SageMaker 10 | - Notebook for [hosting distributed models](https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/generativeai/deepspeed/GPT-J-6B_DJLServing_with_PySDK.ipynb) on SageMaker with `deepspeed` is right here. This example uses GPT-J 6B. 11 | - Using SageMaker JumpStart for hosting (and training) foundation models. This [repository has 7+ example notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/main/introduction_to_amazon_algorithms/jumpstart-foundation-models), you can use these images and models directly. 12 | - Source code [link](https://github.com/aws/amazon-sagemaker-examples/tree/main/introduction_to_amazon_algorithms/jumpstart-foundation-models) for large model serving container with SageMaker 13 | 14 | ### 3. Prompt engineering 15 | - Repository with a [self-managed UI](https://github.com/aws-samples/prompt-engineering-playground-with-sagemaker) to run on Jupyter for prompting prebuilt foundation models with SageMaker JumpStart. 16 | - AWS [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-customize-prompt-engineering.html) on prompt engineering for foundation models. 17 | - Prefix tuning for causal language modeling with Hugging Face PEFT library: the notebook is [here](https://github.com/huggingface/peft/blob/main/examples/causal_language_modeling/peft_prefix_tuning_clm.ipynb). 18 | 19 | ### 4. MLOps for foundation models with SageMaker 20 | - Example notebooks for using [SageMaker Pipelines](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-pipelines) 21 | - Notebooks with [SageMaker Lineage Tracking](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-lineage), to automatically trace up and down your lifecycle and find key assets. 22 | -------------------------------------------------------------------------------- /example/Falcon_meets_IMDB.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "b066b207-afa1-49b5-acc1-06f5f6575db6", 6 | "metadata": {}, 7 | "source": [ 8 | "# Falcon meets IMDB: Dataset exploration, model customizations, and evaluation\n", 9 | "The Internet Movie Database (IMDB) is a classic storehouse of rich NLP data used in many studies. While I would prefer to simply use what's already on the Hugging Face Hub [here](https://huggingface.co/datasets/imdb), sadly this isn't an option because that sample is only used for binary classification of the reviews.\n", 10 | "\n", 11 | "Instead, we'll go straight to the source. The IMDB group provides samples of their datasets for free [here](https://developer.imdb.com/non-commercial-datasets/). This includes the title and description of the film. We'll be interested in the description here. \n", 12 | "\n", 13 | "Another option for movie descriptions is of course Wikipedia articles. However, this isn't already organized by film titles, so we'll consider that a plan B.\n", 14 | "\n", 15 | "In this notebook, we want to see how well ***an open-source language model can generate new movie concepts.*** Specifically we want to explore prompt engineering and fine-tuning with a state-of-the-art model backbone. These days that is TII's Falcon 40B. \n", 16 | "\n", 17 | "An interesting challenge to solve in this notebook is that ***there's not already a great way of defining a \"good\" movie description.*** This means we'll need to develop some new evaluation metric or method to take a basic natural language movie description, say with at least 5 sentences, and create some numeric signal for how good this is. " 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "id": "d78d73dc-928d-40ec-870a-5cf0e17653e2", 23 | "metadata": {}, 24 | "source": [ 25 | "If you're following along with me, I'm using a SageMaker Studio notebook, specifically an `ml.m5.2xlarge`. I start with the Python 3 Data Science kernel." 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "id": "cc886c80-f2f4-47ab-9d80-79ff8b13d818", 31 | "metadata": {}, 32 | "source": [ 33 | "### Step 0. Define and install package requirements." 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "id": "d9dd6697-9596-4307-b090-18dc09516cef", 40 | "metadata": { 41 | "tags": [] 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "%%writefile requirements.txt\n", 46 | "torch\n", 47 | "transformers\n", 48 | "datasets" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "id": "6f8f7ee4-0544-4977-80c1-cb254cb25457", 55 | "metadata": { 56 | "tags": [] 57 | }, 58 | "outputs": [], 59 | "source": [ 60 | "!pip install -r requirements.txt" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "id": "c4e5ec7b-2b70-4498-adcb-d43f89f80173", 66 | "metadata": {}, 67 | "source": [ 68 | "### Step 1. Download some of the `IMDB` non-commercial datasets and load into pandas.\n", 69 | "Specifically we'll download the `ratings` and `titles` datasets, then join these. After this, we will need to query Wikipedia for the articles, including the summary and full plot." 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "id": "fbfc97be-6cca-4e70-a621-70e82c4cd037", 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "!mkdir imdb" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 17, 85 | "id": "c84352b4-a011-4af6-a761-638058295847", 86 | "metadata": { 87 | "tags": [] 88 | }, 89 | "outputs": [], 90 | "source": [ 91 | "import os \n", 92 | "\n", 93 | "def download_imdb_set(file_name, local_dir):\n", 94 | " msg1 = f'wget https://datasets.imdbws.com/{file_name} --directory {local_dir}/'\n", 95 | " os.system(msg1)\n", 96 | " msg2 = f'gunzip {local_dir}/{file_name}'\n", 97 | " os.system(msg2)\n", 98 | "\n", 99 | "download_imdb_set(file_name='title.ratings.tsv.gz', local_dir='imdb')\n", 100 | "download_imdb_set(file_name='title.basics.tsv.gz', local_dir='imdb')" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 16, 106 | "id": "b3a163c1-d02b-4bbb-a744-395c2c05e789", 107 | "metadata": { 108 | "tags": [] 109 | }, 110 | "outputs": [], 111 | "source": [ 112 | "import pandas as pd\n", 113 | "\n", 114 | "def format_imdb(table_name):\n", 115 | " # only find us titles\n", 116 | " if 'title.akas.tsv' in table_name:\n", 117 | " titles_df = pd.read_table('imdb/title.akas.tsv')\n", 118 | " us_titles = titles_df.loc[titles_df[\"region\"]=='US'] \n", 119 | " us_titles.set_index('titleId', inplace=True)\n", 120 | " return us_titles\n", 121 | " \n", 122 | " elif 'title.ratings.tsv' in table_name:\n", 123 | " ratings_df = pd.read_table('imdb/title.ratings.tsv')\n", 124 | " ratings_df.set_index('tconst', inplace=True)\n", 125 | " return ratings_df\n", 126 | " \n", 127 | " elif 'title.basics.tsv' in table_name:\n", 128 | " title_basics = pd.read_table('imdb/title.basics.tsv')\n", 129 | " # filter out adult films and only grab movies\n", 130 | " title_basics = title_basics[(title_basics.isAdult==0) & (title_basics.titleType == 'movie')]\n", 131 | " title_basics.set_index('tconst', inplace=True)\n", 132 | " return title_basics" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 41, 138 | "id": "7d4274f1-376b-4b33-9803-be06c707fef9", 139 | "metadata": { 140 | "tags": [] 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "title_basics = format_imdb('title_basics.tsv')\n", 145 | "us_titles = format_imdb('title.akas.tsv')\n", 146 | "ratings_df = format_imdb('title.ratings.tsv')" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 62, 152 | "id": "c7a2fd6a-b4f2-46ab-9b1f-d3789d2c125a", 153 | "metadata": { 154 | "tags": [] 155 | }, 156 | "outputs": [ 157 | { 158 | "data": { 159 | "text/plain": [ 160 | "(637227, 8)" 161 | ] 162 | }, 163 | "execution_count": 62, 164 | "metadata": {}, 165 | "output_type": "execute_result" 166 | } 167 | ], 168 | "source": [ 169 | "title_basics.shape" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": 33, 175 | "id": "43eb4765-14bd-42bd-96b6-70f6f22efedc", 176 | "metadata": { 177 | "tags": [] 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "# grab only full-length feature US movies\n", 182 | "us_movies = us_titles[us_titles.index.isin(title_basics.index)]\n", 183 | "# join with the ratings \n", 184 | "df = us_movies.join(ratings_df)\n", 185 | "df.shape" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": 47, 191 | "id": "801c93ff-945c-4b19-b2ca-7432e6472a08", 192 | "metadata": { 193 | "tags": [] 194 | }, 195 | "outputs": [ 196 | { 197 | "data": { 198 | "text/html": [ 199 | "
\n", 200 | "\n", 213 | "\n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | "
orderingtitleregionlanguagetypesattributesisOriginalTitleaverageRatingnumVotes
tt00000095Miss JerryUS\\NimdbDisplay\\N05.3206.0
tt00001471The Corbett-Fitzsimmons FightUS\\NimdbDisplay\\N05.3475.0
tt00005747The Story of the Kelly GangUS\\NimdbDisplay\\N06.0832.0
tt00005913The Prodigal SonUS\\N\\N\\N04.420.0
tt00006304HamletUS\\N\\N\\N02.826.0
\n", 291 | "
" 292 | ], 293 | "text/plain": [ 294 | " ordering title region language \\\n", 295 | "tt0000009 5 Miss Jerry US \\N \n", 296 | "tt0000147 1 The Corbett-Fitzsimmons Fight US \\N \n", 297 | "tt0000574 7 The Story of the Kelly Gang US \\N \n", 298 | "tt0000591 3 The Prodigal Son US \\N \n", 299 | "tt0000630 4 Hamlet US \\N \n", 300 | "\n", 301 | " types attributes isOriginalTitle averageRating numVotes \n", 302 | "tt0000009 imdbDisplay \\N 0 5.3 206.0 \n", 303 | "tt0000147 imdbDisplay \\N 0 5.3 475.0 \n", 304 | "tt0000574 imdbDisplay \\N 0 6.0 832.0 \n", 305 | "tt0000591 \\N \\N 0 4.4 20.0 \n", 306 | "tt0000630 \\N \\N 0 2.8 26.0 " 307 | ] 308 | }, 309 | "execution_count": 47, 310 | "metadata": {}, 311 | "output_type": "execute_result" 312 | } 313 | ], 314 | "source": [ 315 | "df.head()" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "id": "f500a97d-96ff-40ac-a176-5ebacdd3d5bc", 321 | "metadata": {}, 322 | "source": [ 323 | "Wow, surprinsingly it looks like of the 1.4M US movies in this dataset sample, only 326K have ratings. Let's pick the best ones." 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 60, 329 | "id": "ddfb4312-25c9-4fb9-9a93-d40368316b3c", 330 | "metadata": { 331 | "tags": [] 332 | }, 333 | "outputs": [ 334 | { 335 | "data": { 336 | "text/plain": [ 337 | "(50676, 9)" 338 | ] 339 | }, 340 | "execution_count": 60, 341 | "metadata": {}, 342 | "output_type": "execute_result" 343 | } 344 | ], 345 | "source": [ 346 | "top_movies = df[df.averageRating >= 7.0]\n", 347 | "top_movies.shape" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": 61, 353 | "id": "09391531-9b37-4127-8d22-215cbcfa0906", 354 | "metadata": { 355 | "tags": [] 356 | }, 357 | "outputs": [ 358 | { 359 | "data": { 360 | "text/html": [ 361 | "
\n", 362 | "\n", 375 | "\n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | "
orderingtitleregionlanguagetypesattributesisOriginalTitleaverageRatingnumVotes
tt000213015Dante's InfernoUS\\N\\N\\N07.03167.0
tt00023052Life of VillaUS\\NimdbDisplay\\N07.628.0
tt00026373ArizonaUS\\NimdbDisplay\\N07.218.0
tt00034561Trial by FireUS\\N\\N16mm release title07.213.0
tt00034562Through Fire and AirUS\\NimdbDisplay\\N07.213.0
\n", 453 | "
" 454 | ], 455 | "text/plain": [ 456 | " ordering title region language types \\\n", 457 | "tt0002130 15 Dante's Inferno US \\N \\N \n", 458 | "tt0002305 2 Life of Villa US \\N imdbDisplay \n", 459 | "tt0002637 3 Arizona US \\N imdbDisplay \n", 460 | "tt0003456 1 Trial by Fire US \\N \\N \n", 461 | "tt0003456 2 Through Fire and Air US \\N imdbDisplay \n", 462 | "\n", 463 | " attributes isOriginalTitle averageRating numVotes \n", 464 | "tt0002130 \\N 0 7.0 3167.0 \n", 465 | "tt0002305 \\N 0 7.6 28.0 \n", 466 | "tt0002637 \\N 0 7.2 18.0 \n", 467 | "tt0003456 16mm release title 0 7.2 13.0 \n", 468 | "tt0003456 \\N 0 7.2 13.0 " 469 | ] 470 | }, 471 | "execution_count": 61, 472 | "metadata": {}, 473 | "output_type": "execute_result" 474 | } 475 | ], 476 | "source": [ 477 | "top_movies.head()" 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "id": "aadaf767-f097-4210-a090-53bce76822b8", 483 | "metadata": {}, 484 | "source": [ 485 | "Wow, we started at 637,227 feature-length movies in the IMDB dataset, and now have just about 50,000 US feature-length films with a rating at 7.0 or higher. Clearly there's a data issue here!" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": null, 491 | "id": "2a5d28cf-96f4-4e25-877a-eaab912166ae", 492 | "metadata": {}, 493 | "outputs": [], 494 | "source": [] 495 | } 496 | ], 497 | "metadata": { 498 | "availableInstances": [ 499 | { 500 | "_defaultOrder": 0, 501 | "_isFastLaunch": true, 502 | "category": "General purpose", 503 | "gpuNum": 0, 504 | "hideHardwareSpecs": false, 505 | "memoryGiB": 4, 506 | "name": "ml.t3.medium", 507 | "vcpuNum": 2 508 | }, 509 | { 510 | "_defaultOrder": 1, 511 | "_isFastLaunch": false, 512 | "category": "General purpose", 513 | "gpuNum": 0, 514 | "hideHardwareSpecs": false, 515 | "memoryGiB": 8, 516 | "name": "ml.t3.large", 517 | "vcpuNum": 2 518 | }, 519 | { 520 | "_defaultOrder": 2, 521 | "_isFastLaunch": false, 522 | "category": "General purpose", 523 | "gpuNum": 0, 524 | "hideHardwareSpecs": false, 525 | "memoryGiB": 16, 526 | "name": "ml.t3.xlarge", 527 | "vcpuNum": 4 528 | }, 529 | { 530 | "_defaultOrder": 3, 531 | "_isFastLaunch": false, 532 | "category": "General purpose", 533 | "gpuNum": 0, 534 | "hideHardwareSpecs": false, 535 | "memoryGiB": 32, 536 | "name": "ml.t3.2xlarge", 537 | "vcpuNum": 8 538 | }, 539 | { 540 | "_defaultOrder": 4, 541 | "_isFastLaunch": true, 542 | "category": "General purpose", 543 | "gpuNum": 0, 544 | "hideHardwareSpecs": false, 545 | "memoryGiB": 8, 546 | "name": "ml.m5.large", 547 | "vcpuNum": 2 548 | }, 549 | { 550 | "_defaultOrder": 5, 551 | "_isFastLaunch": false, 552 | "category": "General purpose", 553 | "gpuNum": 0, 554 | "hideHardwareSpecs": false, 555 | "memoryGiB": 16, 556 | "name": "ml.m5.xlarge", 557 | "vcpuNum": 4 558 | }, 559 | { 560 | "_defaultOrder": 6, 561 | "_isFastLaunch": false, 562 | "category": "General purpose", 563 | "gpuNum": 0, 564 | "hideHardwareSpecs": false, 565 | "memoryGiB": 32, 566 | "name": "ml.m5.2xlarge", 567 | "vcpuNum": 8 568 | }, 569 | { 570 | "_defaultOrder": 7, 571 | "_isFastLaunch": false, 572 | "category": "General purpose", 573 | "gpuNum": 0, 574 | "hideHardwareSpecs": false, 575 | "memoryGiB": 64, 576 | "name": "ml.m5.4xlarge", 577 | "vcpuNum": 16 578 | }, 579 | { 580 | "_defaultOrder": 8, 581 | "_isFastLaunch": false, 582 | "category": "General purpose", 583 | "gpuNum": 0, 584 | "hideHardwareSpecs": false, 585 | "memoryGiB": 128, 586 | "name": "ml.m5.8xlarge", 587 | "vcpuNum": 32 588 | }, 589 | { 590 | "_defaultOrder": 9, 591 | "_isFastLaunch": false, 592 | "category": "General purpose", 593 | "gpuNum": 0, 594 | "hideHardwareSpecs": false, 595 | "memoryGiB": 192, 596 | "name": "ml.m5.12xlarge", 597 | "vcpuNum": 48 598 | }, 599 | { 600 | "_defaultOrder": 10, 601 | "_isFastLaunch": false, 602 | "category": "General purpose", 603 | "gpuNum": 0, 604 | "hideHardwareSpecs": false, 605 | "memoryGiB": 256, 606 | "name": "ml.m5.16xlarge", 607 | "vcpuNum": 64 608 | }, 609 | { 610 | "_defaultOrder": 11, 611 | "_isFastLaunch": false, 612 | "category": "General purpose", 613 | "gpuNum": 0, 614 | "hideHardwareSpecs": false, 615 | "memoryGiB": 384, 616 | "name": "ml.m5.24xlarge", 617 | "vcpuNum": 96 618 | }, 619 | { 620 | "_defaultOrder": 12, 621 | "_isFastLaunch": false, 622 | "category": "General purpose", 623 | "gpuNum": 0, 624 | "hideHardwareSpecs": false, 625 | "memoryGiB": 8, 626 | "name": "ml.m5d.large", 627 | "vcpuNum": 2 628 | }, 629 | { 630 | "_defaultOrder": 13, 631 | "_isFastLaunch": false, 632 | "category": "General purpose", 633 | "gpuNum": 0, 634 | "hideHardwareSpecs": false, 635 | "memoryGiB": 16, 636 | "name": "ml.m5d.xlarge", 637 | "vcpuNum": 4 638 | }, 639 | { 640 | "_defaultOrder": 14, 641 | "_isFastLaunch": false, 642 | "category": "General purpose", 643 | "gpuNum": 0, 644 | "hideHardwareSpecs": false, 645 | "memoryGiB": 32, 646 | "name": "ml.m5d.2xlarge", 647 | "vcpuNum": 8 648 | }, 649 | { 650 | "_defaultOrder": 15, 651 | "_isFastLaunch": false, 652 | "category": "General purpose", 653 | "gpuNum": 0, 654 | "hideHardwareSpecs": false, 655 | "memoryGiB": 64, 656 | "name": "ml.m5d.4xlarge", 657 | "vcpuNum": 16 658 | }, 659 | { 660 | "_defaultOrder": 16, 661 | "_isFastLaunch": false, 662 | "category": "General purpose", 663 | "gpuNum": 0, 664 | "hideHardwareSpecs": false, 665 | "memoryGiB": 128, 666 | "name": "ml.m5d.8xlarge", 667 | "vcpuNum": 32 668 | }, 669 | { 670 | "_defaultOrder": 17, 671 | "_isFastLaunch": false, 672 | "category": "General purpose", 673 | "gpuNum": 0, 674 | "hideHardwareSpecs": false, 675 | "memoryGiB": 192, 676 | "name": "ml.m5d.12xlarge", 677 | "vcpuNum": 48 678 | }, 679 | { 680 | "_defaultOrder": 18, 681 | "_isFastLaunch": false, 682 | "category": "General purpose", 683 | "gpuNum": 0, 684 | "hideHardwareSpecs": false, 685 | "memoryGiB": 256, 686 | "name": "ml.m5d.16xlarge", 687 | "vcpuNum": 64 688 | }, 689 | { 690 | "_defaultOrder": 19, 691 | "_isFastLaunch": false, 692 | "category": "General purpose", 693 | "gpuNum": 0, 694 | "hideHardwareSpecs": false, 695 | "memoryGiB": 384, 696 | "name": "ml.m5d.24xlarge", 697 | "vcpuNum": 96 698 | }, 699 | { 700 | "_defaultOrder": 20, 701 | "_isFastLaunch": false, 702 | "category": "General purpose", 703 | "gpuNum": 0, 704 | "hideHardwareSpecs": true, 705 | "memoryGiB": 0, 706 | "name": "ml.geospatial.interactive", 707 | "supportedImageNames": [ 708 | "sagemaker-geospatial-v1-0" 709 | ], 710 | "vcpuNum": 0 711 | }, 712 | { 713 | "_defaultOrder": 21, 714 | "_isFastLaunch": true, 715 | "category": "Compute optimized", 716 | "gpuNum": 0, 717 | "hideHardwareSpecs": false, 718 | "memoryGiB": 4, 719 | "name": "ml.c5.large", 720 | "vcpuNum": 2 721 | }, 722 | { 723 | "_defaultOrder": 22, 724 | "_isFastLaunch": false, 725 | "category": "Compute optimized", 726 | "gpuNum": 0, 727 | "hideHardwareSpecs": false, 728 | "memoryGiB": 8, 729 | "name": "ml.c5.xlarge", 730 | "vcpuNum": 4 731 | }, 732 | { 733 | "_defaultOrder": 23, 734 | "_isFastLaunch": false, 735 | "category": "Compute optimized", 736 | "gpuNum": 0, 737 | "hideHardwareSpecs": false, 738 | "memoryGiB": 16, 739 | "name": "ml.c5.2xlarge", 740 | "vcpuNum": 8 741 | }, 742 | { 743 | "_defaultOrder": 24, 744 | "_isFastLaunch": false, 745 | "category": "Compute optimized", 746 | "gpuNum": 0, 747 | "hideHardwareSpecs": false, 748 | "memoryGiB": 32, 749 | "name": "ml.c5.4xlarge", 750 | "vcpuNum": 16 751 | }, 752 | { 753 | "_defaultOrder": 25, 754 | "_isFastLaunch": false, 755 | "category": "Compute optimized", 756 | "gpuNum": 0, 757 | "hideHardwareSpecs": false, 758 | "memoryGiB": 72, 759 | "name": "ml.c5.9xlarge", 760 | "vcpuNum": 36 761 | }, 762 | { 763 | "_defaultOrder": 26, 764 | "_isFastLaunch": false, 765 | "category": "Compute optimized", 766 | "gpuNum": 0, 767 | "hideHardwareSpecs": false, 768 | "memoryGiB": 96, 769 | "name": "ml.c5.12xlarge", 770 | "vcpuNum": 48 771 | }, 772 | { 773 | "_defaultOrder": 27, 774 | "_isFastLaunch": false, 775 | "category": "Compute optimized", 776 | "gpuNum": 0, 777 | "hideHardwareSpecs": false, 778 | "memoryGiB": 144, 779 | "name": "ml.c5.18xlarge", 780 | "vcpuNum": 72 781 | }, 782 | { 783 | "_defaultOrder": 28, 784 | "_isFastLaunch": false, 785 | "category": "Compute optimized", 786 | "gpuNum": 0, 787 | "hideHardwareSpecs": false, 788 | "memoryGiB": 192, 789 | "name": "ml.c5.24xlarge", 790 | "vcpuNum": 96 791 | }, 792 | { 793 | "_defaultOrder": 29, 794 | "_isFastLaunch": true, 795 | "category": "Accelerated computing", 796 | "gpuNum": 1, 797 | "hideHardwareSpecs": false, 798 | "memoryGiB": 16, 799 | "name": "ml.g4dn.xlarge", 800 | "vcpuNum": 4 801 | }, 802 | { 803 | "_defaultOrder": 30, 804 | "_isFastLaunch": false, 805 | "category": "Accelerated computing", 806 | "gpuNum": 1, 807 | "hideHardwareSpecs": false, 808 | "memoryGiB": 32, 809 | "name": "ml.g4dn.2xlarge", 810 | "vcpuNum": 8 811 | }, 812 | { 813 | "_defaultOrder": 31, 814 | "_isFastLaunch": false, 815 | "category": "Accelerated computing", 816 | "gpuNum": 1, 817 | "hideHardwareSpecs": false, 818 | "memoryGiB": 64, 819 | "name": "ml.g4dn.4xlarge", 820 | "vcpuNum": 16 821 | }, 822 | { 823 | "_defaultOrder": 32, 824 | "_isFastLaunch": false, 825 | "category": "Accelerated computing", 826 | "gpuNum": 1, 827 | "hideHardwareSpecs": false, 828 | "memoryGiB": 128, 829 | "name": "ml.g4dn.8xlarge", 830 | "vcpuNum": 32 831 | }, 832 | { 833 | "_defaultOrder": 33, 834 | "_isFastLaunch": false, 835 | "category": "Accelerated computing", 836 | "gpuNum": 4, 837 | "hideHardwareSpecs": false, 838 | "memoryGiB": 192, 839 | "name": "ml.g4dn.12xlarge", 840 | "vcpuNum": 48 841 | }, 842 | { 843 | "_defaultOrder": 34, 844 | "_isFastLaunch": false, 845 | "category": "Accelerated computing", 846 | "gpuNum": 1, 847 | "hideHardwareSpecs": false, 848 | "memoryGiB": 256, 849 | "name": "ml.g4dn.16xlarge", 850 | "vcpuNum": 64 851 | }, 852 | { 853 | "_defaultOrder": 35, 854 | "_isFastLaunch": false, 855 | "category": "Accelerated computing", 856 | "gpuNum": 1, 857 | "hideHardwareSpecs": false, 858 | "memoryGiB": 61, 859 | "name": "ml.p3.2xlarge", 860 | "vcpuNum": 8 861 | }, 862 | { 863 | "_defaultOrder": 36, 864 | "_isFastLaunch": false, 865 | "category": "Accelerated computing", 866 | "gpuNum": 4, 867 | "hideHardwareSpecs": false, 868 | "memoryGiB": 244, 869 | "name": "ml.p3.8xlarge", 870 | "vcpuNum": 32 871 | }, 872 | { 873 | "_defaultOrder": 37, 874 | "_isFastLaunch": false, 875 | "category": "Accelerated computing", 876 | "gpuNum": 8, 877 | "hideHardwareSpecs": false, 878 | "memoryGiB": 488, 879 | "name": "ml.p3.16xlarge", 880 | "vcpuNum": 64 881 | }, 882 | { 883 | "_defaultOrder": 38, 884 | "_isFastLaunch": false, 885 | "category": "Accelerated computing", 886 | "gpuNum": 8, 887 | "hideHardwareSpecs": false, 888 | "memoryGiB": 768, 889 | "name": "ml.p3dn.24xlarge", 890 | "vcpuNum": 96 891 | }, 892 | { 893 | "_defaultOrder": 39, 894 | "_isFastLaunch": false, 895 | "category": "Memory Optimized", 896 | "gpuNum": 0, 897 | "hideHardwareSpecs": false, 898 | "memoryGiB": 16, 899 | "name": "ml.r5.large", 900 | "vcpuNum": 2 901 | }, 902 | { 903 | "_defaultOrder": 40, 904 | "_isFastLaunch": false, 905 | "category": "Memory Optimized", 906 | "gpuNum": 0, 907 | "hideHardwareSpecs": false, 908 | "memoryGiB": 32, 909 | "name": "ml.r5.xlarge", 910 | "vcpuNum": 4 911 | }, 912 | { 913 | "_defaultOrder": 41, 914 | "_isFastLaunch": false, 915 | "category": "Memory Optimized", 916 | "gpuNum": 0, 917 | "hideHardwareSpecs": false, 918 | "memoryGiB": 64, 919 | "name": "ml.r5.2xlarge", 920 | "vcpuNum": 8 921 | }, 922 | { 923 | "_defaultOrder": 42, 924 | "_isFastLaunch": false, 925 | "category": "Memory Optimized", 926 | "gpuNum": 0, 927 | "hideHardwareSpecs": false, 928 | "memoryGiB": 128, 929 | "name": "ml.r5.4xlarge", 930 | "vcpuNum": 16 931 | }, 932 | { 933 | "_defaultOrder": 43, 934 | "_isFastLaunch": false, 935 | "category": "Memory Optimized", 936 | "gpuNum": 0, 937 | "hideHardwareSpecs": false, 938 | "memoryGiB": 256, 939 | "name": "ml.r5.8xlarge", 940 | "vcpuNum": 32 941 | }, 942 | { 943 | "_defaultOrder": 44, 944 | "_isFastLaunch": false, 945 | "category": "Memory Optimized", 946 | "gpuNum": 0, 947 | "hideHardwareSpecs": false, 948 | "memoryGiB": 384, 949 | "name": "ml.r5.12xlarge", 950 | "vcpuNum": 48 951 | }, 952 | { 953 | "_defaultOrder": 45, 954 | "_isFastLaunch": false, 955 | "category": "Memory Optimized", 956 | "gpuNum": 0, 957 | "hideHardwareSpecs": false, 958 | "memoryGiB": 512, 959 | "name": "ml.r5.16xlarge", 960 | "vcpuNum": 64 961 | }, 962 | { 963 | "_defaultOrder": 46, 964 | "_isFastLaunch": false, 965 | "category": "Memory Optimized", 966 | "gpuNum": 0, 967 | "hideHardwareSpecs": false, 968 | "memoryGiB": 768, 969 | "name": "ml.r5.24xlarge", 970 | "vcpuNum": 96 971 | }, 972 | { 973 | "_defaultOrder": 47, 974 | "_isFastLaunch": false, 975 | "category": "Accelerated computing", 976 | "gpuNum": 1, 977 | "hideHardwareSpecs": false, 978 | "memoryGiB": 16, 979 | "name": "ml.g5.xlarge", 980 | "vcpuNum": 4 981 | }, 982 | { 983 | "_defaultOrder": 48, 984 | "_isFastLaunch": false, 985 | "category": "Accelerated computing", 986 | "gpuNum": 1, 987 | "hideHardwareSpecs": false, 988 | "memoryGiB": 32, 989 | "name": "ml.g5.2xlarge", 990 | "vcpuNum": 8 991 | }, 992 | { 993 | "_defaultOrder": 49, 994 | "_isFastLaunch": false, 995 | "category": "Accelerated computing", 996 | "gpuNum": 1, 997 | "hideHardwareSpecs": false, 998 | "memoryGiB": 64, 999 | "name": "ml.g5.4xlarge", 1000 | "vcpuNum": 16 1001 | }, 1002 | { 1003 | "_defaultOrder": 50, 1004 | "_isFastLaunch": false, 1005 | "category": "Accelerated computing", 1006 | "gpuNum": 1, 1007 | "hideHardwareSpecs": false, 1008 | "memoryGiB": 128, 1009 | "name": "ml.g5.8xlarge", 1010 | "vcpuNum": 32 1011 | }, 1012 | { 1013 | "_defaultOrder": 51, 1014 | "_isFastLaunch": false, 1015 | "category": "Accelerated computing", 1016 | "gpuNum": 1, 1017 | "hideHardwareSpecs": false, 1018 | "memoryGiB": 256, 1019 | "name": "ml.g5.16xlarge", 1020 | "vcpuNum": 64 1021 | }, 1022 | { 1023 | "_defaultOrder": 52, 1024 | "_isFastLaunch": false, 1025 | "category": "Accelerated computing", 1026 | "gpuNum": 4, 1027 | "hideHardwareSpecs": false, 1028 | "memoryGiB": 192, 1029 | "name": "ml.g5.12xlarge", 1030 | "vcpuNum": 48 1031 | }, 1032 | { 1033 | "_defaultOrder": 53, 1034 | "_isFastLaunch": false, 1035 | "category": "Accelerated computing", 1036 | "gpuNum": 4, 1037 | "hideHardwareSpecs": false, 1038 | "memoryGiB": 384, 1039 | "name": "ml.g5.24xlarge", 1040 | "vcpuNum": 96 1041 | }, 1042 | { 1043 | "_defaultOrder": 54, 1044 | "_isFastLaunch": false, 1045 | "category": "Accelerated computing", 1046 | "gpuNum": 8, 1047 | "hideHardwareSpecs": false, 1048 | "memoryGiB": 768, 1049 | "name": "ml.g5.48xlarge", 1050 | "vcpuNum": 192 1051 | }, 1052 | { 1053 | "_defaultOrder": 55, 1054 | "_isFastLaunch": false, 1055 | "category": "Accelerated computing", 1056 | "gpuNum": 8, 1057 | "hideHardwareSpecs": false, 1058 | "memoryGiB": 1152, 1059 | "name": "ml.p4d.24xlarge", 1060 | "vcpuNum": 96 1061 | }, 1062 | { 1063 | "_defaultOrder": 56, 1064 | "_isFastLaunch": false, 1065 | "category": "Accelerated computing", 1066 | "gpuNum": 8, 1067 | "hideHardwareSpecs": false, 1068 | "memoryGiB": 1152, 1069 | "name": "ml.p4de.24xlarge", 1070 | "vcpuNum": 96 1071 | } 1072 | ], 1073 | "instance_type": "ml.m5.2xlarge", 1074 | "kernelspec": { 1075 | "display_name": "Python 3 (Data Science)", 1076 | "language": "python", 1077 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" 1078 | }, 1079 | "language_info": { 1080 | "codemirror_mode": { 1081 | "name": "ipython", 1082 | "version": 3 1083 | }, 1084 | "file_extension": ".py", 1085 | "mimetype": "text/x-python", 1086 | "name": "python", 1087 | "nbconvert_exporter": "python", 1088 | "pygments_lexer": "ipython3", 1089 | "version": "3.7.10" 1090 | } 1091 | }, 1092 | "nbformat": 4, 1093 | "nbformat_minor": 5 1094 | } 1095 | -------------------------------------------------------------------------------- /example/vision_and_scope.md: -------------------------------------------------------------------------------- 1 | # Movie Chat 2 | 3 | Every good project starts with a great vision for what the end product can be. For example, let's pick something wildly ambitious that we are somewhat reasonably confident is possible. ***Movie Chat!*** 4 | 5 | How does it work - Movie Chat is an interactive LLM-based agent you can interact with as a writer to easily and quickly craft your own stories. 6 | 7 | Movie Chat specializes in long-form film narrative. 8 | 9 | You’ll start by chatting with the agent about types of movies you like, what you are interested in, and what type of message you’d like to share with the world. 10 | 11 | Then you can work with Movie Chat to figure out what the basic plot of your narrative is. 12 | 13 | For most films that includes who are your main characters, what happens to them, what struggles they encounter, how they overcome those struggles, and how they change as a result. 14 | 15 | MovieChat is great at understanding the basics of narrative construction, so you don’t have to be a total professional writer. 16 | 17 | MovieChat is also really good at helping visually describe scenes, including camera angles and shots, to help make the movie come to life for the reader. 18 | 19 | You’ll have multiple chats available, letting you move between them as you work on different parts of the overall narrative. 20 | 21 | Once you have a clear written form, MovieChat can interact in the visual domain with you to generate visuals for your characters, scenes, and the overall storyboard. 22 | 23 | # Datasets 24 | Now that we have a killer idea, let's flesh out a few example datasets we could use to take some steps towards actually building this. You are probably already thinking about the famous ***IMDB*** dataset, with a sample available on Hugging Face [here](https://huggingface.co/datasets/imdb). While is is a great set of reviews, I don't actually think those would help us with producing a net new story from stratch, let alone the storyboard. It might be useful eventually as a signal for the quality of the movies we write, so we'll keep it on the back burner. 25 | 26 | One amazing dataset I'd love to work with is the ***IMSDb***, which has the largest collection of screenplays available online for free [here](https://imsdb.com/). This will come in handy when we move into generating the dialogue for the movie, line-by-line. 27 | 28 | But what does it actually look like to generate a movie? Let's think about the user journey. 29 | 30 | # User journey 31 | On a side note, I actually took quite a few courses on playwrighting in college, where I minored in creative writing. So I have some idea of what the actual creative process looks like for at least getting from an idea to a workable play! Now let's figure out how to do this using LLM's. You'll notice that, across the board, the writer is utterly at the helm of the entire process and operates as the final decision maker and owner. The technology, however, operates as a helpful frame of reference. It's a side-kick :). 32 | 33 | 1. I interact with an LLM to find a few cool ideas for a new movie I want to make. 34 | 2. I generate a short, 5-line synopsis I'm happy with. 35 | 3. From the short synopsis, I move into the character development. 36 | 4. Having some idea of the characters on paper, I use an image generator to create pictures of them! 37 | 5. I generate more detailed written descriptions of my characters: who they are, what they want, what they struggle with, and how they change. 38 | 6. Using all of this rich content about the characters, I move into the full plot. This is a detailed 500+ word description of the entire movie. It includes a clear beginning, middle, and end, with a strong plot narrative. The LLM and agent help me write all of this, piece by piece. It gives me a variety of options when I get stuck, including tips for what audiences usually like to see. 39 | 7. Once I have the full plot, I pick a few key scenes. This might be the opening, a plot twist, a reveal, a chase, and the conclusion. 40 | 8. With a few of these scenes in mind, I move back to the vision model. Using the image I generated of my own character, I create some pictures themselves. 41 | 9. Working like this I create the story board for most of the parts of the film. 42 | 10. Then I go back and write the dialogue for each scene, and I'm done! 43 | 44 | At the end of this journey, the user has everything they need to get started producing a net new movie! 45 | 46 | # Map key points in user journey back to concepts from the book 47 | Now that we know what product we want to build, let's map the steps back to what we learned from the book. Given the excellent state of many open-source foundation models today, including those in langauge and vision, I strongly doubt we'll need to pretrain one from scratch to solve this problem. 48 | 49 | However, for the sake of argument, let's imagine we had more than 1TB of relevant, rich data to work with, and we're interested in pretraining a model like this from scratch. How might we go about it? 50 | 51 | ### Part One - Preparation 52 | 1. ***Dataset analysis.*** The first thing I'd do is look at our datasets in Python, just to get a sense of their dimensions and characteristics. This includes bias detection and mitigation. 53 | 2. ***Model analysis.*** Next, work with a few top LLMs and vision models on this dataset. I'll produce that chart to tell me how far prompt engineering and fine-tuning take me. 54 | 3. ***Scaling law analysis.*** Then, I'd run a few hypothesis through the scaling laws equations to make sure I'm thinking about the overall training runs appropriately. 55 | 4. ***Script preparation.*** Once I know which models I want to work with and their target sizes, I get my training scripts ready. 56 | 57 | ### Part Two - Training 58 | 5. ***Hyperparameter tuning.*** Once my scripts are ready, I run some hyperparameter tuning to find the right model settings. 59 | 6. ***Compilation.*** I try to compile my model, including for the Trainium custom hardware on AWS, to max out my cost savings. 60 | 7. ***Large-scale training.*** Then I train my model on AWS! I use warm pools, cloudwatch logs, debugger, and other features to log my runs 61 | 8. ***Fine-tuning and evaluation***. With my finished model, I run some fine-tuning jobs to focus the LLM on my target use case. I evaluate it using standard KPIs, including a mix of quantitative and qualitative analysis. 62 | 63 | ### Part Three - Hosting 64 | 9. ***Bias detection and mitigation.*** I run more bias detection and mitigation analysis. 65 | 10. ***Hosting.*** I shrink my model and host it on SageMaker endpoints. 66 | 11. ***Prompt engineering.*** I interact with my model using prompting to get the best performance 67 | 12. ***MLOps.*** I build a complete end-to-end pipeline that wraps this entire experience into an interface and pipeline that customers love to use! 68 | 69 | This repository will present scripts and examples that show you how to do each of these steps, in the context of this MovieChat hypothetical app. 70 | 71 | # Disclaimer 72 | Please remember this entire project is just a hypothetical product that could be built with LLMs. I think it's a fun and interesting illustration of the concepts discussed in the book, and as such we'll explore the idea throughout the repository. 73 | 74 | 75 | 76 | -------------------------------------------------------------------------------- /preparation/README.md: -------------------------------------------------------------------------------- 1 | # Preparing to pretrain vision and language foundation models on AWS 2 | 3 | ### 1. Dataset analysis 4 | - Working with [pandas](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/) 5 | - Working with Hugging Face [transformers](https://github.com/nlp-with-transformers/notebooks/blob/main/01_introduction.ipynb) 6 | - Dataset augmentation suggestions from [DataCamp](https://www.datacamp.com/tutorial/complete-guide-data-augmentation) 7 | ### 2. Model analysis 8 | - This relates to Chapter 3 in the book, which has a rich set of links and papers referenced. I'll update this with some [scaling law](https://arxiv.org/abs/2203.15556) examples soon. 9 | 10 | ### 3. Prepping your containers and accelerators 11 | - This refers to Chapter 4 in the book, which also has some great references. 12 | - Here's a [link](https://github.com/aws/deep-learning-containers) to the AWS Deep Learning Containers. Pro tip - all the frameworks and packages work really well here right off the bat! 13 | - Some documentation for using [Docker containers with SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html) 14 | - Here are a few [notebook examples](https://github.com/aws/amazon-sagemaker-examples/tree/main/advanced_functionality/custom-training-containers) for building your own Training containers on SageMaker 15 | - More notebook examples, this time for working with [Trainium on SageMaker](https://github.com/aws-samples/sagemaker-trainium-examples/tree/main) 16 | 17 | ### 4. Distribution fundamentals 18 | - Data parallel notebook examples [on SageMaker](https://github.com/aws/amazon-sagemaker-examples/tree/main/training/distributed_training/pytorch/data_parallel). Includes both vision and language foundation model training! 19 | - Model parallel examples are available right [here on SageMaker](https://github.com/aws/amazon-sagemaker-examples/tree/main/training/distributed_training/pytorch/model_parallel). 20 | -------------------------------------------------------------------------------- /training/README.md: -------------------------------------------------------------------------------- 1 | # Training foundation models on AWS 2 | 3 | ### 1. Building a distributed data loader 4 | - Working with datasets and [dataloaders in PyTorch](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) 5 | - Using the Hugging Face [`Dataset` object with PyTorch](https://huggingface.co/docs/datasets/use_with_pytorch) 6 | - Profiling a data loader with [SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-data-loading-time.html) 7 | - SageMaker [processing jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) 8 | - Example data pipeline in Python for [training GPT-2 on SageMaker](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/data_pipeline.py) 9 | 10 | ### 2. Picking the right hyperparameters 11 | - 10+ example notebooks for [hyperparameter tuning on SageMaker](https://github.com/aws/amazon-sagemaker-examples/tree/main/hyperparameter_tuning) 12 | - Hyperband automatic model tuning for [distributed training example notebook](https://github.com/aws/amazon-sagemaker-examples/blob/2e60fb1522d1b228a77d4979a0c4ae269a4afe9c/hyperparameter_tuning/model_tuning_for_distributed_training/hyperparameter_tuning_for_distributed_training.ipynb#L7) 13 | - More [guidance](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html) on updating learning rate and batch size as a function of the overall accelerator world size 14 | 15 | ### 3. Large-scale training on SageMaker 16 | - End-to-end notebook for training up to [30B parameter GPT-based model on SageMaker](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/smp-train-gpt-simple.ipynb) 17 | - Basic example of model parallel trainnig on SageMaker with [Hugging Face `Trainer` API](https://github.com/huggingface/notebooks/blob/main/sagemaker/04_distributed_training_model_parallelism/sagemaker-notebook.ipynb) 18 | - Noteook for fine-tuning or pretraining [Stable Diffusion on SageMaker](https://github.com/aws-samples/sagemaker-distributed-training-workshop/blob/main/1_data_parallel/Lab1_stable_diffusion/fine_tune_stable_diffusion.ipynb) 19 | - Using [SageMaker warm pools](https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html) to enhance your development cycles for the Training API 20 | - Notebook for using [PyTorch FSDP with SageMaker and Hugging Face](https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/sagemaker-notebook.ipynb) for 20B+ parameter models 21 | 22 | ### 4. Compiling your model 23 | - Detailed [walk-through from Chaim Rand](https://towardsdatascience.com/tips-and-tricks-for-upgrading-to-pytorch-2-3127db1d1f3d) on upgrading to PyTorch 2.0 on AWS and SageMaker. 24 | - Example [notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-training-compiler/huggingface/pytorch_single_gpu_single_node/albert-base-v2/albert-base-v2.ipynb) of changing batch size and learning rate as a function of model compilation, uses SageMaker Training Compiler 25 | - Guidance [from the Neuron SDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/model-architecture-fit.html) on supported models and compiling your model for AWS custom machine learning accelerators, optimized for training, `Trainium`. 26 | 27 | ### 5. Measuring throughput 28 | - I'll add an example here of computing TFLOPS per accelerator 29 | - Some [guidance](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html) on working with CloudWatch and SageMaker 30 | - Using [SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configure-framework-profiling.html) 31 | 32 | ### 6. Fine-tuning your model 33 | - Fine-tuning BLOOM with LoRA using Hugging Face's parameter-efficient fine-tuning [on SageMaker](https://github.com/huggingface/notebooks/blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb) 34 | - Fine-tuning `Donut` for [SOTA document parsing](https://github.com/huggingface/notebooks/blob/main/sagemaker/26_document_ai_donut/sagemaker-notebook.ipynb) with Hugging Face on SageMaker. 35 | 36 | ### 7. Evaluating your model 37 | - This is too broad to have a single example for everything - that would be like trying to compute the wave angles of every ocean continuously :). However, for some concrete examples in language, take a look at Hugging Face's repository from their book **Natural Language Processing with Transformers** right [here](https://github.com/nlp-with-transformers/notebooks). 38 | - Alternatively you can jump straight to the Hugging Face `evaluate` library, with many examples and [tutorials](https://huggingface.co/docs/evaluate/index). 39 | -------------------------------------------------------------------------------- /v2_updates/README.md: -------------------------------------------------------------------------------- 1 | # Updates to book 2 | As it goes in our field, there are constantly new techniques, methods, and technologies that emerge which are helpful to the pretraining process. I'll track these here! 3 | 4 | #### How to request an addition 5 | If you'd want to see another paper, trend, model, or technique referenced in this title, feel free to [log an issue](https://github.com/PacktPublishing/Pretrain-Vision-and-Large-Language-Models-in-Python/issues) in this repository and I'll add it to this list. 6 | 7 | ## Papers to add 8 | - Inverse scaling laws for [CLIP](https://arxiv.org/pdf/2305.07017.pdf). Just wait, I am positive these will be explored in language. 9 | --------------------------------------------------------------------------------