├── README.md ├── day1 ├── ML_overview.ipynb └── slides.pdf ├── day2 ├── Intro_Machine_Learning_Part2.ipynb ├── nba_player_statistics.csv ├── slides_day2.pdf └── sms_spam.csv ├── day3 ├── Intro_Machine_Learning_Part3.ipynb └── slides_day3.pdf ├── day4 ├── intro_ml_day4_autoencoder_basic.ipynb ├── intro_ml_day4_basic_cnn_cifar_10.ipynb ├── intro_ml_day4_basic_gan_mnist.ipynb ├── intro_ml_day4_basic_lstm_airlines.ipynb ├── intro_ml_day4_variational_auteoencoder.ipynb ├── material_from_2023 │ ├── intro_to_ml_day4.ipynb │ └── intro_to_ml_day4.pdf ├── survey_of_nn_architectures.pdf └── transformers.ipynb └── past_hackathons ├── computer_vision_hackathon ├── intro_to_ML_day5_computer_vision_1.ipynb ├── intro_to_ML_day5_computer_vision_2.ipynb ├── intro_to_ML_day5_computer_vision_3.ipynb ├── intro_to_ml_day5_CNNs.pdf └── material_from_2023 │ ├── day5_computer_vision_hackathon_notebook1.ipynb │ ├── day5_computer_vision_hackathon_notebook2.ipynb │ ├── day5_computer_vision_hackathon_notebook3_transfer_learning.ipynb │ └── intro_to_ml_day5.pptx ├── diffusion_models_hackathon ├── diffusion_attention.ipynb └── diffusion_no_attention.ipynb ├── humanities_and_social_sciences └── 1-MNIST-demo-filled.ipynb ├── large_language_models_hackathon ├── LLM_Finetuning.ipynb ├── README.md └── llm_slides.pdf ├── natural_language_processing_hackathon ├── day5_nlp_movie_reviews_notebook1_bag_of_words.ipynb ├── day5_nlp_movie_reviews_notebook2_SOLUTION_and_llm_comparison.ipynb ├── day5_nlp_movie_reviews_notebook2_hackathon.ipynb └── day5_nlp_movie_reviews_notebook2_hackathon_HINTS.ipynb └── quarterback_performance_hackathon └── NFL_QB_Data.csv /README.md: -------------------------------------------------------------------------------- 1 | # A Hands-On Introduction to Machine Learning 2 | 3 | This mini-course provides a comprehensive introduction to machine learning. Part 1 introduces the machine learning process and shows participants how to train simple models. Part 2 covers model evaluation and refinement. Artificial neural networks are introduced in Part 3. A survey of different neural network architectures is presented in Part 4. The mini-course concludes with specialized sessions during Part 5 where participants will choose from one of multiple domains (natural language processing, graph neural networks, physical sciences). 4 | 5 | Attendees should have some familiarity with Python and basic calculus. This mini-course will be held during [Wintersession 2025](https://winter.princeton.edu). 6 | 7 | ### Days 1-4 8 | 9 |         [A Hands-On Introduction to Machine Learning](https://cglink.me/2gi/r1951382) 10 |         January 15, 16, 17, 21 (2025) at 2:00-4:00 PM 11 |         Location: Lewis Library 120 12 |         Instructors: 13 |         Julian Gold, DataX Data Scientist, CSML 14 |         Gage DeZoort, Postdoctoral Research Associate and Lecturer, Physics 15 | 16 | ### Day 5 (and 6) 17 | 18 | Choose one of these options: 19 | 20 | * [Getting Started with Large Language Models with Princeton Language and Intelligence](https://cglink.me/2gi/r1951386) (Parts 1 & 2) 21 | January 22-23, 2025 at 2:00-4:00 PM 22 | Location: Lewis Library 120 23 | Instructors: 24 | Simon Park, Graduate Student, Computer Science and PLI 25 | Abhishek Panigrahi, Graduate Student, Computer Science and PLI 26 | 27 | * [Machine Learning for the Physical Sciences](https://cglink.me/2gi/r1951387) 28 | Wednesday, January 22, 2025 at 2:00-3:30 PM 29 | Location: Lewis Library 134 30 | Instructors: 31 | Christian Jespersen, Graduate Student, Astrophysical Sciences 32 | Rafael Pastrana, Graduate Student, Architecture 33 | Quinn Gallagher, Graduate Student, Chemical and Biological Engineering 34 | Holly Johnson, Graduate Student, Electrical and Computer Engineering 35 | 36 | * [Graph Neural Networks for Your Research](https://cglink.me/2gi/r1951388) 37 | Wednesday, January 22, 2025 at 2:00-4:00 PM 38 | Location: Lewis Library 122 39 | Instructor: Gage DeZoort, Postdoctoral Research Associate and Lecturer, Physics 40 | 41 | ### Before the Mini-Course 42 | 43 | To prepare for this mini-course, consider attending: 44 | 45 |         [Introduction to Machine Learning for Humanists and Social Scientists](https://cglink.me/2gi/r1952533) (Parts 1 & 2) 46 |         January 13-14, 2025 at 10:00 AM-12:00 PM 47 |         Location: Arthur Lewis Auditorium in Robertson Hall 48 |         Instructor: Sarah-Jane Leslie, Professor of Philosophy and CSML, and NAM Co-Director 49 | 50 | ### After the Mini-Course 51 | 52 | Continue learning about machine learning and data science by attending the following: 53 | 54 |         [Introduction to Optimal Transport: Applications to Machine Learning, Cognitive Science, and Comp. Biology](https://cglink.me/2gi/r1952543) 55 |         Thursday, January 23, 2025 at 10:30 AM-1:30 PM 56 |         Location: Bendheim House 103 57 |         Instructors: 58 |         Sarah-Jane Leslie, Professor of Philosophy and CSML, and NAM Co-Director 59 |         Julian Gold, DataX Data Scientist, CSML 60 | 61 | ## Colab Not Working? 62 | 63 | You can run the notebooks for days 1 and 2 of this workshop using only a web browser thanks to jupyterlite. 64 | 65 | Step 1: Go to [https://jdh4.github.io/intro-ml](https://jdh4.github.io/intro-ml) 66 | 67 | Step 2: In the file browser on the left, double click on `ML_overview_2024.ipynb` for day 1 or `Intro_Machine_Learning_Part2_2024.ipynb` for day 2 . You can then run the notebook as usual without using Colab or explicitly installing anything. The notebooks will run on your local machine. 68 | 69 | [[2](https://colab.research.google.com/github/PrincetonUniversity/intro_machine_learning/blob/main/day2/Intro_Machine_Learning_Part2.ipynb)] [[3]](https://colab.research.google.com/github/PrincetonUniversity/intro_machine_learning/blob/main/day3/Intro_Machine_Learning_Part3.ipynb) 70 | 71 | ## Undergraduate A.I. Conference at Princeton 72 | 73 | The Princeton undergraduate student group, [Envision](https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.envisionprinceton.com%2F%23page1&data=05%7C02%7Chalverson%40Princeton.EDU%7C016ce7f1c90242f8d7b508dd193e02fa%7C2ff601167431425db5af077d7791bda4%7C0%7C0%7C638694476311451151%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=vaQba4WX5uw5l0%2FmGNMU8Vkadb93MhnJs9EyfLIwSpU%3D&reserved=0), is hosting a day-long student-focused conference on February 22, 2025. The conference aims to explore the intersection of A.I., information tech policy, and ethics, with the goal of educating, inspiring action, and shaping tomorrow’s leaders. 74 | 75 | ## Authorship 76 | 77 | The materials in this repository were created by Brian Arnold, Gage DeZoort, Julian Gold, 78 | Jonathan Halverson, Christina Peters, Jake Snell, Savannah Thias and Amy Winecoff. 79 | -------------------------------------------------------------------------------- /day1/slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PrincetonUniversity/intro_machine_learning/60dfeef803a04d9073e94584d1572a50d1b64f0f/day1/slides.pdf -------------------------------------------------------------------------------- /day2/slides_day2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PrincetonUniversity/intro_machine_learning/60dfeef803a04d9073e94584d1572a50d1b64f0f/day2/slides_day2.pdf -------------------------------------------------------------------------------- /day3/slides_day3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PrincetonUniversity/intro_machine_learning/60dfeef803a04d9073e94584d1572a50d1b64f0f/day3/slides_day3.pdf -------------------------------------------------------------------------------- /day4/intro_ml_day4_basic_lstm_airlines.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "\"Open" 21 | ], 22 | "metadata": { 23 | "id": "zpCkWYWdmgbk" 24 | } 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "source": [ 29 | "#Long Short-Term Memory for Sequential Predictions\n", 30 | "Introduction to Machine Learning (Day 4)\\\n", 31 | "Princeton University Wintersession\\\n", 32 | "Gage DeZoort\\\n", 33 | "\\\n", 34 | "Based on several helpful tutorials:\\\n", 35 | "[1] [LSTM for Time Series Prediction in PyTorch](https://machinelearningmastery.com/lstm-for-time-series-prediction-in-pytorch/)\\\n", 36 | "[2] [Predicting airline passengers using LSTM and Tensorflow](https://matthewmacfarquhar.medium.com/predicting-airline-passengers-using-lstm-and-tensorflow-ab86347cf318)\n" 37 | ], 38 | "metadata": { 39 | "id": "EcsMvCfajcyc" 40 | } 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "source": [ 45 | "## Temperature Predictions" 46 | ], 47 | "metadata": { 48 | "id": "6kgvYxPHRET4" 49 | } 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "source": [ 54 | "This project is based on Long Short-Term Memory (LSTM) modules. LSTMs belong to the class of Recurent Neural Networks (RNNs), which operate on sequential data (ordered data, indexed by time or space). For example, the daily temperature is a time series we all experience:\n", 55 | "\n", 56 | "![Weather](https://www.influxdata.com/wp-content/uploads/time-series-data-weather-data.png \"weather\")\n", 57 | "(Image from [this article](https://www.influxdata.com/what-is-time-series-data/))\n", 58 | "\n", 59 | "Sentences are another example of sequential data:\n", 60 | "\n", 61 | "![Sentences](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*phpgEszN4Q6n_Rtd24zpGw.png \"sentences\")\n", 62 | "(Image from [this article](https://bansalh944.medium.com/text-generation-using-lstm-b6ced8629b03))\n", 63 | "\n", 64 | "We see that sequential data is everywhere! RNNs have accordingly bene applied to a wide variety of domains, including:\n", 65 | "\n", 66 | "- Natural Language Processing (NLP): translation, word prediction, sentiment analysis\n", 67 | "- Time-Series Analysis: financial predictions, weather/climate forecasting\n", 68 | "- Music Generation: e.g. composition\n", 69 | "- Robotics: e.g. path predictions\n", 70 | "\n", 71 | "How do RNNs work? Here's a helpful diagram:\n", 72 | "\n", 73 | "![RNN](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png \"rnn\")\n", 74 | "(Image from [this article](https://colah.github.io/posts/2015-08-Understanding-LSTMs/))\n", 75 | "\n", 76 | "In this diagram, $A$ represents the NN. Here, we take $x_t$ to represent the sequence of inputs ($x_0$, $x_1$, $x_2$,...,$x_N$), and $h_t$ its sequence of outputs ($h_0$, $h_1$, $h_2$,...,$h_N$). The sequential nature of the predictions is highlight by the rightward arrows; the prediction at each timestep is informed by the prediction at the previous timestep. Unfortunately, it has been shown that simple RNNs *fail to learn long-term dependencies*. This was the motivation for developing LSTMs.\n", 77 | "\n", 78 | "Okay, let's switch to a bit of coding.\n" 79 | ], 80 | "metadata": { 81 | "id": "MqBfwwg3kDrf" 82 | } 83 | }, 84 | { 85 | "cell_type": "code", 86 | "source": [ 87 | "import torch\n", 88 | "import numpy as np\n", 89 | "import random\n", 90 | "import pandas as pd\n", 91 | "import matplotlib.pyplot as plt\n", 92 | "\n", 93 | "# grab data\n", 94 | "!wget https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv" 95 | ], 96 | "metadata": { 97 | "id": "-E3OAoyMka0_" 98 | }, 99 | "execution_count": null, 100 | "outputs": [] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "source": [ 105 | "## Dataset Preparation" 106 | ], 107 | "metadata": { 108 | "id": "0MIJL6AZRLBk" 109 | } 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "source": [ 114 | "In the last line, we grabed a CSV (comma-separated value) file called `airline-passengers.csv`. Let's use Pandas to explore the data." 115 | ], 116 | "metadata": { 117 | "id": "V86GylF4r2RI" 118 | } 119 | }, 120 | { 121 | "cell_type": "code", 122 | "source": [ 123 | "df = pd.read_csv(\"airline-passengers.csv\")\n", 124 | "df.head()" 125 | ], 126 | "metadata": { 127 | "id": "tC-c2ghnra2d" 128 | }, 129 | "execution_count": null, 130 | "outputs": [] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "source": [ 135 | "We see that this is a time-series, counting the number of passengers (in units of 1,000) between Jan. 1949 and Dec. 1960, corresponding to 12 years and 144 observations. Let's plot the trend:" 136 | ], 137 | "metadata": { 138 | "id": "VaZLbY4esK8P" 139 | } 140 | }, 141 | { 142 | "cell_type": "code", 143 | "source": [ 144 | "plt.plot(df.Passengers)\n", 145 | "plt.xlabel(\"Months Since 01/1949\")\n", 146 | "plt.ylabel(\"Airline Passengers / 1,000\")\n", 147 | "plt.show()" 148 | ], 149 | "metadata": { 150 | "id": "kqaIZNjirzTb" 151 | }, 152 | "execution_count": null, 153 | "outputs": [] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "source": [ 158 | "What trends do you observe?" 159 | ], 160 | "metadata": { 161 | "id": "S9BF-m-PuRla" 162 | } 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "source": [ 167 | "The upward trend will be difficult for the ML model to capture given the limited size of the dataset. We can simply remove it before training the algorithm, then add it back! Our goal will be to fit a quadratic to the data:\n", 168 | "\n", 169 | "`P(m) = x_0 + x_1 * m + x_2 * m^2`\n", 170 | "\n", 171 | "Where `P(m)` is the number of passengers in a given month `m`. Let's grab the regression coefficients:" 172 | ], 173 | "metadata": { 174 | "id": "wyj9QoI8QdWU" 175 | } 176 | }, 177 | { 178 | "cell_type": "code", 179 | "source": [ 180 | "N = len(df)\n", 181 | "ones, xrange = np.ones(N), np.arange(N)\n", 182 | "X = np.stack((ones, xrange, xrange**2)).T\n", 183 | "y = df.Passengers.to_numpy().reshape(-1,1)\n", 184 | "beta = (np.linalg.inv(X.T @ X)@X.T@y)\n", 185 | "x0, x1, x2 = beta[0][0], beta[1][0], beta[2][0]\n", 186 | "x0, x1, x2" 187 | ], 188 | "metadata": { 189 | "id": "kZ883Os7Ako0" 190 | }, 191 | "execution_count": null, 192 | "outputs": [] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "source": [ 197 | "### Exercise 1\n", 198 | "\n", 199 | "(3 mins) We've given you a plot of the original data (`passengers`) below; add to this plot 1) the regression line and 2) `passengers` with the regression line subtracted." 200 | ], 201 | "metadata": { 202 | "id": "M5wpPvQURCle" 203 | } 204 | }, 205 | { 206 | "cell_type": "code", 207 | "source": [ 208 | "passengers = df.Passengers\n", 209 | "plt.plot(passengers, label=\"Raw Data\")\n", 210 | "\n", 211 | "# compute the regression line (\"trend\") as a function of xrange\n", 212 | "\n", 213 | "# compute passengers_c = passengers - trend\n", 214 | "\n", 215 | "# plot them both\n", 216 | "\n", 217 | "\n", 218 | "plt.xlabel(\"Months Since 01/1949\")\n", 219 | "plt.ylabel(\"Airline Passengers / 1,000\")\n", 220 | "plt.legend(loc=\"best\")\n", 221 | "plt.show()" 222 | ], 223 | "metadata": { 224 | "id": "LNi3ywgiBf3a" 225 | }, 226 | "execution_count": null, 227 | "outputs": [] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "source": [ 232 | "## Building Train/Test Sets" 233 | ], 234 | "metadata": { 235 | "id": "GmhrvNYJSpW7" 236 | } 237 | }, 238 | { 239 | "cell_type": "code", 240 | "source": [ 241 | "# convert everything to plain arrays\n", 242 | "passengers = passengers.values.astype(\"float32\").reshape(-1,1)\n", 243 | "passengers_c = passengers_c.values.astype(\"float32\").reshape(-1,1)\n", 244 | "passengers_c.shape" 245 | ], 246 | "metadata": { 247 | "id": "2kek8arZC-B0" 248 | }, 249 | "execution_count": null, 250 | "outputs": [] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "source": [ 255 | "Let's turn this into an ML task. Given 2/3 of the time series, can we predict the remaining 1/3? Fundamentally, that means we're doing regression.\n" 256 | ], 257 | "metadata": { 258 | "id": "2SOsKDecu1gI" 259 | } 260 | }, 261 | { 262 | "cell_type": "code", 263 | "source": [ 264 | "# split into train/test\n", 265 | "train_size = int(len(passengers_c)*0.67)\n", 266 | "train, test = passengers_c[:train_size], passengers_c[train_size:]\n", 267 | "train.shape, test.shape" 268 | ], 269 | "metadata": { 270 | "id": "bPpafstXuwlg" 271 | }, 272 | "execution_count": null, 273 | "outputs": [] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "source": [ 278 | "To train a model, we need to show it data in the time interval $[t-w, t-1]$, where $w$ is the window or \"lookback\" size, and ask it to make predictions for the timestep $t$. To do this, we need to turn our training data into $(X,y)$ pairs, $X,y\\in\\mathbb{R}^{w}$, where $X$ reprsents the inputs and $y$ represents the targets." 279 | ], 280 | "metadata": { 281 | "id": "MnJ6CLD8wnPC" 282 | } 283 | }, 284 | { 285 | "cell_type": "code", 286 | "source": [ 287 | "def create_dataset(dataset, w=1):\n", 288 | " X,Y = [], []\n", 289 | " for i in range(len(dataset)-w-1):\n", 290 | " X.append(dataset[i:(i+w), 0])\n", 291 | " Y.append(dataset[i+w, 0])\n", 292 | " X, Y = torch.Tensor(X), torch.Tensor(Y)\n", 293 | " return X, Y.reshape(len(Y),1)" 294 | ], 295 | "metadata": { 296 | "id": "p4ce4gDWvCCb" 297 | }, 298 | "execution_count": null, 299 | "outputs": [] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "source": [ 304 | "w = 3\n", 305 | "X_train, y_train = create_dataset(train, w=w)\n", 306 | "X_test, y_test = create_dataset(test, w=w)" 307 | ], 308 | "metadata": { 309 | "id": "r_e5FONTwNeJ" 310 | }, 311 | "execution_count": null, 312 | "outputs": [] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "source": [ 317 | "### Exercise 2\n", 318 | "\n", 319 | "Print the sizes of `X_train`, `y_train`, `X_test`, and `y_test`. Do they make sense? What happens if you increase/decrease `w`?" 320 | ], 321 | "metadata": { 322 | "id": "jlxOy4syS86Y" 323 | } 324 | }, 325 | { 326 | "cell_type": "code", 327 | "source": [], 328 | "metadata": { 329 | "id": "fdGRmXTMTNS0" 330 | }, 331 | "execution_count": null, 332 | "outputs": [] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "source": [ 337 | "## Building an LSTM\n", 338 | "Now we're going to build a LSTM! Here's the PyTorch model:" 339 | ], 340 | "metadata": { 341 | "id": "NeSZ_2la3JRH" 342 | } 343 | }, 344 | { 345 | "cell_type": "code", 346 | "source": [ 347 | "import torch.nn as nn\n", 348 | "\n", 349 | "class AirModel(nn.Module):\n", 350 | " def __init__(self):\n", 351 | " super().__init__()\n", 352 | " self.lstm = nn.LSTM(\n", 353 | " input_size=w,\n", 354 | " hidden_size=64,\n", 355 | " num_layers=1,\n", 356 | " batch_first=False\n", 357 | " )\n", 358 | " self.linear1 = nn.Linear(\n", 359 | " in_features=64,\n", 360 | " out_features=1,\n", 361 | " )\n", 362 | " self.relu = torch.nn.ReLU()\n", 363 | "\n", 364 | " def forward(self, x):\n", 365 | " x, _ = self.lstm(x)\n", 366 | " return self.linear1(x)" 367 | ], 368 | "metadata": { 369 | "id": "lNupbk4E0YKH" 370 | }, 371 | "execution_count": null, 372 | "outputs": [] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "source": [ 377 | "This model has several components. The main workhorse is the **LSTM Module**: see the [PyTorch docs](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) for details of the implementation. Here's a diagram:\n", 378 | "\n", 379 | "![LSTM](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*kT7TJdlJflJJSnEJ6XRKug.png \"lstm\")\\\n", 380 | "(Image from [this article](https://bansalh944.medium.com/text-generation-using-lstm-b6ced8629b03))\n", 381 | "\\\n", 382 | "\\\n", 383 | "This is a single LSTM \"block\" corresponding to the timestep $t$. There's a lot going on here, but here's the gist:\n", 384 | "\n", 385 | "- The LSTM block at timestep $t$ is fed by the input $x_t$ (# passengers), the output from the previous block $h_{t-1}$, and the memory from the previous block $c_{t-1}$.\n", 386 | "- The LSTM block at timestep $t$ is composed of several logical gates. These include an input gate, a forget gate, a cell gate, and an output gate. The full system of equations is\n", 387 | "\n", 388 | "$$\n", 389 | "\\begin{align*}\n", 390 | " i_t &= \\sigma(W_{ii} x_t + b_{ii} + W_{hi}h_{t-1} + b_{hi})\\ &\\rightarrow \\ \\ \\text{input gate} \\\\\n", 391 | " f_t &= \\sigma(W_{if}x_t + b_{if} + W_{hf}h_{t-1} + b_{hf})\\ &\\rightarrow \\ \\ \\text{forget gate}\\\\\n", 392 | " g_t &= \\text{tanh}(W_{ig}x_t + b_{ig} + W_{hg}h_{t-1} + b_{hg})\\ &\\rightarrow \\ \\ \\text{cell features}\\\\\n", 393 | " o_t &= \\sigma(W_{io}x_t + b_{io} + W_{ho}h_{t-1} + b_{ho})\\ &\\rightarrow \\ \\ \\text{output gate}\\\\\n", 394 | " c_t &= f_t \\odot c_{t-1} + i_t \\odot g_t\\ &\\rightarrow \\ \\ \\text{cell state (memory)}\\\\\n", 395 | " h_t &= o_t \\odot \\text{tanh}(c_t) \\ &\\rightarrow \\ \\ \\text{hidden state}\n", 396 | "\\end{align*}\n", 397 | "$$\n", 398 | "\n", 399 | "In practice, the PyTorch module `nn.LSTM()` has inputs `input_size` corresponding to the dimension of $x_t$, `hidden_size` corresponding to the size of the outputs $h_{t}$, and `num_layers` corresponding to the number of \"stacked\" LSTM modules. Let's train an our model:" 400 | ], 401 | "metadata": { 402 | "id": "2qkn_eY03ZA0" 403 | } 404 | }, 405 | { 406 | "cell_type": "code", 407 | "source": [ 408 | "import numpy as np\n", 409 | "import torch.optim as optim\n", 410 | "import torch.utils.data as data\n", 411 | "\n", 412 | "model = AirModel()\n", 413 | "optimizer = optim.Adam(model.parameters(), lr=10**-3)\n", 414 | "loss_fn = nn.MSELoss()\n", 415 | "loader = data.DataLoader(\n", 416 | " data.TensorDataset(X_train, y_train),\n", 417 | " shuffle=True,\n", 418 | " batch_size=8,\n", 419 | ")\n", 420 | "\n", 421 | "n_epochs = 1000\n", 422 | "for epoch in range(n_epochs):\n", 423 | " model.train()\n", 424 | " for X_batch, y_batch in loader:\n", 425 | " y_pred = model(X_batch)\n", 426 | " loss = loss_fn(y_pred, y_batch)\n", 427 | " optimizer.zero_grad()\n", 428 | " loss.backward()\n", 429 | " optimizer.step()\n", 430 | " # Validation\n", 431 | " if epoch % 50 != 0:\n", 432 | " continue\n", 433 | " model.eval()\n", 434 | " with torch.no_grad():\n", 435 | " y_pred = model(X_train)\n", 436 | " train_rmse = np.sqrt(loss_fn(y_pred, y_train))\n", 437 | " y_pred = model(X_test)\n", 438 | " test_rmse = np.sqrt(loss_fn(y_pred, y_test))\n", 439 | " print(\"Epoch %d: train RMSE %.4f, test RMSE %.4f\" % (epoch, train_rmse, test_rmse))" 440 | ], 441 | "metadata": { 442 | "id": "iWoGxJQx4WRM" 443 | }, 444 | "execution_count": null, 445 | "outputs": [] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "source": [ 450 | "with torch.no_grad():\n", 451 | " train_plot = np.ones(len(passengers_c)) * np.nan\n", 452 | " train_plot[w:train_size-1] = model(X_train).flatten()\n", 453 | " test_plot = np.ones(len(passengers_c)) * np.nan\n", 454 | " test_plot[train_size+w:len(passengers_c)-1] = model(X_test).flatten()\n", 455 | "\n", 456 | "plt.plot(passengers_c.flatten() + trend, c='b', label=\"Truth\")\n", 457 | "plt.plot(train_plot + trend, c='r', label=\"Training\")\n", 458 | "plt.plot(test_plot + trend, c='g', label=\"Predicted\")\n", 459 | "plt.show()\n" 460 | ], 461 | "metadata": { 462 | "id": "C1DbmBP7_azt" 463 | }, 464 | "execution_count": null, 465 | "outputs": [] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "source": [ 470 | "Clearly, the LSTM understands how to model the trends in the data, but is having issues capturing the magnitude of the seasonal fluctuations." 471 | ], 472 | "metadata": { 473 | "id": "o8L0-OP5Ue3w" 474 | } 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "source": [ 479 | "### Exercise 3\n", 480 | "Can you improve the performance of the model? The full code from the notebook is reproduced below for your convenience. You may want to explore:\n", 481 | "\n", 482 | "0) Add a second linear layer to the output of your network (see comments in the model below). \\\n", 483 | "1) Adjusting the learning rate.\\\n", 484 | "2) Increasing the size (# neurons, # layers) of the NN.\\\n", 485 | "3) Changing `w` (note that this alters the nature of the learning task, but may still be fun to explore).\\\n", 486 | "4) Changing the batch size." 487 | ], 488 | "metadata": { 489 | "id": "NvNO2WdrUmBS" 490 | } 491 | }, 492 | { 493 | "cell_type": "code", 494 | "source": [ 495 | "import torch\n", 496 | "import numpy as np\n", 497 | "import random\n", 498 | "import pandas as pd\n", 499 | "import matplotlib.pyplot as plt\n", 500 | "import torch.nn as nn\n", 501 | "import torch.optim as optim\n", 502 | "import torch.utils.data as data\n", 503 | "\n", 504 | "df = pd.read_csv(\"airline-passengers.csv\")\n", 505 | "\n", 506 | "# fit the trend\n", 507 | "N = len(df)\n", 508 | "ones, xrange = np.ones(N), np.arange(N)\n", 509 | "X = np.stack((ones, xrange, xrange**2)).T\n", 510 | "y = df.Passengers.to_numpy().reshape(-1,1)\n", 511 | "beta = (np.linalg.inv(X.T @ X)@X.T@y)\n", 512 | "x0, x1, x2 = beta[0][0], beta[1][0], beta[2][0]\n", 513 | "\n", 514 | "# convert everything to plain arrays\n", 515 | "passengers = df.Passengers\n", 516 | "trend = x0 + xrange*x1 + xrange**2 * x2\n", 517 | "passengers_c = passengers - trend\n", 518 | "passengers = passengers.values.astype(\"float32\").reshape(-1,1)\n", 519 | "passengers_c = passengers_c.values.astype(\"float32\").reshape(-1,1)\n", 520 | "passengers_c.shape\n", 521 | "\n", 522 | "# split into train/test\n", 523 | "train_size = int(len(passengers_c)*0.67)\n", 524 | "train, test = passengers_c[:train_size], passengers_c[train_size:]\n", 525 | "train.shape, test.shape\n", 526 | "\n", 527 | "# define the model\n", 528 | "class AirModel(nn.Module):\n", 529 | " def __init__(self):\n", 530 | " super().__init__()\n", 531 | " self.lstm = nn.LSTM(\n", 532 | " input_size=w,\n", 533 | " hidden_size=64,\n", 534 | " num_layers=1,\n", 535 | " batch_first=False\n", 536 | " )\n", 537 | " self.linear1 = nn.Linear(\n", 538 | " in_features=64,\n", 539 | " out_features=1, #64,\n", 540 | " )\n", 541 | " #self.linear2 = nn.Linear(\n", 542 | " # in_features=64,\n", 543 | " # out_features=1,\n", 544 | " #)\n", 545 | " self.relu = torch.nn.ReLU()\n", 546 | "\n", 547 | " def forward(self, x):\n", 548 | " x, _ = self.lstm(x)\n", 549 | " return self.linear1(x)\n", 550 | "\n", 551 | "model = AirModel()\n", 552 | "optimizer = optim.Adam(model.parameters(), lr=10**-3)\n", 553 | "loss_fn = nn.MSELoss()\n", 554 | "loader = data.DataLoader(\n", 555 | " data.TensorDataset(X_train, y_train),\n", 556 | " shuffle=True,\n", 557 | " batch_size=8,\n", 558 | ")\n", 559 | "\n", 560 | "n_epochs = 1000\n", 561 | "for epoch in range(n_epochs):\n", 562 | " model.train()\n", 563 | " for X_batch, y_batch in loader:\n", 564 | " y_pred = model(X_batch)\n", 565 | " loss = loss_fn(y_pred, y_batch)\n", 566 | " optimizer.zero_grad()\n", 567 | " loss.backward()\n", 568 | " optimizer.step()\n", 569 | " # Validation\n", 570 | " if epoch % 50 != 0:\n", 571 | " continue\n", 572 | " model.eval()\n", 573 | " with torch.no_grad():\n", 574 | " y_pred = model(X_train)\n", 575 | " train_rmse = np.sqrt(loss_fn(y_pred, y_train))\n", 576 | " y_pred = model(X_test)\n", 577 | " test_rmse = np.sqrt(loss_fn(y_pred, y_test))\n", 578 | " print(\"Epoch %d: train RMSE %.4f, test RMSE %.4f\" % (epoch, train_rmse, test_rmse))\n", 579 | "\n", 580 | "with torch.no_grad():\n", 581 | " train_plot = np.ones(len(passengers_c)) * np.nan\n", 582 | " train_plot[w:train_size-1] = model(X_train).flatten()\n", 583 | " test_plot = np.ones(len(passengers_c)) * np.nan\n", 584 | " test_plot[train_size+w:len(passengers_c)-1] = model(X_test).flatten()\n", 585 | "\n", 586 | "plt.plot(passengers_c.flatten() + trend, c='b', label=\"Truth\")\n", 587 | "plt.plot(train_plot + trend, c='r', label=\"Training\")\n", 588 | "plt.plot(test_plot + trend, c='g', label=\"Predicted\")\n", 589 | "plt.show()" 590 | ], 591 | "metadata": { 592 | "id": "_Hc3-ZEE_z9q" 593 | }, 594 | "execution_count": null, 595 | "outputs": [] 596 | }, 597 | { 598 | "cell_type": "markdown", 599 | "source": [ 600 | "## Exercise Solutions" 601 | ], 602 | "metadata": { 603 | "id": "h9e5ZEp0STqi" 604 | } 605 | }, 606 | { 607 | "cell_type": "markdown", 608 | "source": [ 609 | "```\n", 610 | "Exercise 1\n", 611 | "\n", 612 | "passengers = df.Passengers\n", 613 | "plt.plot(passengers, label=\"Raw Data\")\n", 614 | "trend = x0 + xrange*x1 + xrange**2 * x2\n", 615 | "passengers_c = passengers - trend\n", 616 | "plt.plot(trend, \"r-\", label=\"Global Trend\")\n", 617 | "plt.plot(passengers_c, \"g--\", label=\"Corrected\")\n", 618 | "plt.xlabel(\"Months Since 01/1949\")\n", 619 | "plt.ylabel(\"Airline Passengers / 1,000\")\n", 620 | "plt.legend(loc=\"best\")\n", 621 | "plt.show()\n", 622 | "\n", 623 | "```" 624 | ], 625 | "metadata": { 626 | "id": "KpmyXNSfSO0s" 627 | } 628 | }, 629 | { 630 | "cell_type": "markdown", 631 | "source": [ 632 | "```\n", 633 | "Exercise 2\n", 634 | "\n", 635 | "In summary, the size is [window sample, time steps, features].\n", 636 | "print(X_train.shape)\n", 637 | "print(y_train.shape)\n", 638 | "print(X_test.shape)\n", 639 | "print(y_test.shape)\n", 640 | "print(X_train[0])\n", 641 | "print(y_train[0])\n", 642 | "\n", 643 | "What do the shapes of these tensors tell us?\n", 644 | "- `X_train` has 95 entries of the form [[t-5],[t-4],[t-3],[t-2],[t-1]]. `Y_train` has 95 entries of the form [[t]].\n", 645 | "The story is similar for the test set, which has 47 entries.\n", 646 | "```" 647 | ], 648 | "metadata": { 649 | "id": "w0_pRUrbTTeS" 650 | } 651 | }, 652 | { 653 | "cell_type": "markdown", 654 | "source": [ 655 | "```\n", 656 | "Exercise 3\n", 657 | "Adding a second linear layer definitely helps. Slowing the learning rate helps a bit too.\n", 658 | "```" 659 | ], 660 | "metadata": { 661 | "id": "oqOqCtrnmYHq" 662 | } 663 | }, 664 | { 665 | "cell_type": "code", 666 | "source": [], 667 | "metadata": { 668 | "id": "kFMnCMd0AHe2" 669 | }, 670 | "execution_count": null, 671 | "outputs": [] 672 | } 673 | ] 674 | } 675 | -------------------------------------------------------------------------------- /day4/intro_ml_day4_variational_auteoencoder.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "uk4qqxPI7Wy6" 7 | }, 8 | "source": [ 9 | "# Implementation of Variational AutoEncoder (VAE)\n", 10 | "\n", 11 | "This notebook is based on the [work of Jackson Kang](https://github.com/Jackson-Kang/Pytorch-VAE-tutorial).\n", 12 | "\n", 13 | "Below is a schematic illustration of a variational autoencoder ([image credit](https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73)):" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "source": [ 19 | "![VAE](https://miro.medium.com/v2/resize:fit:4800/format:webp/1*Qd1xKV9o-AnWtfIDhhNdFg@2x.png)" 20 | ], 21 | "metadata": { 22 | "id": "LBGxnRyR-9Df" 23 | } 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": { 29 | "id": "5WfQ_TkU7Wy-" 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "import torch\n", 34 | "import torch.nn as nn\n", 35 | "\n", 36 | "import numpy as np\n", 37 | "\n", 38 | "from tqdm import tqdm\n", 39 | "from torchvision.utils import save_image, make_grid" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": { 46 | "id": "bJ432hVf7WzA" 47 | }, 48 | "outputs": [], 49 | "source": [ 50 | "# Model Hyperparameters\n", 51 | "\n", 52 | "dataset_path = '~/datasets'\n", 53 | "\n", 54 | "cuda = torch.cuda.is_available()\n", 55 | "DEVICE = torch.device(\"cuda\" if cuda else \"cpu\")\n", 56 | "\n", 57 | "print(cuda)\n", 58 | "batch_size = 100\n", 59 | "\n", 60 | "x_dim = 784\n", 61 | "hidden_dim = 400\n", 62 | "latent_dim = 200\n", 63 | "\n", 64 | "lr = 1e-3\n", 65 | "\n", 66 | "epochs = 30 if cuda else 15" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": { 72 | "id": "xPhJrsqt7WzA" 73 | }, 74 | "source": [ 75 | "### Step 1. Load (or download) Dataset" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": { 82 | "id": "iB-v9gdV7WzB" 83 | }, 84 | "outputs": [], 85 | "source": [ 86 | "from torchvision.datasets import MNIST\n", 87 | "import torchvision.transforms as transforms\n", 88 | "from torch.utils.data import DataLoader\n", 89 | "\n", 90 | "\n", 91 | "mnist_transform = transforms.Compose([\n", 92 | " transforms.ToTensor(),\n", 93 | "])\n", 94 | "\n", 95 | "kwargs = {'num_workers': 1, 'pin_memory': True}\n", 96 | "\n", 97 | "train_dataset = MNIST(dataset_path, transform=mnist_transform, train=True, download=True)\n", 98 | "test_dataset = MNIST(dataset_path, transform=mnist_transform, train=False, download=True)\n", 99 | "\n", 100 | "train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True, **kwargs)\n", 101 | "test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False, **kwargs)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": { 107 | "id": "AB3im3KB7WzB" 108 | }, 109 | "source": [ 110 | "### Step 2. Define our model: Variational AutoEncoder (VAE)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": { 117 | "id": "p7qgRe-37WzC" 118 | }, 119 | "outputs": [], 120 | "source": [ 121 | "\"\"\"\n", 122 | " A simple implementation of Gaussian MLP Encoder and Decoder\n", 123 | "\"\"\"\n", 124 | "\n", 125 | "class Encoder(nn.Module):\n", 126 | "\n", 127 | " def __init__(self, input_dim, hidden_dim, latent_dim):\n", 128 | " super(Encoder, self).__init__()\n", 129 | "\n", 130 | " self.FC_input = nn.Linear(input_dim, hidden_dim)\n", 131 | " self.FC_input2 = nn.Linear(hidden_dim, hidden_dim)\n", 132 | " self.FC_mean = nn.Linear(hidden_dim, latent_dim)\n", 133 | " self.FC_var = nn.Linear (hidden_dim, latent_dim)\n", 134 | "\n", 135 | " self.LeakyReLU = nn.LeakyReLU(0.2)\n", 136 | "\n", 137 | " self.training = True\n", 138 | "\n", 139 | " def forward(self, x):\n", 140 | " h_ = self.LeakyReLU(self.FC_input(x))\n", 141 | " h_ = self.LeakyReLU(self.FC_input2(h_))\n", 142 | " mean = self.FC_mean(h_)\n", 143 | " log_var = self.FC_var(h_) # encoder produces mean and log of variance\n", 144 | " # (i.e., parateters of simple tractable normal distribution \"q\"\n", 145 | "\n", 146 | " return mean, log_var" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": { 153 | "id": "RAs4xW647WzC" 154 | }, 155 | "outputs": [], 156 | "source": [ 157 | "class Decoder(nn.Module):\n", 158 | " def __init__(self, latent_dim, hidden_dim, output_dim):\n", 159 | " super(Decoder, self).__init__()\n", 160 | " self.FC_hidden = nn.Linear(latent_dim, hidden_dim)\n", 161 | " self.FC_hidden2 = nn.Linear(hidden_dim, hidden_dim)\n", 162 | " self.FC_output = nn.Linear(hidden_dim, output_dim)\n", 163 | "\n", 164 | " self.LeakyReLU = nn.LeakyReLU(0.2)\n", 165 | "\n", 166 | " def forward(self, x):\n", 167 | " h = self.LeakyReLU(self.FC_hidden(x))\n", 168 | " h = self.LeakyReLU(self.FC_hidden2(h))\n", 169 | "\n", 170 | " x_hat = torch.sigmoid(self.FC_output(h))\n", 171 | " return x_hat\n" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": null, 177 | "metadata": { 178 | "id": "GgcOElSB7WzD" 179 | }, 180 | "outputs": [], 181 | "source": [ 182 | "class Model(nn.Module):\n", 183 | " def __init__(self, Encoder, Decoder):\n", 184 | " super(Model, self).__init__()\n", 185 | " self.Encoder = Encoder\n", 186 | " self.Decoder = Decoder\n", 187 | "\n", 188 | " def reparameterization(self, mean, var):\n", 189 | " epsilon = torch.randn_like(var).to(DEVICE) # sampling epsilon\n", 190 | " z = mean + var*epsilon # reparameterization trick\n", 191 | " return z\n", 192 | "\n", 193 | "\n", 194 | " def forward(self, x):\n", 195 | " mean, log_var = self.Encoder(x)\n", 196 | " z = self.reparameterization(mean, torch.exp(0.5 * log_var)) # takes exponential function (log var -> var)\n", 197 | " x_hat = self.Decoder(z)\n", 198 | "\n", 199 | " return x_hat, mean, log_var" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "metadata": { 206 | "id": "hGuYpI8e7WzE" 207 | }, 208 | "outputs": [], 209 | "source": [ 210 | "encoder = Encoder(input_dim=x_dim, hidden_dim=hidden_dim, latent_dim=latent_dim)\n", 211 | "decoder = Decoder(latent_dim=latent_dim, hidden_dim = hidden_dim, output_dim = x_dim)\n", 212 | "\n", 213 | "model = Model(Encoder=encoder, Decoder=decoder).to(DEVICE)" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": { 219 | "id": "qVovnHE47WzE" 220 | }, 221 | "source": [ 222 | "### Step 3. Define Loss function (reprod. loss) and optimizer" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": { 229 | "id": "IyWKnGrO7WzE" 230 | }, 231 | "outputs": [], 232 | "source": [ 233 | "from torch.optim import Adam\n", 234 | "\n", 235 | "BCE_loss = nn.BCELoss()\n", 236 | "\n", 237 | "def loss_function(x, x_hat, mean, log_var):\n", 238 | " reproduction_loss = nn.functional.binary_cross_entropy(x_hat, x, reduction='sum')\n", 239 | " KLD = - 0.5 * torch.sum(1+ log_var - mean.pow(2) - log_var.exp())\n", 240 | "\n", 241 | " return reproduction_loss + KLD\n", 242 | "\n", 243 | "\n", 244 | "optimizer = Adam(model.parameters(), lr=lr)" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": { 250 | "id": "LVmiNVzk7WzE" 251 | }, 252 | "source": [ 253 | "### Step 4. Train Variational AutoEncoder (VAE)" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "metadata": { 260 | "scrolled": false, 261 | "id": "UlLy0RAr7WzF" 262 | }, 263 | "outputs": [], 264 | "source": [ 265 | "print(\"Start training VAE...\")\n", 266 | "model.train()\n", 267 | "\n", 268 | "for epoch in range(epochs):\n", 269 | " overall_loss = 0\n", 270 | " for batch_idx, (x, _) in enumerate(train_loader):\n", 271 | " x = x.view(batch_size, x_dim)\n", 272 | " x = x.to(DEVICE)\n", 273 | "\n", 274 | " optimizer.zero_grad()\n", 275 | "\n", 276 | " x_hat, mean, log_var = model(x)\n", 277 | " loss = loss_function(x, x_hat, mean, log_var)\n", 278 | "\n", 279 | " overall_loss += loss.item()\n", 280 | "\n", 281 | " loss.backward()\n", 282 | " optimizer.step()\n", 283 | "\n", 284 | " print(f\"\\tEpoch {epoch + 1} of {epochs} complete\", \"\\tAverage Loss: \", overall_loss / (batch_idx*batch_size))\n", 285 | "\n", 286 | "print(\"Finish!!\")" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": { 292 | "id": "TxynWFgu7WzG" 293 | }, 294 | "source": [ 295 | "### Step 5. Generate images from test dataset" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "metadata": { 302 | "id": "tQ88prZe7WzG" 303 | }, 304 | "outputs": [], 305 | "source": [ 306 | "import matplotlib.pyplot as plt" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": null, 312 | "metadata": { 313 | "id": "-EHB0Grw7WzG" 314 | }, 315 | "outputs": [], 316 | "source": [ 317 | "model.eval()\n", 318 | "\n", 319 | "with torch.no_grad():\n", 320 | " for batch_idx, (x, _) in enumerate(tqdm(test_loader)):\n", 321 | " x = x.view(batch_size, x_dim)\n", 322 | " x = x.to(DEVICE)\n", 323 | "\n", 324 | " x_hat, _, _ = model(x)\n", 325 | "\n", 326 | "\n", 327 | " break" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "metadata": { 334 | "id": "7fi3nyUR7WzG" 335 | }, 336 | "outputs": [], 337 | "source": [ 338 | "def show_image(x, idx):\n", 339 | " x = x.view(batch_size, 28, 28)\n", 340 | "\n", 341 | " fig = plt.figure()\n", 342 | " plt.imshow(x[idx].cpu().numpy())" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": null, 348 | "metadata": { 349 | "scrolled": true, 350 | "id": "TxZG3oWk7WzG" 351 | }, 352 | "outputs": [], 353 | "source": [ 354 | "show_image(x, idx=0)" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "metadata": { 361 | "id": "N_bOza237WzH" 362 | }, 363 | "outputs": [], 364 | "source": [ 365 | "show_image(x_hat, idx=0)" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": { 371 | "id": "wPJ8nR0y7WzH" 372 | }, 373 | "source": [ 374 | "### Step 6. Generate image from noise vector" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": { 380 | "id": "5fSLoKnG7WzH" 381 | }, 382 | "source": [ 383 | "**Please note that this is not the correct generative process.**\n", 384 | "\n", 385 | "* Even if we don't know exact p(z|x), we can generate images from noise, since the loss function of training VAE regulates the q(z|x) (simple and tractable posteriors) must close enough to N(0, I). If q(z|x) is close to N(0, I) \"enough\"(but not tightly close due to posterior collapse problem), N(0, I) may replace the encoder of VAE.\n", 386 | "\n", 387 | "* To show this, I just tested with a noise vector sampled from N(0, I) similar with Generative Adversarial Network." 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": null, 393 | "metadata": { 394 | "id": "pd_Lgpeh7WzH" 395 | }, 396 | "outputs": [], 397 | "source": [ 398 | "with torch.no_grad():\n", 399 | " noise = torch.randn(batch_size, latent_dim).to(DEVICE)\n", 400 | " generated_images = decoder(noise)" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "metadata": { 407 | "id": "xrIQ8IUo7WzH" 408 | }, 409 | "outputs": [], 410 | "source": [ 411 | "save_image(generated_images.view(batch_size, 1, 28, 28), 'generated_sample.png')" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": { 418 | "id": "-CUpJ8J07WzH" 419 | }, 420 | "outputs": [], 421 | "source": [ 422 | "show_image(generated_images, idx=12)" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "metadata": { 429 | "id": "X_Wu0S8q7WzH" 430 | }, 431 | "outputs": [], 432 | "source": [ 433 | "show_image(generated_images, idx=0)" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "metadata": { 440 | "id": "JMjmoG7-7WzI" 441 | }, 442 | "outputs": [], 443 | "source": [ 444 | "show_image(generated_images, idx=1)" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "metadata": { 451 | "id": "srJ2q3zH7WzI" 452 | }, 453 | "outputs": [], 454 | "source": [ 455 | "show_image(generated_images, idx=10)" 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": null, 461 | "metadata": { 462 | "id": "7dt55g5B7WzI" 463 | }, 464 | "outputs": [], 465 | "source": [ 466 | "show_image(generated_images, idx=20)" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": null, 472 | "metadata": { 473 | "id": "Z2XQBWpX7WzI" 474 | }, 475 | "outputs": [], 476 | "source": [ 477 | "show_image(generated_images, idx=50)" 478 | ] 479 | } 480 | ], 481 | "metadata": { 482 | "kernelspec": { 483 | "display_name": "Python 3", 484 | "name": "python3" 485 | }, 486 | "language_info": { 487 | "codemirror_mode": { 488 | "name": "ipython", 489 | "version": 3 490 | }, 491 | "file_extension": ".py", 492 | "mimetype": "text/x-python", 493 | "name": "python", 494 | "nbconvert_exporter": "python", 495 | "pygments_lexer": "ipython3", 496 | "version": "3.7.9" 497 | }, 498 | "colab": { 499 | "provenance": [] 500 | } 501 | }, 502 | "nbformat": 4, 503 | "nbformat_minor": 0 504 | } -------------------------------------------------------------------------------- /day4/material_from_2023/intro_to_ml_day4.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PrincetonUniversity/intro_machine_learning/60dfeef803a04d9073e94584d1572a50d1b64f0f/day4/material_from_2023/intro_to_ml_day4.pdf -------------------------------------------------------------------------------- /day4/survey_of_nn_architectures.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PrincetonUniversity/intro_machine_learning/60dfeef803a04d9073e94584d1572a50d1b64f0f/day4/survey_of_nn_architectures.pdf -------------------------------------------------------------------------------- /day4/transformers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "source": [ 6 | "\"Open" 7 | ], 8 | "metadata": { 9 | "id": "zpCkWYWdmgbk" 10 | } 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "source": [ 15 | "# Transformers\n", 16 | "\n", 17 | "Gage DeZoort\n", 18 | "\n", 19 | "Wintersession 2025\n", 20 | "\n", 21 | "*Adapted from a helpful conversation with ChatGPT.*\n" 22 | ], 23 | "metadata": { 24 | "id": "3-poBvrs6mFy" 25 | } 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "source": [ 30 | "## 0. Imports" 31 | ], 32 | "metadata": { 33 | "id": "smb8fYeN7ENJ" 34 | } 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "id": "MAaLG3PSyc-w", 41 | "collapsed": true 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "%matplotlib inline\n", 46 | "\n", 47 | "!pip install datasets -q" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": { 53 | "id": "1F3gjqazyc-x" 54 | }, 55 | "source": [ 56 | "\n", 57 | "\n", 58 | "The goal of this tutorial is to train a sequence-to-sequence\n", 59 | "\n" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": { 65 | "id": "B-jbjvb8yc-y" 66 | }, 67 | "source": [ 68 | "## 1. The Learning Task\n", 69 | "\n", 70 | "\n", 71 | "\n" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": { 77 | "id": "l1yFgIZpyc-y" 78 | }, 79 | "source": [ 80 | "Given a word or sequence of words, how likely is some subsequent word? This is a fundamental language modeling task: assigning a likelihood probability to a word to follow some input sequence.\n", 81 | "\n", 82 | "\n", 83 | "As an example, let's consider the following input sequence:\n", 84 | "\n", 85 | "*I need to take my dog to the vet because he is*\n", 86 | "\n", 87 | "What's the next word? *Hungry*? *Healthy*? *Sick*?\n", 88 | "\n", 89 | "You get the picture." 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "source": [ 95 | "### 1.1 Tokenization\n", 96 | "\n", 97 | "Machines need to analyze *tokenized* data. Tokens can be words, phrases, characters, etc. They have corresponding `IDs` that are stored in a lookup table.\n", 98 | "\n", 99 | "We're going to use a model called *BERT* (Bidirectional Transformers) as our tokenizer. BERT is a transformer model, whose tokenizer splits the input text into words and punctuation, ignoring whitespace. It also splits complicated words into subwords. See below how the string `\"deeeep\"` which does not appear in the English language, is split into three tokens `['dee', '##ee', '##p']`. The latter two tokens are called *subwords*.\n", 100 | "\n", 101 | "Google's propriatary WordPiece algorithm is used to build BERT's vocabulary (of subwords) built iteratively from an initial vocab of single character tokens. Frequent character pairs are merged into new subwords until its 30,000 token vocabulary is constructed.\n", 102 | "\n", 103 | "\n", 104 | "\n" 105 | ], 106 | "metadata": { 107 | "id": "H7xtViWL430Q" 108 | } 109 | }, 110 | { 111 | "cell_type": "code", 112 | "source": [ 113 | "from transformers import AutoTokenizer\n", 114 | "\n", 115 | "# Choose a pre-trained model tokenizer (e.g., BERT)\n", 116 | "model_name = \"bert-base-uncased\" # 100M parameters, not case-sensitive\n", 117 | "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", 118 | "\n", 119 | "# Example: Tokenizing text\n", 120 | "text = \"Transformers are a type of deeeep learning model used for NLP tasks. Epehmeral. Anachronism.\"\n", 121 | "tokens = tokenizer.tokenize(text)\n", 122 | "print(\"Tokens:\", tokens)\n", 123 | "\n", 124 | "# Converting tokens to IDs\n", 125 | "token_ids = tokenizer.convert_tokens_to_ids(tokens)\n", 126 | "print(\"Token IDs:\", token_ids)\n", 127 | "\n", 128 | "# Decoding token IDs back to text\n", 129 | "decoded_text = tokenizer.decode(token_ids)\n", 130 | "print(\"Decoded Text:\", decoded_text)" 131 | ], 132 | "metadata": { 133 | "id": "ISRStxoX6VIW" 134 | }, 135 | "execution_count": null, 136 | "outputs": [] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "source": [ 141 | "### 1.2 Sequence Data\n", 142 | "\n", 143 | "\n", 144 | "To create a coherent learning task, we need to take sequences of tokens and batch them into inputs with corresponding targets. Sequences are batched into uniform-length chunks. For example consider two words written as sequences of tokens:\n", 145 | "\n", 146 | "Sequence #1: `[\"run\", \"##ner\"]`\n", 147 | "\n", 148 | "Sequence #2: `[\"d\", \"##run\", \"#k\", \"##en\"]`\n", 149 | "\n", 150 | "Our model will expect fixed-size sequences at input, say of size `max_length=3`. Sequence #1 is shorter than `max_length`, so we have to *pad* it with some default value. In BERT, this default value is `[PAD]`. Sequence #2, on the other hand, is longer than `max_length`, so we have to *truncate* it." 151 | ], 152 | "metadata": { 153 | "id": "UyfvjiUldZhF" 154 | } 155 | }, 156 | { 157 | "cell_type": "code", 158 | "source": [ 159 | "# Padding and truncation\n", 160 | "\n", 161 | "sequence = tokenizer(text, padding=\"max_length\", truncation=True, max_length=10)\n", 162 | "print(\"Encoded Sequence:\", sequence)\n", 163 | "tokenizer.decode(sequence[\"input_ids\"])" 164 | ], 165 | "metadata": { 166 | "id": "tyLbb0tedgjU" 167 | }, 168 | "execution_count": null, 169 | "outputs": [] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "source": [ 174 | "Here, the `input_IDs` are what the BERT transformer will actually process, the `token_type_ids` are used to demarkate segments (for next-sentence prediction), and the `attention_mask` indicates which tokens are padding (0). Note that BERT's tokenizer has added a few special tokens. `[CLS]` is a classification token marking the start of the sequence, and `[SEP]` is the separater token marking the end." 175 | ], 176 | "metadata": { 177 | "id": "C1Wh7pHz6aIp" 178 | } 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "source": [ 183 | "## 1. Transformer Models\n", 184 | "\n", 185 | "BERT is a pre-trained transformer model available for generic use cases. It takes as input the `sequence` data type we generated above and outputs embeddings for each token." 186 | ], 187 | "metadata": { 188 | "id": "dEo-0Scnk89c" 189 | } 190 | }, 191 | { 192 | "cell_type": "code", 193 | "source": [ 194 | "# --- Section 4: Understanding Attention ---\n", 195 | "import torch\n", 196 | "from transformers import AutoModel\n", 197 | "\n", 198 | "# Load a pre-trained model\n", 199 | "model = AutoModel.from_pretrained(model_name)\n", 200 | "\n", 201 | "# Example input\n", 202 | "inputs = tokenizer(\"The quick brown fox jumps over the lazy dog.\", return_tensors=\"pt\")\n", 203 | "\n", 204 | "# Forward pass through the model\n", 205 | "outputs = model(**inputs)\n", 206 | "\n", 207 | "# The model outputs embeddings\n", 208 | "print(\"Last hidden state shape:\", outputs.last_hidden_state.shape)" 209 | ], 210 | "metadata": { 211 | "id": "f8_30kqjgQp3" 212 | }, 213 | "execution_count": null, 214 | "outputs": [] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "source": [ 219 | "So we see that each of the 12 words gets a 768 dimensional output embedding. This is a high dimension, so we'll have to use some specialized tools to get a closer look." 220 | ], 221 | "metadata": { 222 | "id": "tWGWW7r0nh2n" 223 | } 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "source": [ 228 | "## 1.1 Attention is All You Need\n", 229 | "\n", 230 | "Transformers use attention modules, which quantify how much tokens in a sequence focus on other tokens. Let's take a closer look at how attention works." 231 | ], 232 | "metadata": { 233 | "id": "T2PfMFYNn2vp" 234 | } 235 | }, 236 | { 237 | "cell_type": "code", 238 | "source": [ 239 | "from transformers import AutoTokenizer, AutoModel\n", 240 | "import torch\n", 241 | "\n", 242 | "# Input text\n", 243 | "text = \"Transformers are powerful and versatile models.\"\n", 244 | "\n", 245 | "# Tokenize and extract embeddings\n", 246 | "inputs = tokenizer(text, return_tensors=\"pt\")\n", 247 | "outputs = model(**inputs, output_attentions=True)\n", 248 | "\n", 249 | "# Extract hidden states (last layer embeddings)\n", 250 | "token_embeddings = outputs.last_hidden_state.squeeze(0) # Shape: [sequence_length, hidden_size]\n", 251 | "print(token_embeddings.shape)" 252 | ], 253 | "metadata": { 254 | "id": "PICExuxUnXNM" 255 | }, 256 | "execution_count": null, 257 | "outputs": [] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "source": [ 262 | "Since the embeddings have such a high dimension, we need to use a dimensionality reduction technique called principle component analysis (PCA) to visualize them. PCA identifies mutually-orthogonal directions ($< 768$ of them!) of large variance in the data, returning the projection in this new, lower-dimensional basis." 263 | ], 264 | "metadata": { 265 | "id": "7aqxG0IUpol_" 266 | } 267 | }, 268 | { 269 | "cell_type": "code", 270 | "source": [ 271 | "from sklearn.decomposition import PCA\n", 272 | "import matplotlib.pyplot as plt\n", 273 | "\n", 274 | "# Apply PCA to reduce dimensions to 2D\n", 275 | "pca = PCA(n_components=2)\n", 276 | "reduced_embeddings = pca.fit_transform(token_embeddings.detach().numpy())\n", 277 | "\n", 278 | "# Visualize the reduced embeddings\n", 279 | "tokens = tokenizer.convert_ids_to_tokens(inputs[\"input_ids\"][0])\n", 280 | "plt.figure(figsize=(10, 7))\n", 281 | "for i, token in enumerate(tokens):\n", 282 | " plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1])\n", 283 | " plt.text(reduced_embeddings[i, 0] + 0.01, reduced_embeddings[i, 1] + 0.01, token, fontsize=12)\n", 284 | "plt.title(\"2D Visualization of Token Embeddings\")\n", 285 | "plt.xlabel(\"PCA Dimension 1\")\n", 286 | "plt.ylabel(\"PCA Dimension 2\")\n", 287 | "plt.grid()\n", 288 | "plt.show()" 289 | ], 290 | "metadata": { 291 | "id": "Dr-e_2ugoEQR" 292 | }, 293 | "execution_count": null, 294 | "outputs": [] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "source": [ 299 | "We see that the tokens `[\"are\",\"versatile\", \"and\", \"powerful\", \"models\"]` all have very similar embeddings. The sentence start and end tokens, in addition to `\"transformers\"` and the punctuation \".\" are embedded elsewhere." 300 | ], 301 | "metadata": { 302 | "id": "v0_URai2rWpw" 303 | } 304 | }, 305 | { 306 | "cell_type": "code", 307 | "source": [ 308 | "# Extract attention weights\n", 309 | "attention_weights = outputs.attentions # Shape: [num_layers, batch_size, num_heads, seq_len, seq_len]\n", 310 | "\n", 311 | "# Example: Visualize attention from the last layer, head 0\n", 312 | "import seaborn as sns\n", 313 | "import numpy as np\n", 314 | "\n", 315 | "attention_last_layer = torch.mean(outputs.attentions[-1][0], dim=0).detach().numpy() # Shape: [seq_len, seq_len]\n", 316 | "\n", 317 | "plt.figure(figsize=(10, 8))\n", 318 | "sns.heatmap(attention_last_layer, annot=True, fmt=\".2f\", xticklabels=tokens, yticklabels=tokens, cmap=\"viridis\")\n", 319 | "plt.title(\"Attention Weights for the Last Layer, Head 0\")\n", 320 | "plt.xlabel(\"Key Tokens\")\n", 321 | "plt.ylabel(\"Query Tokens\")\n", 322 | "plt.show()" 323 | ], 324 | "metadata": { 325 | "id": "f3gwlb-PorKL" 326 | }, 327 | "execution_count": null, 328 | "outputs": [] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "source": [ 333 | "You may notice that [CLS] and [SEP] get the strongest attention weights. `[CLS]` is typically sent to a downstream classification module to analyze the sentiment/meaning of the sequence provided. It may also be used to compare two sequences, e.g. via cosine similarity. [SEP] is usually used in sentence pair analysis; e.g. it can store information about how different two sentences are." 334 | ], 335 | "metadata": { 336 | "id": "YGbqIQhx4q8g" 337 | } 338 | }, 339 | { 340 | "cell_type": "code", 341 | "source": [ 342 | "# Aggregate attention across heads for multiple layers\n", 343 | "for layer_idx in range(11):\n", 344 | " layer_attention = torch.mean(outputs.attentions[layer_idx][0], dim=0).detach().numpy()\n", 345 | " sns.heatmap(layer_attention, xticklabels=tokens, yticklabels=tokens, cmap=\"viridis\")\n", 346 | " plt.title(f\"Layer {layer_idx + 1} Attention (Averaged Across Heads)\")\n", 347 | " plt.xlabel(\"Key Tokens\")\n", 348 | " plt.ylabel(\"Query Tokens\")\n", 349 | " plt.show()" 350 | ], 351 | "metadata": { 352 | "id": "73eJ4QG_4E8H" 353 | }, 354 | "execution_count": null, 355 | "outputs": [] 356 | }, 357 | { 358 | "cell_type": "markdown", 359 | "source": [ 360 | "### 1.2 Sentence Similarity\n", 361 | "\n", 362 | "Let's drill down on the embedding stored in `[CLS]` by evaluating several sentences that have (potentially) similar semantic structure." 363 | ], 364 | "metadata": { 365 | "id": "LqRX_zlG5ivt" 366 | } 367 | }, 368 | { 369 | "cell_type": "code", 370 | "source": [ 371 | "from torch.nn import CosineSimilarity\n", 372 | "\n", 373 | "s1 = \"Transformers are powerful and versatile models.\"\n", 374 | "s2 = \"Language models like transformers have diverse applications.\"\n", 375 | "s3 = \"Political polarization keeps us divided and blind to issues that really matter.\"\n", 376 | "\n", 377 | "# Tokenize and extract embeddings\n", 378 | "cls = []\n", 379 | "for s in [s1, s2, s3]:\n", 380 | " inputs = tokenizer(s, return_tensors=\"pt\")\n", 381 | " outputs = model(**inputs, output_attentions=True)\n", 382 | " token_embeddings = outputs.last_hidden_state.squeeze(0) # Shape: [sequence_length, hidden_size]\n", 383 | " cls.append(token_embeddings[0])\n", 384 | "\n", 385 | "# Cosine similarity of each sentence\n", 386 | "cos_sim = CosineSimilarity(dim=-1)\n", 387 | "for i in range(3):\n", 388 | " for j in range(3):\n", 389 | " print(i, j, cos_sim(cls[i], cls[j]))" 390 | ], 391 | "metadata": { 392 | "id": "LpOWFf5k4M7f" 393 | }, 394 | "execution_count": null, 395 | "outputs": [] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "source": [ 400 | "## 2. Fine-tuning\n", 401 | "\n", 402 | "We've got pre-trained models like BERT available to us. These models have been trained on massive corpora and have excellent general language capabilities. Fine tuning is the process of tuning a pre-trained model, which is a much more efficient approach than re-tuning a language model from scratch." 403 | ], 404 | "metadata": { 405 | "id": "R0KaKCMv8qFU" 406 | } 407 | }, 408 | { 409 | "cell_type": "code", 410 | "source": [ 411 | "from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments\n", 412 | "from datasets import load_dataset\n", 413 | "import torch\n", 414 | "\n", 415 | "!pip install evaluate\n", 416 | "import evaluate" 417 | ], 418 | "metadata": { 419 | "id": "rvg_2g-i6u5y" 420 | }, 421 | "execution_count": null, 422 | "outputs": [] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "source": [ 427 | "We're going to spin up a smaller version of BERT to fine tune." 428 | ], 429 | "metadata": { 430 | "id": "_B_Bcep2BRzr" 431 | } 432 | }, 433 | { 434 | "cell_type": "code", 435 | "source": [ 436 | "# Load tokenizer and model\n", 437 | "model_name = \"distilbert-base-uncased\" # \"bert-base-uncased\"\n", 438 | "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", 439 | "model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # Binary classification" 440 | ], 441 | "metadata": { 442 | "id": "kQB6L3x36hK-" 443 | }, 444 | "execution_count": null, 445 | "outputs": [] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "source": [ 450 | "The [IMDb dataset](https://huggingface.co/datasets/stanfordnlp/imdb) contains 50k movie reviews formatted as input sequences for downstream sentiment analysis. For example, what (0 or 1) do you think the training label would be for this review?\n", 451 | "\n", 452 | "*National Treasure is about as over-rated and over-hyped as they come. Nicholas Cage is in no way a believable action hero, and this film is no \"Indiana Jones\". People who have compared this movie to the Indian Jones classic trilogy have seriously fallen off their rocker...*" 453 | ], 454 | "metadata": { 455 | "id": "WtG6oU3XBV2x" 456 | } 457 | }, 458 | { 459 | "cell_type": "code", 460 | "source": [ 461 | "# Load IMDb dataset\n", 462 | "dataset = load_dataset(\"imdb\")" 463 | ], 464 | "metadata": { 465 | "id": "-S3P8SUF9F0R" 466 | }, 467 | "execution_count": null, 468 | "outputs": [] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "source": [ 473 | "# Take a small fraction of the dataset (e.g., 10%)\n", 474 | "fraction = 0.1\n", 475 | "small_train_dataset = dataset[\"train\"].shuffle(seed=42).select(range(int(len(dataset[\"train\"]) * fraction)))\n", 476 | "small_test_dataset = dataset[\"test\"].shuffle(seed=42).select(range(int(len(dataset[\"test\"]) * fraction)))\n", 477 | "\n", 478 | "# Verify the size\n", 479 | "print(f\"Train size: {len(small_train_dataset)}, Test size: {len(small_test_dataset)}\")\n", 480 | "\n", 481 | "# Tokenize the smaller datasets\n", 482 | "def preprocess_data(example):\n", 483 | " return tokenizer(example[\"text\"], padding=\"max_length\", truncation=True, max_length=128)\n", 484 | "\n", 485 | "small_train_dataset = small_train_dataset.map(preprocess_data, batched=True)\n", 486 | "small_test_dataset = small_test_dataset.map(preprocess_data, batched=True)\n", 487 | "\n", 488 | "# Convert to PyTorch format\n", 489 | "small_train_dataset = small_train_dataset.rename_column(\"label\", \"labels\")\n", 490 | "small_test_dataset = small_test_dataset.rename_column(\"label\", \"labels\")\n", 491 | "\n", 492 | "small_train_dataset.set_format(type=\"torch\", columns=[\"input_ids\", \"attention_mask\", \"labels\"])\n", 493 | "small_test_dataset.set_format(type=\"torch\", columns=[\"input_ids\", \"attention_mask\", \"labels\"])" 494 | ], 495 | "metadata": { 496 | "id": "uIRHqsw1_s37" 497 | }, 498 | "execution_count": null, 499 | "outputs": [] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "source": [ 504 | "import torch\n", 505 | "\n", 506 | "# Function to move tensors to the correct device (GPU/CPU)\n", 507 | "def move_to_device(batch):\n", 508 | " device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", 509 | " # Move tensor columns to the correct device\n", 510 | " batch = {key: value.to(device) if torch.is_tensor(value) else value for key, value in batch.items()}\n", 511 | " return batch\n", 512 | "\n", 513 | "# Apply this function to your dataset using `map`\n", 514 | "small_train_dataset = small_train_dataset.map(move_to_device, batched=True)\n", 515 | "small_test_dataset = small_test_dataset.map(move_to_device, batched=True)" 516 | ], 517 | "metadata": { 518 | "id": "jP-HyFoHIi7e" 519 | }, 520 | "execution_count": null, 521 | "outputs": [] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "source": [ 526 | "model = model.to(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", 527 | "print(\"Model device:\", next(model.parameters()).device) # This should print \"cuda\" if using GPU" 528 | ], 529 | "metadata": { 530 | "id": "BTg7qVj4IvCy" 531 | }, 532 | "execution_count": null, 533 | "outputs": [] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "source": [ 538 | "from transformers import TrainingArguments\n", 539 | "\n", 540 | "accuracy = evaluate.load(\"accuracy\")\n", 541 | "\n", 542 | "from sklearn.metrics import accuracy_score\n", 543 | "\n", 544 | "def compute_metrics(p):\n", 545 | " predictions, labels = p\n", 546 | " preds = predictions.argmax(axis=-1) # Get the class with the highest probability\n", 547 | " return accuracy.compute(predictions=preds, references=labels)\n", 548 | "\n", 549 | "training_args = TrainingArguments(\n", 550 | " output_dir=\"./results\",\n", 551 | " evaluation_strategy=\"epoch\",\n", 552 | " save_strategy=\"epoch\",\n", 553 | " per_device_train_batch_size=16,\n", 554 | " per_device_eval_batch_size=32,\n", 555 | " num_train_epochs=10,\n", 556 | " logging_steps=10,\n", 557 | " fp16=torch.cuda.is_available(), # Enable mixed precision if on GPU\n", 558 | ")\n", 559 | "\n", 560 | "trainer = Trainer(\n", 561 | " model=model,\n", 562 | " args=training_args,\n", 563 | " train_dataset=small_train_dataset,\n", 564 | " eval_dataset=small_test_dataset,\n", 565 | " tokenizer=tokenizer,\n", 566 | " compute_metrics=compute_metrics,\n", 567 | ")\n", 568 | "\n", 569 | "# Train the model\n", 570 | "trainer.train()" 571 | ], 572 | "metadata": { 573 | "id": "ycHfWO3L9GIL" 574 | }, 575 | "execution_count": null, 576 | "outputs": [] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "source": [ 581 | "results = trainer.evaluate()\n", 582 | "print(results)" 583 | ], 584 | "metadata": { 585 | "id": "TyH_4UjF9O4n" 586 | }, 587 | "execution_count": null, 588 | "outputs": [] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "source": [], 593 | "metadata": { 594 | "id": "9mmHNaZOKD5O" 595 | }, 596 | "execution_count": null, 597 | "outputs": [] 598 | } 599 | ], 600 | "metadata": { 601 | "kernelspec": { 602 | "display_name": "Python 3", 603 | "name": "python3" 604 | }, 605 | "language_info": { 606 | "codemirror_mode": { 607 | "name": "ipython", 608 | "version": 3 609 | }, 610 | "file_extension": ".py", 611 | "mimetype": "text/x-python", 612 | "name": "python", 613 | "nbconvert_exporter": "python", 614 | "pygments_lexer": "ipython3", 615 | "version": "3.6.8" 616 | }, 617 | "colab": { 618 | "provenance": [], 619 | "gpuType": "T4" 620 | }, 621 | "accelerator": "GPU" 622 | }, 623 | "nbformat": 4, 624 | "nbformat_minor": 0 625 | } 626 | -------------------------------------------------------------------------------- /past_hackathons/computer_vision_hackathon/intro_to_ML_day5_computer_vision_2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | }, 15 | "accelerator": "GPU", 16 | "gpuClass": "standard" 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "source": [ 22 | "#Introduction to Machine Learning \n", 23 | "Computer Vision Hackathon 2 \\\n", 24 | "Gage DeZoort and Jon Halverson\\\n", 25 | "Princeton University Wintersession \\\n", 26 | "January 22, 2024" 27 | ], 28 | "metadata": { 29 | "id": "aga1pGnHqFDc" 30 | } 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "source": [ 35 | "In this notebook you will create a CNN from scratch to distinguish cats from dogs." 36 | ], 37 | "metadata": { 38 | "id": "UNycfWo7M0TV" 39 | } 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "source": [ 44 | "# About Your Colab Session" 45 | ], 46 | "metadata": { 47 | "id": "2zeUMxrssifc" 48 | } 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "source": [ 53 | "Learn about the CPU-cores for your session:" 54 | ], 55 | "metadata": { 56 | "id": "DuQEJ5K4T6mr" 57 | } 58 | }, 59 | { 60 | "cell_type": "code", 61 | "source": [ 62 | "cat /proc/cpuinfo" 63 | ], 64 | "metadata": { 65 | "id": "kmhl7u9GTJdM" 66 | }, 67 | "execution_count": null, 68 | "outputs": [] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "source": [ 73 | "import os\n", 74 | "num_cores = min(os.cpu_count(), 2)\n", 75 | "print(num_cores)" 76 | ], 77 | "metadata": { 78 | "id": "pvmd8gdqqadV" 79 | }, 80 | "execution_count": null, 81 | "outputs": [] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "source": [ 86 | "Let's see which GPU we are using (probably a Tesla T4):" 87 | ], 88 | "metadata": { 89 | "id": "boWe_CxtT_NO" 90 | } 91 | }, 92 | { 93 | "cell_type": "code", 94 | "source": [ 95 | "!nvidia-smi" 96 | ], 97 | "metadata": { 98 | "id": "8yR2en5xCqsO" 99 | }, 100 | "execution_count": null, 101 | "outputs": [] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "source": [ 106 | "# Data Preparation" 107 | ], 108 | "metadata": { 109 | "id": "wGufTwtPso3h" 110 | } 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 4, 115 | "metadata": { 116 | "id": "bH3SrMfHBejx" 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "import torch\n", 121 | "import torch.nn as nn\n", 122 | "import torch.nn.functional as F\n", 123 | "import torch.optim as optim\n", 124 | "from torchvision import datasets, transforms\n", 125 | "from torch.optim.lr_scheduler import StepLR\n", 126 | "from PIL import Image" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "source": [ 132 | "We want to use a GPU when one is available:" 133 | ], 134 | "metadata": { 135 | "id": "OOKVq098sZne" 136 | } 137 | }, 138 | { 139 | "cell_type": "code", 140 | "source": [ 141 | "use_cuda = torch.cuda.is_available()\n", 142 | "print(use_cuda)" 143 | ], 144 | "metadata": { 145 | "id": "Nk3pkXNxCF5F", 146 | "colab": { 147 | "base_uri": "https://localhost:8080/" 148 | }, 149 | "outputId": "fb0e931d-34f0-4a37-f6df-48e2e839977f" 150 | }, 151 | "execution_count": 5, 152 | "outputs": [ 153 | { 154 | "output_type": "stream", 155 | "name": "stdout", 156 | "text": [ 157 | "True\n" 158 | ] 159 | } 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "source": [ 165 | "torch.manual_seed(42)\n", 166 | "device = torch.device(\"cuda\") if use_cuda else torch.device(\"cpu\")\n", 167 | "\n", 168 | "train_kwargs = {'batch_size': 64}\n", 169 | "test_kwargs = {'batch_size': 1000}\n", 170 | "if use_cuda:\n", 171 | " cuda_kwargs = {'num_workers': num_cores, 'pin_memory': True}\n", 172 | " train_kwargs.update(cuda_kwargs)\n", 173 | " test_kwargs.update(cuda_kwargs)" 174 | ], 175 | "metadata": { 176 | "id": "TVoXo2d6CVwC" 177 | }, 178 | "execution_count": 6, 179 | "outputs": [] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "source": [ 184 | "Download and unpack the data:" 185 | ], 186 | "metadata": { 187 | "id": "oPIynRI6zWut" 188 | } 189 | }, 190 | { 191 | "cell_type": "code", 192 | "source": [ 193 | "!wget https://tigress-web.princeton.edu/~jdh4/cats_vs_dogs.tar\n", 194 | "!tar xf cats_vs_dogs.tar" 195 | ], 196 | "metadata": { 197 | "id": "UbRdgWThzTcG" 198 | }, 199 | "execution_count": null, 200 | "outputs": [] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "source": [ 205 | "transform=transforms.Compose([\n", 206 | " transforms.ToTensor(),\n", 207 | " #transforms.Normalize((0.1307,), (0.3081,))\n", 208 | "])\n", 209 | "train_set = datasets.ImageFolder(root=\"./training_set/\", transform=transform)\n", 210 | "test_set = datasets.ImageFolder(root=\"./test_set/\", transform=transform)\n", 211 | "\n", 212 | "train_loader = torch.utils.data.DataLoader(train_set, shuffle=True, **train_kwargs)\n", 213 | "test_loader = torch.utils.data.DataLoader(test_set, shuffle=True, **test_kwargs)\n", 214 | "\n", 215 | "image_0, label_0 = train_set[0]\n", 216 | "print(image_0.shape, label_0)" 217 | ], 218 | "metadata": { 219 | "id": "NgIFFS1TJp-O" 220 | }, 221 | "execution_count": null, 222 | "outputs": [] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "source": [ 227 | "There are roughly 4000 cat images and 4000 dog images in the training set. The test set is roughly 1000 images of each. All images have dimensions 128x128. The cat and dogs images are in color so they are composed of three layers (red, green, blue). The MNIST data set was grayscale so only a single layer was needed per image." 228 | ], 229 | "metadata": { 230 | "id": "NO9cdWhN4ouV" 231 | } 232 | }, 233 | { 234 | "cell_type": "code", 235 | "source": [ 236 | "img = Image.open(\"./training_set/dogs/resized-dog.1001.jpg\")\n", 237 | "print(f\"Image height: {img.height}\")\n", 238 | "print(f\"Image width: {img.width}\")\n", 239 | "img" 240 | ], 241 | "metadata": { 242 | "id": "KkXNTCGY28xc" 243 | }, 244 | "execution_count": null, 245 | "outputs": [] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "source": [ 250 | "img = Image.open(\"./training_set/cats/resized-cat.1001.jpg\")\n", 251 | "print(f\"Image height: {img.height}\")\n", 252 | "print(f\"Image width: {img.width}\")\n", 253 | "img" 254 | ], 255 | "metadata": { 256 | "id": "vPX6m50p3laY" 257 | }, 258 | "execution_count": null, 259 | "outputs": [] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "source": [ 264 | "# Model Definition and Hackathon Project" 265 | ], 266 | "metadata": { 267 | "id": "NxvDfF0Ps9uZ" 268 | } 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "source": [ 273 | "The hackathon project is to create a convolutional neural network from scratch and train in such that it gives a sustained accuracy of 75% or higher on the test set. Your network should use at least 3 convolutional layers.\n", 274 | "\n", 275 | "You only need to write the Net class. The rest of the notebook does not need to be changed. After writing the Net class, try running notebook. Raise your hand if you have any questions for the instructor. We're happy to give hints as you work through the exercise." 276 | ], 277 | "metadata": { 278 | "id": "SzBNqmv3M7o3" 279 | } 280 | }, 281 | { 282 | "cell_type": "code", 283 | "source": [ 284 | "class Net(nn.Module):\n", 285 | " def __init__(self):\n", 286 | " super(Net, self).__init__()\n", 287 | " # CREATE THE LAYERS HERE\n", 288 | "\n", 289 | " def forward(self, x):\n", 290 | " # DEFINE THE FORWARD PASS HERE\n", 291 | " return output" 292 | ], 293 | "metadata": { 294 | "id": "HYNJjPkeB4Rj" 295 | }, 296 | "execution_count": null, 297 | "outputs": [] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "source": [ 302 | "Instantiate the network and move it to the device (which is a GPU when available). Create the optimizer." 303 | ], 304 | "metadata": { 305 | "id": "QVGbz36OuS8O" 306 | } 307 | }, 308 | { 309 | "cell_type": "code", 310 | "source": [ 311 | "model = Net().to(device)\n", 312 | "optimizer = optim.Adadelta(model.parameters(), lr=1.0)" 313 | ], 314 | "metadata": { 315 | "id": "tvkGwJD_JGEY" 316 | }, 317 | "execution_count": null, 318 | "outputs": [] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "source": [ 323 | "from torchsummary import summary\n", 324 | "summary(model, input_size=(3, 128, 128))" 325 | ], 326 | "metadata": { 327 | "id": "kTfbe4QKRYLu" 328 | }, 329 | "execution_count": null, 330 | "outputs": [] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "source": [ 335 | "# Train and Test Methods" 336 | ], 337 | "metadata": { 338 | "id": "KYR03y9dvDEO" 339 | } 340 | }, 341 | { 342 | "cell_type": "code", 343 | "source": [ 344 | "def train(model, device, train_loader, optimizer, epoch):\n", 345 | " model.train() # sets the model in training mode (i.e., dropout enabled)\n", 346 | " for batch_idx, (data, target) in enumerate(train_loader):\n", 347 | " data, target = data.to(device), target.to(device)\n", 348 | " optimizer.zero_grad()\n", 349 | " output = model(data)\n", 350 | " loss = F.nll_loss(output, target)\n", 351 | " loss.backward()\n", 352 | " optimizer.step()\n", 353 | " if batch_idx % 100 == 0:\n", 354 | " print('Train Epoch: {} [{}/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n", 355 | " epoch, batch_idx * len(data), len(train_loader.dataset),\n", 356 | " 100. * batch_idx / len(train_loader), loss.item()))" 357 | ], 358 | "metadata": { 359 | "id": "_PrPJRlsCCO5" 360 | }, 361 | "execution_count": null, 362 | "outputs": [] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "source": [ 367 | "def test(model, device, test_loader):\n", 368 | " model.eval() # sets the model in evaluation mode (i.e., dropout disabled)\n", 369 | " test_loss = 0\n", 370 | " correct = 0\n", 371 | " with torch.no_grad():\n", 372 | " for data, target in test_loader:\n", 373 | " data, target = data.to(device), target.to(device)\n", 374 | " output = model(data)\n", 375 | " test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss\n", 376 | " pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability\n", 377 | " correct += pred.eq(target.view_as(pred)).sum().item()\n", 378 | "\n", 379 | " test_loss /= len(test_loader.dataset)\n", 380 | "\n", 381 | " print('\\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\\n'.format(\n", 382 | " test_loss, correct, len(test_loader.dataset),\n", 383 | " 100. * correct / len(test_loader.dataset)))" 384 | ], 385 | "metadata": { 386 | "id": "bns0Q8O-CFVM" 387 | }, 388 | "execution_count": null, 389 | "outputs": [] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "source": [ 394 | "Train for some number of epochs while reporting the accuracy on the test set periodically:" 395 | ], 396 | "metadata": { 397 | "id": "2Zxo7USTvKMv" 398 | } 399 | }, 400 | { 401 | "cell_type": "code", 402 | "source": [ 403 | "epochs = 12\n", 404 | "scheduler = StepLR(optimizer, step_size=1, gamma=0.7)\n", 405 | "for epoch in range(1, epochs + 1):\n", 406 | " train(model, device, train_loader, optimizer, epoch)\n", 407 | " test(model, device, test_loader)\n", 408 | " scheduler.step()" 409 | ], 410 | "metadata": { 411 | "id": "Yl8Lcz1RJCk9" 412 | }, 413 | "execution_count": null, 414 | "outputs": [] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "source": [], 419 | "metadata": { 420 | "id": "XGZrm2RazBTa" 421 | }, 422 | "execution_count": null, 423 | "outputs": [] 424 | } 425 | ] 426 | } -------------------------------------------------------------------------------- /past_hackathons/computer_vision_hackathon/intro_to_ML_day5_computer_vision_3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | }, 15 | "accelerator": "GPU", 16 | "gpuClass": "standard" 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "source": [ 22 | "#Introduction to Machine Learning \n", 23 | "Computer Vision Hackathon 3\\\n", 24 | "Jon Halverson and Gage DeZoort\\\n", 25 | "Princeton University Wintersession\\\n", 26 | "January 23, 2024" 27 | ], 28 | "metadata": { 29 | "id": "aga1pGnHqFDc" 30 | } 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "source": [ 35 | "In this notebook you will use transfer learning to get a very high accuracy on the cats versus dog problem. The idea is to take a large CNN model trained on vast amounts of data and retrain only the top layers while freezing the lower layers. We are transferring the learning done previously to our problem. We will use the ResNet-50 model." 36 | ], 37 | "metadata": { 38 | "id": "UNycfWo7M0TV" 39 | } 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "source": [ 44 | "# About Your Colab Session" 45 | ], 46 | "metadata": { 47 | "id": "2zeUMxrssifc" 48 | } 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "source": [ 53 | "Learn about the CPU-cores for your session:" 54 | ], 55 | "metadata": { 56 | "id": "DuQEJ5K4T6mr" 57 | } 58 | }, 59 | { 60 | "cell_type": "code", 61 | "source": [ 62 | "cat /proc/cpuinfo" 63 | ], 64 | "metadata": { 65 | "id": "kmhl7u9GTJdM" 66 | }, 67 | "execution_count": null, 68 | "outputs": [] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "source": [ 73 | "import os\n", 74 | "num_cores = min(os.cpu_count(), 2)\n", 75 | "print(num_cores)" 76 | ], 77 | "metadata": { 78 | "id": "pvmd8gdqqadV" 79 | }, 80 | "execution_count": null, 81 | "outputs": [] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "source": [ 86 | "Let's see which GPU we are using (probably a Tesla T4):" 87 | ], 88 | "metadata": { 89 | "id": "boWe_CxtT_NO" 90 | } 91 | }, 92 | { 93 | "cell_type": "code", 94 | "source": [ 95 | "!nvidia-smi" 96 | ], 97 | "metadata": { 98 | "id": "8yR2en5xCqsO" 99 | }, 100 | "execution_count": null, 101 | "outputs": [] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "source": [ 106 | "# Data Preparation" 107 | ], 108 | "metadata": { 109 | "id": "wGufTwtPso3h" 110 | } 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": { 116 | "id": "bH3SrMfHBejx" 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "import torch\n", 121 | "import torch.nn as nn\n", 122 | "import torch.nn.functional as F\n", 123 | "import torch.optim as optim\n", 124 | "from torchvision import datasets, transforms, models\n", 125 | "from torch.optim.lr_scheduler import StepLR\n", 126 | "from PIL import Image" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "source": [ 132 | "We want to use a GPU when one is available:" 133 | ], 134 | "metadata": { 135 | "id": "OOKVq098sZne" 136 | } 137 | }, 138 | { 139 | "cell_type": "code", 140 | "source": [ 141 | "use_cuda = torch.cuda.is_available()\n", 142 | "print(use_cuda)" 143 | ], 144 | "metadata": { 145 | "id": "Nk3pkXNxCF5F", 146 | "outputId": "60c7b911-ecbc-4b8e-f2ea-e36f0b3c4cf3", 147 | "colab": { 148 | "base_uri": "https://localhost:8080/" 149 | } 150 | }, 151 | "execution_count": null, 152 | "outputs": [ 153 | { 154 | "output_type": "stream", 155 | "name": "stdout", 156 | "text": [ 157 | "True\n" 158 | ] 159 | } 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "source": [ 165 | "torch.manual_seed(42)\n", 166 | "device = torch.device(\"cuda\") if use_cuda else torch.device(\"cpu\")\n", 167 | "\n", 168 | "train_kwargs = {'batch_size': 64}\n", 169 | "test_kwargs = {'batch_size': 128}\n", 170 | "if use_cuda:\n", 171 | " cuda_kwargs = {'num_workers': num_cores, 'pin_memory': True}\n", 172 | " train_kwargs.update(cuda_kwargs)\n", 173 | " test_kwargs.update(cuda_kwargs)" 174 | ], 175 | "metadata": { 176 | "id": "TVoXo2d6CVwC" 177 | }, 178 | "execution_count": null, 179 | "outputs": [] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "source": [ 184 | "Download and unpack the data:" 185 | ], 186 | "metadata": { 187 | "id": "oPIynRI6zWut" 188 | } 189 | }, 190 | { 191 | "cell_type": "code", 192 | "source": [ 193 | "!wget https://tigress-web.princeton.edu/~jdh4/cats_vs_dogs.tar\n", 194 | "!tar xf cats_vs_dogs.tar" 195 | ], 196 | "metadata": { 197 | "id": "UbRdgWThzTcG" 198 | }, 199 | "execution_count": null, 200 | "outputs": [] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "source": [ 205 | "transform=transforms.Compose([\n", 206 | " transforms.ToTensor(),\n", 207 | " transforms.Normalize((0.1307,), (0.3081,))])\n", 208 | "dataset1 = datasets.ImageFolder(root=\"./training_set/\", transform=transform)\n", 209 | "dataset2 = datasets.ImageFolder(root=\"./test_set/\", transform=transform)\n", 210 | "\n", 211 | "train_loader = torch.utils.data.DataLoader(dataset1, shuffle=True, **train_kwargs)\n", 212 | "test_loader = torch.utils.data.DataLoader(dataset2, shuffle=True, **test_kwargs)" 213 | ], 214 | "metadata": { 215 | "id": "NgIFFS1TJp-O" 216 | }, 217 | "execution_count": null, 218 | "outputs": [] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "source": [ 223 | "There are roughly 4000 cat images and 4000 dog images in the training set. The test set is roughly 1000 images of each. All images have dimensions 128x128. The cat and dogs images are in color so they are composed of three layers (red, green, blue). The MNIST data set was grayscale so only a single layer was needed per image." 224 | ], 225 | "metadata": { 226 | "id": "NO9cdWhN4ouV" 227 | } 228 | }, 229 | { 230 | "cell_type": "code", 231 | "source": [ 232 | "img = Image.open(\"./training_set/dogs/resized-dog.1001.jpg\")\n", 233 | "print(f\"Image height: {img.height}\")\n", 234 | "print(f\"Image width: {img.width}\")\n", 235 | "img" 236 | ], 237 | "metadata": { 238 | "id": "KkXNTCGY28xc" 239 | }, 240 | "execution_count": null, 241 | "outputs": [] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "source": [ 246 | "img = Image.open(\"./training_set/cats/resized-cat.1001.jpg\")\n", 247 | "print(f\"Image height: {img.height}\")\n", 248 | "print(f\"Image width: {img.width}\")\n", 249 | "img" 250 | ], 251 | "metadata": { 252 | "id": "vPX6m50p3laY" 253 | }, 254 | "execution_count": null, 255 | "outputs": [] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "source": [ 260 | "# Model Definition" 261 | ], 262 | "metadata": { 263 | "id": "NxvDfF0Ps9uZ" 264 | } 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "source": [ 269 | "Below the model is downloaded. We turn-off gradient tracking for all model parameters except the last two linear layers. The model is moved to the device (which is a GPU is available) and the optimizer is created." 270 | ], 271 | "metadata": { 272 | "id": "QVGbz36OuS8O" 273 | } 274 | }, 275 | { 276 | "cell_type": "code", 277 | "source": [ 278 | "model = models.resnet50(weights='DEFAULT')\n", 279 | "for param in model.parameters():\n", 280 | " param.requires_grad = False\n", 281 | "# use print(model) to see that the name of the last layer is fc\n", 282 | "# we redefine fc in the next line\n", 283 | "model.fc = nn.Sequential(nn.Linear(2048, 128), nn.ReLU(inplace=True), nn.Linear(128, 2))\n", 284 | "model = model.to(device)\n", 285 | "optimizer = optim.Adadelta(model.fc.parameters(), lr=1.0)" 286 | ], 287 | "metadata": { 288 | "id": "tvkGwJD_JGEY" 289 | }, 290 | "execution_count": null, 291 | "outputs": [] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "source": [ 296 | "from torchsummary import summary\n", 297 | "summary(model, input_size=(3, 128, 128))" 298 | ], 299 | "metadata": { 300 | "id": "kTfbe4QKRYLu" 301 | }, 302 | "execution_count": null, 303 | "outputs": [] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "source": [ 308 | "# Train and Test Methods" 309 | ], 310 | "metadata": { 311 | "id": "KYR03y9dvDEO" 312 | } 313 | }, 314 | { 315 | "cell_type": "code", 316 | "source": [ 317 | "def train(model, device, train_loader, optimizer, epoch):\n", 318 | " model.train() # sets the model in training mode (i.e., dropout enabled)\n", 319 | " for batch_idx, (data, target) in enumerate(train_loader):\n", 320 | " data, target = data.to(device), target.to(device)\n", 321 | " optimizer.zero_grad()\n", 322 | " output = model(data)\n", 323 | " loss = F.nll_loss(F.log_softmax(output, dim=1), target)\n", 324 | " loss.backward()\n", 325 | " optimizer.step()\n", 326 | " if batch_idx % 100 == 0:\n", 327 | " print('Train Epoch: {} [{}/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n", 328 | " epoch, batch_idx * len(data), len(train_loader.dataset),\n", 329 | " 100. * batch_idx / len(train_loader), loss.item()))" 330 | ], 331 | "metadata": { 332 | "id": "_PrPJRlsCCO5" 333 | }, 334 | "execution_count": null, 335 | "outputs": [] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "source": [ 340 | "def test(model, device, test_loader):\n", 341 | " model.eval() # sets the model in evaluation mode (i.e., dropout disabled)\n", 342 | " test_loss = 0\n", 343 | " correct = 0\n", 344 | " with torch.no_grad():\n", 345 | " for data, target in test_loader:\n", 346 | " data, target = data.to(device), target.to(device)\n", 347 | " output = model(data)\n", 348 | " test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss\n", 349 | " pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability\n", 350 | " correct += pred.eq(target.view_as(pred)).sum().item()\n", 351 | "\n", 352 | " test_loss /= len(test_loader.dataset)\n", 353 | "\n", 354 | " print('\\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\\n'.format(\n", 355 | " test_loss, correct, len(test_loader.dataset),\n", 356 | " 100. * correct / len(test_loader.dataset)))" 357 | ], 358 | "metadata": { 359 | "id": "bns0Q8O-CFVM" 360 | }, 361 | "execution_count": null, 362 | "outputs": [] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "source": [ 367 | "Train for some number of epochs while reporting the accuracy on the test set periodically:" 368 | ], 369 | "metadata": { 370 | "id": "2Zxo7USTvKMv" 371 | } 372 | }, 373 | { 374 | "cell_type": "code", 375 | "source": [ 376 | "epochs = 12\n", 377 | "scheduler = StepLR(optimizer, step_size=1, gamma=0.7)\n", 378 | "for epoch in range(1, epochs + 1):\n", 379 | " train(model, device, train_loader, optimizer, epoch)\n", 380 | " test(model, device, test_loader)\n", 381 | " scheduler.step()" 382 | ], 383 | "metadata": { 384 | "id": "Yl8Lcz1RJCk9" 385 | }, 386 | "execution_count": null, 387 | "outputs": [] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "source": [], 392 | "metadata": { 393 | "id": "mvYEJybhHNz2" 394 | }, 395 | "execution_count": null, 396 | "outputs": [] 397 | } 398 | ] 399 | } -------------------------------------------------------------------------------- /past_hackathons/computer_vision_hackathon/intro_to_ml_day5_CNNs.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PrincetonUniversity/intro_machine_learning/60dfeef803a04d9073e94584d1572a50d1b64f0f/past_hackathons/computer_vision_hackathon/intro_to_ml_day5_CNNs.pdf -------------------------------------------------------------------------------- /past_hackathons/computer_vision_hackathon/material_from_2023/day5_computer_vision_hackathon_notebook1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | }, 15 | "accelerator": "GPU", 16 | "gpuClass": "standard" 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "source": [ 22 | "#Introduction to Machine Learning \n", 23 | "**Computer Vision Hackathon \n", 24 | "Wintersession \n", 25 | "Tuesday, January 24, 2023**" 26 | ], 27 | "metadata": { 28 | "id": "aga1pGnHqFDc" 29 | } 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "source": [ 34 | "This notebook trains a simple CNN on the MNIST dataset. The code comes from a [PyTorch example on GitHub](https://github.com/pytorch/examples/blob/master/mnist/main.py)." 35 | ], 36 | "metadata": { 37 | "id": "UNycfWo7M0TV" 38 | } 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "source": [ 43 | "# About Your Colab Session" 44 | ], 45 | "metadata": { 46 | "id": "2zeUMxrssifc" 47 | } 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "source": [ 52 | "Learn about the CPU-cores for your session:" 53 | ], 54 | "metadata": { 55 | "id": "DuQEJ5K4T6mr" 56 | } 57 | }, 58 | { 59 | "cell_type": "code", 60 | "source": [ 61 | "cat /proc/cpuinfo" 62 | ], 63 | "metadata": { 64 | "id": "kmhl7u9GTJdM" 65 | }, 66 | "execution_count": null, 67 | "outputs": [] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "source": [ 72 | "import os\n", 73 | "num_cores = min(os.cpu_count(), 2)\n", 74 | "print(num_cores)" 75 | ], 76 | "metadata": { 77 | "id": "pvmd8gdqqadV" 78 | }, 79 | "execution_count": null, 80 | "outputs": [] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "source": [ 85 | "Let's see which GPU we are using (probably a Tesla T4):" 86 | ], 87 | "metadata": { 88 | "id": "boWe_CxtT_NO" 89 | } 90 | }, 91 | { 92 | "cell_type": "code", 93 | "source": [ 94 | "!nvidia-smi" 95 | ], 96 | "metadata": { 97 | "id": "8yR2en5xCqsO" 98 | }, 99 | "execution_count": null, 100 | "outputs": [] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "source": [ 105 | "# Data Preparation" 106 | ], 107 | "metadata": { 108 | "id": "wGufTwtPso3h" 109 | } 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": { 115 | "id": "bH3SrMfHBejx" 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "import torch\n", 120 | "import torch.nn as nn\n", 121 | "import torch.nn.functional as F\n", 122 | "import torch.optim as optim\n", 123 | "from torchvision import datasets, transforms\n", 124 | "from torch.optim.lr_scheduler import StepLR\n", 125 | "from matplotlib import pyplot as plt" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "source": [ 131 | "We want to use a GPU when one is available:" 132 | ], 133 | "metadata": { 134 | "id": "OOKVq098sZne" 135 | } 136 | }, 137 | { 138 | "cell_type": "code", 139 | "source": [ 140 | "use_cuda = torch.cuda.is_available()\n", 141 | "print(use_cuda)" 142 | ], 143 | "metadata": { 144 | "id": "Nk3pkXNxCF5F" 145 | }, 146 | "execution_count": null, 147 | "outputs": [] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "source": [ 152 | "torch.manual_seed(42)\n", 153 | "device = torch.device(\"cuda\") if use_cuda else torch.device(\"cpu\")\n", 154 | "\n", 155 | "train_kwargs = {'batch_size': 64}\n", 156 | "test_kwargs = {'batch_size': 1000}\n", 157 | "if use_cuda:\n", 158 | " cuda_kwargs = {'num_workers': num_cores, 'pin_memory': True, 'shuffle': True}\n", 159 | " train_kwargs.update(cuda_kwargs)\n", 160 | " test_kwargs.update(cuda_kwargs)" 161 | ], 162 | "metadata": { 163 | "id": "TVoXo2d6CVwC" 164 | }, 165 | "execution_count": null, 166 | "outputs": [] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "source": [ 171 | "transform=transforms.Compose([\n", 172 | " transforms.ToTensor(),\n", 173 | " transforms.Normalize((0.1307,), (0.3081,))\n", 174 | "])\n", 175 | "train_data = datasets.MNIST('/tmp', train=True, download=True,\n", 176 | " transform=transform)\n", 177 | "test_data = datasets.MNIST('/tmp', train=False,\n", 178 | " transform=transform)\n", 179 | "train_loader = torch.utils.data.DataLoader(train_data, **train_kwargs)\n", 180 | "test_loader = torch.utils.data.DataLoader(test_data, **test_kwargs)" 181 | ], 182 | "metadata": { 183 | "id": "NgIFFS1TJp-O" 184 | }, 185 | "execution_count": null, 186 | "outputs": [] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "source": [ 191 | "# plot several random examples\n", 192 | "figure = plt.figure(figsize=(8, 8))\n", 193 | "cols, rows = 3, 3\n", 194 | "for i in range(1, cols * rows + 1):\n", 195 | " sample_idx = torch.randint(len(train_data), size=(1,)).item()\n", 196 | " img, label = train_data[sample_idx]\n", 197 | " figure.add_subplot(rows, cols, i)\n", 198 | " plt.axis(\"off\")\n", 199 | " plt.imshow(img.squeeze(), cmap=\"gray\")\n", 200 | "plt.show()" 201 | ], 202 | "metadata": { 203 | "id": "y-46k5yEHtyT" 204 | }, 205 | "execution_count": null, 206 | "outputs": [] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "source": [ 211 | "# Model Definition" 212 | ], 213 | "metadata": { 214 | "id": "NxvDfF0Ps9uZ" 215 | } 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "source": [ 220 | "Create a Python class called Net that derives from the nn.Module of PyTorch. The \\_\\_init__() method defines the network layers and regularization method while the forward method describes the forward pass." 221 | ], 222 | "metadata": { 223 | "id": "SzBNqmv3M7o3" 224 | } 225 | }, 226 | { 227 | "cell_type": "code", 228 | "source": [ 229 | "class Net(nn.Module):\n", 230 | " def __init__(self):\n", 231 | " super(Net, self).__init__()\n", 232 | "\n", 233 | " # first convolutional layer\n", 234 | " self.conv1 = nn.Conv2d(in_channels=1, # input image is greyscale, each pixel has 1 dimension\n", 235 | " out_channels=32, # create 32 filters\n", 236 | " kernel_size=3, # each filter is 3x3x1\n", 237 | " stride=1) # slide the filters without making jumps\n", 238 | " # when you stack the feature maps, this outputs a 26x26x32 \"image\"\n", 239 | "\n", 240 | " # second convolutional layer\n", 241 | " self.conv2 = nn.Conv2d(in_channels=32, # we have 32 feature maps (26x26x32) from the last Conv2d\n", 242 | " out_channels=64, # create 64 filters \n", 243 | " kernel_size=3, # each filter is 3x3x32\n", 244 | " stride=1) # slide the filters without making jumps\n", 245 | " # when you stack the feature maps, this outputs a 24x24x64 \"image\"\n", 246 | "\n", 247 | " # dropout randomly \"drops out\" a tensor so that the model doesn't overtrain\n", 248 | " self.dropout1 = nn.Dropout(0.25) \n", 249 | " self.dropout2 = nn.Dropout(0.5) \n", 250 | "\n", 251 | " # flattened images are passed to the NN (after pooling, 12x12x64=9216)\n", 252 | " self.fc1 = nn.Linear(in_features=9216, # weights and biases \n", 253 | " out_features=128)\n", 254 | " self.fc2 = nn.Linear(in_features=128,\n", 255 | " out_features=10)\n", 256 | "\n", 257 | " def forward(self, x):\n", 258 | " # apply convolutional layers\n", 259 | " x = self.conv1(x)\n", 260 | " x = F.relu(x)\n", 261 | " x = self.conv2(x)\n", 262 | " x = F.relu(x)\n", 263 | " x = F.max_pool2d(x, kernel_size=2)\n", 264 | " x = self.dropout1(x)\n", 265 | " # flatten and feed to a NN\n", 266 | " x = torch.flatten(x, 1)\n", 267 | " x = self.fc1(x)\n", 268 | " x = F.relu(x)\n", 269 | " x = self.dropout2(x)\n", 270 | " x = self.fc2(x)\n", 271 | " output = F.log_softmax(x, dim=1) # log_softmax + nll_loss = cross entropy loss\n", 272 | " return output" 273 | ], 274 | "metadata": { 275 | "id": "HYNJjPkeB4Rj" 276 | }, 277 | "execution_count": null, 278 | "outputs": [] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "source": [ 283 | "Instantiate the network and move it to the device (which is a GPU when available). Create the optimizer." 284 | ], 285 | "metadata": { 286 | "id": "QVGbz36OuS8O" 287 | } 288 | }, 289 | { 290 | "cell_type": "code", 291 | "source": [ 292 | "model = Net().to(device)\n", 293 | "optimizer = optim.Adadelta(model.parameters(), lr=1.0)" 294 | ], 295 | "metadata": { 296 | "id": "tvkGwJD_JGEY" 297 | }, 298 | "execution_count": null, 299 | "outputs": [] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "source": [ 304 | "from torchsummary import summary\n", 305 | "summary(model, input_size=(1, 28, 28))" 306 | ], 307 | "metadata": { 308 | "id": "kTfbe4QKRYLu" 309 | }, 310 | "execution_count": null, 311 | "outputs": [] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "source": [ 316 | "# Train and Test Methods" 317 | ], 318 | "metadata": { 319 | "id": "KYR03y9dvDEO" 320 | } 321 | }, 322 | { 323 | "cell_type": "code", 324 | "source": [ 325 | "def train(model, device, train_loader, optimizer, epoch):\n", 326 | " model.train() # sets the model in training mode (i.e., dropout enabled)\n", 327 | " for batch_idx, (data, target) in enumerate(train_loader):\n", 328 | " data, target = data.to(device), target.to(device)\n", 329 | " optimizer.zero_grad()\n", 330 | " output = model(data)\n", 331 | " loss = F.nll_loss(output, target)\n", 332 | " loss.backward()\n", 333 | " optimizer.step()\n", 334 | " if batch_idx % 100 == 0:\n", 335 | " print('Train Epoch: {} [{}/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n", 336 | " epoch, batch_idx * len(data), len(train_loader.dataset),\n", 337 | " 100. * batch_idx / len(train_loader), loss.item()))" 338 | ], 339 | "metadata": { 340 | "id": "_PrPJRlsCCO5" 341 | }, 342 | "execution_count": null, 343 | "outputs": [] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "source": [ 348 | "def test(model, device, test_loader):\n", 349 | " model.eval() # sets the model in evaluation mode (i.e., dropout disabled)\n", 350 | " test_loss = 0\n", 351 | " correct = 0\n", 352 | " with torch.no_grad():\n", 353 | " for data, target in test_loader:\n", 354 | " data, target = data.to(device), target.to(device)\n", 355 | " output = model(data)\n", 356 | " test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss\n", 357 | " pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability\n", 358 | " correct += pred.eq(target.view_as(pred)).sum().item()\n", 359 | "\n", 360 | " test_loss /= len(test_loader.dataset)\n", 361 | "\n", 362 | " print('\\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\\n'.format(\n", 363 | " test_loss, correct, len(test_loader.dataset),\n", 364 | " 100. * correct / len(test_loader.dataset)))" 365 | ], 366 | "metadata": { 367 | "id": "bns0Q8O-CFVM" 368 | }, 369 | "execution_count": null, 370 | "outputs": [] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "source": [ 375 | "Train for some number of epochs while reporting the accuracy on the test set periodically:" 376 | ], 377 | "metadata": { 378 | "id": "2Zxo7USTvKMv" 379 | } 380 | }, 381 | { 382 | "cell_type": "code", 383 | "source": [ 384 | "epochs = 5\n", 385 | "scheduler = StepLR(optimizer, step_size=1, gamma=0.7)\n", 386 | "for epoch in range(1, epochs + 1):\n", 387 | " train(model, device, train_loader, optimizer, epoch)\n", 388 | " test(model, device, test_loader)\n", 389 | " scheduler.step()" 390 | ], 391 | "metadata": { 392 | "id": "Yl8Lcz1RJCk9" 393 | }, 394 | "execution_count": null, 395 | "outputs": [] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "source": [], 400 | "metadata": { 401 | "id": "kQ1vBgwJR3VG" 402 | }, 403 | "execution_count": null, 404 | "outputs": [] 405 | } 406 | ] 407 | } -------------------------------------------------------------------------------- /past_hackathons/computer_vision_hackathon/material_from_2023/day5_computer_vision_hackathon_notebook2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | }, 15 | "accelerator": "GPU", 16 | "gpuClass": "standard" 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "source": [ 22 | "#Introduction to Machine Learning \n", 23 | "**Computer Vision Hackathon \n", 24 | "Wintersession \n", 25 | "Tuesday, January 24, 2023**" 26 | ], 27 | "metadata": { 28 | "id": "aga1pGnHqFDc" 29 | } 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "source": [ 34 | "In this notebook you will create a CNN from scratch to distinguish cats from dogs." 35 | ], 36 | "metadata": { 37 | "id": "UNycfWo7M0TV" 38 | } 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "source": [ 43 | "# About Your Colab Session" 44 | ], 45 | "metadata": { 46 | "id": "2zeUMxrssifc" 47 | } 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "source": [ 52 | "Learn about the CPU-cores for your session:" 53 | ], 54 | "metadata": { 55 | "id": "DuQEJ5K4T6mr" 56 | } 57 | }, 58 | { 59 | "cell_type": "code", 60 | "source": [ 61 | "cat /proc/cpuinfo" 62 | ], 63 | "metadata": { 64 | "id": "kmhl7u9GTJdM" 65 | }, 66 | "execution_count": null, 67 | "outputs": [] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "source": [ 72 | "import os\n", 73 | "num_cores = min(os.cpu_count(), 2)\n", 74 | "print(num_cores)" 75 | ], 76 | "metadata": { 77 | "id": "pvmd8gdqqadV" 78 | }, 79 | "execution_count": null, 80 | "outputs": [] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "source": [ 85 | "Let's see which GPU we are using (probably a Tesla T4):" 86 | ], 87 | "metadata": { 88 | "id": "boWe_CxtT_NO" 89 | } 90 | }, 91 | { 92 | "cell_type": "code", 93 | "source": [ 94 | "!nvidia-smi" 95 | ], 96 | "metadata": { 97 | "id": "8yR2en5xCqsO" 98 | }, 99 | "execution_count": null, 100 | "outputs": [] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "source": [ 105 | "# Data Preparation" 106 | ], 107 | "metadata": { 108 | "id": "wGufTwtPso3h" 109 | } 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": { 115 | "id": "bH3SrMfHBejx" 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "import torch\n", 120 | "import torch.nn as nn\n", 121 | "import torch.nn.functional as F\n", 122 | "import torch.optim as optim\n", 123 | "from torchvision import datasets, transforms\n", 124 | "from torch.optim.lr_scheduler import StepLR\n", 125 | "from PIL import Image" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "source": [ 131 | "We want to use a GPU when one is available:" 132 | ], 133 | "metadata": { 134 | "id": "OOKVq098sZne" 135 | } 136 | }, 137 | { 138 | "cell_type": "code", 139 | "source": [ 140 | "use_cuda = torch.cuda.is_available()\n", 141 | "print(use_cuda)" 142 | ], 143 | "metadata": { 144 | "id": "Nk3pkXNxCF5F" 145 | }, 146 | "execution_count": null, 147 | "outputs": [] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "source": [ 152 | "torch.manual_seed(42)\n", 153 | "device = torch.device(\"cuda\") if use_cuda else torch.device(\"cpu\")\n", 154 | "\n", 155 | "train_kwargs = {'batch_size': 64}\n", 156 | "test_kwargs = {'batch_size': 1000}\n", 157 | "if use_cuda:\n", 158 | " cuda_kwargs = {'num_workers': num_cores, 'pin_memory': True}\n", 159 | " train_kwargs.update(cuda_kwargs)\n", 160 | " test_kwargs.update(cuda_kwargs)" 161 | ], 162 | "metadata": { 163 | "id": "TVoXo2d6CVwC" 164 | }, 165 | "execution_count": null, 166 | "outputs": [] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "source": [ 171 | "Download and unpack the data:" 172 | ], 173 | "metadata": { 174 | "id": "oPIynRI6zWut" 175 | } 176 | }, 177 | { 178 | "cell_type": "code", 179 | "source": [ 180 | "!wget https://tigress-web.princeton.edu/~jdh4/cats_vs_dogs.tar\n", 181 | "!tar xf cats_vs_dogs.tar" 182 | ], 183 | "metadata": { 184 | "id": "UbRdgWThzTcG" 185 | }, 186 | "execution_count": null, 187 | "outputs": [] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "source": [ 192 | "transform=transforms.Compose([\n", 193 | " transforms.ToTensor(),\n", 194 | " #transforms.Normalize((0.1307,), (0.3081,))\n", 195 | "]) \n", 196 | "train_set = datasets.ImageFolder(root=\"./training_set/\", transform=transform)\n", 197 | "test_set = datasets.ImageFolder(root=\"./test_set/\", transform=transform)\n", 198 | "\n", 199 | "train_loader = torch.utils.data.DataLoader(train_set, shuffle=True, **train_kwargs)\n", 200 | "test_loader = torch.utils.data.DataLoader(test_set, shuffle=True, **test_kwargs)\n", 201 | "\n", 202 | "image_0, label_0 = train_set[0]\n", 203 | "print(image_0.shape, label_0)" 204 | ], 205 | "metadata": { 206 | "id": "NgIFFS1TJp-O" 207 | }, 208 | "execution_count": null, 209 | "outputs": [] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "source": [ 214 | "There are roughly 4000 cat images and 4000 dog images in the training set. The test set is roughly 1000 images of each. All images have dimensions 128x128. The cat and dogs images are in color so they are composed of three layers (red, green, blue). The MNIST data set was grayscale so only a single layer was needed per image." 215 | ], 216 | "metadata": { 217 | "id": "NO9cdWhN4ouV" 218 | } 219 | }, 220 | { 221 | "cell_type": "code", 222 | "source": [ 223 | "img = Image.open(\"./training_set/dogs/resized-dog.1001.jpg\")\n", 224 | "print(f\"Image height: {img.height}\") \n", 225 | "print(f\"Image width: {img.width}\")\n", 226 | "img" 227 | ], 228 | "metadata": { 229 | "id": "KkXNTCGY28xc" 230 | }, 231 | "execution_count": null, 232 | "outputs": [] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "source": [ 237 | "img = Image.open(\"./training_set/cats/resized-cat.1001.jpg\")\n", 238 | "print(f\"Image height: {img.height}\") \n", 239 | "print(f\"Image width: {img.width}\")\n", 240 | "img" 241 | ], 242 | "metadata": { 243 | "id": "vPX6m50p3laY" 244 | }, 245 | "execution_count": null, 246 | "outputs": [] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "source": [ 251 | "# Model Definition and Hackathon Project" 252 | ], 253 | "metadata": { 254 | "id": "NxvDfF0Ps9uZ" 255 | } 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "source": [ 260 | "The hackathon project is to create a convolutional neural network from scratch and train in such that it gives a sustained accuracy of 75% or higher on the test set. Your network should use at least 3 convolutional layers.\n", 261 | "\n", 262 | "You only need to write the Net class. The rest of the notebook does not need to be changed. After writing the Net class, try running notebook. Raise your hand if you have any questions for the instructor. We're happy to give hints as you work through the exercise." 263 | ], 264 | "metadata": { 265 | "id": "SzBNqmv3M7o3" 266 | } 267 | }, 268 | { 269 | "cell_type": "code", 270 | "source": [ 271 | "class Net(nn.Module):\n", 272 | " def __init__(self):\n", 273 | " super(Net, self).__init__()\n", 274 | " # CREATE THE LAYERS HERE\n", 275 | "\n", 276 | " def forward(self, x):\n", 277 | " # DEFINE THE FORWARD PASS HERE\n", 278 | " return output" 279 | ], 280 | "metadata": { 281 | "id": "HYNJjPkeB4Rj" 282 | }, 283 | "execution_count": null, 284 | "outputs": [] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "source": [ 289 | "Instantiate the network and move it to the device (which is a GPU when available). Create the optimizer." 290 | ], 291 | "metadata": { 292 | "id": "QVGbz36OuS8O" 293 | } 294 | }, 295 | { 296 | "cell_type": "code", 297 | "source": [ 298 | "model = Net().to(device)\n", 299 | "optimizer = optim.Adadelta(model.parameters(), lr=1.0)" 300 | ], 301 | "metadata": { 302 | "id": "tvkGwJD_JGEY" 303 | }, 304 | "execution_count": null, 305 | "outputs": [] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "source": [ 310 | "from torchsummary import summary\n", 311 | "summary(model, input_size=(3, 128, 128))" 312 | ], 313 | "metadata": { 314 | "id": "kTfbe4QKRYLu" 315 | }, 316 | "execution_count": null, 317 | "outputs": [] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "source": [ 322 | "# Train and Test Methods" 323 | ], 324 | "metadata": { 325 | "id": "KYR03y9dvDEO" 326 | } 327 | }, 328 | { 329 | "cell_type": "code", 330 | "source": [ 331 | "def train(model, device, train_loader, optimizer, epoch):\n", 332 | " model.train() # sets the model in training mode (i.e., dropout enabled)\n", 333 | " for batch_idx, (data, target) in enumerate(train_loader):\n", 334 | " data, target = data.to(device), target.to(device)\n", 335 | " optimizer.zero_grad()\n", 336 | " output = model(data)\n", 337 | " loss = F.nll_loss(output, target)\n", 338 | " loss.backward()\n", 339 | " optimizer.step()\n", 340 | " if batch_idx % 100 == 0:\n", 341 | " print('Train Epoch: {} [{}/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n", 342 | " epoch, batch_idx * len(data), len(train_loader.dataset),\n", 343 | " 100. * batch_idx / len(train_loader), loss.item()))" 344 | ], 345 | "metadata": { 346 | "id": "_PrPJRlsCCO5" 347 | }, 348 | "execution_count": null, 349 | "outputs": [] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "source": [ 354 | "def test(model, device, test_loader):\n", 355 | " model.eval() # sets the model in evaluation mode (i.e., dropout disabled)\n", 356 | " test_loss = 0\n", 357 | " correct = 0\n", 358 | " with torch.no_grad():\n", 359 | " for data, target in test_loader:\n", 360 | " data, target = data.to(device), target.to(device)\n", 361 | " output = model(data)\n", 362 | " test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss\n", 363 | " pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability\n", 364 | " correct += pred.eq(target.view_as(pred)).sum().item()\n", 365 | "\n", 366 | " test_loss /= len(test_loader.dataset)\n", 367 | "\n", 368 | " print('\\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\\n'.format(\n", 369 | " test_loss, correct, len(test_loader.dataset),\n", 370 | " 100. * correct / len(test_loader.dataset)))" 371 | ], 372 | "metadata": { 373 | "id": "bns0Q8O-CFVM" 374 | }, 375 | "execution_count": null, 376 | "outputs": [] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "source": [ 381 | "Train for some number of epochs while reporting the accuracy on the test set periodically:" 382 | ], 383 | "metadata": { 384 | "id": "2Zxo7USTvKMv" 385 | } 386 | }, 387 | { 388 | "cell_type": "code", 389 | "source": [ 390 | "epochs = 12\n", 391 | "scheduler = StepLR(optimizer, step_size=1, gamma=0.7)\n", 392 | "for epoch in range(1, epochs + 1):\n", 393 | " train(model, device, train_loader, optimizer, epoch)\n", 394 | " test(model, device, test_loader)\n", 395 | " scheduler.step()" 396 | ], 397 | "metadata": { 398 | "id": "Yl8Lcz1RJCk9" 399 | }, 400 | "execution_count": null, 401 | "outputs": [] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "source": [], 406 | "metadata": { 407 | "id": "XGZrm2RazBTa" 408 | }, 409 | "execution_count": null, 410 | "outputs": [] 411 | } 412 | ] 413 | } -------------------------------------------------------------------------------- /past_hackathons/computer_vision_hackathon/material_from_2023/day5_computer_vision_hackathon_notebook3_transfer_learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "authorship_tag": "ABX9TyOsGrgoWTZImD1zjdvDl42h", 8 | "include_colab_link": true 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | }, 17 | "accelerator": "GPU", 18 | "gpuClass": "standard" 19 | }, 20 | "cells": [ 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "id": "view-in-github", 25 | "colab_type": "text" 26 | }, 27 | "source": [ 28 | "\"Open" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "source": [ 34 | "#Introduction to Machine Learning \n", 35 | "**Computer Vision Hackathon \n", 36 | "Wintersession \n", 37 | "Tuesday, January 24, 2023**" 38 | ], 39 | "metadata": { 40 | "id": "aga1pGnHqFDc" 41 | } 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "source": [ 46 | "In this notebook you will use transfer learning to get a very high accuracy on the cats versus dog problem. The idea is to take a large CNN model trained on vast amounts of data and retrain only the top layers while freezing the lower layers. We are transferring the learning done previously to our problem. We will use the ResNet-50 model." 47 | ], 48 | "metadata": { 49 | "id": "UNycfWo7M0TV" 50 | } 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "source": [ 55 | "# About Your Colab Session" 56 | ], 57 | "metadata": { 58 | "id": "2zeUMxrssifc" 59 | } 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "source": [ 64 | "Learn about the CPU-cores for your session:" 65 | ], 66 | "metadata": { 67 | "id": "DuQEJ5K4T6mr" 68 | } 69 | }, 70 | { 71 | "cell_type": "code", 72 | "source": [ 73 | "cat /proc/cpuinfo" 74 | ], 75 | "metadata": { 76 | "id": "kmhl7u9GTJdM" 77 | }, 78 | "execution_count": null, 79 | "outputs": [] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "source": [ 84 | "import os\n", 85 | "num_cores = min(os.cpu_count(), 2)\n", 86 | "print(num_cores)" 87 | ], 88 | "metadata": { 89 | "id": "pvmd8gdqqadV" 90 | }, 91 | "execution_count": null, 92 | "outputs": [] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "source": [ 97 | "Let's see which GPU we are using (probably a Tesla T4):" 98 | ], 99 | "metadata": { 100 | "id": "boWe_CxtT_NO" 101 | } 102 | }, 103 | { 104 | "cell_type": "code", 105 | "source": [ 106 | "!nvidia-smi" 107 | ], 108 | "metadata": { 109 | "id": "8yR2en5xCqsO" 110 | }, 111 | "execution_count": null, 112 | "outputs": [] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "source": [ 117 | "# Data Preparation" 118 | ], 119 | "metadata": { 120 | "id": "wGufTwtPso3h" 121 | } 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": { 127 | "id": "bH3SrMfHBejx" 128 | }, 129 | "outputs": [], 130 | "source": [ 131 | "import torch\n", 132 | "import torch.nn as nn\n", 133 | "import torch.nn.functional as F\n", 134 | "import torch.optim as optim\n", 135 | "from torchvision import datasets, transforms, models\n", 136 | "from torch.optim.lr_scheduler import StepLR\n", 137 | "from PIL import Image" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "source": [ 143 | "We want to use a GPU when one is available:" 144 | ], 145 | "metadata": { 146 | "id": "OOKVq098sZne" 147 | } 148 | }, 149 | { 150 | "cell_type": "code", 151 | "source": [ 152 | "use_cuda = torch.cuda.is_available()\n", 153 | "print(use_cuda)" 154 | ], 155 | "metadata": { 156 | "id": "Nk3pkXNxCF5F" 157 | }, 158 | "execution_count": null, 159 | "outputs": [] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "source": [ 164 | "torch.manual_seed(42)\n", 165 | "device = torch.device(\"cuda\") if use_cuda else torch.device(\"cpu\")\n", 166 | "\n", 167 | "train_kwargs = {'batch_size': 64}\n", 168 | "test_kwargs = {'batch_size': 128}\n", 169 | "if use_cuda:\n", 170 | " cuda_kwargs = {'num_workers': num_cores, 'pin_memory': True}\n", 171 | " train_kwargs.update(cuda_kwargs)\n", 172 | " test_kwargs.update(cuda_kwargs)" 173 | ], 174 | "metadata": { 175 | "id": "TVoXo2d6CVwC" 176 | }, 177 | "execution_count": null, 178 | "outputs": [] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "source": [ 183 | "Download and unpack the data:" 184 | ], 185 | "metadata": { 186 | "id": "oPIynRI6zWut" 187 | } 188 | }, 189 | { 190 | "cell_type": "code", 191 | "source": [ 192 | "!wget https://tigress-web.princeton.edu/~jdh4/cats_vs_dogs.tar\n", 193 | "!tar xf cats_vs_dogs.tar" 194 | ], 195 | "metadata": { 196 | "id": "UbRdgWThzTcG" 197 | }, 198 | "execution_count": null, 199 | "outputs": [] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "source": [ 204 | "transform=transforms.Compose([\n", 205 | " transforms.ToTensor(),\n", 206 | " transforms.Normalize((0.1307,), (0.3081,))])\n", 207 | "dataset1 = datasets.ImageFolder(root=\"./training_set/\", transform=transform)\n", 208 | "dataset2 = datasets.ImageFolder(root=\"./test_set/\", transform=transform)\n", 209 | "\n", 210 | "train_loader = torch.utils.data.DataLoader(dataset1, shuffle=True, **train_kwargs)\n", 211 | "test_loader = torch.utils.data.DataLoader(dataset2, shuffle=True, **test_kwargs)" 212 | ], 213 | "metadata": { 214 | "id": "NgIFFS1TJp-O" 215 | }, 216 | "execution_count": null, 217 | "outputs": [] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "source": [ 222 | "There are roughly 4000 cat images and 4000 dog images in the training set. The test set is roughly 1000 images of each. All images have dimensions 128x128. The cat and dogs images are in color so they are composed of three layers (red, green, blue). The MNIST data set was grayscale so only a single layer was needed per image." 223 | ], 224 | "metadata": { 225 | "id": "NO9cdWhN4ouV" 226 | } 227 | }, 228 | { 229 | "cell_type": "code", 230 | "source": [ 231 | "img = Image.open(\"./training_set/dogs/resized-dog.1001.jpg\")\n", 232 | "print(f\"Image height: {img.height}\") \n", 233 | "print(f\"Image width: {img.width}\")\n", 234 | "img" 235 | ], 236 | "metadata": { 237 | "id": "KkXNTCGY28xc" 238 | }, 239 | "execution_count": null, 240 | "outputs": [] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "source": [ 245 | "img = Image.open(\"./training_set/cats/resized-cat.1001.jpg\")\n", 246 | "print(f\"Image height: {img.height}\") \n", 247 | "print(f\"Image width: {img.width}\")\n", 248 | "img" 249 | ], 250 | "metadata": { 251 | "id": "vPX6m50p3laY" 252 | }, 253 | "execution_count": null, 254 | "outputs": [] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "source": [ 259 | "# Model Definition" 260 | ], 261 | "metadata": { 262 | "id": "NxvDfF0Ps9uZ" 263 | } 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "source": [ 268 | "Below the model is downloaded. We turn-off gradient tracking for all model parameters except the last two linear layers. The model is moved to the device (which is a GPU is available) and the optimizer is created." 269 | ], 270 | "metadata": { 271 | "id": "QVGbz36OuS8O" 272 | } 273 | }, 274 | { 275 | "cell_type": "code", 276 | "source": [ 277 | "model = models.resnet50(weights='DEFAULT')\n", 278 | "for param in model.parameters():\n", 279 | " param.requires_grad = False\n", 280 | "# use print(model) to see that the name of the last layer is fc\n", 281 | "# we redefine fc in the next line\n", 282 | "model.fc = nn.Sequential(nn.Linear(2048, 128), nn.ReLU(inplace=True), nn.Linear(128, 2))\n", 283 | "model = model.to(device)\n", 284 | "optimizer = optim.Adadelta(model.fc.parameters(), lr=1.0)" 285 | ], 286 | "metadata": { 287 | "id": "tvkGwJD_JGEY" 288 | }, 289 | "execution_count": null, 290 | "outputs": [] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "source": [ 295 | "from torchsummary import summary\n", 296 | "summary(model, input_size=(3, 128, 128))" 297 | ], 298 | "metadata": { 299 | "id": "kTfbe4QKRYLu" 300 | }, 301 | "execution_count": null, 302 | "outputs": [] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "source": [ 307 | "# Train and Test Methods" 308 | ], 309 | "metadata": { 310 | "id": "KYR03y9dvDEO" 311 | } 312 | }, 313 | { 314 | "cell_type": "code", 315 | "source": [ 316 | "def train(model, device, train_loader, optimizer, epoch):\n", 317 | " model.train() # sets the model in training mode (i.e., dropout enabled)\n", 318 | " for batch_idx, (data, target) in enumerate(train_loader):\n", 319 | " data, target = data.to(device), target.to(device)\n", 320 | " optimizer.zero_grad()\n", 321 | " output = model(data)\n", 322 | " loss = F.nll_loss(F.log_softmax(output, dim=1), target)\n", 323 | " loss.backward()\n", 324 | " optimizer.step()\n", 325 | " if batch_idx % 100 == 0:\n", 326 | " print('Train Epoch: {} [{}/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n", 327 | " epoch, batch_idx * len(data), len(train_loader.dataset),\n", 328 | " 100. * batch_idx / len(train_loader), loss.item()))" 329 | ], 330 | "metadata": { 331 | "id": "_PrPJRlsCCO5" 332 | }, 333 | "execution_count": null, 334 | "outputs": [] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "source": [ 339 | "def test(model, device, test_loader):\n", 340 | " model.eval() # sets the model in evaluation mode (i.e., dropout disabled)\n", 341 | " test_loss = 0\n", 342 | " correct = 0\n", 343 | " with torch.no_grad():\n", 344 | " for data, target in test_loader:\n", 345 | " data, target = data.to(device), target.to(device)\n", 346 | " output = model(data)\n", 347 | " test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss\n", 348 | " pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability\n", 349 | " correct += pred.eq(target.view_as(pred)).sum().item()\n", 350 | "\n", 351 | " test_loss /= len(test_loader.dataset)\n", 352 | "\n", 353 | " print('\\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\\n'.format(\n", 354 | " test_loss, correct, len(test_loader.dataset),\n", 355 | " 100. * correct / len(test_loader.dataset)))" 356 | ], 357 | "metadata": { 358 | "id": "bns0Q8O-CFVM" 359 | }, 360 | "execution_count": null, 361 | "outputs": [] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "source": [ 366 | "Train for some number of epochs while reporting the accuracy on the test set periodically:" 367 | ], 368 | "metadata": { 369 | "id": "2Zxo7USTvKMv" 370 | } 371 | }, 372 | { 373 | "cell_type": "code", 374 | "source": [ 375 | "epochs = 12\n", 376 | "scheduler = StepLR(optimizer, step_size=1, gamma=0.7)\n", 377 | "for epoch in range(1, epochs + 1):\n", 378 | " train(model, device, train_loader, optimizer, epoch)\n", 379 | " test(model, device, test_loader)\n", 380 | " scheduler.step()" 381 | ], 382 | "metadata": { 383 | "id": "Yl8Lcz1RJCk9" 384 | }, 385 | "execution_count": null, 386 | "outputs": [] 387 | } 388 | ] 389 | } -------------------------------------------------------------------------------- /past_hackathons/computer_vision_hackathon/material_from_2023/intro_to_ml_day5.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PrincetonUniversity/intro_machine_learning/60dfeef803a04d9073e94584d1572a50d1b64f0f/past_hackathons/computer_vision_hackathon/material_from_2023/intro_to_ml_day5.pptx -------------------------------------------------------------------------------- /past_hackathons/large_language_models_hackathon/LLM_Finetuning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "gpuType": "T4" 8 | }, 9 | "kernelspec": { 10 | "name": "python3", 11 | "display_name": "Python 3" 12 | }, 13 | "language_info": { 14 | "name": "python" 15 | }, 16 | "accelerator": "GPU" 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "source": [ 22 | "# LLM Fine-tuning\n", 23 | "\n", 24 | "- Instructor: Jake Snell\n", 25 | "- Date: January 23, 2024\n", 26 | "\n", 27 | "The contents of this tutorial are based on the following guide:\n", 28 | "- https://huggingface.co/docs/transformers/v4.37.0/tasks/language_modeling" 29 | ], 30 | "metadata": { 31 | "id": "9CeMVZyKbGbz" 32 | } 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "source": [ 37 | "**Note:** Before beginning, be sure to select \"Runtime > Change runtime type > T4 GPU\". This will make fine-tuning a lot faster." 38 | ], 39 | "metadata": { 40 | "id": "_4tnLdZvex2N" 41 | } 42 | }, 43 | { 44 | "cell_type": "code", 45 | "source": [ 46 | "!pip install transformers[torch] datasets evaluate accelerate -U" 47 | ], 48 | "metadata": { 49 | "id": "eqgbwUVwVu7k" 50 | }, 51 | "execution_count": null, 52 | "outputs": [] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "source": [ 57 | "## Part 1: Text Generation\n", 58 | "\n", 59 | "First, we need to download a LLM from HuggingFace. Here we use GPT-2, but feel free to experiment with your own choice of LLM by browsing https://huggingface.co/models?pipeline_tag=text-generation&sort=trending." 60 | ], 61 | "metadata": { 62 | "id": "SR9NV8Q-f-mq" 63 | } 64 | }, 65 | { 66 | "cell_type": "code", 67 | "source": [ 68 | "# Let's grab GPT-2 from HuggingFace\n", 69 | "\n", 70 | "from transformers import AutoModelForCausalLM\n", 71 | "\n", 72 | "model = AutoModelForCausalLM.from_pretrained(\"gpt2\").to(\"cuda:0\")" 73 | ], 74 | "metadata": { 75 | "id": "FhLo3HjIek-C" 76 | }, 77 | "execution_count": null, 78 | "outputs": [] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "source": [ 83 | "# Let's get the corresponding tokenizer as well\n", 84 | "from transformers import AutoTokenizer\n", 85 | "\n", 86 | "tokenizer = AutoTokenizer.from_pretrained(\"gpt2\")" 87 | ], 88 | "metadata": { 89 | "id": "hIkJnA_8g0Qo" 90 | }, 91 | "execution_count": null, 92 | "outputs": [] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "source": [ 97 | "# Now let's tokenize a sample sentence\n", 98 | "tokenized_sentence = tokenizer(\"One good tokenizer is worth more than a hundred bad ones.\")\n", 99 | "tokenized_sentence" 100 | ], 101 | "metadata": { 102 | "id": "dXbjKmSTg5Pi" 103 | }, 104 | "execution_count": null, 105 | "outputs": [] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "source": [ 110 | "**Question**: Based on the tokenizer output, is your tokenizer a character-level, word-level, or subword-level tokenizer? How can you tell?" 111 | ], 112 | "metadata": { 113 | "id": "ASVY656BhTpG" 114 | } 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "source": [ 119 | "\n", 120 | "\n", 121 | "**Exercise.** There are several methods for sampling a text sequence from a language model. Using the guide at https://huggingface.co/blog/how-to-generate, choose at least 2 sampling methods and implement them. Which technique generates higher quality text samples? What happens when you seed the text with different phrases, such as \"Wherefore art\", \"Four score and seven\", etc. Does the output match what you would expect?" 122 | ], 123 | "metadata": { 124 | "id": "Kvv4_ykefRA_" 125 | } 126 | }, 127 | { 128 | "cell_type": "code", 129 | "source": [ 130 | "# Write your code to sample from the model here!" 131 | ], 132 | "metadata": { 133 | "id": "VWzDiClqfKBW" 134 | }, 135 | "execution_count": null, 136 | "outputs": [] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "source": [ 141 | "## Part 2: Basic Fine-tuning\n", 142 | "\n", 143 | "Now we will download a small dataset to fine-tune our LLM. We will use the [TinyShakespeare](https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt) dataset here, but feel free to find your own dataset on HuggingFace: https://huggingface.co/datasets?sort=trending. If you do, be sure to choose one of the filters under \"Natural Language Processing\" so you get a text dataset.\n" 144 | ], 145 | "metadata": { 146 | "id": "OZ6wFrqSf36B" 147 | } 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": { 153 | "id": "Eg2KYgkaVZk_" 154 | }, 155 | "outputs": [], 156 | "source": [ 157 | "from datasets import load_dataset\n", 158 | "\n", 159 | "dataset = load_dataset(\"Trelis/tiny-shakespeare\")" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "source": [ 165 | "dataset" 166 | ], 167 | "metadata": { 168 | "id": "pbDKGeelVkaS" 169 | }, 170 | "execution_count": null, 171 | "outputs": [] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "source": [ 176 | "**Quick check**: Verify that the text in the dataset is what you expect. Based on the original [TinyShakespeare](https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt) file, what strategy do you think was used to split into train and test? Would you have used this strategy, or something else?" 177 | ], 178 | "metadata": { 179 | "id": "ltqYqHV2ittg" 180 | } 181 | }, 182 | { 183 | "cell_type": "code", 184 | "source": [ 185 | "# First, we will need to tokenize this dataset using our tokenizer\n", 186 | "tokenized_dataset = dataset.map(\n", 187 | " lambda example: tokenizer(example[\"Text\"]),\n", 188 | " batched=True,\n", 189 | " num_proc=4,\n", 190 | " remove_columns=dataset[\"train\"].column_names\n", 191 | ")" 192 | ], 193 | "metadata": { 194 | "id": "ZsbFeyu9ajmY" 195 | }, 196 | "execution_count": null, 197 | "outputs": [] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "source": [ 202 | "tokenized_dataset" 203 | ], 204 | "metadata": { 205 | "id": "CufJ3PKvbe0i" 206 | }, 207 | "execution_count": null, 208 | "outputs": [] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "source": [ 213 | "# Now, we will need to split the rows into blocks\n", 214 | "block_size = 128\n", 215 | "\n", 216 | "def group_texts(examples):\n", 217 | " # Concatenate all texts.\n", 218 | " concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}\n", 219 | " total_length = len(concatenated_examples[list(examples.keys())[0]])\n", 220 | " # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can\n", 221 | " # customize this part to your needs.\n", 222 | " if total_length >= block_size:\n", 223 | " total_length = (total_length // block_size) * block_size\n", 224 | " # Split by chunks of block_size.\n", 225 | " result = {\n", 226 | " k: [t[i : i + block_size] for i in range(0, total_length, block_size)]\n", 227 | " for k, t in concatenated_examples.items()\n", 228 | " }\n", 229 | " result[\"labels\"] = result[\"input_ids\"].copy()\n", 230 | " return result" 231 | ], 232 | "metadata": { 233 | "id": "NeSHOxayb0NW" 234 | }, 235 | "execution_count": null, 236 | "outputs": [] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "source": [ 241 | "finetuning_dataset = tokenized_dataset.map(group_texts, batched=True, num_proc=4)" 242 | ], 243 | "metadata": { 244 | "id": "4VqF4oR7ffjY" 245 | }, 246 | "execution_count": null, 247 | "outputs": [] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "source": [ 252 | "finetuning_dataset" 253 | ], 254 | "metadata": { 255 | "id": "cSZBzw0mfjO2" 256 | }, 257 | "execution_count": null, 258 | "outputs": [] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "source": [ 263 | "**Question:** Why is the number of rows different between `tokenized_dataset` and `finetuning_dataset`? Given the number of rows in a split from `tokenized_dataset`, can you write down an expression for the number of rows in the corresponding `finetuning_dataset` split?" 264 | ], 265 | "metadata": { 266 | "id": "4DScOzSeimSB" 267 | } 268 | }, 269 | { 270 | "cell_type": "code", 271 | "source": [ 272 | "# Here we set up the data collator to pass into the training loop\n", 273 | "from transformers import DataCollatorForLanguageModeling, TrainingArguments, Trainer\n", 274 | "\n", 275 | "tokenizer.pad_token = tokenizer.eos_token\n", 276 | "data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)" 277 | ], 278 | "metadata": { 279 | "id": "Sozp6ERKgkD3" 280 | }, 281 | "execution_count": null, 282 | "outputs": [] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "source": [ 287 | "training_args = TrainingArguments(\n", 288 | " output_dir=\"my_awesome_shakespeare_model\",\n", 289 | " evaluation_strategy=\"epoch\",\n", 290 | " learning_rate=2e-5,\n", 291 | " weight_decay=0.01\n", 292 | ")\n", 293 | "\n", 294 | "trainer = Trainer(\n", 295 | " model=model,\n", 296 | " args=training_args,\n", 297 | " train_dataset=finetuning_dataset[\"train\"],\n", 298 | " eval_dataset=finetuning_dataset[\"test\"],\n", 299 | " data_collator=data_collator,\n", 300 | ")\n", 301 | "\n", 302 | "trainer.train()" 303 | ], 304 | "metadata": { 305 | "id": "pKDYb1JJfqc2" 306 | }, 307 | "execution_count": null, 308 | "outputs": [] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "source": [ 313 | "# Here we evaluate perplexity\n", 314 | "import math\n", 315 | "\n", 316 | "eval_results = trainer.evaluate()\n", 317 | "print(f\"Perplexity: {math.exp(eval_results['eval_loss']):.2f}\")" 318 | ], 319 | "metadata": { 320 | "id": "t-jxQZZ6gPmI" 321 | }, 322 | "execution_count": null, 323 | "outputs": [] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "source": [ 328 | "**Exercise**: Modify the code above to experiment with different learning rates, weight decay, and/or number of epochs.\n", 329 | "\n", 330 | "1. How do these choices affect training loss and validation loss? Which fine-tuning strategy is best?\n", 331 | "2. Use your sampling strategies from Part 1 above to sample from your fine-tuned model. How do the samples compare?" 332 | ], 333 | "metadata": { 334 | "id": "CAnO64bRkXCK" 335 | } 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "source": [ 340 | "## Part 3: PEFT\n", 341 | "\n", 342 | "Another approach to fine-tuning is known as Parameter Efficient Fine-tuning (PEFT). See slides for a diagram of LoRA (Hu et al., 2021)\n", 343 | "\n", 344 | "**Exercise:** Use the HuggingFace [PEFT Guide](https://huggingface.co/docs/peft/quicktour) as a base to implement LoRA or the PEFT technique of your choice. Fine-tune your original LLM using PEFT, being sure to record training loss, validation loss. After PEFT fine-tuning is complete, generate some samples from your model.\n", 345 | "\n", 346 | "1. How do the training/validation losses and generated samples compare to your model from Part 2? Which model is better, in your opinion?\n", 347 | "2. How does the time taken for fine-tuning differ between ordinary fine-tuning and PEFT?\n", 348 | "3. What are some benefits of PEFT relative to ordinary fine-tuning? Which technique would you recommend to use in practice?" 349 | ], 350 | "metadata": { 351 | "id": "Cd3YKiTHlJR-" 352 | } 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "source": [ 357 | "## Homework (Optional)\n", 358 | "\n", 359 | "- Now that you know how to fine-tune a model for text generation, choose another text-based task. You could choose translation, text classification, or anything else that takes text as input. Fine-tune the LLM of your choice on this new task. How well does it perform? Did the promise of \"foundation model + adaptation\" live up to what you expected, or is there something still to be desired?\n", 360 | "- For a more in-depth look at transformers, you can check out [\"The Annotated Transformer\" tutorial](https://nlp.seas.harvard.edu/annotated-transformer/) by Sasha Rush. It covers masking and positional encoding in more detail, and also discusses advanced topics such as label smoothing and learning rate scheduling.\n", 361 | "\n", 362 | "# Thank you for joining us this Wintersession! We wish you the best of luck on your machine learning journey!\n", 363 | "\n", 364 | "If you have any questions or comments, please reach out to me at ." 365 | ], 366 | "metadata": { 367 | "id": "LDh_YQW1ju5q" 368 | } 369 | }, 370 | { 371 | "cell_type": "code", 372 | "source": [], 373 | "metadata": { 374 | "id": "2cg59FkCj2NY" 375 | }, 376 | "execution_count": null, 377 | "outputs": [] 378 | } 379 | ] 380 | } -------------------------------------------------------------------------------- /past_hackathons/large_language_models_hackathon/README.md: -------------------------------------------------------------------------------- 1 | # Large Language Models Hackathon 2 | 3 | - Date: Jan. 23, 2024 4 | - Session instructor: Jake Snell 5 | 6 | This session introduces the basics of language modeling using the transformer architecture. Participants will learn how to download and fine-tune a large language model using the Hugging Face library. 7 | 8 | --- 9 | 10 | _The following is aimed at future instructors of this hackathon._ 11 | 12 | Some material that may be good to include in future iterations of this hackathon would be: 13 | - Metrics for language modeling (cross-entropy, perplexity) 14 | - The LLM zoo: GPT, Bloom, PaLM, BERT, RoBERTa, Llama, Mistral, Mixtral, Phi, etc. 15 | - Steps of training an LLM (e.g. supervised fine-tuning, RLHF) 16 | - Prompt engineering: what it is and why it is important 17 | - In-context learning 18 | - High-level overview of fine-tuning approaches: LoRA, prefix tuning, prompt tuning, IA3, etc. 19 | 20 | For a more hands-on introduction to LLMs, it may be worth considering incorporating aspects of [The Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/) into the hackathon. 21 | -------------------------------------------------------------------------------- /past_hackathons/large_language_models_hackathon/llm_slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PrincetonUniversity/intro_machine_learning/60dfeef803a04d9073e94584d1572a50d1b64f0f/past_hackathons/large_language_models_hackathon/llm_slides.pdf -------------------------------------------------------------------------------- /past_hackathons/natural_language_processing_hackathon/day5_nlp_movie_reviews_notebook1_bag_of_words.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "toc_visible": true, 8 | "authorship_tag": "ABX9TyPEd9Oono0VQxWVnVLgHzvt", 9 | "include_colab_link": true 10 | }, 11 | "kernelspec": { 12 | "name": "python3", 13 | "display_name": "Python 3" 14 | }, 15 | "language_info": { 16 | "name": "python" 17 | } 18 | }, 19 | "cells": [ 20 | { 21 | "cell_type": "markdown", 22 | "metadata": { 23 | "id": "view-in-github", 24 | "colab_type": "text" 25 | }, 26 | "source": [ 27 | "\"Open" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "source": [ 33 | "#Introduction to Machine Learning \n", 34 | "**Natural Language Processing Hackathon: Notebook 1 \n", 35 | "Wintersession \n", 36 | "Tuesday, January 24, 2023**" 37 | ], 38 | "metadata": { 39 | "id": "MH7MrrKyZ3dQ" 40 | } 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "source": [ 45 | "The material here is based on Chapter 8 of \n", 46 | "Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili and Dmytro Dzhulgakov. The book is available via the PU library." 47 | ], 48 | "metadata": { 49 | "id": "VkcS5HW9VrsR" 50 | } 51 | }, 52 | { 53 | "cell_type": "code", 54 | "source": [ 55 | "import re\n", 56 | "import pandas as pd\n", 57 | "import numpy as np\n", 58 | "import pprint\n", 59 | "import nltk\n", 60 | "from nltk.stem.porter import PorterStemmer\n", 61 | "from sklearn.feature_extraction.text import CountVectorizer\n", 62 | "from sklearn.feature_extraction.text import TfidfTransformer" 63 | ], 64 | "metadata": { 65 | "id": "UuDdLpWUaBRX" 66 | }, 67 | "execution_count": null, 68 | "outputs": [] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "source": [ 73 | "# How to process natural language using a computer?" 74 | ], 75 | "metadata": { 76 | "id": "HnX9D9Zta3Og" 77 | } 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "source": [ 82 | "Our focus for this project will be sentiment analysis or opinion mining. That is, for a given document, is the sentiment or tone of the document positive or negative?" 83 | ], 84 | "metadata": { 85 | "id": "xRIImSCma80-" 86 | } 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "source": [ 91 | "\"best movie ever\" \n", 92 | "\"we found this movie to be very entertaining\" \n", 93 | "\"this movie was the worst movie ever\" " 94 | ], 95 | "metadata": { 96 | "id": "vOA93jYdbLh-" 97 | } 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "source": [ 102 | "In order to use computers to do natural language processing we need to convert the text to numbers. What simple approaches can one think of to do this?" 103 | ], 104 | "metadata": { 105 | "id": "qFB1aTjYbqa0" 106 | } 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "source": [ 111 | "# Bag of Words" 112 | ], 113 | "metadata": { 114 | "id": "X0ilCceydSks" 115 | } 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "source": [ 120 | "One approach is to count the number of times that each word appears in each document and associate these counts with the class label. This approach is called bag of words. Let's look at an example." 121 | ], 122 | "metadata": { 123 | "id": "uqMVGSXIdVyM" 124 | } 125 | }, 126 | { 127 | "cell_type": "code", 128 | "source": [ 129 | "df = pd.DataFrame({\"review\":[\"best movie ever\",\n", 130 | " \"we found this movie to be very entertaining\",\n", 131 | " \"this movie was the worst movie ever\"],\n", 132 | " \"sentimemt\":[1, 1, 0]})\n", 133 | "df" 134 | ], 135 | "metadata": { 136 | "id": "dCiRWIxOeJqo" 137 | }, 138 | "execution_count": null, 139 | "outputs": [] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "source": [ 144 | "We'll use a tool called a CountVectorizer to perform the counting. See the documentation for the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)." 145 | ], 146 | "metadata": { 147 | "id": "y9Ri8YtPn5A8" 148 | } 149 | }, 150 | { 151 | "cell_type": "code", 152 | "source": [ 153 | "count = CountVectorizer(stop_words=None)\n", 154 | "bag = count.fit_transform(df[\"review\"])" 155 | ], 156 | "metadata": { 157 | "id": "YJmG6rmSeys6" 158 | }, 159 | "execution_count": null, 160 | "outputs": [] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "source": [ 165 | "The dataframe below shows the term frequencies for each review:" 166 | ], 167 | "metadata": { 168 | "id": "fIwstwEokz5V" 169 | } 170 | }, 171 | { 172 | "cell_type": "code", 173 | "source": [ 174 | "numbers = pd.DataFrame(bag.toarray())\n", 175 | "numbers.columns = sorted(count.vocabulary_.keys())\n", 176 | "numbers" 177 | ], 178 | "metadata": { 179 | "id": "NnwITsTXkkOE" 180 | }, 181 | "execution_count": null, 182 | "outputs": [] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "source": [ 187 | "We now have features that can be used for training a machine learning model! Let's add a few more pieces." 188 | ], 189 | "metadata": { 190 | "id": "iVuJgsA3hZSl" 191 | } 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "source": [ 196 | "# Term Frequency-Inverse Document Frequency" 197 | ], 198 | "metadata": { 199 | "id": "33dRb1HPhk1D" 200 | } 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "source": [ 205 | "Some words appear in many of the reviews (or documents in general) while others only appear rarely. Let's come up with a scheme for up-weighting the rare words and down-weighting the common words. Our hypothesis is that the rare words have more importance.\n", 206 | "\n", 207 | "One solution is to multiply the term frequency of a given word in a document by the log of the ratio of the number of documents divided by the number of documents containing that word. Like this:\n", 208 | "\n", 209 | "tf(w, r) = count of word w in review r \n", 210 | "N = total number of reviews \n", 211 | "n(w) = number of reviews containing word w \n", 212 | "\n", 213 | "\n", 214 | "tf-idf = tf(w, r) log ((N + 1) / (n(w) + 1))\n", 215 | "\n", 216 | "The log of the ratio is used to prevent very rare words from getting excess weight. Let's try it out and see it the results make sense." 217 | ], 218 | "metadata": { 219 | "id": "RQka29bvhpiM" 220 | } 221 | }, 222 | { 223 | "cell_type": "code", 224 | "source": [ 225 | "tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)\n", 226 | "tbl = tfidf.fit_transform(bag).toarray()\n", 227 | "numbers = pd.DataFrame(tbl)\n", 228 | "numbers.columns = sorted(count.vocabulary_.keys())\n", 229 | "numbers.round(decimals=2)" 230 | ], 231 | "metadata": { 232 | "id": "A0P7ehxOlcrw" 233 | }, 234 | "execution_count": null, 235 | "outputs": [] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "source": [ 240 | "In the first row above, \"best\" has the largest value. This makes sense since it only appears once in that review and not in others. The word \"movie\" appears in all reviews and its magnitude is smallest. In the third row, \"movie\" has the largest magnitude despite being a common word. This arises because appears twice so its term frequency is 2 which is high." 241 | ], 242 | "metadata": { 243 | "id": "NEq1PUAAXxss" 244 | } 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "source": [ 249 | "The values in the table above have been normalized by row. Let's check that each row is normalized:" 250 | ], 251 | "metadata": { 252 | "id": "oPIOrfa-ngsZ" 253 | } 254 | }, 255 | { 256 | "cell_type": "code", 257 | "source": [ 258 | "print([np.linalg.norm(tbl[i]) for i in [0, 1, 2]])" 259 | ], 260 | "metadata": { 261 | "id": "ubOEKze7mgC-" 262 | }, 263 | "execution_count": null, 264 | "outputs": [] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "source": [ 269 | "Note that using use_idf=False, norm=None and smooth_idf=False simply gives the word counts:" 270 | ], 271 | "metadata": { 272 | "id": "W3knL9XOpr29" 273 | } 274 | }, 275 | { 276 | "cell_type": "code", 277 | "source": [ 278 | "tfidf = TfidfTransformer(use_idf=False, norm=None, smooth_idf=False)\n", 279 | "print(tfidf.fit_transform(bag).toarray())" 280 | ], 281 | "metadata": { 282 | "id": "Gz7fzwgIpgja" 283 | }, 284 | "execution_count": null, 285 | "outputs": [] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "source": [ 290 | "# Stemming" 291 | ], 292 | "metadata": { 293 | "id": "aM0JA2CAqRiO" 294 | } 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "source": [ 299 | "Words like running and run are closely related. They derive from the same stem. We can reduce the number of words by applying stemming." 300 | ], 301 | "metadata": { 302 | "id": "V_RmC88kqY-C" 303 | } 304 | }, 305 | { 306 | "cell_type": "code", 307 | "source": [ 308 | "porter = PorterStemmer()\n", 309 | "def tokenizer_porter(text):\n", 310 | " return [porter.stem(word) for word in text.split()]\n", 311 | "tokenizer_porter('runners like running and thus they run')" 312 | ], 313 | "metadata": { 314 | "id": "FzaRtOJYsESj" 315 | }, 316 | "execution_count": null, 317 | "outputs": [] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "source": [ 322 | "There is also the trivial tokenizer which does not perform stemming:" 323 | ], 324 | "metadata": { 325 | "id": "y5oKhZgRtuQZ" 326 | } 327 | }, 328 | { 329 | "cell_type": "code", 330 | "source": [ 331 | "def tokenizer(text):\n", 332 | " return text.split()\n", 333 | "tokenizer('runners like running and thus they run')\n", 334 | "['runners', 'like', 'running', 'and', 'thus', 'they', 'run']" 335 | ], 336 | "metadata": { 337 | "id": "aBUCpFhVtyYl" 338 | }, 339 | "execution_count": null, 340 | "outputs": [] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "source": [ 345 | "# Text Cleaning" 346 | ], 347 | "metadata": { 348 | "id": "99sxBuYuunAp" 349 | } 350 | }, 351 | { 352 | "cell_type": "code", 353 | "source": [ 354 | "def preprocessor(text):\n", 355 | " text = re.sub('<[^>]*>', '', text)\n", 356 | " emoticons = re.findall('(?::|;|=)(?:-)?(?:\\)|\\(|D|P)', text)\n", 357 | " text = (re.sub('[\\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))\n", 358 | " return text" 359 | ], 360 | "metadata": { 361 | "id": "MXsMVM4wuqEF" 362 | }, 363 | "execution_count": null, 364 | "outputs": [] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "source": [ 369 | "preprocessor(\"This :) is :( a test :-)!\")" 370 | ], 371 | "metadata": { 372 | "id": "v4F-q3Xvurxa" 373 | }, 374 | "execution_count": null, 375 | "outputs": [] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "source": [ 380 | "Via the first regex, <[^>]*>, in the preceding code section, we tried to remove all of the HTML markup from the movie reviews. Although many programmers generally advise against the use of regex to parse HTML, this regex should be sufficient to clean this particular dataset. Since we are only interested in removing HTML markup and do not plan to use the HTML markup further, using regex to do the job should be acceptable. However, if you prefer to use sophisticated tools for removing HTML markup from text, you can take a look at Python’s HTML parser module, which is described at https://docs.python.org/3/library/html.parser.html. After we removed the HTML markup, we used a slightly more complex regex to find emoticons, which we temporarily stored as emoticons. Next, we removed all non-word characters from the text via the regex [\\W]+ and converted the text into lowercase characters." 381 | ], 382 | "metadata": { 383 | "id": "BSkxW_3ku4kr" 384 | } 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "source": [ 389 | "# Stop-words" 390 | ], 391 | "metadata": { 392 | "id": "0QxyZSwIwOUR" 393 | } 394 | }, 395 | { 396 | "cell_type": "markdown", 397 | "source": [ 398 | "The most common words that may not contribute much information are called stop words. We may consider removing these when pre-processing the text." 399 | ], 400 | "metadata": { 401 | "id": "8nGSSvUmxYMw" 402 | } 403 | }, 404 | { 405 | "cell_type": "code", 406 | "source": [ 407 | "nltk.download('stopwords')" 408 | ], 409 | "metadata": { 410 | "id": "DSCSHHQzwTHS" 411 | }, 412 | "execution_count": null, 413 | "outputs": [] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "source": [ 418 | "from nltk.corpus import stopwords\n", 419 | "stop = stopwords.words('english')\n", 420 | "[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]\n", 421 | "['runner', 'like', 'run', 'run', 'lot']" 422 | ], 423 | "metadata": { 424 | "id": "gkNi1jUKwciZ" 425 | }, 426 | "execution_count": null, 427 | "outputs": [] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "source": [ 432 | "', '.join(sorted(stop))" 433 | ], 434 | "metadata": { 435 | "id": "7GccDf1OwvEc" 436 | }, 437 | "execution_count": null, 438 | "outputs": [] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "source": [ 443 | "# n-grams" 444 | ], 445 | "metadata": { 446 | "id": "2O7HcPuRx0IC" 447 | } 448 | }, 449 | { 450 | "cell_type": "markdown", 451 | "source": [ 452 | "We can make tokens out of multiple words. This allows us to capture features like \"very bad\" or \"very good\"." 453 | ], 454 | "metadata": { 455 | "id": "dIcUMUHrxzOw" 456 | } 457 | }, 458 | { 459 | "cell_type": "code", 460 | "source": [ 461 | "count = CountVectorizer(stop_words=None, ngram_range=(1, 2))\n", 462 | "bag = count.fit_transform(df[\"review\"])" 463 | ], 464 | "metadata": { 465 | "id": "AfalFKIbeEfC" 466 | }, 467 | "execution_count": null, 468 | "outputs": [] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "source": [ 473 | "import pprint\n", 474 | "pprint.pprint(count.vocabulary_)" 475 | ], 476 | "metadata": { 477 | "id": "uUDXhKELfstg" 478 | }, 479 | "execution_count": null, 480 | "outputs": [] 481 | } 482 | ] 483 | } -------------------------------------------------------------------------------- /past_hackathons/natural_language_processing_hackathon/day5_nlp_movie_reviews_notebook2_SOLUTION_and_llm_comparison.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "authorship_tag": "ABX9TyPUEoS49PbBK1ZaDfydZrpJ", 8 | "include_colab_link": true 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | } 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "id": "view-in-github", 23 | "colab_type": "text" 24 | }, 25 | "source": [ 26 | "\"Open" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "source": [ 32 | "#Introduction to Machine Learning \n", 33 | "**Natural Language Processing Hackathon: Hackathon Solution \n", 34 | "Wintersession 2023 \n", 35 | "Tuesday, January 24, 2023**" 36 | ], 37 | "metadata": { 38 | "id": "8YCte8fp2XDV" 39 | } 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "source": [ 44 | "The material here is based on Chapter 8 of \n", 45 | "Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili and Dmytro Dzhulgakov. The book is available via the PU library.\n", 46 | "\n", 47 | "In this notebook we are going to work with a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb) and build a predictor that can distinguish between positive and negative reviews." 48 | ], 49 | "metadata": { 50 | "id": "j5YLQlOs2jNv" 51 | } 52 | }, 53 | { 54 | "cell_type": "code", 55 | "source": [ 56 | "import re\n", 57 | "import textwrap\n", 58 | "import pandas as pd\n", 59 | "import numpy as np\n", 60 | "import nltk\n", 61 | "from nltk.corpus import stopwords\n", 62 | "from nltk.stem.porter import PorterStemmer\n", 63 | "from sklearn.model_selection import GridSearchCV\n", 64 | "from sklearn.pipeline import Pipeline\n", 65 | "from sklearn.linear_model import LogisticRegression\n", 66 | "from sklearn.feature_extraction.text import TfidfVectorizer" 67 | ], 68 | "metadata": { 69 | "id": "UuDdLpWUaBRX" 70 | }, 71 | "execution_count": null, 72 | "outputs": [] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "source": [ 77 | "# Download Data and Make Dataframe" 78 | ], 79 | "metadata": { 80 | "id": "3Kpy-8Ff8ZZA" 81 | } 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "source": [ 86 | "Download the data:" 87 | ], 88 | "metadata": { 89 | "id": "wjO7F84nz99c" 90 | } 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": { 96 | "id": "qoSng-U6VyvC" 97 | }, 98 | "outputs": [], 99 | "source": [ 100 | "!wget https://tigress-web.princeton.edu/~jdh4/movie_data.csv" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "source": [ 106 | "Read in the CSV file and print the first 5 rows of the Pandas dataframe:" 107 | ], 108 | "metadata": { 109 | "id": "peptRcYAdrSq" 110 | } 111 | }, 112 | { 113 | "cell_type": "code", 114 | "source": [ 115 | "df = pd.read_csv('movie_data.csv', encoding='utf-8')\n", 116 | "df.head(5)" 117 | ], 118 | "metadata": { 119 | "id": "DuYihEqqcBwN" 120 | }, 121 | "execution_count": null, 122 | "outputs": [] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "source": [ 127 | "df[\"raw-review\"] = df[\"review\"]" 128 | ], 129 | "metadata": { 130 | "id": "wp-mY7nDFfsM" 131 | }, 132 | "execution_count": null, 133 | "outputs": [] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "source": [ 138 | "def remove_html_tags(text):\n", 139 | " \"\"\"Remove html tags from a string\"\"\"\n", 140 | " import re\n", 141 | " clean = re.compile('<.*?>')\n", 142 | " return re.sub(clean, '', text)" 143 | ], 144 | "metadata": { 145 | "id": "QQT-KU0-GflF" 146 | }, 147 | "execution_count": null, 148 | "outputs": [] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "source": [ 153 | "remove_html_tags('What is this, said the toad? Where is

the time probe?')" 154 | ], 155 | "metadata": { 156 | "id": "ZHKvtFBJHuPr" 157 | }, 158 | "execution_count": null, 159 | "outputs": [] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "source": [ 164 | "df[\"raw-review\"] = df[\"raw-review\"].apply(remove_html_tags)" 165 | ], 166 | "metadata": { 167 | "id": "jrOrrMkrGktN" 168 | }, 169 | "execution_count": null, 170 | "outputs": [] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "source": [ 175 | "Change the value of idx to vary that amount of train and test data. The default value is 25000 or a 50/50 split." 176 | ], 177 | "metadata": { 178 | "id": "yN0XyTfcggrf" 179 | } 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "source": [ 184 | "# Preprocessing and Train-Test Split" 185 | ], 186 | "metadata": { 187 | "id": "VjfO0RH38o6d" 188 | } 189 | }, 190 | { 191 | "cell_type": "code", 192 | "source": [ 193 | "def preprocessor(text):\n", 194 | " text = re.sub('<[^>]*>', '', text)\n", 195 | " emoticons = re.findall('(?::|;|=)(?:-)?(?:\\)|\\(|D|P)', text)\n", 196 | " text = (re.sub('[\\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))\n", 197 | " return text" 198 | ], 199 | "metadata": { 200 | "id": "YSqs-9TYhKt6" 201 | }, 202 | "execution_count": null, 203 | "outputs": [] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "source": [ 208 | "df['review'] = df['review'].apply(preprocessor)" 209 | ], 210 | "metadata": { 211 | "id": "hDEbzfOahQOv" 212 | }, 213 | "execution_count": null, 214 | "outputs": [] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "source": [ 219 | "idx = 25000\n", 220 | "X_train = df.loc[:idx - 1, 'review'].values\n", 221 | "y_train = df.loc[:idx - 1, 'sentiment'].values\n", 222 | "X_test = df.loc[idx:, 'review'].values\n", 223 | "y_test = df.loc[idx:, 'sentiment'].values" 224 | ], 225 | "metadata": { 226 | "id": "kOOBt1t4ccFx" 227 | }, 228 | "execution_count": null, 229 | "outputs": [] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "source": [ 234 | "def tokenizer(text):\n", 235 | " return text.split()" 236 | ], 237 | "metadata": { 238 | "id": "L6zxfzFjhlhP" 239 | }, 240 | "execution_count": null, 241 | "outputs": [] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "source": [ 246 | "porter = PorterStemmer()\n", 247 | "def tokenizer_porter(text):\n", 248 | " return [porter.stem(word) for word in text.split()]" 249 | ], 250 | "metadata": { 251 | "id": "epJ9DjT31bp2" 252 | }, 253 | "execution_count": null, 254 | "outputs": [] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "source": [ 259 | "nltk.download('stopwords')\n", 260 | "stop = stopwords.words(\"english\")" 261 | ], 262 | "metadata": { 263 | "id": "9bee9sBr1DqL" 264 | }, 265 | "execution_count": null, 266 | "outputs": [] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "source": [ 271 | "# Preprocessing and Training Pipeline" 272 | ], 273 | "metadata": { 274 | "id": "Qui887XB8GFl" 275 | } 276 | }, 277 | { 278 | "cell_type": "code", 279 | "source": [ 280 | "tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)\n", 281 | "param_grid = [{'vect__ngram_range': [(1, 1)],\n", 282 | " 'vect__stop_words': [stop],\n", 283 | " 'vect__tokenizer': [tokenizer],\n", 284 | " 'vect__use_idf': [True],\n", 285 | " 'vect__norm': [None],\n", 286 | " 'clf__penalty': ['l2'],\n", 287 | " 'clf__C': [1.0]}]\n", 288 | "\n", 289 | "lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(solver='liblinear'))])\n", 290 | "gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=-1)\n", 291 | "gs_lr_tfidf.fit(X_train, y_train)\n", 292 | "\n", 293 | "print(gs_lr_tfidf.best_params_)\n", 294 | "print(gs_lr_tfidf.best_score_)\n", 295 | "\n", 296 | "clf = gs_lr_tfidf.best_estimator_\n", 297 | "print('Accuracy (test):', clf.score(X_test, y_test))" 298 | ], 299 | "metadata": { 300 | "id": "ewiIlHtw1I-e" 301 | }, 302 | "execution_count": null, 303 | "outputs": [] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "source": [ 308 | "Pipelines can be expensive to evaulate. In the above, the param_grid is chosen with one set of parameters. For a more extensive search use the param_grid below:" 309 | ], 310 | "metadata": { 311 | "id": "GR1Dnz2uu-0L" 312 | } 313 | }, 314 | { 315 | "cell_type": "code", 316 | "source": [ 317 | "param_grid = [{'vect__ngram_range': [(1, 3)],\n", 318 | " 'vect__stop_words': [None],\n", 319 | " 'vect__tokenizer': [tokenizer, tokenizer_porter],\n", 320 | " 'clf__penalty': ['l2'],\n", 321 | " 'clf__C': [1.0, 10.0]},\n", 322 | " {'vect__ngram_range': [(1, 1)],\n", 323 | " 'vect__stop_words': [stop, None],\n", 324 | " 'vect__tokenizer': [tokenizer],\n", 325 | " 'vect__use_idf': [True, False],\n", 326 | " 'vect__norm': [None],\n", 327 | " 'clf__penalty': ['l2'],\n", 328 | " 'clf__C': [1.0, 10.0]}]" 329 | ], 330 | "metadata": { 331 | "id": "qv5C1CjT9c06" 332 | }, 333 | "execution_count": null, 334 | "outputs": [] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "source": [ 339 | "# Pretrained Large Language Model" 340 | ], 341 | "metadata": { 342 | "id": "sDg1yJd6-h0O" 343 | } 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "source": [ 348 | "For an introduction to transformers see the Colab notebook: https://tinyurl.com/hugfacetutorial" 349 | ], 350 | "metadata": { 351 | "id": "nDeq3gWJ-_yx" 352 | } 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "source": [ 357 | "For an introduction to transformers on the Princeton Research Computing clusters see this repo by David Turner of PNI: [GitHub](https://github.com/davidt0x/hf_tutorial). In particular, see slides.pptx" 358 | ], 359 | "metadata": { 360 | "id": "aY2cunOo5dsU" 361 | } 362 | }, 363 | { 364 | "cell_type": "code", 365 | "source": [ 366 | "%%capture\n", 367 | "%pip install transformers[sentencepiece]" 368 | ], 369 | "metadata": { 370 | "id": "xpuO1dXl_Xv6" 371 | }, 372 | "execution_count": null, 373 | "outputs": [] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "source": [ 378 | "from transformers import pipeline\n", 379 | "\n", 380 | "sentiment_pipeline = pipeline('text-classification', model=\"distilbert-base-uncased-finetuned-sst-2-english\")" 381 | ], 382 | "metadata": { 383 | "id": "kusC65__-mhL" 384 | }, 385 | "execution_count": null, 386 | "outputs": [] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "source": [ 391 | "review = df.loc[0]['raw-review']\n", 392 | "print(review)" 393 | ], 394 | "metadata": { 395 | "id": "gVlvEZHKAHpN" 396 | }, 397 | "execution_count": null, 398 | "outputs": [] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "source": [ 403 | "sentiment_pipeline(review)[0]['label']" 404 | ], 405 | "metadata": { 406 | "id": "SyVYEWCj-nWb" 407 | }, 408 | "execution_count": null, 409 | "outputs": [] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "source": [ 414 | "df[\"truncated-review\"] = df['raw-review'].apply(lambda x: x if len(x.split()) < 300 else ' '.join(x.split()[:300]))" 415 | ], 416 | "metadata": { 417 | "id": "kYPONnP4AsHR" 418 | }, 419 | "execution_count": null, 420 | "outputs": [] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "source": [ 425 | "df_sub = df[:250].copy()" 426 | ], 427 | "metadata": { 428 | "id": "SCGNiTMiCzT2" 429 | }, 430 | "execution_count": null, 431 | "outputs": [] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "source": [ 436 | "df_sub.head()" 437 | ], 438 | "metadata": { 439 | "id": "BMW4l4_lDNfa" 440 | }, 441 | "execution_count": null, 442 | "outputs": [] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "source": [ 447 | "df_sub[\"pretrained-distillbert-pred\"] = df_sub['truncated-review'].apply(lambda x: sentiment_pipeline(x)[0]['label'])" 448 | ], 449 | "metadata": { 450 | "id": "E8vu6XR2APgq" 451 | }, 452 | "execution_count": null, 453 | "outputs": [] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "source": [ 458 | "df_sub[\"pretrained-distillbert-pred\"].value_counts()" 459 | ], 460 | "metadata": { 461 | "id": "R9ePJ_pPIMQq" 462 | }, 463 | "execution_count": null, 464 | "outputs": [] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "source": [ 469 | "df_sub[\"pretrained-distillbert-pred\"] = df_sub[\"pretrained-distillbert-pred\"].apply(lambda x: 0 if x == 'NEGATIVE' else 1)" 470 | ], 471 | "metadata": { 472 | "id": "HuHm-pbXAhjC" 473 | }, 474 | "execution_count": null, 475 | "outputs": [] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "source": [ 480 | "distillbert_accuracy = df_sub[df_sub[\"pretrained-distillbert-pred\"] == df_sub[\"sentiment\"]].shape[0] / df_sub.shape[0]\n", 481 | "print(f'{100 * distillbert_accuracy}%')" 482 | ], 483 | "metadata": { 484 | "id": "K5NjrRDTEjdm" 485 | }, 486 | "execution_count": null, 487 | "outputs": [] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "source": [ 492 | "We get almost the same accuracy but with no training from the LLM versus our ML model." 493 | ], 494 | "metadata": { 495 | "id": "O_A6J81GJnzW" 496 | } 497 | }, 498 | { 499 | "cell_type": "code", 500 | "source": [], 501 | "metadata": { 502 | "id": "CufPereWLmBQ" 503 | }, 504 | "execution_count": null, 505 | "outputs": [] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "source": [ 510 | "Exercise: Use the LLM to summarize one of the reviews." 511 | ], 512 | "metadata": { 513 | "id": "F5A0z32oLmxn" 514 | } 515 | }, 516 | { 517 | "cell_type": "code", 518 | "source": [ 519 | "summarization_pipeline = pipeline(\"summarization\", model=\"sshleifer/distilbart-cnn-12-6\")" 520 | ], 521 | "metadata": { 522 | "id": "z02Fqpw8Lsjb" 523 | }, 524 | "execution_count": null, 525 | "outputs": [] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "source": [ 530 | "review = df.loc[6][\"raw-review\"]\n", 531 | "review" 532 | ], 533 | "metadata": { 534 | "id": "svuOlXQDL3yc" 535 | }, 536 | "execution_count": null, 537 | "outputs": [] 538 | }, 539 | { 540 | "cell_type": "code", 541 | "source": [ 542 | "outputs = summarization_pipeline(review, max_length=80, clean_up_tokenization_spaces=True)\n", 543 | "wrapper = textwrap.TextWrapper(width=80, break_long_words=False, break_on_hyphens=False)\n", 544 | "print(wrapper.fill(outputs[0]['summary_text']))" 545 | ], 546 | "metadata": { 547 | "id": "HP6fyqseLvIh" 548 | }, 549 | "execution_count": null, 550 | "outputs": [] 551 | } 552 | ] 553 | } -------------------------------------------------------------------------------- /past_hackathons/natural_language_processing_hackathon/day5_nlp_movie_reviews_notebook2_hackathon.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "toc_visible": true, 8 | "authorship_tag": "ABX9TyMuOi57j8EtrC5kz4UbtZUF", 9 | "include_colab_link": true 10 | }, 11 | "kernelspec": { 12 | "name": "python3", 13 | "display_name": "Python 3" 14 | }, 15 | "language_info": { 16 | "name": "python" 17 | } 18 | }, 19 | "cells": [ 20 | { 21 | "cell_type": "markdown", 22 | "metadata": { 23 | "id": "view-in-github", 24 | "colab_type": "text" 25 | }, 26 | "source": [ 27 | "\"Open" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "source": [ 33 | "#Introduction to Machine Learning \n", 34 | "**Natural Language Processing Hackathon: Notebook 2 \n", 35 | "Wintersession \n", 36 | "Tuesday, January 24, 2023**" 37 | ], 38 | "metadata": { 39 | "id": "MH7MrrKyZ3dQ" 40 | } 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "source": [ 45 | "The material here is based on Chapter 8 of \n", 46 | "Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili and Dmytro Dzhulgakov. The book is available via the PU library.\n", 47 | "\n", 48 | "In this notebook we are going to work with a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb) and build a predictor that can distinguish between positive and negative reviews." 49 | ], 50 | "metadata": { 51 | "id": "AcJNrVl84xDp" 52 | } 53 | }, 54 | { 55 | "cell_type": "code", 56 | "source": [ 57 | "import re\n", 58 | "import textwrap\n", 59 | "import pandas as pd\n", 60 | "import numpy as np\n", 61 | "import nltk\n", 62 | "from nltk.corpus import stopwords\n", 63 | "from nltk.stem.porter import PorterStemmer\n", 64 | "from sklearn.model_selection import GridSearchCV\n", 65 | "from sklearn.pipeline import Pipeline\n", 66 | "from sklearn.linear_model import LogisticRegression\n", 67 | "from sklearn.feature_extraction.text import TfidfVectorizer" 68 | ], 69 | "metadata": { 70 | "id": "UuDdLpWUaBRX" 71 | }, 72 | "execution_count": null, 73 | "outputs": [] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "source": [ 78 | "# Dowload and View the Data" 79 | ], 80 | "metadata": { 81 | "id": "RFLgSxPO2u-2" 82 | } 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "source": [ 87 | "Download the data set:" 88 | ], 89 | "metadata": { 90 | "id": "wjO7F84nz99c" 91 | } 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": { 97 | "id": "qoSng-U6VyvC" 98 | }, 99 | "outputs": [], 100 | "source": [ 101 | "!wget https://tigress-web.princeton.edu/~jdh4/movie_data.csv" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "source": [ 107 | "Read in the CSV file and print the first 5 rows of the Pandas dataframe:" 108 | ], 109 | "metadata": { 110 | "id": "peptRcYAdrSq" 111 | } 112 | }, 113 | { 114 | "cell_type": "code", 115 | "source": [ 116 | "df = pd.read_csv('movie_data.csv', encoding='utf-8')\n", 117 | "df.head(5)" 118 | ], 119 | "metadata": { 120 | "id": "DuYihEqqcBwN" 121 | }, 122 | "execution_count": null, 123 | "outputs": [] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "source": [ 128 | "Let's look at the number of total rows and the data types:" 129 | ], 130 | "metadata": { 131 | "id": "rlcaf5fad1VT" 132 | } 133 | }, 134 | { 135 | "cell_type": "code", 136 | "source": [ 137 | "df.info()" 138 | ], 139 | "metadata": { 140 | "id": "7tK0-ZCLdQVV" 141 | }, 142 | "execution_count": null, 143 | "outputs": [] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "source": [ 148 | "Let's check for class imbalance:" 149 | ], 150 | "metadata": { 151 | "id": "js0X9iZkda1v" 152 | } 153 | }, 154 | { 155 | "cell_type": "code", 156 | "source": [ 157 | "df[\"sentiment\"].value_counts()" 158 | ], 159 | "metadata": { 160 | "id": "yvB3XKdudStC" 161 | }, 162 | "execution_count": null, 163 | "outputs": [] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "source": [ 168 | "The classes are balanced so we do not need to worry about imbalance. Next, let's print some reviews to get a sense of the content." 169 | ], 170 | "metadata": { 171 | "id": "4uycrR1jeHBr" 172 | } 173 | }, 174 | { 175 | "cell_type": "code", 176 | "source": [ 177 | "def print_reviews_and_sentiment(d, start_index=42, num=3, width=80):\n", 178 | " wrapper = textwrap.TextWrapper(width=width, break_long_words=False, break_on_hyphens=False)\n", 179 | " for i in range(start_index, start_index + num):\n", 180 | " print(wrapper.fill(str(d.loc[i][\"review\"])))\n", 181 | " print('------------')\n", 182 | " print(f'Sentiment: {d.loc[i][\"sentiment\"]}\\n')" 183 | ], 184 | "metadata": { 185 | "id": "NVoLU81BcQtK" 186 | }, 187 | "execution_count": null, 188 | "outputs": [] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "source": [ 193 | "print_reviews_and_sentiment(df, start_index=42, num=2)" 194 | ], 195 | "metadata": { 196 | "id": "cyhpu6ycjSNh" 197 | }, 198 | "execution_count": null, 199 | "outputs": [] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "source": [ 204 | "# Hackathon Project" 205 | ], 206 | "metadata": { 207 | "id": "pri7RiNL110z" 208 | } 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "source": [ 213 | "Train a classifier on the movie review data. See if you can get about 88% accuracy on the test set that you make. Use the techniques from the previous notebook and previous workshop days." 214 | ], 215 | "metadata": { 216 | "id": "kgu5qQAh15ko" 217 | } 218 | } 219 | ] 220 | } -------------------------------------------------------------------------------- /past_hackathons/natural_language_processing_hackathon/day5_nlp_movie_reviews_notebook2_hackathon_HINTS.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "authorship_tag": "ABX9TyOgyIBfhW3MOO3ltL5zC8DS", 8 | "include_colab_link": true 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | } 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "id": "view-in-github", 23 | "colab_type": "text" 24 | }, 25 | "source": [ 26 | "\"Open" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "source": [ 32 | "#Introduction to Machine Learning \n", 33 | "**Natural Language Processing Hackathon: Notebook 2 HINTS \n", 34 | "Wintersession \n", 35 | "Tuesday, January 24, 2023**" 36 | ], 37 | "metadata": { 38 | "id": "MH7MrrKyZ3dQ" 39 | } 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "source": [ 44 | "The material here is based on Chapter 8 of \n", 45 | "Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili and Dmytro Dzhulgakov. The book is available via the PU library.\n", 46 | "\n", 47 | "In this notebook we are going to work with a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb) and build a predictor that can distinguish between positive and negative reviews." 48 | ], 49 | "metadata": { 50 | "id": "W51U-7ZW4sNI" 51 | } 52 | }, 53 | { 54 | "cell_type": "code", 55 | "source": [ 56 | "import re\n", 57 | "import textwrap\n", 58 | "import pandas as pd\n", 59 | "import numpy as np\n", 60 | "import nltk\n", 61 | "from nltk.corpus import stopwords\n", 62 | "from nltk.stem.porter import PorterStemmer\n", 63 | "from sklearn.model_selection import GridSearchCV\n", 64 | "from sklearn.pipeline import Pipeline\n", 65 | "from sklearn.linear_model import LogisticRegression\n", 66 | "from sklearn.feature_extraction.text import TfidfVectorizer" 67 | ], 68 | "metadata": { 69 | "id": "UuDdLpWUaBRX" 70 | }, 71 | "execution_count": null, 72 | "outputs": [] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "source": [ 77 | "Download the data set:" 78 | ], 79 | "metadata": { 80 | "id": "wjO7F84nz99c" 81 | } 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": { 87 | "id": "qoSng-U6VyvC" 88 | }, 89 | "outputs": [], 90 | "source": [ 91 | "!wget https://tigress-web.princeton.edu/~jdh4/movie_data.csv" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "source": [ 97 | "Read in the CSV file and print the first 5 rows of the Pandas dataframe:" 98 | ], 99 | "metadata": { 100 | "id": "peptRcYAdrSq" 101 | } 102 | }, 103 | { 104 | "cell_type": "code", 105 | "source": [ 106 | "df = pd.read_csv('movie_data.csv', encoding='utf-8')\n", 107 | "df.head(5)" 108 | ], 109 | "metadata": { 110 | "id": "DuYihEqqcBwN" 111 | }, 112 | "execution_count": null, 113 | "outputs": [] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "source": [ 118 | "Let's look at the number of total rows and the data types:" 119 | ], 120 | "metadata": { 121 | "id": "rlcaf5fad1VT" 122 | } 123 | }, 124 | { 125 | "cell_type": "code", 126 | "source": [ 127 | "df.info()" 128 | ], 129 | "metadata": { 130 | "id": "7tK0-ZCLdQVV" 131 | }, 132 | "execution_count": null, 133 | "outputs": [] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "source": [ 138 | "Let's check for class imbalance:" 139 | ], 140 | "metadata": { 141 | "id": "js0X9iZkda1v" 142 | } 143 | }, 144 | { 145 | "cell_type": "code", 146 | "source": [ 147 | "df[\"sentiment\"].value_counts()" 148 | ], 149 | "metadata": { 150 | "id": "yvB3XKdudStC" 151 | }, 152 | "execution_count": null, 153 | "outputs": [] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "source": [ 158 | "The classes are balanced so we do not need to worry about imbalance. Next, let's print some reviews to get a sense of the content." 159 | ], 160 | "metadata": { 161 | "id": "4uycrR1jeHBr" 162 | } 163 | }, 164 | { 165 | "cell_type": "code", 166 | "source": [ 167 | "def print_reviews_and_sentiment(d, start_index=42, num=3, width=80):\n", 168 | " wrapper = textwrap.TextWrapper(width=width, break_long_words=False, break_on_hyphens=False)\n", 169 | " for i in range(start_index, start_index + num):\n", 170 | " print(wrapper.fill(str(d.loc[i][\"review\"])))\n", 171 | " print('------------')\n", 172 | " print(f'Sentiment: {d.loc[i][\"sentiment\"]}\\n')" 173 | ], 174 | "metadata": { 175 | "id": "NVoLU81BcQtK" 176 | }, 177 | "execution_count": null, 178 | "outputs": [] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "source": [ 183 | "print_reviews_and_sentiment(df, start_index=42, num=2)" 184 | ], 185 | "metadata": { 186 | "id": "cyhpu6ycjSNh" 187 | }, 188 | "execution_count": null, 189 | "outputs": [] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "source": [ 194 | "Change the value of idx to vary that amount of train and test data. The default value is 25000 or a 50/50 split." 195 | ], 196 | "metadata": { 197 | "id": "yN0XyTfcggrf" 198 | } 199 | }, 200 | { 201 | "cell_type": "code", 202 | "source": [ 203 | "def preprocessor(text):\n", 204 | " text = re.sub('<[^>]*>', '', text)\n", 205 | " emoticons = re.findall('(?::|;|=)(?:-)?(?:\\)|\\(|D|P)', text)\n", 206 | " text = (re.sub('[\\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))\n", 207 | " return text" 208 | ], 209 | "metadata": { 210 | "id": "YSqs-9TYhKt6" 211 | }, 212 | "execution_count": null, 213 | "outputs": [] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "source": [ 218 | "Via the first regex, <[^>]*>, in the preceding code section, we tried to remove all of the HTML markup from the movie reviews. Although many programmers generally advise against the use of regex to parse HTML, this regex should be sufficient to clean this particular dataset. Since we are only interested in removing HTML markup and do not plan to use the HTML markup further, using regex to do the job should be acceptable. However, if you prefer to use sophisticated tools for removing HTML markup from text, you can take a look at Python’s HTML parser module, which is described at https://docs.python.org/3/library/html.parser.html. After we removed the HTML markup, we used a slightly more complex regex to find emoticons, which we temporarily stored as emoticons. Next, we removed all non-word characters from the text via the regex [\\W]+ and converted the text into lowercase characters." 219 | ], 220 | "metadata": { 221 | "id": "UhLHT8pu5uWY" 222 | } 223 | }, 224 | { 225 | "cell_type": "code", 226 | "source": [ 227 | "df['review'] = df['review'].apply(preprocessor)" 228 | ], 229 | "metadata": { 230 | "id": "hDEbzfOahQOv" 231 | }, 232 | "execution_count": null, 233 | "outputs": [] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "source": [ 238 | "print_reviews_and_sentiment(df, start_index=42, num=2)" 239 | ], 240 | "metadata": { 241 | "id": "OI-4WUWUimJw" 242 | }, 243 | "execution_count": null, 244 | "outputs": [] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "source": [ 249 | "Create a train-test split:" 250 | ], 251 | "metadata": { 252 | "id": "GZwH1OQLkBKB" 253 | } 254 | }, 255 | { 256 | "cell_type": "code", 257 | "source": [ 258 | "idx = 25000\n", 259 | "X_train = df.loc[:idx - 1, 'review'].values\n", 260 | "y_train = df.loc[:idx - 1, 'sentiment'].values\n", 261 | "X_test = df.loc[idx:, 'review'].values\n", 262 | "y_test = df.loc[idx:, 'sentiment'].values" 263 | ], 264 | "metadata": { 265 | "id": "kOOBt1t4ccFx" 266 | }, 267 | "execution_count": null, 268 | "outputs": [] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "source": [ 273 | "Let's try using the word counts as the features to get started:" 274 | ], 275 | "metadata": { 276 | "id": "J97KRI7pmnbS" 277 | } 278 | }, 279 | { 280 | "cell_type": "code", 281 | "source": [ 282 | "tfidf = TfidfVectorizer(use_idf=False, norm=None, smooth_idf=False)\n", 283 | "word_counts = tfidf.fit_transform(X_train)" 284 | ], 285 | "metadata": { 286 | "id": "yhgfDr2OpreS" 287 | }, 288 | "execution_count": null, 289 | "outputs": [] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "source": [ 294 | "type(word_counts)" 295 | ], 296 | "metadata": { 297 | "id": "Fxaw2kjoq_YC" 298 | }, 299 | "execution_count": null, 300 | "outputs": [] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "source": [ 305 | "word_counts.shape" 306 | ], 307 | "metadata": { 308 | "id": "5H1exZeIqP2K" 309 | }, 310 | "execution_count": null, 311 | "outputs": [] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "source": [ 316 | "list(tfidf.vocabulary_.items())[:10]" 317 | ], 318 | "metadata": { 319 | "id": "QJt0JBNZnzgN" 320 | }, 321 | "execution_count": null, 322 | "outputs": [] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "source": [ 327 | "print(df.loc[1][\"review\"])" 328 | ], 329 | "metadata": { 330 | "id": "xUqo7SwsseM3" 331 | }, 332 | "execution_count": null, 333 | "outputs": [] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "source": [ 338 | "print(word_counts[1,:])" 339 | ], 340 | "metadata": { 341 | "id": "ljeEWKf4sOYA" 342 | }, 343 | "execution_count": null, 344 | "outputs": [] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "source": [ 349 | "tfidf.vocabulary_[\"window\"]" 350 | ], 351 | "metadata": { 352 | "id": "XJctczjUpLn5" 353 | }, 354 | "execution_count": null, 355 | "outputs": [] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "source": [ 360 | "clf = LogisticRegression(C=1.0, solver='liblinear')\n", 361 | "clf = clf.fit(word_counts, y_train)" 362 | ], 363 | "metadata": { 364 | "id": "oWnjcA5wgz14" 365 | }, 366 | "execution_count": null, 367 | "outputs": [] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "source": [ 372 | "The accuracy on the test set is:" 373 | ], 374 | "metadata": { 375 | "id": "AB1o6eeOqBna" 376 | } 377 | }, 378 | { 379 | "cell_type": "code", 380 | "source": [ 381 | "clf.score(tfidf.transform(X_test), y_test)" 382 | ], 383 | "metadata": { 384 | "id": "UUZ9youasypj" 385 | }, 386 | "execution_count": null, 387 | "outputs": [] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "source": [ 392 | "Notice that the .transform() method was applied to the test set while .fit_transform() was applied to the train set. In this notebook we only worked with unnormalized word counts. We did nothing with stop-words, stemming, inverse document frequency weighting, n-grams, etc. The full solution in the next notebook uses a Pipeline to tryout various combinations of these choices to find the best one." 393 | ], 394 | "metadata": { 395 | "id": "fnBoPcDfqPpf" 396 | } 397 | } 398 | ] 399 | } -------------------------------------------------------------------------------- /past_hackathons/quarterback_performance_hackathon/NFL_QB_Data.csv: -------------------------------------------------------------------------------- 1 | Year,Round,Pick,Team,Player,Pos,Age,LY,AllProYrs,ProBowls,YrsStrtr,wAV,DrAV,G,PaCmp,PaAtt,PaYds,PaTD,PaInt,RuAtt,RuYds,RuTD,Rec,ReYds,ReTD,Solo,Int,Sk,College/Univ 2 | 2011,1,1,CAR,Cam Newton,QB,22,2021,1,3,9,115,107,148,2682,4474,32382,194,123,1118,5628,75,3,68,1,,,,Auburn 3 | 2012,1,1,IND,Andrew Luck,QB,22,2018,0,4,5,72,72,86,2000,3290,23671,171,83,332,1590,14,1,4,0,,,,Stanford 4 | 2015,1,1,TAM,Jameis Winston,QB,21,2022,0,1,5,59,54,86,1738,2835,21840,139,96,293,1220,11,0,0,0,,,,Florida St. 5 | 2016,1,1,LAR,Jared Goff,QB,21,2022,0,2,5,71,48,100,2250,3502,25854,155,70,209,474,10,1,5,0,2,,,California 6 | 2018,1,1,CLE,Baker Mayfield,QB,23,2022,0,0,4,42,42,72,1386,2259,16288,102,64,189,660,6,1,17,0,1,,,Oklahoma 7 | 2019,1,1,ARI,Kyler Murray,QB,22,2022,0,2,3,51,51,57,1316,1971,13848,84,41,381,2204,23,0,7,0,,,,Oklahoma 8 | 2020,1,1,CIN,Joe Burrow,QB,23,2022,0,1,2,38,38,42,1044,1530,11774,82,31,152,517,10,0,0,0,1,,,LSU 9 | 2021,1,1,JAX,Trevor Lawrence,QB,21,2022,0,0,1,21,21,34,746,1186,7754,37,25,135,625,7,0,0,0,4,,,Clemson 10 | 2012,1,2,WAS,Robert Griffin III,QB,22,2020,0,1,3,36,31,56,799,1268,9271,43,30,307,1809,10,0,0,0,,,,Baylor 11 | 2015,1,2,TEN,Marcus Mariota,QB,21,2022,0,0,4,54,43,87,1312,2095,15656,92,54,349,2012,17,2,62,1,1,,,Oregon 12 | 2016,1,2,PHI,Carson Wentz,QB,23,2022,0,1,6,59,43,93,2056,3284,22129,151,66,337,1362,10,2,11,0,1,,,North Dakota St. 13 | 2017,1,2,CHI,Mitchell Trubisky,QB,23,2022,0,1,4,36,33,64,1133,1765,11904,68,43,222,1119,11,0,0,0,,,,North Carolina 14 | 2021,1,2,NYJ,Zach Wilson,QB,22,2022,0,0,1,9,9,22,345,625,4022,15,18,57,287,5,1,2,1,1,,,BYU 15 | 2014,1,3,JAX,Blake Bortles,QB,22,2019,0,0,5,44,44,78,1562,2634,17649,103,75,283,1766,8,1,20,1,,,,Central Florida 16 | 2018,1,3,NYJ,Sam Darnold,QB,21,2022,0,0,4,25,16,56,1054,1765,11767,61,55,188,745,12,0,0,0,,,,USC 17 | 2021,1,3,SFO,Trey Lance,QB,21,2022,0,0,0,4,4,8,56,102,797,5,3,54,235,1,0,0,0,1,,,North Dakota St. 18 | 2020,1,5,MIA,Tua Tagovailoa,QB,22,2022,0,0,2,23,23,36,708,1078,8015,52,23,101,307,6,0,0,0,,,,Alabama 19 | 2019,1,6,NYG,Daniel Jones,QB,22,2022,0,0,3,38,38,54,1113,1740,11603,60,34,292,1708,12,1,16,0,,,,Duke 20 | 2020,1,6,LAC,Justin Herbert,QB,22,2022,0,1,2,43,43,49,1316,1966,14089,94,35,172,683,8,2,-10,0,2,,,Oregon 21 | 2018,1,7,BUF,Josh Allen,QB,22,2022,0,3,4,68,68,77,1604,2566,18397,138,60,546,3087,38,1,12,1,4,,,Wyoming 22 | 2011,1,8,TEN,Jake Locker,QB,23,2014,0,0,1,15,15,30,408,709,4967,27,22,95,644,5,0,0,0,,,,Washington 23 | 2012,1,8,MIA,Ryan Tannehill,QB,24,2022,0,1,9,88,47,145,2914,4534,33265,212,108,423,2029,26,4,8,1,3,,,Texas A&M 24 | 2011,1,10,JAX,Blaine Gabbert,QB,21,2022,0,0,2,16,7,67,864,1533,9302,51,47,194,640,3,1,-16,0,,,,Missouri 25 | 2017,1,10,KAN,Patrick Mahomes,QB,21,2022,1,5,4,85,85,80,1985,2993,24241,192,49,299,1547,12,1,6,0,2,,,Texas Tech 26 | 2018,1,10,ARI,Josh Rosen,QB,21,2021,0,0,1,3,2,24,277,513,2864,12,21,26,151,0,0,0,0,,,,UCLA 27 | 2021,1,11,CHI,Justin Fields,QB,22,2022,0,0,0,22,22,27,351,588,4112,24,21,232,1563,10,0,0,0,,,,Ohio St. 28 | 2011,1,12,MIN,Christian Ponder,QB,23,2014,0,0,3,22,22,38,632,1057,6658,38,36,126,639,7,1,-15,0,1,,,Florida St. 29 | 2017,1,12,HOU,Deshaun Watson,QB,21,2022,0,3,3,55,52,60,1285,1918,15641,111,41,343,1852,18,1,6,1,1,,,Clemson 30 | 2019,1,15,WAS,Dwayne Haskins,QB,22,2020,0,0,0,4,4,16,267,444,2804,12,14,40,147,1,0,0,0,,,,Ohio St. 31 | 2021,1,15,NWE,Mac Jones,QB,23,2022,0,1,1,22,22,31,640,963,6798,36,24,91,231,1,0,0,0,1,,,Alabama 32 | 2013,1,16,BUF,EJ Manuel,QB,23,2017,0,0,1,10,10,30,343,590,3767,20,16,96,339,4,0,0,0,,,,Florida St. 33 | 2012,1,22,CLE,Brandon Weeden,QB,28,2018,0,0,1,13,10,35,559,965,6462,31,30,62,200,1,0,-9,0,,,,Oklahoma St. 34 | 2014,1,22,CLE,Johnny Manziel,QB,21,2015,0,0,0,4,4,14,147,258,1675,7,7,46,259,1,0,0,0,,,,Texas A&M 35 | 2016,1,26,DEN,Paxton Lynch,QB,22,2017,0,0,0,2,2,5,79,128,792,4,4,16,55,0,0,0,0,,,,Memphis 36 | 2020,1,26,GNB,Jordan Love,QB,21,2022,0,0,0,3,3,10,50,83,606,3,3,13,26,0,0,0,0,,,,Utah St. 37 | 2014,1,32,MIN,Teddy Bridgewater,QB,21,2022,0,1,4,49,22,78,1372,2067,15120,75,47,219,846,11,0,0,0,,,,Louisville 38 | 2018,1,32,BAL,Lamar Jackson,QB,21,2022,1,2,3,69,69,70,1055,1655,12209,101,38,727,4437,24,0,0,0,1,,,Louisville 39 | 2011,2,35,CIN,Andy Dalton,QB,23,2022,0,3,10,90,82,166,3374,5396,38150,244,144,468,1465,22,3,11,1,1,,,TCU 40 | 2011,2,36,SFO,Colin Kaepernick,QB,23,2016,0,0,4,45,45,69,1011,1692,12271,72,30,375,2300,13,0,0,0,,,,Nevada 41 | 2014,2,36,OAK,Derek Carr,QB,23,2022,0,3,8,82,82,142,3201,4958,35222,217,99,278,845,6,1,-9,0,,,,Fresno St. 42 | 2013,2,39,NYJ,Geno Smith,QB,22,2022,0,1,2,31,14,62,991,1578,11199,64,48,226,1067,9,1,13,0,1,,,West Virginia 43 | 2019,2,42,DEN,Drew Lock,QB,22,2021,0,0,1,12,12,24,421,710,4740,25,20,72,285,5,1,1,0,1,,,Missouri 44 | 2016,2,51,NYJ,Christian Hackenberg,QB,21,,0,0,0,,,,,,,,,,,,,,,,,,Penn St. 45 | 2017,2,52,CLE,DeShone Kizer,QB,21,2018,0,0,1,6,5,18,275,518,3081,11,24,82,458,5,0,0,0,,,,Notre Dame 46 | 2020,2,53,PHI,Jalen Hurts,QB,22,2022,0,1,1,40,40,45,648,1040,7906,44,19,367,1898,26,1,3,0,2,,,Oklahoma 47 | 2012,2,57,DEN,Brock Osweiler,QB,21,2018,0,0,1,14,7,49,697,1165,7418,37,31,92,266,4,1,-14,0,,,,Arizona St. 48 | 2014,2,62,NWE,Jimmy Garoppolo,QB,22,2022,0,0,2,46,2,74,1167,1726,14289,87,42,165,225,7,2,-3,0,,,,East. Illinois 49 | 2021,2,64,TAM,Kyle Trask,QB,23,2022,0,0,0,0,0,1,3,9,23,0,0,0,0,0,0,0,0,,,,Florida 50 | 2021,3,66,MIN,Kellen Mond,QB,22,2021,0,0,0,0,0,1,2,3,5,0,0,0,0,0,0,0,0,,,,Texas A&M 51 | 2021,3,67,HOU,Davis Mills,QB,22,2022,0,0,1,14,14,28,555,873,5782,33,25,50,152,2,0,0,0,2,,,Stanford 52 | 2013,3,73,TAM,Mike Glennon,QB,23,2021,0,0,1,9,11,40,689,1147,7025,47,35,56,140,1,0,0,0,1,,,North Carolina St. 53 | 2011,3,74,NWE,Ryan Mallett,QB,23,2017,0,0,0,4,0,21,190,345,1835,9,10,28,-5,1,0,0,0,,,,Arkansas 54 | 2012,3,75,SEA,Russell Wilson,QB,23,2022,0,9,10,130,125,173,3371,5218,40583,308,98,901,4966,26,5,21,1,,,,Wisconsin 55 | 2015,3,75,NOR,Garrett Grayson,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,Colorado St. 56 | 2018,3,76,PIT,Mason Rudolph,QB,23,2021,0,0,1,5,5,17,236,384,2366,16,11,33,89,0,0,0,0,,,,Oklahoma St. 57 | 2017,3,87,NYG,Davis Webb,QB,22,2022,0,0,0,1,1,2,23,40,168,1,0,8,38,1,0,0,0,,,,California 58 | 2012,3,88,PHI,Nick Foles,QB,23,2022,0,1,3,32,24,71,1302,2087,14227,82,47,151,407,6,1,10,0,,,,Arizona 59 | 2015,3,89,STL,Sean Mannion,QB,23,2021,0,0,0,2,1,14,67,110,573,1,3,25,-3,0,0,0,0,,,,Oregon St. 60 | 2016,3,91,NWE,Jacoby Brissett,QB,23,2022,0,0,2,33,2,76,963,1577,10350,48,23,227,896,15,1,2,0,,,,North Carolina St. 61 | 2016,3,93,CLE,Cody Kessler,QB,23,2018,0,0,1,5,3,17,224,349,2215,8,5,31,140,0,0,0,0,,,,USC 62 | 2013,4,98,PHI,Matt Barkley,QB,22,2020,0,0,1,5,1,19,212,363,2699,11,22,23,-12,0,1,2,1,,,,USC 63 | 2016,4,100,OAK,Connor Cook,QB,23,2016,0,0,0,0,0,1,14,21,150,1,1,0,0,0,0,0,0,,,,Michigan St. 64 | 2019,3,100,CAR,Will Grier,QB,24,2019,0,0,0,1,1,2,28,52,228,0,4,7,22,0,0,0,0,,,,West Virginia 65 | 2012,4,102,WAS,Kirk Cousins,QB,24,2022,0,4,7,90,35,142,3249,4866,37140,252,105,290,933,19,1,-1,0,1,,,Michigan St. 66 | 2015,4,103,NYJ,Bryce Petty,QB,24,2017,0,0,0,4,4,10,130,245,1353,4,10,12,74,0,0,0,0,,,,Baylor 67 | 2017,3,104,SFO,C.J. Beathard,QB,23,2022,0,0,0,8,8,25,300,510,3537,18,14,56,231,4,0,0,0,,,,Iowa 68 | 2019,4,104,CIN,Ryan Finley,QB,24,2020,0,0,0,2,2,8,58,119,638,3,4,21,143,1,0,0,0,,,,North Carolina St. 69 | 2018,4,108,NYG,Kyle Lauletta,QB,23,2018,0,0,0,0,0,2,0,5,0,0,1,1,-2,0,0,0,0,,,,Richmond 70 | 2013,4,110,NYG,Ryan Nassib,QB,23,2015,0,0,0,0,0,5,9,10,128,1,0,2,-3,0,0,0,0,,,,Syracuse 71 | 2013,4,112,OAK,Tyler Wilson,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,Arkansas 72 | 2013,4,115,PIT,Landry Jones,QB,24,2017,0,0,0,4,4,18,108,169,1310,8,7,19,-19,0,0,0,0,,,,Oklahoma 73 | 2014,4,120,ARI,Logan Thomas,QB,23,2022,0,0,1,11,0,78,3,11,124,1,0,3,5,0,164,1506,12,10,,,Virginia Tech 74 | 2020,4,122,IND,Jacob Eason,QB,22,2022,0,0,0,0,0,2,5,10,84,0,2,0,0,0,0,0,0,,,,Washington 75 | 2020,4,125,NYJ,James Morgan,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Florida International 76 | 2019,4,133,NWE,Jarrett Stidham,QB,23,2022,0,0,0,4,1,13,77,131,926,6,7,23,89,0,0,0,0,2,,,Auburn 77 | 2021,4,133,NOR,Ian Book,QB,23,2021,0,0,0,0,0,1,12,20,135,0,2,3,6,0,0,0,0,,,,Notre Dame 78 | 2011,5,135,KAN,Ricky Stanzi,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,Iowa 79 | 2014,4,135,HOU,Tom Savage,QB,24,2017,0,0,1,3,3,13,181,315,2000,5,7,16,8,0,0,0,0,,,,Pittsburgh 80 | 2016,4,135,DAL,Dak Prescott,QB,23,2022,0,3,5,77,77,97,2185,3283,24943,166,65,352,1642,26,1,11,1,,,,Mississippi St. 81 | 2017,4,135,PIT,Joshua Dobbs,QB,22,2022,0,0,0,1,0,8,50,85,456,2,3,14,75,0,0,0,0,,,,Tennessee 82 | 2016,4,139,BUF,Cardale Jones,QB,23,2016,0,0,0,0,0,1,6,11,96,0,1,1,-1,0,0,0,0,,,,Ohio St. 83 | 2015,5,147,GNB,Brett Hundley,QB,22,2019,0,0,1,5,5,18,199,337,1902,9,13,46,309,2,1,10,0,,,,UCLA 84 | 2011,5,152,HOU,T.J. Yates,QB,24,2017,0,0,0,6,6,22,179,324,2057,10,11,28,107,1,0,0,0,,,,North Carolina 85 | 2011,5,160,CHI,Nathan Enderle,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Idaho 86 | 2016,5,162,KAN,Kevin Hogan,QB,23,2021,0,0,0,2,,9,60,101,621,4,7,18,176,1,0,0,0,,,,Stanford 87 | 2014,5,163,KAN,Aaron Murray,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Georgia 88 | 2014,5,164,CIN,A.J. McCarron,QB,23,2020,0,0,0,4,3,17,109,174,1173,6,3,22,68,1,0,0,0,,,,Alabama 89 | 2019,5,166,LAC,Easton Stick,QB,23,2020,0,0,0,0,0,1,1,1,4,0,0,1,-2,0,0,0,0,,,,North Dakota St. 90 | 2019,5,167,PHI,Clayton Thorson,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,Northwestern 91 | 2020,5,167,BUF,Jake Fromm,QB,22,2021,0,0,0,1,,3,27,60,210,1,3,8,65,0,0,0,0,,,,Georgia 92 | 2017,5,171,BUF,Nathan Peterman,QB,23,2022,0,0,0,3,2,13,85,160,712,4,13,22,91,1,0,0,0,,,,Pittsburgh 93 | 2018,5,171,DAL,Mike White,QB,23,2022,0,0,0,5,,8,191,307,2145,8,12,11,8,1,0,0,0,,,,Western Kentucky 94 | 2014,6,178,TEN,Zach Mettenberger,QB,23,2015,0,0,1,1,1,14,208,345,2347,12,14,14,12,1,0,0,0,,,,LSU 95 | 2019,6,178,JAX,Gardner Minshew II,QB,23,2022,0,0,2,20,17,32,586,933,6632,44,15,112,521,2,1,0,0,1,,,Washington St. 96 | 2011,6,180,BAL,Tyrod Taylor,QB,22,2022,0,1,3,45,1,81,952,1550,10794,60,26,366,2071,19,2,10,0,,,,Virginia Tech 97 | 2014,6,183,CHI,David Fales,QB,23,2019,0,0,0,1,0,5,31,48,287,1,1,5,8,1,0,0,0,,,,San Jose St. 98 | 2012,6,185,ARI,Ryan Lindley,QB,23,2015,0,0,0,-4,-4,10,140,274,1372,3,11,4,7,0,0,0,0,,,,San Diego St. 99 | 2016,6,187,WAS,Nate Sudfeld,QB,22,2022,0,0,0,1,,6,25,37,188,1,1,10,28,0,0,0,0,,,,Indiana 100 | 2020,6,189,JAX,Jake Luton,QB,24,2020,0,0,0,2,2,3,60,110,624,2,6,1,13,1,0,0,0,,,,Oregon St. 101 | 2016,6,191,DET,Jake Rudock,QB,23,2017,0,0,0,0,0,3,3,5,24,0,1,0,0,0,0,0,0,,,,Michigan 102 | 2014,6,194,BAL,Keith Wenning,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Ball St. 103 | 2019,6,197,BAL,Trace McSorley,QB,24,2022,0,0,0,1,0,9,48,93,502,1,5,21,79,0,0,0,0,,,,Penn St. 104 | 2018,6,199,TEN,Luke Falk,QB,23,2019,0,0,0,1,,3,47,73,416,0,3,0,0,0,0,0,0,,,,Washington St. 105 | 2016,6,201,JAX,Brandon Allen,QB,24,2022,0,0,0,4,,15,149,263,1611,10,6,33,64,0,0,0,0,,,,Arkansas 106 | 2018,6,203,JAX,Tanner Lee,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Nebraska 107 | 2016,6,207,SFO,Jeff Driskel,QB,23,2022,0,0,0,7,,23,216,365,2228,14,8,73,384,3,2,10,0,,,,Louisiana Tech 108 | 2011,7,208,NYJ,Greg McElroy,QB,23,2012,0,0,0,1,1,2,19,31,214,1,1,8,30,0,0,0,0,,,,Alabama 109 | 2014,6,213,NYJ,Tajh Boyd,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Clemson 110 | 2014,6,214,STL,Garrett Gilbert,QB,23,2021,0,0,0,2,,8,43,75,477,1,1,6,25,0,0,0,0,,,,SMU 111 | 2017,6,215,DET,Brad Kaaya,QB,22,,0,0,0,,,,,,,,,,,,,,,,,,Miami (FL) 112 | 2021,6,218,IND,Sam Ehlinger,QB,22,2022,0,0,0,2,2,7,64,101,573,3,3,20,96,0,0,0,0,,,,Texas 113 | 2018,7,219,NWE,Danny Etling,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,LSU 114 | 2018,7,220,SEA,Alex McGough,QB,22,,0,0,0,,,,,,,,,,,,,,,,,,Florida International 115 | 2013,7,221,SDG,Brad Sorensen,QB,25,,0,0,0,,,,,,,,,,,,,,,,,,Southern Utah 116 | 2016,7,223,MIA,Brandon Doughty,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,Western Kentucky 117 | 2020,7,224,TEN,Cole McDonald,QB,22,,0,0,0,,,,,,,,,,,,,,,,,,Hawaii 118 | 2020,7,231,DAL,Ben DiNucci,QB,23,2020,0,0,0,1,1,3,23,43,219,0,0,6,22,0,0,0,0,,,,James Madison 119 | 2013,7,234,DEN,Zac Dysert,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Miami (OH) 120 | 2013,7,237,SFO,B.J. Daniels,QB,24,2015,0,0,0,0,,8,1,2,7,0,0,6,6,0,2,18,0,1,,,South Florida 121 | 2020,7,240,NOR,Tommy Stevens,QB,23,2020,0,0,0,0,,1,0,0,0,0,0,4,24,0,0,0,0,,,,Mississippi St. 122 | 2012,7,243,GNB,B.J. Coleman,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Chattanooga 123 | 2020,7,244,MIN,Nate Stanley,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Iowa 124 | 2013,7,249,ATL,Sean Renfree,QB,23,2015,0,0,0,0,0,2,3,7,11,0,1,1,-4,0,0,0,0,,,,Duke 125 | 2018,7,249,CIN,Logan Woodside,QB,23,2022,0,0,0,0,,12,1,3,7,0,0,13,4,0,0,0,0,,,,Toledo 126 | 2015,7,250,DEN,Trevor Siemian,QB,23,2022,0,0,2,17,13,35,621,1055,7027,42,28,73,211,2,0,0,0,,,,Northwestern 127 | 2012,7,253,IND,Chandler Harnish,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,Northern Illinois 128 | 2017,7,253,DEN,Chad Kelly,QB,23,2018,0,0,0,0,0,1,0,0,0,0,0,1,-1,0,0,0,0,,,,Mississippi 129 | --------------------------------------------------------------------------------