the time probe?')"
154 | ],
155 | "metadata": {
156 | "id": "ZHKvtFBJHuPr"
157 | },
158 | "execution_count": null,
159 | "outputs": []
160 | },
161 | {
162 | "cell_type": "code",
163 | "source": [
164 | "df[\"raw-review\"] = df[\"raw-review\"].apply(remove_html_tags)"
165 | ],
166 | "metadata": {
167 | "id": "jrOrrMkrGktN"
168 | },
169 | "execution_count": null,
170 | "outputs": []
171 | },
172 | {
173 | "cell_type": "markdown",
174 | "source": [
175 | "Change the value of idx to vary that amount of train and test data. The default value is 25000 or a 50/50 split."
176 | ],
177 | "metadata": {
178 | "id": "yN0XyTfcggrf"
179 | }
180 | },
181 | {
182 | "cell_type": "markdown",
183 | "source": [
184 | "# Preprocessing and Train-Test Split"
185 | ],
186 | "metadata": {
187 | "id": "VjfO0RH38o6d"
188 | }
189 | },
190 | {
191 | "cell_type": "code",
192 | "source": [
193 | "def preprocessor(text):\n",
194 | " text = re.sub('<[^>]*>', '', text)\n",
195 | " emoticons = re.findall('(?::|;|=)(?:-)?(?:\\)|\\(|D|P)', text)\n",
196 | " text = (re.sub('[\\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))\n",
197 | " return text"
198 | ],
199 | "metadata": {
200 | "id": "YSqs-9TYhKt6"
201 | },
202 | "execution_count": null,
203 | "outputs": []
204 | },
205 | {
206 | "cell_type": "code",
207 | "source": [
208 | "df['review'] = df['review'].apply(preprocessor)"
209 | ],
210 | "metadata": {
211 | "id": "hDEbzfOahQOv"
212 | },
213 | "execution_count": null,
214 | "outputs": []
215 | },
216 | {
217 | "cell_type": "code",
218 | "source": [
219 | "idx = 25000\n",
220 | "X_train = df.loc[:idx - 1, 'review'].values\n",
221 | "y_train = df.loc[:idx - 1, 'sentiment'].values\n",
222 | "X_test = df.loc[idx:, 'review'].values\n",
223 | "y_test = df.loc[idx:, 'sentiment'].values"
224 | ],
225 | "metadata": {
226 | "id": "kOOBt1t4ccFx"
227 | },
228 | "execution_count": null,
229 | "outputs": []
230 | },
231 | {
232 | "cell_type": "code",
233 | "source": [
234 | "def tokenizer(text):\n",
235 | " return text.split()"
236 | ],
237 | "metadata": {
238 | "id": "L6zxfzFjhlhP"
239 | },
240 | "execution_count": null,
241 | "outputs": []
242 | },
243 | {
244 | "cell_type": "code",
245 | "source": [
246 | "porter = PorterStemmer()\n",
247 | "def tokenizer_porter(text):\n",
248 | " return [porter.stem(word) for word in text.split()]"
249 | ],
250 | "metadata": {
251 | "id": "epJ9DjT31bp2"
252 | },
253 | "execution_count": null,
254 | "outputs": []
255 | },
256 | {
257 | "cell_type": "code",
258 | "source": [
259 | "nltk.download('stopwords')\n",
260 | "stop = stopwords.words(\"english\")"
261 | ],
262 | "metadata": {
263 | "id": "9bee9sBr1DqL"
264 | },
265 | "execution_count": null,
266 | "outputs": []
267 | },
268 | {
269 | "cell_type": "markdown",
270 | "source": [
271 | "# Preprocessing and Training Pipeline"
272 | ],
273 | "metadata": {
274 | "id": "Qui887XB8GFl"
275 | }
276 | },
277 | {
278 | "cell_type": "code",
279 | "source": [
280 | "tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)\n",
281 | "param_grid = [{'vect__ngram_range': [(1, 1)],\n",
282 | " 'vect__stop_words': [stop],\n",
283 | " 'vect__tokenizer': [tokenizer],\n",
284 | " 'vect__use_idf': [True],\n",
285 | " 'vect__norm': [None],\n",
286 | " 'clf__penalty': ['l2'],\n",
287 | " 'clf__C': [1.0]}]\n",
288 | "\n",
289 | "lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(solver='liblinear'))])\n",
290 | "gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=-1)\n",
291 | "gs_lr_tfidf.fit(X_train, y_train)\n",
292 | "\n",
293 | "print(gs_lr_tfidf.best_params_)\n",
294 | "print(gs_lr_tfidf.best_score_)\n",
295 | "\n",
296 | "clf = gs_lr_tfidf.best_estimator_\n",
297 | "print('Accuracy (test):', clf.score(X_test, y_test))"
298 | ],
299 | "metadata": {
300 | "id": "ewiIlHtw1I-e"
301 | },
302 | "execution_count": null,
303 | "outputs": []
304 | },
305 | {
306 | "cell_type": "markdown",
307 | "source": [
308 | "Pipelines can be expensive to evaulate. In the above, the param_grid is chosen with one set of parameters. For a more extensive search use the param_grid below:"
309 | ],
310 | "metadata": {
311 | "id": "GR1Dnz2uu-0L"
312 | }
313 | },
314 | {
315 | "cell_type": "code",
316 | "source": [
317 | "param_grid = [{'vect__ngram_range': [(1, 3)],\n",
318 | " 'vect__stop_words': [None],\n",
319 | " 'vect__tokenizer': [tokenizer, tokenizer_porter],\n",
320 | " 'clf__penalty': ['l2'],\n",
321 | " 'clf__C': [1.0, 10.0]},\n",
322 | " {'vect__ngram_range': [(1, 1)],\n",
323 | " 'vect__stop_words': [stop, None],\n",
324 | " 'vect__tokenizer': [tokenizer],\n",
325 | " 'vect__use_idf': [True, False],\n",
326 | " 'vect__norm': [None],\n",
327 | " 'clf__penalty': ['l2'],\n",
328 | " 'clf__C': [1.0, 10.0]}]"
329 | ],
330 | "metadata": {
331 | "id": "qv5C1CjT9c06"
332 | },
333 | "execution_count": null,
334 | "outputs": []
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "source": [
339 | "# Pretrained Large Language Model"
340 | ],
341 | "metadata": {
342 | "id": "sDg1yJd6-h0O"
343 | }
344 | },
345 | {
346 | "cell_type": "markdown",
347 | "source": [
348 | "For an introduction to transformers see the Colab notebook: https://tinyurl.com/hugfacetutorial"
349 | ],
350 | "metadata": {
351 | "id": "nDeq3gWJ-_yx"
352 | }
353 | },
354 | {
355 | "cell_type": "markdown",
356 | "source": [
357 | "For an introduction to transformers on the Princeton Research Computing clusters see this repo by David Turner of PNI: [GitHub](https://github.com/davidt0x/hf_tutorial). In particular, see slides.pptx"
358 | ],
359 | "metadata": {
360 | "id": "aY2cunOo5dsU"
361 | }
362 | },
363 | {
364 | "cell_type": "code",
365 | "source": [
366 | "%%capture\n",
367 | "%pip install transformers[sentencepiece]"
368 | ],
369 | "metadata": {
370 | "id": "xpuO1dXl_Xv6"
371 | },
372 | "execution_count": null,
373 | "outputs": []
374 | },
375 | {
376 | "cell_type": "code",
377 | "source": [
378 | "from transformers import pipeline\n",
379 | "\n",
380 | "sentiment_pipeline = pipeline('text-classification', model=\"distilbert-base-uncased-finetuned-sst-2-english\")"
381 | ],
382 | "metadata": {
383 | "id": "kusC65__-mhL"
384 | },
385 | "execution_count": null,
386 | "outputs": []
387 | },
388 | {
389 | "cell_type": "code",
390 | "source": [
391 | "review = df.loc[0]['raw-review']\n",
392 | "print(review)"
393 | ],
394 | "metadata": {
395 | "id": "gVlvEZHKAHpN"
396 | },
397 | "execution_count": null,
398 | "outputs": []
399 | },
400 | {
401 | "cell_type": "code",
402 | "source": [
403 | "sentiment_pipeline(review)[0]['label']"
404 | ],
405 | "metadata": {
406 | "id": "SyVYEWCj-nWb"
407 | },
408 | "execution_count": null,
409 | "outputs": []
410 | },
411 | {
412 | "cell_type": "code",
413 | "source": [
414 | "df[\"truncated-review\"] = df['raw-review'].apply(lambda x: x if len(x.split()) < 300 else ' '.join(x.split()[:300]))"
415 | ],
416 | "metadata": {
417 | "id": "kYPONnP4AsHR"
418 | },
419 | "execution_count": null,
420 | "outputs": []
421 | },
422 | {
423 | "cell_type": "code",
424 | "source": [
425 | "df_sub = df[:250].copy()"
426 | ],
427 | "metadata": {
428 | "id": "SCGNiTMiCzT2"
429 | },
430 | "execution_count": null,
431 | "outputs": []
432 | },
433 | {
434 | "cell_type": "code",
435 | "source": [
436 | "df_sub.head()"
437 | ],
438 | "metadata": {
439 | "id": "BMW4l4_lDNfa"
440 | },
441 | "execution_count": null,
442 | "outputs": []
443 | },
444 | {
445 | "cell_type": "code",
446 | "source": [
447 | "df_sub[\"pretrained-distillbert-pred\"] = df_sub['truncated-review'].apply(lambda x: sentiment_pipeline(x)[0]['label'])"
448 | ],
449 | "metadata": {
450 | "id": "E8vu6XR2APgq"
451 | },
452 | "execution_count": null,
453 | "outputs": []
454 | },
455 | {
456 | "cell_type": "code",
457 | "source": [
458 | "df_sub[\"pretrained-distillbert-pred\"].value_counts()"
459 | ],
460 | "metadata": {
461 | "id": "R9ePJ_pPIMQq"
462 | },
463 | "execution_count": null,
464 | "outputs": []
465 | },
466 | {
467 | "cell_type": "code",
468 | "source": [
469 | "df_sub[\"pretrained-distillbert-pred\"] = df_sub[\"pretrained-distillbert-pred\"].apply(lambda x: 0 if x == 'NEGATIVE' else 1)"
470 | ],
471 | "metadata": {
472 | "id": "HuHm-pbXAhjC"
473 | },
474 | "execution_count": null,
475 | "outputs": []
476 | },
477 | {
478 | "cell_type": "code",
479 | "source": [
480 | "distillbert_accuracy = df_sub[df_sub[\"pretrained-distillbert-pred\"] == df_sub[\"sentiment\"]].shape[0] / df_sub.shape[0]\n",
481 | "print(f'{100 * distillbert_accuracy}%')"
482 | ],
483 | "metadata": {
484 | "id": "K5NjrRDTEjdm"
485 | },
486 | "execution_count": null,
487 | "outputs": []
488 | },
489 | {
490 | "cell_type": "markdown",
491 | "source": [
492 | "We get almost the same accuracy but with no training from the LLM versus our ML model."
493 | ],
494 | "metadata": {
495 | "id": "O_A6J81GJnzW"
496 | }
497 | },
498 | {
499 | "cell_type": "code",
500 | "source": [],
501 | "metadata": {
502 | "id": "CufPereWLmBQ"
503 | },
504 | "execution_count": null,
505 | "outputs": []
506 | },
507 | {
508 | "cell_type": "markdown",
509 | "source": [
510 | "Exercise: Use the LLM to summarize one of the reviews."
511 | ],
512 | "metadata": {
513 | "id": "F5A0z32oLmxn"
514 | }
515 | },
516 | {
517 | "cell_type": "code",
518 | "source": [
519 | "summarization_pipeline = pipeline(\"summarization\", model=\"sshleifer/distilbart-cnn-12-6\")"
520 | ],
521 | "metadata": {
522 | "id": "z02Fqpw8Lsjb"
523 | },
524 | "execution_count": null,
525 | "outputs": []
526 | },
527 | {
528 | "cell_type": "code",
529 | "source": [
530 | "review = df.loc[6][\"raw-review\"]\n",
531 | "review"
532 | ],
533 | "metadata": {
534 | "id": "svuOlXQDL3yc"
535 | },
536 | "execution_count": null,
537 | "outputs": []
538 | },
539 | {
540 | "cell_type": "code",
541 | "source": [
542 | "outputs = summarization_pipeline(review, max_length=80, clean_up_tokenization_spaces=True)\n",
543 | "wrapper = textwrap.TextWrapper(width=80, break_long_words=False, break_on_hyphens=False)\n",
544 | "print(wrapper.fill(outputs[0]['summary_text']))"
545 | ],
546 | "metadata": {
547 | "id": "HP6fyqseLvIh"
548 | },
549 | "execution_count": null,
550 | "outputs": []
551 | }
552 | ]
553 | }
--------------------------------------------------------------------------------
/past_hackathons/natural_language_processing_hackathon/day5_nlp_movie_reviews_notebook2_hackathon.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "provenance": [],
7 | "toc_visible": true,
8 | "authorship_tag": "ABX9TyMuOi57j8EtrC5kz4UbtZUF",
9 | "include_colab_link": true
10 | },
11 | "kernelspec": {
12 | "name": "python3",
13 | "display_name": "Python 3"
14 | },
15 | "language_info": {
16 | "name": "python"
17 | }
18 | },
19 | "cells": [
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {
23 | "id": "view-in-github",
24 | "colab_type": "text"
25 | },
26 | "source": [
27 | ""
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "source": [
33 | "#Introduction to Machine Learning \n",
34 | "**Natural Language Processing Hackathon: Notebook 2 \n",
35 | "Wintersession \n",
36 | "Tuesday, January 24, 2023**"
37 | ],
38 | "metadata": {
39 | "id": "MH7MrrKyZ3dQ"
40 | }
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "source": [
45 | "The material here is based on Chapter 8 of \n",
46 | "Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili and Dmytro Dzhulgakov. The book is available via the PU library.\n",
47 | "\n",
48 | "In this notebook we are going to work with a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb) and build a predictor that can distinguish between positive and negative reviews."
49 | ],
50 | "metadata": {
51 | "id": "AcJNrVl84xDp"
52 | }
53 | },
54 | {
55 | "cell_type": "code",
56 | "source": [
57 | "import re\n",
58 | "import textwrap\n",
59 | "import pandas as pd\n",
60 | "import numpy as np\n",
61 | "import nltk\n",
62 | "from nltk.corpus import stopwords\n",
63 | "from nltk.stem.porter import PorterStemmer\n",
64 | "from sklearn.model_selection import GridSearchCV\n",
65 | "from sklearn.pipeline import Pipeline\n",
66 | "from sklearn.linear_model import LogisticRegression\n",
67 | "from sklearn.feature_extraction.text import TfidfVectorizer"
68 | ],
69 | "metadata": {
70 | "id": "UuDdLpWUaBRX"
71 | },
72 | "execution_count": null,
73 | "outputs": []
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "source": [
78 | "# Dowload and View the Data"
79 | ],
80 | "metadata": {
81 | "id": "RFLgSxPO2u-2"
82 | }
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "source": [
87 | "Download the data set:"
88 | ],
89 | "metadata": {
90 | "id": "wjO7F84nz99c"
91 | }
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": null,
96 | "metadata": {
97 | "id": "qoSng-U6VyvC"
98 | },
99 | "outputs": [],
100 | "source": [
101 | "!wget https://tigress-web.princeton.edu/~jdh4/movie_data.csv"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "source": [
107 | "Read in the CSV file and print the first 5 rows of the Pandas dataframe:"
108 | ],
109 | "metadata": {
110 | "id": "peptRcYAdrSq"
111 | }
112 | },
113 | {
114 | "cell_type": "code",
115 | "source": [
116 | "df = pd.read_csv('movie_data.csv', encoding='utf-8')\n",
117 | "df.head(5)"
118 | ],
119 | "metadata": {
120 | "id": "DuYihEqqcBwN"
121 | },
122 | "execution_count": null,
123 | "outputs": []
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "source": [
128 | "Let's look at the number of total rows and the data types:"
129 | ],
130 | "metadata": {
131 | "id": "rlcaf5fad1VT"
132 | }
133 | },
134 | {
135 | "cell_type": "code",
136 | "source": [
137 | "df.info()"
138 | ],
139 | "metadata": {
140 | "id": "7tK0-ZCLdQVV"
141 | },
142 | "execution_count": null,
143 | "outputs": []
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "source": [
148 | "Let's check for class imbalance:"
149 | ],
150 | "metadata": {
151 | "id": "js0X9iZkda1v"
152 | }
153 | },
154 | {
155 | "cell_type": "code",
156 | "source": [
157 | "df[\"sentiment\"].value_counts()"
158 | ],
159 | "metadata": {
160 | "id": "yvB3XKdudStC"
161 | },
162 | "execution_count": null,
163 | "outputs": []
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "source": [
168 | "The classes are balanced so we do not need to worry about imbalance. Next, let's print some reviews to get a sense of the content."
169 | ],
170 | "metadata": {
171 | "id": "4uycrR1jeHBr"
172 | }
173 | },
174 | {
175 | "cell_type": "code",
176 | "source": [
177 | "def print_reviews_and_sentiment(d, start_index=42, num=3, width=80):\n",
178 | " wrapper = textwrap.TextWrapper(width=width, break_long_words=False, break_on_hyphens=False)\n",
179 | " for i in range(start_index, start_index + num):\n",
180 | " print(wrapper.fill(str(d.loc[i][\"review\"])))\n",
181 | " print('------------')\n",
182 | " print(f'Sentiment: {d.loc[i][\"sentiment\"]}\\n')"
183 | ],
184 | "metadata": {
185 | "id": "NVoLU81BcQtK"
186 | },
187 | "execution_count": null,
188 | "outputs": []
189 | },
190 | {
191 | "cell_type": "code",
192 | "source": [
193 | "print_reviews_and_sentiment(df, start_index=42, num=2)"
194 | ],
195 | "metadata": {
196 | "id": "cyhpu6ycjSNh"
197 | },
198 | "execution_count": null,
199 | "outputs": []
200 | },
201 | {
202 | "cell_type": "markdown",
203 | "source": [
204 | "# Hackathon Project"
205 | ],
206 | "metadata": {
207 | "id": "pri7RiNL110z"
208 | }
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "source": [
213 | "Train a classifier on the movie review data. See if you can get about 88% accuracy on the test set that you make. Use the techniques from the previous notebook and previous workshop days."
214 | ],
215 | "metadata": {
216 | "id": "kgu5qQAh15ko"
217 | }
218 | }
219 | ]
220 | }
--------------------------------------------------------------------------------
/past_hackathons/natural_language_processing_hackathon/day5_nlp_movie_reviews_notebook2_hackathon_HINTS.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "provenance": [],
7 | "authorship_tag": "ABX9TyOgyIBfhW3MOO3ltL5zC8DS",
8 | "include_colab_link": true
9 | },
10 | "kernelspec": {
11 | "name": "python3",
12 | "display_name": "Python 3"
13 | },
14 | "language_info": {
15 | "name": "python"
16 | }
17 | },
18 | "cells": [
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {
22 | "id": "view-in-github",
23 | "colab_type": "text"
24 | },
25 | "source": [
26 | "
"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "source": [
32 | "#Introduction to Machine Learning \n",
33 | "**Natural Language Processing Hackathon: Notebook 2 HINTS \n",
34 | "Wintersession \n",
35 | "Tuesday, January 24, 2023**"
36 | ],
37 | "metadata": {
38 | "id": "MH7MrrKyZ3dQ"
39 | }
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "source": [
44 | "The material here is based on Chapter 8 of \n",
45 | "Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili and Dmytro Dzhulgakov. The book is available via the PU library.\n",
46 | "\n",
47 | "In this notebook we are going to work with a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb) and build a predictor that can distinguish between positive and negative reviews."
48 | ],
49 | "metadata": {
50 | "id": "W51U-7ZW4sNI"
51 | }
52 | },
53 | {
54 | "cell_type": "code",
55 | "source": [
56 | "import re\n",
57 | "import textwrap\n",
58 | "import pandas as pd\n",
59 | "import numpy as np\n",
60 | "import nltk\n",
61 | "from nltk.corpus import stopwords\n",
62 | "from nltk.stem.porter import PorterStemmer\n",
63 | "from sklearn.model_selection import GridSearchCV\n",
64 | "from sklearn.pipeline import Pipeline\n",
65 | "from sklearn.linear_model import LogisticRegression\n",
66 | "from sklearn.feature_extraction.text import TfidfVectorizer"
67 | ],
68 | "metadata": {
69 | "id": "UuDdLpWUaBRX"
70 | },
71 | "execution_count": null,
72 | "outputs": []
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "source": [
77 | "Download the data set:"
78 | ],
79 | "metadata": {
80 | "id": "wjO7F84nz99c"
81 | }
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": null,
86 | "metadata": {
87 | "id": "qoSng-U6VyvC"
88 | },
89 | "outputs": [],
90 | "source": [
91 | "!wget https://tigress-web.princeton.edu/~jdh4/movie_data.csv"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "source": [
97 | "Read in the CSV file and print the first 5 rows of the Pandas dataframe:"
98 | ],
99 | "metadata": {
100 | "id": "peptRcYAdrSq"
101 | }
102 | },
103 | {
104 | "cell_type": "code",
105 | "source": [
106 | "df = pd.read_csv('movie_data.csv', encoding='utf-8')\n",
107 | "df.head(5)"
108 | ],
109 | "metadata": {
110 | "id": "DuYihEqqcBwN"
111 | },
112 | "execution_count": null,
113 | "outputs": []
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "source": [
118 | "Let's look at the number of total rows and the data types:"
119 | ],
120 | "metadata": {
121 | "id": "rlcaf5fad1VT"
122 | }
123 | },
124 | {
125 | "cell_type": "code",
126 | "source": [
127 | "df.info()"
128 | ],
129 | "metadata": {
130 | "id": "7tK0-ZCLdQVV"
131 | },
132 | "execution_count": null,
133 | "outputs": []
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "source": [
138 | "Let's check for class imbalance:"
139 | ],
140 | "metadata": {
141 | "id": "js0X9iZkda1v"
142 | }
143 | },
144 | {
145 | "cell_type": "code",
146 | "source": [
147 | "df[\"sentiment\"].value_counts()"
148 | ],
149 | "metadata": {
150 | "id": "yvB3XKdudStC"
151 | },
152 | "execution_count": null,
153 | "outputs": []
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "source": [
158 | "The classes are balanced so we do not need to worry about imbalance. Next, let's print some reviews to get a sense of the content."
159 | ],
160 | "metadata": {
161 | "id": "4uycrR1jeHBr"
162 | }
163 | },
164 | {
165 | "cell_type": "code",
166 | "source": [
167 | "def print_reviews_and_sentiment(d, start_index=42, num=3, width=80):\n",
168 | " wrapper = textwrap.TextWrapper(width=width, break_long_words=False, break_on_hyphens=False)\n",
169 | " for i in range(start_index, start_index + num):\n",
170 | " print(wrapper.fill(str(d.loc[i][\"review\"])))\n",
171 | " print('------------')\n",
172 | " print(f'Sentiment: {d.loc[i][\"sentiment\"]}\\n')"
173 | ],
174 | "metadata": {
175 | "id": "NVoLU81BcQtK"
176 | },
177 | "execution_count": null,
178 | "outputs": []
179 | },
180 | {
181 | "cell_type": "code",
182 | "source": [
183 | "print_reviews_and_sentiment(df, start_index=42, num=2)"
184 | ],
185 | "metadata": {
186 | "id": "cyhpu6ycjSNh"
187 | },
188 | "execution_count": null,
189 | "outputs": []
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "source": [
194 | "Change the value of idx to vary that amount of train and test data. The default value is 25000 or a 50/50 split."
195 | ],
196 | "metadata": {
197 | "id": "yN0XyTfcggrf"
198 | }
199 | },
200 | {
201 | "cell_type": "code",
202 | "source": [
203 | "def preprocessor(text):\n",
204 | " text = re.sub('<[^>]*>', '', text)\n",
205 | " emoticons = re.findall('(?::|;|=)(?:-)?(?:\\)|\\(|D|P)', text)\n",
206 | " text = (re.sub('[\\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))\n",
207 | " return text"
208 | ],
209 | "metadata": {
210 | "id": "YSqs-9TYhKt6"
211 | },
212 | "execution_count": null,
213 | "outputs": []
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "source": [
218 | "Via the first regex, <[^>]*>, in the preceding code section, we tried to remove all of the HTML markup from the movie reviews. Although many programmers generally advise against the use of regex to parse HTML, this regex should be sufficient to clean this particular dataset. Since we are only interested in removing HTML markup and do not plan to use the HTML markup further, using regex to do the job should be acceptable. However, if you prefer to use sophisticated tools for removing HTML markup from text, you can take a look at Python’s HTML parser module, which is described at https://docs.python.org/3/library/html.parser.html. After we removed the HTML markup, we used a slightly more complex regex to find emoticons, which we temporarily stored as emoticons. Next, we removed all non-word characters from the text via the regex [\\W]+ and converted the text into lowercase characters."
219 | ],
220 | "metadata": {
221 | "id": "UhLHT8pu5uWY"
222 | }
223 | },
224 | {
225 | "cell_type": "code",
226 | "source": [
227 | "df['review'] = df['review'].apply(preprocessor)"
228 | ],
229 | "metadata": {
230 | "id": "hDEbzfOahQOv"
231 | },
232 | "execution_count": null,
233 | "outputs": []
234 | },
235 | {
236 | "cell_type": "code",
237 | "source": [
238 | "print_reviews_and_sentiment(df, start_index=42, num=2)"
239 | ],
240 | "metadata": {
241 | "id": "OI-4WUWUimJw"
242 | },
243 | "execution_count": null,
244 | "outputs": []
245 | },
246 | {
247 | "cell_type": "markdown",
248 | "source": [
249 | "Create a train-test split:"
250 | ],
251 | "metadata": {
252 | "id": "GZwH1OQLkBKB"
253 | }
254 | },
255 | {
256 | "cell_type": "code",
257 | "source": [
258 | "idx = 25000\n",
259 | "X_train = df.loc[:idx - 1, 'review'].values\n",
260 | "y_train = df.loc[:idx - 1, 'sentiment'].values\n",
261 | "X_test = df.loc[idx:, 'review'].values\n",
262 | "y_test = df.loc[idx:, 'sentiment'].values"
263 | ],
264 | "metadata": {
265 | "id": "kOOBt1t4ccFx"
266 | },
267 | "execution_count": null,
268 | "outputs": []
269 | },
270 | {
271 | "cell_type": "markdown",
272 | "source": [
273 | "Let's try using the word counts as the features to get started:"
274 | ],
275 | "metadata": {
276 | "id": "J97KRI7pmnbS"
277 | }
278 | },
279 | {
280 | "cell_type": "code",
281 | "source": [
282 | "tfidf = TfidfVectorizer(use_idf=False, norm=None, smooth_idf=False)\n",
283 | "word_counts = tfidf.fit_transform(X_train)"
284 | ],
285 | "metadata": {
286 | "id": "yhgfDr2OpreS"
287 | },
288 | "execution_count": null,
289 | "outputs": []
290 | },
291 | {
292 | "cell_type": "code",
293 | "source": [
294 | "type(word_counts)"
295 | ],
296 | "metadata": {
297 | "id": "Fxaw2kjoq_YC"
298 | },
299 | "execution_count": null,
300 | "outputs": []
301 | },
302 | {
303 | "cell_type": "code",
304 | "source": [
305 | "word_counts.shape"
306 | ],
307 | "metadata": {
308 | "id": "5H1exZeIqP2K"
309 | },
310 | "execution_count": null,
311 | "outputs": []
312 | },
313 | {
314 | "cell_type": "code",
315 | "source": [
316 | "list(tfidf.vocabulary_.items())[:10]"
317 | ],
318 | "metadata": {
319 | "id": "QJt0JBNZnzgN"
320 | },
321 | "execution_count": null,
322 | "outputs": []
323 | },
324 | {
325 | "cell_type": "code",
326 | "source": [
327 | "print(df.loc[1][\"review\"])"
328 | ],
329 | "metadata": {
330 | "id": "xUqo7SwsseM3"
331 | },
332 | "execution_count": null,
333 | "outputs": []
334 | },
335 | {
336 | "cell_type": "code",
337 | "source": [
338 | "print(word_counts[1,:])"
339 | ],
340 | "metadata": {
341 | "id": "ljeEWKf4sOYA"
342 | },
343 | "execution_count": null,
344 | "outputs": []
345 | },
346 | {
347 | "cell_type": "code",
348 | "source": [
349 | "tfidf.vocabulary_[\"window\"]"
350 | ],
351 | "metadata": {
352 | "id": "XJctczjUpLn5"
353 | },
354 | "execution_count": null,
355 | "outputs": []
356 | },
357 | {
358 | "cell_type": "code",
359 | "source": [
360 | "clf = LogisticRegression(C=1.0, solver='liblinear')\n",
361 | "clf = clf.fit(word_counts, y_train)"
362 | ],
363 | "metadata": {
364 | "id": "oWnjcA5wgz14"
365 | },
366 | "execution_count": null,
367 | "outputs": []
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "source": [
372 | "The accuracy on the test set is:"
373 | ],
374 | "metadata": {
375 | "id": "AB1o6eeOqBna"
376 | }
377 | },
378 | {
379 | "cell_type": "code",
380 | "source": [
381 | "clf.score(tfidf.transform(X_test), y_test)"
382 | ],
383 | "metadata": {
384 | "id": "UUZ9youasypj"
385 | },
386 | "execution_count": null,
387 | "outputs": []
388 | },
389 | {
390 | "cell_type": "markdown",
391 | "source": [
392 | "Notice that the .transform() method was applied to the test set while .fit_transform() was applied to the train set. In this notebook we only worked with unnormalized word counts. We did nothing with stop-words, stemming, inverse document frequency weighting, n-grams, etc. The full solution in the next notebook uses a Pipeline to tryout various combinations of these choices to find the best one."
393 | ],
394 | "metadata": {
395 | "id": "fnBoPcDfqPpf"
396 | }
397 | }
398 | ]
399 | }
--------------------------------------------------------------------------------
/past_hackathons/quarterback_performance_hackathon/NFL_QB_Data.csv:
--------------------------------------------------------------------------------
1 | Year,Round,Pick,Team,Player,Pos,Age,LY,AllProYrs,ProBowls,YrsStrtr,wAV,DrAV,G,PaCmp,PaAtt,PaYds,PaTD,PaInt,RuAtt,RuYds,RuTD,Rec,ReYds,ReTD,Solo,Int,Sk,College/Univ
2 | 2011,1,1,CAR,Cam Newton,QB,22,2021,1,3,9,115,107,148,2682,4474,32382,194,123,1118,5628,75,3,68,1,,,,Auburn
3 | 2012,1,1,IND,Andrew Luck,QB,22,2018,0,4,5,72,72,86,2000,3290,23671,171,83,332,1590,14,1,4,0,,,,Stanford
4 | 2015,1,1,TAM,Jameis Winston,QB,21,2022,0,1,5,59,54,86,1738,2835,21840,139,96,293,1220,11,0,0,0,,,,Florida St.
5 | 2016,1,1,LAR,Jared Goff,QB,21,2022,0,2,5,71,48,100,2250,3502,25854,155,70,209,474,10,1,5,0,2,,,California
6 | 2018,1,1,CLE,Baker Mayfield,QB,23,2022,0,0,4,42,42,72,1386,2259,16288,102,64,189,660,6,1,17,0,1,,,Oklahoma
7 | 2019,1,1,ARI,Kyler Murray,QB,22,2022,0,2,3,51,51,57,1316,1971,13848,84,41,381,2204,23,0,7,0,,,,Oklahoma
8 | 2020,1,1,CIN,Joe Burrow,QB,23,2022,0,1,2,38,38,42,1044,1530,11774,82,31,152,517,10,0,0,0,1,,,LSU
9 | 2021,1,1,JAX,Trevor Lawrence,QB,21,2022,0,0,1,21,21,34,746,1186,7754,37,25,135,625,7,0,0,0,4,,,Clemson
10 | 2012,1,2,WAS,Robert Griffin III,QB,22,2020,0,1,3,36,31,56,799,1268,9271,43,30,307,1809,10,0,0,0,,,,Baylor
11 | 2015,1,2,TEN,Marcus Mariota,QB,21,2022,0,0,4,54,43,87,1312,2095,15656,92,54,349,2012,17,2,62,1,1,,,Oregon
12 | 2016,1,2,PHI,Carson Wentz,QB,23,2022,0,1,6,59,43,93,2056,3284,22129,151,66,337,1362,10,2,11,0,1,,,North Dakota St.
13 | 2017,1,2,CHI,Mitchell Trubisky,QB,23,2022,0,1,4,36,33,64,1133,1765,11904,68,43,222,1119,11,0,0,0,,,,North Carolina
14 | 2021,1,2,NYJ,Zach Wilson,QB,22,2022,0,0,1,9,9,22,345,625,4022,15,18,57,287,5,1,2,1,1,,,BYU
15 | 2014,1,3,JAX,Blake Bortles,QB,22,2019,0,0,5,44,44,78,1562,2634,17649,103,75,283,1766,8,1,20,1,,,,Central Florida
16 | 2018,1,3,NYJ,Sam Darnold,QB,21,2022,0,0,4,25,16,56,1054,1765,11767,61,55,188,745,12,0,0,0,,,,USC
17 | 2021,1,3,SFO,Trey Lance,QB,21,2022,0,0,0,4,4,8,56,102,797,5,3,54,235,1,0,0,0,1,,,North Dakota St.
18 | 2020,1,5,MIA,Tua Tagovailoa,QB,22,2022,0,0,2,23,23,36,708,1078,8015,52,23,101,307,6,0,0,0,,,,Alabama
19 | 2019,1,6,NYG,Daniel Jones,QB,22,2022,0,0,3,38,38,54,1113,1740,11603,60,34,292,1708,12,1,16,0,,,,Duke
20 | 2020,1,6,LAC,Justin Herbert,QB,22,2022,0,1,2,43,43,49,1316,1966,14089,94,35,172,683,8,2,-10,0,2,,,Oregon
21 | 2018,1,7,BUF,Josh Allen,QB,22,2022,0,3,4,68,68,77,1604,2566,18397,138,60,546,3087,38,1,12,1,4,,,Wyoming
22 | 2011,1,8,TEN,Jake Locker,QB,23,2014,0,0,1,15,15,30,408,709,4967,27,22,95,644,5,0,0,0,,,,Washington
23 | 2012,1,8,MIA,Ryan Tannehill,QB,24,2022,0,1,9,88,47,145,2914,4534,33265,212,108,423,2029,26,4,8,1,3,,,Texas A&M
24 | 2011,1,10,JAX,Blaine Gabbert,QB,21,2022,0,0,2,16,7,67,864,1533,9302,51,47,194,640,3,1,-16,0,,,,Missouri
25 | 2017,1,10,KAN,Patrick Mahomes,QB,21,2022,1,5,4,85,85,80,1985,2993,24241,192,49,299,1547,12,1,6,0,2,,,Texas Tech
26 | 2018,1,10,ARI,Josh Rosen,QB,21,2021,0,0,1,3,2,24,277,513,2864,12,21,26,151,0,0,0,0,,,,UCLA
27 | 2021,1,11,CHI,Justin Fields,QB,22,2022,0,0,0,22,22,27,351,588,4112,24,21,232,1563,10,0,0,0,,,,Ohio St.
28 | 2011,1,12,MIN,Christian Ponder,QB,23,2014,0,0,3,22,22,38,632,1057,6658,38,36,126,639,7,1,-15,0,1,,,Florida St.
29 | 2017,1,12,HOU,Deshaun Watson,QB,21,2022,0,3,3,55,52,60,1285,1918,15641,111,41,343,1852,18,1,6,1,1,,,Clemson
30 | 2019,1,15,WAS,Dwayne Haskins,QB,22,2020,0,0,0,4,4,16,267,444,2804,12,14,40,147,1,0,0,0,,,,Ohio St.
31 | 2021,1,15,NWE,Mac Jones,QB,23,2022,0,1,1,22,22,31,640,963,6798,36,24,91,231,1,0,0,0,1,,,Alabama
32 | 2013,1,16,BUF,EJ Manuel,QB,23,2017,0,0,1,10,10,30,343,590,3767,20,16,96,339,4,0,0,0,,,,Florida St.
33 | 2012,1,22,CLE,Brandon Weeden,QB,28,2018,0,0,1,13,10,35,559,965,6462,31,30,62,200,1,0,-9,0,,,,Oklahoma St.
34 | 2014,1,22,CLE,Johnny Manziel,QB,21,2015,0,0,0,4,4,14,147,258,1675,7,7,46,259,1,0,0,0,,,,Texas A&M
35 | 2016,1,26,DEN,Paxton Lynch,QB,22,2017,0,0,0,2,2,5,79,128,792,4,4,16,55,0,0,0,0,,,,Memphis
36 | 2020,1,26,GNB,Jordan Love,QB,21,2022,0,0,0,3,3,10,50,83,606,3,3,13,26,0,0,0,0,,,,Utah St.
37 | 2014,1,32,MIN,Teddy Bridgewater,QB,21,2022,0,1,4,49,22,78,1372,2067,15120,75,47,219,846,11,0,0,0,,,,Louisville
38 | 2018,1,32,BAL,Lamar Jackson,QB,21,2022,1,2,3,69,69,70,1055,1655,12209,101,38,727,4437,24,0,0,0,1,,,Louisville
39 | 2011,2,35,CIN,Andy Dalton,QB,23,2022,0,3,10,90,82,166,3374,5396,38150,244,144,468,1465,22,3,11,1,1,,,TCU
40 | 2011,2,36,SFO,Colin Kaepernick,QB,23,2016,0,0,4,45,45,69,1011,1692,12271,72,30,375,2300,13,0,0,0,,,,Nevada
41 | 2014,2,36,OAK,Derek Carr,QB,23,2022,0,3,8,82,82,142,3201,4958,35222,217,99,278,845,6,1,-9,0,,,,Fresno St.
42 | 2013,2,39,NYJ,Geno Smith,QB,22,2022,0,1,2,31,14,62,991,1578,11199,64,48,226,1067,9,1,13,0,1,,,West Virginia
43 | 2019,2,42,DEN,Drew Lock,QB,22,2021,0,0,1,12,12,24,421,710,4740,25,20,72,285,5,1,1,0,1,,,Missouri
44 | 2016,2,51,NYJ,Christian Hackenberg,QB,21,,0,0,0,,,,,,,,,,,,,,,,,,Penn St.
45 | 2017,2,52,CLE,DeShone Kizer,QB,21,2018,0,0,1,6,5,18,275,518,3081,11,24,82,458,5,0,0,0,,,,Notre Dame
46 | 2020,2,53,PHI,Jalen Hurts,QB,22,2022,0,1,1,40,40,45,648,1040,7906,44,19,367,1898,26,1,3,0,2,,,Oklahoma
47 | 2012,2,57,DEN,Brock Osweiler,QB,21,2018,0,0,1,14,7,49,697,1165,7418,37,31,92,266,4,1,-14,0,,,,Arizona St.
48 | 2014,2,62,NWE,Jimmy Garoppolo,QB,22,2022,0,0,2,46,2,74,1167,1726,14289,87,42,165,225,7,2,-3,0,,,,East. Illinois
49 | 2021,2,64,TAM,Kyle Trask,QB,23,2022,0,0,0,0,0,1,3,9,23,0,0,0,0,0,0,0,0,,,,Florida
50 | 2021,3,66,MIN,Kellen Mond,QB,22,2021,0,0,0,0,0,1,2,3,5,0,0,0,0,0,0,0,0,,,,Texas A&M
51 | 2021,3,67,HOU,Davis Mills,QB,22,2022,0,0,1,14,14,28,555,873,5782,33,25,50,152,2,0,0,0,2,,,Stanford
52 | 2013,3,73,TAM,Mike Glennon,QB,23,2021,0,0,1,9,11,40,689,1147,7025,47,35,56,140,1,0,0,0,1,,,North Carolina St.
53 | 2011,3,74,NWE,Ryan Mallett,QB,23,2017,0,0,0,4,0,21,190,345,1835,9,10,28,-5,1,0,0,0,,,,Arkansas
54 | 2012,3,75,SEA,Russell Wilson,QB,23,2022,0,9,10,130,125,173,3371,5218,40583,308,98,901,4966,26,5,21,1,,,,Wisconsin
55 | 2015,3,75,NOR,Garrett Grayson,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,Colorado St.
56 | 2018,3,76,PIT,Mason Rudolph,QB,23,2021,0,0,1,5,5,17,236,384,2366,16,11,33,89,0,0,0,0,,,,Oklahoma St.
57 | 2017,3,87,NYG,Davis Webb,QB,22,2022,0,0,0,1,1,2,23,40,168,1,0,8,38,1,0,0,0,,,,California
58 | 2012,3,88,PHI,Nick Foles,QB,23,2022,0,1,3,32,24,71,1302,2087,14227,82,47,151,407,6,1,10,0,,,,Arizona
59 | 2015,3,89,STL,Sean Mannion,QB,23,2021,0,0,0,2,1,14,67,110,573,1,3,25,-3,0,0,0,0,,,,Oregon St.
60 | 2016,3,91,NWE,Jacoby Brissett,QB,23,2022,0,0,2,33,2,76,963,1577,10350,48,23,227,896,15,1,2,0,,,,North Carolina St.
61 | 2016,3,93,CLE,Cody Kessler,QB,23,2018,0,0,1,5,3,17,224,349,2215,8,5,31,140,0,0,0,0,,,,USC
62 | 2013,4,98,PHI,Matt Barkley,QB,22,2020,0,0,1,5,1,19,212,363,2699,11,22,23,-12,0,1,2,1,,,,USC
63 | 2016,4,100,OAK,Connor Cook,QB,23,2016,0,0,0,0,0,1,14,21,150,1,1,0,0,0,0,0,0,,,,Michigan St.
64 | 2019,3,100,CAR,Will Grier,QB,24,2019,0,0,0,1,1,2,28,52,228,0,4,7,22,0,0,0,0,,,,West Virginia
65 | 2012,4,102,WAS,Kirk Cousins,QB,24,2022,0,4,7,90,35,142,3249,4866,37140,252,105,290,933,19,1,-1,0,1,,,Michigan St.
66 | 2015,4,103,NYJ,Bryce Petty,QB,24,2017,0,0,0,4,4,10,130,245,1353,4,10,12,74,0,0,0,0,,,,Baylor
67 | 2017,3,104,SFO,C.J. Beathard,QB,23,2022,0,0,0,8,8,25,300,510,3537,18,14,56,231,4,0,0,0,,,,Iowa
68 | 2019,4,104,CIN,Ryan Finley,QB,24,2020,0,0,0,2,2,8,58,119,638,3,4,21,143,1,0,0,0,,,,North Carolina St.
69 | 2018,4,108,NYG,Kyle Lauletta,QB,23,2018,0,0,0,0,0,2,0,5,0,0,1,1,-2,0,0,0,0,,,,Richmond
70 | 2013,4,110,NYG,Ryan Nassib,QB,23,2015,0,0,0,0,0,5,9,10,128,1,0,2,-3,0,0,0,0,,,,Syracuse
71 | 2013,4,112,OAK,Tyler Wilson,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,Arkansas
72 | 2013,4,115,PIT,Landry Jones,QB,24,2017,0,0,0,4,4,18,108,169,1310,8,7,19,-19,0,0,0,0,,,,Oklahoma
73 | 2014,4,120,ARI,Logan Thomas,QB,23,2022,0,0,1,11,0,78,3,11,124,1,0,3,5,0,164,1506,12,10,,,Virginia Tech
74 | 2020,4,122,IND,Jacob Eason,QB,22,2022,0,0,0,0,0,2,5,10,84,0,2,0,0,0,0,0,0,,,,Washington
75 | 2020,4,125,NYJ,James Morgan,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Florida International
76 | 2019,4,133,NWE,Jarrett Stidham,QB,23,2022,0,0,0,4,1,13,77,131,926,6,7,23,89,0,0,0,0,2,,,Auburn
77 | 2021,4,133,NOR,Ian Book,QB,23,2021,0,0,0,0,0,1,12,20,135,0,2,3,6,0,0,0,0,,,,Notre Dame
78 | 2011,5,135,KAN,Ricky Stanzi,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,Iowa
79 | 2014,4,135,HOU,Tom Savage,QB,24,2017,0,0,1,3,3,13,181,315,2000,5,7,16,8,0,0,0,0,,,,Pittsburgh
80 | 2016,4,135,DAL,Dak Prescott,QB,23,2022,0,3,5,77,77,97,2185,3283,24943,166,65,352,1642,26,1,11,1,,,,Mississippi St.
81 | 2017,4,135,PIT,Joshua Dobbs,QB,22,2022,0,0,0,1,0,8,50,85,456,2,3,14,75,0,0,0,0,,,,Tennessee
82 | 2016,4,139,BUF,Cardale Jones,QB,23,2016,0,0,0,0,0,1,6,11,96,0,1,1,-1,0,0,0,0,,,,Ohio St.
83 | 2015,5,147,GNB,Brett Hundley,QB,22,2019,0,0,1,5,5,18,199,337,1902,9,13,46,309,2,1,10,0,,,,UCLA
84 | 2011,5,152,HOU,T.J. Yates,QB,24,2017,0,0,0,6,6,22,179,324,2057,10,11,28,107,1,0,0,0,,,,North Carolina
85 | 2011,5,160,CHI,Nathan Enderle,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Idaho
86 | 2016,5,162,KAN,Kevin Hogan,QB,23,2021,0,0,0,2,,9,60,101,621,4,7,18,176,1,0,0,0,,,,Stanford
87 | 2014,5,163,KAN,Aaron Murray,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Georgia
88 | 2014,5,164,CIN,A.J. McCarron,QB,23,2020,0,0,0,4,3,17,109,174,1173,6,3,22,68,1,0,0,0,,,,Alabama
89 | 2019,5,166,LAC,Easton Stick,QB,23,2020,0,0,0,0,0,1,1,1,4,0,0,1,-2,0,0,0,0,,,,North Dakota St.
90 | 2019,5,167,PHI,Clayton Thorson,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,Northwestern
91 | 2020,5,167,BUF,Jake Fromm,QB,22,2021,0,0,0,1,,3,27,60,210,1,3,8,65,0,0,0,0,,,,Georgia
92 | 2017,5,171,BUF,Nathan Peterman,QB,23,2022,0,0,0,3,2,13,85,160,712,4,13,22,91,1,0,0,0,,,,Pittsburgh
93 | 2018,5,171,DAL,Mike White,QB,23,2022,0,0,0,5,,8,191,307,2145,8,12,11,8,1,0,0,0,,,,Western Kentucky
94 | 2014,6,178,TEN,Zach Mettenberger,QB,23,2015,0,0,1,1,1,14,208,345,2347,12,14,14,12,1,0,0,0,,,,LSU
95 | 2019,6,178,JAX,Gardner Minshew II,QB,23,2022,0,0,2,20,17,32,586,933,6632,44,15,112,521,2,1,0,0,1,,,Washington St.
96 | 2011,6,180,BAL,Tyrod Taylor,QB,22,2022,0,1,3,45,1,81,952,1550,10794,60,26,366,2071,19,2,10,0,,,,Virginia Tech
97 | 2014,6,183,CHI,David Fales,QB,23,2019,0,0,0,1,0,5,31,48,287,1,1,5,8,1,0,0,0,,,,San Jose St.
98 | 2012,6,185,ARI,Ryan Lindley,QB,23,2015,0,0,0,-4,-4,10,140,274,1372,3,11,4,7,0,0,0,0,,,,San Diego St.
99 | 2016,6,187,WAS,Nate Sudfeld,QB,22,2022,0,0,0,1,,6,25,37,188,1,1,10,28,0,0,0,0,,,,Indiana
100 | 2020,6,189,JAX,Jake Luton,QB,24,2020,0,0,0,2,2,3,60,110,624,2,6,1,13,1,0,0,0,,,,Oregon St.
101 | 2016,6,191,DET,Jake Rudock,QB,23,2017,0,0,0,0,0,3,3,5,24,0,1,0,0,0,0,0,0,,,,Michigan
102 | 2014,6,194,BAL,Keith Wenning,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Ball St.
103 | 2019,6,197,BAL,Trace McSorley,QB,24,2022,0,0,0,1,0,9,48,93,502,1,5,21,79,0,0,0,0,,,,Penn St.
104 | 2018,6,199,TEN,Luke Falk,QB,23,2019,0,0,0,1,,3,47,73,416,0,3,0,0,0,0,0,0,,,,Washington St.
105 | 2016,6,201,JAX,Brandon Allen,QB,24,2022,0,0,0,4,,15,149,263,1611,10,6,33,64,0,0,0,0,,,,Arkansas
106 | 2018,6,203,JAX,Tanner Lee,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Nebraska
107 | 2016,6,207,SFO,Jeff Driskel,QB,23,2022,0,0,0,7,,23,216,365,2228,14,8,73,384,3,2,10,0,,,,Louisiana Tech
108 | 2011,7,208,NYJ,Greg McElroy,QB,23,2012,0,0,0,1,1,2,19,31,214,1,1,8,30,0,0,0,0,,,,Alabama
109 | 2014,6,213,NYJ,Tajh Boyd,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Clemson
110 | 2014,6,214,STL,Garrett Gilbert,QB,23,2021,0,0,0,2,,8,43,75,477,1,1,6,25,0,0,0,0,,,,SMU
111 | 2017,6,215,DET,Brad Kaaya,QB,22,,0,0,0,,,,,,,,,,,,,,,,,,Miami (FL)
112 | 2021,6,218,IND,Sam Ehlinger,QB,22,2022,0,0,0,2,2,7,64,101,573,3,3,20,96,0,0,0,0,,,,Texas
113 | 2018,7,219,NWE,Danny Etling,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,LSU
114 | 2018,7,220,SEA,Alex McGough,QB,22,,0,0,0,,,,,,,,,,,,,,,,,,Florida International
115 | 2013,7,221,SDG,Brad Sorensen,QB,25,,0,0,0,,,,,,,,,,,,,,,,,,Southern Utah
116 | 2016,7,223,MIA,Brandon Doughty,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,Western Kentucky
117 | 2020,7,224,TEN,Cole McDonald,QB,22,,0,0,0,,,,,,,,,,,,,,,,,,Hawaii
118 | 2020,7,231,DAL,Ben DiNucci,QB,23,2020,0,0,0,1,1,3,23,43,219,0,0,6,22,0,0,0,0,,,,James Madison
119 | 2013,7,234,DEN,Zac Dysert,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Miami (OH)
120 | 2013,7,237,SFO,B.J. Daniels,QB,24,2015,0,0,0,0,,8,1,2,7,0,0,6,6,0,2,18,0,1,,,South Florida
121 | 2020,7,240,NOR,Tommy Stevens,QB,23,2020,0,0,0,0,,1,0,0,0,0,0,4,24,0,0,0,0,,,,Mississippi St.
122 | 2012,7,243,GNB,B.J. Coleman,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Chattanooga
123 | 2020,7,244,MIN,Nate Stanley,QB,23,,0,0,0,,,,,,,,,,,,,,,,,,Iowa
124 | 2013,7,249,ATL,Sean Renfree,QB,23,2015,0,0,0,0,0,2,3,7,11,0,1,1,-4,0,0,0,0,,,,Duke
125 | 2018,7,249,CIN,Logan Woodside,QB,23,2022,0,0,0,0,,12,1,3,7,0,0,13,4,0,0,0,0,,,,Toledo
126 | 2015,7,250,DEN,Trevor Siemian,QB,23,2022,0,0,2,17,13,35,621,1055,7027,42,28,73,211,2,0,0,0,,,,Northwestern
127 | 2012,7,253,IND,Chandler Harnish,QB,24,,0,0,0,,,,,,,,,,,,,,,,,,Northern Illinois
128 | 2017,7,253,DEN,Chad Kelly,QB,23,2018,0,0,0,0,0,1,0,0,0,0,0,1,-1,0,0,0,0,,,,Mississippi
129 |
--------------------------------------------------------------------------------