├── starter-kit-plan.png
├── SECURITY.md
├── README.md
├── notebooks
    ├── lesson_1.ipynb
    ├── lesson_4.ipynb
    ├── lesson_5.ipynb
    └── lesson_3.ipynb
└── LICENSE


/starter-kit-plan.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Nebius-Academy/ML-Starter-Pack/HEAD/starter-kit-plan.png


--------------------------------------------------------------------------------
/SECURITY.md:
--------------------------------------------------------------------------------
 1 | # Reporting Security Issues
 2 | 
 3 | The Nebius team take security bugs seriously. We appreciate your efforts to responsibly disclose your findings, and will make every effort to acknowledge your contributions.
 4 | 
 5 | To report a security issue, please use the GitHub Security Advisory ["Report a Vulnerability"](https://github.com/Nebius-Academy/ML-Starter-Pack/security/advisories/new) tab.
 6 | 
 7 | The Nebius team will send a response indicating the next steps in handling your report. After the initial reply to your report, the Nebius team will keep you informed of the progress towards a fix and full announcement, and may ask for additional information or guidance.
 8 | 
 9 | ## Learning More About Security in Nebius
10 | 
11 | To learn more about security in Nebius, please see the [this page](https://nebius.ai/docs/security).
12 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ML Starter Pack
 2 | 
 3 | ![](starter-kit-plan.png)
 4 | 
 5 | ## Description
 6 | If you always wanted to learn how AI works, this is the course for you. Our course will guide you through the basics of machine learning and show you ways it can be applied in real-world scenarios. No prior machine learning knowledge is required, but you’ll need a solid grasp of Python programming. 
 7 | 
 8 | We’ll start with simple examples and provide step-by-step explanations on training machine learning models to make predictions. Then we’ll introduce you to neural networks, where you’ll learn about more advanced models that drive recent AI advancements. Additionally, we’ll cover models like ChatGPT, showing you how they are trained and how to build applications using them.
 9 | 
10 | By the end of the course, you'll have practical skills and a solid foundation in machine learning, ready to create your own projects.
11 | 
12 | ## The course materials
13 | 
14 | In this course, we’ll guide you through five clear and straightforward notebooks that introduce you to cutting-edge concepts. Each notebook has two versions:
15 | 
16 | - A basic version has several simple coding tasks that will help you to get a grasp on the material,
17 | - A version with solutions has all the tasks solved, so you can consult them if you get stuck or if you want to check yourself.
18 | 
19 | ## Key Topics
20 | 
21 | - Basic machine learning principles
22 | - Linear classifier
23 | - Transformers for texts and images
24 | - ChatGPT architecture
25 | - CLIP model architecture
26 | - HuggingFace libraries: datasets, transformers
27 | - OpenAI API, LangChain
28 | 
29 | ## Synopsis
30 | 
31 | ### Introduction to Machine Learning
32 | 
33 | [Notebook](https://colab.research.google.com/github/Nebius-Academy/ML-Starter-Pack/blob/main/notebooks/lesson_1.ipynb)
34 | 
35 | - General principles of machine learning
36 | - Overview of different areas within machine learning
37 | - Introduction to text classification using a linear classifier
38 | 
39 | ### Hands-on Linear Models
40 | 
41 | [Notebook](https://colab.research.google.com/github/Nebius-Academy/ML-Starter-Pack/blob/main/notebooks/lesson_2.ipynb)
42 | 
43 | [Notebook with solutions](https://colab.research.google.com/github/Nebius-Academy/ML-Starter-Pack/blob/main/notebooks/lesson_2_with_solutions.ipynb)
44 | 
45 | - Using Hugging Face datasets
46 | - Bag-of-words model and text vectorization with scikit-learn
47 | - Math behind a multiclass linear classifier
48 | - Gradient descent for parameter optimization
49 | - Training a linear model with scikit-learn
50 | 
51 | ### Transformers
52 | 
53 | [Notebook](https://colab.research.google.com/github/Nebius-Academy/ML-Starter-Pack/blob/main/notebooks/lesson_3.ipynb)
54 | 
55 | [Notebook with solutions](https://colab.research.google.com/github/Nebius-Academy/ML-Starter-Pack/blob/main/notebooks/lesson_3_with_solutions.ipynb)
56 | 
57 | - Basics of neural networks
58 | - Texts tokenization and embeddings
59 | - Attention mechanism in neural networks
60 | - Transformers architecture
61 | - Stochastic gradient descent for neural networks training
62 | - Training and fine-tuning models with Hugging Face transformers
63 | 
64 | ### ChatGPT
65 | 
66 | [Notebook](https://colab.research.google.com/github/Nebius-Academy/ML-Starter-Pack/blob/main/notebooks/lesson_4.ipynb)
67 | 
68 | [Notebook with solutions](https://colab.research.google.com/github/Nebius-Academy/ML-Starter-Pack/blob/main/notebooks/lesson_4_with_solutions.ipynb)
69 | 
70 | - Base GPT model is pre-training using self-supervised learning
71 | - ChatGPT fine-tuning and alignment training
72 | - In-context learning with ChatGPT
73 | - Using the OpenAI API for text classification
74 | - Building applications with large language models using LangChain
75 | 
76 | ### Computer Vision
77 | 
78 | [Notebook](https://colab.research.google.com/github/Nebius-Academy/ML-Starter-Pack/blob/main/notebooks/lesson_5.ipynb)
79 | 
80 | [Notebook with solutions](https://colab.research.google.com/github/Nebius-Academy/ML-Starter-Pack/blob/main/notebooks/lesson_5_with_solutions.ipynb)
81 | 
82 | - Basics of image classification
83 | - Vision transformer architecture
84 | - Principles of transfer learning
85 | - Training and fine-tuning vision transformers
86 | - CLIP model architecture
87 | - Zero-shot image classification with CLIP
88 | 
89 | © Nebius BV, 2024
90 | 


--------------------------------------------------------------------------------
/notebooks/lesson_1.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1iDUg6p9Aakj5VIJSR9xiQySB9K4PXWLv\" width=640 /></center>\n","<br/>\n","\n","<font size=6>\n","<center>Lesson 1: Introduction to Machine Learning\n","</center>\n","</font>\n","\n","<font size=4>\n","<center><i>A crash-course on every hype of AI technology</i>\n","</center>\n","</font>"],"metadata":{"id":"Bcfx5yMxZays"},"id":"Bcfx5yMxZays"},{"cell_type":"markdown","id":"6345005e","metadata":{"id":"6345005e"},"source":["# 1. Introduction"]},{"cell_type":"markdown","id":"b3eafcd5","metadata":{"id":"b3eafcd5"},"source":["## About This Course"]},{"cell_type":"markdown","source":["This course is about **Machine Learning** – often just called “ML” for short. But before we demystify ML, let’s start by talking about _what it can do_. Why learn about it at all? As it turns out, ML is the underlying technology behind many things in our lives: it gives us music recommendations, creates beautiful images and texts (well, it tries), and it’s even being used to help find cures for different diseases ([example](https://deepmind.google/discover/blog/stopping-malaria-in-its-tracks/)).\n","\n","ML is also worth learning about because it is now a critical component for many businesses and, beyond all this – it’s just plain interesting to work with! So, if this all sounds like a promising path for you, you’ve found the right course to help you start down this road."],"metadata":{"id":"kaRjWjJHlKe3"},"id":"kaRjWjJHlKe3"},{"cell_type":"markdown","source":["## Course Overview"],"metadata":{"id":"J0ke9TmT54lL"},"id":"J0ke9TmT54lL"},{"cell_type":"markdown","source":["There are 5 lessons in this course, including this introduction. In each lesson (starting from the next), we’ll guide you through solving some non-trivial problem using ML, all while explaining all the necessary theory. In these lessons you will:\n","\n","* Learn about the **current** **state of ML** and how it’s evolving\n","* Understand what you need to make ML work and the **tasks it can solve**\n","* **Practice implementing ML** systems in code\n","\n","And of course, along the way, we’ll also play with generative models like ChatGPT!\n","\n","Here’s an overview of the lesson contents:"],"metadata":{"id":"v1h2dt5smw5P"},"id":"v1h2dt5smw5P"},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1Vw_cx4ECpYWtI2sP_ai62P8STQ_ig4nD\" width=1000 /></center>"],"metadata":{"id":"t6ByNXQOokiK"},"id":"t6ByNXQOokiK"},{"cell_type":"markdown","source":["_Disclaimer: While you’ll definitely learn a lot here by writing code and creating models, it probably won’t be enough to score a job. But, if you decide to continue your ML studies, we invite you to our [Practical Generative AI course](https://ai-dt.school/generative-ai/)._"],"metadata":{"id":"Fvzmerqgnax-"},"id":"Fvzmerqgnax-"},{"cell_type":"markdown","source":["## What to Expect"],"metadata":{"id":"0qQyIgks574F"},"id":"0qQyIgks574F"},{"cell_type":"markdown","id":"ba81f49e","metadata":{"id":"ba81f49e"},"source":["The lessons are made up of Jupyter notebooks with **theory** and **code**. Some of that code will be missing, and you’ll need to fill it in to make the notebook work. Then, after you’ve written your solution, go ahead and look at the notebooks with solutions (we will add links to them in the following lessons). Inside, you’ll find keys for all the missing pieces of code. (You can also just look at these if you get stuck, but we encourage you to really try and solve all the problems on your own first).\n","\n","This course is intended for learners of all different levels, and that means that you’ll sometimes encounter some rather complex **math explanations**. That said, these are not critical for this course, and you can feel free to skip them. In any case, these will also be accompanied by simplified summaries to convey the general idea."]},{"cell_type":"markdown","source":["## This Lesson Overview"],"metadata":{"id":"RJ7PY-Dmoydn"},"id":"RJ7PY-Dmoydn"},{"cell_type":"markdown","id":"3a4528cd","metadata":{"id":"3a4528cd"},"source":["In this lesson we’ll introduce machine learning in two sections:\n","\n","* In section 2, we’ll explain machine learning and give a wide overview of what it can be\n","* In section 3, we’ll take an example problem and look closely at how machine learning can solve it\n","\n","This lesson doesn’t feature any code, just theory – and with that, let’s jump in."]},{"cell_type":"markdown","source":["# 2. What is Machine Learning"],"metadata":{"id":"qqVFOeszo4OD"},"id":"qqVFOeszo4OD"},{"cell_type":"markdown","source":["To kick things off, let’s tackle an essential question."],"metadata":{"id":"BYyygn0vny6U"},"id":"BYyygn0vny6U"},{"cell_type":"markdown","source":["## Why do we need machine learning?"],"metadata":{"id":"CojXED80kPhv"},"id":"CojXED80kPhv"},{"cell_type":"markdown","id":"4146f933","metadata":{"id":"4146f933"},"source":["So, why even bother with machine learning? It all boils down to algorithms, and ML gives us a way of developing algorithms, most often for some task that we want to solve using software, but we **don't know how to directly implement the solution**.\n","\n","For example, let’s say we want to build an app that translates text from one language to another. At present, the exact algorithm to do this remains unknown – as a matter of fact, many linguists have worked on this problem but haven't managed to achieve the near-human level of translation provided by modern ML-based translators (though this research certainly helped with the development of ML-based methods).\n","\n","Now, operating within the ML paradigm, we can **solve this problem indirectly** by developing an algorithm that will create an algorithm _for us_. If that sounds even more complex, consider this metaphor: when building skyscrapers, people simply cannot lift some of the materials by hand, right? So, we build machines to do the physical heavy lifting for us. In terms of software, we’re making the machines do the \"informational heavy-lifting”.\n","\n","One crucial component when using machine learning is the **training data**. This is a collection of data related to the task, and the ML algorithm pulls the knowledge for creating the solution algorithm from this data.\n","\n","For example, to revisit the case of machine translation, the training data could be a collection of texts that were previously translated manually by human translators. Now, it’s worth noting that collecting this data is not always the easiest thing in the world, but it turns out that sometimes this is easier than trying to manually implement the algorithm."]},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1FLh1ael2CDZYBLwRAK8rL02L2TUXdNE0\" width=320 />\n","</center>"],"metadata":{"id":"buxUcYLPmGNq"},"id":"buxUcYLPmGNq"},{"cell_type":"markdown","source":["So, to sum things up: ML developers design **algorithms that process training data and output new algorithms**. Then, these new algorithms are embedded into software (most often alongside with other algorithms) to solve the task at large.\n","\n","Now that we have a basic understanding of the machine learning paradigm, let's learn what ML models can look like in general. We’re not aiming to give a full overview of ML, we just want to highlight the most popular areas of application and show how diverse the field can be."],"metadata":{"id":"Ktq0oDjcmGY0"},"id":"Ktq0oDjcmGY0"},{"cell_type":"markdown","source":["## Supervised Learning"],"metadata":{"id":"oQcIcNVmoeb2"},"id":"oQcIcNVmoeb2"},{"cell_type":"markdown","source":["First, let's start with **supervised learning** tasks (and machine translation is actually one of these).\n","\n","Supervised learning tasks require an **object** and a **target**. The object can be essentially anything digital, while the target is something connected to this object that we ultimately want to predict.\n","\n","Here are some examples to help make this clear:"],"metadata":{"id":"f5ZONizwohLH"},"id":"f5ZONizwohLH"},{"cell_type":"markdown","id":"5de3e9ca","metadata":{"id":"5de3e9ca"},"source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1oNcqKtC-8Lzn_3r1UPpQjG90Ar7HkHIA\" width=540 />\n","</center>"]},{"cell_type":"markdown","source":["The important characteristic of supervised learning tasks worth highlighting is that for every object there is an expected correct answer (and sometimes more than one, although that is not important), and we want a model that will be able to predict targets as close to this answer as possible.\n","\n","In order to train these kinds of models, we need **labeled** data – for each data point (object) there must be a correct answer (target) provided, also called **ground truth**.\n","\n","While training, the model tries to infer the algorithm that will take objects as inputs and output the proper targets. After that, when put into production, the algorithm will predict targets for new data points.\n","\n","For another example of a supervised learning model, let's consider a self-driving vehicle that has pedestrian detection as one of its sub-systems. Let’s organize this in terms of supervised learning\n","\n","\n","\n","* Each object is an image from the front camera (we will avoid considering other sensors for simplicity’s sake).\n","* The target is a list of coordinates and the sizes of any pedestrians in the image. (Usually represented as a list of boxes.)\n","* The training data is a set of real street photos where boxes have been manually drawn by human labelers around the pedestrians.\n","\n","And using data like this, a machine learning model will be able to detect pedestrians using just raw photos."],"metadata":{"id":"FGew5Lxqosrj"},"id":"FGew5Lxqosrj"},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1OtdrwkShhIDwGPtwAOUjEoVkFIKXGuiw\" width=600 />\n","\n","*Image from [JAAD dataset](https://data.nvision2.eecs.yorku.ca/JAAD_dataset/)*\n","</center>"],"metadata":{"id":"muT6jXJF5DcN"},"id":"muT6jXJF5DcN"},{"cell_type":"markdown","source":["## Generative Tasks"],"metadata":{"id":"EPVxm28ipXCs"},"id":"EPVxm28ipXCs"},{"cell_type":"markdown","source":["That was a more straightforward case, but sometimes our target is more ambiguous. For example, there are many correct answers to this question: \"What do you see in this picture?\" One might answer “a cat”, “a laptop”, “a city” – there are numerous responses that are ultimately correct in their own right."],"metadata":{"id":"U_zQ53zyo5cM"},"id":"U_zQ53zyo5cM"},{"cell_type":"markdown","source":["<br/>\n","<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1duwq9usLDLqLBWZXkTurKpe2Yjvnz0Us\" width=256 />\n","</center>"],"metadata":{"id":"aAQ10CTSpIq6"},"id":"aAQ10CTSpIq6"},{"cell_type":"markdown","source":["And if we ask a more broad question, \"Please describe what you see in the picture\", there could be even more options: \"A cat with a laptop\", \"A cyberpunk cat\", \"A cat with a city in the background\" and so on.\n","\n","In this case, we don't want the model to give some specific correct answer, we just want it to **generate** an answer that is reasonable, but this can be quite variable as well.\n","\n","These tasks are referred to as **generative** and, in this case, we want the model to learn the **distribution** of the data and then give us reasonable samples using it.\n","\n","Here are some examples of generative tasks:"],"metadata":{"id":"bwmlQ_PIpNou"},"id":"bwmlQ_PIpNou"},{"cell_type":"markdown","id":"f0fb8b60","metadata":{"id":"f0fb8b60"},"source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=17ZfWpf5xhDoni383IueJ8fsyug-PrcYK\" width=700 />\n","</center>"]},{"cell_type":"markdown","source":["These tasks can be useful on their own, or they can be a part of a more complex model. We'll discuss generative tasks further in one of the following lessons."],"metadata":{"id":"ViqSBkggpRs0"},"id":"ViqSBkggpRs0"},{"cell_type":"markdown","source":["## Self-Supervised Learning"],"metadata":{"id":"4YogNxcvphiB"},"id":"4YogNxcvphiB"},{"cell_type":"markdown","source":["In machine learning, usually the more complex the model you want to create, the more data you need. However, if it's a supervised learning task, acquiring large amounts of labeled data may be too expensive.\n","\n","For example, if we want to create a ChatGPT model this way, we would need billions of questions with good-quality answers – which is essentially impossible to get.\n","\n","For tasks like that, it would be good to have some basic model that \"understands\" the domain in general, which we can then combine with something else to solve our task.\n","\n","If we linger on the question of chat models a bit more, it would be great to have a model that generally understands \"how to use language\": grammar, combining phrases into reasonable statements, logical reasoning, and so on.\n","\n","These models are often called **foundational models**, and they are rather hard to train, but can be used in a range of different applications.\n","\n","So how do we get data for that kind of model? One way is to take unlabeled data and create a new synthetic task that transforms this unlabeled data into labeled data.\n","\n","For example if we’re working with large amounts of random texts, we can scan them word-by-word and try to predict the next word by using all the words that were seen before. This becomes a supervised learning task, where objects are a list of words before a given word and targets are single words (following the corresponding prefixes). These following words usually can't be predicted with a high level of certainty, so the model is expected to simply learn which options are more probable, and which are less. And actually this task requires the model to have knowledge of grammar, sentence structure and logical reasoning that we discussed earlier."],"metadata":{"id":"l7dzNSn3pa8v"},"id":"l7dzNSn3pa8v"},{"cell_type":"markdown","id":"61a567f1","metadata":{"id":"61a567f1"},"source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1ZRqsGHIAdJ5qJYe7P4rrBQAfbM0yjrRr\" width=450 />\n","</center>"]},{"cell_type":"markdown","source":["This approach is called **self-supervised learning**, and it can be applied to many areas.\n","\n","For example, with videos, we can take a sequence of frames and try to predict the next one; with images we can remove some pixels (or even larger patches) and try to predict what should go there using all the other parts of the image.\n","\n","All in all, this is a very powerful method of learning because it can use large, unlabeled amounts of data. We’ll talk about how it can be incorporated into more complex models in one of the following sections."],"metadata":{"id":"07Esuq7lpfsr"},"id":"07Esuq7lpfsr"},{"cell_type":"markdown","source":["## Reinforcement Learning"],"metadata":{"id":"E01--2rSpk8z"},"id":"E01--2rSpk8z"},{"cell_type":"markdown","source":["Some models don't work with the regular data at all. For instance, this is the case with the **reinforcement learning** approach, which gets data in a more interactive way:\n","\n","* With this approach there is some kind of environment, for example, a video game.\n","* Models can interact with this environment: get inputs (also called **state** of the environment) and perform **actions**. In case of a video game, inputs could be frame-by-frame screenshots, and actions might be the game controls.\n","* During these interactions, the models receive special numeric values called **rewards** (and also punishments, which are just negative rewards). For a video game, they may get a small reward for picking up an in-game item, a big reward for winning the game, and a big punishment for losing.\n","* The model plays the game repeatedly trying to \"beat the game\" – that is, to develop an algorithm that will maximize its average reward."],"metadata":{"id":"ZItrD6Xspo34"},"id":"ZItrD6Xspo34"},{"cell_type":"markdown","id":"eaa2db9d","metadata":{"id":"eaa2db9d"},"source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1jztLehN4_EeXLGOY-AjoHlO9cycCCKbF\" width=350 />\n","\n","*Image from [Atari 2600 Pacman game](https://en.wikipedia.org/wiki/Pac-Man_(Atari_2600_video_game))*\n","</center>"]},{"cell_type":"markdown","source":["These methods are usually capable of achieving very _interesting_ results (for example, beating all the top Go players. At the same time, they are quite difficult to set up and train."],"metadata":{"id":"kueGvYBepvYC"},"id":"kueGvYBepvYC"},{"cell_type":"markdown","source":["## Large Language Models"],"metadata":{"id":"JpfUq-aepoRW"},"id":"JpfUq-aepoRW"},{"cell_type":"markdown","source":["Finally, we want to devote a separate section to large language models (LLMs) because of how important they are in the current AI landscape. These models combine several approaches ,for example, ChatGPT was trained in several stages:\n","\n","* The GPT core model was trained in a self-supervised way. This gives the model a general understanding of how languages and texts work. At this stage, the model already has lots of useful knowledge, but has not been specifically trained to follow instructions; it also doesn’t yet know how to filter potentially harmful content.\n","* Next, the developers trained the model to follow instructions; performing alignment training to prevent the generation of harmful answers also occurs at this stage. There are two stages of training here:\n","    * First, the model was trained in a supervised way using examples of humans answering instructions. For this, the developers collected a dataset of prompts with detailed professional answers.\n","    * Then, the model was trained answering new instructions, but this time, instead of seeing the correct answers, special moderators looked through the dialogues and evaluated the quality of the answers. After, those evaluations were used as rewards and punishments, and the model was trained in a reinforcement learning paradigm.\n","\n","Multi-stage approaches like this are becoming more common and are usually necessary for models that aim to solve hard problems, utilizing as much data as possible. Those models are usually themselves quite big in order to be able to pull as much knowledge from the data as possible.\n","\n","To conclude this, let's just ask ChatGPT to explain how large language models are trained:"],"metadata":{"id":"_nzV7-HAvtZE"},"id":"_nzV7-HAvtZE"},{"cell_type":"markdown","id":"eed36043","metadata":{"id":"eed36043"},"source":["\n","**🤖 ChatGPT:**\n","\n","<blockquote><p>\n","Large language models are created through a two-step process: training and fine-tuning. In the training phase, the models are exposed to massive amounts of text data from the internet, books, articles, and more. They learn by predicting the next word in a sentence based on the context provided by the previous words. This helps them understand grammar, meaning, and relationships between words.\n","    \n","Once the initial training is complete, the models undergo fine-tuning. During this phase, they are exposed to more specific and carefully curated datasets to improve their performance in specific tasks or domains. This fine-tuning process allows the models to specialize in areas like translation, summarization, or even medical diagnosis, depending on the training data used.\n","\n","    \n","Both training and fine-tuning involve complex algorithms running on powerful computers with vast amounts of computational resources. The resulting models can then be used to interact with humans, providing assistance, generating text, or helping in various language-related tasks.\n","</p></blockquote>"]},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1xyUZWlXJq3NqLDuTFZkwFfHvQiUtmSIM\" width=320 />\n","</center>"],"metadata":{"id":"_IHN-NzWo0eH"},"id":"_IHN-NzWo0eH"},{"cell_type":"markdown","source":["# 3. Example: Text Classification"],"metadata":{"id":"1i9Qv4E3ptBe"},"id":"1i9Qv4E3ptBe"},{"cell_type":"markdown","id":"744a6008","metadata":{"id":"744a6008"},"source":["Now that we have an overview of what ML can do, to help us understand a little more how algorithms work, let's pick an example problem and look at it more closely.\n"]},{"cell_type":"markdown","source":["## Task Description"],"metadata":{"id":"0Ie88uR5pvaD"},"id":"0Ie88uR5pvaD"},{"cell_type":"markdown","id":"f036a255","metadata":{"id":"f036a255"},"source":["Let's imagine that we’re developing a website where users can share and discuss the stories they’ve written. The database is very big with lots of different data, and we want to add automatic genre recognition, in order to simplify the search and navigation.\n","\n","So, essentially, we have texts looking like this (generated with ChatGPT):\n","\n","<blockquote><p>\n","The starship Orion cruised through the vast expanse of space, its engines humming smoothly as it navigated through the swirling clouds of cosmic dust. Captain Rachel Ward stood on the bridge, gazing out at the endless stars through the viewport. She had been leading the Orion on its mission of exploration for years now, and she had come to love the solitude and the sense of purpose that came with the job.\n","</p></blockquote>\n","\n","Our question, which genre does this text belong with, sci-fi, fantasy or horror?\n","\n","\n","\n","This task is a supervised learning problem and, more specifically, a **text classification** problem. This is because we have a set of **classes**, and we need to assign them to the texts.\n","\n","For simplicity, let's reduce this to just one genre problem: for a given text, we just want to determine if it’s sci-fi or not. This is called a **binary classification** problem because we need to decide between two classes: \"sci-fi\" and \"not sci-fi\".\n"]},{"cell_type":"markdown","source":["## The Direct Approach vs The ML-based Approach"],"metadata":{"id":"lloFxANipyEK"},"id":"lloFxANipyEK"},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1yE08vhlEbJzgGk4AwEl1GXY5L_e9No2B\" width=320 /></center>"],"metadata":{"id":"yW3545gMqbCQ"},"id":"yW3545gMqbCQ"},{"cell_type":"markdown","id":"91579ac8","metadata":{"id":"91579ac8"},"source":["To get a hold on the problem as a whole, let’s first start with a direct implementation approach that does not involve ML. We’ll write a bunch of rules using our general knowledge:\n","\n","\n","\n","* Words like \"spaceship\" or \"android\" are most likely from sci-fi\n","* If we see the words \"wizard\" or \"magic\", the text probably belongs to the fantasy category, and we can put the text into \"not sci-fi\" category\n","* ... and we can conceive a ton of similar rules.\n","\n","This will probably give us an _adequate-ish_ system with just a couple of hours of work. Naturally, the question follows: is this _good enough_? We need some kind of criterion to measure the **classification quality** of our program.\n","\n","And we can indeed measure it by using some labeled data – just manually picking some texts and deciding the most likely genre. With this data, we can pass it through our program and count the percentage of correct answers. This measure is called **accuracy**, and we want it to be as high as possible.\n","\n","However, simple, hand-crafted rules will probably have low accuracy. Sure, perhaps if we work on them long enough, we’ll end up with decent accuracy, but there are many tasks where this brute-force approach simply doesn't work (as with the machine translation example we mentioned earlier). Let's learn how to solve this with machine learning.\n","\n","If we want to solve this using ML, we want to create an algorithm that will look through training data, notice patterns between texts and their genres, and deduce rules using it (this is actually very similar to how humans learn to do it). So the next question is: how exactly does the machine learn those rules from training data? Setting up the algorithms to do this is exactly what ML jobs are about. Let's learn about one simple way to do this."]},{"cell_type":"markdown","source":["## Classification Algorithm"],"metadata":{"id":"VzOCEHgWp1It"},"id":"VzOCEHgWp1It"},{"cell_type":"markdown","source":["We want the machine learning algorithm to create a classification program (or **classifier** for short). Let's first decide how it can look. One simple way is to imitate what we did in the hand-crafted rules approach:\n","\n","* Let’s say we have a **dictionary** of words important for genre identification (\"spaceship\" or \"wizard\").\n","* Each word has its own **score**, which shows if the word is more likely within the sci-fi genre or not. For example, the word \"starship\" might have a score of +0.5 and the word wizard could have a score of -0.7\n","* The classifier searches through the text for the words from the dictionary and totals all the scores it found.\n","* If the final score is positive, it declares that the answer is \"sci-fi\", and if it is negative it says \"not sci-fi\"."],"metadata":{"id":"tE8Jjo-Fwgfd"},"id":"tE8Jjo-Fwgfd"},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1vCpg9ZQIt6MsjzQ0JADCYL_w3uHrK6He\" width=500 />\n","</center>"],"metadata":{"id":"Lq3cABBPwnly"},"id":"Lq3cABBPwnly"},{"cell_type":"markdown","source":["Note that, at the moment, we have neither the dictionary nor the scores – we just decided how our classifier will look in general. We’ll deal with acquiring those values later.\n","\n","Also, let’s point out that the algorithm is quite simple and it is already easy to see some possible problems (no matter how good the dictionary and scores), because it ignores the context of the words completely. An interesting thing is that this level of algorithm can still be considerably better than hand-crafted rules. But in any case, we’re going to improve it later, and this algorithm is sufficient for explaining all the basics."],"metadata":{"id":"RfX9dm1wwpN2"},"id":"RfX9dm1wwpN2"},{"cell_type":"markdown","source":["<details>\n","    <summary><font color=\"404090\">An example when this algorithm can fail (click to expand).</font></summary>\n","    \n","In a sci-fi story, one character compliments another character's skill by calling her a \"wizard\":\n","    \n","<blockquote>\n","\n","Captain Rourke observed Lieutenant Thompson as she effortlessly manipulated the complex holographic interface, her fingers dancing across the virtual controls with a grace that bordered on the supernatural. He marveled at her uncanny ability to navigate the intricate systems of their advanced starship, her expertise unparalleled.\n","\n","\"Thompson,\" he called out, admiration evident in his voice. \"You're a **wizard** with these systems, you know that?\"\n","\n","Thompson turned to face him, a grin spreading across her face. \"Captain, you flatter me,\" she replied, her eyes twinkling with a mix of amusement and pride. \"But it's not **magic**, just years of practice and a deep understanding of the technology.\"</blockquote>\n","    \n","In response to the appearance of this “fantasy” term within this text, the algorithm may react as if it reduces the possibility of this text being sci-fi, which is obviously incorrect.\n","\n","<br/>\n","</details>"],"metadata":{"id":"YX-nrd3ewsEG"},"id":"YX-nrd3ewsEG"},{"cell_type":"markdown","id":"be69b538","metadata":{"id":"be69b538"},"source":["Now we have a **model** – this is like the “concept” of how our algorithm will work, without defining the actual scores and dictionary, and also without thinking yet about how to get them. It is typical for ML models to be defined in this way: we first outline the algorithm, leaving some gaps (here dictionary and scores), and then we figure out how to fill them in using training data; these gaps are called the **parameters** of the model."]},{"cell_type":"markdown","source":["## The Linear Classifier"],"metadata":{"id":"p2s80C5yp56N"},"id":"p2s80C5yp56N"},{"cell_type":"markdown","source":["In this section, we want to transform how our model looks because we want to be able to apply math algorithms to find all the missing parts. What we'll get in the end is called a **linear classifier**.\n","\n","Let's focus on the list of important words and their scores (we’ll put in some arbitrary values for this example):"],"metadata":{"id":"t1LyyUENxIZy"},"id":"t1LyyUENxIZy"},{"cell_type":"markdown","source":["\n","| word | score |\n","|---|---|\n","| starship | 0.5 |\n","| cosmic | 0.3 |\n","| wizard | -0.7 |\n","| dragon | -0.6 |\n","| ... | ... |"],"metadata":{"id":"_OKdjoJwxLKU"},"id":"_OKdjoJwxLKU"},{"cell_type":"markdown","id":"f3700df7","metadata":{"id":"f3700df7"},"source":["\n","\n","Using this table, we can write the output of the classifier as a formula. To do so, let's define some variables:\n","\n","* The variable for _starship_ is 1.0 if this word is present in the text and 0.0 if it is not\n","* The same applies for _cosmic_, _wizard_ and all the other words in the dictionary\n","* The variable _score_ is the final score of the classifier (prediction);\n","* _pred_ is the prediction of the classifier (True if the text belongs to sci-fi and False otherwise).\n","\n","Then, we get:\n","\n","```Python\n","score = 0.5 * starship + 0.3 * cosmic - 0.7 * wizard - 0.6 * dragon + ...\n","pred = score > 0\n","```\n","\n","_Note: we are assuming that an exact zero score is rare, and we don't care what the answer is in that case._"]},{"cell_type":"markdown","source":["That is the formula for our specific example, we now need to make it more abstract. Let's create new variables for that:\n","\n","- $x_1, x_2, \\ldots, x_N$ – variables for words, where $N$ is the size of the dictionary. If $i$-th word from the dictionary is present in the given text, we get $x_i = 1$ and otherwise we get $x_i = 0$. These are the input variables, and they are also called **features** of the objects, it is common to use letter $x$ for those.\n","\n","- $s$ — variable for the output score.\n","\n","- $\\color{magenta}{w_1}, \\color{magenta}{w_2}, \\dots, \\color{magenta}{w_N}$ — scores assigned to the specific words, also using $N$ — the size of the dictionary. The letter $w$ comes from the word **weights**, which is a common notation for some parameters that are inside a formula (we are also highlighting the parameters in formulas with magenta for easier reading). Those are the values we will later acquire from the training data.\n","\n","Now we can write:\n","\n","$$\n","s = \\color{magenta}{w_1} x_1 + \\color{magenta}{w_2} x_2 + \\ldots + \\color{magenta}{w_N} x_N\n","$$\n","\n","That's it – this is the **linear classifier**, which has this name because the function  $s(x_1, x_2, \\ldots, x_N)$ is linear."],"metadata":{"id":"zS50OcUXjHR6"},"id":"zS50OcUXjHR6"},{"cell_type":"markdown","source":["## Model Training"],"metadata":{"id":"6KjtEIxtqNz_"},"id":"6KjtEIxtqNz_"},{"cell_type":"markdown","source":["Now, let's finally talk about how to get the dictionary and the values of the weights. The process of getting them out of training data is called **training** or **learning**.\n","\n","We’re only going to briefly outline it now, but we’ll give a detailed explanation in the next lesson.\n","\n","First, to get the dictionary, a simple trick is usually used. Let's look through the training data and create a dictionary of _all_ the possible words from it (for simplicity let's assume we have some kind of tool that matches different forms of the same word). This may sound strange because obviously not all words are important, but the importance can actually be reflected by the weights. A word can be in the dictionary, but at the same time can have a weight close to 0 – this will mean it is not very important. So, we’ve just basically reduced two problems to one – we only need to understand how to acquire weights now.\n","\n","And at this point, let's get to the harder part – getting the weights. The training algorithm combines several components:\n","\n","\n","* First, it needs a formula that takes features and weights as input and outputs the scores. We already have it from before – this is our linear classifier formula.\n","* It plugs all the training data into this formula and gets the scores for all the examples from it. For now let's just assume that we use random values for weights. Remember that we also have the correct answers already prepared.\n","* We know that we want our model to have high _accuracy_, or a low percentage of errors. Let's plug the model's predictions together with the correct answers into the accuracy formula and get the error percentage. In the real training algorithm, instead of accuracy, another measure is used, but it’s  quite close in terms of meaning, so we won't go into this for now. But, in the next lesson we will, and we’ll also explain why just using the accuracy doesn't work well.\n","* So, we get a list of error values for all the training examples. We then average them all to get one final error value for our dataset."],"metadata":{"id":"Wdw1xa2uyMVp"},"id":"Wdw1xa2uyMVp"},{"cell_type":"markdown","id":"eaaea09a","metadata":{"id":"eaaea09a"},"source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1kbZ1GoUlbQgjp2u3UOfLIHKp80TwY0Ka\" width=800 />\n","</center>"]},{"cell_type":"markdown","source":["By doing all this, we get a single value that reflects how good the model is and that is directly connected to the model weights. Basically, it’s a function that takes weights and outputs their overall error. That is a very good result – that’s because now we can just apply some mathematical optimization methods. These take functions and find the inputs that minimize the output of the function (not always perfect in practice, but that doesn't matter just now). We’ll dive deeper into these types of algorithms in the next lesson.\n","\n","For now, this is it, the algorithm is ready. After training it using optimization methods, we get a model, inside of which is a formula with learned weights. The next step is to measure its accuracy, and if it is good enough, ship it into production. As this is a theoretical problem, we can't actually test the model, but in the next lesson, we will do all this in practice."],"metadata":{"id":"x4Ebg9rfyW-C"},"id":"x4Ebg9rfyW-C"},{"cell_type":"markdown","source":["# 4. Conclusion"],"metadata":{"id":"p04Lgg0LqcWf"},"id":"p04Lgg0LqcWf"},{"cell_type":"markdown","source":["Let's recap what we learned in this lesson.\n","\n","In the overview section, we talked about machine learning models in general and learned that ML software works differently from regular algorithms: instead of implementing the logic directly, we implement an algorithm that learns the logic from training data.\n","\n","This training data can come in different forms: it can be labeled (suitable for supervised learning) or unlabeled (suitable for self-supervised learning), and can even come in the form of interactive environments with rewards and penalties (reinforcement learning).\n","\n","We also learned that many modern models require large amounts of data to train, and they are frequently composed of different stages (as with ChatGPT):"],"metadata":{"id":"dbxPzk5-ynfC"},"id":"dbxPzk5-ynfC"},{"cell_type":"markdown","id":"3a5deb18","metadata":{"id":"3a5deb18"},"source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1ZvSDaCBvKi6wQnEC9kKFGAjgARCpqWbt\" width=800 />\n","</center>\n"]},{"cell_type":"markdown","source":["In the model example section, we went deeper into one specific problem and learned the steps necessary for designing a ML model:\n","\n","* We defined what is the problem we want to solve and what is the training data for it. We also decided how to understand how good the model is (model quality).\n","* We designed a machine learning model: we defined the features it has and how it converts them into predictions. We also defined the parameters of the model – the weights that we can change to tune the model to the data.\n","* We then rewrote the model as a formula and developed a training algorithm that uses formula optimization to find the best weights."],"metadata":{"id":"b81hB73JyybI"},"id":"b81hB73JyybI"},{"cell_type":"markdown","id":"3253b3a2","metadata":{"id":"3253b3a2"},"source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1j2uCVIAs2uM5uzVf7vso6LrYIb7lHrAf\" width=500 />\n","</center>\n"]},{"cell_type":"markdown","source":["Great! In the [**next lesson**](https://colab.research.google.com/github/Nebius-Academy/ML-Starter-Pack/blob/main/notebooks/lesson_2.ipynb) you will learn how to **train linear models** with Python and go even further into all the related math.\n","\n","Additionally, if you’re interested, here are some additional resources on the things we talked about in this lesson:\n","\n","* [A visualisation of the data distribution for a simple generative task: handwritten digits](https://n8python.github.io/mnistLatentSpace/)\n","* [Reinforcement Learning example video: AI learns to solve simple puzzles](https://www.youtube.com/watch?v=v3UBlEJDXR0)\n","* [Blogpost about training ChatGPT from OpenAI](https://openai.com/blog/chatgpt)"],"metadata":{"id":"gmeHOzNny5zi"},"id":"gmeHOzNny5zi"},{"cell_type":"code","source":[],"metadata":{"id":"QGhU5_W2kerq"},"id":"QGhU5_W2kerq","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.8"},"colab":{"provenance":[{"file_id":"1YTLU9kjk7Qt6u-rIYcyQBS2kXwmndKA8","timestamp":1708329781537}]}},"nbformat":4,"nbformat_minor":5}
2 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                     GNU GENERAL PUBLIC LICENSE
  2 |                        Version 3, 29 June 2007
  3 | 
  4 |  Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
  5 |  Everyone is permitted to copy and distribute verbatim copies
  6 |  of this license document, but changing it is not allowed.
  7 | 
  8 |                             Preamble
  9 | 
 10 |   The GNU General Public License is a free, copyleft license for
 11 | software and other kinds of works.
 12 | 
 13 |   The licenses for most software and other practical works are designed
 14 | to take away your freedom to share and change the works.  By contrast,
 15 | the GNU General Public License is intended to guarantee your freedom to
 16 | share and change all versions of a program--to make sure it remains free
 17 | software for all its users.  We, the Free Software Foundation, use the
 18 | GNU General Public License for most of our software; it applies also to
 19 | any other work released this way by its authors.  You can apply it to
 20 | your programs, too.
 21 | 
 22 |   When we speak of free software, we are referring to freedom, not
 23 | price.  Our General Public Licenses are designed to make sure that you
 24 | have the freedom to distribute copies of free software (and charge for
 25 | them if you wish), that you receive source code or can get it if you
 26 | want it, that you can change the software or use pieces of it in new
 27 | free programs, and that you know you can do these things.
 28 | 
 29 |   To protect your rights, we need to prevent others from denying you
 30 | these rights or asking you to surrender the rights.  Therefore, you have
 31 | certain responsibilities if you distribute copies of the software, or if
 32 | you modify it: responsibilities to respect the freedom of others.
 33 | 
 34 |   For example, if you distribute copies of such a program, whether
 35 | gratis or for a fee, you must pass on to the recipients the same
 36 | freedoms that you received.  You must make sure that they, too, receive
 37 | or can get the source code.  And you must show them these terms so they
 38 | know their rights.
 39 | 
 40 |   Developers that use the GNU GPL protect your rights with two steps:
 41 | (1) assert copyright on the software, and (2) offer you this License
 42 | giving you legal permission to copy, distribute and/or modify it.
 43 | 
 44 |   For the developers' and authors' protection, the GPL clearly explains
 45 | that there is no warranty for this free software.  For both users' and
 46 | authors' sake, the GPL requires that modified versions be marked as
 47 | changed, so that their problems will not be attributed erroneously to
 48 | authors of previous versions.
 49 | 
 50 |   Some devices are designed to deny users access to install or run
 51 | modified versions of the software inside them, although the manufacturer
 52 | can do so.  This is fundamentally incompatible with the aim of
 53 | protecting users' freedom to change the software.  The systematic
 54 | pattern of such abuse occurs in the area of products for individuals to
 55 | use, which is precisely where it is most unacceptable.  Therefore, we
 56 | have designed this version of the GPL to prohibit the practice for those
 57 | products.  If such problems arise substantially in other domains, we
 58 | stand ready to extend this provision to those domains in future versions
 59 | of the GPL, as needed to protect the freedom of users.
 60 | 
 61 |   Finally, every program is threatened constantly by software patents.
 62 | States should not allow patents to restrict development and use of
 63 | software on general-purpose computers, but in those that do, we wish to
 64 | avoid the special danger that patents applied to a free program could
 65 | make it effectively proprietary.  To prevent this, the GPL assures that
 66 | patents cannot be used to render the program non-free.
 67 | 
 68 |   The precise terms and conditions for copying, distribution and
 69 | modification follow.
 70 | 
 71 |                        TERMS AND CONDITIONS
 72 | 
 73 |   0. Definitions.
 74 | 
 75 |   "This License" refers to version 3 of the GNU General Public License.
 76 | 
 77 |   "Copyright" also means copyright-like laws that apply to other kinds of
 78 | works, such as semiconductor masks.
 79 | 
 80 |   "The Program" refers to any copyrightable work licensed under this
 81 | License.  Each licensee is addressed as "you".  "Licensees" and
 82 | "recipients" may be individuals or organizations.
 83 | 
 84 |   To "modify" a work means to copy from or adapt all or part of the work
 85 | in a fashion requiring copyright permission, other than the making of an
 86 | exact copy.  The resulting work is called a "modified version" of the
 87 | earlier work or a work "based on" the earlier work.
 88 | 
 89 |   A "covered work" means either the unmodified Program or a work based
 90 | on the Program.
 91 | 
 92 |   To "propagate" a work means to do anything with it that, without
 93 | permission, would make you directly or secondarily liable for
 94 | infringement under applicable copyright law, except executing it on a
 95 | computer or modifying a private copy.  Propagation includes copying,
 96 | distribution (with or without modification), making available to the
 97 | public, and in some countries other activities as well.
 98 | 
 99 |   To "convey" a work means any kind of propagation that enables other
100 | parties to make or receive copies.  Mere interaction with a user through
101 | a computer network, with no transfer of a copy, is not conveying.
102 | 
103 |   An interactive user interface displays "Appropriate Legal Notices"
104 | to the extent that it includes a convenient and prominently visible
105 | feature that (1) displays an appropriate copyright notice, and (2)
106 | tells the user that there is no warranty for the work (except to the
107 | extent that warranties are provided), that licensees may convey the
108 | work under this License, and how to view a copy of this License.  If
109 | the interface presents a list of user commands or options, such as a
110 | menu, a prominent item in the list meets this criterion.
111 | 
112 |   1. Source Code.
113 | 
114 |   The "source code" for a work means the preferred form of the work
115 | for making modifications to it.  "Object code" means any non-source
116 | form of a work.
117 | 
118 |   A "Standard Interface" means an interface that either is an official
119 | standard defined by a recognized standards body, or, in the case of
120 | interfaces specified for a particular programming language, one that
121 | is widely used among developers working in that language.
122 | 
123 |   The "System Libraries" of an executable work include anything, other
124 | than the work as a whole, that (a) is included in the normal form of
125 | packaging a Major Component, but which is not part of that Major
126 | Component, and (b) serves only to enable use of the work with that
127 | Major Component, or to implement a Standard Interface for which an
128 | implementation is available to the public in source code form.  A
129 | "Major Component", in this context, means a major essential component
130 | (kernel, window system, and so on) of the specific operating system
131 | (if any) on which the executable work runs, or a compiler used to
132 | produce the work, or an object code interpreter used to run it.
133 | 
134 |   The "Corresponding Source" for a work in object code form means all
135 | the source code needed to generate, install, and (for an executable
136 | work) run the object code and to modify the work, including scripts to
137 | control those activities.  However, it does not include the work's
138 | System Libraries, or general-purpose tools or generally available free
139 | programs which are used unmodified in performing those activities but
140 | which are not part of the work.  For example, Corresponding Source
141 | includes interface definition files associated with source files for
142 | the work, and the source code for shared libraries and dynamically
143 | linked subprograms that the work is specifically designed to require,
144 | such as by intimate data communication or control flow between those
145 | subprograms and other parts of the work.
146 | 
147 |   The Corresponding Source need not include anything that users
148 | can regenerate automatically from other parts of the Corresponding
149 | Source.
150 | 
151 |   The Corresponding Source for a work in source code form is that
152 | same work.
153 | 
154 |   2. Basic Permissions.
155 | 
156 |   All rights granted under this License are granted for the term of
157 | copyright on the Program, and are irrevocable provided the stated
158 | conditions are met.  This License explicitly affirms your unlimited
159 | permission to run the unmodified Program.  The output from running a
160 | covered work is covered by this License only if the output, given its
161 | content, constitutes a covered work.  This License acknowledges your
162 | rights of fair use or other equivalent, as provided by copyright law.
163 | 
164 |   You may make, run and propagate covered works that you do not
165 | convey, without conditions so long as your license otherwise remains
166 | in force.  You may convey covered works to others for the sole purpose
167 | of having them make modifications exclusively for you, or provide you
168 | with facilities for running those works, provided that you comply with
169 | the terms of this License in conveying all material for which you do
170 | not control copyright.  Those thus making or running the covered works
171 | for you must do so exclusively on your behalf, under your direction
172 | and control, on terms that prohibit them from making any copies of
173 | your copyrighted material outside their relationship with you.
174 | 
175 |   Conveying under any other circumstances is permitted solely under
176 | the conditions stated below.  Sublicensing is not allowed; section 10
177 | makes it unnecessary.
178 | 
179 |   3. Protecting Users' Legal Rights From Anti-Circumvention Law.
180 | 
181 |   No covered work shall be deemed part of an effective technological
182 | measure under any applicable law fulfilling obligations under article
183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or
184 | similar laws prohibiting or restricting circumvention of such
185 | measures.
186 | 
187 |   When you convey a covered work, you waive any legal power to forbid
188 | circumvention of technological measures to the extent such circumvention
189 | is effected by exercising rights under this License with respect to
190 | the covered work, and you disclaim any intention to limit operation or
191 | modification of the work as a means of enforcing, against the work's
192 | users, your or third parties' legal rights to forbid circumvention of
193 | technological measures.
194 | 
195 |   4. Conveying Verbatim Copies.
196 | 
197 |   You may convey verbatim copies of the Program's source code as you
198 | receive it, in any medium, provided that you conspicuously and
199 | appropriately publish on each copy an appropriate copyright notice;
200 | keep intact all notices stating that this License and any
201 | non-permissive terms added in accord with section 7 apply to the code;
202 | keep intact all notices of the absence of any warranty; and give all
203 | recipients a copy of this License along with the Program.
204 | 
205 |   You may charge any price or no price for each copy that you convey,
206 | and you may offer support or warranty protection for a fee.
207 | 
208 |   5. Conveying Modified Source Versions.
209 | 
210 |   You may convey a work based on the Program, or the modifications to
211 | produce it from the Program, in the form of source code under the
212 | terms of section 4, provided that you also meet all of these conditions:
213 | 
214 |     a) The work must carry prominent notices stating that you modified
215 |     it, and giving a relevant date.
216 | 
217 |     b) The work must carry prominent notices stating that it is
218 |     released under this License and any conditions added under section
219 |     7.  This requirement modifies the requirement in section 4 to
220 |     "keep intact all notices".
221 | 
222 |     c) You must license the entire work, as a whole, under this
223 |     License to anyone who comes into possession of a copy.  This
224 |     License will therefore apply, along with any applicable section 7
225 |     additional terms, to the whole of the work, and all its parts,
226 |     regardless of how they are packaged.  This License gives no
227 |     permission to license the work in any other way, but it does not
228 |     invalidate such permission if you have separately received it.
229 | 
230 |     d) If the work has interactive user interfaces, each must display
231 |     Appropriate Legal Notices; however, if the Program has interactive
232 |     interfaces that do not display Appropriate Legal Notices, your
233 |     work need not make them do so.
234 | 
235 |   A compilation of a covered work with other separate and independent
236 | works, which are not by their nature extensions of the covered work,
237 | and which are not combined with it such as to form a larger program,
238 | in or on a volume of a storage or distribution medium, is called an
239 | "aggregate" if the compilation and its resulting copyright are not
240 | used to limit the access or legal rights of the compilation's users
241 | beyond what the individual works permit.  Inclusion of a covered work
242 | in an aggregate does not cause this License to apply to the other
243 | parts of the aggregate.
244 | 
245 |   6. Conveying Non-Source Forms.
246 | 
247 |   You may convey a covered work in object code form under the terms
248 | of sections 4 and 5, provided that you also convey the
249 | machine-readable Corresponding Source under the terms of this License,
250 | in one of these ways:
251 | 
252 |     a) Convey the object code in, or embodied in, a physical product
253 |     (including a physical distribution medium), accompanied by the
254 |     Corresponding Source fixed on a durable physical medium
255 |     customarily used for software interchange.
256 | 
257 |     b) Convey the object code in, or embodied in, a physical product
258 |     (including a physical distribution medium), accompanied by a
259 |     written offer, valid for at least three years and valid for as
260 |     long as you offer spare parts or customer support for that product
261 |     model, to give anyone who possesses the object code either (1) a
262 |     copy of the Corresponding Source for all the software in the
263 |     product that is covered by this License, on a durable physical
264 |     medium customarily used for software interchange, for a price no
265 |     more than your reasonable cost of physically performing this
266 |     conveying of source, or (2) access to copy the
267 |     Corresponding Source from a network server at no charge.
268 | 
269 |     c) Convey individual copies of the object code with a copy of the
270 |     written offer to provide the Corresponding Source.  This
271 |     alternative is allowed only occasionally and noncommercially, and
272 |     only if you received the object code with such an offer, in accord
273 |     with subsection 6b.
274 | 
275 |     d) Convey the object code by offering access from a designated
276 |     place (gratis or for a charge), and offer equivalent access to the
277 |     Corresponding Source in the same way through the same place at no
278 |     further charge.  You need not require recipients to copy the
279 |     Corresponding Source along with the object code.  If the place to
280 |     copy the object code is a network server, the Corresponding Source
281 |     may be on a different server (operated by you or a third party)
282 |     that supports equivalent copying facilities, provided you maintain
283 |     clear directions next to the object code saying where to find the
284 |     Corresponding Source.  Regardless of what server hosts the
285 |     Corresponding Source, you remain obligated to ensure that it is
286 |     available for as long as needed to satisfy these requirements.
287 | 
288 |     e) Convey the object code using peer-to-peer transmission, provided
289 |     you inform other peers where the object code and Corresponding
290 |     Source of the work are being offered to the general public at no
291 |     charge under subsection 6d.
292 | 
293 |   A separable portion of the object code, whose source code is excluded
294 | from the Corresponding Source as a System Library, need not be
295 | included in conveying the object code work.
296 | 
297 |   A "User Product" is either (1) a "consumer product", which means any
298 | tangible personal property which is normally used for personal, family,
299 | or household purposes, or (2) anything designed or sold for incorporation
300 | into a dwelling.  In determining whether a product is a consumer product,
301 | doubtful cases shall be resolved in favor of coverage.  For a particular
302 | product received by a particular user, "normally used" refers to a
303 | typical or common use of that class of product, regardless of the status
304 | of the particular user or of the way in which the particular user
305 | actually uses, or expects or is expected to use, the product.  A product
306 | is a consumer product regardless of whether the product has substantial
307 | commercial, industrial or non-consumer uses, unless such uses represent
308 | the only significant mode of use of the product.
309 | 
310 |   "Installation Information" for a User Product means any methods,
311 | procedures, authorization keys, or other information required to install
312 | and execute modified versions of a covered work in that User Product from
313 | a modified version of its Corresponding Source.  The information must
314 | suffice to ensure that the continued functioning of the modified object
315 | code is in no case prevented or interfered with solely because
316 | modification has been made.
317 | 
318 |   If you convey an object code work under this section in, or with, or
319 | specifically for use in, a User Product, and the conveying occurs as
320 | part of a transaction in which the right of possession and use of the
321 | User Product is transferred to the recipient in perpetuity or for a
322 | fixed term (regardless of how the transaction is characterized), the
323 | Corresponding Source conveyed under this section must be accompanied
324 | by the Installation Information.  But this requirement does not apply
325 | if neither you nor any third party retains the ability to install
326 | modified object code on the User Product (for example, the work has
327 | been installed in ROM).
328 | 
329 |   The requirement to provide Installation Information does not include a
330 | requirement to continue to provide support service, warranty, or updates
331 | for a work that has been modified or installed by the recipient, or for
332 | the User Product in which it has been modified or installed.  Access to a
333 | network may be denied when the modification itself materially and
334 | adversely affects the operation of the network or violates the rules and
335 | protocols for communication across the network.
336 | 
337 |   Corresponding Source conveyed, and Installation Information provided,
338 | in accord with this section must be in a format that is publicly
339 | documented (and with an implementation available to the public in
340 | source code form), and must require no special password or key for
341 | unpacking, reading or copying.
342 | 
343 |   7. Additional Terms.
344 | 
345 |   "Additional permissions" are terms that supplement the terms of this
346 | License by making exceptions from one or more of its conditions.
347 | Additional permissions that are applicable to the entire Program shall
348 | be treated as though they were included in this License, to the extent
349 | that they are valid under applicable law.  If additional permissions
350 | apply only to part of the Program, that part may be used separately
351 | under those permissions, but the entire Program remains governed by
352 | this License without regard to the additional permissions.
353 | 
354 |   When you convey a copy of a covered work, you may at your option
355 | remove any additional permissions from that copy, or from any part of
356 | it.  (Additional permissions may be written to require their own
357 | removal in certain cases when you modify the work.)  You may place
358 | additional permissions on material, added by you to a covered work,
359 | for which you have or can give appropriate copyright permission.
360 | 
361 |   Notwithstanding any other provision of this License, for material you
362 | add to a covered work, you may (if authorized by the copyright holders of
363 | that material) supplement the terms of this License with terms:
364 | 
365 |     a) Disclaiming warranty or limiting liability differently from the
366 |     terms of sections 15 and 16 of this License; or
367 | 
368 |     b) Requiring preservation of specified reasonable legal notices or
369 |     author attributions in that material or in the Appropriate Legal
370 |     Notices displayed by works containing it; or
371 | 
372 |     c) Prohibiting misrepresentation of the origin of that material, or
373 |     requiring that modified versions of such material be marked in
374 |     reasonable ways as different from the original version; or
375 | 
376 |     d) Limiting the use for publicity purposes of names of licensors or
377 |     authors of the material; or
378 | 
379 |     e) Declining to grant rights under trademark law for use of some
380 |     trade names, trademarks, or service marks; or
381 | 
382 |     f) Requiring indemnification of licensors and authors of that
383 |     material by anyone who conveys the material (or modified versions of
384 |     it) with contractual assumptions of liability to the recipient, for
385 |     any liability that these contractual assumptions directly impose on
386 |     those licensors and authors.
387 | 
388 |   All other non-permissive additional terms are considered "further
389 | restrictions" within the meaning of section 10.  If the Program as you
390 | received it, or any part of it, contains a notice stating that it is
391 | governed by this License along with a term that is a further
392 | restriction, you may remove that term.  If a license document contains
393 | a further restriction but permits relicensing or conveying under this
394 | License, you may add to a covered work material governed by the terms
395 | of that license document, provided that the further restriction does
396 | not survive such relicensing or conveying.
397 | 
398 |   If you add terms to a covered work in accord with this section, you
399 | must place, in the relevant source files, a statement of the
400 | additional terms that apply to those files, or a notice indicating
401 | where to find the applicable terms.
402 | 
403 |   Additional terms, permissive or non-permissive, may be stated in the
404 | form of a separately written license, or stated as exceptions;
405 | the above requirements apply either way.
406 | 
407 |   8. Termination.
408 | 
409 |   You may not propagate or modify a covered work except as expressly
410 | provided under this License.  Any attempt otherwise to propagate or
411 | modify it is void, and will automatically terminate your rights under
412 | this License (including any patent licenses granted under the third
413 | paragraph of section 11).
414 | 
415 |   However, if you cease all violation of this License, then your
416 | license from a particular copyright holder is reinstated (a)
417 | provisionally, unless and until the copyright holder explicitly and
418 | finally terminates your license, and (b) permanently, if the copyright
419 | holder fails to notify you of the violation by some reasonable means
420 | prior to 60 days after the cessation.
421 | 
422 |   Moreover, your license from a particular copyright holder is
423 | reinstated permanently if the copyright holder notifies you of the
424 | violation by some reasonable means, this is the first time you have
425 | received notice of violation of this License (for any work) from that
426 | copyright holder, and you cure the violation prior to 30 days after
427 | your receipt of the notice.
428 | 
429 |   Termination of your rights under this section does not terminate the
430 | licenses of parties who have received copies or rights from you under
431 | this License.  If your rights have been terminated and not permanently
432 | reinstated, you do not qualify to receive new licenses for the same
433 | material under section 10.
434 | 
435 |   9. Acceptance Not Required for Having Copies.
436 | 
437 |   You are not required to accept this License in order to receive or
438 | run a copy of the Program.  Ancillary propagation of a covered work
439 | occurring solely as a consequence of using peer-to-peer transmission
440 | to receive a copy likewise does not require acceptance.  However,
441 | nothing other than this License grants you permission to propagate or
442 | modify any covered work.  These actions infringe copyright if you do
443 | not accept this License.  Therefore, by modifying or propagating a
444 | covered work, you indicate your acceptance of this License to do so.
445 | 
446 |   10. Automatic Licensing of Downstream Recipients.
447 | 
448 |   Each time you convey a covered work, the recipient automatically
449 | receives a license from the original licensors, to run, modify and
450 | propagate that work, subject to this License.  You are not responsible
451 | for enforcing compliance by third parties with this License.
452 | 
453 |   An "entity transaction" is a transaction transferring control of an
454 | organization, or substantially all assets of one, or subdividing an
455 | organization, or merging organizations.  If propagation of a covered
456 | work results from an entity transaction, each party to that
457 | transaction who receives a copy of the work also receives whatever
458 | licenses to the work the party's predecessor in interest had or could
459 | give under the previous paragraph, plus a right to possession of the
460 | Corresponding Source of the work from the predecessor in interest, if
461 | the predecessor has it or can get it with reasonable efforts.
462 | 
463 |   You may not impose any further restrictions on the exercise of the
464 | rights granted or affirmed under this License.  For example, you may
465 | not impose a license fee, royalty, or other charge for exercise of
466 | rights granted under this License, and you may not initiate litigation
467 | (including a cross-claim or counterclaim in a lawsuit) alleging that
468 | any patent claim is infringed by making, using, selling, offering for
469 | sale, or importing the Program or any portion of it.
470 | 
471 |   11. Patents.
472 | 
473 |   A "contributor" is a copyright holder who authorizes use under this
474 | License of the Program or a work on which the Program is based.  The
475 | work thus licensed is called the contributor's "contributor version".
476 | 
477 |   A contributor's "essential patent claims" are all patent claims
478 | owned or controlled by the contributor, whether already acquired or
479 | hereafter acquired, that would be infringed by some manner, permitted
480 | by this License, of making, using, or selling its contributor version,
481 | but do not include claims that would be infringed only as a
482 | consequence of further modification of the contributor version.  For
483 | purposes of this definition, "control" includes the right to grant
484 | patent sublicenses in a manner consistent with the requirements of
485 | this License.
486 | 
487 |   Each contributor grants you a non-exclusive, worldwide, royalty-free
488 | patent license under the contributor's essential patent claims, to
489 | make, use, sell, offer for sale, import and otherwise run, modify and
490 | propagate the contents of its contributor version.
491 | 
492 |   In the following three paragraphs, a "patent license" is any express
493 | agreement or commitment, however denominated, not to enforce a patent
494 | (such as an express permission to practice a patent or covenant not to
495 | sue for patent infringement).  To "grant" such a patent license to a
496 | party means to make such an agreement or commitment not to enforce a
497 | patent against the party.
498 | 
499 |   If you convey a covered work, knowingly relying on a patent license,
500 | and the Corresponding Source of the work is not available for anyone
501 | to copy, free of charge and under the terms of this License, through a
502 | publicly available network server or other readily accessible means,
503 | then you must either (1) cause the Corresponding Source to be so
504 | available, or (2) arrange to deprive yourself of the benefit of the
505 | patent license for this particular work, or (3) arrange, in a manner
506 | consistent with the requirements of this License, to extend the patent
507 | license to downstream recipients.  "Knowingly relying" means you have
508 | actual knowledge that, but for the patent license, your conveying the
509 | covered work in a country, or your recipient's use of the covered work
510 | in a country, would infringe one or more identifiable patents in that
511 | country that you have reason to believe are valid.
512 | 
513 |   If, pursuant to or in connection with a single transaction or
514 | arrangement, you convey, or propagate by procuring conveyance of, a
515 | covered work, and grant a patent license to some of the parties
516 | receiving the covered work authorizing them to use, propagate, modify
517 | or convey a specific copy of the covered work, then the patent license
518 | you grant is automatically extended to all recipients of the covered
519 | work and works based on it.
520 | 
521 |   A patent license is "discriminatory" if it does not include within
522 | the scope of its coverage, prohibits the exercise of, or is
523 | conditioned on the non-exercise of one or more of the rights that are
524 | specifically granted under this License.  You may not convey a covered
525 | work if you are a party to an arrangement with a third party that is
526 | in the business of distributing software, under which you make payment
527 | to the third party based on the extent of your activity of conveying
528 | the work, and under which the third party grants, to any of the
529 | parties who would receive the covered work from you, a discriminatory
530 | patent license (a) in connection with copies of the covered work
531 | conveyed by you (or copies made from those copies), or (b) primarily
532 | for and in connection with specific products or compilations that
533 | contain the covered work, unless you entered into that arrangement,
534 | or that patent license was granted, prior to 28 March 2007.
535 | 
536 |   Nothing in this License shall be construed as excluding or limiting
537 | any implied license or other defenses to infringement that may
538 | otherwise be available to you under applicable patent law.
539 | 
540 |   12. No Surrender of Others' Freedom.
541 | 
542 |   If conditions are imposed on you (whether by court order, agreement or
543 | otherwise) that contradict the conditions of this License, they do not
544 | excuse you from the conditions of this License.  If you cannot convey a
545 | covered work so as to satisfy simultaneously your obligations under this
546 | License and any other pertinent obligations, then as a consequence you may
547 | not convey it at all.  For example, if you agree to terms that obligate you
548 | to collect a royalty for further conveying from those to whom you convey
549 | the Program, the only way you could satisfy both those terms and this
550 | License would be to refrain entirely from conveying the Program.
551 | 
552 |   13. Use with the GNU Affero General Public License.
553 | 
554 |   Notwithstanding any other provision of this License, you have
555 | permission to link or combine any covered work with a work licensed
556 | under version 3 of the GNU Affero General Public License into a single
557 | combined work, and to convey the resulting work.  The terms of this
558 | License will continue to apply to the part which is the covered work,
559 | but the special requirements of the GNU Affero General Public License,
560 | section 13, concerning interaction through a network will apply to the
561 | combination as such.
562 | 
563 |   14. Revised Versions of this License.
564 | 
565 |   The Free Software Foundation may publish revised and/or new versions of
566 | the GNU General Public License from time to time.  Such new versions will
567 | be similar in spirit to the present version, but may differ in detail to
568 | address new problems or concerns.
569 | 
570 |   Each version is given a distinguishing version number.  If the
571 | Program specifies that a certain numbered version of the GNU General
572 | Public License "or any later version" applies to it, you have the
573 | option of following the terms and conditions either of that numbered
574 | version or of any later version published by the Free Software
575 | Foundation.  If the Program does not specify a version number of the
576 | GNU General Public License, you may choose any version ever published
577 | by the Free Software Foundation.
578 | 
579 |   If the Program specifies that a proxy can decide which future
580 | versions of the GNU General Public License can be used, that proxy's
581 | public statement of acceptance of a version permanently authorizes you
582 | to choose that version for the Program.
583 | 
584 |   Later license versions may give you additional or different
585 | permissions.  However, no additional obligations are imposed on any
586 | author or copyright holder as a result of your choosing to follow a
587 | later version.
588 | 
589 |   15. Disclaimer of Warranty.
590 | 
591 |   THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
592 | APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
596 | PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
597 | IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
599 | 
600 |   16. Limitation of Liability.
601 | 
602 |   IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
610 | SUCH DAMAGES.
611 | 
612 |   17. Interpretation of Sections 15 and 16.
613 | 
614 |   If the disclaimer of warranty and limitation of liability provided
615 | above cannot be given local legal effect according to their terms,
616 | reviewing courts shall apply local law that most closely approximates
617 | an absolute waiver of all civil liability in connection with the
618 | Program, unless a warranty or assumption of liability accompanies a
619 | copy of the Program in return for a fee.
620 | 
621 |                      END OF TERMS AND CONDITIONS
622 | 
623 |             How to Apply These Terms to Your New Programs
624 | 
625 |   If you develop a new program, and you want it to be of the greatest
626 | possible use to the public, the best way to achieve this is to make it
627 | free software which everyone can redistribute and change under these terms.
628 | 
629 |   To do so, attach the following notices to the program.  It is safest
630 | to attach them to the start of each source file to most effectively
631 | state the exclusion of warranty; and each file should have at least
632 | the "copyright" line and a pointer to where the full notice is found.
633 | 
634 |     <one line to give the program's name and a brief idea of what it does.>
635 |     Copyright (C) <year>  <name of author>
636 | 
637 |     This program is free software: you can redistribute it and/or modify
638 |     it under the terms of the GNU General Public License as published by
639 |     the Free Software Foundation, either version 3 of the License, or
640 |     (at your option) any later version.
641 | 
642 |     This program is distributed in the hope that it will be useful,
643 |     but WITHOUT ANY WARRANTY; without even the implied warranty of
644 |     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
645 |     GNU General Public License for more details.
646 | 
647 |     You should have received a copy of the GNU General Public License
648 |     along with this program.  If not, see <https://www.gnu.org/licenses/>.
649 | 
650 | Also add information on how to contact you by electronic and paper mail.
651 | 
652 |   If the program does terminal interaction, make it output a short
653 | notice like this when it starts in an interactive mode:
654 | 
655 |     <program>  Copyright (C) <year>  <name of author>
656 |     This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
657 |     This is free software, and you are welcome to redistribute it
658 |     under certain conditions; type `show c' for details.
659 | 
660 | The hypothetical commands `show w' and `show c' should show the appropriate
661 | parts of the General Public License.  Of course, your program's commands
662 | might be different; for a GUI interface, you would use an "about box".
663 | 
664 |   You should also get your employer (if you work as a programmer) or school,
665 | if any, to sign a "copyright disclaimer" for the program, if necessary.
666 | For more information on this, and how to apply and follow the GNU GPL, see
667 | <https://www.gnu.org/licenses/>.
668 | 
669 |   The GNU General Public License does not permit incorporating your program
670 | into proprietary programs.  If your program is a subroutine library, you
671 | may consider it more useful to permit linking proprietary applications with
672 | the library.  If this is what you want to do, use the GNU Lesser General
673 | Public License instead of this License.  But first, please read
674 | <https://www.gnu.org/licenses/why-not-lgpl.html>.
675 | 


--------------------------------------------------------------------------------
/notebooks/lesson_4.ipynb:
--------------------------------------------------------------------------------
1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"toc_visible":true},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1-UrUldqGx2Ezp8fqw0J5fOU2YXswfjsb\" width=640 /></center>\n","\n","<br/>\n","\n","<font size=6>\n","<center>Lesson 4: ChatGPT\n","</center>\n","</font>\n","\n","<font size=4>\n","<center><i>Discovering the many things that one model can do</i>\n","</center>\n","</font>\n","\n","\n"],"metadata":{"id":"oBVShEbLkQlx"}},{"cell_type":"markdown","source":["# 1. Introduction\n"],"metadata":{"id":"kYKZPaMmWUV_"}},{"cell_type":"markdown","source":["## Lesson Overview"],"metadata":{"id":"aNKTHB8srO6E"}},{"cell_type":"markdown","source":["This time, we’ll go deep into a machine learning model that’s made a lot of noise recently: ChatGPT. Fortunately, just like BERT from the previous lesson, it’s based on the **transformers** architecture, so it will be easier to understand what’s happening. In this lesson, you’ll:\n","\n","* Understand the structure of the ChatGPT model and how it’s **trained**\n","* Learn about **in-context learning** and how it relates to other types of learning\n","* Practice solving real-life problems using the **OpenAI API** and **LangChain** library\n","\n","You can find the notebook with solutions [here](https://colab.research.google.com/drive/1FkzEuJIjkJLXUFCMCpWpYbx1a7RkYwkP?usp=sharing)."],"metadata":{"id":"coG_sboIrWmS"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1djg_MFGfQwg68uPN4cKW5pt3YNMsGlmX\" width=1000 /></center>"],"metadata":{"id":"M5YuOoyrZcOD"}},{"cell_type":"markdown","source":["# 2. Understanding ChatGPT\n"],"metadata":{"id":"ZlMq_Ia_XVb3"}},{"cell_type":"markdown","source":["## The Base GPT model"],"metadata":{"id":"8D5UbFit2xO6"}},{"cell_type":"markdown","source":["First, a quick review of the **transformers** architecture from the previous lesson:"],"metadata":{"id":"wfugvp2xZ1M_"}},{"cell_type":"markdown","source":["\n","<center><img src=\"https://drive.google.com/uc?export=view&id=19D0hMWMKa2-wqt-6GN-RtoDwxvmb2AD-\" width=700 /></center>"],"metadata":{"id":"0vIKCExMrgVZ"}},{"cell_type":"markdown","source":["There are two major parts here:\n","\n","* An **encoder**, which takes a sequence of input tokens and generates a sequence of embedding vectors\n","* A **decoder**, which takes both the embedding vectors and the previous output tokens and generates next output tokens\n","\n","Note that there are three different types of embeddings: input token embeddings, encoder outputs, and output token embeddings. Token embeddings are connected to natural language, but encoder outputs are just vectors that only the model understands.\n","\n","In the previous lesson, we worked with a BERT model, which only has the encoder part; it builds the embedding vectors that can be used later for classifying the whole sequence, (for example, to detect the sentiments of the reviews).\n","\n","But what if instead, we build a model with only the decoder?\n","\n","This may sound a little strange – because how would we decode something without encoding it first? However, this way we actually arrive at a **generative model**. Rather than transforming an input sequence into an output sequence (like in a machine translation task), a generative model just continues its input sequence; it can even create a text from scratch. (We talked a little bit about models like this in lesson 1.)\n","\n","The following picture illustrates how it works:"],"metadata":{"id":"IoI75h5pdlD1"}},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1x3dtwbvP1tbL7Jqys8QX_NW6AehJekPw\" width=600 />\n","</center>\n","\n","\n"],"metadata":{"id":"-ScU2J-Kd8C-"}},{"cell_type":"markdown","source":["A decoder-only model predicts the continuation of a text token by token, adding each new token to its input.\n","\n","Recent decoder-only models have billions of trainable parameters, and so they’re referred to as **Large Language Models** (**LLMs**).\n","\n","The emergence of LLMs was a huge breakthrough in machine learning, and these days, several Large Language Models appear every week. Let’s just mention a few of them:\n","\n","\n","\n","1. The **GPT** family, created by OpenAI, consists of decoder-only models (GPT means **Generative Pre-trained Transformer**). It includes very famous models like GPT-3.5, GPT-4 and GPT-4o\n","2. The **Claude** family by Anthropic\n","3. The **Gemini** family by Google\n","4. **Mistral** and **Mixtral**, which are open source and can be downloaded from Hugging Face.\n","\n","Let’s discuss the typical architecture for these models:\n","\n","\n","\n","* The input data is transformed into vectors using **token embeddings** and **positional embeddings**.\n","* These vectors go through a lot of big **transformer layers**. (Like, really huge layers with a ton of parameters.)\n","* The final vector embeddings are translated back into token probabilities using **softmax**.\n","\n","The training process for these models often consists of several steps (more details ahead!), but the first step is always the same: **pre-training**. During this step, the model browses through huge volumes of real texts and learns to reproduce them. Roughly speaking, for each text and for each token position of this text the model learns to solve a classification task: predicting the next token.\n","\n","After the pre-training stage we get the **base GPT model**."],"metadata":{"id":"bnlmtGr_s3Q3"}},{"cell_type":"markdown","source":["## From GPT to ChatGPT"],"metadata":{"id":"-VFFZmM3DQ6t"}},{"cell_type":"markdown","source":["Now, from the base GPT model, there are two more training steps to get the ChatGPT model."],"metadata":{"id":"PQjlJCX9SUHS"}},{"cell_type":"markdown","source":["**1. Supervised Fine-Tuning**\n","\n","The ability to continue a sentence is not enough – ChatGPT should complete tasks given by users and have the ability to keep up a conversation.\n","\n","To do that, OpenAI hired people who created specific fine tuning data: prompts paired with completions, this is where a prompt is given as a task and a completion solves this task. This could be something like this:\n","\n","<br/>\n","\n","| Prompt | Completion |\n","| -------- | -------- |\n","| How can I train an LLM?   | You need to start with pre-training and then...   |\n","| I feel depressed :(   | What's wrong? How can I help you?   |\n","\n","<br/>\n","\n","This step is just another supervised training step where the value comes from specific, high quality data."],"metadata":{"id":"PbndHlAJSeyk"}},{"cell_type":"markdown","source":["**2. Alignment training**\n","\n","We also expect an LLM not only to produce likely texts, but to be helpful, honest, and, well, “harmless”.\n","\n","Helpfulness can be achieved after Supervised Fine-Tuning, but after that the model can still produce harmful or toxic answers. (And we wouldn’t want ChatGPT to tell us how to assemble a bomb or to write an 18+ movie script, would we?)\n","\n","In other words, we need to ensure that the model aligns with human preferences. There are several ways of doing this; OpenAI used **RLHF** (**Reinforcement Learning on Human Feedback**).\n","\n","We won't go too far into the details about this type of learning, let’s just explain the basics.\n","\n","First, a **reward model** is trained. This is a model that is able to rank different completions of a prompt by assigning each a numerical score indicating how acceptable it is. So, a really toxic completion will receive a very low score, while a helpful, harmless one will get a much larger score.\n","\n","The reward model is trained on triplets (prompt, better completion, worse completion) labeled by human contractors.\n","\n","Next, RLHF trains the model to generate completions, maximizing the reward model score."],"metadata":{"id":"SxUUkQx7s0Ks"}},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1oO3mzSFK3UrrCmNiqOSsRdfqfPKtJ7xt\" width=800 />\n","\n","*Image from [InstructGPT paper](https://arxiv.org/abs/2203.02155)*\n","</center>"],"metadata":{"id":"VFy4zwEf8Kfn"}},{"cell_type":"markdown","source":["This whole thing is the process of fine-tuning a general GPT model to create a ChatGPT model with more specific goals. We’ll continue overviewing this terminology a little bit later in the lesson.\n","\n","Let's now pause and get a zoomed-out view on ChatGPT, because it’s easy to get lost in all the complex details. In a nutshell, this model is similar to the sentiment prediction linear model from lesson 2; this is a machine learning model with parameters that are trained on some data with a specific goal. ChatGPT obviously seems much more intelligent than a simple linear model, and the magic happens because ChatGPT has more parameters and a bigger training dataset – both larger by several magnitudes. The numbers on the right are for ChatGPT, based on GPT-3.5."],"metadata":{"id":"MLqzU5xG8PEV"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1Ax621tRQzJsT44tqRL_EQ6QyxpS0Q4sW\" width=700 /></center>"],"metadata":{"id":"5Ia7nTzHiKke"}},{"cell_type":"markdown","source":["Note that because ChatGPT is still a machine learning model, there can be some unexpected effects to this, for example:\n","\n","* The model does not actually give the best possible answers for the prompts, it instead tries to give the answer that was more suitable during training. The difference is subtle, but there are some known cases of “weird” behaviour. For example, you can add things like \"take a deep breath before answering\", or, \"I'll tip you $100 if you answer well\", to the prompt, and the results will improve. It's not entirely clear why the model does this, but this could be the result of seeing that humans answer better after those prompts.\n","* Even after Supervised Fine-Tuning and RLHF, the model can still make mistakes or **hallucinate** – in other words, create something entirely new. (For example, if you ask it to name several research papers about elven artifacts of Gondolin, it will probably make up some just to be helpful and nice.)"],"metadata":{"id":"Khd0OOGCkAFe"}},{"cell_type":"markdown","source":["## OpenAI API"],"metadata":{"id":"hLwj8Ufz2qWm"}},{"cell_type":"markdown","source":["Now, you’ve probably already tried out the ChatGPT in action (if not, we highly recommend doing so by visiting [https://chat.openai.com/](https://chat.openai.com/)). Now, we want to show you another way to use it: through OpenAI API, which is a more flexible endpoint for the same thing. Let's try it out, and we’ll also compare how the answers from the base GPT model and ChatGPT model differ.\n","\n","First, we need to install and import the “openai” package which is the wrapper for the API in Python."],"metadata":{"id":"YyP2OIzpntEG"}},{"cell_type":"code","source":["!pip install -qq openai"],"metadata":{"id":"HQp8BOhjhAfc"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["import openai"],"metadata":{"id":"8p9h3atchORt"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["In order to make requests to this API, we need to give it a proper API key so that it knows the account the requests should be attributed to, and charged from. (If you’ve never used the API platform before, you’ll get some free credits for the start that should be sufficient for our lessons.)"],"metadata":{"id":"-M_ZmWc3osiL"}},{"cell_type":"code","source":["OPENAI_API_KEY = \"<insert your key here>\"\n","\n","client = openai.OpenAI(api_key=OPENAI_API_KEY)"],"metadata":{"id":"ZThVn_k1hqN9"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["\n","Now, let’s give ChatGPT a prompt and check out its answer. With the API, you need to explicitly state the model you want to use;  we’ll be using ChatGPT-3.5 Turbo because it’s much cheaper than the more recent GPT-4o, yet it still provides good answers. Here’s the code for that (see the comments for explanations):"],"metadata":{"id":"dSnXx5FVpGDd"}},{"cell_type":"code","source":["response = client.chat.completions.create(   # this function is for chat models\n","    model=\"gpt-3.5-turbo\",   # this is the name of the model in the system\n","    messages=[   # prompts are given in a structured format of messages\n","        {\n","            \"role\": \"user\",   # this means that the message is from user\n","            \"content\": \"Write a haiku\",   # the actual text of the prompt\n","        }\n","    ]\n",")\n","response"],"metadata":{"id":"rxUtEvXTsbca"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["\n","The response object contains lots of meta-information, but for our purposes just the text is enough:"],"metadata":{"id":"3BfDGQ-ctJKf"}},{"cell_type":"code","source":["print(response.choices[0].message.content)"],"metadata":{"id":"XSNkOnvttIJY"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["As previously discussed, the ChatGPT model is built on top of a more general GPT model. The only base model available is the one for the GPT-3 model, because there is not much point in using those. But we can get access to that via the API and analyse its responses for a better understanding of ChatGPT:"],"metadata":{"id":"_EWWl0JYuZA6"}},{"cell_type":"code","source":["# this model uses a different, more simple legacy API\n","response = client.completions.create(\n","    model=\"davinci-002\",    # the name of the base GPT-3 model\n","    prompt=\"Write a haiku\")\n","print(response.choices[0].text)"],"metadata":{"id":"HtiXtFWiu-xF"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["As you can see, the answer is very strange. This is because this model never learned to follow the instructions; rather, it just tries to continue text generation from a given start instead. From this point of view, the answer makes more sense.\n","\n","Note that if we phrase the prompt as if we are only waiting for completion, both models will do well with it:"],"metadata":{"id":"zfIl84UouZLF"}},{"cell_type":"code","source":["response = client.chat.completions.create(\n","    model=\"gpt-3.5-turbo\",\n","    messages=[\n","        {\"role\": \"user\", \"content\": \"From the moment I walked in, I\"}\n","    ],\n","    max_tokens=200,   # let's limit the number of tokens to prevent generation of very long answers\n",")\n","print(response.choices[0].message.content)"],"metadata":{"id":"EZqtDbASv0Ds"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["response = client.completions.create(\n","    model=\"davinci-002\",\n","    prompt=\"From the moment I walked in, I\",\n","    max_tokens=200,\n",")\n","print(response.choices[0].text)"],"metadata":{"id":"UmoINDnTv6p-"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Let's do a small exercise to get more familiar with the API.\n","\n","There is a parameter in the API called `temperature` that controls the randomness of the answers. Do you recall that the model generates one output token at a time, and each time predicts the probability of each possible next token? Temperature controls how the actual generated token is chosen:\n","\n","\n","\n","* If the temperature is 0, the model just takes the token with the highest probability\n","* With smaller temperatures, the model generates a random token, but with a very high preference for high probability tokens (it amplifies their already high probabilities)\n","* Then, with higher temperatures, the model starts generating more low probability tokens, at some point starting to boost lower probability tokens to make them more even with others\n","\n","In the OpenAI API, this is described as a balance between consistency (low temperatures) and diversity or creativity (high temperatures)."],"metadata":{"id":"R01bOLoB6TaN"}},{"cell_type":"markdown","source":["**Exercise.** Try using different temperature values (it has a range from 0 to 2) with a single prompt and analyse how the results change. Use the parameter _n_ to generate several answers with one request."],"metadata":{"id":"XrWzIzrl6YPc"}},{"cell_type":"code","source":["# <your code here>"],"metadata":{"id":"DMCRAbuPKtxB"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Well, it seems the maximum temperature results are totally out of control (creative), but this is an expected behaviour."],"metadata":{"id":"k2gq_EwIGHfk"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1Qtbbu3FicxKTVXERSFIiV5-PkjxRk4kN\" width=256 /></center>"],"metadata":{"id":"DfwNI6bugfK2"}},{"cell_type":"markdown","source":["If you want to play more with this sampling aspect, there’s another parameter called `top_p` that can cut-off lowest probability tokens (you can read about it in [API documentation](https://platform.openai.com/docs/api-reference/chat/create)).\n","\n","In the following sections, we’ll explore the capabilities of this model and see what applications can be built on top of it."],"metadata":{"id":"iSafb3j_rF1q"}},{"cell_type":"markdown","source":["# 3. In-context Learning"],"metadata":{"id":"yHTsScvuXrXP"}},{"cell_type":"markdown","source":["Before starting to build applications on top of ChatGPT, let's understand one important concept relevant to it: **in-context learning** (or **few-shot learning)**. This is the ability to temporarily learn new things from the data provided in the prompt. Let's see some examples; below is a typical question that pulls the information from all the base knowledge of GPT (with the addition of nice phrasing trained in ChatGPT):\n","\n","**👤 Prompt:**\n","\n","<blockquote><p>\n","What company created the IPhone?\n","</p></blockquote>\n","\n","**🤖 ChatGPT:**\n","\n","<blockquote><p>\n","The iPhone was created by Apple Inc. Apple is a technology company that designs, manufactures, and markets consumer electronics, computer software, and other products, including the iPhone. The first iPhone was introduced by Apple's co-founder, Steve Jobs, and it was initially released on June 29, 2007. Since then, Apple has continued to release new models of the iPhone, making it one of the most popular and widely used smartphones globally.\n","</p></blockquote>\n","\n","\n"],"metadata":{"id":"_gqGzB8-sCLx"}},{"cell_type":"markdown","source":["Now, let's make it harder and make the question a bit cryptic.\n","\n","**👤 Prompt:**\n","\n","<blockquote><p>\n","In this text hgieugheda means IPhone. What company created hgieugheda?\n","</p></blockquote>\n","\n","**🤖 ChatGPT:**\n","\n","<blockquote><p>\n","The term \"hgieugheda\" does not correspond to any recognizable product or brand name, and as of my last knowledge update in January 2022, it is not associated with the iPhone or any known technology product. If \"hgieugheda\" is a fictional term or code you've created to represent the iPhone in a specific context, it's important to note that in reality, the iPhone is a product of Apple Inc. If there have been developments or changes after my last update, I would not be aware of them.\n","</p></blockquote>\n","\n","Despite adding an unnecessary critique of our prompt, the model still managed to give the correct answer to our question. This is a primitive example of in-context learning: the model learned new \"rules\" of the prompt and answered accordingly."],"metadata":{"id":"13Laer5Y4Lp4"}},{"cell_type":"markdown","source":["We can make it even more explicit by giving training examples, and asking for predictions. Let's ask it to classify some food between fruits and vegetables, but without telling the model what the rule is, just assigning values 0 (vegetable) and 1 (fruit) to them:\n","\n","**👤 Prompt:**\n","\n","<blockquote><p>\n","tomato - 0\n","\n","orange - 1\n","\n","apple - 1\n","\n","cucumber - 0\n","\n","bellpepper - 0\n","\n","Now following the same rule, answer this:\n","\n","pear - ?\n","</p></blockquote>\n","\n","**🤖 ChatGPT:**\n","\n","<blockquote><p>\n","The pattern in the provided list appears to be assigning a value of 1 to fruits and a value of 0 to vegetables. Based on this pattern:\n","\n","pear - 1\n","\n","So, according to the established rule, the value for \"pear\" would be 1.\n","</p></blockquote>\n","\n","This is a simple example, but anyway we can treat this as learning and predicting."],"metadata":{"id":"NE_qH91z7mZ3"}},{"cell_type":"markdown","source":["However, there is a limit to how far we can take this:\n","\n","**👤 Prompt:**\n","\n","<blockquote><p>\n","In this text, iPhone means Microsoft Windows. What company created the iPhone?\n","</p></blockquote>\n","\n","**🤖 ChatGPT:**\n","\n","<blockquote><p>\n","The iPhone was created by Apple Inc., not Microsoft Windows. Apple is the company responsible for designing and manufacturing the iPhone, which is a line of smartphones that run on the iOS operating system developed by Apple. Microsoft Windows, on the other hand, is an operating system developed by Microsoft for personal computers, not mobile devices like the iPhone. If your text is using \"iPhone\" to refer to Microsoft Windows, it may be a misrepresentation or a mistake, as these are products of different companies.\n","</p></blockquote>\n","\n","We can see that the model tried really hard to give the correct answer, but the \"memory\" from training data went into conflict with the context, and the \"memory\" prevailed. This once again shows us the limits of what ChatGPT is capable of: it could be hard for it to generate an unlikely answer like \"Microsoft created IPhone\", even though in the given context this is the correct answer.\n"],"metadata":{"id":"Qr2vSHUu7hiH"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=114PQPyQ7fUSmG8n3aUwb1m8APdTz-wTS\" width=320 /></center>"],"metadata":{"id":"6bNMApxSOLtl"}},{"cell_type":"markdown","source":["**Exercise.** Can you engineer a prompt that will help the model answer this question correctly? Try not to change the question, just add instructions about how to approach it."],"metadata":{"id":"LLpyNy4u5zs-"}},{"cell_type":"markdown","source":["Now, let's take a step back and overview different types of learning for NLP models:\n","\n","\n","\n","1. **Regular ML model training.** We collect a training set, then train the model from scratch on this training set. In the context of GPT, this is the **Pre-Training** stage, where an LLM is trained on a huge text dataset.\n","2. **Fine-tuning of a pre-trained model.** This is a model that was already trained on a general-purpose dataset, which is then tuned on smaller, task-specific data. ChatGPT is tuned to follow instructions during the **Supervised Fine-Tuning** stage and to align with human preferences during the **RLHF** stage.  \\\n","(The fine-tuning of the BERT model for sentiment analysis from the previous lesson falls into the same category.)\n","3. **In-context learning.** This refers to a situation where we give the model all the training data (usually this is very small) inside the prompt and want the model to extract patterns from the prompt. This type of learning doesn't involve any model training in the technical sense because the model does not change during this process.\n","\n","At the beginning of the ML revolution, only the first type of training was used; nowadays, all three of them are well established. When going from regular training, to fine-tuning, to context learning, we make more and more shallow changes of the model behaviour, but at the same time, it requires less and less data and computing power. This is very important: training a large model on terabytes of data for half a year is something only few companies can afford!"],"metadata":{"id":"NbuXTSIDxV2k"}},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=18Qzaoo5MKMZX_-wWCYuE4BhyRUgwO9d-\" width=\"600\" />\n","</center>"],"metadata":{"id":"weqJor7D9dO8"}},{"cell_type":"markdown","source":["# 4. ChatGPT as a General-Purpose AI\n","\n"],"metadata":{"id":"ohiEMbR5YqrR"}},{"cell_type":"markdown","source":["While it's easy to see that ChatGPT can give helpful answers to the most random questions, let's pose a different question: can it solve any of the  practical tasks that other ML applications can? It turns out that, because the core GPT model was trained on a very diverse dataset of texts, the ChatGPT model can indeed solve a lot of text-based tasks.\n","\n","Let's return to our task related to understanding the sentiments from user reviews for various businesses from the Yelp dataset that we previously worked with. As a refresher, let's load dataset and take a look:"],"metadata":{"id":"_dfb8MDKdsM3"}},{"cell_type":"code","source":["!pip install -qq datasets"],"metadata":{"id":"jaSvErKolhyl"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["from datasets import load_dataset\n","\n","dataset = load_dataset(\"yelp_review_full\")\n","dataset"],"metadata":{"id":"qS_w5O13llrO"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["import pandas as pd\n","import textwrap\n","\n","some_samples = [0, 72, 167, 213, 316]\n","\n","for ind in some_samples:\n","    print(\"Review:\", textwrap.fill(dataset['train'][ind]['text']))\n","    print(\"Rating:\", int(dataset['train'][ind]['label']) + 1)\n","    print()"],"metadata":{"id":"2N9OHWhYlytt"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Now, in order to solve our sentiment analysis task – we’ll just ask ChatGPT to do it!"],"metadata":{"id":"CQD20E0_KIPX"}},{"cell_type":"code","source":["# remove unnecessary whitespaces using dedent function\n","prompt = textwrap.dedent(\"\"\"\n","    Here is a review of some company:\n","\n","    \"dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.\"\n","\n","    Predict how many stars from 1 to 5 this user rated this company.\n","\"\"\")\n","\n","response = client.chat.completions.create(\n","    model=\"gpt-3.5-turbo\",\n","    messages=[{\"role\": \"user\", \"content\": prompt}],\n","    temperature = 0.0)\n","\n","# use fill function for better readability\n","print(textwrap.fill(response.choices[0].message.content))"],"metadata":{"id":"dvMwE3aSKHWn"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["\n","And the answer is correct! Let's try another one with a rating of 1:"],"metadata":{"id":"WDYDnKoOLYNo"}},{"cell_type":"code","source":["# remove unnecessary whitespaces using dedent function\n","prompt = textwrap.dedent(\"\"\"\n","    Here is a review of some company:\n","\n","    \"The worse \"Chinese food\", even the white rice which came with the ogle sweet and sour chicken is really bad. Please don't eat this food.\"\n","\n","    Predict how many stars from 1 to 5 this user rated this company.\n","\"\"\")\n","\n","response = client.chat.completions.create(\n","    model=\"gpt-3.5-turbo\",\n","    messages=[{\"role\": \"user\", \"content\": prompt}],\n","    temperature = 0.0)\n","\n","# use fill function for better readability\n","print(textwrap.fill(response.choices[0].message.content))"],"metadata":{"id":"iJcfS4q0KHYq"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Also correct!\n","\n","Naturally, the next step will be to run an evaluation on the entire test dataset. However, we have a problem that needs to be addressed first: there is no uniform structure for the answers, meaning that it will be hard to extract and evaluate them with metrics. So, let’s try asking ChatGPT to format this properly:"],"metadata":{"id":"7t67Bhr1Lyih"}},{"cell_type":"code","source":["# remove unnecessary whitespaces using dedent function\n","prompt = textwrap.dedent(\"\"\"\n","    Here is a review of some company:\n","\n","    \"dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.\"\n","\n","    Predict how many stars from 1 to 5 this user rated this company.\n","    At the end of your answer, give your final decision about the rating in the following format: \"Final Prediction: X stars\"\n","\"\"\")\n","\n","response = client.chat.completions.create(\n","    model=\"gpt-3.5-turbo\",\n","    messages=[{\"role\": \"user\", \"content\": prompt}],\n","    temperature = 0.0)\n","\n","# use fill function for better readability\n","print(textwrap.fill(response.choices[0].message.content))"],"metadata":{"id":"tTzAeARNKHat"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["This seems to be working, though we cannot be sure that the model will be consistent with it."],"metadata":{"id":"W0BNdTOUMR-v"}},{"cell_type":"markdown","source":["**Exercise.** Write a function that takes the text of a review as input and returns the star rating as an integer. If the answer is not in any recognizable format, the function should raise an exception."],"metadata":{"id":"WURtoXFa8_Sd"}},{"cell_type":"code","source":["# Exception class for failed parsing\n","class ReviewParseError(Exception):\n","    pass\n","\n","# passing also the OpenAI client as parameter for better usability\n","def predict_review_rating(text: str, client) -> int:\n","\n","    # <your code here>\n","\n","    pass"],"metadata":{"id":"QVkumwv19i8f"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Let's test out how it works:"],"metadata":{"id":"RBUTjjUZ-drW"}},{"cell_type":"code","source":["ind = 0\n","print(\"Review:\", textwrap.fill(dataset['train'][ind]['text']))\n","print(\"Rating:\", int(dataset['train'][ind]['label']) + 1)\n","print()\n","ans = predict_review_rating(dataset['train'][ind]['text'], client)\n","print(\"Answer:\", ans)"],"metadata":{"id":"PspGfFrLbKBQ"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["**Exercise.** Evaluate this prediction function on the dataset. Determine both the accuracy and the percentage of parsing errors.\n","\n","_Note: This function will probably be very slow, but we don't want to bother too much with speed for now, so you can use around 100 examples for simplicity’s sake._\n","\n","<details>\n","    <summary><font color=\"404090\">Hint (click to expand).</font></summary>\n","\n","Setting the temperature to 0 may help a lot.\n","\n","</details>"],"metadata":{"id":"lEuKFYko-rFr"}},{"cell_type":"code","source":["# <your code here>"],"metadata":{"id":"mw_FhhlOPFGv"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["In our solution, we ended up with 65% accuracy and a 0% failure rate. However, since the ChatGPT is updated regularly, your results may vary. In any case, they are probably similar, and it’s quite impressive that this model has an accuracy level higher than the linear one, considering that there was no actual training performed, just a little bit of in-context learning."],"metadata":{"id":"w4OtDIiq6lDJ"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=18ltp1vYx6PYcohdOOZ2zbtiplXQcxfGP\" width=320 /></center>"],"metadata":{"id":"ON7qRGKvz8gb"}},{"cell_type":"markdown","source":["There is one thing worth mentioning however: this experiment, making predictions with out-of-the-box ChatGPT, may show an incorrect result. That is because the dataset could  have **leaked** into the GPT training data. We can't know this for certain, but given that the model was trained on a huge dataset of texts sourced from the Internet, (and given that Yelp dataset is not very novel), there’s a good chance that it ended up in the training data. The model hasn't been trained to predict the sentiment of these texts, but it may have seen the texts and their labels somewhere in close context.\n","\n","Potential test data leakage is actually a huge problem for benchmarking contemporary LLMs. Most of the state-of-the-art models are regularly updated with new data downloaded from the Internet so that they can answer prompts with more up-to-date information (and also just to get even more training data). So, if someone creates an open benchmark for testing LLMs, it will leak into the training data soon enough, and thus, invalidate all the testing. Unfortunately, as for now, there is no set protocol on how to do these things correctly."],"metadata":{"id":"o32SCHLDeOSK"}},{"cell_type":"markdown","source":["# 5. LangChain + OpenAI API\n"],"metadata":{"id":"aUIQ8Pf5YgxR"}},{"cell_type":"markdown","source":["## LangChain Basics"],"metadata":{"id":"DovnY0nYZGBP"}},{"cell_type":"markdown","source":["Now that we've tried writing applications on top of ChatGPT, let's learn how to do this using some even more powerful tools. **LangChain** is one of those tools, and it helps us build applications on top of language models. First, let's install it and set it up."],"metadata":{"id":"QbE-PUSrZPRW"}},{"cell_type":"code","source":["!pip install -qq langchain langchain-openai"],"metadata":{"id":"ay6uAs8ZsFYJ"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Using this library, you can build what are called **chains**, that is, different steps for processing LLMs. Let's build a simple chain for our Yelp review prediction. When using the LangChain library our chain will contain 3 parts:\n","\n","1. A prompt template for converting input data (review) into the LLM prompt\n","2. An LLM interface for processing the prompt\n","3. An output parser for converting the LLMs output into a proper data type\n","\n","As you can see, this is very similar to what we've done before, just split into more standardized blocks.\n","\n","So, first, let's build a prompt template. And for better parsing quality, let's ask ChatGPT to output the results in JSON format, instead of plain text:"],"metadata":{"id":"KOgcieFXal3Y"}},{"cell_type":"code","source":["# just a general template for prompts\n","from langchain_core.prompts import ChatPromptTemplate\n","\n","# adding all our prompt text, plus a {review} template variable\n","json_prompt = ChatPromptTemplate.from_template(\n","    textwrap.dedent(\"\"\"Here is a review of some company:\n","\n","            \"{review}\"\n","\n","            Return a JSON object with a `rating` key that stores a\n","            prediction of how many stars from 1 to 5 this user rated this company.\n","        \"\"\")\n",")"],"metadata":{"id":"XrAlVJCXbxCR"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Next, let's create an LLM, which is basically just ChatGPT wrapped in a LangChain class:"],"metadata":{"id":"8o--0czTcJ5N"}},{"cell_type":"code","source":["from langchain_openai import ChatOpenAI\n","\n","llm = ChatOpenAI(model=\"gpt-3.5-turbo\",\n","                 temperature=0.0,\n","                 openai_api_key=OPENAI_API_KEY)"],"metadata":{"id":"A7p8xPmraan1"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Then, we’ll import an output parser, which is just an out-of-the box feature in LangChain for parsing JSON:"],"metadata":{"id":"iW5wtJjGce8j"}},{"cell_type":"code","source":["from langchain.output_parsers.json import SimpleJsonOutputParser\n","\n","json_parser = SimpleJsonOutputParser()"],"metadata":{"id":"xkX0ClRWcpa0"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Now, the last step – connecting all the elements into one chain using this neat syntax:"],"metadata":{"id":"GDoOHIRsctWp"}},{"cell_type":"code","source":["chain = json_prompt | llm | json_parser"],"metadata":{"id":"mH_b4vvU1Pqi"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Let's test it! To run a chain, we call the function `invoke`, and we pass it our named parameters from the prompt:"],"metadata":{"id":"neUd4wh7c3ZY"}},{"cell_type":"code","source":["chain.invoke({\"review\": dataset['train'][0]['text']})"],"metadata":{"id":"wQyFbOtb0U5J"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["**Exercise.** Evaluate this new function and find out its accuracy and failure rate. Again, to save time you can just use 100 examples)."],"metadata":{"id":"swI6ZeV4P2Br"}},{"cell_type":"code","source":["# <your code here>"],"metadata":{"id":"JTummTsnQGsm"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["acc, fail = evaluate_chatgpt_predictions_2(dataset['test'], chain, subset_size=100)\n","print()\n","print(\"Accuracy: \", acc)\n","print(\"Failed: \", fail)"],"metadata":{"id":"l8GcpWrXTb4e"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["The accuracy was probably very similar to those we saw before, but with much less manual parsing needed.\n","\n","There is also a way to improve LLM prediction results by using a technique called **Chain of Thought(s)** (CoT). This is when we provide some additional text in the prompt that helps the model elaborate more on the answer before arriving at the final result. In this task with Yelp, it doesn't seem to help much, but we’ll demonstrate how to do it with LangChain anyway because it’s a useful tool.\n","\n","We’ll use the simplest version of CoT by adding the phrase \"Let's think step by step\" to our prompt. This will encourage the model to elaborate more, but it will also ruin the JSON format. So, we’ll need to pass the output again to the LLM and ask it to build the answer in JSON format; this is done by creating two chains and passing one to the other:"],"metadata":{"id":"B__X-ytrNdmn"}},{"cell_type":"code","source":["from operator import itemgetter\n","from langchain_core.output_parsers import StrOutputParser\n","\n","# first prompt with elaboration and prediction\n","prompt_1 = ChatPromptTemplate.from_template(\n","    textwrap.dedent(\"\"\"Here is a review of some company:\n","\n","            \"{review}\"\n","\n","            You need to analyse the sentiment of this text. Then make a\n","            prediction of how many stars from 1 to 5 this user rated this company.\n","            Let's think step by step.\n","        \"\"\")\n",")\n","\n","# chain with simple string output\n","chain_1 = prompt_1 | llm | StrOutputParser()\n","\n","# second prompt just for JSON extraction\n","prompt_2 = ChatPromptTemplate.from_template(\n","    textwrap.dedent(\"\"\"\n","            Here is the analysis of the sentiment of a user review:\n","\n","            \"{cot}\"\n","\n","            Return a JSON object with a `rating` key that stores a\n","            prediction of how many stars from 1 to 5 this user rated this company.\n","        \"\"\")\n",")\n","\n","# second chain using syntax for passing results from one chain to another\n","chain_2 = {\"cot\": chain_1} | prompt_2 | llm | json_parser"],"metadata":{"id":"03EnKpOTUCjd"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Let's test how it works:"],"metadata":{"id":"4jorv8Pe9mZh"}},{"cell_type":"code","source":["chain_2.invoke({\"review\": dataset['train'][0]['text']})"],"metadata":{"id":"8ssubJ_hR5GL"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["aacc, fail = evaluate_chatgpt_predictions_2(dataset['test'], chain_2, subset_size=100)\n","print()\n","print(\"Accuracy: \", acc)\n","print(\"Failed: \", fail)"],"metadata":{"id":"rLPZw4xJSgfX"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["**Bonus exercise.** Can you make use of this technique to actually improve the quality? Here’s [a guide on CoT](https://deepgram.com/learn/chain-of-thought-prompting-guide) that you can peruse if needed.\n","\n","_Note: as the change in quality decreases, you’ll need to use more examples for testing._"],"metadata":{"id":"ouQS6nbP7ibu"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1wkxdhtNHUm3S3aAbq4nK586mor8YIv5l\" width=320 /></center>"],"metadata":{"id":"p1CtPwPMXRdo"}},{"cell_type":"markdown","source":["## RAG with LangChain"],"metadata":{"id":"ITDcwTDJZAih"}},{"cell_type":"markdown","source":["Just to see how powerful LangChain is, let's implement another thing using it: **Retrieval Augmented Generation**, or **RAG**. The idea behind this approach is to improve the LLMs question-answering abilities by adding search results from a search engine or a database as additional information. This way, we utilize both the LLMs ability to generate intelligent answers, and we also make sure that the information it uses is as up-to-date as possible.\n","\n","A simple RAG in LangChain will work almost out-of-the-box, so let's make one, and get a general overview of how it works.\n","\n","We’ll need to know about two new concepts for this task: **agents** and **tools**. Essentially, an agent is a high-level system that uses an LLM together with prompt templates and various tools to solve complex tasks. Tools are things like web search, Python code, various APIs (also known as \"plugins\" in ChatGPT web UI).\n","\n","So, in our case, we’ll just need to create an agent, and provide it with a web search tool and the appropriate prompts. We’ll make use DuckDuckGo search and a pre-made prompt from LangChain Hub designed for these kinds of agents:"],"metadata":{"id":"1vLpP9bjeHzp"}},{"cell_type":"code","source":["!pip install -qq duckduckgo-search langchainhub langchain-community"],"metadata":{"id":"IyPuxnrj8Pr1"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["from langchain.agents import AgentExecutor, create_react_agent\n","from langchain import hub\n","from langchain.tools import DuckDuckGoSearchRun\n","\n","# just loading a prompt for ReAct agents (Reasoning + Acting)\n","prompt = hub.pull(\"hwchase17/react\")\n","\n","# using only one tool: web search\n","tools = [DuckDuckGoSearchRun()]\n","\n","# pass everything to an agent (llm was already defined before)\n","agent = create_react_agent(llm, tools, prompt)\n","# just another wrapper, but we are setting verbose to true to see the details\n","agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)"],"metadata":{"id":"_XVieslPfyhz"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Now, let's run the agent with a query that involves looking up some up-to-date information:"],"metadata":{"id":"L4UCSYd7BWrs"}},{"cell_type":"code","source":["agent_executor.invoke({\n","    \"input\": \"What is the most interesting scientific discovery of 2024?\"})"],"metadata":{"id":"ik0GG-FLge8x"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["From the output you can see that the agent:\n","\n","\n","\n","1. Understood what it needs to search for, and created a search query\n","2. Got the search results from the search engine\n","3. Analyzed the results and generated the answer\n","\n","So, there’s a simple way to upgrade a search engine with LLMs, but even more complex RAG's are getting lots of action in many applications right now!"],"metadata":{"id":"MAMwqKM4Drq0"}},{"cell_type":"markdown","source":["# 6. Conclusion"],"metadata":{"id":"367f8e3QaDik"}},{"cell_type":"markdown","source":["In this lesson, we talked a lot about ChatGPT and its applications, so let's wrap up what we've learned:\n","\n","\n","\n","* We understood how the GPT model is built using **transformer** architecture and how ChatGPT is built on top of it\n","* We learned about different kinds of model learning, and most interestingly for LLMs, **in-context learning**\n","* We practised writing applications on top of LLMs, using both a basic **API** and the **LangChain** library\n","\n","Up until this point, we’ve mostly been talking about NLP tasks, but in the [**next lesson**](https://colab.research.google.com/github/Nebius-Academy/ML-Starter-Pack/blob/main/notebooks/lesson_5.ipynb), we’ll switch to a new kind of task: **image classification**."],"metadata":{"id":"sKz0ca90aG4I"}},{"cell_type":"markdown","source":["By the way, if you’re interested, here are some additional resources on the things we talked about in this lesson:\n","\n","* This [blog post from OpenAI](https://openai.com/blog/chatgpt) describes the ChatGPT model\n","* See a [GPT-2 implementation](https://jaykmody.com/blog/gpt-from-scratch/) from scratch in NumPy\n","* A [guide](https://deepgram.com/learn/chain-of-thought-prompting-guide) on the Chain-of-Thought technique\n","* A [simple guide](https://huggingface.co/learn/cookbook/en/rag_zephyr_langchain) on building RAGs for more specific applications than web search"],"metadata":{"id":"wP7uohwtWK4a"}}]}
2 | 


--------------------------------------------------------------------------------
/notebooks/lesson_5.ipynb:
--------------------------------------------------------------------------------
1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[{"file_id":"1bmzk7M4XXQAF5DTv_t9epyJTbyfqyd06","timestamp":1708076437293},{"file_id":"1hY8zyxoeKa6tPRIOj2dIdN7uvCMRZaXx","timestamp":1692272241413}],"machine_shape":"hm","gpuType":"T4","toc_visible":true},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1apgBfsTFKlwZiLWXlGZT3qNljH2QPS3L\" width=640 /></center>\n","\n","<font size=6>\n","<center>Lesson 5: Computer Vision\n","</center>\n","</font>\n","\n","<font size=4>\n","<center><i>Applying what we've learned to a new task</i>\n","</center>\n","</font>"],"metadata":{"id":"KGo6NsSJIkF4"}},{"cell_type":"markdown","source":["# 1. Introduction"],"metadata":{"id":"JriIf2J2I1Fo"}},{"cell_type":"markdown","source":["## Image Classification\n"],"metadata":{"id":"K2deQJgsETtN"}},{"cell_type":"markdown","source":["In this lesson, we’ll switch to a new task: **image classification**. Image classification is essentially the same as text classification, just with images as input. We’ll be working with a dataset that contains pictures of outdoor scenes, while the task will be understanding what is shown in the picture. Take a look at some examples:"],"metadata":{"id":"AM-VQh1MeXUl"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=107CFSNUCb0NkXsjZcBmIDwwfAUf4KtaW\" width=500 />\n","\n","<i>This dataset is based on the [Kaggle Scene Classification dataset](https://www.kaggle.com/datasets/nitishabharathi/scene-classification)</i>\n","\n","</center>\n","\n"],"metadata":{"id":"Dy5u_OJ0_FSs"}},{"cell_type":"markdown","source":["So, to put it in more formal terms we first encountered in lesson 1, for our task:\n","\n","* The **objects** (inputs to the ML model) are images\n","* The **targets** (values we want to predict) are the classes (a number from 1-6, e.g. 1 - Building, 2 - Forest, and so on)"],"metadata":{"id":"u7cq1OykCC11"}},{"cell_type":"markdown","source":["## Lesson Overview"],"metadata":{"id":"nYDOYyrYFLFM"}},{"cell_type":"markdown","source":["We’ll try solving this image classification task using three different approaches:\n","\n","1. **Training** a neural network for image classification from scratch\n","2. Fine-tuning a network trained for a different, but similar, task; this technique is called **transfer learning**\n","3. Using the **CLIP** text-and-image model in a **zero-shot learning** setup\n","\n","Let’s note that this exercise was intentionally devised to be similar to the three ways of learning we saw in the previous lesson (training, fine-tuning, in-context learning), and we’ll explain this connection in more detail later!"],"metadata":{"id":"w-lZvAwqFM0Y"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=13cyi2StlJ9EUXQURJ8ueWGw0JTEGr9Oz\" width=400 /></center>"],"metadata":{"id":"yhgb1ZwUOjRl"}},{"cell_type":"markdown","source":["So, in this lesson, you’ll:\n","\n","* Learn a **Vision Transformer** (ViT), and how to train it using HuggingFace\n","* Understand what **transfer learning** is and how to apply it to an image classification task\n","* Become familiar with the **CLIP** model and practice building classifiers with it in a **zero-shot** manner\n","\n","You can find the notebook with solutions [here](https://colab.research.google.com/drive/15s6SBd9ftYoM9QStvZVzWwlUZ5MB9Tp8?usp=sharing)."],"metadata":{"id":"4enbzJjAIqiR"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1DX19aF36EPcDPQXnlNeGcQHQCPF1_NAi\" width=1000 /></center>"],"metadata":{"id":"qnjNxqvk1Rli"}},{"cell_type":"markdown","source":["# 2. Vision Transformer"],"metadata":{"id":"NQ_-GxM4Qn_Q"}},{"cell_type":"markdown","source":["## Loading and Exploring the Data"],"metadata":{"id":"T6X-5egu_CFd"}},{"cell_type":"markdown","source":["First, let's set up some of the general stuff that will be needed in the lesson:"],"metadata":{"id":"nvKuEmCjCFl2"}},{"cell_type":"code","source":["! pip install -qq transformers[torch] datasets"],"metadata":{"id":"v7Y8sroDUjvt"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"lF_rCVVZT3Hh"},"outputs":[],"source":["import numpy as np\n","import matplotlib.pyplot as plt\n","from tqdm.notebook import tqdm\n","\n","from PIL import Image\n","import os\n","\n","import torch\n","\n","# checking availability of GPU\n","device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n","device"]},{"cell_type":"markdown","source":["Next, let's download the dataset and extract it. Note that we’re using a smaller version of the dataset here to speed up the training:"],"metadata":{"id":"FOSeuYWtSjPu"}},{"cell_type":"code","source":["# library for downloading from google drive\n","! pip install gdown\n","\n","# download this file by it's id:\n","# https://drive.google.com/file/d/1ez8ZDk94GtjxnN-D58CBF8flKuold0ZA/view?usp=drive_link\n","! gdown --id 1ez8ZDk94GtjxnN-D58CBF8flKuold0ZA -O scenes-tiny.zip"],"metadata":{"id":"OZxqck5s4jCe"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["! unzip -qq scenes-tiny.zip"],"metadata":{"id":"Wo_S46Qe4ksH"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["After running the code above, you should have a folder called \"scenes-small\" (if you are working in Google Colab, it will appear in your \"Files\" tab). This folder contains all the data in a format compatible with the HuggingFace dataset libraries. Let's take a look inside.\n","\n","The dataset is split into three parts: train and test, which you are already familiar with; as well as val, which stands for **validation**. This third set, validation, is an additional layer between the training and test set, and is usually used to compute metrics during training and for choosing the epoch at which to stop training; this helps with avoiding overfitting on the test set.\n","\n","Each of the sets have a separate directory with the following structure:\n","\n","<br/>\n","\n","```\n","scenes-small/train\n","        |-- Building/\n","        |---- 7262.jpg\n","        |---- 7267.jpg\n","        ...\n","        |-- Forest/\n","        |---- 6805.jpg\n","        |---- 6812.jpg\n","        ...\n","        |-- Street/\n","        |---- 6158.jpg\n","        |---- 6160.jpg\n","        ...\n","```"],"metadata":{"id":"XNIZjf4VSQWF"}},{"cell_type":"markdown","source":["So, inside each main directory (\"scenes-small/train\", \"scenes-small/val\" or \"scenes-small/test\") there are 6 folders, each corresponding with one of the classes of scenes (\"Building\", \"Forest\", and so on). Each subdirectory contains pictures of the corresponding scenes.\n","\n","We can now load the dataset using the `load_dataset` function from Hugging Face using `imagefolder` as dataset name:"],"metadata":{"id":"ENAK1UUCUAz_"}},{"cell_type":"code","source":["from datasets import load_dataset\n","\n","dataset = load_dataset(\"imagefolder\", data_dir=\"./scenes-tiny\")"],"metadata":{"id":"-L-XA0sx4ozH"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["dataset"],"metadata":{"id":"Yzcey0XmKEvI"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Let's take a look at an example image and its label:"],"metadata":{"id":"eTaurW8LUii1"}},{"cell_type":"code","source":["# you can change num_image to view different images from data\n","num_image = 0\n","image = dataset['train'][num_image]['image']\n","image"],"metadata":{"id":"AnbAvK5oW_8l"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["label = dataset['train'][num_image]['label']\n","label"],"metadata":{"id":"9sxW3A4zVfZp"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["We can also use the class names in the dataset, to make it more human readable:"],"metadata":{"id":"Dk-uRM7fVhpx"}},{"cell_type":"code","source":["class_names = dataset['train'].features['label'].names\n","class_names"],"metadata":{"id":"l9qvRXOdhqmn"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["label = dataset['train'][num_image]['label']\n","class_names[label]"],"metadata":{"id":"GE_hGaP3czM-"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Before you start training any model, it’s always a good idea to take a look at the data samples. With this in mind, here’s a function that displays a random set of images in a grid:"],"metadata":{"id":"FL5pk5CvhmWb"}},{"cell_type":"code","source":["def show_random_images(data):\n","\n","    # create a grid\n","    f, axes_grid = plt.subplots(4, 4, figsize=(7, 7))\n","\n","    # shuffling image indices\n","    image_indices = list(range(len(data)))\n","    np.random.shuffle(image_indices)\n","\n","    i = 0\n","    for axes_row in axes_grid:\n","        for ax in axes_row:\n","            img = data[image_indices[i]]['image']\n","            label = data[image_indices[i]]['label']\n","\n","            # display image\n","            ax.imshow(img)\n","            # set a label of an image as a title\n","            ax.set_title(class_names[label])\n","            # remove axis numerations for nicer view\n","            ax.get_xaxis().set_visible(False)\n","            ax.get_yaxis().set_visible(False)\n","\n","            i += 1\n","\n","    # show resulting grid\n","    plt.show()"],"metadata":{"id":"gwDWCpPIRtSh"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["show_random_images(dataset['train'])"],"metadata":{"id":"CDo8hT1AiRCK"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Great! We’re ready to train a neural network model on our data."],"metadata":{"id":"8GCtS08VY9r8"}},{"cell_type":"markdown","source":["## What is Vision Transformer?"],"metadata":{"id":"SnJOHfCKKiJw"}},{"cell_type":"markdown","source":["For our first approach, we’ll train a **Vision Transformer (ViT)** model on our dataset for image classification from scratch.\n","\n","Let's first take a moment to understand how ViT works. Basically, it does what its name implies and applies **Transformer** architecture to computer vision tasks:\n","\n","1. The image is split into small **patches**, which are treated similarly to sequences of **tokens** with texts\n","2. Patches are **embedded** into vectors together with positional embeddings\n","3. The tokens are passed through a **transformer encoder** (multiple layers), just like the one we used in previous lessons for BERT and GPT\n","4. The final output embeddings go through additional **fully-connected feed-forward** layers (also known as multi-layered perceptron, or MLP, the oldest and most universal neuron network architecture), and a **softmax** function to generate class probabilities"],"metadata":{"id":"OO7kIenDQspr"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1tEV7RnAJdlRD-DMW0nXrFaQ4DP1QHNTR\" width=800 />\n","\n","*Image from the original paper*\n","</center>"],"metadata":{"id":"6vUs6bz8Dy-n"}},{"cell_type":"markdown","source":["As a reminder: the main ingredient in the transformer architecture is the operation called **attention** (here, it's \"multi-head attention\", but the difference isn’t important). Attention works by matching together queries and keys produced from different tokens (with linear multiplications alone) and generating outputs based on how similar they are. Here’s an overview of this concept from lesson 3:"],"metadata":{"id":"wZPEryDMfyvM"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=19F0BepzsGJRlHH0GiZMyfwk6LCFwR2c8\" width=600 /></center>"],"metadata":{"id":"ckNGrUzobMJ2"}},{"cell_type":"markdown","source":["<details>\n","    <summary><font color=\"404090\">Why not convolutional architecture? (click to expand).</font></summary>\n","    \n","Maybe you've heard about **convolutional neural networks (CNN)**. These were very popular in computer vision before transformers and are still used in many applications. However, we opted not to use them for this lesson. First of all, this is because the details about how CNN architectures work are beyond the scope of this material. Additionally, we think our approach is worthwhile because, these days, transformers are becoming very competitive in the world of computer vision.\n","</details>"],"metadata":{"id":"reHu3aAEMrin"}},{"cell_type":"markdown","source":["## Training the Model"],"metadata":{"id":"FkGc2PHdoi4t"}},{"cell_type":"markdown","source":["We’ll use the ViT model from the transformers library. This will be similar to how we trained the BERT model in lesson 3, however we won’t be using auto-model to automatically configure the mode, so we’ll also need to create a config class for the model."],"metadata":{"id":"xeKR874mXir8"}},{"cell_type":"code","source":["from transformers import ViTConfig, ViTForImageClassification"],"metadata":{"id":"ag9BJ7EzXsW9"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# using default configuration with a small update\n","configuration = ViTConfig()\n","configuration.num_labels = len(class_names)\n","\n","# creating a ViT model with configuration\n","model = ViTForImageClassification(configuration).to(device)"],"metadata":{"id":"2HjW-Ilbbnmn"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["When working with images, we also need to define the image transformations that we’ll use. Transformations usually include things like image resizing, value normalization, and so on. We won’t go too much into how to set all of this up, and instead, we’ll just use a default ViT image processor."],"metadata":{"id":"PpyElVHzkqEb"}},{"cell_type":"code","source":["from transformers import ViTImageProcessor\n","\n","image_processor = ViTImageProcessor()"],"metadata":{"id":"kncquKPJJTnA"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["\n","We also need to make a separate function for dataset transformation:"],"metadata":{"id":"y-JZwwcCh-H4"}},{"cell_type":"code","source":["def transform(dataset):\n","\n","    \"\"\"\n","    A function that takes a batch of data a of a format\n","    {\n","        'image': [list of images],\n","        'label': [list of labels]\n","    }\n","    and transforms all the images in it using image_processor.\n","    \"\"\"\n","\n","    # Transform images in dataset using image_processor\n","    transformed_dataset = image_processor(dataset['image'])\n","\n","    # setting labels of newly transformed dataset with labels from initial dataset\n","    transformed_dataset['label'] = dataset['label']\n","\n","    return transformed_dataset\n","\n","\n","transformed_dataset = dataset.with_transform(transform)"],"metadata":{"id":"n9srj5a-oGM9"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["The final thing that we need to do before training is to define the function to compute metrics. In image classification, we can use exactly the same **accuracy** metric as before with text (the ratio of the correctly classified objects)."],"metadata":{"id":"4trNAcyYlnH9"}},{"cell_type":"code","source":["from datasets import load_metric\n","\n","def compute_metrics(eval_pred):\n","    accuracy = load_metric(\"accuracy\", trust_remote_code=True)\n","\n","    logits, labels = eval_pred\n","    predictions = np.argmax(logits, axis=-1)\n","    acc = accuracy.compute(predictions=predictions, references=labels)[\"accuracy\"]\n","    return {\"accuracy\": acc}"],"metadata":{"id":"20uEyqp7URER"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Now, before we start training the model, let’s review the training algorithm: **gradient descent**. This will help us better understand what’s going on during training.\n","\n","The algorithm takes two hyperparameters: the number of steps as $N$ and learning rate $\\alpha$, and works like this:\n","\n","\n","\n","1. Choose a random starting value for the weight vector $w$\n","\n","2. Repeat the following $N$ times:\n","    * Compute the loss function $L(w)$ and its gradient $\\nabla L(w)$.\n","\n","    * Update the weights by moving them into the direction of negative gradient, multiplying it by learning rate:\n","\n","    $$\n","        w = w - \\alpha \\nabla L(w)\n","    $$\n","\n","\n","Recall that the **learning rate** hyperparameter controls how large the weight updates will be and it can have a lot of influence on training. If it’s too small, each step will be small and the overall training will be slow. On the other hand, if it is too big, the steps may overshoot into making the loss higher, and training may fail. The typical values for the learning rate are between $10^{-5}$ and $10^{-2}$."],"metadata":{"id":"Tj0NqkclgVwy"}},{"cell_type":"markdown","source":["**Exercise.** Your task is to train the model and achieve at least 50% accuracy. We've set up all the code for training and evaluation, but you need to find the correct training hyperparameters. Here, the most important ones are the learning rate and the number of epochs, however, if you wish, you may also play with other hyperparameters.\n","\n","_Note: it should take no more than 10 minutes to train the model and get the desired accuracy._"],"metadata":{"id":"KDT_SZxw9YwJ"}},{"cell_type":"markdown","source":["<details>\n","    <summary><font color=\"404090\">Hint (click to expand).</font></summary>\n","    \n","First, try increasing the number of epochs to be able to see how the accuracy changes over time. Then, play with the learning rate following the explanations above.\n","</details>"],"metadata":{"id":"LPwcuqBLCt32"}},{"cell_type":"code","source":["# edit the code below code\n","\n","from transformers import TrainingArguments, Trainer\n","\n","# re-run the model creation for multiple experiments\n","model = ViTForImageClassification(configuration).to(device)\n","\n","training_args = TrainingArguments(\n","    output_dir=\"./output\",\n","    overwrite_output_dir=True,\n","    per_device_train_batch_size=64,\n","    evaluation_strategy=\"steps\",\n","    logging_steps=10,\n","    eval_steps=10,\n","    remove_unused_columns=False, # utility argument for easier data processing\n","\n","    learning_rate=0.05,\n","    num_train_epochs=5,\n",")\n","\n","trainer = Trainer(\n","    model=model,\n","    args=training_args,\n","    compute_metrics=compute_metrics,\n","    train_dataset=transformed_dataset[\"train\"],\n","    eval_dataset=transformed_dataset[\"validation\"],\n",")\n","\n","train_results = trainer.train()\n","trainer.save_model()\n","print()\n","\n","metrics = trainer.evaluate(transformed_dataset['train'])\n","trainer.log_metrics(\"train\", metrics)\n","\n","metrics = trainer.evaluate(transformed_dataset['test'])\n","trainer.log_metrics(\"test\", metrics)"],"metadata":{"id":"OxkBt0CdPzZp"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["We didn’t get the best results, and the network has already started to **overfit** (reminder: this means that the quality on the training set is better than the quality on the testing set). This is mainly due to the low amount of data for this rather complex task. In the next section, we’ll learn how to use another approach called transfer learning which will help us get a better performing model in less time and effort."],"metadata":{"id":"se1WpeeTfrXJ"}},{"cell_type":"markdown","source":["**Bonus task.** You can try to play with this model and see how high you can push its quality.\n","\n","\n","<details>\n","    <summary><font color=\"404090\">Some ideas on what you can do (click to expand).</font></summary>\n","\n","\n","1. Change the number of layers or attention heads in the model by adjusting the `num_attention_heads` and `num_hidden_layers` parameters in the model configurations. You can use common sense here: if your model is easily overfitting, try to make the model smaller; if your model is underfitting, try to make the model bigger.\n","2. Adjust the size of the input images. If your model is easily overfitting, it's a good idea to try smaller image sizes (via the `image size` parameter in the configuration).\n","3. Add **data augmentation** to the training dataset. This is a powerful technique that adds randomized data transformations to make it look like the dataset is much larger. Here are some guides on data augmentation:\n","    - [What is image augmentation?](https://albumentations.ai/docs/introduction/image_augmentation/)\n","    - [PyTorch documentation on augmentation](https://pytorch.org/vision/main/transforms.html)\n","    - [ Augmentation techniques in PyTorch illustrated](https://pytorch.org/vision/main/auto_examples/plot_transforms.html#sphx-glr-auto-examples-plot-transforms-py)\n","    - In order to add augmentation to our model, you'll need to add augmenting transformations to our \"transform\" function. You can find an example of how to do that [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_PyTorch_Lightning.ipynb)\n","\n","</details>"],"metadata":{"id":"KEHzdIT2gUKG"}},{"cell_type":"markdown","source":["# 3. Transfer Learning"],"metadata":{"id":"UyKu4rwYGYWi"}},{"cell_type":"markdown","source":["## Why do we need transfer learning?"],"metadata":{"id":"j9LeGHTbtWs1"}},{"cell_type":"markdown","source":["The big problem for many machine learning tasks is that there is **never enough data**. Consider these tasks:\n","\n","\n","\n","1. **Medical image classification**. As an example, take tumor detection in MRI images. This is a much harder task than just  understanding a scene (e.g. distinguishing mountains from forests). Further MRI images can be taken in a lot of ways, and tumors come in all shapes and sizes. At the same time, the amount of data is very small, because labeling each sample takes a lot of work and involves highly qualified professionals.\n","2. **Machine translation for rare languages**. In order to train the model for direct translation, we need pairs of sentences in one language with its analogous translation in the second language. This is a difficult task by itself, even for languages with lots of translation examples, like English and French. If we want to train the model for some less common language, such as from Vietnamese to English, the amount of data shrinks considerably. Pairing two non-common languages makes the problem even more difficult!\n","\n","In general, almost any task can benefit from more data, so that's why transfer learning is such a useful tool! Now that we know why it’s useful, let’s try to understand how it works."],"metadata":{"id":"gMxcqEsGO_An"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1lckfEYwblZQpEZEw1CS6l-2z_5itYHZU\" width=400 /></center>"],"metadata":{"id":"I04_V0bnUFqw"}},{"cell_type":"markdown","source":["## Fine-tuning as Transfer Learning"],"metadata":{"id":"1owV5q3kltyG"}},{"cell_type":"markdown","source":["Transfer learning is where we train a model on one dataset, then transfer the knowledge to another, related dataset. The simplest way to do this consists of two steps:\n","\n","\n","\n","1. **Pre-training** a general model on a big dataset that is only vaguely related to our specific task\n","2. **Fine-tuning** this model on a smaller dataset for our specific task\n","\n","We've already talked about this approach a little bit in previous lessons, now we want to go deeper.\n","\n","Let's discuss how transfer learning works in practice by looking at our task of scene classification.\n","\n","First, we need a dataset for **pre-training**, and it needs to be large and slightly relevant to our task. One possible option is to collect a large photo dataset composed of any kind of objects from online, no matter the context, then train a model to classify them (spoiler: a dataset like this does exist). The model wouldn't be able to predict what we want with a high level of precision, but at least it will be able to learn the basics about how the images are structured.\n","\n","Next, we take this model and **fine-tune** it on our smaller dataset, in a manner designed to specifically understand scenes from a small set of classes. This is the same model training, but starting with the weights acquired on the previous step, and with a much smaller amount of epochs."],"metadata":{"id":"yGMVPu6rUbZx"}},{"cell_type":"markdown","source":["\n","<center><img src=\"https://drive.google.com/uc?export=view&id=1Q-2iVC26O2AwsWrrI1mnkar0H1OfHaCB\" width=1000 /></center>"],"metadata":{"id":"yRKGHA9CM7ce"}},{"cell_type":"markdown","source":["To try to develop some intuition for how this fine-tuning technique works, consider the following: let's suppose you want to learn to drive a motorcycle. Compared to starting from scratch, it would be easier to learn if you already know how to drive a car because you already know how traffic regulations work, have some basic intuition about driving vehicles, and know how to understand and analyze what’s happening on the road. This is fundamental knowledge which you gain by driving any kind of vehicle (**pre-training**), so then, when switching from one vehicle to another, you only need to learn the residual aspects (**fine-tuning**). So, in the end, you'll require fewer lessons (\"less data\") to properly learn what you need to know.\n","\n","It’s obvious that, the **closer the two datasets** are (for pre-training and for fine-tuning), the better transfer learning will work. For example, if you know Spanish, it would be easier for you to learn Portuguese than Chinese. That said, in practice, transfer learning works well even for quite **different datasets**! This happens because low-level data details (e.g. images have contours that convey important information) are still quite universal, and make up a big portion of the difficulty for any task.\n","\n","Also, note that even if you have a fair amount of data for your task, it can still be a good idea to apply the transfer learning approach. As we said earlier, there is rarely enough data in the world of machine learning!"],"metadata":{"id":"QqpzYZwHiX5h"}},{"cell_type":"markdown","source":["## Transfer Learning for ViT"],"metadata":{"id":"cI4wJVfbtvXr"}},{"cell_type":"markdown","source":["Now, let's see how transfer learning for images works in our scene understanding task. We need a **big, general-task dataset** for pretraining – and luckily there is one we can easily use!\n","\n","That dataset is called **ImageNet**, and it is made up of images collected from the Internet, each labeled with information about what is being displayed. This dataset is very popular in computer vision because of its size and generality. We’ll be using a larger version of this dataset called Imagenet-21k; it has 14 million images and 21 thousand classes."],"metadata":{"id":"e_PNRiui2wpf"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1xP5LNXf8lgQWhjB1uH4qpsQ22CMwZEbz\" width=600 /></center>"],"metadata":{"id":"DvD2WOeZ4sao"}},{"cell_type":"markdown","source":["Obviously, we’re not going to be training an image classification model from scratch on our own — that would take a lot of computational resources, and the whole point of this section is to avoid that. Fortunately, HuggingFace already has an ViT model trained on ImageNet that we can use, so let's load it."],"metadata":{"id":"dahdchBy40y1"}},{"cell_type":"code","source":["from transformers import ViTForImageClassification\n","\n","model_name = 'google/vit-base-patch16-224-in21k'\n","\n","# we need to pass an amount of classes in our dataset to\n","# ViTForImageClassification so that our model is build\n","# for the specified number of classes\n","model = ViTForImageClassification.from_pretrained(\n","    model_name,\n","    num_labels=len(class_names)\n",")"],"metadata":{"id":"mNJC1WSrGxr3"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["We need to re-define the image transformations that will be applied for the images in our dataset. Here's one important thing to keep in mind: it’s better to use the same image transformations that were used to pre-train the model on ImageNet. This is simply because, during pre-training, our model \"got used to\" images being transformed in a certain way. In other words, it became well-suited for images with certain characteristics, so it's better to transform images in the same way during fine-tuning, too."],"metadata":{"id":"GaLtMLVlT_8B"}},{"cell_type":"code","source":["from transformers import ViTFeatureExtractor\n","\n","# loading image processor with the same model name\n","image_processor = ViTFeatureExtractor.from_pretrained(model_name)"],"metadata":{"id":"Go45iB_jIGeJ"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["def transform(dataset):\n","    '''\n","    A function that takes a batch of data a of a format\n","    {\n","        'image': [list of images],\n","        'label': [list of labels]\n","    }\n","    and transforms all the images in it using image_processor\n","    '''\n","\n","    transformed_dataset = image_processor(dataset['image'])\n","    transformed_dataset['label'] = dataset['label']\n","    return transformed_dataset\n","\n","\n","transformed_dataset = dataset.with_transform(transform)"],"metadata":{"id":"UHE0UEQnWaEO"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["**Exercise.** Now fine-tune the model! Your goal now is much higher: try to get at least 90% accuracy. Additionally, try to achieve this accuracy in 5 epochs or less. As before, all the code is below, but you need to set the correct training hyperparameters.\n","\n","<details>\n","    <summary><font color=\"404090\">Hint (click to expand).</font></summary>\n","    \n","This time you need to increase the learning rate – but not too much.\n","</details>"],"metadata":{"id":"vx-RIYCsv8KB"}},{"cell_type":"markdown","source":["**Bonus exercise.** Try to get 90% accuracy in just 2 epochs!\n","\n","\n","<details>\n","    <summary><font color=\"404090\">Hint (click to expand).</font></summary>\n","    \n","Decreasing the size of the batch may help.\n","</details>"],"metadata":{"id":"ksy75sKxw5ma"}},{"cell_type":"code","source":["# edit the code below code\n","\n","\n","\n","from transformers import TrainingArguments, Trainer\n","\n","\n","model = ViTForImageClassification.from_pretrained(\n","    model_name,\n","    num_labels=len(class_names))\n","\n","training_args = TrainingArguments(\n","    output_dir=\"./logs\",\n","    evaluation_strategy=\"steps\",\n","    remove_unused_columns=False,\n","\n","    # more frequent steps for more information\n","    logging_steps=5,\n","    eval_steps=5,\n","\n","    per_device_train_batch_size=64,\n","    learning_rate=1e-6,\n","    num_train_epochs=10,\n",")\n","\n","trainer = Trainer(\n","    model=model,\n","    args=training_args,\n","    compute_metrics=compute_metrics,\n","    # datasets for training and validating\n","    train_dataset=transformed_dataset[\"train\"],\n","    eval_dataset=transformed_dataset[\"validation\"],\n",")\n","\n","\n","train_results = trainer.train()\n","trainer.save_model()\n","print()\n","\n","metrics = trainer.evaluate(transformed_dataset['train'])\n","trainer.log_metrics(\"train\", metrics)\n","metrics = trainer.evaluate(transformed_dataset['test'])\n","trainer.log_metrics(\"test\", metrics)"],"metadata":{"id":"0r6pjNhHpSCS"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["<details>\n","    <summary><font color=\"404090\">Exercise conclusions (click to expand).</font></summary>\n","\n","Great! We got much better results with less training time (not counting a probably enormous pre-training time).\n","</details>\n"],"metadata":{"id":"IEAptZy4ONxo"}},{"cell_type":"markdown","source":["Fine-tuning a model usually takes less iterations over data, so it’s usually less prone to overfitting. However, this can still happen!\n","\n","There is a trick that can help prevent this, while at the same time speeding up training even more: **layer freezing**. This is when we only train a part of the network’s layers, while all other layers are kept fixed (i.e. \"frozen\"). In transfer learning, it usually makes sense to freeze everything except a couple of the last layers. The reasoning behind this is that the first layers of the network learn more low-level details, which are relevant for all kinds of various tasks (exactly what we want from pre-training). Meanwhile, the last layer corresponds with more high-level, abstract concepts, which we want to update for the new dataset (the goal of the fine-tuning)."],"metadata":{"id":"dMDHe7QMSJQF"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1gbLcXPplvrzp2uToQLTpKC8hFR4Vj53K\" width=1000 /></center>"],"metadata":{"id":"s7ST5IUpvBS1"}},{"cell_type":"markdown","source":["Depending on how different the datasets are for pre-training and fine-tuning, we may want to freeze different amounts of layers: for example, we can freeze only 2 or 3 initial layers if the datasets are very different, or almost the entire network if they are very close.\n","\n","We won't be doing any layer freezing in this lesson, but if you want, you can try it yourself. Here’s a [thread](https://discuss.huggingface.co/t/freeze-lower-layers-with-auto-classification-model/11386) on the Hugging Face forums that may help!\n","\n","To conclude this section, it’s also worth mentioning that transfer learning is not only limited to classification tasks. For example, models pre-trained on ImageNet can be adapted for image segmentation, detection, and so on."],"metadata":{"id":"nN8QyVZvjKjI"}},{"cell_type":"markdown","source":["# 4. Classification Using CLIP"],"metadata":{"id":"J3M2byK9j-7t"}},{"cell_type":"markdown","source":["In this section, we'll use a pre-trained model called CLIP for our classification task. But CLIP isn’t a model for image classification, and we have something different in mind about how to use it. Before we jump in, let’s talk about what CLIP is."],"metadata":{"id":"N0oyZf3lGg20"}},{"cell_type":"markdown","source":["## Understanding CLIP"],"metadata":{"id":"J9NTCKCEirCp"}},{"cell_type":"markdown","source":["Remember how BERT created text embeddings and ViT created image embeddings (before classification, that is)? Well, CLIP is able to create both in a common embedding space! The idea is that, if a text $T$ is a good description of an image $I$, their embeddings are close to each other, while embeddings of images and random non-related texts are far from each other.\n","\n","CLIP consists of two neural networks:\n","\n","* An **image encoder**, which maps an image into a vector\n","* A **text encoder**, which maps text into a vector of the same length\n","\n","These two networks are trained together on a dataset of paired images and their descriptions $(I_1, T_1), \\ldots, (I_N, T_N)$. They use an unusual loss function, called **contrastive loss**. We won't go into much detail about this, but basically, it creates a table of embeddings for each pair of image  $I_j$ and text $T_k$, and then:\n","\n","- if $i=j$, then the loss is **distance** between them (thus forcing them to be close to each other)\n","- if $i \\neq j$, then the loss is **negative distance** between them (thus forcing them to be far from each other)"],"metadata":{"id":"io5dSPEceqag"}},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1wYwJa2sp4rfscvPd4ZZdf1ETOMipM-_S\" width=800 />\n","\n","*Image from the original paper*\n","\n","</center>"],"metadata":{"id":"jgT-vocSgaKy"}},{"cell_type":"markdown","source":["As a result, after this training, we get two models (text encoder and image encoder), which map related text and images into similar embeddings, and unrelated text and images into dissimilar embeddings."],"metadata":{"id":"B2SaRf4HFdp5"}},{"cell_type":"markdown","source":["## Zero-Shot Prediction with CLIP"],"metadata":{"id":"qiffLf8uOMbC"}},{"cell_type":"markdown","source":["Now, let's talk about how to use CLIP for image classification without any further training. This is called **zero-shot** prediction because we are not training the model in any way.\n","\n","It's simple: suppose we have an image classification task with 4 classes of images: \"plane\", \"car\", \"dog\", \"bird\". How can we understand for a given image $I$ which of the four classes this image belongs to?"],"metadata":{"id":"Bmyp2NMplbCR"}},{"cell_type":"markdown","source":["\n","\n","Let's do the following:\n","\n","- Get embedding $e_I$ of $I$ using the image encoder\n","- Convert class names $X$ into class descriptions with the form \"a photo/picture/image of a X\":\n","\n","    \"a photo of a plane\"\n","\n","    \"a photo of a car\"\n","\n","    \"a photo of a dog\"\n","\n","    \"a photo of a bird\"\n","\n","- Get embeddings $e_{plane}, e_{car}, e_{dog}, e_{bird}$ of all of the class descriptions\n","- Compute similarities between image embedding $e_I$ and the embeddings of the class descriptions:\n","\n","    $sim_{I, plane} = e_I \\cdot e_{plane}$\n","\n","    $sim_{I, car} = e_I \\cdot e_{car}$\n","\n","    $sim_{I, dog} = e_I \\cdot e_{dog}$\n","\n","    $sim_{I, bird} = e_I \\cdot e_{bird}$\n","\n","    …and then choose the class which has the highest similarity (e.g. if $sim_{I, car}$ has the highest value across all four, then we conclude that our image belongs to the \"car\" class.)\n"],"metadata":{"id":"TfaeCE4cORKL"}},{"cell_type":"markdown","source":["\n","\n","So, here,we’re directly utilizing the idea that a text related to the image content will have a CLIP embedding similar to CLIP image embedding. If our image is an image of a car, its CLIP embedding will be close to the CLIP embedding of a phrase \"a photo of a car\", and far from CLIP embeddings of other phrases."],"metadata":{"id":"sAAyjSJild92"}},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1rO-f7-LW4myvkn4W0tNwtEx91vkxNm0j\" width=600 />\n","\n","*Image from the original paper*\n","</center>"],"metadata":{"id":"szxpwEormeBT"}},{"cell_type":"markdown","source":["Note that, as we said before, we don't need to perform any model training, we’re just using it in **zero-shot** mode! CLIP was trained with a contrastive loss objective for  general similarity understanding, and we’re now adapting this to the image classification task; this approach is very similar to the **in-context** learning that we've done in the previous lesson with prompt engineering for ChatGPT."],"metadata":{"id":"IcH_46gCTJfy"}},{"cell_type":"markdown","source":["## Image Classification"],"metadata":{"id":"RLZr1Uwtnacf"}},{"cell_type":"markdown","source":["Let’s do what we’ve discussed for real with our scenes dataset. Yet again, the trained CLIP model already exists in the `transformers` library in HuggingFace."],"metadata":{"id":"piy2Z1yoTmO2"}},{"cell_type":"code","source":["from transformers import CLIPProcessor, CLIPModel\n","\n","# this is the original model by OpenAI\n","model_name = \"openai/clip-vit-base-patch32\"\n","\n","\n","model = CLIPModel.from_pretrained(model_name).to(device)\n","# This is for pre-processing text and images before passing to CLIP model\n","processor = CLIPProcessor.from_pretrained(model_name)"],"metadata":{"id":"XXmex-OPi_6h"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Let’s next see how to pass images and texts to the model: first, we’ll take two random images from the dataset."],"metadata":{"id":"_jc8-FdEhJZt"}},{"cell_type":"code","source":["dataset['test'][0]['image']"],"metadata":{"id":"5InBUVmEhpLi"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["dataset['test'][101]['image']"],"metadata":{"id":"MqHyp77whtc9"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["images = [dataset['test'][0]['image'], dataset['test'][101]['image']]"],"metadata":{"id":"g2se_l14hw27"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Do you remember how CLIP can work with arbitrary texts also? Let's create some examples."],"metadata":{"id":"xJfxi4XpiAWE"}},{"cell_type":"code","source":["texts = [\n","    \"some houses or whatever\",   # this should be somewhat similar to the first image\n","    \"sea with waves\",    # this should be somewhat similar to the second image\n","    \"transformer is a type of neural network architecture\"   # this should not be similar to anything\n","]"],"metadata":{"id":"EmmravpHiZPj"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Now, let's pass texts and images through the model; they’ll need to be passed through simultaneously to generate their similarity matrix."],"metadata":{"id":"AvesF2qMits8"}},{"cell_type":"code","source":["# prepare texts and images for the model\n","inputs = processor(text=texts, images=images, return_tensors=\"pt\", padding=True).to(device)\n","\n","# pass the dictionary as keywords\n","outputs = model(**inputs)"],"metadata":{"id":"U_yZG0wRgarY"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["This `outputs` variable contains lots of information that we don't need (you can try to print it and see). The similarity matrix that we need is saved in the property `logits_per_image`:"],"metadata":{"id":"xacuNqEAxa1P"}},{"cell_type":"code","source":["outputs.logits_per_image"],"metadata":{"id":"jzFpCxZ0xbfa"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["The numbers above don't have any specific meaning, the only thing that matters is how they compare to each other. There are 2 rows (for each image) and 3 columns (for each text), and each cell is the similarity between the corresponding image and text. As we can see, this similarity corresponds to what we expected: the first text is close to the first image, second to the second, the third text is not close to anything.\n","\n","With that, let's move on to our image classification task."],"metadata":{"id":"q-qHK2QwjL7l"}},{"cell_type":"markdown","source":["**Exercise.** Create a function `clip_predict()` that takes an image as input and returns its prediction of the class label. Use the code example above for passing images and texts through the model. You’ll also need to create a set of appropriate prompts with class descriptions to compare them with images.\n","\n","<details>\n","    <summary><font color=\"404090\">Hint (click to expand).</font></summary>\n","\n","Here’s how to make prompts: for each class name, you need to create a text for the class with this format: \"A picture of ...\".\n","</details>\n","\n","\n"],"metadata":{"id":"GmoDiLrSkH76"}},{"cell_type":"code","source":["# your code here\n","\n","# extract the classes from\n","class_names = dataset['test'].features['label'].names\n","\n","# create a prompt for each class\n","prompts = ...\n","\n","def clip_predict(image):\n","    # pass all the prompts and the image to the processor\n","    inputs = ...\n","\n","    # pass the inputs to the model\n","    outputs = ...\n","\n","    # get the prompt with the highest value\n","    prediction = ...\n","\n","    return prediction"],"metadata":{"id":"6Bp8zApmk7jq"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# get class index and print its name\n","prediction = clip_predict(dataset['test'][0]['image'])\n","print(dataset['test'].features['label'].names[prediction])"],"metadata":{"id":"Kd1IjT27mJ6r"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["\n","Next up, let's test this function on our dataset!"],"metadata":{"id":"CkBsfW8lmPs_"}},{"cell_type":"markdown","source":["**Exercise.** Using the function `clip_predict()` from the previous exercise, write the function `evaluate_clip()` for computing the classification accuracy with CLIP."],"metadata":{"id":"aQoDiRgEl0YG"}},{"cell_type":"code","source":["# your code here\n","\n","def evaluate_clip(dataset):\n","    ...\n","\n","\n","evaluate_clip(dataset['test'])"],"metadata":{"id":"aEy_pMwSmxOG"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Voila! The CLIP model in **zero-shot** mode performs quite well: we end up with around 90% accuracy without any actual training on our data! This is even better than how ChatGPT performed on the Yelp reviews task from the previous lesson, (however, the scene classification task is probably simpler).\n"],"metadata":{"id":"EX4HUKkmvfbW"}},{"cell_type":"markdown","source":["# 6. Conclusion"],"metadata":{"id":"8jskQx4QZwxp"}},{"cell_type":"markdown","source":["That’s the end of our lesson on image classification, so let’s wrap up what we've learned:\n","\n","* We trained and tested the **Vision Transformer** network from scratch on our data\n","* We learned about concept of **Transfer Learning** and **fine-tuned** a ViT model using HuggingFace library\n","* We learned about **contrastive learning** and used a pre-trained **CLIP model** to solve an image classification task in **zero-shot** mode"],"metadata":{"id":"gtc35sQufUBc"}},{"cell_type":"markdown","source":["Here are some additional materials for further reading:\n","\n","\n","\n","* A [guide](https://viso.ai/deep-learning/vision-transformer-vit/) on Vision Transformers and [another guide](https://www.v7labs.com/blog/vision-transformer-guide) on ViT.\n","* A [guide](https://www.v7labs.com/blog/transfer-learning-guide) on Transfer Learning.\n","* A [blogpost](https://lilianweng.github.io/posts/2021-05-31-contrastive/) explaining Contrastive Loss.\n","\n","There are many other things you can do with CLIP, here are some resources that can help:\n","\n","\n","\n","* The [CLIP release page](https://openai.com/research/clip)<span style=\"text-decoration:underline;\">.</span>\n","* An [overview](https://medium.com/nightcafe-creator/vqgan-clip-tutorial-a411402cf3ad) of an approach for image generating using CLIP, and [in this colab notebook](https://colab.research.google.com/drive/1wkF67ThUz37T2_oPIuSwuO4e_-0vjaLs?usp=sharing) you can generate images using CLIP yourself.\n","* [Here you can find more tasks](https://blog.roboflow.com/openai-clip/) that CLIP can help solve."],"metadata":{"id":"OHfaT5DwdrAY"}},{"cell_type":"code","source":[],"metadata":{"id":"plDgAAhOT-J2"},"execution_count":null,"outputs":[]}]}


--------------------------------------------------------------------------------
/notebooks/lesson_3.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","id":"ee9b364b","metadata":{"id":"ee9b364b"},"source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1VkGbtJX9Ezh187IsbMKe7wgXkaR5o79k\" width=640 />\n","</center>\n","\n","<br/>\n","\n","<font size=6>\n","<center>Lesson 3: Transformers\n","</center>\n","</font>\n","\n","<font size=4>\n","<center><i>Understanding the architecture behind recent NLP breakthroughs</i>\n","</center>\n","</font>\n","\n","\n"]},{"cell_type":"markdown","source":["# 1. Introduction"],"metadata":{"id":"IgPrmLGod0Jy"},"id":"IgPrmLGod0Jy"},{"cell_type":"markdown","source":["## Lesson Overview"],"metadata":{"id":"LUvv798Sd7kJ"},"id":"LUvv798Sd7kJ"},{"cell_type":"markdown","source":["In the previous lesson, we trained a linear model to make predictions based on some text data. Now, we’re moving on to more advanced models: **transformers**. In this lesson, you’ll:\n","\n","\n","\n","* Learn the basics of neural networks and how the differ from linear models\n","* Understand transformer architecture and how to use it with text data\n","* Practice training transformer models with the Hugging Face library\n","\n","You can find the notebook with solutions [here](https://colab.research.google.com/drive/1wSbezYwVv-kjN2BA_wFeYpuMJ5bOh-aY?usp=drive_link).\n","\n","_Note: in this lesson, we borrow some materials from the fantastic [NLP course](https://lena-voita.github.io/nlp_course.html) by Lena Voita. If you need some additional explanation after this lesson, we suggest taking a look – actually, it’s not a bad idea to look even if you feel solid._"],"metadata":{"id":"JSH54AIyd3hg"},"id":"JSH54AIyd3hg"},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1yX2D4GbMQyUAfumAGJiavyCiBKZLc_ui\" width=1000 /></center>"],"metadata":{"id":"LXneaF45tEgb"},"id":"LXneaF45tEgb"},{"cell_type":"markdown","source":["## Why Use Neural Networks: Non-Linear Problems"],"metadata":{"id":"qqH5Wmn4eNRf"},"id":"qqH5Wmn4eNRf"},{"cell_type":"markdown","source":["Linear models are interpretable but too simple for most real-life tasks. Consider, for instance, the following 2-dimensional classification problem:"],"metadata":{"id":"PuQzpbstd_pY"},"id":"PuQzpbstd_pY"},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1ywNu-crPjvHLAIqngLlNP6Bxki7ehUY8\" width=400 />\n","</center>"],"metadata":{"id":"j1ejAs70SMsy"},"id":"j1ejAs70SMsy"},{"cell_type":"markdown","source":["Obviously, no straight line could separate these two classes, so a linear classifier would be useless here. At the same time, the classification problem itself is very easy. Quite clearly, a dotted red circle perfectly separates the classes, so a simple rule that compares the _distance between the data point_\n","$x$\n","_and the origin_ with, say,\n","$0.6$\n"," will solve the problem.\n","\n","We can do a little hack and introduce a **new feature** $x_1^2 + x_2^2$\n"," (it's the squared distance from the origin):\n","\n","\n","$$\n","(x_1, x_2)\\longrightarrow (z_1, z_2, z_3) = (x_1, x_2, x_1^2 + x_2^2)\n","$$\n","\n","\n","Now, we can pass this 3-feature problem to a linear model, which will crack the task easily:\n","\n","\n","$$\n","0\\cdot z_1 + 0\\cdot z_2 + 1\\cdot z_3 - 0.6^2 > 0 \\text{ then class 1, else class 0}\n","$$"],"metadata":{"id":"CHvSLRaWeGRe"},"id":"CHvSLRaWeGRe"},{"cell_type":"markdown","source":["## Constructing New Features"],"metadata":{"id":"LrPD9jcIej_7"},"id":"LrPD9jcIej_7"},{"cell_type":"markdown","source":["If we have a real-world problem which is more complex, but not visually solvable, constructing new features can be tricky. There is even a separate craft known as **feature engineering** that combines some math and empirical knowledge to create sometimes very complicated features. And just to give you an idea: before the rise of neural networks, NLP specialists used word frequencies coupled with various normalizations, or even complex formulas with empiric coefficients like [BM25](https://en.wikipedia.org/wiki/Okapi_BM25). These carefully engineered features were then fed to something like a linear model, with the whole pipeline looking like this:"],"metadata":{"id":"CQLGrOTWeqyV"},"id":"CQLGrOTWeqyV"},{"cell_type":"markdown","id":"fb93a3c0","metadata":{"id":"fb93a3c0"},"source":["\n","<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1x8t3131WJFoZYzhVWayepaK5-h0tj432\" width=400 />\n","\n","\n","*Image from the [NLP For You course by Lena Voita](https://lena-voita.github.io/nlp_course.html).*\n","</center>"]},{"cell_type":"markdown","source":["But deep learning changed all of that. One of the greatest things in deep learning is that we can simultaneously learn both **feature representations** and the **final classifier**. But how? Let's see in the next section!"],"metadata":{"id":"PrGojekxeyxI"},"id":"PrGojekxeyxI"},{"cell_type":"markdown","id":"9e6c17d1","metadata":{"id":"9e6c17d1"},"source":["# 2. A Gentle Intro to Neural Networks"]},{"cell_type":"markdown","source":["## Decomposing The Classification Rule"],"metadata":{"id":"GG34qLZ-goOl"},"id":"GG34qLZ-goOl"},{"cell_type":"markdown","source":["Let's return to our example with two circles, but this time, let's consider a different classification rule:"],"metadata":{"id":"QTXQngJye8W4"},"id":"QTXQngJye8W4"},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=13HOWy3Si9sVfSrRVHS4TtbedEnREU2K_\" width=400 />\n","</center>"],"metadata":{"id":"qfykqjgIUJMv"},"id":"qfykqjgIUJMv"},{"cell_type":"markdown","id":"9ad06b62","metadata":{"id":"9ad06b62"},"source":["This also solves the problem, and it will be easier for us to work with. Let's write down this rule as a formula; one way to do so is to compute the following expression and later compare it with zero for the answer:\n","\n","\n","$$\n","\\max\\left(0,x+y-0.75\\right)+\\max\\left(0,x-y-0.75\\right)+\\max\\left(0,-x-y-0.75\\right)+\\max\\left(0,-x+y-0.75\\right)\n","$$\n","\n","If it’s positive, we predict class 1, and if it’s negative, we predict class 0. Feel free to verify that it works with any plotting tool, for example, with [Desmos](https://www.desmos.com/Calculator).\n","\n","While it looks scary, it can be decomposed into three simpler stages:\n","\n","<br/>\n","\n","$$\n","(x_1, x_2)\\longrightarrow\n","$$\n","\n","<br/>\n","\n","$$\\longrightarrow(z_{11}, z_{12}, z_{13}, z_{14}) = (x+y-0.75,x-y-0.75,-x-y-0.75,-x+y-0.75)\\longrightarrow$$\n","\n","<br/>\n","\n","$$\\longrightarrow(z_{21}, z_{22}, z_{23}, z_{24}) = (\\max(0,z_{11}), \\max(0, z_{12}), \\max(0, z_{13}), \\max(0, z_{14}))\\longrightarrow$$\n","\n","<br/>\n","\n","$$\\longrightarrow z_4 = z_{21} + z_{22} + z_{23} + z_{24}$$\n","\n","<br/>\n","\n","The first and the third are linear transforms, while the second is an elementwise **nonlinearity**; in this case, the function is known in deep learning as **ReLU** (Rectified Linear Unit):\n","\n","\n","$$\n","\\text{ReLU}(t) = \\max(0, t)\n","$$"]},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1oBnSTTp3THYbOciKB82KMJpO0EqO08EE\" width=360 />\n","</center>"],"metadata":{"id":"p910T2gAewLs"},"id":"p910T2gAewLs"},{"cell_type":"markdown","source":["If we want to predict the _probability of class 1_, we can apply the **sigmoid** function to the final result:\n","\n","$$\\sigma(z) = \\frac1{1 + e^{-z}}$$\n","\n","The sigmoid function has values between $0$ and $1$ (so it can be a probability); and moreover, the higher the value of $z$, the higher the predicted probability. Let's see the plot:\n","\n"],"metadata":{"id":"m42wIaWuftrX"},"id":"m42wIaWuftrX"},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1E0o_Xb9oj7-1eTrgD3EaehCA49ROsOJU\" width=360 />\n","</center>\n","\n"],"metadata":{"id":"z1ok9nYEe03p"},"id":"z1ok9nYEe03p"},{"cell_type":"markdown","source":["So, $\\sigma(z_4) > \\frac12$ for positive $z_4$ and vice versa.\n","\n","\n","<br/>\n","\n","<details>\n","    <summary><font color=\"404090\">A note of the sigmoid-softmax relation (click to expand)</font></summary>\n","\n","Note: the sigmoid function is closely related to the softmax function introduced in the previous lesson:\n","\n","$$\\text{softmax}(-z, z) = (1 - \\sigma(z), \\sigma(z))$$\n","\n","<br/>\n","</details>"],"metadata":{"id":"kAzK2ioyf9QO"},"id":"kAzK2ioyf9QO"},{"cell_type":"markdown","id":"005d5fac","metadata":{"id":"005d5fac"},"source":["## Layers And Parameters"]},{"cell_type":"markdown","id":"a9b02004","metadata":{"id":"a9b02004"},"source":["That’s all good, but the lengthy formula above seemed to just appear miraculously. How can we arrive at the same function with machine learning?\n","\n","Remember, the linear models from the previous lesson are actually **parametric** (that is, they have parameters or weights that can be learned from data). So, let's make our current formula parametric too:\n","\n","<br/>\n","\n","$$(x_1, x_2)\\longrightarrow$$\n","\n","<br/>\n","\n","$$\\begin{align}\\longrightarrow(z_{11}, z_{12}, z_{13}, z_{14}) =\\,(\n","&\\color{magenta}{w_{11}}x_1+\\color{magenta}{w_{12}}x_2+\\color{magenta}{w_{10}},\\\\\n","&\\color{magenta}{w_{21}}x_1+\\color{magenta}{w_{22}}x_2+\\color{magenta}{w_{20}},\\\\\n","&\\color{magenta}{w_{31}}x_1+\\color{magenta}{w_{32}}x_2+\\color{magenta}{w_{30}},\\\\\n","&\\color{magenta}{w_{41}}x_1+\\color{magenta}{w_{42}}x_2+\\color{magenta}{w_{40}}\\,)\\longrightarrow\\end{align}$$\n","\n","<br/>\n","\n","$$\\longrightarrow(z_{21}, z_{22}, z_{23}, z_{24}) = (\\max(0,z_{11}), \\max(0, z_{12}), \\max(0, z_{13}), \\max(0, z_{14}))\\longrightarrow$$\n","\n","<br/>\n","\n","$$\\longrightarrow z_3 = \\color{magenta}{u_1}z_{21} + \\color{magenta}{u_2}z_{22} + \\color{magenta}{u_3}z_{23}+ \\color{magenta}{u_4}z_{24} + \\color{magenta}{u_0}$$\n","\n","<br/>\n","\n","$$\\longrightarrow z_4 = \\sigma(z_3)$$\n","\n","<br/>\n","\n","All the $\\color{magenta}{w}$'s and the $\\color{magenta}{u}$'s are trainable parameters here. And note a slight difference in the linear formula (the first step) from the previous lesson: we’ve also added a set of **bias** parameters $\\color{magenta}{w_{10}}, \\ldots, \\color{magenta}{w_{40}}$; these are not multiplied by any inputs and are just added to our formula; the bias is present in the actual full linear model, but we skipped it in the last lesson for simplicity."]},{"cell_type":"markdown","id":"fe98caa2","metadata":{"id":"fe98caa2"},"source":["And this is our first **neural network**! It has 4 **layers**:\n","\n","* `Linear(2, 4)` maps a length 2 vector to a length 4 vector and has trainable weights $\\color{magenta}{w}$\n","* `ReLU()` applies ReLU to every element of a vector, it has no trainable parameters\n","* `Linear(4, 1)` maps a length 4 vector to a length 1 vector and has trainable weights $\\color{magenta}{u}$\n","\n","* `Sigmoid()` applies $\\sigma$ to the previous result, it has no trainable parameters\n","\n","Notice here how we’ve denoted each layer as if we are initializing some class instance, passing all the necessary configurations into it. This is actually how it’s done in many deep learning frameworks, so we’ll stick with this notation.\n","\n","<details>\n","    <summary><font color=\"404090\">A note on our formula’s math notation\n"," (click to expand)</font></summary>\n","    \n","The formula\n","$$\\begin{align}(z_{11}, z_{12}, z_{13}, z_{14}) =\\,(\n","&\\color{magenta}{w_{11}}x_1+\\color{magenta}{w_{12}}x_2+\\color{magenta}{w_{10}},\\\\\n","&\\color{magenta}{w_{21}}x_1+\\color{magenta}{w_{22}}x_2+\\color{magenta}{w_{20}},\\\\\n","&\\color{magenta}{w_{31}}x_1+\\color{magenta}{w_{32}}x_2+\\color{magenta}{w_{30}},\\\\\n","&\\color{magenta}{w_{41}}x_1+\\color{magenta}{w_{42}}x_2+\\color{magenta}{w_{40}}\\,)\\end{align}$$\n","    \n","may be written shorter using matrix multiplication:\n","    \n","$$(z_{11}, z_{12}, z_{13}, z_{14}) = (x_1, x_2)\\cdot\\color{magenta}{W} + \\color{magenta}{b},$$\n","    \n","where\n","    \n","$$\\color{magenta}{W} = \\begin{pmatrix}\n","\\color{magenta}{w_{11}} & \\color{magenta}{w_{12}}\\\\\n","\\color{magenta}{w_{21}} & \\color{magenta}{w_{22}}\\\\\n","\\color{magenta}{w_{31}} & \\color{magenta}{w_{32}}\\\\\n","\\color{magenta}{w_{41}} & \\color{magenta}{w_{42}}\n","\\end{pmatrix}$$\n","    \n","and\n","    \n","$$\\color{magenta}{b} = \\begin{pmatrix}\n","\\color{magenta}{w_{10}}\\\\\n","\\color{magenta}{w_{20}}\\\\\n","\\color{magenta}{w_{30}}\\\\\n","\\color{magenta}{w_{40}}\n","\\end{pmatrix}$$\n","    \n","The matrix and vector notation help a lot for shortening the formulas, for details, see [this page](https://en.wikipedia.org/wiki/Matrix_multiplication)\n","</details>"]},{"cell_type":"markdown","id":"c182fd2a","metadata":{"id":"c182fd2a"},"source":["To train this network, we take a suitable loss function and optimize all of the $\\color{magenta}{w}$'s and $\\color{magenta}{u}$'s with a gradient descent. Just a quick refresher on the general concept of gradient descent:\n","\n","1. Start with random weights\n","2. Compute the negative gradient of the loss function (the direction in which it decreases the fastest)\n","3. Move the current weights slightly in this direction\n","4. Repeat from step 2 until convergence\n","\n","And here’s some good news: allmodern neural network packages are able to compute the negative gradient automatically, so this algorithm is not very hard to implement."]},{"cell_type":"markdown","id":"63765e51","metadata":{"id":"63765e51"},"source":["## Constructing Neural Networks"]},{"cell_type":"markdown","id":"b293debb","metadata":{"id":"b293debb"},"source":["Neural networks are built from layers like a toy house is built from the pieces of a construction set. Right now, we have just a few layers to work with (`Linear`, `ReLU`, `Sigmoid`, and `Softmax`), but we can use them to create very complex architectures. For example, if we have a 5-class classification problem on a dataset with 100 features, we could use something like this:\n","\n","```{python}\n","Linear(100, 200)\n","ReLU()\n","Linear(200, 400)\n","ReLU()\n","Linear(400, 400)\n","ReLU()\n","Linear(400, 5)\n","Softmax()\n","```\n","\n","A neural network with many layers can be called \"deep\" – hence the term **“deep learning”**.\n","\n","Please note several important things:\n","\n","\n","\n","* The dimensions of adjacent linear layers correspond to each other. For instance, the first linear layer outputs a vector of length 200; ReLU doesn't change the length, so the input of the second linear layer should also be 200\n","* Between every two layers there is a nonlinearity. This may be something other than ReLU, but we don't stack linear layers without nonlinearity, because several linear layers stacked directly are equivalent to just one linear layer\n","* The output of the final linear layer has a dimension equal to the number of classes, and the softmax outputs class probabilities\n","* The training will be focused on parameters of four linear layers\n","\n","**Question**: How many trainable parameters does this network have?\n","\n","<details>\n","    <summary><font color=\"404090\">Answer (click to expand).</font></summary>\n","    \n","A `Linear(m,n)` layer has $(m+1)n$ parameters. Indeed, each of the $n$ coordinates of the output equals $$z^{(next)}_{i} = \\color{magenta}{w_{i1}}z^{(prev)}_1+\\ldots+\\color{magenta}{w_{im}}z^{(prev)}_m + \\color{magenta}+\\color{magenta}{w_{i0}}$$ Don't forget about the bias! So, in our example we have\n","\n","$$\n","101\\cdot 200 + 201\\cdot 400 + 401\\cdot 400 + 401\\cdot 5 = 263005\n","$$\n","\n","This is actually quite an impressive number. And much more than just the 101 parameters of a linear model. This can allow the network to capture complex dependencies.\n","</details>"]},{"cell_type":"markdown","source":["Of course, the choice of architecture (the number of layers, the dimensions,  nonlinearities, etc.) is crucial for the quality of the model. Moreover, you can't simply check all possible architectures because training each of them may take too much time and compute. That's why, excluding a thin stratum of tech companies and research labs, engineers typically don't invent their own architectures, instead reusing existing ones known for high performance. Accordingly, in this short course, we’ll also be working with existing models."],"metadata":{"id":"SeIh9G5LiLX2"},"id":"SeIh9G5LiLX2"},{"cell_type":"markdown","id":"36be90d8","metadata":{"id":"36be90d8"},"source":["## Is That It?"]},{"cell_type":"markdown","id":"8b433957","metadata":{"id":"8b433957"},"source":["The idea of stacking several transformations isn't too elaborate, so you might wonder why the deep learning revolution only happened in 2012.\n","\n","The truth is that multilayered neural networks first appeared as early as in the 1960s and significant research was being invested into them long before 2012. However, the lack of two things impeded the application of neural networks in real-world tasks:\n","\n","**1) Large datasets**. Deep neural networks usually have many trainable parameters. For example, our 8-layer network above has more than 200,000 parameters while the network AlexNet (which in 2012 was state-of-the-art in Computer Vision) had 60 million parameters. To train a network with so many parameters, we need millions of training images, otherwise the network will overfit. In Computer Vision, the first dataset that allowed us to train really large models was [ImageNet](https://image-net.org/). Since then, the models have only continued to grow in size. For example, ChatGPT-4 is said to have some 100 trillion parameters. So, large datasets remain a vital resource for the creation of state-of-the-art models.\n","\n","**2) Computational resources**. Even if you could fit AlexNet and all the byproducts of the training process in your laptop's memory, training on the CPU would take ages. Thus, deep learning researchers needed more advanced infrastructure. Luckily, using **GPUs** (Graphical Processing Units) greatly speeds up training and prediction for neural networks, and this made AlexNet possible. Now, most cloud providers (Google Colab included) allow the use of GPUs. That said, training of state-of-the-art models like GPT-4 can only be done on huge multi-GPU clusters."]},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1aEWAD_B4MgFKq6JmK0OPzxadU2N5hhnu\" width=400 />\n","</center>\n"],"metadata":{"id":"jaJwc-uLDaSt"},"id":"jaJwc-uLDaSt"},{"cell_type":"markdown","id":"f2dc6ca5","metadata":{"id":"f2dc6ca5"},"source":["# 3. Neural Networks for Texts"]},{"cell_type":"markdown","source":["In the examples above, we described neural networks for data with just two numerical features: $x$ and $y$. But how to fit a text into a model like this? We could use the Bag of Words approach, but as we mentioned before, there's a problem with it: the features extracted by it don't take word order into account. In other words, sentences \"The room was bad, I did not like it\" and \"The room was not bad, I did like it\" will give us the same feature vectors.\n","\n","Now that we’re using advanced models, we don't want to lose that much important information.\n","\n","In this section, we’ll learn how to work with texts in a more clever way.\n","\n","But first, we need some tools! Luckily, these days, we can work with neural networks without much coding – for example, [Hugging Face](https://huggingface.co/docs), which we used for datasets in the previous lesson, also contains many useful deep learning libraries. Here are the new ones we’ll be using in this lesson:\n","\n","\n","\n","* **Tokenizers** for building features from texts\n","* **Transformers** for more advanced machine learning models\n","\n","(Addressing all of the features of Hugging Face is beyond the scope of this text, but we’ll link to some deeper learning materials at the end of this lesson.  "],"metadata":{"id":"A_iNFFSakwvP"},"id":"A_iNFFSakwvP"},{"cell_type":"markdown","id":"l2aHDnsekke5","metadata":{"id":"l2aHDnsekke5"},"source":["## Dataset"]},{"cell_type":"markdown","source":["Let's start by reviewing our dataset. We’ll be using the same Yelp reviews dataset as before, which contains review texts  and their sentiments."],"metadata":{"id":"5IYX5zqvk6W0"},"id":"5IYX5zqvk6W0"},{"cell_type":"code","execution_count":null,"id":"U2eAPWg3kpIQ","metadata":{"id":"U2eAPWg3kpIQ","scrolled":true},"outputs":[],"source":["!pip install -qq datasets"]},{"cell_type":"code","source":["from datasets import load_dataset\n","dataset = load_dataset(\"yelp_review_full\")\n","dataset"],"metadata":{"id":"jESAZBVP1tDP"},"id":"jESAZBVP1tDP","execution_count":null,"outputs":[]},{"cell_type":"markdown","id":"xwh7jFZhNtvp","metadata":{"id":"xwh7jFZhNtvp"},"source":["\n","Let's also refresh our memory about the columns we’re working with:"]},{"cell_type":"code","execution_count":null,"id":"TA9ZejndnMz6","metadata":{"id":"TA9ZejndnMz6"},"outputs":[],"source":["dataset.column_names"]},{"cell_type":"markdown","id":"e86902f8","metadata":{"id":"e86902f8"},"source":["... and about the features format:"]},{"cell_type":"code","execution_count":null,"id":"cjHGHZ1pnM--","metadata":{"id":"cjHGHZ1pnM--"},"outputs":[],"source":["dataset['train'].features"]},{"cell_type":"markdown","source":["## Text Tokenization"],"metadata":{"id":"GKT3OVA2ctDM"},"id":"GKT3OVA2ctDM"},{"cell_type":"markdown","source":["We promised to show a clever way of getting numeric representations for texts, and here's that trick: _instead of making a feature vector for the whole text, we make a vector for each word!_ Like this:"],"metadata":{"id":"sRcLVyYxlFQu"},"id":"sRcLVyYxlFQu"},{"cell_type":"markdown","id":"a53da154","metadata":{"id":"a53da154"},"source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1CFN7YX5xTDZgPJDbCPryv4YuESv5zaEa\" width=400 />\n","\n","*Image from the [NLP For You course by Lena Voita](https://lena-voita.github.io/nlp_course.html).*\n","</center>\n","\n"]},{"cell_type":"markdown","source":["These vectors are created in a pretty straightforward way: we just create a dictionary with each word having its own vector (this is called **embedding**), and then we train these vectors together with the whole network (they become model **parameters**).\n","\n","We’ll return to these embeddings later, but right now, let's understand why working with words (as we did in the previous lesson) may actually not be the best idea:\n","\n","1. There are just too many words. Keeping so many vectors may be excessively memory intensive.\n","\n","2. Even if we could somehow, the list of words will be set after training . However, in real life it's not possible to make an exhaustive list of all words: names, typos, abbreviations, and constantly-updating slang will turn our life into a nightmare.\n","\n","This process of converting a raw text into some kind of units is called **tokenization** and each unit is called **token**. After this, we store the correspondence between tokens and some kind of vector in our dictionary."],"metadata":{"id":"JgS3OZ3XlS1G"},"id":"JgS3OZ3XlS1G"},{"cell_type":"markdown","source":["So, we don’t want to use the tokenization strategy from the previous lesson, and we need some other tokenization strategy.\n","\n","One possible solution could be **character level tokenization**. That is, each character is considered a separate token, and the whole text is treated just as a stream of characters. This could help to deal with misspellings and rare words, but the main drawback is that the task becomes harder —  in addition to learning how words compose into sentences, the network would also need to learn how characters compose into words. This would require more complex models, more data, more compute power, and more memory. For this reason, character tokenization is rarely used in practice.\n","\n","Luckily, there are intermediate options between character and word tokenization that preserves all the input information and some of the input structure. These are **subword tokenization** strategies.These include BPE (Byte Pair Encoding) and WordPiece, which split some words into parts (sometimes into separate symbols), and leave their own tokens for other words. Here’s a rough idea on how to choose which words to split:\n","\n","\n","\n","* Keep frequent words unbroken, because there is a lot of data on their contexts, and it will help the neural network learn their connections to other words.\n","* Split rare words into smaller units, because it will help keep down the size of the vocabulary and maybe even allow the network to deduce some meaning of the word based on its parts.\n","\n","Let's check out an example to understand how it works: we’ll need to install a new library called `transformers` for this, and we’ll also use a model called BERT (we’ll explain what all of that means later in this lesson)."],"metadata":{"id":"E_4cxg1tpfug"},"id":"E_4cxg1tpfug"},{"cell_type":"code","execution_count":null,"id":"4e37d468","metadata":{"id":"4e37d468"},"outputs":[],"source":["!pip install -qq transformers[torch]"]},{"cell_type":"code","execution_count":null,"id":"b0a2d845","metadata":{"id":"b0a2d845"},"outputs":[],"source":["from transformers import AutoTokenizer\n","\n","bert_tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\")\n","# there is a deliberate typo in word \"neural\"\n","example = \"In this course we unravel the mysteries of nevral networks.\"\n","encoding = bert_tokenizer(example)\n","print(type(encoding))"]},{"cell_type":"code","execution_count":null,"id":"17eac285","metadata":{"id":"17eac285"},"outputs":[],"source":["print(encoding.tokens())"]},{"cell_type":"markdown","id":"7c4d0865","metadata":{"id":"7c4d0865"},"source":["Some notes: the double hash sign `##` denotes that the token is not a new word, but a continuation of the previous one. The word ”unravel” is rare, so it was split into three tokens: “`un`”, “`##rave`”, “`##l`”. The word “nevral” has a typo, so it was also split: “`ne`”, “`##v`”, “`##ral`”.\n","\n","Note also that there are some special tokens in the list (`[CLS]`, `[SEP]`), which are not very important for us at the moment."]},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1oHszvlMpGV2Evws7HiwwONa9uaXlA-_J\" width=700 />\n","</center>\n"],"metadata":{"id":"ybuykZeyhVAG"},"id":"ybuykZeyhVAG"},{"cell_type":"markdown","source":["Let's play with the BERT tokenizer a little bit to get more understanding how it works.\n","\n","As we've mentioned, the tokenizer tries to produce separate tokens for frequent words and split less frequent into parts."],"metadata":{"id":"UuKjkDEPfqLe"},"id":"UuKjkDEPfqLe"},{"cell_type":"markdown","source":["**Exercise.** Find 2 words that have their own tokens and 2 words (existing and without typos) that were split into two separate tokens."],"metadata":{"id":"7nYnSxdRgF8o"},"id":"7nYnSxdRgF8o"},{"cell_type":"code","source":["# <your code here>\n","\n"],"metadata":{"id":"_VrRf1Gu2tau"},"id":"_VrRf1Gu2tau","execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Now, splitting words into parts for vocabulary reduction is great, but how useful are the created tokens for the downstream neural network? If some of the partial tokens repeat consistently between different words, it can be useful, as the network would be able to deduce some grammar rules from it for example."],"metadata":{"id":"_n6e90I33Omt"},"id":"_n6e90I33Omt"},{"cell_type":"markdown","source":["\n","\n","**Exercise.** Find 2 different words split into parts by the tokenizer where the first part is the same token. Then, find 2 different words with the last part being the same token."],"metadata":{"id":"dwp982AemL3o"},"id":"dwp982AemL3o"},{"cell_type":"code","source":["# <your code here>\n","\n","\n"],"metadata":{"id":"cco_tUcA4tzv"},"id":"cco_tUcA4tzv","execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Finally, let's think about this – does this way of splitting tokens actually help reduce the vocabulary size? Because, for many words, this tokenizer actually creates more words, and it will only help in the long run if there are many similar word parts."],"metadata":{"id":"ON7e01xkmQHR"},"id":"ON7e01xkmQHR"},{"cell_type":"markdown","source":["**Exercise.** Find out how the vocabulary size depends on the size of the dataset for simple word tokenization and our BERT tokenizer. Since training new tokenizers is very compute-heavy, we’ll just find the number of unique tokens for each of the tokenizers depending on the subset size. Plot those values as graphs to see the dynamics more easily. Tokenizing the entire dataset will take too much time, so just try to find a size where the size of the BERT vocabulary becomes smaller than the words vocabulary.\n","\n","We’ve already implemented both of the tokenizers as simple functions below."],"metadata":{"id":"Z8j85hpK56Ep"},"id":"Z8j85hpK56Ep"},{"cell_type":"code","source":["import re\n","\n","def word_tok(text):\n","    return re.findall('(?u)\\\\b\\\\w\\\\w+\\\\b', text)\n","\n","word_tok(\"In this course we unravel the mysteries of nevral networks.\")"],"metadata":{"id":"wgqVUnikx7n0"},"id":"wgqVUnikx7n0","execution_count":null,"outputs":[]},{"cell_type":"code","source":["def bert_tok(text):\n","    return bert_tokenizer(text).tokens()\n","\n","bert_tok(\"In this course we unravel the mysteries of nevral networks.\")"],"metadata":{"id":"yojlxWNC4iuL"},"id":"yojlxWNC4iuL","execution_count":null,"outputs":[]},{"cell_type":"code","source":["# <your code here>\n","\n","\n"],"metadata":{"id":"ZvxFo-nR6_x9"},"id":"ZvxFo-nR6_x9","execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## Dataset Preprocessing"],"metadata":{"id":"g9mmrHJi8Q9v"},"id":"g9mmrHJi8Q9v"},{"cell_type":"markdown","id":"wYFxLC5ocfX5","metadata":{"id":"wYFxLC5ocfX5"},"source":["Let's return to our sentiment analysis problem. We need to tokenize our dataset before starting to run models on it. Training large models requires lots of computational power, so, for demonstration purposes we’ll just use a tiny fraction of the data."]},{"cell_type":"code","source":["small_train_dataset = dataset[\"train\"].shuffle(seed=42).select(range(1000))\n","small_eval_dataset = dataset[\"test\"].shuffle(seed=42).select(range(1000))"],"metadata":{"id":"ZZ9qwzsP82lT"},"id":"ZZ9qwzsP82lT","execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["There is a function `map` in the Hugging Face datasets that we can use to apply the tokenizer to the whole set. We just need to define a mapping function:"],"metadata":{"id":"W4O9uVW_8eZ4"},"id":"W4O9uVW_8eZ4"},{"cell_type":"code","execution_count":null,"id":"C4AAlmkks-Pb","metadata":{"id":"C4AAlmkks-Pb"},"outputs":[],"source":["def preprocess_function(examples):\n","    return bert_tokenizer(\n","        examples[\"text\"],\n","        padding=\"max_length\",   # see explanation below\n","        truncation=True)   # see explanation below"]},{"cell_type":"markdown","source":["There are two additional parameters in the tokenizer that will help us later. In short, we want all of the texts in our dataset to have the same length, so we ask the tokenizer to pad small texts with special \"empty\" tokens and truncate texts that are too long.\n","\n","Having the exact same length is only important to get faster, more convenient model training; the model is still able to process texts of various lengths if necessary.\n","\n","Now, let's apply the mapping:"],"metadata":{"id":"qaECX9n291d2"},"id":"qaECX9n291d2"},{"cell_type":"code","source":["# batched=True just helps to process things faster\n","small_train_dataset_tok = small_train_dataset.map(preprocess_function, batched=True)\n","small_eval_dataset_tok = small_eval_dataset.map(preprocess_function, batched=True)"],"metadata":{"id":"oJfttSCK-YEL"},"id":"oJfttSCK-YEL","execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## Transformers"],"metadata":{"id":"xZemA0hN-j1i"},"id":"xZemA0hN-j1i"},{"cell_type":"markdown","source":["Now, the fun part: applying neural networks to our domain task. We’ll focus on the most popular architecture in NLP nowadays: **transformer-based** models.\n","\n","The original transformer model is based on on the **encoder-decoder** architecture:\n","\n","* **Encoder** part converts the input sequence of tokens into a sequence of embedding vectors.\n","* **Decoder** part generates an output sequence of tokens, one step at a time, using the encoder's output."],"metadata":{"id":"UvQmbw8R26VV"},"id":"UvQmbw8R26VV"},{"cell_type":"markdown","source":["\n","<center><img src=\"https://drive.google.com/uc?export=view&id=19D0hMWMKa2-wqt-6GN-RtoDwxvmb2AD-\" width=700 /></center>"],"metadata":{"id":"y7el3Kgk28ve"},"id":"y7el3Kgk28ve"},{"cell_type":"markdown","id":"MLy10_nCiV34","metadata":{"id":"MLy10_nCiV34"},"source":["A good example of a task that uses both components is **machine translation**: a sequence of tokens in a source language is encoded via an encoder and later translated into tokens of another language using a decoder; the image above schematically represents English to French translation.\n","\n","However, in some tasks you don't need both parts to solve a problem:\n","\n","\n","\n","* **Decoder-only** models are used in generation tasks. For example, **GPT** (generative pre-trained transformer) models are decoder-only models (we’ll talk about this more in the next lesson).\n","* **Encoder-only** models are able to extract meaningful vector representations of tokens that can be used for tasks like text classification. (Examples include **BERT** and its variants, like **RoBERTa** and **DistilBERT**.) We will use encoder-only models in this lesson.\n","\n","Now, let's dive into the anatomy of encoder-only transformer models and we’ll understand how to apply them to the classification task."]},{"cell_type":"markdown","source":["### Attention Mechanism"],"metadata":{"id":"OT8b5BdFAN6G"},"id":"OT8b5BdFAN6G"},{"cell_type":"markdown","source":["Let's start with the core mechanism in the transformers architecture: **attention**.This processes sequences of embeddings together and enables the exchange of information between them. Roughly speaking, the embedding of each token \"_pays attention_\" to the embeddings of all other tokens and \"_draws conclusions_\" from it.\n","\n","<details>\n","    <summary><font color=\"404090\">A funny story about the name “attention” (click to expand)</font></summary>\n","  \n","Attention mechanism was introduced before the transformers, and in other architectures it was used alongside other important sequence processing techniques. However, the paper that introduced transformers suggested to make attention the core mechanism of the architecture and remove most of the other techniques. Naturally, the paper was named [Attention Is All You Need](https://arxiv.org/abs/1706.03762).\n","</details>"],"metadata":{"id":"SU9wPkWjm4YG"},"id":"SU9wPkWjm4YG"},{"cell_type":"markdown","id":"_QEx9qhJiWKZ","metadata":{"id":"_QEx9qhJiWKZ"},"source":["We'll describe the most popular version of the attention mechanism: **scaled dot-product attention**. There are few main steps required to implement this mechanism:\n","\n","**Step 1.** Map each token's embedding into three vectors called **query**, **key**, and **value**. To do this we multiply each vector $x_i$ by trainable matrices:\n","\n","- $W_Q$ for queries\n","- $W_K$ for keys\n","- $W_V$ for values\n"]},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1GPWfojsEXhawctXqyUlv0HZe9wgfrs9G\" width=600 />\n","</center>"],"metadata":{"id":"dYCg6B9a51AX"},"id":"dYCg6B9a51AX"},{"cell_type":"markdown","source":["The names \"query\", \"key\", and \"value\" reflect what will be done with them:\n","\n","* **Query** $q_i = W_Qx_i$ is the \"probe\" used by the $i$-th token to \"look\" upon all other tokens.\n","* **Key** $k_i = W_Kx_i$ is the signal that $i$-th token sends to the probes to catch their attention. Both queries and keys are the vectors of the same length $d$."],"metadata":{"id":"hK8LT_Xh_FyD"},"id":"hK8LT_Xh_FyD"},{"cell_type":"markdown","source":["**Step 2.** The actual act of  \"paying attention\" is performed by computing the scalar product\n","\n","$$\n","\\langle q_i, k_j\\rangle = q_{i1}k_{j1} + \\ldots + q_{id}k_{jd},\n","$$\n","\n","which tells us how much attention the $i$-th token will pay to the $j$-th token."],"metadata":{"id":"ihIdBhRu_QZx"},"id":"ihIdBhRu_QZx"},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1R8Tdvnv6ZRGa3Wb0CvxxDjp7mflJEGo2\" width=480 />\n","</center>"],"metadata":{"id":"aKX4W77F_RPC"},"id":"aKX4W77F_RPC"},{"cell_type":"markdown","source":["**Step 3.** Now that all tokens have decided how much attention they should pay to each other, they’re ready to take the **values** $v_i = W_Vx_i$ and use the information inside them. Each token updates its embedding like so:\n","\n","\n","$$\n","x'_i = x_i + w_{i1}v_1 + w_{i2}v_2 + \\ldots + w_{iN}v_N,\n","$$\n","\n","\n","where $N$ is the sequence length and $w_{ij}$ are called **attention weights**, and these are computed this way:\n","\n","$$\n","(w_{i1}, \\ldots, w_{iN}) = \\text{Softmax}\\left(\\frac{\\langle q_i, k_1\\rangle}{\\sqrt{d}}, \\ldots, \\frac{\\langle q_i, k_N\\rangle}{\\sqrt{d}}\\right)\n","$$"],"metadata":{"id":"wEqYfJsNAOzp"},"id":"wEqYfJsNAOzp"},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1481GOkcGqGJaTFnLL4nFo5PmVpbNIri6\" width=400 />\n","</center>"],"metadata":{"id":"RezlrhZ6Apws"},"id":"RezlrhZ6Apws"},{"cell_type":"markdown","source":["So, we mix the information provided in the values with the coefficients that show how much attention we should pay for each of the tokens."],"metadata":{"id":"EJe4rzJToOd2"},"id":"EJe4rzJToOd2"},{"cell_type":"markdown","id":"68b4eaed","metadata":{"id":"68b4eaed"},"source":["**Why do we need Softmax?** Softmax is the function we mentioned in the previous lesson:\n","\n","$$\n","\\text{Softmax}(z_1,\\ldots,z_N) = \\left(\\frac{e^{z_1}}{\\sum_je^{z_j}}, \\ldots, \\frac{e^{z_N}}{\\sum_je^{z_j}}\\right)\n","$$\n","\n","It turns a tuple of numbers into a valid probability distribution, that is, a tuple of non-negative numbers that sum to 1.\n","\n","This helps the attention layers mix the values without changing the overall scale."]},{"cell_type":"markdown","source":["Here’s an overview diagram of the whole attention concept:"],"metadata":{"id":"O6Q4DD8noZdy"},"id":"O6Q4DD8noZdy"},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=16sGrMsKVcJah28CLeEKToiaLZOchKKIf\" width=700 /></center>"],"metadata":{"id":"PNqVjFzub9GV"},"id":"PNqVjFzub9GV"},{"cell_type":"markdown","source":["### Transformer Block"],"metadata":{"id":"S9AOz3-lmUSg"},"id":"S9AOz3-lmUSg"},{"cell_type":"markdown","source":["There are several other important parts in the architecture that help the attention mechanism process information better. Let’s discuss those.\n","\n","**Feed-forward layer:** This layer has three sublayers applied independently to each of the embeddings:\n","\n","1. Linear transformation\n","2. Non-linearity (ReLU in the original Transformer)\n","3. Linear transformation (again)\n","\n","This block is added after each attention block and allows the model to do some additional processing of the information gathered in between various tokens.\n","\n","**Residual connections:** After applying attention to the embeddings, we add the results to the initial $x_i$ to obtain $x_i'$ (without any weight). This has proven to be beneficial for training, and intuitively it allows us to pass more information about the initial vectors through the pipeline with the new information received from the attention mechanism."],"metadata":{"id":"jNjTaanImeyg"},"id":"jNjTaanImeyg"},{"cell_type":"markdown","source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1zqnVOLlSVkEeme9anRvX8sgI1DN_pfCq\" width=400 />\n","</center>"],"metadata":{"id":"s67Ury7YjWcF"},"id":"s67Ury7YjWcF"},{"cell_type":"markdown","source":["The mechanism of adding the initial vectors to the transformed ones is known as a **residual connection**, or **skip connection**, the same thing is also applied to the feed-forward layer.\n","\n","**Normalization:** There is one more technique that helps train this model better: normalizing the embedding vectors between meaningful layers. We won't dive too deep into how it works, but the core idea is to make the entire set of all embeddings have zero mean and unit variance. This helps the layer at least know the scale of values in order to expect them and help learn faster.\n","\n","All together, these things form a **transformer** building block, that we’ll later just copy and stack to get the full thing:\n"],"metadata":{"id":"s_MAK6h_otH4"},"id":"s_MAK6h_otH4"},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1ayRRaOyGcsnszxW36RNcz9XYz_yYrb7D\" width=400 /></center>"],"metadata":{"id":"o9wbkiDMngK2"},"id":"o9wbkiDMngK2"},{"cell_type":"markdown","source":["### Token Embeddings"],"metadata":{"id":"1sPt_pqSAbMl"},"id":"1sPt_pqSAbMl"},{"cell_type":"markdown","source":["As we said before, the input tokens are transformed into **embeddings** by just creating a dictionary of embedding vectors for each token and training them.\n","\n","However, perhaps you’ve noticed that the order of tokens doesn't matter anywhere in the transformer architecture, so we need an additional mechanism to account for them. This is  done with special **positional embeddings **which are added to the initial token embeddings before feeding them to the encoder:"],"metadata":{"id":"aKAdf7Qwo0K7"},"id":"aKAdf7Qwo0K7"},{"cell_type":"markdown","id":"f93fd861","metadata":{"id":"f93fd861"},"source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1CsITa1y7Q6zoGQA_rZTwMrrjBaU7k6BA\" width=420 />\n","</center>\n"]},{"cell_type":"markdown","source":["### Overall Encoder Architecture\n"],"metadata":{"id":"aMM1q3LBdAAO"},"id":"aMM1q3LBdAAO"},{"cell_type":"markdown","source":["To get a transformer encoder we just need to stack several transformer blocks. Here is how the full thing looks like:"],"metadata":{"id":"4E2dfnbAo914"},"id":"4E2dfnbAo914"},{"cell_type":"markdown","id":"be4e6097","metadata":{"id":"be4e6097"},"source":["<center>\n","<img src=\"https://drive.google.com/uc?export=view&id=1qPNW7j7X7u0n4serAqEtU0oZHr2wRnV5\" width=400 />\n","</center>\n"]},{"cell_type":"markdown","source":["_Note: We omitted several things  crucial for training but not that important for your first acquaintance with the material._"],"metadata":{"id":"Ygccv10GpCAK"},"id":"Ygccv10GpCAK"},{"cell_type":"markdown","source":["## Classification with Transformers"],"metadata":{"id":"bK2RrMV3Ad98"},"id":"bK2RrMV3Ad98"},{"cell_type":"markdown","source":["In the previous section, we built a transformer encoder that takes a sequence of token embeddings as input and produces a new sequence of **output embeddings**. One last step is missing: how do we convert this sequence of output embeddings into label prediction for our classification task?\n","\n","The idea is to summarize all the information in those output embeddings and pass it into a linear classifier. We’ll take the following steps:\n","\n","\n","1. Take the average of all the output embeddings vectors (this is called **pooling**)\n","2. Add a `Linear(d, k)` layer where d is the dimension of the output embeddings and k is the number of classes\n","3. At the end, add `Softmax()`\n"],"metadata":{"id":"3HZROUPDpJRY"},"id":"3HZROUPDpJRY"},{"cell_type":"markdown","id":"10012dad","metadata":{"id":"10012dad"},"source":["<center><img src=\"https://drive.google.com/uc?export=view&id=16tZ-Hphhc0zhDrREvvmAEf9Ti7aSKLrA\" width=400 /></center>"]},{"cell_type":"markdown","source":["The resulting network is ready to predict class probabilities and can be trained with cross entropy loss and gradient descent as usual.\n","\n","Now that we understand how transformers work, let's do some training! However, training the whole transformer model from scratch is too computationally heavy, so we’ll need some other approach."],"metadata":{"id":"o4DKDY9wZQ3B"},"id":"o4DKDY9wZQ3B"},{"cell_type":"markdown","id":"886c0915","metadata":{"id":"886c0915"},"source":["# 4. Fine Tuning with Hugging Face"]},{"cell_type":"markdown","source":["If training a large model isn’t feasible, there’s another approach we can use that is called **fine-tuning**. The idea behind this method is taking a model that’s already been trained on another similar dataset (this is called a **pre-trained** **model**) and then do some additional training on our dataset.\n","\n","We’ll discuss pre-training and fine-tuning further in the following lessons, but for now, let's just take some already trained model and play with it. As we've already said, the model of choice for this lesson is called **BERT** (Bidirectional Encoder Representations from Transformers), which is just a good, reliable language model from Google that can be fine-tuned on any text task. As we see from its name, it’s a transformer-based model, with some tweaks that are not important for our task."],"metadata":{"id":"WP5wwzlaDDqh"},"id":"WP5wwzlaDDqh"},{"cell_type":"markdown","source":["## Before Training"],"metadata":{"id":"9DG27lBYLYZ6"},"id":"9DG27lBYLYZ6"},{"cell_type":"markdown","source":["Even for fine-tuning, we’ll need to use the GPU, otherwise training will take ages. Here’s the code to check if the GPU is available:"],"metadata":{"id":"cazLxtNpLdcj"},"id":"cazLxtNpLdcj"},{"cell_type":"code","source":["import torch\n","\n","# save the best available device to a variable to use over all the following code\n","# GPU is crucial for training, but we can do debugging on CPU\n","device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n","device"],"metadata":{"id":"Dj3WUi9G3zT8"},"id":"Dj3WUi9G3zT8","execution_count":null,"outputs":[]},{"cell_type":"markdown","id":"izM-qkxqc9W8","metadata":{"id":"izM-qkxqc9W8"},"source":["Let's load the model (which already contains both the transformer and classifier) and specify the number of classes we need to predict:"]},{"cell_type":"code","execution_count":null,"id":"Y3oLiHuQF2Tn","metadata":{"id":"Y3oLiHuQF2Tn"},"outputs":[],"source":["from transformers import AutoModelForSequenceClassification\n","\n","# create and load the model\n","model = AutoModelForSequenceClassification.from_pretrained(\"bert-base-cased\", num_labels=5)\n","\n","# move the model to the device set up above\n","model.to(device)"]},{"cell_type":"markdown","source":["The auto-models in Hugging Face help us quickly create and load models, and in the code above we specify that:\n","\n","* The model type is \"sequence classification\"\n","* The pretrained encoder for the model is BERT\n","* The classification head has 5 outputs (5 classes) and it is not trained (not loading any weights)\n","\n","With the model ready, we need to set up the training. This is done using the `Trainer` class in transformers, which in turn needs a `TrainingArguments` object for initialization. All those classes are necessary because there are a lot of ways the training can be customized and tweaked, and this is the way the Hugging Face has provided an interface for this.\n","\n","Here’s a simple way of initializing those classes – let's take a look and then talk about what some of these arguments mean:"],"metadata":{"id":"pNnQ5inLE9Ej"},"id":"pNnQ5inLE9Ej"},{"cell_type":"code","source":["from transformers import TrainingArguments, Trainer\n","\n","# we will explain what is the batch size below\n","BATCH_SIZE = 16\n","\n","training_args = TrainingArguments(\n","   output_dir='./output',  # a place to save some training stuff, not important to us\n","   per_device_train_batch_size=BATCH_SIZE,  # just special names for batch size, not important to us\n","   per_device_eval_batch_size=BATCH_SIZE,\n","   num_train_epochs=5,   # we will explain what epoch is below\n",")\n","\n","trainer = Trainer(\n","   model=model,\n","   args=training_args,\n","   train_dataset=small_train_dataset_tok,\n","   eval_dataset=small_eval_dataset_tok,\n",")"],"metadata":{"id":"nlEHxGp_MWZe"},"id":"nlEHxGp_MWZe","execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Let's understand two very important hyperparameters of the training: **batch size** and **number of epochs**. First, a quick refresher about our optimization algorithm, **gradient descent**:\n","\n","\n","1. Choose a random starting value for the weight vector $w$.\n","\n","2. Repeat $N$ times:\n","\n","    - Compute the loss function $L(w)$ and its gradient $\\nabla L(w)$.\n","\n","    - Update the weights by moving them into the direction of negative gradient, multiplying it by learning rate:\n","\n","    $$\n","        w = w - \\alpha \\nabla L(w)\n","    $$\n","\n","The thing is, in practice, there’s a problem with this algorithm: if a dataset is very big, computing the loss function and the gradient on the whole dataset will take a lot of time. Luckily, there is a simple trick that can speed this up dramatically – instead of computing the gradient of the loss function on the entire dataset, we can approximate it on just a small random subset (called a **batch**), and get a new subset on each step.\n","\n","This new algorithm is called **stochastic gradient descent** or **SGD**. In practice, instead of choosing a new random subset on each step, we usually just shuffle the whole dataset, split it into batches, and go over them one by one. If we run out of batches and want to do more steps, we just reshuffle and re-split the dataset. Each pass over the whole dataset is called an **epoch**."],"metadata":{"id":"AIV0Gr53N1BN"},"id":"AIV0Gr53N1BN"},{"cell_type":"markdown","id":"p7UZVUygdTFQ","metadata":{"id":"p7UZVUygdTFQ"},"source":["Here are some important points on how to organize training with these hyperparameters:\n","\n","\n","\n","* **Role of batch size $B$: computation vs. stability**: The larger $B$, the more stable the optimization. Sometimes, you just can't achieve anything meaningful with too small of a batch size. However, with a large $B$, the optimization can simply fail due to memory exhaustion. So, sadly, you can find that some models just won't train at all on your hardware. There are some hacks, like gradient checkpointing, to tackle this, but it goes beyond the scope of this course.\n","* **Role of number of epochs**: We usually need several epochs (and probably lots of them) to train a model. However, with small data a model can overfit after too many epochs. (And during fine tuning you usually don't have too much data.)\n","* **Monitoring in progress**: The training process can take hours (or even days), so you probably want to monitor its intermediate steps. Usually this is done after each epoch on a dedicated **evaluation dataset**. It’s important to perform this independent evaluation.\n","* **Algorithm modifications**: If you just ran SGD on a real world task with a realistically small batch size, the training would most likely fail. Additional modifications and hacks are needed to ensure (or at least to get closer to) stable training – there are many SGD-based methods, such as _AdaGrad_ or _Adam_. However, all of them still have the overall structure of the same batch-and-epoch process.\n"]},{"cell_type":"markdown","source":["<center><img src=\"https://drive.google.com/uc?export=view&id=1hHkxOpCpM9VVAFNCvlj1be2rWAtO56Hm\" width=320 /></center>"],"metadata":{"id":"j6rbTXhM4nBP"},"id":"j6rbTXhM4nBP"},{"cell_type":"markdown","source":["## Model Training\n"],"metadata":{"id":"89IZFuVaXVTN"},"id":"89IZFuVaXVTN"},{"cell_type":"markdown","source":["Before running the training, let's set up some more important things.\n","\n","First, let's set up an accuracy metric (the ratio of the correct answers) in the format that the trainer can use for proper logging."],"metadata":{"id":"Qt5vS68zXVbx"},"id":"Qt5vS68zXVbx"},{"cell_type":"code","execution_count":null,"id":"pNjMmTCLF2V_","metadata":{"id":"pNjMmTCLF2V_"},"outputs":[],"source":["import numpy as np\n","from datasets import load_metric\n","\n","def compute_metrics(eval_pred):\n","    accuracy = load_metric(\"accuracy\", trust_remote_code=True)\n","\n","    logits, labels = eval_pred\n","    predictions = np.argmax(logits, axis=-1)\n","    acc = accuracy.compute(predictions=predictions, references=labels)[\"accuracy\"]\n","    return {\"accuracy\": acc}"]},{"cell_type":"markdown","source":["Now, let's set up the model and the trainer again, adding some more useful parameters.\n"],"metadata":{"id":"1jkuL9nQZCAd"},"id":"1jkuL9nQZCAd"},{"cell_type":"code","source":["# Important: you need to re-run this line before each training to reset weights.\n","# Otherwise the model will continue training from previously saved weights.\n","model = AutoModelForSequenceClassification.from_pretrained(\"bert-base-cased\", num_labels=5).to(device)\n","\n","BATCH_SIZE = 16\n","\n","training_args = TrainingArguments(\n","    output_dir='./output',\n","    overwrite_output_dir=True,\n","    per_device_train_batch_size=BATCH_SIZE,\n","    per_device_eval_batch_size=BATCH_SIZE,\n","    num_train_epochs=5,\n","    # these lines will evaluate our model every 50 steps, so that we can see the progress\n","    logging_steps=50,\n","    evaluation_strategy=\"steps\",\n",")\n","\n","trainer = Trainer(\n","    model=model,\n","    args=training_args,\n","    train_dataset=small_train_dataset_tok,\n","    eval_dataset=small_eval_dataset_tok,\n","    compute_metrics=compute_metrics,\n",")"],"metadata":{"id":"YFE_4CKjY-6j"},"id":"YFE_4CKjY-6j","execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["There are many more parameters in `TrainingArguments`, and if you’re interested you can see the full list of options in the [documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)\n","\n","Finally, let's train the model!"],"metadata":{"id":"T5cE9RecOBRn"},"id":"T5cE9RecOBRn"},{"cell_type":"code","source":["trainer.train()"],"metadata":{"id":"dvh2e3X3D15_"},"id":"dvh2e3X3D15_","execution_count":null,"outputs":[]},{"cell_type":"markdown","id":"c2QovdJodqPd","metadata":{"id":"c2QovdJodqPd"},"source":["We can see the metrics in the log, but let's also evaluate the final model:"]},{"cell_type":"code","execution_count":null,"id":"F-eMVf1DtmOH","metadata":{"id":"F-eMVf1DtmOH"},"outputs":[],"source":["trainer.evaluate()"]},{"cell_type":"markdown","source":["As you can see, the model improves over time, and the accuracy is rather good. However, in the previous lesson, we managed to get 60% with just the linear model, so this is kind of disappointing. Now, we’re obviously quite limited by the computational power here, but maybe it’s possible to get the quality just a little bit higher than the linear model? (We’ll see the full power of the model in the next section).\n","\n","Note: The quality above may already be above the linear model due to some randomness, so we’ll push it up a little further still in the exercise below."],"metadata":{"id":"KkPy4dCt8JPc"},"id":"KkPy4dCt8JPc"},{"cell_type":"markdown","source":["**Optional exercise.** Play with the training code and try to get the quality to at least 62% within a reasonable training time (less than 30 minutes).\n","\n","<details>\n","    <summary><font color=\"404090\">Hint (click to expand).</font></summary>\n","    \n","While it’s possible to get the desired quality just by playing with the parameter values in the code above, there’s actually an implicit parameter that can help dramatically improve the quality!\n","<br/>\n","</details>\n","\n","<details>\n","    <summary><font color=\"404090\">Hint 2 (click to expand).</font></summary>\n","    \n","If the dataset is too big for training, going over the same examples several times doesn't make sense.\n","<br/>\n","</details>"],"metadata":{"id":"w21mkHSSbTUI"},"id":"w21mkHSSbTUI"},{"cell_type":"code","source":["# your code here\n","\n","\n","\n","\n"],"metadata":{"id":"vcSxAQgJdPp-"},"id":"vcSxAQgJdPp-","execution_count":null,"outputs":[]},{"cell_type":"code","source":["trainer.train()"],"metadata":{"id":"xmS0pNLmdsRm"},"id":"xmS0pNLmdsRm","execution_count":null,"outputs":[]},{"cell_type":"code","source":["trainer.evaluate()"],"metadata":{"id":"VKS3I2iFduw-"},"id":"VKS3I2iFduw-","execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## Running A Pre-Trained Model"],"metadata":{"id":"q8lqjyuwBY9A"},"id":"q8lqjyuwBY9A"},{"cell_type":"markdown","id":"w5OdZlFHEWxo","metadata":{"id":"w5OdZlFHEWxo"},"source":["Training a model from scratch, or fine-tuning one for a new large dataset of a different nature, will always take time (both in terms of engineering and compute). It will also require a lot of hardware, efficient pipelines, and a healthy amount of patience!\n","\n","Let's make use of a small hack and take a model that was pre-trained by an enthusiast on the same dataset for solving the sentiment analysis on Yelp data (and later published for community), and we’ll apply it to our test part, then evaluate the performance.\n","\n","(This is another great thing of the Hugging Face community – one can find a lot of open source models of different sizes, trained on various datasets for solving different tasks.)"]},{"cell_type":"code","execution_count":null,"id":"9OA8cre9G_Of","metadata":{"id":"9OA8cre9G_Of"},"outputs":[],"source":["from transformers import AutoTokenizer, AutoModelForSequenceClassification\n","\n","tuned_tokenizer = AutoTokenizer.from_pretrained(\"gilf/english-yelp-sentiment\")\n","tuned_model = AutoModelForSequenceClassification.from_pretrained(\"gilf/english-yelp-sentiment\").to(device)"]},{"cell_type":"markdown","id":"l8iQusLTGqS6","metadata":{"id":"l8iQusLTGqS6"},"source":["Notice that tokenizer and model are the same as before, just trained for longer time."]},{"cell_type":"code","execution_count":null,"id":"podRRL0P21hs","metadata":{"id":"podRRL0P21hs"},"outputs":[],"source":["tuned_tokenizer"]},{"cell_type":"code","execution_count":null,"id":"nbkeWJN53KQm","metadata":{"id":"nbkeWJN53KQm"},"outputs":[],"source":["tuned_model"]},{"cell_type":"markdown","source":["**Exercise.** Evaluate this model on a subset of random 1000 samples from the test dataset. You can do this using an API or write your own loop, choose the approach you prefer.\n","\n","**Important:** You cannot reuse the tokenized subset from the previous section because the tokenizer is different."],"metadata":{"id":"DdSELS53fyhg"},"id":"DdSELS53fyhg"},{"cell_type":"code","source":["# your code here\n","\n","\n","\n","\n"],"metadata":{"id":"iwh7Cli0gaH2"},"id":"iwh7Cli0gaH2","execution_count":null,"outputs":[]},{"cell_type":"markdown","id":"NQTHdJJuHmYH","metadata":{"id":"NQTHdJJuHmYH"},"source":["That's it! Pretty competitive quality for the vague task predicting sentiment on the scale from 1 to 5. However, there is always room for improvement: for instance, we could take a much bigger model and train it!"]},{"cell_type":"markdown","id":"8987f023","metadata":{"id":"8987f023"},"source":["# 5. Conclusion"]},{"cell_type":"markdown","source":["Let's wrap up what we've learned.\n","\n","First, there was a lot of theory this time:\n","\n","\n","\n","* Why using linear models isn’t enough for most tasks and why we need complex features to solve these tasks.\n","* How the concepts of **neural networks** and **deep learning** emerged and how we can train the classifier and features simultaneously.\n","* How the transformer model works and can be applied for text classification.\n","\n","Then, we had practice with some code:\n","\n","\n","\n","* We explored the Hugging Face libraries for working with tokenization and transformers.\n","* We **fine-tuned** a sequence classifier based on the pre-trained **BERT encoder**.\n","* We loaded the trained fine-tuned model and compared the quality of all the models.\n","\n","Now that we've learned the basics about neural networks and NLP, in the [**next lesson**](https://colab.research.google.com/github/Nebius-Academy/ML-Starter-Pack/blob/main/notebooks/lesson_4.ipynb), we’ll talk about **ChatGPT**: how it works, and how we can use it in different tasks."],"metadata":{"id":"y7H9BT64jTxm"},"id":"y7H9BT64jTxm"},{"cell_type":"markdown","source":["If you’re interested, here are some additional resources on the things we talked about in this lesson:\n","\n","* A great [NLP Course by Lena Voita](https://lena-voita.github.io/nlp_course.html), with a dedicated [section for attention and transformers](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html).\n","* [NLP course by Hugging Face](https://huggingface.co/course/chapter1/2?fw=pt).\n","* [A tutorial](https://huggingface.co/docs/datasets/tutorial) on working with datasets using Hugging Face\n","* [A great visualization](https://3.bp.blogspot.com/-aZ3zvPiCoXM/WaiKQO7KRnI/AAAAAAAAB_8/7a1CYjp40nUg4lKpW7covGZJQAySxlg8QCLcBGAs/s640/transform20fps.gif) of the attention idea from the [blog post from Google](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html).\n","* [Tokenization internals in the Hugging Face](https://huggingface.co/learn/nlp-course/chapter6/1?fw=pt)."],"metadata":{"id":"pBrjuKTMjS3W"},"id":"pBrjuKTMjS3W"},{"cell_type":"code","execution_count":null,"id":"01a0b711","metadata":{"id":"01a0b711"},"outputs":[],"source":[]}],"metadata":{"colab":{"provenance":[],"gpuType":"T4"},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.8"},"accelerator":"GPU"},"nbformat":4,"nbformat_minor":5}
2 | 


--------------------------------------------------------------------------------