├── decks ├── _all.apkg ├── 01_intro.apkg ├── 10_nlp.apkg ├── 03_ethics.apkg ├── 08_collab.apkg ├── 09_tabular.apkg ├── 02_production.apkg ├── 05_pet_breeds.apkg ├── 06_multicat.apkg ├── 12_nlp_dive.apkg ├── 04_mnist_basics.apkg ├── 13_convolutions.apkg └── 11_midlevel_data.apkg ├── src ├── img │ ├── 02-crop.png │ ├── 02-pad.png │ ├── 04-relu.png │ ├── 05-log.png │ ├── 02-squish.png │ ├── 01-ml-steps.png │ ├── 04-sigmoid.png │ ├── 12-stacked-rnn.png │ ├── 02-drivetrain-steps.png │ └── 08-collab-filtering-data.png ├── 18_CAM.md ├── 14_resnet.md ├── 19_learner.md ├── 16_accel_sgd.md ├── 15_arch_details.md ├── 17_foundations.md ├── 07_sizing_and_tta.md ├── 11_midlevel_data.md ├── 06_multicat.md ├── 05_pet_breeds.md ├── 10_nlp.md ├── 08_collab.md ├── 09_tabular.md ├── 13_convolutions.md ├── 03_ethics.md ├── 12_nlp_dive.md ├── 02_production.md ├── 04_mnist_basics.md └── 01_intro.md ├── .vscode └── settings.json ├── docs ├── pull_request_template.md ├── flashcards_deleted.md └── progress.md ├── CONTRIBUTING.md ├── CODE_OF_CONDUCT.md ├── README.md └── LICENSE /decks/_all.apkg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/decks/_all.apkg -------------------------------------------------------------------------------- /decks/01_intro.apkg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/decks/01_intro.apkg -------------------------------------------------------------------------------- /decks/10_nlp.apkg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/decks/10_nlp.apkg -------------------------------------------------------------------------------- /src/img/02-crop.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/src/img/02-crop.png -------------------------------------------------------------------------------- /src/img/02-pad.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/src/img/02-pad.png -------------------------------------------------------------------------------- /src/img/04-relu.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/src/img/04-relu.png -------------------------------------------------------------------------------- /src/img/05-log.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/src/img/05-log.png -------------------------------------------------------------------------------- /decks/03_ethics.apkg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/decks/03_ethics.apkg -------------------------------------------------------------------------------- /decks/08_collab.apkg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/decks/08_collab.apkg -------------------------------------------------------------------------------- /decks/09_tabular.apkg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/decks/09_tabular.apkg -------------------------------------------------------------------------------- /src/img/02-squish.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/src/img/02-squish.png -------------------------------------------------------------------------------- /decks/02_production.apkg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/decks/02_production.apkg -------------------------------------------------------------------------------- /decks/05_pet_breeds.apkg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/decks/05_pet_breeds.apkg -------------------------------------------------------------------------------- /decks/06_multicat.apkg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/decks/06_multicat.apkg -------------------------------------------------------------------------------- /decks/12_nlp_dive.apkg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/decks/12_nlp_dive.apkg -------------------------------------------------------------------------------- /src/img/01-ml-steps.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/src/img/01-ml-steps.png -------------------------------------------------------------------------------- /src/img/04-sigmoid.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/src/img/04-sigmoid.png -------------------------------------------------------------------------------- /decks/04_mnist_basics.apkg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/decks/04_mnist_basics.apkg -------------------------------------------------------------------------------- /decks/13_convolutions.apkg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/decks/13_convolutions.apkg -------------------------------------------------------------------------------- /src/img/12-stacked-rnn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/src/img/12-stacked-rnn.png -------------------------------------------------------------------------------- /decks/11_midlevel_data.apkg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/decks/11_midlevel_data.apkg -------------------------------------------------------------------------------- /src/img/02-drivetrain-steps.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/src/img/02-drivetrain-steps.png -------------------------------------------------------------------------------- /src/img/08-collab-filtering-data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanielDamours/fastai-flashcards/HEAD/src/img/08-collab-filtering-data.png -------------------------------------------------------------------------------- /src/18_CAM.md: -------------------------------------------------------------------------------- 1 | # 18_CAM 2 | 3 | ## Hello :) 4 | 5 | Feel free to [contribute](https://github.com/NathanielDamours/fastai-flashcards/blob/main/CONTRIBUTING.md) 6 | -------------------------------------------------------------------------------- /src/14_resnet.md: -------------------------------------------------------------------------------- 1 | # 14_resnet 2 | 3 | ## Hello :) 4 | 5 | Feel free to [contribute](https://github.com/NathanielDamours/fastai-flashcards/blob/main/CONTRIBUTING.md) 6 | -------------------------------------------------------------------------------- /src/19_learner.md: -------------------------------------------------------------------------------- 1 | # 19_learner 2 | 3 | ## Hello :) 4 | 5 | Feel free to [contribute](https://github.com/NathanielDamours/fastai-flashcards/blob/main/CONTRIBUTING.md) 6 | -------------------------------------------------------------------------------- /src/16_accel_sgd.md: -------------------------------------------------------------------------------- 1 | # 16_accel_sgd 2 | 3 | ## Hello :) 4 | 5 | Feel free to [contribute](https://github.com/NathanielDamours/fastai-flashcards/blob/main/CONTRIBUTING.md) 6 | -------------------------------------------------------------------------------- /src/15_arch_details.md: -------------------------------------------------------------------------------- 1 | # 15_arch_details 2 | 3 | ## Hello :) 4 | 5 | Feel free to [contribute](https://github.com/NathanielDamours/fastai-flashcards/blob/main/CONTRIBUTING.md) 6 | -------------------------------------------------------------------------------- /src/17_foundations.md: -------------------------------------------------------------------------------- 1 | # 17_foundations 2 | 3 | ## Hello :) 4 | 5 | Feel free to [contribute](https://github.com/NathanielDamours/fastai-flashcards/blob/main/CONTRIBUTING.md) 6 | -------------------------------------------------------------------------------- /src/07_sizing_and_tta.md: -------------------------------------------------------------------------------- 1 | # 07_sizing_and_tta 2 | 3 | ## Hello :) 4 | 5 | Feel free to [contribute](https://github.com/NathanielDamours/fastai-flashcards/blob/main/CONTRIBUTING.md) 6 | -------------------------------------------------------------------------------- /.vscode/settings.json: -------------------------------------------------------------------------------- 1 | { 2 | "pasteImage.path": "${currentFileDir}/img", 3 | "[markdown]": { 4 | "editor.defaultFormatter": "DavidAnson.vscode-markdownlint", 5 | "editor.formatOnSave": true, 6 | "editor.formatOnPaste": true 7 | } 8 | } -------------------------------------------------------------------------------- /docs/pull_request_template.md: -------------------------------------------------------------------------------- 1 | ## What does this PR do ? 2 | 3 | 6 | 7 | ### Checklist 8 | 9 | 10 | 11 | - [ ] My code follows the [contributing guidelines](../CONTRIBUTING.md#guidelines) 12 | - [ ] I have added my sources for each new card 13 | - [ ] I have updated [progress.md](docs/../progress.md) 14 | -------------------------------------------------------------------------------- /docs/flashcards_deleted.md: -------------------------------------------------------------------------------- 1 | # Flashcards Deleted 2 | 3 | ❌ Some questions do not fit well on flashcards and are therefore removed. 4 | 5 | ## 1 - Your Deep Learning Journey 6 | 7 | - Follow through each cell of the stripped version of the notebook for this chapter. Before executing each cell, guess what will happen. 8 | - Complete the Jupyter Notebook online appendix. 9 | 10 | ## 2 - From Model to Production 11 | 12 | - Create an image recognition model using data you curate, and deploy it on the web. 13 | 14 | ## 3 - Data Ethics 15 | 16 | - In the paper ["Does Machine Learning Automate Moral Hazard and Error"](https://scholar.harvard.edu/files/sendhil/files/aer.p20171084.pdf) why is sinusitis found to be predictive of a stroke? 17 | 18 | ## 5 - Image Classification 19 | 20 | - If you are not familiar with regular expressions, find a regular expression tutorial, and some problem sets, and complete them. Have a look on the book's website for suggestions. 21 | - Look up the documentation for `L` and try using a few of the new methods that it adds. 22 | - Look up the documentation for the Python `pathlib` module and try using a few methods of the `Path` class. 23 | - Calculate the `exp` and `softmax` columns of <> yourself (i.e., in a spreadsheet, with a calculator, or in a notebook). 24 | 25 | ## 8 - Collaborative Filtering Deep Dive 26 | 27 | - The "Why" part of "What is a latent factor? Why is it 'latent'?" 28 | - Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!). 29 | - Create a class (without peeking, if possible!) and use it. 30 | - The "train a model with it" part of "Rewrite the `DotProduct` class (without peeking, if possible!) and train a model with it." 31 | - What is a good loss function to use for MovieLens? Why? 32 | - What would happen if we used cross-entropy loss with MovieLens? How would we need to change the model? 33 | - What is another name for weight decay? 34 | 35 | ## 9 - Tabular Modeling Deep Dive 36 | 37 | - What is a continuous variable? 38 | - What is a categorical variable? 39 | - Should you pick a random validation set in the bulldozer competition? If no, what kind of validation set should you pick? 40 | - In the section "Creating a Random Forest", just after <>, why did `preds.mean(0)` give the same result as our random forest? 41 | - Why do we ensure `saleElapsed` is a continuous variable, even although it has less than 9,000 distinct values? 42 | 43 | ## 11 - Data Munging with fastai's Mid-Level API 44 | 45 | - Write a `Transform` that does the numericalization of tokenized texts (it should set its vocab automatically from the dataset seen and have a `decode` method) 46 | - Why can we easily apply fastai data augmentation transforms to the `SiamesePair` we built? 47 | 48 | ## 12 - A Language Model from Scratch 49 | 50 | - Write a module which predicts the third word given the previous two words of a sentence, without peeking 51 | - Write code to print out the first few batches of the validation set, including converting the token IDs back into English strings, as we showed for batches of IMDb data in <>. 52 | - Experiment with `bernoulli_` to understand how it works. 53 | - Study the refactored version of `LSTMCell` carefully to ensure you understand how and why it does the same thing as the non-refactored version. 54 | 55 | ## 13 - Data Munging with fastai's Mid-Level API 56 | 57 | - "Where does it need to be included in the MNIST CNN? Why?" part of "What is `Flatten`? Where does it need to be included in the MNIST CNN? Why?" 58 | - Run *conv-example.xlsx* yourself and experiment with *trace precedents* 59 | - Have a look at Jeremy or Sylvain's list of recent Twitter "like"s, and see if you find any interesting resources or ideas there. 60 | -------------------------------------------------------------------------------- /src/11_midlevel_data.md: -------------------------------------------------------------------------------- 1 | # 11_midlevel_data 2 | 3 | ## What does fastai's *layered* API refer to? 4 | 5 | It refers to the **levels of fastai's API**: 6 | 7 | - Fastai's high-level API that allows to train neural networks for common applications with just a few lines of code 8 | - Fastai's lower-level APIs that are more flexible and better for custom tasks 9 | 10 | ## Why does a `Transform` have a `decode` method? 11 | 12 | To allow us to **reverse** (if possible) the application of the transform. 13 | 14 | ## How is `decode` often used ? 15 | 16 | To convert predictions and mini-batches into **human**-understandable representation. 17 | 18 | ## Why does a `Transform` have a `setup` method? 19 | 20 | Because, sometimes it is necessary to **initialize some inner state**, like the vocabulary for a tokenizer. The `setup` method handles this. 21 | 22 | ## How does a `Transform` work when called on a tuple? 23 | 24 | The `Transform` is always applied to each item of the tuple. If a type annotation is provided, the `Transform` is only applied to the items with the correct type. 25 | 26 | ## Which methods do you need to implement when writing your own `Transform`? 27 | 28 | Just the `encodes` method, and optionally the `decodes` method for it to be reversible, and `setups` for initializing an inner state. 29 | 30 | ## What is the operation that allows to normalize the items? 31 | 32 | ```py 33 | x = (x - x.mean()) - x.std() 34 | ``` 35 | 36 | ## Write a `Normalize` transform that fully normalizes items, and that can decode that behavior 37 | 38 | ```py 39 | class Normalize(Transform): 40 | def setups(self, items): 41 | self.mean = items.mean() 42 | self.std = items.std() 43 | 44 | def encodes(self, items): 45 | return (items - self.mean) / self.std 46 | 47 | def decodes(self, items): 48 | return items * self.std + self.mean 49 | ``` 50 | 51 | ## What is a `Pipeline`? 52 | 53 | The `Pipeline` class is meant for **composing several transforms together**. When you call `Pipeline` on an object, it will automatically call the transforms inside, in order. 54 | 55 | ```py 56 | >>> tfms = [RandomResizedCrop(224), FlipItem(0.5)] 57 | >>> Pipeline(tfms) 58 | Pipeline: FlipItem -- {'p': 0.5} -> RandomResizedCrop -- {'size': (224, 224), 'min_scale': 0.08, 'ratio': (0.75, 1.3333333333333333), 'resamples': (, ), 'val_xtra': 0.14, 'max_scale': 1.0, 'p': 1.0} 59 | ``` 60 | 61 | ## What is a `TfmdLists`? 62 | 63 | A **`Pipeline` of `Transform`s** applied to a collection of items. 64 | 65 | [Source](https://docs.fast.ai/data.core.html#tfmdlists) 66 | 67 | ## What is a `Datasets`? 68 | 69 | A dataset that creates a tuple from each `tfms` (a list of `Transform`(s) or `Pipeline` to apply). 70 | 71 | [Source](https://github.com/fastai/fastai/blob/50c8a760bf4a1bfb2c24ef00b5c100a3d55b4389/fastai/data/core.py#L429) 72 | 73 | ## How is a `Datasets` different from a `TfmdLists`? 74 | 75 | `Datasets` will apply two (or more) pipelines in parallel to the same raw object and build a tuple with the result. This is different from `TfmdLists` which leads to two separate objects for the input and target. 76 | 77 | ## Why are `TfmdLists` and `Datasets` named with an "s"? 78 | 79 | Because they can handle a training and a validation set with the `splits` argument. 80 | 81 | ## How can you build a `DataLoaders` from a `TfmdLists` or a `Datasets`? 82 | 83 | You can call the `dataloaders` method. 84 | 85 | ## How do you pass `item_tfms` and `batch_tfms` when building a `DataLoaders` from a `TfmdLists` or a `Datasets`? 86 | 87 | You can pass `after_item` and `after_batch`, respectively, to the `dataloaders` argument. 88 | 89 | ## What do you need to do when you want to have your custom items work with methods like `show_batch` or `show_results`? 90 | 91 | You need to create a **custom type with a `show` method**, since `TfmdLists`/`Datasets` will decode the items until it reaches a type with a `show` method. 92 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | 3 | - [Contributing](#contributing) 4 | - [🛠️ How to contribute](#️-how-to-contribute) 5 | - [🧭 Where to start?](#-where-to-start) 6 | - [📢 Guidelines](#-guidelines) 7 | - [Cards' format](#cards-format) 8 | - [Images](#images) 9 | - [⚠️ Significant changes](#️-significant-changes) 10 | 11 | ## 🛠️ How to contribute 12 | 13 |
14 | Quick: on GitHub 15 | 16 | 1. Select the `src` directory: 17 | - ![image](https://user-images.githubusercontent.com/88633026/210842064-d5ea1e87-fa4d-497b-baa4-1ee8979cdb6e.png) 18 | 2. Select the chapter's file you want to modify: 19 | - ![image](https://user-images.githubusercontent.com/88633026/210842429-64ae41f6-83de-4479-abd0-b50d8fee3e5f.png) 20 | 3. Hit the edit button: 21 | - ![image](https://user-images.githubusercontent.com/88633026/210845278-910e98ac-0df3-4b5d-a177-15a0c37feb1f.png) 22 | 4. Edit the file: 23 | - ![image](https://user-images.githubusercontent.com/88633026/210845525-3f378dce-a229-4c00-9dde-e5e80fe3bf06.png) 24 | 5. Name your changes and propose them: 25 | - ![image](https://user-images.githubusercontent.com/88633026/210848012-1c6a4bdf-e8ae-45fa-bc98-0fc56a8c30e2.png) 26 | 6. Create your pull request: 27 | 1. ![image](https://user-images.githubusercontent.com/88633026/210846256-199e41b0-2712-4004-b8c7-4ab794ef676b.png) 28 | 2. ![image](https://user-images.githubusercontent.com/88633026/210847584-e42c8d24-ec5f-4cc5-afc3-c1e3dbfcdd76.png) 29 | 30 |
31 | 32 |
33 | Average: locally 34 | 35 | Please follow [these steps](https://docs.github.com/en/get-started/quickstart/contributing-to-projects). 36 | 37 |
38 | 39 |
40 | Long: generate flashcards 41 | 42 | Requirements: 43 | 44 | 1. [VSCode](https://code.visualstudio.com/Download) (or [VSCodium](https://vscodium.com/)) >= 1.47 45 | 2. [Anki](https://apps.ankiweb.net/) >= 2.1.21 46 | 3. [AnkiConnect](https://ankiweb.net/shared/info/2055492159) >= 2020-07-13 47 | 48 | Create flashcards: 49 | 50 | 1. Do what is asked in the [average's section](CONTRIBUTING.md#average-locally) 51 | 2. Launch Anki 52 | 3. Launch VSCode 53 | 4. Download the **Anki for VSCode** extension 54 | 5. Do `Ctrl + shift + p` in VSCode 55 | 6. Type *Anki* 56 | 7. Select **Anki: Sync Anki**. 57 | 8. Go on the Anki's window. You should see a pop-up asking you to connect. 58 | 9. Create your Anki account. 59 | 10. Download these extensions: 60 | - [Docs Markdown](https://marketplace.visualstudio.com/items?itemName=docsmsft.docs-markdown) 61 | - [Markdown All in One](https://open-vsx.org/extension/yzhang/markdown-all-in-one) 62 | - [Paste Image](https://open-vsx.org/extension/mushan/vscode-paste-image) 63 | 64 |
65 |
66 | 67 | ## 🧭 Where to start? 68 | 69 | Look at the [progress.md](docs/progress.md). 70 | 71 | ## 📢 Guidelines 72 | 73 | ### Cards' format 74 | 75 | - Question: a short `## Your question?` 76 | - Answer: 77 | 1. Start with a quick answer (important parts in **bold**) 78 | 2. Develop your answer 79 | 3. At the bottom of your answer add your source 80 | 4. Add an example (facultative) 81 | 82 | For example, 83 | 84 | ```md 85 | ## What is a model's architecture? 86 | 87 | The architecture is the **functional form of the model**. Indeed, a model can be split into an architecture and parameter(s). The parameters are some variables that define how the architecture operates. 88 | 89 | For example, $y=ax+b$ is an architecture with the parameters $a$ and $b$ that change the behavior of the function. 90 | 91 | [Source](https://nathanieldamours.github.io/blog/deep%20learning%20for%20coders/jupyter/2021/12/17/dl_for_coders_01.html#Architecture-and-Parameters) 92 | ``` 93 | 94 | ### Images 95 | 96 | Do not add images unless it is **really** necessary because we have to keep anki decks' size small. 97 | 98 | ## ⚠️ Significant changes 99 | 100 | If you intend to make significant changes, please [open an issue](https://docs.github.com/en/enterprise-cloud@latest/issues/tracking-your-work-with-issues/creating-an-issue) so that we can discuss it first. Remember to [link it to your pull request](https://docs.github.com/en/free-pro-team@latest/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword). 101 | -------------------------------------------------------------------------------- /src/06_multicat.md: -------------------------------------------------------------------------------- 1 | # 06_multicat 2 | 3 | ## How could multi-label classification improve the usability of the bear classifier? 4 | 5 | This would **allow for the classification of no bears present**. Otherwise, a multi-class classification model will predict the presence of a bear even if it's not there (unless a separate class is explicitly added). 6 | 7 | ## How do we encode the dependent variable in a multi-label classification problem? 8 | 9 | This is encoded **as a one-hot encoded vector**. Essentially, this means we have a zero vector of the same length of the number of classes, and ones are present at the indices for the classes that are present in the data. 10 | 11 | ## How do you access the rows and columns of a DataFrame as if it was a matrix? 12 | 13 | You can use `.iloc`. For example, `my_dataframe.iloc[10, 10]` will select the element in the 10th row and 10th column as if the DataFrame is a matrix. `iloc` stands for index location. 14 | 15 | ## How do you get a column by name from a DataFrame 16 | 17 | `my_dataframe["column_name"]`. However, a good practice would be to use `my_dataframe.loc["column_name"]` for [explicitness](https://stackoverflow.com/a/38886211) and for [speed](https://stackoverflow.com/a/65875826). 18 | 19 | ## What is the difference between a `Dataset` and `DataLoader`? 20 | 21 | - `Dataset` is a collection which returns a tuple of your independent and dependent variable for a single item. 22 | - `DataLoader` is an extension of the `Dataset` functionality. It is an iterator which provides a stream of mini-batches, where each mini-batch is a couple of a batch of independent variables and a batch of dependent variables. 23 | 24 | ## What does a Datasets object normally contain? 25 | 26 | - Training set 27 | - Validation set 28 | 29 | ## What does a DataLoaders object normally contain? 30 | 31 | - Training dataloader 32 | - Validation dataloader 33 | 34 | ## What does lambda do in Python? 35 | 36 | Lambda is a **shortcut for writing functions** (writing one-liner functions). It is great for quick prototyping and iterating, but since it is not serializable, it cannot be used in deployment and production. 37 | 38 | ```py 39 | >>> double = lambda x : 2*x 40 | >>> double(4) 41 | 8 42 | ``` 43 | 44 | ## What are the methods to customise how the independent and dependent variables are created with the data block API? 45 | 46 | - `get_x`: specify how the independent variables are created 47 | - `get_y`: specify how the data is labelled 48 | 49 | ## Why is softmax not an appropriate output activation function when using a one hot encoded target? 50 | 51 | Softmax wants to make the model predict **only a single class**, which may not be true in a multi-label classification problem. In multi-label classification problems, the input data could have multiple labels or even no labels. 52 | 53 | ## Why is `nll_loss` not an appropriate loss function when using a one hot encoded target? 54 | 55 | Because `NLLLoss` expects the index representation of the labels. You could convert one-hot targets this way: 56 | 57 | ```py 58 | >>> target = torch.Tensor([[1, 0, 0, 1], [0, 0, 0, 1], [0, 0, 1, 0]]) 59 | >>> target = torch.argmax(target, axis=1) 60 | >>> target 61 | tensor([0, 3, 2]) 62 | ``` 63 | 64 | [Source](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) 65 | 66 | ## What is the difference between nn.BCELoss and nn.BCEWithLogitsLoss? 67 | 68 | - `nn.BCELoss` does not include the initial sigmoid. It assumes that the appropriate activation function (ie. the sigmoid) has already been applied to the predictions. 69 | - `nn.BCEWithLogitsLoss` does both the sigmoid and cross entropy in a single function. 70 | 71 | ## Why can't we use regular accuracy in a multi-label problem? 72 | 73 | Because the regular accuracy function **assumes that the final model-predicted class is the one with the highest activation**. However, in multi-label problems, there can be multiple labels. Therefore, a threshold for the activations needs to be set for choosing the final predicted classes based on the activations, for comparing to the target classes. 74 | 75 | ## When is it okay to tune an hyperparameter on the validation set? 76 | 77 | When the relationship between the hyperparameter and the metric being observed is smooth in order to avoid to pick an inappropriate outlier. 78 | 79 | ## How is `y_range` implemented in fastai? 80 | 81 | `y_range` is implemented using `sigmoid_range` in fastai. 82 | 83 | ```py 84 | def sigmoid_range(x, low, high): 85 | return x.sigmoid() * (high-low) + low 86 | ``` 87 | 88 | ## What is a regression problem? 89 | 90 | A problem in which the labels that are **continuous** values 91 | 92 | ## What loss function should you use for such a regression problem? 93 | 94 | The **mean squared error** loss function 95 | 96 | ## What do you need to do to make sure the fastai library applies the same data augmentation to your inputs images and your target point coordinates? 97 | 98 | You need to use the **correct DataBlock**. In this case, it is the `PointBlock`. This DataBlock automatically handles the application data augmentation to the input images and the target point coordinates. 99 | -------------------------------------------------------------------------------- /src/05_pet_breeds.md: -------------------------------------------------------------------------------- 1 | # 05_pet_breeds 2 | 3 | ## Why do we first resize to a large size on the CPU, and then to a smaller size on the GPU? 4 | 5 | Because we want to **minimize data destruction**. Indeed, data augmentation can lead to degradation of the artifacts, especially at the edges. Therefore, the augmentations are done on a larger image, and then `RandomResizeCrop` is performed to resize to the final image size. Also, resizing to a large size on the CPU, and then to a smaller size on the GPU is known as presizing. 6 | 7 | Also, this presizing is done on the CPU, because the CPU can use higher-quality resampling filters, such as bicubic interpolation, which can produce smoother and more accurate results than the resampling filters available on the GPU. Resizing on the CPU can therefore **produce higher-quality images** that are more suitable for further processing on the GPU. 8 | 9 | Source : 10 | 11 | - [First paragraph](https://forums.fast.ai/t/fastbook-chapter-5-questionnaire-solutions-wiki/69301) 12 | - [Second paragraph](https://chat.openai.com/chat) 13 | 14 | ## What are the two ways in which data is most commonly provided, for most deep learning datasets? 15 | 16 | 1. **Individual files representing items of data**, such as text documents or images. 17 | 2. **A table of data**, such as in CSV format, where each row is an item, each row which may include filenames providing a connection between the data in the table and data in other formats such as text documents and images. 18 | 19 | ## Give (2) examples of ways that image transformations can degrade the quality of the data 20 | 21 | 1. **Rotation** can leave empty areas in the final image 22 | 2. Other operations may require **interpolation** which is based on the original image pixels, but are still of lower image quality 23 | 24 | ## What method does fastai provide to view the data in a DataLoader? 25 | 26 | `DataLoader.show_batch` 27 | 28 | ## What method does fastai provide to help you debug a DataBlock? 29 | 30 | `DataBlock.summary` 31 | 32 | ## Should you hold off on training a model until you have thoroughly cleaned your data? 33 | 34 | **No**. It is best to create a baseline model as soon as possible. 35 | 36 | ## What are the (2) pieces that are combined into cross entropy loss in PyTorch? 37 | 38 | - Softmax function 39 | - Negative log likelihood loss 40 | 41 | ## What are the (2) properties of activations that softmax ensures? Why is this important? 42 | 43 | - It makes the outputs for the classes **add up to one**. This means the model can **only predict one class**. 44 | - It **amplifies small changes** in the output activations. 45 | 46 | This is helpful, because it means the model will **select a label with higher confidence** (good for problems with definite labels). 47 | 48 | ## When might you want your activations to not have the two properties that softmax ensures? 49 | 50 | When you have **multi-label** classification problems (more than one label possible). 51 | 52 | ## Why can’t we use `torch.where` to create a loss function for datasets where our label can have more than two categories? 53 | 54 | Because `torch.where` can only select between **two** possibilities while for multi-class classification, we have **multiple** possibilities. 55 | 56 | ## What is the value of log(-2)? Why? 57 | 58 | ![](img/05-log.png) 59 | 60 | This value is not defined. The logarithm is the inverse of the exponential function, and the exponential function is always positive no matter what value is passed. So the logarithm is not defined for negative values. 61 | 62 | ## What are (2) good rules of thumb for picking a learning rate from the learning rate finder? 63 | 64 | - One order of magnitude less than where the **minimum loss** was achieved (i.e. the minimum divided by 10) 65 | - The last point where the loss was clearly **decreasing** 66 | 67 | ## What (2) steps does the `fine_tune` method do? 68 | 69 | 1. Train the new **head** (with random weights) for one epoch 70 | 2. Unfreeze all the **layers** and train them all for the requested number of epochs 71 | 72 | ## In Jupyter notebook, how do you get the source code for a method or function? 73 | 74 | Use `??` after the function. For example, `DataBlock.summary??` 75 | 76 | ## What are discriminative learning rates? 77 | 78 | Discriminative learning rates refers to the training trick of **using different learning rates for different layers** of the model. This is commonly used in transfer learning. The idea is that when you train a pre-trained model, you don’t want to drastically change the earlier layers as it contains information regarding simple features like edges and shapes. But later layers may be changed a little more as it may contain information regarding facial feature or other object features that may not be relevant to your task. Therefore, the earlier layers have a lower learning rate and the later layers have higher learning rates. 79 | 80 | ## How is a Python slice object interpreted when passed as a learning rate to fastai? 81 | 82 | - **First value** : learning rate for the earliest layer 83 | - **Second value**: learning rate for the last layer 84 | 85 | The layers in between will have learning rates that are multiplicatively equidistant throughout that range. 86 | 87 | ## Why is early stopping a poor choice when using one cycle training? 88 | 89 | Because the training may **not have time to reach lower learning rate** values in the learning rate schedule, which could easily continue to improve the model. Therefore, it is recommended to retrain the model from scratch and select the number of epochs based on where the previous best results were found. 90 | 91 | ## What is the difference between `resnet50` and `resnet101`? 92 | 93 | The number 50 and 101 refer to the **number of layers** in the models. Therefore, `resnet101` is a larger model with more layers versus `resnet50`. These model variants are commonly as there are ImageNet-pre-trained weights available. 94 | 95 | ## What does `to_fp16` do? 96 | 97 | This enables mixed-precision training, in which less precise numbers are used in order to speed up training. 98 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | 2 | # Contributor Covenant Code of Conduct 3 | 4 | ## Our Pledge 5 | 6 | We as members, contributors, and leaders pledge to make participation in our 7 | community a harassment-free experience for everyone, regardless of age, body 8 | size, visible or invisible disability, ethnicity, sex characteristics, gender 9 | identity and expression, level of experience, education, socio-economic status, 10 | nationality, personal appearance, race, caste, color, religion, or sexual 11 | identity and orientation. 12 | 13 | We pledge to act and interact in ways that contribute to an open, welcoming, 14 | diverse, inclusive, and healthy community. 15 | 16 | ## Our Standards 17 | 18 | Examples of behavior that contributes to a positive environment for our 19 | community include: 20 | 21 | * Demonstrating empathy and kindness toward other people 22 | * Being respectful of differing opinions, viewpoints, and experiences 23 | * Giving and gracefully accepting constructive feedback 24 | * Accepting responsibility and apologizing to those affected by our mistakes, 25 | and learning from the experience 26 | * Focusing on what is best not just for us as individuals, but for the overall 27 | community 28 | 29 | Examples of unacceptable behavior include: 30 | 31 | * The use of sexualized language or imagery, and sexual attention or advances of 32 | any kind 33 | * Trolling, insulting or derogatory comments, and personal or political attacks 34 | * Public or private harassment 35 | * Publishing others' private information, such as a physical or email address, 36 | without their explicit permission 37 | * Other conduct which could reasonably be considered inappropriate in a 38 | professional setting 39 | 40 | ## Enforcement Responsibilities 41 | 42 | Community leaders are responsible for clarifying and enforcing our standards of 43 | acceptable behavior and will take appropriate and fair corrective action in 44 | response to any behavior that they deem inappropriate, threatening, offensive, 45 | or harmful. 46 | 47 | Community leaders have the right and responsibility to remove, edit, or reject 48 | comments, commits, code, wiki edits, issues, and other contributions that are 49 | not aligned to this Code of Conduct, and will communicate reasons for moderation 50 | decisions when appropriate. 51 | 52 | ## Scope 53 | 54 | This Code of Conduct applies within all community spaces, and also applies when 55 | an individual is officially representing the community in public spaces. 56 | Examples of representing our community include using an official e-mail address, 57 | posting via an official social media account, or acting as an appointed 58 | representative at an online or offline event. 59 | 60 | ## Enforcement 61 | 62 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 63 | reported to the community leaders responsible for enforcement at 64 | [INSERT CONTACT METHOD]. 65 | All complaints will be reviewed and investigated promptly and fairly. 66 | 67 | All community leaders are obligated to respect the privacy and security of the 68 | reporter of any incident. 69 | 70 | ## Enforcement Guidelines 71 | 72 | Community leaders will follow these Community Impact Guidelines in determining 73 | the consequences for any action they deem in violation of this Code of Conduct: 74 | 75 | ### 1. Correction 76 | 77 | **Community Impact**: Use of inappropriate language or other behavior deemed 78 | unprofessional or unwelcome in the community. 79 | 80 | **Consequence**: A private, written warning from community leaders, providing 81 | clarity around the nature of the violation and an explanation of why the 82 | behavior was inappropriate. A public apology may be requested. 83 | 84 | ### 2. Warning 85 | 86 | **Community Impact**: A violation through a single incident or series of 87 | actions. 88 | 89 | **Consequence**: A warning with consequences for continued behavior. No 90 | interaction with the people involved, including unsolicited interaction with 91 | those enforcing the Code of Conduct, for a specified period of time. This 92 | includes avoiding interactions in community spaces as well as external channels 93 | like social media. Violating these terms may lead to a temporary or permanent 94 | ban. 95 | 96 | ### 3. Temporary Ban 97 | 98 | **Community Impact**: A serious violation of community standards, including 99 | sustained inappropriate behavior. 100 | 101 | **Consequence**: A temporary ban from any sort of interaction or public 102 | communication with the community for a specified period of time. No public or 103 | private interaction with the people involved, including unsolicited interaction 104 | with those enforcing the Code of Conduct, is allowed during this period. 105 | Violating these terms may lead to a permanent ban. 106 | 107 | ### 4. Permanent Ban 108 | 109 | **Community Impact**: Demonstrating a pattern of violation of community 110 | standards, including sustained inappropriate behavior, harassment of an 111 | individual, or aggression toward or disparagement of classes of individuals. 112 | 113 | **Consequence**: A permanent ban from any sort of public interaction within the 114 | community. 115 | 116 | ## Attribution 117 | 118 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 119 | version 2.1, available at 120 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1]. 121 | 122 | Community Impact Guidelines were inspired by 123 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC]. 124 | 125 | For answers to common questions about this code of conduct, see the FAQ at 126 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at 127 | [https://www.contributor-covenant.org/translations][translations]. 128 | 129 | [homepage]: https://www.contributor-covenant.org 130 | [v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html 131 | [Mozilla CoC]: https://github.com/mozilla/diversity 132 | [FAQ]: https://www.contributor-covenant.org/faq 133 | [translations]: https://www.contributor-covenant.org/translations 134 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 | 3 | # Fastai's Flashcards 4 | 5 | 📕 Flashcards for the book [Deep Learning for Coder with Fastai and PyTorch](https://github.com/fastai/fastbook). 📕 6 | 7 |
8 | 9 | ## ⬇️ Download Decks 10 | 11 | [Download all decks](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/_all.apkg) or download the one you want: 12 | 13 | 1. [Your Deep Learning Journey](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/01_intro.apkg) 14 | 2. [From Model to Production](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/02_production.apkg) 15 | 3. [Data Ethics](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/03_ethics.apkg) 16 | 4. [Under the Hood: Training a Digit Classifier](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/04_mnist_basics.apkg) 17 | 5. [Image Classification](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/05_pet_breeds.apkg) 18 | 6. [Other Computer Vision Problems](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/06_multicat.apkg) 19 | 7. Training a State-of-the-Art Model 20 | 8. [Collaborative Filtering Deep Dive](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/08_collab.apkg) 21 | 9. [Tabular Modeling Deep Dive](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/09_tabular.apkg) 22 | 10. [NLP Deep Dive: RNNs](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/10_nlp.apkg) 23 | 11. [Data Munging with fastai's Mid-Level API](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/11_midlevel_data.apkg) 24 | 12. [A Language Model from Scratch](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/12_nlp_dive.apkg) 25 | 13. [Convolutional Neural Networks](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/13_convolutions.apkg) 26 | 14. ResNets 27 | 15. Application Architectures Deep Dive 28 | 16. The Training Process 29 | 17. A Neural Net from the Foundations 30 | 18. CNN Interpretation with CAM 31 | 19. A fastai Learner from Scratch 32 | 33 | ## 🚀 Get Started 34 | 35 | ### 💻 Desktop 36 | 37 | 1. Download [Anki](https://apps.ankiweb.net/) 38 | 2. Download [all decks](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/_all.apkg) or [one](README.md#download-decks) 39 | 3. Import the deck: 40 | - Click on *File* 41 | - Click on *Import...* 42 | 43 | ### 📱 Mobile 44 | 45 | 1. Download Anki ([IOS](https://apps.apple.com/us/app/ankimobile-flashcards/id373493387)/[Android](https://play.google.com/store/apps/details?id=com.ichi2.anki&hl=en&gl=us)) 46 | 2. Download [all decks](https://github.com/NathanielDamours/fastai-flashcards/raw/main/decks/_all.apkg) or [one](README.md#download-decks) 47 | 3. Import the deck 48 | 49 | > P.S.: If the desktop setup is already done, you could skip the steps #2 and #3 by synchronizing your decks with an Anki account. 50 | 51 | ## 🛠️ Contributing 52 | 53 | If you want to contribute, review the [contribution guidelines](CONTRIBUTING.md) and the [code of conduct](CODE_OF_CONDUCT.md). 54 | 55 | [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](CODE_OF_CONDUCT.md) 56 | 57 | ## ⭐ Acknowledgments 58 | 59 |
60 | Fastai's contributors 61 | 62 | - [Jeremy Howard](https://github.com/jph00) 63 | - [Sylvain Gugger](https://github.com/sgugger) 64 | - Joe Bender 65 | - SOVIETIC-BOSS88 66 | - alvarotap 67 | - Brad S 68 | - Jakub Duchniewicz 69 | - Jonathan Sum 70 | - holynec 71 | - Phillip Chu 72 | - Vijayabhaskar 73 | - lgvaz 74 | - pfcrowe 75 | - ricardocalleja 76 | - sirbots 77 | - Happy Sugar Life 78 | - Rubens 79 | - Steven Borg 80 | - Yikai Zhao 81 | - Ashwin Jayaprakash 82 | - Ben 83 | - Ben Mainye 84 | - Hamel Husain 85 | - Jeff Kriske 86 | - Jithendra Yenugula 87 | - Jordi Villar 88 | - Lee Yi Jie Joel 89 | - Minjeong 90 | - Niyas Mohammed 91 | - Tanner Gilbert 92 | - AKAMath 93 | - Abhinav Misra 94 | - Abhishek Sankar 95 | - Albert Villanova del Moral 96 | - Alexander Walther 97 | - Almog Baku 98 | - Alok 99 | - Amrit Purshotam 100 | - Anshul Joshi 101 | - Anthony DePasquale 102 | - Armin Berres 103 | - Austin Taylor 104 | - Benjamin van der Burgh 105 | - Cleon W 106 | - Daniel Strobusch 107 | - Daniel Wehner 108 | - Dien Hoa TRUONG 109 | - Eduard 110 | - Eric Daniels 111 | - Fabrizio Damicelli 112 | - Faisal Sharji 113 | - Gilbert Tanner 114 | - Giovanni Ruggiero 115 | - Gregory Bruss 116 | - Henry Webel 117 | - Hiromi Suenaga 118 | - Jacopo Repossi 119 | - Jakub Halmeš 120 | - Jared 121 | - Jimgao 122 | - Joel Mathew 123 | - Johannes Stutz 124 | - John Wu 125 | - Jophel Lyles 126 | - Jorge Avila 127 | - Josh Kraft 128 | - Kaito 129 | - Karel Ha 130 | - Kartikeya Bhardwaj 131 | - Kasim Te 132 | - Katrin Leinweber 133 | - Kerrick Staley 134 | - Kofi Asiedu Brempong 135 | - Leozítor Floro de Souza 136 | - Lloyd Jones 137 | - Luca Martial 138 | - Lucas Vazquez 139 | - Ludwig Schmidt-Hackenberg 140 | - Luke Smith 141 | - Maria Rodriguez 142 | - Matus-Dubrava 143 | - Michael Becker 144 | - Michelangelo Bucci 145 | - Mircea Ilie Ploscaru 146 | - MrFabulous 147 | - Musab 148 | - Nelson Chen 149 | - Nghia 150 | - Noè Rosanas 151 | - Pablo Wolter 152 | - Parul Pandey 153 | - Pedro Pereira 154 | - Pete Cooper 155 | - Petr Simecek 156 | - Prith 157 | - Priya Gautam 158 | - Rahim Nathwani 159 | - Rehman Amjad 160 | - Ritobrata Ghosh 161 | - Samuel El-Borai 162 | - Sarada Lee 163 | - Sarah 164 | - Sayantan Karmakar 165 | - Shaojun 166 | - Shin 167 | - Sirish 168 | - Sofyan Hadi Ahmad 169 | - Somnath Rakshit 170 | - TannerGilbert 171 | - Vineet Ahuja 172 | - Void01 173 | - Yurij Mikhalevich 174 | - akarri2001 175 | - alephthoughts 176 | - booletic 177 | - brett koonce 178 | - franperic 179 | - invictus2010 180 | - jeffreytjs 181 | - jhrun 182 | - maxfdama 183 | - miwojc 184 | - pakgembus 185 | - prairie-guy 186 | - seovalue 187 | - sgugger 188 | - tylerpoelking 189 | - unknown 190 | - 蔡舒起 191 | - 송석리(Song Sukree) 192 | 193 |
194 | 195 |
196 | Tanishq Abraham's solutions 197 | 198 | 1. [Your Deep Learning Journey](https://forums.fast.ai/t/fastbook-chapter-1-questionnaire-solutions-wiki) 199 | 2. [From Model to Production](https://forums.fast.ai/t/fastbook-chapter-2-questionnaire-solutions-wiki) 200 | 3. [Data Ethics](https://forums.fast.ai/t/fastbook-chapter-3-questionnaire-solutions-wiki) 201 | 4. [Under the Hood: Training a Digit Classifier](https://forums.fast.ai/t/fastbook-chapter-4-questionnaire-solutions-wiki) 202 | 5. [Image Classification](https://forums.fast.ai/t/fastbook-chapter-5-questionnaire-solutions-wiki) 203 | 6. [Other Computer Vision Problems](https://forums.fast.ai/t/fastbook-chapter-6-questionnaire-solutions-wiki) 204 | 7. Skipped 205 | 8. [Collaborative Filtering Deep Dive](https://forums.fast.ai/t/fastbook-chapter-8-questionnaire-solutions-wiki) 206 | 9. [Tabular Modeling Deep Dive](https://forums.fast.ai/t/fastbook-chapter-9-questionnaire-solutions-wiki) 207 | 10. [NLP Deep Dive: RNNs](https://forums.fast.ai/t/fastbook-chapter-10-questionnaire-solutions-wiki) 208 | 11. [Data Munging with fastai's Mid-Level API](https://forums.fast.ai/t/fastbook-chapter-11-questionnaire-solutions-wiki) 209 | 12. [A Language Model from Scratch](https://forums.fast.ai/t/fastbook-chapter-12-questionnaire-wiki) 210 | 13. [Convolutional Neural Networks](https://forums.fast.ai/t/fastbook-chapter-13-questionnaire-wiki) 211 | 212 |
213 | -------------------------------------------------------------------------------- /src/10_nlp.md: -------------------------------------------------------------------------------- 1 | # 10_nlp 2 | 3 | ## What is self-supervised learning? 4 | 5 | Training a model **without** the use of **labels**. An example is a language model. 6 | 7 | ## What is a language model? 8 | 9 | A language model is a self-supervised model that tries to predict the next **word** of a given passage of text. 10 | 11 | ## Why is a language model considered self-supervised learning? 12 | 13 | Because there are **no labels** (ex: sentiment) provided during training. Instead, the model learns to predict the next word by reading lots of provided text with no labels. 14 | 15 | ## What are self-supervised models usually used for? 16 | 17 | Often, they are used as a **pre-trained model for transfer learning**. However, sometimes, they are used by themselves. For example, a language model can be used for autocomplete algorithms! 18 | 19 | ## When do we fine-tune language models? 20 | 21 | When we want to use a pre-trained language model on a **slightly different corpus** than the one for the current task. Indeed, we need to fine-tune the language model on the corpus of the desired downstream task in order to get a better performance. 22 | 23 | ## What are the three steps to create a state-of-the-art text classifier? 24 | 25 | 1. **Train** a language model on a large corpus of text (already done for ULM-FiT by Sebastian Ruder and Jeremy!) 26 | 2. **Fine-tune** the language model on text classification dataset 27 | 3. **Fine-tune** the language model as a text classifier instead. 28 | 29 | ## How do the 50,000 unlabeled movie reviews help create a better text classifier for the IMDb dataset? 30 | 31 | By allowing the model to learn how to predict the next word of a movie review. As a result, the model better understands the language style and structure of the text classification dataset. Therefore, the model can perform better when fine-tuned as a classifier. 32 | 33 | ## What are the three steps to prepare your data for a language model? 34 | 35 | 1. Tokenization 36 | 2. Numericalization 37 | 3. Language model DataLoader 38 | 39 | ## What is tokenization? 40 | 41 | The process of converting **text into a list of words**. 42 | 43 | ## Why do we need tokenization? 44 | 45 | Because, we need a tokenizer that deals with **complicated cases like punctuation, hyphenated words**... Indeed, converting text into a list of words, it is not as simple as splitting on the spaces. 46 | 47 | ## Name three different approaches to tokenization 48 | 49 | 1. Word-based tokenization 50 | 2. Subword-based tokenization 51 | 3. Character-based tokenization 52 | 53 | ## What is `xxbos`? 54 | 55 | This is a special token added by fastai that indicated the **beginning** of the text. 56 | 57 | ## List (4/8) rules that fastai applies to text during tokenization 58 | 59 | - `fix_html`: replace special HTML characters by a readable version 60 | - `replace_rep`: replace any character repeated three times or more by a special token for repetition (`xxrep`), the number of times it's repeated, then the character 61 | - `replace_wrep`: replace any word repeated three times or more by a special token for word repetition (`xxwrep`), the number of times it's repeated, then the word 62 | - `spec_add_spaces`: add spaces around / and # 63 | - `rm_useless_spaces`: remove all repetitions of the space character 64 | - `replace_all_caps`: lowercase a word written in all caps and adds a special token for all caps (`xxcap`) in front of it 65 | - `replace_maj`: lowercase a capitalized word and adds a special token for capitalized (`xxmaj`) in front of it 66 | - `lowercase`: lowercase all text and adds a special token at the beginning (`xxbos`) and/or the end (`xxeos`) 67 | 68 | ## Why are repeated characters replaced with a token showing the number of repetitions, and the character that's repeated? 69 | 70 | Because it allows the model's embedding matrix **to encode information about general concepts** such as repeated characters which could have special or different meaning than just a single character. 71 | 72 | ## What is numericalization? 73 | 74 | This refers to the mapping of the **tokens to integers** to be passed into the model. 75 | 76 | ## Why might there be words that are replaced with the *unknown word* token? 77 | 78 | Because the embedding matrix would be very **large**, would increase **memory** usage, and would **slow** down training if all the words in the dataset have a token associated with them. Therefore, only words with more than `min_freq` occurrence are assigned a token and finally a number, while others are replaced with the *unknown word* token. 79 | 80 | ## What does **first row** of the *first batch* contain (in case, with batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset)? 81 | 82 | The *beginning* of the **first mini-stream** (tokens 1-64) 83 | 84 | Explanation : 85 | 86 | 1. The dataset is split into 64 mini-streams (batch size) 87 | 2. Each batch has 64 rows (batch size) and 64 columns (sequence length) 88 | 3. The **first row** of the *first batch* contains the *beginning* of the **first mini-stream** (tokens 1-64) 89 | 4. The **second row** of the *first batch* contains the *beginning* of the **second mini-stream** 90 | 5. The **first row** of the *second batch* contains the *second chunk* of the **first mini-stream** (tokens 65-128) 91 | 92 | ## What does the **second row** of the *first batch* contain (in case, with batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset)? 93 | 94 | The *beginning* of the **second mini-stream** 95 | 96 | Explanation : 97 | 98 | 1. The dataset is split into 64 mini-streams (batch size) 99 | 2. Each batch has 64 rows (batch size) and 64 columns (sequence length) 100 | 3. The **first row** of the *first batch* contains the *beginning* of the **first mini-stream** (tokens 1-64) 101 | 4. The **second row** of the *first batch* contains the *beginning* of the **second mini-stream** 102 | 5. The **first row** of the *second batch* contains the *second chunk* of the **first mini-stream** (tokens 65-128) 103 | 104 | ## What does the **first row** of the *second batch* contain (in case, with batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset)? 105 | 106 | The *second chunk* of the **first mini-stream** (tokens 65-128) 107 | 108 | Explanation : 109 | 110 | 1. The dataset is split into 64 mini-streams (batch size) 111 | 2. Each batch has 64 rows (batch size) and 64 columns (sequence length) 112 | 3. The **first row** of the *first batch* contains the *beginning* of the **first mini-stream** (tokens 1-64) 113 | 4. The **second row** of the *first batch* contains the *beginning* of the **second mini-stream** 114 | 5. The **first row** of the *second batch* contains the *second chunk* of the **first mini-stream** (tokens 65-128) 115 | 116 | ## Why do we need padding for text classification? 117 | 118 | Because we have to **collate the batch** since the documents have **variable sizes**. Other approaches. like cropping or squishing, either to negatively affect training or do not make sense in this context. Therefore, padding is used. 119 | 120 | ## Why don't we need padding for language modeling? 121 | 122 | It is not required for language modeling since the documents are all **concatenated**. 123 | 124 | ## What does an embedding matrix for NLP contain? 125 | 126 | It contains vector representations of **all tokens** in the vocabulary. 127 | 128 | ## What is shape of an embedding matrix for NLP? 129 | 130 | `vocab_size` x `embedding_size`, where `vocab_size` is the length of the vocabulary, and `embedding_size` is an arbitrary number defining the number of latent factors of the tokens. 131 | 132 | ## What is perplexity? 133 | 134 | The **exponential of the loss**. Also, it's a commonly used metric in NLP for language models. 135 | 136 | ## Why do we have to pass the vocabulary of the language model to the classifier data block? 137 | 138 | Because it ensures the same **correspondence of tokens to index** so the model can appropriately use the embeddings learned during LM fine-tuning. 139 | 140 | ## What is gradual unfreezing? 141 | 142 | This refers to unfreezing one layer at a time and fine-tuning the pre-trained model. 143 | 144 | ## Why is text generation always likely to be ahead of automatic identification of machine generated texts? 145 | 146 | Because the classification models could be used to **improve** text generation algorithms (evading the classifier) so the text generation algorithms will always be ahead. 147 | -------------------------------------------------------------------------------- /src/08_collab.md: -------------------------------------------------------------------------------- 1 | # 08_collab 2 | 3 | ## What problem does collaborative filtering solve? 4 | 5 | The problem of predicting the interests of users based on the interests of other users and recommending items based on these interests 6 | 7 | ## How does collaborative filtering solve their problem? 8 | 9 | **By using `latent factors`**. The idea is that the model can tell what kind of items you may like (ex: you like sci-fi movies/books) and these kinds of factors are _learned_ (via basic gradient descent) based on what items other users like. 10 | 11 | ## Why might a collaborative filtering predictive model fail to be a very useful recommendation system? 12 | 13 | Because: 14 | 15 | - there are not many recommendations to learn from 16 | - there are not enough data about the user to provide useful recommendations 17 | 18 | ## What does a crosstab representation of collaborative filtering data look like? 19 | 20 | ![](img/08-collab-filtering-data.png) 21 | 22 | The users and items are the rows and columns (or vice versa) of a large matrix with the values filled out based on the user's rating of the item. 23 | 24 | ## What is a latent factor? 25 | 26 | It's a factor is not explicitly given to the model and instead **learned** (hence "latent"). This factor is important for the prediction of the recommendations. 27 | 28 | For example, suppose that you observe the behaviour of a person or animal. You can only observe the behaviour. You cannot observe the internal state (e.g. the mood) of this person or animal. The mood is a hidden variable because it cannot be observed directly (but only indirectly through its consequences). 29 | 30 | Source : 31 | 32 | - [Description](https://forums.fast.ai/t/fastbook-chapter-8-questionnaire-solutions-wiki/69926) 33 | - [Example](https://ai.stackexchange.com/a/12505) 34 | 35 | ## What is a dot product? 36 | 37 | It's when you multiply the corresponding elements of two vectors and add them up. If we represent the vectors as lists of the same size, here is how we can perform a dot product: 38 | 39 | ```py 40 | >>> a = [1, 2, 3, 4] 41 | >>> b = [5, 6, 7, 8] 42 | >>> dot_product = sum(elements[0]*elements[1] for elements in zip(a, b)) 43 | >>> dot_product 44 | 70 45 | ``` 46 | 47 | ## What does `pandas.DataFrame.merge` do? 48 | 49 | It allows you to merge `DataFrames` into one `DataFrame`. 50 | 51 | ## What is an embedding matrix? 52 | 53 | What you multiply an embedding with and, in the case of this collaborative filtering problem, what is learned through training 54 | 55 | ## What is the relationship between an embedding and a matrix of one-hot encoded vectors? 56 | 57 | An embedding is a dense, continuous representation of a categorical variable, while a one-hot encoded matrix is a sparse, discrete representation of the same variable. An embedding is often used as an alternative to a one-hot encoded matrix in order to reduce the dimensionality and computational cost of a model. 58 | 59 | [Source](https://chat.openai.com/chat) 60 | 61 | ## Why do we need `Embedding` if we could use one-hot encoded vectors for the same thing? 62 | 63 | `Embedding` is **computationally more efficient**. The multiplication with one-hot encoded vectors is equivalent to indexing into the embedding matrix, and the `Embedding` layer does this. However, the gradient is calculated such that it is equivalent to the multiplication with the one-hot encoded vectors. 64 | 65 | ## What does an embedding contain before we start training (assuming we're not using a pre-trained model)? 66 | 67 | **Random values**. For example, 68 | 69 | ```py 70 | >>> embedding = Embedding(3, 2) 71 | >>> list(embedding.parameters()) 72 | [Parameter containing: 73 | tensor([[ 0.0102, 0.0115], 74 | [-0.0006, -0.0036], 75 | [ 0.0111, -0.0069]], requires_grad=True)] 76 | ``` 77 | 78 | ## What does `my_tensor[0, :]` return? 79 | 80 | All the columns of the first row of the matrix `my_tensor`. 81 | 82 | ```py 83 | >>> my_tensor = torch.Tensor([[1, 2, 3], 84 | [4, 5, 6], 85 | [7, 8, 9]]) 86 | >>> my_tensor[0, :] 87 | tensor([1., 2., 3.]) 88 | ``` 89 | 90 | ## Rewrite the `DotProduct` class 91 | 92 | ```py 93 | class DotProduct(Module): 94 | def __init__(self, n_users, n_movies, n_factors, y_range=(0, 5.5)): 95 | self.user_factors = Embedding(n_users, n_factors) 96 | self.movie_factors = Embedding(n_movies, n_factors) 97 | self.y_range = y_range 98 | 99 | def forward(self, x): 100 | users = self.user_factors(x[:, 0]) 101 | movies = self.movie_factors(x[:, 1]) 102 | return sigmoid_range((users * movies).sum(dim=1), *self.y_range) 103 | ``` 104 | 105 | ## What is the use of bias in a dot product model? 106 | 107 | A bias will **compensate for the extreme values**. For example, for a movie classifier, a bias will compensate for the fact that some movies are just amazing or pretty bad. It will also compensate for users who often have more positive or negative recommendations in general. 108 | 109 | ## What is weight decay? 110 | 111 | A regularization technique that adds a penalty to the loss (usually the L2 norm). 112 | 113 | Therefore, `loss = loss + weight_decay_penalty` 114 | 115 | [Source](https://medium.com/datathings/dense-layers-explained-in-a-simple-way-62fe1db0ed75) 116 | 117 | ## Code a weight decay using L2 regularisation 118 | 119 | ```py 120 | l2_reg = weight_decay_param*(model_params**2).sum() 121 | loss = loss + l2_reg 122 | ``` 123 | 124 | ## What do we add to the gradient when we use weight decay? 125 | 126 | ```py 127 | 2*weight_decay_param*params 128 | ``` 129 | 130 | ## Why does the gradient of weight decay with L2 regularization help reduce weights? 131 | 132 | Because weight decay with L2 regularization **adds a penalty** to the loss based on the weights. The bigger they are, the bigger is the penalty. Therefore, in order to reduce the loss, the optimizer will usually reduce the weights. 133 | 134 | ## Why does reducing weights lead to better generalization? 135 | 136 | Because it reduces the **degree of freedom** of the model to fit the training data. Therefore, the model will be less likely to overfit. 137 | 138 | ## What does `argsort` do in PyTorch? 139 | 140 | Returns the indices that would sort a tensor. 141 | 142 | ```py 143 | >>> a = torch.randn(4) 144 | >>> a 145 | tensor([ 0.1280, -1.5398, 0.4009, 2.9171]) 146 | >>> torch.argsort(a) 147 | tensor([1, 0, 2, 3]) 148 | ``` 149 | 150 | ## Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why? 151 | 152 | **No**, it means much more than that because it takes into account the genres, the actors or other factors. For example, movies with low bias means even if you like these types of movies you may not like this movie (and vice versa for movies with high bias). 153 | 154 | ## How do you print the names and details of the layers in a model? 155 | 156 | By typing `learn.model` 157 | 158 | ## What is the "bootstrapping problem" in collaborative filtering? 159 | 160 | That the model cannot make any recommendations or draw any inferences for users or items about which it has not yet gathered sufficient information. It's also called the cold start problem. 161 | 162 | ## How could you deal with the bootstrapping problem for new users or new movies? 163 | 164 | You could solve this by coming up with **an average embedding** for a user/movie. Or select a particular user/movie to represent the average user/movie. Additionally, you could come up with some questions that could help initialize the embedding vectors for new users and movies. 165 | 166 | ## How can feedback loops impact collaborative filtering systems? 167 | 168 | The recommendations may suffer from **representation bias** where a small number of people influence the system heavily. 169 | 170 | E.g.: Highly enthusiastic anime fans who rate movies much more frequently than others may cause the system to recommend anime more often than expected (incl. to non-anime fans). 171 | 172 | ## When using a neural network in collaborative filtering, why can we have different number of factors for movie and user? 173 | 174 | Because we are not taking the dot product but instead **concatenating the embedding matrices**. 175 | 176 | ## Why is there a `nn.Sequential` in the `CollabNN` model? 177 | 178 | Because this allows us to couple multiple `nn.Module` layers together to be used. In this case, the two linear layers are coupled together and the embeddings can be directly passed into the linear layers. 179 | 180 | ## What kind of model should be use if we want to add metadata about users and items, or information such as date and time, to a collaborative filter model? 181 | 182 | A tabular model 183 | -------------------------------------------------------------------------------- /src/09_tabular.md: -------------------------------------------------------------------------------- 1 | # 09_tabular 2 | 3 | ## What is the difference between a continuous and a categorical variable? 4 | 5 | - **Continuous** variable: have a wide range of "continuous" values (ex: age) 6 | - **Categorical** variable: can take on discrete levels that correspond to different categories (ex: cat and dog) 7 | 8 | ## Provide (2) of the words that are used for the possible values of a categorical variable 9 | 10 | **Levels** or **categories** (ordinal or categorical). For example, movie ratings are ordinal variables and colors are categorical variables. 11 | 12 | ## What is a "dense layer"? 13 | 14 | A layer that is **deeply connected** with its preceding layer which means the neurons of the layer are connected to every neuron of its preceding layer. 15 | 16 | [Source](https://analyticsindiamag.com/a-complete-understanding-of-dense-layers-in-neural-networks/) 17 | 18 | ## How do entity embeddings reduce memory usage and speed up neural networks? 19 | 20 | Using entity embeddings allows the data to have a **much more memory-efficient (dense) representation of the data**. This will also lead to speed-ups for the model. On the other hand, especially for large datasets, representing the data as one-hot encoded vectors can be very inefficient (and also sparse). 21 | 22 | ## What kind of datasets are entity embeddings especially useful for? 23 | 24 | Datasets with features that have high levels of **cardinality** (the features have lots of possible categories like ZIP code: 90503). Other methods often overfit to data like this. 25 | 26 | ## What are the two main families of machine learning algorithms? 27 | 28 | - **Ensemble of decision trees**: best for structured data (tabular data) 29 | - **Multilayered neural networks**: best for unstructured data (audio, vision, text, etc.) 30 | 31 | ## Why do some categorical columns need a special ordering in their classes? 32 | 33 | Because ordinal categories may inherently have some order. 34 | 35 | ## How do you tell a pandas' DataFrame that some categorical columns need a special ordering? 36 | 37 | By using `set_categories` with the argument `ordered=True` and passing in the ordered list this information represented in the pandas' DataFrame. 38 | 39 | ## Summarize what a decision tree algorithm does 40 | 41 | It determines **how to group the data based on "questions"** that we ask about the data. That is, we keep splitting the data based on the levels or values of the features and generate predictions based on the average target value of the data points in that group. Here is the algorithm: 42 | 43 | 1. Loop through each column of the dataset in turn 44 | 2. For each column, loop through each possible level of that column in turn 45 | 3. Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable) 46 | 4. Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. That is, treat this as a very simple "model" where our predictions are simply the average sale price of the item's group 47 | 5. After looping through all of the columns and possible levels for each, pick the split point which gave the best predictions using our very simple model 48 | 6. We now have two different groups for our data, based on this selected split. Treat each of these as separate datasets, and find the best split for each, by going back to step one for each group 49 | 7. Continue this process recursively, and until you have reached some stopping criterion for each group — for instance, stop splitting a group further when it has only 20 items in it. 50 | 51 | ## Why is a date different from a regular categorical or continuous variable ? 52 | 53 | Some dates are **different** to others (ex: some are holidays, weekends, etc.) that cannot be described as just an ordinal variable. 54 | 55 | ## How can you preprocess a date to allow it to be used in a model? 56 | 57 | We can generate many different categorical features about the **properties** of the given date (ex: is it a weekday? is it the end of the month?, etc.) 58 | 59 | ## What is pickle ? 60 | 61 | Pickle is a Python module that is used to **save nearly any Python object as a file**. Indeed, it is used to serialize and deserialize Python objects. Serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. 62 | The opposite operation, extracting a data structure from a series of bytes, is deserialization 63 | 64 | Source: 65 | 66 | - [First sentence](https://forums.fast.ai/t/fastbook-chapter-9-questionnaire-solutions-wiki/69932) 67 | - [The rest](https://en.wikipedia.org/wiki/Serialization) 68 | 69 | ## How are mse, samples, and values calculated in the decision tree drawn in this chapter? 70 | 71 | By traversing the tree based on answering questions about the data, we reach the **nodes that tell us** the average value of the data in that group, the mse, and the number of samples in that group. 72 | 73 | ## How do we deal with outliers in our training data, before building a decision tree? 74 | 75 | **You don't have to!** Indeed, in decision tree learning, you do splits based on a metric that depends on the proportions of the classes on the left and right leaves after the split (for instance, Giny Impurity). If there are few outliers (which should be the case: if not, you cannot use any model), then they will not be relevant to these proportions. For this reason, decision trees are robust to outliers. 76 | 77 | [Source](https://datascience.stackexchange.com/a/31439) 78 | 79 | ## How do we handle categorical variables in a decision tree? 80 | 81 | We convert the **categorical variables to integers**, where the integers correspond to the discrete levels of the categorical variable. Apart from that, there is nothing special that needs to be done to get it to work with decision trees (unlike neural networks, where we use embedding layers). 82 | 83 | ## What is bagging? 84 | 85 | Train multiple models on random subsets of the data, and use the ensemble of models for prediction. 86 | 87 | ## What is the difference between `max_samples` and `max_features` when creating a random forest? 88 | 89 | - `max_samples` defines how many **samples** we use for each decision tree. 90 | - `max_features` defines how many **features** we use for each decision tree. 91 | 92 | Don't forget that when training random forests, we train multiple decision trees on random subsets of the data. 93 | 94 | ## If you increase `n_estimators` to a very high value, can that lead to overfitting? Why? 95 | 96 | No, because the trees added due to the increase of `n_estimators` are independent of each other. 97 | 98 | ## What is *out of bag error*? 99 | 100 | Using only the models not trained on the row of data when going through the data and evaluating the dataset. No validation set is needed. 101 | 102 | ## Tell (2) reasons why a model's validation set error might be worse than the OOB error 103 | 104 | - The model does not **generalize** well. 105 | - The possibility that the validation data has a slightly different **distribution** than the data the model was trained on. 106 | 107 | ## Why random forests are well suit to show how confident we are in our projections using a particular row of data? 108 | 109 | Because you just have to look at the standard deviation between the estimators. 110 | 111 | ## Why random forests are well suit for predicting with a particular row of data, what were the most important factors, and how did they influence that prediction? 112 | 113 | Because you just have to use the `treeinterpreter` package to check how the prediction changes as it goes through the tree, adding up the contributions from each split/feature. Use waterfall plot to visualize. 114 | 115 | ## Why random forests are well suit to show which columns are the strongest predictors? 116 | 117 | Because you just have to look at **feature importance** 118 | 119 | ## Why random forests are well suit to show how do predictions vary, as we vary these columns? 120 | 121 | Look at partial dependence plots 122 | 123 | ## What's the purpose of removing unimportant variables? 124 | 125 | Sometimes, it is better to have a more **interpretable** model with less features, so removing unimportant variables helps in that regard. 126 | 127 | ## What's a good type of plot for showing tree interpreter results? 128 | 129 | Waterfall plot 130 | 131 | ## What is the *extrapolation problem* ? 132 | 133 | It is a problem encountered when it is hard for a model to extrapolate to data that's **outside the domain** of the training data. 134 | 135 | ## How can you tell if your test or validation set is distributed in a different way to your training set? 136 | 137 | We can do so by **training a model to classify if the data is training or validation data**. If the data is of different distributions (out-of-domain data), then the model can properly classify between the two datasets. 138 | 139 | ## What is boosting? 140 | 141 | We train a model that **underfits** the dataset, and train subsequent models that predicts the **error** of the original model. We then add the predictions of all the models to get the **final** prediction. 142 | 143 | ## How could we use embeddings with a random forest? 144 | 145 | Instead of passing in the raw categorical columns, the entity embeddings can be passed **into the random forest** model. 146 | 147 | ## Does using embeddings improve the performance of a random forest? 148 | 149 | Entity embeddings contains **richer representations** of the categorical features and definitely can improve the performance of other models like random forests. 150 | 151 | ## Why might we not always use a neural net for tabular modeling? 152 | 153 | Because they are the **hardest to train and longest to train, and less well-understood**. Instead, random forests should be the first choice/baseline, and neural networks could be tried to improve these results or add to an ensemble. 154 | -------------------------------------------------------------------------------- /src/13_convolutions.md: -------------------------------------------------------------------------------- 1 | # 13_convolutions 2 | 3 | ## What is a *feature*? 4 | 5 | A **transformation** of the data which is designed to make it easier for the model to learn from it 6 | 7 | ## Write out the convolutional kernel matrix for a top edge detector 8 | 9 | ```py 10 | [-1, -1, -1, 11 | 0, 0, 0, 12 | 1, 1, 1] 13 | ``` 14 | 15 | ## What is the value of a convolutional kernel applied to a 3×3 matrix of zeros? 16 | 17 | A zero matrix 18 | 19 | ## What is *padding*? 20 | 21 | The additional **pixels** that are **added around** the outside of the image. They allow the kernel to be applied to the edge of the image for a convolution. 22 | 23 | ## What is *stride*? 24 | 25 | It refers to how many pixels at a time the kernel is **moved** during the convolution. 26 | 27 | ## Create a nested list comprehension 28 | 29 | ```py 30 | >>> x = [[i*3 + j for j in range(3)] for i in range(3)] 31 | >>> x 32 | [[0, 1, 2], [3, 4, 5], [6, 7, 8]] 33 | ``` 34 | 35 | ## What is a *channel*? 36 | 37 | It is the **number of activations** per grid cell after a convolution, which is, the size of the second axis of a weight matrix. Channel and feature are often used interchangeably. 38 | 39 | ## What are the shapes of the `input` and `weight` parameters to PyTorch's 2D convolution? 40 | 41 | - `input`: `(minibatch_size, in_channels, i_height, i_width)` 42 | - `weight`: `(out_channels, in_channels, k_height, k_width)` 43 | 44 | ## How is a convolution related to matrix multiplication? 45 | 46 | A convolution operation can be represented as a matrix multiplication. 47 | 48 | ## What is a "convolutional neural network"? 49 | 50 | A Convolutional Neural Network, also known as CNN or ConvNet, is a class of neural networks that specializes in processing data that has a **grid-like topology**, such as an image. 51 | 52 | [Source](https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939) 53 | 54 | ## What are the (4) benefit of refactoring parts of your neural network definition? 55 | 56 | - You'll **less** likely get **errors** due to inconsistencies in your architecture. 57 | - It makes it more **obvious** to the reader which **parts of your layers** are actually **changing**. 58 | - It can help improve the **readability** and **maintainability** of your code. 59 | - It can make it easier to **debug** and **troubleshoot** issues in your model. 60 | 61 | ## What is `nn.Flatten`? 62 | 63 | It converts **multi-dimensional tensors into a 1D vector**. It's basically the same as PyTorch's `squeeze` method, but as a module. `nn.Flatten` is useful when you want to transform the multidimensional tensor output by a convolutional layer into a vector, which can be fed into a fully connected (linear) layer. 64 | 65 | [Source](https://chat.openai.com/chat) 66 | 67 | ## What does "NCHW" mean? 68 | 69 | It is an **abbreviation** for the axes of the input of the model: 70 | 71 | - N: batch size 72 | - C: channels 73 | - H: height 74 | - W: width 75 | 76 | ## Why does the third layer of the MNIST CNN have `7*7*(1168-16)` multiplications? 77 | 78 | There are 1168 parameters for that layer, and ignoring the 16 parameters (=number of filters) of the bias, the (1168-16) parameters is applied to the 7x7 grid. 79 | 80 | ## What is a "receptive field"? 81 | 82 | It's the **area of an image** that is involved in the calculation of a layer. 83 | 84 | ## What is the size of the receptive field of an activation (size=7, k=7) after two stride-2 convolutions? Why? 85 | 86 | $7 x 7$, because the formula for calculating the size of the receptive field after applying multiple convolutions with stride $s$ is: 87 | 88 | $size = size + (size - k + 2p) (s - 1)$ 89 | 90 | where: 91 | 92 | - size: size of the receptive field before applying the convolution 93 | - k: kernel size of the convolution 94 | - s: stride of the convolution 95 | - p: padding of the convolution 96 | 97 | Therefore, 98 | $size = (7, 7) + ((7, 7) - 7 + 0)(2 -1)$ 99 | $= (7, 7) + ((0, 0) + 0)(2-1)$ 100 | $= (7, 7) + (0, 0) = (7, 7)$ 101 | 102 | [Source](https://chat.openai.com/chat) 103 | 104 | ## How is a gray image represented as a tensor? 105 | 106 | It is a rank-3 tensor of shape (1, height, width) 107 | 108 | ## How is a color image represented as a tensor? 109 | 110 | It is a rank-3 tensor of shape (3, height, width) 111 | 112 | ## How does a convolution work with a color input? 113 | 114 | The convolutional kernel is of **size `(ch_out, ch_in, ks, ks)`**. For example, with a color input with a kernel size of 3x3 with 7 output channels, that would be (7, 3, 3, 3). The convolution filter for each of the `ch_in=3` channels are applied separately to each of the 3 color channels and summed up, and we have `ch_out` filters like this, giving us a `ch_out` convolutional kernel tensors of size `ch_in=3 x ks x ks`. Thus the final size of this tensor is `(ch_out, ch_in, ks, ks)`. Additionally we would have a bias of size `ch_out`. 115 | 116 | ## What method can we use to see that data in `DataLoaders`? 117 | 118 | `show_batch` 119 | 120 | ## Why do we double the number of filters after each stride-2 convolution? 121 | 122 | Because we're **decreasing** the number of activations in the activation map by a factor of 4; we don't want to decrease the capacity of a layer by too much at a time 123 | 124 | ## Why do we use a larger kernel in the first conv with MNIST (28x28) (with `simple_cnn`)? 125 | 126 | Because this can help the neural network to **learn more effectively**. Indeed, with the first layer, if the kernel size is 3x3, with four output filters, then nine pixels are being used to produce 8 output numbers so there is not much learning since input and output size are almost the same. Neural networks will only create useful features if they're forced to do so—that is, if the number of outputs from an operation is significantly smaller than the number of inputs. To fix this, we can use a larger kernel in the first layer. 127 | 128 | ## What information does `ActivationStats` save for each layer? 129 | 130 | - Mean 131 | - Standard deviation 132 | - Histogram of activations for the specified trainable layers in the model being tracked 133 | 134 | ## How can we access a learner's callback after training? 135 | 136 | They are available with the `Learner` object with the same name as the callback class, but in `snake_case`. For example, the `Recorder` callback is available through `learn.recorder`. 137 | 138 | ## What are the three statistics plotted by `plot_layer_stats`? 139 | 140 | - The mean of the activations 141 | - The standard deviation of the activations 142 | - The percentage of activation near zero 143 | 144 | ## What does the x-axis represent in `plot_layer_stats`? 145 | 146 | The progress of training (batch number) 147 | 148 | ## Why are activations near zero problematic? 149 | 150 | Because it means we have **computation in the model that's doing nothing at all** (since multiplying by zero gives zero). When you have some zeros in one layer, they will therefore generally carry over to the next layer... which will then create more zeros. 151 | 152 | ## What are the (2) upsides and (2) downsides of training with a larger batch size? 153 | 154 | - (+) **More accurate gradients** since they're calculated from more data 155 | - (+) **Faster training** because the model can process more data in each forward and backward pass 156 | - (-) **Less opportunities** for the model to **update weights** because there are fewer batches per epoch 157 | - (-) **Higher memory requirements** because the model has to store activations for more data points at the same time 158 | 159 | ## Why should we avoid using a high learning rate at the start of training? 160 | 161 | Because our **initial weights are not well suited** to the task we're trying to solve. Therefore, it is dangerous to begin training with a high learning rate: we may very well make the training diverge instantly. 162 | 163 | ## What is 1cycle training? 164 | 165 | It's a type of **learning rate schedule** developed by Leslie Smith that combines learning rate warmup and annealing, which allows us to train with higher learning rates. 166 | 167 | ## What are the (2) benefits of training with a high learning rate? 168 | 169 | - **Faster training** — a phenomenon Smith named *super-convergence*. 170 | - **Less overfitting** because we skip over the sharp local minima to end up in a smoother (and therefore more generalizable) part of the loss. 171 | 172 | ## Why do we want to use a low learning rate at the end of training? 173 | 174 | Because it allows us to find the **best part** of loss landscape and further minimize the loss. 175 | 176 | ## What is "cyclical momentum"? 177 | 178 | It suggests that the **momentum varies in the opposite direction of the learning rate**: when we are at high learning rates, we use less momentum, and we use more again in the annealing phase. 179 | 180 | ## What callback tracks hyperparameter values during training (along with other information)? 181 | 182 | The `Recorder` callback 183 | 184 | ## What does one column of pixels in the `color_dim` plot represent? 185 | 186 | The histogram of activations for the specified layer for that batch 187 | 188 | ## What does "bad training" look like in `color_dim`? Why? 189 | 190 | We would see a cycle of dark blue, bright yellow at the bottom return, because this training is not smooth and effectively starts from scratch during these cycles. 191 | 192 | ## What trainable parameters does a batch normalization layer contain? 193 | 194 | - `beta` 195 | - `gamma` 196 | 197 | ## What allows the trainable parameters of a batch normalization? 198 | 199 | They allow the model to have any **mean and variance for each layer**, which are learned during training. 200 | 201 | ## What statistics are used to normalize in batch normalization during training? 202 | 203 | The **mean** and **standard deviation** of the batch. 204 | 205 | ## What statistics are used to normalize in batch normalization during validation? 206 | 207 | The running mean of the statistics calculated during training. 208 | 209 | ## Why do models with batch normalization layers generalize better? 210 | 211 | Because **batch normalization adds some extra randomness** to the training process (most researchers believe that) 212 | -------------------------------------------------------------------------------- /src/03_ethics.md: -------------------------------------------------------------------------------- 1 | # 03_ethics 2 | 3 | ## Does ethics provide a list of "right answers"? 4 | 5 | There is **no list of do's and dont's**. Ethics is complicated, and context-dependent. It involves the perspectives of many stakeholders. Ethics is a muscle that you have to develop and practice. In this chapter, our goal is to provide some signposts to help you on that journey. 6 | 7 | ## How can working with people of different backgrounds help when considering ethical questions? 8 | 9 | Different people's backgrounds will help them to see things which may not be obvious to you. Working with a team is helpful for many "muscle building" activities, including this one. 10 | 11 | ## What was the role of IBM in Nazi Germany? 12 | 13 | **IBM supplied the Nazis with data tabulation products necessary to track the extermination of Jews and other groups on a massive scale**. This was driven from the top of the company, with marketing to Hitler and his leadership team. Company President Thomas Watson personally approved the 1939 release of special IBM alphabetizing machines to help organize the deportation of Polish Jews. Hitler awarded Watson a special "Service to the Reich" medal in 1937. 14 | 15 | But it also happened throughout the organization. IBM and its subsidiaries provided regular training and maintenance on-site at the concentration camps: printing off cards, configuring machines, and repairing them as they broke frequently. IBM set up categorizations on their punch card system for the way that each person was killed, which group they were assigned to, and the logistical information necessary to track them through the vast Holocaust system. IBM's code for Jews in the concentration camps was 8, where around 6,000,000 were killed. Its code for Romanis was 12 (they were labeled by the Nazis as "asocials", with over 300,000 killed in the *Zigeunerlager* , or "Gypsy camp"). General executions were coded as 4, death in the gas chambers as 6. 16 | 17 | ## Why did the company and the workers participate as they did? 18 | 19 | Because they were **making huge profits**. 20 | 21 | Edwin Black, author of *IBM and the Holocaust*, said: 22 | > To the blind technocrat, the means were more important than the ends. The destruction of the Jewish people became even less important because the invigorating nature of IBM's technical achievement was only heightened by the fantastical profits to be made at a time when bread lines stretched across the world. 23 | 24 | ## What was the role of the first person jailed in the Volkswagen diesel scandal? 25 | 26 | It was one of the engineers, James Liang, who just did what he was told. 27 | 28 | ## What was the problem with a database of suspected gang members maintained by California law enforcement officials? 29 | 30 | It was found to be **full of errors**, including 42 babies who had been added to the database when they were less than 1 year old (28 of whom were marked as "admitting to being gang members"). In this case, there was no process in place for correcting mistakes or removing people once they’d been added. 31 | 32 | ## Why did YouTube's recommendation algorithm recommend videos of partially clothed children to pedophiles? 33 | 34 | Because of the **centrality of metrics** in driving a financially important system. Indeed. when an algorithm has a metric to optimise, it will do everything it can to optimise that number. This tends to lead to all kinds of edge cases, and humans interacting with a system will search for, find, and exploit these edge cases and feedback loops for their advantage. 35 | 36 | ## What are (3/7) problems with the centrality of metrics? 37 | 38 | - Reliance on metrics can lead to a **narrow focus on measurable outcomes**, rather than broader goals or values. This can lead to a lack of attention to other important aspects of a situation or problem. 39 | 40 | - Over-reliance on metrics can create pressure for people **to conform to predetermined goals or targets**, rather than encouraging creativity, innovation, or independent thinking. 41 | 42 | - The use of metrics may create unintended consequences, such as **discrimination or bias**, if they are not carefully designed and implemented. 43 | 44 | - The use of metrics may **create a sense of competition or rivalry**, rather than collaboration or cooperation. 45 | 46 | - Metrics may **not accurately reflect the complexity or nuances of a situation**. They may oversimplify or obscure important factors or dynamics. 47 | 48 | - Metrics may **not always be applicable or relevant in all contexts**, and may not capture the full range of factors that are important to consider in a given situation. 49 | 50 | - Metrics may be **difficult to interpret or compare**, especially if they are not clearly defined or standardized. 51 | 52 | [Source](https://chat.openai.com/chat) 53 | 54 | ## Why did Meetup.com not include gender in their recommendation system for tech meetups? 55 | 56 | Because they were concerned that including gender in the recommendation algorithm would **create a self-reinforcing feedback loop where it would recommend Tech meetups mainly to men**, because Meetup had observed that men expressed more interest than women towards attending Tech meetups. To avoid this situation and continue to recommend Tech meetups to their users regardless of the gender, they simply decided to not include gender in the recommendation algorithm. 57 | 58 | ## What are the (6) types of bias in machine learning, according to Suresh and Guttag? 59 | 60 | - Historical 61 | - Measurement 62 | - Aggregation 63 | - Representation 64 | - Deployment 65 | - Evaluation 66 | 67 | ## What is historical bias? 68 | 69 | A bias that our datasets and models inherit from the real world. People are biased, processes are biased and society in general is biased. 70 | 71 | ## What is measurement bias? 72 | 73 | When we measure the wrong thing, incorporate the measurement inappropriately or measure it in the wrong way. An example is the stroke prediction model that includes information about if a person went to a doctor in it's prediction if a patient had a stroke. 74 | 75 | ## What is aggregation bias? 76 | 77 | When data is aggregated to the extent where it is does not take the differences in the heterogeneous population of data into account. An example is that effectiveness of treatments in medicine for some diseases differs on gender and ethnicity, but where those parameters are not present in the training data as they have been "aggregated away" 78 | 79 | ## What is representation bias? 80 | 81 | When the model emphasize some property of the data as it seemingly has the closest correlation with the prediction, even though that might not be the truth. An example is the gender property in the occupation prediction model where the model only predicted 11.6% of surgeons to be women whereas the real number was 14.6%. 82 | 83 | ## What is deployment bias? 84 | 85 | When there is a disparity between the intended purpose of a model and how it is actually used. In other words, a model is designed for a purpose that is not achieved after deployment. 86 | [Source](https://medium.com/unpackai/glimpse-of-different-types-of-bias-in-machine-learning-3e8767436aea) 87 | 88 | ## What is evaluation bias? 89 | 90 | It happens during the evaluation or iteration process. It often arises when the testing or external target populations does not accurately represent the different segments of the user population. In addition, using inappropriate metrics for the intended use of the model can also lead to evaluation bias. 91 | 92 | [Source](https://medium.com/unpackai/glimpse-of-different-types-of-bias-in-machine-learning-3e8767436aea) 93 | 94 | ## Give (2) examples of historical race bias in the US 95 | 96 | - When doctors were shown identical files, they were much **less likely to recommend cardiac catherization (a helpful procedure) to Black patients.** 97 | - An all-white jury was **16% more likely to convict a Black defendant than a white one**, but when a jury had at least one Black member, it convicted both at the same rate. 98 | 99 | ## Where are most images in Imagenet from? 100 | 101 | **The US and other Western countries.** This leads to models trained on the ImageNet dataset performing worse for other countries and cultures that doesn't have as much representation in the dataset. 102 | 103 | ## How are machines and people different, in terms of their use for making decisions? 104 | 105 | - People assume that algorithms are objective or/and error-free 106 | - Algorithmic systems are: 107 | - more likely to be implemented with a no-appeals process in place 108 | - often used at scale 109 | - cheap 110 | 111 | ## Is disinformation the same as "fake news"? 112 | 113 | No, it is not necessarily about getting someone to believe something false, but rather often **used to sow disharmony and uncertainty, and to get people to give up on seeking the truth**. To do that disinformation often contain exaggerations, seeds of truth or half-truths taken out of context rather than just "fake news". Also, disinformation has a history stretching back hundreds or even thousands of years. 114 | 115 | ## Why is disinformation through auto-generated text a particularly significant issue? 116 | 117 | Due to the greatly **increased capability** provided by deep learning. 118 | 119 | ## What are the (5) ethical lenses described by the Markkula Center? 120 | 121 | - **The rights approach**: which option best respects the rights of all who have a stake? 122 | - **The justice approach**: which option treats people equally or proportionally? 123 | - **The utilitarian approach**: which option will produce the most good and do the least harm? 124 | - **The common good approach**: which option best serves the community as a whole, not just some members. 125 | - **The virtue approach**: which option leads me to act as the sort of person I want to be? 126 | 127 | The objective of looking through different ethical lenses when making a decision is to uncover concrete issues with the different options. 128 | 129 | ## When is policy an appropriate tool for addressing data ethics issues? 130 | 131 | When it's likely that **design fixes, self regulation and technical approaches to addressing problems**, involving ethical uses of Machine Learning **are not working**. While such measures can be useful, they will not be sufficient to address the underlying problems that have led to our current state. For example, as long as it is incredibly profitable to create addictive technology, companies will continue to do so, regardless of whether this has the side effect of promoting conspiracy theories and polluting our information ecosystem. While individual designers may try to tweak product designs, we will not see substantial changes until the underlying profit incentives changes. 132 | 133 | Because of the above it is almost certain that policies will have to be created by government to address these issues. 134 | -------------------------------------------------------------------------------- /src/12_nlp_dive.md: -------------------------------------------------------------------------------- 1 | # 12_nlp_dive 2 | 3 | ## If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do? 4 | 5 | Perhaps create the **simplest possible dataset** that allow for quick and easy prototyping. For example, Jeremy created a *human numbers* dataset. 6 | 7 | ## Why do we concatenate the documents in our dataset before creating a language model? 8 | 9 | Because we need to create a continuous stream of input/target words, to be able to split it up in batches of significant size. 10 | 11 | ## What (2) tweaks do we need to make to use a standard fully connected network to predict the fourth word given the previous three words? 12 | 13 | - Use the **same** weight matrix for the three layers. 14 | - Use the **first** word's embeddings as activations to pass to linear layer, add the second word's embeddings to the first layer's output activations, and continues for rest of words. 15 | 16 | ## How can we share a weight matrix across multiple layers in PyTorch? 17 | 18 | Define one layer in the PyTorch model class, and use it multiple times in the `forward` method. 19 | 20 | ## What is a recurrent neural network? 21 | 22 | A refactoring of a **multilayer** neural network using a `for` loop. 23 | 24 | [Source](https://github.com/fastai/fastbook/blob/9821dade6e6fb747cc30aceba2923805d5662192/12_nlp_dive.ipynb) 25 | 26 | ## What is "hidden state"? 27 | 28 | The **activations updated** after each RNN step. 29 | 30 | ## What is the equivalent of hidden state in `LMModel1`? 31 | 32 | `h` 33 | 34 | ## To maintain the state in an RNN why is it important to pass the text to the model in order? 35 | 36 | Because the **state is maintained** over all batches independent of sequence length, this is only useful if the text is passed in order. 37 | 38 | ## What is an unrolled representation of an RNN? 39 | 40 | A representation **without loops**, depicted as a standard multilayer network 41 | 42 | ## Why can maintaining the hidden state in an RNN lead to memory and performance problems? 43 | 44 | Because it has to use the **gradients from also all the past calls** of the model when performing backpropagation since the hidden state is maintained through every single call of the model. 45 | 46 | ## How do we fix the memory and performance problems when the hidden state in an RNN is maintained? 47 | 48 | When performing backpropagation with the model, after every call, the `detach` method is called to delete the gradient history of previous calls of the model. 49 | 50 | ## What is *BPTT*? 51 | 52 | Calculating backpropagation only for the **given batch**, and therefore only doing backpropagation for the defined sequence length of the batch. *BPTT* stands for BackPropagation Through Time. 53 | 54 | ## What does the `ModelResetter` callback do? 55 | 56 | It resets the hidden state of the model before every epoch and before every validation run. 57 | 58 | ## Why do we need the `ModelResetter` callback? 59 | 60 | Because it makes sense to reset the hidden state when you are working with **instances or batches** that are **not related in any meaningful way** (to make predictions) e.g. translating two different input instances in neural translation. You can think of the hidden state as limited memory that gets convoluted if the input is too long (and it can be if you combine multiple instances) and, as end result, the final performance may decline. 61 | 62 | One more thing, when performing SGD, you assume batches are independent of one another. If you don’t reset the hidden state between them, you lose the i.i.d. assumption. 63 | 64 | [Source](https://discuss.pytorch.org/t/in-lstm-why-should-i-reset-hidden-variables/94016/2) 65 | 66 | ## What are the downsides of predicting just one output word for each three input words? 67 | 68 | There are words in between that are **not being used** for training the model. 69 | 70 | ## How do we solve the downsides of predicting just one output word for each three input word? 71 | 72 | We apply the output layer to every hidden state produced to predict three output words for the three input words (offset by one). 73 | 74 | ## Why do we need a custom loss function for `LMModel4` ? 75 | 76 | Because `CrossEntropyLoss` expects flattened tensors. 77 | 78 | ## Why is the training of `LMModel4` unstable? 79 | 80 | Because this network is **very deep** and this can lead to very small or very large gradients that don't train well. 81 | 82 | ## Why do we need to stack RNNs to get better results, even though a recurrent neural network in the unrolled representation has many layers? 83 | 84 | Because **only one weight matrix** is really being used. So multiple layers can improve this. 85 | 86 | ## Imagine a representation of a stacked (multilayer) RNN 87 | 88 | ![](img/12-stacked-rnn.png) 89 | 90 | ## Why can a deep network result in very large or very small activations? 91 | 92 | Because, in deep networks, we have **repeated matrix multiplications** and, after repeated multiplications, numbers that are just slightly higher or lower than one can lead to the explosion or disappearance of numbers. 93 | 94 | ## Why do very large or very small activations matter ? 95 | 96 | Because they could lead to vanishing or exploding gradients which **prevent training**. 97 | 98 | ## In a computer's floating point representation of numbers, which numbers are the most precise? 99 | 100 | Small numbers, that are not *too* close to **zero** however 101 | 102 | ## Why do vanishing gradients prevent training? 103 | 104 | Because the accumulation of small gradients results in a model that is **incapable of learning meaningful insights** since the weights and biases of the initial layers, which tends to learn the core features from the input data, will not be updated effectively. In the worst case scenario the gradient will be 0 which in turn will stop the network will stop further training. 105 | 106 | [Source](https://towardsdatascience.com/the-vanishing-exploding-gradient-problem-in-deep-neural-networks-191358470c11) 107 | 108 | ## Why do exploding gradients prevent training? 109 | 110 | Because the accumulation of large derivatives results in the model being very **unstable and incapable of effective learning**. Indeed, the large changes in the models weights creates a very unstable network, which at extreme values the weights become so large that is causes overflow resulting in NaN weight values of which can no longer be updated. 111 | 112 | [Source](https://towardsdatascience.com/the-vanishing-exploding-gradient-problem-in-deep-neural-networks-191358470c11) 113 | 114 | ## What is the purpose of each hidden states in the LSTM architecture? 115 | 116 | - One state remembers what happened **earlier** in the sentence 117 | - The other predicts the **next** token 118 | 119 | ## What are these two states called in an LSTM? 120 | 121 | - Cell state (long short-term memory) 122 | - Hidden state (predict next token) 123 | 124 | ## What is tanh, and how is it related to sigmoid? 125 | 126 | It's just a sigmoid function rescaled to the range of -1 to 1. 127 | 128 | ## In `LSTMCell`, what is the purpose of this code: `h = torch.cat([h, input], dim=1)`? 129 | 130 | It **joins the hidden state and the new input**. Concatenating the hidden state and the new input allows the LSTM cell to incorporate the new input into the hidden state and update it accordingly. This allows the LSTM cell to process and make use of both short-term and long-term dependencies in the data, which can improve the model's performance. 131 | 132 | Source: 133 | 134 | - [First sentence](https://forums.fast.ai/t/fastbook-chapter-12-questionnaire-wiki/70516) 135 | - [The rest](https://chat.openai.com/chat) 136 | 137 | ## What does `chunk` to in PyTorch? 138 | 139 | Splits a tensor in equal sizes 140 | 141 | ## Why can we use a higher learning rate for `LMModel6` ? 142 | 143 | Because LSTM provides a **partial** solution to exploding/vanishing gradients 144 | 145 | ## What are the (3) regularisation techniques used in an AWD-LSTM model? 146 | 147 | 1. Dropout 148 | 2. Activation regularization 149 | 3. Temporal activation regularization 150 | 151 | ## What is dropout? 152 | 153 | To **randomly** change some **activations to zero** at training time. It's a regularization technique that was introduced by Geoffrey Hinton et al. in [Improving neural networks by preventing co-adaptation of feature detectors](https://arxiv.org/abs/1207.0580). This makes sure all neurons actively work toward the output, as seen in <> (from "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" by Nitish Srivastava et al.). 154 | 155 | [Source](https://github.com/fastai/fastbook/blob/9821dade6e6fb747cc30aceba2923805d5662192/12_nlp_dive.ipynb) 156 | 157 | ## Why do we scale the weights with dropout? 158 | 159 | Because the **scale changes** if we sum up activations. Indeed, it makes a difference if all activations are present or they are dropped with probability $p$. To correct the scale, a division by $(1-p)$ is applied. 160 | 161 | ## When do we scale the weights with dropout ? 162 | 163 | During training, inference, or both 164 | 165 | ## What is the purpose of this line from `Dropout`: `if not self.training: return x`? 166 | 167 | When not in training mode, don't apply dropout. 168 | 169 | ## How do you set your model in **training** mode in PyTorch? 170 | 171 | `Module.train()` 172 | 173 | ## How do you set your model in **evaluation** mode in PyTorch? 174 | 175 | `Module.eval()` 176 | 177 | ## What is the equation for l2 activation regularization? 178 | 179 | `loss = original_loss + alpha * activations.pow(2).mean()` 180 | 181 | or 182 | 183 | $$ loss \leftarrow original\_loss + \alpha \cdot \frac{1}{n}\sum_{i=1}^n{activation_i^2} $$ 184 | 185 | Side note : you could also sum the activations squared instead of taking their average. 186 | 187 | ## How is the equation for l2 activation regularization different to weight decay using L2 regularisation? 188 | 189 | The activation regularization is not decreasing the weights but the **activations**. 190 | 191 | ## What is the equation for temporal activation regularization (in python)? 192 | 193 | `loss = original_loss + beta * (activations[:, 1:] - activations[:, :-1]).pow(2).mean()` 194 | 195 | Side note : you could also use `sum()` instead of `mean()`. 196 | 197 | ## What is *weight tying* in a language model? 198 | 199 | It is a technique where we set the **input-to-hidden and hidden-to-output weights** to be **equal**. They are the same object in memory, the same tensor, playing both roles. The hypothesis is that conceptually in a Language Model predicting the next word (converting activations to English words) and converting embeddings to activations are essentially the same operation, the model tasks that are fundamentally similar. It turns out that indeed tying the weights allows a model to train better. 200 | 201 | [Source](https://aiquizzes.com/questions/81) 202 | -------------------------------------------------------------------------------- /docs/progress.md: -------------------------------------------------------------------------------- 1 | # Progress 2 | 3 | ⁉️ The remaining cards to include or to complete. 4 | 5 | ## 7 - Training a State-of-the-Art Model 6 | 7 | - What is the difference between ImageNet and Imagenette? When is it better to experiment on one versus the other? 8 | - What is normalization? 9 | - Why didn't we have to care about normalization when using a pretrained model? 10 | - What is progressive resizing? 11 | - Implement progressive resizing in your own project. Did it help? 12 | - What is test time augmentation? How do you use it in fastai? 13 | - Is using TTA at inference slower or faster than regular inference? Why? 14 | - What is Mixup? How do you use it in fastai? 15 | - Why does Mixup prevent the model from being too confident? 16 | - Why does training with Mixup for five epochs end up worse than training without Mixup? 17 | - What is the idea behind label smoothing? 18 | - What problems in your data can label smoothing help with? 19 | - When using label smoothing with five categories, what is the target associated with the index 1? 20 | - What is the first step to take when you want to prototype quick experiments on a new dataset? 21 | 22 | ## 9 - Tabular Modeling Deep Dive 23 | 24 | - Make a list of reasons why a model's validation set error might be worse than the OOB error. How could you test your hypotheses? # TODO: add more reasons and add the test part. The question is renamed "Tell (2) reasons why a model's validation set error might be worse than the OOB error" 25 | 26 | ## 12 - A Language Model from Scratch 27 | 28 | - Why should we get better results in an RNN if we call `detach` less often? Why might this not happen in practice with a simple RNN? 29 | 30 | ## 14 - ResNets 31 | 32 | - How did we get to a single vector of activations in the CNNs used for MNIST in previous chapters? Why isn't that suitable for Imagenette? 33 | - What do we do for Imagenette instead? 34 | - What is "adaptive pooling"? 35 | - What is "average pooling"? 36 | - Why do we need `Flatten` after an adaptive average pooling layer? 37 | - What is a "skip connection"? 38 | - Why do skip connections allow us to train deeper models? 39 | - What does <> show? How did that lead to the idea of skip connections? 40 | - What is "identity mapping"? 41 | - What is the basic equation for a ResNet block (ignoring batchnorm and ReLU layers)? 42 | - What do ResNets have to do with residuals? 43 | - How do we deal with the skip connection when there is a stride-2 convolution? How about when the number of filters changes? 44 | - How can we express a 1×1 convolution in terms of a vector dot product? 45 | - Create a `1x1 convolution` with `F.conv2d` or `nn.Conv2d` and apply it to an image. What happens to the `shape` of the image? 46 | - What does the `noop` function return? 47 | - Explain what is shown in <>. 48 | - When is top-5 accuracy a better metric than top-1 accuracy? 49 | - What is the "stem" of a CNN? 50 | - Why do we use plain convolutions in the CNN stem, instead of ResNet blocks? 51 | - How does a bottleneck block differ from a plain ResNet block? 52 | - Why is a bottleneck block faster? 53 | - How do fully convolutional nets (and nets with adaptive pooling in general) allow for progressive resizing? 54 | 55 | ## 15 - Application Architectures Deep Dive 56 | 57 | - What is the "head" of a neural net? 58 | - What is the "body" of a neural net? 59 | - What is "cutting" a neural net? Why do we need to do this for transfer learning? 60 | - What is `model_meta`? Try printing it to see what's inside. 61 | - Read the source code for `create_head` and make sure you understand what each line does. 62 | - Look at the output of `create_head` and make sure you understand why each layer is there, and how the `create_head` source created it. 63 | - Figure out how to change the dropout, layer size, and number of layers created by `cnn_learner`, and see if you can find values that result in better accuracy from the pet recognizer. 64 | - What does `AdaptiveConcatPool2d` do? 65 | - What is "nearest neighbor interpolation"? How can it be used to upsample convolutional activations? 66 | - What is a "transposed convolution"? What is another name for it? 67 | - Create a conv layer with `transpose=True` and apply it to an image. Check the output shape. 68 | - Draw the U-Net architecture. 69 | - What is "BPTT for Text Classification" (BPT3C)? 70 | - How do we handle different length sequences in BPT3C? 71 | - Try to run each line of `TabularModel.forward` separately, one line per cell, in a notebook, and look at the input and output shapes at each step. 72 | - How is `self.layers` defined in `TabularModel`? 73 | - What are the five steps for preventing over-fitting? 74 | - Why don't we reduce architecture complexity before trying other approaches to preventing overfitting? 75 | 76 | ## 16 - The Training Process 77 | 78 | - What is the equation for a step of SGD, in math or code (as you prefer)? 79 | - What do we pass to `cnn_learner` to use a non-default optimizer? 80 | - What are optimizer callbacks? 81 | - What does `zero_grad` do in an optimizer? 82 | - What does `step` do in an optimizer? How is it implemented in the general optimizer? 83 | - Rewrite `sgd_cb` to use the `+=` operator, instead of `add_`. 84 | - What is "momentum"? Write out the equation. 85 | - What's a physical analogy for momentum? How does it apply in our model training settings? 86 | - What does a bigger value for momentum do to the gradients? 87 | - What are the default values of momentum for 1cycle training? 88 | - What is RMSProp? Write out the equation. 89 | - What do the squared values of the gradients indicate? 90 | - How does Adam differ from momentum and RMSProp? 91 | - Write out the equation for Adam. 92 | - Calculate the values of `unbias_avg` and `w.avg` for a few batches of dummy values. 93 | - What's the impact of having a high `eps` in Adam? 94 | - Read through the optimizer notebook in fastai's repo, and execute it. 95 | - In what situations do dynamic learning rate methods like Adam change the behavior of weight decay? 96 | - What are the four steps of a training loop? 97 | - Why is using callbacks better than writing a new training loop for each tweak you want to add? 98 | - What aspects of the design of fastai's callback system make it as flexible as copying and pasting bits of code? 99 | - How can you get the list of events available to you when writing a callback? 100 | - Write the `ModelResetter` callback (without peeking). 101 | - How can you access the necessary attributes of the training loop inside a callback? When can you use or not use the shortcuts that go with them? 102 | - How can a callback influence the control flow of the training loop. 103 | - Write the `TerminateOnNaN` callback (without peeking, if possible). 104 | - How do you make sure your callback runs after or before another callback? 105 | 106 | ## 17 - A Neural Net from the Foundations 107 | 108 | - Write the Python code to implement a single neuron. 109 | - Write the Python code to implement ReLU. 110 | - Write the Python code for a dense layer in terms of matrix multiplication. 111 | - Write the Python code for a dense layer in plain Python (that is, with list comprehensions and functionality built into Python). 112 | - What is the "hidden size" of a layer? 113 | - What does the `t` method do in PyTorch? 114 | - Why is matrix multiplication written in plain Python very slow? 115 | - In `matmul`, why is `ac==br`? 116 | - In Jupyter Notebook, how do you measure the time taken for a single cell to execute? 117 | - What is "elementwise arithmetic"? 118 | - Write the PyTorch code to test whether every element of `a` is greater than the corresponding element of `b`. 119 | - What is a rank-0 tensor? How do you convert it to a plain Python data type? 120 | - What does this return, and why? `tensor([1,2]) + tensor([1])` 121 | - What does this return, and why? `tensor([1,2]) + tensor([1,2,3])` 122 | - How does elementwise arithmetic help us speed up `matmul`? 123 | - What are the broadcasting rules? 124 | - What is `expand_as`? Show an example of how it can be used to match the results of broadcasting. 125 | - How does `unsqueeze` help us to solve certain broadcasting problems? 126 | - How can we use indexing to do the same operation as `unsqueeze`? 127 | - How do we show the actual contents of the memory used for a tensor? 128 | - When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added to each row or each column of the matrix? (Be sure to check your answer by running this code in a notebook.) 129 | - Do broadcasting and `expand_as` result in increased memory use? Why or why not? 130 | - Implement `matmul` using Einstein summation. 131 | - What does a repeated index letter represent on the left-hand side of einsum? 132 | - What are the three rules of Einstein summation notation? Why? 133 | - What are the forward pass and backward pass of a neural network? 134 | - Why do we need to store some of the activations calculated for intermediate layers in the forward pass? 135 | - What is the downside of having activations with a standard deviation too far away from 1? 136 | - How can weight initialization help avoid this problem? 137 | - What is the formula to initialize weights such that we get a standard deviation of 1 for a plain linear layer, and for a linear layer followed by ReLU? 138 | - Why do we sometimes have to use the `squeeze` method in loss functions? 139 | - What does the argument to the `squeeze` method do? Why might it be important to include this argument, even though PyTorch does not require it? 140 | - What is the "chain rule"? Show the equation in either of the two forms presented in this chapter. 141 | - Show how to calculate the gradients of `mse(lin(l2, w2, b2), y)` using the chain rule. 142 | - What is the gradient of ReLU? Show it in math or code. (You shouldn't need to commit this to memory—try to figure it using your knowledge of the shape of the function.) 143 | - In what order do we need to call the `*_grad` functions in the backward pass? Why? 144 | - What is `__call__`? 145 | - What methods must we implement when writing a `torch.autograd.Function`? 146 | - Write `nn.Linear` from scratch, and test it works. 147 | - What is the difference between `nn.Module` and fastai's `Module`? 148 | 149 | ## 18 - CNN Interpretation with CAM 150 | 151 | - What is a "hook" in PyTorch? 152 | - Which layer does CAM use the outputs of? 153 | - Why does CAM require a hook? 154 | - Look at the source code of the `ActivationStats` class and see how it uses hooks. 155 | - Write a hook that stores the activations of a given layer in a model (without peeking, if possible). 156 | - Why do we call `eval` before getting the activations? Why do we use `no_grad`? 157 | - Use `torch.einsum` to compute the "dog" or "cat" score of each of the locations in the last activation of the body of the model. 158 | - How do you check which order the categories are in (i.e., the correspondence of index->category)? 159 | - Why are we using `decode` when displaying the input image? 160 | - What is a "context manager"? What special methods need to be defined to create one? 161 | - Why can't we use plain CAM for the inner layers of a network? 162 | - Why do we need to register a hook on the backward pass in order to do Grad-CAM? 163 | - Why can't we call `output.backward()` when `output` is a rank-2 tensor of output activations per image per class? 164 | 165 | ## 19 - A fastai Learner from Scratch 166 | 167 | - What is `glob`? 168 | - How do you open an image with the Python imaging library? 169 | - What does `L.map` do? 170 | - What does `Self` do? 171 | - What is `L.val2idx`? 172 | - What methods do you need to implement to create your own `Dataset`? 173 | - Why do we call `convert` when we open an image from Imagenette? 174 | - What does `~` do? How is it useful for splitting training and validation sets? 175 | - Does `~` work with the `L` or `Tensor` classes? What about NumPy arrays, Python lists, or pandas DataFrames? 176 | - What is `ProcessPoolExecutor`? 177 | - How does `L.range(self.ds)` work? 178 | - What is `__iter__`? 179 | - What is `first`? 180 | - What is `permute`? Why is it needed? 181 | - What is a recursive function? How does it help us define the `parameters` method? 182 | - Write a recursive function that returns the first 20 items of the Fibonacci sequence. 183 | - What is `super`? 184 | - Why do subclasses of `Module` need to override `forward` instead of defining `__call__`? 185 | - In `ConvLayer`, why does `init` depend on `act`? 186 | - Why does `Sequential` need to call `register_modules`? 187 | - Write a hook that prints the shape of every layer's activations. 188 | - What is "LogSumExp"? 189 | - Why is `log_softmax` useful? 190 | - What is `GetAttr`? How is it helpful for callbacks? 191 | - Reimplement one of the callbacks in this chapter without inheriting from `Callback` or `GetAttr`. 192 | - What does `Learner.__call__` do? 193 | - What is `getattr`? (Note the case difference to `GetAttr`!) 194 | - Why is there a `try` block in `fit`? 195 | - Why do we check for `model.training` in `one_batch`? 196 | - What is `store_attr`? 197 | - What is the purpose of `TrackResults.before_epoch`? 198 | - What does `model.cuda` do? How does it work? 199 | - Why do we need to check `model.training` in `LRFinder` and `OneCycle`? 200 | - Use cosine annealing in `OneCycle`. 201 | -------------------------------------------------------------------------------- /src/02_production.md: -------------------------------------------------------------------------------- 1 | # 02_production 2 | 3 | ## Provide (5) examples of where the bear classification model might work poorly, due to structural or style differences to the training data 4 | 5 | - The bear is partially **obstructed** 6 | - **Nighttime** images are passed into the model 7 | - **Low-resolution** images are passed into the model 8 | - The bear is **far away** from the camera 9 | - The bear training dataset is highly **biased** towards one type of features (eg. color) 10 | 11 | P.S.: these cases were not represented in the training data 12 | 13 | ## Where do text models currently have a major deficiency? 14 | 15 | Text models still struggle with ***correct* responses**. Given factual information (such as a knowledge base), it is still hard to generate responses that utilizes this information to generate factually correct responses, though the text can seem very compelling. This can be very dangerous, as the layman may not be able to evaluate the factual accuracy of the generated text. On the other hand, text models can generate context-appropriate text (like replies or imitating author style). 16 | 17 | ## What are (2) possible negative societal implications of text generation models? 18 | 19 | - The ability for text generation models to generate context-aware, highly compelling responses can be used at a massive scale to spread disinformation ("fake news") and encourage conflict. 20 | - Models reinforce bias (like gender bias, racial bias) in training data and create a vicious cycle of biased outputs. 21 | 22 | ## In situations where a model might make mistakes, and those mistakes could be harmful, what is a good alternative to automating a process? 23 | 24 | The predictions of the model could be **reviewed by human experts** for them to evaluate the results and determine what is the best next step. 25 | 26 | This is especially true for applying machine learning for medical diagnoses. For example, a machine learning model for identifying strokes in CT scans can alert high priority cases for expedited review, while other cases are still sent to radiologists for review. Or other models can also augment the medical professional’s abilities, reducing risk but still improving efficiency of the workflow. For example, deep learning models can provide useful measurements for radiologists or pathologists. 27 | 28 | ## What kind of tabular data is deep learning particularly good at? 29 | 30 | Tabular data with: 31 | 32 | - **natural language** 33 | - **high cardinality categorical columns** (containing larger number of discrete choices like zip code). 34 | 35 | ## What’s a key downside of directly using a deep learning model for recommendation systems? 36 | 37 | Deep learning will often **only tell what products a user might like**, and may not be recommendations that would be helpful to the user. 38 | 39 | For example, if a user is familiar with other books from the same author, it isn’t helpful to recommend those products even though the user bought the author’s book. Or, recommending products a user may have already purchased. 40 | 41 | ## What are the steps of the Drivetrain approach? 42 | 43 | - **Objective**: what outcome am I trying to achieve? 44 | - **Levers**: what inputs can we control 45 | - **Data**: what data can we collect 46 | - **Models**: how the levers influence the objective 47 | 48 | ![](img/02-drivetrain-steps.png) 49 | 50 | ## How do the steps of the Drivetrain approach map to a recommendation system? 51 | 52 | - The **objective** of a recommendation engine is to drive additional sales by surprising and delighting the customer with recommendations of items they would not have purchased without the recommendation. 53 | - The **lever** is the ranking of the recommendations. 54 | - New **data** must be collected to generate recommendations that will *cause new sales*. This will require conducting many randomized experiments in order to collect data about a wide range of recommendations for a wide range of customers. This is a step that few organizations take; but without it, you don't have the information you need to actually optimize recommendations based on your true objective (more sales!) 55 | 56 | ## What is `DataLoaders`? 57 | 58 | The `DataLoaders` class is the class that passes the data to the fastai model. It is essentially a class that stores the required `Dataloader` objects (usually for train and validation sets). 59 | 60 | ## What four things do we need to tell fastai to create `DataLoaders`? 61 | 62 | - What kinds of data we are working with (`blocks`) 63 | - How to get the list of items (`get_items`) 64 | - How to label these items (`get_y`) 65 | - How to create the validation set (`splitter`) 66 | 67 | For example, 68 | 69 | ```py 70 | bears_datablock = DataBlock( 71 | blocks=(ImageBlock, CategoryBlock), 72 | get_items=get_image_files, 73 | get_y=parent_label, 74 | splitter=RandomSplitter(valid_pct=0.2, seed=123)) 75 | 76 | bears_dataloaders = bears_datablock.dataloaders(path) 77 | ``` 78 | 79 | ## What does the splitter parameter to `DataBlock` do? 80 | 81 | In fastai `DataBlock`, you provide the splitter argument a way for fastai to **split up the dataset into subsets** (usually train and validation set). For example, to randomly split the data, you can use fastai’s predefined `RandomSplitter` class, providing it with the proportion of the data used for validation. 82 | 83 | ## How do we ensure a random split always gives the same validation set? 84 | 85 | **By using a random seed**. Indeed, it turns out it is impossible for our computers to generate truly random numbers. Instead, they use a process known as a pseudo-random generator. However, this process can be controlled using a random seed. By setting a random seed value (e.g. `random_seed=123`), the pseudo-random generator will generate the "random" numbers in a fixed manner and it will be the same for every run. Using a random seed, we can generate a random split that gives the same validation set always. 86 | 87 | ## What letters are often used to signify the independent and dependent variables? 88 | 89 | - **x**: independent 90 | - **y**: dependent 91 | 92 | ## What is `crop` and its disadvantage? 93 | 94 | - `crop` is the default `Resize()` method, and it ***crops* the images** to fit a square shape of the size requested, using the full width or height. 95 | - Disadvantage: this can result in losing some important details. For instance, if we were trying to recognize the breed of dog or cat, we may end up cropping out a key part of the body or the face necessary to distinguish between similar breeds. 96 | 97 | ![](img/02-crop.png) 98 | 99 | ## What is `pad` and its (2) disadvantages? 100 | 101 | - `pad` is an alternative `Resize()` method, which **pads the matrix** of the image's pixels with zeros (which shows as black when viewing the images). 102 | - Disadvantages: if we pad the images then we have a whole lot of empty space, which is just wasted computation for our model, and results in a lower effective resolution for the part of the image we actually use. 103 | 104 | ![](img/02-pad.png) 105 | 106 | ## What is `squish` and its disadvantage? 107 | 108 | - `squish` is another alternative `Resize()` method, which can **either squish or stretch** the image. 109 | - Disadvantage: this can cause the image to take on an unrealistic shape, leading to a model that learns that things look different to how they actually are, which we would expect to result in lower accuracy. 110 | 111 | ![](img/02-squish.png) 112 | 113 | ## What is `RandomResizedCrop`? 114 | 115 | A method in which we **crop on a randomly selected region of the image**. So every epoch, the model will see a different part of the image and will learn accordingly. 116 | 117 | ## What is data augmentation? 118 | 119 | **It refers to creating random variations of our input data**, such that they appear different, but not so different that it changes the meaning of the data. Examples include flipping, rotation, perspective warping, brightness changes, etc. 120 | 121 | ## Why is data augmentation needed? 122 | 123 | Because it helps the model to **understand the basic concept of what an object is** and **how the objects of interest are represented in images**. Therefore, data augmentation allows machine learning models to *generalize*. This is especially important when it can be slow and expensive to label data. 124 | 125 | ## What is the difference between `item_tfms` and `batch_tfms`? 126 | 127 | - `item_tfms` are transformations **applied to a single data sample `x` on the CPU**. `Resize()` is a common transform because the mini-batch of input images to a cnn must have the same dimensions. Assuming the images are RGB with 3 channels, then `Resize()` as item_tfms will make sure the images have the same width and height. 128 | - `batch_tfms` are **applied to batched data samples** (aka individual samples that have been collated into a mini-batch) **on the GPU**. They are faster and more efficient than `item_tfms`. A good example of these are the ones provided by `aug_transforms()`. Inside are several batch-level augmentations that help many models. 129 | 130 | ## What is a confusion matrix? 131 | 132 | It's a **representation of the predictions made vs the correct labels**. The rows of the matrix represent the actual labels while the columns represent the predictions. Therefore, the number of images in the diagonal elements represent the number of correctly classified images, while the off-diagonal elements are incorrectly classified images. Class confusion matrices provide useful information about how well the model is doing and which classes the model might be *confusing*. 133 | 134 | Here is a confusion matrix, 135 | 136 | | Total population (P + N) | Positive prediction | Negative prediction | 137 | | :----------------------: | :-----------------: | :-----------------: | 138 | | Actually positive (P) | True positive (TP) | False negative (FN) | 139 | | Actually negative (N) | False positive (FP) | True negative (TN) | 140 | 141 | And here, a practical example, 142 | 143 | | Total population (12) | Cancer prediction (9) | Non-cancer prediction (3) | 144 | | :-------------------: | :-------------------: | :-----------------------: | 145 | | Actual cancer (7) | 6 | 1 | 146 | | Actual non-cancer (5) | 3 | 2 | 147 | 148 | ## What does `export` save? 149 | 150 | - The architecture 151 | - The trained parameters of the neural network architecture 152 | - How the `DataLoaders` are defined 153 | 154 | ## What is it called when we use a model for getting predictions, instead of training? 155 | 156 | Inference 157 | 158 | ## What are IPython widgets? 159 | 160 | IPython widgets are JavaScript and Python combined functionalities that let us **build and interact with GUI components directly in a Jupyter notebook**. An example of this would be an upload button, which can be created with the Python function `widgets.FileUpload()`. 161 | 162 | ## When might you want to use GPU for deployment? When might CPU be better? 163 | 164 | - **GPUs are best for doing identical work in parallel**. GPUs could be used if you collect user responses into a batch at a time, and perform inference on the batch. This may require the user to wait for model predictions. Additionally, there are many other complexities when it comes to GPU inference, like memory management and queuing of the batches. 165 | 166 | - **CPU** may be more cost effective if you will be **analyzing single pieces of data at a time** (like a single image or single sentence), especially with more market competition for CPU servers versus GPU servers. 167 | 168 | ## What are the downsides of deploying your app to a server, instead of to a client (or edge) device such as a phone or PC? 169 | 170 | The application will: 171 | 172 | - require network connection 173 | - need extra network latency time when submitting input and returning results. 174 | - lead to security concerns if you send private data to a network server 175 | 176 | Deploying a model to a server will: 177 | 178 | - makes it easier to iterate and roll out new versions of a model. This is because you as a developer have full control over the server environment and only need to do it once rather than having to make sure that all the endpoints (phones, PCs) upgrade their version individually. 179 | 180 | ## Name (3) examples of situations that could lead to a poor performance when rolling out a bear warning system in practice? 181 | 182 | - Handling **night-time** images 183 | - Dealing with **low-resolution** images (ex: some smartphone images) 184 | - The model returns prediction too **slowly** to be useful 185 | 186 | ## What is "out of domain data"? 187 | 188 | Data that is **fundamentally different** in some aspect compared to the model's training data. For example, an object detector that was trained exclusively with outside daytime photos is given a photo taken at night. 189 | 190 | ## What is "domain shift"? 191 | 192 | This is **when the type of data changes gradually over time**. For example, an insurance company is using a deep learning model as part of their pricing algorithm, but over time their customers will be different, with the original training data not being representative of current data, and the deep learning model being applied on effectively out-of-domain data. 193 | 194 | ## What are the (3) steps in the deployment process? 195 | 196 | 1. **Manual process** – the model is run in parallel and not directly driving any actions, with humans still checking the model outputs. 197 | 2. **Limited scope deployment** – The model’s scope is limited and carefully supervised. For example, doing a geographically and time-constrained trial of model deployment, that is carefully supervised. 198 | 3. **Gradual expansion** – The model scope is gradually increased, while good reporting systems are implemented in order to check for any significant changes to the actions taken compared to the manual process (i.e. the models should perform similarly to the humans, unless it is already anticipated to be better). 199 | 200 | ## For a project you’re interested in applying deep learning to, consider the thought experiment "what would happen if it went really, really well?" 201 | 202 | To be done by reader... 203 | 204 | ## Start a blog, and write your first blog post. For instance, write about what you think deep learning might be useful for in a domain you’re interested in 205 | 206 | To be done by reader... Check [this forum post](https://forums.fast.ai/t/fastai2-blog-posts-projects-and-tutorials/65827) for inspiration. 207 | -------------------------------------------------------------------------------- /src/04_mnist_basics.md: -------------------------------------------------------------------------------- 1 | # 04_mnist_basics 2 | 3 | ## How is an image represented on a computer? How about a gray image and a color image? 4 | 5 | - **Images** are represented by arrays with pixel values representing the content of the image. 6 | 7 | - For **grayscale images**, a 2-dimensional array is used with the pixels representing the grayscale values, with a range of 256 integers. A value of 0 would represent white, and a value of 255 represents black, and different shades of grayscale in between. 8 | 9 | - For **color images**, three color channels (red, green, blue) are typically used, with a separate 256-range 2D array used for each channel. A pixel value of 0 again represents white, with 255 representing solid red, green, or blue. The three 2-D arrays form a final 3-D array (rank 3 tensor) representing the color image. 10 | 11 | ## How are the files and folders in the `MNIST_SAMPLE` dataset structured? Why? 12 | 13 | There are **two subfolders**, `train` and `valid`, the former contains the data for model training, the latter contains the data for validating model performance after each training step. It is structured this way, because it allows to **simplify comparing results between implementations/publications**. 14 | 15 | Evaluating the model on the validation set serves two purposes: 16 | 17 | - to report a human-interpretable metric such as accuracy (in contrast to the often abstract loss functions used for training) 18 | - to facilitate the detection of overfitting by evaluating the model on a dataset it hasn't been trained on (in short, an overfitting model performs increasingly well on the training set but decreasingly so on the validation set). 19 | 20 | Of course, every practitioner could generate their own train/validation-split of the data. As specified earlier, public datasets are usually pre-split to simplify comparing results between implementations/publications. 21 | 22 | Each subfolder has two subsubfolders `3` and `7` which contain the `.jpg` files for the respective class of images. This is a common way of organizing datasets comprised of pictures. For the full `MNIST` dataset there are 10 subsubfolders, one for the images for each digit. 23 | 24 | ## Explain how the "pixel similarity" approach to classifying digits works 25 | 26 | In the "pixel similarity" approach, we **generate an archetype for each class** we want to identify. In our case, we want to distinguish images of 3's from images of 7's. We define the archetypical 3 as the pixel-wise mean value of all 3's in the training set. Analogously for the 7's. You can visualize the two archetypes and see that they are in fact blurred versions of the numbers they represent. 27 | 28 | In order to tell if a previously unseen image is a 3 or a 7, we calculate its distance to the two archetypes (here: mean pixel-wise absolute difference). We say the new image is a 3 if its distance to the archetypical 3 is lower than two the archetypical 7. 29 | 30 | ## What is a list comprehension? 31 | 32 | A list comprehension is a Pythonic way of condensing the creation of a list using a `for`-loop into a single expression. List comprehensions will also often include if clauses for filtering. 33 | 34 | For example, 35 | 36 | ```py 37 | >>> zero_to_four = [i for i in range(5)] 38 | >>> zero_to_four 39 | [0, 1, 2, 3, 4] 40 | ``` 41 | 42 | is equal to : 43 | 44 | ```py 45 | >>> zero_to_four = [] 46 | >>> for i in range(5): 47 | >>> zero_to_four.append(i) 48 | >>> zero_to_four 49 | [0, 1, 2, 3, 4] 50 | ``` 51 | 52 | ## Create a list comprehension that selects odd numbers from an iterator (like `range()`) and doubles them 53 | 54 | ```py 55 | >>> odd_numbers_times_3 = [3*i for i in range(5) if i%2 == 1] 56 | >>> odd_numbers_times_3 57 | [3, 9] 58 | ``` 59 | 60 | is equal to : 61 | 62 | ```py 63 | >>> odd_numbers_times_3 = [] 64 | >>> for i in range(5): 65 | >>> if i%2 == 1: 66 | >>> odd_numbers_times_3.append(3*i) 67 | >>> odd_numbers_times_3 68 | [3, 9] 69 | ``` 70 | 71 | ## What is the rank of a tensor? 72 | 73 | It's the **number of dimensions** (or axes) it has. An easy way to identify the rank is the number of indices you would need to reference a number within a tensor. A scalar can be represented as a tensor of rank 0 (no index), a vector can be represented as a tensor of rank 1 (one index, e.g., `v[i]`), a matrix can be represented as a tensor of rank 2 (two indices, e.g., `a[i, j]`), and a tensor of rank 3 is a cuboid or a "stack of matrices" (three indices, e.g., `b[i, j, k]`). In particular, the rank of a tensor is independent of its shape or dimensionality, e.g., a tensor of shape 2x2x2 and a tensor of shape 3x5x7 both have rank 3. 74 | 75 | Note that the term "rank" has different meanings in the context of tensors and matrices (where it refers to the number of linearly independent column vectors). 76 | 77 | ## What is the difference between tensor rank and shape? 78 | 79 | - **Rank** : number of dimensions (or axes) in a tensor 80 | - **Shape** : size of each axis of a tensor 81 | 82 | ## How do you get the rank from the shape of a tensor? 83 | 84 | **The length of a tensor's shape** is its rank. So if we have the images of the 3's folder from the `MNIST_SAMPLE` dataset in a tensor called `stacked_threes`, we find its shape like this: 85 | 86 | ```py 87 | >>> stacked_threes.shape 88 | torch.Size([6131, 28, 28]) 89 | ``` 90 | 91 | And it's rank like this: 92 | 93 | ```py 94 | >>> len(stacked_threes.shape) 95 | 3 96 | ``` 97 | 98 | You can also get a tensor's rank directly with `ndim`: 99 | 100 | ```py 101 | >>> stacked_threes.ndim 102 | 3 103 | ``` 104 | 105 | ## What are RMSE and L1 norm? 106 | 107 | Root mean square error (RMSE), also called the L2 norm, and mean absolute difference (MAE), also called the L1 norm, are two commonly used **methods of measuring "distance"**. Simple differences do not work because some difference are positive and others are negative, canceling each other out. Therefore, a function that focuses on the magnitudes of the differences is needed to properly measure distances. The simplest would be to add the absolute values of the differences, which is what MAE is. RMSE takes the mean of the square (makes everything positive) and then takes the square root (undoes squaring). 108 | 109 | ## How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop? 110 | 111 | As loops are very slow in Python, it is best to **represent the operations as array operations** rather than looping through individual elements. If this can be done, then using NumPy or PyTorch will be thousands of times faster, as they use underlying C code which is much faster than pure Python. Even better, PyTorch allows you to run operations on GPU, which will have significant speedup if there are parallel operations that can be done. 112 | 113 | ## Create a 3x3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom right 4 numbers 114 | 115 | ```py 116 | >>> tensor_3x3 = torch.Tensor(list(range(1, 10))).view(3, 3) 117 | >>> tensor_doubled = 2*tensor_3x3 118 | >>> tensor_bottom_right = tensor_doubled[1:, 1:] 119 | 120 | >>> tensor_3x3 121 | tensor([[1., 2., 3.], 122 | [4., 5., 6.], 123 | [7., 8., 9.]]) 124 | 125 | >>> tensor_doubled 126 | tensor([[ 2., 4., 6.], 127 | [ 8., 10., 12.], 128 | [14., 16., 18.]]) 129 | 130 | >>> tensor_bottom_right 131 | tensor([[10., 12.], 132 | [16., 18.]]) 133 | ``` 134 | 135 | ## What is broadcasting? 136 | 137 | **Tensors with smaller rank are expanded to have the same size as the larger rank tensor** in the case of PyTorch. In this way, operations can be performed between tensors with different rank. Fortunately, scientific/numerical Python packages like NumPy and PyTorch will often implement broadcasting that often makes code easier to write. 138 | 139 | ## Are metrics generally calculated using the training set, or the validation set? Why? 140 | 141 | **Metrics are generally calculated on a validation set**. As the validation set is unseen data for the model, evaluating the metrics on the validation set is better in order to determine if there is any overfitting and how well the model might generalize if given similar data. 142 | 143 | ## What is SGD? 144 | 145 | SGD, or stochastic gradient descent, is an **optimization algorithm**. Specifically, SGD is an algorithm that will update the parameters of a model in order to minimize a given loss function that was evaluated on the predictions and target. The key idea behind SGD (and many optimization algorithms, for that matter) is that the gradient of the loss function provides an indication of how that loss function changes in the parameter space, which we can use to determine how best to update the parameters in order to minimize the loss function. This is what SGD does. 146 | 147 | ## Why does SGD use mini-batches? 148 | 149 | Because, most of the time, it leads to **better performances**. Indeed, if we iterated through each instances the gradient will be unstable and imprecise. This is not suitable for training. On the other hand, if we calculate the gradient from a batch (dataset), we might get stuck in a local minimum (because we loose the power of the *randomness* of the stochastic gradient descent). This (usually) means that the model won't perform its best. Therefore, we need to use mini-batches. 150 | 151 | ## What are the (7) steps in SGD for machine learning? 152 | 153 | 1. **Initialize the parameters**: random values often work best. 154 | 2. Calculate the **predictions**: this is done on the training set, one mini-batch at a time. 155 | 3. Calculate the **loss**: the average loss over the mini-batch is calculated 156 | 4. Calculate the **gradients**: this is an approximation of how the parameters need to change in order to minimize the loss function 157 | 5. **Step the weights**: update the parameters based on the calculated weights 158 | 6. **Repeat** the process 159 | 7. **Stop**: in practice, this is either based on time constraints or usually based on when the training/validation losses and metrics stop improving. 160 | 161 | ## How do we initialize the weights in a model? 162 | 163 | **Random** weights work pretty well. 164 | 165 | ## What is "loss"? 166 | 167 | It's a function that returns a value based on the given predictions and targets, where lower values correspond to better model predictions. 168 | 169 | ## Why can’t we always use a high learning rate? 170 | 171 | **The loss may "bounce" around (oscillate) or even diverge**, as the optimizer is taking steps that are too large, and updating the parameters faster than it should be. 172 | 173 | ## What is a "gradient"? 174 | 175 | The gradients is a **measure that tells us how much we have to change each weight** to make our model better. Indeed, this is a measure of how the loss function changes with changes of the weights of the model (the derivative). 176 | 177 | ## Do you need to know how to calculate gradients yourself? 178 | 179 | No, deep learning libraries will automatically calculate the gradients for you. This feature is known as automatic differentiation. In PyTorch, if `requires_grad=True`, the gradients can be returned by calling the backward method: `loss.backward()` 180 | 181 | ## Why can’t we use accuracy as a loss function? 182 | 183 | Because a **loss function needs to change as the weights are being adjusted**. Accuracy only changes if the predictions of the model change. So if there are slight changes to the model that, say, improves confidence in a prediction, but does not change the prediction, the accuracy will still not change. Therefore, the gradients will be zero everywhere except when the actual predictions change. The model therefore cannot learn from the gradients equal to zero, and the model’s weights will not update and will not train. A good loss function gives a slightly better loss when the model gives slightly better predictions. Slightly better predictions mean if the model is more confident about the correct prediction. For example, predicting 0.9 vs 0.7 for probability that a MNIST image is a 3 would be slightly better prediction. The loss function needs to reflect that. 184 | 185 | ## Imagine the sigmoid function. What is special about its shape? 186 | 187 | ![](img/04-sigmoid.png) 188 | 189 | Sigmoid function is a **smooth curve that squishes all values into values from 0 to 1**. Most loss functions assume that the model is outputting some form of a probability or confidence level between 0 and 1 so we use a sigmoid function at the end of the model in order to do this. 190 | 191 | Its formula: 192 | $ \sigma(x) = \frac{1}{1 + e^{-x}} \sigma(x) = \frac{1}{1 + e^{-x}} $ 193 | 194 | ## What is the difference between loss and metric? 195 | 196 | - **Metric**: drives human understanding 197 | - **Loss**: drives automated learning 198 | 199 | In order for loss to be useful for training, it needs to have a meaningful derivative. Many metrics, like accuracy are not like that. Metrics instead are the numbers that humans care about, that reflect the performance of the model. 200 | 201 | ## What is the function to calculate new weights using a learning rate? 202 | 203 | The optimizer `step()` function 204 | 205 | ## What does the DataLoader class do? 206 | 207 | The DataLoader class can take any Python collection and turn it into an iterator over many batches. 208 | 209 | ## Write pseudo-code showing the basic steps taken each epoch for SGD 210 | 211 | ```py 212 | for inputs, targets in train_dataloaders: 213 | pred = model(inputs) 214 | loss = loss_func(pred, targets) 215 | loss.backward() 216 | model.params -= model.params.grad * lr 217 | ``` 218 | 219 | ## Create a function which, if passed two arguments `[1, 2, 3, 4]` and `'abcd'` , returns `[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]`. What is special about that output data structure? 220 | 221 | ```py 222 | def func(a, b): 223 | return list(zip(a, b)) 224 | ``` 225 | 226 | This data structure is useful for machine learning models when you need lists of tuples where each tuple would contain input data and a label. 227 | 228 | ## What does view do in PyTorch? 229 | 230 | It changes the **shape** of a Tensor without changing its contents. 231 | 232 | ## What is the *bias* parameter in a neuron? 233 | 234 | It is a term $b$ that we add to the multiplication of the weight by the input ($wx$) in a neuron where its output is $y = f_{activation}(wx + b)$. 235 | 236 | ## Why do we need the *bias* parameter in a neuron? 237 | 238 | Because the bias adds **flexibility**. Indeed, without the bias parameters, if the input is zero, the output will always be zero, since $y = 0$ where $y = wx$ and $x = 0$. On the other hand, if the bias is present, we will obtain $y = b$. 239 | 240 | ## What does the @ operator do in python? 241 | 242 | This is the **matrix multiplication** operator. 243 | 244 | ## What does the `backward` method do? 245 | 246 | It returns the current **gradients**. 247 | 248 | ## Why do we have to zero the gradients? 249 | 250 | Because PyTorch will **add the gradients of a variable to any previously stored gradients**. If the training loop function is called multiple times, without zeroing the gradients, the gradient of current `loss` would be added to the previously stored gradient value. 251 | 252 | ## What information do we have to pass to Learner ? 253 | 254 | - The DataLoaders 255 | - The model 256 | - The optimization function 257 | - The loss function 258 | - (optionally) any metrics to print. 259 | 260 | ## Show pseudo-code (inspired from python) for the basic steps of a training loop 261 | 262 | ```py 263 | def training_loop(): 264 | for _ in range(n_epoch): 265 | for batch in training_data_loader: 266 | training_step(batch) 267 | 268 | def training_step(batch): 269 | for inputs, targets in batch: 270 | predictions = model(inputs) 271 | loss = loss_function(predictions, targets) 272 | loss.backward() 273 | model.params.data -= lr * model.params.gradient.data 274 | params.grad = None 275 | ``` 276 | 277 | ## What is "ReLU"? Draw a plot of it for values from -2 to +2 278 | 279 | ![](img/04-relu.png) 280 | 281 | ReLU just means "replace any negative numbers with zero". It is a commonly used activation function. 282 | 283 | ## What is an "activation function"? 284 | 285 | The activation function of a neuron is a **function that modifies the output of this neuron**. For example, the sigmoid function (as you saw earlier) is an activation function. 286 | 287 | Usually, activation functions provide *non-linearity* to the model. The idea is that without non-linear activation functions, we just have multiple linear functions of the form `y = mx + b`. However, a series of linear layers is equivalent to a single linear layer, so our model can only fit a line to the data. By introducing a non-linearity in between the linear layers, this is no longer true. Each layer is somewhat decoupled from the rest of the layers, and the model can now fit much more complex functions. In fact, it can be mathematically proven that such a model can solve any computable problem to an arbitrarily high accuracy, if the model is large enough with the correct weights. This is known as the universal approximation theorem. 288 | 289 | ## What’s the difference between `F.relu` and `nn.ReLU` ? 290 | 291 | - `F.relu`: a Python function for the relu activation function 292 | - `nn.ReLU`: is a PyTorch module This means that it is a Python class that can be called as a function in the same way as `F.relu` 293 | 294 | ## The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more? 295 | 296 | There are practical **performance** benefits to using more than one nonlinearity. We can use a deeper model with less number of parameters, better performance, faster training, and less compute/memory requirements. 297 | -------------------------------------------------------------------------------- /src/01_intro.md: -------------------------------------------------------------------------------- 1 | # 01_intro 2 | 3 | ## Do you need *Lots of math* for deep learning? 4 | 5 | No 6 | 7 | ## Do you need *Lots of data* for deep learning? 8 | 9 | No 10 | 11 | ## Do you need *Lots of expensive computers* for deep learning? 12 | 13 | No 14 | 15 | ## Do you need a *PhD* for deep learning? 16 | 17 | No 18 | 19 | ## Name (5/8) areas where deep learning is now the best in the world 20 | 21 | - **Natural Language Processing** (NLP): Question Answering, Document Summarization and Classification, etc. 22 | - **Computer Vision**: Satellite and drone imagery interpretation, face detection and recognition, image captioning, etc. 23 | - **Medicine**: Finding anomalies in medical images (ex: CT, X-ray, MRI), detecting features in tissue slides (pathology), diagnosing diabetic retinopathy, etc. 24 | - **Biology**: Folding proteins, classifying, genomics tasks, cell classification, etc. 25 | - **Image generation/enhancement**: colorizing images, improving image resolution (super-resolution), removing noise from images (denoising), converting images to art in style of famous artists (style transfer), etc. 26 | - **Recommendation systems**: web search, product recommendations, etc. 27 | - **Playing games**: Super-human performance in Chess, Go, Atari games, etc 28 | - **Robotics**: handling objects that are challenging to locate (e.g. transparent, shiny, lack of texture) or hard to pick up 29 | - **Other applications**: financial and logistical forecasting; text to speech; much much more. 30 | 31 | ## What was the name of the first device that was based on the principle of the artificial neuron? 32 | 33 | Mark I Perceptron built by Frank Rosenblatt 34 | 35 | ## What are the (8) requirements for "Parallel Distributed Processing" (based on the book of the same name)? 36 | 37 | 1. A set of **processing units** 38 | 2. A **state of activation** 39 | 3. An **output function** for each unit 40 | 4. A **pattern of connectivity** among units 41 | 5. A **propagation rule** for propagating patterns of activities through the network of connectivities 42 | 6. An **activation rule** for combining the inputs impinging on a unit with the current state of that unit to produce a new level of activation for the unit 43 | 7. A **learning rule** whereby patterns of connectivity are modified by experience 44 | 8. An **environment** within which the system must operate 45 | 46 | ## What were the $1^{st}$ theoretical misunderstandings that held back the field of neural networks? 47 | 48 | In 1969, Marvin Minsky and Seymour Papert demonstrated in their book, **Perceptrons**, that a single layer of artificial neurons **cannot learn simple, critical mathematical functions like XOR logic gate**. While they subsequently demonstrated in the same book that additional layers can solve this problem, only the first insight was recognized, leading to the start of the first AI winter. 49 | 50 | ## What were the $2^{nd}$ theoretical misunderstandings that held back the field of neural networks? 51 | 52 | In the 1980’s, models with two layers were being explored. Theoretically, it is possible to approximate any mathematical function using two layers of artificial neurons. However, in practices, these networks were **too big and too slow**. While it was demonstrated that adding additional layers improved performance, this insight was not acknowledged, and the second AI winter began. In this past decade, with increased data availability, and improvements in computer hardware (both in CPU performance but more importantly in GPU performance), neural networks are finally living up to its potential. 53 | 54 | ## What is a GPU? 55 | 56 | A **Graphics Processing Unit** (also known as a graphics card). 57 | 58 | Standard computers have various components like CPUs, RAM, etc. CPUs, or central processing units, are the core units of all standard computers, and they execute the instructions that make up computer programs. GPUs, on the other hand, are specialized units meant for displaying graphics, especially the 3D graphics in modern computer games. The hardware optimizations used in GPUs allow it to handle thousands of tasks at the same time. Incidentally, these optimizations allow us to run and train neural networks hundreds of times faster than a regular CPU. 59 | 60 | ## When you open a notebook and execute a cell containing: 1+1, what happens? 61 | 62 | The code is run by Python and the output is displayed underneath the code cell (in this case: 2). 63 | 64 | ## Why is it hard to use a traditional computer program to recognize images in a photo? 65 | 66 | Because cats, dogs, or other objects, have a **wide variety of shapes, textures, colors, and other features**, and it is close to impossible to manually encode this in a traditional computer program. 67 | 68 | For us humans, it is easy to identify images in a photos, such as identifying cats vs dogs in a photo. This is because, subconsciously our brains have learned which features define a cat or a dog for example. But it is hard to define rules for a traditional computer program to recognize a cat or a dog. Can you think of a universal rule to determine if a photo contains a cat or dog? How would you encode that as a computer program? 69 | 70 | ## What did Samuel mean by "weight assignment"? 71 | 72 | **"Weight assignment" refers to the current values of the model parameters**. Arthur Samuel further mentions an :" *automatic means of testing the effectiveness of any current weight assignment* *" and a "* *mechanism for altering the weight assignment so as to maximize the performance* *". This refers to the evaluation and training of the model in order to obtain a set of parameter values that maximizes model performance.* 73 | 74 | ## What term do we normally use in deep learning for what Samuel called "weights"? 75 | 76 | - **Parameters** 77 | 78 | In deep learning, the term "weights" has a separate meaning. (The neural network has various parameters that we fit our data to. As shown in upcoming chapters, the two *types* of neural network parameters are weights and biases) 79 | 80 | ## Imagine a picture that summarizes Arthur Samuel’s view of a machine learning model 81 | 82 | ![](img/01-ml-steps.png) 83 | 84 | ## Why is it hard to understand why a deep learning model makes a particular prediction? 85 | 86 | Because of its **deep nature**. 87 | 88 | Think of a linear regression model. Simply, we have some input variables/data that are multiplied by some weights, giving us an output. We can understand which variables are more important and which are less important based on their weights. A similar logic might apply for a small neural network with 1-3 layers. However, deep neural networks have hundreds, if not thousands, of layers. It is hard to determine which factors are important in determining the final output. The neurons in the network interact with each other, with the outputs of some neurons feeding into other neurons. Altogether, **due to the complex nature of deep learning models**, it is very difficult to understand why a neural network makes a given prediction. 89 | 90 | However, in some cases, recent research has made it easier to better understand a neural network’s prediction. For example, as shown in this chapter, we can analyze the sets of weights and determine what kind of features activate the neurons. When applying CNNs to images, we can also see which parts of the images highly activate the model. We will see how we can make our models interpretable later in the book. In fact, interpretability is is a highly-researched topic known as of deep learning models. 91 | 92 | ## What is the name of the theorem that a neural network can solve any mathematical problem to any level of accuracy? 93 | 94 | The **universal approximation theorem** 95 | 96 | ## What does the universal approximation theorem state? 97 | 98 | It states that **neural networks can theoretically represent any mathematical function**. However, it is important to realize that practically, due to the limits of available data and computer hardware, it is impossible to practically train a model to do so, but we can get very close. 99 | 100 | ## What are the (5) things that you need in order to train a model? 101 | 102 | - **Architecture** for the given problem 103 | - **Data** to input to your model 104 | - **Labels** (for most use-cases) 105 | - **Loss** function that will quantitatively measure the performance of your model 106 | - A way to **update the parameters** of the model in order to improve its performance (this is known as an optimizer) 107 | 108 | ## How could a feedback loop impact the rollout of a predictive policing model? 109 | 110 | In a predictive policing model, we might end up with a positive feedback loop, leading to a **highly biased model with little predictive power**. For example, we may want a model that would predict crimes, but we use information on arrests as a *proxy*. However, this data itself is slightly biased due to the biases in existing policing processes. Training with this data leads to a biased model. Law enforcement might use the model to determine where to focus police activity, increasing arrests in those areas. These additional arrests would be used in training future iterations of models, leading to an even more biased model. This cycle continues as a *positive feedback loop*. 111 | 112 | ## Do we always have to use 224x224 pixel images with the cat recognition model? 113 | 114 | **No**, we do not. 224x224 is commonly used for historical reasons. You can increase the size and get better performance, but at the price of speed and memory consumption. 115 | 116 | ## What is the difference between classification and regression? 117 | 118 | - Classification is focused on predicting a **class or category** (ex: type of pet). 119 | - Regression is focused on predicting a **numeric quantity** (ex: age of pet). 120 | 121 | ## What is a validation set? Why do we need it? 122 | 123 | The validation **set is the portion of the dataset that is not used for training the model**, but for evaluating the model during training, in order to prevent overfitting. This ensures that the model performance is not due to "cheating" or memorization of the dataset, but rather because it learns the appropriate features to use for prediction. 124 | 125 | ## What is a test set? Why do we need it? 126 | 127 | It is possible that we overfit the validation data because the human modeler is also part of the training process, adjusting *hyperparameters* and training procedures according to the validation performance. Therefore, **another unseen portion of the dataset**, the test set, is **used for final evaluation of the model**. This splitting of the dataset is necessary to ensure that the model *generalizes* to *unseen* data. 128 | 129 | ## What will fastai do if you don’t provide a validation set? 130 | 131 | fastai will automatically **create a validation dataset**. It will randomly take 20% of the data and assign it as the validation set ( `valid_pct` = `0.2` ). 132 | 133 | ## Can we always use a random sample for a validation set? Why? 134 | 135 | **No**, because a good validation or test set should be **representative of new data you will see in the future**. Sometimes this isn’t true if a random sample is used. For example, for time series data, selecting sets randomly does not make sense. Instead, defining different time periods for the train, validation, and test set is a better approach. 136 | 137 | ## What is overfitting? 138 | 139 | Overfitting refers to **when the model fits too closely to a limited set of data but does not generalize well to unseen data**. This is especially important when it comes to neural networks, because neural networks can potentially *memorize* the dataset that the model was trained on, and will perform terribly on unseen data because it didn't *memorize* the ground truth values for that data. This is why a proper validation framework is needed by splitting the data into training, validation, and test sets. 140 | 141 | P.S.: Overfitting is the most challenging issue when it comes to training machine learning models. 142 | 143 | Here's a real world example: 144 | 145 | Little Bobby loves cookies. His mother bakes chocolate chip cookies for him every Sunday. But, the world is not ideal and the cookies do not turn out equally tasty every Sunday. Some Sundays they taste better, some Sundays, they don’t taste as good. Being the curious little boy that he is, Bobby decides to find out when the cookies turn out to be tasty and when they don’t. 146 | 147 | The first observation that he makes is that the number of chocolate chips in a cookie varies from cookie to cookie and that is pretty much the only directly observable thing that varies across cookies. 148 | 149 | Now Bobby starts taking notes every Sunday. 150 | After seven Sundays, his notes look something like this: 151 | 152 | Sunday 1 – No. of Chocolate chips: 7; Tastiness: Awesome 153 | Sunday 2 – No. of Chocolate chips: 4; Tastiness: Good 154 | Sunday 3 – No. of Chocolate chips: 2; Tastiness: Bad 155 | Sunday 4 – No. of Chocolate chips: 5; Tastiness: Terrible 156 | Sunday 5 – No. of Chocolate chips: 3; Tastiness: Average 157 | Sunday 6 – No. of Chocolate chips: 6; Tastiness: Terrible 158 | 159 | This looks pretty straightforward. The more the number of chocolate chips, the tastier the cookie except that the notes from Sunday 4 and Sunday 6 seem to contradict this hypothesis. What little Bobby doesn’t know is that his mother forgot to put sugar in the cookies on Sunday 4 and Sunday 6. 160 | 161 | Since Bobby is an innocent little kid, he doesn’t know that the world is far from ideal and things like randomness and noise are an integral part of it. He also doesn’t know that there are factors that are not directly observable. But, they do affect the outcomes of our experiments. So, he goes on to conclude that the tastiness of cookies increases with the number of chocolate chips when the number of chips is less than 5 and more than 6. But, drops drastically when the number of chocolate chips is 5 or 6. 162 | 163 | He has come up with an overly complex, and not to forget, an incorrect hypothesis to explain how the tastiness of cookies varies because he tried to explain and justify his notes from every single Sunday. This is called over-fitting. Trying to explain/justify as many observations as possible by coming up with an overly complex (and possibly incorrect) hypothesis. 164 | 165 | If, instead, he had treated Sunday 4 and Sunday 6 as just noise, his hypothesis would have been simpler and relatively more correct. 166 | 167 | Source : 168 | 169 | - [The example](https://qr.ae/pryKEh) 170 | - [The rest](https://forums.fast.ai/t/fastbook-chapter-1-questionnaire-solutions-wiki/65647) 171 | 172 | ## What is a metric? How does it differ to "loss"? 173 | 174 | - A *metric* is a **function that measures quality of the model's predictions using the validation set**. 175 | - A *loss* is also a measure of performance of the model. However, **loss is meant for the optimization algorithm** (like SGD) to efficiently update the model parameters, while **metrics are human-interpretable measures of performance**. Sometimes, a metric may also be a good choice for the loss. 176 | 177 | ## How can pre-trained models help? 178 | 179 | Pre-trained models have been **trained on other problems that may be quite similar to the current task**. For example, pre-trained image recognition models were often trained on the ImageNet dataset, which has 1000 classes focused on a lot of different types of visual objects. Pre-trained models are useful because they have already learned how to handle a lot of simple features like edge and color detection. However, since the model was trained for a different task than already used, this model cannot be used as is. 180 | 181 | ## What is the *head* of a model? 182 | 183 | **This is the top of a network, the output layer**. It is a bit arbitrary. 184 | 185 | For instance, on the bottom (where data comes in) you take convolution layers of some model, say resnet that we use in this course. If you call `ConvLearner.pre-trained`, `ConvnetBuilder` will build a network with appropriate head to your data. If you are working on a classification problem, it will create a head with a cross entropy loss and if you are working on a regression problem, it will create a head suited to that. 186 | 187 | But you could build a model that has multiple heads. The model could take inputs from the base network (resnet convolutional layers) and feed the activations to some model, say `head1` and then same data to `head2`. Or you could have some number of shared layers built on top of resnet and only those layers feeding to `head1` and `head2`. You could even have different layers feed to different heads! 188 | 189 | This is the general picture, but there are some nuances to this. For instance, with regards to the fastai lib, `ConvnetBuilder` will add an `AdaptivePooling` layer on top of the base network if you don’t specify the `custom_head` argument and if you do it won’t. 190 | 191 | [Source](https://forums.fast.ai/t/terminology-question-head-of-neural-network/14819/3) 192 | 193 | ## What kinds of features do the **early** layers of a CNN find? 194 | 195 | The simple features like diagonal, horizontal, and vertical edges. 196 | 197 | ## What kinds of features do the **later** layers of a CNN find? 198 | 199 | The more advanced features like car wheels, flower petals, and even outlines of animals. 200 | 201 | ## Are image models only useful for photos? 202 | 203 | **Nope! Image models can be useful for other types of images like sketches, medical data, etc.** However, a lot of information can be represented *as images* . For example, a sound can be converted into a spectrogram, which is a visual interpretation of the audio. Time series (ex: financial data) can be converted to image by plotting on a graph. Even better, there are various transformations that generate images from time series, and have achieved good results for time series classification. There are many other examples, and by being creative, it may be possible to formulate your problem as an image classification problem, and use pre-trained image models to obtain state-of-the-art results! 204 | 205 | ## What is a model's architecture? 206 | 207 | The **architecture is the functional form of the model**. Indeed, a model can be split into an architecture and parameter(s). The parameters are some variables that define how the architecture operates. 208 | 209 | For example, $y=ax+b$ is an architecture with the parameters $a$ and $b$ that change the behavior of the function. 210 | 211 | [Source](https://nathanieldamours.github.io/blog/deep%20learning%20for%20coders/jupyter/2021/12/17/dl_for_coders_01.html#Architecture-and-Parameters) 212 | 213 | ## What is segmentation? 214 | 215 | At its core, segmentation is a **pixelwise classification** problem. We attempt to predict a label for every single pixel in the image. This provides a mask for which parts of the image correspond to the given label. 216 | 217 | ## What is `y_range` used for? When do we need it? 218 | 219 | `y_range` is being used to **limit the values predicted** when our problem is focused on predicting a numeric **value in a given range** (ex: predicting movie ratings, range of 0.5-5). 220 | 221 | ## What are "hyperparameters"? 222 | 223 | They are **parameters that define *how* the model is trained**. For example, we could define how long we train for (the number of epochs) or how fast the model parameters are allowed to change (the learning rate). 224 | 225 | ## What’s the best way to avoid failures when using AI in an organization? 226 | 227 | 1. Make sure a training, validation, and testing set is defined properly in order to evaluate the model in an appropriate manner. 228 | 2. Try out a simple baseline, which future models should hopefully beat. Or even this simple baseline may be enough in some cases. 229 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 3, 29 June 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU General Public License is a free, copyleft license for 11 | software and other kinds of works. 12 | 13 | The licenses for most software and other practical works are designed 14 | to take away your freedom to share and change the works. By contrast, 15 | the GNU General Public License is intended to guarantee your freedom to 16 | share and change all versions of a program--to make sure it remains free 17 | software for all its users. We, the Free Software Foundation, use the 18 | GNU General Public License for most of our software; it applies also to 19 | any other work released this way by its authors. You can apply it to 20 | your programs, too. 21 | 22 | When we speak of free software, we are referring to freedom, not 23 | price. Our General Public Licenses are designed to make sure that you 24 | have the freedom to distribute copies of free software (and charge for 25 | them if you wish), that you receive source code or can get it if you 26 | want it, that you can change the software or use pieces of it in new 27 | free programs, and that you know you can do these things. 28 | 29 | To protect your rights, we need to prevent others from denying you 30 | these rights or asking you to surrender the rights. Therefore, you have 31 | certain responsibilities if you distribute copies of the software, or if 32 | you modify it: responsibilities to respect the freedom of others. 33 | 34 | For example, if you distribute copies of such a program, whether 35 | gratis or for a fee, you must pass on to the recipients the same 36 | freedoms that you received. You must make sure that they, too, receive 37 | or can get the source code. And you must show them these terms so they 38 | know their rights. 39 | 40 | Developers that use the GNU GPL protect your rights with two steps: 41 | (1) assert copyright on the software, and (2) offer you this License 42 | giving you legal permission to copy, distribute and/or modify it. 43 | 44 | For the developers' and authors' protection, the GPL clearly explains 45 | that there is no warranty for this free software. For both users' and 46 | authors' sake, the GPL requires that modified versions be marked as 47 | changed, so that their problems will not be attributed erroneously to 48 | authors of previous versions. 49 | 50 | Some devices are designed to deny users access to install or run 51 | modified versions of the software inside them, although the manufacturer 52 | can do so. This is fundamentally incompatible with the aim of 53 | protecting users' freedom to change the software. The systematic 54 | pattern of such abuse occurs in the area of products for individuals to 55 | use, which is precisely where it is most unacceptable. Therefore, we 56 | have designed this version of the GPL to prohibit the practice for those 57 | products. If such problems arise substantially in other domains, we 58 | stand ready to extend this provision to those domains in future versions 59 | of the GPL, as needed to protect the freedom of users. 60 | 61 | Finally, every program is threatened constantly by software patents. 62 | States should not allow patents to restrict development and use of 63 | software on general-purpose computers, but in those that do, we wish to 64 | avoid the special danger that patents applied to a free program could 65 | make it effectively proprietary. To prevent this, the GPL assures that 66 | patents cannot be used to render the program non-free. 67 | 68 | The precise terms and conditions for copying, distribution and 69 | modification follow. 70 | 71 | TERMS AND CONDITIONS 72 | 73 | 0. Definitions. 74 | 75 | "This License" refers to version 3 of the GNU General Public License. 76 | 77 | "Copyright" also means copyright-like laws that apply to other kinds of 78 | works, such as semiconductor masks. 79 | 80 | "The Program" refers to any copyrightable work licensed under this 81 | License. Each licensee is addressed as "you". "Licensees" and 82 | "recipients" may be individuals or organizations. 83 | 84 | To "modify" a work means to copy from or adapt all or part of the work 85 | in a fashion requiring copyright permission, other than the making of an 86 | exact copy. The resulting work is called a "modified version" of the 87 | earlier work or a work "based on" the earlier work. 88 | 89 | A "covered work" means either the unmodified Program or a work based 90 | on the Program. 91 | 92 | To "propagate" a work means to do anything with it that, without 93 | permission, would make you directly or secondarily liable for 94 | infringement under applicable copyright law, except executing it on a 95 | computer or modifying a private copy. Propagation includes copying, 96 | distribution (with or without modification), making available to the 97 | public, and in some countries other activities as well. 98 | 99 | To "convey" a work means any kind of propagation that enables other 100 | parties to make or receive copies. Mere interaction with a user through 101 | a computer network, with no transfer of a copy, is not conveying. 102 | 103 | An interactive user interface displays "Appropriate Legal Notices" 104 | to the extent that it includes a convenient and prominently visible 105 | feature that (1) displays an appropriate copyright notice, and (2) 106 | tells the user that there is no warranty for the work (except to the 107 | extent that warranties are provided), that licensees may convey the 108 | work under this License, and how to view a copy of this License. If 109 | the interface presents a list of user commands or options, such as a 110 | menu, a prominent item in the list meets this criterion. 111 | 112 | 1. Source Code. 113 | 114 | The "source code" for a work means the preferred form of the work 115 | for making modifications to it. "Object code" means any non-source 116 | form of a work. 117 | 118 | A "Standard Interface" means an interface that either is an official 119 | standard defined by a recognized standards body, or, in the case of 120 | interfaces specified for a particular programming language, one that 121 | is widely used among developers working in that language. 122 | 123 | The "System Libraries" of an executable work include anything, other 124 | than the work as a whole, that (a) is included in the normal form of 125 | packaging a Major Component, but which is not part of that Major 126 | Component, and (b) serves only to enable use of the work with that 127 | Major Component, or to implement a Standard Interface for which an 128 | implementation is available to the public in source code form. A 129 | "Major Component", in this context, means a major essential component 130 | (kernel, window system, and so on) of the specific operating system 131 | (if any) on which the executable work runs, or a compiler used to 132 | produce the work, or an object code interpreter used to run it. 133 | 134 | The "Corresponding Source" for a work in object code form means all 135 | the source code needed to generate, install, and (for an executable 136 | work) run the object code and to modify the work, including scripts to 137 | control those activities. However, it does not include the work's 138 | System Libraries, or general-purpose tools or generally available free 139 | programs which are used unmodified in performing those activities but 140 | which are not part of the work. For example, Corresponding Source 141 | includes interface definition files associated with source files for 142 | the work, and the source code for shared libraries and dynamically 143 | linked subprograms that the work is specifically designed to require, 144 | such as by intimate data communication or control flow between those 145 | subprograms and other parts of the work. 146 | 147 | The Corresponding Source need not include anything that users 148 | can regenerate automatically from other parts of the Corresponding 149 | Source. 150 | 151 | The Corresponding Source for a work in source code form is that 152 | same work. 153 | 154 | 2. Basic Permissions. 155 | 156 | All rights granted under this License are granted for the term of 157 | copyright on the Program, and are irrevocable provided the stated 158 | conditions are met. This License explicitly affirms your unlimited 159 | permission to run the unmodified Program. The output from running a 160 | covered work is covered by this License only if the output, given its 161 | content, constitutes a covered work. This License acknowledges your 162 | rights of fair use or other equivalent, as provided by copyright law. 163 | 164 | You may make, run and propagate covered works that you do not 165 | convey, without conditions so long as your license otherwise remains 166 | in force. You may convey covered works to others for the sole purpose 167 | of having them make modifications exclusively for you, or provide you 168 | with facilities for running those works, provided that you comply with 169 | the terms of this License in conveying all material for which you do 170 | not control copyright. Those thus making or running the covered works 171 | for you must do so exclusively on your behalf, under your direction 172 | and control, on terms that prohibit them from making any copies of 173 | your copyrighted material outside their relationship with you. 174 | 175 | Conveying under any other circumstances is permitted solely under 176 | the conditions stated below. Sublicensing is not allowed; section 10 177 | makes it unnecessary. 178 | 179 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 180 | 181 | No covered work shall be deemed part of an effective technological 182 | measure under any applicable law fulfilling obligations under article 183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 184 | similar laws prohibiting or restricting circumvention of such 185 | measures. 186 | 187 | When you convey a covered work, you waive any legal power to forbid 188 | circumvention of technological measures to the extent such circumvention 189 | is effected by exercising rights under this License with respect to 190 | the covered work, and you disclaim any intention to limit operation or 191 | modification of the work as a means of enforcing, against the work's 192 | users, your or third parties' legal rights to forbid circumvention of 193 | technological measures. 194 | 195 | 4. Conveying Verbatim Copies. 196 | 197 | You may convey verbatim copies of the Program's source code as you 198 | receive it, in any medium, provided that you conspicuously and 199 | appropriately publish on each copy an appropriate copyright notice; 200 | keep intact all notices stating that this License and any 201 | non-permissive terms added in accord with section 7 apply to the code; 202 | keep intact all notices of the absence of any warranty; and give all 203 | recipients a copy of this License along with the Program. 204 | 205 | You may charge any price or no price for each copy that you convey, 206 | and you may offer support or warranty protection for a fee. 207 | 208 | 5. Conveying Modified Source Versions. 209 | 210 | You may convey a work based on the Program, or the modifications to 211 | produce it from the Program, in the form of source code under the 212 | terms of section 4, provided that you also meet all of these conditions: 213 | 214 | a) The work must carry prominent notices stating that you modified 215 | it, and giving a relevant date. 216 | 217 | b) The work must carry prominent notices stating that it is 218 | released under this License and any conditions added under section 219 | 7. This requirement modifies the requirement in section 4 to 220 | "keep intact all notices". 221 | 222 | c) You must license the entire work, as a whole, under this 223 | License to anyone who comes into possession of a copy. This 224 | License will therefore apply, along with any applicable section 7 225 | additional terms, to the whole of the work, and all its parts, 226 | regardless of how they are packaged. This License gives no 227 | permission to license the work in any other way, but it does not 228 | invalidate such permission if you have separately received it. 229 | 230 | d) If the work has interactive user interfaces, each must display 231 | Appropriate Legal Notices; however, if the Program has interactive 232 | interfaces that do not display Appropriate Legal Notices, your 233 | work need not make them do so. 234 | 235 | A compilation of a covered work with other separate and independent 236 | works, which are not by their nature extensions of the covered work, 237 | and which are not combined with it such as to form a larger program, 238 | in or on a volume of a storage or distribution medium, is called an 239 | "aggregate" if the compilation and its resulting copyright are not 240 | used to limit the access or legal rights of the compilation's users 241 | beyond what the individual works permit. Inclusion of a covered work 242 | in an aggregate does not cause this License to apply to the other 243 | parts of the aggregate. 244 | 245 | 6. Conveying Non-Source Forms. 246 | 247 | You may convey a covered work in object code form under the terms 248 | of sections 4 and 5, provided that you also convey the 249 | machine-readable Corresponding Source under the terms of this License, 250 | in one of these ways: 251 | 252 | a) Convey the object code in, or embodied in, a physical product 253 | (including a physical distribution medium), accompanied by the 254 | Corresponding Source fixed on a durable physical medium 255 | customarily used for software interchange. 256 | 257 | b) Convey the object code in, or embodied in, a physical product 258 | (including a physical distribution medium), accompanied by a 259 | written offer, valid for at least three years and valid for as 260 | long as you offer spare parts or customer support for that product 261 | model, to give anyone who possesses the object code either (1) a 262 | copy of the Corresponding Source for all the software in the 263 | product that is covered by this License, on a durable physical 264 | medium customarily used for software interchange, for a price no 265 | more than your reasonable cost of physically performing this 266 | conveying of source, or (2) access to copy the 267 | Corresponding Source from a network server at no charge. 268 | 269 | c) Convey individual copies of the object code with a copy of the 270 | written offer to provide the Corresponding Source. This 271 | alternative is allowed only occasionally and noncommercially, and 272 | only if you received the object code with such an offer, in accord 273 | with subsection 6b. 274 | 275 | d) Convey the object code by offering access from a designated 276 | place (gratis or for a charge), and offer equivalent access to the 277 | Corresponding Source in the same way through the same place at no 278 | further charge. You need not require recipients to copy the 279 | Corresponding Source along with the object code. If the place to 280 | copy the object code is a network server, the Corresponding Source 281 | may be on a different server (operated by you or a third party) 282 | that supports equivalent copying facilities, provided you maintain 283 | clear directions next to the object code saying where to find the 284 | Corresponding Source. Regardless of what server hosts the 285 | Corresponding Source, you remain obligated to ensure that it is 286 | available for as long as needed to satisfy these requirements. 287 | 288 | e) Convey the object code using peer-to-peer transmission, provided 289 | you inform other peers where the object code and Corresponding 290 | Source of the work are being offered to the general public at no 291 | charge under subsection 6d. 292 | 293 | A separable portion of the object code, whose source code is excluded 294 | from the Corresponding Source as a System Library, need not be 295 | included in conveying the object code work. 296 | 297 | A "User Product" is either (1) a "consumer product", which means any 298 | tangible personal property which is normally used for personal, family, 299 | or household purposes, or (2) anything designed or sold for incorporation 300 | into a dwelling. In determining whether a product is a consumer product, 301 | doubtful cases shall be resolved in favor of coverage. For a particular 302 | product received by a particular user, "normally used" refers to a 303 | typical or common use of that class of product, regardless of the status 304 | of the particular user or of the way in which the particular user 305 | actually uses, or expects or is expected to use, the product. A product 306 | is a consumer product regardless of whether the product has substantial 307 | commercial, industrial or non-consumer uses, unless such uses represent 308 | the only significant mode of use of the product. 309 | 310 | "Installation Information" for a User Product means any methods, 311 | procedures, authorization keys, or other information required to install 312 | and execute modified versions of a covered work in that User Product from 313 | a modified version of its Corresponding Source. The information must 314 | suffice to ensure that the continued functioning of the modified object 315 | code is in no case prevented or interfered with solely because 316 | modification has been made. 317 | 318 | If you convey an object code work under this section in, or with, or 319 | specifically for use in, a User Product, and the conveying occurs as 320 | part of a transaction in which the right of possession and use of the 321 | User Product is transferred to the recipient in perpetuity or for a 322 | fixed term (regardless of how the transaction is characterized), the 323 | Corresponding Source conveyed under this section must be accompanied 324 | by the Installation Information. But this requirement does not apply 325 | if neither you nor any third party retains the ability to install 326 | modified object code on the User Product (for example, the work has 327 | been installed in ROM). 328 | 329 | The requirement to provide Installation Information does not include a 330 | requirement to continue to provide support service, warranty, or updates 331 | for a work that has been modified or installed by the recipient, or for 332 | the User Product in which it has been modified or installed. Access to a 333 | network may be denied when the modification itself materially and 334 | adversely affects the operation of the network or violates the rules and 335 | protocols for communication across the network. 336 | 337 | Corresponding Source conveyed, and Installation Information provided, 338 | in accord with this section must be in a format that is publicly 339 | documented (and with an implementation available to the public in 340 | source code form), and must require no special password or key for 341 | unpacking, reading or copying. 342 | 343 | 7. Additional Terms. 344 | 345 | "Additional permissions" are terms that supplement the terms of this 346 | License by making exceptions from one or more of its conditions. 347 | Additional permissions that are applicable to the entire Program shall 348 | be treated as though they were included in this License, to the extent 349 | that they are valid under applicable law. If additional permissions 350 | apply only to part of the Program, that part may be used separately 351 | under those permissions, but the entire Program remains governed by 352 | this License without regard to the additional permissions. 353 | 354 | When you convey a copy of a covered work, you may at your option 355 | remove any additional permissions from that copy, or from any part of 356 | it. (Additional permissions may be written to require their own 357 | removal in certain cases when you modify the work.) You may place 358 | additional permissions on material, added by you to a covered work, 359 | for which you have or can give appropriate copyright permission. 360 | 361 | Notwithstanding any other provision of this License, for material you 362 | add to a covered work, you may (if authorized by the copyright holders of 363 | that material) supplement the terms of this License with terms: 364 | 365 | a) Disclaiming warranty or limiting liability differently from the 366 | terms of sections 15 and 16 of this License; or 367 | 368 | b) Requiring preservation of specified reasonable legal notices or 369 | author attributions in that material or in the Appropriate Legal 370 | Notices displayed by works containing it; or 371 | 372 | c) Prohibiting misrepresentation of the origin of that material, or 373 | requiring that modified versions of such material be marked in 374 | reasonable ways as different from the original version; or 375 | 376 | d) Limiting the use for publicity purposes of names of licensors or 377 | authors of the material; or 378 | 379 | e) Declining to grant rights under trademark law for use of some 380 | trade names, trademarks, or service marks; or 381 | 382 | f) Requiring indemnification of licensors and authors of that 383 | material by anyone who conveys the material (or modified versions of 384 | it) with contractual assumptions of liability to the recipient, for 385 | any liability that these contractual assumptions directly impose on 386 | those licensors and authors. 387 | 388 | All other non-permissive additional terms are considered "further 389 | restrictions" within the meaning of section 10. If the Program as you 390 | received it, or any part of it, contains a notice stating that it is 391 | governed by this License along with a term that is a further 392 | restriction, you may remove that term. If a license document contains 393 | a further restriction but permits relicensing or conveying under this 394 | License, you may add to a covered work material governed by the terms 395 | of that license document, provided that the further restriction does 396 | not survive such relicensing or conveying. 397 | 398 | If you add terms to a covered work in accord with this section, you 399 | must place, in the relevant source files, a statement of the 400 | additional terms that apply to those files, or a notice indicating 401 | where to find the applicable terms. 402 | 403 | Additional terms, permissive or non-permissive, may be stated in the 404 | form of a separately written license, or stated as exceptions; 405 | the above requirements apply either way. 406 | 407 | 8. Termination. 408 | 409 | You may not propagate or modify a covered work except as expressly 410 | provided under this License. Any attempt otherwise to propagate or 411 | modify it is void, and will automatically terminate your rights under 412 | this License (including any patent licenses granted under the third 413 | paragraph of section 11). 414 | 415 | However, if you cease all violation of this License, then your 416 | license from a particular copyright holder is reinstated (a) 417 | provisionally, unless and until the copyright holder explicitly and 418 | finally terminates your license, and (b) permanently, if the copyright 419 | holder fails to notify you of the violation by some reasonable means 420 | prior to 60 days after the cessation. 421 | 422 | Moreover, your license from a particular copyright holder is 423 | reinstated permanently if the copyright holder notifies you of the 424 | violation by some reasonable means, this is the first time you have 425 | received notice of violation of this License (for any work) from that 426 | copyright holder, and you cure the violation prior to 30 days after 427 | your receipt of the notice. 428 | 429 | Termination of your rights under this section does not terminate the 430 | licenses of parties who have received copies or rights from you under 431 | this License. If your rights have been terminated and not permanently 432 | reinstated, you do not qualify to receive new licenses for the same 433 | material under section 10. 434 | 435 | 9. Acceptance Not Required for Having Copies. 436 | 437 | You are not required to accept this License in order to receive or 438 | run a copy of the Program. Ancillary propagation of a covered work 439 | occurring solely as a consequence of using peer-to-peer transmission 440 | to receive a copy likewise does not require acceptance. However, 441 | nothing other than this License grants you permission to propagate or 442 | modify any covered work. These actions infringe copyright if you do 443 | not accept this License. Therefore, by modifying or propagating a 444 | covered work, you indicate your acceptance of this License to do so. 445 | 446 | 10. Automatic Licensing of Downstream Recipients. 447 | 448 | Each time you convey a covered work, the recipient automatically 449 | receives a license from the original licensors, to run, modify and 450 | propagate that work, subject to this License. You are not responsible 451 | for enforcing compliance by third parties with this License. 452 | 453 | An "entity transaction" is a transaction transferring control of an 454 | organization, or substantially all assets of one, or subdividing an 455 | organization, or merging organizations. If propagation of a covered 456 | work results from an entity transaction, each party to that 457 | transaction who receives a copy of the work also receives whatever 458 | licenses to the work the party's predecessor in interest had or could 459 | give under the previous paragraph, plus a right to possession of the 460 | Corresponding Source of the work from the predecessor in interest, if 461 | the predecessor has it or can get it with reasonable efforts. 462 | 463 | You may not impose any further restrictions on the exercise of the 464 | rights granted or affirmed under this License. For example, you may 465 | not impose a license fee, royalty, or other charge for exercise of 466 | rights granted under this License, and you may not initiate litigation 467 | (including a cross-claim or counterclaim in a lawsuit) alleging that 468 | any patent claim is infringed by making, using, selling, offering for 469 | sale, or importing the Program or any portion of it. 470 | 471 | 11. Patents. 472 | 473 | A "contributor" is a copyright holder who authorizes use under this 474 | License of the Program or a work on which the Program is based. The 475 | work thus licensed is called the contributor's "contributor version". 476 | 477 | A contributor's "essential patent claims" are all patent claims 478 | owned or controlled by the contributor, whether already acquired or 479 | hereafter acquired, that would be infringed by some manner, permitted 480 | by this License, of making, using, or selling its contributor version, 481 | but do not include claims that would be infringed only as a 482 | consequence of further modification of the contributor version. For 483 | purposes of this definition, "control" includes the right to grant 484 | patent sublicenses in a manner consistent with the requirements of 485 | this License. 486 | 487 | Each contributor grants you a non-exclusive, worldwide, royalty-free 488 | patent license under the contributor's essential patent claims, to 489 | make, use, sell, offer for sale, import and otherwise run, modify and 490 | propagate the contents of its contributor version. 491 | 492 | In the following three paragraphs, a "patent license" is any express 493 | agreement or commitment, however denominated, not to enforce a patent 494 | (such as an express permission to practice a patent or covenant not to 495 | sue for patent infringement). To "grant" such a patent license to a 496 | party means to make such an agreement or commitment not to enforce a 497 | patent against the party. 498 | 499 | If you convey a covered work, knowingly relying on a patent license, 500 | and the Corresponding Source of the work is not available for anyone 501 | to copy, free of charge and under the terms of this License, through a 502 | publicly available network server or other readily accessible means, 503 | then you must either (1) cause the Corresponding Source to be so 504 | available, or (2) arrange to deprive yourself of the benefit of the 505 | patent license for this particular work, or (3) arrange, in a manner 506 | consistent with the requirements of this License, to extend the patent 507 | license to downstream recipients. "Knowingly relying" means you have 508 | actual knowledge that, but for the patent license, your conveying the 509 | covered work in a country, or your recipient's use of the covered work 510 | in a country, would infringe one or more identifiable patents in that 511 | country that you have reason to believe are valid. 512 | 513 | If, pursuant to or in connection with a single transaction or 514 | arrangement, you convey, or propagate by procuring conveyance of, a 515 | covered work, and grant a patent license to some of the parties 516 | receiving the covered work authorizing them to use, propagate, modify 517 | or convey a specific copy of the covered work, then the patent license 518 | you grant is automatically extended to all recipients of the covered 519 | work and works based on it. 520 | 521 | A patent license is "discriminatory" if it does not include within 522 | the scope of its coverage, prohibits the exercise of, or is 523 | conditioned on the non-exercise of one or more of the rights that are 524 | specifically granted under this License. You may not convey a covered 525 | work if you are a party to an arrangement with a third party that is 526 | in the business of distributing software, under which you make payment 527 | to the third party based on the extent of your activity of conveying 528 | the work, and under which the third party grants, to any of the 529 | parties who would receive the covered work from you, a discriminatory 530 | patent license (a) in connection with copies of the covered work 531 | conveyed by you (or copies made from those copies), or (b) primarily 532 | for and in connection with specific products or compilations that 533 | contain the covered work, unless you entered into that arrangement, 534 | or that patent license was granted, prior to 28 March 2007. 535 | 536 | Nothing in this License shall be construed as excluding or limiting 537 | any implied license or other defenses to infringement that may 538 | otherwise be available to you under applicable patent law. 539 | 540 | 12. No Surrender of Others' Freedom. 541 | 542 | If conditions are imposed on you (whether by court order, agreement or 543 | otherwise) that contradict the conditions of this License, they do not 544 | excuse you from the conditions of this License. If you cannot convey a 545 | covered work so as to satisfy simultaneously your obligations under this 546 | License and any other pertinent obligations, then as a consequence you may 547 | not convey it at all. For example, if you agree to terms that obligate you 548 | to collect a royalty for further conveying from those to whom you convey 549 | the Program, the only way you could satisfy both those terms and this 550 | License would be to refrain entirely from conveying the Program. 551 | 552 | 13. Use with the GNU Affero General Public License. 553 | 554 | Notwithstanding any other provision of this License, you have 555 | permission to link or combine any covered work with a work licensed 556 | under version 3 of the GNU Affero General Public License into a single 557 | combined work, and to convey the resulting work. The terms of this 558 | License will continue to apply to the part which is the covered work, 559 | but the special requirements of the GNU Affero General Public License, 560 | section 13, concerning interaction through a network will apply to the 561 | combination as such. 562 | 563 | 14. Revised Versions of this License. 564 | 565 | The Free Software Foundation may publish revised and/or new versions of 566 | the GNU General Public License from time to time. Such new versions will 567 | be similar in spirit to the present version, but may differ in detail to 568 | address new problems or concerns. 569 | 570 | Each version is given a distinguishing version number. If the 571 | Program specifies that a certain numbered version of the GNU General 572 | Public License "or any later version" applies to it, you have the 573 | option of following the terms and conditions either of that numbered 574 | version or of any later version published by the Free Software 575 | Foundation. If the Program does not specify a version number of the 576 | GNU General Public License, you may choose any version ever published 577 | by the Free Software Foundation. 578 | 579 | If the Program specifies that a proxy can decide which future 580 | versions of the GNU General Public License can be used, that proxy's 581 | public statement of acceptance of a version permanently authorizes you 582 | to choose that version for the Program. 583 | 584 | Later license versions may give you additional or different 585 | permissions. However, no additional obligations are imposed on any 586 | author or copyright holder as a result of your choosing to follow a 587 | later version. 588 | 589 | 15. Disclaimer of Warranty. 590 | 591 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 592 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 596 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 597 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 599 | 600 | 16. Limitation of Liability. 601 | 602 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 610 | SUCH DAMAGES. 611 | 612 | 17. Interpretation of Sections 15 and 16. 613 | 614 | If the disclaimer of warranty and limitation of liability provided 615 | above cannot be given local legal effect according to their terms, 616 | reviewing courts shall apply local law that most closely approximates 617 | an absolute waiver of all civil liability in connection with the 618 | Program, unless a warranty or assumption of liability accompanies a 619 | copy of the Program in return for a fee. 620 | 621 | END OF TERMS AND CONDITIONS 622 | 623 | How to Apply These Terms to Your New Programs 624 | 625 | If you develop a new program, and you want it to be of the greatest 626 | possible use to the public, the best way to achieve this is to make it 627 | free software which everyone can redistribute and change under these terms. 628 | 629 | To do so, attach the following notices to the program. It is safest 630 | to attach them to the start of each source file to most effectively 631 | state the exclusion of warranty; and each file should have at least 632 | the "copyright" line and a pointer to where the full notice is found. 633 | 634 | 635 | Copyright (C) 636 | 637 | This program is free software: you can redistribute it and/or modify 638 | it under the terms of the GNU General Public License as published by 639 | the Free Software Foundation, either version 3 of the License, or 640 | (at your option) any later version. 641 | 642 | This program is distributed in the hope that it will be useful, 643 | but WITHOUT ANY WARRANTY; without even the implied warranty of 644 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 645 | GNU General Public License for more details. 646 | 647 | You should have received a copy of the GNU General Public License 648 | along with this program. If not, see . 649 | 650 | Also add information on how to contact you by electronic and paper mail. 651 | 652 | If the program does terminal interaction, make it output a short 653 | notice like this when it starts in an interactive mode: 654 | 655 | Copyright (C) 656 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 657 | This is free software, and you are welcome to redistribute it 658 | under certain conditions; type `show c' for details. 659 | 660 | The hypothetical commands `show w' and `show c' should show the appropriate 661 | parts of the General Public License. Of course, your program's commands 662 | might be different; for a GUI interface, you would use an "about box". 663 | 664 | You should also get your employer (if you work as a programmer) or school, 665 | if any, to sign a "copyright disclaimer" for the program, if necessary. 666 | For more information on this, and how to apply and follow the GNU GPL, see 667 | . 668 | 669 | The GNU General Public License does not permit incorporating your program 670 | into proprietary programs. If your program is a subroutine library, you 671 | may consider it more useful to permit linking proprietary applications with 672 | the library. If this is what you want to do, use the GNU Lesser General 673 | Public License instead of this License. But first, please read 674 | . 675 | --------------------------------------------------------------------------------