├── .gitignore
├── run-lab.sh
├── img
├── ML-map.png
├── Bayes_Theorem.jpg
├── id3_algorithm.png
├── categorical_tree.dia
├── categorical_tree.png
├── cross_validation.dia
├── cross_validation.png
├── ML-map.dot
└── random_variable.svg
├── Pipfile
├── tools
├── pd_helpers.py
├── hw.csv
├── stats.py
├── venn.py
└── plots.py
├── Homework.ipynb
├── README.md
├── Lab09-Exercises.ipynb
├── Lab05-Exercises.ipynb
├── Lab12-Exercises.ipynb
├── Lab04-Exercises.ipynb
├── Lab13-Exercises.ipynb
├── Lab11-Exercises.ipynb
├── Lab14-Exercises.ipynb
├── Lab07-Exercises.ipynb
├── Lab01-Exercises.ipynb
├── Lab06-Exercises.ipynb
├── Lab02-Exercises.ipynb
├── Lab05.ipynb
├── Lab03-Exercises.ipynb
├── Lab06.ipynb
├── extras
└── Lab-EM-Exercises.ipynb
└── Lab10-Exercises.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
2 | *-Solutions.ipynb
3 |
--------------------------------------------------------------------------------
/run-lab.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | python3 -m pipenv run jupyter-lab
3 |
--------------------------------------------------------------------------------
/img/ML-map.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spantiru/companion-lab/HEAD/img/ML-map.png
--------------------------------------------------------------------------------
/img/Bayes_Theorem.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spantiru/companion-lab/HEAD/img/Bayes_Theorem.jpg
--------------------------------------------------------------------------------
/img/id3_algorithm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spantiru/companion-lab/HEAD/img/id3_algorithm.png
--------------------------------------------------------------------------------
/img/categorical_tree.dia:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spantiru/companion-lab/HEAD/img/categorical_tree.dia
--------------------------------------------------------------------------------
/img/categorical_tree.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spantiru/companion-lab/HEAD/img/categorical_tree.png
--------------------------------------------------------------------------------
/img/cross_validation.dia:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spantiru/companion-lab/HEAD/img/cross_validation.dia
--------------------------------------------------------------------------------
/img/cross_validation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spantiru/companion-lab/HEAD/img/cross_validation.png
--------------------------------------------------------------------------------
/Pipfile:
--------------------------------------------------------------------------------
1 | [[source]]
2 | name = "pypi"
3 | url = "https://pypi.org/simple"
4 | verify_ssl = true
5 |
6 | [dev-packages]
7 |
8 | [packages]
9 | jupyterlab = "*"
10 | matplotlib = "*"
11 | matplotlib-venn = "*"
12 | pandas = "*"
13 | scikit-learn = "*"
14 |
15 | [requires]
16 | python_version = "3.12"
17 |
--------------------------------------------------------------------------------
/tools/pd_helpers.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 |
3 | def apply_counts(df: pd.DataFrame, count_col: str):
4 | """ Denormalise a dataframe with a 'Counts' column by
5 | multiplying that column by the count and dropping the
6 | count_col. """
7 | feats = [c for c in df.columns if c != count_col]
8 | return pd.concat([
9 | pd.DataFrame([list(r[feats])] * r[count_col], columns=feats)
10 | for i, r in df.iterrows()
11 | ], ignore_index=True)
--------------------------------------------------------------------------------
/tools/hw.csv:
--------------------------------------------------------------------------------
1 | Lab,Problems
2 | 1,"7, 14"
3 | 1,"2, 16"
4 | 1,"5, 8"
5 | 1,"3, 15"
6 | 1,"12, 13"
7 | 2,"3, 5"
8 | 2,"2, 4"
9 | 2,"1, 3"
10 | 3,"1, 4"
11 | 3,"2, 5"
12 | 3,"3, 4"
13 | 4,"1, 3"
14 | 4,"2, 3"
15 | 5,"1, 2"
16 | 6,"1, 3"
17 | 6,"2, 4"
18 | 6,"3, 5"
19 | 6,"4, 6"
20 | 7,1
21 | 7,2
22 | 7,3
23 | 7,4
24 | 7,5
25 | 9,"1, 2"
26 | 10,"1, 2"
27 | 10,"3, 4"
28 | 10,"5, 6"
29 | 11,"2, 4"
30 | 11,"1, 3"
31 | 11,"2, 3"
32 | 12,"1, 3"
33 | 12,"2, 3"
34 | 13,"1, 4"
35 | 13,"2, 5"
36 | 14,"4, 5"
37 | 14,"1, 3"
38 |
39 |
--------------------------------------------------------------------------------
/tools/stats.py:
--------------------------------------------------------------------------------
1 | from typing import Set, Any
2 | from dataclasses import dataclass
3 |
4 | def probability(A: Set[Any], omega: Set[Any]):
5 | """ Probability for a uniform distribution
6 | in a finite space"""
7 | return len(A) / len(omega)
8 |
9 |
10 | @dataclass(frozen=True)
11 | class WeightedOutcome:
12 | """ Class adding a weight to any outcome. """
13 | weight: float
14 |
15 |
16 | def probability_weighted(A: Set[WeightedOutcome],
17 | omega: Set[WeightedOutcome]):
18 | """ Probability for a uniform distribution
19 | in a finite space. Omega is defined as a dictionary with
20 | values being weights. """
21 | A_weight = sum((o.weight for o in A))
22 | omega_weight = sum((o.weight for o in omega))
23 | return A_weight / omega_weight
--------------------------------------------------------------------------------
/tools/venn.py:
--------------------------------------------------------------------------------
1 | from matplotlib_venn import venn2, venn2_circles
2 | import matplotlib.pyplot as plt
3 |
4 | omega = set(['10', '11', '01', '00'])
5 | A=set(['10', '11'])
6 | B=set(['11', '01'])
7 |
8 | def plot_venn(highlights):
9 | """ Plot a venn diagram with two intersecting sets A and B
10 | and highlight any combination of A, B and omega. """
11 | to_hide = set(['10', '11', '01'])-set(highlights)
12 | figure = plt.figure(figsize=(4, 3))
13 | ax=plt.gca()
14 | ax.text(0.7, 0.5, r'$\Omega$', fontsize=16)
15 | if '00' in highlights:
16 | figure.patch.set_facecolor('grey')
17 | subsets={'10': 1, '01': 1, '11': 1}
18 | v = venn2(subsets, set_labels = ('A', 'B'), alpha=1)
19 | for p in to_hide:
20 | v.get_patch_by_id(p).set_color('w')
21 | v.get_patch_by_id(p).set_alpha(1)
22 | for p in {'10', '11', '01'}:
23 | v.get_label_by_id(p).set_text('')
24 | venn2_circles(subsets)
25 | plt.show()
--------------------------------------------------------------------------------
/Homework.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "044dc450-8772-43ec-92ef-8ad6ed454b13",
6 | "metadata": {},
7 | "source": [
8 | "# What is my homework?"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": null,
14 | "id": "67163e92-9054-4df2-afa9-5e00f9ce5171",
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "from hashlib import sha1\n",
20 | "\n",
21 | "def what_is_my_homework(email: str, lab_no: int):\n",
22 | " \"\"\"\n",
23 | " What is my assigned homework for the given lab?\n",
24 | " \n",
25 | " Print a message that displays the assigned homework for a specific\n",
26 | " email and a specific lab.\n",
27 | " \n",
28 | " Parameters\n",
29 | " ----------\n",
30 | " email : string\n",
31 | " The email you provided at the beginning of the semester.\n",
32 | " lab_no : int\n",
33 | " The lab number for which to assign homework.\n",
34 | " \n",
35 | " Examples\n",
36 | " --------\n",
37 | " >>> what_is_my_homework('my_email@example.com', 1)\n",
38 | " Your homework for lab 1: 1, 7.\n",
39 | " \"\"\"\n",
40 | " df = pd.read_csv(\"tools/hw.csv\")\n",
41 | " if lab_no not in df['Lab'].unique():\n",
42 | " print(\"No homework assigned for this lab!\")\n",
43 | " return\n",
44 | " df_lab = df[df['Lab'] == lab_no]\n",
45 | " key = ('2024'+email+str(lab_no)).encode(\"utf-8\")\n",
46 | " idx = int(sha1(key).hexdigest(), 16) % 2**10\n",
47 | " hw = df_lab.sample(n=1, random_state=idx).iloc[0, 1]\n",
48 | " print(f\"Your homework for lab {lab_no}: {hw}.\")\n",
49 | "\n",
50 | "what_is_my_homework('my_email@example.com', 1)"
51 | ]
52 | }
53 | ],
54 | "metadata": {
55 | "kernelspec": {
56 | "display_name": "Python 3 (ipykernel)",
57 | "language": "python",
58 | "name": "python3"
59 | },
60 | "language_info": {
61 | "codemirror_mode": {
62 | "name": "ipython",
63 | "version": 3
64 | },
65 | "file_extension": ".py",
66 | "mimetype": "text/x-python",
67 | "name": "python",
68 | "nbconvert_exporter": "python",
69 | "pygments_lexer": "ipython3",
70 | "version": "3.12.3"
71 | }
72 | },
73 | "nbformat": 4,
74 | "nbformat_minor": 5
75 | }
76 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | [](https://mybinder.org/v2/gh/spantiru/companion-lab/master)
2 |
3 | # JupyterLab for the Machine Learning seminar
4 |
5 | Ștefan Panțiru, Faculty of Computer Science, "Alexandru Ioan Cuza" University Iași
6 |
7 | To run this lab, either click the `launch binder` button above, or run the code on your machine following the instructions below.
8 |
9 | ## Contents
10 | * [Lab01 - Elementary Notions in Probability and Statistics](Lab01.ipynb)
11 | * [Lab02 - Decision Trees (part1)](Lab02.ipynb)
12 | * [Lab03 - Decision Trees (part2)](Lab03.ipynb)
13 | * [Lab04 - Decision Trees (part3)](Lab04.ipynb)
14 | * [Lab05 - Naive Bayes (part1)](Lab05.ipynb)
15 | * [Lab06 - Naive Bayes (part2)](Lab06.ipynb)
16 | * [Lab07 - Maximum Likelihood Estimation](Lab07.ipynb)
17 | * Week 8 - Midterm Exam
18 | * [Lab09 - Logistic Regression](Lab09.ipynb)
19 | * [Lab10 - k-Nearest Neighbour](Lab10.ipynb)
20 | * [Lab11 - AdaBoost](Lab11.ipynb)
21 | * [Lab12 - Hierarchical Clustering](Lab12.ipynb)
22 | * [Lab13 - k-Means (part1)](Lab13.ipynb)
23 | * [Lab14 - k-Means (part2)](Lab14.ipynb)
24 |
25 | ## Useful resources:
26 | * [ML Homework](https://forms.gle/EmPiJqABNR7MZWgf8)
27 | * [Register for Piazza](https://forms.gle/KQP3pwRQxvqxVvLz6)
28 | * [Resources uploaded to Piazza](https://piazza.com/info.uaic.ro/spring2024/ml2024f/resources)
29 | * [Python official documentation](https://docs.python.org/3.12/library/index.html)
30 | * [Scikit-learn library](https://scikit-learn.org/stable/getting_started.html) - machine learning library for Python
31 | * [Scipy statistical functions](https://docs.scipy.org/doc/scipy/reference/stats.html)
32 | * [Pandas library](https://pandas.pydata.org/docs/reference/index.html) - library providing data structures and analysis tools for Python
33 |
34 | ## Running locally
35 |
36 | This lab uses Python 3.12 and `pipenv`, so make sure they are available on your system.
37 |
38 | ```bash
39 | $ python3 --version
40 | Python 3.12.3
41 | $ python3 -m pipenv --version
42 | pipenv, version 2023.12.1
43 | ```
44 |
45 | Clone the repository using git:
46 |
47 | ```bash
48 | $ git clone https://github.com/spantiru/companion-lab.git
49 | ```
50 |
51 | Inside the project folder, create the pipenv environment:
52 |
53 | ```bash
54 | $ cd companion-lab
55 | $ python3 -m pipenv install # Might take a few seconds
56 | ```
57 |
58 | Run `jupyter-lab`, which should start in your default browser:
59 |
60 | ```bash
61 | $ python3 -m pipenv run jupyter-lab
62 | ```
63 |
--------------------------------------------------------------------------------
/img/ML-map.dot:
--------------------------------------------------------------------------------
1 | digraph G {
2 | rankdir=LR; // Left to right layout
3 | splines=false; // Make arrows straight lines
4 |
5 | // Global node styling for a professional look
6 | node [
7 | shape=box,
8 | style=filled,
9 | fillcolor=lightblue,
10 | color=black, // Black border
11 | fontname="Helvetica",
12 | fontsize=10,
13 | penwidth=0.8
14 | ];
15 |
16 | // Global edge styling
17 | edge [color=gray, arrowsize=0.8, penwidth=0.8];
18 |
19 | // Define the structure
20 | "Machine Learning" -> "Supervised Learning";
21 | "Machine Learning" -> "Unsupervised Learning";
22 |
23 | // Linear and Non-linear split under Supervised Learning
24 | "Supervised Learning" -> "Linear";
25 | "Supervised Learning" -> "Non-Linear";
26 |
27 | // Linear Supervised Learning Algorithms
28 | "Linear" -> "Linear Regression";
29 | "Linear" -> "Logistic Regression";
30 | "Linear" -> "Naive Bayes";
31 | "Linear" -> "Support Vector Machines (SVM)";
32 |
33 | // Highlighted Logistic Regression
34 | "Logistic Regression" [fontcolor=red];
35 |
36 | // Highlighted Naive Bayes
37 | "Naive Bayes" [fontcolor=red];
38 |
39 | // Non-Linear Supervised Learning Algorithms
40 | "Non-Linear" -> "Decision Trees";
41 | "Non-Linear" -> "k-Nearest Neighbors (k-NN)";
42 | "Non-Linear" -> "Neural Networks";
43 | "Non-Linear" -> "AdaBoost";
44 | "Non-Linear" -> "ARIMA"; // Time series forecasting
45 | "Non-Linear" -> "XGBoost";
46 |
47 | // Highlighted Decision Trees
48 | "Decision Trees" [fontcolor=red];
49 |
50 | // Highlighted k-NN
51 | "k-Nearest Neighbors (k-NN)" [fontcolor=red];
52 |
53 | // Highlighted AdaBoost
54 | "AdaBoost" [fontcolor=red];
55 |
56 | // Unsupervised Learning split into Clustering and Dimensionality Reduction
57 | "Unsupervised Learning" -> "Clustering";
58 | "Unsupervised Learning" -> "Dimensionality Reduction";
59 |
60 | // Clustering Algorithms
61 | "Clustering" -> "K-means";
62 | "Clustering" -> "Hierarchical Clustering";
63 | "Clustering" -> "Gaussian Mixture Models (GMM)";
64 |
65 | // Highlighted K-means
66 | "K-means" [fontcolor=red];
67 |
68 | // Highlighted Hierarchical Clustering
69 | "Hierarchical Clustering" [fontcolor=red];
70 |
71 | // Dimensionality Reduction Algorithms
72 | "Dimensionality Reduction" -> "Principal Component Analysis (PCA)";
73 | "Dimensionality Reduction" -> "t-SNE";
74 | }
75 |
76 |
--------------------------------------------------------------------------------
/Lab09-Exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "c6b704f5-a47c-4124-8be4-a27e13ca3503",
6 | "metadata": {},
7 | "source": [
8 | "# Logistic Regression\n",
9 | "\n",
10 | "## Exercise 1\n",
11 | "\n",
12 | "For the dataset below:\n",
13 | "\n",
14 | "1. Plot the decision surface of the Logistic Regression algorithm.\n",
15 | "2. Calculate the CVLOO error for Logistic Regression.\n",
16 | "3. Plot the decision surface of the ID3 algorithm (with entropy and no pruning).\n",
17 | "4. Calculate the CVLOO error for ID3."
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 1,
23 | "id": "d0a7eabd-0aa1-4809-b58d-00de769ca895",
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "from sklearn.datasets import make_moons\n",
28 | "\n",
29 | "X, y = make_moons(n_samples=200, noise=0.2, random_state=42)"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "id": "415d3052-6ae3-47c3-aa6e-8b7fec9a901c",
35 | "metadata": {},
36 | "source": [
37 | "## Exercise 2\n",
38 | "\n",
39 | "Given the dataset below, implement the gradient ascent formula from the lab. Starting from an initial $w=(0, 0, 0)$, apply 10 gradient ascent steps with $\\eta = 0.01$. What are the values of $w$ after the 10 steps? \n",
40 | "\n",
41 | "_Note: The component $x_0 = 1$ was already added to the dataset, so $w$ and $X$ have the same number of dimensions._"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 2,
47 | "id": "20b07227-d6a6-4c6c-8253-1ea8ded7b496",
48 | "metadata": {},
49 | "outputs": [],
50 | "source": [
51 | "from sklearn.datasets import make_blobs\n",
52 | "import numpy as np\n",
53 | "\n",
54 | "X, y = make_blobs(n_samples=200, cluster_std=3, centers=2, random_state=42)\n",
55 | "\n",
56 | "def add_intercept(X):\n",
57 | " \"\"\"Add 1 as the first column of X\"\"\"\n",
58 | " return np.hstack((np.ones((len(X), 1)), X))\n",
59 | "\n",
60 | "X = add_intercept(X)"
61 | ]
62 | }
63 | ],
64 | "metadata": {
65 | "kernelspec": {
66 | "display_name": "Python 3 (ipykernel)",
67 | "language": "python",
68 | "name": "python3"
69 | },
70 | "language_info": {
71 | "codemirror_mode": {
72 | "name": "ipython",
73 | "version": 3
74 | },
75 | "file_extension": ".py",
76 | "mimetype": "text/x-python",
77 | "name": "python",
78 | "nbconvert_exporter": "python",
79 | "pygments_lexer": "ipython3",
80 | "version": "3.12.3"
81 | }
82 | },
83 | "nbformat": 4,
84 | "nbformat_minor": 5
85 | }
86 |
--------------------------------------------------------------------------------
/Lab05-Exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "a90491ec-1f9f-479a-95b9-fd697241ae99",
6 | "metadata": {},
7 | "source": [
8 | "# Naive Bayes"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "5c43b866-e21d-4112-bb13-d7212ccefc62",
14 | "metadata": {},
15 | "source": [
16 | "## Exercise 1\n",
17 | "\n",
18 | "Given the following dataset, with input attributes $A$, $B$, and $C$ and target attribute $Y$, predict the entry $A=0, B=0, C=1$ using `BernoulliNB(alpha=1e-10)` and `predict_proba()` then manually calculate the probabilities using the formulas."
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 3,
24 | "id": "d44a550d-e8a9-4df8-9c74-6fdaa470e9f6",
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "import pandas as pd\n",
29 | "d = pd.DataFrame({'A': [0, 0, 1, 0, 1, 1, 1],\n",
30 | " 'B': [0, 1, 1, 0, 1, 0, 1],\n",
31 | " 'C': [1, 0, 0, 1, 1, 0, 0],\n",
32 | " 'Y': [0, 0, 0, 1, 1, 1, 1]})"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "id": "3f18f1dd-f66d-45c9-abb9-5f80a4c95ddf",
38 | "metadata": {},
39 | "source": [
40 | "## Exercise 2\n",
41 | "\n",
42 | "Consider two random variables $X_1$ and $X_2$ and a label $Y$ assigned to each instance as in the dataset `d` created below.\n",
43 | "\n",
44 | "1. Classify the instance $X_1=0,X_2=0$ using Naive Bayes.\n",
45 | "\n",
46 | "1. According to Naive Bayes, what is the probability of this classification?\n",
47 | "\n",
48 | "1. How many probabilities are estimated by the model (check the `class_log_prior_` and `feature_log_prob_` attributes)?\n",
49 | "\n",
50 | "1. How many probabilities would be estimated by the model if there were $n$ features instead of 2?"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": 4,
56 | "id": "282a321a-de22-41ec-85c7-12833b244a65",
57 | "metadata": {},
58 | "outputs": [],
59 | "source": [
60 | "import pandas as pd\n",
61 | "from tools.pd_helpers import apply_counts\n",
62 | "\n",
63 | "d_grouped = pd.DataFrame({\n",
64 | " 'X1': [0, 0, 1, 1, 0, 0, 1, 1],\n",
65 | " 'X2': [0, 0, 0, 0, 1, 1, 1, 1],\n",
66 | " 'C' : [2, 18, 4, 1, 4, 1, 2, 18],\n",
67 | " 'Y' : [0, 1, 0, 1, 0, 1, 0, 1]})\n",
68 | "d = apply_counts(d_grouped, 'C')"
69 | ]
70 | }
71 | ],
72 | "metadata": {
73 | "kernelspec": {
74 | "display_name": "Python 3 (ipykernel)",
75 | "language": "python",
76 | "name": "python3"
77 | },
78 | "language_info": {
79 | "codemirror_mode": {
80 | "name": "ipython",
81 | "version": 3
82 | },
83 | "file_extension": ".py",
84 | "mimetype": "text/x-python",
85 | "name": "python",
86 | "nbconvert_exporter": "python",
87 | "pygments_lexer": "ipython3",
88 | "version": "3.12.3"
89 | }
90 | },
91 | "nbformat": 4,
92 | "nbformat_minor": 5
93 | }
94 |
--------------------------------------------------------------------------------
/Lab12-Exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Exercise 1\n",
8 | "Agglomerative clustering on a 2d dataset\n",
9 | "\n",
10 | "Considering the points (-4, -2), (-3, -2), (-2, -2), (-1, -2), (1, -1), (1, 1), (2, 3), (3, 2), (3, 4), (4, 3):\n",
11 | "1. create a scatter plot using `pyplot`;\n",
12 | "1. create the dendrogram using `AgglomerativeClustering` with single-linkage and then color the scatter plot using the best 4 clusters;\n",
13 | "1. create the dendrogram using `AgglomerativeClustering` with complete-linkage and then color the scatter plot using the best 4 clusters;\n",
14 | "1. what is the difference in behaviour between the two types of linkage? What shapes do they tend to give to the clusters?"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "# Exercise 2\n",
22 | "\"Natural\" clusters\n",
23 | "\n",
24 | "Given the dataset {0, 4, 5, 20, 25, 39, 43, 44}:\n",
25 | "1. find the natural clusters using agglomerative clusters with single-linkage and plot clusters using a scatter plot;\n",
26 | "1. find the natural clusters using agglomerative clusters with average-linkage and plot clusters using a scatter plot."
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "# Exercise 3\n",
34 | "\n",
35 | "For each the following two datasets `d1` and `d2`:\n",
36 | "1. plot the points using `pyplot` and highlight (by using different colours for the points) the 2 clusters found by agglomerative clustering using single linkage;\n",
37 | "1. plot the points using `pyplot` and highlight (by using different colours for the points) the 2 clusters found by agglomerative clustering using average linkage."
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 1,
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "from sklearn import datasets as ds\n",
47 | "import numpy as np\n",
48 | "import pandas as pd\n",
49 | "\n",
50 | "np.random.seed(0)\n",
51 | "X1, _ = ds.make_circles(n_samples=1500, factor=.5, noise=.05)\n",
52 | "X2, _ = ds.make_blobs(n_samples=1500,\n",
53 | " cluster_std=[1.0, 2.5, 0.5],\n",
54 | " random_state=170)\n",
55 | "\n",
56 | "d1 = pd.DataFrame(X1, columns=['X1', 'X2'])\n",
57 | "d2 = pd.DataFrame(X2, columns=['X1', 'X2'])"
58 | ]
59 | }
60 | ],
61 | "metadata": {
62 | "kernelspec": {
63 | "display_name": "Python 3 (ipykernel)",
64 | "language": "python",
65 | "name": "python3"
66 | },
67 | "language_info": {
68 | "codemirror_mode": {
69 | "name": "ipython",
70 | "version": 3
71 | },
72 | "file_extension": ".py",
73 | "mimetype": "text/x-python",
74 | "name": "python",
75 | "nbconvert_exporter": "python",
76 | "pygments_lexer": "ipython3",
77 | "version": "3.12.3"
78 | }
79 | },
80 | "nbformat": 4,
81 | "nbformat_minor": 4
82 | }
83 |
--------------------------------------------------------------------------------
/Lab04-Exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Exercise 1\n",
8 | "\n",
9 | "Given the following dataset with two input random variables $X_1$ and $X_2$ and a target variable $Y$, we want to compare two extreme decision tree algorithms:\n",
10 | "\n",
11 | "* OVERFIT will build a full standard ID3 decision tree, with no pruning;\n",
12 | "* UNDERFIT will make no splits at all, always having a single node (which is both root and decision).\n",
13 | "\n",
14 | "1. Plot the full OVERFIT tree.\n",
15 | "1. What is the CVLOO error for OVERFIT?\n",
16 | "1. What is the CVLOO error for UNDERFIT?"
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": 1,
22 | "metadata": {},
23 | "outputs": [],
24 | "source": [
25 | "import pandas as pd\n",
26 | "d = pd.DataFrame({'X1': [1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8],\n",
27 | " 'X2': [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2],\n",
28 | " 'Y' : [0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1]})"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "## Exercise 2\n",
36 | "\n",
37 | "Suppose we learned a decision tree from a training set with binary output values (either 0 or 1). We find that for a leaf node $l$, \n",
38 | "\n",
39 | "* there are $M$ training examples falling into it (labeled either 0 or 1); \n",
40 | "* its entropy is $H$. \n",
41 | "\n",
42 | "1. Create a graph using `matplotlib` that shows the entropy $H$ as a function of the proportion of 1s in $M$. The proportion should be on the $x$ axis (from 0 to 1), while the entropy should be on the $y$ axis.\n",
43 | "1. Create a simple algorithm which takes as input $M$ and $H$ and that outputs the number of training examples misclassified by leaf node $l$.\n"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {},
49 | "source": [
50 | "## Exercise 3\n",
51 | "\n",
52 | "Given the dataset below:\n",
53 | "1. plot the points and the labels using `matplotib.pyplot.scatter`;\n",
54 | "1. train a regular decision tree, then plot its decision surface;\n",
55 | "1. create a new dataset with 1000 random points with coordinates between 0 and 10, which the diagonal line $X1 = X2$ perfectly separates in two classes. See [numpy.random.random_sample](https://numpy.org/doc/stable/reference/random/generated/numpy.random.random_sample.html#numpy.random.random_sample) for easily generating random numbers between 0 and 1.\n",
56 | "1. train a regular decision tree, then plot its decision surface on the new dataset."
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 2,
62 | "metadata": {},
63 | "outputs": [],
64 | "source": [
65 | "import pandas as pd\n",
66 | "d = pd.DataFrame({'X1': [1, 2, 3, 3, 3, 4, 5, 5, 5],\n",
67 | " 'X2': [2, 3, 1, 2, 4, 4, 1, 2, 4],\n",
68 | " 'Y': [1, 1, 0, 0, 0, 0, 1, 1, 0]})"
69 | ]
70 | }
71 | ],
72 | "metadata": {
73 | "kernelspec": {
74 | "display_name": "Python 3 (ipykernel)",
75 | "language": "python",
76 | "name": "python3"
77 | },
78 | "language_info": {
79 | "codemirror_mode": {
80 | "name": "ipython",
81 | "version": 3
82 | },
83 | "file_extension": ".py",
84 | "mimetype": "text/x-python",
85 | "name": "python",
86 | "nbconvert_exporter": "python",
87 | "pygments_lexer": "ipython3",
88 | "version": "3.12.3"
89 | }
90 | },
91 | "nbformat": 4,
92 | "nbformat_minor": 4
93 | }
94 |
--------------------------------------------------------------------------------
/tools/plots.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | from matplotlib.colors import ListedColormap
4 | from scipy.spatial import Voronoi, voronoi_plot_2d
5 | import matplotlib.pyplot as plt
6 | from scipy.cluster.hierarchy import dendrogram
7 | import matplotlib as mpl
8 |
9 | def add_ellipses(gmm, ax, colors):
10 | """ Draw 2d ellipses, on a given axis, corresponding to the
11 | covariances of the GMM. """
12 | for n, color in enumerate(colors):
13 | if gmm.covariance_type == 'full':
14 | covariances = gmm.covariances_[n][:2, :2]
15 | elif gmm.covariance_type == 'tied':
16 | covariances = gmm.covariances_[:2, :2]
17 | elif gmm.covariance_type == 'diag':
18 | covariances = np.diag(gmm.covariances_[n][:2])
19 | elif gmm.covariance_type == 'spherical':
20 | covariances = np.eye(gmm.means_.shape[1]) * gmm.covariances_[n]
21 | v, w = np.linalg.eigh(covariances)
22 | u = w[0] / np.linalg.norm(w[0])
23 | angle = np.arctan2(u[1], u[0])
24 | angle = 180 * angle / np.pi # convert to degrees
25 | v = 2. * np.sqrt(2.) * np.sqrt(v)
26 | ell = mpl.patches.Ellipse(gmm.means_[n, :2], v[0], v[1],
27 | angle=180 + angle, color=color)
28 | ell.set_clip_box(ax.bbox)
29 | ell.set_alpha(0.5)
30 | ax.add_artist(ell)
31 | ax.set_aspect('equal', 'datalim')
32 |
33 | def plot_dendrogram(model, **kwargs):
34 | # Create linkage matrix and then plot the dendrogram
35 |
36 | # create the counts of samples under each node
37 | counts = np.zeros(model.children_.shape[0])
38 | n_samples = len(model.labels_)
39 | for i, merge in enumerate(model.children_):
40 | current_count = 0
41 | for child_idx in merge:
42 | if child_idx < n_samples:
43 | current_count += 1 # leaf node
44 | else:
45 | current_count += counts[child_idx - n_samples]
46 | counts[i] = current_count
47 |
48 | linkage_matrix = np.column_stack([model.children_, model.distances_,
49 | counts]).astype(float)
50 |
51 | # Plot the corresponding dendrogram
52 | dendrogram(linkage_matrix, **kwargs)
53 |
54 | def plot_decision_surface(clas, X, Y):
55 | """Plot a decision surface for 2 classes. """
56 | # step size in the mesh
57 | h = .02
58 | # Create color maps
59 | cmap_light = ListedColormap(['lightgreen', 'lightcoral'])
60 | cmap_bold = ListedColormap(['green','red'])
61 | # Plot the decision boundary. For that, we will assign a color to each
62 | # point in the mesh [x_min, x_max]x[y_min, y_max].
63 | x_min, x_max = X['X1'].min() - 1, X['X1'].max() + 1
64 | y_min, y_max = X['X2'].min() - 1, X['X2'].max() + 1
65 | xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
66 | np.arange(y_min, y_max, h))
67 | Z = clas.predict(pd.DataFrame(np.c_[xx.ravel(), yy.ravel()], columns=['X1', 'X2']))
68 |
69 | # Put the result into a color plot
70 | Z = Z.reshape(xx.shape)
71 | fig, ax = plt.subplots(figsize=(6, 6))
72 | plt.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='auto')
73 |
74 | # Plot also the training points
75 | plt.scatter(X['X1'], X['X2'], c=Y, cmap=cmap_bold, s=20)
76 | plt.xlabel('X1')
77 | plt.ylabel('X2')
78 | plt.xlim(xx.min(), xx.max())
79 | plt.ylim(yy.min(), yy.max())
80 | plt.title("Classification")
81 | plt.show()
82 |
83 | def plot_decision_surface_knn(knn, X, Y, voronoi=False):
84 | """Plot a decision surface for 2 classes, optionally
85 | overlaying the voronoi diagram. """
86 | # step size in the mesh
87 | h = .02
88 | # Create color maps
89 | cmap_light = ListedColormap(['lightgreen', 'lightcoral'])
90 | cmap_bold = ListedColormap(['green','red'])
91 | # Plot the decision boundary. For that, we will assign a color to each
92 | # point in the mesh [x_min, x_max]x[y_min, y_max].
93 | x_min, x_max = X['X1'].min() - 1, X['X1'].max() + 1
94 | y_min, y_max = X['X2'].min() - 1, X['X2'].max() + 1
95 | xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
96 | np.arange(y_min, y_max, h))
97 | Z = knn.predict(pd.DataFrame(np.c_[xx.ravel(), yy.ravel()], columns=['X1', 'X2']))
98 |
99 | # Put the result into a color plot
100 | Z = Z.reshape(xx.shape)
101 | fig, ax = plt.subplots(figsize=(6, 6))
102 | plt.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='auto')
103 |
104 | if voronoi:
105 | vor = Voronoi(X)
106 | voronoi_plot_2d(vor, show_points=False, ax=ax)
107 | # Plot also the training points
108 | plt.scatter(X['X1'], X['X2'], c=Y, cmap=cmap_bold, s=20)
109 | plt.xlim(xx.min(), xx.max())
110 | plt.ylim(yy.min(), yy.max())
111 | plt.title("k-NN Classification")
112 | plt.show()
--------------------------------------------------------------------------------
/Lab13-Exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# k-Means\n",
8 | "\n",
9 | "## Exercise 1\n",
10 | "k-Means on two-dimensional data with cluster separators\n",
11 | "\n",
12 | "For the dataset below, plot the clusters and the centroids of the k-means algorithm for each iteration until convergence. The initial centroids will be the points A, D and G (therefore the algorithm will find 3 clusters). Include in each plot the Voronoi diagram for the centroids, to highlight the cluster separation."
13 | ]
14 | },
15 | {
16 | "cell_type": "code",
17 | "execution_count": 1,
18 | "metadata": {},
19 | "outputs": [],
20 | "source": [
21 | "import pandas as pd\n",
22 | "data = {\n",
23 | " 'A':[2, 10], 'B':[2, 5], 'C':[8, 4], 'D':[5, 8], \n",
24 | " 'E':[7, 5], 'F':[6, 4], 'G':[1, 2], 'H':[4, 9]\n",
25 | "}\n",
26 | "d = pd.DataFrame.from_dict(data, orient='index', columns=['X', 'Y'])"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "## Exercise 2\n",
34 | "k-Means on two-dimensional data\n",
35 | "\n",
36 | "For the dataset below and the initial centroids A, D and G, independently implement the k-Means algorithm (i.e. do not use the one from `sklearn`) and plot the clusters and centroids for each iteration."
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 2,
42 | "metadata": {},
43 | "outputs": [],
44 | "source": [
45 | "import pandas as pd\n",
46 | "data = {\n",
47 | " 'A':[2, 10], 'B':[2, 5], 'C':[8, 4], 'D':[5, 8], \n",
48 | " 'E':[7, 5], 'F':[6, 4], 'G':[1, 2], 'H':[4, 9]\n",
49 | "}\n",
50 | "d = pd.DataFrame.from_dict(data, orient='index', columns=['X', 'Y'])"
51 | ]
52 | },
53 | {
54 | "cell_type": "markdown",
55 | "metadata": {},
56 | "source": [
57 | "## Exercise 3\n",
58 | "k-Means on an external dataset with starting centroids\n",
59 | "\n",
60 | "Apply k-means on this [two-dimensional dataset](https://profs.info.uaic.ro/~ciortuz/ML.ex-book/res/CMU.2004f.TM+AM.HW3.pr5.cl.dat) using these [starting centroids](https://profs.info.uaic.ro/~ciortuz/ML.ex-book/res/CMU.2004f.TM+AM.HW3.pr5.init.dat). Plot the clusters and centroids after each iteration until convergence. What is unusual about about the first iteration?"
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "## Exercise 4\n",
68 | "Anisotropically distributed data\n",
69 | "\n",
70 | "Run the k-means algorithm for the datasets `d1` and `d2` with $k=3$ and the default parameters. \n",
71 | "1. Plot the resulting clusters.\n",
72 | "1. Which clusters look more 'natural' and why?"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": 3,
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "from sklearn.datasets import make_blobs\n",
82 | "import numpy as np\n",
83 | "\n",
84 | "n_samples = 1500\n",
85 | "random_state = 170\n",
86 | "X, y = make_blobs(n_samples=n_samples, random_state=random_state)\n",
87 | "# Anisotropically distributed data\n",
88 | "transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]\n",
89 | "anis = np.dot(X, transformation)\n",
90 | "# Compare these datasets\n",
91 | "d1, d2 = X, anis"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "## Exercise 5\n",
99 | "k-means and noise\n",
100 | "\n",
101 | "Considering the dataset `d` below and two sets of starting centroids `c1` and `c2`.\n",
102 | "\n",
103 | "1. Run k-means ($k=3$ and the default parameters), first starting with `c1` and then starting with `c2`. (You might want to also use `n_init=1` to prevent a warning.)\n",
104 | "1. Plot the resulting clusters for each of the two runs.\n",
105 | "1. In which of the two runs the clusters look more 'natural' and why?"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": 4,
111 | "metadata": {},
112 | "outputs": [],
113 | "source": [
114 | "from sklearn.datasets import make_blobs\n",
115 | "import numpy as np\n",
116 | "\n",
117 | "n_samples = 1500\n",
118 | "random_state = 110\n",
119 | "d, _ = make_blobs(n_samples=n_samples, random_state=random_state)\n",
120 | "# Dataset\n",
121 | "d = np.append(d, [[-10, 15]], axis=0)\n",
122 | "# Starting centroids\n",
123 | "c1 = np.array([[-6, 2], [-10, 15], [3, 3]])\n",
124 | "c2 = np.array([[-10, 3], [-2, 2], [3, 3]])"
125 | ]
126 | }
127 | ],
128 | "metadata": {
129 | "kernelspec": {
130 | "display_name": "Python 3 (ipykernel)",
131 | "language": "python",
132 | "name": "python3"
133 | },
134 | "language_info": {
135 | "codemirror_mode": {
136 | "name": "ipython",
137 | "version": 3
138 | },
139 | "file_extension": ".py",
140 | "mimetype": "text/x-python",
141 | "name": "python",
142 | "nbconvert_exporter": "python",
143 | "pygments_lexer": "ipython3",
144 | "version": "3.12.3"
145 | }
146 | },
147 | "nbformat": 4,
148 | "nbformat_minor": 4
149 | }
150 |
--------------------------------------------------------------------------------
/Lab11-Exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Exercise 1\n",
8 | "AdaBoost on a uni-dimensional array\n",
9 | "\n",
10 | "Given the dataset below and the AdaBoost algorithm using the usual decision stumps as weak learners:\n",
11 | "\n",
12 | "1. Plot the dataset using `pyplot`.\n",
13 | "2. Draw the decision surface corresponding to the first weak learner.\n",
14 | "3. What are the values of $\\epsilon_1$ (training error of the first decision stump) and $\\alpha_1$ (the \"weight\" of the vode of the first decision stump)?\n",
15 | "4. What will be the updated weights of the training instances, after the first update?\n",
16 | "5. Draw the decision surface after adding the second weak learner."
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": 1,
22 | "metadata": {},
23 | "outputs": [],
24 | "source": [
25 | "import pandas as pd\n",
26 | "d = pd.DataFrame({\n",
27 | " 'X': [-1, -0.7, -0.4, -0.1, 0.2, 0.5, 0.8],\n",
28 | " 'Y': [1, 1, 1, -1, -1, -1, 1]\n",
29 | "})\n",
30 | "X, Y = d[['X']], d['Y']"
31 | ]
32 | },
33 | {
34 | "cell_type": "markdown",
35 | "metadata": {},
36 | "source": [
37 | "# Exercise 2\n",
38 | "AdaBoost on a two-dimensional array\n",
39 | "\n",
40 | "Given the dataset below and the AdaBoost algorithm using the usual decision stumps as weak learners:\n",
41 | "1. Plot the dataset using `pyplot`.\n",
42 | "2. Draw the decision surface corresponding to the first weak learner as chosen by `AdaBoostClassifier` with the default `base_estimator`.\n",
43 | "3. Show why AdaBoost chose that learner, by plotting the decision surface of all the candidates and their corresponding error rate."
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": 2,
49 | "metadata": {},
50 | "outputs": [],
51 | "source": [
52 | "import pandas as pd\n",
53 | "d = pd.DataFrame({\n",
54 | " 'X1': [1, 2, 2.75, 3.25, 4, 5],\n",
55 | " 'X2': [1, 2, 1.25, 2.75, 2.25, 3.5],\n",
56 | " 'Y': [1, 1, -1, 1, -1, -1]\n",
57 | "})\n",
58 | "X, Y = d[['X1', 'X2']], d['Y']"
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "# Exercise 3\n",
66 | "AdaBoost vs ID3\n",
67 | "\n",
68 | "Given the dataset below:\n",
69 | "1. Plot the dataset using `pyplot`.\n",
70 | "2. Compare the training error of the AdaBoost algorithm (using the usual decision stumps as weak learners) and the ID3 algorithm.\n",
71 | "2. Compare the CVLOO error of the AdaBoost algorithm (using the usual decision stumps as weak learners) and the ID3 algorithm."
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 3,
77 | "metadata": {},
78 | "outputs": [],
79 | "source": [
80 | "from scipy.stats import norm\n",
81 | "import pandas as pd\n",
82 | "import numpy as np\n",
83 | "x_red = norm.rvs(0, 1, 100, random_state=1)\n",
84 | "y_red = norm.rvs(0, 1, 100, random_state=2)\n",
85 | "x_green = norm.rvs(1, 1, 100, random_state=3)\n",
86 | "y_green = norm.rvs(1, 1, 100, random_state=4)\n",
87 | "d = pd.DataFrame({\n",
88 | " 'X1': np.concatenate([x_red,x_green]),\n",
89 | " 'X2': np.concatenate([y_red,y_green]),\n",
90 | " 'Y': [1]*100+[0]*100\n",
91 | "})\n",
92 | "X, Y = d[['X1', 'X2']], d['Y']"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "## Exercise 4\n",
100 | "Finding the optimum number of weak learners\n",
101 | "\n",
102 | "For the dataset below:\n",
103 | "1. plot the points using `pyplot.scatter`;\n",
104 | "1. plot a line chart using `pyplot.plot` that shows the training error and the CVLOO error of AdaBoost using between 1 and 15 weak learners.\n",
105 | "1. What is the best number of weak learners in this case?"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": 4,
111 | "metadata": {},
112 | "outputs": [],
113 | "source": [
114 | "from scipy.stats import norm\n",
115 | "import pandas as pd\n",
116 | "import numpy as np\n",
117 | "x_red = norm.rvs(0, 1, 100, random_state=1)\n",
118 | "y_red = norm.rvs(0, 1, 100, random_state=2)\n",
119 | "x_green = norm.rvs(1, 1, 100, random_state=3)\n",
120 | "y_green = norm.rvs(1, 1, 100, random_state=4)\n",
121 | "d = pd.DataFrame({\n",
122 | " 'X1': np.concatenate([x_red,x_green]),\n",
123 | " 'X2': np.concatenate([y_red,y_green]),\n",
124 | " 'Y': [1]*100+[0]*100\n",
125 | "})\n",
126 | "X, Y = d[['X1', 'X2']], d['Y']"
127 | ]
128 | }
129 | ],
130 | "metadata": {
131 | "kernelspec": {
132 | "display_name": "Python 3 (ipykernel)",
133 | "language": "python",
134 | "name": "python3"
135 | },
136 | "language_info": {
137 | "codemirror_mode": {
138 | "name": "ipython",
139 | "version": 3
140 | },
141 | "file_extension": ".py",
142 | "mimetype": "text/x-python",
143 | "name": "python",
144 | "nbconvert_exporter": "python",
145 | "pygments_lexer": "ipython3",
146 | "version": "3.12.3"
147 | }
148 | },
149 | "nbformat": 4,
150 | "nbformat_minor": 4
151 | }
152 |
--------------------------------------------------------------------------------
/Lab14-Exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# k-Means\n",
8 | "\n",
9 | "## Exercise 1\n",
10 | "Intra-cluster cohesion increase\n",
11 | "\n",
12 | "For the dataset `d` and `start_centroids` below, show that the $J$ criterion (or inertia) monotonically decreases for each successive iteration of the algorithm by plotting it on a line chart using `pyplot.plot`."
13 | ]
14 | },
15 | {
16 | "cell_type": "code",
17 | "execution_count": 1,
18 | "metadata": {},
19 | "outputs": [],
20 | "source": [
21 | "from sklearn.datasets import make_blobs\n",
22 | "import numpy as np\n",
23 | "import pandas as pd\n",
24 | "\n",
25 | "n_samples = 1500\n",
26 | "random_state = 170\n",
27 | "X, y = make_blobs(n_samples=n_samples, random_state=random_state)\n",
28 | "d = pd.DataFrame(X, columns=['X1', 'X2'])\n",
29 | "start_centroids = np.array([[0, 0.1], [0, 0.2], [0, 0.3]])"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "## Exercise 2\n",
37 | "Intra-cluster cohension calculation\n",
38 | "\n",
39 | "Given the dataset `d` and `start_centroids` below:\n",
40 | "1. Run the k-means algorithm form `sklearn` on the dataset `d` and plot the resulting clusters using a scatterplot.\n",
41 | "1. Print the intra-cluster cohesion as computed by the algorithm for the resulting clusters.\n",
42 | "1. Independently calculate the intra-cluster cohesion for the resulting clusters (it should match the one computed by the algorithm)."
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 2,
48 | "metadata": {},
49 | "outputs": [],
50 | "source": [
51 | "from sklearn.datasets import make_blobs\n",
52 | "import numpy as np\n",
53 | "import pandas as pd\n",
54 | "\n",
55 | "n_samples = 1500\n",
56 | "random_state = 110\n",
57 | "X, y = make_blobs(n_samples=n_samples, random_state=random_state)\n",
58 | "d = pd.DataFrame(X, columns=['X1', 'X2'])\n",
59 | "start_centroids = np.array([[0, 0.1], [0, 0.2], [0, 0.3]])"
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "## Exercise 3\n",
67 | "k-means++\n",
68 | "\n",
69 | "Show that _k-means++_ provides better centroid initialisation by comparing the average value of $J$ (i.e. the `inertia_` attribute) for _random_ initializations with _k-means++_ initialisations. More specifically:\n",
70 | "* on the dataset `d` below, run the k-means algorithm 1000 times using _random_ initialisation and record the `inertia_` attribute. Plot the histogram of the recorded values and print their mean;\n",
71 | "* repeat the process using _k-means++_;\n",
72 | "* Which method performs better?\n",
73 | "\n",
74 | "For the `KMeans` function, make sure to always use the parameters `max_iter=1, n_init=1, algorithm='full', n_clusters=3, random_state=None` to emphasize the effect."
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 3,
80 | "metadata": {},
81 | "outputs": [],
82 | "source": [
83 | "from sklearn.datasets import make_blobs\n",
84 | "import numpy as np\n",
85 | "import pandas as pd\n",
86 | "\n",
87 | "n_samples = 1500\n",
88 | "random_state = 100\n",
89 | "X, y = make_blobs(n_samples=n_samples, random_state=random_state)\n",
90 | "d = pd.DataFrame(X, columns=['X1', 'X2'])"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "## Exercise 4\n",
98 | "k-means++ calculation\n",
99 | "\n",
100 | "Given a dataset containing the points A(-1, 0), B(1, 0), C(0, 1), D(3, 0) and E(3, 1):\n",
101 | "1. Plot the points using a scatterplot.\n",
102 | "1. Considering the k-means algorithm with $k=2$ and `random` initialisation. During the initial centroid selection (before the first iteration), if the first centroid was chosen at random to be the point A, what is the probability that the next centroid will be chosen from the set {B, C}?\n",
103 | "1. Calculate the same probability, but this time for `k-means++` instead of `random`."
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "## Exercise 5\n",
111 | "The best value for $k$\n",
112 | "\n",
113 | "For the dataset `d` below, find the number of clusters for the k-means algorithm using the \"elbow\" method. Make sure to plot the points using a scatterplot and the line chart for the $J$ criterion."
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": 4,
119 | "metadata": {},
120 | "outputs": [],
121 | "source": [
122 | "from sklearn.datasets import make_blobs\n",
123 | "import numpy as np\n",
124 | "import pandas as pd\n",
125 | "\n",
126 | "n_samples = 1500\n",
127 | "random_state = 160\n",
128 | "X, y = make_blobs(n_samples=n_samples, centers=6, random_state=random_state)\n",
129 | "d = pd.DataFrame(X, columns=['X1', 'X2'])"
130 | ]
131 | }
132 | ],
133 | "metadata": {
134 | "kernelspec": {
135 | "display_name": "Python 3 (ipykernel)",
136 | "language": "python",
137 | "name": "python3"
138 | },
139 | "language_info": {
140 | "codemirror_mode": {
141 | "name": "ipython",
142 | "version": 3
143 | },
144 | "file_extension": ".py",
145 | "mimetype": "text/x-python",
146 | "name": "python",
147 | "nbconvert_exporter": "python",
148 | "pygments_lexer": "ipython3",
149 | "version": "3.12.3"
150 | }
151 | },
152 | "nbformat": 4,
153 | "nbformat_minor": 4
154 | }
155 |
--------------------------------------------------------------------------------
/Lab07-Exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Estimating the parameters of a distribution\n",
8 | "\n",
9 | "## Exercise 1\n",
10 | "Linked Bernoulli distributions\n",
11 | "\n",
12 | "Consider two coins. The probability of getting \"heads\" is $p$ for the first coin and $2p$ for the second. We then toss the first coin 5 times and the second 10 times, as simulated in the code below:"
13 | ]
14 | },
15 | {
16 | "cell_type": "code",
17 | "execution_count": 1,
18 | "metadata": {},
19 | "outputs": [
20 | {
21 | "name": "stdout",
22 | "output_type": "stream",
23 | "text": [
24 | "X: [0 1 0 0 0]\n",
25 | "Y: [1 1 1 1 1 1 1 0 1 1]\n"
26 | ]
27 | }
28 | ],
29 | "source": [
30 | "from scipy.stats import bernoulli\n",
31 | "p = 0.3\n",
32 | "X = bernoulli.rvs(p, size=5, random_state=1)\n",
33 | "Y = bernoulli.rvs(2*p, size=10, random_state=2)\n",
34 | "print('X:', X) # 1 = heads; 0 = tails\n",
35 | "print('Y:', Y)"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "1. Plot the log-likelihood of the data as a function of $\\hat{p}$. Note that the data consists of both `X` and `Y`, so the likelihood function becomes $L(\\hat{p} | X,Y)$\n",
43 | "1. Experimentally determine the MLE estimation for $p$ corresponding to the observations in `X` and `Y`.\n",
44 | "1. Analytically determine the MLE estimation for $p$."
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "## Exercise 2\n",
52 | "Poisson distribution\n",
53 | "\n",
54 | "A call centre keeps track of the number of phone calls received every day. In order to accurately plan the resources, the number of calls for the last 100 days are modelled as a random variable $X$ following a Poisson distribution of an unknown parameter $\\lambda$.\n",
55 | "\n",
56 | "We will simulate this for $\\lambda = 4$ using the `poisson.rvs` function:"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 2,
62 | "metadata": {},
63 | "outputs": [
64 | {
65 | "data": {
66 | "text/plain": [
67 | "array([2, 2, 3, 4, 4, 6, 1, 5, 3, 5])"
68 | ]
69 | },
70 | "execution_count": 2,
71 | "metadata": {},
72 | "output_type": "execute_result"
73 | }
74 | ],
75 | "source": [
76 | "from scipy.stats import poisson\n",
77 | "lambda_ = 4\n",
78 | "X = poisson.rvs(lambda_, size=100, random_state=1)\n",
79 | "X[:10] # Calls received in the first 10 days"
80 | ]
81 | },
82 | {
83 | "cell_type": "markdown",
84 | "metadata": {},
85 | "source": [
86 | "1. Plot the histogram of the data.\n",
87 | "1. Plot the log-likelihood of the data as a function of $\\hat{\\lambda}$.\n",
88 | "1. Experimentally determine the MLE estimation for $\\lambda$ corresponding to the observations in `X`.\n",
89 | "1. Analytically determine the MLE estimation for $\\lambda$."
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "## Exercise 3\n",
97 | "Uniform distribution\n",
98 | "\n",
99 | "Consider a hashing function that returns a number in the interval $[-w, w]$ for any file. Any value in that interval is equally likely to appear so the hash values are following a uniform distribution $U(-w,w)$. We can simulate the hashes for 100 files using the `uniform.rvs` function:"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 3,
105 | "metadata": {},
106 | "outputs": [
107 | {
108 | "data": {
109 | "text/plain": [
110 | "array([-1.65955991, 4.40648987, -9.9977125 ])"
111 | ]
112 | },
113 | "execution_count": 3,
114 | "metadata": {},
115 | "output_type": "execute_result"
116 | }
117 | ],
118 | "source": [
119 | "from scipy.stats import uniform\n",
120 | "w = 10\n",
121 | "X = uniform.rvs(-w, 2*w, size=100, random_state=1)\n",
122 | "X[:3] # The first 3 hashes"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "metadata": {},
128 | "source": [
129 | "1. Plot the histogram of the data.\n",
130 | "1. Experimentally determine the MLE estimation for $w$ given the observations in `X`.\n",
131 | "1. Analytically determine the MLE estimation for $w$."
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "## Exercise 4\n",
139 | "Exponential distribution\n",
140 | "\n",
141 | "Seismologists are tracking the time interval between consecutive major earthquakes. They noticed that it follows an exponential distribution $Exp(\\lambda)$.\n",
142 | "\n",
143 | "To simulate the observed intervals between 100 earthquakes that occur on average once per year, we can use the `expon.rvs` function:"
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": 4,
149 | "metadata": {},
150 | "outputs": [
151 | {
152 | "data": {
153 | "text/plain": [
154 | "array([5.39605837e-01, 1.27412525e+00, 1.14381359e-04])"
155 | ]
156 | },
157 | "execution_count": 4,
158 | "metadata": {},
159 | "output_type": "execute_result"
160 | }
161 | ],
162 | "source": [
163 | "from scipy.stats import expon \n",
164 | "lambda_ = 1 # Once per year, on average\n",
165 | "X = expon.rvs(scale=1/lambda_, size=100, random_state=1)\n",
166 | "X[:3] # The first 3 intervals "
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {},
172 | "source": [
173 | "1. Plot the histogram of the data.\n",
174 | "1. Experimentally determine the MLE estimation for $\\lambda$ corresponding to the observations in `X`.\n",
175 | "1. Analytically determine the MLE estimation for $\\lambda$."
176 | ]
177 | },
178 | {
179 | "cell_type": "markdown",
180 | "metadata": {},
181 | "source": [
182 | "## Exercise 5\n",
183 | "Variance of a Gaussian distribution\n",
184 | "\n",
185 | "Consider $X$ a random variable representing size of pollen grains following a normal (Gaussian) distribution with known mean 0 and a variance $\\sigma^2$, formally written as $X \\sim N(0, \\sigma^2)$. Load the observations for this variable as a `numpy` array, by running the following code:"
186 | ]
187 | },
188 | {
189 | "cell_type": "code",
190 | "execution_count": 5,
191 | "metadata": {},
192 | "outputs": [],
193 | "source": [
194 | "from sklearn.datasets import fetch_openml\n",
195 | "pollen = fetch_openml('pollen', version=1, as_frame=False, parser='auto')\n",
196 | "X = pollen.data[:,1]"
197 | ]
198 | },
199 | {
200 | "cell_type": "markdown",
201 | "metadata": {},
202 | "source": [
203 | "1. Plot the histogram of the data.\n",
204 | "1. Experimentally find the value of $\\hat{\\sigma}^2_\\text{MLE}$ by testing candidates in the interval $[1, 10]$. Note that since the dataset is quite large, calculating the likelihood of the data can quickly result in an underflow on most systems. Try using the log-likelihood instead.\n",
205 | "1. Analytically find the estimator $\\hat{\\sigma}^2_\\text{MLE}$ and apply the resulting formula to the dataset."
206 | ]
207 | }
208 | ],
209 | "metadata": {
210 | "kernelspec": {
211 | "display_name": "Python 3 (ipykernel)",
212 | "language": "python",
213 | "name": "python3"
214 | },
215 | "language_info": {
216 | "codemirror_mode": {
217 | "name": "ipython",
218 | "version": 3
219 | },
220 | "file_extension": ".py",
221 | "mimetype": "text/x-python",
222 | "name": "python",
223 | "nbconvert_exporter": "python",
224 | "pygments_lexer": "ipython3",
225 | "version": "3.12.3"
226 | }
227 | },
228 | "nbformat": 4,
229 | "nbformat_minor": 4
230 | }
231 |
--------------------------------------------------------------------------------
/Lab01-Exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Elementary Notions in Probability"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Exercise 1*\n",
15 | "(Events, implementation)\n",
16 | "\n",
17 | "Illustrate DeMorgan's laws using the `plot_venn()` function and standard Python set operations:\n",
18 | "1. $\\neg (A\\cup B) = \\neg A \\cap \\neg B$\n",
19 | "1. $\\neg (A\\cap B) = \\neg A \\cup \\neg B$"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 1,
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "from tools.venn import A, B, omega, plot_venn\n",
29 | "# First law\n",
30 | "\n",
31 | "# Second law"
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "## Exercise 2\n",
39 | "(Product of sample spaces, implementation)\n",
40 | "\n",
41 | "Two dice are thrown simultaneously. Calculate the probability that the sum is 11."
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 2,
47 | "metadata": {},
48 | "outputs": [],
49 | "source": [
50 | "from itertools import product\n",
51 | "\n",
52 | "# Code here"
53 | ]
54 | },
55 | {
56 | "cell_type": "markdown",
57 | "metadata": {},
58 | "source": [
59 | "## Exercise 3\n",
60 | "(Conditional probabilities, implementation)\n",
61 | "\n",
62 | "The event S represents the sum of two dice. What is the probability that S=11 knowing that S is a prime?"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 3,
68 | "metadata": {},
69 | "outputs": [],
70 | "source": [
71 | "from itertools import product\n",
72 | "\n",
73 | "# Code here"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "## Exercise 4* - Monty Hall problem\n",
81 | "(Bayes' theorem, implementation and analysis)\n",
82 | "\n",
83 | "Suppose you are in a game show and you're given the choice of three doors; behind one is a car, behind the others, goats. You pick door no. 1, but don't open it. The game host (who knows what is behind each door) then opens a door which always has a goat (in this case opens door no. 2) and asks you if you still want to open door no.1 or you want to switch to no.3. \n",
84 | "\n",
85 | "What are the probabilities of finding the car in the two cases?\n",
86 | "\n",
87 | "1. Create a Python simulation for 1000 games to estimate the answer.\n",
88 | "2. Find the answer using the `tool.stats.probability_weighted` function (see [this approach](http://web.mit.edu/neboat/Public/6.042/probabilityintro.pdf) for constructing the sample space).\n",
89 | "3. Find the answer mathematically by applying Bayes' theorem and the law of total probability."
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "## Exercise 5\n",
97 | "(Probabilities, analysis)\n",
98 | "\n",
99 | "Using the definition of the probability, prove that:\n",
100 | "\n",
101 | "1. $P(\\neg A) = 1-P(A)$\n",
102 | "1. $A \\subseteq B \\Rightarrow P(A) \\leq P(B)$\n",
103 | "\n",
104 | "## Exercise 6*\n",
105 | "(Probabilities, analysis)\n",
106 | "\n",
107 | "Using the definition of the probability, prove that:\n",
108 | "\n",
109 | "1. $P(A \\setminus B) = P(A) - P(A \\cap B)$\n",
110 | "1. $P(A \\cup B) = P(A) + P(B) - P(A \\cap B)$"
111 | ]
112 | },
113 | {
114 | "cell_type": "markdown",
115 | "metadata": {},
116 | "source": [
117 | "## Exercise 7\n",
118 | "\n",
119 | "(Independent events, analysis)\n",
120 | "\n",
121 | "Two soldiers A and B are doing target practice. The probability that soldier A misses is 1/5. The probability that soldier B misses is 1/2. Probability that both miss at the same time is 1/10.\n",
122 | "\n",
123 | "1. Are the two events independent?\n",
124 | "1. What is the probability that at least one of the soldiers misses?\n",
125 | "1. What is the probability that exactly one of the soldiers misses?"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "## Exercise 8\n",
133 | "(Independent events, implementation)\n",
134 | "\n",
135 | "Consider the event space corresponding to two tosses of a fair coin, and the events A \"heads on toss 1\", B \"heads on toss 2\" and C \"the two tosses are equal\". Using the `tools.stats.probability` function, find if:\n",
136 | "\n",
137 | "1. events A and B are independent;\n",
138 | "1. events A and C are independent."
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": 4,
144 | "metadata": {},
145 | "outputs": [],
146 | "source": [
147 | "from tools.stats import probability\n",
148 | "\n",
149 | "# Code here"
150 | ]
151 | },
152 | {
153 | "cell_type": "markdown",
154 | "metadata": {},
155 | "source": [
156 | "# Elementary Notions in Statistics"
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {},
162 | "source": [
163 | "## Exercise 9*\n",
164 | "(Random variables, implementation)\n",
165 | "\n",
166 | "Give an example of a a real phenomenon modelled by the following discrete distributions and plot an illustrative pmf for that phenomenon using `matplotlib` and `scipy.stats` functions:\n",
167 | "\n",
168 | "1. binomial;\n",
169 | "2. geometric."
170 | ]
171 | },
172 | {
173 | "cell_type": "markdown",
174 | "metadata": {},
175 | "source": [
176 | "## Exercise 10*\n",
177 | "(Random variables, implementation)\n",
178 | "\n",
179 | "Give an example of a real phenomenon modelled by the following continuous distributions and plot an illustrative pdf for that phenomenon using `matplotlib` and `scipy.stats` functions:\n",
180 | "\n",
181 | "1. gamma;\n",
182 | "2. Pareto."
183 | ]
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {},
188 | "source": [
189 | "## Exercise 11*\n",
190 | "\n",
191 | "(Random variables, implementation)\n",
192 | "\n",
193 | "Suppose you measure the temperature 10 consecutive days with a thermometer that has a small random error. \n",
194 | "\n",
195 | "1. What is the mean temperature, knowing that the mean error is +1°C and the measurements are those in the variable $Y$ below.\n",
196 | "2. A second thermometer with a Fahrenheit scale ($T_{(°F)} = T_{(°C)} × 1.8 + 32$) measures the temperature in a different region. The variance measured by this thermometer (in Fahrenheit) is 8. Where is the temperature more stable: in your region, or in the region measured by the second thermometer?"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": 1,
202 | "metadata": {},
203 | "outputs": [],
204 | "source": [
205 | "Y = [21, 20, 22, 23, 20, 19, 19, 18, 19, 20]\n",
206 | "\n",
207 | "# Your code here"
208 | ]
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "metadata": {},
213 | "source": [
214 | "## Exercise 12\n",
215 | "(Random variable, implementation)\n",
216 | "\n",
217 | "Let $S$ be the outcome of a random variable describing the sum of two dice thrown independently.\n",
218 | "\n",
219 | "1. Print the probability distribution of $S$ graphically.\n",
220 | "1. Determine $E[S]$ and $Var(S)$."
221 | ]
222 | },
223 | {
224 | "cell_type": "markdown",
225 | "metadata": {},
226 | "source": [
227 | "## Exercise 13\n",
228 | "(Random variable, conceptual)\n",
229 | "\n",
230 | "The probability distribution of a discrete random variable $X$ is given by\n",
231 | "\n",
232 | "$P(X=-1)=1/5, P(X=0)=2/5, P(X=1)=2/5$.\n",
233 | "\n",
234 | "1. Compute $E[X]$.\n",
235 | "1. Give the probability distribution of $Y=X^2$ and compute $E[Y]$ using the distribution of $Y$.\n",
236 | "1. Determine $E[X^2]$ using the change-of-variable formula. Check your answer against the answer in 2.\n",
237 | "1. Determine $Var(X)$."
238 | ]
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "metadata": {},
243 | "source": [
244 | "## Exercise 14\n",
245 | "(binomial distribution, applied)\n",
246 | "\n",
247 | "A sailor is trying to walk on a slippery deck, but because of the movements of the ship, he can make exactly one step every second, either forward (with probability $p=0.5$) or backward (with probability $1-p=0.5$). Using the `scipy.stats.binom` package, determine the probability that the sailor is in position +8 after 16 seconds."
248 | ]
249 | },
250 | {
251 | "cell_type": "markdown",
252 | "metadata": {},
253 | "source": [
254 | "## Exercise 15\n",
255 | "(geometric distribution, applied)\n",
256 | "\n",
257 | "In order to finish a board game, a player must get an exact 3 on a regular die. Using the `scipy.stats.geom` package, determine how many tries will it take to win the game (on average)? What are the best and worst cases?"
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "## Exercise 16\n",
265 | "(gamma distribution, applied)\n",
266 | "\n",
267 | "The grades from an exam roughly follow a Gamma distribution with parameters $k=9$ (shape parameter) and $\\theta=0.5$ (scale parameter). Using the `scipy.stats.gamma` package, determine what percentage of students will pass the exam, if the minimum score is 3."
268 | ]
269 | }
270 | ],
271 | "metadata": {
272 | "kernelspec": {
273 | "display_name": "Python 3 (ipykernel)",
274 | "language": "python",
275 | "name": "python3"
276 | },
277 | "language_info": {
278 | "codemirror_mode": {
279 | "name": "ipython",
280 | "version": 3
281 | },
282 | "file_extension": ".py",
283 | "mimetype": "text/x-python",
284 | "name": "python",
285 | "nbconvert_exporter": "python",
286 | "pygments_lexer": "ipython3",
287 | "version": "3.12.3"
288 | }
289 | },
290 | "nbformat": 4,
291 | "nbformat_minor": 4
292 | }
293 |
--------------------------------------------------------------------------------
/Lab06-Exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Naive Bayes"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Exercise 1\n",
15 | "\n",
16 | "Learned probabilities\n",
17 | "\n",
18 | "Given the following run of the Naive Bayes algorithm without smoothing:"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 1,
24 | "metadata": {},
25 | "outputs": [
26 | {
27 | "name": "stdout",
28 | "output_type": "stream",
29 | "text": [
30 | "[-1.09861229 -0.40546511]\n",
31 | "[[-2.87682072e-01 -1.38629436e+00 -1.38629436e+00]\n",
32 | " [-2.51052925e+01 -1.24997790e-11 -6.93147181e-01]]\n"
33 | ]
34 | }
35 | ],
36 | "source": [
37 | "import pandas as pd\n",
38 | "from sklearn.naive_bayes import BernoulliNB\n",
39 | "\n",
40 | "# Create the training set\n",
41 | "features = ['study', 'free', 'money']\n",
42 | "target = 'is_spam'\n",
43 | "messages = pd.DataFrame(\n",
44 | "[(1, 0, 0, 0),\n",
45 | "(0, 0, 1, 0),\n",
46 | "(1, 0, 0, 0),\n",
47 | "(1, 1, 0, 0)] +\n",
48 | "[(0, 1, 0, 1)] * 4 +\n",
49 | "[(0, 1, 1, 1)] * 4,\n",
50 | "columns=features+[target])\n",
51 | "\n",
52 | "# Create the prediction set\n",
53 | "X = messages[features]\n",
54 | "y = messages[target]\n",
55 | "cl = BernoulliNB(alpha=1e-10).fit(X, y)\n",
56 | "\n",
57 | "print(cl.class_log_prior_)\n",
58 | "print(cl.feature_log_prob_)"
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "1. Write a function that independently calculates the value of the `class_log_prior_` attribute without smoothing using only `messages` as parameter. (These are the natural logarithms of class probabilities $P(v_j)$).\n",
66 | "2. Write a function that independently calculates the value of the `feature_log_prob_` attribute without smoothing using only `messages` as parameter. (These are the natural logarithms of attribute probabilities $P(a_i|v_j)$)."
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "## Exercise 2\n",
74 | "Expected error rate in training\n",
75 | "\n",
76 | "Consider a binary classification problem with features $X_1$ and $X_2$ and label $Y$. The two features are assumed to be conditionally independent with respect to $Y$ . The prior probabilities $P(Y=0)$ and $P(Y=1)$ are both equal to 0.5. The conditional probabilities are:\n",
77 | "\n",
78 | "
\n",
79 | "
\n",
80 | "
P(X1|Y)
\n",
81 | "
Y=0
\n",
82 | "
Y=1
\n",
83 | "
\n",
84 | "
X1=0
\n",
85 | "
0.7
\n",
86 | "
0.2
\n",
87 | "
\n",
88 | "
X1=1
\n",
89 | "
0.3
\n",
90 | "
0.8
\n",
91 | "
\n",
92 | "
\n",
93 | "\n",
94 | "
\n",
95 | "
\n",
96 | "
P(X2|Y)
\n",
97 | "
Y=0
\n",
98 | "
Y=1
\n",
99 | "
\n",
100 | "
X2=0
\n",
101 | "
0.9
\n",
102 | "
0.5
\n",
103 | "
\n",
104 | "
X2=1
\n",
105 | "
0.1
\n",
106 | "
0.5
\n",
107 | "
\n",
108 | "
\n",
109 | "\n",
110 | " \n",
111 | "\n",
112 | "1. Generate a `DataFrame` with 1000 entries and three columns `['x1', 'x2', 'y']`, according to the description above, using the `bernoulli.rvs` function from `scipy`.\n",
113 | "1. After training on the DataFrame above, predict every combination of values for $X_1$ and $X_2$.\n",
114 | "1. Calculate the average error rate on the training dataset.\n",
115 | "1. Create a new attribute $X_3$ as a copy of $X_2$. What is the new average error rate on the training dataset?"
116 | ]
117 | },
118 | {
119 | "cell_type": "markdown",
120 | "metadata": {},
121 | "source": [
122 | "## Exercise 3\n",
123 | "Joint Bayes\n",
124 | "\n",
125 | "Considering the dataset below:"
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "execution_count": 2,
131 | "metadata": {},
132 | "outputs": [],
133 | "source": [
134 | "import pandas as pd\n",
135 | "from tools.pd_helpers import apply_counts\n",
136 | "\n",
137 | "d = pd.DataFrame({'X1': [0, 0, 1, 1, 0, 0, 1, 1],\n",
138 | " 'X2': [0, 0, 0, 0, 1, 1, 1, 1],\n",
139 | " 'C' : [2, 18, 4, 1, 4, 1, 2, 18],\n",
140 | " 'Y' : [0, 1, 0, 1, 0, 1, 0, 1]})\n",
141 | "d=apply_counts(d, 'C')"
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {},
147 | "source": [
148 | "1. Implement a simple version of the Joint Bayes algorithm by creating the `BernoulliJB` class, similar to `BernoulliNB` from `scikit`, but only implement the `fit(X,y)` and `predict_proba(X)` without smoothing.\n",
149 | "1. How many probabilities are estimated by the the Joint Bayes algorithm?\n",
150 | "1. What are probability estimates for the instance $X_1 = 0$, $X_2 = 0$ calculated by `predict_proba(X)` from `BernoulliJB`?\n",
151 | "1. What are the predicted probabilities of Naive Bayes (using `predict_proba(X)` from `BernoulliNB`) without smoothing for this instance?"
152 | ]
153 | },
154 | {
155 | "cell_type": "markdown",
156 | "metadata": {},
157 | "source": [
158 | "## Exercise 4\n",
159 | "Measuring the naivety assumption\n",
160 | "\n",
161 | "Consider a simple text classification that only considers two words: $w_1$ and $w_2$. The label $y$ will only be 1 if $w_1$ is present and $w_2$ is not, so the label is effectively the function $w1 \\land \\lnot w2$.\n",
162 | "\n",
163 | "The `correlated_df` function below will return such a dataset, with 10,000 entries and the columns `['w1', 'w2', 'y']`. The parameter `corr` specifies approximately how much correlation should exist between `w1` and `w2`."
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": 3,
169 | "metadata": {},
170 | "outputs": [
171 | {
172 | "name": "stdout",
173 | "output_type": "stream",
174 | "text": [
175 | "Correlation: PearsonRResult(statistic=np.float64(0.469833135691177), pvalue=np.float64(0.0))\n"
176 | ]
177 | }
178 | ],
179 | "source": [
180 | "from scipy.stats import bernoulli\n",
181 | "from scipy.stats import pearsonr\n",
182 | "size = 10000\n",
183 | "\n",
184 | "def correlated_df(corr):\n",
185 | " w1 = bernoulli.rvs(0.5, size=size, random_state=1)\n",
186 | " d = pd.DataFrame({'w1': w1})\n",
187 | " mask = bernoulli.rvs(corr, size=size, random_state=2)\n",
188 | " random = bernoulli.rvs(0.5, size=size, random_state=3)\n",
189 | " d['w2'] = d['w1'] & mask | random & ~mask\n",
190 | " d['mask'] = mask\n",
191 | " d['random'] = random\n",
192 | " d['y'] = d['w1'] & ~ d['w2']\n",
193 | " return d\n",
194 | "\n",
195 | "d = correlated_df(0.5)\n",
196 | "\n",
197 | "# Check that the correlation is indeed close to 0.5\n",
198 | "print(\"Correlation: \", pearsonr(d['w1'], d['w2']))"
199 | ]
200 | },
201 | {
202 | "cell_type": "markdown",
203 | "metadata": {},
204 | "source": [
205 | "1. With the function above, create a line chart using `matplotlib` that shows how the correlation affects the training error of Naive Bayes (no smoothing).\n",
206 | "1. Using the function above, create a line chart that shows how the correlation affects the training error of a decision tree classifier (see the `DecisionTreeClassifier` class from `sklearn`)."
207 | ]
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "metadata": {},
212 | "source": [
213 | "## Exercise 5\n",
214 | "Average error rate\n",
215 | "\n",
216 | "Given the function $Y = (A \\land B) \\lor \\neg(B \\lor C)$ where $A$, $B$ and $C$ are independent binary random variables, each of which having 50% chance of being 0 and 50% chance of being 1.\n",
217 | "\n",
218 | "1. Generate a DataFrame with 1000 entries and four columns `A`, `B`, `C` and `Y`, according to the description above, using the `bernoulli.rvs` function from `scipy`.\n",
219 | "1. Calculate the error rate for Naive Bayes on the training dataset.\n",
220 | "1. What is the average error rate on this training dataset for the Joint Bayes algorithm? (Note that you don't have to actually build the algorithm, just provide a theoretical justification.)"
221 | ]
222 | },
223 | {
224 | "cell_type": "markdown",
225 | "metadata": {},
226 | "source": [
227 | "## Exercise 6\n",
228 | "Text classification\n",
229 | "\n",
230 | "A news company would like to automatically sort the news articles related to sport from those related to politics. They are using 8 key words ($w_1,...,w_8)$ and have annotated several articles in each category for training:"
231 | ]
232 | },
233 | {
234 | "cell_type": "code",
235 | "execution_count": 4,
236 | "metadata": {},
237 | "outputs": [],
238 | "source": [
239 | "import pandas as pd\n",
240 | "\n",
241 | "features = [f'w{i}' for i in range(1,9)]\n",
242 | "\n",
243 | "politics=pd.DataFrame([\n",
244 | "(1, 0, 1, 1, 1, 0, 1, 1),\n",
245 | "(0, 0, 0, 1, 0, 0, 1, 1),\n",
246 | "(1, 0, 0, 1, 1, 0, 1, 0),\n",
247 | "(0, 1, 0, 0, 1, 1, 0, 1),\n",
248 | "(0, 0, 0, 1, 1, 0, 1, 1),\n",
249 | "(0, 0, 0, 1, 1, 0, 0, 1)],\n",
250 | "columns=features)\n",
251 | "\n",
252 | "sport=pd.DataFrame([\n",
253 | "(1, 1, 0, 0, 0, 0, 0, 0),\n",
254 | "(0, 0, 1, 0, 0, 0, 0, 0),\n",
255 | "(1, 1, 0, 1, 0, 0, 0, 0),\n",
256 | "(1, 1, 0, 1, 0, 0, 0, 1),\n",
257 | "(1, 1, 0, 1, 1, 0, 0, 0),\n",
258 | "(0, 0, 0, 1, 0, 1, 0, 0),\n",
259 | "(1, 1, 1, 1, 1, 0, 1, 0)],\n",
260 | "columns=features)"
261 | ]
262 | },
263 | {
264 | "cell_type": "markdown",
265 | "metadata": {},
266 | "source": [
267 | "According to Naive Bayes (without smoothing), what is the probability that the document `x = (1, 0, 0, 1, 1, 1, 1, 0)` is about politics?"
268 | ]
269 | }
270 | ],
271 | "metadata": {
272 | "kernelspec": {
273 | "display_name": "Python 3 (ipykernel)",
274 | "language": "python",
275 | "name": "python3"
276 | },
277 | "language_info": {
278 | "codemirror_mode": {
279 | "name": "ipython",
280 | "version": 3
281 | },
282 | "file_extension": ".py",
283 | "mimetype": "text/x-python",
284 | "name": "python",
285 | "nbconvert_exporter": "python",
286 | "pygments_lexer": "ipython3",
287 | "version": "3.12.3"
288 | }
289 | },
290 | "nbformat": 4,
291 | "nbformat_minor": 4
292 | }
293 |
--------------------------------------------------------------------------------
/img/random_variable.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
323 |
--------------------------------------------------------------------------------
/Lab02-Exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Introduction to Information Theory\n",
8 | "\n",
9 | "## Exercise 1\n",
10 | "\n",
11 | "(entropy, implementation)\n",
12 | "\n",
13 | "Consider two fair dice with 6 sides each.\n",
14 | "\n",
15 | "1. Print the probability distribution of the sum $S$ of the numbers obtained by throwing the two dice.\n",
16 | "1. What is the information content in bits of the events $S=2$, $S=11$, $S=5$, $S=7$.\n",
17 | "1. Calculate the entropy of S.\n",
18 | "1. Lets say you throw the die one at a time, and the first die shows 4. What is the entropy of S after this observation? Was any information gained/lost in the process of observing the outcome of the first die toss? If so, calculate how much information (in bits) was lost or gained."
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "## Exercise 2\n",
26 | "\n",
27 | "(information gain, implementation or analysis)\n",
28 | "\n",
29 | "Given the dataset below, calculate the information gain for the target variable 'Edible' and each feature ('Weight', 'Smell', 'Spots', 'Smooth'):"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 1,
35 | "metadata": {},
36 | "outputs": [
37 | {
38 | "name": "stdout",
39 | "output_type": "stream",
40 | "text": [
41 | " Weight Smell Spots Smooth Edible\n",
42 | "A 1 0 0 0 1\n",
43 | "B 1 0 1 0 1\n",
44 | "C 0 1 0 1 1\n",
45 | "D 0 0 0 1 0\n",
46 | "E 1 1 1 0 0\n",
47 | "F 1 0 1 1 0\n",
48 | "G 1 0 0 1 0\n",
49 | "H 0 1 0 0 0\n"
50 | ]
51 | }
52 | ],
53 | "source": [
54 | "import pandas as pd\n",
55 | "features = ['Weight', 'Smell', 'Spots', 'Smooth', 'Edible']\n",
56 | "mushrooms = pd.DataFrame([\n",
57 | " (1, 0, 0, 0, 1),\n",
58 | " (1, 0, 1, 0, 1),\n",
59 | " (0, 1, 0, 1, 1),\n",
60 | " (0, 0, 0, 1, 0),\n",
61 | " (1, 1, 1, 0, 0),\n",
62 | " (1, 0, 1, 1, 0),\n",
63 | " (1, 0, 0, 1, 0),\n",
64 | " (0, 1, 0, 0, 0)\n",
65 | "],\n",
66 | "index=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],\n",
67 | "columns=features)\n",
68 | "print(mushrooms)"
69 | ]
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {},
74 | "source": [
75 | "## Exercise 3\n",
76 | "\n",
77 | "(entropy and information gain, implementation or analysis)\n",
78 | "\n",
79 | "The following code simulates the season results for football team F:"
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": 2,
85 | "metadata": {},
86 | "outputs": [
87 | {
88 | "data": {
89 | "text/html": [
90 | "
\n",
91 | "\n",
104 | "
\n",
105 | " \n",
106 | "
\n",
107 | "
\n",
108 | "
opponent
\n",
109 | "
stadium
\n",
110 | "
result
\n",
111 | "
\n",
112 | " \n",
113 | " \n",
114 | "
\n",
115 | "
0
\n",
116 | "
Team A
\n",
117 | "
Home
\n",
118 | "
Win
\n",
119 | "
\n",
120 | "
\n",
121 | "
1
\n",
122 | "
Team A
\n",
123 | "
Away
\n",
124 | "
Draw
\n",
125 | "
\n",
126 | "
\n",
127 | "
2
\n",
128 | "
Team B
\n",
129 | "
Home
\n",
130 | "
Draw
\n",
131 | "
\n",
132 | "
\n",
133 | "
3
\n",
134 | "
Team B
\n",
135 | "
Away
\n",
136 | "
Win
\n",
137 | "
\n",
138 | "
\n",
139 | "
4
\n",
140 | "
Team C
\n",
141 | "
Home
\n",
142 | "
Loss
\n",
143 | "
\n",
144 | "
\n",
145 | "
5
\n",
146 | "
Team C
\n",
147 | "
Away
\n",
148 | "
Loss
\n",
149 | "
\n",
150 | "
\n",
151 | "
6
\n",
152 | "
Team D
\n",
153 | "
Home
\n",
154 | "
Loss
\n",
155 | "
\n",
156 | "
\n",
157 | "
7
\n",
158 | "
Team D
\n",
159 | "
Away
\n",
160 | "
Draw
\n",
161 | "
\n",
162 | "
\n",
163 | "
8
\n",
164 | "
Team E
\n",
165 | "
Home
\n",
166 | "
Win
\n",
167 | "
\n",
168 | "
\n",
169 | "
9
\n",
170 | "
Team E
\n",
171 | "
Away
\n",
172 | "
Win
\n",
173 | "
\n",
174 | "
\n",
175 | "
10
\n",
176 | "
Team A
\n",
177 | "
Home
\n",
178 | "
Draw
\n",
179 | "
\n",
180 | "
\n",
181 | "
11
\n",
182 | "
Team A
\n",
183 | "
Away
\n",
184 | "
Loss
\n",
185 | "
\n",
186 | "
\n",
187 | "
12
\n",
188 | "
Team B
\n",
189 | "
Home
\n",
190 | "
Draw
\n",
191 | "
\n",
192 | "
\n",
193 | "
13
\n",
194 | "
Team B
\n",
195 | "
Away
\n",
196 | "
Win
\n",
197 | "
\n",
198 | "
\n",
199 | "
14
\n",
200 | "
Team C
\n",
201 | "
Home
\n",
202 | "
Loss
\n",
203 | "
\n",
204 | "
\n",
205 | "
15
\n",
206 | "
Team C
\n",
207 | "
Away
\n",
208 | "
Draw
\n",
209 | "
\n",
210 | "
\n",
211 | "
16
\n",
212 | "
Team D
\n",
213 | "
Home
\n",
214 | "
Win
\n",
215 | "
\n",
216 | "
\n",
217 | "
17
\n",
218 | "
Team D
\n",
219 | "
Away
\n",
220 | "
Draw
\n",
221 | "
\n",
222 | "
\n",
223 | "
18
\n",
224 | "
Team E
\n",
225 | "
Home
\n",
226 | "
Draw
\n",
227 | "
\n",
228 | "
\n",
229 | "
19
\n",
230 | "
Team E
\n",
231 | "
Away
\n",
232 | "
Win
\n",
233 | "
\n",
234 | " \n",
235 | "
\n",
236 | "
"
237 | ],
238 | "text/plain": [
239 | " opponent stadium result\n",
240 | "0 Team A Home Win\n",
241 | "1 Team A Away Draw\n",
242 | "2 Team B Home Draw\n",
243 | "3 Team B Away Win\n",
244 | "4 Team C Home Loss\n",
245 | "5 Team C Away Loss\n",
246 | "6 Team D Home Loss\n",
247 | "7 Team D Away Draw\n",
248 | "8 Team E Home Win\n",
249 | "9 Team E Away Win\n",
250 | "10 Team A Home Draw\n",
251 | "11 Team A Away Loss\n",
252 | "12 Team B Home Draw\n",
253 | "13 Team B Away Win\n",
254 | "14 Team C Home Loss\n",
255 | "15 Team C Away Draw\n",
256 | "16 Team D Home Win\n",
257 | "17 Team D Away Draw\n",
258 | "18 Team E Home Draw\n",
259 | "19 Team E Away Win"
260 | ]
261 | },
262 | "execution_count": 2,
263 | "metadata": {},
264 | "output_type": "execute_result"
265 | }
266 | ],
267 | "source": [
268 | "from itertools import product\n",
269 | "import pandas as pd\n",
270 | "import random\n",
271 | "random.seed(1)\n",
272 | "opponents = ['Team '+chr(ord('A') + i) for i in range(5)]\n",
273 | "stadiums = ['Home', 'Away']\n",
274 | "games = pd.DataFrame(list(product(opponents, stadiums))*2,\n",
275 | " columns=['opponent', 'stadium'])\n",
276 | "games['result'] = random.choices([\"Win\", \"Loss\", \"Draw\"],\n",
277 | " k=len(games))\n",
278 | "games"
279 | ]
280 | },
281 | {
282 | "cell_type": "markdown",
283 | "metadata": {},
284 | "source": [
285 | "1. What is the entropy of the `result` $H(result)$ (ignoring all other variables)?\n",
286 | "1. What are the average conditional entropies $H(result | stadium)$ and $H(result | opponent)$?\n",
287 | "1. Which of the two variables is more important in deciding the result of a game? Answer this question by calculating the information gain for the two variables: $IG(result; stadium)$ and $IG(result;opponent)$."
288 | ]
289 | },
290 | {
291 | "cell_type": "markdown",
292 | "metadata": {},
293 | "source": [
294 | "# Exercise 4\n",
295 | "\n",
296 | "(entropy, implementation or analysis)\n",
297 | "\n",
298 | "Consider the random variable $C$ \"a person has a cold\" and the random variable $T$ \"outside temperature\". The joint distribution of the two variables is given below:"
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "execution_count": 3,
304 | "metadata": {},
305 | "outputs": [
306 | {
307 | "data": {
308 | "text/html": [
309 | "
\n",
310 | "\n",
323 | "
\n",
324 | " \n",
325 | "
\n",
326 | "
\n",
327 | "
T_Sunny
\n",
328 | "
T_Rainy
\n",
329 | "
T_Snowy
\n",
330 | "
\n",
331 | " \n",
332 | " \n",
333 | "
\n",
334 | "
C_No
\n",
335 | "
0.30
\n",
336 | "
0.20
\n",
337 | "
0.1
\n",
338 | "
\n",
339 | "
\n",
340 | "
C_Yes
\n",
341 | "
0.05
\n",
342 | "
0.15
\n",
343 | "
0.2
\n",
344 | "
\n",
345 | " \n",
346 | "
\n",
347 | "
"
348 | ],
349 | "text/plain": [
350 | " T_Sunny T_Rainy T_Snowy\n",
351 | "C_No 0.30 0.20 0.1\n",
352 | "C_Yes 0.05 0.15 0.2"
353 | ]
354 | },
355 | "execution_count": 3,
356 | "metadata": {},
357 | "output_type": "execute_result"
358 | }
359 | ],
360 | "source": [
361 | "import pandas as pd\n",
362 | "d = pd.DataFrame({'T_Sunny': [0.3, 0.05], \n",
363 | " 'T_Rainy': [0.2, 0.15], \n",
364 | " 'T_Snowy': [0.1, 0.2]}, \n",
365 | " index=['C_No', 'C_Yes'])\n",
366 | "d"
367 | ]
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "1. Plot the pmf of $C$ and $T$.\n",
374 | "1. Calculate $H(C)$, $H(T)$.\n",
375 | "1. Calculate $H(C|T)$, $H(T|C)$. Does the temperature (T) reduce the uncertainty regarding someone having a cold (C)?"
376 | ]
377 | },
378 | {
379 | "cell_type": "markdown",
380 | "metadata": {},
381 | "source": [
382 | "# Exercise 5\n",
383 | "\n",
384 | "(decision tree, implementation)\n",
385 | "\n",
386 | "Consider the Boolean expression $A \\lor (B \\land C)$. The corresponding truth table can be generated with:"
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": 4,
392 | "metadata": {},
393 | "outputs": [],
394 | "source": [
395 | "from itertools import product\n",
396 | "X = [list(c) for c in product([0,1], repeat=3)]\n",
397 | "y = [A or (B and C) for A, B, C in X]"
398 | ]
399 | },
400 | {
401 | "cell_type": "markdown",
402 | "metadata": {},
403 | "source": [
404 | "1. Fit a decision tree classifier on the truth table above and visualise the resulting tree. Make sure to use the entropy as a metric.\n",
405 | "1. Is the tree above optimal? Can you find a decision tree with fewer levels or nodes that correctly represents this function?"
406 | ]
407 | }
408 | ],
409 | "metadata": {
410 | "kernelspec": {
411 | "display_name": "Python 3 (ipykernel)",
412 | "language": "python",
413 | "name": "python3"
414 | },
415 | "language_info": {
416 | "codemirror_mode": {
417 | "name": "ipython",
418 | "version": 3
419 | },
420 | "file_extension": ".py",
421 | "mimetype": "text/x-python",
422 | "name": "python",
423 | "nbconvert_exporter": "python",
424 | "pygments_lexer": "ipython3",
425 | "version": "3.12.3"
426 | }
427 | },
428 | "nbformat": 4,
429 | "nbformat_minor": 4
430 | }
431 |
--------------------------------------------------------------------------------
/Lab05.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "0de207cb-ff2c-498a-9e5b-f649ee51ee40",
6 | "metadata": {},
7 | "source": [
8 | "# Naive Bayes\n",
9 | "\n",
10 | "Here is a step by step explanation of the algorithm: https://youtu.be/O2L2Uv9pdDA\n",
11 | "\n",
12 | "Bayesian classifiers and in particular the naive Bayes classifier are a family of probabilistic classification algorithms particularly suited to problems like text classification.\n",
13 | "\n",
14 | "When to use it:\n",
15 | "\n",
16 | "* The target function $f$ takes value from a finite set $V=\\{v_1,...,v_k\\}$\n",
17 | "* Moderate or large training data set is available\n",
18 | "* The attributes $$ that describes instances are conditionally independent with respect to the given classification:\n",
19 | "\n",
20 | "$$P(a_1,a_2,...,a_n|v_j)=\\prod_i P(a_i|v_j)$$\n",
21 | "\n",
22 | "The most probable value of $f(x)$ is:\n",
23 | "\n",
24 | "\\begin{align}\n",
25 | "v_{MAP} &= \\mbox{argmax}_{v_j \\in V}P(v_j|a_1,a_2,...,a_n) \\\\\n",
26 | " &= \\mbox{argmax}_{v_j \\in V}\\frac{P(a_1,a_2,...,a_n|v_j)P(v_j)}{P(a_1,a_2,...,a_n)}\\\\\n",
27 | " &= \\mbox{argmax}_{v_j \\in V} P(a_1,a_2,...,a_n|v_j)P(v_j)\\\\\n",
28 | " &= \\mbox{argmax}_{v_j \\in V} \\prod_i P(a_i|v_j)P(v_j)\n",
29 | "\\end{align}\n",
30 | "\n",
31 | "where MAP stands for [_maximum a posteriori probability_](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation).\n",
32 | "\n",
33 | "As an example, let's consider a simplified dataset of only 12 messages, 8 of which are spam. For each message, only consider the words \"study\", \"free\" and \"money\":"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 14,
39 | "id": "167d846e-9dc3-4db9-a485-7417639ba786",
40 | "metadata": {},
41 | "outputs": [
42 | {
43 | "data": {
44 | "text/html": [
45 | "