11 |
12 | A description of the datasets and the file format can be found on this link.
13 | As the description states, do NOT use data from the last column (i.e., the class labels) in your calculations.
14 | In these files, all columns except for the last one contain example inputs. The last column contains the class label.
15 |
16 | ### Output
17 |
18 | The output of your code should contain one line for each dimension of each class. Such a line should look like this:
19 | Class %d, dimension %d, mean = %.2f, variance = %.2f
20 |
21 | In your answers.pdf document, provide the output produced by your program when given yeast_training.txt as the input file
22 |
--------------------------------------------------------------------------------
/01_MATLAB/01_Descision_Tree_and_Random_Forest/code/choose_attribute.m:
--------------------------------------------------------------------------------
1 | function[best_attribute,best_threshold,max_gain]=choose_attribute(data,attributes,option)
2 | max_gain = -1;
3 | if isequal(option,'randomized')
4 | attribute = datasample(attributes,1,'Replace',true);
5 | L = min(data(:,attribute));
6 | M = max(data(:,attribute));
7 | for K=1:50
8 | threshold= L + K*(M - L)/51;
9 | gain=get_info_gain(data, attribute, threshold);
10 | if gain>max_gain
11 | max_gain=gain;
12 | best_attribute=attribute;
13 | best_threshold=threshold;
14 | end
15 | end
16 | %disp(best_attribute)
17 |
18 | elseif isequal(option,'optimized')
19 |
20 | for attribute = 1:size(data, 2)-1
21 | L=min(data(:, attribute));
22 | M=max(data(:, attribute));
23 | for K = 1:50
24 | threshold = (L + K*(M-L))/(51);
25 | [gain] = get_info_gain(data, attribute, threshold);
26 | if gain > max_gain
27 | max_gain = gain;
28 | best_attribute = attribute;
29 | best_threshold = threshold;
30 | end
31 | end
32 | end
33 |
34 | end
35 |
36 | end
--------------------------------------------------------------------------------
/02_PYTHON/02_Gaussian_Naive_Bayes/README.md:
--------------------------------------------------------------------------------
1 | ### Task
2 |
3 | You must implement a program that learns a naive Bayes classifier for a classification problem, given some training data and some additional options.
4 | In particular, your program will be invoked as follows:
5 | - naive_bayes gaussians
6 |
7 | ### Training: Gaussians
8 |
9 | If the third commandline argument is gaussians, then you should model P(x | class) as a Gaussian separately for each dimension of the data.
10 | The output of the training phase should be a sequence of lines like this:
11 | Class %d, attribute %d, mean = %.2f, std = %.2f
12 | The output lines should be sorted by class number.
13 | Within the same class, lines should be sorted by attribute number.
14 | Attributes should be numbered starting from 0, not from 1.
15 | In certain cases, it is possible that value computed for the standard deviation is equal to zero.
16 | Your code should make sure that the variance of the Gaussian is NEVER smaller than 0.0001.
17 | Since the variance is the square of the standard deviation, this means that the standard deviation should never be smaller than sqrt(0.0001) = 0.01.
18 | Any time the value for the standard deviation is computed to be smaller than 0.01, your code should replace that value with 0.01.
19 |
20 | In your answers.pdf document, provide the output produced by the training stage of your program when given yeast_training.txt as the input file.
21 |
--------------------------------------------------------------------------------
/02_PYTHON/06_Error_Function_and_Regularisation/README.md:
--------------------------------------------------------------------------------
1 | Markup : # Calculating Error Function #
2 |
3 |
4 | Markup : We have a dataset of 40 training examples.
5 | Markup : The i-th training example is denoted as (xi, ti), where xi is the example input and ti is the target output.
6 | Markup : The target inputs xi can be downloaded from training_inputs1.txt.
7 | Markup : Each xi is a three-dimensional vector denoted as (xi, 1, xi, 2, xi, 3).
8 | Markup : In file training_inputs1.txt, the number at row i and column j is the value for xi, j.
9 | Markup : The target outputs ti can be downloaded from training_outputs1.txt.
10 | Markup : Each ti is a real number. Row i of training_outputs1.txt contains the value for ti.
11 |
12 | Markup : Error Function formula
13 | Markup : 
14 |
15 | Markup : Let w be a three dimensional vector (w1, w2, w3).
16 | Markup : Define y(xi, w) as follows: y(xi, w) = w1 * xi, 1 + w2 * xi, 2 + w3 * xi, 3.
17 | Markup : Part a: If w = (3, -1.5, -2), evaluate E(w)
18 | Markup : Part b: If w = (5.2, -2, 1), evaluate E(w)
19 |
20 | Markup : Regularisation formula
21 | Markup : 
22 |
23 | Markup : Part c: If w = (3, -1.5, -2) and ? = 0.25, evaluate the alternative error
24 | Markup : Part d: If w = (5.2, -2, 1) and ? = 0.25, evaluate the alternative error
--------------------------------------------------------------------------------
/02_PYTHON/05_Fitting_2D_Gaussian_to_data/README.md:
--------------------------------------------------------------------------------
1 | ### Task
2 | Write code that fits 2-dimensional Gaussians to data.
3 | The input file will be a single text file, like the text files in the UCI datasets directory.
4 | A description of the datasets and the file format can be found on this link.
5 | Your code will be given as a command-line argument the path of a text file.
6 | This text file could be any of the six files in the UCI datasets directory, but it could also be ANY OTHER file using the same format as the files in the datasets directory.
7 |
8 |
11 |
12 | A description of the datasets and the file format can be found on above link.
13 | You should only fit a 2D Gaussian to the first two dimensions of the data. You can ignore the other dimensions.
14 |
15 | ### Output
16 | The output of your code should contain one line for each class. Such a line should look like this:
17 | Class %d, mean = [%.2f, %.2f], sigma = [%.2f, %.2f, %.2f, %.2f]
18 | Note that, in the above output sample, %d is a place holder for an integer, and %.2f is a placeholder for a number with two decimal digits, following the Java and C printf conventions. With sigma we denote the covariance matrix.
19 | The values of sigma should be printed in this order: [(row=1 col=1), (row=1 col=2), (row=2 col=1), (row=2 col=2)].
20 | In your answers.pdf document, provide the output produced by your program when given satellite_training.txt as the input file
21 |
--------------------------------------------------------------------------------
/02_PYTHON/07_Frequentist_Estimate/README.md:
--------------------------------------------------------------------------------
1 | ### Task
2 | Implement a simulation, where you estimate the probability of a binary event using a frequentist approach.
3 | The data will be generated by your code using a specific probability distribution, but then your code will ESTIMATE that distribution based on the data that was generated.
4 | Naturally, the estimated distribution will probably not be identical to the true distribution that was used.
5 | First, your code a random string S whose length is 3100 characters: S = c1, c2, c3100.
6 | The length of 3100 is based on the "snow in January" example covered in class, it corresponds to 100 years of weather records of January days).
7 | To generate the string S, follow these guidelines:
8 |
9 | Each character ci is either character 'a' or character 'b'.
10 | Each characer c should be chosen randomly, so that the prior p(c = 'a') = 0.1. To do this, your code should:
11 | Generate a random number, drawn from a uniform distribution between 0 and 1.
12 | If that random number is less than or equal to 0.1, then the character should be set to 'a'. If the random number is greater than 0.1, the character should be set to 'b'.
13 | You should make sure that the choice of any character is independent from the choice of any other character. In other words: if i != j, then P(ci = 'a' | cj = 'a') = P(ci = 'a').
14 | After you generate the string S, you should estimate (using the frequentist approach) the probability p(c = 'a'), based on the characters of S. At the end, your code should print out the estimated probability.
15 | The program output should follow EXACTLY this format:
16 | p(c = 'a') = %.4f
17 |
--------------------------------------------------------------------------------
/01_MATLAB/04_Principal_Component_Analysis/code/main.m:
--------------------------------------------------------------------------------
1 | function [] = main(train_file, test_file, M, iterations)
2 |
3 | %===================================
4 | % Loading and Inistialising Data
5 | %===================================
6 | train_data = load(train_file);
7 | train_data = double(train_data);
8 | test_data = load(test_file);
9 | test_data = test_data(:, 1:end-1);
10 | U = zeros(size(train_data, 2)-1, M);
11 | data = train_data(:, 1:end-1);
12 |
13 | %===================================
14 | % Calculating Eigen Vector - U
15 | %===================================
16 | for d = 1:M
17 |
18 | covariance = cov(data, 1);
19 | U(1:end, d) = power_method(covariance, iterations, size(data, 2));
20 | data = compute_X(data, U(1:end, d));
21 |
22 | end
23 |
24 | %===================================
25 | % Printing Eigen Vector - U
26 | %===================================
27 | for i = 1:M
28 | fprintf('Eigenvector %d \n', i);
29 | for j = 1:size(train_data, 2)-1
30 | fprintf('%d : %.4f \n', j, U(j, i));
31 | end
32 | end
33 |
34 | %=========================================================
35 | % Calculating Projection Matrix and Projection Values
36 | %=========================================================
37 | proj_mat = transpose(U);
38 | proj_value = proj_mat * transpose(test_data);
39 |
40 | %===================================
41 | % Displaying Results
42 | %===================================
43 | for i = 1: size(test_data)
44 | fprintf('Test object %d \n', i-1);
45 | for j = 1:M
46 | fprintf('%d : %.4f\n', j, proj_value(j, i));
47 | end
48 | end
49 | end
50 |
--------------------------------------------------------------------------------
/01_MATLAB/01_Descision_Tree_and_Random_Forest/code/get_info_gain.m:
--------------------------------------------------------------------------------
1 | function [gain] = get_info_gain(data, attribute, threshold)
2 | target = data(:,end);
3 | distribution = histc(target,unique(target));
4 | distribution = distribution / size(target,1);
5 | class_gain = 0;
6 |
7 | for i = 1:size(distribution, 1)
8 | if distribution(i) > 0
9 | class_gain = class_gain + distribution(i) * log2(distribution(i));
10 | end
11 | end
12 |
13 | class_gain = (-1)*class_gain;
14 |
15 | info_gain_1 = zeros(size(unique(target)));
16 | info_gain_2 = zeros(size(unique(target)));
17 |
18 | find_1 = data(1:end, attribute) < threshold;
19 | find_2 = data(1:end, attribute) >= threshold;
20 |
21 | left_data = data(find_1, :);
22 | right_data = data(find_2, :);
23 |
24 | left_class = histc(left_data(:, end), unique(target));
25 | right_class = histc(right_data(:, end), unique(target));
26 |
27 | if isrow(left_class)
28 | left_class = transpose(left_class);
29 | end
30 |
31 | if isrow(right_class)
32 | right_class = transpose(right_class);
33 | end
34 |
35 | left_probab = left_class / sum(left_class);
36 | left_probab(isnan(left_probab)) = 0;
37 |
38 | right_probab = right_class / sum(right_class);
39 | right_probab(isnan(right_probab)) = 0;
40 |
41 | for class = 1: size(unique(target))
42 | info_gain_1(class, 1) = left_probab(class, 1)*(log2(left_probab(class, 1)));
43 | end
44 |
45 | info_gain_1(isnan(info_gain_1)) = 0;
46 | info_gain_1 = (-1)*info_gain_1;
47 | ig_1 = sum(info_gain_1);
48 |
49 | for class = 1: size(unique(target))
50 | info_gain_2(class, 1) = right_probab(class, 1)*(log2(right_probab(class, 1)));
51 | end
52 |
53 | info_gain_2(isnan(info_gain_2)) = 0;
54 | info_gain_2 = (-1)*info_gain_2;
55 | ig_2 = sum(info_gain_2);
56 |
57 | entropy = (sum(left_class)/size(data, 1))*ig_1 + (sum(right_class)/size(data, 1))*ig_2;
58 |
59 | gain = class_gain - entropy;
60 |
61 |
62 | end
63 |
--------------------------------------------------------------------------------
/01_MATLAB/06_Linear_Regression/README.md:
--------------------------------------------------------------------------------
1 | ### Task
2 | In this task you will implement linear regression
3 | Data used is in the Data folder
4 |
5 | ### Command-line Arguments
6 |
7 | You must implement a program that uses linear regression to fit a line or a second-degree polynomial to a set of training data.
8 | Your program can be invoked as follows:
9 | linear_regression <λ>
10 | The arguments provide to the program the following information:
11 | The first argument is the path name of the training file, where the training data is stored.
12 | The path name can specify any file stored on the local computer.
13 | The second argument is a number. This number should be either 1 or 2.
14 | We will not test your code with any other values. If the number is 1, you should fit a line to the data.
15 | If the number is 2, you should fit a second-degree polynomial to the data.
16 | The third number is a non-negative real number (it can be zero or greater than zero).
17 | This is the value of λ that you should use for regularization. If λ = 0, then no regularization is used.
18 | The training file is a text file, containing data in tabular format. Each value is a number, and values are separated by white space.
19 | Each row contains two numbers: the first of those numbers is the training input, and the second of those numbers is the target output.
20 |
21 | ### Output
22 | At the end, your program should print out the values of the weights that you have estimated.
23 |
24 | For C or C++, use:
25 | printf("w0=%.4lf\n", w0);
26 | printf("w1=%.4lf\n", w1);
27 | printf("w2=%.4lf\n", w2);
28 | For any other language, just make sure that you use formatting specifies that produce aligned output that matches EXACTLY the specs given above for C. Note that you print the value of w2 even when the command-line degree argument is 1. In that case, just print 0 for w2.
29 |
30 | ### Running the Program
31 | Run linear_regression.m
32 | In command Window type :
33 | - linear_regression('sample_data1.txt', degree, lambda)
34 |
--------------------------------------------------------------------------------
/01_MATLAB/01_Descision_Tree_and_Random_Forest/code/make_tree.m:
--------------------------------------------------------------------------------
1 | function[tree,threshold,gain] = make_tree(data,pruning_threshold,option,attributes,class_max,tree,threshold,gain,index)
2 | target=(data(:,end));
3 | if size(data,1) <= pruning_threshold
4 | distribution=zeros(class_max,1);
5 | target = data(1:end,end);
6 |
7 | for j=1:size(distribution,1)
8 | column = find(target==j);
9 | probclass=size(column,1)/size(target,1);
10 | distribution(j,1)=probclass;
11 | end
12 |
13 | [m,i]=max(distribution);
14 |
15 | tree(index)=i;
16 |
17 | threshold(index)=-1;
18 |
19 | gain(index)=-1;
20 |
21 | elseif size(unique(target),1)==1
22 |
23 | tree(index)=target(1, 1);
24 | threshold(index)=-1;
25 | gain(index)=-1;
26 |
27 | else
28 | [best_attri,best_thresh,best_gain]= choose_attribute(data, attributes,option);
29 |
30 | tree(index)=best_attri;
31 |
32 | threshold(index)=best_thresh;
33 |
34 | gain(index)=best_gain;
35 |
36 | ryt=1;
37 |
38 | left=1;
39 |
40 | for i=1:size(data,1)
41 | if data(i,best_attri) >= best_thresh
42 | rightdata(ryt,:)=data(i,1:end);
43 | ryt = ryt + 1;
44 | else
45 | leftdata(left,:) = data(i,1:end);
46 | left = left + 1;
47 | end
48 | end
49 |
50 | if exist('leftdata')
51 | [tree,threshold,gain]= make_tree(leftdata,pruning_threshold,option,attributes,class_max,tree,threshold,gain,2*index);
52 |
53 | else
54 | tree(2*index)=1;
55 | threshold(2*index)=-1;
56 | gain(2*index)=-1;
57 | end
58 |
59 | if exist('rightdata')
60 | [tree,threshold,gain] = make_tree(rightdata,pruning_threshold,option,attributes,class_max,tree,threshold,gain,(2*index)+1);
61 | else
62 | tree((2*index)+1)=1;
63 | threshold((2*index)+1)=-1;
64 | gain((2*index)+1)=-1;
65 | end
66 | end
67 | end
--------------------------------------------------------------------------------
/01_MATLAB/08_K_Mean_Clustering/README.md:
--------------------------------------------------------------------------------
1 | ### Task 1
2 | In this task you will implement k-means clustering.
3 |
4 | ### Command-line Arguments
5 | Your program will be invoked as follows:
6 | k_means_cluster
7 | The arguments provide to the program the following information:
8 | The first argument, , is the path name of a file where the data is stored.
9 | The path name can specify any file stored on the local computer.
10 | The second argument, , specifies the number of clusters.
11 | The third argument, , specifies the number of iterations of the main loop.
12 | The initialization stage (giving a random assignment of objects to clusters, and computing the means of those random assignments) does not count as an iteration.
13 | The data file will follow the same format as the training and test files in the UCI datasets directory.
14 | A description of the datasets and the file format can be found on this link.
15 | Your code should also work with ANY OTHER training and test files using the same format as the files in the UCI datasets directory.
16 | As the description states, do NOT use data from the last column (i.e., the class labels) as features.
17 | In these files, all columns except for the last one contain example inputs. The last column contains the class label.
18 | Link to data set is given below :
19 |
20 |
23 |
24 | ### Implementation Guidelines
25 | Use the L2 distance (the Euclidean distance) for computing the distance between any two objects in the dataset.
26 |
27 | ### Output
28 |
29 | After the initialization stage, and after each iteration, you should print the value E(S1,S2,...,SK), as defined in page 27 of the clustering slides.
30 | The output should follow this format:
31 |
32 | After initialization: error = %.4f
33 | After iteration 1: error = %.4f
34 | After iteration 2: error = %.4f
35 | ...
36 |
37 | ### Running The Program
38 | 1) open main.m file
39 | 2) type main('yeast_test.txt', 2, 5) in command window
40 | 3) Press Enter
41 |
--------------------------------------------------------------------------------
/01_MATLAB/05_K_Nearest_Neigbour/code/knn.m:
--------------------------------------------------------------------------------
1 | function [] = knn(train_data, test_data, train_target, test_target, k)
2 | classification_accuracy = 0;
3 | for i = 1:size(test_data, 1)
4 | %============================================
5 | % Calculating Euclidean Distance
6 | %============================================
7 | D = test_data(i, :) - train_data(: , :);
8 | D = D.^2;
9 | dist_mat = sum(D, 2);
10 | dist_mat = sqrt(dist_mat);
11 | dist = [dist_mat train_target];
12 | %============================================
13 | % Sorting Row according to minimum distance
14 | %============================================
15 | dist = sortrows(dist, 1);
16 | %============================================
17 | % If K value is 1 Print Results
18 | %============================================
19 | if k == 1
20 | k_neighbours = dist(k, :);
21 | predicted = k_neighbours(1, 2);
22 | true = test_target(i, 1);
23 | if true == predicted
24 | accuracy = 1;
25 | classification_accuracy = classification_accuracy + accuracy;
26 | else
27 | accuracy = 0;
28 | end
29 | fprintf('ID=%5d, predicted=%3d, true=%3d, accuracy=%4.2f \n', i, predicted, true, accuracy)
30 | %===============================================
31 | % Else K value is greater then 1 Print Results
32 | %===============================================
33 | else
34 | k_neighbours = dist(1:k, :);
35 | if size(unique(k_neighbours(:, 2))) == 1
36 | predicted = unique(k_neighbours(:, 2));
37 | elseif size(unique(k_neighbours(:, 2))) == k
38 | predicted = k_neighbours(1, 2);
39 | else
40 | predicted = mode(k_neighbours(:, 2))
41 | end
42 | true = test_target(i, 1);
43 | if true == predicted
44 | accuracy = 1;
45 | classification_accuracy = classification_accuracy + accuracy;
46 | else
47 | accuracy = 0;
48 | end
49 | fprintf('ID=%5d, predicted=%3d, true=%3d, accuracy=%4.2f \n', i, predicted, true, accuracy)
50 | end
51 | end
52 | fprintf('classification_accuracy=%6.4f \n', classification_accuracy/size(test_target, 1))
53 | end
54 |
--------------------------------------------------------------------------------
/01_MATLAB/01_Descision_Tree_and_Random_Forest/code/main.m:
--------------------------------------------------------------------------------
1 | function [] = main(string, training_file, testing_file, option, pruning_thr)
2 | there = strfind(option,'forest');
3 | if size(there) > 0
4 | %disp('in if')
5 | option = 'forest3';
6 | end
7 | if isequal(option,'forest3')
8 | random(training_file, testing_file, 'randomized', pruning_thr);
9 |
10 | else
11 | %train_file = 'pendigits_training.txt';
12 | %test_file = 'pendigits_test.txt';
13 | train_file = training_file;
14 | test_file = testing_file;
15 | train_data = load(train_file);
16 | test_data = load(test_file);
17 | target = train_data(:, end);
18 | test_target = test_data(:, end);
19 | unique_class = unique(target);
20 | pruning_thrsld = pruning_thr;
21 | class_max = max(target);
22 | tree = [];
23 | thrsldeshold = [];
24 | gainin = [];
25 | index = 1;
26 | classification_acc = 0;
27 |
28 | attributes = zeros(1, size(train_data, 2)-1);
29 |
30 | for col = 1: size(train_data, 2)-1
31 | attributes(1, col) = col;
32 | end
33 |
34 | %attributes = [1, 2, 3, 4, 5 ,6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16];
35 |
36 | [tree,thrsldeshold,gainin] = make_tree(train_data,pruning_thrsld,option,attributes,class_max,tree,thrsldeshold,gainin,index);
37 |
38 | loop_end = size(tree,2);
39 |
40 | for i=1:loop_end
41 | if (tree(:,i)-1) ~= -1
42 | fprintf('tree=%2d, node=%3d, feature=%2d, thr=%6.2f, gain=%f\n',0,i,tree(:,i)-1,thrsldeshold(:,i),gainin(:,i));
43 | end
44 | end
45 |
46 | loop_end = size(test_data,1);
47 | for row=1:loop_end
48 | index=1;
49 | is_leaf=1;
50 | while is_leaf == 1
51 | attr=tree(index);
52 | thrsld=thrsldeshold(index);
53 | gain=gainin(index);
54 | if thrsld~=-1 && gain~=-1
55 | if test_data(row,attr) < thrsld
56 | index = (2*index);
57 | else
58 | index = (2*index)+1;
59 | end
60 | else
61 | test_label = test_data(row,end);
62 | if attr~=test_label
63 | acc = 0;
64 | else
65 | acc = 1;
66 | end
67 | classification_acc=classification_acc+acc;
68 | fprintf('ID=%5d, predicted=%3d, true=%3d, accuracy=%4.2f\n', row, attr, test_label, acc);
69 | is_leaf=0;
70 |
71 | end
72 | end
73 | end
74 |
75 | fprintf('classification accuracy=%6.4f\n',classification_acc/size(test_data,1));
76 |
77 | end
78 |
79 | end
80 |
81 |
--------------------------------------------------------------------------------
/01_MATLAB/08_K_Mean_Clustering/code/main.m:
--------------------------------------------------------------------------------
1 | function [] = main(file, k, iterations)
2 | %=======================================
3 | % Initialising Data
4 | %=======================================
5 | % k = 3;
6 | % iterations = 20;
7 | % file = 'yeast_test.txt';
8 | data = load(file);
9 | rows = size(data, 1);
10 | cols = size(data, 2);
11 | data = data(1:rows, 1:cols-1);
12 | clusters = randi([1 k], rows, 1);
13 | clustered_data = [data clusters];
14 | mean_matrix = zeros(k, cols-1);
15 |
16 | %=======================================
17 | % Calculating Mean
18 | %=======================================
19 | for i = 1:k
20 | index = clustered_data(:, end) == i;
21 | indexed_data = clustered_data(index, 1:end-1);
22 | mean_matrix(i, :) = mean(indexed_data);
23 |
24 | end
25 |
26 | %=======================================
27 | % Computing Error
28 | %=======================================
29 | error = get_error(clustered_data, mean_matrix);
30 | fprintf('After initialization: error = %.4f \n', error);
31 |
32 | %=======================================
33 | % Starting Iterations
34 | %=======================================
35 | for p = 1:iterations
36 | for q = 1:rows
37 | %=======================================
38 | % Deciding which Cluster data belongs
39 | %=======================================
40 | dist = get_euclidean(data(q, :), mean_matrix);
41 | [minimum_row, minimum_col] = min(dist);
42 | clusters(q) = minimum_col;
43 | end
44 | clustered_data = [data, clusters];
45 |
46 | %=======================================
47 | % Calculating Mean
48 | %=======================================
49 | for i = 1:k
50 | index = clustered_data(:, end) == i;
51 | indexed_data = clustered_data(index, 1:end-1);
52 | mean_matrix(i, :) = mean(indexed_data);
53 | end
54 | %=======================================
55 | % Computing Error
56 | %=======================================
57 | error = get_error(clustered_data, mean_matrix);
58 | fprintf('After iteration %d: error = %.4f \n', p, error);
59 | end
60 | end
61 |
62 | function [error] = get_error(data, mean_matrix)
63 | %=======================================
64 | % Computing Error
65 | %=======================================
66 | error = 0;
67 | for j = 1: size(data, 1)
68 | c = data(j, end);
69 | dist = get_euclidean(data(j, 1:end-1), mean_matrix(c, 1:end));
70 | error = error + dist;
71 | end
72 | end
73 |
74 | function [distance_matrix] = get_euclidean(data, mean_matrix)
75 | %=======================================
76 | % Calculating Euclidean
77 | %=======================================
78 | dist = data - mean_matrix;
79 | dist = dist.^2;
80 | dist = sum(dist, 2);
81 | distance_matrix = sqrt(dist);
82 | end
--------------------------------------------------------------------------------
/01_MATLAB/10_Dynamic_Time_Warping/code/dtw.m:
--------------------------------------------------------------------------------
1 | function [] = dtw(train_data_col1, train_data_col2, train_class, train_lenght, test_data_col1, test_data_col2, test_class, test_lenght)
2 | classification_accuracy = 0;
3 | %===============================================
4 | % For every test object
5 | %===============================================
6 | for i = 1:test_lenght
7 | x = test_data_col1(i, :);
8 | x = transpose(x);
9 | y = test_data_col2(i, :);
10 | y = transpose(y);
11 | test_cord = [x, y];
12 | test_cord(all(test_cord == 0, 2), :)=[];
13 | n = size(test_cord, 1);
14 | %===============================================
15 | % For every train object
16 | %===============================================
17 | for j = 1:train_lenght
18 | x = train_data_col1(j, :);
19 | x = transpose(x);
20 | y = train_data_col2(j, :);
21 | y = transpose(y);
22 | train_cord = [x, y];
23 | train_cord(all(train_cord == 0, 2), :)=[];
24 | m = size(train_cord, 1);
25 | c = zeros(m, n);
26 | x1 = train_cord(1, 1);
27 | x2 = test_cord(1, 1);
28 | y1 = train_cord(1, 2);
29 | y2 = test_cord(1, 2);
30 | c(1, 1) = dist(x1, x2, y1, y2);
31 | %===============================================
32 | % Filling the first column
33 | %===============================================
34 | for k = 2:m
35 | x1 = train_cord(k, 1);
36 | x2 = test_cord(1, 1);
37 | y1 = train_cord(k, 2);
38 | y2 = test_cord(1, 2);
39 | c(k, 1) = c(k-1, 1) + dist(x1, x2, y1, y2);
40 | end
41 |
42 | %===============================================
43 | % Filling the first row
44 | %===============================================
45 | for l = 2:n
46 | x1 = train_cord(1, 1);
47 | x2 = test_cord(l, 1);
48 | y1 = train_cord(1, 2);
49 | y2 = test_cord(l, 2);
50 | c(1, l) = c(1, l-1) + dist(x1, x2, y1, y2);
51 | end
52 |
53 | %===============================================
54 | % Filling the rest of the matrix
55 | %===============================================
56 | for p = 2:m
57 | for q = 2:n
58 | x1 = train_cord(p, 1);
59 | x2 = test_cord(q, 1);
60 | y1 = train_cord(p, 2);
61 | y2 = test_cord(q, 2);
62 | c(p, q)= min([c(p-1, q) c(p, q-1) c(p-1, q-1)]) + dist(x1, x2, y1, y2);
63 | end
64 | end
65 | cost(j, 1) = c(m, n);
66 | cost(j, 2) = train_class(j);
67 | end
68 | value = sortrows(cost, 1);
69 | distance = value(1, 1);
70 | predicted = value(1, 2);
71 | true = test_class(i);
72 | acc = 0;
73 | if true == predicted
74 | acc = 1;
75 | end
76 | fprintf('ID=%5d, predicted=%3d, true=%3d, accuracy=%4.2f, distance = %.2f \n', i, predicted, true, acc, distance);
77 | classification_accuracy = classification_accuracy + acc;
78 | end
79 | fprintf('classification accuracy=%6.4f\n', classification_accuracy/test_lenght);
80 | end
81 |
--------------------------------------------------------------------------------
/01_MATLAB/03_Singular_Value_Decomposition/README.md:
--------------------------------------------------------------------------------
1 | ### Task
2 | In this task you will implement singular value decomposition (SVD).
3 | More specifically, you will compute the matrices U, S, V for a specific input matrix, and for a specific target dimensionality.
4 | Where needed, you will find eigenvectors using the power method.
5 |
6 | ### Command-line Arguments
7 |
8 | Your program will be invoked as follows:
9 | svd_power
10 | The arguments provide to the program the following information:
11 | The first argument, , is the path name of the file where the input matrix is stored.
12 | The path name can specify any file stored on the local computer.
13 | The data file will have as many lines as the rows of the input matrix.
14 | Line n will contain the values in the n-th row of the matrix.
15 | Within that line n, values will be separated by white space. An example data file is input1.txt.
16 | Values can be any real numbers, and the input matrix can have any number of rows and columns.
17 | The second argument, , specifies the number of dimensions for the SVD output.
18 | In other words, the U matrix should have columns, the V matrix should have columns, and the S matrix should have rows and columns. Remember, the diagonal entries Sd,d of S should contain values that decrease as d increases.
19 | The third argument, , is a number greater than or equal to 1, that specifies the number of iterations for the power method.
20 | Slide 44 in the slides on PCA describes how to use the power method to find the top eigenvector, using a sequence bk. You should stop calculating this sequence after the specified number of iterations, and use the last bk (where k=) as the eigenvector.
21 |
22 | ### Output
23 |
24 | After you compute matrices U, S, V, you need to print each of those matrices. You also need to print the reconstruction of the original matrix. This reconstruction is computed as U*S*VT. The output should follow this format:
25 | Matrix U:
26 | Row 1: %8.4f %8.4f ... %8.4f
27 | Row 2: %8.4f %8.4f ... %8.4f
28 | ...
29 | Row X: %8.4f %8.4f ... %8.4f
30 |
31 | Matrix S:
32 | Row 1: %8.4f %8.4f ... %8.4f
33 | Row 2: %8.4f %8.4f ... %8.4f
34 | ...
35 | Row M: %8.4f %8.4f ... %8.4f
36 |
37 | Matrix V:
38 | Row 1: %8.4f %8.4f ... %8.4f
39 | Row 2: %8.4f %8.4f ... %8.4f
40 | ...
41 | Row Y: %8.4f %8.4f ... %8.4f
42 |
43 | Reconstruction (U*S*V'):
44 | Row 1: %8.4f %8.4f ... %8.4f
45 | Row 2: %8.4f %8.4f ... %8.4f
46 | ...
47 | Row X: %8.4f %8.4f ... %8.4f
48 | In the above output template:
49 | X is the number of rows in the data file.
50 | Y is the number of columns in the data file.
51 | M is command-line argument .
52 | In each line printing the row of a matrix, the row numbershould be printed using the %3d format specification (an integer with three allocated spaces). The actual values of the matrix should be printed with the %8.4f format specifier (or equivalent format if using a language different than Java).
53 | In your answers.pdf file, include the output for the following invocations of your program:
54 |
55 | svd_power input1.txt 2 10
56 | svd_power input1.txt 4 100
57 |
58 | ### Running The Program
59 | IN command window type : main('input1.txt', 2, 10)
60 |
61 | Here :
62 | M = 2
63 | iterations = 10
64 |
65 |
--------------------------------------------------------------------------------
/01_MATLAB/05_K_Nearest_Neigbour/README.md:
--------------------------------------------------------------------------------
1 | ### Task
2 | In this task you will implement k-nearest neighbor classification using Euclidean Distance.
3 |
4 | ### Command-line Arguments
5 |
6 | Your program will be invoked as follows:
7 | knn_classify
8 | The arguments provide to the program the following information:
9 | The first argument, , is the path name of the training file, where the training data is stored.
10 | The path name can specify any file stored on the local computer.
11 | The second argument, , is the path name of the test file, where the test data is stored.
12 | The path name can specify any file stored on the local computer.
13 | The third argument specifies the value of k for the k-nearest neighbor classifier.
14 | Files used will be any of the files in the UCI datasets directory
15 | A description of the datasets and the file format can be found on below link.
16 |
17 |
20 |
21 | ### Implementation Guidelines
22 |
23 | Each dimension should be normalized, separately from all other dimensions.
24 | Specifically, for both training and test objects, each dimension should be transformed using function:
25 | - F(v) = (v - mean) / std, using the mean and std of the values of that dimension on the TRAINING data.
26 | To compute the std, divide by the number of training data (NOT the number of training data minus 1).
27 | Use the L2 distance (the Euclidean distance) for computing the nearest neighbors
28 |
29 | ### Classification Stage
30 |
31 | For each test object you should print a line containing the following info:
32 | object ID. This is the line number where that object occurs in the test file. Start with 0 in numbering the objects, not with 1.
33 | predicted class (the result of the classification). If your classification result is a tie among two or more classes, choose one of them randomly.
34 | true class (from the last column of the test file).
35 | accuracy. This is defined as follows:
36 | If there were no ties in your classification result, and the predicted class is correct, the accuracy is 1.
37 | If there were no ties in your classification result, and the predicted class is incorrect, the accuracy is 0.
38 | If there were ties in your classification result, and the correct class was one of the classes that tied for best, the accuracy is 1 divided by the number of classes that tied for best.
39 | If there were ties in your classification result, and the correct class was NOT one of the classes that tied for best, the accuracy is 0.
40 | To produce this output in a uniform manner, use these printing statements:
41 | For C or C++, use:
42 | printf("ID=%5d, predicted=%3d, true=%3d, accuracy=%4.2lf\n", object_id, predicted_class, true_class, accuracy);
43 | For Java, use:
44 | System.out.printf("ID=%5d, predicted=%3d, true=%3d, accuracy=%4.2f\n", object_id, predicted_class, true_class, accuracy);
45 |
46 | For Python or any other language, just make sure that you use formatting specifies that produce aligned output that matches the specs given for C and Java.
47 | After you have printed the results for all test objects, you should print the overall classification accuracy, which is defined as the average of the classification accuracies you printed out for each test object. To print the classification accuracy in a uniform manner, use these printing statements:
48 | Use:
49 | printf("classification accuracy=%6.4lf\n", classification_accuracy);
50 |
51 |
52 |
53 | ### Running The Program
54 | Run the main.m file
55 |
56 | main('pendigits_training.txt','pendigits_test.txt', 1)
57 |
58 | main('pendigits_training.txt','pendigits_test.txt', 3)
59 |
60 | main('pendigits_training.txt','pendigits_test.txt', 5)
61 |
--------------------------------------------------------------------------------
/01_MATLAB/03_Singular_Value_Decomposition/code/main.m:
--------------------------------------------------------------------------------
1 | function [] = main(train_file, M, iterations)
2 | % =================================================%
3 | % Loading and Initialising Data
4 | % =================================================%
5 | train_data = load(train_file);
6 | train_data = double(train_data);
7 | A = train_data*transpose(train_data);
8 | A_1 = train_data*transpose(train_data);
9 | U = zeros(size(train_data, 1), M);
10 |
11 | % =================================================%
12 | % Calculating Eigen Vectors - U
13 | % =================================================%
14 | for d = 1: M
15 | U(:, d) = svd_power(A*A', iterations);
16 | %disp(size(U))
17 | for i = 1:size(A, 1)
18 | value = transpose(U(:, d)) * transpose(A(i, 1:end))*U(:, d);
19 | A(i, :) = A(i, :) - transpose(value);
20 | end
21 | end
22 |
23 | % =================================================%
24 | % Calculating Lambda, Eigen Values, and S Diagonal
25 | % =================================================%
26 | lambda = transpose(U)*A_1*U;
27 | eigen_values = sqrt(max(lambda));
28 | S = zeros(M, M);
29 | for i = 1:M
30 | S(i, i) = eigen_values(1, i);
31 | end
32 |
33 | % =================================================%
34 | % Calculating V
35 | % =================================================%
36 | V = zeros(size(train_data, 2), M);
37 | A = transpose(train_data)*train_data;
38 | for d = 1: M
39 | V(:, d) = svd_power(A*A', iterations);
40 | %disp(size(U))
41 | for i = 1:size(A, 1)
42 | value = transpose(V(:, d)) * transpose(A(i, 1:end))*V(:, d);
43 | A(i, :) = A(i, :) - transpose(value);
44 | end
45 | end
46 |
47 | % =================================================%
48 | % Performing Reconstruction
49 | % =================================================%
50 | reconstruction = U*S*transpose(V);
51 | display(U, S, V, reconstruction)
52 | end
53 |
54 | function [] = display(U, S, V, reconstruction)
55 | % =================================================%
56 | % Displaying Eigen Vectors - U
57 | % =================================================%
58 | fprintf('Matrix U: \n')
59 | for row = 1:size(U, 1)
60 | fprintf('Row%3d:', row);
61 | for col = 1:size(U, 2)
62 | fprintf('%8.4f', U(row, col));
63 | end
64 | fprintf('\n')
65 | end
66 |
67 | % =================================================%
68 | % Displaying Diagonal Matrix - S
69 | % =================================================%
70 | fprintf('\n');
71 | fprintf('Matrix S: \n')
72 | for row = 1:size(S, 1)
73 | fprintf('Row%3d:', row);
74 | for col = 1:size(S, 2)
75 | fprintf('%8.4f', S(row, col));
76 | end
77 | fprintf('\n')
78 | end
79 |
80 | % =================================================%
81 | % Displaying Matrix - V
82 | % =================================================%
83 | fprintf('\n');
84 | fprintf('Matrix V: \n')
85 | for row = 1:size(V, 1)
86 | fprintf('Row%3d:', row);
87 | for col = 1:size(V, 2)
88 | fprintf('%8.4f', V(row, col));
89 | end
90 | fprintf('\n')
91 | end
92 |
93 |
94 | % =================================================%
95 | % Displaying Reconstruction Matrix
96 | % =================================================%
97 | fprintf('\n');
98 | fprintf('Reconstruction (U*S*V''): \n')
99 | for row = 1:size(reconstruction, 1)
100 | fprintf('Row%3d:', row);
101 | for col = 1:size(reconstruction, 2)
102 | fprintf('%8.4f', reconstruction(row, col));
103 | end
104 | fprintf('\n')
105 | end
106 |
107 | end
108 |
109 |
--------------------------------------------------------------------------------
/01_MATLAB/04_Principal_Component_Analysis/README.md:
--------------------------------------------------------------------------------
1 | ### Task
2 | In this task you will implement principal component analysis (PCA)
3 |
4 | ### Command-line Arguments
5 |
6 | Your program will be invoked as follows:
7 | pca_power
8 | The arguments provide to the program the following information:
9 | The first argument, , is the path name of the training file, where the training data is stored.
10 | The path name can specify any file stored on the local computer.
11 | The second argument, , is the path name of the test file, where the test data is stored.
12 | The path name can specify any file stored on the local computer.
13 | The third argument, , specifies the dimension of the output space of the projection.
14 | In other words, you will use the with the largest eigenvalues to define the projection matrix.
15 | The fourth argument, , is a number greater than or equal to 1, that specifies the number of iterations for the power method.
16 | Use the power method to find the top eigenvector, using a sequence bk.
17 | You should stop calculating this sequence after the specified number of iterations, and use the last bk (where k=) as the eigenvector.
18 | The training and test files will follow the same format as the text files in the UCI datasets directory
19 | ### Training Stage Output
20 |
21 | After you compute the projection matrix using the training data, print out the top eigenvectors, in decreasing order of their eigenvalues. Note that you do not need to know the eigenvalues to specify this order. You just print out the eigenvectors in the order in which they have been calculated, based on the pseudocode in slide 54 of the slides on PCA. The output should follow this format:
22 | Eigenvector 1
23 | 1: %.4f
24 | 2: %.4f
25 | ...
26 | D: %.4f
27 |
28 | Eigenvector 2
29 | 1: %.4f
30 | 2: %.4f
31 | ...
32 | D: %.4f
33 |
34 | ...
35 |
36 | Eigenvector M
37 | 1: %.4f
38 | 2: %.4f
39 | ...
40 | D: %.4f
41 | In the above output template:
42 | D is the number of dimensions of the training data.
43 | M is command-line argument .
44 | In each line containing a value of an eigenvector, the first number (the dimension index) should be printed using the %3d format specification (an integer with three allocated spaces). The second value is simply the value of the eigenvector in that dimension, with exactly four decimal digits.
45 |
46 | ### Test Stage
47 |
48 | For each test object (in the order in which test objects appear in the test file), you should print the projection of that test object based on the projection you computed during training.
49 | The output should follow this format:
50 |
51 | Test object 0
52 | 1: %.4f
53 | 2: %.4f
54 | ...
55 | M: %.4f
56 |
57 | Test object 1
58 | 1: %.4f
59 | 2: %.4f
60 | ...
61 | M: %.4f
62 |
63 | ...
64 | In the above output template:
65 | M is command-line argument .
66 | In each line containing a value of the projection of an object, follow the same instructions as for printing values of eigenvectors at the end of the training stage.
67 | Output for answers.pdf
68 |
69 | In your answers.pdf document, you need to provide parts of the output for some invocations of your program listed below. For each invocation, provide:
70 | The full output of the training stage.
71 | ONLY THE PROJECTION OF THE FIRST OBJECT for the test stage.
72 | Include this output for the following invocations of your program:
73 | pca_power pendigits_training pendigits_test 1 10
74 | pca_power satellite_training satellite_test 2 20
75 | pca_power yeast_training yeast_test 3 30
76 |
77 | ### Running The Program
78 |
79 | Run any one of the below commands:
80 |
81 | main('pendigits_training.txt', 'pendigits_test.txt', 1, 10)
82 |
83 | main('satellite_training.txt', 'satellite_test.txt', 2, 20)
84 |
85 | main('yeast_training.txt', 'yeast_test.txt', 3, 30)
86 |
--------------------------------------------------------------------------------
/02_PYTHON/01_Naive_Bayes_Classifier/README.md:
--------------------------------------------------------------------------------
1 | ### Task
2 | In this assignment you will implement naive Bayes classifiers based on histograms
3 |
4 | ### Command-line Arguments
5 |
6 | You must implement a program that learns a naive Bayes classifier for a classification problem.
7 | Given some training data and some additional options. In particular, your program will be invoked as follows:
8 | naive_bayes histograms
9 |
10 | ### Training: Histograms
11 |
12 | If the third commandline argument is histograms, then you should model P(x | class) as a histogram separately for each dimension of the data.
13 | The number of bins for each histogram is specified by the fourth command line argument.
14 | Suppose that you are building a histogram of N bins for the j-th dimension of the data and for the c-th class.
15 | Let S be the smallest and L be the largest value in the j-th dimension among all training data belonging to the c-th class.
16 | Let G = (L-S)/(N-3). G will be the width of all bins, except for bin 0 and bin N-1, whose width is infinite.
17 | If you get a value of G that is less than 0.0001, then set G to 0.0001. Your bins should have the following ranges:
18 |
19 | Bin 0, covering interval (-infinity, S-G/2).
20 | Bin 1, covering interval [S-G/2, S+G/2).
21 | Bin 2, covering interval [S+G/2, S+G+G/2).
22 | Bin 3, covering interval [S+G+G/2, S+2G+G/2).
23 | ...
24 | Bin N-2, covering interval [S+(N-4)G+G/2, S+(N-3)G+G/2). This interval is the same as [L-G/2, L+G/2).
25 | Bin N-1, covering interval [S+(N-3)G+G/2, +infinity). This interval is the same as [L+G/2, +infinity).
26 |
27 | The output of the training phase should be a sequence of lines like this:
28 | Class %d, attribute %d, bin %d, P(bin | class) = %.2f
29 | The output lines should be sorted by class number. Within the same class, lines should be sorted by attribute number.
30 | Within the same attribute, lines should be sorted by bin number. Attributes and bins should be numbered starting from 0, not from 1.
31 | In computing the value that you store at each bin of each histogram, you must use Equation 2.241 on page 120 of the textbook.
32 | Notice that the width of the bin appears in the denominator of that equation. As mentioned above, the minimum width should be 0.0001. If your value of G is less than 0.0001, you should set G to 0.0001.
33 |
34 | In your answers.pdf document, provide the output produced by the training stage of your program when given yeast_training.txt as the input file, using seven bins for each histogram.
35 |
36 | ### Classification
37 |
38 | For each test object you should print a line containing the following info:
39 | object ID. This is the line number where that object occurs in the test file. Start with 0 in numbering the objects, not with 1.
40 | predicted class (the result of the classification). If your classification result is a tie among two or more classes, choose one of them randomly.
41 | probability of the predicted class given the data.
42 | true class (from the last column of the test file).
43 | accuracy. This is defined as follows:
44 | If there were no ties in your classification result, and the predicted class is correct, the accuracy is 1.
45 | If there were no ties in your classification result, and the predicted class is incorrect, the accuracy is 0.
46 | If there were ties in your classification result, and the correct class was one of the classes that tied for best, the accuracy is 1 divided by the number of classes that tied for best.
47 | If there were ties in your classification result, and the correct class was NOT one of the classes that tied for best, the accuracy is 0.
48 | To produce this output in a uniform manner, use these printing statements:
49 | For C or C++, use:
50 | printf("ID=%5d, predicted=%3d, probability = %.4lf, true=%3d, accuracy=%4.2lf\n",
51 | object_id, probability, predicted_class, true_class, accuracy);
52 | For Java, use:
53 | System.out.printf("ID=%5d, predicted=%3d, probability = %.4f, true=%3d, accuracy=%4.2f\n",
54 | object_id, predicted_class, probability, true_class, accuracy);
55 |
--------------------------------------------------------------------------------
/02_PYTHON/07_Frequentist_Estimate/Frequentist_Estimate.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "c5b12fc7",
6 | "metadata": {
7 | "ExecuteTime": {
8 | "end_time": "2021-12-20T03:18:13.862346Z",
9 | "start_time": "2021-12-20T03:18:13.817895Z"
10 | }
11 | },
12 | "source": [
13 | ""
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 1,
19 | "id": "c7d6f26e",
20 | "metadata": {
21 | "ExecuteTime": {
22 | "end_time": "2021-12-20T05:46:00.120744Z",
23 | "start_time": "2021-12-20T05:46:00.084767Z"
24 | }
25 | },
26 | "outputs": [
27 | {
28 | "name": "stdout",
29 | "output_type": "stream",
30 | "text": [
31 | "p(c = 'a') = 0.1026\n"
32 | ]
33 | }
34 | ],
35 | "source": [
36 | "import random\n",
37 | "\n",
38 | "\n",
39 | "class Task1:\n",
40 | "\n",
41 | "\tdef __init__(self):\n",
42 | "\t\t\"\"\"This Function initialises the random uniform array and the string\"\"\"\n",
43 | "\t\t\n",
44 | "\t\tself.string = \"\"\n",
45 | "\n",
46 | "\tdef form_string(self):\n",
47 | "\t\t\"\"\" This Function will form the string which will contain a and b \"\"\"\n",
48 | "\t\t\n",
49 | "\t\tfor x in range(0, 3100):\n",
50 | "\t\t\trandom_number = random.uniform(0, 1)\n",
51 | "\t\t\tif random_number > 0.1:\n",
52 | "\t\t\t\tself.string += \"b\"\n",
53 | "\t\t\telse:\n",
54 | "\t\t\t\tself.string += \"a\"\n",
55 | "\n",
56 | "\tdef probability_of_character(self, character):\n",
57 | "\t\t\"\"\" This function calculates the probability of occurrence of certain character in the string \"\"\"\n",
58 | "\t\tcount_of_character = self.string.count(character)\n",
59 | "\t\tcount_of_character = float(count_of_character)\n",
60 | "\t\tprobability = count_of_character/len(self.string)\n",
61 | "\t\treturn probability\n",
62 | "\n",
63 | "if __name__ == \"__main__\":\n",
64 | "\t# start = time.time()\n",
65 | "\ttask1 = Task1()\n",
66 | "\ttask1.form_string()\n",
67 | "\tsolution1 = task1.probability_of_character(\"a\")\n",
68 | "\tprint(\"p(c = 'a') = %.4f\" % solution1)\n",
69 | "\t# end = time.time()\n",
70 | "\t# print(\"time\", end - start)\n"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": null,
76 | "id": "5285289a",
77 | "metadata": {},
78 | "outputs": [],
79 | "source": []
80 | }
81 | ],
82 | "metadata": {
83 | "kernelspec": {
84 | "display_name": "Python 3",
85 | "language": "python",
86 | "name": "python3"
87 | },
88 | "language_info": {
89 | "codemirror_mode": {
90 | "name": "ipython",
91 | "version": 3
92 | },
93 | "file_extension": ".py",
94 | "mimetype": "text/x-python",
95 | "name": "python",
96 | "nbconvert_exporter": "python",
97 | "pygments_lexer": "ipython3",
98 | "version": "3.8.8"
99 | },
100 | "toc": {
101 | "base_numbering": 1,
102 | "nav_menu": {},
103 | "number_sections": true,
104 | "sideBar": true,
105 | "skip_h1_title": false,
106 | "title_cell": "Table of Contents",
107 | "title_sidebar": "Contents",
108 | "toc_cell": false,
109 | "toc_position": {},
110 | "toc_section_display": true,
111 | "toc_window_display": false
112 | },
113 | "varInspector": {
114 | "cols": {
115 | "lenName": 16,
116 | "lenType": 16,
117 | "lenVar": 40
118 | },
119 | "kernels_config": {
120 | "python": {
121 | "delete_cmd_postfix": "",
122 | "delete_cmd_prefix": "del ",
123 | "library": "var_list.py",
124 | "varRefreshCmd": "print(var_dic_list())"
125 | },
126 | "r": {
127 | "delete_cmd_postfix": ") ",
128 | "delete_cmd_prefix": "rm(",
129 | "library": "var_list.r",
130 | "varRefreshCmd": "cat(var_dic_list()) "
131 | }
132 | },
133 | "types_to_exclude": [
134 | "module",
135 | "function",
136 | "builtin_function_or_method",
137 | "instance",
138 | "_Feature"
139 | ],
140 | "window_display": false
141 | }
142 | },
143 | "nbformat": 4,
144 | "nbformat_minor": 5
145 | }
146 |
--------------------------------------------------------------------------------
/02_PYTHON/07_Frequentist_Estimate/old_Frequentist_Estimate.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "c5b12fc7",
6 | "metadata": {
7 | "ExecuteTime": {
8 | "end_time": "2021-12-20T03:18:13.862346Z",
9 | "start_time": "2021-12-20T03:18:13.817895Z"
10 | }
11 | },
12 | "source": [
13 | ""
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 1,
19 | "id": "c7d6f26e",
20 | "metadata": {
21 | "ExecuteTime": {
22 | "end_time": "2021-12-20T05:46:04.900907Z",
23 | "start_time": "2021-12-20T05:46:04.668788Z"
24 | }
25 | },
26 | "outputs": [
27 | {
28 | "name": "stdout",
29 | "output_type": "stream",
30 | "text": [
31 | "p(c = 'a') = 0.0977\n"
32 | ]
33 | }
34 | ],
35 | "source": [
36 | "import numpy as np\n",
37 | "import time\n",
38 | "\n",
39 | "\n",
40 | "class Task1:\n",
41 | "\n",
42 | "\tdef __init__(self):\n",
43 | "\t\t\"\"\"This Function initialises the random uniform array and the string\"\"\"\n",
44 | "\t\t\n",
45 | "\t\tself.randUniformArray = np.random.uniform(0, 1, 3100)\n",
46 | "\t\tself.string = \"\"\n",
47 | "\n",
48 | "\tdef form_string(self):\n",
49 | "\t\t\"\"\" This Function will form the string which will contain a and b \"\"\"\n",
50 | "\t\t\n",
51 | "\t\tfor x in range(0, 3100):\n",
52 | "\t\t\tif self.randUniformArray[x] > 0.1:\n",
53 | "\t\t\t\tself.string += \"b\"\n",
54 | "\t\t\telse:\n",
55 | "\t\t\t\tself.string += \"a\"\n",
56 | "\n",
57 | "\tdef probability_of_character(self, character):\n",
58 | "\t\t\"\"\" This function calculates the probability of occurrence of certain character in the string \"\"\"\n",
59 | "\t\t\n",
60 | "\t\tcount_of_character = self.string.count(character)\n",
61 | "\t\tcount_of_character = float(count_of_character)\n",
62 | "\t\tprobability = count_of_character/len(self.string)\n",
63 | "\t\treturn probability\n",
64 | "\n",
65 | "if __name__ == \"__main__\":\n",
66 | "\t# start = time.time()\n",
67 | "\ttask1 = Task1()\n",
68 | "\ttask1.form_string()\n",
69 | "\tsolution1 = task1.probability_of_character(\"a\")\n",
70 | "\tprint(\"p(c = 'a') = %.4f\" % solution1)\n",
71 | "\t# end = time.time()\n",
72 | "\t# print(\"time\", end - start)\n"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": null,
78 | "id": "5285289a",
79 | "metadata": {},
80 | "outputs": [],
81 | "source": []
82 | }
83 | ],
84 | "metadata": {
85 | "kernelspec": {
86 | "display_name": "Python 3",
87 | "language": "python",
88 | "name": "python3"
89 | },
90 | "language_info": {
91 | "codemirror_mode": {
92 | "name": "ipython",
93 | "version": 3
94 | },
95 | "file_extension": ".py",
96 | "mimetype": "text/x-python",
97 | "name": "python",
98 | "nbconvert_exporter": "python",
99 | "pygments_lexer": "ipython3",
100 | "version": "3.8.8"
101 | },
102 | "toc": {
103 | "base_numbering": 1,
104 | "nav_menu": {},
105 | "number_sections": true,
106 | "sideBar": true,
107 | "skip_h1_title": false,
108 | "title_cell": "Table of Contents",
109 | "title_sidebar": "Contents",
110 | "toc_cell": false,
111 | "toc_position": {},
112 | "toc_section_display": true,
113 | "toc_window_display": false
114 | },
115 | "varInspector": {
116 | "cols": {
117 | "lenName": 16,
118 | "lenType": 16,
119 | "lenVar": 40
120 | },
121 | "kernels_config": {
122 | "python": {
123 | "delete_cmd_postfix": "",
124 | "delete_cmd_prefix": "del ",
125 | "library": "var_list.py",
126 | "varRefreshCmd": "print(var_dic_list())"
127 | },
128 | "r": {
129 | "delete_cmd_postfix": ") ",
130 | "delete_cmd_prefix": "rm(",
131 | "library": "var_list.r",
132 | "varRefreshCmd": "cat(var_dic_list()) "
133 | }
134 | },
135 | "types_to_exclude": [
136 | "module",
137 | "function",
138 | "builtin_function_or_method",
139 | "instance",
140 | "_Feature"
141 | ],
142 | "window_display": false
143 | }
144 | },
145 | "nbformat": 4,
146 | "nbformat_minor": 5
147 | }
148 |
--------------------------------------------------------------------------------
/02_PYTHON/03_Mixture_Of_Gaussian_Using_EM_Algortihm/README.md:
--------------------------------------------------------------------------------
1 | ### Task
2 | You must implement a program that learns a naive Bayes classifier for a classification problem.
3 |
4 |
5 | ### Command-line Arguments
6 | Given some training data and some additional options. In particular, your program can be invoked as follows:
7 | naive_bayes mixtures
8 |
9 | ### Training: Mixtures of Gaussians
10 |
11 | If the third commandline argument is mixtures, then you should model P(x | class) as a mixture of Gaussians separately for each dimension of the data.
12 | The number of Gaussians for each mixture is specified by the fourth command line argument.
13 | Suppose that you are building a mixture of N Gaussians for the j-th dimension of the data and for the c-th class.
14 | Let S be the smallest and L be the largest value in the j-th dimension among all training data belonging to the c-th class.
15 | Let G = (L-S)/N. Then, you should initialize all standard deviations of the mixture to 1, you should initialize all weights to 1/N, and you should initialize the means as follows:
16 |
17 | For the first Gaussian, the initial mean should be S + G/2.
18 | For the second Gaussian, the initial mean should be S + G + G/2.
19 | For the third Gaussian, the initial mean should be S + 2G + G/2.
20 | ...
21 | For the N-th Gaussian, the initial mean should be S + (N-1)G + G/2.
22 | You should repeat the main loop of the EM algorithm 50 times. So, no need to worry about any other stopping criterion.
23 | Your stopping criterion is simply that the loop has been executed 50 times.
24 | In certain cases, it is possible that the M-step computes a value for the standard deviation that is equal to zero.
25 | Your code should make sure that the variance of the Gaussian is NEVER smaller than 0.0001.
26 | Since the variance is the square of the standard deviation, this means that the standard deviation should never be smaller than sqrt(0.0001) = 0.01.
27 | Any time the M-step computes a value for the standard deviation that is smaller than 0.01, your code should replace that value with 0.01.
28 |
29 | The output of the training phase should be a sequence of lines like this:
30 |
31 | Class %d, attribute %d, Gaussian %d, mean = %.2f, std = %.2f
32 | The output lines should be sorted by class number. Within the same class, lines should be sorted by attribute number.
33 | Within the same attribute, lines should be sorted by Gaussian number.
34 | Attributes and Gaussians should be numbered starting from 0, not from 1.
35 | In your answers.pdf document, provide the output produced by the training stage of your program when given yeast_training.txt as the input file, using three Gaussians for each mixture.
36 |
37 |
38 | ### Classification
39 |
40 | For each test object you should print a line containing the following info:
41 | object ID. This is the line number where that object occurs in the test file. Start with 0 in numbering the objects, not with 1.
42 | predicted class (the result of the classification). If your classification result is a tie among two or more classes, choose one of them randomly.
43 | probability of the predicted class given the data.
44 | true class (from the last column of the test file).
45 | accuracy. This is defined as follows:
46 | If there were no ties in your classification result, and the predicted class is correct, the accuracy is 1.
47 | If there were no ties in your classification result, and the predicted class is incorrect, the accuracy is 0.
48 | If there were ties in your classification result, and the correct class was one of the classes that tied for best, the accuracy is 1 divided by the number of classes that tied for best.
49 | If there were ties in your classification result, and the correct class was NOT one of the classes that tied for best, the accuracy is 0.
50 | To produce this output in a uniform manner, use these printing statements:
51 | use:
52 | printf("ID=%5d, predicted=%3d, probability = %.4lf, true=%3d, accuracy=%4.2lf\n",
53 | object_id, probability, predicted_class, true_class, accuracy);
54 |
55 | For Python or any other language, just make sure that you use formatting specifies that produce aligned output that matches the specs given for C and Java.
56 | Object IDs should be numbered starting from 0, not 1.
57 | After you have printed the results for all test objects, you should print the overall classification accuracy, which is defined as the average of the classification accuracies you printed out for each test object.
58 |
--------------------------------------------------------------------------------
/01_MATLAB/01_Descision_Tree_and_Random_Forest/code/random.m:
--------------------------------------------------------------------------------
1 | function [] = random(training_file, testing_file, option, pruning_thr)
2 | %train_file = 'pendigits_training.txt';
3 | %test_file = 'pendigits_test.txt';
4 | train_file = training_file;
5 | test_file = testing_file;
6 | train_data = load(train_file);
7 | test_data = load(test_file);
8 | target = train_data(:, end);
9 | test_target = test_data(:, end);
10 | unique_class = unique(target);
11 | pruning_thrsld = pruning_thr;
12 | classificatio_acc=0;
13 | class_max = max(target);
14 |
15 | tree1 = [];
16 | thrsldeshold1 = [];
17 | gainin1 = [];
18 | index1 = 1;
19 |
20 |
21 | tree2 = [];
22 | thrsldeshold2 = [];
23 | gainin2 = [];
24 | index2 = 1;
25 |
26 |
27 | tree3 = [];
28 | thrsldeshold3 = [];
29 | gainin3 = [];
30 | index3 = 1;
31 |
32 | attributes = zeros(1, size(train_data, 2)-1);
33 | for col = 1: size(train_data, 2)-1
34 | attributes(1, col) = col;
35 | end
36 |
37 | % = [1, 2, 3, 4, 5 ,6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16];
38 |
39 |
40 |
41 | [tree1,thrsldeshold1,gainin1] = make_tree(train_data,pruning_thrsld,option,attributes,class_max,tree1,thrsldeshold1,gainin1,index1);
42 | [tree2,thrsldeshold2,gainin2] = make_tree(train_data,pruning_thrsld,option,attributes,class_max,tree2,thrsldeshold2,gainin2,index2);
43 | [tree3,thrsldeshold3,gainin3] = make_tree(train_data,pruning_thrsld,option,attributes,class_max,tree3,thrsldeshold3,gainin3,index3);
44 |
45 | for i=1:size(tree1,2)
46 | if (tree1(:,i)-1) ~= -1
47 | fprintf('tree=%2d, node=%3d, feature=%2d, thr=%6.2f, gain=%f\n',0,i,tree1(:,i)-1,thrsldeshold1(:,i),gainin1(:,i));
48 | end
49 | end
50 |
51 | for i=1:size(tree2, 2)
52 | if (tree2(:,i)-1) ~= -1
53 | fprintf('tree=%2d, node=%3d, feature=%2d, thr=%6.2f, gain=%f\n',1,i,tree2(:,i)-1,thrsldeshold2(:,i),gainin2(:,i));
54 | end
55 | end
56 |
57 | for i=1:size(tree3, 2)
58 | if (tree3(:,i)-1) ~= -1
59 | fprintf('tree=%2d, node=%3d, feature=%2d, thr=%6.2f, gain=%f\n',2,i,tree3(:,i)-1,thrsldeshold3(:,i),gainin3(:,i));
60 | end
61 | end
62 |
63 |
64 | for test_row=1:size(test_data,1)
65 | index=1;
66 | is_leaf=1;
67 | while is_leaf == 1
68 | atest_classr = tree1(index);
69 | thrsld = thrsldeshold1(index);
70 | gain = gainin1(index);
71 | if thrsld == -1 && gain == -1
72 | tree_1_values = atest_classr;
73 | is_leaf=0;
74 | else
75 | if (test_data(test_row,atest_classr))>=thrsld
76 | index=(2*index)+1;
77 | else
78 | index=(2*index);
79 | end
80 | end
81 | end
82 | index=1;
83 | is_leaf=1;
84 | while is_leaf == 1
85 | atest_classr = tree2(index);
86 | thrsld=thrsldeshold2(index);
87 | gain=gainin2(index);
88 | if thrsld ~= -1 && gain ~= -1
89 | if (test_data(test_row,atest_classr)) 0
39 | d = utility_matrix((row -1), col);
40 | end
41 | %=================================
42 | % Can Go Up ??
43 | %=================================
44 | if (row + 1) > size(utility_matrix,1) || utility_matrix((row + 1), col) == 2
45 | u = utility_matrix(row, col);
46 | %up = 1;
47 | elseif (row + 1) <= size(utility_matrix,1)
48 | u = utility_matrix((row + 1), col);
49 | end
50 | %=================================
51 | % Can Go Left ??
52 | %=================================
53 | if (col - 1) == 0 || utility_matrix(row, (col - 1)) == 2
54 | l = utility_matrix(row, col);
55 | %left = 1;
56 | elseif (col - 1) > 0
57 | l = utility_matrix(row, (col - 1));
58 | end
59 | %=================================
60 | % Can Go Right ??
61 | %=================================
62 | if (col + 1) > size(utility_matrix, 2) || utility_matrix(row, (col + 1)) == 2
63 | r = utility_matrix(row, col);
64 | %right = 1;
65 | elseif (col + 1) <= size(utility_matrix, 2)
66 | r = utility_matrix(row, (col + 1));
67 | end
68 | % right, left, up, down
69 | ut = [r, l, u, d];
70 | ut = ut';
71 |
72 | %=================================
73 | % Calculating Utility Sequence
74 | %=================================
75 | % right, left, up, down
76 | uti = zeros(4, 1);
77 | for x = 1:4
78 | uti(x, 1) = non_terminal + g*ut(x, 1);
79 | end
80 |
81 | %=================================
82 | % Calculating Expected Utility
83 | %=================================
84 | % right, left, up, down
85 | util = zeros(4, 1);
86 | util(1, 1) = 0.8*uti(1, 1)+0.1*uti(3, 1)+0.1*uti(4, 1);
87 | util(2, 1) = 0.8*uti(2, 1)+0.1*uti(3, 1)+0.1*uti(4, 1);
88 | util(3, 1) = 0.8*uti(3, 1)+0.1*uti(1, 1)+0.1*uti(2, 1);
89 | util(4, 1) = 0.8*uti(4, 1)+0.1*uti(1, 1)+0.1*uti(2, 1);
90 |
91 | %=========================================
92 | % Selecting the Best Uitility
93 | %=========================================
94 | max_value = max(util);
95 | dummy(row, col) = max_value;
96 |
97 | end
98 | end
99 | utility_matrix = dummy(:, :);
100 | end
101 | %=========================================
102 | % Displaying Results
103 | %=========================================
104 | result = utility_matrix(:, :);
105 | result(result == 2) = 0;
106 | for row = 1:size(result, 1)
107 | for col = 1:size(result, 2)
108 | if col == size(result, 2)
109 | fprintf('%6.3f ', result(row, col));
110 | else
111 | fprintf('%6.3f, ', result(row, col));
112 | end
113 | end
114 | fprintf('\n');
115 | end
116 | end
117 |
118 |
119 |
120 |
121 |
122 |
123 |
--------------------------------------------------------------------------------
/01_MATLAB/10_Dynamic_Time_Warping/README.md:
--------------------------------------------------------------------------------
1 | ### Task 1 (100 points)
2 |
3 | In this task you will implement 1-nearest neighbor classification of time series using dynamic time warping (DTW) as the distance measure.
4 | Your zip file should have a folder called dtw_classification, which contains your code and the README.txt file.
5 |
6 | ### Command-line Arguments
7 |
8 | Your program will be invoked as follows:
9 | dtw_classify
10 | The arguments provide to the program the following information:
11 | The first argument, , is the path name of the training file, where the training data is stored. The path name can specify any file stored on the local computer.
12 | The second argument, , is the path name of the test file, where the test data is stored. The path name can specify any file stored on the local computer.
13 | The training and test files will follow the same format as the text files asl_training.txt and asl_test.txt. For example, for test object 40, file asl_test.txt contains the following information:
14 | object ID: 40
15 | class label: 53
16 | sign meaning: advice
17 |
18 | dominant hand trajectory:
19 | -0.098205 0.584317
20 | -0.108025 0.554856
21 | -0.088384 0.535215
22 | -0.088384 0.535215
23 | -0.068743 0.545035
24 | -0.039282 0.574497
25 | 0.009820 0.574497
26 | 0.058923 0.643240
27 | 0.108025 0.702162
28 | 0.147307 0.751265
29 | 0.186589 0.790547
30 | 0.216050 0.829828
31 | 0.216050 0.859290
32 | 0.225870 0.888751
33 | 0.235691 0.918212
34 | 0.245511 0.928033
35 | 0.265152 0.947674
36 | The object ID for that object is 40. The class label is 53, so classification of that object is correct if and only if its nearest neighbor among the training objects also has class label 53. In the example training and test files, for every test object there are only two training objects with the same class label.
37 | Each time series is a sequence of two-dimensional vectors. For the example shown above (test example 40), after the line with text "dominant hand trajectory", there is a sequence of 17 lines. The n-th line in that sequence contains the value of the n-th vector in the time series. Different objects have different length.
38 |
39 | ### Implementation Guidelines
40 |
41 | In contrast to the previous assignment, do NOT do any type of normalization on the time series values that you read from the files. Just use those values as they are.
42 | Use the L2 distance (the Euclidean distance) for computing the cost of matching two 2D vectors to each other. Use DTW, as described in the course slides, to compute the distance between two time series.
43 |
44 | ### Classification Stage
45 |
46 | For each test object you should print a line containing the following info:
47 | object ID. This number is explicitly stated in the file for each object. Note that object IDs are numbered starting from 1.
48 | predicted class (the result of the classification). If your classification result is a tie among two or more classes, choose one of them randomly.
49 | true class (from the last column of the test file).
50 | accuracy. This is defined as follows:
51 | If there were no ties in your classification result, and the predicted class is correct, the accuracy is 1.
52 | If there were no ties in your classification result, and the predicted class is incorrect, the accuracy is 0.
53 | If there were ties in your classification result, and the correct class was one of the classes that tied for best, the accuracy is 1 divided by the number of classes that tied for best.
54 | If there were ties in your classification result, and the correct class was NOT one of the classes that tied for best, the accuracy is 0.
55 | the DTW distance of the test object to its nearest neighbor in the training objects.
56 | To produce this output in a uniform manner, use these printing statements:
57 | Use:
58 | printf("ID=%5d, predicted=%3d, true=%3d, accuracy=%4.2lf, distance = %.2lf\n", object_id, predicted_class, true_class, accuracy, distance);
59 |
60 |
61 | After you have printed the results for all test objects, you should print the overall classification accuracy, which is defined as the average of the classification accuracies you printed out for each test object. To print the classification accuracy in a uniform manner, use these printing statements:
62 | use:
63 | printf("classification accuracy=%6.4lf\n", classification_accuracy);
64 |
65 | ### Output for answers.pdf
66 |
67 | In your answers.pdf document, you need to provide the COMPLETE output for the following invocation of your program:
68 | dtw_classify asl_training.txt asl_test.txt
69 |
70 | ### Grading
71 |
72 | 75 points: Correct implementation of the 1-nearest neighbor classifier with DTW as the distance measure.
73 | 25 points: Following the specifications in producing the required output and in producing the answers.pdf file.
74 |
75 | ### Running The Program
76 | 1) open "main.m"
77 | 2) type main('asl_training.txt', 'asl_test.txt') in the command window
78 | 3) press enter.
79 |
--------------------------------------------------------------------------------
/01_MATLAB/07_Logistic_Regression/code/logistic_regression.m:
--------------------------------------------------------------------------------
1 | function [] = logistic_regression(train_file, degree, test_file)
2 |
3 | %============================================
4 | % Initialising Values
5 | %============================================
6 | data = double(load(train_file));
7 | target = data(1: end, end);
8 | data = data(:,1:end-1);
9 | target(target > 1) = 0;
10 | prev_error=0;
11 | rows = size(data,1);
12 | cols = size(data,2);
13 | phi = zeros(rows,1);
14 |
15 | %============================================
16 | % Calculating Training Phi Matrix
17 | %============================================
18 | for row = 1: rows
19 | phi(row, 1) = 1;
20 | x = 2;
21 | for col = 1: cols
22 | for deg = 1:degree
23 | phi(row, x) = data(row, col)^deg;
24 | x = x+1;
25 | end
26 | end
27 | end
28 |
29 | %============================================
30 | % Training starts from this point
31 | %============================================
32 | condition = true;
33 | wght = zeros(cols*degree+1,1);
34 | phi_trans = transpose(phi);
35 | m = 1;
36 |
37 | while condition
38 | wghtT = transpose(wght);
39 |
40 | for i = 1:rows
41 | output(i,1) = wghtT * phi_trans(1:end,i);
42 | output(i,1) = 1 / (1 + exp(output(i,1)*(-1)));
43 | end
44 |
45 | %============================================
46 | % Calculating Error Matrix
47 | %============================================
48 | E = phi_trans * (output - target);
49 |
50 | %============================================
51 | % Calculating New Error
52 | %============================================
53 | new_error = sum(E,1);
54 |
55 |
56 | %============================================
57 | % Calculating Error Difference
58 | %============================================
59 | error_diff = abs(new_error - prev_error);
60 |
61 | %============================================
62 | % Calculating R Diagonal Matrix
63 | %============================================
64 | R = zeros(rows,rows);
65 | for i = 1:rows
66 | R(i,i) = output(i,1) * (1 - output(i,1));
67 | end
68 |
69 | %============================================
70 | % Calculating New Weights
71 | %============================================
72 | new_wght = wght - pinv(phi_trans * R * phi) * E ;
73 |
74 | condition = abs(sum(new_wght) - sum(wght)) >= 0.001 && error_diff>= 0.001;
75 | if condition
76 | wght = new_wght;
77 | prev_error = new_error;
78 | end
79 | m = m + 1;
80 | end
81 |
82 | %============================================
83 | % Testing Initialisation
84 | %============================================
85 | test_data = double(load(test_file));
86 | target = test_data(1: end, end);
87 | test_data = test_data(:,1:end-1);
88 | rows = size(test_data,1);
89 | cols = size(test_data,2);
90 | phi = zeros(rows,1);
91 |
92 | %============================================
93 | % Calculating Testing Phi Matrix
94 | %============================================
95 | for row = 1: rows
96 | phi(row, 1) = 1;
97 | x = 2;
98 | for col = 1: cols
99 | for deg = 1:degree
100 | phi(row, x) = test_data(row, col)^deg;
101 | x = x+1;
102 | end
103 | end
104 | end
105 |
106 | %============================================
107 | % Calculating Testing Output Matrix
108 | %============================================
109 | phi_trans = transpose(phi);
110 | for i = 1:rows
111 | output(i,1) = transpose(new_wght) * phi_trans(1:end,i);
112 | output(i,1) = 1 / (1 + exp(output(i,1)*(-1)));
113 | end
114 |
115 | target(target > 1) = 0;
116 | predicted = zeros(size(output, 1), 1);
117 | accuracy = zeros(rows, 1);
118 |
119 | %============================================
120 | % Printing the Weights
121 | %============================================
122 | for i = 1:size(new_wght, 1)
123 | fprintf(' W%d = %.4f\n', i-1, new_wght(i, 1));
124 | end
125 |
126 | %============================================
127 | % Prediction
128 | %============================================
129 | for i = 1:rows
130 | first = transpose(new_wght) * transpose(phi(i, 1:end));
131 | second = output(i, 1) ;
132 | if (first > 0) && (second > 0.5)
133 | predicted(i, 1) = 1;
134 | if predicted(i, 1) == target(i, 1)
135 | accuracy(i, 1) = 1;
136 | end
137 | elseif (first < 0) && (1 - second > 0.5)
138 | predicted(i, 1) = 0;
139 | output(i, 1) = (1 - second);
140 | if predicted(i, 1) == target(i, 1)
141 | accuracy(i, 1) = 1;
142 | end
143 | else
144 | predicted(i, 1) = 1;
145 | accuracy(i, 1) = 0.5;
146 | end
147 | fprintf(' objectID=%5d, predicted=%3d, probability = %.4f, true=%3d, accuracy=%4.2f \n', i-1, predicted(i, 1), output(i, 1), target(i, 1), accuracy(i, 1));
148 | end
149 |
150 | num = sum(accuracy);
151 | den = size(accuracy, 1);
152 | final_acc = num/den;
153 |
154 | fprintf('classification accuracy=%6.4f \n', final_acc)
155 |
156 |
157 |
158 | end
159 |
--------------------------------------------------------------------------------
/01_MATLAB/07_Logistic_Regression/README.md:
--------------------------------------------------------------------------------
1 | ### Task
2 | In this task you will implement logistic regression using Iterative Reweighted Least Square
3 |
4 | ### Command-line Arguments
5 |
6 | You must implement a program that uses linear regression to fit a line or a second-degree polynomial to a set of training data.
7 | Your program can be invoked as follows:
8 | logistic_regression
9 | The arguments provide to the program the following information:
10 | The first argument, , is the path name of the training file, where the training data is stored.
11 | The path name can specify any file stored on the local computer.
12 | The second argument, is a number equal to either 1 or 2.
13 | We will not test your code with any other values. The degree specifies what function φ you should use.
14 | Suppose that you have an input vector x = (x1, x2, ..., xD)T,.
15 | If the degree is 1, then φ(x) = (1, x1, x2, ..., xD)T.
16 | If the number is 2, then φ(x) = (1, x1, (x1)2, x2, (x2)2..., xD, (xD)2)T.
17 | The third argument, , is the path name of the test file, where the test data is stored.
18 | The path name can specify any file stored on the local computer.
19 | The training and test files will follow the same format as the text files in the UCI datasets directory.
20 | A description of the datasets and the file format can be found on this link.
21 | For each dataset, a training file and a test file are provided.
22 | The name of each file indicates what dataset the file belongs to, and whether the file contains training or test data.
23 | Your code should also work with ANY OTHER training and test files using the same format as the files in the UCI datasets directory.
24 | A description of the datasets and the file format can be found on this link.
25 |
26 |
29 |
30 |
31 | ### Converting to Binary Classification Problem
32 |
33 | We have only covered logistic regression for binary classification problems. In this assignment, you should convert the class labels found in the files as follows:
34 | If the class label is equal to 1, it stays equal to 1.
35 | If the class label is not equal to 1, you must set it equal to 0.
36 | This way, your code will only see class labels that are 1 or 0.
37 |
38 | ### Weight Initialisation
39 | All weights must be initialized to 0.
40 |
41 |
42 | ### Stopping Criteria
43 |
44 | For logistic regression, the training goes through iterations. At each iteration, you should decide as follows if you should stop the training:
45 | Compare the new weight values, computed at this iteration, with the previous weight values. If the sum of absolute values of differences of individual weights is less than 0.001, then you should stop the training.
46 | Compute the cross-entropy error, using the new weights computed at this iteration. Compare it with the cross-entropy error computed using the previous value of weights. If the change in the error is less than 0.001, then you should stop the training.
47 |
48 | ### Numerical Issues for Yeast Dataset
49 |
50 | Your code will probably not work on the yeast dataset for degree=2. Don't worry about that, we will not test for that case. As an optional (and zero-credit) task, figure out how to make the code work in this case. Feel free to do any changes you want, including ignoring some dimensions of the data. If you succeed, describe in your answers.pdf file what you did.
51 |
52 |
53 | ### Output of Training Stage
54 | printf("w0=%.4lf\n", w0);
55 | printf("w1=%.4lf\n", w1);
56 | printf("w2=%.4lf\n", w2);
57 | ...
58 | For any other language, just make sure that you use formatting specifies that produce aligned output that matches EXACTLY the specs given above for C. You should print exactly as many lines as the number of weights that you are estimating (D+1 weights if degree=1, 2D+1 weights if degree=2).
59 |
60 | ### Output of Test Stage
61 |
62 | After the training stage, you should apply the classifier that you have learned on the test data. For each test object (following the order in which each test object appears in the test file), you should print a line containing the following info:
63 | object ID. This is the line number where that object occurs in the test file. Start with 0 in numbering the objects, not with 1.
64 | predicted class (the result of the classification). If your classification result is a tie, choose one of them randomly.
65 | probability of the predicted class given the data. This probability is the output of the classifier if the predicted class is 1.
66 | If the predicted class is 0, then the probability is 1 minus the output of the classifier.
67 | true class (should be binary, 0 or 1).
68 | accuracy. This is defined as follows:
69 | If there were no ties in your classification result, and the predicted class is correct, the accuracy is 1.
70 | If there were no ties in your classification result, and the predicted class is incorrect, the accuracy is 0.
71 | If there were ties in your classification result, and the correct class was one of the classes that tied for best, the accuracy is 1 divided by the number of classes that tied for best.
72 | If there were ties in your classification result, since we only have two classes, the accuracy is 0.5.
73 | To produce this output in a uniform manner, use these printing statements:
74 | Use:
75 | printf("ID=%5d, predicted=%3d, probability = %.4lf, true=%3d, accuracy=%4.2lf\n",
76 | object_id, probability, predicted_class, true_class, accuracy);
77 |
78 | ### Running The Program
79 | 1) type "logistic_regression(training_filename, degree, testing_filename)
80 | 2) press enter.
81 | 3) for example : logistic_regression('pendigits_training.txt', 1, 'pendigits_test.txt')
82 |
--------------------------------------------------------------------------------
/01_MATLAB/01_Descision_Tree_and_Random_Forest/README.md:
--------------------------------------------------------------------------------
1 | ### Task 1
2 |
3 | In this task you will implement decision trees and decision forests.
4 | Your program will learn decision trees from training data and will apply decision trees and decision forests to classify test objects.
5 | Your zip file should have a folder called decision_trees, which contains your code and the README.txt file.
6 |
7 | ### Command-line Arguments
8 |
9 | Your program will be invoked as follows:
10 | dtree