├── .gitattributes ├── README.md ├── chp02 ├── binomial_tree.py ├── gibbs_gauss.py ├── imp_samp.py ├── mh_gauss2d.py ├── monte_carlo_pi.py └── random_walk.py ├── chp03 └── mean_field_mrf.py ├── chp04 ├── binary_search.py ├── binomial_coeffs.py ├── knapsack_greedy.py └── subset_gen.py ├── chp05 ├── cart.py ├── naive_bayes.py ├── perceptron.py ├── sgd_lr.py └── svm.py ├── chp06 ├── gp_reg.py ├── hierarchical_regression.py ├── knn_reg.py └── ridge_reg.py ├── chp07 ├── active_learning.py ├── adaboost_clf.py ├── bagging_clf.py ├── bayes_opt_sklearn.py ├── demo_logreg.py ├── hmm.py ├── page_rank.py ├── plot_smote_regular.py ├── plot_tomek_links.py └── stacked_clf.py ├── chp08 ├── dpmeans.py ├── gmm.py ├── manifold_learning.py └── pca.py ├── chp09 ├── ga.py ├── inv_cov.py ├── kde.py ├── lda.py ├── portfolio_opt.py └── sim_annealing.py ├── chp10 ├── image_search.py ├── keras_optimizers.py ├── lenet.py ├── lstm_sentiment.py ├── mlp.py └── multi_input_nn.py ├── chp11 ├── keras_mdn.py ├── lstm_vae.py ├── spektral_gnn.py └── transformer.py ├── data ├── NAB │ ├── artificialNoAnomaly │ │ └── art_daily_no_noise.csv │ ├── artificialWithAnomaly │ │ └── art_daily_jumpsup.csv │ └── labels │ │ └── combined_labels.json ├── cora │ ├── cora.cites │ └── cora.content └── radon.txt ├── figures ├── bayes.bmp └── meap.png └── requirements.txt /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning Algorithms in Depth 2 | ML Algorithms in Depth: Bayesian Inference and Deep Learning 3 | 4 | **Chp02: Markov Chain Monte Carlo (MCMC)** 5 | - [Estimate Pi](./chp02/monte_carlo_pi.py): Monte Carlo estimate of Pi 6 | - [Binomial Tree Model](./chp02/binomial_tree.py): Monte Carlo simulation of binomial stock price 7 | - [Random Walk](./chp02/random_walk.py): self-avoiding random walk 8 | - [Gibbs Sampling](./chp02/gibbs_gauss.py): Gibbs sampling of multivariate Gaussian distribution 9 | - [Metropolis-Hastings Sampling](./chp02/mh_gauss2d.py): Metropolis-Hastings sampling of multivariate Gaussian mixture 10 | - [Importance Sampling](./chp02/imp_samp.py): importance sampling for finding expected value of a function 11 | 12 | **Chp03: Variational Inference (VI)** 13 | - [Mean Field VI](./chp03/mean_field_mrf.py): image denoising in Ising model 14 | 15 | **Chp04: Software Implementation** 16 | - [Subset Generation](./chp04/subset_gen.py): a complete search algorithm 17 | - [Fractional Knapsack](./chp04/knapsack_greedy.py): a greedy algorithm 18 | - [Binary Search](./chp04/binary_search.py): a divide and conquer algorithm 19 | - [Binomial Coefficients](./chp04/binomial_coeffs.py): a dynamic programming algorithm 20 | 21 | **Chp05: Classification Algorithms** 22 | - [Perceptron](./chp05/perceptron.py): perceptron algorithm 23 | - [SVM](./chp05/svm.py): support vector machine 24 | - [SGD-LR](./chp05/sgd_lr.py): stochastic gradient descent logistic regression 25 | - [Naive Bayes](./chp05/naive_bayes.py): Bernoulli Naive Bayes algorithm 26 | - [CART](./chp05/cart.py): decision tree classification algorithm 27 | 28 | **Chp06: Regression Algorithms** 29 | - [KNN](./chp06/knn_reg.py): K-Nearest Neighbors regression 30 | - [BLR](./chp06/ridge_reg.py): Bayesian linear regression 31 | - [HBR](./chp06/hierarchical_regression.py): Hierarchical Bayesian regression 32 | - [GPR](./chp06/gp_reg.py): Gaussian Process regression 33 | 34 | **Chp07: Selected Supervised Learning Algorithms** 35 | - [Page Rank](./chp07/page_rank.py): Google page rank algorithm 36 | - [HMM](./chp07/hmm.py): EM algorithm for Hidden Markov Models 37 | - Imbalanced Learning: [Tomek Links](./chp07/plot_tomek_links.py), [SMOTE](./chp07/plot_smote_regular.py) 38 | - Active Learning: [LR](./chp07/demo_logreg.py) 39 | - Bayesian optimization: [BO](./chp07/bayes_opt_sklearn.py) 40 | - Ensemble Learning: [Bagging](./chp07/bagging_clf.py), [Boosting](./chp07/adaboost_clf.py), [Stacking](./chp07/stacked_clf.py) 41 | 42 | **Chp08: Unsupervised Learning Algorithms** 43 | - [DP-Means](./chp08/dpmeans.py): Dirichlet Process (DP) K-Means 44 | - [EM-GMM](./chp08/gmm.py): EM algorithm for Gaussian Mixture Models 45 | - [PCA](./chp08/pca.py): Principal Component Analysis 46 | - [t-SNE](./chp08/manifold_learning.py): t-SNE manifold learning 47 | 48 | **Chp09: Selected Unsupervised Learning Algorithms** 49 | - [LDA](./chp09/lda.py): Variational Inference for Latent Dirichlet Allocation 50 | - [KDE](./chp09/kde.py): Kernel Density Estimator 51 | - [TPO](./chp09/portfolio_opt.py): Tangent Portfolio Optimization 52 | - [ICE](./chp09/inv_cov.py): Inverse Covariance Estimation 53 | - [SA](./chp09/sim_annealing.py): Simulated Annealing 54 | - [GA](./chp09/ga.py): Genetic Algorithm 55 | 56 | **Chp10: Fundamental Deep Learning Algorithms** 57 | - [MLP](./chp10/mlp.py): Multi-Layer Perceptron 58 | - [LeNet](./chp10/lenet.py): LeNet for MNIST digit classification 59 | - [ResNet](./chp10/image_search.py): ResNet50 image search on CalTech101 dataset 60 | - [LSTM](./chp10/lstm_sentiment.py): LSTM sentiment classification of IMDB movie dataset 61 | - [MINN](./chp10/multi_input_nn.py): Mult-Input Neural Net model for sequence similarity of Quora question pairs dataset 62 | - [OPT](./chp10/keras_optimizers.py): Neural Net Optimizers 63 | 64 | **Chp11: Advanced Deep Learning Algorithms** 65 | - [LSTM-VAE](./chp11/lstm_vae.py): time-series anomaly detector 66 | - [MDN](./chp11/keras_mdn.py): mixture density network 67 | - [Transformer](./chp11/transformer.py): for text classification 68 | - [GNN](./chp11/spektral_gnn.py): graph neural network 69 | 70 | **Environment** 71 | 72 | To install required libraries, please run the following commands: 73 | 74 | ``` 75 | python3 -m venv ml-algo 76 | 77 | source ml-algo/bin/activate //in linux 78 | .\ml-algo\Scripts\activate.bat //in CMD windows 79 | .\ml-algo\Scripts\Activate.ps1 //in Powershell windows 80 | 81 | pip install -r requirements.txt 82 | ``` 83 | 84 | **Manning Early Access Preview (MEAP)** 85 | 86 | This book is now available in Manning Early Access Preview. 87 | Link to book: https://www.manning.com/books/machine-learning-algorithms-in-depth 88 | 89 |

90 | 91 |

92 | 93 | It will help you develop mathematical intuition for classic and modern ML algorithms, learn the fundamentals of Bayesian inference and deep learning, as well as data structures and algorithmic paradigms in ML! 94 | 95 | **Citation** 96 | 97 | You are welcome to cite the book as follows: 98 | 99 | ``` 100 | @book{MLAlgoInDepth, 101 | author = {Vadim Smolyakov}, 102 | title = {Machine Learning Algorithms in Depth}, 103 | year = {2023}, 104 | isbn = {9781633439214}, 105 | publisher = {Manning Publications} 106 | } 107 | ``` 108 | -------------------------------------------------------------------------------- /chp02/binomial_tree.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | import seaborn as sns 4 | import matplotlib.pyplot as plt 5 | 6 | np.random.seed(42) 7 | 8 | def binomial_tree(mu, sigma, S0, N, T, step): 9 | 10 | #compute state price and probability 11 | u = np.exp(sigma * np.sqrt(step)) #up state price 12 | d = 1.0/u #down state price 13 | p = 0.5+0.5*(mu/sigma)*np.sqrt(step) #prob of up state 14 | 15 | #binomial tree simulation 16 | up_times = np.zeros((N, len(T))) 17 | down_times = np.zeros((N, len(T))) 18 | for idx in range(len(T)): 19 | up_times[:,idx] = np.random.binomial(T[idx]/step, p, N) 20 | down_times[:,idx] = T[idx]/step - up_times[:,idx] 21 | 22 | #compute terminal price 23 | ST = S0 * u**up_times * d**down_times 24 | 25 | #generate plots 26 | plt.figure() 27 | plt.plot(ST[:,0], color='b', alpha=0.5, label='1 month horizon') 28 | plt.plot(ST[:,1], color='r', alpha=0.5, label='1 year horizon') 29 | plt.xlabel('time step, day') 30 | plt.ylabel('price') 31 | plt.title('Binomial-Tree Stock Simulation') 32 | plt.legend() 33 | plt.show() 34 | 35 | plt.figure() 36 | plt.hist(ST[:,0], color='b', alpha=0.5, label='1 month horizon') 37 | plt.hist(ST[:,1], color='r', alpha=0.5, label='1 year horizon') 38 | plt.xlabel('price') 39 | plt.ylabel('count') 40 | plt.title('Binomial-Tree Stock Simulation') 41 | plt.legend() 42 | plt.show() 43 | 44 | 45 | if __name__ == "__main__": 46 | 47 | #model parameters 48 | mu = 0.1 #mean 49 | sigma = 0.15 #volatility 50 | S0 = 1 #starting price 51 | 52 | N = 10000 #number of simulations 53 | T = [21.0/252, 1.0] #time horizon in years 54 | step = 1.0/252 #time step in years 55 | 56 | binomial_tree(mu, sigma, S0, N, T, step) 57 | 58 | -------------------------------------------------------------------------------- /chp02/gibbs_gauss.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | import itertools 5 | from numpy.linalg import inv 6 | from scipy.stats import multivariate_normal 7 | 8 | np.random.seed(42) 9 | 10 | class gibbs_gauss: 11 | 12 | def gauss_conditional(self, mu, Sigma, setA, x): 13 | #computes P(X_A | X_B = x) = N(mu_{A|B}, Sigma_{A|B}) 14 | dim = len(mu) 15 | setU = set(range(dim)) 16 | setB = setU.difference(setA) 17 | muA = np.array([mu[item] for item in setA]).reshape(-1,1) 18 | muB = np.array([mu[item] for item in setB]).reshape(-1,1) 19 | xB = np.array([x[item] for item in setB]).reshape(-1,1) 20 | 21 | Sigma_AA = [] 22 | for (idx1, idx2) in itertools.product(setA, setA): 23 | Sigma_AA.append(Sigma[idx1][idx2]) 24 | Sigma_AA = np.array(Sigma_AA).reshape(len(setA),len(setA)) 25 | 26 | Sigma_AB = [] 27 | for (idx1, idx2) in itertools.product(setA, setB): 28 | Sigma_AB.append(Sigma[idx1][idx2]) 29 | Sigma_AB = np.array(Sigma_AB).reshape(len(setA),len(setB)) 30 | 31 | Sigma_BB = [] 32 | for (idx1, idx2) in itertools.product(setB, setB): 33 | Sigma_BB.append(Sigma[idx1][idx2]) 34 | Sigma_BB = np.array(Sigma_BB).reshape(len(setB),len(setB)) 35 | 36 | Sigma_BB_inv = inv(Sigma_BB) 37 | mu_AgivenB = muA + np.matmul(np.matmul(Sigma_AB, Sigma_BB_inv), xB - muB) 38 | Sigma_AgivenB = Sigma_AA - np.matmul(np.matmul(Sigma_AB, Sigma_BB_inv), np.transpose(Sigma_AB)) 39 | 40 | return mu_AgivenB, Sigma_AgivenB 41 | 42 | def sample(self, mu, Sigma, xinit, num_samples): 43 | dim = len(mu) 44 | samples = np.zeros((num_samples, dim)) 45 | x = xinit 46 | for s in range(num_samples): 47 | for d in range(dim): 48 | mu_AgivenB, Sigma_AgivenB = self.gauss_conditional(mu, Sigma, set([d]), x) 49 | x[d] = np.random.normal(mu_AgivenB, np.sqrt(Sigma_AgivenB)) 50 | #end for 51 | samples[s,:] = np.transpose(x) 52 | #end for 53 | return samples 54 | 55 | if __name__ == "__main__": 56 | 57 | num_samples = 2000 58 | mu = [1, 1] 59 | Sigma = [[2,1], [1,1]] 60 | xinit = np.random.rand(len(mu),1) 61 | num_burnin = 1000 62 | 63 | gg = gibbs_gauss() 64 | gibbs_samples = gg.sample(mu, Sigma, xinit, num_samples) 65 | 66 | scipy_samples = multivariate_normal.rvs(mean=mu, cov=Sigma, size=num_samples, random_state=42) 67 | 68 | plt.figure() 69 | plt.scatter(gibbs_samples[num_burnin:,0], gibbs_samples[num_burnin:,1], c = 'blue', marker='s', alpha=0.8, label='Gibbs Samples') 70 | plt.scatter(scipy_samples[num_burnin:,0], scipy_samples[num_burnin:,1], c = 'red', alpha=0.8, label='Ground Truth Samples') 71 | plt.grid(True); plt.legend(); plt.xlim([-4,5]) 72 | plt.title("Gibbs Sampling of Multivariate Gaussian"); plt.xlabel("X1"); plt.ylabel("X2") 73 | #plt.savefig("./figures/gibbs_gauss.png") 74 | plt.show() 75 | 76 | 77 | -------------------------------------------------------------------------------- /chp02/imp_samp.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | from scipy.integrate import quad 5 | from scipy.stats import multivariate_normal 6 | 7 | np.random.seed(42) 8 | 9 | class importance_sampler: 10 | # E[f(x)] = int_x f(x)p(x)dx = int_x f(x)[p(x)/q(x)]q(x) dx 11 | # = sum_i f(x_i)w(x_i), where x_i ~ q(x) 12 | # e.g. for f(x) = 1(x \in A), E[f(x)] = P(A) 13 | 14 | def __init__(self, k=1.5, mu=0.8, sigma=np.sqrt(1.5), c=3): 15 | #target params p(x) 16 | self.k = k 17 | 18 | #proposal params q(x) 19 | self.mu = mu 20 | self.sigma = sigma 21 | self.c = c #fix c, s.t. p(x) < c q(x) 22 | 23 | def target_pdf(self, x): 24 | #p(x) ~ Chi(k=1.5) 25 | return (x**(self.k-1)) * np.exp(-x**2/2.0) 26 | 27 | def proposal_pdf(self, x): 28 | #q(x) ~ N(mu,sigma) 29 | return self.c * 1.0/np.sqrt(2*np.pi*1.5) * np.exp(-(x-self.mu)**2/(2*self.sigma**2)) 30 | 31 | def fx(self, x): 32 | #function of interest f(x), x >= 0 33 | return 2*np.sin((np.pi/1.5)*x) 34 | 35 | def sample(self, num_samples): 36 | #sample from the proposal 37 | x = multivariate_normal.rvs(self.mu, self.sigma, num_samples) 38 | 39 | #discard netgative samples (since f(x) is defined for x >= 0) 40 | idx = np.where(x >= 0) 41 | x_pos = x[idx] 42 | 43 | #compute importance weights 44 | isw = self.target_pdf(x_pos) / self.proposal_pdf(x_pos) 45 | 46 | #compute E[f(x)] = sum_i f(x_i)w(x_i), where x_i ~ q(x) 47 | fw = (isw/np.sum(isw))*self.fx(x_pos) 48 | f_est = np.sum(fw) 49 | 50 | return isw, f_est 51 | 52 | 53 | if __name__ == "__main__": 54 | 55 | num_samples = [10, 100, 1000, 10000, 100000, 1000000] 56 | 57 | F_est_iter, IS_weights_var_iter = [], [] 58 | for k in num_samples: 59 | IS = importance_sampler() 60 | IS_weights, F_est = IS.sample(k) 61 | IS_weights_var = np.var(IS_weights/np.sum(IS_weights)) 62 | F_est_iter.append(F_est) 63 | IS_weights_var_iter.append(IS_weights_var) 64 | 65 | #ground truth (numerical integration) 66 | k = 1.5 67 | I_gt, _ = quad(lambda x: 2.0*np.sin((np.pi/1.5)*x)*(x**(k-1))*np.exp(-x**2/2.0), 0, 5) 68 | 69 | #generate plots 70 | plt.figure() 71 | xx = np.linspace(0,8,100) 72 | plt.plot(xx, IS.target_pdf(xx), '-r', label='target pdf p(x)') 73 | plt.plot(xx, IS.proposal_pdf(xx), '--b', label='proposal pdf q(x)') 74 | plt.plot(xx, IS.fx(xx) * IS.target_pdf(xx), ':k', label='p(x)f(x) integrand') 75 | plt.grid(True); plt.legend(); plt.xlabel("X1"); plt.ylabel("X2") 76 | plt.title("Importance Sampling Components") 77 | #plt.savefig('./figures/importance_sampling.png') 78 | plt.show() 79 | 80 | plt.figure() 81 | plt.hist(IS_weights, label = "IS weights") 82 | plt.grid(True); plt.legend(); 83 | plt.title("Importance Weights Histogram") 84 | #plt.savefig('./figures/importance_weights.png') 85 | plt.show() 86 | 87 | plt.figure() 88 | plt.semilogx(num_samples, F_est_iter, '-b', label = "IS Estimate of E[f(x)]") 89 | plt.semilogx(num_samples, I_gt*np.ones(len(num_samples)), '--r', label = "Ground Truth") 90 | plt.grid(True); plt.legend(); plt.xlabel('iterations'); plt.ylabel("E[f(x)] estimate") 91 | plt.title("IS Estimate of E[f(x)]") 92 | #plt.savefig('./figures/importance_estimate.png') 93 | plt.show() 94 | -------------------------------------------------------------------------------- /chp02/mh_gauss2d.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | from scipy.stats import uniform 5 | from scipy.stats import multivariate_normal 6 | 7 | np.random.seed(42) 8 | 9 | class mh_gauss: 10 | def __init__(self, dim, K, num_samples, target_mu, target_sigma, target_pi, proposal_mu, proposal_sigma): 11 | #target params: p(x) = \sum_k pi(k) N(x; mu_k,Sigma_k) 12 | self.dim = dim 13 | self.K = K 14 | self.num_samples = num_samples 15 | self.target_mu = target_mu 16 | self.target_sigma = target_sigma 17 | self.target_pi = target_pi 18 | 19 | #proposal params: q(x) = N(x; mu, Sigma) 20 | self.proposal_mu = proposal_mu 21 | self.proposal_sigma = proposal_sigma 22 | 23 | #sample chain params 24 | self.n_accept = 0 25 | self.alpha = np.zeros(self.num_samples) 26 | self.mh_samples = np.zeros((self.num_samples, self.dim)) 27 | 28 | def target_pdf(self, x): 29 | #p(x) = \sum_k pi(k) N(x; mu_k,Sigma_k) 30 | prob = 0 31 | for k in range(self.K): 32 | prob += self.target_pi[k]*\ 33 | multivariate_normal.pdf(x,self.target_mu[:,k],self.target_sigma[:,:,k]) 34 | #end for 35 | return prob 36 | 37 | def proposal_pdf(self, x, mu): 38 | #q(x) = N(x; mu, Sigma) 39 | return multivariate_normal.pdf(x, mu, self.proposal_sigma) 40 | 41 | def sample(self): 42 | #draw init sample from proposal 43 | #import pdb; pdb.set_trace() 44 | x_init = multivariate_normal.rvs(self.proposal_mu, self.proposal_sigma, 1) 45 | self.mh_samples[0,:] = x_init 46 | 47 | for i in range(self.num_samples-1): 48 | x_curr = self.mh_samples[i,:] 49 | x_new = multivariate_normal.rvs(x_curr, self.proposal_sigma, 1) 50 | 51 | #MH ratio 52 | self.alpha[i] = self.proposal_pdf(x_curr, x_new) / self.proposal_pdf(x_new, x_curr) #q(x|x')/q(x'|x) 53 | self.alpha[i] = self.alpha[i] * (self.target_pdf(x_new)/self.target_pdf(x_curr)) #alpha x p(x')/p(x) 54 | 55 | #MH acceptance probability 56 | r = min(1, self.alpha[i]) 57 | u = uniform.rvs(loc=0, scale=1, size=1) 58 | if (u <= r): 59 | self.n_accept += 1 60 | self.mh_samples[i+1,:] = x_new #accept 61 | else: 62 | self.mh_samples[i+1,:] = x_curr #reject 63 | #end for 64 | print("MH acceptance ratio: ", self.n_accept/float(self.num_samples)) 65 | 66 | if __name__ == "__main__": 67 | 68 | dim = 2 69 | K = 2 70 | num_samples = 5000 71 | target_mu = np.zeros((dim,K)) 72 | target_mu[:,0] = [4,0] 73 | target_mu[:,1] = [-4,0] 74 | target_sigma = np.zeros((dim, dim, K)) 75 | target_sigma[:,:,0] = [[2,1],[1,1]] 76 | target_sigma[:,:,1] = [[1,0],[0,1]] 77 | target_pi = np.array([0.4, 0.6]) 78 | 79 | proposal_mu = np.zeros((dim,1)).flatten() 80 | proposal_sigma = 10*np.eye(dim) 81 | 82 | mhg = mh_gauss(dim, K, num_samples, target_mu, target_sigma, target_pi, proposal_mu, proposal_sigma) 83 | mhg.sample() 84 | 85 | plt.figure() 86 | plt.scatter(mhg.mh_samples[:,0], mhg.mh_samples[:,1], label='MH samples') 87 | plt.grid(True); plt.legend() 88 | plt.title("Metropolis-Hastings Sampling of 2D Gaussian Mixture") 89 | plt.xlabel("X1"); plt.ylabel("X2") 90 | #plt.savefig("./figures/mh_gauss2d.png") 91 | plt.show() -------------------------------------------------------------------------------- /chp02/monte_carlo_pi.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | np.random.seed(42) 5 | 6 | def pi_est(radius=1, num_iter=int(1e4)): 7 | 8 | X = np.random.uniform(-radius,+radius,num_iter) 9 | Y = np.random.uniform(-radius,+radius,num_iter) 10 | 11 | R2 = X**2 + Y**2 12 | inside = R2 < radius**2 13 | outside = ~inside 14 | 15 | samples = (2*radius)*(2*radius)*inside 16 | 17 | I_hat = np.mean(samples) 18 | pi_hat = I_hat/radius ** 2 19 | pi_hat_se = np.std(samples)/np.sqrt(num_iter) 20 | print("pi est: {} +/- {:f}".format(pi_hat, pi_hat_se)) 21 | 22 | plt.figure() 23 | plt.scatter(X[inside],Y[inside], c='b', alpha=0.5) 24 | plt.scatter(X[outside],Y[outside], c='r', alpha=0.5) 25 | plt.show() 26 | 27 | if __name__ == "__main__": 28 | 29 | pi_est() 30 | 31 | -------------------------------------------------------------------------------- /chp02/random_walk.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import seaborn as sns 3 | import matplotlib.pyplot as plt 4 | 5 | np.random.seed(42) 6 | 7 | def rand_walk(num_step, num_iter, moves): 8 | 9 | #random walk stats 10 | square_dist = np.zeros(num_iter) 11 | weights = np.zeros(num_iter) 12 | 13 | for it in range(num_iter): 14 | 15 | trial = 0 16 | i = 1 17 | 18 | #iterate until we have a non-crossing random walk 19 | while i != num_step-1: 20 | 21 | #init 22 | X, Y = 0, 0 23 | weight = 1 24 | lattice = np.zeros((2*num_step+1, 2*num_step+1)) 25 | lattice[num_step+1,num_step+1] = 1 26 | path = np.array([0, 0]) 27 | xx = num_step + 1 + X 28 | yy = num_step + 1 + Y 29 | 30 | print("iter: %d, trial %d" %(it, trial)) 31 | 32 | for i in range(num_step): 33 | 34 | up = lattice[xx,yy+1] 35 | down = lattice[xx,yy-1] 36 | left = lattice[xx-1,yy] 37 | right = lattice[xx+1,yy] 38 | 39 | #compute available directions 40 | neighbors = np.array([1, 1, 1, 1]) - np.array([up, down, left, right]) 41 | 42 | #avoid self-loops 43 | if (np.sum(neighbors) == 0): 44 | i = 1 45 | break 46 | #end if 47 | 48 | #compute importance weights: d0 x d1 x ... x d_{n-1} 49 | weight = weight * np.sum(neighbors) 50 | 51 | #sample a move direction 52 | direction = np.where(np.random.rand() < np.cumsum(neighbors/float(sum(neighbors)))) 53 | 54 | X = X + moves[direction[0][0],0] 55 | Y = Y + moves[direction[0][0],1] 56 | 57 | #store sampled path 58 | path_new = np.array([X,Y]) 59 | path = np.vstack((path,path_new)) 60 | 61 | #update grid coordinates 62 | xx = num_step + 1 + X 63 | yy = num_step + 1 + Y 64 | lattice[xx,yy] = 1 65 | #end for 66 | 67 | trial = trial + 1 68 | #end while 69 | 70 | #compute square extension 71 | square_dist[it] = X**2 + Y**2 72 | 73 | #store importance weights 74 | weights[it] = weight 75 | #end for 76 | 77 | #compute mean square extension 78 | mean_square_dist = np.mean(weights * square_dist)/np.mean(weights) 79 | print("mean square dist: ", mean_square_dist) 80 | 81 | #generate plots 82 | plt.figure() 83 | for i in range(num_step-1): 84 | plt.plot(path[i,0], path[i,1], path[i+1,0], path[i+1,1], 'ob') 85 | plt.title('random walk with no overlaps') 86 | plt.xlabel('X') 87 | plt.ylabel('Y') 88 | plt.show() 89 | 90 | plt.figure() 91 | sns.displot(square_dist) 92 | plt.xlim(0,np.max(square_dist)) 93 | plt.title('square distance of the random walk') 94 | plt.xlabel('square distance (X^2 + Y^2)') 95 | plt.show() 96 | 97 | 98 | if __name__ == "__main__": 99 | 100 | num_step = 150 #number of steps in a random walk 101 | num_iter = 100 #number of iterations for averaging results 102 | moves = np.array([[0, 1],[0, -1],[-1, 0],[1, 0]]) #2-D moves 103 | 104 | rand_walk(num_step, num_iter, moves) -------------------------------------------------------------------------------- /chp03/mean_field_mrf.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | import seaborn as sns 5 | import matplotlib.pyplot as plt 6 | 7 | from PIL import Image 8 | from tqdm import tqdm 9 | from scipy.special import expit as sigmoid 10 | from scipy.stats import multivariate_normal 11 | 12 | np.random.seed(42) 13 | sns.set_style('whitegrid') 14 | 15 | class image_denoising: 16 | 17 | def __init__(self, img_binary, sigma=2, J=1): 18 | 19 | #mean-field parameters 20 | self.sigma = sigma #noise level 21 | self.y = img_binary + self.sigma*np.random.randn(M, N) #y_i ~ N(x_i; sigma^2); 22 | self.J = J #coupling strength (w_ij) 23 | self.rate = 0.5 #update smoothing rate 24 | self.max_iter = 15 25 | self.ELBO = np.zeros(self.max_iter) 26 | self.Hx_mean = np.zeros(self.max_iter) 27 | 28 | def mean_field(self): 29 | 30 | #Mean-Field VI 31 | print("running mean-field variational inference...") 32 | logodds = multivariate_normal.logpdf(self.y.flatten(), mean=+1, cov=self.sigma**2) - \ 33 | multivariate_normal.logpdf(self.y.flatten(), mean=-1, cov=self.sigma**2) 34 | logodds = np.reshape(logodds, (M, N)) 35 | 36 | #init 37 | p1 = sigmoid(logodds) 38 | mu = 2*p1-1 #mu_init 39 | 40 | a = mu + 0.5 * logodds 41 | qxp1 = sigmoid(+2*a) #q_i(x_i=+1) 42 | qxm1 = sigmoid(-2*a) #q_i(x_i=-1) 43 | 44 | logp1 = np.reshape(multivariate_normal.logpdf(self.y.flatten(), mean=+1, cov=self.sigma**2), (M, N)) 45 | logm1 = np.reshape(multivariate_normal.logpdf(self.y.flatten(), mean=-1, cov=self.sigma**2), (M, N)) 46 | 47 | for i in tqdm(range(self.max_iter)): 48 | muNew = mu 49 | for ix in range(N): 50 | for iy in range(M): 51 | pos = iy + M*ix 52 | neighborhood = pos + np.array([-1,1,-M,M]) 53 | boundary_idx = [iy!=0,iy!=M-1,ix!=0,ix!=N-1] 54 | neighborhood = neighborhood[np.where(boundary_idx)[0]] 55 | xx, yy = np.unravel_index(pos, (M,N), order='F') 56 | nx, ny = np.unravel_index(neighborhood, (M,N), order='F') 57 | 58 | Sbar = self.J*np.sum(mu[nx,ny]) 59 | muNew[xx,yy] = (1-self.rate)*muNew[xx,yy] + self.rate*np.tanh(Sbar + 0.5*logodds[xx,yy]) 60 | self.ELBO[i] = self.ELBO[i] + 0.5*(Sbar * muNew[xx,yy]) 61 | #end for 62 | #end for 63 | mu = muNew 64 | 65 | a = mu + 0.5 * logodds 66 | qxp1 = sigmoid(+2*a) #q_i(x_i=+1) 67 | qxm1 = sigmoid(-2*a) #q_i(x_i=-1) 68 | Hx = -qxm1*np.log(qxm1+1e-10) - qxp1*np.log(qxp1+1e-10) #entropy 69 | 70 | self.ELBO[i] = self.ELBO[i] + np.sum(qxp1*logp1 + qxm1*logm1) + np.sum(Hx) 71 | self.Hx_mean[i] = np.mean(Hx) 72 | #end for 73 | return mu 74 | 75 | if __name__ == "__main__": 76 | 77 | #load data 78 | print("loading data...") 79 | data = Image.open('./figures/bayes.bmp') 80 | img = np.double(data) 81 | img_mean = np.mean(img) 82 | img_binary = +1*(img>img_mean) + -1*(img x: 10 | return binary_search(arr, l, mid-1, x) 11 | else: 12 | return binary_search(arr, mid+1, r, x) 13 | #end if 14 | else: 15 | return -1 16 | 17 | if __name__ == "__main__": 18 | 19 | x = 5 20 | arr = sorted([1, 7, 8, 3, 2, 5]) 21 | 22 | print(arr) 23 | print("binary search:") 24 | result = binary_search(arr, 0, len(arr)-1, x) 25 | 26 | if result != -1: 27 | print("element {} is found at index {}.".format(x, result)) 28 | else: 29 | print("element is not found.") 30 | -------------------------------------------------------------------------------- /chp04/binomial_coeffs.py: -------------------------------------------------------------------------------- 1 | 2 | def binomial_coeffs1(n, k): 3 | #top down DP 4 | if (k == 0 or k == n): 5 | return 1 6 | if (memo[n][k] != -1): 7 | return memo[n][k] 8 | 9 | memo[n][k] = binomial_coeffs1(n-1, k-1) + binomial_coeffs1(n-1, k) 10 | return memo[n][k] 11 | 12 | def binomial_coeffs2(n, k): 13 | #bottom up DP 14 | for i in range(n+1): 15 | for j in range(min(i,k)+1): 16 | if (j == 0 or j == i): 17 | memo[i][j] = 1 18 | else: 19 | memo[i][j] = memo[i-1][j-1] + memo[i-1][j] 20 | #end if 21 | #end for 22 | #end for 23 | return memo[n][k] 24 | 25 | def print_array(memo): 26 | for i in range(len(memo)): 27 | print('\t'.join([str(x) for x in memo[i]])) 28 | 29 | 30 | if __name__ == "__main__": 31 | 32 | n = 5 33 | k = 2 34 | 35 | print("top down DP") 36 | memo = [[-1 for i in range(6)] for j in range(6)] 37 | nCk = binomial_coeffs1(n, k) 38 | print_array(memo) 39 | print("C(n={}, k={}) = {}".format(n,k,nCk)) 40 | 41 | print("bottom up DP") 42 | memo = [[-1 for i in range(6)] for j in range(6)] 43 | nCk = binomial_coeffs2(n, k) 44 | print_array(memo) 45 | print("C(n={}, k={}) = {}".format(n,k,nCk)) 46 | 47 | 48 | -------------------------------------------------------------------------------- /chp04/knapsack_greedy.py: -------------------------------------------------------------------------------- 1 | class Item: 2 | def __init__(self, wt, val, ind): 3 | self.wt = wt 4 | self.val = val 5 | self.ind = ind 6 | self.cost = val // wt 7 | 8 | def __lt__(self, other): 9 | return self.cost < other.cost 10 | 11 | class FractionalKnapSack: 12 | def get_max_value(self, wt, val, capacity): 13 | 14 | item_list = [] 15 | for i in range(len(wt)): 16 | item_list.append(Item(wt[i], val[i], i)) 17 | 18 | # sorting items by cost heuristic 19 | item_list.sort(reverse = True) #O(nlogn) 20 | 21 | total_value = 0 22 | for i in item_list: 23 | cur_wt = int(i.wt) 24 | cur_val = int(i.val) 25 | if capacity - cur_wt >= 0: 26 | capacity -= cur_wt 27 | total_value += cur_val 28 | else: 29 | fraction = capacity / cur_wt 30 | total_value += cur_val * fraction 31 | capacity = int(capacity - (cur_wt * fraction)) 32 | break 33 | return total_value 34 | 35 | if __name__ == "__main__": 36 | wt = [10, 20, 30] 37 | val = [60, 100, 120] 38 | capacity = 50 39 | 40 | fk = FractionalKnapSack() 41 | max_value = fk.get_max_value(wt, val, capacity) 42 | print("greedy fractional knapsack") 43 | print("maximum value: ", max_value) 44 | 45 | -------------------------------------------------------------------------------- /chp04/subset_gen.py: -------------------------------------------------------------------------------- 1 | def search(k, n): 2 | if (k == n): 3 | #process subset 4 | print(subset) 5 | else: 6 | search(k+1, n) 7 | subset.append(k) 8 | search(k+1, n) 9 | subset.pop() 10 | #end if 11 | 12 | def bitseq(n): 13 | for b in range(1 << n): 14 | subset = [] 15 | for i in range(n): 16 | if (b & 1 << i): 17 | subset.append(i) 18 | #end for 19 | print(subset) 20 | #end for 21 | 22 | if __name__ == "__main__": 23 | n = 4 24 | subset = [] 25 | search(0, n) #recursive 26 | 27 | #subset = [] 28 | #bitseq(n) #iterative 29 | -------------------------------------------------------------------------------- /chp05/cart.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | from sklearn.datasets import load_iris 5 | from sklearn.metrics import accuracy_score 6 | from sklearn.model_selection import train_test_split 7 | 8 | class TreeNode(): 9 | def __init__(self, gini, num_samples, num_samples_class, class_label): 10 | self.gini = gini #gini cost 11 | self.num_samples = num_samples #size of node 12 | self.num_samples_class = num_samples_class #number of node pts with label k 13 | self.class_label = class_label #predicted class label 14 | self.feature_idx = 0 #idx of feature to split on 15 | self.treshold = 0 #best threshold to split on 16 | self.left = None #left subtree pointer 17 | self.right = None #right subtree pointer 18 | 19 | class DecisionTreeClassifier(): 20 | def __init__(self, max_depth = None): 21 | self.max_depth = max_depth 22 | 23 | def best_split(self, X_train, y_train): 24 | m = y_train.size 25 | if (m <= 1): 26 | return None, None 27 | 28 | #number of points of class k 29 | mk = [np.sum(y_train == k) for k in range(self.num_classes)] 30 | 31 | #gini of current node 32 | best_gini = 1.0 - sum((n / m) ** 2 for n in mk) 33 | best_idx, best_thr = None, None 34 | 35 | #iterate over all features 36 | for idx in range(self.num_features): 37 | # sort data along selected feature 38 | thresholds, classes = zip(*sorted(zip(X[:, idx], y))) 39 | 40 | num_left = [0]*self.num_classes 41 | num_right = mk.copy() 42 | 43 | #iterate overall possible split positions 44 | for i in range(1, m): 45 | 46 | k = classes[i-1] 47 | 48 | num_left[k] += 1 49 | num_right[k] -= 1 50 | 51 | gini_left = 1.0 - sum( 52 | (num_left[x] / i) ** 2 for x in range(self.num_classes) 53 | ) 54 | 55 | gini_right = 1.0 - sum( 56 | (num_right[x] / (m - i)) ** 2 for x in range(self.num_classes) 57 | ) 58 | 59 | gini = (i * gini_left + (m - i) * gini_right) / m 60 | 61 | # check that we don't try to split two pts with identical values 62 | if thresholds[i] == thresholds[i - 1]: 63 | continue 64 | 65 | if (gini < best_gini): 66 | best_gini = gini 67 | best_idx = idx 68 | best_thr = (thresholds[i] + thresholds[i - 1]) / 2 # midpoint 69 | #end if 70 | #end for 71 | #end for 72 | return best_idx, best_thr 73 | 74 | def gini(self, y_train): 75 | m = y_train.size 76 | return 1.0 - sum((np.sum(y_train == k) / m) ** 2 for k in range(self.num_classes)) 77 | 78 | def fit(self, X_train, y_train): 79 | self.num_classes = len(set(y_train)) 80 | self.num_features = X_train.shape[1] 81 | self.tree = self.grow_tree(X_train, y_train) 82 | 83 | def grow_tree(self, X_train, y_train, depth=0): 84 | 85 | num_samples_class = [np.sum(y_train == k) for k in range(self.num_classes)] 86 | class_label = np.argmax(num_samples_class) 87 | 88 | node = TreeNode( 89 | gini=self.gini(y_train), 90 | num_samples=y_train.size, 91 | num_samples_class=num_samples_class, 92 | class_label=class_label, 93 | ) 94 | 95 | # split recursively until maximum depth is reached 96 | if depth < self.max_depth: 97 | idx, thr = self.best_split(X_train, y_train) 98 | if idx is not None: 99 | indices_left = X_train[:, idx] < thr 100 | X_left, y_left = X_train[indices_left], y_train[indices_left] 101 | X_right, y_right = X_train[~indices_left], y_train[~indices_left] 102 | node.feature_index = idx 103 | node.threshold = thr 104 | node.left = self.grow_tree(X_left, y_left, depth + 1) 105 | node.right = self.grow_tree(X_right, y_right, depth + 1) 106 | 107 | return node 108 | 109 | def predict(self, X_test): 110 | return [self.predict_helper(x_test) for x_test in X_test] 111 | 112 | def predict_helper(self, x_test): 113 | node = self.tree 114 | while node.left: 115 | if x_test[node.feature_index] < node.threshold: 116 | node = node.left 117 | else: 118 | node = node.right 119 | return node.class_label 120 | 121 | 122 | if __name__ == "__main__": 123 | 124 | #load data 125 | iris = load_iris() 126 | X = iris.data[:, [2,3]] 127 | y = iris.target 128 | 129 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 130 | 131 | print("decision tree classifier...") 132 | tree_clf = DecisionTreeClassifier(max_depth = 3) 133 | tree_clf.fit(X_train, y_train) 134 | 135 | print("prediction...") 136 | y_pred = tree_clf.predict(X_test) 137 | 138 | tree_clf_acc = accuracy_score(y_test, y_pred) 139 | print("test set accuracy: ", tree_clf_acc) 140 | -------------------------------------------------------------------------------- /chp05/naive_bayes.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import seaborn as sns 3 | import matplotlib.pyplot as plt 4 | 5 | from time import time 6 | from nltk.corpus import stopwords 7 | from nltk.tokenize import RegexpTokenizer 8 | 9 | from sklearn.metrics import accuracy_score 10 | from sklearn.datasets import fetch_20newsgroups 11 | from sklearn.model_selection import train_test_split 12 | from sklearn.feature_extraction.text import CountVectorizer 13 | 14 | sns.set_style("whitegrid") 15 | tokenizer = RegexpTokenizer(r'\w+') 16 | stop_words = set(stopwords.words('english')) 17 | stop_words.update(['s','t','m','1','2']) 18 | 19 | class naive_bayes: 20 | def __init__(self, K, D): 21 | self.K = K #number of classes 22 | self.D = D #dictionary size 23 | 24 | self.pi = np.ones(K) #class priors 25 | self.theta = np.ones((self.D, self.K)) #bernoulli parameters 26 | 27 | def fit(self, X_train, y_train): 28 | 29 | num_docs = X_train.shape[0] 30 | for doc in range(num_docs): 31 | 32 | label = y_train[doc] 33 | self.pi[label] += 1 34 | 35 | for word in range(self.D): 36 | if (X_train[doc][word] > 0): 37 | self.theta[word][label] += 1 38 | #end if 39 | #end for 40 | #end for 41 | 42 | #normalize pi and theta 43 | self.pi = self.pi/np.sum(self.pi) 44 | self.theta = self.theta/np.sum(self.theta, axis=0) 45 | 46 | def predict(self, X_test): 47 | 48 | num_docs = X_test.shape[0] 49 | logp = np.zeros((num_docs,self.K)) 50 | for doc in range(num_docs): 51 | for kk in range(self.K): 52 | logp[doc][kk] = np.log(self.pi[kk]) 53 | for word in range(self.D): 54 | if (X_test[doc][word] > 0): 55 | logp[doc][kk] += np.log(self.theta[word][kk]) 56 | else: 57 | logp[doc][kk] += np.log(1-self.theta[word][kk]) 58 | #end if 59 | #end for 60 | #end for 61 | #end for 62 | return np.argmax(logp, axis=1) 63 | 64 | if __name__ == "__main__": 65 | 66 | import nltk 67 | nltk.download('stopwords') 68 | 69 | #load data 70 | print("loading 20 newsgroups dataset...") 71 | tic = time() 72 | classes = ['sci.space', 'comp.graphics', 'rec.autos', 'rec.sport.hockey'] 73 | dataset = fetch_20newsgroups(shuffle=True, random_state=0, remove=('headers','footers','quotes'), categories=classes) 74 | X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.5, random_state=0) 75 | toc = time() 76 | print("elapsed time: %.4f sec" %(toc - tic)) 77 | print("number of training docs: ", len(X_train)) 78 | print("number of test docs: ", len(X_test)) 79 | 80 | print("vectorizing input data...") 81 | cnt_vec = CountVectorizer(tokenizer=tokenizer.tokenize, analyzer='word', ngram_range=(1,1), max_df=0.8, min_df=2, max_features=1000, stop_words=stop_words) 82 | cnt_vec.fit(X_train) 83 | toc = time() 84 | print("elapsed time: %.2f sec" %(toc - tic)) 85 | vocab = cnt_vec.vocabulary_ 86 | idx2word = {val: key for (key, val) in vocab.items()} 87 | print("vocab size: ", len(vocab)) 88 | 89 | X_train_vec = cnt_vec.transform(X_train).toarray() 90 | X_test_vec = cnt_vec.transform(X_test).toarray() 91 | 92 | print("naive bayes model MLE inference...") 93 | K = len(set(y_train)) #number of classes 94 | D = len(vocab) #dictionary size 95 | nb_clf = naive_bayes(K, D) 96 | nb_clf.fit(X_train_vec, y_train) 97 | 98 | print("naive bayes prediction...") 99 | y_pred = nb_clf.predict(X_test_vec) 100 | nb_clf_acc = accuracy_score(y_test, y_pred) 101 | print("test set accuracy: ", nb_clf_acc) -------------------------------------------------------------------------------- /chp05/perceptron.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import seaborn as sns 3 | import matplotlib.pyplot as plt 4 | 5 | from scipy.stats import randint 6 | from sklearn.datasets import load_iris 7 | from sklearn.metrics import confusion_matrix 8 | from sklearn.model_selection import train_test_split 9 | 10 | class perceptron: 11 | def __init__(self, num_epochs, dim): 12 | self.num_epochs = num_epochs 13 | self.theta0 = 0 14 | self.theta = np.zeros(dim) 15 | 16 | def fit(self, X_train, y_train): 17 | n = X_train.shape[0] 18 | dim = X_train.shape[1] 19 | 20 | k = 1 21 | for epoch in range(self.num_epochs): 22 | for i in range(n): 23 | #sample random point 24 | idx = randint.rvs(0, n-1, size=1)[0] 25 | #hinge loss 26 | if (y_train[idx] * (np.dot(self.theta, X_train[idx,:]) + self.theta0) <= 0): 27 | #update learning rate 28 | eta = pow(k+1, -1) 29 | k += 1 30 | #print("eta: ", eta) 31 | 32 | #update theta 33 | self.theta = self.theta + eta * y_train[idx] * X_train[idx, :] 34 | self.theta0 = self.theta0 + eta * y_train[idx] 35 | #end if 36 | print("epoch: ", epoch) 37 | print("theta: ", self.theta) 38 | print("theta0: ", self.theta0) 39 | #end for 40 | #end for 41 | 42 | def predict(self, X_test): 43 | n = X_test.shape[0] 44 | dim = X_test.shape[1] 45 | 46 | y_pred = np.zeros(n) 47 | for idx in range(n): 48 | y_pred[idx] = np.sign(np.dot(self.theta, X_test[idx,:]) + self.theta0) 49 | #end for 50 | return y_pred 51 | 52 | if __name__ == "__main__": 53 | 54 | #load dataset 55 | iris = load_iris() 56 | X = iris.data[:100,:] 57 | y = 2*iris.target[:100] - 1 #map to {+1,-1} labels 58 | 59 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 60 | 61 | #perceptron (binary) classifier 62 | clf = perceptron(num_epochs=5, dim=X.shape[1]) 63 | clf.fit(X_train, y_train) 64 | y_pred = clf.predict(X_test) 65 | 66 | cmt = confusion_matrix(y_test, y_pred) 67 | acc = np.trace(cmt)/np.sum(np.sum(cmt)) 68 | print("percepton accuracy: ", acc) 69 | 70 | #generate plots 71 | plt.figure() 72 | sns.heatmap(cmt, annot=True, fmt="d") 73 | plt.title("Confusion Matrix"); plt.xlabel("predicted"); plt.ylabel("actual") 74 | #plt.savefig("./figures/perceptron_acc.png") 75 | plt.show() 76 | 77 | 78 | 79 | -------------------------------------------------------------------------------- /chp05/sgd_lr.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | def generate_data(): 5 | 6 | n = 1000 7 | mu1 = np.array([1,1]) 8 | mu2 = np.array([-1,-1]) 9 | pik = np.array([0.4,0.6]) 10 | 11 | X = np.zeros((n,2)) 12 | y = np.zeros((n,1)) 13 | 14 | for i in range(1,n): 15 | u = np.random.rand() 16 | idx = np.where(u < np.cumsum(pik))[0] 17 | 18 | if (len(idx)==1): 19 | X[i,:] = np.random.randn(1,2) + mu1 20 | y[i] = 1 21 | else: 22 | X[i,:] = np.random.randn(1,2) + mu2 23 | y[i] = -1 24 | 25 | return X, y 26 | 27 | 28 | class sgdlr: 29 | 30 | def __init__(self): 31 | 32 | self.num_iter = 100 33 | self.lmbda = 1e-9 34 | 35 | self.tau0 = 10 36 | self.kappa = 1 37 | self.eta = np.zeros(self.num_iter) 38 | 39 | self.batch_size = 200 40 | 41 | def fit(self, X, y): 42 | 43 | #random init 44 | theta = np.random.randn(X.shape[1],1) 45 | 46 | #learning rate schedule 47 | for i in range(self.num_iter): 48 | self.eta[i] = (self.tau0+i)**(-self.kappa) 49 | 50 | #divide data in batches 51 | batch_data, batch_labels = self.make_batches(X,y,self.batch_size) 52 | num_batches = batch_data.shape[0] 53 | num_updates = 0 54 | 55 | J_hist = np.zeros((self.num_iter * num_batches,1)) 56 | t_hist = np.zeros((self.num_iter * num_batches,1)) 57 | 58 | for itr in range(self.num_iter): 59 | for b in range(num_batches): 60 | Xb = batch_data[b] 61 | yb = batch_labels[b] 62 | 63 | J_cost, J_grad = self.lr_objective(theta, Xb, yb, self.lmbda) 64 | theta = theta - self.eta[itr]*(num_batches*J_grad) 65 | 66 | J_hist[num_updates] = J_cost 67 | t_hist[num_updates] = np.linalg.norm(theta,2) 68 | num_updates = num_updates + 1 69 | print("iteration %d, cost: %f" %(itr, J_cost)) 70 | 71 | y_pred = 2*(self.sigmoid(X.dot(theta)) > 0.5) - 1 72 | y_err = np.size(np.where(y_pred - y)[0])/float(y.shape[0]) 73 | print("classification error:", y_err) 74 | 75 | self.generate_plots(X, J_hist, t_hist, theta) 76 | return theta 77 | 78 | def make_batches(self, X, y, batch_size): 79 | n = X.shape[0] 80 | d = X.shape[1] 81 | num_batches = int(np.ceil(n/batch_size)) 82 | 83 | groups = np.tile(range(num_batches),batch_size) 84 | batch_data=np.zeros((num_batches,batch_size,d)) 85 | batch_labels=np.zeros((num_batches,batch_size,1)) 86 | 87 | for i in range(num_batches): 88 | batch_data[i,:,:] = X[groups==i,:] 89 | batch_labels[i,:] = y[groups==i] 90 | 91 | return batch_data, batch_labels 92 | 93 | def lr_objective(self, theta, X, y, lmbda): 94 | 95 | n = y.shape[0] 96 | y01 = (y+1)/2.0 97 | 98 | #compute the objective 99 | mu = self.sigmoid(X.dot(theta)) 100 | 101 | #bound away from 0 and 1 102 | eps = np.finfo(float).eps 103 | mu = np.maximum(mu,eps) 104 | mu = np.minimum(mu,1-eps) 105 | 106 | #compute cost 107 | cost = -(1/n)*np.sum(y01*np.log(mu)+(1-y01)*np.log(1-mu))+np.sum(lmbda*theta*theta) 108 | 109 | #compute the gradient of the lr objective 110 | grad = X.T.dot(mu-y01) + 2*lmbda*theta 111 | 112 | #compute the Hessian of the lr objective 113 | #H = X.T.dot(np.diag(np.diag( mu*(1-mu) ))).dot(X) + 2*lmbda*np.eye(np.size(theta)) 114 | 115 | return cost, grad 116 | 117 | def sigmoid(self, a): 118 | return 1/(1+np.exp(-a)) 119 | 120 | def generate_plots(self, X, J_hist, t_hist, theta): 121 | 122 | plt.figure() 123 | plt.plot(J_hist) 124 | plt.title("logistic regression") 125 | plt.xlabel('iterations') 126 | plt.ylabel('cost') 127 | #plt.savefig('./figures/lrsgd_loss.png') 128 | plt.show() 129 | 130 | plt.figure() 131 | plt.plot(t_hist) 132 | plt.title("logistic regression") 133 | plt.xlabel('iterations') 134 | plt.ylabel('theta l2 norm') 135 | #plt.savefig('./figures/lrsgd_theta_norm.png') 136 | plt.show() 137 | 138 | plt.figure() 139 | plt.plot(self.eta) 140 | plt.title("logistic regression") 141 | plt.xlabel('iterations') 142 | plt.ylabel('learning rate') 143 | #plt.savefig('./figures/lrsgd_learning_rate.png') 144 | plt.show() 145 | 146 | plt.figure() 147 | x1 = np.linspace(np.min(X[:,0])-1,np.max(X[:,0])+1,10) 148 | plt.scatter(X[:,0], X[:,1]) 149 | plt.plot(x1, -(theta[0]/theta[1])*x1) 150 | plt.title('logistic regression') 151 | plt.grid(True) 152 | plt.xlabel('X1') 153 | plt.ylabel('X2') 154 | #plt.savefig('./figures/lrsgd_clf.png') 155 | plt.show() 156 | 157 | if __name__ == "__main__": 158 | 159 | X, y = generate_data() 160 | sgd = sgdlr() 161 | theta = sgd.fit(X,y) 162 | 163 | -------------------------------------------------------------------------------- /chp05/svm.py: -------------------------------------------------------------------------------- 1 | import cvxopt 2 | import numpy as np 3 | 4 | from sklearn.svm import SVC #for comparison only 5 | from sklearn.datasets import load_iris 6 | from sklearn.metrics import accuracy_score 7 | from sklearn.model_selection import train_test_split 8 | 9 | def rbf_kernel(gamma, **kwargs): 10 | def f(x1, x2): 11 | distance = np.linalg.norm(x1 - x2) ** 2 12 | return np.exp(-gamma * distance) 13 | return f 14 | 15 | class SupportVectorMachine(object): 16 | def __init__(self, C=1, kernel=rbf_kernel, power=4, gamma=None, coef=4): 17 | self.C = C 18 | self.kernel = kernel 19 | self.power = power 20 | self.gamma = gamma 21 | self.coef = coef 22 | self.lagr_multipliers = None 23 | self.support_vectors = None 24 | self.support_vector_labels = None 25 | self.intercept = None 26 | 27 | def fit(self, X, y): 28 | 29 | n_samples, n_features = np.shape(X) 30 | 31 | # Set gamma to 1/n_features by default 32 | if not self.gamma: 33 | self.gamma = 1 / n_features 34 | 35 | # Initialize kernel method with parameters 36 | self.kernel = self.kernel( 37 | power=self.power, 38 | gamma=self.gamma, 39 | coef=self.coef) 40 | 41 | # Calculate kernel matrix 42 | kernel_matrix = np.zeros((n_samples, n_samples)) 43 | for i in range(n_samples): 44 | for j in range(n_samples): 45 | kernel_matrix[i, j] = self.kernel(X[i], X[j]) 46 | 47 | # Define the quadratic optimization problem 48 | P = cvxopt.matrix(np.outer(y, y) * kernel_matrix, tc='d') 49 | q = cvxopt.matrix(np.ones(n_samples) * -1) 50 | A = cvxopt.matrix(y, (1, n_samples), tc='d') 51 | b = cvxopt.matrix(0, tc='d') 52 | 53 | if not self.C: #if its empty 54 | G = cvxopt.matrix(np.identity(n_samples) * -1) 55 | h = cvxopt.matrix(np.zeros(n_samples)) 56 | else: 57 | G_max = np.identity(n_samples) * -1 58 | G_min = np.identity(n_samples) 59 | G = cvxopt.matrix(np.vstack((G_max, G_min))) 60 | h_max = cvxopt.matrix(np.zeros(n_samples)) 61 | h_min = cvxopt.matrix(np.ones(n_samples) * self.C) 62 | h = cvxopt.matrix(np.vstack((h_max, h_min))) 63 | 64 | # Solve the quadratic optimization problem using cvxopt 65 | minimization = cvxopt.solvers.qp(P, q, G, h, A, b) 66 | 67 | # Lagrange multipliers 68 | lagr_mult = np.ravel(minimization['x']) 69 | 70 | # Extract support vectors 71 | # Get indexes of non-zero lagr. multipiers 72 | idx = lagr_mult > 1e-11 73 | # Get the corresponding lagr. multipliers 74 | self.lagr_multipliers = lagr_mult[idx] 75 | # Get the samples that will act as support vectors 76 | self.support_vectors = X[idx] 77 | # Get the corresponding labels 78 | self.support_vector_labels = y[idx] 79 | 80 | # Calculate intercept with first support vector 81 | self.intercept = self.support_vector_labels[0] 82 | for i in range(len(self.lagr_multipliers)): 83 | self.intercept -= self.lagr_multipliers[i] * self.support_vector_labels[ 84 | i] * self.kernel(self.support_vectors[i], self.support_vectors[0]) 85 | 86 | 87 | def predict(self, X): 88 | y_pred = [] 89 | # Iterate through list of samples and make predictions 90 | for sample in X: 91 | prediction = 0 92 | # Determine the label of the sample by the support vectors 93 | for i in range(len(self.lagr_multipliers)): 94 | prediction += self.lagr_multipliers[i] * self.support_vector_labels[ 95 | i] * self.kernel(self.support_vectors[i], sample) 96 | prediction += self.intercept 97 | y_pred.append(np.sign(prediction)) 98 | return np.array(y_pred) 99 | 100 | 101 | def main(): 102 | 103 | #load dataset 104 | iris = load_iris() 105 | X = iris.data[:100,:] 106 | y = 2*iris.target[:100] - 1 #map to {+1,-1} labels 107 | 108 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4) 109 | clf = SupportVectorMachine(kernel=rbf_kernel, gamma = 1) 110 | clf.fit(X_train, y_train) 111 | y_pred = clf.predict(X_test) 112 | accuracy = accuracy_score(y_test, y_pred) 113 | print ("Accuracy (scratch):", accuracy) 114 | 115 | clf_sklearn = SVC(gamma = 'auto') 116 | clf_sklearn.fit(X_train, y_train) 117 | y_pred2 = clf_sklearn.predict(X_test) 118 | accuracy = accuracy_score(y_test, y_pred2) 119 | print ("Accuracy :", accuracy) 120 | 121 | if __name__ == "__main__": 122 | main() -------------------------------------------------------------------------------- /chp06/gp_reg.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | from scipy.spatial.distance import cdist 4 | 5 | np.random.seed(42) 6 | 7 | class GPreg: 8 | 9 | def __init__(self, X_train, y_train, X_test): 10 | 11 | self.L = 1.0 12 | self.keps = 1e-8 13 | 14 | self.muFn = self.mean_func(X_test) 15 | self.Kfn = self.kernel_func(X_test, X_test) + 1e-15*np.eye(np.size(X_test)) 16 | 17 | self.X_train = X_train 18 | self.y_train = y_train 19 | self.X_test = X_test 20 | 21 | def mean_func(self, x): 22 | muFn = np.zeros(len(x)).reshape(-1,1) 23 | return muFn 24 | 25 | def kernel_func(self, x, z): 26 | sq_dist = cdist(x/self.L, z/self.L, 'euclidean')**2 27 | Kfn = 1.0 * np.exp(-sq_dist/2) 28 | return Kfn 29 | 30 | def compute_posterior(self): 31 | K = self.kernel_func(self.X_train, self.X_train) #K 32 | Ks = self.kernel_func(self.X_train, self.X_test) #K_* 33 | Kss = self.kernel_func(self.X_test, self.X_test) + self.keps*np.eye(np.size(self.X_test)) #K_** 34 | Ki = np.linalg.inv(K) #O(Ntrain^3) 35 | 36 | postMu = self.mean_func(self.X_test) + np.dot(np.transpose(Ks), np.dot(Ki, (self.y_train - self.mean_func(self.X_train)))) 37 | postCov = Kss - np.dot(np.transpose(Ks), np.dot(Ki, Ks)) 38 | 39 | self.muFn = postMu 40 | self.Kfn = postCov 41 | 42 | return None 43 | 44 | def generate_plots(self, X, num_samples=3): 45 | plt.figure() 46 | for i in range(num_samples): 47 | fs = self.gauss_sample(1) 48 | plt.plot(X, fs, '-k') 49 | #plt.plot(self.X_train, self.y_train, 'xk') 50 | 51 | mu = self.muFn.ravel() 52 | S2 = np.diag(self.Kfn) 53 | plt.fill(np.concatenate([X, X[::-1]]), np.concatenate([mu - 2*np.sqrt(S2), (mu + 2*np.sqrt(S2))[::-1]]), alpha=0.2, fc='b') 54 | plt.show() 55 | 56 | def gauss_sample(self, n): 57 | # returns n samples from a multivariate Gaussian distribution 58 | # S = AZ + mu 59 | A = np.linalg.cholesky(self.Kfn) 60 | Z = np.random.normal(loc=0, scale=1, size=(len(self.muFn),n)) 61 | S = np.dot(A,Z) + self.muFn 62 | return S 63 | 64 | def main(): 65 | 66 | # generate noise-less training data 67 | X_train = np.array([-4, -3, -2, -1, 1]) 68 | X_train = X_train.reshape(-1,1) 69 | y_train = np.sin(X_train) 70 | 71 | # generate test data 72 | X_test = np.linspace(-5, 5, 50) 73 | X_test = X_test.reshape(-1,1) 74 | 75 | gp = GPreg(X_train, y_train, X_test) 76 | gp.generate_plots(X_test,3) #samples from GP prior 77 | gp.compute_posterior() 78 | gp.generate_plots(X_test,3) #samples from GP posterior 79 | 80 | 81 | if __name__ == "__main__": 82 | main() 83 | 84 | -------------------------------------------------------------------------------- /chp06/hierarchical_regression.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | import seaborn as sns 5 | import matplotlib.pyplot as plt 6 | 7 | import pymc3 as pm 8 | 9 | def main(): 10 | 11 | #load data 12 | data = pd.read_csv('./data/radon.txt') 13 | 14 | county_names = data.county.unique() 15 | county_idx = data['county_code'].values 16 | 17 | with pm.Model() as hierarchical_model: 18 | 19 | # Hyperpriors 20 | mu_a = pm.Normal('mu_alpha', mu=0., sd=100**2) 21 | sigma_a = pm.Uniform('sigma_alpha', lower=0, upper=100) 22 | mu_b = pm.Normal('mu_beta', mu=0., sd=100**2) 23 | sigma_b = pm.Uniform('sigma_beta', lower=0, upper=100) 24 | 25 | # Intercept for each county, distributed around group mean mu_a 26 | a = pm.Normal('alpha', mu=mu_a, sd=sigma_a, shape=len(data.county.unique())) 27 | # Slope for each county, distributed around group mean mu_b 28 | b = pm.Normal('beta', mu=mu_b, sd=sigma_b, shape=len(data.county.unique())) 29 | 30 | # Model error 31 | eps = pm.Uniform('eps', lower=0, upper=100) 32 | 33 | # Expected value 34 | radon_est = a[county_idx] + b[county_idx] * data.floor.values 35 | 36 | # Data likelihood 37 | y_like = pm.Normal('y_like', mu=radon_est, sd=eps, observed=data.log_radon) 38 | 39 | 40 | with hierarchical_model: 41 | # Use ADVI for initialization 42 | mu, sds, elbo = pm.variational.advi(n=100000) 43 | step = pm.NUTS(scaling=hierarchical_model.dict_to_array(sds)**2, is_cov=True) 44 | hierarchical_trace = pm.sample(5000, step, start=mu) 45 | 46 | 47 | pm.traceplot(hierarchical_trace[500:]) 48 | plt.show() 49 | 50 | if __name__ == "__main__": 51 | main() 52 | -------------------------------------------------------------------------------- /chp06/knn_reg.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | from sklearn import datasets 5 | from sklearn.model_selection import train_test_split 6 | 7 | np.random.seed(42) 8 | 9 | class KNN(): 10 | 11 | def __init__(self, K): 12 | self.K = K 13 | 14 | def euclidean_distance(self, x1, x2): 15 | dist = 0 16 | for i in range(len(x1)): 17 | dist += np.power((x1[i] - x2[i]), 2) 18 | return np.sqrt(dist) 19 | 20 | def knn_search(self, X_train, y_train, Q): 21 | y_pred = np.empty(Q.shape[0]) 22 | 23 | for i, query in enumerate(Q): 24 | #get K nearest neighbors to query point 25 | idx = np.argsort([self.euclidean_distance(query, x) for x in X_train])[:self.K] 26 | #extract the labels of KNN training labels 27 | knn_labels = np.array([y_train[i] for i in idx]) 28 | #label query sample as the average of knn_labels 29 | y_pred[i] = np.mean(knn_labels) 30 | 31 | return y_pred 32 | 33 | 34 | if __name__ == "__main__": 35 | 36 | plt.close('all') 37 | 38 | #iris dataset 39 | iris = datasets.load_iris() 40 | X = iris.data[:,:2] 41 | y = iris.target 42 | 43 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 44 | 45 | K = 4 46 | knn = KNN(K) 47 | y_pred = knn.knn_search(X_train, y_train, X_test) 48 | 49 | plt.figure(1) 50 | plt.scatter(X_train[:,0], X_train[:,1], s = 100, marker = 'x', color = 'r', label = 'data') 51 | plt.scatter(X_test[:,0], X_test[:,1], s = 100, marker = 'o', color = 'b', label = 'query') 52 | plt.title('K Nearest Neighbors (K=%d)'% K) 53 | plt.legend() 54 | plt.xlabel('X1') 55 | plt.ylabel('X2') 56 | plt.grid(True) 57 | plt.show() 58 | 59 | 60 | 61 | -------------------------------------------------------------------------------- /chp06/ridge_reg.py: -------------------------------------------------------------------------------- 1 | import math 2 | import numpy as np 3 | import pandas as pd 4 | 5 | import matplotlib.pyplot as plt 6 | from sklearn.datasets import fetch_california_housing 7 | 8 | class ridge_reg(): 9 | 10 | def __init__(self, n_iter=20, learning_rate=1e-3, lmbda=0.1): 11 | self.n_iter = n_iter 12 | self.learning_rate = learning_rate 13 | self.lmbda = lmbda 14 | 15 | def fit(self, X, y): 16 | #insert const 1 for bias term 17 | X = np.insert(X, 0, 1, axis=1) 18 | 19 | self.loss = [] 20 | self.w = np.random.rand(X.shape[1]) 21 | 22 | for i in range(self.n_iter): 23 | y_pred = X.dot(self.w) 24 | mse = np.mean(0.5*(y - y_pred)**2 + 0.5*self.lmbda*self.w.T.dot(self.w)) 25 | self.loss.append(mse) 26 | print(" %d iter, mse: %.4f" %(i, mse)) 27 | #compute gradient of NLL(w) wrt w 28 | grad_w = - (y - y_pred).dot(X) + self.lmbda*self.w 29 | #update the weights 30 | self.w -= self.learning_rate * grad_w 31 | 32 | def predict(self, X): 33 | #insert const 1 for bias term 34 | X = np.insert(X, 0, 1, axis=1) 35 | y_pred = X.dot(self.w) 36 | return y_pred 37 | 38 | if __name__ == "__main__": 39 | 40 | X, y = fetch_california_housing(return_X_y=True) 41 | X_reg = X[:,2].reshape(-1,1) #avg number of rooms 42 | X_std = (X_reg - X_reg.mean())/X.std() #standard scaling 43 | y_std = (y - y.mean())/y.std() #standard scaling 44 | 45 | X_std = X_std[:200,:] 46 | y_std = y_std[:200] 47 | 48 | rr = ridge_reg() 49 | rr.fit(X_std, y_std) 50 | y_pred = rr.predict(X_std) 51 | 52 | print(rr.w) 53 | 54 | plt.figure() 55 | plt.plot(rr.loss) 56 | plt.xlabel('Epoch') 57 | plt.ylabel('Loss') 58 | plt.tight_layout() 59 | plt.show() 60 | 61 | plt.figure() 62 | plt.scatter(X_std, y_std) 63 | plt.plot(np.linspace(-1,1), rr.w[1]*np.linspace(-1,1)+rr.w[0], c='red') 64 | plt.xlim([-0.01,0.01]) 65 | plt.xlabel("scaled avg num of rooms") 66 | plt.ylabel("scaled house price") 67 | plt.show() -------------------------------------------------------------------------------- /chp07/active_learning.py: -------------------------------------------------------------------------------- 1 | from __future__ import unicode_literals, division 2 | from scipy.sparse import csc_matrix, vstack 3 | from scipy.stats import entropy 4 | from collections import Counter 5 | import numpy as np 6 | 7 | 8 | class ActiveLearner(object): 9 | 10 | uncertainty_sampling_frameworks = [ 11 | 'entropy', 12 | 'max_margin', 13 | 'least_confident', 14 | ] 15 | 16 | query_by_committee_frameworks = [ 17 | 'vote_entropy', 18 | 'average_kl_divergence', 19 | ] 20 | 21 | def __init__(self, strategy='least_confident'): 22 | self.strategy = strategy 23 | 24 | def rank(self, clf, X_unlabeled, num_queries=None): 25 | 26 | if num_queries == None: 27 | num_queries = X_unlabeled.shape[0] 28 | 29 | elif type(num_queries) == float: 30 | num_queries = int(num_queries * X_unlabeled.shape[0]) 31 | 32 | if self.strategy in self.uncertainty_sampling_frameworks: 33 | scores = self.uncertainty_sampling(clf, X_unlabeled) 34 | 35 | elif self.strategy in self.query_by_committee_frameworks: 36 | scores = self.query_by_committee(clf, X_unlabeled) 37 | 38 | else: 39 | raise ValueError("this strategy is not implemented.") 40 | 41 | rankings = np.argsort(-scores)[:num_queries] 42 | return rankings 43 | 44 | def uncertainty_sampling(self, clf, X_unlabeled): 45 | probs = clf.predict_proba(X_unlabeled) 46 | 47 | if self.strategy == 'least_confident': 48 | return 1 - np.amax(probs, axis=1) 49 | 50 | elif self.strategy == 'max_margin': 51 | margin = np.partition(-probs, 1, axis=1) 52 | return -np.abs(margin[:,0] - margin[:, 1]) 53 | 54 | elif self.strategy == 'entropy': 55 | return np.apply_along_axis(entropy, 1, probs) 56 | 57 | def query_by_committee(self, clf, X_unlabeled): 58 | num_classes = len(clf[0].classes_) 59 | C = len(clf) 60 | preds = [] 61 | 62 | if self.strategy == 'vote_entropy': 63 | for model in clf: 64 | y_out = map(int, model.predict(X_unlabeled)) 65 | preds.append(np.eye(num_classes)[y_out]) 66 | 67 | votes = np.apply_along_axis(np.sum, 0, np.stack(preds)) / C 68 | return np.apply_along_axis(entropy, 1, votes) 69 | 70 | elif self.strategy == 'average_kl_divergence': 71 | for model in clf: 72 | preds.append(model.predict_proba(X_unlabeled)) 73 | 74 | consensus = np.mean(np.stack(preds), axis=0) 75 | divergence = [] 76 | for y_out in preds: 77 | divergence.append(entropy(consensus.T, y_out.T)) 78 | 79 | return np.apply_along_axis(np.mean, 0, np.stack(divergence)) 80 | -------------------------------------------------------------------------------- /chp07/adaboost_clf.py: -------------------------------------------------------------------------------- 1 | import itertools 2 | import numpy as np 3 | 4 | import seaborn as sns 5 | import matplotlib.pyplot as plt 6 | import matplotlib.gridspec as gridspec 7 | 8 | from sklearn import datasets 9 | 10 | from sklearn.tree import DecisionTreeClassifier 11 | from sklearn.neighbors import KNeighborsClassifier 12 | from sklearn.linear_model import LogisticRegression 13 | 14 | from sklearn.ensemble import AdaBoostClassifier 15 | from sklearn.model_selection import cross_val_score, train_test_split 16 | 17 | from mlxtend.plotting import plot_learning_curves 18 | from mlxtend.plotting import plot_decision_regions 19 | 20 | def main(): 21 | 22 | iris = datasets.load_iris() 23 | X, y = iris.data[:, 0:2], iris.target 24 | 25 | #XOR dataset 26 | #X = np.random.randn(200, 2) 27 | #y = np.array(map(int,np.logical_xor(X[:, 0] > 0, X[:, 1] > 0))) 28 | 29 | clf = DecisionTreeClassifier(criterion='entropy', max_depth=1) 30 | 31 | num_est = [1, 2, 3, 10] 32 | label = ['AdaBoost (n_est=1)', 'AdaBoost (n_est=2)', 'AdaBoost (n_est=3)', 'AdaBoost (n_est=10)'] 33 | 34 | fig = plt.figure(figsize=(10, 8)) 35 | gs = gridspec.GridSpec(2, 2) 36 | grid = itertools.product([0,1],repeat=2) 37 | 38 | for n_est, label, grd in zip(num_est, label, grid): 39 | boosting = AdaBoostClassifier(base_estimator=clf, n_estimators=n_est) 40 | boosting.fit(X, y) 41 | ax = plt.subplot(gs[grd[0], grd[1]]) 42 | fig = plot_decision_regions(X=X, y=y, clf=boosting, legend=2) 43 | plt.title(label) 44 | 45 | plt.show() 46 | #plt.savefig('./figures/boosting_ensemble.png') 47 | 48 | #plot learning curves 49 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) 50 | 51 | boosting = AdaBoostClassifier(base_estimator=clf, n_estimators=10) 52 | 53 | plt.figure() 54 | plot_learning_curves(X_train, y_train, X_test, y_test, boosting, print_model=False, style='ggplot') 55 | plt.show() 56 | #plt.savefig('./figures/boosting_ensemble_learning_curve.png') 57 | 58 | #Ensemble Size 59 | num_est = list(map(int, np.linspace(1,100,20))) 60 | bg_clf_cv_mean = [] 61 | bg_clf_cv_std = [] 62 | for n_est in num_est: 63 | print("num_est: ", n_est) 64 | ada_clf = AdaBoostClassifier(base_estimator=clf, n_estimators=n_est) 65 | scores = cross_val_score(ada_clf, X, y, cv=3, scoring='accuracy') 66 | bg_clf_cv_mean.append(scores.mean()) 67 | bg_clf_cv_std.append(scores.std()) 68 | 69 | plt.figure() 70 | (_, caps, _) = plt.errorbar(num_est, bg_clf_cv_mean, yerr=bg_clf_cv_std, c='blue', fmt='-o', capsize=5) 71 | for cap in caps: 72 | cap.set_markeredgewidth(1) 73 | plt.ylabel('Accuracy'); plt.xlabel('Ensemble Size'); plt.title('AdaBoost Ensemble'); 74 | plt.show() 75 | #plt.savefig('./figures/boosting_ensemble_size.png') 76 | 77 | if __name__ == "__main__": 78 | main() -------------------------------------------------------------------------------- /chp07/bagging_clf.py: -------------------------------------------------------------------------------- 1 | import itertools 2 | import numpy as np 3 | 4 | import seaborn as sns 5 | import matplotlib.pyplot as plt 6 | import matplotlib.gridspec as gridspec 7 | 8 | from sklearn import datasets 9 | 10 | from sklearn.tree import DecisionTreeClassifier 11 | from sklearn.neighbors import KNeighborsClassifier 12 | from sklearn.linear_model import LogisticRegression 13 | from sklearn.ensemble import RandomForestClassifier 14 | 15 | from sklearn.ensemble import BaggingClassifier 16 | from sklearn.model_selection import cross_val_score, train_test_split 17 | 18 | from mlxtend.plotting import plot_learning_curves 19 | from mlxtend.plotting import plot_decision_regions 20 | 21 | def main(): 22 | 23 | iris = datasets.load_iris() 24 | X, y = iris.data[:, 0:2], iris.target 25 | 26 | clf1 = DecisionTreeClassifier(criterion='entropy', max_depth=None) 27 | clf2 = KNeighborsClassifier(n_neighbors=1) 28 | 29 | bagging1 = BaggingClassifier(base_estimator=clf1, n_estimators=10, max_samples=0.8, max_features=0.8) 30 | bagging2 = BaggingClassifier(base_estimator=clf2, n_estimators=10, max_samples=0.8, max_features=0.8) 31 | 32 | label = ['Decision Tree', 'K-NN', 'Bagging Tree', 'Bagging K-NN'] 33 | clf_list = [clf1, clf2, bagging1, bagging2] 34 | 35 | fig = plt.figure(figsize=(10, 8)) 36 | gs = gridspec.GridSpec(2, 2) 37 | grid = itertools.product([0,1],repeat=2) 38 | 39 | for clf, label, grd in zip(clf_list, label, grid): 40 | scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy') 41 | print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label)) 42 | 43 | clf.fit(X, y) 44 | ax = plt.subplot(gs[grd[0], grd[1]]) 45 | fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2) 46 | plt.title(label) 47 | 48 | plt.show() 49 | #plt.savefig('./figures/bagging_ensemble.png') 50 | 51 | #plot learning curves 52 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) 53 | 54 | plt.figure() 55 | plot_learning_curves(X_train, y_train, X_test, y_test, bagging1, print_model=False, style='ggplot') 56 | plt.show() 57 | #plt.savefig('./figures/bagging_ensemble_learning_curve.png') 58 | 59 | #Ensemble Size 60 | num_est = list(map(int, np.linspace(1,100,20))) 61 | bg_clf_cv_mean = [] 62 | bg_clf_cv_std = [] 63 | for n_est in num_est: 64 | print("num_est: ", n_est) 65 | bg_clf = BaggingClassifier(base_estimator=clf1, n_estimators=n_est, max_samples=0.8, max_features=0.8) 66 | scores = cross_val_score(bg_clf, X, y, cv=3, scoring='accuracy') 67 | bg_clf_cv_mean.append(scores.mean()) 68 | bg_clf_cv_std.append(scores.std()) 69 | 70 | plt.figure() 71 | (_, caps, _) = plt.errorbar(num_est, bg_clf_cv_mean, yerr=bg_clf_cv_std, c='blue', fmt='-o', capsize=5) 72 | for cap in caps: 73 | cap.set_markeredgewidth(1) 74 | plt.ylabel('Accuracy'); plt.xlabel('Ensemble Size'); plt.title('Bagging Tree Ensemble'); 75 | plt.show() 76 | #plt.savefig('./figures/bagging_ensemble_size.png') 77 | 78 | if __name__ == "__main__": 79 | main() -------------------------------------------------------------------------------- /chp07/bayes_opt_sklearn.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | import seaborn as sns 5 | import matplotlib.pyplot as plt 6 | 7 | from sklearn.datasets import make_classification 8 | from sklearn.model_selection import cross_val_score 9 | from sklearn.ensemble import RandomForestClassifier as RFC 10 | from sklearn.svm import SVC 11 | 12 | from bayes_opt import BayesianOptimization 13 | 14 | np.random.seed(42) 15 | 16 | # Load data set and target values 17 | data, target = make_classification( 18 | n_samples=1000, 19 | n_features=45, 20 | n_informative=12, 21 | n_redundant=7 22 | ) 23 | target = target.ravel() 24 | 25 | def svccv(gamma): 26 | val = cross_val_score( 27 | SVC(gamma=gamma, random_state=0), 28 | data, target, scoring='f1', cv=2 29 | ).mean() 30 | 31 | return val 32 | 33 | def rfccv(n_estimators, max_depth): 34 | val = cross_val_score( 35 | RFC(n_estimators=int(n_estimators), 36 | max_depth=int(max_depth), 37 | random_state=0 38 | ), 39 | data, target, scoring='f1', cv=2 40 | ).mean() 41 | return val 42 | 43 | if __name__ == "__main__": 44 | 45 | gp_params = {"alpha": 1e-5} 46 | 47 | #SVM 48 | svcBO = BayesianOptimization(svccv, 49 | {'gamma': (0.00001, 0.1)}) 50 | 51 | svcBO.maximize(init_points=3, n_iter=4, **gp_params) 52 | 53 | #Random Forest 54 | rfcBO = BayesianOptimization( 55 | rfccv, 56 | {'n_estimators': (10, 300), 57 | 'max_depth': (2, 10) 58 | } 59 | ) 60 | rfcBO.maximize(init_points=4, n_iter=4, **gp_params) 61 | 62 | print('Final Results') 63 | print(svcBO.max) 64 | print(rfcBO.max) -------------------------------------------------------------------------------- /chp07/demo_logreg.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import seaborn as sns 3 | import matplotlib.pyplot as plt 4 | 5 | from active_learning import ActiveLearner 6 | from sklearn.metrics import accuracy_score 7 | from sklearn.datasets import make_classification 8 | from sklearn.linear_model import LogisticRegression 9 | from sklearn.model_selection import train_test_split 10 | 11 | np.random.seed(42) 12 | 13 | def main(): 14 | 15 | #number of labeled points 16 | num_queries = 30 17 | 18 | #generate data 19 | data, target = make_classification(n_samples=200, n_features=2, n_informative=2,\ 20 | n_redundant=0, n_classes=2, weights = [0.5, 0.5], random_state=0) 21 | 22 | #split into labeled and unlabeled pools 23 | X_train, X_unlabeled, y_train, y_oracle = train_test_split(data, target, test_size=0.2, random_state=0) 24 | 25 | #random sub-sampling 26 | rnd_idx = np.random.randint(0, X_train.shape[0], num_queries) 27 | X1 = X_train[rnd_idx,:] 28 | y1 = y_train[rnd_idx] 29 | 30 | clf1 = LogisticRegression() 31 | clf1.fit(X1, y1) 32 | 33 | y1_preds = clf1.predict(X_unlabeled) 34 | score1 = accuracy_score(y_oracle, y1_preds) 35 | print("random subsampling accuracy: ", score1) 36 | 37 | #plot 2D decision boundary: w2x2 + w1x1 + w0 = 0 38 | w0 = clf1.intercept_ 39 | w1, w2 = clf1.coef_[0] 40 | xx = np.linspace(-1, 1, 100) 41 | decision_boundary = -w0/float(w2) - (w1/float(w2))*xx 42 | 43 | plt.figure() 44 | plt.scatter(data[rnd_idx,0], data[rnd_idx,1], c='black', marker='s', s=64, label='labeled') 45 | plt.scatter(data[target==0,0], data[target==0,1], c='blue', marker='o', alpha=0.5, label='class 0') 46 | plt.scatter(data[target==1,0], data[target==1,1], c='red', marker='o', alpha=0.5, label='class 1') 47 | plt.plot(xx, decision_boundary, linewidth = 2.0, c='black', linestyle = '--', label='log reg boundary') 48 | plt.title("Random Subsampling") 49 | plt.legend() 50 | plt.show() 51 | 52 | #active learning 53 | AL = ActiveLearner(strategy='entropy') 54 | al_idx = AL.rank(clf1, X_unlabeled, num_queries=num_queries) 55 | 56 | X2 = X_train[al_idx,:] 57 | y2 = y_train[al_idx] 58 | 59 | clf2 = LogisticRegression() 60 | clf2.fit(X2, y2) 61 | 62 | y2_preds = clf2.predict(X_unlabeled) 63 | score2 = accuracy_score(y_oracle, y2_preds) 64 | print("active learning accuracy: ", score2) 65 | 66 | #plot 2D decision boundary: w2x2 + w1x1 + w0 = 0 67 | w0 = clf2.intercept_ 68 | w1, w2 = clf2.coef_[0] 69 | xx = np.linspace(-1, 1, 100) 70 | decision_boundary = -w0/float(w2) - (w1/float(w2))*xx 71 | 72 | plt.figure() 73 | plt.scatter(data[al_idx,0], data[al_idx,1], c='black', marker='s', s=64, label='labeled') 74 | plt.scatter(data[target==0,0], data[target==0,1], c='blue', marker='o', alpha=0.5, label='class 0') 75 | plt.scatter(data[target==1,0], data[target==1,1], c='red', marker='o', alpha=0.5, label='class 1') 76 | plt.plot(xx, decision_boundary, linewidth = 2.0, c='black', linestyle = '--', label='log reg boundary') 77 | plt.title("Uncertainty Sampling") 78 | plt.legend() 79 | plt.show() 80 | 81 | if __name__ == "__main__": 82 | 83 | main() -------------------------------------------------------------------------------- /chp07/hmm.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from scipy.sparse import coo_matrix 3 | import matplotlib.pyplot as plt 4 | 5 | np.random.seed(42) 6 | 7 | class HMM(): 8 | def __init__(self, d=3, k=2, n=10000): 9 | self.d = d #dimension of data 10 | self.k = k #dimension of latent state 11 | self.n = n #number of data points 12 | 13 | self.A = np.zeros((k,k)) #transition matrix 14 | self.E = np.zeros((k,d)) #emission matrix 15 | self.s = np.zeros(k) #initial state vector 16 | 17 | self.x = np.zeros(self.n) #emitted observations 18 | 19 | def normalize_mat(self, X, dim=1): 20 | z = np.sum(X, axis=dim) 21 | Xnorm = X/z.reshape(-1,1) 22 | return Xnorm 23 | 24 | def normalize_vec(self, v): 25 | z = sum(v) 26 | u = v / z 27 | return u, z 28 | 29 | def init_hmm(self): 30 | 31 | #initialize matrices at random 32 | self.A = self.normalize_mat(np.random.rand(self.k,self.k)) 33 | self.E = self.normalize_mat(np.random.rand(self.k,self.d)) 34 | self.s, _ = self.normalize_vec(np.random.rand(self.k)) 35 | 36 | #generate markov observations 37 | z = np.random.choice(self.k, size=1, p=self.s) 38 | self.x[0] = np.random.choice(self.d, size=1, p=self.E[z,:].ravel()) 39 | for i in range(1, self.n): 40 | z = np.random.choice(self.k, size=1, p=self.A[z,:].ravel()) 41 | self.x[i] = np.random.choice(self.d, size=1, p=self.E[z,:].ravel()) 42 | #end for 43 | 44 | def forward_backward(self): 45 | 46 | #construct sparse matrix X of emission indicators 47 | data = np.ones(self.n) 48 | row = self.x 49 | col = np.arange(self.n) 50 | X = coo_matrix((data, (row, col)), shape=(self.d, self.n)) 51 | 52 | M = self.E * X 53 | At = np.transpose(self.A) 54 | c = np.zeros(self.n) #normalization constants 55 | alpha = np.zeros((self.k, self.n)) #alpha = p(z_t = j | x_{1:T}) 56 | alpha[:,0], c[0] = self.normalize_vec(self.s * M[:,0]) 57 | for t in range(1, self.n): 58 | alpha[:,t], c[t] = self.normalize_vec(np.dot(At, alpha[:,t-1]) * M[:,t]) 59 | #end for 60 | 61 | beta = np.ones((self.k, self.n)) 62 | for t in range(self.n-2, 0, -1): 63 | beta[:,t] = np.dot(self.A, beta[:,t+1] * M[:,t+1])/c[t+1] 64 | #end for 65 | gamma = alpha * beta 66 | 67 | return gamma, alpha, beta, c 68 | 69 | def viterbi(self): 70 | 71 | #construct sparse matrix X of emission indicators 72 | data = np.ones(self.n) 73 | row = self.x 74 | col = np.arange(self.n) 75 | X = coo_matrix((data, (row, col)), shape=(self.d, self.n)) 76 | 77 | #log scale for numerical stability 78 | s = np.log(self.s) 79 | A = np.log(self.A) 80 | M = np.log(self.E * X) 81 | 82 | Z = np.zeros((self.k, self.n)) 83 | Z[:,0] = np.arange(self.k) 84 | v = s + M[:,0] 85 | for t in range(1, self.n): 86 | Av = A + v.reshape(-1,1) 87 | v = np.max(Av, axis=0) 88 | idx = np.argmax(Av, axis=0) 89 | v = v.reshape(-1,1) + M[:,t].reshape(-1,1) 90 | Z = Z[idx,:] 91 | Z[:,t] = np.arange(self.k) 92 | #end for 93 | llh = np.max(v) 94 | idx = np.argmax(v) 95 | z = Z[idx,:] 96 | 97 | return z, llh 98 | 99 | 100 | if __name__ == "__main__": 101 | 102 | hmm = HMM() 103 | hmm.init_hmm() 104 | 105 | gamma, alpha, beta, c = hmm.forward_backward() 106 | z, llh = hmm.viterbi() 107 | import pdb; pdb.set_trace() -------------------------------------------------------------------------------- /chp07/page_rank.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from numpy.linalg import norm 3 | 4 | np.random.seed(42) 5 | 6 | class page_rank(): 7 | 8 | def __init__(self): 9 | self.max_iter = 100 10 | self.tolerance = 1e-5 11 | 12 | def power_iteration(self, A): 13 | n = np.shape(A)[0] 14 | v = np.random.rand(n) 15 | converged = False 16 | iter = 0 17 | 18 | while (not converged) and (iter < self.max_iter): 19 | old_v = v 20 | v = np.dot(A, v) 21 | v = v / norm(v) 22 | lambd = np.dot(v, np.dot(A, v)) 23 | converged = norm(v - old_v) < self.tolerance 24 | iter += 1 25 | #end while 26 | 27 | return lambd, v 28 | 29 | if __name__ == "__main__": 30 | 31 | #construct a symmetric real matrix 32 | X = np.random.rand(10,5) 33 | A = np.dot(X.T, X) 34 | 35 | pr = page_rank() 36 | lambd, v = pr.power_iteration(A) 37 | 38 | print(lambd) 39 | print(v) 40 | 41 | #compare against np.linalg implementation 42 | eigval, eigvec = np.linalg.eig(A) 43 | idx = np.argsort(np.abs(eigval))[::-1] 44 | top_lambd = eigval[idx][0] 45 | top_v = eigvec[:,idx][0] 46 | 47 | assert np.allclose(lambd, top_lambd, 1e-3) 48 | assert np.allclose(v, top_v, 1e-3) 49 | 50 | 51 | 52 | 53 | -------------------------------------------------------------------------------- /chp07/plot_smote_regular.py: -------------------------------------------------------------------------------- 1 | import seaborn as sns 2 | import matplotlib.pyplot as plt 3 | 4 | from sklearn.datasets import make_classification 5 | from sklearn.decomposition import PCA 6 | 7 | from imblearn.over_sampling import SMOTE 8 | 9 | def plot_resampling(ax, X, y, title): 10 | c0 = ax.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0", marker="o", alpha=0.5) 11 | c1 = ax.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1", marker="s", alpha=0.5) 12 | ax.set_title(title) 13 | ax.spines['top'].set_visible(False) 14 | ax.spines['right'].set_visible(False) 15 | ax.get_xaxis().tick_bottom() 16 | ax.get_yaxis().tick_left() 17 | ax.spines['left'].set_position(('outward', 10)) 18 | ax.spines['bottom'].set_position(('outward', 10)) 19 | ax.set_xlim([-6, 8]) 20 | ax.set_ylim([-6, 6]) 21 | 22 | return c0, c1 23 | 24 | def main(): 25 | # generate the dataset 26 | X, y = make_classification(n_classes=2, class_sep=2, weights=[0.3, 0.7], 27 | n_informative=3, n_redundant=1, flip_y=0, 28 | n_features=20, n_clusters_per_class=1, 29 | n_samples=80, random_state=10) 30 | 31 | # fit PCA for visualization 32 | pca = PCA(n_components=2) 33 | X_vis = pca.fit_transform(X) 34 | 35 | # apply regular SMOTE 36 | method = SMOTE() 37 | X_res, y_res = method.fit_resample(X, y) 38 | X_res_vis = pca.transform(X_res) 39 | 40 | # generate plots 41 | f, (ax1, ax2) = plt.subplots(1, 2) 42 | c0, c1 = plot_resampling(ax1, X_vis, y, 'Original') 43 | plot_resampling(ax2, X_res_vis, y_res, 'SMOTE') 44 | ax1.legend((c0, c1), ('Class #0', 'Class #1')) 45 | plt.tight_layout() 46 | plt.show() 47 | 48 | if __name__ == "__main__": 49 | main() 50 | -------------------------------------------------------------------------------- /chp07/plot_tomek_links.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import seaborn as sns 3 | import matplotlib.pyplot as plt 4 | 5 | from sklearn.model_selection import train_test_split 6 | from sklearn.utils import shuffle 7 | from imblearn.under_sampling import TomekLinks 8 | 9 | rng = np.random.RandomState(42) 10 | 11 | def main(): 12 | 13 | #generate data 14 | n_samples_1 = 500 15 | n_samples_2 = 50 16 | X_syn = np.r_[1.5 * rng.randn(n_samples_1, 2), 0.5 * rng.randn(n_samples_2, 2) + [2, 2]] 17 | y_syn = np.array([0] * (n_samples_1) + [1] * (n_samples_2)) 18 | X_syn, y_syn = shuffle(X_syn, y_syn) 19 | X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(X_syn, y_syn) 20 | 21 | # remove Tomek links 22 | tl = TomekLinks(sampling_strategy='auto') 23 | X_resampled, y_resampled = tl.fit_resample(X_syn, y_syn) 24 | idx_resampled = tl.sample_indices_ 25 | idx_samples_removed = np.setdiff1d(np.arange(X_syn.shape[0]),idx_resampled) 26 | 27 | #generate plots 28 | fig = plt.figure() 29 | ax = fig.add_subplot(1, 1, 1) 30 | 31 | idx_class_0 = y_resampled == 0 32 | plt.scatter(X_resampled[idx_class_0, 0], X_resampled[idx_class_0, 1], alpha=.8, marker = "o", label='Class #0') 33 | plt.scatter(X_resampled[~idx_class_0, 0], X_resampled[~idx_class_0, 1], alpha=.8, marker = "s", label='Class #1') 34 | plt.scatter(X_syn[idx_samples_removed, 0], X_syn[idx_samples_removed, 1], alpha=.8, marker = "v", label='Removed samples') 35 | plt.title('Undersampling: Tomek links') 36 | plt.legend() 37 | plt.show() 38 | 39 | if __name__ == "__main__": 40 | main() -------------------------------------------------------------------------------- /chp07/stacked_clf.py: -------------------------------------------------------------------------------- 1 | import itertools 2 | import numpy as np 3 | import seaborn as sns 4 | import matplotlib.pyplot as plt 5 | import matplotlib.gridspec as gridspec 6 | 7 | from sklearn import datasets 8 | 9 | from sklearn.linear_model import LogisticRegression 10 | from sklearn.neighbors import KNeighborsClassifier 11 | from sklearn.naive_bayes import GaussianNB 12 | from sklearn.ensemble import RandomForestClassifier 13 | from mlxtend.classifier import StackingClassifier 14 | 15 | from sklearn.model_selection import cross_val_score, train_test_split 16 | 17 | from mlxtend.plotting import plot_learning_curves 18 | from mlxtend.plotting import plot_decision_regions 19 | 20 | def main(): 21 | 22 | iris = datasets.load_iris() 23 | X, y = iris.data[:, 1:3], iris.target 24 | 25 | clf1 = KNeighborsClassifier(n_neighbors=1) 26 | clf2 = RandomForestClassifier(random_state=1) 27 | clf3 = GaussianNB() 28 | lr = LogisticRegression() 29 | sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 30 | meta_classifier=lr) 31 | 32 | label = ['KNN', 'Random Forest', 'Naive Bayes', 'Stacking Classifier'] 33 | clf_list = [clf1, clf2, clf3, sclf] 34 | 35 | fig = plt.figure(figsize=(10,8)) 36 | gs = gridspec.GridSpec(2, 2) 37 | grid = itertools.product([0,1],repeat=2) 38 | 39 | clf_cv_mean = [] 40 | clf_cv_std = [] 41 | for clf, label, grd in zip(clf_list, label, grid): 42 | 43 | scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy') 44 | print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label)) 45 | clf_cv_mean.append(scores.mean()) 46 | clf_cv_std.append(scores.std()) 47 | 48 | clf.fit(X, y) 49 | ax = plt.subplot(gs[grd[0], grd[1]]) 50 | fig = plot_decision_regions(X=X, y=y, clf=clf) 51 | plt.title(label) 52 | 53 | plt.show() 54 | #plt.savefig("./figures/ensemble_stacking.png") 55 | 56 | #plot classifier accuracy 57 | plt.figure() 58 | (_, caps, _) = plt.errorbar(range(4), clf_cv_mean, yerr=clf_cv_std, c='blue', fmt='-o', capsize=5) 59 | for cap in caps: 60 | cap.set_markeredgewidth(1) 61 | plt.xticks(range(4), ['KNN', 'RF', 'NB', 'Stacking'], rotation='vertical') 62 | plt.ylabel('Accuracy'); plt.xlabel('Classifier'); plt.title('Stacking Ensemble'); 63 | plt.show() 64 | #plt.savefig('./figures/stacking_ensemble_size.png') 65 | 66 | #plot learning curves 67 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) 68 | 69 | plt.figure() 70 | plot_learning_curves(X_train, y_train, X_test, y_test, sclf, print_model=False, style='ggplot') 71 | plt.show() 72 | #plt.savefig('./figures/stacking_ensemble_learning_curve.png') 73 | 74 | 75 | if __name__ == "__main__": 76 | main() 77 | -------------------------------------------------------------------------------- /chp08/dpmeans.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | import time 5 | from sklearn import metrics 6 | from sklearn.datasets import load_iris 7 | 8 | np.random.seed(42) 9 | 10 | class dpmeans: 11 | 12 | def __init__(self,X): 13 | # Initialize parameters for DP means 14 | self.K = 1 15 | self.K_init = 4 16 | self.d = X.shape[1] 17 | self.z = np.mod(np.random.permutation(X.shape[0]),self.K)+1 18 | self.mu = np.random.standard_normal((self.K, self.d)) 19 | self.sigma = 1 20 | self.nk = np.zeros(self.K) 21 | self.pik = np.ones(self.K)/self.K 22 | 23 | #init mu 24 | self.mu = np.array([np.mean(X,0)]) 25 | 26 | #init lambda 27 | self.Lambda = self.kpp_init(X,self.K_init) 28 | 29 | self.max_iter = 100 30 | self.obj = np.zeros(self.max_iter) 31 | self.em_time = np.zeros(self.max_iter) 32 | 33 | def kpp_init(self,X,k): 34 | #k++ init 35 | #lambda is max distance to k++ means 36 | 37 | [n,d] = np.shape(X) 38 | mu = np.zeros((k,d)) 39 | dist = np.inf*np.ones(n) 40 | 41 | mu[0,:] = X[int(np.random.rand()*n-1),:] 42 | for i in range(1,k): 43 | D = X-np.tile(mu[i-1,:],(n,1)) 44 | dist = np.minimum(dist, np.sum(D*D,1)) 45 | idx = np.where(np.random.rand() < np.cumsum(dist/float(sum(dist)))) 46 | mu[i,:] = X[idx[0][0],:] 47 | Lambda = np.max(dist) 48 | 49 | print("Lambda: ", Lambda) 50 | 51 | return Lambda 52 | 53 | def fit(self,X): 54 | 55 | obj_tol = 1e-3 56 | max_iter = self.max_iter 57 | [n,d] = np.shape(X) 58 | 59 | obj = np.zeros(max_iter) 60 | em_time = np.zeros(max_iter) 61 | print('running dpmeans...') 62 | 63 | for iter in range(max_iter): 64 | tic = time.time() 65 | dist = np.zeros((n,self.K)) 66 | 67 | #assignment step 68 | for kk in range(self.K): 69 | Xm = X - np.tile(self.mu[kk,:],(n,1)) 70 | dist[:,kk] = np.sum(Xm*Xm,1) 71 | 72 | #update labels 73 | dmin = np.min(dist,1) 74 | self.z = np.argmin(dist,1) 75 | idx = np.where(dmin > self.Lambda) 76 | 77 | if (np.size(idx) > 0): 78 | self.K = self.K + 1 79 | self.z[idx[0]] = self.K-1 #cluster labels in [0,...,K-1] 80 | self.mu = np.vstack([self.mu,np.mean(X[idx[0],:],0)]) 81 | Xm = X - np.tile(self.mu[self.K-1,:],(n,1)) 82 | dist = np.hstack([dist, np.array([np.sum(Xm*Xm,1)]).T]) 83 | 84 | #update step 85 | self.nk = np.zeros(self.K) 86 | for kk in range(self.K): 87 | self.nk[kk] = self.z.tolist().count(kk) 88 | idx = np.where(self.z == kk) 89 | self.mu[kk,:] = np.mean(X[idx[0],:],0) 90 | 91 | self.pik = self.nk/float(np.sum(self.nk)) 92 | 93 | #compute objective 94 | for kk in range(self.K): 95 | idx = np.where(self.z == kk) 96 | obj[iter] = obj[iter] + np.sum(dist[idx[0],kk],0) 97 | obj[iter] = obj[iter] + self.Lambda * self.K 98 | 99 | #check convergence 100 | if (iter > 0 and np.abs(obj[iter]-obj[iter-1]) < obj_tol*obj[iter]): 101 | print('converged in %d iterations\n'% iter) 102 | break 103 | em_time[iter] = time.time()-tic 104 | #end for 105 | self.obj = obj 106 | self.em_time = em_time 107 | return self.z, obj, em_time 108 | 109 | def compute_nmi(self, z1, z2): 110 | # compute normalized mutual information 111 | 112 | n = np.size(z1) 113 | k1 = np.size(np.unique(z1)) 114 | k2 = np.size(np.unique(z2)) 115 | 116 | nk1 = np.zeros((k1,1)) 117 | nk2 = np.zeros((k2,1)) 118 | 119 | for kk in range(k1): 120 | nk1[kk] = np.sum(z1==kk) 121 | for kk in range(k2): 122 | nk2[kk] = np.sum(z2==kk) 123 | 124 | pk1 = nk1/float(np.sum(nk1)) 125 | pk2 = nk2/float(np.sum(nk2)) 126 | 127 | nk12 = np.zeros((k1,k2)) 128 | for ii in range(k1): 129 | for jj in range(k2): 130 | nk12[ii,jj] = np.sum((z1==ii)*(z2==jj)) 131 | pk12 = nk12/float(n) 132 | 133 | Hx = -np.sum(pk1 * np.log(pk1 + np.finfo(float).eps)) 134 | Hy = -np.sum(pk2 * np.log(pk2 + np.finfo(float).eps)) 135 | 136 | Hxy = -np.sum(pk12 * np.log(pk12 + np.finfo(float).eps)) 137 | 138 | MI = Hx + Hy - Hxy; 139 | nmi = MI/float(0.5*(Hx+Hy)) 140 | 141 | return nmi 142 | 143 | def generate_plots(self,X): 144 | 145 | plt.close('all') 146 | plt.figure(0) 147 | for kk in range(self.K): 148 | #idx = np.where(self.z == kk) 149 | plt.scatter(X[self.z == kk,0], X[self.z == kk,1], \ 150 | s = 100, marker = 'o', c = np.random.rand(3,), label = str(kk)) 151 | #end for 152 | plt.xlabel('X1') 153 | plt.ylabel('X2') 154 | plt.legend() 155 | plt.title('DP-means clusters') 156 | plt.grid(True) 157 | plt.show() 158 | 159 | plt.figure(1) 160 | plt.plot(self.obj) 161 | plt.title('DP-means objective function') 162 | plt.xlabel('iterations') 163 | plt.ylabel('penalized l2 squared distance') 164 | plt.grid(True) 165 | plt.show() 166 | 167 | if __name__ == "__main__": 168 | 169 | iris = load_iris() 170 | X = iris.data 171 | y = iris.target 172 | 173 | dp = dpmeans(X) 174 | labels, obj, em_time = dp.fit(X) 175 | dp.generate_plots(X) 176 | 177 | nmi = dp.compute_nmi(y,labels) 178 | ari = metrics.adjusted_rand_score(y,labels) 179 | 180 | print("NMI: %.4f" % nmi) 181 | print("ARI: %.4f" % ari) 182 | -------------------------------------------------------------------------------- /chp08/gmm.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | import matplotlib as mpl 4 | from sklearn.cluster import KMeans 5 | from scipy.stats import multivariate_normal 6 | from scipy.special import logsumexp 7 | from scipy import linalg 8 | 9 | np.random.seed(3) 10 | 11 | class GMM: 12 | 13 | def __init__(self, n=1e3, d=2, K=4): 14 | self.n = int(n) #number of data points 15 | self.d = d #data dimension 16 | self.K = K #number of clusters 17 | 18 | self.X = np.zeros((self.n, self.d)) 19 | 20 | self.mu = np.zeros((self.d, self.K)) 21 | self.sigma = np.zeros((self.d, self.d, self.K)) 22 | self.pik = np.ones(self.K)/K 23 | 24 | def generate_data(self): 25 | #GMM generative model 26 | alpha0 = np.ones(self.K) 27 | pi = np.random.dirichlet(alpha0) 28 | 29 | #ground truth mu and sigma 30 | mu0 = np.random.randint(0, 10, size=(self.d, self.K)) - 5*np.ones((self.d, self.K)) 31 | V0 = np.zeros((self.d, self.d, self.K)) 32 | for k in range(self.K): 33 | eigen_mean = 0 34 | Q = np.random.normal(loc=0, scale=1, size=(self.d, self.d)) 35 | D = np.diag(abs(eigen_mean + np.random.normal(loc=0, scale=1, size=self.d))) 36 | V0[:,:,k] = abs(np.transpose(Q)*D*Q) 37 | 38 | #sample data 39 | for i in range(self.n): 40 | z = np.random.multinomial(1,pi) 41 | k = np.nonzero(z)[0][0] 42 | self.X[i,:] = np.random.multivariate_normal(mean=mu0[:,k], cov=V0[:,:,k], size=1) 43 | 44 | plt.figure() 45 | plt.scatter(self.X[:,0], self.X[:,1], color='b', alpha=0.5) 46 | plt.title("Ground Truth Data"); plt.xlabel("X1"); plt.ylabel("X2") 47 | plt.show() 48 | 49 | return mu0, V0 50 | 51 | def gmm_em(self): 52 | 53 | #init mu with k-means 54 | kmeans = KMeans(n_clusters=self.K, random_state=42).fit(self.X) 55 | self.mu = np.transpose(kmeans.cluster_centers_) 56 | 57 | #init sigma 58 | for k in range(self.K): 59 | self.sigma[:,:,k] = np.eye(self.d) 60 | 61 | #EM algorithm 62 | max_iter = 10 63 | tol = 1e-5 64 | obj = np.zeros(max_iter) 65 | for iter in range(max_iter): 66 | print("EM iter ", iter) 67 | #E-step 68 | resp, llh = self.estep() 69 | #M-step 70 | self.mstep(resp) 71 | #check convergence 72 | obj[iter] = llh 73 | if (iter > 1 and obj[iter] - obj[iter-1] < tol*abs(obj[iter])): 74 | break 75 | #end if 76 | #end for 77 | plt.figure() 78 | plt.plot(obj) 79 | plt.title('EM-GMM objective'); plt.xlabel("iter"); plt.ylabel("log-likelihood") 80 | plt.show() 81 | 82 | def estep(self): 83 | 84 | log_r = np.zeros((self.n, self.K)) 85 | for k in range(self.K): 86 | log_r[:,k] = multivariate_normal.logpdf(self.X, mean=self.mu[:,k], cov=self.sigma[:,:,k]) 87 | #end for 88 | log_r = log_r + np.log(self.pik) 89 | L = logsumexp(log_r, axis=1) 90 | llh = np.sum(L)/self.n #log likelihood 91 | log_r = log_r - L.reshape(-1,1) #normalize 92 | resp = np.exp(log_r) 93 | return resp, llh 94 | 95 | def mstep(self, resp): 96 | 97 | nk = np.sum(resp, axis=0) 98 | self.pik = nk/self.n 99 | sqrt_resp = np.sqrt(resp) 100 | for k in range(self.K): 101 | #update mu 102 | rx = np.multiply(resp[:,k].reshape(-1,1), self.X) 103 | self.mu[:,k] = np.sum(rx, axis=0) / nk[k] 104 | 105 | #update sigma 106 | Xm = self.X - self.mu[:,k] 107 | Xm = np.multiply(sqrt_resp[:,k].reshape(-1,1), Xm) 108 | self.sigma[:,:,k] = np.maximum(0, np.dot(np.transpose(Xm), Xm) / nk[k] + 1e-5 * np.eye(self.d)) 109 | #end for 110 | 111 | if __name__ == '__main__': 112 | 113 | gmm = GMM() 114 | mu0, V0 = gmm.generate_data() 115 | gmm.gmm_em() 116 | 117 | for k in range(mu0.shape[1]): 118 | print("cluster ", k) 119 | print("-----------") 120 | print("ground truth means:") 121 | print(mu0[:,k]) 122 | print("ground truth covariance:") 123 | print(V0[:,:,k]) 124 | #end for 125 | 126 | for k in range(mu0.shape[1]): 127 | print("cluster ", k) 128 | print("-----------") 129 | print("GMM-EM means:") 130 | print(gmm.mu[:,k]) 131 | print("GMM-EM covariance:") 132 | print(gmm.sigma[:,:,k]) 133 | 134 | plt.figure() 135 | ax = plt.axes() 136 | plt.scatter(gmm.X[:,0], gmm.X[:,1], color='b', alpha=0.5) 137 | 138 | for k in range(mu0.shape[1]): 139 | 140 | v, w = linalg.eigh(gmm.sigma[:,:,k]) 141 | v = 2.0 * np.sqrt(2.0) * np.sqrt(v) 142 | u = w[0] / linalg.norm(w[0]) 143 | 144 | # plot an ellipse to show the Gaussian component 145 | angle = np.arctan(u[1] / u[0]) 146 | angle = 180.0 * angle / np.pi # convert to degrees 147 | ell = mpl.patches.Ellipse(gmm.mu[:,k], v[0], v[1], 180.0 + angle, color='r', alpha=0.5) 148 | ax.add_patch(ell) 149 | 150 | # plot cluster centroids 151 | plt.scatter(gmm.mu[0,k], gmm.mu[1,k], s=80, marker='x', color='k', alpha=1) 152 | plt.title("Gaussian Mixture Model"); plt.xlabel("X1"); plt.ylabel("X2") 153 | plt.show() 154 | -------------------------------------------------------------------------------- /chp08/manifold_learning.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | from time import time 5 | from sklearn import manifold 6 | 7 | from sklearn.datasets import load_digits 8 | from sklearn.neighbors import KDTree 9 | 10 | def plot_digits(X): 11 | 12 | n_img_per_row = np.amin((20, np.int(np.sqrt(X.shape[0])))) 13 | img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row)) 14 | for i in range(n_img_per_row): 15 | ix = 10 * i + 1 16 | for j in range(n_img_per_row): 17 | iy = 10 * j + 1 18 | img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8)) 19 | 20 | plt.figure() 21 | plt.imshow(img, cmap=plt.cm.binary) 22 | plt.xticks([]) 23 | plt.yticks([]) 24 | plt.title('A selection from the 64-dimensional digits dataset') 25 | 26 | def mnist_manifold(): 27 | 28 | digits = load_digits() 29 | 30 | X = digits.data 31 | y = digits.target 32 | 33 | num_classes = np.unique(y).shape[0] 34 | 35 | plot_digits(X) 36 | 37 | #TSNE 38 | #Barnes-Hut: O(d NlogN) where d is dim and N is the number of samples 39 | #Exact: O(d N^2) 40 | t0 = time() 41 | tsne = manifold.TSNE(n_components = 2, init = 'pca', method = 'barnes_hut', verbose = 1) 42 | X_tsne = tsne.fit_transform(X) 43 | t1 = time() 44 | print('t-SNE: %.2f sec' %(t1-t0)) 45 | tsne.get_params() 46 | 47 | plt.figure() 48 | for k in range(num_classes): 49 | plt.plot(X_tsne[y==k,0], X_tsne[y==k,1],'o') 50 | plt.title('t-SNE embedding of digits dataset') 51 | plt.xlabel('X1') 52 | plt.ylabel('X2') 53 | axes = plt.gca() 54 | axes.set_xlim([X_tsne[:,0].min()-1,X_tsne[:,0].max()+1]) 55 | axes.set_ylim([X_tsne[:,1].min()-1,X_tsne[:,1].max()+1]) 56 | plt.show() 57 | 58 | #ISOMAP 59 | #1. Nearest neighbors search: O(d log k N log N) 60 | #2. Shortest path graph search: O(N^2(k+log(N)) 61 | #3. Partial eigenvalue decomposition: O(dN^2) 62 | 63 | t0 = time() 64 | isomap = manifold.Isomap(n_neighbors = 5, n_components = 2) 65 | X_isomap = isomap.fit_transform(X) 66 | t1 = time() 67 | print('Isomap: %.2f sec' %(t1-t0)) 68 | isomap.get_params() 69 | 70 | plt.figure() 71 | for k in range(num_classes): 72 | plt.plot(X_isomap[y==k,0], X_isomap[y==k,1], 'o', label=str(k), linewidth = 2) 73 | plt.title('Isomap embedding of the digits dataset') 74 | plt.xlabel('X1') 75 | plt.ylabel('X2') 76 | plt.show() 77 | 78 | #Use KD-tree to find k-nearest neighbors to a query image 79 | kdt = KDTree(X_isomap) 80 | Q = np.array([[-160, -30],[-102, 14]]) 81 | kdt_dist, kdt_idx = kdt.query(Q,k=20) 82 | plot_digits(X[kdt_idx.ravel(),:]) 83 | 84 | if __name__ == "__main__": 85 | mnist_manifold() 86 | 87 | -------------------------------------------------------------------------------- /chp08/pca.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | np.random.seed(42) 5 | 6 | class PCA(): 7 | def __init__(self, n_components = 2): 8 | self.n_components = n_components 9 | 10 | def covariance_matrix(self, X, Y=None): 11 | if Y is None: 12 | Y = X 13 | n_samples = np.shape(X)[0] 14 | covariance_matrix = (1 / (n_samples-1)) * (X - X.mean(axis=0)).T.dot(Y - Y.mean(axis=0)) 15 | return covariance_matrix 16 | 17 | def transform(self, X): 18 | Sigma = self.covariance_matrix(X) 19 | eig_vals, eig_vecs = np.linalg.eig(Sigma) 20 | 21 | #sort from largest to smallest and select the first n_components 22 | idx = eig_vals.argsort()[::-1] 23 | eig_vals = eig_vals[idx][:self.n_components] 24 | eig_vecs = np.atleast_1d(eig_vecs[:,idx])[:, :self.n_components] 25 | 26 | #project the data onto principal components 27 | X_transformed = X.dot(eig_vecs) 28 | 29 | return X_transformed 30 | 31 | if __name__ == "__main__": 32 | 33 | n = 20 34 | d = 5 35 | X = np.random.rand(n,d) 36 | 37 | pca = PCA(n_components = 2) 38 | X_pca = pca.transform(X) 39 | 40 | print(X_pca) 41 | 42 | plt.figure() 43 | plt.scatter(X_pca[:,0], X_pca[:,1], color='b', alpha=0.5) 44 | plt.title("Principal Component Analysis"); plt.xlabel("X1"); plt.ylabel("X2") 45 | plt.show() -------------------------------------------------------------------------------- /chp09/ga.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import string 3 | 4 | class GeneticAlgorithm(): 5 | 6 | def __init__(self, target_string, population_size, mutation_rate): 7 | self.target = target_string 8 | self.population_size = population_size 9 | self.mutation_rate = mutation_rate 10 | self.letters = [" "] + list(string.ascii_letters) 11 | 12 | def initialize(self): 13 | # init population with random strings 14 | self.population = [] 15 | for _ in range(self.population_size): 16 | individual = "".join(np.random.choice(self.letters, size=len(self.target))) 17 | self.population.append(individual) 18 | 19 | def calculate_fitness(self): 20 | #calculate fitness of each individual in a population 21 | population_fitness = [] 22 | for individual in self.population: 23 | # calculate loss as the distance between characters 24 | loss = 0 25 | for i in range(len(individual)): 26 | letter_i1 = self.letters.index(individual[i]) 27 | letter_i2 = self.letters.index(self.target[i]) 28 | loss += abs(letter_i1 - letter_i2) 29 | fitness = 1 / (loss + 1e-6) 30 | population_fitness.append(fitness) 31 | return population_fitness 32 | 33 | def mutate(self, individual): 34 | #randomly change the characters with probability equal to mutation_rate 35 | individual = list(individual) 36 | for j in range(len(individual)): 37 | if np.random.random() < self.mutation_rate: 38 | individual[j] = np.random.choice(self.letters) 39 | return "".join(individual) 40 | 41 | def crossover(self, parent1, parent2): 42 | #create children from parents by crossover 43 | cross_i = np.random.randint(0, len(parent1)) 44 | child1 = parent1[:cross_i] + parent2[cross_i:] 45 | child2 = parent2[:cross_i] + parent1[cross_i:] 46 | return child1, child2 47 | 48 | def run(self, iterations): 49 | self.initialize() 50 | 51 | for epoch in range(iterations): 52 | population_fitness = self.calculate_fitness() 53 | 54 | fittest_individual = self.population[np.argmax(population_fitness)] 55 | highest_fitness = max(population_fitness) 56 | 57 | if fittest_individual == self.target: 58 | break 59 | 60 | #select individual as a parent proportional to individual's fitness 61 | parent_probabilities = [fitness / sum(population_fitness) for fitness in population_fitness] 62 | 63 | #next generation 64 | new_population = [] 65 | for i in np.arange(0, self.population_size, 2): 66 | #select two parents 67 | parent1, parent2 = np.random.choice(self.population, size=2, p=parent_probabilities, replace=False) 68 | #crossover to produce offspring 69 | child1, child2 = self.crossover(parent1, parent2) 70 | #save mutated offspring for next generation 71 | new_population += [self.mutate(child1), self.mutate(child2)] 72 | 73 | print("iter %d, closest candidate: %s, fitness: %.4f" %(epoch, fittest_individual, highest_fitness)) 74 | self.population = new_population 75 | 76 | print("iter %d, final candidate: %s" %(epoch, fittest_individual)) 77 | 78 | if __name__ == "__main__": 79 | 80 | target_string = "Genome" 81 | population_size = 50 82 | mutation_rate = 0.1 83 | 84 | ga = GeneticAlgorithm(target_string, population_size, mutation_rate) 85 | ga.run(iterations = 1000) 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | -------------------------------------------------------------------------------- /chp09/inv_cov.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from scipy import linalg 4 | 5 | from datetime import datetime 6 | import pytz 7 | 8 | from sklearn.datasets import make_sparse_spd_matrix 9 | from sklearn.covariance import GraphicalLassoCV, ledoit_wolf 10 | from sklearn.preprocessing import StandardScaler 11 | from sklearn import cluster, manifold 12 | 13 | import seaborn as sns 14 | import matplotlib.pyplot as plt 15 | from matplotlib.collections import LineCollection 16 | 17 | import pandas_datareader.data as web 18 | 19 | np.random.seed(42) 20 | 21 | def main(): 22 | 23 | #generate data (synthetic) 24 | #num_samples = 60 25 | #num_features = 20 26 | #prec = make_sparse_spd_matrix(num_features, alpha=0.95, smallest_coef=0.4, largest_coef=0.7) 27 | #cov = linalg.inv(prec) 28 | #X = np.random.multivariate_normal(np.zeros(num_features), cov, size=num_samples) 29 | #X = StandardScaler().fit_transform(X) 30 | 31 | #generate data (actual) 32 | STOCKS = { 33 | 'SPY': 'S&P500', 34 | 'LQD': 'Bond_Corp', 35 | 'TIP': 'Bond_Treas', 36 | 'GLD': 'Gold', 37 | 'MSFT': 'Microsoft', 38 | 'XOM': 'Exxon', 39 | 'AMZN': 'Amazon', 40 | 'BAC': 'BofA', 41 | 'NVS': 'Novartis'} 42 | 43 | symbols, names = np.array(list(STOCKS.items())).T 44 | 45 | #load data 46 | #year, month, day, hour, minute, second, microsecond 47 | start = datetime(2015, 1, 1, 0, 0, 0, 0, pytz.utc) 48 | end = datetime(2017, 1, 1, 0, 0, 0, 0, pytz.utc) 49 | 50 | qopen, qclose = [], [] 51 | data_close, data_open = pd.DataFrame(), pd.DataFrame() 52 | for ticker in symbols: 53 | price = web.DataReader(ticker, 'stooq', start, end) 54 | qopen.append(price['Open']) 55 | qclose.append(price['Close']) 56 | 57 | data_open = pd.concat(qopen, axis=1) 58 | data_open.columns = symbols 59 | data_close = pd.concat(qclose, axis=1) 60 | data_close.columns = symbols 61 | 62 | #per day variation in price for each symbol 63 | variation = data_close - data_open 64 | variation = variation.dropna() 65 | 66 | X = variation.values 67 | X /= X.std(axis=0) #standardize to use correlations rather than covariance 68 | 69 | #estimate inverse covariance 70 | graph = GraphicalLassoCV() 71 | graph.fit(X) 72 | 73 | gl_cov = graph.covariance_ 74 | gl_prec = graph.precision_ 75 | gl_alphas = graph.cv_alphas_ 76 | gl_scores = graph.cv_results_['mean_test_score'] 77 | 78 | plt.figure() 79 | sns.heatmap(gl_prec, xticklabels=names, yticklabels=names) 80 | plt.xticks(rotation=45) 81 | plt.yticks(rotation=45) 82 | plt.tight_layout() 83 | plt.show() 84 | 85 | plt.figure() 86 | plt.plot(gl_alphas, gl_scores, marker='o', color='b', lw=2.0, label='GraphLassoCV') 87 | plt.title("Graph Lasso Alpha Selection") 88 | plt.xlabel("alpha") 89 | plt.ylabel("score") 90 | plt.legend() 91 | plt.show() 92 | 93 | #cluster using affinity propagation 94 | _, labels = cluster.affinity_propagation(gl_cov) 95 | num_labels = np.max(labels) 96 | 97 | for i in range(num_labels+1): 98 | print("Cluster %i: %s" %((i+1), ', '.join(names[labels==i]))) 99 | 100 | #find a low dim embedding for visualization 101 | node_model = manifold.LocallyLinearEmbedding(n_components=2, n_neighbors=6, eigen_solver='dense') 102 | embedding = node_model.fit_transform(X.T).T 103 | 104 | #generate plots 105 | plt.figure() 106 | plt.clf() 107 | ax = plt.axes([0.,0.,1.,1.]) 108 | plt.axis('off') 109 | 110 | partial_corr = gl_prec 111 | d = 1 / np.sqrt(np.diag(partial_corr)) 112 | non_zero = (np.abs(np.triu(partial_corr, k=1)) > 0.02) #connectivity matrix 113 | 114 | #plot the nodes 115 | plt.scatter(embedding[0], embedding[1], s = 100*d**2, c = labels, cmap = plt.cm.Spectral) 116 | 117 | #plot the edges 118 | start_idx, end_idx = np.where(non_zero) 119 | segments = [[embedding[:,start], embedding[:,stop]] for start, stop in zip(start_idx, end_idx)] 120 | values = np.abs(partial_corr[non_zero]) 121 | lc = LineCollection(segments, zorder=0, cmap=plt.cm.hot_r, norm=plt.Normalize(0,0.7*values.max())) 122 | lc.set_array(values) 123 | lc.set_linewidths(2*values) 124 | ax.add_collection(lc) 125 | 126 | #plot the labels 127 | for index, (name, label, (x,y)) in enumerate(zip(names, labels, embedding.T)): 128 | plt.text(x,y,name,size=12) 129 | 130 | plt.show() 131 | 132 | if __name__ == "__main__": 133 | main() 134 | -------------------------------------------------------------------------------- /chp09/kde.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | np.random.seed(14) 5 | 6 | class KDE(): 7 | 8 | def __init__(self): 9 | #Histogram and Gaussian Kernel Estimator used to 10 | #analyze RNA-seq data for flux estimation of a T7 promoter 11 | self.G = 1e9 #length of genome in base pairs (bp) 12 | self.C = 1e3 #number of unique molecules 13 | self.L = 100 #length of a read, bp 14 | self.N = 1e6 #number of reads, L bp long 15 | self.M = 1e4 #number of unique read sequences, bp 16 | self.LN = 1000 #total length of assembled / mapped RNA-seq reads 17 | self.FDR = 0.05 #false discovery rate 18 | 19 | #uniform sampling (poisson model) 20 | self.lmbda = (self.N * self.L) / self.G #expected number of bases covered 21 | self.C_est = self.M/(1-np.exp(-self.lmbda)) #library size estimate 22 | self.C_cvrg = self.G - self.G * np.exp(-self.lmbda) #base coverage 23 | self.N_gaps = self.N * np.exp(-self.lmbda) #number of gaps (uncovered bases) 24 | 25 | #gamma prior sampling (negative binomial model) 26 | #X = "number of failures before rth success" 27 | self.k = 0.5 # dispersion parameter (fit to data) 28 | self.p = self.lmbda/(self.lmbda + 1/self.k) # success probability 29 | self.r = 1/self.k # number of successes 30 | 31 | #RNAP binding data (RNA-seq) 32 | self.data = np.random.negative_binomial(self.r, self.p, size=self.LN) 33 | 34 | def histogram(self): 35 | self.bin_delta = 1 #smoothing parameter 36 | self.bin_range = np.arange(1, np.max(self.data), self.bin_delta) 37 | self.bin_counts, _ = np.histogram(self.data, bins=self.bin_range) 38 | 39 | #histogram density estimation 40 | #P = integral_R p(x) dx, where X is in R^3 41 | #p(x) = K/(NxV), where K=number of points in region R 42 | #N=total number of points, V=volume of region R 43 | 44 | rnap_density_est = self.bin_counts/(sum(self.bin_counts) * self.bin_delta) 45 | return rnap_density_est 46 | 47 | def kernel(self): 48 | #Gaussian kernel density estimator with smoothing parameter h 49 | #sum N Guassians centered at each data point, parameterized by common std dev h 50 | 51 | x_dim = 1 #dimension of x 52 | h = 10 #standard deviation 53 | 54 | rnap_density_support = np.arange(np.max(self.data)) 55 | rnap_density_est = 0 56 | for i in range(np.sum(self.bin_counts)): 57 | rnap_density_est += (1/(2*np.pi*h**2)**(x_dim/2.0))*np.exp(-(rnap_density_support - self.data[i])**2 / (2.0*h**2)) 58 | #end for 59 | 60 | rnap_density_est = rnap_density_est / np.sum(rnap_density_est) 61 | return rnap_density_est 62 | 63 | if __name__ == "__main__": 64 | 65 | kde = KDE() 66 | est1 = kde.histogram() 67 | est2 = kde.kernel() 68 | 69 | plt.figure() 70 | plt.plot(est1, '-b', label='histogram') 71 | plt.plot(est2, '--r', label='gaussian kernel') 72 | plt.title("RNA-seq density estimate based on negative binomial model") 73 | plt.xlabel("read length, [base pairs]"); plt.ylabel("density"); plt.legend() 74 | plt.show() -------------------------------------------------------------------------------- /chp09/lda.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | from sklearn.datasets import fetch_20newsgroups 4 | from sklearn.feature_extraction.text import TfidfVectorizer 5 | from wordcloud import WordCloud 6 | from scipy.special import digamma, gammaln 7 | 8 | np.random.seed(12) 9 | 10 | class LDA: 11 | def __init__(self, A, K): 12 | self.N = A.shape[0] # word (dictionary size) 13 | self.D = A.shape[1] # number of documents 14 | self.K = num_topics # number of topics 15 | 16 | self.A = A #term-document matrix 17 | 18 | #init word distribution beta 19 | self.eta = np.ones(self.N) #uniform dirichlet prior on words 20 | self.beta = np.zeros((self.N, self.K)) #NxK topic matrix 21 | for k in range(self.K): 22 | self.beta[:,k] = np.random.dirichlet(self.eta) 23 | self.beta[:,k] = self.beta[:,k] + 1e-6 #to avoid zero entries 24 | self.beta[:,k] = self.beta[:,k]/np.sum(self.beta[:,k]) 25 | #end for 26 | 27 | #init topic proportions theta and cluster assignments z 28 | self.alpha = np.ones(self.K) #uniform dirichlet prior on topics 29 | self.z = np.zeros((self.N, self.D)) #cluster assignments z_{id} 30 | for d in range(self.D): 31 | theta = np.random.dirichlet(self.alpha) 32 | wdn_idx = np.nonzero(self.A[:,d])[0] 33 | for i in range(len(wdn_idx)): 34 | z_idx = np.argmax(np.random.multinomial(1, theta)) 35 | self.z[wdn_idx[i],d] = z_idx #topic id 36 | #end for 37 | #end for 38 | 39 | #init variational parameters 40 | self.gamma = np.ones((self.D, self.K)) #topic proportions 41 | for d in range(self.D): 42 | theta = np.random.dirichlet(self.alpha) 43 | self.gamma[d,:] = theta 44 | #end for 45 | 46 | self.lmbda = np.transpose(self.beta) #np.ones((self.K, self.N))/self.N #word frequencies 47 | 48 | self.phi = np.zeros((self.D, self.N, self.K)) #assignments 49 | for d in range(self.D): 50 | for w in range(self.N): 51 | theta = np.random.dirichlet(self.alpha) 52 | self.phi[d,w,:] = np.random.multinomial(1, theta) 53 | #end for 54 | #end for 55 | 56 | def variational_inference(self): 57 | 58 | var_iter = 10 59 | llh = np.zeros(var_iter) 60 | llh_delta = np.zeros(var_iter) 61 | 62 | for iter in range(var_iter): 63 | print("VI iter: ", iter) 64 | J_old = self.elbo_objective() 65 | self.mean_field_update() 66 | J_new = self.elbo_objective() 67 | 68 | llh[iter] = J_old 69 | llh_delta[iter] = J_new - J_old 70 | #end for 71 | 72 | #update alpha and beta 73 | for k in range(self.K): 74 | self.alpha[k] = np.sum(self.gamma[:,k]) 75 | self.beta[:,k] = self.lmbda[k,:] / np.sum(self.lmbda[k,:]) 76 | #end for 77 | 78 | #update topic assignments 79 | for d in range(self.D): 80 | wdn_idx = np.nonzero(self.A[:,d])[0] 81 | for i in range(len(wdn_idx)): 82 | z_idx = np.argmax(self.phi[d,wdn_idx[i],:]) 83 | self.z[wdn_idx[i],d] = z_idx #topic id 84 | #end for 85 | #end for 86 | 87 | plt.figure() 88 | plt.plot(llh); plt.title('LDA VI'); 89 | plt.xlabel('mean field iterations'); plt.ylabel("ELBO") 90 | plt.show() 91 | 92 | return llh 93 | 94 | def mean_field_update(self): 95 | 96 | ndw = np.zeros((self.D, self.N)) #word counts for each document 97 | for d in range(self.D): 98 | doc = self.A[:,d] 99 | wdn_idx = np.nonzero(doc)[0] 100 | 101 | for i in range(len(wdn_idx)): 102 | ndw[d,wdn_idx[i]] += 1 103 | #end for 104 | 105 | #update gamma 106 | for k in range(self.K): 107 | self.gamma[d,k] = self.alpha[k] + np.dot(ndw[d,:], self.phi[d,:,k]) 108 | #end for 109 | 110 | #update phi 111 | for w in range(len(wdn_idx)): 112 | self.phi[d,wdn_idx[w],:] = np.exp(digamma(self.gamma[d,:]) - digamma(np.sum(self.gamma[d,:])) + digamma(self.lmbda[:,wdn_idx[w]]) - digamma(np.sum(self.lmbda, axis=1))) 113 | if (np.sum(self.phi[d,wdn_idx[w],:]) > 0): #to avoid 0/0 114 | self.phi[d,wdn_idx[w],:] = self.phi[d,wdn_idx[w],:] / np.sum(self.phi[d,wdn_idx[w],:]) #normalize phi 115 | #end if 116 | #end for 117 | 118 | #end for 119 | 120 | #update lambda given ndw for all docs 121 | for k in range(self.K): 122 | self.lmbda[k,:] = self.eta 123 | for d in range(self.D): 124 | self.lmbda[k,:] += np.multiply(ndw[d,:], self.phi[d,:,k]) 125 | #end for 126 | #end for 127 | 128 | def elbo_objective(self): 129 | #see Blei 2003 130 | 131 | T1_A = gammaln(np.sum(self.alpha)) - np.sum(gammaln(self.alpha)) 132 | T1_B = 0 133 | for k in range(self.K): 134 | T1_B += np.dot(self.alpha[k]-1, digamma(self.gamma[:,k]) - digamma(np.sum(self.gamma, axis=1))) 135 | T1 = T1_A + T1_B 136 | 137 | T2 = 0 138 | for n in range(self.N): 139 | for k in range(self.K): 140 | T2 += self.phi[:,n,k] * (digamma(self.gamma[:,k]) - digamma(np.sum(self.gamma, axis=1))) 141 | 142 | T3 = 0 143 | for n in range(self.N): 144 | for k in range(self.K): 145 | T3 += self.phi[:,n,k] * np.log(self.beta[n,k]) 146 | 147 | T4 = 0 148 | T4_A = -gammaln(np.sum(self.gamma, axis=1)) + np.sum(gammaln(self.gamma), axis=1) 149 | T4_B = 0 150 | for k in range(self.K): 151 | T4_B = -(self.gamma[:,k]-1) * (digamma(self.gamma[:,k]) - digamma(np.sum(self.gamma, axis=1))) 152 | T4 = T4_A + T4_B 153 | 154 | T5 = 0 155 | for n in range(self.N): 156 | for k in range(self.K): 157 | T5 += -np.multiply(self.phi[:,n,k], np.log(self.phi[:,n,k] + 1e-6)) 158 | 159 | T15 = T1 + T2 + T3 + T4 + T5 160 | J = sum(T15)/self.D #averaged over documents 161 | return J 162 | 163 | if __name__ == "__main__": 164 | 165 | #LDA parameters 166 | num_features = 1000 #vocabulary size 167 | num_topics = 4 #fixed for LD 168 | 169 | #20 newsgroups dataset 170 | categories = ['sci.crypt', 'comp.graphics', 'sci.space', 'talk.religion.misc'] 171 | 172 | newsgroups = fetch_20newsgroups(shuffle=True, random_state=42, subset='train', 173 | remove=('headers', 'footers', 'quotes'), categories=categories) 174 | 175 | vectorizer = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=2, stop_words = 'english') 176 | dataset = vectorizer.fit_transform(newsgroups.data) 177 | A = np.transpose(dataset.toarray()) #term-document matrix 178 | 179 | lda = LDA(A=A, K=num_topics) 180 | llh = lda.variational_inference() 181 | id2word = {v:k for k,v in vectorizer.vocabulary_.items()} 182 | 183 | #display topics 184 | for k in range(num_topics): 185 | print("topic: ", k) 186 | print("----------") 187 | topic_words = "" 188 | top_words = np.argsort(lda.lmbda[k,:])[-10:] 189 | for i in range(len(top_words)): 190 | topic_words += id2word[top_words[i]] + " " 191 | print(id2word[top_words[i]]) 192 | 193 | wordcloud = WordCloud(width = 800, height = 800, 194 | background_color ='white', 195 | min_font_size = 10).generate(topic_words) 196 | 197 | plt.figure() 198 | plt.imshow(wordcloud) 199 | plt.axis("off") 200 | plt.tight_layout(pad = 0) 201 | plt.show() 202 | -------------------------------------------------------------------------------- /chp09/portfolio_opt.py: -------------------------------------------------------------------------------- 1 | 2 | import numpy as np 3 | import pandas as pd 4 | import matplotlib.pyplot as plt 5 | 6 | from sklearn.neighbors import KDTree 7 | from pandas.plotting import scatter_matrix 8 | from scipy.spatial import ConvexHull 9 | 10 | import pandas_datareader.data as web 11 | from datetime import datetime 12 | import pytz 13 | 14 | STOCKS = ['SPY','LQD','TIP','GLD','MSFT'] 15 | 16 | np.random.seed(42) 17 | 18 | if __name__ == "__main__": 19 | 20 | plt.close("all") 21 | 22 | #load data 23 | #year, month, day, hour, minute, second, microsecond 24 | start = datetime(2012, 1, 1, 0, 0, 0, 0, pytz.utc) 25 | end = datetime(2017, 1, 1, 0, 0, 0, 0, pytz.utc) 26 | 27 | data = pd.DataFrame() 28 | series = [] 29 | for ticker in STOCKS: 30 | price = web.DataReader(ticker, 'stooq', start, end) 31 | series.append(price['Close']) 32 | 33 | data = pd.concat(series, axis=1) 34 | data.columns = STOCKS 35 | data = data.dropna() 36 | 37 | #plot data correlations 38 | scatter_matrix(data, alpha=0.2, diagonal='kde') 39 | plt.show() 40 | 41 | #get current portfolio 42 | cash = 10000 43 | num_assets = np.size(STOCKS) 44 | cur_value = (1e4-5e3)*np.random.rand(num_assets,1) + 5e3 45 | tot_value = np.sum(cur_value) 46 | weights = cur_value.ravel()/float(tot_value) 47 | 48 | #compute portfolio risk 49 | Sigma = data.cov().values 50 | Corr = data.corr().values 51 | volatility = np.sqrt(np.dot(weights.T, np.dot(Sigma, weights))) 52 | 53 | plt.figure() 54 | plt.title('Correlation Matrix') 55 | plt.imshow(Corr, cmap='gray') 56 | plt.xticks(range(len(STOCKS)),data.columns) 57 | plt.yticks(range(len(STOCKS)),data.columns) 58 | plt.colorbar() 59 | plt.show() 60 | 61 | #generate random portfolio weights 62 | num_trials = 1000 63 | W = np.random.rand(num_trials, np.size(weights)) 64 | W = W/np.sum(W,axis=1).reshape(num_trials,1) #normalize 65 | 66 | pv = np.zeros(num_trials) #portoflio value w'v 67 | ps = np.zeros(num_trials) #portfolio sigma: sqrt(w'Sw) 68 | 69 | avg_price = data.mean().values 70 | adj_price = avg_price 71 | 72 | for i in range(num_trials): 73 | pv[i] = np.sum(adj_price * W[i,:]) 74 | ps[i] = np.sqrt(np.dot(W[i,:].T, np.dot(Sigma, W[i,:]))) 75 | 76 | points = np.vstack((ps,pv)).T 77 | hull = ConvexHull(points) 78 | 79 | plt.figure() 80 | plt.scatter(ps, pv, marker='o', color='b', linewidth = 3.0, label = 'tangent portfolio') 81 | plt.scatter(volatility, np.sum(adj_price * weights), marker = 's', color = 'r', linewidth = 3.0, label = 'current') 82 | plt.plot(points[hull.vertices,0], points[hull.vertices,1], linewidth = 2.0) 83 | plt.title('expected return vs volatility') 84 | plt.ylabel('expected price') 85 | plt.xlabel('portfolio std dev') 86 | plt.legend() 87 | plt.grid(True) 88 | plt.show() 89 | 90 | #query for nearest neighbor portfolio 91 | knn = 5 92 | kdt = KDTree(points) 93 | query_point = np.array([2, 115]).reshape(1,-1) 94 | kdt_dist, kdt_idx = kdt.query(query_point,k=knn) 95 | print("top-%d closest to query portfolios:" %knn) 96 | print("values: ", pv[kdt_idx.ravel()]) 97 | print("sigmas: ", ps[kdt_idx.ravel()]) 98 | 99 | -------------------------------------------------------------------------------- /chp09/sim_annealing.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | np.random.seed(42) 5 | 6 | class simulated_annealing(): 7 | def __init__(self): 8 | self.max_iter = 1000 9 | self.conv_thresh = 1e-4 10 | self.conv_window = 10 11 | 12 | self.samples = np.zeros((self.max_iter, 2)) 13 | self.energies = np.zeros(self.max_iter) 14 | self.temperatures = np.zeros(self.max_iter) 15 | 16 | def target(self, x, y): 17 | z = 3*(1-x)**2 * np.exp(-x**2 - (y+1)**2) \ 18 | - 10*(x/5 -x**3 - y**5) * np.exp(-x**2 - y**2) \ 19 | - (1/3)*np.exp(-(x+1)**2 - y**2) 20 | return z 21 | 22 | def proposal(self, x, y): 23 | mean = np.array([x, y]) 24 | cov = 1.1 * np.eye(2) 25 | x_new, y_new = np.random.multivariate_normal(mean, cov) 26 | return x_new, y_new 27 | 28 | def temperature_schedule(self, T, iter): 29 | return 0.9 * T 30 | 31 | def run(self, x_init, y_init): 32 | 33 | converged = False 34 | T = 1 35 | self.temperatures[0] = T 36 | num_accepted = 0 37 | x_old, y_old = x_init, y_init 38 | energy_old = self.target(x_init, y_init) 39 | 40 | iter = 1 41 | while not converged: 42 | print("iter: {:4d}, temp: {:.4f}, energy = {:.6f}".format(iter, T, energy_old)) 43 | x_new, y_new = self.proposal(x_old, y_old) 44 | energy_new = self.target(x_new, y_new) 45 | 46 | #check convergence 47 | if iter > 2*self.conv_window: 48 | vals = self.energies[iter-self.conv_window : iter-1] 49 | if (np.std(vals) < self.conv_thresh): 50 | converged = True 51 | #end if 52 | #end if 53 | 54 | alpha = np.exp((energy_old - energy_new)/T) 55 | r = np.minimum(1, alpha) 56 | u = np.random.uniform(0, 1) 57 | if u < r: 58 | x_old, y_old = x_new, y_new 59 | num_accepted += 1 60 | energy_old = energy_new 61 | #end if 62 | self.samples[iter, :] = np.array([x_old, y_old]) 63 | self.energies[iter] = energy_old 64 | 65 | T = self.temperature_schedule(T, iter) 66 | self.temperatures[iter] = T 67 | 68 | iter = iter + 1 69 | 70 | if (iter > self.max_iter): converged = True 71 | #end while 72 | 73 | niter = iter - 1 74 | acceptance_rate = num_accepted / niter 75 | print("acceptance rate: ", acceptance_rate) 76 | 77 | x_opt, y_opt = x_old, y_old 78 | 79 | return x_opt, y_opt, self.samples[:niter,:], self.energies[:niter], self.temperatures[:niter] 80 | 81 | if __name__ == "__main__": 82 | 83 | SA = simulated_annealing() 84 | 85 | nx, ny = (1000, 1000) 86 | x = np.linspace(-2, 2, nx) 87 | y = np.linspace(-2, 2, ny) 88 | xv, yv = np.meshgrid(x, y) 89 | 90 | z = SA.target(xv, yv) 91 | plt.figure() 92 | plt.contourf(x, y, z) 93 | plt.title("energy landscape") 94 | plt.show() 95 | 96 | #find global minimum by exhaustive search 97 | min_search = np.min(z) 98 | argmin_search = np.argwhere(z == min_search) 99 | xmin, ymin = argmin_search[0][0], argmin_search[0][1] 100 | print("global minimum (exhaustive search): ", min_search) 101 | print("located at (x, y): ", x[xmin], y[ymin]) 102 | 103 | #find global minimum by simulated annealing 104 | x_init, y_init = 0, 0 105 | x_opt, y_opt, samples, energies, temperatures = SA.run(x_init, y_init) 106 | print("global minimum (simulated annealing): ", energies[-1]) 107 | print("located at (x, y): ", x_opt, y_opt) 108 | 109 | plt.figure() 110 | plt.plot(energies) 111 | plt.title("SA sampled energies") 112 | plt.show() 113 | 114 | plt.figure() 115 | plt.plot(temperatures) 116 | plt.title("Temperature Schedule") 117 | plt.show() 118 | -------------------------------------------------------------------------------- /chp10/image_search.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import tensorflow as tf 4 | from tensorflow import keras 5 | 6 | from keras import Model 7 | from keras.applications.resnet50 import ResNet50 8 | from keras.preprocessing import image 9 | from keras.applications.resnet50 import preprocess_input 10 | 11 | from keras.callbacks import ModelCheckpoint 12 | from keras.callbacks import TensorBoard 13 | from keras.callbacks import LearningRateScheduler 14 | from keras.callbacks import EarlyStopping 15 | 16 | import os 17 | import random 18 | from PIL import Image 19 | from scipy.spatial import distance 20 | from sklearn.decomposition import PCA 21 | 22 | import matplotlib.pyplot as plt 23 | 24 | tf.keras.utils.set_random_seed(42) 25 | 26 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/" 27 | DATA_PATH = "/content/drive/MyDrive/data/101_ObjectCategories/" 28 | 29 | def get_closest_images(acts, query_image_idx, num_results=5): 30 | 31 | num_images, dim = acts.shape 32 | distances = [] 33 | for image_idx in range(num_images): 34 | distances.append(distance.euclidean(acts[query_image_idx, :], acts[image_idx, :])) 35 | #end for 36 | idx_closest = sorted(range(len(distances)), key=lambda k: distances[k])[1:num_results+1] 37 | 38 | return idx_closest 39 | 40 | def get_concatenated_images(images, indexes, thumb_height): 41 | 42 | thumbs = [] 43 | for idx in indexes: 44 | img = Image.open(images[idx]) 45 | img = img.resize((int(img.width * thumb_height / img.height), int(thumb_height)), Image.ANTIALIAS) 46 | if img.mode != "RGB": 47 | img = img.convert("RGB") 48 | thumbs.append(img) 49 | concat_image = np.concatenate([np.asarray(t) for t in thumbs], axis=1) 50 | 51 | return concat_image 52 | 53 | if __name__ == "__main__": 54 | 55 | num_images = 5000 56 | images = [os.path.join(dp,f) for dp, dn, filenames in os.walk(DATA_PATH) for f in filenames \ 57 | if os.path.splitext(f)[1].lower() in ['.jpg','.png','.jpeg']] 58 | images = [images[i] for i in sorted(random.sample(range(len(images)), num_images))] 59 | 60 | #CNN encodings 61 | base_model = ResNet50(weights='imagenet') 62 | model = Model(inputs=base_model.input, outputs=base_model.get_layer('avg_pool').output) 63 | 64 | activations = [] 65 | for idx, image_path in enumerate(images): 66 | if idx % 100 == 0: 67 | print('getting activations for %d/%d image...' %(idx,len(images))) 68 | img = image.load_img(image_path, target_size=(224, 224)) 69 | x = image.img_to_array(img) 70 | x = np.expand_dims(x, axis=0) 71 | x = preprocess_input(x) 72 | features = model.predict(x) 73 | activations.append(features.flatten().reshape(1,-1)) 74 | 75 | # reduce activation dimension 76 | print('computing PCA...') 77 | acts = np.concatenate(activations, axis=0) 78 | pca = PCA(n_components=300) 79 | pca.fit(acts) 80 | acts = pca.transform(acts) 81 | 82 | print('image search...') 83 | query_image_idx = int(num_images*random.random()) 84 | idx_closest = get_closest_images(acts, query_image_idx) 85 | query_image = get_concatenated_images(images, [query_image_idx], 300) 86 | results_image = get_concatenated_images(images, idx_closest, 300) 87 | 88 | plt.figure() 89 | plt.imshow(query_image) 90 | plt.title("query image (%d)" %query_image_idx) 91 | plt.show() 92 | #plt.savefig('./figures/query_image.png') 93 | 94 | plt.figure() 95 | plt.imshow(results_image) 96 | plt.title("result images") 97 | plt.show() 98 | #plt.savefig('./figures/result_images.png') -------------------------------------------------------------------------------- /chp10/keras_optimizers.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import tensorflow as tf 4 | from tensorflow import keras 5 | 6 | from keras import backend as K 7 | from keras.models import Sequential 8 | from keras.layers import Dense, Dropout, Flatten 9 | from keras.layers import Conv2D, MaxPooling2D, Activation 10 | 11 | from keras.callbacks import ModelCheckpoint 12 | from keras.callbacks import TensorBoard 13 | from keras.callbacks import LearningRateScheduler 14 | from keras.callbacks import EarlyStopping 15 | 16 | import math 17 | import matplotlib.pyplot as plt 18 | 19 | tf.keras.utils.set_random_seed(42) 20 | 21 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/" 22 | 23 | def scheduler(epoch, lr): 24 | if epoch < 4: 25 | return lr 26 | else: 27 | return lr * tf.math.exp(-0.1) 28 | 29 | if __name__ == "__main__": 30 | 31 | img_rows, img_cols = 32, 32 32 | (x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data() 33 | x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 3).astype("float32") / 255 34 | x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 3).astype("float32") / 255 35 | 36 | y_train_label = keras.utils.to_categorical(y_train) 37 | y_test_label = keras.utils.to_categorical(y_test) 38 | num_classes = y_train_label.shape[1] 39 | 40 | #training parameters 41 | batch_size = 256 42 | num_epochs = 32 43 | 44 | #model parameters 45 | num_filters_l1 = 64 46 | num_filters_l2 = 128 47 | 48 | #CNN architecture 49 | cnn = Sequential() 50 | cnn.add(Conv2D(num_filters_l1, kernel_size = (5, 5), input_shape=(img_rows, img_cols, 3), padding='same')) 51 | cnn.add(Activation('relu')) 52 | cnn.add(MaxPooling2D(pool_size=(2,2), strides=(2,2))) 53 | 54 | cnn.add(Conv2D(num_filters_l2, kernel_size = (5, 5), padding='same')) 55 | cnn.add(Activation('relu')) 56 | cnn.add(MaxPooling2D(pool_size=(2,2), strides=(2,2))) 57 | 58 | cnn.add(Flatten()) 59 | cnn.add(Dense(128)) 60 | cnn.add(Activation('relu')) 61 | 62 | cnn.add(Dense(num_classes)) 63 | cnn.add(Activation('softmax')) 64 | 65 | #optimizers 66 | opt1 = tf.keras.optimizers.SGD() 67 | opt2 = tf.keras.optimizers.SGD(momentum=0.9, nesterov=True) 68 | opt3 = tf.keras.optimizers.RMSprop() 69 | opt4 = tf.keras.optimizers.Adam() 70 | 71 | optimizer_list = [opt1, opt2, opt3, opt4] 72 | 73 | history_list = [] 74 | 75 | for idx in range(len(optimizer_list)): 76 | 77 | K.clear_session() 78 | 79 | opt = optimizer_list[idx] 80 | 81 | cnn.compile( 82 | loss=keras.losses.CategoricalCrossentropy(), 83 | optimizer=opt, 84 | metrics=["accuracy"] 85 | ) 86 | 87 | #define callbacks 88 | reduce_lr = LearningRateScheduler(scheduler, verbose=1) 89 | callbacks_list = [reduce_lr] 90 | 91 | #training loop 92 | hist = cnn.fit(x_train, y_train_label, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_split=0.2) 93 | history_list.append(hist) 94 | 95 | #end for 96 | 97 | plt.figure() 98 | plt.plot(history_list[0].history['loss'], 'b', lw=2.0, label='SGD') 99 | plt.plot(history_list[1].history['loss'], '--r', lw=2.0, label='SGD Nesterov') 100 | plt.plot(history_list[2].history['loss'], ':g', lw=2.0, label='RMSProp') 101 | plt.plot(history_list[3].history['loss'], '-.k', lw=2.0, label='ADAM') 102 | plt.title('LeNet, CIFAR-100, Optimizers') 103 | plt.xlabel('Epochs') 104 | plt.ylabel('Cross-Entropy Training Loss') 105 | plt.legend(loc='upper right') 106 | plt.show() 107 | #plt.savefig('./figures/lenet_loss.png') 108 | 109 | plt.figure() 110 | plt.plot(history_list[0].history['val_accuracy'], 'b', lw=2.0, label='SGD') 111 | plt.plot(history_list[1].history['val_accuracy'], '--r', lw=2.0, label='SGD Nesterov') 112 | plt.plot(history_list[2].history['val_accuracy'], ':g', lw=2.0, label='RMSProp') 113 | plt.plot(history_list[3].history['val_accuracy'], '-.k', lw=2.0, label='ADAM') 114 | plt.title('LeNet, CIFAR-100, Optimizers') 115 | plt.xlabel('Epochs') 116 | plt.ylabel('Validation Accuracy') 117 | plt.legend(loc='upper right') 118 | plt.show() 119 | #plt.savefig('./figures/lenet_loss.png') 120 | 121 | plt.figure() 122 | plt.plot(history_list[0].history['lr'], 'b', lw=2.0, label='SGD') 123 | plt.plot(history_list[1].history['lr'], '--r', lw=2.0, label='SGD Nesterov') 124 | plt.plot(history_list[2].history['lr'], ':g', lw=2.0, label='RMSProp') 125 | plt.plot(history_list[3].history['lr'], '-.k', lw=2.0, label='ADAM') 126 | plt.title('LeNet, CIFAR-100, Optimizers') 127 | plt.xlabel('Epochs') 128 | plt.ylabel('Learning Rate Schedule') 129 | plt.legend(loc='upper right') 130 | plt.show() 131 | #plt.savefig('./figures/lenet_loss.png') 132 | 133 | -------------------------------------------------------------------------------- /chp10/lenet.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import tensorflow as tf 4 | from tensorflow import keras 5 | 6 | from keras.models import Sequential 7 | from keras.layers import Dense, Dropout, Flatten 8 | from keras.layers import Conv2D, MaxPooling2D, Activation 9 | 10 | from keras.callbacks import ModelCheckpoint 11 | from keras.callbacks import TensorBoard 12 | from keras.callbacks import LearningRateScheduler 13 | from keras.callbacks import EarlyStopping 14 | 15 | import math 16 | import matplotlib.pyplot as plt 17 | 18 | tf.keras.utils.set_random_seed(42) 19 | 20 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/" 21 | 22 | def scheduler(epoch, lr): 23 | if epoch < 4: 24 | return lr 25 | else: 26 | return lr * tf.math.exp(-0.1) 27 | 28 | 29 | if __name__ == "__main__": 30 | 31 | img_rows, img_cols = 28, 28 32 | (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data() 33 | x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1).astype("float32") / 255 34 | x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1).astype("float32") / 255 35 | 36 | y_train_label = keras.utils.to_categorical(y_train) 37 | y_test_label = keras.utils.to_categorical(y_test) 38 | num_classes = y_train_label.shape[1] 39 | 40 | #training parameters 41 | batch_size = 128 42 | num_epochs = 8 43 | 44 | #model parameters 45 | num_filters_l1 = 32 46 | num_filters_l2 = 64 47 | 48 | #CNN architecture 49 | cnn = Sequential() 50 | #CONV -> RELU -> MAXPOOL 51 | cnn.add(Conv2D(num_filters_l1, kernel_size = (5, 5), input_shape=(img_rows, img_cols, 1), padding='same')) 52 | cnn.add(Activation('relu')) 53 | cnn.add(MaxPooling2D(pool_size=(2,2), strides=(2,2))) 54 | 55 | #CONV -> RELU -> MAXPOOL 56 | cnn.add(Conv2D(num_filters_l2, kernel_size = (5, 5), padding='same')) 57 | cnn.add(Activation('relu')) 58 | cnn.add(MaxPooling2D(pool_size=(2,2), strides=(2,2))) 59 | 60 | #FC -> RELU 61 | cnn.add(Flatten()) 62 | cnn.add(Dense(128)) 63 | cnn.add(Activation('relu')) 64 | 65 | #Softmax Classifier 66 | cnn.add(Dense(num_classes)) 67 | cnn.add(Activation('softmax')) 68 | 69 | cnn.compile( 70 | loss=keras.losses.CategoricalCrossentropy(), 71 | optimizer=tf.keras.optimizers.Adam(), 72 | metrics=["accuracy"] 73 | ) 74 | 75 | cnn.summary() 76 | 77 | #define callbacks 78 | file_name = SAVE_PATH + 'lenet-weights-checkpoint.h5' 79 | checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min') 80 | reduce_lr = LearningRateScheduler(scheduler, verbose=1) 81 | early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1) 82 | #tensor_board = TensorBoard(log_dir='./logs', write_graph=True) 83 | callbacks_list = [checkpoint, reduce_lr, early_stopping] 84 | 85 | hist = cnn.fit(x_train, y_train_label, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_split=0.2) 86 | 87 | test_scores = cnn.evaluate(x_test, y_test_label, verbose=2) 88 | 89 | print("Test loss:", test_scores[0]) 90 | print("Test accuracy:", test_scores[1]) 91 | 92 | y_prob = cnn.predict(x_test) 93 | y_pred = y_prob.argmax(axis=-1) 94 | 95 | #create submission 96 | submission = pd.DataFrame(index=pd.RangeIndex(start=1, stop=10001, step=1), columns=['Label']) 97 | submission['Label'] = y_pred.reshape(-1,1) 98 | submission.index.name = "ImageId" 99 | submission.to_csv(SAVE_PATH + '/lenet_pred.csv', index=True, header=True) 100 | 101 | plt.figure() 102 | plt.plot(hist.history['loss'], 'b', lw=2.0, label='train') 103 | plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val') 104 | plt.title('LeNet model') 105 | plt.xlabel('Epochs') 106 | plt.ylabel('Cross-Entropy Loss') 107 | plt.legend(loc='upper right') 108 | plt.show() 109 | #plt.savefig('./figures/lenet_loss.png') 110 | 111 | plt.figure() 112 | plt.plot(hist.history['accuracy'], 'b', lw=2.0, label='train') 113 | plt.plot(hist.history['val_accuracy'], '--r', lw=2.0, label='val') 114 | plt.title('LeNet model') 115 | plt.xlabel('Epochs') 116 | plt.ylabel('Accuracy') 117 | plt.legend(loc='upper left') 118 | plt.show() 119 | #plt.savefig('./figures/lenet_acc.png') 120 | 121 | plt.figure() 122 | plt.plot(hist.history['lr'], lw=2.0, label='learning rate') 123 | plt.title('LeNet model') 124 | plt.xlabel('Epochs') 125 | plt.ylabel('Learning Rate') 126 | plt.legend() 127 | plt.show() 128 | #plt.savefig('./figures/lenet_learning_rate.png') 129 | -------------------------------------------------------------------------------- /chp10/lstm_sentiment.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | import tensorflow as tf 5 | from tensorflow import keras 6 | 7 | from keras.models import Sequential 8 | from keras.layers import LSTM, Bidirectional 9 | from keras.layers import Dense, Dropout, Activation, Embedding 10 | 11 | from keras import regularizers 12 | from keras.preprocessing import sequence 13 | from keras.utils import np_utils 14 | 15 | from keras.callbacks import ModelCheckpoint 16 | from keras.callbacks import TensorBoard 17 | from keras.callbacks import LearningRateScheduler 18 | from keras.callbacks import EarlyStopping 19 | 20 | import matplotlib.pyplot as plt 21 | 22 | tf.keras.utils.set_random_seed(42) 23 | 24 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/" 25 | 26 | def scheduler(epoch, lr): 27 | if epoch < 4: 28 | return lr 29 | else: 30 | return lr * tf.math.exp(-0.1) 31 | 32 | if __name__ == "__main__": 33 | 34 | #load dataset 35 | max_words = 20000 # top 20K most frequent words 36 | seq_len = 200 # first 200 words of each movie review 37 | (x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=max_words) 38 | 39 | x_train = keras.utils.pad_sequences(x_train, maxlen=seq_len) 40 | x_val = keras.utils.pad_sequences(x_val, maxlen=seq_len) 41 | 42 | #training params 43 | batch_size = 256 44 | num_epochs = 8 45 | 46 | #model parameters 47 | hidden_size = 64 48 | embed_dim = 128 49 | lstm_dropout = 0.2 50 | dense_dropout = 0.5 51 | weight_decay = 1e-3 52 | 53 | #LSTM architecture 54 | model = Sequential() 55 | model.add(Embedding(max_words, embed_dim, input_length=seq_len)) 56 | model.add(Bidirectional(LSTM(hidden_size, dropout=lstm_dropout, recurrent_dropout=lstm_dropout))) 57 | model.add(Dense(hidden_size, kernel_regularizer=regularizers.l2(weight_decay), activation='relu')) 58 | model.add(Dropout(dense_dropout)) 59 | model.add(Dense(hidden_size/4, kernel_regularizer=regularizers.l2(weight_decay), activation='relu')) 60 | model.add(Dense(1, activation='sigmoid')) 61 | 62 | model.compile( 63 | loss=keras.losses.BinaryCrossentropy(), 64 | optimizer=tf.keras.optimizers.Adam(), 65 | metrics=["accuracy"] 66 | ) 67 | 68 | model.summary() 69 | 70 | #define callbacks 71 | file_name = SAVE_PATH + 'lstm-weights-checkpoint.h5' 72 | checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min') 73 | reduce_lr = LearningRateScheduler(scheduler, verbose=1) 74 | early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1) 75 | #tensor_board = TensorBoard(log_dir='./logs', write_graph=True) 76 | callbacks_list = [checkpoint, reduce_lr, early_stopping] 77 | 78 | hist = model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_data=(x_val, y_val)) 79 | 80 | test_scores = model.evaluate(x_val, y_val, verbose=2) 81 | 82 | print("Test loss:", test_scores[0]) 83 | print("Test accuracy:", test_scores[1]) 84 | 85 | plt.figure() 86 | plt.plot(hist.history['loss'], 'b', lw=2.0, label='train') 87 | plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val') 88 | plt.title('LSTM model') 89 | plt.xlabel('Epochs') 90 | plt.ylabel('Cross-Entropy Loss') 91 | plt.legend(loc='upper right') 92 | plt.show() 93 | #plt.savefig('./figures/lstm_loss.png') 94 | 95 | plt.figure() 96 | plt.plot(hist.history['accuracy'], 'b', lw=2.0, label='train') 97 | plt.plot(hist.history['val_accuracy'], '--r', lw=2.0, label='val') 98 | plt.title('LSTM model') 99 | plt.xlabel('Epochs') 100 | plt.ylabel('Accuracy') 101 | plt.legend(loc='upper left') 102 | plt.show() 103 | #plt.savefig('./figures/lstm_acc.png') 104 | 105 | plt.figure() 106 | plt.plot(hist.history['lr'], lw=2.0, label='learning rate') 107 | plt.title('LSTM model') 108 | plt.xlabel('Epochs') 109 | plt.ylabel('Learning Rate') 110 | plt.legend() 111 | plt.show() 112 | #plt.savefig('./figures/lstm_learning_rate.png') 113 | 114 | -------------------------------------------------------------------------------- /chp10/mlp.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import tensorflow as tf 3 | from tensorflow import keras 4 | 5 | from keras.models import Sequential 6 | from keras.layers import Dense, Dropout 7 | 8 | from keras.callbacks import ModelCheckpoint 9 | from keras.callbacks import TensorBoard 10 | from keras.callbacks import LearningRateScheduler 11 | from keras.callbacks import EarlyStopping 12 | 13 | import math 14 | import matplotlib.pyplot as plt 15 | 16 | tf.keras.utils.set_random_seed(42) 17 | 18 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/" 19 | 20 | def scheduler(epoch, lr): 21 | if epoch < 4: 22 | return lr 23 | else: 24 | return lr * tf.math.exp(-0.1) 25 | 26 | if __name__ == "__main__": 27 | 28 | (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data() 29 | x_train = x_train.reshape(60000, 784).astype("float32") / 255 30 | x_test = x_test.reshape(10000, 784).astype("float32") / 255 31 | 32 | y_train_label = keras.utils.to_categorical(y_train) 33 | y_test_label = keras.utils.to_categorical(y_test) 34 | num_classes = y_train_label.shape[1] 35 | 36 | #training params 37 | batch_size = 64 38 | num_epochs = 16 39 | 40 | model = Sequential() 41 | model.add(Dense(128, input_shape=(784, ), activation='relu')) 42 | model.add(Dense(64, activation='relu')) 43 | model.add(Dropout(0.5)) 44 | model.add(Dense(10, activation='softmax')) 45 | 46 | model.compile( 47 | loss=keras.losses.CategoricalCrossentropy(), 48 | optimizer=tf.keras.optimizers.RMSprop(), 49 | metrics=["accuracy"] 50 | ) 51 | 52 | model.summary() 53 | 54 | #define callbacks 55 | file_name = SAVE_PATH + 'mlp-weights-checkpoint.h5' 56 | checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min') 57 | reduce_lr = LearningRateScheduler(scheduler, verbose=1) 58 | early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1) 59 | #tensor_board = TensorBoard(log_dir='./logs', write_graph=True) 60 | callbacks_list = [checkpoint, reduce_lr, early_stopping] 61 | 62 | hist = model.fit(x_train, y_train_label, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_split=0.2) 63 | 64 | test_scores = model.evaluate(x_test, y_test_label, verbose=2) 65 | 66 | print("Test loss:", test_scores[0]) 67 | print("Test accuracy:", test_scores[1]) 68 | 69 | plt.figure() 70 | plt.plot(hist.history['loss'], 'b', lw=2.0, label='train') 71 | plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val') 72 | plt.title('MLP model') 73 | plt.xlabel('Epochs') 74 | plt.ylabel('Cross-Entropy Loss') 75 | plt.legend(loc='upper right') 76 | plt.show() 77 | #plt.savefig('./figures/mlp_loss.png') 78 | 79 | plt.figure() 80 | plt.plot(hist.history['accuracy'], 'b', lw=2.0, label='train') 81 | plt.plot(hist.history['val_accuracy'], '--r', lw=2.0, label='val') 82 | plt.title('MLP model') 83 | plt.xlabel('Epochs') 84 | plt.ylabel('Accuracy') 85 | plt.legend(loc='upper left') 86 | plt.show() 87 | #plt.savefig('./figures/mlp_acc.png') 88 | 89 | plt.figure() 90 | plt.plot(hist.history['lr'], lw=2.0, label='learning rate') 91 | plt.title('MLP model') 92 | plt.xlabel('Epochs') 93 | plt.ylabel('Learning Rate') 94 | plt.legend() 95 | plt.show() 96 | #plt.savefig('./figures/mlp_learning_rate.png') -------------------------------------------------------------------------------- /chp10/multi_input_nn.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | import tensorflow as tf 5 | from tensorflow import keras 6 | 7 | import os 8 | import re 9 | import csv 10 | import codecs 11 | 12 | from keras.models import Model 13 | from keras.layers import Input, Flatten, Concatenate, LSTM, Lambda, Dropout 14 | from keras.layers import Dense, Dropout, Activation, Embedding 15 | from keras.layers import Conv1D, MaxPooling1D 16 | from keras.layers import TimeDistributed, Bidirectional, BatchNormalization 17 | 18 | from keras import backend as K 19 | from keras.preprocessing.text import Tokenizer 20 | from keras.utils import pad_sequences 21 | 22 | from nltk.corpus import stopwords 23 | from nltk.stem import SnowballStemmer 24 | 25 | from keras import regularizers 26 | from keras.preprocessing import sequence 27 | from keras.utils import np_utils 28 | 29 | from keras.callbacks import ModelCheckpoint 30 | from keras.callbacks import TensorBoard 31 | from keras.callbacks import LearningRateScheduler 32 | from keras.callbacks import EarlyStopping 33 | 34 | import matplotlib.pyplot as plt 35 | 36 | tf.keras.utils.set_random_seed(42) 37 | 38 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/" 39 | DATA_PATH = "/content/drive/MyDrive/data/" 40 | 41 | GLOVE_DIR = DATA_PATH 42 | TRAIN_DATA_FILE = DATA_PATH + 'quora_train.csv' 43 | TEST_DATA_FILE = DATA_PATH + 'quora_test.csv' 44 | MAX_SEQUENCE_LENGTH = 30 45 | MAX_NB_WORDS = 200000 46 | EMBEDDING_DIM = 300 47 | VALIDATION_SPLIT = 0.01 48 | 49 | def scheduler(epoch, lr): 50 | if epoch < 4: 51 | return lr 52 | else: 53 | return lr * tf.math.exp(-0.1) 54 | 55 | def text_to_wordlist(row, remove_stopwords=False, stem_words=False): 56 | # Clean the text, with the option to remove stopwords and to stem words. 57 | 58 | text = row['question'] 59 | # Convert words to lower case and split them 60 | if type(text) is str: 61 | text = text.lower().split() 62 | else: 63 | return " " 64 | 65 | # Optionally, remove stop words 66 | if remove_stopwords: 67 | stops = set(stopwords.words("english")) 68 | text = [w for w in text if not w in stops] 69 | 70 | text = " ".join(text) 71 | 72 | # Clean the text 73 | text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text) 74 | 75 | # Optionally, shorten words to their stems 76 | if stem_words: 77 | text = text.split() 78 | stemmer = SnowballStemmer('english') 79 | stemmed_words = [stemmer.stem(word) for word in text] 80 | text = " ".join(stemmed_words) 81 | 82 | # Return a list of words 83 | return(text) 84 | 85 | if __name__ == "__main__": 86 | 87 | #load embeddings 88 | print('Indexing word vectors...') 89 | embeddings_index = {} 90 | f = codecs.open(os.path.join(GLOVE_DIR, 'glove.6B.300d.txt'), encoding='utf-8') 91 | for line in f: 92 | values = line.split(' ') 93 | word = values[0] 94 | coefs = np.asarray(values[1:], dtype='float32') 95 | embeddings_index[word] = coefs 96 | f.close() 97 | print('Found %s word vectors.' % len(embeddings_index)) 98 | 99 | #load dataset 100 | train_df = pd.read_csv(TRAIN_DATA_FILE) 101 | test_df = pd.read_csv(TEST_DATA_FILE) 102 | 103 | q1df = train_df['question1'].reset_index() 104 | q2df = train_df['question2'].reset_index() 105 | q1df.columns = ['index', 'question'] 106 | q2df.columns = ['index', 'question'] 107 | texts_1 = q1df.apply(text_to_wordlist, axis=1, raw=False).tolist() 108 | texts_2 = q2df.apply(text_to_wordlist, axis=1, raw=False).tolist() 109 | labels = train_df['is_duplicate'].astype(int).tolist() 110 | print('Found %s texts.' % len(texts_1)) 111 | del q1df 112 | del q2df 113 | 114 | q1df = test_df['question1'].reset_index() 115 | q2df = test_df['question2'].reset_index() 116 | q1df.columns = ['index', 'question'] 117 | q2df.columns = ['index', 'question'] 118 | test_texts_1 = q1df.apply(text_to_wordlist, axis=1, raw=False).tolist() 119 | test_texts_2 = q2df.apply(text_to_wordlist, axis=1, raw=False).tolist() 120 | test_labels = np.arange(0, test_df.shape[0]) 121 | print('Found %s texts.' % len(test_texts_1)) 122 | del q1df 123 | del q2df 124 | 125 | #tokenize, convert to sequences and pad 126 | tokenizer = Tokenizer(nb_words=MAX_NB_WORDS) 127 | tokenizer.fit_on_texts(texts_1 + texts_2 + test_texts_1 + test_texts_2) 128 | sequences_1 = tokenizer.texts_to_sequences(texts_1) 129 | sequences_2 = tokenizer.texts_to_sequences(texts_2) 130 | word_index = tokenizer.word_index 131 | print('Found %s unique tokens.' % len(word_index)) 132 | 133 | test_sequences_1 = tokenizer.texts_to_sequences(test_texts_1) 134 | test_sequences_2 = tokenizer.texts_to_sequences(test_texts_2) 135 | 136 | data_1 = pad_sequences(sequences_1, maxlen=MAX_SEQUENCE_LENGTH) 137 | data_2 = pad_sequences(sequences_2, maxlen=MAX_SEQUENCE_LENGTH) 138 | labels = np.array(labels) 139 | print('Shape of data tensor:', data_1.shape) 140 | print('Shape of label tensor:', labels.shape) 141 | 142 | test_data_1 = pad_sequences(test_sequences_1, maxlen=MAX_SEQUENCE_LENGTH) 143 | test_data_2 = pad_sequences(test_sequences_2, maxlen=MAX_SEQUENCE_LENGTH) 144 | test_labels = np.array(test_labels) 145 | del test_sequences_1 146 | del test_sequences_2 147 | del sequences_1 148 | del sequences_2 149 | 150 | #embedding matrix 151 | print('Preparing embedding matrix...') 152 | nb_words = min(MAX_NB_WORDS, len(word_index)) 153 | 154 | embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM)) 155 | for word, i in word_index.items(): 156 | if i >= nb_words: 157 | continue 158 | embedding_vector = embeddings_index.get(word) 159 | if embedding_vector is not None: 160 | # words not found in embedding index will be all-zeros. 161 | embedding_matrix[i] = embedding_vector 162 | print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0)) 163 | 164 | #Multi-Input Architecture 165 | embedding_layer = Embedding(nb_words, 166 | EMBEDDING_DIM, 167 | weights=[embedding_matrix], 168 | input_length=MAX_SEQUENCE_LENGTH, 169 | trainable=False) 170 | 171 | sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') 172 | embedded_sequences_1 = embedding_layer(sequence_1_input) 173 | x1 = Conv1D(128, 3, activation='relu')(embedded_sequences_1) 174 | x1 = MaxPooling1D(10)(x1) 175 | x1 = Flatten()(x1) 176 | x1 = Dense(64, activation='relu')(x1) 177 | x1 = Dropout(0.2)(x1) 178 | 179 | sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') 180 | embedded_sequences_2 = embedding_layer(sequence_2_input) 181 | y1 = Conv1D(128, 3, activation='relu')(embedded_sequences_2) 182 | y1 = MaxPooling1D(10)(y1) 183 | y1 = Flatten()(y1) 184 | y1 = Dense(64, activation='relu')(y1) 185 | y1 = Dropout(0.2)(y1) 186 | 187 | merged = Concatenate()([x1, y1]) 188 | merged = BatchNormalization()(merged) 189 | merged = Dense(64, activation='relu')(merged) 190 | merged = Dropout(0.2)(merged) 191 | merged = BatchNormalization()(merged) 192 | preds = Dense(1, activation='sigmoid')(merged) 193 | 194 | model = Model(inputs=[sequence_1_input,sequence_2_input], outputs=preds) 195 | 196 | model.compile( 197 | loss=keras.losses.BinaryCrossentropy(), 198 | optimizer=tf.keras.optimizers.Adam(), 199 | metrics=["accuracy"] 200 | ) 201 | 202 | model.summary() 203 | 204 | #define callbacks 205 | file_name = SAVE_PATH + 'multi-input-weights-checkpoint.h5' 206 | checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min') 207 | reduce_lr = LearningRateScheduler(scheduler, verbose=1) 208 | early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1) 209 | #tensor_board = TensorBoard(log_dir='./logs', write_graph=True) 210 | callbacks_list = [checkpoint, reduce_lr, early_stopping] 211 | 212 | hist = model.fit([data_1, data_2], labels, batch_size=1024, epochs=10, callbacks=callbacks_list, validation_split=VALIDATION_SPLIT) 213 | 214 | num_test = 100000 215 | preds = model.predict([test_data_1[:num_test,:], test_data_2[:num_test,:]]) 216 | 217 | quora_submission = pd.DataFrame({"test_id":test_labels[:num_test], "is_duplicate":preds.ravel()}) 218 | quora_submission.to_csv(SAVE_PATH + "quora_submission.csv", index=False) 219 | 220 | plt.figure() 221 | plt.plot(hist.history['loss'], c='b', lw=2.0, label='train') 222 | plt.plot(hist.history['val_loss'], c='r', lw=2.0, label='val') 223 | plt.title('Multi-Input model') 224 | plt.xlabel('Epochs') 225 | plt.ylabel('Cross-Entropy Loss') 226 | plt.legend(loc='upper right') 227 | plt.show() 228 | #plt.savefig('./figures/lstm_loss.png') 229 | 230 | plt.figure() 231 | plt.plot(hist.history['accuracy'], c='b', lw=2.0, label='train') 232 | plt.plot(hist.history['val_accuracy'], c='r', lw=2.0, label='val') 233 | plt.title('Multi-Input model') 234 | plt.xlabel('Epochs') 235 | plt.ylabel('Accuracy') 236 | plt.legend(loc='upper left') 237 | plt.show() 238 | #plt.savefig('./figures/lstm_acc.png') 239 | 240 | plt.figure() 241 | plt.plot(hist.history['lr'], lw=2.0, label='learning rate') 242 | plt.title('Multi-Input model') 243 | plt.xlabel('Epochs') 244 | plt.ylabel('Learning Rate') 245 | plt.legend() 246 | plt.show() 247 | #plt.savefig('./figures/lstm_learning_rate.png') -------------------------------------------------------------------------------- /chp11/keras_mdn.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | import tensorflow as tf 5 | from tensorflow import keras 6 | 7 | from keras.models import Model 8 | from keras.layers import concatenate, Input 9 | from keras.layers import Dense, Activation, Dropout, Flatten 10 | from keras.layers import BatchNormalization 11 | 12 | from keras import regularizers 13 | from keras import backend as K 14 | from keras.utils import np_utils 15 | 16 | from keras.callbacks import ModelCheckpoint 17 | from keras.callbacks import TensorBoard 18 | from keras.callbacks import LearningRateScheduler 19 | from keras.callbacks import EarlyStopping 20 | 21 | from sklearn.datasets import make_blobs 22 | from sklearn.metrics import adjusted_rand_score 23 | from sklearn.metrics import normalized_mutual_info_score 24 | from sklearn.model_selection import train_test_split 25 | 26 | import math 27 | import matplotlib.pyplot as plt 28 | import matplotlib.cm as cm 29 | 30 | tf.keras.utils.set_random_seed(42) 31 | 32 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/" 33 | 34 | def scheduler(epoch, lr): 35 | if epoch < 4: 36 | return lr 37 | else: 38 | return lr * tf.math.exp(-0.1) 39 | 40 | def generate_data(N): 41 | pi = np.array([0.2, 0.4, 0.3, 0.1]) 42 | mu = [[2,2], [-2,2], [-2,-2], [2,-2]] 43 | std = [[0.5,0.5], [1.0,1.0], [0.5,0.5], [1.0,1.0]] 44 | x = np.zeros((N,2), dtype=np.float32) 45 | y = np.zeros((N,2), dtype=np.float32) 46 | z = np.zeros((N,1), dtype=np.int32) 47 | for n in range(N): 48 | k = np.argmax(np.random.multinomial(1, pi)) 49 | x[n,:] = np.random.multivariate_normal(mu[k], np.diag(std[k])) 50 | y[n,:] = mu[k] 51 | z[n,:] = k 52 | #end for 53 | z = z.flatten() 54 | return x, y, z, pi, mu, std 55 | 56 | def tf_normal(y, mu, sigma): 57 | y_tile = K.stack([y]*num_clusters, axis=1) #[batch_size, K, D] 58 | result = y_tile - mu 59 | sigma_tile = K.stack([sigma]*data_dim, axis=-1) #[batch_size, K, D] 60 | result = result * 1.0/(sigma_tile+1e-8) 61 | result = -K.square(result)/2.0 62 | oneDivSqrtTwoPI = 1.0/math.sqrt(2*math.pi) 63 | result = K.exp(result) * (1.0/(sigma_tile + 1e-8))*oneDivSqrtTwoPI 64 | result = K.prod(result, axis=-1) #[batch_size, K] iid Gaussians 65 | return result 66 | 67 | def NLLLoss(y_true, y_pred): 68 | out_mu = y_pred[:,:num_clusters*data_dim] 69 | out_sigma = y_pred[:,num_clusters*data_dim : num_clusters*(data_dim+1)] 70 | out_pi = y_pred[:,num_clusters*(data_dim+1):] 71 | 72 | out_mu = K.reshape(out_mu, [-1, num_clusters, data_dim]) 73 | 74 | result = tf_normal(y_true, out_mu, out_sigma) 75 | result = result * out_pi 76 | result = K.sum(result, axis=1, keepdims=True) 77 | result = -K.log(result + 1e-8) 78 | result = K.mean(result) 79 | return tf.maximum(result, 0) 80 | 81 | #generate data 82 | X_data, y_data, z_data, pi_true, mu_true, sigma_true = generate_data(4096) 83 | 84 | data_dim = X_data.shape[1] 85 | num_clusters = len(mu_true) 86 | 87 | num_train = 3500 88 | X_train, X_test, y_train, y_test = X_data[:num_train,:], X_data[num_train:,:], y_data[:num_train,:], y_data[num_train:,:] 89 | z_train, z_test = z_data[:num_train], z_data[num_train:] 90 | 91 | #visualize data 92 | plt.figure() 93 | plt.scatter(X_train[:,0], X_train[:,1], c=z_train, cmap=cm.bwr) 94 | plt.title('training data') 95 | plt.show() 96 | #plt.savefig(SAVE_PATH + '/mdn_training_data.png') 97 | 98 | #training params 99 | batch_size = 128 100 | num_epochs = 128 101 | 102 | #model parameters 103 | hidden_size = 32 104 | weight_decay = 1e-4 105 | 106 | #MDN architecture 107 | input_data = Input(shape=(data_dim,)) 108 | x = Dense(32, activation='relu')(input_data) 109 | x = Dropout(0.2)(x) 110 | x = BatchNormalization()(x) 111 | x = Dense(32, activation='relu')(x) 112 | x = Dropout(0.2)(x) 113 | x = BatchNormalization()(x) 114 | 115 | mu = Dense(num_clusters * data_dim, activation='linear')(x) #cluster means 116 | sigma = Dense(num_clusters, activation=K.exp)(x) #diagonal cov 117 | pi = Dense(num_clusters, activation='softmax')(x) #mixture proportions 118 | out = concatenate([mu, sigma, pi], axis=-1) 119 | 120 | model = Model(input_data, out) 121 | 122 | model.compile( 123 | loss=NLLLoss, 124 | optimizer=tf.keras.optimizers.Adam(), 125 | metrics=["accuracy"] 126 | ) 127 | 128 | model.summary() 129 | 130 | #define callbacks 131 | file_name = SAVE_PATH + 'mdn-weights-checkpoint.h5' 132 | checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min') 133 | reduce_lr = LearningRateScheduler(scheduler, verbose=1) 134 | early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1) 135 | #tensor_board = TensorBoard(log_dir='./logs', write_graph=True) 136 | callbacks_list = [checkpoint, reduce_lr, early_stopping] 137 | 138 | hist = model.fit(X_train, y_train, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_split=0.2, shuffle=True, verbose=2) 139 | 140 | y_pred = model.predict(X_test) 141 | 142 | mu_pred = y_pred[:,:num_clusters*data_dim] 143 | mu_pred = np.reshape(mu_pred, [-1, num_clusters, data_dim]) 144 | sigma_pred = y_pred[:,num_clusters*data_dim : num_clusters*(data_dim+1)] 145 | pi_pred = y_pred[:,num_clusters*(data_dim+1):] 146 | z_pred = np.argmax(pi_pred, axis=-1) 147 | 148 | rand_score = adjusted_rand_score(z_test, z_pred) 149 | print("adjusted rand score: ", rand_score) 150 | 151 | nmi_score = normalized_mutual_info_score(z_test, z_pred) 152 | print("normalized MI score: ", nmi_score) 153 | 154 | mu_pred_list = [] 155 | sigma_pred_list = [] 156 | for label in np.unique(z_pred): 157 | z_idx = np.where(z_pred == label)[0] 158 | mu_pred_lbl = np.mean(mu_pred[z_idx,label,:], axis=0) 159 | mu_pred_list.append(mu_pred_lbl) 160 | 161 | sigma_pred_lbl = np.mean(sigma_pred[z_idx,label], axis=0) 162 | sigma_pred_list.append(sigma_pred_lbl) 163 | #end for 164 | 165 | print("true means:") 166 | print(np.array(mu_true)) 167 | 168 | print("predicted means:") 169 | print(np.array(mu_pred_list)) 170 | 171 | print("true sigmas:") 172 | print(np.array(sigma_true)) 173 | 174 | print("predicted sigmas:") 175 | print(np.array(sigma_pred_list)) 176 | 177 | #generate plots 178 | plt.figure() 179 | plt.scatter(X_test[:,0], X_test[:,1], c=z_pred, cmap=cm.bwr) 180 | plt.scatter(np.array(mu_pred_list)[:,0], np.array(mu_pred_list)[:,1], s=100, marker='x', lw=4.0, color='k') 181 | plt.title('test data') 182 | #plt.savefig('./figures/mdn_test_data.png') 183 | 184 | plt.figure() 185 | plt.plot(hist.history['loss'], 'b', lw=2.0, label='train') 186 | plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val') 187 | plt.title('Mixture Density Network') 188 | plt.xlabel('Epochs') 189 | plt.ylabel('Negative Log Likelihood Loss') 190 | plt.legend(loc='upper left') 191 | #plt.savefig('./figures/mdn_loss.png') 192 | 193 | 194 | -------------------------------------------------------------------------------- /chp11/lstm_vae.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | import tensorflow as tf 5 | from tensorflow import keras 6 | import tensorflow_probability as tfp 7 | 8 | from keras.layers import Input, Dense, Lambda, Layer 9 | from keras.layers import LSTM, RepeatVector 10 | from keras.models import Model 11 | from keras import backend as K 12 | from keras import metrics 13 | from keras import optimizers 14 | 15 | import math 16 | import json 17 | from scipy.stats import norm 18 | from sklearn.model_selection import train_test_split 19 | from sklearn import preprocessing 20 | from sklearn.metrics import confusion_matrix 21 | from sklearn.preprocessing import StandardScaler 22 | 23 | from keras.callbacks import ModelCheckpoint 24 | from keras.callbacks import TensorBoard 25 | from keras.callbacks import LearningRateScheduler 26 | from keras.callbacks import EarlyStopping 27 | 28 | import matplotlib.pyplot as plt 29 | 30 | tf.keras.utils.set_random_seed(42) 31 | 32 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/" 33 | DATA_PATH = "/content/drive/MyDrive/data/" 34 | 35 | def scheduler(epoch, lr): 36 | if epoch < 4: 37 | return lr 38 | else: 39 | return lr * tf.math.exp(-0.1) 40 | 41 | nab_path = DATA_PATH + 'NAB/' 42 | nab_data_path = nab_path 43 | 44 | labels_filename = '/labels/combined_labels.json' 45 | train_file_name = 'artificialNoAnomaly/art_daily_no_noise.csv' 46 | test_file_name = 'artificialWithAnomaly/art_daily_jumpsup.csv' 47 | 48 | #train_file_name = 'realAWSCloudwatch/rds_cpu_utilization_cc0c53.csv' 49 | #test_file_name = 'realAWSCloudwatch/rds_cpu_utilization_e47b3b.csv' 50 | 51 | labels_file = open(nab_path + labels_filename, 'r') 52 | labels = json.loads(labels_file.read()) 53 | labels_file.close() 54 | 55 | def load_data_frame_with_labels(file_name): 56 | data_frame = pd.read_csv(nab_data_path + file_name) 57 | data_frame['anomaly_label'] = data_frame['timestamp'].isin( 58 | labels[file_name]).astype(int) 59 | return data_frame 60 | 61 | train_data_frame = load_data_frame_with_labels(train_file_name) 62 | test_data_frame = load_data_frame_with_labels(test_file_name) 63 | 64 | plt.plot(train_data_frame.loc[0:3000,'value']) 65 | plt.plot(test_data_frame['value']) 66 | 67 | train_data_frame_final = train_data_frame.loc[0:3000,:] 68 | test_data_frame_final = test_data_frame 69 | 70 | data_scaler = StandardScaler() 71 | data_scaler.fit(train_data_frame_final[['value']].values) 72 | train_data = data_scaler.transform(train_data_frame_final[['value']].values) 73 | test_data = data_scaler.transform(test_data_frame_final[['value']].values) 74 | 75 | def create_dataset(dataset, look_back=64): 76 | dataX, dataY = [], [] 77 | for i in range(len(dataset)-look_back-1): 78 | dataX.append(dataset[i:(i+look_back),:]) 79 | dataY.append(dataset[i+look_back,:]) 80 | 81 | return np.array(dataX), np.array(dataY) 82 | 83 | X_data, y_data = create_dataset(train_data, look_back=64) #look_back = window_size 84 | X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.1, random_state=42) 85 | X_test, y_test = create_dataset(test_data, look_back=64) #look_back = window_size 86 | 87 | #training params 88 | batch_size = 256 89 | num_epochs = 32 90 | 91 | #model params 92 | timesteps = X_train.shape[1] 93 | input_dim = X_train.shape[-1] 94 | intermediate_dim = 16 95 | latent_dim = 2 96 | epsilon_std = 1.0 97 | 98 | #sampling layer 99 | class Sampling(Layer): 100 | def call(self, inputs): 101 | z_mean, z_log_var = inputs 102 | batch = tf.shape(z_mean)[0] 103 | dim = tf.shape(z_mean)[1] 104 | epsilon = tf.keras.backend.random_normal(shape=(batch, dim)) 105 | return z_mean + tf.exp(0.5 * z_log_var) * epsilon 106 | 107 | #likelihood layer 108 | class Likelihood(Layer): 109 | def call(self, inputs): 110 | x, x_decoded_mean, x_decoded_scale = inputs 111 | dist = tfp.distributions.MultivariateNormalDiag(x_decoded_mean, x_decoded_scale) 112 | likelihood = dist.log_prob(x) 113 | return likelihood 114 | 115 | #VAE architecture 116 | 117 | #encoder 118 | x = Input(shape=(timesteps, input_dim,)) 119 | h = LSTM(intermediate_dim)(x) 120 | 121 | z_mean = Dense(latent_dim)(h) 122 | z_log_sigma = Dense(latent_dim, activation='softplus')(h) 123 | 124 | #sampling 125 | z = Sampling()((z_mean, z_log_sigma)) 126 | 127 | #decoder 128 | decoder_h = LSTM(intermediate_dim, return_sequences=True) 129 | decoder_loc = LSTM(input_dim, return_sequences=True) 130 | decoder_scale = LSTM(input_dim, activation='softplus', return_sequences=True) 131 | 132 | h_decoded = RepeatVector(timesteps)(z) 133 | h_decoded = decoder_h(h_decoded) 134 | 135 | x_decoded_mean = decoder_loc(h_decoded) 136 | x_decoded_scale = decoder_scale(h_decoded) 137 | 138 | #log-likelihood 139 | llh = Likelihood()([x, x_decoded_mean, x_decoded_scale]) 140 | 141 | #define VAE model 142 | vae = Model(inputs=x, outputs=llh) 143 | 144 | # Add KL divergence regularization loss and likelihood loss 145 | kl_loss = - 0.5 * K.mean(1 + z_log_sigma - K.square(z_mean) - K.exp(z_log_sigma)) 146 | tot_loss = -K.mean(llh - kl_loss) 147 | vae.add_loss(tot_loss) 148 | 149 | # Loss and optimizer. 150 | loss_fn = tf.keras.losses.MeanSquaredError() 151 | optimizer = tf.keras.optimizers.Adam() 152 | 153 | @tf.function 154 | def training_step(x): 155 | with tf.GradientTape() as tape: 156 | reconstructed = vae(x) # Compute input reconstruction. 157 | # Compute loss. 158 | loss = 0 #loss_fn(x, reconstructed) 159 | loss += sum(vae.losses) 160 | # Update the weights of the VAE. 161 | grads = tape.gradient(loss, vae.trainable_weights) 162 | optimizer.apply_gradients(zip(grads, vae.trainable_weights)) 163 | return loss 164 | 165 | losses = [] # Keep track of the losses over time. 166 | dataset = tf.data.Dataset.from_tensor_slices(X_train).batch(batch_size) 167 | for epoch in range(num_epochs): 168 | for step, x in enumerate(dataset): 169 | loss = training_step(x) 170 | losses.append(float(loss)) 171 | print("Epoch:", epoch, "Loss:", sum(losses) / len(losses)) 172 | 173 | plt.figure() 174 | plt.plot(losses, c='b', lw=2.0, label='train') 175 | plt.title('LSTM-VAE model') 176 | plt.xlabel('Epochs') 177 | plt.ylabel('Total Loss') 178 | plt.legend(loc='upper right') 179 | plt.show() 180 | #plt.savefig('./figures/lstm_loss.png') 181 | 182 | pred_test = vae.predict(X_test) 183 | 184 | plt.plot(pred_test[:,0]) 185 | 186 | is_anomaly = pred_test[:,0] < -1e1 187 | plt.figure() 188 | plt.plot(test_data, color='b') 189 | plt.figure() 190 | plt.plot(is_anomaly, color='r') 191 | -------------------------------------------------------------------------------- /chp11/spektral_gnn.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | import tensorflow as tf 5 | from tensorflow import keras 6 | 7 | import networkx as nx 8 | from tensorflow.keras.utils import to_categorical 9 | from sklearn.preprocessing import LabelEncoder 10 | from sklearn.utils import shuffle 11 | from sklearn.metrics import classification_report 12 | from sklearn.model_selection import train_test_split 13 | 14 | from spektral.layers import GCNConv 15 | 16 | from tensorflow.keras.models import Model 17 | from tensorflow.keras.layers import Input, Dropout, Dense 18 | from tensorflow.keras import Sequential 19 | from tensorflow.keras.optimizers import Adam 20 | from tensorflow.keras.callbacks import TensorBoard, EarlyStopping 21 | from tensorflow.keras.regularizers import l2 22 | 23 | import os 24 | from collections import Counter 25 | from sklearn.manifold import TSNE 26 | import matplotlib.pyplot as plt 27 | 28 | tf.keras.utils.set_random_seed(42) 29 | 30 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/" 31 | DATA_PATH = "/content/drive/MyDrive/data/cora/" 32 | 33 | column_names = ["paper_id"] + [f"term_{idx}" for idx in range(1433)] + ["subject"] 34 | node_df = pd.read_csv(DATA_PATH + "cora.content", sep="\t", header=None, names=column_names) 35 | print("Node df shape:", node_df.shape) 36 | 37 | edge_df = pd.read_csv(DATA_PATH + "cora.cites", sep="\t", header=None, names=["target", "source"]) 38 | print("Edge df shape:", edge_df.shape) 39 | 40 | #parse node data 41 | nodes = node_df.iloc[:,0].tolist() 42 | labels = node_df.iloc[:,-1].tolist() 43 | X = node_df.iloc[:,1:-1].values 44 | 45 | X = np.array(X,dtype=int) 46 | N = X.shape[0] #the number of nodes 47 | F = X.shape[1] #the size of node features 48 | 49 | #parse edge data 50 | edge_list = [(x, y) for x, y in zip(edge_df['target'], edge_df['source'])] 51 | 52 | num_classes = len(set(labels)) 53 | 54 | print('Number of nodes:', N) 55 | print('Number of features of each node:', F) 56 | print('Labels:', set(labels)) 57 | print('Number of classes:', num_classes) 58 | 59 | def sample_data(labels, limit=20, val_num=500, test_num=1000): 60 | label_counter = dict((l, 0) for l in labels) 61 | train_idx = [] 62 | 63 | for i in range(len(labels)): 64 | label = labels[i] 65 | if label_counter[label]