├── .gitattributes
├── README.md
├── chp02
    ├── binomial_tree.py
    ├── gibbs_gauss.py
    ├── imp_samp.py
    ├── mh_gauss2d.py
    ├── monte_carlo_pi.py
    └── random_walk.py
├── chp03
    └── mean_field_mrf.py
├── chp04
    ├── binary_search.py
    ├── binomial_coeffs.py
    ├── knapsack_greedy.py
    └── subset_gen.py
├── chp05
    ├── cart.py
    ├── naive_bayes.py
    ├── perceptron.py
    ├── sgd_lr.py
    └── svm.py
├── chp06
    ├── gp_reg.py
    ├── hierarchical_regression.py
    ├── knn_reg.py
    └── ridge_reg.py
├── chp07
    ├── active_learning.py
    ├── adaboost_clf.py
    ├── bagging_clf.py
    ├── bayes_opt_sklearn.py
    ├── demo_logreg.py
    ├── hmm.py
    ├── page_rank.py
    ├── plot_smote_regular.py
    ├── plot_tomek_links.py
    └── stacked_clf.py
├── chp08
    ├── dpmeans.py
    ├── gmm.py
    ├── manifold_learning.py
    └── pca.py
├── chp09
    ├── ga.py
    ├── inv_cov.py
    ├── kde.py
    ├── lda.py
    ├── portfolio_opt.py
    └── sim_annealing.py
├── chp10
    ├── image_search.py
    ├── keras_optimizers.py
    ├── lenet.py
    ├── lstm_sentiment.py
    ├── mlp.py
    └── multi_input_nn.py
├── chp11
    ├── keras_mdn.py
    ├── lstm_vae.py
    ├── spektral_gnn.py
    └── transformer.py
├── data
    ├── NAB
    │   ├── artificialNoAnomaly
    │   │   └── art_daily_no_noise.csv
    │   ├── artificialWithAnomaly
    │   │   └── art_daily_jumpsup.csv
    │   └── labels
    │   │   └── combined_labels.json
    ├── cora
    │   ├── cora.cites
    │   └── cora.content
    └── radon.txt
├── figures
    ├── bayes.bmp
    └── meap.png
└── requirements.txt


/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Machine Learning Algorithms in Depth
  2 | ML Algorithms in Depth: Bayesian Inference and Deep Learning
  3 | 
  4 | **Chp02: Markov Chain Monte Carlo (MCMC)**
  5 | - [Estimate Pi](./chp02/monte_carlo_pi.py): Monte Carlo estimate of Pi
  6 | - [Binomial Tree Model](./chp02/binomial_tree.py): Monte Carlo simulation of binomial stock price
  7 | - [Random Walk](./chp02/random_walk.py): self-avoiding random walk
  8 | - [Gibbs Sampling](./chp02/gibbs_gauss.py): Gibbs sampling of multivariate Gaussian distribution
  9 | - [Metropolis-Hastings Sampling](./chp02/mh_gauss2d.py): Metropolis-Hastings sampling of multivariate Gaussian mixture
 10 | - [Importance Sampling](./chp02/imp_samp.py): importance sampling for finding expected value of a function
 11 | 
 12 | **Chp03: Variational Inference (VI)**
 13 | - [Mean Field VI](./chp03/mean_field_mrf.py): image denoising in Ising model
 14 | 
 15 | **Chp04: Software Implementation**
 16 | - [Subset Generation](./chp04/subset_gen.py): a complete search algorithm
 17 | - [Fractional Knapsack](./chp04/knapsack_greedy.py): a greedy algorithm
 18 | - [Binary Search](./chp04/binary_search.py): a divide and conquer algorithm
 19 | - [Binomial Coefficients](./chp04/binomial_coeffs.py): a dynamic programming algorithm
 20 | 
 21 | **Chp05: Classification Algorithms**
 22 | - [Perceptron](./chp05/perceptron.py): perceptron algorithm
 23 | - [SVM](./chp05/svm.py): support vector machine
 24 | - [SGD-LR](./chp05/sgd_lr.py): stochastic gradient descent logistic regression
 25 | - [Naive Bayes](./chp05/naive_bayes.py): Bernoulli Naive Bayes algorithm
 26 | - [CART](./chp05/cart.py): decision tree classification algorithm
 27 | 
 28 | **Chp06: Regression Algorithms**
 29 | - [KNN](./chp06/knn_reg.py): K-Nearest Neighbors regression
 30 | - [BLR](./chp06/ridge_reg.py): Bayesian linear regression
 31 | - [HBR](./chp06/hierarchical_regression.py): Hierarchical Bayesian regression
 32 | - [GPR](./chp06/gp_reg.py): Gaussian Process regression
 33 | 
 34 | **Chp07: Selected Supervised Learning Algorithms**
 35 | - [Page Rank](./chp07/page_rank.py): Google page rank algorithm
 36 | - [HMM](./chp07/hmm.py): EM algorithm for Hidden Markov Models
 37 | - Imbalanced Learning: [Tomek Links](./chp07/plot_tomek_links.py), [SMOTE](./chp07/plot_smote_regular.py)
 38 | - Active Learning: [LR](./chp07/demo_logreg.py)
 39 | - Bayesian optimization: [BO](./chp07/bayes_opt_sklearn.py)
 40 | - Ensemble Learning: [Bagging](./chp07/bagging_clf.py), [Boosting](./chp07/adaboost_clf.py), [Stacking](./chp07/stacked_clf.py)
 41 | 
 42 | **Chp08: Unsupervised Learning Algorithms**
 43 | - [DP-Means](./chp08/dpmeans.py): Dirichlet Process (DP) K-Means
 44 | - [EM-GMM](./chp08/gmm.py): EM algorithm for Gaussian Mixture Models
 45 | - [PCA](./chp08/pca.py): Principal Component Analysis
 46 | - [t-SNE](./chp08/manifold_learning.py): t-SNE manifold learning
 47 | 
 48 | **Chp09: Selected Unsupervised Learning Algorithms**
 49 | - [LDA](./chp09/lda.py): Variational Inference for Latent Dirichlet Allocation
 50 | - [KDE](./chp09/kde.py): Kernel Density Estimator
 51 | - [TPO](./chp09/portfolio_opt.py): Tangent Portfolio Optimization
 52 | - [ICE](./chp09/inv_cov.py): Inverse Covariance Estimation
 53 | - [SA](./chp09/sim_annealing.py): Simulated Annealing
 54 | - [GA](./chp09/ga.py): Genetic Algorithm
 55 | 
 56 | **Chp10: Fundamental Deep Learning Algorithms**
 57 | - [MLP](./chp10/mlp.py): Multi-Layer Perceptron
 58 | - [LeNet](./chp10/lenet.py): LeNet for MNIST digit classification
 59 | - [ResNet](./chp10/image_search.py): ResNet50 image search on CalTech101 dataset
 60 | - [LSTM](./chp10/lstm_sentiment.py): LSTM sentiment classification of IMDB movie dataset
 61 | - [MINN](./chp10/multi_input_nn.py): Mult-Input Neural Net model for sequence similarity of Quora question pairs dataset
 62 | - [OPT](./chp10/keras_optimizers.py): Neural Net Optimizers
 63 | 
 64 | **Chp11: Advanced Deep Learning Algorithms**
 65 | - [LSTM-VAE](./chp11/lstm_vae.py): time-series anomaly detector
 66 | - [MDN](./chp11/keras_mdn.py): mixture density network
 67 | - [Transformer](./chp11/transformer.py): for text classification
 68 | - [GNN](./chp11/spektral_gnn.py): graph neural network
 69 | 
 70 | **Environment**
 71 | 
 72 | To install required libraries, please run the following commands:
 73 | 
 74 | ```
 75 | python3 -m venv ml-algo
 76 | 
 77 | source ml-algo/bin/activate    //in linux
 78 | .\ml-algo\Scripts\activate.bat //in CMD windows
 79 | .\ml-algo\Scripts\Activate.ps1 //in Powershell windows
 80 | 
 81 | pip install -r requirements.txt
 82 | ```
 83 | 
 84 | **Manning Early Access Preview (MEAP)**
 85 | 
 86 | This book is now available in Manning Early Access Preview.  
 87 | Link to book: https://www.manning.com/books/machine-learning-algorithms-in-depth
 88 | 
 89 | <p align="left">
 90 | <img src="https://github.com/vsmolyakov/ml_algo_in_depth/blob/main/figures/meap.png"/>
 91 | </p>
 92 | 
 93 | It will help you develop mathematical intuition for classic and modern ML algorithms, learn the fundamentals of Bayesian inference and deep learning, as well as data structures and algorithmic paradigms in ML!
 94 | 
 95 | **Citation**
 96 | 
 97 | You are welcome to cite the book as follows:
 98 | 
 99 | ```
100 | @book{MLAlgoInDepth,
101 |   author = {Vadim Smolyakov},
102 |   title = {Machine Learning Algorithms in Depth},
103 |   year = {2023},
104 |   isbn = {9781633439214},
105 |   publisher = {Manning Publications}
106 | }
107 | ```
108 | 


--------------------------------------------------------------------------------
/chp02/binomial_tree.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | import seaborn as sns
 4 | import matplotlib.pyplot as plt
 5 | 
 6 | np.random.seed(42)
 7 | 
 8 | def binomial_tree(mu, sigma, S0, N, T, step):
 9 |     
10 |     #compute state price and probability
11 |     u = np.exp(sigma * np.sqrt(step))    #up state price
12 |     d = 1.0/u                            #down state price
13 |     p = 0.5+0.5*(mu/sigma)*np.sqrt(step) #prob of up state
14 |     
15 |     #binomial tree simulation
16 |     up_times = np.zeros((N, len(T)))
17 |     down_times = np.zeros((N, len(T)))
18 |     for idx in range(len(T)):
19 |         up_times[:,idx] = np.random.binomial(T[idx]/step, p, N)
20 |         down_times[:,idx] = T[idx]/step - up_times[:,idx]
21 |     
22 |     #compute terminal price
23 |     ST = S0 * u**up_times * d**down_times
24 |             
25 |     #generate plots
26 |     plt.figure()
27 |     plt.plot(ST[:,0], color='b', alpha=0.5, label='1 month horizon')
28 |     plt.plot(ST[:,1], color='r', alpha=0.5, label='1 year horizon')
29 |     plt.xlabel('time step, day')
30 |     plt.ylabel('price')
31 |     plt.title('Binomial-Tree Stock Simulation')
32 |     plt.legend()
33 |     plt.show()
34 |     
35 |     plt.figure()
36 |     plt.hist(ST[:,0], color='b', alpha=0.5, label='1 month horizon')
37 |     plt.hist(ST[:,1], color='r', alpha=0.5, label='1 year horizon')
38 |     plt.xlabel('price')
39 |     plt.ylabel('count')
40 |     plt.title('Binomial-Tree Stock Simulation')
41 |     plt.legend()
42 |     plt.show()
43 |                                 
44 |     
45 | if __name__ == "__main__":
46 | 
47 |     #model parameters
48 |     mu = 0.1      #mean
49 |     sigma = 0.15  #volatility
50 |     S0 = 1        #starting price
51 |     
52 |     N = 10000     #number of simulations
53 |     T = [21.0/252, 1.0]  #time horizon in years
54 |     step = 1.0/252       #time step in years
55 | 
56 |     binomial_tree(mu, sigma, S0, N, T, step)
57 |     
58 |     


--------------------------------------------------------------------------------
/chp02/gibbs_gauss.py:
--------------------------------------------------------------------------------
 1 | import numpy as np 
 2 | import matplotlib.pyplot as plt 
 3 | 
 4 | import itertools
 5 | from numpy.linalg import inv
 6 | from scipy.stats import multivariate_normal
 7 | 
 8 | np.random.seed(42)
 9 | 
10 | class gibbs_gauss:
11 | 
12 |     def gauss_conditional(self, mu, Sigma, setA, x):
13 |         #computes P(X_A | X_B = x) = N(mu_{A|B}, Sigma_{A|B})
14 |         dim = len(mu)
15 |         setU = set(range(dim))    
16 |         setB = setU.difference(setA)
17 |         muA = np.array([mu[item] for item in setA]).reshape(-1,1)
18 |         muB = np.array([mu[item] for item in setB]).reshape(-1,1)
19 |         xB = np.array([x[item] for item in setB]).reshape(-1,1)
20 |                     
21 |         Sigma_AA = []
22 |         for (idx1, idx2) in itertools.product(setA, setA):
23 |             Sigma_AA.append(Sigma[idx1][idx2])
24 |         Sigma_AA = np.array(Sigma_AA).reshape(len(setA),len(setA))
25 | 
26 |         Sigma_AB = []
27 |         for (idx1, idx2) in itertools.product(setA, setB):
28 |             Sigma_AB.append(Sigma[idx1][idx2])
29 |         Sigma_AB = np.array(Sigma_AB).reshape(len(setA),len(setB))
30 | 
31 |         Sigma_BB = []
32 |         for (idx1, idx2) in itertools.product(setB, setB):
33 |             Sigma_BB.append(Sigma[idx1][idx2])
34 |         Sigma_BB = np.array(Sigma_BB).reshape(len(setB),len(setB))
35 | 
36 |         Sigma_BB_inv = inv(Sigma_BB)
37 |         mu_AgivenB = muA + np.matmul(np.matmul(Sigma_AB, Sigma_BB_inv), xB - muB)
38 |         Sigma_AgivenB = Sigma_AA - np.matmul(np.matmul(Sigma_AB, Sigma_BB_inv), np.transpose(Sigma_AB))        
39 | 
40 |         return mu_AgivenB, Sigma_AgivenB
41 | 
42 |     def sample(self, mu, Sigma, xinit, num_samples):
43 |         dim = len(mu)
44 |         samples = np.zeros((num_samples, dim))
45 |         x = xinit
46 |         for s in range(num_samples):
47 |             for d in range(dim):
48 |                 mu_AgivenB, Sigma_AgivenB = self.gauss_conditional(mu, Sigma, set([d]), x)
49 |                 x[d] = np.random.normal(mu_AgivenB, np.sqrt(Sigma_AgivenB))
50 |             #end for
51 |             samples[s,:] = np.transpose(x)
52 |         #end for
53 |         return samples
54 | 
55 | if __name__ == "__main__":
56 | 
57 |     num_samples = 2000
58 |     mu = [1, 1]
59 |     Sigma = [[2,1], [1,1]]    
60 |     xinit = np.random.rand(len(mu),1)
61 |     num_burnin = 1000
62 | 
63 |     gg = gibbs_gauss()
64 |     gibbs_samples = gg.sample(mu, Sigma, xinit, num_samples)
65 | 
66 |     scipy_samples = multivariate_normal.rvs(mean=mu, cov=Sigma, size=num_samples, random_state=42)
67 | 
68 |     plt.figure()
69 |     plt.scatter(gibbs_samples[num_burnin:,0], gibbs_samples[num_burnin:,1], c = 'blue', marker='s', alpha=0.8, label='Gibbs Samples')
70 |     plt.scatter(scipy_samples[num_burnin:,0], scipy_samples[num_burnin:,1], c = 'red', alpha=0.8, label='Ground Truth Samples')
71 |     plt.grid(True); plt.legend(); plt.xlim([-4,5])
72 |     plt.title("Gibbs Sampling of Multivariate Gaussian"); plt.xlabel("X1"); plt.ylabel("X2")
73 |     #plt.savefig("./figures/gibbs_gauss.png")
74 |     plt.show()
75 |     
76 | 
77 | 


--------------------------------------------------------------------------------
/chp02/imp_samp.py:
--------------------------------------------------------------------------------
 1 | import numpy as np 
 2 | import matplotlib.pyplot as plt 
 3 | 
 4 | from scipy.integrate import quad
 5 | from scipy.stats import multivariate_normal
 6 | 
 7 | np.random.seed(42)
 8 | 
 9 | class importance_sampler:
10 |     # E[f(x)] = int_x f(x)p(x)dx = int_x f(x)[p(x)/q(x)]q(x) dx
11 |     #         = sum_i f(x_i)w(x_i), where x_i ~ q(x)
12 |     # e.g. for f(x) = 1(x \in A), E[f(x)] = P(A)
13 |     
14 |     def __init__(self, k=1.5, mu=0.8, sigma=np.sqrt(1.5), c=3):
15 |         #target params p(x)
16 |         self.k = k 
17 | 
18 |         #proposal params q(x)
19 |         self.mu = mu 
20 |         self.sigma = sigma 
21 |         self.c = c #fix c, s.t. p(x) < c q(x)
22 | 
23 |     def target_pdf(self, x):
24 |         #p(x) ~ Chi(k=1.5)
25 |         return (x**(self.k-1)) * np.exp(-x**2/2.0)
26 | 
27 |     def proposal_pdf(self, x):
28 |         #q(x) ~ N(mu,sigma)
29 |         return self.c * 1.0/np.sqrt(2*np.pi*1.5) * np.exp(-(x-self.mu)**2/(2*self.sigma**2))
30 |     
31 |     def fx(self, x):
32 |         #function of interest f(x), x >= 0
33 |         return 2*np.sin((np.pi/1.5)*x)
34 | 
35 |     def sample(self, num_samples):    
36 |         #sample from the proposal
37 |         x = multivariate_normal.rvs(self.mu, self.sigma, num_samples)
38 | 
39 |         #discard netgative samples (since f(x) is defined for x >= 0)
40 |         idx = np.where(x >= 0)
41 |         x_pos = x[idx]
42 | 
43 |         #compute importance weights
44 |         isw = self.target_pdf(x_pos) / self.proposal_pdf(x_pos)
45 |         
46 |         #compute E[f(x)] = sum_i f(x_i)w(x_i), where x_i ~ q(x)
47 |         fw = (isw/np.sum(isw))*self.fx(x_pos)
48 |         f_est = np.sum(fw)
49 |     
50 |         return isw, f_est
51 |         
52 | 
53 | if __name__ == "__main__":
54 | 
55 |     num_samples = [10, 100, 1000, 10000, 100000, 1000000]
56 | 
57 |     F_est_iter, IS_weights_var_iter = [], []
58 |     for k in num_samples:
59 |         IS = importance_sampler()
60 |         IS_weights, F_est = IS.sample(k)
61 |         IS_weights_var = np.var(IS_weights/np.sum(IS_weights))
62 |         F_est_iter.append(F_est)
63 |         IS_weights_var_iter.append(IS_weights_var)
64 | 
65 |     #ground truth (numerical integration)
66 |     k = 1.5
67 |     I_gt, _ = quad(lambda x: 2.0*np.sin((np.pi/1.5)*x)*(x**(k-1))*np.exp(-x**2/2.0), 0, 5)
68 | 
69 |     #generate plots
70 |     plt.figure()
71 |     xx = np.linspace(0,8,100)
72 |     plt.plot(xx, IS.target_pdf(xx), '-r', label='target pdf p(x)')
73 |     plt.plot(xx, IS.proposal_pdf(xx), '--b', label='proposal pdf q(x)')
74 |     plt.plot(xx, IS.fx(xx) * IS.target_pdf(xx), ':k', label='p(x)f(x) integrand')
75 |     plt.grid(True); plt.legend(); plt.xlabel("X1"); plt.ylabel("X2")
76 |     plt.title("Importance Sampling Components")
77 |     #plt.savefig('./figures/importance_sampling.png')
78 |     plt.show()
79 | 
80 |     plt.figure()
81 |     plt.hist(IS_weights, label = "IS weights")
82 |     plt.grid(True); plt.legend();
83 |     plt.title("Importance Weights Histogram")
84 |     #plt.savefig('./figures/importance_weights.png')
85 |     plt.show()
86 | 
87 |     plt.figure()
88 |     plt.semilogx(num_samples, F_est_iter, '-b', label = "IS Estimate of E[f(x)]")
89 |     plt.semilogx(num_samples, I_gt*np.ones(len(num_samples)), '--r', label = "Ground Truth")
90 |     plt.grid(True); plt.legend(); plt.xlabel('iterations'); plt.ylabel("E[f(x)] estimate")
91 |     plt.title("IS Estimate of E[f(x)]")
92 |     #plt.savefig('./figures/importance_estimate.png')
93 |     plt.show()
94 | 


--------------------------------------------------------------------------------
/chp02/mh_gauss2d.py:
--------------------------------------------------------------------------------
 1 | import numpy as np 
 2 | import matplotlib.pyplot as plt 
 3 | 
 4 | from scipy.stats import uniform
 5 | from scipy.stats import multivariate_normal
 6 | 
 7 | np.random.seed(42)
 8 | 
 9 | class mh_gauss:
10 |     def __init__(self, dim, K, num_samples, target_mu, target_sigma, target_pi, proposal_mu, proposal_sigma):
11 |         #target params: p(x) = \sum_k pi(k) N(x; mu_k,Sigma_k)
12 |         self.dim = dim 
13 |         self.K = K 
14 |         self.num_samples = num_samples
15 |         self.target_mu = target_mu
16 |         self.target_sigma = target_sigma 
17 |         self.target_pi = target_pi 
18 |         
19 |         #proposal params: q(x) = N(x; mu, Sigma)
20 |         self.proposal_mu = proposal_mu 
21 |         self.proposal_sigma = proposal_sigma 
22 |         
23 |         #sample chain params
24 |         self.n_accept = 0
25 |         self.alpha = np.zeros(self.num_samples)
26 |         self.mh_samples = np.zeros((self.num_samples, self.dim))
27 | 
28 |     def target_pdf(self, x):
29 |         #p(x) = \sum_k pi(k) N(x; mu_k,Sigma_k)
30 |         prob = 0
31 |         for k in range(self.K):
32 |             prob += self.target_pi[k]*\
33 |                 multivariate_normal.pdf(x,self.target_mu[:,k],self.target_sigma[:,:,k])
34 |         #end for
35 |         return prob
36 | 
37 |     def proposal_pdf(self, x, mu):
38 |         #q(x) = N(x; mu, Sigma)
39 |         return multivariate_normal.pdf(x, mu, self.proposal_sigma)
40 | 
41 |     def sample(self):
42 |         #draw init sample from proposal
43 |         #import pdb; pdb.set_trace()
44 |         x_init = multivariate_normal.rvs(self.proposal_mu, self.proposal_sigma, 1)
45 |         self.mh_samples[0,:] = x_init
46 | 
47 |         for i in range(self.num_samples-1):
48 |             x_curr = self.mh_samples[i,:]
49 |             x_new = multivariate_normal.rvs(x_curr, self.proposal_sigma, 1)
50 | 
51 |             #MH ratio
52 |             self.alpha[i] = self.proposal_pdf(x_curr, x_new) / self.proposal_pdf(x_new, x_curr) #q(x|x')/q(x'|x)
53 |             self.alpha[i] = self.alpha[i] * (self.target_pdf(x_new)/self.target_pdf(x_curr)) #alpha x p(x')/p(x)
54 | 
55 |             #MH acceptance probability
56 |             r = min(1, self.alpha[i])
57 |             u = uniform.rvs(loc=0, scale=1, size=1)
58 |             if (u <= r):
59 |                 self.n_accept += 1
60 |                 self.mh_samples[i+1,:] = x_new  #accept
61 |             else:
62 |                 self.mh_samples[i+1,:] = x_curr #reject
63 |         #end for
64 |         print("MH acceptance ratio: ", self.n_accept/float(self.num_samples))
65 |   
66 | if __name__ == "__main__":
67 | 
68 |     dim = 2
69 |     K = 2
70 |     num_samples = 5000 
71 |     target_mu = np.zeros((dim,K))
72 |     target_mu[:,0] = [4,0]
73 |     target_mu[:,1] = [-4,0]
74 |     target_sigma = np.zeros((dim, dim, K))
75 |     target_sigma[:,:,0] = [[2,1],[1,1]]
76 |     target_sigma[:,:,1] = [[1,0],[0,1]]
77 |     target_pi = np.array([0.4, 0.6])
78 | 
79 |     proposal_mu = np.zeros((dim,1)).flatten()
80 |     proposal_sigma = 10*np.eye(dim)
81 | 
82 |     mhg = mh_gauss(dim, K, num_samples, target_mu, target_sigma, target_pi, proposal_mu, proposal_sigma)
83 |     mhg.sample()
84 | 
85 |     plt.figure()
86 |     plt.scatter(mhg.mh_samples[:,0], mhg.mh_samples[:,1], label='MH samples')
87 |     plt.grid(True); plt.legend()
88 |     plt.title("Metropolis-Hastings Sampling of 2D Gaussian Mixture")
89 |     plt.xlabel("X1"); plt.ylabel("X2")
90 |     #plt.savefig("./figures/mh_gauss2d.png")
91 |     plt.show()


--------------------------------------------------------------------------------
/chp02/monte_carlo_pi.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | 
 4 | np.random.seed(42)
 5 | 
 6 | def pi_est(radius=1, num_iter=int(1e4)):
 7 |     
 8 |     X = np.random.uniform(-radius,+radius,num_iter)
 9 |     Y = np.random.uniform(-radius,+radius,num_iter)
10 |     
11 |     R2 = X**2 + Y**2
12 |     inside = R2 < radius**2
13 |     outside = ~inside
14 |     
15 |     samples = (2*radius)*(2*radius)*inside
16 |     
17 |     I_hat = np.mean(samples)
18 |     pi_hat = I_hat/radius ** 2
19 |     pi_hat_se = np.std(samples)/np.sqrt(num_iter)    
20 |     print("pi est: {} +/- {:f}".format(pi_hat, pi_hat_se))
21 |     
22 |     plt.figure()
23 |     plt.scatter(X[inside],Y[inside], c='b', alpha=0.5)
24 |     plt.scatter(X[outside],Y[outside], c='r', alpha=0.5)
25 |     plt.show()
26 |     
27 | if __name__ == "__main__":
28 |     
29 |     pi_est()
30 |     
31 |     


--------------------------------------------------------------------------------
/chp02/random_walk.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import seaborn as sns
  3 | import matplotlib.pyplot as plt
  4 | 
  5 | np.random.seed(42)
  6 | 
  7 | def rand_walk(num_step, num_iter, moves):
  8 |     
  9 |     #random walk stats
 10 |     square_dist = np.zeros(num_iter)
 11 |     weights = np.zeros(num_iter)
 12 |         
 13 |     for it in range(num_iter):
 14 |         
 15 |         trial = 0
 16 |         i = 1
 17 |         
 18 |         #iterate until we have a non-crossing random walk
 19 |         while i != num_step-1:
 20 |             
 21 |             #init
 22 |             X, Y = 0, 0
 23 |             weight = 1
 24 |             lattice = np.zeros((2*num_step+1, 2*num_step+1))
 25 |             lattice[num_step+1,num_step+1] = 1
 26 |             path = np.array([0, 0])
 27 |             xx = num_step + 1 + X
 28 |             yy = num_step + 1 + Y
 29 |             
 30 |             print("iter: %d, trial %d" %(it, trial))
 31 |             
 32 |             for i in range(num_step):
 33 |                 
 34 |                 up    = lattice[xx,yy+1]
 35 |                 down  = lattice[xx,yy-1]
 36 |                 left  = lattice[xx-1,yy]
 37 |                 right = lattice[xx+1,yy]
 38 |                 
 39 |                 #compute available directions
 40 |                 neighbors = np.array([1, 1, 1, 1]) - np.array([up, down, left, right])
 41 |                 
 42 |                 #avoid self-loops
 43 |                 if (np.sum(neighbors) == 0):
 44 |                     i = 1
 45 |                     break
 46 |                 #end if
 47 |                 
 48 |                 #compute importance weights: d0 x d1 x ... x d_{n-1}
 49 |                 weight = weight * np.sum(neighbors)
 50 |                 
 51 |                 #sample a move direction
 52 |                 direction = np.where(np.random.rand() < np.cumsum(neighbors/float(sum(neighbors))))
 53 |                 
 54 |                 X = X + moves[direction[0][0],0]
 55 |                 Y = Y + moves[direction[0][0],1]
 56 |                 
 57 |                 #store sampled path
 58 |                 path_new = np.array([X,Y])
 59 |                 path = np.vstack((path,path_new))
 60 |                 
 61 |                 #update grid coordinates
 62 |                 xx = num_step + 1 + X
 63 |                 yy = num_step + 1 + Y                  
 64 |                 lattice[xx,yy] = 1                                                                                                                                            
 65 |             #end for
 66 |             
 67 |             trial = trial + 1
 68 |         #end while                                
 69 |         
 70 |         #compute square extension
 71 |         square_dist[it] = X**2 + Y**2
 72 |         
 73 |         #store importance weights
 74 |         weights[it] = weight                                                        
 75 |     #end for
 76 |      
 77 |     #compute mean square extension
 78 |     mean_square_dist = np.mean(weights * square_dist)/np.mean(weights)
 79 |     print("mean square dist: ", mean_square_dist)
 80 |     
 81 |     #generate plots
 82 |     plt.figure()
 83 |     for i in range(num_step-1):
 84 |         plt.plot(path[i,0], path[i,1], path[i+1,0], path[i+1,1], 'ob')            
 85 |     plt.title('random walk with no overlaps')
 86 |     plt.xlabel('X')
 87 |     plt.ylabel('Y')
 88 |     plt.show()
 89 | 
 90 |     plt.figure()
 91 |     sns.displot(square_dist)
 92 |     plt.xlim(0,np.max(square_dist))
 93 |     plt.title('square distance of the random walk')
 94 |     plt.xlabel('square distance (X^2 + Y^2)')
 95 |     plt.show()
 96 |         
 97 |     
 98 | if __name__ == "__main__":
 99 | 
100 |     num_step = 150 #number of steps in a random walk
101 |     num_iter = 100 #number of iterations for averaging results
102 |     moves = np.array([[0, 1],[0, -1],[-1, 0],[1, 0]]) #2-D moves
103 | 
104 |     rand_walk(num_step, num_iter, moves)


--------------------------------------------------------------------------------
/chp03/mean_field_mrf.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | 
  4 | import seaborn as sns
  5 | import matplotlib.pyplot as plt
  6 | 
  7 | from PIL import Image
  8 | from tqdm import tqdm
  9 | from scipy.special import expit as sigmoid
 10 | from scipy.stats import multivariate_normal
 11 | 
 12 | np.random.seed(42)
 13 | sns.set_style('whitegrid')
 14 | 
 15 | class image_denoising:
 16 | 
 17 |     def __init__(self, img_binary, sigma=2, J=1):
 18 | 
 19 |         #mean-field parameters
 20 |         self.sigma  = sigma  #noise level
 21 |         self.y = img_binary + self.sigma*np.random.randn(M, N) #y_i ~ N(x_i; sigma^2);
 22 |         self.J = J  #coupling strength (w_ij)
 23 |         self.rate = 0.5  #update smoothing rate
 24 |         self.max_iter = 15
 25 |         self.ELBO = np.zeros(self.max_iter)
 26 |         self.Hx_mean = np.zeros(self.max_iter)
 27 | 
 28 |     def mean_field(self):
 29 | 
 30 |         #Mean-Field VI
 31 |         print("running mean-field variational inference...")
 32 |         logodds = multivariate_normal.logpdf(self.y.flatten(), mean=+1, cov=self.sigma**2) - \
 33 |                   multivariate_normal.logpdf(self.y.flatten(), mean=-1, cov=self.sigma**2)
 34 |         logodds = np.reshape(logodds, (M, N))
 35 | 
 36 |         #init
 37 |         p1 = sigmoid(logodds)
 38 |         mu = 2*p1-1  #mu_init
 39 | 
 40 |         a = mu + 0.5 * logodds
 41 |         qxp1 = sigmoid(+2*a)  #q_i(x_i=+1)
 42 |         qxm1 = sigmoid(-2*a)  #q_i(x_i=-1)
 43 | 
 44 |         logp1 = np.reshape(multivariate_normal.logpdf(self.y.flatten(), mean=+1, cov=self.sigma**2), (M, N))
 45 |         logm1 = np.reshape(multivariate_normal.logpdf(self.y.flatten(), mean=-1, cov=self.sigma**2), (M, N))
 46 | 
 47 |         for i in tqdm(range(self.max_iter)):
 48 |             muNew = mu
 49 |             for ix in range(N):
 50 |                 for iy in range(M):
 51 |                     pos = iy + M*ix
 52 |                     neighborhood = pos + np.array([-1,1,-M,M])            
 53 |                     boundary_idx = [iy!=0,iy!=M-1,ix!=0,ix!=N-1]
 54 |                     neighborhood = neighborhood[np.where(boundary_idx)[0]]            
 55 |                     xx, yy = np.unravel_index(pos, (M,N), order='F')
 56 |                     nx, ny = np.unravel_index(neighborhood, (M,N), order='F')
 57 |             
 58 |                     Sbar = self.J*np.sum(mu[nx,ny])       
 59 |                     muNew[xx,yy] = (1-self.rate)*muNew[xx,yy] + self.rate*np.tanh(Sbar + 0.5*logodds[xx,yy])
 60 |                     self.ELBO[i] = self.ELBO[i] + 0.5*(Sbar * muNew[xx,yy])
 61 |                 #end for
 62 |             #end for
 63 |             mu = muNew
 64 |             
 65 |             a = mu + 0.5 * logodds
 66 |             qxp1 = sigmoid(+2*a) #q_i(x_i=+1)
 67 |             qxm1 = sigmoid(-2*a) #q_i(x_i=-1)    
 68 |             Hx = -qxm1*np.log(qxm1+1e-10) - qxp1*np.log(qxp1+1e-10) #entropy        
 69 |     
 70 |             self.ELBO[i] = self.ELBO[i] + np.sum(qxp1*logp1 + qxm1*logm1) + np.sum(Hx)
 71 |             self.Hx_mean[i] = np.mean(Hx)            
 72 |         #end for
 73 |         return mu
 74 | 
 75 | if __name__ == "__main__":
 76 | 
 77 |     #load data
 78 |     print("loading data...")
 79 |     data = Image.open('./figures/bayes.bmp')
 80 |     img = np.double(data)
 81 |     img_mean = np.mean(img)
 82 |     img_binary = +1*(img>img_mean) + -1*(img<img_mean)
 83 |     [M, N] = img_binary.shape
 84 | 
 85 |     mrf = image_denoising(img_binary, sigma=2, J=1)
 86 |     mu = mrf.mean_field()
 87 | 
 88 |     #generate plots
 89 |     plt.figure()
 90 |     plt.imshow(mrf.y)
 91 |     plt.title("observed noisy image")
 92 |     #plt.savefig('./figures/ising_vi_observed_image.png')
 93 |     plt.show()
 94 | 
 95 |     plt.figure()
 96 |     plt.imshow(mu)
 97 |     plt.title("after %d mean-field iterations" %mrf.max_iter)
 98 |     #plt.savefig('./figures/ising_vi_denoised_image.png')
 99 |     plt.show()
100 | 
101 |     plt.figure()
102 |     plt.plot(mrf.Hx_mean, color='b', lw=2.0, label='Avg Entropy')
103 |     plt.title('Variational Inference for Ising Model')
104 |     plt.xlabel('iterations'); plt.ylabel('average entropy')
105 |     plt.legend(loc='upper right')
106 |     #plt.savefig('./figures/ising_vi_avg_entropy.png')
107 |     plt.show()
108 | 
109 |     plt.figure()
110 |     plt.plot(mrf.ELBO, color='b', lw=2.0, label='ELBO')
111 |     plt.title('Variational Inference for Ising Model')
112 |     plt.xlabel('iterations'); plt.ylabel('ELBO objective')
113 |     plt.legend(loc='upper left')
114 |     #plt.savefig('./figures/ising_vi_elbo.png')
115 |     plt.show()
116 | 
117 | 


--------------------------------------------------------------------------------
/chp04/binary_search.py:
--------------------------------------------------------------------------------
 1 | 
 2 | def binary_search(arr, l, r, x):
 3 |     #assumes a sorted array
 4 |     if l <= r:
 5 |         mid = int(l + (r-l) / 2)
 6 |         
 7 |         if arr[mid] == x:
 8 |             return mid
 9 |         elif arr[mid] > x:
10 |             return binary_search(arr, l, mid-1, x)
11 |         else:
12 |             return binary_search(arr, mid+1, r, x)
13 |         #end if
14 |     else:
15 |         return -1
16 | 
17 | if __name__ == "__main__":
18 | 
19 |     x = 5 
20 |     arr = sorted([1, 7, 8, 3, 2, 5])
21 | 
22 |     print(arr)
23 |     print("binary search:")
24 |     result = binary_search(arr, 0, len(arr)-1, x)
25 | 
26 |     if result != -1:
27 |         print("element {} is found at index {}.".format(x, result))
28 |     else:
29 |         print("element is not found.")
30 | 


--------------------------------------------------------------------------------
/chp04/binomial_coeffs.py:
--------------------------------------------------------------------------------
 1 | 
 2 | def binomial_coeffs1(n, k):
 3 |     #top down DP
 4 |     if (k == 0 or k == n):
 5 |         return 1
 6 |     if (memo[n][k] != -1):
 7 |         return memo[n][k]
 8 | 
 9 |     memo[n][k] = binomial_coeffs1(n-1, k-1) + binomial_coeffs1(n-1, k)
10 |     return memo[n][k]
11 | 
12 | def binomial_coeffs2(n, k):
13 |     #bottom up DP
14 |     for i in range(n+1):
15 |         for j in range(min(i,k)+1):
16 |             if (j == 0 or j == i):
17 |                 memo[i][j] = 1
18 |             else:
19 |                 memo[i][j] = memo[i-1][j-1] + memo[i-1][j]
20 |             #end if
21 |         #end for
22 |     #end for
23 |     return memo[n][k]
24 | 
25 | def print_array(memo):
26 |     for i in range(len(memo)):
27 |         print('\t'.join([str(x) for x in memo[i]]))
28 | 
29 | 
30 | if __name__ == "__main__":
31 | 
32 |     n = 5
33 |     k = 2
34 | 
35 |     print("top down DP")
36 |     memo = [[-1 for i in range(6)] for j in range(6)]
37 |     nCk = binomial_coeffs1(n, k)
38 |     print_array(memo)
39 |     print("C(n={}, k={}) = {}".format(n,k,nCk))
40 | 
41 |     print("bottom up DP")
42 |     memo = [[-1 for i in range(6)] for j in range(6)]
43 |     nCk = binomial_coeffs2(n, k)
44 |     print_array(memo)
45 |     print("C(n={}, k={}) = {}".format(n,k,nCk))
46 | 
47 | 
48 | 


--------------------------------------------------------------------------------
/chp04/knapsack_greedy.py:
--------------------------------------------------------------------------------
 1 | class Item:       
 2 |     def __init__(self, wt, val, ind): 
 3 |         self.wt = wt 
 4 |         self.val = val 
 5 |         self.ind = ind 
 6 |         self.cost = val // wt 
 7 |   
 8 |     def __lt__(self, other): 
 9 |         return self.cost < other.cost 
10 |   
11 | class FractionalKnapSack:         
12 |     def get_max_value(self, wt, val, capacity): 
13 |           
14 |         item_list = [] 
15 |         for i in range(len(wt)): 
16 |             item_list.append(Item(wt[i], val[i], i)) 
17 |   
18 |         # sorting items by cost heuristic
19 |         item_list.sort(reverse = True)  #O(nlogn)
20 |   
21 |         total_value = 0
22 |         for i in item_list: 
23 |             cur_wt = int(i.wt) 
24 |             cur_val = int(i.val) 
25 |             if capacity - cur_wt >= 0: 
26 |                 capacity -= cur_wt 
27 |                 total_value += cur_val 
28 |             else: 
29 |                 fraction = capacity / cur_wt 
30 |                 total_value += cur_val * fraction 
31 |                 capacity = int(capacity - (cur_wt * fraction)) 
32 |                 break
33 |         return total_value 
34 |   
35 | if __name__ == "__main__": 
36 |     wt = [10, 20, 30] 
37 |     val = [60, 100, 120] 
38 |     capacity = 50
39 |   
40 |     fk = FractionalKnapSack()  
41 |     max_value = fk.get_max_value(wt, val, capacity)
42 |     print("greedy fractional knapsack") 
43 |     print("maximum value: ", max_value) 
44 | 
45 | 


--------------------------------------------------------------------------------
/chp04/subset_gen.py:
--------------------------------------------------------------------------------
 1 | def search(k, n):
 2 |     if (k == n):
 3 |         #process subset
 4 |         print(subset) 
 5 |     else:
 6 |         search(k+1, n)
 7 |         subset.append(k)
 8 |         search(k+1, n)
 9 |         subset.pop()
10 |     #end if
11 | 
12 | def bitseq(n):
13 |     for b in range(1 << n):
14 |         subset = []
15 |         for i in range(n):
16 |             if (b & 1 << i):
17 |                 subset.append(i)
18 |         #end for
19 |         print(subset)
20 |     #end for 
21 | 
22 | if __name__ == "__main__":
23 |     n = 4
24 |     subset = []
25 |     search(0, n) #recursive
26 | 
27 |     #subset = []
28 |     #bitseq(n)    #iterative
29 | 


--------------------------------------------------------------------------------
/chp05/cart.py:
--------------------------------------------------------------------------------
  1 | import numpy as np 
  2 | import matplotlib.pyplot as plt 
  3 | 
  4 | from sklearn.datasets import load_iris
  5 | from sklearn.metrics import accuracy_score
  6 | from sklearn.model_selection import train_test_split
  7 | 
  8 | class TreeNode():
  9 |     def __init__(self, gini, num_samples, num_samples_class, class_label):
 10 |         self.gini = gini  #gini cost
 11 |         self.num_samples = num_samples #size of node
 12 |         self.num_samples_class = num_samples_class #number of node pts with label k
 13 |         self.class_label = class_label #predicted class label
 14 |         self.feature_idx = 0 #idx of feature to split on
 15 |         self.treshold = 0  #best threshold to split on
 16 |         self.left = None #left subtree pointer
 17 |         self.right = None #right subtree pointer
 18 | 
 19 | class DecisionTreeClassifier():
 20 |     def __init__(self, max_depth = None):
 21 |         self.max_depth = max_depth
 22 |     
 23 |     def best_split(self, X_train, y_train):
 24 |         m = y_train.size
 25 |         if (m <= 1):
 26 |             return None, None
 27 |         
 28 |         #number of points of class k
 29 |         mk = [np.sum(y_train == k) for k in range(self.num_classes)]
 30 | 
 31 |         #gini of current node
 32 |         best_gini = 1.0 - sum((n / m) ** 2 for n in mk)
 33 |         best_idx, best_thr = None, None
 34 | 
 35 |         #iterate over all features
 36 |         for idx in range(self.num_features):
 37 |             # sort data along selected feature
 38 |             thresholds, classes = zip(*sorted(zip(X[:, idx], y)))
 39 | 
 40 |             num_left = [0]*self.num_classes
 41 |             num_right = mk.copy()
 42 | 
 43 |             #iterate overall possible split positions
 44 |             for i in range(1, m):
 45 |                 
 46 |                 k = classes[i-1]
 47 |                 
 48 |                 num_left[k] += 1
 49 |                 num_right[k] -= 1
 50 | 
 51 |                 gini_left = 1.0 - sum(
 52 |                     (num_left[x] / i) ** 2 for x in range(self.num_classes)
 53 |                 )
 54 | 
 55 |                 gini_right = 1.0 - sum(
 56 |                     (num_right[x] / (m - i)) ** 2 for x in range(self.num_classes)
 57 |                 )
 58 | 
 59 |                 gini = (i * gini_left + (m - i) * gini_right) / m
 60 | 
 61 |                 # check that we don't try to split two pts with identical values
 62 |                 if thresholds[i] == thresholds[i - 1]:
 63 |                     continue
 64 | 
 65 |                 if (gini < best_gini):
 66 |                     best_gini = gini
 67 |                     best_idx = idx
 68 |                     best_thr = (thresholds[i] + thresholds[i - 1]) / 2  # midpoint
 69 |                 #end if
 70 |             #end for
 71 |         #end for
 72 |         return best_idx, best_thr
 73 |     
 74 |     def gini(self, y_train):
 75 |         m = y_train.size
 76 |         return 1.0 - sum((np.sum(y_train == k) / m) ** 2 for k in range(self.num_classes))
 77 | 
 78 |     def fit(self, X_train, y_train):
 79 |         self.num_classes = len(set(y_train))
 80 |         self.num_features = X_train.shape[1]
 81 |         self.tree = self.grow_tree(X_train, y_train)
 82 | 
 83 |     def grow_tree(self, X_train, y_train, depth=0):
 84 |         
 85 |         num_samples_class = [np.sum(y_train == k) for k in range(self.num_classes)]
 86 |         class_label = np.argmax(num_samples_class)
 87 |         
 88 |         node = TreeNode(
 89 |             gini=self.gini(y_train),
 90 |             num_samples=y_train.size,
 91 |             num_samples_class=num_samples_class,
 92 |             class_label=class_label,
 93 |         )
 94 | 
 95 |         # split recursively until maximum depth is reached
 96 |         if depth < self.max_depth:
 97 |             idx, thr = self.best_split(X_train, y_train)
 98 |             if idx is not None:
 99 |                 indices_left = X_train[:, idx] < thr
100 |                 X_left, y_left = X_train[indices_left], y_train[indices_left]
101 |                 X_right, y_right = X_train[~indices_left], y_train[~indices_left]
102 |                 node.feature_index = idx
103 |                 node.threshold = thr
104 |                 node.left = self.grow_tree(X_left, y_left, depth + 1)
105 |                 node.right = self.grow_tree(X_right, y_right, depth + 1)
106 | 
107 |         return node
108 | 
109 |     def predict(self, X_test):
110 |         return [self.predict_helper(x_test) for x_test in X_test]
111 | 
112 |     def predict_helper(self, x_test):
113 |         node = self.tree
114 |         while node.left:
115 |             if x_test[node.feature_index] < node.threshold:
116 |                 node = node.left
117 |             else:
118 |                 node = node.right
119 |         return node.class_label
120 | 
121 | 
122 | if __name__ == "__main__":
123 | 
124 |     #load data
125 |     iris = load_iris()
126 |     X = iris.data[:, [2,3]]
127 |     y = iris.target
128 | 
129 |     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
130 |           
131 |     print("decision tree classifier...")
132 |     tree_clf = DecisionTreeClassifier(max_depth = 3)
133 |     tree_clf.fit(X_train, y_train)
134 | 
135 |     print("prediction...")
136 |     y_pred = tree_clf.predict(X_test)
137 |     
138 |     tree_clf_acc = accuracy_score(y_test, y_pred)
139 |     print("test set accuracy: ", tree_clf_acc)
140 | 


--------------------------------------------------------------------------------
/chp05/naive_bayes.py:
--------------------------------------------------------------------------------
  1 | import numpy as np 
  2 | import seaborn as sns
  3 | import matplotlib.pyplot as plt
  4 | 
  5 | from time import time
  6 | from nltk.corpus import stopwords
  7 | from nltk.tokenize import RegexpTokenizer
  8 | 
  9 | from sklearn.metrics import accuracy_score
 10 | from sklearn.datasets import fetch_20newsgroups
 11 | from sklearn.model_selection import train_test_split
 12 | from sklearn.feature_extraction.text import CountVectorizer
 13 | 
 14 | sns.set_style("whitegrid")
 15 | tokenizer = RegexpTokenizer(r'\w+')
 16 | stop_words = set(stopwords.words('english'))
 17 | stop_words.update(['s','t','m','1','2'])
 18 | 
 19 | class naive_bayes:
 20 |     def __init__(self, K, D):
 21 |         self.K = K #number of classes
 22 |         self.D = D #dictionary size
 23 | 
 24 |         self.pi = np.ones(K) #class priors
 25 |         self.theta = np.ones((self.D, self.K)) #bernoulli parameters
 26 | 
 27 |     def fit(self, X_train, y_train):
 28 |         
 29 |         num_docs = X_train.shape[0]
 30 |         for doc in range(num_docs):
 31 |             
 32 |             label = y_train[doc]
 33 |             self.pi[label] += 1
 34 | 
 35 |             for word in range(self.D):
 36 |                 if (X_train[doc][word] > 0):
 37 |                     self.theta[word][label] += 1
 38 |                 #end if
 39 |             #end for
 40 |         #end for
 41 |         
 42 |         #normalize pi and theta
 43 |         self.pi = self.pi/np.sum(self.pi)
 44 |         self.theta = self.theta/np.sum(self.theta, axis=0)
 45 | 
 46 |     def predict(self, X_test):
 47 | 
 48 |         num_docs = X_test.shape[0]
 49 |         logp = np.zeros((num_docs,self.K))
 50 |         for doc in range(num_docs):
 51 |             for kk in range(self.K):
 52 |                 logp[doc][kk] = np.log(self.pi[kk])
 53 |                 for word in range(self.D):
 54 |                     if (X_test[doc][word] > 0):
 55 |                         logp[doc][kk] += np.log(self.theta[word][kk])
 56 |                     else:
 57 |                         logp[doc][kk] += np.log(1-self.theta[word][kk])
 58 |                     #end if
 59 |                 #end for
 60 |             #end for
 61 |         #end for
 62 |         return np.argmax(logp, axis=1)
 63 |         
 64 | if __name__ == "__main__":
 65 | 
 66 |     import nltk
 67 |     nltk.download('stopwords')
 68 | 
 69 |     #load data
 70 |     print("loading 20 newsgroups dataset...")
 71 |     tic = time()
 72 |     classes = ['sci.space', 'comp.graphics', 'rec.autos', 'rec.sport.hockey']
 73 |     dataset = fetch_20newsgroups(shuffle=True, random_state=0, remove=('headers','footers','quotes'), categories=classes)
 74 |     X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.5, random_state=0)
 75 |     toc = time()
 76 |     print("elapsed time: %.4f sec" %(toc - tic))
 77 |     print("number of training docs: ", len(X_train))
 78 |     print("number of test docs: ", len(X_test))
 79 | 
 80 |     print("vectorizing input data...")
 81 |     cnt_vec = CountVectorizer(tokenizer=tokenizer.tokenize, analyzer='word', ngram_range=(1,1), max_df=0.8, min_df=2, max_features=1000, stop_words=stop_words)
 82 |     cnt_vec.fit(X_train)
 83 |     toc = time()
 84 |     print("elapsed time: %.2f sec" %(toc - tic))
 85 |     vocab = cnt_vec.vocabulary_
 86 |     idx2word = {val: key for (key, val) in vocab.items()}
 87 |     print("vocab size: ", len(vocab))
 88 | 
 89 |     X_train_vec = cnt_vec.transform(X_train).toarray()
 90 |     X_test_vec = cnt_vec.transform(X_test).toarray()
 91 | 
 92 |     print("naive bayes model MLE inference...")
 93 |     K = len(set(y_train)) #number of classes
 94 |     D = len(vocab) #dictionary size
 95 |     nb_clf = naive_bayes(K, D)
 96 |     nb_clf.fit(X_train_vec, y_train)
 97 | 
 98 |     print("naive bayes prediction...")
 99 |     y_pred = nb_clf.predict(X_test_vec)
100 |     nb_clf_acc = accuracy_score(y_test, y_pred)
101 |     print("test set accuracy: ", nb_clf_acc)


--------------------------------------------------------------------------------
/chp05/perceptron.py:
--------------------------------------------------------------------------------
 1 | import numpy as np 
 2 | import seaborn as sns
 3 | import matplotlib.pyplot as plt 
 4 | 
 5 | from scipy.stats import randint
 6 | from sklearn.datasets import load_iris
 7 | from sklearn.metrics import confusion_matrix
 8 | from sklearn.model_selection import train_test_split
 9 | 
10 | class perceptron:
11 |     def __init__(self, num_epochs, dim):
12 |         self.num_epochs = num_epochs
13 |         self.theta0 = 0
14 |         self.theta = np.zeros(dim)
15 | 
16 |     def fit(self, X_train, y_train):
17 |         n = X_train.shape[0]
18 |         dim = X_train.shape[1]
19 | 
20 |         k = 1
21 |         for epoch in range(self.num_epochs):
22 |             for i in range(n):
23 |                 #sample random point
24 |                 idx = randint.rvs(0, n-1, size=1)[0] 
25 |                 #hinge loss
26 |                 if (y_train[idx] * (np.dot(self.theta, X_train[idx,:]) + self.theta0) <= 0):
27 |                     #update learning rate
28 |                     eta = pow(k+1, -1)
29 |                     k += 1
30 |                     #print("eta: ", eta)
31 | 
32 |                     #update theta
33 |                     self.theta = self.theta + eta * y_train[idx] * X_train[idx, :]
34 |                     self.theta0 = self.theta0 + eta * y_train[idx]            
35 |                 #end if
36 |             print("epoch: ", epoch)
37 |             print("theta: ", self.theta)
38 |             print("theta0: ", self.theta0)
39 |             #end for
40 |         #end for
41 | 
42 |     def predict(self, X_test):
43 |         n = X_test.shape[0]
44 |         dim = X_test.shape[1]
45 | 
46 |         y_pred = np.zeros(n)
47 |         for idx in range(n):
48 |             y_pred[idx] = np.sign(np.dot(self.theta, X_test[idx,:]) + self.theta0)
49 |         #end for
50 |         return y_pred
51 | 
52 | if __name__ == "__main__":
53 |     
54 |     #load dataset
55 |     iris = load_iris()
56 |     X = iris.data[:100,:]
57 |     y = 2*iris.target[:100] - 1 #map to {+1,-1} labels
58 | 
59 |     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
60 | 
61 |     #perceptron (binary) classifier
62 |     clf = perceptron(num_epochs=5, dim=X.shape[1])
63 |     clf.fit(X_train, y_train)
64 |     y_pred = clf.predict(X_test)
65 | 
66 |     cmt = confusion_matrix(y_test, y_pred)
67 |     acc = np.trace(cmt)/np.sum(np.sum(cmt))
68 |     print("percepton accuracy: ", acc)
69 |     
70 |     #generate plots
71 |     plt.figure()
72 |     sns.heatmap(cmt, annot=True, fmt="d")
73 |     plt.title("Confusion Matrix"); plt.xlabel("predicted"); plt.ylabel("actual")
74 |     #plt.savefig("./figures/perceptron_acc.png")
75 |     plt.show()
76 | 
77 |     
78 | 
79 | 


--------------------------------------------------------------------------------
/chp05/sgd_lr.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import matplotlib.pyplot as plt
  3 | 
  4 | def generate_data():
  5 |         
  6 |     n = 1000
  7 |     mu1 = np.array([1,1])
  8 |     mu2 = np.array([-1,-1])        
  9 |     pik = np.array([0.4,0.6])
 10 |         
 11 |     X = np.zeros((n,2))
 12 |     y = np.zeros((n,1))
 13 |         
 14 |     for i in range(1,n):
 15 |         u = np.random.rand()
 16 |         idx = np.where(u < np.cumsum(pik))[0]                            
 17 |             
 18 |         if (len(idx)==1):
 19 |             X[i,:] = np.random.randn(1,2) + mu1
 20 |             y[i] = 1
 21 |         else:
 22 |             X[i,:] = np.random.randn(1,2) + mu2
 23 |             y[i] = -1
 24 |               
 25 |     return X, y    
 26 | 
 27 | 
 28 | class sgdlr:
 29 |     
 30 |     def __init__(self):
 31 |         
 32 |         self.num_iter = 100
 33 |         self.lmbda = 1e-9
 34 |         
 35 |         self.tau0 = 10
 36 |         self.kappa = 1
 37 |         self.eta = np.zeros(self.num_iter)
 38 |         
 39 |         self.batch_size = 200        
 40 |         
 41 |     def fit(self, X, y):
 42 | 
 43 |         #random init
 44 |         theta = np.random.randn(X.shape[1],1)
 45 |         
 46 |         #learning rate schedule
 47 |         for i in range(self.num_iter):
 48 |             self.eta[i] = (self.tau0+i)**(-self.kappa)        
 49 |                 
 50 |         #divide data in batches        
 51 |         batch_data, batch_labels = self.make_batches(X,y,self.batch_size)                    
 52 |         num_batches = batch_data.shape[0]
 53 |         num_updates = 0
 54 |         
 55 |         J_hist = np.zeros((self.num_iter * num_batches,1))
 56 |         t_hist = np.zeros((self.num_iter * num_batches,1))
 57 |         
 58 |         for itr in range(self.num_iter):
 59 |             for b in range(num_batches):
 60 |                 Xb = batch_data[b]
 61 |                 yb = batch_labels[b]
 62 | 
 63 |                 J_cost, J_grad = self.lr_objective(theta, Xb, yb, self.lmbda)
 64 |                 theta = theta - self.eta[itr]*(num_batches*J_grad)     
 65 | 
 66 |                 J_hist[num_updates] = J_cost
 67 |                 t_hist[num_updates] = np.linalg.norm(theta,2)
 68 |                 num_updates = num_updates + 1
 69 |             print("iteration %d, cost: %f" %(itr, J_cost))
 70 |         
 71 |         y_pred = 2*(self.sigmoid(X.dot(theta)) > 0.5) - 1
 72 |         y_err = np.size(np.where(y_pred - y)[0])/float(y.shape[0])
 73 |         print("classification error:", y_err)
 74 | 
 75 |         self.generate_plots(X, J_hist, t_hist, theta)
 76 |         return theta        
 77 |     
 78 |     def make_batches(self, X, y, batch_size):
 79 |         n = X.shape[0]
 80 |         d = X.shape[1]
 81 |         num_batches = int(np.ceil(n/batch_size))
 82 | 
 83 |         groups = np.tile(range(num_batches),batch_size)
 84 |         batch_data=np.zeros((num_batches,batch_size,d))
 85 |         batch_labels=np.zeros((num_batches,batch_size,1))
 86 |         
 87 |         for i in range(num_batches):
 88 |             batch_data[i,:,:] = X[groups==i,:]
 89 |             batch_labels[i,:] = y[groups==i]
 90 |             
 91 |         return batch_data, batch_labels
 92 |             
 93 |     def lr_objective(self, theta, X, y, lmbda):
 94 |         
 95 |         n = y.shape[0]
 96 |         y01 = (y+1)/2.0
 97 |         
 98 |         #compute the objective
 99 |         mu = self.sigmoid(X.dot(theta))
100 |     
101 |         #bound away from 0 and 1
102 |         eps = np.finfo(float).eps   
103 |         mu = np.maximum(mu,eps)
104 |         mu = np.minimum(mu,1-eps)
105 |         
106 |         #compute cost
107 |         cost = -(1/n)*np.sum(y01*np.log(mu)+(1-y01)*np.log(1-mu))+np.sum(lmbda*theta*theta)
108 |         
109 |         #compute the gradient of the lr objective
110 |         grad = X.T.dot(mu-y01) + 2*lmbda*theta
111 |         
112 |         #compute the Hessian of the lr objective
113 |         #H = X.T.dot(np.diag(np.diag( mu*(1-mu) ))).dot(X) + 2*lmbda*np.eye(np.size(theta))
114 |         
115 |         return cost, grad
116 |                     
117 |     def sigmoid(self, a):
118 |         return 1/(1+np.exp(-a))
119 |     
120 |     def generate_plots(self, X, J_hist, t_hist, theta):
121 |                         
122 |         plt.figure()
123 |         plt.plot(J_hist)
124 |         plt.title("logistic regression")
125 |         plt.xlabel('iterations')
126 |         plt.ylabel('cost')
127 |         #plt.savefig('./figures/lrsgd_loss.png')
128 |         plt.show()
129 |         
130 |         plt.figure()
131 |         plt.plot(t_hist)
132 |         plt.title("logistic regression")
133 |         plt.xlabel('iterations')
134 |         plt.ylabel('theta l2 norm')
135 |         #plt.savefig('./figures/lrsgd_theta_norm.png')
136 |         plt.show()   
137 |         
138 |         plt.figure()
139 |         plt.plot(self.eta)
140 |         plt.title("logistic regression")
141 |         plt.xlabel('iterations')
142 |         plt.ylabel('learning rate')
143 |         #plt.savefig('./figures/lrsgd_learning_rate.png')
144 |         plt.show()
145 |         
146 |         plt.figure()
147 |         x1 = np.linspace(np.min(X[:,0])-1,np.max(X[:,0])+1,10)
148 |         plt.scatter(X[:,0], X[:,1])
149 |         plt.plot(x1, -(theta[0]/theta[1])*x1)
150 |         plt.title('logistic regression')
151 |         plt.grid(True)
152 |         plt.xlabel('X1')
153 |         plt.ylabel('X2')
154 |         #plt.savefig('./figures/lrsgd_clf.png')
155 |         plt.show()
156 |                     
157 | if __name__ == "__main__":
158 |         
159 |     X, y = generate_data()
160 |     sgd = sgdlr()
161 |     theta = sgd.fit(X,y)
162 |     
163 |     


--------------------------------------------------------------------------------
/chp05/svm.py:
--------------------------------------------------------------------------------
  1 | import cvxopt
  2 | import numpy as np
  3 | 
  4 | from sklearn.svm import SVC #for comparison only
  5 | from sklearn.datasets import load_iris
  6 | from sklearn.metrics import accuracy_score
  7 | from sklearn.model_selection import train_test_split
  8 | 
  9 | def rbf_kernel(gamma, **kwargs):
 10 |     def f(x1, x2):
 11 |         distance = np.linalg.norm(x1 - x2) ** 2
 12 |         return np.exp(-gamma * distance)
 13 |     return f
 14 | 
 15 | class SupportVectorMachine(object):
 16 |     def __init__(self, C=1, kernel=rbf_kernel, power=4, gamma=None, coef=4):
 17 |         self.C = C
 18 |         self.kernel = kernel
 19 |         self.power = power
 20 |         self.gamma = gamma
 21 |         self.coef = coef
 22 |         self.lagr_multipliers = None
 23 |         self.support_vectors = None
 24 |         self.support_vector_labels = None
 25 |         self.intercept = None
 26 | 
 27 |     def fit(self, X, y):
 28 | 
 29 |         n_samples, n_features = np.shape(X)
 30 | 
 31 |         # Set gamma to 1/n_features by default
 32 |         if not self.gamma:
 33 |             self.gamma = 1 / n_features
 34 | 
 35 |         # Initialize kernel method with parameters
 36 |         self.kernel = self.kernel(
 37 |             power=self.power,
 38 |             gamma=self.gamma,
 39 |             coef=self.coef)
 40 | 
 41 |         # Calculate kernel matrix
 42 |         kernel_matrix = np.zeros((n_samples, n_samples))
 43 |         for i in range(n_samples):
 44 |             for j in range(n_samples):
 45 |                 kernel_matrix[i, j] = self.kernel(X[i], X[j])
 46 | 
 47 |         # Define the quadratic optimization problem
 48 |         P = cvxopt.matrix(np.outer(y, y) * kernel_matrix, tc='d')
 49 |         q = cvxopt.matrix(np.ones(n_samples) * -1)
 50 |         A = cvxopt.matrix(y, (1, n_samples), tc='d')
 51 |         b = cvxopt.matrix(0, tc='d')
 52 | 
 53 |         if not self.C: #if its empty
 54 |             G = cvxopt.matrix(np.identity(n_samples) * -1)
 55 |             h = cvxopt.matrix(np.zeros(n_samples))
 56 |         else:
 57 |             G_max = np.identity(n_samples) * -1
 58 |             G_min = np.identity(n_samples)
 59 |             G = cvxopt.matrix(np.vstack((G_max, G_min)))
 60 |             h_max = cvxopt.matrix(np.zeros(n_samples))
 61 |             h_min = cvxopt.matrix(np.ones(n_samples) * self.C)
 62 |             h = cvxopt.matrix(np.vstack((h_max, h_min)))
 63 | 
 64 |         # Solve the quadratic optimization problem using cvxopt
 65 |         minimization = cvxopt.solvers.qp(P, q, G, h, A, b)
 66 | 
 67 |         # Lagrange multipliers
 68 |         lagr_mult = np.ravel(minimization['x'])
 69 | 
 70 |         # Extract support vectors
 71 |         # Get indexes of non-zero lagr. multipiers
 72 |         idx = lagr_mult > 1e-11
 73 |         # Get the corresponding lagr. multipliers
 74 |         self.lagr_multipliers = lagr_mult[idx]
 75 |         # Get the samples that will act as support vectors
 76 |         self.support_vectors = X[idx]
 77 |         # Get the corresponding labels
 78 |         self.support_vector_labels = y[idx]
 79 | 
 80 |     # Calculate intercept with first support vector
 81 |         self.intercept = self.support_vector_labels[0]
 82 |         for i in range(len(self.lagr_multipliers)):
 83 |           self.intercept -= self.lagr_multipliers[i] * self.support_vector_labels[
 84 |              i] * self.kernel(self.support_vectors[i], self.support_vectors[0])
 85 | 
 86 | 
 87 |     def predict(self, X):
 88 |         y_pred = []
 89 |     # Iterate through list of samples and make predictions
 90 |         for sample in X:
 91 |             prediction = 0
 92 |             # Determine the label of the sample by the support vectors
 93 |             for i in range(len(self.lagr_multipliers)):
 94 |                 prediction += self.lagr_multipliers[i] * self.support_vector_labels[
 95 |                     i] * self.kernel(self.support_vectors[i], sample)
 96 |             prediction += self.intercept
 97 |             y_pred.append(np.sign(prediction))
 98 |         return np.array(y_pred)
 99 | 
100 | 
101 | def main():
102 | 
103 |     #load dataset
104 |     iris = load_iris()
105 |     X = iris.data[:100,:]
106 |     y = 2*iris.target[:100] - 1 #map to {+1,-1} labels
107 | 
108 |     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
109 |     clf = SupportVectorMachine(kernel=rbf_kernel, gamma = 1)
110 |     clf.fit(X_train, y_train)
111 |     y_pred = clf.predict(X_test)
112 |     accuracy = accuracy_score(y_test, y_pred)
113 |     print ("Accuracy (scratch):", accuracy)
114 | 
115 |     clf_sklearn = SVC(gamma = 'auto')
116 |     clf_sklearn.fit(X_train, y_train)
117 |     y_pred2 = clf_sklearn.predict(X_test)
118 |     accuracy = accuracy_score(y_test, y_pred2)
119 |     print ("Accuracy :", accuracy)
120 | 
121 | if __name__ == "__main__":
122 |     main()


--------------------------------------------------------------------------------
/chp06/gp_reg.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | from scipy.spatial.distance import cdist
 4 | 
 5 | np.random.seed(42)
 6 | 
 7 | class GPreg:
 8 |     
 9 |     def __init__(self, X_train, y_train, X_test):        
10 | 
11 |         self.L = 1.0
12 |         self.keps = 1e-8
13 |         
14 |         self.muFn = self.mean_func(X_test)
15 |         self.Kfn = self.kernel_func(X_test, X_test) + 1e-15*np.eye(np.size(X_test))
16 |         
17 |         self.X_train = X_train
18 |         self.y_train = y_train
19 |         self.X_test = X_test
20 |             
21 |     def mean_func(self, x):
22 |         muFn = np.zeros(len(x)).reshape(-1,1)
23 |         return muFn
24 |         
25 |     def kernel_func(self, x, z):                
26 |         sq_dist = cdist(x/self.L, z/self.L, 'euclidean')**2
27 |         Kfn = 1.0 * np.exp(-sq_dist/2)
28 |         return Kfn
29 |     
30 |     def compute_posterior(self):
31 |         K = self.kernel_func(self.X_train, self.X_train)  #K
32 |         Ks = self.kernel_func(self.X_train, self.X_test)  #K_*
33 |         Kss = self.kernel_func(self.X_test, self.X_test) + self.keps*np.eye(np.size(self.X_test))  #K_**
34 |         Ki = np.linalg.inv(K) #O(Ntrain^3)        
35 |         
36 |         postMu = self.mean_func(self.X_test) + np.dot(np.transpose(Ks), np.dot(Ki, (self.y_train - self.mean_func(self.X_train))))
37 |         postCov = Kss - np.dot(np.transpose(Ks), np.dot(Ki, Ks))
38 |         
39 |         self.muFn = postMu
40 |         self.Kfn = postCov            
41 | 
42 |         return None    
43 |     
44 |     def generate_plots(self, X, num_samples=3):
45 |         plt.figure()
46 |         for i in range(num_samples):
47 |             fs = self.gauss_sample(1)
48 |             plt.plot(X, fs, '-k')
49 |             #plt.plot(self.X_train, self.y_train, 'xk')
50 |                 
51 |         mu = self.muFn.ravel()
52 |         S2 = np.diag(self.Kfn)
53 |         plt.fill(np.concatenate([X, X[::-1]]), np.concatenate([mu - 2*np.sqrt(S2), (mu + 2*np.sqrt(S2))[::-1]]), alpha=0.2, fc='b')
54 |         plt.show()    
55 |         
56 |     def gauss_sample(self, n):
57 |         # returns n samples from a multivariate Gaussian distribution
58 |         # S = AZ + mu        
59 |         A = np.linalg.cholesky(self.Kfn)
60 |         Z = np.random.normal(loc=0, scale=1, size=(len(self.muFn),n))
61 |         S = np.dot(A,Z) + self.muFn        
62 |         return S    
63 |         
64 | def main():
65 |     
66 |     # generate noise-less training data
67 |     X_train = np.array([-4, -3, -2, -1, 1])
68 |     X_train = X_train.reshape(-1,1)
69 |     y_train = np.sin(X_train)
70 | 
71 |     # generate  test data
72 |     X_test = np.linspace(-5, 5, 50)
73 |     X_test = X_test.reshape(-1,1)
74 |     
75 |     gp = GPreg(X_train, y_train, X_test)
76 |     gp.generate_plots(X_test,3)  #samples from GP prior
77 |     gp.compute_posterior()
78 |     gp.generate_plots(X_test,3)  #samples from GP posterior
79 |     
80 |         
81 | if __name__ == "__main__":    
82 |     main()    
83 |     
84 | 


--------------------------------------------------------------------------------
/chp06/hierarchical_regression.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | 
 4 | import seaborn as sns
 5 | import matplotlib.pyplot as plt
 6 | 
 7 | import pymc3 as pm 
 8 | 
 9 | def main():
10 |     
11 |     #load data
12 |     data = pd.read_csv('./data/radon.txt')
13 |     
14 |     county_names = data.county.unique()
15 |     county_idx = data['county_code'].values
16 |     
17 |     with pm.Model() as hierarchical_model:
18 |         
19 |         # Hyperpriors
20 |         mu_a = pm.Normal('mu_alpha', mu=0., sd=100**2)
21 |         sigma_a = pm.Uniform('sigma_alpha', lower=0, upper=100)
22 |         mu_b = pm.Normal('mu_beta', mu=0., sd=100**2)
23 |         sigma_b = pm.Uniform('sigma_beta', lower=0, upper=100)
24 |     
25 |         # Intercept for each county, distributed around group mean mu_a
26 |         a = pm.Normal('alpha', mu=mu_a, sd=sigma_a, shape=len(data.county.unique()))
27 |         # Slope for each county, distributed around group mean mu_b
28 |         b = pm.Normal('beta', mu=mu_b, sd=sigma_b, shape=len(data.county.unique()))
29 |     
30 |         # Model error
31 |         eps = pm.Uniform('eps', lower=0, upper=100)
32 |     
33 |         # Expected value
34 |         radon_est = a[county_idx] + b[county_idx] * data.floor.values
35 |     
36 |         # Data likelihood
37 |         y_like = pm.Normal('y_like', mu=radon_est, sd=eps, observed=data.log_radon)
38 |         
39 |     
40 |     with hierarchical_model:
41 |         # Use ADVI for initialization
42 |         mu, sds, elbo = pm.variational.advi(n=100000)
43 |         step = pm.NUTS(scaling=hierarchical_model.dict_to_array(sds)**2, is_cov=True)
44 |         hierarchical_trace = pm.sample(5000, step, start=mu)
45 | 
46 |                 
47 |     pm.traceplot(hierarchical_trace[500:])
48 |     plt.show()        
49 |         
50 | if __name__ == "__main__":    
51 |     main()
52 | 


--------------------------------------------------------------------------------
/chp06/knn_reg.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | 
 4 | from sklearn import datasets
 5 | from sklearn.model_selection import train_test_split
 6 | 
 7 | np.random.seed(42)
 8 | 
 9 | class KNN():
10 |     
11 |     def __init__(self, K):
12 |         self.K = K
13 | 
14 |     def euclidean_distance(self, x1, x2):
15 |         dist = 0
16 |         for i in range(len(x1)):
17 |             dist += np.power((x1[i] - x2[i]), 2)
18 |         return np.sqrt(dist)
19 |     
20 |     def knn_search(self, X_train, y_train, Q):
21 |         y_pred = np.empty(Q.shape[0])        
22 | 
23 |         for i, query in enumerate(Q):
24 |             #get K nearest neighbors to query point
25 |             idx = np.argsort([self.euclidean_distance(query, x) for x in X_train])[:self.K]            
26 |             #extract the labels of KNN training labels
27 |             knn_labels = np.array([y_train[i] for i in idx])
28 |             #label query sample as the average of knn_labels
29 |             y_pred[i] = np.mean(knn_labels)
30 | 
31 |         return y_pred 
32 |         
33 | 
34 | if __name__ == "__main__":
35 |     
36 |     plt.close('all')
37 |         
38 |     #iris dataset
39 |     iris = datasets.load_iris()
40 |     X = iris.data[:,:2]
41 |     y = iris.target            
42 | 
43 |     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
44 | 
45 |     K = 4
46 |     knn = KNN(K)    
47 |     y_pred = knn.knn_search(X_train, y_train, X_test)
48 |     
49 |     plt.figure(1)
50 |     plt.scatter(X_train[:,0], X_train[:,1], s = 100, marker = 'x', color = 'r', label = 'data')
51 |     plt.scatter(X_test[:,0], X_test[:,1], s = 100, marker = 'o', color = 'b', label = 'query')
52 |     plt.title('K Nearest Neighbors (K=%d)'% K)
53 |     plt.legend()
54 |     plt.xlabel('X1')
55 |     plt.ylabel('X2')
56 |     plt.grid(True)
57 |     plt.show()
58 |     
59 |     
60 | 
61 | 


--------------------------------------------------------------------------------
/chp06/ridge_reg.py:
--------------------------------------------------------------------------------
 1 | import math
 2 | import numpy as np 
 3 | import pandas as pd
 4 | 
 5 | import matplotlib.pyplot as plt
 6 | from sklearn.datasets import fetch_california_housing 
 7 | 
 8 | class ridge_reg():
 9 | 
10 |     def __init__(self, n_iter=20, learning_rate=1e-3, lmbda=0.1):
11 |         self.n_iter = n_iter
12 |         self.learning_rate = learning_rate
13 |         self.lmbda = lmbda 
14 |     
15 |     def fit(self, X, y):
16 |         #insert const 1 for bias term
17 |         X = np.insert(X, 0, 1, axis=1)
18 |         
19 |         self.loss = []
20 |         self.w = np.random.rand(X.shape[1])
21 | 
22 |         for i in range(self.n_iter):
23 |             y_pred = X.dot(self.w)
24 |             mse = np.mean(0.5*(y - y_pred)**2 + 0.5*self.lmbda*self.w.T.dot(self.w))
25 |             self.loss.append(mse)
26 |             print(" %d iter, mse: %.4f" %(i, mse))
27 |             #compute gradient of NLL(w) wrt w
28 |             grad_w = - (y - y_pred).dot(X) + self.lmbda*self.w
29 |             #update the weights
30 |             self.w -= self.learning_rate * grad_w
31 | 
32 |     def predict(self, X):
33 |         #insert const 1 for bias term
34 |         X = np.insert(X, 0, 1, axis=1)
35 |         y_pred = X.dot(self.w)
36 |         return y_pred
37 | 
38 | if __name__ == "__main__":
39 |     
40 |     X, y = fetch_california_housing(return_X_y=True)
41 |     X_reg = X[:,2].reshape(-1,1) #avg number of rooms 
42 |     X_std = (X_reg - X_reg.mean())/X.std() #standard scaling
43 |     y_std = (y - y.mean())/y.std() #standard scaling
44 | 
45 |     X_std = X_std[:200,:]
46 |     y_std = y_std[:200]
47 | 
48 |     rr = ridge_reg()
49 |     rr.fit(X_std, y_std)
50 |     y_pred = rr.predict(X_std)
51 | 
52 |     print(rr.w)
53 | 
54 |     plt.figure()
55 |     plt.plot(rr.loss)
56 |     plt.xlabel('Epoch')
57 |     plt.ylabel('Loss')
58 |     plt.tight_layout()
59 |     plt.show()
60 | 
61 |     plt.figure()
62 |     plt.scatter(X_std, y_std)
63 |     plt.plot(np.linspace(-1,1), rr.w[1]*np.linspace(-1,1)+rr.w[0], c='red')
64 |     plt.xlim([-0.01,0.01])
65 |     plt.xlabel("scaled avg num of rooms")
66 |     plt.ylabel("scaled house price")
67 |     plt.show()


--------------------------------------------------------------------------------
/chp07/active_learning.py:
--------------------------------------------------------------------------------
 1 | from __future__ import unicode_literals, division
 2 | from scipy.sparse import csc_matrix, vstack
 3 | from scipy.stats import entropy
 4 | from collections import Counter
 5 | import numpy as np
 6 | 
 7 | 
 8 | class ActiveLearner(object):
 9 | 
10 |     uncertainty_sampling_frameworks = [
11 |         'entropy',
12 |         'max_margin',
13 |         'least_confident',
14 |     ]
15 | 
16 |     query_by_committee_frameworks = [
17 |         'vote_entropy',
18 |         'average_kl_divergence',
19 |     ]
20 | 
21 |     def __init__(self, strategy='least_confident'):
22 |         self.strategy = strategy
23 | 
24 |     def rank(self, clf, X_unlabeled, num_queries=None):
25 | 
26 |         if num_queries == None:
27 |             num_queries = X_unlabeled.shape[0]
28 | 
29 |         elif type(num_queries) == float:
30 |             num_queries = int(num_queries * X_unlabeled.shape[0])
31 | 
32 |         if self.strategy in self.uncertainty_sampling_frameworks:
33 |             scores = self.uncertainty_sampling(clf, X_unlabeled)
34 | 
35 |         elif self.strategy in self.query_by_committee_frameworks:
36 |             scores = self.query_by_committee(clf, X_unlabeled)
37 | 
38 |         else: 
39 |             raise ValueError("this strategy is not implemented.")
40 | 
41 |         rankings = np.argsort(-scores)[:num_queries]
42 |         return rankings
43 | 
44 |     def uncertainty_sampling(self, clf, X_unlabeled):
45 |         probs = clf.predict_proba(X_unlabeled)
46 | 
47 |         if self.strategy == 'least_confident':
48 |             return 1 - np.amax(probs, axis=1)
49 | 
50 |         elif self.strategy == 'max_margin':
51 |             margin = np.partition(-probs, 1, axis=1)
52 |             return -np.abs(margin[:,0] - margin[:, 1])
53 | 
54 |         elif self.strategy == 'entropy':
55 |             return np.apply_along_axis(entropy, 1, probs)
56 | 
57 |     def query_by_committee(self, clf, X_unlabeled):
58 |         num_classes = len(clf[0].classes_)
59 |         C = len(clf)
60 |         preds = []
61 | 
62 |         if self.strategy == 'vote_entropy':
63 |             for model in clf:
64 |                 y_out = map(int, model.predict(X_unlabeled))
65 |                 preds.append(np.eye(num_classes)[y_out])
66 | 
67 |             votes = np.apply_along_axis(np.sum, 0, np.stack(preds)) / C
68 |             return np.apply_along_axis(entropy, 1, votes)
69 | 
70 |         elif self.strategy == 'average_kl_divergence':
71 |             for model in clf:
72 |                 preds.append(model.predict_proba(X_unlabeled))
73 | 
74 |             consensus = np.mean(np.stack(preds), axis=0)
75 |             divergence = []
76 |             for y_out in preds:
77 |                 divergence.append(entropy(consensus.T, y_out.T))
78 |             
79 |             return np.apply_along_axis(np.mean, 0, np.stack(divergence))
80 | 


--------------------------------------------------------------------------------
/chp07/adaboost_clf.py:
--------------------------------------------------------------------------------
 1 | import itertools
 2 | import numpy as np
 3 | 
 4 | import seaborn as sns
 5 | import matplotlib.pyplot as plt
 6 | import matplotlib.gridspec as gridspec
 7 | 
 8 | from sklearn import datasets
 9 | 
10 | from sklearn.tree import DecisionTreeClassifier
11 | from sklearn.neighbors import KNeighborsClassifier
12 | from sklearn.linear_model import LogisticRegression
13 | 
14 | from sklearn.ensemble import AdaBoostClassifier
15 | from sklearn.model_selection import cross_val_score, train_test_split
16 | 
17 | from mlxtend.plotting import plot_learning_curves
18 | from mlxtend.plotting import plot_decision_regions
19 | 
20 | def main():
21 | 
22 |     iris = datasets.load_iris()
23 |     X, y = iris.data[:, 0:2], iris.target
24 |     
25 |     #XOR dataset
26 |     #X = np.random.randn(200, 2)
27 |     #y = np.array(map(int,np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)))
28 |     
29 |     clf = DecisionTreeClassifier(criterion='entropy', max_depth=1)
30 | 
31 |     num_est = [1, 2, 3, 10]
32 |     label = ['AdaBoost (n_est=1)', 'AdaBoost (n_est=2)', 'AdaBoost (n_est=3)', 'AdaBoost (n_est=10)']
33 |     
34 |     fig = plt.figure(figsize=(10, 8))
35 |     gs = gridspec.GridSpec(2, 2)
36 |     grid = itertools.product([0,1],repeat=2)
37 | 
38 |     for n_est, label, grd in zip(num_est, label, grid):     
39 |         boosting = AdaBoostClassifier(base_estimator=clf, n_estimators=n_est)   
40 |         boosting.fit(X, y)
41 |         ax = plt.subplot(gs[grd[0], grd[1]])
42 |         fig = plot_decision_regions(X=X, y=y, clf=boosting, legend=2)
43 |         plt.title(label)
44 | 
45 |     plt.show()
46 |     #plt.savefig('./figures/boosting_ensemble.png')             
47 | 
48 |     #plot learning curves
49 |     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
50 | 
51 |     boosting = AdaBoostClassifier(base_estimator=clf, n_estimators=10)
52 |         
53 |     plt.figure()
54 |     plot_learning_curves(X_train, y_train, X_test, y_test, boosting, print_model=False, style='ggplot')
55 |     plt.show()
56 |     #plt.savefig('./figures/boosting_ensemble_learning_curve.png')             
57 | 
58 |     #Ensemble Size
59 |     num_est = list(map(int, np.linspace(1,100,20)))
60 |     bg_clf_cv_mean = []
61 |     bg_clf_cv_std = []
62 |     for n_est in num_est:
63 |         print("num_est: ", n_est)
64 |         ada_clf = AdaBoostClassifier(base_estimator=clf, n_estimators=n_est)
65 |         scores = cross_val_score(ada_clf, X, y, cv=3, scoring='accuracy')
66 |         bg_clf_cv_mean.append(scores.mean())
67 |         bg_clf_cv_std.append(scores.std())
68 |         
69 |     plt.figure()
70 |     (_, caps, _) = plt.errorbar(num_est, bg_clf_cv_mean, yerr=bg_clf_cv_std, c='blue', fmt='-o', capsize=5)
71 |     for cap in caps:
72 |         cap.set_markeredgewidth(1)                                                                                                                                
73 |     plt.ylabel('Accuracy'); plt.xlabel('Ensemble Size'); plt.title('AdaBoost Ensemble');
74 |     plt.show()
75 |     #plt.savefig('./figures/boosting_ensemble_size.png')
76 | 
77 | if __name__ == "__main__":    
78 |     main()


--------------------------------------------------------------------------------
/chp07/bagging_clf.py:
--------------------------------------------------------------------------------
 1 | import itertools
 2 | import numpy as np
 3 | 
 4 | import seaborn as sns
 5 | import matplotlib.pyplot as plt
 6 | import matplotlib.gridspec as gridspec
 7 | 
 8 | from sklearn import datasets
 9 | 
10 | from sklearn.tree import DecisionTreeClassifier
11 | from sklearn.neighbors import KNeighborsClassifier
12 | from sklearn.linear_model import LogisticRegression
13 | from sklearn.ensemble import RandomForestClassifier
14 | 
15 | from sklearn.ensemble import BaggingClassifier
16 | from sklearn.model_selection import cross_val_score, train_test_split
17 | 
18 | from mlxtend.plotting import plot_learning_curves
19 | from mlxtend.plotting import plot_decision_regions
20 | 
21 | def main():
22 | 
23 |     iris = datasets.load_iris()
24 |     X, y = iris.data[:, 0:2], iris.target
25 |     
26 |     clf1 = DecisionTreeClassifier(criterion='entropy', max_depth=None)
27 |     clf2 = KNeighborsClassifier(n_neighbors=1)    
28 | 
29 |     bagging1 = BaggingClassifier(base_estimator=clf1, n_estimators=10, max_samples=0.8, max_features=0.8)
30 |     bagging2 = BaggingClassifier(base_estimator=clf2, n_estimators=10, max_samples=0.8, max_features=0.8)            
31 | 
32 |     label = ['Decision Tree', 'K-NN', 'Bagging Tree', 'Bagging K-NN']
33 |     clf_list = [clf1, clf2, bagging1, bagging2]
34 |     
35 |     fig = plt.figure(figsize=(10, 8))
36 |     gs = gridspec.GridSpec(2, 2)
37 |     grid = itertools.product([0,1],repeat=2)
38 | 
39 |     for clf, label, grd in zip(clf_list, label, grid):        
40 |         scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
41 |         print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
42 |         
43 |         clf.fit(X, y)
44 |         ax = plt.subplot(gs[grd[0], grd[1]])
45 |         fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
46 |         plt.title(label)
47 | 
48 |     plt.show()
49 |     #plt.savefig('./figures/bagging_ensemble.png')             
50 | 
51 |     #plot learning curves
52 |     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
53 |     
54 |     plt.figure()
55 |     plot_learning_curves(X_train, y_train, X_test, y_test, bagging1, print_model=False, style='ggplot')
56 |     plt.show()
57 |     #plt.savefig('./figures/bagging_ensemble_learning_curve.png')             
58 | 
59 |     #Ensemble Size
60 |     num_est = list(map(int, np.linspace(1,100,20)))
61 |     bg_clf_cv_mean = []
62 |     bg_clf_cv_std = []
63 |     for n_est in num_est:
64 |         print("num_est: ", n_est)
65 |         bg_clf = BaggingClassifier(base_estimator=clf1, n_estimators=n_est, max_samples=0.8, max_features=0.8)
66 |         scores = cross_val_score(bg_clf, X, y, cv=3, scoring='accuracy')
67 |         bg_clf_cv_mean.append(scores.mean())
68 |         bg_clf_cv_std.append(scores.std())
69 |     
70 |     plt.figure()
71 |     (_, caps, _) = plt.errorbar(num_est, bg_clf_cv_mean, yerr=bg_clf_cv_std, c='blue', fmt='-o', capsize=5)
72 |     for cap in caps:
73 |         cap.set_markeredgewidth(1)                                                                                                                                
74 |     plt.ylabel('Accuracy'); plt.xlabel('Ensemble Size'); plt.title('Bagging Tree Ensemble');
75 |     plt.show()
76 |     #plt.savefig('./figures/bagging_ensemble_size.png')
77 | 
78 | if __name__ == "__main__":    
79 |     main()


--------------------------------------------------------------------------------
/chp07/bayes_opt_sklearn.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | 
 4 | import seaborn as sns
 5 | import matplotlib.pyplot as plt
 6 | 
 7 | from sklearn.datasets import make_classification
 8 | from sklearn.model_selection import cross_val_score
 9 | from sklearn.ensemble import RandomForestClassifier as RFC
10 | from sklearn.svm import SVC
11 | 
12 | from bayes_opt import BayesianOptimization
13 | 
14 | np.random.seed(42)
15 | 
16 | # Load data set and target values
17 | data, target = make_classification(
18 |     n_samples=1000,
19 |     n_features=45,
20 |     n_informative=12,
21 |     n_redundant=7
22 | )
23 | target = target.ravel()
24 | 
25 | def svccv(gamma):
26 |     val = cross_val_score(
27 |         SVC(gamma=gamma, random_state=0),
28 |         data, target, scoring='f1', cv=2
29 |     ).mean()
30 | 
31 |     return val
32 | 
33 | def rfccv(n_estimators, max_depth):
34 |     val = cross_val_score(
35 |         RFC(n_estimators=int(n_estimators),
36 |             max_depth=int(max_depth),
37 |             random_state=0
38 |         ),
39 |         data, target, scoring='f1', cv=2
40 |     ).mean()
41 |     return val
42 | 
43 | if __name__ == "__main__":
44 | 
45 |     gp_params = {"alpha": 1e-5}
46 | 
47 |     #SVM
48 |     svcBO = BayesianOptimization(svccv,
49 |         {'gamma': (0.00001, 0.1)})
50 | 
51 |     svcBO.maximize(init_points=3, n_iter=4, **gp_params)
52 | 
53 |     #Random Forest
54 |     rfcBO = BayesianOptimization(
55 |         rfccv,
56 |         {'n_estimators': (10, 300),
57 |          'max_depth': (2, 10)
58 |         }
59 |     )
60 |     rfcBO.maximize(init_points=4, n_iter=4, **gp_params)
61 | 
62 |     print('Final Results')
63 |     print(svcBO.max)
64 |     print(rfcBO.max)


--------------------------------------------------------------------------------
/chp07/demo_logreg.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import seaborn as sns
 3 | import matplotlib.pyplot as plt
 4 | 
 5 | from active_learning import ActiveLearner
 6 | from sklearn.metrics import accuracy_score
 7 | from sklearn.datasets import make_classification
 8 | from sklearn.linear_model import LogisticRegression
 9 | from sklearn.model_selection import train_test_split
10 | 
11 | np.random.seed(42)
12 | 
13 | def main():
14 |     
15 |     #number of labeled points
16 |     num_queries = 30
17 |     
18 |     #generate data
19 |     data, target = make_classification(n_samples=200, n_features=2, n_informative=2,\
20 |                                        n_redundant=0, n_classes=2, weights = [0.5, 0.5], random_state=0)
21 |     
22 |     #split into labeled and unlabeled pools
23 |     X_train, X_unlabeled, y_train, y_oracle = train_test_split(data, target, test_size=0.2, random_state=0)        
24 | 
25 |     #random sub-sampling        
26 |     rnd_idx = np.random.randint(0, X_train.shape[0], num_queries)    
27 |     X1 = X_train[rnd_idx,:]
28 |     y1 = y_train[rnd_idx]            
29 |     
30 |     clf1 = LogisticRegression()
31 |     clf1.fit(X1, y1)
32 |     
33 |     y1_preds = clf1.predict(X_unlabeled)
34 |     score1 = accuracy_score(y_oracle, y1_preds)
35 |     print("random subsampling accuracy: ", score1)
36 |     
37 |     #plot 2D decision boundary: w2x2 + w1x1 + w0 = 0
38 |     w0 = clf1.intercept_
39 |     w1, w2 = clf1.coef_[0]
40 |     xx = np.linspace(-1, 1, 100)
41 |     decision_boundary = -w0/float(w2) - (w1/float(w2))*xx
42 |     
43 |     plt.figure()
44 |     plt.scatter(data[rnd_idx,0], data[rnd_idx,1], c='black', marker='s', s=64, label='labeled')    
45 |     plt.scatter(data[target==0,0], data[target==0,1], c='blue', marker='o', alpha=0.5, label='class 0')
46 |     plt.scatter(data[target==1,0], data[target==1,1], c='red', marker='o', alpha=0.5, label='class 1')
47 |     plt.plot(xx, decision_boundary, linewidth = 2.0, c='black', linestyle = '--', label='log reg boundary')
48 |     plt.title("Random Subsampling")
49 |     plt.legend()
50 |     plt.show()
51 |     
52 |     #active learning        
53 |     AL = ActiveLearner(strategy='entropy')
54 |     al_idx = AL.rank(clf1, X_unlabeled, num_queries=num_queries)    
55 |     
56 |     X2 = X_train[al_idx,:]
57 |     y2 = y_train[al_idx]
58 |     
59 |     clf2 = LogisticRegression()
60 |     clf2.fit(X2, y2)
61 | 
62 |     y2_preds = clf2.predict(X_unlabeled)
63 |     score2 = accuracy_score(y_oracle, y2_preds)
64 |     print("active learning accuracy: ", score2)
65 |                             
66 |     #plot 2D decision boundary: w2x2 + w1x1 + w0 = 0
67 |     w0 = clf2.intercept_
68 |     w1, w2 = clf2.coef_[0]        
69 |     xx = np.linspace(-1, 1, 100)
70 |     decision_boundary = -w0/float(w2) - (w1/float(w2))*xx
71 |     
72 |     plt.figure()
73 |     plt.scatter(data[al_idx,0], data[al_idx,1], c='black', marker='s', s=64, label='labeled')        
74 |     plt.scatter(data[target==0,0], data[target==0,1], c='blue', marker='o', alpha=0.5, label='class 0')
75 |     plt.scatter(data[target==1,0], data[target==1,1], c='red', marker='o', alpha=0.5, label='class 1')
76 |     plt.plot(xx, decision_boundary, linewidth = 2.0, c='black', linestyle = '--', label='log reg boundary')
77 |     plt.title("Uncertainty Sampling")
78 |     plt.legend()
79 |     plt.show()
80 |             
81 | if __name__ == "__main__":
82 |     
83 |     main()


--------------------------------------------------------------------------------
/chp07/hmm.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from scipy.sparse import coo_matrix
  3 | import matplotlib.pyplot as plt
  4 | 
  5 | np.random.seed(42)
  6 | 
  7 | class HMM():
  8 |     def __init__(self, d=3, k=2, n=10000):
  9 |         self.d = d #dimension of data
 10 |         self.k = k #dimension of latent state
 11 |         self.n = n #number of data points
 12 | 
 13 |         self.A = np.zeros((k,k)) #transition matrix
 14 |         self.E = np.zeros((k,d)) #emission matrix
 15 |         self.s = np.zeros(k)     #initial state vector
 16 | 
 17 |         self.x = np.zeros(self.n) #emitted observations
 18 | 
 19 |     def normalize_mat(self, X, dim=1):
 20 |         z = np.sum(X, axis=dim)
 21 |         Xnorm = X/z.reshape(-1,1)
 22 |         return Xnorm
 23 | 
 24 |     def normalize_vec(self, v):
 25 |         z = sum(v)
 26 |         u = v / z
 27 |         return u, z
 28 | 
 29 |     def init_hmm(self):
 30 | 
 31 |         #initialize matrices at random
 32 |         self.A = self.normalize_mat(np.random.rand(self.k,self.k))
 33 |         self.E = self.normalize_mat(np.random.rand(self.k,self.d))
 34 |         self.s, _ = self.normalize_vec(np.random.rand(self.k))
 35 | 
 36 |         #generate markov observations 
 37 |         z = np.random.choice(self.k, size=1, p=self.s)
 38 |         self.x[0] = np.random.choice(self.d, size=1, p=self.E[z,:].ravel())
 39 |         for i in range(1, self.n):
 40 |             z = np.random.choice(self.k, size=1, p=self.A[z,:].ravel())
 41 |             self.x[i] = np.random.choice(self.d, size=1, p=self.E[z,:].ravel())
 42 |         #end for
 43 | 
 44 |     def forward_backward(self):
 45 |         
 46 |         #construct sparse matrix X of emission indicators
 47 |         data = np.ones(self.n)
 48 |         row = self.x
 49 |         col = np.arange(self.n)
 50 |         X = coo_matrix((data, (row, col)), shape=(self.d, self.n))
 51 | 
 52 |         M = self.E * X
 53 |         At = np.transpose(self.A)
 54 |         c = np.zeros(self.n)  #normalization constants
 55 |         alpha = np.zeros((self.k, self.n))  #alpha = p(z_t = j | x_{1:T})
 56 |         alpha[:,0], c[0] = self.normalize_vec(self.s * M[:,0])
 57 |         for t in range(1, self.n):
 58 |             alpha[:,t], c[t] = self.normalize_vec(np.dot(At, alpha[:,t-1]) * M[:,t]) 
 59 |         #end for
 60 |         
 61 |         beta = np.ones((self.k, self.n))
 62 |         for t in range(self.n-2, 0, -1):
 63 |             beta[:,t] = np.dot(self.A, beta[:,t+1] * M[:,t+1])/c[t+1]
 64 |         #end for
 65 |         gamma = alpha * beta
 66 | 
 67 |         return gamma, alpha, beta, c
 68 | 
 69 |     def viterbi(self):
 70 |         
 71 |         #construct sparse matrix X of emission indicators
 72 |         data = np.ones(self.n)
 73 |         row = self.x
 74 |         col = np.arange(self.n)
 75 |         X = coo_matrix((data, (row, col)), shape=(self.d, self.n))
 76 | 
 77 |         #log scale for numerical stability
 78 |         s = np.log(self.s)
 79 |         A = np.log(self.A)
 80 |         M = np.log(self.E * X)
 81 | 
 82 |         Z = np.zeros((self.k, self.n))
 83 |         Z[:,0] = np.arange(self.k)
 84 |         v = s + M[:,0]
 85 |         for t in range(1, self.n):
 86 |             Av = A + v.reshape(-1,1)
 87 |             v = np.max(Av, axis=0)
 88 |             idx = np.argmax(Av, axis=0)
 89 |             v = v.reshape(-1,1) + M[:,t].reshape(-1,1)
 90 |             Z = Z[idx,:]
 91 |             Z[:,t] = np.arange(self.k)
 92 |         #end for
 93 |         llh = np.max(v)
 94 |         idx = np.argmax(v)
 95 |         z = Z[idx,:]
 96 | 
 97 |         return z, llh
 98 | 
 99 | 
100 | if __name__ == "__main__":
101 | 
102 |     hmm = HMM()
103 |     hmm.init_hmm()
104 | 
105 |     gamma, alpha, beta, c = hmm.forward_backward()
106 |     z, llh = hmm.viterbi()
107 |     import pdb; pdb.set_trace() 


--------------------------------------------------------------------------------
/chp07/page_rank.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from numpy.linalg import norm
 3 | 
 4 | np.random.seed(42)
 5 | 
 6 | class page_rank():
 7 | 
 8 |     def __init__(self):
 9 |         self.max_iter = 100
10 |         self.tolerance = 1e-5
11 |     
12 |     def power_iteration(self, A):
13 |         n = np.shape(A)[0]
14 |         v = np.random.rand(n)
15 |         converged = False
16 |         iter = 0
17 | 
18 |         while (not converged) and (iter < self.max_iter):
19 |             old_v = v
20 |             v = np.dot(A, v)
21 |             v = v / norm(v)
22 |             lambd = np.dot(v, np.dot(A, v))
23 |             converged = norm(v - old_v) < self.tolerance
24 |             iter += 1
25 |         #end while
26 | 
27 |         return lambd, v
28 |     
29 | if __name__ == "__main__":
30 | 
31 |     #construct a symmetric real matrix
32 |     X = np.random.rand(10,5)
33 |     A = np.dot(X.T, X)
34 |     
35 |     pr = page_rank()
36 |     lambd, v = pr.power_iteration(A)
37 |     
38 |     print(lambd)
39 |     print(v)
40 | 
41 |     #compare against np.linalg implementation
42 |     eigval, eigvec = np.linalg.eig(A)
43 |     idx = np.argsort(np.abs(eigval))[::-1]
44 |     top_lambd = eigval[idx][0]
45 |     top_v = eigvec[:,idx][0]    
46 | 
47 |     assert np.allclose(lambd, top_lambd, 1e-3)
48 |     assert np.allclose(v, top_v, 1e-3)
49 | 
50 | 
51 | 
52 | 
53 | 


--------------------------------------------------------------------------------
/chp07/plot_smote_regular.py:
--------------------------------------------------------------------------------
 1 | import seaborn as sns
 2 | import matplotlib.pyplot as plt
 3 | 
 4 | from sklearn.datasets import make_classification
 5 | from sklearn.decomposition import PCA
 6 | 
 7 | from imblearn.over_sampling import SMOTE
 8 | 
 9 | def plot_resampling(ax, X, y, title):
10 |     c0 = ax.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0", marker="o", alpha=0.5)
11 |     c1 = ax.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1", marker="s", alpha=0.5)
12 |     ax.set_title(title)
13 |     ax.spines['top'].set_visible(False)
14 |     ax.spines['right'].set_visible(False)
15 |     ax.get_xaxis().tick_bottom()
16 |     ax.get_yaxis().tick_left()
17 |     ax.spines['left'].set_position(('outward', 10))
18 |     ax.spines['bottom'].set_position(('outward', 10))
19 |     ax.set_xlim([-6, 8])
20 |     ax.set_ylim([-6, 6])
21 | 
22 |     return c0, c1
23 | 
24 | def main():
25 |     # generate the dataset
26 |     X, y = make_classification(n_classes=2, class_sep=2, weights=[0.3, 0.7],
27 |                                n_informative=3, n_redundant=1, flip_y=0,
28 |                                n_features=20, n_clusters_per_class=1,
29 |                                n_samples=80, random_state=10)
30 | 
31 |     # fit PCA for visualization  
32 |     pca = PCA(n_components=2)
33 |     X_vis = pca.fit_transform(X)
34 | 
35 |     # apply regular SMOTE
36 |     method = SMOTE()
37 |     X_res, y_res = method.fit_resample(X, y)
38 |     X_res_vis = pca.transform(X_res)
39 | 
40 |     # generate plots
41 |     f, (ax1, ax2) = plt.subplots(1, 2)
42 |     c0, c1 = plot_resampling(ax1, X_vis, y, 'Original')
43 |     plot_resampling(ax2, X_res_vis, y_res, 'SMOTE')
44 |     ax1.legend((c0, c1), ('Class #0', 'Class #1'))
45 |     plt.tight_layout()
46 |     plt.show()
47 |     
48 | if __name__ == "__main__":
49 |     main()
50 | 


--------------------------------------------------------------------------------
/chp07/plot_tomek_links.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import seaborn as sns
 3 | import matplotlib.pyplot as plt
 4 | 
 5 | from sklearn.model_selection import train_test_split
 6 | from sklearn.utils import shuffle
 7 | from imblearn.under_sampling import TomekLinks
 8 | 
 9 | rng = np.random.RandomState(42)
10 | 
11 | def main():
12 | 
13 |     #generate data    
14 |     n_samples_1 = 500
15 |     n_samples_2 = 50
16 |     X_syn = np.r_[1.5 * rng.randn(n_samples_1, 2), 0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
17 |     y_syn = np.array([0] * (n_samples_1) + [1] * (n_samples_2))
18 |     X_syn, y_syn = shuffle(X_syn, y_syn)
19 |     X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(X_syn, y_syn)
20 | 
21 |     # remove Tomek links
22 |     tl = TomekLinks(sampling_strategy='auto')
23 |     X_resampled, y_resampled = tl.fit_resample(X_syn, y_syn)
24 |     idx_resampled = tl.sample_indices_
25 |     idx_samples_removed = np.setdiff1d(np.arange(X_syn.shape[0]),idx_resampled)
26 |     
27 |     #generate plots
28 |     fig = plt.figure()
29 |     ax = fig.add_subplot(1, 1, 1)
30 | 
31 |     idx_class_0 = y_resampled == 0
32 |     plt.scatter(X_resampled[idx_class_0, 0], X_resampled[idx_class_0, 1], alpha=.8, marker = "o", label='Class #0')
33 |     plt.scatter(X_resampled[~idx_class_0, 0], X_resampled[~idx_class_0, 1], alpha=.8, marker = "s", label='Class #1')
34 |     plt.scatter(X_syn[idx_samples_removed, 0], X_syn[idx_samples_removed, 1], alpha=.8, marker = "v", label='Removed samples')
35 |     plt.title('Undersampling: Tomek links')
36 |     plt.legend()
37 |     plt.show()
38 | 
39 | if __name__ == "__main__":
40 |     main()


--------------------------------------------------------------------------------
/chp07/stacked_clf.py:
--------------------------------------------------------------------------------
 1 | import itertools
 2 | import numpy as np
 3 | import seaborn as sns
 4 | import matplotlib.pyplot as plt
 5 | import matplotlib.gridspec as gridspec
 6 | 
 7 | from sklearn import datasets
 8 | 
 9 | from sklearn.linear_model import LogisticRegression
10 | from sklearn.neighbors import KNeighborsClassifier
11 | from sklearn.naive_bayes import GaussianNB 
12 | from sklearn.ensemble import RandomForestClassifier
13 | from mlxtend.classifier import StackingClassifier
14 | 
15 | from sklearn.model_selection import cross_val_score, train_test_split
16 | 
17 | from mlxtend.plotting import plot_learning_curves
18 | from mlxtend.plotting import plot_decision_regions
19 | 
20 | def main():
21 | 
22 |     iris = datasets.load_iris()
23 |     X, y = iris.data[:, 1:3], iris.target
24 | 
25 |     clf1 = KNeighborsClassifier(n_neighbors=1)
26 |     clf2 = RandomForestClassifier(random_state=1)
27 |     clf3 = GaussianNB()
28 |     lr = LogisticRegression()
29 |     sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 
30 |                               meta_classifier=lr)
31 | 
32 |     label = ['KNN', 'Random Forest', 'Naive Bayes', 'Stacking Classifier']
33 |     clf_list = [clf1, clf2, clf3, sclf]
34 |     
35 |     fig = plt.figure(figsize=(10,8))
36 |     gs = gridspec.GridSpec(2, 2)
37 |     grid = itertools.product([0,1],repeat=2)
38 | 
39 |     clf_cv_mean = []
40 |     clf_cv_std = []
41 |     for clf, label, grd in zip(clf_list, label, grid):
42 |         
43 |         scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
44 |         print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
45 |         clf_cv_mean.append(scores.mean())
46 |         clf_cv_std.append(scores.std())
47 |         
48 |         clf.fit(X, y)
49 |         ax = plt.subplot(gs[grd[0], grd[1]])
50 |         fig = plot_decision_regions(X=X, y=y, clf=clf)
51 |         plt.title(label)
52 | 
53 |     plt.show()
54 |     #plt.savefig("./figures/ensemble_stacking.png")
55 | 
56 |     #plot classifier accuracy    
57 |     plt.figure()
58 |     (_, caps, _) = plt.errorbar(range(4), clf_cv_mean, yerr=clf_cv_std, c='blue', fmt='-o', capsize=5)
59 |     for cap in caps:
60 |         cap.set_markeredgewidth(1)                                                                                                                                
61 |     plt.xticks(range(4), ['KNN', 'RF', 'NB', 'Stacking'], rotation='vertical')        
62 |     plt.ylabel('Accuracy'); plt.xlabel('Classifier'); plt.title('Stacking Ensemble');
63 |     plt.show()
64 |     #plt.savefig('./figures/stacking_ensemble_size.png')      
65 |         
66 |     #plot learning curves
67 |     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
68 |     
69 |     plt.figure()
70 |     plot_learning_curves(X_train, y_train, X_test, y_test, sclf, print_model=False, style='ggplot')
71 |     plt.show()
72 |     #plt.savefig('./figures/stacking_ensemble_learning_curve.png')             
73 | 
74 | 
75 | if __name__ == "__main__":
76 |     main()
77 | 


--------------------------------------------------------------------------------
/chp08/dpmeans.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import matplotlib.pyplot as plt
  3 | 
  4 | import time
  5 | from sklearn import metrics
  6 | from sklearn.datasets import load_iris
  7 | 
  8 | np.random.seed(42)
  9 | 
 10 | class dpmeans:
 11 |     
 12 |     def __init__(self,X):
 13 |         # Initialize parameters for DP means
 14 |         self.K = 1
 15 |         self.K_init = 4
 16 |         self.d = X.shape[1]
 17 |         self.z = np.mod(np.random.permutation(X.shape[0]),self.K)+1
 18 |         self.mu = np.random.standard_normal((self.K, self.d))
 19 |         self.sigma = 1
 20 |         self.nk = np.zeros(self.K)
 21 |         self.pik = np.ones(self.K)/self.K 
 22 |         
 23 |         #init mu
 24 |         self.mu = np.array([np.mean(X,0)])
 25 |         
 26 |         #init lambda
 27 |         self.Lambda = self.kpp_init(X,self.K_init)
 28 |         
 29 |         self.max_iter = 100
 30 |         self.obj = np.zeros(self.max_iter)
 31 |         self.em_time = np.zeros(self.max_iter)   
 32 |         
 33 |     def kpp_init(self,X,k):
 34 |         #k++ init
 35 |         #lambda is max distance to k++ means
 36 | 
 37 |         [n,d] = np.shape(X)
 38 |         mu = np.zeros((k,d))        
 39 |         dist = np.inf*np.ones(n)
 40 |             
 41 |         mu[0,:] = X[int(np.random.rand()*n-1),:]
 42 |         for i in range(1,k):
 43 |             D = X-np.tile(mu[i-1,:],(n,1))
 44 |             dist = np.minimum(dist, np.sum(D*D,1))
 45 |             idx = np.where(np.random.rand() < np.cumsum(dist/float(sum(dist))))
 46 |             mu[i,:] = X[idx[0][0],:]
 47 |             Lambda = np.max(dist)
 48 |         
 49 |         print("Lambda: ", Lambda)
 50 |         
 51 |         return Lambda
 52 |         
 53 |     def fit(self,X):
 54 | 
 55 |         obj_tol = 1e-3
 56 |         max_iter = self.max_iter        
 57 |         [n,d] = np.shape(X)
 58 |         
 59 |         obj = np.zeros(max_iter)
 60 |         em_time = np.zeros(max_iter)
 61 |         print('running dpmeans...')
 62 |         
 63 |         for iter in range(max_iter):
 64 |             tic = time.time()
 65 |             dist = np.zeros((n,self.K))
 66 |             
 67 |             #assignment step
 68 |             for kk in range(self.K):
 69 |                 Xm = X - np.tile(self.mu[kk,:],(n,1))
 70 |                 dist[:,kk] = np.sum(Xm*Xm,1)
 71 |             
 72 |             #update labels
 73 |             dmin = np.min(dist,1)
 74 |             self.z = np.argmin(dist,1)
 75 |             idx = np.where(dmin > self.Lambda)
 76 |             
 77 |             if (np.size(idx) > 0):
 78 |                 self.K = self.K + 1
 79 |                 self.z[idx[0]] = self.K-1 #cluster labels in [0,...,K-1]
 80 |                 self.mu = np.vstack([self.mu,np.mean(X[idx[0],:],0)])                
 81 |                 Xm = X - np.tile(self.mu[self.K-1,:],(n,1))
 82 |                 dist = np.hstack([dist, np.array([np.sum(Xm*Xm,1)]).T])
 83 |              
 84 |             #update step
 85 |             self.nk = np.zeros(self.K)
 86 |             for kk in range(self.K):
 87 |                 self.nk[kk] = self.z.tolist().count(kk)
 88 |                 idx = np.where(self.z == kk)
 89 |                 self.mu[kk,:] = np.mean(X[idx[0],:],0)
 90 |             
 91 |             self.pik = self.nk/float(np.sum(self.nk))
 92 |             
 93 |             #compute objective
 94 |             for kk in range(self.K):
 95 |                 idx = np.where(self.z == kk)
 96 |                 obj[iter] = obj[iter] + np.sum(dist[idx[0],kk],0)                
 97 |             obj[iter] = obj[iter] + self.Lambda * self.K
 98 |             
 99 |             #check convergence
100 |             if (iter > 0 and np.abs(obj[iter]-obj[iter-1]) < obj_tol*obj[iter]):
101 |                 print('converged in %d iterations\n'% iter)
102 |                 break
103 |             em_time[iter] = time.time()-tic
104 |         #end for
105 |         self.obj = obj
106 |         self.em_time = em_time
107 |         return self.z, obj, em_time
108 |         
109 |     def compute_nmi(self, z1, z2):
110 |         # compute normalized mutual information
111 |         
112 |         n = np.size(z1)
113 |         k1 = np.size(np.unique(z1))
114 |         k2 = np.size(np.unique(z2))
115 |         
116 |         nk1 = np.zeros((k1,1))
117 |         nk2 = np.zeros((k2,1))
118 | 
119 |         for kk in range(k1):
120 |             nk1[kk] = np.sum(z1==kk)
121 |         for kk in range(k2):
122 |             nk2[kk] = np.sum(z2==kk)
123 |             
124 |         pk1 = nk1/float(np.sum(nk1))
125 |         pk2 = nk2/float(np.sum(nk2))
126 |         
127 |         nk12 = np.zeros((k1,k2))
128 |         for ii in range(k1):
129 |             for jj in range(k2):
130 |                 nk12[ii,jj] = np.sum((z1==ii)*(z2==jj))
131 |         pk12 = nk12/float(n)        
132 |         
133 |         Hx = -np.sum(pk1 * np.log(pk1 + np.finfo(float).eps))
134 |         Hy = -np.sum(pk2 * np.log(pk2 + np.finfo(float).eps))
135 |         
136 |         Hxy = -np.sum(pk12 * np.log(pk12 + np.finfo(float).eps))
137 |         
138 |         MI = Hx + Hy - Hxy;
139 |         nmi = MI/float(0.5*(Hx+Hy))
140 |         
141 |         return nmi
142 | 
143 |     def generate_plots(self,X):
144 | 
145 |         plt.close('all')
146 |         plt.figure(0)
147 |         for kk in range(self.K):
148 |             #idx = np.where(self.z == kk)
149 |             plt.scatter(X[self.z == kk,0], X[self.z == kk,1], \
150 |                         s = 100, marker = 'o', c = np.random.rand(3,), label = str(kk))
151 |         #end for
152 |         plt.xlabel('X1')
153 |         plt.ylabel('X2')
154 |         plt.legend()
155 |         plt.title('DP-means clusters')
156 |         plt.grid(True)
157 |         plt.show()
158 |         
159 |         plt.figure(1)
160 |         plt.plot(self.obj)
161 |         plt.title('DP-means objective function')
162 |         plt.xlabel('iterations')
163 |         plt.ylabel('penalized l2 squared distance')
164 |         plt.grid(True)
165 |         plt.show()
166 |         
167 | if __name__ == "__main__":        
168 |     
169 |     iris = load_iris()
170 |     X = iris.data
171 |     y = iris.target
172 |     
173 |     dp = dpmeans(X)
174 |     labels, obj, em_time = dp.fit(X)
175 |     dp.generate_plots(X)
176 | 
177 |     nmi = dp.compute_nmi(y,labels)
178 |     ari = metrics.adjusted_rand_score(y,labels)
179 |     
180 |     print("NMI: %.4f" % nmi)
181 |     print("ARI: %.4f" % ari)       
182 | 


--------------------------------------------------------------------------------
/chp08/gmm.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import matplotlib.pyplot as plt
  3 | import matplotlib as mpl
  4 | from sklearn.cluster import KMeans
  5 | from scipy.stats import multivariate_normal
  6 | from scipy.special import logsumexp
  7 | from scipy import linalg
  8 | 
  9 | np.random.seed(3)
 10 | 
 11 | class GMM:
 12 | 
 13 |     def __init__(self, n=1e3, d=2, K=4):
 14 |         self.n = int(n)  #number of data points
 15 |         self.d = d  #data dimension 
 16 |         self.K = K  #number of clusters
 17 | 
 18 |         self.X = np.zeros((self.n, self.d))
 19 | 
 20 |         self.mu = np.zeros((self.d, self.K))
 21 |         self.sigma = np.zeros((self.d, self.d, self.K))
 22 |         self.pik = np.ones(self.K)/K
 23 | 
 24 |     def generate_data(self):
 25 |         #GMM generative model
 26 |         alpha0 = np.ones(self.K) 
 27 |         pi = np.random.dirichlet(alpha0)
 28 | 
 29 |         #ground truth mu and sigma
 30 |         mu0 = np.random.randint(0, 10, size=(self.d, self.K)) - 5*np.ones((self.d, self.K))
 31 |         V0 = np.zeros((self.d, self.d, self.K))
 32 |         for k in range(self.K):
 33 |             eigen_mean = 0 
 34 |             Q = np.random.normal(loc=0, scale=1, size=(self.d, self.d))
 35 |             D = np.diag(abs(eigen_mean + np.random.normal(loc=0, scale=1, size=self.d)))
 36 |             V0[:,:,k] = abs(np.transpose(Q)*D*Q)
 37 |         
 38 |         #sample data
 39 |         for i in range(self.n):
 40 |             z = np.random.multinomial(1,pi)
 41 |             k = np.nonzero(z)[0][0]
 42 |             self.X[i,:] = np.random.multivariate_normal(mean=mu0[:,k], cov=V0[:,:,k], size=1)
 43 | 
 44 |         plt.figure()
 45 |         plt.scatter(self.X[:,0], self.X[:,1], color='b', alpha=0.5)
 46 |         plt.title("Ground Truth Data"); plt.xlabel("X1"); plt.ylabel("X2")
 47 |         plt.show()
 48 | 
 49 |         return mu0, V0
 50 | 
 51 |     def gmm_em(self):
 52 |         
 53 |         #init mu with k-means
 54 |         kmeans = KMeans(n_clusters=self.K, random_state=42).fit(self.X)
 55 |         self.mu = np.transpose(kmeans.cluster_centers_)
 56 | 
 57 |         #init sigma
 58 |         for k in range(self.K):
 59 |             self.sigma[:,:,k] = np.eye(self.d)
 60 | 
 61 |         #EM algorithm
 62 |         max_iter = 10
 63 |         tol = 1e-5
 64 |         obj = np.zeros(max_iter)
 65 |         for iter in range(max_iter):
 66 |             print("EM iter ", iter)
 67 |             #E-step
 68 |             resp, llh = self.estep()
 69 |             #M-step
 70 |             self.mstep(resp)
 71 |             #check convergence
 72 |             obj[iter] = llh
 73 |             if (iter > 1 and obj[iter] - obj[iter-1] < tol*abs(obj[iter])):
 74 |                 break
 75 |             #end if
 76 |         #end for
 77 |         plt.figure()
 78 |         plt.plot(obj)
 79 |         plt.title('EM-GMM objective'); plt.xlabel("iter"); plt.ylabel("log-likelihood")
 80 |         plt.show()
 81 | 
 82 |     def estep(self):
 83 |         
 84 |         log_r = np.zeros((self.n, self.K))
 85 |         for k in range(self.K):
 86 |             log_r[:,k] = multivariate_normal.logpdf(self.X, mean=self.mu[:,k], cov=self.sigma[:,:,k])
 87 |         #end for
 88 |         log_r = log_r + np.log(self.pik)
 89 |         L = logsumexp(log_r, axis=1)
 90 |         llh = np.sum(L)/self.n  #log likelihood
 91 |         log_r = log_r - L.reshape(-1,1) #normalize
 92 |         resp = np.exp(log_r)
 93 |         return resp, llh
 94 | 
 95 |     def mstep(self, resp):
 96 | 
 97 |         nk = np.sum(resp, axis=0)
 98 |         self.pik = nk/self.n
 99 |         sqrt_resp = np.sqrt(resp)
100 |         for k in range(self.K):
101 |             #update mu
102 |             rx = np.multiply(resp[:,k].reshape(-1,1), self.X)
103 |             self.mu[:,k] = np.sum(rx, axis=0) / nk[k]
104 | 
105 |             #update sigma
106 |             Xm = self.X - self.mu[:,k]
107 |             Xm = np.multiply(sqrt_resp[:,k].reshape(-1,1), Xm)
108 |             self.sigma[:,:,k] = np.maximum(0, np.dot(np.transpose(Xm), Xm) / nk[k] + 1e-5 * np.eye(self.d))
109 |         #end for
110 | 
111 | if __name__ == '__main__':
112 | 
113 |     gmm = GMM()
114 |     mu0, V0 = gmm.generate_data()
115 |     gmm.gmm_em()
116 |     
117 |     for k in range(mu0.shape[1]):
118 |         print("cluster ", k)
119 |         print("-----------")
120 |         print("ground truth means:")
121 |         print(mu0[:,k])
122 |         print("ground truth covariance:")
123 |         print(V0[:,:,k])
124 |     #end for 
125 |     
126 |     for k in range(mu0.shape[1]):
127 |         print("cluster ", k)
128 |         print("-----------")
129 |         print("GMM-EM means:")
130 |         print(gmm.mu[:,k])
131 |         print("GMM-EM covariance:")
132 |         print(gmm.sigma[:,:,k])
133 | 
134 |     plt.figure()
135 |     ax = plt.axes()
136 |     plt.scatter(gmm.X[:,0], gmm.X[:,1], color='b', alpha=0.5)
137 | 
138 |     for k in range(mu0.shape[1]):
139 | 
140 |         v, w = linalg.eigh(gmm.sigma[:,:,k])
141 |         v = 2.0 * np.sqrt(2.0) * np.sqrt(v)
142 |         u = w[0] / linalg.norm(w[0])
143 | 
144 |         # plot an ellipse to show the Gaussian component
145 |         angle = np.arctan(u[1] / u[0])
146 |         angle = 180.0 * angle / np.pi  # convert to degrees
147 |         ell = mpl.patches.Ellipse(gmm.mu[:,k], v[0], v[1], 180.0 + angle, color='r', alpha=0.5)
148 |         ax.add_patch(ell)
149 | 
150 |         # plot cluster centroids
151 |         plt.scatter(gmm.mu[0,k], gmm.mu[1,k], s=80, marker='x', color='k', alpha=1)
152 |     plt.title("Gaussian Mixture Model"); plt.xlabel("X1"); plt.ylabel("X2")
153 |     plt.show()
154 | 


--------------------------------------------------------------------------------
/chp08/manifold_learning.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | 
 4 | from time import time
 5 | from sklearn import manifold
 6 | 
 7 | from sklearn.datasets import load_digits
 8 | from sklearn.neighbors import KDTree
 9 | 
10 | def plot_digits(X):
11 | 
12 |     n_img_per_row = np.amin((20, np.int(np.sqrt(X.shape[0]))))
13 |     img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
14 |     for i in range(n_img_per_row):
15 |         ix = 10 * i + 1
16 |         for j in range(n_img_per_row):
17 |             iy = 10 * j + 1
18 |             img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))
19 | 
20 |     plt.figure()
21 |     plt.imshow(img, cmap=plt.cm.binary)
22 |     plt.xticks([])
23 |     plt.yticks([])
24 |     plt.title('A selection from the 64-dimensional digits dataset')    
25 | 
26 | def mnist_manifold():
27 |             
28 |     digits = load_digits()
29 |     
30 |     X = digits.data
31 |     y = digits.target
32 |     
33 |     num_classes = np.unique(y).shape[0]
34 |         
35 |     plot_digits(X)
36 |         
37 |     #TSNE 
38 |     #Barnes-Hut: O(d NlogN) where d is dim and N is the number of samples
39 |     #Exact: O(d N^2)
40 |     t0 = time()
41 |     tsne = manifold.TSNE(n_components = 2, init = 'pca', method = 'barnes_hut', verbose = 1)
42 |     X_tsne = tsne.fit_transform(X)
43 |     t1 = time()
44 |     print('t-SNE: %.2f sec' %(t1-t0))
45 |     tsne.get_params()
46 |     
47 |     plt.figure()
48 |     for k in range(num_classes):
49 |         plt.plot(X_tsne[y==k,0], X_tsne[y==k,1],'o')
50 |     plt.title('t-SNE embedding of digits dataset')
51 |     plt.xlabel('X1')
52 |     plt.ylabel('X2')
53 |     axes = plt.gca()
54 |     axes.set_xlim([X_tsne[:,0].min()-1,X_tsne[:,0].max()+1])
55 |     axes.set_ylim([X_tsne[:,1].min()-1,X_tsne[:,1].max()+1])
56 |     plt.show()
57 |         
58 |     #ISOMAP
59 |     #1. Nearest neighbors search: O(d log k N log N)
60 |     #2. Shortest path graph search: O(N^2(k+log(N))
61 |     #3. Partial eigenvalue decomposition: O(dN^2)
62 |     
63 |     t0 = time()
64 |     isomap = manifold.Isomap(n_neighbors = 5, n_components = 2)
65 |     X_isomap = isomap.fit_transform(X)
66 |     t1 = time()
67 |     print('Isomap: %.2f sec' %(t1-t0))
68 |     isomap.get_params()
69 |     
70 |     plt.figure()
71 |     for k in range(num_classes):
72 |         plt.plot(X_isomap[y==k,0], X_isomap[y==k,1], 'o', label=str(k), linewidth = 2)
73 |     plt.title('Isomap embedding of the digits dataset')
74 |     plt.xlabel('X1')
75 |     plt.ylabel('X2')
76 |     plt.show()
77 |     
78 |     #Use KD-tree to find k-nearest neighbors to a query image
79 |     kdt = KDTree(X_isomap)
80 |     Q = np.array([[-160, -30],[-102, 14]])
81 |     kdt_dist, kdt_idx = kdt.query(Q,k=20)
82 |     plot_digits(X[kdt_idx.ravel(),:])
83 |                                        
84 | if __name__ == "__main__":    
85 |     mnist_manifold()
86 |         
87 |     


--------------------------------------------------------------------------------
/chp08/pca.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | 
 4 | np.random.seed(42)
 5 | 
 6 | class PCA():
 7 |     def __init__(self, n_components = 2):
 8 |         self.n_components = n_components
 9 | 
10 |     def covariance_matrix(self, X, Y=None):
11 |         if Y is None:
12 |             Y = X
13 |         n_samples = np.shape(X)[0]
14 |         covariance_matrix = (1 / (n_samples-1)) * (X - X.mean(axis=0)).T.dot(Y - Y.mean(axis=0))
15 |         return covariance_matrix
16 |     
17 |     def transform(self, X):
18 |         Sigma = self.covariance_matrix(X)
19 |         eig_vals, eig_vecs = np.linalg.eig(Sigma)
20 | 
21 |         #sort from largest to smallest and select the first n_components
22 |         idx = eig_vals.argsort()[::-1]
23 |         eig_vals = eig_vals[idx][:self.n_components]
24 |         eig_vecs = np.atleast_1d(eig_vecs[:,idx])[:, :self.n_components]
25 | 
26 |         #project the data onto principal components
27 |         X_transformed = X.dot(eig_vecs)
28 | 
29 |         return X_transformed
30 | 
31 | if __name__ == "__main__":
32 | 
33 |     n = 20
34 |     d = 5
35 |     X = np.random.rand(n,d)
36 | 
37 |     pca = PCA(n_components = 2)
38 |     X_pca = pca.transform(X)
39 | 
40 |     print(X_pca)
41 | 
42 |     plt.figure()
43 |     plt.scatter(X_pca[:,0], X_pca[:,1], color='b', alpha=0.5)
44 |     plt.title("Principal Component Analysis"); plt.xlabel("X1"); plt.ylabel("X2")
45 |     plt.show()


--------------------------------------------------------------------------------
/chp09/ga.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import string
 3 | 
 4 | class GeneticAlgorithm():
 5 |     
 6 |     def __init__(self, target_string, population_size, mutation_rate):
 7 |         self.target = target_string
 8 |         self.population_size = population_size
 9 |         self.mutation_rate = mutation_rate
10 |         self.letters = [" "] + list(string.ascii_letters)
11 | 
12 |     def initialize(self):
13 |         # init population with random strings
14 |         self.population = []
15 |         for _ in range(self.population_size):
16 |             individual = "".join(np.random.choice(self.letters, size=len(self.target)))
17 |             self.population.append(individual)
18 | 
19 |     def calculate_fitness(self):
20 |         #calculate fitness of each individual in a population
21 |         population_fitness = []
22 |         for individual in self.population:
23 |             # calculate loss as the distance between characters
24 |             loss = 0
25 |             for i in range(len(individual)):
26 |                 letter_i1 = self.letters.index(individual[i])
27 |                 letter_i2 = self.letters.index(self.target[i])
28 |                 loss += abs(letter_i1 - letter_i2)
29 |             fitness = 1 / (loss + 1e-6)
30 |             population_fitness.append(fitness)
31 |         return population_fitness
32 | 
33 |     def mutate(self, individual):
34 |         #randomly change the characters with probability equal to mutation_rate
35 |         individual = list(individual)
36 |         for j in range(len(individual)):
37 |             if np.random.random() < self.mutation_rate:
38 |                 individual[j] = np.random.choice(self.letters)
39 |         return "".join(individual)
40 | 
41 |     def crossover(self, parent1, parent2):
42 |         #create children from parents by crossover
43 |         cross_i = np.random.randint(0, len(parent1))
44 |         child1 = parent1[:cross_i] + parent2[cross_i:]
45 |         child2 = parent2[:cross_i] + parent1[cross_i:]
46 |         return child1, child2
47 | 
48 |     def run(self, iterations):
49 |         self.initialize()
50 | 
51 |         for epoch in range(iterations):
52 |             population_fitness = self.calculate_fitness()
53 |             
54 |             fittest_individual = self.population[np.argmax(population_fitness)]
55 |             highest_fitness = max(population_fitness)
56 | 
57 |             if fittest_individual == self.target:
58 |                 break
59 | 
60 |             #select individual as a parent proportional to individual's fitness
61 |             parent_probabilities = [fitness / sum(population_fitness) for fitness in population_fitness]
62 | 
63 |             #next generation
64 |             new_population = []
65 |             for i in np.arange(0, self.population_size, 2):
66 |                 #select two parents
67 |                 parent1, parent2 = np.random.choice(self.population, size=2, p=parent_probabilities, replace=False)
68 |                 #crossover to produce offspring
69 |                 child1, child2 = self.crossover(parent1, parent2)
70 |                 #save mutated offspring for next generation
71 |                 new_population += [self.mutate(child1), self.mutate(child2)]
72 | 
73 |             print("iter %d, closest candidate: %s, fitness: %.4f" %(epoch, fittest_individual, highest_fitness))
74 |             self.population = new_population
75 |         
76 |         print("iter %d, final candidate: %s" %(epoch, fittest_individual))
77 | 
78 | if __name__ == "__main__":
79 | 
80 |     target_string = "Genome"
81 |     population_size = 50
82 |     mutation_rate = 0.1
83 | 
84 |     ga = GeneticAlgorithm(target_string, population_size, mutation_rate)
85 |     ga.run(iterations = 1000)
86 | 
87 | 
88 | 
89 | 
90 | 
91 | 
92 | 
93 | 
94 | 
95 | 
96 | 


--------------------------------------------------------------------------------
/chp09/inv_cov.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | from scipy import linalg
  4 | 
  5 | from datetime import datetime
  6 | import pytz
  7 | 
  8 | from sklearn.datasets import make_sparse_spd_matrix
  9 | from sklearn.covariance import GraphicalLassoCV, ledoit_wolf
 10 | from sklearn.preprocessing import StandardScaler
 11 | from sklearn import cluster, manifold
 12 | 
 13 | import seaborn as sns
 14 | import matplotlib.pyplot as plt
 15 | from matplotlib.collections import LineCollection
 16 | 
 17 | import pandas_datareader.data as web
 18 | 
 19 | np.random.seed(42)
 20 | 
 21 | def main():
 22 |         
 23 |     #generate data (synthetic)
 24 |     #num_samples = 60
 25 |     #num_features = 20
 26 |     #prec = make_sparse_spd_matrix(num_features, alpha=0.95, smallest_coef=0.4, largest_coef=0.7)
 27 |     #cov = linalg.inv(prec)
 28 |     #X = np.random.multivariate_normal(np.zeros(num_features), cov, size=num_samples)
 29 |     #X = StandardScaler().fit_transform(X)    
 30 |    
 31 |     #generate data (actual)
 32 |     STOCKS = {
 33 |         'SPY': 'S&P500',
 34 |         'LQD': 'Bond_Corp',
 35 |         'TIP': 'Bond_Treas',
 36 |         'GLD': 'Gold',
 37 |         'MSFT': 'Microsoft',
 38 |         'XOM':  'Exxon',
 39 |         'AMZN': 'Amazon',
 40 |         'BAC':  'BofA',
 41 |         'NVS':  'Novartis'}
 42 |       
 43 |     symbols, names = np.array(list(STOCKS.items())).T
 44 |     
 45 |     #load data
 46 |     #year, month, day, hour, minute, second, microsecond
 47 |     start = datetime(2015, 1, 1, 0, 0, 0, 0, pytz.utc)
 48 |     end = datetime(2017, 1, 1, 0, 0, 0, 0, pytz.utc)    
 49 | 
 50 |     qopen, qclose = [], []
 51 |     data_close, data_open = pd.DataFrame(), pd.DataFrame()    
 52 |     for ticker in symbols:
 53 |         price = web.DataReader(ticker, 'stooq', start, end)
 54 |         qopen.append(price['Open'])
 55 |         qclose.append(price['Close'])
 56 | 
 57 |     data_open = pd.concat(qopen, axis=1)
 58 |     data_open.columns = symbols
 59 |     data_close = pd.concat(qclose, axis=1)
 60 |     data_close.columns = symbols
 61 |     
 62 |     #per day variation in price for each symbol
 63 |     variation = data_close - data_open
 64 |     variation = variation.dropna()                        
 65 |                 
 66 |     X = variation.values
 67 |     X /= X.std(axis=0) #standardize to use correlations rather than covariance
 68 |                 
 69 |     #estimate inverse covariance    
 70 |     graph = GraphicalLassoCV()
 71 |     graph.fit(X)
 72 |     
 73 |     gl_cov = graph.covariance_
 74 |     gl_prec = graph.precision_
 75 |     gl_alphas = graph.cv_alphas_
 76 |     gl_scores = graph.cv_results_['mean_test_score']
 77 | 
 78 |     plt.figure()        
 79 |     sns.heatmap(gl_prec, xticklabels=names, yticklabels=names)
 80 |     plt.xticks(rotation=45)
 81 |     plt.yticks(rotation=45)    
 82 |     plt.tight_layout()
 83 |     plt.show()
 84 |     
 85 |     plt.figure()    
 86 |     plt.plot(gl_alphas, gl_scores, marker='o', color='b', lw=2.0, label='GraphLassoCV')
 87 |     plt.title("Graph Lasso Alpha Selection")
 88 |     plt.xlabel("alpha")
 89 |     plt.ylabel("score")
 90 |     plt.legend()
 91 |     plt.show()
 92 |     
 93 |     #cluster using affinity propagation
 94 |     _, labels = cluster.affinity_propagation(gl_cov)
 95 |     num_labels = np.max(labels)
 96 |     
 97 |     for i in range(num_labels+1):
 98 |         print("Cluster %i: %s" %((i+1), ', '.join(names[labels==i])))
 99 |     
100 |     #find a low dim embedding for visualization
101 |     node_model = manifold.LocallyLinearEmbedding(n_components=2, n_neighbors=6, eigen_solver='dense')
102 |     embedding = node_model.fit_transform(X.T).T
103 |     
104 |     #generate plots
105 |     plt.figure()
106 |     plt.clf()
107 |     ax = plt.axes([0.,0.,1.,1.])
108 |     plt.axis('off')
109 |     
110 |     partial_corr = gl_prec
111 |     d = 1 / np.sqrt(np.diag(partial_corr))    
112 |     non_zero = (np.abs(np.triu(partial_corr, k=1)) > 0.02)  #connectivity matrix
113 |     
114 |     #plot the nodes
115 |     plt.scatter(embedding[0], embedding[1], s = 100*d**2, c = labels, cmap = plt.cm.Spectral)
116 |     
117 |     #plot the edges
118 |     start_idx, end_idx = np.where(non_zero)
119 |     segments = [[embedding[:,start], embedding[:,stop]] for start, stop in zip(start_idx, end_idx)]
120 |     values = np.abs(partial_corr[non_zero])
121 |     lc = LineCollection(segments, zorder=0, cmap=plt.cm.hot_r, norm=plt.Normalize(0,0.7*values.max()))
122 |     lc.set_array(values)
123 |     lc.set_linewidths(2*values)
124 |     ax.add_collection(lc)
125 |     
126 |     #plot the labels
127 |     for index, (name, label, (x,y)) in enumerate(zip(names, labels, embedding.T)):
128 |         plt.text(x,y,name,size=12)
129 |     
130 |     plt.show()
131 |     
132 | if __name__ == "__main__":
133 |     main()    
134 |         


--------------------------------------------------------------------------------
/chp09/kde.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | 
 4 | np.random.seed(14)
 5 | 
 6 | class KDE():
 7 | 
 8 |     def __init__(self):
 9 |         #Histogram and Gaussian Kernel Estimator used to
10 |         #analyze RNA-seq data for flux estimation of a T7 promoter
11 |         self.G = 1e9  #length of genome in base pairs (bp)
12 |         self.C = 1e3  #number of unique molecules
13 |         self.L = 100  #length of a read, bp
14 |         self.N = 1e6  #number of reads, L bp long
15 |         self.M = 1e4  #number of unique read sequences, bp
16 |         self.LN = 1000  #total length of assembled / mapped RNA-seq reads
17 |         self.FDR = 0.05  #false discovery rate
18 | 
19 |         #uniform sampling (poisson model)
20 |         self.lmbda = (self.N * self.L) / self.G  #expected number of bases covered
21 |         self.C_est = self.M/(1-np.exp(-self.lmbda)) #library size estimate
22 |         self.C_cvrg = self.G - self.G * np.exp(-self.lmbda) #base coverage
23 |         self.N_gaps = self.N * np.exp(-self.lmbda) #number of gaps (uncovered bases)
24 | 
25 |         #gamma prior sampling (negative binomial model)
26 |         #X = "number of failures before rth success"
27 |         self.k = 0.5 # dispersion parameter (fit to data)
28 |         self.p = self.lmbda/(self.lmbda + 1/self.k) # success probability
29 |         self.r = 1/self.k # number of successes
30 |         
31 |         #RNAP binding data (RNA-seq)
32 |         self.data = np.random.negative_binomial(self.r, self.p, size=self.LN)
33 | 
34 |     def histogram(self):
35 |         self.bin_delta = 1  #smoothing parameter 
36 |         self.bin_range = np.arange(1, np.max(self.data), self.bin_delta)
37 |         self.bin_counts, _ = np.histogram(self.data, bins=self.bin_range)
38 | 
39 |         #histogram density estimation 
40 |         #P = integral_R p(x) dx, where X is in R^3
41 |         #p(x) = K/(NxV), where K=number of points in region R
42 |         #N=total number of points, V=volume of region R
43 | 
44 |         rnap_density_est = self.bin_counts/(sum(self.bin_counts) * self.bin_delta)
45 |         return rnap_density_est
46 | 
47 |     def kernel(self):
48 |         #Gaussian kernel density estimator with smoothing parameter h
49 |         #sum N Guassians centered at each data point, parameterized by common std dev h
50 | 
51 |         x_dim = 1  #dimension of x
52 |         h = 10 #standard deviation
53 | 
54 |         rnap_density_support = np.arange(np.max(self.data))
55 |         rnap_density_est = 0
56 |         for i in range(np.sum(self.bin_counts)):
57 |             rnap_density_est += (1/(2*np.pi*h**2)**(x_dim/2.0))*np.exp(-(rnap_density_support - self.data[i])**2 / (2.0*h**2))
58 |         #end for
59 |         
60 |         rnap_density_est = rnap_density_est / np.sum(rnap_density_est)
61 |         return rnap_density_est
62 | 
63 | if __name__ == "__main__":
64 | 
65 |     kde = KDE()
66 |     est1 = kde.histogram()
67 |     est2 = kde.kernel()
68 | 
69 |     plt.figure()
70 |     plt.plot(est1, '-b', label='histogram')
71 |     plt.plot(est2, '--r', label='gaussian kernel')
72 |     plt.title("RNA-seq density estimate based on negative binomial model")
73 |     plt.xlabel("read length, [base pairs]"); plt.ylabel("density"); plt.legend()
74 |     plt.show()


--------------------------------------------------------------------------------
/chp09/lda.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import matplotlib.pyplot as plt
  3 | from sklearn.datasets import fetch_20newsgroups
  4 | from sklearn.feature_extraction.text import TfidfVectorizer
  5 | from wordcloud import WordCloud
  6 | from scipy.special import digamma, gammaln
  7 | 
  8 | np.random.seed(12)
  9 | 
 10 | class LDA:
 11 |     def __init__(self, A, K):
 12 |         self.N = A.shape[0]  # word (dictionary size)
 13 |         self.D = A.shape[1]  # number of documents
 14 |         self.K = num_topics  # number of topics
 15 | 
 16 |         self.A = A  #term-document matrix
 17 | 
 18 |         #init word distribution beta
 19 |         self.eta = np.ones(self.N) #uniform dirichlet prior on words
 20 |         self.beta = np.zeros((self.N, self.K)) #NxK topic matrix
 21 |         for k in range(self.K):
 22 |             self.beta[:,k] = np.random.dirichlet(self.eta)
 23 |             self.beta[:,k] = self.beta[:,k] + 1e-6 #to avoid zero entries
 24 |             self.beta[:,k] = self.beta[:,k]/np.sum(self.beta[:,k])
 25 |         #end for
 26 | 
 27 |         #init topic proportions theta and cluster assignments z
 28 |         self.alpha = np.ones(self.K) #uniform dirichlet prior on topics
 29 |         self.z = np.zeros((self.N, self.D)) #cluster assignments z_{id}
 30 |         for d in range(self.D):
 31 |             theta = np.random.dirichlet(self.alpha)
 32 |             wdn_idx = np.nonzero(self.A[:,d])[0]
 33 |             for i in range(len(wdn_idx)):
 34 |                 z_idx = np.argmax(np.random.multinomial(1, theta))
 35 |                 self.z[wdn_idx[i],d] = z_idx  #topic id
 36 |             #end for
 37 |         #end for
 38 | 
 39 |         #init variational parameters
 40 |         self.gamma = np.ones((self.D, self.K)) #topic proportions
 41 |         for d in range(self.D):
 42 |             theta = np.random.dirichlet(self.alpha)
 43 |             self.gamma[d,:] = theta
 44 |         #end for
 45 | 
 46 |         self.lmbda = np.transpose(self.beta) #np.ones((self.K, self.N))/self.N #word frequencies
 47 | 
 48 |         self.phi = np.zeros((self.D, self.N, self.K)) #assignments
 49 |         for d in range(self.D):
 50 |             for w in range(self.N):
 51 |                 theta = np.random.dirichlet(self.alpha)
 52 |                 self.phi[d,w,:] = np.random.multinomial(1, theta)
 53 |             #end for
 54 |         #end for
 55 | 
 56 |     def variational_inference(self):    
 57 |              
 58 |         var_iter = 10
 59 |         llh = np.zeros(var_iter)
 60 |         llh_delta = np.zeros(var_iter)
 61 | 
 62 |         for iter in range(var_iter):
 63 |             print("VI iter: ", iter)
 64 |             J_old = self.elbo_objective()
 65 |             self.mean_field_update()
 66 |             J_new = self.elbo_objective()
 67 |             
 68 |             llh[iter] = J_old
 69 |             llh_delta[iter] = J_new - J_old
 70 |         #end for
 71 | 
 72 |         #update alpha and beta
 73 |         for k in range(self.K):
 74 |             self.alpha[k] = np.sum(self.gamma[:,k])            
 75 |             self.beta[:,k] = self.lmbda[k,:] / np.sum(self.lmbda[k,:])
 76 |         #end for
 77 | 
 78 |         #update topic assignments
 79 |         for d in range(self.D):
 80 |             wdn_idx = np.nonzero(self.A[:,d])[0]
 81 |             for i in range(len(wdn_idx)):
 82 |                 z_idx = np.argmax(self.phi[d,wdn_idx[i],:])
 83 |                 self.z[wdn_idx[i],d] = z_idx  #topic id
 84 |             #end for
 85 |         #end for
 86 | 
 87 |         plt.figure()
 88 |         plt.plot(llh); plt.title('LDA VI');
 89 |         plt.xlabel('mean field iterations'); plt.ylabel("ELBO")
 90 |         plt.show()
 91 | 
 92 |         return llh
 93 | 
 94 |     def mean_field_update(self):
 95 | 
 96 |         ndw = np.zeros((self.D, self.N)) #word counts for each document
 97 |         for d in range(self.D):
 98 |             doc = self.A[:,d]
 99 |             wdn_idx = np.nonzero(doc)[0]
100 | 
101 |             for i in range(len(wdn_idx)):
102 |                 ndw[d,wdn_idx[i]] += 1
103 |             #end for
104 | 
105 |             #update gamma
106 |             for k in range(self.K):
107 |                 self.gamma[d,k] = self.alpha[k] + np.dot(ndw[d,:], self.phi[d,:,k])
108 |             #end for
109 |             
110 |             #update phi
111 |             for w in range(len(wdn_idx)):
112 |                 self.phi[d,wdn_idx[w],:] = np.exp(digamma(self.gamma[d,:]) - digamma(np.sum(self.gamma[d,:])) + digamma(self.lmbda[:,wdn_idx[w]]) - digamma(np.sum(self.lmbda, axis=1)))
113 |                 if (np.sum(self.phi[d,wdn_idx[w],:]) > 0): #to avoid 0/0
114 |                     self.phi[d,wdn_idx[w],:] = self.phi[d,wdn_idx[w],:] / np.sum(self.phi[d,wdn_idx[w],:]) #normalize phi
115 |                 #end if
116 |             #end for
117 | 
118 |         #end for
119 |         
120 |         #update lambda given ndw for all docs
121 |         for k in range(self.K):
122 |             self.lmbda[k,:] = self.eta 
123 |             for d in range(self.D):
124 |                 self.lmbda[k,:] += np.multiply(ndw[d,:], self.phi[d,:,k])
125 |             #end for
126 |         #end for
127 | 
128 |     def elbo_objective(self):
129 |         #see Blei 2003
130 | 
131 |         T1_A = gammaln(np.sum(self.alpha)) - np.sum(gammaln(self.alpha))
132 |         T1_B = 0
133 |         for k in range(self.K):
134 |             T1_B +=  np.dot(self.alpha[k]-1, digamma(self.gamma[:,k]) - digamma(np.sum(self.gamma, axis=1)))
135 |         T1 = T1_A + T1_B
136 |         
137 |         T2 = 0
138 |         for n in range(self.N):
139 |             for k in range(self.K):
140 |                 T2 += self.phi[:,n,k] * (digamma(self.gamma[:,k]) - digamma(np.sum(self.gamma, axis=1)))
141 | 
142 |         T3 = 0
143 |         for n in range(self.N):
144 |             for k in range(self.K):
145 |                 T3 += self.phi[:,n,k] * np.log(self.beta[n,k])
146 | 
147 |         T4 = 0
148 |         T4_A = -gammaln(np.sum(self.gamma, axis=1)) + np.sum(gammaln(self.gamma), axis=1)
149 |         T4_B = 0
150 |         for k in range(self.K):
151 |             T4_B = -(self.gamma[:,k]-1) * (digamma(self.gamma[:,k]) - digamma(np.sum(self.gamma, axis=1)))
152 |         T4 = T4_A + T4_B
153 |         
154 |         T5 = 0
155 |         for n in range(self.N):
156 |             for k in range(self.K):
157 |                 T5 += -np.multiply(self.phi[:,n,k], np.log(self.phi[:,n,k] + 1e-6))
158 | 
159 |         T15 = T1 + T2 + T3 + T4 + T5
160 |         J = sum(T15)/self.D  #averaged over documents
161 |         return J 
162 | 
163 | if __name__ == "__main__":
164 | 
165 |     #LDA parameters
166 |     num_features = 1000  #vocabulary size
167 |     num_topics = 4      #fixed for LD
168 | 
169 |     #20 newsgroups dataset
170 |     categories = ['sci.crypt', 'comp.graphics', 'sci.space', 'talk.religion.misc']
171 |     
172 |     newsgroups = fetch_20newsgroups(shuffle=True, random_state=42, subset='train',
173 |                  remove=('headers', 'footers', 'quotes'), categories=categories)
174 |     
175 |     vectorizer = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=2, stop_words = 'english')
176 |     dataset = vectorizer.fit_transform(newsgroups.data)
177 |     A = np.transpose(dataset.toarray())  #term-document matrix
178 | 
179 |     lda = LDA(A=A, K=num_topics)
180 |     llh = lda.variational_inference()
181 |     id2word = {v:k for k,v in vectorizer.vocabulary_.items()}
182 | 
183 |     #display topics
184 |     for k in range(num_topics):
185 |         print("topic: ", k)
186 |         print("----------")
187 |         topic_words = ""
188 |         top_words = np.argsort(lda.lmbda[k,:])[-10:]
189 |         for i in range(len(top_words)):
190 |             topic_words += id2word[top_words[i]] + " "
191 |             print(id2word[top_words[i]])
192 |     
193 |         wordcloud = WordCloud(width = 800, height = 800,
194 |                     background_color ='white',
195 |                     min_font_size = 10).generate(topic_words)
196 | 
197 |         plt.figure()
198 |         plt.imshow(wordcloud)
199 |         plt.axis("off")
200 |         plt.tight_layout(pad = 0)
201 |         plt.show()
202 | 


--------------------------------------------------------------------------------
/chp09/portfolio_opt.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import numpy as np
 3 | import pandas as pd
 4 | import matplotlib.pyplot as plt
 5 | 
 6 | from sklearn.neighbors import KDTree
 7 | from pandas.plotting import scatter_matrix
 8 | from scipy.spatial import ConvexHull
 9 | 
10 | import pandas_datareader.data as web
11 | from datetime import datetime
12 | import pytz
13 | 
14 | STOCKS = ['SPY','LQD','TIP','GLD','MSFT']
15 | 
16 | np.random.seed(42)    
17 | 
18 | if __name__ == "__main__":
19 | 
20 |     plt.close("all")
21 |     
22 |     #load data
23 |     #year, month, day, hour, minute, second, microsecond
24 |     start = datetime(2012, 1, 1, 0, 0, 0, 0, pytz.utc)
25 |     end = datetime(2017, 1, 1, 0, 0, 0, 0, pytz.utc)     
26 |     
27 |     data = pd.DataFrame()
28 |     series = []
29 |     for ticker in STOCKS:
30 |         price = web.DataReader(ticker, 'stooq', start, end)
31 |         series.append(price['Close'])
32 | 
33 |     data = pd.concat(series, axis=1)
34 |     data.columns = STOCKS
35 |     data = data.dropna()
36 | 
37 |     #plot data correlations
38 |     scatter_matrix(data, alpha=0.2, diagonal='kde')
39 |     plt.show()
40 | 
41 |     #get current portfolio
42 |     cash = 10000
43 |     num_assets = np.size(STOCKS)
44 |     cur_value = (1e4-5e3)*np.random.rand(num_assets,1) + 5e3        
45 |     tot_value = np.sum(cur_value)
46 |     weights = cur_value.ravel()/float(tot_value)
47 |     
48 |     #compute portfolio risk
49 |     Sigma = data.cov().values
50 |     Corr = data.corr().values        
51 |     volatility = np.sqrt(np.dot(weights.T, np.dot(Sigma, weights)))
52 |     
53 |     plt.figure()
54 |     plt.title('Correlation Matrix')        
55 |     plt.imshow(Corr, cmap='gray')
56 |     plt.xticks(range(len(STOCKS)),data.columns)
57 |     plt.yticks(range(len(STOCKS)),data.columns)    
58 |     plt.colorbar()
59 |     plt.show()
60 |         
61 |     #generate random portfolio weights
62 |     num_trials = 1000
63 |     W = np.random.rand(num_trials, np.size(weights))    
64 |     W = W/np.sum(W,axis=1).reshape(num_trials,1)  #normalize
65 |     
66 |     pv = np.zeros(num_trials)   #portoflio value  w'v
67 |     ps = np.zeros(num_trials)   #portfolio sigma: sqrt(w'Sw)
68 |     
69 |     avg_price = data.mean().values
70 |     adj_price = avg_price
71 |     
72 |     for i in range(num_trials):
73 |         pv[i] = np.sum(adj_price * W[i,:])
74 |         ps[i] = np.sqrt(np.dot(W[i,:].T, np.dot(Sigma, W[i,:])))
75 |     
76 |     points = np.vstack((ps,pv)).T
77 |     hull = ConvexHull(points)
78 |     
79 |     plt.figure()
80 |     plt.scatter(ps, pv, marker='o', color='b', linewidth = 3.0, label = 'tangent portfolio')
81 |     plt.scatter(volatility, np.sum(adj_price * weights), marker = 's', color = 'r', linewidth = 3.0, label = 'current')
82 |     plt.plot(points[hull.vertices,0], points[hull.vertices,1], linewidth = 2.0)    
83 |     plt.title('expected return vs volatility')
84 |     plt.ylabel('expected price')
85 |     plt.xlabel('portfolio std dev')
86 |     plt.legend()
87 |     plt.grid(True)
88 |     plt.show()
89 |     
90 |     #query for nearest neighbor portfolio
91 |     knn = 5    
92 |     kdt = KDTree(points)
93 |     query_point = np.array([2, 115]).reshape(1,-1)
94 |     kdt_dist, kdt_idx = kdt.query(query_point,k=knn)
95 |     print("top-%d closest to query portfolios:" %knn)
96 |     print("values: ", pv[kdt_idx.ravel()])
97 |     print("sigmas: ", ps[kdt_idx.ravel()])
98 |     
99 | 


--------------------------------------------------------------------------------
/chp09/sim_annealing.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import matplotlib.pyplot as plt
  3 | 
  4 | np.random.seed(42)
  5 | 
  6 | class simulated_annealing():
  7 |     def __init__(self):
  8 |         self.max_iter = 1000
  9 |         self.conv_thresh = 1e-4
 10 |         self.conv_window = 10
 11 | 
 12 |         self.samples = np.zeros((self.max_iter, 2))
 13 |         self.energies = np.zeros(self.max_iter)
 14 |         self.temperatures = np.zeros(self.max_iter)
 15 | 
 16 |     def target(self, x, y):
 17 |         z = 3*(1-x)**2 * np.exp(-x**2 - (y+1)**2) \
 18 |             - 10*(x/5 -x**3 - y**5) * np.exp(-x**2 - y**2) \
 19 |             - (1/3)*np.exp(-(x+1)**2 - y**2)
 20 |         return z
 21 | 
 22 |     def proposal(self, x, y):
 23 |         mean = np.array([x, y])
 24 |         cov =  1.1 * np.eye(2)
 25 |         x_new, y_new = np.random.multivariate_normal(mean, cov)
 26 |         return x_new, y_new
 27 | 
 28 |     def temperature_schedule(self, T, iter):
 29 |         return 0.9 * T
 30 | 
 31 |     def run(self, x_init, y_init):
 32 |         
 33 |         converged = False
 34 |         T = 1 
 35 |         self.temperatures[0] = T
 36 |         num_accepted = 0
 37 |         x_old, y_old = x_init, y_init
 38 |         energy_old = self.target(x_init, y_init)
 39 | 
 40 |         iter = 1
 41 |         while not converged:
 42 |             print("iter: {:4d}, temp: {:.4f}, energy = {:.6f}".format(iter, T, energy_old))
 43 |             x_new, y_new = self.proposal(x_old, y_old)
 44 |             energy_new = self.target(x_new, y_new)
 45 | 
 46 |             #check convergence
 47 |             if iter > 2*self.conv_window:
 48 |                 vals = self.energies[iter-self.conv_window : iter-1]
 49 |                 if (np.std(vals) < self.conv_thresh):
 50 |                     converged = True
 51 |                 #end if
 52 |             #end if  
 53 | 
 54 |             alpha = np.exp((energy_old - energy_new)/T)
 55 |             r = np.minimum(1, alpha)
 56 |             u = np.random.uniform(0, 1)
 57 |             if u < r:
 58 |                 x_old, y_old = x_new, y_new
 59 |                 num_accepted += 1
 60 |                 energy_old = energy_new
 61 |             #end if
 62 |             self.samples[iter, :] = np.array([x_old, y_old])
 63 |             self.energies[iter] = energy_old
 64 |             
 65 |             T = self.temperature_schedule(T, iter)
 66 |             self.temperatures[iter] = T
 67 |             
 68 |             iter = iter + 1
 69 |             
 70 |             if (iter > self.max_iter): converged = True
 71 |         #end while
 72 | 
 73 |         niter = iter - 1
 74 |         acceptance_rate = num_accepted / niter
 75 |         print("acceptance rate: ", acceptance_rate)
 76 |         
 77 |         x_opt, y_opt = x_old, y_old
 78 | 
 79 |         return x_opt, y_opt, self.samples[:niter,:], self.energies[:niter], self.temperatures[:niter] 
 80 | 
 81 | if __name__ == "__main__":
 82 | 
 83 |     SA = simulated_annealing()
 84 | 
 85 |     nx, ny = (1000, 1000)
 86 |     x = np.linspace(-2, 2, nx)
 87 |     y = np.linspace(-2, 2, ny)
 88 |     xv, yv = np.meshgrid(x, y)
 89 | 
 90 |     z = SA.target(xv, yv)
 91 |     plt.figure()
 92 |     plt.contourf(x, y, z)
 93 |     plt.title("energy landscape")
 94 |     plt.show()
 95 |     
 96 |     #find global minimum by exhaustive search
 97 |     min_search = np.min(z)
 98 |     argmin_search = np.argwhere(z == min_search)
 99 |     xmin, ymin = argmin_search[0][0], argmin_search[0][1]
100 |     print("global minimum (exhaustive search): ", min_search)
101 |     print("located at (x, y): ", x[xmin], y[ymin])
102 |  
103 |     #find global minimum by simulated annealing
104 |     x_init, y_init = 0, 0
105 |     x_opt, y_opt, samples, energies, temperatures = SA.run(x_init, y_init) 
106 |     print("global minimum (simulated annealing): ", energies[-1])
107 |     print("located at (x, y): ", x_opt, y_opt)
108 | 
109 |     plt.figure()
110 |     plt.plot(energies)
111 |     plt.title("SA sampled energies")
112 |     plt.show()
113 | 
114 |     plt.figure()
115 |     plt.plot(temperatures)
116 |     plt.title("Temperature Schedule")
117 |     plt.show()
118 | 


--------------------------------------------------------------------------------
/chp10/image_search.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import tensorflow as tf
 4 | from tensorflow import keras
 5 | 
 6 | from keras import Model
 7 | from keras.applications.resnet50 import ResNet50
 8 | from keras.preprocessing import image
 9 | from keras.applications.resnet50 import preprocess_input
10 | 
11 | from keras.callbacks import ModelCheckpoint
12 | from keras.callbacks import TensorBoard
13 | from keras.callbacks import LearningRateScheduler 
14 | from keras.callbacks import EarlyStopping
15 | 
16 | import os
17 | import random
18 | from PIL import Image
19 | from scipy.spatial import distance
20 | from sklearn.decomposition import PCA
21 | 
22 | import matplotlib.pyplot as plt
23 | 
24 | tf.keras.utils.set_random_seed(42)
25 | 
26 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
27 | DATA_PATH = "/content/drive/MyDrive/data/101_ObjectCategories/"
28 | 
29 | def get_closest_images(acts, query_image_idx, num_results=5):
30 | 
31 |     num_images, dim = acts.shape
32 |     distances = []
33 |     for image_idx in range(num_images):
34 |         distances.append(distance.euclidean(acts[query_image_idx, :], acts[image_idx, :]))
35 |     #end for    
36 |     idx_closest  = sorted(range(len(distances)), key=lambda k: distances[k])[1:num_results+1]
37 |   
38 |     return idx_closest
39 | 
40 | def get_concatenated_images(images, indexes, thumb_height):
41 | 
42 |     thumbs = []
43 |     for idx in indexes:
44 |         img = Image.open(images[idx])
45 |         img = img.resize((int(img.width * thumb_height / img.height), int(thumb_height)), Image.ANTIALIAS)
46 |         if img.mode != "RGB":
47 |             img = img.convert("RGB")
48 |         thumbs.append(img)
49 |     concat_image = np.concatenate([np.asarray(t) for t in thumbs], axis=1)
50 | 
51 |     return concat_image
52 | 
53 | if __name__ == "__main__":
54 | 
55 |     num_images = 5000
56 |     images = [os.path.join(dp,f) for dp, dn, filenames in os.walk(DATA_PATH) for f in filenames \
57 |               if os.path.splitext(f)[1].lower() in ['.jpg','.png','.jpeg']]
58 |     images = [images[i] for i in sorted(random.sample(range(len(images)), num_images))]
59 | 
60 |     #CNN encodings
61 |     base_model = ResNet50(weights='imagenet')
62 |     model = Model(inputs=base_model.input, outputs=base_model.get_layer('avg_pool').output)
63 | 
64 |     activations = []
65 |     for idx, image_path in enumerate(images):
66 |         if idx % 100 == 0:
67 |             print('getting activations for %d/%d image...' %(idx,len(images)))
68 |         img = image.load_img(image_path, target_size=(224, 224))
69 |         x = image.img_to_array(img)
70 |         x = np.expand_dims(x, axis=0)
71 |         x = preprocess_input(x)
72 |         features = model.predict(x)
73 |         activations.append(features.flatten().reshape(1,-1))
74 | 
75 |     # reduce activation dimension
76 |     print('computing PCA...')
77 |     acts = np.concatenate(activations, axis=0)
78 |     pca = PCA(n_components=300)
79 |     pca.fit(acts)
80 |     acts = pca.transform(acts)
81 | 
82 |     print('image search...') 
83 |     query_image_idx = int(num_images*random.random())
84 |     idx_closest = get_closest_images(acts, query_image_idx)
85 |     query_image = get_concatenated_images(images, [query_image_idx], 300)
86 |     results_image = get_concatenated_images(images, idx_closest, 300)
87 | 
88 |     plt.figure()
89 |     plt.imshow(query_image)
90 |     plt.title("query image (%d)" %query_image_idx)
91 |     plt.show()
92 |     #plt.savefig('./figures/query_image.png')
93 | 
94 |     plt.figure()
95 |     plt.imshow(results_image)
96 |     plt.title("result images")
97 |     plt.show()
98 |     #plt.savefig('./figures/result_images.png')


--------------------------------------------------------------------------------
/chp10/keras_optimizers.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | import tensorflow as tf
  4 | from tensorflow import keras
  5 | 
  6 | from keras import backend as K
  7 | from keras.models import Sequential
  8 | from keras.layers import Dense, Dropout, Flatten
  9 | from keras.layers import Conv2D, MaxPooling2D, Activation
 10 | 
 11 | from keras.callbacks import ModelCheckpoint
 12 | from keras.callbacks import TensorBoard
 13 | from keras.callbacks import LearningRateScheduler 
 14 | from keras.callbacks import EarlyStopping
 15 | 
 16 | import math
 17 | import matplotlib.pyplot as plt
 18 | 
 19 | tf.keras.utils.set_random_seed(42)
 20 | 
 21 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
 22 | 
 23 | def scheduler(epoch, lr):
 24 |     if epoch < 4:
 25 |         return lr
 26 |     else:
 27 |         return lr * tf.math.exp(-0.1)
 28 | 
 29 | if __name__ == "__main__":
 30 | 
 31 |     img_rows, img_cols = 32, 32
 32 |     (x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data()
 33 |     x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 3).astype("float32") / 255
 34 |     x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 3).astype("float32") / 255
 35 | 
 36 |     y_train_label = keras.utils.to_categorical(y_train)
 37 |     y_test_label = keras.utils.to_categorical(y_test)
 38 |     num_classes = y_train_label.shape[1]
 39 | 
 40 |     #training parameters
 41 |     batch_size = 256
 42 |     num_epochs = 32
 43 | 
 44 |     #model parameters
 45 |     num_filters_l1 = 64
 46 |     num_filters_l2 = 128
 47 | 
 48 |     #CNN architecture
 49 |     cnn = Sequential()
 50 |     cnn.add(Conv2D(num_filters_l1, kernel_size = (5, 5), input_shape=(img_rows, img_cols, 3), padding='same'))
 51 |     cnn.add(Activation('relu'))
 52 |     cnn.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
 53 | 
 54 |     cnn.add(Conv2D(num_filters_l2, kernel_size = (5, 5), padding='same'))
 55 |     cnn.add(Activation('relu'))
 56 |     cnn.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
 57 | 
 58 |     cnn.add(Flatten())
 59 |     cnn.add(Dense(128))
 60 |     cnn.add(Activation('relu'))
 61 | 
 62 |     cnn.add(Dense(num_classes))
 63 |     cnn.add(Activation('softmax'))
 64 | 
 65 |     #optimizers
 66 |     opt1 = tf.keras.optimizers.SGD()
 67 |     opt2 = tf.keras.optimizers.SGD(momentum=0.9, nesterov=True)
 68 |     opt3 = tf.keras.optimizers.RMSprop()
 69 |     opt4 = tf.keras.optimizers.Adam()
 70 | 
 71 |     optimizer_list = [opt1, opt2, opt3, opt4]
 72 | 
 73 |     history_list = []
 74 | 
 75 |     for idx in range(len(optimizer_list)):
 76 | 
 77 |         K.clear_session()
 78 | 
 79 |         opt = optimizer_list[idx]
 80 | 
 81 |         cnn.compile(
 82 |             loss=keras.losses.CategoricalCrossentropy(),
 83 |             optimizer=opt,
 84 |             metrics=["accuracy"]
 85 |         )
 86 | 
 87 |         #define callbacks
 88 |         reduce_lr = LearningRateScheduler(scheduler, verbose=1)
 89 |         callbacks_list = [reduce_lr]
 90 | 
 91 |         #training loop
 92 |         hist = cnn.fit(x_train, y_train_label, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_split=0.2)
 93 |         history_list.append(hist)
 94 | 
 95 |     #end for
 96 | 
 97 |     plt.figure()
 98 |     plt.plot(history_list[0].history['loss'], 'b', lw=2.0, label='SGD')
 99 |     plt.plot(history_list[1].history['loss'], '--r', lw=2.0, label='SGD Nesterov')
100 |     plt.plot(history_list[2].history['loss'], ':g', lw=2.0, label='RMSProp')
101 |     plt.plot(history_list[3].history['loss'], '-.k', lw=2.0, label='ADAM')
102 |     plt.title('LeNet, CIFAR-100, Optimizers')
103 |     plt.xlabel('Epochs')
104 |     plt.ylabel('Cross-Entropy Training Loss')
105 |     plt.legend(loc='upper right')
106 |     plt.show()
107 |     #plt.savefig('./figures/lenet_loss.png')
108 | 
109 |     plt.figure()
110 |     plt.plot(history_list[0].history['val_accuracy'], 'b', lw=2.0, label='SGD')
111 |     plt.plot(history_list[1].history['val_accuracy'], '--r', lw=2.0, label='SGD Nesterov')
112 |     plt.plot(history_list[2].history['val_accuracy'], ':g', lw=2.0, label='RMSProp')
113 |     plt.plot(history_list[3].history['val_accuracy'], '-.k', lw=2.0, label='ADAM')
114 |     plt.title('LeNet, CIFAR-100, Optimizers')
115 |     plt.xlabel('Epochs')
116 |     plt.ylabel('Validation Accuracy')
117 |     plt.legend(loc='upper right')
118 |     plt.show()
119 |     #plt.savefig('./figures/lenet_loss.png')
120 | 
121 |     plt.figure()
122 |     plt.plot(history_list[0].history['lr'], 'b', lw=2.0, label='SGD')
123 |     plt.plot(history_list[1].history['lr'], '--r', lw=2.0, label='SGD Nesterov')
124 |     plt.plot(history_list[2].history['lr'], ':g', lw=2.0, label='RMSProp')
125 |     plt.plot(history_list[3].history['lr'], '-.k', lw=2.0, label='ADAM')
126 |     plt.title('LeNet, CIFAR-100, Optimizers')
127 |     plt.xlabel('Epochs')
128 |     plt.ylabel('Learning Rate Schedule')
129 |     plt.legend(loc='upper right')
130 |     plt.show()
131 |     #plt.savefig('./figures/lenet_loss.png')
132 | 
133 |     


--------------------------------------------------------------------------------
/chp10/lenet.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | import tensorflow as tf
  4 | from tensorflow import keras
  5 | 
  6 | from keras.models import Sequential
  7 | from keras.layers import Dense, Dropout, Flatten
  8 | from keras.layers import Conv2D, MaxPooling2D, Activation
  9 | 
 10 | from keras.callbacks import ModelCheckpoint
 11 | from keras.callbacks import TensorBoard
 12 | from keras.callbacks import LearningRateScheduler 
 13 | from keras.callbacks import EarlyStopping
 14 | 
 15 | import math
 16 | import matplotlib.pyplot as plt
 17 | 
 18 | tf.keras.utils.set_random_seed(42)
 19 | 
 20 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
 21 | 
 22 | def scheduler(epoch, lr):
 23 |     if epoch < 4:
 24 |         return lr
 25 |     else:
 26 |         return lr * tf.math.exp(-0.1)
 27 | 
 28 | 
 29 | if __name__ == "__main__":
 30 | 
 31 |     img_rows, img_cols = 28, 28
 32 |     (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
 33 |     x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1).astype("float32") / 255
 34 |     x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1).astype("float32") / 255
 35 | 
 36 |     y_train_label = keras.utils.to_categorical(y_train)
 37 |     y_test_label = keras.utils.to_categorical(y_test)
 38 |     num_classes = y_train_label.shape[1]
 39 | 
 40 |     #training parameters
 41 |     batch_size = 128
 42 |     num_epochs = 8
 43 | 
 44 |     #model parameters
 45 |     num_filters_l1 = 32
 46 |     num_filters_l2 = 64
 47 | 
 48 |     #CNN architecture
 49 |     cnn = Sequential()
 50 |     #CONV -> RELU -> MAXPOOL
 51 |     cnn.add(Conv2D(num_filters_l1, kernel_size = (5, 5), input_shape=(img_rows, img_cols, 1), padding='same'))
 52 |     cnn.add(Activation('relu'))
 53 |     cnn.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
 54 | 
 55 |     #CONV -> RELU -> MAXPOOL
 56 |     cnn.add(Conv2D(num_filters_l2, kernel_size = (5, 5), padding='same'))
 57 |     cnn.add(Activation('relu'))
 58 |     cnn.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
 59 | 
 60 |     #FC -> RELU
 61 |     cnn.add(Flatten())
 62 |     cnn.add(Dense(128))
 63 |     cnn.add(Activation('relu'))
 64 | 
 65 |     #Softmax Classifier
 66 |     cnn.add(Dense(num_classes))
 67 |     cnn.add(Activation('softmax'))
 68 | 
 69 |     cnn.compile(
 70 |         loss=keras.losses.CategoricalCrossentropy(),
 71 |         optimizer=tf.keras.optimizers.Adam(),
 72 |         metrics=["accuracy"]
 73 |     )
 74 | 
 75 |     cnn.summary()
 76 | 
 77 |     #define callbacks
 78 |     file_name = SAVE_PATH + 'lenet-weights-checkpoint.h5'
 79 |     checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
 80 |     reduce_lr = LearningRateScheduler(scheduler, verbose=1)
 81 |     early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1)
 82 |     #tensor_board = TensorBoard(log_dir='./logs', write_graph=True)
 83 |     callbacks_list = [checkpoint, reduce_lr, early_stopping]
 84 | 
 85 |     hist = cnn.fit(x_train, y_train_label, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_split=0.2)
 86 | 
 87 |     test_scores = cnn.evaluate(x_test, y_test_label, verbose=2)
 88 | 
 89 |     print("Test loss:", test_scores[0])
 90 |     print("Test accuracy:", test_scores[1])
 91 | 
 92 |     y_prob = cnn.predict(x_test)
 93 |     y_pred = y_prob.argmax(axis=-1)
 94 | 
 95 |     #create submission
 96 |     submission = pd.DataFrame(index=pd.RangeIndex(start=1, stop=10001, step=1), columns=['Label'])
 97 |     submission['Label'] = y_pred.reshape(-1,1)
 98 |     submission.index.name = "ImageId"
 99 |     submission.to_csv(SAVE_PATH + '/lenet_pred.csv', index=True, header=True)
100 | 
101 |     plt.figure()
102 |     plt.plot(hist.history['loss'], 'b', lw=2.0, label='train')
103 |     plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val')
104 |     plt.title('LeNet model')
105 |     plt.xlabel('Epochs')
106 |     plt.ylabel('Cross-Entropy Loss')
107 |     plt.legend(loc='upper right')
108 |     plt.show()
109 |     #plt.savefig('./figures/lenet_loss.png')
110 | 
111 |     plt.figure()
112 |     plt.plot(hist.history['accuracy'], 'b', lw=2.0, label='train')
113 |     plt.plot(hist.history['val_accuracy'], '--r', lw=2.0, label='val')
114 |     plt.title('LeNet model')
115 |     plt.xlabel('Epochs')
116 |     plt.ylabel('Accuracy')
117 |     plt.legend(loc='upper left')
118 |     plt.show()
119 |     #plt.savefig('./figures/lenet_acc.png')
120 | 
121 |     plt.figure()
122 |     plt.plot(hist.history['lr'], lw=2.0, label='learning rate')
123 |     plt.title('LeNet model')
124 |     plt.xlabel('Epochs')
125 |     plt.ylabel('Learning Rate')
126 |     plt.legend()
127 |     plt.show()
128 |     #plt.savefig('./figures/lenet_learning_rate.png')
129 |     


--------------------------------------------------------------------------------
/chp10/lstm_sentiment.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | 
  4 | import tensorflow as tf
  5 | from tensorflow import keras
  6 | 
  7 | from keras.models import Sequential
  8 | from keras.layers import LSTM, Bidirectional
  9 | from keras.layers import Dense, Dropout, Activation, Embedding
 10 | 
 11 | from keras import regularizers
 12 | from keras.preprocessing import sequence
 13 | from keras.utils import np_utils
 14 | 
 15 | from keras.callbacks import ModelCheckpoint
 16 | from keras.callbacks import TensorBoard
 17 | from keras.callbacks import LearningRateScheduler 
 18 | from keras.callbacks import EarlyStopping
 19 | 
 20 | import matplotlib.pyplot as plt
 21 | 
 22 | tf.keras.utils.set_random_seed(42)
 23 | 
 24 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
 25 | 
 26 | def scheduler(epoch, lr):
 27 |     if epoch < 4:
 28 |         return lr
 29 |     else:
 30 |         return lr * tf.math.exp(-0.1)
 31 | 
 32 | if __name__ == "__main__":
 33 | 
 34 |     #load dataset
 35 |     max_words = 20000 # top 20K most frequent words
 36 |     seq_len = 200  # first 200 words of each movie review
 37 |     (x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=max_words)
 38 | 
 39 |     x_train = keras.utils.pad_sequences(x_train, maxlen=seq_len)
 40 |     x_val = keras.utils.pad_sequences(x_val, maxlen=seq_len)
 41 | 
 42 |     #training params
 43 |     batch_size = 256 
 44 |     num_epochs = 8
 45 | 
 46 |     #model parameters
 47 |     hidden_size = 64
 48 |     embed_dim = 128
 49 |     lstm_dropout = 0.2
 50 |     dense_dropout = 0.5
 51 |     weight_decay = 1e-3
 52 | 
 53 |     #LSTM architecture
 54 |     model = Sequential()
 55 |     model.add(Embedding(max_words, embed_dim, input_length=seq_len))
 56 |     model.add(Bidirectional(LSTM(hidden_size, dropout=lstm_dropout, recurrent_dropout=lstm_dropout)))
 57 |     model.add(Dense(hidden_size, kernel_regularizer=regularizers.l2(weight_decay), activation='relu'))
 58 |     model.add(Dropout(dense_dropout))
 59 |     model.add(Dense(hidden_size/4, kernel_regularizer=regularizers.l2(weight_decay), activation='relu'))
 60 |     model.add(Dense(1, activation='sigmoid'))
 61 | 
 62 |     model.compile(
 63 |         loss=keras.losses.BinaryCrossentropy(),
 64 |         optimizer=tf.keras.optimizers.Adam(),
 65 |         metrics=["accuracy"]
 66 |     )
 67 | 
 68 |     model.summary()
 69 | 
 70 |     #define callbacks
 71 |     file_name = SAVE_PATH + 'lstm-weights-checkpoint.h5'
 72 |     checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
 73 |     reduce_lr = LearningRateScheduler(scheduler, verbose=1)
 74 |     early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1)
 75 |     #tensor_board = TensorBoard(log_dir='./logs', write_graph=True)
 76 |     callbacks_list = [checkpoint, reduce_lr, early_stopping]
 77 | 
 78 |     hist = model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_data=(x_val, y_val))
 79 | 
 80 |     test_scores = model.evaluate(x_val, y_val, verbose=2)
 81 | 
 82 |     print("Test loss:", test_scores[0])
 83 |     print("Test accuracy:", test_scores[1])
 84 | 
 85 |     plt.figure()
 86 |     plt.plot(hist.history['loss'], 'b', lw=2.0, label='train')
 87 |     plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val')
 88 |     plt.title('LSTM model')
 89 |     plt.xlabel('Epochs')
 90 |     plt.ylabel('Cross-Entropy Loss')
 91 |     plt.legend(loc='upper right')
 92 |     plt.show()
 93 |     #plt.savefig('./figures/lstm_loss.png')
 94 | 
 95 |     plt.figure()
 96 |     plt.plot(hist.history['accuracy'], 'b', lw=2.0, label='train')
 97 |     plt.plot(hist.history['val_accuracy'], '--r', lw=2.0, label='val')
 98 |     plt.title('LSTM model')
 99 |     plt.xlabel('Epochs')
100 |     plt.ylabel('Accuracy')
101 |     plt.legend(loc='upper left')
102 |     plt.show()
103 |     #plt.savefig('./figures/lstm_acc.png')
104 | 
105 |     plt.figure()
106 |     plt.plot(hist.history['lr'], lw=2.0, label='learning rate')
107 |     plt.title('LSTM model')
108 |     plt.xlabel('Epochs')
109 |     plt.ylabel('Learning Rate')
110 |     plt.legend()
111 |     plt.show()
112 |     #plt.savefig('./figures/lstm_learning_rate.png')
113 | 
114 | 


--------------------------------------------------------------------------------
/chp10/mlp.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import tensorflow as tf
 3 | from tensorflow import keras
 4 | 
 5 | from keras.models import Sequential
 6 | from keras.layers import Dense, Dropout
 7 | 
 8 | from keras.callbacks import ModelCheckpoint
 9 | from keras.callbacks import TensorBoard
10 | from keras.callbacks import LearningRateScheduler 
11 | from keras.callbacks import EarlyStopping
12 | 
13 | import math
14 | import matplotlib.pyplot as plt
15 | 
16 | tf.keras.utils.set_random_seed(42)
17 | 
18 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
19 | 
20 | def scheduler(epoch, lr):
21 |     if epoch < 4:
22 |         return lr
23 |     else:
24 |         return lr * tf.math.exp(-0.1)
25 | 
26 | if __name__ == "__main__":
27 | 
28 |     (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
29 |     x_train = x_train.reshape(60000, 784).astype("float32") / 255
30 |     x_test = x_test.reshape(10000, 784).astype("float32") / 255
31 | 
32 |     y_train_label = keras.utils.to_categorical(y_train)
33 |     y_test_label = keras.utils.to_categorical(y_test)
34 |     num_classes = y_train_label.shape[1]
35 | 
36 |     #training params
37 |     batch_size = 64
38 |     num_epochs = 16
39 | 
40 |     model = Sequential()
41 |     model.add(Dense(128, input_shape=(784, ), activation='relu'))
42 |     model.add(Dense(64, activation='relu'))
43 |     model.add(Dropout(0.5))
44 |     model.add(Dense(10, activation='softmax'))
45 | 
46 |     model.compile(
47 |         loss=keras.losses.CategoricalCrossentropy(),
48 |         optimizer=tf.keras.optimizers.RMSprop(),
49 |         metrics=["accuracy"]
50 |     )
51 | 
52 |     model.summary()
53 | 
54 |     #define callbacks
55 |     file_name = SAVE_PATH + 'mlp-weights-checkpoint.h5'
56 |     checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
57 |     reduce_lr = LearningRateScheduler(scheduler, verbose=1)
58 |     early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1)
59 |     #tensor_board = TensorBoard(log_dir='./logs', write_graph=True)
60 |     callbacks_list = [checkpoint, reduce_lr, early_stopping]
61 | 
62 |     hist = model.fit(x_train, y_train_label, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_split=0.2)
63 | 
64 |     test_scores = model.evaluate(x_test, y_test_label, verbose=2)
65 | 
66 |     print("Test loss:", test_scores[0])
67 |     print("Test accuracy:", test_scores[1])
68 | 
69 |     plt.figure()
70 |     plt.plot(hist.history['loss'], 'b', lw=2.0, label='train')
71 |     plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val')
72 |     plt.title('MLP model')
73 |     plt.xlabel('Epochs')
74 |     plt.ylabel('Cross-Entropy Loss')
75 |     plt.legend(loc='upper right')
76 |     plt.show()
77 |     #plt.savefig('./figures/mlp_loss.png')
78 | 
79 |     plt.figure()
80 |     plt.plot(hist.history['accuracy'], 'b', lw=2.0, label='train')
81 |     plt.plot(hist.history['val_accuracy'], '--r', lw=2.0, label='val')
82 |     plt.title('MLP model')
83 |     plt.xlabel('Epochs')
84 |     plt.ylabel('Accuracy')
85 |     plt.legend(loc='upper left')
86 |     plt.show()
87 |     #plt.savefig('./figures/mlp_acc.png')
88 | 
89 |     plt.figure()
90 |     plt.plot(hist.history['lr'], lw=2.0, label='learning rate')
91 |     plt.title('MLP model')
92 |     plt.xlabel('Epochs')
93 |     plt.ylabel('Learning Rate')
94 |     plt.legend()
95 |     plt.show()
96 |     #plt.savefig('./figures/mlp_learning_rate.png')


--------------------------------------------------------------------------------
/chp10/multi_input_nn.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | 
  4 | import tensorflow as tf
  5 | from tensorflow import keras
  6 | 
  7 | import os
  8 | import re
  9 | import csv
 10 | import codecs
 11 | 
 12 | from keras.models import Model
 13 | from keras.layers import Input, Flatten, Concatenate, LSTM, Lambda, Dropout
 14 | from keras.layers import Dense, Dropout, Activation, Embedding
 15 | from keras.layers import Conv1D, MaxPooling1D
 16 | from keras.layers import TimeDistributed, Bidirectional, BatchNormalization
 17 | 
 18 | from keras import backend as K
 19 | from keras.preprocessing.text import Tokenizer
 20 | from keras.utils import pad_sequences
 21 | 
 22 | from nltk.corpus import stopwords
 23 | from nltk.stem import SnowballStemmer
 24 | 
 25 | from keras import regularizers
 26 | from keras.preprocessing import sequence
 27 | from keras.utils import np_utils
 28 | 
 29 | from keras.callbacks import ModelCheckpoint
 30 | from keras.callbacks import TensorBoard
 31 | from keras.callbacks import LearningRateScheduler 
 32 | from keras.callbacks import EarlyStopping
 33 | 
 34 | import matplotlib.pyplot as plt
 35 | 
 36 | tf.keras.utils.set_random_seed(42)
 37 | 
 38 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
 39 | DATA_PATH = "/content/drive/MyDrive/data/"
 40 | 
 41 | GLOVE_DIR = DATA_PATH
 42 | TRAIN_DATA_FILE = DATA_PATH + 'quora_train.csv'
 43 | TEST_DATA_FILE = DATA_PATH + 'quora_test.csv'
 44 | MAX_SEQUENCE_LENGTH = 30
 45 | MAX_NB_WORDS = 200000
 46 | EMBEDDING_DIM = 300
 47 | VALIDATION_SPLIT = 0.01
 48 | 
 49 | def scheduler(epoch, lr):
 50 |     if epoch < 4:
 51 |         return lr
 52 |     else:
 53 |         return lr * tf.math.exp(-0.1)
 54 | 
 55 | def text_to_wordlist(row, remove_stopwords=False, stem_words=False):
 56 |     # Clean the text, with the option to remove stopwords and to stem words.
 57 |     
 58 |     text = row['question']
 59 |     # Convert words to lower case and split them
 60 |     if type(text) is str:
 61 |         text = text.lower().split()
 62 |     else:
 63 |         return " "
 64 | 
 65 |     # Optionally, remove stop words
 66 |     if remove_stopwords:
 67 |         stops = set(stopwords.words("english"))
 68 |         text = [w for w in text if not w in stops]
 69 |     
 70 |     text = " ".join(text)
 71 | 
 72 |     # Clean the text
 73 |     text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
 74 |     
 75 |     # Optionally, shorten words to their stems
 76 |     if stem_words:
 77 |         text = text.split()
 78 |         stemmer = SnowballStemmer('english')
 79 |         stemmed_words = [stemmer.stem(word) for word in text]
 80 |         text = " ".join(stemmed_words)
 81 |     
 82 |     # Return a list of words
 83 |     return(text)
 84 | 
 85 | if __name__ == "__main__":
 86 | 
 87 |     #load embeddings
 88 |     print('Indexing word vectors...')
 89 |     embeddings_index = {}
 90 |     f = codecs.open(os.path.join(GLOVE_DIR, 'glove.6B.300d.txt'), encoding='utf-8')
 91 |     for line in f:
 92 |         values = line.split(' ')
 93 |         word = values[0]
 94 |         coefs = np.asarray(values[1:], dtype='float32')
 95 |         embeddings_index[word] = coefs
 96 |     f.close()
 97 |     print('Found %s word vectors.' % len(embeddings_index))
 98 | 
 99 |     #load dataset
100 |     train_df = pd.read_csv(TRAIN_DATA_FILE)
101 |     test_df  = pd.read_csv(TEST_DATA_FILE)
102 | 
103 |     q1df = train_df['question1'].reset_index()
104 |     q2df = train_df['question2'].reset_index()
105 |     q1df.columns = ['index', 'question']
106 |     q2df.columns = ['index', 'question']
107 |     texts_1 = q1df.apply(text_to_wordlist, axis=1, raw=False).tolist()
108 |     texts_2 = q2df.apply(text_to_wordlist, axis=1, raw=False).tolist()
109 |     labels = train_df['is_duplicate'].astype(int).tolist()
110 |     print('Found %s texts.' % len(texts_1))
111 |     del q1df
112 |     del q2df
113 | 
114 |     q1df = test_df['question1'].reset_index()
115 |     q2df = test_df['question2'].reset_index()
116 |     q1df.columns = ['index', 'question']
117 |     q2df.columns = ['index', 'question']    
118 |     test_texts_1 = q1df.apply(text_to_wordlist, axis=1, raw=False).tolist()
119 |     test_texts_2 = q2df.apply(text_to_wordlist, axis=1, raw=False).tolist()
120 |     test_labels = np.arange(0, test_df.shape[0])
121 |     print('Found %s texts.' % len(test_texts_1))
122 |     del q1df
123 |     del q2df
124 | 
125 |     #tokenize, convert to sequences and pad
126 |     tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
127 |     tokenizer.fit_on_texts(texts_1 + texts_2 + test_texts_1 + test_texts_2)
128 |     sequences_1 = tokenizer.texts_to_sequences(texts_1)
129 |     sequences_2 = tokenizer.texts_to_sequences(texts_2)
130 |     word_index = tokenizer.word_index
131 |     print('Found %s unique tokens.' % len(word_index))
132 | 
133 |     test_sequences_1 = tokenizer.texts_to_sequences(test_texts_1)
134 |     test_sequences_2 = tokenizer.texts_to_sequences(test_texts_2)
135 | 
136 |     data_1 = pad_sequences(sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
137 |     data_2 = pad_sequences(sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
138 |     labels = np.array(labels)
139 |     print('Shape of data tensor:', data_1.shape)
140 |     print('Shape of label tensor:', labels.shape)
141 | 
142 |     test_data_1 = pad_sequences(test_sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
143 |     test_data_2 = pad_sequences(test_sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
144 |     test_labels = np.array(test_labels)
145 |     del test_sequences_1
146 |     del test_sequences_2
147 |     del sequences_1
148 |     del sequences_2
149 | 
150 |     #embedding matrix
151 |     print('Preparing embedding matrix...')
152 |     nb_words = min(MAX_NB_WORDS, len(word_index))
153 | 
154 |     embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
155 |     for word, i in word_index.items():
156 |         if i >= nb_words:
157 |             continue
158 |         embedding_vector = embeddings_index.get(word)
159 |         if embedding_vector is not None:
160 |             # words not found in embedding index will be all-zeros.
161 |             embedding_matrix[i] = embedding_vector
162 |     print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))
163 | 
164 |     #Multi-Input Architecture
165 |     embedding_layer = Embedding(nb_words,
166 |                                 EMBEDDING_DIM,
167 |                                 weights=[embedding_matrix],
168 |                                 input_length=MAX_SEQUENCE_LENGTH,
169 |                                 trainable=False)
170 | 
171 |     sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
172 |     embedded_sequences_1 = embedding_layer(sequence_1_input)
173 |     x1 = Conv1D(128, 3, activation='relu')(embedded_sequences_1)
174 |     x1 = MaxPooling1D(10)(x1)
175 |     x1 = Flatten()(x1)
176 |     x1 = Dense(64, activation='relu')(x1)
177 |     x1 = Dropout(0.2)(x1)
178 | 
179 |     sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
180 |     embedded_sequences_2 = embedding_layer(sequence_2_input)
181 |     y1 = Conv1D(128, 3, activation='relu')(embedded_sequences_2)
182 |     y1 = MaxPooling1D(10)(y1)
183 |     y1 = Flatten()(y1)
184 |     y1 = Dense(64, activation='relu')(y1)
185 |     y1 = Dropout(0.2)(y1)
186 | 
187 |     merged = Concatenate()([x1, y1])
188 |     merged = BatchNormalization()(merged)
189 |     merged = Dense(64, activation='relu')(merged)
190 |     merged = Dropout(0.2)(merged)
191 |     merged = BatchNormalization()(merged)
192 |     preds = Dense(1, activation='sigmoid')(merged)
193 | 
194 |     model = Model(inputs=[sequence_1_input,sequence_2_input], outputs=preds)
195 | 
196 |     model.compile(
197 |         loss=keras.losses.BinaryCrossentropy(),
198 |         optimizer=tf.keras.optimizers.Adam(),
199 |         metrics=["accuracy"]
200 |     )
201 | 
202 |     model.summary()
203 | 
204 |     #define callbacks
205 |     file_name = SAVE_PATH + 'multi-input-weights-checkpoint.h5'
206 |     checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
207 |     reduce_lr = LearningRateScheduler(scheduler, verbose=1)
208 |     early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1)
209 |     #tensor_board = TensorBoard(log_dir='./logs', write_graph=True)
210 |     callbacks_list = [checkpoint, reduce_lr, early_stopping]
211 | 
212 |     hist = model.fit([data_1, data_2], labels, batch_size=1024, epochs=10, callbacks=callbacks_list, validation_split=VALIDATION_SPLIT)
213 | 
214 |     num_test = 100000
215 |     preds = model.predict([test_data_1[:num_test,:], test_data_2[:num_test,:]])
216 | 
217 |     quora_submission = pd.DataFrame({"test_id":test_labels[:num_test], "is_duplicate":preds.ravel()})
218 |     quora_submission.to_csv(SAVE_PATH + "quora_submission.csv", index=False)
219 | 
220 |     plt.figure()
221 |     plt.plot(hist.history['loss'], c='b', lw=2.0, label='train')
222 |     plt.plot(hist.history['val_loss'], c='r', lw=2.0, label='val')
223 |     plt.title('Multi-Input model')
224 |     plt.xlabel('Epochs')
225 |     plt.ylabel('Cross-Entropy Loss')
226 |     plt.legend(loc='upper right')
227 |     plt.show()
228 |     #plt.savefig('./figures/lstm_loss.png')
229 | 
230 |     plt.figure()
231 |     plt.plot(hist.history['accuracy'], c='b', lw=2.0, label='train')
232 |     plt.plot(hist.history['val_accuracy'], c='r', lw=2.0, label='val')
233 |     plt.title('Multi-Input model')
234 |     plt.xlabel('Epochs')
235 |     plt.ylabel('Accuracy')
236 |     plt.legend(loc='upper left')
237 |     plt.show()
238 |     #plt.savefig('./figures/lstm_acc.png')
239 | 
240 |     plt.figure()
241 |     plt.plot(hist.history['lr'], lw=2.0, label='learning rate')
242 |     plt.title('Multi-Input model')
243 |     plt.xlabel('Epochs')
244 |     plt.ylabel('Learning Rate')
245 |     plt.legend()
246 |     plt.show()
247 |     #plt.savefig('./figures/lstm_learning_rate.png')


--------------------------------------------------------------------------------
/chp11/keras_mdn.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | 
  4 | import tensorflow as tf
  5 | from tensorflow import keras
  6 | 
  7 | from keras.models import Model
  8 | from keras.layers import concatenate, Input
  9 | from keras.layers import Dense, Activation, Dropout, Flatten
 10 | from keras.layers import BatchNormalization
 11 | 
 12 | from keras import regularizers
 13 | from keras import backend as K
 14 | from keras.utils import np_utils
 15 | 
 16 | from keras.callbacks import ModelCheckpoint
 17 | from keras.callbacks import TensorBoard
 18 | from keras.callbacks import LearningRateScheduler 
 19 | from keras.callbacks import EarlyStopping
 20 | 
 21 | from sklearn.datasets import make_blobs
 22 | from sklearn.metrics import adjusted_rand_score
 23 | from sklearn.metrics import normalized_mutual_info_score 
 24 | from sklearn.model_selection import train_test_split
 25 | 
 26 | import math
 27 | import matplotlib.pyplot as plt
 28 | import matplotlib.cm as cm
 29 | 
 30 | tf.keras.utils.set_random_seed(42)
 31 | 
 32 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
 33 | 
 34 | def scheduler(epoch, lr):
 35 |     if epoch < 4:
 36 |         return lr
 37 |     else:
 38 |         return lr * tf.math.exp(-0.1)
 39 | 
 40 | def generate_data(N):
 41 |     pi = np.array([0.2, 0.4, 0.3, 0.1])
 42 |     mu = [[2,2], [-2,2], [-2,-2], [2,-2]]
 43 |     std = [[0.5,0.5], [1.0,1.0], [0.5,0.5], [1.0,1.0]]
 44 |     x = np.zeros((N,2), dtype=np.float32)
 45 |     y = np.zeros((N,2), dtype=np.float32)
 46 |     z = np.zeros((N,1), dtype=np.int32)
 47 |     for n in range(N):
 48 |         k = np.argmax(np.random.multinomial(1, pi))
 49 |         x[n,:] = np.random.multivariate_normal(mu[k], np.diag(std[k]))
 50 |         y[n,:] = mu[k]
 51 |         z[n,:] = k
 52 |     #end for
 53 |     z = z.flatten()
 54 |     return x, y, z, pi, mu, std
 55 | 
 56 | def tf_normal(y, mu, sigma):
 57 |     y_tile = K.stack([y]*num_clusters, axis=1) #[batch_size, K, D]
 58 |     result = y_tile - mu
 59 |     sigma_tile = K.stack([sigma]*data_dim, axis=-1) #[batch_size, K, D]
 60 |     result = result * 1.0/(sigma_tile+1e-8)
 61 |     result = -K.square(result)/2.0
 62 |     oneDivSqrtTwoPI = 1.0/math.sqrt(2*math.pi)    
 63 |     result = K.exp(result) * (1.0/(sigma_tile + 1e-8))*oneDivSqrtTwoPI
 64 |     result = K.prod(result, axis=-1)    #[batch_size, K] iid Gaussians
 65 |     return result
 66 | 
 67 | def NLLLoss(y_true, y_pred):
 68 |     out_mu = y_pred[:,:num_clusters*data_dim]
 69 |     out_sigma = y_pred[:,num_clusters*data_dim : num_clusters*(data_dim+1)] 
 70 |     out_pi = y_pred[:,num_clusters*(data_dim+1):]
 71 | 
 72 |     out_mu = K.reshape(out_mu, [-1, num_clusters, data_dim])
 73 | 
 74 |     result = tf_normal(y_true, out_mu, out_sigma)
 75 |     result = result * out_pi
 76 |     result = K.sum(result, axis=1, keepdims=True)
 77 |     result = -K.log(result + 1e-8)
 78 |     result = K.mean(result)
 79 |     return tf.maximum(result, 0)
 80 | 
 81 | #generate data
 82 | X_data, y_data, z_data, pi_true, mu_true, sigma_true = generate_data(4096)
 83 | 
 84 | data_dim = X_data.shape[1]
 85 | num_clusters = len(mu_true)
 86 | 
 87 | num_train = 3500
 88 | X_train, X_test, y_train, y_test = X_data[:num_train,:], X_data[num_train:,:], y_data[:num_train,:], y_data[num_train:,:]
 89 | z_train, z_test = z_data[:num_train], z_data[num_train:]
 90 | 
 91 | #visualize data
 92 | plt.figure()
 93 | plt.scatter(X_train[:,0], X_train[:,1], c=z_train, cmap=cm.bwr)
 94 | plt.title('training data')
 95 | plt.show()
 96 | #plt.savefig(SAVE_PATH + '/mdn_training_data.png')
 97 | 
 98 | #training params
 99 | batch_size = 128 
100 | num_epochs = 128 
101 | 
102 | #model parameters
103 | hidden_size = 32
104 | weight_decay = 1e-4
105 | 
106 | #MDN architecture
107 | input_data = Input(shape=(data_dim,))
108 | x = Dense(32, activation='relu')(input_data)
109 | x = Dropout(0.2)(x)
110 | x = BatchNormalization()(x)
111 | x = Dense(32, activation='relu')(x)
112 | x = Dropout(0.2)(x)
113 | x = BatchNormalization()(x)
114 | 
115 | mu = Dense(num_clusters * data_dim, activation='linear')(x) #cluster means
116 | sigma = Dense(num_clusters, activation=K.exp)(x)            #diagonal cov
117 | pi = Dense(num_clusters, activation='softmax')(x)           #mixture proportions
118 | out = concatenate([mu, sigma, pi], axis=-1)
119 | 
120 | model = Model(input_data, out)
121 | 
122 | model.compile(
123 |   loss=NLLLoss,
124 |   optimizer=tf.keras.optimizers.Adam(),
125 |   metrics=["accuracy"]
126 | )
127 | 
128 | model.summary()
129 | 
130 | #define callbacks
131 | file_name = SAVE_PATH + 'mdn-weights-checkpoint.h5'
132 | checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
133 | reduce_lr = LearningRateScheduler(scheduler, verbose=1)
134 | early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1)
135 | #tensor_board = TensorBoard(log_dir='./logs', write_graph=True)
136 | callbacks_list = [checkpoint, reduce_lr, early_stopping]
137 | 
138 | hist = model.fit(X_train, y_train, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_split=0.2, shuffle=True, verbose=2)
139 | 
140 | y_pred = model.predict(X_test)
141 | 
142 | mu_pred = y_pred[:,:num_clusters*data_dim]
143 | mu_pred = np.reshape(mu_pred, [-1, num_clusters, data_dim])
144 | sigma_pred = y_pred[:,num_clusters*data_dim : num_clusters*(data_dim+1)] 
145 | pi_pred = y_pred[:,num_clusters*(data_dim+1):]
146 | z_pred = np.argmax(pi_pred, axis=-1)
147 | 
148 | rand_score = adjusted_rand_score(z_test, z_pred)
149 | print("adjusted rand score: ", rand_score)
150 | 
151 | nmi_score = normalized_mutual_info_score(z_test, z_pred)
152 | print("normalized MI score: ", nmi_score)
153 | 
154 | mu_pred_list = []
155 | sigma_pred_list = []
156 | for label in np.unique(z_pred):
157 |     z_idx = np.where(z_pred == label)[0]
158 |     mu_pred_lbl = np.mean(mu_pred[z_idx,label,:], axis=0)
159 |     mu_pred_list.append(mu_pred_lbl)
160 | 
161 |     sigma_pred_lbl = np.mean(sigma_pred[z_idx,label], axis=0)
162 |     sigma_pred_list.append(sigma_pred_lbl)
163 | #end for
164 | 
165 | print("true means:")
166 | print(np.array(mu_true))
167 | 
168 | print("predicted means:")
169 | print(np.array(mu_pred_list))
170 | 
171 | print("true sigmas:")
172 | print(np.array(sigma_true))
173 | 
174 | print("predicted sigmas:")
175 | print(np.array(sigma_pred_list))
176 | 
177 | #generate plots
178 | plt.figure()
179 | plt.scatter(X_test[:,0], X_test[:,1], c=z_pred, cmap=cm.bwr)
180 | plt.scatter(np.array(mu_pred_list)[:,0], np.array(mu_pred_list)[:,1], s=100, marker='x', lw=4.0, color='k')
181 | plt.title('test data')
182 | #plt.savefig('./figures/mdn_test_data.png')
183 | 
184 | plt.figure()
185 | plt.plot(hist.history['loss'], 'b', lw=2.0, label='train')
186 | plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val')
187 | plt.title('Mixture Density Network')
188 | plt.xlabel('Epochs')
189 | plt.ylabel('Negative Log Likelihood Loss')
190 | plt.legend(loc='upper left')
191 | #plt.savefig('./figures/mdn_loss.png')
192 | 
193 | 
194 | 


--------------------------------------------------------------------------------
/chp11/lstm_vae.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | 
  4 | import tensorflow as tf
  5 | from tensorflow import keras
  6 | import tensorflow_probability as tfp
  7 | 
  8 | from keras.layers import Input, Dense, Lambda, Layer
  9 | from keras.layers import LSTM, RepeatVector
 10 | from keras.models import Model
 11 | from keras import backend as K
 12 | from keras import metrics
 13 | from keras import optimizers
 14 | 
 15 | import math
 16 | import json
 17 | from scipy.stats import norm
 18 | from sklearn.model_selection import train_test_split
 19 | from sklearn import preprocessing
 20 | from sklearn.metrics import confusion_matrix
 21 | from sklearn.preprocessing import StandardScaler
 22 | 
 23 | from keras.callbacks import ModelCheckpoint
 24 | from keras.callbacks import TensorBoard
 25 | from keras.callbacks import LearningRateScheduler 
 26 | from keras.callbacks import EarlyStopping
 27 | 
 28 | import matplotlib.pyplot as plt
 29 | 
 30 | tf.keras.utils.set_random_seed(42)
 31 | 
 32 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
 33 | DATA_PATH = "/content/drive/MyDrive/data/"
 34 | 
 35 | def scheduler(epoch, lr):
 36 |     if epoch < 4:
 37 |         return lr
 38 |     else:
 39 |         return lr * tf.math.exp(-0.1)
 40 | 
 41 | nab_path = DATA_PATH + 'NAB/'
 42 | nab_data_path = nab_path
 43 | 
 44 | labels_filename = '/labels/combined_labels.json'
 45 | train_file_name = 'artificialNoAnomaly/art_daily_no_noise.csv' 
 46 | test_file_name = 'artificialWithAnomaly/art_daily_jumpsup.csv'
 47 | 
 48 | #train_file_name = 'realAWSCloudwatch/rds_cpu_utilization_cc0c53.csv'
 49 | #test_file_name = 'realAWSCloudwatch/rds_cpu_utilization_e47b3b.csv'
 50 | 
 51 | labels_file = open(nab_path + labels_filename, 'r')
 52 | labels = json.loads(labels_file.read())
 53 | labels_file.close()
 54 | 
 55 | def load_data_frame_with_labels(file_name):
 56 |     data_frame = pd.read_csv(nab_data_path + file_name)
 57 |     data_frame['anomaly_label'] = data_frame['timestamp'].isin(
 58 |         labels[file_name]).astype(int)
 59 |     return data_frame
 60 | 
 61 | train_data_frame = load_data_frame_with_labels(train_file_name)
 62 | test_data_frame = load_data_frame_with_labels(test_file_name)
 63 | 
 64 | plt.plot(train_data_frame.loc[0:3000,'value'])
 65 | plt.plot(test_data_frame['value'])
 66 | 
 67 | train_data_frame_final = train_data_frame.loc[0:3000,:]
 68 | test_data_frame_final = test_data_frame
 69 | 
 70 | data_scaler = StandardScaler()
 71 | data_scaler.fit(train_data_frame_final[['value']].values)
 72 | train_data = data_scaler.transform(train_data_frame_final[['value']].values)
 73 | test_data = data_scaler.transform(test_data_frame_final[['value']].values)   
 74 | 
 75 | def create_dataset(dataset, look_back=64):
 76 |     dataX, dataY = [], []
 77 |     for i in range(len(dataset)-look_back-1):
 78 |         dataX.append(dataset[i:(i+look_back),:])
 79 |         dataY.append(dataset[i+look_back,:])
 80 | 
 81 |     return np.array(dataX), np.array(dataY)
 82 | 
 83 | X_data, y_data = create_dataset(train_data, look_back=64) #look_back = window_size
 84 | X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.1, random_state=42)
 85 | X_test, y_test = create_dataset(test_data, look_back=64)  #look_back = window_size
 86 | 
 87 | #training params
 88 | batch_size = 256
 89 | num_epochs = 32
 90 | 
 91 | #model params
 92 | timesteps = X_train.shape[1]
 93 | input_dim = X_train.shape[-1]
 94 | intermediate_dim = 16
 95 | latent_dim = 2
 96 | epsilon_std = 1.0
 97 | 
 98 | #sampling layer
 99 | class Sampling(Layer):
100 |     def call(self, inputs):
101 |         z_mean, z_log_var = inputs
102 |         batch = tf.shape(z_mean)[0]
103 |         dim = tf.shape(z_mean)[1]
104 |         epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
105 |         return z_mean + tf.exp(0.5 * z_log_var) * epsilon
106 | 
107 | #likelihood layer
108 | class Likelihood(Layer):
109 |     def call(self, inputs):
110 |         x, x_decoded_mean, x_decoded_scale = inputs
111 |         dist = tfp.distributions.MultivariateNormalDiag(x_decoded_mean, x_decoded_scale)
112 |         likelihood = dist.log_prob(x)  
113 |         return likelihood
114 | 
115 | #VAE architecture
116 | 
117 | #encoder
118 | x = Input(shape=(timesteps, input_dim,))
119 | h = LSTM(intermediate_dim)(x)
120 | 
121 | z_mean = Dense(latent_dim)(h)
122 | z_log_sigma = Dense(latent_dim, activation='softplus')(h)
123 | 
124 | #sampling
125 | z = Sampling()((z_mean, z_log_sigma))
126 | 
127 | #decoder
128 | decoder_h = LSTM(intermediate_dim, return_sequences=True)
129 | decoder_loc = LSTM(input_dim, return_sequences=True)
130 | decoder_scale = LSTM(input_dim, activation='softplus', return_sequences=True)
131 | 
132 | h_decoded = RepeatVector(timesteps)(z)
133 | h_decoded = decoder_h(h_decoded)
134 | 
135 | x_decoded_mean = decoder_loc(h_decoded)
136 | x_decoded_scale = decoder_scale(h_decoded)
137 | 
138 | #log-likelihood
139 | llh = Likelihood()([x, x_decoded_mean, x_decoded_scale])
140 | 
141 | #define VAE model
142 | vae = Model(inputs=x, outputs=llh)
143 | 
144 | # Add KL divergence regularization loss and likelihood loss
145 | kl_loss = - 0.5 * K.mean(1 + z_log_sigma - K.square(z_mean) - K.exp(z_log_sigma))
146 | tot_loss = -K.mean(llh - kl_loss)
147 | vae.add_loss(tot_loss)
148 | 
149 | # Loss and optimizer.
150 | loss_fn = tf.keras.losses.MeanSquaredError()
151 | optimizer = tf.keras.optimizers.Adam()
152 | 
153 | @tf.function
154 | def training_step(x):
155 |     with tf.GradientTape() as tape:
156 |         reconstructed = vae(x)  # Compute input reconstruction.
157 |         # Compute loss.
158 |         loss = 0 #loss_fn(x, reconstructed)
159 |         loss += sum(vae.losses)
160 |     # Update the weights of the VAE.
161 |     grads = tape.gradient(loss, vae.trainable_weights)
162 |     optimizer.apply_gradients(zip(grads, vae.trainable_weights))
163 |     return loss
164 | 
165 | losses = []  # Keep track of the losses over time.
166 | dataset = tf.data.Dataset.from_tensor_slices(X_train).batch(batch_size)
167 | for epoch in range(num_epochs):
168 |     for step, x in enumerate(dataset):
169 |         loss = training_step(x)
170 |         losses.append(float(loss))
171 |     print("Epoch:", epoch, "Loss:", sum(losses) / len(losses))
172 | 
173 | plt.figure()
174 | plt.plot(losses, c='b', lw=2.0, label='train')
175 | plt.title('LSTM-VAE model')
176 | plt.xlabel('Epochs')
177 | plt.ylabel('Total Loss')
178 | plt.legend(loc='upper right')
179 | plt.show()
180 | #plt.savefig('./figures/lstm_loss.png')
181 | 
182 | pred_test = vae.predict(X_test)
183 | 
184 | plt.plot(pred_test[:,0])
185 | 
186 | is_anomaly = pred_test[:,0] < -1e1
187 | plt.figure()
188 | plt.plot(test_data, color='b')
189 | plt.figure()
190 | plt.plot(is_anomaly, color='r')
191 | 


--------------------------------------------------------------------------------
/chp11/spektral_gnn.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | 
  4 | import tensorflow as tf
  5 | from tensorflow import keras
  6 | 
  7 | import networkx as nx
  8 | from tensorflow.keras.utils import to_categorical
  9 | from sklearn.preprocessing import LabelEncoder
 10 | from sklearn.utils import shuffle
 11 | from sklearn.metrics import classification_report
 12 | from sklearn.model_selection import train_test_split
 13 | 
 14 | from spektral.layers import GCNConv
 15 | 
 16 | from tensorflow.keras.models import Model
 17 | from tensorflow.keras.layers import Input, Dropout, Dense
 18 | from tensorflow.keras import Sequential
 19 | from tensorflow.keras.optimizers import Adam
 20 | from tensorflow.keras.callbacks import TensorBoard, EarlyStopping
 21 | from tensorflow.keras.regularizers import l2
 22 | 
 23 | import os
 24 | from collections import Counter
 25 | from sklearn.manifold import TSNE
 26 | import matplotlib.pyplot as plt
 27 | 
 28 | tf.keras.utils.set_random_seed(42)
 29 | 
 30 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
 31 | DATA_PATH = "/content/drive/MyDrive/data/cora/"
 32 | 
 33 | column_names = ["paper_id"] + [f"term_{idx}" for idx in range(1433)] + ["subject"]
 34 | node_df = pd.read_csv(DATA_PATH + "cora.content", sep="\t", header=None, names=column_names)
 35 | print("Node df shape:", node_df.shape)
 36 | 
 37 | edge_df = pd.read_csv(DATA_PATH + "cora.cites", sep="\t", header=None, names=["target", "source"])
 38 | print("Edge df shape:", edge_df.shape)
 39 | 
 40 | #parse node data
 41 | nodes = node_df.iloc[:,0].tolist()
 42 | labels = node_df.iloc[:,-1].tolist()
 43 | X = node_df.iloc[:,1:-1].values
 44 | 
 45 | X = np.array(X,dtype=int)
 46 | N = X.shape[0] #the number of nodes
 47 | F = X.shape[1] #the size of node features
 48 | 
 49 | #parse edge data
 50 | edge_list = [(x, y) for x, y in zip(edge_df['target'], edge_df['source'])]
 51 | 
 52 | num_classes = len(set(labels))
 53 | 
 54 | print('Number of nodes:', N)
 55 | print('Number of features of each node:', F)
 56 | print('Labels:', set(labels))
 57 | print('Number of classes:', num_classes)
 58 | 
 59 | def sample_data(labels, limit=20, val_num=500, test_num=1000):
 60 |     label_counter = dict((l, 0) for l in labels)
 61 |     train_idx = []
 62 | 
 63 |     for i in range(len(labels)):
 64 |         label = labels[i]
 65 |         if label_counter[label]<limit:
 66 |             #add the example to the training data
 67 |             train_idx.append(i)
 68 |             label_counter[label]+=1
 69 |         
 70 |         #exit the loop once we found 20 examples for each class
 71 |         if all(count == limit for count in label_counter.values()):
 72 |             break
 73 |     
 74 |     #get the indices that do not go to traning data
 75 |     rest_idx = [x for x in range(len(labels)) if x not in train_idx]
 76 |     #get the first val_num
 77 |     val_idx = rest_idx[:val_num]
 78 |     test_idx = rest_idx[val_num:(val_num+test_num)]
 79 |     return train_idx, val_idx,test_idx
 80 | 
 81 | 
 82 | train_idx,val_idx,test_idx = sample_data(labels)
 83 | 
 84 | #set the mask
 85 | train_mask = np.zeros((N,),dtype=bool)
 86 | train_mask[train_idx] = True
 87 | 
 88 | val_mask = np.zeros((N,),dtype=bool)
 89 | val_mask[val_idx] = True
 90 | 
 91 | test_mask = np.zeros((N,),dtype=bool)
 92 | test_mask[test_idx] = True
 93 | 
 94 | print("Training data distribution:\n{}".format(Counter([labels[i] for i in train_idx])))
 95 | print("Validation data distribution:\n{}".format(Counter([labels[i] for i in val_idx])))
 96 | 
 97 | def encode_label(labels):
 98 |     label_encoder = LabelEncoder()
 99 |     labels = label_encoder.fit_transform(labels)
100 |     labels = to_categorical(labels)
101 |     return labels, label_encoder.classes_
102 | 
103 | labels_encoded, classes = encode_label(labels)
104 | 
105 | #build the graph
106 | G = nx.Graph()
107 | G.add_nodes_from(nodes)
108 | G.add_edges_from(edge_list)
109 | 
110 | #obtain the adjacency matrix (A)
111 | A = nx.adjacency_matrix(G)
112 | print('Graph info: ', nx.info(G))
113 | 
114 | # Parameters
115 | channels = 16           # Number of channels in the first layer
116 | dropout = 0.5           # Dropout rate for the features
117 | l2_reg = 5e-4           # L2 regularization rate
118 | learning_rate = 1e-2    # Learning rate
119 | epochs = 200            # Number of training epochs
120 | es_patience = 10        # Patience for early stopping
121 | 
122 | # Preprocessing operations
123 | A = GCNConv.preprocess(A).astype('f4')
124 | 
125 | # Model definition
126 | X_in = Input(shape=(F, ))
127 | fltr_in = Input((N, ), sparse=True)
128 | 
129 | dropout_1 = Dropout(dropout)(X_in)
130 | graph_conv_1 = GCNConv(channels,
131 |                        activation='relu',
132 |                        kernel_regularizer=l2(l2_reg),
133 |                        use_bias=False)([dropout_1, fltr_in])
134 | 
135 | dropout_2 = Dropout(dropout)(graph_conv_1)
136 | graph_conv_2 = GCNConv(num_classes,
137 |                        activation='softmax',
138 |                        use_bias=False)([dropout_2, fltr_in])
139 | 
140 | # Build model
141 | model = Model(inputs=[X_in, fltr_in], outputs=graph_conv_2)
142 | model.compile(optimizer=Adam(learning_rate=learning_rate),
143 |               loss='categorical_crossentropy',
144 |               weighted_metrics=['accuracy'])
145 | model.summary()
146 | 
147 | # Train model
148 | validation_data = ([X, A], labels_encoded, val_mask)
149 | hist = model.fit([X, A],
150 |           labels_encoded,
151 |           sample_weight=train_mask,
152 |           epochs=epochs,
153 |           batch_size=N,
154 |           validation_data=validation_data,
155 |           shuffle=False,
156 |           callbacks=[
157 |               EarlyStopping(patience=es_patience,  restore_best_weights=True)
158 |           ])
159 | 
160 | # Evaluate model
161 | X_test = X
162 | A_test = A
163 | y_test = labels_encoded
164 | 
165 | y_pred = model.predict([X_test, A_test], batch_size=N)
166 | report = classification_report(np.argmax(y_test,axis=1), np.argmax(y_pred,axis=1), target_names=classes)
167 | print('GCN Classification Report: \n {}'.format(report))
168 | 
169 | layer_outputs = [layer.output for layer in model.layers]
170 | activation_model = Model(inputs=model.input, outputs=layer_outputs)
171 | activations = activation_model.predict([X,A],batch_size=N)
172 | 
173 | #Get t-SNE Representation
174 | #get the hidden layer representation after the first GCN layer
175 | x_tsne = TSNE(n_components=2).fit_transform(activations[3])
176 | 
177 | def plot_tSNE(labels_encoded,x_tsne):
178 |     color_map = np.argmax(labels_encoded, axis=1)
179 |     plt.figure(figsize=(10,10))
180 |     for cl in range(num_classes):
181 |         indices = np.where(color_map==cl)
182 |         indices = indices[0]
183 |         plt.scatter(x_tsne[indices,0], x_tsne[indices, 1], label=cl)
184 |     plt.legend()
185 |     plt.show()
186 |     
187 | plot_tSNE(labels_encoded,x_tsne)
188 | 
189 | plt.figure()
190 | plt.plot(hist.history['loss'], 'b', lw=2.0, label='train')
191 | plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val')
192 | plt.title('GNN model')
193 | plt.xlabel('Epochs')
194 | plt.ylabel('Cross-Entropy Loss')
195 | plt.legend(loc='upper right')
196 | plt.show()
197 | #plt.savefig('./figures/gnn_loss.png')
198 | 
199 | plt.figure()
200 | plt.plot(hist.history['accuracy'], 'b', lw=2.0, label='train')
201 | plt.plot(hist.history['val_accuracy'], '--r', lw=2.0, label='val')
202 | plt.title('GNN model')
203 | plt.xlabel('Epochs')
204 | plt.ylabel('Accuracy')
205 | plt.legend(loc='upper left')
206 | plt.show()
207 | #plt.savefig('./figures/gnn_acc.png')
208 | 


--------------------------------------------------------------------------------
/chp11/transformer.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | 
  4 | import tensorflow as tf
  5 | from tensorflow import keras
  6 | 
  7 | from keras.models import Model, Sequential
  8 | from keras.layers import Layer, Dense, Dropout, Activation
  9 | from keras.layers import LayerNormalization, MultiHeadAttention 
 10 | from keras.layers import Input, Embedding, GlobalAveragePooling1D
 11 | 
 12 | from keras import regularizers
 13 | from keras.preprocessing import sequence
 14 | from keras.utils import np_utils
 15 | 
 16 | from keras.callbacks import ModelCheckpoint
 17 | from keras.callbacks import TensorBoard
 18 | from keras.callbacks import LearningRateScheduler 
 19 | from keras.callbacks import EarlyStopping
 20 | 
 21 | import matplotlib.pyplot as plt
 22 | 
 23 | tf.keras.utils.set_random_seed(42)
 24 | 
 25 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
 26 | 
 27 | def scheduler(epoch, lr):
 28 |     if epoch < 4:
 29 |         return lr
 30 |     else:
 31 |         return lr * tf.math.exp(-0.1)
 32 | 
 33 | #load dataset
 34 | max_words = 20000 # top 20K most frequent words
 35 | seq_len = 200  # first 200 words of each movie review
 36 | (x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=max_words)
 37 | 
 38 | x_train = keras.utils.pad_sequences(x_train, maxlen=seq_len)
 39 | x_val = keras.utils.pad_sequences(x_val, maxlen=seq_len)
 40 | 
 41 | class TransformerBlock(Layer):
 42 |     def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
 43 |         super(TransformerBlock, self).__init__()
 44 |         self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
 45 |         self.ffn = Sequential(
 46 |             [Dense(ff_dim, activation="relu"), Dense(embed_dim)]
 47 |         )
 48 |         self.layernorm1 = LayerNormalization(epsilon=1e-6)
 49 |         self.layernorm2 = LayerNormalization(epsilon=1e-6)
 50 |         self.dropout1 = Dropout(rate)
 51 |         self.dropout2 = Dropout(rate)
 52 | 
 53 |     def call(self, inputs, training):
 54 |         attn_output = self.att(inputs, inputs)
 55 |         attn_output = self.dropout1(attn_output, training=training)
 56 |         out1 = self.layernorm1(inputs + attn_output)
 57 |         ffn_output = self.ffn(out1)
 58 |         ffn_output = self.dropout2(ffn_output, training=training)
 59 |         return self.layernorm2(out1 + ffn_output)
 60 | 
 61 | class TokenAndPositionEmbedding(Layer):
 62 |     def __init__(self, maxlen, vocab_size, embed_dim):
 63 |         super(TokenAndPositionEmbedding, self).__init__()
 64 |         self.token_emb = Embedding(input_dim=vocab_size, output_dim=embed_dim)
 65 |         self.pos_emb = Embedding(input_dim=maxlen, output_dim=embed_dim)
 66 | 
 67 |     def call(self, x):
 68 |         maxlen = tf.shape(x)[-1]
 69 |         positions = tf.range(start=0, limit=maxlen, delta=1)
 70 |         positions = self.pos_emb(positions)
 71 |         x = self.token_emb(x)
 72 |         return x + positions
 73 | 
 74 | #training params
 75 | batch_size = 32 
 76 | num_epochs = 8
 77 | 
 78 | #model parameters
 79 | embed_dim = 32
 80 | num_heads = 2  
 81 | ff_dim = 32
 82 | 
 83 | #transformer architecture
 84 | inputs = Input(shape=(seq_len,))
 85 | embedding_layer = TokenAndPositionEmbedding(seq_len, max_words, embed_dim)
 86 | x = embedding_layer(inputs)
 87 | transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
 88 | x = transformer_block(x)
 89 | x = GlobalAveragePooling1D()(x)
 90 | x = Dropout(0.1)(x)
 91 | x = Dense(20, activation="relu")(x)
 92 | x = Dropout(0.1)(x)
 93 | outputs = Dense(2, activation="softmax")(x)
 94 | 
 95 | model = Model(inputs=inputs, outputs=outputs)
 96 | 
 97 | model.compile(
 98 |   loss=keras.losses.SparseCategoricalCrossentropy(),
 99 |   optimizer=tf.keras.optimizers.Adam(),
100 |   metrics=["accuracy"]
101 | )
102 | 
103 | #define callbacks
104 | file_name = SAVE_PATH + 'transformer-weights-checkpoint.h5'
105 | #checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
106 | reduce_lr = LearningRateScheduler(scheduler, verbose=1)
107 | early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1)
108 | #tensor_board = TensorBoard(log_dir='./logs', write_graph=True)
109 | callbacks_list = [reduce_lr, early_stopping]
110 | 
111 | hist = model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_data=(x_val, y_val))
112 | 
113 | test_scores = model.evaluate(x_val, y_val, verbose=2)
114 | 
115 | print("Test loss:", test_scores[0])
116 | print("Test accuracy:", test_scores[1])
117 | 
118 | plt.figure()
119 | plt.plot(hist.history['loss'], 'b', lw=2.0, label='train')
120 | plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val')
121 | plt.title('Transformer model')
122 | plt.xlabel('Epochs')
123 | plt.ylabel('Cross-Entropy Loss')
124 | plt.legend(loc='upper right')
125 | plt.show()
126 | #plt.savefig('./figures/transformer_loss.png')
127 | 
128 | plt.figure()
129 | plt.plot(hist.history['accuracy'], 'b', lw=2.0, label='train')
130 | plt.plot(hist.history['val_accuracy'], '--r', lw=2.0, label='val')
131 | plt.title('Transformer model')
132 | plt.xlabel('Epochs')
133 | plt.ylabel('Accuracy')
134 | plt.legend(loc='upper left')
135 | plt.show()
136 | #plt.savefig('./figures/transformer_acc.png')
137 | 
138 | plt.figure()
139 | plt.plot(hist.history['lr'], lw=2.0, label='learning rate')
140 | plt.title('Transformer model')
141 | plt.xlabel('Epochs')
142 | plt.ylabel('Learning Rate')
143 | plt.legend()
144 | plt.show()
145 | #plt.savefig('./figures/transformer_learning_rate.png')
146 | 
147 | 


--------------------------------------------------------------------------------
/data/NAB/labels/combined_labels.json:
--------------------------------------------------------------------------------
  1 | {
  2 |     "artificialNoAnomaly/art_daily_no_noise.csv": [],
  3 |     "artificialNoAnomaly/art_daily_perfect_square_wave.csv": [],
  4 |     "artificialNoAnomaly/art_daily_small_noise.csv": [],
  5 |     "artificialNoAnomaly/art_flatline.csv": [],
  6 |     "artificialNoAnomaly/art_noisy.csv": [],
  7 |     "artificialWithAnomaly/art_daily_flatmiddle.csv": [
  8 |         "2014-04-11 00:00:00"
  9 |     ],
 10 |     "artificialWithAnomaly/art_daily_jumpsdown.csv": [
 11 |         "2014-04-11 09:00:00"
 12 |     ],
 13 |     "artificialWithAnomaly/art_daily_jumpsup.csv": [
 14 |         "2014-04-11 09:00:00"
 15 |     ],
 16 |     "artificialWithAnomaly/art_daily_nojump.csv": [
 17 |         "2014-04-11 09:00:00"
 18 |     ],
 19 |     "artificialWithAnomaly/art_increase_spike_density.csv": [
 20 |         "2014-04-07 23:10:00"
 21 |     ],
 22 |     "artificialWithAnomaly/art_load_balancer_spikes.csv": [
 23 |         "2014-04-11 04:35:00"
 24 |     ],
 25 |     "realAWSCloudwatch/ec2_cpu_utilization_24ae8d.csv": [
 26 |         "2014-02-26 22:05:00",
 27 |         "2014-02-27 17:15:00"
 28 |     ],
 29 |     "realAWSCloudwatch/ec2_cpu_utilization_53ea38.csv": [
 30 |         "2014-02-19 19:10:00",
 31 |         "2014-02-23 20:05:00"
 32 |     ],
 33 |     "realAWSCloudwatch/ec2_cpu_utilization_5f5533.csv": [
 34 |         "2014-02-19 00:22:00",
 35 |         "2014-02-24 18:37:00"
 36 |     ],
 37 |     "realAWSCloudwatch/ec2_cpu_utilization_77c1ca.csv": [
 38 |         "2014-04-09 10:15:00"
 39 |     ],
 40 |     "realAWSCloudwatch/ec2_cpu_utilization_825cc2.csv": [
 41 |         "2014-04-15 15:44:00",
 42 |         "2014-04-16 03:34:00"
 43 |     ],
 44 |     "realAWSCloudwatch/ec2_cpu_utilization_ac20cd.csv": [
 45 |         "2014-04-15 00:49:00"
 46 |     ],
 47 |     "realAWSCloudwatch/ec2_cpu_utilization_c6585a.csv": [],
 48 |     "realAWSCloudwatch/ec2_cpu_utilization_fe7f93.csv": [
 49 |         "2014-02-17 06:12:00",
 50 |         "2014-02-22 00:02:00",
 51 |         "2014-02-23 15:17:00"
 52 |     ],
 53 |     "realAWSCloudwatch/ec2_disk_write_bytes_1ef3de.csv": [
 54 |         "2014-03-10 21:09:00"
 55 |     ],
 56 |     "realAWSCloudwatch/ec2_disk_write_bytes_c0d644.csv": [
 57 |         "2014-04-09 01:30:00",
 58 |         "2014-04-10 14:35:00",
 59 |         "2014-04-13 03:00:00"
 60 |     ],
 61 |     "realAWSCloudwatch/ec2_network_in_257a54.csv": [
 62 |         "2014-04-15 16:44:00"
 63 |     ],
 64 |     "realAWSCloudwatch/ec2_network_in_5abac7.csv": [
 65 |         "2014-03-10 18:56:00",
 66 |         "2014-03-12 21:01:00"
 67 |     ],
 68 |     "realAWSCloudwatch/elb_request_count_8c0756.csv": [
 69 |         "2014-04-12 17:24:00",
 70 |         "2014-04-22 19:34:00"
 71 |     ],
 72 |     "realAWSCloudwatch/grok_asg_anomaly.csv": [
 73 |         "2014-01-20 08:30:00",
 74 |         "2014-01-21 10:45:00",
 75 |         "2014-01-29 00:45:00"
 76 |     ],
 77 |     "realAWSCloudwatch/iio_us-east-1_i-a2eb1cd9_NetworkIn.csv": [
 78 |         "2013-10-10 09:35:00",
 79 |         "2013-10-10 20:40:00"
 80 |     ],
 81 |     "realAWSCloudwatch/rds_cpu_utilization_cc0c53.csv": [
 82 |         "2014-02-25 07:15:00",
 83 |         "2014-02-27 00:50:00"
 84 |     ],
 85 |     "realAWSCloudwatch/rds_cpu_utilization_e47b3b.csv": [
 86 |         "2014-04-13 06:52:00",
 87 |         "2014-04-18 23:27:00"
 88 |     ],
 89 |     "realAdExchange/exchange-2_cpc_results.csv": [
 90 |         "2011-07-14 13:00:01"
 91 |     ],
 92 |     "realAdExchange/exchange-2_cpm_results.csv": [
 93 |         "2011-07-26 06:00:01",
 94 |         "2011-08-10 17:00:01"
 95 |     ],
 96 |     "realAdExchange/exchange-3_cpc_results.csv": [
 97 |         "2011-07-14 10:15:01",
 98 |         "2011-07-20 10:15:01",
 99 |         "2011-08-13 10:15:01"
100 |     ],
101 |     "realAdExchange/exchange-3_cpm_results.csv": [
102 |         "2011-08-19 18:15:01"
103 |     ],
104 |     "realAdExchange/exchange-4_cpc_results.csv": [
105 |         "2011-07-16 09:15:01",
106 |         "2011-08-02 12:15:01",
107 |         "2011-08-23 08:15:01"
108 |     ],
109 |     "realAdExchange/exchange-4_cpm_results.csv": [
110 |         "2011-07-16 09:15:01",
111 |         "2011-08-01 07:15:01",
112 |         "2011-08-23 08:15:01",
113 |         "2011-08-28 13:15:01"
114 |     ],
115 |     "realKnownCause/ambient_temperature_system_failure.csv": [
116 |         "2013-12-22 20:00:00",
117 |         "2014-04-13 09:00:00"
118 |     ],
119 |     "realKnownCause/cpu_utilization_asg_misconfiguration.csv": [
120 |         "2014-07-12 02:04:00",
121 |         "2014-07-14 21:44:00"
122 |     ],
123 |     "realKnownCause/ec2_request_latency_system_failure.csv": [
124 |         "2014-03-14 09:06:00",
125 |         "2014-03-18 22:41:00",
126 |         "2014-03-21 03:01:00"
127 |     ],
128 |     "realKnownCause/machine_temperature_system_failure.csv": [
129 |         "2013-12-11 06:00:00",
130 |         "2013-12-16 17:25:00",
131 |         "2014-01-28 13:55:00",
132 |         "2014-02-08 14:30:00"
133 |     ],
134 |     "realKnownCause/nyc_taxi.csv": [
135 |         "2014-11-01 19:00:00",
136 |         "2014-11-27 15:30:00",
137 |         "2014-12-25 15:00:00",
138 |         "2015-01-01 01:00:00",
139 |         "2015-01-27 00:00:00"
140 |     ],
141 |     "realKnownCause/rogue_agent_key_hold.csv": [
142 |         "2014-07-15 08:30:00",
143 |         "2014-07-17 09:50:00"
144 |     ],
145 |     "realKnownCause/rogue_agent_key_updown.csv": [
146 |         "2014-07-15 04:00:00",
147 |         "2014-07-17 08:50:00"
148 |     ],
149 |     "realTraffic/TravelTime_387.csv": [
150 |         "2015-07-30 12:29:00",
151 |         "2015-08-18 16:26:00",
152 |         "2015-09-01 05:34:00"
153 |     ],
154 |     "realTraffic/TravelTime_451.csv": [
155 |         "2015-08-11 12:07:00"
156 |     ],
157 |     "realTraffic/occupancy_6005.csv": [
158 |         "2015-09-15 06:55:00"
159 |     ],
160 |     "realTraffic/occupancy_t4013.csv": [
161 |         "2015-09-16 08:09:00",
162 |         "2015-09-17 07:55:00"
163 |     ],
164 |     "realTraffic/speed_6005.csv": [
165 |         "2015-09-17 07:00:00"
166 |     ],
167 |     "realTraffic/speed_7578.csv": [
168 |         "2015-09-11 16:44:00",
169 |         "2015-09-15 14:34:00",
170 |         "2015-09-16 14:14:00",
171 |         "2015-09-16 17:10:00"
172 |     ],
173 |     "realTraffic/speed_t4013.csv": [
174 |         "2015-09-16 08:04:00",
175 |         "2015-09-17 08:15:00"
176 |     ],
177 |     "realTweets/Twitter_volume_AAPL.csv": [
178 |         "2015-03-03 21:07:53",
179 |         "2015-03-09 17:32:53",
180 |         "2015-03-16 02:57:53",
181 |         "2015-03-31 03:27:53"
182 |     ],
183 |     "realTweets/Twitter_volume_AMZN.csv": [
184 |         "2015-03-05 19:47:53",
185 |         "2015-03-11 20:57:53",
186 |         "2015-04-01 21:57:53",
187 |         "2015-04-08 04:52:53"
188 |     ],
189 |     "realTweets/Twitter_volume_CRM.csv": [
190 |         "2015-03-09 19:07:53",
191 |         "2015-03-19 23:07:53",
192 |         "2015-03-26 19:07:53"
193 |     ],
194 |     "realTweets/Twitter_volume_CVS.csv": [
195 |         "2015-03-04 16:02:53",
196 |         "2015-03-05 19:57:53",
197 |         "2015-03-26 14:07:53",
198 |         "2015-04-14 22:37:53"
199 |     ],
200 |     "realTweets/Twitter_volume_FB.csv": [
201 |         "2015-03-16 07:07:53",
202 |         "2015-04-03 17:47:53"
203 |     ],
204 |     "realTweets/Twitter_volume_GOOG.csv": [
205 |         "2015-03-13 20:22:53",
206 |         "2015-03-14 16:27:53",
207 |         "2015-03-22 22:52:53",
208 |         "2015-04-01 05:27:53"
209 |     ],
210 |     "realTweets/Twitter_volume_IBM.csv": [
211 |         "2015-03-23 22:27:53",
212 |         "2015-04-20 20:07:53"
213 |     ],
214 |     "realTweets/Twitter_volume_KO.csv": [
215 |         "2015-03-20 13:12:53",
216 |         "2015-04-08 23:42:53",
217 |         "2015-04-14 14:52:53"
218 |     ],
219 |     "realTweets/Twitter_volume_PFE.csv": [
220 |         "2015-03-02 21:22:53",
221 |         "2015-03-04 10:32:53",
222 |         "2015-03-13 19:57:53",
223 |         "2015-04-07 23:42:53"
224 |     ],
225 |     "realTweets/Twitter_volume_UPS.csv": [
226 |         "2015-03-03 00:27:53",
227 |         "2015-03-04 11:07:53",
228 |         "2015-03-05 15:22:53",
229 |         "2015-03-24 18:17:53",
230 |         "2015-03-29 16:27:53"
231 |     ]
232 | }


--------------------------------------------------------------------------------
/figures/bayes.bmp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vsmolyakov/ml_algo_in_depth/4aa77f81499765ddc1ce2213044db59796613bcd/figures/bayes.bmp


--------------------------------------------------------------------------------
/figures/meap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vsmolyakov/ml_algo_in_depth/4aa77f81499765ddc1ce2213044db59796613bcd/figures/meap.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | numpy
 2 | pandas
 3 | matplotlib
 4 | seaborn==0.12.1
 5 | nltk
 6 | scikit-learn
 7 | pymc3
 8 | bayesian-optimization
 9 | scipy
10 | wordcloud
11 | pandas-datareader
12 | tqdm
13 | Pillow
14 | cvxopt
15 | mlxtend
16 | imblearn
17 | networkx==2.7
18 | tensorflow
19 | tensorflow-probability
20 | spektral
21 | pyqt5
22 | 


--------------------------------------------------------------------------------