├── .gitattributes
├── README.md
├── chp02
├── binomial_tree.py
├── gibbs_gauss.py
├── imp_samp.py
├── mh_gauss2d.py
├── monte_carlo_pi.py
└── random_walk.py
├── chp03
└── mean_field_mrf.py
├── chp04
├── binary_search.py
├── binomial_coeffs.py
├── knapsack_greedy.py
└── subset_gen.py
├── chp05
├── cart.py
├── naive_bayes.py
├── perceptron.py
├── sgd_lr.py
└── svm.py
├── chp06
├── gp_reg.py
├── hierarchical_regression.py
├── knn_reg.py
└── ridge_reg.py
├── chp07
├── active_learning.py
├── adaboost_clf.py
├── bagging_clf.py
├── bayes_opt_sklearn.py
├── demo_logreg.py
├── hmm.py
├── page_rank.py
├── plot_smote_regular.py
├── plot_tomek_links.py
└── stacked_clf.py
├── chp08
├── dpmeans.py
├── gmm.py
├── manifold_learning.py
└── pca.py
├── chp09
├── ga.py
├── inv_cov.py
├── kde.py
├── lda.py
├── portfolio_opt.py
└── sim_annealing.py
├── chp10
├── image_search.py
├── keras_optimizers.py
├── lenet.py
├── lstm_sentiment.py
├── mlp.py
└── multi_input_nn.py
├── chp11
├── keras_mdn.py
├── lstm_vae.py
├── spektral_gnn.py
└── transformer.py
├── data
├── NAB
│ ├── artificialNoAnomaly
│ │ └── art_daily_no_noise.csv
│ ├── artificialWithAnomaly
│ │ └── art_daily_jumpsup.csv
│ └── labels
│ │ └── combined_labels.json
├── cora
│ ├── cora.cites
│ └── cora.content
└── radon.txt
├── figures
├── bayes.bmp
└── meap.png
└── requirements.txt
/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Machine Learning Algorithms in Depth
2 | ML Algorithms in Depth: Bayesian Inference and Deep Learning
3 |
4 | **Chp02: Markov Chain Monte Carlo (MCMC)**
5 | - [Estimate Pi](./chp02/monte_carlo_pi.py): Monte Carlo estimate of Pi
6 | - [Binomial Tree Model](./chp02/binomial_tree.py): Monte Carlo simulation of binomial stock price
7 | - [Random Walk](./chp02/random_walk.py): self-avoiding random walk
8 | - [Gibbs Sampling](./chp02/gibbs_gauss.py): Gibbs sampling of multivariate Gaussian distribution
9 | - [Metropolis-Hastings Sampling](./chp02/mh_gauss2d.py): Metropolis-Hastings sampling of multivariate Gaussian mixture
10 | - [Importance Sampling](./chp02/imp_samp.py): importance sampling for finding expected value of a function
11 |
12 | **Chp03: Variational Inference (VI)**
13 | - [Mean Field VI](./chp03/mean_field_mrf.py): image denoising in Ising model
14 |
15 | **Chp04: Software Implementation**
16 | - [Subset Generation](./chp04/subset_gen.py): a complete search algorithm
17 | - [Fractional Knapsack](./chp04/knapsack_greedy.py): a greedy algorithm
18 | - [Binary Search](./chp04/binary_search.py): a divide and conquer algorithm
19 | - [Binomial Coefficients](./chp04/binomial_coeffs.py): a dynamic programming algorithm
20 |
21 | **Chp05: Classification Algorithms**
22 | - [Perceptron](./chp05/perceptron.py): perceptron algorithm
23 | - [SVM](./chp05/svm.py): support vector machine
24 | - [SGD-LR](./chp05/sgd_lr.py): stochastic gradient descent logistic regression
25 | - [Naive Bayes](./chp05/naive_bayes.py): Bernoulli Naive Bayes algorithm
26 | - [CART](./chp05/cart.py): decision tree classification algorithm
27 |
28 | **Chp06: Regression Algorithms**
29 | - [KNN](./chp06/knn_reg.py): K-Nearest Neighbors regression
30 | - [BLR](./chp06/ridge_reg.py): Bayesian linear regression
31 | - [HBR](./chp06/hierarchical_regression.py): Hierarchical Bayesian regression
32 | - [GPR](./chp06/gp_reg.py): Gaussian Process regression
33 |
34 | **Chp07: Selected Supervised Learning Algorithms**
35 | - [Page Rank](./chp07/page_rank.py): Google page rank algorithm
36 | - [HMM](./chp07/hmm.py): EM algorithm for Hidden Markov Models
37 | - Imbalanced Learning: [Tomek Links](./chp07/plot_tomek_links.py), [SMOTE](./chp07/plot_smote_regular.py)
38 | - Active Learning: [LR](./chp07/demo_logreg.py)
39 | - Bayesian optimization: [BO](./chp07/bayes_opt_sklearn.py)
40 | - Ensemble Learning: [Bagging](./chp07/bagging_clf.py), [Boosting](./chp07/adaboost_clf.py), [Stacking](./chp07/stacked_clf.py)
41 |
42 | **Chp08: Unsupervised Learning Algorithms**
43 | - [DP-Means](./chp08/dpmeans.py): Dirichlet Process (DP) K-Means
44 | - [EM-GMM](./chp08/gmm.py): EM algorithm for Gaussian Mixture Models
45 | - [PCA](./chp08/pca.py): Principal Component Analysis
46 | - [t-SNE](./chp08/manifold_learning.py): t-SNE manifold learning
47 |
48 | **Chp09: Selected Unsupervised Learning Algorithms**
49 | - [LDA](./chp09/lda.py): Variational Inference for Latent Dirichlet Allocation
50 | - [KDE](./chp09/kde.py): Kernel Density Estimator
51 | - [TPO](./chp09/portfolio_opt.py): Tangent Portfolio Optimization
52 | - [ICE](./chp09/inv_cov.py): Inverse Covariance Estimation
53 | - [SA](./chp09/sim_annealing.py): Simulated Annealing
54 | - [GA](./chp09/ga.py): Genetic Algorithm
55 |
56 | **Chp10: Fundamental Deep Learning Algorithms**
57 | - [MLP](./chp10/mlp.py): Multi-Layer Perceptron
58 | - [LeNet](./chp10/lenet.py): LeNet for MNIST digit classification
59 | - [ResNet](./chp10/image_search.py): ResNet50 image search on CalTech101 dataset
60 | - [LSTM](./chp10/lstm_sentiment.py): LSTM sentiment classification of IMDB movie dataset
61 | - [MINN](./chp10/multi_input_nn.py): Mult-Input Neural Net model for sequence similarity of Quora question pairs dataset
62 | - [OPT](./chp10/keras_optimizers.py): Neural Net Optimizers
63 |
64 | **Chp11: Advanced Deep Learning Algorithms**
65 | - [LSTM-VAE](./chp11/lstm_vae.py): time-series anomaly detector
66 | - [MDN](./chp11/keras_mdn.py): mixture density network
67 | - [Transformer](./chp11/transformer.py): for text classification
68 | - [GNN](./chp11/spektral_gnn.py): graph neural network
69 |
70 | **Environment**
71 |
72 | To install required libraries, please run the following commands:
73 |
74 | ```
75 | python3 -m venv ml-algo
76 |
77 | source ml-algo/bin/activate //in linux
78 | .\ml-algo\Scripts\activate.bat //in CMD windows
79 | .\ml-algo\Scripts\Activate.ps1 //in Powershell windows
80 |
81 | pip install -r requirements.txt
82 | ```
83 |
84 | **Manning Early Access Preview (MEAP)**
85 |
86 | This book is now available in Manning Early Access Preview.
87 | Link to book: https://www.manning.com/books/machine-learning-algorithms-in-depth
88 |
89 |
90 |
91 |
92 |
93 | It will help you develop mathematical intuition for classic and modern ML algorithms, learn the fundamentals of Bayesian inference and deep learning, as well as data structures and algorithmic paradigms in ML!
94 |
95 | **Citation**
96 |
97 | You are welcome to cite the book as follows:
98 |
99 | ```
100 | @book{MLAlgoInDepth,
101 | author = {Vadim Smolyakov},
102 | title = {Machine Learning Algorithms in Depth},
103 | year = {2023},
104 | isbn = {9781633439214},
105 | publisher = {Manning Publications}
106 | }
107 | ```
108 |
--------------------------------------------------------------------------------
/chp02/binomial_tree.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | import seaborn as sns
4 | import matplotlib.pyplot as plt
5 |
6 | np.random.seed(42)
7 |
8 | def binomial_tree(mu, sigma, S0, N, T, step):
9 |
10 | #compute state price and probability
11 | u = np.exp(sigma * np.sqrt(step)) #up state price
12 | d = 1.0/u #down state price
13 | p = 0.5+0.5*(mu/sigma)*np.sqrt(step) #prob of up state
14 |
15 | #binomial tree simulation
16 | up_times = np.zeros((N, len(T)))
17 | down_times = np.zeros((N, len(T)))
18 | for idx in range(len(T)):
19 | up_times[:,idx] = np.random.binomial(T[idx]/step, p, N)
20 | down_times[:,idx] = T[idx]/step - up_times[:,idx]
21 |
22 | #compute terminal price
23 | ST = S0 * u**up_times * d**down_times
24 |
25 | #generate plots
26 | plt.figure()
27 | plt.plot(ST[:,0], color='b', alpha=0.5, label='1 month horizon')
28 | plt.plot(ST[:,1], color='r', alpha=0.5, label='1 year horizon')
29 | plt.xlabel('time step, day')
30 | plt.ylabel('price')
31 | plt.title('Binomial-Tree Stock Simulation')
32 | plt.legend()
33 | plt.show()
34 |
35 | plt.figure()
36 | plt.hist(ST[:,0], color='b', alpha=0.5, label='1 month horizon')
37 | plt.hist(ST[:,1], color='r', alpha=0.5, label='1 year horizon')
38 | plt.xlabel('price')
39 | plt.ylabel('count')
40 | plt.title('Binomial-Tree Stock Simulation')
41 | plt.legend()
42 | plt.show()
43 |
44 |
45 | if __name__ == "__main__":
46 |
47 | #model parameters
48 | mu = 0.1 #mean
49 | sigma = 0.15 #volatility
50 | S0 = 1 #starting price
51 |
52 | N = 10000 #number of simulations
53 | T = [21.0/252, 1.0] #time horizon in years
54 | step = 1.0/252 #time step in years
55 |
56 | binomial_tree(mu, sigma, S0, N, T, step)
57 |
58 |
--------------------------------------------------------------------------------
/chp02/gibbs_gauss.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 | import itertools
5 | from numpy.linalg import inv
6 | from scipy.stats import multivariate_normal
7 |
8 | np.random.seed(42)
9 |
10 | class gibbs_gauss:
11 |
12 | def gauss_conditional(self, mu, Sigma, setA, x):
13 | #computes P(X_A | X_B = x) = N(mu_{A|B}, Sigma_{A|B})
14 | dim = len(mu)
15 | setU = set(range(dim))
16 | setB = setU.difference(setA)
17 | muA = np.array([mu[item] for item in setA]).reshape(-1,1)
18 | muB = np.array([mu[item] for item in setB]).reshape(-1,1)
19 | xB = np.array([x[item] for item in setB]).reshape(-1,1)
20 |
21 | Sigma_AA = []
22 | for (idx1, idx2) in itertools.product(setA, setA):
23 | Sigma_AA.append(Sigma[idx1][idx2])
24 | Sigma_AA = np.array(Sigma_AA).reshape(len(setA),len(setA))
25 |
26 | Sigma_AB = []
27 | for (idx1, idx2) in itertools.product(setA, setB):
28 | Sigma_AB.append(Sigma[idx1][idx2])
29 | Sigma_AB = np.array(Sigma_AB).reshape(len(setA),len(setB))
30 |
31 | Sigma_BB = []
32 | for (idx1, idx2) in itertools.product(setB, setB):
33 | Sigma_BB.append(Sigma[idx1][idx2])
34 | Sigma_BB = np.array(Sigma_BB).reshape(len(setB),len(setB))
35 |
36 | Sigma_BB_inv = inv(Sigma_BB)
37 | mu_AgivenB = muA + np.matmul(np.matmul(Sigma_AB, Sigma_BB_inv), xB - muB)
38 | Sigma_AgivenB = Sigma_AA - np.matmul(np.matmul(Sigma_AB, Sigma_BB_inv), np.transpose(Sigma_AB))
39 |
40 | return mu_AgivenB, Sigma_AgivenB
41 |
42 | def sample(self, mu, Sigma, xinit, num_samples):
43 | dim = len(mu)
44 | samples = np.zeros((num_samples, dim))
45 | x = xinit
46 | for s in range(num_samples):
47 | for d in range(dim):
48 | mu_AgivenB, Sigma_AgivenB = self.gauss_conditional(mu, Sigma, set([d]), x)
49 | x[d] = np.random.normal(mu_AgivenB, np.sqrt(Sigma_AgivenB))
50 | #end for
51 | samples[s,:] = np.transpose(x)
52 | #end for
53 | return samples
54 |
55 | if __name__ == "__main__":
56 |
57 | num_samples = 2000
58 | mu = [1, 1]
59 | Sigma = [[2,1], [1,1]]
60 | xinit = np.random.rand(len(mu),1)
61 | num_burnin = 1000
62 |
63 | gg = gibbs_gauss()
64 | gibbs_samples = gg.sample(mu, Sigma, xinit, num_samples)
65 |
66 | scipy_samples = multivariate_normal.rvs(mean=mu, cov=Sigma, size=num_samples, random_state=42)
67 |
68 | plt.figure()
69 | plt.scatter(gibbs_samples[num_burnin:,0], gibbs_samples[num_burnin:,1], c = 'blue', marker='s', alpha=0.8, label='Gibbs Samples')
70 | plt.scatter(scipy_samples[num_burnin:,0], scipy_samples[num_burnin:,1], c = 'red', alpha=0.8, label='Ground Truth Samples')
71 | plt.grid(True); plt.legend(); plt.xlim([-4,5])
72 | plt.title("Gibbs Sampling of Multivariate Gaussian"); plt.xlabel("X1"); plt.ylabel("X2")
73 | #plt.savefig("./figures/gibbs_gauss.png")
74 | plt.show()
75 |
76 |
77 |
--------------------------------------------------------------------------------
/chp02/imp_samp.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 | from scipy.integrate import quad
5 | from scipy.stats import multivariate_normal
6 |
7 | np.random.seed(42)
8 |
9 | class importance_sampler:
10 | # E[f(x)] = int_x f(x)p(x)dx = int_x f(x)[p(x)/q(x)]q(x) dx
11 | # = sum_i f(x_i)w(x_i), where x_i ~ q(x)
12 | # e.g. for f(x) = 1(x \in A), E[f(x)] = P(A)
13 |
14 | def __init__(self, k=1.5, mu=0.8, sigma=np.sqrt(1.5), c=3):
15 | #target params p(x)
16 | self.k = k
17 |
18 | #proposal params q(x)
19 | self.mu = mu
20 | self.sigma = sigma
21 | self.c = c #fix c, s.t. p(x) < c q(x)
22 |
23 | def target_pdf(self, x):
24 | #p(x) ~ Chi(k=1.5)
25 | return (x**(self.k-1)) * np.exp(-x**2/2.0)
26 |
27 | def proposal_pdf(self, x):
28 | #q(x) ~ N(mu,sigma)
29 | return self.c * 1.0/np.sqrt(2*np.pi*1.5) * np.exp(-(x-self.mu)**2/(2*self.sigma**2))
30 |
31 | def fx(self, x):
32 | #function of interest f(x), x >= 0
33 | return 2*np.sin((np.pi/1.5)*x)
34 |
35 | def sample(self, num_samples):
36 | #sample from the proposal
37 | x = multivariate_normal.rvs(self.mu, self.sigma, num_samples)
38 |
39 | #discard netgative samples (since f(x) is defined for x >= 0)
40 | idx = np.where(x >= 0)
41 | x_pos = x[idx]
42 |
43 | #compute importance weights
44 | isw = self.target_pdf(x_pos) / self.proposal_pdf(x_pos)
45 |
46 | #compute E[f(x)] = sum_i f(x_i)w(x_i), where x_i ~ q(x)
47 | fw = (isw/np.sum(isw))*self.fx(x_pos)
48 | f_est = np.sum(fw)
49 |
50 | return isw, f_est
51 |
52 |
53 | if __name__ == "__main__":
54 |
55 | num_samples = [10, 100, 1000, 10000, 100000, 1000000]
56 |
57 | F_est_iter, IS_weights_var_iter = [], []
58 | for k in num_samples:
59 | IS = importance_sampler()
60 | IS_weights, F_est = IS.sample(k)
61 | IS_weights_var = np.var(IS_weights/np.sum(IS_weights))
62 | F_est_iter.append(F_est)
63 | IS_weights_var_iter.append(IS_weights_var)
64 |
65 | #ground truth (numerical integration)
66 | k = 1.5
67 | I_gt, _ = quad(lambda x: 2.0*np.sin((np.pi/1.5)*x)*(x**(k-1))*np.exp(-x**2/2.0), 0, 5)
68 |
69 | #generate plots
70 | plt.figure()
71 | xx = np.linspace(0,8,100)
72 | plt.plot(xx, IS.target_pdf(xx), '-r', label='target pdf p(x)')
73 | plt.plot(xx, IS.proposal_pdf(xx), '--b', label='proposal pdf q(x)')
74 | plt.plot(xx, IS.fx(xx) * IS.target_pdf(xx), ':k', label='p(x)f(x) integrand')
75 | plt.grid(True); plt.legend(); plt.xlabel("X1"); plt.ylabel("X2")
76 | plt.title("Importance Sampling Components")
77 | #plt.savefig('./figures/importance_sampling.png')
78 | plt.show()
79 |
80 | plt.figure()
81 | plt.hist(IS_weights, label = "IS weights")
82 | plt.grid(True); plt.legend();
83 | plt.title("Importance Weights Histogram")
84 | #plt.savefig('./figures/importance_weights.png')
85 | plt.show()
86 |
87 | plt.figure()
88 | plt.semilogx(num_samples, F_est_iter, '-b', label = "IS Estimate of E[f(x)]")
89 | plt.semilogx(num_samples, I_gt*np.ones(len(num_samples)), '--r', label = "Ground Truth")
90 | plt.grid(True); plt.legend(); plt.xlabel('iterations'); plt.ylabel("E[f(x)] estimate")
91 | plt.title("IS Estimate of E[f(x)]")
92 | #plt.savefig('./figures/importance_estimate.png')
93 | plt.show()
94 |
--------------------------------------------------------------------------------
/chp02/mh_gauss2d.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 | from scipy.stats import uniform
5 | from scipy.stats import multivariate_normal
6 |
7 | np.random.seed(42)
8 |
9 | class mh_gauss:
10 | def __init__(self, dim, K, num_samples, target_mu, target_sigma, target_pi, proposal_mu, proposal_sigma):
11 | #target params: p(x) = \sum_k pi(k) N(x; mu_k,Sigma_k)
12 | self.dim = dim
13 | self.K = K
14 | self.num_samples = num_samples
15 | self.target_mu = target_mu
16 | self.target_sigma = target_sigma
17 | self.target_pi = target_pi
18 |
19 | #proposal params: q(x) = N(x; mu, Sigma)
20 | self.proposal_mu = proposal_mu
21 | self.proposal_sigma = proposal_sigma
22 |
23 | #sample chain params
24 | self.n_accept = 0
25 | self.alpha = np.zeros(self.num_samples)
26 | self.mh_samples = np.zeros((self.num_samples, self.dim))
27 |
28 | def target_pdf(self, x):
29 | #p(x) = \sum_k pi(k) N(x; mu_k,Sigma_k)
30 | prob = 0
31 | for k in range(self.K):
32 | prob += self.target_pi[k]*\
33 | multivariate_normal.pdf(x,self.target_mu[:,k],self.target_sigma[:,:,k])
34 | #end for
35 | return prob
36 |
37 | def proposal_pdf(self, x, mu):
38 | #q(x) = N(x; mu, Sigma)
39 | return multivariate_normal.pdf(x, mu, self.proposal_sigma)
40 |
41 | def sample(self):
42 | #draw init sample from proposal
43 | #import pdb; pdb.set_trace()
44 | x_init = multivariate_normal.rvs(self.proposal_mu, self.proposal_sigma, 1)
45 | self.mh_samples[0,:] = x_init
46 |
47 | for i in range(self.num_samples-1):
48 | x_curr = self.mh_samples[i,:]
49 | x_new = multivariate_normal.rvs(x_curr, self.proposal_sigma, 1)
50 |
51 | #MH ratio
52 | self.alpha[i] = self.proposal_pdf(x_curr, x_new) / self.proposal_pdf(x_new, x_curr) #q(x|x')/q(x'|x)
53 | self.alpha[i] = self.alpha[i] * (self.target_pdf(x_new)/self.target_pdf(x_curr)) #alpha x p(x')/p(x)
54 |
55 | #MH acceptance probability
56 | r = min(1, self.alpha[i])
57 | u = uniform.rvs(loc=0, scale=1, size=1)
58 | if (u <= r):
59 | self.n_accept += 1
60 | self.mh_samples[i+1,:] = x_new #accept
61 | else:
62 | self.mh_samples[i+1,:] = x_curr #reject
63 | #end for
64 | print("MH acceptance ratio: ", self.n_accept/float(self.num_samples))
65 |
66 | if __name__ == "__main__":
67 |
68 | dim = 2
69 | K = 2
70 | num_samples = 5000
71 | target_mu = np.zeros((dim,K))
72 | target_mu[:,0] = [4,0]
73 | target_mu[:,1] = [-4,0]
74 | target_sigma = np.zeros((dim, dim, K))
75 | target_sigma[:,:,0] = [[2,1],[1,1]]
76 | target_sigma[:,:,1] = [[1,0],[0,1]]
77 | target_pi = np.array([0.4, 0.6])
78 |
79 | proposal_mu = np.zeros((dim,1)).flatten()
80 | proposal_sigma = 10*np.eye(dim)
81 |
82 | mhg = mh_gauss(dim, K, num_samples, target_mu, target_sigma, target_pi, proposal_mu, proposal_sigma)
83 | mhg.sample()
84 |
85 | plt.figure()
86 | plt.scatter(mhg.mh_samples[:,0], mhg.mh_samples[:,1], label='MH samples')
87 | plt.grid(True); plt.legend()
88 | plt.title("Metropolis-Hastings Sampling of 2D Gaussian Mixture")
89 | plt.xlabel("X1"); plt.ylabel("X2")
90 | #plt.savefig("./figures/mh_gauss2d.png")
91 | plt.show()
--------------------------------------------------------------------------------
/chp02/monte_carlo_pi.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 | np.random.seed(42)
5 |
6 | def pi_est(radius=1, num_iter=int(1e4)):
7 |
8 | X = np.random.uniform(-radius,+radius,num_iter)
9 | Y = np.random.uniform(-radius,+radius,num_iter)
10 |
11 | R2 = X**2 + Y**2
12 | inside = R2 < radius**2
13 | outside = ~inside
14 |
15 | samples = (2*radius)*(2*radius)*inside
16 |
17 | I_hat = np.mean(samples)
18 | pi_hat = I_hat/radius ** 2
19 | pi_hat_se = np.std(samples)/np.sqrt(num_iter)
20 | print("pi est: {} +/- {:f}".format(pi_hat, pi_hat_se))
21 |
22 | plt.figure()
23 | plt.scatter(X[inside],Y[inside], c='b', alpha=0.5)
24 | plt.scatter(X[outside],Y[outside], c='r', alpha=0.5)
25 | plt.show()
26 |
27 | if __name__ == "__main__":
28 |
29 | pi_est()
30 |
31 |
--------------------------------------------------------------------------------
/chp02/random_walk.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import seaborn as sns
3 | import matplotlib.pyplot as plt
4 |
5 | np.random.seed(42)
6 |
7 | def rand_walk(num_step, num_iter, moves):
8 |
9 | #random walk stats
10 | square_dist = np.zeros(num_iter)
11 | weights = np.zeros(num_iter)
12 |
13 | for it in range(num_iter):
14 |
15 | trial = 0
16 | i = 1
17 |
18 | #iterate until we have a non-crossing random walk
19 | while i != num_step-1:
20 |
21 | #init
22 | X, Y = 0, 0
23 | weight = 1
24 | lattice = np.zeros((2*num_step+1, 2*num_step+1))
25 | lattice[num_step+1,num_step+1] = 1
26 | path = np.array([0, 0])
27 | xx = num_step + 1 + X
28 | yy = num_step + 1 + Y
29 |
30 | print("iter: %d, trial %d" %(it, trial))
31 |
32 | for i in range(num_step):
33 |
34 | up = lattice[xx,yy+1]
35 | down = lattice[xx,yy-1]
36 | left = lattice[xx-1,yy]
37 | right = lattice[xx+1,yy]
38 |
39 | #compute available directions
40 | neighbors = np.array([1, 1, 1, 1]) - np.array([up, down, left, right])
41 |
42 | #avoid self-loops
43 | if (np.sum(neighbors) == 0):
44 | i = 1
45 | break
46 | #end if
47 |
48 | #compute importance weights: d0 x d1 x ... x d_{n-1}
49 | weight = weight * np.sum(neighbors)
50 |
51 | #sample a move direction
52 | direction = np.where(np.random.rand() < np.cumsum(neighbors/float(sum(neighbors))))
53 |
54 | X = X + moves[direction[0][0],0]
55 | Y = Y + moves[direction[0][0],1]
56 |
57 | #store sampled path
58 | path_new = np.array([X,Y])
59 | path = np.vstack((path,path_new))
60 |
61 | #update grid coordinates
62 | xx = num_step + 1 + X
63 | yy = num_step + 1 + Y
64 | lattice[xx,yy] = 1
65 | #end for
66 |
67 | trial = trial + 1
68 | #end while
69 |
70 | #compute square extension
71 | square_dist[it] = X**2 + Y**2
72 |
73 | #store importance weights
74 | weights[it] = weight
75 | #end for
76 |
77 | #compute mean square extension
78 | mean_square_dist = np.mean(weights * square_dist)/np.mean(weights)
79 | print("mean square dist: ", mean_square_dist)
80 |
81 | #generate plots
82 | plt.figure()
83 | for i in range(num_step-1):
84 | plt.plot(path[i,0], path[i,1], path[i+1,0], path[i+1,1], 'ob')
85 | plt.title('random walk with no overlaps')
86 | plt.xlabel('X')
87 | plt.ylabel('Y')
88 | plt.show()
89 |
90 | plt.figure()
91 | sns.displot(square_dist)
92 | plt.xlim(0,np.max(square_dist))
93 | plt.title('square distance of the random walk')
94 | plt.xlabel('square distance (X^2 + Y^2)')
95 | plt.show()
96 |
97 |
98 | if __name__ == "__main__":
99 |
100 | num_step = 150 #number of steps in a random walk
101 | num_iter = 100 #number of iterations for averaging results
102 | moves = np.array([[0, 1],[0, -1],[-1, 0],[1, 0]]) #2-D moves
103 |
104 | rand_walk(num_step, num_iter, moves)
--------------------------------------------------------------------------------
/chp03/mean_field_mrf.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 |
4 | import seaborn as sns
5 | import matplotlib.pyplot as plt
6 |
7 | from PIL import Image
8 | from tqdm import tqdm
9 | from scipy.special import expit as sigmoid
10 | from scipy.stats import multivariate_normal
11 |
12 | np.random.seed(42)
13 | sns.set_style('whitegrid')
14 |
15 | class image_denoising:
16 |
17 | def __init__(self, img_binary, sigma=2, J=1):
18 |
19 | #mean-field parameters
20 | self.sigma = sigma #noise level
21 | self.y = img_binary + self.sigma*np.random.randn(M, N) #y_i ~ N(x_i; sigma^2);
22 | self.J = J #coupling strength (w_ij)
23 | self.rate = 0.5 #update smoothing rate
24 | self.max_iter = 15
25 | self.ELBO = np.zeros(self.max_iter)
26 | self.Hx_mean = np.zeros(self.max_iter)
27 |
28 | def mean_field(self):
29 |
30 | #Mean-Field VI
31 | print("running mean-field variational inference...")
32 | logodds = multivariate_normal.logpdf(self.y.flatten(), mean=+1, cov=self.sigma**2) - \
33 | multivariate_normal.logpdf(self.y.flatten(), mean=-1, cov=self.sigma**2)
34 | logodds = np.reshape(logodds, (M, N))
35 |
36 | #init
37 | p1 = sigmoid(logodds)
38 | mu = 2*p1-1 #mu_init
39 |
40 | a = mu + 0.5 * logodds
41 | qxp1 = sigmoid(+2*a) #q_i(x_i=+1)
42 | qxm1 = sigmoid(-2*a) #q_i(x_i=-1)
43 |
44 | logp1 = np.reshape(multivariate_normal.logpdf(self.y.flatten(), mean=+1, cov=self.sigma**2), (M, N))
45 | logm1 = np.reshape(multivariate_normal.logpdf(self.y.flatten(), mean=-1, cov=self.sigma**2), (M, N))
46 |
47 | for i in tqdm(range(self.max_iter)):
48 | muNew = mu
49 | for ix in range(N):
50 | for iy in range(M):
51 | pos = iy + M*ix
52 | neighborhood = pos + np.array([-1,1,-M,M])
53 | boundary_idx = [iy!=0,iy!=M-1,ix!=0,ix!=N-1]
54 | neighborhood = neighborhood[np.where(boundary_idx)[0]]
55 | xx, yy = np.unravel_index(pos, (M,N), order='F')
56 | nx, ny = np.unravel_index(neighborhood, (M,N), order='F')
57 |
58 | Sbar = self.J*np.sum(mu[nx,ny])
59 | muNew[xx,yy] = (1-self.rate)*muNew[xx,yy] + self.rate*np.tanh(Sbar + 0.5*logodds[xx,yy])
60 | self.ELBO[i] = self.ELBO[i] + 0.5*(Sbar * muNew[xx,yy])
61 | #end for
62 | #end for
63 | mu = muNew
64 |
65 | a = mu + 0.5 * logodds
66 | qxp1 = sigmoid(+2*a) #q_i(x_i=+1)
67 | qxm1 = sigmoid(-2*a) #q_i(x_i=-1)
68 | Hx = -qxm1*np.log(qxm1+1e-10) - qxp1*np.log(qxp1+1e-10) #entropy
69 |
70 | self.ELBO[i] = self.ELBO[i] + np.sum(qxp1*logp1 + qxm1*logm1) + np.sum(Hx)
71 | self.Hx_mean[i] = np.mean(Hx)
72 | #end for
73 | return mu
74 |
75 | if __name__ == "__main__":
76 |
77 | #load data
78 | print("loading data...")
79 | data = Image.open('./figures/bayes.bmp')
80 | img = np.double(data)
81 | img_mean = np.mean(img)
82 | img_binary = +1*(img>img_mean) + -1*(img x:
10 | return binary_search(arr, l, mid-1, x)
11 | else:
12 | return binary_search(arr, mid+1, r, x)
13 | #end if
14 | else:
15 | return -1
16 |
17 | if __name__ == "__main__":
18 |
19 | x = 5
20 | arr = sorted([1, 7, 8, 3, 2, 5])
21 |
22 | print(arr)
23 | print("binary search:")
24 | result = binary_search(arr, 0, len(arr)-1, x)
25 |
26 | if result != -1:
27 | print("element {} is found at index {}.".format(x, result))
28 | else:
29 | print("element is not found.")
30 |
--------------------------------------------------------------------------------
/chp04/binomial_coeffs.py:
--------------------------------------------------------------------------------
1 |
2 | def binomial_coeffs1(n, k):
3 | #top down DP
4 | if (k == 0 or k == n):
5 | return 1
6 | if (memo[n][k] != -1):
7 | return memo[n][k]
8 |
9 | memo[n][k] = binomial_coeffs1(n-1, k-1) + binomial_coeffs1(n-1, k)
10 | return memo[n][k]
11 |
12 | def binomial_coeffs2(n, k):
13 | #bottom up DP
14 | for i in range(n+1):
15 | for j in range(min(i,k)+1):
16 | if (j == 0 or j == i):
17 | memo[i][j] = 1
18 | else:
19 | memo[i][j] = memo[i-1][j-1] + memo[i-1][j]
20 | #end if
21 | #end for
22 | #end for
23 | return memo[n][k]
24 |
25 | def print_array(memo):
26 | for i in range(len(memo)):
27 | print('\t'.join([str(x) for x in memo[i]]))
28 |
29 |
30 | if __name__ == "__main__":
31 |
32 | n = 5
33 | k = 2
34 |
35 | print("top down DP")
36 | memo = [[-1 for i in range(6)] for j in range(6)]
37 | nCk = binomial_coeffs1(n, k)
38 | print_array(memo)
39 | print("C(n={}, k={}) = {}".format(n,k,nCk))
40 |
41 | print("bottom up DP")
42 | memo = [[-1 for i in range(6)] for j in range(6)]
43 | nCk = binomial_coeffs2(n, k)
44 | print_array(memo)
45 | print("C(n={}, k={}) = {}".format(n,k,nCk))
46 |
47 |
48 |
--------------------------------------------------------------------------------
/chp04/knapsack_greedy.py:
--------------------------------------------------------------------------------
1 | class Item:
2 | def __init__(self, wt, val, ind):
3 | self.wt = wt
4 | self.val = val
5 | self.ind = ind
6 | self.cost = val // wt
7 |
8 | def __lt__(self, other):
9 | return self.cost < other.cost
10 |
11 | class FractionalKnapSack:
12 | def get_max_value(self, wt, val, capacity):
13 |
14 | item_list = []
15 | for i in range(len(wt)):
16 | item_list.append(Item(wt[i], val[i], i))
17 |
18 | # sorting items by cost heuristic
19 | item_list.sort(reverse = True) #O(nlogn)
20 |
21 | total_value = 0
22 | for i in item_list:
23 | cur_wt = int(i.wt)
24 | cur_val = int(i.val)
25 | if capacity - cur_wt >= 0:
26 | capacity -= cur_wt
27 | total_value += cur_val
28 | else:
29 | fraction = capacity / cur_wt
30 | total_value += cur_val * fraction
31 | capacity = int(capacity - (cur_wt * fraction))
32 | break
33 | return total_value
34 |
35 | if __name__ == "__main__":
36 | wt = [10, 20, 30]
37 | val = [60, 100, 120]
38 | capacity = 50
39 |
40 | fk = FractionalKnapSack()
41 | max_value = fk.get_max_value(wt, val, capacity)
42 | print("greedy fractional knapsack")
43 | print("maximum value: ", max_value)
44 |
45 |
--------------------------------------------------------------------------------
/chp04/subset_gen.py:
--------------------------------------------------------------------------------
1 | def search(k, n):
2 | if (k == n):
3 | #process subset
4 | print(subset)
5 | else:
6 | search(k+1, n)
7 | subset.append(k)
8 | search(k+1, n)
9 | subset.pop()
10 | #end if
11 |
12 | def bitseq(n):
13 | for b in range(1 << n):
14 | subset = []
15 | for i in range(n):
16 | if (b & 1 << i):
17 | subset.append(i)
18 | #end for
19 | print(subset)
20 | #end for
21 |
22 | if __name__ == "__main__":
23 | n = 4
24 | subset = []
25 | search(0, n) #recursive
26 |
27 | #subset = []
28 | #bitseq(n) #iterative
29 |
--------------------------------------------------------------------------------
/chp05/cart.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 | from sklearn.datasets import load_iris
5 | from sklearn.metrics import accuracy_score
6 | from sklearn.model_selection import train_test_split
7 |
8 | class TreeNode():
9 | def __init__(self, gini, num_samples, num_samples_class, class_label):
10 | self.gini = gini #gini cost
11 | self.num_samples = num_samples #size of node
12 | self.num_samples_class = num_samples_class #number of node pts with label k
13 | self.class_label = class_label #predicted class label
14 | self.feature_idx = 0 #idx of feature to split on
15 | self.treshold = 0 #best threshold to split on
16 | self.left = None #left subtree pointer
17 | self.right = None #right subtree pointer
18 |
19 | class DecisionTreeClassifier():
20 | def __init__(self, max_depth = None):
21 | self.max_depth = max_depth
22 |
23 | def best_split(self, X_train, y_train):
24 | m = y_train.size
25 | if (m <= 1):
26 | return None, None
27 |
28 | #number of points of class k
29 | mk = [np.sum(y_train == k) for k in range(self.num_classes)]
30 |
31 | #gini of current node
32 | best_gini = 1.0 - sum((n / m) ** 2 for n in mk)
33 | best_idx, best_thr = None, None
34 |
35 | #iterate over all features
36 | for idx in range(self.num_features):
37 | # sort data along selected feature
38 | thresholds, classes = zip(*sorted(zip(X[:, idx], y)))
39 |
40 | num_left = [0]*self.num_classes
41 | num_right = mk.copy()
42 |
43 | #iterate overall possible split positions
44 | for i in range(1, m):
45 |
46 | k = classes[i-1]
47 |
48 | num_left[k] += 1
49 | num_right[k] -= 1
50 |
51 | gini_left = 1.0 - sum(
52 | (num_left[x] / i) ** 2 for x in range(self.num_classes)
53 | )
54 |
55 | gini_right = 1.0 - sum(
56 | (num_right[x] / (m - i)) ** 2 for x in range(self.num_classes)
57 | )
58 |
59 | gini = (i * gini_left + (m - i) * gini_right) / m
60 |
61 | # check that we don't try to split two pts with identical values
62 | if thresholds[i] == thresholds[i - 1]:
63 | continue
64 |
65 | if (gini < best_gini):
66 | best_gini = gini
67 | best_idx = idx
68 | best_thr = (thresholds[i] + thresholds[i - 1]) / 2 # midpoint
69 | #end if
70 | #end for
71 | #end for
72 | return best_idx, best_thr
73 |
74 | def gini(self, y_train):
75 | m = y_train.size
76 | return 1.0 - sum((np.sum(y_train == k) / m) ** 2 for k in range(self.num_classes))
77 |
78 | def fit(self, X_train, y_train):
79 | self.num_classes = len(set(y_train))
80 | self.num_features = X_train.shape[1]
81 | self.tree = self.grow_tree(X_train, y_train)
82 |
83 | def grow_tree(self, X_train, y_train, depth=0):
84 |
85 | num_samples_class = [np.sum(y_train == k) for k in range(self.num_classes)]
86 | class_label = np.argmax(num_samples_class)
87 |
88 | node = TreeNode(
89 | gini=self.gini(y_train),
90 | num_samples=y_train.size,
91 | num_samples_class=num_samples_class,
92 | class_label=class_label,
93 | )
94 |
95 | # split recursively until maximum depth is reached
96 | if depth < self.max_depth:
97 | idx, thr = self.best_split(X_train, y_train)
98 | if idx is not None:
99 | indices_left = X_train[:, idx] < thr
100 | X_left, y_left = X_train[indices_left], y_train[indices_left]
101 | X_right, y_right = X_train[~indices_left], y_train[~indices_left]
102 | node.feature_index = idx
103 | node.threshold = thr
104 | node.left = self.grow_tree(X_left, y_left, depth + 1)
105 | node.right = self.grow_tree(X_right, y_right, depth + 1)
106 |
107 | return node
108 |
109 | def predict(self, X_test):
110 | return [self.predict_helper(x_test) for x_test in X_test]
111 |
112 | def predict_helper(self, x_test):
113 | node = self.tree
114 | while node.left:
115 | if x_test[node.feature_index] < node.threshold:
116 | node = node.left
117 | else:
118 | node = node.right
119 | return node.class_label
120 |
121 |
122 | if __name__ == "__main__":
123 |
124 | #load data
125 | iris = load_iris()
126 | X = iris.data[:, [2,3]]
127 | y = iris.target
128 |
129 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
130 |
131 | print("decision tree classifier...")
132 | tree_clf = DecisionTreeClassifier(max_depth = 3)
133 | tree_clf.fit(X_train, y_train)
134 |
135 | print("prediction...")
136 | y_pred = tree_clf.predict(X_test)
137 |
138 | tree_clf_acc = accuracy_score(y_test, y_pred)
139 | print("test set accuracy: ", tree_clf_acc)
140 |
--------------------------------------------------------------------------------
/chp05/naive_bayes.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import seaborn as sns
3 | import matplotlib.pyplot as plt
4 |
5 | from time import time
6 | from nltk.corpus import stopwords
7 | from nltk.tokenize import RegexpTokenizer
8 |
9 | from sklearn.metrics import accuracy_score
10 | from sklearn.datasets import fetch_20newsgroups
11 | from sklearn.model_selection import train_test_split
12 | from sklearn.feature_extraction.text import CountVectorizer
13 |
14 | sns.set_style("whitegrid")
15 | tokenizer = RegexpTokenizer(r'\w+')
16 | stop_words = set(stopwords.words('english'))
17 | stop_words.update(['s','t','m','1','2'])
18 |
19 | class naive_bayes:
20 | def __init__(self, K, D):
21 | self.K = K #number of classes
22 | self.D = D #dictionary size
23 |
24 | self.pi = np.ones(K) #class priors
25 | self.theta = np.ones((self.D, self.K)) #bernoulli parameters
26 |
27 | def fit(self, X_train, y_train):
28 |
29 | num_docs = X_train.shape[0]
30 | for doc in range(num_docs):
31 |
32 | label = y_train[doc]
33 | self.pi[label] += 1
34 |
35 | for word in range(self.D):
36 | if (X_train[doc][word] > 0):
37 | self.theta[word][label] += 1
38 | #end if
39 | #end for
40 | #end for
41 |
42 | #normalize pi and theta
43 | self.pi = self.pi/np.sum(self.pi)
44 | self.theta = self.theta/np.sum(self.theta, axis=0)
45 |
46 | def predict(self, X_test):
47 |
48 | num_docs = X_test.shape[0]
49 | logp = np.zeros((num_docs,self.K))
50 | for doc in range(num_docs):
51 | for kk in range(self.K):
52 | logp[doc][kk] = np.log(self.pi[kk])
53 | for word in range(self.D):
54 | if (X_test[doc][word] > 0):
55 | logp[doc][kk] += np.log(self.theta[word][kk])
56 | else:
57 | logp[doc][kk] += np.log(1-self.theta[word][kk])
58 | #end if
59 | #end for
60 | #end for
61 | #end for
62 | return np.argmax(logp, axis=1)
63 |
64 | if __name__ == "__main__":
65 |
66 | import nltk
67 | nltk.download('stopwords')
68 |
69 | #load data
70 | print("loading 20 newsgroups dataset...")
71 | tic = time()
72 | classes = ['sci.space', 'comp.graphics', 'rec.autos', 'rec.sport.hockey']
73 | dataset = fetch_20newsgroups(shuffle=True, random_state=0, remove=('headers','footers','quotes'), categories=classes)
74 | X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.5, random_state=0)
75 | toc = time()
76 | print("elapsed time: %.4f sec" %(toc - tic))
77 | print("number of training docs: ", len(X_train))
78 | print("number of test docs: ", len(X_test))
79 |
80 | print("vectorizing input data...")
81 | cnt_vec = CountVectorizer(tokenizer=tokenizer.tokenize, analyzer='word', ngram_range=(1,1), max_df=0.8, min_df=2, max_features=1000, stop_words=stop_words)
82 | cnt_vec.fit(X_train)
83 | toc = time()
84 | print("elapsed time: %.2f sec" %(toc - tic))
85 | vocab = cnt_vec.vocabulary_
86 | idx2word = {val: key for (key, val) in vocab.items()}
87 | print("vocab size: ", len(vocab))
88 |
89 | X_train_vec = cnt_vec.transform(X_train).toarray()
90 | X_test_vec = cnt_vec.transform(X_test).toarray()
91 |
92 | print("naive bayes model MLE inference...")
93 | K = len(set(y_train)) #number of classes
94 | D = len(vocab) #dictionary size
95 | nb_clf = naive_bayes(K, D)
96 | nb_clf.fit(X_train_vec, y_train)
97 |
98 | print("naive bayes prediction...")
99 | y_pred = nb_clf.predict(X_test_vec)
100 | nb_clf_acc = accuracy_score(y_test, y_pred)
101 | print("test set accuracy: ", nb_clf_acc)
--------------------------------------------------------------------------------
/chp05/perceptron.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import seaborn as sns
3 | import matplotlib.pyplot as plt
4 |
5 | from scipy.stats import randint
6 | from sklearn.datasets import load_iris
7 | from sklearn.metrics import confusion_matrix
8 | from sklearn.model_selection import train_test_split
9 |
10 | class perceptron:
11 | def __init__(self, num_epochs, dim):
12 | self.num_epochs = num_epochs
13 | self.theta0 = 0
14 | self.theta = np.zeros(dim)
15 |
16 | def fit(self, X_train, y_train):
17 | n = X_train.shape[0]
18 | dim = X_train.shape[1]
19 |
20 | k = 1
21 | for epoch in range(self.num_epochs):
22 | for i in range(n):
23 | #sample random point
24 | idx = randint.rvs(0, n-1, size=1)[0]
25 | #hinge loss
26 | if (y_train[idx] * (np.dot(self.theta, X_train[idx,:]) + self.theta0) <= 0):
27 | #update learning rate
28 | eta = pow(k+1, -1)
29 | k += 1
30 | #print("eta: ", eta)
31 |
32 | #update theta
33 | self.theta = self.theta + eta * y_train[idx] * X_train[idx, :]
34 | self.theta0 = self.theta0 + eta * y_train[idx]
35 | #end if
36 | print("epoch: ", epoch)
37 | print("theta: ", self.theta)
38 | print("theta0: ", self.theta0)
39 | #end for
40 | #end for
41 |
42 | def predict(self, X_test):
43 | n = X_test.shape[0]
44 | dim = X_test.shape[1]
45 |
46 | y_pred = np.zeros(n)
47 | for idx in range(n):
48 | y_pred[idx] = np.sign(np.dot(self.theta, X_test[idx,:]) + self.theta0)
49 | #end for
50 | return y_pred
51 |
52 | if __name__ == "__main__":
53 |
54 | #load dataset
55 | iris = load_iris()
56 | X = iris.data[:100,:]
57 | y = 2*iris.target[:100] - 1 #map to {+1,-1} labels
58 |
59 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
60 |
61 | #perceptron (binary) classifier
62 | clf = perceptron(num_epochs=5, dim=X.shape[1])
63 | clf.fit(X_train, y_train)
64 | y_pred = clf.predict(X_test)
65 |
66 | cmt = confusion_matrix(y_test, y_pred)
67 | acc = np.trace(cmt)/np.sum(np.sum(cmt))
68 | print("percepton accuracy: ", acc)
69 |
70 | #generate plots
71 | plt.figure()
72 | sns.heatmap(cmt, annot=True, fmt="d")
73 | plt.title("Confusion Matrix"); plt.xlabel("predicted"); plt.ylabel("actual")
74 | #plt.savefig("./figures/perceptron_acc.png")
75 | plt.show()
76 |
77 |
78 |
79 |
--------------------------------------------------------------------------------
/chp05/sgd_lr.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 | def generate_data():
5 |
6 | n = 1000
7 | mu1 = np.array([1,1])
8 | mu2 = np.array([-1,-1])
9 | pik = np.array([0.4,0.6])
10 |
11 | X = np.zeros((n,2))
12 | y = np.zeros((n,1))
13 |
14 | for i in range(1,n):
15 | u = np.random.rand()
16 | idx = np.where(u < np.cumsum(pik))[0]
17 |
18 | if (len(idx)==1):
19 | X[i,:] = np.random.randn(1,2) + mu1
20 | y[i] = 1
21 | else:
22 | X[i,:] = np.random.randn(1,2) + mu2
23 | y[i] = -1
24 |
25 | return X, y
26 |
27 |
28 | class sgdlr:
29 |
30 | def __init__(self):
31 |
32 | self.num_iter = 100
33 | self.lmbda = 1e-9
34 |
35 | self.tau0 = 10
36 | self.kappa = 1
37 | self.eta = np.zeros(self.num_iter)
38 |
39 | self.batch_size = 200
40 |
41 | def fit(self, X, y):
42 |
43 | #random init
44 | theta = np.random.randn(X.shape[1],1)
45 |
46 | #learning rate schedule
47 | for i in range(self.num_iter):
48 | self.eta[i] = (self.tau0+i)**(-self.kappa)
49 |
50 | #divide data in batches
51 | batch_data, batch_labels = self.make_batches(X,y,self.batch_size)
52 | num_batches = batch_data.shape[0]
53 | num_updates = 0
54 |
55 | J_hist = np.zeros((self.num_iter * num_batches,1))
56 | t_hist = np.zeros((self.num_iter * num_batches,1))
57 |
58 | for itr in range(self.num_iter):
59 | for b in range(num_batches):
60 | Xb = batch_data[b]
61 | yb = batch_labels[b]
62 |
63 | J_cost, J_grad = self.lr_objective(theta, Xb, yb, self.lmbda)
64 | theta = theta - self.eta[itr]*(num_batches*J_grad)
65 |
66 | J_hist[num_updates] = J_cost
67 | t_hist[num_updates] = np.linalg.norm(theta,2)
68 | num_updates = num_updates + 1
69 | print("iteration %d, cost: %f" %(itr, J_cost))
70 |
71 | y_pred = 2*(self.sigmoid(X.dot(theta)) > 0.5) - 1
72 | y_err = np.size(np.where(y_pred - y)[0])/float(y.shape[0])
73 | print("classification error:", y_err)
74 |
75 | self.generate_plots(X, J_hist, t_hist, theta)
76 | return theta
77 |
78 | def make_batches(self, X, y, batch_size):
79 | n = X.shape[0]
80 | d = X.shape[1]
81 | num_batches = int(np.ceil(n/batch_size))
82 |
83 | groups = np.tile(range(num_batches),batch_size)
84 | batch_data=np.zeros((num_batches,batch_size,d))
85 | batch_labels=np.zeros((num_batches,batch_size,1))
86 |
87 | for i in range(num_batches):
88 | batch_data[i,:,:] = X[groups==i,:]
89 | batch_labels[i,:] = y[groups==i]
90 |
91 | return batch_data, batch_labels
92 |
93 | def lr_objective(self, theta, X, y, lmbda):
94 |
95 | n = y.shape[0]
96 | y01 = (y+1)/2.0
97 |
98 | #compute the objective
99 | mu = self.sigmoid(X.dot(theta))
100 |
101 | #bound away from 0 and 1
102 | eps = np.finfo(float).eps
103 | mu = np.maximum(mu,eps)
104 | mu = np.minimum(mu,1-eps)
105 |
106 | #compute cost
107 | cost = -(1/n)*np.sum(y01*np.log(mu)+(1-y01)*np.log(1-mu))+np.sum(lmbda*theta*theta)
108 |
109 | #compute the gradient of the lr objective
110 | grad = X.T.dot(mu-y01) + 2*lmbda*theta
111 |
112 | #compute the Hessian of the lr objective
113 | #H = X.T.dot(np.diag(np.diag( mu*(1-mu) ))).dot(X) + 2*lmbda*np.eye(np.size(theta))
114 |
115 | return cost, grad
116 |
117 | def sigmoid(self, a):
118 | return 1/(1+np.exp(-a))
119 |
120 | def generate_plots(self, X, J_hist, t_hist, theta):
121 |
122 | plt.figure()
123 | plt.plot(J_hist)
124 | plt.title("logistic regression")
125 | plt.xlabel('iterations')
126 | plt.ylabel('cost')
127 | #plt.savefig('./figures/lrsgd_loss.png')
128 | plt.show()
129 |
130 | plt.figure()
131 | plt.plot(t_hist)
132 | plt.title("logistic regression")
133 | plt.xlabel('iterations')
134 | plt.ylabel('theta l2 norm')
135 | #plt.savefig('./figures/lrsgd_theta_norm.png')
136 | plt.show()
137 |
138 | plt.figure()
139 | plt.plot(self.eta)
140 | plt.title("logistic regression")
141 | plt.xlabel('iterations')
142 | plt.ylabel('learning rate')
143 | #plt.savefig('./figures/lrsgd_learning_rate.png')
144 | plt.show()
145 |
146 | plt.figure()
147 | x1 = np.linspace(np.min(X[:,0])-1,np.max(X[:,0])+1,10)
148 | plt.scatter(X[:,0], X[:,1])
149 | plt.plot(x1, -(theta[0]/theta[1])*x1)
150 | plt.title('logistic regression')
151 | plt.grid(True)
152 | plt.xlabel('X1')
153 | plt.ylabel('X2')
154 | #plt.savefig('./figures/lrsgd_clf.png')
155 | plt.show()
156 |
157 | if __name__ == "__main__":
158 |
159 | X, y = generate_data()
160 | sgd = sgdlr()
161 | theta = sgd.fit(X,y)
162 |
163 |
--------------------------------------------------------------------------------
/chp05/svm.py:
--------------------------------------------------------------------------------
1 | import cvxopt
2 | import numpy as np
3 |
4 | from sklearn.svm import SVC #for comparison only
5 | from sklearn.datasets import load_iris
6 | from sklearn.metrics import accuracy_score
7 | from sklearn.model_selection import train_test_split
8 |
9 | def rbf_kernel(gamma, **kwargs):
10 | def f(x1, x2):
11 | distance = np.linalg.norm(x1 - x2) ** 2
12 | return np.exp(-gamma * distance)
13 | return f
14 |
15 | class SupportVectorMachine(object):
16 | def __init__(self, C=1, kernel=rbf_kernel, power=4, gamma=None, coef=4):
17 | self.C = C
18 | self.kernel = kernel
19 | self.power = power
20 | self.gamma = gamma
21 | self.coef = coef
22 | self.lagr_multipliers = None
23 | self.support_vectors = None
24 | self.support_vector_labels = None
25 | self.intercept = None
26 |
27 | def fit(self, X, y):
28 |
29 | n_samples, n_features = np.shape(X)
30 |
31 | # Set gamma to 1/n_features by default
32 | if not self.gamma:
33 | self.gamma = 1 / n_features
34 |
35 | # Initialize kernel method with parameters
36 | self.kernel = self.kernel(
37 | power=self.power,
38 | gamma=self.gamma,
39 | coef=self.coef)
40 |
41 | # Calculate kernel matrix
42 | kernel_matrix = np.zeros((n_samples, n_samples))
43 | for i in range(n_samples):
44 | for j in range(n_samples):
45 | kernel_matrix[i, j] = self.kernel(X[i], X[j])
46 |
47 | # Define the quadratic optimization problem
48 | P = cvxopt.matrix(np.outer(y, y) * kernel_matrix, tc='d')
49 | q = cvxopt.matrix(np.ones(n_samples) * -1)
50 | A = cvxopt.matrix(y, (1, n_samples), tc='d')
51 | b = cvxopt.matrix(0, tc='d')
52 |
53 | if not self.C: #if its empty
54 | G = cvxopt.matrix(np.identity(n_samples) * -1)
55 | h = cvxopt.matrix(np.zeros(n_samples))
56 | else:
57 | G_max = np.identity(n_samples) * -1
58 | G_min = np.identity(n_samples)
59 | G = cvxopt.matrix(np.vstack((G_max, G_min)))
60 | h_max = cvxopt.matrix(np.zeros(n_samples))
61 | h_min = cvxopt.matrix(np.ones(n_samples) * self.C)
62 | h = cvxopt.matrix(np.vstack((h_max, h_min)))
63 |
64 | # Solve the quadratic optimization problem using cvxopt
65 | minimization = cvxopt.solvers.qp(P, q, G, h, A, b)
66 |
67 | # Lagrange multipliers
68 | lagr_mult = np.ravel(minimization['x'])
69 |
70 | # Extract support vectors
71 | # Get indexes of non-zero lagr. multipiers
72 | idx = lagr_mult > 1e-11
73 | # Get the corresponding lagr. multipliers
74 | self.lagr_multipliers = lagr_mult[idx]
75 | # Get the samples that will act as support vectors
76 | self.support_vectors = X[idx]
77 | # Get the corresponding labels
78 | self.support_vector_labels = y[idx]
79 |
80 | # Calculate intercept with first support vector
81 | self.intercept = self.support_vector_labels[0]
82 | for i in range(len(self.lagr_multipliers)):
83 | self.intercept -= self.lagr_multipliers[i] * self.support_vector_labels[
84 | i] * self.kernel(self.support_vectors[i], self.support_vectors[0])
85 |
86 |
87 | def predict(self, X):
88 | y_pred = []
89 | # Iterate through list of samples and make predictions
90 | for sample in X:
91 | prediction = 0
92 | # Determine the label of the sample by the support vectors
93 | for i in range(len(self.lagr_multipliers)):
94 | prediction += self.lagr_multipliers[i] * self.support_vector_labels[
95 | i] * self.kernel(self.support_vectors[i], sample)
96 | prediction += self.intercept
97 | y_pred.append(np.sign(prediction))
98 | return np.array(y_pred)
99 |
100 |
101 | def main():
102 |
103 | #load dataset
104 | iris = load_iris()
105 | X = iris.data[:100,:]
106 | y = 2*iris.target[:100] - 1 #map to {+1,-1} labels
107 |
108 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
109 | clf = SupportVectorMachine(kernel=rbf_kernel, gamma = 1)
110 | clf.fit(X_train, y_train)
111 | y_pred = clf.predict(X_test)
112 | accuracy = accuracy_score(y_test, y_pred)
113 | print ("Accuracy (scratch):", accuracy)
114 |
115 | clf_sklearn = SVC(gamma = 'auto')
116 | clf_sklearn.fit(X_train, y_train)
117 | y_pred2 = clf_sklearn.predict(X_test)
118 | accuracy = accuracy_score(y_test, y_pred2)
119 | print ("Accuracy :", accuracy)
120 |
121 | if __name__ == "__main__":
122 | main()
--------------------------------------------------------------------------------
/chp06/gp_reg.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 | from scipy.spatial.distance import cdist
4 |
5 | np.random.seed(42)
6 |
7 | class GPreg:
8 |
9 | def __init__(self, X_train, y_train, X_test):
10 |
11 | self.L = 1.0
12 | self.keps = 1e-8
13 |
14 | self.muFn = self.mean_func(X_test)
15 | self.Kfn = self.kernel_func(X_test, X_test) + 1e-15*np.eye(np.size(X_test))
16 |
17 | self.X_train = X_train
18 | self.y_train = y_train
19 | self.X_test = X_test
20 |
21 | def mean_func(self, x):
22 | muFn = np.zeros(len(x)).reshape(-1,1)
23 | return muFn
24 |
25 | def kernel_func(self, x, z):
26 | sq_dist = cdist(x/self.L, z/self.L, 'euclidean')**2
27 | Kfn = 1.0 * np.exp(-sq_dist/2)
28 | return Kfn
29 |
30 | def compute_posterior(self):
31 | K = self.kernel_func(self.X_train, self.X_train) #K
32 | Ks = self.kernel_func(self.X_train, self.X_test) #K_*
33 | Kss = self.kernel_func(self.X_test, self.X_test) + self.keps*np.eye(np.size(self.X_test)) #K_**
34 | Ki = np.linalg.inv(K) #O(Ntrain^3)
35 |
36 | postMu = self.mean_func(self.X_test) + np.dot(np.transpose(Ks), np.dot(Ki, (self.y_train - self.mean_func(self.X_train))))
37 | postCov = Kss - np.dot(np.transpose(Ks), np.dot(Ki, Ks))
38 |
39 | self.muFn = postMu
40 | self.Kfn = postCov
41 |
42 | return None
43 |
44 | def generate_plots(self, X, num_samples=3):
45 | plt.figure()
46 | for i in range(num_samples):
47 | fs = self.gauss_sample(1)
48 | plt.plot(X, fs, '-k')
49 | #plt.plot(self.X_train, self.y_train, 'xk')
50 |
51 | mu = self.muFn.ravel()
52 | S2 = np.diag(self.Kfn)
53 | plt.fill(np.concatenate([X, X[::-1]]), np.concatenate([mu - 2*np.sqrt(S2), (mu + 2*np.sqrt(S2))[::-1]]), alpha=0.2, fc='b')
54 | plt.show()
55 |
56 | def gauss_sample(self, n):
57 | # returns n samples from a multivariate Gaussian distribution
58 | # S = AZ + mu
59 | A = np.linalg.cholesky(self.Kfn)
60 | Z = np.random.normal(loc=0, scale=1, size=(len(self.muFn),n))
61 | S = np.dot(A,Z) + self.muFn
62 | return S
63 |
64 | def main():
65 |
66 | # generate noise-less training data
67 | X_train = np.array([-4, -3, -2, -1, 1])
68 | X_train = X_train.reshape(-1,1)
69 | y_train = np.sin(X_train)
70 |
71 | # generate test data
72 | X_test = np.linspace(-5, 5, 50)
73 | X_test = X_test.reshape(-1,1)
74 |
75 | gp = GPreg(X_train, y_train, X_test)
76 | gp.generate_plots(X_test,3) #samples from GP prior
77 | gp.compute_posterior()
78 | gp.generate_plots(X_test,3) #samples from GP posterior
79 |
80 |
81 | if __name__ == "__main__":
82 | main()
83 |
84 |
--------------------------------------------------------------------------------
/chp06/hierarchical_regression.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 |
4 | import seaborn as sns
5 | import matplotlib.pyplot as plt
6 |
7 | import pymc3 as pm
8 |
9 | def main():
10 |
11 | #load data
12 | data = pd.read_csv('./data/radon.txt')
13 |
14 | county_names = data.county.unique()
15 | county_idx = data['county_code'].values
16 |
17 | with pm.Model() as hierarchical_model:
18 |
19 | # Hyperpriors
20 | mu_a = pm.Normal('mu_alpha', mu=0., sd=100**2)
21 | sigma_a = pm.Uniform('sigma_alpha', lower=0, upper=100)
22 | mu_b = pm.Normal('mu_beta', mu=0., sd=100**2)
23 | sigma_b = pm.Uniform('sigma_beta', lower=0, upper=100)
24 |
25 | # Intercept for each county, distributed around group mean mu_a
26 | a = pm.Normal('alpha', mu=mu_a, sd=sigma_a, shape=len(data.county.unique()))
27 | # Slope for each county, distributed around group mean mu_b
28 | b = pm.Normal('beta', mu=mu_b, sd=sigma_b, shape=len(data.county.unique()))
29 |
30 | # Model error
31 | eps = pm.Uniform('eps', lower=0, upper=100)
32 |
33 | # Expected value
34 | radon_est = a[county_idx] + b[county_idx] * data.floor.values
35 |
36 | # Data likelihood
37 | y_like = pm.Normal('y_like', mu=radon_est, sd=eps, observed=data.log_radon)
38 |
39 |
40 | with hierarchical_model:
41 | # Use ADVI for initialization
42 | mu, sds, elbo = pm.variational.advi(n=100000)
43 | step = pm.NUTS(scaling=hierarchical_model.dict_to_array(sds)**2, is_cov=True)
44 | hierarchical_trace = pm.sample(5000, step, start=mu)
45 |
46 |
47 | pm.traceplot(hierarchical_trace[500:])
48 | plt.show()
49 |
50 | if __name__ == "__main__":
51 | main()
52 |
--------------------------------------------------------------------------------
/chp06/knn_reg.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 | from sklearn import datasets
5 | from sklearn.model_selection import train_test_split
6 |
7 | np.random.seed(42)
8 |
9 | class KNN():
10 |
11 | def __init__(self, K):
12 | self.K = K
13 |
14 | def euclidean_distance(self, x1, x2):
15 | dist = 0
16 | for i in range(len(x1)):
17 | dist += np.power((x1[i] - x2[i]), 2)
18 | return np.sqrt(dist)
19 |
20 | def knn_search(self, X_train, y_train, Q):
21 | y_pred = np.empty(Q.shape[0])
22 |
23 | for i, query in enumerate(Q):
24 | #get K nearest neighbors to query point
25 | idx = np.argsort([self.euclidean_distance(query, x) for x in X_train])[:self.K]
26 | #extract the labels of KNN training labels
27 | knn_labels = np.array([y_train[i] for i in idx])
28 | #label query sample as the average of knn_labels
29 | y_pred[i] = np.mean(knn_labels)
30 |
31 | return y_pred
32 |
33 |
34 | if __name__ == "__main__":
35 |
36 | plt.close('all')
37 |
38 | #iris dataset
39 | iris = datasets.load_iris()
40 | X = iris.data[:,:2]
41 | y = iris.target
42 |
43 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
44 |
45 | K = 4
46 | knn = KNN(K)
47 | y_pred = knn.knn_search(X_train, y_train, X_test)
48 |
49 | plt.figure(1)
50 | plt.scatter(X_train[:,0], X_train[:,1], s = 100, marker = 'x', color = 'r', label = 'data')
51 | plt.scatter(X_test[:,0], X_test[:,1], s = 100, marker = 'o', color = 'b', label = 'query')
52 | plt.title('K Nearest Neighbors (K=%d)'% K)
53 | plt.legend()
54 | plt.xlabel('X1')
55 | plt.ylabel('X2')
56 | plt.grid(True)
57 | plt.show()
58 |
59 |
60 |
61 |
--------------------------------------------------------------------------------
/chp06/ridge_reg.py:
--------------------------------------------------------------------------------
1 | import math
2 | import numpy as np
3 | import pandas as pd
4 |
5 | import matplotlib.pyplot as plt
6 | from sklearn.datasets import fetch_california_housing
7 |
8 | class ridge_reg():
9 |
10 | def __init__(self, n_iter=20, learning_rate=1e-3, lmbda=0.1):
11 | self.n_iter = n_iter
12 | self.learning_rate = learning_rate
13 | self.lmbda = lmbda
14 |
15 | def fit(self, X, y):
16 | #insert const 1 for bias term
17 | X = np.insert(X, 0, 1, axis=1)
18 |
19 | self.loss = []
20 | self.w = np.random.rand(X.shape[1])
21 |
22 | for i in range(self.n_iter):
23 | y_pred = X.dot(self.w)
24 | mse = np.mean(0.5*(y - y_pred)**2 + 0.5*self.lmbda*self.w.T.dot(self.w))
25 | self.loss.append(mse)
26 | print(" %d iter, mse: %.4f" %(i, mse))
27 | #compute gradient of NLL(w) wrt w
28 | grad_w = - (y - y_pred).dot(X) + self.lmbda*self.w
29 | #update the weights
30 | self.w -= self.learning_rate * grad_w
31 |
32 | def predict(self, X):
33 | #insert const 1 for bias term
34 | X = np.insert(X, 0, 1, axis=1)
35 | y_pred = X.dot(self.w)
36 | return y_pred
37 |
38 | if __name__ == "__main__":
39 |
40 | X, y = fetch_california_housing(return_X_y=True)
41 | X_reg = X[:,2].reshape(-1,1) #avg number of rooms
42 | X_std = (X_reg - X_reg.mean())/X.std() #standard scaling
43 | y_std = (y - y.mean())/y.std() #standard scaling
44 |
45 | X_std = X_std[:200,:]
46 | y_std = y_std[:200]
47 |
48 | rr = ridge_reg()
49 | rr.fit(X_std, y_std)
50 | y_pred = rr.predict(X_std)
51 |
52 | print(rr.w)
53 |
54 | plt.figure()
55 | plt.plot(rr.loss)
56 | plt.xlabel('Epoch')
57 | plt.ylabel('Loss')
58 | plt.tight_layout()
59 | plt.show()
60 |
61 | plt.figure()
62 | plt.scatter(X_std, y_std)
63 | plt.plot(np.linspace(-1,1), rr.w[1]*np.linspace(-1,1)+rr.w[0], c='red')
64 | plt.xlim([-0.01,0.01])
65 | plt.xlabel("scaled avg num of rooms")
66 | plt.ylabel("scaled house price")
67 | plt.show()
--------------------------------------------------------------------------------
/chp07/active_learning.py:
--------------------------------------------------------------------------------
1 | from __future__ import unicode_literals, division
2 | from scipy.sparse import csc_matrix, vstack
3 | from scipy.stats import entropy
4 | from collections import Counter
5 | import numpy as np
6 |
7 |
8 | class ActiveLearner(object):
9 |
10 | uncertainty_sampling_frameworks = [
11 | 'entropy',
12 | 'max_margin',
13 | 'least_confident',
14 | ]
15 |
16 | query_by_committee_frameworks = [
17 | 'vote_entropy',
18 | 'average_kl_divergence',
19 | ]
20 |
21 | def __init__(self, strategy='least_confident'):
22 | self.strategy = strategy
23 |
24 | def rank(self, clf, X_unlabeled, num_queries=None):
25 |
26 | if num_queries == None:
27 | num_queries = X_unlabeled.shape[0]
28 |
29 | elif type(num_queries) == float:
30 | num_queries = int(num_queries * X_unlabeled.shape[0])
31 |
32 | if self.strategy in self.uncertainty_sampling_frameworks:
33 | scores = self.uncertainty_sampling(clf, X_unlabeled)
34 |
35 | elif self.strategy in self.query_by_committee_frameworks:
36 | scores = self.query_by_committee(clf, X_unlabeled)
37 |
38 | else:
39 | raise ValueError("this strategy is not implemented.")
40 |
41 | rankings = np.argsort(-scores)[:num_queries]
42 | return rankings
43 |
44 | def uncertainty_sampling(self, clf, X_unlabeled):
45 | probs = clf.predict_proba(X_unlabeled)
46 |
47 | if self.strategy == 'least_confident':
48 | return 1 - np.amax(probs, axis=1)
49 |
50 | elif self.strategy == 'max_margin':
51 | margin = np.partition(-probs, 1, axis=1)
52 | return -np.abs(margin[:,0] - margin[:, 1])
53 |
54 | elif self.strategy == 'entropy':
55 | return np.apply_along_axis(entropy, 1, probs)
56 |
57 | def query_by_committee(self, clf, X_unlabeled):
58 | num_classes = len(clf[0].classes_)
59 | C = len(clf)
60 | preds = []
61 |
62 | if self.strategy == 'vote_entropy':
63 | for model in clf:
64 | y_out = map(int, model.predict(X_unlabeled))
65 | preds.append(np.eye(num_classes)[y_out])
66 |
67 | votes = np.apply_along_axis(np.sum, 0, np.stack(preds)) / C
68 | return np.apply_along_axis(entropy, 1, votes)
69 |
70 | elif self.strategy == 'average_kl_divergence':
71 | for model in clf:
72 | preds.append(model.predict_proba(X_unlabeled))
73 |
74 | consensus = np.mean(np.stack(preds), axis=0)
75 | divergence = []
76 | for y_out in preds:
77 | divergence.append(entropy(consensus.T, y_out.T))
78 |
79 | return np.apply_along_axis(np.mean, 0, np.stack(divergence))
80 |
--------------------------------------------------------------------------------
/chp07/adaboost_clf.py:
--------------------------------------------------------------------------------
1 | import itertools
2 | import numpy as np
3 |
4 | import seaborn as sns
5 | import matplotlib.pyplot as plt
6 | import matplotlib.gridspec as gridspec
7 |
8 | from sklearn import datasets
9 |
10 | from sklearn.tree import DecisionTreeClassifier
11 | from sklearn.neighbors import KNeighborsClassifier
12 | from sklearn.linear_model import LogisticRegression
13 |
14 | from sklearn.ensemble import AdaBoostClassifier
15 | from sklearn.model_selection import cross_val_score, train_test_split
16 |
17 | from mlxtend.plotting import plot_learning_curves
18 | from mlxtend.plotting import plot_decision_regions
19 |
20 | def main():
21 |
22 | iris = datasets.load_iris()
23 | X, y = iris.data[:, 0:2], iris.target
24 |
25 | #XOR dataset
26 | #X = np.random.randn(200, 2)
27 | #y = np.array(map(int,np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)))
28 |
29 | clf = DecisionTreeClassifier(criterion='entropy', max_depth=1)
30 |
31 | num_est = [1, 2, 3, 10]
32 | label = ['AdaBoost (n_est=1)', 'AdaBoost (n_est=2)', 'AdaBoost (n_est=3)', 'AdaBoost (n_est=10)']
33 |
34 | fig = plt.figure(figsize=(10, 8))
35 | gs = gridspec.GridSpec(2, 2)
36 | grid = itertools.product([0,1],repeat=2)
37 |
38 | for n_est, label, grd in zip(num_est, label, grid):
39 | boosting = AdaBoostClassifier(base_estimator=clf, n_estimators=n_est)
40 | boosting.fit(X, y)
41 | ax = plt.subplot(gs[grd[0], grd[1]])
42 | fig = plot_decision_regions(X=X, y=y, clf=boosting, legend=2)
43 | plt.title(label)
44 |
45 | plt.show()
46 | #plt.savefig('./figures/boosting_ensemble.png')
47 |
48 | #plot learning curves
49 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
50 |
51 | boosting = AdaBoostClassifier(base_estimator=clf, n_estimators=10)
52 |
53 | plt.figure()
54 | plot_learning_curves(X_train, y_train, X_test, y_test, boosting, print_model=False, style='ggplot')
55 | plt.show()
56 | #plt.savefig('./figures/boosting_ensemble_learning_curve.png')
57 |
58 | #Ensemble Size
59 | num_est = list(map(int, np.linspace(1,100,20)))
60 | bg_clf_cv_mean = []
61 | bg_clf_cv_std = []
62 | for n_est in num_est:
63 | print("num_est: ", n_est)
64 | ada_clf = AdaBoostClassifier(base_estimator=clf, n_estimators=n_est)
65 | scores = cross_val_score(ada_clf, X, y, cv=3, scoring='accuracy')
66 | bg_clf_cv_mean.append(scores.mean())
67 | bg_clf_cv_std.append(scores.std())
68 |
69 | plt.figure()
70 | (_, caps, _) = plt.errorbar(num_est, bg_clf_cv_mean, yerr=bg_clf_cv_std, c='blue', fmt='-o', capsize=5)
71 | for cap in caps:
72 | cap.set_markeredgewidth(1)
73 | plt.ylabel('Accuracy'); plt.xlabel('Ensemble Size'); plt.title('AdaBoost Ensemble');
74 | plt.show()
75 | #plt.savefig('./figures/boosting_ensemble_size.png')
76 |
77 | if __name__ == "__main__":
78 | main()
--------------------------------------------------------------------------------
/chp07/bagging_clf.py:
--------------------------------------------------------------------------------
1 | import itertools
2 | import numpy as np
3 |
4 | import seaborn as sns
5 | import matplotlib.pyplot as plt
6 | import matplotlib.gridspec as gridspec
7 |
8 | from sklearn import datasets
9 |
10 | from sklearn.tree import DecisionTreeClassifier
11 | from sklearn.neighbors import KNeighborsClassifier
12 | from sklearn.linear_model import LogisticRegression
13 | from sklearn.ensemble import RandomForestClassifier
14 |
15 | from sklearn.ensemble import BaggingClassifier
16 | from sklearn.model_selection import cross_val_score, train_test_split
17 |
18 | from mlxtend.plotting import plot_learning_curves
19 | from mlxtend.plotting import plot_decision_regions
20 |
21 | def main():
22 |
23 | iris = datasets.load_iris()
24 | X, y = iris.data[:, 0:2], iris.target
25 |
26 | clf1 = DecisionTreeClassifier(criterion='entropy', max_depth=None)
27 | clf2 = KNeighborsClassifier(n_neighbors=1)
28 |
29 | bagging1 = BaggingClassifier(base_estimator=clf1, n_estimators=10, max_samples=0.8, max_features=0.8)
30 | bagging2 = BaggingClassifier(base_estimator=clf2, n_estimators=10, max_samples=0.8, max_features=0.8)
31 |
32 | label = ['Decision Tree', 'K-NN', 'Bagging Tree', 'Bagging K-NN']
33 | clf_list = [clf1, clf2, bagging1, bagging2]
34 |
35 | fig = plt.figure(figsize=(10, 8))
36 | gs = gridspec.GridSpec(2, 2)
37 | grid = itertools.product([0,1],repeat=2)
38 |
39 | for clf, label, grd in zip(clf_list, label, grid):
40 | scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
41 | print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
42 |
43 | clf.fit(X, y)
44 | ax = plt.subplot(gs[grd[0], grd[1]])
45 | fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
46 | plt.title(label)
47 |
48 | plt.show()
49 | #plt.savefig('./figures/bagging_ensemble.png')
50 |
51 | #plot learning curves
52 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
53 |
54 | plt.figure()
55 | plot_learning_curves(X_train, y_train, X_test, y_test, bagging1, print_model=False, style='ggplot')
56 | plt.show()
57 | #plt.savefig('./figures/bagging_ensemble_learning_curve.png')
58 |
59 | #Ensemble Size
60 | num_est = list(map(int, np.linspace(1,100,20)))
61 | bg_clf_cv_mean = []
62 | bg_clf_cv_std = []
63 | for n_est in num_est:
64 | print("num_est: ", n_est)
65 | bg_clf = BaggingClassifier(base_estimator=clf1, n_estimators=n_est, max_samples=0.8, max_features=0.8)
66 | scores = cross_val_score(bg_clf, X, y, cv=3, scoring='accuracy')
67 | bg_clf_cv_mean.append(scores.mean())
68 | bg_clf_cv_std.append(scores.std())
69 |
70 | plt.figure()
71 | (_, caps, _) = plt.errorbar(num_est, bg_clf_cv_mean, yerr=bg_clf_cv_std, c='blue', fmt='-o', capsize=5)
72 | for cap in caps:
73 | cap.set_markeredgewidth(1)
74 | plt.ylabel('Accuracy'); plt.xlabel('Ensemble Size'); plt.title('Bagging Tree Ensemble');
75 | plt.show()
76 | #plt.savefig('./figures/bagging_ensemble_size.png')
77 |
78 | if __name__ == "__main__":
79 | main()
--------------------------------------------------------------------------------
/chp07/bayes_opt_sklearn.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 |
4 | import seaborn as sns
5 | import matplotlib.pyplot as plt
6 |
7 | from sklearn.datasets import make_classification
8 | from sklearn.model_selection import cross_val_score
9 | from sklearn.ensemble import RandomForestClassifier as RFC
10 | from sklearn.svm import SVC
11 |
12 | from bayes_opt import BayesianOptimization
13 |
14 | np.random.seed(42)
15 |
16 | # Load data set and target values
17 | data, target = make_classification(
18 | n_samples=1000,
19 | n_features=45,
20 | n_informative=12,
21 | n_redundant=7
22 | )
23 | target = target.ravel()
24 |
25 | def svccv(gamma):
26 | val = cross_val_score(
27 | SVC(gamma=gamma, random_state=0),
28 | data, target, scoring='f1', cv=2
29 | ).mean()
30 |
31 | return val
32 |
33 | def rfccv(n_estimators, max_depth):
34 | val = cross_val_score(
35 | RFC(n_estimators=int(n_estimators),
36 | max_depth=int(max_depth),
37 | random_state=0
38 | ),
39 | data, target, scoring='f1', cv=2
40 | ).mean()
41 | return val
42 |
43 | if __name__ == "__main__":
44 |
45 | gp_params = {"alpha": 1e-5}
46 |
47 | #SVM
48 | svcBO = BayesianOptimization(svccv,
49 | {'gamma': (0.00001, 0.1)})
50 |
51 | svcBO.maximize(init_points=3, n_iter=4, **gp_params)
52 |
53 | #Random Forest
54 | rfcBO = BayesianOptimization(
55 | rfccv,
56 | {'n_estimators': (10, 300),
57 | 'max_depth': (2, 10)
58 | }
59 | )
60 | rfcBO.maximize(init_points=4, n_iter=4, **gp_params)
61 |
62 | print('Final Results')
63 | print(svcBO.max)
64 | print(rfcBO.max)
--------------------------------------------------------------------------------
/chp07/demo_logreg.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import seaborn as sns
3 | import matplotlib.pyplot as plt
4 |
5 | from active_learning import ActiveLearner
6 | from sklearn.metrics import accuracy_score
7 | from sklearn.datasets import make_classification
8 | from sklearn.linear_model import LogisticRegression
9 | from sklearn.model_selection import train_test_split
10 |
11 | np.random.seed(42)
12 |
13 | def main():
14 |
15 | #number of labeled points
16 | num_queries = 30
17 |
18 | #generate data
19 | data, target = make_classification(n_samples=200, n_features=2, n_informative=2,\
20 | n_redundant=0, n_classes=2, weights = [0.5, 0.5], random_state=0)
21 |
22 | #split into labeled and unlabeled pools
23 | X_train, X_unlabeled, y_train, y_oracle = train_test_split(data, target, test_size=0.2, random_state=0)
24 |
25 | #random sub-sampling
26 | rnd_idx = np.random.randint(0, X_train.shape[0], num_queries)
27 | X1 = X_train[rnd_idx,:]
28 | y1 = y_train[rnd_idx]
29 |
30 | clf1 = LogisticRegression()
31 | clf1.fit(X1, y1)
32 |
33 | y1_preds = clf1.predict(X_unlabeled)
34 | score1 = accuracy_score(y_oracle, y1_preds)
35 | print("random subsampling accuracy: ", score1)
36 |
37 | #plot 2D decision boundary: w2x2 + w1x1 + w0 = 0
38 | w0 = clf1.intercept_
39 | w1, w2 = clf1.coef_[0]
40 | xx = np.linspace(-1, 1, 100)
41 | decision_boundary = -w0/float(w2) - (w1/float(w2))*xx
42 |
43 | plt.figure()
44 | plt.scatter(data[rnd_idx,0], data[rnd_idx,1], c='black', marker='s', s=64, label='labeled')
45 | plt.scatter(data[target==0,0], data[target==0,1], c='blue', marker='o', alpha=0.5, label='class 0')
46 | plt.scatter(data[target==1,0], data[target==1,1], c='red', marker='o', alpha=0.5, label='class 1')
47 | plt.plot(xx, decision_boundary, linewidth = 2.0, c='black', linestyle = '--', label='log reg boundary')
48 | plt.title("Random Subsampling")
49 | plt.legend()
50 | plt.show()
51 |
52 | #active learning
53 | AL = ActiveLearner(strategy='entropy')
54 | al_idx = AL.rank(clf1, X_unlabeled, num_queries=num_queries)
55 |
56 | X2 = X_train[al_idx,:]
57 | y2 = y_train[al_idx]
58 |
59 | clf2 = LogisticRegression()
60 | clf2.fit(X2, y2)
61 |
62 | y2_preds = clf2.predict(X_unlabeled)
63 | score2 = accuracy_score(y_oracle, y2_preds)
64 | print("active learning accuracy: ", score2)
65 |
66 | #plot 2D decision boundary: w2x2 + w1x1 + w0 = 0
67 | w0 = clf2.intercept_
68 | w1, w2 = clf2.coef_[0]
69 | xx = np.linspace(-1, 1, 100)
70 | decision_boundary = -w0/float(w2) - (w1/float(w2))*xx
71 |
72 | plt.figure()
73 | plt.scatter(data[al_idx,0], data[al_idx,1], c='black', marker='s', s=64, label='labeled')
74 | plt.scatter(data[target==0,0], data[target==0,1], c='blue', marker='o', alpha=0.5, label='class 0')
75 | plt.scatter(data[target==1,0], data[target==1,1], c='red', marker='o', alpha=0.5, label='class 1')
76 | plt.plot(xx, decision_boundary, linewidth = 2.0, c='black', linestyle = '--', label='log reg boundary')
77 | plt.title("Uncertainty Sampling")
78 | plt.legend()
79 | plt.show()
80 |
81 | if __name__ == "__main__":
82 |
83 | main()
--------------------------------------------------------------------------------
/chp07/hmm.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from scipy.sparse import coo_matrix
3 | import matplotlib.pyplot as plt
4 |
5 | np.random.seed(42)
6 |
7 | class HMM():
8 | def __init__(self, d=3, k=2, n=10000):
9 | self.d = d #dimension of data
10 | self.k = k #dimension of latent state
11 | self.n = n #number of data points
12 |
13 | self.A = np.zeros((k,k)) #transition matrix
14 | self.E = np.zeros((k,d)) #emission matrix
15 | self.s = np.zeros(k) #initial state vector
16 |
17 | self.x = np.zeros(self.n) #emitted observations
18 |
19 | def normalize_mat(self, X, dim=1):
20 | z = np.sum(X, axis=dim)
21 | Xnorm = X/z.reshape(-1,1)
22 | return Xnorm
23 |
24 | def normalize_vec(self, v):
25 | z = sum(v)
26 | u = v / z
27 | return u, z
28 |
29 | def init_hmm(self):
30 |
31 | #initialize matrices at random
32 | self.A = self.normalize_mat(np.random.rand(self.k,self.k))
33 | self.E = self.normalize_mat(np.random.rand(self.k,self.d))
34 | self.s, _ = self.normalize_vec(np.random.rand(self.k))
35 |
36 | #generate markov observations
37 | z = np.random.choice(self.k, size=1, p=self.s)
38 | self.x[0] = np.random.choice(self.d, size=1, p=self.E[z,:].ravel())
39 | for i in range(1, self.n):
40 | z = np.random.choice(self.k, size=1, p=self.A[z,:].ravel())
41 | self.x[i] = np.random.choice(self.d, size=1, p=self.E[z,:].ravel())
42 | #end for
43 |
44 | def forward_backward(self):
45 |
46 | #construct sparse matrix X of emission indicators
47 | data = np.ones(self.n)
48 | row = self.x
49 | col = np.arange(self.n)
50 | X = coo_matrix((data, (row, col)), shape=(self.d, self.n))
51 |
52 | M = self.E * X
53 | At = np.transpose(self.A)
54 | c = np.zeros(self.n) #normalization constants
55 | alpha = np.zeros((self.k, self.n)) #alpha = p(z_t = j | x_{1:T})
56 | alpha[:,0], c[0] = self.normalize_vec(self.s * M[:,0])
57 | for t in range(1, self.n):
58 | alpha[:,t], c[t] = self.normalize_vec(np.dot(At, alpha[:,t-1]) * M[:,t])
59 | #end for
60 |
61 | beta = np.ones((self.k, self.n))
62 | for t in range(self.n-2, 0, -1):
63 | beta[:,t] = np.dot(self.A, beta[:,t+1] * M[:,t+1])/c[t+1]
64 | #end for
65 | gamma = alpha * beta
66 |
67 | return gamma, alpha, beta, c
68 |
69 | def viterbi(self):
70 |
71 | #construct sparse matrix X of emission indicators
72 | data = np.ones(self.n)
73 | row = self.x
74 | col = np.arange(self.n)
75 | X = coo_matrix((data, (row, col)), shape=(self.d, self.n))
76 |
77 | #log scale for numerical stability
78 | s = np.log(self.s)
79 | A = np.log(self.A)
80 | M = np.log(self.E * X)
81 |
82 | Z = np.zeros((self.k, self.n))
83 | Z[:,0] = np.arange(self.k)
84 | v = s + M[:,0]
85 | for t in range(1, self.n):
86 | Av = A + v.reshape(-1,1)
87 | v = np.max(Av, axis=0)
88 | idx = np.argmax(Av, axis=0)
89 | v = v.reshape(-1,1) + M[:,t].reshape(-1,1)
90 | Z = Z[idx,:]
91 | Z[:,t] = np.arange(self.k)
92 | #end for
93 | llh = np.max(v)
94 | idx = np.argmax(v)
95 | z = Z[idx,:]
96 |
97 | return z, llh
98 |
99 |
100 | if __name__ == "__main__":
101 |
102 | hmm = HMM()
103 | hmm.init_hmm()
104 |
105 | gamma, alpha, beta, c = hmm.forward_backward()
106 | z, llh = hmm.viterbi()
107 | import pdb; pdb.set_trace()
--------------------------------------------------------------------------------
/chp07/page_rank.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from numpy.linalg import norm
3 |
4 | np.random.seed(42)
5 |
6 | class page_rank():
7 |
8 | def __init__(self):
9 | self.max_iter = 100
10 | self.tolerance = 1e-5
11 |
12 | def power_iteration(self, A):
13 | n = np.shape(A)[0]
14 | v = np.random.rand(n)
15 | converged = False
16 | iter = 0
17 |
18 | while (not converged) and (iter < self.max_iter):
19 | old_v = v
20 | v = np.dot(A, v)
21 | v = v / norm(v)
22 | lambd = np.dot(v, np.dot(A, v))
23 | converged = norm(v - old_v) < self.tolerance
24 | iter += 1
25 | #end while
26 |
27 | return lambd, v
28 |
29 | if __name__ == "__main__":
30 |
31 | #construct a symmetric real matrix
32 | X = np.random.rand(10,5)
33 | A = np.dot(X.T, X)
34 |
35 | pr = page_rank()
36 | lambd, v = pr.power_iteration(A)
37 |
38 | print(lambd)
39 | print(v)
40 |
41 | #compare against np.linalg implementation
42 | eigval, eigvec = np.linalg.eig(A)
43 | idx = np.argsort(np.abs(eigval))[::-1]
44 | top_lambd = eigval[idx][0]
45 | top_v = eigvec[:,idx][0]
46 |
47 | assert np.allclose(lambd, top_lambd, 1e-3)
48 | assert np.allclose(v, top_v, 1e-3)
49 |
50 |
51 |
52 |
53 |
--------------------------------------------------------------------------------
/chp07/plot_smote_regular.py:
--------------------------------------------------------------------------------
1 | import seaborn as sns
2 | import matplotlib.pyplot as plt
3 |
4 | from sklearn.datasets import make_classification
5 | from sklearn.decomposition import PCA
6 |
7 | from imblearn.over_sampling import SMOTE
8 |
9 | def plot_resampling(ax, X, y, title):
10 | c0 = ax.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0", marker="o", alpha=0.5)
11 | c1 = ax.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1", marker="s", alpha=0.5)
12 | ax.set_title(title)
13 | ax.spines['top'].set_visible(False)
14 | ax.spines['right'].set_visible(False)
15 | ax.get_xaxis().tick_bottom()
16 | ax.get_yaxis().tick_left()
17 | ax.spines['left'].set_position(('outward', 10))
18 | ax.spines['bottom'].set_position(('outward', 10))
19 | ax.set_xlim([-6, 8])
20 | ax.set_ylim([-6, 6])
21 |
22 | return c0, c1
23 |
24 | def main():
25 | # generate the dataset
26 | X, y = make_classification(n_classes=2, class_sep=2, weights=[0.3, 0.7],
27 | n_informative=3, n_redundant=1, flip_y=0,
28 | n_features=20, n_clusters_per_class=1,
29 | n_samples=80, random_state=10)
30 |
31 | # fit PCA for visualization
32 | pca = PCA(n_components=2)
33 | X_vis = pca.fit_transform(X)
34 |
35 | # apply regular SMOTE
36 | method = SMOTE()
37 | X_res, y_res = method.fit_resample(X, y)
38 | X_res_vis = pca.transform(X_res)
39 |
40 | # generate plots
41 | f, (ax1, ax2) = plt.subplots(1, 2)
42 | c0, c1 = plot_resampling(ax1, X_vis, y, 'Original')
43 | plot_resampling(ax2, X_res_vis, y_res, 'SMOTE')
44 | ax1.legend((c0, c1), ('Class #0', 'Class #1'))
45 | plt.tight_layout()
46 | plt.show()
47 |
48 | if __name__ == "__main__":
49 | main()
50 |
--------------------------------------------------------------------------------
/chp07/plot_tomek_links.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import seaborn as sns
3 | import matplotlib.pyplot as plt
4 |
5 | from sklearn.model_selection import train_test_split
6 | from sklearn.utils import shuffle
7 | from imblearn.under_sampling import TomekLinks
8 |
9 | rng = np.random.RandomState(42)
10 |
11 | def main():
12 |
13 | #generate data
14 | n_samples_1 = 500
15 | n_samples_2 = 50
16 | X_syn = np.r_[1.5 * rng.randn(n_samples_1, 2), 0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
17 | y_syn = np.array([0] * (n_samples_1) + [1] * (n_samples_2))
18 | X_syn, y_syn = shuffle(X_syn, y_syn)
19 | X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(X_syn, y_syn)
20 |
21 | # remove Tomek links
22 | tl = TomekLinks(sampling_strategy='auto')
23 | X_resampled, y_resampled = tl.fit_resample(X_syn, y_syn)
24 | idx_resampled = tl.sample_indices_
25 | idx_samples_removed = np.setdiff1d(np.arange(X_syn.shape[0]),idx_resampled)
26 |
27 | #generate plots
28 | fig = plt.figure()
29 | ax = fig.add_subplot(1, 1, 1)
30 |
31 | idx_class_0 = y_resampled == 0
32 | plt.scatter(X_resampled[idx_class_0, 0], X_resampled[idx_class_0, 1], alpha=.8, marker = "o", label='Class #0')
33 | plt.scatter(X_resampled[~idx_class_0, 0], X_resampled[~idx_class_0, 1], alpha=.8, marker = "s", label='Class #1')
34 | plt.scatter(X_syn[idx_samples_removed, 0], X_syn[idx_samples_removed, 1], alpha=.8, marker = "v", label='Removed samples')
35 | plt.title('Undersampling: Tomek links')
36 | plt.legend()
37 | plt.show()
38 |
39 | if __name__ == "__main__":
40 | main()
--------------------------------------------------------------------------------
/chp07/stacked_clf.py:
--------------------------------------------------------------------------------
1 | import itertools
2 | import numpy as np
3 | import seaborn as sns
4 | import matplotlib.pyplot as plt
5 | import matplotlib.gridspec as gridspec
6 |
7 | from sklearn import datasets
8 |
9 | from sklearn.linear_model import LogisticRegression
10 | from sklearn.neighbors import KNeighborsClassifier
11 | from sklearn.naive_bayes import GaussianNB
12 | from sklearn.ensemble import RandomForestClassifier
13 | from mlxtend.classifier import StackingClassifier
14 |
15 | from sklearn.model_selection import cross_val_score, train_test_split
16 |
17 | from mlxtend.plotting import plot_learning_curves
18 | from mlxtend.plotting import plot_decision_regions
19 |
20 | def main():
21 |
22 | iris = datasets.load_iris()
23 | X, y = iris.data[:, 1:3], iris.target
24 |
25 | clf1 = KNeighborsClassifier(n_neighbors=1)
26 | clf2 = RandomForestClassifier(random_state=1)
27 | clf3 = GaussianNB()
28 | lr = LogisticRegression()
29 | sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
30 | meta_classifier=lr)
31 |
32 | label = ['KNN', 'Random Forest', 'Naive Bayes', 'Stacking Classifier']
33 | clf_list = [clf1, clf2, clf3, sclf]
34 |
35 | fig = plt.figure(figsize=(10,8))
36 | gs = gridspec.GridSpec(2, 2)
37 | grid = itertools.product([0,1],repeat=2)
38 |
39 | clf_cv_mean = []
40 | clf_cv_std = []
41 | for clf, label, grd in zip(clf_list, label, grid):
42 |
43 | scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
44 | print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
45 | clf_cv_mean.append(scores.mean())
46 | clf_cv_std.append(scores.std())
47 |
48 | clf.fit(X, y)
49 | ax = plt.subplot(gs[grd[0], grd[1]])
50 | fig = plot_decision_regions(X=X, y=y, clf=clf)
51 | plt.title(label)
52 |
53 | plt.show()
54 | #plt.savefig("./figures/ensemble_stacking.png")
55 |
56 | #plot classifier accuracy
57 | plt.figure()
58 | (_, caps, _) = plt.errorbar(range(4), clf_cv_mean, yerr=clf_cv_std, c='blue', fmt='-o', capsize=5)
59 | for cap in caps:
60 | cap.set_markeredgewidth(1)
61 | plt.xticks(range(4), ['KNN', 'RF', 'NB', 'Stacking'], rotation='vertical')
62 | plt.ylabel('Accuracy'); plt.xlabel('Classifier'); plt.title('Stacking Ensemble');
63 | plt.show()
64 | #plt.savefig('./figures/stacking_ensemble_size.png')
65 |
66 | #plot learning curves
67 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
68 |
69 | plt.figure()
70 | plot_learning_curves(X_train, y_train, X_test, y_test, sclf, print_model=False, style='ggplot')
71 | plt.show()
72 | #plt.savefig('./figures/stacking_ensemble_learning_curve.png')
73 |
74 |
75 | if __name__ == "__main__":
76 | main()
77 |
--------------------------------------------------------------------------------
/chp08/dpmeans.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 | import time
5 | from sklearn import metrics
6 | from sklearn.datasets import load_iris
7 |
8 | np.random.seed(42)
9 |
10 | class dpmeans:
11 |
12 | def __init__(self,X):
13 | # Initialize parameters for DP means
14 | self.K = 1
15 | self.K_init = 4
16 | self.d = X.shape[1]
17 | self.z = np.mod(np.random.permutation(X.shape[0]),self.K)+1
18 | self.mu = np.random.standard_normal((self.K, self.d))
19 | self.sigma = 1
20 | self.nk = np.zeros(self.K)
21 | self.pik = np.ones(self.K)/self.K
22 |
23 | #init mu
24 | self.mu = np.array([np.mean(X,0)])
25 |
26 | #init lambda
27 | self.Lambda = self.kpp_init(X,self.K_init)
28 |
29 | self.max_iter = 100
30 | self.obj = np.zeros(self.max_iter)
31 | self.em_time = np.zeros(self.max_iter)
32 |
33 | def kpp_init(self,X,k):
34 | #k++ init
35 | #lambda is max distance to k++ means
36 |
37 | [n,d] = np.shape(X)
38 | mu = np.zeros((k,d))
39 | dist = np.inf*np.ones(n)
40 |
41 | mu[0,:] = X[int(np.random.rand()*n-1),:]
42 | for i in range(1,k):
43 | D = X-np.tile(mu[i-1,:],(n,1))
44 | dist = np.minimum(dist, np.sum(D*D,1))
45 | idx = np.where(np.random.rand() < np.cumsum(dist/float(sum(dist))))
46 | mu[i,:] = X[idx[0][0],:]
47 | Lambda = np.max(dist)
48 |
49 | print("Lambda: ", Lambda)
50 |
51 | return Lambda
52 |
53 | def fit(self,X):
54 |
55 | obj_tol = 1e-3
56 | max_iter = self.max_iter
57 | [n,d] = np.shape(X)
58 |
59 | obj = np.zeros(max_iter)
60 | em_time = np.zeros(max_iter)
61 | print('running dpmeans...')
62 |
63 | for iter in range(max_iter):
64 | tic = time.time()
65 | dist = np.zeros((n,self.K))
66 |
67 | #assignment step
68 | for kk in range(self.K):
69 | Xm = X - np.tile(self.mu[kk,:],(n,1))
70 | dist[:,kk] = np.sum(Xm*Xm,1)
71 |
72 | #update labels
73 | dmin = np.min(dist,1)
74 | self.z = np.argmin(dist,1)
75 | idx = np.where(dmin > self.Lambda)
76 |
77 | if (np.size(idx) > 0):
78 | self.K = self.K + 1
79 | self.z[idx[0]] = self.K-1 #cluster labels in [0,...,K-1]
80 | self.mu = np.vstack([self.mu,np.mean(X[idx[0],:],0)])
81 | Xm = X - np.tile(self.mu[self.K-1,:],(n,1))
82 | dist = np.hstack([dist, np.array([np.sum(Xm*Xm,1)]).T])
83 |
84 | #update step
85 | self.nk = np.zeros(self.K)
86 | for kk in range(self.K):
87 | self.nk[kk] = self.z.tolist().count(kk)
88 | idx = np.where(self.z == kk)
89 | self.mu[kk,:] = np.mean(X[idx[0],:],0)
90 |
91 | self.pik = self.nk/float(np.sum(self.nk))
92 |
93 | #compute objective
94 | for kk in range(self.K):
95 | idx = np.where(self.z == kk)
96 | obj[iter] = obj[iter] + np.sum(dist[idx[0],kk],0)
97 | obj[iter] = obj[iter] + self.Lambda * self.K
98 |
99 | #check convergence
100 | if (iter > 0 and np.abs(obj[iter]-obj[iter-1]) < obj_tol*obj[iter]):
101 | print('converged in %d iterations\n'% iter)
102 | break
103 | em_time[iter] = time.time()-tic
104 | #end for
105 | self.obj = obj
106 | self.em_time = em_time
107 | return self.z, obj, em_time
108 |
109 | def compute_nmi(self, z1, z2):
110 | # compute normalized mutual information
111 |
112 | n = np.size(z1)
113 | k1 = np.size(np.unique(z1))
114 | k2 = np.size(np.unique(z2))
115 |
116 | nk1 = np.zeros((k1,1))
117 | nk2 = np.zeros((k2,1))
118 |
119 | for kk in range(k1):
120 | nk1[kk] = np.sum(z1==kk)
121 | for kk in range(k2):
122 | nk2[kk] = np.sum(z2==kk)
123 |
124 | pk1 = nk1/float(np.sum(nk1))
125 | pk2 = nk2/float(np.sum(nk2))
126 |
127 | nk12 = np.zeros((k1,k2))
128 | for ii in range(k1):
129 | for jj in range(k2):
130 | nk12[ii,jj] = np.sum((z1==ii)*(z2==jj))
131 | pk12 = nk12/float(n)
132 |
133 | Hx = -np.sum(pk1 * np.log(pk1 + np.finfo(float).eps))
134 | Hy = -np.sum(pk2 * np.log(pk2 + np.finfo(float).eps))
135 |
136 | Hxy = -np.sum(pk12 * np.log(pk12 + np.finfo(float).eps))
137 |
138 | MI = Hx + Hy - Hxy;
139 | nmi = MI/float(0.5*(Hx+Hy))
140 |
141 | return nmi
142 |
143 | def generate_plots(self,X):
144 |
145 | plt.close('all')
146 | plt.figure(0)
147 | for kk in range(self.K):
148 | #idx = np.where(self.z == kk)
149 | plt.scatter(X[self.z == kk,0], X[self.z == kk,1], \
150 | s = 100, marker = 'o', c = np.random.rand(3,), label = str(kk))
151 | #end for
152 | plt.xlabel('X1')
153 | plt.ylabel('X2')
154 | plt.legend()
155 | plt.title('DP-means clusters')
156 | plt.grid(True)
157 | plt.show()
158 |
159 | plt.figure(1)
160 | plt.plot(self.obj)
161 | plt.title('DP-means objective function')
162 | plt.xlabel('iterations')
163 | plt.ylabel('penalized l2 squared distance')
164 | plt.grid(True)
165 | plt.show()
166 |
167 | if __name__ == "__main__":
168 |
169 | iris = load_iris()
170 | X = iris.data
171 | y = iris.target
172 |
173 | dp = dpmeans(X)
174 | labels, obj, em_time = dp.fit(X)
175 | dp.generate_plots(X)
176 |
177 | nmi = dp.compute_nmi(y,labels)
178 | ari = metrics.adjusted_rand_score(y,labels)
179 |
180 | print("NMI: %.4f" % nmi)
181 | print("ARI: %.4f" % ari)
182 |
--------------------------------------------------------------------------------
/chp08/gmm.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 | import matplotlib as mpl
4 | from sklearn.cluster import KMeans
5 | from scipy.stats import multivariate_normal
6 | from scipy.special import logsumexp
7 | from scipy import linalg
8 |
9 | np.random.seed(3)
10 |
11 | class GMM:
12 |
13 | def __init__(self, n=1e3, d=2, K=4):
14 | self.n = int(n) #number of data points
15 | self.d = d #data dimension
16 | self.K = K #number of clusters
17 |
18 | self.X = np.zeros((self.n, self.d))
19 |
20 | self.mu = np.zeros((self.d, self.K))
21 | self.sigma = np.zeros((self.d, self.d, self.K))
22 | self.pik = np.ones(self.K)/K
23 |
24 | def generate_data(self):
25 | #GMM generative model
26 | alpha0 = np.ones(self.K)
27 | pi = np.random.dirichlet(alpha0)
28 |
29 | #ground truth mu and sigma
30 | mu0 = np.random.randint(0, 10, size=(self.d, self.K)) - 5*np.ones((self.d, self.K))
31 | V0 = np.zeros((self.d, self.d, self.K))
32 | for k in range(self.K):
33 | eigen_mean = 0
34 | Q = np.random.normal(loc=0, scale=1, size=(self.d, self.d))
35 | D = np.diag(abs(eigen_mean + np.random.normal(loc=0, scale=1, size=self.d)))
36 | V0[:,:,k] = abs(np.transpose(Q)*D*Q)
37 |
38 | #sample data
39 | for i in range(self.n):
40 | z = np.random.multinomial(1,pi)
41 | k = np.nonzero(z)[0][0]
42 | self.X[i,:] = np.random.multivariate_normal(mean=mu0[:,k], cov=V0[:,:,k], size=1)
43 |
44 | plt.figure()
45 | plt.scatter(self.X[:,0], self.X[:,1], color='b', alpha=0.5)
46 | plt.title("Ground Truth Data"); plt.xlabel("X1"); plt.ylabel("X2")
47 | plt.show()
48 |
49 | return mu0, V0
50 |
51 | def gmm_em(self):
52 |
53 | #init mu with k-means
54 | kmeans = KMeans(n_clusters=self.K, random_state=42).fit(self.X)
55 | self.mu = np.transpose(kmeans.cluster_centers_)
56 |
57 | #init sigma
58 | for k in range(self.K):
59 | self.sigma[:,:,k] = np.eye(self.d)
60 |
61 | #EM algorithm
62 | max_iter = 10
63 | tol = 1e-5
64 | obj = np.zeros(max_iter)
65 | for iter in range(max_iter):
66 | print("EM iter ", iter)
67 | #E-step
68 | resp, llh = self.estep()
69 | #M-step
70 | self.mstep(resp)
71 | #check convergence
72 | obj[iter] = llh
73 | if (iter > 1 and obj[iter] - obj[iter-1] < tol*abs(obj[iter])):
74 | break
75 | #end if
76 | #end for
77 | plt.figure()
78 | plt.plot(obj)
79 | plt.title('EM-GMM objective'); plt.xlabel("iter"); plt.ylabel("log-likelihood")
80 | plt.show()
81 |
82 | def estep(self):
83 |
84 | log_r = np.zeros((self.n, self.K))
85 | for k in range(self.K):
86 | log_r[:,k] = multivariate_normal.logpdf(self.X, mean=self.mu[:,k], cov=self.sigma[:,:,k])
87 | #end for
88 | log_r = log_r + np.log(self.pik)
89 | L = logsumexp(log_r, axis=1)
90 | llh = np.sum(L)/self.n #log likelihood
91 | log_r = log_r - L.reshape(-1,1) #normalize
92 | resp = np.exp(log_r)
93 | return resp, llh
94 |
95 | def mstep(self, resp):
96 |
97 | nk = np.sum(resp, axis=0)
98 | self.pik = nk/self.n
99 | sqrt_resp = np.sqrt(resp)
100 | for k in range(self.K):
101 | #update mu
102 | rx = np.multiply(resp[:,k].reshape(-1,1), self.X)
103 | self.mu[:,k] = np.sum(rx, axis=0) / nk[k]
104 |
105 | #update sigma
106 | Xm = self.X - self.mu[:,k]
107 | Xm = np.multiply(sqrt_resp[:,k].reshape(-1,1), Xm)
108 | self.sigma[:,:,k] = np.maximum(0, np.dot(np.transpose(Xm), Xm) / nk[k] + 1e-5 * np.eye(self.d))
109 | #end for
110 |
111 | if __name__ == '__main__':
112 |
113 | gmm = GMM()
114 | mu0, V0 = gmm.generate_data()
115 | gmm.gmm_em()
116 |
117 | for k in range(mu0.shape[1]):
118 | print("cluster ", k)
119 | print("-----------")
120 | print("ground truth means:")
121 | print(mu0[:,k])
122 | print("ground truth covariance:")
123 | print(V0[:,:,k])
124 | #end for
125 |
126 | for k in range(mu0.shape[1]):
127 | print("cluster ", k)
128 | print("-----------")
129 | print("GMM-EM means:")
130 | print(gmm.mu[:,k])
131 | print("GMM-EM covariance:")
132 | print(gmm.sigma[:,:,k])
133 |
134 | plt.figure()
135 | ax = plt.axes()
136 | plt.scatter(gmm.X[:,0], gmm.X[:,1], color='b', alpha=0.5)
137 |
138 | for k in range(mu0.shape[1]):
139 |
140 | v, w = linalg.eigh(gmm.sigma[:,:,k])
141 | v = 2.0 * np.sqrt(2.0) * np.sqrt(v)
142 | u = w[0] / linalg.norm(w[0])
143 |
144 | # plot an ellipse to show the Gaussian component
145 | angle = np.arctan(u[1] / u[0])
146 | angle = 180.0 * angle / np.pi # convert to degrees
147 | ell = mpl.patches.Ellipse(gmm.mu[:,k], v[0], v[1], 180.0 + angle, color='r', alpha=0.5)
148 | ax.add_patch(ell)
149 |
150 | # plot cluster centroids
151 | plt.scatter(gmm.mu[0,k], gmm.mu[1,k], s=80, marker='x', color='k', alpha=1)
152 | plt.title("Gaussian Mixture Model"); plt.xlabel("X1"); plt.ylabel("X2")
153 | plt.show()
154 |
--------------------------------------------------------------------------------
/chp08/manifold_learning.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 | from time import time
5 | from sklearn import manifold
6 |
7 | from sklearn.datasets import load_digits
8 | from sklearn.neighbors import KDTree
9 |
10 | def plot_digits(X):
11 |
12 | n_img_per_row = np.amin((20, np.int(np.sqrt(X.shape[0]))))
13 | img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
14 | for i in range(n_img_per_row):
15 | ix = 10 * i + 1
16 | for j in range(n_img_per_row):
17 | iy = 10 * j + 1
18 | img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))
19 |
20 | plt.figure()
21 | plt.imshow(img, cmap=plt.cm.binary)
22 | plt.xticks([])
23 | plt.yticks([])
24 | plt.title('A selection from the 64-dimensional digits dataset')
25 |
26 | def mnist_manifold():
27 |
28 | digits = load_digits()
29 |
30 | X = digits.data
31 | y = digits.target
32 |
33 | num_classes = np.unique(y).shape[0]
34 |
35 | plot_digits(X)
36 |
37 | #TSNE
38 | #Barnes-Hut: O(d NlogN) where d is dim and N is the number of samples
39 | #Exact: O(d N^2)
40 | t0 = time()
41 | tsne = manifold.TSNE(n_components = 2, init = 'pca', method = 'barnes_hut', verbose = 1)
42 | X_tsne = tsne.fit_transform(X)
43 | t1 = time()
44 | print('t-SNE: %.2f sec' %(t1-t0))
45 | tsne.get_params()
46 |
47 | plt.figure()
48 | for k in range(num_classes):
49 | plt.plot(X_tsne[y==k,0], X_tsne[y==k,1],'o')
50 | plt.title('t-SNE embedding of digits dataset')
51 | plt.xlabel('X1')
52 | plt.ylabel('X2')
53 | axes = plt.gca()
54 | axes.set_xlim([X_tsne[:,0].min()-1,X_tsne[:,0].max()+1])
55 | axes.set_ylim([X_tsne[:,1].min()-1,X_tsne[:,1].max()+1])
56 | plt.show()
57 |
58 | #ISOMAP
59 | #1. Nearest neighbors search: O(d log k N log N)
60 | #2. Shortest path graph search: O(N^2(k+log(N))
61 | #3. Partial eigenvalue decomposition: O(dN^2)
62 |
63 | t0 = time()
64 | isomap = manifold.Isomap(n_neighbors = 5, n_components = 2)
65 | X_isomap = isomap.fit_transform(X)
66 | t1 = time()
67 | print('Isomap: %.2f sec' %(t1-t0))
68 | isomap.get_params()
69 |
70 | plt.figure()
71 | for k in range(num_classes):
72 | plt.plot(X_isomap[y==k,0], X_isomap[y==k,1], 'o', label=str(k), linewidth = 2)
73 | plt.title('Isomap embedding of the digits dataset')
74 | plt.xlabel('X1')
75 | plt.ylabel('X2')
76 | plt.show()
77 |
78 | #Use KD-tree to find k-nearest neighbors to a query image
79 | kdt = KDTree(X_isomap)
80 | Q = np.array([[-160, -30],[-102, 14]])
81 | kdt_dist, kdt_idx = kdt.query(Q,k=20)
82 | plot_digits(X[kdt_idx.ravel(),:])
83 |
84 | if __name__ == "__main__":
85 | mnist_manifold()
86 |
87 |
--------------------------------------------------------------------------------
/chp08/pca.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 | np.random.seed(42)
5 |
6 | class PCA():
7 | def __init__(self, n_components = 2):
8 | self.n_components = n_components
9 |
10 | def covariance_matrix(self, X, Y=None):
11 | if Y is None:
12 | Y = X
13 | n_samples = np.shape(X)[0]
14 | covariance_matrix = (1 / (n_samples-1)) * (X - X.mean(axis=0)).T.dot(Y - Y.mean(axis=0))
15 | return covariance_matrix
16 |
17 | def transform(self, X):
18 | Sigma = self.covariance_matrix(X)
19 | eig_vals, eig_vecs = np.linalg.eig(Sigma)
20 |
21 | #sort from largest to smallest and select the first n_components
22 | idx = eig_vals.argsort()[::-1]
23 | eig_vals = eig_vals[idx][:self.n_components]
24 | eig_vecs = np.atleast_1d(eig_vecs[:,idx])[:, :self.n_components]
25 |
26 | #project the data onto principal components
27 | X_transformed = X.dot(eig_vecs)
28 |
29 | return X_transformed
30 |
31 | if __name__ == "__main__":
32 |
33 | n = 20
34 | d = 5
35 | X = np.random.rand(n,d)
36 |
37 | pca = PCA(n_components = 2)
38 | X_pca = pca.transform(X)
39 |
40 | print(X_pca)
41 |
42 | plt.figure()
43 | plt.scatter(X_pca[:,0], X_pca[:,1], color='b', alpha=0.5)
44 | plt.title("Principal Component Analysis"); plt.xlabel("X1"); plt.ylabel("X2")
45 | plt.show()
--------------------------------------------------------------------------------
/chp09/ga.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import string
3 |
4 | class GeneticAlgorithm():
5 |
6 | def __init__(self, target_string, population_size, mutation_rate):
7 | self.target = target_string
8 | self.population_size = population_size
9 | self.mutation_rate = mutation_rate
10 | self.letters = [" "] + list(string.ascii_letters)
11 |
12 | def initialize(self):
13 | # init population with random strings
14 | self.population = []
15 | for _ in range(self.population_size):
16 | individual = "".join(np.random.choice(self.letters, size=len(self.target)))
17 | self.population.append(individual)
18 |
19 | def calculate_fitness(self):
20 | #calculate fitness of each individual in a population
21 | population_fitness = []
22 | for individual in self.population:
23 | # calculate loss as the distance between characters
24 | loss = 0
25 | for i in range(len(individual)):
26 | letter_i1 = self.letters.index(individual[i])
27 | letter_i2 = self.letters.index(self.target[i])
28 | loss += abs(letter_i1 - letter_i2)
29 | fitness = 1 / (loss + 1e-6)
30 | population_fitness.append(fitness)
31 | return population_fitness
32 |
33 | def mutate(self, individual):
34 | #randomly change the characters with probability equal to mutation_rate
35 | individual = list(individual)
36 | for j in range(len(individual)):
37 | if np.random.random() < self.mutation_rate:
38 | individual[j] = np.random.choice(self.letters)
39 | return "".join(individual)
40 |
41 | def crossover(self, parent1, parent2):
42 | #create children from parents by crossover
43 | cross_i = np.random.randint(0, len(parent1))
44 | child1 = parent1[:cross_i] + parent2[cross_i:]
45 | child2 = parent2[:cross_i] + parent1[cross_i:]
46 | return child1, child2
47 |
48 | def run(self, iterations):
49 | self.initialize()
50 |
51 | for epoch in range(iterations):
52 | population_fitness = self.calculate_fitness()
53 |
54 | fittest_individual = self.population[np.argmax(population_fitness)]
55 | highest_fitness = max(population_fitness)
56 |
57 | if fittest_individual == self.target:
58 | break
59 |
60 | #select individual as a parent proportional to individual's fitness
61 | parent_probabilities = [fitness / sum(population_fitness) for fitness in population_fitness]
62 |
63 | #next generation
64 | new_population = []
65 | for i in np.arange(0, self.population_size, 2):
66 | #select two parents
67 | parent1, parent2 = np.random.choice(self.population, size=2, p=parent_probabilities, replace=False)
68 | #crossover to produce offspring
69 | child1, child2 = self.crossover(parent1, parent2)
70 | #save mutated offspring for next generation
71 | new_population += [self.mutate(child1), self.mutate(child2)]
72 |
73 | print("iter %d, closest candidate: %s, fitness: %.4f" %(epoch, fittest_individual, highest_fitness))
74 | self.population = new_population
75 |
76 | print("iter %d, final candidate: %s" %(epoch, fittest_individual))
77 |
78 | if __name__ == "__main__":
79 |
80 | target_string = "Genome"
81 | population_size = 50
82 | mutation_rate = 0.1
83 |
84 | ga = GeneticAlgorithm(target_string, population_size, mutation_rate)
85 | ga.run(iterations = 1000)
86 |
87 |
88 |
89 |
90 |
91 |
92 |
93 |
94 |
95 |
96 |
--------------------------------------------------------------------------------
/chp09/inv_cov.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | from scipy import linalg
4 |
5 | from datetime import datetime
6 | import pytz
7 |
8 | from sklearn.datasets import make_sparse_spd_matrix
9 | from sklearn.covariance import GraphicalLassoCV, ledoit_wolf
10 | from sklearn.preprocessing import StandardScaler
11 | from sklearn import cluster, manifold
12 |
13 | import seaborn as sns
14 | import matplotlib.pyplot as plt
15 | from matplotlib.collections import LineCollection
16 |
17 | import pandas_datareader.data as web
18 |
19 | np.random.seed(42)
20 |
21 | def main():
22 |
23 | #generate data (synthetic)
24 | #num_samples = 60
25 | #num_features = 20
26 | #prec = make_sparse_spd_matrix(num_features, alpha=0.95, smallest_coef=0.4, largest_coef=0.7)
27 | #cov = linalg.inv(prec)
28 | #X = np.random.multivariate_normal(np.zeros(num_features), cov, size=num_samples)
29 | #X = StandardScaler().fit_transform(X)
30 |
31 | #generate data (actual)
32 | STOCKS = {
33 | 'SPY': 'S&P500',
34 | 'LQD': 'Bond_Corp',
35 | 'TIP': 'Bond_Treas',
36 | 'GLD': 'Gold',
37 | 'MSFT': 'Microsoft',
38 | 'XOM': 'Exxon',
39 | 'AMZN': 'Amazon',
40 | 'BAC': 'BofA',
41 | 'NVS': 'Novartis'}
42 |
43 | symbols, names = np.array(list(STOCKS.items())).T
44 |
45 | #load data
46 | #year, month, day, hour, minute, second, microsecond
47 | start = datetime(2015, 1, 1, 0, 0, 0, 0, pytz.utc)
48 | end = datetime(2017, 1, 1, 0, 0, 0, 0, pytz.utc)
49 |
50 | qopen, qclose = [], []
51 | data_close, data_open = pd.DataFrame(), pd.DataFrame()
52 | for ticker in symbols:
53 | price = web.DataReader(ticker, 'stooq', start, end)
54 | qopen.append(price['Open'])
55 | qclose.append(price['Close'])
56 |
57 | data_open = pd.concat(qopen, axis=1)
58 | data_open.columns = symbols
59 | data_close = pd.concat(qclose, axis=1)
60 | data_close.columns = symbols
61 |
62 | #per day variation in price for each symbol
63 | variation = data_close - data_open
64 | variation = variation.dropna()
65 |
66 | X = variation.values
67 | X /= X.std(axis=0) #standardize to use correlations rather than covariance
68 |
69 | #estimate inverse covariance
70 | graph = GraphicalLassoCV()
71 | graph.fit(X)
72 |
73 | gl_cov = graph.covariance_
74 | gl_prec = graph.precision_
75 | gl_alphas = graph.cv_alphas_
76 | gl_scores = graph.cv_results_['mean_test_score']
77 |
78 | plt.figure()
79 | sns.heatmap(gl_prec, xticklabels=names, yticklabels=names)
80 | plt.xticks(rotation=45)
81 | plt.yticks(rotation=45)
82 | plt.tight_layout()
83 | plt.show()
84 |
85 | plt.figure()
86 | plt.plot(gl_alphas, gl_scores, marker='o', color='b', lw=2.0, label='GraphLassoCV')
87 | plt.title("Graph Lasso Alpha Selection")
88 | plt.xlabel("alpha")
89 | plt.ylabel("score")
90 | plt.legend()
91 | plt.show()
92 |
93 | #cluster using affinity propagation
94 | _, labels = cluster.affinity_propagation(gl_cov)
95 | num_labels = np.max(labels)
96 |
97 | for i in range(num_labels+1):
98 | print("Cluster %i: %s" %((i+1), ', '.join(names[labels==i])))
99 |
100 | #find a low dim embedding for visualization
101 | node_model = manifold.LocallyLinearEmbedding(n_components=2, n_neighbors=6, eigen_solver='dense')
102 | embedding = node_model.fit_transform(X.T).T
103 |
104 | #generate plots
105 | plt.figure()
106 | plt.clf()
107 | ax = plt.axes([0.,0.,1.,1.])
108 | plt.axis('off')
109 |
110 | partial_corr = gl_prec
111 | d = 1 / np.sqrt(np.diag(partial_corr))
112 | non_zero = (np.abs(np.triu(partial_corr, k=1)) > 0.02) #connectivity matrix
113 |
114 | #plot the nodes
115 | plt.scatter(embedding[0], embedding[1], s = 100*d**2, c = labels, cmap = plt.cm.Spectral)
116 |
117 | #plot the edges
118 | start_idx, end_idx = np.where(non_zero)
119 | segments = [[embedding[:,start], embedding[:,stop]] for start, stop in zip(start_idx, end_idx)]
120 | values = np.abs(partial_corr[non_zero])
121 | lc = LineCollection(segments, zorder=0, cmap=plt.cm.hot_r, norm=plt.Normalize(0,0.7*values.max()))
122 | lc.set_array(values)
123 | lc.set_linewidths(2*values)
124 | ax.add_collection(lc)
125 |
126 | #plot the labels
127 | for index, (name, label, (x,y)) in enumerate(zip(names, labels, embedding.T)):
128 | plt.text(x,y,name,size=12)
129 |
130 | plt.show()
131 |
132 | if __name__ == "__main__":
133 | main()
134 |
--------------------------------------------------------------------------------
/chp09/kde.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 | np.random.seed(14)
5 |
6 | class KDE():
7 |
8 | def __init__(self):
9 | #Histogram and Gaussian Kernel Estimator used to
10 | #analyze RNA-seq data for flux estimation of a T7 promoter
11 | self.G = 1e9 #length of genome in base pairs (bp)
12 | self.C = 1e3 #number of unique molecules
13 | self.L = 100 #length of a read, bp
14 | self.N = 1e6 #number of reads, L bp long
15 | self.M = 1e4 #number of unique read sequences, bp
16 | self.LN = 1000 #total length of assembled / mapped RNA-seq reads
17 | self.FDR = 0.05 #false discovery rate
18 |
19 | #uniform sampling (poisson model)
20 | self.lmbda = (self.N * self.L) / self.G #expected number of bases covered
21 | self.C_est = self.M/(1-np.exp(-self.lmbda)) #library size estimate
22 | self.C_cvrg = self.G - self.G * np.exp(-self.lmbda) #base coverage
23 | self.N_gaps = self.N * np.exp(-self.lmbda) #number of gaps (uncovered bases)
24 |
25 | #gamma prior sampling (negative binomial model)
26 | #X = "number of failures before rth success"
27 | self.k = 0.5 # dispersion parameter (fit to data)
28 | self.p = self.lmbda/(self.lmbda + 1/self.k) # success probability
29 | self.r = 1/self.k # number of successes
30 |
31 | #RNAP binding data (RNA-seq)
32 | self.data = np.random.negative_binomial(self.r, self.p, size=self.LN)
33 |
34 | def histogram(self):
35 | self.bin_delta = 1 #smoothing parameter
36 | self.bin_range = np.arange(1, np.max(self.data), self.bin_delta)
37 | self.bin_counts, _ = np.histogram(self.data, bins=self.bin_range)
38 |
39 | #histogram density estimation
40 | #P = integral_R p(x) dx, where X is in R^3
41 | #p(x) = K/(NxV), where K=number of points in region R
42 | #N=total number of points, V=volume of region R
43 |
44 | rnap_density_est = self.bin_counts/(sum(self.bin_counts) * self.bin_delta)
45 | return rnap_density_est
46 |
47 | def kernel(self):
48 | #Gaussian kernel density estimator with smoothing parameter h
49 | #sum N Guassians centered at each data point, parameterized by common std dev h
50 |
51 | x_dim = 1 #dimension of x
52 | h = 10 #standard deviation
53 |
54 | rnap_density_support = np.arange(np.max(self.data))
55 | rnap_density_est = 0
56 | for i in range(np.sum(self.bin_counts)):
57 | rnap_density_est += (1/(2*np.pi*h**2)**(x_dim/2.0))*np.exp(-(rnap_density_support - self.data[i])**2 / (2.0*h**2))
58 | #end for
59 |
60 | rnap_density_est = rnap_density_est / np.sum(rnap_density_est)
61 | return rnap_density_est
62 |
63 | if __name__ == "__main__":
64 |
65 | kde = KDE()
66 | est1 = kde.histogram()
67 | est2 = kde.kernel()
68 |
69 | plt.figure()
70 | plt.plot(est1, '-b', label='histogram')
71 | plt.plot(est2, '--r', label='gaussian kernel')
72 | plt.title("RNA-seq density estimate based on negative binomial model")
73 | plt.xlabel("read length, [base pairs]"); plt.ylabel("density"); plt.legend()
74 | plt.show()
--------------------------------------------------------------------------------
/chp09/lda.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 | from sklearn.datasets import fetch_20newsgroups
4 | from sklearn.feature_extraction.text import TfidfVectorizer
5 | from wordcloud import WordCloud
6 | from scipy.special import digamma, gammaln
7 |
8 | np.random.seed(12)
9 |
10 | class LDA:
11 | def __init__(self, A, K):
12 | self.N = A.shape[0] # word (dictionary size)
13 | self.D = A.shape[1] # number of documents
14 | self.K = num_topics # number of topics
15 |
16 | self.A = A #term-document matrix
17 |
18 | #init word distribution beta
19 | self.eta = np.ones(self.N) #uniform dirichlet prior on words
20 | self.beta = np.zeros((self.N, self.K)) #NxK topic matrix
21 | for k in range(self.K):
22 | self.beta[:,k] = np.random.dirichlet(self.eta)
23 | self.beta[:,k] = self.beta[:,k] + 1e-6 #to avoid zero entries
24 | self.beta[:,k] = self.beta[:,k]/np.sum(self.beta[:,k])
25 | #end for
26 |
27 | #init topic proportions theta and cluster assignments z
28 | self.alpha = np.ones(self.K) #uniform dirichlet prior on topics
29 | self.z = np.zeros((self.N, self.D)) #cluster assignments z_{id}
30 | for d in range(self.D):
31 | theta = np.random.dirichlet(self.alpha)
32 | wdn_idx = np.nonzero(self.A[:,d])[0]
33 | for i in range(len(wdn_idx)):
34 | z_idx = np.argmax(np.random.multinomial(1, theta))
35 | self.z[wdn_idx[i],d] = z_idx #topic id
36 | #end for
37 | #end for
38 |
39 | #init variational parameters
40 | self.gamma = np.ones((self.D, self.K)) #topic proportions
41 | for d in range(self.D):
42 | theta = np.random.dirichlet(self.alpha)
43 | self.gamma[d,:] = theta
44 | #end for
45 |
46 | self.lmbda = np.transpose(self.beta) #np.ones((self.K, self.N))/self.N #word frequencies
47 |
48 | self.phi = np.zeros((self.D, self.N, self.K)) #assignments
49 | for d in range(self.D):
50 | for w in range(self.N):
51 | theta = np.random.dirichlet(self.alpha)
52 | self.phi[d,w,:] = np.random.multinomial(1, theta)
53 | #end for
54 | #end for
55 |
56 | def variational_inference(self):
57 |
58 | var_iter = 10
59 | llh = np.zeros(var_iter)
60 | llh_delta = np.zeros(var_iter)
61 |
62 | for iter in range(var_iter):
63 | print("VI iter: ", iter)
64 | J_old = self.elbo_objective()
65 | self.mean_field_update()
66 | J_new = self.elbo_objective()
67 |
68 | llh[iter] = J_old
69 | llh_delta[iter] = J_new - J_old
70 | #end for
71 |
72 | #update alpha and beta
73 | for k in range(self.K):
74 | self.alpha[k] = np.sum(self.gamma[:,k])
75 | self.beta[:,k] = self.lmbda[k,:] / np.sum(self.lmbda[k,:])
76 | #end for
77 |
78 | #update topic assignments
79 | for d in range(self.D):
80 | wdn_idx = np.nonzero(self.A[:,d])[0]
81 | for i in range(len(wdn_idx)):
82 | z_idx = np.argmax(self.phi[d,wdn_idx[i],:])
83 | self.z[wdn_idx[i],d] = z_idx #topic id
84 | #end for
85 | #end for
86 |
87 | plt.figure()
88 | plt.plot(llh); plt.title('LDA VI');
89 | plt.xlabel('mean field iterations'); plt.ylabel("ELBO")
90 | plt.show()
91 |
92 | return llh
93 |
94 | def mean_field_update(self):
95 |
96 | ndw = np.zeros((self.D, self.N)) #word counts for each document
97 | for d in range(self.D):
98 | doc = self.A[:,d]
99 | wdn_idx = np.nonzero(doc)[0]
100 |
101 | for i in range(len(wdn_idx)):
102 | ndw[d,wdn_idx[i]] += 1
103 | #end for
104 |
105 | #update gamma
106 | for k in range(self.K):
107 | self.gamma[d,k] = self.alpha[k] + np.dot(ndw[d,:], self.phi[d,:,k])
108 | #end for
109 |
110 | #update phi
111 | for w in range(len(wdn_idx)):
112 | self.phi[d,wdn_idx[w],:] = np.exp(digamma(self.gamma[d,:]) - digamma(np.sum(self.gamma[d,:])) + digamma(self.lmbda[:,wdn_idx[w]]) - digamma(np.sum(self.lmbda, axis=1)))
113 | if (np.sum(self.phi[d,wdn_idx[w],:]) > 0): #to avoid 0/0
114 | self.phi[d,wdn_idx[w],:] = self.phi[d,wdn_idx[w],:] / np.sum(self.phi[d,wdn_idx[w],:]) #normalize phi
115 | #end if
116 | #end for
117 |
118 | #end for
119 |
120 | #update lambda given ndw for all docs
121 | for k in range(self.K):
122 | self.lmbda[k,:] = self.eta
123 | for d in range(self.D):
124 | self.lmbda[k,:] += np.multiply(ndw[d,:], self.phi[d,:,k])
125 | #end for
126 | #end for
127 |
128 | def elbo_objective(self):
129 | #see Blei 2003
130 |
131 | T1_A = gammaln(np.sum(self.alpha)) - np.sum(gammaln(self.alpha))
132 | T1_B = 0
133 | for k in range(self.K):
134 | T1_B += np.dot(self.alpha[k]-1, digamma(self.gamma[:,k]) - digamma(np.sum(self.gamma, axis=1)))
135 | T1 = T1_A + T1_B
136 |
137 | T2 = 0
138 | for n in range(self.N):
139 | for k in range(self.K):
140 | T2 += self.phi[:,n,k] * (digamma(self.gamma[:,k]) - digamma(np.sum(self.gamma, axis=1)))
141 |
142 | T3 = 0
143 | for n in range(self.N):
144 | for k in range(self.K):
145 | T3 += self.phi[:,n,k] * np.log(self.beta[n,k])
146 |
147 | T4 = 0
148 | T4_A = -gammaln(np.sum(self.gamma, axis=1)) + np.sum(gammaln(self.gamma), axis=1)
149 | T4_B = 0
150 | for k in range(self.K):
151 | T4_B = -(self.gamma[:,k]-1) * (digamma(self.gamma[:,k]) - digamma(np.sum(self.gamma, axis=1)))
152 | T4 = T4_A + T4_B
153 |
154 | T5 = 0
155 | for n in range(self.N):
156 | for k in range(self.K):
157 | T5 += -np.multiply(self.phi[:,n,k], np.log(self.phi[:,n,k] + 1e-6))
158 |
159 | T15 = T1 + T2 + T3 + T4 + T5
160 | J = sum(T15)/self.D #averaged over documents
161 | return J
162 |
163 | if __name__ == "__main__":
164 |
165 | #LDA parameters
166 | num_features = 1000 #vocabulary size
167 | num_topics = 4 #fixed for LD
168 |
169 | #20 newsgroups dataset
170 | categories = ['sci.crypt', 'comp.graphics', 'sci.space', 'talk.religion.misc']
171 |
172 | newsgroups = fetch_20newsgroups(shuffle=True, random_state=42, subset='train',
173 | remove=('headers', 'footers', 'quotes'), categories=categories)
174 |
175 | vectorizer = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=2, stop_words = 'english')
176 | dataset = vectorizer.fit_transform(newsgroups.data)
177 | A = np.transpose(dataset.toarray()) #term-document matrix
178 |
179 | lda = LDA(A=A, K=num_topics)
180 | llh = lda.variational_inference()
181 | id2word = {v:k for k,v in vectorizer.vocabulary_.items()}
182 |
183 | #display topics
184 | for k in range(num_topics):
185 | print("topic: ", k)
186 | print("----------")
187 | topic_words = ""
188 | top_words = np.argsort(lda.lmbda[k,:])[-10:]
189 | for i in range(len(top_words)):
190 | topic_words += id2word[top_words[i]] + " "
191 | print(id2word[top_words[i]])
192 |
193 | wordcloud = WordCloud(width = 800, height = 800,
194 | background_color ='white',
195 | min_font_size = 10).generate(topic_words)
196 |
197 | plt.figure()
198 | plt.imshow(wordcloud)
199 | plt.axis("off")
200 | plt.tight_layout(pad = 0)
201 | plt.show()
202 |
--------------------------------------------------------------------------------
/chp09/portfolio_opt.py:
--------------------------------------------------------------------------------
1 |
2 | import numpy as np
3 | import pandas as pd
4 | import matplotlib.pyplot as plt
5 |
6 | from sklearn.neighbors import KDTree
7 | from pandas.plotting import scatter_matrix
8 | from scipy.spatial import ConvexHull
9 |
10 | import pandas_datareader.data as web
11 | from datetime import datetime
12 | import pytz
13 |
14 | STOCKS = ['SPY','LQD','TIP','GLD','MSFT']
15 |
16 | np.random.seed(42)
17 |
18 | if __name__ == "__main__":
19 |
20 | plt.close("all")
21 |
22 | #load data
23 | #year, month, day, hour, minute, second, microsecond
24 | start = datetime(2012, 1, 1, 0, 0, 0, 0, pytz.utc)
25 | end = datetime(2017, 1, 1, 0, 0, 0, 0, pytz.utc)
26 |
27 | data = pd.DataFrame()
28 | series = []
29 | for ticker in STOCKS:
30 | price = web.DataReader(ticker, 'stooq', start, end)
31 | series.append(price['Close'])
32 |
33 | data = pd.concat(series, axis=1)
34 | data.columns = STOCKS
35 | data = data.dropna()
36 |
37 | #plot data correlations
38 | scatter_matrix(data, alpha=0.2, diagonal='kde')
39 | plt.show()
40 |
41 | #get current portfolio
42 | cash = 10000
43 | num_assets = np.size(STOCKS)
44 | cur_value = (1e4-5e3)*np.random.rand(num_assets,1) + 5e3
45 | tot_value = np.sum(cur_value)
46 | weights = cur_value.ravel()/float(tot_value)
47 |
48 | #compute portfolio risk
49 | Sigma = data.cov().values
50 | Corr = data.corr().values
51 | volatility = np.sqrt(np.dot(weights.T, np.dot(Sigma, weights)))
52 |
53 | plt.figure()
54 | plt.title('Correlation Matrix')
55 | plt.imshow(Corr, cmap='gray')
56 | plt.xticks(range(len(STOCKS)),data.columns)
57 | plt.yticks(range(len(STOCKS)),data.columns)
58 | plt.colorbar()
59 | plt.show()
60 |
61 | #generate random portfolio weights
62 | num_trials = 1000
63 | W = np.random.rand(num_trials, np.size(weights))
64 | W = W/np.sum(W,axis=1).reshape(num_trials,1) #normalize
65 |
66 | pv = np.zeros(num_trials) #portoflio value w'v
67 | ps = np.zeros(num_trials) #portfolio sigma: sqrt(w'Sw)
68 |
69 | avg_price = data.mean().values
70 | adj_price = avg_price
71 |
72 | for i in range(num_trials):
73 | pv[i] = np.sum(adj_price * W[i,:])
74 | ps[i] = np.sqrt(np.dot(W[i,:].T, np.dot(Sigma, W[i,:])))
75 |
76 | points = np.vstack((ps,pv)).T
77 | hull = ConvexHull(points)
78 |
79 | plt.figure()
80 | plt.scatter(ps, pv, marker='o', color='b', linewidth = 3.0, label = 'tangent portfolio')
81 | plt.scatter(volatility, np.sum(adj_price * weights), marker = 's', color = 'r', linewidth = 3.0, label = 'current')
82 | plt.plot(points[hull.vertices,0], points[hull.vertices,1], linewidth = 2.0)
83 | plt.title('expected return vs volatility')
84 | plt.ylabel('expected price')
85 | plt.xlabel('portfolio std dev')
86 | plt.legend()
87 | plt.grid(True)
88 | plt.show()
89 |
90 | #query for nearest neighbor portfolio
91 | knn = 5
92 | kdt = KDTree(points)
93 | query_point = np.array([2, 115]).reshape(1,-1)
94 | kdt_dist, kdt_idx = kdt.query(query_point,k=knn)
95 | print("top-%d closest to query portfolios:" %knn)
96 | print("values: ", pv[kdt_idx.ravel()])
97 | print("sigmas: ", ps[kdt_idx.ravel()])
98 |
99 |
--------------------------------------------------------------------------------
/chp09/sim_annealing.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 | np.random.seed(42)
5 |
6 | class simulated_annealing():
7 | def __init__(self):
8 | self.max_iter = 1000
9 | self.conv_thresh = 1e-4
10 | self.conv_window = 10
11 |
12 | self.samples = np.zeros((self.max_iter, 2))
13 | self.energies = np.zeros(self.max_iter)
14 | self.temperatures = np.zeros(self.max_iter)
15 |
16 | def target(self, x, y):
17 | z = 3*(1-x)**2 * np.exp(-x**2 - (y+1)**2) \
18 | - 10*(x/5 -x**3 - y**5) * np.exp(-x**2 - y**2) \
19 | - (1/3)*np.exp(-(x+1)**2 - y**2)
20 | return z
21 |
22 | def proposal(self, x, y):
23 | mean = np.array([x, y])
24 | cov = 1.1 * np.eye(2)
25 | x_new, y_new = np.random.multivariate_normal(mean, cov)
26 | return x_new, y_new
27 |
28 | def temperature_schedule(self, T, iter):
29 | return 0.9 * T
30 |
31 | def run(self, x_init, y_init):
32 |
33 | converged = False
34 | T = 1
35 | self.temperatures[0] = T
36 | num_accepted = 0
37 | x_old, y_old = x_init, y_init
38 | energy_old = self.target(x_init, y_init)
39 |
40 | iter = 1
41 | while not converged:
42 | print("iter: {:4d}, temp: {:.4f}, energy = {:.6f}".format(iter, T, energy_old))
43 | x_new, y_new = self.proposal(x_old, y_old)
44 | energy_new = self.target(x_new, y_new)
45 |
46 | #check convergence
47 | if iter > 2*self.conv_window:
48 | vals = self.energies[iter-self.conv_window : iter-1]
49 | if (np.std(vals) < self.conv_thresh):
50 | converged = True
51 | #end if
52 | #end if
53 |
54 | alpha = np.exp((energy_old - energy_new)/T)
55 | r = np.minimum(1, alpha)
56 | u = np.random.uniform(0, 1)
57 | if u < r:
58 | x_old, y_old = x_new, y_new
59 | num_accepted += 1
60 | energy_old = energy_new
61 | #end if
62 | self.samples[iter, :] = np.array([x_old, y_old])
63 | self.energies[iter] = energy_old
64 |
65 | T = self.temperature_schedule(T, iter)
66 | self.temperatures[iter] = T
67 |
68 | iter = iter + 1
69 |
70 | if (iter > self.max_iter): converged = True
71 | #end while
72 |
73 | niter = iter - 1
74 | acceptance_rate = num_accepted / niter
75 | print("acceptance rate: ", acceptance_rate)
76 |
77 | x_opt, y_opt = x_old, y_old
78 |
79 | return x_opt, y_opt, self.samples[:niter,:], self.energies[:niter], self.temperatures[:niter]
80 |
81 | if __name__ == "__main__":
82 |
83 | SA = simulated_annealing()
84 |
85 | nx, ny = (1000, 1000)
86 | x = np.linspace(-2, 2, nx)
87 | y = np.linspace(-2, 2, ny)
88 | xv, yv = np.meshgrid(x, y)
89 |
90 | z = SA.target(xv, yv)
91 | plt.figure()
92 | plt.contourf(x, y, z)
93 | plt.title("energy landscape")
94 | plt.show()
95 |
96 | #find global minimum by exhaustive search
97 | min_search = np.min(z)
98 | argmin_search = np.argwhere(z == min_search)
99 | xmin, ymin = argmin_search[0][0], argmin_search[0][1]
100 | print("global minimum (exhaustive search): ", min_search)
101 | print("located at (x, y): ", x[xmin], y[ymin])
102 |
103 | #find global minimum by simulated annealing
104 | x_init, y_init = 0, 0
105 | x_opt, y_opt, samples, energies, temperatures = SA.run(x_init, y_init)
106 | print("global minimum (simulated annealing): ", energies[-1])
107 | print("located at (x, y): ", x_opt, y_opt)
108 |
109 | plt.figure()
110 | plt.plot(energies)
111 | plt.title("SA sampled energies")
112 | plt.show()
113 |
114 | plt.figure()
115 | plt.plot(temperatures)
116 | plt.title("Temperature Schedule")
117 | plt.show()
118 |
--------------------------------------------------------------------------------
/chp10/image_search.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import tensorflow as tf
4 | from tensorflow import keras
5 |
6 | from keras import Model
7 | from keras.applications.resnet50 import ResNet50
8 | from keras.preprocessing import image
9 | from keras.applications.resnet50 import preprocess_input
10 |
11 | from keras.callbacks import ModelCheckpoint
12 | from keras.callbacks import TensorBoard
13 | from keras.callbacks import LearningRateScheduler
14 | from keras.callbacks import EarlyStopping
15 |
16 | import os
17 | import random
18 | from PIL import Image
19 | from scipy.spatial import distance
20 | from sklearn.decomposition import PCA
21 |
22 | import matplotlib.pyplot as plt
23 |
24 | tf.keras.utils.set_random_seed(42)
25 |
26 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
27 | DATA_PATH = "/content/drive/MyDrive/data/101_ObjectCategories/"
28 |
29 | def get_closest_images(acts, query_image_idx, num_results=5):
30 |
31 | num_images, dim = acts.shape
32 | distances = []
33 | for image_idx in range(num_images):
34 | distances.append(distance.euclidean(acts[query_image_idx, :], acts[image_idx, :]))
35 | #end for
36 | idx_closest = sorted(range(len(distances)), key=lambda k: distances[k])[1:num_results+1]
37 |
38 | return idx_closest
39 |
40 | def get_concatenated_images(images, indexes, thumb_height):
41 |
42 | thumbs = []
43 | for idx in indexes:
44 | img = Image.open(images[idx])
45 | img = img.resize((int(img.width * thumb_height / img.height), int(thumb_height)), Image.ANTIALIAS)
46 | if img.mode != "RGB":
47 | img = img.convert("RGB")
48 | thumbs.append(img)
49 | concat_image = np.concatenate([np.asarray(t) for t in thumbs], axis=1)
50 |
51 | return concat_image
52 |
53 | if __name__ == "__main__":
54 |
55 | num_images = 5000
56 | images = [os.path.join(dp,f) for dp, dn, filenames in os.walk(DATA_PATH) for f in filenames \
57 | if os.path.splitext(f)[1].lower() in ['.jpg','.png','.jpeg']]
58 | images = [images[i] for i in sorted(random.sample(range(len(images)), num_images))]
59 |
60 | #CNN encodings
61 | base_model = ResNet50(weights='imagenet')
62 | model = Model(inputs=base_model.input, outputs=base_model.get_layer('avg_pool').output)
63 |
64 | activations = []
65 | for idx, image_path in enumerate(images):
66 | if idx % 100 == 0:
67 | print('getting activations for %d/%d image...' %(idx,len(images)))
68 | img = image.load_img(image_path, target_size=(224, 224))
69 | x = image.img_to_array(img)
70 | x = np.expand_dims(x, axis=0)
71 | x = preprocess_input(x)
72 | features = model.predict(x)
73 | activations.append(features.flatten().reshape(1,-1))
74 |
75 | # reduce activation dimension
76 | print('computing PCA...')
77 | acts = np.concatenate(activations, axis=0)
78 | pca = PCA(n_components=300)
79 | pca.fit(acts)
80 | acts = pca.transform(acts)
81 |
82 | print('image search...')
83 | query_image_idx = int(num_images*random.random())
84 | idx_closest = get_closest_images(acts, query_image_idx)
85 | query_image = get_concatenated_images(images, [query_image_idx], 300)
86 | results_image = get_concatenated_images(images, idx_closest, 300)
87 |
88 | plt.figure()
89 | plt.imshow(query_image)
90 | plt.title("query image (%d)" %query_image_idx)
91 | plt.show()
92 | #plt.savefig('./figures/query_image.png')
93 |
94 | plt.figure()
95 | plt.imshow(results_image)
96 | plt.title("result images")
97 | plt.show()
98 | #plt.savefig('./figures/result_images.png')
--------------------------------------------------------------------------------
/chp10/keras_optimizers.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import tensorflow as tf
4 | from tensorflow import keras
5 |
6 | from keras import backend as K
7 | from keras.models import Sequential
8 | from keras.layers import Dense, Dropout, Flatten
9 | from keras.layers import Conv2D, MaxPooling2D, Activation
10 |
11 | from keras.callbacks import ModelCheckpoint
12 | from keras.callbacks import TensorBoard
13 | from keras.callbacks import LearningRateScheduler
14 | from keras.callbacks import EarlyStopping
15 |
16 | import math
17 | import matplotlib.pyplot as plt
18 |
19 | tf.keras.utils.set_random_seed(42)
20 |
21 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
22 |
23 | def scheduler(epoch, lr):
24 | if epoch < 4:
25 | return lr
26 | else:
27 | return lr * tf.math.exp(-0.1)
28 |
29 | if __name__ == "__main__":
30 |
31 | img_rows, img_cols = 32, 32
32 | (x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data()
33 | x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 3).astype("float32") / 255
34 | x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 3).astype("float32") / 255
35 |
36 | y_train_label = keras.utils.to_categorical(y_train)
37 | y_test_label = keras.utils.to_categorical(y_test)
38 | num_classes = y_train_label.shape[1]
39 |
40 | #training parameters
41 | batch_size = 256
42 | num_epochs = 32
43 |
44 | #model parameters
45 | num_filters_l1 = 64
46 | num_filters_l2 = 128
47 |
48 | #CNN architecture
49 | cnn = Sequential()
50 | cnn.add(Conv2D(num_filters_l1, kernel_size = (5, 5), input_shape=(img_rows, img_cols, 3), padding='same'))
51 | cnn.add(Activation('relu'))
52 | cnn.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
53 |
54 | cnn.add(Conv2D(num_filters_l2, kernel_size = (5, 5), padding='same'))
55 | cnn.add(Activation('relu'))
56 | cnn.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
57 |
58 | cnn.add(Flatten())
59 | cnn.add(Dense(128))
60 | cnn.add(Activation('relu'))
61 |
62 | cnn.add(Dense(num_classes))
63 | cnn.add(Activation('softmax'))
64 |
65 | #optimizers
66 | opt1 = tf.keras.optimizers.SGD()
67 | opt2 = tf.keras.optimizers.SGD(momentum=0.9, nesterov=True)
68 | opt3 = tf.keras.optimizers.RMSprop()
69 | opt4 = tf.keras.optimizers.Adam()
70 |
71 | optimizer_list = [opt1, opt2, opt3, opt4]
72 |
73 | history_list = []
74 |
75 | for idx in range(len(optimizer_list)):
76 |
77 | K.clear_session()
78 |
79 | opt = optimizer_list[idx]
80 |
81 | cnn.compile(
82 | loss=keras.losses.CategoricalCrossentropy(),
83 | optimizer=opt,
84 | metrics=["accuracy"]
85 | )
86 |
87 | #define callbacks
88 | reduce_lr = LearningRateScheduler(scheduler, verbose=1)
89 | callbacks_list = [reduce_lr]
90 |
91 | #training loop
92 | hist = cnn.fit(x_train, y_train_label, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_split=0.2)
93 | history_list.append(hist)
94 |
95 | #end for
96 |
97 | plt.figure()
98 | plt.plot(history_list[0].history['loss'], 'b', lw=2.0, label='SGD')
99 | plt.plot(history_list[1].history['loss'], '--r', lw=2.0, label='SGD Nesterov')
100 | plt.plot(history_list[2].history['loss'], ':g', lw=2.0, label='RMSProp')
101 | plt.plot(history_list[3].history['loss'], '-.k', lw=2.0, label='ADAM')
102 | plt.title('LeNet, CIFAR-100, Optimizers')
103 | plt.xlabel('Epochs')
104 | plt.ylabel('Cross-Entropy Training Loss')
105 | plt.legend(loc='upper right')
106 | plt.show()
107 | #plt.savefig('./figures/lenet_loss.png')
108 |
109 | plt.figure()
110 | plt.plot(history_list[0].history['val_accuracy'], 'b', lw=2.0, label='SGD')
111 | plt.plot(history_list[1].history['val_accuracy'], '--r', lw=2.0, label='SGD Nesterov')
112 | plt.plot(history_list[2].history['val_accuracy'], ':g', lw=2.0, label='RMSProp')
113 | plt.plot(history_list[3].history['val_accuracy'], '-.k', lw=2.0, label='ADAM')
114 | plt.title('LeNet, CIFAR-100, Optimizers')
115 | plt.xlabel('Epochs')
116 | plt.ylabel('Validation Accuracy')
117 | plt.legend(loc='upper right')
118 | plt.show()
119 | #plt.savefig('./figures/lenet_loss.png')
120 |
121 | plt.figure()
122 | plt.plot(history_list[0].history['lr'], 'b', lw=2.0, label='SGD')
123 | plt.plot(history_list[1].history['lr'], '--r', lw=2.0, label='SGD Nesterov')
124 | plt.plot(history_list[2].history['lr'], ':g', lw=2.0, label='RMSProp')
125 | plt.plot(history_list[3].history['lr'], '-.k', lw=2.0, label='ADAM')
126 | plt.title('LeNet, CIFAR-100, Optimizers')
127 | plt.xlabel('Epochs')
128 | plt.ylabel('Learning Rate Schedule')
129 | plt.legend(loc='upper right')
130 | plt.show()
131 | #plt.savefig('./figures/lenet_loss.png')
132 |
133 |
--------------------------------------------------------------------------------
/chp10/lenet.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import tensorflow as tf
4 | from tensorflow import keras
5 |
6 | from keras.models import Sequential
7 | from keras.layers import Dense, Dropout, Flatten
8 | from keras.layers import Conv2D, MaxPooling2D, Activation
9 |
10 | from keras.callbacks import ModelCheckpoint
11 | from keras.callbacks import TensorBoard
12 | from keras.callbacks import LearningRateScheduler
13 | from keras.callbacks import EarlyStopping
14 |
15 | import math
16 | import matplotlib.pyplot as plt
17 |
18 | tf.keras.utils.set_random_seed(42)
19 |
20 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
21 |
22 | def scheduler(epoch, lr):
23 | if epoch < 4:
24 | return lr
25 | else:
26 | return lr * tf.math.exp(-0.1)
27 |
28 |
29 | if __name__ == "__main__":
30 |
31 | img_rows, img_cols = 28, 28
32 | (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
33 | x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1).astype("float32") / 255
34 | x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1).astype("float32") / 255
35 |
36 | y_train_label = keras.utils.to_categorical(y_train)
37 | y_test_label = keras.utils.to_categorical(y_test)
38 | num_classes = y_train_label.shape[1]
39 |
40 | #training parameters
41 | batch_size = 128
42 | num_epochs = 8
43 |
44 | #model parameters
45 | num_filters_l1 = 32
46 | num_filters_l2 = 64
47 |
48 | #CNN architecture
49 | cnn = Sequential()
50 | #CONV -> RELU -> MAXPOOL
51 | cnn.add(Conv2D(num_filters_l1, kernel_size = (5, 5), input_shape=(img_rows, img_cols, 1), padding='same'))
52 | cnn.add(Activation('relu'))
53 | cnn.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
54 |
55 | #CONV -> RELU -> MAXPOOL
56 | cnn.add(Conv2D(num_filters_l2, kernel_size = (5, 5), padding='same'))
57 | cnn.add(Activation('relu'))
58 | cnn.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
59 |
60 | #FC -> RELU
61 | cnn.add(Flatten())
62 | cnn.add(Dense(128))
63 | cnn.add(Activation('relu'))
64 |
65 | #Softmax Classifier
66 | cnn.add(Dense(num_classes))
67 | cnn.add(Activation('softmax'))
68 |
69 | cnn.compile(
70 | loss=keras.losses.CategoricalCrossentropy(),
71 | optimizer=tf.keras.optimizers.Adam(),
72 | metrics=["accuracy"]
73 | )
74 |
75 | cnn.summary()
76 |
77 | #define callbacks
78 | file_name = SAVE_PATH + 'lenet-weights-checkpoint.h5'
79 | checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
80 | reduce_lr = LearningRateScheduler(scheduler, verbose=1)
81 | early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1)
82 | #tensor_board = TensorBoard(log_dir='./logs', write_graph=True)
83 | callbacks_list = [checkpoint, reduce_lr, early_stopping]
84 |
85 | hist = cnn.fit(x_train, y_train_label, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_split=0.2)
86 |
87 | test_scores = cnn.evaluate(x_test, y_test_label, verbose=2)
88 |
89 | print("Test loss:", test_scores[0])
90 | print("Test accuracy:", test_scores[1])
91 |
92 | y_prob = cnn.predict(x_test)
93 | y_pred = y_prob.argmax(axis=-1)
94 |
95 | #create submission
96 | submission = pd.DataFrame(index=pd.RangeIndex(start=1, stop=10001, step=1), columns=['Label'])
97 | submission['Label'] = y_pred.reshape(-1,1)
98 | submission.index.name = "ImageId"
99 | submission.to_csv(SAVE_PATH + '/lenet_pred.csv', index=True, header=True)
100 |
101 | plt.figure()
102 | plt.plot(hist.history['loss'], 'b', lw=2.0, label='train')
103 | plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val')
104 | plt.title('LeNet model')
105 | plt.xlabel('Epochs')
106 | plt.ylabel('Cross-Entropy Loss')
107 | plt.legend(loc='upper right')
108 | plt.show()
109 | #plt.savefig('./figures/lenet_loss.png')
110 |
111 | plt.figure()
112 | plt.plot(hist.history['accuracy'], 'b', lw=2.0, label='train')
113 | plt.plot(hist.history['val_accuracy'], '--r', lw=2.0, label='val')
114 | plt.title('LeNet model')
115 | plt.xlabel('Epochs')
116 | plt.ylabel('Accuracy')
117 | plt.legend(loc='upper left')
118 | plt.show()
119 | #plt.savefig('./figures/lenet_acc.png')
120 |
121 | plt.figure()
122 | plt.plot(hist.history['lr'], lw=2.0, label='learning rate')
123 | plt.title('LeNet model')
124 | plt.xlabel('Epochs')
125 | plt.ylabel('Learning Rate')
126 | plt.legend()
127 | plt.show()
128 | #plt.savefig('./figures/lenet_learning_rate.png')
129 |
--------------------------------------------------------------------------------
/chp10/lstm_sentiment.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 |
4 | import tensorflow as tf
5 | from tensorflow import keras
6 |
7 | from keras.models import Sequential
8 | from keras.layers import LSTM, Bidirectional
9 | from keras.layers import Dense, Dropout, Activation, Embedding
10 |
11 | from keras import regularizers
12 | from keras.preprocessing import sequence
13 | from keras.utils import np_utils
14 |
15 | from keras.callbacks import ModelCheckpoint
16 | from keras.callbacks import TensorBoard
17 | from keras.callbacks import LearningRateScheduler
18 | from keras.callbacks import EarlyStopping
19 |
20 | import matplotlib.pyplot as plt
21 |
22 | tf.keras.utils.set_random_seed(42)
23 |
24 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
25 |
26 | def scheduler(epoch, lr):
27 | if epoch < 4:
28 | return lr
29 | else:
30 | return lr * tf.math.exp(-0.1)
31 |
32 | if __name__ == "__main__":
33 |
34 | #load dataset
35 | max_words = 20000 # top 20K most frequent words
36 | seq_len = 200 # first 200 words of each movie review
37 | (x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=max_words)
38 |
39 | x_train = keras.utils.pad_sequences(x_train, maxlen=seq_len)
40 | x_val = keras.utils.pad_sequences(x_val, maxlen=seq_len)
41 |
42 | #training params
43 | batch_size = 256
44 | num_epochs = 8
45 |
46 | #model parameters
47 | hidden_size = 64
48 | embed_dim = 128
49 | lstm_dropout = 0.2
50 | dense_dropout = 0.5
51 | weight_decay = 1e-3
52 |
53 | #LSTM architecture
54 | model = Sequential()
55 | model.add(Embedding(max_words, embed_dim, input_length=seq_len))
56 | model.add(Bidirectional(LSTM(hidden_size, dropout=lstm_dropout, recurrent_dropout=lstm_dropout)))
57 | model.add(Dense(hidden_size, kernel_regularizer=regularizers.l2(weight_decay), activation='relu'))
58 | model.add(Dropout(dense_dropout))
59 | model.add(Dense(hidden_size/4, kernel_regularizer=regularizers.l2(weight_decay), activation='relu'))
60 | model.add(Dense(1, activation='sigmoid'))
61 |
62 | model.compile(
63 | loss=keras.losses.BinaryCrossentropy(),
64 | optimizer=tf.keras.optimizers.Adam(),
65 | metrics=["accuracy"]
66 | )
67 |
68 | model.summary()
69 |
70 | #define callbacks
71 | file_name = SAVE_PATH + 'lstm-weights-checkpoint.h5'
72 | checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
73 | reduce_lr = LearningRateScheduler(scheduler, verbose=1)
74 | early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1)
75 | #tensor_board = TensorBoard(log_dir='./logs', write_graph=True)
76 | callbacks_list = [checkpoint, reduce_lr, early_stopping]
77 |
78 | hist = model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_data=(x_val, y_val))
79 |
80 | test_scores = model.evaluate(x_val, y_val, verbose=2)
81 |
82 | print("Test loss:", test_scores[0])
83 | print("Test accuracy:", test_scores[1])
84 |
85 | plt.figure()
86 | plt.plot(hist.history['loss'], 'b', lw=2.0, label='train')
87 | plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val')
88 | plt.title('LSTM model')
89 | plt.xlabel('Epochs')
90 | plt.ylabel('Cross-Entropy Loss')
91 | plt.legend(loc='upper right')
92 | plt.show()
93 | #plt.savefig('./figures/lstm_loss.png')
94 |
95 | plt.figure()
96 | plt.plot(hist.history['accuracy'], 'b', lw=2.0, label='train')
97 | plt.plot(hist.history['val_accuracy'], '--r', lw=2.0, label='val')
98 | plt.title('LSTM model')
99 | plt.xlabel('Epochs')
100 | plt.ylabel('Accuracy')
101 | plt.legend(loc='upper left')
102 | plt.show()
103 | #plt.savefig('./figures/lstm_acc.png')
104 |
105 | plt.figure()
106 | plt.plot(hist.history['lr'], lw=2.0, label='learning rate')
107 | plt.title('LSTM model')
108 | plt.xlabel('Epochs')
109 | plt.ylabel('Learning Rate')
110 | plt.legend()
111 | plt.show()
112 | #plt.savefig('./figures/lstm_learning_rate.png')
113 |
114 |
--------------------------------------------------------------------------------
/chp10/mlp.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import tensorflow as tf
3 | from tensorflow import keras
4 |
5 | from keras.models import Sequential
6 | from keras.layers import Dense, Dropout
7 |
8 | from keras.callbacks import ModelCheckpoint
9 | from keras.callbacks import TensorBoard
10 | from keras.callbacks import LearningRateScheduler
11 | from keras.callbacks import EarlyStopping
12 |
13 | import math
14 | import matplotlib.pyplot as plt
15 |
16 | tf.keras.utils.set_random_seed(42)
17 |
18 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
19 |
20 | def scheduler(epoch, lr):
21 | if epoch < 4:
22 | return lr
23 | else:
24 | return lr * tf.math.exp(-0.1)
25 |
26 | if __name__ == "__main__":
27 |
28 | (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
29 | x_train = x_train.reshape(60000, 784).astype("float32") / 255
30 | x_test = x_test.reshape(10000, 784).astype("float32") / 255
31 |
32 | y_train_label = keras.utils.to_categorical(y_train)
33 | y_test_label = keras.utils.to_categorical(y_test)
34 | num_classes = y_train_label.shape[1]
35 |
36 | #training params
37 | batch_size = 64
38 | num_epochs = 16
39 |
40 | model = Sequential()
41 | model.add(Dense(128, input_shape=(784, ), activation='relu'))
42 | model.add(Dense(64, activation='relu'))
43 | model.add(Dropout(0.5))
44 | model.add(Dense(10, activation='softmax'))
45 |
46 | model.compile(
47 | loss=keras.losses.CategoricalCrossentropy(),
48 | optimizer=tf.keras.optimizers.RMSprop(),
49 | metrics=["accuracy"]
50 | )
51 |
52 | model.summary()
53 |
54 | #define callbacks
55 | file_name = SAVE_PATH + 'mlp-weights-checkpoint.h5'
56 | checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
57 | reduce_lr = LearningRateScheduler(scheduler, verbose=1)
58 | early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1)
59 | #tensor_board = TensorBoard(log_dir='./logs', write_graph=True)
60 | callbacks_list = [checkpoint, reduce_lr, early_stopping]
61 |
62 | hist = model.fit(x_train, y_train_label, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_split=0.2)
63 |
64 | test_scores = model.evaluate(x_test, y_test_label, verbose=2)
65 |
66 | print("Test loss:", test_scores[0])
67 | print("Test accuracy:", test_scores[1])
68 |
69 | plt.figure()
70 | plt.plot(hist.history['loss'], 'b', lw=2.0, label='train')
71 | plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val')
72 | plt.title('MLP model')
73 | plt.xlabel('Epochs')
74 | plt.ylabel('Cross-Entropy Loss')
75 | plt.legend(loc='upper right')
76 | plt.show()
77 | #plt.savefig('./figures/mlp_loss.png')
78 |
79 | plt.figure()
80 | plt.plot(hist.history['accuracy'], 'b', lw=2.0, label='train')
81 | plt.plot(hist.history['val_accuracy'], '--r', lw=2.0, label='val')
82 | plt.title('MLP model')
83 | plt.xlabel('Epochs')
84 | plt.ylabel('Accuracy')
85 | plt.legend(loc='upper left')
86 | plt.show()
87 | #plt.savefig('./figures/mlp_acc.png')
88 |
89 | plt.figure()
90 | plt.plot(hist.history['lr'], lw=2.0, label='learning rate')
91 | plt.title('MLP model')
92 | plt.xlabel('Epochs')
93 | plt.ylabel('Learning Rate')
94 | plt.legend()
95 | plt.show()
96 | #plt.savefig('./figures/mlp_learning_rate.png')
--------------------------------------------------------------------------------
/chp10/multi_input_nn.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 |
4 | import tensorflow as tf
5 | from tensorflow import keras
6 |
7 | import os
8 | import re
9 | import csv
10 | import codecs
11 |
12 | from keras.models import Model
13 | from keras.layers import Input, Flatten, Concatenate, LSTM, Lambda, Dropout
14 | from keras.layers import Dense, Dropout, Activation, Embedding
15 | from keras.layers import Conv1D, MaxPooling1D
16 | from keras.layers import TimeDistributed, Bidirectional, BatchNormalization
17 |
18 | from keras import backend as K
19 | from keras.preprocessing.text import Tokenizer
20 | from keras.utils import pad_sequences
21 |
22 | from nltk.corpus import stopwords
23 | from nltk.stem import SnowballStemmer
24 |
25 | from keras import regularizers
26 | from keras.preprocessing import sequence
27 | from keras.utils import np_utils
28 |
29 | from keras.callbacks import ModelCheckpoint
30 | from keras.callbacks import TensorBoard
31 | from keras.callbacks import LearningRateScheduler
32 | from keras.callbacks import EarlyStopping
33 |
34 | import matplotlib.pyplot as plt
35 |
36 | tf.keras.utils.set_random_seed(42)
37 |
38 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
39 | DATA_PATH = "/content/drive/MyDrive/data/"
40 |
41 | GLOVE_DIR = DATA_PATH
42 | TRAIN_DATA_FILE = DATA_PATH + 'quora_train.csv'
43 | TEST_DATA_FILE = DATA_PATH + 'quora_test.csv'
44 | MAX_SEQUENCE_LENGTH = 30
45 | MAX_NB_WORDS = 200000
46 | EMBEDDING_DIM = 300
47 | VALIDATION_SPLIT = 0.01
48 |
49 | def scheduler(epoch, lr):
50 | if epoch < 4:
51 | return lr
52 | else:
53 | return lr * tf.math.exp(-0.1)
54 |
55 | def text_to_wordlist(row, remove_stopwords=False, stem_words=False):
56 | # Clean the text, with the option to remove stopwords and to stem words.
57 |
58 | text = row['question']
59 | # Convert words to lower case and split them
60 | if type(text) is str:
61 | text = text.lower().split()
62 | else:
63 | return " "
64 |
65 | # Optionally, remove stop words
66 | if remove_stopwords:
67 | stops = set(stopwords.words("english"))
68 | text = [w for w in text if not w in stops]
69 |
70 | text = " ".join(text)
71 |
72 | # Clean the text
73 | text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
74 |
75 | # Optionally, shorten words to their stems
76 | if stem_words:
77 | text = text.split()
78 | stemmer = SnowballStemmer('english')
79 | stemmed_words = [stemmer.stem(word) for word in text]
80 | text = " ".join(stemmed_words)
81 |
82 | # Return a list of words
83 | return(text)
84 |
85 | if __name__ == "__main__":
86 |
87 | #load embeddings
88 | print('Indexing word vectors...')
89 | embeddings_index = {}
90 | f = codecs.open(os.path.join(GLOVE_DIR, 'glove.6B.300d.txt'), encoding='utf-8')
91 | for line in f:
92 | values = line.split(' ')
93 | word = values[0]
94 | coefs = np.asarray(values[1:], dtype='float32')
95 | embeddings_index[word] = coefs
96 | f.close()
97 | print('Found %s word vectors.' % len(embeddings_index))
98 |
99 | #load dataset
100 | train_df = pd.read_csv(TRAIN_DATA_FILE)
101 | test_df = pd.read_csv(TEST_DATA_FILE)
102 |
103 | q1df = train_df['question1'].reset_index()
104 | q2df = train_df['question2'].reset_index()
105 | q1df.columns = ['index', 'question']
106 | q2df.columns = ['index', 'question']
107 | texts_1 = q1df.apply(text_to_wordlist, axis=1, raw=False).tolist()
108 | texts_2 = q2df.apply(text_to_wordlist, axis=1, raw=False).tolist()
109 | labels = train_df['is_duplicate'].astype(int).tolist()
110 | print('Found %s texts.' % len(texts_1))
111 | del q1df
112 | del q2df
113 |
114 | q1df = test_df['question1'].reset_index()
115 | q2df = test_df['question2'].reset_index()
116 | q1df.columns = ['index', 'question']
117 | q2df.columns = ['index', 'question']
118 | test_texts_1 = q1df.apply(text_to_wordlist, axis=1, raw=False).tolist()
119 | test_texts_2 = q2df.apply(text_to_wordlist, axis=1, raw=False).tolist()
120 | test_labels = np.arange(0, test_df.shape[0])
121 | print('Found %s texts.' % len(test_texts_1))
122 | del q1df
123 | del q2df
124 |
125 | #tokenize, convert to sequences and pad
126 | tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
127 | tokenizer.fit_on_texts(texts_1 + texts_2 + test_texts_1 + test_texts_2)
128 | sequences_1 = tokenizer.texts_to_sequences(texts_1)
129 | sequences_2 = tokenizer.texts_to_sequences(texts_2)
130 | word_index = tokenizer.word_index
131 | print('Found %s unique tokens.' % len(word_index))
132 |
133 | test_sequences_1 = tokenizer.texts_to_sequences(test_texts_1)
134 | test_sequences_2 = tokenizer.texts_to_sequences(test_texts_2)
135 |
136 | data_1 = pad_sequences(sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
137 | data_2 = pad_sequences(sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
138 | labels = np.array(labels)
139 | print('Shape of data tensor:', data_1.shape)
140 | print('Shape of label tensor:', labels.shape)
141 |
142 | test_data_1 = pad_sequences(test_sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
143 | test_data_2 = pad_sequences(test_sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
144 | test_labels = np.array(test_labels)
145 | del test_sequences_1
146 | del test_sequences_2
147 | del sequences_1
148 | del sequences_2
149 |
150 | #embedding matrix
151 | print('Preparing embedding matrix...')
152 | nb_words = min(MAX_NB_WORDS, len(word_index))
153 |
154 | embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
155 | for word, i in word_index.items():
156 | if i >= nb_words:
157 | continue
158 | embedding_vector = embeddings_index.get(word)
159 | if embedding_vector is not None:
160 | # words not found in embedding index will be all-zeros.
161 | embedding_matrix[i] = embedding_vector
162 | print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))
163 |
164 | #Multi-Input Architecture
165 | embedding_layer = Embedding(nb_words,
166 | EMBEDDING_DIM,
167 | weights=[embedding_matrix],
168 | input_length=MAX_SEQUENCE_LENGTH,
169 | trainable=False)
170 |
171 | sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
172 | embedded_sequences_1 = embedding_layer(sequence_1_input)
173 | x1 = Conv1D(128, 3, activation='relu')(embedded_sequences_1)
174 | x1 = MaxPooling1D(10)(x1)
175 | x1 = Flatten()(x1)
176 | x1 = Dense(64, activation='relu')(x1)
177 | x1 = Dropout(0.2)(x1)
178 |
179 | sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
180 | embedded_sequences_2 = embedding_layer(sequence_2_input)
181 | y1 = Conv1D(128, 3, activation='relu')(embedded_sequences_2)
182 | y1 = MaxPooling1D(10)(y1)
183 | y1 = Flatten()(y1)
184 | y1 = Dense(64, activation='relu')(y1)
185 | y1 = Dropout(0.2)(y1)
186 |
187 | merged = Concatenate()([x1, y1])
188 | merged = BatchNormalization()(merged)
189 | merged = Dense(64, activation='relu')(merged)
190 | merged = Dropout(0.2)(merged)
191 | merged = BatchNormalization()(merged)
192 | preds = Dense(1, activation='sigmoid')(merged)
193 |
194 | model = Model(inputs=[sequence_1_input,sequence_2_input], outputs=preds)
195 |
196 | model.compile(
197 | loss=keras.losses.BinaryCrossentropy(),
198 | optimizer=tf.keras.optimizers.Adam(),
199 | metrics=["accuracy"]
200 | )
201 |
202 | model.summary()
203 |
204 | #define callbacks
205 | file_name = SAVE_PATH + 'multi-input-weights-checkpoint.h5'
206 | checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
207 | reduce_lr = LearningRateScheduler(scheduler, verbose=1)
208 | early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1)
209 | #tensor_board = TensorBoard(log_dir='./logs', write_graph=True)
210 | callbacks_list = [checkpoint, reduce_lr, early_stopping]
211 |
212 | hist = model.fit([data_1, data_2], labels, batch_size=1024, epochs=10, callbacks=callbacks_list, validation_split=VALIDATION_SPLIT)
213 |
214 | num_test = 100000
215 | preds = model.predict([test_data_1[:num_test,:], test_data_2[:num_test,:]])
216 |
217 | quora_submission = pd.DataFrame({"test_id":test_labels[:num_test], "is_duplicate":preds.ravel()})
218 | quora_submission.to_csv(SAVE_PATH + "quora_submission.csv", index=False)
219 |
220 | plt.figure()
221 | plt.plot(hist.history['loss'], c='b', lw=2.0, label='train')
222 | plt.plot(hist.history['val_loss'], c='r', lw=2.0, label='val')
223 | plt.title('Multi-Input model')
224 | plt.xlabel('Epochs')
225 | plt.ylabel('Cross-Entropy Loss')
226 | plt.legend(loc='upper right')
227 | plt.show()
228 | #plt.savefig('./figures/lstm_loss.png')
229 |
230 | plt.figure()
231 | plt.plot(hist.history['accuracy'], c='b', lw=2.0, label='train')
232 | plt.plot(hist.history['val_accuracy'], c='r', lw=2.0, label='val')
233 | plt.title('Multi-Input model')
234 | plt.xlabel('Epochs')
235 | plt.ylabel('Accuracy')
236 | plt.legend(loc='upper left')
237 | plt.show()
238 | #plt.savefig('./figures/lstm_acc.png')
239 |
240 | plt.figure()
241 | plt.plot(hist.history['lr'], lw=2.0, label='learning rate')
242 | plt.title('Multi-Input model')
243 | plt.xlabel('Epochs')
244 | plt.ylabel('Learning Rate')
245 | plt.legend()
246 | plt.show()
247 | #plt.savefig('./figures/lstm_learning_rate.png')
--------------------------------------------------------------------------------
/chp11/keras_mdn.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 |
4 | import tensorflow as tf
5 | from tensorflow import keras
6 |
7 | from keras.models import Model
8 | from keras.layers import concatenate, Input
9 | from keras.layers import Dense, Activation, Dropout, Flatten
10 | from keras.layers import BatchNormalization
11 |
12 | from keras import regularizers
13 | from keras import backend as K
14 | from keras.utils import np_utils
15 |
16 | from keras.callbacks import ModelCheckpoint
17 | from keras.callbacks import TensorBoard
18 | from keras.callbacks import LearningRateScheduler
19 | from keras.callbacks import EarlyStopping
20 |
21 | from sklearn.datasets import make_blobs
22 | from sklearn.metrics import adjusted_rand_score
23 | from sklearn.metrics import normalized_mutual_info_score
24 | from sklearn.model_selection import train_test_split
25 |
26 | import math
27 | import matplotlib.pyplot as plt
28 | import matplotlib.cm as cm
29 |
30 | tf.keras.utils.set_random_seed(42)
31 |
32 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
33 |
34 | def scheduler(epoch, lr):
35 | if epoch < 4:
36 | return lr
37 | else:
38 | return lr * tf.math.exp(-0.1)
39 |
40 | def generate_data(N):
41 | pi = np.array([0.2, 0.4, 0.3, 0.1])
42 | mu = [[2,2], [-2,2], [-2,-2], [2,-2]]
43 | std = [[0.5,0.5], [1.0,1.0], [0.5,0.5], [1.0,1.0]]
44 | x = np.zeros((N,2), dtype=np.float32)
45 | y = np.zeros((N,2), dtype=np.float32)
46 | z = np.zeros((N,1), dtype=np.int32)
47 | for n in range(N):
48 | k = np.argmax(np.random.multinomial(1, pi))
49 | x[n,:] = np.random.multivariate_normal(mu[k], np.diag(std[k]))
50 | y[n,:] = mu[k]
51 | z[n,:] = k
52 | #end for
53 | z = z.flatten()
54 | return x, y, z, pi, mu, std
55 |
56 | def tf_normal(y, mu, sigma):
57 | y_tile = K.stack([y]*num_clusters, axis=1) #[batch_size, K, D]
58 | result = y_tile - mu
59 | sigma_tile = K.stack([sigma]*data_dim, axis=-1) #[batch_size, K, D]
60 | result = result * 1.0/(sigma_tile+1e-8)
61 | result = -K.square(result)/2.0
62 | oneDivSqrtTwoPI = 1.0/math.sqrt(2*math.pi)
63 | result = K.exp(result) * (1.0/(sigma_tile + 1e-8))*oneDivSqrtTwoPI
64 | result = K.prod(result, axis=-1) #[batch_size, K] iid Gaussians
65 | return result
66 |
67 | def NLLLoss(y_true, y_pred):
68 | out_mu = y_pred[:,:num_clusters*data_dim]
69 | out_sigma = y_pred[:,num_clusters*data_dim : num_clusters*(data_dim+1)]
70 | out_pi = y_pred[:,num_clusters*(data_dim+1):]
71 |
72 | out_mu = K.reshape(out_mu, [-1, num_clusters, data_dim])
73 |
74 | result = tf_normal(y_true, out_mu, out_sigma)
75 | result = result * out_pi
76 | result = K.sum(result, axis=1, keepdims=True)
77 | result = -K.log(result + 1e-8)
78 | result = K.mean(result)
79 | return tf.maximum(result, 0)
80 |
81 | #generate data
82 | X_data, y_data, z_data, pi_true, mu_true, sigma_true = generate_data(4096)
83 |
84 | data_dim = X_data.shape[1]
85 | num_clusters = len(mu_true)
86 |
87 | num_train = 3500
88 | X_train, X_test, y_train, y_test = X_data[:num_train,:], X_data[num_train:,:], y_data[:num_train,:], y_data[num_train:,:]
89 | z_train, z_test = z_data[:num_train], z_data[num_train:]
90 |
91 | #visualize data
92 | plt.figure()
93 | plt.scatter(X_train[:,0], X_train[:,1], c=z_train, cmap=cm.bwr)
94 | plt.title('training data')
95 | plt.show()
96 | #plt.savefig(SAVE_PATH + '/mdn_training_data.png')
97 |
98 | #training params
99 | batch_size = 128
100 | num_epochs = 128
101 |
102 | #model parameters
103 | hidden_size = 32
104 | weight_decay = 1e-4
105 |
106 | #MDN architecture
107 | input_data = Input(shape=(data_dim,))
108 | x = Dense(32, activation='relu')(input_data)
109 | x = Dropout(0.2)(x)
110 | x = BatchNormalization()(x)
111 | x = Dense(32, activation='relu')(x)
112 | x = Dropout(0.2)(x)
113 | x = BatchNormalization()(x)
114 |
115 | mu = Dense(num_clusters * data_dim, activation='linear')(x) #cluster means
116 | sigma = Dense(num_clusters, activation=K.exp)(x) #diagonal cov
117 | pi = Dense(num_clusters, activation='softmax')(x) #mixture proportions
118 | out = concatenate([mu, sigma, pi], axis=-1)
119 |
120 | model = Model(input_data, out)
121 |
122 | model.compile(
123 | loss=NLLLoss,
124 | optimizer=tf.keras.optimizers.Adam(),
125 | metrics=["accuracy"]
126 | )
127 |
128 | model.summary()
129 |
130 | #define callbacks
131 | file_name = SAVE_PATH + 'mdn-weights-checkpoint.h5'
132 | checkpoint = ModelCheckpoint(file_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
133 | reduce_lr = LearningRateScheduler(scheduler, verbose=1)
134 | early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=16, verbose=1)
135 | #tensor_board = TensorBoard(log_dir='./logs', write_graph=True)
136 | callbacks_list = [checkpoint, reduce_lr, early_stopping]
137 |
138 | hist = model.fit(X_train, y_train, batch_size=batch_size, epochs=num_epochs, callbacks=callbacks_list, validation_split=0.2, shuffle=True, verbose=2)
139 |
140 | y_pred = model.predict(X_test)
141 |
142 | mu_pred = y_pred[:,:num_clusters*data_dim]
143 | mu_pred = np.reshape(mu_pred, [-1, num_clusters, data_dim])
144 | sigma_pred = y_pred[:,num_clusters*data_dim : num_clusters*(data_dim+1)]
145 | pi_pred = y_pred[:,num_clusters*(data_dim+1):]
146 | z_pred = np.argmax(pi_pred, axis=-1)
147 |
148 | rand_score = adjusted_rand_score(z_test, z_pred)
149 | print("adjusted rand score: ", rand_score)
150 |
151 | nmi_score = normalized_mutual_info_score(z_test, z_pred)
152 | print("normalized MI score: ", nmi_score)
153 |
154 | mu_pred_list = []
155 | sigma_pred_list = []
156 | for label in np.unique(z_pred):
157 | z_idx = np.where(z_pred == label)[0]
158 | mu_pred_lbl = np.mean(mu_pred[z_idx,label,:], axis=0)
159 | mu_pred_list.append(mu_pred_lbl)
160 |
161 | sigma_pred_lbl = np.mean(sigma_pred[z_idx,label], axis=0)
162 | sigma_pred_list.append(sigma_pred_lbl)
163 | #end for
164 |
165 | print("true means:")
166 | print(np.array(mu_true))
167 |
168 | print("predicted means:")
169 | print(np.array(mu_pred_list))
170 |
171 | print("true sigmas:")
172 | print(np.array(sigma_true))
173 |
174 | print("predicted sigmas:")
175 | print(np.array(sigma_pred_list))
176 |
177 | #generate plots
178 | plt.figure()
179 | plt.scatter(X_test[:,0], X_test[:,1], c=z_pred, cmap=cm.bwr)
180 | plt.scatter(np.array(mu_pred_list)[:,0], np.array(mu_pred_list)[:,1], s=100, marker='x', lw=4.0, color='k')
181 | plt.title('test data')
182 | #plt.savefig('./figures/mdn_test_data.png')
183 |
184 | plt.figure()
185 | plt.plot(hist.history['loss'], 'b', lw=2.0, label='train')
186 | plt.plot(hist.history['val_loss'], '--r', lw=2.0, label='val')
187 | plt.title('Mixture Density Network')
188 | plt.xlabel('Epochs')
189 | plt.ylabel('Negative Log Likelihood Loss')
190 | plt.legend(loc='upper left')
191 | #plt.savefig('./figures/mdn_loss.png')
192 |
193 |
194 |
--------------------------------------------------------------------------------
/chp11/lstm_vae.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 |
4 | import tensorflow as tf
5 | from tensorflow import keras
6 | import tensorflow_probability as tfp
7 |
8 | from keras.layers import Input, Dense, Lambda, Layer
9 | from keras.layers import LSTM, RepeatVector
10 | from keras.models import Model
11 | from keras import backend as K
12 | from keras import metrics
13 | from keras import optimizers
14 |
15 | import math
16 | import json
17 | from scipy.stats import norm
18 | from sklearn.model_selection import train_test_split
19 | from sklearn import preprocessing
20 | from sklearn.metrics import confusion_matrix
21 | from sklearn.preprocessing import StandardScaler
22 |
23 | from keras.callbacks import ModelCheckpoint
24 | from keras.callbacks import TensorBoard
25 | from keras.callbacks import LearningRateScheduler
26 | from keras.callbacks import EarlyStopping
27 |
28 | import matplotlib.pyplot as plt
29 |
30 | tf.keras.utils.set_random_seed(42)
31 |
32 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
33 | DATA_PATH = "/content/drive/MyDrive/data/"
34 |
35 | def scheduler(epoch, lr):
36 | if epoch < 4:
37 | return lr
38 | else:
39 | return lr * tf.math.exp(-0.1)
40 |
41 | nab_path = DATA_PATH + 'NAB/'
42 | nab_data_path = nab_path
43 |
44 | labels_filename = '/labels/combined_labels.json'
45 | train_file_name = 'artificialNoAnomaly/art_daily_no_noise.csv'
46 | test_file_name = 'artificialWithAnomaly/art_daily_jumpsup.csv'
47 |
48 | #train_file_name = 'realAWSCloudwatch/rds_cpu_utilization_cc0c53.csv'
49 | #test_file_name = 'realAWSCloudwatch/rds_cpu_utilization_e47b3b.csv'
50 |
51 | labels_file = open(nab_path + labels_filename, 'r')
52 | labels = json.loads(labels_file.read())
53 | labels_file.close()
54 |
55 | def load_data_frame_with_labels(file_name):
56 | data_frame = pd.read_csv(nab_data_path + file_name)
57 | data_frame['anomaly_label'] = data_frame['timestamp'].isin(
58 | labels[file_name]).astype(int)
59 | return data_frame
60 |
61 | train_data_frame = load_data_frame_with_labels(train_file_name)
62 | test_data_frame = load_data_frame_with_labels(test_file_name)
63 |
64 | plt.plot(train_data_frame.loc[0:3000,'value'])
65 | plt.plot(test_data_frame['value'])
66 |
67 | train_data_frame_final = train_data_frame.loc[0:3000,:]
68 | test_data_frame_final = test_data_frame
69 |
70 | data_scaler = StandardScaler()
71 | data_scaler.fit(train_data_frame_final[['value']].values)
72 | train_data = data_scaler.transform(train_data_frame_final[['value']].values)
73 | test_data = data_scaler.transform(test_data_frame_final[['value']].values)
74 |
75 | def create_dataset(dataset, look_back=64):
76 | dataX, dataY = [], []
77 | for i in range(len(dataset)-look_back-1):
78 | dataX.append(dataset[i:(i+look_back),:])
79 | dataY.append(dataset[i+look_back,:])
80 |
81 | return np.array(dataX), np.array(dataY)
82 |
83 | X_data, y_data = create_dataset(train_data, look_back=64) #look_back = window_size
84 | X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.1, random_state=42)
85 | X_test, y_test = create_dataset(test_data, look_back=64) #look_back = window_size
86 |
87 | #training params
88 | batch_size = 256
89 | num_epochs = 32
90 |
91 | #model params
92 | timesteps = X_train.shape[1]
93 | input_dim = X_train.shape[-1]
94 | intermediate_dim = 16
95 | latent_dim = 2
96 | epsilon_std = 1.0
97 |
98 | #sampling layer
99 | class Sampling(Layer):
100 | def call(self, inputs):
101 | z_mean, z_log_var = inputs
102 | batch = tf.shape(z_mean)[0]
103 | dim = tf.shape(z_mean)[1]
104 | epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
105 | return z_mean + tf.exp(0.5 * z_log_var) * epsilon
106 |
107 | #likelihood layer
108 | class Likelihood(Layer):
109 | def call(self, inputs):
110 | x, x_decoded_mean, x_decoded_scale = inputs
111 | dist = tfp.distributions.MultivariateNormalDiag(x_decoded_mean, x_decoded_scale)
112 | likelihood = dist.log_prob(x)
113 | return likelihood
114 |
115 | #VAE architecture
116 |
117 | #encoder
118 | x = Input(shape=(timesteps, input_dim,))
119 | h = LSTM(intermediate_dim)(x)
120 |
121 | z_mean = Dense(latent_dim)(h)
122 | z_log_sigma = Dense(latent_dim, activation='softplus')(h)
123 |
124 | #sampling
125 | z = Sampling()((z_mean, z_log_sigma))
126 |
127 | #decoder
128 | decoder_h = LSTM(intermediate_dim, return_sequences=True)
129 | decoder_loc = LSTM(input_dim, return_sequences=True)
130 | decoder_scale = LSTM(input_dim, activation='softplus', return_sequences=True)
131 |
132 | h_decoded = RepeatVector(timesteps)(z)
133 | h_decoded = decoder_h(h_decoded)
134 |
135 | x_decoded_mean = decoder_loc(h_decoded)
136 | x_decoded_scale = decoder_scale(h_decoded)
137 |
138 | #log-likelihood
139 | llh = Likelihood()([x, x_decoded_mean, x_decoded_scale])
140 |
141 | #define VAE model
142 | vae = Model(inputs=x, outputs=llh)
143 |
144 | # Add KL divergence regularization loss and likelihood loss
145 | kl_loss = - 0.5 * K.mean(1 + z_log_sigma - K.square(z_mean) - K.exp(z_log_sigma))
146 | tot_loss = -K.mean(llh - kl_loss)
147 | vae.add_loss(tot_loss)
148 |
149 | # Loss and optimizer.
150 | loss_fn = tf.keras.losses.MeanSquaredError()
151 | optimizer = tf.keras.optimizers.Adam()
152 |
153 | @tf.function
154 | def training_step(x):
155 | with tf.GradientTape() as tape:
156 | reconstructed = vae(x) # Compute input reconstruction.
157 | # Compute loss.
158 | loss = 0 #loss_fn(x, reconstructed)
159 | loss += sum(vae.losses)
160 | # Update the weights of the VAE.
161 | grads = tape.gradient(loss, vae.trainable_weights)
162 | optimizer.apply_gradients(zip(grads, vae.trainable_weights))
163 | return loss
164 |
165 | losses = [] # Keep track of the losses over time.
166 | dataset = tf.data.Dataset.from_tensor_slices(X_train).batch(batch_size)
167 | for epoch in range(num_epochs):
168 | for step, x in enumerate(dataset):
169 | loss = training_step(x)
170 | losses.append(float(loss))
171 | print("Epoch:", epoch, "Loss:", sum(losses) / len(losses))
172 |
173 | plt.figure()
174 | plt.plot(losses, c='b', lw=2.0, label='train')
175 | plt.title('LSTM-VAE model')
176 | plt.xlabel('Epochs')
177 | plt.ylabel('Total Loss')
178 | plt.legend(loc='upper right')
179 | plt.show()
180 | #plt.savefig('./figures/lstm_loss.png')
181 |
182 | pred_test = vae.predict(X_test)
183 |
184 | plt.plot(pred_test[:,0])
185 |
186 | is_anomaly = pred_test[:,0] < -1e1
187 | plt.figure()
188 | plt.plot(test_data, color='b')
189 | plt.figure()
190 | plt.plot(is_anomaly, color='r')
191 |
--------------------------------------------------------------------------------
/chp11/spektral_gnn.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 |
4 | import tensorflow as tf
5 | from tensorflow import keras
6 |
7 | import networkx as nx
8 | from tensorflow.keras.utils import to_categorical
9 | from sklearn.preprocessing import LabelEncoder
10 | from sklearn.utils import shuffle
11 | from sklearn.metrics import classification_report
12 | from sklearn.model_selection import train_test_split
13 |
14 | from spektral.layers import GCNConv
15 |
16 | from tensorflow.keras.models import Model
17 | from tensorflow.keras.layers import Input, Dropout, Dense
18 | from tensorflow.keras import Sequential
19 | from tensorflow.keras.optimizers import Adam
20 | from tensorflow.keras.callbacks import TensorBoard, EarlyStopping
21 | from tensorflow.keras.regularizers import l2
22 |
23 | import os
24 | from collections import Counter
25 | from sklearn.manifold import TSNE
26 | import matplotlib.pyplot as plt
27 |
28 | tf.keras.utils.set_random_seed(42)
29 |
30 | SAVE_PATH = "/content/drive/MyDrive/Colab Notebooks/data/"
31 | DATA_PATH = "/content/drive/MyDrive/data/cora/"
32 |
33 | column_names = ["paper_id"] + [f"term_{idx}" for idx in range(1433)] + ["subject"]
34 | node_df = pd.read_csv(DATA_PATH + "cora.content", sep="\t", header=None, names=column_names)
35 | print("Node df shape:", node_df.shape)
36 |
37 | edge_df = pd.read_csv(DATA_PATH + "cora.cites", sep="\t", header=None, names=["target", "source"])
38 | print("Edge df shape:", edge_df.shape)
39 |
40 | #parse node data
41 | nodes = node_df.iloc[:,0].tolist()
42 | labels = node_df.iloc[:,-1].tolist()
43 | X = node_df.iloc[:,1:-1].values
44 |
45 | X = np.array(X,dtype=int)
46 | N = X.shape[0] #the number of nodes
47 | F = X.shape[1] #the size of node features
48 |
49 | #parse edge data
50 | edge_list = [(x, y) for x, y in zip(edge_df['target'], edge_df['source'])]
51 |
52 | num_classes = len(set(labels))
53 |
54 | print('Number of nodes:', N)
55 | print('Number of features of each node:', F)
56 | print('Labels:', set(labels))
57 | print('Number of classes:', num_classes)
58 |
59 | def sample_data(labels, limit=20, val_num=500, test_num=1000):
60 | label_counter = dict((l, 0) for l in labels)
61 | train_idx = []
62 |
63 | for i in range(len(labels)):
64 | label = labels[i]
65 | if label_counter[label]