├── LICENSE ├── POT └── AL_POT.tar.gz ├── README.md ├── data ├── .gitkeep └── traj.tar.gz ├── environment.yml ├── images ├── .gitkeep └── Active_Learning.png └── workflow ├── .gitkeep ├── BayesOpt_SOAP.py └── activesample.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019-Present [Ganesh Sivaraman] 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /POT/AL_POT.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/argonne-lcf/active-learning-md/624d61fdb9e7ef2b07bf2d6cfeb654a99f324609/POT/AL_POT.tar.gz -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Active learning workflow for Gaussian Approximation Potential (GAP) 2 | 3 | Documentation for the active learning workflow developed as a part of the article "Machine Learning Inter-Atomic Potentials Generation Driven by Active Learning: 4 | A Case Study for Amorphous and Liquid Hafnium dioxide". 5 | __For more details, please refer to the [paper](https://www.nature.com/articles/s41524-020-00367-7).__ 6 | 7 | If you are using this active learning workflow in your research paper, please cite us as 8 | ``` 9 | @article{sivaraman2020machine, 10 | title={Machine-learned interatomic potentials by active learning: amorphous and liquid hafnium dioxide}, 11 | author={Sivaraman, Ganesh and Krishnamoorthy, Anand Narayanan and Baur, Matthias and Holm, Christian and Stan, Marius and Cs{\'a}nyi, G{\'a}bor and Benmore, Chris and V{\'a}zquez-Mayagoitia, {\'A}lvaro}, 12 | journal={npj Computational Materials}, 13 | volume={6}, 14 | number={1}, 15 | pages={1--8}, 16 | year={2020}, 17 | publisher={Nature Publishing Group} 18 | } 19 | ``` 20 | 21 | ![pipeline](images/Active_Learning.png) 22 | 23 | 24 | First to clone the github and then replicate a new anaconda environment using the **environment.yml** file: 25 | 26 | ``` 27 | git clone https://github.com/argonne-lcf/active-learning-md 28 | cd active-learning-md 29 | conda env create -f environment.yml 30 | ``` 31 | 32 | As the title suggests the active learning is implemented to automate the training for the GAP model. Please refer to the 33 | [QUIP software library online documentation](http://libatoms.github.io/QUIP/) for more details. Once the QUIP libraries has been compiled, please set the 34 | flag below in your HPC submission script, which should point to the full path to the `teach_sparse` and `quip` binaries. The workflow script described in 35 | the next section will automatically invoke the required binaries if this flag is set correctly. 36 | 37 | ``` 38 | export QUIP_PATH=/path/libatoms/QUIP-git/build/linux_x86_64_ifort_icc_openmp/ 39 | ``` 40 | 41 | **Note: As one can see the version of QUIP used in this study is compiled with OpenMP support and can fully exploit node level parallelization!** 42 | 43 | 44 | ## The workflow script 45 | 46 | Let us go through the workflow script. The workflow is implemented through the script 'BayesOpt_SOAP.py'. The 47 | functionality of this script can be seen below. 48 | 49 | ``` 50 | $ python BayesOpt_SOAP.py --help 51 | 52 | usage: BayesOpt_SOAP.py [-h] -xyz XYZFILENAME [-Ns] [-Nm] -c CUTOFF 53 | [CUTOFF ...] -s NSPARSE [NSPARSE ...] -nl NLMAX 54 | [NLMAX ...] -Nbo NOPT [NOPT ...] 55 | 56 | A python workflow to run active learning for GAP 57 | 58 | optional arguments: 59 | -h, --help show this help message and exit 60 | -xyz XYZFILENAME, --xyzfilename XYZFILENAME 61 | Trajectory filename (xyz formated) 62 | -Ns , --nsample Minimum sample size 63 | -Nm , --nminclust No of Minimum clusters. Tweak to reduce noise points 64 | -c CUTOFF [CUTOFF ...], --cutoff CUTOFF [CUTOFF ...] 65 | Cutt off range for SOAP. Usage: '-c min max'. 66 | Additional dependency: common sense! 67 | -s NSPARSE [NSPARSE ...], --nsparse NSPARSE [NSPARSE ...] 68 | Nsparse for SOAP. Usage: '-s min max'. Additional 69 | dependency: common sense 70 | -nl NLMAX [NLMAX ...], --nlmax NLMAX [NLMAX ...] 71 | Nmax, Lmax range. Usage: '-nl min max'. Additional 72 | dependency: common sense! 73 | -Nbo NOPT [NOPT ...], --Nopt NOPT [NOPT ...] 74 | Number of exploration and optimization steps for BO. 75 | e.g.'-Nbo 25 50' 76 | 77 | ``` 78 | 79 | 80 | **Warning: Do not run this on your desktop. Try to get a HPC allocation and run it there!** 81 | 82 | # How to run the workflow ? 83 | 84 | Extract the trajectory from the `data` folder. 85 | 86 | ``` 87 | python -u BayesOpt_SOAP.py -xyz a-Hfo2-300K-NVT.extxyz -Ns 10 -Nm 10 -c 4 6 -s 100 1200 -nl 3 8 -Nbo 10 20 > BO-SOAP.out 88 | 89 | ``` 90 | 91 | In our benchmarks, our workflow converged in Niter = 2 with a total run time of `1080 s` on a single node (Intel Broadwell). The optimal training and validation configurations 92 | are written to `opt_train.extxyz` and `opt_train.extxyz` respectively. The detailed output is written to `activelearned_quipconfigs.json`. 93 | 94 | 95 | 96 | ## Acknowledgements 97 | This material is based upon work supported by Laboratory Directed Research and Development (LDRD) funding from Argonne National Laboratory, 98 | provided by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357. This research used resources of the 99 | Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility sup-ported under Contract DE-AC02-06CH11357. 100 | Argonne National Laboratory's work was supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. We gratefully acknowledge the computing resources provided on Bebop; 101 | a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory. This research used resources of 102 | the Advanced Photon Source, a U.S. Department of Energy (DOE) Office of Science User Facility operated for the DOE Office of Science by Argonne National Laboratory 103 | under Contract No. DE-AC02-06CH11357. Use of the Center for Nanoscale Materials, an Office of Science user facility, was supported by the U.S. Department of Energy, 104 | Office of Science, Office of Basic Energy Sciences, under Contract No. DE-AC02-06CH11357. 105 | 106 | 107 | -------------------------------------------------------------------------------- /data/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/argonne-lcf/active-learning-md/624d61fdb9e7ef2b07bf2d6cfeb654a99f324609/data/.gitkeep -------------------------------------------------------------------------------- /data/traj.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/argonne-lcf/active-learning-md/624d61fdb9e7ef2b07bf2d6cfeb654a99f324609/data/traj.tar.gz -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: gap_al 2 | channels: 3 | - defaults 4 | - conda-forge 5 | dependencies: 6 | - ase 7 | - mdtraj 8 | - hdbscan 9 | - numpy 10 | - ipython 11 | - jupyter 12 | - matplotlib 13 | - numpy 14 | - pandas 15 | - scipy 16 | - seaborn 17 | - scikit-learn 18 | - argparse 19 | - json 20 | - os 21 | - subprocess 22 | - pprint 23 | - copy 24 | - random 25 | - pip: 26 | - gpyopt 27 | -------------------------------------------------------------------------------- /images/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/argonne-lcf/active-learning-md/624d61fdb9e7ef2b07bf2d6cfeb654a99f324609/images/.gitkeep -------------------------------------------------------------------------------- /images/Active_Learning.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/argonne-lcf/active-learning-md/624d61fdb9e7ef2b07bf2d6cfeb654a99f324609/images/Active_Learning.png -------------------------------------------------------------------------------- /workflow/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/argonne-lcf/active-learning-md/624d61fdb9e7ef2b07bf2d6cfeb654a99f324609/workflow/.gitkeep -------------------------------------------------------------------------------- /workflow/BayesOpt_SOAP.py: -------------------------------------------------------------------------------- 1 | """ 2 | @author: gsivaraman@anl.gov 3 | """ 4 | from __future__ import print_function 5 | import GPyOpt 6 | import argparse 7 | import json, os, subprocess 8 | import numpy as np 9 | from ase.io import read, write 10 | #import commands ##--> Depreciated! WIll switch to subprocess once quippy is ported to Python 3+ 11 | from sklearn.metrics import mean_absolute_error, r2_score 12 | from pprint import pprint 13 | from time import time 14 | from copy import deepcopy 15 | from activesample import activesample 16 | 17 | try: 18 | path = os.getenv('QUIP_PATH') 19 | except: 20 | print("'QUIP_PATH' not set!") 21 | 22 | 23 | def get_argparse(): 24 | """ 25 | A function to parse all the input arguments 26 | """ 27 | parser = argparse.ArgumentParser(description='A python workflow to run active learning for GAP') 28 | parser.add_argument('-xyz','--xyzfilename', type=str,required=True,help='Trajectory filename (xyz formated)') 29 | parser.add_argument('-Ns','--nsample',type=int,metavar='',\ 30 | help="Minimum sample size",default=10) 31 | parser.add_argument('-Nm','--nminclust',type=int,metavar='',\ 32 | help="No of Minimum clusters. Tweak to reduce noise points",default=10) 33 | parser.add_argument('-c','--cutoff',nargs='+',type=float,required=True,\ 34 | help="Cutt off range for SOAP. Usage: '-c min max'. Additional dependency: common sense!") 35 | parser.add_argument('-s','--nsparse',nargs='+',type=float,required=True,\ 36 | help="Nsparse for SOAP. Usage: '-s min max'. Additional dependency: common sense ") 37 | parser.add_argument('-nl','--nlmax',nargs='+',type=float,required=True,\ 38 | help="Nmax, Lmax range. Usage: '-nl min max'. Additional dependency: common sense! ") 39 | parser.add_argument('-Nbo','--Nopt',nargs='+',type=float,required=True,\ 40 | help="Number of exploration and optimization steps for BO. e.g.'-Nbo 25 50' ") 41 | 42 | return parser.parse_args() 43 | 44 | 45 | 46 | 47 | def run_quip(cutoff=5.0,delta=1.0,n_sparse=100,nlmax=4): 48 | """ 49 | A method to launch quip and postprocess. 50 | """ 51 | if os.path.isfile('quip_test.xyz'): 52 | os.remove('quip_test.xyz') 53 | if os.path.isfile('gap.xml'): 54 | os.remove('gap.xml') 55 | 56 | start = time() 57 | 58 | runtraining = path+"teach_sparse at_file=./train.extxyz gap={soap cutoff="+str(cutoff)+" n_sparse="+str(n_sparse)+" covariance_type=dot_product sparse_method=cur_points delta="+str(delta)+" zeta=4 l_max="+str(nlmax)+" n_max="+str(nlmax)+" atom_sigma=0.5 cutoff_transition_width=0.5 add_species} e0={Hf:-2.70516846:O:-0.01277342} gp_file=gap.xml default_sigma={0.0001 0.0001 0.01 .01} sparse_jitter=1.0e-8 energy_parameter_name=energy" 59 | output = subprocess.check_output(runtraining,stderr=subprocess.STDOUT, shell=True) 60 | 61 | end = time() 62 | 63 | print("\nTraining Completed in {} sec".format(end - start)) 64 | 65 | 66 | evaluate = path+"quip E=T atoms_filename=./test.extxyz param_filename=gap.xml | grep AT | sed 's/AT//' >> quip_test.xyz" 67 | pout = subprocess.check_output(evaluate,stderr=subprocess.STDOUT, shell=True) 68 | inp = read('test.extxyz',':') 69 | inenergy = [ei.get_potential_energy() for ei in inp ] 70 | output = read('quip_test.xyz',':') 71 | outenergy = [eo.get_potential_energy() for eo in output ] 72 | if len(inenergy) == len(outenergy) : 73 | mae = mean_absolute_error(np.asarray(inenergy), np.asarray(outenergy)) 74 | rval = r2_score(np.asarray(inenergy), np.asarray(outenergy)) 75 | 76 | return [mae, rval ] 77 | 78 | else: 79 | print("Memory blow up for cutoff={}, n_sparse={}".format(cutoff,n_sparse)) 80 | return [float(20000), float(-2000)] 81 | 82 | 83 | def gen_trials(minclustlen,maxclustlen): 84 | ''' 85 | Generate the trials in the sampling width (Kmax, Kmin) 86 | :param minclustlen: Kmin (int) 87 | :param maxclustlen: Kmax (int) 88 | Return type: Trials (list) 89 | ''' 90 | if maxclustlen <= 50 : 91 | trialscale = int( np.floor(minclustlen/2) ) #int( np.floor(data.nsample/2) ) 92 | maxtrial = int( np.floor(maxclustlen/trialscale) ) 93 | trials = [trialscale*t for t in range(maxtrial,0,-1)] 94 | elif maxclustlen > 50 and maxclustlen <= 200 : 95 | trialscale = int( np.floor( (minclustlen/2 ) * (minclustlen/2 ) ) ) 96 | maxtrial = int( np.floor(maxclustlen/trialscale) ) 97 | trials = [trialscale*t for t in range(maxtrial,0,-1)] + [u*int(np.floor(minclustlen/2)) for u in range( int(np.floor(minclustlen/2)) -1,0,-1)] 98 | elif maxclustlen > 200 : 99 | trialscale = int( minclustlen**2 ) 100 | maxtrial = int( np.floor(maxclustlen/trialscale) ) 101 | trials = [trialscale*t for t in range(maxtrial,0,-1)] + [u*int(np.floor(minclustlen/2)) for u in range( int(minclustlen) ,0,-1)] 102 | return trials 103 | 104 | 105 | def f(x): 106 | """ 107 | Surrogate function over the error metric to be optimized 108 | """ 109 | 110 | evaluation = run_quip(cutoff = float(x[:,0]), delta = float(x[:,1]), n_sparse = float(x[:,2]), nlmax = float(x[:,3])) 111 | print("\nParam: {}, {}, {}, {} | MAE : {}, R2: {}".format(float(x[:,0]),float(x[:,1]),float(x[:,2]),float(x[:,3]) ,evaluation[0],evaluation[1])) 112 | 113 | return evaluation[0] 114 | 115 | 116 | def plot_metric(track_metric): 117 | ''' 118 | Plot the metric evolution over the trial using this function 119 | ''' 120 | import matplotlib as mpl 121 | import matplotlib.pyplot as plt 122 | mpl.style.use('seaborn') 123 | cmap=plt.cm.get_cmap('brg') 124 | plt.rcParams["font.family"] = "Arial" 125 | x = range(1, len(track_metric)+1 ) 126 | plt.plot(x, track_metric, c=cmap(0.40),linewidth=3.0) 127 | plt.title('Active learning',fontsize=32) 128 | plt.xlabel('Trial',fontsize=24) 129 | plt.ylabel('MAE (ev)',fontsize=24) 130 | plt.grid('True') 131 | plt.tight_layout() 132 | plt.draw() 133 | plt.savefig('AL.png',dpi=300) 134 | plt.savefig('AL.svg',dpi=300,format='svg') 135 | c = 'Done' 136 | return c 137 | 138 | 139 | def main(): 140 | ''' 141 | Gather our code here 142 | ''' 143 | args = get_argparse() 144 | 145 | cutoff = tuple(args.cutoff) 146 | sparse = tuple(args.nsparse) 147 | nlmax = tuple(args.nlmax) 148 | Nopt = tuple(args.Nopt) 149 | trackerjson = {} 150 | track_metric = [] # For plotting the error metric 151 | trackerjson['clusters'] = [] 152 | 153 | data = activesample(args.xyzfilename,nsample=args.nsample,nminclust=args.nminclust) 154 | Nclust, Nnoise, clust = data.gen_cluster() 155 | clustlen = [] 156 | 157 | for Ni, Ntraj in clust.items(): 158 | clustlen.append(len(Ntraj) ) 159 | 160 | maxclustlen = max(clustlen) 161 | minclustlen = min(clustlen) 162 | 163 | print("\nNumber of elements in the smallest, largest cluster is {}, {}\n".format(minclustlen,maxclustlen)) 164 | print("\n Nnoise : {}, Nclusters : {}\n".format(Nnoise,Nclust)) 165 | 166 | trackerjson['Nclusters'] = Nclust 167 | trackerjson['Nnoise'] = Nnoise 168 | trackerjson['clusters'].append(clust) 169 | trackerjson['paritition_trials'] = [] 170 | rmse_opt = 10000. 171 | 172 | trials = gen_trials(minclustlen,maxclustlen) 173 | 174 | Natom = data.exyztrj[0].get_number_of_atoms() 175 | count = 1 176 | trainlen_last = 0 177 | print("\n The trials will run in the sampling width interval : ({},{}) \n".format(max(trials),min(trials) )) 178 | 179 | for Ntrial in trials : 180 | # set(np.random.randint(data.nsample, maxclustlen, size=10)): #range(1,trialsize+2): 181 | 182 | cmd = "rm train.extxyz test.extxyz gap.xml* quip_test.xyz " 183 | rmout = subprocess.call(cmd, shell=True) 184 | #print(rmout) 185 | data.truncate = deepcopy(Ntrial) #int(np.floor(maxclustlen /Ntrial) ) 186 | print("\n\nBeginning trial number : {} with a sampling width of {}".format(count,data.truncate)) 187 | trainlist, testlist = data.clusterpartition() 188 | train_lennew = len(trainlist) 189 | 190 | print("\n\nNumer of training and test configs: {} , {}".format(len(trainlist), len(testlist) ) ) 191 | 192 | if train_lennew > trainlen_last : 193 | print("\n {} new learning configs added by the active sampler".format(train_lennew - trainlen_last ) ) 194 | data.writeconfigs() 195 | 196 | bounds = [{'name': 'cutoff', 'type': 'continuous', 'domain': (cutoff[0], cutoff[1])},\ 197 | {'name':'delta', 'type': 'discrete', 'domain': (0.01, 0.1, 1.0)},\ 198 | {'name': 'n_sparse', 'type': 'discrete', 'domain': np.arange(sparse[0],sparse[1]+1,100)},\ 199 | {'name':'nlmax', 'type': 'discrete', 'domain': np.arange(nlmax[0],nlmax[1]+1)},] 200 | 201 | # optimizer 202 | opt_quip = GPyOpt.methods.BayesianOptimization(f=f, domain=bounds,initial_design_numdata = int(Nopt[0]), 203 | model_type="GP_MCMC", 204 | acquisition_type='EI_MCMC', #EI 205 | evaluator_type="predictive", # Expected Improvement 206 | exact_feval = False, 207 | maximize=False) ##--> True only for r2 208 | 209 | # optimize MP model 210 | print("\nBegin Optimization run \t") 211 | opt_quip.run_optimization(max_iter=int(Nopt[1]) ) 212 | hyperdict = {} 213 | for num in range(len(bounds)): 214 | hyperdict[bounds[num]["name"] ] = opt_quip.x_opt[num] 215 | hyperdict["MAE"] = opt_quip.fx_opt 216 | trackerjson['paritition_trials'].append({'test': trainlist, 'train': testlist,'hyperparam': hyperdict }) 217 | trainlen_last = deepcopy(train_lennew) #---> Update only if configs increased over iterations! 218 | if opt_quip.fx_opt < rmse_opt : 219 | rmse_opt = float(opt_quip.fx_opt) 220 | best_train = deepcopy(trainlist) 221 | best_test = deepcopy(testlist) 222 | print("\n MAE lowered in this trial: {} eV/Atom".format(rmse_opt/Natom)) 223 | track_metric.append(rmse_opt/Natom) 224 | best_hyperparam = deepcopy(hyperdict) 225 | if count != 1 and np.round(rmse_opt/Natom,4) <= .001 : 226 | print("\n Optimal configs found! on {}th trial with hyper parameters : {}\n".format(count, best_hyperparam)) 227 | with open('activelearned_quipconfigs.json', 'w') as outfile: 228 | json.dump(trackerjson, outfile,indent=4) 229 | print("\nActive learning history written to 'activelearned_quipconfigs.json' ") 230 | 231 | train_xyz = [] 232 | test_xyz = [] 233 | for i in best_train: 234 | train_xyz.append(data.exyztrj[i]) 235 | for j in best_test: 236 | test_xyz.append(data.exyztrj[j]) 237 | 238 | write("opt_train.extxyz", train_xyz) 239 | write("opt_test.extxyz", test_xyz) 240 | print("\nActive learnied configurations written to 'opt_train.extxyz','opt_test.extxyz' ") 241 | break 242 | else: 243 | print("No new configs found in the {} trial. Skipping!".format(count)) 244 | count += 1 245 | 246 | plot_metric(track_metric) 247 | 248 | # Standard boilerplate to call the main() function to begin 249 | # the program. 250 | if __name__ == '__main__': 251 | main() 252 | -------------------------------------------------------------------------------- /workflow/activesample.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | Created on Mon May 27 20:46:10 2019 5 | 6 | @author: gsivaraman@anl.gov 7 | """ 8 | 9 | import mdtraj as md 10 | import numpy as np 11 | from hdbscan import HDBSCAN 12 | import os 13 | import random 14 | import matplotlib.pyplot as plt 15 | from copy import deepcopy 16 | from ase.io import read, write 17 | 18 | 19 | class activesample: 20 | """ 21 | A python class to perform active learning over a given AIMD trajectory to automate sampling of the training configurations for the GAP IP model 22 | """ 23 | 24 | 25 | def __init__(self,exyzfile,nsample=10,nminclust=10,truncate=1): 26 | """ 27 | Input Args: 28 | exyztrj (str): File name of AIMD trajectory (ase exyz format) 29 | nsample (int): Number of samples for unsupervised learning 30 | nminclust (int): Minimum number of clusters 31 | truncate (int): Truncation level for sampling clusters. '1' for default random sampling 32 | """ 33 | 34 | self.exyzfile = exyzfile 35 | self.nsample = nsample 36 | self.nminclust = nminclust 37 | self.truncate = truncate 38 | self.exyztrj = read(exyzfile,':') 39 | if not os.path.isfile(exyzfile.strip('.extxyz') + '.pdb' ): 40 | write(exyzfile.strip('.extxyz') +'.pdb', self.exyztrj,format='proteindatabank') 41 | self.pdbtrj = md.load_pdb(exyzfile.strip('.extxyz') +'.pdb') 42 | self.distances = self.compute_dist() 43 | self.clustdict = {} 44 | self.trainlist = [] 45 | self.testlist = [] 46 | 47 | 48 | def compute_dist(self): 49 | """ 50 | Compute the rmsd distance matrix from the input trajectory 51 | return: 52 | distmat : numpy distance matrix 53 | """ 54 | distmat = np.empty((self.pdbtrj.n_frames, self.pdbtrj.n_frames)) 55 | for i in range(self.pdbtrj.n_frames): 56 | distmat[i] = md.rmsd(self.pdbtrj, self.pdbtrj, i) 57 | return distmat 58 | 59 | 60 | def plot_elbow(self): 61 | """ 62 | A method to perform elbow plot of the distance matrix. The epsilon parameter 63 | can be chosen by visually inspecting the elbow. 64 | """ 65 | plotthisarray = self.distances[:,-1] 66 | fig = plt.figure() 67 | plt.plot(-np.sort(-plotthisarray)) 68 | #plt.show() 69 | return fig 70 | 71 | 72 | def gen_cluster(self): 73 | """ 74 | Perform HDBSCAN clustering on a given distance matrix 75 | return: 76 | n_clusters_ (int) : Total number of clusters excluding noise 77 | n_noise_ (int) : Total number of noise points 78 | clustdict (dict) : Cluster number keys. List values of AIMD trajectory index 79 | """ 80 | hdb = HDBSCAN(metric='precomputed', min_samples=self.nsample, min_cluster_size=self.nminclust).fit(self.distances) # allow_single_cluster=True 81 | labels = hdb.labels_ 82 | n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) 83 | n_noise_ = list(labels).count(-1) 84 | 85 | lclustdict = {} 86 | for index, clustnum in enumerate(labels): 87 | if clustnum != -1 : 88 | if clustnum not in lclustdict.keys(): 89 | lclustdict[clustnum] = [] 90 | lclustdict[clustnum].append(index) 91 | 92 | self.clustdict = deepcopy(lclustdict) 93 | return n_clusters_,n_noise_, lclustdict 94 | 95 | 96 | 97 | def clusterpartition(self): 98 | """ 99 | A method to create test/ train split by paritioning the cluster 100 | Return type: traininglist, testlist (list) 101 | """ 102 | 103 | train_list = [] 104 | test_list = [] 105 | 106 | for key, trjlist in self.clustdict.items(): 107 | ltrain = list() 108 | ltest = list() 109 | if self.truncate == 1: 110 | ltrain.append(random.choice(trjlist) ) 111 | ltest = random.sample(trjlist,4) 112 | while ltrain[0] in ltest: 113 | ltest = random.sample(trjlist,4) 114 | 115 | elif self.truncate != 1: 116 | if len(trjlist) <= self.truncate : 117 | #muind = int(np.floor(len(trjlist)/2) ) 118 | #ltrain.append(trjlist[muind] ) 119 | ltrain.append(random.choice(trjlist)) 120 | ltest = random.sample(trjlist,1) 121 | while ltrain[0] in ltest: 122 | ltest = random.sample(trjlist,1) 123 | elif len(trjlist) > self.truncate : 124 | ltrain = trjlist[0::self.truncate] 125 | ltest = trjlist[2::self.truncate] 126 | for k in ltrain: 127 | train_list.append(k) 128 | for v in ltest: 129 | test_list.append(v) 130 | self.trainlist = deepcopy(train_list) 131 | self.testlist = deepcopy(test_list) 132 | return train_list, test_list 133 | 134 | 135 | def writeconfigs(self): 136 | """ 137 | Write training and test configurations to exyz files! 138 | """ 139 | train_xyz = [] 140 | test_xyz = [] 141 | for i in self.trainlist: 142 | train_xyz.append(self.exyztrj[i]) 143 | for j in self.testlist: 144 | test_xyz.append(self.exyztrj[j]) 145 | 146 | write("train.extxyz", train_xyz) 147 | write("test.extxyz", test_xyz) 148 | --------------------------------------------------------------------------------