├── LICENSE
├── POT
    └── AL_POT.tar.gz
├── README.md
├── data
    ├── .gitkeep
    └── traj.tar.gz
├── environment.yml
├── images
    ├── .gitkeep
    └── Active_Learning.png
└── workflow
    ├── .gitkeep
    ├── BayesOpt_SOAP.py
    └── activesample.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019-Present [Ganesh Sivaraman]
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/POT/AL_POT.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/argonne-lcf/active-learning-md/624d61fdb9e7ef2b07bf2d6cfeb654a99f324609/POT/AL_POT.tar.gz


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## Active learning workflow for Gaussian Approximation Potential (GAP)
  2 | 
  3 | Documentation for the active learning workflow developed as a part of the article "Machine Learning Inter-Atomic Potentials Generation Driven by Active Learning: 
  4 | A Case Study for Amorphous and Liquid Hafnium dioxide". 
  5 | __For more details, please refer to the [paper](https://www.nature.com/articles/s41524-020-00367-7).__
  6 | 
  7 | If you are using this active learning workflow  in your research paper, please cite us as
  8 | ```
  9 | @article{sivaraman2020machine,
 10 |   title={Machine-learned interatomic potentials by active learning: amorphous and liquid hafnium dioxide},
 11 |   author={Sivaraman, Ganesh and Krishnamoorthy, Anand Narayanan and Baur, Matthias and Holm, Christian and Stan, Marius and Cs{\'a}nyi, G{\'a}bor and Benmore, Chris and V{\'a}zquez-Mayagoitia, {\'A}lvaro},
 12 |   journal={npj Computational Materials},
 13 |   volume={6},
 14 |   number={1},
 15 |   pages={1--8},
 16 |   year={2020},
 17 |   publisher={Nature Publishing Group}
 18 | }
 19 | ```
 20 | 
 21 | ![pipeline](images/Active_Learning.png)
 22 | 
 23 | 
 24 | First to clone the github and then replicate a new anaconda environment using the **environment.yml** file:
 25 | 
 26 | ```
 27 | git clone https://github.com/argonne-lcf/active-learning-md
 28 | cd active-learning-md
 29 | conda env create -f environment.yml
 30 | ```
 31 | 
 32 | As the title suggests the active learning is implemented to automate the training for the GAP model. Please refer to the 
 33 | [QUIP software library online documentation](http://libatoms.github.io/QUIP/) for more details. Once the QUIP libraries has been compiled, please set the 
 34 | flag below in your HPC submission script, which should point to the full path to the `teach_sparse` and `quip` binaries. The workflow script described in 
 35 | the next section will automatically invoke the required binaries if this flag is set correctly. 
 36 | 
 37 | ```
 38 | export QUIP_PATH=/path/libatoms/QUIP-git/build/linux_x86_64_ifort_icc_openmp/
 39 | ```
 40 | 
 41 | **Note: As one can see the version of QUIP used in this study is compiled with OpenMP support and can fully exploit node level parallelization!**
 42 | 
 43 | 
 44 | ## The workflow script
 45 | 
 46 | Let us go through the workflow script. The workflow is implemented through the script 'BayesOpt_SOAP.py'. The 
 47 | functionality of this script can be seen below. 
 48 | 
 49 | ```
 50 | $ python BayesOpt_SOAP.py --help 
 51 | 
 52 | usage: BayesOpt_SOAP.py [-h] -xyz XYZFILENAME [-Ns] [-Nm] -c CUTOFF
 53 |                         [CUTOFF ...] -s NSPARSE [NSPARSE ...] -nl NLMAX
 54 |                         [NLMAX ...] -Nbo NOPT [NOPT ...]
 55 | 
 56 | A python workflow to run active learning for GAP
 57 | 
 58 | optional arguments:
 59 |   -h, --help            show this help message and exit
 60 |   -xyz XYZFILENAME, --xyzfilename XYZFILENAME
 61 |                         Trajectory filename (xyz formated)
 62 |   -Ns , --nsample       Minimum sample size
 63 |   -Nm , --nminclust     No of Minimum clusters. Tweak to reduce noise points
 64 |   -c CUTOFF [CUTOFF ...], --cutoff CUTOFF [CUTOFF ...]
 65 |                         Cutt off range for SOAP. Usage: '-c min max'.
 66 |                         Additional dependency: common sense!
 67 |   -s NSPARSE [NSPARSE ...], --nsparse NSPARSE [NSPARSE ...]
 68 |                         Nsparse for SOAP. Usage: '-s min max'. Additional
 69 |                         dependency: common sense
 70 |   -nl NLMAX [NLMAX ...], --nlmax NLMAX [NLMAX ...]
 71 |                         Nmax, Lmax range. Usage: '-nl min max'. Additional
 72 |                         dependency: common sense!
 73 |   -Nbo NOPT [NOPT ...], --Nopt NOPT [NOPT ...]
 74 |                         Number of exploration and optimization steps for BO.
 75 |                         e.g.'-Nbo 25 50'
 76 | 
 77 | ```
 78 | 
 79 | 
 80 | **Warning: Do not run this on your desktop. Try to get a HPC allocation and run it there!**
 81 | 
 82 | # How to run the workflow  ? 
 83 | 
 84 | Extract the trajectory from the `data` folder.
 85 | 
 86 | ```
 87 | python -u  BayesOpt_SOAP.py  -xyz  a-Hfo2-300K-NVT.extxyz  -Ns 10 -Nm 10 -c 4 6 -s 100 1200 -nl 3 8 -Nbo 10 20  > BO-SOAP.out
 88 | 
 89 | ```
 90 | 
 91 | In our benchmarks, our workflow  converged in  N<sub>iter</sub> = 2 with a  total run time of  `1080 s` on a single node (Intel Broadwell). The optimal training and validation configurations 
 92 | are  written to `opt_train.extxyz` and `opt_train.extxyz` respectively. The detailed output is written to `activelearned_quipconfigs.json`. 
 93 | 
 94 | 
 95 | 
 96 | ## Acknowledgements
 97 | This material is based upon work supported by Laboratory Directed Research and Development (LDRD) funding from Argonne National Laboratory,
 98 | provided by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357. This research used resources of the
 99 | Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility sup-ported under Contract DE-AC02-06CH11357.
100 | Argonne National Laboratory's work was supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. We gratefully acknowledge the computing resources provided on Bebop; 
101 | a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory. This research used resources of
102 | the Advanced Photon Source, a U.S. Department of Energy (DOE) Office of Science User Facility operated for the DOE Office of Science by Argonne National Laboratory 
103 | under Contract No. DE-AC02-06CH11357. Use of the Center for Nanoscale Materials, an Office of Science user facility, was supported by the U.S. Department of Energy,
104 | Office of Science, Office of Basic Energy Sciences, under Contract No. DE-AC02-06CH11357. 
105 | 
106 | 
107 | 


--------------------------------------------------------------------------------
/data/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/argonne-lcf/active-learning-md/624d61fdb9e7ef2b07bf2d6cfeb654a99f324609/data/.gitkeep


--------------------------------------------------------------------------------
/data/traj.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/argonne-lcf/active-learning-md/624d61fdb9e7ef2b07bf2d6cfeb654a99f324609/data/traj.tar.gz


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
 1 | name: gap_al
 2 | channels:
 3 | - defaults
 4 | - conda-forge
 5 | dependencies:
 6 | - ase
 7 | - mdtraj 
 8 | - hdbscan
 9 | - numpy
10 | - ipython
11 | - jupyter
12 | - matplotlib
13 | - numpy
14 | - pandas
15 | - scipy
16 | - seaborn
17 | - scikit-learn
18 | - argparse
19 | - json
20 | - os 
21 | - subprocess
22 | - pprint
23 | - copy
24 | - random
25 | - pip:
26 |   - gpyopt
27 | 


--------------------------------------------------------------------------------
/images/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/argonne-lcf/active-learning-md/624d61fdb9e7ef2b07bf2d6cfeb654a99f324609/images/.gitkeep


--------------------------------------------------------------------------------
/images/Active_Learning.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/argonne-lcf/active-learning-md/624d61fdb9e7ef2b07bf2d6cfeb654a99f324609/images/Active_Learning.png


--------------------------------------------------------------------------------
/workflow/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/argonne-lcf/active-learning-md/624d61fdb9e7ef2b07bf2d6cfeb654a99f324609/workflow/.gitkeep


--------------------------------------------------------------------------------
/workflow/BayesOpt_SOAP.py:
--------------------------------------------------------------------------------
  1 | """
  2 | @author: gsivaraman@anl.gov
  3 | """
  4 | from __future__ import print_function
  5 | import  GPyOpt
  6 | import argparse
  7 | import json, os, subprocess
  8 | import numpy as np
  9 | from ase.io import read, write
 10 | #import commands  ##--> Depreciated! WIll switch to subprocess once quippy is ported to Python 3+
 11 | from sklearn.metrics import mean_absolute_error, r2_score
 12 | from pprint import pprint
 13 | from time import time 
 14 | from copy import deepcopy
 15 | from activesample import activesample
 16 | 
 17 | try:
 18 |     path = os.getenv('QUIP_PATH')    
 19 | except:
 20 |     print("'QUIP_PATH' not set!")
 21 | 
 22 | 
 23 | def get_argparse():
 24 |     """
 25 |     A function to parse all the input arguments
 26 |     """
 27 |     parser = argparse.ArgumentParser(description='A python workflow to run  active learning for GAP')
 28 |     parser.add_argument('-xyz','--xyzfilename', type=str,required=True,help='Trajectory filename (xyz formated)')
 29 |     parser.add_argument('-Ns','--nsample',type=int,metavar='',\
 30 |                        help="Minimum sample size",default=10)
 31 |     parser.add_argument('-Nm','--nminclust',type=int,metavar='',\
 32 |                                help="No of Minimum clusters. Tweak to reduce noise points",default=10)
 33 |     parser.add_argument('-c','--cutoff',nargs='+',type=float,required=True,\
 34 |          help="Cutt off  range for SOAP. Usage: '-c min max'. Additional dependency: common sense!")
 35 |     parser.add_argument('-s','--nsparse',nargs='+',type=float,required=True,\
 36 |          help="Nsparse for SOAP. Usage: '-s min max'. Additional dependency: common sense ")
 37 |     parser.add_argument('-nl','--nlmax',nargs='+',type=float,required=True,\
 38 |                         help="Nmax, Lmax range. Usage: '-nl min max'. Additional dependency: common sense! ")
 39 |     parser.add_argument('-Nbo','--Nopt',nargs='+',type=float,required=True,\
 40 |                         help="Number of exploration and optimization steps for BO. e.g.'-Nbo 25 50' ")
 41 | 
 42 |     return parser.parse_args()
 43 | 
 44 | 
 45 | 
 46 | 
 47 | def run_quip(cutoff=5.0,delta=1.0,n_sparse=100,nlmax=4):
 48 |     """
 49 |     A method to launch  quip and postprocess.
 50 |     """
 51 |     if os.path.isfile('quip_test.xyz'):
 52 |         os.remove('quip_test.xyz')
 53 |     if os.path.isfile('gap.xml'):
 54 |         os.remove('gap.xml')
 55 |     
 56 |     start = time()
 57 |            
 58 |     runtraining =  path+"teach_sparse at_file=./train.extxyz  gap={soap  cutoff="+str(cutoff)+"  n_sparse="+str(n_sparse)+"  covariance_type=dot_product sparse_method=cur_points  delta="+str(delta)+"  zeta=4 l_max="+str(nlmax)+"  n_max="+str(nlmax)+"  atom_sigma=0.5  cutoff_transition_width=0.5  add_species} e0={Hf:-2.70516846:O:-0.01277342}  gp_file=gap.xml default_sigma={0.0001 0.0001 0.01 .01} sparse_jitter=1.0e-8 energy_parameter_name=energy"
 59 |     output = subprocess.check_output(runtraining,stderr=subprocess.STDOUT, shell=True)
 60 | 
 61 |     end = time()
 62 | 
 63 |     print("\nTraining Completed in {} sec".format(end - start))
 64 | 
 65 | 
 66 |     evaluate = path+"quip  E=T   atoms_filename=./test.extxyz   param_filename=gap.xml  | grep AT | sed 's/AT//' >> quip_test.xyz" 
 67 |     pout = subprocess.check_output(evaluate,stderr=subprocess.STDOUT, shell=True)
 68 |     inp = read('test.extxyz',':')
 69 |     inenergy = [ei.get_potential_energy() for ei  in inp ]
 70 |     output = read('quip_test.xyz',':')
 71 |     outenergy = [eo.get_potential_energy() for eo in output ]
 72 |     if len(inenergy) == len(outenergy) :
 73 |         mae  = mean_absolute_error(np.asarray(inenergy), np.asarray(outenergy))
 74 |         rval  = r2_score(np.asarray(inenergy), np.asarray(outenergy)) 
 75 | 
 76 |         return [mae, rval ]
 77 | 
 78 |     else:
 79 |         print("Memory blow up for cutoff={}, n_sparse={}".format(cutoff,n_sparse))
 80 |         return [float(20000), float(-2000)]
 81 | 
 82 | 
 83 | def gen_trials(minclustlen,maxclustlen):
 84 |     '''
 85 |     Generate the trials in the   sampling width (Kmax, Kmin) 
 86 |     :param minclustlen: Kmin (int)
 87 |     :param maxclustlen: Kmax (int) 
 88 |     Return type: Trials  (list)
 89 |     '''
 90 |     if maxclustlen <= 50 :
 91 |         trialscale = int( np.floor(minclustlen/2) ) #int( np.floor(data.nsample/2) )
 92 |         maxtrial  = int( np.floor(maxclustlen/trialscale) )
 93 |         trials = [trialscale*t for t in range(maxtrial,0,-1)]
 94 |     elif maxclustlen > 50 and  maxclustlen <= 200 :
 95 |         trialscale = int( np.floor( (minclustlen/2 ) * (minclustlen/2 )  ) )
 96 |         maxtrial  = int( np.floor(maxclustlen/trialscale) )
 97 |         trials = [trialscale*t for t in range(maxtrial,0,-1)] +  [u*int(np.floor(minclustlen/2))  for u  in range( int(np.floor(minclustlen/2)) -1,0,-1)]
 98 |     elif maxclustlen > 200 :
 99 |         trialscale = int( minclustlen**2    )
100 |         maxtrial  = int( np.floor(maxclustlen/trialscale) )
101 |         trials = [trialscale*t for t in range(maxtrial,0,-1)] + [u*int(np.floor(minclustlen/2)) for u in range( int(minclustlen) ,0,-1)]
102 |     return trials
103 | 
104 | 
105 | def f(x):
106 |     """
107 |     Surrogate function over the error metric to be optimized
108 |     """
109 |    
110 |     evaluation = run_quip(cutoff = float(x[:,0]), delta = float(x[:,1]), n_sparse = float(x[:,2]), nlmax = float(x[:,3]))
111 |     print("\nParam: {}, {}, {}, {}  |  MAE : {}, R2: {}".format(float(x[:,0]),float(x[:,1]),float(x[:,2]),float(x[:,3]) ,evaluation[0],evaluation[1]))
112 | 
113 |     return evaluation[0]
114 | 
115 | 
116 | def plot_metric(track_metric):
117 |     '''
118 |     Plot the metric evolution over the trial using this function
119 |     '''
120 |     import matplotlib as mpl
121 |     import matplotlib.pyplot as plt
122 |     mpl.style.use('seaborn')
123 |     cmap=plt.cm.get_cmap('brg')
124 |     plt.rcParams["font.family"] = "Arial"
125 |     x = range(1, len(track_metric)+1 )
126 |     plt.plot(x, track_metric, c=cmap(0.40),linewidth=3.0)
127 |     plt.title('Active learning',fontsize=32)
128 |     plt.xlabel('Trial',fontsize=24)
129 |     plt.ylabel('MAE (ev)',fontsize=24)
130 |     plt.grid('True')
131 |     plt.tight_layout()
132 |     plt.draw()
133 |     plt.savefig('AL.png',dpi=300)
134 |     plt.savefig('AL.svg',dpi=300,format='svg')
135 |     c = 'Done'
136 |     return c
137 | 
138 | 
139 | def main():
140 |     '''
141 |     Gather our code here 
142 |     '''
143 |     args = get_argparse() 
144 | 
145 |     cutoff = tuple(args.cutoff)
146 |     sparse = tuple(args.nsparse)
147 |     nlmax  = tuple(args.nlmax)
148 |     Nopt = tuple(args.Nopt)
149 |     trackerjson = {}
150 |     track_metric = [] # For plotting the error metric
151 |     trackerjson['clusters'] = []
152 | 
153 |     data  = activesample(args.xyzfilename,nsample=args.nsample,nminclust=args.nminclust)
154 |     Nclust, Nnoise, clust = data.gen_cluster()
155 |     clustlen = []
156 | 
157 |     for Ni, Ntraj in clust.items():
158 |         clustlen.append(len(Ntraj) )
159 | 
160 |     maxclustlen = max(clustlen)
161 |     minclustlen = min(clustlen)
162 | 
163 |     print("\nNumber of elements in the smallest, largest cluster is {}, {}\n".format(minclustlen,maxclustlen))
164 |     print("\n Nnoise : {}, Nclusters : {}\n".format(Nnoise,Nclust))
165 | 
166 |     trackerjson['Nclusters'] = Nclust
167 |     trackerjson['Nnoise']  = Nnoise
168 |     trackerjson['clusters'].append(clust)
169 |     trackerjson['paritition_trials'] = []
170 |     rmse_opt = 10000.
171 | 
172 |     trials = gen_trials(minclustlen,maxclustlen) 
173 | 
174 |     Natom = data.exyztrj[0].get_number_of_atoms()     
175 |     count = 1 
176 |     trainlen_last = 0
177 |     print("\n The trials will run in the sampling width interval : ({},{}) \n".format(max(trials),min(trials) )) 
178 | 
179 |     for Ntrial in trials :
180 |         # set(np.random.randint(data.nsample, maxclustlen, size=10)): #range(1,trialsize+2):
181 |         
182 |         cmd  = "rm train.extxyz test.extxyz gap.xml* quip_test.xyz "
183 |         rmout = subprocess.call(cmd, shell=True)
184 |         #print(rmout)
185 |         data.truncate = deepcopy(Ntrial)  #int(np.floor(maxclustlen /Ntrial) )
186 |         print("\n\nBeginning  trial number : {} with a sampling width of {}".format(count,data.truncate)) 
187 |         trainlist, testlist = data.clusterpartition()
188 |         train_lennew = len(trainlist)
189 |        
190 |         print("\n\nNumer of training and test configs: {} , {}".format(len(trainlist), len(testlist) ) )
191 |         
192 |         if train_lennew > trainlen_last :
193 |             print("\n {} new learning configs added by the active sampler".format(train_lennew - trainlen_last ) )
194 |             data.writeconfigs()
195 |         
196 |             bounds = [{'name': 'cutoff',          'type': 'continuous',  'domain': (cutoff[0], cutoff[1])},\
197 |             {'name':'delta',            'type': 'discrete',  'domain': (0.01, 0.1, 1.0)},\
198 |             {'name': 'n_sparse',        'type': 'discrete',  'domain': np.arange(sparse[0],sparse[1]+1,100)},\
199 |             {'name':'nlmax',            'type': 'discrete',  'domain': np.arange(nlmax[0],nlmax[1]+1)},]
200 | 
201 |             # optimizer
202 |             opt_quip = GPyOpt.methods.BayesianOptimization(f=f, domain=bounds,initial_design_numdata = int(Nopt[0]),
203 |                                              model_type="GP_MCMC",
204 |                                              acquisition_type='EI_MCMC', #EI
205 |                                              evaluator_type="predictive",  # Expected Improvement
206 |                                              exact_feval = False,
207 |                                              maximize=False) ##--> True only for r2
208 | 
209 |             # optimize MP model
210 |             print("\nBegin Optimization run \t")
211 |             opt_quip.run_optimization(max_iter=int(Nopt[1]) )
212 |             hyperdict = {}
213 |             for num in range(len(bounds)):
214 |                 hyperdict[bounds[num]["name"] ] = opt_quip.x_opt[num]
215 |             hyperdict["MAE"] = opt_quip.fx_opt
216 |             trackerjson['paritition_trials'].append({'test': trainlist, 'train': testlist,'hyperparam': hyperdict })
217 |             trainlen_last = deepcopy(train_lennew) #---> Update only if configs increased over iterations!
218 |             if opt_quip.fx_opt < rmse_opt :
219 |                 rmse_opt = float(opt_quip.fx_opt)
220 |                 best_train = deepcopy(trainlist)
221 |                 best_test = deepcopy(testlist)
222 |                 print("\n MAE lowered in this trial: {} eV/Atom".format(rmse_opt/Natom))
223 |                 track_metric.append(rmse_opt/Natom)
224 |                 best_hyperparam = deepcopy(hyperdict)
225 |                 if count != 1  and np.round(rmse_opt/Natom,4)  <=  .001 :
226 |                    print("\n Optimal configs found! on {}th trial with hyper parameters : {}\n".format(count, best_hyperparam))
227 |                    with open('activelearned_quipconfigs.json', 'w') as outfile:
228 |                       json.dump(trackerjson, outfile,indent=4)
229 |                    print("\nActive learning history written to 'activelearned_quipconfigs.json' ")
230 | 
231 |                    train_xyz = []
232 |                    test_xyz = []
233 |                    for i in best_train:
234 |                       train_xyz.append(data.exyztrj[i])
235 |                    for j in best_test:
236 |                       test_xyz.append(data.exyztrj[j])
237 | 
238 |                    write("opt_train.extxyz", train_xyz)
239 |                    write("opt_test.extxyz", test_xyz)
240 |                    print("\nActive learnied configurations written to 'opt_train.extxyz','opt_test.extxyz' ")
241 |                    break
242 |         else:
243 | 	    print("No new configs found in the {} trial. Skipping!".format(count))
244 |         count += 1
245 |         
246 |     plot_metric(track_metric)
247 | 
248 | # Standard boilerplate to call the main() function to begin
249 | # the program.
250 | if __name__ == '__main__':
251 |     main()
252 | 


--------------------------------------------------------------------------------
/workflow/activesample.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | """
  4 | Created on Mon May 27 20:46:10 2019
  5 |     
  6 | @author: gsivaraman@anl.gov
  7 | """
  8 | 
  9 | import mdtraj as md
 10 | import numpy as np
 11 | from hdbscan import HDBSCAN
 12 | import os
 13 | import random
 14 | import matplotlib.pyplot as plt
 15 | from copy import deepcopy
 16 | from ase.io import read, write
 17 | 
 18 | 
 19 | class activesample:
 20 |     """
 21 |     A python class to perform active learning over a given AIMD trajectory to automate sampling of the training configurations for the GAP IP model
 22 |     """
 23 | 
 24 |     
 25 |     def __init__(self,exyzfile,nsample=10,nminclust=10,truncate=1):
 26 |         """
 27 |         Input Args:
 28 |         exyztrj (str): File name of AIMD trajectory (ase exyz format)
 29 |         nsample (int): Number of samples for unsupervised learning
 30 |         nminclust (int): Minimum number of clusters
 31 |         truncate (int): Truncation level for sampling clusters. '1' for default random sampling
 32 |         """
 33 |         
 34 |         self.exyzfile = exyzfile
 35 |         self.nsample = nsample
 36 |         self.nminclust = nminclust
 37 |         self.truncate = truncate
 38 |         self.exyztrj = read(exyzfile,':')
 39 |         if not os.path.isfile(exyzfile.strip('.extxyz') + '.pdb' ):
 40 |             write(exyzfile.strip('.extxyz') +'.pdb', self.exyztrj,format='proteindatabank')
 41 |         self.pdbtrj = md.load_pdb(exyzfile.strip('.extxyz') +'.pdb')
 42 |         self.distances = self.compute_dist()
 43 |         self.clustdict = {}
 44 |         self.trainlist = []
 45 |         self.testlist = []
 46 |         
 47 | 
 48 |     def compute_dist(self):
 49 |         """
 50 |         Compute the rmsd distance matrix from the input trajectory
 51 |         return:
 52 |         distmat : numpy distance matrix
 53 |         """
 54 |         distmat = np.empty((self.pdbtrj.n_frames, self.pdbtrj.n_frames))
 55 |         for i in range(self.pdbtrj.n_frames):
 56 |             distmat[i] = md.rmsd(self.pdbtrj, self.pdbtrj, i)
 57 |         return distmat
 58 | 
 59 | 
 60 |     def plot_elbow(self):
 61 |         """
 62 |         A method to perform elbow plot of the distance matrix. The epsilon parameter
 63 |         can be chosen by visually inspecting the elbow.
 64 |         """
 65 |         plotthisarray = self.distances[:,-1]
 66 |         fig = plt.figure()
 67 |         plt.plot(-np.sort(-plotthisarray))
 68 |         #plt.show()
 69 |         return fig
 70 | 
 71 | 
 72 |     def gen_cluster(self):
 73 |         """
 74 |         Perform HDBSCAN clustering on a given distance matrix
 75 |         return:
 76 |         n_clusters_ (int) : Total number of clusters excluding noise
 77 |         n_noise_ (int) : Total number of noise points
 78 |         clustdict (dict) : Cluster number keys. List values of AIMD trajectory index
 79 |         """
 80 |         hdb = HDBSCAN(metric='precomputed', min_samples=self.nsample, min_cluster_size=self.nminclust).fit(self.distances) # allow_single_cluster=True
 81 |         labels = hdb.labels_
 82 |         n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
 83 |         n_noise_ = list(labels).count(-1)
 84 | 
 85 |         lclustdict = {}
 86 |         for index, clustnum in enumerate(labels):
 87 |             if clustnum != -1 :
 88 |                 if clustnum not  in lclustdict.keys():
 89 |                     lclustdict[clustnum] = []
 90 |                 lclustdict[clustnum].append(index)
 91 | 
 92 |         self.clustdict = deepcopy(lclustdict)
 93 |         return n_clusters_,n_noise_, lclustdict
 94 | 
 95 | 
 96 | 
 97 |     def clusterpartition(self):
 98 |         """
 99 |         A method to create test/ train split by paritioning the cluster
100 |         Return type: traininglist, testlist (list)
101 |         """
102 |     
103 |         train_list = []
104 |         test_list = []
105 |     
106 |         for key, trjlist in self.clustdict.items():
107 |             ltrain = list()
108 |             ltest = list()
109 |             if self.truncate == 1:
110 |                 ltrain.append(random.choice(trjlist) )
111 |                 ltest = random.sample(trjlist,4)
112 |                 while ltrain[0]  in  ltest:
113 |                     ltest = random.sample(trjlist,4)
114 |             
115 |             elif self.truncate  != 1:
116 |                 if len(trjlist) <= self.truncate :
117 |                     #muind = int(np.floor(len(trjlist)/2) ) 
118 |                     #ltrain.append(trjlist[muind] )
119 |                     ltrain.append(random.choice(trjlist))
120 |                     ltest = random.sample(trjlist,1) 
121 |                     while ltrain[0] in  ltest:
122 |                         ltest = random.sample(trjlist,1) 
123 |                 elif len(trjlist) >  self.truncate :
124 |                     ltrain = trjlist[0::self.truncate]
125 |                     ltest = trjlist[2::self.truncate]
126 |             for k in ltrain:
127 |                 train_list.append(k)
128 |             for v in ltest:
129 |                 test_list.append(v)
130 |         self.trainlist = deepcopy(train_list)
131 |         self.testlist = deepcopy(test_list)
132 |         return train_list, test_list
133 | 
134 | 
135 |     def writeconfigs(self):
136 |         """
137 |         Write training and test configurations to exyz files!
138 |         """
139 |         train_xyz = []
140 |         test_xyz = []
141 |         for i in self.trainlist:
142 |             train_xyz.append(self.exyztrj[i])
143 |         for j in self.testlist:
144 |             test_xyz.append(self.exyztrj[j])
145 | 
146 |         write("train.extxyz", train_xyz)
147 |         write("test.extxyz", test_xyz)
148 | 


--------------------------------------------------------------------------------