├── README.md ├── codes ├── demo_ML_training.py ├── explore_data.py ├── func_tmp.py ├── paper_SI_hypertune.py ├── paper_train_model.py └── readme.md └── figs ├── counts_vs_elements.png ├── fig2.png ├── fig3.png ├── fig4.png ├── fig5.png └── table2.png /README.md: -------------------------------------------------------------------------------- 1 | # Efficient first principles based modeling via machine learning: from simple representations to high entropy materials 2 | 3 | ## Paper 4 | This is the repo associated for our paper *Efficient first principles based modeling via machine learning: from simple representations to high entropy materials* ([publisher version](https://doi.org/10.1039/D4TA00982G), [arXiv version](https://arxiv.org/html/2403.15579v1)), which we create a large DFT dataset for HEMs and evaluate the in-distribution and out-of-distribution performance of machine learning models. 5 | 6 | 7 | ## DFT dataset for high entropy alloys [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10854500.svg)](https://doi.org/10.5281/zenodo.10854500) 8 | 9 | Our DFT dataset encompasses bcc and fcc structures composed of eight elements and overs all possible 2- to 7-component alloy systems formed by them. 10 | The dataset used in the paper is publicly available on [Zenodo](https://doi.org/10.5281/zenodo.10854500), which includes initial and final structures, formation energies, atomic magnetic moments and charges among other attributes. 11 | 12 | *Note: The trajectory data (energies and forces for structures during the DFT relaxations) is not published with this paper; it will be released later with a work on machine learning force fields for HEMs.* 13 | 14 | 15 | ### Table: Numbers of alloy systems and structures. 16 | | No. components | 2 | 3 | 4 | 5 | 6 | 7 | Total | 17 | |----------------------------|------|-------|-------|-------|------|------|-------| 18 | | Alloy systems | 28 | 56 | 70 | 56 | 28 | 8 | 246 | 19 | | Ordered (2-8 atoms) | 4975 | 22098 | 29494 | 6157 | 3132 | 3719 | 69575 | 20 | | SQS (27, 64, or 128 atoms) | 715 | 3302 | 3542 | 4718 | 1183 | 762 | 14222 | 21 | | Ordered+SQS | 5690 | 25400 | 33036 | 10875 | 4315 | 4481 | 83797 | 22 | 23 | 24 | ### Number of structures as a function of a given constituent element. 25 | The legend indicates the number of components. 26 |

27 | image 28 |

29 | 30 | ## Generalization performance of machine learning models 31 | The data on [Zenodo](https://doi.org/10.5281/zenodo.10854500) provide the Matminer features of initial and final structures and a demo script to train tree-based models. The results in the paper can be readily reproduced by adapting the demo script for different train-test splits. The `codes` folder provides the scripts that we used in the paper. 32 | 33 | ### Generalization performance from small to large structures. 34 |

35 | image 36 |
37 | (a) Normalized error obtained by training on structures with ≤ N atoms and evaluating on structures with > N atoms. (b) ALIGNN prediction on SQSs with > 27 atoms, obtained by training on structures with ≤ 4 atoms. (c) Parity plot of the ALIGNN prediction on SQSs with > 27 atoms, obtained by training on structures with ≤ 8 atoms. 38 |

39 |    40 | 41 | ### Generalization performance from low-order to high-order systems. 42 |

43 | image 44 |
45 | (a) Normalized error obtained by training on structures with ≤ N elements and evaluating on structures with >N elements. (b) Parity plot of the ALIGNN prediction on structures with ≥ 3 elements, obtained by training on binary structures. (c) Parity plot of the ALIGNN prediction on structures with ≥ 4 elements, obtained by training on binary and ternary structures. 46 |

47 |    48 | 49 | ### Generalization performance from (near-)equimolar to non-equimolar structures. 50 |

51 | image 52 |
53 | (a) Normalized error obtained by training on structures with maxΔc below a given threshold and evaluating on the rest. (b) Predictions on non-equimolar structures (maxΔc>0) by the ALIGNN model trained on equimolar structures (maxΔc=0). (c) Predictions on structures with relatively strong deviation from equimolar composition (maxΔc > 0.2) by the ALIGNN model trained on structures with relatively weak deviation from equimolar composition (maxΔc ≤ 2). maxΔc is defined as the maximum concentration difference between any two elements in a structure. 54 |

55 |    56 | 57 | ### Effects of dataset size and use of unrelaxed vs. relaxed structures 58 |

59 | image 60 |

61 |    62 | 63 | ### Overview of model performance on different generalization tasks 64 |

65 | image 66 |

67 | 68 | 69 | -------------------------------------------------------------------------------- /codes/demo_ML_training.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | @author: kangming 5 | 6 | A demo to show the XGBoost training on the formation energy dataset 7 | """ 8 | #%% 9 | import numpy as np 10 | import pandas as pd 11 | from sklearn.pipeline import Pipeline 12 | from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score 13 | from sklearn.model_selection import cross_val_predict 14 | from matplotlib import pyplot as plt 15 | from sklearn.preprocessing import StandardScaler 16 | import xgboost as xgb 17 | 18 | 19 | model= Pipeline([ 20 | ('scaler', StandardScaler()), # scaling does not actually matter for tree methods 21 | ('model', xgb.XGBRegressor( 22 | n_estimators=500, 23 | learning_rate=0.4, 24 | reg_lambda=0.01,reg_alpha=0.1, 25 | colsample_bytree=0.5,colsample_bylevel=0.7, 26 | num_parallel_tree=6, 27 | # tree_method='gpu_hist', gpu_id=0 28 | tree_method = "hist", device = "cuda", 29 | ) 30 | ) 31 | ]) 32 | 33 | #%% 34 | struct = 'structure' 35 | 36 | dat_type = 'featurized' 37 | # df = pd.read_pickle(f'data/{struct}_{dat_type}.dat_reduced.pkl') 38 | df = pd.read_csv(f'data/{struct}_{dat_type}.dat_all.csv',index_col=0) 39 | 40 | #%% 41 | if dat_type == 'featurized': 42 | nfeatures = 273 43 | cols_feat = df.columns[-nfeatures:] 44 | 45 | X_all = df[cols_feat] 46 | # drop features whose variance is zero 47 | X_all = X_all.loc[:,X_all.var()!=0] 48 | else: 49 | X_all = df[f'graphs_{struct}'] 50 | 51 | y_all = df['Ef_per_atom'] 52 | 53 | #%% 54 | # Get the 5-fold cross-validation estimates for the whole dataset 55 | cv = 5 56 | y_pred = cross_val_predict(model, X_all, y_all, cv=cv) 57 | 58 | #%% 59 | df_y = pd.DataFrame({'Ef_true':y_all, 'Ef_pred':y_pred}, index=y_all.index) 60 | cols2add = ['formula','lattice','NIONS'] 61 | df_y = pd.concat([df_y,df[cols2add]], axis=1) 62 | 63 | #%% 64 | # get the mae of Ef_true and Ef_pred 65 | mad = np.mean(np.abs(df_y['Ef_true'] - df_y['Ef_true'].mean())) 66 | mae = mean_absolute_error(df_y['Ef_true'], df_y['Ef_pred']) 67 | print(f'MAD: {mad:.3f}') 68 | print(f'MAE: {mae:.3f}') 69 | # get the r2 score 70 | r2 = r2_score(df_y['Ef_true'], df_y['Ef_pred']) 71 | print(f'R2: {r2:.3f}') 72 | 73 | #%% parity plot 74 | fig, ax = plt.subplots(figsize=(5,5)) 75 | ax.scatter(df_y['Ef_true'], df_y['Ef_pred'], s=5) 76 | lims = [-0.6, 0.6] 77 | #diag line 78 | ax.plot(lims,lims, 'k--', lw=1) 79 | # set limits 80 | ax.set_xlim(lims) 81 | ax.set_ylim(lims) 82 | ax.set_xlabel('DFT formation energy (eV/atom)') 83 | ax.set_ylabel('Predicted formation energy (eV/atom)') 84 | # add scores to fig 85 | ax.text(0.05, 0.9, f'MAE: {mae:.3f} eV/atom', transform=ax.transAxes) 86 | ax.text(0.05, 0.85, f'R2: {r2:.3f}', transform=ax.transAxes) 87 | 88 | # %% 89 | -------------------------------------------------------------------------------- /codes/explore_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | #%% 5 | import pandas as pd 6 | import numpy as np 7 | import matplotlib 8 | import matplotlib.pyplot as plt 9 | from sklearn.decomposition import PCA 10 | import seaborn as sns 11 | 12 | # use color blind friendly colors 13 | sns.set_palette("colorblind") 14 | 15 | figsize=(4,4) 16 | 17 | #%% 18 | ''' 19 | Import data and the related SROs 20 | ''' 21 | dat = pd.read_csv('data/structure_ini_featurized.dat_all.csv',index_col=0) 22 | feature_names = dat.columns[-273:] 23 | compo_names = feature_names[-145:] 24 | struc_names = feature_names[:-145] 25 | # this is intended to be used for removing the data from MP 26 | dat = dat.dropna() 27 | # dat_from_MP = dat[dat['lattice'].isna()] 28 | # print('dat_from_MP: ',dat_from_MP.shape, ' dat: ', dat.shape) 29 | 30 | # get SROs 31 | df_sro = pd.read_csv('data/SROs_structure_ini.csv',index_col=0) 32 | # concat df_sro to dat 33 | dat = pd.concat([dat,df_sro],axis=1) 34 | 35 | #%% 36 | # elemental distribution 37 | chemical_system = dat['chemical_system'] 38 | elements = ['Al','Si','Cr','Mn','Fe','Co','Ni','Cu'] 39 | 40 | # count the number of chemical systems containing each element 41 | nstructures = {} 42 | nstructures_nele = {} 43 | for element in elements: 44 | nstructures[element] = chemical_system.str.contains(element).sum() 45 | for nele in range(2,8): 46 | nstructures_nele[element,nele] = dat[dat['nelements']==nele]['chemical_system'].str.contains(element).sum() 47 | 48 | # bar plot 49 | fig, ax = plt.subplots(figsize=(4,3.5)) 50 | # ax.bar(nstructures.keys(),nstructures.values()) 51 | # make a stacked bar plot using nstructures_nele 52 | bottom = np.zeros(len(elements)) 53 | for nele in range(2,8): 54 | ax.bar(elements,[nstructures_nele[element,nele] for element in elements],bottom=bottom,label=str(nele)) 55 | bottom += np.array([nstructures_nele[element,nele] for element in elements]) 56 | ax.legend() 57 | 58 | ax.set_ylim(0,5.5e4) 59 | # add minor yticks (5000) 60 | ax.yaxis.set_minor_locator(matplotlib.ticker.MultipleLocator(5000)) 61 | ax.set_ylabel('Number of structures') 62 | 63 | # fig.savefig('figs/describe_data/counts_vs_elements.pdf') 64 | 65 | 66 | #%% 67 | ''' 68 | Define: 69 | 70 | structure wise dataframe or series: 71 | X: original features 72 | X_std: standardized features 73 | y: formation enerngy per atom 74 | sg: space group number 75 | lattice: lattice 76 | csro: chemical SRO (mean abs SRO1, mean abs SRO2, mean abs SRO3, mean abs SRO4) 77 | 78 | Dict of index. Keys of dict are used to classified the data according to the number of atoms or number of elements 79 | neles: keys are the number of elements, values are the index of the data with the corresponding number of elements 80 | natoms: keys are the number of atoms, values are the index of the data with the corresponding number of atoms 81 | ''' 82 | 83 | # features 84 | X = dat[feature_names] 85 | # drop columns whose variance is 0 86 | X = X.loc[:,X.std()!=0] 87 | # standardize the features and turn it into a df by keeping the index and column names 88 | X_std = pd.DataFrame((X-X.mean())/X.std(), index=X.index, columns=X.columns) 89 | # target 90 | y = dat['Ef_per_atom'] 91 | 92 | # space group 93 | sg = dat['space_group_number'] 94 | # lattice 95 | lattice = dat['lattice'] 96 | # chemical SRO 97 | csro = dat[['mean abs SRO'+str(i) for i in range(1,5)]] 98 | 99 | natoms={} 100 | for i in dat['NIONS'].unique(): 101 | natoms[i] = dat[dat['NIONS']==i].index 102 | 103 | smaller = dat[dat['NIONS']<=6].index 104 | small = dat[dat['NIONS']<=8].index 105 | large = dat[dat['NIONS']>8].index 106 | low = dat[dat['nelements']<=2].index 107 | high = dat[dat['nelements']>2].index 108 | 109 | 110 | neles={} 111 | for i in range(2,8): 112 | neles[i] = dat[dat['nelements']==i].index 113 | nsamll = dat[(dat['NIONS']<=8) & (dat['nelements']==i)].shape[0] 114 | nlarge = dat[(dat['NIONS']>8) & (dat['nelements']==i)].shape[0] 115 | n_chemsys = len(dat.loc[neles[i],'chemical_system'].unique()) 116 | print(f'nelements={i}: {n_chemsys} chemical systems, {len(neles[i])} structures, {nsamll} small, {nlarge} large') 117 | 118 | 119 | 120 | 121 | #%% 122 | ''' 123 | distribution of the system size: counts vs. number of atoms. 124 | ''' 125 | 126 | fig, ax = plt.subplots(figsize=figsize) 127 | # get the number of atoms, [<=6, 8, 27,64,125] 128 | counts = [len(small)] + [len(natoms[n]) for n in [27,64,125]] 129 | 130 | # plot bar plot 131 | ax.bar(np.arange(len(counts)),counts) 132 | # # use scientific notation for y 133 | # ax.ticklabel_format(axis='y', style='sci', scilimits=(0,0)) 134 | # use log scale for y 135 | ax.set_yscale('log') 136 | # set ylim 137 | ax.set_ylim(100,1e5) 138 | # set xticks 139 | ax.set_xticks(np.arange(len(counts))) 140 | # set xticklabels 141 | ax.set_xticklabels([r'$\leq$8', '27','64','125']) 142 | # set xlabel 143 | ax.set_xlabel('Number of atoms') 144 | # set ylabel 145 | ax.set_ylabel('Counts') 146 | 147 | ''' 148 | distribution of the system size: counts vs. number of elements. 149 | ''' 150 | fig, ax = plt.subplots(figsize=figsize) 151 | ax.bar(range(2,8),[len(neles[i]) for i in range(2,8)]) 152 | ax.set_xlabel('Number of elements') 153 | ax.set_ylabel('Counts') 154 | ax.set_ylim(0,3.5e4) 155 | 156 | 157 | #%% 158 | ''' 159 | Plot distribution of the system size: counts vs. number of atoms. 160 | But this time, use a stacked bar plot: 161 | each bar is the count for a specific system size, and the bar is divided into 6 parts 162 | according to nelements (2,3,4,5,6,7) 163 | 164 | ''' 165 | # get the number of atoms, [<=8, 27,64,125], aggregate the counts for # atoms <=8 166 | counts = {r'$\leq$8':[], '27':[], '64':[], '125':[]} 167 | 168 | # loop over the number of elements 169 | for nele in range(2,8): 170 | # loop over the number of atoms 171 | for n in [r'$\leq$8',27,64,125]: 172 | if n == r'$\leq$8': 173 | counts[str(n)].append(y[(dat['NIONS']<=8) & (dat['nelements']==nele)].shape[0]) 174 | else: 175 | # get the counts for a specific number of atoms and number of elements 176 | counts[str(n)].append(y[(dat['NIONS']==n) & (dat['nelements']==nele)].shape[0]) 177 | 178 | fig, ax = plt.subplots(figsize=(4,3.7)) 179 | 180 | 181 | 182 | # plot stacked bar plot: 183 | # the x axis is the number of atoms, namely the keys of the counts dictionary 184 | # the y axis is the stacked counts, consisting of 6 parts 185 | bottom = np.zeros(len(counts.keys())) 186 | for i,nele in enumerate(range(2,8)): 187 | ax.bar(counts.keys(), 188 | [counts[natom][i] for natom in counts.keys()], 189 | label=str(nele), 190 | width=0.55, 191 | bottom=bottom) 192 | bottom += np.array([counts[natom][i] for natom in counts.keys()]) 193 | 194 | # use log scale for y 195 | # ax.set_yscale('log') 196 | # set ylim 197 | ax.set_ylim(0,7e4) 198 | # set legend to 3 columns 199 | ax.legend(loc=(0.5,0.075),ncol=2) 200 | # use scientific notation for y 201 | # ax.ticklabel_format(axis='y', style='sci', scilimits=(0,0)) 202 | ax.set_xlabel('Number of atoms') 203 | ax.set_ylabel('Number of structures') 204 | 205 | # add an inset 206 | axins = ax.inset_axes([0.4, 0.47, 0.475, 0.475]) 207 | bottom = np.zeros(len(counts.keys())) 208 | 209 | for i,nele in enumerate(range(2,8)): 210 | axins.bar(['27','64','125'], 211 | [counts[natom][i] for natom in ['27','64','125']], 212 | label=str(nele), 213 | bottom=bottom[[1,2,3]]) 214 | bottom += np.array([counts[natom][i] for natom in counts.keys()]) 215 | axins.set_ylim(0,1e4) 216 | # set xticks and xticklabels 217 | axins.set_xticks(np.arange(1,4)) 218 | axins.set_xticklabels(['27','64','125']) 219 | axins.ticklabel_format(axis='y', style='sci', scilimits=(3,3)) 220 | 221 | fig.tight_layout() 222 | # fig.savefig('figs/describe_data/counts_vs_natoms.pdf') 223 | 224 | 225 | 226 | #%% 227 | ''' 228 | Create a df where the index is the number of atoms, and the columns are the number of elements 229 | ''' 230 | df_counts = pd.DataFrame(columns=range(2,8), index=sorted(dat['NIONS'].unique())) 231 | 232 | for index in df_counts.index: 233 | for col in df_counts.columns: 234 | df_counts.loc[index,col] = dat[(dat['NIONS']==index) & (dat['nelements']==col)].shape[0] 235 | 236 | 237 | #%% 238 | ''' violin plot for the distribution of y vs. NIONS ''' 239 | fig, ax = plt.subplots(figsize=(4,3.5)) 240 | # plot violin plot 241 | violinplot = ax.violinplot([ 242 | y[small], 243 | y[natoms[27]], 244 | y[natoms[64]], 245 | y[natoms[125]], 246 | ], 247 | showextrema=False, 248 | quantiles=[[0.50,0.90] for i in range(4)], 249 | ) 250 | 251 | # change the color of the violin plot 252 | for pc in violinplot['bodies']: 253 | pc.set_alpha(1) 254 | violinplot['cquantiles'].set_color('r') 255 | 256 | # set xticks 257 | ax.set_xticks([1,2,3,4]) 258 | # set xticklabels 259 | ax.set_xticklabels([r'$\leq$8','27','64','125']) 260 | # set ylabel 261 | ax.set_ylabel('Formation energy per atom (eV/atom)') 262 | ax.set_ylim(-0.6,0.5) 263 | # ax.grid(linewidth=0.1) 264 | ax.set_xlabel('Number of atoms') 265 | 266 | # fig.savefig('figs/describe_data/violinplot_Ef_vs_natoms.pdf') 267 | 268 | 269 | 270 | #%% 271 | ''' 272 | 273 | violin plot for the distribution of SRO vs. NIONS 274 | 275 | ''' 276 | fig, axs = plt.subplots(figsize=(4,3.7),ncols=2,sharey=True,gridspec_kw={'wspace':0.0}) 277 | 278 | for i in range(2): 279 | ax = axs[i] 280 | # plot violin plot 281 | sro2plot = f'mean abs SRO{i+1}' 282 | violinplot = ax.violinplot([ 283 | csro.loc[small,sro2plot], 284 | csro.loc[natoms[27],sro2plot], 285 | csro.loc[natoms[64],sro2plot], 286 | csro.loc[natoms[125],sro2plot], 287 | ], 288 | showextrema=False, 289 | quantiles=[[0.50,0.90] for i in range(4)], 290 | ) 291 | 292 | # change the color of the violin plot 293 | for pc in violinplot['bodies']: 294 | pc.set_alpha(1) 295 | violinplot['cquantiles'].set_color('r') 296 | 297 | # set xticks 298 | ax.set_xticks([1,2,3,4]) 299 | # set xticklabels 300 | ax.set_xticklabels([r'$\leq$8','27','64','125']) 301 | 302 | # set ylabel 303 | if i == 0: 304 | ax.set_ylabel('Mean abs SRO') 305 | ax.text(0.3,0.9,'(a) SRO1',transform=ax.transAxes) 306 | else: 307 | ax.text(0.3,0.9,'(b) SRO2',transform=ax.transAxes) 308 | ax.set_ylim(0,1.4) 309 | ax.grid(linewidth=0.1) 310 | # set a common xlabel 311 | fig.text(0.425,0.0,'Number of atoms') 312 | 313 | 314 | # fig.savefig('figs/describe_data/violinplot_SRO_vs_natoms.pdf') 315 | 316 | #%% 317 | ''' 318 | Distribution of SRO1 vs. SRO2 for ordered and disordered structures 319 | ''' 320 | 321 | fig, axs = plt.subplots(figsize=(4.25,2.25),ncols=2, 322 | # set the horizontal space between subplots 323 | gridspec_kw={'wspace':0.275} 324 | ) 325 | 326 | norm = matplotlib.colors.LogNorm(vmin=1,vmax=1000) 327 | 328 | ax = axs[0] 329 | # hexbin plot of SRO1 vs. SRO2 for small systems 330 | # use a cmap that is white for small counts and black for large counts 331 | ax.hexbin(csro.loc[small,'mean abs SRO1'],csro.loc[small,'mean abs SRO2'], 332 | gridsize=(60,60),cmap='Greys',norm = norm) 333 | ax.set_xlabel('Mean abs SRO1') 334 | ax.set_ylabel('Mean abs SRO2') 335 | ax.set_xlim(-0.02,1.6) 336 | ax.set_xticks([0,0.4,0.8,1.2,1.6]) 337 | ax.set_ylim(-0.02,2.5) 338 | ax.text(0.4,0.9,r'$\leq$8 atoms',transform=ax.transAxes, 339 | # white background for the text 340 | bbox=dict(facecolor='white', edgecolor='white', pad=0.0) 341 | ) 342 | ax.text(-0.4,0.965, '(a)' ,transform=ax.transAxes,fontsize=11, 343 | weight='bold' 344 | ) 345 | 346 | ax=axs[1] 347 | # hexbin plot of SRO1 vs. SRO2 for small systems 348 | # use a cmap that is white for small counts and black for large counts 349 | ax.hexbin(csro.loc[large,'mean abs SRO1'],csro.loc[large,'mean abs SRO2'], 350 | gridsize=(30,25),cmap='Greys',norm=norm) 351 | ax.set_xlabel('Mean abs SRO1') 352 | # ax.set_ylabel('Mean abs SRO2') 353 | ax.set_xlim(-0.01,0.4) 354 | ax.set_xticks([0,0.1,0.2,0.3,0.4]) 355 | ax.set_ylim(-0.01,0.4) 356 | ax.set_yticks([0,0.1,0.2,0.3,0.4]) 357 | ax.text(0.4,0.9,r'$\geq$27 atoms',transform=ax.transAxes) 358 | 359 | # Add a colorbar shared between the two plots 360 | cbar_ax = fig.add_axes([0.915, 0.125, 0.015, 0.725]) 361 | sm = plt.cm.ScalarMappable(cmap='Greys', norm=norm,) 362 | cbar = fig.colorbar(sm, cax=cbar_ax,orientation='vertical') 363 | cbar.set_label('Counts') 364 | # cbar.ax.xaxis.set_ticks_position('top') # new line: move ticks to top 365 | # cbar.ax.xaxis.set_label_position('top') # new line: move label to top 366 | # fig.savefig('figs/describe_data/hexbin_SRO1_vs_SRO2.png',bbox_inches='tight',dpi=300) 367 | 368 | 369 | 370 | 371 | 372 | 373 | #%% 374 | ''' violin plot for the distribution of SRO vs. number of elements ''' 375 | 376 | def plot_ax(ax,df,ylim=(-0.01,1.8),sro2plot = 'mean abs SRO1'): 377 | 378 | violinplot = ax.violinplot([ 379 | csro.loc[df[df['nelements']==n].index,sro2plot] for n in range(2,8) 380 | ], 381 | showextrema=False, 382 | showmedians=True, 383 | quantiles=[[0.5,0.90] for i in range(6)], 384 | ) 385 | 386 | # change the color of the violin plot 387 | # for pc in violinplot['bodies']: 388 | # pc.set_alpha(0.5) 389 | violinplot['cquantiles'].set_color('r') 390 | # set the linewidth of cquantiles 391 | violinplot['cquantiles'].set_linewidth(2) 392 | 393 | # set xticks 394 | ax.set_xticks([1,2,3,4,5,6]) 395 | # set xticklabels 396 | ax.set_xticklabels(['2','3','4','5','6','7']) 397 | # set xlabel 398 | ax.set_xlabel('Number of elements') 399 | # set ylabel 400 | ax.set_ylabel(sro2plot) 401 | ax.set_ylim(ylim) 402 | # ax.grid(linewidth=0.05) 403 | return ax 404 | 405 | ''' 406 | violin plot for the distribution of SRO1 vs. number of elements 407 | The plot contains 3 subplots, for the number of atoms <=8, 27, 64 408 | ''' 409 | fig, axs = plt.subplots(figsize=(4.25,2.25),ncols=2,sharey=True, 410 | # set the space between subplots 411 | gridspec_kw={'wspace':0.0} 412 | ) 413 | 414 | sro2plot = 'SRO1' 415 | 416 | # ax = axs[0] 417 | # plot_ax(ax,dat[dat['NIONS']<=8],ylim=(0,1.4),sro2plot='mean abs '+sro2plot) 418 | # ax.text(0.04,0.92,r'$\leq$8 atoms',transform=ax.transAxes) 419 | 420 | ax = axs[0] 421 | plot_ax(ax,dat[dat['NIONS']==27],ylim=(0,1.4),sro2plot='mean abs '+sro2plot) 422 | # ax.text(0.05,0.92,'(a) 27 atoms',transform=ax.transAxes) 423 | ax.text(-0.35,0.95, '(b)' ,transform=ax.transAxes,fontsize=11, 424 | weight='bold' 425 | ) 426 | 427 | 428 | ax = axs[1] 429 | plot_ax(ax,dat[dat['NIONS']==64],ylim=(0,1.4),sro2plot='mean abs '+sro2plot) 430 | # ax.text(0.05,0.92,'(b) 64 atoms',transform=ax.transAxes) 431 | 432 | # hide the ylabels for the 2nd and 3rd subplots 433 | axs[1].set_ylabel('') 434 | # axs[2].set_ylabel('') 435 | 436 | # set a common xlabel 437 | for ax in axs: 438 | ax.set_xlabel('') 439 | ax.set_ylim(-0.005,0.33) 440 | # add horizontal line at 0 441 | ax.axhline(0, color='k',linestyle="-",linewidth=0.1) 442 | fig.text(0.4,0.0,'Number of elements') 443 | 444 | #%% 445 | def plot_SRO_vs_nelements(df,figname,ylim=(-0.01,1.8)): 446 | fig, axs = plt.subplots(figsize=(figsize[0]*1.2,figsize[1]),ncols=2,sharey=True) 447 | for i in range(2): 448 | ax = axs[i] 449 | plot_ax(ax,df,ylim=ylim,sro2plot=f'mean abs SRO{i+1}') 450 | fig.savefig(figname) 451 | 452 | # plot_SRO_vs_nelements(dat,'figs/describe_data/violinplot_SRO_vs_nelements.pdf') 453 | # plot_SRO_vs_nelements(dat[dat['NIONS']<=8],'figs/describe_data/violinplot_SRO_vs_nelements_small.pdf') 454 | # plot_SRO_vs_nelements(dat[dat['NIONS']==27],'figs/describe_data/violinplot_SRO_vs_nelements_27.pdf') 455 | # plot_SRO_vs_nelements(dat[dat['NIONS']==64],'figs/describe_data/violinplot_SRO_vs_nelements_64.pdf') 456 | 457 | 458 | 459 | 460 | 461 | #%% 462 | ''' 463 | Plot distribution of the formation energy per atom (y): histogram 464 | ''' 465 | 466 | fig, ax = plt.subplots(figsize=(4.,3.7)) 467 | # set bins (min, max, step) 468 | bins=np.arange(-0.7,0.64,0.04) 469 | # plot histogram for y_large and y_small 470 | y_small = dat[dat['NIONS']<=8]['Ef_per_atom'] 471 | # mean absolute deviation 472 | mad = np.mean(np.abs(y_small-np.mean(y_small))) 473 | ax.hist(y_small, bins=bins, histtype='step', 474 | label=r'$\leq8$ atoms' 475 | ) 476 | 477 | 478 | for n in [27,64,125]: 479 | mad = np.mean(np.abs(y[natoms[n]]-np.mean(y[natoms[n]]))) 480 | ax.hist(y[natoms[n]], bins=bins, histtype='step',label=str(n)+' atoms') 481 | 482 | # y_large = dat[dat['NIONS']>8]['Ef_per_atom'] 483 | # mad = np.mean(np.abs(y_large-np.mean(y_large))) 484 | # ax.hist(y_large, bins=bins, histtype='step', linestyle='--', 485 | # label=r'$\geq27$ atoms' +f' (MAD={mad:.3f})' 486 | # ) 487 | 488 | # set ylim 489 | ax.set_ylim(1,2e4) 490 | # set xlim 491 | ax.set_xlim(bins.min(),0.6) 492 | # use log scale for y 493 | ax.set_yscale('log') 494 | # set vertical line at 0 495 | # ax.axvline(0, color='k',linewidth=0.1) 496 | ax.legend( 497 | loc=(0.005,0.7), 498 | # adjust the size of the legend 499 | handlelength=1., 500 | handletextpad=0.5, 501 | 502 | ) 503 | # set xlabel 504 | ax.set_xlabel('Formation energy per atom (eV/atom)') 505 | # set ylabel 506 | ax.set_ylabel('Counts') 507 | 508 | # fig.savefig('figs/describe_data/hist_Ef_per_atom.pdf') 509 | 510 | 511 | 512 | 513 | #%% 514 | ''' violin plot ''' 515 | fig, ax = plt.subplots(figsize=figsize) 516 | # plot violin plot 517 | ax.violinplot([y_small,y[natoms[27]],y[natoms[64]],y[natoms[125]]]) 518 | # # set ylim 519 | ax.set_ylim(-0.5,0.5) 520 | # set xticks 521 | ax.set_xticks(np.arange(1,5)) 522 | # set xticklabels 523 | ax.set_xticklabels(['$\leq$8','27','64','125']) 524 | # set xlabel 525 | ax.set_xlabel('Number of atoms') 526 | # set ylabel 527 | ax.set_ylabel('Formation energy per atom (eV/atom)') 528 | # set horizontal line at 0 529 | ax.axhline(0, color='k', linestyle='--') 530 | 531 | 532 | #%% 533 | 534 | ''' Visualize the feature space using PCA''' 535 | 536 | features = [i for i in X_std.columns if i in compo_names] 537 | # features = [i for i in X_std.columns if i in struc_names] 538 | X_pca_in = X_std[features] 539 | 540 | 541 | # PCA 542 | pca = PCA() 543 | X_pca = pca.fit_transform(X_pca_in) 544 | X_pca = pd.DataFrame(X_pca,index=X.index, 545 | columns = [*range(X_pca.shape[1])] 546 | ) 547 | 548 | #%% 549 | cluster0 = dat[(X_pca[0]>=-0.05)].index # Also related to composition 550 | c = X_pca[0]<0 551 | fig, ax = plt.subplots(figsize=figsize) 552 | ax.scatter(X_pca[0],X_pca[1],c=c,alpha=0.5,s=5) 553 | 554 | ''' 555 | There are two clusters in the PCA plot. 556 | The following code is used to identify why there are two clusters, 557 | but I could not find a clear explanation and just dropped this part. 558 | Would appreciate if someone could provide some insights. 559 | 560 | ''' 561 | 562 | 563 | 564 | #%% 565 | cluster1 = dat[(X_pca[0]<-0.05)].index # Also related to composition 566 | c = X_pca[0]<0 567 | fig, ax = plt.subplots(figsize=figsize) 568 | ax.scatter(X_pca[0],X_pca[1],c=c,alpha=0.5,s=5) 569 | 570 | #%% 571 | cluster2 = dat[(X_pca[1]>21)].index # Al-Si 572 | c = X_pca[1]>21 573 | fig, ax = plt.subplots(figsize=figsize) 574 | ax.scatter(X_pca[0],X_pca[1],c=c,alpha=0.5,s=5) 575 | 576 | #%% 577 | chemical_system1 = dat.loc[cluster1, 'chemical_system'].unique() 578 | chemical_system_non1 = dat.loc[cluster0, 'chemical_system'].unique() 579 | 580 | 581 | #%% 582 | # check if chemical_system1 is a subset of chemical_system_non1 583 | chemical_system1_set = set(chemical_system1) 584 | chemical_system_non1_set = set(chemical_system_non1) 585 | chemical_system1_set.issubset(chemical_system_non1_set) 586 | 587 | #%% 588 | 589 | # get the coefficients for the first principal component 590 | coeff = pd.DataFrame(pca.components_[0],index=X_pca_in.columns,columns=['coeff']) 591 | # sort the coefficients 592 | coeff = coeff.sort_values(by='coeff',ascending=False) 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | #%% 602 | # plot 603 | fig, ax = plt.subplots(figsize=figsize) 604 | ax.scatter(X_pca[0],X_pca[1],c=y,cmap='rainbow',alpha=0.5,s=5) 605 | ax.set_xlabel('PC1') 606 | ax.set_ylabel('PC2') 607 | # set colorbar 608 | sm = plt.cm.ScalarMappable(cmap='rainbow', 609 | # norm=plt.Normalize(vmin=y.min(), vmax=y.max()) 610 | ) 611 | # add the colorbar to the figure 612 | cbar = fig.colorbar(sm) 613 | cbar.set_label('Formation energy per atom (eV/atom)') 614 | # set title 615 | ax.set_title('PCA of the feature space') 616 | 617 | # same plot but color by the number of atoms 618 | fig, ax = plt.subplots(figsize=figsize) 619 | ax.scatter(X_pca[0],X_pca[1],c=dat['NIONS'],cmap='rainbow',alpha=0.5,s=5) 620 | ax.set_xlabel('PC1') 621 | ax.set_ylabel('PC2') 622 | # set colorbar 623 | sm = plt.cm.ScalarMappable(cmap='rainbow', norm=plt.Normalize(vmin=dat['NIONS'].min(), vmax=dat['NIONS'].max())) 624 | # add the colorbar to the figure 625 | cbar = fig.colorbar(sm) 626 | cbar.set_label('Number of atoms') 627 | # set title 628 | ax.set_title('PCA of the feature space') 629 | 630 | # same plot but color by the number of elements 631 | fig, ax = plt.subplots(figsize=figsize) 632 | ax.scatter(X_pca[0],X_pca[1],c=dat['nelements'],cmap='rainbow',alpha=0.5,s=5) 633 | ax.set_xlabel('PC1') 634 | ax.set_ylabel('PC2') 635 | # set colorbar 636 | sm = plt.cm.ScalarMappable(cmap='rainbow', norm=plt.Normalize(vmin=dat['nelements'].min(), vmax=dat['nelements'].max())) 637 | # add the colorbar to the figure 638 | cbar = fig.colorbar(sm) 639 | cbar.set_label('Number of elements') 640 | # set title 641 | ax.set_title('PCA of the feature space') 642 | 643 | # same plot but color by the space group number 644 | fig, ax = plt.subplots(figsize=figsize) 645 | ax.scatter(X_pca.loc[:,0],X_pca.loc[:,1],c=sg.loc[:],cmap='rainbow',alpha=0.5,s=5) 646 | ax.set_xlabel('PC1') 647 | ax.set_ylabel('PC2') 648 | # set colorbar 649 | sm = plt.cm.ScalarMappable(cmap='rainbow', norm=plt.Normalize(vmin=sg.min(), vmax=sg.max())) 650 | # add the colorbar to the figure 651 | cbar = fig.colorbar(sm) 652 | cbar.set_label('Space group number') 653 | # set title 654 | ax.set_title('PCA of the feature space') 655 | 656 | 657 | 658 | 659 | 660 | 661 | #%% 662 | # same plot but colored by mean abs SRO1 663 | fig, ax = plt.subplots(figsize=figsize) 664 | ax.scatter(X_pca.loc[:,0],X_pca.loc[:,1],c=dat['mean abs SRO1'],cmap='rainbow',alpha=0.5,s=5) 665 | ax.set_xlabel('PC1') 666 | ax.set_ylabel('PC2') 667 | # set colorbar 668 | sm = plt.cm.ScalarMappable(cmap='rainbow', norm=plt.Normalize(vmin=dat['mean abs SRO1'].min(), vmax=dat['mean abs SRO1'].max())) 669 | # add the colorbar to the figure 670 | cbar = fig.colorbar(sm) 671 | cbar.set_label('Mean absolute SRO1') 672 | # set title 673 | ax.set_title('PCA of the feature space') 674 | 675 | #%% -------------------------------------------------------------------------------- /codes/func_tmp.py: -------------------------------------------------------------------------------- 1 | #%% 2 | import pandas as pd 3 | import numpy as np 4 | import matplotlib.pyplot as plt 5 | import os 6 | from sklearn.model_selection import train_test_split 7 | from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score 8 | import time 9 | 10 | random_state = 1 11 | 12 | #%% Define functions 13 | 14 | 15 | def train_predict(model, X_train, y_train, X_test, y_test,print_metrics=True): 16 | # record the time 17 | time_init = time.time() 18 | 19 | # fit the model on the training set 20 | model.fit(X_train, y_train) 21 | # predict the test set 22 | y_pred = pd.DataFrame(model.predict(X_test), index=y_test.index) 23 | 24 | # record the time 25 | time_elapsed = time.time() - time_init 26 | 27 | # Metrics 28 | rmse = mean_squared_error(y_test, y_pred, squared=False) 29 | mae = mean_absolute_error(y_test, y_pred) 30 | r2 = r2_score(y_test, y_pred) 31 | if print_metrics == True: 32 | print(f'rmse={rmse:.3f}, mae={mae:.3f}, r2={r2:.3f}, time={time_elapsed:.1f} s') 33 | metrics = {} 34 | metrics['rmse'] = rmse 35 | metrics['mae'] = mae 36 | metrics['r2'] = r2 37 | 38 | return model, y_pred, metrics 39 | 40 | def parity_plot(y_test, y_pred, title=None, ax=None, metrics=None): 41 | # Figure 42 | fig, ax = plt.subplots(figsize=(4,4)) 43 | ax.plot(y_test, y_pred,'r.') 44 | ax.plot(np.linspace(-10,10),np.linspace(-10,10),'k') 45 | ax.set_xlim([np.min(y_test)-0.1,np.max(y_test)+0.1]) 46 | ax.set_ylim([np.min(y_test)-0.1,np.max(y_test)+0.1]) 47 | ax.set_xlabel('DFT (eV/atom)') 48 | ax.set_ylabel('ML (eV/atom)') 49 | 50 | # add metrics 51 | if metrics is not None: 52 | rmse = metrics['rmse'] 53 | mae = metrics['mae'] 54 | r2 = metrics['r2'] 55 | text = f'rmse={rmse:.3f}, mae={mae:.3f}, r2={r2:.3f}' 56 | ax.text(np.min(y_test), np.max(y_test), text, ha="left", va="top", color="b") 57 | plt.tight_layout() 58 | if title is not None: 59 | plt.title(title) 60 | return ax 61 | 62 | 63 | 64 | 65 | def perf_vs_size(model, X_pool, y_pool, X_test, y_test, csv_out, 66 | overwrite=False,frac_list=None,n_frac = 15, n_run_factor=1): 67 | ''' 68 | model: a sklearn model 69 | X_pool, y_pool: the training pool 70 | X_test, y_test: the test set 71 | frac_list: the list of training set size as a fraction of the total training set 72 | overwrite: if True, overwrite the csv file 73 | n_frac: the number of training set size to consider 74 | ''' 75 | 76 | # if csv_out exists, read it 77 | if os.path.exists(csv_out) and not overwrite: 78 | df = pd.read_csv(csv_out,index_col=0) 79 | # if csv_out does not exist, create it 80 | else: 81 | df = pd.DataFrame(columns=['rmse','mae','r2','rmse_std','mae_std','r2_std']) 82 | 83 | if frac_list is None: 84 | # the list of training set size as a fraction of the total training set 85 | # set frac_list to be a list of fractions, equally spaced in log space, from 0.005 to 1 86 | frac_min = np.log10(100/X_pool.shape[0]) 87 | frac_list = np.logspace(frac_min,0,n_frac) 88 | 89 | 90 | for frac in frac_list: 91 | skip = False 92 | # skip if frac is close to an existing frac 93 | for frac_ in df.index: 94 | if abs(frac - frac_)/frac_ < 0.25: 95 | skip = True 96 | if skip: 97 | continue 98 | 99 | if frac * X_pool.shape[0] < 80: 100 | continue 101 | 102 | # determine the number of runs based on frac 103 | if frac < 0.01: 104 | n_run = 20 105 | elif frac < 0.05: 106 | n_run = 10 # 20 107 | elif frac >= 0.05 and frac < 0.5: 108 | n_run = 6 #10 109 | elif frac >= 0.5 and frac < 1: 110 | n_run = 4 #5 111 | else: 112 | n_run = 1 113 | 114 | n_run = max(1, int(n_run * n_run_factor)) 115 | 116 | print(f'frac={frac:.3f}, n_run={n_run}') 117 | 118 | metrics_ = {} 119 | for random_state_ in range(n_run): 120 | if frac == 1: 121 | X_train, y_train = X_pool, y_pool 122 | else: 123 | X_train, _, y_train, _ = train_test_split(X_pool, y_pool, train_size=frac, 124 | random_state=random_state_ * (random_state + 5) ) 125 | # train and predict 126 | _, _, metrics_[random_state_] = train_predict(model, X_train, y_train, X_test, y_test) 127 | 128 | metrics_ = pd.DataFrame(metrics_).transpose()[['rmse','mae','r2']] 129 | means = metrics_.mean(axis=0) 130 | std = metrics_.std(axis=0) 131 | std.index = [f'{col}_std' for col in std.index] 132 | 133 | # add metrics_.mean(axis=1) and metrics_.std(axis=1) to metrics[model_name] 134 | df.loc[frac] = pd.concat([means,std]) 135 | print(df.loc[frac]) 136 | # save the metrics 137 | df.sort_index().to_csv(csv_out, index_label='frac') 138 | 139 | return df 140 | 141 | 142 | 143 | 144 | 145 | def plot_metrics_vs_size(metrics, metrics_name,id_train, 146 | xlims=None, ylims=None, 147 | figsize=(4,4), 148 | ax_in=None, 149 | fig_ax = None, 150 | markers=None 151 | ): 152 | if fig_ax is not None: 153 | fig, ax = fig_ax 154 | elif ax_in is None: 155 | fig, ax = plt.subplots(figsize=figsize) 156 | else: 157 | fig = ax_in.get_figure() 158 | # get second y axis 159 | ax = ax_in.twinx() 160 | 161 | 162 | if markers is None: 163 | markers = {'RF':'o', 164 | 'XGB':'s', 165 | 'alignn50':'^',} 166 | 167 | 168 | for model_name in metrics.keys(): 169 | ax.errorbar(metrics[model_name].index, metrics[model_name][metrics_name], 170 | yerr=metrics[model_name][f'{metrics_name}_std'], 171 | fmt=f'-{markers[model_name]}',markersize=5,label=model_name, capsize=3) 172 | 173 | if xlims is None: 174 | xlims = [100/len(id_train)*0.9, 1] 175 | ax.set_xlim(xlims) 176 | 177 | if ylims is None: 178 | ylims = [0.4, 1] 179 | ax.set_ylim(ylims) 180 | 181 | 182 | 183 | ax.set_xscale('log') 184 | ax.set_xlabel('Fraction of the full training set') 185 | 186 | # add the upper x axis for the number of training data 187 | ax2 = ax.twiny() 188 | xlims = ax.get_xlim() 189 | ax2.set_xlim([xlims[0]*len(id_train), xlims[1]*len(id_train)]) 190 | ax2.set_xscale('log') 191 | ax2.set_xlabel('Training set size') 192 | ax2.tick_params(axis='x', which='major', pad=0) 193 | 194 | if metrics_name == 'r2': 195 | ax.set_ylabel('$R^2$') 196 | else: 197 | ax.set_ylabel(f'{metrics_name.upper()} (eV/atom)') 198 | 199 | if ax_in is None: 200 | ax.legend(loc='upper center') 201 | ax.grid(linewidth=0.1) 202 | return fig, ax 203 | 204 | 205 | 206 | def eval_ood(model, X_train, y_train, X_test, y_test,title=None, id_small=None, id_large=None): 207 | model, y_pred, metrics = train_predict(model, X_train, y_train, X_test, y_test) 208 | # parity_plot(y_test, y_pred, title=None, metrics=metrics) 209 | # Figure 210 | fig, ax = plt.subplots(figsize=(4,4)) 211 | 212 | index_small = list( set(y_test.index) & set(id_small) ) 213 | index_large = list( set(y_test.index) & set(id_large) ) 214 | if len(index_small) > 0: 215 | ax.plot(y_test.loc[index_small], y_pred.loc[index_small],'.', label='small') 216 | if len(index_large) > 0: 217 | ax.plot(y_test.loc[index_large], y_pred.loc[index_large],'.', label='large') 218 | ax.plot(np.linspace(-10,10),np.linspace(-10,10),'k') 219 | 220 | # xlim = [np.min(y_test)-0.1,np.max(y_test)+0.1] 221 | xlim = [-0.75,0.75] 222 | ylim = xlim 223 | 224 | ax.set_xlim(xlim) 225 | ax.set_ylim(ylim) 226 | ax.set_xlabel('DFT (eV/atom)') 227 | ax.set_ylabel('ML prediction (eV/atom)') 228 | ax.legend() 229 | rmse, mae, r2 = metrics['rmse'], metrics['mae'], metrics['r2'] 230 | text = f'RMSE={rmse:.3f}, MAE={mae:.3f}, $R^2$={r2:.3f}' 231 | ax.text(min(xlim)+0.1,min(xlim), text, ha="left", va="bottom", color="k") 232 | if title is not None: 233 | plt.title(title) 234 | plt.tight_layout() 235 | 236 | 237 | def get_mad_std(s): 238 | # calculate the mean absolute deviation and STD of the series s 239 | # mean absolute deviation 240 | mad = (s - s.mean()).abs().mean() 241 | print(f'mean absolute deviation: {mad:.4f}') 242 | # standard deviation 243 | print(f'standard deviation: {s.std():.4f}') 244 | 245 | 246 | 247 | # %% 248 | -------------------------------------------------------------------------------- /codes/paper_SI_hypertune.py: -------------------------------------------------------------------------------- 1 | #%% 2 | import os 3 | import numpy as np 4 | import pandas as pd 5 | from sklearn.pipeline import Pipeline 6 | from sklearn.impute import SimpleImputer 7 | from sklearn.ensemble import RandomForestRegressor 8 | from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score 9 | from sklearn.linear_model import LinearRegression,Lasso,Ridge 10 | from sklearn.model_selection import GridSearchCV, cross_val_predict 11 | from matplotlib import pyplot as plt 12 | from sklearn.preprocessing import StandardScaler 13 | import xgboost as xgb 14 | from sklearn.model_selection import train_test_split 15 | import random 16 | from func_tmp import train_predict, parity_plot, plot_metrics_vs_size, perf_vs_size,eval_ood,get_mad_std 17 | from distill import return_model 18 | 19 | random_state = 1 20 | overwrite = True # whether to overwrite the existing results 21 | 22 | #%% Import data 23 | struct='structure' 24 | # dat_type = 'graphs' 25 | dat_type = 'featurized' 26 | df = pd.read_pickle(f'data/{struct}_{dat_type}.dat_reduced.pkl') 27 | df = df.dropna() 28 | 29 | csv_dir = f'csv/{struct}' 30 | if not os.path.exists(csv_dir): 31 | os.makedirs(csv_dir) 32 | 33 | #%% Define X and y 34 | if dat_type == 'featurized': 35 | nfeatures = 273 36 | cols_feat = df.columns[-nfeatures:] 37 | # col2keep, col2drop = get_col2drop(df[cols_feat], cutoff=0.75,method='spearman') 38 | # cols_feat = col2keep 39 | 40 | 41 | X_all = df[cols_feat] 42 | # drop features whose variance is zero 43 | X_all = X_all.loc[:,X_all.var()!=0] 44 | # # standardize the features and turn it into a df by keeping the index and column names 45 | # X_std = pd.DataFrame((X_all-X_all.mean())/X_all.std(), index=X_all.index, columns=X_all.columns) 46 | else: 47 | X_all = df[f'graphs_{struct}'] 48 | 49 | y_all = df['Ef_per_atom'] 50 | 51 | 52 | #%% Get training test sets 53 | X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.2, random_state=random_state) 54 | 55 | pipe={} 56 | 57 | #%% Hyperparameter tuning 58 | 59 | csv_out = csv_dir + '/hypersearch.cv_results_RF.csv' 60 | if not os.path.exists(csv_out) or overwrite: 61 | 62 | pipe['RF'] = Pipeline([ 63 | ('scaler', StandardScaler()), 64 | ('model', RandomForestRegressor(n_jobs=-1, random_state=1)) 65 | ]) 66 | 67 | # Make a function that does the hyperparameter tuning for the model pipe['RF'] 68 | # and returns the best hyperparameters 69 | hyperparams = {'model__bootstrap': [True, False], 70 | 'model__n_estimators': [50, 100, 150, 200], 71 | 'model__max_features': [0.45, 0.3, 0.2, 0.1], 72 | 'model__max_depth': [5, 10, 15, 20, None], 73 | } 74 | 75 | # Use GridSearchCV to find the best hyperparameters based on MAE and the corresponding scores 76 | hypersearch = GridSearchCV(pipe['RF'], 77 | hyperparams, 78 | cv=5, 79 | scoring='neg_mean_absolute_error', 80 | verbose=3).fit(X_all, y_all) 81 | best_params, best_scores = hypersearch.best_params_, hypersearch.best_score_ 82 | 83 | # Save all the tested hyperparameters, scores, and the associated time to a csv file 84 | results = pd.DataFrame(hypersearch.cv_results_) 85 | results.to_csv(csv_out) 86 | 87 | # Save the best hyperparameters and the corresponding score to a csv file 88 | pd.DataFrame({'best_params': [best_params], 'best_scores': [best_scores]}).to_csv('csv/best_params_RF.csv') 89 | 90 | #%% 91 | csv_out = csv_dir + '/hypersearch.cv_results_XGB.csv' 92 | if not os.path.exists(csv_out) or overwrite: 93 | pipe['XGB'] = Pipeline([ 94 | ('scaler', StandardScaler()), 95 | ('model', xgb.XGBRegressor( 96 | n_estimators=2000, 97 | learning_rate=0.1, 98 | reg_lambda=0, # L2 regularization 99 | reg_alpha=0.1,# L1 regularization 100 | num_parallel_tree=1, # set >1 for boosted random forest 101 | tree_method='gpu_hist', gpu_id=0)) 102 | ]) 103 | 104 | # Make a function that does the hyperparameter tuning for the model pipe['XGB'] 105 | # and returns the best hyperparameters 106 | hyperparams = {'model__n_estimators': [500, 1000, 2000, 3000], 107 | 'model__learning_rate': [0.1, 0.2, 0.3, 0.4], 108 | 'model__colsample_bytree': [0.3, 0.5, 0.7, 0.9], 109 | 'model__colsample_bylevel': [0.3, 0.5, 0.7, 0.9], 110 | 'model__num_parallel_tree': [4, 6, 8, 10], 111 | } 112 | # Use GridSearchCV to find the best hyperparameters based on MAE and the corresponding scores 113 | hypersearch = GridSearchCV(pipe['XGB'], hyperparams, cv=5, scoring='neg_mean_absolute_error',verbose=3).fit(X_all, y_all) 114 | best_params, best_scores = hypersearch.best_params_, hypersearch.best_score_ 115 | 116 | # Save all the tested hyperparameters, scores, and the associated time to a csv file 117 | results = pd.DataFrame(hypersearch.cv_results_) 118 | results.to_csv('csv/hypersearch.cv_results_XGB.csv') 119 | 120 | # Save the best hyperparameters and the corresponding score to a csv file 121 | pd.DataFrame({'best_params': [best_params], 'best_scores': [best_scores]}).to_csv('csv/best_params_XGB.csv') 122 | 123 | 124 | -------------------------------------------------------------------------------- /codes/paper_train_model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | 5 | 6 | 7 | """ 8 | 9 | #%% 10 | import os 11 | import numpy as np 12 | import pandas as pd 13 | from sklearn.pipeline import Pipeline 14 | from sklearn.ensemble import RandomForestRegressor 15 | from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score 16 | from matplotlib import pyplot as plt 17 | from sklearn.preprocessing import StandardScaler 18 | import xgboost as xgb 19 | from sklearn.model_selection import train_test_split 20 | from func_tmp import plot_metrics_vs_size, perf_vs_size,eval_ood,get_mad_std 21 | 22 | 23 | random_state = 1 24 | overwrite = False # whether to overwrite the existing results 25 | 26 | 27 | #Figure setting 28 | figsize = (3.6,3.6) 29 | 30 | #%% Define the models 31 | 32 | pipe={} 33 | 34 | # more expensive models 35 | pipe['RF'] = Pipeline([ 36 | ('scaler', StandardScaler()), 37 | ('model', RandomForestRegressor(n_estimators=100, 38 | bootstrap=False, 39 | max_features = 1/3, 40 | n_jobs=-1, random_state=random_state)) 41 | ]) 42 | 43 | pipe['XGB'] = Pipeline([ 44 | ('scaler', StandardScaler()), 45 | ('model', xgb.XGBRegressor( 46 | n_estimators=500, 47 | learning_rate=0.4, 48 | reg_lambda=0.01,reg_alpha=0.1, 49 | colsample_bytree=0.5,colsample_bylevel=0.7, 50 | num_parallel_tree=6, 51 | tree_method='gpu_hist', gpu_id=0) 52 | ) 53 | ]) 54 | 55 | epochs=50 56 | modelname = f'alignn{epochs}' 57 | pipe[modelname] = None #return_model(modelname,random_state,alignn_epoch=epochs) 58 | 59 | 60 | #%% Import data 61 | for struct in ['structure_ini']: 62 | # dat_type = 'graphs' 63 | dat_type = 'featurized' 64 | df = pd.read_pickle(f'data/{struct}_{dat_type}.dat_reduced.pkl') 65 | df = df.dropna() 66 | 67 | csv_dir = f'csv/{struct}' 68 | if not os.path.exists(csv_dir): 69 | os.makedirs(csv_dir) 70 | 71 | 72 | 73 | #%% Define X and y 74 | 75 | if dat_type == 'featurized': 76 | # drop features with high correlation 77 | # from myfunc import get_col2drop 78 | nfeatures = 273 79 | cols_feat = df.columns[-nfeatures:] 80 | # col2keep, col2drop = get_col2drop(df[cols_feat], cutoff=0.75,method='spearman') 81 | # cols_feat = col2keep 82 | 83 | 84 | X_all = df[cols_feat] 85 | # drop features whose variance is zero 86 | X_all = X_all.loc[:,X_all.var()!=0] 87 | # # standardize the features and turn it into a df by keeping the index and column names 88 | # X_std = pd.DataFrame((X_all-X_all.mean())/X_all.std(), index=X_all.index, columns=X_all.columns) 89 | else: 90 | X_all = df[f'graphs_{struct}'] 91 | 92 | y_all = df['Ef_per_atom'] 93 | 94 | get_mad_std(y_all) 95 | 96 | 97 | #%% define index 98 | 99 | # according to lattice 100 | id_bcc = df[df['lattice']=='bcc'].index.tolist() 101 | id_fcc = df[df['lattice']=='fcc'].index.tolist() 102 | 103 | print('bcc:',len(id_bcc)) 104 | get_mad_std(y_all[id_bcc]) 105 | print('fcc:',len(id_fcc)) 106 | get_mad_std(y_all[id_fcc]) 107 | 108 | # according to nelements 109 | id_nele = {} 110 | for nele in [2,3,4,5,6,7]: 111 | id_nele[nele] = df[df['nelements']==nele].index.tolist() 112 | 113 | id_loworder = df[df['nelements']<=3].index.tolist() 114 | id_highorder = df[df['nelements']>3].index.tolist() 115 | 116 | 117 | # according to NIONS 118 | id_all = df.index.tolist() 119 | id_small = df[df['NIONS']<=8].index.tolist() 120 | id_large = df[df['NIONS']>8].index.tolist() 121 | print('small:',len(id_small)) 122 | get_mad_std(y_all[id_small]) 123 | print('large:',len(id_large)) 124 | get_mad_std(y_all[id_large]) 125 | 126 | #%% 127 | # calculate the composition based on reduced_formula 128 | from pymatgen.core.composition import Composition 129 | composition = df['reduced_formula'].apply(lambda x: Composition(x)) 130 | 131 | def get_el_frac(x): 132 | el_frac_list = list(x.get_el_amt_dict().values()) 133 | tot = sum(el_frac_list) 134 | el_frac = np.array([i/tot for i in el_frac_list]) 135 | return el_frac 136 | 137 | el_frac = composition.apply(get_el_frac) 138 | 139 | # For each composition, calculate the max fractional concentration and min fractional concentration 140 | df['max_c'] = el_frac.apply(max) 141 | df['min_c'] = el_frac.apply(min) 142 | df['diff_c'] = df['max_c'] - df['min_c'] 143 | df['std_c'] = el_frac.apply(np.std) 144 | 145 | 146 | #%% 147 | 148 | ''' 149 | First, evaluate the interpolation performance 150 | ''' 151 | 152 | frac_list_alignn = [1,0.25,0.1,0.05,0.01] 153 | 154 | model_list = [f'alignn{epochs}','XGB','RF'] 155 | # csv_dir_ = f'csv/{struct}' 156 | csv_dir_ = f'csv/structure_ini' 157 | 158 | 159 | for scope in ['all']: # ,'large','small' 160 | if scope == 'all': 161 | index = id_all 162 | elif scope == 'small': 163 | index = id_small 164 | elif scope == 'large': 165 | index = id_large 166 | 167 | test_size = 0.2 168 | 169 | X_pool, X_test, y_pool, y_test = train_test_split( 170 | X_all.loc[index], y_all.loc[index], 171 | test_size=test_size, 172 | random_state=random_state, 173 | ) 174 | 175 | # get the performance vs. training set size 176 | metrics = {} 177 | for model_name in pipe.keys(): 178 | csv_out = f'{csv_dir_}/size_effect_rand_split_{scope}_{model_name}.csv' 179 | # skip if the csv file already exists 180 | if os.path.exists(csv_out) and not overwrite: 181 | # read the csv file 182 | metrics[model_name] = pd.read_csv(csv_out, index_col=0) 183 | # MAD of the test set 184 | mad = y_test.mad() 185 | metrics[model_name]['mae/mad'] = metrics[model_name]['mae']/mad 186 | metrics[model_name]['mae/mad_std'] = metrics[model_name]['mae_std']/mad 187 | continue 188 | 189 | if 'alignn' in model_name: 190 | frac_list = frac_list_alignn 191 | n_run_factor = 0.5 192 | else: 193 | frac_list = None 194 | n_run_factor = 1 195 | 196 | metrics[model_name] = perf_vs_size( 197 | pipe[model_name], 198 | X_pool, y_pool, 199 | X_test, y_test, 200 | csv_out, 201 | overwrite=overwrite, 202 | frac_list=frac_list, 203 | n_run_factor=n_run_factor, 204 | ) 205 | 206 | # print the performance of full model 207 | mae = metrics[model_name].iloc[-1]['mae'] 208 | r2 = metrics[model_name].iloc[-1]['r2'] 209 | print(f'{scope} {model_name} {mae} {r2} ') 210 | 211 | 212 | fig, axs = plt.subplots(figsize=(3.25*2,2.5), ncols=2, 213 | gridspec_kw={'wspace':0.325} 214 | ) 215 | ax = axs[0] 216 | fig, ax = plot_metrics_vs_size(metrics, 'mae/mad', X_pool.index, 217 | figsize=(3.5,3), 218 | ylims=[0.05,0.45], 219 | xlims=[5e-3,1], 220 | fig_ax=(fig,ax), 221 | ) 222 | ax.set_xlabel('Fraction of training pool') 223 | ax.set_ylabel(f'MAE/MAD') 224 | ax.legend( 225 | # set legend label 226 | ['RF','XGB','ALIGNN'] 227 | ) 228 | ax.text(-0.15,1.05,'(a)',transform=ax.transAxes,fontsize=12) 229 | # add grid 230 | # ax.grid(which='both', axis='both') 231 | 232 | 233 | 234 | # ax.set_yticks(np.arange(0.05,0.5,0.05)) 235 | 236 | ax = axs[1] 237 | fig, ax = plot_metrics_vs_size(metrics, 'r2', X_pool.index, 238 | ylims=[0.75,1], 239 | xlims=[5e-3,1], 240 | # ax_in=ax, 241 | fig_ax=(fig,ax), 242 | ) 243 | ax.legend( 244 | # set legend label 245 | ['RF','XGB','ALIGNN'] 246 | ) 247 | 248 | ax.set_xlabel('Fraction of training pool') 249 | ax.set_ylabel(r'$R^2$') 250 | ax.text(-0.15,1.05,'(b)',transform=ax.transAxes,fontsize=12) 251 | 252 | fig.savefig(f'figs/{struct}_size_effect_rand_split_{scope}.pdf',bbox_inches='tight') 253 | 254 | 255 | #%% 256 | 257 | 258 | frac_list_alignn = [1,0.25,0.1,0.05,0.01] 259 | 260 | fig, axs = plt.subplots(figsize=(2.75*2,2.5), ncols=2, 261 | gridspec_kw={'wspace':0.035} 262 | ) 263 | ax = axs[0] 264 | 265 | # csv_dir_ = f'csv/{struct}' 266 | scope = 'all' 267 | for i, csv_dir_ in enumerate(['csv/structure','csv/structure_ini']): 268 | metrics = {} 269 | for model_name in pipe.keys(): 270 | if scope == 'all': 271 | csv_out = f'{csv_dir_}/size_effect_rand_split_{scope}_{model_name}.csv' 272 | else: 273 | csv_out = f'{csv_dir_}/size_effect_{scope}_{model_name}.csv' 274 | # skip if the csv file already exists 275 | if os.path.exists(csv_out) and not overwrite: 276 | # read the csv file 277 | metrics[model_name] = pd.read_csv(csv_out, index_col=0) 278 | # MAD of the test set 279 | mad = y_test.mad() 280 | metrics[model_name]['mae/mad'] = metrics[model_name]['mae']/mad 281 | metrics[model_name]['mae/mad_std'] = metrics[model_name]['mae_std']/mad 282 | 283 | ax = axs[i] 284 | fig, ax = plot_metrics_vs_size(metrics, 'mae/mad', X_pool.index, 285 | figsize=(3.5,3), 286 | ylims=[0.05,0.425], 287 | xlims=[5e-3,1], 288 | fig_ax=(fig,ax), 289 | ) 290 | ax.set_xlabel('Fraction of training pool') 291 | ax.legend( 292 | # set legend label 293 | ['RF','XGB','ALIGNN'],loc='upper right' 294 | ) 295 | # ax.text(-0.15,1.05,f'({chr(ord("a")+i)})',transform=ax.transAxes,fontsize=12) 296 | if i == 0: 297 | label = 'Relaxed' 298 | ax.set_ylabel(f'MAE/MAD') 299 | 300 | else: 301 | label = 'Unrelaxed' 302 | # disable y label and y ticklabels 303 | ax.set_ylabel('') 304 | ax.set_yticklabels([]) 305 | 306 | ax.text(0.05,0.05,f'({chr(ord("a")+i)}) {label}',transform=ax.transAxes,fontsize=12) 307 | 308 | ax.grid() 309 | fig.savefig('figs/size_effect_rand_split_all.pdf',bbox_inches='tight') 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | #%% 318 | ''' 319 | Next, evaluate the extrapolation performance 320 | 321 | ''' 322 | for scope in [ 323 | 'small2large', 324 | 'low2high', 325 | 'large2small', 326 | 'high2low', # XGB is a bit strange 327 | 328 | # 'bcc2fcc', 329 | # 'fcc2bcc', # This is strange; need to figure out later 330 | ]: 331 | if scope == 'small2large': 332 | id_train = id_small 333 | id_test = id_large 334 | elif scope == 'large2small': 335 | id_train = id_large 336 | id_test = id_small 337 | elif scope == 'low2high': 338 | id_train = id_loworder 339 | id_test = id_highorder 340 | elif scope == 'high2low': 341 | id_train = id_highorder 342 | id_test = id_loworder 343 | elif scope == 'bcc2fcc': 344 | id_train = id_bcc 345 | id_test = id_fcc 346 | elif scope == 'fcc2bcc': 347 | id_train = id_fcc 348 | id_test = id_bcc 349 | 350 | 351 | metrics = {} 352 | 353 | # define the training and test sets 354 | X_pool, y_pool = X_all.loc[id_train], y_all.loc[id_train] 355 | X_test, y_test = X_all.loc[id_test], y_all.loc[id_test] 356 | 357 | mad = y_test.mad() 358 | print(f'{scope} MAD of the test set: {mad}') 359 | 360 | # get the performance vs. training set size 361 | for model_name in model_list: 362 | csv_out = f'{csv_dir}/size_effect_{scope}_{model_name}.csv' 363 | # skip if the csv file already exists 364 | # if os.path.exists(csv_out) and not overwrite: 365 | # # read the csv file 366 | # metrics[model_name] = pd.read_csv(csv_out, index_col=0) 367 | # # MAD of the test set 368 | # mad = y_test.mad() 369 | # metrics[model_name]['mae/mad'] = metrics[model_name]['mae']/mad 370 | # metrics[model_name]['mae/mad_std'] = metrics[model_name]['mae_std']/mad 371 | # continue 372 | 373 | if 'alignn' in model_name: 374 | frac_list = frac_list_alignn 375 | n_run_factor = 0.5 376 | else: 377 | frac_list = None 378 | n_run_factor = 1 379 | 380 | metrics[model_name] = perf_vs_size( 381 | pipe[model_name], 382 | X_pool, y_pool, 383 | X_test, y_test, 384 | csv_out, 385 | overwrite=overwrite, 386 | frac_list=frac_list, 387 | # frac_list = [0.5], 388 | n_run_factor=n_run_factor, 389 | ) 390 | # print the performance of full model 391 | mae = metrics[model_name].iloc[-1]['mae'] 392 | r2 = metrics[model_name].iloc[-1]['r2'] 393 | print(f'{scope} {model_name} {mae} {r2} ') 394 | 395 | # plot performance vs. training set size 396 | # fig, ax = plot_metrics_vs_size(metrics, 'mae', X_pool.index, 397 | # ylims=[0.0,0.07], 398 | # ) 399 | 400 | # fig, ax = plot_metrics_vs_size(metrics, 'r2', X_pool.index, 401 | # ylims=[0.5,1], 402 | # ) 403 | # # add legend title 404 | # ax.legend(title=scope, loc='lower right') 405 | # ax.grid(which='both', axis='both', ls=':') 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | #%% 417 | 418 | # show the accumulated count based on diff_c 419 | # df['diff_c'].hist(cumulative=True, density=1, bins=1000) 420 | 421 | for model_name in pipe.keys(): 422 | model = pipe[model_name] 423 | 424 | for max_diff_c in [0.]:# ,0.15,0.2,0.3,0.4,0.5, 0.6 425 | id_train = df[df['diff_c']<=max_diff_c].index.tolist() 426 | id_test = df[df['diff_c']>max_diff_c].index.tolist() 427 | X_train, y_train = X_all.loc[id_train], y_all.loc[id_train] 428 | X_test, y_test = X_all.loc[id_test], y_all.loc[id_test] 429 | # # MAD 430 | # mad = y_test.mad() 431 | # print(f'{model_name} {max_diff_c} {mad}') 432 | model.fit(X_train, y_train) 433 | y_pred = pd.Series(model.predict(X_test), index=y_test.index) 434 | 435 | y_err = (y_pred-y_test).abs() 436 | 437 | id_test_large = df[(df['diff_c']>max_diff_c) & (df['NIONS']>8)].index.tolist() 438 | id_test_small = df[(df['diff_c']>max_diff_c) & (df['NIONS']<=8)].index.tolist() 439 | 440 | # get mean absolute error 441 | mae = y_err.mean() 442 | # mae_large = y_err.loc[id_test_large].mean() 443 | # mae_small = y_err.loc[id_test_small].mean() 444 | # get r2 445 | r2 = r2_score(y_test, y_pred) 446 | # r2_large = r2_score(y_test.loc[id_test_large], y_pred.loc[id_test_large]) 447 | # r2_small = r2_score(y_test.loc[id_test_small], y_pred.loc[id_test_small]) 448 | 449 | print(f'{model_name} {max_diff_c} {mae}') 450 | # print(f'{model_name} {max_diff_c} {r2} {r2_large} {r2_small}') 451 | 452 | 453 | 454 | -------------------------------------------------------------------------------- /codes/readme.md: -------------------------------------------------------------------------------- 1 | The raw data can be downloaded from [Zenodo](https://doi.org/10.5281/zenodo.10854500), which also provides the Matminer features of initial and final structures and a demo script to train tree-based models. The results in the paper can be readily reproduced by adapting the demo script for different train-test splits (which is basically what `paper_train_model.py` does). To directly using `paper_train_model.py`, you may want to first put data in the appropriate data folder and make a csv to pkl format conversion. 2 | -------------------------------------------------------------------------------- /figs/counts_vs_elements.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mathsphy/high-entropy-alloys-dataset-ML/dc62220ff35a31f75742201c3edac9a29b9c2c97/figs/counts_vs_elements.png -------------------------------------------------------------------------------- /figs/fig2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mathsphy/high-entropy-alloys-dataset-ML/dc62220ff35a31f75742201c3edac9a29b9c2c97/figs/fig2.png -------------------------------------------------------------------------------- /figs/fig3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mathsphy/high-entropy-alloys-dataset-ML/dc62220ff35a31f75742201c3edac9a29b9c2c97/figs/fig3.png -------------------------------------------------------------------------------- /figs/fig4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mathsphy/high-entropy-alloys-dataset-ML/dc62220ff35a31f75742201c3edac9a29b9c2c97/figs/fig4.png -------------------------------------------------------------------------------- /figs/fig5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mathsphy/high-entropy-alloys-dataset-ML/dc62220ff35a31f75742201c3edac9a29b9c2c97/figs/fig5.png -------------------------------------------------------------------------------- /figs/table2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mathsphy/high-entropy-alloys-dataset-ML/dc62220ff35a31f75742201c3edac9a29b9c2c97/figs/table2.png --------------------------------------------------------------------------------