├── README.md
├── codes
├── demo_ML_training.py
├── explore_data.py
├── func_tmp.py
├── paper_SI_hypertune.py
├── paper_train_model.py
└── readme.md
└── figs
├── counts_vs_elements.png
├── fig2.png
├── fig3.png
├── fig4.png
├── fig5.png
└── table2.png
/README.md:
--------------------------------------------------------------------------------
1 | # Efficient first principles based modeling via machine learning: from simple representations to high entropy materials
2 |
3 | ## Paper
4 | This is the repo associated for our paper *Efficient first principles based modeling via machine learning: from simple representations to high entropy materials* ([publisher version](https://doi.org/10.1039/D4TA00982G), [arXiv version](https://arxiv.org/html/2403.15579v1)), which we create a large DFT dataset for HEMs and evaluate the in-distribution and out-of-distribution performance of machine learning models.
5 |
6 |
7 | ## DFT dataset for high entropy alloys [](https://doi.org/10.5281/zenodo.10854500)
8 |
9 | Our DFT dataset encompasses bcc and fcc structures composed of eight elements and overs all possible 2- to 7-component alloy systems formed by them.
10 | The dataset used in the paper is publicly available on [Zenodo](https://doi.org/10.5281/zenodo.10854500), which includes initial and final structures, formation energies, atomic magnetic moments and charges among other attributes.
11 |
12 | *Note: The trajectory data (energies and forces for structures during the DFT relaxations) is not published with this paper; it will be released later with a work on machine learning force fields for HEMs.*
13 |
14 |
15 | ### Table: Numbers of alloy systems and structures.
16 | | No. components | 2 | 3 | 4 | 5 | 6 | 7 | Total |
17 | |----------------------------|------|-------|-------|-------|------|------|-------|
18 | | Alloy systems | 28 | 56 | 70 | 56 | 28 | 8 | 246 |
19 | | Ordered (2-8 atoms) | 4975 | 22098 | 29494 | 6157 | 3132 | 3719 | 69575 |
20 | | SQS (27, 64, or 128 atoms) | 715 | 3302 | 3542 | 4718 | 1183 | 762 | 14222 |
21 | | Ordered+SQS | 5690 | 25400 | 33036 | 10875 | 4315 | 4481 | 83797 |
22 |
23 |
24 | ### Number of structures as a function of a given constituent element.
25 | The legend indicates the number of components.
26 |
27 |
28 |
29 |
30 | ## Generalization performance of machine learning models
31 | The data on [Zenodo](https://doi.org/10.5281/zenodo.10854500) provide the Matminer features of initial and final structures and a demo script to train tree-based models. The results in the paper can be readily reproduced by adapting the demo script for different train-test splits. The `codes` folder provides the scripts that we used in the paper.
32 |
33 | ### Generalization performance from small to large structures.
34 |
35 |
36 |
37 | (a) Normalized error obtained by training on structures with ≤ N atoms and evaluating on structures with > N atoms. (b) ALIGNN prediction on SQSs with > 27 atoms, obtained by training on structures with ≤ 4 atoms. (c) Parity plot of the ALIGNN prediction on SQSs with > 27 atoms, obtained by training on structures with ≤ 8 atoms.
38 |
39 |
40 |
41 | ### Generalization performance from low-order to high-order systems.
42 |
43 |
44 |
45 | (a) Normalized error obtained by training on structures with ≤ N elements and evaluating on structures with >N elements. (b) Parity plot of the ALIGNN prediction on structures with ≥ 3 elements, obtained by training on binary structures. (c) Parity plot of the ALIGNN prediction on structures with ≥ 4 elements, obtained by training on binary and ternary structures.
46 |
47 |
48 |
49 | ### Generalization performance from (near-)equimolar to non-equimolar structures.
50 |
51 |
52 |
53 | (a) Normalized error obtained by training on structures with maxΔc below a given threshold and evaluating on the rest. (b) Predictions on non-equimolar structures (maxΔc>0) by the ALIGNN model trained on equimolar structures (maxΔc=0). (c) Predictions on structures with relatively strong deviation from equimolar composition (maxΔc > 0.2) by the ALIGNN model trained on structures with relatively weak deviation from equimolar composition (maxΔc ≤ 2). maxΔc is defined as the maximum concentration difference between any two elements in a structure.
54 |
55 |
56 |
57 | ### Effects of dataset size and use of unrelaxed vs. relaxed structures
58 |
59 |
60 |
61 |
62 |
63 | ### Overview of model performance on different generalization tasks
64 |
65 |
66 |
67 |
68 |
69 |
--------------------------------------------------------------------------------
/codes/demo_ML_training.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 | """
4 | @author: kangming
5 |
6 | A demo to show the XGBoost training on the formation energy dataset
7 | """
8 | #%%
9 | import numpy as np
10 | import pandas as pd
11 | from sklearn.pipeline import Pipeline
12 | from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
13 | from sklearn.model_selection import cross_val_predict
14 | from matplotlib import pyplot as plt
15 | from sklearn.preprocessing import StandardScaler
16 | import xgboost as xgb
17 |
18 |
19 | model= Pipeline([
20 | ('scaler', StandardScaler()), # scaling does not actually matter for tree methods
21 | ('model', xgb.XGBRegressor(
22 | n_estimators=500,
23 | learning_rate=0.4,
24 | reg_lambda=0.01,reg_alpha=0.1,
25 | colsample_bytree=0.5,colsample_bylevel=0.7,
26 | num_parallel_tree=6,
27 | # tree_method='gpu_hist', gpu_id=0
28 | tree_method = "hist", device = "cuda",
29 | )
30 | )
31 | ])
32 |
33 | #%%
34 | struct = 'structure'
35 |
36 | dat_type = 'featurized'
37 | # df = pd.read_pickle(f'data/{struct}_{dat_type}.dat_reduced.pkl')
38 | df = pd.read_csv(f'data/{struct}_{dat_type}.dat_all.csv',index_col=0)
39 |
40 | #%%
41 | if dat_type == 'featurized':
42 | nfeatures = 273
43 | cols_feat = df.columns[-nfeatures:]
44 |
45 | X_all = df[cols_feat]
46 | # drop features whose variance is zero
47 | X_all = X_all.loc[:,X_all.var()!=0]
48 | else:
49 | X_all = df[f'graphs_{struct}']
50 |
51 | y_all = df['Ef_per_atom']
52 |
53 | #%%
54 | # Get the 5-fold cross-validation estimates for the whole dataset
55 | cv = 5
56 | y_pred = cross_val_predict(model, X_all, y_all, cv=cv)
57 |
58 | #%%
59 | df_y = pd.DataFrame({'Ef_true':y_all, 'Ef_pred':y_pred}, index=y_all.index)
60 | cols2add = ['formula','lattice','NIONS']
61 | df_y = pd.concat([df_y,df[cols2add]], axis=1)
62 |
63 | #%%
64 | # get the mae of Ef_true and Ef_pred
65 | mad = np.mean(np.abs(df_y['Ef_true'] - df_y['Ef_true'].mean()))
66 | mae = mean_absolute_error(df_y['Ef_true'], df_y['Ef_pred'])
67 | print(f'MAD: {mad:.3f}')
68 | print(f'MAE: {mae:.3f}')
69 | # get the r2 score
70 | r2 = r2_score(df_y['Ef_true'], df_y['Ef_pred'])
71 | print(f'R2: {r2:.3f}')
72 |
73 | #%% parity plot
74 | fig, ax = plt.subplots(figsize=(5,5))
75 | ax.scatter(df_y['Ef_true'], df_y['Ef_pred'], s=5)
76 | lims = [-0.6, 0.6]
77 | #diag line
78 | ax.plot(lims,lims, 'k--', lw=1)
79 | # set limits
80 | ax.set_xlim(lims)
81 | ax.set_ylim(lims)
82 | ax.set_xlabel('DFT formation energy (eV/atom)')
83 | ax.set_ylabel('Predicted formation energy (eV/atom)')
84 | # add scores to fig
85 | ax.text(0.05, 0.9, f'MAE: {mae:.3f} eV/atom', transform=ax.transAxes)
86 | ax.text(0.05, 0.85, f'R2: {r2:.3f}', transform=ax.transAxes)
87 |
88 | # %%
89 |
--------------------------------------------------------------------------------
/codes/explore_data.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 |
4 | #%%
5 | import pandas as pd
6 | import numpy as np
7 | import matplotlib
8 | import matplotlib.pyplot as plt
9 | from sklearn.decomposition import PCA
10 | import seaborn as sns
11 |
12 | # use color blind friendly colors
13 | sns.set_palette("colorblind")
14 |
15 | figsize=(4,4)
16 |
17 | #%%
18 | '''
19 | Import data and the related SROs
20 | '''
21 | dat = pd.read_csv('data/structure_ini_featurized.dat_all.csv',index_col=0)
22 | feature_names = dat.columns[-273:]
23 | compo_names = feature_names[-145:]
24 | struc_names = feature_names[:-145]
25 | # this is intended to be used for removing the data from MP
26 | dat = dat.dropna()
27 | # dat_from_MP = dat[dat['lattice'].isna()]
28 | # print('dat_from_MP: ',dat_from_MP.shape, ' dat: ', dat.shape)
29 |
30 | # get SROs
31 | df_sro = pd.read_csv('data/SROs_structure_ini.csv',index_col=0)
32 | # concat df_sro to dat
33 | dat = pd.concat([dat,df_sro],axis=1)
34 |
35 | #%%
36 | # elemental distribution
37 | chemical_system = dat['chemical_system']
38 | elements = ['Al','Si','Cr','Mn','Fe','Co','Ni','Cu']
39 |
40 | # count the number of chemical systems containing each element
41 | nstructures = {}
42 | nstructures_nele = {}
43 | for element in elements:
44 | nstructures[element] = chemical_system.str.contains(element).sum()
45 | for nele in range(2,8):
46 | nstructures_nele[element,nele] = dat[dat['nelements']==nele]['chemical_system'].str.contains(element).sum()
47 |
48 | # bar plot
49 | fig, ax = plt.subplots(figsize=(4,3.5))
50 | # ax.bar(nstructures.keys(),nstructures.values())
51 | # make a stacked bar plot using nstructures_nele
52 | bottom = np.zeros(len(elements))
53 | for nele in range(2,8):
54 | ax.bar(elements,[nstructures_nele[element,nele] for element in elements],bottom=bottom,label=str(nele))
55 | bottom += np.array([nstructures_nele[element,nele] for element in elements])
56 | ax.legend()
57 |
58 | ax.set_ylim(0,5.5e4)
59 | # add minor yticks (5000)
60 | ax.yaxis.set_minor_locator(matplotlib.ticker.MultipleLocator(5000))
61 | ax.set_ylabel('Number of structures')
62 |
63 | # fig.savefig('figs/describe_data/counts_vs_elements.pdf')
64 |
65 |
66 | #%%
67 | '''
68 | Define:
69 |
70 | structure wise dataframe or series:
71 | X: original features
72 | X_std: standardized features
73 | y: formation enerngy per atom
74 | sg: space group number
75 | lattice: lattice
76 | csro: chemical SRO (mean abs SRO1, mean abs SRO2, mean abs SRO3, mean abs SRO4)
77 |
78 | Dict of index. Keys of dict are used to classified the data according to the number of atoms or number of elements
79 | neles: keys are the number of elements, values are the index of the data with the corresponding number of elements
80 | natoms: keys are the number of atoms, values are the index of the data with the corresponding number of atoms
81 | '''
82 |
83 | # features
84 | X = dat[feature_names]
85 | # drop columns whose variance is 0
86 | X = X.loc[:,X.std()!=0]
87 | # standardize the features and turn it into a df by keeping the index and column names
88 | X_std = pd.DataFrame((X-X.mean())/X.std(), index=X.index, columns=X.columns)
89 | # target
90 | y = dat['Ef_per_atom']
91 |
92 | # space group
93 | sg = dat['space_group_number']
94 | # lattice
95 | lattice = dat['lattice']
96 | # chemical SRO
97 | csro = dat[['mean abs SRO'+str(i) for i in range(1,5)]]
98 |
99 | natoms={}
100 | for i in dat['NIONS'].unique():
101 | natoms[i] = dat[dat['NIONS']==i].index
102 |
103 | smaller = dat[dat['NIONS']<=6].index
104 | small = dat[dat['NIONS']<=8].index
105 | large = dat[dat['NIONS']>8].index
106 | low = dat[dat['nelements']<=2].index
107 | high = dat[dat['nelements']>2].index
108 |
109 |
110 | neles={}
111 | for i in range(2,8):
112 | neles[i] = dat[dat['nelements']==i].index
113 | nsamll = dat[(dat['NIONS']<=8) & (dat['nelements']==i)].shape[0]
114 | nlarge = dat[(dat['NIONS']>8) & (dat['nelements']==i)].shape[0]
115 | n_chemsys = len(dat.loc[neles[i],'chemical_system'].unique())
116 | print(f'nelements={i}: {n_chemsys} chemical systems, {len(neles[i])} structures, {nsamll} small, {nlarge} large')
117 |
118 |
119 |
120 |
121 | #%%
122 | '''
123 | distribution of the system size: counts vs. number of atoms.
124 | '''
125 |
126 | fig, ax = plt.subplots(figsize=figsize)
127 | # get the number of atoms, [<=6, 8, 27,64,125]
128 | counts = [len(small)] + [len(natoms[n]) for n in [27,64,125]]
129 |
130 | # plot bar plot
131 | ax.bar(np.arange(len(counts)),counts)
132 | # # use scientific notation for y
133 | # ax.ticklabel_format(axis='y', style='sci', scilimits=(0,0))
134 | # use log scale for y
135 | ax.set_yscale('log')
136 | # set ylim
137 | ax.set_ylim(100,1e5)
138 | # set xticks
139 | ax.set_xticks(np.arange(len(counts)))
140 | # set xticklabels
141 | ax.set_xticklabels([r'$\leq$8', '27','64','125'])
142 | # set xlabel
143 | ax.set_xlabel('Number of atoms')
144 | # set ylabel
145 | ax.set_ylabel('Counts')
146 |
147 | '''
148 | distribution of the system size: counts vs. number of elements.
149 | '''
150 | fig, ax = plt.subplots(figsize=figsize)
151 | ax.bar(range(2,8),[len(neles[i]) for i in range(2,8)])
152 | ax.set_xlabel('Number of elements')
153 | ax.set_ylabel('Counts')
154 | ax.set_ylim(0,3.5e4)
155 |
156 |
157 | #%%
158 | '''
159 | Plot distribution of the system size: counts vs. number of atoms.
160 | But this time, use a stacked bar plot:
161 | each bar is the count for a specific system size, and the bar is divided into 6 parts
162 | according to nelements (2,3,4,5,6,7)
163 |
164 | '''
165 | # get the number of atoms, [<=8, 27,64,125], aggregate the counts for # atoms <=8
166 | counts = {r'$\leq$8':[], '27':[], '64':[], '125':[]}
167 |
168 | # loop over the number of elements
169 | for nele in range(2,8):
170 | # loop over the number of atoms
171 | for n in [r'$\leq$8',27,64,125]:
172 | if n == r'$\leq$8':
173 | counts[str(n)].append(y[(dat['NIONS']<=8) & (dat['nelements']==nele)].shape[0])
174 | else:
175 | # get the counts for a specific number of atoms and number of elements
176 | counts[str(n)].append(y[(dat['NIONS']==n) & (dat['nelements']==nele)].shape[0])
177 |
178 | fig, ax = plt.subplots(figsize=(4,3.7))
179 |
180 |
181 |
182 | # plot stacked bar plot:
183 | # the x axis is the number of atoms, namely the keys of the counts dictionary
184 | # the y axis is the stacked counts, consisting of 6 parts
185 | bottom = np.zeros(len(counts.keys()))
186 | for i,nele in enumerate(range(2,8)):
187 | ax.bar(counts.keys(),
188 | [counts[natom][i] for natom in counts.keys()],
189 | label=str(nele),
190 | width=0.55,
191 | bottom=bottom)
192 | bottom += np.array([counts[natom][i] for natom in counts.keys()])
193 |
194 | # use log scale for y
195 | # ax.set_yscale('log')
196 | # set ylim
197 | ax.set_ylim(0,7e4)
198 | # set legend to 3 columns
199 | ax.legend(loc=(0.5,0.075),ncol=2)
200 | # use scientific notation for y
201 | # ax.ticklabel_format(axis='y', style='sci', scilimits=(0,0))
202 | ax.set_xlabel('Number of atoms')
203 | ax.set_ylabel('Number of structures')
204 |
205 | # add an inset
206 | axins = ax.inset_axes([0.4, 0.47, 0.475, 0.475])
207 | bottom = np.zeros(len(counts.keys()))
208 |
209 | for i,nele in enumerate(range(2,8)):
210 | axins.bar(['27','64','125'],
211 | [counts[natom][i] for natom in ['27','64','125']],
212 | label=str(nele),
213 | bottom=bottom[[1,2,3]])
214 | bottom += np.array([counts[natom][i] for natom in counts.keys()])
215 | axins.set_ylim(0,1e4)
216 | # set xticks and xticklabels
217 | axins.set_xticks(np.arange(1,4))
218 | axins.set_xticklabels(['27','64','125'])
219 | axins.ticklabel_format(axis='y', style='sci', scilimits=(3,3))
220 |
221 | fig.tight_layout()
222 | # fig.savefig('figs/describe_data/counts_vs_natoms.pdf')
223 |
224 |
225 |
226 | #%%
227 | '''
228 | Create a df where the index is the number of atoms, and the columns are the number of elements
229 | '''
230 | df_counts = pd.DataFrame(columns=range(2,8), index=sorted(dat['NIONS'].unique()))
231 |
232 | for index in df_counts.index:
233 | for col in df_counts.columns:
234 | df_counts.loc[index,col] = dat[(dat['NIONS']==index) & (dat['nelements']==col)].shape[0]
235 |
236 |
237 | #%%
238 | ''' violin plot for the distribution of y vs. NIONS '''
239 | fig, ax = plt.subplots(figsize=(4,3.5))
240 | # plot violin plot
241 | violinplot = ax.violinplot([
242 | y[small],
243 | y[natoms[27]],
244 | y[natoms[64]],
245 | y[natoms[125]],
246 | ],
247 | showextrema=False,
248 | quantiles=[[0.50,0.90] for i in range(4)],
249 | )
250 |
251 | # change the color of the violin plot
252 | for pc in violinplot['bodies']:
253 | pc.set_alpha(1)
254 | violinplot['cquantiles'].set_color('r')
255 |
256 | # set xticks
257 | ax.set_xticks([1,2,3,4])
258 | # set xticklabels
259 | ax.set_xticklabels([r'$\leq$8','27','64','125'])
260 | # set ylabel
261 | ax.set_ylabel('Formation energy per atom (eV/atom)')
262 | ax.set_ylim(-0.6,0.5)
263 | # ax.grid(linewidth=0.1)
264 | ax.set_xlabel('Number of atoms')
265 |
266 | # fig.savefig('figs/describe_data/violinplot_Ef_vs_natoms.pdf')
267 |
268 |
269 |
270 | #%%
271 | '''
272 |
273 | violin plot for the distribution of SRO vs. NIONS
274 |
275 | '''
276 | fig, axs = plt.subplots(figsize=(4,3.7),ncols=2,sharey=True,gridspec_kw={'wspace':0.0})
277 |
278 | for i in range(2):
279 | ax = axs[i]
280 | # plot violin plot
281 | sro2plot = f'mean abs SRO{i+1}'
282 | violinplot = ax.violinplot([
283 | csro.loc[small,sro2plot],
284 | csro.loc[natoms[27],sro2plot],
285 | csro.loc[natoms[64],sro2plot],
286 | csro.loc[natoms[125],sro2plot],
287 | ],
288 | showextrema=False,
289 | quantiles=[[0.50,0.90] for i in range(4)],
290 | )
291 |
292 | # change the color of the violin plot
293 | for pc in violinplot['bodies']:
294 | pc.set_alpha(1)
295 | violinplot['cquantiles'].set_color('r')
296 |
297 | # set xticks
298 | ax.set_xticks([1,2,3,4])
299 | # set xticklabels
300 | ax.set_xticklabels([r'$\leq$8','27','64','125'])
301 |
302 | # set ylabel
303 | if i == 0:
304 | ax.set_ylabel('Mean abs SRO')
305 | ax.text(0.3,0.9,'(a) SRO1',transform=ax.transAxes)
306 | else:
307 | ax.text(0.3,0.9,'(b) SRO2',transform=ax.transAxes)
308 | ax.set_ylim(0,1.4)
309 | ax.grid(linewidth=0.1)
310 | # set a common xlabel
311 | fig.text(0.425,0.0,'Number of atoms')
312 |
313 |
314 | # fig.savefig('figs/describe_data/violinplot_SRO_vs_natoms.pdf')
315 |
316 | #%%
317 | '''
318 | Distribution of SRO1 vs. SRO2 for ordered and disordered structures
319 | '''
320 |
321 | fig, axs = plt.subplots(figsize=(4.25,2.25),ncols=2,
322 | # set the horizontal space between subplots
323 | gridspec_kw={'wspace':0.275}
324 | )
325 |
326 | norm = matplotlib.colors.LogNorm(vmin=1,vmax=1000)
327 |
328 | ax = axs[0]
329 | # hexbin plot of SRO1 vs. SRO2 for small systems
330 | # use a cmap that is white for small counts and black for large counts
331 | ax.hexbin(csro.loc[small,'mean abs SRO1'],csro.loc[small,'mean abs SRO2'],
332 | gridsize=(60,60),cmap='Greys',norm = norm)
333 | ax.set_xlabel('Mean abs SRO1')
334 | ax.set_ylabel('Mean abs SRO2')
335 | ax.set_xlim(-0.02,1.6)
336 | ax.set_xticks([0,0.4,0.8,1.2,1.6])
337 | ax.set_ylim(-0.02,2.5)
338 | ax.text(0.4,0.9,r'$\leq$8 atoms',transform=ax.transAxes,
339 | # white background for the text
340 | bbox=dict(facecolor='white', edgecolor='white', pad=0.0)
341 | )
342 | ax.text(-0.4,0.965, '(a)' ,transform=ax.transAxes,fontsize=11,
343 | weight='bold'
344 | )
345 |
346 | ax=axs[1]
347 | # hexbin plot of SRO1 vs. SRO2 for small systems
348 | # use a cmap that is white for small counts and black for large counts
349 | ax.hexbin(csro.loc[large,'mean abs SRO1'],csro.loc[large,'mean abs SRO2'],
350 | gridsize=(30,25),cmap='Greys',norm=norm)
351 | ax.set_xlabel('Mean abs SRO1')
352 | # ax.set_ylabel('Mean abs SRO2')
353 | ax.set_xlim(-0.01,0.4)
354 | ax.set_xticks([0,0.1,0.2,0.3,0.4])
355 | ax.set_ylim(-0.01,0.4)
356 | ax.set_yticks([0,0.1,0.2,0.3,0.4])
357 | ax.text(0.4,0.9,r'$\geq$27 atoms',transform=ax.transAxes)
358 |
359 | # Add a colorbar shared between the two plots
360 | cbar_ax = fig.add_axes([0.915, 0.125, 0.015, 0.725])
361 | sm = plt.cm.ScalarMappable(cmap='Greys', norm=norm,)
362 | cbar = fig.colorbar(sm, cax=cbar_ax,orientation='vertical')
363 | cbar.set_label('Counts')
364 | # cbar.ax.xaxis.set_ticks_position('top') # new line: move ticks to top
365 | # cbar.ax.xaxis.set_label_position('top') # new line: move label to top
366 | # fig.savefig('figs/describe_data/hexbin_SRO1_vs_SRO2.png',bbox_inches='tight',dpi=300)
367 |
368 |
369 |
370 |
371 |
372 |
373 | #%%
374 | ''' violin plot for the distribution of SRO vs. number of elements '''
375 |
376 | def plot_ax(ax,df,ylim=(-0.01,1.8),sro2plot = 'mean abs SRO1'):
377 |
378 | violinplot = ax.violinplot([
379 | csro.loc[df[df['nelements']==n].index,sro2plot] for n in range(2,8)
380 | ],
381 | showextrema=False,
382 | showmedians=True,
383 | quantiles=[[0.5,0.90] for i in range(6)],
384 | )
385 |
386 | # change the color of the violin plot
387 | # for pc in violinplot['bodies']:
388 | # pc.set_alpha(0.5)
389 | violinplot['cquantiles'].set_color('r')
390 | # set the linewidth of cquantiles
391 | violinplot['cquantiles'].set_linewidth(2)
392 |
393 | # set xticks
394 | ax.set_xticks([1,2,3,4,5,6])
395 | # set xticklabels
396 | ax.set_xticklabels(['2','3','4','5','6','7'])
397 | # set xlabel
398 | ax.set_xlabel('Number of elements')
399 | # set ylabel
400 | ax.set_ylabel(sro2plot)
401 | ax.set_ylim(ylim)
402 | # ax.grid(linewidth=0.05)
403 | return ax
404 |
405 | '''
406 | violin plot for the distribution of SRO1 vs. number of elements
407 | The plot contains 3 subplots, for the number of atoms <=8, 27, 64
408 | '''
409 | fig, axs = plt.subplots(figsize=(4.25,2.25),ncols=2,sharey=True,
410 | # set the space between subplots
411 | gridspec_kw={'wspace':0.0}
412 | )
413 |
414 | sro2plot = 'SRO1'
415 |
416 | # ax = axs[0]
417 | # plot_ax(ax,dat[dat['NIONS']<=8],ylim=(0,1.4),sro2plot='mean abs '+sro2plot)
418 | # ax.text(0.04,0.92,r'$\leq$8 atoms',transform=ax.transAxes)
419 |
420 | ax = axs[0]
421 | plot_ax(ax,dat[dat['NIONS']==27],ylim=(0,1.4),sro2plot='mean abs '+sro2plot)
422 | # ax.text(0.05,0.92,'(a) 27 atoms',transform=ax.transAxes)
423 | ax.text(-0.35,0.95, '(b)' ,transform=ax.transAxes,fontsize=11,
424 | weight='bold'
425 | )
426 |
427 |
428 | ax = axs[1]
429 | plot_ax(ax,dat[dat['NIONS']==64],ylim=(0,1.4),sro2plot='mean abs '+sro2plot)
430 | # ax.text(0.05,0.92,'(b) 64 atoms',transform=ax.transAxes)
431 |
432 | # hide the ylabels for the 2nd and 3rd subplots
433 | axs[1].set_ylabel('')
434 | # axs[2].set_ylabel('')
435 |
436 | # set a common xlabel
437 | for ax in axs:
438 | ax.set_xlabel('')
439 | ax.set_ylim(-0.005,0.33)
440 | # add horizontal line at 0
441 | ax.axhline(0, color='k',linestyle="-",linewidth=0.1)
442 | fig.text(0.4,0.0,'Number of elements')
443 |
444 | #%%
445 | def plot_SRO_vs_nelements(df,figname,ylim=(-0.01,1.8)):
446 | fig, axs = plt.subplots(figsize=(figsize[0]*1.2,figsize[1]),ncols=2,sharey=True)
447 | for i in range(2):
448 | ax = axs[i]
449 | plot_ax(ax,df,ylim=ylim,sro2plot=f'mean abs SRO{i+1}')
450 | fig.savefig(figname)
451 |
452 | # plot_SRO_vs_nelements(dat,'figs/describe_data/violinplot_SRO_vs_nelements.pdf')
453 | # plot_SRO_vs_nelements(dat[dat['NIONS']<=8],'figs/describe_data/violinplot_SRO_vs_nelements_small.pdf')
454 | # plot_SRO_vs_nelements(dat[dat['NIONS']==27],'figs/describe_data/violinplot_SRO_vs_nelements_27.pdf')
455 | # plot_SRO_vs_nelements(dat[dat['NIONS']==64],'figs/describe_data/violinplot_SRO_vs_nelements_64.pdf')
456 |
457 |
458 |
459 |
460 |
461 | #%%
462 | '''
463 | Plot distribution of the formation energy per atom (y): histogram
464 | '''
465 |
466 | fig, ax = plt.subplots(figsize=(4.,3.7))
467 | # set bins (min, max, step)
468 | bins=np.arange(-0.7,0.64,0.04)
469 | # plot histogram for y_large and y_small
470 | y_small = dat[dat['NIONS']<=8]['Ef_per_atom']
471 | # mean absolute deviation
472 | mad = np.mean(np.abs(y_small-np.mean(y_small)))
473 | ax.hist(y_small, bins=bins, histtype='step',
474 | label=r'$\leq8$ atoms'
475 | )
476 |
477 |
478 | for n in [27,64,125]:
479 | mad = np.mean(np.abs(y[natoms[n]]-np.mean(y[natoms[n]])))
480 | ax.hist(y[natoms[n]], bins=bins, histtype='step',label=str(n)+' atoms')
481 |
482 | # y_large = dat[dat['NIONS']>8]['Ef_per_atom']
483 | # mad = np.mean(np.abs(y_large-np.mean(y_large)))
484 | # ax.hist(y_large, bins=bins, histtype='step', linestyle='--',
485 | # label=r'$\geq27$ atoms' +f' (MAD={mad:.3f})'
486 | # )
487 |
488 | # set ylim
489 | ax.set_ylim(1,2e4)
490 | # set xlim
491 | ax.set_xlim(bins.min(),0.6)
492 | # use log scale for y
493 | ax.set_yscale('log')
494 | # set vertical line at 0
495 | # ax.axvline(0, color='k',linewidth=0.1)
496 | ax.legend(
497 | loc=(0.005,0.7),
498 | # adjust the size of the legend
499 | handlelength=1.,
500 | handletextpad=0.5,
501 |
502 | )
503 | # set xlabel
504 | ax.set_xlabel('Formation energy per atom (eV/atom)')
505 | # set ylabel
506 | ax.set_ylabel('Counts')
507 |
508 | # fig.savefig('figs/describe_data/hist_Ef_per_atom.pdf')
509 |
510 |
511 |
512 |
513 | #%%
514 | ''' violin plot '''
515 | fig, ax = plt.subplots(figsize=figsize)
516 | # plot violin plot
517 | ax.violinplot([y_small,y[natoms[27]],y[natoms[64]],y[natoms[125]]])
518 | # # set ylim
519 | ax.set_ylim(-0.5,0.5)
520 | # set xticks
521 | ax.set_xticks(np.arange(1,5))
522 | # set xticklabels
523 | ax.set_xticklabels(['$\leq$8','27','64','125'])
524 | # set xlabel
525 | ax.set_xlabel('Number of atoms')
526 | # set ylabel
527 | ax.set_ylabel('Formation energy per atom (eV/atom)')
528 | # set horizontal line at 0
529 | ax.axhline(0, color='k', linestyle='--')
530 |
531 |
532 | #%%
533 |
534 | ''' Visualize the feature space using PCA'''
535 |
536 | features = [i for i in X_std.columns if i in compo_names]
537 | # features = [i for i in X_std.columns if i in struc_names]
538 | X_pca_in = X_std[features]
539 |
540 |
541 | # PCA
542 | pca = PCA()
543 | X_pca = pca.fit_transform(X_pca_in)
544 | X_pca = pd.DataFrame(X_pca,index=X.index,
545 | columns = [*range(X_pca.shape[1])]
546 | )
547 |
548 | #%%
549 | cluster0 = dat[(X_pca[0]>=-0.05)].index # Also related to composition
550 | c = X_pca[0]<0
551 | fig, ax = plt.subplots(figsize=figsize)
552 | ax.scatter(X_pca[0],X_pca[1],c=c,alpha=0.5,s=5)
553 |
554 | '''
555 | There are two clusters in the PCA plot.
556 | The following code is used to identify why there are two clusters,
557 | but I could not find a clear explanation and just dropped this part.
558 | Would appreciate if someone could provide some insights.
559 |
560 | '''
561 |
562 |
563 |
564 | #%%
565 | cluster1 = dat[(X_pca[0]<-0.05)].index # Also related to composition
566 | c = X_pca[0]<0
567 | fig, ax = plt.subplots(figsize=figsize)
568 | ax.scatter(X_pca[0],X_pca[1],c=c,alpha=0.5,s=5)
569 |
570 | #%%
571 | cluster2 = dat[(X_pca[1]>21)].index # Al-Si
572 | c = X_pca[1]>21
573 | fig, ax = plt.subplots(figsize=figsize)
574 | ax.scatter(X_pca[0],X_pca[1],c=c,alpha=0.5,s=5)
575 |
576 | #%%
577 | chemical_system1 = dat.loc[cluster1, 'chemical_system'].unique()
578 | chemical_system_non1 = dat.loc[cluster0, 'chemical_system'].unique()
579 |
580 |
581 | #%%
582 | # check if chemical_system1 is a subset of chemical_system_non1
583 | chemical_system1_set = set(chemical_system1)
584 | chemical_system_non1_set = set(chemical_system_non1)
585 | chemical_system1_set.issubset(chemical_system_non1_set)
586 |
587 | #%%
588 |
589 | # get the coefficients for the first principal component
590 | coeff = pd.DataFrame(pca.components_[0],index=X_pca_in.columns,columns=['coeff'])
591 | # sort the coefficients
592 | coeff = coeff.sort_values(by='coeff',ascending=False)
593 |
594 |
595 |
596 |
597 |
598 |
599 |
600 |
601 | #%%
602 | # plot
603 | fig, ax = plt.subplots(figsize=figsize)
604 | ax.scatter(X_pca[0],X_pca[1],c=y,cmap='rainbow',alpha=0.5,s=5)
605 | ax.set_xlabel('PC1')
606 | ax.set_ylabel('PC2')
607 | # set colorbar
608 | sm = plt.cm.ScalarMappable(cmap='rainbow',
609 | # norm=plt.Normalize(vmin=y.min(), vmax=y.max())
610 | )
611 | # add the colorbar to the figure
612 | cbar = fig.colorbar(sm)
613 | cbar.set_label('Formation energy per atom (eV/atom)')
614 | # set title
615 | ax.set_title('PCA of the feature space')
616 |
617 | # same plot but color by the number of atoms
618 | fig, ax = plt.subplots(figsize=figsize)
619 | ax.scatter(X_pca[0],X_pca[1],c=dat['NIONS'],cmap='rainbow',alpha=0.5,s=5)
620 | ax.set_xlabel('PC1')
621 | ax.set_ylabel('PC2')
622 | # set colorbar
623 | sm = plt.cm.ScalarMappable(cmap='rainbow', norm=plt.Normalize(vmin=dat['NIONS'].min(), vmax=dat['NIONS'].max()))
624 | # add the colorbar to the figure
625 | cbar = fig.colorbar(sm)
626 | cbar.set_label('Number of atoms')
627 | # set title
628 | ax.set_title('PCA of the feature space')
629 |
630 | # same plot but color by the number of elements
631 | fig, ax = plt.subplots(figsize=figsize)
632 | ax.scatter(X_pca[0],X_pca[1],c=dat['nelements'],cmap='rainbow',alpha=0.5,s=5)
633 | ax.set_xlabel('PC1')
634 | ax.set_ylabel('PC2')
635 | # set colorbar
636 | sm = plt.cm.ScalarMappable(cmap='rainbow', norm=plt.Normalize(vmin=dat['nelements'].min(), vmax=dat['nelements'].max()))
637 | # add the colorbar to the figure
638 | cbar = fig.colorbar(sm)
639 | cbar.set_label('Number of elements')
640 | # set title
641 | ax.set_title('PCA of the feature space')
642 |
643 | # same plot but color by the space group number
644 | fig, ax = plt.subplots(figsize=figsize)
645 | ax.scatter(X_pca.loc[:,0],X_pca.loc[:,1],c=sg.loc[:],cmap='rainbow',alpha=0.5,s=5)
646 | ax.set_xlabel('PC1')
647 | ax.set_ylabel('PC2')
648 | # set colorbar
649 | sm = plt.cm.ScalarMappable(cmap='rainbow', norm=plt.Normalize(vmin=sg.min(), vmax=sg.max()))
650 | # add the colorbar to the figure
651 | cbar = fig.colorbar(sm)
652 | cbar.set_label('Space group number')
653 | # set title
654 | ax.set_title('PCA of the feature space')
655 |
656 |
657 |
658 |
659 |
660 |
661 | #%%
662 | # same plot but colored by mean abs SRO1
663 | fig, ax = plt.subplots(figsize=figsize)
664 | ax.scatter(X_pca.loc[:,0],X_pca.loc[:,1],c=dat['mean abs SRO1'],cmap='rainbow',alpha=0.5,s=5)
665 | ax.set_xlabel('PC1')
666 | ax.set_ylabel('PC2')
667 | # set colorbar
668 | sm = plt.cm.ScalarMappable(cmap='rainbow', norm=plt.Normalize(vmin=dat['mean abs SRO1'].min(), vmax=dat['mean abs SRO1'].max()))
669 | # add the colorbar to the figure
670 | cbar = fig.colorbar(sm)
671 | cbar.set_label('Mean absolute SRO1')
672 | # set title
673 | ax.set_title('PCA of the feature space')
674 |
675 | #%%
--------------------------------------------------------------------------------
/codes/func_tmp.py:
--------------------------------------------------------------------------------
1 | #%%
2 | import pandas as pd
3 | import numpy as np
4 | import matplotlib.pyplot as plt
5 | import os
6 | from sklearn.model_selection import train_test_split
7 | from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
8 | import time
9 |
10 | random_state = 1
11 |
12 | #%% Define functions
13 |
14 |
15 | def train_predict(model, X_train, y_train, X_test, y_test,print_metrics=True):
16 | # record the time
17 | time_init = time.time()
18 |
19 | # fit the model on the training set
20 | model.fit(X_train, y_train)
21 | # predict the test set
22 | y_pred = pd.DataFrame(model.predict(X_test), index=y_test.index)
23 |
24 | # record the time
25 | time_elapsed = time.time() - time_init
26 |
27 | # Metrics
28 | rmse = mean_squared_error(y_test, y_pred, squared=False)
29 | mae = mean_absolute_error(y_test, y_pred)
30 | r2 = r2_score(y_test, y_pred)
31 | if print_metrics == True:
32 | print(f'rmse={rmse:.3f}, mae={mae:.3f}, r2={r2:.3f}, time={time_elapsed:.1f} s')
33 | metrics = {}
34 | metrics['rmse'] = rmse
35 | metrics['mae'] = mae
36 | metrics['r2'] = r2
37 |
38 | return model, y_pred, metrics
39 |
40 | def parity_plot(y_test, y_pred, title=None, ax=None, metrics=None):
41 | # Figure
42 | fig, ax = plt.subplots(figsize=(4,4))
43 | ax.plot(y_test, y_pred,'r.')
44 | ax.plot(np.linspace(-10,10),np.linspace(-10,10),'k')
45 | ax.set_xlim([np.min(y_test)-0.1,np.max(y_test)+0.1])
46 | ax.set_ylim([np.min(y_test)-0.1,np.max(y_test)+0.1])
47 | ax.set_xlabel('DFT (eV/atom)')
48 | ax.set_ylabel('ML (eV/atom)')
49 |
50 | # add metrics
51 | if metrics is not None:
52 | rmse = metrics['rmse']
53 | mae = metrics['mae']
54 | r2 = metrics['r2']
55 | text = f'rmse={rmse:.3f}, mae={mae:.3f}, r2={r2:.3f}'
56 | ax.text(np.min(y_test), np.max(y_test), text, ha="left", va="top", color="b")
57 | plt.tight_layout()
58 | if title is not None:
59 | plt.title(title)
60 | return ax
61 |
62 |
63 |
64 |
65 | def perf_vs_size(model, X_pool, y_pool, X_test, y_test, csv_out,
66 | overwrite=False,frac_list=None,n_frac = 15, n_run_factor=1):
67 | '''
68 | model: a sklearn model
69 | X_pool, y_pool: the training pool
70 | X_test, y_test: the test set
71 | frac_list: the list of training set size as a fraction of the total training set
72 | overwrite: if True, overwrite the csv file
73 | n_frac: the number of training set size to consider
74 | '''
75 |
76 | # if csv_out exists, read it
77 | if os.path.exists(csv_out) and not overwrite:
78 | df = pd.read_csv(csv_out,index_col=0)
79 | # if csv_out does not exist, create it
80 | else:
81 | df = pd.DataFrame(columns=['rmse','mae','r2','rmse_std','mae_std','r2_std'])
82 |
83 | if frac_list is None:
84 | # the list of training set size as a fraction of the total training set
85 | # set frac_list to be a list of fractions, equally spaced in log space, from 0.005 to 1
86 | frac_min = np.log10(100/X_pool.shape[0])
87 | frac_list = np.logspace(frac_min,0,n_frac)
88 |
89 |
90 | for frac in frac_list:
91 | skip = False
92 | # skip if frac is close to an existing frac
93 | for frac_ in df.index:
94 | if abs(frac - frac_)/frac_ < 0.25:
95 | skip = True
96 | if skip:
97 | continue
98 |
99 | if frac * X_pool.shape[0] < 80:
100 | continue
101 |
102 | # determine the number of runs based on frac
103 | if frac < 0.01:
104 | n_run = 20
105 | elif frac < 0.05:
106 | n_run = 10 # 20
107 | elif frac >= 0.05 and frac < 0.5:
108 | n_run = 6 #10
109 | elif frac >= 0.5 and frac < 1:
110 | n_run = 4 #5
111 | else:
112 | n_run = 1
113 |
114 | n_run = max(1, int(n_run * n_run_factor))
115 |
116 | print(f'frac={frac:.3f}, n_run={n_run}')
117 |
118 | metrics_ = {}
119 | for random_state_ in range(n_run):
120 | if frac == 1:
121 | X_train, y_train = X_pool, y_pool
122 | else:
123 | X_train, _, y_train, _ = train_test_split(X_pool, y_pool, train_size=frac,
124 | random_state=random_state_ * (random_state + 5) )
125 | # train and predict
126 | _, _, metrics_[random_state_] = train_predict(model, X_train, y_train, X_test, y_test)
127 |
128 | metrics_ = pd.DataFrame(metrics_).transpose()[['rmse','mae','r2']]
129 | means = metrics_.mean(axis=0)
130 | std = metrics_.std(axis=0)
131 | std.index = [f'{col}_std' for col in std.index]
132 |
133 | # add metrics_.mean(axis=1) and metrics_.std(axis=1) to metrics[model_name]
134 | df.loc[frac] = pd.concat([means,std])
135 | print(df.loc[frac])
136 | # save the metrics
137 | df.sort_index().to_csv(csv_out, index_label='frac')
138 |
139 | return df
140 |
141 |
142 |
143 |
144 |
145 | def plot_metrics_vs_size(metrics, metrics_name,id_train,
146 | xlims=None, ylims=None,
147 | figsize=(4,4),
148 | ax_in=None,
149 | fig_ax = None,
150 | markers=None
151 | ):
152 | if fig_ax is not None:
153 | fig, ax = fig_ax
154 | elif ax_in is None:
155 | fig, ax = plt.subplots(figsize=figsize)
156 | else:
157 | fig = ax_in.get_figure()
158 | # get second y axis
159 | ax = ax_in.twinx()
160 |
161 |
162 | if markers is None:
163 | markers = {'RF':'o',
164 | 'XGB':'s',
165 | 'alignn50':'^',}
166 |
167 |
168 | for model_name in metrics.keys():
169 | ax.errorbar(metrics[model_name].index, metrics[model_name][metrics_name],
170 | yerr=metrics[model_name][f'{metrics_name}_std'],
171 | fmt=f'-{markers[model_name]}',markersize=5,label=model_name, capsize=3)
172 |
173 | if xlims is None:
174 | xlims = [100/len(id_train)*0.9, 1]
175 | ax.set_xlim(xlims)
176 |
177 | if ylims is None:
178 | ylims = [0.4, 1]
179 | ax.set_ylim(ylims)
180 |
181 |
182 |
183 | ax.set_xscale('log')
184 | ax.set_xlabel('Fraction of the full training set')
185 |
186 | # add the upper x axis for the number of training data
187 | ax2 = ax.twiny()
188 | xlims = ax.get_xlim()
189 | ax2.set_xlim([xlims[0]*len(id_train), xlims[1]*len(id_train)])
190 | ax2.set_xscale('log')
191 | ax2.set_xlabel('Training set size')
192 | ax2.tick_params(axis='x', which='major', pad=0)
193 |
194 | if metrics_name == 'r2':
195 | ax.set_ylabel('$R^2$')
196 | else:
197 | ax.set_ylabel(f'{metrics_name.upper()} (eV/atom)')
198 |
199 | if ax_in is None:
200 | ax.legend(loc='upper center')
201 | ax.grid(linewidth=0.1)
202 | return fig, ax
203 |
204 |
205 |
206 | def eval_ood(model, X_train, y_train, X_test, y_test,title=None, id_small=None, id_large=None):
207 | model, y_pred, metrics = train_predict(model, X_train, y_train, X_test, y_test)
208 | # parity_plot(y_test, y_pred, title=None, metrics=metrics)
209 | # Figure
210 | fig, ax = plt.subplots(figsize=(4,4))
211 |
212 | index_small = list( set(y_test.index) & set(id_small) )
213 | index_large = list( set(y_test.index) & set(id_large) )
214 | if len(index_small) > 0:
215 | ax.plot(y_test.loc[index_small], y_pred.loc[index_small],'.', label='small')
216 | if len(index_large) > 0:
217 | ax.plot(y_test.loc[index_large], y_pred.loc[index_large],'.', label='large')
218 | ax.plot(np.linspace(-10,10),np.linspace(-10,10),'k')
219 |
220 | # xlim = [np.min(y_test)-0.1,np.max(y_test)+0.1]
221 | xlim = [-0.75,0.75]
222 | ylim = xlim
223 |
224 | ax.set_xlim(xlim)
225 | ax.set_ylim(ylim)
226 | ax.set_xlabel('DFT (eV/atom)')
227 | ax.set_ylabel('ML prediction (eV/atom)')
228 | ax.legend()
229 | rmse, mae, r2 = metrics['rmse'], metrics['mae'], metrics['r2']
230 | text = f'RMSE={rmse:.3f}, MAE={mae:.3f}, $R^2$={r2:.3f}'
231 | ax.text(min(xlim)+0.1,min(xlim), text, ha="left", va="bottom", color="k")
232 | if title is not None:
233 | plt.title(title)
234 | plt.tight_layout()
235 |
236 |
237 | def get_mad_std(s):
238 | # calculate the mean absolute deviation and STD of the series s
239 | # mean absolute deviation
240 | mad = (s - s.mean()).abs().mean()
241 | print(f'mean absolute deviation: {mad:.4f}')
242 | # standard deviation
243 | print(f'standard deviation: {s.std():.4f}')
244 |
245 |
246 |
247 | # %%
248 |
--------------------------------------------------------------------------------
/codes/paper_SI_hypertune.py:
--------------------------------------------------------------------------------
1 | #%%
2 | import os
3 | import numpy as np
4 | import pandas as pd
5 | from sklearn.pipeline import Pipeline
6 | from sklearn.impute import SimpleImputer
7 | from sklearn.ensemble import RandomForestRegressor
8 | from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
9 | from sklearn.linear_model import LinearRegression,Lasso,Ridge
10 | from sklearn.model_selection import GridSearchCV, cross_val_predict
11 | from matplotlib import pyplot as plt
12 | from sklearn.preprocessing import StandardScaler
13 | import xgboost as xgb
14 | from sklearn.model_selection import train_test_split
15 | import random
16 | from func_tmp import train_predict, parity_plot, plot_metrics_vs_size, perf_vs_size,eval_ood,get_mad_std
17 | from distill import return_model
18 |
19 | random_state = 1
20 | overwrite = True # whether to overwrite the existing results
21 |
22 | #%% Import data
23 | struct='structure'
24 | # dat_type = 'graphs'
25 | dat_type = 'featurized'
26 | df = pd.read_pickle(f'data/{struct}_{dat_type}.dat_reduced.pkl')
27 | df = df.dropna()
28 |
29 | csv_dir = f'csv/{struct}'
30 | if not os.path.exists(csv_dir):
31 | os.makedirs(csv_dir)
32 |
33 | #%% Define X and y
34 | if dat_type == 'featurized':
35 | nfeatures = 273
36 | cols_feat = df.columns[-nfeatures:]
37 | # col2keep, col2drop = get_col2drop(df[cols_feat], cutoff=0.75,method='spearman')
38 | # cols_feat = col2keep
39 |
40 |
41 | X_all = df[cols_feat]
42 | # drop features whose variance is zero
43 | X_all = X_all.loc[:,X_all.var()!=0]
44 | # # standardize the features and turn it into a df by keeping the index and column names
45 | # X_std = pd.DataFrame((X_all-X_all.mean())/X_all.std(), index=X_all.index, columns=X_all.columns)
46 | else:
47 | X_all = df[f'graphs_{struct}']
48 |
49 | y_all = df['Ef_per_atom']
50 |
51 |
52 | #%% Get training test sets
53 | X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.2, random_state=random_state)
54 |
55 | pipe={}
56 |
57 | #%% Hyperparameter tuning
58 |
59 | csv_out = csv_dir + '/hypersearch.cv_results_RF.csv'
60 | if not os.path.exists(csv_out) or overwrite:
61 |
62 | pipe['RF'] = Pipeline([
63 | ('scaler', StandardScaler()),
64 | ('model', RandomForestRegressor(n_jobs=-1, random_state=1))
65 | ])
66 |
67 | # Make a function that does the hyperparameter tuning for the model pipe['RF']
68 | # and returns the best hyperparameters
69 | hyperparams = {'model__bootstrap': [True, False],
70 | 'model__n_estimators': [50, 100, 150, 200],
71 | 'model__max_features': [0.45, 0.3, 0.2, 0.1],
72 | 'model__max_depth': [5, 10, 15, 20, None],
73 | }
74 |
75 | # Use GridSearchCV to find the best hyperparameters based on MAE and the corresponding scores
76 | hypersearch = GridSearchCV(pipe['RF'],
77 | hyperparams,
78 | cv=5,
79 | scoring='neg_mean_absolute_error',
80 | verbose=3).fit(X_all, y_all)
81 | best_params, best_scores = hypersearch.best_params_, hypersearch.best_score_
82 |
83 | # Save all the tested hyperparameters, scores, and the associated time to a csv file
84 | results = pd.DataFrame(hypersearch.cv_results_)
85 | results.to_csv(csv_out)
86 |
87 | # Save the best hyperparameters and the corresponding score to a csv file
88 | pd.DataFrame({'best_params': [best_params], 'best_scores': [best_scores]}).to_csv('csv/best_params_RF.csv')
89 |
90 | #%%
91 | csv_out = csv_dir + '/hypersearch.cv_results_XGB.csv'
92 | if not os.path.exists(csv_out) or overwrite:
93 | pipe['XGB'] = Pipeline([
94 | ('scaler', StandardScaler()),
95 | ('model', xgb.XGBRegressor(
96 | n_estimators=2000,
97 | learning_rate=0.1,
98 | reg_lambda=0, # L2 regularization
99 | reg_alpha=0.1,# L1 regularization
100 | num_parallel_tree=1, # set >1 for boosted random forest
101 | tree_method='gpu_hist', gpu_id=0))
102 | ])
103 |
104 | # Make a function that does the hyperparameter tuning for the model pipe['XGB']
105 | # and returns the best hyperparameters
106 | hyperparams = {'model__n_estimators': [500, 1000, 2000, 3000],
107 | 'model__learning_rate': [0.1, 0.2, 0.3, 0.4],
108 | 'model__colsample_bytree': [0.3, 0.5, 0.7, 0.9],
109 | 'model__colsample_bylevel': [0.3, 0.5, 0.7, 0.9],
110 | 'model__num_parallel_tree': [4, 6, 8, 10],
111 | }
112 | # Use GridSearchCV to find the best hyperparameters based on MAE and the corresponding scores
113 | hypersearch = GridSearchCV(pipe['XGB'], hyperparams, cv=5, scoring='neg_mean_absolute_error',verbose=3).fit(X_all, y_all)
114 | best_params, best_scores = hypersearch.best_params_, hypersearch.best_score_
115 |
116 | # Save all the tested hyperparameters, scores, and the associated time to a csv file
117 | results = pd.DataFrame(hypersearch.cv_results_)
118 | results.to_csv('csv/hypersearch.cv_results_XGB.csv')
119 |
120 | # Save the best hyperparameters and the corresponding score to a csv file
121 | pd.DataFrame({'best_params': [best_params], 'best_scores': [best_scores]}).to_csv('csv/best_params_XGB.csv')
122 |
123 |
124 |
--------------------------------------------------------------------------------
/codes/paper_train_model.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 | """
4 |
5 |
6 |
7 | """
8 |
9 | #%%
10 | import os
11 | import numpy as np
12 | import pandas as pd
13 | from sklearn.pipeline import Pipeline
14 | from sklearn.ensemble import RandomForestRegressor
15 | from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
16 | from matplotlib import pyplot as plt
17 | from sklearn.preprocessing import StandardScaler
18 | import xgboost as xgb
19 | from sklearn.model_selection import train_test_split
20 | from func_tmp import plot_metrics_vs_size, perf_vs_size,eval_ood,get_mad_std
21 |
22 |
23 | random_state = 1
24 | overwrite = False # whether to overwrite the existing results
25 |
26 |
27 | #Figure setting
28 | figsize = (3.6,3.6)
29 |
30 | #%% Define the models
31 |
32 | pipe={}
33 |
34 | # more expensive models
35 | pipe['RF'] = Pipeline([
36 | ('scaler', StandardScaler()),
37 | ('model', RandomForestRegressor(n_estimators=100,
38 | bootstrap=False,
39 | max_features = 1/3,
40 | n_jobs=-1, random_state=random_state))
41 | ])
42 |
43 | pipe['XGB'] = Pipeline([
44 | ('scaler', StandardScaler()),
45 | ('model', xgb.XGBRegressor(
46 | n_estimators=500,
47 | learning_rate=0.4,
48 | reg_lambda=0.01,reg_alpha=0.1,
49 | colsample_bytree=0.5,colsample_bylevel=0.7,
50 | num_parallel_tree=6,
51 | tree_method='gpu_hist', gpu_id=0)
52 | )
53 | ])
54 |
55 | epochs=50
56 | modelname = f'alignn{epochs}'
57 | pipe[modelname] = None #return_model(modelname,random_state,alignn_epoch=epochs)
58 |
59 |
60 | #%% Import data
61 | for struct in ['structure_ini']:
62 | # dat_type = 'graphs'
63 | dat_type = 'featurized'
64 | df = pd.read_pickle(f'data/{struct}_{dat_type}.dat_reduced.pkl')
65 | df = df.dropna()
66 |
67 | csv_dir = f'csv/{struct}'
68 | if not os.path.exists(csv_dir):
69 | os.makedirs(csv_dir)
70 |
71 |
72 |
73 | #%% Define X and y
74 |
75 | if dat_type == 'featurized':
76 | # drop features with high correlation
77 | # from myfunc import get_col2drop
78 | nfeatures = 273
79 | cols_feat = df.columns[-nfeatures:]
80 | # col2keep, col2drop = get_col2drop(df[cols_feat], cutoff=0.75,method='spearman')
81 | # cols_feat = col2keep
82 |
83 |
84 | X_all = df[cols_feat]
85 | # drop features whose variance is zero
86 | X_all = X_all.loc[:,X_all.var()!=0]
87 | # # standardize the features and turn it into a df by keeping the index and column names
88 | # X_std = pd.DataFrame((X_all-X_all.mean())/X_all.std(), index=X_all.index, columns=X_all.columns)
89 | else:
90 | X_all = df[f'graphs_{struct}']
91 |
92 | y_all = df['Ef_per_atom']
93 |
94 | get_mad_std(y_all)
95 |
96 |
97 | #%% define index
98 |
99 | # according to lattice
100 | id_bcc = df[df['lattice']=='bcc'].index.tolist()
101 | id_fcc = df[df['lattice']=='fcc'].index.tolist()
102 |
103 | print('bcc:',len(id_bcc))
104 | get_mad_std(y_all[id_bcc])
105 | print('fcc:',len(id_fcc))
106 | get_mad_std(y_all[id_fcc])
107 |
108 | # according to nelements
109 | id_nele = {}
110 | for nele in [2,3,4,5,6,7]:
111 | id_nele[nele] = df[df['nelements']==nele].index.tolist()
112 |
113 | id_loworder = df[df['nelements']<=3].index.tolist()
114 | id_highorder = df[df['nelements']>3].index.tolist()
115 |
116 |
117 | # according to NIONS
118 | id_all = df.index.tolist()
119 | id_small = df[df['NIONS']<=8].index.tolist()
120 | id_large = df[df['NIONS']>8].index.tolist()
121 | print('small:',len(id_small))
122 | get_mad_std(y_all[id_small])
123 | print('large:',len(id_large))
124 | get_mad_std(y_all[id_large])
125 |
126 | #%%
127 | # calculate the composition based on reduced_formula
128 | from pymatgen.core.composition import Composition
129 | composition = df['reduced_formula'].apply(lambda x: Composition(x))
130 |
131 | def get_el_frac(x):
132 | el_frac_list = list(x.get_el_amt_dict().values())
133 | tot = sum(el_frac_list)
134 | el_frac = np.array([i/tot for i in el_frac_list])
135 | return el_frac
136 |
137 | el_frac = composition.apply(get_el_frac)
138 |
139 | # For each composition, calculate the max fractional concentration and min fractional concentration
140 | df['max_c'] = el_frac.apply(max)
141 | df['min_c'] = el_frac.apply(min)
142 | df['diff_c'] = df['max_c'] - df['min_c']
143 | df['std_c'] = el_frac.apply(np.std)
144 |
145 |
146 | #%%
147 |
148 | '''
149 | First, evaluate the interpolation performance
150 | '''
151 |
152 | frac_list_alignn = [1,0.25,0.1,0.05,0.01]
153 |
154 | model_list = [f'alignn{epochs}','XGB','RF']
155 | # csv_dir_ = f'csv/{struct}'
156 | csv_dir_ = f'csv/structure_ini'
157 |
158 |
159 | for scope in ['all']: # ,'large','small'
160 | if scope == 'all':
161 | index = id_all
162 | elif scope == 'small':
163 | index = id_small
164 | elif scope == 'large':
165 | index = id_large
166 |
167 | test_size = 0.2
168 |
169 | X_pool, X_test, y_pool, y_test = train_test_split(
170 | X_all.loc[index], y_all.loc[index],
171 | test_size=test_size,
172 | random_state=random_state,
173 | )
174 |
175 | # get the performance vs. training set size
176 | metrics = {}
177 | for model_name in pipe.keys():
178 | csv_out = f'{csv_dir_}/size_effect_rand_split_{scope}_{model_name}.csv'
179 | # skip if the csv file already exists
180 | if os.path.exists(csv_out) and not overwrite:
181 | # read the csv file
182 | metrics[model_name] = pd.read_csv(csv_out, index_col=0)
183 | # MAD of the test set
184 | mad = y_test.mad()
185 | metrics[model_name]['mae/mad'] = metrics[model_name]['mae']/mad
186 | metrics[model_name]['mae/mad_std'] = metrics[model_name]['mae_std']/mad
187 | continue
188 |
189 | if 'alignn' in model_name:
190 | frac_list = frac_list_alignn
191 | n_run_factor = 0.5
192 | else:
193 | frac_list = None
194 | n_run_factor = 1
195 |
196 | metrics[model_name] = perf_vs_size(
197 | pipe[model_name],
198 | X_pool, y_pool,
199 | X_test, y_test,
200 | csv_out,
201 | overwrite=overwrite,
202 | frac_list=frac_list,
203 | n_run_factor=n_run_factor,
204 | )
205 |
206 | # print the performance of full model
207 | mae = metrics[model_name].iloc[-1]['mae']
208 | r2 = metrics[model_name].iloc[-1]['r2']
209 | print(f'{scope} {model_name} {mae} {r2} ')
210 |
211 |
212 | fig, axs = plt.subplots(figsize=(3.25*2,2.5), ncols=2,
213 | gridspec_kw={'wspace':0.325}
214 | )
215 | ax = axs[0]
216 | fig, ax = plot_metrics_vs_size(metrics, 'mae/mad', X_pool.index,
217 | figsize=(3.5,3),
218 | ylims=[0.05,0.45],
219 | xlims=[5e-3,1],
220 | fig_ax=(fig,ax),
221 | )
222 | ax.set_xlabel('Fraction of training pool')
223 | ax.set_ylabel(f'MAE/MAD')
224 | ax.legend(
225 | # set legend label
226 | ['RF','XGB','ALIGNN']
227 | )
228 | ax.text(-0.15,1.05,'(a)',transform=ax.transAxes,fontsize=12)
229 | # add grid
230 | # ax.grid(which='both', axis='both')
231 |
232 |
233 |
234 | # ax.set_yticks(np.arange(0.05,0.5,0.05))
235 |
236 | ax = axs[1]
237 | fig, ax = plot_metrics_vs_size(metrics, 'r2', X_pool.index,
238 | ylims=[0.75,1],
239 | xlims=[5e-3,1],
240 | # ax_in=ax,
241 | fig_ax=(fig,ax),
242 | )
243 | ax.legend(
244 | # set legend label
245 | ['RF','XGB','ALIGNN']
246 | )
247 |
248 | ax.set_xlabel('Fraction of training pool')
249 | ax.set_ylabel(r'$R^2$')
250 | ax.text(-0.15,1.05,'(b)',transform=ax.transAxes,fontsize=12)
251 |
252 | fig.savefig(f'figs/{struct}_size_effect_rand_split_{scope}.pdf',bbox_inches='tight')
253 |
254 |
255 | #%%
256 |
257 |
258 | frac_list_alignn = [1,0.25,0.1,0.05,0.01]
259 |
260 | fig, axs = plt.subplots(figsize=(2.75*2,2.5), ncols=2,
261 | gridspec_kw={'wspace':0.035}
262 | )
263 | ax = axs[0]
264 |
265 | # csv_dir_ = f'csv/{struct}'
266 | scope = 'all'
267 | for i, csv_dir_ in enumerate(['csv/structure','csv/structure_ini']):
268 | metrics = {}
269 | for model_name in pipe.keys():
270 | if scope == 'all':
271 | csv_out = f'{csv_dir_}/size_effect_rand_split_{scope}_{model_name}.csv'
272 | else:
273 | csv_out = f'{csv_dir_}/size_effect_{scope}_{model_name}.csv'
274 | # skip if the csv file already exists
275 | if os.path.exists(csv_out) and not overwrite:
276 | # read the csv file
277 | metrics[model_name] = pd.read_csv(csv_out, index_col=0)
278 | # MAD of the test set
279 | mad = y_test.mad()
280 | metrics[model_name]['mae/mad'] = metrics[model_name]['mae']/mad
281 | metrics[model_name]['mae/mad_std'] = metrics[model_name]['mae_std']/mad
282 |
283 | ax = axs[i]
284 | fig, ax = plot_metrics_vs_size(metrics, 'mae/mad', X_pool.index,
285 | figsize=(3.5,3),
286 | ylims=[0.05,0.425],
287 | xlims=[5e-3,1],
288 | fig_ax=(fig,ax),
289 | )
290 | ax.set_xlabel('Fraction of training pool')
291 | ax.legend(
292 | # set legend label
293 | ['RF','XGB','ALIGNN'],loc='upper right'
294 | )
295 | # ax.text(-0.15,1.05,f'({chr(ord("a")+i)})',transform=ax.transAxes,fontsize=12)
296 | if i == 0:
297 | label = 'Relaxed'
298 | ax.set_ylabel(f'MAE/MAD')
299 |
300 | else:
301 | label = 'Unrelaxed'
302 | # disable y label and y ticklabels
303 | ax.set_ylabel('')
304 | ax.set_yticklabels([])
305 |
306 | ax.text(0.05,0.05,f'({chr(ord("a")+i)}) {label}',transform=ax.transAxes,fontsize=12)
307 |
308 | ax.grid()
309 | fig.savefig('figs/size_effect_rand_split_all.pdf',bbox_inches='tight')
310 |
311 |
312 |
313 |
314 |
315 |
316 |
317 | #%%
318 | '''
319 | Next, evaluate the extrapolation performance
320 |
321 | '''
322 | for scope in [
323 | 'small2large',
324 | 'low2high',
325 | 'large2small',
326 | 'high2low', # XGB is a bit strange
327 |
328 | # 'bcc2fcc',
329 | # 'fcc2bcc', # This is strange; need to figure out later
330 | ]:
331 | if scope == 'small2large':
332 | id_train = id_small
333 | id_test = id_large
334 | elif scope == 'large2small':
335 | id_train = id_large
336 | id_test = id_small
337 | elif scope == 'low2high':
338 | id_train = id_loworder
339 | id_test = id_highorder
340 | elif scope == 'high2low':
341 | id_train = id_highorder
342 | id_test = id_loworder
343 | elif scope == 'bcc2fcc':
344 | id_train = id_bcc
345 | id_test = id_fcc
346 | elif scope == 'fcc2bcc':
347 | id_train = id_fcc
348 | id_test = id_bcc
349 |
350 |
351 | metrics = {}
352 |
353 | # define the training and test sets
354 | X_pool, y_pool = X_all.loc[id_train], y_all.loc[id_train]
355 | X_test, y_test = X_all.loc[id_test], y_all.loc[id_test]
356 |
357 | mad = y_test.mad()
358 | print(f'{scope} MAD of the test set: {mad}')
359 |
360 | # get the performance vs. training set size
361 | for model_name in model_list:
362 | csv_out = f'{csv_dir}/size_effect_{scope}_{model_name}.csv'
363 | # skip if the csv file already exists
364 | # if os.path.exists(csv_out) and not overwrite:
365 | # # read the csv file
366 | # metrics[model_name] = pd.read_csv(csv_out, index_col=0)
367 | # # MAD of the test set
368 | # mad = y_test.mad()
369 | # metrics[model_name]['mae/mad'] = metrics[model_name]['mae']/mad
370 | # metrics[model_name]['mae/mad_std'] = metrics[model_name]['mae_std']/mad
371 | # continue
372 |
373 | if 'alignn' in model_name:
374 | frac_list = frac_list_alignn
375 | n_run_factor = 0.5
376 | else:
377 | frac_list = None
378 | n_run_factor = 1
379 |
380 | metrics[model_name] = perf_vs_size(
381 | pipe[model_name],
382 | X_pool, y_pool,
383 | X_test, y_test,
384 | csv_out,
385 | overwrite=overwrite,
386 | frac_list=frac_list,
387 | # frac_list = [0.5],
388 | n_run_factor=n_run_factor,
389 | )
390 | # print the performance of full model
391 | mae = metrics[model_name].iloc[-1]['mae']
392 | r2 = metrics[model_name].iloc[-1]['r2']
393 | print(f'{scope} {model_name} {mae} {r2} ')
394 |
395 | # plot performance vs. training set size
396 | # fig, ax = plot_metrics_vs_size(metrics, 'mae', X_pool.index,
397 | # ylims=[0.0,0.07],
398 | # )
399 |
400 | # fig, ax = plot_metrics_vs_size(metrics, 'r2', X_pool.index,
401 | # ylims=[0.5,1],
402 | # )
403 | # # add legend title
404 | # ax.legend(title=scope, loc='lower right')
405 | # ax.grid(which='both', axis='both', ls=':')
406 |
407 |
408 |
409 |
410 |
411 |
412 |
413 |
414 |
415 |
416 | #%%
417 |
418 | # show the accumulated count based on diff_c
419 | # df['diff_c'].hist(cumulative=True, density=1, bins=1000)
420 |
421 | for model_name in pipe.keys():
422 | model = pipe[model_name]
423 |
424 | for max_diff_c in [0.]:# ,0.15,0.2,0.3,0.4,0.5, 0.6
425 | id_train = df[df['diff_c']<=max_diff_c].index.tolist()
426 | id_test = df[df['diff_c']>max_diff_c].index.tolist()
427 | X_train, y_train = X_all.loc[id_train], y_all.loc[id_train]
428 | X_test, y_test = X_all.loc[id_test], y_all.loc[id_test]
429 | # # MAD
430 | # mad = y_test.mad()
431 | # print(f'{model_name} {max_diff_c} {mad}')
432 | model.fit(X_train, y_train)
433 | y_pred = pd.Series(model.predict(X_test), index=y_test.index)
434 |
435 | y_err = (y_pred-y_test).abs()
436 |
437 | id_test_large = df[(df['diff_c']>max_diff_c) & (df['NIONS']>8)].index.tolist()
438 | id_test_small = df[(df['diff_c']>max_diff_c) & (df['NIONS']<=8)].index.tolist()
439 |
440 | # get mean absolute error
441 | mae = y_err.mean()
442 | # mae_large = y_err.loc[id_test_large].mean()
443 | # mae_small = y_err.loc[id_test_small].mean()
444 | # get r2
445 | r2 = r2_score(y_test, y_pred)
446 | # r2_large = r2_score(y_test.loc[id_test_large], y_pred.loc[id_test_large])
447 | # r2_small = r2_score(y_test.loc[id_test_small], y_pred.loc[id_test_small])
448 |
449 | print(f'{model_name} {max_diff_c} {mae}')
450 | # print(f'{model_name} {max_diff_c} {r2} {r2_large} {r2_small}')
451 |
452 |
453 |
454 |
--------------------------------------------------------------------------------
/codes/readme.md:
--------------------------------------------------------------------------------
1 | The raw data can be downloaded from [Zenodo](https://doi.org/10.5281/zenodo.10854500), which also provides the Matminer features of initial and final structures and a demo script to train tree-based models. The results in the paper can be readily reproduced by adapting the demo script for different train-test splits (which is basically what `paper_train_model.py` does). To directly using `paper_train_model.py`, you may want to first put data in the appropriate data folder and make a csv to pkl format conversion.
2 |
--------------------------------------------------------------------------------
/figs/counts_vs_elements.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mathsphy/high-entropy-alloys-dataset-ML/dc62220ff35a31f75742201c3edac9a29b9c2c97/figs/counts_vs_elements.png
--------------------------------------------------------------------------------
/figs/fig2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mathsphy/high-entropy-alloys-dataset-ML/dc62220ff35a31f75742201c3edac9a29b9c2c97/figs/fig2.png
--------------------------------------------------------------------------------
/figs/fig3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mathsphy/high-entropy-alloys-dataset-ML/dc62220ff35a31f75742201c3edac9a29b9c2c97/figs/fig3.png
--------------------------------------------------------------------------------
/figs/fig4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mathsphy/high-entropy-alloys-dataset-ML/dc62220ff35a31f75742201c3edac9a29b9c2c97/figs/fig4.png
--------------------------------------------------------------------------------
/figs/fig5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mathsphy/high-entropy-alloys-dataset-ML/dc62220ff35a31f75742201c3edac9a29b9c2c97/figs/fig5.png
--------------------------------------------------------------------------------
/figs/table2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mathsphy/high-entropy-alloys-dataset-ML/dc62220ff35a31f75742201c3edac9a29b9c2c97/figs/table2.png
--------------------------------------------------------------------------------