├── .gitignore ├── LICENSE ├── Makefile ├── README.md ├── bgnn ├── models │ ├── BGNN.py │ ├── Base.py │ ├── GBDT.py │ ├── GNN.py │ ├── MLP.py │ └── __init__.py └── scripts │ ├── run.py │ └── utils.py ├── configs └── model │ ├── bgnn.yaml │ ├── catboost.yaml │ ├── gnn.yaml │ ├── lightgbm.yaml │ ├── mlp.yaml │ └── resgnn.yaml ├── datasets.zip ├── models ├── BGNN.py ├── Base.py ├── GBDT.py ├── GNN.py ├── MLP.py └── __init__.py ├── requirements.txt ├── scripts ├── run.py └── utils.py └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | datasets/ 2 | *pycache* 3 | results/ 4 | *egg* 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 russellsparadox 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | install: ## [Local development, CPU] Upgrade pip, install requirements, install package. 2 | python -m pip install -U pip setuptools wheel 3 | python -m pip install -r requirements.txt 4 | python -m pip install -e . 5 | 6 | .PHONY: help 7 | 8 | help: # Run `make help` to get help on the make commands 9 | @grep -E '^[0-9a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}' 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### Boosted Graph Neural Networks 2 | The code and data for the ICLR 2021 paper: [Boost then Convolve: Gradient Boosting Meets Graph Neural Networks](https://openreview.net/pdf?id=ebS5NUfoMKL) 3 | 4 | This code contains implementation of the following models for graphs: 5 | * **CatBoost** 6 | * **LightGBM** 7 | * **Fully-Connected Neural Network** (FCNN) 8 | * **GNN** (GAT, GCN, AGNN, APPNP) 9 | * **FCNN-GNN** (GAT, GCN, AGNN, APPNP) 10 | * **ResGNN** (CatBoost + {GAT, GCN, AGNN, APPNP}) 11 | * **BGNN** (end-to-end {CatBoost + {GAT, GCN, AGNN, APPNP}}) 12 | 13 | ## Installation 14 | To run the models you have to download the repo, install the requirements, and extract the datasets. 15 | 16 | First, let's create a python environment: 17 | ```bash 18 | mkdir envs 19 | cd envs 20 | python -m venv bgnn_env 21 | source bgnn_env/bin/activate 22 | cd .. 23 | ``` 24 | --- 25 | Second, let's download the code and install requirements 26 | ```bash 27 | git clone https://github.com/nd7141/bgnn.git 28 | cd bgnn 29 | unzip datasets.zip 30 | make install 31 | ``` 32 | --- 33 | Next we need to install a proper version of [PyTorch](https://pytorch.org/) and [DGL](https://www.dgl.ai/), depending on the cuda version of your machine. 34 | We strongly encourage to use GPU-supported versions of DGL (the speed up in training can be 100x). 35 | 36 | First, determine your cuda version with `nvcc --version`. 37 | Then, check installation instructions for [pytorch](https://pytorch.org/get-started/locally/). 38 | For example for cuda version 9.2, install it as follows: 39 | ```bash 40 | pip install torch==1.7.1+cu92 torchvision==0.8.2+cu92 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html 41 | ``` 42 | 43 | If you don't have GPU, use the following: 44 | ```bash 45 | pip install torch==1.7.1+cpu torchvision==0.8.2+cpu torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html 46 | ``` 47 | --- 48 | Similarly, you need to install [DGL library](https://docs.dgl.ai/en/0.4.x/install/). 49 | For example, cuda==9.2: 50 | 51 | ```bash 52 | pip install dgl-cu92 53 | ``` 54 | 55 | For cpu version of DGL: 56 | ```bash 57 | pip install dgl 58 | ``` 59 | 60 | Tested versions of `torch` and `dgl` are: 61 | * torch==1.7.1+cu92 62 | * dgl_cu92==0.5.3 63 | 64 | ## Running 65 | Starting point is file `scripts/run.py`: 66 | ```bash 67 | python scripts/run.py dataset models 68 | (optional) 69 | --save_folder: str = None 70 | --task: str = 'regression', 71 | --repeat_exp: int = 1, 72 | --max_seeds: int = 5, 73 | --dataset_dir: str = None, 74 | --config_dir: str = None 75 | ``` 76 | Available options for dataset: 77 | * house (regression) 78 | * county (regression) 79 | * vk (regression) 80 | * wiki (regression) 81 | * avazu (regression) 82 | * vk_class (classification) 83 | * house_class (classification) 84 | * dblp (classification) 85 | * slap (classification) 86 | * path/to/your/dataset 87 | 88 | Available options for models are `catboost`, `lightgbm`, `gnn`, `resgnn`, `bgnn`, `all`. 89 | 90 | Each model is specifed by its config. Check [`configs/`](https://github.com/nd7141/bgnn/tree/master/configs/model) folder to specify parameters of the model and run. 91 | 92 | Upon completion, the results wil be saved in the specifed folder (default: `results/{dataset}/day_month/`). 93 | This folder will contain `aggregated_results.json`, which will contain aggregated results for each model. 94 | Each model will have 4 numbers in this order: `mean metric` (RMSE or accuracy), `std metric`, `mean runtime`, `std runtime`. 95 | File `seed_results.json` will have results for each experiment and each seed. 96 | Additional folders will contain loss values during training. 97 | 98 | --- 99 | 100 | ###Examples 101 | 102 | The following script will launch all models on `House` dataset. 103 | ```bash 104 | python scripts/run.py house all 105 | ``` 106 | 107 | The following script will launch CatBoost and GNN models on `SLAP` classification dataset. 108 | ```bash 109 | python scripts/run.py slap catboost gnn --task classification 110 | ``` 111 | 112 | The following script will launch LightGBM model for 5 splits of data, repeating each experiment for 3 times. 113 | ```bash 114 | python scripts/run.py vk lightgbm --repeat_exp 3 --max_seeds 5 115 | ``` 116 | 117 | The following script will launch resgnn and bgnn models saving results to custom folder. 118 | ```bash 119 | python scripts/run.py county resgnn bgnn --save_folder ./county_resgnn_bgnn 120 | ``` 121 | 122 | ### Running on your dataset 123 | To run the code on your dataset, it's necessary to prepare the files in the right format. 124 | 125 | You can check examples in `datasets/` folder. 126 | 127 | There should be at least `X.csv` (node features), `y.csv` (target labels), `graph.graphml` (graph in graphml format). 128 | 129 | Make sure to keep _these_ filenames for your dataset. 130 | 131 | You can also have `cat_features.txt` specifying names of categorical columns. 132 | 133 | You can also have `masks.json` specifying train/val/test splits. 134 | 135 | After that run the script as usual: 136 | ```bash 137 | python scripts/run.py path/to/your/dataset gnn catboost 138 | ``` 139 | 140 | ## Citation 141 | ``` 142 | @inproceedings{ 143 | ivanov2021boost, 144 | title={Boost then Convolve: Gradient Boosting Meets Graph Neural Networks}, 145 | author={Sergei Ivanov and Liudmila Prokhorenkova}, 146 | booktitle={International Conference on Learning Representations (ICLR)}, 147 | year={2021}, 148 | url={https://openreview.net/forum?id=ebS5NUfoMKL} 149 | } 150 | ``` 151 | -------------------------------------------------------------------------------- /bgnn/models/BGNN.py: -------------------------------------------------------------------------------- 1 | import itertools 2 | import time 3 | import numpy as np 4 | import torch 5 | 6 | from catboost import Pool, CatBoostClassifier, CatBoostRegressor, sum_models 7 | from .GNN import GNNModelDGL, GATDGL 8 | from .Base import BaseModel 9 | from tqdm import tqdm 10 | from collections import defaultdict as ddict 11 | 12 | class BGNN(BaseModel): 13 | def __init__(self, 14 | task='regression', iter_per_epoch = 10, lr=0.01, hidden_dim=64, dropout=0., 15 | only_gbdt=False, train_non_gbdt=False, 16 | name='gat', use_leaderboard=False, depth=6, gbdt_lr=0.1): 17 | super(BaseModel, self).__init__() 18 | self.learning_rate = lr 19 | self.hidden_dim = hidden_dim 20 | self.task = task 21 | self.dropout = dropout 22 | self.only_gbdt = only_gbdt 23 | self.train_residual = train_non_gbdt 24 | self.name = name 25 | self.use_leaderboard = use_leaderboard 26 | self.iter_per_epoch = iter_per_epoch 27 | self.depth = depth 28 | self.lang = 'dgl' 29 | self.gbdt_lr = gbdt_lr 30 | 31 | self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 32 | 33 | def __name__(self): 34 | return 'BGNN' 35 | 36 | def init_gbdt_model(self, num_epochs, epoch): 37 | if self.task == 'regression': 38 | catboost_model_obj = CatBoostRegressor 39 | catboost_loss_fn = 'RMSE' #''RMSEWithUncertainty' 40 | else: 41 | if epoch == 0: 42 | catboost_model_obj = CatBoostClassifier 43 | catboost_loss_fn = 'MultiClass' 44 | else: 45 | catboost_model_obj = CatBoostRegressor 46 | catboost_loss_fn = 'MultiRMSE' 47 | 48 | return catboost_model_obj(iterations=num_epochs, 49 | depth=self.depth, 50 | learning_rate=self.gbdt_lr, 51 | loss_function=catboost_loss_fn, 52 | random_seed=0, 53 | nan_mode='Min') 54 | 55 | def fit_gbdt(self, pool, trees_per_epoch, epoch): 56 | gbdt_model = self.init_gbdt_model(trees_per_epoch, epoch) 57 | gbdt_model.fit(pool, verbose=False) 58 | return gbdt_model 59 | 60 | def init_gnn_model(self): 61 | if self.use_leaderboard: 62 | self.model = GATDGL(in_feats=self.in_dim, n_classes=self.out_dim).to(self.device) 63 | else: 64 | self.model = GNNModelDGL(in_dim=self.in_dim, 65 | hidden_dim=self.hidden_dim, 66 | out_dim=self.out_dim, 67 | name=self.name, 68 | dropout=self.dropout).to(self.device) 69 | 70 | def append_gbdt_model(self, new_gbdt_model, weights): 71 | if self.gbdt_model is None: 72 | return new_gbdt_model 73 | return sum_models([self.gbdt_model, new_gbdt_model], weights=weights) 74 | 75 | def train_gbdt(self, gbdt_X_train, gbdt_y_train, cat_features, epoch, 76 | gbdt_trees_per_epoch, gbdt_alpha): 77 | 78 | pool = Pool(gbdt_X_train, gbdt_y_train, cat_features=cat_features) 79 | epoch_gbdt_model = self.fit_gbdt(pool, gbdt_trees_per_epoch, epoch) 80 | if epoch == 0 and self.task=='classification': 81 | self.base_gbdt = epoch_gbdt_model 82 | else: 83 | self.gbdt_model = self.append_gbdt_model(epoch_gbdt_model, weights=[1, gbdt_alpha]) 84 | 85 | def update_node_features(self, node_features, X, encoded_X): 86 | if self.task == 'regression': 87 | predictions = np.expand_dims(self.gbdt_model.predict(X), axis=1) 88 | # predictions = self.gbdt_model.virtual_ensembles_predict(X, 89 | # virtual_ensembles_count=5, 90 | # prediction_type='TotalUncertainty') 91 | else: 92 | predictions = self.base_gbdt.predict_proba(X) 93 | # predictions = self.base_gbdt.predict(X, prediction_type='RawFormulaVal') 94 | if self.gbdt_model is not None: 95 | predictions_after_one = self.gbdt_model.predict(X) 96 | predictions += predictions_after_one 97 | 98 | if not self.only_gbdt: 99 | if self.train_residual: 100 | predictions = np.append(node_features.detach().cpu().data[:, :-self.out_dim], predictions, 101 | axis=1) # append updated X to prediction 102 | else: 103 | predictions = np.append(encoded_X, predictions, axis=1) # append X to prediction 104 | 105 | predictions = torch.from_numpy(predictions).to(self.device) 106 | 107 | node_features.data = predictions.float().data 108 | 109 | def update_gbdt_targets(self, node_features, node_features_before, train_mask): 110 | return (node_features - node_features_before).detach().cpu().numpy()[train_mask, -self.out_dim:] 111 | 112 | def init_node_features(self, X): 113 | node_features = torch.empty(X.shape[0], self.in_dim, requires_grad=True, device=self.device) 114 | if not self.only_gbdt: 115 | node_features.data[:, :-self.out_dim] = torch.from_numpy(X.to_numpy(copy=True)) 116 | return node_features 117 | 118 | def init_node_parameters(self, num_nodes): 119 | return torch.empty(num_nodes, self.out_dim, requires_grad=True, device=self.device) 120 | 121 | def init_optimizer2(self, node_parameters, learning_rate): 122 | params = [self.model.parameters(), [node_parameters]] 123 | return torch.optim.Adam(itertools.chain(*params), lr=learning_rate) 124 | 125 | def update_node_features2(self, node_parameters, X): 126 | if self.task == 'regression': 127 | predictions = np.expand_dims(self.gbdt_model.predict(X), axis=1) 128 | else: 129 | predictions = self.base_gbdt.predict_proba(X) 130 | if self.gbdt_model is not None: 131 | predictions += self.gbdt_model.predict(X) 132 | 133 | predictions = torch.from_numpy(predictions).to(self.device) 134 | node_parameters.data = predictions.float().data 135 | 136 | def fit(self, networkx_graph, X, y, train_mask, val_mask, test_mask, cat_features, 137 | num_epochs, patience, logging_epochs=1, loss_fn=None, metric_name='loss', 138 | normalize_features=True, replace_na=True, 139 | ): 140 | 141 | # initialize for early stopping and metrics 142 | if metric_name in ['r2', 'accuracy']: 143 | best_metric = [np.float('-inf')] * 3 # for train/val/test 144 | else: 145 | best_metric = [np.float('inf')] * 3 # for train/val/test 146 | best_val_epoch = 0 147 | epochs_since_last_best_metric = 0 148 | metrics = ddict(list) 149 | if cat_features is None: 150 | cat_features = [] 151 | 152 | if self.task == 'regression': 153 | self.out_dim = y.shape[1] 154 | elif self.task == 'classification': 155 | self.out_dim = len(set(y.iloc[test_mask, 0])) 156 | # self.in_dim = X.shape[1] if not self.only_gbdt else 0 157 | # self.in_dim += 3 if uncertainty else 1 158 | self.in_dim = self.out_dim + X.shape[1] if not self.only_gbdt else self.out_dim 159 | 160 | self.init_gnn_model() 161 | 162 | gbdt_X_train = X.iloc[train_mask] 163 | gbdt_y_train = y.iloc[train_mask] 164 | gbdt_alpha = 1 165 | self.gbdt_model = None 166 | 167 | encoded_X = X.copy() 168 | if not self.only_gbdt: 169 | if len(cat_features): 170 | encoded_X = self.encode_cat_features(encoded_X, y, cat_features, train_mask, val_mask, test_mask) 171 | if normalize_features: 172 | encoded_X = self.normalize_features(encoded_X, train_mask, val_mask, test_mask) 173 | if replace_na: 174 | encoded_X = self.replace_na(encoded_X, train_mask) 175 | 176 | node_features = self.init_node_features(encoded_X) 177 | optimizer = self.init_optimizer(node_features, optimize_node_features=True, learning_rate=self.learning_rate) 178 | 179 | y, = self.pandas_to_torch(y) 180 | self.y = y 181 | if self.lang == 'dgl': 182 | graph = self.networkx_to_torch(networkx_graph) 183 | elif self.lang == 'pyg': 184 | graph = self.networkx_to_torch2(networkx_graph) 185 | 186 | self.graph = graph 187 | 188 | pbar = tqdm(range(num_epochs)) 189 | for epoch in pbar: 190 | start2epoch = time.time() 191 | 192 | # gbdt part 193 | self.train_gbdt(gbdt_X_train, gbdt_y_train, cat_features, epoch, 194 | self.iter_per_epoch, gbdt_alpha) 195 | 196 | self.update_node_features(node_features, X, encoded_X) 197 | node_features_before = node_features.clone() 198 | model_in=(graph, node_features) 199 | loss = self.train_and_evaluate(model_in, y, train_mask, val_mask, test_mask, 200 | optimizer, metrics, self.iter_per_epoch) 201 | gbdt_y_train = self.update_gbdt_targets(node_features, node_features_before, train_mask) 202 | 203 | self.log_epoch(pbar, metrics, epoch, loss, time.time() - start2epoch, logging_epochs, 204 | metric_name=metric_name) 205 | # check early stopping 206 | best_metric, best_val_epoch, epochs_since_last_best_metric = \ 207 | self.update_early_stopping(metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric, 208 | metric_name, lower_better=(metric_name not in ['r2', 'accuracy'])) 209 | if patience and epochs_since_last_best_metric > patience: 210 | break 211 | if np.isclose(gbdt_y_train.sum(), 0.): 212 | print('Nodes do not change anymore. Stopping...') 213 | break 214 | 215 | if loss_fn: 216 | self.save_metrics(metrics, loss_fn) 217 | 218 | print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, best_val_epoch, *best_metric)) 219 | return metrics 220 | 221 | def predict(self, graph, X, y, test_mask): 222 | node_features = torch.empty(X.shape[0], self.in_dim).to(self.device) 223 | self.update_node_features(node_features, X, X) 224 | return self.evaluate_model((graph, node_features), y, test_mask) -------------------------------------------------------------------------------- /bgnn/models/Base.py: -------------------------------------------------------------------------------- 1 | import itertools 2 | import torch 3 | from sklearn import preprocessing 4 | import pandas as pd 5 | import torch.nn.functional as F 6 | import numpy as np 7 | from sklearn.metrics import r2_score, accuracy_score 8 | 9 | class BaseModel(torch.nn.Module): 10 | def __init__(self): 11 | super(BaseModel, self).__init__() 12 | self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 13 | 14 | def pandas_to_torch(self, *args): 15 | return [torch.from_numpy(arg.to_numpy(copy=True)).float().squeeze().to(self.device) for arg in args] 16 | 17 | def networkx_to_torch(self, networkx_graph): 18 | import dgl 19 | # graph = dgl.DGLGraph() 20 | graph = dgl.from_networkx(networkx_graph) 21 | graph = dgl.remove_self_loop(graph) 22 | graph = dgl.add_self_loop(graph) 23 | graph = graph.to(self.device) 24 | return graph 25 | 26 | def networkx_to_torch2(self, networkx_graph): 27 | from torch_geometric.utils import convert 28 | import torch_geometric.transforms as T 29 | graph = convert.from_networkx(networkx_graph) 30 | transform = T.Compose([T.TargetIndegree()]) 31 | graph = transform(graph) 32 | return graph.to(self.device) 33 | 34 | def move_to_device(self, *args): 35 | return [arg.to(self.device) for arg in args] 36 | 37 | def init_optimizer(self, node_features, optimize_node_features, learning_rate): 38 | 39 | params = [self.model.parameters()] 40 | if optimize_node_features: 41 | params.append([node_features]) 42 | optimizer = torch.optim.Adam(itertools.chain(*params), lr=learning_rate) 43 | return optimizer 44 | 45 | def log_epoch(self, pbar, metrics, epoch, loss, epoch_time, logging_epochs, metric_name='loss'): 46 | train_rmse, val_rmse, test_rmse = metrics[metric_name][-1] 47 | if epoch and epoch % logging_epochs == 0: 48 | pbar.set_description( 49 | "Epoch {:05d} | Loss {:.3f} | Loss {:.3f}/{:.3f}/{:.3f} | Time {:.4f}".format(epoch, loss, 50 | train_rmse, 51 | val_rmse, test_rmse, 52 | epoch_time)) 53 | 54 | def normalize_features(self, X, train_mask, val_mask, test_mask): 55 | min_max_scaler = preprocessing.MinMaxScaler() 56 | A = X.to_numpy(copy=True) 57 | A[train_mask] = min_max_scaler.fit_transform(A[train_mask]) 58 | A[val_mask + test_mask] = min_max_scaler.transform(A[val_mask + test_mask]) 59 | return pd.DataFrame(A, columns=X.columns).astype(float) 60 | 61 | def replace_na(self, X, train_mask): 62 | if X.isna().any().any(): 63 | return X.fillna(X.iloc[train_mask].min() - 1) 64 | return X 65 | 66 | def encode_cat_features(self, X, y, cat_features, train_mask, val_mask, test_mask): 67 | from category_encoders import CatBoostEncoder 68 | enc = CatBoostEncoder() 69 | A = X.to_numpy(copy=True) 70 | b = y.to_numpy(copy=True) 71 | A[np.ix_(train_mask, cat_features)] = enc.fit_transform(A[np.ix_(train_mask, cat_features)], b[train_mask]) 72 | A[np.ix_(val_mask + test_mask, cat_features)] = enc.transform(A[np.ix_(val_mask + test_mask, cat_features)]) 73 | A = A.astype(float) 74 | return pd.DataFrame(A, columns=X.columns) 75 | 76 | def train_model(self, model_in, target_labels, train_mask, optimizer): 77 | y = target_labels[train_mask] 78 | 79 | self.model.train() 80 | logits = self.model(*model_in).squeeze() 81 | pred = logits[train_mask] 82 | 83 | if self.task == 'regression': 84 | loss = torch.sqrt(F.mse_loss(pred, y)) 85 | elif self.task == 'classification': 86 | loss = F.cross_entropy(pred, y.long()) 87 | else: 88 | raise NotImplemented("Unknown task. Supported tasks: classification, regression.") 89 | 90 | optimizer.zero_grad() 91 | loss.backward() 92 | optimizer.step() 93 | return loss 94 | 95 | def evaluate_model(self, logits, target_labels, mask): 96 | metrics = {} 97 | y = target_labels[mask] 98 | with torch.no_grad(): 99 | pred = logits[mask] 100 | if self.task == 'regression': 101 | metrics['loss'] = torch.sqrt(F.mse_loss(pred, y).squeeze() + 1e-8) 102 | metrics['rmsle'] = torch.sqrt(F.mse_loss(torch.log(pred + 1), torch.log(y + 1)).squeeze() + 1e-8) 103 | metrics['mae'] = F.l1_loss(pred, y) 104 | metrics['r2'] = torch.Tensor([r2_score(y.cpu().numpy(), pred.cpu().numpy())]) 105 | elif self.task == 'classification': 106 | metrics['loss'] = F.cross_entropy(pred, y.long()) 107 | metrics['accuracy'] = torch.Tensor([(y == pred.max(1)[1]).sum().item()/y.shape[0]]) 108 | 109 | return metrics 110 | 111 | def train_val_test_split(self, X, y, train_mask, val_mask, test_mask): 112 | X_train, y_train = X.iloc[train_mask], y.iloc[train_mask] 113 | X_val, y_val = X.iloc[val_mask], y.iloc[val_mask] 114 | X_test, y_test = X.iloc[test_mask], y.iloc[test_mask] 115 | return X_train, y_train, X_val, y_val, X_test, y_test 116 | 117 | def train_and_evaluate(self, model_in, target_labels, train_mask, val_mask, test_mask, 118 | optimizer, metrics, gnn_passes_per_epoch): 119 | loss = None 120 | 121 | for _ in range(gnn_passes_per_epoch): 122 | loss = self.train_model(model_in, target_labels, train_mask, optimizer) 123 | 124 | self.model.eval() 125 | logits = self.model(*model_in).squeeze() 126 | train_results = self.evaluate_model(logits, target_labels, train_mask) 127 | val_results = self.evaluate_model(logits, target_labels, val_mask) 128 | test_results = self.evaluate_model(logits, target_labels, test_mask) 129 | for metric_name in train_results: 130 | metrics[metric_name].append((train_results[metric_name].detach().item(), 131 | val_results[metric_name].detach().item(), 132 | test_results[metric_name].detach().item() 133 | )) 134 | return loss 135 | 136 | def update_early_stopping(self, metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric, metric_name, 137 | lower_better=False): 138 | train_metric, val_metric, test_metric = metrics[metric_name][-1] 139 | if (lower_better and val_metric < best_metric[1]) or (not lower_better and val_metric > best_metric[1]): 140 | best_metric = metrics[metric_name][-1] 141 | best_val_epoch = epoch 142 | epochs_since_last_best_metric = 0 143 | else: 144 | epochs_since_last_best_metric += 1 145 | return best_metric, best_val_epoch, epochs_since_last_best_metric 146 | 147 | def save_metrics(self, metrics, fn): 148 | with open(fn, "w+") as f: 149 | for key, value in metrics.items(): 150 | print(key, value, file=f) 151 | 152 | def plot(self, metrics, legend, title, output_fn=None, logx=False, logy=False, metric_name='loss'): 153 | import matplotlib.pyplot as plt 154 | metric_results = metrics[metric_name] 155 | xs = [range(len(metric_results))] * len(metric_results[0]) 156 | ys = list(zip(*metric_results)) 157 | 158 | plt.rcParams.update({'font.size': 40}) 159 | plt.rcParams["figure.figsize"] = (20, 10) 160 | lss = ['-', '--', '-.', ':'] 161 | colors = ['#4053d3', '#ddb310', '#b51d14', '#00beff', '#fb49b0', '#00b25d', '#cacaca'] 162 | colors = [(235, 172, 35), (184, 0, 88), (0, 140, 249), (0, 110, 0), (0, 187, 173), (209, 99, 230), (178, 69, 2), 163 | (255, 146, 135), (89, 84, 214), (0, 198, 248), (135, 133, 0), (0, 167, 108), (189, 189, 189)] 164 | colors = [[p / 255 for p in c] for c in colors] 165 | for i in range(len(ys)): 166 | plt.plot(xs[i], ys[i], lw=4, color=colors[i]) 167 | plt.legend(legend, loc=1, fontsize=30) 168 | plt.title(title) 169 | 170 | plt.xscale('log') if logx else None 171 | plt.yscale('log') if logy else None 172 | plt.xlabel('Iteration') 173 | plt.ylabel('RMSE') 174 | plt.grid() 175 | plt.tight_layout() 176 | 177 | plt.savefig(output_fn, bbox_inches='tight') if output_fn else None 178 | plt.show() 179 | 180 | def plot_interactive(self, metrics, legend, title, logx=False, logy=False, metric_name='loss', start_from=0): 181 | import plotly.graph_objects as go 182 | metric_results = metrics[metric_name] 183 | xs = [list(range(len(metric_results)))] * len(metric_results[0]) 184 | ys = list(zip(*metric_results)) 185 | 186 | fig = go.Figure() 187 | for i in range(len(ys)): 188 | fig.add_trace(go.Scatter(x=xs[i][start_from:], y=ys[i][start_from:], 189 | mode='lines+markers', 190 | name=legend[i])) 191 | 192 | fig.update_layout( 193 | title=title, 194 | title_x=0.5, 195 | xaxis_title='Epoch', 196 | yaxis_title='RMSE', 197 | font=dict( 198 | size=40, 199 | ), 200 | height=600, 201 | ) 202 | 203 | if logx: 204 | fig.update_layout(xaxis_type="log") 205 | if logy: 206 | fig.update_layout(yaxis_type="log") 207 | 208 | fig.show() 209 | -------------------------------------------------------------------------------- /bgnn/models/GBDT.py: -------------------------------------------------------------------------------- 1 | from catboost import Pool, CatBoostClassifier, CatBoostRegressor 2 | import time 3 | from sklearn.metrics import mean_squared_error, accuracy_score, r2_score 4 | import numpy as np 5 | from collections import defaultdict as ddict 6 | import lightgbm 7 | from lightgbm import LGBMClassifier, LGBMRegressor 8 | 9 | class GBDTCatBoost: 10 | def __init__(self, task='regression', depth=6, lr=0.1, l2_leaf_reg=None, max_bin=None): 11 | self.task = task 12 | self.depth = depth 13 | self.learning_rate = lr 14 | self.l2_leaf_reg = l2_leaf_reg 15 | self.max_bin = max_bin 16 | 17 | 18 | def init_model(self, num_epochs, patience): 19 | catboost_model_obj = CatBoostRegressor if self.task == 'regression' else CatBoostClassifier 20 | self.catboost_loss_function = 'RMSE' if self.task == 'regression' else 'MultiClass' 21 | self.custom_metrics = ['R2'] if self.task == 'regression' else ['Accuracy'] 22 | # ['Accuracy', 'AUC', 'Precision', 'Recall', 'F1', 'MCC', 'R2'], 23 | 24 | self.model = catboost_model_obj(iterations=num_epochs, 25 | depth=self.depth, 26 | learning_rate=self.learning_rate, 27 | loss_function=self.catboost_loss_function, 28 | custom_metric=self.custom_metrics, 29 | random_seed=0, 30 | early_stopping_rounds=patience, 31 | l2_leaf_reg=self.l2_leaf_reg, 32 | max_bin=self.max_bin, 33 | nan_mode='Min') 34 | 35 | def get_metrics(self): 36 | d = self.model.evals_result_ 37 | metrics = ddict(list) 38 | keys = ['learn', 'validation_0', 'validation_1'] \ 39 | if 'validation_0' in self.model.evals_result_ \ 40 | else ['learn', 'validation'] 41 | for metric_name in d[keys[0]]: 42 | perf = [d[key][metric_name] for key in keys] 43 | if metric_name == self.catboost_loss_function: 44 | metrics['loss'] = list(zip(*perf)) 45 | else: 46 | metrics[metric_name.lower()] = list(zip(*perf)) 47 | 48 | return metrics 49 | 50 | def get_test_metric(self, metrics, metric_name): 51 | if metric_name == 'loss': 52 | val_epoch = np.argmin([acc[1] for acc in metrics[metric_name]]) 53 | else: 54 | val_epoch = np.argmax([acc[1] for acc in metrics[metric_name]]) 55 | min_metric = metrics[metric_name][val_epoch] 56 | return min_metric, val_epoch 57 | 58 | def save_metrics(self, metrics, fn): 59 | with open(fn, "w+") as f: 60 | for key, value in metrics.items(): 61 | print(key, value, file=f) 62 | 63 | def train_val_test_split(self, X, y, train_mask, val_mask, test_mask): 64 | X_train, y_train = X.iloc[train_mask], y.iloc[train_mask] 65 | X_val, y_val = X.iloc[val_mask], y.iloc[val_mask] 66 | X_test, y_test = X.iloc[test_mask], y.iloc[test_mask] 67 | return X_train, y_train, X_val, y_val, X_test, y_test 68 | 69 | def fit(self, 70 | X, y, train_mask, val_mask, test_mask, 71 | cat_features=None, num_epochs=1000, patience=200, 72 | plot=False, verbose=False, 73 | loss_fn="", metric_name='loss'): 74 | 75 | X_train, y_train, X_val, y_val, X_test, y_test = \ 76 | self.train_val_test_split(X, y, train_mask, val_mask, test_mask) 77 | self.init_model(num_epochs, patience) 78 | 79 | start = time.time() 80 | pool = Pool(X_train, y_train, cat_features=cat_features) 81 | eval_set = [(X_val, y_val), (X_test, y_test)] 82 | self.model.fit(pool, eval_set=eval_set, plot=plot, verbose=verbose) 83 | finish = time.time() 84 | 85 | num_trees = self.model.tree_count_ 86 | print('Finished training. Total time: {:.2f} | Number of trees: {:d} | Time per tree: {:.2f}'.format(finish - start, num_trees, (time.time() - start )/num_trees)) 87 | 88 | metrics = self.get_metrics() 89 | min_metric, min_val_epoch = self.get_test_metric(metrics, metric_name) 90 | if loss_fn: 91 | self.save_metrics(metrics, loss_fn) 92 | print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, min_val_epoch, *min_metric)) 93 | return metrics 94 | 95 | def predict(self, X_test, y_test): 96 | pred = self.model.predict(X_test) 97 | 98 | metrics = {} 99 | metrics['rmse'] = mean_squared_error(pred, y_test) ** .5 100 | 101 | return metrics 102 | 103 | 104 | class GBDTLGBM: 105 | def __init__(self, task='regression', lr=0.1, num_leaves=31, max_bin=255, 106 | lambda_l1=0., lambda_l2=0., boosting='gbdt'): 107 | self.task = task 108 | self.boosting = boosting 109 | self.learning_rate = lr 110 | self.num_leaves = num_leaves 111 | self.max_bin = max_bin 112 | self.lambda_l1 = lambda_l1 113 | self.lambda_l2 = lambda_l2 114 | 115 | def accuracy(self, preds, train_data): 116 | labels = train_data.get_label() 117 | preds_classes = preds.reshape((preds.shape[0]//labels.shape[0], labels.shape[0])).argmax(0) 118 | return 'accuracy', accuracy_score(labels, preds_classes), True 119 | 120 | def r2(self, preds, train_data): 121 | labels = train_data.get_label() 122 | return 'r2', r2_score(labels, preds), True 123 | 124 | def init_model(self): 125 | 126 | self.parameters = { 127 | 'objective': 'regression' if self.task == 'regression' else 'multiclass', 128 | 'metric': {'rmse'} if self.task == 'regression' else {'multiclass'}, 129 | 'num_classes': self.num_classes, 130 | 'boosting': self.boosting, 131 | 'num_leaves': self.num_leaves, 132 | 'max_bin': self.max_bin, 133 | 'learning_rate': self.learning_rate, 134 | 'lambda_l1': self.lambda_l1, 135 | 'lambda_l2': self.lambda_l2, 136 | # 'num_threads': 1, 137 | # 'feature_fraction': 0.9, 138 | # 'bagging_fraction': 0.8, 139 | # 'bagging_freq': 5, 140 | 'verbose': 1, 141 | # 'device_type': 'gpu' 142 | } 143 | self.evals_result = dict() 144 | 145 | def get_metrics(self): 146 | d = self.evals_result 147 | metrics = ddict(list) 148 | keys = ['training', 'valid_1', 'valid_2'] \ 149 | if 'training' in d \ 150 | else ['valid_0', 'valid_1'] 151 | for metric_name in d[keys[0]]: 152 | perf = [d[key][metric_name] for key in keys] 153 | if metric_name in ['regression', 'multiclass', 'rmse', 'l2', 'multi_logloss', 'binary_logloss']: 154 | metrics['loss'] = list(zip(*perf)) 155 | else: 156 | metrics[metric_name] = list(zip(*perf)) 157 | return metrics 158 | 159 | def get_test_metric(self, metrics, metric_name): 160 | if metric_name == 'loss': 161 | val_epoch = np.argmin([acc[1] for acc in metrics[metric_name]]) 162 | else: 163 | val_epoch = np.argmax([acc[1] for acc in metrics[metric_name]]) 164 | min_metric = metrics[metric_name][val_epoch] 165 | return min_metric, val_epoch 166 | 167 | def save_metrics(self, metrics, fn): 168 | with open(fn, "w+") as f: 169 | for key, value in metrics.items(): 170 | print(key, value, file=f) 171 | 172 | def train_val_test_split(self, X, y, train_mask, val_mask, test_mask): 173 | X_train, y_train = X.iloc[train_mask], y.iloc[train_mask] 174 | X_val, y_val = X.iloc[val_mask], y.iloc[val_mask] 175 | X_test, y_test = X.iloc[test_mask], y.iloc[test_mask] 176 | return X_train, y_train, X_val, y_val, X_test, y_test 177 | 178 | def fit(self, 179 | X, y, train_mask, val_mask, test_mask, 180 | cat_features=None, num_epochs=1000, patience=200, 181 | loss_fn="", metric_name='loss'): 182 | 183 | if cat_features is not None: 184 | X = X.copy() 185 | for col in list(X.columns[cat_features]): 186 | X[col] = X[col].astype('category') 187 | 188 | X_train, y_train, X_val, y_val, X_test, y_test = \ 189 | self.train_val_test_split(X, y, train_mask, val_mask, test_mask) 190 | self.num_classes = None if self.task == 'regression' else len(set(y.iloc[:, 0])) 191 | self.init_model() 192 | 193 | start = time.time() 194 | train_data = lightgbm.Dataset(X_train, label=y_train) 195 | val_data = lightgbm.Dataset(X_val, label=y_val) 196 | test_data = lightgbm.Dataset(X_test, label=y_test) 197 | 198 | self.model = lightgbm.train(self.parameters, 199 | train_data, 200 | valid_sets=[train_data, val_data, test_data], 201 | num_boost_round=num_epochs, 202 | early_stopping_rounds=patience, 203 | evals_result=self.evals_result, 204 | feval=self.r2 if self.task == 'regression' else self.accuracy, 205 | verbose_eval=1) 206 | finish = time.time() 207 | 208 | print('Finished training. Total time: {:.2f}'.format(finish - start)) 209 | 210 | metrics = self.get_metrics() 211 | min_metric, min_val_epoch = self.get_test_metric(metrics, metric_name) 212 | if loss_fn: 213 | self.save_metrics(metrics, loss_fn) 214 | print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, min_val_epoch, *min_metric)) 215 | return metrics 216 | 217 | def predict(self, X_test, y_test): 218 | pred = self.model.predict(X_test) 219 | 220 | metrics = {} 221 | metrics['rmse'] = mean_squared_error(pred, y_test) ** .5 222 | 223 | return metrics -------------------------------------------------------------------------------- /bgnn/models/GNN.py: -------------------------------------------------------------------------------- 1 | import time 2 | import numpy as np 3 | import torch 4 | from torch.nn import Dropout, ELU 5 | import torch.nn.functional as F 6 | from torch import nn 7 | from dgl.nn.pytorch import GATConv as GATConvDGL, GraphConv, ChebConv as ChebConvDGL, \ 8 | AGNNConv as AGNNConvDGL, APPNPConv 9 | from torch.nn import Sequential, Linear, ReLU, Identity 10 | from tqdm import tqdm 11 | from .Base import BaseModel 12 | from torch.autograd import Variable 13 | from collections import defaultdict as ddict 14 | from .MLP import MLPRegressor 15 | 16 | 17 | class ElementWiseLinear(nn.Module): 18 | def __init__(self, size, weight=True, bias=True, inplace=False): 19 | super().__init__() 20 | if weight: 21 | self.weight = nn.Parameter(torch.Tensor(size)) 22 | else: 23 | self.weight = None 24 | if bias: 25 | self.bias = nn.Parameter(torch.Tensor(size)) 26 | else: 27 | self.bias = None 28 | self.inplace = inplace 29 | 30 | self.reset_parameters() 31 | 32 | def reset_parameters(self): 33 | if self.weight is not None: 34 | nn.init.ones_(self.weight) 35 | if self.bias is not None: 36 | nn.init.zeros_(self.bias) 37 | 38 | def forward(self, x): 39 | if self.inplace: 40 | if self.weight is not None: 41 | x.mul_(self.weight) 42 | if self.bias is not None: 43 | x.add_(self.bias) 44 | else: 45 | if self.weight is not None: 46 | x = x * self.weight 47 | if self.bias is not None: 48 | x = x + self.bias 49 | return x 50 | 51 | class GATDGL(torch.nn.Module): 52 | ''' 53 | Implementation of leaderboard GAT network for OGB datasets. 54 | https://github.com/Espylapiza/dgl/blob/master/examples/pytorch/ogb/ogbn-arxiv/models.py 55 | ''' 56 | def __init__( 57 | self, 58 | in_feats, 59 | n_classes, 60 | n_layers=3, 61 | n_heads=3, 62 | activation=F.relu, 63 | n_hidden=250, 64 | dropout=0.75, 65 | input_drop=0.1, 66 | attn_drop=0.0, 67 | ): 68 | super().__init__() 69 | self.in_feats = in_feats 70 | self.n_hidden = n_hidden 71 | self.n_classes = n_classes 72 | self.n_layers = n_layers 73 | self.num_heads = n_heads 74 | 75 | self.convs = torch.nn.ModuleList() 76 | self.norms = torch.nn.ModuleList() 77 | 78 | for i in range(n_layers): 79 | in_hidden = n_heads * n_hidden if i > 0 else in_feats 80 | out_hidden = n_hidden if i < n_layers - 1 else n_classes 81 | num_heads = n_heads if i < n_layers - 1 else 1 82 | out_channels = n_heads 83 | 84 | self.convs.append( 85 | GATConvDGL( 86 | in_hidden, 87 | out_hidden, 88 | num_heads=num_heads, 89 | attn_drop=attn_drop, 90 | residual=True, 91 | ) 92 | ) 93 | 94 | if i < n_layers - 1: 95 | self.norms.append(torch.nn.BatchNorm1d(out_channels * out_hidden)) 96 | 97 | self.bias_last = ElementWiseLinear(n_classes, weight=False, bias=True, inplace=True) 98 | 99 | self.input_drop = nn.Dropout(input_drop) 100 | self.dropout = nn.Dropout(dropout) 101 | self.activation = activation 102 | 103 | def forward(self, graph, feat): 104 | h = feat 105 | h = self.input_drop(h) 106 | 107 | for i in range(self.n_layers): 108 | conv = self.convs[i](graph, h) 109 | 110 | h = conv 111 | 112 | if i < self.n_layers - 1: 113 | h = h.flatten(1) 114 | h = self.norms[i](h) 115 | h = self.activation(h, inplace=True) 116 | h = self.dropout(h) 117 | 118 | h = h.mean(1) 119 | h = self.bias_last(h) 120 | 121 | return h 122 | 123 | 124 | 125 | class GNNModelDGL(torch.nn.Module): 126 | def __init__(self, in_dim, hidden_dim, out_dim, 127 | dropout=0., name='gat', residual=True, use_mlp=False, join_with_mlp=False): 128 | super(GNNModelDGL, self).__init__() 129 | self.name = name 130 | self.use_mlp = use_mlp 131 | self.join_with_mlp = join_with_mlp 132 | self.normalize_input_columns = True 133 | if use_mlp: 134 | self.mlp = MLPRegressor(in_dim, hidden_dim, out_dim) 135 | if join_with_mlp: 136 | in_dim += out_dim 137 | else: 138 | in_dim = out_dim 139 | if name == 'gat': 140 | self.l1 = GATConvDGL(in_dim, hidden_dim//8, 8, feat_drop=dropout, attn_drop=dropout, residual=False, 141 | activation=F.elu) 142 | self.l2 = GATConvDGL(hidden_dim, out_dim, 1, feat_drop=dropout, attn_drop=dropout, residual=residual, activation=None) 143 | elif name == 'gcn': 144 | self.l1 = GraphConv(in_dim, hidden_dim, activation=F.elu) 145 | self.l2 = GraphConv(hidden_dim, out_dim, activation=F.elu) 146 | self.drop = Dropout(p=dropout) 147 | elif name == 'cheb': 148 | self.l1 = ChebConvDGL(in_dim, hidden_dim, k = 3) 149 | self.l2 = ChebConvDGL(hidden_dim, out_dim, k = 3) 150 | self.drop = Dropout(p=dropout) 151 | elif name == 'agnn': 152 | self.lin1 = Sequential(Dropout(p=dropout), Linear(in_dim, hidden_dim), ELU()) 153 | self.l1 = AGNNConvDGL(learn_beta=False) 154 | self.l2 = AGNNConvDGL(learn_beta=True) 155 | self.lin2 = Sequential(Dropout(p=dropout), Linear(hidden_dim, out_dim), ELU()) 156 | elif name == 'appnp': 157 | self.lin1 = Sequential(Dropout(p=dropout), Linear(in_dim, hidden_dim), 158 | ReLU(), Dropout(p=dropout), Linear(hidden_dim, out_dim)) 159 | self.l1 = APPNPConv(k=10, alpha=0.1, edge_drop=0.) 160 | 161 | 162 | def forward(self, graph, features): 163 | h = features 164 | if self.use_mlp: 165 | if self.join_with_mlp: 166 | h = torch.cat((h, self.mlp(features)), 1) 167 | else: 168 | h = self.mlp(features) 169 | if self.name == 'gat': 170 | h = self.l1(graph, h).flatten(1) 171 | logits = self.l2(graph, h).mean(1) 172 | elif self.name in ['appnp']: 173 | h = self.lin1(h) 174 | logits = self.l1(graph, h) 175 | elif self.name == 'agnn': 176 | h = self.lin1(h) 177 | h = self.l1(graph, h) 178 | h = self.l2(graph, h) 179 | logits = self.lin2(h) 180 | elif self.name in ['gcn', 'cheb']: 181 | h = self.drop(h) 182 | h = self.l1(graph, h) 183 | logits = self.l2(graph, h) 184 | 185 | 186 | return logits 187 | 188 | class GNN(BaseModel): 189 | def __init__(self, task='regression', lr=0.01, hidden_dim=64, dropout=0., 190 | name='gat', residual=True, lang='dgl', 191 | gbdt_predictions=None, mlp=False, use_leaderboard=False, only_gbdt=False): 192 | super(GNN, self).__init__() 193 | 194 | self.dropout = dropout 195 | self.learning_rate = lr 196 | self.hidden_dim = hidden_dim 197 | self.task = task 198 | self.model_name = name 199 | self.use_residual = residual 200 | self.lang = lang 201 | self.use_mlp = mlp 202 | self.use_leaderboard = use_leaderboard 203 | self.gbdt_predictions = gbdt_predictions 204 | self.only_gbdt = only_gbdt 205 | 206 | self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 207 | 208 | def __name__(self): 209 | if self.gbdt_predictions is None: 210 | return 'GNN' 211 | else: 212 | return 'ResGNN' 213 | 214 | def init_model(self): 215 | if self.lang == 'pyg': 216 | self.model = GNNModelPYG(in_dim=self.in_dim, hidden_dim=self.hidden_dim, out_dim=self.out_dim, 217 | heads=self.heads, dropout=self.dropout, name=self.model_name, 218 | residual=self.use_residual).to(self.device) 219 | elif self.lang == 'dgl': 220 | if self.use_leaderboard: 221 | self.model = GATDGL(in_feats=self.in_dim, n_classes=self.out_dim).to(self.device) 222 | else: 223 | self.model = GNNModelDGL(in_dim=self.in_dim, hidden_dim=self.hidden_dim, out_dim=self.out_dim, 224 | dropout=self.dropout, name=self.model_name, 225 | residual=self.use_residual, use_mlp=self.use_mlp, 226 | join_with_mlp=self.use_mlp).to(self.device) 227 | 228 | def init_node_features(self, X, optimize_node_features): 229 | node_features = Variable(X, requires_grad=optimize_node_features) 230 | return node_features 231 | 232 | def fit(self, networkx_graph, X, y, train_mask, val_mask, test_mask, num_epochs, 233 | cat_features=None, patience=200, logging_epochs=1, optimize_node_features=False, 234 | loss_fn=None, metric_name='loss', normalize_features=True, replace_na=True): 235 | 236 | # initialize for early stopping and metrics 237 | if metric_name in ['r2', 'accuracy']: 238 | best_metric = [np.float('-inf')] * 3 # for train/val/test 239 | else: 240 | best_metric = [np.float('inf')] * 3 # for train/val/test 241 | best_val_epoch = 0 242 | epochs_since_last_best_metric = 0 243 | metrics = ddict(list) # metric_name -> (train/val/test) 244 | if cat_features is None: 245 | cat_features = [] 246 | 247 | if self.gbdt_predictions is not None: 248 | X = X.copy() 249 | X['predict'] = self.gbdt_predictions 250 | if self.only_gbdt: 251 | cat_features = [] 252 | X = X[['predict']] 253 | 254 | self.in_dim = X.shape[1] 255 | self.hidden_dim = self.hidden_dim 256 | if self.task == 'regression': 257 | self.out_dim = y.shape[1] 258 | elif self.task == 'classification': 259 | self.out_dim = len(set(y.iloc[:, 0])) 260 | 261 | if len(cat_features): 262 | X = self.encode_cat_features(X, y, cat_features, train_mask, val_mask, test_mask) 263 | if normalize_features: 264 | X = self.normalize_features(X, train_mask, val_mask, test_mask) 265 | if replace_na: 266 | X = self.replace_na(X, train_mask) 267 | 268 | X, y = self.pandas_to_torch(X, y) 269 | if len(X.shape) == 1: 270 | X = X.unsqueeze(1) 271 | 272 | if self.lang == 'dgl': 273 | graph = self.networkx_to_torch(networkx_graph) 274 | elif self.lang == 'pyg': 275 | graph = self.networkx_to_torch2(networkx_graph) 276 | self.init_model() 277 | node_features = self.init_node_features(X, optimize_node_features) 278 | 279 | self.node_features = node_features 280 | self.graph = graph 281 | optimizer = self.init_optimizer(node_features, optimize_node_features, self.learning_rate) 282 | 283 | pbar = tqdm(range(num_epochs)) 284 | for epoch in pbar: 285 | start2epoch = time.time() 286 | 287 | model_in = (graph, node_features) 288 | loss = self.train_and_evaluate(model_in, y, train_mask, val_mask, test_mask, optimizer, 289 | metrics, gnn_passes_per_epoch=1) 290 | self.log_epoch(pbar, metrics, epoch, loss, time.time() - start2epoch, logging_epochs, 291 | metric_name=metric_name) 292 | 293 | # check early stopping 294 | best_metric, best_val_epoch, epochs_since_last_best_metric = \ 295 | self.update_early_stopping(metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric, 296 | metric_name, lower_better=(metric_name not in ['r2', 'accuracy'])) 297 | if patience and epochs_since_last_best_metric > patience: 298 | break 299 | 300 | if loss_fn: 301 | self.save_metrics(metrics, loss_fn) 302 | 303 | print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, best_val_epoch, *best_metric)) 304 | return metrics 305 | 306 | def predict(self, graph, node_features, target_labels, test_mask): 307 | return self.evaluate_model((graph, node_features), target_labels, test_mask) -------------------------------------------------------------------------------- /bgnn/models/MLP.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import numpy as np 5 | import time 6 | from tqdm import tqdm 7 | from .Base import BaseModel 8 | from sklearn.metrics import r2_score 9 | from collections import defaultdict as ddict 10 | 11 | class MLPClassifier(torch.nn.Module): 12 | def __init__(self, in_dim, hidden_dim, out_dim, num_layers=3, dropout=0.5): 13 | super(MLPClassifier, self).__init__() 14 | 15 | self.lins = torch.nn.ModuleList() 16 | self.lins.append(torch.nn.Linear(in_dim, hidden_dim)) 17 | self.bns = torch.nn.ModuleList() 18 | self.bns.append(torch.nn.BatchNorm1d(hidden_dim)) 19 | for _ in range(num_layers - 2): 20 | self.lins.append(torch.nn.Linear(hidden_dim, hidden_dim)) 21 | self.bns.append(torch.nn.BatchNorm1d(hidden_dim)) 22 | self.lins.append(torch.nn.Linear(hidden_dim, out_dim)) 23 | 24 | self.dropout = dropout 25 | 26 | def reset_parameters(self): 27 | for lin in self.lins: 28 | lin.reset_parameters() 29 | 30 | def forward(self, x): 31 | for i, lin in enumerate(self.lins[:-1]): 32 | x = lin(x) 33 | x = self.bns[i](x) 34 | x = F.relu(x) 35 | x = F.dropout(x, p=self.dropout, training=self.training) 36 | x = self.lins[-1](x) 37 | return x 38 | 39 | 40 | class MLPRegressor(nn.Module): 41 | def __init__(self, in_dim, hidden_dim, out_dim, num_layers=3, dropout=0.5): 42 | super(MLPRegressor, self).__init__() 43 | 44 | self.layers = nn.Sequential( 45 | nn.Linear(in_dim, hidden_dim), 46 | nn.ReLU(), 47 | nn.Dropout(p=dropout), 48 | nn.Linear(hidden_dim, hidden_dim), 49 | nn.ReLU(), 50 | nn.Dropout(p=dropout), 51 | nn.Linear(hidden_dim, out_dim) 52 | ) 53 | 54 | def forward(self, x): 55 | return self.layers(x) 56 | 57 | 58 | class MLP(BaseModel): 59 | def __init__(self, task='regression', num_layers=3, dropout=0., lr=0.01, hidden_dim=128): 60 | super(MLP, self).__init__() 61 | self.task = task 62 | self.num_layers = num_layers 63 | self.dropout = dropout 64 | self.learning_rate = lr 65 | self.hidden_dim = hidden_dim 66 | 67 | 68 | self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 69 | 70 | def __name__(self): 71 | return 'MLP' 72 | 73 | def init_model(self): 74 | # mlp_model = MLPRegressor if self.task == 'regression' else MLPClassifier 75 | mlp_model = MLPClassifier 76 | self.model = mlp_model(in_dim=self.in_dim, hidden_dim=self.hidden_dim, out_dim=self.out_dim, 77 | num_layers=self.num_layers, dropout=self.dropout).to( 78 | self.device) 79 | 80 | def fit(self, X, y, train_mask, val_mask, test_mask, cat_features=None, 81 | num_epochs=1000, patience=200, 82 | logging_epochs=1, loss_fn=None, 83 | metric_name='loss', normalize_features=True, replace_na=True): 84 | 85 | # initialize for early stopping and metrics 86 | if metric_name in ['r2', 'accuracy']: 87 | best_metric = [np.float('-inf')] * 3 # for train/val/test 88 | else: 89 | best_metric = [np.float('inf')] * 3 # for train/val/test 90 | best_val_epoch = 0 91 | epochs_since_last_best_metric = 0 92 | metrics = ddict(list) # metric_name -> (train/val/test) 93 | if cat_features is None: 94 | cat_features = [] 95 | 96 | self.in_dim = X.shape[1] 97 | self.hidden_dim = self.hidden_dim 98 | if self.task == 'regression': 99 | self.out_dim = y.shape[1] 100 | elif self.task == 'classification': 101 | self.out_dim = len(set(y.iloc[:, 0])) 102 | 103 | 104 | if len(cat_features): 105 | X = self.encode_cat_features(X, y, cat_features, train_mask, val_mask, test_mask) 106 | if normalize_features: 107 | X = self.normalize_features(X, train_mask, val_mask, test_mask) 108 | if replace_na: 109 | X = self.replace_na(X, train_mask) 110 | 111 | X, y = self.pandas_to_torch(X, y) 112 | if len(X.shape) == 1: 113 | X = X.unsqueeze(dim=1) 114 | 115 | self.init_model() 116 | optimizer = self.init_optimizer(None, False, learning_rate=self.learning_rate) 117 | 118 | pbar = tqdm(range(num_epochs)) 119 | for epoch in pbar: 120 | 121 | start2epoch = time.time() 122 | 123 | model_in = (X,) 124 | loss = self.train_and_evaluate(model_in, y, train_mask, val_mask, test_mask, optimizer, 125 | metrics, gnn_passes_per_epoch=1) 126 | self.log_epoch(pbar, metrics, epoch, loss, time.time() - start2epoch, logging_epochs, 127 | metric_name=metric_name) 128 | 129 | # check early stopping 130 | best_metric, best_val_epoch, epochs_since_last_best_metric = \ 131 | self.update_early_stopping(metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric, 132 | metric_name, lower_better=(metric_name not in ['r2', 'accuracy'])) 133 | if patience and epochs_since_last_best_metric > patience: 134 | break 135 | 136 | if loss_fn: 137 | self.save_metrics(metrics, loss_fn) 138 | 139 | print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, best_val_epoch, *best_metric)) 140 | return metrics 141 | 142 | def predict(self, X, target_labels, test_mask): 143 | return self.evaluate_model((X,), target_labels, test_mask) -------------------------------------------------------------------------------- /bgnn/models/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nd7141/bgnn/11290bc8ec5427faa1cb48ec51d947d5f6624b60/bgnn/models/__init__.py -------------------------------------------------------------------------------- /bgnn/scripts/run.py: -------------------------------------------------------------------------------- 1 | # from catboost import CatboostError 2 | # import sys 3 | # sys.path.append('../') 4 | 5 | from bgnn.models.GBDT import GBDTCatBoost, GBDTLGBM 6 | from bgnn.models.MLP import MLP 7 | from bgnn.models.GNN import GNN 8 | from bgnn.models.BGNN import BGNN 9 | from bgnn.scripts.utils import get_masks, NpEncoder 10 | 11 | import os 12 | import json 13 | import time 14 | import datetime 15 | from pathlib import Path 16 | from collections import defaultdict as ddict 17 | 18 | import pandas as pd 19 | import networkx as nx 20 | import random 21 | import numpy as np 22 | import fire 23 | from omegaconf import OmegaConf 24 | from sklearn.model_selection import ParameterGrid 25 | 26 | 27 | class RunModel: 28 | def read_input(self, input_folder): 29 | self.X = pd.read_csv(f'{input_folder}/X.csv') 30 | self.y = pd.read_csv(f'{input_folder}/y.csv') 31 | 32 | networkx_graph = nx.read_graphml(f'{input_folder}/graph.graphml') 33 | networkx_graph = nx.relabel_nodes(networkx_graph, {str(i): i for i in range(len(networkx_graph))}) 34 | self.networkx_graph = networkx_graph 35 | 36 | categorical_columns = [] 37 | if os.path.exists(f'{input_folder}/cat_features.txt'): 38 | with open(f'{input_folder}/cat_features.txt') as f: 39 | for line in f: 40 | if line.strip(): 41 | categorical_columns.append(line.strip()) 42 | 43 | self.cat_features = None 44 | if categorical_columns: 45 | columns = self.X.columns 46 | self.cat_features = np.where(columns.isin(categorical_columns))[0] 47 | 48 | for col in list(columns[self.cat_features]): 49 | self.X[col] = self.X[col].astype(str) 50 | 51 | 52 | if os.path.exists(f'{input_folder}/masks.json'): 53 | with open(f'{input_folder}/masks.json') as f: 54 | self.masks = json.load(f) 55 | else: 56 | print('Creating and saving train/val/test masks') 57 | idx = list(range(self.y.shape[0])) 58 | self.masks = dict() 59 | for i in range(self.max_seeds): 60 | random.shuffle(idx) 61 | r1, r2, r3 = idx[:int(.6*len(idx))], idx[int(.6*len(idx)):int(.8*len(idx))], idx[int(.8*len(idx)):] 62 | self.masks[str(i)] = {"train": r1, "val": r2, "test": r3} 63 | 64 | with open(f'{input_folder}/masks.json', 'w+') as f: 65 | json.dump(self.masks, f, cls=NpEncoder) 66 | 67 | 68 | def get_input(self, dataset_dir, dataset: str): 69 | if dataset == 'house': 70 | input_folder = dataset_dir / 'house' 71 | elif dataset == 'county': 72 | input_folder = dataset_dir / 'county' 73 | elif dataset == 'vk': 74 | input_folder = dataset_dir / 'vk' 75 | elif dataset == 'wiki': 76 | input_folder = dataset_dir / 'wiki' 77 | elif dataset == 'avazu': 78 | input_folder = dataset_dir / 'avazu' 79 | elif dataset == 'vk_class': 80 | input_folder = dataset_dir / 'vk_class' 81 | elif dataset == 'house_class': 82 | input_folder = dataset_dir / 'house_class' 83 | elif dataset == 'dblp': 84 | input_folder = dataset_dir / 'dblp' 85 | elif dataset == 'slap': 86 | input_folder = dataset_dir / 'slap' 87 | else: 88 | input_folder = dataset 89 | 90 | if self.save_folder is None: 91 | self.save_folder = f'results/{dataset}/{datetime.datetime.now().strftime("%d_%m")}' 92 | 93 | self.read_input(input_folder) 94 | print('Save to folder:', self.save_folder) 95 | 96 | 97 | def run_one_model(self, config_fn, model_name): 98 | self.config = OmegaConf.load(config_fn) 99 | grid = ParameterGrid(dict(self.config.hp)) 100 | 101 | for ps in grid: 102 | param_string = ''.join([f'-{key}{ps[key]}' for key in ps]) 103 | exp_name = f'{model_name}{param_string}' 104 | print(f'\nSeed {self.seed} RUNNING:{exp_name}') 105 | 106 | runs = [] 107 | runs_custom = [] 108 | times = [] 109 | for _ in range(self.repeat_exp): 110 | start = time.time() 111 | model = self.define_model(model_name, ps) 112 | 113 | inputs = {'X': self.X, 'y': self.y, 'train_mask': self.train_mask, 114 | 'val_mask': self.val_mask, 'test_mask': self.test_mask, 'cat_features': self.cat_features} 115 | if model_name in ['gnn', 'resgnn', 'bgnn']: 116 | inputs['networkx_graph'] = self.networkx_graph 117 | 118 | metrics = model.fit(num_epochs=self.config.num_epochs, patience=self.config.patience, 119 | loss_fn=f"{self.seed_folder}/{exp_name}.txt", 120 | metric_name='loss' if self.task == 'regression' else 'accuracy', **inputs) 121 | finish = time.time() 122 | best_loss = min(metrics['loss'], key=lambda x: x[1]) 123 | best_custom = max(metrics['r2' if self.task == 'regression' else 'accuracy'], key=lambda x: x[1]) 124 | runs.append(best_loss) 125 | runs_custom.append(best_custom) 126 | times.append(finish - start) 127 | self.store_results[exp_name] = (list(map(np.mean, zip(*runs))), 128 | list(map(np.mean, zip(*runs_custom))), 129 | np.mean(times), 130 | ) 131 | 132 | def define_model(self, model_name, ps): 133 | if model_name == 'catboost': 134 | return GBDTCatBoost(self.task, **ps) 135 | elif model_name == 'lightgbm': 136 | return GBDTLGBM(self.task, **ps) 137 | elif model_name == 'mlp': 138 | return MLP(self.task, **ps) 139 | elif model_name == 'gnn': 140 | return GNN(self.task, **ps) 141 | elif model_name == 'resgnn': 142 | gbdt = GBDTCatBoost(self.task) 143 | gbdt.fit(self.X, self.y, self.train_mask, self.val_mask, self.test_mask, 144 | cat_features=self.cat_features, 145 | num_epochs=1000, patience=100, 146 | plot=False, verbose=False, loss_fn=None, 147 | metric_name='loss' if self.task == 'regression' else 'accuracy') 148 | return GNN(task=self.task, gbdt_predictions=gbdt.model.predict(self.X), **ps) 149 | elif model_name == 'bgnn': 150 | return BGNN(self.task, **ps) 151 | 152 | def create_save_folder(self, seed): 153 | self.seed_folder = f'{self.save_folder}/{seed}' 154 | os.makedirs(self.seed_folder, exist_ok=True) 155 | 156 | def split_masks(self, seed): 157 | self.train_mask, self.val_mask, self.test_mask = self.masks[seed]['train'], \ 158 | self.masks[seed]['val'], self.masks[seed]['test'] 159 | 160 | def save_results(self, seed): 161 | self.seed_results[seed] = self.store_results 162 | with open(f'{self.save_folder}/seed_results.json', 'w+') as f: 163 | json.dump(self.seed_results, f) 164 | 165 | self.aggregated = self.aggregate_results() 166 | with open(f'{self.save_folder}/aggregated_results.json', 'w+') as f: 167 | json.dump(self.aggregated, f) 168 | 169 | def get_model_name(self, exp_name: str, algos: list): 170 | # get name of the model (for gnn-like models (eg. gat)) 171 | if 'name' in exp_name: 172 | model_name = '-' + [param[4:] for param in exp_name.split('-') if param.startswith('name')][0] 173 | else: 174 | model_name = '' 175 | 176 | # get a model used a MLP (eg. MLP-GNN) 177 | if 'gnn' in exp_name and 'mlpTrue' in exp_name: 178 | model_name += '-MLP' 179 | 180 | # algo corresponds to type of the model (eg. gnn, resgnn, bgnn) 181 | for algo in algos: 182 | if exp_name.startswith(algo): 183 | return algo + model_name 184 | return 'unknown' 185 | 186 | def aggregate_results(self): 187 | algos = ['catboost', 'lightgbm', 'mlp', 'gnn', 'resgnn', 'bgnn'] 188 | model_best_score = ddict(list) 189 | model_best_time = ddict(list) 190 | 191 | results = self.seed_results 192 | for seed in results: 193 | model_results_for_seed = ddict(list) 194 | for name, output in results[seed].items(): 195 | model_name = self.get_model_name(name, algos=algos) 196 | if self.task == 'regression': # rmse metric 197 | val_metric, test_metric, time = output[0][1], output[0][2], output[2] 198 | else: # accuracy metric 199 | val_metric, test_metric, time = output[1][1], output[1][2], output[2] 200 | model_results_for_seed[model_name].append((val_metric, test_metric, time)) 201 | 202 | for model_name, model_results in model_results_for_seed.items(): 203 | if self.task == 'regression': 204 | best_result = min(model_results) # rmse 205 | else: 206 | best_result = max(model_results) # accuracy 207 | model_best_score[model_name].append(best_result[1]) 208 | model_best_time[model_name].append(best_result[2]) 209 | 210 | aggregated = dict() 211 | for model, scores in model_best_score.items(): 212 | aggregated[model] = (np.mean(scores), np.std(scores), 213 | np.mean(model_best_time[model]), np.std(model_best_time[model])) 214 | return aggregated 215 | 216 | def run(self, dataset: str, *args, 217 | save_folder: str = None, 218 | task: str = 'regression', 219 | repeat_exp: int = 1, 220 | max_seeds: int = 1, 221 | dataset_dir: str = None, 222 | config_dir: str = None 223 | ): 224 | start2run = time.time() 225 | self.repeat_exp = repeat_exp 226 | self.max_seeds = max_seeds 227 | print(dataset, args, task, repeat_exp, max_seeds) 228 | 229 | dataset_dir = Path(dataset_dir) if dataset_dir else Path(__file__).parent / 'datasets' 230 | config_dir = Path(config_dir) if config_dir else Path(__file__).parent / 'configs' / 'model' 231 | print(dataset_dir, config_dir) 232 | 233 | self.task = task 234 | self.save_folder = save_folder 235 | self.get_input(dataset_dir, dataset) 236 | 237 | self.seed_results = dict() 238 | for ix, seed in enumerate(self.masks): 239 | print(f'{dataset} Seed {seed}') 240 | self.seed = seed 241 | 242 | self.create_save_folder(seed) 243 | self.split_masks(seed) 244 | 245 | self.store_results = dict() 246 | for arg in args: 247 | if arg == 'all': 248 | self.run_one_model(config_fn=config_dir / 'catboost.yaml', model_name="catboost") 249 | self.run_one_model(config_fn=config_dir / 'lightgbm.yaml', model_name="lightgbm") 250 | self.run_one_model(config_fn=config_dir / 'mlp.yaml', model_name="mlp") 251 | self.run_one_model(config_fn=config_dir / 'gnn.yaml', model_name="gnn") 252 | self.run_one_model(config_fn=config_dir / 'resgnn.yaml', model_name="resgnn") 253 | self.run_one_model(config_fn=config_dir / 'bgnn.yaml', model_name="bgnn") 254 | break 255 | elif arg == 'catboost': 256 | self.run_one_model(config_fn=config_dir / 'catboost.yaml', model_name="catboost") 257 | elif arg == 'lightgbm': 258 | self.run_one_model(config_fn=config_dir / 'lightgbm.yaml', model_name="lightgbm") 259 | elif arg == 'mlp': 260 | self.run_one_model(config_fn=config_dir / 'mlp.yaml', model_name="mlp") 261 | elif arg == 'gnn': 262 | self.run_one_model(config_fn=config_dir / 'gnn.yaml', model_name="gnn") 263 | elif arg == 'resgnn': 264 | self.run_one_model(config_fn=config_dir / 'resgnn.yaml', model_name="resgnn") 265 | elif arg == 'bgnn': 266 | self.run_one_model(config_fn=config_dir / 'bgnn.yaml', model_name="bgnn") 267 | 268 | self.save_results(seed) 269 | if ix+1 >= max_seeds: 270 | break 271 | 272 | print(f'Finished {dataset}: {time.time() - start2run} sec.') 273 | 274 | if __name__ == '__main__': 275 | fire.Fire(RunModel().run) -------------------------------------------------------------------------------- /bgnn/scripts/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from dgl.data import citation_graph as citegrh, TUDataset 4 | import torch as th 5 | from dgl import DGLGraph 6 | import numpy as np 7 | from sklearn.model_selection import KFold 8 | import itertools 9 | from sklearn.preprocessing import OneHotEncoder as OHE 10 | import random 11 | import json 12 | 13 | def load_cora_data(): 14 | data = citegrh.load_cora() 15 | features = th.FloatTensor(data.features) 16 | labels = th.LongTensor(data.labels) 17 | train_mask = th.BoolTensor(data.train_mask) 18 | test_mask = th.BoolTensor(data.test_mask) 19 | g = DGLGraph(data.graph) 20 | return g, features, labels, train_mask, test_mask 21 | 22 | def get_degree_features(graph): 23 | return graph.out_degrees().unsqueeze(-1).numpy() 24 | 25 | def get_categorical_features(features): 26 | return np.argmax(features, axis=-1).unsqueeze(dim=1).numpy() 27 | 28 | def get_random_int_features(shape, num_categories=100): 29 | return np.random.randint(0, num_categories, size=shape) 30 | 31 | def get_random_norm_features(shape): 32 | return np.random.normal(size=shape) 33 | 34 | def get_random_uniform_features(shape): 35 | return np.random.unifor(-1, 1, size=shape) 36 | 37 | def merge_features(*args): 38 | return np.hstack(args) 39 | 40 | def get_train_data(graph, features, num_random_features=10, num_random_categories=100): 41 | return merge_features( 42 | get_categorical_features(features), 43 | get_degree_features(graph), 44 | get_random_int_features(shape=(features.shape[0], num_random_features), num_categories=num_random_categories), 45 | ) 46 | 47 | 48 | def save_folds(dataset_name, n_splits=3): 49 | dataset = TUDataset(dataset_name) 50 | i = 0 51 | kfold = KFold(n_splits=n_splits, shuffle=True) 52 | dir_name = f'kfold_{dataset_name}' 53 | for trix, teix in kfold.split(range(len(dataset))): 54 | os.makedirs(f'{dir_name}/fold{i}', exist_ok=True) 55 | np.savetxt(f'{dir_name}/fold{i}/train.idx', trix, fmt='%i') 56 | np.savetxt(f'{dir_name}/fold{i}/test.idx', teix, fmt='%i') 57 | i += 1 58 | 59 | 60 | def graph_to_node_label(graphs, labels): 61 | targets = np.array(list(itertools.chain(*[[labels[i]] * graphs[i].number_of_nodes() for i in range(len(graphs))]))) 62 | enc = OHE(dtype=np.float32) 63 | return np.asarray(enc.fit_transform(targets.reshape(-1, 1)).todense()) 64 | 65 | 66 | def get_masks(N, train_size=0.6, val_size=0.2, random_seed=42): 67 | if not random_seed: 68 | seed = random.randint(0, 100) 69 | else: 70 | seed = random_seed 71 | 72 | # print('seed', seed) 73 | random.seed(seed) 74 | 75 | indices = list(range(N)) 76 | random.shuffle(indices) 77 | 78 | train_mask = indices[:int(train_size * len(indices))] 79 | val_mask = indices[int(train_size * len(indices)):int((train_size + val_size) * len(indices))] 80 | train_val_mask = indices[:int((train_size + val_size) * len(indices))] 81 | test_mask = indices[int((train_size + val_size) * len(indices)):] 82 | 83 | return train_mask, val_mask, train_val_mask, test_mask 84 | 85 | 86 | class NpEncoder(json.JSONEncoder): 87 | def default(self, obj): 88 | if isinstance(obj, np.integer): 89 | return int(obj) 90 | elif isinstance(obj, np.floating): 91 | return float(obj) 92 | elif isinstance(obj, np.ndarray): 93 | return obj.tolist() 94 | else: 95 | return super(NpEncoder, self).default(obj) -------------------------------------------------------------------------------- /configs/model/bgnn.yaml: -------------------------------------------------------------------------------- 1 | hp: 2 | iter_per_epoch: 3 | - 10 4 | - 20 5 | lr: 6 | - 0.01 7 | - 0.1 8 | hidden_dim: 9 | - 64 10 | name: 11 | - gat 12 | - gcn 13 | - agnn 14 | - appnp 15 | only_gbdt: 16 | - false 17 | - true 18 | dropout: 19 | - 0. 20 | - 0.5 21 | depth: 22 | - 6 23 | num_epochs: 200 24 | patience: 10 25 | -------------------------------------------------------------------------------- /configs/model/catboost.yaml: -------------------------------------------------------------------------------- 1 | hp: 2 | lr: 3 | - 0.01 4 | - 0.1 5 | depth: 6 | - 4 7 | - 6 8 | l2_leaf_reg: 9 | - null 10 | num_epochs: 1000 11 | patience: 100 12 | verbose: false 13 | -------------------------------------------------------------------------------- /configs/model/gnn.yaml: -------------------------------------------------------------------------------- 1 | hp: 2 | lr: 3 | - 0.01 4 | - 0.1 5 | name: 6 | - gat 7 | - gcn 8 | - agnn 9 | - appnp 10 | mlp: 11 | - true 12 | - false 13 | dropout: 14 | - 0.0 15 | - 0.5 16 | hidden_dim: 17 | - 64 18 | num_epochs: 2000 19 | patience: 200 20 | -------------------------------------------------------------------------------- /configs/model/lightgbm.yaml: -------------------------------------------------------------------------------- 1 | hp: 2 | lr: 3 | - 0.01 4 | - 0.1 5 | num_leaves: 6 | - 15 7 | - 63 8 | lambda_l2: 9 | - 0.0 10 | boosting: 11 | - gbdt 12 | num_epochs: 1000 13 | patience: 100 14 | -------------------------------------------------------------------------------- /configs/model/mlp.yaml: -------------------------------------------------------------------------------- 1 | hp: 2 | lr: 3 | - 0.01 4 | - 0.1 5 | num_layers: 6 | - 2 7 | - 3 8 | dropout: 9 | - 0.0 10 | - 0.5 11 | hidden_dim: 12 | - 64 13 | num_epochs: 5000 14 | patience: 200 15 | -------------------------------------------------------------------------------- /configs/model/resgnn.yaml: -------------------------------------------------------------------------------- 1 | hp: 2 | lr: 3 | - 0.01 4 | - 0.1 5 | name: 6 | - gat 7 | - gcn 8 | - agnn 9 | - appnp 10 | dropout: 11 | - 0.0 12 | - 0.5 13 | hidden_dim: 14 | - 64 15 | only_gbdt: 16 | - false 17 | - true 18 | num_epochs: 1000 19 | patience: 100 20 | -------------------------------------------------------------------------------- /datasets.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nd7141/bgnn/11290bc8ec5427faa1cb48ec51d947d5f6624b60/datasets.zip -------------------------------------------------------------------------------- /models/BGNN.py: -------------------------------------------------------------------------------- 1 | import itertools 2 | import time 3 | import numpy as np 4 | import torch 5 | 6 | from catboost import Pool, CatBoostClassifier, CatBoostRegressor, sum_models 7 | from .GNN import GNNModelDGL, GATDGL 8 | from .Base import BaseModel 9 | from tqdm import tqdm 10 | from collections import defaultdict as ddict 11 | 12 | class BGNN(BaseModel): 13 | def __init__(self, 14 | task='regression', iter_per_epoch = 10, lr=0.01, hidden_dim=64, dropout=0., 15 | only_gbdt=False, train_non_gbdt=False, 16 | name='gat', use_leaderboard=False, depth=6, gbdt_lr=0.1): 17 | super(BaseModel, self).__init__() 18 | self.learning_rate = lr 19 | self.hidden_dim = hidden_dim 20 | self.task = task 21 | self.dropout = dropout 22 | self.only_gbdt = only_gbdt 23 | self.train_residual = train_non_gbdt 24 | self.name = name 25 | self.use_leaderboard = use_leaderboard 26 | self.iter_per_epoch = iter_per_epoch 27 | self.depth = depth 28 | self.lang = 'dgl' 29 | self.gbdt_lr = gbdt_lr 30 | 31 | self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 32 | 33 | def __name__(self): 34 | return 'BGNN' 35 | 36 | def init_gbdt_model(self, num_epochs, epoch): 37 | if self.task == 'regression': 38 | catboost_model_obj = CatBoostRegressor 39 | catboost_loss_fn = 'RMSE' #''RMSEWithUncertainty' 40 | else: 41 | if epoch == 0: 42 | catboost_model_obj = CatBoostClassifier 43 | catboost_loss_fn = 'MultiClass' 44 | else: 45 | catboost_model_obj = CatBoostRegressor 46 | catboost_loss_fn = 'MultiRMSE' 47 | 48 | return catboost_model_obj(iterations=num_epochs, 49 | depth=self.depth, 50 | learning_rate=self.gbdt_lr, 51 | loss_function=catboost_loss_fn, 52 | random_seed=0, 53 | nan_mode='Min') 54 | 55 | def fit_gbdt(self, pool, trees_per_epoch, epoch): 56 | gbdt_model = self.init_gbdt_model(trees_per_epoch, epoch) 57 | gbdt_model.fit(pool, verbose=False) 58 | return gbdt_model 59 | 60 | def init_gnn_model(self): 61 | if self.use_leaderboard: 62 | self.model = GATDGL(in_feats=self.in_dim, n_classes=self.out_dim).to(self.device) 63 | else: 64 | self.model = GNNModelDGL(in_dim=self.in_dim, 65 | hidden_dim=self.hidden_dim, 66 | out_dim=self.out_dim, 67 | name=self.name, 68 | dropout=self.dropout).to(self.device) 69 | 70 | def append_gbdt_model(self, new_gbdt_model, weights): 71 | if self.gbdt_model is None: 72 | return new_gbdt_model 73 | return sum_models([self.gbdt_model, new_gbdt_model], weights=weights) 74 | 75 | def train_gbdt(self, gbdt_X_train, gbdt_y_train, cat_features, epoch, 76 | gbdt_trees_per_epoch, gbdt_alpha): 77 | 78 | pool = Pool(gbdt_X_train, gbdt_y_train, cat_features=cat_features) 79 | epoch_gbdt_model = self.fit_gbdt(pool, gbdt_trees_per_epoch, epoch) 80 | if epoch == 0 and self.task=='classification': 81 | self.base_gbdt = epoch_gbdt_model 82 | else: 83 | self.gbdt_model = self.append_gbdt_model(epoch_gbdt_model, weights=[1, gbdt_alpha]) 84 | 85 | def update_node_features(self, node_features, X, encoded_X): 86 | if self.task == 'regression': 87 | predictions = np.expand_dims(self.gbdt_model.predict(X), axis=1) 88 | # predictions = self.gbdt_model.virtual_ensembles_predict(X, 89 | # virtual_ensembles_count=5, 90 | # prediction_type='TotalUncertainty') 91 | else: 92 | predictions = self.base_gbdt.predict_proba(X) 93 | # predictions = self.base_gbdt.predict(X, prediction_type='RawFormulaVal') 94 | if self.gbdt_model is not None: 95 | predictions_after_one = self.gbdt_model.predict(X) 96 | predictions += predictions_after_one 97 | 98 | if not self.only_gbdt: 99 | if self.train_residual: 100 | predictions = np.append(node_features.detach().cpu().data[:, :-self.out_dim], predictions, 101 | axis=1) # append updated X to prediction 102 | else: 103 | predictions = np.append(encoded_X, predictions, axis=1) # append X to prediction 104 | 105 | predictions = torch.from_numpy(predictions).to(self.device) 106 | 107 | node_features.data = predictions.float().data 108 | 109 | def update_gbdt_targets(self, node_features, node_features_before, train_mask): 110 | return (node_features - node_features_before).detach().cpu().numpy()[train_mask, -self.out_dim:] 111 | 112 | def init_node_features(self, X): 113 | node_features = torch.empty(X.shape[0], self.in_dim, requires_grad=True, device=self.device) 114 | if not self.only_gbdt: 115 | node_features.data[:, :-self.out_dim] = torch.from_numpy(X.to_numpy(copy=True)) 116 | return node_features 117 | 118 | def init_node_parameters(self, num_nodes): 119 | return torch.empty(num_nodes, self.out_dim, requires_grad=True, device=self.device) 120 | 121 | def init_optimizer2(self, node_parameters, learning_rate): 122 | params = [self.model.parameters(), [node_parameters]] 123 | return torch.optim.Adam(itertools.chain(*params), lr=learning_rate) 124 | 125 | def update_node_features2(self, node_parameters, X): 126 | if self.task == 'regression': 127 | predictions = np.expand_dims(self.gbdt_model.predict(X), axis=1) 128 | else: 129 | predictions = self.base_gbdt.predict_proba(X) 130 | if self.gbdt_model is not None: 131 | predictions += self.gbdt_model.predict(X) 132 | 133 | predictions = torch.from_numpy(predictions).to(self.device) 134 | node_parameters.data = predictions.float().data 135 | 136 | def fit(self, networkx_graph, X, y, train_mask, val_mask, test_mask, cat_features, 137 | num_epochs, patience, logging_epochs=1, loss_fn=None, metric_name='loss', 138 | normalize_features=True, replace_na=True, 139 | ): 140 | 141 | # initialize for early stopping and metrics 142 | if metric_name in ['r2', 'accuracy']: 143 | best_metric = [np.float('-inf')] * 3 # for train/val/test 144 | else: 145 | best_metric = [np.float('inf')] * 3 # for train/val/test 146 | best_val_epoch = 0 147 | epochs_since_last_best_metric = 0 148 | metrics = ddict(list) 149 | if cat_features is None: 150 | cat_features = [] 151 | 152 | if self.task == 'regression': 153 | self.out_dim = y.shape[1] 154 | elif self.task == 'classification': 155 | self.out_dim = len(set(y.iloc[test_mask, 0])) 156 | # self.in_dim = X.shape[1] if not self.only_gbdt else 0 157 | # self.in_dim += 3 if uncertainty else 1 158 | self.in_dim = self.out_dim + X.shape[1] if not self.only_gbdt else self.out_dim 159 | 160 | self.init_gnn_model() 161 | 162 | gbdt_X_train = X.iloc[train_mask] 163 | gbdt_y_train = y.iloc[train_mask] 164 | gbdt_alpha = 1 165 | self.gbdt_model = None 166 | 167 | encoded_X = X.copy() 168 | if not self.only_gbdt: 169 | if len(cat_features): 170 | encoded_X = self.encode_cat_features(encoded_X, y, cat_features, train_mask, val_mask, test_mask) 171 | if normalize_features: 172 | encoded_X = self.normalize_features(encoded_X, train_mask, val_mask, test_mask) 173 | if replace_na: 174 | encoded_X = self.replace_na(encoded_X, train_mask) 175 | 176 | node_features = self.init_node_features(encoded_X) 177 | optimizer = self.init_optimizer(node_features, optimize_node_features=True, learning_rate=self.learning_rate) 178 | 179 | y, = self.pandas_to_torch(y) 180 | self.y = y 181 | if self.lang == 'dgl': 182 | graph = self.networkx_to_torch(networkx_graph) 183 | elif self.lang == 'pyg': 184 | graph = self.networkx_to_torch2(networkx_graph) 185 | 186 | self.graph = graph 187 | 188 | pbar = tqdm(range(num_epochs)) 189 | for epoch in pbar: 190 | start2epoch = time.time() 191 | 192 | # gbdt part 193 | self.train_gbdt(gbdt_X_train, gbdt_y_train, cat_features, epoch, 194 | self.iter_per_epoch, gbdt_alpha) 195 | 196 | self.update_node_features(node_features, X, encoded_X) 197 | node_features_before = node_features.clone() 198 | model_in=(graph, node_features) 199 | loss = self.train_and_evaluate(model_in, y, train_mask, val_mask, test_mask, 200 | optimizer, metrics, self.iter_per_epoch) 201 | gbdt_y_train = self.update_gbdt_targets(node_features, node_features_before, train_mask) 202 | 203 | self.log_epoch(pbar, metrics, epoch, loss, time.time() - start2epoch, logging_epochs, 204 | metric_name=metric_name) 205 | # check early stopping 206 | best_metric, best_val_epoch, epochs_since_last_best_metric = \ 207 | self.update_early_stopping(metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric, 208 | metric_name, lower_better=(metric_name not in ['r2', 'accuracy'])) 209 | if patience and epochs_since_last_best_metric > patience: 210 | break 211 | if np.isclose(gbdt_y_train.sum(), 0.): 212 | print('Nodes do not change anymore. Stopping...') 213 | break 214 | 215 | if loss_fn: 216 | self.save_metrics(metrics, loss_fn) 217 | 218 | print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, best_val_epoch, *best_metric)) 219 | return metrics 220 | 221 | def predict(self, graph, X, y, test_mask): 222 | node_features = torch.empty(X.shape[0], self.in_dim).to(self.device) 223 | self.update_node_features(node_features, X, X) 224 | return self.evaluate_model((graph, node_features), y, test_mask) -------------------------------------------------------------------------------- /models/Base.py: -------------------------------------------------------------------------------- 1 | import itertools 2 | import torch 3 | from sklearn import preprocessing 4 | import pandas as pd 5 | import torch.nn.functional as F 6 | import numpy as np 7 | from sklearn.metrics import r2_score, accuracy_score 8 | 9 | class BaseModel(torch.nn.Module): 10 | def __init__(self): 11 | super(BaseModel, self).__init__() 12 | self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 13 | 14 | def pandas_to_torch(self, *args): 15 | return [torch.from_numpy(arg.to_numpy(copy=True)).float().squeeze().to(self.device) for arg in args] 16 | 17 | def networkx_to_torch(self, networkx_graph): 18 | import dgl 19 | # graph = dgl.DGLGraph() 20 | graph = dgl.from_networkx(networkx_graph) 21 | graph = dgl.remove_self_loop(graph) 22 | graph = dgl.add_self_loop(graph) 23 | graph = graph.to(self.device) 24 | return graph 25 | 26 | def networkx_to_torch2(self, networkx_graph): 27 | from torch_geometric.utils import convert 28 | import torch_geometric.transforms as T 29 | graph = convert.from_networkx(networkx_graph) 30 | transform = T.Compose([T.TargetIndegree()]) 31 | graph = transform(graph) 32 | return graph.to(self.device) 33 | 34 | def move_to_device(self, *args): 35 | return [arg.to(self.device) for arg in args] 36 | 37 | def init_optimizer(self, node_features, optimize_node_features, learning_rate): 38 | 39 | params = [self.model.parameters()] 40 | if optimize_node_features: 41 | params.append([node_features]) 42 | optimizer = torch.optim.Adam(itertools.chain(*params), lr=learning_rate) 43 | return optimizer 44 | 45 | def log_epoch(self, pbar, metrics, epoch, loss, epoch_time, logging_epochs, metric_name='loss'): 46 | train_rmse, val_rmse, test_rmse = metrics[metric_name][-1] 47 | if epoch and epoch % logging_epochs == 0: 48 | pbar.set_description( 49 | "Epoch {:05d} | Loss {:.3f} | Loss {:.3f}/{:.3f}/{:.3f} | Time {:.4f}".format(epoch, loss, 50 | train_rmse, 51 | val_rmse, test_rmse, 52 | epoch_time)) 53 | 54 | def normalize_features(self, X, train_mask, val_mask, test_mask): 55 | min_max_scaler = preprocessing.MinMaxScaler() 56 | A = X.to_numpy(copy=True) 57 | A[train_mask] = min_max_scaler.fit_transform(A[train_mask]) 58 | A[val_mask + test_mask] = min_max_scaler.transform(A[val_mask + test_mask]) 59 | return pd.DataFrame(A, columns=X.columns).astype(float) 60 | 61 | def replace_na(self, X, train_mask): 62 | if X.isna().any().any(): 63 | return X.fillna(X.iloc[train_mask].min() - 1) 64 | return X 65 | 66 | def encode_cat_features(self, X, y, cat_features, train_mask, val_mask, test_mask): 67 | from category_encoders import CatBoostEncoder 68 | enc = CatBoostEncoder() 69 | A = X.to_numpy(copy=True) 70 | b = y.to_numpy(copy=True) 71 | A[np.ix_(train_mask, cat_features)] = enc.fit_transform(A[np.ix_(train_mask, cat_features)], b[train_mask]) 72 | A[np.ix_(val_mask + test_mask, cat_features)] = enc.transform(A[np.ix_(val_mask + test_mask, cat_features)]) 73 | A = A.astype(float) 74 | return pd.DataFrame(A, columns=X.columns) 75 | 76 | def train_model(self, model_in, target_labels, train_mask, optimizer): 77 | y = target_labels[train_mask] 78 | 79 | self.model.train() 80 | logits = self.model(*model_in).squeeze() 81 | pred = logits[train_mask] 82 | 83 | if self.task == 'regression': 84 | loss = torch.sqrt(F.mse_loss(pred, y)) 85 | elif self.task == 'classification': 86 | loss = F.cross_entropy(pred, y.long()) 87 | else: 88 | raise NotImplemented("Unknown task. Supported tasks: classification, regression.") 89 | 90 | optimizer.zero_grad() 91 | loss.backward() 92 | optimizer.step() 93 | return loss 94 | 95 | def evaluate_model(self, logits, target_labels, mask): 96 | metrics = {} 97 | y = target_labels[mask] 98 | with torch.no_grad(): 99 | pred = logits[mask] 100 | if self.task == 'regression': 101 | metrics['loss'] = torch.sqrt(F.mse_loss(pred, y).squeeze() + 1e-8) 102 | metrics['rmsle'] = torch.sqrt(F.mse_loss(torch.log(pred + 1), torch.log(y + 1)).squeeze() + 1e-8) 103 | metrics['mae'] = F.l1_loss(pred, y) 104 | metrics['r2'] = torch.Tensor([r2_score(y.cpu().numpy(), pred.cpu().numpy())]) 105 | elif self.task == 'classification': 106 | metrics['loss'] = F.cross_entropy(pred, y.long()) 107 | metrics['accuracy'] = torch.Tensor([(y == pred.max(1)[1]).sum().item()/y.shape[0]]) 108 | 109 | return metrics 110 | 111 | def train_val_test_split(self, X, y, train_mask, val_mask, test_mask): 112 | X_train, y_train = X.iloc[train_mask], y.iloc[train_mask] 113 | X_val, y_val = X.iloc[val_mask], y.iloc[val_mask] 114 | X_test, y_test = X.iloc[test_mask], y.iloc[test_mask] 115 | return X_train, y_train, X_val, y_val, X_test, y_test 116 | 117 | def train_and_evaluate(self, model_in, target_labels, train_mask, val_mask, test_mask, 118 | optimizer, metrics, gnn_passes_per_epoch): 119 | loss = None 120 | 121 | for _ in range(gnn_passes_per_epoch): 122 | loss = self.train_model(model_in, target_labels, train_mask, optimizer) 123 | 124 | self.model.eval() 125 | logits = self.model(*model_in).squeeze() 126 | train_results = self.evaluate_model(logits, target_labels, train_mask) 127 | val_results = self.evaluate_model(logits, target_labels, val_mask) 128 | test_results = self.evaluate_model(logits, target_labels, test_mask) 129 | for metric_name in train_results: 130 | metrics[metric_name].append((train_results[metric_name].detach().item(), 131 | val_results[metric_name].detach().item(), 132 | test_results[metric_name].detach().item() 133 | )) 134 | return loss 135 | 136 | def update_early_stopping(self, metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric, metric_name, 137 | lower_better=False): 138 | train_metric, val_metric, test_metric = metrics[metric_name][-1] 139 | if (lower_better and val_metric < best_metric[1]) or (not lower_better and val_metric > best_metric[1]): 140 | best_metric = metrics[metric_name][-1] 141 | best_val_epoch = epoch 142 | epochs_since_last_best_metric = 0 143 | else: 144 | epochs_since_last_best_metric += 1 145 | return best_metric, best_val_epoch, epochs_since_last_best_metric 146 | 147 | def save_metrics(self, metrics, fn): 148 | with open(fn, "w+") as f: 149 | for key, value in metrics.items(): 150 | print(key, value, file=f) 151 | 152 | def plot(self, metrics, legend, title, output_fn=None, logx=False, logy=False, metric_name='loss'): 153 | import matplotlib.pyplot as plt 154 | metric_results = metrics[metric_name] 155 | xs = [range(len(metric_results))] * len(metric_results[0]) 156 | ys = list(zip(*metric_results)) 157 | 158 | plt.rcParams.update({'font.size': 40}) 159 | plt.rcParams["figure.figsize"] = (20, 10) 160 | lss = ['-', '--', '-.', ':'] 161 | colors = ['#4053d3', '#ddb310', '#b51d14', '#00beff', '#fb49b0', '#00b25d', '#cacaca'] 162 | colors = [(235, 172, 35), (184, 0, 88), (0, 140, 249), (0, 110, 0), (0, 187, 173), (209, 99, 230), (178, 69, 2), 163 | (255, 146, 135), (89, 84, 214), (0, 198, 248), (135, 133, 0), (0, 167, 108), (189, 189, 189)] 164 | colors = [[p / 255 for p in c] for c in colors] 165 | for i in range(len(ys)): 166 | plt.plot(xs[i], ys[i], lw=4, color=colors[i]) 167 | plt.legend(legend, loc=1, fontsize=30) 168 | plt.title(title) 169 | 170 | plt.xscale('log') if logx else None 171 | plt.yscale('log') if logy else None 172 | plt.xlabel('Iteration') 173 | plt.ylabel('RMSE') 174 | plt.grid() 175 | plt.tight_layout() 176 | 177 | plt.savefig(output_fn, bbox_inches='tight') if output_fn else None 178 | plt.show() 179 | 180 | def plot_interactive(self, metrics, legend, title, logx=False, logy=False, metric_name='loss', start_from=0): 181 | import plotly.graph_objects as go 182 | metric_results = metrics[metric_name] 183 | xs = [list(range(len(metric_results)))] * len(metric_results[0]) 184 | ys = list(zip(*metric_results)) 185 | 186 | fig = go.Figure() 187 | for i in range(len(ys)): 188 | fig.add_trace(go.Scatter(x=xs[i][start_from:], y=ys[i][start_from:], 189 | mode='lines+markers', 190 | name=legend[i])) 191 | 192 | fig.update_layout( 193 | title=title, 194 | title_x=0.5, 195 | xaxis_title='Epoch', 196 | yaxis_title='RMSE', 197 | font=dict( 198 | size=40, 199 | ), 200 | height=600, 201 | ) 202 | 203 | if logx: 204 | fig.update_layout(xaxis_type="log") 205 | if logy: 206 | fig.update_layout(yaxis_type="log") 207 | 208 | fig.show() 209 | -------------------------------------------------------------------------------- /models/GBDT.py: -------------------------------------------------------------------------------- 1 | from catboost import Pool, CatBoostClassifier, CatBoostRegressor 2 | import time 3 | from sklearn.metrics import mean_squared_error, accuracy_score, r2_score 4 | import numpy as np 5 | from collections import defaultdict as ddict 6 | import lightgbm 7 | from lightgbm import LGBMClassifier, LGBMRegressor 8 | 9 | class GBDTCatBoost: 10 | def __init__(self, task='regression', depth=6, lr=0.1, l2_leaf_reg=None, max_bin=None): 11 | self.task = task 12 | self.depth = depth 13 | self.learning_rate = lr 14 | self.l2_leaf_reg = l2_leaf_reg 15 | self.max_bin = max_bin 16 | 17 | 18 | def init_model(self, num_epochs, patience): 19 | catboost_model_obj = CatBoostRegressor if self.task == 'regression' else CatBoostClassifier 20 | self.catboost_loss_function = 'RMSE' if self.task == 'regression' else 'MultiClass' 21 | self.custom_metrics = ['R2'] if self.task == 'regression' else ['Accuracy'] 22 | # ['Accuracy', 'AUC', 'Precision', 'Recall', 'F1', 'MCC', 'R2'], 23 | 24 | self.model = catboost_model_obj(iterations=num_epochs, 25 | depth=self.depth, 26 | learning_rate=self.learning_rate, 27 | loss_function=self.catboost_loss_function, 28 | custom_metric=self.custom_metrics, 29 | random_seed=0, 30 | early_stopping_rounds=patience, 31 | l2_leaf_reg=self.l2_leaf_reg, 32 | max_bin=self.max_bin, 33 | nan_mode='Min') 34 | 35 | def get_metrics(self): 36 | d = self.model.evals_result_ 37 | metrics = ddict(list) 38 | keys = ['learn', 'validation_0', 'validation_1'] \ 39 | if 'validation_0' in self.model.evals_result_ \ 40 | else ['learn', 'validation'] 41 | for metric_name in d[keys[0]]: 42 | perf = [d[key][metric_name] for key in keys] 43 | if metric_name == self.catboost_loss_function: 44 | metrics['loss'] = list(zip(*perf)) 45 | else: 46 | metrics[metric_name.lower()] = list(zip(*perf)) 47 | 48 | return metrics 49 | 50 | def get_test_metric(self, metrics, metric_name): 51 | if metric_name == 'loss': 52 | val_epoch = np.argmin([acc[1] for acc in metrics[metric_name]]) 53 | else: 54 | val_epoch = np.argmax([acc[1] for acc in metrics[metric_name]]) 55 | min_metric = metrics[metric_name][val_epoch] 56 | return min_metric, val_epoch 57 | 58 | def save_metrics(self, metrics, fn): 59 | with open(fn, "w+") as f: 60 | for key, value in metrics.items(): 61 | print(key, value, file=f) 62 | 63 | def train_val_test_split(self, X, y, train_mask, val_mask, test_mask): 64 | X_train, y_train = X.iloc[train_mask], y.iloc[train_mask] 65 | X_val, y_val = X.iloc[val_mask], y.iloc[val_mask] 66 | X_test, y_test = X.iloc[test_mask], y.iloc[test_mask] 67 | return X_train, y_train, X_val, y_val, X_test, y_test 68 | 69 | def fit(self, 70 | X, y, train_mask, val_mask, test_mask, 71 | cat_features=None, num_epochs=1000, patience=200, 72 | plot=False, verbose=False, 73 | loss_fn="", metric_name='loss'): 74 | 75 | X_train, y_train, X_val, y_val, X_test, y_test = \ 76 | self.train_val_test_split(X, y, train_mask, val_mask, test_mask) 77 | self.init_model(num_epochs, patience) 78 | 79 | start = time.time() 80 | pool = Pool(X_train, y_train, cat_features=cat_features) 81 | eval_set = [(X_val, y_val), (X_test, y_test)] 82 | self.model.fit(pool, eval_set=eval_set, plot=plot, verbose=verbose) 83 | finish = time.time() 84 | 85 | num_trees = self.model.tree_count_ 86 | print('Finished training. Total time: {:.2f} | Number of trees: {:d} | Time per tree: {:.2f}'.format(finish - start, num_trees, (time.time() - start )/num_trees)) 87 | 88 | metrics = self.get_metrics() 89 | min_metric, min_val_epoch = self.get_test_metric(metrics, metric_name) 90 | if loss_fn: 91 | self.save_metrics(metrics, loss_fn) 92 | print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, min_val_epoch, *min_metric)) 93 | return metrics 94 | 95 | def predict(self, X_test, y_test): 96 | pred = self.model.predict(X_test) 97 | 98 | metrics = {} 99 | metrics['rmse'] = mean_squared_error(pred, y_test) ** .5 100 | 101 | return metrics 102 | 103 | 104 | class GBDTLGBM: 105 | def __init__(self, task='regression', lr=0.1, num_leaves=31, max_bin=255, 106 | lambda_l1=0., lambda_l2=0., boosting='gbdt'): 107 | self.task = task 108 | self.boosting = boosting 109 | self.learning_rate = lr 110 | self.num_leaves = num_leaves 111 | self.max_bin = max_bin 112 | self.lambda_l1 = lambda_l1 113 | self.lambda_l2 = lambda_l2 114 | 115 | def accuracy(self, preds, train_data): 116 | labels = train_data.get_label() 117 | preds_classes = preds.reshape((preds.shape[0]//labels.shape[0], labels.shape[0])).argmax(0) 118 | return 'accuracy', accuracy_score(labels, preds_classes), True 119 | 120 | def r2(self, preds, train_data): 121 | labels = train_data.get_label() 122 | return 'r2', r2_score(labels, preds), True 123 | 124 | def init_model(self): 125 | 126 | self.parameters = { 127 | 'objective': 'regression' if self.task == 'regression' else 'multiclass', 128 | 'metric': {'rmse'} if self.task == 'regression' else {'multiclass'}, 129 | 'num_classes': self.num_classes, 130 | 'boosting': self.boosting, 131 | 'num_leaves': self.num_leaves, 132 | 'max_bin': self.max_bin, 133 | 'learning_rate': self.learning_rate, 134 | 'lambda_l1': self.lambda_l1, 135 | 'lambda_l2': self.lambda_l2, 136 | # 'num_threads': 1, 137 | # 'feature_fraction': 0.9, 138 | # 'bagging_fraction': 0.8, 139 | # 'bagging_freq': 5, 140 | 'verbose': 1, 141 | # 'device_type': 'gpu' 142 | } 143 | self.evals_result = dict() 144 | 145 | def get_metrics(self): 146 | d = self.evals_result 147 | metrics = ddict(list) 148 | keys = ['training', 'valid_1', 'valid_2'] \ 149 | if 'training' in d \ 150 | else ['valid_0', 'valid_1'] 151 | for metric_name in d[keys[0]]: 152 | perf = [d[key][metric_name] for key in keys] 153 | if metric_name in ['regression', 'multiclass', 'rmse', 'l2', 'multi_logloss', 'binary_logloss']: 154 | metrics['loss'] = list(zip(*perf)) 155 | else: 156 | metrics[metric_name] = list(zip(*perf)) 157 | return metrics 158 | 159 | def get_test_metric(self, metrics, metric_name): 160 | if metric_name == 'loss': 161 | val_epoch = np.argmin([acc[1] for acc in metrics[metric_name]]) 162 | else: 163 | val_epoch = np.argmax([acc[1] for acc in metrics[metric_name]]) 164 | min_metric = metrics[metric_name][val_epoch] 165 | return min_metric, val_epoch 166 | 167 | def save_metrics(self, metrics, fn): 168 | with open(fn, "w+") as f: 169 | for key, value in metrics.items(): 170 | print(key, value, file=f) 171 | 172 | def train_val_test_split(self, X, y, train_mask, val_mask, test_mask): 173 | X_train, y_train = X.iloc[train_mask], y.iloc[train_mask] 174 | X_val, y_val = X.iloc[val_mask], y.iloc[val_mask] 175 | X_test, y_test = X.iloc[test_mask], y.iloc[test_mask] 176 | return X_train, y_train, X_val, y_val, X_test, y_test 177 | 178 | def fit(self, 179 | X, y, train_mask, val_mask, test_mask, 180 | cat_features=None, num_epochs=1000, patience=200, 181 | loss_fn="", metric_name='loss'): 182 | 183 | if cat_features is not None: 184 | X = X.copy() 185 | for col in list(X.columns[cat_features]): 186 | X[col] = X[col].astype('category') 187 | 188 | X_train, y_train, X_val, y_val, X_test, y_test = \ 189 | self.train_val_test_split(X, y, train_mask, val_mask, test_mask) 190 | self.num_classes = None if self.task == 'regression' else len(set(y.iloc[:, 0])) 191 | self.init_model() 192 | 193 | start = time.time() 194 | train_data = lightgbm.Dataset(X_train, label=y_train) 195 | val_data = lightgbm.Dataset(X_val, label=y_val) 196 | test_data = lightgbm.Dataset(X_test, label=y_test) 197 | 198 | self.model = lightgbm.train(self.parameters, 199 | train_data, 200 | valid_sets=[train_data, val_data, test_data], 201 | num_boost_round=num_epochs, 202 | early_stopping_rounds=patience, 203 | evals_result=self.evals_result, 204 | feval=self.r2 if self.task == 'regression' else self.accuracy, 205 | verbose_eval=1) 206 | finish = time.time() 207 | 208 | print('Finished training. Total time: {:.2f}'.format(finish - start)) 209 | 210 | metrics = self.get_metrics() 211 | min_metric, min_val_epoch = self.get_test_metric(metrics, metric_name) 212 | if loss_fn: 213 | self.save_metrics(metrics, loss_fn) 214 | print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, min_val_epoch, *min_metric)) 215 | return metrics 216 | 217 | def predict(self, X_test, y_test): 218 | pred = self.model.predict(X_test) 219 | 220 | metrics = {} 221 | metrics['rmse'] = mean_squared_error(pred, y_test) ** .5 222 | 223 | return metrics -------------------------------------------------------------------------------- /models/GNN.py: -------------------------------------------------------------------------------- 1 | import time 2 | import numpy as np 3 | import torch 4 | from torch.nn import Dropout, ELU 5 | import torch.nn.functional as F 6 | from torch import nn 7 | from dgl.nn.pytorch import GATConv as GATConvDGL, GraphConv, ChebConv as ChebConvDGL, \ 8 | AGNNConv as AGNNConvDGL, APPNPConv 9 | from torch.nn import Sequential, Linear, ReLU, Identity 10 | from tqdm import tqdm 11 | from .Base import BaseModel 12 | from torch.autograd import Variable 13 | from collections import defaultdict as ddict 14 | from .MLP import MLPRegressor 15 | 16 | 17 | class ElementWiseLinear(nn.Module): 18 | def __init__(self, size, weight=True, bias=True, inplace=False): 19 | super().__init__() 20 | if weight: 21 | self.weight = nn.Parameter(torch.Tensor(size)) 22 | else: 23 | self.weight = None 24 | if bias: 25 | self.bias = nn.Parameter(torch.Tensor(size)) 26 | else: 27 | self.bias = None 28 | self.inplace = inplace 29 | 30 | self.reset_parameters() 31 | 32 | def reset_parameters(self): 33 | if self.weight is not None: 34 | nn.init.ones_(self.weight) 35 | if self.bias is not None: 36 | nn.init.zeros_(self.bias) 37 | 38 | def forward(self, x): 39 | if self.inplace: 40 | if self.weight is not None: 41 | x.mul_(self.weight) 42 | if self.bias is not None: 43 | x.add_(self.bias) 44 | else: 45 | if self.weight is not None: 46 | x = x * self.weight 47 | if self.bias is not None: 48 | x = x + self.bias 49 | return x 50 | 51 | class GATDGL(torch.nn.Module): 52 | ''' 53 | Implementation of leaderboard GAT network for OGB datasets. 54 | https://github.com/Espylapiza/dgl/blob/master/examples/pytorch/ogb/ogbn-arxiv/models.py 55 | ''' 56 | def __init__( 57 | self, 58 | in_feats, 59 | n_classes, 60 | n_layers=3, 61 | n_heads=3, 62 | activation=F.relu, 63 | n_hidden=250, 64 | dropout=0.75, 65 | input_drop=0.1, 66 | attn_drop=0.0, 67 | ): 68 | super().__init__() 69 | self.in_feats = in_feats 70 | self.n_hidden = n_hidden 71 | self.n_classes = n_classes 72 | self.n_layers = n_layers 73 | self.num_heads = n_heads 74 | 75 | self.convs = torch.nn.ModuleList() 76 | self.norms = torch.nn.ModuleList() 77 | 78 | for i in range(n_layers): 79 | in_hidden = n_heads * n_hidden if i > 0 else in_feats 80 | out_hidden = n_hidden if i < n_layers - 1 else n_classes 81 | num_heads = n_heads if i < n_layers - 1 else 1 82 | out_channels = n_heads 83 | 84 | self.convs.append( 85 | GATConvDGL( 86 | in_hidden, 87 | out_hidden, 88 | num_heads=num_heads, 89 | attn_drop=attn_drop, 90 | residual=True, 91 | ) 92 | ) 93 | 94 | if i < n_layers - 1: 95 | self.norms.append(torch.nn.BatchNorm1d(out_channels * out_hidden)) 96 | 97 | self.bias_last = ElementWiseLinear(n_classes, weight=False, bias=True, inplace=True) 98 | 99 | self.input_drop = nn.Dropout(input_drop) 100 | self.dropout = nn.Dropout(dropout) 101 | self.activation = activation 102 | 103 | def forward(self, graph, feat): 104 | h = feat 105 | h = self.input_drop(h) 106 | 107 | for i in range(self.n_layers): 108 | conv = self.convs[i](graph, h) 109 | 110 | h = conv 111 | 112 | if i < self.n_layers - 1: 113 | h = h.flatten(1) 114 | h = self.norms[i](h) 115 | h = self.activation(h, inplace=True) 116 | h = self.dropout(h) 117 | 118 | h = h.mean(1) 119 | h = self.bias_last(h) 120 | 121 | return h 122 | 123 | 124 | 125 | class GNNModelDGL(torch.nn.Module): 126 | def __init__(self, in_dim, hidden_dim, out_dim, 127 | dropout=0., name='gat', residual=True, use_mlp=False, join_with_mlp=False): 128 | super(GNNModelDGL, self).__init__() 129 | self.name = name 130 | self.use_mlp = use_mlp 131 | self.join_with_mlp = join_with_mlp 132 | self.normalize_input_columns = True 133 | if use_mlp: 134 | self.mlp = MLPRegressor(in_dim, hidden_dim, out_dim) 135 | if join_with_mlp: 136 | in_dim += out_dim 137 | else: 138 | in_dim = out_dim 139 | if name == 'gat': 140 | self.l1 = GATConvDGL(in_dim, hidden_dim//8, 8, feat_drop=dropout, attn_drop=dropout, residual=False, 141 | activation=F.elu) 142 | self.l2 = GATConvDGL(hidden_dim, out_dim, 1, feat_drop=dropout, attn_drop=dropout, residual=residual, activation=None) 143 | elif name == 'gcn': 144 | self.l1 = GraphConv(in_dim, hidden_dim, activation=F.elu) 145 | self.l2 = GraphConv(hidden_dim, out_dim, activation=F.elu) 146 | self.drop = Dropout(p=dropout) 147 | elif name == 'cheb': 148 | self.l1 = ChebConvDGL(in_dim, hidden_dim, k = 3) 149 | self.l2 = ChebConvDGL(hidden_dim, out_dim, k = 3) 150 | self.drop = Dropout(p=dropout) 151 | elif name == 'agnn': 152 | self.lin1 = Sequential(Dropout(p=dropout), Linear(in_dim, hidden_dim), ELU()) 153 | self.l1 = AGNNConvDGL(learn_beta=False) 154 | self.l2 = AGNNConvDGL(learn_beta=True) 155 | self.lin2 = Sequential(Dropout(p=dropout), Linear(hidden_dim, out_dim), ELU()) 156 | elif name == 'appnp': 157 | self.lin1 = Sequential(Dropout(p=dropout), Linear(in_dim, hidden_dim), 158 | ReLU(), Dropout(p=dropout), Linear(hidden_dim, out_dim)) 159 | self.l1 = APPNPConv(k=10, alpha=0.1, edge_drop=0.) 160 | 161 | 162 | def forward(self, graph, features): 163 | h = features 164 | if self.use_mlp: 165 | if self.join_with_mlp: 166 | h = torch.cat((h, self.mlp(features)), 1) 167 | else: 168 | h = self.mlp(features) 169 | if self.name == 'gat': 170 | h = self.l1(graph, h).flatten(1) 171 | logits = self.l2(graph, h).mean(1) 172 | elif self.name in ['appnp']: 173 | h = self.lin1(h) 174 | logits = self.l1(graph, h) 175 | elif self.name == 'agnn': 176 | h = self.lin1(h) 177 | h = self.l1(graph, h) 178 | h = self.l2(graph, h) 179 | logits = self.lin2(h) 180 | elif self.name in ['gcn', 'cheb']: 181 | h = self.drop(h) 182 | h = self.l1(graph, h) 183 | logits = self.l2(graph, h) 184 | 185 | 186 | return logits 187 | 188 | class GNN(BaseModel): 189 | def __init__(self, task='regression', lr=0.01, hidden_dim=64, dropout=0., 190 | name='gat', residual=True, lang='dgl', 191 | gbdt_predictions=None, mlp=False, use_leaderboard=False, only_gbdt=False): 192 | super(GNN, self).__init__() 193 | 194 | self.dropout = dropout 195 | self.learning_rate = lr 196 | self.hidden_dim = hidden_dim 197 | self.task = task 198 | self.model_name = name 199 | self.use_residual = residual 200 | self.lang = lang 201 | self.use_mlp = mlp 202 | self.use_leaderboard = use_leaderboard 203 | self.gbdt_predictions = gbdt_predictions 204 | self.only_gbdt = only_gbdt 205 | 206 | self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 207 | 208 | def __name__(self): 209 | if self.gbdt_predictions is None: 210 | return 'GNN' 211 | else: 212 | return 'ResGNN' 213 | 214 | def init_model(self): 215 | if self.lang == 'pyg': 216 | self.model = GNNModelPYG(in_dim=self.in_dim, hidden_dim=self.hidden_dim, out_dim=self.out_dim, 217 | heads=self.heads, dropout=self.dropout, name=self.model_name, 218 | residual=self.use_residual).to(self.device) 219 | elif self.lang == 'dgl': 220 | if self.use_leaderboard: 221 | self.model = GATDGL(in_feats=self.in_dim, n_classes=self.out_dim).to(self.device) 222 | else: 223 | self.model = GNNModelDGL(in_dim=self.in_dim, hidden_dim=self.hidden_dim, out_dim=self.out_dim, 224 | dropout=self.dropout, name=self.model_name, 225 | residual=self.use_residual, use_mlp=self.use_mlp, 226 | join_with_mlp=self.use_mlp).to(self.device) 227 | 228 | def init_node_features(self, X, optimize_node_features): 229 | node_features = Variable(X, requires_grad=optimize_node_features) 230 | return node_features 231 | 232 | def fit(self, networkx_graph, X, y, train_mask, val_mask, test_mask, num_epochs, 233 | cat_features=None, patience=200, logging_epochs=1, optimize_node_features=False, 234 | loss_fn=None, metric_name='loss', normalize_features=True, replace_na=True): 235 | 236 | # initialize for early stopping and metrics 237 | if metric_name in ['r2', 'accuracy']: 238 | best_metric = [np.float('-inf')] * 3 # for train/val/test 239 | else: 240 | best_metric = [np.float('inf')] * 3 # for train/val/test 241 | best_val_epoch = 0 242 | epochs_since_last_best_metric = 0 243 | metrics = ddict(list) # metric_name -> (train/val/test) 244 | if cat_features is None: 245 | cat_features = [] 246 | 247 | if self.gbdt_predictions is not None: 248 | X = X.copy() 249 | X['predict'] = self.gbdt_predictions 250 | if self.only_gbdt: 251 | cat_features = [] 252 | X = X[['predict']] 253 | 254 | self.in_dim = X.shape[1] 255 | self.hidden_dim = self.hidden_dim 256 | if self.task == 'regression': 257 | self.out_dim = y.shape[1] 258 | elif self.task == 'classification': 259 | self.out_dim = len(set(y.iloc[:, 0])) 260 | 261 | if len(cat_features): 262 | X = self.encode_cat_features(X, y, cat_features, train_mask, val_mask, test_mask) 263 | if normalize_features: 264 | X = self.normalize_features(X, train_mask, val_mask, test_mask) 265 | if replace_na: 266 | X = self.replace_na(X, train_mask) 267 | 268 | X, y = self.pandas_to_torch(X, y) 269 | if len(X.shape) == 1: 270 | X = X.unsqueeze(1) 271 | 272 | if self.lang == 'dgl': 273 | graph = self.networkx_to_torch(networkx_graph) 274 | elif self.lang == 'pyg': 275 | graph = self.networkx_to_torch2(networkx_graph) 276 | self.init_model() 277 | node_features = self.init_node_features(X, optimize_node_features) 278 | 279 | self.node_features = node_features 280 | self.graph = graph 281 | optimizer = self.init_optimizer(node_features, optimize_node_features, self.learning_rate) 282 | 283 | pbar = tqdm(range(num_epochs)) 284 | for epoch in pbar: 285 | start2epoch = time.time() 286 | 287 | model_in = (graph, node_features) 288 | loss = self.train_and_evaluate(model_in, y, train_mask, val_mask, test_mask, optimizer, 289 | metrics, gnn_passes_per_epoch=1) 290 | self.log_epoch(pbar, metrics, epoch, loss, time.time() - start2epoch, logging_epochs, 291 | metric_name=metric_name) 292 | 293 | # check early stopping 294 | best_metric, best_val_epoch, epochs_since_last_best_metric = \ 295 | self.update_early_stopping(metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric, 296 | metric_name, lower_better=(metric_name not in ['r2', 'accuracy'])) 297 | if patience and epochs_since_last_best_metric > patience: 298 | break 299 | 300 | if loss_fn: 301 | self.save_metrics(metrics, loss_fn) 302 | 303 | print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, best_val_epoch, *best_metric)) 304 | return metrics 305 | 306 | def predict(self, graph, node_features, target_labels, test_mask): 307 | return self.evaluate_model((graph, node_features), target_labels, test_mask) -------------------------------------------------------------------------------- /models/MLP.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import numpy as np 5 | import time 6 | from tqdm import tqdm 7 | from .Base import BaseModel 8 | from sklearn.metrics import r2_score 9 | from collections import defaultdict as ddict 10 | 11 | class MLPClassifier(torch.nn.Module): 12 | def __init__(self, in_dim, hidden_dim, out_dim, num_layers=3, dropout=0.5): 13 | super(MLPClassifier, self).__init__() 14 | 15 | self.lins = torch.nn.ModuleList() 16 | self.lins.append(torch.nn.Linear(in_dim, hidden_dim)) 17 | self.bns = torch.nn.ModuleList() 18 | self.bns.append(torch.nn.BatchNorm1d(hidden_dim)) 19 | for _ in range(num_layers - 2): 20 | self.lins.append(torch.nn.Linear(hidden_dim, hidden_dim)) 21 | self.bns.append(torch.nn.BatchNorm1d(hidden_dim)) 22 | self.lins.append(torch.nn.Linear(hidden_dim, out_dim)) 23 | 24 | self.dropout = dropout 25 | 26 | def reset_parameters(self): 27 | for lin in self.lins: 28 | lin.reset_parameters() 29 | 30 | def forward(self, x): 31 | for i, lin in enumerate(self.lins[:-1]): 32 | x = lin(x) 33 | x = self.bns[i](x) 34 | x = F.relu(x) 35 | x = F.dropout(x, p=self.dropout, training=self.training) 36 | x = self.lins[-1](x) 37 | return x 38 | 39 | 40 | class MLPRegressor(nn.Module): 41 | def __init__(self, in_dim, hidden_dim, out_dim, num_layers=3, dropout=0.5): 42 | super(MLPRegressor, self).__init__() 43 | 44 | self.layers = nn.Sequential( 45 | nn.Linear(in_dim, hidden_dim), 46 | nn.ReLU(), 47 | nn.Dropout(p=dropout), 48 | nn.Linear(hidden_dim, hidden_dim), 49 | nn.ReLU(), 50 | nn.Dropout(p=dropout), 51 | nn.Linear(hidden_dim, out_dim) 52 | ) 53 | 54 | def forward(self, x): 55 | return self.layers(x) 56 | 57 | 58 | class MLP(BaseModel): 59 | def __init__(self, task='regression', num_layers=3, dropout=0., lr=0.01, hidden_dim=128): 60 | super(MLP, self).__init__() 61 | self.task = task 62 | self.num_layers = num_layers 63 | self.dropout = dropout 64 | self.learning_rate = lr 65 | self.hidden_dim = hidden_dim 66 | 67 | 68 | self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 69 | 70 | def __name__(self): 71 | return 'MLP' 72 | 73 | def init_model(self): 74 | # mlp_model = MLPRegressor if self.task == 'regression' else MLPClassifier 75 | mlp_model = MLPClassifier 76 | self.model = mlp_model(in_dim=self.in_dim, hidden_dim=self.hidden_dim, out_dim=self.out_dim, 77 | num_layers=self.num_layers, dropout=self.dropout).to( 78 | self.device) 79 | 80 | def fit(self, X, y, train_mask, val_mask, test_mask, cat_features=None, 81 | num_epochs=1000, patience=200, 82 | logging_epochs=1, loss_fn=None, 83 | metric_name='loss', normalize_features=True, replace_na=True): 84 | 85 | # initialize for early stopping and metrics 86 | if metric_name in ['r2', 'accuracy']: 87 | best_metric = [np.float('-inf')] * 3 # for train/val/test 88 | else: 89 | best_metric = [np.float('inf')] * 3 # for train/val/test 90 | best_val_epoch = 0 91 | epochs_since_last_best_metric = 0 92 | metrics = ddict(list) # metric_name -> (train/val/test) 93 | if cat_features is None: 94 | cat_features = [] 95 | 96 | self.in_dim = X.shape[1] 97 | self.hidden_dim = self.hidden_dim 98 | if self.task == 'regression': 99 | self.out_dim = y.shape[1] 100 | elif self.task == 'classification': 101 | self.out_dim = len(set(y.iloc[:, 0])) 102 | 103 | 104 | if len(cat_features): 105 | X = self.encode_cat_features(X, y, cat_features, train_mask, val_mask, test_mask) 106 | if normalize_features: 107 | X = self.normalize_features(X, train_mask, val_mask, test_mask) 108 | if replace_na: 109 | X = self.replace_na(X, train_mask) 110 | 111 | X, y = self.pandas_to_torch(X, y) 112 | if len(X.shape) == 1: 113 | X = X.unsqueeze(dim=1) 114 | 115 | self.init_model() 116 | optimizer = self.init_optimizer(None, False, learning_rate=self.learning_rate) 117 | 118 | pbar = tqdm(range(num_epochs)) 119 | for epoch in pbar: 120 | 121 | start2epoch = time.time() 122 | 123 | model_in = (X,) 124 | loss = self.train_and_evaluate(model_in, y, train_mask, val_mask, test_mask, optimizer, 125 | metrics, gnn_passes_per_epoch=1) 126 | self.log_epoch(pbar, metrics, epoch, loss, time.time() - start2epoch, logging_epochs, 127 | metric_name=metric_name) 128 | 129 | # check early stopping 130 | best_metric, best_val_epoch, epochs_since_last_best_metric = \ 131 | self.update_early_stopping(metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric, 132 | metric_name, lower_better=(metric_name not in ['r2', 'accuracy'])) 133 | if patience and epochs_since_last_best_metric > patience: 134 | break 135 | 136 | if loss_fn: 137 | self.save_metrics(metrics, loss_fn) 138 | 139 | print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, best_val_epoch, *best_metric)) 140 | return metrics 141 | 142 | def predict(self, X, target_labels, test_mask): 143 | return self.evaluate_model((X,), target_labels, test_mask) -------------------------------------------------------------------------------- /models/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nd7141/bgnn/11290bc8ec5427faa1cb48ec51d947d5f6624b60/models/__init__.py -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.19.4 2 | plotly==4.14.1 3 | catboost==0.24.4 4 | lightgbm==3.0.0 5 | networkx==2.5 6 | matplotlib==3.3.3 7 | pandas==1.1.5 8 | tqdm==4.55.1 9 | fire==0.3.1 10 | omegaconf==2.0.5 11 | category_encoders==2.2.2 12 | scikit_learn==0.24.0 13 | -------------------------------------------------------------------------------- /scripts/run.py: -------------------------------------------------------------------------------- 1 | # from catboost import CatboostError 2 | # import sys 3 | # sys.path.append('../') 4 | 5 | from bgnn.models.GBDT import GBDTCatBoost, GBDTLGBM 6 | from bgnn.models.MLP import MLP 7 | from bgnn.models.GNN import GNN 8 | from bgnn.models.BGNN import BGNN 9 | from bgnn.scripts.utils import NpEncoder 10 | 11 | import os 12 | import json 13 | import time 14 | import datetime 15 | from pathlib import Path 16 | from collections import defaultdict as ddict 17 | 18 | import pandas as pd 19 | import networkx as nx 20 | import random 21 | import numpy as np 22 | import fire 23 | from omegaconf import OmegaConf 24 | from sklearn.model_selection import ParameterGrid 25 | 26 | 27 | class RunModel: 28 | def read_input(self, input_folder): 29 | self.X = pd.read_csv(f'{input_folder}/X.csv') 30 | self.y = pd.read_csv(f'{input_folder}/y.csv') 31 | 32 | networkx_graph = nx.read_graphml(f'{input_folder}/graph.graphml') 33 | networkx_graph = nx.relabel_nodes(networkx_graph, {str(i): i for i in range(len(networkx_graph))}) 34 | self.networkx_graph = networkx_graph 35 | 36 | categorical_columns = [] 37 | if os.path.exists(f'{input_folder}/cat_features.txt'): 38 | with open(f'{input_folder}/cat_features.txt') as f: 39 | for line in f: 40 | if line.strip(): 41 | categorical_columns.append(line.strip()) 42 | 43 | self.cat_features = None 44 | if categorical_columns: 45 | columns = self.X.columns 46 | self.cat_features = np.where(columns.isin(categorical_columns))[0] 47 | 48 | for col in list(columns[self.cat_features]): 49 | self.X[col] = self.X[col].astype(str) 50 | 51 | 52 | if os.path.exists(f'{input_folder}/masks.json'): 53 | with open(f'{input_folder}/masks.json') as f: 54 | self.masks = json.load(f) 55 | else: 56 | print('Creating and saving train/val/test masks') 57 | idx = list(range(self.y.shape[0])) 58 | self.masks = dict() 59 | for i in range(self.max_seeds): 60 | random.shuffle(idx) 61 | r1, r2, r3 = idx[:int(.6*len(idx))], idx[int(.6*len(idx)):int(.8*len(idx))], idx[int(.8*len(idx)):] 62 | self.masks[str(i)] = {"train": r1, "val": r2, "test": r3} 63 | 64 | with open(f'{input_folder}/masks.json', 'w+') as f: 65 | json.dump(self.masks, f, cls=NpEncoder) 66 | 67 | 68 | def get_input(self, dataset_dir, dataset: str): 69 | if dataset == 'house': 70 | input_folder = dataset_dir / 'house' 71 | elif dataset == 'county': 72 | input_folder = dataset_dir / 'county' 73 | elif dataset == 'vk': 74 | input_folder = dataset_dir / 'vk' 75 | elif dataset == 'wiki': 76 | input_folder = dataset_dir / 'wiki' 77 | elif dataset == 'avazu': 78 | input_folder = dataset_dir / 'avazu' 79 | elif dataset == 'vk_class': 80 | input_folder = dataset_dir / 'vk_class' 81 | elif dataset == 'house_class': 82 | input_folder = dataset_dir / 'house_class' 83 | elif dataset == 'dblp': 84 | input_folder = dataset_dir / 'dblp' 85 | elif dataset == 'slap': 86 | input_folder = dataset_dir / 'slap' 87 | else: 88 | input_folder = dataset 89 | 90 | if self.save_folder is None: 91 | self.save_folder = f'results/{dataset}/{datetime.datetime.now().strftime("%d_%m")}' 92 | 93 | self.read_input(input_folder) 94 | print('Save to folder:', self.save_folder) 95 | 96 | 97 | def run_one_model(self, config_fn, model_name): 98 | self.config = OmegaConf.load(config_fn) 99 | grid = ParameterGrid(dict(self.config.hp)) 100 | 101 | for ps in grid: 102 | param_string = ''.join([f'-{key}{ps[key]}' for key in ps]) 103 | exp_name = f'{model_name}{param_string}' 104 | print(f'\nSeed {self.seed} RUNNING:{exp_name}') 105 | 106 | runs = [] 107 | runs_custom = [] 108 | times = [] 109 | for _ in range(self.repeat_exp): 110 | start = time.time() 111 | model = self.define_model(model_name, ps) 112 | 113 | inputs = {'X': self.X, 'y': self.y, 'train_mask': self.train_mask, 114 | 'val_mask': self.val_mask, 'test_mask': self.test_mask, 'cat_features': self.cat_features} 115 | if model_name in ['gnn', 'resgnn', 'bgnn']: 116 | inputs['networkx_graph'] = self.networkx_graph 117 | 118 | metrics = model.fit(num_epochs=self.config.num_epochs, patience=self.config.patience, 119 | loss_fn=f"{self.seed_folder}/{exp_name}.txt", 120 | metric_name='loss' if self.task == 'regression' else 'accuracy', **inputs) 121 | finish = time.time() 122 | best_loss = min(metrics['loss'], key=lambda x: x[1]) 123 | best_custom = max(metrics['r2' if self.task == 'regression' else 'accuracy'], key=lambda x: x[1]) 124 | runs.append(best_loss) 125 | runs_custom.append(best_custom) 126 | times.append(finish - start) 127 | self.store_results[exp_name] = (list(map(np.mean, zip(*runs))), 128 | list(map(np.mean, zip(*runs_custom))), 129 | np.mean(times), 130 | ) 131 | 132 | def define_model(self, model_name, ps): 133 | if model_name == 'catboost': 134 | return GBDTCatBoost(self.task, **ps) 135 | elif model_name == 'lightgbm': 136 | return GBDTLGBM(self.task, **ps) 137 | elif model_name == 'mlp': 138 | return MLP(self.task, **ps) 139 | elif model_name == 'gnn': 140 | return GNN(self.task, **ps) 141 | elif model_name == 'resgnn': 142 | gbdt = GBDTCatBoost(self.task) 143 | gbdt.fit(self.X, self.y, self.train_mask, self.val_mask, self.test_mask, 144 | cat_features=self.cat_features, 145 | num_epochs=1000, patience=100, 146 | plot=False, verbose=False, loss_fn=None, 147 | metric_name='loss' if self.task == 'regression' else 'accuracy') 148 | return GNN(task=self.task, gbdt_predictions=gbdt.model.predict(self.X), **ps) 149 | elif model_name == 'bgnn': 150 | return BGNN(self.task, **ps) 151 | 152 | def create_save_folder(self, seed): 153 | self.seed_folder = f'{self.save_folder}/{seed}' 154 | os.makedirs(self.seed_folder, exist_ok=True) 155 | 156 | def split_masks(self, seed): 157 | self.train_mask, self.val_mask, self.test_mask = self.masks[seed]['train'], \ 158 | self.masks[seed]['val'], self.masks[seed]['test'] 159 | 160 | def save_results(self, seed): 161 | self.seed_results[seed] = self.store_results 162 | with open(f'{self.save_folder}/seed_results.json', 'w+') as f: 163 | json.dump(self.seed_results, f) 164 | 165 | self.aggregated = self.aggregate_results() 166 | with open(f'{self.save_folder}/aggregated_results.json', 'w+') as f: 167 | json.dump(self.aggregated, f) 168 | 169 | def get_model_name(self, exp_name: str, algos: list): 170 | # get name of the model (for gnn-like models (eg. gat)) 171 | if 'name' in exp_name: 172 | model_name = '-' + [param[4:] for param in exp_name.split('-') if param.startswith('name')][0] 173 | else: 174 | model_name = '' 175 | 176 | # get a model used a MLP (eg. MLP-GNN) 177 | if 'gnn' in exp_name and 'mlpTrue' in exp_name: 178 | model_name += '-MLP' 179 | 180 | # algo corresponds to type of the model (eg. gnn, resgnn, bgnn) 181 | for algo in algos: 182 | if exp_name.startswith(algo): 183 | return algo + model_name 184 | return 'unknown' 185 | 186 | def aggregate_results(self): 187 | algos = ['catboost', 'lightgbm', 'mlp', 'gnn', 'resgnn', 'bgnn'] 188 | model_best_score = ddict(list) 189 | model_best_time = ddict(list) 190 | 191 | results = self.seed_results 192 | for seed in results: 193 | model_results_for_seed = ddict(list) 194 | for name, output in results[seed].items(): 195 | model_name = self.get_model_name(name, algos=algos) 196 | if self.task == 'regression': # rmse metric 197 | val_metric, test_metric, time = output[0][1], output[0][2], output[2] 198 | else: # accuracy metric 199 | val_metric, test_metric, time = output[1][1], output[1][2], output[2] 200 | model_results_for_seed[model_name].append((val_metric, test_metric, time)) 201 | 202 | for model_name, model_results in model_results_for_seed.items(): 203 | if self.task == 'regression': 204 | best_result = min(model_results) # rmse 205 | else: 206 | best_result = max(model_results) # accuracy 207 | model_best_score[model_name].append(best_result[1]) 208 | model_best_time[model_name].append(best_result[2]) 209 | 210 | aggregated = dict() 211 | for model, scores in model_best_score.items(): 212 | aggregated[model] = (np.mean(scores), np.std(scores), 213 | np.mean(model_best_time[model]), np.std(model_best_time[model])) 214 | return aggregated 215 | 216 | def run(self, dataset: str, *args, 217 | save_folder: str = None, 218 | task: str = 'regression', 219 | repeat_exp: int = 1, 220 | max_seeds: int = 5, 221 | dataset_dir: str = None, 222 | config_dir: str = None 223 | ): 224 | start2run = time.time() 225 | self.repeat_exp = repeat_exp 226 | self.max_seeds = max_seeds 227 | print(dataset, args, task, repeat_exp, max_seeds, dataset_dir, config_dir) 228 | 229 | dataset_dir = Path(dataset_dir) if dataset_dir else Path(__file__).parent.parent / 'datasets' 230 | config_dir = Path(config_dir) if config_dir else Path(__file__).parent.parent / 'configs' / 'model' 231 | print(dataset_dir, config_dir) 232 | 233 | self.task = task 234 | self.save_folder = save_folder 235 | self.get_input(dataset_dir, dataset) 236 | 237 | self.seed_results = dict() 238 | for ix, seed in enumerate(self.masks): 239 | print(f'{dataset} Seed {seed}') 240 | self.seed = seed 241 | 242 | self.create_save_folder(seed) 243 | self.split_masks(seed) 244 | 245 | self.store_results = dict() 246 | for arg in args: 247 | if arg == 'all': 248 | self.run_one_model(config_fn=config_dir / 'catboost.yaml', model_name="catboost") 249 | self.run_one_model(config_fn=config_dir / 'lightgbm.yaml', model_name="lightgbm") 250 | self.run_one_model(config_fn=config_dir / 'mlp.yaml', model_name="mlp") 251 | self.run_one_model(config_fn=config_dir / 'gnn.yaml', model_name="gnn") 252 | self.run_one_model(config_fn=config_dir / 'resgnn.yaml', model_name="resgnn") 253 | self.run_one_model(config_fn=config_dir / 'bgnn.yaml', model_name="bgnn") 254 | break 255 | elif arg == 'catboost': 256 | self.run_one_model(config_fn=config_dir / 'catboost.yaml', model_name="catboost") 257 | elif arg == 'lightgbm': 258 | self.run_one_model(config_fn=config_dir / 'lightgbm.yaml', model_name="lightgbm") 259 | elif arg == 'mlp': 260 | self.run_one_model(config_fn=config_dir / 'mlp.yaml', model_name="mlp") 261 | elif arg == 'gnn': 262 | self.run_one_model(config_fn=config_dir / 'gnn.yaml', model_name="gnn") 263 | elif arg == 'resgnn': 264 | self.run_one_model(config_fn=config_dir / 'resgnn.yaml', model_name="resgnn") 265 | elif arg == 'bgnn': 266 | self.run_one_model(config_fn=config_dir / 'bgnn.yaml', model_name="bgnn") 267 | 268 | self.save_results(seed) 269 | if ix+1 >= max_seeds: 270 | break 271 | 272 | print(f'Finished {dataset}: {time.time() - start2run} sec.') 273 | 274 | if __name__ == '__main__': 275 | fire.Fire(RunModel().run) -------------------------------------------------------------------------------- /scripts/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from dgl.data import citation_graph as citegrh, TUDataset 4 | import torch as th 5 | from dgl import DGLGraph 6 | import numpy as np 7 | from sklearn.model_selection import KFold 8 | import itertools 9 | from sklearn.preprocessing import OneHotEncoder as OHE 10 | import random 11 | import json 12 | 13 | def load_cora_data(): 14 | data = citegrh.load_cora() 15 | features = th.FloatTensor(data.features) 16 | labels = th.LongTensor(data.labels) 17 | train_mask = th.BoolTensor(data.train_mask) 18 | test_mask = th.BoolTensor(data.test_mask) 19 | g = DGLGraph(data.graph) 20 | return g, features, labels, train_mask, test_mask 21 | 22 | def get_degree_features(graph): 23 | return graph.out_degrees().unsqueeze(-1).numpy() 24 | 25 | def get_categorical_features(features): 26 | return np.argmax(features, axis=-1).unsqueeze(dim=1).numpy() 27 | 28 | def get_random_int_features(shape, num_categories=100): 29 | return np.random.randint(0, num_categories, size=shape) 30 | 31 | def get_random_norm_features(shape): 32 | return np.random.normal(size=shape) 33 | 34 | def get_random_uniform_features(shape): 35 | return np.random.unifor(-1, 1, size=shape) 36 | 37 | def merge_features(*args): 38 | return np.hstack(args) 39 | 40 | def get_train_data(graph, features, num_random_features=10, num_random_categories=100): 41 | return merge_features( 42 | get_categorical_features(features), 43 | get_degree_features(graph), 44 | get_random_int_features(shape=(features.shape[0], num_random_features), num_categories=num_random_categories), 45 | ) 46 | 47 | 48 | def save_folds(dataset_name, n_splits=3): 49 | dataset = TUDataset(dataset_name) 50 | i = 0 51 | kfold = KFold(n_splits=n_splits, shuffle=True) 52 | dir_name = f'kfold_{dataset_name}' 53 | for trix, teix in kfold.split(range(len(dataset))): 54 | os.makedirs(f'{dir_name}/fold{i}', exist_ok=True) 55 | np.savetxt(f'{dir_name}/fold{i}/train.idx', trix, fmt='%i') 56 | np.savetxt(f'{dir_name}/fold{i}/test.idx', teix, fmt='%i') 57 | i += 1 58 | 59 | 60 | def graph_to_node_label(graphs, labels): 61 | targets = np.array(list(itertools.chain(*[[labels[i]] * graphs[i].number_of_nodes() for i in range(len(graphs))]))) 62 | enc = OHE(dtype=np.float32) 63 | return np.asarray(enc.fit_transform(targets.reshape(-1, 1)).todense()) 64 | 65 | 66 | def get_masks(N, train_size=0.6, val_size=0.2, random_seed=42): 67 | if not random_seed: 68 | seed = random.randint(0, 100) 69 | else: 70 | seed = random_seed 71 | 72 | # print('seed', seed) 73 | random.seed(seed) 74 | 75 | indices = list(range(N)) 76 | random.shuffle(indices) 77 | 78 | train_mask = indices[:int(train_size * len(indices))] 79 | val_mask = indices[int(train_size * len(indices)):int((train_size + val_size) * len(indices))] 80 | train_val_mask = indices[:int((train_size + val_size) * len(indices))] 81 | test_mask = indices[int((train_size + val_size) * len(indices)):] 82 | 83 | return train_mask, val_mask, train_val_mask, test_mask 84 | 85 | 86 | class NpEncoder(json.JSONEncoder): 87 | def default(self, obj): 88 | if isinstance(obj, np.integer): 89 | return int(obj) 90 | elif isinstance(obj, np.floating): 91 | return float(obj) 92 | elif isinstance(obj, np.ndarray): 93 | return obj.tolist() 94 | else: 95 | return super(NpEncoder, self).default(obj) -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | """Setup script.""" 2 | import setuptools 3 | 4 | with open("README.md", "r", encoding="utf-8") as fh: 5 | long_description = fh.read() 6 | 7 | if __name__ == "__main__": 8 | 9 | # Run setup 10 | setuptools.setup( 11 | name="bgnn", # Replace with your own username 12 | version="0.0.1", 13 | author="Sergey Ivanov", 14 | author_email="sergei.ivanov@skolkovotech.ru", 15 | description="Boosted Graph Neural Networks", 16 | long_description=long_description, 17 | long_description_content_type="text/markdown", 18 | url="https://github.com/nd7141/bgnn", 19 | packages=setuptools.find_packages(), 20 | classifiers=[ 21 | "License :: OSI Approved :: Apache Software License", 22 | "Operating System :: OS Independent", 23 | "Programming Language :: Python :: 3", 24 | "Programming Language :: Python :: 3.6", 25 | "Intended Audience :: Developers", 26 | "Intended Audience :: Science/Research", 27 | "Topic :: Scientific/Engineering :: Artificial Intelligence", 28 | ], 29 | python_requires='>=3.6', 30 | ) --------------------------------------------------------------------------------