├── .gitignore
├── LICENSE
├── Makefile
├── README.md
├── bgnn
    ├── models
    │   ├── BGNN.py
    │   ├── Base.py
    │   ├── GBDT.py
    │   ├── GNN.py
    │   ├── MLP.py
    │   └── __init__.py
    └── scripts
    │   ├── run.py
    │   └── utils.py
├── configs
    └── model
    │   ├── bgnn.yaml
    │   ├── catboost.yaml
    │   ├── gnn.yaml
    │   ├── lightgbm.yaml
    │   ├── mlp.yaml
    │   └── resgnn.yaml
├── datasets.zip
├── models
    ├── BGNN.py
    ├── Base.py
    ├── GBDT.py
    ├── GNN.py
    ├── MLP.py
    └── __init__.py
├── requirements.txt
├── scripts
    ├── run.py
    └── utils.py
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
1 | datasets/
2 | *pycache*
3 | results/
4 | *egg*
5 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 russellsparadox
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | install: ## [Local development, CPU] Upgrade pip, install requirements, install package.
 2 | 	python -m pip install -U pip setuptools wheel
 3 | 	python -m pip install -r requirements.txt 
 4 | 	python -m pip install -e .
 5 | 
 6 | .PHONY: help
 7 | 
 8 | help: # Run `make help` to get help on the make commands
 9 | 	@grep -E '^[0-9a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}'
10 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ### Boosted Graph Neural Networks
  2 | The code and data for the ICLR 2021 paper: [Boost then Convolve: Gradient Boosting Meets Graph Neural Networks](https://openreview.net/pdf?id=ebS5NUfoMKL)
  3 | 
  4 | This code contains implementation of the following models for graphs: 
  5 | * **CatBoost**
  6 | * **LightGBM**
  7 | * **Fully-Connected Neural Network** (FCNN)
  8 | * **GNN** (GAT, GCN, AGNN, APPNP)
  9 | * **FCNN-GNN** (GAT, GCN, AGNN, APPNP)
 10 | * **ResGNN** (CatBoost + {GAT, GCN, AGNN, APPNP})
 11 | * **BGNN** (end-to-end {CatBoost + {GAT, GCN, AGNN, APPNP}})
 12 | 
 13 | ## Installation
 14 | To run the models you have to download the repo, install the requirements, and extract the datasets.
 15 | 
 16 | First, let's create a python environment:
 17 | ```bash
 18 | mkdir envs
 19 | cd envs
 20 | python -m venv bgnn_env
 21 | source bgnn_env/bin/activate
 22 | cd ..
 23 | ```
 24 | ---
 25 | Second, let's download the code and install requirements
 26 | ```bash
 27 | git clone https://github.com/nd7141/bgnn.git 
 28 | cd bgnn
 29 | unzip datasets.zip
 30 | make install
 31 | ```
 32 | ---
 33 | Next we need to install a proper version of [PyTorch](https://pytorch.org/) and [DGL](https://www.dgl.ai/), depending on the cuda version of your machine.
 34 | We strongly encourage to use GPU-supported versions of DGL (the speed up in training can be 100x).
 35 | 
 36 | First, determine your cuda version with `nvcc --version`. 
 37 | Then, check installation instructions for [pytorch](https://pytorch.org/get-started/locally/).
 38 | For example for cuda version 9.2, install it as follows:
 39 | ```bash
 40 | pip install torch==1.7.1+cu92 torchvision==0.8.2+cu92 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
 41 | ```
 42 | 
 43 | If you don't have GPU, use the following: 
 44 | ```bash
 45 | pip install torch==1.7.1+cpu torchvision==0.8.2+cpu torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
 46 | ```
 47 | ---
 48 | Similarly, you need to install [DGL library](https://docs.dgl.ai/en/0.4.x/install/). 
 49 | For example, cuda==9.2:
 50 | 
 51 | ```bash
 52 | pip install dgl-cu92
 53 | ```
 54 | 
 55 | For cpu version of DGL: 
 56 | ```bash
 57 | pip install dgl
 58 | ```
 59 | 
 60 | Tested versions of `torch` and `dgl` are:
 61 | * torch==1.7.1+cu92
 62 | * dgl_cu92==0.5.3
 63 | 
 64 | ## Running
 65 | Starting point is file `scripts/run.py`:
 66 | ```bash
 67 | python scripts/run.py dataset models 
 68 |     (optional) 
 69 |             --save_folder: str = None
 70 |             --task: str = 'regression',
 71 |             --repeat_exp: int = 1,
 72 |             --max_seeds: int = 5,
 73 |             --dataset_dir: str = None,
 74 |             --config_dir: str = None
 75 | ```
 76 | Available options for dataset: 
 77 | * house (regression)
 78 | * county (regression)
 79 | * vk (regression)
 80 | * wiki (regression)
 81 | * avazu (regression)
 82 | * vk_class (classification)
 83 | * house_class (classification)
 84 | * dblp (classification)
 85 | * slap (classification)
 86 | * path/to/your/dataset
 87 |     
 88 | Available options for models are `catboost`, `lightgbm`, `gnn`, `resgnn`, `bgnn`, `all`.
 89 | 
 90 | Each model is specifed by its config. Check [`configs/`](https://github.com/nd7141/bgnn/tree/master/configs/model) folder to specify parameters of the model and run.
 91 | 
 92 | Upon completion, the results wil be saved in the specifed folder (default: `results/{dataset}/day_month/`).
 93 | This folder will contain `aggregated_results.json`, which will contain aggregated results for each model.
 94 | Each model will have 4 numbers in this order: `mean metric` (RMSE or accuracy), `std metric`, `mean runtime`, `std runtime`.
 95 | File `seed_results.json` will have results for each experiment and each seed. 
 96 | Additional folders will contain loss values during training. 
 97 | 
 98 | ---
 99 | 
100 | ###Examples
101 | 
102 | The following script will launch all models on `House` dataset.  
103 | ```bash
104 | python scripts/run.py house all
105 | ```
106 | 
107 | The following script will launch CatBoost and GNN models on `SLAP` classification dataset.  
108 | ```bash
109 | python scripts/run.py slap catboost gnn --task classification
110 | ```
111 | 
112 | The following script will launch LightGBM model for 5 splits of data, repeating each experiment for 3 times.  
113 | ```bash
114 | python scripts/run.py vk lightgbm --repeat_exp 3 --max_seeds 5
115 | ```
116 | 
117 | The following script will launch resgnn and bgnn models saving results to custom folder.  
118 | ```bash
119 | python scripts/run.py county resgnn bgnn --save_folder ./county_resgnn_bgnn
120 | ```
121 | 
122 | ### Running on your dataset
123 | To run the code on your dataset, it's necessary to prepare the files in the right format. 
124 | 
125 | You can check examples in `datasets/` folder. 
126 | 
127 | There should be at least `X.csv` (node features), `y.csv` (target labels), `graph.graphml` (graph in graphml format).
128 | 
129 | Make sure to keep _these_ filenames for your dataset.
130 | 
131 | You can also have `cat_features.txt` specifying names of categorical columns.
132 | 
133 | You can also have `masks.json` specifying train/val/test splits. 
134 | 
135 | After that run the script as usual: 
136 | ```bash
137 | python scripts/run.py path/to/your/dataset gnn catboost 
138 | ```
139 | 
140 | ## Citation
141 | ```
142 | @inproceedings{
143 | ivanov2021boost,
144 | title={Boost then Convolve: Gradient Boosting Meets Graph Neural Networks},
145 | author={Sergei Ivanov and Liudmila Prokhorenkova},
146 | booktitle={International Conference on Learning Representations (ICLR)},
147 | year={2021},
148 | url={https://openreview.net/forum?id=ebS5NUfoMKL}
149 | }
150 | ```
151 | 


--------------------------------------------------------------------------------
/bgnn/models/BGNN.py:
--------------------------------------------------------------------------------
  1 | import itertools
  2 | import time
  3 | import numpy as np
  4 | import torch
  5 | 
  6 | from catboost import Pool, CatBoostClassifier, CatBoostRegressor, sum_models
  7 | from .GNN import GNNModelDGL, GATDGL
  8 | from .Base import BaseModel
  9 | from tqdm import tqdm
 10 | from collections import defaultdict as ddict
 11 | 
 12 | class BGNN(BaseModel):
 13 |     def __init__(self,
 14 |                  task='regression', iter_per_epoch = 10, lr=0.01, hidden_dim=64, dropout=0.,
 15 |                  only_gbdt=False, train_non_gbdt=False,
 16 |                  name='gat', use_leaderboard=False, depth=6, gbdt_lr=0.1):
 17 |         super(BaseModel, self).__init__()
 18 |         self.learning_rate = lr
 19 |         self.hidden_dim = hidden_dim
 20 |         self.task = task
 21 |         self.dropout = dropout
 22 |         self.only_gbdt = only_gbdt
 23 |         self.train_residual = train_non_gbdt
 24 |         self.name = name
 25 |         self.use_leaderboard = use_leaderboard
 26 |         self.iter_per_epoch = iter_per_epoch
 27 |         self.depth = depth
 28 |         self.lang = 'dgl'
 29 |         self.gbdt_lr = gbdt_lr
 30 | 
 31 |         self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
 32 | 
 33 |     def __name__(self):
 34 |         return 'BGNN'
 35 | 
 36 |     def init_gbdt_model(self, num_epochs, epoch):
 37 |         if self.task == 'regression':
 38 |             catboost_model_obj = CatBoostRegressor
 39 |             catboost_loss_fn = 'RMSE' #''RMSEWithUncertainty'
 40 |         else:
 41 |             if epoch == 0:
 42 |                 catboost_model_obj = CatBoostClassifier
 43 |                 catboost_loss_fn = 'MultiClass'
 44 |             else:
 45 |                 catboost_model_obj = CatBoostRegressor
 46 |                 catboost_loss_fn = 'MultiRMSE'
 47 | 
 48 |         return catboost_model_obj(iterations=num_epochs,
 49 |                                   depth=self.depth,
 50 |                                   learning_rate=self.gbdt_lr,
 51 |                                   loss_function=catboost_loss_fn,
 52 |                                   random_seed=0,
 53 |                                   nan_mode='Min')
 54 | 
 55 |     def fit_gbdt(self, pool, trees_per_epoch, epoch):
 56 |         gbdt_model = self.init_gbdt_model(trees_per_epoch, epoch)
 57 |         gbdt_model.fit(pool, verbose=False)
 58 |         return gbdt_model
 59 | 
 60 |     def init_gnn_model(self):
 61 |         if self.use_leaderboard:
 62 |             self.model = GATDGL(in_feats=self.in_dim, n_classes=self.out_dim).to(self.device)
 63 |         else:
 64 |             self.model = GNNModelDGL(in_dim=self.in_dim,
 65 |                                      hidden_dim=self.hidden_dim,
 66 |                                      out_dim=self.out_dim,
 67 |                                      name=self.name,
 68 |                                      dropout=self.dropout).to(self.device)
 69 | 
 70 |     def append_gbdt_model(self, new_gbdt_model, weights):
 71 |         if self.gbdt_model is None:
 72 |             return new_gbdt_model
 73 |         return sum_models([self.gbdt_model, new_gbdt_model], weights=weights)
 74 | 
 75 |     def train_gbdt(self, gbdt_X_train, gbdt_y_train, cat_features, epoch,
 76 |                    gbdt_trees_per_epoch, gbdt_alpha):
 77 | 
 78 |         pool = Pool(gbdt_X_train, gbdt_y_train, cat_features=cat_features)
 79 |         epoch_gbdt_model = self.fit_gbdt(pool, gbdt_trees_per_epoch, epoch)
 80 |         if epoch == 0 and self.task=='classification':
 81 |             self.base_gbdt = epoch_gbdt_model
 82 |         else:
 83 |             self.gbdt_model = self.append_gbdt_model(epoch_gbdt_model, weights=[1, gbdt_alpha])
 84 | 
 85 |     def update_node_features(self, node_features, X, encoded_X):
 86 |         if self.task == 'regression':
 87 |             predictions = np.expand_dims(self.gbdt_model.predict(X), axis=1)
 88 |             # predictions = self.gbdt_model.virtual_ensembles_predict(X,
 89 |             #                                                         virtual_ensembles_count=5,
 90 |             #                                                         prediction_type='TotalUncertainty')
 91 |         else:
 92 |             predictions = self.base_gbdt.predict_proba(X)
 93 |             # predictions = self.base_gbdt.predict(X, prediction_type='RawFormulaVal')
 94 |             if self.gbdt_model is not None:
 95 |                 predictions_after_one = self.gbdt_model.predict(X)
 96 |                 predictions += predictions_after_one
 97 | 
 98 |         if not self.only_gbdt:
 99 |             if self.train_residual:
100 |                 predictions = np.append(node_features.detach().cpu().data[:, :-self.out_dim], predictions,
101 |                                         axis=1)  # append updated X to prediction
102 |             else:
103 |                 predictions = np.append(encoded_X, predictions, axis=1)  # append X to prediction
104 | 
105 |         predictions = torch.from_numpy(predictions).to(self.device)
106 | 
107 |         node_features.data = predictions.float().data
108 | 
109 |     def update_gbdt_targets(self, node_features, node_features_before, train_mask):
110 |         return (node_features - node_features_before).detach().cpu().numpy()[train_mask, -self.out_dim:]
111 | 
112 |     def init_node_features(self, X):
113 |         node_features = torch.empty(X.shape[0], self.in_dim, requires_grad=True, device=self.device)
114 |         if not self.only_gbdt:
115 |             node_features.data[:, :-self.out_dim] = torch.from_numpy(X.to_numpy(copy=True))
116 |         return node_features
117 | 
118 |     def init_node_parameters(self, num_nodes):
119 |         return torch.empty(num_nodes, self.out_dim, requires_grad=True, device=self.device)
120 | 
121 |     def init_optimizer2(self, node_parameters, learning_rate):
122 |         params = [self.model.parameters(), [node_parameters]]
123 |         return torch.optim.Adam(itertools.chain(*params), lr=learning_rate)
124 | 
125 |     def update_node_features2(self, node_parameters, X):
126 |         if self.task == 'regression':
127 |             predictions = np.expand_dims(self.gbdt_model.predict(X), axis=1)
128 |         else:
129 |             predictions = self.base_gbdt.predict_proba(X)
130 |             if self.gbdt_model is not None:
131 |                 predictions += self.gbdt_model.predict(X)
132 | 
133 |         predictions = torch.from_numpy(predictions).to(self.device)
134 |         node_parameters.data = predictions.float().data
135 | 
136 |     def fit(self, networkx_graph, X, y, train_mask, val_mask, test_mask, cat_features,
137 |             num_epochs, patience, logging_epochs=1, loss_fn=None, metric_name='loss',
138 |             normalize_features=True, replace_na=True,
139 |             ):
140 | 
141 |         # initialize for early stopping and metrics
142 |         if metric_name in ['r2', 'accuracy']:
143 |             best_metric = [np.float('-inf')] * 3  # for train/val/test
144 |         else:
145 |             best_metric = [np.float('inf')] * 3  # for train/val/test
146 |         best_val_epoch = 0
147 |         epochs_since_last_best_metric = 0
148 |         metrics = ddict(list)
149 |         if cat_features is None:
150 |             cat_features = []
151 | 
152 |         if self.task == 'regression':
153 |             self.out_dim = y.shape[1]
154 |         elif self.task == 'classification':
155 |             self.out_dim = len(set(y.iloc[test_mask, 0]))
156 |         # self.in_dim = X.shape[1] if not self.only_gbdt else 0
157 |         # self.in_dim += 3 if uncertainty else 1
158 |         self.in_dim = self.out_dim + X.shape[1] if not self.only_gbdt else self.out_dim
159 | 
160 |         self.init_gnn_model()
161 | 
162 |         gbdt_X_train = X.iloc[train_mask]
163 |         gbdt_y_train = y.iloc[train_mask]
164 |         gbdt_alpha = 1
165 |         self.gbdt_model = None
166 | 
167 |         encoded_X = X.copy()
168 |         if not self.only_gbdt:
169 |             if len(cat_features):
170 |                 encoded_X = self.encode_cat_features(encoded_X, y, cat_features, train_mask, val_mask, test_mask)
171 |             if normalize_features:
172 |                 encoded_X = self.normalize_features(encoded_X, train_mask, val_mask, test_mask)
173 |             if replace_na:
174 |                 encoded_X = self.replace_na(encoded_X, train_mask)
175 | 
176 |         node_features = self.init_node_features(encoded_X)
177 |         optimizer = self.init_optimizer(node_features, optimize_node_features=True, learning_rate=self.learning_rate)
178 | 
179 |         y, = self.pandas_to_torch(y)
180 |         self.y = y
181 |         if self.lang == 'dgl':
182 |             graph = self.networkx_to_torch(networkx_graph)
183 |         elif self.lang == 'pyg':
184 |             graph = self.networkx_to_torch2(networkx_graph)
185 | 
186 |         self.graph = graph
187 | 
188 |         pbar = tqdm(range(num_epochs))
189 |         for epoch in pbar:
190 |             start2epoch = time.time()
191 | 
192 |             # gbdt part
193 |             self.train_gbdt(gbdt_X_train, gbdt_y_train, cat_features, epoch,
194 |                             self.iter_per_epoch, gbdt_alpha)
195 | 
196 |             self.update_node_features(node_features, X, encoded_X)
197 |             node_features_before = node_features.clone()
198 |             model_in=(graph, node_features)
199 |             loss = self.train_and_evaluate(model_in, y, train_mask, val_mask, test_mask,
200 |                                            optimizer, metrics, self.iter_per_epoch)
201 |             gbdt_y_train = self.update_gbdt_targets(node_features, node_features_before, train_mask)
202 | 
203 |             self.log_epoch(pbar, metrics, epoch, loss, time.time() - start2epoch, logging_epochs,
204 |                            metric_name=metric_name)
205 |             # check early stopping
206 |             best_metric, best_val_epoch, epochs_since_last_best_metric = \
207 |                 self.update_early_stopping(metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric,
208 |                                            metric_name, lower_better=(metric_name not in ['r2', 'accuracy']))
209 |             if patience and epochs_since_last_best_metric > patience:
210 |                 break
211 |             if np.isclose(gbdt_y_train.sum(), 0.):
212 |                 print('Nodes do not change anymore. Stopping...')
213 |                 break
214 | 
215 |         if loss_fn:
216 |             self.save_metrics(metrics, loss_fn)
217 | 
218 |         print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, best_val_epoch, *best_metric))
219 |         return metrics
220 | 
221 |     def predict(self, graph, X, y, test_mask):
222 |         node_features = torch.empty(X.shape[0], self.in_dim).to(self.device)
223 |         self.update_node_features(node_features, X, X)
224 |         return self.evaluate_model((graph, node_features), y, test_mask)


--------------------------------------------------------------------------------
/bgnn/models/Base.py:
--------------------------------------------------------------------------------
  1 | import itertools
  2 | import torch
  3 | from sklearn import preprocessing
  4 | import pandas as pd
  5 | import torch.nn.functional as F
  6 | import numpy as np
  7 | from sklearn.metrics import r2_score, accuracy_score
  8 | 
  9 | class BaseModel(torch.nn.Module):
 10 |     def __init__(self):
 11 |         super(BaseModel, self).__init__()
 12 |         self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
 13 | 
 14 |     def pandas_to_torch(self, *args):
 15 |         return [torch.from_numpy(arg.to_numpy(copy=True)).float().squeeze().to(self.device) for arg in args]
 16 | 
 17 |     def networkx_to_torch(self, networkx_graph):
 18 |         import dgl
 19 |         # graph = dgl.DGLGraph()
 20 |         graph = dgl.from_networkx(networkx_graph)
 21 |         graph = dgl.remove_self_loop(graph)
 22 |         graph = dgl.add_self_loop(graph)
 23 |         graph = graph.to(self.device)
 24 |         return graph
 25 | 
 26 |     def networkx_to_torch2(self, networkx_graph):
 27 |         from torch_geometric.utils import convert
 28 |         import torch_geometric.transforms as T
 29 |         graph = convert.from_networkx(networkx_graph)
 30 |         transform = T.Compose([T.TargetIndegree()])
 31 |         graph = transform(graph)
 32 |         return graph.to(self.device)
 33 | 
 34 |     def move_to_device(self, *args):
 35 |         return [arg.to(self.device) for arg in args]
 36 | 
 37 |     def init_optimizer(self, node_features, optimize_node_features, learning_rate):
 38 | 
 39 |         params = [self.model.parameters()]
 40 |         if optimize_node_features:
 41 |             params.append([node_features])
 42 |         optimizer = torch.optim.Adam(itertools.chain(*params), lr=learning_rate)
 43 |         return optimizer
 44 | 
 45 |     def log_epoch(self, pbar, metrics, epoch, loss, epoch_time, logging_epochs, metric_name='loss'):
 46 |         train_rmse, val_rmse, test_rmse = metrics[metric_name][-1]
 47 |         if epoch and epoch % logging_epochs == 0:
 48 |             pbar.set_description(
 49 |                 "Epoch {:05d} | Loss {:.3f} | Loss {:.3f}/{:.3f}/{:.3f} | Time {:.4f}".format(epoch, loss,
 50 |                                                                                               train_rmse,
 51 |                                                                                               val_rmse, test_rmse,
 52 |                                                                                               epoch_time))
 53 | 
 54 |     def normalize_features(self, X, train_mask, val_mask, test_mask):
 55 |         min_max_scaler = preprocessing.MinMaxScaler()
 56 |         A = X.to_numpy(copy=True)
 57 |         A[train_mask] = min_max_scaler.fit_transform(A[train_mask])
 58 |         A[val_mask + test_mask] = min_max_scaler.transform(A[val_mask + test_mask])
 59 |         return pd.DataFrame(A, columns=X.columns).astype(float)
 60 | 
 61 |     def replace_na(self, X, train_mask):
 62 |         if X.isna().any().any():
 63 |             return X.fillna(X.iloc[train_mask].min() - 1)
 64 |         return X
 65 | 
 66 |     def encode_cat_features(self, X, y, cat_features, train_mask, val_mask, test_mask):
 67 |         from category_encoders import CatBoostEncoder
 68 |         enc = CatBoostEncoder()
 69 |         A = X.to_numpy(copy=True)
 70 |         b = y.to_numpy(copy=True)
 71 |         A[np.ix_(train_mask, cat_features)] = enc.fit_transform(A[np.ix_(train_mask, cat_features)], b[train_mask])
 72 |         A[np.ix_(val_mask + test_mask, cat_features)] = enc.transform(A[np.ix_(val_mask + test_mask, cat_features)])
 73 |         A = A.astype(float)
 74 |         return pd.DataFrame(A, columns=X.columns)
 75 | 
 76 |     def train_model(self, model_in, target_labels, train_mask, optimizer):
 77 |         y = target_labels[train_mask]
 78 | 
 79 |         self.model.train()
 80 |         logits = self.model(*model_in).squeeze()
 81 |         pred = logits[train_mask]
 82 | 
 83 |         if self.task == 'regression':
 84 |             loss = torch.sqrt(F.mse_loss(pred, y))
 85 |         elif self.task == 'classification':
 86 |             loss = F.cross_entropy(pred, y.long())
 87 |         else:
 88 |             raise NotImplemented("Unknown task. Supported tasks: classification, regression.")
 89 | 
 90 |         optimizer.zero_grad()
 91 |         loss.backward()
 92 |         optimizer.step()
 93 |         return loss
 94 | 
 95 |     def evaluate_model(self, logits, target_labels, mask):
 96 |         metrics = {}
 97 |         y = target_labels[mask]
 98 |         with torch.no_grad():
 99 |             pred = logits[mask]
100 |             if self.task == 'regression':
101 |                 metrics['loss'] = torch.sqrt(F.mse_loss(pred, y).squeeze() + 1e-8)
102 |                 metrics['rmsle'] = torch.sqrt(F.mse_loss(torch.log(pred + 1), torch.log(y + 1)).squeeze() + 1e-8)
103 |                 metrics['mae'] = F.l1_loss(pred, y)
104 |                 metrics['r2'] = torch.Tensor([r2_score(y.cpu().numpy(), pred.cpu().numpy())])
105 |             elif self.task == 'classification':
106 |                 metrics['loss'] = F.cross_entropy(pred, y.long())
107 |                 metrics['accuracy'] = torch.Tensor([(y == pred.max(1)[1]).sum().item()/y.shape[0]])
108 | 
109 |             return metrics
110 | 
111 |     def train_val_test_split(self, X, y, train_mask, val_mask, test_mask):
112 |         X_train, y_train = X.iloc[train_mask], y.iloc[train_mask]
113 |         X_val, y_val = X.iloc[val_mask], y.iloc[val_mask]
114 |         X_test, y_test = X.iloc[test_mask], y.iloc[test_mask]
115 |         return X_train, y_train, X_val, y_val, X_test, y_test
116 | 
117 |     def train_and_evaluate(self, model_in, target_labels, train_mask, val_mask, test_mask,
118 |                            optimizer, metrics, gnn_passes_per_epoch):
119 |         loss = None
120 | 
121 |         for _ in range(gnn_passes_per_epoch):
122 |             loss = self.train_model(model_in, target_labels, train_mask, optimizer)
123 | 
124 |         self.model.eval()
125 |         logits = self.model(*model_in).squeeze()
126 |         train_results = self.evaluate_model(logits, target_labels, train_mask)
127 |         val_results = self.evaluate_model(logits, target_labels, val_mask)
128 |         test_results = self.evaluate_model(logits, target_labels, test_mask)
129 |         for metric_name in train_results:
130 |             metrics[metric_name].append((train_results[metric_name].detach().item(),
131 |                                val_results[metric_name].detach().item(),
132 |                                test_results[metric_name].detach().item()
133 |                                ))
134 |         return loss
135 | 
136 |     def update_early_stopping(self, metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric, metric_name,
137 |                               lower_better=False):
138 |         train_metric, val_metric, test_metric = metrics[metric_name][-1]
139 |         if (lower_better and val_metric < best_metric[1]) or (not lower_better and val_metric > best_metric[1]):
140 |             best_metric = metrics[metric_name][-1]
141 |             best_val_epoch = epoch
142 |             epochs_since_last_best_metric = 0
143 |         else:
144 |             epochs_since_last_best_metric += 1
145 |         return best_metric, best_val_epoch, epochs_since_last_best_metric
146 | 
147 |     def save_metrics(self, metrics, fn):
148 |         with open(fn, "w+") as f:
149 |             for key, value in metrics.items():
150 |                 print(key, value, file=f)
151 | 
152 |     def plot(self, metrics, legend, title, output_fn=None, logx=False, logy=False, metric_name='loss'):
153 |         import matplotlib.pyplot as plt
154 |         metric_results = metrics[metric_name]
155 |         xs = [range(len(metric_results))] * len(metric_results[0])
156 |         ys = list(zip(*metric_results))
157 | 
158 |         plt.rcParams.update({'font.size': 40})
159 |         plt.rcParams["figure.figsize"] = (20, 10)
160 |         lss = ['-', '--', '-.', ':']
161 |         colors = ['#4053d3', '#ddb310', '#b51d14', '#00beff', '#fb49b0', '#00b25d', '#cacaca']
162 |         colors = [(235, 172, 35), (184, 0, 88), (0, 140, 249), (0, 110, 0), (0, 187, 173), (209, 99, 230), (178, 69, 2),
163 |                   (255, 146, 135), (89, 84, 214), (0, 198, 248), (135, 133, 0), (0, 167, 108), (189, 189, 189)]
164 |         colors = [[p / 255 for p in c] for c in colors]
165 |         for i in range(len(ys)):
166 |             plt.plot(xs[i], ys[i], lw=4, color=colors[i])
167 |         plt.legend(legend, loc=1, fontsize=30)
168 |         plt.title(title)
169 | 
170 |         plt.xscale('log') if logx else None
171 |         plt.yscale('log') if logy else None
172 |         plt.xlabel('Iteration')
173 |         plt.ylabel('RMSE')
174 |         plt.grid()
175 |         plt.tight_layout()
176 | 
177 |         plt.savefig(output_fn, bbox_inches='tight') if output_fn else None
178 |         plt.show()
179 | 
180 |     def plot_interactive(self, metrics, legend, title, logx=False, logy=False, metric_name='loss', start_from=0):
181 |         import plotly.graph_objects as go
182 |         metric_results = metrics[metric_name]
183 |         xs = [list(range(len(metric_results)))] * len(metric_results[0])
184 |         ys = list(zip(*metric_results))
185 | 
186 |         fig = go.Figure()
187 |         for i in range(len(ys)):
188 |             fig.add_trace(go.Scatter(x=xs[i][start_from:], y=ys[i][start_from:],
189 |                                      mode='lines+markers',
190 |                                      name=legend[i]))
191 | 
192 |         fig.update_layout(
193 |             title=title,
194 |             title_x=0.5,
195 |             xaxis_title='Epoch',
196 |             yaxis_title='RMSE',
197 |             font=dict(
198 |                 size=40,
199 |             ),
200 |             height=600,
201 |         )
202 | 
203 |         if logx:
204 |             fig.update_layout(xaxis_type="log")
205 |         if logy:
206 |             fig.update_layout(yaxis_type="log")
207 | 
208 |         fig.show()
209 | 


--------------------------------------------------------------------------------
/bgnn/models/GBDT.py:
--------------------------------------------------------------------------------
  1 | from catboost import Pool, CatBoostClassifier, CatBoostRegressor
  2 | import time
  3 | from sklearn.metrics import mean_squared_error, accuracy_score, r2_score
  4 | import numpy as np
  5 | from collections import defaultdict as ddict
  6 | import lightgbm
  7 | from lightgbm import LGBMClassifier, LGBMRegressor
  8 | 
  9 | class GBDTCatBoost:
 10 |     def __init__(self, task='regression', depth=6, lr=0.1, l2_leaf_reg=None, max_bin=None):
 11 |         self.task = task
 12 |         self.depth = depth
 13 |         self.learning_rate = lr
 14 |         self.l2_leaf_reg = l2_leaf_reg
 15 |         self.max_bin = max_bin
 16 | 
 17 | 
 18 |     def init_model(self, num_epochs, patience):
 19 |         catboost_model_obj = CatBoostRegressor if self.task == 'regression' else CatBoostClassifier
 20 |         self.catboost_loss_function = 'RMSE' if self.task == 'regression' else 'MultiClass'
 21 |         self.custom_metrics = ['R2'] if self.task == 'regression' else ['Accuracy']
 22 |         # ['Accuracy', 'AUC', 'Precision', 'Recall', 'F1', 'MCC', 'R2'],
 23 | 
 24 |         self.model = catboost_model_obj(iterations=num_epochs,
 25 |                                        depth=self.depth,
 26 |                                        learning_rate=self.learning_rate,
 27 |                                        loss_function=self.catboost_loss_function,
 28 |                                        custom_metric=self.custom_metrics,
 29 |                                        random_seed=0,
 30 |                                        early_stopping_rounds=patience,
 31 |                                        l2_leaf_reg=self.l2_leaf_reg,
 32 |                                        max_bin=self.max_bin,
 33 |                                        nan_mode='Min')
 34 | 
 35 |     def get_metrics(self):
 36 |         d = self.model.evals_result_
 37 |         metrics = ddict(list)
 38 |         keys = ['learn', 'validation_0', 'validation_1'] \
 39 |             if 'validation_0' in self.model.evals_result_ \
 40 |             else ['learn', 'validation']
 41 |         for metric_name in d[keys[0]]:
 42 |             perf = [d[key][metric_name] for key in keys]
 43 |             if metric_name == self.catboost_loss_function:
 44 |                 metrics['loss'] = list(zip(*perf))
 45 |             else:
 46 |                 metrics[metric_name.lower()] = list(zip(*perf))
 47 | 
 48 |         return metrics
 49 | 
 50 |     def get_test_metric(self, metrics, metric_name):
 51 |         if metric_name == 'loss':
 52 |             val_epoch = np.argmin([acc[1] for acc in metrics[metric_name]])
 53 |         else:
 54 |             val_epoch = np.argmax([acc[1] for acc in metrics[metric_name]])
 55 |         min_metric = metrics[metric_name][val_epoch]
 56 |         return min_metric, val_epoch
 57 | 
 58 |     def save_metrics(self, metrics, fn):
 59 |         with open(fn, "w+") as f:
 60 |             for key, value in metrics.items():
 61 |                 print(key, value, file=f)
 62 | 
 63 |     def train_val_test_split(self, X, y, train_mask, val_mask, test_mask):
 64 |         X_train, y_train = X.iloc[train_mask], y.iloc[train_mask]
 65 |         X_val, y_val = X.iloc[val_mask], y.iloc[val_mask]
 66 |         X_test, y_test = X.iloc[test_mask], y.iloc[test_mask]
 67 |         return X_train, y_train, X_val, y_val, X_test, y_test
 68 | 
 69 |     def fit(self,
 70 |             X, y, train_mask, val_mask, test_mask,
 71 |             cat_features=None, num_epochs=1000, patience=200,
 72 |             plot=False, verbose=False,
 73 |             loss_fn="", metric_name='loss'):
 74 | 
 75 |         X_train, y_train, X_val, y_val, X_test, y_test = \
 76 |             self.train_val_test_split(X, y, train_mask, val_mask, test_mask)
 77 |         self.init_model(num_epochs, patience)
 78 | 
 79 |         start = time.time()
 80 |         pool = Pool(X_train, y_train, cat_features=cat_features)
 81 |         eval_set = [(X_val, y_val), (X_test, y_test)]
 82 |         self.model.fit(pool, eval_set=eval_set, plot=plot, verbose=verbose)
 83 |         finish = time.time()
 84 | 
 85 |         num_trees = self.model.tree_count_
 86 |         print('Finished training. Total time: {:.2f} | Number of trees: {:d} | Time per tree: {:.2f}'.format(finish - start, num_trees, (time.time() - start )/num_trees))
 87 | 
 88 |         metrics = self.get_metrics()
 89 |         min_metric, min_val_epoch = self.get_test_metric(metrics, metric_name)
 90 |         if loss_fn:
 91 |             self.save_metrics(metrics, loss_fn)
 92 |         print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, min_val_epoch, *min_metric))
 93 |         return metrics
 94 | 
 95 |     def predict(self, X_test, y_test):
 96 |         pred = self.model.predict(X_test)
 97 | 
 98 |         metrics = {}
 99 |         metrics['rmse'] = mean_squared_error(pred, y_test) ** .5
100 | 
101 |         return metrics
102 | 
103 | 
104 | class GBDTLGBM:
105 |     def __init__(self, task='regression', lr=0.1, num_leaves=31, max_bin=255,
106 |                  lambda_l1=0., lambda_l2=0., boosting='gbdt'):
107 |         self.task = task
108 |         self.boosting = boosting
109 |         self.learning_rate = lr
110 |         self.num_leaves = num_leaves
111 |         self.max_bin = max_bin
112 |         self.lambda_l1 = lambda_l1
113 |         self.lambda_l2 = lambda_l2
114 | 
115 |     def accuracy(self, preds, train_data):
116 |         labels = train_data.get_label()
117 |         preds_classes = preds.reshape((preds.shape[0]//labels.shape[0], labels.shape[0])).argmax(0)
118 |         return 'accuracy', accuracy_score(labels, preds_classes), True
119 | 
120 |     def r2(self, preds, train_data):
121 |         labels = train_data.get_label()
122 |         return 'r2', r2_score(labels, preds), True
123 | 
124 |     def init_model(self):
125 | 
126 |         self.parameters = {
127 |             'objective': 'regression' if self.task == 'regression' else 'multiclass',
128 |             'metric': {'rmse'} if self.task == 'regression' else {'multiclass'},
129 |             'num_classes': self.num_classes,
130 |             'boosting': self.boosting,
131 |             'num_leaves': self.num_leaves,
132 |             'max_bin': self.max_bin,
133 |             'learning_rate': self.learning_rate,
134 |             'lambda_l1': self.lambda_l1,
135 |             'lambda_l2': self.lambda_l2,
136 |             # 'num_threads': 1,
137 |             # 'feature_fraction': 0.9,
138 |             # 'bagging_fraction': 0.8,
139 |             # 'bagging_freq': 5,
140 |             'verbose': 1,
141 |             #     'device_type': 'gpu'
142 |         }
143 |         self.evals_result = dict()
144 | 
145 |     def get_metrics(self):
146 |         d = self.evals_result
147 |         metrics = ddict(list)
148 |         keys = ['training', 'valid_1', 'valid_2'] \
149 |             if 'training' in d \
150 |             else ['valid_0', 'valid_1']
151 |         for metric_name in d[keys[0]]:
152 |             perf = [d[key][metric_name] for key in keys]
153 |             if metric_name in ['regression', 'multiclass', 'rmse', 'l2', 'multi_logloss', 'binary_logloss']:
154 |                 metrics['loss'] = list(zip(*perf))
155 |             else:
156 |                 metrics[metric_name] = list(zip(*perf))
157 |         return metrics
158 | 
159 |     def get_test_metric(self, metrics, metric_name):
160 |         if metric_name == 'loss':
161 |             val_epoch = np.argmin([acc[1] for acc in metrics[metric_name]])
162 |         else:
163 |             val_epoch = np.argmax([acc[1] for acc in metrics[metric_name]])
164 |         min_metric = metrics[metric_name][val_epoch]
165 |         return min_metric, val_epoch
166 | 
167 |     def save_metrics(self, metrics, fn):
168 |         with open(fn, "w+") as f:
169 |             for key, value in metrics.items():
170 |                 print(key, value, file=f)
171 | 
172 |     def train_val_test_split(self, X, y, train_mask, val_mask, test_mask):
173 |         X_train, y_train = X.iloc[train_mask], y.iloc[train_mask]
174 |         X_val, y_val = X.iloc[val_mask], y.iloc[val_mask]
175 |         X_test, y_test = X.iloc[test_mask], y.iloc[test_mask]
176 |         return X_train, y_train, X_val, y_val, X_test, y_test
177 | 
178 |     def fit(self,
179 |             X, y, train_mask, val_mask, test_mask,
180 |             cat_features=None, num_epochs=1000, patience=200,
181 |             loss_fn="", metric_name='loss'):
182 | 
183 |         if cat_features is not None:
184 |             X = X.copy()
185 |             for col in list(X.columns[cat_features]):
186 |                 X[col] = X[col].astype('category')
187 | 
188 |         X_train, y_train, X_val, y_val, X_test, y_test = \
189 |             self.train_val_test_split(X, y, train_mask, val_mask, test_mask)
190 |         self.num_classes = None if self.task == 'regression' else len(set(y.iloc[:, 0]))
191 |         self.init_model()
192 | 
193 |         start = time.time()
194 |         train_data = lightgbm.Dataset(X_train, label=y_train)
195 |         val_data = lightgbm.Dataset(X_val, label=y_val)
196 |         test_data = lightgbm.Dataset(X_test, label=y_test)
197 | 
198 |         self.model = lightgbm.train(self.parameters,
199 |                                train_data,
200 |                                valid_sets=[train_data, val_data, test_data],
201 |                                num_boost_round=num_epochs,
202 |                                early_stopping_rounds=patience,
203 |                                evals_result=self.evals_result,
204 |                                feval=self.r2 if self.task == 'regression' else self.accuracy,
205 |                                verbose_eval=1)
206 |         finish = time.time()
207 | 
208 |         print('Finished training. Total time: {:.2f}'.format(finish - start))
209 | 
210 |         metrics = self.get_metrics()
211 |         min_metric, min_val_epoch = self.get_test_metric(metrics, metric_name)
212 |         if loss_fn:
213 |             self.save_metrics(metrics, loss_fn)
214 |         print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, min_val_epoch, *min_metric))
215 |         return metrics
216 | 
217 |     def predict(self, X_test, y_test):
218 |         pred = self.model.predict(X_test)
219 | 
220 |         metrics = {}
221 |         metrics['rmse'] = mean_squared_error(pred, y_test) ** .5
222 | 
223 |         return metrics


--------------------------------------------------------------------------------
/bgnn/models/GNN.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import numpy as np
  3 | import torch
  4 | from torch.nn import Dropout, ELU
  5 | import torch.nn.functional as F
  6 | from torch import nn
  7 | from dgl.nn.pytorch import GATConv as GATConvDGL, GraphConv, ChebConv as ChebConvDGL, \
  8 |     AGNNConv as AGNNConvDGL, APPNPConv
  9 | from torch.nn import Sequential, Linear, ReLU, Identity
 10 | from tqdm import tqdm
 11 | from .Base import BaseModel
 12 | from torch.autograd import Variable
 13 | from collections import defaultdict as ddict
 14 | from .MLP import MLPRegressor
 15 | 
 16 | 
 17 | class ElementWiseLinear(nn.Module):
 18 |     def __init__(self, size, weight=True, bias=True, inplace=False):
 19 |         super().__init__()
 20 |         if weight:
 21 |             self.weight = nn.Parameter(torch.Tensor(size))
 22 |         else:
 23 |             self.weight = None
 24 |         if bias:
 25 |             self.bias = nn.Parameter(torch.Tensor(size))
 26 |         else:
 27 |             self.bias = None
 28 |         self.inplace = inplace
 29 | 
 30 |         self.reset_parameters()
 31 | 
 32 |     def reset_parameters(self):
 33 |         if self.weight is not None:
 34 |             nn.init.ones_(self.weight)
 35 |         if self.bias is not None:
 36 |             nn.init.zeros_(self.bias)
 37 | 
 38 |     def forward(self, x):
 39 |         if self.inplace:
 40 |             if self.weight is not None:
 41 |                 x.mul_(self.weight)
 42 |             if self.bias is not None:
 43 |                 x.add_(self.bias)
 44 |         else:
 45 |             if self.weight is not None:
 46 |                 x = x * self.weight
 47 |             if self.bias is not None:
 48 |                 x = x + self.bias
 49 |         return x
 50 | 
 51 | class GATDGL(torch.nn.Module):
 52 |     '''
 53 |     Implementation of leaderboard GAT network for OGB datasets.
 54 |     https://github.com/Espylapiza/dgl/blob/master/examples/pytorch/ogb/ogbn-arxiv/models.py
 55 |     '''
 56 |     def __init__(
 57 |         self,
 58 |         in_feats,
 59 |         n_classes,
 60 |         n_layers=3,
 61 |         n_heads=3,
 62 |         activation=F.relu,
 63 |         n_hidden=250,
 64 |         dropout=0.75,
 65 |         input_drop=0.1,
 66 |         attn_drop=0.0,
 67 |     ):
 68 |         super().__init__()
 69 |         self.in_feats = in_feats
 70 |         self.n_hidden = n_hidden
 71 |         self.n_classes = n_classes
 72 |         self.n_layers = n_layers
 73 |         self.num_heads = n_heads
 74 | 
 75 |         self.convs = torch.nn.ModuleList()
 76 |         self.norms = torch.nn.ModuleList()
 77 | 
 78 |         for i in range(n_layers):
 79 |             in_hidden = n_heads * n_hidden if i > 0 else in_feats
 80 |             out_hidden = n_hidden if i < n_layers - 1 else n_classes
 81 |             num_heads = n_heads if i < n_layers - 1 else 1
 82 |             out_channels = n_heads
 83 | 
 84 |             self.convs.append(
 85 |                 GATConvDGL(
 86 |                     in_hidden,
 87 |                     out_hidden,
 88 |                     num_heads=num_heads,
 89 |                     attn_drop=attn_drop,
 90 |                     residual=True,
 91 |                 )
 92 |             )
 93 | 
 94 |             if i < n_layers - 1:
 95 |                 self.norms.append(torch.nn.BatchNorm1d(out_channels * out_hidden))
 96 | 
 97 |         self.bias_last = ElementWiseLinear(n_classes, weight=False, bias=True, inplace=True)
 98 | 
 99 |         self.input_drop = nn.Dropout(input_drop)
100 |         self.dropout = nn.Dropout(dropout)
101 |         self.activation = activation
102 | 
103 |     def forward(self, graph, feat):
104 |         h = feat
105 |         h = self.input_drop(h)
106 | 
107 |         for i in range(self.n_layers):
108 |             conv = self.convs[i](graph, h)
109 | 
110 |             h = conv
111 | 
112 |             if i < self.n_layers - 1:
113 |                 h = h.flatten(1)
114 |                 h = self.norms[i](h)
115 |                 h = self.activation(h, inplace=True)
116 |                 h = self.dropout(h)
117 | 
118 |         h = h.mean(1)
119 |         h = self.bias_last(h)
120 | 
121 |         return h
122 | 
123 | 
124 | 
125 | class GNNModelDGL(torch.nn.Module):
126 |     def __init__(self, in_dim, hidden_dim, out_dim,
127 |                  dropout=0., name='gat', residual=True, use_mlp=False, join_with_mlp=False):
128 |         super(GNNModelDGL, self).__init__()
129 |         self.name = name
130 |         self.use_mlp = use_mlp
131 |         self.join_with_mlp = join_with_mlp
132 |         self.normalize_input_columns = True
133 |         if use_mlp:
134 |             self.mlp = MLPRegressor(in_dim, hidden_dim, out_dim)
135 |             if join_with_mlp:
136 |                 in_dim += out_dim
137 |             else:
138 |                 in_dim = out_dim
139 |         if name == 'gat':
140 |             self.l1 = GATConvDGL(in_dim, hidden_dim//8, 8, feat_drop=dropout, attn_drop=dropout, residual=False,
141 |                               activation=F.elu)
142 |             self.l2 = GATConvDGL(hidden_dim, out_dim, 1, feat_drop=dropout, attn_drop=dropout, residual=residual, activation=None)
143 |         elif name == 'gcn':
144 |             self.l1 = GraphConv(in_dim, hidden_dim, activation=F.elu)
145 |             self.l2 = GraphConv(hidden_dim, out_dim, activation=F.elu)
146 |             self.drop = Dropout(p=dropout)
147 |         elif name == 'cheb':
148 |             self.l1 = ChebConvDGL(in_dim, hidden_dim, k = 3)
149 |             self.l2 = ChebConvDGL(hidden_dim, out_dim, k = 3)
150 |             self.drop = Dropout(p=dropout)
151 |         elif name == 'agnn':
152 |             self.lin1 = Sequential(Dropout(p=dropout), Linear(in_dim, hidden_dim), ELU())
153 |             self.l1 = AGNNConvDGL(learn_beta=False)
154 |             self.l2 = AGNNConvDGL(learn_beta=True)
155 |             self.lin2 = Sequential(Dropout(p=dropout), Linear(hidden_dim, out_dim), ELU())
156 |         elif name == 'appnp':
157 |             self.lin1 = Sequential(Dropout(p=dropout), Linear(in_dim, hidden_dim),
158 |                        ReLU(), Dropout(p=dropout), Linear(hidden_dim, out_dim))
159 |             self.l1 = APPNPConv(k=10, alpha=0.1, edge_drop=0.)
160 | 
161 | 
162 |     def forward(self, graph, features):
163 |         h = features
164 |         if self.use_mlp:
165 |             if self.join_with_mlp:
166 |                 h = torch.cat((h, self.mlp(features)), 1)
167 |             else:
168 |                 h = self.mlp(features)
169 |         if self.name == 'gat':
170 |             h = self.l1(graph, h).flatten(1)
171 |             logits = self.l2(graph, h).mean(1)
172 |         elif self.name in ['appnp']:
173 |             h = self.lin1(h)
174 |             logits = self.l1(graph, h)
175 |         elif self.name == 'agnn':
176 |             h = self.lin1(h)
177 |             h = self.l1(graph, h)
178 |             h = self.l2(graph, h)
179 |             logits = self.lin2(h)
180 |         elif self.name in ['gcn', 'cheb']:
181 |             h = self.drop(h)
182 |             h = self.l1(graph, h)
183 |             logits = self.l2(graph, h)
184 | 
185 | 
186 |         return logits
187 | 
188 | class GNN(BaseModel):
189 |     def __init__(self, task='regression', lr=0.01, hidden_dim=64, dropout=0.,
190 |                  name='gat', residual=True, lang='dgl',
191 |                 gbdt_predictions=None, mlp=False, use_leaderboard=False, only_gbdt=False):
192 |         super(GNN, self).__init__()
193 | 
194 |         self.dropout = dropout
195 |         self.learning_rate = lr
196 |         self.hidden_dim = hidden_dim
197 |         self.task = task
198 |         self.model_name = name
199 |         self.use_residual = residual
200 |         self.lang = lang
201 |         self.use_mlp = mlp
202 |         self.use_leaderboard = use_leaderboard
203 |         self.gbdt_predictions = gbdt_predictions
204 |         self.only_gbdt = only_gbdt
205 | 
206 |         self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
207 | 
208 |     def __name__(self):
209 |         if self.gbdt_predictions is None:
210 |             return 'GNN'
211 |         else:
212 |             return 'ResGNN'
213 | 
214 |     def init_model(self):
215 |         if self.lang == 'pyg':
216 |             self.model = GNNModelPYG(in_dim=self.in_dim, hidden_dim=self.hidden_dim, out_dim=self.out_dim,
217 |                                   heads=self.heads, dropout=self.dropout, name=self.model_name,
218 |                                   residual=self.use_residual).to(self.device)
219 |         elif self.lang == 'dgl':
220 |             if self.use_leaderboard:
221 |                 self.model = GATDGL(in_feats=self.in_dim, n_classes=self.out_dim).to(self.device)
222 |             else:
223 |                 self.model = GNNModelDGL(in_dim=self.in_dim, hidden_dim=self.hidden_dim, out_dim=self.out_dim,
224 |                                          dropout=self.dropout, name=self.model_name,
225 |                                          residual=self.use_residual, use_mlp=self.use_mlp,
226 |                                          join_with_mlp=self.use_mlp).to(self.device)
227 | 
228 |     def init_node_features(self, X, optimize_node_features):
229 |         node_features = Variable(X, requires_grad=optimize_node_features)
230 |         return node_features
231 | 
232 |     def fit(self, networkx_graph, X, y, train_mask, val_mask, test_mask, num_epochs,
233 |             cat_features=None, patience=200, logging_epochs=1, optimize_node_features=False,
234 |             loss_fn=None, metric_name='loss', normalize_features=True, replace_na=True):
235 | 
236 |         # initialize for early stopping and metrics
237 |         if metric_name in ['r2', 'accuracy']:
238 |             best_metric = [np.float('-inf')] * 3  # for train/val/test
239 |         else:
240 |             best_metric = [np.float('inf')] * 3  # for train/val/test
241 |         best_val_epoch = 0
242 |         epochs_since_last_best_metric = 0
243 |         metrics = ddict(list) # metric_name -> (train/val/test)
244 |         if cat_features is None:
245 |             cat_features = []
246 | 
247 |         if self.gbdt_predictions is not None:
248 |             X = X.copy()
249 |             X['predict'] = self.gbdt_predictions
250 |             if self.only_gbdt:
251 |                 cat_features = []
252 |                 X = X[['predict']]
253 | 
254 |         self.in_dim = X.shape[1]
255 |         self.hidden_dim = self.hidden_dim
256 |         if self.task == 'regression':
257 |             self.out_dim = y.shape[1]
258 |         elif self.task == 'classification':
259 |             self.out_dim = len(set(y.iloc[:, 0]))
260 | 
261 |         if len(cat_features):
262 |             X = self.encode_cat_features(X, y, cat_features, train_mask, val_mask, test_mask)
263 |         if normalize_features:
264 |             X = self.normalize_features(X, train_mask, val_mask, test_mask)
265 |         if replace_na:
266 |             X = self.replace_na(X, train_mask)
267 | 
268 |         X, y = self.pandas_to_torch(X, y)
269 |         if len(X.shape) == 1:
270 |             X = X.unsqueeze(1)
271 | 
272 |         if self.lang == 'dgl':
273 |             graph = self.networkx_to_torch(networkx_graph)
274 |         elif self.lang == 'pyg':
275 |             graph = self.networkx_to_torch2(networkx_graph)
276 |         self.init_model()
277 |         node_features = self.init_node_features(X, optimize_node_features)
278 | 
279 |         self.node_features = node_features
280 |         self.graph = graph
281 |         optimizer = self.init_optimizer(node_features, optimize_node_features, self.learning_rate)
282 | 
283 |         pbar = tqdm(range(num_epochs))
284 |         for epoch in pbar:
285 |             start2epoch = time.time()
286 | 
287 |             model_in = (graph, node_features)
288 |             loss = self.train_and_evaluate(model_in, y, train_mask, val_mask, test_mask, optimizer,
289 |                                            metrics, gnn_passes_per_epoch=1)
290 |             self.log_epoch(pbar, metrics, epoch, loss, time.time() - start2epoch, logging_epochs,
291 |                            metric_name=metric_name)
292 | 
293 |             # check early stopping
294 |             best_metric, best_val_epoch, epochs_since_last_best_metric = \
295 |                 self.update_early_stopping(metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric,
296 |                                            metric_name, lower_better=(metric_name not in ['r2', 'accuracy']))
297 |             if patience and epochs_since_last_best_metric > patience:
298 |                 break
299 | 
300 |         if loss_fn:
301 |             self.save_metrics(metrics, loss_fn)
302 | 
303 |         print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, best_val_epoch, *best_metric))
304 |         return metrics
305 | 
306 |     def predict(self, graph, node_features, target_labels, test_mask):
307 |         return self.evaluate_model((graph, node_features), target_labels, test_mask)


--------------------------------------------------------------------------------
/bgnn/models/MLP.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | import torch.nn.functional as F
  4 | import numpy as np
  5 | import time
  6 | from tqdm import tqdm
  7 | from .Base import BaseModel
  8 | from sklearn.metrics import r2_score
  9 | from collections import defaultdict as ddict
 10 | 
 11 | class MLPClassifier(torch.nn.Module):
 12 |     def __init__(self, in_dim, hidden_dim, out_dim, num_layers=3, dropout=0.5):
 13 |         super(MLPClassifier, self).__init__()
 14 | 
 15 |         self.lins = torch.nn.ModuleList()
 16 |         self.lins.append(torch.nn.Linear(in_dim, hidden_dim))
 17 |         self.bns = torch.nn.ModuleList()
 18 |         self.bns.append(torch.nn.BatchNorm1d(hidden_dim))
 19 |         for _ in range(num_layers - 2):
 20 |             self.lins.append(torch.nn.Linear(hidden_dim, hidden_dim))
 21 |             self.bns.append(torch.nn.BatchNorm1d(hidden_dim))
 22 |         self.lins.append(torch.nn.Linear(hidden_dim, out_dim))
 23 | 
 24 |         self.dropout = dropout
 25 | 
 26 |     def reset_parameters(self):
 27 |         for lin in self.lins:
 28 |             lin.reset_parameters()
 29 | 
 30 |     def forward(self, x):
 31 |         for i, lin in enumerate(self.lins[:-1]):
 32 |             x = lin(x)
 33 |             x = self.bns[i](x)
 34 |             x = F.relu(x)
 35 |             x = F.dropout(x, p=self.dropout, training=self.training)
 36 |         x = self.lins[-1](x)
 37 |         return x
 38 | 
 39 | 
 40 | class MLPRegressor(nn.Module):
 41 |     def __init__(self, in_dim, hidden_dim, out_dim, num_layers=3, dropout=0.5):
 42 |         super(MLPRegressor, self).__init__()
 43 | 
 44 |         self.layers = nn.Sequential(
 45 |             nn.Linear(in_dim, hidden_dim),
 46 |             nn.ReLU(),
 47 |             nn.Dropout(p=dropout),
 48 |             nn.Linear(hidden_dim, hidden_dim),
 49 |             nn.ReLU(),
 50 |             nn.Dropout(p=dropout),
 51 |             nn.Linear(hidden_dim, out_dim)
 52 |         )
 53 | 
 54 |     def forward(self, x):
 55 |         return self.layers(x)
 56 | 
 57 | 
 58 | class MLP(BaseModel):
 59 |     def __init__(self, task='regression', num_layers=3, dropout=0., lr=0.01, hidden_dim=128):
 60 |         super(MLP, self).__init__()
 61 |         self.task = task
 62 |         self.num_layers = num_layers
 63 |         self.dropout = dropout
 64 |         self.learning_rate = lr
 65 |         self.hidden_dim = hidden_dim
 66 | 
 67 | 
 68 |         self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
 69 | 
 70 |     def __name__(self):
 71 |         return 'MLP'
 72 | 
 73 |     def init_model(self):
 74 |         # mlp_model = MLPRegressor if self.task == 'regression' else MLPClassifier
 75 |         mlp_model = MLPClassifier
 76 |         self.model = mlp_model(in_dim=self.in_dim, hidden_dim=self.hidden_dim, out_dim=self.out_dim,
 77 |                                num_layers=self.num_layers, dropout=self.dropout).to(
 78 |             self.device)
 79 | 
 80 |     def fit(self, X, y, train_mask, val_mask, test_mask, cat_features=None,
 81 |             num_epochs=1000, patience=200,
 82 |             logging_epochs=1, loss_fn=None,
 83 |             metric_name='loss', normalize_features=True, replace_na=True):
 84 | 
 85 |         # initialize for early stopping and metrics
 86 |         if metric_name in ['r2', 'accuracy']:
 87 |             best_metric = [np.float('-inf')] * 3  # for train/val/test
 88 |         else:
 89 |             best_metric = [np.float('inf')] * 3  # for train/val/test
 90 |         best_val_epoch = 0
 91 |         epochs_since_last_best_metric = 0
 92 |         metrics = ddict(list) # metric_name -> (train/val/test)
 93 |         if cat_features is None:
 94 |             cat_features = []
 95 | 
 96 |         self.in_dim = X.shape[1]
 97 |         self.hidden_dim = self.hidden_dim
 98 |         if self.task == 'regression':
 99 |             self.out_dim = y.shape[1]
100 |         elif self.task == 'classification':
101 |             self.out_dim = len(set(y.iloc[:, 0]))
102 | 
103 | 
104 |         if len(cat_features):
105 |             X = self.encode_cat_features(X, y, cat_features, train_mask, val_mask, test_mask)
106 |         if normalize_features:
107 |             X = self.normalize_features(X, train_mask, val_mask, test_mask)
108 |         if replace_na:
109 |             X = self.replace_na(X, train_mask)
110 | 
111 |         X, y = self.pandas_to_torch(X, y)
112 |         if len(X.shape) == 1:
113 |             X = X.unsqueeze(dim=1)
114 | 
115 |         self.init_model()
116 |         optimizer = self.init_optimizer(None, False, learning_rate=self.learning_rate)
117 | 
118 |         pbar = tqdm(range(num_epochs))
119 |         for epoch in pbar:
120 | 
121 |             start2epoch = time.time()
122 | 
123 |             model_in = (X,)
124 |             loss = self.train_and_evaluate(model_in, y, train_mask, val_mask, test_mask, optimizer,
125 |                                            metrics, gnn_passes_per_epoch=1)
126 |             self.log_epoch(pbar, metrics, epoch, loss, time.time() - start2epoch, logging_epochs,
127 |                            metric_name=metric_name)
128 | 
129 |             # check early stopping
130 |             best_metric, best_val_epoch, epochs_since_last_best_metric = \
131 |                 self.update_early_stopping(metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric,
132 |                                            metric_name, lower_better=(metric_name not in ['r2', 'accuracy']))
133 |             if patience and epochs_since_last_best_metric > patience:
134 |                 break
135 | 
136 |         if loss_fn:
137 |             self.save_metrics(metrics, loss_fn)
138 | 
139 |         print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, best_val_epoch, *best_metric))
140 |         return metrics
141 | 
142 |     def predict(self, X, target_labels, test_mask):
143 |         return self.evaluate_model((X,), target_labels, test_mask)


--------------------------------------------------------------------------------
/bgnn/models/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nd7141/bgnn/11290bc8ec5427faa1cb48ec51d947d5f6624b60/bgnn/models/__init__.py


--------------------------------------------------------------------------------
/bgnn/scripts/run.py:
--------------------------------------------------------------------------------
  1 | # from catboost import CatboostError
  2 | # import sys
  3 | # sys.path.append('../')
  4 | 
  5 | from bgnn.models.GBDT import GBDTCatBoost, GBDTLGBM
  6 | from bgnn.models.MLP import MLP
  7 | from bgnn.models.GNN import GNN
  8 | from bgnn.models.BGNN import BGNN
  9 | from bgnn.scripts.utils import get_masks, NpEncoder
 10 | 
 11 | import os
 12 | import json
 13 | import time
 14 | import datetime
 15 | from pathlib import Path
 16 | from collections import defaultdict as ddict
 17 | 
 18 | import pandas as pd
 19 | import networkx as nx
 20 | import random
 21 | import numpy as np
 22 | import fire
 23 | from omegaconf import OmegaConf
 24 | from sklearn.model_selection import ParameterGrid
 25 | 
 26 | 
 27 | class RunModel:
 28 |     def read_input(self, input_folder):
 29 |         self.X = pd.read_csv(f'{input_folder}/X.csv')
 30 |         self.y = pd.read_csv(f'{input_folder}/y.csv')
 31 | 
 32 |         networkx_graph = nx.read_graphml(f'{input_folder}/graph.graphml')
 33 |         networkx_graph = nx.relabel_nodes(networkx_graph, {str(i): i for i in range(len(networkx_graph))})
 34 |         self.networkx_graph = networkx_graph
 35 | 
 36 |         categorical_columns = []
 37 |         if os.path.exists(f'{input_folder}/cat_features.txt'):
 38 |             with open(f'{input_folder}/cat_features.txt') as f:
 39 |                 for line in f:
 40 |                     if line.strip():
 41 |                         categorical_columns.append(line.strip())
 42 | 
 43 |         self.cat_features = None
 44 |         if categorical_columns:
 45 |             columns = self.X.columns
 46 |             self.cat_features = np.where(columns.isin(categorical_columns))[0]
 47 | 
 48 |             for col in list(columns[self.cat_features]):
 49 |                 self.X[col] = self.X[col].astype(str)
 50 | 
 51 | 
 52 |         if os.path.exists(f'{input_folder}/masks.json'):
 53 |             with open(f'{input_folder}/masks.json') as f:
 54 |                 self.masks = json.load(f)
 55 |         else:
 56 |             print('Creating and saving train/val/test masks')
 57 |             idx = list(range(self.y.shape[0]))
 58 |             self.masks = dict()
 59 |             for i in range(self.max_seeds):
 60 |                 random.shuffle(idx)
 61 |                 r1, r2, r3 = idx[:int(.6*len(idx))], idx[int(.6*len(idx)):int(.8*len(idx))], idx[int(.8*len(idx)):]
 62 |                 self.masks[str(i)] = {"train": r1, "val": r2, "test": r3}
 63 | 
 64 |             with open(f'{input_folder}/masks.json', 'w+') as f:
 65 |                 json.dump(self.masks, f, cls=NpEncoder)
 66 | 
 67 | 
 68 |     def get_input(self, dataset_dir, dataset: str):
 69 |         if dataset == 'house':
 70 |             input_folder = dataset_dir / 'house'
 71 |         elif dataset == 'county':
 72 |             input_folder = dataset_dir / 'county'
 73 |         elif dataset == 'vk':
 74 |             input_folder = dataset_dir / 'vk'
 75 |         elif dataset == 'wiki':
 76 |             input_folder = dataset_dir / 'wiki'
 77 |         elif dataset == 'avazu':
 78 |             input_folder = dataset_dir / 'avazu'
 79 |         elif dataset == 'vk_class':
 80 |             input_folder = dataset_dir / 'vk_class'
 81 |         elif dataset == 'house_class':
 82 |             input_folder = dataset_dir / 'house_class'
 83 |         elif dataset == 'dblp':
 84 |             input_folder = dataset_dir / 'dblp'
 85 |         elif dataset == 'slap':
 86 |             input_folder = dataset_dir / 'slap'
 87 |         else:
 88 |             input_folder = dataset
 89 | 
 90 |         if self.save_folder is None:
 91 |             self.save_folder = f'results/{dataset}/{datetime.datetime.now().strftime("%d_%m")}'
 92 | 
 93 |         self.read_input(input_folder)
 94 |         print('Save to folder:', self.save_folder)
 95 | 
 96 | 
 97 |     def run_one_model(self, config_fn, model_name):
 98 |         self.config = OmegaConf.load(config_fn)
 99 |         grid = ParameterGrid(dict(self.config.hp))
100 | 
101 |         for ps in grid:
102 |             param_string = ''.join([f'-{key}{ps[key]}' for key in ps])
103 |             exp_name = f'{model_name}{param_string}'
104 |             print(f'\nSeed {self.seed} RUNNING:{exp_name}')
105 | 
106 |             runs = []
107 |             runs_custom = []
108 |             times = []
109 |             for _ in range(self.repeat_exp):
110 |                 start = time.time()
111 |                 model = self.define_model(model_name, ps)
112 | 
113 |                 inputs = {'X': self.X, 'y': self.y, 'train_mask': self.train_mask,
114 |                           'val_mask': self.val_mask, 'test_mask': self.test_mask, 'cat_features': self.cat_features}
115 |                 if model_name in ['gnn', 'resgnn', 'bgnn']:
116 |                     inputs['networkx_graph'] = self.networkx_graph
117 | 
118 |                 metrics = model.fit(num_epochs=self.config.num_epochs, patience=self.config.patience,
119 |                            loss_fn=f"{self.seed_folder}/{exp_name}.txt",
120 |                            metric_name='loss' if self.task == 'regression' else 'accuracy', **inputs)
121 |                 finish = time.time()
122 |                 best_loss = min(metrics['loss'], key=lambda x: x[1])
123 |                 best_custom = max(metrics['r2' if self.task == 'regression' else 'accuracy'], key=lambda x: x[1])
124 |                 runs.append(best_loss)
125 |                 runs_custom.append(best_custom)
126 |                 times.append(finish - start)
127 |             self.store_results[exp_name] = (list(map(np.mean, zip(*runs))),
128 |                                        list(map(np.mean, zip(*runs_custom))),
129 |                                        np.mean(times),
130 |                                        )
131 | 
132 |     def define_model(self, model_name, ps):
133 |         if model_name == 'catboost':
134 |             return GBDTCatBoost(self.task, **ps)
135 |         elif model_name == 'lightgbm':
136 |             return GBDTLGBM(self.task, **ps)
137 |         elif model_name == 'mlp':
138 |             return MLP(self.task, **ps)
139 |         elif model_name == 'gnn':
140 |             return GNN(self.task, **ps)
141 |         elif model_name == 'resgnn':
142 |             gbdt = GBDTCatBoost(self.task)
143 |             gbdt.fit(self.X, self.y, self.train_mask, self.val_mask, self.test_mask,
144 |                      cat_features=self.cat_features,
145 |                      num_epochs=1000, patience=100,
146 |                      plot=False, verbose=False, loss_fn=None,
147 |                      metric_name='loss' if self.task == 'regression' else 'accuracy')
148 |             return GNN(task=self.task, gbdt_predictions=gbdt.model.predict(self.X), **ps)
149 |         elif model_name == 'bgnn':
150 |             return BGNN(self.task, **ps)
151 | 
152 |     def create_save_folder(self, seed):
153 |         self.seed_folder = f'{self.save_folder}/{seed}'
154 |         os.makedirs(self.seed_folder, exist_ok=True)
155 | 
156 |     def split_masks(self, seed):
157 |         self.train_mask, self.val_mask, self.test_mask = self.masks[seed]['train'], \
158 |                                                          self.masks[seed]['val'], self.masks[seed]['test']
159 | 
160 |     def save_results(self, seed):
161 |         self.seed_results[seed] = self.store_results
162 |         with open(f'{self.save_folder}/seed_results.json', 'w+') as f:
163 |             json.dump(self.seed_results, f)
164 | 
165 |         self.aggregated = self.aggregate_results()
166 |         with open(f'{self.save_folder}/aggregated_results.json', 'w+') as f:
167 |             json.dump(self.aggregated, f)
168 | 
169 |     def get_model_name(self, exp_name: str, algos: list):
170 |         # get name of the model (for gnn-like models (eg. gat))
171 |         if 'name' in exp_name:
172 |             model_name = '-' + [param[4:] for param in exp_name.split('-') if param.startswith('name')][0]
173 |         else:
174 |             model_name = ''
175 | 
176 |         # get a model used a MLP (eg. MLP-GNN)
177 |         if 'gnn' in exp_name and 'mlpTrue' in exp_name:
178 |             model_name += '-MLP'
179 | 
180 |         # algo corresponds to type of the model (eg. gnn, resgnn, bgnn)
181 |         for algo in algos:
182 |             if exp_name.startswith(algo):
183 |                 return algo + model_name
184 |         return 'unknown'
185 | 
186 |     def aggregate_results(self):
187 |         algos = ['catboost', 'lightgbm', 'mlp', 'gnn', 'resgnn', 'bgnn']
188 |         model_best_score = ddict(list)
189 |         model_best_time = ddict(list)
190 | 
191 |         results = self.seed_results
192 |         for seed in results:
193 |             model_results_for_seed = ddict(list)
194 |             for name, output in results[seed].items():
195 |                 model_name = self.get_model_name(name, algos=algos)
196 |                 if self.task == 'regression': # rmse metric
197 |                     val_metric, test_metric, time = output[0][1], output[0][2], output[2]
198 |                 else: # accuracy metric
199 |                     val_metric, test_metric, time = output[1][1], output[1][2], output[2]
200 |                 model_results_for_seed[model_name].append((val_metric, test_metric, time))
201 | 
202 |             for model_name, model_results in model_results_for_seed.items():
203 |                 if self.task == 'regression':
204 |                     best_result = min(model_results) # rmse
205 |                 else:
206 |                     best_result = max(model_results) # accuracy
207 |                 model_best_score[model_name].append(best_result[1])
208 |                 model_best_time[model_name].append(best_result[2])
209 | 
210 |         aggregated = dict()
211 |         for model, scores in model_best_score.items():
212 |             aggregated[model] = (np.mean(scores), np.std(scores),
213 |                                  np.mean(model_best_time[model]), np.std(model_best_time[model]))
214 |         return aggregated
215 | 
216 |     def run(self, dataset: str, *args,
217 |             save_folder: str = None,
218 |             task: str = 'regression',
219 |             repeat_exp: int = 1,
220 |             max_seeds: int = 1,
221 |             dataset_dir: str = None,
222 |             config_dir: str = None
223 |             ):
224 |         start2run = time.time()
225 |         self.repeat_exp = repeat_exp
226 |         self.max_seeds = max_seeds
227 |         print(dataset, args, task, repeat_exp, max_seeds)
228 | 
229 |         dataset_dir = Path(dataset_dir) if dataset_dir else Path(__file__).parent / 'datasets'
230 |         config_dir = Path(config_dir) if config_dir else Path(__file__).parent / 'configs' / 'model'
231 |         print(dataset_dir, config_dir)
232 | 
233 |         self.task = task
234 |         self.save_folder = save_folder
235 |         self.get_input(dataset_dir, dataset)
236 | 
237 |         self.seed_results = dict()
238 |         for ix, seed in enumerate(self.masks):
239 |             print(f'{dataset} Seed {seed}')
240 |             self.seed = seed
241 | 
242 |             self.create_save_folder(seed)
243 |             self.split_masks(seed)
244 | 
245 |             self.store_results = dict()
246 |             for arg in args:
247 |                 if arg == 'all':
248 |                     self.run_one_model(config_fn=config_dir / 'catboost.yaml', model_name="catboost")
249 |                     self.run_one_model(config_fn=config_dir / 'lightgbm.yaml', model_name="lightgbm")
250 |                     self.run_one_model(config_fn=config_dir / 'mlp.yaml', model_name="mlp")
251 |                     self.run_one_model(config_fn=config_dir / 'gnn.yaml', model_name="gnn")
252 |                     self.run_one_model(config_fn=config_dir / 'resgnn.yaml', model_name="resgnn")
253 |                     self.run_one_model(config_fn=config_dir / 'bgnn.yaml', model_name="bgnn")
254 |                     break
255 |                 elif arg == 'catboost':
256 |                     self.run_one_model(config_fn=config_dir / 'catboost.yaml', model_name="catboost")
257 |                 elif arg == 'lightgbm':
258 |                     self.run_one_model(config_fn=config_dir / 'lightgbm.yaml', model_name="lightgbm")
259 |                 elif arg == 'mlp':
260 |                     self.run_one_model(config_fn=config_dir / 'mlp.yaml', model_name="mlp")
261 |                 elif arg == 'gnn':
262 |                     self.run_one_model(config_fn=config_dir / 'gnn.yaml', model_name="gnn")
263 |                 elif arg == 'resgnn':
264 |                     self.run_one_model(config_fn=config_dir / 'resgnn.yaml', model_name="resgnn")
265 |                 elif arg == 'bgnn':
266 |                     self.run_one_model(config_fn=config_dir / 'bgnn.yaml', model_name="bgnn")
267 | 
268 |             self.save_results(seed)
269 |             if ix+1 >= max_seeds:
270 |                 break
271 | 
272 |         print(f'Finished {dataset}: {time.time() - start2run} sec.')
273 | 
274 | if __name__ == '__main__':
275 |     fire.Fire(RunModel().run)


--------------------------------------------------------------------------------
/bgnn/scripts/utils.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from dgl.data import citation_graph as citegrh, TUDataset
 4 | import torch as th
 5 | from dgl import DGLGraph
 6 | import numpy as np
 7 | from sklearn.model_selection import KFold
 8 | import itertools
 9 | from sklearn.preprocessing import OneHotEncoder as OHE
10 | import random
11 | import json
12 | 
13 | def load_cora_data():
14 |     data = citegrh.load_cora()
15 |     features = th.FloatTensor(data.features)
16 |     labels = th.LongTensor(data.labels)
17 |     train_mask = th.BoolTensor(data.train_mask)
18 |     test_mask = th.BoolTensor(data.test_mask)
19 |     g = DGLGraph(data.graph)
20 |     return g, features, labels, train_mask, test_mask
21 | 
22 | def get_degree_features(graph):
23 |     return graph.out_degrees().unsqueeze(-1).numpy()
24 | 
25 | def get_categorical_features(features):
26 |     return np.argmax(features, axis=-1).unsqueeze(dim=1).numpy()
27 | 
28 | def get_random_int_features(shape, num_categories=100):
29 |     return np.random.randint(0, num_categories, size=shape)
30 | 
31 | def get_random_norm_features(shape):
32 |     return np.random.normal(size=shape)
33 | 
34 | def get_random_uniform_features(shape):
35 |     return np.random.unifor(-1, 1, size=shape)
36 | 
37 | def merge_features(*args):
38 |     return np.hstack(args)
39 | 
40 | def get_train_data(graph, features, num_random_features=10, num_random_categories=100):
41 |     return merge_features(
42 |         get_categorical_features(features),
43 |         get_degree_features(graph),
44 |         get_random_int_features(shape=(features.shape[0], num_random_features), num_categories=num_random_categories),
45 |     )
46 | 
47 | 
48 | def save_folds(dataset_name, n_splits=3):
49 |     dataset = TUDataset(dataset_name)
50 |     i = 0
51 |     kfold = KFold(n_splits=n_splits, shuffle=True)
52 |     dir_name = f'kfold_{dataset_name}'
53 |     for trix, teix in kfold.split(range(len(dataset))):
54 |         os.makedirs(f'{dir_name}/fold{i}', exist_ok=True)
55 |         np.savetxt(f'{dir_name}/fold{i}/train.idx', trix, fmt='%i')
56 |         np.savetxt(f'{dir_name}/fold{i}/test.idx', teix, fmt='%i')
57 |         i += 1
58 | 
59 | 
60 | def graph_to_node_label(graphs, labels):
61 |     targets = np.array(list(itertools.chain(*[[labels[i]] * graphs[i].number_of_nodes() for i in range(len(graphs))])))
62 |     enc = OHE(dtype=np.float32)
63 |     return np.asarray(enc.fit_transform(targets.reshape(-1, 1)).todense())
64 | 
65 | 
66 | def get_masks(N, train_size=0.6, val_size=0.2, random_seed=42):
67 |     if not random_seed:
68 |         seed = random.randint(0, 100)
69 |     else:
70 |         seed = random_seed
71 | 
72 |     # print('seed', seed)
73 |     random.seed(seed)
74 | 
75 |     indices = list(range(N))
76 |     random.shuffle(indices)
77 | 
78 |     train_mask = indices[:int(train_size * len(indices))]
79 |     val_mask = indices[int(train_size * len(indices)):int((train_size + val_size) * len(indices))]
80 |     train_val_mask = indices[:int((train_size + val_size) * len(indices))]
81 |     test_mask = indices[int((train_size + val_size) * len(indices)):]
82 | 
83 |     return train_mask, val_mask, train_val_mask, test_mask
84 | 
85 | 
86 | class NpEncoder(json.JSONEncoder):
87 |     def default(self, obj):
88 |         if isinstance(obj, np.integer):
89 |             return int(obj)
90 |         elif isinstance(obj, np.floating):
91 |             return float(obj)
92 |         elif isinstance(obj, np.ndarray):
93 |             return obj.tolist()
94 |         else:
95 |             return super(NpEncoder, self).default(obj)


--------------------------------------------------------------------------------
/configs/model/bgnn.yaml:
--------------------------------------------------------------------------------
 1 | hp:
 2 |   iter_per_epoch:
 3 |   - 10
 4 |   - 20
 5 |   lr:
 6 |   - 0.01
 7 |   - 0.1
 8 |   hidden_dim:
 9 |   - 64
10 |   name:
11 |   - gat
12 |   - gcn
13 |   - agnn
14 |   - appnp
15 |   only_gbdt:
16 |   - false
17 |   - true
18 |   dropout:
19 |   - 0.
20 |   - 0.5
21 |   depth:
22 |   - 6
23 | num_epochs: 200
24 | patience: 10
25 | 


--------------------------------------------------------------------------------
/configs/model/catboost.yaml:
--------------------------------------------------------------------------------
 1 | hp:
 2 |   lr:
 3 |   - 0.01
 4 |   - 0.1
 5 |   depth:
 6 |   - 4
 7 |   - 6
 8 |   l2_leaf_reg:
 9 |   - null
10 | num_epochs: 1000
11 | patience: 100
12 | verbose: false
13 | 


--------------------------------------------------------------------------------
/configs/model/gnn.yaml:
--------------------------------------------------------------------------------
 1 | hp:
 2 |   lr:
 3 |   - 0.01
 4 |   - 0.1
 5 |   name:
 6 |   - gat
 7 |   - gcn
 8 |   - agnn
 9 |   - appnp
10 |   mlp:
11 |   - true
12 |   - false
13 |   dropout:
14 |   - 0.0
15 |   - 0.5
16 |   hidden_dim:
17 |   - 64
18 | num_epochs: 2000
19 | patience: 200
20 | 


--------------------------------------------------------------------------------
/configs/model/lightgbm.yaml:
--------------------------------------------------------------------------------
 1 | hp:
 2 |   lr:
 3 |   - 0.01
 4 |   - 0.1
 5 |   num_leaves:
 6 |   - 15
 7 |   - 63
 8 |   lambda_l2:
 9 |   - 0.0
10 |   boosting:
11 |   - gbdt
12 | num_epochs: 1000
13 | patience: 100
14 | 


--------------------------------------------------------------------------------
/configs/model/mlp.yaml:
--------------------------------------------------------------------------------
 1 | hp:
 2 |   lr:
 3 |   - 0.01
 4 |   - 0.1
 5 |   num_layers:
 6 |   - 2
 7 |   - 3
 8 |   dropout:
 9 |   - 0.0
10 |   - 0.5
11 |   hidden_dim:
12 |   - 64
13 | num_epochs: 5000
14 | patience: 200
15 | 


--------------------------------------------------------------------------------
/configs/model/resgnn.yaml:
--------------------------------------------------------------------------------
 1 | hp:
 2 |   lr:
 3 |   - 0.01
 4 |   - 0.1
 5 |   name:
 6 |   - gat
 7 |   - gcn
 8 |   - agnn
 9 |   - appnp
10 |   dropout:
11 |   - 0.0
12 |   - 0.5
13 |   hidden_dim:
14 |   - 64
15 |   only_gbdt:
16 |   - false
17 |   - true
18 | num_epochs: 1000
19 | patience: 100
20 | 


--------------------------------------------------------------------------------
/datasets.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nd7141/bgnn/11290bc8ec5427faa1cb48ec51d947d5f6624b60/datasets.zip


--------------------------------------------------------------------------------
/models/BGNN.py:
--------------------------------------------------------------------------------
  1 | import itertools
  2 | import time
  3 | import numpy as np
  4 | import torch
  5 | 
  6 | from catboost import Pool, CatBoostClassifier, CatBoostRegressor, sum_models
  7 | from .GNN import GNNModelDGL, GATDGL
  8 | from .Base import BaseModel
  9 | from tqdm import tqdm
 10 | from collections import defaultdict as ddict
 11 | 
 12 | class BGNN(BaseModel):
 13 |     def __init__(self,
 14 |                  task='regression', iter_per_epoch = 10, lr=0.01, hidden_dim=64, dropout=0.,
 15 |                  only_gbdt=False, train_non_gbdt=False,
 16 |                  name='gat', use_leaderboard=False, depth=6, gbdt_lr=0.1):
 17 |         super(BaseModel, self).__init__()
 18 |         self.learning_rate = lr
 19 |         self.hidden_dim = hidden_dim
 20 |         self.task = task
 21 |         self.dropout = dropout
 22 |         self.only_gbdt = only_gbdt
 23 |         self.train_residual = train_non_gbdt
 24 |         self.name = name
 25 |         self.use_leaderboard = use_leaderboard
 26 |         self.iter_per_epoch = iter_per_epoch
 27 |         self.depth = depth
 28 |         self.lang = 'dgl'
 29 |         self.gbdt_lr = gbdt_lr
 30 | 
 31 |         self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
 32 | 
 33 |     def __name__(self):
 34 |         return 'BGNN'
 35 | 
 36 |     def init_gbdt_model(self, num_epochs, epoch):
 37 |         if self.task == 'regression':
 38 |             catboost_model_obj = CatBoostRegressor
 39 |             catboost_loss_fn = 'RMSE' #''RMSEWithUncertainty'
 40 |         else:
 41 |             if epoch == 0:
 42 |                 catboost_model_obj = CatBoostClassifier
 43 |                 catboost_loss_fn = 'MultiClass'
 44 |             else:
 45 |                 catboost_model_obj = CatBoostRegressor
 46 |                 catboost_loss_fn = 'MultiRMSE'
 47 | 
 48 |         return catboost_model_obj(iterations=num_epochs,
 49 |                                   depth=self.depth,
 50 |                                   learning_rate=self.gbdt_lr,
 51 |                                   loss_function=catboost_loss_fn,
 52 |                                   random_seed=0,
 53 |                                   nan_mode='Min')
 54 | 
 55 |     def fit_gbdt(self, pool, trees_per_epoch, epoch):
 56 |         gbdt_model = self.init_gbdt_model(trees_per_epoch, epoch)
 57 |         gbdt_model.fit(pool, verbose=False)
 58 |         return gbdt_model
 59 | 
 60 |     def init_gnn_model(self):
 61 |         if self.use_leaderboard:
 62 |             self.model = GATDGL(in_feats=self.in_dim, n_classes=self.out_dim).to(self.device)
 63 |         else:
 64 |             self.model = GNNModelDGL(in_dim=self.in_dim,
 65 |                                      hidden_dim=self.hidden_dim,
 66 |                                      out_dim=self.out_dim,
 67 |                                      name=self.name,
 68 |                                      dropout=self.dropout).to(self.device)
 69 | 
 70 |     def append_gbdt_model(self, new_gbdt_model, weights):
 71 |         if self.gbdt_model is None:
 72 |             return new_gbdt_model
 73 |         return sum_models([self.gbdt_model, new_gbdt_model], weights=weights)
 74 | 
 75 |     def train_gbdt(self, gbdt_X_train, gbdt_y_train, cat_features, epoch,
 76 |                    gbdt_trees_per_epoch, gbdt_alpha):
 77 | 
 78 |         pool = Pool(gbdt_X_train, gbdt_y_train, cat_features=cat_features)
 79 |         epoch_gbdt_model = self.fit_gbdt(pool, gbdt_trees_per_epoch, epoch)
 80 |         if epoch == 0 and self.task=='classification':
 81 |             self.base_gbdt = epoch_gbdt_model
 82 |         else:
 83 |             self.gbdt_model = self.append_gbdt_model(epoch_gbdt_model, weights=[1, gbdt_alpha])
 84 | 
 85 |     def update_node_features(self, node_features, X, encoded_X):
 86 |         if self.task == 'regression':
 87 |             predictions = np.expand_dims(self.gbdt_model.predict(X), axis=1)
 88 |             # predictions = self.gbdt_model.virtual_ensembles_predict(X,
 89 |             #                                                         virtual_ensembles_count=5,
 90 |             #                                                         prediction_type='TotalUncertainty')
 91 |         else:
 92 |             predictions = self.base_gbdt.predict_proba(X)
 93 |             # predictions = self.base_gbdt.predict(X, prediction_type='RawFormulaVal')
 94 |             if self.gbdt_model is not None:
 95 |                 predictions_after_one = self.gbdt_model.predict(X)
 96 |                 predictions += predictions_after_one
 97 | 
 98 |         if not self.only_gbdt:
 99 |             if self.train_residual:
100 |                 predictions = np.append(node_features.detach().cpu().data[:, :-self.out_dim], predictions,
101 |                                         axis=1)  # append updated X to prediction
102 |             else:
103 |                 predictions = np.append(encoded_X, predictions, axis=1)  # append X to prediction
104 | 
105 |         predictions = torch.from_numpy(predictions).to(self.device)
106 | 
107 |         node_features.data = predictions.float().data
108 | 
109 |     def update_gbdt_targets(self, node_features, node_features_before, train_mask):
110 |         return (node_features - node_features_before).detach().cpu().numpy()[train_mask, -self.out_dim:]
111 | 
112 |     def init_node_features(self, X):
113 |         node_features = torch.empty(X.shape[0], self.in_dim, requires_grad=True, device=self.device)
114 |         if not self.only_gbdt:
115 |             node_features.data[:, :-self.out_dim] = torch.from_numpy(X.to_numpy(copy=True))
116 |         return node_features
117 | 
118 |     def init_node_parameters(self, num_nodes):
119 |         return torch.empty(num_nodes, self.out_dim, requires_grad=True, device=self.device)
120 | 
121 |     def init_optimizer2(self, node_parameters, learning_rate):
122 |         params = [self.model.parameters(), [node_parameters]]
123 |         return torch.optim.Adam(itertools.chain(*params), lr=learning_rate)
124 | 
125 |     def update_node_features2(self, node_parameters, X):
126 |         if self.task == 'regression':
127 |             predictions = np.expand_dims(self.gbdt_model.predict(X), axis=1)
128 |         else:
129 |             predictions = self.base_gbdt.predict_proba(X)
130 |             if self.gbdt_model is not None:
131 |                 predictions += self.gbdt_model.predict(X)
132 | 
133 |         predictions = torch.from_numpy(predictions).to(self.device)
134 |         node_parameters.data = predictions.float().data
135 | 
136 |     def fit(self, networkx_graph, X, y, train_mask, val_mask, test_mask, cat_features,
137 |             num_epochs, patience, logging_epochs=1, loss_fn=None, metric_name='loss',
138 |             normalize_features=True, replace_na=True,
139 |             ):
140 | 
141 |         # initialize for early stopping and metrics
142 |         if metric_name in ['r2', 'accuracy']:
143 |             best_metric = [np.float('-inf')] * 3  # for train/val/test
144 |         else:
145 |             best_metric = [np.float('inf')] * 3  # for train/val/test
146 |         best_val_epoch = 0
147 |         epochs_since_last_best_metric = 0
148 |         metrics = ddict(list)
149 |         if cat_features is None:
150 |             cat_features = []
151 | 
152 |         if self.task == 'regression':
153 |             self.out_dim = y.shape[1]
154 |         elif self.task == 'classification':
155 |             self.out_dim = len(set(y.iloc[test_mask, 0]))
156 |         # self.in_dim = X.shape[1] if not self.only_gbdt else 0
157 |         # self.in_dim += 3 if uncertainty else 1
158 |         self.in_dim = self.out_dim + X.shape[1] if not self.only_gbdt else self.out_dim
159 | 
160 |         self.init_gnn_model()
161 | 
162 |         gbdt_X_train = X.iloc[train_mask]
163 |         gbdt_y_train = y.iloc[train_mask]
164 |         gbdt_alpha = 1
165 |         self.gbdt_model = None
166 | 
167 |         encoded_X = X.copy()
168 |         if not self.only_gbdt:
169 |             if len(cat_features):
170 |                 encoded_X = self.encode_cat_features(encoded_X, y, cat_features, train_mask, val_mask, test_mask)
171 |             if normalize_features:
172 |                 encoded_X = self.normalize_features(encoded_X, train_mask, val_mask, test_mask)
173 |             if replace_na:
174 |                 encoded_X = self.replace_na(encoded_X, train_mask)
175 | 
176 |         node_features = self.init_node_features(encoded_X)
177 |         optimizer = self.init_optimizer(node_features, optimize_node_features=True, learning_rate=self.learning_rate)
178 | 
179 |         y, = self.pandas_to_torch(y)
180 |         self.y = y
181 |         if self.lang == 'dgl':
182 |             graph = self.networkx_to_torch(networkx_graph)
183 |         elif self.lang == 'pyg':
184 |             graph = self.networkx_to_torch2(networkx_graph)
185 | 
186 |         self.graph = graph
187 | 
188 |         pbar = tqdm(range(num_epochs))
189 |         for epoch in pbar:
190 |             start2epoch = time.time()
191 | 
192 |             # gbdt part
193 |             self.train_gbdt(gbdt_X_train, gbdt_y_train, cat_features, epoch,
194 |                             self.iter_per_epoch, gbdt_alpha)
195 | 
196 |             self.update_node_features(node_features, X, encoded_X)
197 |             node_features_before = node_features.clone()
198 |             model_in=(graph, node_features)
199 |             loss = self.train_and_evaluate(model_in, y, train_mask, val_mask, test_mask,
200 |                                            optimizer, metrics, self.iter_per_epoch)
201 |             gbdt_y_train = self.update_gbdt_targets(node_features, node_features_before, train_mask)
202 | 
203 |             self.log_epoch(pbar, metrics, epoch, loss, time.time() - start2epoch, logging_epochs,
204 |                            metric_name=metric_name)
205 |             # check early stopping
206 |             best_metric, best_val_epoch, epochs_since_last_best_metric = \
207 |                 self.update_early_stopping(metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric,
208 |                                            metric_name, lower_better=(metric_name not in ['r2', 'accuracy']))
209 |             if patience and epochs_since_last_best_metric > patience:
210 |                 break
211 |             if np.isclose(gbdt_y_train.sum(), 0.):
212 |                 print('Nodes do not change anymore. Stopping...')
213 |                 break
214 | 
215 |         if loss_fn:
216 |             self.save_metrics(metrics, loss_fn)
217 | 
218 |         print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, best_val_epoch, *best_metric))
219 |         return metrics
220 | 
221 |     def predict(self, graph, X, y, test_mask):
222 |         node_features = torch.empty(X.shape[0], self.in_dim).to(self.device)
223 |         self.update_node_features(node_features, X, X)
224 |         return self.evaluate_model((graph, node_features), y, test_mask)


--------------------------------------------------------------------------------
/models/Base.py:
--------------------------------------------------------------------------------
  1 | import itertools
  2 | import torch
  3 | from sklearn import preprocessing
  4 | import pandas as pd
  5 | import torch.nn.functional as F
  6 | import numpy as np
  7 | from sklearn.metrics import r2_score, accuracy_score
  8 | 
  9 | class BaseModel(torch.nn.Module):
 10 |     def __init__(self):
 11 |         super(BaseModel, self).__init__()
 12 |         self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
 13 | 
 14 |     def pandas_to_torch(self, *args):
 15 |         return [torch.from_numpy(arg.to_numpy(copy=True)).float().squeeze().to(self.device) for arg in args]
 16 | 
 17 |     def networkx_to_torch(self, networkx_graph):
 18 |         import dgl
 19 |         # graph = dgl.DGLGraph()
 20 |         graph = dgl.from_networkx(networkx_graph)
 21 |         graph = dgl.remove_self_loop(graph)
 22 |         graph = dgl.add_self_loop(graph)
 23 |         graph = graph.to(self.device)
 24 |         return graph
 25 | 
 26 |     def networkx_to_torch2(self, networkx_graph):
 27 |         from torch_geometric.utils import convert
 28 |         import torch_geometric.transforms as T
 29 |         graph = convert.from_networkx(networkx_graph)
 30 |         transform = T.Compose([T.TargetIndegree()])
 31 |         graph = transform(graph)
 32 |         return graph.to(self.device)
 33 | 
 34 |     def move_to_device(self, *args):
 35 |         return [arg.to(self.device) for arg in args]
 36 | 
 37 |     def init_optimizer(self, node_features, optimize_node_features, learning_rate):
 38 | 
 39 |         params = [self.model.parameters()]
 40 |         if optimize_node_features:
 41 |             params.append([node_features])
 42 |         optimizer = torch.optim.Adam(itertools.chain(*params), lr=learning_rate)
 43 |         return optimizer
 44 | 
 45 |     def log_epoch(self, pbar, metrics, epoch, loss, epoch_time, logging_epochs, metric_name='loss'):
 46 |         train_rmse, val_rmse, test_rmse = metrics[metric_name][-1]
 47 |         if epoch and epoch % logging_epochs == 0:
 48 |             pbar.set_description(
 49 |                 "Epoch {:05d} | Loss {:.3f} | Loss {:.3f}/{:.3f}/{:.3f} | Time {:.4f}".format(epoch, loss,
 50 |                                                                                               train_rmse,
 51 |                                                                                               val_rmse, test_rmse,
 52 |                                                                                               epoch_time))
 53 | 
 54 |     def normalize_features(self, X, train_mask, val_mask, test_mask):
 55 |         min_max_scaler = preprocessing.MinMaxScaler()
 56 |         A = X.to_numpy(copy=True)
 57 |         A[train_mask] = min_max_scaler.fit_transform(A[train_mask])
 58 |         A[val_mask + test_mask] = min_max_scaler.transform(A[val_mask + test_mask])
 59 |         return pd.DataFrame(A, columns=X.columns).astype(float)
 60 | 
 61 |     def replace_na(self, X, train_mask):
 62 |         if X.isna().any().any():
 63 |             return X.fillna(X.iloc[train_mask].min() - 1)
 64 |         return X
 65 | 
 66 |     def encode_cat_features(self, X, y, cat_features, train_mask, val_mask, test_mask):
 67 |         from category_encoders import CatBoostEncoder
 68 |         enc = CatBoostEncoder()
 69 |         A = X.to_numpy(copy=True)
 70 |         b = y.to_numpy(copy=True)
 71 |         A[np.ix_(train_mask, cat_features)] = enc.fit_transform(A[np.ix_(train_mask, cat_features)], b[train_mask])
 72 |         A[np.ix_(val_mask + test_mask, cat_features)] = enc.transform(A[np.ix_(val_mask + test_mask, cat_features)])
 73 |         A = A.astype(float)
 74 |         return pd.DataFrame(A, columns=X.columns)
 75 | 
 76 |     def train_model(self, model_in, target_labels, train_mask, optimizer):
 77 |         y = target_labels[train_mask]
 78 | 
 79 |         self.model.train()
 80 |         logits = self.model(*model_in).squeeze()
 81 |         pred = logits[train_mask]
 82 | 
 83 |         if self.task == 'regression':
 84 |             loss = torch.sqrt(F.mse_loss(pred, y))
 85 |         elif self.task == 'classification':
 86 |             loss = F.cross_entropy(pred, y.long())
 87 |         else:
 88 |             raise NotImplemented("Unknown task. Supported tasks: classification, regression.")
 89 | 
 90 |         optimizer.zero_grad()
 91 |         loss.backward()
 92 |         optimizer.step()
 93 |         return loss
 94 | 
 95 |     def evaluate_model(self, logits, target_labels, mask):
 96 |         metrics = {}
 97 |         y = target_labels[mask]
 98 |         with torch.no_grad():
 99 |             pred = logits[mask]
100 |             if self.task == 'regression':
101 |                 metrics['loss'] = torch.sqrt(F.mse_loss(pred, y).squeeze() + 1e-8)
102 |                 metrics['rmsle'] = torch.sqrt(F.mse_loss(torch.log(pred + 1), torch.log(y + 1)).squeeze() + 1e-8)
103 |                 metrics['mae'] = F.l1_loss(pred, y)
104 |                 metrics['r2'] = torch.Tensor([r2_score(y.cpu().numpy(), pred.cpu().numpy())])
105 |             elif self.task == 'classification':
106 |                 metrics['loss'] = F.cross_entropy(pred, y.long())
107 |                 metrics['accuracy'] = torch.Tensor([(y == pred.max(1)[1]).sum().item()/y.shape[0]])
108 | 
109 |             return metrics
110 | 
111 |     def train_val_test_split(self, X, y, train_mask, val_mask, test_mask):
112 |         X_train, y_train = X.iloc[train_mask], y.iloc[train_mask]
113 |         X_val, y_val = X.iloc[val_mask], y.iloc[val_mask]
114 |         X_test, y_test = X.iloc[test_mask], y.iloc[test_mask]
115 |         return X_train, y_train, X_val, y_val, X_test, y_test
116 | 
117 |     def train_and_evaluate(self, model_in, target_labels, train_mask, val_mask, test_mask,
118 |                            optimizer, metrics, gnn_passes_per_epoch):
119 |         loss = None
120 | 
121 |         for _ in range(gnn_passes_per_epoch):
122 |             loss = self.train_model(model_in, target_labels, train_mask, optimizer)
123 | 
124 |         self.model.eval()
125 |         logits = self.model(*model_in).squeeze()
126 |         train_results = self.evaluate_model(logits, target_labels, train_mask)
127 |         val_results = self.evaluate_model(logits, target_labels, val_mask)
128 |         test_results = self.evaluate_model(logits, target_labels, test_mask)
129 |         for metric_name in train_results:
130 |             metrics[metric_name].append((train_results[metric_name].detach().item(),
131 |                                val_results[metric_name].detach().item(),
132 |                                test_results[metric_name].detach().item()
133 |                                ))
134 |         return loss
135 | 
136 |     def update_early_stopping(self, metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric, metric_name,
137 |                               lower_better=False):
138 |         train_metric, val_metric, test_metric = metrics[metric_name][-1]
139 |         if (lower_better and val_metric < best_metric[1]) or (not lower_better and val_metric > best_metric[1]):
140 |             best_metric = metrics[metric_name][-1]
141 |             best_val_epoch = epoch
142 |             epochs_since_last_best_metric = 0
143 |         else:
144 |             epochs_since_last_best_metric += 1
145 |         return best_metric, best_val_epoch, epochs_since_last_best_metric
146 | 
147 |     def save_metrics(self, metrics, fn):
148 |         with open(fn, "w+") as f:
149 |             for key, value in metrics.items():
150 |                 print(key, value, file=f)
151 | 
152 |     def plot(self, metrics, legend, title, output_fn=None, logx=False, logy=False, metric_name='loss'):
153 |         import matplotlib.pyplot as plt
154 |         metric_results = metrics[metric_name]
155 |         xs = [range(len(metric_results))] * len(metric_results[0])
156 |         ys = list(zip(*metric_results))
157 | 
158 |         plt.rcParams.update({'font.size': 40})
159 |         plt.rcParams["figure.figsize"] = (20, 10)
160 |         lss = ['-', '--', '-.', ':']
161 |         colors = ['#4053d3', '#ddb310', '#b51d14', '#00beff', '#fb49b0', '#00b25d', '#cacaca']
162 |         colors = [(235, 172, 35), (184, 0, 88), (0, 140, 249), (0, 110, 0), (0, 187, 173), (209, 99, 230), (178, 69, 2),
163 |                   (255, 146, 135), (89, 84, 214), (0, 198, 248), (135, 133, 0), (0, 167, 108), (189, 189, 189)]
164 |         colors = [[p / 255 for p in c] for c in colors]
165 |         for i in range(len(ys)):
166 |             plt.plot(xs[i], ys[i], lw=4, color=colors[i])
167 |         plt.legend(legend, loc=1, fontsize=30)
168 |         plt.title(title)
169 | 
170 |         plt.xscale('log') if logx else None
171 |         plt.yscale('log') if logy else None
172 |         plt.xlabel('Iteration')
173 |         plt.ylabel('RMSE')
174 |         plt.grid()
175 |         plt.tight_layout()
176 | 
177 |         plt.savefig(output_fn, bbox_inches='tight') if output_fn else None
178 |         plt.show()
179 | 
180 |     def plot_interactive(self, metrics, legend, title, logx=False, logy=False, metric_name='loss', start_from=0):
181 |         import plotly.graph_objects as go
182 |         metric_results = metrics[metric_name]
183 |         xs = [list(range(len(metric_results)))] * len(metric_results[0])
184 |         ys = list(zip(*metric_results))
185 | 
186 |         fig = go.Figure()
187 |         for i in range(len(ys)):
188 |             fig.add_trace(go.Scatter(x=xs[i][start_from:], y=ys[i][start_from:],
189 |                                      mode='lines+markers',
190 |                                      name=legend[i]))
191 | 
192 |         fig.update_layout(
193 |             title=title,
194 |             title_x=0.5,
195 |             xaxis_title='Epoch',
196 |             yaxis_title='RMSE',
197 |             font=dict(
198 |                 size=40,
199 |             ),
200 |             height=600,
201 |         )
202 | 
203 |         if logx:
204 |             fig.update_layout(xaxis_type="log")
205 |         if logy:
206 |             fig.update_layout(yaxis_type="log")
207 | 
208 |         fig.show()
209 | 


--------------------------------------------------------------------------------
/models/GBDT.py:
--------------------------------------------------------------------------------
  1 | from catboost import Pool, CatBoostClassifier, CatBoostRegressor
  2 | import time
  3 | from sklearn.metrics import mean_squared_error, accuracy_score, r2_score
  4 | import numpy as np
  5 | from collections import defaultdict as ddict
  6 | import lightgbm
  7 | from lightgbm import LGBMClassifier, LGBMRegressor
  8 | 
  9 | class GBDTCatBoost:
 10 |     def __init__(self, task='regression', depth=6, lr=0.1, l2_leaf_reg=None, max_bin=None):
 11 |         self.task = task
 12 |         self.depth = depth
 13 |         self.learning_rate = lr
 14 |         self.l2_leaf_reg = l2_leaf_reg
 15 |         self.max_bin = max_bin
 16 | 
 17 | 
 18 |     def init_model(self, num_epochs, patience):
 19 |         catboost_model_obj = CatBoostRegressor if self.task == 'regression' else CatBoostClassifier
 20 |         self.catboost_loss_function = 'RMSE' if self.task == 'regression' else 'MultiClass'
 21 |         self.custom_metrics = ['R2'] if self.task == 'regression' else ['Accuracy']
 22 |         # ['Accuracy', 'AUC', 'Precision', 'Recall', 'F1', 'MCC', 'R2'],
 23 | 
 24 |         self.model = catboost_model_obj(iterations=num_epochs,
 25 |                                        depth=self.depth,
 26 |                                        learning_rate=self.learning_rate,
 27 |                                        loss_function=self.catboost_loss_function,
 28 |                                        custom_metric=self.custom_metrics,
 29 |                                        random_seed=0,
 30 |                                        early_stopping_rounds=patience,
 31 |                                        l2_leaf_reg=self.l2_leaf_reg,
 32 |                                        max_bin=self.max_bin,
 33 |                                        nan_mode='Min')
 34 | 
 35 |     def get_metrics(self):
 36 |         d = self.model.evals_result_
 37 |         metrics = ddict(list)
 38 |         keys = ['learn', 'validation_0', 'validation_1'] \
 39 |             if 'validation_0' in self.model.evals_result_ \
 40 |             else ['learn', 'validation']
 41 |         for metric_name in d[keys[0]]:
 42 |             perf = [d[key][metric_name] for key in keys]
 43 |             if metric_name == self.catboost_loss_function:
 44 |                 metrics['loss'] = list(zip(*perf))
 45 |             else:
 46 |                 metrics[metric_name.lower()] = list(zip(*perf))
 47 | 
 48 |         return metrics
 49 | 
 50 |     def get_test_metric(self, metrics, metric_name):
 51 |         if metric_name == 'loss':
 52 |             val_epoch = np.argmin([acc[1] for acc in metrics[metric_name]])
 53 |         else:
 54 |             val_epoch = np.argmax([acc[1] for acc in metrics[metric_name]])
 55 |         min_metric = metrics[metric_name][val_epoch]
 56 |         return min_metric, val_epoch
 57 | 
 58 |     def save_metrics(self, metrics, fn):
 59 |         with open(fn, "w+") as f:
 60 |             for key, value in metrics.items():
 61 |                 print(key, value, file=f)
 62 | 
 63 |     def train_val_test_split(self, X, y, train_mask, val_mask, test_mask):
 64 |         X_train, y_train = X.iloc[train_mask], y.iloc[train_mask]
 65 |         X_val, y_val = X.iloc[val_mask], y.iloc[val_mask]
 66 |         X_test, y_test = X.iloc[test_mask], y.iloc[test_mask]
 67 |         return X_train, y_train, X_val, y_val, X_test, y_test
 68 | 
 69 |     def fit(self,
 70 |             X, y, train_mask, val_mask, test_mask,
 71 |             cat_features=None, num_epochs=1000, patience=200,
 72 |             plot=False, verbose=False,
 73 |             loss_fn="", metric_name='loss'):
 74 | 
 75 |         X_train, y_train, X_val, y_val, X_test, y_test = \
 76 |             self.train_val_test_split(X, y, train_mask, val_mask, test_mask)
 77 |         self.init_model(num_epochs, patience)
 78 | 
 79 |         start = time.time()
 80 |         pool = Pool(X_train, y_train, cat_features=cat_features)
 81 |         eval_set = [(X_val, y_val), (X_test, y_test)]
 82 |         self.model.fit(pool, eval_set=eval_set, plot=plot, verbose=verbose)
 83 |         finish = time.time()
 84 | 
 85 |         num_trees = self.model.tree_count_
 86 |         print('Finished training. Total time: {:.2f} | Number of trees: {:d} | Time per tree: {:.2f}'.format(finish - start, num_trees, (time.time() - start )/num_trees))
 87 | 
 88 |         metrics = self.get_metrics()
 89 |         min_metric, min_val_epoch = self.get_test_metric(metrics, metric_name)
 90 |         if loss_fn:
 91 |             self.save_metrics(metrics, loss_fn)
 92 |         print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, min_val_epoch, *min_metric))
 93 |         return metrics
 94 | 
 95 |     def predict(self, X_test, y_test):
 96 |         pred = self.model.predict(X_test)
 97 | 
 98 |         metrics = {}
 99 |         metrics['rmse'] = mean_squared_error(pred, y_test) ** .5
100 | 
101 |         return metrics
102 | 
103 | 
104 | class GBDTLGBM:
105 |     def __init__(self, task='regression', lr=0.1, num_leaves=31, max_bin=255,
106 |                  lambda_l1=0., lambda_l2=0., boosting='gbdt'):
107 |         self.task = task
108 |         self.boosting = boosting
109 |         self.learning_rate = lr
110 |         self.num_leaves = num_leaves
111 |         self.max_bin = max_bin
112 |         self.lambda_l1 = lambda_l1
113 |         self.lambda_l2 = lambda_l2
114 | 
115 |     def accuracy(self, preds, train_data):
116 |         labels = train_data.get_label()
117 |         preds_classes = preds.reshape((preds.shape[0]//labels.shape[0], labels.shape[0])).argmax(0)
118 |         return 'accuracy', accuracy_score(labels, preds_classes), True
119 | 
120 |     def r2(self, preds, train_data):
121 |         labels = train_data.get_label()
122 |         return 'r2', r2_score(labels, preds), True
123 | 
124 |     def init_model(self):
125 | 
126 |         self.parameters = {
127 |             'objective': 'regression' if self.task == 'regression' else 'multiclass',
128 |             'metric': {'rmse'} if self.task == 'regression' else {'multiclass'},
129 |             'num_classes': self.num_classes,
130 |             'boosting': self.boosting,
131 |             'num_leaves': self.num_leaves,
132 |             'max_bin': self.max_bin,
133 |             'learning_rate': self.learning_rate,
134 |             'lambda_l1': self.lambda_l1,
135 |             'lambda_l2': self.lambda_l2,
136 |             # 'num_threads': 1,
137 |             # 'feature_fraction': 0.9,
138 |             # 'bagging_fraction': 0.8,
139 |             # 'bagging_freq': 5,
140 |             'verbose': 1,
141 |             #     'device_type': 'gpu'
142 |         }
143 |         self.evals_result = dict()
144 | 
145 |     def get_metrics(self):
146 |         d = self.evals_result
147 |         metrics = ddict(list)
148 |         keys = ['training', 'valid_1', 'valid_2'] \
149 |             if 'training' in d \
150 |             else ['valid_0', 'valid_1']
151 |         for metric_name in d[keys[0]]:
152 |             perf = [d[key][metric_name] for key in keys]
153 |             if metric_name in ['regression', 'multiclass', 'rmse', 'l2', 'multi_logloss', 'binary_logloss']:
154 |                 metrics['loss'] = list(zip(*perf))
155 |             else:
156 |                 metrics[metric_name] = list(zip(*perf))
157 |         return metrics
158 | 
159 |     def get_test_metric(self, metrics, metric_name):
160 |         if metric_name == 'loss':
161 |             val_epoch = np.argmin([acc[1] for acc in metrics[metric_name]])
162 |         else:
163 |             val_epoch = np.argmax([acc[1] for acc in metrics[metric_name]])
164 |         min_metric = metrics[metric_name][val_epoch]
165 |         return min_metric, val_epoch
166 | 
167 |     def save_metrics(self, metrics, fn):
168 |         with open(fn, "w+") as f:
169 |             for key, value in metrics.items():
170 |                 print(key, value, file=f)
171 | 
172 |     def train_val_test_split(self, X, y, train_mask, val_mask, test_mask):
173 |         X_train, y_train = X.iloc[train_mask], y.iloc[train_mask]
174 |         X_val, y_val = X.iloc[val_mask], y.iloc[val_mask]
175 |         X_test, y_test = X.iloc[test_mask], y.iloc[test_mask]
176 |         return X_train, y_train, X_val, y_val, X_test, y_test
177 | 
178 |     def fit(self,
179 |             X, y, train_mask, val_mask, test_mask,
180 |             cat_features=None, num_epochs=1000, patience=200,
181 |             loss_fn="", metric_name='loss'):
182 | 
183 |         if cat_features is not None:
184 |             X = X.copy()
185 |             for col in list(X.columns[cat_features]):
186 |                 X[col] = X[col].astype('category')
187 | 
188 |         X_train, y_train, X_val, y_val, X_test, y_test = \
189 |             self.train_val_test_split(X, y, train_mask, val_mask, test_mask)
190 |         self.num_classes = None if self.task == 'regression' else len(set(y.iloc[:, 0]))
191 |         self.init_model()
192 | 
193 |         start = time.time()
194 |         train_data = lightgbm.Dataset(X_train, label=y_train)
195 |         val_data = lightgbm.Dataset(X_val, label=y_val)
196 |         test_data = lightgbm.Dataset(X_test, label=y_test)
197 | 
198 |         self.model = lightgbm.train(self.parameters,
199 |                                train_data,
200 |                                valid_sets=[train_data, val_data, test_data],
201 |                                num_boost_round=num_epochs,
202 |                                early_stopping_rounds=patience,
203 |                                evals_result=self.evals_result,
204 |                                feval=self.r2 if self.task == 'regression' else self.accuracy,
205 |                                verbose_eval=1)
206 |         finish = time.time()
207 | 
208 |         print('Finished training. Total time: {:.2f}'.format(finish - start))
209 | 
210 |         metrics = self.get_metrics()
211 |         min_metric, min_val_epoch = self.get_test_metric(metrics, metric_name)
212 |         if loss_fn:
213 |             self.save_metrics(metrics, loss_fn)
214 |         print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, min_val_epoch, *min_metric))
215 |         return metrics
216 | 
217 |     def predict(self, X_test, y_test):
218 |         pred = self.model.predict(X_test)
219 | 
220 |         metrics = {}
221 |         metrics['rmse'] = mean_squared_error(pred, y_test) ** .5
222 | 
223 |         return metrics


--------------------------------------------------------------------------------
/models/GNN.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import numpy as np
  3 | import torch
  4 | from torch.nn import Dropout, ELU
  5 | import torch.nn.functional as F
  6 | from torch import nn
  7 | from dgl.nn.pytorch import GATConv as GATConvDGL, GraphConv, ChebConv as ChebConvDGL, \
  8 |     AGNNConv as AGNNConvDGL, APPNPConv
  9 | from torch.nn import Sequential, Linear, ReLU, Identity
 10 | from tqdm import tqdm
 11 | from .Base import BaseModel
 12 | from torch.autograd import Variable
 13 | from collections import defaultdict as ddict
 14 | from .MLP import MLPRegressor
 15 | 
 16 | 
 17 | class ElementWiseLinear(nn.Module):
 18 |     def __init__(self, size, weight=True, bias=True, inplace=False):
 19 |         super().__init__()
 20 |         if weight:
 21 |             self.weight = nn.Parameter(torch.Tensor(size))
 22 |         else:
 23 |             self.weight = None
 24 |         if bias:
 25 |             self.bias = nn.Parameter(torch.Tensor(size))
 26 |         else:
 27 |             self.bias = None
 28 |         self.inplace = inplace
 29 | 
 30 |         self.reset_parameters()
 31 | 
 32 |     def reset_parameters(self):
 33 |         if self.weight is not None:
 34 |             nn.init.ones_(self.weight)
 35 |         if self.bias is not None:
 36 |             nn.init.zeros_(self.bias)
 37 | 
 38 |     def forward(self, x):
 39 |         if self.inplace:
 40 |             if self.weight is not None:
 41 |                 x.mul_(self.weight)
 42 |             if self.bias is not None:
 43 |                 x.add_(self.bias)
 44 |         else:
 45 |             if self.weight is not None:
 46 |                 x = x * self.weight
 47 |             if self.bias is not None:
 48 |                 x = x + self.bias
 49 |         return x
 50 | 
 51 | class GATDGL(torch.nn.Module):
 52 |     '''
 53 |     Implementation of leaderboard GAT network for OGB datasets.
 54 |     https://github.com/Espylapiza/dgl/blob/master/examples/pytorch/ogb/ogbn-arxiv/models.py
 55 |     '''
 56 |     def __init__(
 57 |         self,
 58 |         in_feats,
 59 |         n_classes,
 60 |         n_layers=3,
 61 |         n_heads=3,
 62 |         activation=F.relu,
 63 |         n_hidden=250,
 64 |         dropout=0.75,
 65 |         input_drop=0.1,
 66 |         attn_drop=0.0,
 67 |     ):
 68 |         super().__init__()
 69 |         self.in_feats = in_feats
 70 |         self.n_hidden = n_hidden
 71 |         self.n_classes = n_classes
 72 |         self.n_layers = n_layers
 73 |         self.num_heads = n_heads
 74 | 
 75 |         self.convs = torch.nn.ModuleList()
 76 |         self.norms = torch.nn.ModuleList()
 77 | 
 78 |         for i in range(n_layers):
 79 |             in_hidden = n_heads * n_hidden if i > 0 else in_feats
 80 |             out_hidden = n_hidden if i < n_layers - 1 else n_classes
 81 |             num_heads = n_heads if i < n_layers - 1 else 1
 82 |             out_channels = n_heads
 83 | 
 84 |             self.convs.append(
 85 |                 GATConvDGL(
 86 |                     in_hidden,
 87 |                     out_hidden,
 88 |                     num_heads=num_heads,
 89 |                     attn_drop=attn_drop,
 90 |                     residual=True,
 91 |                 )
 92 |             )
 93 | 
 94 |             if i < n_layers - 1:
 95 |                 self.norms.append(torch.nn.BatchNorm1d(out_channels * out_hidden))
 96 | 
 97 |         self.bias_last = ElementWiseLinear(n_classes, weight=False, bias=True, inplace=True)
 98 | 
 99 |         self.input_drop = nn.Dropout(input_drop)
100 |         self.dropout = nn.Dropout(dropout)
101 |         self.activation = activation
102 | 
103 |     def forward(self, graph, feat):
104 |         h = feat
105 |         h = self.input_drop(h)
106 | 
107 |         for i in range(self.n_layers):
108 |             conv = self.convs[i](graph, h)
109 | 
110 |             h = conv
111 | 
112 |             if i < self.n_layers - 1:
113 |                 h = h.flatten(1)
114 |                 h = self.norms[i](h)
115 |                 h = self.activation(h, inplace=True)
116 |                 h = self.dropout(h)
117 | 
118 |         h = h.mean(1)
119 |         h = self.bias_last(h)
120 | 
121 |         return h
122 | 
123 | 
124 | 
125 | class GNNModelDGL(torch.nn.Module):
126 |     def __init__(self, in_dim, hidden_dim, out_dim,
127 |                  dropout=0., name='gat', residual=True, use_mlp=False, join_with_mlp=False):
128 |         super(GNNModelDGL, self).__init__()
129 |         self.name = name
130 |         self.use_mlp = use_mlp
131 |         self.join_with_mlp = join_with_mlp
132 |         self.normalize_input_columns = True
133 |         if use_mlp:
134 |             self.mlp = MLPRegressor(in_dim, hidden_dim, out_dim)
135 |             if join_with_mlp:
136 |                 in_dim += out_dim
137 |             else:
138 |                 in_dim = out_dim
139 |         if name == 'gat':
140 |             self.l1 = GATConvDGL(in_dim, hidden_dim//8, 8, feat_drop=dropout, attn_drop=dropout, residual=False,
141 |                               activation=F.elu)
142 |             self.l2 = GATConvDGL(hidden_dim, out_dim, 1, feat_drop=dropout, attn_drop=dropout, residual=residual, activation=None)
143 |         elif name == 'gcn':
144 |             self.l1 = GraphConv(in_dim, hidden_dim, activation=F.elu)
145 |             self.l2 = GraphConv(hidden_dim, out_dim, activation=F.elu)
146 |             self.drop = Dropout(p=dropout)
147 |         elif name == 'cheb':
148 |             self.l1 = ChebConvDGL(in_dim, hidden_dim, k = 3)
149 |             self.l2 = ChebConvDGL(hidden_dim, out_dim, k = 3)
150 |             self.drop = Dropout(p=dropout)
151 |         elif name == 'agnn':
152 |             self.lin1 = Sequential(Dropout(p=dropout), Linear(in_dim, hidden_dim), ELU())
153 |             self.l1 = AGNNConvDGL(learn_beta=False)
154 |             self.l2 = AGNNConvDGL(learn_beta=True)
155 |             self.lin2 = Sequential(Dropout(p=dropout), Linear(hidden_dim, out_dim), ELU())
156 |         elif name == 'appnp':
157 |             self.lin1 = Sequential(Dropout(p=dropout), Linear(in_dim, hidden_dim),
158 |                        ReLU(), Dropout(p=dropout), Linear(hidden_dim, out_dim))
159 |             self.l1 = APPNPConv(k=10, alpha=0.1, edge_drop=0.)
160 | 
161 | 
162 |     def forward(self, graph, features):
163 |         h = features
164 |         if self.use_mlp:
165 |             if self.join_with_mlp:
166 |                 h = torch.cat((h, self.mlp(features)), 1)
167 |             else:
168 |                 h = self.mlp(features)
169 |         if self.name == 'gat':
170 |             h = self.l1(graph, h).flatten(1)
171 |             logits = self.l2(graph, h).mean(1)
172 |         elif self.name in ['appnp']:
173 |             h = self.lin1(h)
174 |             logits = self.l1(graph, h)
175 |         elif self.name == 'agnn':
176 |             h = self.lin1(h)
177 |             h = self.l1(graph, h)
178 |             h = self.l2(graph, h)
179 |             logits = self.lin2(h)
180 |         elif self.name in ['gcn', 'cheb']:
181 |             h = self.drop(h)
182 |             h = self.l1(graph, h)
183 |             logits = self.l2(graph, h)
184 | 
185 | 
186 |         return logits
187 | 
188 | class GNN(BaseModel):
189 |     def __init__(self, task='regression', lr=0.01, hidden_dim=64, dropout=0.,
190 |                  name='gat', residual=True, lang='dgl',
191 |                 gbdt_predictions=None, mlp=False, use_leaderboard=False, only_gbdt=False):
192 |         super(GNN, self).__init__()
193 | 
194 |         self.dropout = dropout
195 |         self.learning_rate = lr
196 |         self.hidden_dim = hidden_dim
197 |         self.task = task
198 |         self.model_name = name
199 |         self.use_residual = residual
200 |         self.lang = lang
201 |         self.use_mlp = mlp
202 |         self.use_leaderboard = use_leaderboard
203 |         self.gbdt_predictions = gbdt_predictions
204 |         self.only_gbdt = only_gbdt
205 | 
206 |         self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
207 | 
208 |     def __name__(self):
209 |         if self.gbdt_predictions is None:
210 |             return 'GNN'
211 |         else:
212 |             return 'ResGNN'
213 | 
214 |     def init_model(self):
215 |         if self.lang == 'pyg':
216 |             self.model = GNNModelPYG(in_dim=self.in_dim, hidden_dim=self.hidden_dim, out_dim=self.out_dim,
217 |                                   heads=self.heads, dropout=self.dropout, name=self.model_name,
218 |                                   residual=self.use_residual).to(self.device)
219 |         elif self.lang == 'dgl':
220 |             if self.use_leaderboard:
221 |                 self.model = GATDGL(in_feats=self.in_dim, n_classes=self.out_dim).to(self.device)
222 |             else:
223 |                 self.model = GNNModelDGL(in_dim=self.in_dim, hidden_dim=self.hidden_dim, out_dim=self.out_dim,
224 |                                          dropout=self.dropout, name=self.model_name,
225 |                                          residual=self.use_residual, use_mlp=self.use_mlp,
226 |                                          join_with_mlp=self.use_mlp).to(self.device)
227 | 
228 |     def init_node_features(self, X, optimize_node_features):
229 |         node_features = Variable(X, requires_grad=optimize_node_features)
230 |         return node_features
231 | 
232 |     def fit(self, networkx_graph, X, y, train_mask, val_mask, test_mask, num_epochs,
233 |             cat_features=None, patience=200, logging_epochs=1, optimize_node_features=False,
234 |             loss_fn=None, metric_name='loss', normalize_features=True, replace_na=True):
235 | 
236 |         # initialize for early stopping and metrics
237 |         if metric_name in ['r2', 'accuracy']:
238 |             best_metric = [np.float('-inf')] * 3  # for train/val/test
239 |         else:
240 |             best_metric = [np.float('inf')] * 3  # for train/val/test
241 |         best_val_epoch = 0
242 |         epochs_since_last_best_metric = 0
243 |         metrics = ddict(list) # metric_name -> (train/val/test)
244 |         if cat_features is None:
245 |             cat_features = []
246 | 
247 |         if self.gbdt_predictions is not None:
248 |             X = X.copy()
249 |             X['predict'] = self.gbdt_predictions
250 |             if self.only_gbdt:
251 |                 cat_features = []
252 |                 X = X[['predict']]
253 | 
254 |         self.in_dim = X.shape[1]
255 |         self.hidden_dim = self.hidden_dim
256 |         if self.task == 'regression':
257 |             self.out_dim = y.shape[1]
258 |         elif self.task == 'classification':
259 |             self.out_dim = len(set(y.iloc[:, 0]))
260 | 
261 |         if len(cat_features):
262 |             X = self.encode_cat_features(X, y, cat_features, train_mask, val_mask, test_mask)
263 |         if normalize_features:
264 |             X = self.normalize_features(X, train_mask, val_mask, test_mask)
265 |         if replace_na:
266 |             X = self.replace_na(X, train_mask)
267 | 
268 |         X, y = self.pandas_to_torch(X, y)
269 |         if len(X.shape) == 1:
270 |             X = X.unsqueeze(1)
271 | 
272 |         if self.lang == 'dgl':
273 |             graph = self.networkx_to_torch(networkx_graph)
274 |         elif self.lang == 'pyg':
275 |             graph = self.networkx_to_torch2(networkx_graph)
276 |         self.init_model()
277 |         node_features = self.init_node_features(X, optimize_node_features)
278 | 
279 |         self.node_features = node_features
280 |         self.graph = graph
281 |         optimizer = self.init_optimizer(node_features, optimize_node_features, self.learning_rate)
282 | 
283 |         pbar = tqdm(range(num_epochs))
284 |         for epoch in pbar:
285 |             start2epoch = time.time()
286 | 
287 |             model_in = (graph, node_features)
288 |             loss = self.train_and_evaluate(model_in, y, train_mask, val_mask, test_mask, optimizer,
289 |                                            metrics, gnn_passes_per_epoch=1)
290 |             self.log_epoch(pbar, metrics, epoch, loss, time.time() - start2epoch, logging_epochs,
291 |                            metric_name=metric_name)
292 | 
293 |             # check early stopping
294 |             best_metric, best_val_epoch, epochs_since_last_best_metric = \
295 |                 self.update_early_stopping(metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric,
296 |                                            metric_name, lower_better=(metric_name not in ['r2', 'accuracy']))
297 |             if patience and epochs_since_last_best_metric > patience:
298 |                 break
299 | 
300 |         if loss_fn:
301 |             self.save_metrics(metrics, loss_fn)
302 | 
303 |         print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, best_val_epoch, *best_metric))
304 |         return metrics
305 | 
306 |     def predict(self, graph, node_features, target_labels, test_mask):
307 |         return self.evaluate_model((graph, node_features), target_labels, test_mask)


--------------------------------------------------------------------------------
/models/MLP.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | import torch.nn.functional as F
  4 | import numpy as np
  5 | import time
  6 | from tqdm import tqdm
  7 | from .Base import BaseModel
  8 | from sklearn.metrics import r2_score
  9 | from collections import defaultdict as ddict
 10 | 
 11 | class MLPClassifier(torch.nn.Module):
 12 |     def __init__(self, in_dim, hidden_dim, out_dim, num_layers=3, dropout=0.5):
 13 |         super(MLPClassifier, self).__init__()
 14 | 
 15 |         self.lins = torch.nn.ModuleList()
 16 |         self.lins.append(torch.nn.Linear(in_dim, hidden_dim))
 17 |         self.bns = torch.nn.ModuleList()
 18 |         self.bns.append(torch.nn.BatchNorm1d(hidden_dim))
 19 |         for _ in range(num_layers - 2):
 20 |             self.lins.append(torch.nn.Linear(hidden_dim, hidden_dim))
 21 |             self.bns.append(torch.nn.BatchNorm1d(hidden_dim))
 22 |         self.lins.append(torch.nn.Linear(hidden_dim, out_dim))
 23 | 
 24 |         self.dropout = dropout
 25 | 
 26 |     def reset_parameters(self):
 27 |         for lin in self.lins:
 28 |             lin.reset_parameters()
 29 | 
 30 |     def forward(self, x):
 31 |         for i, lin in enumerate(self.lins[:-1]):
 32 |             x = lin(x)
 33 |             x = self.bns[i](x)
 34 |             x = F.relu(x)
 35 |             x = F.dropout(x, p=self.dropout, training=self.training)
 36 |         x = self.lins[-1](x)
 37 |         return x
 38 | 
 39 | 
 40 | class MLPRegressor(nn.Module):
 41 |     def __init__(self, in_dim, hidden_dim, out_dim, num_layers=3, dropout=0.5):
 42 |         super(MLPRegressor, self).__init__()
 43 | 
 44 |         self.layers = nn.Sequential(
 45 |             nn.Linear(in_dim, hidden_dim),
 46 |             nn.ReLU(),
 47 |             nn.Dropout(p=dropout),
 48 |             nn.Linear(hidden_dim, hidden_dim),
 49 |             nn.ReLU(),
 50 |             nn.Dropout(p=dropout),
 51 |             nn.Linear(hidden_dim, out_dim)
 52 |         )
 53 | 
 54 |     def forward(self, x):
 55 |         return self.layers(x)
 56 | 
 57 | 
 58 | class MLP(BaseModel):
 59 |     def __init__(self, task='regression', num_layers=3, dropout=0., lr=0.01, hidden_dim=128):
 60 |         super(MLP, self).__init__()
 61 |         self.task = task
 62 |         self.num_layers = num_layers
 63 |         self.dropout = dropout
 64 |         self.learning_rate = lr
 65 |         self.hidden_dim = hidden_dim
 66 | 
 67 | 
 68 |         self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
 69 | 
 70 |     def __name__(self):
 71 |         return 'MLP'
 72 | 
 73 |     def init_model(self):
 74 |         # mlp_model = MLPRegressor if self.task == 'regression' else MLPClassifier
 75 |         mlp_model = MLPClassifier
 76 |         self.model = mlp_model(in_dim=self.in_dim, hidden_dim=self.hidden_dim, out_dim=self.out_dim,
 77 |                                num_layers=self.num_layers, dropout=self.dropout).to(
 78 |             self.device)
 79 | 
 80 |     def fit(self, X, y, train_mask, val_mask, test_mask, cat_features=None,
 81 |             num_epochs=1000, patience=200,
 82 |             logging_epochs=1, loss_fn=None,
 83 |             metric_name='loss', normalize_features=True, replace_na=True):
 84 | 
 85 |         # initialize for early stopping and metrics
 86 |         if metric_name in ['r2', 'accuracy']:
 87 |             best_metric = [np.float('-inf')] * 3  # for train/val/test
 88 |         else:
 89 |             best_metric = [np.float('inf')] * 3  # for train/val/test
 90 |         best_val_epoch = 0
 91 |         epochs_since_last_best_metric = 0
 92 |         metrics = ddict(list) # metric_name -> (train/val/test)
 93 |         if cat_features is None:
 94 |             cat_features = []
 95 | 
 96 |         self.in_dim = X.shape[1]
 97 |         self.hidden_dim = self.hidden_dim
 98 |         if self.task == 'regression':
 99 |             self.out_dim = y.shape[1]
100 |         elif self.task == 'classification':
101 |             self.out_dim = len(set(y.iloc[:, 0]))
102 | 
103 | 
104 |         if len(cat_features):
105 |             X = self.encode_cat_features(X, y, cat_features, train_mask, val_mask, test_mask)
106 |         if normalize_features:
107 |             X = self.normalize_features(X, train_mask, val_mask, test_mask)
108 |         if replace_na:
109 |             X = self.replace_na(X, train_mask)
110 | 
111 |         X, y = self.pandas_to_torch(X, y)
112 |         if len(X.shape) == 1:
113 |             X = X.unsqueeze(dim=1)
114 | 
115 |         self.init_model()
116 |         optimizer = self.init_optimizer(None, False, learning_rate=self.learning_rate)
117 | 
118 |         pbar = tqdm(range(num_epochs))
119 |         for epoch in pbar:
120 | 
121 |             start2epoch = time.time()
122 | 
123 |             model_in = (X,)
124 |             loss = self.train_and_evaluate(model_in, y, train_mask, val_mask, test_mask, optimizer,
125 |                                            metrics, gnn_passes_per_epoch=1)
126 |             self.log_epoch(pbar, metrics, epoch, loss, time.time() - start2epoch, logging_epochs,
127 |                            metric_name=metric_name)
128 | 
129 |             # check early stopping
130 |             best_metric, best_val_epoch, epochs_since_last_best_metric = \
131 |                 self.update_early_stopping(metrics, epoch, best_metric, best_val_epoch, epochs_since_last_best_metric,
132 |                                            metric_name, lower_better=(metric_name not in ['r2', 'accuracy']))
133 |             if patience and epochs_since_last_best_metric > patience:
134 |                 break
135 | 
136 |         if loss_fn:
137 |             self.save_metrics(metrics, loss_fn)
138 | 
139 |         print('Best {} at iteration {}: {:.3f}/{:.3f}/{:.3f}'.format(metric_name, best_val_epoch, *best_metric))
140 |         return metrics
141 | 
142 |     def predict(self, X, target_labels, test_mask):
143 |         return self.evaluate_model((X,), target_labels, test_mask)


--------------------------------------------------------------------------------
/models/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nd7141/bgnn/11290bc8ec5427faa1cb48ec51d947d5f6624b60/models/__init__.py


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | numpy==1.19.4
 2 | plotly==4.14.1
 3 | catboost==0.24.4
 4 | lightgbm==3.0.0
 5 | networkx==2.5
 6 | matplotlib==3.3.3
 7 | pandas==1.1.5
 8 | tqdm==4.55.1
 9 | fire==0.3.1
10 | omegaconf==2.0.5
11 | category_encoders==2.2.2
12 | scikit_learn==0.24.0
13 | 


--------------------------------------------------------------------------------
/scripts/run.py:
--------------------------------------------------------------------------------
  1 | # from catboost import CatboostError
  2 | # import sys
  3 | # sys.path.append('../')
  4 | 
  5 | from bgnn.models.GBDT import GBDTCatBoost, GBDTLGBM
  6 | from bgnn.models.MLP import MLP
  7 | from bgnn.models.GNN import GNN
  8 | from bgnn.models.BGNN import BGNN
  9 | from bgnn.scripts.utils import NpEncoder
 10 | 
 11 | import os
 12 | import json
 13 | import time
 14 | import datetime
 15 | from pathlib import Path
 16 | from collections import defaultdict as ddict
 17 | 
 18 | import pandas as pd
 19 | import networkx as nx
 20 | import random
 21 | import numpy as np
 22 | import fire
 23 | from omegaconf import OmegaConf
 24 | from sklearn.model_selection import ParameterGrid
 25 | 
 26 | 
 27 | class RunModel:
 28 |     def read_input(self, input_folder):
 29 |         self.X = pd.read_csv(f'{input_folder}/X.csv')
 30 |         self.y = pd.read_csv(f'{input_folder}/y.csv')
 31 | 
 32 |         networkx_graph = nx.read_graphml(f'{input_folder}/graph.graphml')
 33 |         networkx_graph = nx.relabel_nodes(networkx_graph, {str(i): i for i in range(len(networkx_graph))})
 34 |         self.networkx_graph = networkx_graph
 35 | 
 36 |         categorical_columns = []
 37 |         if os.path.exists(f'{input_folder}/cat_features.txt'):
 38 |             with open(f'{input_folder}/cat_features.txt') as f:
 39 |                 for line in f:
 40 |                     if line.strip():
 41 |                         categorical_columns.append(line.strip())
 42 | 
 43 |         self.cat_features = None
 44 |         if categorical_columns:
 45 |             columns = self.X.columns
 46 |             self.cat_features = np.where(columns.isin(categorical_columns))[0]
 47 | 
 48 |             for col in list(columns[self.cat_features]):
 49 |                 self.X[col] = self.X[col].astype(str)
 50 | 
 51 | 
 52 |         if os.path.exists(f'{input_folder}/masks.json'):
 53 |             with open(f'{input_folder}/masks.json') as f:
 54 |                 self.masks = json.load(f)
 55 |         else:
 56 |             print('Creating and saving train/val/test masks')
 57 |             idx = list(range(self.y.shape[0]))
 58 |             self.masks = dict()
 59 |             for i in range(self.max_seeds):
 60 |                 random.shuffle(idx)
 61 |                 r1, r2, r3 = idx[:int(.6*len(idx))], idx[int(.6*len(idx)):int(.8*len(idx))], idx[int(.8*len(idx)):]
 62 |                 self.masks[str(i)] = {"train": r1, "val": r2, "test": r3}
 63 | 
 64 |             with open(f'{input_folder}/masks.json', 'w+') as f:
 65 |                 json.dump(self.masks, f, cls=NpEncoder)
 66 | 
 67 | 
 68 |     def get_input(self, dataset_dir, dataset: str):
 69 |         if dataset == 'house':
 70 |             input_folder = dataset_dir / 'house'
 71 |         elif dataset == 'county':
 72 |             input_folder = dataset_dir / 'county'
 73 |         elif dataset == 'vk':
 74 |             input_folder = dataset_dir / 'vk'
 75 |         elif dataset == 'wiki':
 76 |             input_folder = dataset_dir / 'wiki'
 77 |         elif dataset == 'avazu':
 78 |             input_folder = dataset_dir / 'avazu'
 79 |         elif dataset == 'vk_class':
 80 |             input_folder = dataset_dir / 'vk_class'
 81 |         elif dataset == 'house_class':
 82 |             input_folder = dataset_dir / 'house_class'
 83 |         elif dataset == 'dblp':
 84 |             input_folder = dataset_dir / 'dblp'
 85 |         elif dataset == 'slap':
 86 |             input_folder = dataset_dir / 'slap'
 87 |         else:
 88 |             input_folder = dataset
 89 | 
 90 |         if self.save_folder is None:
 91 |             self.save_folder = f'results/{dataset}/{datetime.datetime.now().strftime("%d_%m")}'
 92 | 
 93 |         self.read_input(input_folder)
 94 |         print('Save to folder:', self.save_folder)
 95 | 
 96 | 
 97 |     def run_one_model(self, config_fn, model_name):
 98 |         self.config = OmegaConf.load(config_fn)
 99 |         grid = ParameterGrid(dict(self.config.hp))
100 | 
101 |         for ps in grid:
102 |             param_string = ''.join([f'-{key}{ps[key]}' for key in ps])
103 |             exp_name = f'{model_name}{param_string}'
104 |             print(f'\nSeed {self.seed} RUNNING:{exp_name}')
105 | 
106 |             runs = []
107 |             runs_custom = []
108 |             times = []
109 |             for _ in range(self.repeat_exp):
110 |                 start = time.time()
111 |                 model = self.define_model(model_name, ps)
112 | 
113 |                 inputs = {'X': self.X, 'y': self.y, 'train_mask': self.train_mask,
114 |                           'val_mask': self.val_mask, 'test_mask': self.test_mask, 'cat_features': self.cat_features}
115 |                 if model_name in ['gnn', 'resgnn', 'bgnn']:
116 |                     inputs['networkx_graph'] = self.networkx_graph
117 | 
118 |                 metrics = model.fit(num_epochs=self.config.num_epochs, patience=self.config.patience,
119 |                            loss_fn=f"{self.seed_folder}/{exp_name}.txt",
120 |                            metric_name='loss' if self.task == 'regression' else 'accuracy', **inputs)
121 |                 finish = time.time()
122 |                 best_loss = min(metrics['loss'], key=lambda x: x[1])
123 |                 best_custom = max(metrics['r2' if self.task == 'regression' else 'accuracy'], key=lambda x: x[1])
124 |                 runs.append(best_loss)
125 |                 runs_custom.append(best_custom)
126 |                 times.append(finish - start)
127 |             self.store_results[exp_name] = (list(map(np.mean, zip(*runs))),
128 |                                        list(map(np.mean, zip(*runs_custom))),
129 |                                        np.mean(times),
130 |                                        )
131 | 
132 |     def define_model(self, model_name, ps):
133 |         if model_name == 'catboost':
134 |             return GBDTCatBoost(self.task, **ps)
135 |         elif model_name == 'lightgbm':
136 |             return GBDTLGBM(self.task, **ps)
137 |         elif model_name == 'mlp':
138 |             return MLP(self.task, **ps)
139 |         elif model_name == 'gnn':
140 |             return GNN(self.task, **ps)
141 |         elif model_name == 'resgnn':
142 |             gbdt = GBDTCatBoost(self.task)
143 |             gbdt.fit(self.X, self.y, self.train_mask, self.val_mask, self.test_mask,
144 |                      cat_features=self.cat_features,
145 |                      num_epochs=1000, patience=100,
146 |                      plot=False, verbose=False, loss_fn=None,
147 |                      metric_name='loss' if self.task == 'regression' else 'accuracy')
148 |             return GNN(task=self.task, gbdt_predictions=gbdt.model.predict(self.X), **ps)
149 |         elif model_name == 'bgnn':
150 |             return BGNN(self.task, **ps)
151 | 
152 |     def create_save_folder(self, seed):
153 |         self.seed_folder = f'{self.save_folder}/{seed}'
154 |         os.makedirs(self.seed_folder, exist_ok=True)
155 | 
156 |     def split_masks(self, seed):
157 |         self.train_mask, self.val_mask, self.test_mask = self.masks[seed]['train'], \
158 |                                                          self.masks[seed]['val'], self.masks[seed]['test']
159 | 
160 |     def save_results(self, seed):
161 |         self.seed_results[seed] = self.store_results
162 |         with open(f'{self.save_folder}/seed_results.json', 'w+') as f:
163 |             json.dump(self.seed_results, f)
164 | 
165 |         self.aggregated = self.aggregate_results()
166 |         with open(f'{self.save_folder}/aggregated_results.json', 'w+') as f:
167 |             json.dump(self.aggregated, f)
168 | 
169 |     def get_model_name(self, exp_name: str, algos: list):
170 |         # get name of the model (for gnn-like models (eg. gat))
171 |         if 'name' in exp_name:
172 |             model_name = '-' + [param[4:] for param in exp_name.split('-') if param.startswith('name')][0]
173 |         else:
174 |             model_name = ''
175 | 
176 |         # get a model used a MLP (eg. MLP-GNN)
177 |         if 'gnn' in exp_name and 'mlpTrue' in exp_name:
178 |             model_name += '-MLP'
179 | 
180 |         # algo corresponds to type of the model (eg. gnn, resgnn, bgnn)
181 |         for algo in algos:
182 |             if exp_name.startswith(algo):
183 |                 return algo + model_name
184 |         return 'unknown'
185 | 
186 |     def aggregate_results(self):
187 |         algos = ['catboost', 'lightgbm', 'mlp', 'gnn', 'resgnn', 'bgnn']
188 |         model_best_score = ddict(list)
189 |         model_best_time = ddict(list)
190 | 
191 |         results = self.seed_results
192 |         for seed in results:
193 |             model_results_for_seed = ddict(list)
194 |             for name, output in results[seed].items():
195 |                 model_name = self.get_model_name(name, algos=algos)
196 |                 if self.task == 'regression': # rmse metric
197 |                     val_metric, test_metric, time = output[0][1], output[0][2], output[2]
198 |                 else: # accuracy metric
199 |                     val_metric, test_metric, time = output[1][1], output[1][2], output[2]
200 |                 model_results_for_seed[model_name].append((val_metric, test_metric, time))
201 | 
202 |             for model_name, model_results in model_results_for_seed.items():
203 |                 if self.task == 'regression':
204 |                     best_result = min(model_results) # rmse
205 |                 else:
206 |                     best_result = max(model_results) # accuracy
207 |                 model_best_score[model_name].append(best_result[1])
208 |                 model_best_time[model_name].append(best_result[2])
209 | 
210 |         aggregated = dict()
211 |         for model, scores in model_best_score.items():
212 |             aggregated[model] = (np.mean(scores), np.std(scores),
213 |                                  np.mean(model_best_time[model]), np.std(model_best_time[model]))
214 |         return aggregated
215 | 
216 |     def run(self, dataset: str, *args,
217 |             save_folder: str = None,
218 |             task: str = 'regression',
219 |             repeat_exp: int = 1,
220 |             max_seeds: int = 5,
221 |             dataset_dir: str = None,
222 |             config_dir: str = None
223 |             ):
224 |         start2run = time.time()
225 |         self.repeat_exp = repeat_exp
226 |         self.max_seeds = max_seeds
227 |         print(dataset, args, task, repeat_exp, max_seeds, dataset_dir, config_dir)
228 | 
229 |         dataset_dir = Path(dataset_dir) if dataset_dir else Path(__file__).parent.parent / 'datasets'
230 |         config_dir = Path(config_dir) if config_dir else Path(__file__).parent.parent / 'configs' / 'model'
231 |         print(dataset_dir, config_dir)
232 | 
233 |         self.task = task
234 |         self.save_folder = save_folder
235 |         self.get_input(dataset_dir, dataset)
236 | 
237 |         self.seed_results = dict()
238 |         for ix, seed in enumerate(self.masks):
239 |             print(f'{dataset} Seed {seed}')
240 |             self.seed = seed
241 | 
242 |             self.create_save_folder(seed)
243 |             self.split_masks(seed)
244 | 
245 |             self.store_results = dict()
246 |             for arg in args:
247 |                 if arg == 'all':
248 |                     self.run_one_model(config_fn=config_dir / 'catboost.yaml', model_name="catboost")
249 |                     self.run_one_model(config_fn=config_dir / 'lightgbm.yaml', model_name="lightgbm")
250 |                     self.run_one_model(config_fn=config_dir / 'mlp.yaml', model_name="mlp")
251 |                     self.run_one_model(config_fn=config_dir / 'gnn.yaml', model_name="gnn")
252 |                     self.run_one_model(config_fn=config_dir / 'resgnn.yaml', model_name="resgnn")
253 |                     self.run_one_model(config_fn=config_dir / 'bgnn.yaml', model_name="bgnn")
254 |                     break
255 |                 elif arg == 'catboost':
256 |                     self.run_one_model(config_fn=config_dir / 'catboost.yaml', model_name="catboost")
257 |                 elif arg == 'lightgbm':
258 |                     self.run_one_model(config_fn=config_dir / 'lightgbm.yaml', model_name="lightgbm")
259 |                 elif arg == 'mlp':
260 |                     self.run_one_model(config_fn=config_dir / 'mlp.yaml', model_name="mlp")
261 |                 elif arg == 'gnn':
262 |                     self.run_one_model(config_fn=config_dir / 'gnn.yaml', model_name="gnn")
263 |                 elif arg == 'resgnn':
264 |                     self.run_one_model(config_fn=config_dir / 'resgnn.yaml', model_name="resgnn")
265 |                 elif arg == 'bgnn':
266 |                     self.run_one_model(config_fn=config_dir / 'bgnn.yaml', model_name="bgnn")
267 | 
268 |             self.save_results(seed)
269 |             if ix+1 >= max_seeds:
270 |                 break
271 | 
272 |         print(f'Finished {dataset}: {time.time() - start2run} sec.')
273 | 
274 | if __name__ == '__main__':
275 |     fire.Fire(RunModel().run)


--------------------------------------------------------------------------------
/scripts/utils.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from dgl.data import citation_graph as citegrh, TUDataset
 4 | import torch as th
 5 | from dgl import DGLGraph
 6 | import numpy as np
 7 | from sklearn.model_selection import KFold
 8 | import itertools
 9 | from sklearn.preprocessing import OneHotEncoder as OHE
10 | import random
11 | import json
12 | 
13 | def load_cora_data():
14 |     data = citegrh.load_cora()
15 |     features = th.FloatTensor(data.features)
16 |     labels = th.LongTensor(data.labels)
17 |     train_mask = th.BoolTensor(data.train_mask)
18 |     test_mask = th.BoolTensor(data.test_mask)
19 |     g = DGLGraph(data.graph)
20 |     return g, features, labels, train_mask, test_mask
21 | 
22 | def get_degree_features(graph):
23 |     return graph.out_degrees().unsqueeze(-1).numpy()
24 | 
25 | def get_categorical_features(features):
26 |     return np.argmax(features, axis=-1).unsqueeze(dim=1).numpy()
27 | 
28 | def get_random_int_features(shape, num_categories=100):
29 |     return np.random.randint(0, num_categories, size=shape)
30 | 
31 | def get_random_norm_features(shape):
32 |     return np.random.normal(size=shape)
33 | 
34 | def get_random_uniform_features(shape):
35 |     return np.random.unifor(-1, 1, size=shape)
36 | 
37 | def merge_features(*args):
38 |     return np.hstack(args)
39 | 
40 | def get_train_data(graph, features, num_random_features=10, num_random_categories=100):
41 |     return merge_features(
42 |         get_categorical_features(features),
43 |         get_degree_features(graph),
44 |         get_random_int_features(shape=(features.shape[0], num_random_features), num_categories=num_random_categories),
45 |     )
46 | 
47 | 
48 | def save_folds(dataset_name, n_splits=3):
49 |     dataset = TUDataset(dataset_name)
50 |     i = 0
51 |     kfold = KFold(n_splits=n_splits, shuffle=True)
52 |     dir_name = f'kfold_{dataset_name}'
53 |     for trix, teix in kfold.split(range(len(dataset))):
54 |         os.makedirs(f'{dir_name}/fold{i}', exist_ok=True)
55 |         np.savetxt(f'{dir_name}/fold{i}/train.idx', trix, fmt='%i')
56 |         np.savetxt(f'{dir_name}/fold{i}/test.idx', teix, fmt='%i')
57 |         i += 1
58 | 
59 | 
60 | def graph_to_node_label(graphs, labels):
61 |     targets = np.array(list(itertools.chain(*[[labels[i]] * graphs[i].number_of_nodes() for i in range(len(graphs))])))
62 |     enc = OHE(dtype=np.float32)
63 |     return np.asarray(enc.fit_transform(targets.reshape(-1, 1)).todense())
64 | 
65 | 
66 | def get_masks(N, train_size=0.6, val_size=0.2, random_seed=42):
67 |     if not random_seed:
68 |         seed = random.randint(0, 100)
69 |     else:
70 |         seed = random_seed
71 | 
72 |     # print('seed', seed)
73 |     random.seed(seed)
74 | 
75 |     indices = list(range(N))
76 |     random.shuffle(indices)
77 | 
78 |     train_mask = indices[:int(train_size * len(indices))]
79 |     val_mask = indices[int(train_size * len(indices)):int((train_size + val_size) * len(indices))]
80 |     train_val_mask = indices[:int((train_size + val_size) * len(indices))]
81 |     test_mask = indices[int((train_size + val_size) * len(indices)):]
82 | 
83 |     return train_mask, val_mask, train_val_mask, test_mask
84 | 
85 | 
86 | class NpEncoder(json.JSONEncoder):
87 |     def default(self, obj):
88 |         if isinstance(obj, np.integer):
89 |             return int(obj)
90 |         elif isinstance(obj, np.floating):
91 |             return float(obj)
92 |         elif isinstance(obj, np.ndarray):
93 |             return obj.tolist()
94 |         else:
95 |             return super(NpEncoder, self).default(obj)


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | """Setup script."""
 2 | import setuptools
 3 | 
 4 | with open("README.md", "r", encoding="utf-8") as fh:
 5 |     long_description = fh.read()
 6 | 
 7 | if __name__ == "__main__":
 8 | 
 9 |     # Run setup
10 |     setuptools.setup(
11 |         name="bgnn",  # Replace with your own username
12 |         version="0.0.1",
13 |         author="Sergey Ivanov",
14 |         author_email="sergei.ivanov@skolkovotech.ru",
15 |         description="Boosted Graph Neural Networks",
16 |         long_description=long_description,
17 |         long_description_content_type="text/markdown",
18 |         url="https://github.com/nd7141/bgnn",
19 |         packages=setuptools.find_packages(),
20 |         classifiers=[
21 |             "License :: OSI Approved :: Apache Software License",
22 |             "Operating System :: OS Independent",
23 |             "Programming Language :: Python :: 3",
24 |             "Programming Language :: Python :: 3.6",
25 |             "Intended Audience :: Developers",
26 |             "Intended Audience :: Science/Research",
27 |             "Topic :: Scientific/Engineering :: Artificial Intelligence",
28 |         ],
29 |         python_requires='>=3.6',
30 |     )


--------------------------------------------------------------------------------