├── .gitignore
├── LICENSE
├── README.md
├── adone_experiment.py
├── anomaly_insert.py
├── anomalydae_experiment.py
├── data
└── .gitignore
├── data_finefoods.py
├── data_finefoods_small.py
├── data_movies.py
├── data_movies_small.py
├── data_reddit.py
├── data_wikipedia.py
├── dominant_experiment.py
├── extract_movies.py
├── isoforest_experiment.py
├── models
├── conv.py
├── conv_sample.py
├── data.py
├── loss.py
├── net.py
├── net_sample.py
├── sampler.py
└── score.py
├── requirements.txt
├── results
└── .gitignore
├── storage
└── .gitignore
├── train_full_experiment.py
├── train_sample_experiment.py
└── utils
├── seed.py
├── sparse_combine.py
├── sprand.py
└── sum_dict.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # OS generated files #
2 | ######################
3 | .DS_Store
4 | .DS_Store?
5 |
6 | # python generated files #
7 | ##########################
8 | *.pyc
9 | */venv/
10 |
11 | # ide generated directories
12 | .cache
13 | .vscode
14 |
15 | # pytest generated files #
16 | ##########################
17 | .pytest_cache
18 |
19 | # jupyter generated files #
20 | ##########################
21 | .ipynb_checkpoints/
22 |
23 | # emacs autosave files
24 | \#*
25 | *~
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Interaction-Focused Anomaly Detection on Bipartite Node-and-Edge-Attributed Graphs
2 |
3 | This repository contains the experimental source code of the [*Interaction-Focused Anomaly Detection on Bipartite Node-and-Edge-Attributed Graphs*](https://engineering.grab.com/graph-anomaly-model) paper presented at the [International Joint Conference on Neural Networks (IJCNN) 2023](https://2023.ijcnn.org/).
4 |
5 | Authors: [Rizal Fathony](mailto:rizal.fathony@grab.com), [Jenn Ng](mailto:jenn.ng@grab.com), and [Jia Chen](mailto:jia.chen@grab.com).
6 |
7 | ## Abstract
8 |
9 | Many anomaly detection applications naturally produce datasets that can be represented as bipartite graphs (user–interaction–item graphs). These graph datasets are usually supplied with rich information on both the entities (nodes) and the interactions (edges). Unfortunately, previous graph neural network anomaly models are unable to fully capture the rich information and produce high-performing detections on these graphs, as they mostly focus on homogeneous graphs and node attributes only. To overcome the problem, we propose a new graph anomaly detection model that focuses on the rich interactions in bipartite graphs. Specifically, our model takes a bipartite node-and-edge-attributed graph and produces anomaly scores for each of its edges and then for each of its bipartite nodes. We design our model as an autoencoder-type model with a customized encoder and decoder to facilitate the compression of node features, edge features, and graph structure into node-level latent representations. The reconstruction errors of each edge and node are then leveraged to spot the anomalies. Our network architecture is scalable, enabling large real-world applications. Finally, we demonstrate that our method significantly outperforms previous anomaly detection methods in the experiments.
10 |
11 | ## Setup
12 |
13 | 1. Install the required packages using:
14 | ```
15 | pip install -r requirements.txt
16 | ```
17 | 2. Download the datasets.
18 |
19 | - `wikipedia` and `reddit`:
20 | ```
21 | wget -P data/ http://snap.stanford.edu/jodie/wikipedia.csv
22 | wget -P data/ http://snap.stanford.edu/jodie/reddit.csv
23 | ```
24 |
25 | - `finefoods`: Download from [here](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews?select=Reviews.csv) to `data` folder. Rename the file to `finefoods.csv`.
26 |
27 | - `movies`: Download from [here](https://snap.stanford.edu/data/web-Movies.html). Extract and rename the file to `movies.txt`. Generate the `.csv` file by running `python extract_movies.py`.
28 |
29 |
30 | ## Construct Graph Datasets
31 |
32 | We construct the graph datasets by loading the csv and construct PyG graph data. We then inject anomalies into the dataset. For each dataset, please run:
33 | - `wikipedia` dataset: `python data_wikipedia.py`
34 | - `reddit` dataset: `python data_reddit.py`
35 |
36 | - `finefoods-large` dataset: `python data_finefoods.py`
37 | - `finefoods-small` dataset: `python data_finefoods_small.py`
38 | - `movies-large` dataset: `python data_movies.py`
39 | - `movies-small` dataset: `python data_movies_small.py`
40 |
41 | `Note`: for `finefoods` and `movies`, we use `sentence-transformer` to generate features from the review text. Running the graph construction on a machine with GPU support is recommended. The size of `finefoods` and `movies` is also quite large. Therefore, a machine with large memory size is required (60GB or 120GB).
42 |
43 | The script will convert the csv files into PyG graph format, and constrcut 10 different copies of the graph by injecting random anomalies into the graph via `anomaly_insert.py`. Each graph instance will have different sets of anomalies.
44 |
45 | ## Run Experiment
46 |
47 | To run the experiments, please execute the corresponding file for each model.
48 |
49 | 1. `GrapBEAN`:
50 | ```
51 | python train_full_experiment.py --name wikipedia_anomaly --id 0
52 | ```
53 |
54 | 1. `GrapBEAN` with neighborhood sampling:
55 | ```
56 | python train_sample_experiment.py --name wikipedia_anomaly --id 0 --batch-size 128
57 | ```
58 |
59 | 1. `IsolationForest`:
60 | ```
61 | python isoforest_experiment.py --name wikipedia_anomaly --id 0
62 | ```
63 |
64 | 1. `DOMINANT`:
65 | ```
66 | python dominant_experiment.py --name wikipedia_anomaly --id 0
67 | ```
68 |
69 | 1. `AnomalyDAE`:
70 | ```
71 | python anomalydae_experiment.py --name wikipedia_anomaly --id 0
72 | ```
73 |
74 | 1. `AdONE`:
75 | ```
76 | python adone_experiment.py --name wikipedia_anomaly --id 0
77 | ```
78 |
79 | The argument `--name` indicates which dataset we want the model run on, with the format of `{dataset_name}_anomaly`. Additional arguments are also available depending on the models.
80 |
81 | - Arguments for **all** models.
82 | ```
83 | --name : dataset name
84 | --id : which instance of anomaly injected graph [0-9]
85 | ```
86 | - Arguments for `DOMINANT`, `AnomalyDAE`, `AdONE`, and `GraphBEAN`.
87 | ```
88 | --n-epoch : number of epoch in the training [default: 50]
89 | --lr : learning rate [default: 1e-2]
90 | ```
91 | - Arguments for `DOMINANT` and `AnomalyDAE`.
92 | ```
93 | --alpha : balance parameter [default: 0.8]
94 | ```
95 | - Arguments for `GraphBEAN` (full and sample training).
96 | ```
97 | --eta : structure decoder loss weight [default: 0.2]
98 | --score-agg : aggregation method for node anomaly score
99 | (max or mean) [default: max]
100 | --scheduler-milestones : milestones for learning scheduler [default: []]
101 | ```
102 | - Arguments for `GraphBEAN` (sample training).
103 | ```
104 | --batch-size : number of target nodes in one batch [default: 2048]
105 | --num-neighbors-u : number of neighbors sampled for node u [default: 10]
106 | --num-neighbors-v : number of neighbors sampled for node v [default: 10]
107 | --num-workers : number of workers in dataloader [default: 0]
108 | suggestion: set it as the number of available cores
109 | ```
110 |
111 | Running the experiments on a machine with GPU support is recommended for all models except IsolationForest.
112 |
113 | ## License
114 |
115 | This repository is licenced under the [MIT License](LICENSE).
116 |
117 | ## Citation
118 |
119 | If you use this repository for academic purpose, please cite the following paper:
120 |
121 |
122 | > R. Fathony, J. Ng and J. Chen, "Interaction-Focused Anomaly Detection on Bipartite Node-and-Edge-Attributed Graphs," 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 2023, pp. 1-10, doi: 10.1109/IJCNN54540.2023.10191331.
123 |
--------------------------------------------------------------------------------
/adone_experiment.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import sys
5 | from sklearn.metrics import roc_curve, precision_recall_curve, auc
6 | from data_finefoods import load_graph
7 |
8 | import argparse
9 | import os
10 |
11 | import torch
12 | from torch_geometric.data import Data
13 | from torch_scatter import scatter
14 |
15 | from utils.seed import seed_all
16 |
17 | # train a detector
18 | from pygod.models import AdONE
19 |
20 | # %% args
21 |
22 | parser = argparse.ArgumentParser(description="AdONE")
23 | parser.add_argument("--name", type=str, default="wikipedia_anomaly", help="name")
24 | parser.add_argument(
25 | "--key", type=str, default="graph_anomaly_list", help="key to the data"
26 | )
27 | parser.add_argument("--id", type=int, default=0, help="id to the data")
28 | parser.add_argument("--n-epoch", type=int, default=200, help="number of epoch")
29 | parser.add_argument(
30 | "--num-neighbors", type=int, default=-1, help="number of neighbors for node"
31 | )
32 | parser.add_argument("--batch-size", type=int, default=0, help="batch size")
33 | parser.add_argument("--lr", type=float, default=1e-2, help="learning rate")
34 | parser.add_argument("--gpu", type=int, default=0, help="gpu number")
35 |
36 | args1 = vars(parser.parse_args())
37 |
38 | args2 = {
39 | "seed": 0,
40 | "hidden_channels": 32,
41 | "dropout_prob": 0.0,
42 | }
43 |
44 | args = {**args1, **args2}
45 |
46 | seed_all(args["seed"])
47 |
48 | result_dir = "results/"
49 |
50 | # %% data
51 | data = load_graph(args["name"], args["key"], args["id"])
52 |
53 | u_ch = data.xu.shape[1]
54 | v_ch = data.xv.shape[1]
55 | e_ch = data.xe.shape[1]
56 |
57 | print(
58 | f"Data dimension: U node = {data.xu.shape}; V node = {data.xv.shape}; E edge = {data.xe.shape}; \n"
59 | )
60 |
61 | # %% model
62 |
63 | xu, xv = data.xu, data.xv
64 | xe, adj = data.xe, data.adj
65 | yu, yv, ye = data.yu, data.yv, data.ye
66 |
67 |
68 | # %% to homogen
69 | nu = xu.shape[0]
70 | nv = xv.shape[0]
71 | nn = nu + nv
72 |
73 | # to homogen
74 | row_h = torch.cat([adj.storage.row(), adj.storage.col() + nu])
75 | col_h = torch.cat([adj.storage.col() + nu, adj.storage.row()])
76 | edge_index_h = torch.stack([row_h, col_h])
77 | xuh = torch.cat(
78 | [
79 | scatter(xe, adj.storage.row(), dim=0, reduce="max"),
80 | scatter(xe, adj.storage.row(), dim=0, reduce="mean"),
81 | ],
82 | dim=1,
83 | )
84 | xvh = torch.cat(
85 | [
86 | scatter(xe, adj.storage.col(), dim=0, reduce="max"),
87 | scatter(xe, adj.storage.col(), dim=0, reduce="mean"),
88 | ],
89 | dim=1,
90 | )
91 | xh = torch.cat([xuh, xvh], dim=0)
92 | yh = torch.cat([yu, yv], dim=0)
93 | data_h = Data(x=xh, edge_index=edge_index_h, y=yh)
94 |
95 | # %% model
96 |
97 | device = torch.device(f'cuda:{args["gpu"]}' if torch.cuda.is_available() else "cpu")
98 |
99 | model = AdONE(
100 | hid_dim=args["hidden_channels"],
101 | dropout=args["dropout_prob"],
102 | epoch=args["n_epoch"],
103 | lr=args["lr"],
104 | verbose=True,
105 | gpu=args["gpu"],
106 | batch_size=args["batch_size"],
107 | num_neigh=args["num_neighbors"],
108 | )
109 |
110 | print(args)
111 | print()
112 |
113 |
114 | def auc_eval(pred, y):
115 |
116 | rc_curve = roc_curve(y, pred)
117 | pr_curve = precision_recall_curve(y, pred)
118 | roc_auc = auc(rc_curve[0], rc_curve[1])
119 | pr_auc = auc(pr_curve[1], pr_curve[0])
120 |
121 | return roc_auc, pr_auc, rc_curve, pr_curve
122 |
123 |
124 | # %% run training
125 |
126 | model.fit(data_h, yh)
127 | score = model.decision_scores_
128 |
129 | score_u = score[:nu]
130 | score_v = score[nu:]
131 | score_e_u = score_u[adj.storage.row().numpy()]
132 | score_e_v = score_v[adj.storage.col().numpy()]
133 | score_e = (score_e_u + score_e_v) / 2
134 |
135 | u_roc_auc, u_pr_auc, u_rc_curve, u_pr_curve = auc_eval(score_u, yu.numpy())
136 | v_roc_auc, v_pr_auc, v_rc_curve, v_pr_curve = auc_eval(score_v, yv.numpy())
137 | e_roc_auc, e_pr_auc, e_rc_curve, e_pr_curve = auc_eval(score_e, ye.numpy())
138 |
139 | print(
140 | f"Eval | "
141 | + f"u auc-roc: {u_roc_auc:.4f}, v auc-roc: {v_roc_auc:.4f}, e auc-roc: {e_roc_auc:.4f} | "
142 | + f"u auc-pr {u_pr_auc:.4f}, v auc-pr {v_pr_auc:.4f}, e auc-pr {e_pr_auc:.4f}"
143 | )
144 |
145 | auc_metrics = {
146 | "u_roc_auc": u_roc_auc,
147 | "u_pr_auc": u_pr_auc,
148 | "v_roc_auc": v_roc_auc,
149 | "v_pr_auc": v_pr_auc,
150 | "e_roc_auc": e_roc_auc,
151 | "e_pr_auc": e_pr_auc,
152 | "u_roc_curve": u_rc_curve,
153 | "u_pr_curve": u_pr_curve,
154 | "v_roc_curve": v_rc_curve,
155 | "v_pr_curve": v_pr_curve,
156 | "e_roc_curve": e_rc_curve,
157 | "e_pr_curve": e_pr_curve,
158 | }
159 | anomaly_score = {"score_u": score_u, "score_v": score_v, "score_e": score_e}
160 |
161 | model_stored = {
162 | "args": args,
163 | "auc_metrics": auc_metrics,
164 | "state_dict": model.model.state_dict(),
165 | }
166 | output_stored = {"args": args, "anomaly_score": anomaly_score}
167 |
168 | print("Saving current results...")
169 | torch.save(
170 | model_stored,
171 | os.path.join(result_dir, f"adone-{args['name']}-{args['id']}-model.th"),
172 | )
173 | torch.save(
174 | output_stored,
175 | os.path.join(result_dir, f"adone-{args['name']}-{args['id']}-output.th"),
176 | )
177 |
178 |
179 | print()
180 | print(args)
181 |
--------------------------------------------------------------------------------
/anomaly_insert.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import torch
5 |
6 | import numpy as np
7 | from scipy.stats import truncnorm
8 | from torch_sparse import SparseTensor
9 |
10 | from models.data import BipartiteData
11 |
12 | from typing import Tuple, Union
13 |
14 | # %% features outliers
15 |
16 | # features outside confidence interval
17 | def outside_cofidence_interval(
18 | x: torch.Tensor, prop_sample=0.1, prop_feat=0.3, std_cutoff=3.0, mu=None, sigm=None
19 | ):
20 | n, m = x.shape
21 | ns = int(np.ceil(prop_sample * n))
22 | ms = int(np.ceil(prop_feat * m))
23 |
24 | # random outlier from truncated normal
25 | left_side = truncnorm.rvs(-np.inf, -std_cutoff, size=ns * ms)
26 | right_side = truncnorm.rvs(std_cutoff, np.inf, size=ns * ms)
27 | lr_flag = np.random.randint(2, size=ns * ms)
28 | random_outliers = lr_flag * left_side + (1 - lr_flag) * right_side
29 |
30 | # determine which sample & features that are randomized
31 | feat_idx = np.random.rand(ns, m).argsort(axis=1)[:, :ms]
32 | sample_idx = np.random.choice(n, ns, replace=False)
33 | row_idx = np.tile(sample_idx[:, None], (1, ms)).flatten()
34 | col_idx = feat_idx.flatten()
35 |
36 | # calculate mean and variance
37 | xr = x.cpu().numpy()
38 | if mu is None:
39 | mu = xr.mean(axis=0)
40 | if sigm is None:
41 | sigm = xr.std(axis=0)
42 |
43 | # replace the value with outliers
44 | random_outliers = random_outliers * sigm[col_idx] + mu[col_idx]
45 | xr[(row_idx, col_idx)] = random_outliers
46 |
47 | # anomaly
48 | anomaly_label = torch.zeros(n).long()
49 | anomaly_label[sample_idx] = 1
50 |
51 | return torch.Tensor(xr), anomaly_label, row_idx, col_idx
52 |
53 |
54 | # add scaled gaussian noise
55 | def scaled_gaussian_noise(
56 | x: torch.Tensor, scale=3.0, min_dist_rel=3.0, filter=True, mu=None, sigm=None
57 | ):
58 |
59 | # calculate mean and variance
60 | if mu is None:
61 | mu = x.mean(dim=0)
62 | if sigm is None:
63 | sigm = x.std(dim=0)
64 |
65 | # noise
66 | noise = torch.randn(x.shape) * sigm * scale
67 | outlier = x + noise
68 | closest_dist = torch.cdist(outlier, x, p=1).min(dim=1)[0]
69 | if filter:
70 | anomaly_label = (closest_dist / x.shape[1] > min_dist_rel).long()
71 | # replace the value with outliers
72 | xr = anomaly_label[:, None] * outlier + (1 - anomaly_label[:, None]) * x
73 | else:
74 | anomaly_label = torch.ones(x.shape[0]).long()
75 | xr = outlier
76 |
77 | return xr, anomaly_label
78 |
79 |
80 | # %% structure outliers
81 | def dense_block(
82 | adj: SparseTensor,
83 | xe: torch.Tensor,
84 | ye=None,
85 | num_nodes: Union[int, Tuple[int, int]] = 5,
86 | num_group: int = 2,
87 | connected_prop=1.0,
88 | feature_anomaly=False,
89 | feature_anomaly_type="outside_ci",
90 | **kwargs,
91 | ):
92 |
93 | n, m = adj.sparse_sizes()
94 | ne = xe.shape[0]
95 |
96 | if isinstance(num_nodes, int):
97 | num_nodes = (num_nodes, num_nodes)
98 |
99 | row = adj.storage.row()
100 | col = adj.storage.col()
101 | ids = torch.stack([row, col])
102 |
103 | outlier_row = torch.zeros(0).long()
104 | outlier_col = torch.zeros(0).long()
105 |
106 | for i in range(num_group):
107 | rid = np.random.choice(n, num_nodes[0], replace=False)
108 | cid = np.random.choice(m, num_nodes[1], replace=False)
109 |
110 | # all nodes are connected
111 | rows_id = torch.tensor(np.tile(rid[:, None], (1, num_nodes[1])).flatten())
112 | cols_id = torch.tensor(np.tile(cid, num_nodes[0]))
113 |
114 | # partially dense connection
115 | if connected_prop < 1.0:
116 | n_connected = rows_id.shape[0]
117 | n_taken = int(np.ceil(connected_prop * n_connected))
118 | taken_id = np.random.choice(n_connected, n_taken, replace=False)
119 |
120 | rows_id = rows_id[taken_id]
121 | cols_id = cols_id[taken_id]
122 |
123 | # add to the graph
124 | outlier_row = torch.cat([outlier_row, rows_id])
125 | outlier_col = torch.cat([outlier_col, cols_id])
126 |
127 | # only unique ids
128 | outlier_ids = torch.stack([outlier_row, outlier_col]).unique(dim=1)
129 |
130 | # find additional ids that is not in the current adj
131 | ids_all, inv, count = torch.cat([ids, outlier_ids], dim=1).unique(
132 | dim=1, return_counts=True, return_inverse=True
133 | )
134 | ids_duplicate = ids_all[:, count > 1]
135 | ids_2, count_2 = torch.cat([outlier_ids, ids_duplicate], dim=1).unique(
136 | dim=1, return_counts=True
137 | )
138 | ids_additional = ids_2[:, count_2 == 1]
139 |
140 | # anomalous label for the original
141 | label_orig = (count[inv][:ne] > 1).long()
142 |
143 | ## features
144 | n_add = ids_additional.shape[1]
145 | # random features for the new edges
146 | add_ids = np.random.choice(ne, n_add, replace=False)
147 | xe_add = xe[add_ids, :]
148 |
149 | # inject feature anomaly
150 | xe2 = xe.clone()
151 | if feature_anomaly:
152 | mu = xe.mean(dim=0).numpy()
153 | sigm = xe.std(dim=0).numpy()
154 | kwargs["mu"] = mu
155 | kwargs["sigm"] = sigm
156 |
157 | if feature_anomaly_type == "outside_ci":
158 | kwargs["prop_sample"] = 1.0
159 | xe_add = outside_cofidence_interval(xe_add, **kwargs)[0]
160 | if label_orig.sum() > 0:
161 | xe2[label_orig == 1, :] = outside_cofidence_interval(
162 | xe[label_orig == 1, :], **kwargs
163 | )[0]
164 | else:
165 | xe2 = xe
166 | elif feature_anomaly_type == "scaled_gaussian":
167 | kwargs["filter"] = False
168 | xe_add = scaled_gaussian_noise(xe_add, **kwargs)[0]
169 | if label_orig.sum() > 0:
170 | xe2[label_orig == 1, :] = scaled_gaussian_noise(
171 | xe[label_orig == 1, :], **kwargs
172 | )[0]
173 | else:
174 | xe2 = xe
175 |
176 | # combine with the previous label if given
177 | ye2 = label_orig if ye is None else torch.logical_or(ye, label_orig).long()
178 |
179 | # attach xe and label to value
180 | ids_cmb = torch.cat([ids, ids_additional], dim=1)
181 | xe_cmb = torch.cat([xe2, xe_add], dim=0)
182 | ye_cmb = torch.cat([ye2, torch.ones(n_add).long()])
183 | label_cmb = torch.cat([label_orig, torch.ones(n_add).long()])
184 | value_cmb = torch.cat([xe_cmb, ye_cmb[:, None], label_cmb[:, None]], dim=1)
185 |
186 | # get result
187 | adj_new = SparseTensor(row=ids_cmb[0], col=ids_cmb[1], value=value_cmb).coalesce()
188 | value_new = adj_new.storage.value()
189 | xe_new = value_new[:, :-2]
190 | ye_new = value_new[:, -2].long()
191 | label = value_new[:, -1].long()
192 | adj_new.storage._value = None
193 |
194 | return adj_new, xe_new, ye_new, label
195 |
196 |
197 | # %% graph, insert anomaly
198 |
199 |
200 | def inject_feature_anomaly(
201 | data: BipartiteData,
202 | node_anomaly=True,
203 | edge_anomaly=True,
204 | feature_anomaly_type="outside_ci",
205 | **kwargs,
206 | ):
207 |
208 | if node_anomaly:
209 | if feature_anomaly_type == "outside_ci":
210 | xu, yu2, _, _ = outside_cofidence_interval(data.xu, **kwargs)
211 | xv, yv2, _, _ = outside_cofidence_interval(data.xv, **kwargs)
212 | elif feature_anomaly_type == "scaled_gaussian":
213 | xu, yu2 = scaled_gaussian_noise(data.xu, **kwargs)
214 | xv, yv2 = scaled_gaussian_noise(data.xv, **kwargs)
215 | yu = torch.logical_or(data.yu, yu2).long() if hasattr(data, "yu") else yu2
216 | yv = torch.logical_or(data.yv, yv2).long() if hasattr(data, "yv") else yv2
217 |
218 | else:
219 | xu = data.xu
220 | xv = data.xv
221 | yu = data.yu if hasattr(data, "yu") else None
222 | yv = data.yv if hasattr(data, "yv") else None
223 |
224 | if edge_anomaly:
225 | if feature_anomaly_type == "outside_ci":
226 | xe, ye2, _, _ = outside_cofidence_interval(data.xe, **kwargs)
227 | elif feature_anomaly_type == "scaled_gaussian":
228 | xe, ye2 = scaled_gaussian_noise(data.xe, **kwargs)
229 | ye = torch.logical_or(data.ye, ye2).long() if hasattr(data, "ye") else ye2
230 | else:
231 | xe = data.xe
232 | ye = data.ye if hasattr(data, "ye") else None
233 |
234 | data_new = BipartiteData(data.adj, xu=xu, xv=xv, xe=xe, yu=yu, yv=yv, ye=ye)
235 |
236 | return data_new
237 |
238 |
239 | def inject_dense_block_anomaly(data: BipartiteData, **kwargs):
240 | kwargs["feature_anomaly"] = False
241 | ye = data.ye if hasattr(data, "ye") else None
242 | adj_new, xe_new, ye_new, label = dense_block(data.adj, data.xe, ye=ye, **kwargs)
243 |
244 | yu = torch.zeros(data.xu.shape[0]).long()
245 | yu[adj_new.storage.row()[label == 1].unique()] = 1
246 |
247 | yv = torch.zeros(data.xv.shape[0]).long()
248 | yv[adj_new.storage.col()[label == 1].unique()] = 1
249 |
250 | data_new = BipartiteData(adj_new, xu=data.xu, xv=data.xv, xe=xe_new)
251 | data_new.ye = ye_new
252 | data_new.yu = torch.logical_or(data.yu, yu).long() if hasattr(data, "yu") else yu
253 | data_new.yv = torch.logical_or(data.yv, yv).long() if hasattr(data, "yv") else yv
254 |
255 | return data_new
256 |
257 |
258 | def inject_dense_block_and_feature_anomaly(
259 | data: BipartiteData, node_feature_anomaly=False, edge_feature_anomaly=True, **kwargs
260 | ):
261 |
262 | kwargs["feature_anomaly"] = edge_feature_anomaly
263 | if "feature_anomaly_type" not in kwargs:
264 | kwargs["feature_anomaly_type"] = "outside_ci"
265 |
266 | ye = data.ye if hasattr(data, "ye") else None
267 | adj_new, xe_new, ye_new, label = dense_block(data.adj, data.xe, ye=ye, **kwargs)
268 |
269 | yu = torch.zeros(data.xu.shape[0]).long()
270 | yu[adj_new.storage.row()[label == 1].unique()] = 1
271 |
272 | yv = torch.zeros(data.xv.shape[0]).long()
273 | yv[adj_new.storage.col()[label == 1].unique()] = 1
274 |
275 | # also node feature anomaly
276 | if node_feature_anomaly:
277 |
278 | # args
279 | kw2 = {}
280 |
281 | # xu
282 | xu = data.xu
283 | mu = xu.mean(dim=0).numpy()
284 | sigm = xu.std(dim=0).numpy()
285 | kw2["mu"] = mu
286 | kw2["sigm"] = sigm
287 |
288 | if kwargs["feature_anomaly_type"] == "outside_ci":
289 | kw2["prop_sample"] = 1.0
290 | if "prop_feat" in kwargs:
291 | kw2["prop_feat"] = kwargs["prop_feat"]
292 | if "std_cutoff" in kwargs:
293 | kw2["std_cutoff"] = kwargs["std_cutoff"]
294 | xu_new = xu.clone()
295 | xu_new[yu == 1, :] = outside_cofidence_interval(xu[yu == 1, :], **kw2)[0]
296 | elif kwargs["feature_anomaly_type"] == "scaled_gaussian":
297 | kw2["filter"] = False
298 | if "scale" in kwargs:
299 | kw2["scale"] = kwargs["scale"]
300 | if "min_dist_rel" in kwargs:
301 | kw2["min_dist_rel"] = kwargs["min_dist_rel"]
302 | xu_new = xu.clone()
303 | xu_new[yu == 1, :] = scaled_gaussian_noise(xu[yu == 1, :], **kw2)[0]
304 |
305 | # xv
306 | xv = data.xv
307 | mu = xv.mean(dim=0).numpy()
308 | sigm = xv.std(dim=0).numpy()
309 | kw2["mu"] = mu
310 | kw2["sigm"] = sigm
311 |
312 | if kwargs["feature_anomaly_type"] == "outside_ci":
313 | kw2["prop_sample"] = 1.0
314 | if "prop_feat" in kwargs:
315 | kw2["prop_feat"] = kwargs["prop_feat"]
316 | if "std_cutoff" in kwargs:
317 | kw2["std_cutoff"] = kwargs["std_cutoff"]
318 | xv_new = xv.clone()
319 | xv_new[yv == 1, :] = outside_cofidence_interval(xv[yv == 1, :], **kw2)[0]
320 | elif kwargs["feature_anomaly_type"] == "scaled_gaussian":
321 | kw2["filter"] = False
322 | if "scale" in kwargs:
323 | kw2["scale"] = kwargs["scale"]
324 | if "min_dist_rel" in kwargs:
325 | kw2["min_dist_rel"] = kwargs["min_dist_rel"]
326 | xv_new = xv.clone()
327 | xv_new[yv == 1, :] = scaled_gaussian_noise(xv[yv == 1, :], **kw2)[0]
328 |
329 | # data
330 | data_new = BipartiteData(adj_new, xu=xu_new, xv=xv_new, xe=xe_new)
331 | data_new.ye = ye_new
332 | data_new.yu = (
333 | torch.logical_or(data.yu, yu).long() if hasattr(data, "yu") else yu
334 | )
335 | data_new.yv = (
336 | torch.logical_or(data.yv, yv).long() if hasattr(data, "yv") else yv
337 | )
338 |
339 | else:
340 | data_new = BipartiteData(adj_new, xu=data.xu, xv=data.xv, xe=xe_new)
341 | data_new.ye = ye_new
342 | data_new.yu = (
343 | torch.logical_or(data.yu, yu).long() if hasattr(data, "yu") else yu
344 | )
345 | data_new.yv = (
346 | torch.logical_or(data.yv, yv).long() if hasattr(data, "yv") else yv
347 | )
348 |
349 | return data_new
350 |
351 |
352 | # %% random anomaly
353 |
354 |
355 | def choose(r, choices, thresholds):
356 | i = 0
357 | cm = thresholds[i]
358 | while i < len(choices):
359 | if r <= cm + 1e-9:
360 | selected = i
361 | break
362 | else:
363 | i += 1
364 | if i < len(choices):
365 | cm += thresholds[i]
366 | else:
367 | selected = len(choices) - 1
368 | break
369 |
370 | return choices[selected]
371 |
372 |
373 | def inject_random_block_anomaly(
374 | data: BipartiteData,
375 | num_group=40,
376 | num_nodes_range=(1, 12),
377 | num_nodes_range2=None,
378 | **kwargs,
379 | ):
380 |
381 | block_anomalies = ["full_dense_block", "partial_full_dense_block"] # , 'none']
382 | feature_anomalies = ["outside_ci", "scaled_gaussian", "none"]
383 | node_edge_feat_anomalies = ["node_only", "edge_only", "node_edge"]
384 |
385 | block_anomalies_weight = [0.2, 0.8] # , 0.1]
386 | feature_anomalies_weight = [0.5, 0.4, 0.1]
387 | node_edge_feat_anomalies_weight = [0.1, 0.3, 0.6]
388 |
389 | data_new = BipartiteData(data.adj, xu=data.xu, xv=data.xv, xe=data.xe)
390 |
391 | # random anomaly
392 | for itg in range(num_group):
393 |
394 | print(f"it {itg}: ", end="")
395 |
396 | rnd = torch.rand(3)
397 | block_an = choose(rnd[0], block_anomalies, block_anomalies_weight)
398 | feature_an = choose(rnd[1], feature_anomalies, feature_anomalies_weight)
399 | node_edge_an = choose(
400 | rnd[2], node_edge_feat_anomalies, node_edge_feat_anomalies_weight
401 | )
402 | lr, rr, mr = (
403 | num_nodes_range[0],
404 | num_nodes_range[1],
405 | num_nodes_range[0] + num_nodes_range[1] / 2,
406 | )
407 | if num_nodes_range2 is not None:
408 | nn1 = int(
409 | np.minimum(
410 | np.maximum(lr, (torch.randn(1).item() * np.sqrt(mr)) + mr), rr + 1
411 | )
412 | )
413 | lr2, rr2, mr2 = (
414 | num_nodes_range2[0],
415 | num_nodes_range2[1],
416 | num_nodes_range2[0] + num_nodes_range2[1] / 2,
417 | )
418 | nn2 = int(
419 | np.minimum(
420 | np.maximum(lr2, (torch.randn(1).item() * np.sqrt(mr2)) + mr2),
421 | rr2 + 1,
422 | )
423 | )
424 | num_nodes = (nn1, nn2)
425 | else:
426 | num_nodes = int(
427 | np.minimum(
428 | np.maximum(lr, (torch.randn(1).item() * np.sqrt(mr)) + mr), rr + 1
429 | )
430 | )
431 |
432 | ## setup kwargs
433 | connected_prop = 1.0
434 | if block_an == "partial_full_dense_block":
435 | connected_prop = np.minimum(
436 | np.maximum(0.2, (torch.randn(1).item() / 4) + 0.5), 1.0
437 | )
438 |
439 | prop_feat = np.minimum(np.maximum(0.1, (torch.randn(1).item() / 8) + 0.3), 0.9)
440 | std_cutoff = np.maximum(2.0, torch.randn(1).item() + 3.0)
441 | scale = np.maximum(2.0, torch.randn(1).item() + 3.0)
442 |
443 | ## inject anomaly
444 | node_feature_anomaly = None
445 | if block_an != "none" and feature_an != "none":
446 | node_feature_anomaly = False if node_edge_an == "edge_only" else True
447 | edge_feature_anomaly = False if node_edge_an == "node_only" else True
448 |
449 | if feature_an == "outside_ci":
450 | data_new = inject_dense_block_and_feature_anomaly(
451 | data_new,
452 | node_feature_anomaly,
453 | edge_feature_anomaly,
454 | num_group=1,
455 | num_nodes=num_nodes,
456 | connected_prop=connected_prop,
457 | feature_anomaly_type="outside_ci",
458 | prop_feat=prop_feat,
459 | std_cutoff=std_cutoff,
460 | )
461 |
462 | elif feature_an == "scaled_gaussian":
463 | data_new = inject_dense_block_and_feature_anomaly(
464 | data_new,
465 | node_feature_anomaly,
466 | edge_feature_anomaly,
467 | num_group=1,
468 | num_nodes=num_nodes,
469 | connected_prop=connected_prop,
470 | feature_anomaly_type="scaled_gaussian",
471 | scale=scale,
472 | )
473 |
474 | elif block_an != "none" and feature_an == "none":
475 | data_new = inject_dense_block_anomaly(
476 | data_new,
477 | num_group=1,
478 | num_nodes=num_nodes,
479 | connected_prop=connected_prop,
480 | )
481 |
482 | elif block_an == "none" and feature_an != "none":
483 | node_anomaly = False if node_edge_an == "edge_only" else True
484 | edge_anomaly = False if node_edge_an == "node_only" else True
485 |
486 | if feature_an == "outside_ci":
487 | data_new = inject_feature_anomaly(
488 | data_new,
489 | node_anomaly,
490 | edge_anomaly,
491 | feature_anomaly_type="outside_ci",
492 | prop_feat=prop_feat,
493 | std_cutoff=std_cutoff,
494 | )
495 |
496 | elif feature_an == "scaled_gaussian":
497 | data_new = inject_feature_anomaly(
498 | data_new,
499 | node_anomaly,
500 | edge_anomaly,
501 | feature_anomaly_type="scaled_gaussian",
502 | scale=scale,
503 | )
504 |
505 | print(
506 | f"affected: yu = {data_new.yu.sum()}, yv = {data_new.yv.sum()}, ye = {data_new.ye.sum()} ",
507 | end="",
508 | )
509 | print(
510 | f"[{block_an}:{connected_prop:.2f},{feature_an},{num_nodes},{node_feature_anomaly}]"
511 | )
512 |
513 | return data_new
514 |
--------------------------------------------------------------------------------
/anomalydae_experiment.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import sys
5 |
6 | from sklearn.metrics import roc_curve, precision_recall_curve, auc
7 |
8 | from data_finefoods import load_graph
9 |
10 | import argparse
11 | import os
12 |
13 | import torch
14 | from torch_geometric.data import Data
15 | from torch_scatter import scatter
16 | from utils.seed import seed_all
17 |
18 | # train a detector
19 | from pygod.models import AnomalyDAE
20 |
21 | # %% args
22 |
23 | parser = argparse.ArgumentParser(description="AnomalyDAE")
24 | parser.add_argument("--name", type=str, default="wikipedia_anomaly", help="name")
25 | parser.add_argument(
26 | "--key", type=str, default="graph_anomaly_list", help="key to the data"
27 | )
28 | parser.add_argument("--id", type=int, default=0, help="id to the data")
29 | parser.add_argument("--n-epoch", type=int, default=200, help="number of epoch")
30 | parser.add_argument(
31 | "--num-neighbors", type=int, default=-1, help="number of neighbors for node"
32 | )
33 | parser.add_argument("--batch-size", type=int, default=0, help="batch size")
34 | parser.add_argument("--alpha", type=float, default=0.8, help="balance parameter")
35 | parser.add_argument("--lr", type=float, default=1e-2, help="learning rate")
36 | parser.add_argument("--gpu", type=int, default=0, help="gpu number")
37 |
38 | args1 = vars(parser.parse_args())
39 |
40 | args2 = {
41 | "seed": 0,
42 | "hidden_channels": 32,
43 | "dropout_prob": 0.0,
44 | }
45 |
46 | args = {**args1, **args2}
47 |
48 | seed_all(args["seed"])
49 |
50 | result_dir = "results/"
51 |
52 | # %% data
53 | data = load_graph(args["name"], args["key"], args["id"])
54 |
55 | u_ch = data.xu.shape[1]
56 | v_ch = data.xv.shape[1]
57 | e_ch = data.xe.shape[1]
58 |
59 | print(
60 | f"Data dimension: U node = {data.xu.shape}; V node = {data.xv.shape}; E edge = {data.xe.shape}; \n"
61 | )
62 |
63 | # %% model
64 |
65 | xu, xv = data.xu, data.xv
66 | xe, adj = data.xe, data.adj
67 | yu, yv, ye = data.yu, data.yv, data.ye
68 |
69 |
70 | # %% to homogen
71 | nu = xu.shape[0]
72 | nv = xv.shape[0]
73 | nn = nu + nv
74 |
75 | # to homogen
76 | row_h = torch.cat([adj.storage.row(), adj.storage.col() + nu])
77 | col_h = torch.cat([adj.storage.col() + nu, adj.storage.row()])
78 | edge_index_h = torch.stack([row_h, col_h])
79 | xuh = torch.cat(
80 | [
81 | scatter(xe, adj.storage.row(), dim=0, reduce="max"),
82 | scatter(xe, adj.storage.row(), dim=0, reduce="mean"),
83 | ],
84 | dim=1,
85 | )
86 | xvh = torch.cat(
87 | [
88 | scatter(xe, adj.storage.col(), dim=0, reduce="max"),
89 | scatter(xe, adj.storage.col(), dim=0, reduce="mean"),
90 | ],
91 | dim=1,
92 | )
93 | xh = torch.cat([xuh, xvh], dim=0)
94 | yh = torch.cat([yu, yv], dim=0)
95 | data_h = Data(x=xh, edge_index=edge_index_h, y=yh)
96 |
97 | # %% model
98 |
99 | device = torch.device(f'cuda:{args["gpu"]}' if torch.cuda.is_available() else "cpu")
100 | model = AnomalyDAE(
101 | embed_dim=args["hidden_channels"],
102 | out_dim=args["hidden_channels"],
103 | dropout=args["dropout_prob"],
104 | alpha=args["alpha"],
105 | epoch=args["n_epoch"],
106 | lr=args["lr"],
107 | verbose=True,
108 | gpu=args["gpu"],
109 | batch_size=args["batch_size"],
110 | num_neigh=args["num_neighbors"],
111 | )
112 |
113 | print(args)
114 | print()
115 |
116 |
117 | def auc_eval(pred, y):
118 |
119 | rc_curve = roc_curve(y, pred)
120 | pr_curve = precision_recall_curve(y, pred)
121 | roc_auc = auc(rc_curve[0], rc_curve[1])
122 | pr_auc = auc(pr_curve[1], pr_curve[0])
123 |
124 | return roc_auc, pr_auc, rc_curve, pr_curve
125 |
126 |
127 | # %% run training
128 |
129 | model.fit(data_h, yh)
130 | score = model.decision_scores_
131 |
132 | score_u = score[:nu]
133 | score_v = score[nu:]
134 | score_e_u = score_u[adj.storage.row().numpy()]
135 | score_e_v = score_v[adj.storage.col().numpy()]
136 | score_e = (score_e_u + score_e_v) / 2
137 |
138 | u_roc_auc, u_pr_auc, u_rc_curve, u_pr_curve = auc_eval(score_u, yu.numpy())
139 | v_roc_auc, v_pr_auc, v_rc_curve, v_pr_curve = auc_eval(score_v, yv.numpy())
140 | e_roc_auc, e_pr_auc, e_rc_curve, e_pr_curve = auc_eval(score_e, ye.numpy())
141 |
142 | print(
143 | f"Eval | "
144 | + f"u auc-roc: {u_roc_auc:.4f}, v auc-roc: {v_roc_auc:.4f}, e auc-roc: {e_roc_auc:.4f} | "
145 | + f"u auc-pr {u_pr_auc:.4f}, v auc-pr {v_pr_auc:.4f}, e auc-pr {e_pr_auc:.4f}"
146 | )
147 |
148 | auc_metrics = {
149 | "u_roc_auc": u_roc_auc,
150 | "u_pr_auc": u_pr_auc,
151 | "v_roc_auc": v_roc_auc,
152 | "v_pr_auc": v_pr_auc,
153 | "e_roc_auc": e_roc_auc,
154 | "e_pr_auc": e_pr_auc,
155 | "u_roc_curve": u_rc_curve,
156 | "u_pr_curve": u_pr_curve,
157 | "v_roc_curve": v_rc_curve,
158 | "v_pr_curve": v_pr_curve,
159 | "e_roc_curve": e_rc_curve,
160 | "e_pr_curve": e_pr_curve,
161 | }
162 | anomaly_score = {"score_u": score_u, "score_v": score_v, "score_e": score_e}
163 |
164 | model_stored = {
165 | "args": args,
166 | "auc_metrics": auc_metrics,
167 | "state_dict": model.model.state_dict(),
168 | }
169 | output_stored = {"args": args, "anomaly_score": anomaly_score}
170 |
171 | print("Saving current results...")
172 | torch.save(
173 | model_stored,
174 | os.path.join(
175 | result_dir,
176 | f"anomalydae-{args['name']}-{args['id']}-alpha-{args['alpha']}-model.th",
177 | ),
178 | )
179 | torch.save(
180 | output_stored,
181 | os.path.join(
182 | result_dir,
183 | f"anomalydae-{args['name']}-{args['id']}-alpha-{args['alpha']}-output.th",
184 | ),
185 | )
186 |
187 |
188 | print()
189 | print(args)
190 |
--------------------------------------------------------------------------------
/data/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore
--------------------------------------------------------------------------------
/data_finefoods.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import torch
5 | from torch_sparse.tensor import SparseTensor
6 |
7 | import numpy as np
8 | from anomaly_insert import inject_random_block_anomaly
9 |
10 | from models.data import BipartiteData
11 |
12 | import torch
13 | from sklearn import preprocessing
14 |
15 | import pandas as pd
16 |
17 | from sentence_transformers import SentenceTransformer
18 |
19 | # %%
20 |
21 |
22 | def standardize(features: np.ndarray) -> np.ndarray:
23 | scaler = preprocessing.StandardScaler()
24 | z = scaler.fit_transform(features)
25 | return z
26 |
27 |
28 | def prepare_data():
29 | model = SentenceTransformer("all-MiniLM-L6-v2")
30 | df = pd.read_csv(f"data/finefoods.csv")
31 |
32 | df["SummaryCharLen"] = df["Summary"].astype("str").apply(len)
33 | df["TextCharLen"] = df["Text"].astype("str").apply(len)
34 | df["Helpfulness"] = (
35 | df["HelpfulnessNumerator"] / df["HelpfulnessDenominator"]
36 | ).fillna(0)
37 |
38 | df = df.iloc[:, 1:].sort_values(["ProductId", "UserId", "Time"])
39 | dfu = df.groupby(["ProductId", "UserId"], as_index=False).last()
40 |
41 | df_product = dfu.groupby("ProductId", as_index=False).agg(
42 | user_count=("UserId", "count"),
43 | helpful_num_mean=("HelpfulnessNumerator", "mean"),
44 | helpful_num_sum=("HelpfulnessNumerator", "sum"),
45 | helpful_mean=("Helpfulness", "mean"),
46 | helpful_sum=("Helpfulness", "sum"),
47 | score_mean=("Score", "mean"),
48 | score_sum=("Score", "sum"),
49 | summary_len_mean=("SummaryCharLen", "mean"),
50 | summary_len_sum=("SummaryCharLen", "sum"),
51 | text_len_mean=("TextCharLen", "mean"),
52 | text_len_sum=("TextCharLen", "sum"),
53 | )
54 |
55 | df_user = dfu.groupby("UserId", as_index=False).agg(
56 | product_count=("ProductId", "count"),
57 | helpful_num_mean=("HelpfulnessNumerator", "mean"),
58 | helpful_num_sum=("HelpfulnessNumerator", "sum"),
59 | helpful_mean=("Helpfulness", "mean"),
60 | helpful_sum=("Helpfulness", "sum"),
61 | score_mean=("Score", "mean"),
62 | score_sum=("Score", "sum"),
63 | summary_len_mean=("SummaryCharLen", "mean"),
64 | summary_len_sum=("SummaryCharLen", "sum"),
65 | text_len_mean=("TextCharLen", "mean"),
66 | text_len_sum=("TextCharLen", "sum"),
67 | )
68 |
69 | df_user.to_csv(f"data/finefoods-user.csv")
70 | df_product.to_csv(f"data/finefoods-product.csv")
71 |
72 | sentences = dfu["Text"].astype("str").to_numpy()
73 | embeddings = model.encode(sentences)
74 | cols = [f"v{i}" for i in range(embeddings.shape[1])]
75 | df_review = pd.concat(
76 | [dfu[["ProductId", "UserId"]], pd.DataFrame(embeddings, columns=cols)], axis=1
77 | )
78 |
79 | df_review.to_csv(f"data/finefoods-review.csv")
80 |
81 |
82 | def create_graph():
83 |
84 | df_user = pd.read_csv("data/finefoods-user.csv")
85 | df_product = pd.read_csv("data/finefoods-product.csv")
86 | df_review = pd.read_csv("data/finefoods-review.csv")
87 |
88 | df_user["uid"] = df_user.index
89 | df_product["pid"] = df_product.index
90 |
91 | df_user_id = df_user[["UserId", "uid"]]
92 | df_product_id = df_product[["ProductId", "pid"]]
93 |
94 | df_review_2 = df_review.merge(
95 | df_user_id,
96 | on="UserId",
97 | ).merge(df_product_id, on="ProductId")
98 | df_review_2 = df_review_2.sort_values(["uid", "pid"])
99 |
100 | uid = torch.tensor(df_review_2["uid"].to_numpy())
101 | pid = torch.tensor(df_review_2["pid"].to_numpy())
102 |
103 | adj = SparseTensor(row=uid, col=pid)
104 | edge_attr = torch.tensor(standardize(df_review_2.iloc[:, 3:-2].to_numpy())).float()
105 |
106 | user_attr = torch.tensor(standardize(df_user.iloc[:, 2:-1].to_numpy())).float()
107 | product_attr = torch.tensor(
108 | standardize(df_product.iloc[:, 2:-1].to_numpy())
109 | ).float()
110 |
111 | data = BipartiteData(adj, xu=user_attr, xv=product_attr, xe=edge_attr)
112 |
113 | return data
114 |
115 |
116 | def store_graph(name: str, dataset):
117 | torch.save(dataset, f"storage/{name}.pt")
118 |
119 |
120 | def load_graph(name: str, key: str, id=None):
121 | if id == None:
122 | data = torch.load(f"storage/{name}.pt")
123 | return data[key]
124 | else:
125 | data = torch.load(f"storage/{name}.pt")
126 | return data[key][id]
127 |
128 |
129 | def synth_random():
130 | # generate nd store data
131 | import argparse
132 |
133 | parser = argparse.ArgumentParser(description="GraphBEAN")
134 | parser.add_argument("--name", type=str, default="finefoods_anomaly", help="name")
135 | parser.add_argument("--n-graph", type=int, default=5, help="n graph")
136 |
137 | args = vars(parser.parse_args())
138 |
139 | prepare_data()
140 | graph = create_graph()
141 | store_graph("finefoods-graph", graph)
142 | # graph = torch.load(f'storage/finefoods-graph.pt')
143 |
144 | graph_anomaly_list = []
145 | for i in range(args["n_graph"]):
146 | print(f"GRAPH ANOMALY {i} >>>>>>>>>>>>>>")
147 | graph_multi_dense = inject_random_block_anomaly(
148 | graph, num_group=100, num_nodes_range=(1, 20)
149 | )
150 | graph_anomaly_list.append(graph_multi_dense)
151 | print()
152 |
153 | dataset = {"args": args, "graph": graph, "graph_anomaly_list": graph_anomaly_list}
154 |
155 | store_graph(args["name"], dataset)
156 |
157 |
158 | if __name__ == "__main__":
159 | synth_random()
160 |
--------------------------------------------------------------------------------
/data_finefoods_small.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import torch
5 | from torch_sparse.tensor import SparseTensor
6 |
7 | import numpy as np
8 | from anomaly_insert import inject_random_block_anomaly
9 |
10 | from models.data import BipartiteData
11 |
12 | import torch
13 | from sklearn import preprocessing
14 |
15 | import pandas as pd
16 |
17 | # %%
18 |
19 |
20 | def standardize(features: np.ndarray) -> np.ndarray:
21 | scaler = preprocessing.StandardScaler()
22 | z = scaler.fit_transform(features)
23 | return z
24 |
25 |
26 | def sample_data():
27 | df_user = pd.read_csv(f"data/finefoods-user.csv")
28 | df_product = pd.read_csv(f"data/finefoods-product.csv")
29 | df_review = pd.read_csv(f"data/finefoods-review.csv")
30 |
31 | pc = np.log10(df_user["product_count"].to_numpy()) + 1
32 | user_weight = pc / pc.sum()
33 |
34 | uc = np.log10(df_product["user_count"].to_numpy()) + 1
35 | product_weight = uc / uc.sum()
36 |
37 | user_nums = np.random.choice(df_user.shape[0], 24000, replace=False, p=user_weight)
38 | user_ids = df_user["UserId"][user_nums]
39 |
40 | product_nums = np.random.choice(
41 | df_product.shape[0], 12000, replace=False, p=product_weight
42 | )
43 | product_ids = df_product["ProductId"][product_nums]
44 |
45 | df_review_chosen = df_review[
46 | df_review["ProductId"].isin(product_ids) & df_review["UserId"].isin(user_ids)
47 | ].iloc[:, 1:]
48 | df_user_chosen = df_user[
49 | df_user["UserId"].isin(df_review_chosen["UserId"].unique())
50 | ].iloc[:, 1:]
51 | df_product_chosen = df_product[
52 | df_product["ProductId"].isin(df_review_chosen["ProductId"].unique())
53 | ].iloc[:, 1:]
54 |
55 | df_user_chosen.to_csv(f"data/finefoods-small-user.csv")
56 | df_product_chosen.to_csv(f"data/finefoods-small-product.csv")
57 | df_review_chosen.to_csv(f"data/finefoods-small-review.csv")
58 |
59 |
60 | def create_graph():
61 |
62 | df_user = pd.read_csv("data/finefoods-small-user.csv")
63 | df_product = pd.read_csv("data/finefoods-small-product.csv")
64 | df_review = pd.read_csv("data/finefoods-small-review.csv")
65 |
66 | df_user["uid"] = df_user.index
67 | df_product["pid"] = df_product.index
68 |
69 | df_user_id = df_user[["UserId", "uid"]]
70 | df_product_id = df_product[["ProductId", "pid"]]
71 |
72 | df_review_2 = df_review.merge(
73 | df_user_id,
74 | on="UserId",
75 | ).merge(df_product_id, on="ProductId")
76 | df_review_2 = df_review_2.sort_values(["uid", "pid"])
77 |
78 | uid = torch.tensor(df_review_2["uid"].to_numpy())
79 | pid = torch.tensor(df_review_2["pid"].to_numpy())
80 |
81 | adj = SparseTensor(row=uid, col=pid)
82 | edge_attr = torch.tensor(standardize(df_review_2.iloc[:, 3:-2].to_numpy())).float()
83 |
84 | user_attr = torch.tensor(standardize(df_user.iloc[:, 2:-1].to_numpy())).float()
85 | product_attr = torch.tensor(
86 | standardize(df_product.iloc[:, 2:-1].to_numpy())
87 | ).float()
88 |
89 | data = BipartiteData(adj, xu=user_attr, xv=product_attr, xe=edge_attr)
90 |
91 | return data
92 |
93 |
94 | def store_graph(name: str, dataset):
95 | torch.save(dataset, f"storage/{name}.pt")
96 |
97 |
98 | def load_graph(name: str, key: str, id=None):
99 | if id == None:
100 | data = torch.load(f"storage/{name}.pt")
101 | return data[key]
102 | else:
103 | data = torch.load(f"storage/{name}.pt")
104 | return data[key][id]
105 |
106 |
107 | def synth_random():
108 | # generate nd store data
109 | import argparse
110 |
111 | parser = argparse.ArgumentParser(description="GraphBEAN")
112 | parser.add_argument(
113 | "--name", type=str, default="finefoods-small_anomaly", help="name"
114 | )
115 | parser.add_argument("--n-graph", type=int, default=10, help="n graph")
116 |
117 | args = vars(parser.parse_args())
118 |
119 | sample_data()
120 | graph = create_graph()
121 | store_graph("finefoods-small-graph", graph)
122 | # graph = torch.load(f'storage/finefoods-small-graph.pt')
123 |
124 | graph_anomaly_list = []
125 | for i in range(args["n_graph"]):
126 | print(f"GRAPH ANOMALY {i} >>>>>>>>>>>>>>")
127 | graph_multi_dense = inject_random_block_anomaly(
128 | graph, num_group=20, num_nodes_range=(1, 12)
129 | )
130 | graph_anomaly_list.append(graph_multi_dense)
131 | print()
132 |
133 | dataset = {"args": args, "graph": graph, "graph_anomaly_list": graph_anomaly_list}
134 |
135 | store_graph(args["name"], dataset)
136 |
137 |
138 | if __name__ == "__main__":
139 | synth_random()
140 |
--------------------------------------------------------------------------------
/data_movies.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import torch
5 | from torch_sparse.tensor import SparseTensor
6 |
7 | import numpy as np
8 | from anomaly_insert import inject_random_block_anomaly
9 |
10 | from models.data import BipartiteData
11 |
12 | import torch
13 | from sklearn import preprocessing
14 |
15 | import pandas as pd
16 |
17 | from sentence_transformers import SentenceTransformer
18 |
19 |
20 | # %%
21 |
22 |
23 | def standardize(features: np.ndarray) -> np.ndarray:
24 | scaler = preprocessing.StandardScaler()
25 | z = scaler.fit_transform(features)
26 | return z
27 |
28 |
29 | def prepare_data():
30 | model = SentenceTransformer("all-MiniLM-L6-v2")
31 | df = pd.read_csv(f"data/movies.csv")
32 |
33 | df["summary_char_len"] = df["summary"].astype("str").apply(len)
34 | df["text_char_len"] = df["text"].astype("str").apply(len)
35 | df["helpfulness"] = (
36 | df["helpfulness_numerator"] / df["helpfulness_denominator"]
37 | ).fillna(0)
38 |
39 | df = df.sort_values(["product_id", "user_id", "time"])
40 | dfu = df.groupby(["product_id", "user_id"], as_index=False).last()
41 |
42 | df_product = dfu.groupby("product_id", as_index=False).agg(
43 | user_count=("user_id", "count"),
44 | helpful_num_mean=("helpfulness_numerator", "mean"),
45 | helpful_num_sum=("helpfulness_numerator", "sum"),
46 | helpful_mean=("helpfulness", "mean"),
47 | helpful_sum=("helpfulness", "sum"),
48 | score_mean=("score", "mean"),
49 | score_sum=("score", "sum"),
50 | summary_len_mean=("summary_char_len", "mean"),
51 | summary_len_sum=("summary_char_len", "sum"),
52 | text_len_mean=("text_char_len", "mean"),
53 | text_len_sum=("text_char_len", "sum"),
54 | )
55 |
56 | df_user = dfu.groupby("user_id", as_index=False).agg(
57 | product_count=("product_id", "count"),
58 | helpful_num_mean=("helpfulness_numerator", "mean"),
59 | helpful_num_sum=("helpfulness_numerator", "sum"),
60 | helpful_mean=("helpfulness", "mean"),
61 | helpful_sum=("helpfulness", "sum"),
62 | score_mean=("score", "mean"),
63 | score_sum=("score", "sum"),
64 | summary_len_mean=("summary_char_len", "mean"),
65 | summary_len_sum=("summary_char_len", "sum"),
66 | text_len_mean=("text_char_len", "mean"),
67 | text_len_sum=("text_char_len", "sum"),
68 | )
69 |
70 | df_user.to_csv(f"data/movies-user.csv")
71 | df_product.to_csv(f"data/movies-product.csv")
72 |
73 | sentences = dfu["text"].astype("str").to_numpy()
74 | embeddings = model.encode(sentences)
75 |
76 | np.save(f"data/movies-embeddings.npy", embeddings)
77 | dfu[["product_id", "user_id"]].to_csv(f"data/movies-ids.csv")
78 |
79 |
80 | def create_graph():
81 |
82 | df_user = pd.read_csv("data/movies-user.csv")
83 | df_product = pd.read_csv("data/movies-product.csv")
84 | df_review_id = pd.read_csv("data/movies-ids.csv")
85 | embeddings = np.load("data/movies-embeddings.npy")
86 |
87 | df_user["uid"] = df_user.index
88 | df_product["pid"] = df_product.index
89 |
90 | df_user_id = df_user[["user_id", "uid"]]
91 | df_product_id = df_product[["product_id", "pid"]]
92 |
93 | cols = [f"v{i}" for i in range(embeddings.shape[1])]
94 | df_review = pd.concat(
95 | [df_review_id, pd.DataFrame(embeddings, columns=cols)], axis=1
96 | )
97 |
98 | df_review_2 = df_review.merge(
99 | df_user_id,
100 | on="user_id",
101 | ).merge(df_product_id, on="product_id")
102 | df_review_2 = df_review_2.sort_values(["uid", "pid"])
103 |
104 | uid = torch.tensor(df_review_2["uid"].to_numpy())
105 | pid = torch.tensor(df_review_2["pid"].to_numpy())
106 |
107 | adj = SparseTensor(row=uid, col=pid)
108 | edge_attr = torch.tensor(standardize(df_review_2.iloc[:, 3:-2].to_numpy())).float()
109 |
110 | user_attr = torch.tensor(standardize(df_user.iloc[:, 2:-1].to_numpy())).float()
111 | product_attr = torch.tensor(
112 | standardize(df_product.iloc[:, 2:-1].to_numpy())
113 | ).float()
114 |
115 | data = BipartiteData(adj, xu=user_attr, xv=product_attr, xe=edge_attr)
116 |
117 | return data
118 |
119 |
120 | def store_graph(name: str, dataset):
121 | torch.save(dataset, f"storage/{name}.pt")
122 |
123 |
124 | def load_graph(name: str, key: str, id=None):
125 | if id == None:
126 | data = torch.load(f"storage/{name}.pt")
127 | return data[key]
128 | else:
129 | data = torch.load(f"storage/{name}.pt")
130 | return data[key][id]
131 |
132 |
133 | def synth_random():
134 | # generate nd store data
135 | import argparse
136 |
137 | parser = argparse.ArgumentParser(description="GraphBEAN")
138 | parser.add_argument("--name", type=str, default="movies_anomaly", help="name")
139 | parser.add_argument("--n-graph", type=int, default=2, help="n graph")
140 |
141 | args = vars(parser.parse_args())
142 |
143 | prepare_data()
144 | graph = create_graph()
145 | store_graph("movies-graph", graph)
146 | # graph = torch.load(f'storage/movies-graph.pt')
147 | print(graph)
148 |
149 | graph_anomaly_list = []
150 | for i in range(args["n_graph"]):
151 | print(f"GRAPH ANOMALY {i} >>>>>>>>>>>>>>")
152 | graph_multi_dense = inject_random_block_anomaly(
153 | graph, num_group=100, num_nodes_range=(1, 20)
154 | )
155 | graph_anomaly_list.append(graph_multi_dense)
156 | print()
157 |
158 | dataset = {"args": args, "graph": graph, "graph_anomaly_list": graph_anomaly_list}
159 |
160 | store_graph(args["name"], dataset)
161 |
162 |
163 | if __name__ == "__main__":
164 | synth_random()
165 |
--------------------------------------------------------------------------------
/data_movies_small.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import torch
5 | from torch_sparse.tensor import SparseTensor
6 |
7 | import numpy as np
8 | from anomaly_insert import inject_random_block_anomaly
9 |
10 | from models.data import BipartiteData
11 |
12 | import torch
13 | from sklearn import preprocessing
14 |
15 | import pandas as pd
16 |
17 | # %%
18 |
19 |
20 | def standardize(features: np.ndarray) -> np.ndarray:
21 | scaler = preprocessing.StandardScaler()
22 | z = scaler.fit_transform(features)
23 | return z
24 |
25 |
26 | def sample_data():
27 | df_user = pd.read_csv(f"data/movies-user.csv")
28 | df_product = pd.read_csv(f"data/movies-product.csv")
29 | df_review = pd.read_csv(f"data/movies-review.csv")
30 |
31 | pc = np.log10(df_user["product_count"].to_numpy()) + 1
32 | user_weight = pc / pc.sum()
33 |
34 | uc = np.log10(df_product["user_count"].to_numpy()) + 1
35 | product_weight = uc / uc.sum()
36 |
37 | user_nums = np.random.choice(df_user.shape[0], 28000, replace=False, p=user_weight)
38 | user_ids = df_user["user_id"][user_nums]
39 |
40 | product_nums = np.random.choice(
41 | df_product.shape[0], 14000, replace=False, p=product_weight
42 | )
43 | product_ids = df_product["product_id"][product_nums]
44 |
45 | df_review_chosen = df_review[
46 | df_review["product_id"].isin(product_ids) & df_review["user_id"].isin(user_ids)
47 | ].iloc[:, 1:]
48 | df_user_chosen = df_user[
49 | df_user["user_id"].isin(df_review_chosen["user_id"].unique())
50 | ].iloc[:, 1:]
51 | df_product_chosen = df_product[
52 | df_product["product_id"].isin(df_review_chosen["product_id"].unique())
53 | ].iloc[:, 1:]
54 |
55 | df_user_chosen.to_csv(f"data/movies-small-user.csv")
56 | df_product_chosen.to_csv(f"data/movies-small-product.csv")
57 | df_review_chosen.to_csv(f"data/movies-small-review.csv")
58 |
59 |
60 | def create_graph():
61 |
62 | df_user = pd.read_csv("data/movies-small-user.csv")
63 | df_product = pd.read_csv("data/movies-small-product.csv")
64 | df_review = pd.read_csv("data/movies-small-review.csv")
65 |
66 | df_user["uid"] = df_user.index
67 | df_product["pid"] = df_product.index
68 |
69 | df_user_id = df_user[["user_id", "uid"]]
70 | df_product_id = df_product[["product_id", "pid"]]
71 |
72 | df_review_2 = df_review.merge(
73 | df_user_id,
74 | on="user_id",
75 | ).merge(df_product_id, on="product_id")
76 | df_review_2 = df_review_2.sort_values(["uid", "pid"])
77 |
78 | uid = torch.tensor(df_review_2["uid"].to_numpy())
79 | pid = torch.tensor(df_review_2["pid"].to_numpy())
80 |
81 | adj = SparseTensor(row=uid, col=pid)
82 | edge_attr = torch.tensor(standardize(df_review_2.iloc[:, 3:-2].to_numpy())).float()
83 |
84 | user_attr = torch.tensor(standardize(df_user.iloc[:, 2:-1].to_numpy())).float()
85 | product_attr = torch.tensor(
86 | standardize(df_product.iloc[:, 2:-1].to_numpy())
87 | ).float()
88 |
89 | data = BipartiteData(adj, xu=user_attr, xv=product_attr, xe=edge_attr)
90 |
91 | return data
92 |
93 |
94 | def store_graph(name: str, dataset):
95 | torch.save(dataset, f"storage/{name}.pt")
96 |
97 |
98 | def load_graph(name: str, key: str, id=None):
99 | if id == None:
100 | data = torch.load(f"storage/{name}.pt")
101 | return data[key]
102 | else:
103 | data = torch.load(f"storage/{name}.pt")
104 | return data[key][id]
105 |
106 |
107 | def synth_random():
108 | # generate nd store data
109 | import argparse
110 |
111 | parser = argparse.ArgumentParser(description="GraphBEAN")
112 | parser.add_argument("--name", type=str, default="movies-small_anomaly", help="name")
113 | parser.add_argument("--n-graph", type=int, default=10, help="n graph")
114 |
115 | args = vars(parser.parse_args())
116 |
117 | sample_data()
118 | graph = create_graph()
119 | store_graph("movies-small-graph", graph)
120 | # graph = torch.load(f'storage/movies-small-graph.pt')
121 |
122 | graph_anomaly_list = []
123 | for i in range(args["n_graph"]):
124 | print(f"GRAPH ANOMALY {i} >>>>>>>>>>>>>>")
125 | graph_multi_dense = inject_random_block_anomaly(
126 | graph, num_group=20, num_nodes_range=(1, 12)
127 | )
128 | graph_anomaly_list.append(graph_multi_dense)
129 | print()
130 |
131 | dataset = {"args": args, "graph": graph, "graph_anomaly_list": graph_anomaly_list}
132 |
133 | store_graph(args["name"], dataset)
134 |
135 |
136 | if __name__ == "__main__":
137 | synth_random()
138 |
--------------------------------------------------------------------------------
/data_reddit.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import torch
5 | from torch_sparse.tensor import SparseTensor
6 |
7 | import numpy as np
8 | from anomaly_insert import inject_random_block_anomaly
9 |
10 | from models.data import BipartiteData
11 |
12 | import torch
13 | from sklearn import preprocessing
14 |
15 | import pandas as pd
16 |
17 | # %%
18 |
19 |
20 | def standardize(features: np.ndarray) -> np.ndarray:
21 | scaler = preprocessing.StandardScaler()
22 | z = scaler.fit_transform(features)
23 | return z
24 |
25 |
26 | def prepare_data():
27 |
28 | cols = ["user_id", "item_id", "timestamp", "state_label"] + [
29 | f"v{i+1}" for i in range(172)
30 | ]
31 | df = pd.read_csv(f"data/wikipedia.csv", skiprows=1, names=cols)
32 |
33 | # edge
34 | cols_d = {"item_id": [("n_action", "count")]}
35 | for i in range(172):
36 | cols_d[f"v{i+1}"] = [(f"v{i+1}_mean", "mean"), (f"v{i+1}_max", "max")]
37 |
38 | df_edge = df.groupby(["user_id", "item_id"]).agg(cols_d)
39 | df_edge = df_edge.droplevel(axis=1, level=0).reset_index()
40 | df_edge.to_csv(f"data/reddit-edge.csv")
41 |
42 | # user
43 | cols_d = {"item_id": [("n_item", "nunique"), ("n_action", "count")]}
44 | for i in range(172):
45 | cols_d[f"v{i+1}"] = [(f"v{i+1}_mean", "mean")]
46 |
47 | df_user = df.groupby(["user_id"]).agg(cols_d)
48 |
49 | df_user = df_user.droplevel(axis=1, level=0).reset_index()
50 | df_user.to_csv(f"data/reddit-user.csv")
51 |
52 | # item
53 | cols_d = {"user_id": [("n_user", "nunique"), ("n_action", "count")]}
54 | for i in range(172):
55 | cols_d[f"v{i+1}"] = [(f"v{i+1}_mean", "mean")]
56 |
57 | df_item = df.groupby(["item_id"]).agg(cols_d)
58 | df_item = df_item.droplevel(axis=1, level=0).reset_index()
59 | df_item.to_csv(f"data/reddit-item.csv")
60 |
61 |
62 | def create_graph():
63 |
64 | df_user = pd.read_csv("data/reddit-user.csv")
65 | df_item = pd.read_csv("data/reddit-item.csv")
66 | df_edge = pd.read_csv("data/reddit-edge.csv")
67 |
68 | df_user["uid"] = df_user.index
69 | df_item["iid"] = df_item.index
70 |
71 | df_user_id = df_user[["user_id", "uid"]]
72 | df_item_id = df_item[["item_id", "iid"]]
73 |
74 | df_edge_2 = df_edge.merge(
75 | df_user_id,
76 | on="user_id",
77 | ).merge(df_item_id, on="item_id")
78 | df_edge_2 = df_edge_2.sort_values(["uid", "iid"])
79 |
80 | uid = torch.tensor(df_edge_2["uid"].to_numpy())
81 | iid = torch.tensor(df_edge_2["iid"].to_numpy())
82 |
83 | adj = SparseTensor(row=uid, col=iid)
84 | edge_attr = torch.tensor(standardize(df_edge_2.iloc[:, 3:-2].to_numpy())).float()
85 |
86 | user_attr = torch.tensor(standardize(df_user.iloc[:, 2:-1].to_numpy())).float()
87 | product_attr = torch.tensor(standardize(df_item.iloc[:, 2:-1].to_numpy())).float()
88 |
89 | data = BipartiteData(adj, xu=user_attr, xv=product_attr, xe=edge_attr)
90 |
91 | return data
92 |
93 |
94 | def store_graph(name: str, dataset):
95 | torch.save(dataset, f"storage/{name}.pt")
96 |
97 |
98 | def load_graph(name: str, key: str, id=None):
99 | if id == None:
100 | data = torch.load(f"storage/{name}.pt")
101 | return data[key]
102 | else:
103 | data = torch.load(f"storage/{name}.pt")
104 | return data[key][id]
105 |
106 |
107 | def synth_random():
108 | # generate nd store data
109 | import argparse
110 |
111 | parser = argparse.ArgumentParser(description="GraphBEAN")
112 | parser.add_argument("--name", type=str, default="reddit_anomaly", help="name")
113 | parser.add_argument("--n-graph", type=int, default=10, help="n graph")
114 |
115 | args = vars(parser.parse_args())
116 |
117 | prepare_data()
118 | graph = create_graph()
119 | store_graph("reddit-graph", graph)
120 | # graph = torch.load(f'storage/reddit-graph.pt')
121 |
122 | graph_anomaly_list = []
123 | for i in range(args["n_graph"]):
124 | print(f"GRAPH ANOMALY {i} >>>>>>>>>>>>>>")
125 | print(graph)
126 | graph_multi_dense = inject_random_block_anomaly(
127 | graph, num_group=30, num_nodes_range=(1, 30), num_nodes_range2=(1, 6)
128 | )
129 | graph_anomaly_list.append(graph_multi_dense)
130 | print()
131 |
132 | dataset = {"args": args, "graph": graph, "graph_anomaly_list": graph_anomaly_list}
133 |
134 | store_graph(args["name"], dataset)
135 |
136 |
137 | if __name__ == "__main__":
138 | synth_random()
139 |
--------------------------------------------------------------------------------
/data_wikipedia.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import torch
5 | from torch_sparse.tensor import SparseTensor
6 |
7 | import numpy as np
8 | from anomaly_insert import inject_random_block_anomaly
9 |
10 | from models.data import BipartiteData
11 |
12 | import torch
13 | from sklearn import preprocessing
14 |
15 | import pandas as pd
16 |
17 | # %%
18 |
19 |
20 | def standardize(features: np.ndarray) -> np.ndarray:
21 | scaler = preprocessing.StandardScaler()
22 | z = scaler.fit_transform(features)
23 | return z
24 |
25 |
26 | def prepare_data():
27 |
28 | cols = ["user_id", "item_id", "timestamp", "state_label"] + [
29 | f"v{i+1}" for i in range(172)
30 | ]
31 | df = pd.read_csv(f"data/wikipedia.csv", skiprows=1, names=cols)
32 |
33 | # edge
34 | cols_d = {"item_id": [("n_action", "count")]}
35 | for i in range(172):
36 | cols_d[f"v{i+1}"] = [(f"v{i+1}_mean", "mean"), (f"v{i+1}_max", "max")]
37 |
38 | df_edge = df.groupby(["user_id", "item_id"]).agg(cols_d)
39 | df_edge = df_edge.droplevel(axis=1, level=0).reset_index()
40 | df_edge.to_csv(f"data/wikipedia-edge.csv")
41 |
42 | # user
43 | cols_d = {"item_id": [("n_item", "nunique"), ("n_action", "count")]}
44 | for i in range(172):
45 | cols_d[f"v{i+1}"] = [(f"v{i+1}_mean", "mean")]
46 |
47 | df_user = df.groupby(["user_id"]).agg(cols_d)
48 |
49 | df_user = df_user.droplevel(axis=1, level=0).reset_index()
50 | df_user.to_csv(f"data/wikipedia-user.csv")
51 |
52 | # item
53 | cols_d = {"user_id": [("n_user", "nunique"), ("n_action", "count")]}
54 | for i in range(172):
55 | cols_d[f"v{i+1}"] = [(f"v{i+1}_mean", "mean")]
56 |
57 | df_item = df.groupby(["item_id"]).agg(cols_d)
58 | df_item = df_item.droplevel(axis=1, level=0).reset_index()
59 | df_item.to_csv(f"data/wikipedia-item.csv")
60 |
61 |
62 | def create_graph():
63 |
64 | df_user = pd.read_csv("data/wikipedia-user.csv")
65 | df_item = pd.read_csv("data/wikipedia-item.csv")
66 | df_edge = pd.read_csv("data/wikipedia-edge.csv")
67 |
68 | df_user["uid"] = df_user.index
69 | df_item["iid"] = df_item.index
70 |
71 | df_user_id = df_user[["user_id", "uid"]]
72 | df_item_id = df_item[["item_id", "iid"]]
73 |
74 | df_edge_2 = df_edge.merge(
75 | df_user_id,
76 | on="user_id",
77 | ).merge(df_item_id, on="item_id")
78 | df_edge_2 = df_edge_2.sort_values(["uid", "iid"])
79 |
80 | uid = torch.tensor(df_edge_2["uid"].to_numpy())
81 | iid = torch.tensor(df_edge_2["iid"].to_numpy())
82 |
83 | adj = SparseTensor(row=uid, col=iid)
84 | edge_attr = torch.tensor(standardize(df_edge_2.iloc[:, 3:-2].to_numpy())).float()
85 |
86 | user_attr = torch.tensor(standardize(df_user.iloc[:, 2:-1].to_numpy())).float()
87 | product_attr = torch.tensor(standardize(df_item.iloc[:, 2:-1].to_numpy())).float()
88 |
89 | data = BipartiteData(adj, xu=user_attr, xv=product_attr, xe=edge_attr)
90 |
91 | return data
92 |
93 |
94 | def store_graph(name: str, dataset):
95 | torch.save(dataset, f"storage/{name}.pt")
96 |
97 |
98 | def load_graph(name: str, key: str, id=None):
99 | if id == None:
100 | data = torch.load(f"storage/{name}.pt")
101 | return data[key]
102 | else:
103 | data = torch.load(f"storage/{name}.pt")
104 | return data[key][id]
105 |
106 |
107 | def synth_random():
108 | # generate nd store data
109 | import argparse
110 |
111 | parser = argparse.ArgumentParser(description="GraphBEAN")
112 | parser.add_argument("--name", type=str, default="wikipedia_anomaly", help="name")
113 | parser.add_argument("--n-graph", type=int, default=10, help="n graph")
114 |
115 | args = vars(parser.parse_args())
116 |
117 | prepare_data()
118 | graph = create_graph()
119 | store_graph("wikipedia-graph", graph)
120 | # graph = torch.load(f'storage/wikipedia-graph.pt')
121 |
122 | graph_anomaly_list = []
123 | for i in range(args["n_graph"]):
124 | print(f"GRAPH ANOMALY {i} >>>>>>>>>>>>>>")
125 | print(graph)
126 | graph_multi_dense = inject_random_block_anomaly(
127 | graph, num_group=20, num_nodes_range=(1, 20), num_nodes_range2=(1, 6)
128 | )
129 | graph_anomaly_list.append(graph_multi_dense)
130 | print()
131 |
132 | dataset = {"args": args, "graph": graph, "graph_anomaly_list": graph_anomaly_list}
133 |
134 | store_graph(args["name"], dataset)
135 |
136 |
137 | if __name__ == "__main__":
138 | synth_random()
139 |
--------------------------------------------------------------------------------
/dominant_experiment.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import sys
5 |
6 | from sklearn.metrics import roc_curve, precision_recall_curve, auc
7 |
8 | from data_finefoods import load_graph
9 |
10 | import argparse
11 | import os
12 |
13 | import torch
14 | from torch_geometric.data import Data
15 | from torch_scatter import scatter
16 |
17 | from utils.seed import seed_all
18 |
19 | # train a dominant detector
20 | from pygod.models import DOMINANT
21 |
22 | # %% args
23 |
24 | parser = argparse.ArgumentParser(description="DOMINANT")
25 | parser.add_argument("--name", type=str, default="wikipedia_anomaly", help="name")
26 | parser.add_argument(
27 | "--key", type=str, default="graph_anomaly_list", help="key to the data"
28 | )
29 | parser.add_argument("--id", type=int, default=0, help="id to the data")
30 | parser.add_argument("--n-epoch", type=int, default=200, help="number of epoch")
31 | parser.add_argument(
32 | "--num-neighbors", type=int, default=-1, help="number of neighbors for node"
33 | )
34 | parser.add_argument("--batch-size", type=int, default=0, help="batch size")
35 | parser.add_argument("--alpha", type=float, default=0.8, help="balance parameter")
36 | parser.add_argument("--lr", type=float, default=1e-2, help="learning rate")
37 | parser.add_argument("--gpu", type=int, default=0, help="gpu number")
38 |
39 |
40 | args1 = vars(parser.parse_args())
41 |
42 | args2 = {
43 | "seed": 0,
44 | "hidden_channels": 32,
45 | "dropout_prob": 0.0,
46 | }
47 |
48 | args = {**args1, **args2}
49 |
50 | seed_all(args["seed"])
51 |
52 | result_dir = "results/"
53 |
54 | # %% data
55 | data = load_graph(args["name"], args["key"], args["id"])
56 |
57 | u_ch = data.xu.shape[1]
58 | v_ch = data.xv.shape[1]
59 | e_ch = data.xe.shape[1]
60 |
61 | print(
62 | f"Data dimension: U node = {data.xu.shape}; V node = {data.xv.shape}; E edge = {data.xe.shape}; \n"
63 | )
64 |
65 | # %% model
66 |
67 | xu, xv = data.xu, data.xv
68 | xe, adj = data.xe, data.adj
69 | yu, yv, ye = data.yu, data.yv, data.ye
70 |
71 |
72 | # %% to homogen
73 | nu = xu.shape[0]
74 | nv = xv.shape[0]
75 | nn = nu + nv
76 |
77 | # to homogen
78 | row_h = torch.cat([adj.storage.row(), adj.storage.col() + nu])
79 | col_h = torch.cat([adj.storage.col() + nu, adj.storage.row()])
80 | edge_index_h = torch.stack([row_h, col_h])
81 | xuh = torch.cat(
82 | [
83 | scatter(xe, adj.storage.row(), dim=0, reduce="max"),
84 | scatter(xe, adj.storage.row(), dim=0, reduce="mean"),
85 | ],
86 | dim=1,
87 | )
88 | xvh = torch.cat(
89 | [
90 | scatter(xe, adj.storage.col(), dim=0, reduce="max"),
91 | scatter(xe, adj.storage.col(), dim=0, reduce="mean"),
92 | ],
93 | dim=1,
94 | )
95 | xh = torch.cat([xuh, xvh], dim=0)
96 | yh = torch.cat([yu, yv], dim=0)
97 | data_h = Data(x=xh, edge_index=edge_index_h, y=yh)
98 |
99 | # %% model
100 |
101 | device = torch.device(f'cuda:{args["gpu"]}' if torch.cuda.is_available() else "cpu")
102 | model = DOMINANT(
103 | hid_dim=args["hidden_channels"],
104 | num_layers=4,
105 | dropout=args["dropout_prob"],
106 | alpha=args["alpha"],
107 | epoch=args["n_epoch"],
108 | lr=args["lr"],
109 | verbose=True,
110 | gpu=args["gpu"],
111 | batch_size=args["batch_size"],
112 | num_neigh=args["num_neighbors"],
113 | )
114 |
115 | print(args)
116 | print()
117 |
118 |
119 | def auc_eval(pred, y):
120 |
121 | rc_curve = roc_curve(y, pred)
122 | pr_curve = precision_recall_curve(y, pred)
123 | roc_auc = auc(rc_curve[0], rc_curve[1])
124 | pr_auc = auc(pr_curve[1], pr_curve[0])
125 |
126 | return roc_auc, pr_auc, rc_curve, pr_curve
127 |
128 |
129 | # %% run training
130 |
131 | print("ready to run")
132 |
133 | model.fit(data_h, yh)
134 | score = model.decision_scores_
135 |
136 | score_u = score[:nu]
137 | score_v = score[nu:]
138 | score_e_u = score_u[adj.storage.row().numpy()]
139 | score_e_v = score_v[adj.storage.col().numpy()]
140 | score_e = (score_e_u + score_e_v) / 2
141 |
142 | u_roc_auc, u_pr_auc, u_rc_curve, u_pr_curve = auc_eval(score_u, yu.numpy())
143 | v_roc_auc, v_pr_auc, v_rc_curve, v_pr_curve = auc_eval(score_v, yv.numpy())
144 | e_roc_auc, e_pr_auc, e_rc_curve, e_pr_curve = auc_eval(score_e, ye.numpy())
145 |
146 | print(
147 | f"Eval | "
148 | + f"u auc-roc: {u_roc_auc:.4f}, v auc-roc: {v_roc_auc:.4f}, e auc-roc: {e_roc_auc:.4f} | "
149 | + f"u auc-pr {u_pr_auc:.4f}, v auc-pr {v_pr_auc:.4f}, e auc-pr {e_pr_auc:.4f}"
150 | )
151 |
152 | auc_metrics = {
153 | "u_roc_auc": u_roc_auc,
154 | "u_pr_auc": u_pr_auc,
155 | "v_roc_auc": v_roc_auc,
156 | "v_pr_auc": v_pr_auc,
157 | "e_roc_auc": e_roc_auc,
158 | "e_pr_auc": e_pr_auc,
159 | "u_roc_curve": u_rc_curve,
160 | "u_pr_curve": u_pr_curve,
161 | "v_roc_curve": v_rc_curve,
162 | "v_pr_curve": v_pr_curve,
163 | "e_roc_curve": e_rc_curve,
164 | "e_pr_curve": e_pr_curve,
165 | }
166 | anomaly_score = {"score_u": score_u, "score_v": score_v, "score_e": score_e}
167 |
168 | model_stored = {
169 | "args": args,
170 | "auc_metrics": auc_metrics,
171 | "state_dict": model.model.state_dict(),
172 | }
173 | output_stored = {"args": args, "anomaly_score": anomaly_score}
174 |
175 | print("Saving current results...")
176 | torch.save(
177 | model_stored,
178 | os.path.join(
179 | result_dir, f"dominant-{args['name']}-{args['id']}-alpha-{args['alpha']}-model.th"
180 | ),
181 | )
182 | torch.save(
183 | output_stored,
184 | os.path.join(
185 | result_dir,
186 | f"dominant-{args['name']}-{args['id']}-alpha-{args['alpha']}-output.th",
187 | ),
188 | )
189 |
190 |
191 | print()
192 | print(args)
193 |
--------------------------------------------------------------------------------
/extract_movies.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | with open(f"data/movies.txt", "r", errors="ignore") as infile:
5 | movies = infile.readlines()
6 |
7 | infile.close()
8 | with open(f"data/movies.csv", "w") as out:
9 | out.write(
10 | "product_id,user_id,profile_name,helpfulness_numerator,helpfulness_denominator,score,time,summary,text\n"
11 | )
12 |
13 | count = 0
14 | out_dict = {
15 | "product/productId": "",
16 | "review/userId": "",
17 | "review/profileName": "",
18 | "review/helpfulness": "",
19 | "review/score": "",
20 | "review/time": "",
21 | "review/summary": "",
22 | "review/text": "",
23 | "hnum": 0,
24 | "hden": 0,
25 | }
26 |
27 | for row in movies:
28 | if row.rstrip() != "":
29 | cells = row.split(":")
30 | if cells[0] == "product/productId":
31 | if len(cells) > 1:
32 | out_dict[cells[0]] = (
33 | cells[1]
34 | .replace(",", "")
35 | .replace("\n", "")
36 | .replace("
", "")
37 | .replace("\\", "")
38 | .strip()
39 | )
40 | if count > 0:
41 | output = (
42 | f"{out_dict['product/productId']},{out_dict['review/userId']},{out_dict['review/profileName']},{out_dict['hnum']},{out_dict['hden']},"
43 | + f"{out_dict['review/score']},{out_dict['review/time']},{out_dict['review/summary']},{out_dict['review/text']}\n"
44 | )
45 | out.write(output)
46 | count += 1
47 | if count % 1000 == 0:
48 | out.flush()
49 | print(count)
50 | elif cells[0] == "review/helpfulness":
51 | if len(cells) > 1:
52 | if "/" in cells[1]:
53 | hs = cells[1].split("/")
54 | out_dict["hnum"] = int(hs[0])
55 | out_dict["hden"] = int(hs[1])
56 | else:
57 | out_dict["hnum"] = int(cells[1])
58 | out_dict["hden"] = int(cells[1])
59 | elif cells[0] == "review/text":
60 | if len(cells) > 1:
61 | out_dict[cells[0]] = (
62 | '"'
63 | + ":".join(cells[1:])
64 | .replace(",", "")
65 | .replace("\n", "")
66 | .replace("
", "")
67 | .replace("\\", "")
68 | .replace('"', "")
69 | .replace("'", "")
70 | .strip()
71 | + '"'
72 | )
73 | else:
74 | if len(cells) > 1:
75 | out_dict[cells[0]] = (
76 | cells[1]
77 | .replace(",", "")
78 | .replace("\n", "")
79 | .replace("
", "")
80 | .replace("\\", "")
81 | .strip()
82 | )
83 |
84 | output = (
85 | f"{out_dict['product/productId']},{out_dict['review/userId']},{out_dict['review/profileName']},{out_dict['hnum']},{out_dict['hden']},"
86 | + f"{out_dict['review/score']},{out_dict['review/time']},{out_dict['review/summary']},{out_dict['review/text']}\n"
87 | )
88 | out.write(output)
89 | print("======= ALL FINISHED ========")
90 |
--------------------------------------------------------------------------------
/isoforest_experiment.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import sys
5 |
6 | from sklearn.metrics import roc_curve, precision_recall_curve, auc
7 |
8 | from data_finefoods import load_graph
9 |
10 | import argparse
11 | import os
12 |
13 | import torch
14 | from utils.seed import seed_all
15 |
16 | from sklearn.ensemble import IsolationForest
17 |
18 | # %% args
19 |
20 | parser = argparse.ArgumentParser(description="IsolationForest")
21 | parser.add_argument("--name", type=str, default="wikipedia_anomaly", help="name")
22 | parser.add_argument(
23 | "--key", type=str, default="graph_anomaly_list", help="key to the data"
24 | )
25 | parser.add_argument("--id", type=int, default=0, help="id to the data")
26 |
27 | args1 = vars(parser.parse_args())
28 |
29 | args2 = {
30 | "seed": 0,
31 | }
32 |
33 | args = {**args1, **args2}
34 |
35 | seed_all(args["seed"])
36 |
37 | result_dir = "results/"
38 |
39 |
40 | # %% data
41 | data = load_graph(args["name"], args["key"], args["id"])
42 |
43 | u_ch = data.xu.shape[1]
44 | v_ch = data.xv.shape[1]
45 | e_ch = data.xe.shape[1]
46 |
47 | print(
48 | f"Data dimension: U node = {data.xu.shape}; V node = {data.xv.shape}; E edge = {data.xe.shape}; \n"
49 | )
50 |
51 | # %% model
52 |
53 | xu, xv = data.xu, data.xv
54 | xe, adj = data.xe, data.adj
55 | yu, yv, ye = data.yu, data.yv, data.ye
56 |
57 |
58 | def train_eval(x, y):
59 | clf = IsolationForest()
60 | clf.fit(x)
61 | score = -clf.score_samples(x)
62 |
63 | rc_curve = roc_curve(y, score)
64 | pr_curve = precision_recall_curve(y, score)
65 | roc_auc = auc(rc_curve[0], rc_curve[1])
66 | pr_auc = auc(pr_curve[1], pr_curve[0])
67 |
68 | return roc_auc, pr_auc, rc_curve, pr_curve
69 |
70 |
71 | # %% isolation forest
72 |
73 | u_roc_auc, u_pr_auc, u_rc_curve, u_pr_curve = train_eval(xu.numpy(), yu.numpy())
74 | v_roc_auc, v_pr_auc, v_rc_curve, v_pr_curve = train_eval(xv.numpy(), yv.numpy())
75 | e_roc_auc, e_pr_auc, e_rc_curve, e_pr_curve = train_eval(xe.numpy(), ye.numpy())
76 |
77 | print(args)
78 |
79 | print(
80 | f"Eval, "
81 | + f"u auc-roc: {u_roc_auc:.4f}, v auc-roc: {v_roc_auc:.4f}, e auc-roc: {e_roc_auc:.4f} | "
82 | + f"u auc-pr {u_pr_auc:.4f}, v auc-pr {v_pr_auc:.4f}, e auc-pr {e_pr_auc:.4f}"
83 | )
84 |
85 |
86 | auc_metrics = {
87 | "u_roc_auc": u_roc_auc,
88 | "u_pr_auc": u_pr_auc,
89 | "v_roc_auc": v_roc_auc,
90 | "v_pr_auc": v_pr_auc,
91 | "e_roc_auc": e_roc_auc,
92 | "e_pr_auc": e_pr_auc,
93 | "u_roc_curve": u_rc_curve,
94 | "u_pr_curve": u_pr_curve,
95 | "v_roc_curve": v_rc_curve,
96 | "v_pr_curve": v_pr_curve,
97 | "e_roc_curve": e_rc_curve,
98 | "e_pr_curve": e_pr_curve,
99 | }
100 |
101 | output_stored = {
102 | "args": args,
103 | "auc_metrics": auc_metrics,
104 | }
105 |
106 | print("Saving current results...")
107 | torch.save(
108 | output_stored,
109 | os.path.join(result_dir, f"isoforest-{args['name']}-{args['id']}-output.th"),
110 | )
111 |
--------------------------------------------------------------------------------
/models/conv.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | from typing import Optional, Tuple
5 | import torch
6 |
7 | from torch import Tensor
8 | import torch.nn as nn
9 | from torch_sparse import SparseTensor, matmul
10 | from torch_scatter import scatter
11 |
12 | from torch_geometric.nn.conv import MessagePassing
13 | from torch_geometric.nn.dense.linear import Linear
14 |
15 | from torch_geometric.typing import PairTensor, OptTensor
16 |
17 |
18 | class BEANConv(MessagePassing):
19 | def __init__(
20 | self,
21 | in_channels: Tuple[int, int, Optional[int]],
22 | out_channels: Tuple[int, int, Optional[int]],
23 | node_self_loop: bool = True,
24 | normalize: bool = True,
25 | bias: bool = True,
26 | **kwargs
27 | ):
28 |
29 | super(BEANConv, self).__init__(**kwargs)
30 |
31 | self.in_channels = in_channels
32 | self.out_channels = out_channels
33 | self.node_self_loop = node_self_loop
34 | self.normalize = normalize
35 |
36 | self.input_has_edge_channel = len(in_channels) == 3
37 | self.output_has_edge_channel = len(out_channels) == 3
38 |
39 | if self.input_has_edge_channel:
40 | if self.node_self_loop:
41 | self.in_channels_u = (
42 | in_channels[0] + 2 * in_channels[1] + 2 * in_channels[2]
43 | )
44 | self.in_channels_v = (
45 | 2 * in_channels[0] + in_channels[1] + 2 * in_channels[2]
46 | )
47 | else:
48 | self.in_channels_u = 2 * in_channels[1] + 2 * in_channels[2]
49 | self.in_channels_v = 2 * in_channels[0] + 2 * in_channels[2]
50 | self.in_channels_e = in_channels[0] + in_channels[1] + in_channels[2]
51 | else:
52 | if self.node_self_loop:
53 | self.in_channels_u = in_channels[0] + 2 * in_channels[1]
54 | self.in_channels_v = 2 * in_channels[0] + in_channels[1]
55 | else:
56 | self.in_channels_u = 2 * in_channels[1]
57 | self.in_channels_v = 2 * in_channels[0]
58 | self.in_channels_e = in_channels[0] + in_channels[1]
59 |
60 | self.lin_u = Linear(self.in_channels_u, out_channels[0], bias=bias)
61 | self.lin_v = Linear(self.in_channels_v, out_channels[1], bias=bias)
62 | if self.output_has_edge_channel:
63 | self.lin_e = Linear(self.in_channels_e, out_channels[2], bias=bias)
64 |
65 | if normalize:
66 | self.bn_u = nn.BatchNorm1d(out_channels[0])
67 | self.bn_v = nn.BatchNorm1d(out_channels[1])
68 | if self.output_has_edge_channel:
69 | self.bn_e = nn.BatchNorm1d(out_channels[2])
70 |
71 | self.reset_parameters()
72 |
73 | def reset_parameters(self):
74 | self.lin_u.reset_parameters()
75 | self.lin_v.reset_parameters()
76 | if self.output_has_edge_channel:
77 | self.lin_e.reset_parameters()
78 |
79 | def forward(
80 | self, x: PairTensor, adj: SparseTensor, xe: OptTensor = None
81 | ) -> Tuple[PairTensor, Tensor]:
82 | """"""
83 |
84 | assert self.input_has_edge_channel == (xe is not None)
85 |
86 | # propagate_type: (x: PairTensor)
87 | (out_u, out_v), out_e = self.propagate(adj, x=x, xe=xe)
88 |
89 | # lin layer
90 | out_u = self.lin_u(out_u)
91 | out_v = self.lin_v(out_v)
92 | if self.output_has_edge_channel:
93 | out_e = self.lin_e(out_e)
94 |
95 | if self.normalize:
96 | out_u = self.bn_u(out_u)
97 | out_v = self.bn_v(out_v)
98 | if self.output_has_edge_channel:
99 | out_e = self.bn_e(out_e)
100 |
101 | return (out_u, out_v), out_e
102 |
103 | def message_and_aggregate(
104 | self, adj: SparseTensor, x: PairTensor, xe: OptTensor
105 | ) -> Tuple[PairTensor, Tensor]:
106 |
107 | xu, xv = x
108 | adj = adj.set_value(None, layout=None)
109 |
110 | # messages node to node
111 | msg_v2u_mean = matmul(adj, xv, reduce="mean")
112 | msg_v2u_sum = matmul(adj, xv, reduce="max")
113 |
114 | msg_u2v_mean = matmul(adj.t(), xu, reduce="mean")
115 | msg_u2v_sum = matmul(adj.t(), xu, reduce="max")
116 |
117 | # messages edge to node
118 | if xe is not None:
119 | msg_e2u_mean = scatter(xe, adj.storage.row(), dim=0, reduce="mean")
120 | msg_e2u_sum = scatter(xe, adj.storage.row(), dim=0, reduce="max")
121 |
122 | msg_e2v_mean = scatter(xe, adj.storage.col(), dim=0, reduce="mean")
123 | msg_e2v_sum = scatter(xe, adj.storage.col(), dim=0, reduce="max")
124 |
125 | # collect all msg (including self loop)
126 | msg_2e = None
127 | if xe is not None:
128 | if self.node_self_loop:
129 | msg_2u = torch.cat(
130 | (xu, msg_v2u_mean, msg_v2u_sum, msg_e2u_mean, msg_e2u_sum), dim=1
131 | )
132 | msg_2v = torch.cat(
133 | (xv, msg_u2v_mean, msg_u2v_sum, msg_e2v_mean, msg_e2v_sum), dim=1
134 | )
135 | else:
136 | msg_2u = torch.cat(
137 | (msg_v2u_mean, msg_v2u_sum, msg_e2u_mean, msg_e2u_sum), dim=1
138 | )
139 | msg_2v = torch.cat(
140 | (msg_u2v_mean, msg_u2v_sum, msg_e2v_mean, msg_e2v_sum), dim=1
141 | )
142 |
143 | if self.output_has_edge_channel:
144 | msg_2e = torch.cat(
145 | (xe, xu[adj.storage.row()], xv[adj.storage.col()]), dim=1
146 | )
147 | else:
148 | if self.node_self_loop:
149 | msg_2u = torch.cat((xu, msg_v2u_mean, msg_v2u_sum), dim=1)
150 | msg_2v = torch.cat((xv, msg_u2v_mean, msg_u2v_sum), dim=1)
151 | else:
152 | msg_2u = torch.cat((msg_v2u_mean, msg_v2u_sum), dim=1)
153 | msg_2v = torch.cat((msg_u2v_mean, msg_u2v_sum), dim=1)
154 |
155 | if self.output_has_edge_channel:
156 | msg_2e = torch.cat(
157 | (xu[adj.storage.row()], xv[adj.storage.col()]), dim=1
158 | )
159 |
160 | return (msg_2u, msg_2v), msg_2e
161 |
--------------------------------------------------------------------------------
/models/conv_sample.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | from typing import List, Optional, Tuple
5 | import torch
6 |
7 | from torch import Tensor
8 | import torch.nn as nn
9 | from torch_sparse import SparseTensor, matmul
10 | from torch_scatter import scatter
11 |
12 | from torch_geometric.nn.conv import MessagePassing
13 | from torch_geometric.nn.dense.linear import Linear
14 |
15 | from torch_geometric.typing import PairTensor, OptTensor
16 |
17 | from models.sampler import BEANAdjacency
18 |
19 |
20 | class BEANConvSample(torch.nn.Module):
21 | def __init__(
22 | self,
23 | in_channels: Tuple[int, int, Optional[int]],
24 | out_channels: Tuple[int, int, Optional[int]],
25 | node_self_loop: bool = True,
26 | normalize: bool = True,
27 | bias: bool = True,
28 | **kwargs,
29 | ):
30 |
31 | super().__init__(**kwargs)
32 |
33 | self.in_channels = in_channels
34 | self.out_channels = out_channels
35 | self.node_self_loop = node_self_loop
36 | self.normalize = normalize
37 |
38 | self.input_has_edge_channel = len(in_channels) == 3
39 | self.output_has_edge_channel = len(out_channels) == 3
40 |
41 | self.v2u_conv = BEANConvNode(
42 | in_channels,
43 | out_channels[0],
44 | flow="v->u",
45 | node_self_loop=node_self_loop,
46 | normalize=normalize,
47 | bias=bias,
48 | **kwargs,
49 | )
50 |
51 | self.u2v_conv = BEANConvNode(
52 | in_channels,
53 | out_channels[1],
54 | flow="u->v",
55 | node_self_loop=node_self_loop,
56 | normalize=normalize,
57 | bias=bias,
58 | **kwargs,
59 | )
60 |
61 | if self.output_has_edge_channel:
62 | self.e_conv = BEANConvEdge(
63 | in_channels,
64 | out_channels[2],
65 | node_self_loop=node_self_loop,
66 | normalize=normalize,
67 | bias=bias,
68 | **kwargs,
69 | )
70 |
71 | def forward(
72 | self,
73 | xu: PairTensor,
74 | xv: PairTensor,
75 | adj: BEANAdjacency,
76 | xe: Optional[Tuple[Tensor, Tensor, Tensor]] = None,
77 | ) -> Tuple[Tensor, Tensor, Tensor]:
78 |
79 | # source and target
80 | xus, xut = xu
81 | xvs, xvt = xv
82 |
83 | # xe
84 | if xe is not None:
85 | xe_e, xe_v2u, xe_u2v = xe
86 |
87 | out_u = self.v2u_conv((xut, xvs), adj.adj_v2u.adj, xe_v2u)
88 | out_v = self.u2v_conv((xus, xvt), adj.adj_u2v.adj, xe_u2v)
89 |
90 | out_e = None
91 | if self.output_has_edge_channel:
92 | out_e = self.e_conv((xut, xvt), adj.adj_e.adj, xe_e)
93 |
94 | return out_u, out_v, out_e
95 |
96 |
97 | class BEANConvNode(MessagePassing):
98 | def __init__(
99 | self,
100 | in_channels: Tuple[int, int, Optional[int]],
101 | out_channels: int,
102 | flow: str = "v->u",
103 | node_self_loop: bool = True,
104 | normalize: bool = True,
105 | bias: bool = True,
106 | agg: List[str] = ["mean", "max"],
107 | **kwargs,
108 | ):
109 |
110 | super().__init__(**kwargs)
111 |
112 | self.in_channels = in_channels
113 | self.out_channels = out_channels
114 | self.flow = flow
115 | self.node_self_loop = node_self_loop
116 | self.normalize = normalize
117 | self.agg = agg
118 |
119 | self.input_has_edge_channel = len(in_channels) == 3
120 |
121 | n_agg = len(agg)
122 | # calculate in channels
123 | if self.input_has_edge_channel:
124 | if self.node_self_loop:
125 | if flow == "v->u":
126 | self.in_channels_all = (
127 | in_channels[0] + n_agg * in_channels[1] + n_agg * in_channels[2]
128 | )
129 | else:
130 | self.in_channels_all = (
131 | n_agg * in_channels[0] + in_channels[1] + n_agg * in_channels[2]
132 | )
133 | else:
134 | if flow == "v->u":
135 | self.in_channels_all = (
136 | n_agg * in_channels[1] + n_agg * in_channels[2]
137 | )
138 | else:
139 | self.in_channels_all = (
140 | n_agg * in_channels[0] + n_agg * in_channels[2]
141 | )
142 | else:
143 | if self.node_self_loop:
144 | if flow == "v->u":
145 | self.in_channels_all = in_channels[0] + n_agg * in_channels[1]
146 | else:
147 | self.in_channels_all = n_agg * in_channels[0] + in_channels[1]
148 | else:
149 | if flow == "v->u":
150 | self.in_channels_all = n_agg * in_channels[1]
151 | else:
152 | self.in_channels_all = n_agg * in_channels[0]
153 |
154 | self.lin = Linear(self.in_channels_all, out_channels, bias=bias)
155 |
156 | if normalize:
157 | self.bn = nn.BatchNorm1d(out_channels)
158 |
159 | self.reset_parameters()
160 |
161 | def reset_parameters(self):
162 | self.lin.reset_parameters()
163 |
164 | def forward(self, x: PairTensor, adj: SparseTensor, xe: OptTensor = None) -> Tensor:
165 | """"""
166 |
167 | assert self.input_has_edge_channel == (xe is not None)
168 |
169 | # propagate_type: (x: PairTensor)
170 | out = self.propagate(adj, x=x, xe=xe)
171 |
172 | # lin layer
173 | out = self.lin(out)
174 | if self.normalize:
175 | out = self.bn(out)
176 |
177 | return out
178 |
179 | def message_and_aggregate(
180 | self, adj: SparseTensor, x: PairTensor, xe: OptTensor
181 | ) -> Tensor:
182 |
183 | xu, xv = x
184 | adj = adj.set_value(None, layout=None)
185 |
186 | ## Node V to node U
187 | if self.flow == "v->u":
188 | # messages node to node
189 | msg_v2u_list = [matmul(adj, xv, reduce=ag) for ag in self.agg]
190 |
191 | # messages edge to node
192 | if xe is not None:
193 | msg_e2u_list = [
194 | scatter(xe, adj.storage.row(), dim=0, reduce=ag) for ag in self.agg
195 | ]
196 |
197 | # collect all msg
198 | if xe is not None:
199 | if self.node_self_loop:
200 | if xu.shape[0] != msg_e2u_list[0].shape[0]:
201 | print(
202 | f"xu: {xu.shape} | msg_v2u : {msg_v2u_list[0].shape} | msg_e2u_sum : {msg_e2u_list[0].shape}"
203 | )
204 | msg_2u = torch.cat((xu, *msg_v2u_list, *msg_e2u_list), dim=1)
205 | else:
206 | msg_2u = torch.cat((*msg_v2u_list, *msg_e2u_list), dim=1)
207 | else:
208 | if self.node_self_loop:
209 | msg_2u = torch.cat((xu, *msg_v2u_list), dim=1)
210 | else:
211 | msg_2u = torch.cat((*msg_v2u_list,), dim=1)
212 |
213 | return msg_2u
214 |
215 | ## Node U to node V
216 | else:
217 | msg_u2v_list = [matmul(adj.t(), xu, reduce=ag) for ag in self.agg]
218 |
219 | # messages edge to node
220 | if xe is not None:
221 | msg_e2v_list = [
222 | scatter(xe, adj.storage.col(), dim=0, reduce=ag) for ag in self.agg
223 | ]
224 |
225 | # collect all msg (including self loop)
226 | if xe is not None:
227 | if self.node_self_loop:
228 | msg_2v = torch.cat((xv, *msg_u2v_list, *msg_e2v_list), dim=1)
229 | else:
230 | msg_2v = torch.cat((*msg_u2v_list, *msg_e2v_list), dim=1)
231 | else:
232 | if self.node_self_loop:
233 | msg_2v = torch.cat((xv, *msg_u2v_list), dim=1)
234 | else:
235 | msg_2v = torch.cat((*msg_u2v_list,), dim=1)
236 |
237 | return msg_2v
238 |
239 |
240 | class BEANConvEdge(MessagePassing):
241 | def __init__(
242 | self,
243 | in_channels: Tuple[int, int, Optional[int]],
244 | out_channels: int,
245 | node_self_loop: bool = True,
246 | normalize: bool = True,
247 | bias: bool = True,
248 | **kwargs,
249 | ):
250 |
251 | super().__init__(**kwargs)
252 |
253 | self.in_channels = in_channels
254 | self.out_channels = out_channels
255 | self.node_self_loop = node_self_loop
256 | self.normalize = normalize
257 |
258 | self.input_has_edge_channel = len(in_channels) == 3
259 |
260 | if self.input_has_edge_channel:
261 | self.in_channels_e = in_channels[0] + in_channels[1] + in_channels[2]
262 | else:
263 | self.in_channels_e = in_channels[0] + in_channels[1]
264 |
265 | self.lin_e = Linear(self.in_channels_e, out_channels, bias=bias)
266 |
267 | if normalize:
268 | self.bn_e = nn.BatchNorm1d(out_channels)
269 |
270 | self.reset_parameters()
271 |
272 | def reset_parameters(self):
273 | self.lin_e.reset_parameters()
274 |
275 | def forward(self, x: PairTensor, adj: SparseTensor, xe: Tensor) -> Tensor:
276 | """"""
277 |
278 | # propagate_type: (x: PairTensor)
279 | out_e = self.propagate(adj, x=x, xe=xe)
280 |
281 | # lin layer
282 | out_e = self.lin_e(out_e)
283 |
284 | if self.normalize:
285 | out_e = self.bn_e(out_e)
286 |
287 | return out_e
288 |
289 | def message_and_aggregate(
290 | self, adj: SparseTensor, x: PairTensor, xe: OptTensor
291 | ) -> Tensor:
292 |
293 | xu, xv = x
294 | adj = adj.set_value(None, layout=None)
295 |
296 | # collect all msg (including self loop)
297 | if xe is not None:
298 | msg_2e = torch.cat(
299 | (xe, xu[adj.storage.row()], xv[adj.storage.col()]), dim=1
300 | )
301 | else:
302 | msg_2e = torch.cat((xu[adj.storage.row()], xv[adj.storage.col()]), dim=1)
303 |
304 | return msg_2e
305 |
--------------------------------------------------------------------------------
/models/data.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import torch
5 | from torch_geometric.data import Data
6 | from torch_geometric.typing import OptTensor
7 | from torch_sparse.tensor import SparseTensor
8 |
9 |
10 | class BipartiteData(Data):
11 | def __init__(
12 | self,
13 | adj: SparseTensor,
14 | xu: OptTensor = None,
15 | xv: OptTensor = None,
16 | xe: OptTensor = None,
17 | **kwargs
18 | ):
19 | super().__init__()
20 | self.adj = adj
21 | self.xu = xu
22 | self.xv = xv
23 | self.xe = xe
24 |
25 | for key, value in kwargs.items():
26 | setattr(self, key, value)
27 |
28 | def __inc__(self, key, value, *args, **kwargs):
29 | if key == "adj":
30 | return torch.tensor([[self.xu.size(0)], [self.xv.size(0)]])
31 | else:
32 | return super().__inc__(key, value, *args, **kwargs)
33 |
--------------------------------------------------------------------------------
/models/loss.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | from typing import Dict, Tuple
5 | import torch.nn.functional as F
6 | from torch import Tensor
7 |
8 | from torch_sparse import SparseTensor
9 |
10 |
11 | def reconstruction_loss(
12 | xu: Tensor,
13 | xv: Tensor,
14 | xe: Tensor,
15 | adj: SparseTensor,
16 | edge_pred_samples: SparseTensor,
17 | out: Dict[str, Tensor],
18 | xe_loss_weight: float = 1.0,
19 | structure_loss_weight: float = 1.0,
20 | ) -> Tuple[Tensor, Dict[str, Tensor]]:
21 | # feature mse
22 | xu_loss = F.mse_loss(xu, out["xu"])
23 | xv_loss = F.mse_loss(xv, out["xv"])
24 | xe_loss = F.mse_loss(xe, out["xe"])
25 | feature_loss = xu_loss + xv_loss + xe_loss_weight * xe_loss
26 |
27 | # structure loss
28 | edge_gt = (edge_pred_samples.storage.value() > 0).float()
29 | structure_loss = F.binary_cross_entropy(out["eprob"], edge_gt)
30 |
31 | loss = feature_loss + structure_loss_weight * structure_loss
32 |
33 | loss_component = {
34 | "xu": xu_loss,
35 | "xv": xv_loss,
36 | "xe": xe_loss,
37 | "e": structure_loss,
38 | "total": loss,
39 | }
40 |
41 | return loss, loss_component
42 |
--------------------------------------------------------------------------------
/models/net.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import torch
5 | import torch.nn as nn
6 | import torch.nn.functional as F
7 | from torch import Tensor
8 |
9 | from torch_sparse import SparseTensor
10 | from torch_geometric.nn.dense.linear import Linear
11 |
12 | from typing import Tuple, Union, Dict
13 |
14 | from models.conv import BEANConv
15 |
16 |
17 | def make_tuple(x: Union[int, Tuple[int, int, int], Tuple[int, int]], repeat: int = 3):
18 | if isinstance(x, int):
19 | if repeat == 2:
20 | return (x, x)
21 | else:
22 | return (x, x, x)
23 | else:
24 | return x
25 |
26 |
27 | def apply_relu_dropout(x: Tensor, dropout_prob: float, training: bool) -> Tensor:
28 | x = F.relu(x)
29 | if dropout_prob > 0.0:
30 | x = F.dropout(x, p=dropout_prob, training=training)
31 | return x
32 |
33 |
34 | class GraphBEAN(nn.Module):
35 | def __init__(
36 | self,
37 | in_channels: Union[int, Tuple[int, int, int]],
38 | hidden_channels: Union[int, Tuple[int, int, int]] = 32,
39 | latent_channels: Union[int, Tuple[int, int]] = 64,
40 | edge_pred_latent: int = 64,
41 | n_layers_encoder: int = 4,
42 | n_layers_decoder: int = 4,
43 | n_layers_mlp: int = 4,
44 | dropout_prob: float = 0.0,
45 | ):
46 |
47 | super().__init__()
48 |
49 | self.in_channels = make_tuple(in_channels)
50 | self.hidden_channels = make_tuple(hidden_channels)
51 | self.latent_channels = make_tuple(latent_channels, 2)
52 | self.edge_pred_latent = edge_pred_latent
53 | self.n_layers_encoder = n_layers_encoder
54 | self.n_layers_decoder = n_layers_decoder
55 | self.n_layers_mlp = n_layers_mlp
56 | self.dropout_prob = dropout_prob
57 |
58 | self.create_encoder()
59 | self.create_feature_decoder()
60 | self.create_structure_decoder()
61 |
62 | def create_encoder(self):
63 | self.encoder_convs = nn.ModuleList()
64 | for i in range(self.n_layers_encoder):
65 | if i == 0:
66 | in_channels = self.in_channels
67 | out_channels = self.hidden_channels
68 | elif i == self.n_layers_encoder - 1:
69 | in_channels = self.hidden_channels
70 | out_channels = self.latent_channels
71 | else:
72 | in_channels = self.hidden_channels
73 | out_channels = self.hidden_channels
74 |
75 | if i == self.n_layers_encoder - 1:
76 | self.encoder_convs.append(
77 | BEANConv(in_channels, out_channels, node_self_loop=False)
78 | )
79 | else:
80 | self.encoder_convs.append(
81 | BEANConv(in_channels, out_channels, node_self_loop=True)
82 | )
83 |
84 | def create_feature_decoder(self):
85 | self.decoder_convs = nn.ModuleList()
86 | for i in range(self.n_layers_decoder):
87 | if i == 0:
88 | in_channels = self.latent_channels
89 | out_channels = self.hidden_channels
90 | elif i == self.n_layers_decoder - 1:
91 | in_channels = self.hidden_channels
92 | out_channels = self.in_channels
93 | else:
94 | in_channels = self.hidden_channels
95 | out_channels = self.hidden_channels
96 |
97 | self.decoder_convs.append(BEANConv(in_channels, out_channels))
98 |
99 | def create_structure_decoder(self):
100 | self.u_mlp_layers = nn.ModuleList()
101 | self.v_mlp_layers = nn.ModuleList()
102 |
103 | for i in range(self.n_layers_mlp):
104 | if i == 0:
105 | in_channels = self.latent_channels
106 | else:
107 | in_channels = (self.edge_pred_latent, self.edge_pred_latent)
108 | out_channels = self.edge_pred_latent
109 |
110 | self.u_mlp_layers.append(Linear(in_channels[0], out_channels))
111 |
112 | self.v_mlp_layers.append(Linear(in_channels[1], out_channels))
113 |
114 | def forward(
115 | self,
116 | xu: Tensor,
117 | xv: Tensor,
118 | xe: Tensor,
119 | adj: SparseTensor,
120 | edge_pred_samples: SparseTensor,
121 | ) -> Dict[str, Tensor]:
122 |
123 | # encoder
124 | for i, conv in enumerate(self.encoder_convs):
125 | (xu, xv), xe = conv((xu, xv), adj, xe=xe)
126 | if i != self.n_layers_encoder - 1:
127 | xu = apply_relu_dropout(xu, self.dropout_prob, self.training)
128 | xv = apply_relu_dropout(xv, self.dropout_prob, self.training)
129 | xe = apply_relu_dropout(xe, self.dropout_prob, self.training)
130 |
131 | # get latent vars
132 | zu, zv = xu, xv
133 |
134 | # feature decoder
135 | for i, conv in enumerate(self.decoder_convs):
136 | (xu, xv), xe = conv((xu, xv), adj, xe=xe)
137 | if i != self.n_layers_decoder - 1:
138 | xu = apply_relu_dropout(xu, self.dropout_prob, self.training)
139 | xv = apply_relu_dropout(xv, self.dropout_prob, self.training)
140 | xe = apply_relu_dropout(xe, self.dropout_prob, self.training)
141 |
142 | # structure decoder
143 | zu2, zv2 = zu, zv
144 | for i, layer in enumerate(self.u_mlp_layers):
145 | zu2 = layer(zu2)
146 | if i != self.n_layers_mlp - 1:
147 | zu2 = apply_relu_dropout(zu2, self.dropout_prob, self.training)
148 |
149 | for i, layer in enumerate(self.v_mlp_layers):
150 | zv2 = layer(zv2)
151 | if i != self.n_layers_mlp - 1:
152 | zv2 = apply_relu_dropout(zv2, self.dropout_prob, self.training)
153 |
154 | zu2_edge = zu2[edge_pred_samples.storage.row()]
155 | zv2_edge = zv2[edge_pred_samples.storage.col()]
156 |
157 | eprob = torch.sigmoid(torch.sum(zu2_edge * zv2_edge, dim=1))
158 |
159 | # collect results
160 | result = {"xu": xu, "xv": xv, "xe": xe, "zu": zu, "zv": zv, "eprob": eprob}
161 |
162 | return result
163 |
--------------------------------------------------------------------------------
/models/net_sample.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import torch
5 | import torch.nn as nn
6 | import torch.nn.functional as F
7 | from torch import Tensor
8 |
9 | from torch_sparse import SparseTensor
10 | from torch_geometric.nn.dense.linear import Linear
11 |
12 | from typing import List, Tuple, Union, Dict
13 |
14 | from models.conv_sample import BEANConvSample
15 | from models.sampler import BEANAdjacency, BipartiteNeighborSampler, EdgeLoader
16 | from utils.sparse_combine import xe_split3
17 |
18 | from tqdm import tqdm
19 |
20 |
21 | def make_tuple(x: Union[int, Tuple[int, int, int], Tuple[int, int]], repeat: int = 3):
22 | if isinstance(x, int):
23 | if repeat == 2:
24 | return (x, x)
25 | else:
26 | return (x, x, x)
27 | else:
28 | return x
29 |
30 |
31 | def apply_relu_dropout(x: Tensor, dropout_prob: float, training: bool) -> Tensor:
32 | x = F.relu(x)
33 | if dropout_prob > 0.0:
34 | x = F.dropout(x, p=dropout_prob, training=training)
35 | return x
36 |
37 |
38 | class GraphBEANSampled(nn.Module):
39 | def __init__(
40 | self,
41 | in_channels: Union[int, Tuple[int, int, int]],
42 | hidden_channels: Union[int, Tuple[int, int, int]] = 32,
43 | latent_channels: Union[int, Tuple[int, int]] = 64,
44 | edge_pred_latent: int = 64,
45 | n_layers_encoder: int = 4,
46 | n_layers_decoder: int = 4,
47 | n_layers_mlp: int = 4,
48 | dropout_prob: float = 0.0,
49 | ):
50 |
51 | super().__init__()
52 |
53 | self.in_channels = make_tuple(in_channels)
54 | self.hidden_channels = make_tuple(hidden_channels)
55 | self.latent_channels = make_tuple(latent_channels, 2)
56 | self.edge_pred_latent = edge_pred_latent
57 | self.n_layers_encoder = n_layers_encoder
58 | self.n_layers_decoder = n_layers_decoder
59 | self.n_layers_mlp = n_layers_mlp
60 | self.dropout_prob = dropout_prob
61 |
62 | self.create_encoder()
63 | self.create_feature_decoder()
64 | self.create_structure_decoder()
65 |
66 | def create_encoder(self):
67 | self.encoder_convs = nn.ModuleList()
68 | for i in range(self.n_layers_encoder):
69 | if i == 0:
70 | in_channels = self.in_channels
71 | out_channels = self.hidden_channels
72 | elif i == self.n_layers_encoder - 1:
73 | in_channels = self.hidden_channels
74 | out_channels = self.latent_channels
75 | else:
76 | in_channels = self.hidden_channels
77 | out_channels = self.hidden_channels
78 |
79 | if i == self.n_layers_encoder - 1:
80 | self.encoder_convs.append(
81 | BEANConvSample(in_channels, out_channels, node_self_loop=False)
82 | )
83 | else:
84 | self.encoder_convs.append(
85 | BEANConvSample(in_channels, out_channels, node_self_loop=True)
86 | )
87 |
88 | def create_feature_decoder(self):
89 | self.decoder_convs = nn.ModuleList()
90 | for i in range(self.n_layers_decoder):
91 | if i == 0:
92 | in_channels = self.latent_channels
93 | out_channels = self.hidden_channels
94 | elif i == self.n_layers_decoder - 1:
95 | in_channels = self.hidden_channels
96 | out_channels = self.in_channels
97 | else:
98 | in_channels = self.hidden_channels
99 | out_channels = self.hidden_channels
100 |
101 | self.decoder_convs.append(BEANConvSample(in_channels, out_channels))
102 |
103 | def create_structure_decoder(self):
104 | self.u_mlp_layers = nn.ModuleList()
105 | self.v_mlp_layers = nn.ModuleList()
106 |
107 | for i in range(self.n_layers_mlp):
108 | if i == 0:
109 | in_channels = self.latent_channels
110 | else:
111 | in_channels = (self.edge_pred_latent, self.edge_pred_latent)
112 | out_channels = self.edge_pred_latent
113 |
114 | self.u_mlp_layers.append(Linear(in_channels[0], out_channels))
115 |
116 | self.v_mlp_layers.append(Linear(in_channels[1], out_channels))
117 |
118 | def forward(
119 | self,
120 | xu: Tensor,
121 | xv: Tensor,
122 | xe: Tensor,
123 | bean_adjs: List[BEANAdjacency],
124 | e_flags: List[SparseTensor],
125 | edge_pred_samples: SparseTensor,
126 | ) -> Dict[str, Tensor]:
127 |
128 | assert self.n_layers_encoder + self.n_layers_decoder == len(bean_adjs)
129 |
130 | # encoder
131 | for i, conv in enumerate(self.encoder_convs):
132 | badj = bean_adjs[i]
133 | e_flag = e_flags[i]
134 |
135 | # target size
136 | n_ut = badj.adj_v2u.size[0]
137 | n_vt = badj.adj_u2v.size[1]
138 |
139 | # get xut and xvt
140 | xus, xut = xu, xu[:n_ut] # target nodes are always placed first
141 | xvs, xvt = xv, xv[:n_vt]
142 |
143 | # get xe
144 | xe_e, xe_v2u, xe_u2v = xe_split3(xe, e_flag.storage.value())
145 |
146 | # do convolution
147 | xu, xv, xe = conv(
148 | xu=(xus, xut), xv=(xvs, xvt), adj=badj, xe=(xe_e, xe_v2u, xe_u2v)
149 | )
150 |
151 | if i != self.n_layers_encoder - 1:
152 | xu = apply_relu_dropout(xu, self.dropout_prob, self.training)
153 | xv = apply_relu_dropout(xv, self.dropout_prob, self.training)
154 | xe = apply_relu_dropout(xe, self.dropout_prob, self.training)
155 |
156 | # extract latent vars (only target nodes)
157 | last_badj = bean_adjs[-1]
158 | n_u_target = last_badj.adj_v2u.size[0]
159 | n_v_target = last_badj.adj_u2v.size[1]
160 | # get latent vars
161 | zu, zv = xu[:n_u_target], xv[:n_v_target]
162 |
163 | # feature decoder
164 | for i, conv in enumerate(self.decoder_convs):
165 |
166 | badj = bean_adjs[self.n_layers_encoder + i]
167 | e_flag = e_flags[self.n_layers_encoder + i]
168 |
169 | # target size
170 | n_ut = badj.adj_v2u.size[0]
171 | n_vt = badj.adj_u2v.size[1]
172 |
173 | # get xut and xvt
174 | xus, xut = xu, xu[:n_ut] # target nodes are always placed first
175 | xvs, xvt = xv, xv[:n_vt]
176 |
177 | # get xe
178 | if xe is not None:
179 | xe_e, xe_v2u, xe_u2v = xe_split3(xe, e_flag.storage.value())
180 | else:
181 | xe_e, xe_v2u, xe_u2v = None, None, None
182 |
183 | # do convolution
184 | xu, xv, xe = conv(
185 | xu=(xus, xut), xv=(xvs, xvt), adj=badj, xe=(xe_e, xe_v2u, xe_u2v)
186 | )
187 |
188 | if i != self.n_layers_decoder - 1:
189 | xu = apply_relu_dropout(xu, self.dropout_prob, self.training)
190 | xv = apply_relu_dropout(xv, self.dropout_prob, self.training)
191 | xe = apply_relu_dropout(xe, self.dropout_prob, self.training)
192 |
193 | # structure decoder
194 | zu2, zv2 = zu, zv
195 | for i, layer in enumerate(self.u_mlp_layers):
196 | zu2 = layer(zu2)
197 | if i != self.n_layers_mlp - 1:
198 | zu2 = apply_relu_dropout(zu2, self.dropout_prob, self.training)
199 |
200 | for i, layer in enumerate(self.v_mlp_layers):
201 | zv2 = layer(zv2)
202 | if i != self.n_layers_mlp - 1:
203 | zv2 = apply_relu_dropout(zv2, self.dropout_prob, self.training)
204 |
205 | zu2_edge = zu2[edge_pred_samples.storage.row()]
206 | zv2_edge = zv2[edge_pred_samples.storage.col()]
207 |
208 | eprob = torch.sigmoid(torch.sum(zu2_edge * zv2_edge, dim=1))
209 |
210 | # collect results
211 | result = {"xu": xu, "xv": xv, "xe": xe, "zu": zu, "zv": zv, "eprob": eprob}
212 |
213 | return result
214 |
215 | def apply_conv(self, conv, dir_adj, xu_all, xv_all, xe_all, device):
216 | xu = xu_all[dir_adj.u_id].to(device)
217 | xv = xv_all[dir_adj.v_id].to(device)
218 | xe = xe_all[dir_adj.e_id].to(device) if xe_all is not None else None
219 | adj = dir_adj.adj.to(device)
220 |
221 | out = conv((xu, xv), adj, xe)
222 |
223 | return out
224 |
225 | def inference(
226 | self,
227 | xu_all: Tensor,
228 | xv_all: Tensor,
229 | xe_all: Tensor,
230 | adj_all: SparseTensor,
231 | edge_pred_samples: SparseTensor,
232 | batch_sizes: Tuple[int, int, int],
233 | device,
234 | progress_bar: bool = True,
235 | **kwargs,
236 | ) -> Dict[str, Tensor]:
237 |
238 | kwargs["shuffle"] = False
239 | u_loader = BipartiteNeighborSampler(
240 | adj_all,
241 | n_layers=1,
242 | base="u",
243 | batch_size=batch_sizes[0],
244 | n_other_node=1,
245 | num_neighbors_u=-1,
246 | num_neighbors_v=1,
247 | **kwargs,
248 | )
249 | v_loader = BipartiteNeighborSampler(
250 | adj_all,
251 | n_layers=1,
252 | base="v",
253 | batch_size=batch_sizes[1],
254 | n_other_node=1,
255 | num_neighbors_u=1,
256 | num_neighbors_v=-1,
257 | **kwargs,
258 | )
259 | e_loader = EdgeLoader(adj_all, batch_size=batch_sizes[2], **kwargs)
260 |
261 | u_mlp_loader = torch.utils.data.DataLoader(
262 | torch.arange(xu_all.shape[0]), batch_size=batch_sizes[0], **kwargs
263 | )
264 | v_mlp_loader = torch.utils.data.DataLoader(
265 | torch.arange(xv_all.shape[0]), batch_size=batch_sizes[1], **kwargs
266 | )
267 |
268 | epred_loader = torch.utils.data.DataLoader(
269 | torch.arange(edge_pred_samples.nnz()), batch_size=batch_sizes[2], **kwargs
270 | )
271 |
272 | total_iter = (
273 | (len(u_loader) + len(v_loader))
274 | * (self.n_layers_encoder + self.n_layers_decoder)
275 | + len(e_loader) * (self.n_layers_encoder + self.n_layers_decoder - 1)
276 | + (len(u_mlp_loader) + len(v_mlp_loader)) * self.n_layers_mlp
277 | + len(epred_loader)
278 | )
279 | if progress_bar:
280 | pbar = tqdm(total=total_iter, leave=False)
281 | pbar.set_description(f"Evaluation")
282 |
283 | # encoder
284 | for i, conv in enumerate(self.encoder_convs):
285 |
286 | ## next u nodes
287 | xu_list = []
288 | for _, _, adjacency, _ in u_loader:
289 | out = self.apply_conv(
290 | conv.v2u_conv, adjacency.adj_v2u, xu_all, xv_all, xe_all, device
291 | )
292 | if i != self.n_layers_encoder - 1:
293 | out = F.relu(out)
294 | xu_list.append(out.cpu())
295 | if progress_bar:
296 | pbar.update(1)
297 | xu_all_next = torch.cat(xu_list, dim=0)
298 |
299 | ## next v nodes
300 | xv_list = []
301 | for _, _, adjacency, _ in v_loader:
302 | out = self.apply_conv(
303 | conv.u2v_conv, adjacency.adj_u2v, xu_all, xv_all, xe_all, device
304 | )
305 | if i != self.n_layers_encoder - 1:
306 | out = F.relu(out)
307 | xv_list.append(out.cpu())
308 | if progress_bar:
309 | pbar.update(1)
310 | xv_all_next = torch.cat(xv_list, dim=0)
311 |
312 | ## next edge
313 | if i != self.n_layers_encoder - 1:
314 | xe_list = []
315 | for adj_e in e_loader:
316 | out = self.apply_conv(
317 | conv.e_conv, adj_e, xu_all, xv_all, xe_all, device
318 | )
319 | out = F.relu(out)
320 | xe_list.append(out.cpu())
321 | if progress_bar:
322 | pbar.update(1)
323 | xe_all_next = torch.cat(xe_list, dim=0)
324 | else:
325 | xe_all_next = None
326 |
327 | xu_all = xu_all_next
328 | xv_all = xv_all_next
329 | xe_all = xe_all_next
330 |
331 | # get latent vars
332 | zu_all, zv_all = xu_all, xv_all
333 |
334 | # feature decoder
335 | for i, conv in enumerate(self.decoder_convs):
336 |
337 | ## next u nodes
338 | xu_list = []
339 | for _, _, adjacency, _ in u_loader:
340 | out = self.apply_conv(
341 | conv.v2u_conv, adjacency.adj_v2u, xu_all, xv_all, xe_all, device
342 | )
343 | if i != self.n_layers_decoder - 1:
344 | out = F.relu(out)
345 | xu_list.append(out.cpu())
346 | if progress_bar:
347 | pbar.update(1)
348 | xu_all_next = torch.cat(xu_list, dim=0)
349 |
350 | ## next v nodes
351 | xv_list = []
352 | for _, _, adjacency, _ in v_loader:
353 | out = self.apply_conv(
354 | conv.u2v_conv, adjacency.adj_u2v, xu_all, xv_all, xe_all, device
355 | )
356 | if i != self.n_layers_decoder - 1:
357 | out = F.relu(out)
358 | xv_list.append(out.cpu())
359 | if progress_bar:
360 | pbar.update(1)
361 | xv_all_next = torch.cat(xv_list, dim=0)
362 |
363 | ## next edge
364 | xe_list = []
365 | for adj_e in e_loader:
366 | out = self.apply_conv(
367 | conv.e_conv, adj_e, xu_all, xv_all, xe_all, device
368 | )
369 | if i != self.n_layers_decoder - 1:
370 | out = F.relu(out)
371 | xe_list.append(out.cpu())
372 | if progress_bar:
373 | pbar.update(1)
374 | xe_all_next = torch.cat(xe_list, dim=0)
375 |
376 | xu_all = xu_all_next
377 | xv_all = xv_all_next
378 | xe_all = xe_all_next
379 |
380 | # structure decoder
381 | zu2_all, zv2_all = zu_all, zv_all
382 | for i, layer in enumerate(self.u_mlp_layers):
383 | zu2_list = []
384 | for batch in u_mlp_loader:
385 | out = layer(zu2_all[batch].to(device))
386 | if i != self.n_layers_mlp - 1:
387 | out = F.relu(out)
388 | zu2_list.append(out.cpu())
389 | if progress_bar:
390 | pbar.update(1)
391 | zu2_all = torch.cat(zu2_list, dim=0)
392 |
393 | for i, layer in enumerate(self.v_mlp_layers):
394 | zv2_list = []
395 | for batch in v_mlp_loader:
396 | out = layer(zv2_all[batch].to(device))
397 | if i != self.n_layers_mlp - 1:
398 | out = F.relu(out)
399 | zv2_list.append(out.cpu())
400 | if progress_bar:
401 | pbar.update(1)
402 | zv2_all = torch.cat(zv2_list, dim=0)
403 |
404 | eprob_list = []
405 | for batch in epred_loader:
406 | zu2_edge = zu2_all[edge_pred_samples.storage.row()[batch]].to(device)
407 | zv2_edge = zv2_all[edge_pred_samples.storage.col()[batch]].to(device)
408 | out = torch.sigmoid(torch.sum(zu2_edge * zv2_edge, dim=1))
409 | eprob_list.append(out.cpu())
410 | if progress_bar:
411 | pbar.update(1)
412 | eprob_all = torch.cat(eprob_list, dim=0)
413 |
414 | # collect results
415 | result = {
416 | "xu": xu_all,
417 | "xv": xv_all,
418 | "xe": xe_all,
419 | "zu": zu_all,
420 | "zv": zv_all,
421 | "eprob": eprob_all,
422 | }
423 |
424 | if progress_bar:
425 | pbar.close()
426 |
427 | return result
428 |
--------------------------------------------------------------------------------
/models/sampler.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import torch
5 | from torch import Tensor
6 |
7 | import math
8 |
9 | from torch_sparse import SparseTensor
10 | from torch_sparse.storage import SparseStorage
11 |
12 | from typing import List, NamedTuple, Optional, Tuple, Union
13 |
14 | from utils.sparse_combine import spadd
15 | from utils.sprand import sprand
16 |
17 |
18 | class EdgePredictionSampler:
19 | def __init__(
20 | self,
21 | adj: SparseTensor,
22 | n_random: Optional[int] = None,
23 | mult: Optional[float] = 2.0,
24 | ):
25 | self.adj = adj
26 |
27 | if n_random is None:
28 | n_pos = adj.nnz()
29 | n_random = mult * n_pos
30 |
31 | self.adj = adj
32 | self.n_random = n_random
33 |
34 | def sample(self):
35 | rnd_samples = sprand(self.adj.sparse_sizes(), self.n_random)
36 | rnd_samples.fill_value_(-1)
37 | rnd_samples = rnd_samples.to(self.adj.device())
38 |
39 | pos_samples = self.adj.fill_value(2)
40 |
41 | samples = spadd(rnd_samples, pos_samples)
42 | samples.set_value_(
43 | torch.minimum(
44 | samples.storage.value(), torch.ones_like(samples.storage.value())
45 | ),
46 | layout="coo",
47 | )
48 |
49 | return samples
50 |
51 |
52 | ### REGION Neighbor Sampling
53 | def sample_v_given_u(
54 | adj: SparseTensor,
55 | u_indices: Tensor,
56 | prev_v: Tensor,
57 | num_neighbors: int,
58 | replace=False,
59 | ) -> Tuple[SparseTensor, Tensor]:
60 |
61 | # to homogenous adjacency
62 | nu, nv = adj.sparse_sizes()
63 | adj_h = SparseTensor(
64 | row=adj.storage.row(),
65 | col=adj.storage.col() + nu,
66 | value=adj.storage.value(),
67 | sparse_sizes=(nu + nv, nu + nv),
68 | )
69 |
70 | res_adj_h, res_id = adj_h.sample_adj(
71 | torch.cat([u_indices, prev_v + nu]),
72 | num_neighbors=num_neighbors,
73 | replace=replace,
74 | )
75 |
76 | ni = len(u_indices)
77 | v_indices = res_id[ni:] - nu
78 | res_adj = res_adj_h[:ni, ni:]
79 |
80 | return res_adj, v_indices
81 |
82 |
83 | def sample_u_given_v(
84 | adj: SparseTensor,
85 | v_indices: Tensor,
86 | prev_u: Tensor,
87 | num_neighbors: int,
88 | replace=False,
89 | ) -> Tuple[SparseTensor, Tensor]:
90 |
91 | # to homogenous adjacency
92 | res_adj_t, u_indices = sample_v_given_u(
93 | adj.t(), v_indices, prev_u, num_neighbors=num_neighbors, replace=replace
94 | )
95 |
96 | return res_adj_t.t(), u_indices
97 |
98 |
99 | class DirectedAdj(NamedTuple):
100 | adj: SparseTensor
101 | u_id: Tensor
102 | v_id: Tensor
103 | e_id: Optional[Tensor]
104 | size: Tuple[int, int]
105 | flow: str
106 |
107 | def to(self, *args, **kwargs):
108 | adj = self.adj.to(*args, **kwargs)
109 | u_id = self.u_id.to(*args, **kwargs)
110 | v_id = self.v_id.to(*args, **kwargs)
111 | e_id = self.e_id.to(*args, **kwargs) if self.e_id is not None else None
112 | return DirectedAdj(adj, u_id, v_id, e_id, self.size, self.flow)
113 |
114 |
115 | class BEANAdjacency(NamedTuple):
116 | adj_v2u: DirectedAdj
117 | adj_u2v: DirectedAdj
118 | adj_e: Optional[DirectedAdj]
119 |
120 | def to(self, *args, **kwargs):
121 | adj_v2u = self.adj_v2u.to(*args, **kwargs)
122 | adj_u2v = self.adj_u2v.to(*args, **kwargs)
123 | adj_e = None
124 | if self.adj_e is not None:
125 | adj_e = self.adj_e.to(*args, **kwargs)
126 | return BEANAdjacency(adj_v2u, adj_u2v, adj_e)
127 |
128 |
129 | class BipartiteNeighborSampler(torch.utils.data.DataLoader):
130 | def __init__(
131 | self,
132 | adj: SparseTensor,
133 | n_layers: int,
134 | num_neighbors_u: Union[int, List[int]],
135 | num_neighbors_v: Union[int, List[int]],
136 | base: str = "u",
137 | n_other_node: int = -1,
138 | **kwargs
139 | ):
140 |
141 | adj = adj.to("cpu")
142 |
143 | if "collate_fn" in kwargs:
144 | del kwargs["collate_fn"]
145 |
146 | self.adj = adj
147 | self.n_layers = n_layers
148 | self.base = base
149 | self.n_other_node = n_other_node
150 |
151 | if isinstance(num_neighbors_u, int):
152 | num_neighbors_u = [num_neighbors_u for _ in range(n_layers)]
153 | if isinstance(num_neighbors_v, int):
154 | num_neighbors_v = [num_neighbors_v for _ in range(n_layers)]
155 | self.num_neighbors_u = num_neighbors_u
156 | self.num_neighbors_v = num_neighbors_v
157 |
158 | if base == "u": # start from u
159 | item_idx = torch.arange(adj.sparse_size(0))
160 | elif base == "v": # start from v instead
161 | item_idx = torch.arange(adj.sparse_size(1))
162 | elif base == "e": # start from e instead
163 | item_idx = torch.arange(adj.nnz())
164 | else: # start from u default
165 | item_idx = torch.arange(adj.sparse_size(0))
166 |
167 | value = torch.arange(adj.nnz())
168 | adj = adj.set_value(value, layout="coo")
169 | self.__val__ = adj.storage.value()
170 |
171 | # transpose of adjacency
172 | self.adj = adj
173 | self.adj_t = adj.t()
174 |
175 | # homogenous graph adjacency matrix
176 | self.nu, self.nv = self.adj.sparse_sizes()
177 | self.adj_homogen = SparseTensor(
178 | row=self.adj.storage.row(),
179 | col=self.adj.storage.col() + self.nu,
180 | value=self.adj.storage.value(),
181 | sparse_sizes=(self.nu + self.nv, self.nu + self.nv),
182 | )
183 | self.adj_t_homogen = SparseTensor(
184 | row=self.adj_t.storage.row(),
185 | col=self.adj_t.storage.col() + self.nv,
186 | value=self.adj_t.storage.value(),
187 | sparse_sizes=(self.nu + self.nv, self.nu + self.nv),
188 | )
189 |
190 | super(BipartiteNeighborSampler, self).__init__(
191 | item_idx.view(-1).tolist(), collate_fn=self.sample, **kwargs
192 | )
193 |
194 | def sample_v_given_u(
195 | self, u_indices: Tensor, prev_v: Tensor, num_neighbors: int
196 | ) -> Tuple[SparseTensor, Tensor]:
197 |
198 | res_adj_h, res_id = self.adj_homogen.sample_adj(
199 | torch.cat([u_indices, prev_v + self.nu]),
200 | num_neighbors=num_neighbors,
201 | replace=False,
202 | )
203 |
204 | ni = len(u_indices)
205 | v_indices = res_id[ni:] - self.nu
206 | res_adj = res_adj_h[:ni, ni:]
207 |
208 | return res_adj, v_indices
209 |
210 | def sample_u_given_v(
211 | self, v_indices: Tensor, prev_u: Tensor, num_neighbors: int
212 | ) -> Tuple[SparseTensor, Tensor]:
213 |
214 | # start = time.time()
215 | res_adj_h, res_id = self.adj_t_homogen.sample_adj(
216 | torch.cat([v_indices, prev_u + self.nv]),
217 | num_neighbors=num_neighbors,
218 | replace=False,
219 | )
220 | # print(f"adjoint sampling : {time.time() - start} s")
221 |
222 | ni = len(v_indices)
223 | u_indices = res_id[ni:] - self.nv
224 | res_adj = res_adj_h[:ni, ni:]
225 |
226 | return res_adj.t(), u_indices
227 |
228 | def adjacency_from_samples(
229 | self, adj: SparseTensor, u_id: Tensor, v_id: Tensor, flow: str
230 | ) -> DirectedAdj:
231 |
232 | e_id = adj.storage.value()
233 | size = adj.sparse_sizes()
234 | if self.__val__ is not None:
235 | adj.set_value_(self.__val__[e_id], layout="coo")
236 |
237 | return DirectedAdj(adj, u_id, v_id, e_id, size, flow)
238 |
239 | def combine_adjacency(
240 | self, v2u_adj: SparseTensor, u2v_adj: SparseTensor, e_adj: SparseTensor
241 | ) -> SparseTensor:
242 |
243 | # start = time.time()
244 | nu = u2v_adj.sparse_size(0)
245 | nv = v2u_adj.sparse_size(1)
246 |
247 | row = torch.cat(
248 | [e_adj.storage.row(), v2u_adj.storage.row(), u2v_adj.storage.row()], dim=-1
249 | )
250 | col = torch.cat(
251 | [e_adj.storage.col(), v2u_adj.storage.col(), u2v_adj.storage.col()], dim=-1
252 | )
253 | value = torch.cat(
254 | [e_adj.storage.value(), v2u_adj.storage.value(), u2v_adj.storage.value()],
255 | dim=0,
256 | )
257 | fl = torch.cat(
258 | [
259 | torch.ones(e_adj.nnz()),
260 | 2 * torch.ones(v2u_adj.nnz()),
261 | 4 * torch.ones(u2v_adj.nnz()),
262 | ]
263 | )
264 |
265 | storage = SparseStorage(
266 | row=row, col=col, value=value, sparse_sizes=(nu, nv), is_sorted=False
267 | )
268 | storage = storage.coalesce(reduce="mean")
269 |
270 | fl_storage = SparseStorage(
271 | row=row, col=col, value=fl, sparse_sizes=(nu, nv), is_sorted=False
272 | )
273 | fl_storage = fl_storage.coalesce(reduce="sum")
274 |
275 | res = SparseTensor.from_storage(storage)
276 | flag = SparseTensor.from_storage(fl_storage)
277 |
278 | # print(f"combine adj : {time.time() - start} s")
279 |
280 | return res, flag
281 |
282 | def sample(self, batch):
283 |
284 | # start = time.time()
285 |
286 | if not isinstance(batch, Tensor):
287 | batch = torch.tensor(batch)
288 |
289 | batch_size: int = len(batch)
290 |
291 | # calculate batch_size for another node
292 | if self.n_other_node == -1 and self.base in ["u", "v"]:
293 | # do proportional
294 | nu, nv = self.adj.sparse_sizes()
295 | if self.base == "u":
296 | self.n_other_node = int(math.ceil((nv / nu) * batch_size))
297 | elif self.base == "v":
298 | self.n_other_node = int(math.ceil((nu / nv) * batch_size))
299 |
300 | ## get the other indices
301 | empty_list = torch.tensor([], dtype=torch.long)
302 | if self.base == "u":
303 | # get the base node for v
304 | u_indices = batch
305 | res_adj, res_id = self.sample_v_given_u(
306 | u_indices, empty_list, num_neighbors=self.num_neighbors_u[0]
307 | )
308 | rand_id = torch.randperm(len(res_id))[: self.n_other_node]
309 | v_indices = res_id[rand_id]
310 | e_adj = res_adj[:, rand_id]
311 | elif self.base == "v":
312 | # get the base node for u
313 | v_indices = batch
314 | res_adj, res_id = self.sample_u_given_v(
315 | v_indices, empty_list, num_neighbors=self.num_neighbors_v[0]
316 | )
317 | rand_id = torch.randperm(len(res_id))[: self.n_other_node]
318 | u_indices = res_id[rand_id]
319 | e_adj = res_adj[rand_id, :]
320 | elif self.base == "e":
321 | # get the base node for u and v
322 | row = self.adj.storage.row()[batch]
323 | col = self.adj.storage.col()[batch]
324 | unique_row, invidx_row = torch.unique(row, return_inverse=True)
325 | unique_col, invidx_col = torch.unique(col, return_inverse=True)
326 |
327 | reindex_row_id = torch.arange(len(unique_row))
328 | reindex_col_id = torch.arange(len(unique_col))
329 | reindex_row = reindex_row_id[invidx_row]
330 | reindex_col = reindex_col_id[invidx_col]
331 |
332 | e_adj = SparseTensor(row=reindex_row, col=reindex_col, value=batch)
333 | e_indices = batch
334 | u_indices = unique_row
335 | v_indices = unique_col
336 |
337 | # init results
338 | adjacencies = []
339 | e_flags = []
340 |
341 | ## for subsequent layers
342 | for i in range(self.n_layers):
343 |
344 | # v -> u
345 | u_adj, next_v_indices = self.sample_v_given_u(
346 | u_indices, prev_v=v_indices, num_neighbors=self.num_neighbors_u[i]
347 | )
348 | dir_adj_v2u = self.adjacency_from_samples(
349 | u_adj, u_indices, next_v_indices, "v->u"
350 | )
351 |
352 | # u -> v
353 | v_adj, next_u_indices = self.sample_u_given_v(
354 | v_indices, prev_u=u_indices, num_neighbors=self.num_neighbors_v[i]
355 | )
356 | dir_adj_u2v = self.adjacency_from_samples(
357 | v_adj, next_u_indices, v_indices, "u->v"
358 | )
359 |
360 | # u -> e <- v
361 | dir_adj_e = self.adjacency_from_samples(
362 | e_adj, u_indices, v_indices, "u->e<-v"
363 | )
364 |
365 | # add them to the list
366 | adjacencies.append(BEANAdjacency(dir_adj_v2u, dir_adj_u2v, dir_adj_e))
367 |
368 | # for next iter
369 | e_adj, e_flag = self.combine_adjacency(
370 | v2u_adj=u_adj, u2v_adj=v_adj, e_adj=e_adj
371 | )
372 | u_indices = next_u_indices
373 | v_indices = next_v_indices
374 | e_flags.append(e_flag)
375 |
376 | # flip the order
377 | adjacencies = adjacencies[0] if len(adjacencies) == 1 else adjacencies[::-1]
378 | e_flags = e_flags[0] if len(e_flags) == 1 else e_flags[::-1]
379 |
380 | # get e_indices
381 | e_indices = e_adj.storage.value()
382 |
383 | # print(f"sampling : {time.time() - start} s")
384 |
385 | return batch_size, (u_indices, v_indices, e_indices), adjacencies, e_flags
386 |
387 |
388 | class EdgeLoader(torch.utils.data.DataLoader):
389 | def __init__(self, adj: SparseTensor, **kwargs):
390 |
391 | edge_idx = torch.arange(adj.nnz())
392 | self.adj = adj
393 |
394 | super().__init__(edge_idx.view(-1).tolist(), collate_fn=self.sample, **kwargs)
395 |
396 | def sample(self, batch):
397 |
398 | if not isinstance(batch, Tensor):
399 | batch = torch.tensor(batch)
400 |
401 | row = self.adj.storage.row()[batch]
402 | col = self.adj.storage.col()[batch]
403 | if self.adj.storage.has_value():
404 | val = self.adj.storage.col()[batch]
405 | else:
406 | val = batch
407 |
408 | # get unique row, col & idx
409 | unique_row, invidx_row = torch.unique(row, return_inverse=True)
410 | unique_col, invidx_col = torch.unique(col, return_inverse=True)
411 |
412 | reindex_row_id = torch.arange(len(unique_row))
413 | reindex_col_id = torch.arange(len(unique_col))
414 |
415 | reindex_row = reindex_row_id[invidx_row]
416 | reindex_col = reindex_col_id[invidx_col]
417 |
418 | adj = SparseTensor(row=reindex_row, col=reindex_col, value=val)
419 | e_id = batch
420 | u_id = unique_row
421 | v_id = unique_col
422 |
423 | adj_e = DirectedAdj(adj, u_id, v_id, e_id, adj.sparse_sizes(), "u->e<-v")
424 |
425 | return adj_e
426 |
--------------------------------------------------------------------------------
/models/score.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | from typing import Dict
5 | import torch
6 | from torch import Tensor
7 |
8 | from torch_scatter import scatter
9 | from torch_sparse import SparseTensor
10 |
11 | from sklearn.metrics import (
12 | accuracy_score,
13 | f1_score,
14 | precision_score,
15 | recall_score,
16 | roc_curve,
17 | precision_recall_curve,
18 | auc,
19 | )
20 | import pandas as pd
21 |
22 |
23 | def compute_anomaly_score(
24 | xu: Tensor,
25 | xv: Tensor,
26 | xe: Tensor,
27 | adj: SparseTensor,
28 | edge_pred_samples: SparseTensor,
29 | out: Dict[str, Tensor],
30 | xe_loss_weight: float = 1.0,
31 | structure_loss_weight: float = 1.0,
32 | ) -> Dict[str, Tensor]:
33 |
34 | # node error, use RMSE instead of MSE
35 | xu_error = torch.sqrt(torch.mean((xu - out["xu"]) ** 2, dim=1))
36 | xv_error = torch.sqrt(torch.mean((xv - out["xv"]) ** 2, dim=1))
37 |
38 | # edge error, use RMSE instead of MSE
39 | xe_error = torch.sqrt(torch.mean((xe - out["xe"]) ** 2, dim=1))
40 |
41 | # edge prediction cross entropy
42 | edge_ce = -torch.log(out["eprob"][edge_pred_samples.storage.value() > 0] + 1e-12)
43 |
44 | # edge score
45 | e_score = xe_loss_weight * xe_error + structure_loss_weight * edge_ce
46 |
47 | # edge score
48 | u_score_edge_max = xu_error + scatter(
49 | e_score, adj.storage.row(), dim=0, reduce="max"
50 | )
51 | v_score_edge_max = xv_error + scatter(
52 | e_score, adj.storage.col(), dim=0, reduce="max"
53 | )
54 | u_score_edge_mean = xu_error + scatter(
55 | e_score, adj.storage.row(), dim=0, reduce="mean"
56 | )
57 | v_score_edge_mean = xv_error + scatter(
58 | e_score, adj.storage.col(), dim=0, reduce="mean"
59 | )
60 | u_score_edge_sum = xu_error + scatter(
61 | e_score, adj.storage.row(), dim=0, reduce="sum"
62 | )
63 | v_score_edge_sum = xv_error + scatter(
64 | e_score, adj.storage.col(), dim=0, reduce="sum"
65 | )
66 |
67 | anomaly_score = {
68 | "xu_error": xu_error,
69 | "xv_error": xv_error,
70 | "xe_error": xe_error,
71 | "edge_ce": edge_ce,
72 | "e_score": e_score,
73 | "u_score_edge_max": u_score_edge_max,
74 | "u_score_edge_mean": u_score_edge_mean,
75 | "u_score_edge_sum": u_score_edge_sum,
76 | "v_score_edge_max": v_score_edge_max,
77 | "v_score_edge_mean": v_score_edge_mean,
78 | "v_score_edge_sum": v_score_edge_sum,
79 | }
80 |
81 | return anomaly_score
82 |
83 |
84 | def edge_prediction_metric(
85 | edge_pred_samples: SparseTensor, edge_prob: Tensor
86 | ) -> Dict[str, float]:
87 |
88 | edge_pred = (edge_prob >= 0.5).int().cpu().numpy()
89 | edge_gt = (edge_pred_samples.storage.value() > 0).int().cpu().numpy()
90 |
91 | acc = accuracy_score(edge_gt, edge_pred)
92 | prec = precision_score(edge_gt, edge_pred)
93 | rec = recall_score(edge_gt, edge_pred)
94 | f1 = f1_score(edge_gt, edge_pred)
95 |
96 | result = {"acc": acc, "prec": prec, "rec": rec, "f1": f1}
97 | return result
98 |
99 |
100 | def compute_evaluation_metrics(
101 | anomaly_score: Dict[str, Tensor], yu: Tensor, yv: Tensor, ye: Tensor, agg="max"
102 | ):
103 |
104 | # node u
105 | u_roc_curve = roc_curve(
106 | yu.cpu().numpy(), anomaly_score[f"u_score_edge_{agg}"].cpu().numpy()
107 | )
108 | u_pr_curve = precision_recall_curve(
109 | yu.cpu().numpy(), anomaly_score[f"u_score_edge_{agg}"].cpu().numpy()
110 | )
111 | u_roc_auc = auc(u_roc_curve[0], u_roc_curve[1])
112 | u_pr_auc = auc(u_pr_curve[1], u_pr_curve[0])
113 |
114 | # node v
115 | v_roc_curve = roc_curve(
116 | yv.cpu().numpy(), anomaly_score[f"v_score_edge_{agg}"].cpu().numpy()
117 | )
118 | v_pr_curve = precision_recall_curve(
119 | yv.cpu().numpy(), anomaly_score[f"v_score_edge_{agg}"].cpu().numpy()
120 | )
121 | v_roc_auc = auc(v_roc_curve[0], v_roc_curve[1])
122 | v_pr_auc = auc(v_pr_curve[1], v_pr_curve[0])
123 |
124 | # nedge
125 | e_roc_curve = roc_curve(ye.cpu().numpy(), anomaly_score["xe_error"].cpu().numpy())
126 | e_pr_curve = precision_recall_curve(
127 | ye.cpu().numpy(), anomaly_score["xe_error"].cpu().numpy()
128 | )
129 | e_roc_auc = auc(e_roc_curve[0], e_roc_curve[1])
130 | e_pr_auc = auc(e_pr_curve[1], e_pr_curve[0])
131 |
132 | metrics = {
133 | "u_roc_curve": u_roc_curve,
134 | "u_pr_curve": u_pr_curve,
135 | "u_roc_auc": u_roc_auc,
136 | "u_pr_auc": u_pr_auc,
137 | "v_roc_curve": v_roc_curve,
138 | "v_pr_curve": v_pr_curve,
139 | "v_roc_auc": v_roc_auc,
140 | "v_pr_auc": v_pr_auc,
141 | "e_roc_curve": e_roc_curve,
142 | "e_pr_curve": e_pr_curve,
143 | "e_roc_auc": e_roc_auc,
144 | "e_pr_auc": e_pr_auc,
145 | }
146 |
147 | return metrics
148 |
149 |
150 | def attach_anomaly_score(
151 | anomaly_score: Dict[str, Tensor],
152 | dfu_id: pd.DataFrame,
153 | dfv_id: pd.DataFrame,
154 | dfe_id: pd.DataFrame,
155 | ):
156 |
157 | dfu_id = dfu_id.assign(
158 | xu_error=anomaly_score["xu_error"].cpu().numpy(),
159 | u_score_edge_max=anomaly_score["u_score_edge_max"].cpu().numpy(),
160 | u_score_edge_mean=anomaly_score["u_score_edge_mean"].cpu().numpy(),
161 | )
162 |
163 | dfv_id = dfv_id.assign(
164 | xv_error=anomaly_score["xv_error"].cpu().numpy(),
165 | v_score_edge_max=anomaly_score["v_score_edge_max"].cpu().numpy(),
166 | v_score_edge_mean=anomaly_score["v_score_edge_mean"].cpu().numpy(),
167 | )
168 |
169 | dfe_id = dfe_id.assign(
170 | xe_error=anomaly_score["xe_error"].cpu().numpy(),
171 | edge_ce=anomaly_score["edge_ce"].cpu().numpy(),
172 | e_score=anomaly_score["e_score"].cpu().numpy(),
173 | )
174 |
175 | return dfu_id, dfv_id, dfe_id
176 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy==1.21.6
2 | pandas==1.1.5
3 | pygod==0.3.0
4 | scikit_learn==1.0.2
5 | scipy==1.7.2
6 | sentence_transformers==2.2.2
7 | torch==1.9.0
8 | torch_geometric==2.0.3
9 | torch_scatter==2.0.8
10 | torch_sparse==0.6.12
11 | tqdm==4.62.3
12 | tensorboard==2.9.1
--------------------------------------------------------------------------------
/results/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore
--------------------------------------------------------------------------------
/storage/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore
--------------------------------------------------------------------------------
/train_full_experiment.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import sys
5 |
6 | from data_finefoods import load_graph
7 | from models.score import compute_evaluation_metrics
8 |
9 | import time
10 | from tqdm import tqdm
11 | import argparse
12 | import os
13 |
14 | from torch.utils.tensorboard import SummaryWriter
15 | import datetime
16 |
17 | import torch
18 |
19 | from models.data import BipartiteData
20 | from models.net import GraphBEAN
21 | from models.sampler import EdgePredictionSampler
22 | from models.loss import reconstruction_loss
23 | from models.score import compute_anomaly_score, edge_prediction_metric
24 |
25 | from utils.seed import seed_all
26 |
27 | # %% args
28 |
29 | parser = argparse.ArgumentParser(description="GraphBEAN")
30 | parser.add_argument("--name", type=str, default="wikipedia_anomaly", help="name")
31 | parser.add_argument(
32 | "--key", type=str, default="graph_anomaly_list", help="key to the data"
33 | )
34 | parser.add_argument("--id", type=int, default=0, help="id to the data")
35 | parser.add_argument("--n-epoch", type=int, default=200, help="number of epoch")
36 | parser.add_argument(
37 | "--scheduler-milestones",
38 | nargs="+",
39 | type=int,
40 | default=[],
41 | help="scheduler milestone",
42 | )
43 | parser.add_argument("--lr", type=float, default=1e-2, help="learning rate")
44 | parser.add_argument(
45 | "--score-agg", type=str, default="max", help="aggregation for node anomaly score"
46 | )
47 | parser.add_argument("--eta", type=float, default=0.2, help="structure loss weight")
48 |
49 | args1 = vars(parser.parse_args())
50 |
51 | args2 = {
52 | "hidden_channels": 32,
53 | "latent_channels_u": 32,
54 | "latent_channels_v": 32,
55 | "edge_pred_latent": 32,
56 | "n_layers_encoder": 2,
57 | "n_layers_decoder": 2,
58 | "n_layers_mlp": 2,
59 | "dropout_prob": 0.0,
60 | "gamma": 0.2,
61 | "xe_loss_weight": 1.0,
62 | "structure_loss_weight": args1["eta"],
63 | "structure_loss_weight_anomaly_score": args1["eta"],
64 | "iter_check": 10,
65 | "seed": 0,
66 | "neg_sampler_mult": 5,
67 | "k_check": 15,
68 | "tensorboard": False,
69 | "progress_bar": True,
70 | }
71 |
72 | args = {**args1, **args2}
73 |
74 | seed_all(args["seed"])
75 |
76 | result_dir = "results/"
77 |
78 |
79 | # %% data
80 | data = load_graph(args["name"], args["key"], args["id"])
81 |
82 | u_ch = data.xu.shape[1]
83 | v_ch = data.xv.shape[1]
84 | e_ch = data.xe.shape[1]
85 |
86 | print(
87 | f"Data dimension: U node = {data.xu.shape}; V node = {data.xv.shape}; E edge = {data.xe.shape}; \n"
88 | )
89 |
90 | # %% model
91 |
92 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
93 | model = GraphBEAN(
94 | in_channels=(u_ch, v_ch, e_ch),
95 | hidden_channels=args["hidden_channels"],
96 | latent_channels=(args["latent_channels_u"], args["latent_channels_v"]),
97 | edge_pred_latent=args["edge_pred_latent"],
98 | n_layers_encoder=args["n_layers_encoder"],
99 | n_layers_decoder=args["n_layers_decoder"],
100 | n_layers_mlp=args["n_layers_mlp"],
101 | dropout_prob=args["dropout_prob"],
102 | )
103 |
104 | model = model.to(device)
105 | optimizer = torch.optim.Adam(model.parameters(), lr=args["lr"])
106 | scheduler = torch.optim.lr_scheduler.MultiStepLR(
107 | optimizer, milestones=args["scheduler_milestones"], gamma=args["gamma"]
108 | )
109 |
110 | xu, xv = data.xu.to(device), data.xv.to(device)
111 | xe, adj = data.xe.to(device), data.adj.to(device)
112 | yu, yv, ye = data.yu.to(device), data.yv.to(device), data.ye.to(device)
113 |
114 |
115 | # sampler
116 | sampler = EdgePredictionSampler(adj, mult=args["neg_sampler_mult"])
117 |
118 | print(args)
119 | print()
120 |
121 | # %% train
122 | def train(epoch):
123 |
124 | model.train()
125 |
126 | edge_pred_samples = sampler.sample()
127 |
128 | optimizer.zero_grad()
129 | out = model(xu, xv, xe, adj, edge_pred_samples)
130 |
131 | loss, loss_component = reconstruction_loss(
132 | xu,
133 | xv,
134 | xe,
135 | adj,
136 | edge_pred_samples,
137 | out,
138 | xe_loss_weight=args["xe_loss_weight"],
139 | structure_loss_weight=args["structure_loss_weight"],
140 | )
141 |
142 | loss.backward()
143 | optimizer.step()
144 | scheduler.step()
145 |
146 | epred_metric = edge_prediction_metric(edge_pred_samples, out["eprob"])
147 |
148 | return loss, loss_component, epred_metric
149 |
150 |
151 | # %% evaluate and store
152 | def eval(epoch):
153 |
154 | # model.eval()
155 |
156 | start = time.time()
157 |
158 | # negative sampling
159 | edge_pred_samples = sampler.sample()
160 |
161 | with torch.no_grad():
162 |
163 | out = model(xu, xv, xe, adj, edge_pred_samples)
164 |
165 | loss, loss_component = reconstruction_loss(
166 | xu,
167 | xv,
168 | xe,
169 | adj,
170 | edge_pred_samples,
171 | out,
172 | xe_loss_weight=args["xe_loss_weight"],
173 | structure_loss_weight=args["structure_loss_weight"],
174 | )
175 |
176 | epred_metric = edge_prediction_metric(edge_pred_samples, out["eprob"])
177 |
178 | anomaly_score = compute_anomaly_score(
179 | xu,
180 | xv,
181 | xe,
182 | adj,
183 | edge_pred_samples,
184 | out,
185 | xe_loss_weight=args["xe_loss_weight"],
186 | structure_loss_weight=args["structure_loss_weight_anomaly_score"],
187 | )
188 |
189 | eval_metrics = compute_evaluation_metrics(
190 | anomaly_score, yu, yv, ye, agg=args["score_agg"]
191 | )
192 |
193 | elapsed = time.time() - start
194 |
195 | print(
196 | f"Eval, loss: {loss:.4f}, "
197 | + f"u auc-roc: {eval_metrics['u_roc_auc']:.4f}, v auc-roc: {eval_metrics['v_roc_auc']:.4f}, e auc-roc: {eval_metrics['e_roc_auc']:.4f}, "
198 | + f"u auc-pr {eval_metrics['u_pr_auc']:.4f}, v auc-pr {eval_metrics['v_pr_auc']:.4f}, e auc-pr {eval_metrics['e_pr_auc']:.4f} "
199 | + f"> {elapsed:.2f}s"
200 | )
201 |
202 | if args["tensorboard"]:
203 | tb.add_scalar("loss", loss, epoch)
204 | tb.add_scalar("u_roc_auc", eval_metrics["u_roc_auc"], epoch)
205 | tb.add_scalar("u_pr_auc", eval_metrics["u_pr_auc"], epoch)
206 | tb.add_scalar("v_roc_auc", eval_metrics["v_roc_auc"], epoch)
207 | tb.add_scalar("v_pr_auc", eval_metrics["v_pr_auc"], epoch)
208 | tb.add_scalar("e_roc_auc", eval_metrics["e_roc_auc"], epoch)
209 | tb.add_scalar("e_pr_auc", eval_metrics["e_pr_auc"], epoch)
210 |
211 | model_stored = {
212 | "args": args,
213 | "loss": loss,
214 | "loss_component": loss_component,
215 | "epred_metric": epred_metric,
216 | "eval_metrics": eval_metrics,
217 | "loss_hist": loss_hist,
218 | "loss_component_hist": loss_component_hist,
219 | "epred_metric_hist": epred_metric_hist,
220 | "state_dict": model.state_dict(),
221 | "optimizer_state_dict": optimizer.state_dict(),
222 | }
223 | output_stored = {"args": args, "out": out, "anomaly_score": anomaly_score}
224 |
225 | print("Saving current results...")
226 | torch.save(
227 | model_stored,
228 | os.path.join(
229 | result_dir,
230 | f"graphbean-{args['name']}-{args['id']}-eta-{args['eta']}-structure-model.th",
231 | ),
232 | )
233 | torch.save(
234 | output_stored,
235 | os.path.join(
236 | result_dir,
237 | f"graphbean-{args['name']}-{args['id']}-eta-{args['eta']}-structure-output.th",
238 | ),
239 | )
240 |
241 | return loss, loss_component, epred_metric
242 |
243 |
244 | # %% run training
245 | loss_hist = []
246 | loss_component_hist = []
247 | epred_metric_hist = []
248 |
249 | # tensor board
250 | if args["tensorboard"]:
251 | log_dir = (
252 | "/logs/tensorboard/"
253 | + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
254 | + "-"
255 | + args["name"]
256 | )
257 | tb = SummaryWriter(log_dir=log_dir, comment=args["name"])
258 | check_counter = 0
259 |
260 | eval(0)
261 |
262 | for epoch in range(args["n_epoch"]):
263 |
264 | start = time.time()
265 | loss, loss_component, epred_metric = train(epoch)
266 | elapsed = time.time() - start
267 |
268 | loss_hist.append(loss)
269 | loss_component_hist.append(loss_component)
270 | epred_metric_hist.append(epred_metric)
271 |
272 | print(
273 | f"#{epoch:3d}, "
274 | + f"Loss: {loss:.4f} => xu: {loss_component['xu']:.4f}, xv: {loss_component['xv']:.4f}, "
275 | + f"xe: {loss_component['xe']:.4f}, "
276 | + f"e: {loss_component['e']:.4f} -> "
277 | + f"[acc: {epred_metric['acc']:.3f}, f1: {epred_metric['f1']:.3f} -> "
278 | + f"prec: {epred_metric['prec']:.3f}, rec: {epred_metric['rec']:.3f}] "
279 | + f"> {elapsed:.2f}s"
280 | )
281 |
282 | if epoch % args["iter_check"] == 0: # and epoch != 0:
283 | # tb eval
284 | eval(epoch)
285 |
286 |
287 | # %% after training
288 | res = eval(args["n_epoch"])
289 | ev_loss, ev_loss_component, ev_epred_metric = res
290 |
291 | if args["tensorboard"]:
292 | tb.add_hparams(
293 | args,
294 | {
295 | "loss": ev_loss,
296 | "xu": ev_loss_component["xu"],
297 | "xv": ev_loss_component["xv"],
298 | "xe": ev_loss_component["xe"],
299 | "e": ev_loss_component["e"],
300 | "acc": ev_epred_metric["acc"],
301 | "f1": ev_epred_metric["f1"],
302 | "prec": ev_epred_metric["prec"],
303 | "rec": ev_epred_metric["rec"],
304 | },
305 | )
306 |
307 | print()
308 | print(args)
309 |
--------------------------------------------------------------------------------
/train_sample_experiment.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import sys
5 |
6 | from data_finefoods import load_graph
7 | from models.score import compute_evaluation_metrics
8 |
9 | import time
10 | from tqdm import tqdm
11 | import argparse
12 | import os
13 |
14 | from torch.utils.tensorboard import SummaryWriter
15 | import datetime
16 |
17 | import torch
18 |
19 | from models.data import BipartiteData
20 | from models.net_sample import GraphBEANSampled
21 | from models.sampler import BipartiteNeighborSampler
22 | from models.sampler import EdgePredictionSampler
23 | from models.loss import reconstruction_loss
24 | from models.score import compute_anomaly_score, edge_prediction_metric
25 |
26 | from utils.sum_dict import dict_addto, dict_div
27 | from utils.seed import seed_all
28 |
29 | # %% args
30 |
31 | parser = argparse.ArgumentParser(description="GraphBEAN")
32 | parser.add_argument("--name", type=str, default="finefoods_anomaly", help="name")
33 | parser.add_argument(
34 | "--key", type=str, default="graph_anomaly_list", help="key to the data"
35 | )
36 | parser.add_argument("--id", type=int, default=0, help="id to the data")
37 | parser.add_argument("--batch-size", type=int, default=2048, help="batch size")
38 | parser.add_argument(
39 | "--num-neighbors-u",
40 | type=int,
41 | default=10,
42 | help="number of neighbors for node u in sampling",
43 | )
44 | parser.add_argument(
45 | "--num-neighbors-v",
46 | type=int,
47 | default=10,
48 | help="number of neighbors for node v in sampling",
49 | )
50 | parser.add_argument("--n-epoch", type=int, default=50, help="number of epoch")
51 | parser.add_argument(
52 | "--scheduler-milestones",
53 | nargs="+",
54 | type=int,
55 | default=[20, 35],
56 | help="scheduler milestone",
57 | )
58 | parser.add_argument("--lr", type=float, default=1e-2, help="learning rate")
59 | parser.add_argument(
60 | "--score-agg", type=str, default="max", help="aggregation for node anomaly score"
61 | )
62 | parser.add_argument(
63 | "--num-workers",
64 | type=int,
65 | default=0,
66 | help="number of workers in neighborhood sampling loader",
67 | )
68 |
69 | args1 = vars(parser.parse_args())
70 |
71 | args2 = {
72 | "hidden_channels": 32,
73 | "latent_channels_u": 32,
74 | "latent_channels_v": 32,
75 | "edge_pred_latent": 32,
76 | "n_layers_encoder": 2,
77 | "n_layers_decoder": 2,
78 | "n_layers_mlp": 2,
79 | "dropout_prob": 0.0,
80 | "gamma": 0.2,
81 | "xe_loss_weight": 1.0,
82 | "structure_loss_weight": 0.2,
83 | "structure_loss_weight_anomaly_score": 0.2,
84 | "iter_check": 10,
85 | "seed": 0,
86 | "neg_sampler_mult": 3,
87 | "k_check": 15,
88 | "tensorboard": False,
89 | "progress_bar": False,
90 | }
91 |
92 | args = {**args1, **args2}
93 |
94 | seed_all(args["seed"])
95 |
96 | result_dir = "results/"
97 |
98 |
99 | # %% params
100 | batch_size = args["batch_size"]
101 |
102 | # %% data
103 | data = load_graph(args["name"], args["key"], args["id"])
104 | print(data)
105 |
106 | u_ch = data.xu.shape[1]
107 | v_ch = data.xv.shape[1]
108 | e_ch = data.xe.shape[1]
109 |
110 | print(
111 | f"Data dimension: U node = {data.xu.shape}; V node = {data.xv.shape}; E edge = {data.xe.shape}; \n"
112 | )
113 |
114 | # %% model
115 |
116 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
117 | model = GraphBEANSampled(
118 | in_channels=(u_ch, v_ch, e_ch),
119 | hidden_channels=args["hidden_channels"],
120 | latent_channels=(args["latent_channels_u"], args["latent_channels_v"]),
121 | edge_pred_latent=args["edge_pred_latent"],
122 | n_layers_encoder=args["n_layers_encoder"],
123 | n_layers_decoder=args["n_layers_decoder"],
124 | n_layers_mlp=args["n_layers_mlp"],
125 | dropout_prob=args["dropout_prob"],
126 | )
127 |
128 | model = model.to(device)
129 | optimizer = torch.optim.Adam(model.parameters(), lr=args["lr"])
130 | scheduler = torch.optim.lr_scheduler.MultiStepLR(
131 | optimizer, milestones=args["scheduler_milestones"], gamma=args["gamma"]
132 | )
133 |
134 | xu, xv = data.xu, data.xv
135 | xe, adj = data.xe, data.adj
136 | yu, yv, ye = data.yu, data.yv, data.ye
137 |
138 | # sampler
139 | train_loader = BipartiteNeighborSampler(
140 | adj,
141 | n_layers=4,
142 | base="v",
143 | batch_size=batch_size,
144 | drop_last=True,
145 | n_other_node=-1,
146 | num_neighbors_u=args["num_neighbors_u"],
147 | num_neighbors_v=args["num_neighbors_v"],
148 | num_workers=args["num_workers"],
149 | shuffle=True,
150 | )
151 |
152 | print(args)
153 | print()
154 |
155 | # %% train
156 | def train(epoch, check_counter):
157 |
158 | model.train()
159 |
160 | n_batch = len(train_loader)
161 | if args["progress_bar"]:
162 | pbar = tqdm(total=n_batch, leave=False)
163 | pbar.set_description(f"#{epoch:3d}")
164 |
165 | total_loss = 0
166 | total_epred_metric = {"acc": 0.0, "prec": 0.0, "rec": 0.0, "f1": 0.0}
167 | total_loss_component = {"xu": 0.0, "xv": 0.0, "xe": 0.0, "e": 0.0, "total": 0.0}
168 | num_update = 0
169 |
170 | for batch_size, indices, adjacencies, e_flags in train_loader:
171 |
172 | # print(f"# u nodes: {len(indices[0])} | # v nodes: {len(indices[1])} | # edges: {len(indices[2])}")
173 |
174 | adjacencies = [adj.to(device) for adj in adjacencies]
175 | e_flags = [fl.to(device) for fl in e_flags]
176 | u_id, v_id, e_id = indices
177 |
178 | # sample
179 | xu_sample = xu[u_id].to(device)
180 | xv_sample = xv[v_id].to(device)
181 | xe_sample = xe[e_id].to(device)
182 |
183 | # edge pred samples
184 | target_adj = adjacencies[-1].adj_e.adj
185 | edge_pred_sampler = EdgePredictionSampler(
186 | target_adj, mult=args["neg_sampler_mult"]
187 | )
188 | edge_pred_samples = edge_pred_sampler.sample().to(device)
189 |
190 | optimizer.zero_grad()
191 |
192 | # start = time.time()
193 | out = model(
194 | xu=xu_sample,
195 | xv=xv_sample,
196 | xe=xe_sample,
197 | bean_adjs=adjacencies,
198 | e_flags=e_flags,
199 | edge_pred_samples=edge_pred_samples,
200 | )
201 | # print(f"training : {time.time() - start} s")
202 |
203 | last_adj_e = adjacencies[-1].adj_e
204 | xu_target = xu[last_adj_e.u_id].to(device)
205 | xv_target = xv[last_adj_e.v_id].to(device)
206 | xe_target = xe[last_adj_e.e_id].to(device)
207 |
208 | loss, loss_component = reconstruction_loss(
209 | xu=xu_target,
210 | xv=xv_target,
211 | xe=xe_target,
212 | adj=last_adj_e.adj,
213 | edge_pred_samples=edge_pred_samples,
214 | out=out,
215 | xe_loss_weight=args["xe_loss_weight"],
216 | structure_loss_weight=args["structure_loss_weight"],
217 | )
218 |
219 | loss.backward()
220 | optimizer.step()
221 |
222 | epred_metric = edge_prediction_metric(edge_pred_samples, out["eprob"])
223 |
224 | total_loss += float(loss)
225 | total_epred_metric = dict_addto(total_epred_metric, epred_metric)
226 | total_loss_component = dict_addto(total_loss_component, loss_component)
227 | num_update += 1
228 |
229 | if args["progress_bar"]:
230 | pbar.update(1)
231 | pbar.set_postfix(
232 | {
233 | "loss": float(loss),
234 | "ep acc": epred_metric["acc"],
235 | "ep f1": epred_metric["f1"],
236 | }
237 | )
238 |
239 | if num_update == args["k_check"]:
240 | loss = total_loss / num_update
241 | loss_component = dict_div(total_loss_component, num_update)
242 | epred_metric = dict_div(total_epred_metric, num_update)
243 |
244 | # tensorboard
245 | if args["tensorboard"]:
246 | tb.add_scalar("loss", loss, check_counter)
247 | tb.add_scalar("loss_xu", loss_component["xu"], check_counter)
248 | tb.add_scalar("loss_xv", loss_component["xv"], check_counter)
249 | tb.add_scalar("loss_xe", loss_component["xe"], check_counter)
250 | tb.add_scalar("loss_e", loss_component["e"], check_counter)
251 |
252 | tb.add_scalar("epred_acc", epred_metric["acc"], check_counter)
253 | tb.add_scalar("epred_f1", epred_metric["f1"], check_counter)
254 | tb.add_scalar("epred_prec", epred_metric["prec"], check_counter)
255 | tb.add_scalar("epred_rec", epred_metric["rec"], check_counter)
256 |
257 | check_counter += 1
258 |
259 | total_loss = 0
260 | total_epred_metric = {"acc": 0.0, "prec": 0.0, "rec": 0.0, "f1": 0.0}
261 | total_loss_component = {
262 | "xu": 0.0,
263 | "xv": 0.0,
264 | "xe": 0.0,
265 | "e": 0.0,
266 | "total": 0.0,
267 | }
268 | num_update = 0
269 |
270 | if args["progress_bar"]:
271 | pbar.close()
272 | scheduler.step()
273 |
274 | return loss, loss_component, epred_metric, check_counter
275 |
276 |
277 | # %% evaluate and store
278 | def eval(epoch):
279 |
280 | model.eval()
281 |
282 | start = time.time()
283 |
284 | # negative sampling
285 | edge_pred_sampler = EdgePredictionSampler(adj, mult=args["neg_sampler_mult"])
286 | edge_pred_samples = edge_pred_sampler.sample()
287 |
288 | with torch.no_grad():
289 |
290 | out = model.inference(
291 | xu,
292 | xv,
293 | xe,
294 | adj,
295 | edge_pred_samples,
296 | batch_sizes=(2**13, 2**13, 2**13),
297 | device=device,
298 | progress_bar=args["progress_bar"],
299 | )
300 |
301 | loss, loss_component = reconstruction_loss(
302 | xu,
303 | xv,
304 | xe,
305 | adj,
306 | edge_pred_samples,
307 | out,
308 | xe_loss_weight=args["xe_loss_weight"],
309 | structure_loss_weight=args["structure_loss_weight"],
310 | )
311 |
312 | epred_metric = edge_prediction_metric(edge_pred_samples, out["eprob"])
313 |
314 | anomaly_score = compute_anomaly_score(
315 | xu,
316 | xv,
317 | xe,
318 | adj,
319 | edge_pred_samples,
320 | out,
321 | xe_loss_weight=args["xe_loss_weight"],
322 | structure_loss_weight=args["structure_loss_weight_anomaly_score"],
323 | )
324 |
325 | eval_metrics = compute_evaluation_metrics(
326 | anomaly_score, yu, yv, ye, agg=args["score_agg"]
327 | )
328 |
329 | elapsed = time.time() - start
330 |
331 | print(
332 | f"Eval, loss: {loss:.4f}, "
333 | + f"u auc-roc: {eval_metrics['u_roc_auc']:.4f}, v auc-roc: {eval_metrics['v_roc_auc']:.4f}, e auc-roc: {eval_metrics['e_roc_auc']:.4f}, "
334 | + f"u auc-pr {eval_metrics['u_pr_auc']:.4f}, v auc-pr {eval_metrics['v_pr_auc']:.4f}, e auc-pr {eval_metrics['e_pr_auc']:.4f} "
335 | + f"> {elapsed:.2f}s"
336 | )
337 |
338 | if args["tensorboard"]:
339 | tb.add_scalar("loss", loss, epoch)
340 | tb.add_scalar("u_roc_auc", eval_metrics["u_roc_auc"], epoch)
341 | tb.add_scalar("u_pr_auc", eval_metrics["u_pr_auc"], epoch)
342 | tb.add_scalar("v_roc_auc", eval_metrics["v_roc_auc"], epoch)
343 | tb.add_scalar("v_pr_auc", eval_metrics["v_pr_auc"], epoch)
344 | tb.add_scalar("e_roc_auc", eval_metrics["e_roc_auc"], epoch)
345 | tb.add_scalar("e_pr_auc", eval_metrics["e_pr_auc"], epoch)
346 |
347 | model_stored = {
348 | "args": args,
349 | "loss": loss,
350 | "loss_component": loss_component,
351 | "epred_metric": epred_metric,
352 | "eval_metrics": eval_metrics,
353 | "loss_hist": loss_hist,
354 | "loss_component_hist": loss_component_hist,
355 | "epred_metric_hist": epred_metric_hist,
356 | "state_dict": model.state_dict(),
357 | "optimizer_state_dict": optimizer.state_dict(),
358 | }
359 | output_stored = {"args": args, "out": out, "anomaly_score": anomaly_score}
360 |
361 | print("Saving current results...")
362 | torch.save(
363 | model_stored,
364 | os.path.join(result_dir, f"graphbean-{args['name']}-{args['id']}-model.th"),
365 | )
366 | torch.save(
367 | output_stored,
368 | os.path.join(result_dir, f"graphbean-{args['name']}-{args['id']}-output.th"),
369 | )
370 |
371 | return loss, loss_component, epred_metric
372 |
373 |
374 | # %% run training
375 | loss_hist = []
376 | loss_component_hist = []
377 | epred_metric_hist = []
378 |
379 | # tensor board
380 | if args["tensorboard"]:
381 | log_dir = (
382 | "/logs/tensorboard/"
383 | + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
384 | + "-"
385 | + args["name"]
386 | )
387 | tb = SummaryWriter(log_dir=log_dir, comment=args["name"])
388 | check_counter = 0
389 |
390 | # eval(0)
391 |
392 | for epoch in range(args["n_epoch"]):
393 |
394 | start = time.time()
395 | loss, loss_component, epred_metric, check_counter = train(epoch, check_counter)
396 | elapsed = time.time() - start
397 |
398 | loss_hist.append(loss)
399 | loss_component_hist.append(loss_component)
400 | epred_metric_hist.append(epred_metric)
401 |
402 | print(
403 | f"#{epoch:3d}, "
404 | + f"Loss: {loss:.4f} => xu: {loss_component['xu']:.4f}, xv: {loss_component['xv']:.4f}, "
405 | + f"xe: {loss_component['xe']:.4f}, "
406 | + f"e: {loss_component['e']:.4f} -> "
407 | + f"[acc: {epred_metric['acc']:.3f}, f1: {epred_metric['f1']:.3f} -> "
408 | + f"prec: {epred_metric['prec']:.3f}, rec: {epred_metric['rec']:.3f}] "
409 | + f"> {elapsed:.2f}s"
410 | )
411 |
412 | if epoch % args["iter_check"] == 0: # and epoch != 0:
413 | # tb eval
414 | eval(epoch)
415 |
416 |
417 | # %% after training
418 | res = eval(args["n_epoch"])
419 | ev_loss, ev_loss_component, ev_epred_metric = res
420 |
421 | if args["tensorboard"]:
422 | tb.add_hparams(
423 | args,
424 | {
425 | "loss": ev_loss,
426 | "xu": ev_loss_component["xu"],
427 | "xv": ev_loss_component["xv"],
428 | "xe": ev_loss_component["xe"],
429 | "e": ev_loss_component["e"],
430 | "acc": ev_epred_metric["acc"],
431 | "f1": ev_epred_metric["f1"],
432 | "prec": ev_epred_metric["prec"],
433 | "rec": ev_epred_metric["rec"],
434 | },
435 | )
436 |
437 | print()
438 | print(args)
439 |
--------------------------------------------------------------------------------
/utils/seed.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import numpy as np
5 | import random
6 | import torch
7 |
8 |
9 | def seed_all(seed_num):
10 | np.random.seed(seed_num)
11 | random.seed(seed_num)
12 | torch.manual_seed(seed_num)
13 | torch.cuda.manual_seed_all(seed_num)
14 |
--------------------------------------------------------------------------------
/utils/sparse_combine.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import torch
5 | from torch import Tensor
6 |
7 | from torch_sparse import SparseTensor
8 |
9 | from typing import Optional, Tuple
10 |
11 | from torch_sparse import SparseTensor
12 | from torch_sparse.storage import SparseStorage
13 |
14 |
15 | def spadd(A: SparseTensor, B: SparseTensor, op: str = "add") -> SparseTensor:
16 | assert A.sparse_sizes() == B.sparse_sizes()
17 |
18 | m, n = A.sparse_sizes()
19 |
20 | row = torch.cat([A.storage.row(), B.storage.row()], dim=-1)
21 | col = torch.cat([A.storage.col(), B.storage.col()], dim=-1)
22 | value = torch.cat([A.storage.value(), B.storage.value()], dim=0)
23 |
24 | storage = SparseStorage(
25 | row=row, col=col, value=value, sparse_sizes=(m, n), is_sorted=False
26 | )
27 | storage = storage.coalesce(reduce=op)
28 |
29 | return SparseTensor.from_storage(storage)
30 |
31 |
32 | ## sparse combine
33 | def sparse_combine(
34 | a: SparseTensor, b: SparseTensor, flag_mult: Optional[Tuple[int, int]] = (1, 2)
35 | ) -> Tuple[SparseTensor, SparseTensor]:
36 |
37 | res = spadd(a, b, op="mean")
38 |
39 | # flag where the source come from
40 | flag = spadd(a.fill_value(flag_mult[0]), b.fill_value(flag_mult[1]))
41 |
42 | return res, flag
43 |
44 |
45 | ## sparse combine
46 | def sparse_combine3(
47 | a: SparseTensor, b: SparseTensor, c: SparseTensor
48 | ) -> Tuple[SparseTensor, SparseTensor, SparseTensor]:
49 |
50 | flag_mult = (1, 2, 4)
51 | res = spadd(spadd(a, b, op="mean"), c, op="mean")
52 |
53 | # flag where the source come from
54 | flag = spadd(
55 | spadd(a.fill_value(flag_mult[0]), b.fill_value(flag_mult[1])),
56 | c.fill_value(flag_mult[2]),
57 | )
58 |
59 | return res, flag
60 |
61 |
62 | def sparse_combine3a(
63 | a: SparseTensor, b: SparseTensor, c: SparseTensor
64 | ) -> Tuple[SparseTensor, SparseTensor, SparseTensor]:
65 |
66 | flag_mult = (1, 2, 4)
67 |
68 | # add values
69 | d = SparseTensor.from_torch_sparse_coo_tensor(
70 | a.to_torch_sparse_coo_tensor()
71 | + b.to_torch_sparse_coo_tensor()
72 | + c.to_torch_sparse_coo_tensor()
73 | )
74 | # add non zeros
75 | e = SparseTensor.from_torch_sparse_coo_tensor(
76 | a.fill_value(1).to_torch_sparse_coo_tensor()
77 | + b.fill_value(1).to_torch_sparse_coo_tensor()
78 | + c.fill_value(1).to_torch_sparse_coo_tensor()
79 | )
80 |
81 | # rmove duplicate values
82 | val = (d.storage.value() / e.storage.value()).long()
83 | res = d.set_value(val, layout="coo")
84 |
85 | # flag where the source come from
86 | flag = SparseTensor.from_torch_sparse_coo_tensor(
87 | a.fill_value(flag_mult[0]).to_torch_sparse_coo_tensor()
88 | + b.fill_value(flag_mult[1]).to_torch_sparse_coo_tensor()
89 | + c.fill_value(flag_mult[2]).to_torch_sparse_coo_tensor()
90 | )
91 |
92 | return res, flag
93 |
94 |
95 | def xe_split3(
96 | xe: Tensor,
97 | flag: Tensor,
98 | ) -> Tuple[Tensor, Tensor, Tensor]:
99 |
100 | # flag_mult = (1,2,4)
101 |
102 | a_idx = (flag == 1) | (flag == 3) | (flag == 5) | (flag == 7)
103 | b_idx = (flag == 2) | (flag == 3) | (flag == 6) | (flag == 7)
104 | c_idx = (flag == 4) | (flag == 5) | (flag == 6) | (flag == 7)
105 |
106 | xe_a = xe[a_idx]
107 | xe_b = xe[b_idx]
108 | xe_c = xe[c_idx]
109 |
110 | return xe_a, xe_b, xe_c
111 |
--------------------------------------------------------------------------------
/utils/sprand.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 | import torch
5 | from torch_sparse import SparseTensor, SparseStorage
6 | from typing import Tuple
7 |
8 |
9 | def sprand(dim: Tuple[int, int], nnz: int) -> SparseTensor:
10 | nu, nv = dim
11 | row = torch.randint(nu, (nnz,))
12 | col = torch.randint(nv, (nnz,))
13 |
14 | storage = SparseStorage(row=row, col=col, sparse_sizes=(nu, nv), is_sorted=False)
15 | storage = storage.coalesce(reduce="max")
16 |
17 | return SparseTensor.from_storage(storage)
18 |
--------------------------------------------------------------------------------
/utils/sum_dict.py:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Grabtaxi Holdings Pte Ltd (GRAB), All rights reserved.
2 | # Use of this source code is governed by an MIT-style license that can be found in the LICENSE file
3 |
4 |
5 | def dict_addto(res: dict, a: dict) -> dict:
6 | for k in res.keys():
7 | res[k] += float(a[k])
8 | return res
9 |
10 |
11 | def dict_div(res: dict, div) -> dict:
12 | for k in res.keys():
13 | res[k] /= div
14 | return res
15 |
--------------------------------------------------------------------------------