├── .github └── ISSUE_TEMPLATE │ ├── bug_report.md │ └── feature_request.md ├── .gitignore ├── .travis.yml ├── DGFraud_logo.png ├── LICENSE ├── README.md ├── algorithms ├── FdGars │ ├── FdGars.py │ ├── FdGars_main.py │ └── README.md ├── GAS │ ├── GAS.py │ ├── GAS_main.py │ └── README.md ├── GEM │ ├── GEM.py │ ├── GEM_main.py │ └── README.md ├── GeniePath │ ├── GeniePath.py │ ├── GeniePath_main.py │ └── README.md ├── GraphConsis │ ├── README.md │ ├── __init__.py │ ├── aggregators.py │ ├── inits.py │ ├── layers.py │ ├── metrics.py │ ├── minibatch.py │ ├── models.py │ ├── neigh_samplers.py │ ├── prediction.py │ ├── supervised_models.py │ ├── supervised_train.py │ └── utils.py ├── GraphSage │ ├── README.md │ ├── __init__.py │ ├── aggregators.py │ ├── inits.py │ ├── layers.py │ ├── metrics.py │ ├── minibatch.py │ ├── models.py │ ├── neigh_samplers.py │ ├── prediction.py │ ├── supervised_models.py │ ├── supervised_train.py │ └── utils.py ├── HACUD │ ├── README.md │ ├── __init__.py │ ├── data_loader.py │ ├── dblp │ │ ├── s_adj_0_mat.npz │ │ ├── s_adj_1_mat.npz │ │ ├── s_adj_2_mat.npz │ │ ├── s_mean_adj_0_mat.npz │ │ ├── s_mean_adj_1_mat.npz │ │ ├── s_mean_adj_2_mat.npz │ │ ├── s_norm_adj_0_mat.npz │ │ ├── s_norm_adj_1_mat.npz │ │ └── s_norm_adj_2_mat.npz │ ├── get_data.py │ ├── main.py │ ├── model.py │ ├── parse.py │ └── utils.py ├── Player2Vec │ ├── Player2Vec.py │ ├── Player2Vec_main.py │ └── README.md ├── SemiGNN │ ├── README.md │ ├── SemiGNN.py │ └── SemiGNN_main.py └── base_algorithm.py ├── base_models ├── inits.py ├── layers.py └── models.py ├── dataset ├── DBLP4057_GAT_with_idx_tra200_val_800.zip └── YelpChi.zip ├── main.py ├── reference ├── fdgars.txt ├── gas.txt ├── gem.txt ├── geniepath.txt ├── graphconsis.txt ├── graphsage.txt ├── hacud.txt ├── player2vec.txt └── semignn.txt ├── requirements.txt ├── setup.py └── utils ├── data_loader.py └── utils.py /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Steps to reproduce the behavior: 15 | 1. Go to '...' 16 | 2. Click on '....' 17 | 3. Scroll down to '....' 18 | 4. See error 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Desktop (please complete the following information):** 27 | - OS: [e.g. iOS] 28 | - Browser [e.g. chrome, safari] 29 | - Version [e.g. 22] 30 | 31 | **Smartphone (please complete the following information):** 32 | - Device: [e.g. iPhone6] 33 | - OS: [e.g. iOS8.1] 34 | - Browser [e.g. stock browser, safari] 35 | - Version [e.g. 22] 36 | 37 | **Additional context** 38 | Add any other context about the problem here. 39 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | *.iml 3 | *.xml 4 | /idea/ 5 | /data/ 6 | .idea/dictionariesgit 7 | .idea/$CACHE_FILE$ 8 | .idea/dictionaries 9 | __pycache__/ 10 | 11 | dataset/YelpChi.mat 12 | dataset/YelpChi_hand.zip 13 | dataset/YelpChi_hand.mat 14 | .Ds_Store 15 | */.Ds_Store 16 | */__pycache__ 17 | dataset/DBLP4057_GAT_with_idx_tra200_val_800.mat 18 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "3.6" 4 | - "3.7" 5 | # command to install dependencies 6 | install: 7 | - pip install -U importlib_metadata && pip install -r requirements.txt 8 | script: 9 | - python main.py 10 | -------------------------------------------------------------------------------- /DGFraud_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/safe-graph/DGFraud/22b72d75f81dd057762f0c7225a4558a25095b8f/DGFraud_logo.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 |
3 | 4 | 5 | 6 |
7 |

8 |

9 | 10 | PRs Welcome 11 | 12 | 13 | GitHub 14 | 15 | 16 | GitHub release 17 | 18 | 19 | PRs 20 | 21 |

22 | 23 |

24 |

A Deep Graph-based Toolbox for Fraud Detection 25 |

26 | 27 | **Introduction** 28 | 29 | **May 2021 Update:** The DGFraud has upgraded to TensorFlow 2.0! Please check out [DGFraud-TF2](https://github.com/safe-graph/DGFraud-TF2) 30 | 31 | **DGFraud** is a Graph Neural Network (GNN) based toolbox for fraud detection. It integrates the implementation & comparison of state-of-the-art GNN-based fraud detection models. The introduction of implemented models can be found [here](#implemented-models). 32 | 33 | We welcome contributions on adding new fraud detectors and extending the features of the toolbox. Some of the planned features are listed in [TODO list](#todo-list). 34 | 35 | If you use the toolbox in your project, please cite one of the two papers below and the [algorithms](#implemented-models) you used : 36 | 37 | CIKM'20 ([PDF](https://arxiv.org/pdf/2008.08692.pdf)) 38 | ```bibtex 39 | @inproceedings{dou2020enhancing, 40 | title={Enhancing Graph Neural Network-based Fraud Detectors against Camouflaged Fraudsters}, 41 | author={Dou, Yingtong and Liu, Zhiwei and Sun, Li and Deng, Yutong and Peng, Hao and Yu, Philip S}, 42 | booktitle={Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM'20)}, 43 | year={2020} 44 | } 45 | ``` 46 | SIGIR'20 ([PDF](https://arxiv.org/pdf/2005.00625.pdf)) 47 | ```bibtex 48 | @inproceedings{liu2020alleviating, 49 | title={Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection}, 50 | author={Liu, Zhiwei and Dou, Yingtong and Yu, Philip S. and Deng, Yutong and Peng, Hao}, 51 | booktitle={Proceedings of the 43nd International ACM SIGIR Conference on Research and Development in Information Retrieval}, 52 | year={2020} 53 | } 54 | ``` 55 | 56 | **Useful Resources** 57 | - [PyGOD: A Python Library for Graph Outlier Detection (Anomaly Detection)](https://github.com/pygod-team/pygod) 58 | - [UGFraud: An Unsupervised Graph-based Toolbox for Fraud Detection](https://github.com/safe-graph/UGFraud) 59 | - [Graph-based Fraud Detection Paper List](https://github.com/safe-graph/graph-fraud-detection-papers) 60 | - [Awesome Fraud Detection Papers](https://github.com/benedekrozemberczki/awesome-fraud-detection-papers) 61 | - [Attack and Defense Papers on Graph Data](https://github.com/safe-graph/graph-adversarial-learning-literature) 62 | - [PyOD: A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)](https://github.com/yzhao062/pyod) 63 | - [PyODD: An End-to-end Outlier Detection System](https://github.com/datamllab/pyodds) 64 | - [DGL: Deep Graph Library](https://github.com/dmlc/dgl) 65 | - [Outlier Detection DataSets (ODDS)](http://odds.cs.stonybrook.edu/) 66 | 67 | **Table of Contents** 68 | - [Installation](#installation) 69 | - [Datasets](#datasets) 70 | - [User Guide](#user-guide) 71 | - [Implemented Models](#implemented-models) 72 | - [Model Comparison](#model-comparison) 73 | - [TODO List](#todo-list) 74 | - [How to Contribute](#how-to-contribute) 75 | 76 | 77 | ## Installation 78 | ```bash 79 | git clone https://github.com/safe-graph/DGFraud.git 80 | cd DGFraud 81 | python setup.py install 82 | ``` 83 | ### Requirements 84 | ```bash 85 | * python 3.6, 3.7 86 | * tensorflow>=1.14.0,<2.0 87 | * numpy>=1.16.4 88 | * scipy>=1.2.0 89 | * networkx<=1.11 90 | ``` 91 | ## Datasets 92 | 93 | ### DBLP 94 | We uses the pre-processed DBLP dataset from [Jhy1993/HAN](https://github.com/Jhy1993/HAN) 95 | You can run the FdGars, Player2Vec, GeniePath and GEM based on the DBLP dataset. 96 | Unzip the archive before using the dataset: 97 | ```bash 98 | cd dataset 99 | unzip DBLP4057_GAT_with_idx_tra200_val_800.zip 100 | ``` 101 | 102 | ### Example dataset 103 | We implement example graphs for SemiGNN, GAS and GEM in `data_loader.py`. Because those models require unique graph structures or node types, which cannot be found in opensource datasets. 104 | 105 | 106 | ### Yelp dataset 107 | For [GraphConsis](https://arxiv.org/abs/2005.00625), we preprocessed [Yelp Spam Review Dataset](http://odds.cs.stonybrook.edu/yelpchi-dataset/) with reviews as nodes and three relations as edges. 108 | 109 | The dataset with `.mat` format is located at `/dataset/YelpChi.zip`. The `.mat` file includes: 110 | - `net_rur, net_rtr, net_rsr`: three sparse matrices representing three homo-graphs defined in [GraphConsis](https://arxiv.org/abs/2005.00625) paper; 111 | - `features`: a sparse matrix of 32-dimension handcrafted features; 112 | - `label`: a numpy array with the ground truth of nodes. `1` represents spam and `0` represents benign. 113 | 114 | The YelpChi data preprocessing details can be found in our [CIKM'20](https://arxiv.org/pdf/2008.08692.pdf) paper. 115 | To get the complete metadata of the Yelp dataset, please email to [ytongdou@gmail.com](mailto:ytongdou@gmail.com) for inquiry. 116 | 117 | 118 | ## User Guide 119 | 120 | ### Running the example code 121 | You can find the implemented models in `algorithms` directory. For example, you can run Player2Vec using: 122 | ```bash 123 | python Player2Vec_main.py 124 | ``` 125 | You can specify parameters for models when running the code. 126 | 127 | ### Running on your datasets 128 | Have a look at the load_data_dblp() function in utils/utils.py for an example. 129 | 130 | In order to use your own data, you have to provide: 131 | * adjacency matrices or adjlists (for GAS); 132 | * a feature matrix 133 | * a label matrix 134 | then split feature matrix and label matrix into testing data and training data. 135 | 136 | You can specify a dataset as follows: 137 | ```bash 138 | python xx_main.py --dataset your_dataset 139 | ``` 140 | or by editing xx_main.py 141 | 142 | ### The structure of code 143 | The repository is organized as follows: 144 | - `algorithms/` contains the implemented models and the corresponding example code; 145 | - `base_models/` contains the basic models (GCN); 146 | - `dataset/` contains the necessary dataset files; 147 | - `utils/` contains: 148 | * loading and splitting the data (`data_loader.py`); 149 | * contains various utilities (`utils.py`). 150 | 151 | 152 | ## Implemented Models 153 | 154 | | Model | Paper | Venue | Reference | 155 | |-------|--------|--------|--------| 156 | | **SemiGNN** | [A Semi-supervised Graph Attentive Network for Financial Fraud Detection](https://arxiv.org/pdf/2003.01171) | ICDM 2019 | [BibTex](https://github.com/safe-graph/DGFraud/blob/master/reference/semignn.txt) | 157 | | **Player2Vec** | [Key Player Identification in Underground Forums over Attributed Heterogeneous Information Network Embedding Framework](http://mason.gmu.edu/~lzhao9/materials/papers/lp0110-zhangA.pdf) | CIKM 2019 | [BibTex](https://github.com/safe-graph/DGFraud/blob/master/reference/player2vec.txt)| 158 | | **GAS** | [Spam Review Detection with Graph Convolutional Networks](https://arxiv.org/abs/1908.10679) | CIKM 2019 | [BibTex](https://github.com/safe-graph/DGFraud/blob/master/reference/gas.txt) | 159 | | **FdGars** | [FdGars: Fraudster Detection via Graph Convolutional Networks in Online App Review System](https://dl.acm.org/citation.cfm?id=3316586) | WWW 2019 | [BibTex](https://github.com/safe-graph/DGFraud/blob/master/reference/fdgars.txt) | 160 | | **GeniePath** | [GeniePath: Graph Neural Networks with Adaptive Receptive Paths](https://arxiv.org/abs/1802.00910) | AAAI 2019 | [BibTex](https://github.com/safe-graph/DGFraud/blob/master/reference/geniepath.txt) | 161 | | **GEM** | [Heterogeneous Graph Neural Networks for Malicious Account Detection](https://arxiv.org/pdf/2002.12307.pdf) | CIKM 2018 |[BibTex](https://github.com/safe-graph/DGFraud/blob/master/reference/gem.txt) | 162 | | **GraphSAGE** | [Inductive Representation Learning on Large Graphs](https://arxiv.org/pdf/1706.02216.pdf) | NIPS 2017 | [BibTex](https://github.com/safe-graph/DGFraud/blob/master/reference/graphsage.txt) | 163 | | **GraphConsis** | [Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection](https://arxiv.org/pdf/2005.00625.pdf) | SIGIR 2020 | [BibTex](https://github.com/safe-graph/DGFraud/blob/master/reference/graphconsis.txt) | 164 | | **HACUD** | [Cash-Out User Detection Based on Attributed Heterogeneous Information Network with a Hierarchical Attention Mechanism](https://aaai.org/ojs/index.php/AAAI/article/view/3884) | AAAI 2019 | [BibTex](https://github.com/safe-graph/DGFraud/blob/master/reference/hacud.txt) | 165 | 166 | 167 | ## Model Comparison 168 | | Model | Application | Graph Type | Base Model | 169 | |-------|--------|--------|--------| 170 | | **SemiGNN** | Financial Fraud | Heterogeneous | GAT, LINE, DeepWalk | 171 | | **Player2Vec** | Cyber Criminal | Heterogeneous | GAT, GCN| 172 | | **GAS** | Opinion Fraud | Heterogeneous | GCN, GAT | 173 | | **FdGars** | Opinion Fraud | Homogeneous | GCN | 174 | | **GeniePath** | Financial Fraud | Homogeneous | GAT | 175 | | **GEM** | Financial Fraud | Heterogeneous |GCN | 176 | | **GraphSAGE** | Opinion Fraud | Homogeneous | GraphSAGE | 177 | | **GraphConsis** | Opinion Fraud | Heterogeneous | GraphSAGE | 178 | | **HACUD** | Financial Fraud | Heterogeneous | GAT | 179 | 180 | 181 | ## TODO List 182 | - Implementing mini-batch training 183 | - The log loss for GEM model 184 | - Time-based sampling for GEM 185 | - Add sampling methods 186 | - Benchmarking SOTA models 187 | - Scalable implementation 188 | - Pytorch implementation 189 | 190 | ## How to Contribute 191 | You are welcomed to contribute to this open-source toolbox. The detailed instructions will be released soon. Currently, you can create issues or email to [bdscsafegraph@gmail.com](mailto:bdscsafegraph@gmail.com) for inquiry. 192 | 193 | -------------------------------------------------------------------------------- /algorithms/FdGars/FdGars.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Yutong Deng (@yutongD), Yingtong Dou (@YingtongDou) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | 6 | FdGars ('FdGars: Fraudster Detection via Graph Convolutional Networks in Online App Review System') 7 | 8 | Parameters: 9 | nodes: total nodes number 10 | gcn_output1: the first gcn layer unit number 11 | gcn_output2: the second gcn layer unit number 12 | embedding: node feature dim 13 | encoding: nodes representation dim (predict class dim) 14 | ''' 15 | import os 16 | import sys 17 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../..'))) 18 | import tensorflow as tf 19 | from base_models.models import GCN 20 | from algorithms.base_algorithm import Algorithm 21 | from utils import utils 22 | 23 | 24 | class FdGars(Algorithm): 25 | 26 | def __init__(self, 27 | session, 28 | nodes, 29 | class_size, 30 | gcn_output1, 31 | gcn_output2, 32 | meta, 33 | embedding, 34 | encoding): 35 | self.nodes = nodes 36 | self.meta = meta 37 | self.class_size = class_size 38 | self.gcn_output1 = gcn_output1 39 | self.embedding = embedding 40 | self.encoding = encoding 41 | 42 | self.placeholders = {'a': tf.placeholder(tf.float32, [self.meta, self.nodes, self.nodes], 'adj'), 43 | 'x': tf.placeholder(tf.float32, [self.nodes, self.embedding], 'nxf'), 44 | 'batch_index': tf.placeholder(tf.int32, [None], 'index'), 45 | 't': tf.placeholder(tf.float32, [None, self.class_size], 'labels'), 46 | 'lr': tf.placeholder(tf.float32, [], 'learning_rate'), 47 | 'mom': tf.placeholder(tf.float32, [], 'momentum'), 48 | 'num_features_nonzero': tf.placeholder(tf.int32)} 49 | 50 | loss, probabilities = self.forward_propagation() 51 | self.loss, self.probabilities = loss, probabilities 52 | self.l2 = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(0.01), 53 | tf.trainable_variables()) 54 | 55 | self.pred = tf.one_hot(tf.argmax(self.probabilities, 1), class_size) 56 | print(self.pred.shape) 57 | self.correct_prediction = tf.equal(tf.argmax(self.probabilities, 1), tf.argmax(self.placeholders['t'], 1)) 58 | self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, "float")) 59 | print('Forward propagation finished.') 60 | 61 | self.sess = session 62 | self.optimizer = tf.train.AdamOptimizer(self.placeholders['lr']) 63 | gradients = self.optimizer.compute_gradients(self.loss + self.l2) 64 | capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None] 65 | self.train_op = self.optimizer.apply_gradients(capped_gradients) 66 | self.init = tf.global_variables_initializer() 67 | print('Backward propagation finished.') 68 | 69 | def forward_propagation(self): 70 | with tf.variable_scope('gcn'): 71 | gcn_emb = [] 72 | for i in range(self.meta): 73 | gcn_out = tf.reshape(GCN(self.placeholders, self.gcn_output1, self.embedding, 74 | self.encoding, index=i).embedding(), [1, self.nodes * self.encoding]) 75 | gcn_emb.append(gcn_out) 76 | gcn_emb = tf.concat(gcn_emb, 0) 77 | gcn_emb = tf.reshape(gcn_emb, [self.nodes, self.encoding]) 78 | print('GCN embedding over!') 79 | 80 | with tf.variable_scope('classification'): 81 | batch_data = tf.matmul(tf.one_hot(self.placeholders['batch_index'], self.nodes), gcn_emb) 82 | logits = tf.nn.softmax(batch_data) 83 | loss = tf.losses.sigmoid_cross_entropy(multi_class_labels=self.placeholders['t'], logits=logits) 84 | 85 | return loss, tf.nn.sigmoid(logits) 86 | 87 | def train(self, x, a, t, b, learning_rate=1e-2, momentum=0.9): 88 | feed_dict = utils.construct_feed_dict(x, a, t, b, learning_rate, momentum, self.placeholders) 89 | outs = self.sess.run( 90 | [self.train_op, self.loss, self.accuracy, self.pred, self.probabilities], 91 | feed_dict=feed_dict) 92 | loss = outs[1] 93 | acc = outs[2] 94 | pred = outs[3] 95 | prob = outs[4] 96 | return loss, acc, pred, prob 97 | 98 | def test(self, x, a, t, b, learning_rate=1e-2, momentum=0.9): 99 | feed_dict = utils.construct_feed_dict(x, a, t, b, learning_rate, momentum, self.placeholders) 100 | acc, pred, probabilities, tags = self.sess.run( 101 | [self.accuracy, self.pred, self.probabilities, self.correct_prediction], 102 | feed_dict=feed_dict) 103 | return acc, pred, probabilities, tags 104 | -------------------------------------------------------------------------------- /algorithms/FdGars/FdGars_main.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Yutong Deng (@yutongD), Yingtong Dou (@YingtongDou) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | 7 | import tensorflow as tf 8 | import argparse 9 | import os 10 | import sys 11 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../..'))) 12 | from algorithms.FdGars.FdGars import FdGars 13 | import time 14 | from utils.data_loader import * 15 | from utils.utils import * 16 | 17 | 18 | # os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' 19 | 20 | # init the common args, expect the model specific args 21 | def arg_parser(): 22 | parser = argparse.ArgumentParser() 23 | parser.add_argument('--seed', type=int, default=123, help='Random seed.') 24 | parser.add_argument('--dataset_str', type=str, default='dblp', help="['dblp','example']") 25 | parser.add_argument('--epoch_num', type=int, default=30, help='Number of epochs to train.') 26 | parser.add_argument('--batch_size', type=int, default=1000) 27 | parser.add_argument('--momentum', type=int, default=0.9) 28 | parser.add_argument('--learning_rate', default=0.001, help='the ratio of training set in whole dataset.') 29 | 30 | # GCN args 31 | parser.add_argument('--hidden1', default=16, help='Number of units in GCN hidden layer 1.') 32 | parser.add_argument('--hidden2', default=16, help='Number of units in GCN hidden layer 2.') 33 | parser.add_argument('--gcn_output', default=4, help='gcn output size.') 34 | 35 | args = parser.parse_args() 36 | return args 37 | 38 | 39 | def set_env(args): 40 | tf.reset_default_graph() 41 | np.random.seed(args.seed) 42 | tf.set_random_seed(args.seed) 43 | 44 | 45 | # get batch data 46 | def get_data(ix, int_batch, train_size): 47 | if ix + int_batch >= train_size: 48 | ix = train_size - int_batch 49 | end = train_size 50 | else: 51 | end = ix + int_batch 52 | return train_data[ix:end], train_label[ix:end] 53 | 54 | 55 | def load_data(args): 56 | if args.dataset_str == 'dblp': 57 | adj_list, features, train_data, train_label, test_data, test_label = load_data_dblp() 58 | node_size = features.shape[0] 59 | node_embedding = features.shape[1] 60 | class_size = train_label.shape[1] 61 | train_size = len(train_data) 62 | paras = [node_size, node_embedding, class_size, train_size] 63 | 64 | return adj_list, features, train_data, train_label, test_data, test_label, paras 65 | 66 | 67 | def train(args, adj_list, features, train_data, train_label, test_data, test_label, paras): 68 | with tf.Session() as sess: 69 | adj_data = [normalize_adj(adj) for adj in adj_list] 70 | meta_size = len(adj_list) # meta=1 in FdGars 71 | net = FdGars(session=sess, class_size=paras[2], gcn_output1=args.hidden1, gcn_output2=args.hidden2, 72 | meta=meta_size, nodes=paras[0], embedding=paras[1], encoding=args.gcn_output) 73 | 74 | sess.run(tf.global_variables_initializer()) 75 | # net.load(sess) 76 | 77 | t_start = time.clock() 78 | for epoch in range(args.epoch_num): 79 | train_loss = 0 80 | train_acc = 0 81 | count = 0 82 | for index in range(0, paras[3], args.batch_size): 83 | batch_data, batch_label = get_data(index, args.batch_size, paras[3]) 84 | loss, acc, pred, prob = net.train(features, adj_data, batch_label, 85 | batch_data, args.learning_rate, 86 | args.momentum) 87 | 88 | print("batch loss: {:.4f}, batch acc: {:.4f}".format(loss, acc)) 89 | # print(prob, pred) 90 | 91 | train_loss += loss 92 | train_acc += acc 93 | count += 1 94 | train_loss = train_loss / count 95 | train_acc = train_acc / count 96 | print("epoch{:d} : train_loss: {:.4f}, train_acc: {:.4f}".format(epoch, train_loss, train_acc)) 97 | 98 | # net.save(sess) 99 | 100 | t_end = time.clock() 101 | print("train time=", "{:.5f}".format(t_end - t_start)) 102 | print("Train end!") 103 | 104 | test_acc, test_pred, test_probabilities, test_tags = net.test(features, adj_data, test_label, 105 | test_data) 106 | 107 | print("test acc:", test_acc) 108 | 109 | 110 | if __name__ == "__main__": 111 | args = arg_parser() 112 | set_env(args) 113 | adj_list, features, train_data, train_label, test_data, test_label, paras = load_data(args) 114 | train(args, adj_list, features, train_data, train_label, test_data, test_label, paras) 115 | -------------------------------------------------------------------------------- /algorithms/FdGars/README.md: -------------------------------------------------------------------------------- 1 | 2 | # FdGars 3 | 4 | ## Paper 5 | The FdGars model is proposed by the [paper](https://dl.acm.org/citation.cfm?id=3316586) below: 6 | ```bibtex 7 | @inproceedings{wang2019fdgars, 8 | title={Fdgars: Fraudster detection via graph convolutional networks in online app review system}, 9 | author={Wang, Jianyu and Wen, Rui and Wu, Chunming and Huang, Yu and Xion, Jian}, 10 | booktitle={Companion Proceedings of The 2019 World Wide Web Conference}, 11 | pages={310--316}, 12 | year={2019} 13 | } 14 | ``` 15 | 16 | 17 | ## Brief Introduction 18 | 19 | It applies [vanilla GCN](https://arxiv.org/abs/1609.02907) to spam detection problems with handcrafted features. 20 | 21 | ## Input Format 22 | 23 | FdGars has the same input as vanilla GCN. In our toolbox, we take one homo-graph of DBLP dataset as its input. 24 | 25 | ## TODO List 26 | 27 | -------------------------------------------------------------------------------- /algorithms/GAS/GAS.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Yutong Deng (@yutongD), Yingtong Dou (@YingtongDou) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | 6 | GAS ('Spam Review Detection with Graph Convolutional Networks') 7 | Parameters: 8 | nodes: total nodes number 9 | class_size: class number 10 | embedding_i: item embedding size 11 | embedding_u: user embedding size 12 | embedding_r: review embedding size 13 | gcn_dim: the gcn layer unit number 14 | ''' 15 | import os 16 | import sys 17 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../..'))) 18 | import tensorflow as tf 19 | from base_models.models import GCN 20 | from base_models.layers import AttentionLayer, ConcatenationAggregator, AttentionAggregator, GASConcatenation 21 | from algorithms.base_algorithm import Algorithm 22 | 23 | 24 | class GAS(Algorithm): 25 | def __init__(self, session, nodes, class_size, embedding_i, embedding_u, embedding_r, h_u_size, h_i_size, 26 | encoding1, encoding2, encoding3, encoding4, gcn_dim, meta=1, concat=True, **kwargs): 27 | super().__init__(**kwargs) 28 | self.meta = meta 29 | self.nodes = nodes 30 | self.class_size = class_size 31 | self.embedding_i = embedding_i 32 | self.embedding_u = embedding_u 33 | self.embedding_r = embedding_r 34 | self.encoding1 = encoding1 35 | self.encoding2 = encoding2 36 | self.encoding3 = encoding3 37 | self.encoding4 = encoding4 38 | self.gcn_dim = gcn_dim 39 | self.h_i_size = h_i_size 40 | self.h_u_size = h_u_size 41 | self.concat = concat 42 | self.build_placeholders() 43 | 44 | loss, probabilities = self.forward_propagation() 45 | self.loss, self.probabilities = loss, probabilities 46 | self.l2 = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(0.01), 47 | tf.trainable_variables()) 48 | 49 | self.pred = tf.one_hot(tf.argmax(self.probabilities, 1), class_size) 50 | print(self.pred.shape) 51 | self.correct_prediction = tf.equal(tf.argmax(self.probabilities, 1), tf.argmax(self.t, 1)) 52 | self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, "float")) 53 | print('Forward propagation finished.') 54 | 55 | self.sess = session 56 | self.optimizer = tf.train.AdamOptimizer(self.lr) 57 | gradients = self.optimizer.compute_gradients(self.loss + self.l2) 58 | capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None] 59 | self.train_op = self.optimizer.apply_gradients(capped_gradients) 60 | self.init = tf.global_variables_initializer() 61 | print('Backward propagation finished.') 62 | 63 | def build_placeholders(self): 64 | self.user_review_adj = tf.placeholder(tf.float32, [None, None], 'adjlist1') 65 | self.user_item_adj = tf.placeholder(tf.float32, [None, None], 'adjlist2') 66 | self.item_review_adj = tf.placeholder(tf.float32, [None, None], 'adjlist3') 67 | self.item_user_adj = tf.placeholder(tf.float32, [None, None], 'adjlist4') 68 | self.review_user_adj = tf.placeholder(tf.float32, [None], 'adjlist5') 69 | self.review_item_adj = tf.placeholder(tf.float32, [None], 'adjlist6') 70 | self.homo_adj = tf.placeholder(tf.float32, [self.nodes, self.nodes], 'comment_adj') 71 | self.review_vecs = tf.placeholder(tf.float32, [None, None], 'init_embedding1') 72 | self.user_vecs = tf.placeholder(tf.float32, [None, None], 'init_embedding2') 73 | self.item_vecs = tf.placeholder(tf.float32, [None, None], 'init_embedding3') 74 | self.batch_index = tf.placeholder(tf.int32, [None], 'index') 75 | self.t = tf.placeholder(tf.float32, [None, self.class_size], 'labels') 76 | self.lr = tf.placeholder(tf.float32, [], 'learning_rate') 77 | self.mom = tf.placeholder(tf.float32, [], 'momentum') 78 | 79 | def forward_propagation(self): 80 | with tf.variable_scope('hete_gcn'): 81 | r_aggregator = ConcatenationAggregator(input_dim=self.embedding_r + self.embedding_u + self.embedding_i, 82 | output_dim=self.encoding1, 83 | review_item_adj=self.review_item_adj, 84 | review_user_adj=self.review_user_adj, 85 | review_vecs=self.review_vecs, user_vecs=self.user_vecs, 86 | item_vecs=self.item_vecs) 87 | h_r = r_aggregator(inputs=None) 88 | 89 | iu_aggregator = AttentionAggregator(input_dim1=self.h_u_size, input_dim2=self.h_i_size, 90 | output_dim=self.encoding3, hid_dim=self.encoding2, user_review_adj=self.user_review_adj, 91 | user_item_adj=self.user_item_adj, 92 | item_review_adj=self.item_review_adj, item_user_adj=self.item_user_adj, 93 | review_vecs=self.review_vecs, user_vecs=self.user_vecs, 94 | item_vecs=self.item_vecs, concat=True) 95 | h_u, h_i = iu_aggregator(inputs=None) 96 | print('Nodes embedding over!') 97 | 98 | with tf.variable_scope('homo_gcn'): 99 | x = self.review_vecs 100 | # gcn_out = GCN(x, self.homo_adj, self.gcn_dim, self.embedding_r, 101 | # self.encoding4).embedding() 102 | print('Comment graph embedding over!') 103 | 104 | with tf.variable_scope('classification'): 105 | concatenator = GASConcatenation(review_user_adj=self.review_user_adj, review_item_adj=self.review_item_adj, 106 | review_vecs=h_r, homo_vecs=self.homo_adj, 107 | user_vecs=h_u, item_vecs=h_i) 108 | concated_hr = concatenator(inputs=None) 109 | 110 | batch_data = tf.matmul(tf.one_hot(self.batch_index, self.nodes), concated_hr) 111 | W = tf.get_variable(name='weights', 112 | shape=[self.encoding1 + 2 * self.encoding2 + 2 * self.nodes + self.nodes, 113 | self.class_size], 114 | initializer=tf.contrib.layers.xavier_initializer()) 115 | b = tf.get_variable(name='bias', shape=[1, self.class_size], initializer=tf.zeros_initializer()) 116 | tf.transpose(batch_data, perm=[0, 1]) 117 | logits = tf.matmul(batch_data, W) + b 118 | loss = tf.losses.sigmoid_cross_entropy(multi_class_labels=self.t, logits=logits) 119 | 120 | return loss, tf.nn.sigmoid(logits) 121 | 122 | def train(self, h, adj_info, t, b, learning_rate=1e-2, momentum=0.9): 123 | feed_dict = { 124 | self.user_review_adj: adj_info[0], 125 | self.user_item_adj: adj_info[1], 126 | self.item_review_adj: adj_info[2], 127 | self.item_user_adj: adj_info[3], 128 | self.review_user_adj: adj_info[4], 129 | self.review_item_adj: adj_info[5], 130 | self.homo_adj: adj_info[6], 131 | self.review_vecs: h[0], 132 | self.user_vecs: h[1], 133 | self.item_vecs: h[2], 134 | self.t: t, 135 | self.batch_index: b, 136 | self.lr: learning_rate, 137 | self.mom: momentum 138 | } 139 | outs = self.sess.run( 140 | [self.train_op, self.loss, self.accuracy, self.pred, self.probabilities], 141 | feed_dict=feed_dict) 142 | loss = outs[1] 143 | acc = outs[2] 144 | pred = outs[3] 145 | prob = outs[4] 146 | return loss, acc, pred, prob 147 | 148 | def test(self, h, adj_info, t, b): 149 | feed_dict = { 150 | self.user_review_adj: adj_info[0], 151 | self.user_item_adj: adj_info[1], 152 | self.item_review_adj: adj_info[2], 153 | self.item_user_adj: adj_info[3], 154 | self.review_user_adj: adj_info[4], 155 | self.review_item_adj: adj_info[5], 156 | self.homo_adj: adj_info[6], 157 | self.review_vecs: h[0], 158 | self.user_vecs: h[1], 159 | self.item_vecs: h[2], 160 | self.t: t, 161 | self.batch_index: b 162 | } 163 | acc, pred, probabilities, tags = self.sess.run( 164 | [self.accuracy, self.pred, self.probabilities, self.correct_prediction], 165 | feed_dict=feed_dict) 166 | return acc, pred, probabilities, tags -------------------------------------------------------------------------------- /algorithms/GAS/GAS_main.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Yutong Deng (@yutongD), Yingtong Dou (@YingtongDou) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | import tensorflow as tf 7 | import argparse 8 | import os 9 | import sys 10 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../..'))) 11 | from algorithms.GAS.GAS import GAS 12 | import time 13 | from utils.data_loader import * 14 | from utils.utils import * 15 | 16 | 17 | # os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' 18 | 19 | # init the common args, expect the model specific args 20 | def arg_parser(): 21 | parser = argparse.ArgumentParser() 22 | parser.add_argument('--seed', type=int, default=123, help='Random seed.') 23 | parser.add_argument('--dataset_str', type=str, default='example', help="['dblp','example']") 24 | parser.add_argument('--epoch_num', type=int, default=30, help='Number of epochs to train.') 25 | parser.add_argument('--batch_size', type=int, default=1000) 26 | parser.add_argument('--momentum', type=int, default=0.9) 27 | parser.add_argument('--learning_rate', default=0.001, help='the ratio of training set in whole dataset.') 28 | 29 | # GAS 30 | parser.add_argument('--review_num sample', default=7, help='review number.') 31 | parser.add_argument('--gcn_dim', type=int, default=5, help='gcn layer size.') 32 | parser.add_argument('--encoding1', type=int, default=64) 33 | parser.add_argument('--encoding2', type=int, default=64) 34 | parser.add_argument('--encoding3', type=int, default=64) 35 | parser.add_argument('--encoding4', type=int, default=64) 36 | 37 | args = parser.parse_args() 38 | return args 39 | 40 | 41 | def set_env(args): 42 | tf.reset_default_graph() 43 | np.random.seed(args.seed) 44 | tf.set_random_seed(args.seed) 45 | 46 | 47 | # get batch data 48 | def get_data(ix, int_batch, train_size): 49 | if ix + int_batch >= train_size: 50 | ix = train_size - int_batch 51 | end = train_size 52 | else: 53 | end = ix + int_batch 54 | return train_data[ix:end], train_label[ix:end] 55 | 56 | 57 | def load_data(args): 58 | if args.dataset_str == 'example': 59 | adj_list, features, train_data, train_label, test_data, test_label = load_data_gas() 60 | node_embedding_r = features[0].shape[1] 61 | node_embedding_u = features[1].shape[1] 62 | node_embedding_i = features[2].shape[1] 63 | node_size = features[0].shape[0] 64 | 65 | # node_embedding_i = node_embedding_r = node_size 66 | h_u_size = adj_list[0].shape[1] * (node_embedding_r + node_embedding_u) 67 | h_i_size = adj_list[2].shape[1] * (node_embedding_r + node_embedding_i) 68 | 69 | class_size = train_label.shape[1] 70 | train_size = len(train_data) 71 | 72 | paras = [node_size, node_embedding_r, node_embedding_u, node_embedding_i, class_size, train_size, h_u_size, 73 | h_i_size] 74 | 75 | return adj_list, features, train_data, train_label, test_data, test_label, paras 76 | 77 | 78 | def train(args, adj_list, features, train_data, train_label, test_data, test_label, paras): 79 | with tf.Session() as sess: 80 | adj_data = adj_list 81 | net = GAS(session=sess, nodes=paras[0], class_size=paras[4], embedding_r=paras[1], embedding_u=paras[2], 82 | embedding_i=paras[3], h_u_size=paras[6], h_i_size=paras[7], 83 | encoding1=args.encoding1, encoding2=args.encoding2, encoding3=args.encoding3, 84 | encoding4=args.encoding4, gcn_dim=args.gcn_dim) 85 | 86 | sess.run(tf.global_variables_initializer()) 87 | # net.load(sess) 88 | 89 | t_start = time.clock() 90 | for epoch in range(args.epoch_num): 91 | train_loss = 0 92 | train_acc = 0 93 | count = 0 94 | for index in range(0, paras[3], args.batch_size): 95 | batch_data, batch_label = get_data(index, args.batch_size, paras[3]) 96 | loss, acc, pred, prob = net.train(features, adj_data, batch_label, 97 | batch_data, args.learning_rate, 98 | args.momentum) 99 | 100 | print("batch loss: {:.4f}, batch acc: {:.4f}".format(loss, acc)) 101 | # print(prob, pred) 102 | 103 | train_loss += loss 104 | train_acc += acc 105 | count += 1 106 | train_loss = train_loss / count 107 | train_acc = train_acc / count 108 | print("epoch{:d} : train_loss: {:.4f}, train_acc: {:.4f}".format(epoch, train_loss, train_acc)) 109 | # net.save(sess) 110 | 111 | t_end = time.clock() 112 | print("train time=", "{:.5f}".format(t_end - t_start)) 113 | print("Train end!") 114 | 115 | test_acc, test_pred, test_probabilities, test_tags = net.test(features, adj_data, test_label, 116 | test_data) 117 | 118 | print("test acc:", test_acc) 119 | 120 | 121 | if __name__ == "__main__": 122 | args = arg_parser() 123 | set_env(args) 124 | adj_list, features, train_data, train_label, test_data, test_label, paras = load_data(args) 125 | train(args, adj_list, features, train_data, train_label, test_data, test_label, paras) 126 | -------------------------------------------------------------------------------- /algorithms/GAS/README.md: -------------------------------------------------------------------------------- 1 | 2 | # GAS 3 | 4 | ## Paper 5 | The GAS model is proposed by the [paper](https://arxiv.org/abs/1908.10679) below: 6 | ```bibtex 7 | @inproceedings{li2019spam, 8 | title={Spam Review Detection with Graph Convolutional Networks}, 9 | author={Li, Ao and Qin, Zhou and Liu, Runshi and Yang, Yiqun and Li, Dong}, 10 | booktitle={Proceedings of the 28th ACM International Conference on Information and Knowledge Management}, 11 | pages={2703--2711}, 12 | year={2019} 13 | } 14 | ``` 15 | 16 | 17 | ## Brief Introduction 18 | 19 | GAS directly aggregates neighbors with different node types. 20 | 21 | ## Input Format 22 | 23 | The input is a heterogeneous graph. We use a small example graph in our toolbox. You can find the example graph structure in **load_example_gas** function in `\utils\dataloader.py`. If you want to use your own graph as the input, just follow the same format like the example graph. 24 | 25 | ## TODO List 26 | 27 | -------------------------------------------------------------------------------- /algorithms/GEM/GEM.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Yutong Deng (@yutongD), Yingtong Dou (@YingtongDou) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | 6 | GEM ('Heterogeneous Graph Neural Networks for Malicious Account Detection') 7 | 8 | Parameters: 9 | nodes: total nodes number 10 | meta: device number 11 | hop: the number of hops a vertex needs to look at, or the number of hidden layers 12 | embedding: node feature dim 13 | encoding: nodes representation dim 14 | ''' 15 | import os 16 | import sys 17 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../..'))) 18 | import tensorflow as tf 19 | from base_models.models import GEMLayer 20 | from algorithms.base_algorithm import Algorithm 21 | from utils import utils 22 | 23 | 24 | class GEM(Algorithm): 25 | 26 | def __init__(self, 27 | session, 28 | nodes, 29 | class_size, 30 | meta, 31 | embedding, 32 | encoding, 33 | hop): 34 | self.nodes = nodes 35 | self.meta = meta 36 | self.class_size = class_size 37 | self.embedding = embedding 38 | self.encoding = encoding 39 | self.hop = hop 40 | 41 | self.placeholders = {'a': tf.placeholder(tf.float32, [self.meta, self.nodes, self.nodes], 'adj'), 42 | 'x': tf.placeholder(tf.float32, [self.nodes, self.embedding], 'nxf'), 43 | 'batch_index': tf.placeholder(tf.int32, [None], 'index'), 44 | 't': tf.placeholder(tf.float32, [None, self.class_size], 'labels'), 45 | 'lr': tf.placeholder(tf.float32, [], 'learning_rate'), 46 | 'mom': tf.placeholder(tf.float32, [], 'momentum'), 47 | 'num_features_nonzero': tf.placeholder(tf.int32)} 48 | 49 | loss, probabilities = self.forward_propagation() 50 | self.loss, self.probabilities = loss, probabilities 51 | self.l2 = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(0.01), 52 | tf.trainable_variables()) 53 | 54 | x = tf.ones_like(self.probabilities) 55 | y = tf.zeros_like(self.probabilities) 56 | self.pred = tf.where(self.probabilities > 0.5, x=x, y=y) 57 | 58 | print(self.pred.shape) 59 | self.correct_prediction = tf.equal(self.pred, self.placeholders['t']) 60 | self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, "float")) 61 | print('Forward propagation finished.') 62 | 63 | self.sess = session 64 | self.optimizer = tf.train.AdamOptimizer(self.placeholders['lr']) 65 | gradients = self.optimizer.compute_gradients(self.loss + self.l2) 66 | capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None] 67 | self.train_op = self.optimizer.apply_gradients(capped_gradients) 68 | self.init = tf.global_variables_initializer() 69 | print('Backward propagation finished.') 70 | 71 | def forward_propagation(self): 72 | with tf.variable_scope('gem_embedding'): 73 | h = tf.get_variable(name='init_embedding', shape=[self.nodes, self.encoding], 74 | initializer=tf.contrib.layers.xavier_initializer()) 75 | for i in range(0, self.hop): 76 | f = GEMLayer(self.placeholders, self.nodes, self.meta, self.embedding, self.encoding) 77 | gem_out = f(inputs=h) 78 | h = tf.reshape(gem_out, [self.nodes, self.encoding]) 79 | print('GEM embedding over!') 80 | 81 | with tf.variable_scope('classification'): 82 | batch_data = tf.matmul(tf.one_hot(self.placeholders['batch_index'], self.nodes), h) 83 | W = tf.get_variable(name='weights', 84 | shape=[self.encoding, self.class_size], 85 | initializer=tf.contrib.layers.xavier_initializer()) 86 | b = tf.get_variable(name='bias', shape=[1, self.class_size], initializer=tf.zeros_initializer()) 87 | tf.transpose(batch_data, perm=[0, 1]) 88 | logits = tf.matmul(batch_data, W) + b 89 | 90 | u = tf.get_variable(name='u', 91 | shape=[1, self.encoding], 92 | initializer=tf.contrib.layers.xavier_initializer()) 93 | 94 | loss = tf.losses.sigmoid_cross_entropy(multi_class_labels=self.placeholders['t'], logits=logits) 95 | 96 | # TODO 97 | # loss = -tf.reduce_sum( 98 | # tf.log_sigmoid(self.placeholders['t'] * tf.matmul(u, tf.transpose(batch_data, perm=[1, 0])))) 99 | 100 | # return loss, logits 101 | return loss, tf.nn.sigmoid(logits) 102 | 103 | def train(self, x, a, t, b, learning_rate=1e-2, momentum=0.9): 104 | feed_dict = utils.construct_feed_dict(x, a, t, b, learning_rate, momentum, self.placeholders) 105 | outs = self.sess.run( 106 | [self.train_op, self.loss, self.accuracy, self.pred, self.probabilities], 107 | feed_dict=feed_dict) 108 | loss = outs[1] 109 | acc = outs[2] 110 | pred = outs[3] 111 | prob = outs[4] 112 | return loss, acc, pred, prob 113 | 114 | def test(self, x, a, t, b, learning_rate=1e-2, momentum=0.9): 115 | feed_dict = utils.construct_feed_dict(x, a, t, b, learning_rate, momentum, self.placeholders) 116 | acc, pred, probabilities, tags = self.sess.run( 117 | [self.accuracy, self.pred, self.probabilities, self.correct_prediction], 118 | feed_dict=feed_dict) 119 | return acc, pred, probabilities, tags 120 | -------------------------------------------------------------------------------- /algorithms/GEM/GEM_main.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Yutong Deng (@yutongD), Yingtong Dou (@YingtongDou) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | import tensorflow as tf 7 | import argparse 8 | import os 9 | import sys 10 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../..'))) 11 | from algorithms.GEM.GEM import GEM 12 | import time 13 | from utils.data_loader import * 14 | from utils.utils import * 15 | 16 | 17 | # os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' 18 | 19 | # init the common args, expect the model specific args 20 | def arg_parser(): 21 | parser = argparse.ArgumentParser() 22 | parser.add_argument('--seed', type=int, default=123, help='Random seed.') 23 | parser.add_argument('--dataset_str', type=str, default='dblp', help="['dblp','example']") 24 | parser.add_argument('--epoch_num', type=int, default=30, help='Number of epochs to train.') 25 | parser.add_argument('--batch_size', type=int, default=1000) 26 | parser.add_argument('--momentum', type=int, default=0.9) 27 | parser.add_argument('--learning_rate', default=0.001, help='the ratio of training set in whole dataset.') 28 | 29 | # GEM 30 | parser.add_argument('--hop', default=1, help='hop number') 31 | parser.add_argument('--k', default=16, help='gem layer unit') 32 | 33 | args = parser.parse_args() 34 | return args 35 | 36 | 37 | def set_env(args): 38 | tf.reset_default_graph() 39 | np.random.seed(args.seed) 40 | tf.set_random_seed(args.seed) 41 | 42 | 43 | # get batch data 44 | def get_data(ix, int_batch, train_size): 45 | if ix + int_batch >= train_size: 46 | ix = train_size - int_batch 47 | end = train_size 48 | else: 49 | end = ix + int_batch 50 | return train_data[ix:end], train_label[ix:end] 51 | 52 | 53 | def load_data(args): 54 | if args.dataset_str == 'dblp': 55 | adj_list, features, train_data, train_label, test_data, test_label = load_data_dblp() 56 | if args.dataset_str == 'example': 57 | adj_list, features, train_data, train_label, test_data, test_label = load_example_gem() 58 | node_size = features.shape[0] 59 | node_embedding = features.shape[1] 60 | class_size = train_label.shape[1] 61 | train_size = len(train_data) 62 | paras = [node_size, node_embedding, class_size, train_size] 63 | 64 | return adj_list, features, train_data, train_label, test_data, test_label, paras 65 | 66 | 67 | def train(args, adj_list, features, train_data, train_label, test_data, test_label, paras): 68 | with tf.Session() as sess: 69 | 70 | adj_data = adj_list 71 | meta_size = len(adj_list) # device num 72 | net = GEM(session=sess, class_size=paras[2], encoding=args.k, 73 | meta=meta_size, nodes=paras[0], embedding=paras[1], hop=args.hop) 74 | 75 | sess.run(tf.global_variables_initializer()) 76 | # net.load(sess) 77 | 78 | t_start = time.clock() 79 | for epoch in range(args.epoch_num): 80 | train_loss = 0 81 | train_acc = 0 82 | count = 0 83 | for index in range(0, paras[3], args.batch_size): 84 | batch_data, batch_label = get_data(index, args.batch_size, paras[3]) 85 | loss, acc, pred, prob = net.train(features, adj_data, batch_label, 86 | batch_data, args.learning_rate, 87 | args.momentum) 88 | 89 | print("batch loss: {:.4f}, batch acc: {:.4f}".format(loss, acc)) 90 | # print(prob, pred) 91 | 92 | train_loss += loss 93 | train_acc += acc 94 | count += 1 95 | train_loss = train_loss / count 96 | train_acc = train_acc / count 97 | print("epoch{:d} : train_loss: {:.4f}, train_acc: {:.4f}".format(epoch, train_loss, train_acc)) 98 | # net.save(sess) 99 | 100 | t_end = time.clock() 101 | print("train time=", "{:.5f}".format(t_end - t_start)) 102 | print("Train end!") 103 | 104 | test_acc, test_pred, test_probabilities, test_tags = net.test(features, adj_data, test_label, 105 | test_data) 106 | 107 | print("test acc:", test_acc) 108 | 109 | 110 | if __name__ == "__main__": 111 | args = arg_parser() 112 | set_env(args) 113 | adj_list, features, train_data, train_label, test_data, test_label, paras = load_data(args) 114 | train(args, adj_list, features, train_data, train_label, test_data, test_label, paras) 115 | -------------------------------------------------------------------------------- /algorithms/GEM/README.md: -------------------------------------------------------------------------------- 1 | 2 | # GEM 3 | 4 | ## Paper 5 | The GEM model is proposed by the [paper](https://arxiv.org/pdf/2002.12307.pdf) below: 6 | ```bibtex 7 | @inproceedings{liu2018heterogeneous, 8 | title={Heterogeneous graph neural networks for malicious account detection}, 9 | author={Liu, Ziqi and Chen, Chaochao and Yang, Xinxing and Zhou, Jun and Li, Xiaolong and Song, Le}, 10 | booktitle={Proceedings of the 27th ACM International Conference on Information and Knowledge Management}, 11 | pages={2077--2085}, 12 | year={2018} 13 | } 14 | ``` 15 | 16 | 17 | ## Brief Introduction 18 | 19 | A heterogeneous graph neural network approach for detecting malicious accounts. 20 | 21 | ## Input Format 22 | 23 | This model uses a device graph as input. We use a small example graph in our toolbox. You can find the example graph structure in **load_example_gem** function in `\utils\dataloader.py`. If you want to use your own graph as the input, just follow the same format like the example graph. 24 | 25 | ## TODO List 26 | 27 | - The log loss fuction (Eq. (7) in the paper) is not implemented. Currently we use cross-entropy loss to replace it. 28 | 29 | -------------------------------------------------------------------------------- /algorithms/GeniePath/GeniePath.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Yutong Deng (@yutongD), Yingtong Dou (@YingtongDou) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | 6 | GeniePath ('GeniePath: Graph Neural Networks with Adaptive Receptive Paths') 7 | 8 | Parameters: 9 | nodes: total nodes number 10 | in_dim: input feature dim 11 | out_dim: output representation dim 12 | dim: breadth forward layer unit 13 | lstm_hidden: depth forward layer unit 14 | layer_num: GeniePath layer num 15 | ''' 16 | import os 17 | import sys 18 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../..'))) 19 | import tensorflow as tf 20 | from base_models.layers import GeniePathLayer 21 | from algorithms.base_algorithm import Algorithm 22 | from utils import utils 23 | 24 | 25 | class GeniePath(Algorithm): 26 | 27 | def __init__(self, 28 | session, 29 | nodes, 30 | in_dim, 31 | out_dim, 32 | dim, 33 | lstm_hidden, 34 | heads, 35 | layer_num, 36 | class_size): 37 | self.nodes = nodes 38 | self.in_dim = in_dim 39 | self.out_dim = out_dim 40 | self.dim = dim 41 | self.lstm_hidden = lstm_hidden 42 | self.heads = heads 43 | self.layer_num = layer_num 44 | self.class_size = class_size 45 | 46 | self.placeholders = {'a': tf.placeholder(tf.float32, [1, self.nodes, self.nodes], 'adj'), 47 | 'x': tf.placeholder(tf.float32, [self.nodes, self.in_dim], 'nxf'), 48 | 'batch_index': tf.placeholder(tf.int32, [None], 'index'), 49 | 't': tf.placeholder(tf.float32, [None, self.out_dim], 'labels'), 50 | 'lr': tf.placeholder(tf.float32, [], 'learning_rate'), 51 | 'mom': tf.placeholder(tf.float32, [], 'momentum'), 52 | 'num_features_nonzero': tf.placeholder(tf.int32)} 53 | 54 | loss, probabilities = self.forward_propagation() 55 | self.loss, self.probabilities = loss, probabilities 56 | self.l2 = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(0.01), 57 | tf.trainable_variables()) 58 | 59 | self.pred = tf.one_hot(tf.argmax(self.probabilities, 1), self.out_dim) 60 | print(self.pred.shape) 61 | self.correct_prediction = tf.equal(tf.argmax(self.probabilities, 1), tf.argmax(self.placeholders['t'], 1)) 62 | self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, "float")) 63 | print('Forward propagation finished.') 64 | 65 | self.sess = session 66 | self.optimizer = tf.train.AdamOptimizer(self.placeholders['lr']) 67 | gradients = self.optimizer.compute_gradients(self.loss + self.l2) 68 | capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None] 69 | self.train_op = self.optimizer.apply_gradients(capped_gradients) 70 | self.init = tf.global_variables_initializer() 71 | print('Backward propagation finished.') 72 | 73 | def forward_propagation(self): 74 | with tf.variable_scope('genie_path_forward'): 75 | x = self.placeholders['x'] 76 | x = x[None, :] 77 | x = tf.contrib.layers.fully_connected(x, self.dim, activation_fn=lambda x: x) 78 | 79 | gplayers = [GeniePathLayer(self.placeholders, self.nodes, self.in_dim, self.dim) 80 | for i in range(self.layer_num)] 81 | for i, l in enumerate(gplayers): 82 | x, (h, c) = gplayers[i].forward(x, self.placeholders['a'], self.lstm_hidden, self.lstm_hidden) 83 | x = x[None, :] 84 | self.check = x 85 | x = tf.contrib.layers.fully_connected(x, self.out_dim, activation_fn=lambda x: x) 86 | x = tf.squeeze(x, 0) 87 | print('geniePath embedding over!') 88 | 89 | with tf.variable_scope('classification'): 90 | batch_data = tf.matmul(tf.one_hot(self.placeholders['batch_index'], self.nodes), x) 91 | # W = tf.get_variable(name='weights', 92 | # shape=[self.out_dim, self.class_size], 93 | # initializer=tf.contrib.layers.xavier_initializer()) 94 | # b = tf.get_variable(name='bias', shape=[1, self.class_size], initializer=tf.zeros_initializer()) 95 | # logits = tf.matmul(batch_data, W) + b 96 | logits = batch_data 97 | loss = tf.losses.sigmoid_cross_entropy(multi_class_labels=self.placeholders['t'], logits=logits) 98 | 99 | return loss, tf.nn.softmax(logits) 100 | 101 | def train(self, x, a, t, b, learning_rate=1e-2, momentum=0.9): 102 | feed_dict = utils.construct_feed_dict(x, a, t, b, learning_rate, momentum, self.placeholders) 103 | outs = self.sess.run( 104 | [self.train_op, self.loss, self.accuracy, self.pred, self.probabilities], 105 | feed_dict=feed_dict) 106 | loss = outs[1] 107 | acc = outs[2] 108 | pred = outs[3] 109 | prob = outs[4] 110 | return loss, acc, pred, prob 111 | 112 | def test(self, x, a, t, b, learning_rate=1e-2, momentum=0.9): 113 | feed_dict = utils.construct_feed_dict(x, a, t, b, learning_rate, momentum, self.placeholders) 114 | acc, pred, probabilities, tags = self.sess.run( 115 | [self.accuracy, self.pred, self.probabilities, self.correct_prediction], 116 | feed_dict=feed_dict) 117 | return acc, pred, probabilities, tags 118 | -------------------------------------------------------------------------------- /algorithms/GeniePath/GeniePath_main.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Yutong Deng (@yutongD), Yingtong Dou (@YingtongDou) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | import tensorflow as tf 7 | import argparse 8 | import os 9 | import sys 10 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../..'))) 11 | from algorithms.GeniePath.GeniePath import GeniePath 12 | import time 13 | from utils.data_loader import * 14 | from utils.utils import * 15 | 16 | 17 | # os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' 18 | 19 | # init the common args, expect the model specific args 20 | def arg_parser(): 21 | parser = argparse.ArgumentParser() 22 | parser.add_argument('--seed', type=int, default=123, help='Random seed.') 23 | parser.add_argument('--dataset_str', type=str, default='dblp', help="['dblp','example']") 24 | parser.add_argument('--epoch_num', type=int, default=30, help='Number of epochs to train.') 25 | parser.add_argument('--batch_size', type=int, default=1000) 26 | parser.add_argument('--momentum', type=int, default=0.9) 27 | parser.add_argument('--learning_rate', default=0.001, help='the ratio of training set in whole dataset.') 28 | 29 | # GeniePath 30 | parser.add_argument('--dim', default=128) 31 | parser.add_argument('--lstm_hidden', default=128, help='lstm_hidden unit') 32 | parser.add_argument('--heads', default=1, help='gat heads') 33 | parser.add_argument('--layer_num', default=4, help='geniePath layer num') 34 | 35 | args = parser.parse_args() 36 | return args 37 | 38 | 39 | def set_env(args): 40 | tf.reset_default_graph() 41 | np.random.seed(args.seed) 42 | tf.set_random_seed(args.seed) 43 | 44 | 45 | # get batch data 46 | def get_data(ix, int_batch, train_size): 47 | if ix + int_batch >= train_size: 48 | ix = train_size - int_batch 49 | end = train_size 50 | else: 51 | end = ix + int_batch 52 | return train_data[ix:end], train_label[ix:end] 53 | 54 | 55 | def load_data(args): 56 | if args.dataset_str == 'dblp': 57 | adj_list, features, train_data, train_label, test_data, test_label = load_data_dblp() 58 | node_size = features.shape[0] 59 | node_embedding = features.shape[1] 60 | class_size = train_label.shape[1] 61 | train_size = len(train_data) 62 | paras = [node_size, node_embedding, class_size, train_size] 63 | 64 | return adj_list, features, train_data, train_label, test_data, test_label, paras 65 | 66 | 67 | def train(args, adj_list, features, train_data, train_label, test_data, test_label, paras): 68 | with tf.Session() as sess: 69 | adj_data = adj_list 70 | net = GeniePath(session=sess, out_dim=paras[2], dim=args.dim, lstm_hidden=args.lstm_hidden, 71 | nodes=paras[0], in_dim=paras[1], heads=args.heads, layer_num=args.layer_num, 72 | class_size=paras[2]) 73 | 74 | sess.run(tf.global_variables_initializer()) 75 | # net.load(sess) 76 | 77 | t_start = time.clock() 78 | for epoch in range(args.epoch_num): 79 | train_loss = 0 80 | train_acc = 0 81 | count = 0 82 | for index in range(0, paras[3], args.batch_size): 83 | batch_data, batch_label = get_data(index, args.batch_size, paras[3]) 84 | loss, acc, pred, prob = net.train(features, adj_data, batch_label, 85 | batch_data, args.learning_rate, 86 | args.momentum) 87 | 88 | print("batch loss: {:.4f}, batch acc: {:.4f}".format(loss, acc)) 89 | # print(prob, pred) 90 | 91 | train_loss += loss 92 | train_acc += acc 93 | count += 1 94 | train_loss = train_loss / count 95 | train_acc = train_acc / count 96 | print("epoch{:d} : train_loss: {:.4f}, train_acc: {:.4f}".format(epoch, train_loss, train_acc)) 97 | # net.save(sess) 98 | 99 | t_end = time.clock() 100 | print("train time=", "{:.5f}".format(t_end - t_start)) 101 | print("Train end!") 102 | 103 | test_acc, test_pred, test_probabilities, test_tags = net.test(features, adj_data, test_label, 104 | test_data) 105 | 106 | print("test acc:", test_acc) 107 | 108 | 109 | if __name__ == "__main__": 110 | args = arg_parser() 111 | set_env(args) 112 | adj_list, features, train_data, train_label, test_data, test_label, paras = load_data(args) 113 | train(args, adj_list, features, train_data, train_label, test_data, test_label, paras) 114 | -------------------------------------------------------------------------------- /algorithms/GeniePath/README.md: -------------------------------------------------------------------------------- 1 | 2 | # GeniePath 3 | 4 | ## Paper 5 | The GeniePath model is proposed by the [paper](https://arxiv.org/abs/1802.00910) below: 6 | ```bibtex 7 | @inproceedings{liu2019geniepath, 8 | title={Geniepath: Graph neural networks with adaptive receptive paths}, 9 | author={Liu, Ziqi and Chen, Chaochao and Li, Longfei and Zhou, Jun and Li, Xiaolong and Song, Le and Qi, Yuan}, 10 | booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, 11 | volume={33}, 12 | pages={4424--4431}, 13 | year={2019} 14 | } 15 | ``` 16 | 17 | 18 | ## Brief Introduction 19 | 20 | GeniePath employs LSTM to learn the layers of GCN and attention mechanism to learn the neighbor weights. 21 | 22 | ## Input Format 23 | 24 | The input graph is homogeneous. In our toolbox, it takes a homo-graph from DBLP dataset as the input. 25 | 26 | ## TODO List 27 | 28 | - The performance of GeniePath on DBLP needs to be tuned. 29 | - The implementation of GeniePath-lazy 30 | -------------------------------------------------------------------------------- /algorithms/GraphConsis/README.md: -------------------------------------------------------------------------------- 1 | 2 | # GraphConsis 3 | 4 | ## Paper 5 | The GraphConsis model is proposed by the [paper](https://arxiv.org/abs/2005.00625) below: 6 | ```bibtex 7 | @inproceedings{liu2020alleviating, 8 | title={Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection}, 9 | author={Liu, Zhiwei and Dou, Yingtong and Yu, Philip S. and Deng, Yutong and Peng, Hao}, 10 | booktitle={Proceedings of the 43nd International ACM SIGIR Conference on Research and Development in Information Retrieval}, 11 | year={2020} 12 | } 13 | ``` 14 | 15 | 16 | ## Brief Introduction 17 | 18 | It is revised based on the [GraphSage](https://github.com/williamleif/GraphSAGE/tree/master/graphsage) model. We support multiple relations and distance sampling as mentioned in [our paper](https://arxiv.org/pdf/2005.00625.pdf). 19 | 20 | 21 | ## Run the code 22 | Go to `algorithms/GraphConsis/`, and run the following command in the terminal: 23 | 24 | `python -m supervised_train --train_prefix ../../dataset/ --file_name YelpChi.mat --model graphsage_mean --sigmoid True --epochs 3 --samples_1 10 -samples_2 5 --context_dim 128 --train_perc 1. --gpu 1` 25 | 26 | or run `supervised_train.py` in your IDEs. 27 | 28 | 29 | ## Meaning of the arguments 30 | ``` 31 | --samples_1 -samples_2: the number of samples using at difference layers 32 | --context_dim: the dimension of context embeddings 33 | --train_perc: the percentage of training data used to train the model, 1. represents using 80% training and 20% testing, 0.5 will use 40% as training and the same 20% as testing 34 | ``` 35 | For more information about the arguments, please refer to `supervised_train.py`. 36 | 37 | ## Note 38 | - the major differences of GraphSage and GraphConsis are in the `neighbor_sampler.py' and 'supervised_models.py'. For neighbor_sampler, we use distance sampler that computing the consistency score and sampling probability. For supervised model, the GraphConsis model considers all the relations and learn relation vectors and attention weights for each sample. 39 | 40 | - Before running the code, please remember unzip the given dataset. 41 | -------------------------------------------------------------------------------- /algorithms/GraphConsis/__init__.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from __future__ import division 3 | -------------------------------------------------------------------------------- /algorithms/GraphConsis/inits.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | 4 | # DISCLAIMER: 5 | # Parts of this code file are derived from 6 | # https://github.com/tkipf/gcn 7 | # which is under an identical MIT license as GraphSAGE 8 | 9 | def uniform(shape, scale=0.05, name=None): 10 | """Uniform init.""" 11 | initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32) 12 | return tf.Variable(initial, name=name) 13 | 14 | 15 | def glorot(shape, name=None): 16 | """Glorot & Bengio (AISTATS 2010) init.""" 17 | init_range = np.sqrt(6.0/(shape[0]+shape[1])) 18 | initial = tf.random_uniform(shape, minval=-init_range, maxval=init_range, dtype=tf.float32) 19 | return tf.Variable(initial, name=name) 20 | 21 | 22 | def zeros(shape, name=None): 23 | """All zeros.""" 24 | initial = tf.zeros(shape, dtype=tf.float32) 25 | return tf.Variable(initial, name=name) 26 | 27 | def ones(shape, name=None): 28 | """All ones.""" 29 | initial = tf.ones(shape, dtype=tf.float32) 30 | return tf.Variable(initial, name=name) 31 | -------------------------------------------------------------------------------- /algorithms/GraphConsis/layers.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from __future__ import print_function 3 | 4 | import tensorflow as tf 5 | 6 | from inits import zeros 7 | 8 | flags = tf.app.flags 9 | FLAGS = flags.FLAGS 10 | 11 | # DISCLAIMER: 12 | # Boilerplate parts of this code file were originally forked from 13 | # https://github.com/tkipf/gcn 14 | # which itself was very inspired by the keras package 15 | 16 | # global unique layer ID dictionary for layer name assignment 17 | _LAYER_UIDS = {} 18 | 19 | def get_layer_uid(layer_name=''): 20 | """Helper function, assigns unique layer IDs.""" 21 | if layer_name not in _LAYER_UIDS: 22 | _LAYER_UIDS[layer_name] = 1 23 | return 1 24 | else: 25 | _LAYER_UIDS[layer_name] += 1 26 | return _LAYER_UIDS[layer_name] 27 | 28 | class Layer(object): 29 | """Base layer class. Defines basic API for all layer objects. 30 | Implementation inspired by keras (http://keras.io). 31 | # Properties 32 | name: String, defines the variable scope of the layer. 33 | logging: Boolean, switches Tensorflow histogram logging on/off 34 | 35 | # Methods 36 | _call(inputs): Defines computation graph of layer 37 | (i.e. takes input, returns output) 38 | __call__(inputs): Wrapper for _call() 39 | _log_vars(): Log all variables 40 | """ 41 | 42 | def __init__(self, **kwargs): 43 | allowed_kwargs = {'name', 'logging', 'model_size'} 44 | for kwarg in kwargs.keys(): 45 | assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg 46 | name = kwargs.get('name') 47 | if not name: 48 | layer = self.__class__.__name__.lower() 49 | name = layer + '_' + str(get_layer_uid(layer)) 50 | self.name = name 51 | self.vars = {} 52 | logging = kwargs.get('logging', False) 53 | self.logging = logging 54 | self.sparse_inputs = False 55 | 56 | def _call(self, inputs): 57 | return inputs 58 | 59 | def __call__(self, inputs): 60 | with tf.name_scope(self.name): 61 | if self.logging and not self.sparse_inputs: 62 | tf.summary.histogram(self.name + '/inputs', inputs) 63 | outputs = self._call(inputs) 64 | if self.logging: 65 | tf.summary.histogram(self.name + '/outputs', outputs) 66 | return outputs 67 | 68 | def _log_vars(self): 69 | for var in self.vars: 70 | tf.summary.histogram(self.name + '/vars/' + var, self.vars[var]) 71 | 72 | 73 | class Dense(Layer): 74 | """Dense layer.""" 75 | def __init__(self, input_dim, output_dim, dropout=0., 76 | act=tf.nn.relu, placeholders=None, bias=True, featureless=False, 77 | sparse_inputs=False, **kwargs): 78 | super(Dense, self).__init__(**kwargs) 79 | 80 | self.dropout = dropout 81 | 82 | self.act = act 83 | self.featureless = featureless 84 | self.bias = bias 85 | self.input_dim = input_dim 86 | self.output_dim = output_dim 87 | 88 | # helper variable for sparse dropout 89 | self.sparse_inputs = sparse_inputs 90 | if sparse_inputs: 91 | self.num_features_nonzero = placeholders['num_features_nonzero'] 92 | 93 | with tf.variable_scope(self.name + '_vars'): 94 | self.vars['weights'] = tf.get_variable('weights', shape=(input_dim, output_dim), 95 | dtype=tf.float32, 96 | initializer=tf.contrib.layers.xavier_initializer(), 97 | regularizer=tf.contrib.layers.l2_regularizer(FLAGS.weight_decay)) 98 | if self.bias: 99 | self.vars['bias'] = zeros([output_dim], name='bias') 100 | 101 | if self.logging: 102 | self._log_vars() 103 | 104 | def _call(self, inputs): 105 | x = inputs 106 | 107 | x = tf.nn.dropout(x, 1-self.dropout) 108 | 109 | # transform 110 | output = tf.matmul(x, self.vars['weights']) 111 | 112 | # bias 113 | if self.bias: 114 | output += self.vars['bias'] 115 | 116 | return self.act(output) 117 | -------------------------------------------------------------------------------- /algorithms/GraphConsis/metrics.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | # DISCLAIMER: 4 | # Parts of this code file were originally forked from 5 | # https://github.com/tkipf/gcn 6 | # which itself was very inspired by the keras package 7 | def masked_logit_cross_entropy(preds, labels, mask): 8 | """Logit cross-entropy loss with masking.""" 9 | loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=preds, labels=labels) 10 | loss = tf.reduce_sum(loss, axis=1) 11 | mask = tf.cast(mask, dtype=tf.float32) 12 | mask /= tf.maximum(tf.reduce_sum(mask), tf.constant([1.])) 13 | loss *= mask 14 | return tf.reduce_mean(loss) 15 | 16 | def masked_softmax_cross_entropy(preds, labels, mask): 17 | """Softmax cross-entropy loss with masking.""" 18 | loss = tf.nn.softmax_cross_entropy_with_logits(logits=preds, labels=labels) 19 | mask = tf.cast(mask, dtype=tf.float32) 20 | mask /= tf.maximum(tf.reduce_sum(mask), tf.constant([1.])) 21 | loss *= mask 22 | return tf.reduce_mean(loss) 23 | 24 | 25 | def masked_l2(preds, actuals, mask): 26 | """L2 loss with masking.""" 27 | loss = tf.nn.l2(preds, actuals) 28 | mask = tf.cast(mask, dtype=tf.float32) 29 | mask /= tf.reduce_mean(mask) 30 | loss *= mask 31 | return tf.reduce_mean(loss) 32 | 33 | def masked_accuracy(preds, labels, mask): 34 | """Accuracy with masking.""" 35 | correct_prediction = tf.equal(tf.argmax(preds, 1), tf.argmax(labels, 1)) 36 | accuracy_all = tf.cast(correct_prediction, tf.float32) 37 | mask = tf.cast(mask, dtype=tf.float32) 38 | mask /= tf.reduce_mean(mask) 39 | accuracy_all *= mask 40 | return tf.reduce_mean(accuracy_all) 41 | -------------------------------------------------------------------------------- /algorithms/GraphConsis/neigh_samplers.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Zhiwei Liu (@JimLiu96) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | from __future__ import division 7 | from __future__ import print_function 8 | 9 | from layers import Layer 10 | 11 | import tensorflow as tf 12 | flags = tf.app.flags 13 | FLAGS = flags.FLAGS 14 | 15 | 16 | """ 17 | Classes that are used to sample node neighborhoods 18 | """ 19 | 20 | class UniformNeighborSampler(Layer): 21 | """ 22 | Uniformly samples neighbors. 23 | Assumes that adj lists are padded with random re-sampling 24 | """ 25 | def __init__(self, adj_info, **kwargs): 26 | super(UniformNeighborSampler, self).__init__(**kwargs) 27 | self.adj_info = adj_info 28 | 29 | def _call(self, inputs): 30 | ids, num_samples = inputs 31 | adj_lists = tf.nn.embedding_lookup(self.adj_info, ids) 32 | adj_lists = tf.transpose(tf.random_shuffle(tf.transpose(adj_lists))) 33 | adj_lists = tf.slice(adj_lists, [0,0], [-1, num_samples]) 34 | return adj_lists 35 | 36 | class DistanceNeighborSampler(Layer): 37 | """ 38 | Sampling neighbors based on the feature consistency. 39 | """ 40 | def __init__(self, adj_info, **kwargs): 41 | super(DistanceNeighborSampler, self).__init__(**kwargs) 42 | self.adj_info = adj_info 43 | self.num_neighs = adj_info.shape[-1] 44 | 45 | def _call(self, inputs): 46 | eps = 0.001 47 | ids, num_samples, features, batch_size = inputs 48 | adj_lists = tf.gather(self.adj_info, ids) 49 | node_features = tf.gather(features, ids) 50 | feature_size = tf.shape(features)[-1] 51 | node_feature_repeat = tf.tile(node_features, [1,self.num_neighs]) 52 | node_feature_repeat = tf.reshape(node_feature_repeat, [batch_size, self.num_neighs, feature_size]) 53 | neighbor_feature = tf.gather(features, adj_lists) 54 | distance = tf.sqrt(tf.reduce_sum(tf.square(node_feature_repeat - neighbor_feature), -1)) 55 | prob = tf.exp(-distance) 56 | prob_sum = tf.reduce_sum(prob, -1, keepdims=True) 57 | prob_sum = tf.tile(prob_sum, [1,self.num_neighs]) 58 | prob = tf.divide(prob, prob_sum) 59 | prob = tf.where(prob>eps, prob, 0*prob) # uncommenting this line to use eps to filter small probabilities 60 | samples_idx = tf.random.categorical(tf.math.log(prob), num_samples) 61 | selected = tf.batch_gather(adj_lists, samples_idx) 62 | return selected 63 | 64 | 65 | -------------------------------------------------------------------------------- /algorithms/GraphConsis/prediction.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from __future__ import print_function 3 | 4 | from inits import zeros 5 | from layers import Layer 6 | import tensorflow as tf 7 | 8 | flags = tf.app.flags 9 | FLAGS = flags.FLAGS 10 | 11 | 12 | class BipartiteEdgePredLayer(Layer): 13 | def __init__(self, input_dim1, input_dim2, placeholders, dropout=False, act=tf.nn.sigmoid, 14 | loss_fn='xent', neg_sample_weights=1.0, 15 | bias=False, bilinear_weights=False, **kwargs): 16 | """ 17 | Basic class that applies skip-gram-like loss 18 | (i.e., dot product of node+target and node and negative samples) 19 | Args: 20 | bilinear_weights: use a bilinear weight for affinity calculation: u^T A v. If set to 21 | false, it is assumed that input dimensions are the same and the affinity will be 22 | based on dot product. 23 | """ 24 | super(BipartiteEdgePredLayer, self).__init__(**kwargs) 25 | self.input_dim1 = input_dim1 26 | self.input_dim2 = input_dim2 27 | self.act = act 28 | self.bias = bias 29 | self.eps = 1e-7 30 | 31 | # Margin for hinge loss 32 | self.margin = 0.1 33 | self.neg_sample_weights = neg_sample_weights 34 | 35 | self.bilinear_weights = bilinear_weights 36 | 37 | if dropout: 38 | self.dropout = placeholders['dropout'] 39 | else: 40 | self.dropout = 0. 41 | 42 | # output a likelihood term 43 | self.output_dim = 1 44 | with tf.variable_scope(self.name + '_vars'): 45 | # bilinear form 46 | if bilinear_weights: 47 | #self.vars['weights'] = glorot([input_dim1, input_dim2], 48 | # name='pred_weights') 49 | self.vars['weights'] = tf.get_variable( 50 | 'pred_weights', 51 | shape=(input_dim1, input_dim2), 52 | dtype=tf.float32, 53 | initializer=tf.contrib.layers.xavier_initializer()) 54 | 55 | if self.bias: 56 | self.vars['bias'] = zeros([self.output_dim], name='bias') 57 | 58 | if loss_fn == 'xent': 59 | self.loss_fn = self._xent_loss 60 | elif loss_fn == 'skipgram': 61 | self.loss_fn = self._skipgram_loss 62 | elif loss_fn == 'hinge': 63 | self.loss_fn = self._hinge_loss 64 | 65 | if self.logging: 66 | self._log_vars() 67 | 68 | def affinity(self, inputs1, inputs2): 69 | """ Affinity score between batch of inputs1 and inputs2. 70 | Args: 71 | inputs1: tensor of shape [batch_size x feature_size]. 72 | """ 73 | # shape: [batch_size, input_dim1] 74 | if self.bilinear_weights: 75 | prod = tf.matmul(inputs2, tf.transpose(self.vars['weights'])) 76 | self.prod = prod 77 | result = tf.reduce_sum(inputs1 * prod, axis=1) 78 | else: 79 | result = tf.reduce_sum(inputs1 * inputs2, axis=1) 80 | return result 81 | 82 | def neg_cost(self, inputs1, neg_samples, hard_neg_samples=None): 83 | """ For each input in batch, compute the sum of its affinity to negative samples. 84 | 85 | Returns: 86 | Tensor of shape [batch_size x num_neg_samples]. For each node, a list of affinities to 87 | negative samples is computed. 88 | """ 89 | if self.bilinear_weights: 90 | inputs1 = tf.matmul(inputs1, self.vars['weights']) 91 | neg_aff = tf.matmul(inputs1, tf.transpose(neg_samples)) 92 | return neg_aff 93 | 94 | def loss(self, inputs1, inputs2, neg_samples): 95 | """ negative sampling loss. 96 | Args: 97 | neg_samples: tensor of shape [num_neg_samples x input_dim2]. Negative samples for all 98 | inputs in batch inputs1. 99 | """ 100 | return self.loss_fn(inputs1, inputs2, neg_samples) 101 | 102 | def _xent_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None): 103 | aff = self.affinity(inputs1, inputs2) 104 | neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples) 105 | true_xent = tf.nn.sigmoid_cross_entropy_with_logits( 106 | labels=tf.ones_like(aff), logits=aff) 107 | negative_xent = tf.nn.sigmoid_cross_entropy_with_logits( 108 | labels=tf.zeros_like(neg_aff), logits=neg_aff) 109 | loss = tf.reduce_sum(true_xent) + self.neg_sample_weights * tf.reduce_sum(negative_xent) 110 | return loss 111 | 112 | def _skipgram_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None): 113 | aff = self.affinity(inputs1, inputs2) 114 | neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples) 115 | neg_cost = tf.log(tf.reduce_sum(tf.exp(neg_aff), axis=1)) 116 | loss = tf.reduce_sum(aff - neg_cost) 117 | return loss 118 | 119 | def _hinge_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None): 120 | aff = self.affinity(inputs1, inputs2) 121 | neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples) 122 | diff = tf.nn.relu(tf.subtract(neg_aff, tf.expand_dims(aff, 1) - self.margin), name='diff') 123 | loss = tf.reduce_sum(diff) 124 | self.neg_shape = tf.shape(neg_aff) 125 | return loss 126 | 127 | def weights_norm(self): 128 | return tf.nn.l2_norm(self.vars['weights']) 129 | -------------------------------------------------------------------------------- /algorithms/GraphConsis/supervised_models.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Zhiwei Liu (@JimLiu96) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | import tensorflow as tf 7 | import models as models 8 | import layers as layers 9 | from aggregators import MeanAggregator, MaxPoolingAggregator, MeanPoolingAggregator, SeqAggregator, GCNAggregator 10 | from inits import glorot, zeros 11 | 12 | flags = tf.app.flags 13 | FLAGS = flags.FLAGS 14 | 15 | 16 | class SupervisedGraphconsis(models.SampleAndAggregate): 17 | """Implementation of supervised GraphConsis.""" 18 | 19 | def __init__(self, num_classes, 20 | placeholders, features, adj, degrees, 21 | layer_infos, concat=True, aggregator_type="mean", 22 | model_size="small", sigmoid_loss=False, identity_dim=0, num_re=3, 23 | **kwargs): 24 | ''' 25 | Args: 26 | - placeholders: Stanford TensorFlow placeholder object. 27 | - features: Numpy array with node features. 28 | - adj: Numpy array with adjacency lists (padded with random re-samples) 29 | - degrees: Numpy array with node degrees. 30 | - layer_infos: List of SAGEInfo namedtuples that describe the parameters of all 31 | the recursive layers. See SAGEInfo definition above. It contains *numer_re* lists of layer_info 32 | - concat: whether to concatenate during recursive iterations 33 | - aggregator_type: how to aggregate neighbor information 34 | - model_size: one of "small" and "big" 35 | - sigmoid_loss: Set to true if nodes can belong to multiple classes 36 | - identity_dim: context embedding 37 | ''' 38 | 39 | models.GeneralizedModel.__init__(self, **kwargs) 40 | 41 | if aggregator_type == "mean": 42 | self.aggregator_cls = MeanAggregator 43 | elif aggregator_type == "seq": 44 | self.aggregator_cls = SeqAggregator 45 | elif aggregator_type == "meanpool": 46 | self.aggregator_cls = MeanPoolingAggregator 47 | elif aggregator_type == "maxpool": 48 | self.aggregator_cls = MaxPoolingAggregator 49 | elif aggregator_type == "gcn": 50 | self.aggregator_cls = GCNAggregator 51 | else: 52 | raise Exception("Unknown aggregator: ", self.aggregator_cls) 53 | 54 | # get info from placeholders... 55 | self.inputs1 = placeholders["batch"] 56 | self.model_size = model_size 57 | self.adj_info = adj 58 | if identity_dim > 0: 59 | self.embeds_context = tf.get_variable("node_embeddings", [features.shape[0], identity_dim]) 60 | else: 61 | self.embeds_context = None 62 | if features is None: 63 | if identity_dim == 0: 64 | raise Exception("Must have a positive value for identity feature dimension if no input features given.") 65 | self.features = self.embeds_context 66 | else: 67 | self.features = tf.Variable(tf.constant(features, dtype=tf.float32), trainable=False) 68 | if not self.embeds_context is None: 69 | self.features = tf.concat([self.embeds_context, self.features], axis=1) 70 | self.degrees = degrees 71 | self.concat = concat 72 | self.num_classes = num_classes 73 | self.sigmoid_loss = sigmoid_loss 74 | self.dims = [(0 if features is None else features.shape[1]) + identity_dim] 75 | self.dims.extend([layer_infos[0][i].output_dim for i in range(len(layer_infos[0]))]) 76 | self.batch_size = placeholders["batch_size"] 77 | self.placeholders = placeholders 78 | self.layer_infos = layer_infos 79 | self.num_relations = num_re 80 | dim_mult = 2 if self.concat else 1 81 | self.relation_vectors = tf.Variable(glorot([num_re, self.dims[-1] * dim_mult]), trainable=True, name='relation_vectors') 82 | self.attention_vec = tf.Variable(glorot([self.dims[-1] * dim_mult * 2, 1])) 83 | 84 | self.optimizer = tf.train.AdamOptimizer(learning_rate=FLAGS.learning_rate) 85 | self.build() 86 | 87 | 88 | def build(self): 89 | samples1_list, support_sizes1_list = [], [] 90 | for r_idx in range(self.num_relations): 91 | samples1, support_sizes1 = self.sample(self.inputs1, self.layer_infos[r_idx]) 92 | samples1_list.append(samples1) 93 | support_sizes1_list.append(support_sizes1) 94 | num_samples = [layer_info.num_samples for layer_info in self.layer_infos[0]] 95 | self.outputs1_list = [] 96 | dim_mult = 2 if self.concat else 1 # multiplication to get the correct output dimension 97 | dim_mult = dim_mult * 2 98 | for r_idx in range(self.num_relations): 99 | outputs1, self.aggregators = self.aggregate(samples1_list[r_idx], [self.features], self.dims, num_samples, 100 | support_sizes1, concat=self.concat, model_size=self.model_size) 101 | self.relation_batch = tf.tile([tf.nn.embedding_lookup(self.relation_vectors, r_idx)], [self.batch_size, 1]) 102 | outputs1 = tf.concat([outputs1, self.relation_batch], 1) 103 | self.attention_weights = tf.matmul(outputs1, self.attention_vec) 104 | self.attention_weights = tf.tile(self.attention_weights, [1, dim_mult*self.dims[-1]]) 105 | outputs1 = tf.multiply(self.attention_weights, outputs1) 106 | self.outputs1_list += [outputs1] 107 | # self.outputs1 = tf.reduce_mean(self.outputs1_list, 0) 108 | self.outputs1 = tf.stack(self.outputs1_list, 1) 109 | self.outputs1 = tf.reduce_sum(self.outputs1, axis=1, keepdims=False) 110 | self.outputs1 = tf.nn.l2_normalize(self.outputs1, 1) 111 | self.node_pred = layers.Dense(dim_mult*self.dims[-1], self.num_classes, 112 | dropout=self.placeholders['dropout'], 113 | act=lambda x : x) 114 | # TF graph management 115 | self.node_preds = self.node_pred(self.outputs1) 116 | 117 | self._loss() 118 | grads_and_vars = self.optimizer.compute_gradients(self.loss) 119 | clipped_grads_and_vars = [(tf.clip_by_value(grad, -5.0, 5.0) if grad is not None else None, var) 120 | for grad, var in grads_and_vars] 121 | self.grad, _ = clipped_grads_and_vars[0] 122 | self.opt_op = self.optimizer.apply_gradients(clipped_grads_and_vars) 123 | self.preds = self.predict() 124 | 125 | def _loss(self): 126 | # Weight decay loss 127 | for aggregator in self.aggregators: 128 | for var in aggregator.vars.values(): 129 | self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var) 130 | for var in self.node_pred.vars.values(): 131 | self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var) 132 | 133 | # classification loss 134 | if self.sigmoid_loss: 135 | self.loss += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits( 136 | logits=self.node_preds, 137 | labels=self.placeholders['labels'])) 138 | else: 139 | self.loss += tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits( 140 | logits=self.node_preds, 141 | labels=self.placeholders['labels'])) 142 | 143 | tf.summary.scalar('loss', self.loss) 144 | 145 | def predict(self): 146 | if self.sigmoid_loss: 147 | return tf.nn.sigmoid(self.node_preds) 148 | else: 149 | return tf.nn.softmax(self.node_preds) 150 | 151 | -------------------------------------------------------------------------------- /algorithms/GraphConsis/utils.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import numpy as np 4 | import random 5 | import json 6 | import sys 7 | import os 8 | import scipy.io as sio 9 | 10 | import networkx as nx 11 | from networkx.readwrite import json_graph 12 | version_info = list(map(int, nx.__version__.split('.'))) 13 | major = version_info[0] 14 | minor = version_info[1] 15 | assert (major <= 1) and (minor <= 11), "networkx major version > 1.11" 16 | 17 | 18 | WALK_LEN=5 19 | N_WALKS=50 20 | 21 | def load_mat_full(prefix='./example_data/', file_name = 'YelpChi.mat', relations=['net_rur'], train_size=0.8): 22 | data = sio.loadmat(prefix + file_name) 23 | truelabels, features = data['label'], data['features'].astype(float) 24 | truelabels = truelabels.tolist()[0] 25 | features = features.todense() 26 | N = features.shape[0] 27 | adj_mat = [data[relation] for relation in relations] 28 | index = range(len(truelabels)) 29 | train_num = int(len(truelabels) * 0.8) 30 | train_idx = set(np.random.choice(index, train_num, replace=False)) 31 | test_idx = set(index).difference(train_idx) 32 | train_num = int(len(truelabels) * train_size) 33 | train_idx = set(list(train_idx)[:train_num]) 34 | return adj_mat, features, truelabels, train_idx, test_idx 35 | 36 | def graph_process(graph, features, truelabels, test_idx): 37 | print('-------processing graph-------------') 38 | for node in graph.nodes(): 39 | graph.node[node]['feature'] = features[node,:].tolist()[0] 40 | graph.node[node]['label'] = [truelabels[node]] 41 | if node in test_idx: 42 | graph.node[node]['test'] = True 43 | graph.node[node]['val'] = True 44 | else: 45 | graph.node[node]['test'] = False 46 | graph.node[node]['val'] = False 47 | broken_count = 0 48 | for edge in graph.edges(): 49 | graph[edge[0]][edge[1]]['train_removed'] = False 50 | return graph 51 | 52 | def load_data(prefix='./example_data/', file_name = 'YelpChi.mat', relations=['net_rur'], normalize=True, load_walks=False, train_size=0.8): 53 | adjs, feats, truelabels, train_idx, test_idx = load_mat_full(prefix, file_name, relations, train_size) 54 | gs = [nx.to_networkx_graph(adj) for adj in adjs] 55 | id_map = {int(i):i for i in range(len(truelabels))} 56 | class_map = {int(i):truelabels[i] for i in range(len(truelabels))} 57 | walks = [] 58 | adj_main = np.sum(adjs) # change the index to specify which adj matrix to use for aggregation 59 | G = nx.to_networkx_graph(adj_main) 60 | gs = [graph_process(g, feats, truelabels, test_idx) for g in gs] 61 | G = graph_process(G, feats, truelabels, test_idx) 62 | if normalize and not feats is None: 63 | from sklearn.preprocessing import StandardScaler 64 | train_ids = np.array([id_map[n] for n in G.nodes()]) 65 | train_feats = feats[train_ids] 66 | scaler = StandardScaler() 67 | scaler.fit(train_feats) 68 | feats = scaler.transform(feats) 69 | if load_walks: 70 | with open(prefix + "-walks.txt") as fp: 71 | for line in fp: 72 | walks.append(map(conversion, line.split())) 73 | return G, feats, id_map, walks, class_map, gs 74 | 75 | def run_random_walks(G, nodes, num_walks=N_WALKS): 76 | pairs = [] 77 | for count, node in enumerate(nodes): 78 | if G.degree(node) == 0: 79 | continue 80 | for i in range(num_walks): 81 | curr_node = node 82 | for j in range(WALK_LEN): 83 | next_node = random.choice(G.neighbors(curr_node)) 84 | # self co-occurrences are useless 85 | if curr_node != node: 86 | pairs.append((node,curr_node)) 87 | curr_node = next_node 88 | if count % 1000 == 0: 89 | print("Done walks for", count, "nodes") 90 | return pairs 91 | 92 | if __name__ == "__main__": 93 | """ Run random walks """ 94 | graph_file = sys.argv[1] 95 | out_file = sys.argv[2] 96 | G_data = json.load(open(graph_file)) 97 | G = json_graph.node_link_graph(G_data) 98 | nodes = [n for n in G.nodes() if not G.node[n]["val"] and not G.node[n]["test"]] 99 | G = G.subgraph(nodes) 100 | pairs = run_random_walks(G, nodes) 101 | with open(out_file, "w") as fp: 102 | fp.write("\n".join([str(p[0]) + "\t" + str(p[1]) for p in pairs])) 103 | -------------------------------------------------------------------------------- /algorithms/GraphSage/README.md: -------------------------------------------------------------------------------- 1 | # GraphSAGE 2 | 3 | ## Paper 4 | The GraphSAGE model is proposed by the [paper](http://papers.nips.cc/paper/6703-inductive-representation-learning-on-large-graphs.pdf) below: 5 | ```bibtex 6 | @inproceedings{hamilton2017inductive, 7 | title={Inductive representation learning on large graphs}, 8 | author={Hamilton, Will and Ying, Zhitao and Leskovec, Jure}, 9 | booktitle={Advances in neural information processing systems}, 10 | pages={1024--1034}, 11 | year={2017} 12 | } 13 | ``` 14 | 15 | # Brief Introduction 16 | We revise the original code of [graphsage](https://github.com/williamleif/GraphSAGE/tree/master/graphsage) so that it can load our data format and train the model. 17 | 18 | # Run the code 19 | `python -m graphsage.supervised_train --train_prefix ../../dataset --model graphsage_mean --sigmoid` 20 | 21 | # Note 22 | - Since graphsage only supports one type of relation, hence we only use the one major relation as the adjacency matrix. In the code, it only reads `rur` relation. You may change it by revise 23 | line 28 in `utils.py` file 24 | ```python 25 | rownetworks = [data['net_rur']] 26 | ``` 27 | - Before running the code, please remember unzip the given YelpChi dataset. 28 | -------------------------------------------------------------------------------- /algorithms/GraphSage/__init__.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from __future__ import division 3 | -------------------------------------------------------------------------------- /algorithms/GraphSage/inits.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | 4 | # DISCLAIMER: 5 | # Parts of this code file are derived from 6 | # https://github.com/tkipf/gcn 7 | # which is under an identical MIT license as GraphSAGE 8 | 9 | def uniform(shape, scale=0.05, name=None): 10 | """Uniform init.""" 11 | initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32) 12 | return tf.Variable(initial, name=name) 13 | 14 | 15 | def glorot(shape, name=None): 16 | """Glorot & Bengio (AISTATS 2010) init.""" 17 | init_range = np.sqrt(6.0/(shape[0]+shape[1])) 18 | initial = tf.random_uniform(shape, minval=-init_range, maxval=init_range, dtype=tf.float32) 19 | return tf.Variable(initial, name=name) 20 | 21 | 22 | def zeros(shape, name=None): 23 | """All zeros.""" 24 | initial = tf.zeros(shape, dtype=tf.float32) 25 | return tf.Variable(initial, name=name) 26 | 27 | def ones(shape, name=None): 28 | """All ones.""" 29 | initial = tf.ones(shape, dtype=tf.float32) 30 | return tf.Variable(initial, name=name) 31 | -------------------------------------------------------------------------------- /algorithms/GraphSage/layers.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from __future__ import print_function 3 | 4 | import tensorflow as tf 5 | 6 | from graphsage.inits import zeros 7 | 8 | flags = tf.app.flags 9 | FLAGS = flags.FLAGS 10 | 11 | # DISCLAIMER: 12 | # Boilerplate parts of this code file were originally forked from 13 | # https://github.com/tkipf/gcn 14 | # which itself was very inspired by the keras package 15 | 16 | # global unique layer ID dictionary for layer name assignment 17 | _LAYER_UIDS = {} 18 | 19 | def get_layer_uid(layer_name=''): 20 | """Helper function, assigns unique layer IDs.""" 21 | if layer_name not in _LAYER_UIDS: 22 | _LAYER_UIDS[layer_name] = 1 23 | return 1 24 | else: 25 | _LAYER_UIDS[layer_name] += 1 26 | return _LAYER_UIDS[layer_name] 27 | 28 | class Layer(object): 29 | """Base layer class. Defines basic API for all layer objects. 30 | Implementation inspired by keras (http://keras.io). 31 | # Properties 32 | name: String, defines the variable scope of the layer. 33 | logging: Boolean, switches Tensorflow histogram logging on/off 34 | 35 | # Methods 36 | _call(inputs): Defines computation graph of layer 37 | (i.e. takes input, returns output) 38 | __call__(inputs): Wrapper for _call() 39 | _log_vars(): Log all variables 40 | """ 41 | 42 | def __init__(self, **kwargs): 43 | allowed_kwargs = {'name', 'logging', 'model_size'} 44 | for kwarg in kwargs.keys(): 45 | assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg 46 | name = kwargs.get('name') 47 | if not name: 48 | layer = self.__class__.__name__.lower() 49 | name = layer + '_' + str(get_layer_uid(layer)) 50 | self.name = name 51 | self.vars = {} 52 | logging = kwargs.get('logging', False) 53 | self.logging = logging 54 | self.sparse_inputs = False 55 | 56 | def _call(self, inputs): 57 | return inputs 58 | 59 | def __call__(self, inputs): 60 | with tf.name_scope(self.name): 61 | if self.logging and not self.sparse_inputs: 62 | tf.summary.histogram(self.name + '/inputs', inputs) 63 | outputs = self._call(inputs) 64 | if self.logging: 65 | tf.summary.histogram(self.name + '/outputs', outputs) 66 | return outputs 67 | 68 | def _log_vars(self): 69 | for var in self.vars: 70 | tf.summary.histogram(self.name + '/vars/' + var, self.vars[var]) 71 | 72 | 73 | class Dense(Layer): 74 | """Dense layer.""" 75 | def __init__(self, input_dim, output_dim, dropout=0., 76 | act=tf.nn.relu, placeholders=None, bias=True, featureless=False, 77 | sparse_inputs=False, **kwargs): 78 | super(Dense, self).__init__(**kwargs) 79 | 80 | self.dropout = dropout 81 | 82 | self.act = act 83 | self.featureless = featureless 84 | self.bias = bias 85 | self.input_dim = input_dim 86 | self.output_dim = output_dim 87 | 88 | # helper variable for sparse dropout 89 | self.sparse_inputs = sparse_inputs 90 | if sparse_inputs: 91 | self.num_features_nonzero = placeholders['num_features_nonzero'] 92 | 93 | with tf.variable_scope(self.name + '_vars'): 94 | self.vars['weights'] = tf.get_variable('weights', shape=(input_dim, output_dim), 95 | dtype=tf.float32, 96 | initializer=tf.contrib.layers.xavier_initializer(), 97 | regularizer=tf.contrib.layers.l2_regularizer(FLAGS.weight_decay)) 98 | if self.bias: 99 | self.vars['bias'] = zeros([output_dim], name='bias') 100 | 101 | if self.logging: 102 | self._log_vars() 103 | 104 | def _call(self, inputs): 105 | x = inputs 106 | 107 | x = tf.nn.dropout(x, 1-self.dropout) 108 | 109 | # transform 110 | output = tf.matmul(x, self.vars['weights']) 111 | 112 | # bias 113 | if self.bias: 114 | output += self.vars['bias'] 115 | 116 | return self.act(output) 117 | -------------------------------------------------------------------------------- /algorithms/GraphSage/metrics.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | # DISCLAIMER: 4 | # Parts of this code file were originally forked from 5 | # https://github.com/tkipf/gcn 6 | # which itself was very inspired by the keras package 7 | def masked_logit_cross_entropy(preds, labels, mask): 8 | """Logit cross-entropy loss with masking.""" 9 | loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=preds, labels=labels) 10 | loss = tf.reduce_sum(loss, axis=1) 11 | mask = tf.cast(mask, dtype=tf.float32) 12 | mask /= tf.maximum(tf.reduce_sum(mask), tf.constant([1.])) 13 | loss *= mask 14 | return tf.reduce_mean(loss) 15 | 16 | def masked_softmax_cross_entropy(preds, labels, mask): 17 | """Softmax cross-entropy loss with masking.""" 18 | loss = tf.nn.softmax_cross_entropy_with_logits(logits=preds, labels=labels) 19 | mask = tf.cast(mask, dtype=tf.float32) 20 | mask /= tf.maximum(tf.reduce_sum(mask), tf.constant([1.])) 21 | loss *= mask 22 | return tf.reduce_mean(loss) 23 | 24 | 25 | def masked_l2(preds, actuals, mask): 26 | """L2 loss with masking.""" 27 | loss = tf.nn.l2(preds, actuals) 28 | mask = tf.cast(mask, dtype=tf.float32) 29 | mask /= tf.reduce_mean(mask) 30 | loss *= mask 31 | return tf.reduce_mean(loss) 32 | 33 | def masked_accuracy(preds, labels, mask): 34 | """Accuracy with masking.""" 35 | correct_prediction = tf.equal(tf.argmax(preds, 1), tf.argmax(labels, 1)) 36 | accuracy_all = tf.cast(correct_prediction, tf.float32) 37 | mask = tf.cast(mask, dtype=tf.float32) 38 | mask /= tf.reduce_mean(mask) 39 | accuracy_all *= mask 40 | return tf.reduce_mean(accuracy_all) 41 | -------------------------------------------------------------------------------- /algorithms/GraphSage/neigh_samplers.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from __future__ import print_function 3 | 4 | from graphsage.layers import Layer 5 | 6 | import tensorflow as tf 7 | flags = tf.app.flags 8 | FLAGS = flags.FLAGS 9 | 10 | 11 | """ 12 | Classes that are used to sample node neighborhoods 13 | """ 14 | 15 | class UniformNeighborSampler(Layer): 16 | """ 17 | Uniformly samples neighbors. 18 | Assumes that adj lists are padded with random re-sampling 19 | """ 20 | def __init__(self, adj_info, **kwargs): 21 | super(UniformNeighborSampler, self).__init__(**kwargs) 22 | self.adj_info = adj_info 23 | 24 | def _call(self, inputs): 25 | ids, num_samples = inputs 26 | adj_lists = tf.nn.embedding_lookup(self.adj_info, ids) 27 | adj_lists = tf.transpose(tf.random_shuffle(tf.transpose(adj_lists))) 28 | adj_lists = tf.slice(adj_lists, [0,0], [-1, num_samples]) 29 | return adj_lists 30 | -------------------------------------------------------------------------------- /algorithms/GraphSage/prediction.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from __future__ import print_function 3 | 4 | from graphsage.inits import zeros 5 | from graphsage.layers import Layer 6 | import tensorflow as tf 7 | 8 | flags = tf.app.flags 9 | FLAGS = flags.FLAGS 10 | 11 | 12 | class BipartiteEdgePredLayer(Layer): 13 | def __init__(self, input_dim1, input_dim2, placeholders, dropout=False, act=tf.nn.sigmoid, 14 | loss_fn='xent', neg_sample_weights=1.0, 15 | bias=False, bilinear_weights=False, **kwargs): 16 | """ 17 | Basic class that applies skip-gram-like loss 18 | (i.e., dot product of node+target and node and negative samples) 19 | Args: 20 | bilinear_weights: use a bilinear weight for affinity calculation: u^T A v. If set to 21 | false, it is assumed that input dimensions are the same and the affinity will be 22 | based on dot product. 23 | """ 24 | super(BipartiteEdgePredLayer, self).__init__(**kwargs) 25 | self.input_dim1 = input_dim1 26 | self.input_dim2 = input_dim2 27 | self.act = act 28 | self.bias = bias 29 | self.eps = 1e-7 30 | 31 | # Margin for hinge loss 32 | self.margin = 0.1 33 | self.neg_sample_weights = neg_sample_weights 34 | 35 | self.bilinear_weights = bilinear_weights 36 | 37 | if dropout: 38 | self.dropout = placeholders['dropout'] 39 | else: 40 | self.dropout = 0. 41 | 42 | # output a likelihood term 43 | self.output_dim = 1 44 | with tf.variable_scope(self.name + '_vars'): 45 | # bilinear form 46 | if bilinear_weights: 47 | #self.vars['weights'] = glorot([input_dim1, input_dim2], 48 | # name='pred_weights') 49 | self.vars['weights'] = tf.get_variable( 50 | 'pred_weights', 51 | shape=(input_dim1, input_dim2), 52 | dtype=tf.float32, 53 | initializer=tf.contrib.layers.xavier_initializer()) 54 | 55 | if self.bias: 56 | self.vars['bias'] = zeros([self.output_dim], name='bias') 57 | 58 | if loss_fn == 'xent': 59 | self.loss_fn = self._xent_loss 60 | elif loss_fn == 'skipgram': 61 | self.loss_fn = self._skipgram_loss 62 | elif loss_fn == 'hinge': 63 | self.loss_fn = self._hinge_loss 64 | 65 | if self.logging: 66 | self._log_vars() 67 | 68 | def affinity(self, inputs1, inputs2): 69 | """ Affinity score between batch of inputs1 and inputs2. 70 | Args: 71 | inputs1: tensor of shape [batch_size x feature_size]. 72 | """ 73 | # shape: [batch_size, input_dim1] 74 | if self.bilinear_weights: 75 | prod = tf.matmul(inputs2, tf.transpose(self.vars['weights'])) 76 | self.prod = prod 77 | result = tf.reduce_sum(inputs1 * prod, axis=1) 78 | else: 79 | result = tf.reduce_sum(inputs1 * inputs2, axis=1) 80 | return result 81 | 82 | def neg_cost(self, inputs1, neg_samples, hard_neg_samples=None): 83 | """ For each input in batch, compute the sum of its affinity to negative samples. 84 | 85 | Returns: 86 | Tensor of shape [batch_size x num_neg_samples]. For each node, a list of affinities to 87 | negative samples is computed. 88 | """ 89 | if self.bilinear_weights: 90 | inputs1 = tf.matmul(inputs1, self.vars['weights']) 91 | neg_aff = tf.matmul(inputs1, tf.transpose(neg_samples)) 92 | return neg_aff 93 | 94 | def loss(self, inputs1, inputs2, neg_samples): 95 | """ negative sampling loss. 96 | Args: 97 | neg_samples: tensor of shape [num_neg_samples x input_dim2]. Negative samples for all 98 | inputs in batch inputs1. 99 | """ 100 | return self.loss_fn(inputs1, inputs2, neg_samples) 101 | 102 | def _xent_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None): 103 | aff = self.affinity(inputs1, inputs2) 104 | neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples) 105 | true_xent = tf.nn.sigmoid_cross_entropy_with_logits( 106 | labels=tf.ones_like(aff), logits=aff) 107 | negative_xent = tf.nn.sigmoid_cross_entropy_with_logits( 108 | labels=tf.zeros_like(neg_aff), logits=neg_aff) 109 | loss = tf.reduce_sum(true_xent) + self.neg_sample_weights * tf.reduce_sum(negative_xent) 110 | return loss 111 | 112 | def _skipgram_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None): 113 | aff = self.affinity(inputs1, inputs2) 114 | neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples) 115 | neg_cost = tf.log(tf.reduce_sum(tf.exp(neg_aff), axis=1)) 116 | loss = tf.reduce_sum(aff - neg_cost) 117 | return loss 118 | 119 | def _hinge_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None): 120 | aff = self.affinity(inputs1, inputs2) 121 | neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples) 122 | diff = tf.nn.relu(tf.subtract(neg_aff, tf.expand_dims(aff, 1) - self.margin), name='diff') 123 | loss = tf.reduce_sum(diff) 124 | self.neg_shape = tf.shape(neg_aff) 125 | return loss 126 | 127 | def weights_norm(self): 128 | return tf.nn.l2_norm(self.vars['weights']) 129 | -------------------------------------------------------------------------------- /algorithms/GraphSage/supervised_models.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | import graphsage.models as models 4 | import graphsage.layers as layers 5 | from graphsage.aggregators import MeanAggregator, MaxPoolingAggregator, MeanPoolingAggregator, SeqAggregator, GCNAggregator 6 | 7 | flags = tf.app.flags 8 | FLAGS = flags.FLAGS 9 | 10 | class SupervisedGraphsage(models.SampleAndAggregate): 11 | """Implementation of supervised GraphSAGE.""" 12 | 13 | def __init__(self, num_classes, 14 | placeholders, features, adj, degrees, 15 | layer_infos, concat=True, aggregator_type="mean", 16 | model_size="small", sigmoid_loss=False, identity_dim=0, 17 | **kwargs): 18 | ''' 19 | Args: 20 | - placeholders: Stanford TensorFlow placeholder object. 21 | - features: Numpy array with node features. 22 | - adj: Numpy array with adjacency lists (padded with random re-samples) 23 | - degrees: Numpy array with node degrees. 24 | - layer_infos: List of SAGEInfo namedtuples that describe the parameters of all 25 | the recursive layers. See SAGEInfo definition above. 26 | - concat: whether to concatenate during recursive iterations 27 | - aggregator_type: how to aggregate neighbor information 28 | - model_size: one of "small" and "big" 29 | - sigmoid_loss: Set to true if nodes can belong to multiple classes 30 | ''' 31 | 32 | models.GeneralizedModel.__init__(self, **kwargs) 33 | 34 | if aggregator_type == "mean": 35 | self.aggregator_cls = MeanAggregator 36 | elif aggregator_type == "seq": 37 | self.aggregator_cls = SeqAggregator 38 | elif aggregator_type == "meanpool": 39 | self.aggregator_cls = MeanPoolingAggregator 40 | elif aggregator_type == "maxpool": 41 | self.aggregator_cls = MaxPoolingAggregator 42 | elif aggregator_type == "gcn": 43 | self.aggregator_cls = GCNAggregator 44 | else: 45 | raise Exception("Unknown aggregator: ", self.aggregator_cls) 46 | 47 | # get info from placeholders... 48 | self.inputs1 = placeholders["batch"] 49 | self.model_size = model_size 50 | self.adj_info = adj 51 | if identity_dim > 0: 52 | self.embeds = tf.get_variable("node_embeddings", [adj.get_shape().as_list()[0], identity_dim]) 53 | else: 54 | self.embeds = None 55 | if features is None: 56 | if identity_dim == 0: 57 | raise Exception("Must have a positive value for identity feature dimension if no input features given.") 58 | self.features = self.embeds 59 | else: 60 | self.features = tf.Variable(tf.constant(features, dtype=tf.float32), trainable=False) 61 | if not self.embeds is None: 62 | self.features = tf.concat([self.embeds, self.features], axis=1) 63 | self.degrees = degrees 64 | self.concat = concat 65 | self.num_classes = num_classes 66 | self.sigmoid_loss = sigmoid_loss 67 | self.dims = [(0 if features is None else features.shape[1]) + identity_dim] 68 | self.dims.extend([layer_infos[i].output_dim for i in range(len(layer_infos))]) 69 | self.batch_size = placeholders["batch_size"] 70 | self.placeholders = placeholders 71 | self.layer_infos = layer_infos 72 | 73 | self.optimizer = tf.train.AdamOptimizer(learning_rate=FLAGS.learning_rate) 74 | 75 | self.build() 76 | 77 | 78 | def build(self): 79 | samples1, support_sizes1 = self.sample(self.inputs1, self.layer_infos) 80 | num_samples = [layer_info.num_samples for layer_info in self.layer_infos] 81 | self.outputs1, self.aggregators = self.aggregate(samples1, [self.features], self.dims, num_samples, 82 | support_sizes1, concat=self.concat, model_size=self.model_size) 83 | dim_mult = 2 if self.concat else 1 84 | 85 | self.outputs1 = tf.nn.l2_normalize(self.outputs1, 1) 86 | 87 | dim_mult = 2 if self.concat else 1 88 | self.node_pred = layers.Dense(dim_mult*self.dims[-1], self.num_classes, 89 | dropout=self.placeholders['dropout'], 90 | act=lambda x : x) 91 | # TF graph management 92 | self.node_preds = self.node_pred(self.outputs1) 93 | 94 | self._loss() 95 | grads_and_vars = self.optimizer.compute_gradients(self.loss) 96 | clipped_grads_and_vars = [(tf.clip_by_value(grad, -5.0, 5.0) if grad is not None else None, var) 97 | for grad, var in grads_and_vars] 98 | self.grad, _ = clipped_grads_and_vars[0] 99 | self.opt_op = self.optimizer.apply_gradients(clipped_grads_and_vars) 100 | self.preds = self.predict() 101 | 102 | def _loss(self): 103 | # Weight decay loss 104 | for aggregator in self.aggregators: 105 | for var in aggregator.vars.values(): 106 | self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var) 107 | for var in self.node_pred.vars.values(): 108 | self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var) 109 | 110 | # classification loss 111 | if self.sigmoid_loss: 112 | self.loss += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits( 113 | logits=self.node_preds, 114 | labels=self.placeholders['labels'])) 115 | else: 116 | self.loss += tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits( 117 | logits=self.node_preds, 118 | labels=self.placeholders['labels'])) 119 | 120 | tf.summary.scalar('loss', self.loss) 121 | 122 | def predict(self): 123 | if self.sigmoid_loss: 124 | return tf.nn.sigmoid(self.node_preds) 125 | else: 126 | return tf.nn.softmax(self.node_preds) 127 | -------------------------------------------------------------------------------- /algorithms/GraphSage/utils.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import numpy as np 4 | import random 5 | import json 6 | import sys 7 | import os 8 | import scipy.io as sio 9 | 10 | import networkx as nx 11 | from networkx.readwrite import json_graph 12 | version_info = list(map(int, nx.__version__.split('.'))) 13 | major = version_info[0] 14 | minor = version_info[1] 15 | assert (major <= 1) and (minor <= 11), "networkx major version > 1.11" 16 | 17 | 18 | WALK_LEN=5 19 | N_WALKS=50 20 | 21 | def load_data_dblp(prefix='./example_data/', file_name = 'YelpChi.mat'): 22 | training_size = 0.8 23 | data = sio.loadmat(prefix + file_name) 24 | truelabels, features = data['label'], data['features'].astype(float) 25 | truelabels = truelabels.tolist()[0] 26 | features = features.todense() 27 | N = features.shape[0] 28 | rownetworks = [data['net_rur']] 29 | # rownetworks = [data['net_APA'] - np.eye(N), data['net_APCPA'] - np.eye(N), data['net_APTPA'] - np.eye(N)] 30 | index = range(len(truelabels)) 31 | train_num = int(len(truelabels) * training_size) 32 | train_idx = set(np.random.choice(index, train_num, replace=False)) 33 | test_idx = set(index).difference(train_idx) 34 | # X_train, X_test, y_train, y_test = train_test_split(index, y, stratify=y, test_size=0.4, random_state=48, 35 | # shuffle=True) 36 | return rownetworks, features, truelabels, train_idx, test_idx 37 | 38 | def load_data(prefix='./example_data/', file_name = 'YelpChi.mat', normalize=True, load_walks=False): 39 | adjs, feats, truelabels, train_idx, test_idx = load_data_dblp(prefix, file_name) 40 | gs = [nx.to_networkx_graph(adj) for adj in adjs] 41 | id_map = {int(i):i for i in range(len(truelabels))} 42 | class_map = {int(i):truelabels[i] for i in range(len(truelabels))} 43 | walks = [] 44 | G = gs[0] # change the index to specify which adj matrix to use for aggregation 45 | for node in G.nodes(): 46 | G.node[node]['feature'] = feats[node,:].tolist()[0] 47 | G.node[node]['label'] = [truelabels[node]] 48 | if node in train_idx: 49 | G.node[node]['test'] = False 50 | G.node[node]['val'] = False 51 | else: 52 | G.node[node]['test'] = True 53 | G.node[node]['val'] = True 54 | broken_count = 0 55 | for node in G.nodes(): 56 | if not 'val' in G.node[node] or not 'test' in G.node[node]: 57 | G.remove_node(node) 58 | broken_count += 1 59 | print("Removed {:d} nodes that lacked proper annotations due to networkx versioning issues".format(broken_count)) 60 | print("Loaded data.. now preprocessing..") 61 | for edge in G.edges(): 62 | G[edge[0]][edge[1]]['train_removed'] = False 63 | 64 | if normalize and not feats is None: 65 | from sklearn.preprocessing import StandardScaler 66 | train_ids = np.array([id_map[n] for n in G.nodes()]) 67 | train_feats = feats[train_ids] 68 | scaler = StandardScaler() 69 | scaler.fit(train_feats) 70 | feats = scaler.transform(feats) 71 | 72 | if load_walks: 73 | with open(prefix + "-walks.txt") as fp: 74 | for line in fp: 75 | walks.append(map(conversion, line.split())) 76 | 77 | return G, feats, id_map, walks, class_map 78 | 79 | 80 | 81 | def load_data_ori(prefix, normalize=True, load_walks=False): 82 | G_data = json.load(open(prefix + "-G.json")) 83 | G = json_graph.node_link_graph(G_data) 84 | if isinstance(G.nodes()[0], int): 85 | conversion = lambda n : int(n) 86 | else: 87 | conversion = lambda n : n 88 | 89 | if os.path.exists(prefix + "-feats.npy"): 90 | feats = np.load(prefix + "-feats.npy") 91 | else: 92 | print("No features present.. Only identity features will be used.") 93 | feats = None 94 | id_map = json.load(open(prefix + "-id_map.json")) 95 | id_map = {conversion(k):int(v) for k,v in id_map.items()} 96 | walks = [] 97 | class_map = json.load(open(prefix + "-class_map.json")) 98 | if isinstance(list(class_map.values())[0], list): 99 | lab_conversion = lambda n : n 100 | else: 101 | lab_conversion = lambda n : int(n) 102 | 103 | class_map = {conversion(k):lab_conversion(v) for k,v in class_map.items()} 104 | 105 | ## Remove all nodes that do not have val/test annotations 106 | ## (necessary because of networkx weirdness with the Reddit data) 107 | broken_count = 0 108 | for node in G.nodes(): 109 | if not 'val' in G.node[node] or not 'test' in G.node[node]: 110 | G.remove_node(node) 111 | broken_count += 1 112 | print("Removed {:d} nodes that lacked proper annotations due to networkx versioning issues".format(broken_count)) 113 | 114 | ## Make sure the graph has edge train_removed annotations 115 | ## (some datasets might already have this..) 116 | print("Loaded data.. now preprocessing..") 117 | for edge in G.edges(): 118 | if (G.node[edge[0]]['val'] or G.node[edge[1]]['val'] or 119 | G.node[edge[0]]['test'] or G.node[edge[1]]['test']): 120 | G[edge[0]][edge[1]]['train_removed'] = True 121 | else: 122 | G[edge[0]][edge[1]]['train_removed'] = False 123 | 124 | if normalize and not feats is None: 125 | from sklearn.preprocessing import StandardScaler 126 | train_ids = np.array([id_map[n] for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']]) 127 | train_feats = feats[train_ids] 128 | scaler = StandardScaler() 129 | scaler.fit(train_feats) 130 | feats = scaler.transform(feats) 131 | 132 | if load_walks: 133 | with open(prefix + "-walks.txt") as fp: 134 | for line in fp: 135 | walks.append(map(conversion, line.split())) 136 | 137 | return G, feats, id_map, walks, class_map 138 | 139 | def run_random_walks(G, nodes, num_walks=N_WALKS): 140 | pairs = [] 141 | for count, node in enumerate(nodes): 142 | if G.degree(node) == 0: 143 | continue 144 | for i in range(num_walks): 145 | curr_node = node 146 | for j in range(WALK_LEN): 147 | next_node = random.choice(G.neighbors(curr_node)) 148 | # self co-occurrences are useless 149 | if curr_node != node: 150 | pairs.append((node,curr_node)) 151 | curr_node = next_node 152 | if count % 1000 == 0: 153 | print("Done walks for", count, "nodes") 154 | return pairs 155 | 156 | if __name__ == "__main__": 157 | """ Run random walks """ 158 | graph_file = sys.argv[1] 159 | out_file = sys.argv[2] 160 | G_data = json.load(open(graph_file)) 161 | G = json_graph.node_link_graph(G_data) 162 | nodes = [n for n in G.nodes() if not G.node[n]["val"] and not G.node[n]["test"]] 163 | G = G.subgraph(nodes) 164 | pairs = run_random_walks(G, nodes) 165 | with open(out_file, "w") as fp: 166 | fp.write("\n".join([str(p[0]) + "\t" + str(p[1]) for p in pairs])) 167 | -------------------------------------------------------------------------------- /algorithms/HACUD/README.md: -------------------------------------------------------------------------------- 1 | # HACUD 2 | 3 | ## Paper 4 | 5 | The HACUD model is proposed by the [paper](https://aaai.org/ojs/index.php/AAAI/article/view/3884) below: 6 | 7 | ```bibtex 8 | @inproceedings{DBLP:conf/aaai/HuZSZLQ19, 9 | author = {Binbin Hu and 10 | Zhiqiang Zhang and 11 | Chuan Shi and 12 | Jun Zhou and 13 | Xiaolong Li and 14 | Yuan Qi}, 15 | title = {Cash-Out User Detection Based on Attributed Heterogeneous Information 16 | Network with a Hierarchical Attention Mechanism}, 17 | booktitle = {The Thirty-Third AAAI Conference on Artificial Intelligence}, 18 | year = {2019} 19 | } 20 | ``` 21 | 22 | ## Run the code 23 | 24 | Go to `algorithms/HACUD/`,and run the following command in the terminal: 25 | 26 | `python main.py --dataset dblp --gpu 0 --epoch 100 --embed_size 16 --batch_size 64 --lr 1e-4 ` 27 | 28 | ## Meaning of the arguments 29 | 30 | ``` 31 | --lr: learning rate 32 | --gpu: gpu id 33 | --epoch: number of training epoches 34 | --embed_size: size of the hidden representations of nodes 35 | --batch_size: training batch size 36 | ``` 37 | 38 | There are also several optional arguments for this model, read parse.py for details. 39 | 40 | ## Note 41 | 42 | - Before running the code, please remember unzip the given dataset -------------------------------------------------------------------------------- /algorithms/HACUD/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | -------------------------------------------------------------------------------- /algorithms/HACUD/data_loader.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Hengrui Zhang (@hengruizhang98) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | import numpy as np 7 | from sklearn.model_selection import train_test_split 8 | import scipy.io as sio 9 | 10 | import zipfile 11 | 12 | 13 | # zip_src = '../dataset/DBLP4057_GAT_with_idx_tra200_val_800.zip' 14 | # dst_dir = '../dataset' 15 | def unzip_file(zip_src, dst_dir): 16 | iz = zipfile.is_zipfile(zip_src) 17 | if iz: 18 | zf = zipfile.ZipFile(zip_src, 'r') 19 | for file in zf.namelist(): 20 | zf.extract(file, dst_dir) 21 | else: 22 | print('Zip Error.') 23 | 24 | 25 | def load_data_dblp(path='../../dataset/DBLP4057_GAT_with_idx_tra200_val_800.mat'): 26 | data = sio.loadmat(path) 27 | truelabels, features = data['label'], data['features'].astype(float) 28 | N = features.shape[0] 29 | rownetworks = [] 30 | 31 | rownetworks.append(data['net_APA'] - np.eye(N)) 32 | rownetworks.append(data['net_APCPA'] - np.eye(N)) 33 | rownetworks.append(data['net_APTPA'] - np.eye(N)) 34 | 35 | # rownetworks = [data['net_APA'] - np.eye(N), data['net_APCPA'] - np.eye(N), data['net_APTPA'] - np.eye(N)] 36 | y = truelabels 37 | index = range(len(y)) 38 | X_train, X_test, y_train, y_test = train_test_split(index, y, stratify=y, test_size=0.4, random_state=48, 39 | shuffle=True) 40 | 41 | return rownetworks, features, X_train, y_train, X_test, y_test 42 | 43 | 44 | def load_example_semi(): 45 | # example data for SemiGNN 46 | features = np.array([[1, 1, 0, 0, 0, 0, 0], 47 | [0, 0, 1, 0, 0, 0, 0], 48 | [0, 0, 0, 1, 0, 0, 0], 49 | [0, 0, 0, 0, 0, 1, 0], 50 | [0, 0, 0, 0, 1, 0, 1], 51 | [1, 0, 1, 1, 0, 0, 0], 52 | [0, 1, 0, 0, 1, 0, 0], 53 | [0, 0, 0, 0, 0, 1, 1] 54 | ]) 55 | N = features.shape[0] 56 | rownetworks = [np.array([[1, 0, 0, 1, 0, 1, 1, 1], 57 | [1, 0, 0, 1, 1, 1, 0, 1], 58 | [1, 0, 0, 0, 0, 0, 0, 1], 59 | [0, 1, 0, 0, 1, 1, 1, 0], 60 | [0, 1, 1, 1, 0, 1, 0, 0], 61 | [1, 0, 0, 1, 1, 1, 0, 1], 62 | [1, 0, 0, 0, 0, 0, 0, 1], 63 | [0, 1, 0, 0, 1, 1, 1, 0]]), 64 | np.array([[1, 0, 0, 0, 0, 1, 1, 1], 65 | [0, 1, 0, 0, 1, 1, 0, 0], 66 | [0, 1, 1, 1, 0, 0, 0, 0], 67 | [0, 0, 1, 1, 1, 0, 0, 1], 68 | [1, 1, 0, 1, 1, 0, 0, 0], 69 | [1, 0, 0, 1, 0, 1, 1, 1], 70 | [1, 0, 0, 1, 1, 1, 0, 1], 71 | [1, 0, 0, 0, 0, 0, 0, 1]])] 72 | y = np.array([[0, 1], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [0, 1]]) 73 | index = range(len(y)) 74 | X_train, X_test, y_train, y_test = train_test_split(index, y, stratify=y, test_size=0.2, random_state=48, 75 | shuffle=True) # test_size=0.25 batch——size=2 76 | 77 | return rownetworks, features, X_train, y_train, X_test, y_test 78 | 79 | 80 | def load_example_gem(): 81 | # example data for GEM 82 | # node=8 p=7 D=2 83 | features = np.array([[5, 3, 0, 1, 0, 0, 0, 1, 0], 84 | [2, 3, 1, 2, 0, 0, 0, 1, 0], 85 | [3, 1, 6, 4, 0, 0, 1, 1, 0], 86 | [0, 0, 2, 4, 4, 1, 0, 1, 1], 87 | [0, 0, 3, 3, 1, 0, 1, 0, 1], 88 | [1, 2, 5, 1, 4, 1, 0, 0, 1], 89 | [0, 1, 3, 5, 1, 0, 0, 0, 1], 90 | [0, 3, 4, 5, 2, 1, 1, 0, 1] 91 | ]) 92 | N = features.shape[0] 93 | rownetworks = [np.array([[1, 1, 1, 1, 0, 0, 0, 0], 94 | [1, 1, 1, 1, 0, 0, 0, 0], 95 | [1, 1, 1, 1, 0, 0, 0, 0], 96 | [1, 1, 1, 1, 0, 0, 0, 0], 97 | [0, 0, 0, 0, 0, 0, 0, 0], 98 | [0, 0, 0, 0, 0, 0, 0, 0], 99 | [0, 0, 0, 0, 0, 0, 0, 0], 100 | [0, 0, 0, 0, 0, 0, 0, 0]]), 101 | np.array([[0, 0, 0, 0, 0, 0, 0, 0], 102 | [0, 0, 0, 0, 0, 0, 0, 0], 103 | [0, 0, 0, 0, 0, 0, 0, 0], 104 | [0, 0, 0, 1, 1, 1, 1, 1], 105 | [0, 0, 0, 1, 1, 1, 1, 1], 106 | [0, 0, 0, 1, 1, 1, 1, 1], 107 | [0, 0, 0, 1, 1, 1, 1, 1], 108 | [0, 0, 0, 1, 1, 1, 1, 1]])] 109 | # y = np.array([-1, -1, -1, -1, 1, 1, 1, 1]) 110 | y = np.array([0, 0, 0, 0, 1, 1, 1, 1]) 111 | y = y[:, np.newaxis] 112 | index = range(len(y)) 113 | X_train, X_test, y_train, y_test = train_test_split(index, y, stratify=y, test_size=0.2, random_state=8, 114 | shuffle=True) 115 | 116 | return rownetworks, features, X_train, y_train, X_test, y_test 117 | 118 | 119 | def load_data_gas(): 120 | # example data for GAS 121 | # construct U-E-I network 122 | user_review_adj = [[0, 1], [2], [3], [5], [4, 6]] 123 | user_review_adj = utils.pad_adjlist(user_review_adj) 124 | user_item_adj = [[0, 1], [0], [0], [2], [1, 2]] 125 | user_item_adj = utils.pad_adjlist(user_item_adj) 126 | item_review_adj = [[0, 2, 3], [1, 4], [5, 6]] 127 | item_review_adj = utils.pad_adjlist(item_review_adj) 128 | item_user_adj = [[0, 1, 2], [0, 4], [3, 4]] 129 | item_user_adj = utils.pad_adjlist(item_user_adj) 130 | review_item_adj = [0, 1, 0, 0, 1, 2, 2] 131 | review_user_adj = [0, 0, 1, 2, 4, 3, 4] 132 | 133 | # initialize review_vecs 134 | review_vecs = np.array([[1, 0, 0, 1, 0], 135 | [1, 0, 0, 1, 1], 136 | [1, 0, 0, 0, 0], 137 | [0, 1, 0, 0, 1], 138 | [0, 1, 1, 1, 0], 139 | [0, 0, 1, 1, 1], 140 | [1, 1, 0, 1, 1]]) 141 | 142 | # initialize user_vecs and item_vecs with user_review_adj and item_review_adj 143 | # for example, u0 has r1 and r0, then we get the first line of user_vecs: [1, 1, 0, 0, 0, 0, 0] 144 | user_vecs = np.array([[1, 1, 0, 0, 0, 0, 0], 145 | [0, 0, 1, 0, 0, 0, 0], 146 | [0, 0, 0, 1, 0, 0, 0], 147 | [0, 0, 0, 0, 0, 1, 0], 148 | [0, 0, 0, 0, 1, 0, 1]]) 149 | item_vecs = np.array([[1, 0, 1, 1, 0, 0, 0], 150 | [0, 1, 0, 0, 1, 0, 0], 151 | [0, 0, 0, 0, 0, 1, 1]]) 152 | features = [review_vecs, user_vecs, item_vecs] 153 | 154 | # initialize the Comment Graph 155 | homo_adj = [[1, 0, 0, 0, 1, 1, 1], 156 | [1, 0, 0, 0, 1, 1, 0], 157 | [0, 0, 0, 1, 1, 1, 0], 158 | [1, 0, 1, 0, 0, 1, 0], 159 | [0, 1, 1, 1, 1, 0, 0], 160 | [0, 1, 1, 0, 1, 0, 0], 161 | [0, 1, 0, 0, 1, 0, 0]] 162 | 163 | adjs = [user_review_adj, user_item_adj, item_review_adj, item_user_adj, review_user_adj, review_item_adj, homo_adj] 164 | 165 | y = np.array([[0, 1], [1, 0], [1, 0], [0, 1], [1, 0], [1, 0], [0, 1], [1, 0]]) 166 | index = range(len(y)) 167 | X_train, X_test, y_train, y_test = train_test_split(index, y, stratify=y, test_size=0.4, random_state=48, 168 | shuffle=True) 169 | 170 | return adjs, features, X_train, y_train, X_test, y_test 171 | -------------------------------------------------------------------------------- /algorithms/HACUD/dblp/s_adj_0_mat.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/safe-graph/DGFraud/22b72d75f81dd057762f0c7225a4558a25095b8f/algorithms/HACUD/dblp/s_adj_0_mat.npz -------------------------------------------------------------------------------- /algorithms/HACUD/dblp/s_adj_1_mat.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/safe-graph/DGFraud/22b72d75f81dd057762f0c7225a4558a25095b8f/algorithms/HACUD/dblp/s_adj_1_mat.npz -------------------------------------------------------------------------------- /algorithms/HACUD/dblp/s_adj_2_mat.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/safe-graph/DGFraud/22b72d75f81dd057762f0c7225a4558a25095b8f/algorithms/HACUD/dblp/s_adj_2_mat.npz -------------------------------------------------------------------------------- /algorithms/HACUD/dblp/s_mean_adj_0_mat.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/safe-graph/DGFraud/22b72d75f81dd057762f0c7225a4558a25095b8f/algorithms/HACUD/dblp/s_mean_adj_0_mat.npz -------------------------------------------------------------------------------- /algorithms/HACUD/dblp/s_mean_adj_1_mat.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/safe-graph/DGFraud/22b72d75f81dd057762f0c7225a4558a25095b8f/algorithms/HACUD/dblp/s_mean_adj_1_mat.npz -------------------------------------------------------------------------------- /algorithms/HACUD/dblp/s_mean_adj_2_mat.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/safe-graph/DGFraud/22b72d75f81dd057762f0c7225a4558a25095b8f/algorithms/HACUD/dblp/s_mean_adj_2_mat.npz -------------------------------------------------------------------------------- /algorithms/HACUD/dblp/s_norm_adj_0_mat.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/safe-graph/DGFraud/22b72d75f81dd057762f0c7225a4558a25095b8f/algorithms/HACUD/dblp/s_norm_adj_0_mat.npz -------------------------------------------------------------------------------- /algorithms/HACUD/dblp/s_norm_adj_1_mat.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/safe-graph/DGFraud/22b72d75f81dd057762f0c7225a4558a25095b8f/algorithms/HACUD/dblp/s_norm_adj_1_mat.npz -------------------------------------------------------------------------------- /algorithms/HACUD/dblp/s_norm_adj_2_mat.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/safe-graph/DGFraud/22b72d75f81dd057762f0c7225a4558a25095b8f/algorithms/HACUD/dblp/s_norm_adj_2_mat.npz -------------------------------------------------------------------------------- /algorithms/HACUD/get_data.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Hengrui Zhang (@hengruizhang98) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | import numpy as np 7 | import random as rd 8 | import scipy.sparse as sp 9 | from time import time 10 | from data_loader import load_data_dblp 11 | import os 12 | 13 | class Data(object): 14 | def __init__(self, path, save_path): 15 | self.path = path 16 | self.save_path = save_path 17 | 18 | self.rownetworks, self.features, self.X_train, self.y_train, self.X_test, self.y_test = load_data_dblp(path) 19 | self.n_nodes = 0 20 | self.n_train, self.n_test = 0, 0 21 | 22 | self.n_nodes = len(self.features) 23 | self.n_train = len(self.X_train) 24 | self.n_test = len(self.X_test) 25 | 26 | self.n_metapath = len(self.rownetworks) 27 | adj = [] 28 | u_index = [] 29 | v_index = [] 30 | self.n_int = [] 31 | 32 | for i in range(self.n_metapath): 33 | z = self.rownetworks[i] 34 | adj.append(z) 35 | u_index.append(np.where(z)[0]) 36 | v_index.append(np.where(z)[1]) 37 | self.n_int.append(len(np.where(z)[0])) 38 | 39 | self.print_statistics() 40 | 41 | self.R = [] 42 | for i in range(self.n_metapath): 43 | R = sp.dok_matrix((self.n_nodes, self.n_nodes), dtype=np.float32) 44 | R[u_index[i], v_index[i]] = 1 45 | self.R.append(R) 46 | 47 | 48 | def get_adj_mat(self): 49 | 50 | try: 51 | t1 = time() 52 | adj_mat = [] 53 | norm_adj_mat = [] 54 | mean_adj_mat = [] 55 | 56 | for i in range(self.n_metapath): 57 | adj = sp.load_npz(self.save_path + '/s_adj_%d_mat.npz' %i) 58 | norm = sp.load_npz(self.save_path + '/s_norm_adj_%d_mat.npz' %i) 59 | mean = sp.load_npz(self.save_path + '/s_mean_adj_%d_mat.npz' %i) 60 | 61 | adj_mat.append(adj) 62 | norm_adj_mat.append(norm) 63 | mean_adj_mat.append(mean) 64 | 65 | print('already load adj matrix', adj_mat[0].shape, time() - t1) 66 | 67 | except Exception: 68 | adj_mat, norm_adj_mat, mean_adj_mat = self.create_adj_mat() 69 | for i in range(self.n_metapath): 70 | sp.save_npz(self.save_path + '/s_adj_%d_mat.npz' %i, adj_mat[i]) 71 | sp.save_npz(self.save_path + '/s_norm_adj_%d_mat.npz' %i, norm_adj_mat[i]) 72 | sp.save_npz(self.save_path + '/s_mean_adj_%d_mat.npz' %i, mean_adj_mat[i]) 73 | 74 | return adj_mat, norm_adj_mat, mean_adj_mat 75 | 76 | def create_adj_mat(self): 77 | 78 | def normalized_adj_single(adj): 79 | rowsum = np.array(adj.sum(1)) 80 | 81 | d_inv = np.power(rowsum, -1).flatten() 82 | d_inv[np.isinf(d_inv)] = 0. 83 | d_mat_inv = sp.diags(d_inv) 84 | 85 | norm_adj = d_mat_inv.dot(adj) 86 | # norm_adj = adj.dot(d_mat_inv) 87 | 88 | return norm_adj.tocoo() 89 | 90 | def check_adj_if_equal(adj): 91 | dense_A = np.array(adj.todense()) 92 | degree = np.sum(dense_A, axis=1, keepdims=False) 93 | 94 | temp = np.dot(np.diag(np.power(degree, -1)), dense_A) 95 | print('check normalized adjacency matrix whether equal to this laplacian matrix.') 96 | return temp 97 | 98 | _adj = [] 99 | norm_adj = [] 100 | mean_adj = [] 101 | for i in range(self.n_metapath): 102 | print('metapath', i) 103 | t1 = time() 104 | adj_mat = sp.dok_matrix((self.n_nodes, self.n_nodes), dtype=np.float32) 105 | adj_mat = adj_mat.tolil() 106 | R = self.R[i].tolil() 107 | 108 | adj_mat[:self.n_nodes,:self.n_nodes] = R 109 | adj_mat = adj_mat.todok() 110 | print('already create adjacency matrix', adj_mat.shape, time() - t1) 111 | 112 | t2 = time() 113 | 114 | norm_adj_mat = normalized_adj_single(adj_mat + sp.eye(adj_mat.shape[0])) 115 | mean_adj_mat = normalized_adj_single(adj_mat) 116 | 117 | print('already normalize adjacency matrix', time() - t2) 118 | 119 | _adj.append(adj_mat.tocsr()) 120 | norm_adj.append(norm_adj_mat.tocsr()) 121 | mean_adj.append(mean_adj_mat.tocsr()) 122 | return _adj, norm_adj, mean_adj 123 | 124 | 125 | 126 | 127 | 128 | def print_statistics(self): 129 | print('n_metapaths=%d' % (self.n_metapath)) 130 | print('n_metapahts=%d' % (self.n_metapath)) 131 | print('n_nodes=%d' % (self.n_nodes)) 132 | print('n_interactions=%s' % (self.n_int)) 133 | print('n_train=%d, n_test=%d, sparsity=%s' % (self.n_train, self.n_test, (np.array(self.n_int)/(self.n_nodes * self.n_nodes)))) 134 | 135 | 136 | def get_sparsity_split(self): 137 | try: 138 | split_uids, split_state = [], [] 139 | lines = open(self.save_path + '/sparsity.split', 'r').readlines() 140 | 141 | for idx, line in enumerate(lines): 142 | if idx % 2 == 0: 143 | split_state.append(line.strip()) 144 | print(line.strip()) 145 | else: 146 | split_uids.append([int(uid) for uid in line.strip().split(' ')]) 147 | print('get sparsity split.') 148 | 149 | except Exception: 150 | split_uids, split_state = self.create_sparsity_split() 151 | f = open(self.save_path + '/sparsity.split', 'w') 152 | for idx in range(len(split_state)): 153 | f.write(split_state[idx] + '\n') 154 | f.write(' '.join([str(uid) for uid in split_uids[idx]]) + '\n') 155 | print('create sparsity split.') 156 | 157 | return split_uids, split_state 158 | 159 | 160 | 161 | def create_sparsity_split(self): 162 | all_users_to_test = list(self.test_set.keys()) 163 | user_n_iid = dict() 164 | 165 | # generate a dictionary to store (key=n_iids, value=a list of uid). 166 | for uid in all_users_to_test: 167 | train_iids = self.train_items[uid] 168 | test_iids = self.test_set[uid] 169 | 170 | n_iids = len(train_iids) + len(test_iids) 171 | 172 | if n_iids not in user_n_iid.keys(): 173 | user_n_iid[n_iids] = [uid] 174 | else: 175 | user_n_iid[n_iids].append(uid) 176 | split_uids = list() 177 | 178 | # split the whole user set into four subset. 179 | temp = [] 180 | count = 1 181 | fold = 4 182 | n_count = (self.n_train + self.n_test) 183 | n_rates = 0 184 | 185 | split_state = [] 186 | for idx, n_iids in enumerate(sorted(user_n_iid)): 187 | temp += user_n_iid[n_iids] 188 | n_rates += n_iids * len(user_n_iid[n_iids]) 189 | n_count -= n_iids * len(user_n_iid[n_iids]) 190 | 191 | if n_rates >= count * 0.25 * (self.n_train + self.n_test): 192 | split_uids.append(temp) 193 | 194 | state = '#inter per user<=[%d], #users=[%d], #all rates=[%d]' %(n_iids, len(temp), n_rates) 195 | split_state.append(state) 196 | print(state) 197 | 198 | temp = [] 199 | n_rates = 0 200 | fold -= 1 201 | 202 | if idx == len(user_n_iid.keys()) - 1 or n_count == 0: 203 | split_uids.append(temp) 204 | 205 | state = '#inter per user<=[%d], #users=[%d], #all rates=[%d]' % (n_iids, len(temp), n_rates) 206 | split_state.append(state) 207 | print(state) 208 | 209 | return split_uids, split_state 210 | 211 | # if __name__ == '__main__': 212 | # path = "../../dataset/DBLP4057_GAT_with_idx_tra200_val_800.mat" 213 | # data_generator = Data(path=path, save_path = save_path) 214 | -------------------------------------------------------------------------------- /algorithms/HACUD/main.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Hengrui Zhang (@hengruizhang98) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | import numpy as np 7 | import pandas as pd 8 | import os 9 | from time import time 10 | import random 11 | import tensorflow as tf 12 | import scipy.sparse as sp 13 | from sklearn import metrics 14 | from parse import parse_args 15 | from get_data import Data 16 | from model import Model 17 | 18 | 19 | def calc_f1(y_true, y_pred): 20 | 21 | y_true = np.argmax(y_true, axis=1) 22 | y_pred = np.argmax(y_pred, axis=1) 23 | 24 | return metrics.f1_score(y_true, y_pred, average="micro"), metrics.f1_score(y_true, y_pred, average="macro") 25 | 26 | def cal_acc(y_true, y_pred): 27 | y_true = np.argmax(y_true, axis=1) 28 | y_pred = np.argmax(y_pred, axis=1) 29 | 30 | return metrics.accuracy_score(y_true, y_pred) 31 | 32 | # a = 0 33 | # b = 0 34 | # for i in range(len(y_true)): 35 | # if y_true[i] == y_pred[i]: 36 | # a+=1 37 | # b+=1 38 | # return a/b 39 | 40 | 41 | # def calc_auc(y_true, y_pred): 42 | # return metrics.roc_auc_score(y_true, y_pred) 43 | 44 | 45 | if __name__ == '__main__': 46 | 47 | args = parse_args() 48 | 49 | if args.dataset == 'dblp': 50 | path = "../../dataset/DBLP4057_GAT_with_idx_tra200_val_800.mat" 51 | save_path = "../HACUD/dblp" 52 | 53 | data_generator = Data(path=path, save_path = save_path) 54 | 55 | X_train = data_generator.X_train 56 | X_test = data_generator.X_test 57 | 58 | y_train = data_generator.y_train 59 | y_test = data_generator.y_test 60 | 61 | os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu) 62 | 63 | config = dict() 64 | config['n_nodes'] = data_generator.n_nodes 65 | config['n_metapath'] = data_generator.n_metapath 66 | config['n_class'] = y_train.shape[1] 67 | 68 | plain_adj, norm_adj, mean_adj = data_generator.get_adj_mat() 69 | 70 | features = data_generator.features 71 | 72 | config['features'] = features 73 | 74 | if args.adj_type == 'plain': 75 | config['norm_adj'] = plain_adj 76 | print('use the plain adjacency matrix') 77 | 78 | elif args.adj_type == 'norm': 79 | config['norm_adj'] = norm_adj 80 | print('use the normalized adjacency matrix') 81 | 82 | else: 83 | config['norm_adj'] = [] 84 | for i in range(args.n_metapath): 85 | config['norm_adj'].append(mean_adj[i] + sp.eye(mean_adj[i].shape[0])) 86 | print('use the mean adjacency matrix') 87 | 88 | t0 = time() 89 | 90 | pretrain_data = None 91 | 92 | model = Model(data_config=config, pretrain_data=pretrain_data, args = args) 93 | 94 | 95 | config = tf.ConfigProto() 96 | config.gpu_options.allow_growth = True 97 | 98 | 99 | with tf.Session(config=config) as sess: 100 | 101 | sess.run(tf.global_variables_initializer()) 102 | cur_best_pre_0 = 0. 103 | print('without pretraining.') 104 | 105 | ''' Train ''' 106 | loss_loger, pre_loger, rec_loger, ndcg_loger, hit_loger, auc_loger = [], [], [], [], [], [] 107 | stopping_step = 0 108 | should_stop = False 109 | 110 | for epoch in range(args.epoch): 111 | t1 = time() 112 | loss, ce_loss = 0., 0. 113 | n_batch = (data_generator.n_train-1) // args.batch_size + 1 114 | 115 | for idx in range(n_batch): 116 | if idx == n_batch - 1 : 117 | nodes = X_train[idx*args.batch_size:] 118 | labels = y_train[idx*args.batch_size:] 119 | else: 120 | nodes = X_train[idx*int(args.batch_size):(idx+1)*int(args.batch_size)] 121 | labels= y_train[idx*int(args.batch_size):(idx+1)*int(args.batch_size)] 122 | 123 | batch_loss, batch_ce_loss, reg_loss = model.train(sess, nodes, labels) 124 | 125 | loss += batch_loss 126 | ce_loss += batch_ce_loss 127 | 128 | test_nodes = X_test 129 | test_label = y_test 130 | 131 | test_loss, test_ce_loss, test_reg_loss, pred_label = model.eval(sess, test_nodes, test_label) 132 | 133 | f1_scores = calc_f1(test_label, pred_label) 134 | acc = cal_acc(test_label, pred_label) 135 | 136 | # auc_score = calc_auc(pred_label, test_label) 137 | 138 | val_f1_mic, val_f1_mac = f1_scores[0], f1_scores[1] 139 | 140 | if np.isnan(loss) == True: 141 | 142 | print('ERROR: loss is nan.') 143 | print('ce_loss =%s' % ce_loss) 144 | sys.exit() 145 | 146 | log1 = 'Epoch {} Train: {:.4f} CE: {:.4f} Reg: {:.4f} Test: {:.4f} F1_mic: {:.4f} F1_mac: {:.4f} Accuracy: {:.4f}'.\ 147 | format(epoch, loss, ce_loss, reg_loss, test_loss, val_f1_mic, val_f1_mac, acc) 148 | 149 | print(log1) 150 | -------------------------------------------------------------------------------- /algorithms/HACUD/model.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Hengrui Zhang (@hengruizhang98) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | import tensorflow as tf 7 | import os 8 | import sys 9 | import numpy as np 10 | os.environ['TF_CPP_MIN_LOG_LEVEL']='2' 11 | 12 | 13 | class Model(object): 14 | def __init__(self, data_config, pretrain_data, args): 15 | self.model_type = 'hacud' 16 | self.adj_type = args.adj_type 17 | self.early_stop = args.early_stop 18 | self.pretrain_data = pretrain_data 19 | os.environ["CUDA_VISIBLE_DEVICES"] = args.gpu 20 | self.n_nodes = data_config['n_nodes'] 21 | self.n_metapath = data_config['n_metapath'] 22 | self.n_class = data_config['n_class'] 23 | 24 | self.n_fold = args.n_fold 25 | self.n_fc = args.n_fc 26 | self.fc = eval(args.fc) 27 | self.reg = args.reg 28 | 29 | 30 | self.norm_adj = data_config['norm_adj'] 31 | 32 | self.features = data_config['features'] 33 | self.f_dim = self.features.shape[1] 34 | self.lr = args.lr 35 | 36 | self.emb_dim = args.embed_size 37 | self.batch_size = args.batch_size 38 | 39 | self.verbose = args.verbose 40 | 41 | ''' 42 | Create Placeholder for Input Data & Dropout. 43 | ''' 44 | # placeholder definition 45 | self.nodes = tf.placeholder(tf.int32, shape=(None,)) 46 | 47 | ''' 48 | Create Model Parameters (i.e., Initialize Weights). 49 | ''' 50 | # initialization of model parameters 51 | self.weights = self._init_weights() 52 | 53 | ''' 54 | Compute Graph-based Representations of all nodes 55 | ''' 56 | self.n_embeddings = self._create_embedding() 57 | 58 | ''' 59 | Establish the representations of nodes in a batch. 60 | ''' 61 | self.batch_embeddings = tf.nn.embedding_lookup(self.n_embeddings, self.nodes) 62 | 63 | self.label = tf.placeholder(tf.float32, shape=(None, self.n_class)) 64 | 65 | self.pred_label = self.pred(self.batch_embeddings) 66 | 67 | self.loss = self.create_loss(self.pred_label, self.label) 68 | 69 | self.opt = tf.train.AdamOptimizer(learning_rate=self.lr).minimize(self.loss) 70 | 71 | def _init_weights(self): 72 | all_weights = dict() 73 | 74 | initializer = tf.contrib.layers.xavier_initializer() 75 | 76 | print('using xavier initialization') 77 | 78 | self.fc = [self.emb_dim] + self.fc 79 | 80 | all_weights['W'] = tf.Variable( 81 | initializer([self.f_dim, self.emb_dim]), name='W') 82 | all_weights['b'] = tf.Variable( 83 | initializer([1, self.emb_dim]), name='b') 84 | tf.add_to_collection(tf.GraphKeys.WEIGHTS, all_weights['W']) 85 | tf.add_to_collection(tf.GraphKeys.WEIGHTS, all_weights['b']) 86 | 87 | 88 | for n in range(self.n_fc): 89 | all_weights['W_%d' % n] = tf.Variable( 90 | initializer([self.fc[n], self.fc[n+1]]), name='W_%d' % n) 91 | all_weights['b_%d' % n] = tf.Variable( 92 | initializer([1, self.fc[n+1]]), name='b_%d' % n) 93 | tf.add_to_collection(tf.GraphKeys.WEIGHTS, all_weights['W_%d' %n]) 94 | tf.add_to_collection(tf.GraphKeys.WEIGHTS, all_weights['b_%d' %n]) 95 | 96 | for n in range(self.n_metapath): 97 | all_weights['W_rho_%d' % n] = tf.Variable( 98 | initializer([self.f_dim, self.emb_dim]), name='W_rho_%d' % n) 99 | all_weights['b_rho_%d' % n] = tf.Variable( 100 | initializer([1, self.emb_dim]), name='b_rho_%d' % n) 101 | all_weights['W_f_%d' % n] = tf.Variable( 102 | initializer([2*self.emb_dim, self.emb_dim]), name='W_f_%d' % n) 103 | all_weights['b_f_%d' % n] = tf.Variable( 104 | initializer([1, self.emb_dim]), name='b_f_%d' % n) 105 | all_weights['z_%d' % n] = tf.Variable( 106 | initializer([1, self.emb_dim*self.n_metapath]), name='z_%d' % n) 107 | 108 | tf.add_to_collection(tf.GraphKeys.WEIGHTS, all_weights['W_rho_%d' %n]) 109 | tf.add_to_collection(tf.GraphKeys.WEIGHTS, all_weights['b_rho_%d' %n]) 110 | tf.add_to_collection(tf.GraphKeys.WEIGHTS, all_weights['W_f_%d' %n]) 111 | tf.add_to_collection(tf.GraphKeys.WEIGHTS, all_weights['b_f_%d' %n]) 112 | tf.add_to_collection(tf.GraphKeys.WEIGHTS, all_weights['z_%d' %n]) 113 | 114 | all_weights['W_f1'] = tf.Variable( 115 | initializer([2*self.emb_dim, self.emb_dim]), name='W_f1') 116 | all_weights['b_f1'] = tf.Variable( 117 | initializer([1, self.emb_dim]), name='b_f1') 118 | tf.add_to_collection(tf.GraphKeys.WEIGHTS, all_weights['W_f1']) 119 | tf.add_to_collection(tf.GraphKeys.WEIGHTS, all_weights['b_f1']) 120 | 121 | all_weights['W_f2'] = tf.Variable( 122 | initializer([self.emb_dim, self.emb_dim]), name='W_f2') 123 | all_weights['b_f2'] = tf.Variable( 124 | initializer([1, self.emb_dim]), name='b_f2') 125 | tf.add_to_collection(tf.GraphKeys.WEIGHTS, all_weights['W_f2']) 126 | tf.add_to_collection(tf.GraphKeys.WEIGHTS, all_weights['b_f2']) 127 | 128 | return all_weights 129 | 130 | def _split_A_hat(self, X): 131 | A_fold_hat = [] 132 | 133 | fold_len = (self.n_nodes) // self.n_fold 134 | for i_fold in range(self.n_fold): 135 | start = i_fold * fold_len 136 | if i_fold == self.n_fold -1: 137 | end = self.n_nodes 138 | else: 139 | end = (i_fold + 1) * fold_len 140 | A_fold_hat.append(self._convert_sp_mat_to_sp_tensor(X[start:end])) 141 | 142 | return A_fold_hat 143 | 144 | def _create_embedding(self): 145 | 146 | A_fold_hat = {} 147 | for n in range(self.n_metapath): 148 | A_fold_hat['%d' %n] = self._split_A_hat(self.norm_adj[n]) 149 | 150 | embeddings = self.features 151 | embeddings = embeddings.astype(np.float32) 152 | 153 | h = tf.matmul(embeddings, self.weights['W']) + self.weights['b'] 154 | 155 | embed_u = {} 156 | h_u = {} 157 | f_u = {} 158 | v_u = {} 159 | alp_u = {} 160 | alp_hat = {} 161 | f_tilde = {} 162 | 163 | for n in range(self.n_metapath): 164 | 165 | ''' Graph Convolution ''' 166 | embed_u['%d' %n] = [] 167 | for f in range(self.n_fold): 168 | embed_u['%d' %n].append(tf.sparse_tensor_dense_matmul(A_fold_hat['%d' %n][f], embeddings)) 169 | 170 | embed_u['%d' %n] = tf.concat(embed_u['%d' %n], 0) 171 | 172 | ''' Feature Fusion ''' 173 | h_u['%d' %n] = tf.matmul(embed_u['%d' %n], self.weights['W_rho_%d' %n]) + self.weights['b_rho_%d' %n] 174 | f_u['%d' %n] = tf.nn.relu(tf.matmul(tf.concat([h,h_u['%d' %n]],1), self.weights['W_f_%d' %n]) 175 | + self.weights['b_f_%d' %n]) 176 | ''' Feature Attention ''' 177 | v_u['%d' %n] = tf.nn.relu(tf.matmul(tf.concat([h,f_u['%d' %n]],1), self.weights['W_f1']) 178 | + self.weights['b_f1']) 179 | alp_u['%d' %n] = tf.nn.relu(tf.matmul(v_u['%d' %n], self.weights['W_f2']) 180 | + self.weights['b_f2']) 181 | 182 | alp_hat['%d' %n] = tf.nn.softmax(alp_u['%d' %n], axis = 1) 183 | 184 | f_tilde['%d' %n] = tf.multiply(alp_hat['%d' %n], f_u['%d' %n]) 185 | 186 | ''' Path Attention ''' 187 | 188 | f_c = [] 189 | for n in range(self.n_metapath): 190 | f_c.append(f_tilde['%d' %n]) 191 | f_c = tf.concat(f_c,1) 192 | 193 | 194 | for n in range(self.n_metapath): 195 | if n == 0: 196 | beta = tf.matmul(f_c, tf.transpose(self.weights['z_%d' % n])) 197 | f = f_tilde['%d' %n] 198 | f = tf.expand_dims(f, -1) 199 | 200 | else: 201 | beta = tf.concat([beta, tf.matmul(f_c, tf.transpose(self.weights['z_%d' % n]))], axis = 1) 202 | f = tf.concat([f,tf.expand_dims(f_tilde['%d' %n],-1)], axis = 2) 203 | 204 | beta_u = tf.nn.softmax(beta, axis = 1) 205 | beta_u = tf.transpose(tf.expand_dims(beta_u, 0),(1,0,2)) 206 | 207 | e_u = tf.multiply(beta_u, f) 208 | e_u = tf.reduce_sum(e_u, axis = 2) 209 | 210 | return e_u 211 | 212 | def pred(self, x): 213 | for n in range(self.n_fc): 214 | if n == self.n_fc - 1: 215 | x = tf.matmul(x, self.weights['W_%d' %n])+ self.weights['b_%d' %n] 216 | else: 217 | x = tf.nn.relu(tf.matmul(x, self.weights['W_%d' %n])+ 218 | self.weights['b_%d' %n]) 219 | return x 220 | 221 | def create_ce_loss(self, x, y): 222 | 223 | ce_loss = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(logits=x, labels=y),0) 224 | 225 | return ce_loss 226 | 227 | def create_reg_loss(self): 228 | 229 | # for key in self.weights.keys(): 230 | # reg_loss += tf.contrib.layers.l2_regularizer(0.5)(self.weights[key]) 231 | # regularizer = tf.contrib.layers.l2_regularizer(0.5) 232 | # reg_loss += tf.contrib.layers.apply_regularization(regularizer) 233 | reg_loss = tf.add_n([tf.nn.l2_loss(tf.cast(v, tf.float32)) for v in tf.trainable_variables()]) 234 | 235 | return reg_loss 236 | 237 | def create_loss(self, x, y): 238 | self.ce_loss = self.create_ce_loss(x,y) 239 | self.reg_loss = self.create_reg_loss() 240 | 241 | loss = self.ce_loss + self.reg * self.reg_loss 242 | 243 | return loss 244 | 245 | def _convert_sp_mat_to_sp_tensor(self, X): 246 | coo = X.tocoo().astype(np.float32) 247 | indices = np.mat([coo.row, coo.col]).transpose() 248 | return tf.SparseTensor(indices, coo.data, coo.shape) 249 | 250 | def train(self, sess, nodes, labels): 251 | _, batch_loss, batch_ce_loss, reg_loss = sess.run([self.opt, self.loss, self.ce_loss, self.reg_loss], 252 | feed_dict={self.nodes: nodes, self.label: labels}) 253 | return batch_loss, batch_ce_loss, reg_loss 254 | 255 | def eval(self, sess, nodes, labels): 256 | loss, ce_loss, reg_loss, pred_label = sess.run([self.loss, self.ce_loss, self.reg_loss, self.pred_label], 257 | feed_dict={self.nodes: nodes, self.label: labels}) 258 | return loss, ce_loss, reg_loss, pred_label 259 | -------------------------------------------------------------------------------- /algorithms/HACUD/parse.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Hengrui Zhang (@hengruizhang98) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | import argparse 7 | 8 | def parse_args(): 9 | parser = argparse.ArgumentParser(description="Run HACUD.") 10 | parser.add_argument('--weights_path', nargs='?', default='', 11 | help='Store model path.') 12 | parser.add_argument('--data_path', nargs='?', default='../Data/', 13 | help='Input data path.') 14 | parser.add_argument('--proj_path', nargs='?', default='', 15 | help='Project path.') 16 | parser.add_argument('--dataset', nargs='?', default='dblp', 17 | help='Choose a dataset from {dblp, yelp}') 18 | parser.add_argument('--pretrain', type=int, default=0, 19 | help='0: No pretrain, -1: Pretrain with the learned embeddings, 1:Pretrain with stored models.') 20 | parser.add_argument('--verbose', type=int, default=1, 21 | help='Interval of evaluation.') 22 | parser.add_argument('--epoch', type=int, default=500, 23 | help='Number of epoch.') 24 | 25 | parser.add_argument('--embed_size', type=int, default=64, 26 | help='Embedding size.') 27 | parser.add_argument('--batch_size', type=int, default=64, 28 | help='Batch size.') 29 | parser.add_argument('--n_fold', type=int, default=20, 30 | help='number of fold.') 31 | parser.add_argument('--n_fc', type=int, default=4, 32 | help='number of fully-connected layers.') 33 | parser.add_argument('--fc', nargs='?', default='[32,16,8,4]', 34 | help='Output sizes of every layer') 35 | 36 | 37 | parser.add_argument('--lr', type=float, default=0.001, 38 | help='Learning rate.') 39 | parser.add_argument('--reg', type=float, default=1e-3, 40 | help='Regularization ratio.') 41 | 42 | parser.add_argument('--model_type', nargs='?', default='ngcf', 43 | help='Specify the name of model (ngcf).') 44 | parser.add_argument('--adj_type', nargs='?', default='norm', 45 | help='Specify the type of the adjacency (laplacian) matrix from {plain, norm, mean}.') 46 | parser.add_argument('--alg_type', nargs='?', default='ngcf', 47 | help='Specify the type of the graph convolutional layer from {ngcf, gcn, gcmc}.') 48 | 49 | parser.add_argument('--save_flag', type=int, default=0, 50 | help='0: Disable model saver, 1: Activate model saver') 51 | 52 | parser.add_argument('--test_flag', nargs='?', default='part', 53 | help='Specify the test type from {part, full}, indicating whether the reference is done in mini-batch') 54 | 55 | parser.add_argument('--gpu', nargs='?', default='0') 56 | 57 | parser.add_argument('--early_stop', type = int, default=10) 58 | 59 | parser.add_argument('--report', type=int, default=0, 60 | help='0: Disable performance report w.r.t. sparsity levels, 1: Show performance report w.r.t. sparsity levels') 61 | return parser.parse_args() 62 | 63 | -------------------------------------------------------------------------------- /algorithms/HACUD/utils.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Hengrui Zhang (@hengruizhang98) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | import random 7 | import scipy.io as sio 8 | import scipy.sparse as sp 9 | import numpy as np 10 | 11 | 12 | # symmetrically normalize adjacency matrix 13 | def normalize_adj(adj): 14 | adj = adj + sp.eye(adj.shape[0]) 15 | adj = sp.coo_matrix(adj) 16 | rowsum = np.array(adj.sum(1)) 17 | d_inv_sqrt = np.power(rowsum, -0.5).flatten() 18 | d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0. 19 | d_mat_inv_sqrt = sp.diags(d_inv_sqrt) 20 | return adj.dot(d_mat_inv_sqrt).transpose().dot(d_mat_inv_sqrt).A 21 | 22 | 23 | # Construct feed dictionary 24 | def construct_feed_dict(x, a, t, b, learning_rate, momentum, placeholders): 25 | feed_dict = dict() 26 | feed_dict.update({placeholders['x']: x}) 27 | feed_dict.update({placeholders['a']: a}) 28 | feed_dict.update({placeholders['t']: t}) 29 | feed_dict.update({placeholders['batch_index']: b}) 30 | feed_dict.update({placeholders['lr']: learning_rate}) 31 | feed_dict.update({placeholders['mom']: momentum}) 32 | feed_dict.update({placeholders['num_features_nonzero']: x[1].shape}) 33 | return feed_dict 34 | 35 | 36 | # Construct feed dictionary for SemiGNN 37 | def construct_feed_dict_semi(a, u_i, u_j, batch_graph_label, batch_data, batch_sup_label, learning_rate, momentum, 38 | placeholders): 39 | feed_dict = dict() 40 | feed_dict.update({placeholders['a']: a}) 41 | feed_dict.update({placeholders['u_i']: u_i}) 42 | feed_dict.update({placeholders['u_j']: u_j}) 43 | feed_dict.update({placeholders['graph_t']: batch_graph_label}) 44 | feed_dict.update({placeholders['batch_index']: batch_data}) 45 | feed_dict.update({placeholders['sup_t']: batch_sup_label}) 46 | feed_dict.update({placeholders['lr']: learning_rate}) 47 | feed_dict.update({placeholders['mom']: momentum}) 48 | return feed_dict 49 | 50 | 51 | # Construct feed dictionary for SemiGNN 52 | def construct_feed_dict_spam(h, adj_info, t, b, learning_rate, momentum, placeholders): 53 | feed_dict = dict() 54 | feed_dict.update({placeholders['user_review_adj']: adj_info[0]}) 55 | feed_dict.update({placeholders['user_item_adj']: adj_info[1]}) 56 | feed_dict.update({placeholders['item_review_adj']: adj_info[2]}) 57 | feed_dict.update({placeholders['item_user_adj']: adj_info[3]}) 58 | feed_dict.update({placeholders['review_user_adj']: adj_info[4]}) 59 | feed_dict.update({placeholders['review_item_adj']: adj_info[5]}) 60 | feed_dict.update({placeholders['homo_adj']: adj_info[6]}) 61 | feed_dict.update({placeholders['review_vecs']: h[0]}) 62 | feed_dict.update({placeholders['user_vecs']: h[1]}) 63 | feed_dict.update({placeholders['item_vecs']: h[2]}) 64 | feed_dict.update({placeholders['t']: t}) 65 | feed_dict.update({placeholders['batch_index']: b}) 66 | feed_dict.update({placeholders['lr']: learning_rate}) 67 | feed_dict.update({placeholders['mom']: momentum}) 68 | feed_dict.update({placeholders['num_features_nonzero']: h[0][1].shape}) 69 | return feed_dict 70 | 71 | 72 | def pad_adjlist(x_data): 73 | # Get lengths of each row of data 74 | lens = np.array([len(x_data[i]) for i in range(len(x_data))]) 75 | 76 | # Mask of valid places in each row 77 | mask = np.arange(lens.max()) < lens[:, None] 78 | 79 | # Setup output array and put elements from data into masked positions 80 | padded = np.zeros(mask.shape) 81 | for i in range(mask.shape[0]): 82 | padded[i] = np.random.choice(x_data[i], mask.shape[1]) 83 | padded[mask] = np.hstack((x_data[:])) 84 | return padded 85 | 86 | 87 | def matrix_to_adjlist(M, pad=True): 88 | adjlist = [] 89 | for i in range(len(M)): 90 | adjline = [i] 91 | for j in range(len(M[i])): 92 | if M[i][j] == 1: 93 | adjline.append(j) 94 | adjlist.append(adjline) 95 | if pad: 96 | adjlist = pad_adjlist(adjlist) 97 | return adjlist 98 | 99 | 100 | def adjlist_to_matrix(adjlist): 101 | nodes = len(adjlist) 102 | M = np.zeros((nodes, nodes)) 103 | for i in range(nodes): 104 | for j in adjlist[i]: 105 | M[i][j] = 1 106 | return M 107 | 108 | 109 | def pairs_to_matrix(pairs, nodes): 110 | M = np.zeros((nodes, nodes)) 111 | for i, j in pairs: 112 | M[i][j] = 1 113 | return M 114 | 115 | 116 | # Random walk on graph 117 | def generate_random_walk(adjlist, start, walklength): 118 | t = 1 119 | walk_path = np.array([start]) 120 | while t <= walklength: 121 | neighbors = adjlist[start] 122 | current = np.random.choice(neighbors) 123 | walk_path = np.append(walk_path, current) 124 | start = current 125 | t += 1 126 | return walk_path 127 | 128 | 129 | # sample multiple times for each node 130 | def random_walks(adjlist, numerate, walklength): 131 | nodes = range(0, len(adjlist)) # node index starts from zero 132 | walks = [] 133 | for n in range(numerate): 134 | for node in nodes: 135 | walks.append(generate_random_walk(adjlist, node, walklength)) 136 | pairs = [] 137 | for i in range(len(walks)): 138 | for j in range(1, len(walks[i])): 139 | pair = [walks[i][0], walks[i][j]] 140 | pairs.append(pair) 141 | return pairs 142 | 143 | 144 | def negative_sampling(adj_nodelist): 145 | degree = [len(neighbors) for neighbors in adj_nodelist] 146 | node_negative_distribution = np.power(np.array(degree, dtype=np.float32), 0.75) 147 | node_negative_distribution /= np.sum(node_negative_distribution) 148 | node_sampling = AliasSampling(prob=node_negative_distribution) 149 | return node_negative_distribution, node_sampling 150 | 151 | 152 | def get_negative_sampling(pairs, adj_nodelist, Q=3, node_sampling='atlas'): 153 | num_of_nodes = len(adj_nodelist) 154 | u_i = [] 155 | u_j = [] 156 | graph_label = [] 157 | node_negative_distribution, nodesampling = negative_sampling(adj_nodelist) 158 | for index in range(0, num_of_nodes): 159 | u_i.append(pairs[index][0]) 160 | u_j.append(pairs[index][1]) 161 | graph_label.append(1) 162 | for i in range(Q): 163 | while True: 164 | if node_sampling == 'numpy': 165 | negative_node = np.random.choice(num_of_nodes, node_negative_distribution) 166 | if negative_node not in adj_nodelist[pairs[index][0]]: 167 | break 168 | elif node_sampling == 'atlas': 169 | negative_node = nodesampling.sampling() 170 | if negative_node not in adj_nodelist[pairs[index][0]]: 171 | break 172 | elif node_sampling == 'uniform': 173 | negative_node = np.random.randint(0, num_of_nodes) 174 | if negative_node not in adj_nodelist[pairs[index][0]]: 175 | break 176 | u_i.append(pairs[index][0]) 177 | u_j.append(negative_node) 178 | graph_label.append(-1) 179 | graph_label = np.array(graph_label) 180 | graph_label = graph_label.reshape(graph_label.shape[0], 1) 181 | return u_i, u_j, graph_label 182 | 183 | 184 | # Reference: https://en.wikipedia.org/wiki/Alias_method 185 | class AliasSampling: 186 | 187 | def __init__(self, prob): 188 | self.n = len(prob) 189 | self.U = np.array(prob) * self.n 190 | self.K = [i for i in range(len(prob))] 191 | overfull, underfull = [], [] 192 | for i, U_i in enumerate(self.U): 193 | if U_i > 1: 194 | overfull.append(i) 195 | elif U_i < 1: 196 | underfull.append(i) 197 | while len(overfull) and len(underfull): 198 | i, j = overfull.pop(), underfull.pop() 199 | self.K[j] = i 200 | self.U[i] = self.U[i] - (1 - self.U[j]) 201 | if self.U[i] > 1: 202 | overfull.append(i) 203 | elif self.U[i] < 1: 204 | underfull.append(i) 205 | 206 | def sampling(self, n=1): 207 | x = np.random.rand(n) 208 | i = np.floor(self.n * x) 209 | y = self.n * x - i 210 | i = i.astype(np.int32) 211 | res = [i[k] if y[k] < self.U[i[k]] else self.K[i[k]] for k in range(n)] 212 | if n == 1: 213 | return res[0] 214 | else: 215 | return res 216 | -------------------------------------------------------------------------------- /algorithms/Player2Vec/Player2Vec.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Yutong Deng (@yutongD), Yingtong Dou (@YingtongDou) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | 6 | Player2Vec ('Key Player Identification in Underground Forums 7 | over Attributed Heterogeneous Information Network Embedding Framework') 8 | 9 | Parameters: 10 | meta: meta-path number 11 | nodes: total nodes number 12 | gcn_output1: the first gcn layer unit number 13 | gcn_output2: the second gcn layer unit number 14 | embedding: node feature dim 15 | encoding: nodes representation dim 16 | ''' 17 | import os 18 | import sys 19 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../..'))) 20 | import tensorflow as tf 21 | from base_models.models import GCN 22 | from base_models.layers import AttentionLayer 23 | from algorithms.base_algorithm import Algorithm 24 | from utils import utils 25 | 26 | 27 | class Player2Vec(Algorithm): 28 | 29 | def __init__(self, 30 | session, 31 | meta, 32 | nodes, 33 | class_size, 34 | gcn_output1, 35 | embedding, 36 | encoding): 37 | self.meta = meta 38 | self.nodes = nodes 39 | self.class_size = class_size 40 | self.gcn_output1 = gcn_output1 41 | self.embedding = embedding 42 | self.encoding = encoding 43 | self.placeholders = {'a': tf.placeholder(tf.float32, [self.meta, self.nodes, self.nodes], 'adj'), 44 | 'x': tf.placeholder(tf.float32, [self.nodes, self.embedding], 'nxf'), 45 | 'batch_index': tf.placeholder(tf.int32, [None], 'index'), 46 | 't': tf.placeholder(tf.float32, [None, self.class_size], 'labels'), 47 | 'lr': tf.placeholder(tf.float32, [], 'learning_rate'), 48 | 'mom': tf.placeholder(tf.float32, [], 'momentum'), 49 | 'num_features_nonzero': tf.placeholder(tf.int32)} 50 | 51 | loss, probabilities = self.forward_propagation() 52 | self.loss, self.probabilities = loss, probabilities 53 | self.l2 = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(0.01), 54 | tf.trainable_variables()) 55 | 56 | self.pred = tf.one_hot(tf.argmax(self.probabilities, 1), class_size) 57 | print(self.pred.shape) 58 | self.correct_prediction = tf.equal(tf.argmax(self.probabilities, 1), tf.argmax(self.placeholders['t'], 1)) 59 | self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, "float")) 60 | print('Forward propagation finished.') 61 | 62 | self.sess = session 63 | self.optimizer = tf.train.AdamOptimizer(self.placeholders['lr']) 64 | gradients = self.optimizer.compute_gradients(self.loss + self.l2) 65 | capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None] 66 | self.train_op = self.optimizer.apply_gradients(capped_gradients) 67 | self.init = tf.global_variables_initializer() 68 | print('Backward propagation finished.') 69 | 70 | def forward_propagation(self): 71 | with tf.variable_scope('gcn'): 72 | # x = self.x 73 | # A = tf.reshape(self.a, [self.meta, self.nodes, self.nodes]) 74 | gcn_emb = [] 75 | for i in range(self.meta): 76 | gcn_out = tf.reshape(GCN(self.placeholders, self.gcn_output1, self.embedding, 77 | self.encoding, index=i).embedding(), [1, self.nodes * self.encoding]) 78 | gcn_emb.append(gcn_out) 79 | gcn_emb = tf.concat(gcn_emb, 0) 80 | assert gcn_emb.shape == [self.meta, self.nodes * self.encoding] 81 | print('GCN embedding over!') 82 | 83 | with tf.variable_scope('attention'): 84 | gat_out = AttentionLayer.attention(inputs=gcn_emb, attention_size=1, v_type='tanh') 85 | gat_out = tf.reshape(gat_out, [self.nodes, self.encoding]) 86 | print('Embedding with attention over!') 87 | 88 | with tf.variable_scope('classification'): 89 | batch_data = tf.matmul(tf.one_hot(self.placeholders['batch_index'], self.nodes), gat_out) 90 | W = tf.get_variable(name='weights', shape=[self.encoding, self.class_size], 91 | initializer=tf.contrib.layers.xavier_initializer()) 92 | b = tf.get_variable(name='bias', shape=[1, self.class_size], initializer=tf.zeros_initializer()) 93 | tf.transpose(batch_data, perm=[0, 1]) 94 | logits = tf.matmul(batch_data, W) + b 95 | loss = tf.losses.sigmoid_cross_entropy(multi_class_labels=self.placeholders['t'], logits=logits) 96 | 97 | return loss, tf.nn.sigmoid(logits) 98 | 99 | def train(self, x, a, t, b, learning_rate=1e-2, momentum=0.9): 100 | feed_dict = utils.construct_feed_dict(x, a, t, b, learning_rate, momentum, self.placeholders) 101 | 102 | outs = self.sess.run( 103 | [self.train_op, self.loss, self.accuracy, self.pred, self.probabilities], 104 | feed_dict=feed_dict) 105 | loss = outs[1] 106 | acc = outs[2] 107 | pred = outs[3] 108 | prob = outs[4] 109 | return loss, acc, pred, prob 110 | 111 | def test(self, x, a, t, b, learning_rate=1e-2, momentum=0.9): 112 | feed_dict = utils.construct_feed_dict(x, a, t, b, learning_rate, momentum, self.placeholders) 113 | acc, pred, probabilities, tags = self.sess.run( 114 | [self.accuracy, self.pred, self.probabilities, self.correct_prediction], 115 | feed_dict=feed_dict) 116 | return acc, pred, probabilities, tags 117 | -------------------------------------------------------------------------------- /algorithms/Player2Vec/Player2Vec_main.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Yutong Deng (@yutongD), Yingtong Dou (@YingtongDou) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | import tensorflow as tf 7 | import argparse 8 | import os 9 | import sys 10 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../..'))) 11 | from algorithms.Player2Vec.Player2Vec import Player2Vec 12 | import time 13 | from utils.data_loader import * 14 | from utils.utils import * 15 | 16 | 17 | # os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' 18 | 19 | # init the common args, expect the model specific args 20 | def arg_parser(): 21 | parser = argparse.ArgumentParser() 22 | parser.add_argument('--seed', type=int, default=123, help='Random seed.') 23 | parser.add_argument('--dataset_str', type=str, default='dblp', help="['dblp','example']") 24 | parser.add_argument('--epoch_num', type=int, default=30, help='Number of epochs to train.') 25 | parser.add_argument('--batch_size', type=int, default=1000) 26 | parser.add_argument('--momentum', type=int, default=0.9) 27 | parser.add_argument('--learning_rate', default=0.001, help='the ratio of training set in whole dataset.') 28 | 29 | # GCN args 30 | parser.add_argument('--hidden1', default=16, help='Number of units in GCN hidden layer 1.') 31 | parser.add_argument('--hidden2', default=16, help='Number of units in GCN hidden layer 2.') 32 | parser.add_argument('--gcn_output', default=4, help='gcn output size.') 33 | 34 | args = parser.parse_args() 35 | return args 36 | 37 | 38 | def set_env(args): 39 | tf.reset_default_graph() 40 | np.random.seed(args.seed) 41 | tf.set_random_seed(args.seed) 42 | 43 | 44 | # get batch data 45 | def get_data(ix, int_batch, train_size): 46 | if ix + int_batch >= train_size: 47 | ix = train_size - int_batch 48 | end = train_size 49 | else: 50 | end = ix + int_batch 51 | return train_data[ix:end], train_label[ix:end] 52 | 53 | 54 | def load_data(args): 55 | if args.dataset_str == 'dblp': 56 | adj_list, features, train_data, train_label, test_data, test_label = load_data_dblp() 57 | node_size = features.shape[0] 58 | node_embedding = features.shape[1] 59 | class_size = train_label.shape[1] 60 | train_size = len(train_data) 61 | paras = [node_size, node_embedding, class_size, train_size] 62 | 63 | return adj_list, features, train_data, train_label, test_data, test_label, paras 64 | 65 | 66 | def train(args, adj_list, features, train_data, train_label, test_data, test_label, paras): 67 | with tf.Session() as sess: 68 | adj_data = [normalize_adj(adj) for adj in adj_list] 69 | meta_size = len(adj_list) 70 | net = Player2Vec(session=sess, class_size=paras[2], gcn_output1=args.hidden1, 71 | meta=meta_size, nodes=paras[0], embedding=paras[1], encoding=args.gcn_output) 72 | 73 | sess.run(tf.global_variables_initializer()) 74 | # net.load(sess) 75 | 76 | t_start = time.clock() 77 | for epoch in range(args.epoch_num): 78 | train_loss = 0 79 | train_acc = 0 80 | count = 0 81 | for index in range(0, paras[3], args.batch_size): 82 | batch_data, batch_label = get_data(index, args.batch_size, paras[3]) 83 | loss, acc, pred, prob = net.train(features, adj_data, batch_label, 84 | batch_data, args.learning_rate, 85 | args.momentum) 86 | 87 | print("batch loss: {:.4f}, batch acc: {:.4f}".format(loss, acc)) 88 | # print(prob, pred) 89 | 90 | train_loss += loss 91 | train_acc += acc 92 | count += 1 93 | train_loss = train_loss / count 94 | train_acc = train_acc / count 95 | print("epoch{:d} : train_loss: {:.4f}, train_acc: {:.4f}".format(epoch, train_loss, train_acc)) 96 | # net.save(sess) 97 | 98 | t_end = time.clock() 99 | print("train time=", "{:.5f}".format(t_end - t_start)) 100 | print("Train end!") 101 | 102 | test_acc, test_pred, test_probabilities, test_tags = net.test(features, adj_data, test_label, 103 | test_data) 104 | 105 | print("test acc:", test_acc) 106 | 107 | 108 | if __name__ == "__main__": 109 | args = arg_parser() 110 | set_env(args) 111 | adj_list, features, train_data, train_label, test_data, test_label, paras = load_data(args) 112 | train(args, adj_list, features, train_data, train_label, test_data, test_label, paras) 113 | -------------------------------------------------------------------------------- /algorithms/Player2Vec/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Player2Vec 3 | 4 | ## Paper 5 | The Player2Vec model is proposed by the [paper](http://mason.gmu.edu/~lzhao9/materials/papers/lp0110-zhangA.pdf) below: 6 | ```bibtex 7 | @inproceedings{zhang2019key, 8 | title={Key Player Identification in Underground Forums over Attributed Heterogeneous Information Network Embedding Framework}, 9 | author={Zhang, Yiming and Fan, Yujie and Ye, Yanfang and Zhao, Liang and Shi, Chuan}, 10 | booktitle={Proceedings of the 28th ACM International Conference on Information and Knowledge Management}, 11 | pages={549--558}, 12 | year={2019} 13 | } 14 | ``` 15 | 16 | 17 | ## Brief Introduction 18 | 19 | A fraud detection model which uses GCN to encode information in each relation, and uses attention mechanism to aggregate neighbors from different relations. 20 | 21 | ## Input Format 22 | 23 | In our toolbox, we take homo-graphs of DBLP dataset as its multi-view input. 24 | 25 | ## TODO List 26 | 27 | -------------------------------------------------------------------------------- /algorithms/SemiGNN/README.md: -------------------------------------------------------------------------------- 1 | 2 | # SemiGNN 3 | 4 | ## Paper 5 | The SemiGNN model is proposed by the [paper](https://arxiv.org/pdf/2003.01171) below: 6 | ```bibtex 7 | @inproceedings{wang2019semi, 8 | title={A Semi-supervised Graph Attentive Network for Financial Fraud Detection}, 9 | author={Wang, Daixin and Lin, Jianbin and Cui, Peng and Jia, Quanhui and Wang, Zhen and Fang, Yanming and Yu, Quan and Zhou, Jun and Yang, Shuang and Qi, Yuan}, 10 | booktitle={2019 IEEE International Conference on Data Mining (ICDM)}, 11 | pages={598--607}, 12 | year={2019}, 13 | organization={IEEE} 14 | } 15 | ``` 16 | 17 | 18 | ## Brief Introduction 19 | 20 | SemiGNN takes a multi-view heterogeneous graph as input. It employs the attention mechanism to aggregate the embeddings from each view. It adds a structure-based loss with negative sampling. 21 | 22 | ## Input Format 23 | 24 | The input is a heterogeneous graph. We use a small example graph in our toolbox. You can find the example graph structure in **load_example_semi** function in `\utils\dataloader.py`. If you want to use your own graph as the input, just follow the same format like the example graph. 25 | 26 | ## TODO List 27 | 28 | - The memory-efficient implementation. 29 | - Testing large-scale graphs. 30 | 31 | -------------------------------------------------------------------------------- /algorithms/SemiGNN/SemiGNN.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Yutong Deng (@yutongD), Yingtong Dou (@YingtongDou) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | 6 | SemiGNN ('A Semi-supervised Graph Attentive Network for 7 | Financial Fraud Detection') 8 | 9 | Parameters: 10 | nodes: total nodes number 11 | semi_encoding1: the first view attention layer unit number 12 | semi_encoding2: the second view attention layer unit number 13 | semi_encoding3: MLP layer unit number 14 | init_emb_size: the initial node embedding 15 | meta: view number 16 | ul: labeled users number 17 | ''' 18 | import os 19 | import sys 20 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../..'))) 21 | import tensorflow as tf 22 | from algorithms.base_algorithm import Algorithm 23 | from base_models.layers import AttentionLayer 24 | from utils import utils 25 | 26 | 27 | class SemiGNN(Algorithm): 28 | 29 | def __init__(self, 30 | session, 31 | nodes, 32 | class_size, 33 | semi_encoding1, 34 | semi_encoding2, 35 | semi_encoding3, 36 | init_emb_size, 37 | meta, 38 | ul, 39 | alpha, 40 | lamtha): 41 | self.nodes = nodes 42 | self.meta = meta 43 | self.class_size = class_size 44 | self.semi_encoding1 = semi_encoding1 45 | self.semi_encoding2 = semi_encoding2 46 | self.semi_encoding3 = semi_encoding3 47 | self.init_emb_size = init_emb_size 48 | self.ul = ul 49 | self.alpha = alpha 50 | self.lamtha = lamtha 51 | self.placeholders = {'a': tf.placeholder(tf.float32, [self.meta, self.nodes, self.nodes], 'adj'), 52 | 'u_i': tf.placeholder(tf.float32, [None, ], 'u_i'), 53 | 'u_j': tf.placeholder(tf.float32, [None, ], 'u_j'), 54 | 'batch_index': tf.placeholder(tf.int32, [None], 'index'), 55 | 'sup_label': tf.placeholder(tf.float32, [None, self.class_size], 'sup_label'), 56 | 'graph_label': tf.placeholder(tf.float32, [None, 1], 'graph_label'), 57 | 'lr': tf.placeholder(tf.float32, [], 'learning_rate'), 58 | 'mom': tf.placeholder(tf.float32, [], 'momentum'), 59 | 'num_features_nonzero': tf.placeholder(tf.int32)} 60 | 61 | loss, probabilities, pred = self.forward_propagation() 62 | self.loss, self.probabilities, self.pred = loss, probabilities, pred 63 | self.l2 = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(0.01), 64 | tf.trainable_variables()) 65 | 66 | print(self.pred.shape) 67 | self.correct_prediction = tf.equal(tf.argmax(self.probabilities, 1), 68 | tf.argmax(self.placeholders['sup_label'], 1)) 69 | self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, "float")) 70 | print('Forward propagation finished.') 71 | 72 | self.sess = session 73 | self.optimizer = tf.train.AdamOptimizer(self.placeholders['lr']) 74 | gradients = self.optimizer.compute_gradients(self.loss + self.lamtha * self.l2) 75 | capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None] 76 | self.train_op = self.optimizer.apply_gradients(capped_gradients) 77 | self.init = tf.global_variables_initializer() 78 | print('Backward propagation finished.') 79 | 80 | def forward_propagation(self): 81 | with tf.variable_scope('node_level_attention', reuse=tf.AUTO_REUSE): 82 | h1 = [] 83 | for i in range(self.meta): 84 | emb = tf.get_variable(name='init_embedding', shape=[self.nodes, self.init_emb_size], 85 | initializer=tf.contrib.layers.xavier_initializer()) 86 | h = AttentionLayer.node_attention(inputs=emb, adj=self.placeholders['a'][i]) 87 | h = tf.reshape(h, [self.nodes, emb.shape[1]]) 88 | h1.append(h) 89 | h1 = tf.concat(h1, 0) 90 | h1 = tf.reshape(h1, [self.meta, self.nodes, self.init_emb_size]) 91 | print('Node_level attention over!') 92 | 93 | with tf.variable_scope('view_level_attention'): 94 | h2 = AttentionLayer.view_attention(inputs=h1, layer_size=2, 95 | meta=self.meta, encoding1=self.semi_encoding1, 96 | encoding2=self.semi_encoding2) 97 | h2 = tf.reshape(h2, [self.nodes, self.semi_encoding2 * self.meta]) 98 | print('View_level attention over!') 99 | 100 | with tf.variable_scope('MLP'): 101 | a_u = tf.layers.dense(inputs=h2, units=self.semi_encoding3, activation=None) 102 | 103 | with tf.variable_scope('loss'): 104 | # for the labeled users, use softmax to get the classification result. 105 | labeled_a_u = tf.matmul(tf.one_hot(self.placeholders['batch_index'], self.nodes), a_u) 106 | theta = tf.get_variable(name='theta', shape=[self.semi_encoding3, self.class_size], 107 | initializer=tf.contrib.layers.xavier_initializer()) 108 | 109 | logits = tf.matmul(labeled_a_u, theta) 110 | prob = tf.nn.sigmoid(logits) 111 | pred = tf.one_hot(tf.argmax(prob, 1), self.class_size) 112 | 113 | loss1 = -(1 / self.ul) * tf.reduce_sum( 114 | self.placeholders['sup_label'] * tf.log(tf.nn.softmax(logits))) 115 | 116 | u_i_embedding = tf.nn.embedding_lookup(a_u, tf.cast(self.placeholders['u_i'], dtype=tf.int32)) 117 | u_j_embedding = tf.nn.embedding_lookup(a_u, tf.cast(self.placeholders['u_j'], dtype=tf.int32)) 118 | inner_product = tf.reduce_sum(u_i_embedding * u_j_embedding, axis=1) 119 | loss2 = -tf.reduce_mean(tf.log_sigmoid(self.placeholders['graph_label'] * inner_product)) 120 | 121 | loss = self.alpha * loss1 + (1 - self.alpha) * loss2 122 | return loss, prob, pred 123 | 124 | def train(self, a, u_i, u_j, batch_graph_label, batch_data, batch_sup_label, learning_rate=1e-2, momentum=0.9): 125 | feed_dict = utils.construct_feed_dict_semi(a, u_i, u_j, batch_graph_label, batch_data, batch_sup_label, 126 | learning_rate, momentum, 127 | self.placeholders) 128 | outs = self.sess.run( 129 | [self.train_op, self.loss, self.accuracy, self.pred, self.probabilities], 130 | feed_dict=feed_dict) 131 | loss = outs[1] 132 | acc = outs[2] 133 | pred = outs[3] 134 | prob = outs[4] 135 | return loss, acc, pred, prob 136 | 137 | def test(self, a, u_i, u_j, batch_graph_label, batch_data, batch_sup_label, learning_rate=1e-2, momentum=0.9): 138 | feed_dict = utils.construct_feed_dict_semi(a, u_i, u_j, batch_graph_label, batch_data, batch_sup_label, 139 | learning_rate, momentum, 140 | self.placeholders) 141 | acc, pred, probabilities, tags = self.sess.run( 142 | [self.accuracy, self.pred, self.probabilities, self.correct_prediction], 143 | feed_dict=feed_dict) 144 | return acc, pred, probabilities, tags 145 | -------------------------------------------------------------------------------- /algorithms/SemiGNN/SemiGNN_main.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code is due to Yutong Deng (@yutongD), Yingtong Dou (@YingtongDou) and UIC BDSC Lab 3 | DGFraud (A Deep Graph-based Toolbox for Fraud Detection) 4 | https://github.com/safe-graph/DGFraud 5 | ''' 6 | import tensorflow as tf 7 | import argparse 8 | import os 9 | import sys 10 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../..'))) 11 | from algorithms.SemiGNN.SemiGNN import SemiGNN 12 | import time 13 | from utils.data_loader import * 14 | from utils.utils import * 15 | 16 | 17 | # os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' 18 | 19 | # init the common args, expect the model specific args 20 | def arg_parser(): 21 | parser = argparse.ArgumentParser() 22 | parser.add_argument('--seed', type=int, default=123, help='Random seed.') 23 | parser.add_argument('--dataset_str', type=str, default='example', help="['dblp','example']") 24 | parser.add_argument('--epoch_num', type=int, default=30, help='Number of epochs to train.') 25 | parser.add_argument('--batch_size', type=int, default=1000) 26 | parser.add_argument('--momentum', type=int, default=0.9) 27 | parser.add_argument('--learning_rate', default=0.001, help='the ratio of training set in whole dataset.') 28 | 29 | # SemiGNN 30 | parser.add_argument('--init_emb_size', default=4, help='initial node embedding size') 31 | parser.add_argument('--semi_encoding1', default=3, help='the first view attention layer unit number') 32 | parser.add_argument('--semi_encoding2', default=2, help='the second view attention layer unit number') 33 | parser.add_argument('--semi_encoding3', default=4, help='one-layer perceptron units') 34 | parser.add_argument('--Ul', default=8, help='labeled users number') 35 | parser.add_argument('--alpha', default=0.5, help='loss alpha') 36 | parser.add_argument('--lamtha', default=0.5, help='loss lamtha') 37 | 38 | args = parser.parse_args() 39 | return args 40 | 41 | 42 | def set_env(args): 43 | tf.reset_default_graph() 44 | np.random.seed(args.seed) 45 | tf.set_random_seed(args.seed) 46 | 47 | 48 | # get batch data 49 | def get_data(ix, int_batch, train_size): 50 | if ix + int_batch >= train_size: 51 | ix = train_size - int_batch 52 | end = train_size 53 | else: 54 | end = ix + int_batch 55 | return train_data[ix:end], train_label[ix:end] 56 | 57 | 58 | def load_data(args): 59 | if args.dataset_str == 'example': 60 | adj_list, features, train_data, train_label, test_data, test_label = load_example_semi() 61 | node_size = features.shape[0] 62 | node_embedding = features.shape[1] 63 | class_size = train_label.shape[1] 64 | train_size = len(train_data) 65 | paras = [node_size, node_embedding, class_size, train_size] 66 | 67 | return adj_list, features, train_data, train_label, test_data, test_label, paras 68 | 69 | 70 | def train(args, adj_list, features, train_data, train_label, test_data, test_label, paras): 71 | with tf.Session() as sess: 72 | adj_nodelists = [matrix_to_adjlist(adj, pad=False) for adj in adj_list] 73 | meta_size = len(adj_list) 74 | pairs = [random_walks(adj_nodelists[i], 2, 3) for i in range(meta_size)] 75 | net = SemiGNN(session=sess, class_size=paras[2], semi_encoding1=args.semi_encoding1, 76 | semi_encoding2=args.semi_encoding2, semi_encoding3=args.semi_encoding3, 77 | meta=meta_size, nodes=paras[0], init_emb_size=args.init_emb_size, ul=args.batch_size, 78 | alpha=args.alpha, lamtha=args.lamtha) 79 | adj_data = [pairs_to_matrix(p, paras[0]) for p in pairs] 80 | u_i = [] 81 | u_j = [] 82 | for adj_nodelist, p in zip(adj_nodelists, pairs): 83 | u_i_t, u_j_t, graph_label = get_negative_sampling(p, adj_nodelist) 84 | u_i.append(u_i_t) 85 | u_j.append(u_j_t) 86 | u_i = np.concatenate(np.array(u_i)) 87 | u_j = np.concatenate(np.array(u_j)) 88 | 89 | sess.run(tf.global_variables_initializer()) 90 | # net.load(sess) 91 | 92 | t_start = time.clock() 93 | for epoch in range(args.epoch_num): 94 | train_loss = 0 95 | train_acc = 0 96 | count = 0 97 | for index in range(0, paras[3], args.batch_size): 98 | batch_data, batch_sup_label = get_data(index, args.batch_size, paras[3]) 99 | loss, acc, pred, prob = net.train(adj_data, u_i, u_j, graph_label, batch_data, 100 | batch_sup_label, 101 | args.learning_rate, 102 | args.momentum) 103 | 104 | print("batch loss: {:.4f}, batch acc: {:.4f}".format(loss, acc)) 105 | # print(prob, pred) 106 | 107 | train_loss += loss 108 | train_acc += acc 109 | count += 1 110 | train_loss = train_loss / count 111 | train_acc = train_acc / count 112 | print("epoch{:d} : train_loss: {:.4f}, train_acc: {:.4f}".format(epoch, train_loss, train_acc)) 113 | # net.save(sess) 114 | 115 | t_end = time.clock() 116 | print("train time=", "{:.5f}".format(t_end - t_start)) 117 | print("Train end!") 118 | 119 | test_acc, test_pred, test_probabilities, test_tags = net.test(adj_data, u_i, u_j, 120 | graph_label, 121 | test_data, 122 | test_label, 123 | args.learning_rate, 124 | args.momentum) 125 | 126 | print("test acc:", test_acc) 127 | 128 | 129 | if __name__ == "__main__": 130 | args = arg_parser() 131 | set_env(args) 132 | adj_list, features, train_data, train_label, test_data, test_label, paras = load_data(args) 133 | train(args, adj_list, features, train_data, train_label, test_data, test_label, paras) 134 | -------------------------------------------------------------------------------- /algorithms/base_algorithm.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | class Algorithm(object): 4 | def __init__(self, **kwargs): 5 | self.nodes = None 6 | 7 | def forward_propagation(self): 8 | pass 9 | 10 | def save(self, sess=None): 11 | if not sess: 12 | raise AttributeError("TensorFlow session not provided.") 13 | saver = tf.train.Saver() 14 | save_path = saver.save(sess, "tmp/%s.ckpt" % 'temp') 15 | print("Model saved in file: %s" % save_path) 16 | 17 | def load(self, sess=None): 18 | if not sess: 19 | raise AttributeError("TensorFlow session not provided.") 20 | saver = tf.train.Saver() 21 | save_path = "tmp/%s.ckpt" % 'temp' 22 | saver.restore(sess, save_path) 23 | print("Model restored from file: %s" % save_path) -------------------------------------------------------------------------------- /base_models/inits.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | 4 | '''Adapted from tkipf/gcn''' 5 | 6 | 7 | def uniform(shape, scale=0.05, name=None): 8 | """Uniform init.""" 9 | initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32) 10 | return tf.Variable(initial, name=name) 11 | 12 | 13 | def glorot(shape, name=None): 14 | """Glorot & Bengio (AISTATS 2010) init.""" 15 | init_range = np.sqrt(6.0 / (shape[0] + shape[1])) 16 | initial = tf.random_uniform(shape, minval=-init_range, maxval=init_range, dtype=tf.float32) 17 | return tf.Variable(initial, name=name) 18 | 19 | 20 | def zeros(shape, name=None): 21 | """All zeros.""" 22 | initial = tf.zeros(shape, dtype=tf.float32) 23 | return tf.Variable(initial, name=name) 24 | 25 | 26 | def ones(shape, name=None): 27 | """All ones.""" 28 | initial = tf.ones(shape, dtype=tf.float32) 29 | return tf.Variable(initial, name=name) 30 | -------------------------------------------------------------------------------- /base_models/models.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import tensorflow as tf 3 | import os 4 | import sys 5 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..'))) 6 | from base_models.layers import * 7 | 8 | flags = tf.app.flags 9 | FLAGS = flags.FLAGS 10 | 11 | 12 | class Model(object): 13 | '''Adapted from tkipf/gcn.''' 14 | 15 | def __init__(self, **kwargs): 16 | allowed_kwargs = {'name', 'logging'} 17 | for kwarg in kwargs.keys(): 18 | assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg 19 | name = kwargs.get('name') 20 | if not name: 21 | name = self.__class__.__name__.lower() 22 | self.name = name 23 | 24 | logging = kwargs.get('logging', False) 25 | self.logging = logging 26 | 27 | self.vars = {} 28 | self.layers = [] 29 | self.activations = [] 30 | 31 | self.inputs = None 32 | self.outputs = None 33 | self.dim1 = None 34 | self.dim2 = None 35 | self.adj = None 36 | 37 | def _build(self): 38 | raise NotImplementedError 39 | 40 | def build(self): 41 | """ Wrapper for _build() """ 42 | with tf.variable_scope(self.name): 43 | self._build() 44 | 45 | # Build sequential layer model 46 | self.activations.append(self.inputs) 47 | for layer in self.layers: 48 | hidden = layer(self.activations[-1]) 49 | self.activations.append(hidden) 50 | self.outputs = self.activations[-1] 51 | 52 | # Store model variables for easy access 53 | variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name) 54 | self.vars = {var.name: var for var in variables} 55 | 56 | def embedding(self): 57 | pass 58 | 59 | def save(self, sess=None): 60 | if not sess: 61 | raise AttributeError("TensorFlow session not provided.") 62 | saver = tf.train.Saver(self.vars) 63 | save_path = saver.save(sess, "tmp/%s.ckpt" % self.name) 64 | print("Model saved in file: %s" % save_path) 65 | 66 | def load(self, sess=None): 67 | if not sess: 68 | raise AttributeError("TensorFlow session not provided.") 69 | saver = tf.train.Saver(self.vars) 70 | save_path = "tmp/%s.ckpt" % self.name 71 | saver.restore(sess, save_path) 72 | print("Model restored from file: %s" % save_path) 73 | 74 | 75 | class GCN(Model): 76 | def __init__(self, placeholders, dim1, input_dim, output_dim, index=0, **kwargs): 77 | super(GCN, self).__init__(**kwargs) 78 | 79 | self.inputs = placeholders['x'] 80 | self.placeholders = placeholders 81 | self.input_dim = input_dim 82 | self.output_dim = output_dim 83 | self.dim1 = dim1 84 | self.index = index # index of meta paths 85 | self.build() 86 | 87 | def _build(self): 88 | self.layers.append(GraphConvolution(input_dim=self.input_dim, 89 | output_dim=self.dim1, 90 | placeholders=self.placeholders, 91 | index=self.index, 92 | act=tf.nn.relu, 93 | dropout=0.0, 94 | sparse_inputs=False, 95 | logging=self.logging, 96 | norm=True)) 97 | 98 | self.layers.append(GraphConvolution(input_dim=self.dim1, 99 | output_dim=self.output_dim, 100 | placeholders=self.placeholders, 101 | index=self.index, 102 | act=tf.nn.relu, 103 | dropout=0., 104 | logging=self.logging, 105 | norm=False)) 106 | 107 | def embedding(self): 108 | return self.outputs 109 | 110 | -------------------------------------------------------------------------------- /dataset/DBLP4057_GAT_with_idx_tra200_val_800.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/safe-graph/DGFraud/22b72d75f81dd057762f0c7225a4558a25095b8f/dataset/DBLP4057_GAT_with_idx_tra200_val_800.zip -------------------------------------------------------------------------------- /dataset/YelpChi.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/safe-graph/DGFraud/22b72d75f81dd057762f0c7225a4558a25095b8f/dataset/YelpChi.zip -------------------------------------------------------------------------------- /reference/fdgars.txt: -------------------------------------------------------------------------------- 1 | @inproceedings{wang2019fdgars, 2 | title={Fdgars: Fraudster detection via graph convolutional networks in online app review system}, 3 | author={Wang, Jianyu and Wen, Rui and Wu, Chunming and Huang, Yu and Xion, Jian}, 4 | booktitle={Companion Proceedings of The 2019 World Wide Web Conference}, 5 | pages={310--316}, 6 | year={2019} 7 | } -------------------------------------------------------------------------------- /reference/gas.txt: -------------------------------------------------------------------------------- 1 | @inproceedings{li2019spam, 2 | title={Spam Review Detection with Graph Convolutional Networks}, 3 | author={Li, Ao and Qin, Zhou and Liu, Runshi and Yang, Yiqun and Li, Dong}, 4 | booktitle={Proceedings of the 28th ACM International Conference on Information and Knowledge Management}, 5 | pages={2703--2711}, 6 | year={2019} 7 | } -------------------------------------------------------------------------------- /reference/gem.txt: -------------------------------------------------------------------------------- 1 | @inproceedings{liu2018heterogeneous, 2 | title={Heterogeneous graph neural networks for malicious account detection}, 3 | author={Liu, Ziqi and Chen, Chaochao and Yang, Xinxing and Zhou, Jun and Li, Xiaolong and Song, Le}, 4 | booktitle={Proceedings of the 27th ACM International Conference on Information and Knowledge Management}, 5 | pages={2077--2085}, 6 | year={2018} 7 | } -------------------------------------------------------------------------------- /reference/geniepath.txt: -------------------------------------------------------------------------------- 1 | @inproceedings{liu2019geniepath, 2 | title={Geniepath: Graph neural networks with adaptive receptive paths}, 3 | author={Liu, Ziqi and Chen, Chaochao and Li, Longfei and Zhou, Jun and Li, Xiaolong and Song, Le and Qi, Yuan}, 4 | booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, 5 | volume={33}, 6 | pages={4424--4431}, 7 | year={2019} 8 | } -------------------------------------------------------------------------------- /reference/graphconsis.txt: -------------------------------------------------------------------------------- 1 | @inproceedings{liu2020alleviating, 2 | title={Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection}, 3 | author={Liu, Zhiwei and Dou, Yingtong and Yu, Philip S. and Deng, Yutong and Peng, Hao}, 4 | booktitle={Proceedings of the 43nd International ACM SIGIR Conference on Research and Development in Information Retrieval}, 5 | year={2020} 6 | } 7 | -------------------------------------------------------------------------------- /reference/graphsage.txt: -------------------------------------------------------------------------------- 1 | @inproceedings{hamilton2017inductive, 2 | author = {Hamilton, William L. and Ying, Rex and Leskovec, Jure}, 3 | title = {Inductive Representation Learning on Large Graphs}, 4 | booktitle = {NIPS}, 5 | year = {2017} 6 | } 7 | -------------------------------------------------------------------------------- /reference/hacud.txt: -------------------------------------------------------------------------------- 1 | @inproceedings{DBLP:conf/aaai/HuZSZLQ19, 2 | author = {Binbin Hu and 3 | Zhiqiang Zhang and 4 | Chuan Shi and 5 | Jun Zhou and 6 | Xiaolong Li and 7 | Yuan Qi}, 8 | title = {Cash-Out User Detection Based on Attributed Heterogeneous Information 9 | Network with a Hierarchical Attention Mechanism}, 10 | booktitle = {The Thirty-Third AAAI Conference on Artificial Intelligence}, 11 | year = {2019} 12 | } -------------------------------------------------------------------------------- /reference/player2vec.txt: -------------------------------------------------------------------------------- 1 | @inproceedings{zhang2019key, 2 | title={Key Player Identification in Underground Forums over Attributed Heterogeneous Information Network Embedding Framework}, 3 | author={Zhang, Yiming and Fan, Yujie and Ye, Yanfang and Zhao, Liang and Shi, Chuan}, 4 | booktitle={Proceedings of the 28th ACM International Conference on Information and Knowledge Management}, 5 | pages={549--558}, 6 | year={2019} 7 | } -------------------------------------------------------------------------------- /reference/semignn.txt: -------------------------------------------------------------------------------- 1 | @inproceedings{wang2019semi, 2 | title={A Semi-supervised Graph Attentive Network for Financial Fraud Detection}, 3 | author={Wang, Daixin and Lin, Jianbin and Cui, Peng and Jia, Quanhui and Wang, Zhen and Fang, Yanming and Yu, Quan and Zhou, Jun and Yang, Shuang and Qi, Yuan}, 4 | booktitle={2019 IEEE International Conference on Data Mining (ICDM)}, 5 | pages={598--607}, 6 | year={2019}, 7 | organization={IEEE} 8 | } -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | tensorflow>=1.14.0,<2.0 2 | numpy>=1.16.4 3 | scipy>=1.2.1 4 | scikit_learn>=0.21rc2 5 | networkx<=1.11 6 | 7 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import find_packages, setup 2 | 3 | # read the contents of README file 4 | from os import path 5 | from io import open # for Python 2 and 3 compatibility 6 | 7 | this_directory = path.abspath(path.dirname(__file__)) 8 | 9 | # read the contents of requirements.txt 10 | with open(path.join(this_directory, 'requirements.txt'), 11 | encoding='utf-8') as f: 12 | requirements = f.read().splitlines() 13 | 14 | setup(name='DGFraud', 15 | version="0.1.0", 16 | author="Yutong Deng, Yingtong Dou, Hengrui Zhang, and UIC BDSC Lab", 17 | author_email="bdscsafegraph@gmail.com", 18 | description='a GNN-based toolbox for fraud detection in Tensorflow', 19 | long_description=open("README.md", "r", encoding="utf-8").read(), 20 | long_description_content_type="text/markdown", 21 | url='https://github.com/safe-graph/DGFraud', 22 | download_url='https://github.com/safe-graph/DGFraud/archive/master.zip', 23 | keywords=['fraud detection', 'anomaly detection', 'graph neural network', 24 | 'data mining', 'security'], 25 | install_requires=['numpy>=1.16.4', 26 | 'tensorflow>=1.14.0,<2.0', 27 | 'scipy>=1.2.1', 28 | 'scikit_learn>=0.21rc2', 29 | 'networkx<=1.11' 30 | ], 31 | packages=find_packages(exclude=['test']), 32 | include_package_data=True, 33 | setup_requires=['setuptools>=38.6.0'], 34 | classifiers=[ 35 | 'Development Status :: 4 - Beta', 36 | 'Intended Audience :: Education', 37 | 'Intended Audience :: Financial and Insurance Industry', 38 | 'Intended Audience :: Science/Research', 39 | 'Intended Audience :: Developers', 40 | 'Intended Audience :: Information Technology', 41 | 'License :: OSI Approved :: Apache Software License', 42 | 'Programming Language :: Python :: 3.6', 43 | 'Programming Language :: Python :: 3.7', 44 | ], 45 | ) 46 | -------------------------------------------------------------------------------- /utils/data_loader.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.model_selection import train_test_split 3 | import scipy.io as sio 4 | import os 5 | import sys 6 | sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..'))) 7 | from utils.utils import pad_adjlist 8 | import zipfile 9 | 10 | 11 | # zip_src = '../dataset/DBLP4057_GAT_with_idx_tra200_val_800.zip' 12 | # dst_dir = '../dataset' 13 | def unzip_file(zip_src, dst_dir): 14 | iz = zipfile.is_zipfile(zip_src) 15 | if iz: 16 | zf = zipfile.ZipFile(zip_src, 'r') 17 | for file in zf.namelist(): 18 | zf.extract(file, dst_dir) 19 | else: 20 | print('Zip Error.') 21 | 22 | 23 | def load_data_dblp(path='../../dataset/DBLP4057_GAT_with_idx_tra200_val_800.mat'): 24 | data = sio.loadmat(path) 25 | truelabels, features = data['label'], data['features'].astype(float) 26 | N = features.shape[0] 27 | rownetworks = [data['net_APA'] - np.eye(N)] 28 | # rownetworks = [data['net_APA'] - np.eye(N), data['net_APCPA'] - np.eye(N), data['net_APTPA'] - np.eye(N)] 29 | y = truelabels 30 | index = range(len(y)) 31 | X_train, X_test, y_train, y_test = train_test_split(index, y, stratify=y, test_size=0.4, random_state=48, 32 | shuffle=True) 33 | 34 | return rownetworks, features, X_train, y_train, X_test, y_test 35 | 36 | 37 | def load_example_semi(): 38 | # example data for SemiGNN 39 | features = np.array([[1, 1, 0, 0, 0, 0, 0], 40 | [0, 0, 1, 0, 0, 0, 0], 41 | [0, 0, 0, 1, 0, 0, 0], 42 | [0, 0, 0, 0, 0, 1, 0], 43 | [0, 0, 0, 0, 1, 0, 1], 44 | [1, 0, 1, 1, 0, 0, 0], 45 | [0, 1, 0, 0, 1, 0, 0], 46 | [0, 0, 0, 0, 0, 1, 1] 47 | ]) 48 | N = features.shape[0] 49 | # Here we use binary matrix as adjacency matrix, weighted matrix is acceptable as well 50 | rownetworks = [np.array([[1, 0, 0, 1, 0, 1, 1, 1], 51 | [1, 0, 0, 1, 1, 1, 0, 1], 52 | [1, 0, 0, 0, 0, 0, 0, 1], 53 | [0, 1, 0, 0, 1, 1, 1, 0], 54 | [0, 1, 1, 1, 0, 1, 0, 0], 55 | [1, 0, 0, 1, 1, 1, 0, 1], 56 | [1, 0, 0, 0, 0, 0, 0, 1], 57 | [0, 1, 0, 0, 1, 1, 1, 0]]), 58 | np.array([[1, 0, 0, 0, 0, 1, 1, 1], 59 | [0, 1, 0, 0, 1, 1, 0, 0], 60 | [0, 1, 1, 1, 0, 0, 0, 0], 61 | [0, 0, 1, 1, 1, 0, 0, 1], 62 | [1, 1, 0, 1, 1, 0, 0, 0], 63 | [1, 0, 0, 1, 0, 1, 1, 1], 64 | [1, 0, 0, 1, 1, 1, 0, 1], 65 | [1, 0, 0, 0, 0, 0, 0, 1]])] 66 | y = np.array([[0, 1], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [0, 1]]) 67 | index = range(len(y)) 68 | X_train, X_test, y_train, y_test = train_test_split(index, y, stratify=y, test_size=0.2, random_state=48, 69 | shuffle=True) # test_size=0.25 batch——size=2 70 | 71 | return rownetworks, features, X_train, y_train, X_test, y_test 72 | 73 | 74 | def load_example_gem(): 75 | # example data for GEM 76 | # node=8 p=7 D=2 77 | features = np.array([[5, 3, 0, 1, 0, 0, 0, 1, 0], 78 | [2, 3, 1, 2, 0, 0, 0, 1, 0], 79 | [3, 1, 6, 4, 0, 0, 1, 1, 0], 80 | [0, 0, 2, 4, 4, 1, 0, 1, 1], 81 | [0, 0, 3, 3, 1, 0, 1, 0, 1], 82 | [1, 2, 5, 1, 4, 1, 0, 0, 1], 83 | [0, 1, 3, 5, 1, 0, 0, 0, 1], 84 | [0, 3, 4, 5, 2, 1, 1, 0, 1] 85 | ]) 86 | N = features.shape[0] 87 | rownetworks = [np.array([[1, 1, 1, 1, 0, 0, 0, 0], 88 | [1, 1, 1, 1, 0, 0, 0, 0], 89 | [1, 1, 1, 1, 0, 0, 0, 0], 90 | [1, 1, 1, 1, 0, 0, 0, 0], 91 | [0, 0, 0, 0, 0, 0, 0, 0], 92 | [0, 0, 0, 0, 0, 0, 0, 0], 93 | [0, 0, 0, 0, 0, 0, 0, 0], 94 | [0, 0, 0, 0, 0, 0, 0, 0]]), 95 | np.array([[0, 0, 0, 0, 0, 0, 0, 0], 96 | [0, 0, 0, 0, 0, 0, 0, 0], 97 | [0, 0, 0, 0, 0, 0, 0, 0], 98 | [0, 0, 0, 1, 1, 1, 1, 1], 99 | [0, 0, 0, 1, 1, 1, 1, 1], 100 | [0, 0, 0, 1, 1, 1, 1, 1], 101 | [0, 0, 0, 1, 1, 1, 1, 1], 102 | [0, 0, 0, 1, 1, 1, 1, 1]])] 103 | # y = np.array([-1, -1, -1, -1, 1, 1, 1, 1]) 104 | y = np.array([0, 0, 0, 0, 1, 1, 1, 1]) 105 | y = y[:, np.newaxis] 106 | index = range(len(y)) 107 | X_train, X_test, y_train, y_test = train_test_split(index, y, stratify=y, test_size=0.2, random_state=8, 108 | shuffle=True) 109 | 110 | return rownetworks, features, X_train, y_train, X_test, y_test 111 | 112 | 113 | def load_data_gas(): 114 | # example data for GAS 115 | # construct U-E-I network 116 | user_review_adj = [[0, 1], [2], [3], [5], [4, 6]] 117 | user_review_adj = pad_adjlist(user_review_adj) 118 | user_item_adj = [[0, 1], [0], [0], [2], [1, 2]] 119 | user_item_adj = pad_adjlist(user_item_adj) 120 | item_review_adj = [[0, 2, 3], [1, 4], [5, 6]] 121 | item_review_adj = pad_adjlist(item_review_adj) 122 | item_user_adj = [[0, 1, 2], [0, 4], [3, 4]] 123 | item_user_adj = pad_adjlist(item_user_adj) 124 | review_item_adj = [0, 1, 0, 0, 1, 2, 2] 125 | review_user_adj = [0, 0, 1, 2, 4, 3, 4] 126 | 127 | # initialize review_vecs 128 | review_vecs = np.array([[1, 0, 0, 1, 0], 129 | [1, 0, 0, 1, 1], 130 | [1, 0, 0, 0, 0], 131 | [0, 1, 0, 0, 1], 132 | [0, 1, 1, 1, 0], 133 | [0, 0, 1, 1, 1], 134 | [1, 1, 0, 1, 1]]) 135 | 136 | # initialize user_vecs and item_vecs with user_review_adj and item_review_adj 137 | # for example, u0 has r1 and r0, then we get the first line of user_vecs: [1, 1, 0, 0, 0, 0, 0] 138 | user_vecs = np.array([[1, 1, 0, 0, 0, 0, 0], 139 | [0, 0, 1, 0, 0, 0, 0], 140 | [0, 0, 0, 1, 0, 0, 0], 141 | [0, 0, 0, 0, 0, 1, 0], 142 | [0, 0, 0, 0, 1, 0, 1]]) 143 | item_vecs = np.array([[1, 0, 1, 1, 0, 0, 0], 144 | [0, 1, 0, 0, 1, 0, 0], 145 | [0, 0, 0, 0, 0, 1, 1]]) 146 | features = [review_vecs, user_vecs, item_vecs] 147 | 148 | # initialize the Comment Graph 149 | homo_adj = [[1, 0, 0, 0, 1, 1, 1], 150 | [1, 0, 0, 0, 1, 1, 0], 151 | [0, 0, 0, 1, 1, 1, 0], 152 | [1, 0, 1, 0, 0, 1, 0], 153 | [0, 1, 1, 1, 1, 0, 0], 154 | [0, 1, 1, 0, 1, 0, 0], 155 | [0, 1, 0, 0, 1, 0, 0]] 156 | 157 | adjs = [user_review_adj, user_item_adj, item_review_adj, item_user_adj, review_user_adj, review_item_adj, homo_adj] 158 | 159 | y = np.array([[0, 1], [1, 0], [1, 0], [0, 1], [1, 0], [1, 0], [0, 1], [1, 0]]) 160 | index = range(len(y)) 161 | X_train, X_test, y_train, y_test = train_test_split(index, y, stratify=y, test_size=0.4, random_state=48, 162 | shuffle=True) 163 | 164 | return adjs, features, X_train, y_train, X_test, y_test 165 | -------------------------------------------------------------------------------- /utils/utils.py: -------------------------------------------------------------------------------- 1 | import random 2 | import scipy.io as sio 3 | import scipy.sparse as sp 4 | import numpy as np 5 | 6 | 7 | # symmetrically normalize adjacency matrix 8 | def normalize_adj(adj): 9 | adj = adj + sp.eye(adj.shape[0]) 10 | adj = sp.coo_matrix(adj) 11 | rowsum = np.array(adj.sum(1)) 12 | d_inv_sqrt = np.power(rowsum, -0.5).flatten() 13 | d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0. 14 | d_mat_inv_sqrt = sp.diags(d_inv_sqrt) 15 | return adj.dot(d_mat_inv_sqrt).transpose().dot(d_mat_inv_sqrt).A 16 | 17 | 18 | # Construct feed dictionary 19 | def construct_feed_dict(x, a, t, b, learning_rate, momentum, placeholders): 20 | feed_dict = dict() 21 | feed_dict.update({placeholders['x']: x}) 22 | feed_dict.update({placeholders['a']: a}) 23 | feed_dict.update({placeholders['t']: t}) 24 | feed_dict.update({placeholders['batch_index']: b}) 25 | feed_dict.update({placeholders['lr']: learning_rate}) 26 | feed_dict.update({placeholders['mom']: momentum}) 27 | feed_dict.update({placeholders['num_features_nonzero']: x[1].shape}) 28 | return feed_dict 29 | 30 | 31 | # Construct feed dictionary for SemiGNN 32 | def construct_feed_dict_semi(a, u_i, u_j, batch_graph_label, batch_data, batch_sup_label, learning_rate, momentum, 33 | placeholders): 34 | feed_dict = dict() 35 | feed_dict.update({placeholders['a']: a}) 36 | feed_dict.update({placeholders['u_i']: u_i}) 37 | feed_dict.update({placeholders['u_j']: u_j}) 38 | feed_dict.update({placeholders['graph_label']: batch_graph_label}) 39 | feed_dict.update({placeholders['batch_index']: batch_data}) 40 | feed_dict.update({placeholders['sup_label']: batch_sup_label}) 41 | feed_dict.update({placeholders['lr']: learning_rate}) 42 | feed_dict.update({placeholders['mom']: momentum}) 43 | return feed_dict 44 | 45 | 46 | # Construct feed dictionary for SemiGNN 47 | def construct_feed_dict_spam(h, adj_info, t, b, learning_rate, momentum, placeholders): 48 | feed_dict = dict() 49 | feed_dict.update({placeholders['user_review_adj']: adj_info[0]}) 50 | feed_dict.update({placeholders['user_item_adj']: adj_info[1]}) 51 | feed_dict.update({placeholders['item_review_adj']: adj_info[2]}) 52 | feed_dict.update({placeholders['item_user_adj']: adj_info[3]}) 53 | feed_dict.update({placeholders['review_user_adj']: adj_info[4]}) 54 | feed_dict.update({placeholders['review_item_adj']: adj_info[5]}) 55 | feed_dict.update({placeholders['homo_adj']: adj_info[6]}) 56 | feed_dict.update({placeholders['review_vecs']: h[0]}) 57 | feed_dict.update({placeholders['user_vecs']: h[1]}) 58 | feed_dict.update({placeholders['item_vecs']: h[2]}) 59 | feed_dict.update({placeholders['t']: t}) 60 | feed_dict.update({placeholders['batch_index']: b}) 61 | feed_dict.update({placeholders['lr']: learning_rate}) 62 | feed_dict.update({placeholders['mom']: momentum}) 63 | feed_dict.update({placeholders['num_features_nonzero']: h[0][1].shape}) 64 | return feed_dict 65 | 66 | 67 | def pad_adjlist(x_data): 68 | # Get lengths of each row of data 69 | lens = np.array([len(x_data[i]) for i in range(len(x_data))]) 70 | 71 | # Mask of valid places in each row 72 | mask = np.arange(lens.max()) < lens[:, None] 73 | 74 | # Setup output array and put elements from data into masked positions 75 | padded = np.zeros(mask.shape) 76 | for i in range(mask.shape[0]): 77 | padded[i] = np.random.choice(x_data[i], mask.shape[1]) 78 | padded[mask] = np.hstack((x_data[:])) 79 | return padded 80 | 81 | 82 | def matrix_to_adjlist(M, pad=True): 83 | adjlist = [] 84 | for i in range(len(M)): 85 | adjline = [i] 86 | for j in range(len(M[i])): 87 | if M[i][j] == 1: 88 | adjline.append(j) 89 | adjlist.append(adjline) 90 | if pad: 91 | adjlist = pad_adjlist(adjlist) 92 | return adjlist 93 | 94 | 95 | def adjlist_to_matrix(adjlist): 96 | nodes = len(adjlist) 97 | M = np.zeros((nodes, nodes)) 98 | for i in range(nodes): 99 | for j in adjlist[i]: 100 | M[i][j] = 1 101 | return M 102 | 103 | 104 | def pairs_to_matrix(pairs, nodes): 105 | M = np.zeros((nodes, nodes)) 106 | for i, j in pairs: 107 | M[i][j] = 1 108 | return M 109 | 110 | 111 | # Random walk on graph 112 | def generate_random_walk(adjlist, start, walklength): 113 | t = 1 114 | walk_path = np.array([start]) 115 | while t <= walklength: 116 | neighbors = adjlist[start] 117 | current = np.random.choice(neighbors) 118 | walk_path = np.append(walk_path, current) 119 | start = current 120 | t += 1 121 | return walk_path 122 | 123 | 124 | # sample multiple times for each node 125 | def random_walks(adjlist, numerate, walklength): 126 | nodes = range(0, len(adjlist)) # node index starts from zero 127 | walks = [] 128 | for n in range(numerate): 129 | for node in nodes: 130 | walks.append(generate_random_walk(adjlist, node, walklength)) 131 | pairs = [] 132 | for i in range(len(walks)): 133 | for j in range(1, len(walks[i])): 134 | pair = [walks[i][0], walks[i][j]] 135 | pairs.append(pair) 136 | return pairs 137 | 138 | 139 | def negative_sampling(adj_nodelist): 140 | degree = [len(neighbors) for neighbors in adj_nodelist] 141 | node_negative_distribution = np.power(np.array(degree, dtype=np.float32), 0.75) 142 | node_negative_distribution /= np.sum(node_negative_distribution) 143 | node_sampling = AliasSampling(prob=node_negative_distribution) 144 | return node_negative_distribution, node_sampling 145 | 146 | 147 | def get_negative_sampling(pairs, adj_nodelist, Q=3, node_sampling='atlas'): 148 | num_of_nodes = len(adj_nodelist) 149 | u_i = [] 150 | u_j = [] 151 | graph_label = [] 152 | node_negative_distribution, nodesampling = negative_sampling(adj_nodelist) 153 | for index in range(0, num_of_nodes): 154 | u_i.append(pairs[index][0]) 155 | u_j.append(pairs[index][1]) 156 | graph_label.append(1) 157 | for i in range(Q): 158 | while True: 159 | if node_sampling == 'numpy': 160 | negative_node = np.random.choice(num_of_nodes, node_negative_distribution) 161 | if negative_node not in adj_nodelist[pairs[index][0]]: 162 | break 163 | elif node_sampling == 'atlas': 164 | negative_node = nodesampling.sampling() 165 | if negative_node not in adj_nodelist[pairs[index][0]]: 166 | break 167 | elif node_sampling == 'uniform': 168 | negative_node = np.random.randint(0, num_of_nodes) 169 | if negative_node not in adj_nodelist[pairs[index][0]]: 170 | break 171 | u_i.append(pairs[index][0]) 172 | u_j.append(negative_node) 173 | graph_label.append(-1) 174 | graph_label = np.array(graph_label) 175 | graph_label = graph_label.reshape(graph_label.shape[0], 1) 176 | return u_i, u_j, graph_label 177 | 178 | 179 | # Reference: https://en.wikipedia.org/wiki/Alias_method 180 | class AliasSampling: 181 | 182 | def __init__(self, prob): 183 | self.n = len(prob) 184 | self.U = np.array(prob) * self.n 185 | self.K = [i for i in range(len(prob))] 186 | overfull, underfull = [], [] 187 | for i, U_i in enumerate(self.U): 188 | if U_i > 1: 189 | overfull.append(i) 190 | elif U_i < 1: 191 | underfull.append(i) 192 | while len(overfull) and len(underfull): 193 | i, j = overfull.pop(), underfull.pop() 194 | self.K[j] = i 195 | self.U[i] = self.U[i] - (1 - self.U[j]) 196 | if self.U[i] > 1: 197 | overfull.append(i) 198 | elif self.U[i] < 1: 199 | underfull.append(i) 200 | 201 | def sampling(self, n=1): 202 | x = np.random.rand(n) 203 | i = np.floor(self.n * x) 204 | y = self.n * x - i 205 | i = i.astype(np.int32) 206 | res = [i[k] if y[k] < self.U[i[k]] else self.K[i[k]] for k in range(n)] 207 | if n == 1: 208 | return res[0] 209 | else: 210 | return res 211 | --------------------------------------------------------------------------------