├── Dockerfile ├── MANIFEST.in ├── README.md ├── examples ├── graphics │ ├── 150epochs_supervised_trained.png │ ├── 150epochs_unsupervised_trained.png │ ├── supervised.gif │ ├── unsupervised.gif │ ├── untrained_example_large.png │ ├── untrained_example_supervised.png │ └── untrained_example_unsupervised.png ├── karate_attributes.csv ├── karateclub.py └── reddit.py ├── fastrec ├── GraphSimRec.py ├── RecAPI.py ├── __init__.py └── torchmodels.py └── setup.py /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM ubuntu:16.04 2 | 3 | RUN apt-get update && apt-get install -y \ 4 | build-essential \ 5 | binutils \ 6 | make \ 7 | bzip2 \ 8 | cmake \ 9 | curl \ 10 | git \ 11 | g++ \ 12 | libboost-all-dev \ 13 | libbz2-dev \ 14 | libfluidsynth-dev \ 15 | libfreetype6-dev \ 16 | libgme-dev \ 17 | libgtk2.0-dev \ 18 | libjpeg-dev \ 19 | libopenal-dev \ 20 | libpng-dev \ 21 | libsdl2-dev \ 22 | libwildmidi-dev \ 23 | libzmq3-dev \ 24 | nano \ 25 | nasm \ 26 | pkg-config \ 27 | rsync \ 28 | software-properties-common \ 29 | sudo \ 30 | tar \ 31 | timidity \ 32 | unzip \ 33 | wget \ 34 | locales \ 35 | zlib1g-dev \ 36 | libfltk1.3-dev \ 37 | libxft-dev \ 38 | libxinerama-dev \ 39 | libjpeg-dev \ 40 | libpng-dev \ 41 | zlib1g-dev \ 42 | xdg-utils \ 43 | net-tools 44 | 45 | ENV PATH="/root/miniconda3/bin:${PATH}" 46 | ARG PATH="/root/miniconda3/bin:${PATH}" 47 | 48 | RUN wget \ 49 | https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \ 50 | && mkdir /root/.conda \ 51 | && bash Miniconda3-latest-Linux-x86_64.sh -b \ 52 | && rm -f Miniconda3-latest-Linux-x86_64.sh 53 | 54 | RUN pip install --upgrade pip 55 | RUN conda install numpy pandas matplotlib networkx tqdm scikit-learn imageio 56 | RUN conda install pytorch torchvision cudatoolkit=10.0 faiss-gpu -c pytorch 57 | RUN conda install -c dglteam dgl-cuda10.0 58 | RUN pip install fastapi uvicorn 59 | 60 | ENV NVIDIA_VISIBLE_DEVICES all 61 | ENV NVIDIA_DRIVER_CAPABILITIES compute,utility 62 | 63 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.md -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # FastRec 2 | 3 | Graph neural networks are capable of capturing the structure and relationships between nodes in a graph as dense vectors. 4 | With these dense vectors, we can identify pairs of nodes that are similar, identify communities and clusters, or train 5 | a linear classification model with the dense vectors as inputs. 6 | 7 | This project automates the entire pipeline from node/edge graph data to generate embeddings, train and fine tune those embeddings, create and train a [Facebook AI Similarity Search Index](https://ai.facebook.com/tools/faiss/) (faiss), and deploy a recommender API to query the index over the network. FastRec handles all of the boilerplate code, handling gpu/cpu memory management, and passing data between pytorch, Deep Graph Library (DGL), faiss, and fastapi. 8 | 9 | The code is intended to be as scalable as possible, with the only limitation being the memory available to store the graph. The code adapts the implementation of [GraphSage](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) from the [DGL reference implementation](https://github.com/dmlc/dgl/tree/master/examples/pytorch/graphsage). FastRec has been tested on graphs with up to 1 million nodes and 100 million edges and was able to generate and train embeddings, train a faiss index, and begin answering api queries in minutes. With sufficient memory, it should be able to scale to billions of nodes and edges. Distributed training is not currently implemented, but could further improve scalability. 10 | 11 | ## Installation 12 | 13 | The quickest way to get started is with a cpu only installation with conda. 14 | 15 | ```bash 16 | conda install -c ddangelo fastrec -c pytorch -c dglteam -c conda-forge 17 | ``` 18 | 19 | To install for gpu, you will need to manually install dgl and pytorch with gpu support. Then, you can pip install fastrec. 20 | 21 | ```bash 22 | conda install pytorch torchvision cudatoolkit=10.0 faiss-gpu -c pytorch 23 | conda install -c dglteam dgl-cuda10.0 24 | pip install fastrec 25 | ``` 26 | 27 | Note that currently there are only conda builds of faiss for linux and OSX systems. If you are on windows, you might be able to install from source. 28 | 29 | ## Basic Usage: Karate Club Communities 30 | 31 | As an example, we can generate embeddings for [Zachary's karate club](https://en.wikipedia.org/wiki/Zachary%27s_karate_club) graph. See [karateclub.py](https://github.com/devinjdangelo/FastRec/blob/master/examples/karateclub.py) for the full script to replicate the below. 32 | 33 | First, convert the graph into a node and edgelist format. 34 | 35 | ```python 36 | import networkx as nx 37 | g = nx.karate_club_graph() 38 | nodes = list(g.nodes) 39 | e1,e2 = zip(*g.edges) 40 | attributes = pd.read_csv('./karate_attributes.csv') 41 | ``` 42 | 43 | Then we can initialize a recommender, add the data, and generate node embeddings. 44 | 45 | ```python 46 | from fastrec import GraphRecommender 47 | #initialize our recommender to embed into 2 dimensions and 48 | #use euclidan distance as the metric for similarity. 49 | sage = GraphRecommender(2,distance='l2') 50 | sage.add_nodes(nodes) 51 | sage.add_edges(e1,e2) 52 | sage.add_edges(e2,e1) 53 | sage.update_labels(attributes.community) 54 | untrained_embeddings = sage.embeddings 55 | ``` 56 | How do the embeddings look? Even with no training of the graph neural network weights, the embeddings don't do a terrible job dividing the two communities. The nodes in the Instructor community are blue and the nodes in the Administrator community are red. 57 | 58 | drawing 59 | 60 | With one command, we can improve the embeddings with supervised learning with a triplet loss. 61 | 62 | ```python 63 | epochs, batch_size = 150, 15 64 | sage.train(epochs, batch_size) 65 | ``` 66 | drawing 67 | 68 | The trained embeddings much more neatly divide the communities. But what about the more realistic scenario where we did not know the labels of all of the nodes in advance? We can instead train the embeddings in a fully unsupervised manner. 69 | 70 | ```python 71 | epochs, batch_size = 150, 15 72 | sage.train(epochs, batch_size, unsupervised=True) 73 | ``` 74 | 75 | drawing 76 | 77 | In this case, the unsupervised training actually seems to do a slightly better job of dividing the two communities. 78 | 79 | What if we have a very large graph which is expensive and slow to train? Often, the untrained performance of the embeddings will improve if we increase the size of our graph neural network (in terms of width and # of parameters). 80 | 81 | ```python 82 | sage = GraphRecommender(2,distance='l2',feature_dim=512,hidden_dim=512) 83 | untrained_embeddings_large = sage.embeddings 84 | ``` 85 | 86 | drawing 87 | 88 | This looks nearly as good as the trained version of the small network, but no training was required! 89 | 90 | Once we have embeddings that we are happy with, we can query a specific node or nodes to get its nearest neighbors in a single line. 91 | 92 | ```python 93 | #what are the 5 nearest neighbors of node 0, the Admin, and 33, the Instructor? 94 | sage.query_neighbors(['0','33'],k=5) 95 | {'0': {'neighbors': ['0', '13', '16', '6', '5'], 'distances': [0.0, 0.001904212054796517, 0.005100540816783905, 0.007833012379705906, 0.008420777507126331]}, '33': {'neighbors': ['33', '27', '31', '28', '32'], 'distances': [0.0, 0.0005751167191192508, 0.0009900123113766313, 0.001961079193279147, 0.006331112235784531]}} 96 | ``` 97 | Each node's nearest neighbor is itself with a distance of 0. The Admin is closest to nodes 13, 16, 6, and 5, all of which are in fact part of the Admin community. The Instructor is closest to 27, 31, 28, and 32, all of which are part of the Instructor community. 98 | 99 | ## Reddit Post Recommender 100 | 101 | In under 5 minutes and with just 10 lines of code, we can create and deploy a Reddit post recommender based on a graph dataset with over 100m edges. We will use the Reddit post dataset from the [GraphSage](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) paper. Each node represets a post and an edge between posts represents one user who commented on both posts. Each node is labeled with one of 41 subreddits, which group the posts by theme or user interest. The original paper focused on correctly classifying the subreddit of each post. Here, we will simply say that a post recommendation is reasonable if it is in the same subreddit as the query post. See [reddit.py](https://github.com/devinjdangelo/FastRec/blob/master/examples/reddit.py) for the full script to replicate the below. 102 | 103 | First, we download the Reddit Dataset. 104 | 105 | ```python 106 | import pandas as pd 107 | import numpy as np 108 | from dgl.data import RedditDataset 109 | data = RedditDataset(self_loop=True) 110 | e1, e2 = data.graph.all_edges() 111 | e1, e2 = e1.numpy(), e2.numpy() 112 | nodes = pd.DataFrame(data.labels,dtype=np.int32,columns=['labels']) 113 | ``` 114 | 115 | Now we can set up our recommender. For larger graphs, it will be much faster to use gpu for both torch and faiss computations. 116 | 117 | ```python 118 | from fastrec import GraphRecommender 119 | sage = GraphRecommender(128, feature_dim=512, hidden_dim=256, 120 | torch_device='cuda', faiss_gpu=True, distance='cosine') 121 | sage.add_nodes(nodes.index.to_numpy()) 122 | sage.add_edges(e1,e2) 123 | sage.update_labels(nodes.labels) 124 | ``` 125 | 126 | Finally, we can evaluate our untrained embedding and deploy our API. 127 | 128 | ```python 129 | perf = sage.evaluate(test_levels=[10,5]) 130 | print(perf) 131 | {'Top 10 neighbors': {'Share >=1 correct neighbor': 0.9517867490824802, 'Share of correct neighbors': 0.8623741763784262}, 'Top 5 neighbors': {'Share >=1 correct neighbor': 0.9417079818856909, 'Share of correct neighbors': 0.8764973279247956}} 132 | sage.start_api() 133 | ``` 134 | 135 | The performance stats indicate that on average 86% of the top 10 recommendations for a post are in the same subreddit. About 95% of all posts have at least 1 recommendation in the same subreddit among its top 10 recommendations. We could optionally train our embeddings with supervised or unsupervised learning from here, but for now this performance is good enough. We can now query our API over the network. 136 | 137 | ## Recommender API 138 | 139 | We can share the recommender system as an API in a single line. No args are needed to test over localhost, but we can optionally pass in any args accepted by [uvicorn](https://www.uvicorn.org/deployment/). 140 | 141 | ```python 142 | host, port = 127.0.0.1, 8000 143 | sage.start_api(host=host,port=port) 144 | ``` 145 | 146 | This method of starting the API is convenient but has some downsides in the current implementation. Some data will be duplicated in memory, so if your graph is taking up most of your current memory this deployment may fail. You can avoid this issue by instead using the included deployment script. Simply save your GraphRecommender and point the deployment script to the saved location. Just like with the previous method, all args are passed along to uvicorn. 147 | 148 | ```bash 149 | fastrec-deploy /example/directory --host 127.0.0.1 --port 8000 150 | ``` 151 | 152 | Now we can query the recommender from any other script on the network. For detailed API docs, see the /docs endpoint. 153 | 154 | ```python 155 | import requests 156 | #configure url, default is localhost 157 | apiurl = 'http://127.0.0.1:8000/knn/{}?k={}' 158 | example_node = '0' 159 | k = 10 160 | r = requests.get(apiurl.format('knn',example_node,k)) 161 | r.json() 162 | {0: {'neighbors': [0, 114546, 118173, 123258, 174705, 99438, 51354, 119874, 203176, 101864], 'distances': [0.9999998807907104, 0.9962959289550781, 0.9962303042411804, 0.9961680173873901, 0.9961460828781128, 0.9961054921150208, 0.9961045980453491, 0.9960995316505432, 0.9960215091705322, 0.9960126280784607]}} 163 | ``` 164 | 165 | Because we use a trained faiss index for our deployed API backend, requests should be returned very quickly even for large graphs. For the Reddit post recommender described above, the default API responds in about 82ms. 166 | 167 | ```python 168 | import random 169 | %timeit r = requests.get(apiurl.format('knn',random.randint(0,232964),k)) 170 | 82.3 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 171 | ``` 172 | 173 | ## Save and Load 174 | 175 | If you are creating a very large graph (millions of nodes and edges), you will want to save your created graph and model weights to disk, so that you will not have to process the raw edge data or train the embeddings again. You can save and load all of the necessary information to restore your GraphRecommeder in a single line. 176 | 177 | ```python 178 | sage.save('/example/directory') 179 | ``` 180 | You can likewise restore your session in a single line. 181 | 182 | ```python 183 | sage = GraphRecommender.load('/example/directory') 184 | ``` 185 | 186 | Note that the loading method is a classmethod, so you do not need to initialize a new instance of GraphRecommeder to restore from disk. The save and load functionality keeps track of the args you used to initialize the class for you. 187 | -------------------------------------------------------------------------------- /examples/graphics/150epochs_supervised_trained.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/devinjdangelo/FastRec/6728469d5ae11493236dc816022942913bd9d601/examples/graphics/150epochs_supervised_trained.png -------------------------------------------------------------------------------- /examples/graphics/150epochs_unsupervised_trained.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/devinjdangelo/FastRec/6728469d5ae11493236dc816022942913bd9d601/examples/graphics/150epochs_unsupervised_trained.png -------------------------------------------------------------------------------- /examples/graphics/supervised.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/devinjdangelo/FastRec/6728469d5ae11493236dc816022942913bd9d601/examples/graphics/supervised.gif -------------------------------------------------------------------------------- /examples/graphics/unsupervised.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/devinjdangelo/FastRec/6728469d5ae11493236dc816022942913bd9d601/examples/graphics/unsupervised.gif -------------------------------------------------------------------------------- /examples/graphics/untrained_example_large.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/devinjdangelo/FastRec/6728469d5ae11493236dc816022942913bd9d601/examples/graphics/untrained_example_large.png -------------------------------------------------------------------------------- /examples/graphics/untrained_example_supervised.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/devinjdangelo/FastRec/6728469d5ae11493236dc816022942913bd9d601/examples/graphics/untrained_example_supervised.png -------------------------------------------------------------------------------- /examples/graphics/untrained_example_unsupervised.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/devinjdangelo/FastRec/6728469d5ae11493236dc816022942913bd9d601/examples/graphics/untrained_example_unsupervised.png -------------------------------------------------------------------------------- /examples/karate_attributes.csv: -------------------------------------------------------------------------------- 1 | node,role,community 2 | 0,Administrator,Administrator 3 | 1,Member,Administrator 4 | 2,Member,Administrator 5 | 3,Member,Administrator 6 | 4,Member,Administrator 7 | 5,Member,Administrator 8 | 6,Member,Administrator 9 | 7,Member,Administrator 10 | 8,Member,Administrator 11 | 9,Member,Instructor 12 | 10,Member,Administrator 13 | 11,Member,Administrator 14 | 12,Member,Administrator 15 | 13,Member,Administrator 16 | 14,Member,Instructor 17 | 15,Member,Instructor 18 | 16,Member,Administrator 19 | 17,Member,Administrator 20 | 18,Member,Instructor 21 | 19,Member,Administrator 22 | 20,Member,Instructor 23 | 21,Member,Administrator 24 | 22,Member,Instructor 25 | 23,Member,Instructor 26 | 24,Member,Instructor 27 | 25,Member,Instructor 28 | 26,Member,Instructor 29 | 27,Member,Instructor 30 | 28,Member,Instructor 31 | 29,Member,Instructor 32 | 30,Member,Instructor 33 | 31,Member,Instructor 34 | 32,Member,Instructor 35 | 33,Instructor,Instructor 36 | 37 | -------------------------------------------------------------------------------- /examples/karateclub.py: -------------------------------------------------------------------------------- 1 | import networkx as nx 2 | import pandas as pd 3 | import imageio 4 | import matplotlib.pyplot as plt 5 | import tqdm 6 | import pathlib 7 | 8 | from fastrec import GraphRecommender 9 | 10 | def animate(labelsnp,all_embeddings,mask): 11 | labelsnp = labelsnp[mask] 12 | 13 | for i,embedding in enumerate(tqdm.tqdm(all_embeddings)): 14 | data = embedding[mask] 15 | fig = plt.figure(dpi=150) 16 | fig.clf() 17 | ax = fig.subplots() 18 | plt.title('Epoch {}'.format(i)) 19 | 20 | colormap = ['r' if l=='Administrator' else 'b' for l in labelsnp] 21 | plt.scatter(data[:,0],data[:,1], c=colormap) 22 | 23 | ax.annotate('Administrator',(data[0,0],data[0,1])) 24 | ax.annotate('Instructor',(data[33,0],data[33,1])) 25 | 26 | plt.savefig('./ims/{n}.png'.format(n=i)) 27 | plt.close() 28 | 29 | imagep = pathlib.Path('./ims/') 30 | images = imagep.glob('*.png') 31 | images = list(images) 32 | images.sort(key=lambda x : int(str(x).split('/')[-1].split('.')[0])) 33 | with imageio.get_writer('./animation.gif', mode='I') as writer: 34 | for image in images: 35 | data = imageio.imread(image.__str__()) 36 | writer.append_data(data) 37 | 38 | if __name__=='__main__': 39 | g = nx.karate_club_graph() 40 | nodes = list(g.nodes) 41 | e1,e2 = zip(*g.edges) 42 | attributes = pd.read_csv('./karate_attributes.csv') 43 | 44 | sage = GraphRecommender(2,distance='l2') 45 | sage.add_nodes(nodes) 46 | sage.add_edges(e1,e2) 47 | sage.add_edges(e2,e1) 48 | sage.update_labels(attributes.community) 49 | 50 | epochs, batch_size = 150, 15 51 | _,_,all_embeddings = sage.train(epochs, batch_size, unsupervised = True, learning_rate=1e-2, 52 | test_every_n_epochs=10, return_intermediate_embeddings=True) 53 | 54 | animate(sage.labels,all_embeddings,sage.entity_mask) 55 | 56 | print(sage.query_neighbors([0,33],k=5)) 57 | 58 | sage.start_api() 59 | 60 | -------------------------------------------------------------------------------- /examples/reddit.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | 4 | from dgl.data import RedditDataset 5 | from fastrec import GraphRecommender 6 | 7 | 8 | data = RedditDataset(self_loop=True) 9 | e1, e2 = data.graph.all_edges() 10 | e1, e2 = e1.numpy(), e2.numpy() 11 | nodes = pd.DataFrame(data.labels,dtype=np.int32,columns=['labels']) 12 | del data #free up some memory 13 | 14 | sage = GraphRecommender(128, feature_dim=512, hidden_dim=256, 15 | torch_device='cuda', faiss_gpu=True, distance='cosine') 16 | sage.add_nodes(nodes.index.to_numpy()) 17 | sage.add_edges(e1,e2) 18 | sage.update_labels(nodes.labels) 19 | 20 | perf = sage.evaluate(test_levels=[50,25,10,5]) 21 | print(perf) 22 | 23 | #epochs, batch_size = 100, 1000 24 | #sage.train(epochs, batch_size, unsupervised = True, learning_rate=1e-2,test_every_n_epochs=10) 25 | 26 | print(sage.query_neighbors([0,1000],k=10)) 27 | 28 | sage.start_api() 29 | 30 | 31 | 32 | -------------------------------------------------------------------------------- /fastrec/GraphSimRec.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Wed May 6 10:39:20 2020 4 | 5 | @author: djdev 6 | """ 7 | 8 | import pandas as pd 9 | import networkx as nx 10 | import time 11 | import numpy as np 12 | import pathlib 13 | from math import ceil 14 | import argparse 15 | import itertools as it 16 | import tqdm 17 | import os 18 | 19 | import pickle 20 | 21 | import dgl 22 | import dgl.function as fn 23 | from dgl import DGLGraph 24 | from dgl.data import citation_graph as citegrh 25 | import dgl.nn.pytorch as dglnn 26 | 27 | import torch as th 28 | import torch.nn as nn 29 | import torch.nn.functional as F 30 | from torch.utils.data import DataLoader 31 | import faiss 32 | import uvicorn 33 | 34 | from torchmodels import * 35 | 36 | #this is the maximum edges we will ad at once to keep temp memory usage down 37 | MAX_ADD_EDGES = 1e6 38 | 39 | # this is the target ratio of nodes to faiss clusters for index training 40 | # roughly matches what the faiss warning messages suggest in testing 41 | FAISS_NODES_TO_CLUSTERS = 1000 42 | 43 | #Arbitrary... not sure what this should be long term. Depends on memory usage 44 | #which I haven't tested thoroughly yet. 45 | MAXIMUM_FAISS_CLUSTERS = 10000 46 | 47 | class GraphRecommender: 48 | """Rapidly trains similarity embeddings for graphs and generates recomendations 49 | 50 | Attributes 51 | ---------- 52 | G : DGL Graph object 53 | Current DGL graph for all added data with self.add_data 54 | node_ids : pandas data frame 55 | Contains mapping from user provided nodeids to DGL and faiss compatable integer ids. 56 | Also contains various flags which identify properties and classes of the nodes. 57 | """ 58 | 59 | def __init__(self, embedding_dim, 60 | feature_dim = None, 61 | hidden_dim = None, 62 | hidden_layers = 2, 63 | dropout = 0, 64 | agg_type = 'gcn', 65 | distance = 'cosine', 66 | torch_device = 'cpu', 67 | faiss_gpu = False, 68 | inference_batch_size = 10000, 69 | p_train = 1, 70 | train_faiss_index = False): 71 | """Generates embeddings for graph data such that embeddings close by a given distance metric are 72 | 'similar'. Embeddings can be used to predict which nodes belong to the same class. The embeddings can be 73 | trained with triplet loss in a fully supervised, semi-supervised or fully unsupervised manner. GraphSage 74 | is used to allow minibatch training. Uses faiss index to allow extremely fast query times for most similar 75 | nodes to a query node even for graphs with billions of nodes. Memory is likely to be the limiting factor before 76 | query times. 77 | 78 | Args 79 | ---- 80 | embedding_dim : int 81 | the dimension of the final output embedding used for similarity search 82 | feature_dim : int 83 | the dimension of the input node features, currently only allowed to be 84 | a trainable embedding. In the future should allow external node features. 85 | defaults to 2*hidden_dim 86 | hidden_dim : int 87 | the dimension of the intermediate hidden layers, defaults to 2*embedding dim. 88 | hidden_layers : int 89 | number of hidden layers. Embeddings can collpase to a single value if this 90 | is set too high. Defaults to 2. 91 | dropout : float 92 | whether to apply a dropout layer after hidden layers of GraphSAge. Defaults to 0, 93 | which means there is no Dropout applied. 94 | agg_type : str 95 | aggregation function to apply to GraphSage. Valid options are 'mean', 'lstm', and 'gcn' 96 | aggregation. See GraphSage paper for implementation details. Defaults to gcn which performs 97 | well for untrained networks. 98 | distance : str 99 | distance metric to use for similarity search. Valid options are l2 and cosine. Defaults to cosine. 100 | torch_device : str 101 | computation device to place pytorch tensors on. Valid options are any valid pytorch device. Defaults 102 | to cpu. 103 | faiss_gpu : bool 104 | whether to use gpu to accelerate faiss searching. Note that it will compete with pytorch for gpu memory. 105 | inference_batch_size : number of nodes to compute per batch when computing all embeddings with self.net.inference. 106 | defaults to 10000 which should comfortably fit on most gpus and be reasonably efficient on cpu. 107 | p_train : float 108 | the proportion of nodes with known class labels to use for training defaults to 1 109 | train_faiss_index : bool 110 | whether to train faiss index for faster searches. Not reccomended for training since brute force 111 | will actually be faster than retraining the index at each test iteration. Can be used for api to speed 112 | up response times. 113 | """ 114 | self.embedding_dim = embedding_dim 115 | self.device = torch_device 116 | self.inference_batch_size = inference_batch_size 117 | assert p_train<=1 and p_train>=0 118 | self.p_train = p_train 119 | self.faiss_gpu = faiss_gpu 120 | self.train_faiss = train_faiss_index 121 | 122 | self.distance_metric = distance 123 | if self.distance_metric == 'cosine': 124 | self.distance_function = lambda t1,t2 : F.cosine_embedding_loss(t1, 125 | t2, 126 | th.ones(t1.shape[0]).to(self.device),reduce=False) 127 | elif self.distance_metric == 'l2': 128 | self.distance_function = lambda t1,t2 : th.sum(F.mse_loss(t1,t2,reduce=False),dim=1) 129 | else: 130 | raise ValueError('distance {} is not implemented'.format(self.distance)) 131 | 132 | hidden_dim = embedding_dim*4 if hidden_dim is None else hidden_dim 133 | feature_dim = hidden_dim*2 if feature_dim is None else feature_dim 134 | self.feature_dim = feature_dim 135 | self.net = SAGE(feature_dim, hidden_dim, embedding_dim, hidden_layers, F.relu, dropout, agg_type) 136 | self.net.to(self.device) 137 | 138 | self._embeddings = None 139 | self._index = None 140 | self._masks_set = False 141 | 142 | self.node_ids = pd.DataFrame(columns=['id','intID','classid','feature_flag']) 143 | self.G = DGLGraph() 144 | 145 | #hold init args in memory in case needed to save to disk for restoring later 146 | self.initargs = (embedding_dim, 147 | feature_dim, 148 | hidden_dim, 149 | hidden_layers, 150 | dropout, 151 | agg_type, 152 | distance, 153 | torch_device, 154 | faiss_gpu, 155 | inference_batch_size, 156 | p_train, 157 | train_faiss_index) 158 | 159 | 160 | def add_nodes(self, nodearray, skip_duplicates=False): 161 | """Define nodes by passing an array (or array like object). Nodes 162 | can be identified by any data type (even mixed data types), but each 163 | node must be unique. An exception is raised if all nodes are not unique 164 | including if the same node is attempted to be added in two calls to this 165 | method. Each node is mapped to a unique integer id based on the order 166 | they are added. 167 | 168 | Args 169 | ---- 170 | nodearray : numpy array (or array-like object) 171 | array containing the identifiers of each node to be added 172 | skip_duplicates : bool 173 | if true, ignore nodes which have already been added. If False, raise error. 174 | """ 175 | 176 | ninputnodes = len(nodearray) 177 | nodedf = pd.DataFrame(nodearray, columns=['id']) 178 | 179 | if len(nodedf) != len(nodedf.drop_duplicates()): 180 | raise ValueError('Provided nodeids are not unique. Please pass an array of unique identifiers.') 181 | 182 | nodes_already_exist = nodedf.merge(self.node_ids,on='id',how='inner') 183 | if len(nodes_already_exist)>0 and not skip_duplicates: 184 | raise ValueError( 185 | 'Some provided nodes have already been added to the graph. See node_ids.ids.') 186 | elif len(nodes_already_exist)>0 and skip_duplicates: 187 | #get rid of the duplicates 188 | nodes_already_exist['dropflag'] = True 189 | nodedf = nodedf.merge(nodes_already_exist,on='id',how='left') 190 | nodedf['dropflag'] = ~pd.isna(nodedf.dropflag) 191 | nodedf = nodedf.drop(nodedf[nodedf.dropflag].index) 192 | nodedf = nodedf[['id']] 193 | 194 | 195 | current_maximum_id = self.node_ids.intID.max() 196 | num_new_nodes = len(nodedf) 197 | 198 | start = (current_maximum_id+1) 199 | if np.isnan(start): 200 | start = 0 201 | end = start + num_new_nodes 202 | 203 | nodedf['intID'] = range(start,end) 204 | nodedf['classid'] = None 205 | nodedf['feature_flag'] = False 206 | 207 | self.node_ids = pd.concat([self.node_ids,nodedf]) 208 | 209 | self._masks_set = False 210 | 211 | if self.G.is_readonly: 212 | self.G = dgl.as_immutable_graph(self.G) 213 | self.G.readonly(False) 214 | self.G.add_nodes(num_new_nodes) 215 | 216 | self._masks_set = False 217 | self._embeddings = None 218 | self._index = None 219 | 220 | 221 | def add_edges(self, n1, n2): 222 | """Adds edges to the DGL graph. Nodes must be previously defined by 223 | add_nodes or an exception is raised. Edges are directed. To define 224 | a undirected graph, include both n1->n2 and n2->n1 in the graph. 225 | 226 | Args 227 | ---- 228 | n1 : numpy array (or array-like object) 229 | first node in the edge (n1->n2) 230 | n2 : numpy array (or array-like object) 231 | second node in the edge (n1->n2) 232 | """ 233 | edgedf_all = pd.DataFrame(n1,columns=['n1']) 234 | edgedf_all['n2'] = n2 235 | 236 | chunks = int(max(len(edgedf_all)//MAX_ADD_EDGES,1)) 237 | edgedf_all = np.array_split(edgedf_all, chunks) 238 | 239 | if chunks>1: 240 | pbar = tqdm.tqdm(total=chunks) 241 | 242 | for i in range(chunks): 243 | edgedf = edgedf_all.pop() 244 | edgedf = edgedf.merge(self.node_ids,left_on='n1',right_on='id',how='left') 245 | edgedf = edgedf.merge(self.node_ids,left_on='n2',right_on='id',how='left',suffixes=('','2')) 246 | edgedf = edgedf[['intID','intID2']] 247 | 248 | if len(edgedf) != len(edgedf.dropna()): 249 | raise ValueError('Some edges do not correspond to any known node. Please add with add_nodes method first.') 250 | 251 | if self.G.is_readonly: 252 | self.G = dgl.as_immutable_graph(self.G) 253 | self.G.readonly(False) 254 | 255 | self.G.add_edges(edgedf.intID,edgedf.intID2) 256 | 257 | if chunks>1: 258 | pbar.update(1) 259 | 260 | if chunks>1: 261 | pbar.close() 262 | 263 | self._masks_set = False 264 | self._embeddings = None 265 | self._index = None 266 | 267 | def _update_node_ids(self,datadf): 268 | """Overwrites existing information about nodes with new info 269 | contained in a dataframe. Temporarily sets id as the index to use 270 | built in pandas update method aligned on index. 271 | 272 | Args 273 | ---- 274 | datadf : data frame 275 | has the same structure as self.node_ids 276 | """ 277 | 278 | datadf.set_index('id',inplace=True,drop=True) 279 | self.node_ids.set_index('id',inplace=True,drop=True) 280 | self.node_ids.update(datadf, overwrite=True) 281 | self.node_ids.reset_index(inplace=True) 282 | 283 | def update_labels(self,labels): 284 | 285 | """Updates nodes by adding a label (or class). Existing class label 286 | is overridden if one already exists. Any node which does not have a 287 | known class has a label of None. Any data type can be a valid class 288 | label except for None which is reserved for unknown class. All nodes 289 | included in the update must be previously defined by add_nodes or 290 | an exception is raised. 291 | 292 | Args 293 | ---- 294 | labels : dictionary or pandas series 295 | maps node ids to label, i.e. classid. If pandas series the index acts as the dictionary key.""" 296 | 297 | labeldf = pd.DataFrame(labels.items(), columns=['id','classid']) 298 | labeldf = labeldf.merge(self.node_ids,on='id',how='left',suffixes=('','2')) 299 | 300 | if labeldf['intID'].isna().sum() > 0: 301 | raise ValueError('Some nodes in update do not exist in graph. Add them first with add_nodes.') 302 | 303 | labeldf = labeldf[['id','intID','classid','feature_flag']] 304 | self._update_node_ids(labeldf) 305 | 306 | self._masks_set = False 307 | self._embeddings = None 308 | self._index = None 309 | 310 | def update_feature_flag(self,flags): 311 | """Updates node by adding a feature flag. This can be True or False. 312 | If the feature flag is True, the node will not be included in any 313 | recommender index. It will still be included in the graph to enrich 314 | the embeddings of the other nodes, but it will never be returned as 315 | a recommendation as a similar node. I.e. if True this node is a feature 316 | of other nodes only and not interesting as an entity of its own right. 317 | 318 | Args 319 | ---- 320 | flags : dictionary or pandas series 321 | maps node ids to feature flag. If pandas series the index acts as the dictionary key.""" 322 | 323 | featuredf = pd.DataFrame(flags.items(), columns=['id','feature_flag']) 324 | featuredf = featuredf.merge(self.node_ids,on='id',how='left',suffixes=('','2')) 325 | 326 | if featuredf['intID'].isna().sum() > 0: 327 | raise ValueError('Some nodes in update do not exist in graph. Add them first with add_nodes.') 328 | 329 | featuredf = featuredf[['id','intID','classid','feature_flag']] 330 | self._update_node_ids(featuredf) 331 | 332 | self._masks_set = False 333 | self._embeddings = None 334 | self._index = None 335 | 336 | def set_masks(self): 337 | """Sets train, test, and relevance masks. Needs to be called once after data as been added to graph. 338 | self.train and self.evaluate automatically check if this needs to be called and will call it, but 339 | it can also be called manually. Can be called a second time manually to reroll the random generation 340 | of the train and test sets.""" 341 | 342 | self.node_ids = self.node_ids.sort_values('intID') 343 | self.labels = self.node_ids.classid.to_numpy() 344 | 345 | #is relevant mask indicates the nodes which we know the class of 346 | self.is_relevant_mask = np.logical_not(pd.isna(self.node_ids.classid).to_numpy()) 347 | 348 | #entity_mask indicates the nodes which we want to include in the faiss index 349 | self.entity_mask = np.logical_not(self.node_ids.feature_flag.to_numpy().astype(np.bool)) 350 | 351 | self.train_mask = np.random.choice( 352 | a=[False,True],size=(len(self.node_ids)),p=[1-self.p_train,self.p_train]) 353 | 354 | #test set is all nodes other than the train set unless train set is all 355 | #nodes and then test set is the same as train set. 356 | if self.p_train != 1: 357 | self.test_mask = np.logical_not(self.train_mask) 358 | else: 359 | self.test_mask = self.train_mask 360 | 361 | #do not include any node without a classid in either set 362 | self.train_mask = np.logical_and(self.train_mask,self.is_relevant_mask) 363 | self.train_mask = np.logical_and(self.train_mask,self.entity_mask) 364 | self.test_mask = np.logical_and(self.test_mask,self.is_relevant_mask) 365 | self.test_mask = np.logical_and(self.test_mask,self.entity_mask) 366 | 367 | if not self.G.is_readonly: 368 | self.embed = nn.Embedding(len(self.node_ids),self.feature_dim) 369 | self.G.readonly() 370 | self.G = dgl.as_heterograph(self.G) 371 | self.G.ndata['features'] = self.embed.weight 372 | 373 | self.features = self.embed.weight 374 | self.features.to(self.device) 375 | self.embed.to(self.device) 376 | 377 | self._masks_set = True 378 | 379 | @property 380 | def embeddings(self): 381 | """Updates all node embeddings if needed and returns the embeddings. 382 | Simple implementation of a cached property. 383 | 384 | Returns 385 | ------- 386 | embeddings node x embedding_dim tensor""" 387 | 388 | if self._embeddings is None: 389 | if not self._masks_set: 390 | self.set_masks() 391 | print('computing embeddings for all nodes...') 392 | with th.no_grad(): 393 | self._embeddings = self.net.inference( 394 | self.G, self.features,self.inference_batch_size,self.device).detach().cpu().numpy() 395 | return self._embeddings 396 | 397 | @property 398 | def index(self): 399 | """Creates a faiss index for similarity searches over the node embeddings. 400 | Simple implementation of a cached property. 401 | 402 | Returns 403 | ------- 404 | a faiss index with input embeddings added and optionally trained""" 405 | 406 | if self._index is None: 407 | if not self._masks_set: 408 | self.set_masks() 409 | if self.distance_metric=='cosine': 410 | self._index = faiss.IndexFlatIP(self.embedding_dim) 411 | embeddings = np.copy(self.embeddings[self.entity_mask]) 412 | #this function operates in place so np.copy any views into a new array before using. 413 | faiss.normalize_L2(embeddings) 414 | elif self.distance_metric=='l2': 415 | self._index = faiss.IndexFlatL2(self.embedding_dim) 416 | embeddings = self.embeddings[self.entity_mask] 417 | 418 | if self.train_faiss: 419 | training_points = min( 420 | len(self.node_ids)//FAISS_NODES_TO_CLUSTERS+1, 421 | MAXIMUM_FAISS_CLUSTERS) 422 | self._index = faiss.IndexIVFFlat(self._index, self.embedding_dim, training_points) 423 | self._index.train(embeddings) 424 | 425 | self._index.add(embeddings) 426 | 427 | if self.faiss_gpu: 428 | GPU = faiss.StandardGpuResources() 429 | self._index = faiss.index_cpu_to_gpu(GPU, 0, self._index) 430 | 431 | 432 | return self._index 433 | 434 | def _search_index(self,inputs,k): 435 | """Directly searches the faiss index and 436 | returns the k nearest neighbors of inputs 437 | 438 | Args 439 | ---- 440 | inputs : numpy array np.float 441 | the vectors to search against the faiss index 442 | k : int 443 | how many neighbors to lookup 444 | 445 | Returns 446 | ------- 447 | D, I distance numpy array and neighbors array from faiss""" 448 | 449 | if self.distance_metric == 'cosine': 450 | inputs = np.copy(inputs) 451 | faiss.normalize_L2(inputs) 452 | D, I = self.index.search(inputs,k) 453 | return D,I 454 | 455 | def _get_intID(self,nodelist): 456 | """Accepts a list of nodeids and converts them to internally used 457 | sequential integer id. 458 | 459 | Args 460 | ---- 461 | nodelist : List 462 | node identifiers to convert 463 | 464 | Returns 465 | ------- 466 | list of integer identifiers""" 467 | 468 | relevant_nodes = self.node_ids.loc[self.node_ids.id.isin(nodelist)] 469 | try: 470 | intids = [relevant_nodes.loc[relevant_nodes.id == node].intID.iloc[0] 471 | for node in nodelist] 472 | except IndexError: 473 | intids = [relevant_nodes.loc[relevant_nodes.id == int(node)].intID.iloc[0] 474 | for node in nodelist] 475 | 476 | return intids 477 | 478 | def get_embeddings(self,nodelist): 479 | """Looks up the embedding for a specific list of nodes based on 480 | their nodeid. 481 | 482 | Args 483 | ---- 484 | nodelist : List 485 | list of node identifiers to get the embedding of 486 | 487 | Returns 488 | ------- 489 | numpy array of final embeddings""" 490 | 491 | intids = self._get_intID(nodelist) 492 | return self.embeddings[intids,:] 493 | 494 | def _faiss_ids_to_nodeids(self, I, return_labels): 495 | """Takes an output from faiss index and maps the faissids back to nodeids 496 | and optionally node class labels 497 | 498 | Args 499 | ---- 500 | I : numpy array 501 | array returned from a faiss index search 502 | return_labels : bool 503 | whether to lookup labels 504 | 505 | Returns 506 | ------- 507 | I : array with ids mapped to nodeids 508 | L : optionally second array with ids mapped to node class labels, 509 | if return_labels is false, is None""" 510 | 511 | faissid_to_nodeid = self.node_ids.id.to_numpy()[self.entity_mask].tolist() 512 | if return_labels: 513 | faissid_to_label = self.node_ids.classid.to_numpy()[self.entity_mask].tolist() 514 | L = [[faissid_to_label[neighbor] for neighbor in neighbors] for neighbors in I] 515 | I = [[faissid_to_nodeid[neighbor] for neighbor in neighbors] for neighbors in I] 516 | else: 517 | I = [[faissid_to_nodeid[neighbor] for neighbor in neighbors] for neighbors in I] 518 | L = None 519 | return I, L 520 | 521 | 522 | def query_neighbors(self, nodelist, k, return_labels=False): 523 | """For each query node in nodelist, return the k closest neighbors in the 524 | embedding space. 525 | 526 | Args 527 | ---- 528 | nodelist : list 529 | list of node identifiers to query 530 | k : int 531 | number of neighbors to return 532 | return_labels : bool 533 | if true, includes the node label of all neighbors returned 534 | 535 | Returns 536 | ------- 537 | dictionary of neighbors for each querynode and corresponding distance""" 538 | 539 | if not self._masks_set: 540 | self.set_masks() 541 | 542 | inputs = self.get_embeddings(nodelist) 543 | 544 | D, I = self._search_index(inputs,k) 545 | I,L = self._faiss_ids_to_nodeids(I,return_labels) 546 | if return_labels: 547 | output = {node:{'neighbors':i,'neighbor labels':l,'distances':d.tolist()} for node, d, i, l in zip(nodelist,D,I,L)} 548 | else: 549 | output = {node:{'neighbors':i,'distances':d.tolist()} for node, d, i in zip(nodelist,D,I)} 550 | return output 551 | 552 | def evaluate(self, test_levels=[5,1], test_only=False): 553 | """Evaluates performance of current embeddings 554 | 555 | Args 556 | ---- 557 | test_only : bool 558 | whether to only test the performance on the test set. If 559 | false, all nodes with known class will be tested. 560 | test_levels : list of ints 561 | each entry is a number of nearest neighbors and we will test 562 | if at least one of the neighbors at each level contains a correct 563 | neighbor based on node labels. We also test the 564 | total share of the neighbors that have a correct label. 565 | 566 | Returns 567 | ------- 568 | dictionary containing details of the performance of the model at each level 569 | """ 570 | 571 | self.net.eval() 572 | 573 | if not self._masks_set: 574 | self.set_masks() 575 | 576 | mask = self.test_mask if test_only else self.is_relevant_mask 577 | test_labels = self.labels[mask] 578 | faiss_labels = self.labels[self.entity_mask] 579 | 580 | test_embeddings = self.embeddings[mask] 581 | 582 | #we need to return the maximum number of neighbors that we want to test 583 | #plus 1 since the top neighbor of each node will always be itself, which 584 | #we exclude. 585 | _, I = self._search_index(test_embeddings,max(test_levels)+1) 586 | 587 | performance = {level:[] for level in test_levels} 588 | performance_share = {level:[] for level in test_levels} 589 | for node, neighbors in enumerate(I): 590 | label = test_labels[node] 591 | neighbor_labels = [faiss_labels[n] for n in neighbors[1:]] 592 | for level in test_levels: 593 | correct_labels = np.sum([label==nl for nl in neighbor_labels[:level]]) 594 | #at least one label in the neighbors was correct 595 | performance[level].append(correct_labels>0) 596 | #share of labels in the neighbors that was correct 597 | performance_share[level].append(correct_labels/level) 598 | 599 | return {f'Top {level} neighbors': 600 | {'Share >=1 correct neighbor':np.mean(performance[level]), 601 | 'Share of correct neighbors':np.mean(performance_share[level])} 602 | for level in test_levels} 603 | 604 | @staticmethod 605 | def setup_pairwise_loss_tensors(labelsnp): 606 | """Accepts a list of labels and sets up indexers which can be used 607 | in a triplet loss function along with whether each pair is a positive or 608 | negative example. 609 | 610 | Args 611 | ---- 612 | labelsnp : numpy array 613 | Class labels of each node, labelsnp[i] = class of node with intid i 614 | 615 | Returns 616 | ------- 617 | idx1 : indexer array for left side comparison 618 | idx2 : indexer array for right side comparison 619 | target : array indicating whether left and right side are the same or different""" 620 | 621 | idx1 = [] 622 | idx2 = [] 623 | target = [] 624 | for i,l in enumerate(labelsnp): 625 | ids = list(range(len(labelsnp))) 626 | for j,other in zip(ids[i+1:],labelsnp[i+1:]): 627 | if other==l: 628 | idx1.append(i) 629 | idx2.append(j) 630 | target.append(1) 631 | else: 632 | idx1.append(i) 633 | idx2.append(j) 634 | target.append(-1) 635 | 636 | return idx1, idx2, target 637 | 638 | def triplet_loss(self,embeddings,labels): 639 | """For a given tensor of embeddings and corresponding labels, 640 | returns a triplet loss maximizing distance between negative examples 641 | and minimizing distance between positive examples 642 | 643 | Args 644 | ---- 645 | embeddings : pytorch tensor torch.float32 646 | embeddings to be trained 647 | labels : numpy array 648 | Class labels of each node, labelsnp[i] = class of node with intid i""" 649 | 650 | batch_relevant_nodes = [i for i,l in enumerate(labels) if not pd.isna(l)] 651 | embeddings = embeddings[batch_relevant_nodes] 652 | labels = labels[batch_relevant_nodes] 653 | idx1,idx2,target = self.setup_pairwise_loss_tensors(labels) 654 | 655 | 656 | losstarget = th.tensor(target).to(self.device) 657 | 658 | if self.distance_metric=='cosine': 659 | input1 = embeddings[idx1] 660 | input2 = embeddings[idx2] 661 | loss = F.cosine_embedding_loss(input1, 662 | input2, 663 | losstarget, 664 | margin=0.5) 665 | elif self.distance_metric=='l2': 666 | idx1_pos = [idx for i,idx in enumerate(idx1) if target[i]==1] 667 | idx1_neg = [idx for i,idx in enumerate(idx1) if target[i]==-1] 668 | 669 | idx2_pos = [idx for i,idx in enumerate(idx2) if target[i]==1] 670 | idx2_neg = [idx for i,idx in enumerate(idx2) if target[i]==-1] 671 | 672 | input1_pos = embeddings[idx1_pos] 673 | input2_pos = embeddings[idx2_pos] 674 | 675 | input1_neg = embeddings[idx1_neg] 676 | input2_neg = embeddings[idx2_neg] 677 | 678 | loss_pos = F.mse_loss(input1_pos,input2_pos) 679 | loss_neg = th.mean(th.max(th.zeros(input1_neg.shape[0]).to(self.device),0.25-th.sum(F.mse_loss(input1_neg,input2_neg,reduce=False),dim=1))) 680 | 681 | loss = loss_pos + loss_neg 682 | 683 | else: 684 | raise ValueError('distance {} is not implemented'.format(self.distance_metric)) 685 | 686 | return loss 687 | 688 | 689 | def train(self,epochs, 690 | batch_size, 691 | test_every_n_epochs = 1, 692 | unsupervised = False, 693 | learning_rate = 1e-2, 694 | fanouts = [10,25], 695 | neg_samples = 1, 696 | return_intermediate_embeddings = False, 697 | test_levels=[5,1]): 698 | """Trains the network weights to improve the embeddings. Can train via supervised learning with triplet loss, 699 | semisupervised learning with triplet loss, or fully unsupervised learning. 700 | 701 | Args 702 | ---- 703 | epochs : int 704 | number of training passes over the data 705 | batch_size : int 706 | number of seed nodes for building the training graph 707 | test_every_n_epochs : int 708 | how often to do a full evaluation of the embeddings, expensive for large graphs 709 | unsupervised : bool 710 | whether to train completely unsupervised 711 | learning_rate : float 712 | learning rate to use in the adam optimizer 713 | fanouts : list of int 714 | number of neighbors to sample at each layer for GraphSage 715 | neg_samples : int 716 | number of negative samples to use in unsupervised loss 717 | test_levels : list of ints 718 | passsed to self.eval, number of neighbors to use for testing accuracy""" 719 | 720 | if not self._masks_set: 721 | self.set_masks() 722 | 723 | optimizer = th.optim.Adam(it.chain(self.net.parameters(),self.embed.parameters()), lr=learning_rate) 724 | 725 | if not unsupervised: 726 | sampler = NeighborSampler(self.G, [int(fanout) for fanout in fanouts]) 727 | sampledata = np.nonzero(self.train_mask)[0] 728 | else: 729 | sampler = UnsupervisedNeighborSampler(self.G, [int(fanout) for fanout in fanouts],neg_samples) 730 | sampledata = list(range(len(self.node_ids))) 731 | unsup_loss_fn = CrossEntropyLoss() 732 | unsup_loss_fn.to(self.device) 733 | 734 | dataloader = DataLoader( 735 | dataset=sampledata, 736 | batch_size=batch_size, 737 | collate_fn=sampler.sample_blocks, 738 | shuffle=True, 739 | drop_last=True, 740 | num_workers=0) 741 | 742 | 743 | 744 | 745 | perf = self.evaluate(test_levels=test_levels,test_only=True) 746 | 747 | testtop5, testtop1 = perf['Top 5 neighbors']['Share >=1 correct neighbor'], \ 748 | perf['Top 1 neighbors']['Share >=1 correct neighbor'] 749 | 750 | testtop5tot, testtop1tot = perf['Top 5 neighbors']['Share of correct neighbors'], \ 751 | perf['Top 1 neighbors']['Share of correct neighbors'] 752 | 753 | print(testtop5,testtop1,testtop5tot, testtop1tot) 754 | print("Test Top5 {:.4f} | Test Top1 {:.4f} | Test Top5 Total {:.4f} | Test Top1 Total {:.4f} ".format( 755 | testtop5,testtop1,testtop5tot, testtop1tot)) 756 | 757 | loss_history = [] 758 | perf_history = [perf] 759 | if return_intermediate_embeddings: 760 | all_embeddings = [] 761 | all_embeddings.append(self.embeddings) 762 | 763 | for epoch in range(1,epochs+1): 764 | 765 | for step,data in enumerate(dataloader): 766 | #sup_blocks, unsupervised_data = data 767 | #pos_graph, neg_graph, unsup_blocks = unsupervised_data 768 | 769 | 770 | self.net.train() 771 | 772 | # these names are confusing because "seeds" are the input 773 | # to neighbor generation but the output in the sense that we 774 | # output their embeddings based on their neighbors... 775 | # the neighbors are the inputs in the sense that they are what we 776 | # use to generate the embedding for the seeds. 777 | if not unsupervised: 778 | sup_blocks = data 779 | sup_input_nodes = sup_blocks[0].srcdata[dgl.NID] 780 | sup_seeds = sup_blocks[-1].dstdata[dgl.NID] 781 | 782 | #sup_batch_inputs = self.G.ndata['features'][sup_input_nodes].to(self.device) 783 | sup_batch_inputs = self.features[sup_input_nodes].to(self.device) 784 | sup_batch_labels = self.labels[sup_seeds] 785 | #nodeids = [self.node_ids.loc[self.node_ids.intID==i].id.iloc[0] for i in sup_seeds] 786 | 787 | #print(sup_batch_labels,nodeids) 788 | 789 | sup_embeddings = self.net(sup_blocks, sup_batch_inputs) 790 | 791 | 792 | 793 | loss = self.triplet_loss(sup_embeddings,sup_batch_labels) 794 | else: 795 | pos_graph, neg_graph, unsup_blocks = data 796 | unsup_input_nodes = unsup_blocks[0].srcdata[dgl.NID] 797 | unsup_seeds = unsup_blocks[-1].dstdata[dgl.NID] 798 | 799 | unsup_batch_inputs = self.G.ndata['features'][unsup_input_nodes].to(self.device) 800 | 801 | unsup_embeddings =self.net(unsup_blocks,unsup_batch_inputs) 802 | loss = unsup_loss_fn(unsup_embeddings, pos_graph, neg_graph) 803 | 804 | optimizer.zero_grad() 805 | loss.backward() 806 | optimizer.step() 807 | #once the parameters change we no longer know the new embeddings for all nodes 808 | self._embeddings = None 809 | self._index = None 810 | 811 | 812 | print("Epoch {:05d} | Step {:0.1f} | Loss {:.8f}".format( 813 | epoch, step, loss.item())) 814 | if return_intermediate_embeddings: 815 | all_embeddings.append(self.embeddings) 816 | loss_history.append(loss.item()) 817 | if epoch % test_every_n_epochs == 0 or epoch==epochs: 818 | 819 | perf = self.evaluate(test_levels=test_levels,test_only=True) 820 | 821 | testtop5, testtop1 = perf['Top 5 neighbors']['Share >=1 correct neighbor'], \ 822 | perf['Top 1 neighbors']['Share >=1 correct neighbor'] 823 | 824 | testtop5tot, testtop1tot = perf['Top 5 neighbors']['Share of correct neighbors'], \ 825 | perf['Top 1 neighbors']['Share of correct neighbors'] 826 | 827 | print("Epoch {:05d} | Loss {:.8f} | Test Top5 {:.4f} | Test Top1 {:.4f} | Test Top5 Total {:.4f} | Test Top1 Total {:.4f} ".format( 828 | epoch, loss.item(),testtop5,testtop1,testtop5tot, testtop1tot)) 829 | 830 | perf_history.append(perf) 831 | 832 | if return_intermediate_embeddings: 833 | return loss_history,perf_history,all_embeddings 834 | else: 835 | return loss_history,perf_history 836 | 837 | def start_api(self,*args,**kwargs): 838 | """Launches a fastapi to query this class in its current state.""" 839 | package_path = os.path.dirname(os.path.abspath(__file__)) 840 | production_path = package_path + '/production_model' 841 | pathlib.Path(production_path).mkdir(exist_ok=True) 842 | self.save(production_path) 843 | os.environ['FASTREC_DEPLOY_PATH'] = production_path 844 | #this import cant be at the top level to prevent circular dependency 845 | from RecAPI import app 846 | uvicorn.run(app,*args,**kwargs) 847 | 848 | 849 | def save(self, filepath): 850 | """Save all information neccessary to recover current state of the current instance of 851 | this object to a folder. Initialization args, graph data, node ids, current trained embedding, 852 | and current torch paramters are all saved. 853 | 854 | Args 855 | ---- 856 | filepath : str 857 | path on disk to save files""" 858 | 859 | 860 | outg = dgl.as_immutable_graph(self.G) 861 | dgl.data.utils.save_graphs(f'{filepath}/dgl.bin',outg) 862 | 863 | self.node_ids.to_csv(f'{filepath}/node_ids.csv',index=False) 864 | 865 | th.save(self.embed,f'{filepath}/embed.torch') 866 | th.save(self.net.state_dict(),f'{filepath}/model_weights.torch') 867 | embeddings = self.embeddings 868 | np.save(f'{filepath}/final_embed.npy',embeddings,allow_pickle=False) 869 | 870 | with open(f'{filepath}/initargs.pkl','wb') as pklf: 871 | pickle.dump(self.initargs,pklf) 872 | 873 | def load_graph_data(self,filepath): 874 | """Restore graph data from disk, but not network parameters 875 | or trained embeddings. Useful for changing network parameters 876 | if you don't want to reconstruct the graph. 877 | 878 | Args 879 | ---- 880 | filepath : str 881 | path to where you saved previous the GraphRecommender 882 | """ 883 | 884 | self.G,_ = dgl.data.utils.load_graphs(f'{filepath}/dgl.bin') 885 | self.G = restored_self.G[0] 886 | self.G.readonly() 887 | self.G = dgl.as_heterograph(restored_self.G) 888 | 889 | self.node_ids = pd.read_csv(f'{filepath}/node_ids.csv') 890 | 891 | self._masks_set = False 892 | self._embeddings = None 893 | self._index = None 894 | 895 | 896 | @classmethod 897 | def load(cls, filepath, device=None, faiss_gpu=None): 898 | """Restore a previous instance of this class from disk. 899 | 900 | Args 901 | ---- 902 | filepath : str 903 | path on disk to load from 904 | device : str 905 | optionally override the pytorch device 906 | faiss_gpu : str 907 | optionally override whether faiss uses gpu""" 908 | 909 | with open(f'{filepath}/initargs.pkl','rb') as pklf: 910 | (embedding_dim, 911 | feature_dim, 912 | hidden_dim, 913 | hidden_layers, 914 | dropout, 915 | agg_type, 916 | distance, 917 | torch_device, 918 | faiss_gpu_loaded, 919 | inference_batch_size, 920 | p_train, 921 | train_faiss_index) = pickle.load(pklf) 922 | 923 | if device is not None: 924 | torch_device=device 925 | 926 | if faiss_gpu is not None: 927 | faiss_gpu_loaded = faiss_gpu 928 | 929 | restored_self = cls(embedding_dim, 930 | feature_dim, 931 | hidden_dim, 932 | hidden_layers, 933 | dropout, 934 | agg_type, 935 | distance, 936 | torch_device, 937 | faiss_gpu_loaded, 938 | inference_batch_size, 939 | p_train, 940 | train_faiss_index) 941 | 942 | restored_self.G,_ = dgl.data.utils.load_graphs(f'{filepath}/dgl.bin') 943 | restored_self.G = restored_self.G[0] 944 | restored_self.G.readonly() 945 | restored_self.G = dgl.as_heterograph(restored_self.G) 946 | 947 | restored_self.node_ids = pd.read_csv(f'{filepath}/node_ids.csv') 948 | 949 | restored_self.embed = th.load(f'{filepath}/embed.torch',map_location=th.device(torch_device)) 950 | restored_self.net.load_state_dict(th.load(f'{filepath}/model_weights.torch',map_location=th.device(torch_device))) 951 | embeddings = np.load(f'{filepath}/final_embed.npy',allow_pickle=False) 952 | restored_self._embeddings = embeddings 953 | 954 | return restored_self 955 | 956 | 957 | 958 | 959 | -------------------------------------------------------------------------------- /fastrec/RecAPI.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from typing import List 4 | from pydantic import BaseModel 5 | import numpy as np 6 | from fastapi import FastAPI 7 | 8 | from GraphSimRec import GraphRecommender 9 | 10 | class NodeIdList(BaseModel): 11 | ids : List[str] 12 | 13 | class UnseenNodes(BaseModel): 14 | nodelist : List[str] 15 | neighbors : List[List[str]] 16 | 17 | app = FastAPI() 18 | 19 | 20 | @app.on_event("startup") 21 | def startup_event(): 22 | 23 | global sage 24 | deploy_path = os.environ['FASTREC_DEPLOY_PATH'] 25 | sage = GraphRecommender.load(deploy_path,device='cpu',faiss_gpu=False) 26 | #force the index to be trained 27 | sage.train_faiss = True 28 | sage.index.nprobe = 100 29 | assert sage.index.is_trained 30 | 31 | @app.get("/") 32 | def read_root(): 33 | return {"GraphData": sage.G.__str__()} 34 | 35 | 36 | @app.get("/knn/{nodeid}") 37 | def knn(nodeid: str, k: int = 5, labels : bool=False): 38 | """Returns the nearest k nodes in the graph using faiss 39 | 40 | Args 41 | ---- 42 | nodeid : str 43 | identifier of the query node 44 | 45 | k : int 46 | number of neighbors to return 47 | 48 | labels : bool 49 | if true, return the class label for each node in the list of neighbors 50 | 51 | Returns: 52 | K nearest neighbors, distances, and labels of neighbors""" 53 | return sage.query_neighbors([nodeid], k, return_labels=labels) 54 | 55 | @app.post("/knn/") 56 | def knn_post(nodelist : NodeIdList, k: int = 5, labels : bool=False): 57 | """Returns the nearest k nodes in the graph using faiss 58 | Args 59 | ---- 60 | nodelist : NodeIdList 61 | identifier of the query nodes 62 | 63 | k : int 64 | number of neighbors to return 65 | 66 | labels : bool 67 | if true, return the class label for each node in the list of neighbors 68 | 69 | Returns: 70 | K nearest neighbors, distances, and labels of neighbors""" 71 | return sage.query_neighbors(nodelist.ids, k, return_labels=labels) 72 | 73 | 74 | @app.post('/knn_unseen/') 75 | def knn_unseen(unseen_nodes : UnseenNodes, k: int = 5, labels : bool=False): 76 | """Returns the k nearest neighbors in the graph for 77 | query nodes that do not currently exist in the graph. 78 | The unseen nodes must exclusively have neighbors that do 79 | already exist in the graph. We can then estimate their 80 | embedding by average the embedding of their neighbors. 81 | 82 | Args 83 | ---- 84 | unseen_nodes : UnseenNodes 85 | Contains the ids of the unseen nodes and their neighbors 86 | 87 | k : int 88 | number of nearest neighbors to query 89 | 90 | labels : bool 91 | if true, return the class label for each node in the list of neighbors 92 | 93 | Returns 94 | ------- 95 | k nearest neighbors, distances, and labels of neighbors""" 96 | 97 | nodelist, neighbors = unseen_nodes.nodelist, unseen_nodes.neighbors 98 | embeddings = [np.mean(sage.get_embeddings(nlist),axis=0) for nlist in neighbors] 99 | embeddings = np.array(embeddings) 100 | D,I = sage._search_index(embeddings,k) 101 | I,L = sage._faiss_ids_to_nodeids(I,labels) 102 | if labels: 103 | output = {node:{'neighbors':i,'neighbor labels':l,'distances':d.tolist()} for node, d, i, l in zip(nodelist,D,I,L)} 104 | else: 105 | output = {node:{'neighbors':i,'distances':d.tolist()} for node, d, i in zip(nodelist,D,I)} 106 | 107 | return output 108 | 109 | -------------------------------------------------------------------------------- /fastrec/__init__.py: -------------------------------------------------------------------------------- 1 | import os, sys 2 | sys.path.append(os.path.dirname(os.path.realpath(__file__))) 3 | 4 | from .GraphSimRec import GraphRecommender 5 | from .RecAPI import app -------------------------------------------------------------------------------- /fastrec/torchmodels.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import tqdm 3 | 4 | import dgl 5 | import dgl.function as fn 6 | from dgl import DGLGraph 7 | from dgl.data import citation_graph as citegrh 8 | import dgl.nn.pytorch as dglnn 9 | 10 | import torch as th 11 | import torch.nn as nn 12 | import torch.nn.functional as F 13 | 14 | class NegativeSampler(object): 15 | def __init__(self, g): 16 | self.weights = g.in_degrees().float() ** 0.75 17 | 18 | def __call__(self, num_samples): 19 | return self.weights.multinomial(num_samples, replacement=True) 20 | 21 | class UnsupervisedNeighborSampler(object): 22 | def __init__(self, g, fanouts, num_negs): 23 | self.g = g 24 | self.fanouts = fanouts 25 | self.neg_sampler = NegativeSampler(g) 26 | self.num_negs = num_negs 27 | 28 | def sample_blocks(self, seed_edges): 29 | n_edges = len(seed_edges) 30 | seed_edges = th.LongTensor(np.asarray(seed_edges)) 31 | heads, tails = self.g.find_edges(seed_edges) 32 | neg_tails = self.neg_sampler(self.num_negs * n_edges) 33 | neg_heads = heads.view(-1, 1).expand(n_edges, self.num_negs).flatten() 34 | 35 | # Maintain the correspondence between heads, tails and negative tails as two 36 | # graphs. 37 | # pos_graph contains the correspondence between each head and its positive tail. 38 | # neg_graph contains the correspondence between each head and its negative tails. 39 | # Both pos_graph and neg_graph are first constructed with the same node space as 40 | # the original graph. Then they are compacted together with dgl.compact_graphs. 41 | pos_graph = dgl.graph((heads, tails), num_nodes=self.g.number_of_nodes()) 42 | neg_graph = dgl.graph((neg_heads, neg_tails), num_nodes=self.g.number_of_nodes()) 43 | pos_graph, neg_graph = dgl.compact_graphs([pos_graph, neg_graph]) 44 | 45 | # Obtain the node IDs being used in either pos_graph or neg_graph. Since they 46 | # are compacted together, pos_graph and neg_graph share the same compacted node 47 | # space. 48 | seeds = pos_graph.ndata[dgl.NID] 49 | blocks = [] 50 | for fanout in self.fanouts: 51 | # For each seed node, sample ``fanout`` neighbors. 52 | frontier = dgl.sampling.sample_neighbors(self.g, seeds, fanout, replace=True) 53 | # Remove all edges between heads and tails, as well as heads and neg_tails. 54 | _, _, edge_ids = frontier.edge_ids( 55 | th.cat([heads, tails, neg_heads, neg_tails]), 56 | th.cat([tails, heads, neg_tails, neg_heads]), 57 | return_uv=True) 58 | frontier = dgl.remove_edges(frontier, edge_ids) 59 | # Then we compact the frontier into a bipartite graph for message passing. 60 | block = dgl.to_block(frontier, seeds) 61 | # Obtain the seed nodes for next layer. 62 | seeds = block.srcdata[dgl.NID] 63 | 64 | blocks.insert(0, block) 65 | return pos_graph, neg_graph, blocks 66 | 67 | 68 | class NeighborSampler(object): 69 | def __init__(self, g, fanouts): 70 | self.g = g 71 | self.fanouts = fanouts 72 | 73 | def sample_blocks(self, seeds): 74 | seeds = th.LongTensor(np.asarray(seeds)) 75 | blocks = [] 76 | for fanout in self.fanouts: 77 | # For each seed node, sample ``fanout`` neighbors. 78 | frontier = dgl.sampling.sample_neighbors(self.g, seeds, fanout, replace=True) 79 | # Then we compact the frontier into a bipartite graph for message passing. 80 | block = dgl.to_block(frontier, seeds) 81 | # Obtain the seed nodes for next layer. 82 | seeds = block.srcdata[dgl.NID] 83 | 84 | blocks.insert(0, block) 85 | return blocks 86 | 87 | class SAGE(nn.Module): 88 | def __init__(self, 89 | in_feats, 90 | n_hidden, 91 | n_classes, 92 | n_layers, 93 | activation, 94 | dropout, 95 | agg_type): 96 | super().__init__() 97 | self.n_layers = n_layers 98 | self.n_hidden = n_hidden 99 | self.n_classes = n_classes 100 | self.layers = nn.ModuleList() 101 | self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, agg_type)) 102 | for i in range(1, n_layers - 1): 103 | self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, agg_type)) 104 | self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, agg_type)) 105 | self.dropout = nn.Dropout(dropout) 106 | self.activation = activation 107 | 108 | def forward(self, blocks, x): 109 | h = x 110 | for l, (layer, block) in enumerate(zip(self.layers, blocks)): 111 | # We need to first copy the representation of nodes on the RHS from the 112 | # appropriate nodes on the LHS. 113 | # Note that the shape of h is (num_nodes_LHS, D) and the shape of h_dst 114 | # would be (num_nodes_RHS, D) 115 | h_dst = h[:block.number_of_dst_nodes()] 116 | # Then we compute the updated representation on the RHS. 117 | # The shape of h now becomes (num_nodes_RHS, D) 118 | h = layer(block, (h, h_dst)) 119 | if l != len(self.layers) - 1: 120 | h = self.activation(h) 121 | h = self.dropout(h) 122 | return h 123 | 124 | def inference(self, g, x, batch_size, device): 125 | """ 126 | Inference with the GraphSAGE model on full neighbors (i.e. without neighbor sampling). 127 | g : the entire graph. 128 | x : the input of entire node set. 129 | The inference code is written in a fashion that it could handle any number of nodes and 130 | layers. 131 | """ 132 | # During inference with sampling, multi-layer blocks are very inefficient because 133 | # lots of computations in the first few layers are repeated. 134 | # Therefore, we compute the representation of all nodes layer by layer. The nodes 135 | # on each layer are of course splitted in batches. 136 | # TODO: can we standardize this? 137 | nodes = th.arange(g.number_of_nodes()) 138 | for l, layer in enumerate(self.layers): 139 | y = th.zeros(g.number_of_nodes(), self.n_hidden if l != len(self.layers) - 1 else self.n_classes) 140 | 141 | for start in tqdm.trange(0, len(nodes), batch_size): 142 | end = start + batch_size 143 | batch_nodes = nodes[start:end] 144 | block = dgl.to_block(dgl.in_subgraph(g, batch_nodes), batch_nodes) 145 | input_nodes = block.srcdata[dgl.NID] 146 | 147 | h = x[input_nodes].to(device) 148 | h_dst = h[:block.number_of_dst_nodes()] 149 | h = layer(block, (h, h_dst)) 150 | if l != len(self.layers) - 1: 151 | h = self.activation(h) 152 | h = self.dropout(h) 153 | 154 | y[start:end] = h.cpu() 155 | 156 | x = y 157 | return y 158 | 159 | class CrossEntropyLoss(nn.Module): 160 | def forward(self, block_outputs, pos_graph, neg_graph): 161 | with pos_graph.local_scope(): 162 | pos_graph.ndata['h'] = block_outputs 163 | pos_graph.apply_edges(fn.u_dot_v('h', 'h', 'score')) 164 | pos_score = pos_graph.edata['score'] 165 | with neg_graph.local_scope(): 166 | neg_graph.ndata['h'] = block_outputs 167 | neg_graph.apply_edges(fn.u_dot_v('h', 'h', 'score')) 168 | neg_score = neg_graph.edata['score'] 169 | 170 | score = th.cat([pos_score, neg_score]) 171 | label = th.cat([th.ones_like(pos_score), th.zeros_like(neg_score)]).long() 172 | loss = F.binary_cross_entropy_with_logits(score, label.float()) 173 | return loss -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | def readme(): 4 | with open('README.md') as f: 5 | return f.read() 6 | 7 | setup(name='fastrec', 8 | version='0.0.4', 9 | description='Rapidly deployed gnn based recommender', 10 | long_description=readme(), 11 | url='https://github.com/devinjdangelo/FastRec', 12 | author='Devin DAngelo', 13 | packages=['fastrec'], 14 | scripts=['fastrec/fastrec-deploy'], 15 | install_requires=['fastapi','uvicorn','tqdm','pandas'], 16 | dependency_links=['https://download.pytorch.org/whl/torch_stable.html'], 17 | keywords='recommender graph neural network gnn deployment deploy', 18 | include_package_data=True, 19 | long_description_content_type="text/markdown", 20 | test_suite='nose.collector', 21 | tests_require=['nose'], 22 | zip_safe=False) --------------------------------------------------------------------------------