└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Towards-graph-foundation-models 2 | 3 | [BLOG](https://seemly-sugar-c3d.notion.site/Towards-foundation-model-for-graph-A-Paper-List-1f4e92ce52cb45c7a25b448bb9d2aeff) 4 | 5 | This repo is for those models designed for the foundation model in the graph. 6 | 7 | 8 | 9 | 10 | ## Path 1: Graph transformers 11 | 12 | For a more detailed list of GT-related papers, you may check [awesome graph transformers](https://github.com/wehos/awesome-graph-transformer) 13 | 14 | ### Rethinking Graph Transformers with Spectral Attention (NIPS 2021) [Paper](https://arxiv.org/pdf/2106.03893.pdf) [Code](https://github.com/DevinKreuzer/SAN) 15 | * SAN, which uses eigenfunctions as positional encodings 16 | ### Do Transformers Really Perform Bad for Graph Representation? (NIPS 2021) [Paper](https://arxiv.org/abs/2106.05234) [Code](https://github.com/microsoft/Graphormer) 17 | * Graphormer 18 | * Pure positional encoding-based designs, use \[VNODE\] as the readout 19 | * Transfer learning experiments first included 20 | ### Global Self-Attention as a Replacement for Graph Convolution (EGT) (KDD 2022) [Paper](https://arxiv.org/abs/2108.03348) [Code](https://github.com/shamim-hussain/egt) 21 | * Give a solution to multi-task without clear performance loss -> explicitly learn the embeddings, and add different heads for node/edge/graph-level tasks 22 | ### GraphGPS: General Powerful Scalable Graph Transformers (NIPS 2022) [Paper](https://arxiv.org/abs/2205.12454) [Code](https://github.com/rampasek/GraphGPS) 23 | * A blueprint work that tries to revisit and summarize some recent works, which proposes the following framework (I think this is not a pure transformer since local message passing is involved): (1) positional/structural encoding; (2) local message-passing mechanism; (3) global attention mechanism 24 | * What's the difference between PE and SE? PE is more about the global information, like the position in the **whole graph**; SE is more about the local information, like the role in the **local substructure** 25 | * positional/structural encoding: LapPE, RWSE, SignNet, EquivStableLapPE 26 | * local message-passing mechanism: GatedGCN, GINE, PNA 27 | * global attention mechanism: Transformer, Performer, BigBird 28 | ### Pure Transformers are Powerful Graph Learners (\*) (NIPS 2022) [Paper](https://arxiv.org/abs/2207.02505) [Code](https://github.com/jw9730/tokengt) 29 | * Pure transformers can work well on graph classification tasks if we augment the original features with node identifiers and edge identifiers, achieving expressiveness better than GNNs 30 | * With proper tokenwise embeddings, transformers can approximate any permutation equivariant linear functions on the graph, which means f(π(x))=π(f(x)) 31 | * How to design such embeddings: node-level: \[Xv, Pv, Pv, En\]; edge-level \[Xu,v, Pu, Pv, Ee\]; Pv is the node identifier, En/Ee is the type identifier (whether a node or an edge) 32 | ### NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs (ICLR 2023) [Paper](https://openreview.net/forum?id=8KYeilT3Ow) [Code](https://github.com/JHL-HUST/NAGphormer) 33 | * This paper is for **Node classification** 34 | * Hop2Token: for each node v, generate a sequence of size (k, d)-> k is the length of the sequence (k hop), d is the hidden dimension => (X, AX, A2X, ...) => "language" 35 | * An attention-based readout function (attention on k) 36 | * No experiments for semi-supervised setting 37 | ### Graph Inductive Biases in Transformers without Message (ICML 2023) Passing [Paper](https://arxiv.org/abs/2305.17589) [Code](https://github.com/liamma/grit) 38 | * RRWP (Relative Random Walk Probabilities) + MLP can be proven as expressive as Shortest-path distance, polynomial of RR matrices, and adjacency matrices with self-loop 39 | * Similar to EGT, explicitly learn the edge embedding 40 | ### A GENERALIZATION OF VIT/MLP-MIXER TO GRAPHS (ICML 2023) [Paper](https://arxiv.org/abs/2212.13350) [Code](https://github.com/XiaoxinHe/Graph-ViT-MLPMixer) 41 | * Patch is the basic for ViT, what's the patch for graph->subgraph generated by METIS 42 | * Encode subgraph with GNN and MLP, no long-range problem since subgraph is small 43 | * Two interesting experiments: 44 | * Empirical Oversquashing: TreeNeighborMatch 45 | * CSL, EXP, SR25 46 | ### Exphormer: Sparse Transformers for Graphs (ICML 2023) [Paper](https://arxiv.org/abs/2303.06147) [Code](https://github.com/hamed1375/exphormer) 47 | * How to scale graph transformers to large graphs like MPNNs and achieve superior efficiency? GraphGPS has addressed this problem and adopts sparse attention solutions like BigBird or Performer. This paper revisits this problem and proposes a graph-specific solution. 48 | * Two mechanisms are proposed 49 | * Virtual global nodes 50 | * Expander graphs: A type of graph presenting some special theoretical properties, which may preserve the properties of complete graph with only cost O(n) 51 | 52 | ### ADVECTIVE DIFFUSION TRANSFORMERS FOR TOPOLOGICAL GENERALIZATION IN GRAPH LEARNING (Arxiv 2023) [Paper](https://arxiv.org/pdf/2310.06417.pdf) 53 | * This paper studies structural generalization and proposes a SIGN-like model 54 | * Strictly speaking, I think the model proposed in this paper can not be viewed as a transformer. It mainly incorporates a learnable global self-attention and combines it with the original adjacency matrix. 55 | * I like the view in this paper, which views the underlying evolving process of a graph as the heat diffusion and models local message passing with advective diffusion. 56 | 57 | ### GRAPHGPT: GRAPH LEARNING WITH GENERATIVE PRE-TRAINED TRANSFORMERS (Arxiv 2023) (\*) [Paper](https://openreview.net/pdf?id=070DFUdNh7) 58 | * Very interesting paper, a must-read! 59 | * This paper demonstrates how to convert a graph into a sequence(Eulerian path), and use pure transformers can achieve very good results on node/edge/graph-level tasks. 60 | * One missing part is the node feature, which is not well-considered in the current framework. 61 | 62 | ### GRAPH CONVOLUTIONS ENRICH THE SELFATTENTION IN TRANSFORMERS! (Arxiv 2023) [Paper](https://openreview.net/pdf?id=poFAoivHQk) 63 | * This paper shows a very simple trick can achieve promising performance to deal with the oversmoothing of transformers 64 | * Given the self-attention matrix A, view it as the adjacency matrix of a graph, inject higher-order information A^k into the self-attention matrix, and approximate it with Taylor expansion. 65 | * The detailed explanation needs further investigation. 66 | 67 | 68 | 69 | 70 | 71 | 72 | ## Path 2: Pre-train graph neural networks 73 | 74 | ### STRATEGIES FOR PRE-TRAINING GRAPH NEURAL NETWORKS (ICLR 2020) [Paper](https://arxiv.org/pdf/1905.12265.pdf) [Code](http://snap.stanford.edu/gnn-pretrain) 75 | * Pioneering work in pre-training GNNs in the graph-level tasks, which proposes two pretext tasks 76 | * context prediction: the subgraph should share similar representations to the context graph 77 | * attribute masking 78 | * Some observations: 79 | * 1. Expressiveness of GNNs is related to the effectiveness of pre-training 80 | * 2. In the experiment, the results show that self-supervised learning is very effective, even more effective than supervised pre-training. Actually, this may be mainly due to the fact that a scaffolding split is adopted. In the NIPS 2022 benchmark paper, similar observation is found on the scaffolding split. 81 | 82 | ### GPT-GNN: Generative Pre-Training of Graph Neural Networks (KDD 2020) [Paper](https://arxiv.org/pdf/2006.15437.pdf) [Code](https://github.com/acbull/GPT-GNN) 83 | * Propose masked node attribute generation and masked edge generation 84 | * To improve performance, the generation is conducted in one round, which means the attribute of some nodes are masked when conducting edge generation 85 | 86 | 87 | ### Self-Supervised Graph Transformer on Large-Scale Molecular Data (NIPS 2020) [Paper](https://arxiv.org/pdf/2007.02835.pdf) [Code](https://github.com/tencent-ailab/grover) 88 | * An MPNN-involved transformer with randomized layer number 89 | * Contextual property prediction for node-level SSL 90 | * Graph-level motif prediction for graph-level SSL 91 | * Pretraining a transformer is very expensive, for this GROVER model, needs hundreds of GPUs to pre-train for several days. 92 | 93 | 94 | ### Graph Meta Learning via Local Subgraphs (NIPS 2020) [Paper](https://proceedings.neurips.cc/paper_files/paper/2020/file/412604be30f701b1b1e3124c252065e6-Paper.pdf) [Code](https://github.com/mims-harvard/G-Meta) 95 | * An extension of MAML in the graph domain, with the prototype (clustering center) and ego subgraph as the bridge 96 | * Three scenarios were considered: 97 | * Single Graph, Disjoint labels: learn from a set of samples, and test the model on another set of samples with disjoint labels 98 | * Multiple Graphs, Shared Labels: learn from a set of graphs, and test on the other graphs with a shared label space 99 | * Multiple Graphs, Disjoint Labels: Hybrid case of 1 and 2 100 | 101 | ### GPPT: Graph Pre-training and Prompt Tuning to Generalize Graph Neural Networks (KDD 2022) [Paper](https://dl.acm.org/doi/pdf/10.1145/3534678.3539249) [Code](https://github.com/MingChen-Sun/GPPT) 102 | * The setting is quite similar to self-supervised learning, and I think the concept here is closer to aligning the downstream tasks with pre-trained tasks in graph SSL 103 | * Instead of directly doing prediction with the pre-trained embeddings, prompts \[Ttask(y), Tsrt(v)\] is introduced 104 | * The label is mapped to a continuous prompt, and node classification is viewed as the link prediction task between the label prompt and node prompt 105 | * There's a section in the experiment called prompt without fine-tuning, and I think this prompt strategy is not very useful in that the label information is still used here; Also, it is designed to speedup the training process, which is usually not considered as the efficiency bottleneck of GNNs 106 | 107 | ### GraphPrompt: Unifying Pre-Training and Downstream Tasks for Graph Neural Networks (WWW 2022) [Paper](https://arxiv.org/pdf/2302.08043.pdf) [Code](https://github.com/Starlien95/GraphPrompt) 108 | * Three phases: pre-training, prompt tuning, and prediction 109 | * The key is to unify the form of tasks across these three stages: basically, we view all as the link prediction tasks => for pre-training, just the EdgePred tasks; for classification task, we get the representation of **each class** as the clustering center of samples from the same class, and then select the class with the shortest distance. 110 | * One special point of this method is that it can work for different tasks. A special task prompt is inserted and tuned via the prompt tuning to adapt each task. The prompt is viewed as a weight matrix. A special readout function is designed for each task based on the prompt. 111 | 112 | 113 | ### Does GNN Pretraining Help Molecular Representation? (NIPS 2022) [Paper](https://arxiv.org/pdf/2207.06010.pdf) 114 | * A benchmark paper that systematically studies the pretraining of GNNs in the molecular domain, with the following conclusion 115 | * 1. with rich features, self-supervised learning only present very limited or even negative help; supervised fine-tuning is more effective, combing self-supervised & supervised can help 116 | * 2. pre-training is helpful for data split with distribution split 117 | * 3. The help from pre-training is more obvious with weaker features 118 | 119 | ### MOLE-BERT: RETHINKING PRE-TRAINING GRAPH NEURAL NETWORKS FOR MOLECULES (ICLR 2023) [Paper](https://openreview.net/pdf?id=jevY-DtiZTR)[Code](https://github.com/junxia97/Mole-BERT) 120 | * The authors argue that the main bottleneck of AttrMask in GNN-pretraining (ICLR 2020) paper is that the node-level task is too easy. Inspired by VQ-VAE, MoleBERT explicitly learn a discrete codebook, which demonstrates effectiveness. 121 | 122 | 123 | ### All in One: Multi-Task Prompting for Graph Neural Networks (KDD 2023) [Paper](https://arxiv.org/pdf/2307.01504.pdf)[Code](https://arxiv.org/pdf/2307.01504.pdf) 124 | * Compared to other graph prompt papers, one special point is that prompt is injected to the subgraph (transferring unit) considering both structures and features 125 | * Then, it combines the pretext tasks (like self-supervised learning) with meta-learning, and show good performance. 126 | 127 | ### When to Pre-Train Graph Neural Networks? From Data Generation Perspective! (KDD 2023) [Paper](https://arxiv.org/pdf/2303.16458.pdf) [Code](https://github.com/caoyxuan/W2PGNN) 128 | * Using Graphon to model the distribution of pre-trained data, which can be used to 129 | * determine the portion of knowledge that can be transferrable from the pre-training corpus to the downstream tasks 130 | * Data pruning 131 | * One main problem: feature is ignored in the theoretical model 132 | 133 | ### BETTER WITH LESS: DATA-ACTIVE PRE-TRAINING OF GRAPH NEURAL NETWORKS (NIPS 23) [Paper](https://openreview.net/pdf?id=663Cl-KetJ) 134 | * This paper argues that by actively selecting the pre-training data, models fine-tuned with downstream tasks can get better performance 135 | * There are mainly two strategies 136 | * Predictive uncertainty 137 | * Graph properties (entropy of random walk) 138 | 139 | ### TOWARDS FOUNDATIONAL MODELS FOR MOLECULAR LEARNING ON LARGE-SCALE MULTI-TASK DATASETS (Arxiv 2023) [Paper](https://arxiv.org/pdf/2310.04292.pdf) [Code](https://github.com/datamol-io/graphium) 140 | * the largest benchmark dataset for molecular pre-training 141 | * interestingly, in many cases, multi-dataset training present no gain, which somehow doubt the effectiveness of using GNNs as the blocks of graph foundation models 142 | 143 | 144 | ## Path3: Language model pre-training (co-training & co-finetuning) 145 | 146 | ### DeepWalk, Node2Vec, PTE 147 | * Traditional walk-based pre-training 148 | * Equivalent to matrix factorization of a matrix such as normalized graph laplacian 149 | * do not consider node features 150 | 151 | ### LinkBERT: Pretraining Language Models with Document Links (ACL 2022)[Paper](https://arxiv.org/pdf/2203.15827.pdf) [Code](https://github.com/michiyasunaga/LinkBERT) 152 | * On the basis of BERT, consider the relationship among different documents, which use the following two tasks 153 | * Masked language modeling 154 | * Document relation prediction: Segment documents into chunks. Given two chunks, classify their relationship into contiguous, random, or linked. 155 | 156 | 157 | ### NODE FEATURE EXTRACTION BY SELF-SUPERVISED MULTI-SCALE NEIGHBORHOOD PREDICTION (ICLR 2022) [Paper](https://arxiv.org/pdf/2111.00064.pdf) [Code](https://github.com/amzn/pecos/tree/mainline/examples/giant-xrt) 158 | * View the structural fine-tuning of LMs as an extreme multilabel classification (XMC) problem, which means for graph: A-B, A-C, C-D, the label for node A is \[1, 1, 1, 0]. 159 | 160 | ###  Learning on Large-scale Text-attributed Graphs via Variational Inference (ICLR 2023) [Paper](https://arxiv.org/abs/2210.14709) [Code](https://github.com/AndyJZhao/GLEM) 161 | * Co-train an LM and a GNN iteratively using the EM framework 162 | 163 | 164 | ### Graph-Aware Language Model Pre-Training on a Large Graph Corpus Can Help Multiple Graph Applications (KDD 2023) [Paper](https://arxiv.org/pdf/2306.02592.pdf) 165 | * GaLM mainly involves the following steps 166 | * Pretraining-stage: 167 | * 1. Graph-aware LM pre-training: tune the parameters of a Bert model with unsupervised pretext task: link prediction 168 | * 2. Observation: co-training is not very useful 169 | * Fine-tuning stage: 170 | * 1. Graph-aware LM fine-tuning on various applications 171 | * 2. stitching large graph corpus at the data level 172 | 173 | ### Walklm: a uniform language model fine-tuning framework for attributed graph embedding (NIPS 2023) 174 | * wait for the camera-ready version 175 | 176 | 177 | 178 | 179 | ## Path 4: Graph as Natural Language + LLMs 180 | Here for natural language, we consider both 181 | * directly use human language, this usually can only be applied to the most powerful LLMs like ChatGPT and GPT4 182 | * use tokens from LLMs' dictionaries, or use prompt tuning to align the feature space of LLMs and graphs 183 | 184 | 185 | ### Can Language Models Solve Graph Problems in Natural Language? (NIPS 2023) [Paper](https://arxiv.org/abs/2305.10037) [Code](https://github.com/arthur-heng/nlgraph) 186 | * Use natural language to directly describe the graph structure, and propose the NLGraph benchmark for algorithmic reasoning on the graph. 187 | * Language: Human language 188 | 189 | ### Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs (Arxiv 2023) [Paper](https://arxiv.org/abs/2307.03393) [Code](https://github.com/CurryTang/Graph-LLM) 190 | 191 | - For the LLMs-as-Predictors pipeline, propose a language based on node attributes. 192 | 193 | 194 | ### Natural Language is All a Graph Needs (Arxiv 2023) [Paper](https://arxiv.org/abs/2308.07134) 195 | * Tuning T5 for node classification tasks, using link prediction as the auxiliary tasks 196 | * Language: Human language with node features inserted (those features are added to the vocabulary sets of the LLMs) 197 | 198 | ### Graph Neural Prompting with Large Language Models (Arxiv 2023) [Paper](https://arxiv.org/abs/2309.15427) 199 | * Graph language here: embedding of the encoded graphs + text embeddings (from the LLM dictionary) of the entities => used as the prompt 200 | * auxiliary link prediction tasks 201 | * Language: Continuous prompt generated by GNNs 202 | 203 | ### GRAPHTEXT: GRAPH REASONING IN TEXT SPACE (Arxiv 2023) [Paper](https://openreview.net/forum?id=dbcWzalk6G) 204 | * Generate a graph syntax tree 205 | * A first-order traversal of the GST can generate a sequence of required information, which can be further adopted as the prompt 206 | * Propose a novel graph learning paradigm: interactive graph reasoning => Use human reminders to guide the correct reasoning steps 207 | * Language: prompts induced by the graph syntax tree 208 | 209 | 210 | ### GRAPHLLM: BOOSTING GRAPH REASONING ABILITY OF LARGE LANGUAGE MODEL (Arxiv 2023) [Paper](https://arxiv.org/pdf/2310.05845.pdf)[Code](https://github.com/mistyreed63849/Graph-LLM) 211 | * Add structure-aware prefix prompts to inputs 212 | * How to generate those structure-aware prefix prompts: first use a transformer encoder to generate a context vector, then use the decoder to generate the node features, which are then processed by a graph transformer to generate the final embedding (this design is related to the nature of dataset in this paper, since they don't present much textual information ) 213 | * Language: prefix prompt 214 | 215 | 216 | ### TALK LIKE A GRAPH: ENCODING GRAPHS FOR LARGE LANGUAGE MODELS (Arxiv 2023) [Paper](https://arxiv.org/pdf/2310.04560.pdf) 217 | * Very similar to the NLGraph benchmark, also use the natural language to describe the graph structures 218 | * Language: human language 219 | 220 | 221 | ### IN-CONTEXT LEARNING FOR FEW-SHOT MOLECULAR PROPERTY PREDICTION (Arxiv 2023) [Paper](https://arxiv.org/pdf/2310.08863v1.pdf) 222 | * Demonstrate one meta-learning-inspired way to do few-shot prediction on large transformers without fine-tuning 223 | * Encode query, and support points together with their labels, and then concatenate them together into embeddings. Embeddings are further adopted as a prompt and inserted into the transformer. 224 | 225 | ### GraphGPT: Graph Instruction Tuning for Large Language Models (Arxiv 2023) [Paper](https://www.semanticscholar.org/paper/GraphGPT%3A-Graph-Instruction-Tuning-for-Large-Models-Tang-Yang/45872b94798c3125abfb185b7926689c5e767763?utm_source=direct_link) [Code](https://github.com/HKUDS/GraphGPT) 226 | 227 | 228 | ## Path 5: Unifying data in a large graph 229 | ### PRODIGY: Enabling In-context Learning Over Graphs [Paper](https://arxiv.org/abs/2305.12600) [Code](https://github.com/snap-stanford/prodigy) 230 | * GNN-based multi-task multi-dataset setting 231 | * Show the knowledge between MAG and WIKI can be transferred 232 | * Manually construct a large heterogeneous graph that consists of different types of nodes; use these nodes as guidance to transferring 233 | 234 | ### ONE FOR ALL: TOWARDS TRAINING ONE GRAPH MODEL FOR ALL CLASSIFICATION TASKS [Paper](https://arxiv.org/abs/2310.00149) [Code](https://github.com/LechengKong/OneForAll) 235 | * How do we align the feature space? Use sentence-bert to encode all, the attribute is required(this method can not directly work for continuous features, what they do is using a prompt to describe the attribute) 236 | * How to do multiple datasets? 237 | * NOI prompt node->necessary subgraphs for node/link/graph-level tasks 238 | * Class nodes->Give the information of classes so zero-shot learning is possible 239 | 240 | 241 | ## Path 6: Diffusion Models & S4 242 | ### Data-Centric Learning from Unlabeled Graphs with Diffusion Model (NIPS 2023) [Paper](https://arxiv.org/pdf/2303.10108.pdf) 243 | * Diffusion model 244 | ### S4G: BREAKING THE BOTTLENECK ON GRAPHS WITH STRUCTURED STATE SPACES (Arxiv 2023) [Paper](https://openreview.net/pdf?id=0Z6lN4GYrO) 245 | * Adopting S4 to handle long-range information and over-squashing problem 246 | 247 | 248 | 249 | 250 | 251 | 252 | ## Others 253 | 254 | ### Foundation models for KG 255 | 256 | ### TOWARDS FOUNDATION MODELS FOR KNOWLEDGE GRAPH REASONING [Paper](https://arxiv.org/pdf/2310.04562.pdf) 257 | --------------------------------------------------------------------------------