└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Awesome Large Foundation Model Theory
  2 | 
  3 | Welcome to the Awesome Large Foundation Model Theory repository! This repository is dedicated to exploring and discussing the fascinating field of Large Foundation Model Theory.
  4 | 
  5 | ## About
  6 | 
  7 | Large Foundation Modal Theory examines the ways in which learning is enhanced when it occurs within meaningful and relevant contexts. This repository aims to gather resources, research papers, presentations, and foster discussions related to Large Foundation Modal Theory, its applications, and its impact on education and learning environments. In this group, we study the Large Foundation Modal Theory, which includes but is not limited to
  8 | 
  9 | 
 10 | 1. [In context learning](#In_context_learning)
 11 | 2. [Diffusion Model](#diffusion-model)
 12 | 3. [Hallucination](#Hallucination)
 13 | 4. [Chain-of-Thought](#Chain-of-Thought)
 14 | 5. [Reasoning](#Reasoning)
 15 | 6. [State Space Models](#State_space_models)
 16 |  
 17 | 
 18 | ## In Context Learning
 19 | 
 20 | > What Can Transformers Learn In-Context? A Case Study of Simple Function Classes, *NeurIPS 2022*, [link](https://arxiv.org/abs/2208.01066)  
 21 | 
 22 | > Data Distributional Properties Drive Emergent In-Context Learning in Transformers, *NeurIPS 2022*, [link](https://arxiv.org/abs/2205.05055)  
 23 | 
 24 | > In-context learning and induction heads**, *Transformer Circuits Thread, 2022*, [link](https://arxiv.org/abs/2209.11895)
 25 | 
 26 | > An Explanation of In-context Learning as Implicit Bayesian Inference, *ICLR 2022*, [link](https://arxiv.org/abs/2111.02080)
 27 | 
 28 | ### 2023
 29 | 
 30 | > What learning algorithm is in-context learning? Investigations with linear models, *ICLR 2023*, [link](https://arxiv.org/pdf/2211.15661.pdf)  
 31 | 
 32 | > Uncovering mesa-optimization algorithms in Transformers, [link](https://arxiv.org/abs/2309.05858)  
 33 | 
 34 | > Transformers as statisticians: Provable in-context learning with in-context algorithm selection, *NeurIPS 2023*,  [link](https://arxiv.org/abs/2306.04637)
 35 | 
 36 | > Transformers as Algorithms: Generalization and Stability in In-context Learning, *ICML 2023*,  [link](https://proceedings.mlr.press/v202/li23l/li23l.pdf)
 37 | 
 38 | > Transformers learn in-context by gradient descent, *ICML 2023*, [link](https://arxiv.org/abs/2212.07677)
 39 | 
 40 | > Max-Margin Token Selection in Attention Mechanism, *NeurIPS 2023*, [link](https://arxiv.org/abs/2306.13596)  
 41 | 
 42 | > Transformers learn to implement preconditioned gradient descent for in-context learning, *NeurIPS 2023*, [link](https://arxiv.org/abs/2306.00297)
 43 | 
 44 | > Trained Transformers Learn Linear Models In-Context, *JMLR*, [link](https://arxiv.org/pdf/2306.09927.pdf)
 45 | 
 46 | > In-context convergence of transformers, *ICML 2024*, [link](https://arxiv.org/abs/2310.05249)
 47 | 
 48 | > Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning, *NeurIPS 2023*, [link](https://arxiv.org/abs/2301.11916)
 49 | 
 50 | > In-Context Learning through the Bayesian Prism, *ICLR 2024* [link](https://arxiv.org/abs/2306.04891)
 51 | 
 52 | > What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization**, [link](https://arxiv.org/abs/2305.19420)
 53 | 
 54 | > Birth of a Transformer: A Memory Viewpoint, *NeurIPS 2023*, [link](https://arxiv.org/abs/2306.00802)
 55 | 
 56 | > A Theory of Emergent In-Context Learning as Implicit Structure Induction, [link](https://arxiv.org/pdf/2303.07971)
 57 | 
 58 | ### 2024
 59 | 
 60 | > How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression, *ICLR 2024* [link](https://arxiv.org/abs/2310.08391)
 61 | 
 62 | > How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?, *ICML 2024*, [link](https://openreview.net/forum?id=I4HTPws9P6)  
 63 | 
 64 | > How Transformers Learn Causal Structure with Gradient Descent, *ICML 2024*, [link](https://arxiv.org/abs/2402.14735)  
 65 | 
 66 | > How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes, [link](https://arxiv.org/pdf/2404.03558)  
 67 | 
 68 | > Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?, *ICML 2024*, [link](https://openreview.net/pdf?id=o8AaRKbP9K)
 69 | 
 70 | > Transformers learn nonlinear features in context: nonconvex mean-field dynamics on the attention landscape, *ICML 2024*, [link](https://arxiv.org/abs/2402.01258)
 71 | 
 72 | > How Well Can Transformers Emulate In-context Newton's Method?, [link](https://arxiv.org/pdf/2403.03183)  
 73 | 
 74 | > In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization, [link](https://arxiv.org/pdf/2402.14951)  
 75 | 
 76 | > The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains, [link](https://arxiv.org/pdf/2402.11004)  
 77 | 
 78 | > Fine-grained Analysis of In-context Linear Estimation, [link](https://openreview.net/pdf?id=1vM1a7KrC6)
 79 | 
 80 | > On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability, NeurIPS 2024, [link](https://arxiv.org/pdf/2405.16845)
 81 | 
 82 | > Transformers are Minimax Optimal Nonparametric In-Context Learners, NeurIPS 2024, [link](https://arxiv.org/pdf/2408.12186)
 83 | 
 84 | > Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context, [link](https://arxiv.org/abs/2410.01774)
 85 | 
 86 | 
 87 | 
 88 | 
 89 | 
 90 | 
 91 | ## Diffusion Model
 92 | 
 93 | > Contrastive Energy Prediction for Exact Energy-Guided Diffusion Sampling in Offline Reinforcement Learning, *ICML 2023*, [(link)](https://arxiv.org/pdf/2304.12824.pdf)
 94 | 
 95 | > Diffusion Models are Minimax Optimal Distribution Estimators, *ICML 2023*, [(link)](https://arxiv.org/pdf/2303.01861)
 96 | 
 97 | > Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data, *ICML 2023*, [(link)](https://arxiv.org/pdf/2302.07194)
 98 | 
 99 | > The probability flow ODE is provably fast, *NeurIPS 2023*,  [(link)](https://proceedings.neurips.cc/paper_files/paper/2023/hash/d84a27ff694345aacc21c72097a69ea2-Abstract-Conference.html)
100 | 
101 | > Learning Mixtures of Gaussians Using the DDPM Objective，*NeurIPS 2023*,  [(link)](https://arxiv.org/pdf/2307.01178.pdf)
102 | 
103 | > On the Generalization Properties of Diffusion Models, *NeurIPS 2023*, [link](https://arxiv.org/pdf/2311.01797)
104 | 
105 | > Deep networks as denoising algorithms: Sample-efficient learning of diffusion models in high-dimensional graphical models, *M3L (2023)*, [(link)](https://arxiv.org/pdf/2309.11420)
106 | 
107 | > Faster Sampling without Isoperimetry via Diffusion-based Monte Carlo, *COLT 2024*, [(link)](https://arxiv.org/abs/2401.06325)
108 | 
109 | > Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization, *ICLR 2024*, [(link)](https://arxiv.org/abs/2401.15604)
110 | 
111 | > Critical windows: non-asymptotic theory for feature emergence in diffusion models, *ICML 2024*, [(link)](https://arxiv.org/pdf/2403.01633)
112 | 
113 | > Minimax Optimality of Score-based Diffusion Models: Beyond the Density Lower Bound Assumptions, *ICML 2024*,  [(link)](https://arxiv.org/abs/2402.15602)
114 | 
115 | > Reverse Transition Kernel: A Flexible Framework to Accelerate Diffusion Inference, *ICML 2024*,  [(link)](https://arxiv.org/abs/2405.16387)
116 | 
117 | > Learning General Gaussian Mixtures with Efficient Score Matching, Apr 2024, [(link)](https://arxiv.org/abs/2404.18893)
118 | 
119 | > Learning Mixtures of Gaussians Using Diffusion Models, Apr 2024, [(link)](https://arxiv.org/pdf/2404.18869)
120 | 
121 | > An overview of diffusion models: Applications, guided generation, statistical rates and optimization, Apr 2024, [(link)](https://arxiv.org/abs/2404.07771)
122 | 
123 | > Slight Corruption in Pre-training Data Makes Better Diffusion Models, May 2024, [(link)](https://arxiv.org/abs/2405.20494)
124 | 
125 | > Accelerating Convergence of Score-Based Diffusion Models, Provably, May 2024, [(link)](https://arxiv.org/abs/2403.03852)
126 | 
127 | > Unraveling the Smoothness Properties of Diffusion Models: A Gaussian Mixture Perspective, May 2024, [(link)](https://arxiv.org/pdf/2405.16418)
128 | 
129 | > U-Nets as Belief Propagation: Efficient Classification, Denoising, and Diffusion in Generative Hierarchical Models, May 2024, [(link)](https://arxiv.org/pdf/2404.18444)
130 | 
131 | > Extracting Training Data from Unconditional Diffusion Models, June 2024, [(link)](https://arxiv.org/abs/2406.12752)
132 | 
133 | > On Statistical Rates and Provably Efficient Criteria of Latent Diffusion Transformers (DiTs), July, 2024, [(link)](https://arxiv.org/pdf/2407.01079)
134 | 
135 | > Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering, Sep, 2024, [(link)](https://arxiv.org/pdf/2407.01079)
136 | 
137 | 
138 | ## Chain-of-Thought
139 | 
140 | > Dissecting Chain-of-Thought: Compositionality through In-Context Filtering and Learning. *NeurIPS 2023*, [(link)](https://arxiv.org/abs/2305.18869)
141 | 
142 | > Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective. *NeurIPS 2023*, [(link)](https://www.arxiv.org/abs/2409.02426)
143 | 
144 | > Chain of Thought Empowers Transformers to Solve Inherently Serial Problems. *ICLR 2024*, [(link)](https://arxiv.org/abs/2402.12875)
145 | 
146 | 
147 | 
148 | ## Hallucination
149 | 
150 | > Calibrated Language Models Must Hallucinate, [(link)](https://arxiv.org/abs/2311.14648)
151 | 
152 | ## Reasoning
153 | 
154 | > How Much Can RAG Help the Reasoning of LLM? [(link)](https://arxiv.org/abs/2410.02338)
155 | 
156 | 
157 | ## State Space Models
158 | 
159 | > Repeat After Me: Transformers are Better than State Space Models at Copying, *ICML 2024*, [(link)](https://openreview.net/pdf?id=duRRoGeoQT)
160 | 
161 | 
162 | ## How to Contribute
163 | 
164 | We welcome contributions from anyone interested in Large Foundation Model Theory. Here are some ways you can contribute:
165 | 
166 | - **Add resources:** Share relevant research papers, articles, books, or other resources by opening a pull request.
167 | - **Start a discussion:** Create a new discussion thread in the GitHub Discussions tab to initiate conversations and share insights.
168 | - **Suggest improvements:** If you have any suggestions or ideas to improve the repository, open an issue and let us know.
169 | - **Spread the word:** Help us reach more people by sharing this repository with others who might be interested.
170 | 
171 | Please refer to the [CONTRIBUTING.md](CONTRIBUTING.md) file for more detailed instructions on how to contribute.
172 | 
173 | ## Code of Conduct
174 | 
175 | To ensure that this repository remains a welcoming and inclusive space for everyone, we have adopted a [Code of Conduct](CODE_OF_CONDUCT.md). We kindly request all contributors to adhere to these guidelines when participating in this community.
176 | 
177 | ## License
178 | 
179 | This repository is licensed under the [MIT License](LICENSE). Please note that any contributions made to this repository will be subject to the same license.
180 | 


--------------------------------------------------------------------------------