├── 2305.13048v1.pdf ├── README.md ├── preText_edited.txt └── preText.txt /2305.13048v1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fabiomatricardi/Highlight-RAG/main/2305.13048v1.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Highlight-RAG 2 | Repo of the code from the Medium article 3 | 4 | 5 | ### Resources 6 | - https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker 7 | - https://rich.readthedocs.io/en/stable/layout.html 8 | - https://medium.com/@waihwa.cheong/retrieving-answers-from-your-document-using-llama-2-on-colab-gpu-freemium-series-7284faac8118 9 | - https://arxiv.org/pdf/2305.13048v1.pdf 10 | - 11 | -------------------------------------------------------------------------------- /preText_edited.txt: -------------------------------------------------------------------------------- 1 | RWKV: Reinventing RNNs for the Transformer Era 2 | 3 | Abstract 4 | Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and 5 | computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. 6 | We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of Transformers with the efficient inference of R NNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, which parallelizes computations during training and maintains constant computational and memory complexity during inference, leading to the first non-transformer architecture to be scaled to tens of billions of parameters. 7 | Our experiments reveal that RWKV performs on par with similarly sized Transformers, suggesting that future work can leverage this architecture to create more efficient models. 8 | This work presents a significant step towards reconciling the trade-offs between computational efficiency and model performance in sequence processing tasks. 9 | 10 | Introduction 11 | Deep learning techniques have made significant strides in artificial intelligence, playing a pivotal role in various scientific and industrial applications. These applications often involve complex sequential data processing tasks that include natural language understanding, conversational AI, time-series analysis, and even indirect modalities that can be reframed as sequences, such as images and graphs (Brown et al., 2020; Ismail Fawaz et al., 2019; Wu et al., 2020; Albalak et al., 2022). 12 | Predominant among these techniques are RNNs,convolutional neural networks (CNNs), and the Transformer models (Vaswani et al., 2017). 13 | Each of these has distinct drawbacks that restrict their efficiency in certain scenarios. RNNs suffer from the vanishing gradient problem, making them difficult to train for long sequences. Additionally, they cannot be parallelized in the time dimension during training, which restricts their scalability (Hochreiter, 1998; Le and Zuidema, 2016). CNNs, on the other hand, are only adept at capturing local patterns, which limits their capacity to deal with long-range dependencies, crucial to many sequence processing tasks (Bai et al., 2018). 14 | Transformer models emerged as a powerful alternative due to their ability to handle both local and long-range dependencies and their capability for parallelized training (Tay et al., 2022). Recent models such as GPT-3 (Brown et al., 2020), ChatGPT (OpenAI, 2022; Koco´n et al., 2023), GPT-4 (OpenAI, 2023), LLaMA (Touvron et al., 2023), and Chinchilla (Hoffmann et al., 2022) exemplify the capability of this architecture, pushing the frontiers of what’s possible in NLP. Despite these significant advancements, the self-attention mechanism 15 | inherent to Transformers poses unique challenges, primarily due to its quadr atic complexity. This complexity renders the architecture computationally expensive and memory-intensive for tasks involving long input sequences or in resource-constrained situations. These limitations have spurred a wealth of research aiming to improve the scaling properties of Transformers, often at the expense of some of 16 | the properties that make it so effective (Wang et al.,2020; Zaheer et al., 2020; Dao et al., 2022a). 17 | To tackle these challenges, we introduce the Receptance Weighted Key Value (RWKV) model, a novel architecture that effectively combines the strengths of RNNs and Transformers while circumventing key drawbacks. RWKV is carefully designed to alleviate the memory bottleneck and quadratic scaling associated with Transformers (Katharopoulos et al., 2020) with a more efficient linear scaling, while still preserving the rich, expressive properties that make the Transformer a dominant architecture in the field. 18 | One of the defining characteristics of RWKV is its ability to offer parallelized training and robust scalability, similar to Transformers. Moreover, we have reformulated the attention mechanism in RWKV to introduce a variant of linear attention, eschewing the traditional dot-product token interaction in favor of more effective channel-directed attention. This approach contrasts significantly with the traditional Transformer architecture, where specific token interactions predominantly drive attention. 19 | The implementation of linear attention in RWKV is carried out without approximation, which offers a considerable improvement in efficiency and enhances the scalability, see Table 1. 20 | The overarching motivation behind developing RWKV is to bridge the gap between computational efficiency and expressive capacity in neural network architectures. It offers a promising and viable solution for handling tasks involving large-scale models with billions of parameters, exhibiting competitive performance at a fraction of the computational cost. Our experimental results suggest that RWKV could be a valuable tool for addressing the ongoing challenges in scaling and deploying AI models across various domains, particularly those involving sequential data processing. Thus, RWKV paves the way for the next generation of more sustainable and computationally efficient AI models for sequence processing tasks. 21 | Our contributions in this paper are as follows: 22 | • We introduce the RWKV network architecture, which combines the advantages of RNNs and Transformers while mitigating their known limitations. 23 | • We propose a new attention mechanism reformulation that results in linear attention, eschewing the quadratic complexity associated with standard Transformer models. 24 | • We conduct a comprehensive series of experiments on benchmark datasets to showcase the performance, efficiency and scaling of RWKV in managing tasks involving large-scale models and long-range dependencies. 25 | • We release pretrained model ranging in size from 169 million to 14 billion parameters trained on the Pile (Gao et al., 2020) 26 | 27 | Related Work 28 | Recently, a number of techniques have been proposed to address the limitations of transformers. 29 | Optimizing Attention Mechanism 30 | Many transformer variants (“ x-formers”) have been introduced to reduce the complexity of transformers (Tay et al.,2022), including sparse attention (Beltagy et al., 2020; Kitaev et al., 2020; Guo et al., 2022), approximating the full attention matrix (Wang et al., 2020; Ma et al., 2021; Choromanski et al., 2020), combining chunked attention with gating (Ma et al.,2023) and other efficient methods (Katharopoulos 31 | et al., 2020; Jaegle et al., 2021). Some recent works like FlashAttention (Dao et al., 2022a) and others (Rabe and Staats, 2022;Jang et al., 2019) share similarities with RWKV’s chunked computation scheme. Despite being memory-efficient, their time complexity remains 32 | quadratic or contains chunk size as a hidden factor. In contrast, RWKV achieves better space and time complexity during inference by formulating a linear attention as an RNN. 33 | Attention Free Models 34 | Another line of research replaces the attention mechanism with other modules to scale to long sequences. MLP-Mixer and others (Tolstikhin et al., 2021; Liu et al., 2021) proposed the replacement of attention by Multi-Layer Perceptrons (MLPs) in computer vision tasks. 35 | The Attention Free Transformer (AFT) (Zhai et al.,2021) replaces dot-product self-attention with a computationally efficient alternative which can be seen as a multi-head attention where each feature dimension corresponds to a head. Inspired by AFT, RWKV takes a similar approach but modifies the interaction weights for simplicity such that it can be transformed into an RNN. In parallel, RNN-style (Hochreiter and Schmidhuber, 1997; Chung et al., 2014) recursive components have also been modified to increase context length, such as the Recurrent Memory Transformer (Bulatov et al ., 2022,2023) and Linear Recurrent Units (Orvieto et al.,2023). State space models (SSM) like S4 (Gu et al.,2022) and its variants (Dao et al., 2022b; Poli et al.,2023) are also proposed. 36 | Notably, Quasi-Recurrent neural network (QRNN) (Bradbury et al., 2017) uses both convolutional layers and recurrent pooling functions 37 | across timesteps and channels. While QRNN utilizes conv olutional filters with fixed sizes, RWKV employs a time-mixing module as an 38 | attention mechanism with time-decaying factors. Different from the element-wise pooling in QRNN, RWKV includes a parametrized channel-mixing module (see the green blocks in Fig.1c) that is parallelizable. 39 | 40 | -------------------------------------------------------------------------------- /preText.txt: -------------------------------------------------------------------------------- 1 | RWKV: Reinventing RNNs for the Transformer Era 2 | Bo Peng1∗ Eric Alcaide2,3,4∗ Quentin Anthony2,5∗ 3 | Alon Albalak2,6 Samuel Arcadinho2,7 Huanqi Cao8 Xin Cheng9 Michael Chung10 4 | Matteo Grella11 Kranthi Kiran GV12 Xuzheng He2 Haowen Hou13 Przemysław Kazienko14 5 | Jan Koco´n14 Jiaming Kong15 Bartłomiej Koptyra14 Hayden Lau2 Krishna Sri Ipsit Mantri16 6 | Ferdinand Mom17,18 Atsushi Saito2,19 Xiangru Tang20 Bolun Wang27 Johan S. Wind21 Stanisław Wo´zniak14 7 | Ruichong Zhang8 Zhenyuan Zhang2 Qihang Zhao22,23 Peng Zhou27 Jian Zhu24 Rui-Jie Zhu25,26 8 | 1RWKV Foundation 2EleutherAI 3University of Barcelona 4Charm Therapeutics 5Ohio State University 9 | 6University of California, Santa Barbara 7Zendesk 8Tsinghua University 9Peking University 10 | 10Storyteller .io 11Crisis24 12New York University 13National University of Singapore 11 | 14Wroclaw University of Science and Technology 15Databaker Technology Co. Ltd 16Purdue University 12 | 17Criteo AI Lab 18Epita 19Nextremer Co. Ltd. 20Yale University 21University of Oslo 13 | 22University of Science and Technology of China 23Kuaishou Technology Co. Ltd 14 | 24University of British Columbia 25University of California, Santa Cruz 15 | 26University of Electronic Science and Technology of China 27RuoxinTech 16 | Abstract 17 | Transformers have revolutionized almost all 18 | natural language processing (NLP) tasks but 19 | suffer from memory and computational com- 20 | plexity that scales quadratically with sequence 21 | length. In contrast, recurrent neural networks 22 | (RNNs) exhibit linear scaling in memory and 23 | computational requirements but struggle to 24 | match the same performance as Transform- 25 | ers due to limitations in parallelization and 26 | scalability. 27 | We propose a novel model ar- 28 | chitecture, Receptance Weighted Key Value 29 | (RWKV), that combines the efficient paral- 30 | lelizable training of Transformers with the effi- 31 | cient inference of R NNs. Our approach lever- 32 | ages a linear attention mechanism and allows 33 | us to formulate the model as either a Trans- 34 | former or an RNN, which parallelizes compu- 35 | tations during training and maintains constant 36 | computational and memory complexity during 37 | inference, leading to the first non-transformer 38 | architecture to be scaled to tens of billions 39 | of parameters. 40 | Our experiments reveal that 41 | RWKV performs on par with similarly sized 42 | Transformers, suggesting that future work can 43 | leverage this architecture to create more effi- 44 | cient models. 45 | This work presents a signifi- 46 | cant step towards reconciling the trade-offs be- 47 | tween computational efficiency and model per- 48 | formance in sequence processing tasks.1 49 | 1 50 | Introduction 51 | Deep learning techniques have made significant 52 | strides in artificial intelligence, playing a pivotal 53 | ∗ Equal first authorship. Others listed alphabetically. 54 | 1Code at: https://github.com/BlinkDL/RWKV-LM 55 | role in various scientific and industrial applica- 56 | tions. These applications often involve complex 57 | sequential data processing tasks that include nat- 58 | ural language understanding, conversational AI, 59 | time-series analysis, and even indirect modalities 60 | that can be reframed as sequences, such as im- 61 | ages and graphs (Brown et al., 2020; Ismail Fawaz 62 | et al., 2019; Wu et al., 2020; Albalak et al., 2022). 63 | Predominant among these techniques are RNNs, 64 | convolutional neural networks (CNNs), and the 65 | Transformer models (Vaswani et al., 2017). 66 | Each of these has distinct drawbacks that restrict 67 | their efficiency in certain scenarios. RNNs suf- 68 | fer from the vanishing gradient problem, making 69 | them difficult to train for long sequences. Addition- 70 | ally, they cannot be parallelized in the time dimen- 71 | sion during training, which restricts their scalability 72 | (Hochreiter, 1998; Le and Zuidema, 2016). CNNs, 73 | on the other hand, are only adept at capturing local 74 | patterns, which limits their capacity to deal with 75 | long-range dependencies, crucial to many sequence 76 | processing tasks (Bai et al., 2018). 77 | Transformer models emerged as a powerful alter- 78 | native due to their ability to handle both local and 79 | long-range dependencies and their capability for 80 | parallelized training (Tay et al., 2022). Recent mod- 81 | els such as GPT-3 (Brown et al., 2020), ChatGPT 82 | (OpenAI, 2022; Koco´n et al., 2023), GPT-4 (Ope- 83 | nAI, 2023), LLaMA (Touvron et al., 2023), and 84 | Ch inchilla (Hoffmann et al., 2022) exemplify the 85 | capability of this architecture, pushing the frontiers 86 | of what’s possible in NLP. Despite these signifi- 87 | cant advancements, the self-attention mechanism 88 | inherent to Transformers poses unique challenges, 89 | arXiv:2305.13048v1 [cs.CL] 22 May 2023 90 | Model 91 | Time 92 | Space 93 | Transformer 94 | O(T 2d) 95 | O(T 2 + Td) 96 | Reformer 97 | O(T log Td) 98 | O(T log T + Td) 99 | Linear Transformers 100 | O(Td2) 101 | O(Td + d2) 102 | Performer 103 | O(Td2 log d) O(Td log d + d2 log d) 104 | AFT-full 105 | O(T 2d) 106 | O(Td) 107 | MEGA 108 | O(cTd) 109 | O(cTd) 110 | RWKV (ours) 111 | O(Td) 112 | O(d) 113 | Table 1: Complexity comparison with different Trans- 114 | formers: Reformer (Kitaev et al., 2020), Linear Trans- 115 | former (Katharopoulos et al., 2020), Performer (Choro- 116 | manski et al., 2020), AFT (Zhai et al., 2021), MEGA 117 | (Ma et al., 2023). Here T denotes the sequence length, 118 | d the feature dimension, and c is MEGA’s chunk size 119 | of quadratic attention. 120 | primarily due to its quadr atic complexity. This com- 121 | plexity renders the architecture computationally ex- 122 | pensive and memory-intensive for tasks involving 123 | long input sequences or in resource-constrained sit- 124 | uations. These limitations have spurred a wealth of 125 | research aiming to improve the scaling properties 126 | of Transformers, often at the expense of some of 127 | the properties that make it so effective (Wang et al., 128 | 2020; Zaheer et al., 2020; Dao et al., 2022a). 129 | To tackle these challenges, we introduce the Re- 130 | ceptance Weighted Key Value (RWKV) model, a 131 | novel architecture that effectively combines the 132 | strengths of RNNs and Transformers while cir- 133 | cumventing key drawbacks. RWKV is carefully 134 | designed to alleviate the memory bottleneck and 135 | quadratic scaling associated with Transformers 136 | (Katharopoulos et al., 2020) with a more efficient 137 | linear scaling, while still preserving the rich, ex- 138 | pressive properties that make the Transformer a 139 | dominant architecture in the field. 140 | One of the defining characteristics of RWKV 141 | is its ability to offer parallelized training and ro- 142 | bust scalability, similar to Transformers. More- 143 | over, we have reformulated the attention mecha- 144 | nism in RWKV to introduce a variant of linear 145 | attention, eschewing the traditional dot-product to- 146 | ken interaction in favor of more effective channel- 147 | directed attention. This approach contrasts signifi- 148 | cantly with the traditional Transformer architecture, 149 | where specific token interactions predominantly 150 | drive attention. The implementation of linear atten- 151 | tion in RWKV is carried out without approxima- 152 | tion, which offers a considerable improvement in 153 | ef ficiency and enhances the scalability, see Table 1. 154 | The overarching motivation behind developing 155 | RWKV is to bridge the gap between computational 156 | efficiency and expressive capacity in neural net- 157 | work architectures. It offers a promising and viable 158 | solution for handling tasks involving large-scale 159 | models with billions of parameters, exhibiting com- 160 | petitive performance at a fraction of the computa- 161 | tional cost. Our experimental results suggest that 162 | RWKV could be a valuable tool for addressing the 163 | ongoing challenges in scaling and deploying AI 164 | models across various domains, particularly those 165 | involving sequential data processing. Thus, RWKV 166 | paves the way for the next generation of more sus- 167 | tainable and computationally efficient AI models 168 | for sequence processing tasks. 169 | Our contributions in this paper are as follows: 170 | • We introduce the RWKV network archi- 171 | tecture, which combines the advantages of 172 | RNNs and Transformers while mitigating 173 | their known limitations. 174 | • We propose a new attention mechanism re- 175 | formulation that results in linear attention, es- 176 | chewing the quadratic complexity associated 177 | with standard Transformer models. 178 | • We conduct a comprehensive series of experi- 179 | ments on benchmark datasets to showcase the 180 | performance, efficiency and scaling of RWKV 181 | in managing tasks involving large-scale mod- 182 | els and long-range dependencies. 183 | • We release pretrained model ranging in size 184 | from 169 million to 14 billion parameters 185 | trained on the Pile (Gao et al., 2020).2 186 | 2 187 | Related Work 188 | Recently, a number of techniques have been pro- 189 | posed to address the limitations of transformers. 190 | Optimizing Attention Mechanism 191 | Many trans- 192 | former variants (“ x-formers”) have been introduced 193 | to reduce the complexity of transformers (Tay et al., 194 | 2022), including sparse attention (Beltagy et al., 195 | 2020; Kitaev et al., 2020; Guo et al., 2022), ap- 196 | proximating the full attention matrix (Wang et al., 197 | 2020; Ma et al., 2021; Choromanski et al., 2020), 198 | combining chunked attention with gating (Ma et al., 199 | 2023) and other efficient methods (Katharopoulos 200 | et al., 2020; Jaegle et al., 2021). 201 | Some recent works like FlashAttention (Dao 202 | et al., 2022a) and others (Rabe and Staats, 2022; 203 | Jang et al., 2019) share similarities with RWKV’s 204 | chunked computation scheme. 205 | Despite being 206 | memory-efficient, their time complexity remains 207 | quadratic or contains chunk size as a hidden fac- 208 | tor. In contrast, RWKV achieves better space and 209 | 2https://huggingface.co/RWKV 210 | time complexity during inference by formulating a 211 | linear attention as an RNN. 212 | Attention Free Models 213 | Another line of research 214 | replaces the attention mechanism with other mod- 215 | ules to scale to long sequences. MLP-Mixer and 216 | others (Tolstikhin et al., 2021; Liu et al., 2021) 217 | proposed the replacement of attention by Multi- 218 | Layer Perceptrons (MLPs) in computer vision tasks. 219 | The Attention Free Transformer (AFT) (Zhai et al., 220 | 2021) replaces dot-product self-attention with a 221 | computationally efficient alternative which can be 222 | seen as a multi-head attention where each feature 223 | dimension corresponds to a head. Inspired by AFT, 224 | RWKV takes a similar approach but modifies the 225 | interaction weights for simplicity such that it can 226 | be transformed into an RNN. In parallel, RNN- 227 | style (Hochreiter and Schmidhuber, 1997; Chung 228 | et al., 2014) recursive components have also been 229 | modified to increase context length, such as the Re- 230 | current Memory Transformer (Bulatov et al ., 2022, 231 | 2023) and Linear Recurrent Units (Orvieto et al., 232 | 2023). State space models (SSM) like S4 (Gu et al., 233 | 2022) and its variants (Dao et al., 2022b; Poli et al., 234 | 2023) are also proposed. 235 | Notably, 236 | Quasi-Recurrent 237 | neural 238 | network 239 | (QRNN) (Bradbury et al., 2017) uses both con- 240 | volutional layers and recurrent pooling functions 241 | across timesteps and channels. 242 | While QRNN 243 | utilizes conv olutional filters with fixed sizes, 244 | RWKV employs a time-mixing module as an 245 | attention mechanism with time-decaying factors. 246 | Different from the element-wise pooling in QRNN, 247 | RWKV includes a parametrized channel-mixing 248 | module (see the green blocks in Fig.1c) that is 249 | parallelizable. 250 | 3 251 | Background 252 | Here we briefly review the fundamentals of RNNs 253 | and Transformers. 254 | 3.1 255 | Recurrent Neural Networks (RNNs) 256 | Popular R NN architectures such as LSTM (Hochre- 257 | iter and Schmidhuber, 1997) and GRU (Chung 258 | et al., 2014) are characterized by the following for- 259 | mulation (shown for LSTM, others can be reasoned 260 | similarly): 261 | ft = σg(Wfxt + Ufht−1 + bf), 262 | (1) 263 | it = σg(Wixt + Uiht−1 + bi), 264 | (2) 265 | ot = σg(Woxt + Uoht−1 + bo), 266 | (3) 267 | ˜ct = σc(Wcxt + Ucht−1 + bc), 268 | (4) 269 | ct = ft ⊙ ct−1 + it ⊙ ˜ct, 270 | (5) 271 | ht = ot ⊙ σh(ct). 272 | (6) 273 | The data flow of RNNs is shown in Fig. 1a. Al- 274 | though RNNs can be factored into two linear blocks 275 | (W and U) and an RNN-specific block (1)–(6), as 276 | noted by Brad bury et al. (2017), the data depen- 277 | dency relying on previous time steps prohibits par- 278 | allelizing these typical RNNs. 279 | 3.2 280 | Transformers and AFT 281 | Introduced by Vaswani et al. (2017), Transformers 282 | are a class of neural networks that have become 283 | the dominant architecture for several NLP tasks. 284 | Instead of operating on sequences step-by-step like 285 | RNNs, Transformers rely on attention mechanisms 286 | to capture relationships between all input and all 287 | output tokens: 288 | Attn(Q, K, V ) = softmax(QK� �)V, 289 | (7) 290 | where the multi-headness and scaling factor 291 | 1 292 | √dk is 293 | omitted for convenience. The core QK⊤ multipli- 294 | cation is an ensemble of pairwise attention scores 295 | between each token in a sequence, which can be 296 | decomposed as vector operations: 297 | Attn(Q, K, V )t = 298 | PT 299 | i=1 eq⊤ 300 | t kivi 301 | PT 302 | i=1 eq⊤ 303 | t ki . 304 | (8) 305 | In AFT (Zhai et al., 2021), this is alternately 306 | formulated as 307 | Attn+(W, K, V )t = 308 | Pt 309 | i=1 ewt,i+kivi 310 | Pt 311 | i=1 ewt,i+ki , 312 | (9) 313 | where {wt,i} ∈ RT×T is the learned pair-wise po- 314 | sition biases, and each wt,i is a scalar. 315 | Inspired by AFT, we let each wt,i in RWKV be 316 | a channel-wise time decay vector multiplied by the 317 | relative position, traced backwards from current 318 | time as it decays: 319 | wt,i = −(t − i)w, 320 | (10) 321 | where w ∈ (R≥0)d, with d the number of chan- 322 | nels. We require w to be non-negative to ensure 323 | that ewt,i ≤ 1 and the per-channel weights decay 324 | backwards in time. 325 | Linear 326 | RNN Cell/Linear 327 | Linear 328 | RNN Cell/Linear 329 | (a) RNN 330 | Convolution 331 | Elementwise 332 | Convolution 333 | Pooling 334 | Elementwise 335 | Pooling 336 | (b) QuasiRNN (Bradbury et al., 2017) 337 | Time-mixing 338 | Channel-mixing 339 | Time-mixing 340 | Channel-mixing 341 | (c) RWKV 342 | Figure 1: Computation structure of the RWKV in comparison to QRNN and RNN (Vanilla, LSTM, GRU, etc) 343 | architectures. Color codes : orange indicates time-mixing, convolutions or matrix multiplications, and the contin- 344 | uous block indicates that these computations can proceed simultaneously; blue signifies parameterless functions 345 | that operate concurrently along the channel or feature dimension (element-wise). Green indicates channel-mixing. 346 | 4 347 | The Receptance Weighted Key Value 348 | (RWKV) Model 349 | The RWKV architecture derives its name from 350 | the four primary model elements used in the time- 351 | mixing and channel-mixing blocks: 352 | • R: Receptance vector acting as the accep- 353 | tance of past information. 354 | • W: Weight is the positional weight decay 355 | vector. A trainable model parameter. 356 | • K: Key is a vector analogous to K in tradi- 357 | tional attention. 358 | • V : Value is a vector analogous to V in tradi- 359 | tional attention. 360 | Interactions between the main elements for every 361 | timestep are multiplicative, as illustrated in Fig. 2 362 | 4.1 363 | High-Level Summary 364 | The RWKV architecture is comprised of a series 365 | of stacked residual blocks, each formed by a time- 366 | mixing and a channel-mix ing sub-blocks with re- 367 | current structures. 368 | The recurrence is formulated both as a linear in- 369 | terpolation between the current input and the input 370 | at the previous time step (a technique we refer to 371 | as time-shift mixing or token shift, indicated by the 372 | diagonal lines in Fig. 3), which can be adjusted in- 373 | dependently for every linear projection of the input 374 | embedding (e.g., R, K, V in time-mixing, and R, 375 | K in channel-mixing), and as the time-dependent 376 | update of the WKV which is formalized in equa- 377 | tion 14. The WKV computation is similar to AFT 378 | (Zhai et al., 2021), but W is now a channel-wise 379 | vector multiplied by relative position rather than a 380 | pairwise matrix in AFT. We also introduce a vector 381 | U for separately attending to the current token in 382 | order to compensate for potential degeneration of 383 | W (see Appendix G for more details). 384 | Figure 2: RWKV block elements (left) and RWKV 385 | residual block with a final head for language modeling 386 | (right) architectures. 387 | Token 388 | shift 389 | Token 390 | shift 391 | States 392 | States 393 | Layer Norm 394 | My 395 | Layer Norm 396 | name 397 | Layer Norm 398 | Time Mix 399 | LM Head 400 | Layer Norm 401 | is 402 | Layer Norm 403 | Bob 404 | Layer Norm 405 | Channel Mix 406 | Time Mix 407 | LM Head 408 | Layer Norm 409 | name 410 | Layer Norm 411 | is 412 | Layer Norm 413 | Channel Mix 414 | Time Mix 415 | LM Head 416 | Token 417 | shift 418 | Token 419 | shift 420 | Channel Mix 421 | Figure 3: RWKV architecture for language modelling. 422 | The time-mixing block is given by: 423 | rt = Wr · (µrxt + (1 − µr)xt−1), 424 | (11) 425 | kt = Wk · (µkxt + (1 − µk)xt−1), 426 | (12) 427 | vt = Wv · (µvxt + (1 − µv)xt−1), 428 | (13) 429 | wkvt = 430 | Pt−1 431 | i=1 e−(t−1−i)w+kivi + eu+ktvt 432 | Pt−1 433 | i =1 e−(t−1−i)w+ki + eu+kt 434 | , (14) 435 | ot = Wo · (σ(rt) ⊙ wkvt), 436 | (15) 437 | where the WKV computation, wkvt, plays the 438 | role of Attn(Q, K, V ) in Transformers without in- 439 | curring a quadratic cost as interactions are between 440 | scalars. Intuitively, as time t increases, the vector 441 | ot is dependent on a long history, represented by the 442 | summation of an increasing number of terms. For 443 | the target position t, RWKV performs a weighted 444 | summation in the positional interval of [1, t], and 445 | then multiplies with the receptance σ(r). There- 446 | fore, interactions are multiplicative inside a given 447 | timestep and summed over different timesteps. 448 | Further, the channel-mixing block is given by: 449 | rt = Wr · (µrxt + (1 − µr)xt−1), 450 | (16) 451 | kt = Wk · (µkxt + (1 − µk)xt−1), 452 | (17) 453 | ot = σ(rt) ⊙ (Wv · max(kt, 0)2), 454 | (18) 455 | where we adopt squared ReLU activation (So et al., 456 | 2021). Note that in both time-mixing and channel- 457 | mixing, by taking the sigmoid of the receptance, 458 | we’re intuitively using it as a “forget gate” to elimi- 459 | nate unnecessary historical information. 460 | 4.2 461 | Transformer-like Parallelization 462 | RWKV can be efficiently parallelized in what we 463 | call a time-parallel mode, reminiscent of Trans- 464 | formers. 465 | The time complexity of processing a 466 | batch of sequences in a single layer is O(BTd2), 467 | which mainly consists of matrix multiplications 468 | W□, □ ∈ {r, k, v, o} (assuming B sequences, T 469 | maximum tokens and d channels). Meanwhile, up- 470 | dating attention scores wkvt requires a serial scan 471 | (see Appendix B for more detail) and has complex- 472 | ity O(BTd). 473 | The matrix multiplications can be parallelized 474 | akin to W □ , □ ∈ {Q, K, V, O} in typical Trans- 475 | formers. The element-wise WKV computation 476 | is time-dependent, but can be readily parallelized 477 | along the other two dimensions (Lei et al., 2018)3. 478 | 3If the sequence is very long, more sophisticated meth- 479 | ods such as Martin and Cundy (2017) that parallelize over 480 | sequence length could be used. 481 | Additionally, token shift is implemented as a sim- 482 | ple offset in the temporal dimension at each block 483 | using PyTorch (Paszke et al., 2019 ) library as 484 | nn.ZeroPad2d((0,0,1,-1)). 485 | 4.3 486 | RNN-like Sequential Decoding 487 | It is common in recurrent networks to use output 488 | at state t as input at state t + 1. This is especially 489 | evident in the autoregressive decoding inference 490 | of a language model, requiring each token to be 491 | computed before fed into the next step, making it 492 | possible for RWKV to take advantage of its RNN- 493 | like structure, referred to as time-sequential mode. 494 | In such circumstances, RWKV can be conveniently 495 | formulated recursively for decoding during infer- 496 | ence, as shown in Appendix B, which leverages 497 | the advantage that each output token is dependent 498 | only on the latest state, which is of constant size, 499 | irrespective of the sequence length. 500 | It then behaves as an RNN decoder, yielding 501 | constant speed and memory footprint with respect 502 | to the sequence length, enabling the processing of 503 | longer sequences more efficiently. In contrast, self- 504 | attention typically requires a KV cache growing 505 | linearly with respect to the sequence length, result- 506 | ing in degraded efficiency and increasing memory 507 | footprint and time as the sequence grows longer. 508 | 4.4 509 | Software Implementation 510 | RWKV is originally implemented using the Py- 511 | torch Deep Learning Library (Paszke et al., 2019) 512 | and a custom CUDA kernel for the WKV com- 513 | putation explained in 4.7. Although RWKV is a 514 | general recurrent network, its current implemen- 515 | tation focuses in the task of language modeling 516 | (RWKV-LM). The model architecture is comprised 517 | of an embedding layer, for which we follow the 518 | setup described in Section 4.7 and several identical 519 | residual blocks applied sequentially as seen in Fig. 520 | 2 and 3 following the principles outlined in Section 521 | 4.6. After the last block, a simple output projec- 522 | tion head composed by a LayerNorm (Ba et al., 523 | 2016) and a linear projection is used to obtain the 524 | logits to be used in the next-token prediction task 525 | and calculate the cross entropy loss during training. 526 | Both the embeddings generated after the last resid- 527 | ual block and the logits could also be used later 528 | for downstream NLP tasks. Training is performed 529 | in time-parallel mode (Section 4.2) while autore- 530 | gressive inference and a potential chat interface4 531 | 4https://github.com/BlinkDL/ChatRWKV 532 | leverage the time-sequential mode (Section 4.3). 533 | 4.5 534 | Gradient Stability and Layer Stacking 535 | The RWKV architecture has been designed as a 536 | fusion of both Transformers and RNNs, offering 537 | the advantage of stable gradients and deeper archi- 538 | tectures of Transformers compared to traditional 539 | RNNs while being efficient in inference. 540 | Previous work has sought to tackle the prob- 541 | lem of gradient stability in RNNs with a variety of 542 | techniques including using non-saturated activation 543 | functions (Chandar et al., 2019), gating mechanism 544 | (Gu et al., 2019), gradient clipping (Pascanu et al., 545 | 2012), and adding constraints (Kanai et al., 2017; 546 | Miller and Hardt, 2018). While these techniques 547 | have seen little success, RWKV avoids the problem 548 | inherently by utilizing softmax in conjunction with 549 | RNN-style updates. 550 | The RWKV model features a single-step pro- 551 | cess for updating attention-like scores, which in- 552 | cludes a time-dependent softmax operation that 553 | helps numerical stability and guards against van- 554 | ishing gradients (for rigorous proof, see Appendix 555 | F). Intuitively, this operation ensures the gradient 556 | is propagated along the most relevant path. Layer 557 | normalization (Ba et al., 2016) is another key as- 558 | pect of the architecture which enhances the training 559 | dynamics of deep neural networks by stabilizing 560 | gradients, addressing both vanishing and exploding 561 | gradient issues. 562 | These design elements not only contribute to the 563 | RWKV architecture’s stability and learning capa- 564 | bilities but enable the stacking of multiple layers 565 | in a manner that surpasses the capabilities of any 566 | existing RNN. In doing so , the model is able to cap- 567 | ture more complex patterns across various levels of 568 | abstraction (see also Appendix G). 569 | 4.6 570 | Harnessing Temporal Structure for 571 | Sequential Data Processing 572 | RWKV captures and propagates sequential infor- 573 | mation through the combination of three mecha- 574 | nisms: recurrence, time decay and token shift. 575 | The recurrence in the time-mixing block of 576 | RWKV is the basis for the model’s capacity to 577 | capture intricate relationships between sequence 578 | elements and to propagate locality information 579 | through time. 580 | The time decay mechanism (e−w and eu in equa- 581 | tion 14), maintains sensitivity to the positional re- 582 | lationship between sequence elements. By gradu- 583 | ally diminishing the influence of past information 584 | over time, the model preserves a sense of temporal 585 | locality and progression, which is essential for se- 586 | quential processing. This treatment of positional 587 | information in sequential data exhibits similarities 588 | to the Attention with Linear Biases (ALiBi) model 589 | (Press et al., 2022), where the linear biases facili- 590 | tate input length extrapolation. In this context, the 591 | RWKV architecture can be perceived as a trainable 592 | version of ALiBi, seamlessly incorporating posi- 593 | tional information without the necessity for explicit 594 | encoding. It can also be seen as an extension of the 595 | gated convolution introduced in Zhai et al. (2021) 596 | to the full sequence length until a given step. 597 | The token shift or time-shift mixing, or (diag- 598 | onal arrows in Figure 3), also contributes to the 599 | model’s adaptation to sequential data. By linearly 600 | interpolating between the current input and the pre- 601 | vious time step input, the model naturally aggre- 602 | gates and gates information in the input channels. 603 | The overall structure of time-shift mixing bears 604 | resemblance to the causal convolution with no dila- 605 | tions in WaveNet (van den Oord et al., 2016), which 606 | is a classical architecture used for forecasting time 607 | series data. 608 | 4.7 609 | Additional Optimizations 610 | Custom Kernels 611 | To address inefficiencies in the 612 | WKV computation due to the sequential nature of 613 | the task when using standard deep learning frame- 614 | works, we implement a custom CUDA kernel so 615 | as to launch a single compute kernel in training ac- 616 | celerators. All other parts of the model are matrix 617 | multiplications and point-wise operations that can 618 | already be efficiently parallelized. 619 | FFN with R gate 620 | Prior research (Tolstikhin et al., 621 | 2021; Liu et al., 2021; Yu et al., 2022) sug- 622 | gests that self-attention may not be as essential 623 | in Transformer-based vision tasks as previously 624 | thought. Although it provided us with some in- 625 | sights, replacing self-attention entirely in natural 626 | language tasks could be too drastic. In our study, 627 | we partially dismantle the attention mechanism by 628 | replacing the fixed QKV formula with KV and in- 629 | troducing a new time-decaying factor W. This ap- 630 | proach enables us to incorporate token and channel- 631 | mixing components akin to MLP-mixer (Tolstikhin 632 | et al., 2021) and a gating unit R similar to gMLP 633 | (Liu et al., 2021), which enhance the performance 634 | of our RWKV model. 635 | Small Init Embedding 636 | During the initial stage 637 | of training a transformer model (Vaswani et al., 638 | 2017), we observe that the embedding matrix un- 639 | dergoes slow changes, which pose a challenge for 640 | the model to deviate from its initial noisy embed- 641 | ding state. To mitigate this issue, we propose an 642 | approach that involves initializing the embedding 643 | matrix with small values and subsequently apply- 644 | ing an additional LayerNorm operation. By imple- 645 | menting this technique, we accelerate and stabilize 646 | the training process, enabling the training of deep 647 | architectures with post-LN components. The effec- 648 | tiveness of this approach is demonstrated in Figure 649 | 8, where it is shown to facilitate improved conver- 650 | gence by allowing the model to quickly transition 651 | away from the initially small embedding. This is 652 | achieved through small changes following a single 653 | step, which in turn lead to substantial alterations 654 | in directions and subsequently significant changes 655 | after the LayerNorm operation. 656 | Custom Initialization 657 | Building on principles 658 | from previous works (He et al., 2016; Jumper et al., 659 | 2021), we initialize parameters to values as similar 660 | as possible to an identity mapping while break- 661 | ing symmetry so there is a clean information path. 662 | Most weights are initialized to zero. No biases are 663 | used for linear layers. Specific formulas are given 664 | in Appendix D. We find the choice of initialization 665 | to be significant in convergence speed and quality 666 | (see Appendix E). 667 | 5 668 | Evaluations 669 | In this section, we focus on evaluating to answer 670 | the following questions: 671 | • RQ1: 672 | Is 673 | RWKV 674 | competitive 675 | against 676 | quadratic transformer architectures with equal 677 | number of parameters and training tokens? 678 | • RQ2: When increasing the number of param- 679 | eters, does RW KV remain competitive against 680 | quadratic transformer architectures? 681 | • RQ3: Does increasing parameters of RWKV 682 | yield better language modeling loss, when 683 | RWKV models are trained for context lengths 684 | that most open-sourced quadratic transform- 685 | ers cannot efficiently process? 686 | Addressing RQ1 and RQ2, from Fig. 4, we 687 | can see that RWKV is very competitive on six 688 | benchmarks (Winogrande, PIQA, ARC-C, ARC-E, 689 | LAMBADA, and SciQ) against major open source 690 | quadratic complexity transformer models: Pythia 691 | (Biderman et al., 2023), OPT (Zhang et al., 2022) 692 | and BLOOM (Scao et al., 2022). RWKV even out- 693 | performs Pythia and GPT-Neo (Black et al., 2022) 694 | in four tasks: PIQA, OBQA, ARC-E, and COPA 695 | (See details in Appendix H). For RQ3, Fig. 5 shows 696 | that increasing context length leads to lower test 697 | loss on the Pile, an indication that RW KV can make 698 | effective use of long contextual information. 699 | 6 700 | Inference Experiments 701 | We benchmark inference requirements according 702 | to size and family. Specifically, we evaluate text 703 | generation speed and memory requirements on a 704 | typical compute platforms including CPU (x86) 705 | and GPU (NVIDIA A100 80GB). For all our ex- 706 | periments we use float32 precision. We include 707 | all model parameters in parameter count, including 708 | both embedding and non-embedding layers. Per- 709 | formance under different quantization setups is left 710 | to further work. See Appendix I for more results. 711 | Figure 6: Cumulative time during text generation for 712 | different LLMs. 713 | Additionally, we carried out comparative studies 714 | on RWKV-4 and ChatGPT / GPT-4, see Appendix 715 | J. They revealed that RWKV-4 is very sensitive to 716 | prompt engineering. When the prompts were ad- 717 | justed from the ones used for GPT to more suitable 718 | for RWKV, the F1-measure performance increased 719 | even from 44.2% to 74.8%. 720 | 7 721 | Future Work 722 | There are several promising directions for future 723 | work on the RWKV architecture: 724 | • Increasing model expressivity with enhanced 725 | time-decay formulations and exploring initial 726 | model states while maintaining efficiency. 727 | (a) Winogrande 728 | (b) PIQA 729 | (c) ARC-Challenge 730 | (d) ARC-Easy 731 | (e) LAMBADA 732 | (f) SciQ 733 | Figure 4: Zero-Shot Performance: The horizontal axis is a number of parameters and the vertical axis is accuracy. 734 | 21 735 | 23 736 | 25 737 | 27 738 | 29 739 | 211 740 | Context Length 741 | 21 742 | 22 743 | Pile test loss 744 | 7B 8k 745 | 14B 8k 746 | Figure 5: Increasing context length contributes to lower 747 | test loss on the Pile (Gao et al., 2020). 748 | • Further improving RWKV computational ef- 749 | ficiency by applying parallel scan in the 750 | wkvt step to reduce the computational cost 751 | to O(B log(T)d). 752 | • Investigating the application of RWKV to 753 | encoder-decoder architectures and potential 754 | replacement of cross-attention mechanism. 755 | This could have applicability seq2seq or multi- 756 | modal settings, enhancing efficiency both in 757 | training and inference. 758 | • Leveraging RWKV’s state (or context) for in- 759 | terpretability, predictability in sequence data 760 | and safety. 761 | Manipulating the hidden state 762 | could also guide behavior and allow greater 763 | customizability through prompt tuning. 764 | • Exploring fine-tuned models in specific set- 765 | tings for enhanced interaction with humans 766 | (Ouyang et al., 2022). Particularly interest- 767 | ing would be the performance under different 768 | datasets and specific use cases. 769 | • Adapting 770 | parameter-efficient 771 | fine-tuning 772 | methods such as LoRA (Hu et al., 2022 ) 773 | and characterizing behavior under different 774 | quantization 775 | schemes 776 | for 777 | the 778 | proposed 779 | architecture 780 | 8 781 | Conclusions 782 | We introduced RWKV, a new approach to RNN 783 | models exploiting the potential of time-based mix- 784 | ing components. RWKV introduces several key 785 | strategies which allow it to capture locality and 786 | long-range dependencies, while addressing limi- 787 | tations of current architectures by: (1) replacing 788 | the quadratic QK attention by a scalar formulation 789 | with linear cost, (2) reformulating recurrence and 790 | sequential inductive biases to unlock efficient train- 791 | ing parallelization and efficient inference, and (3) 792 | enhancing training dynamics using custom initial- 793 | izations. 794 | We benchmark the proposed architecture in a 795 | wide variety of NLP tasks and show comparable 796 | performance to SoTA with reduced cost. Further 797 | experiments on expressivity, interpretability, and 798 | scaling showcase the model capabilities and draw 799 | parallels in behavior between RWKV and other 800 | LLMs. 801 | RWKV opens a new door to scalable and ef- 802 | ficient architectures to model complex relation- 803 | ships in sequential data. While many alternatives 804 | to Transformers have been proposed with similar 805 | claims, ours is the first to back up those claims with 806 | pretrained models with tens of billions of parame- 807 | ters. 808 | 9 809 | Limitations 810 | While our proposed RWKV model has demon- 811 | strated promising results regarding training and 812 | memory efficiency during inference, some limita- 813 | tions should be acknowledged and addressed in 814 | future work. First, the linear attention of RWKV 815 | leads to significant efficiency gains but still, it may 816 | also limit the model’s performance on tasks that 817 | require recalling minutiae information over very 818 | long contexts. This is due to the funneling of in- 819 | formation through a single vector representation 820 | over many time steps, compared with the full in- 821 | formation maintained by the quadratic attention of 822 | standard Transformers. In other words, the model’s 823 | recurrent architecture inherently limits its ability to 824 | “look back” at previous tokens, as opposed to tra- 825 | ditional self-attention mechanisms. While learned 826 | time decay helps prevent the loss of information , 827 | it is mechanistically limited compared to full self- 828 | attention. 829 | Another limitation of this work is the increased 830 | importance of prompt engineering in comparison to 831 | standard Transformer models. The linear attention 832 | mechanism used in RWKV limits the information 833 | from the prompt that will be carried over to the 834 | model’s continuation. As a result, carefully de- 835 | signed prompts may be even more crucial for the 836 | model to perform well on tasks. 837 | Acknowledgements 838 | We acknowledge EleutherAI and StabilityAI for 839 | compute access and technical support in develop- 840 | ment of RWKV. We also acknowledge the mem- 841 | bers of the RWKV Discord server for their help 842 | and work on further extending the applicability of 843 | RWKV to different domains. Finally, we thank 844 | Stella Biderman for feedback on the paper. 845 | References 846 | Alon Albalak, Yi-Lin Tuan, Pegah Jandaghi, Connor 847 | Pryor, Luke Yoffe, Deepak Ramachandran, Lise 848 | Getoor, Jay Pujara, and William Yang Wang. 2022. 849 | FETA: A benchmark for few-sample task transfer 850 | in open-domain dialogue . 851 | In Proceedings of the 852 | 2022 Conference on Empirical Methods in Natu- 853 | ral Language Processing, pages 10936–10953, Abu 854 | Dhabi, United Arab Emirates. Association for Com- 855 | putational Linguistics. 856 | Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin- 857 | ton. 2016. Layer normalization. 858 | Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018. 859 | An empirical evaluation of generic convolutional 860 | and recurrent networks for sequence modeling. 861 | Francesco Barbieri, Jose Camacho-Coll ados, Luis Es- 862 | pinosa Anke, and Leonardo Neves. 2020. TweetE- 863 | val: Unified benchmark and comparative evaluation 864 | for tweet classification. In Findings of the Associ- 865 | ation for Computational Linguistics: EMNLP 2020, 866 | pages 1644–1650, Online. Association for Computa- 867 | tional Linguistics. 868 | Iz Beltagy, Matthew E. Peters, and Arman Cohan. 869 | 2020. Longformer: The long-document transformer. 870 | arXiv:2004.05150. 871 | Stella Biderman, Hailey Schoelkopf, Quentin An- 872 | thony, Herbie Bradley, Kyle O’Brien, Eric Halla- 873 | han, Mohammad Aflah Khan, Shivanshu Purohit, 874 | USVSN Sai Prashanth, Edward Raff, et al. 2023. 875 | Pythia: A suite for analyzing large language mod- 876 | els across training and scaling. 877 | arXiv preprint 878 | arXiv:2304.01373. 879 | Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jian- 880 | feng Gao, and Yejin Choi. 2020. Piqa: Reasoning 881 | about physical commonsense in natural language. In 882 | Thirty-Fourth AAAI Conference on Artificial Intelli- 883 | gence. 884 | Sid Black, Leo Gao, Phil Wang, Connor Leahy, and 885 | Stella Biderman. 2022. Gpt-neo: Large scale autore- 886 | gressive language modeling with mesh-tensorflow, 887 | 2021. 888 | URL: 889 | https://doi. 890 | org/10.5281/zenodo, 891 | 5297715. 892 | James Bradbury, Stephen Merity, Caiming Xiong, and 893 | Richard Socher. 2017. Quasi-recurrent neural net- 894 | works. In ICLR. 895 | Tom Brown, Benjamin Mann, Nick Ryder, Melanie 896 | Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind 897 | Neelakantan, Pranav Shyam, Girish Sastry, Amanda 898 | Askell, et al. 2020. Language models are few-shot 899 | learners. Advances in neural information processing 900 | systems, 33:1877–1901. 901 | Ayd ar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev. 902 | 2023. Scaling transformer to 1m tokens and beyond 903 | with rmt. 904 | Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. 905 | 2022. Recurrent memory transformer. Advances in 906 | Neural Information Processing Systems, 35:11079– 907 | 11091. 908 | A. P. Sarath Chandar, Chinnadhurai Sankar, Eugene 909 | Vorontsov, Samira Ebrahimi Kahou, and Yoshua 910 | Bengio. 2019. 911 | Towards non-saturating recurrent 912 | units for modelling long-term dependencies. 913 | In 914 | AAAI Conference on Artificial Intelligence. 915 | Krzysztof Choromanski, Valerii Likhosherstov, David 916 | Dohan, Xingyou Song, Andreea Gane, Tamas Sar- 917 | los, Peter Hawkins, Jared Davis, Af roz Mohiuddin, 918 | Lukasz Kaiser, David Belanger, Lucy Colwell, and 919 | Adrian Weller. 2020. Rethinking attention with per- 920 | formers. 921 | Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, 922 | and Yoshua Bengio. 2014. Empirical evaluation of 923 | gated recurrent neural networks on sequence model- 924 | ing. In NIPS 2014 Deep Learning and Representa- 925 | tion Learning Workshop. 926 | Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, 927 | Ashish Sabhar wal, Carissa Schoenick, and Oyvind 928 | Tafjord. 2018. Think you have solved question an- 929 | swering? try arc, the ai2 reasoning challenge. In 930 | arXiv:1803.05457. 931 | Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Ja- 932 | cob Hilton, Reiichiro Nakano, Christopher Hesse, 933 | and John Schulman. 2021. 934 | Training verifiers to 935 | solve math word problems. 936 | In arXiv, volume 937 | abs/2110.14168. 938 | Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and 939 | Christopher Re. 2022a. 940 | Flashattention: Fast and 941 | memory-efficient exact attention with IO-awareness. 942 | In Advances in Neural Information Processing Sys- 943 | tems. 944 | Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W 945 | Thomas, Atri Rudra, and Christopher Ré. 2022b. 946 | Hungry hungry hippos: 947 | Towards language mod- 948 | eling with state space models. 949 | arXiv preprint 950 | arXiv:22 12.14052. 951 | Dorottya Demszky, Dana Movshovitz-Attias, Jeong- 952 | woo Ko, Alan S. Cowen, Gaurav Nemade, and Su- 953 | jith Ravi. 2020. 954 | Goemotions: A dataset of fine- 955 | grained emotions. In Proceedings of the 58th An- 956 | nual Meeting of the Association for Computational 957 | Linguistics, ACL 2020, Online, July 5-10, 2020, 958 | pages 4040–4054. Association for Computational 959 | Linguistics. 960 | Leo Gao, Stella Biderman, Sid Black, Laurence Gold- 961 | ing, Travis Hoppe, Charles Foster, Jason Phang, Ho- 962 | race He, Anish Thite, Noa Nabeshima, et al. 2020. 963 | The pile: An 800gb dataset of diverse text for lan- 964 | guage modeling. arXiv preprint arXiv:2101.00027. 965 | Albert Gu, Karan Goel, and Christopher Ré. 2022. Ef- 966 | ficiently modeling long sequences with structured 967 | state spaces. 968 | In The International Conference on 969 | Learning Representations (ICLR). 970 | Albert 971 | Gu, 972 | Çaglar 973 | Gülçehre, 974 | Tom 975 | Le 976 | Paine, 977 | Matthew 978 | W. 979 | Hoffman, 980 | and 981 | Razvan 982 | Pascanu. 983 | 2019. Improving the gating mechanism of recurrent 984 | neural networks. ArXiv, abs/1910.09890. 985 | Mandy Guo, Joshua Ainslie, David C Uthus, Santi- 986 | ago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yin- 987 | fei Yang. 2022. Longt5 : Efficient text-to-text trans- 988 | former for long sequences. In Findings of the Associ- 989 | ation for Computational Linguistics: NAACL 2022, 990 | pages 724–736. 991 | Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian 992 | Sun. 2016. Identity mappings in deep residual net- 993 | works. 994 | Dan Hendrycks, Collin Burns, Steven Basart, Andy 995 | Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- 996 | hardt. 2021. 997 | Measuring massive multitask lan- 998 | gu age understanding. In International Conference 999 | on Learning Representations. 1000 | Sepp Hochreiter. 1998. The vanishing gradient prob- 1001 | lem during learning recurrent neural nets and prob- 1002 | lem solutions. 1003 | International Journal of Uncer- 1004 | tainty, Fuzziness and Knowledge-Based Systems, 1005 | 6(02):107–116. 1006 | Sepp Hochreiter and Jürgen Schmidhuber. 1997. 1007 | Long short-term memory. 1008 | Neural Computation, 1009 | 9(8):1735–1780. 1010 | Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, 1011 | El ena Buchatskaya, Trevor Cai, Eliza Rutherford, 1012 | Diego de Las Casas, Lisa Anne Hendricks, Johannes 1013 | Welbl, Aidan Clark, Tom Hennigan, Eric Noland, 1014 | Katie Millican, George van den Driessche, Bogdan 1015 | Damoc, Aurelia Guy, Simon Osindero, Karen Si- 1016 | monyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, 1017 | and Laurent Sifre. 2022. Training compute-optimal 1018 | large language models. 1019 | Edward J Hu, yelong shen, Phillip Wallis, Zeyuan 1020 | Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and 1021 | Weizhu Chen. 2022. LoRA: Low-rank adaptation of 1022 | large language models. In International Conference 1023 | on Learning Representations. 1024 | Hassan Ismail Fawaz, Germain Forestier, Jonathan We- 1025 | ber, Lhassane Idoumghar, and Pierre-Alain Muller. 1026 | 2019. Deep learning for time series classification: 1027 | a review. 1028 | Data mining and knowledge discovery, 1029 | 33(4):917–963. 1030 | Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol 1031 | Vinyals, Andrew Zisserman, and Joao Carreira. 1032 | 2021. Perceiver: General perception with iterative 1033 | attention. In International conference on machine 1034 | learning, pages 4651–4664. PMLR. 1035 | Hanhwi Jang, Joonsung Kim, Jae-Eon Jo, Jaewon Lee, 1036 | and Jangwoo Kim. 2019. 1037 | Mnnfast: A fast and 1038 | scalable system architecture for memory-augmented 1039 | neural networks. In Proceedings of the 46 th Interna- 1040 | tional Symposium on Computer Architecture, pages 1041 | 250–263. 1042 | Matt Gardner Johannes Welbl Nelson F. Liu. 2017. 1043 | Crowdsourcing multiple choice science questions. 1044 | In DOI:10.18653/v1/W17-4413. 1045 | Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke 1046 | Zettlemoyer. 2017. Triviaqa: A large scale distantly 1047 | supervised challenge dataset for reading comprehen- 1048 | sion. In ACL. 1049 | John Jumper, 1050 | Richard Evans, 1051 | Alexander Pritzel, 1052 | Tim Green, Michael Figurnov, Olaf Ronneberger, 1053 | Kathryn Tunyasuvunakool, Russ Bates, Augustin 1054 | Žídek, Anna Potapenko, and et al. 2021. 1055 | Highly 1056 | accurate protein structure prediction with al phafold. 1057 | Nature, 596(7873):583–589. 1058 | Sekitoshi Kanai, Yasuhiro Fujiwara, and Sotetsu Iwa- 1059 | mura. 2017. Preventing gradient explosions in gated 1060 | recurrent units. In NIPS. 1061 | Jared Kaplan, 1062 | Sam McCandlish, 1063 | Tom Henighan, 1064 | Tom B Brown, Benjamin Chess, Rewon Child, Scott 1065 | Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 1066 | 2020. 1067 | Scaling laws for neural language models. 1068 | arXiv preprint arX iv:2001.08361. 1069 | Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap- 1070 | pas, and François Fleuret. 2020. Transformers are 1071 | rnns: Fast autoregressive transformers with linear 1072 | attention. In International Conference on Machine 1073 | Learning, pages 5156–5165. PMLR. 1074 | Nikita Kitaev, 1075 | L. Kaiser, 1076 | and Anselm Levskaya. 1077 | 2020. Reformer: The efficient transformer. ArXiv, 1078 | abs/2001.04451. 1079 | Jan Koco´n, Igor Cichecki, Oliwier Kaszyca, Mateusz 1080 | Kochanek, Dominika Szydło, Joanna Baran, Julita 1081 | Bielaniewicz, Marcin Gruza, Arkadiusz Janz, Kamil 1082 | Kanclerz, Anna Koco´n, Bartłomiej Koptyra, Wik- 1083 | toria Mieleszczenko-Kowszewicz, Piotr Miłkowski, 1084 | Marcin Oleksy, Maciej Piasecki, Łukasz Radli´nski, 1085 | K onrad Wojtasik, Stanisław Wo´zniak, and Prze- 1086 | mysław Kazienko. 2023. Chatgpt: Jack of all trades, 1087 | master of none. 1088 | Jan Koco´n, Piotr Miłkowski, and Monika Za´sko- 1089 | Zieli´nska. 2019. Multi-level sentiment analysis of 1090 | polemo 2.0: Extended corpus of multi-domain con- 1091 | sumer reviews. In Proceedings of the 23rd Confer- 1092 | ence on Computational Natural Language Learning 1093 | (CoNLL), pages 980–991. 1094 | Phong Le and Willem Zuidema. 2016. 1095 | Quantifying 1096 | the vanishing gradient and long distance dependency 1097 | problem in recursive neural networks and recursive 1098 | lstms. In Proceedings of the 1st Workshop on Repre- 1099 | sentation Learning for NLP, pages 87–93. 1100 | Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav 1101 | Artzi. 2018. Simple recurrent units for highly par- 1102 | allelizable recurrence. In Proceedings of the 2018 1103 | Conference on Empirical Methods in Natural Lan- 1104 | guage Processing , pages 4470–4481, Brussels, Bel- 1105 | gium. Association for Computational Linguistics. 1106 | Hanxiao Liu, Zihang Dai, David R. So, and Quoc V. Le. 1107 | 2021. Pay attention to mlps. 1108 | Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting 1109 | Zhou, Jonathan May, Hao Ma, and Luke Zettle- 1110 | moyer. 2021. Luna: Linear unified nested attention. 1111 | Advances in Neural Information Processing Systems, 1112 | 34:2441–2453. 1113 | Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, 1114 | Liangke Gui, Graham Neubig, Jonathan May, and 1115 | Luke Zettlemoyer. 2023. 1116 | Mega: Moving average 1117 | equipped gated attention. In ICLR. 1118 | Eric Martin and Chris Cundy. 2017. 1119 | Parallelizing 1120 | linear recurrent neural nets over sequence length. 1121 | ArXiv, abs/1709.04057. 1122 | Kevin Meng, David Bau, Alex Andonian, and Yonatan 1123 | Belinkov. 2022. Locating and editing factual asso- 1124 | ciations in GPT . Advances in Neural Information 1125 | Processing Systems, 36. 1126 | Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish 1127 | Sabharwal. 2018. Can a suit of armor conduct elec- 1128 | tricity? a new dataset for open book question answer- 1129 | ing. In EMNLP. 1130 | John Miller and Moritz Hardt. 2018. Stable recurrent 1131 | models. arXiv: Learning. 1132 | Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong 1133 | He, Devi Parikh, Dhruv Batra, Lucy Vanderw ende, 1134 | Pushmeet Kohli, and James Allen. 2016. 1135 | A cor- 1136 | pus and cloze evaluation for deeper understanding of 1137 | commonsense stories. In Proceedings of the 2016 1138 | Conference of the North American Chapter of the 1139 | Association for Computational Linguistics: Human 1140 | Language Technologies, pages 839–849. 1141 | OpenAI. 2022. Introducing chatgpt. https://openai. 1142 | com/blog/chatgpt. 1143 | OpenAI. 2023. Gpt-4 technical report. 1144 | Antonio Orvieto, Samuel L Smith, Albert Gu, An ushan 1145 | Fernando, Caglar Gulcehre, Razvan Pascanu, and 1146 | Soham De. 2023. 1147 | Resurrecting recurrent neu- 1148 | ral networks for long sequences. 1149 | arXiv preprint 1150 | arXiv:2303.06349. 1151 | Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, 1152 | Carroll L. Wainwright, Pamela Mishkin, Chong 1153 | Zhang, Sandhini Agarwal, Katarina Slama, Alex 1154 | Ray, John Schulman, Jacob Hilton, Fraser Kelton , 1155 | Luke Miller, Maddie Simens, Amanda Askell, Pe- 1156 | ter Welinder, Paul Christiano, Jan Leike, and Ryan 1157 | Lowe. 2022. Training language models to follow in- 1158 | structions with human feedback. 1159 | Denis Paperno, Germán Kruszewski, Angeliki Lazari- 1160 | dou, Ngoc Quan Pham, Raffaella Bernardi, San- 1161 | dro Pezzelle, Marco Baroni, Gemma Boleda, and 1162 | Raquel Fernandez. 2016. The LAMBADA dataset: 1163 | Word prediction requiring a broad discourse context. 1164 | In Proceedings of the 54th Annual Meeting of the As- 1165 | sociation for Computational Linguistics (Volume 1: 1166 | Long Papers), pages 1525–1534, Berlin, Germany. 1167 | Association for Computational Linguistics. 1168 | Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 1169 | 2012. On the difficulty of training recurrent neural 1170 | networks. In International Conference on Machine 1171 | Learning. 1172 | Adam Paszke, Sam Gross, Francisco Massa, Adam 1173 | Lerer, James Bradbury, Gregory Chanan, Trevor 1174 | Killeen, Zeming Lin, Natalia Gimelshein, Luca 1175 | Antiga, Alban Desmaison, Andreas Köpf, Edward 1176 | Yang, Zach DeVito, Martin Raison, Alykhan Tejani, 1177 | Sas ank Chilamkurthy, Benoit Steiner, Lu Fang, Jun- 1178 | jie Bai, and Soumith Chintala. 2019. Pytorch: An 1179 | imperative style, high-performance deep learning li- 1180 | brary. 1181 | Michael 1182 | Poli, 1183 | Stefano 1184 | Massaroli, 1185 | Eric 1186 | Nguyen, 1187 | Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua 1188 | Bengio, Stefano Ermon, and Christopher Ré. 2023. 1189 | Hyena hierarchy: Towards larger convolutional lan- 1190 | guage models. arXiv preprint arXiv:2302.10866. 1191 | Ofir Press, Noah A. Smith, and Mike Lewis. 2022. 1192 | Train short, test long: Attention with linear biases en- 1193 | ables input length extrapolation. In The Tenth Inter- 1194 | national Conference on Learning Representations, 1195 | ICLR 2022, Virtual Event, April 25-29, 2022. 1196 | Ilan Price, Jordan Gifford-Moore, Jory Flemming, Saul 1197 | Musker, Maayan Roichman, Guillaume Sylvain, 1198 | Nithum Thain, Lucas Dixon, and Jeffrey Sore nsen. 1199 | 2020. 1200 | Six attributes of unhealthy conversations. 1201 | In Proceedings of the Fourth Workshop on Online 1202 | Abuse and Harms, pages 114–124, Online. Associa- 1203 | tion for Computational Linguistics. 1204 | Markus N. Rabe and Charles Staats. 2022. 1205 | Self- 1206 | attention does not need o(n2) memory. 1207 | Melissa Roemmele, Cosmin Adrian Bejan, , and An- 1208 | drew S. Gordon. 2018. Choice of plausible alterna- 1209 | tives: An evaluation of commonsense causal reason- 1210 | ing. In AAAI. 1211 | Teven Le Scao, Angela Fan, Christopher Akiki, El- 1212 | lie Pavlick, 1213 | Suzana Ili´c, 1214 | Daniel Hesslow, 1215 | Ro- 1216 | man Castagné, Alexandra Sasha Luccioni, François 1217 | Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b- 1218 | parameter open-access multilingual language model. 1219 | arXiv preprint arXiv:2211.05100. 1220 | Ramsha Siddiqui. 2019. SARCASMANIA: Sarcasm 1221 | Exposed! 1222 | http:// www.kaggle.com/rmsharks4/ 1223 | sarcasmania-dataset. 1224 | [Online; 1225 | accessed 02- 1226 | February-2023]. 1227 | David R. So, Wojciech Manke, Hanxiao Liu, Zihang 1228 | Dai, Noam Shazeer, and Quoc V. Le. 2021. Primer: 1229 | Searching for efficient transformers for language 1230 | modeling. CoRR, abs/2109.08668. 1231 | Yi Tay, Dara Bahri, Donald Metzler, Da -Cheng Juan, 1232 | Zhe Zhao, and Che Zheng. 2020. Synthesizer: Re- 1233 | thinking self-attention in transformer models. 1234 | Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald 1235 | Metzler. 2022. 1236 | Efficient transformers: A survey. 1237 | ACM Computing Surveys, 55(6):1–28. 1238 | Ilya 1239 | O. 1240 | Tolstikhin, 1241 | Neil 1242 | Houlsby, 1243 | Alexander 1244 | Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas 1245 | Unter thiner, Jessica Yung, Andreas Steiner, Daniel 1246 | Keysers, 1247 | Jakob 1248 | Uszkoreit, 1249 | Mario 1250 | Lucic, 1251 | and 1252 | Alexey Dosovitskiy. 2021. Mlp-mixer: An all-mlp 1253 | architecture for vision. CoRR, abs/2105.01601. 1254 | Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier 1255 | Martinet, Marie-Anne Lachaux, Timothée Lacroix, 1256 | Baptiste Rozière, Naman Goyal, Eric Hamb ro, 1257 | Faisal Azhar, Aurelien Rodriguez, Armand Joulin, 1258 | Edouard Grave, 1259 | and Guillaume Lample. 2023. 1260 | Llama: Open and efficient foundation language mod- 1261 | els. 1262 | Aäron van den Oord, Sander Dieleman, Heiga Zen, 1263 | Karen Simonyan, 1264 | Oriol Vinyals, 1265 | Alex Graves, 1266 | Nal Kalchbrenner, Andrew W. Senior, and Koray 1267 | Kavukcuoglu. 2016. Wavenet: A generative model 1268 | for raw audio. ArXiv, abs/1609.03499. 1269 | Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 1270 | Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz 1271 | Kaiser, and Illia Polosukhin. 2017. Attention is all 1272 | you need. In Advances in Neural Information Pro- 1273 | cessing Systems, volume 30. Curran Associates, Inc. 1274 | David Vilares and Carlos Gómez-Rodríguez. 2019. 1275 | Head-qa: A healthcare dataset for complex reason- 1276 | ing. In ACL. 1277 | Alex Wang, 1278 | Yada Pruksachatkun, 1279 | Nikita Nangia, 1280 | Amanpreet Singh, Julian Michael, Felix Hill, Omer 1281 | Levy, and Samuel Bowman. 2019. 1282 | Superglue: A 1283 | stickier benchmark for general-purpose language un- 1284 | derstanding systems. In Advances in Neural Infor- 1285 | mation Processing Systems, volume 32. Curran As- 1286 | sociates, Inc. 1287 | Alex Wang, Amanpreet Singh, Julian Michael, Fe- 1288 | lix Hill, Omer Levy, and Samuel Bowman. 2018. 1289 | GLUE: A multi-task benchmark and analysis plat- 1290 | form for natural language understanding. 1291 | In Pro- 1292 | ceedings of the 2018 EMNLP Workshop Black- 1293 | boxNLP: Analyzing and Interpreting Neural Net- 1294 | works for NLP, pages 353–355, Brussels, Belgium. 1295 | Association for Computational Linguistics. 1296 | Sinong Wang, Belinda Z. Li, Madian Khabsa, Han 1297 | Fang, and Hao Ma. 2020. Linformer: Self-attention 1298 | with linear complexity. 1299 | Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong 1300 | Long, Chengqi Zhang, and S Yu Philip. 2020. A 1301 | comprehensive survey on graph neural networks. 1302 | IEEE transactions on neural networks and learning 1303 | systems, 32(1):4–24. 1304 | Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. 1305 | Ex machina: Personal attacks seen at scale. In Pro- 1306 | ceedings of the 26th International Conference on 1307 | World Wide Web, WWW 2017, Perth, Australia, April 1308 | 3-7, 2017, pages 1391–13 99. ACM. 1309 | Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen 1310 | Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng 1311 | Yan. 2022. Metaformer is actually what you need 1312 | for vision. 1313 | Manzil Zaheer, Guru Guruganesh, Kumar Avinava 1314 | Dubey, Joshua Ainslie, Chris Alberti, Santiago On- 1315 | tanon, Philip Pham, Anirudh Ravula, Qifan Wang, 1316 | Li Yang, et al. 2020. 1317 | Big bird: Transformers for 1318 | longer sequences. Advances in Neural Information 1319 | Processing Systems, 33. 1320 | Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali 1321 | Farhadi, and Yejin Choi. 2019. Hellaswag: Can a 1322 | machine really finish your sentence? In ACL. 1323 | Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali 1324 | Farhadi, and Yejin Choi. 2020. 1325 | Winogrande: An 1326 | adversarial winograd schema challenge at scale. In 1327 | ACL. 1328 | Shuangfei Zhai, Walter Talbott, Nitish Sriv astava, Chen 1329 | Huang, Hanlin Goh, Ruixiang Zhang, and Josh 1330 | Susskind. 2021. An attention free transformer. 1331 | Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng 1332 | Gao, Kevin Duh, and Benjamin Van Durme. 2018. 1333 | Record: Bridging the gap between human and ma- 1334 | chine commonsense reading comprehension. 1335 | In 1336 | arXiv:1810.12885. 1337 | Susan Zhang, Stephen Roller, Naman Goyal, Mikel 1338 | Artetxe, Moya Chen, Shuohui Chen, Christopher De - 1339 | wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 1340 | 2022. Opt: Open pre-trained transformer language 1341 | models. arXiv preprint arXiv:2205.01068. 1342 | A 1343 | Author Contributions 1344 | Bo Peng 1345 | Original RWKV idea, original code, 1346 | performance optimizations, original experiments, 1347 | and trained RWKV models from 0.1B to 14B. 1348 | Eric Alcaide 1349 | Manuscript (initial draft sections 1, 1350 | 2; sections 4, 7 and 8; revision and proofreading; 1351 | final version ). Figures (2, 3, 4, 7). Experiments 1352 | section 6. Appendices D, I. Contributions to Ap- 1353 | pendix K. 1354 | Quentin 1355 | Anthony 1356 | Led 1357 | writing 1358 | the 1359 | paper. 1360 | Manuscript (initial draft sections 1, 2, 3; revision 1361 | and proofreading; final version). 1362 | Zhenyuan Zhang 1363 | Manuscript (revision and 1364 | proofreading) Figure 3. Experiments Appendix 1365 | G. Contributions to Appendices B and K. 1366 | Kranthi Kiran GV 1367 | Manuscript (sections 2 and 1368 | 5; contributions to section 3; revision and proof- 1369 | reading). Tables 3 and 4. Appendix C. 1370 | Xiangru Tang 1371 | Manuscript (sections 2 and 3; 1372 | contributions to abstract; revision and proofread- 1373 | ing). Contributions to Appendix K. 1374 | Matteo Grella 1375 | Manuscript (sections 4.5, 4.6, 8; 1376 | contributions to sections 1, 7 and 9; proofreading 1377 | and revision). Contributions to Appendix B. 1378 | Ferdinand Mom 1379 | Manuscript (contributions to 1380 | section 1, 2, 4.3, 4.6; proofreading and revision). 1381 | Contribut ions to Appendix B. 1382 | Atsushi Saito 1383 | Manuscript (sections 3 and 5; con- 1384 | tributions to section 2). Figures 1a , 1b, 1c. Contri- 1385 | butions to Appendix H 1386 | Krishna Sri Ipsit Mantri 1387 | Figure 4 1388 | Rui-Jie Zhu 1389 | Tables 1 and 5. Experiments for 1390 | table 5. 1391 | Peng Zhou 1392 | Contributions to Table 5. 1393 | Qihang Zhao 1394 | Manuscript (proofreading and re- 1395 | vision). Contributions to Table 5. 1396 | Xuzheng He 1397 | Manuscript (contributions to sec- 1398 | tion 3; proofreading and revision). Contributions 1399 | to Figures 1, 7. Appendix G. Contributions to ap- 1400 | pendix F. 1401 | Hayden Lau 1402 | Manuscript (contributions to sec- 1403 | tion 1; proofreading and revision). Contributions 1404 | to Appendix K. 1405 | Michael Chung 1406 | Manuscript (contributions to 1407 | section 4.6; proofreading and revision). 1408 | Haowen Hou 1409 | Figure 8. Appendix E 1410 | Jiaming Kong 1411 | Manuscript (revision and proof- 1412 | reading). Appendix F. 1413 | Johan S. Wind 1414 | RWKV performance optimiza- 1415 | tions (CUDA), Contributions to Appendix C. 1416 | Jian Zhu 1417 | Manuscript (section 2; proofreading 1418 | and revision). Figures 3 and 5. 1419 | Huanqi Cao 1420 | Manuscript (contributions to 4.2 1421 | and 4.3; proofreading and revision). Experiments 1422 | for Appendix G. 1423 | Samuel Arcadinho 1424 | Contributions to Figures 6, 1425 | 10, and 11. Contributions to Appendix I. 1426 | Xin Cheng 1427 | Manuscript (proofreading and revi- 1428 | sion). Contributions to Appendix K, H. 1429 | Alon Albalak 1430 | Manuscript (abstract and sections 1431 | 1, 9; proofreading and revision). 1432 | Jan Kocon 1433 | Manuscript (sections 1; proofreading 1434 | and revision). Contributions to Appendix J. 1435 | Przemysław Kazienko 1436 | Manuscript (section 6; 1437 | proofreading and revision). 1438 | Contributions Ap- 1439 | pendix J. 1440 | Ruichong Zhang 1441 | Manuscript (proofreading and 1442 | revision); Contributions to Figure 5 and Appendix 1443 | K. 1444 | Stanisław Wo´zniak 1445 | Appendix J. 1446 | Bartłomiej Koptyra 1447 | Contributions to Appendix 1448 | J. 1449 | B 1450 | Time-Mixing Block as an RNN Cell 1451 | As stated in 4.3, the RWKV time-mixing block can 1452 | be formulated as an RNN, as the WKV computa- 1453 | tion can be written in such a recursive form: 1454 | a0, b0 = 0, 1455 | (19) 1456 | wkvt = at−1 + eu+ktvt 1457 | bt−1 + eu+kt , 1458 | (20) 1459 | at = e−wat−1 + ektvt, 1460 | (21) 1461 | bt = e−wbt−1 + ekt. 1462 | (22) 1463 | The dataflow of the RNN-like time-mixing is 1464 | shown in Fig. 7, where the hidden states h is the 1465 | numerator-denominator tuple (a, b). 1466 | 1 1467 | 2 1468 | e 1469 | 3 1470 | Figure 7: RWKV time-mixing block formulated as an 1471 | RNN cell. Color codes: yellow (µ) denotes the token 1472 | shift, red (1) denotes the denominator, blue (2) denotes 1473 | the numerator, pink (3) denotes the fraction computa- 1474 | tions in 14. h denotes the numerator-denominator tuple 1475 | (a, b). 1476 | To avoid overflow in calculating ekt, a numerical 1477 | trick is used in the official implementation. Note 1478 | that 1479 | a1 = e−wa0 + ek0v0 = ek0v0, 1480 | (23) 1481 | b1 = e−wb0 + ek0 = ek0, 1482 | (24) 1483 | and we set a′ 1484 | 1 = v0, b′ 1485 | 1 = 1, p0 = k0, where pt−1 1486 | stores the shared exponents of at and bt. Now the 1487 | above recursion can be converted into a numerical 1488 | safe version, for each time step t > 1: 1489 | q := max(pt−1, u + kt), 1490 | (25) 1491 | a∗ 1492 | t = ept−1−qa′ 1493 | t−1 + eu+kt−qvt, 1494 | (26) 1495 | b∗ 1496 | t = ept−1−qb′ 1497 | t−1 + eu+kt−q, 1498 | (27) 1499 | wkvt = a∗ 1500 | t 1501 | b∗ 1502 | t 1503 | . 1504 | (28) 1505 | The update to a′ 1506 | t, b′ 1507 | t and their shared exponent are 1508 | also carried out in similar fashion: 1509 | q := max(pt−1 − w, kt), 1510 | (29) 1511 | a′ 1512 | t = ept−1−w−qa′ 1513 | t−1 + ekt−qvt, 1514 | (30) 1515 | b′ 1516 | t = ept−1−w−qb′ 1517 | t−1 + ekt−q, 1518 | (31) 1519 | pt = q. 1520 | (32) 1521 | C 1522 | Parameter and FLOP Count for the 1523 | RWKV Models 1524 | The following section provides an overview of the 1525 | different RWKV model architectures along with 1526 | their respective parameter and FLOP counts in Ta- 1527 | ble 2. 1528 | Name 1529 | Layers 1530 | Model Dimension 1531 | Parameters 1532 | FLOPs per token 1533 | 169 M 1534 | 12 1535 | 768 1536 | 1.693 × 108 1537 | 2.613 × 108 1538 | 430 M 1539 | 24 1540 | 1024 1541 | 4.304 × 108 1542 | 7.573 × 108 1543 | 1.5 B 1544 | 24 1545 | 2048 1546 | 1.515 × 109 1547 | 2.823 × 109 1548 | 3 B 1549 | 32 1550 | 2560 1551 | 2.985 × 109 1552 | 5.710 × 109 1553 | 7 B 1554 | 32 1555 | 4096 1556 | 7.393 × 109 1557 | 1.437 × 1010 1558 | 14 B 1559 | 40 1560 | 5120 1561 | 1.415 × 1010 1562 | 2.778 × 1010 1563 | Table 2: RWKV model architectures and associated 1564 | FLOP counts 1565 | The number of parameters for each model is 1566 | computed using the formula: #parameters = 1567 | 2V D + 13D2L + D(11L + 4) where V = 50277 1568 | is the vocabulary size, D represents the Model Di- 1569 | m ension and L corresponds to the number of lay- 1570 | ers. 1571 | FLOPs is for a forward pass for one token. It 1572 | was calculated as 6(V D + 13D2L), which is the 1573 | twice (add and multiply) the number of parameters 1574 | in linear layers. The backwards pass FLOPs can be 1575 | approximated as twice that of the forward pass. So 1576 | the total is 6(V D + 13D2L) per token for training 1577 | (3x fw FLOPs). It is noteworthy that FLOPs are 1578 | independent of the context length, unlike regular 1579 | transformers. The FLOP approximations in this 1580 | paper are in line with the methodology used by 1581 | Kaplan et al. (2020). 1582 | Alternative approximations for FLOPs include 1583 | doubling the parameters which yields similar re- 1584 | sults within 2% for 14B and a 30% discrepancy for 1585 | 169M variant. Another approximation is based on 1586 | the number of non-embed ding parameters multi- 1587 | plied by 2. This gives 2(V D + 13D2L + D(11L + 1588 | 4)) resulting in 1.6% more FLOPs for 14B model 1589 | and 8% more FLOPs for 169M model. 1590 | D 1591 | Parameter initializations 1592 | We describe the specific parameter initializations 1593 | below and motivate the design choices. Parame- 1594 | ters belonging to residual blocks are often adjusted 1595 | by layer depth and total number of layers. Let # 1596 | denote the vocabulary size, s denote the embed- 1597 | ding dimension, d denote the hidden size (we use 1598 | d = 4s), L the number of layers, l the layer index 1599 | (from 0 to L-1), we use the following initializa- 1600 | tions: 1601 | • Embeddings are initialized to U(±1e-4) as 1602 | explained in 4.7 1603 | • For the channel-mixing blocks (11), µki and 1604 | µri are initialized to ( i 1605 | s)1− l 1606 | L 1607 | • For the time-mixing blocks (16), initializa- 1608 | tions are µki = ( i 1609 | s)1− l 1610 | L , µ vi = ( i 1611 | s)1− l 1612 | L + 0.3l 1613 | L−1 1614 | and µri = 0.5( i 1615 | s)1− l 1616 | L 1617 | • wi (14), also known as “time decay”, is initial- 1618 | ized to −5+8·( 1619 | i 1620 | d−1)0.7+ 1.3l 1621 | L−1 . Intuitively, it is 1622 | the discount factor applied to previous tokens 1623 | over time. 1624 | • ui (14), also known as “bonus”, is set to 1625 | 0. 5(((i + 1) mod 3) − 1) + log 0.3. It is 1626 | the special weighting applied to the current 1627 | token in equation 14. The alternating zigzag 1628 | pattern initially creates subtle variations in the 1629 | tensor elements, which are intended to help 1630 | the model treat different dimensions of the 1631 | embedding distinctively. 1632 | • Wo (15) (time-mixing) and Wv (channel- 1633 | mixing) are initialized to N(0, 1634 | q 1635 | d 1636 | s = 2) 1637 | • All Wr, Wk, Wv weights are initialized to 0 1638 | so the model can start learning from the be- 1639 | ginning without noisy signals. 1640 | • All LayerNorm weights start from 1 and bi- 1641 | ases from 0. 1642 | E 1643 | Small Init Embedding 1644 | This section presents experimental validation of 1645 | small initialization embedding. The experimental 1646 | setup is as follows. In the baseline configuration, 1647 | the parameters are initialized using a normal distri- 1648 | bution with a mean of 0.0 and a standard deviation 1649 | of 0.02, which is a commonly used initialization 1650 | method in models like BERT and GPT. On the other 1651 | hand, in the small initialization of the embedding 1652 | (small init emb) experiment, the parameters are ini- 1653 | tialized using a uniform distribution with a range of 1654 | 1e-4, which is slightly different from RWKV where 1655 | a normal distribution with a standard deviation of 1656 | 1e-4 is used. However, this difference is negligible 1657 | and does not affect our conclusions. The experi- 1658 | ments were conducted with a batch size of 400. As 1659 | depicted in the figure 8, the loss curve for the small 1660 | init emb exhibits a faster rate of decrease and con- 1661 | vergence compared to the traditional initialization 1662 | using a normal distribution. 1663 | 0 1664 | 10000 1665 | 20000 1666 | 30000 1667 | 40000 1668 | 50000 1669 | Step 1670 | 4 1671 | 5 1672 | 6 1673 | 7 1674 | 8 1675 | 9 1676 | 10 1677 | 11 1678 | Loss 1679 | Baseline 1680 | Small Init Emb 1681 | Figure 8: Effect of small initialization embedding. 1682 | F 1683 | Gradient Stability in RWKV 1684 | In this section, we present a mathematical descrip- 1685 | tion of the gradient stability property in RWKV, 1686 | focusing specifically on the time-mixing block. By 1687 | gradient stability we mean that if the inputs xt 1688 | are bounded and the model parameters are fixed, 1689 | then the gradients with respect to Wk and Wv are 1690 | uniformly bounded for all T (thus not exploding). 1691 | Consequently, we can control the amount each xt 1692 | contributes to the gradient at T in a naturally de- 1693 | caying fashion by the weight decay mechanism w 1694 | (thus not vanishing unless desired). 1695 | First, we make the simplification that there are 1696 | no token shifts, this will not affect the final conclu- 1697 | sion. In this scenario, wkvT can be written as 1698 | wkvT = 1699 | PT 1700 | t=1 Ke 1701 | t vt 1702 | PT 1703 | t=1 Ke 1704 | t 1705 | = E(vt) = S(vt) 1706 | S(1) , 1707 | (33) 1708 | where 1709 | vt = Wvxt, 1710 | ∂(vt)i 1711 | ∂(Wv)i, j 1712 | = (xt)j, 1713 | Ke 1714 | t = eWkxt+wT,t, 1715 | ∂(Ke 1716 | t )i 1717 | ∂(Wk)i,j 1718 | = (xt)j(Ke 1719 | t )i, 1720 | and S(·) and E(·) are shorthand for denoting sums 1721 | and averages over weights Ke 1722 | t . 1723 | The loss function at position T can be written as 1724 | LT = l(f(wkvT ), yT ). 1725 | (34) 1726 | Because wkvT relates to (Wk)i,j and (W v)i,j only 1727 | through the i-th channel (wkvT )i, we have 1728 | ∂LT 1729 | ∂(Wv)i,j 1730 | = 1731 | ∂LT 1732 | ∂(wkvT )i 1733 | ∂(wkvT )i 1734 | ∂(Wv)i,j 1735 | . 1736 | (35) 1737 | The first part of above equation contains trivial 1738 | operations like output layers, and other layers of 1739 | time-mixing, which can be proven inductively. The 1740 | second part of above equation can be bounded as 1741 | 1742 | ∂(wkvT )i 1743 | ∂(Wv)i,j 1744 | = 1745 | 1746 | ∂Ei[(vt)i] 1747 | ∂(Wv)i,j 1748 | 1749 | = |Ei[(xt)j]| ≤ max 1750 | t 1751 | |(xt)j|, 1752 | (36) 1753 | which is irrelevant to T. Similarly, 1754 | ∂(wkvT )i 1755 | ∂(Wk)i,j 1756 | = ∂ Si[(vt)i] 1757 | Si(1 ) /∂(Wk)i,j 1758 | = Si[(xt)j(vt)i] 1759 | Si(1) 1760 | − Si[(xt)j]Si[(vt)i] 1761 | Si(1)2 1762 | = Ei[(xt)j(vt)i] − Ei[(xt)j]Ei[(vt)i] 1763 | = covi((xt)j, (vt)i) 1764 | (37) 1765 | can also be bounded. Note that wkv’s softmax op- 1766 | eration contains at least two non-zero terms (u and 1767 | w), so the above “covariance” will not degenerate 1768 | into 0. 1769 | 0 1770 | 100 1771 | 200 1772 | 300 1773 | 400 1774 | 500 1775 | 600 1776 | 700 1777 | 800 1778 | Channel 1779 | 0.0 1780 | 0.2 1781 | 0.4 1782 | 0.6 1783 | 0.8 1784 | 1.0 1785 | Time Decay 1786 | Time decay (sorted along channel axis) 1787 | Layer 1 1788 | Layer 2 1789 | Layer 3 1790 | Layer 4 1791 | Layer 5 1792 | Layer 6 1793 | Layer 7 1794 | Layer 8 1795 | Layer 9 1796 | Layer 10 1797 | Layer 11 1798 | Layer 12 1799 | 1 1800 | 6 1801 | 11 1802 | 16 1803 | 21 1804 | Layer 1805 | The 1806 | E 1807 | iff 1808 | el 1809 | Tower 1810 | is 1811 | located 1812 | in 1813 | the 1814 | city 1815 | of 1816 | Information propagation path 1817 | 7 1818 | 6 1819 | 5 1820 | 4 1821 | 3 1822 | 2 1823 | 1 1824 | Log-probability of "Paris" 1825 | Figure 9: Model behavior visualizations of the RWKV 1826 | model. 1827 | G 1828 | Model Behavior Visualization 1829 | In Figure 9, we present visualizations of some be- 1830 | havior of the RWKV model. 1831 | The top plot illustrates the time decays (e−w) in 1832 | each layer of the RWKV-169M model, sorted along 1833 | the channel axis. Notably, several decays in the last 1834 | layers are very close or equal to one, implying that 1835 | certain information is preserved and propagated 1836 | throughout the model’s temporal context. Mean- 1837 | while, many decays in the initial layer are close 1838 | to zero, which corresponds to local operations in 1839 | wkv (14), likely to be associated with tasks such as 1840 | text parsing or lexical analysis. (Note that the local 1841 | operations in wkv is due to the extra parameter u, 1842 | when e−w is degenerated into 0.) These patterns of 1843 | time decays are partly learned, but also come from 1844 | parameter initialization as it speeds up training. 1845 | The bottom plot shows the information retrieval 1846 | and propagation path in the RWKV-430M model. 1847 | The experiment follows the causal trace method 1848 | introduced by Meng et al. (2022), where we 1849 | 1. Run the model once, and record all states and 1850 | activation of each layer during the computa- 1851 | tion; 1852 | 2. Corrupt the input embeddings of the subject 1853 | using noise (“The Eiffel Tower” in this exam- 1854 | ple); 1855 | 3. Restore the states and activation of a certain 1856 | layer at a certain token during the compu- 1857 | tation, and record the log-probability of the 1858 | model outputting the correct answer (“Paris”). 1859 | Unlike transformers, RWKV relies on recursive 1860 | propagation of information in the time dimension. 1861 | In this case, the fact that "the Eiffel Tower is located 1862 | in Paris" is retrieved in layer 4. It is then passed 1863 | down to the subsequent layers. In layer 20, mostly, 1864 | the information is propagated through time until 1865 | reaching where it is needed. Finally, it is passed 1866 | down to the last layer for outputting the answer. 1867 | H 1868 | Evaluation Details 1869 | The results for following tasks are in Table 3 and 4. 1870 | Tasks: 1871 | • LAMBADA (Paperno et al., 2016). A bench- 1872 | mark dataset that evaluates the model’s contex- 1873 | tual reasoning and language comprehension 1874 | abilities by presenting context-target pairs, 1875 | where the objective is to predict the most prob- 1876 | able target token. 1877 | • PIQA (Bisk et al., 2020). A benchmark for 1878 | the task of physical common sense reasoning, 1879 | which consists of a binary choice task that 1880 | can be better understood as a set of two pairs, 1881 | namely (Goal, Solution). 1882 | • HellaSwag 1883 | (Zellers et al., 2019) A novel 1884 | benchmark for commonsense Natural Lan- 1885 | guage Inference (NLI) which is build by ad- 1886 | versarial filtering against transformer models. 1887 | • Winogrande (Zellers et al., 2020) A dataset 1888 | designed to evaluate the acquisition of com- 1889 | mon sense reasoning by neural language mod- 1890 | els, aiming to determine whether we are ac- 1891 | curately assessing the true capabilities of ma- 1892 | chine common sense. 1893 | • StoryCloze (Mostafazadeh et al., 2016) A 1894 | benchmark to present a novel approach to as- 1895 | sess comprehension of narratives, narrative 1896 | generation, and script acquisition, focusing on 1897 | commonsense reasoning. 1898 | • ARC Challenge (Clark et al., 2018) A dataset 1899 | designed for multiple-choice question answer- 1900 | ing, encompassing science exam questions 1901 | ranging from third grade to ninth grade. 1902 | • ARC Easy An easy subset of ARC. 1903 | • HeadQA 1904 | (V ilares and Gómez-Rodríguez, 1905 | 2019) A benchmark consisting of graduate- 1906 | level questions encompassing various fields 1907 | such as medicine, nursing, biology, chemistry, 1908 | psychology, and pharmacology. 1909 | • OpenBookQA (Mihaylov et al., 2018) A QA 1910 | dataset to evaluate human comprehension of 1911 | a subject by incorporating open book facts, 1912 | scientific knowledge, and perceptual common 1913 | sense, drawing inspiration from open book 1914 | exams. 1915 | • SciQ (Johannes Welbl Nelson F . Liu, 2017) 1916 | A multiple-choice QA dataset which was cre- 1917 | ated using an innovative approach to gather 1918 | well-crafted multiple-choice questions that are 1919 | focused on a specific domain. 1920 | • TriviaQA (Joshi et al., 2017) A QA-IR dataset 1921 | which is constituted of triples of questions, 1922 | answers, supporting evidence, and indepen- 1923 | dently collected evidence documents, with an 1924 | average of six documents per question for re- 1925 | liable sources. 1926 | • ReCoRD (Zhang et al., 2018) A benchmark 1927 | for evaluating commonsense reasoning in 1928 | reading comprehension by generating queries 1929 | from CNN/Daily Mail news articles and re- 1930 | quiring text span answers from corresponding 1931 | summarizing passages. 1932 | • COPA (Roemmele et al., 2018) A dataset to 1933 | evaluate achievement in open-domain com- 1934 | monsense causal reasoning. 1935 | • MMMLU (Hendrycks et al., 2021) A multi- 1936 | task dataset for 57 tasks containing elementary 1937 | mathematics, US history, computer science, 1938 | law, etc. 1939 | I 1940 | Inference results 1941 | Figures 10 and 11 illustrate, respectively , the results 1942 | on time (s) and memory (RAM, VRAM) require- 1943 | ments for LLM inference in float32 precision. We 1944 | benchmark the following model families and sizes: 1945 | • RWKV: 169m, 430m, 1.4b, 3b, 7b, 14b 1946 | • Bloom (Scao et al., 2022): 560m, 1b, 3b 1947 | • OPT (Zhang et al., 2022): 125m, 350m, 1.3b, 1948 | 2.7b, 6.7b, 13b 1949 | • GPT- Neo (Black et al., 2022): 125m, 1.3b, 1950 | 2.7b 1951 | • Pythia (Biderman et al., 2023): 160m, 410m, 1952 | 1.4b, 2.8b, 6.7b, 12b 1953 | Missing models in are due to Out Of Memory 1954 | (OOM) errors. A comparison at 512 tokens is 1955 | shown in Figure 11 as some large transformer mod- 1956 | els produced an OOM when inferencing longer se- 1957 | Model 1958 | Params 1959 | PIQA 1960 | StoryCloze 1961 | HellaSwag 1962 | WinoGrande 1963 | ARC-e 1964 | ARC-c 1965 | OBQA 1966 | B 1967 | acc 1968 | acc 1969 | acc_norm 1970 | acc 1971 | acc 1972 | acc_norm 1973 | acc_norm 1974 | RWKV-4 1975 | 0.17 1976 | 65.07 1977 | 58.79 1978 | 32.26 1979 | 50.83 1980 | 47.47 1981 | 24.15 1982 | 29.60 1983 | Pythia 1984 | 0.16 1985 | 62.68 1986 | 58.47 1987 | 31.63 1988 | 52.01 1989 | 45.12 1990 | 23.81 1991 | 29.20 1992 | GPT-Neo 1993 | 0.16 1994 | 63.06 1995 | 58.26 1996 | 30.42 1997 | 50.43 1998 | 43.73 1999 | 23.12 2000 | 26.20 2001 | RWKV-4 2002 | 0.43 2003 | 67.52 2004 | 63.87 2005 | 40.90 2006 | 51.14 2007 | 52.86 2008 | 25.17 2009 | 32.40 2010 | Pythia 2011 | 0.40 2012 | 66.70 2013 | 62.64 2014 | 39.10 2015 | 53.35 2016 | 50.38 2017 | 25.77 2018 | 30.00 2019 | GPT-Neo 2020 | 0.40 2021 | 65. 07 2022 | 61.04 2023 | 37.64 2024 | 51.14 2025 | 48.91 2026 | 25.34 2027 | 30.60 2028 | RWKV-4 2029 | 1.5 2030 | 72.36 2031 | 68.73 2032 | 52.48 2033 | 54.62 2034 | 60.48 2035 | 29.44 2036 | 34.00 2037 | Pythia 2038 | 1.4 2039 | 71.11 2040 | 67.66 2041 | 50.82 2042 | 56.51 2043 | 57.74 2044 | 28.58 2045 | 30.80 2046 | GPT-Neo 2047 | 1.4 2048 | 71.16 2049 | 67.72 2050 | 48.94 2051 | 54.93 2052 | 56. 19 2053 | 25.85 2054 | 33.60 2055 | RWKV-4 2056 | 3.0 2057 | 74.16 2058 | 70.71 2059 | 59.89 2060 | 59.59 2061 | 65.19 2062 | 33.11 2063 | 37.00 2064 | Pythia 2065 | 2.8 2066 | 73.83 2067 | 70.71 2068 | 59.46 2069 | 61.25 2070 | 62.84 2071 | 32.25 2072 | 35.20 2073 | GPT-Neo 2074 | 2.8 2075 | 72.14 2076 | 69.54 2077 | 55.82 2078 | 57.62 2079 | 61.07 2080 | 30.20 2081 | 33.20 2082 | RWKV-4 2083 | 7.4 2084 | 76.06 2085 | 73.44 2086 | 65.51 2087 | 61.01 2088 | 67.80 2089 | 37.46 2090 | 40.20 2091 | Pythia 2092 | 6.9 2093 | 74.54 2094 | 72.96 2095 | 63.92 2096 | 61.01 2097 | 66.79 2098 | 35.07 2099 | 38.00 2100 | GPT-J 2101 | 6.1 2102 | 75.41 2103 | 74.02 2104 | 66.25 2105 | 64.09 2106 | 66.92 2107 | 36.60 2108 | 38.20 2109 | RWKV-4 2110 | 14.2 2111 | 77.48 2112 | 76.06 2113 | 70.65 2114 | 63 .85 2115 | 70.24 2116 | 38.99 2117 | 41.80 2118 | GPT-level∗ 2119 | 14.2 2120 | 76.49 2121 | 74.97 2122 | 68.72 2123 | 65.14 2124 | 70.77 2125 | 37.99 2126 | 39.27 2127 | Pythia (c.f.) 2128 | 11.8 2129 | 75.90 2130 | 74.40 2131 | 67.38 2132 | 64.72 2133 | 69.82 2134 | 36.77 2135 | 38.80 2136 | GPT-NeoX (c.f.) 2137 | 20.6 2138 | 77.69 2139 | 76.11 2140 | 71.42 2141 | 65.98 2142 | 72 .69 2143 | 40.44 2144 | 40.20 2145 | Table 3: Zero-Shot Performance of the model on Common Sense Reasoning Tasks. ∗ Interpolation of Pythia and 2146 | GPT-Neo models 2147 | Model 2148 | Params 2149 | LAMBADA 2150 | LAMBADA 2151 | headQA 2152 | sciq 2153 | triviaQA 2154 | ReCoRD 2155 | COPA 2156 | B 2157 | ppl 2158 | acc 2159 | acc_norm 2160 | acc 2161 | acc 2162 | em 2163 | acc 2164 | RWKV-4 2165 | 0.17 2166 | 29.33 2167 | 32.99 2168 | 25.78 2169 | 77.50 2170 | 1.26 2171 | 62.03 2172 | 66.00 2173 | Pythia 2174 | 0.16 2175 | 24.38 2176 | 38.97 2177 | 25.82 2178 | 76.50 2179 | 1.31 2180 | 66.32 2181 | 62.00 2182 | GPT-Neo 2183 | 0.16 2184 | 30.27 2185 | 37.36 2186 | 25.16 2187 | 76.60 2188 | 1.18 2189 | 64.92 2190 | 64.00 2191 | RWKV-4 2192 | 0.43 2193 | 13.04 2194 | 45.16 2195 | 27.32 2196 | 80.30 2197 | 2.35 2198 | 70.48 2199 | 65.00 2200 | Pythia 2201 | 0.40 2202 | 11.58 2203 | 50.44 2204 | 25.09 2205 | 81.50 2206 | 2.03 2207 | 75.05 2208 | 67.00 2209 | GPT-Neo 2210 | 0.40 2211 | 13.88 2212 | 47.29 2213 | 26.00 2214 | 81.10 2215 | 1.38 2216 | 73.79 2217 | 65.00 2218 | RWKV-4 2219 | 1.5 2220 | 7.04 2221 | 56.43 2222 | 27.64 2223 | 85.00 2224 | 5.65 2225 | 76.97 2226 | 77.00 2227 | Pythia 2228 | 1.4 2229 | 6.58 2230 | 60.43 2231 | 27.02 2232 | 85.50 2233 | 5.52 2234 | 81.43 2235 | 73.00 2236 | GPT-Neo 2237 | 1.4 2238 | 7.5 2239 | 57.25 2240 | 27.86 2241 | 86.00 2242 | 5.24 2243 | 80.62 2244 | 69.00 2245 | RWKV-4 2246 | 3.0 2247 | 5.25 2248 | 63.96 2249 | 28.45 2250 | 86.50 2251 | 11.68 2252 | 80.87 2253 | 82.00 2254 | Pythia 2255 | 2.8 2256 | 4.93 2257 | 65.36 2258 | 28.96 2259 | 87.70 2260 | 9.63 2261 | 85.10 2262 | 77.00 2263 | GPT- Neo 2264 | 2.8 2265 | 5.63 2266 | 62.22 2267 | 27.17 2268 | 89.30 2269 | 4.82 2270 | 83.80 2271 | 80.00 2272 | RWKV-4 2273 | 7.4 2274 | 4.38 2275 | 67.18 2276 | 31.22 2277 | 88.80 2278 | 18.30 2279 | 83.68 2280 | 85.00 2281 | Pythia 2282 | 6.9 2283 | 4.3 2284 | 67.98 2285 | 28.59 2286 | 90.00 2287 | 15.42 2288 | 86.44 2289 | 85.00 2290 | GPT-J 2291 | 6.1 2292 | 4.1 2293 | 68.31 2294 | 28. 67 2295 | 91.50 2296 | 16.74 2297 | 87.71 2298 | 83.00 2299 | RWKV-4 2300 | 14.2 2301 | 3.86 2302 | 70.83 2303 | 32.64 2304 | 90.40 2305 | 24.58 2306 | 85.67 2307 | 85.00 2308 | GPT-level∗ 2309 | 14.2 2310 | 3.81 2311 | 70.94 2312 | 31.03 2313 | 92.20 2314 | 22.37 2315 | 87.89 2316 | 82.66 2317 | Pythia (c.f.) 2318 | 11.8 2319 | 3.89 2320 | 70.44 2321 | 30.74 2322 | 91.80 2323 | 20.57 2324 | 87.58 2325 | 82.00 2326 | GPT-NeoX (c.f.) 2327 | 20.6 2328 | 3.64 2329 | 71.94 2330 | 31.62 2331 | 93.00 2332 | 25.99 2333 | 88.52 2334 | 84.00 2335 | Table 4: 2336 | Zero-Shot Performance of various models on different tasks. 2337 | ∗ Interpolation of Pythia and GPT-Neo 2338 | models 2339 | Method 2340 | L 2341 | d 2342 | T 2343 | Train bpc 2344 | Test bpc 2345 | Time Complexity 2346 | Space Complexity 2347 | Transformer 2348 | 12 2349 | 512 2350 | 1024 2351 | 0.977 2352 | 1.137 2353 | O(T 2d) 2354 | O(T 2 + Td) 2355 | Transformer 2356 | 24 2357 | 256 2358 | 1024 2359 | 1.039 2360 | 1.130 2361 | O(T 2d) 2362 | O(T 2 + Td) 2363 | Reformer 2364 | 12 2365 | 512 2366 | 1024 2367 | 1.040 2368 | 1.195 2369 | O(T log Td) 2370 | O(T log T + Td) 2371 | Sy nthesizer 2372 | 12 2373 | 512 2374 | 1024 2375 | 0.994 2376 | 1.298 2377 | O(T 2d) 2378 | O(T 2 + Td) 2379 | Linear Transformer 2380 | 12 2381 | 512 2382 | 1024 2383 | 0.981 2384 | 1.207 2385 | O(Td2) 2386 | O(Td + d2) 2387 | Performer 2388 | 12 2389 | 512 2390 | 1024 2391 | 1.002 2392 | 1.199 2393 | O(Td2 log d) 2394 | O(Td log d + d2 log d) 2395 | AFT-simple 2396 | 12 2397 | 512 2398 | 1024 2399 | 0.854 2400 | 1.180 2401 | O(Td) 2402 | O(Td) 2403 | RWKV-RNN 2404 | 6 2405 | 512 2406 | 1024 2407 | 0.720 2408 | - 2409 | O(Td) 2410 | O(d) 2411 | Table 5: Enwik8 results, measured in bits per character (bpc): the lower the better. Baseline comparisons are made 2412 | with Reformer (Kitaev et al., 2020), Synthesizer (Tay et al., 2020) (the best performing dense version), Linear 2413 | Transformer (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020). L , d, and T denote the number 2414 | of blocks (network depth), dimension of features, and sequence length, respectively. Both Linear Transformer and 2415 | Performer are implemented with customized CUDA kernels (github.com/idiap/fast-transformers), and all other 2416 | models are implemented in native Pytorch. 2417 | quences. For GPU experiments, we use an NVIDIA 2418 | A100 with 80GB of VRAM. For CPU experiments, 2419 | we use an AMD EPYC processor with 30 CPU 2420 | cores and 200 GiB RAM. 2421 | Figure 10: Text generation inference memory (CPU 2422 | RAM, GPU VRAM ) for LLMs. Model parameters are 2423 | not accounted. 2424 | Figure 11: Text generation inference time for LLMs. 2425 | Task Name 2426 | Measure 2427 | ChatGPT 2428 | GPT-4 2429 | RWKV-4 2430 | RWKV-4 2431 | SOTA 2432 | type 2433 | [%] 2434 | [%] 2435 | GPT [%] 2436 | changed [%] 2437 | [%] 2438 | RTE 2439 | F1 Macro 2440 | 88.1 2441 | 91.3 2442 | 44.2 2443 | 74.8 2444 | 92.1 2445 | WNLI 2446 | Accuracy 2447 | 81.7 2448 | 91.6 2449 | 47.9 2450 | 49.3 2451 | 97.9 2452 | GoEmotions 2453 | F1 Macro 2454 | 25.6 2455 | 23.1 2456 | 7.9 2457 | 7.9 2458 | 52. 8 2459 | PolEmo2 2460 | F1 Macro 2461 | 44.1 2462 | 41.0 2463 | 38.2 2464 | 40.9 2465 | 76.4 2466 | Table 6: ChatGPT, GPT-4 and RWKV-4-Raven-14B 2467 | reasoning performance comparison in RTE (Wang 2468 | et al., 2019), WNLI (Wang et al., 2018), GoEmotions 2469 | (Demszky et al., 2020), and PolEmo2 (Koco´n et al., 2470 | 2019) benchmarks. SOTA is provided as a supplemen- 2471 | tary reference. 2472 | J 2473 | Importance of prompt construction 2474 | and comparison to GPT models 2475 | Inspired by article (Koco´n et al., 2023), we com- 2476 | pared the zero-shot performance of the RWKV- 2477 | 4-Raven-14B with ChatGPT (access in February 2478 | 2023) and GPT-4 using several known NLP tasks, 2479 | i.e., recognizing textual entailment (RTE), Wino- 2480 | grad Natural Language Inference (WNLI), and rec- 2481 | ognizing emotions elicited in readers (GoEmotions 2482 | and PolEmo2). Each model got the same prompts 2483 | manually chosen to receive proper responses from 2484 | the ChatGPT model. As shown in Tab. 6, RWKV 2485 | performs significantly worse than ChatGPT and 2486 | GPT-4 in specific task performance. We suspect 2487 | that this disparity is likely caused by the choice 2488 | of prompts used to generate the answers. Given 2489 | that prompts are in natural language and do not 2490 | consider that RWKV is an RNN, so it can not look 2491 | back inside an instruction. 2492 | When the instruction style was adapted to re- 2493 | spect that RNNs is not capable for retrospective 2494 | processing, quality on some datasets increased sig- 2495 | nificantly (ex. for RTE (Wang et al., 2019) F1 2496 | Macro increased from 44.2% to 74.8%). We hy- 2497 | pothesize that RWKV models are more sensitive 2498 | to the position of the components in the context, 2499 | as RNN-based architectures cannot look back and 2500 | readjust the weight of previous information. For 2501 | better performance, desired information should be 2502 | after the question. Example of ChatGPT prompt to 2503 | RTE: 2504 | Having premise judge if the 2505 | following hypothesis are 2506 | logically connected with the premise? Answer "en- 2507 | tailment" if yes, or "not_entailment" if no. 2508 | RWKV prompt taking into account the characteris- 2509 | tics of the RNN: 2510 | Can you tell me if the hypothesis is entailment or is 2511 | not entailment to the premise? 2512 | Task Name 2513 | Measure 2514 | ChatGPT 2515 | RWKV-4 2516 | SOTA 2517 | type 2518 | [%] 2519 | adapted [%] 2520 | [%] 2521 | Aggression 2522 | F1 Macro 2523 | 69.10 2524 | 56.66 2525 | 74.45 2526 | MathQA 2527 | Accuracy 2528 | 71.40 2529 | 80.69 2530 | 83.20 2531 | Sarcasm 2532 | F1 Macro 2533 | 49.88 2534 | 50.96 2535 | 53.57 2536 | TweetSent 2537 | F1 Macro 2538 | 63.32 2539 | 52.50 2540 | 72.07 2541 | Unhealthy 2542 | F1 Macro 2543 | 45.21 2544 | 43.30 2545 | 50.96 2546 | Table 7: ChatGPT and RWKV-4-Raven-14B perfor- 2547 | mance comparison in Aggresion (Wulczyn et al., 2017 ), 2548 | Sarcasm (Siddiqui, 2019), Unhealthy (Price et al., 2549 | 2020), MathQA (Cobbe et al., 2021), and TweetSent 2550 | (Barbieri et al., 2020) benchmarks. SOTA is provided 2551 | as a supplementary reference. 2552 | premise: 2553 | hypothesis: 2554 | While separating the instruction from the input 2555 | is relatively easy to do, other aspects of prompt 2556 | engineering are harder to quantify. Testing the ap- 2557 | proach of stating the input after the question on 2558 | multiple other tasks, shown in tab. 7, suggests that 2559 | better prompts might reduce the disparity between 2560 | models. Raven achieves comparable result to Chat- 2561 | GPT on unhealthy conversation detection and even 2562 | surpasses it on the sarcasm detection dataset. While 2563 | this approach in prompting looks necessary, it alone 2564 | is not enough to replace the capability of having 2565 | free access to the whole context. Therefore, prompt 2566 | engineering seems to be of significantly more im- 2567 | portance to the RNN models compared to stan- 2568 | dard transformers. It is entirely possible that good 2569 | prompts to RNN models do not mean additional 2570 | restrictions, but should simply be constructed using 2571 | completely different guidelines. While authors of a 2572 | forementioned paper (Koco´n et al., 2023) perform 2573 | chain-of-thought to improve results on MathQA 2574 | dataset, trying the same approach on the Raven 2575 | model achieved a very low accuracy of 48.11%. 2576 | Despite that, using a very similar prompt without 2577 | the chain-of-thought outperformed ChatGPT on 2578 | this dataset. It would not be surprising if, thanks 2579 | to the hidden state, RNN models do not need addi- 2580 | t ional steps during inference, and adding those only 2581 | makes the model forget the information it analyzed 2582 | and deemed important. This further emphasizes the 2583 | importance of the order of information the model 2584 | receives. 2585 | The template used to prompt the Raven model in 2586 | MathQA with chain-of-thought prompting: 2587 | Write the reasoning and highlight the answer to 2588 | the question at the end in format: ’Answer: ’. The 2589 | question is: 2590 | The template used to prompt the Raven model in 2591 | MathQA without chain-of-thought prompting: 2592 | Write the answer to the math question in the format: 2593 | ’Answer: ’. 2594 | The question is: 2595 | K 2596 | Cases 2597 | In this part, we present a few instances of outputs produced by RWKV model using a Chat interface56. 2598 | RWKV-4-Raven-14B 2599 | Tell me about ravens. 2600 | Ravens are large, black birds with a distinctive white head and neck. They are found in most parts 2601 | of the world, including North America, Europe, Asia, and Australia. Ravens are known for their 2602 | intelligence and problem-solving abilities. They are also considered to be symbols of death and 2603 | transformation in many cultures. Ravens are often associated with the afterlife or death because 2604 | they have been known to eat carrion or even other birds. In some cultures, ravens are also believed 2605 | to be messengers of the gods or guardians of treasure. 2606 | RWKV-4-Raven-14B 2607 | Explain the concept of machine learning to a 10-year-old. 2608 | Machine learning is like a super-smart computer that can learn on its own. It can recognize patterns 2609 | and make decisions based on what it has learned. Just like how you might remember something 2610 | from the past, the computer can remember what it’s learned and use it to make predictions or 2611 | decisions in the future . It’s like having a smart friend who always knows what to do! 2612 | RWKV-4-Raven-14B 2613 | Write a short poem describing nature’s beauty in springtime. 2614 | Nature’s beauty in springtime is rare, 2615 | A symphony of colors, a gentle surprise. 2616 | The flowers bloom, the birds sing and sing, 2617 | As the sun shines down, warming our skin. 2618 | The trees sway in the gentle breeze, 2619 | As we take in their beauty with ease. 2620 | The sky is blue, the clouds are white, 2621 | A sight to behold, it ’s truly divine. 2622 | 5https://github.com/BlinkDL/ChatRWKV 2623 | 6https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio 2624 | RWKV-4-Raven-14B 2625 | Below is the numerical stable formula for updating the state variable in RWKV: 2626 | a∗ 2627 | t = ep−qat−1 + eu+kt−qvt 2628 | b∗ 2629 | t = ep−qbt−1 + eu+kt−q 2630 | wkvt = a∗ 2631 | t 2632 | b∗ 2633 | t 2634 | Please write out the corresponding pytorch code to do so, where both at and bt are 2635 | torch.Tensor([n]) 2636 | Here is the PyTorch code to update the state variable in RW KV using the given formula: 2637 | import 2638 | t o r c h 2639 | def 2640 | u p d a t e _ s t a t e ( a , 2641 | b , 2642 | u , 2643 | v , 2644 | p , 2645 | q ) : 2646 | a_t = t o r c h . exp ( p − q ) * a + t o r c h . exp ( u + k − q ) * v 2647 | b_t = t o r c h . exp ( p − q ) * b + t o r c h . exp ( u + k − q ) 2648 | return 2649 | a_t , 2650 | b _t 2651 | a , 2652 | b = t o r c h . randn (2 , 2653 | 3) 2654 | u = t o r c h . randn ( 2 ) 2655 | v = t o r c h . randn ( 2 ) 2656 | wkv = t o r c h . randn ( 2 ) 2657 | a_t , 2658 | b_t = u p d a t e _ s t a t e ( a , 2659 | b , 2660 | u , 2661 | v , wkv , 2662 | t o r c h . randn ( 2 ) ) 2663 | print ( " a_t : " , 2664 | a _t ) 2665 | print ( " b_t : " , 2666 | b_t ) 2667 | This code defines a function update_state that takes in the state variables a and b, the control 2668 | variables u and v, and the parameters wkv and torch.randn. It then uses the PyTorch torch.exp 2669 | function to compute the exponential of the control variables and the log of the normalization factor. 2670 | Finally, it uses these values to update the state. 2671 | RWKV-4-Raven-14B 2672 | Explain the following metaphor: Life is like cats. 2673 | The metaphor "Life is like cats" suggests that life can be unpredictable and full of unexpected 2674 | twists and turns, just like a cat’s behavior. It implies that we cannot fully control or predict the 2675 | outcome of our lives, just as we cannot control the movements of a cat. The metaphor also implies 2676 | that life is complex and multifaceted, with many different variables and factors at play. Just as a 2677 | cat can have multiple personalities or moods, life can have many twists and turns that are difficult 2678 | to predict or understand. Overall, the metaphor suggests that life is complex and unpredictable , 2679 | much like a cat’s behavior. 2680 | RWKV-4-Raven-14B 2681 | Test the convergence of P∞ 2682 | i=0 2−i. 2683 | The given series is a geometric series with the first term a0 = 2 and common ratio r = 1 2684 | 2. We can 2685 | find the sum of this series using the formula for a geometric series: 2686 | ∞ 2687 | X 2688 | i=0 2689 | ai = 2690 | a0 2691 | 1 − r = 2692 | 1 2693 | 2 2694 | 1 − 1 2695 | 2 2696 | = 2697 | 1 2698 | 1 − 1 2699 | 2 2700 | = 2 2701 | 1 = 2 2702 | So, the sum of the given series is 2 . 2703 | RWKV-4-Raven-14B 2704 | I would like to cook some Chinese food at home. Do you have any suggestions on the meal I 2705 | could choose? 2706 | Yes, of course! If you are not very experienced in cooking Chinese food, you could start with 2707 | making noodles or rice, then make a dish using one of those. I also think making stir-fry with 2708 | some meat is an excellent idea. 2709 | I want to make something more special. What about red braised pork belly? 2710 | Yes, red braised pork belly is also very popular and tasty. 2711 | Tell me the steps to cook. 2712 | You will need a few ingredients such as pork belly, star anise, sugar, ginger, scallions and wine. 2713 | Put all of these in a pot, cook on medium heat and simmer until the meat is tender. After that, 2714 | remove the meat from the pot and reserve the sauce. Slice the meat and add some ginger slices 2715 | into it, then boil until it’s almost done. 2716 | What do I need to be careful with the most during the cooking process? 2717 | To cook pork belly properly, it is very important to monitor the heat of the stove and be patient 2718 | with the cooking process. Overcooking it can cause the pork to be tough, dry and difficult to chew. 2719 | You can add a few ingredients like soy sauce or black beans to give the pork more flavor. 2720 | Tell me more about the proper heat and timing. 2721 | If the temperature of the water is too high, it can damage the collagen and connective tissues, 2722 | resulting in tough pork belly. It is essential to control the temperature carefully, especially 2723 | when braising the pork. Additionally, keep the water simmering but not boiling to maintain a 2724 | low simmer, as high heat will dry out the pork belly and result in tough and unpalatable meat. 2725 | Lastly, monitor the heat during cooking and make sure the heat is medium, as this is what is most 2726 | appropriate for slow cooking pork belly. 2727 | I see. Thank you for your advice. 2728 | You’re welcome! It was a pleasure chatting with you. Let me know if you need any further 2729 | assistance in the future. 2730 | RWKV-4-Raven-14B 2731 | If all apples are fruits and all fruits are sweet, are all apples sweet? Please explain your 2732 | answer in detail. 2733 | Yes, all apples are fruits, and all fruits are sweet. Therefore, it is safe to assume that all apples are 2734 | sweet. 2735 | If birds can fly. Penguins are birds. Can penguins fly? Please explain your answer in detail. 2736 | No, penguins cannot fly. Penguins are birds, but they are not capable of flight like other birds. 2737 | They have wings and can glide for short distances, but they cannot take off or fly long distances 2738 | like other birds. 2739 | The day before two days after the day before tomorrow is Saturday. What day is it today? 2740 | Please explain your answer in detail. 2741 | Today is Friday. Two days after tomorrow is Sunday. Tomorrow is Saturday. 2742 | If five cats can catch five mice in five minutes, how long will it take one cat to catch one 2743 | mouse? Please explain your answer in detail. 2744 | If five cats can catch five mice in five minutes, it means that each cat can catch one mouse in one 2745 | minute. To find out how long it will take one cat to catch one mouse, we can use the concept of 2746 | proportionality. 2747 | Let’s assume that it takes x minutes for a cat to catch a mouse. Then, the number of mice caught 2748 | by a cat in one minute is equal to the number of cats multiplied by the number of mice caught by 2749 | each cat: x = 5 * 1 = 5 2750 | So, it will take 5 minutes for one cat to catch one mouse. 2751 | RWKV-4-Raven-14B 2752 | Summarize the following paragraph. 2753 | 2754 | The paper presents a novel architecture, Receptance Weighted Key Value (RWKV), that combines 2755 | the efficient parallelizable training of Transformers with the efficient inference of RNNs. The 2756 | model is able to scale to tens of billions of parameters and exhibits linear computational complexity 2757 | during training and inference, making it a promising alternative to Transformers for sequence 2758 | processing tasks. 2759 | Please further summarize it in one sentence. 2760 | The researchers have developed a new architecture that combines the effi cient parallelizable 2761 | training of Transformers with the efficient inference of RNNs, which allows for linear scaling in 2762 | memory and computational requirements. 2763 | --------------------------------------------------------------------------------