├── 2305.13048v1.pdf
├── README.md
├── preText_edited.txt
└── preText.txt


/2305.13048v1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/Highlight-RAG/main/2305.13048v1.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Highlight-RAG
 2 | Repo of the code from the Medium article
 3 | 
 4 | 
 5 | ### Resources
 6 | - https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker
 7 | - https://rich.readthedocs.io/en/stable/layout.html
 8 | - https://medium.com/@waihwa.cheong/retrieving-answers-from-your-document-using-llama-2-on-colab-gpu-freemium-series-7284faac8118
 9 | - https://arxiv.org/pdf/2305.13048v1.pdf
10 | - 
11 | 


--------------------------------------------------------------------------------
/preText_edited.txt:
--------------------------------------------------------------------------------
 1 | RWKV: Reinventing RNNs for the Transformer Era
 2 | 
 3 | Abstract
 4 | Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and
 5 | computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability.
 6 | We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efﬁcient parallelizable training of Transformers with the efﬁcient inference of R NNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, which parallelizes computations during training and maintains constant computational and memory complexity during inference, leading to the ﬁrst non-transformer architecture to be scaled to tens of billions of parameters.
 7 | Our experiments reveal that RWKV performs on par with similarly sized Transformers, suggesting that future work can leverage this architecture to create more efﬁcient models. 
 8 | This work presents a signiﬁcant step towards reconciling the trade-offs between computational efﬁciency and model performance in sequence processing tasks.
 9 | 
10 | Introduction
11 | Deep learning techniques have made signiﬁcant strides in artiﬁcial intelligence, playing a pivotal role in various scientiﬁc and industrial  applications. These applications often involve complex sequential data processing tasks that include natural language understanding, conversational AI, time-series analysis, and even indirect modalities that can be reframed as sequences, such as images and graphs (Brown et al., 2020; Ismail Fawaz et al., 2019; Wu et al., 2020; Albalak et al., 2022).
12 | Predominant among these techniques are RNNs,convolutional neural networks (CNNs), and the Transformer models (Vaswani et al., 2017). 
13 | Each of these has distinct drawbacks that restrict their efﬁciency in certain scenarios. RNNs suffer from the vanishing gradient problem, making them difﬁcult to train for long sequences. Additionally, they cannot be parallelized in the time dimension during training, which restricts their scalability (Hochreiter, 1998; Le and Zuidema, 2016). CNNs, on the other hand, are only adept at capturing local patterns, which limits their capacity to deal with long-range dependencies,  crucial to many sequence processing tasks (Bai et al., 2018).
14 | Transformer models emerged as a powerful alternative due to their ability to handle both local and long-range dependencies and their capability for parallelized training (Tay et al., 2022). Recent models such as GPT-3 (Brown et al., 2020), ChatGPT (OpenAI, 2022; Koco´n et al., 2023), GPT-4 (OpenAI, 2023), LLaMA (Touvron et al., 2023), and Chinchilla (Hoffmann et al., 2022) exemplify the capability of this architecture, pushing the frontiers of what’s possible in NLP. Despite these signiﬁcant advancements, the self-attention mechanism
15 | inherent to Transformers poses unique challenges, primarily due to its quadr atic complexity. This complexity renders the architecture computationally expensive and memory-intensive for tasks involving long input sequences or in resource-constrained situations. These limitations have spurred a wealth of research aiming to improve the scaling properties of Transformers, often at the expense of some of
16 | the properties that make it so effective (Wang et al.,2020; Zaheer et al., 2020; Dao et al., 2022a).
17 | To tackle these challenges, we introduce the Receptance Weighted Key Value (RWKV) model, a  novel architecture that effectively combines the strengths of RNNs and Transformers while circumventing key drawbacks. RWKV is carefully designed to alleviate the memory bottleneck and quadratic scaling associated with Transformers (Katharopoulos et al., 2020) with a more efﬁcient linear scaling, while still preserving the rich, expressive properties that make the Transformer a dominant architecture in the ﬁeld.
18 | One of the deﬁning characteristics of RWKV is its ability to offer parallelized training and robust scalability, similar to Transformers. Moreover, we have reformulated the attention mechanism in RWKV to introduce a variant of linear attention, eschewing the traditional dot-product token interaction in favor of more effective channel-directed attention. This approach contrasts signiﬁcantly with the traditional Transformer architecture, where speciﬁc token interactions predominantly drive attention. 
19 | The implementation of linear attention in RWKV is carried out without approximation, which offers a considerable improvement in efﬁciency and enhances the scalability, see Table 1.
20 | The overarching motivation behind developing RWKV is to bridge the gap between computational efﬁciency and expressive capacity in neural network architectures. It offers a promising and viable solution for handling tasks involving large-scale models with billions of parameters, exhibiting competitive performance at a fraction of the computational cost. Our experimental results suggest that RWKV could be a valuable tool for addressing the ongoing challenges in scaling and deploying AI models across various domains, particularly those involving sequential data processing. Thus, RWKV paves the way for the next generation of more sustainable and computationally efﬁcient AI models for sequence processing tasks.
21 | Our contributions in this paper are as follows:
22 | • We introduce the RWKV network architecture, which combines the advantages of RNNs and Transformers while mitigating their known limitations.
23 | • We propose a new attention mechanism reformulation that results in linear attention, eschewing the quadratic complexity associated with standard Transformer models.
24 | • We conduct a comprehensive series of experiments on benchmark datasets to showcase the performance, efﬁciency and scaling of RWKV in managing tasks involving large-scale models and long-range dependencies.
25 | • We release pretrained model ranging in size from 169 million to 14 billion parameters trained on the Pile (Gao et al., 2020)
26 | 
27 | Related Work
28 | Recently, a number of techniques have been proposed to address the limitations of transformers. 
29 | Optimizing Attention Mechanism
30 | Many transformer variants (“ x-formers”) have been introduced to reduce the complexity of transformers (Tay et al.,2022), including sparse attention (Beltagy et al., 2020; Kitaev et al., 2020; Guo et al., 2022), approximating the full attention matrix (Wang et al., 2020; Ma et al., 2021; Choromanski et al., 2020), combining chunked attention with gating (Ma et al.,2023) and other efﬁcient methods (Katharopoulos
31 | et al., 2020;  Jaegle et al., 2021). Some recent works like FlashAttention (Dao et al., 2022a) and others (Rabe and Staats, 2022;Jang et al., 2019) share similarities with RWKV’s chunked computation scheme. Despite being memory-efﬁcient, their time complexity remains
32 | quadratic or contains chunk size as a hidden factor. In contrast, RWKV achieves better space and time complexity during inference by formulating a linear attention as an RNN.
33 | Attention Free Models
34 | Another line of research replaces the attention mechanism with other modules to scale to long sequences. MLP-Mixer and others (Tolstikhin et al., 2021; Liu et al., 2021) proposed the replacement of attention by Multi-Layer Perceptrons (MLPs) in computer vision tasks.
35 | The Attention Free Transformer (AFT) (Zhai et al.,2021) replaces dot-product self-attention with a computationally  efﬁcient alternative which can be seen as a multi-head attention where each feature dimension corresponds to a head. Inspired by AFT, RWKV takes a similar approach but modiﬁes the interaction weights for simplicity such that it can be transformed into an RNN. In parallel, RNN-style (Hochreiter and Schmidhuber, 1997; Chung et al., 2014) recursive components have also been modiﬁed to increase context length, such as the Recurrent Memory Transformer (Bulatov et al ., 2022,2023) and Linear Recurrent Units (Orvieto et al.,2023). State space models (SSM) like S4 (Gu et al.,2022) and its variants (Dao et al., 2022b; Poli et al.,2023) are also proposed.
36 | Notably, Quasi-Recurrent neural network (QRNN) (Bradbury et al., 2017) uses both convolutional layers and recurrent pooling functions
37 | across timesteps and channels. While QRNN utilizes conv olutional ﬁlters with ﬁxed sizes, RWKV employs a time-mixing module as an
38 | attention mechanism with time-decaying factors. Different from the element-wise pooling in QRNN, RWKV includes a parametrized channel-mixing module (see the green blocks in Fig.1c) that is parallelizable.
39 | 
40 | 


--------------------------------------------------------------------------------
/preText.txt:
--------------------------------------------------------------------------------
   1 |  RWKV: Reinventing RNNs for the Transformer Era
   2 | Bo Peng1∗ Eric Alcaide2,3,4∗ Quentin Anthony2,5∗
   3 | Alon Albalak2,6 Samuel Arcadinho2,7 Huanqi Cao8 Xin Cheng9 Michael Chung10
   4 | Matteo Grella11 Kranthi Kiran GV12 Xuzheng He2 Haowen Hou13 Przemysław Kazienko14
   5 | Jan Koco´n14 Jiaming Kong15 Bartłomiej Koptyra14 Hayden Lau2 Krishna Sri  Ipsit Mantri16
   6 | Ferdinand Mom17,18 Atsushi Saito2,19 Xiangru Tang20 Bolun Wang27 Johan S. Wind21 Stanisław Wo´zniak14
   7 | Ruichong Zhang8 Zhenyuan Zhang2 Qihang Zhao22,23 Peng Zhou27 Jian Zhu24 Rui-Jie Zhu25,26
   8 | 1RWKV Foundation 2EleutherAI 3University of Barcelona 4Charm Therapeutics 5Ohio State University
   9 | 6University of California, Santa Barbara 7Zendesk 8Tsinghua University 9Peking University
  10 | 10Storyteller .io 11Crisis24 12New York University 13National University of Singapore
  11 | 14Wroclaw University of Science and Technology 15Databaker Technology Co. Ltd 16Purdue University
  12 | 17Criteo AI Lab 18Epita 19Nextremer Co. Ltd. 20Yale University 21University of Oslo
  13 | 22University of Science and Technology of China 23Kuaishou Technology Co. Ltd
  14 | 24University of British Columbia 25University of California, Santa Cruz
  15 | 26University of Electronic Science and Technology of China 27RuoxinTech
  16 | Abstract
  17 | Transformers have revolutionized almost all
  18 | natural language processing (NLP) tasks but 
  19 | suffer from memory and computational com-
  20 | plexity that scales quadratically with sequence
  21 | length. In contrast, recurrent neural networks
  22 | (RNNs) exhibit linear scaling in memory and
  23 | computational requirements but struggle to
  24 | match the same performance as Transform-
  25 | ers due to limitations in parallelization and
  26 | scalability.
  27 | We propose a novel model ar-
  28 | chitecture, Receptance Weighted Key Value
  29 | (RWKV), that combines the efﬁcient paral-
  30 | lelizable training of Transformers with the efﬁ-
  31 | cient inference of R NNs. Our approach lever-
  32 | ages a linear attention mechanism and allows
  33 | us to formulate the model as either a Trans-
  34 | former or an RNN, which parallelizes compu-
  35 | tations during training and maintains constant
  36 | computational and memory complexity during
  37 | inference, leading to the ﬁrst non-transformer
  38 | architecture to be scaled to tens of billions
  39 | of parameters.
  40 | Our experiments reveal that
  41 | RWKV performs on par with similarly sized
  42 | Transformers, suggesting that future work can
  43 | leverage this architecture to create more efﬁ-
  44 | cient models. 
  45 | This work presents a signiﬁ-
  46 | cant step towards reconciling the trade-offs be-
  47 | tween computational efﬁciency and model per-
  48 | formance in sequence processing tasks.1
  49 | 1
  50 | Introduction
  51 | Deep learning techniques have made signiﬁcant
  52 | strides in artiﬁcial intelligence, playing a pivotal
  53 | ∗ Equal ﬁrst authorship. Others listed alphabetically.
  54 | 1Code at: https://github.com/BlinkDL/RWKV-LM
  55 | role in various scientiﬁc and industrial  applica-
  56 | tions. These applications often involve complex
  57 | sequential data processing tasks that include nat-
  58 | ural language understanding, conversational AI,
  59 | time-series analysis, and even indirect modalities
  60 | that can be reframed as sequences, such as im-
  61 | ages and graphs (Brown et al., 2020; Ismail Fawaz
  62 | et al., 2019; Wu et al., 2020; Albalak et al., 2022).
  63 | Predominant among these techniques are RNNs,
  64 | convolutional neural networks (CNNs), and the
  65 | Transformer models (Vaswani et al., 2017). 
  66 | Each of these has distinct drawbacks that restrict
  67 | their efﬁciency in certain scenarios. RNNs suf-
  68 | fer from the vanishing gradient problem, making
  69 | them difﬁcult to train for long sequences. Addition-
  70 | ally, they cannot be parallelized in the time dimen-
  71 | sion during training, which restricts their scalability
  72 | (Hochreiter, 1998; Le and Zuidema, 2016). CNNs,
  73 | on the other hand, are only adept at capturing local
  74 | patterns, which limits their capacity to deal with
  75 | long-range dependencies,  crucial to many sequence
  76 | processing tasks (Bai et al., 2018).
  77 | Transformer models emerged as a powerful alter-
  78 | native due to their ability to handle both local and
  79 | long-range dependencies and their capability for
  80 | parallelized training (Tay et al., 2022). Recent mod-
  81 | els such as GPT-3 (Brown et al., 2020), ChatGPT
  82 | (OpenAI, 2022; Koco´n et al., 2023), GPT-4 (Ope-
  83 | nAI, 2023), LLaMA (Touvron et al., 2023), and
  84 | Ch inchilla (Hoffmann et al., 2022) exemplify the
  85 | capability of this architecture, pushing the frontiers
  86 | of what’s possible in NLP. Despite these signiﬁ-
  87 | cant advancements, the self-attention mechanism
  88 | inherent to Transformers poses unique challenges,
  89 | arXiv:2305.13048v1  [cs.CL]  22 May 2023
  90 |  Model
  91 | Time
  92 | Space
  93 | Transformer
  94 | O(T 2d)
  95 | O(T 2 + Td)
  96 | Reformer
  97 | O(T log Td)
  98 | O(T log T + Td)
  99 | Linear Transformers
 100 | O(Td2)
 101 | O(Td + d2)
 102 | Performer
 103 | O(Td2 log d) O(Td log d + d2 log d)
 104 | AFT-full
 105 | O(T 2d)
 106 | O(Td)
 107 | MEGA
 108 | O(cTd)
 109 | O(cTd)
 110 | RWKV  (ours)
 111 | O(Td)
 112 | O(d)
 113 | Table 1: Complexity comparison with different Trans-
 114 | formers: Reformer (Kitaev et al., 2020), Linear Trans-
 115 | former (Katharopoulos et al., 2020), Performer (Choro-
 116 | manski et al., 2020), AFT (Zhai et al., 2021), MEGA
 117 | (Ma et al., 2023). Here T denotes the sequence length,
 118 | d the feature dimension, and c is MEGA’s chunk size
 119 | of quadratic attention.
 120 | primarily due to its quadr atic complexity. This com-
 121 | plexity renders the architecture computationally ex-
 122 | pensive and memory-intensive for tasks involving
 123 | long input sequences or in resource-constrained sit-
 124 | uations. These limitations have spurred a wealth of
 125 | research aiming to improve the scaling properties
 126 | of Transformers, often at the expense of some of
 127 | the properties that make it so effective (Wang et al.,
 128 | 2020; Zaheer et al., 2020; Dao et al., 2022a).
 129 | To tackle these challenges, we introduce the Re-
 130 | ceptance Weighted Key Value (RWKV) model, a
 131 |  novel architecture that effectively combines the
 132 | strengths of RNNs and Transformers while cir-
 133 | cumventing key drawbacks. RWKV is carefully
 134 | designed to alleviate the memory bottleneck and
 135 | quadratic scaling associated with Transformers
 136 | (Katharopoulos et al., 2020) with a more efﬁcient
 137 | linear scaling, while still preserving the rich, ex-
 138 | pressive properties that make the Transformer a
 139 | dominant architecture in the ﬁeld.
 140 | One of the deﬁning characteristics of RWKV
 141 | is its ability to offer parallelized training and ro- 
 142 | bust scalability, similar to Transformers. More-
 143 | over, we have reformulated the attention mecha-
 144 | nism in RWKV to introduce a variant of linear
 145 | attention, eschewing the traditional dot-product to-
 146 | ken interaction in favor of more effective channel-
 147 | directed attention. This approach contrasts signiﬁ-
 148 | cantly with the traditional Transformer architecture,
 149 | where speciﬁc token interactions predominantly
 150 | drive attention. The implementation of linear atten-
 151 | tion in RWKV is carried out without approxima-
 152 | tion, which offers a considerable improvement in
 153 | ef ﬁciency and enhances the scalability, see Table 1.
 154 | The overarching motivation behind developing
 155 | RWKV is to bridge the gap between computational
 156 | efﬁciency and expressive capacity in neural net-
 157 | work architectures. It offers a promising and viable
 158 | solution for handling tasks involving large-scale
 159 | models with billions of parameters, exhibiting com-
 160 | petitive performance at a fraction of the computa-
 161 | tional cost. Our experimental results suggest that
 162 | RWKV could be a valuable tool for addressing the
 163 | ongoing challenges in scaling and deploying AI
 164 | models across various domains, particularly those
 165 |  involving sequential data processing. Thus, RWKV
 166 | paves the way for the next generation of more sus-
 167 | tainable and computationally efﬁcient AI models
 168 | for sequence processing tasks.
 169 | Our contributions in this paper are as follows:
 170 | • We introduce the RWKV network archi-
 171 | tecture, which combines the advantages of
 172 | RNNs and Transformers while mitigating
 173 | their known limitations.
 174 | • We propose a new attention mechanism re-
 175 | formulation that results in linear attention, es-
 176 | chewing the quadratic complexity associated
 177 | with standard Transformer models.
 178 |  • We conduct a comprehensive series of experi-
 179 | ments on benchmark datasets to showcase the
 180 | performance, efﬁciency and scaling of RWKV
 181 | in managing tasks involving large-scale mod-
 182 | els and long-range dependencies.
 183 | • We release pretrained model ranging in size
 184 | from 169 million to 14 billion parameters
 185 | trained on the Pile (Gao et al., 2020).2
 186 | 2
 187 | Related Work
 188 | Recently, a number of techniques have been pro-
 189 | posed to address the limitations of transformers.
 190 | Optimizing Attention Mechanism
 191 | Many trans-
 192 | former variants (“ x-formers”) have been introduced
 193 | to reduce the complexity of transformers (Tay et al.,
 194 | 2022), including sparse attention (Beltagy et al.,
 195 | 2020; Kitaev et al., 2020; Guo et al., 2022), ap-
 196 | proximating the full attention matrix (Wang et al.,
 197 | 2020; Ma et al., 2021; Choromanski et al., 2020),
 198 | combining chunked attention with gating (Ma et al.,
 199 | 2023) and other efﬁcient methods (Katharopoulos
 200 | et al., 2020;  Jaegle et al., 2021).
 201 | Some recent works like FlashAttention (Dao
 202 | et al., 2022a) and others (Rabe and Staats, 2022;
 203 | Jang et al., 2019) share similarities with RWKV’s
 204 | chunked computation scheme.
 205 | Despite being
 206 | memory-efﬁcient, their time complexity remains
 207 | quadratic or contains chunk size as a hidden fac-
 208 | tor. In contrast, RWKV achieves better space and
 209 | 2https://huggingface.co/RWKV
 210 |  time complexity during inference by formulating a
 211 | linear attention as an RNN.
 212 | Attention Free Models
 213 | Another line of research
 214 | replaces the attention mechanism with other mod-
 215 | ules to scale to long sequences. MLP-Mixer and
 216 | others (Tolstikhin et al., 2021; Liu et al., 2021)
 217 | proposed the replacement of attention by Multi-
 218 | Layer Perceptrons (MLPs) in computer vision tasks.
 219 | The Attention Free Transformer (AFT) (Zhai et al.,
 220 | 2021) replaces dot-product self-attention with a
 221 | computationally  efﬁcient alternative which can be
 222 | seen as a multi-head attention where each feature
 223 | dimension corresponds to a head. Inspired by AFT,
 224 | RWKV takes a similar approach but modiﬁes the
 225 | interaction weights for simplicity such that it can
 226 | be transformed into an RNN. In parallel, RNN-
 227 | style (Hochreiter and Schmidhuber, 1997; Chung
 228 | et al., 2014) recursive components have also been
 229 | modiﬁed to increase context length, such as the Re-
 230 | current Memory Transformer (Bulatov et al ., 2022,
 231 | 2023) and Linear Recurrent Units (Orvieto et al.,
 232 | 2023). State space models (SSM) like S4 (Gu et al.,
 233 | 2022) and its variants (Dao et al., 2022b; Poli et al.,
 234 | 2023) are also proposed.
 235 | Notably,
 236 | Quasi-Recurrent
 237 | neural
 238 | network
 239 | (QRNN) (Bradbury et al., 2017) uses both con-
 240 | volutional layers and recurrent pooling functions
 241 | across timesteps and channels.
 242 | While QRNN
 243 | utilizes conv olutional ﬁlters with ﬁxed sizes,
 244 | RWKV employs a time-mixing module as an
 245 | attention mechanism with time-decaying factors.
 246 | Different from the element-wise pooling in QRNN,
 247 | RWKV includes a parametrized channel-mixing
 248 | module (see the green blocks in Fig.1c) that is
 249 | parallelizable.
 250 | 3
 251 | Background
 252 | Here we brieﬂy review the fundamentals of RNNs
 253 | and Transformers.
 254 | 3.1
 255 | Recurrent Neural Networks (RNNs)
 256 | Popular R NN architectures such as LSTM (Hochre-
 257 | iter and Schmidhuber, 1997) and GRU (Chung
 258 | et al., 2014) are characterized by the following for-
 259 | mulation (shown for LSTM, others can be reasoned
 260 | similarly):
 261 | ft = σg(Wfxt + Ufht−1 + bf),
 262 | (1)
 263 | it = σg(Wixt + Uiht−1 + bi),
 264 | (2)
 265 | ot = σg(Woxt + Uoht−1 + bo),
 266 | (3)
 267 |  ˜ct = σc(Wcxt + Ucht−1 + bc),
 268 | (4)
 269 | ct = ft ⊙ ct−1 + it ⊙ ˜ct,
 270 | (5)
 271 | ht = ot ⊙ σh(ct).
 272 | (6)
 273 | The data ﬂow of RNNs is shown in Fig. 1a. Al-
 274 | though RNNs can be factored into two linear blocks
 275 | (W and U) and an RNN-speciﬁc block (1)–(6), as
 276 | noted by Brad bury et al. (2017), the data depen-
 277 | dency relying on previous time steps prohibits par-
 278 | allelizing these typical RNNs.
 279 | 3.2
 280 | Transformers and AFT
 281 | Introduced by Vaswani et al. (2017), Transformers
 282 | are a class of neural networks that have become
 283 | the dominant architecture for several NLP tasks.
 284 | Instead of operating on sequences step-by-step like
 285 | RNNs, Transformers rely on attention mechanisms
 286 | to capture relationships between all input and all
 287 | output tokens:
 288 | Attn(Q, K, V ) = softmax(QK� �)V,
 289 | (7)
 290 | where the multi-headness and scaling factor
 291 | 1
 292 | √dk is
 293 | omitted for convenience. The core QK⊤ multipli-
 294 | cation is an ensemble of pairwise attention scores
 295 | between each token in a sequence, which can be
 296 | decomposed as vector operations:
 297 | Attn(Q, K, V )t =
 298 | PT
 299 | i=1 eq⊤
 300 | t kivi
 301 | PT
 302 | i=1 eq⊤
 303 | t ki .
 304 | (8)
 305 | In AFT (Zhai et al., 2021), this  is alternately
 306 | formulated as
 307 | Attn+(W, K, V )t =
 308 | Pt
 309 | i=1 ewt,i+kivi
 310 | Pt
 311 | i=1 ewt,i+ki ,
 312 | (9)
 313 | where {wt,i} ∈ RT×T is the learned pair-wise po-
 314 | sition biases, and each wt,i is a scalar.
 315 | Inspired by AFT, we let each wt,i in RWKV be
 316 | a channel-wise time decay vector multiplied by the
 317 | relative position, traced backwards from current
 318 | time  as it decays:
 319 | wt,i = −(t − i)w,
 320 | (10)
 321 | where w ∈ (R≥0)d, with d the number of chan-
 322 | nels. We require w to be non-negative to ensure
 323 | that ewt,i ≤ 1 and the per-channel weights decay
 324 | backwards in time.
 325 |  Linear
 326 | RNN Cell/Linear
 327 | Linear
 328 | RNN Cell/Linear
 329 | (a) RNN
 330 | Convolution
 331 | Elementwise
 332 | Convolution
 333 | Pooling
 334 | Elementwise
 335 | Pooling
 336 | (b) QuasiRNN (Bradbury et al., 2017)
 337 | Time-mixing
 338 | Channel-mixing
 339 | Time-mixing
 340 | Channel-mixing
 341 | (c) RWKV
 342 | Figure 1: Computation structure of the RWKV in comparison to QRNN and RNN (Vanilla, LSTM, GRU, etc)
 343 | architectures. Color codes : orange indicates time-mixing, convolutions or matrix multiplications, and the contin-
 344 | uous block indicates that these computations can proceed simultaneously; blue signiﬁes parameterless functions
 345 | that operate concurrently along the channel or feature dimension (element-wise). Green indicates channel-mixing.
 346 | 4
 347 | The Receptance Weighted Key Value
 348 | (RWKV) Model
 349 | The RWKV architecture derives its name from
 350 | the four primary model elements used in the time-
 351 | mixing and channel-mixing blocks:
 352 | • R: Receptance vector acting as the accep-
 353 |  tance of past information.
 354 | • W: Weight is the positional weight decay
 355 | vector. A trainable model parameter.
 356 | • K: Key is a vector analogous to K in tradi-
 357 | tional attention.
 358 | • V : Value is a vector analogous to V in tradi-
 359 | tional attention.
 360 | Interactions between the main elements for every
 361 | timestep are multiplicative, as illustrated in Fig. 2
 362 | 4.1
 363 | High-Level Summary
 364 | The RWKV architecture is comprised of a series
 365 | of stacked residual blocks, each formed by a time-
 366 | mixing and a channel-mix ing sub-blocks with re-
 367 | current structures.
 368 | The recurrence is formulated both as a linear in-
 369 | terpolation between the current input and the input
 370 | at the previous time step (a technique we refer to
 371 | as time-shift mixing or token shift, indicated by the
 372 | diagonal lines in Fig. 3), which can be adjusted in-
 373 | dependently for every linear projection of the input
 374 | embedding (e.g., R, K, V in time-mixing, and R,
 375 | K in channel-mixing), and as the time-dependent
 376 | update of the WKV which  is formalized in equa-
 377 | tion 14. The WKV computation is similar to AFT
 378 | (Zhai et al., 2021), but W is now a channel-wise
 379 | vector multiplied by relative position rather than a
 380 | pairwise matrix in AFT. We also introduce a vector
 381 | U for separately attending to the current token in
 382 | order to compensate for potential degeneration of
 383 | W (see Appendix G for more details).
 384 | Figure 2: RWKV block elements (left) and RWKV
 385 | residual block with a ﬁnal head for language modeling
 386 | (right) architectures.
 387 |  Token
 388 | shift
 389 | Token
 390 | shift
 391 | States
 392 | States
 393 | Layer Norm
 394 | My
 395 | Layer Norm
 396 | name
 397 | Layer Norm
 398 | Time Mix
 399 | LM Head
 400 | Layer Norm
 401 | is
 402 | Layer Norm
 403 | Bob
 404 | Layer Norm
 405 | Channel Mix
 406 | Time Mix
 407 | LM Head
 408 | Layer Norm
 409 | name
 410 | Layer Norm
 411 | is
 412 | Layer Norm
 413 | Channel Mix
 414 | Time Mix
 415 | LM Head
 416 | Token
 417 | shift
 418 | Token
 419 | shift
 420 | Channel Mix
 421 | Figure 3: RWKV architecture for language modelling.
 422 |  The time-mixing block is given by:
 423 | rt = Wr · (µrxt + (1 − µr)xt−1),
 424 | (11)
 425 | kt = Wk · (µkxt + (1 − µk)xt−1),
 426 | (12)
 427 | vt = Wv · (µvxt + (1 − µv)xt−1),
 428 | (13)
 429 | wkvt =
 430 | Pt−1
 431 | i=1 e−(t−1−i)w+kivi + eu+ktvt
 432 | Pt−1
 433 | i =1 e−(t−1−i)w+ki + eu+kt
 434 | , (14)
 435 | ot = Wo · (σ(rt) ⊙ wkvt),
 436 | (15)
 437 | where the WKV computation, wkvt, plays the
 438 | role of Attn(Q, K, V ) in Transformers without in-
 439 | curring a quadratic cost as interactions are between
 440 | scalars. Intuitively, as time t increases, the vector
 441 | ot is dependent on a long history, represented by the
 442 | summation of an increasing number of terms. For 
 443 | the target position t, RWKV performs a weighted
 444 | summation in the positional interval of [1, t], and
 445 | then multiplies with the receptance σ(r). There-
 446 | fore, interactions are multiplicative inside a given
 447 | timestep and summed over different timesteps.
 448 | Further, the channel-mixing block is given by:
 449 | rt = Wr · (µrxt + (1 − µr)xt−1),
 450 | (16)
 451 | kt = Wk · (µkxt + (1 − µk)xt−1),
 452 | (17) 
 453 | ot = σ(rt) ⊙ (Wv · max(kt, 0)2),
 454 | (18)
 455 | where we adopt squared ReLU activation (So et al.,
 456 | 2021). Note that in both time-mixing and channel-
 457 | mixing, by taking the sigmoid of the receptance,
 458 | we’re intuitively using it as a “forget gate” to elimi-
 459 | nate unnecessary historical information.
 460 | 4.2
 461 | Transformer-like Parallelization
 462 | RWKV can be efﬁciently parallelized in what we 
 463 | call a time-parallel mode, reminiscent of Trans-
 464 | formers.
 465 | The time complexity of processing a
 466 | batch of sequences in a single layer is O(BTd2),
 467 | which mainly consists of matrix multiplications
 468 | W□, □ ∈ {r, k, v, o} (assuming B sequences, T
 469 | maximum tokens and d channels). Meanwhile, up-
 470 | dating attention scores wkvt requires a serial scan
 471 | (see Appendix B for more detail) and has complex-
 472 | ity O(BTd).
 473 | The matrix multiplications can be parallelized
 474 | akin to W □ , □ ∈ {Q, K, V, O} in typical Trans-
 475 | formers. The element-wise WKV computation
 476 | is time-dependent, but can be readily parallelized
 477 | along the other two dimensions (Lei et al., 2018)3.
 478 | 3If the sequence is very long, more sophisticated meth-
 479 | ods such as Martin and Cundy (2017) that parallelize over
 480 | sequence length could be used.
 481 | Additionally, token shift is implemented as a sim-
 482 | ple offset in the temporal dimension at each block
 483 | using PyTorch (Paszke et al., 2019 ) library as
 484 | nn.ZeroPad2d((0,0,1,-1)).
 485 | 4.3
 486 | RNN-like Sequential Decoding
 487 | It is common in recurrent networks to use output
 488 | at state t as input at state t + 1. This is especially
 489 | evident in the autoregressive decoding inference
 490 | of a language model, requiring each token to be
 491 | computed before fed into the next step, making it
 492 | possible for RWKV to take advantage of its RNN-
 493 | like structure, referred to as time-sequential mode.
 494 | In such circumstances, RWKV can be  conveniently
 495 | formulated recursively for decoding during infer-
 496 | ence, as shown in Appendix B, which leverages
 497 | the advantage that each output token is dependent
 498 | only on the latest state, which is of constant size,
 499 | irrespective of the sequence length.
 500 | It then behaves as an RNN decoder, yielding
 501 | constant speed and memory footprint with respect
 502 | to the sequence length, enabling the processing of
 503 | longer sequences more efﬁciently. In contrast, self-
 504 | attention typically requires a KV cache growing
 505 | linearly with respect to the sequence length, result-
 506 | ing in  degraded efﬁciency and increasing memory
 507 | footprint and time as the sequence grows longer.
 508 | 4.4
 509 | Software Implementation
 510 | RWKV is originally implemented using the Py-
 511 | torch Deep Learning Library (Paszke et al., 2019)
 512 | and a custom CUDA kernel for the WKV com-
 513 | putation explained in 4.7. Although RWKV is a
 514 | general recurrent network, its current implemen-
 515 | tation focuses in the task of language modeling
 516 | (RWKV-LM). The model architecture is comprised
 517 | of an embedding layer, for which  we follow the
 518 | setup described in Section 4.7 and several identical
 519 | residual blocks applied sequentially as seen in Fig.
 520 | 2 and 3 following the principles outlined in Section
 521 | 4.6. After the last block, a simple output projec-
 522 | tion head composed by a LayerNorm (Ba et al.,
 523 | 2016) and a linear projection is used to obtain the
 524 | logits to be used in the next-token prediction task
 525 | and calculate the cross entropy loss during training.
 526 | Both the embeddings generated after the last resid-
 527 | ual block and the logits could also be used later
 528 | for  downstream NLP tasks. Training is performed
 529 | in time-parallel mode (Section 4.2) while autore-
 530 | gressive inference and a potential chat interface4
 531 | 4https://github.com/BlinkDL/ChatRWKV
 532 |  leverage the time-sequential mode (Section 4.3).
 533 | 4.5
 534 | Gradient Stability and Layer Stacking
 535 | The RWKV architecture has been designed as a
 536 | fusion of both Transformers and RNNs, offering
 537 | the advantage of stable gradients and deeper archi-
 538 | tectures of Transformers compared to traditional
 539 | RNNs while being efﬁcient in inference.
 540 | Previous work has sought to tackle the prob-
 541 | lem of gradient stability in RNNs with a variety of
 542 | techniques including using non-saturated activation
 543 | functions (Chandar et al.,  2019), gating mechanism
 544 | (Gu et al., 2019), gradient clipping (Pascanu et al.,
 545 | 2012), and adding constraints (Kanai et al., 2017;
 546 | Miller and Hardt, 2018). While these techniques
 547 | have seen little success, RWKV avoids the problem
 548 | inherently by utilizing softmax in conjunction with
 549 | RNN-style updates.
 550 | The RWKV model features a single-step pro-
 551 | cess for updating attention-like scores, which in-
 552 | cludes a time-dependent softmax operation that
 553 | helps numerical stability and guards against van-
 554 | ishing gradients (for  rigorous proof, see Appendix
 555 | F). Intuitively, this operation ensures the gradient
 556 | is propagated along the most relevant path. Layer
 557 | normalization (Ba et al., 2016) is another key as-
 558 | pect of the architecture which enhances the training
 559 | dynamics of deep neural networks by stabilizing
 560 | gradients, addressing both vanishing and exploding
 561 | gradient issues.
 562 | These design elements not only contribute to the
 563 | RWKV architecture’s stability and learning capa-
 564 | bilities but enable the stacking of multiple layers
 565 | in a manner that surpasses the capabilities of any
 566 | existing RNN. In doing so , the model is able to cap-
 567 | ture more complex patterns across various levels of
 568 | abstraction (see also Appendix G).
 569 | 4.6
 570 | Harnessing Temporal Structure for
 571 | Sequential Data Processing
 572 | RWKV captures and propagates sequential infor-
 573 | mation through the combination of three mecha-
 574 | nisms: recurrence, time decay and token shift.
 575 | The recurrence in the time-mixing block of
 576 | RWKV is the basis for the model’s capacity to
 577 | capture intricate relationships between sequence
 578 | elements and to propagate locality information
 579 | through time.
 580 |  The time decay mechanism (e−w and eu in equa-
 581 | tion 14), maintains sensitivity to the positional re-
 582 | lationship between sequence elements. By gradu-
 583 | ally diminishing the inﬂuence of past information
 584 | over time, the model preserves a sense of temporal
 585 | locality and progression, which is essential for se-
 586 | quential processing. This treatment of positional
 587 | information in sequential data exhibits similarities
 588 | to the Attention with Linear Biases (ALiBi) model
 589 | (Press et al., 2022), where the linear biases facili-
 590 | tate input length extrapolation. In this  context, the
 591 | RWKV architecture can be perceived as a trainable
 592 | version of ALiBi, seamlessly incorporating posi-
 593 | tional information without the necessity for explicit
 594 | encoding. It can also be seen as an extension of the
 595 | gated convolution introduced in Zhai et al. (2021)
 596 | to the full sequence length until a given step.
 597 | The token shift or time-shift mixing, or (diag-
 598 | onal arrows in Figure 3), also contributes to the
 599 | model’s adaptation to sequential data. By linearly
 600 | interpolating between the current input and the pre- 
 601 | vious time step input, the model naturally aggre-
 602 | gates and gates information in the input channels.
 603 | The overall structure of time-shift mixing bears
 604 | resemblance to the causal convolution with no dila-
 605 | tions in WaveNet (van den Oord et al., 2016), which
 606 | is a classical architecture used for forecasting time
 607 | series data.
 608 | 4.7
 609 | Additional Optimizations
 610 | Custom Kernels
 611 | To address inefﬁciencies in the
 612 | WKV computation due to the sequential nature of
 613 | the task when using standard deep learning frame-
 614 | works, we implement a custom  CUDA kernel so
 615 | as to launch a single compute kernel in training ac-
 616 | celerators. All other parts of the model are matrix
 617 | multiplications and point-wise operations that can
 618 | already be efﬁciently parallelized.
 619 | FFN with R gate
 620 | Prior research (Tolstikhin et al.,
 621 | 2021; Liu et al., 2021; Yu et al., 2022) sug-
 622 | gests that self-attention may not be as essential
 623 | in Transformer-based vision tasks as previously
 624 | thought. Although it provided us with some in-
 625 | sights, replacing  self-attention entirely in natural
 626 | language tasks could be too drastic. In our study,
 627 | we partially dismantle the attention mechanism by
 628 | replacing the ﬁxed QKV formula with KV and in-
 629 | troducing a new time-decaying factor W. This ap-
 630 | proach enables us to incorporate token and channel-
 631 | mixing components akin to MLP-mixer (Tolstikhin
 632 | et al., 2021) and a gating unit R similar to gMLP
 633 | (Liu et al., 2021), which enhance the performance
 634 | of our RWKV model. 
 635 | Small Init Embedding
 636 | During the initial stage
 637 | of training a transformer model (Vaswani et al.,
 638 |  2017), we observe that the embedding matrix un-
 639 | dergoes slow changes, which pose a challenge for
 640 | the model to deviate from its initial noisy embed-
 641 | ding state. To mitigate this issue, we propose an
 642 | approach that involves initializing the embedding
 643 | matrix with small values and subsequently apply-
 644 | ing an additional LayerNorm operation. By imple-
 645 | menting this technique, we accelerate and stabilize
 646 | the training process, enabling the training of deep
 647 | architectures with post-LN components. The effec-
 648 | tiveness of this approach is demonstrated in Figure
 649 | 8, where  it is shown to facilitate improved conver-
 650 | gence by allowing the model to quickly transition
 651 | away from the initially small embedding. This is
 652 | achieved through small changes following a single
 653 | step, which in turn lead to substantial alterations
 654 | in directions and subsequently signiﬁcant changes
 655 | after the LayerNorm operation.
 656 | Custom Initialization
 657 | Building on principles
 658 | from previous works (He et al., 2016; Jumper et al.,
 659 | 2021), we initialize parameters to values as similar
 660 | as possible to an identity mapping while break-
 661 | ing symmetry so there is a clean information path.
 662 | Most weights are initialized  to zero. No biases are
 663 | used for linear layers. Speciﬁc formulas are given
 664 | in Appendix D. We ﬁnd the choice of initialization
 665 | to be signiﬁcant in convergence speed and quality
 666 | (see Appendix E).
 667 | 5
 668 | Evaluations
 669 | In this section, we focus on evaluating to answer
 670 | the following questions:
 671 | • RQ1:
 672 | Is
 673 | RWKV
 674 | competitive
 675 | against
 676 | quadratic transformer architectures with equal
 677 | number of parameters and training tokens?
 678 | • RQ2: When increasing the number of param-
 679 | eters, does RW KV remain competitive against
 680 | quadratic transformer architectures?
 681 | • RQ3: Does increasing parameters of RWKV
 682 | yield better language modeling loss, when
 683 | RWKV models are trained for context lengths
 684 | that most open-sourced quadratic transform-
 685 | ers cannot efﬁciently process?
 686 | Addressing RQ1 and RQ2, from Fig. 4, we
 687 | can see that RWKV is very competitive on six
 688 | benchmarks (Winogrande, PIQA, ARC-C, ARC-E,
 689 | LAMBADA, and SciQ)  against major open source
 690 | quadratic complexity transformer models: Pythia
 691 | (Biderman et al., 2023), OPT (Zhang et al., 2022)
 692 | and BLOOM (Scao et al., 2022). RWKV even out-
 693 | performs Pythia and GPT-Neo (Black et al., 2022)
 694 | in four tasks: PIQA, OBQA, ARC-E, and COPA
 695 | (See details in Appendix H). For RQ3, Fig. 5 shows
 696 | that increasing context length leads to lower test
 697 | loss on the Pile, an indication that RW KV can make
 698 | effective use of long contextual information.
 699 | 6
 700 | Inference Experiments
 701 | We benchmark inference requirements according
 702 | to size and family. Speciﬁcally, we evaluate text
 703 | generation speed and memory requirements on a
 704 | typical compute platforms including CPU (x86)
 705 | and GPU (NVIDIA A100 80GB). For all our ex-
 706 | periments we use ﬂoat32 precision. We include
 707 | all model parameters in parameter count, including
 708 | both embedding and non-embedding layers. Per-
 709 | formance under different quantization setups is left
 710 | to further work. See Appendix  I for more results.
 711 | Figure 6: Cumulative time during text generation for
 712 | different LLMs.
 713 | Additionally, we carried out comparative studies
 714 | on RWKV-4 and ChatGPT / GPT-4, see Appendix
 715 | J. They revealed that RWKV-4 is very sensitive to
 716 | prompt engineering. When the prompts were ad-
 717 | justed from the ones used for GPT to more suitable
 718 | for RWKV, the F1-measure performance increased
 719 | even from 44.2% to 74.8%.
 720 | 7
 721 | Future Work
 722 | There are several promising directions for future
 723 | work  on the RWKV architecture:
 724 | • Increasing model expressivity with enhanced
 725 | time-decay formulations and exploring initial
 726 | model states while maintaining efﬁciency.
 727 |  (a) Winogrande
 728 | (b) PIQA
 729 | (c) ARC-Challenge
 730 | (d) ARC-Easy
 731 | (e) LAMBADA
 732 | (f) SciQ
 733 | Figure 4: Zero-Shot Performance: The horizontal axis is a number of parameters and the vertical axis is accuracy.
 734 | 21
 735 | 23
 736 | 25
 737 | 27
 738 | 29
 739 | 211
 740 | Context Length
 741 | 21
 742 | 22
 743 | Pile test loss
 744 | 7B 8k
 745 | 14B 8k
 746 | Figure 5: Increasing context length contributes to lower
 747 | test loss on the Pile (Gao et al., 2020).
 748 | •  Further improving RWKV computational ef-
 749 | ﬁciency by applying parallel scan in the
 750 | wkvt step to reduce the computational cost
 751 | to O(B log(T)d).
 752 | • Investigating the application of RWKV to
 753 | encoder-decoder architectures and potential
 754 | replacement of cross-attention mechanism.
 755 | This could have applicability seq2seq or multi-
 756 | modal settings, enhancing efﬁciency both in
 757 | training and inference.
 758 | • Leveraging RWKV’s state (or context) for in-
 759 | terpretability, predictability  in sequence data
 760 | and safety.
 761 | Manipulating the hidden state
 762 | could also guide behavior and allow greater
 763 | customizability through prompt tuning.
 764 | • Exploring ﬁne-tuned models in speciﬁc set-
 765 | tings for enhanced interaction with humans
 766 | (Ouyang et al., 2022). Particularly interest-
 767 | ing would be the performance under different
 768 | datasets and speciﬁc use cases.
 769 | • Adapting
 770 | parameter-efﬁcient
 771 | ﬁne-tuning
 772 | methods such as LoRA (Hu et al., 2022 )
 773 | and characterizing behavior under different
 774 | quantization
 775 | schemes
 776 | for
 777 | the
 778 | proposed
 779 | architecture
 780 | 8
 781 | Conclusions
 782 | We introduced RWKV, a new approach to RNN
 783 | models exploiting the potential of time-based mix-
 784 | ing components. RWKV introduces several key
 785 | strategies which allow it to capture locality and
 786 | long-range dependencies, while addressing limi-
 787 | tations of current architectures by: (1) replacing
 788 | the quadratic QK attention by a scalar formulation
 789 | with linear cost, (2) reformulating recurrence and
 790 | sequential  inductive biases to unlock efﬁcient train-
 791 | ing parallelization and efﬁcient inference, and (3)
 792 | enhancing training dynamics using custom initial-
 793 | izations.
 794 | We benchmark the proposed architecture in a
 795 | wide variety of NLP tasks and show comparable
 796 | performance to SoTA with reduced cost. Further
 797 | experiments on expressivity, interpretability, and
 798 | scaling showcase the model capabilities and draw
 799 | parallels in behavior between RWKV and other
 800 | LLMs.
 801 | RWKV opens a new door to scalable and ef-
 802 | ﬁcient architectures to model complex  relation-
 803 |  ships in sequential data. While many alternatives
 804 | to Transformers have been proposed with similar
 805 | claims, ours is the ﬁrst to back up those claims with
 806 | pretrained models with tens of billions of parame-
 807 | ters.
 808 | 9
 809 | Limitations
 810 | While our proposed RWKV model has demon-
 811 | strated promising results regarding training and
 812 | memory efﬁciency during inference, some limita-
 813 | tions should be acknowledged and addressed in
 814 | future work. First, the linear attention of RWKV
 815 | leads to signiﬁcant efﬁciency  gains but still, it may
 816 | also limit the model’s performance on tasks that
 817 | require recalling minutiae information over very
 818 | long contexts. This is due to the funneling of in-
 819 | formation through a single vector representation
 820 | over many time steps, compared with the full in-
 821 | formation maintained by the quadratic attention of
 822 | standard Transformers. In other words, the model’s
 823 | recurrent architecture inherently limits its ability to
 824 | “look back” at previous tokens, as opposed to tra-
 825 | ditional self-attention mechanisms. While learned
 826 | time decay helps prevent the loss of information ,
 827 | it is mechanistically limited compared to full self-
 828 | attention.
 829 | Another limitation of this work is the increased
 830 | importance of prompt engineering in comparison to
 831 | standard Transformer models. The linear attention
 832 | mechanism used in RWKV limits the information
 833 | from the prompt that will be carried over to the
 834 | model’s continuation. As a result, carefully de-
 835 | signed prompts may be even more crucial for the
 836 | model to perform well on tasks.
 837 | Acknowledgements
 838 | We acknowledge EleutherAI and StabilityAI for
 839 | compute access and technical support in develop-
 840 | ment of RWKV.  We also acknowledge the mem-
 841 | bers of the RWKV Discord server for their help
 842 | and work on further extending the applicability of
 843 | RWKV to different domains. Finally, we thank
 844 | Stella Biderman for feedback on the paper.
 845 | References
 846 | Alon Albalak, Yi-Lin Tuan, Pegah Jandaghi, Connor
 847 | Pryor, Luke Yoffe, Deepak Ramachandran, Lise
 848 | Getoor, Jay Pujara, and William Yang Wang. 2022.
 849 | FETA: A benchmark for few-sample task transfer
 850 | in open-domain dialogue .
 851 | In Proceedings of the
 852 | 2022 Conference on Empirical Methods in Natu-
 853 | ral Language Processing, pages 10936–10953, Abu
 854 | Dhabi, United Arab Emirates. Association for Com-
 855 | putational Linguistics.
 856 | Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin-
 857 | ton. 2016. Layer normalization.
 858 | Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018.
 859 | An empirical evaluation of generic convolutional
 860 | and recurrent networks for sequence modeling.
 861 | Francesco Barbieri, Jose Camacho-Coll ados, Luis Es-
 862 | pinosa Anke, and Leonardo Neves. 2020. TweetE-
 863 | val: Uniﬁed benchmark and comparative evaluation
 864 | for tweet classiﬁcation. In Findings of the Associ-
 865 | ation for Computational Linguistics: EMNLP 2020,
 866 | pages 1644–1650, Online. Association for Computa-
 867 | tional Linguistics.
 868 | Iz Beltagy, Matthew E. Peters, and Arman Cohan.
 869 | 2020. Longformer: The long-document transformer.
 870 | arXiv:2004.05150.
 871 | Stella  Biderman, Hailey Schoelkopf, Quentin An-
 872 | thony, Herbie Bradley, Kyle O’Brien, Eric Halla-
 873 | han, Mohammad Aﬂah Khan, Shivanshu Purohit,
 874 | USVSN Sai Prashanth, Edward Raff, et al. 2023.
 875 | Pythia: A suite for analyzing large language mod-
 876 | els across training and scaling.
 877 | arXiv preprint
 878 | arXiv:2304.01373.
 879 | Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jian- 
 880 | feng Gao, and Yejin Choi. 2020. Piqa: Reasoning
 881 | about physical commonsense in natural language. In
 882 | Thirty-Fourth AAAI Conference on Artiﬁcial Intelli-
 883 | gence.
 884 | Sid Black, Leo Gao, Phil Wang, Connor Leahy, and
 885 | Stella Biderman. 2022. Gpt-neo: Large scale autore-
 886 | gressive language modeling with mesh-tensorﬂow,
 887 | 2021.
 888 | URL:
 889 | https://doi.
 890 | org/10.5281/zenodo,
 891 | 5297715.
 892 |  James Bradbury, Stephen Merity, Caiming Xiong, and
 893 | Richard Socher. 2017. Quasi-recurrent neural net-
 894 | works. In ICLR.
 895 | Tom Brown, Benjamin Mann, Nick Ryder, Melanie
 896 | Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
 897 | Neelakantan, Pranav Shyam, Girish Sastry, Amanda
 898 | Askell, et al. 2020. Language models are few-shot
 899 | learners. Advances in neural information processing
 900 | systems, 33:1877–1901.
 901 | Ayd ar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev.
 902 | 2023. Scaling transformer to 1m tokens and beyond
 903 | with rmt.
 904 | Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev.
 905 | 2022. Recurrent memory transformer. Advances in
 906 | Neural Information Processing Systems, 35:11079–
 907 | 11091.
 908 |  A. P. Sarath Chandar, Chinnadhurai Sankar, Eugene
 909 | Vorontsov, Samira Ebrahimi Kahou, and Yoshua
 910 | Bengio. 2019.
 911 | Towards non-saturating recurrent
 912 | units for modelling long-term dependencies.
 913 | In
 914 | AAAI Conference on Artiﬁcial Intelligence.
 915 | Krzysztof Choromanski, Valerii Likhosherstov, David
 916 | Dohan, Xingyou Song, Andreea Gane, Tamas Sar-
 917 | los, Peter Hawkins, Jared Davis, Af roz Mohiuddin,
 918 | Lukasz Kaiser, David Belanger, Lucy Colwell, and
 919 | Adrian Weller. 2020. Rethinking attention with per-
 920 | formers.
 921 | Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
 922 | and Yoshua Bengio. 2014. Empirical evaluation of
 923 | gated recurrent neural networks on sequence model-
 924 | ing. In NIPS 2014 Deep Learning and Representa-
 925 | tion Learning Workshop.
 926 | Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
 927 | Ashish Sabhar wal, Carissa Schoenick, and Oyvind
 928 | Tafjord. 2018. Think you have solved question an-
 929 | swering? try arc, the ai2 reasoning challenge. In
 930 | arXiv:1803.05457.
 931 | Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Ja-
 932 | cob Hilton, Reiichiro Nakano, Christopher Hesse,
 933 | and John Schulman. 2021.
 934 | Training veriﬁers to
 935 | solve math word problems.
 936 | In arXiv, volume
 937 | abs/2110.14168.
 938 | Tri  Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and
 939 | Christopher Re. 2022a.
 940 | Flashattention: Fast and
 941 | memory-efﬁcient exact attention with IO-awareness.
 942 | In Advances in Neural Information Processing Sys-
 943 | tems.
 944 | Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W
 945 | Thomas, Atri Rudra, and Christopher Ré. 2022b.
 946 | Hungry hungry hippos:
 947 | Towards language mod-
 948 | eling with state space models.
 949 | arXiv preprint
 950 | arXiv:22 12.14052.
 951 | Dorottya Demszky, Dana Movshovitz-Attias, Jeong-
 952 | woo Ko, Alan S. Cowen, Gaurav Nemade, and Su-
 953 | jith Ravi. 2020.
 954 | Goemotions: A dataset of ﬁne-
 955 | grained emotions. In Proceedings of the 58th An-
 956 | nual Meeting of the Association for Computational
 957 | Linguistics, ACL 2020, Online, July 5-10, 2020,
 958 | pages 4040–4054. Association for Computational
 959 | Linguistics.
 960 | Leo  Gao, Stella Biderman, Sid Black, Laurence Gold-
 961 | ing, Travis Hoppe, Charles Foster, Jason Phang, Ho-
 962 | race He, Anish Thite, Noa Nabeshima, et al. 2020.
 963 | The pile: An 800gb dataset of diverse text for lan-
 964 | guage modeling. arXiv preprint arXiv:2101.00027.
 965 | Albert Gu, Karan Goel, and Christopher Ré. 2022. Ef-
 966 | ﬁciently modeling long sequences with structured
 967 | state spaces.
 968 | In The International Conference on
 969 | Learning Representations  (ICLR).
 970 | Albert
 971 | Gu,
 972 | Çaglar
 973 | Gülçehre,
 974 | Tom
 975 | Le
 976 | Paine,
 977 | Matthew
 978 | W.
 979 | Hoffman,
 980 | and
 981 | Razvan
 982 | Pascanu.
 983 | 2019. Improving the gating mechanism of recurrent
 984 | neural networks. ArXiv, abs/1910.09890.
 985 | Mandy Guo, Joshua Ainslie, David C Uthus, Santi-
 986 | ago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yin-
 987 | fei Yang. 2022. Longt5 : Efﬁcient text-to-text trans-
 988 | former for long sequences. In Findings of the Associ-
 989 | ation for Computational Linguistics: NAACL 2022,
 990 | pages 724–736.
 991 | Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
 992 | Sun. 2016. Identity mappings in deep residual net-
 993 | works.
 994 | Dan Hendrycks, Collin Burns, Steven Basart, Andy
 995 | Zou, Mantas Mazeika, Dawn Song, and Jacob Stein-
 996 | hardt. 2021.
 997 | Measuring massive multitask lan-
 998 | gu age understanding. In International Conference
 999 | on Learning Representations.
1000 | Sepp Hochreiter. 1998. The vanishing gradient prob-
1001 | lem during learning recurrent neural nets and prob-
1002 | lem solutions.
1003 | International Journal of Uncer-
1004 | tainty, Fuzziness and Knowledge-Based Systems,
1005 | 6(02):107–116.
1006 | Sepp Hochreiter and Jürgen Schmidhuber. 1997.
1007 | Long short-term memory.
1008 | Neural Computation,
1009 | 9(8):1735–1780.
1010 | Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,
1011 | El ena Buchatskaya, Trevor Cai, Eliza Rutherford,
1012 | Diego de Las Casas, Lisa Anne Hendricks, Johannes
1013 | Welbl, Aidan Clark, Tom Hennigan, Eric Noland,
1014 | Katie Millican, George van den Driessche, Bogdan
1015 | Damoc, Aurelia Guy, Simon Osindero, Karen Si-
1016 | monyan, Erich Elsen, Jack W. Rae, Oriol Vinyals,
1017 | and Laurent Sifre. 2022. Training compute-optimal
1018 | large language models.
1019 | Edward J Hu, yelong shen, Phillip Wallis,  Zeyuan
1020 | Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
1021 | Weizhu Chen. 2022. LoRA: Low-rank adaptation of
1022 | large language models. In International Conference
1023 | on Learning Representations.
1024 | Hassan Ismail Fawaz, Germain Forestier, Jonathan We-
1025 | ber, Lhassane Idoumghar, and Pierre-Alain Muller.
1026 | 2019. Deep learning for time series classiﬁcation:
1027 | a review.
1028 | Data mining and knowledge discovery,
1029 | 33(4):917–963.
1030 | Andrew  Jaegle, Felix Gimeno, Andy Brock, Oriol
1031 | Vinyals, Andrew Zisserman, and Joao Carreira.
1032 | 2021. Perceiver: General perception with iterative
1033 | attention. In International conference on machine
1034 | learning, pages 4651–4664. PMLR.
1035 | Hanhwi Jang, Joonsung Kim, Jae-Eon Jo, Jaewon Lee,
1036 | and Jangwoo Kim. 2019.
1037 | Mnnfast: A fast and
1038 | scalable system architecture for memory-augmented
1039 | neural networks. In Proceedings of the 46 th Interna-
1040 | tional Symposium on Computer Architecture, pages
1041 | 250–263.
1042 | Matt Gardner Johannes Welbl Nelson F. Liu. 2017.
1043 | Crowdsourcing multiple choice science questions.
1044 | In DOI:10.18653/v1/W17-4413.
1045 |  Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke
1046 | Zettlemoyer. 2017. Triviaqa: A large scale distantly
1047 | supervised challenge dataset for reading comprehen-
1048 | sion. In ACL.
1049 | John Jumper,
1050 | Richard Evans,
1051 | Alexander Pritzel,
1052 | Tim Green, Michael Figurnov, Olaf Ronneberger,
1053 | Kathryn Tunyasuvunakool, Russ Bates, Augustin
1054 | Žídek, Anna Potapenko, and et al. 2021.
1055 | Highly
1056 | accurate protein structure prediction with al phafold.
1057 | Nature, 596(7873):583–589.
1058 | Sekitoshi Kanai, Yasuhiro Fujiwara, and Sotetsu Iwa-
1059 | mura. 2017. Preventing gradient explosions in gated
1060 | recurrent units. In NIPS.
1061 | Jared Kaplan,
1062 | Sam McCandlish,
1063 | Tom Henighan,
1064 | Tom B Brown, Benjamin Chess, Rewon Child, Scott
1065 | Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.
1066 | 2020.
1067 | Scaling laws for neural language models.
1068 | arXiv preprint arX iv:2001.08361.
1069 | Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap-
1070 | pas, and François Fleuret. 2020. Transformers are
1071 | rnns: Fast autoregressive transformers with linear
1072 | attention. In International Conference on Machine
1073 | Learning, pages 5156–5165. PMLR.
1074 | Nikita Kitaev,
1075 | L. Kaiser,
1076 | and Anselm Levskaya.
1077 | 2020. Reformer: The efﬁcient transformer. ArXiv,
1078 | abs/2001.04451.
1079 | Jan Koco´n, Igor  Cichecki, Oliwier Kaszyca, Mateusz
1080 | Kochanek, Dominika Szydło, Joanna Baran, Julita
1081 | Bielaniewicz, Marcin Gruza, Arkadiusz Janz, Kamil
1082 | Kanclerz, Anna Koco´n, Bartłomiej Koptyra, Wik-
1083 | toria Mieleszczenko-Kowszewicz, Piotr Miłkowski,
1084 | Marcin Oleksy, Maciej Piasecki, Łukasz Radli´nski,
1085 | K onrad Wojtasik, Stanisław Wo´zniak, and Prze-
1086 | mysław Kazienko. 2023. Chatgpt: Jack of all trades,
1087 | master of none.
1088 | Jan Koco´n, Piotr Miłkowski, and Monika Za´sko-
1089 | Zieli´nska. 2019. Multi-level sentiment analysis of
1090 | polemo 2.0: Extended corpus of multi-domain con-
1091 | sumer reviews. In Proceedings of the 23rd Confer-
1092 | ence on Computational Natural Language Learning
1093 | (CoNLL),  pages 980–991.
1094 | Phong Le and Willem Zuidema. 2016.
1095 | Quantifying
1096 | the vanishing gradient and long distance dependency
1097 | problem in recursive neural networks and recursive
1098 | lstms. In Proceedings of the 1st Workshop on Repre-
1099 | sentation Learning for NLP, pages 87–93.
1100 | Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav
1101 | Artzi. 2018. Simple recurrent units for highly par-
1102 | allelizable recurrence. In Proceedings of the 2018
1103 | Conference on Empirical Methods in Natural Lan-
1104 | guage Processing , pages 4470–4481, Brussels, Bel-
1105 | gium. Association for Computational Linguistics.
1106 | Hanxiao Liu, Zihang Dai, David R. So, and Quoc V. Le.
1107 | 2021. Pay attention to mlps.
1108 | Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting
1109 | Zhou, Jonathan May, Hao Ma, and Luke Zettle-
1110 | moyer. 2021. Luna: Linear uniﬁed nested attention.
1111 | Advances in Neural Information Processing Systems,
1112 | 34:2441–2453.
1113 | Xuezhe  Ma, Chunting Zhou, Xiang Kong, Junxian He,
1114 | Liangke Gui, Graham Neubig, Jonathan May, and
1115 | Luke Zettlemoyer. 2023.
1116 | Mega: Moving average
1117 | equipped gated attention. In ICLR.
1118 | Eric Martin and Chris Cundy. 2017.
1119 | Parallelizing
1120 | linear recurrent neural nets over sequence length.
1121 | ArXiv, abs/1709.04057.
1122 | Kevin Meng, David Bau, Alex Andonian, and Yonatan
1123 | Belinkov. 2022. Locating and editing factual asso-
1124 | ciations in GPT . Advances in Neural Information
1125 | Processing Systems, 36.
1126 | Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish
1127 | Sabharwal. 2018. Can a suit of armor conduct elec-
1128 | tricity? a new dataset for open book question answer-
1129 | ing. In EMNLP.
1130 | John Miller and Moritz Hardt. 2018. Stable recurrent
1131 | models. arXiv: Learning.
1132 | Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong
1133 | He, Devi Parikh, Dhruv Batra, Lucy Vanderw ende,
1134 | Pushmeet Kohli, and James Allen. 2016.
1135 | A cor-
1136 | pus and cloze evaluation for deeper understanding of
1137 | commonsense stories. In Proceedings of the 2016
1138 | Conference of the North American Chapter of the
1139 | Association for Computational Linguistics: Human
1140 | Language Technologies, pages 839–849.
1141 | OpenAI. 2022. Introducing chatgpt. https://openai.
1142 | com/blog/chatgpt.
1143 | OpenAI. 2023. Gpt-4 technical report.
1144 | Antonio Orvieto, Samuel L Smith, Albert Gu, An ushan
1145 | Fernando, Caglar Gulcehre, Razvan Pascanu, and
1146 | Soham De. 2023.
1147 | Resurrecting recurrent neu-
1148 | ral networks for long sequences.
1149 | arXiv preprint
1150 | arXiv:2303.06349.
1151 | Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida,
1152 | Carroll L. Wainwright, Pamela Mishkin, Chong
1153 | Zhang, Sandhini Agarwal, Katarina Slama, Alex
1154 | Ray, John Schulman, Jacob Hilton, Fraser Kelton ,
1155 | Luke Miller, Maddie Simens, Amanda Askell, Pe-
1156 | ter Welinder, Paul Christiano, Jan Leike, and Ryan
1157 | Lowe. 2022. Training language models to follow in-
1158 | structions with human feedback.
1159 | Denis Paperno, Germán Kruszewski, Angeliki Lazari-
1160 | dou, Ngoc Quan Pham, Raffaella Bernardi, San-
1161 | dro Pezzelle, Marco Baroni, Gemma Boleda, and
1162 | Raquel Fernandez. 2016. The LAMBADA dataset:
1163 | Word prediction requiring a broad discourse context.
1164 |  In Proceedings of the 54th Annual Meeting of the As-
1165 | sociation for Computational Linguistics (Volume 1:
1166 | Long Papers), pages 1525–1534, Berlin, Germany.
1167 | Association for Computational Linguistics.
1168 |  Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
1169 | 2012. On the difﬁculty of training recurrent neural
1170 | networks. In International Conference on Machine
1171 | Learning.
1172 | Adam Paszke, Sam Gross, Francisco Massa, Adam
1173 | Lerer, James Bradbury, Gregory Chanan, Trevor
1174 | Killeen, Zeming Lin, Natalia Gimelshein, Luca
1175 | Antiga, Alban Desmaison, Andreas Köpf, Edward
1176 | Yang, Zach DeVito, Martin Raison, Alykhan Tejani,
1177 | Sas ank Chilamkurthy, Benoit Steiner, Lu Fang, Jun-
1178 | jie Bai, and Soumith Chintala. 2019. Pytorch: An
1179 | imperative style, high-performance deep learning li-
1180 | brary.
1181 | Michael
1182 | Poli,
1183 | Stefano
1184 | Massaroli,
1185 | Eric
1186 | Nguyen,
1187 | Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua
1188 | Bengio, Stefano Ermon, and Christopher Ré. 2023.
1189 | Hyena hierarchy: Towards larger convolutional lan-
1190 | guage models. arXiv  preprint arXiv:2302.10866.
1191 | Oﬁr Press, Noah A. Smith, and Mike Lewis. 2022.
1192 | Train short, test long: Attention with linear biases en-
1193 | ables input length extrapolation. In The Tenth Inter-
1194 | national Conference on Learning Representations,
1195 | ICLR 2022, Virtual Event, April 25-29, 2022.
1196 | Ilan Price, Jordan Gifford-Moore, Jory Flemming, Saul
1197 | Musker, Maayan Roichman, Guillaume Sylvain,
1198 | Nithum Thain, Lucas Dixon, and Jeffrey Sore nsen.
1199 | 2020.
1200 | Six attributes of unhealthy conversations.
1201 | In Proceedings of the Fourth Workshop on Online
1202 | Abuse and Harms, pages 114–124, Online. Associa-
1203 | tion for Computational Linguistics.
1204 | Markus N. Rabe and Charles Staats. 2022.
1205 | Self-
1206 | attention does not need o(n2) memory.
1207 | Melissa Roemmele, Cosmin Adrian Bejan, , and An-
1208 | drew S. Gordon. 2018. Choice of plausible alterna-
1209 | tives: An evaluation of commonsense causal reason-
1210 | ing. In  AAAI.
1211 | Teven Le Scao, Angela Fan, Christopher Akiki, El-
1212 | lie Pavlick,
1213 | Suzana Ili´c,
1214 | Daniel Hesslow,
1215 | Ro-
1216 | man Castagné, Alexandra Sasha Luccioni, François
1217 | Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b-
1218 | parameter open-access multilingual language model.
1219 | arXiv preprint arXiv:2211.05100.
1220 | Ramsha Siddiqui. 2019. SARCASMANIA: Sarcasm
1221 | Exposed!
1222 | http:// www.kaggle.com/rmsharks4/
1223 | sarcasmania-dataset.
1224 | [Online;
1225 | accessed 02-
1226 | February-2023].
1227 | David R. So, Wojciech Manke, Hanxiao Liu, Zihang
1228 | Dai, Noam Shazeer, and Quoc V. Le. 2021. Primer:
1229 | Searching for efﬁcient transformers for language
1230 | modeling. CoRR, abs/2109.08668.
1231 | Yi Tay, Dara Bahri, Donald Metzler, Da -Cheng Juan,
1232 | Zhe Zhao, and Che Zheng. 2020. Synthesizer: Re-
1233 | thinking self-attention in transformer models.
1234 | Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald
1235 | Metzler. 2022.
1236 | Efﬁcient transformers: A survey.
1237 | ACM Computing Surveys, 55(6):1–28.
1238 | Ilya
1239 | O.
1240 | Tolstikhin,
1241 | Neil
1242 | Houlsby,
1243 | Alexander
1244 | Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas
1245 | Unter thiner, Jessica Yung, Andreas Steiner, Daniel
1246 | Keysers,
1247 | Jakob
1248 | Uszkoreit,
1249 | Mario
1250 | Lucic,
1251 | and
1252 | Alexey Dosovitskiy. 2021. Mlp-mixer: An all-mlp
1253 | architecture for vision. CoRR, abs/2105.01601.
1254 | Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
1255 | Martinet, Marie-Anne Lachaux, Timothée Lacroix,
1256 | Baptiste Rozière, Naman Goyal, Eric Hamb ro,
1257 | Faisal Azhar, Aurelien Rodriguez, Armand Joulin,
1258 | Edouard Grave,
1259 | and Guillaume Lample. 2023.
1260 | Llama: Open and efﬁcient foundation language mod-
1261 | els.
1262 | Aäron van den Oord, Sander Dieleman, Heiga Zen,
1263 | Karen Simonyan,
1264 | Oriol Vinyals,
1265 | Alex Graves,
1266 | Nal Kalchbrenner, Andrew W. Senior, and Koray
1267 | Kavukcuoglu. 2016. Wavenet: A generative model 
1268 | for raw audio. ArXiv, abs/1609.03499.
1269 | Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
1270 | Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
1271 | Kaiser, and Illia Polosukhin. 2017. Attention is all
1272 | you need. In Advances in Neural Information Pro-
1273 | cessing Systems, volume 30. Curran Associates, Inc.
1274 | David Vilares and Carlos Gómez-Rodríguez. 2019.
1275 | Head-qa: A healthcare dataset  for complex reason-
1276 | ing. In ACL.
1277 | Alex Wang,
1278 | Yada Pruksachatkun,
1279 | Nikita Nangia,
1280 | Amanpreet Singh, Julian Michael, Felix Hill, Omer
1281 | Levy, and Samuel Bowman. 2019.
1282 | Superglue: A
1283 | stickier benchmark for general-purpose language un-
1284 | derstanding systems. In Advances in Neural Infor-
1285 | mation Processing Systems, volume 32. Curran As-
1286 | sociates, Inc.
1287 | Alex Wang, Amanpreet Singh, Julian Michael, Fe-
1288 | lix Hill, Omer Levy,  and Samuel Bowman. 2018.
1289 | GLUE: A multi-task benchmark and analysis plat-
1290 | form for natural language understanding.
1291 | In Pro-
1292 | ceedings of the 2018 EMNLP Workshop Black-
1293 | boxNLP: Analyzing and Interpreting Neural Net-
1294 | works for NLP, pages 353–355, Brussels, Belgium.
1295 | Association for Computational Linguistics.
1296 | Sinong Wang, Belinda Z. Li, Madian Khabsa, Han
1297 | Fang, and Hao Ma. 2020. Linformer: Self-attention
1298 | with linear complexity.
1299 | Zonghan Wu,  Shirui Pan, Fengwen Chen, Guodong
1300 | Long, Chengqi Zhang, and S Yu Philip. 2020. A
1301 | comprehensive survey on graph neural networks.
1302 | IEEE transactions on neural networks and learning
1303 | systems, 32(1):4–24.
1304 | Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017.
1305 | Ex machina: Personal attacks seen at scale. In Pro-
1306 | ceedings of the 26th International Conference on
1307 | World Wide Web, WWW 2017, Perth, Australia, April
1308 | 3-7, 2017, pages 1391–13 99. ACM.
1309 |  Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen
1310 | Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng
1311 | Yan. 2022. Metaformer is actually what you need
1312 | for vision.
1313 | Manzil Zaheer, Guru Guruganesh, Kumar Avinava
1314 | Dubey, Joshua Ainslie, Chris Alberti, Santiago On-
1315 | tanon, Philip Pham, Anirudh Ravula, Qifan Wang,
1316 | Li Yang, et al. 2020.
1317 | Big bird: Transformers for
1318 | longer sequences. Advances in Neural  Information
1319 | Processing Systems, 33.
1320 | Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
1321 | Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
1322 | machine really ﬁnish your sentence? In ACL.
1323 | Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
1324 | Farhadi, and Yejin Choi. 2020.
1325 | Winogrande: An
1326 | adversarial winograd schema challenge at scale. In
1327 | ACL.
1328 | Shuangfei Zhai, Walter Talbott, Nitish Sriv astava, Chen
1329 | Huang, Hanlin Goh, Ruixiang Zhang, and Josh
1330 | Susskind. 2021. An attention free transformer.
1331 | Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng
1332 | Gao, Kevin Duh, and Benjamin Van Durme. 2018.
1333 | Record: Bridging the gap between human and ma-
1334 | chine commonsense reading comprehension.
1335 | In
1336 | arXiv:1810.12885.
1337 | Susan Zhang, Stephen Roller, Naman Goyal, Mikel
1338 | Artetxe, Moya Chen, Shuohui Chen, Christopher De -
1339 | wan, Mona Diab, Xian Li, Xi Victoria Lin, et al.
1340 | 2022. Opt: Open pre-trained transformer language
1341 | models. arXiv preprint arXiv:2205.01068.
1342 | A
1343 | Author Contributions
1344 | Bo Peng
1345 | Original RWKV idea, original code,
1346 | performance optimizations, original experiments,
1347 | and trained RWKV models from 0.1B to 14B.
1348 | Eric Alcaide
1349 | Manuscript (initial draft sections 1,
1350 | 2; sections 4, 7 and 8; revision and proofreading;
1351 | ﬁnal version ).  Figures (2, 3, 4, 7). Experiments
1352 | section 6. Appendices D, I. Contributions to Ap-
1353 | pendix K.
1354 | Quentin
1355 | Anthony
1356 | Led
1357 | writing
1358 | the
1359 | paper.
1360 | Manuscript (initial draft sections 1, 2, 3; revision
1361 | and proofreading; ﬁnal version).
1362 | Zhenyuan Zhang
1363 | Manuscript (revision and
1364 | proofreading) Figure 3. Experiments Appendix
1365 | G. Contributions to Appendices B and K.
1366 | Kranthi Kiran GV
1367 | Manuscript (sections 2 and
1368 | 5; contributions to section  3; revision and proof-
1369 | reading). Tables 3 and 4. Appendix C.
1370 | Xiangru Tang
1371 | Manuscript (sections 2 and 3;
1372 | contributions to abstract; revision and proofread-
1373 | ing). Contributions to Appendix K.
1374 | Matteo Grella
1375 | Manuscript (sections 4.5, 4.6, 8;
1376 | contributions to sections 1, 7 and 9; proofreading
1377 | and revision). Contributions to Appendix B.
1378 | Ferdinand Mom
1379 | Manuscript (contributions to
1380 | section 1, 2, 4.3, 4.6; proofreading and revision).
1381 | Contribut ions to Appendix B.
1382 | Atsushi Saito
1383 | Manuscript (sections 3 and 5; con-
1384 | tributions to section 2). Figures 1a , 1b, 1c. Contri-
1385 | butions to Appendix H
1386 | Krishna Sri Ipsit Mantri
1387 | Figure 4
1388 | Rui-Jie Zhu
1389 | Tables 1 and 5. Experiments for
1390 | table 5.
1391 | Peng Zhou
1392 | Contributions to Table 5.
1393 | Qihang Zhao
1394 | Manuscript (proofreading and re-
1395 | vision). Contributions to Table 5.
1396 | Xuzheng He
1397 | Manuscript (contributions to  sec-
1398 | tion 3; proofreading and revision). Contributions
1399 | to Figures 1, 7. Appendix G. Contributions to ap-
1400 | pendix F.
1401 | Hayden Lau
1402 | Manuscript (contributions to sec-
1403 | tion 1; proofreading and revision). Contributions
1404 | to Appendix K.
1405 | Michael Chung
1406 | Manuscript (contributions to
1407 | section 4.6; proofreading and revision).
1408 | Haowen Hou
1409 | Figure 8. Appendix E
1410 | Jiaming Kong
1411 | Manuscript (revision and proof-
1412 | reading). Appendix F.
1413 | Johan S. Wind
1414 | RWKV performance optimiza-
1415 |  tions (CUDA), Contributions to Appendix C.
1416 | Jian Zhu
1417 | Manuscript (section 2; proofreading
1418 | and revision). Figures 3 and 5.
1419 | Huanqi Cao
1420 | Manuscript (contributions to 4.2
1421 | and 4.3; proofreading and revision). Experiments
1422 | for Appendix G.
1423 | Samuel Arcadinho
1424 | Contributions to Figures 6,
1425 | 10, and 11. Contributions to Appendix I.
1426 | Xin Cheng
1427 | Manuscript (proofreading and revi-
1428 | sion). Contributions to Appendix K, H.
1429 |  Alon Albalak
1430 | Manuscript (abstract and sections
1431 | 1, 9; proofreading and revision).
1432 | Jan Kocon
1433 | Manuscript (sections 1; proofreading
1434 | and revision). Contributions to Appendix J.
1435 | Przemysław Kazienko
1436 | Manuscript (section 6;
1437 | proofreading and revision).
1438 | Contributions Ap-
1439 | pendix J.
1440 | Ruichong Zhang
1441 | Manuscript (proofreading and
1442 | revision); Contributions to Figure 5 and Appendix
1443 | K.
1444 | Stanisław Wo´zniak
1445 | Appendix J.
1446 | Bartłomiej Koptyra 
1447 | Contributions to Appendix
1448 | J.
1449 | B
1450 | Time-Mixing Block as an RNN Cell
1451 | As stated in 4.3, the RWKV time-mixing block can
1452 | be formulated as an RNN, as the WKV computa-
1453 | tion can be written in such a recursive form:
1454 | a0, b0 = 0,
1455 | (19)
1456 | wkvt = at−1 + eu+ktvt
1457 | bt−1 + eu+kt ,
1458 | (20)
1459 | at = e−wat−1 + ektvt,
1460 | (21)
1461 | bt =  e−wbt−1 + ekt.
1462 | (22)
1463 | The dataﬂow of the RNN-like time-mixing is
1464 | shown in Fig. 7, where the hidden states h is the
1465 | numerator-denominator tuple (a, b).
1466 | 1
1467 | 2
1468 | e
1469 | 3
1470 | Figure 7: RWKV time-mixing block formulated as an
1471 | RNN cell. Color codes: yellow (µ) denotes the token
1472 | shift, red (1) denotes the denominator, blue (2) denotes
1473 | the numerator, pink (3) denotes the fraction  computa-
1474 | tions in 14. h denotes the numerator-denominator tuple
1475 | (a, b).
1476 | To avoid overﬂow in calculating ekt, a numerical
1477 | trick is used in the ofﬁcial implementation. Note
1478 | that
1479 | a1 = e−wa0 + ek0v0 = ek0v0,
1480 | (23)
1481 | b1 = e−wb0 + ek0 = ek0,
1482 | (24)
1483 | and we set a′
1484 | 1 = v0, b′
1485 | 1 = 1, p0 = k0, where  pt−1
1486 | stores the shared exponents of at and bt. Now the
1487 | above recursion can be converted into a numerical
1488 | safe version, for each time step t > 1:
1489 | q := max(pt−1, u + kt),
1490 | (25)
1491 | a∗
1492 | t = ept−1−qa′
1493 | t−1 + eu+kt−qvt,
1494 | (26)
1495 | b∗
1496 | t = ept−1−qb′
1497 | t−1 + eu+kt−q,
1498 | (27)
1499 | wkvt = a∗
1500 | t 
1501 | b∗
1502 | t
1503 | .
1504 | (28)
1505 | The update to a′
1506 | t, b′
1507 | t and their shared exponent are
1508 | also carried out in similar fashion:
1509 | q := max(pt−1 − w, kt),
1510 | (29)
1511 | a′
1512 | t = ept−1−w−qa′
1513 | t−1 + ekt−qvt,
1514 | (30)
1515 | b′
1516 | t = ept−1−w−qb′
1517 | t−1 + ekt−q,
1518 | (31)
1519 | pt = q.
1520 | (32)
1521 | C
1522 |  Parameter and FLOP Count for the
1523 | RWKV Models
1524 | The following section provides an overview of the
1525 | different RWKV model architectures along with
1526 | their respective parameter and FLOP counts in Ta-
1527 | ble 2.
1528 | Name
1529 | Layers
1530 | Model Dimension
1531 | Parameters
1532 | FLOPs per token
1533 | 169 M
1534 | 12
1535 | 768
1536 | 1.693 × 108
1537 | 2.613 × 108
1538 | 430 M
1539 | 24
1540 | 1024
1541 | 4.304 × 108
1542 | 7.573 × 108
1543 | 1.5 B
1544 | 24
1545 | 2048
1546 | 1.515 × 109
1547 | 2.823 × 109
1548 | 3 B
1549 | 32
1550 |  2560
1551 | 2.985 × 109
1552 | 5.710 × 109
1553 | 7 B
1554 | 32
1555 | 4096
1556 | 7.393 × 109
1557 | 1.437 × 1010
1558 | 14 B
1559 | 40
1560 | 5120
1561 | 1.415 × 1010
1562 | 2.778 × 1010
1563 | Table 2: RWKV model architectures and associated
1564 | FLOP counts
1565 | The number of parameters for each model is
1566 | computed using the formula: #parameters =
1567 | 2V D + 13D2L + D(11L + 4) where V = 50277
1568 | is the vocabulary size, D represents the Model Di-
1569 | m ension and L corresponds to the number of lay-
1570 | ers.
1571 | FLOPs is for a forward pass for one token. It
1572 | was calculated as 6(V D + 13D2L), which is the
1573 | twice (add and multiply) the number of parameters
1574 | in linear layers. The backwards pass FLOPs can be
1575 | approximated as twice that of the forward pass. So
1576 |  the total is 6(V D + 13D2L) per token for training
1577 | (3x fw FLOPs). It is noteworthy that FLOPs are
1578 | independent of the context length, unlike regular
1579 | transformers. The FLOP approximations in this
1580 | paper are in line with the methodology used by
1581 | Kaplan et al. (2020).
1582 | Alternative approximations for FLOPs include
1583 | doubling the parameters which yields similar re-
1584 | sults within 2% for 14B and a 30% discrepancy for
1585 | 169M variant. Another approximation is based on
1586 | the number of non-embed ding parameters multi-
1587 | plied by 2. This gives 2(V D + 13D2L + D(11L +
1588 | 4)) resulting in 1.6% more FLOPs for 14B model
1589 | and 8% more FLOPs for 169M model.
1590 | D
1591 | Parameter initializations
1592 | We describe the speciﬁc parameter initializations
1593 | below and motivate the design choices. Parame-
1594 | ters belonging to residual blocks are often adjusted
1595 | by layer depth and total number of layers. Let #
1596 | denote the vocabulary size, s denote the embed-
1597 | ding dimension, d denote the hidden  size (we use
1598 | d = 4s), L the number of layers, l the layer index
1599 | (from 0 to L-1), we use the following initializa-
1600 | tions:
1601 | • Embeddings are initialized to U(±1e-4) as
1602 | explained in 4.7
1603 | • For the channel-mixing blocks (11), µki and
1604 | µri are initialized to ( i
1605 | s)1− l
1606 | L
1607 | • For the time-mixing blocks (16), initializa-
1608 | tions are µki = ( i
1609 | s)1− l
1610 | L , µ vi = ( i
1611 | s)1− l
1612 | L + 0.3l
1613 | L−1
1614 | and µri = 0.5( i
1615 | s)1− l
1616 | L
1617 | • wi (14), also known as “time decay”, is initial-
1618 | ized to −5+8·(
1619 | i
1620 | d−1)0.7+ 1.3l
1621 | L−1 . Intuitively, it is
1622 | the discount factor applied to previous tokens
1623 | over time.
1624 | • ui (14), also known as “bonus”, is set to
1625 | 0. 5(((i + 1) mod 3) − 1) + log 0.3. It is
1626 | the special weighting applied to the current
1627 | token in equation 14. The alternating zigzag
1628 | pattern initially creates subtle variations in the
1629 | tensor elements, which are intended to help
1630 | the model treat different dimensions of the
1631 | embedding distinctively.
1632 | • Wo (15) (time-mixing) and Wv (channel-
1633 | mixing) are initialized to N(0,
1634 | q
1635 | d
1636 | s = 2)
1637 | • All Wr, Wk, Wv weights are initialized to 0
1638 | so  the model can start learning from the be-
1639 | ginning without noisy signals.
1640 | • All LayerNorm weights start from 1 and bi-
1641 | ases from 0.
1642 | E
1643 | Small Init Embedding
1644 | This section presents experimental validation of
1645 | small initialization embedding. The experimental
1646 | setup is as follows. In the baseline conﬁguration,
1647 | the parameters are initialized using a normal distri-
1648 | bution with a mean of 0.0 and a standard deviation
1649 | of 0.02, which is a commonly used initialization
1650 | method in models like BERT and GPT. On the other
1651 | hand, in the  small initialization of the embedding
1652 | (small init emb) experiment, the parameters are ini-
1653 | tialized using a uniform distribution with a range of
1654 | 1e-4, which is slightly different from RWKV where
1655 | a normal distribution with a standard deviation of
1656 | 1e-4 is used. However, this difference is negligible
1657 | and does not affect our conclusions. The experi-
1658 | ments were conducted with a batch size of 400. As
1659 | depicted in the ﬁgure 8, the loss curve for the small
1660 | init emb exhibits a faster rate of decrease and con-
1661 | vergence compared  to the traditional initialization
1662 | using a normal distribution.
1663 | 0
1664 | 10000
1665 | 20000
1666 | 30000
1667 | 40000
1668 | 50000
1669 | Step
1670 | 4
1671 | 5
1672 | 6
1673 | 7
1674 | 8
1675 | 9
1676 | 10
1677 | 11
1678 | Loss
1679 | Baseline
1680 | Small Init Emb
1681 | Figure 8: Effect of small initialization embedding.
1682 | F
1683 | Gradient Stability in RWKV
1684 | In this section, we present a mathematical descrip-
1685 | tion of the gradient stability property in RWKV,
1686 | focusing speciﬁcally on the time-mixing block. By
1687 | gradient stability we mean that if the inputs  xt
1688 | are bounded and the model parameters are ﬁxed,
1689 | then the gradients with respect to Wk and Wv are
1690 | uniformly bounded for all T (thus not exploding).
1691 | Consequently, we can control the amount each xt
1692 | contributes to the gradient at T in a naturally de-
1693 | caying fashion by the weight decay mechanism w
1694 |  (thus not vanishing unless desired).
1695 | First, we make the simpliﬁcation that there are
1696 | no token shifts, this will not affect the ﬁnal conclu-
1697 | sion. In this scenario, wkvT can be written as
1698 | wkvT =
1699 | PT
1700 | t=1 Ke
1701 | t vt
1702 | PT
1703 | t=1 Ke
1704 | t
1705 | = E(vt) = S(vt)
1706 | S(1) ,
1707 | (33)
1708 | where
1709 | vt = Wvxt,
1710 | ∂(vt)i
1711 | ∂(Wv)i, j
1712 | = (xt)j,
1713 | Ke
1714 | t = eWkxt+wT,t,
1715 | ∂(Ke
1716 | t )i
1717 | ∂(Wk)i,j
1718 | = (xt)j(Ke
1719 | t )i,
1720 | and S(·) and E(·) are shorthand for denoting sums
1721 | and averages over weights Ke
1722 | t .
1723 | The loss function at position T can be written as
1724 | LT = l(f(wkvT ), yT ).
1725 | (34)
1726 | Because wkvT relates to (Wk)i,j and (W v)i,j only
1727 | through the i-th channel (wkvT )i, we have
1728 | ∂LT
1729 | ∂(Wv)i,j
1730 | =
1731 | ∂LT
1732 | ∂(wkvT )i
1733 | ∂(wkvT )i
1734 | ∂(Wv)i,j
1735 | .
1736 | (35)
1737 | The ﬁrst part of above equation contains trivial
1738 | operations like output layers, and other layers of
1739 | time-mixing, which can be proven inductively. The
1740 | second part of above equation can be bounded as
1741 |  
1742 | ∂(wkvT )i
1743 | ∂(Wv)i,j
1744 |  =
1745 | 
1746 | ∂Ei[(vt)i]
1747 | ∂(Wv)i,j
1748 | 
1749 | = |Ei[(xt)j]| ≤ max
1750 | t
1751 | |(xt)j|,
1752 | (36)
1753 | which is irrelevant to T. Similarly,
1754 | ∂(wkvT )i
1755 | ∂(Wk)i,j
1756 | = ∂ Si[(vt)i]
1757 | Si(1 ) /∂(Wk)i,j
1758 | = Si[(xt)j(vt)i]
1759 | Si(1)
1760 | − Si[(xt)j]Si[(vt)i]
1761 | Si(1)2
1762 | = Ei[(xt)j(vt)i] − Ei[(xt)j]Ei[(vt)i]
1763 | = covi((xt)j, (vt)i)
1764 | (37)
1765 | can also be bounded. Note that wkv’s softmax op-
1766 | eration contains at least two non-zero  terms (u and
1767 | w), so the above “covariance” will not degenerate
1768 | into 0.
1769 | 0
1770 | 100
1771 | 200
1772 | 300
1773 | 400
1774 | 500
1775 | 600
1776 | 700
1777 | 800
1778 | Channel
1779 | 0.0
1780 | 0.2
1781 | 0.4
1782 | 0.6
1783 | 0.8
1784 | 1.0
1785 | Time Decay
1786 | Time decay (sorted along channel axis)
1787 | Layer 1
1788 | Layer 2
1789 | Layer 3
1790 | Layer 4
1791 | Layer 5
1792 | Layer 6
1793 | Layer 7
1794 | Layer 8
1795 | Layer 9
1796 | Layer 10
1797 | Layer 11
1798 | Layer 12
1799 | 1
1800 | 6
1801 | 11
1802 | 16
1803 |  21
1804 | Layer
1805 | The
1806 |  E
1807 | iff
1808 | el
1809 |  Tower
1810 |  is
1811 |  located
1812 |  in
1813 |  the
1814 |  city
1815 |  of
1816 | Information propagation path
1817 | 7
1818 | 6
1819 | 5
1820 | 4
1821 | 3
1822 | 2
1823 | 1
1824 | Log-probability of "Paris"
1825 | Figure 9: Model behavior visualizations of the RWKV
1826 | model.
1827 | G
1828 | Model Behavior Visualization
1829 | In Figure 9, we present visualizations of some be-
1830 | havior of the RWKV model.
1831 | The top plot illustrates the time decays (e−w) in
1832 | each layer of the RWKV-169M model, sorted  along
1833 | the channel axis. Notably, several decays in the last
1834 | layers are very close or equal to one, implying that
1835 | certain information is preserved and propagated
1836 | throughout the model’s temporal context. Mean-
1837 | while, many decays in the initial layer are close
1838 | to zero, which corresponds to local operations in
1839 | wkv (14), likely to be associated with tasks such as
1840 | text parsing or lexical analysis. (Note that the local
1841 | operations in wkv is due to the extra parameter u,
1842 | when e−w is degenerated into 0.) These patterns of
1843 | time  decays are partly learned, but also come from
1844 | parameter initialization as it speeds up training.
1845 | The bottom plot shows the information retrieval
1846 | and propagation path in the RWKV-430M model.
1847 | The experiment follows the causal trace method
1848 | introduced by Meng et al. (2022), where we
1849 | 1. Run the model once, and record all states and
1850 | activation of each layer during the computa-
1851 | tion;
1852 | 2. Corrupt the input embeddings of the subject
1853 | using noise (“The Eiffel Tower” in this exam-
1854 | ple);
1855 |  3. Restore the states and activation of a certain
1856 | layer at a certain token during the compu-
1857 | tation, and record the log-probability of the
1858 | model outputting the correct answer (“Paris”).
1859 | Unlike transformers, RWKV relies on recursive
1860 | propagation of information in the time dimension.
1861 | In this case, the fact that "the Eiffel Tower is located
1862 | in Paris" is retrieved in layer 4. It is then passed
1863 | down to the subsequent layers. In layer 20, mostly,
1864 | the information is propagated through time until
1865 | reaching where it is  needed. Finally, it is passed
1866 | down to the last layer for outputting the answer.
1867 | H
1868 | Evaluation Details
1869 | The results for following tasks are in Table 3 and 4.
1870 | Tasks:
1871 | • LAMBADA (Paperno et al., 2016). A bench-
1872 | mark dataset that evaluates the model’s contex-
1873 | tual reasoning and language comprehension
1874 | abilities by presenting context-target pairs,
1875 | where the objective is to predict the most prob-
1876 | able target token.
1877 | • PIQA (Bisk et al., 2020). A benchmark for
1878 | the task of physical common  sense reasoning,
1879 | which consists of a binary choice task that
1880 | can be better understood as a set of two pairs,
1881 | namely (Goal, Solution).
1882 | • HellaSwag
1883 | (Zellers et al., 2019) A novel
1884 | benchmark for commonsense Natural Lan-
1885 | guage Inference (NLI) which is build by ad-
1886 | versarial ﬁltering against transformer models.
1887 | • Winogrande (Zellers et al., 2020) A dataset
1888 | designed to evaluate the acquisition of com-
1889 | mon sense reasoning by neural language mod-
1890 | els, aiming to determine whether  we are ac-
1891 | curately assessing the true capabilities of ma-
1892 | chine common sense.
1893 | • StoryCloze (Mostafazadeh et al., 2016) A
1894 | benchmark to present a novel approach to as-
1895 | sess comprehension of narratives, narrative
1896 | generation, and script acquisition, focusing on
1897 | commonsense reasoning.
1898 | • ARC Challenge (Clark et al., 2018) A dataset
1899 | designed for multiple-choice question answer-
1900 | ing, encompassing science exam questions
1901 | ranging from third grade to ninth grade.
1902 | • ARC Easy An easy subset of ARC.
1903 | • HeadQA
1904 | (V ilares and Gómez-Rodríguez,
1905 | 2019) A benchmark consisting of graduate-
1906 | level questions encompassing various ﬁelds
1907 | such as medicine, nursing, biology, chemistry,
1908 | psychology, and pharmacology.
1909 | • OpenBookQA (Mihaylov et al., 2018) A QA
1910 | dataset to evaluate human comprehension of
1911 | a subject by incorporating open book facts,
1912 | scientiﬁc knowledge, and perceptual common
1913 | sense, drawing inspiration from open book
1914 | exams.
1915 | • SciQ (Johannes Welbl Nelson F . Liu, 2017)
1916 | A multiple-choice QA dataset which was cre-
1917 | ated using an innovative approach to gather
1918 | well-crafted multiple-choice questions that are
1919 | focused on a speciﬁc domain.
1920 | • TriviaQA (Joshi et al., 2017) A QA-IR dataset
1921 | which is constituted of triples of questions,
1922 | answers, supporting evidence, and indepen-
1923 | dently collected evidence documents, with an
1924 | average of six documents per question for re-
1925 | liable sources.
1926 | • ReCoRD (Zhang et al., 2018) A  benchmark
1927 | for evaluating commonsense reasoning in
1928 | reading comprehension by generating queries
1929 | from CNN/Daily Mail news articles and re-
1930 | quiring text span answers from corresponding
1931 | summarizing passages.
1932 | • COPA (Roemmele et al., 2018) A dataset to
1933 | evaluate achievement in open-domain com-
1934 | monsense causal reasoning.
1935 | • MMMLU (Hendrycks et al., 2021) A multi-
1936 | task dataset for 57 tasks containing elementary
1937 | mathematics, US history, computer science,
1938 | law, etc.
1939 | I
1940 | Inference results
1941 | Figures 10 and 11 illustrate, respectively , the results
1942 | on time (s) and memory (RAM, VRAM) require-
1943 | ments for LLM inference in ﬂoat32 precision. We
1944 | benchmark the following model families and sizes:
1945 | • RWKV: 169m, 430m, 1.4b, 3b, 7b, 14b
1946 | • Bloom (Scao et al., 2022): 560m, 1b, 3b
1947 | • OPT (Zhang et al., 2022): 125m, 350m, 1.3b,
1948 | 2.7b, 6.7b, 13b
1949 | • GPT- Neo (Black et al., 2022): 125m, 1.3b,
1950 | 2.7b
1951 | • Pythia (Biderman et al., 2023): 160m, 410m,
1952 | 1.4b, 2.8b, 6.7b, 12b
1953 | Missing models in are due to Out Of Memory
1954 | (OOM) errors. A comparison at 512 tokens is
1955 | shown in Figure 11 as some large transformer mod-
1956 | els produced an OOM when inferencing longer se-
1957 |  Model
1958 | Params
1959 | PIQA
1960 | StoryCloze
1961 | HellaSwag
1962 | WinoGrande
1963 | ARC-e
1964 | ARC-c
1965 | OBQA
1966 | B
1967 | acc
1968 | acc
1969 | acc_norm
1970 | acc
1971 | acc
1972 | acc_norm
1973 | acc_norm
1974 | RWKV-4
1975 | 0.17
1976 | 65.07
1977 | 58.79
1978 | 32.26
1979 | 50.83
1980 | 47.47
1981 | 24.15
1982 | 29.60
1983 | Pythia
1984 | 0.16
1985 | 62.68
1986 | 58.47
1987 | 31.63
1988 | 52.01
1989 | 45.12
1990 | 23.81
1991 |  29.20
1992 | GPT-Neo
1993 | 0.16
1994 | 63.06
1995 | 58.26
1996 | 30.42
1997 | 50.43
1998 | 43.73
1999 | 23.12
2000 | 26.20
2001 | RWKV-4
2002 | 0.43
2003 | 67.52
2004 | 63.87
2005 | 40.90
2006 | 51.14
2007 | 52.86
2008 | 25.17
2009 | 32.40
2010 | Pythia
2011 | 0.40
2012 | 66.70
2013 | 62.64
2014 | 39.10
2015 | 53.35
2016 | 50.38
2017 | 25.77
2018 | 30.00
2019 | GPT-Neo
2020 | 0.40
2021 | 65. 07
2022 | 61.04
2023 | 37.64
2024 | 51.14
2025 | 48.91
2026 | 25.34
2027 | 30.60
2028 | RWKV-4
2029 | 1.5
2030 | 72.36
2031 | 68.73
2032 | 52.48
2033 | 54.62
2034 | 60.48
2035 | 29.44
2036 | 34.00
2037 | Pythia
2038 | 1.4
2039 | 71.11
2040 | 67.66
2041 | 50.82
2042 | 56.51
2043 | 57.74
2044 | 28.58
2045 | 30.80
2046 | GPT-Neo
2047 | 1.4
2048 | 71.16
2049 | 67.72
2050 | 48.94
2051 | 54.93
2052 | 56. 19
2053 | 25.85
2054 | 33.60
2055 | RWKV-4
2056 | 3.0
2057 | 74.16
2058 | 70.71
2059 | 59.89
2060 | 59.59
2061 | 65.19
2062 | 33.11
2063 | 37.00
2064 | Pythia
2065 | 2.8
2066 | 73.83
2067 | 70.71
2068 | 59.46
2069 | 61.25
2070 | 62.84
2071 | 32.25
2072 | 35.20
2073 | GPT-Neo
2074 | 2.8
2075 | 72.14
2076 | 69.54
2077 | 55.82
2078 | 57.62
2079 | 61.07
2080 | 30.20
2081 | 33.20
2082 | RWKV-4
2083 |  7.4
2084 | 76.06
2085 | 73.44
2086 | 65.51
2087 | 61.01
2088 | 67.80
2089 | 37.46
2090 | 40.20
2091 | Pythia
2092 | 6.9
2093 | 74.54
2094 | 72.96
2095 | 63.92
2096 | 61.01
2097 | 66.79
2098 | 35.07
2099 | 38.00
2100 | GPT-J
2101 | 6.1
2102 | 75.41
2103 | 74.02
2104 | 66.25
2105 | 64.09
2106 | 66.92
2107 | 36.60
2108 | 38.20
2109 | RWKV-4
2110 | 14.2
2111 | 77.48
2112 | 76.06
2113 | 70.65
2114 | 63 .85
2115 | 70.24
2116 | 38.99
2117 | 41.80
2118 | GPT-level∗
2119 | 14.2
2120 | 76.49
2121 | 74.97
2122 | 68.72
2123 | 65.14
2124 | 70.77
2125 | 37.99
2126 | 39.27
2127 | Pythia (c.f.)
2128 | 11.8
2129 | 75.90
2130 | 74.40
2131 | 67.38
2132 | 64.72
2133 | 69.82
2134 | 36.77
2135 | 38.80
2136 | GPT-NeoX (c.f.)
2137 | 20.6
2138 | 77.69
2139 | 76.11
2140 | 71.42
2141 | 65.98
2142 | 72 .69
2143 | 40.44
2144 | 40.20
2145 | Table 3: Zero-Shot Performance of the model on Common Sense Reasoning Tasks. ∗ Interpolation of Pythia and
2146 | GPT-Neo models
2147 | Model
2148 | Params
2149 | LAMBADA
2150 | LAMBADA
2151 | headQA
2152 | sciq
2153 | triviaQA
2154 | ReCoRD
2155 | COPA
2156 | B
2157 | ppl
2158 | acc
2159 | acc_norm
2160 | acc
2161 | acc
2162 | em
2163 | acc
2164 | RWKV-4
2165 | 0.17
2166 | 29.33
2167 | 32.99
2168 | 25.78
2169 | 77.50
2170 | 1.26 
2171 | 62.03
2172 | 66.00
2173 | Pythia
2174 | 0.16
2175 | 24.38
2176 | 38.97
2177 | 25.82
2178 | 76.50
2179 | 1.31
2180 | 66.32
2181 | 62.00
2182 | GPT-Neo
2183 | 0.16
2184 | 30.27
2185 | 37.36
2186 | 25.16
2187 | 76.60
2188 | 1.18
2189 | 64.92
2190 | 64.00
2191 | RWKV-4
2192 | 0.43
2193 | 13.04
2194 | 45.16
2195 | 27.32
2196 | 80.30
2197 | 2.35
2198 | 70.48
2199 | 65.00
2200 | Pythia
2201 | 0.40 
2202 | 11.58
2203 | 50.44
2204 | 25.09
2205 | 81.50
2206 | 2.03
2207 | 75.05
2208 | 67.00
2209 | GPT-Neo
2210 | 0.40
2211 | 13.88
2212 | 47.29
2213 | 26.00
2214 | 81.10
2215 | 1.38
2216 | 73.79
2217 | 65.00
2218 | RWKV-4
2219 | 1.5
2220 | 7.04
2221 | 56.43
2222 | 27.64
2223 | 85.00
2224 | 5.65
2225 | 76.97
2226 | 77.00
2227 | Pythia
2228 | 1.4
2229 | 6.58
2230 | 60.43
2231 | 27.02
2232 | 85.50 
2233 | 5.52
2234 | 81.43
2235 | 73.00
2236 | GPT-Neo
2237 | 1.4
2238 | 7.5
2239 | 57.25
2240 | 27.86
2241 | 86.00
2242 | 5.24
2243 | 80.62
2244 | 69.00
2245 | RWKV-4
2246 | 3.0
2247 | 5.25
2248 | 63.96
2249 | 28.45
2250 | 86.50
2251 | 11.68
2252 | 80.87
2253 | 82.00
2254 | Pythia
2255 | 2.8
2256 | 4.93
2257 | 65.36
2258 | 28.96
2259 | 87.70
2260 | 9.63
2261 | 85.10
2262 | 77.00
2263 | GPT- Neo
2264 | 2.8
2265 | 5.63
2266 | 62.22
2267 | 27.17
2268 | 89.30
2269 | 4.82
2270 | 83.80
2271 | 80.00
2272 | RWKV-4
2273 | 7.4
2274 | 4.38
2275 | 67.18
2276 | 31.22
2277 | 88.80
2278 | 18.30
2279 | 83.68
2280 | 85.00
2281 | Pythia
2282 | 6.9
2283 | 4.3
2284 | 67.98
2285 | 28.59
2286 | 90.00
2287 | 15.42
2288 | 86.44
2289 | 85.00
2290 | GPT-J
2291 | 6.1
2292 | 4.1
2293 | 68.31
2294 | 28. 67
2295 | 91.50
2296 | 16.74
2297 | 87.71
2298 | 83.00
2299 | RWKV-4
2300 | 14.2
2301 | 3.86
2302 | 70.83
2303 | 32.64
2304 | 90.40
2305 | 24.58
2306 | 85.67
2307 | 85.00
2308 | GPT-level∗
2309 | 14.2
2310 | 3.81
2311 | 70.94
2312 | 31.03
2313 | 92.20
2314 | 22.37
2315 | 87.89
2316 | 82.66
2317 | Pythia (c.f.)
2318 | 11.8
2319 | 3.89
2320 | 70.44
2321 | 30.74
2322 | 91.80
2323 | 20.57
2324 |  87.58
2325 | 82.00
2326 | GPT-NeoX (c.f.)
2327 | 20.6
2328 | 3.64
2329 | 71.94
2330 | 31.62
2331 | 93.00
2332 | 25.99
2333 | 88.52
2334 | 84.00
2335 | Table 4:
2336 | Zero-Shot Performance of various models on different tasks.
2337 | ∗ Interpolation of Pythia and GPT-Neo
2338 | models
2339 |  Method
2340 | L
2341 | d
2342 | T
2343 | Train bpc
2344 | Test bpc
2345 | Time Complexity
2346 | Space Complexity
2347 | Transformer
2348 | 12
2349 | 512
2350 | 1024
2351 | 0.977
2352 | 1.137
2353 | O(T 2d)
2354 | O(T 2 + Td)
2355 | Transformer
2356 | 24
2357 | 256
2358 | 1024
2359 | 1.039
2360 | 1.130
2361 | O(T 2d)
2362 | O(T 2 + Td)
2363 | Reformer
2364 | 12
2365 | 512
2366 | 1024
2367 | 1.040
2368 | 1.195
2369 | O(T log Td)
2370 | O(T log T + Td)
2371 | Sy nthesizer
2372 | 12
2373 | 512
2374 | 1024
2375 | 0.994
2376 | 1.298
2377 | O(T 2d)
2378 | O(T 2 + Td)
2379 | Linear Transformer
2380 | 12
2381 | 512
2382 | 1024
2383 | 0.981
2384 | 1.207
2385 | O(Td2)
2386 | O(Td + d2)
2387 | Performer
2388 | 12
2389 | 512
2390 | 1024
2391 | 1.002
2392 | 1.199
2393 | O(Td2 log d)
2394 | O(Td log d + d2 log d)
2395 | AFT-simple
2396 | 12
2397 | 512
2398 | 1024
2399 | 0.854
2400 | 1.180 
2401 | O(Td)
2402 | O(Td)
2403 | RWKV-RNN
2404 | 6
2405 | 512
2406 | 1024
2407 | 0.720
2408 | -
2409 | O(Td)
2410 | O(d)
2411 | Table 5: Enwik8 results, measured in bits per character (bpc): the lower the better. Baseline comparisons are made
2412 | with Reformer (Kitaev et al., 2020), Synthesizer (Tay et al., 2020) (the best performing dense version), Linear
2413 | Transformer (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020). L , d, and T denote the number
2414 | of blocks (network depth), dimension of features, and sequence length, respectively. Both Linear Transformer and
2415 | Performer are implemented with customized CUDA kernels (github.com/idiap/fast-transformers), and all other
2416 | models are implemented in native Pytorch.
2417 | quences. For GPU experiments, we use an NVIDIA
2418 | A100 with 80GB of VRAM. For CPU experiments,
2419 | we use an AMD EPYC processor with 30 CPU
2420 | cores and 200 GiB RAM.
2421 | Figure 10: Text generation inference memory (CPU
2422 | RAM, GPU VRAM ) for LLMs. Model parameters are
2423 | not accounted.
2424 | Figure 11: Text generation inference time for LLMs.
2425 |  Task Name
2426 | Measure
2427 | ChatGPT
2428 | GPT-4
2429 | RWKV-4
2430 | RWKV-4
2431 | SOTA
2432 | type
2433 | [%]
2434 | [%]
2435 | GPT [%]
2436 | changed [%]
2437 | [%]
2438 | RTE
2439 | F1 Macro
2440 | 88.1
2441 | 91.3
2442 | 44.2
2443 | 74.8
2444 | 92.1
2445 | WNLI
2446 | Accuracy
2447 | 81.7
2448 | 91.6
2449 | 47.9
2450 | 49.3
2451 | 97.9
2452 | GoEmotions
2453 | F1 Macro
2454 | 25.6
2455 | 23.1
2456 | 7.9
2457 | 7.9
2458 | 52. 8
2459 | PolEmo2
2460 | F1 Macro
2461 | 44.1
2462 | 41.0
2463 | 38.2
2464 | 40.9
2465 | 76.4
2466 | Table 6: ChatGPT, GPT-4 and RWKV-4-Raven-14B
2467 | reasoning performance comparison in RTE (Wang
2468 | et al., 2019), WNLI (Wang et al., 2018), GoEmotions
2469 | (Demszky et al., 2020), and PolEmo2 (Koco´n et al.,
2470 | 2019) benchmarks. SOTA is provided as a supplemen-
2471 | tary reference. 
2472 | J
2473 | Importance of prompt construction
2474 | and comparison to GPT models
2475 | Inspired by article (Koco´n et al., 2023), we com-
2476 | pared the zero-shot performance of the RWKV-
2477 | 4-Raven-14B with ChatGPT (access in February
2478 | 2023) and GPT-4 using several known NLP tasks,
2479 | i.e., recognizing textual entailment (RTE), Wino-
2480 | grad Natural Language Inference (WNLI), and rec-
2481 | ognizing emotions elicited in readers (GoEmotions
2482 | and PolEmo2). Each model  got the same prompts
2483 | manually chosen to receive proper responses from
2484 | the ChatGPT model. As shown in Tab. 6, RWKV
2485 | performs signiﬁcantly worse than ChatGPT and
2486 | GPT-4 in speciﬁc task performance. We suspect
2487 | that this disparity is likely caused by the choice
2488 | of prompts used to generate the answers. Given
2489 | that prompts are in natural language and do not
2490 | consider that RWKV is an RNN, so it can not look
2491 | back inside an instruction.
2492 | When the instruction style was adapted to re-
2493 | spect that  RNNs is not capable for retrospective
2494 | processing, quality on some datasets increased sig-
2495 | niﬁcantly (ex. for RTE (Wang et al., 2019) F1
2496 | Macro increased from 44.2% to 74.8%). We hy-
2497 | pothesize that RWKV models are more sensitive
2498 | to the position of the components in the context,
2499 | as RNN-based architectures cannot look back and
2500 | readjust the weight of previous information. For
2501 | better performance, desired information should be
2502 | after the question. Example of ChatGPT prompt to
2503 | RTE:
2504 | Having  premise <here is a premise> judge if the
2505 | following hypothesis <here is a hypothesis> are
2506 | logically connected with the premise? Answer "en-
2507 | tailment" if yes, or "not_entailment" if no.
2508 | RWKV prompt taking into account the characteris-
2509 | tics of the RNN:
2510 | Can you tell me if the hypothesis is entailment or is
2511 | not entailment to the premise?
2512 | Task Name
2513 | Measure
2514 | ChatGPT
2515 | RWKV-4
2516 | SOTA
2517 | type
2518 | [%]
2519 | adapted [%]
2520 | [%]
2521 | Aggression
2522 | F1  Macro
2523 | 69.10
2524 | 56.66
2525 | 74.45
2526 | MathQA
2527 | Accuracy
2528 | 71.40
2529 | 80.69
2530 | 83.20
2531 | Sarcasm
2532 | F1 Macro
2533 | 49.88
2534 | 50.96
2535 | 53.57
2536 | TweetSent
2537 | F1 Macro
2538 | 63.32
2539 | 52.50
2540 | 72.07
2541 | Unhealthy
2542 | F1 Macro
2543 | 45.21
2544 | 43.30
2545 | 50.96
2546 | Table 7: ChatGPT and RWKV-4-Raven-14B perfor-
2547 | mance comparison in Aggresion (Wulczyn et al., 2017 ),
2548 | Sarcasm (Siddiqui, 2019), Unhealthy (Price et al.,
2549 | 2020), MathQA (Cobbe et al., 2021), and TweetSent
2550 | (Barbieri et al., 2020) benchmarks. SOTA is provided
2551 | as a supplementary reference.
2552 | premise: <here is a premise>
2553 | hypothesis: <here is a hypothesis>
2554 | While separating the instruction from the input
2555 | is relatively easy to do, other aspects of prompt
2556 | engineering are harder to quantify. Testing the ap-
2557 | proach of stating the input after the question on
2558 | multiple other tasks, shown  in tab. 7, suggests that
2559 | better prompts might reduce the disparity between
2560 | models. Raven achieves comparable result to Chat-
2561 | GPT on unhealthy conversation detection and even
2562 | surpasses it on the sarcasm detection dataset. While
2563 | this approach in prompting looks necessary, it alone
2564 | is not enough to replace the capability of having
2565 | free access to the whole context. Therefore, prompt
2566 | engineering seems to be of signiﬁcantly more im-
2567 | portance to the RNN models compared to stan-
2568 | dard transformers. It is entirely possible that good
2569 | prompts to RNN models do  not mean additional
2570 | restrictions, but should simply be constructed using
2571 | completely different guidelines. While authors of a
2572 | forementioned paper (Koco´n et al., 2023) perform
2573 | chain-of-thought to improve results on MathQA
2574 | dataset, trying the same approach on the Raven
2575 | model achieved a very low accuracy of 48.11%.
2576 | Despite that, using a very similar prompt without
2577 | the chain-of-thought outperformed ChatGPT on
2578 | this dataset. It would not be surprising if, thanks
2579 | to the hidden state, RNN models do not need addi-
2580 | t ional steps during inference, and adding those only
2581 | makes the model forget the information it analyzed
2582 | and deemed important. This further emphasizes the
2583 | importance of the order of information the model
2584 | receives.
2585 | The template used to prompt the Raven model in
2586 | MathQA with chain-of-thought prompting:
2587 | Write the reasoning and highlight the answer to
2588 | the question at the end in format: ’Answer: ’. The
2589 | question is: <here is question>
2590 |  The template used to prompt the Raven model in
2591 | MathQA without chain-of-thought prompting:
2592 | Write the answer to the math question in the format:
2593 | ’Answer: ’.
2594 | The question is: <here is question>
2595 |  K
2596 | Cases
2597 | In this part, we present a few instances of outputs produced by RWKV model using a Chat interface56.
2598 | RWKV-4-Raven-14B
2599 | Tell me about ravens.
2600 | Ravens are large, black birds with a distinctive white head and neck. They are found in most parts
2601 | of the world, including North America, Europe, Asia, and Australia. Ravens are known for their
2602 | intelligence and problem-solving abilities. They are also considered to be symbols of death and
2603 | transformation in many cultures. Ravens are often associated with the afterlife or death because
2604 | they have  been known to eat carrion or even other birds. In some cultures, ravens are also believed
2605 | to be messengers of the gods or guardians of treasure.
2606 | RWKV-4-Raven-14B
2607 | Explain the concept of machine learning to a 10-year-old.
2608 | Machine learning is like a super-smart computer that can learn on its own. It can recognize patterns
2609 | and make decisions based on what it has learned. Just like how you might remember something
2610 | from the past, the computer can remember what it’s learned and use it to make predictions or
2611 | decisions in the future . It’s like having a smart friend who always knows what to do!
2612 | RWKV-4-Raven-14B
2613 | Write a short poem describing nature’s beauty in springtime.
2614 | Nature’s beauty in springtime is rare,
2615 | A symphony of colors, a gentle surprise.
2616 | The ﬂowers bloom, the birds sing and sing,
2617 | As the sun shines down, warming our skin.
2618 | The trees sway in the gentle breeze,
2619 | As we take in their beauty with ease.
2620 | The sky is blue, the clouds are white,
2621 | A sight to behold, it ’s truly divine.
2622 | 5https://github.com/BlinkDL/ChatRWKV
2623 | 6https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio
2624 |  RWKV-4-Raven-14B
2625 | Below is the numerical stable formula for updating the state variable in RWKV:
2626 | a∗
2627 | t = ep−qat−1 + eu+kt−qvt
2628 | b∗
2629 | t = ep−qbt−1 + eu+kt−q
2630 | wkvt = a∗
2631 | t
2632 | b∗
2633 | t
2634 | Please write out the corresponding pytorch code to do so, where both at and bt are
2635 | torch.Tensor([n])
2636 | Here is the PyTorch code to update the state variable in RW KV using the given formula:
2637 | import
2638 | t o r c h
2639 | def
2640 | u p d a t e _ s t a t e ( a ,
2641 | b ,
2642 | u ,
2643 | v ,
2644 | p ,
2645 | q ) :
2646 | a_t = t o r c h . exp ( p − q ) * a + t o r c h . exp ( u + k − q ) * v
2647 | b_t = t o r c h . exp ( p − q ) * b + t o r c h . exp ( u + k − q )
2648 | return
2649 | a_t ,
2650 | b _t
2651 | a ,
2652 | b = t o r c h . randn (2 ,
2653 | 3)
2654 | u = t o r c h . randn ( 2 )
2655 | v = t o r c h . randn ( 2 )
2656 | wkv = t o r c h . randn ( 2 )
2657 | a_t ,
2658 | b_t = u p d a t e _ s t a t e ( a ,
2659 | b ,
2660 | u ,
2661 | v , wkv ,
2662 | t o r c h . randn ( 2 ) )
2663 | print ( " a_t : " ,
2664 | a _t )
2665 | print ( " b_t : " ,
2666 | b_t )
2667 | This code deﬁnes a function update_state that takes in the state variables a and b, the control
2668 | variables u and v, and the parameters wkv and torch.randn. It then uses the PyTorch torch.exp
2669 | function to compute the exponential of the control variables and the log of the normalization factor.
2670 | Finally, it uses these values to update the state.
2671 | RWKV-4-Raven-14B
2672 | Explain the following metaphor: Life is like cats.
2673 | The  metaphor "Life is like cats" suggests that life can be unpredictable and full of unexpected
2674 | twists and turns, just like a cat’s behavior. It implies that we cannot fully control or predict the
2675 | outcome of our lives, just as we cannot control the movements of a cat. The metaphor also implies
2676 | that life is complex and multifaceted, with many different variables and factors at play. Just as a
2677 | cat can have multiple personalities or moods, life can have many twists and turns that are difﬁcult
2678 | to predict or understand. Overall, the metaphor suggests that life is complex and unpredictable ,
2679 | much like a cat’s behavior.
2680 |  RWKV-4-Raven-14B
2681 | Test the convergence of P∞
2682 | i=0 2−i.
2683 | The given series is a geometric series with the ﬁrst term a0 = 2 and common ratio r = 1
2684 | 2. We can
2685 | ﬁnd the sum of this series using the formula for a geometric series:
2686 | ∞
2687 | X
2688 | i=0
2689 | ai =
2690 | a0
2691 | 1 − r =
2692 | 1
2693 | 2
2694 | 1 − 1
2695 | 2
2696 | =
2697 | 1
2698 | 1 − 1
2699 | 2
2700 | = 2
2701 | 1 = 2
2702 | So, the sum of  the given series is 2 .
2703 | RWKV-4-Raven-14B
2704 | I would like to cook some Chinese food at home. Do you have any suggestions on the meal I
2705 | could choose?
2706 | Yes, of course! If you are not very experienced in cooking Chinese food, you could start with
2707 | making noodles or rice, then make a dish using one of those. I also think making stir-fry with
2708 | some meat is an excellent idea.
2709 | I want to make something more special. What about red braised pork belly?
2710 | Yes, red braised pork belly is also very popular and tasty.
2711 |  Tell me the steps to cook.
2712 | You will need a few ingredients such as pork belly, star anise, sugar, ginger, scallions and wine.
2713 | Put all of these in a pot, cook on medium heat and simmer until the meat is tender. After that,
2714 | remove the meat from the pot and reserve the sauce. Slice the meat and add some ginger slices
2715 | into it, then boil until it’s almost done.
2716 | What do I need to be careful with the most during the cooking process?
2717 | To cook pork belly properly, it is very important to monitor the heat of the stove and be  patient
2718 | with the cooking process. Overcooking it can cause the pork to be tough, dry and difﬁcult to chew.
2719 | You can add a few ingredients like soy sauce or black beans to give the pork more ﬂavor.
2720 | Tell me more about the proper heat and timing.
2721 | If the temperature of the water is too high, it can damage the collagen and connective tissues,
2722 | resulting in tough pork belly. It is essential to control the temperature carefully, especially
2723 | when braising the pork. Additionally, keep the water simmering but not boiling to maintain a
2724 | low simmer, as  high heat will dry out the pork belly and result in tough and unpalatable meat.
2725 | Lastly, monitor the heat during cooking and make sure the heat is medium, as this is what is most
2726 | appropriate for slow cooking pork belly.
2727 | I see. Thank you for your advice.
2728 | You’re welcome! It was a pleasure chatting with you. Let me know if you need any further
2729 | assistance in the future.
2730 |  RWKV-4-Raven-14B
2731 | If all apples are fruits and all fruits are sweet, are all apples sweet? Please explain your
2732 | answer in detail.
2733 | Yes, all apples are fruits, and all fruits are sweet. Therefore, it is safe to assume that all apples are
2734 | sweet.
2735 | If birds can ﬂy. Penguins are birds. Can penguins ﬂy? Please explain your answer in detail.
2736 | No, penguins cannot ﬂy. Penguins are birds, but they are not capable of ﬂight like other birds.
2737 | They have wings and can glide for  short distances, but they cannot take off or ﬂy long distances
2738 | like other birds.
2739 | The day before two days after the day before tomorrow is Saturday. What day is it today?
2740 | Please explain your answer in detail.
2741 | Today is Friday. Two days after tomorrow is Sunday. Tomorrow is Saturday.
2742 | If ﬁve cats can catch ﬁve mice in ﬁve minutes, how long will it take one cat to catch one
2743 | mouse? Please explain your answer in detail.
2744 | If ﬁve cats can catch ﬁve mice in ﬁve minutes, it means  that each cat can catch one mouse in one
2745 | minute. To ﬁnd out how long it will take one cat to catch one mouse, we can use the concept of
2746 | proportionality.
2747 | Let’s assume that it takes x minutes for a cat to catch a mouse. Then, the number of mice caught
2748 | by a cat in one minute is equal to the number of cats multiplied by the number of mice caught by
2749 | each cat: x = 5 * 1 = 5
2750 | So, it will take 5 minutes for one cat to catch one mouse.
2751 | RWKV-4-Raven-14B
2752 |  Summarize the following paragraph.
2753 | <The Abstract of this paper goes here>
2754 | The paper presents a novel architecture, Receptance Weighted Key Value (RWKV), that combines
2755 | the efﬁcient parallelizable training of Transformers with the efﬁcient inference of RNNs. The
2756 | model is able to scale to tens of billions of parameters and exhibits linear computational complexity
2757 | during training and inference, making it a promising alternative to Transformers for sequence
2758 | processing tasks.
2759 | Please further summarize it in one sentence.
2760 | The researchers have developed a new architecture that combines the efﬁ cient parallelizable
2761 | training of Transformers with the efﬁcient inference of RNNs, which allows for linear scaling in
2762 | memory and computational requirements.
2763 | 


--------------------------------------------------------------------------------