├── Fig
├── README.md
├── coco-r-results.png
├── coco-results.png
├── img-results.png
├── lam-line-x.jpg
├── lam-line-x.pdf
├── lra-results.png
├── methods.jpg
├── mt-results.png
├── public.jpg
├── public.pdf
└── sota.pdf
├── README.md
├── _config.yml
├── _includes
└── head-custom.html
├── _layouts
└── default.html
├── core
├── GRC_Attention.py
└── pvt_grc.py
├── ct-public.gif
└── index.md
/Fig/README.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Fig/coco-r-results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/annosubmission/GRC-Cache/bdaffd4af9647f3028def2c444757b0e61d2e87f/Fig/coco-r-results.png
--------------------------------------------------------------------------------
/Fig/coco-results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/annosubmission/GRC-Cache/bdaffd4af9647f3028def2c444757b0e61d2e87f/Fig/coco-results.png
--------------------------------------------------------------------------------
/Fig/img-results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/annosubmission/GRC-Cache/bdaffd4af9647f3028def2c444757b0e61d2e87f/Fig/img-results.png
--------------------------------------------------------------------------------
/Fig/lam-line-x.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/annosubmission/GRC-Cache/bdaffd4af9647f3028def2c444757b0e61d2e87f/Fig/lam-line-x.jpg
--------------------------------------------------------------------------------
/Fig/lam-line-x.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/annosubmission/GRC-Cache/bdaffd4af9647f3028def2c444757b0e61d2e87f/Fig/lam-line-x.pdf
--------------------------------------------------------------------------------
/Fig/lra-results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/annosubmission/GRC-Cache/bdaffd4af9647f3028def2c444757b0e61d2e87f/Fig/lra-results.png
--------------------------------------------------------------------------------
/Fig/methods.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/annosubmission/GRC-Cache/bdaffd4af9647f3028def2c444757b0e61d2e87f/Fig/methods.jpg
--------------------------------------------------------------------------------
/Fig/mt-results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/annosubmission/GRC-Cache/bdaffd4af9647f3028def2c444757b0e61d2e87f/Fig/mt-results.png
--------------------------------------------------------------------------------
/Fig/public.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/annosubmission/GRC-Cache/bdaffd4af9647f3028def2c444757b0e61d2e87f/Fig/public.jpg
--------------------------------------------------------------------------------
/Fig/public.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/annosubmission/GRC-Cache/bdaffd4af9647f3028def2c444757b0e61d2e87f/Fig/public.pdf
--------------------------------------------------------------------------------
/Fig/sota.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/annosubmission/GRC-Cache/bdaffd4af9647f3028def2c444757b0e61d2e87f/Fig/sota.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | # Cached Transformers
4 | This annoymous repo contains introductions and codes of paper "Cached Transformers: Improving Transformers with Differentiable Memory Cache ".
5 |
6 |
7 | ## Introduction
8 | In this work, we propose a novel family of Transformer model, called Cached Transformer, which has a gated recurrent caches (GRC), a lightweight and flexible widget enabling Transformers to access the historical knowledge.
9 |
10 |
11 |
12 | #### Behavior
13 | We look into this behavior in image classification and find that GRC can separate features into two parts, attending over caches yielding instance-invariant
14 | features, as well as attending over self yielding instance-specific features (See visualizations Below}).
15 |
16 |
17 |
18 |
19 | #### Results
20 | We conduct extensive experiments on more than **ten** representative Transformer networks from both vision and language tasks, including long range arena, image classification, object detection, instance segmentation, and machine translation. The results demonstrate that our approach significantly improves performance of recent Transformers.
21 |
22 |
23 |
24 | ##### ImageNet Results
25 |
26 |
27 | ##### COCO2017 Results (Mask R-CNN 1x)
28 |
29 |
30 | ##### COCO2017 Results (RetinaNet 1x)
31 |
32 |
33 | ##### LRA Results
34 |
35 |
36 | ##### Machine Translation Results
37 |
38 |
39 |
40 | ## Methods
41 |
42 | #### Cached Attention with GRC (GRC-Attention)
43 |
44 |
45 |
46 | The illustration of proposed GRC-Attention in Cached Transformers.
47 |
48 | (a) Details of the updating process of Gated Recurrent Cache. The updated cache $C_t$ is derived based on current tokens $X_t$ and cache of last step $C_{t-1}$. The reset gates $g_r$ reset the previous cache $C_{t-1}$ to reset cache $C_t$, and the update gates $g_u$ controls the update intensity.
49 |
50 | (b) Overall pipeline of GRC-Attention. Inputs will attend over cache and themselves respectively, and the outputs are formulated as interpolation of the two attention results.
51 |
52 |
53 | ## Anaylysis
54 |
55 |
56 | #### Significance of Cached Attention
57 |
58 |
59 | To verify that the above performance gains mainly come from attending over caches, we analyze the contribution of $o_{mem}$ by visualizing the learnable attention ratio $\sigma(\lambda^h)$.
60 | Hence, $\sigma(\lambda^h)$ can be used to represent the relative significance of $o_{mem}^h$ and $o_{self}^h$.
61 | We observe that, for more than half of the layers, $\sigma(\lambda^h)$ is larger than $0.5$, denoting that outputs of those layers are highly dependent on the cached attention.
62 | Besides, we also notice an interesting fact that the models always prefer more cached attention except for the last several layers.
63 |
64 | #### Roles of Cached Attention
65 |
66 |
67 | We investigate the function of GRC-Attention by visualizing their interior feature maps.
68 | We choose the middle layers of cached ViT-S, averaging the outputs from self-attention $o_{self}$ and cached attention ($o_{mem}$) across the head and channel dimension, and then normalizing them into $[0, 1]$.
69 | The corresponding results are denoting as $o_{self}$ and $o_{mem}$, respectively.
70 | As $o_{self}$ and $o_{mem}$ are sequences of patches, they are unflattened to $14 \times 14$ shape for better comparison.
71 | As shown, Features derived by the above two attentions are visually complementary.
72 |
73 | In GRC-Attention, $o_{mem}$ is derived by attending over the proposed cache (GRC) containing compressive representations of historical samples, and thus being adept in recognizing **public** and frequently showing-up patches of this **class**.
74 | While for $o_{self}$ from self-attention branch, it can focus on finding out more private and **characteristic** features of the input **instance**.
75 | With above postulates, we can attempt to explain the regularity of $\sigma(\lambda^h)$: employing more $o_{mem}$ (larger $\sigma(\lambda^h)$ ) in former layers can help the network to distinguish this instance coarsely, and employing more $o_{self}$ (smaller $\sigma(\lambda^h)$) enable the model to make fine-grained decision.
76 |
77 |
78 |
79 | ## Core Codes
80 | The pytorch implementation of GRC-Attention module is provided in "core" directory.
81 | Full training and testing codes will be released later.
82 |
83 |
84 |
85 |
--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-cayman
2 |
3 | markdown: kramdown
4 | kramdown:
5 | math_engine: katex
6 |
--------------------------------------------------------------------------------
/_includes/head-custom.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | {% include head-custom-google-analytics.html %}
5 |
6 |
7 |
8 |
9 |
10 |
11 |
19 |
20 |
--------------------------------------------------------------------------------
/_layouts/default.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |