├── ImportedVisionTransformer.ipynb
├── README.md
├── VIsionTransformer.ipynb
└── VisualizeAttentionMap.ipynb


/README.md:
--------------------------------------------------------------------------------
  1 | # Vision_Transformer
  2 | ## Paper : An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  3 | ## PDF : https://arxiv.org/pdf/2010.11929
  4 | 
  5 | ## VisionTransformer.ipynb
  6 | 기존에 구현했던 model(https://github.com/jo1jun/Transformer/blob/main/Transformer.ipynb) 기반.
  7 | 
  8 | ViT 를 논문 참조하여 직접 구현.
  9 | 
 10 | ## VisualizeAttentionMap.ipynb
 11 | 위에서 직접 구현한 model 로 MNIST를 학습 시켜 구한 attention map 을 visualization 해 보았다.
 12 | 
 13 | visualize 부분은 아래 code 를 참조하였다. (paper 에서 언급하였듯이 layer 마다의 attention map 을 matrix multiplication & mean 하여 구함.)
 14 | 
 15 | reference : https://github.com/jeonsworld/ViT-pytorch/blob/main/visualize_attention_map.ipynb
 16 | 
 17 | ## ImportedVisionTransformer.ipynb
 18 | https://huggingface.co/transformers/model_doc/vit.html#vitmodel
 19 | 
 20 | 를 참조하여 import 한 model 로 custom model 이 correct 하다는 것을 확인하였고
 21 | 
 22 | pretrained model 을 CIFAR10 에 맞게 fine tuning 하여 정확도를 확인해 보았고
 23 | 
 24 | ImageNet 으로 fine tuning 된 pretrained model 로 attention map 을 visualize 해보았다.
 25 | 
 26 | ## dataset
 27 | VisionTransformer.ipynb : CIFAR10
 28 | 
 29 | VisualizeAttentionMap.ipynb : MNIST
 30 | 
 31 | ## result
 32 | ### MNIST Attention Map
 33 | ![image](https://user-images.githubusercontent.com/68524289/117996078-cf4a4600-b37c-11eb-9d6d-c1e872ce0bd9.png)
 34 | ![image](https://user-images.githubusercontent.com/68524289/117996156-e0935280-b37c-11eb-9a80-982f84886e11.png)
 35 | ![image](https://user-images.githubusercontent.com/68524289/117996175-e426d980-b37c-11eb-8395-741f5a0506cc.png)
 36 | 
 37 | ### ViT-B/14-224 Attention Map
 38 | Predicted class: great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias
 39 | 
 40 | reference : https://kids.nationalgeographic.com/animals/fish/facts/great-white-shark
 41 | 
 42 | Predicted class: pelican
 43 | 
 44 | reference : https://www.dw.com/en/theres-more-to-the-pelican-than-a-pouch/g-50613921
 45 | 
 46 | Predicted class: tiger, Panthera tigris
 47 | 
 48 | reference : https://www.worldwildlife.org/species/tiger
 49 | 
 50 | ![image](https://user-images.githubusercontent.com/68524289/117996596-3e279f00-b37d-11eb-8139-ba09f78a504b.png)
 51 | ![image](https://user-images.githubusercontent.com/68524289/117996613-41228f80-b37d-11eb-9391-12ee37263343.png)
 52 | ![image](https://user-images.githubusercontent.com/68524289/117996628-454ead00-b37d-11eb-8717-8135f57b7fac.png)
 53 | 
 54 | 
 55 | ## Accuracy
 56 | 
 57 | ### [Tensorboard](https://tensorboard.dev/experiment/UmmgbIlzQzefEK2i3jCvlQ/)
 58 |   
 59 | - custom model MNIST
 60 |   
 61 | train accuracy 97.944%
 62 | 
 63 | val accuracy 97.790%
 64 | 
 65 | test accuracy 97.470%
 66 | 
 67 | - custom model CIFAR10
 68 |   
 69 | train accuracy 87.943%
 70 | 
 71 | val accuracy 76.000%
 72 | 
 73 | test accuracy 74.720%
 74 | 
 75 | - Imported module CIFAR10
 76 | 
 77 | train accuracy 86.459%
 78 | 
 79 | val accuracy 77.400%
 80 | 
 81 | test accuracy 75.600%
 82 | 
 83 | - pretrained & finetuned model CIFAR10
 84 | 
 85 | train accuracy 100.000%
 86 | 
 87 | val accuracy 96.600%
 88 | 
 89 | test accuracy 96.840%
 90 | 
 91 | ## Comment
 92 | 
 93 | learning rate 는 paper 에서 나온 것과 유사하게 warmup(전체 epoch 의 1/5) 이후 cosine decay 를 사용하였다.
 94 | 
 95 | reference : https://huggingface.co/transformers/v1.2.0/_modules/pytorch_transformers/optimization.html
 96 | 
 97 | - VisionTransformer.ipynb
 98 | 
 99 | cifar10 dataset 의 경우 정확도가 그리 높진 않다. (매우 단순한 CNN architecture 의 성능인 74.5% 와 비슷한 수준.)
100 | 
101 | (simple cnn architecture : https://github.com/jo1jun/CS231N/blob/main/2020/assignment2/PyTorch.ipynb at the bottom)
102 | 
103 | CNN 을 철저히 배제한 transformer 구조로도 image classification 이 가능하다는 것에 의의를 두었다.
104 | 
105 | 논문에 나와있듯이 ViT 는 CNN 기반 구조보다 inductive bias 가 약해서 dataset 이 적은 경우 성능이 떨어진다.
106 | 
107 | 하지만, 큰 dataset 으로 pre-train 한 경우 CNN 기반 구조(inductive bias)를 능가한다.
108 | 
109 | - VisualizeAttentionMap.ipynb
110 | 
111 | attention 을 더 잘 보이게 하기 위해 MNIST 값을 반전시켰다. 
112 | 
113 | 숫자부분 값이 0에 가까워져서 숫자 안쪽은 검게 나타나지만, 테두리 부분에서 attention 이 드러난다. 
114 | 
115 | 숫자와 많이 떨어져 있는 부분들은 attention 하지 않으므로 노란색에서 검은색으로 가까워져있다.
116 | 
117 | 핵심 이해 및 구현이 목적이므로 디테일은 추후에 다시 다루어볼 것.
118 | 
119 | - ImportedVisionTransformer.ipynb
120 | 
121 | https://huggingface.co/transformers/model_doc/vit.html#vitmodel 에서 import 한 module 과 custom model 이 동일한 hyperparmeter 로 유사한 정확도를 보인다. custom model 의 정확성을 확인할 수 있었다.
122 | 
123 | CIFAR10 을 fine tuning 할 때, 이전 layer 들을 freezing 하지 않았으므로 learning_rate 를 더욱 작게하여 fine tuning 을 진행하였다.
124 | 
125 | 정교한 hyperparmeter tuning 없어도 96% 이상의 정확도를 보여 Transfer Learning 의 위력을 확인할 수 있었다.
126 | 
127 | 논문에서 제시한 대로 hyperparmeter 들을 조정하면 유사한 정확도를 얻을 수 있을 것.
128 | 
129 | ## TODO
130 | 1.  응용 task 접목
131 | 
132 | 2.  Inductive bias 를 조절하는 idea 구현해보기
133 | 


--------------------------------------------------------------------------------