└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # [스터디 ] 거꾸로 읽는 self-supervised-learning 시즌3: Visual Language Models 2 | 3 | - [신청 Google form (2/18 모집마감)](https://forms.gle/f7ZmUkfxNxBT1VUD8) 4 | - 거꾸로 읽는 SSL 이번에는 VLM 분야로 넘어 왔습니다! :) 5 | - 2021년도 이후로 가파르게 발전하고 있는 Visual Language Models 논문에 집중하여 의미가 있었던 논문을 살펴봅니다. 6 | - 해당 논문에서 제시하는 메소드의 특징 그리고 역사적으로 평가되는 이유에 대해서 즐겁게 토론하는 시간을 가집니다. 7 | - [Paper list Google sheet](https://docs.google.com/spreadsheets/d/1P-pACgU9G0xq6M9Gufad-3tLUBavSMyUL0NIdd6TVH8/edit#gid=542739927) 8 | 9 | ## 기간 (예정) 10 | - 2023 3/4 ~ 6/3 (14주간) 11 | 12 | ## 발표 논문 및 순서 13 |   | Type | Paper title | Affiliation | Date to be published at ArXiv | Speaker | Youtube 14 | -- | -- | -- | -- | -- | -- | -- 15 | 1 주차 | VLM bechmark and metric | VLM 관련 벤치마크와 메트릭 소개 |   |   | 강재욱 | [youtube1](https://youtu.be/NgxSbyoiQYM), [youtube2](https://youtu.be/8ofFVYPS8vA) 16 | 2 주차 | Vision transformer | ViT:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | Google | 2020 Oct | 이인규 | [youtube](https://youtu.be/7lxKirsixuQ) 17 | 3 주차 | Dual encoder | CLIP: Learning Transferable Visual Models From Natural Language Supervision | OpenAI | 2021 Feb | 김희은 | [youtube](https://youtu.be/BoDFT85-Z8U) 18 | 4 주차 | Image-text matching | Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers | MS | 2020 Apr | 신성호 | [youtube](https://youtu.be/SUgK6EUlbA0) 19 | 5 주차 | Image-text contrastive learning | ALBEF: Align before Fuse: Vision and LanguageRepresentation Learning with Momentum Distillation | Salesforce | 2021 Jul | 이유경 | [youtube](https://youtu.be/O-tQ-hgCmQo) 20 | 6 주차 | Masked Image Modeling | BEiT: BERT Pre-Training of Image Transformers | MS | 2021 Jun | 박민지 | 21 | 7 주차 | Masked VLM | Masked Vision and Language Modeling for Multi-modal Representation Learning | Amazon | 2022 Aug | 김강민 | [youtube](https://youtu.be/1Sil6dTXM-8) 22 | 8 주차 | Multimodal funsion by MoE | VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts | MS | 2021 Nov | 백혜림 | [youtube](https://youtu.be/ZYpmDtdeSk4) 23 | 9 주차 | Multimodal funsion by merged attention | SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | Google | 2021 Aug | 정윤성 | [youtube](https://youtu.be/OMDu5r-Rm-c) 24 | 10 주차 | Multimodal funsion by co-attention | CoCa: Contrastive Captioners are Image-Text Foundation Models | Google | 2022 May | 김승우 | [youtube](https://youtu.be/JSfQQtZ3Ios) 25 | 11 주차 | Few-shot learning in VLM | Flamingo: a visual language model for few-shot learning | DeepMind | 2022 Apr | 조성국 | [youtube](https://youtu.be/ihdHsoPin84) 26 | 12 주차 | Model scaling for VLM 1 | GIT: A Generative Image-to-text Transformer for Vision and Language | MS | 2022 May | 김기범 | [youtube](https://youtu.be/CIdlpeIGDuM) 27 | 13 주차 | Model scaling for VLM 2 | PaLI: A Jointly-Scaled Multilingual Language-Image Model | Google | 2022 Sep | 이영수 28 | 14 주차 | wrap-up | 전체 흐름 재정리 | | | 강재욱 29 | 30 | 31 | ## 관련 링크 32 | - [거꾸로 읽는 SSL 시즌1](https://youtube.com/playlist?list=PLMSTs9nojhszOnaAwOg42NEsH_Jn6405o) 33 | - [거꾸로 읽는 SSL 시즌2](https://youtube.com/playlist?list=PLMSTs9nojhszeFer8gYnEI5yA5JenWzEA) 34 | - [거꾸로 읽는 SSL 유튭 채널](https://www.youtube.com/channel/UCTwcUmKhqeBhG0rQHkPVP6Q) 35 | --------------------------------------------------------------------------------