└── README.md
/README.md:
--------------------------------------------------------------------------------
1 | # Awesome Medical Vision Language Learning
2 |
3 |
4 |
5 | ## Contents
6 | * [Datasets](#datasets)
7 | * [Papers](#papers)
8 |
9 | ## Datasets
10 |
11 | | Dataset | Year | Modality | Images | Text |
12 | |--------------------------------------------------------------------|------|----------|--------|-----------|
13 | | MIMIC-CXR[[data](https://mimic.mit.edu/docs/iv/modules/cxr/)][[paper](https://arxiv.org/pdf/1901.07042.pdf)]| 2019 | Chest X-ray | 377,110 | 227,827 |
14 | | CheXpert[[data](https://stanfordmlgroup.github.io/competitions/chexpert)][[paper](https://arxiv.org/pdf/1901.07031.pdf)]| 2019 | Chest X-ray | 224,316 | 224,316 |
15 | | ROCO [[data](https://github.com/razorx89/roco-dataset)][[paper](https://labels.tue-image.nl/wp-content/uploads/2018/09/AM-04.pdf)] | 2018 | CT, Ultrasound, X-Ray, Fluoroscopy, PET,
Mammography, MRI, Angiography, PET-CT | 81,825 | 81,825 |
16 | | MedICaT[[data](https://github.com/allenai/medicat)][[paper](https://arxiv.org/pdf/2010.06000v1.pdf)] | 2020 | CT, Ultrasound, X-Ray, Fluoroscopy, PET,
Mammography, MRI, Angiography, PET-CT | 217,060 | 217,060 |
17 |
18 |
19 |
20 |
21 | ## Survey
22 |
23 | - VLP: A Survey on Vision-Language Pre-training. arxiv 2022. [[paper](https://arxiv.org/pdf/2202.09061.pdf)]
24 |
25 | - Vision-Language Pre-training: Basics, Recent Advances, and Future Trends. arxiv 2022. [[paper](https://arxiv.org/pdf/2210.09263.pdf)]
26 |
27 | - Beyond Medical Imaging: A Review of Multimodal Deep Learning in Radiology. techrxiv 2022. [[paper](https://www.researchgate.net/profile/Jan-Egger-2/publication/358581125_Beyond_Medical_Imaging_A_Review_of_Multimodal_Deep_Learning_in_Radiology/links/620a1e5a7b05f82592ea5bda/Beyond-Medical-Imaging-A-Review-of-Multimodal-Deep-Learning-in-Radiology.pdf)]
28 |
29 |
30 | ## Tutorial
31 |
32 | - Vision-Language Pretraining: Current Trends and the Future. ACL 2022. [[link](https://vlp-tutorial-acl2022.github.io/)]
33 |
34 | - Recent Advances in Vision-and-Language Pre-training. CVPR 2022. [[link](https://vlp-tutorial.github.io/2022/)]
35 |
36 | ## Vision Language Pretraining
37 |
38 | ### Text Encoder
39 |
40 | | Text Encoder | Year | Corpus |
41 | |--------------------------------------------------------------------|------|------------------------------|
42 | | [BioBERT](https://github.com/dmis-lab/biobert) | 2020 | PubMed |
43 | | [ClinicalBERT](https://arxiv.org/abs/1904.05342) | 2019 | MIMIC-III |
44 | | [PubMedBERT](https://dl.acm.org/doi/10.1145/3458754) | 2022 | PubMed |
45 | | [CXR-BERT](https://arxiv.org/abs/2204.09817) | 2022 | PubMed+MIMIC-III/CXR |
46 |
47 |
48 | ### How to Train
49 |
50 | **2023**
51 |
52 | - PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents. arxiv 2023. [[paper](https://aps.arxiv.org/pdf/2303.07240.pdf)][[code](https://github.com/WeixiongLin/PMC-CLIP)]
53 |
54 | - [BiomedCLIP] LARGE-SCALE DOMAIN-SPECIFIC PRETRAINING FOR BIOMEDICAL VISION-LANGUAGE PROCESSING. arxiv 2023. [[paper](https://arxiv.org/pdf/2303.00915.pdf)][[model](https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224)]
55 |
56 | - Vision-Language Modelling for Radiological Imaging and Reports in the Low Data Regime. MIDL 2023. [[paper](https://arxiv.org/pdf/2303.17644.pdf)]
57 |
58 | - Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts. arxiv 2023. [[paper](https://arxiv.org/pdf/2302.08958.pdf)][[code](https://github.com/zhjohnchan/PTUnifier)]
59 |
60 | - [MRM] Advancing Radiograph Representation Learning with Masked Record Modeling. ICLR 2023. [[paper](https://openreview.net/forum?id=w-x7U26GM7j)][[code](https://github.com/RL4M/MRM-pytorch)]
61 |
62 | - [BioViL-T] Learning to Exploit Temporal Structure for Biomedical Vision–Language Processing. CVPR 2023. [[paper](https://arxiv.org/pdf/2301.04558.pdf)]
63 |
64 | - MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training. arxiv 2023. [[paper](https://arxiv.org/pdf/2301.02228.pdf)] [[code](https://chaoyi-wu.github.io/MedKLIP/)]
65 |
66 | **2022**
67 |
68 | - [MGCA] Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning. NIPS 2022. [[paper](http://arxiv.org/abs/2210.06044)][[code](https://github.com/fuying-wang/MGCA)]
69 |
70 | - MedCLIP: Contrastive Learning from Unpaired Medical Images and Text. EMNLP 2022. [[paper](https://arxiv.org/pdf/2210.10163.pdf)][[code](https://github.com/RyanWangZf/MedCLIP)]
71 |
72 | - [M3AE] Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training. MICCAI 2022. [[paper](https://arxiv.org/pdf/2209.07098.pdf)][[code](https://github.com/zhjohnchan/M3AE)]
73 |
74 | - Breaking with Fixed Set Pathology Recognition through Report-Guided Contrastive Training. MICCAI 2022. [[paper](https://arxiv.org/pdf/2205.07139.pdf)]
75 |
76 | - Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge. MM 2022. [[paper](https://arxiv.org/pdf/2209.07118.pdf)][[code](https://github.com/zhjohnchan/ARL)]
77 |
78 | - [MedViLL] Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training. JHBI 2022. [[paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9894658)][[code](https://github.com/SuperSupermoon/MedViLL)]
79 |
80 | - [REFERS] Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nature Machine Intelligence 2022. [[paper](https://arxiv.org/abs/2111.03452)][[code](https://github.com/funnyzhou/REFERS)]
81 |
82 | - [BioViL] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. ECCV 2022. [[paper](https://arxiv.org/pdf/2204.09817.pdf)]
83 |
84 | - [LoVT] Joint learning of localized representations from medical images and reports. ECCV 2022. [[paper](https://link.springer.com/chapter/10.1007/978-3-031-19809-0_39)]
85 |
86 | **2021**
87 |
88 | - [Local-MI] Multimodal Representation Learning via Maximization of Local Mutual Information. MICCAI 2021. [[paper](https://link.springer.com/chapter/10.1007/978-3-030-87196-3_26)]
89 |
90 | - GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition. ICCV 2021. [[paper](https://ieeexplore.ieee.org/document/9710099/)]
91 |
92 | - Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays. arxiv 2021. [[paper](https://arxiv.org/pdf/2103.16022.pdf)]
93 |
94 |
95 | **2020**
96 |
97 | - A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports. BIBM 2020. [[paper](https://ieeexplore.ieee.org/abstract/document/9313289)]
98 |
99 | - [ConVIRT] Contrastive Learning of Medical Visual Representations from Paired Images and Text. MLHC 2022. [[paper](http://arxiv.org/abs/2010.00747)][[code](https://github.com/yuhaozhang/convirt)]
100 |
101 |
102 | **2018**
103 |
104 | - Unsupervised Multimodal Representation Learning across Medical Images and Reports. NIPS workshop 2018. [[paper](https://arxiv.org/pdf/1811.08615.pdf)]
105 |
106 |
107 | ### How to Use
108 |
109 | **2023**
110 |
111 | - Medical Image Understanding with Pretrained Vision Language Models: A Comprehensive Study. ICLR 2023. [[paper](https://arxiv.org/pdf/2209.15517.pdf)]
112 |
113 | **2022**
114 |
115 | - Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains. NIPS workshop 2022. [[paper](http://arxiv.org/abs/2210.04133)]
116 |
117 | **2021**
118 |
119 | - [PubMedCLIP] Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain. arxiv 2021. [[paper](https://arxiv.org/pdf/2112.13906.pdf)][[code](https://github.com/sarahESL/PubMedCLIP)]
120 |
121 |
122 | ## Vision Language Task
123 |
124 | Refer to [Awesome-Multimodal-Applications-In-Medical-Imaging](https://github.com/Richard88888/awesome-multimodal-in-medical-imaging) for more papers
125 |
126 | ### Segmentation
127 |
128 | - LViT: Language meets Vision Transformer in Medical Image Segmentation. arxiv 2022. [[paper](http://arxiv.org/abs/2206.14718)][[code](https://github.com/HUANGLIZI/LViT)]
129 |
130 |
131 | ### Generation
132 |
133 |
134 | - RoentGen: Vision-Language Foundation Model for Chest X-ray Generation. arxiv 2022. [[paper](http://arxiv.org/abs/2211.12737)]
135 |
136 |
--------------------------------------------------------------------------------