└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Awesome Medical Vision Language Learning 2 | 3 | 4 | 5 | ## Contents 6 | * [Datasets](#datasets) 7 | * [Papers](#papers) 8 | 9 | ## Datasets 10 | 11 | | Dataset | Year | Modality | Images | Text | 12 | |--------------------------------------------------------------------|------|----------|--------|-----------| 13 | | MIMIC-CXR[[data](https://mimic.mit.edu/docs/iv/modules/cxr/)][[paper](https://arxiv.org/pdf/1901.07042.pdf)]| 2019 | Chest X-ray | 377,110 | 227,827 | 14 | | CheXpert[[data](https://stanfordmlgroup.github.io/competitions/chexpert)][[paper](https://arxiv.org/pdf/1901.07031.pdf)]| 2019 | Chest X-ray | 224,316 | 224,316 | 15 | | ROCO [[data](https://github.com/razorx89/roco-dataset)][[paper](https://labels.tue-image.nl/wp-content/uploads/2018/09/AM-04.pdf)] | 2018 | CT, Ultrasound, X-Ray, Fluoroscopy, PET,
Mammography, MRI, Angiography, PET-CT | 81,825 | 81,825 | 16 | | MedICaT[[data](https://github.com/allenai/medicat)][[paper](https://arxiv.org/pdf/2010.06000v1.pdf)] | 2020 | CT, Ultrasound, X-Ray, Fluoroscopy, PET,
Mammography, MRI, Angiography, PET-CT | 217,060 | 217,060 | 17 | 18 | 19 | 20 | 21 | ## Survey 22 | 23 | - VLP: A Survey on Vision-Language Pre-training. arxiv 2022. [[paper](https://arxiv.org/pdf/2202.09061.pdf)] 24 | 25 | - Vision-Language Pre-training: Basics, Recent Advances, and Future Trends. arxiv 2022. [[paper](https://arxiv.org/pdf/2210.09263.pdf)] 26 | 27 | - Beyond Medical Imaging: A Review of Multimodal Deep Learning in Radiology. techrxiv 2022. [[paper](https://www.researchgate.net/profile/Jan-Egger-2/publication/358581125_Beyond_Medical_Imaging_A_Review_of_Multimodal_Deep_Learning_in_Radiology/links/620a1e5a7b05f82592ea5bda/Beyond-Medical-Imaging-A-Review-of-Multimodal-Deep-Learning-in-Radiology.pdf)] 28 | 29 | 30 | ## Tutorial 31 | 32 | - Vision-Language Pretraining: Current Trends and the Future. ACL 2022. [[link](https://vlp-tutorial-acl2022.github.io/)] 33 | 34 | - Recent Advances in Vision-and-Language Pre-training. CVPR 2022. [[link](https://vlp-tutorial.github.io/2022/)] 35 | 36 | ## Vision Language Pretraining 37 | 38 | ### Text Encoder 39 | 40 | | Text Encoder | Year | Corpus | 41 | |--------------------------------------------------------------------|------|------------------------------| 42 | | [BioBERT](https://github.com/dmis-lab/biobert) | 2020 | PubMed | 43 | | [ClinicalBERT](https://arxiv.org/abs/1904.05342) | 2019 | MIMIC-III | 44 | | [PubMedBERT](https://dl.acm.org/doi/10.1145/3458754) | 2022 | PubMed | 45 | | [CXR-BERT](https://arxiv.org/abs/2204.09817) | 2022 | PubMed+MIMIC-III/CXR | 46 | 47 | 48 | ### How to Train 49 | 50 | **2023** 51 | 52 | - PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents. arxiv 2023. [[paper](https://aps.arxiv.org/pdf/2303.07240.pdf)][[code](https://github.com/WeixiongLin/PMC-CLIP)] 53 | 54 | - [BiomedCLIP] LARGE-SCALE DOMAIN-SPECIFIC PRETRAINING FOR BIOMEDICAL VISION-LANGUAGE PROCESSING. arxiv 2023. [[paper](https://arxiv.org/pdf/2303.00915.pdf)][[model](https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224)] 55 | 56 | - Vision-Language Modelling for Radiological Imaging and Reports in the Low Data Regime. MIDL 2023. [[paper](https://arxiv.org/pdf/2303.17644.pdf)] 57 | 58 | - Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts. arxiv 2023. [[paper](https://arxiv.org/pdf/2302.08958.pdf)][[code](https://github.com/zhjohnchan/PTUnifier)] 59 | 60 | - [MRM] Advancing Radiograph Representation Learning with Masked Record Modeling. ICLR 2023. [[paper](https://openreview.net/forum?id=w-x7U26GM7j)][[code](https://github.com/RL4M/MRM-pytorch)] 61 | 62 | - [BioViL-T] Learning to Exploit Temporal Structure for Biomedical Vision–Language Processing. CVPR 2023. [[paper](https://arxiv.org/pdf/2301.04558.pdf)] 63 | 64 | - MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training. arxiv 2023. [[paper](https://arxiv.org/pdf/2301.02228.pdf)] [[code](https://chaoyi-wu.github.io/MedKLIP/)] 65 | 66 | **2022** 67 | 68 | - [MGCA] Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning. NIPS 2022. [[paper](http://arxiv.org/abs/2210.06044)][[code](https://github.com/fuying-wang/MGCA)] 69 | 70 | - MedCLIP: Contrastive Learning from Unpaired Medical Images and Text. EMNLP 2022. [[paper](https://arxiv.org/pdf/2210.10163.pdf)][[code](https://github.com/RyanWangZf/MedCLIP)] 71 | 72 | - [M3AE] Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training. MICCAI 2022. [[paper](https://arxiv.org/pdf/2209.07098.pdf)][[code](https://github.com/zhjohnchan/M3AE)] 73 | 74 | - Breaking with Fixed Set Pathology Recognition through Report-Guided Contrastive Training. MICCAI 2022. [[paper](https://arxiv.org/pdf/2205.07139.pdf)] 75 | 76 | - Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge. MM 2022. [[paper](https://arxiv.org/pdf/2209.07118.pdf)][[code](https://github.com/zhjohnchan/ARL)] 77 | 78 | - [MedViLL] Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training. JHBI 2022. [[paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9894658)][[code](https://github.com/SuperSupermoon/MedViLL)] 79 | 80 | - [REFERS] Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nature Machine Intelligence 2022. [[paper](https://arxiv.org/abs/2111.03452)][[code](https://github.com/funnyzhou/REFERS)] 81 | 82 | - [BioViL] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. ECCV 2022. [[paper](https://arxiv.org/pdf/2204.09817.pdf)] 83 | 84 | - [LoVT] Joint learning of localized representations from medical images and reports. ECCV 2022. [[paper](https://link.springer.com/chapter/10.1007/978-3-031-19809-0_39)] 85 | 86 | **2021** 87 | 88 | - [Local-MI] Multimodal Representation Learning via Maximization of Local Mutual Information. MICCAI 2021. [[paper](https://link.springer.com/chapter/10.1007/978-3-030-87196-3_26)] 89 | 90 | - GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition. ICCV 2021. [[paper](https://ieeexplore.ieee.org/document/9710099/)] 91 | 92 | - Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays. arxiv 2021. [[paper](https://arxiv.org/pdf/2103.16022.pdf)] 93 | 94 | 95 | **2020** 96 | 97 | - A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports. BIBM 2020. [[paper](https://ieeexplore.ieee.org/abstract/document/9313289)] 98 | 99 | - [ConVIRT] Contrastive Learning of Medical Visual Representations from Paired Images and Text. MLHC 2022. [[paper](http://arxiv.org/abs/2010.00747)][[code](https://github.com/yuhaozhang/convirt)] 100 | 101 | 102 | **2018** 103 | 104 | - Unsupervised Multimodal Representation Learning across Medical Images and Reports. NIPS workshop 2018. [[paper](https://arxiv.org/pdf/1811.08615.pdf)] 105 | 106 | 107 | ### How to Use 108 | 109 | **2023** 110 | 111 | - Medical Image Understanding with Pretrained Vision Language Models: A Comprehensive Study. ICLR 2023. [[paper](https://arxiv.org/pdf/2209.15517.pdf)] 112 | 113 | **2022** 114 | 115 | - Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains. NIPS workshop 2022. [[paper](http://arxiv.org/abs/2210.04133)] 116 | 117 | **2021** 118 | 119 | - [PubMedCLIP] Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain. arxiv 2021. [[paper](https://arxiv.org/pdf/2112.13906.pdf)][[code](https://github.com/sarahESL/PubMedCLIP)] 120 | 121 | 122 | ## Vision Language Task 123 | 124 | Refer to [Awesome-Multimodal-Applications-In-Medical-Imaging](https://github.com/Richard88888/awesome-multimodal-in-medical-imaging) for more papers 125 | 126 | ### Segmentation 127 | 128 | - LViT: Language meets Vision Transformer in Medical Image Segmentation. arxiv 2022. [[paper](http://arxiv.org/abs/2206.14718)][[code](https://github.com/HUANGLIZI/LViT)] 129 | 130 | 131 | ### Generation 132 | 133 | 134 | - RoentGen: Vision-Language Foundation Model for Chest X-ray Generation. arxiv 2022. [[paper](http://arxiv.org/abs/2211.12737)] 135 | 136 | --------------------------------------------------------------------------------