├── images ├── 398f94238fe61990ba3dd93ec6e1357359d45541ac7c22a06f0cb804f3bc2b4e.png ├── 50d131269405f43de1d95d747d9f7321d3a46bc87e3a2758286c837f8dec379a.png ├── 8f3a6cf2fec0f679487196ed6c48f94e076ae29ae311f8a888fcc8ce23e73e7c.png ├── b94662b3d3344af40518360bbf617a97bda2baf867d56e7463170c0b64d32101.png └── ee9c18e6d50fb94df01b5ff11283fd3128d6b9f0c7e103e70c07887bf94a71d2.png └── README.md /images/398f94238fe61990ba3dd93ec6e1357359d45541ac7c22a06f0cb804f3bc2b4e.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/398f94238fe61990ba3dd93ec6e1357359d45541ac7c22a06f0cb804f3bc2b4e.png -------------------------------------------------------------------------------- /images/50d131269405f43de1d95d747d9f7321d3a46bc87e3a2758286c837f8dec379a.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/50d131269405f43de1d95d747d9f7321d3a46bc87e3a2758286c837f8dec379a.png -------------------------------------------------------------------------------- /images/8f3a6cf2fec0f679487196ed6c48f94e076ae29ae311f8a888fcc8ce23e73e7c.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/8f3a6cf2fec0f679487196ed6c48f94e076ae29ae311f8a888fcc8ce23e73e7c.png -------------------------------------------------------------------------------- /images/b94662b3d3344af40518360bbf617a97bda2baf867d56e7463170c0b64d32101.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/b94662b3d3344af40518360bbf617a97bda2baf867d56e7463170c0b64d32101.png -------------------------------------------------------------------------------- /images/ee9c18e6d50fb94df01b5ff11283fd3128d6b9f0c7e103e70c07887bf94a71d2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/ee9c18e6d50fb94df01b5ff11283fd3128d6b9f0c7e103e70c07887bf94a71d2.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome-Multimodal-Large-Language-Models-With-Grounding 2 | > A curated list of Multimodal Large Language Models (or Large Vision Language Model) with grounding ability. 3 | 4 | 5 | 8 | 9 | 10 | ## Table of Contents 11 | - [Awesome-Multimodal-Large-Language-Models-With-Grounding](#awesome-multimodal-large-language-models-with-grounding) 12 | - [Table of Contents](#table-of-contents) 13 | - [🔥 Large Vision-Language Model](#-large-vision-language-model) 14 | - [Grounding](#grounding) 15 | - [Referring](#referring) 16 | - [Training Dataset](#training-dataset) 17 | - [Training Recipe](#training-recipe) 18 | - [Evaluation Dataset](#evaluation-dataset) 19 | - [Paper List](#paper-list) 20 | - [🔥 Multi-modality](#-multi-modality) 21 | 22 | ## 🔥 Large Vision-Language Model 23 | 24 | ### Grounding 25 | | Format | Desc | Paper | 26 | |------------|-------|--------| 27 | | Decoder on latent| leverage a decoder to ground | [PerceptionGPT](https://arxiv.org/pdf/2311.06612), [NExT-Chat](https://arxiv.org/pdf/2311.04498), [PSALM](http://arxiv.org/abs/2403.14598), [PixelLM](http://arxiv.org/abs/2312.02228), [u-LLaVA](http://arxiv.org/abs/2311.05348), [GSVA](http://arxiv.org/abs/2312.10103), [ChatterBox](http://arxiv.org/abs/2401.13307), [GLaMM](http://arxiv.org/abs/2311.03356)| 28 | | Output numerical coordinates | direct output numerical tokens | [Shikra](https://arxiv.org/pdf/2306.15195), [VisionLLM](https://proceedings.neurips.cc/paper_files/paper/2023/file/c1f7b1ed763e9c75e4db74b49b76db5f-Paper-Conference.pdf), [Ferret](http://arxiv.org/abs/2310.07704), [Ferret2](http://arxiv.org/abs/2404.07973), [CogVLM](http://arxiv.org/abs/2311.03079)| 29 | | Output token coordinates | output new tokens added to refer positions | [Kosmos-2](https://arxiv.org/pdf/2306.14824) | 30 | | Pixel space | output in discrete pixel space encoded by VQGAN | [Unified-IO](https://arxiv.org/abs/2206.08916), [Unified-IO 2](http://arxiv.org/abs/2312.17172) | 31 | | Proposal retrieval | retrieval from region candidates | [LLM-Seg](http://arxiv.org/abs/2404.08767), [Kosmos-2](https://arxiv.org/pdf/2306.14824), [GROUNDHOG](http://arxiv.org/abs/2305.14167)| 32 | 33 | ### Referring 34 | 35 | | Format | Desc | Paper | 36 | |------------|-------|--------| 37 | | Pooling | Leverage Mask Pooling / RoI Pooling / RoI Align to obtain features from the im encoder output | [Groma](http://arxiv.org/abs/2404.13013), [GPT4RoI](https://arxiv.org/pdf/2307.03601), [Osprey](https://arxiv.org/pdf/2312.10032), [PSALM](http://arxiv.org/abs/2403.14598), [GROUNDHOG](http://arxiv.org/abs/2305.14167), [Ferret](http://arxiv.org/abs/2310.07704), [Ferret2](http://arxiv.org/abs/2404.07973), [PVIT](https://arxiv.org/pdf/2308.13437), [ChatterBox](http://arxiv.org/abs/2401.13307), [GLaMM](http://arxiv.org/abs/2311.03356) | 38 | | Numerical coordinates | Leverage numerical coordinates for referring (bbox / sampled points in mask) | [Shikra](https://arxiv.org/pdf/2306.15195), [PerceptionGPT](https://arxiv.org/pdf/2311.06612) (w/ encoder), [NExT-Chat](https://arxiv.org/pdf/2311.04498) (w/ encoder), [CogVLM](http://arxiv.org/abs/2311.03079)| 39 | | Token coordinates | Add new tokens to vocab to present spatial positions | [Kosmos-2](https://arxiv.org/pdf/2306.14824) | 40 | 41 | * w/ encoder: refers to using a encoder to encode the input coordinates. 42 | 43 | ### Training Dataset 44 | 45 | | Dataset | Source | Data Source | Quantity | Cnstruction Method | 46 | |------------|--------------|--------------|--------------|--------------| 47 | | GRIT | [Ferret](http://arxiv.org/abs/2310.07704) | COYO-700M, LAION-2B | - |
  • Templates are used to convert data.
  • SAM is used to generate masks for free-form referring.
  • ChatGPT4 is used to generate dialogues with bbox.
  • Use GLIPv2 to ground groundable nouns in LLaVA-158k.
  • Negative mining: generate negative yes/or question| 48 | | Shikra-RD | [Shikra](https://arxiv.org/pdf/2306.15195) | Flickr30K Entities | 5,922 QA pairs | ChatGPT4 ==> Referential Dialogue (CoT dialogues with grounding & referring) | 49 | | CB-300K | [ChatterBox](http://arxiv.org/abs/2401.13307) | VG | 717,075 QA pairs | 4 subsets.
  • CB-MRG: Use ChatGPT to write dialogues with bbox
  • CB-LC, extend strict relation (from scene graph) to multi-turn QA with ChatGPT
  • CB-REF REG task
  • CB-GND: grounding task | 50 | | GranD | [GLaMM](http://arxiv.org/abs/2311.03356) | SA-1B | 11M images with 7.5M unique concepts and 810M regions. | Automated annotation pipeline with SAM for dense pixel-wise grounding. Used for pretraining. | 51 | | GranD-f | [GLaMM](http://arxiv.org/abs/2311.03356) | GranD (refined), Flickr30K, RefCOCOg, and PSG | ~214K image-grounded text pairs | Refined subset of GranD for fine-tuning, with 1000 images held out for human-annotated evaluation | 52 | 53 | 54 | ### Training Recipe 55 | | Model | Recipe | 56 | |------------|--------------| 57 | | [Ferret](http://arxiv.org/abs/2310.07704) |
  • Use LLaVA pretrained
  • SFT on GRIT | 58 | | [Ferret2](http://arxiv.org/abs/2404.07973) |
  • image-caption alignment on 1.4M image-text pairs
  • high-resolution dense alignment with template referring & grounding
  • instruction tuning with GRIT, VQA and OCR (VQA and OCR are augmented with GLIPv2 bbox) | 59 | | [ChatterBox](http://arxiv.org/abs/2401.13307) | Trainable: LoRA and location decoder
  • warm up training with visual grounding only dataset.
  • instruction tuning with CB-300K | 60 | | [GPT4RoI](https://arxiv.org/pdf/2307.03601) |
  • Use LLaVA pretrained
  • pretrain region feature extractor with text-region datasets (COCO, RefCOCO, RefCOCO+)
  • train connector, region feature extractor and LLM to follow instructions | 61 | | [GLaMM](http://arxiv.org/abs/2311.03356) |
  • Use [GPT4RoI](https://arxiv.org/pdf/2307.03601) pretrained
  • pretrain on 11M GranD with LoRA
  • finetune on GranD-f, LLaVA-Instruct150K and LLaVA-Instruct-80K| 62 | 63 | 64 | ### Evaluation Dataset 65 | | Dataset | Source | Data Source | Quantity | Cnstruction Method | 66 | |------------|--------------|--------------|--------------|--------------| 67 | | Ferret Bench | [Ferret](http://arxiv.org/abs/2310.07704) | COCO validation set | 120 |
  • Referring Description: models are asked to describe a referred region based on its interaction with surrounding objects.
  • Referring Reasoning: models need to reason on top of one or more referred regions correctly.
  • Grounding in Conversation: models are required to reason correctly and accurately ground/localize the objects/regions necessary for the reasoning.| 68 | 69 | 70 | 71 | 80 | ### Paper List 81 |
    82 | 83 | GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest 84 | 85 | [Paper](https://arxiv.org/pdf/2307.03601) | [Github](https://github.com/jshilong/GPT4RoI) 86 | 87 | 1. propose referring for mllm by replacing placeholder \ by feature obtained by mask pooling 88 | 89 |
    90 | 91 |
    92 | 93 | Osprey: Pixel Understanding with Visual Instruction Tuning 94 | 95 | [Paper](https://arxiv.org/pdf/2312.10032) | [Github](https://github.com/CircleRadon/Osprey) 96 | 97 | 1. similar to GPT4RoI, Osprey also use mask representation to refer to entities in images. 98 | 2. It uses mask pooling to extract semantic features from image encoder and combines with a location extractor to process the mask and output spatial token. 99 | 100 |
    101 | 102 |
    103 | 104 | VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks 105 | 106 | [Paper](https://proceedings.neurips.cc/paper_files/paper/2023/file/c1f7b1ed763e9c75e4db74b49b76db5f-Paper-Conference.pdf) | [Github](https://github.com/OpenGVLab/VisionLLM) 107 | 108 | 1. unified interface for vision and vl tasks: points for detection, sample points for instance seg ==> instruction format for training 109 | 2. extra tokens & output-format-as-query to decode (faster) 110 | 111 |
    112 | 113 |
    114 | 115 | Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks 116 | 117 | [Paper](https://arxiv.org/abs/2206.08916) | [Github](https://github.com/allenai/unified-io-inference) | [Project](https://unified-io.allenai.org/) 118 | 119 | 1. creates a unified IO for all sorts of vision and vl task (into discrete tokens) 120 | 2. using t5-like encoder-decoder arch 121 | 122 |
    123 | 124 |
    125 | 126 | Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action 127 | 128 | [Paper](http://arxiv.org/abs/2312.17172) | [Github](https://github.com/allenai/unified-io-2) | [Project](https://unified-io-2.allenai.org/) 129 | 130 | 1. following Unified-IO v1, creates a unified IO for all sorts of modalities including image, masks, bboxes, audios (into discrete tokens) 131 | 1. dense masks are all binary, unlike v1 which specifies the color in text instruction (model struggles to follow) 132 | 2. propose 2D Rotary Embedding, QK Normalization and Scaled Cosine Attention to stabilize training and scaling 133 | 3. Mixture of Denoisers taining objectives 134 | 4. instruction tuning of 220 tasks drawn from over 120 external datasets 135 |
    136 | 137 |
    138 | 139 | PixelLM: Pixel Reasoning with Large Multimodal Model 140 | 141 | [Paper](http://arxiv.org/abs/2312.02228) | [Github](https://github.com/MaverickRen/PixelLM) | [Project](https://pixellm.github.io/) 142 | 143 | 1. learnable seg tokens + light-weight decoder 144 | 2. a bunch of tricks: 145 | 1. N x L seg tokens for L level multi-scale vision features. N tokens within each group for better modeling 146 | 2. reweighted loss on regions with overlapping predictions 147 | 148 |
    149 | 150 |
    151 | 152 | PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model 153 | 154 | [Paper](http://arxiv.org/abs/2403.14598) | [Github](https://github.com/zamling/PSALM) 155 | 156 | 1. new paradigm: first generate mask proposal, then genereate mask and classification (following mask2former) 157 | 2. instruction prompt + conditional prompt + candidate masks token 158 | 1. three types of conditional prompt: classes, sentence (ref seg) and visual cues (point, scribbles, boxes, etc) 159 | 2. conditional prompt => condition embed, candidate masks token => mask embed. 160 | 3. condition embed +mask embed + image feature => mask2former decoder => bipartite matching loss + query-based decoding 161 | ![图 0](images/398f94238fe61990ba3dd93ec6e1357359d45541ac7c22a06f0cb804f3bc2b4e.png) 162 | 163 |
    164 | 165 |
    166 | 167 | LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning 168 | 169 | [Paper](http://arxiv.org/abs/2404.08767) | [Github](https://github.com/wangjunchi/LLMSeg) 170 | 171 | 1. Use SAM to generate mask candidates, then fomulate the problem as mask selection (mask classification) 172 | 2. promote LLM-Seg40K dataset, by using LLaVA to generate caption, then GPT4 to generate question-answer pair. 173 | 174 |
    175 | 176 |
    177 | 178 | GROUNDHOG: Grounding Large Language Models to Holistic Segmentation 179 | 180 | [Paper](http://arxiv.org/abs/2402.16846) | [Project](https://groundhog-mllm.github.io/) 181 | 182 | 1. disantengle grounding with referring 183 | 2. grounding as mask selection and train a mask2former+ to generate mask candidates 184 | 3. referring by mask pooling on feature 185 | 4. promote 2.5M M3G2 dataset 186 | 187 |
    188 | 189 |
    190 | 191 | DetGPT: Detect What You Need via Reasoning 192 | 193 | [Paper](http://arxiv.org/abs/2305.14167) | [Github](https://github.com/OptimalScale/DetGPT) | [Project](https://detgpt.github.io/) 194 | 195 | 1. Follow LLaVA to tune VLM and for vqa 196 | 2. Use grouding DINO to ground response generated by VLM to detect the relevantg entities. 197 | 198 |
    199 | 200 |
    201 | 202 | Ferret: Refer and Ground Anything Anywhere at Any Granularity 203 | 204 | [Paper](http://arxiv.org/abs/2310.07704) | [Github](https://github.com/apple/ml-ferret) 205 | 206 | 1. propose hybrid region representation for referring : region name + coordinates + mask pooled feature by Spatial-aware visual sampler 207 | 2. grounding through bbox 208 | 209 |
    210 | 211 |
    212 | 213 | Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models 214 | 215 | [Paper](http://arxiv.org/abs/2404.07973) 216 | 217 | 1. propose a bunch of improvements on Ferret v1 218 | 2. including any-resolution (patches) for larger resolution 219 | 3. DINOv2 Encoder for local feature extraction 220 | 4. and High-resolution Dense Alignment stage between SFT and instruction turning. 221 | u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model 222 | 223 | [Paper](http://arxiv.org/abs/2311.05348) | [Github](https://github.com/OPPOMKLab/u-LLaVA) 224 | 225 | 1. propose to use different decoder for grounding (SAM for segmentation, Grounding DINO for detection) 226 | 227 |
    228 | 229 |
    230 | 231 | GSVA: Generalized Segmentation via Multimodal Large Language Models 232 | 233 | [Paper](http://arxiv.org/abs/2312.10103) | [Github](https://github.com/LeapLabTHU/GSVA) 234 | 235 | 1. propose to Generalized Referring Expression Segmentation (GRES) in grounding LLM 236 | 1. multiple object to ground 237 | 2. need to reject null target 238 | 2. propose to use multple [SEG] token to ground multiple objects (indicted by the texts before the [SEG] token), and [REJ] token to rej null target 239 | 240 |
    241 |
    242 | 243 | NExT-Chat: An LMM for Chat, Detection and Segmentation 244 | 245 | [Paper](https://arxiv.org/pdf/2311.04498) | [Github](https://github.com/NExT-ChatV/NExT-Chat) | [Project](https://next-chatv.github.io/) 246 | 247 | 1. propose box encoder-decoder for referring and grounding 248 | 2. for grounding, use token to indicate the presence of a grounding output and input the latent embedding to the box decoder (mask decoder e.g. SAM) for box (mask) generation 249 | 3. for referring, use boxes to represent referred region and use box encoder to encode the referred boxes into features, which is input to LLM. 250 | 4. propose a cycle consistency loss for regularization of box encoder-decoder 251 | ![图 0](images/50d131269405f43de1d95d747d9f7321d3a46bc87e3a2758286c837f8dec379a.png) 252 | 253 |
    254 |
    255 | 256 | PerceptionGPT: Effectively Fusing Visual Perception into LLM 257 | 258 | [Paper](https://arxiv.org/pdf/2311.06612) 259 | 260 | 1. similar to NExT-Chat, propose box encoder-decoder to encode and decode boxes, but seems to only focus on grounding without referring 261 | 2. One possible intriguing point: grounding output indicator \ is used to indicate the presence of grounding output (as usual) but the is replaced by the encoder's output feature in the LLM input. 262 | ![图 1](images/b94662b3d3344af40518360bbf617a97bda2baf867d56e7463170c0b64d32101.png) 263 |
    264 | 265 |
    266 | 267 | Kosmos-2: Grounding Multimodal Large Language Models to the World 268 | 269 | [Paper](https://arxiv.org/abs/2306.14824) | [Github](https://github.com/microsoft/unilm/tree/master/kosmos-2) 270 | 271 | 1. build a web-scale grounding dataset by web-scale data (COYO-700M & LAION-2B etc) and vision detector (GLIP) 272 | 2. following pix2seq, divide the image into PxP grids and introduce PxP new tokens to represent 273 | 3. Use \\ to represent a bbox, with \ to separate multiple boxes (if there are multiple boxes) 274 | 4. Use markdown-like grammar to reference grounded text with \ \ 275 | e.g. 276 | ![图 2](images/ee9c18e6d50fb94df01b5ff11283fd3128d6b9f0c7e103e70c07887bf94a71d2.png) 277 | 278 |
    279 |
    280 | 281 | Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic 282 | 283 | [Paper](https://arxiv.org/pdf/2306.15195) | [Github](https://github.com/shikras/shikra) 284 | 285 | 1. propose to use normalized boxes for unified grounding and referring 286 | 2. Use texts to represent all normalized boxes (directly tokenized by text tokenizer) and input to LLM 287 |
    288 | 289 |
    290 | 291 | Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models 292 | 293 | [Paper](http://arxiv.org/abs/2404.13013) | [Github](https://github.com/FoundationVision/Groma) | [Project](https://groma-mllm.github.io/) 294 | 295 | 1. Propose to ground and refer with a set of proposed regions. 296 | 2. Change a Deformable DETR detection head into binary classifier to propose ROI and use AlignROI pooling to get the region feature 297 |
    298 | 299 |
    300 | 301 | LISA: Reasoning Segmentation via Large Language Model 302 | 303 | [Paper](https://arxiv.org/abs/2308.00692) | [Github](https://github.com/dvlab-research/LISA) 304 | 305 | 1. Introduce the reasoning segmentation task and establish a reasoning segmentation benchmark. 306 | 2. Propose LISA model, which represents the segmentation mask as an embedding and incorporates new segmentation capabilities. 307 |
    308 | 309 |
    310 | 311 | GLaMM: Pixel Grounding Large Multimodal Model 312 | 313 | [Paper](http://arxiv.org/abs/2311.03356) | [Github](https://github.com/mbzuai-oryx/groundingLMM) | [Project](https://mbzuai-oryx.github.io/groundingLMM) 314 | 315 | 1. Introduces Grounded Conversation Generation (GCG) task combining phrase grounding, referring expression segmentation, and vision-language conversations. 316 | 3. Proposes a scalable pipeline to curate GranD (Grounding-anything Dataset) with 7.5M unique concepts grounded in 810M regions with segmentation masks, 214k GranDf and a ~1000 evaluation set. 317 | 5. Architecture: Global Image Encoder for holistic understanding, Region Encoder with RoI pooling for regions referring, LLM generates responses with grounding tokens , and use Pixel Decoder (SAM) to decode segmentation masks from tokens' latent 318 | 319 |
    320 | 321 | ## 🔥 Multi-modality 322 | 323 |
    324 | 325 | GroundingGPT:Language Enhanced Multi-modal Grounding Model 326 | 327 | [Paper](http://arxiv.org/abs/2401.06071) | [Github](https://github.com/OPPOMKLab/u-LLaVA) 328 | 329 | 1. grounding and referring of multi-modality in text 330 | 1. bounding box by four relative coordinate values:[x1, y1, x2, y2] 331 | 2. video timestamps by two two-digit decimals: {t1, t2} 332 | 2. curate dataset for three stage training 333 | ![图 1](images/8f3a6cf2fec0f679487196ed6c48f94e076ae29ae311f8a888fcc8ce23e73e7c.png) 334 | 335 |
    336 | --------------------------------------------------------------------------------