├── images
├── 398f94238fe61990ba3dd93ec6e1357359d45541ac7c22a06f0cb804f3bc2b4e.png
├── 50d131269405f43de1d95d747d9f7321d3a46bc87e3a2758286c837f8dec379a.png
├── 8f3a6cf2fec0f679487196ed6c48f94e076ae29ae311f8a888fcc8ce23e73e7c.png
├── b94662b3d3344af40518360bbf617a97bda2baf867d56e7463170c0b64d32101.png
└── ee9c18e6d50fb94df01b5ff11283fd3128d6b9f0c7e103e70c07887bf94a71d2.png
└── README.md
/images/398f94238fe61990ba3dd93ec6e1357359d45541ac7c22a06f0cb804f3bc2b4e.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/398f94238fe61990ba3dd93ec6e1357359d45541ac7c22a06f0cb804f3bc2b4e.png
--------------------------------------------------------------------------------
/images/50d131269405f43de1d95d747d9f7321d3a46bc87e3a2758286c837f8dec379a.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/50d131269405f43de1d95d747d9f7321d3a46bc87e3a2758286c837f8dec379a.png
--------------------------------------------------------------------------------
/images/8f3a6cf2fec0f679487196ed6c48f94e076ae29ae311f8a888fcc8ce23e73e7c.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/8f3a6cf2fec0f679487196ed6c48f94e076ae29ae311f8a888fcc8ce23e73e7c.png
--------------------------------------------------------------------------------
/images/b94662b3d3344af40518360bbf617a97bda2baf867d56e7463170c0b64d32101.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/b94662b3d3344af40518360bbf617a97bda2baf867d56e7463170c0b64d32101.png
--------------------------------------------------------------------------------
/images/ee9c18e6d50fb94df01b5ff11283fd3128d6b9f0c7e103e70c07887bf94a71d2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/ee9c18e6d50fb94df01b5ff11283fd3128d6b9f0c7e103e70c07887bf94a71d2.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Awesome-Multimodal-Large-Language-Models-With-Grounding
2 | > A curated list of Multimodal Large Language Models (or Large Vision Language Model) with grounding ability.
3 |
4 |
5 |
8 |
9 |
10 | ## Table of Contents
11 | - [Awesome-Multimodal-Large-Language-Models-With-Grounding](#awesome-multimodal-large-language-models-with-grounding)
12 | - [Table of Contents](#table-of-contents)
13 | - [🔥 Large Vision-Language Model](#-large-vision-language-model)
14 | - [Grounding](#grounding)
15 | - [Referring](#referring)
16 | - [Training Dataset](#training-dataset)
17 | - [Training Recipe](#training-recipe)
18 | - [Evaluation Dataset](#evaluation-dataset)
19 | - [Paper List](#paper-list)
20 | - [🔥 Multi-modality](#-multi-modality)
21 |
22 | ## 🔥 Large Vision-Language Model
23 |
24 | ### Grounding
25 | | Format | Desc | Paper |
26 | |------------|-------|--------|
27 | | Decoder on latent| leverage a decoder to ground | [PerceptionGPT](https://arxiv.org/pdf/2311.06612), [NExT-Chat](https://arxiv.org/pdf/2311.04498), [PSALM](http://arxiv.org/abs/2403.14598), [PixelLM](http://arxiv.org/abs/2312.02228), [u-LLaVA](http://arxiv.org/abs/2311.05348), [GSVA](http://arxiv.org/abs/2312.10103), [ChatterBox](http://arxiv.org/abs/2401.13307), [GLaMM](http://arxiv.org/abs/2311.03356)|
28 | | Output numerical coordinates | direct output numerical tokens | [Shikra](https://arxiv.org/pdf/2306.15195), [VisionLLM](https://proceedings.neurips.cc/paper_files/paper/2023/file/c1f7b1ed763e9c75e4db74b49b76db5f-Paper-Conference.pdf), [Ferret](http://arxiv.org/abs/2310.07704), [Ferret2](http://arxiv.org/abs/2404.07973), [CogVLM](http://arxiv.org/abs/2311.03079)|
29 | | Output token coordinates | output new tokens added to refer positions | [Kosmos-2](https://arxiv.org/pdf/2306.14824) |
30 | | Pixel space | output in discrete pixel space encoded by VQGAN | [Unified-IO](https://arxiv.org/abs/2206.08916), [Unified-IO 2](http://arxiv.org/abs/2312.17172) |
31 | | Proposal retrieval | retrieval from region candidates | [LLM-Seg](http://arxiv.org/abs/2404.08767), [Kosmos-2](https://arxiv.org/pdf/2306.14824), [GROUNDHOG](http://arxiv.org/abs/2305.14167)|
32 |
33 | ### Referring
34 |
35 | | Format | Desc | Paper |
36 | |------------|-------|--------|
37 | | Pooling | Leverage Mask Pooling / RoI Pooling / RoI Align to obtain features from the im encoder output | [Groma](http://arxiv.org/abs/2404.13013), [GPT4RoI](https://arxiv.org/pdf/2307.03601), [Osprey](https://arxiv.org/pdf/2312.10032), [PSALM](http://arxiv.org/abs/2403.14598), [GROUNDHOG](http://arxiv.org/abs/2305.14167), [Ferret](http://arxiv.org/abs/2310.07704), [Ferret2](http://arxiv.org/abs/2404.07973), [PVIT](https://arxiv.org/pdf/2308.13437), [ChatterBox](http://arxiv.org/abs/2401.13307), [GLaMM](http://arxiv.org/abs/2311.03356) |
38 | | Numerical coordinates | Leverage numerical coordinates for referring (bbox / sampled points in mask) | [Shikra](https://arxiv.org/pdf/2306.15195), [PerceptionGPT](https://arxiv.org/pdf/2311.06612) (w/ encoder), [NExT-Chat](https://arxiv.org/pdf/2311.04498) (w/ encoder), [CogVLM](http://arxiv.org/abs/2311.03079)|
39 | | Token coordinates | Add new tokens to vocab to present spatial positions | [Kosmos-2](https://arxiv.org/pdf/2306.14824) |
40 |
41 | * w/ encoder: refers to using a encoder to encode the input coordinates.
42 |
43 | ### Training Dataset
44 |
45 | | Dataset | Source | Data Source | Quantity | Cnstruction Method |
46 | |------------|--------------|--------------|--------------|--------------|
47 | | GRIT | [Ferret](http://arxiv.org/abs/2310.07704) | COYO-700M, LAION-2B | - |
Templates are used to convert data. SAM is used to generate masks for free-form referring. ChatGPT4 is used to generate dialogues with bbox. Use GLIPv2 to ground groundable nouns in LLaVA-158k. Negative mining: generate negative yes/or question|
48 | | Shikra-RD | [Shikra](https://arxiv.org/pdf/2306.15195) | Flickr30K Entities | 5,922 QA pairs | ChatGPT4 ==> Referential Dialogue (CoT dialogues with grounding & referring) |
49 | | CB-300K | [ChatterBox](http://arxiv.org/abs/2401.13307) | VG | 717,075 QA pairs | 4 subsets. CB-MRG: Use ChatGPT to write dialogues with bbox CB-LC, extend strict relation (from scene graph) to multi-turn QA with ChatGPT CB-REF REG task CB-GND: grounding task |
50 | | GranD | [GLaMM](http://arxiv.org/abs/2311.03356) | SA-1B | 11M images with 7.5M unique concepts and 810M regions. | Automated annotation pipeline with SAM for dense pixel-wise grounding. Used for pretraining. |
51 | | GranD-f | [GLaMM](http://arxiv.org/abs/2311.03356) | GranD (refined), Flickr30K, RefCOCOg, and PSG | ~214K image-grounded text pairs | Refined subset of GranD for fine-tuning, with 1000 images held out for human-annotated evaluation |
52 |
53 |
54 | ### Training Recipe
55 | | Model | Recipe |
56 | |------------|--------------|
57 | | [Ferret](http://arxiv.org/abs/2310.07704) | Use LLaVA pretrained SFT on GRIT |
58 | | [Ferret2](http://arxiv.org/abs/2404.07973) | image-caption alignment on 1.4M image-text pairs high-resolution dense alignment with template referring & grounding instruction tuning with GRIT, VQA and OCR (VQA and OCR are augmented with GLIPv2 bbox) |
59 | | [ChatterBox](http://arxiv.org/abs/2401.13307) | Trainable: LoRA and location decoder warm up training with visual grounding only dataset. instruction tuning with CB-300K |
60 | | [GPT4RoI](https://arxiv.org/pdf/2307.03601) | Use LLaVA pretrained pretrain region feature extractor with text-region datasets (COCO, RefCOCO, RefCOCO+) train connector, region feature extractor and LLM to follow instructions |
61 | | [GLaMM](http://arxiv.org/abs/2311.03356) | Use [GPT4RoI](https://arxiv.org/pdf/2307.03601) pretrained pretrain on 11M GranD with LoRA finetune on GranD-f, LLaVA-Instruct150K and LLaVA-Instruct-80K|
62 |
63 |
64 | ### Evaluation Dataset
65 | | Dataset | Source | Data Source | Quantity | Cnstruction Method |
66 | |------------|--------------|--------------|--------------|--------------|
67 | | Ferret Bench | [Ferret](http://arxiv.org/abs/2310.07704) | COCO validation set | 120 | Referring Description: models are asked to describe a referred region based on its interaction with surrounding objects. Referring Reasoning: models need to reason on top of one or more referred regions correctly. Grounding in Conversation: models are required to reason correctly and accurately ground/localize the objects/regions necessary for the reasoning.|
68 |
69 |
70 |
71 |
80 | ### Paper List
81 |
82 |
83 | GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
84 |
85 | [Paper](https://arxiv.org/pdf/2307.03601) | [Github](https://github.com/jshilong/GPT4RoI)
86 |
87 | 1. propose referring for mllm by replacing placeholder \ by feature obtained by mask pooling
88 |
89 |
90 |
91 |
92 |
93 | Osprey: Pixel Understanding with Visual Instruction Tuning
94 |
95 | [Paper](https://arxiv.org/pdf/2312.10032) | [Github](https://github.com/CircleRadon/Osprey)
96 |
97 | 1. similar to GPT4RoI, Osprey also use mask representation to refer to entities in images.
98 | 2. It uses mask pooling to extract semantic features from image encoder and combines with a location extractor to process the mask and output spatial token.
99 |
100 |
101 |
102 |
103 |
104 | VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
105 |
106 | [Paper](https://proceedings.neurips.cc/paper_files/paper/2023/file/c1f7b1ed763e9c75e4db74b49b76db5f-Paper-Conference.pdf) | [Github](https://github.com/OpenGVLab/VisionLLM)
107 |
108 | 1. unified interface for vision and vl tasks: points for detection, sample points for instance seg ==> instruction format for training
109 | 2. extra tokens & output-format-as-query to decode (faster)
110 |
111 |
112 |
113 |
114 |
115 | Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
116 |
117 | [Paper](https://arxiv.org/abs/2206.08916) | [Github](https://github.com/allenai/unified-io-inference) | [Project](https://unified-io.allenai.org/)
118 |
119 | 1. creates a unified IO for all sorts of vision and vl task (into discrete tokens)
120 | 2. using t5-like encoder-decoder arch
121 |
122 |
123 |
124 |
125 |
126 | Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
127 |
128 | [Paper](http://arxiv.org/abs/2312.17172) | [Github](https://github.com/allenai/unified-io-2) | [Project](https://unified-io-2.allenai.org/)
129 |
130 | 1. following Unified-IO v1, creates a unified IO for all sorts of modalities including image, masks, bboxes, audios (into discrete tokens)
131 | 1. dense masks are all binary, unlike v1 which specifies the color in text instruction (model struggles to follow)
132 | 2. propose 2D Rotary Embedding, QK Normalization and Scaled Cosine Attention to stabilize training and scaling
133 | 3. Mixture of Denoisers taining objectives
134 | 4. instruction tuning of 220 tasks drawn from over 120 external datasets
135 |
136 |
137 |
138 |
139 | PixelLM: Pixel Reasoning with Large Multimodal Model
140 |
141 | [Paper](http://arxiv.org/abs/2312.02228) | [Github](https://github.com/MaverickRen/PixelLM) | [Project](https://pixellm.github.io/)
142 |
143 | 1. learnable seg tokens + light-weight decoder
144 | 2. a bunch of tricks:
145 | 1. N x L seg tokens for L level multi-scale vision features. N tokens within each group for better modeling
146 | 2. reweighted loss on regions with overlapping predictions
147 |
148 |
149 |
150 |
151 |
152 | PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
153 |
154 | [Paper](http://arxiv.org/abs/2403.14598) | [Github](https://github.com/zamling/PSALM)
155 |
156 | 1. new paradigm: first generate mask proposal, then genereate mask and classification (following mask2former)
157 | 2. instruction prompt + conditional prompt + candidate masks token
158 | 1. three types of conditional prompt: classes, sentence (ref seg) and visual cues (point, scribbles, boxes, etc)
159 | 2. conditional prompt => condition embed, candidate masks token => mask embed.
160 | 3. condition embed +mask embed + image feature => mask2former decoder => bipartite matching loss + query-based decoding
161 | 
162 |
163 |
164 |
165 |
166 |
167 | LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning
168 |
169 | [Paper](http://arxiv.org/abs/2404.08767) | [Github](https://github.com/wangjunchi/LLMSeg)
170 |
171 | 1. Use SAM to generate mask candidates, then fomulate the problem as mask selection (mask classification)
172 | 2. promote LLM-Seg40K dataset, by using LLaVA to generate caption, then GPT4 to generate question-answer pair.
173 |
174 |
175 |
176 |
177 |
178 | GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
179 |
180 | [Paper](http://arxiv.org/abs/2402.16846) | [Project](https://groundhog-mllm.github.io/)
181 |
182 | 1. disantengle grounding with referring
183 | 2. grounding as mask selection and train a mask2former+ to generate mask candidates
184 | 3. referring by mask pooling on feature
185 | 4. promote 2.5M M3G2 dataset
186 |
187 |
188 |
189 |
190 |
191 | DetGPT: Detect What You Need via Reasoning
192 |
193 | [Paper](http://arxiv.org/abs/2305.14167) | [Github](https://github.com/OptimalScale/DetGPT) | [Project](https://detgpt.github.io/)
194 |
195 | 1. Follow LLaVA to tune VLM and for vqa
196 | 2. Use grouding DINO to ground response generated by VLM to detect the relevantg entities.
197 |
198 |
199 |
200 |
201 |
202 | Ferret: Refer and Ground Anything Anywhere at Any Granularity
203 |
204 | [Paper](http://arxiv.org/abs/2310.07704) | [Github](https://github.com/apple/ml-ferret)
205 |
206 | 1. propose hybrid region representation for referring : region name + coordinates + mask pooled feature by Spatial-aware visual sampler
207 | 2. grounding through bbox
208 |
209 |
210 |
211 |
212 |
213 | Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
214 |
215 | [Paper](http://arxiv.org/abs/2404.07973)
216 |
217 | 1. propose a bunch of improvements on Ferret v1
218 | 2. including any-resolution (patches) for larger resolution
219 | 3. DINOv2 Encoder for local feature extraction
220 | 4. and High-resolution Dense Alignment stage between SFT and instruction turning.
221 | u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
222 |
223 | [Paper](http://arxiv.org/abs/2311.05348) | [Github](https://github.com/OPPOMKLab/u-LLaVA)
224 |
225 | 1. propose to use different decoder for grounding (SAM for segmentation, Grounding DINO for detection)
226 |
227 |
228 |
229 |
230 |
231 | GSVA: Generalized Segmentation via Multimodal Large Language Models
232 |
233 | [Paper](http://arxiv.org/abs/2312.10103) | [Github](https://github.com/LeapLabTHU/GSVA)
234 |
235 | 1. propose to Generalized Referring Expression Segmentation (GRES) in grounding LLM
236 | 1. multiple object to ground
237 | 2. need to reject null target
238 | 2. propose to use multple [SEG] token to ground multiple objects (indicted by the texts before the [SEG] token), and [REJ] token to rej null target
239 |
240 |
241 |
242 |
243 | NExT-Chat: An LMM for Chat, Detection and Segmentation
244 |
245 | [Paper](https://arxiv.org/pdf/2311.04498) | [Github](https://github.com/NExT-ChatV/NExT-Chat) | [Project](https://next-chatv.github.io/)
246 |
247 | 1. propose box encoder-decoder for referring and grounding
248 | 2. for grounding, use token to indicate the presence of a grounding output and input the latent embedding to the box decoder (mask decoder e.g. SAM) for box (mask) generation
249 | 3. for referring, use boxes to represent referred region and use box encoder to encode the referred boxes into features, which is input to LLM.
250 | 4. propose a cycle consistency loss for regularization of box encoder-decoder
251 | 
252 |
253 |
254 |
255 |
256 | PerceptionGPT: Effectively Fusing Visual Perception into LLM
257 |
258 | [Paper](https://arxiv.org/pdf/2311.06612)
259 |
260 | 1. similar to NExT-Chat, propose box encoder-decoder to encode and decode boxes, but seems to only focus on grounding without referring
261 | 2. One possible intriguing point: grounding output indicator \ is used to indicate the presence of grounding output (as usual) but the is replaced by the encoder's output feature in the LLM input.
262 | 
263 |
264 |
265 |
266 |
267 | Kosmos-2: Grounding Multimodal Large Language Models to the World
268 |
269 | [Paper](https://arxiv.org/abs/2306.14824) | [Github](https://github.com/microsoft/unilm/tree/master/kosmos-2)
270 |
271 | 1. build a web-scale grounding dataset by web-scale data (COYO-700M & LAION-2B etc) and vision detector (GLIP)
272 | 2. following pix2seq, divide the image into PxP grids and introduce PxP new tokens to represent
273 | 3. Use \\ to represent a bbox, with \ to separate multiple boxes (if there are multiple boxes)
274 | 4. Use markdown-like grammar to reference grounded text with \ \
275 | e.g.
276 | 
277 |
278 |
279 |
280 |
281 | Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
282 |
283 | [Paper](https://arxiv.org/pdf/2306.15195) | [Github](https://github.com/shikras/shikra)
284 |
285 | 1. propose to use normalized boxes for unified grounding and referring
286 | 2. Use texts to represent all normalized boxes (directly tokenized by text tokenizer) and input to LLM
287 |
288 |
289 |
290 |
291 | Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
292 |
293 | [Paper](http://arxiv.org/abs/2404.13013) | [Github](https://github.com/FoundationVision/Groma) | [Project](https://groma-mllm.github.io/)
294 |
295 | 1. Propose to ground and refer with a set of proposed regions.
296 | 2. Change a Deformable DETR detection head into binary classifier to propose ROI and use AlignROI pooling to get the region feature
297 |
298 |
299 |
300 |
301 | LISA: Reasoning Segmentation via Large Language Model
302 |
303 | [Paper](https://arxiv.org/abs/2308.00692) | [Github](https://github.com/dvlab-research/LISA)
304 |
305 | 1. Introduce the reasoning segmentation task and establish a reasoning segmentation benchmark.
306 | 2. Propose LISA model, which represents the segmentation mask as an embedding and incorporates new segmentation capabilities.
307 |
308 |
309 |
310 |
311 | GLaMM: Pixel Grounding Large Multimodal Model
312 |
313 | [Paper](http://arxiv.org/abs/2311.03356) | [Github](https://github.com/mbzuai-oryx/groundingLMM) | [Project](https://mbzuai-oryx.github.io/groundingLMM)
314 |
315 | 1. Introduces Grounded Conversation Generation (GCG) task combining phrase grounding, referring expression segmentation, and vision-language conversations.
316 | 3. Proposes a scalable pipeline to curate GranD (Grounding-anything Dataset) with 7.5M unique concepts grounded in 810M regions with segmentation masks, 214k GranDf and a ~1000 evaluation set.
317 | 5. Architecture: Global Image Encoder for holistic understanding, Region Encoder with RoI pooling for regions referring, LLM generates responses with grounding tokens , and use Pixel Decoder (SAM) to decode segmentation masks from tokens' latent
318 |
319 |
320 |
321 | ## 🔥 Multi-modality
322 |
323 |
324 |
325 | GroundingGPT:Language Enhanced Multi-modal Grounding Model
326 |
327 | [Paper](http://arxiv.org/abs/2401.06071) | [Github](https://github.com/OPPOMKLab/u-LLaVA)
328 |
329 | 1. grounding and referring of multi-modality in text
330 | 1. bounding box by four relative coordinate values:[x1, y1, x2, y2]
331 | 2. video timestamps by two two-digit decimals: {t1, t2}
332 | 2. curate dataset for three stage training
333 | 
334 |
335 |
336 |
--------------------------------------------------------------------------------