├── images
    ├── 398f94238fe61990ba3dd93ec6e1357359d45541ac7c22a06f0cb804f3bc2b4e.png
    ├── 50d131269405f43de1d95d747d9f7321d3a46bc87e3a2758286c837f8dec379a.png
    ├── 8f3a6cf2fec0f679487196ed6c48f94e076ae29ae311f8a888fcc8ce23e73e7c.png
    ├── b94662b3d3344af40518360bbf617a97bda2baf867d56e7463170c0b64d32101.png
    └── ee9c18e6d50fb94df01b5ff11283fd3128d6b9f0c7e103e70c07887bf94a71d2.png
└── README.md


/images/398f94238fe61990ba3dd93ec6e1357359d45541ac7c22a06f0cb804f3bc2b4e.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/398f94238fe61990ba3dd93ec6e1357359d45541ac7c22a06f0cb804f3bc2b4e.png


--------------------------------------------------------------------------------
/images/50d131269405f43de1d95d747d9f7321d3a46bc87e3a2758286c837f8dec379a.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/50d131269405f43de1d95d747d9f7321d3a46bc87e3a2758286c837f8dec379a.png


--------------------------------------------------------------------------------
/images/8f3a6cf2fec0f679487196ed6c48f94e076ae29ae311f8a888fcc8ce23e73e7c.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/8f3a6cf2fec0f679487196ed6c48f94e076ae29ae311f8a888fcc8ce23e73e7c.png


--------------------------------------------------------------------------------
/images/b94662b3d3344af40518360bbf617a97bda2baf867d56e7463170c0b64d32101.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/b94662b3d3344af40518360bbf617a97bda2baf867d56e7463170c0b64d32101.png


--------------------------------------------------------------------------------
/images/ee9c18e6d50fb94df01b5ff11283fd3128d6b9f0c7e103e70c07887bf94a71d2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/williamium3000/awesome-mllm-grounding/HEAD/images/ee9c18e6d50fb94df01b5ff11283fd3128d6b9f0c7e103e70c07887bf94a71d2.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Awesome-Multimodal-Large-Language-Models-With-Grounding
  2 | > A curated list of Multimodal Large Language Models (or Large Vision Language Model) with grounding ability. 
  3 | 
  4 | 
  5 | <!-- ## About Me: 
  6 | I'm an incoming Ph.D. student at the University of California San Diego. I recieved my M.S.E in Computer Science at Johns Hopkins University being a member of CCVL advised by [Alan Yuille](https://www.cs.jhu.edu/~ayuille/). I also work closely with [Haohan Wang](https://haohanwang.github.io/) from University of Illinois Urbana-Champaign.
  7 | Feel free to visit my [homepage](https://williamium3000.github.io/) and contact me for collaboration and discussion. -->
  8 | 
  9 | 
 10 | ## Table of Contents
 11 | - [Awesome-Multimodal-Large-Language-Models-With-Grounding](#awesome-multimodal-large-language-models-with-grounding)
 12 |   - [Table of Contents](#table-of-contents)
 13 |   - [🔥 Large Vision-Language Model](#-large-vision-language-model)
 14 |     - [Grounding](#grounding)
 15 |     - [Referring](#referring)
 16 |     - [Training Dataset](#training-dataset)
 17 |     - [Training Recipe](#training-recipe)
 18 |     - [Evaluation Dataset](#evaluation-dataset)
 19 |     - [Paper List](#paper-list)
 20 |   - [🔥 Multi-modality](#-multi-modality)
 21 | 
 22 | ## 🔥 Large Vision-Language Model
 23 | 
 24 | ### Grounding
 25 | | Format | Desc | Paper | 
 26 | |------------|-------|--------|
 27 | | Decoder on latent| leverage a decoder to ground | [PerceptionGPT](https://arxiv.org/pdf/2311.06612), [NExT-Chat](https://arxiv.org/pdf/2311.04498), [PSALM](http://arxiv.org/abs/2403.14598), [PixelLM](http://arxiv.org/abs/2312.02228), [u-LLaVA](http://arxiv.org/abs/2311.05348), [GSVA](http://arxiv.org/abs/2312.10103), [ChatterBox](http://arxiv.org/abs/2401.13307), [GLaMM](http://arxiv.org/abs/2311.03356)|
 28 | | Output numerical coordinates | direct output numerical tokens | [Shikra](https://arxiv.org/pdf/2306.15195), [VisionLLM](https://proceedings.neurips.cc/paper_files/paper/2023/file/c1f7b1ed763e9c75e4db74b49b76db5f-Paper-Conference.pdf), [Ferret](http://arxiv.org/abs/2310.07704), [Ferret2](http://arxiv.org/abs/2404.07973), [CogVLM](http://arxiv.org/abs/2311.03079)|
 29 | | Output token coordinates | output new tokens added to refer positions | [Kosmos-2](https://arxiv.org/pdf/2306.14824) |
 30 | | Pixel space | output in discrete pixel space encoded by VQGAN | [Unified-IO](https://arxiv.org/abs/2206.08916), [Unified-IO 2](http://arxiv.org/abs/2312.17172) | 
 31 | | Proposal retrieval | retrieval from region candidates | [LLM-Seg](http://arxiv.org/abs/2404.08767), [Kosmos-2](https://arxiv.org/pdf/2306.14824), [GROUNDHOG](http://arxiv.org/abs/2305.14167)|
 32 | 
 33 | ### Referring
 34 | 
 35 | | Format | Desc | Paper | 
 36 | |------------|-------|--------|
 37 | | Pooling | Leverage Mask Pooling / RoI Pooling / RoI Align to obtain features from the im encoder output | [Groma](http://arxiv.org/abs/2404.13013), [GPT4RoI](https://arxiv.org/pdf/2307.03601), [Osprey](https://arxiv.org/pdf/2312.10032), [PSALM](http://arxiv.org/abs/2403.14598), [GROUNDHOG](http://arxiv.org/abs/2305.14167), [Ferret](http://arxiv.org/abs/2310.07704), [Ferret2](http://arxiv.org/abs/2404.07973), [PVIT](https://arxiv.org/pdf/2308.13437), [ChatterBox](http://arxiv.org/abs/2401.13307), [GLaMM](http://arxiv.org/abs/2311.03356) |
 38 | | Numerical coordinates | Leverage numerical coordinates for referring (bbox / sampled points in mask) | [Shikra](https://arxiv.org/pdf/2306.15195), [PerceptionGPT](https://arxiv.org/pdf/2311.06612) (w/ encoder), [NExT-Chat](https://arxiv.org/pdf/2311.04498) (w/ encoder), [CogVLM](http://arxiv.org/abs/2311.03079)|
 39 | | Token coordinates | Add new tokens to vocab to present spatial positions | [Kosmos-2](https://arxiv.org/pdf/2306.14824) | 
 40 | 
 41 | * w/ encoder: refers to using a encoder to encode the input coordinates.
 42 | 
 43 | ### Training Dataset
 44 | 
 45 | | Dataset | Source | Data Source | Quantity | Cnstruction Method |
 46 | |------------|--------------|--------------|--------------|--------------|
 47 | | GRIT | [Ferret](http://arxiv.org/abs/2310.07704) | COYO-700M, LAION-2B | - | <li> Templates are used to convert data. <li> SAM is used to generate masks for free-form referring. <li> ChatGPT4 is used to generate dialogues with bbox. <li> Use GLIPv2 to ground groundable nouns in LLaVA-158k. <li> Negative mining: generate negative yes/or question|
 48 | | Shikra-RD | [Shikra](https://arxiv.org/pdf/2306.15195) | Flickr30K Entities | 5,922 QA pairs | ChatGPT4 ==> Referential Dialogue (CoT dialogues with grounding & referring) |
 49 | | CB-300K | [ChatterBox](http://arxiv.org/abs/2401.13307) | VG | 717,075 QA pairs | 4 subsets. <li> CB-MRG: Use ChatGPT to write dialogues with bbox <li> CB-LC, extend strict relation (from scene graph) to multi-turn QA with ChatGPT <li> CB-REF REG task <li> CB-GND: grounding task |
 50 | | GranD | [GLaMM](http://arxiv.org/abs/2311.03356) | SA-1B | 11M images with 7.5M unique concepts and 810M regions. | Automated annotation pipeline with SAM for dense pixel-wise grounding. Used for pretraining. |
 51 | | GranD-f | [GLaMM](http://arxiv.org/abs/2311.03356) | GranD (refined), Flickr30K, RefCOCOg, and PSG | ~214K image-grounded text pairs | Refined subset of GranD for fine-tuning, with 1000 images held out for human-annotated evaluation | 
 52 | 
 53 | 
 54 | ### Training Recipe
 55 | | Model | Recipe | 
 56 | |------------|--------------|
 57 | | [Ferret](http://arxiv.org/abs/2310.07704) | <li> Use LLaVA pretrained <li> SFT on GRIT |
 58 | | [Ferret2](http://arxiv.org/abs/2404.07973) | <li> image-caption alignment on 1.4M image-text pairs <li> high-resolution dense alignment with template referring & grounding <li> instruction tuning with GRIT, VQA and OCR (VQA and OCR are augmented with GLIPv2 bbox) |
 59 | | [ChatterBox](http://arxiv.org/abs/2401.13307) | Trainable: LoRA and location decoder <li> warm up training with visual grounding only dataset. <li> instruction tuning  with CB-300K |
 60 | | [GPT4RoI](https://arxiv.org/pdf/2307.03601) | <li> Use LLaVA pretrained <li> pretrain region feature extractor with text-region datasets (COCO, RefCOCO, RefCOCO+) <li> train connector, region feature extractor and LLM to follow instructions |
 61 | | [GLaMM](http://arxiv.org/abs/2311.03356) | <li> Use [GPT4RoI](https://arxiv.org/pdf/2307.03601) pretrained <li> pretrain on 11M GranD with LoRA <li> finetune on GranD-f, LLaVA-Instruct150K and LLaVA-Instruct-80K|
 62 | 
 63 | 
 64 | ### Evaluation Dataset
 65 | | Dataset | Source | Data Source | Quantity | Cnstruction Method |
 66 | |------------|--------------|--------------|--------------|--------------|
 67 | | Ferret Bench | [Ferret](http://arxiv.org/abs/2310.07704) | COCO validation set | 120 | <li> Referring Description: models are asked to describe a referred region based on its interaction with surrounding objects. <li> Referring Reasoning: models need to reason on top of one or more referred regions correctly. <li> Grounding in Conversation: models are required to reason correctly and accurately ground/localize the objects/regions necessary for the reasoning.|
 68 | 
 69 | 
 70 | <!-- template -->
 71 | <!-- <details>
 72 | 
 73 |   <summary>Paper name</summary>
 74 | 
 75 |   [Paper]() | [Github]() | [Project]()
 76 |   
 77 |    summary
 78 |   
 79 | </details> -->
 80 | ### Paper List
 81 | <details>
 82 | 
 83 |   <summary>GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest</summary>
 84 | 
 85 |   [Paper](https://arxiv.org/pdf/2307.03601) | [Github](https://github.com/jshilong/GPT4RoI)
 86 | 
 87 |    1. propose referring for mllm by replacing placeholder \<region_i\> by feature obtained by mask pooling
 88 |   
 89 | </details>
 90 | 
 91 | <details>
 92 | 
 93 |   <summary>Osprey: Pixel Understanding with Visual Instruction Tuning</summary>
 94 | 
 95 |   [Paper](https://arxiv.org/pdf/2312.10032) | [Github](https://github.com/CircleRadon/Osprey)
 96 | 
 97 |    1. similar to GPT4RoI, Osprey also use mask representation to refer to entities in images. 
 98 |    2. It uses mask pooling to extract semantic features from image encoder and combines with a location extractor to process the mask and output spatial token.
 99 |   
100 | </details>
101 | 
102 | <details>
103 | 
104 |   <summary>VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks</summary>
105 | 
106 |   [Paper](https://proceedings.neurips.cc/paper_files/paper/2023/file/c1f7b1ed763e9c75e4db74b49b76db5f-Paper-Conference.pdf) | [Github](https://github.com/OpenGVLab/VisionLLM)
107 |   
108 | 1. unified interface for vision and vl tasks: points for detection, sample points for instance seg ==> instruction format for training
109 | 2. extra tokens & output-format-as-query to decode (faster)
110 |   
111 | </details>
112 | 
113 | <details>
114 | 
115 |   <summary>Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks</summary>
116 | 
117 |   [Paper](https://arxiv.org/abs/2206.08916) | [Github](https://github.com/allenai/unified-io-inference) | [Project](https://unified-io.allenai.org/)
118 |   
119 |    1. creates a unified IO for all sorts of vision and vl task (into discrete tokens)
120 |    2. using t5-like encoder-decoder arch
121 |   
122 | </details>
123 |    
124 | <details>
125 | 
126 |   <summary>Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action</summary>
127 | 
128 |   [Paper](http://arxiv.org/abs/2312.17172) | [Github](https://github.com/allenai/unified-io-2) | [Project](https://unified-io-2.allenai.org/)
129 |   
130 |    1. following Unified-IO v1, creates a unified IO for all sorts of modalities including image, masks, bboxes, audios (into discrete tokens)
131 |       1. dense masks are all binary, unlike v1 which specifies the color in text instruction (model struggles to follow)
132 |    2. propose 2D Rotary Embedding, QK Normalization and Scaled Cosine Attention to stabilize training and scaling
133 |    3. Mixture of Denoisers taining objectives
134 |    4. instruction tuning of 220 tasks drawn from over 120 external datasets
135 | </details>
136 | 
137 | <details>
138 | 
139 |   <summary>PixelLM: Pixel Reasoning with Large Multimodal Model</summary>
140 | 
141 |   [Paper](http://arxiv.org/abs/2312.02228) | [Github](https://github.com/MaverickRen/PixelLM) | [Project](https://pixellm.github.io/)
142 |   
143 |    1. learnable seg tokens + light-weight decoder
144 |    2. a bunch of tricks:
145 |       1. N x L seg tokens for L level multi-scale vision features. N tokens within each group for better modeling
146 |       2. reweighted loss on regions with overlapping predictions
147 |   
148 | </details>
149 | 
150 | <details>
151 | 
152 |   <summary>PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model</summary>
153 | 
154 |   [Paper](http://arxiv.org/abs/2403.14598) | [Github](https://github.com/zamling/PSALM)
155 |   
156 |    1. new paradigm: first generate mask proposal, then genereate mask and classification (following mask2former)
157 |    2. instruction prompt + conditional prompt + candidate masks token
158 |       1. three types of conditional prompt: classes, sentence (ref seg) and visual cues (point, scribbles, boxes, etc)
159 |       2. conditional prompt => condition embed, candidate masks token => mask embed.
160 |       3. condition embed +mask embed + image feature => mask2former decoder => bipartite matching loss + query-based decoding 
161 |       ![图 0](images/398f94238fe61990ba3dd93ec6e1357359d45541ac7c22a06f0cb804f3bc2b4e.png)  
162 |       
163 | </details>
164 | 
165 | <details>
166 | 
167 |   <summary>LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning</summary>
168 | 
169 |   [Paper](http://arxiv.org/abs/2404.08767) | [Github](https://github.com/wangjunchi/LLMSeg)
170 |   
171 |    1. Use SAM to generate mask candidates, then fomulate the problem as mask selection (mask classification)
172 |    2. promote LLM-Seg40K dataset, by using LLaVA to generate caption, then GPT4 to generate question-answer pair.
173 |       
174 | </details>
175 | 
176 | <details>
177 | 
178 |   <summary>GROUNDHOG: Grounding Large Language Models to Holistic Segmentation</summary>
179 | 
180 |   [Paper](http://arxiv.org/abs/2402.16846) | [Project](https://groundhog-mllm.github.io/)
181 |    
182 |    1. disantengle grounding with referring
183 |    2. grounding as mask selection and train a mask2former+ to generate mask candidates
184 |    3. referring by mask pooling on feature
185 |    4. promote 2.5M M3G2 dataset
186 |       
187 | </details>
188 | 
189 | <details>
190 | 
191 |   <summary>DetGPT: Detect What You Need via Reasoning</summary>
192 | 
193 |   [Paper](http://arxiv.org/abs/2305.14167) | [Github](https://github.com/OptimalScale/DetGPT) | [Project](https://detgpt.github.io/)
194 |    
195 |    1. Follow LLaVA to tune VLM and for vqa
196 |    2. Use grouding DINO to ground response generated by VLM to detect the relevantg entities.
197 |       
198 | </details>
199 | 
200 | <details>
201 | 
202 |   <summary>Ferret: Refer and Ground Anything Anywhere at Any Granularity</summary>
203 | 
204 |   [Paper](http://arxiv.org/abs/2310.07704) | [Github](https://github.com/apple/ml-ferret)
205 |    
206 |    1. propose hybrid region representation for referring : region name + coordinates + mask pooled feature by Spatial-aware visual sampler
207 |    2. grounding through bbox
208 |       
209 | </details>
210 | 
211 | <details>
212 | 
213 |   <summary>Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models</summary>
214 | 
215 |   [Paper](http://arxiv.org/abs/2404.07973)
216 |    
217 |    1. propose a bunch of improvements on Ferret v1
218 |    2. including any-resolution (patches) for larger resolution
219 |    3. DINOv2 Encoder for local feature extraction
220 |    4. and High-resolution Dense Alignment stage between SFT and instruction turning.
221 |   <summary>u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model</summary>
222 | 
223 |   [Paper](http://arxiv.org/abs/2311.05348) | [Github](https://github.com/OPPOMKLab/u-LLaVA)
224 |    
225 |    1. propose to use different decoder for grounding (SAM for segmentation, Grounding DINO for detection)
226 |       
227 | </details>
228 | 
229 | <details>
230 | 
231 |   <summary>GSVA: Generalized Segmentation via Multimodal Large Language Models</summary>
232 | 
233 |   [Paper](http://arxiv.org/abs/2312.10103) | [Github](https://github.com/LeapLabTHU/GSVA)
234 |    
235 |    1. propose to Generalized Referring Expression Segmentation (GRES) in grounding LLM
236 |       1. multiple object to ground
237 |       2. need to reject null target
238 |    2. propose to use multple [SEG] token to ground multiple objects (indicted by the texts before the [SEG] token), and [REJ] token to rej null target
239 |       
240 | </details>
241 | <details>
242 | 
243 |   <summary>NExT-Chat: An LMM for Chat, Detection and Segmentation</summary>
244 | 
245 |   [Paper](https://arxiv.org/pdf/2311.04498) | [Github](https://github.com/NExT-ChatV/NExT-Chat) | [Project](https://next-chatv.github.io/)
246 | 
247 |    1. propose box encoder-decoder for referring and grounding
248 |    2. for grounding, use <trigger> token to indicate the presence of a grounding output and input the latent embedding to the box decoder (mask decoder e.g. SAM) for box (mask) generation 
249 |    3. for referring, use boxes to represent referred region and use box encoder to encode the referred boxes into features, which is input to LLM.
250 |    4. propose a cycle consistency loss for regularization of box encoder-decoder
251 |       ![图 0](images/50d131269405f43de1d95d747d9f7321d3a46bc87e3a2758286c837f8dec379a.png)  
252 |   
253 | </details>
254 | <details>
255 | 
256 |   <summary>PerceptionGPT: Effectively Fusing Visual Perception into LLM</summary>
257 | 
258 |   [Paper](https://arxiv.org/pdf/2311.06612)
259 | 
260 |    1. similar to NExT-Chat, propose box encoder-decoder to encode and decode boxes, but seems to only focus on grounding without referring
261 |    2. One possible intriguing point: grounding output indicator \<vis\> is used to indicate the presence of grounding output (as usual) but the is replaced by the encoder's output feature in the LLM input. 
262 |   ![图 1](images/b94662b3d3344af40518360bbf617a97bda2baf867d56e7463170c0b64d32101.png)  
263 | </details>
264 | 
265 | <details>
266 | 
267 |   <summary>Kosmos-2: Grounding Multimodal Large Language Models to the World</summary>
268 | 
269 |   [Paper](https://arxiv.org/abs/2306.14824) | [Github](https://github.com/microsoft/unilm/tree/master/kosmos-2)
270 | 
271 |    1. build a web-scale grounding dataset by web-scale data (COYO-700M & LAION-2B etc) and vision detector (GLIP)
272 |    2. following pix2seq, divide the image into PxP grids and introduce PxP new tokens to represent
273 |    3. Use \<box\>\</box\> to represent a bbox, with \<delim\> to separate multiple boxes (if there are multiple boxes)
274 |    4. Use markdown-like grammar to reference grounded text with \<p\> \</p\>
275 |    e.g. 
276 |    ![图 2](images/ee9c18e6d50fb94df01b5ff11283fd3128d6b9f0c7e103e70c07887bf94a71d2.png)  
277 | 
278 | </details>
279 | <details>
280 | 
281 |   <summary>Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic</summary>
282 | 
283 |   [Paper](https://arxiv.org/pdf/2306.15195) | [Github](https://github.com/shikras/shikra)
284 | 
285 |    1. propose to use normalized boxes for unified grounding and referring
286 |    2. Use texts to represent all normalized boxes (directly tokenized by text tokenizer) and input to LLM
287 | </details>
288 | 
289 | <details>
290 | 
291 |   <summary>Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models</summary>
292 | 
293 |   [Paper](http://arxiv.org/abs/2404.13013) | [Github](https://github.com/FoundationVision/Groma) | [Project](https://groma-mllm.github.io/)
294 | 
295 |    1. Propose to ground and refer with a set of proposed regions.
296 |    2. Change a Deformable DETR detection head into binary classifier to propose ROI and use AlignROI pooling to get the region feature
297 | </details>
298 | 
299 | <details>
300 | 
301 |   <summary>LISA: Reasoning Segmentation via Large Language Model</summary>
302 | 
303 |   [Paper](https://arxiv.org/abs/2308.00692) | [Github](https://github.com/dvlab-research/LISA)
304 | 
305 |    1. Introduce the reasoning segmentation task and establish a reasoning segmentation benchmark.
306 |    2. Propose LISA model, which represents the segmentation mask as an embedding and incorporates new segmentation capabilities.
307 | </details>
308 | 
309 | <details>
310 | 
311 |   <summary>GLaMM: Pixel Grounding Large Multimodal Model</summary>
312 | 
313 |   [Paper](http://arxiv.org/abs/2311.03356) | [Github](https://github.com/mbzuai-oryx/groundingLMM) | [Project](https://mbzuai-oryx.github.io/groundingLMM)
314 |   
315 |    1. Introduces Grounded Conversation Generation (GCG) task combining phrase grounding, referring expression segmentation, and vision-language conversations.
316 |    3. Proposes a scalable pipeline to curate GranD (Grounding-anything Dataset) with 7.5M unique concepts grounded in 810M regions with segmentation masks, 214k GranDf and a ～1000 evaluation set.
317 |    5. Architecture: Global Image Encoder for holistic understanding, Region Encoder with RoI pooling for regions referring, LLM generates responses with grounding tokens <seg>, and use Pixel Decoder (SAM) to decode segmentation masks from <seg> tokens' latent
318 |   
319 | </details>
320 | 
321 | ## 🔥 Multi-modality
322 | 
323 | <details>
324 | 
325 |   <summary>GroundingGPT:Language Enhanced Multi-modal Grounding Model</summary>
326 | 
327 |   [Paper](http://arxiv.org/abs/2401.06071) | [Github](https://github.com/OPPOMKLab/u-LLaVA)
328 |    
329 |    1. grounding and referring of multi-modality in text
330 |       1. bounding box by four relative coordinate values:[x1, y1, x2, y2]
331 |       2. video timestamps by two two-digit decimals: {t1, t2}
332 |    2. curate dataset for three stage training
333 |       ![图 1](images/8f3a6cf2fec0f679487196ed6c48f94e076ae29ae311f8a888fcc8ce23e73e7c.png)  
334 | 
335 | </details>
336 | 


--------------------------------------------------------------------------------