├── LICENSE
├── README.md
├── Statement of Clarification.md
├── images
├── architecture.png
└── readme.md
├── lego
├── LEGO.py
├── __init__.py
├── constants.py
├── conversation.py
├── mm_utils.py
├── model
│ ├── builder.py
│ └── utils.py
├── serve
│ ├── __init__.py
│ ├── cli.py
│ ├── gradio_utils.py
│ └── gradio_web_server.py
├── train
│ ├── __pycache__
│ │ ├── __init__.cpython-39.pyc
│ │ ├── llama_flash_attn_monkey_patch.cpython-39.pyc
│ │ ├── llava_trainer.cpython-39.pyc
│ │ └── train.cpython-39.pyc
│ ├── llama_flash_attn_monkey_patch.py
│ ├── train.py
│ └── train_mem.py
└── utils.py
├── requirements.txt
├── scripts
├── finetune.sh
├── pretrain.sh
├── zero2.json
└── zero3.json
└── video_llama
├── __init__.py
├── __pycache__
└── __init__.cpython-39.pyc
├── common
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-39.pyc
│ ├── dist_utils.cpython-39.pyc
│ ├── logger.cpython-39.pyc
│ ├── registry.cpython-39.pyc
│ └── utils.cpython-39.pyc
├── config.py
├── dist_utils.py
├── gradcam.py
├── logger.py
├── optims.py
├── registry.py
└── utils.py
├── configs
├── datasets
│ ├── cc_sbu
│ │ ├── align.yaml
│ │ └── defaults.yaml
│ ├── instruct
│ │ ├── llava_instruct.yaml
│ │ └── webvid_instruct.yaml
│ ├── laion
│ │ └── defaults.yaml
│ └── webvid
│ │ └── defaults.yaml
├── default.yaml
└── models
│ ├── minigpt4.yaml
│ └── video_llama.yaml
├── conversation
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-39.pyc
│ └── conversation_video.cpython-39.pyc
└── conversation_video.py
├── datasets
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-39.pyc
│ └── data_utils.cpython-39.pyc
├── builders
│ ├── __init__.py
│ ├── __pycache__
│ │ ├── __init__.cpython-39.pyc
│ │ ├── base_dataset_builder.cpython-39.pyc
│ │ ├── image_text_pair_builder.cpython-39.pyc
│ │ ├── instruct_builder.cpython-39.pyc
│ │ └── video_caption_builder.cpython-39.pyc
│ ├── base_dataset_builder.py
│ ├── image_text_pair_builder.py
│ ├── instruct_builder.py
│ └── video_caption_builder.py
├── data_utils.py
└── datasets
│ ├── __init__.py
│ ├── __pycache__
│ ├── __init__.cpython-39.pyc
│ ├── base_dataset.cpython-39.pyc
│ ├── caption_datasets.cpython-39.pyc
│ ├── cc_sbu_dataset.cpython-39.pyc
│ ├── laion_dataset.cpython-39.pyc
│ ├── llava_instruct_dataset.cpython-39.pyc
│ ├── video_instruct_dataset.cpython-39.pyc
│ └── webvid_datasets.cpython-39.pyc
│ ├── base_dataset.py
│ ├── caption_datasets.py
│ ├── cc_sbu_dataset.py
│ ├── dataloader_utils.py
│ ├── laion_dataset.py
│ ├── llava_instruct_dataset.py
│ ├── video_instruct_dataset.py
│ └── webvid_datasets.py
├── models
├── ImageBind
│ ├── .assets
│ │ ├── bird_audio.wav
│ │ ├── bird_image.jpg
│ │ ├── car_audio.wav
│ │ ├── car_image.jpg
│ │ ├── dog_audio.wav
│ │ └── dog_image.jpg
│ ├── CODE_OF_CONDUCT.md
│ ├── CONTRIBUTING.md
│ ├── LICENSE
│ ├── README.md
│ ├── __pycache__
│ │ └── data.cpython-39.pyc
│ ├── bpe
│ │ └── bpe_simple_vocab_16e6.txt.gz
│ ├── data.py
│ ├── model_card.md
│ ├── models
│ │ ├── __init__.py
│ │ ├── __pycache__
│ │ │ ├── __init__.cpython-39.pyc
│ │ │ ├── helpers.cpython-39.pyc
│ │ │ ├── imagebind_model.cpython-39.pyc
│ │ │ ├── multimodal_preprocessors.cpython-39.pyc
│ │ │ └── transformer.cpython-39.pyc
│ │ ├── helpers.py
│ │ ├── imagebind_model.py
│ │ ├── multimodal_preprocessors.py
│ │ └── transformer.py
│ └── requirements.txt
├── Qformer.py
├── __init__.py
├── __pycache__
│ ├── Qformer.cpython-39.pyc
│ ├── __init__.cpython-39.pyc
│ ├── base_model.cpython-39.pyc
│ ├── blip2.cpython-39.pyc
│ ├── eva_vit.cpython-39.pyc
│ ├── modeling_llama.cpython-39.pyc
│ └── video_llama.cpython-39.pyc
├── base_model.py
├── blip2.py
├── blip2_outputs.py
├── eva_vit.py
├── modeling_llama.py
└── video_llama.py
├── processors
├── .ipynb_checkpoints
│ └── video_processor-checkpoint.py
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-39.pyc
│ ├── base_processor.cpython-39.pyc
│ ├── blip_processors.cpython-39.pyc
│ ├── functional_video.cpython-39.pyc
│ ├── randaugment.cpython-39.pyc
│ ├── transforms_video.cpython-39.pyc
│ └── video_processor.cpython-39.pyc
├── base_processor.py
├── blip_processors.py
├── functional_video.py
├── randaugment.py
├── transforms_video.py
└── video_processor.py
├── runners
├── __init__.py
├── runner_base.py
└── test.py
└── tasks
├── __init__.py
├── __pycache__
├── __init__.cpython-39.pyc
├── base_task.cpython-39.pyc
├── image_text_pretrain.cpython-39.pyc
└── video_text_pretrain.cpython-39.pyc
├── base_task.py
├── image_text_pretrain.py
└── video_text_pretrain.py
/README.md:
--------------------------------------------------------------------------------
1 | # GroundingGPT: Language-Enhanced Multi-modal Grounding Model
2 |
3 |
[](https://huggingface.co/datasets/zwli/GroundingGPT) [](https://huggingface.co/zwli/GroundingGPT)
4 |
5 |
6 | ## Introduction
7 | GroundingGPT is an end-to-end multimodal grounding model that accurately comprehends inputs and possesses robust grounding capabilities across multi modalities,including images, audios, and videos. To address the issue of limited data, we construct a diverse and high-quality multimodal training dataset. This dataset encompasses a rich collection of multimodal data enriched with spatial and temporal information, thereby serving as a valuable resource to foster further advancements in this field. Extensive experimental evaluations validate the effectiveness of the GroundingGPT model in understanding and grounding tasks across various modalities.
8 |
9 | More details are available in our [project page](https://lzw-lzw.github.io/GroundingGPT.github.io/).
10 |
11 |
12 |
13 | The overall structure of GroundingGPT. Blue boxes represent video as input, while yellow boxes represent image as input.
14 |
15 |
16 | ## News
17 | * **[2024.5]** Our paper is accepted to ACL 2024!
18 | * **[2024.4]** Our [model](https://huggingface.co/zwli/GroundingGPT) is available now!
19 | * **[2024.3]** Our [training dataset](https://huggingface.co/datasets/zwli/GroundingGPT) are available now!
20 | * **[2024.3]** Our code are available now!
21 |
22 | ## Dependencies and Installation
23 | git clone https://github.com/lzw-lzw/GroundingGPT.git
24 | cd GroundingGPT
25 | conda create -n groundinggpt python=3.10 -y
26 | conda activate groundinggpt
27 | pip install -r requirements.txt
28 | pip install flash-attn --no-build-isolation
29 |
30 |
31 | ## Training
32 | ### Training model preparation
33 | - Put the prepared checkpoints in directory `./ckpt`.
34 | - Prepare ImageBind checkpoint: download [imagebind_huge.pth](https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth) in link and put it under directory `./ckpt/imagebind`.
35 | - Prepare blip2 checkpoint: download [blip2_pretrained_flant5xxl.pth](https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth) in link and put it under directory `./ckpt`.
36 |
37 | ### Training dataset preparation
38 | - Please put the prepared checkpoints in file `dataset`.
39 | - Prepare LLaVA, COCO, GQA, OCR-VQA, TextVQA, VisualGenome datasets: follow [LLaVA](https://github.com/haotian-liu/LLaVA).
40 | - Prepare Flickr30K-Entities datasets: follow [Flickr30K-Entities](https://bryanplummer.com/Flickr30kEntities/).
41 | - Prepare Valley datasets: follow [Valley](https://github.com/RupertLuo/Valley).
42 | - Prepare DiDeMO datasets: follow [DiDeMO](https://github.com/LisaAnne/TemporalLanguageRelease).
43 | - Prepare ActivityNet Captions datasets: follow [ActivityNet Captions](https://cs.stanford.edu/people/ranjaykrishna/densevid/).
44 | - Prepare Charades-STA datasets: follow [Charades-STA](https://github.com/jiyanggao/TALL).
45 | - Prepare VGGSS datasets: follow [VGGSS](https://www.robots.ox.ac.uk/~vgg/research/lvs/).
46 | - Prepare WaveCaps datasets: follow [WaveCaps](https://github.com/XinhaoMei/WavCaps).
47 | - Prepare Clotho datasets: follow [Clotho](https://zenodo.org/records/3490684).
48 |
49 |
50 | ### Training
51 |
52 | -
53 |
54 | ## Inference
55 |
56 | - Download [GroundingGPT-7B](https://huggingface.co/zwli/GroundingGPT) and change the model_path in `GroundingGPT/lego/serve/cli.py`
57 | - Use the script to inference
58 |
59 | python3 lego/serve/cli.py
60 |
61 |
62 | ## Demo
63 | - Download [GroundingGPT-7B](https://huggingface.co/zwli/GroundingGPT) and change the model_path in line 141 of `GroundingGPT/lego/serve/gradio_web_server.py`
64 | - Use the script to launch a gradio web demo
65 |
66 | python3 lego/serve/gradio_web_server.py
67 |
68 |
69 | ## Acknowledgement
70 | - [LLaVA](https://github.com/haotian-liu/LLaVA)
71 | - [Video-LLaMA](https://github.com/DAMO-NLP-SG/Video-LLaMA)
72 | - [Shikra](https://github.com/shikras/shikra)
73 |
74 | ### Citation
75 | If you find GroundingGPT useful for your your research and applications, please cite using this BibTeX:
76 |
77 | @inproceedings{li2024groundinggpt,
78 | title={Groundinggpt: Language enhanced multi-modal grounding model},
79 | author={Li, Zhaowei and Xu, Qi and Zhang, Dong and Song, Hang and Cai, Yiqing and Qi, Qi and Zhou, Ran and Pan, Junting and Li, Zefeng and Tu, Vu and others},
80 | booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
81 | pages={6657--6678},
82 | year={2024}
83 | }
84 |
--------------------------------------------------------------------------------
/Statement of Clarification.md:
--------------------------------------------------------------------------------
1 | We hereby clarify that the Language Enhanced Multi-modal Grounding Model (formerly referred to as a LEGO Language Model), which has been modified to GroundingGPT, is in no way associated with or endorsed by the LEGO Group. There is no investment, collaboration, or any other form of relationship between the LEGO Group and our model previously using the LEGO name. We kindly request that any media or third-party entities that have published or disseminated inaccurate or misleading reports regarding this model promptly correct or remove the misinformation. Your immediate attention to this matter would be greatly appreciated. We deeply apologize for any confusion, inconvenience, or harm caused by these misconducts to the LEGO Group.
2 |
--------------------------------------------------------------------------------
/images/architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lzw-lzw/GroundingGPT/8afe78871d45c3b37c263f5065be8d769354951c/images/architecture.png
--------------------------------------------------------------------------------
/images/readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/lego/__init__.py:
--------------------------------------------------------------------------------
1 | from .LEGO import LEGOLlamaForCausalLM
2 | from .train import *
--------------------------------------------------------------------------------
/lego/constants.py:
--------------------------------------------------------------------------------
1 | CONTROLLER_HEART_BEAT_EXPIRATION = 30
2 | WORKER_HEART_BEAT_INTERVAL = 15
3 |
4 | LOGDIR = "."
5 |
6 | # Model Constants
7 | IGNORE_INDEX = -100
8 | IMAGE_TOKEN_INDEX = -200
9 | AUDIO_TOKEN_INDEX = -200
10 | DEFAULT_IMAGE_TOKEN = ""
11 | DEFAULT_IMAGE_PATCH_TOKEN = ""
12 | DEFAULT_IMAGE_START_TOKEN = ""
13 | DEFAULT_IMAGE_END_TOKEN = ""
14 | DEFAULT_VIDEO_TOKEN = "