├── INSTALLATION.md
├── README.md
├── align.sh
├── assest
├── imgs
├── main_features_embodiedgpt.png
└── overall_frame_embodiedgpt.png
├── datasets
└── datasets_share.zip
├── demo
├── .DS_Store
├── inference.py
├── script.py
└── test.py
├── pyproject.toml
├── requirements.txt
├── robohusky
├── .DS_Store
├── base_dataset.py
├── base_dataset_uni.py
├── compression.py
├── configuration_husky.py
├── constants.py
├── conversation.py
├── convert_fp16.py
├── convert_husky_fp16.py
├── convert_reward_fp16.py
├── dist_utils.py
├── llama2_flash_attn_monkey_patch.py
├── model
│ ├── __init__.py
│ ├── __pycache__
│ │ ├── __init__.cpython-38.pyc
│ │ ├── configuration_husky.cpython-38.pyc
│ │ └── modeling_husky_embody2.cpython-38.pyc
│ ├── compression.py
│ ├── configuration_husky.py
│ ├── configuration_husky_ori.py
│ ├── modeling_husky.py
│ ├── modeling_husky_embody2.py
│ ├── modeling_husky_embody2_ori.py
│ └── processing_husky.py
├── train
│ ├── .DS_Store
│ ├── llama_flash_attn_monkey_patch.py
│ ├── llama_rmsnorm_monkey_patch.py
│ ├── train.py
│ └── train_uni.py
├── utils.py
└── video_transformers.py
├── train_files
└── example_train_file.json
├── zero_stage0_config.json
├── zero_stage1_config.json
├── zero_stage2_config.json
└── zero_stage3_config.json
/INSTALLATION.md:
--------------------------------------------------------------------------------
1 | ## 🛠️ Installation
2 |
3 | - Clone this repository:
4 |
5 | ```bash
6 | git clone https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch
7 | ```
8 |
9 | - Create a conda virtual environment and activate it:
10 |
11 | ```bash
12 | conda create -n robohusky python=3.9.16 -y
13 | conda activate robohusky
14 | ```
15 |
16 | - Install `PyTorch>=2.0` and `torchvision>=0.15.2` with `CUDA>=11.7`:
17 |
18 | For examples, to install `torch==2.0.1` with `CUDA==11.8`:
19 |
20 | ```bash
21 | conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
22 | # or
23 | pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
24 | ```
25 |
26 | - Install `flash-attn`:
27 |
28 | ```bash
29 | git clone https://github.com/Dao-AILab/flash-attention.git
30 | cd flash-attention
31 | pip install flash-attn --no-build-isolation
32 | ```
33 |
34 | - Install `transformers==4.34.1`:
35 |
36 | ```bash
37 | pip install transformers==4.34.1
38 | ```
39 |
40 | - Install `apex` (optional):
41 |
42 | ```bash
43 | git clone https://github.com/NVIDIA/apex.git
44 | git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 # https://github.com/NVIDIA/apex/issues/1735
45 | pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
46 | ```
47 |
48 | - Install other requirements:
49 |
50 | ```bash
51 | cd ..
52 | pip install -e EmbodiedGPT_Pytorch
53 | ```
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Embodied Family Code Base
2 |
3 | We will update the instructions for this codebase as soon as possible.
4 |
5 | ## Installation
6 |
7 | See [INSTALLATION.md](https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch/blob/main/INSTALLATION.md)
8 |
9 | ## Data Preparation
10 |
11 | 1. Download the [EgoCOT dataset](https://github.com/EmbodiedGPT/EgoCOT_Dataset).
12 | 2. Download the [COCO-2017 dataset](https://www.kaggle.com/datasets/awsaf49/coco-2017-dataset).
13 |
14 | ## Download the Pretrained Model
15 |
16 | Download the testing
17 | model [Embodied_family_7btiny](https://huggingface.co/Liang-ZX/Embodied_family_7b/).
18 |
19 | ## Prepare the Text Data Paired with Video and Image
20 |
21 | - Unzip `datasets_share.zip`, which contains the text part of the multi-modal dataset, to the `./datasets/` directory.
22 |
23 | ## 🏠 Overview
24 |
25 |
26 |
27 | ## 🎁 Major Features
28 |
29 |
30 |
31 | ## Usage
32 |
33 | This repo can be used in conjunction with PyTorch's `Dataset` and `DataLoader` for training models on heterogeneous
34 | data. Here's a brief overview of the classes and their functionalities:
35 |
36 | ### BaseDataset
37 |
38 | The `BaseDataset` class extends PyTorch's `Dataset` and is designed to handle different media types (images, videos, and
39 | text). It includes a transformation process to standardize the input data and a processor to handle the data specific to
40 | the task.
41 |
42 | #### Example
43 |
44 | ```python
45 | from robohusky.base_dataset_uni import BaseDataset
46 |
47 | # Initialize the dataset with the required parameters
48 | dataset = BaseDataset(
49 | dataset, # Your dataset here
50 | processor, # Your processor here
51 | image_path="path/to/images",
52 | input_size=224,
53 | num_segments=8,
54 | norm_type="openai",
55 | media_type="image"
56 | )
57 |
58 | # Use the dataset with a PyTorch DataLoader
59 | from torch.utils.data import DataLoader
60 |
61 | data_loader = DataLoader(dataset, batch_size=32, shuffle=True)
62 | ```
63 |
64 | ### WeightedConcatDataset
65 |
66 | The `WeightedConcatDataset` class extends PyTorch's `ConcatDataset` and allows for the creation of a unified dataset by
67 | concatenating multiple datasets with specified weights.
68 |
69 | #### Example
70 |
71 | ```python
72 | from robohusky.base_dataset_uni import WeightedConcatDataset
73 |
74 | # Assume we have multiple datasets for different tasks
75 | dataset1 = BaseDataset(...)
76 | dataset2 = BaseDataset(...)
77 | dataset3 = BaseDataset(...)
78 |
79 | # Define the weights for each dataset
80 | weights = [0.5, 0.3, 0.2]
81 |
82 | # Create a weighted concatenated dataset
83 | weighted_dataset = WeightedConcatDataset([dataset1, dataset2, dataset3], weights=weights)
84 |
85 | # Use the weighted dataset with a PyTorch DataLoader
86 | data_loader = DataLoader(weighted_dataset, batch_size=32, shuffle=True)
87 | ```
88 |
89 | ## Customization
90 |
91 | The package is designed to be flexible and customizable. You can implement your own transformation and processing logic
92 | by subclassing `BaseDataset` and overriding the necessary methods.
93 |
94 | ## 🎫 License
95 |
96 | This project is released under the [Apache 2.0 license](LICENSE).
97 |
98 | ## 🖊️ Citation
99 |
100 | If you find this project useful in your research, please consider cite:
101 | ```bibtex
102 | @article{mu2024embodiedgpt,
103 | title={Embodiedgpt: Vision-language pre-training via embodied chain of thought},
104 | author={Mu, Yao and Zhang, Qinglong and Hu, Mengkang and Wang, Wenhai and Ding, Mingyu and Jin, Jun and Wang, Bin and Dai, Jifeng and Qiao, Yu and Luo, Ping},
105 | journal={Advances in Neural Information Processing Systems},
106 | volume={36},
107 | year={2024}
108 | }
109 | ```
110 |
--------------------------------------------------------------------------------
/align.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 |
3 | set -x
4 |
5 | PARTITION= "your partition"
6 |
7 | GPUS=${GPUS:-your number}
8 | GPUS_PER_NODE=${GPUS_PER_NODE:-your number}
9 | QUOTA_TYPE="reserved"
10 |
11 | CPUS_PER_TASK=${CPUS_PER_TASK:-10}
12 | SRUN_ARGS=${SRUN_ARGS:-""}
13 |
14 | srun -p ${PARTITION} \
15 | --job-name='embodied_family' \
16 | --gres=gpu:${GPUS_PER_NODE} \
17 | --nodes= your number \
18 | --ntasks=${GPUS} \
19 | --ntasks-per-node=${GPUS_PER_NODE} \
20 | --cpus-per-task=${CPUS_PER_TASK} \
21 | --kill-on-bad-exit=1 \
22 | --quotatype=${QUOTA_TYPE} \
23 | ${SRUN_ARGS} \
24 | python -u ./embodied_family/robohusky/train/train.py\
25 | --model_name_or_path "your path" \
26 | --cache_dir "/your path to cache"\
27 | --conv_style "husky" \
28 | --train_file "your path to train file" \
29 | --output_dir "your output dir" \
30 | --overwrite_output_dir True \
31 | --run_name "embodied_family" \
32 | --freeze_vision_model False \
33 | --freeze_vision_adapter False \
34 | --freeze_qformer False \
35 | --freeze_text_model False \
36 | --preprocessing_num_workers 1 \
37 | --pad_to_max_length True \
38 | --fp16 True \
39 | --num_train_epochs 3 \
40 | --per_device_train_batch_size 1 \
41 | --per_device_eval_batch_size 1 \
42 | --gradient_accumulation_steps 4 \
43 | --evaluation_strategy "no" \
44 | --save_strategy "steps" \
45 | --save_steps 1000 \
46 | --save_total_limit 1 \
47 | --learning_rate 2e-6 \
48 | --weight_decay 0. \
49 | --warmup_ratio 0.05 \
50 | --lr_scheduler_type "cosine" \
51 | --logging_steps 1 \
52 | --max_seq_length 2048 \
53 | --do_train True \
54 | --deepspeed "zero_stage2_config.json"
55 |
--------------------------------------------------------------------------------
/assest/imgs:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/assest/main_features_embodiedgpt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EmbodiedGPT/EmbodiedGPT_Pytorch/cda80524bf6b7d276ba3b532887bacd4b133f234/assest/main_features_embodiedgpt.png
--------------------------------------------------------------------------------
/assest/overall_frame_embodiedgpt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EmbodiedGPT/EmbodiedGPT_Pytorch/cda80524bf6b7d276ba3b532887bacd4b133f234/assest/overall_frame_embodiedgpt.png
--------------------------------------------------------------------------------
/datasets/datasets_share.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EmbodiedGPT/EmbodiedGPT_Pytorch/cda80524bf6b7d276ba3b532887bacd4b133f234/datasets/datasets_share.zip
--------------------------------------------------------------------------------
/demo/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EmbodiedGPT/EmbodiedGPT_Pytorch/cda80524bf6b7d276ba3b532887bacd4b133f234/demo/.DS_Store
--------------------------------------------------------------------------------
/demo/inference.py:
--------------------------------------------------------------------------------
1 | """
2 | srun -p INTERN2 --job-name='husky_multi_test' --gres=gpu:1 --cpus-per-task=8 --quotatype="auto" python -u demo/inference_new.py
3 | """
4 |
5 | import abc
6 | from typing import Optional
7 |
8 | import os
9 | import requests
10 | from PIL import Image
11 | from io import BytesIO
12 |
13 | import torch
14 | import torchvision.transforms as T
15 | from peft import PeftModel
16 | from torchvision.transforms.functional import InterpolationMode
17 |
18 | from transformers import (
19 | LlamaTokenizer,
20 | GenerationConfig,
21 | StoppingCriteria,
22 | StoppingCriteriaList,
23 | )
24 |
25 | from robohusky.model.modeling_husky_embody2 import HuskyForConditionalGeneration
26 |
27 | from robohusky.conversation import (
28 | conv_templates,
29 | get_conv_template,
30 | )
31 |
32 | from robohusky.video_transformers import (
33 | GroupNormalize,
34 | GroupScale,
35 | GroupCenterCrop,
36 | Stack,
37 | ToTorchFormatTensor,
38 | get_index,
39 | )
40 |
41 | from robohusky.compression import compress_module
42 | from decord import VideoReader, cpu
43 |
44 | # import deepspeed
45 |
46 | IGNORE_INDEX = -100
47 | DEFAULT_UNK_TOKEN = ""
48 | DEFAULT_IMG_START_TOKEN = "
"
49 | DEFAULT_IMG_END_TOKEN = ""
50 |
51 | DEFAULT_VIDEO_START_TOKEN = ""
52 | DEFAULT_VIDEO_END_TOKEN = ""
53 |
54 | def get_gpu_memory(max_gpus=None):
55 | gpu_memory = []
56 | num_gpus = (
57 | torch.cuda.device_count()
58 | if max_gpus is None
59 | else min(max_gpus, torch.cuda.device_count())
60 | )
61 |
62 | for gpu_id in range(num_gpus):
63 | with torch.cuda.device(gpu_id):
64 | device = torch.cuda.current_device()
65 | gpu_properties = torch.cuda.get_device_properties(device)
66 | total_memory = gpu_properties.total_memory / (1024 ** 3)
67 | allocated_memory = torch.cuda.memory_allocated() / (1024 ** 3)
68 | available_memory = total_memory - allocated_memory
69 | gpu_memory.append(available_memory)
70 | return gpu_memory
71 |
72 | def load_model(
73 | model_path, device, num_gpus, max_gpu_memory=None, load_8bit=False, lora_weights=None
74 | ):
75 | if device == "cpu":
76 | kwargs = {}
77 | elif device == "cuda":
78 | kwargs = {"torch_dtype": torch.float16}
79 | if num_gpus == "auto":
80 | kwargs["device_map"] = "auto"
81 | else:
82 | num_gpus = int(num_gpus)
83 | if num_gpus != 1:
84 | kwargs["device_map"] = "auto"
85 | if max_gpu_memory is None:
86 | kwargs[
87 | "device_map"
88 | ] = "sequential" # This is important for not the same VRAM sizes
89 | available_gpu_memory = get_gpu_memory(num_gpus)
90 | kwargs["max_memory"] = {
91 | i: str(int(available_gpu_memory[i] * 0.85)) + "GiB"
92 | for i in range(num_gpus)
93 | }
94 | else:
95 | kwargs["max_memory"] = {i: max_gpu_memory for i in range(num_gpus)}
96 | else:
97 | raise ValueError(f"Invalid device: {device}")
98 |
99 | tokenizer = LlamaTokenizer.from_pretrained(
100 | model_path, use_fast=False)
101 |
102 | if lora_weights is None:
103 | model = HuskyForConditionalGeneration.from_pretrained(
104 | model_path, low_cpu_mem_usage=True, **kwargs
105 | )
106 | else:
107 | kwargs["device_map"] = "auto"
108 | model = HuskyForConditionalGeneration.from_pretrained(
109 | model_path, low_cpu_mem_usage=True, **kwargs
110 | )
111 | model.language_model = PeftModel.from_pretrained(
112 | model.language_model,
113 | lora_weights,
114 | **kwargs
115 | )
116 |
117 | if load_8bit:
118 | compress_module(model, device)
119 |
120 | if (device == "cuda" and num_gpus == 1) or device == "mps":
121 | model.to(device)
122 |
123 | model = model.eval()
124 | return model, tokenizer
125 |
126 | def load_image(image_file, input_size=224):
127 | if image_file.startswith('http') or image_file.startswith('https'):
128 | response = requests.get(image_file)
129 | image = Image.open(BytesIO(response.content)).convert('RGB')
130 | else:
131 | image = Image.open(image_file).convert('RGB')
132 |
133 | crop_pct = 224 / 256
134 | size = int(input_size / crop_pct)
135 | transform = T.Compose([
136 | T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
137 | T.Resize(size, interpolation=InterpolationMode.BICUBIC),
138 | T.CenterCrop(input_size),
139 | T.ToTensor(),
140 | T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))
141 | ])
142 | image = transform(image)
143 | return image
144 |
145 | def load_video(video_path, num_segments=8):
146 | vr = VideoReader(video_path, ctx=cpu(0))
147 | num_frames = len(vr)
148 | frame_indices = get_index(num_frames, num_segments)
149 |
150 | # transform
151 | crop_size = 224
152 | scale_size = 224
153 | input_mean = [0.48145466, 0.4578275, 0.40821073]
154 | input_std = [0.26862954, 0.26130258, 0.27577711]
155 |
156 | transform = T.Compose([
157 | GroupScale(int(scale_size), interpolation=InterpolationMode.BICUBIC),
158 | GroupCenterCrop(crop_size),
159 | Stack(),
160 | ToTorchFormatTensor(),
161 | GroupNormalize(input_mean, input_std)
162 | ])
163 |
164 | images_group = list()
165 | for frame_index in frame_indices:
166 | img = Image.fromarray(vr[frame_index].asnumpy())
167 | images_group.append(img)
168 | video = transform(images_group)
169 | return video
170 |
171 | class StoppingCriteriaSub(StoppingCriteria):
172 |
173 | def __init__(self, stops, encounters=1):
174 | super().__init__()
175 | self.stops = stops
176 |
177 | def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs):
178 | for stop in self.stops:
179 | if torch.all((stop == input_ids[0][-len(stop):])).item():
180 | return True
181 |
182 | return False
183 |
184 | @torch.inference_mode()
185 | def generate_stream(
186 | model, tokenizer, image_processor, params, device
187 | ):
188 | prompt = params["prompt"]
189 | images = params.get("images", None)
190 | videos = params.get("videos", None)
191 | temperature = float(params.get("temperature", 0.7))
192 | max_new_tokens = int(params.get("max_new_tokens", 1024))
193 |
194 | num_queries = model.config.num_query_tokens
195 |
196 | stop_words = ["Human: ", "Assistant: ", "###", "\n\n"]
197 | stop_words_ids = [tokenizer(stop_word, return_tensors='pt')['input_ids'].squeeze() for stop_word in stop_words]
198 | stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])
199 |
200 | generation_config = GenerationConfig(
201 | bos_token_id=1,
202 | do_sample=True,
203 | temperature=temperature,
204 | max_new_tokens=max_new_tokens,
205 | stopping_criteria=stopping_criteria
206 | )
207 |
208 | pixel_values = None
209 | if images is not None:
210 | pixel_values = load_image(images).to(device) # only support one image
211 | image_query = DEFAULT_IMG_START_TOKEN + DEFAULT_IMG_END_TOKEN
212 | prompt = prompt.replace("", image_query)
213 |
214 | elif videos is not None:
215 | pixel_values = load_video(videos).to(device)
216 | video_query = DEFAULT_VIDEO_START_TOKEN + DEFAULT_VIDEO_END_TOKEN
217 | prompt = prompt.replace("