├── INSTALLATION.md ├── README.md ├── align.sh ├── assest ├── imgs ├── main_features_embodiedgpt.png └── overall_frame_embodiedgpt.png ├── datasets └── datasets_share.zip ├── demo ├── .DS_Store ├── inference.py ├── script.py └── test.py ├── pyproject.toml ├── requirements.txt ├── robohusky ├── .DS_Store ├── base_dataset.py ├── base_dataset_uni.py ├── compression.py ├── configuration_husky.py ├── constants.py ├── conversation.py ├── convert_fp16.py ├── convert_husky_fp16.py ├── convert_reward_fp16.py ├── dist_utils.py ├── llama2_flash_attn_monkey_patch.py ├── model │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-38.pyc │ │ ├── configuration_husky.cpython-38.pyc │ │ └── modeling_husky_embody2.cpython-38.pyc │ ├── compression.py │ ├── configuration_husky.py │ ├── configuration_husky_ori.py │ ├── modeling_husky.py │ ├── modeling_husky_embody2.py │ ├── modeling_husky_embody2_ori.py │ └── processing_husky.py ├── train │ ├── .DS_Store │ ├── llama_flash_attn_monkey_patch.py │ ├── llama_rmsnorm_monkey_patch.py │ ├── train.py │ └── train_uni.py ├── utils.py └── video_transformers.py ├── train_files └── example_train_file.json ├── zero_stage0_config.json ├── zero_stage1_config.json ├── zero_stage2_config.json └── zero_stage3_config.json /INSTALLATION.md: -------------------------------------------------------------------------------- 1 | ## 🛠️ Installation 2 | 3 | - Clone this repository: 4 | 5 | ```bash 6 | git clone https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch 7 | ``` 8 | 9 | - Create a conda virtual environment and activate it: 10 | 11 | ```bash 12 | conda create -n robohusky python=3.9.16 -y 13 | conda activate robohusky 14 | ``` 15 | 16 | - Install `PyTorch>=2.0` and `torchvision>=0.15.2` with `CUDA>=11.7`: 17 | 18 | For examples, to install `torch==2.0.1` with `CUDA==11.8`: 19 | 20 | ```bash 21 | conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia 22 | # or 23 | pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118 24 | ``` 25 | 26 | - Install `flash-attn`: 27 | 28 | ```bash 29 | git clone https://github.com/Dao-AILab/flash-attention.git 30 | cd flash-attention 31 | pip install flash-attn --no-build-isolation 32 | ``` 33 | 34 | - Install `transformers==4.34.1`: 35 | 36 | ```bash 37 | pip install transformers==4.34.1 38 | ``` 39 | 40 | - Install `apex` (optional): 41 | 42 | ```bash 43 | git clone https://github.com/NVIDIA/apex.git 44 | git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 # https://github.com/NVIDIA/apex/issues/1735 45 | pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ 46 | ``` 47 | 48 | - Install other requirements: 49 | 50 | ```bash 51 | cd .. 52 | pip install -e EmbodiedGPT_Pytorch 53 | ``` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Embodied Family Code Base 2 | 3 | We will update the instructions for this codebase as soon as possible. 4 | 5 | ## Installation 6 | 7 | See [INSTALLATION.md](https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch/blob/main/INSTALLATION.md) 8 | 9 | ## Data Preparation 10 | 11 | 1. Download the [EgoCOT dataset](https://github.com/EmbodiedGPT/EgoCOT_Dataset). 12 | 2. Download the [COCO-2017 dataset](https://www.kaggle.com/datasets/awsaf49/coco-2017-dataset). 13 | 14 | ## Download the Pretrained Model 15 | 16 | Download the testing 17 | model [Embodied_family_7btiny](https://huggingface.co/Liang-ZX/Embodied_family_7b/). 18 | 19 | ## Prepare the Text Data Paired with Video and Image 20 | 21 | - Unzip `datasets_share.zip`, which contains the text part of the multi-modal dataset, to the `./datasets/` directory. 22 | 23 | ## 🏠 Overview 24 | 25 | image 26 | 27 | ## 🎁 Major Features 28 | 29 | image 30 | 31 | ## Usage 32 | 33 | This repo can be used in conjunction with PyTorch's `Dataset` and `DataLoader` for training models on heterogeneous 34 | data. Here's a brief overview of the classes and their functionalities: 35 | 36 | ### BaseDataset 37 | 38 | The `BaseDataset` class extends PyTorch's `Dataset` and is designed to handle different media types (images, videos, and 39 | text). It includes a transformation process to standardize the input data and a processor to handle the data specific to 40 | the task. 41 | 42 | #### Example 43 | 44 | ```python 45 | from robohusky.base_dataset_uni import BaseDataset 46 | 47 | # Initialize the dataset with the required parameters 48 | dataset = BaseDataset( 49 | dataset, # Your dataset here 50 | processor, # Your processor here 51 | image_path="path/to/images", 52 | input_size=224, 53 | num_segments=8, 54 | norm_type="openai", 55 | media_type="image" 56 | ) 57 | 58 | # Use the dataset with a PyTorch DataLoader 59 | from torch.utils.data import DataLoader 60 | 61 | data_loader = DataLoader(dataset, batch_size=32, shuffle=True) 62 | ``` 63 | 64 | ### WeightedConcatDataset 65 | 66 | The `WeightedConcatDataset` class extends PyTorch's `ConcatDataset` and allows for the creation of a unified dataset by 67 | concatenating multiple datasets with specified weights. 68 | 69 | #### Example 70 | 71 | ```python 72 | from robohusky.base_dataset_uni import WeightedConcatDataset 73 | 74 | # Assume we have multiple datasets for different tasks 75 | dataset1 = BaseDataset(...) 76 | dataset2 = BaseDataset(...) 77 | dataset3 = BaseDataset(...) 78 | 79 | # Define the weights for each dataset 80 | weights = [0.5, 0.3, 0.2] 81 | 82 | # Create a weighted concatenated dataset 83 | weighted_dataset = WeightedConcatDataset([dataset1, dataset2, dataset3], weights=weights) 84 | 85 | # Use the weighted dataset with a PyTorch DataLoader 86 | data_loader = DataLoader(weighted_dataset, batch_size=32, shuffle=True) 87 | ``` 88 | 89 | ## Customization 90 | 91 | The package is designed to be flexible and customizable. You can implement your own transformation and processing logic 92 | by subclassing `BaseDataset` and overriding the necessary methods. 93 | 94 | ## 🎫 License 95 | 96 | This project is released under the [Apache 2.0 license](LICENSE). 97 | 98 | ## 🖊️ Citation 99 | 100 | If you find this project useful in your research, please consider cite: 101 | ```bibtex 102 | @article{mu2024embodiedgpt, 103 | title={Embodiedgpt: Vision-language pre-training via embodied chain of thought}, 104 | author={Mu, Yao and Zhang, Qinglong and Hu, Mengkang and Wang, Wenhai and Ding, Mingyu and Jin, Jun and Wang, Bin and Dai, Jifeng and Qiao, Yu and Luo, Ping}, 105 | journal={Advances in Neural Information Processing Systems}, 106 | volume={36}, 107 | year={2024} 108 | } 109 | ``` 110 | -------------------------------------------------------------------------------- /align.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | set -x 4 | 5 | PARTITION= "your partition" 6 | 7 | GPUS=${GPUS:-your number} 8 | GPUS_PER_NODE=${GPUS_PER_NODE:-your number} 9 | QUOTA_TYPE="reserved" 10 | 11 | CPUS_PER_TASK=${CPUS_PER_TASK:-10} 12 | SRUN_ARGS=${SRUN_ARGS:-""} 13 | 14 | srun -p ${PARTITION} \ 15 | --job-name='embodied_family' \ 16 | --gres=gpu:${GPUS_PER_NODE} \ 17 | --nodes= your number \ 18 | --ntasks=${GPUS} \ 19 | --ntasks-per-node=${GPUS_PER_NODE} \ 20 | --cpus-per-task=${CPUS_PER_TASK} \ 21 | --kill-on-bad-exit=1 \ 22 | --quotatype=${QUOTA_TYPE} \ 23 | ${SRUN_ARGS} \ 24 | python -u ./embodied_family/robohusky/train/train.py\ 25 | --model_name_or_path "your path" \ 26 | --cache_dir "/your path to cache"\ 27 | --conv_style "husky" \ 28 | --train_file "your path to train file" \ 29 | --output_dir "your output dir" \ 30 | --overwrite_output_dir True \ 31 | --run_name "embodied_family" \ 32 | --freeze_vision_model False \ 33 | --freeze_vision_adapter False \ 34 | --freeze_qformer False \ 35 | --freeze_text_model False \ 36 | --preprocessing_num_workers 1 \ 37 | --pad_to_max_length True \ 38 | --fp16 True \ 39 | --num_train_epochs 3 \ 40 | --per_device_train_batch_size 1 \ 41 | --per_device_eval_batch_size 1 \ 42 | --gradient_accumulation_steps 4 \ 43 | --evaluation_strategy "no" \ 44 | --save_strategy "steps" \ 45 | --save_steps 1000 \ 46 | --save_total_limit 1 \ 47 | --learning_rate 2e-6 \ 48 | --weight_decay 0. \ 49 | --warmup_ratio 0.05 \ 50 | --lr_scheduler_type "cosine" \ 51 | --logging_steps 1 \ 52 | --max_seq_length 2048 \ 53 | --do_train True \ 54 | --deepspeed "zero_stage2_config.json" 55 | -------------------------------------------------------------------------------- /assest/imgs: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /assest/main_features_embodiedgpt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EmbodiedGPT/EmbodiedGPT_Pytorch/cda80524bf6b7d276ba3b532887bacd4b133f234/assest/main_features_embodiedgpt.png -------------------------------------------------------------------------------- /assest/overall_frame_embodiedgpt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EmbodiedGPT/EmbodiedGPT_Pytorch/cda80524bf6b7d276ba3b532887bacd4b133f234/assest/overall_frame_embodiedgpt.png -------------------------------------------------------------------------------- /datasets/datasets_share.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EmbodiedGPT/EmbodiedGPT_Pytorch/cda80524bf6b7d276ba3b532887bacd4b133f234/datasets/datasets_share.zip -------------------------------------------------------------------------------- /demo/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EmbodiedGPT/EmbodiedGPT_Pytorch/cda80524bf6b7d276ba3b532887bacd4b133f234/demo/.DS_Store -------------------------------------------------------------------------------- /demo/inference.py: -------------------------------------------------------------------------------- 1 | """ 2 | srun -p INTERN2 --job-name='husky_multi_test' --gres=gpu:1 --cpus-per-task=8 --quotatype="auto" python -u demo/inference_new.py 3 | """ 4 | 5 | import abc 6 | from typing import Optional 7 | 8 | import os 9 | import requests 10 | from PIL import Image 11 | from io import BytesIO 12 | 13 | import torch 14 | import torchvision.transforms as T 15 | from peft import PeftModel 16 | from torchvision.transforms.functional import InterpolationMode 17 | 18 | from transformers import ( 19 | LlamaTokenizer, 20 | GenerationConfig, 21 | StoppingCriteria, 22 | StoppingCriteriaList, 23 | ) 24 | 25 | from robohusky.model.modeling_husky_embody2 import HuskyForConditionalGeneration 26 | 27 | from robohusky.conversation import ( 28 | conv_templates, 29 | get_conv_template, 30 | ) 31 | 32 | from robohusky.video_transformers import ( 33 | GroupNormalize, 34 | GroupScale, 35 | GroupCenterCrop, 36 | Stack, 37 | ToTorchFormatTensor, 38 | get_index, 39 | ) 40 | 41 | from robohusky.compression import compress_module 42 | from decord import VideoReader, cpu 43 | 44 | # import deepspeed 45 | 46 | IGNORE_INDEX = -100 47 | DEFAULT_UNK_TOKEN = "" 48 | DEFAULT_IMG_START_TOKEN = "" 49 | DEFAULT_IMG_END_TOKEN = "" 50 | 51 | DEFAULT_VIDEO_START_TOKEN = "" 52 | DEFAULT_VIDEO_END_TOKEN = "" 53 | 54 | def get_gpu_memory(max_gpus=None): 55 | gpu_memory = [] 56 | num_gpus = ( 57 | torch.cuda.device_count() 58 | if max_gpus is None 59 | else min(max_gpus, torch.cuda.device_count()) 60 | ) 61 | 62 | for gpu_id in range(num_gpus): 63 | with torch.cuda.device(gpu_id): 64 | device = torch.cuda.current_device() 65 | gpu_properties = torch.cuda.get_device_properties(device) 66 | total_memory = gpu_properties.total_memory / (1024 ** 3) 67 | allocated_memory = torch.cuda.memory_allocated() / (1024 ** 3) 68 | available_memory = total_memory - allocated_memory 69 | gpu_memory.append(available_memory) 70 | return gpu_memory 71 | 72 | def load_model( 73 | model_path, device, num_gpus, max_gpu_memory=None, load_8bit=False, lora_weights=None 74 | ): 75 | if device == "cpu": 76 | kwargs = {} 77 | elif device == "cuda": 78 | kwargs = {"torch_dtype": torch.float16} 79 | if num_gpus == "auto": 80 | kwargs["device_map"] = "auto" 81 | else: 82 | num_gpus = int(num_gpus) 83 | if num_gpus != 1: 84 | kwargs["device_map"] = "auto" 85 | if max_gpu_memory is None: 86 | kwargs[ 87 | "device_map" 88 | ] = "sequential" # This is important for not the same VRAM sizes 89 | available_gpu_memory = get_gpu_memory(num_gpus) 90 | kwargs["max_memory"] = { 91 | i: str(int(available_gpu_memory[i] * 0.85)) + "GiB" 92 | for i in range(num_gpus) 93 | } 94 | else: 95 | kwargs["max_memory"] = {i: max_gpu_memory for i in range(num_gpus)} 96 | else: 97 | raise ValueError(f"Invalid device: {device}") 98 | 99 | tokenizer = LlamaTokenizer.from_pretrained( 100 | model_path, use_fast=False) 101 | 102 | if lora_weights is None: 103 | model = HuskyForConditionalGeneration.from_pretrained( 104 | model_path, low_cpu_mem_usage=True, **kwargs 105 | ) 106 | else: 107 | kwargs["device_map"] = "auto" 108 | model = HuskyForConditionalGeneration.from_pretrained( 109 | model_path, low_cpu_mem_usage=True, **kwargs 110 | ) 111 | model.language_model = PeftModel.from_pretrained( 112 | model.language_model, 113 | lora_weights, 114 | **kwargs 115 | ) 116 | 117 | if load_8bit: 118 | compress_module(model, device) 119 | 120 | if (device == "cuda" and num_gpus == 1) or device == "mps": 121 | model.to(device) 122 | 123 | model = model.eval() 124 | return model, tokenizer 125 | 126 | def load_image(image_file, input_size=224): 127 | if image_file.startswith('http') or image_file.startswith('https'): 128 | response = requests.get(image_file) 129 | image = Image.open(BytesIO(response.content)).convert('RGB') 130 | else: 131 | image = Image.open(image_file).convert('RGB') 132 | 133 | crop_pct = 224 / 256 134 | size = int(input_size / crop_pct) 135 | transform = T.Compose([ 136 | T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), 137 | T.Resize(size, interpolation=InterpolationMode.BICUBIC), 138 | T.CenterCrop(input_size), 139 | T.ToTensor(), 140 | T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)) 141 | ]) 142 | image = transform(image) 143 | return image 144 | 145 | def load_video(video_path, num_segments=8): 146 | vr = VideoReader(video_path, ctx=cpu(0)) 147 | num_frames = len(vr) 148 | frame_indices = get_index(num_frames, num_segments) 149 | 150 | # transform 151 | crop_size = 224 152 | scale_size = 224 153 | input_mean = [0.48145466, 0.4578275, 0.40821073] 154 | input_std = [0.26862954, 0.26130258, 0.27577711] 155 | 156 | transform = T.Compose([ 157 | GroupScale(int(scale_size), interpolation=InterpolationMode.BICUBIC), 158 | GroupCenterCrop(crop_size), 159 | Stack(), 160 | ToTorchFormatTensor(), 161 | GroupNormalize(input_mean, input_std) 162 | ]) 163 | 164 | images_group = list() 165 | for frame_index in frame_indices: 166 | img = Image.fromarray(vr[frame_index].asnumpy()) 167 | images_group.append(img) 168 | video = transform(images_group) 169 | return video 170 | 171 | class StoppingCriteriaSub(StoppingCriteria): 172 | 173 | def __init__(self, stops, encounters=1): 174 | super().__init__() 175 | self.stops = stops 176 | 177 | def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs): 178 | for stop in self.stops: 179 | if torch.all((stop == input_ids[0][-len(stop):])).item(): 180 | return True 181 | 182 | return False 183 | 184 | @torch.inference_mode() 185 | def generate_stream( 186 | model, tokenizer, image_processor, params, device 187 | ): 188 | prompt = params["prompt"] 189 | images = params.get("images", None) 190 | videos = params.get("videos", None) 191 | temperature = float(params.get("temperature", 0.7)) 192 | max_new_tokens = int(params.get("max_new_tokens", 1024)) 193 | 194 | num_queries = model.config.num_query_tokens 195 | 196 | stop_words = ["Human: ", "Assistant: ", "###", "\n\n"] 197 | stop_words_ids = [tokenizer(stop_word, return_tensors='pt')['input_ids'].squeeze() for stop_word in stop_words] 198 | stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)]) 199 | 200 | generation_config = GenerationConfig( 201 | bos_token_id=1, 202 | do_sample=True, 203 | temperature=temperature, 204 | max_new_tokens=max_new_tokens, 205 | stopping_criteria=stopping_criteria 206 | ) 207 | 208 | pixel_values = None 209 | if images is not None: 210 | pixel_values = load_image(images).to(device) # only support one image 211 | image_query = DEFAULT_IMG_START_TOKEN + DEFAULT_IMG_END_TOKEN 212 | prompt = prompt.replace("", image_query) 213 | 214 | elif videos is not None: 215 | pixel_values = load_video(videos).to(device) 216 | video_query = DEFAULT_VIDEO_START_TOKEN + DEFAULT_VIDEO_END_TOKEN 217 | prompt = prompt.replace("