├── README.md ├── figures ├── arch.png ├── result-dfec.png ├── result-emotion.png └── result-mvbench.png ├── humanomni ├── __init__.py ├── constants.py ├── conversation.py ├── conversation_llava.py ├── eval │ ├── eval_mafw_dfew.py │ ├── eval_video_mcqa_mvbench.py │ ├── inference_dfec.py │ └── inference_video_mcqa_mvbench.py ├── humanomni_trainer.py ├── mm_utils.py ├── model │ ├── __init__.py │ ├── encoder.py │ ├── humanomni_arch.py │ ├── humanomni_model.py │ └── projector.py ├── train_flash_attn.py ├── train_humanomni.py └── utils.py ├── inference.py ├── requirements.txt └── scripts ├── eval └── eval_video_mcqa_mvbench.sh ├── train └── finetune_omni.sh └── zero3.json /README.md: -------------------------------------------------------------------------------- 1 | # HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding 2 | 3 | [![ModelScope](https://img.shields.io/badge/ModelScope-HumanOmni-blue)](https://modelscope.cn/models/iic/HumanOmni-7B) 4 | [![Hugging Face](https://img.shields.io/badge/HuggingFace-HumanOmni-yellow)](https://huggingface.co/StarJiaxing/HumanOmni-7B) 5 | [![arXiv](https://img.shields.io/badge/arXiv-2501.15111-red)](https://arxiv.org/abs/2501.15111) 6 | 7 |

8 |

9 |

10 | 11 | ## [![News](https://img.shields.io/badge/News-Update-blue)](#news)News 12 | 1) Building upon HumanOmni, we are the **first to combine RLVR (Reinforcement Learning for Vision and Reasoning) with an Omni model**, introducing [R1-Omni](https://github.com/HumanMLLM/R1-Omni), a reasoning-based large multimodal model. 13 | 14 | 15 | ## 📖 Introduction 16 | **HumanOmni** is the industry’s first human-centric Omni-multimodal large language model for comprehensive understanding in human-centric scenes. 17 | 1) **2.4M human-centric video clips with over 14M double-check instructions**: We have constructed a dataset containing over 2.4M human-centric video clips, providing rich and detailed information about individuals. We provide over 14M instruction data for visual pretraining. 18 | 2) **50K video clips with more than 100K manually annotated instrcutions**: We have manually annotated 50K video clips with more than 100K instructions related to emotion recognition, facial description, and speaker-specific speech recognition for visual fine-tuning and cross-modal interaction integration. 19 | 3) **Three human-specific branch**: We use three branches to handle face-related, body-related, and interaction-related scenes separately in HumanOmni. HumanOmni dynamically adjusts its fusion weights based on input instructions, ensuring accurate responses across various scenes. 20 | 4) **Audio-visual synergy**: HumanOmni can simultaneously understand vision and speech, allowing for a more comprehensive understanding of complex scenes. 21 | 22 | 23 | 24 | ## 📦 Model Download 25 |

26 | 27 | | **Model** | **Stage** | **#Params** | **HuggingFace** | **ModelScope** | 28 | |------------------------|------------------------------------|-------------|---------------------------------------------------------------------------------|-------------------------------------------------------------------------| 29 | | `HumanOmni-Video` | Visual Capability Construction | 7B | [![HF](https://img.shields.io/badge/🤗-Download-yellow)](https://hf.co/StarJiaxing/HumanOmni-7B-Video) | [![MS](https://img.shields.io/badge/ModelScope-Download-blue)](https://modelscope.cn/models/iic/HumanOmni-7B-Video) | 30 | | `HumanOmni-Audio` | Auditory Capability Development | 7B | [![HF](https://img.shields.io/badge/🤗-Download-yellow)](https://hf.co/StarJiaxing/HumanOmni-7B-Audio) | [![MS](https://img.shields.io/badge/ModelScope-Download-blue)](https://modelscope.cn/models/iic/HumanOmni-7B-Audio) | 31 | | `HumanOmni-Omni` | Cross-Modal Interaction Integration | 7B | [![HF](https://img.shields.io/badge/🤗-Download-yellow)](https://hf.co/StarJiaxing/HumanOmni-7B) | [![MS](https://img.shields.io/badge/ModelScope-Download-blue)](https://modelscope.cn/models/iic/HumanOmni-7B) | 32 | 33 |

34 | 35 | Our training pipeline consists of three progressive stages to establish multimodal understanding capabilities: 36 | 37 | 📹 Visual Capability Construction 38 | 39 | - Model: HumanOmni-Video 40 | - Objective: Learn spatio-temporal feature representations to analyze human actions and scene dynamics in videos. 41 | 42 | 🎧 Auditory Capability Development 43 | 44 | - Model: HumanOmni-Audio 45 | - Objective: Develop robust speech comprehension and audio interpretation through large-scale acoustic modeling. 46 | 47 | 🌐 Cross-Modal Interaction Integration 48 | 49 | - Model: HumanOmni-Omni (also referred to as HumanOmni) 50 | - Objective: Enable synergistic vision-audio reasoning by fine-tuning parameters from both HumanOmni-Video and HumanOmni-Audio. 51 | 52 | 53 | ## 🏆 Performance 54 | 55 | - Emotion Understanding: 56 | 57 | | Method | Modalities | DFEW (UAR) | DFEW (WAR) | MAFW (UAR) | MAFW (WAR) | 58 | |----------------------------------|------------|-------------|-------------|-------------|-------------| 59 | | **Specialized models for emotion-related tasks** | | | | | | 60 | | Wav2Vec2.0 | A | 36.15 | 43.05 | 21.59 | 29.69 | 61 | | HuBERT | A | 35.98 | 43.24 | 25.00 | 32.60 | 62 | | DFER-CLIP | V | 59.61 | 71.25 | 38.89 | 52.55 | 63 | | MAE-DFER | V | 63.41 | 74.43 | 41.62 | 54.31 | 64 | | HiCMAE | AV | 63.76 | 75.01 | 42.65 | 56.17 | 65 | | Emotion-LLaMA | AV | 64.21 | 77.06 | - | - | 66 | | MMA-DFER | AV | 66.85 | 77.43 | 44.25 | 58.45 | 67 | | **Other models** | | | | | | 68 | | Qwen2-VL-7B | V | 43.08 | 52.83 | 31.67 | 45.89 | 69 | | Qwen2-VL-72B | V | 39.24 | 45.12 | 42.61 | 46.07 | 70 | | VITA | AV | 21.36 | 32.07 | 14.05 | 33.38 | 71 | | InternLM-XComposer-2.5-OL | AV | 44.23 | 51.29 | 33.78 | 46.81 | 72 | | GPT4-O | AV | 50.57 | 57.19 | 38.29 | 48.82 | 73 | | **HumanOmni** | AV | **74.86** | **82.46** | **52.94** | **68.40** | 74 | 75 | - [Dynamic Facial Expression Caption](https://modelscope.cn/datasets/iic/DFEC): 76 | 77 | | Method | Correctness | Detail | Context | Temporal | CIDEr | Rouge-L | AutoDQ | 78 | |----------------------------------|-------------|--------|---------|----------|--------|---------|---------| 79 | | **Vision large language model** | | | | | | | | 80 | | VideoLLaMA | 3.60 | 3.67 | 3.84 | 3.50 | 0.189 | 0.196 | 0.303 | 81 | | VideoChat | 3.47 | 3.52 | 3.92 | 3.38 | 0.251 | 0.192 | 0.344 | 82 | | VideoChat2 | 3.70 | 3.56 | 4.16 | 3.52 | 0.202 | 0.229 | 0.311 | 83 | | Chat-UniVI | 3.64 | 3.63 | 4.21 | 3.61 | 0.189 | 0.231 | 0.396 | 84 | | LLaVA-Next-Video | 4.19 | 4.07 | 4.39 | 4.04 | 0.250 | 0.249 | 0.395 | 85 | | ShareGPT4Video | 4.24 | 4.13 | 4.35 | 4.09 | 0.192 | 0.205 | 0.394 | 86 | | LLaMA-VID | 3.95 | 4.01 | 4.22 | 3.71 | 0.195 | 0.231 | 0.339 | 87 | | VideoLLaMA2 | 4.17 | 4.02 | 4.47 | 3.93 | 0.253 | 0.266 | 0.344 | 88 | | PLLaVA | 4.21 | 4.15 | 4.37 | 4.08 | 0.268 | 0.250 | 0.393 | 89 | | ST-LLM | 4.00 | 3.98 | 4.31 | 3.94 | 0.213 | 0.238 | 0.321 | 90 | | Tarsier | 3.59 | 3.50 | 4.07 | 3.41 | 0.143 | 0.185 | 0.415 | 91 | | LLaVA-OneVision | 3.68 | 3.47 | 4.10 | 3.42 | 0.115 | 0.165 | 0.379 | 92 | | FaceTrack-MM | 4.42 | 4.30 | 4.60 | 4.26 | 0.418 | 0.473 | 0.483 | 93 | | Qwen2-VL-72B | 4.28 | 4.14 | 4.55 | 4.08 | 0.241 | 0.314 | 0.449 | 94 | | Qwen2-VL-7B | 4.23 | 4.16 | 4.52 | 4.02 | 0.204 | 0.233 | 0.422 | 95 | | Qwen2-VL-2B | 4.01 | 3.98 | 4.37 | 3.88 | 0.202 | 0.221 | 0.406 | 96 | | Claude3.5-Sonnet | 4.13 | 4.01 | 4.49 | 4.05 | 0.243 | 0.228 | 0.442 | 97 | | **Omni-modality large language model** | | | | | | | | 98 | | GPT4-O | 4.22 | 3.97 | 4.48 | 3.90 | 0.264 | 0.213 | 0.432 | 99 | | VITA | 3.98 | 3.74 | 4.11 | 3.59 | 0.191 | 0.224 | 0.366 | 100 | | InternLM-XComposer-2.5-OL | 3.91 | 3.70 | 4.12 | 3.54 | 0.113 | 0.164 | 0.382 | 101 | | **HumanOmni** | **4.58** | **4.41**| **4.70**| **4.41** | 0.412 | 0.468 | **0.523**| 102 | 103 | - Action and Pose Understanding: 104 | 105 | | Method | Action Sequence | Unexpected Action | Action Antonym | Object Interaction | Action Count | Fine-grained Action | Avg | 106 | |----------------------------------|-----|-----|-----|-----|-----|-----|------| 107 | | **Vision large language model** | | | | | | | | 108 | | Otter-V | 23.0| 29.5| 27.5| 28.0| 26.0| 27.0| 26.8 | 109 | | mPLUG-Owl-V | 22.0| 29.0| 34.0| 27.0| 31.5| 29.0| 28.8 | 110 | | Video-LLaMA | 27.5| 39.0| 51.0| 40.5| 34.0| 29.0| 36.8 | 111 | | LLaMA-Adapter | 23.0| 33.0| 51.0| 32.5| 29.0| 30.0| 33.1 | 112 | | Video-ChatGPT | 23.5| 26.5| 62.0| 28.0| 30.5| 22.5| 32.2 | 113 | | VideoChat | 33.5| 40.5| 56.0| 40.5| 35.0| 33.5| 39.8 | 114 | | VideoChat2 | 75.5| 60.5| 83.5| 74.5| 37.0| 50.5| 63.6 | 115 | | ST-LLM | 66.0| 58.5| 84.0| 73.5| 36.5| 44.0| 60.4 | 116 | | PLLaVA | 58.0| 61.0| 55.5| 61.0| 39.5| 41.0| 52.6 | 117 | | VideoLLaMB | 54.5| 52.0| 86.5| 58.5| 40.5| 44.5| 56.1 | 118 | | Qwen2-VL-72B* | 51.5| 82.0| 93.5| 81.5| 48.5| 49.0| 67.7 | 119 | | Qwen2-VL-7B* | 73.5| 80.0| 79.0| 78.5| 46.0| 49.0| 67.7 | 120 | | Qwen2-VL-2B* | 77.5| 76.5| 76.5| 77.5| 50.0| 47.5| 67.6 | 121 | | GPT-4V | 55.5| 63.5| 72.0| 59.0| 39.0| 47.5| 56.1 | 122 | | **Omni-modality large language model** | | | | | | | | 123 | | VITA | 58.0| 81.5| 73.5| 61.5| 45.5| 42.0| 60.3 | 124 | | InternLM-XComposer-2.5-OL | 84.5| 81.0| 75.0| 79.5| 60.5| 46.0| 71.1 | 125 | | **HumanOmni** | 70.0| 78.0| 92.5| 80.5| 65.5| 49.0| **72.6** | 126 | 127 | ## 📺 Demo 128 | https://github.com/user-attachments/assets/ecf07e98-f448-4cb6-a124-6e2a2885e581 129 | ``` 130 | User:What's the major emotion? 131 | HumanOmni:angry. 132 | 133 | User:What's the major facial expression? 134 | HumanOmni:frown. 135 | 136 | User:Desribe the woman. 137 | HumanOmni:The woman has long brown hair and is wearing a pink top with a dark cardigan. 138 | 139 | User:Please describe the major action in detail. 140 | HumanOmni:The woman is frowning and shaking her head, indicating that she is angry. She is also gesturing with her hands as if she is explaining something or emphasizing a point. 141 | 142 | User:What did the woman say? 143 | HumanOmni:The woman said "Of course we object! I object, Steve!" 144 | ``` 145 | 146 | https://github.com/user-attachments/assets/0f0d9c5c-6e1d-473f-8c6b-819ebdc8aae8 147 | ``` 148 | User:What's the major emotion? 149 | HumanOmni:happy. 150 | 151 | User:What's the major facial expression? 152 | HumanOmni:smile. 153 | 154 | User:Desribe the woman in detail. 155 | HumanOmni:The woman in the video has blonde hair, is wearing a black top, and is smiling. 156 | 157 | User:What did the woman say? 158 | HumanOmni:The woman said "Dogs are sitting by the door." 159 | ``` 160 | 161 | [https://humanmllm.github.io/HumanOmni/chinese_video.mp4](https://github.com/user-attachments/assets/65b49d37-7213-452d-a492-cfddbbc8b41d) 162 | ``` 163 | User:描述一下视频中男子的外貌穿着. 164 | HumanOmni:视频中的男性有短的黑发，配戴圆框眼镜。他穿了一件棕色的毛大衣，内里是白色的. 165 | 166 | User:视频中男子说了什么内容？ 167 | HumanOmni:男子说他穿什么颜色的衣服，现在处于什么环境，今天的天气怎么样. 168 | 169 | User:视频中男子有什么情绪吗？ 170 | HumanOmni:没有情绪。 171 | ``` 172 | 173 | ## 🛠️ Environment Setup 174 | 175 | To set up the recommended environment for HumanOmni, follow these instructions: 176 | 177 | ### Recommended Environment 178 | - **Python**: >=3.10 179 | - **CUDA**: >=12.1 180 | - **PyTorch**: >=2.2 (with CUDA support) 181 | - **Transformers**: >=4.45 182 | - **Accelerate**: >=0.30.1 183 | 184 | Or you can quickly set up the environment as follows: 185 | 186 | ``` 187 | git clone https://github.com/HumanMLLM/HumanOmni 188 | cd HumanOmni 189 | conda create -n humanOmni python=3.10 -y 190 | conda activate humanOmni 191 | pip install --upgrade pip 192 | pip install -r requirements.txt 193 | pip install flash-attn --no-build-isolation 194 | ``` 195 | ## 🧠 Training on Custom Dataset 196 | ### Data Preparation 197 | An example json file of the training data: 198 | ``` 199 | [ 200 | { 201 | "video": "human/DFEW/videos/1.mp4", 202 | "conversations": [ 203 | { 204 | "from": "human", 205 | "value": "