├── Kimi-VL.pdf ├── LICENSE ├── README.md ├── figures ├── arch.png ├── demo.png ├── demo1.png ├── demo2.png ├── instruct_perf.png ├── logo.png └── thinking_perf.png └── requirements.txt /Kimi-VL.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MoonshotAI/Kimi-VL/7c391c74d71c92394b6a63818aa107ae08b947fd/Kimi-VL.pdf -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | Copyright © 2025 Moonshot AI 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 5 | 6 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 7 | 8 | THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 | KIMI-VL TECHNICAL REPORT 3 |
4 | 5 |
6 | Tech Report | 7 | HuggingFace 8 | | 9 | 💬 Chat Web 10 |
11 | 12 | 13 | ## 1. Introduction 14 | 15 | We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B). 16 | 17 | Kimi-VL demonstrates strong performance across challenging domains: 18 | as a general-purpose VLM, Kimi-VL excels in multi-turn agent interaction tasks (e.g.,OSWorld), achieving state-of-the-art results comparable to flagship models. 19 | Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and etc. 20 | 21 | In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains. 22 | 23 | Kimi-VL also advances the pareto frontiers of multimodal models in processing long contexts and perceiving clearly: Equipped with a 128K extended context window, Kimi-VL can processes long and diverse inputs, achieving impressive scores of 64.5 on LongVideoBench, and 35.1 on MMLongBench-Doc; Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost with common visual inputs and general tasks. 24 | 25 | Building on this foundation, we introduce an advanced long-thinking variant: **Kimi-VL-Thinking**. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal **thinking** models. 26 | 27 | ## 2. Architecture 28 | 29 | The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image. 30 | 31 |
32 | 33 |
34 | 35 | ## 3. News 36 | 37 | - 2025.04.15: [vLLM](https://github.com/vllm-project/vllm) has supported Kimi-VL deployment. See [#16387](https://github.com/vllm-project/vllm/pull/16387) for details. 38 | - 2025.04.14: [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) has supported Kimi-VL finetuning. See [#7719](https://github.com/hiyouga/LLaMA-Factory/pull/7719) for details. 39 | 40 | ## 4. Model Variants 41 | 42 | 🤗 For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend `Kimi-VL-A3B-Instruct` for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using `Kimi-VL-A3B-Thinking`. 43 | 44 |
45 | 46 | | **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** | 47 | | :------------: | :------------: | :------------: | :------------: | :------------: | 48 | | Kimi-VL-A3B-Instruct | 16B | 3B | 128K | [🤗 Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct) | 49 | | Kimi-VL-A3B-Thinking | 16B | 3B | 128K | [🤗 Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking) | 50 | 51 |
52 | 53 | > [!Note] 54 | > Recommended parameter settings: 55 | > - For **Thinking models**, it is recommended to use `Temperature = 0.6`. 56 | > - For **Instruct models**, it is recommended to use `Temperature = 0.2`. 57 | 58 | 59 | ### Hugging Face Demo 60 | 61 | > 🤗 We serve our model demo in Hugging Face spaces: 62 | > - Chat with **Kimi-VL-A3B-Thinking**👀🤔🗺️ (featuring thinking, math, puzzle solving) model on Chat Web. 63 | > - Chat with **Kimi-VL-A3B-Instruct**💻🎬📕 (featuring agent, video, multi-page document) model on Chat Web. 64 | 65 | ## 5. Performance 66 | 67 | As an efficient model, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc). 68 | 69 | A brief comparison with existing 10B-level dense VLMs and DeepSeek-VL2 (A4.5B): 70 | 71 |
72 | 73 |
74 | 75 | With effective long-thinking abilities, Kimi-VL-A3B-Thinking can match the performance of 30B/70B frontier open-source VLMs on MathVision benchmark: 76 | 77 |
78 | 79 |
80 | 81 | 82 | ## 6. Example usage 83 | 84 | ### Setup 85 | 86 | ```bash 87 | conda create -n kimi-vl python=3.10 -y 88 | conda activate kimi-vl 89 | pip install -r requirements.txt 90 | ``` 91 | 92 | > [!Note] 93 | > If you encounter Out-of-Memory or want to speed up inference, please install **flash-attn** with `pip install flash-attn --no-build-isolation`. 94 | 95 | 96 | ### Inference with Hugging Face Transformers 97 | 98 | We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch=2.5.1, and transformers=4.51.3 as the development environment. 99 | 100 | Kimi-VL-A3B-Instruct: 101 | 102 | ```python 103 | import torch 104 | from PIL import Image 105 | from transformers import AutoModelForCausalLM, AutoProcessor 106 | 107 | model_path = "moonshotai/Kimi-VL-A3B-Instruct" 108 | model = AutoModelForCausalLM.from_pretrained( 109 | model_path, 110 | torch_dtype="auto", 111 | device_map="auto", 112 | trust_remote_code=True, 113 | ) 114 | # If flash-attn has been installed, it is recommended to set torch_dtype=torch.bfloat16 and attn_implementation="flash_attention_2" 115 | # to save memory and speed up inference 116 | # model = AutoModelForCausalLM.from_pretrained( 117 | # model_path, 118 | # torch_dtype=torch.bfloat16, 119 | # device_map="auto", 120 | # trust_remote_code=True, 121 | # attn_implementation="flash_attention_2" 122 | # ) 123 | 124 | processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) 125 | 126 | image_path = "./figures/demo.png" 127 | image = Image.open(image_path) 128 | messages = [ 129 | {"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "What is the dome building in the picture? Think step by step."}]} 130 | ] 131 | text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") 132 | inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device) 133 | generated_ids = model.generate(**inputs, max_new_tokens=512) 134 | generated_ids_trimmed = [ 135 | out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) 136 | ] 137 | response = processor.batch_decode( 138 | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False 139 | )[0] 140 | print(response) 141 | ``` 142 | 143 | Kimi-VL-A3B-Thinking: 144 | 145 | ```python 146 | import torch 147 | from PIL import Image 148 | from transformers import AutoModelForCausalLM, AutoProcessor 149 | 150 | model_path = "moonshotai/Kimi-VL-A3B-Thinking" 151 | model = AutoModelForCausalLM.from_pretrained( 152 | model_path, 153 | torch_dtype="auto", 154 | device_map="auto", 155 | trust_remote_code=True, 156 | ) 157 | # If flash-attn has been installed, it is recommended to set torch_dtype=torch.bfloat16 and attn_implementation="flash_attention_2" 158 | # to save memory and speed up inference 159 | # model = AutoModelForCausalLM.from_pretrained( 160 | # model_path, 161 | # torch_dtype=torch.bfloat16, 162 | # device_map="auto", 163 | # trust_remote_code=True, 164 | # attn_implementation="flash_attention_2" 165 | # ) 166 | processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) 167 | 168 | image_paths = ["./figures/demo1.png", "./figures/demo2.png"] 169 | images = [Image.open(path) for path in image_paths] 170 | messages = [ 171 | { 172 | "role": "user", 173 | "content": [ 174 | {"type": "image", "image": image_path} for image_path in image_paths 175 | ] + [{"type": "text", "text": "Please infer step by step who this manuscript belongs to and what it records"}], 176 | }, 177 | ] 178 | text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") 179 | inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device) 180 | generated_ids = model.generate(**inputs, max_new_tokens=2048) 181 | generated_ids_trimmed = [ 182 | out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) 183 | ] 184 | response = processor.batch_decode( 185 | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False 186 | )[0] 187 | print(response) 188 | ``` 189 | 190 | ## 7. Finetuning 191 | 192 | Collaborating closely with the open-source community, Kimi-VL now offers seamless support for efficient fine-tuning through the latest version of [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory). 193 | 194 | The framework enables Single-GPU LoRA fine-tuning with 50GB of VRAM, as well as Multi-GPU full/lora fine-tuning using DeepSpeed ZeRO-2. For more detailed configuration instructions, check out [this PR](https://github.com/hiyouga/LLaMA-Factory/pull/7719#issue-2992644288). 195 | 196 | ## 8. Deployment 197 | 198 | ### Using vLLM 199 | 200 | The [vLLM main branch](https://github.com/vllm-project/vllm) has supported Kimi-VL deployment. You are welcome to deploy Kimi-VL using vLLM. 201 | 202 | #### Offline Inference 203 | 204 | > [!Note] 205 | > More usages about `Offline Inference` can be found at [vLLM Offline Inference](https://docs.vllm.ai/en/latest/serving/offline_inference.html). 206 | 207 | ```python 208 | from PIL import Image 209 | from transformers import AutoProcessor 210 | from vllm import LLM, SamplingParams 211 | 212 | model_path = "moonshotai/Kimi-VL-A3B-Instruct" # or "moonshotai/Kimi-VL-A3B-Thinking" 213 | llm = LLM( 214 | model_path, 215 | trust_remote_code=True, 216 | ) 217 | 218 | processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) 219 | 220 | image_path = "./figures/demo.png" 221 | image = Image.open(image_path) 222 | messages = [ 223 | {"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "What is the dome building in the picture? Think step by step."}]} 224 | ] 225 | text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") 226 | outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": image}}], sampling_params = SamplingParams(max_tokens=512)) 227 | 228 | print("-" * 50) 229 | for o in outputs: 230 | generated_text = o.outputs[0].text 231 | print(generated_text) 232 | print("-" * 50) 233 | ``` 234 | 235 | #### OpenAI-Compatible Server 236 | 237 | > [!Note] 238 | > More usages about `OpenAI-Compatible Server` can be found at [vLLM OpenAI-Compatible Server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#). 239 | 240 | Serve Kimi-VL with `vllm serve` command: 241 | 242 | ```bash 243 | # If you need a longer context window, you can set --max-model-len and --max-num-batched-tokens to 131072 244 | vllm serve moonshotai/Kimi-VL-A3B-Instruct --served-model-name kimi-vl --trust-remote-code --tensor-parallel-size 1 --max-num-batched-tokens 32768 --max-model-len 32768 --limit-mm-per-prompt image=8 245 | ``` 246 | 247 | Call the API 248 | 249 | ```python 250 | import base64 251 | from PIL import Image 252 | from io import BytesIO 253 | from openai import OpenAI 254 | 255 | client = OpenAI( 256 | base_url="http://localhost:8000/v1", 257 | api_key="token-abc123", 258 | ) 259 | 260 | image_path = "./figures/demo.png" 261 | image = Image.open(image_path).convert("RGB") 262 | 263 | buffered = BytesIO() 264 | image.save(buffered, format="JPEG") 265 | img_b64_str = base64.b64encode(buffered.getvalue()).decode("utf-8") 266 | base64_image_url = f"data:image/jpeg;base64,{img_b64_str}" 267 | 268 | messages = [ 269 | {"role": "user", "content": [{"type": "image_url", "image_url": {"url": base64_image_url}}, {"type": "text", "text": "What is the dome building in the picture? Think step by step."}]} 270 | ] 271 | 272 | completion = client.chat.completions.create( 273 | model="kimi-vl", 274 | messages=messages 275 | ) 276 | 277 | print(completion.choices[0].message) 278 | ``` 279 | 280 | ## 9. Citation 281 | 282 | ``` 283 | @misc{kimiteam2025kimivltechnicalreport, 284 | title={{Kimi-VL} Technical Report}, 285 | author={Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Zhang and Haoning Wu and Haotian Yao and Haoyu Lu and Heng Wang and Hongcheng Gao and Huabin Zheng and Jiaming Li and Jianlin Su and Jianzhou Wang and Jiaqi Deng and Jiezhong Qiu and Jin Xie and Jinhong Wang and Jingyuan Liu and Junjie Yan and Kun Ouyang and Liang Chen and Lin Sui and Longhui Yu and Mengfan Dong and Mengnan Dong and Nuo Xu and Pengyu Cheng and Qizheng Gu and Runjie Zhou and Shaowei Liu and Sihan Cao and Tao Yu and Tianhui Song and Tongtong Bai and Wei Song and Weiran He and Weixiao Huang and Weixin Xu and Xiaokun Yuan and Xingcheng Yao and Xingzhe Wu and Xinxing Zu and Xinyu Zhou and Xinyuan Wang and Y. Charles and Yan Zhong and Yang Li and Yangyang Hu and Yanru Chen and Yejie Wang and Yibo Liu and Yibo Miao and Yidao Qin and Yimin Chen and Yiping Bao and Yiqin Wang and Yongsheng Kang and Yuanxin Liu and Yulun Du and Yuxin Wu and Yuzhi Wang and Yuzi Yan and Zaida Zhou and Zhaowei Li and Zhejun Jiang and Zheng Zhang and Zhilin Yang and Zhiqi Huang and Zihao Huang and Zijia Zhao and Ziwei Chen}, 286 | year={2025}, 287 | eprint={2504.07491}, 288 | archivePrefix={arXiv}, 289 | primaryClass={cs.CV}, 290 | url={https://arxiv.org/abs/2504.07491}, 291 | } 292 | ``` 293 | 294 | -------------------------------------------------------------------------------- /figures/arch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MoonshotAI/Kimi-VL/7c391c74d71c92394b6a63818aa107ae08b947fd/figures/arch.png -------------------------------------------------------------------------------- /figures/demo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MoonshotAI/Kimi-VL/7c391c74d71c92394b6a63818aa107ae08b947fd/figures/demo.png -------------------------------------------------------------------------------- /figures/demo1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MoonshotAI/Kimi-VL/7c391c74d71c92394b6a63818aa107ae08b947fd/figures/demo1.png -------------------------------------------------------------------------------- /figures/demo2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MoonshotAI/Kimi-VL/7c391c74d71c92394b6a63818aa107ae08b947fd/figures/demo2.png -------------------------------------------------------------------------------- /figures/instruct_perf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MoonshotAI/Kimi-VL/7c391c74d71c92394b6a63818aa107ae08b947fd/figures/instruct_perf.png -------------------------------------------------------------------------------- /figures/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MoonshotAI/Kimi-VL/7c391c74d71c92394b6a63818aa107ae08b947fd/figures/logo.png -------------------------------------------------------------------------------- /figures/thinking_perf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MoonshotAI/Kimi-VL/7c391c74d71c92394b6a63818aa107ae08b947fd/figures/thinking_perf.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | torch==2.5.1 2 | torchvision==0.20.1 3 | transformers==4.51.3 4 | pillow 5 | tiktoken 6 | accelerate 7 | blobfile 8 | openai 9 | --------------------------------------------------------------------------------