├── AI4Science
    └── AI4Math.md
├── Agents
    └── Agent.md
├── Audio
    └── Audio.md
├── LICENSE
├── Omni
    └── omni.md
├── README.md
├── Robotic
    └── Robotic.md
├── Vision
    ├── 3D.md
    ├── Ego.md
    ├── Image.md
    ├── Video.md
    └── VisionEncoder.md
└── assets
    └── logo.png


/AI4Science/AI4Math.md:
--------------------------------------------------------------------------------
 1 | # AI for Mathematics
 2 | 
 3 | Table of Contents
 4 | 
 5 | - [Reading List](#reading-list)
 6 | 
 7 | ## Reading List
 8 | 
 9 | | Paper                                                                    | Base Language Model | Code                                        | Publication       | Preprint                                    | Affiliation |
10 | | ------------------------------------------------------------------------ | ------------------- | ------------------------------------------- | ----------------- | ------------------------------------------- | ----------- |
11 | |      Solving olympiad geometry without human demonstrations                                                                    |  Transformer-style (151M)                  |          [AlphaGeometry](https://github.com/google-deepmind/alphageometry)                                   |             Nature      |                   [2401.blog](https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/)                          |   DeepMind          |
12 | | G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model | LLaMA (LLaVA)       | [G-LLaVA](https://github.com/pipilurj/G-LLaVA) |                   | [2312.11370](https://arxiv.org/abs/2312.11370) | HUAWEI      |
13 | | ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving |    GPT4, LLaMA2, etc                 | [ToRA](https://github.com/microsoft/ToRA)      |  | [2309.17452](https://arxiv.org/abs/2309.17452) | Microsoft   |
14 | 
15 | - Datasets and Benchmarks
16 | 
17 | Datasets
18 | 
19 | - [MathPile](https://github.com/GAIR-NLP/MathPile), High-quality, large-scale corpora are the cornerstone of building powerful foundation models. In this work, we introduce MathPile a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens.
20 | 
21 | Benchmarks
22 | 
23 | - [GSM8K](https://github.com/openai/grade-school-math) | [paper](https://arxiv.org/abs/2110.14168) | [blog](https://openai.com/blog/grade-school-math/), a dataset of 8.5K high quality linguistically diverse grade school math word problems, *by OpenAI, 2021*
24 | - [MATH](https://github.com/hendrycks/math) | [paper (NeurIPS 2021)](https://arxiv.org/abs/2103.03874), Hard mathematics problems, 12k problems within 7 categories, very hard math and natural science, *by UCB, 2021*
25 | - [TheoremQA](https://github.com/wenhuchen/TheoremQA) | [paper](https://arxiv.org/abs/2305.12524), Hard mathematics problems, 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance, *by UWaterloo, 2023*
26 | 


--------------------------------------------------------------------------------
/Agents/Agent.md:
--------------------------------------------------------------------------------
  1 | # Agents
  2 | 
  3 | *This part records LLMs as Agents: Brain, Perception, Tools*
  4 | 
  5 | Table of Contents
  6 | 
  7 | - [Reading List](#reading-list)
  8 | - [Dataset &amp; Benchmarks](#datasets--benchmarks)
  9 | - [Projects](#projects)
 10 | - [Applications](#applications)
 11 | 
 12 | ## Reading List
 13 | 
 14 | Technical Report
 15 | - (2024-12) Genie 2: A large-scale foundation world model, [Genie2](https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/)
 16 | 
 17 | Survey
 18 | 
 19 | - (2023-09) The Rise and Potential of Large Language Model Based Agents: A Survey [paper](https://arxiv.org/abs/2309.07864)
 20 | - (2023-08) Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies [paper](https://arxiv.org/abs/2308.03188)
 21 | - (2023-06) LLM Powered Autonomous Agents [blog](https://lilianweng.github.io/posts/2023-06-23-agent/)
 22 | - (2023-04) Tool Learning with Foundation Models [paper](https://arxiv.org/abs/2304.08354), [BMTools](https://github.com/OpenBMB/BMTools)
 23 | - (2023-02) Augmented Language Models: a Survey [paper](https://arxiv.org/abs/2302.07842)
 24 | 
 25 | Reading List
 26 | 
 27 | *`Correct` stands work that focus on correction of ALM*
 28 | 
 29 | | Paper                                                                                                                      | LLM                | Code                                                                           | Publication  | Preprint                                                                                        | Affiliation     |
 30 | | -------------------------------------------------------------------------------------------------------------------------- | ------------------ | ------------------------------------------------------------------------------ | ------------ | ----------------------------------------------------------------------------------------------- | --------------- |
 31 | |     MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning                                                                                                                       |     LLaMA               |          [MLLM-Tool](https://github.com/MLLM-Tool/MLLM-Tool)                                                                      |              |                                                [2401.10727](https://arxiv.org/pdf/2401.10727)                                                 |   ShanghaiTech              |
 32 | | LLM Augmented LLMs: Expanding Capabilities through Composition                                                             | PaLM2              | [CALM (non-official)](https://github.com/lucidrains/CALM-pytorch)                 |              | [2401.02412](https://arxiv.org/abs/2401.02412)                                                     | Google          |
 33 | | GPT-4V(ision) is a Generalist Web Agent, if Grounded                                                                       | GPT-4              | [SeeAct](https://github.com/OSU-NLP-Group/SeeAct)                                 |              | [2312.github](https://github.com/OSU-NLP-Group/SeeAct/blob/gh-pages/static/paper/SeeAct_Paper.pdf) | OSU             |
 34 | | AppAgent: Multimodal Agents as Smartphone Users                                                                            | GPT4               | [AppAgent](https://github.com/mnotgod96/AppAgent)                                 |              | [2312.13771](https://arxiv.org/abs/2312.13771)                                                     | Tencent         |
 35 | | An LLM Compiler for Parallel Function Calling                                                                              | GPT, LLaMA2        | [LLMCompiler](https://github.com/SqueezeAILab/LLMCompiler)                        |              | [2312.04511](https://arxiv.org/abs/2312.04511)                                                     | UCB             |
 36 | | ProAgent: From Robotic Process Automation to Agentic Process Automation                                                    | GPT-4              | [ProAgent](https://github.com/OpenBMB/ProAgent)                                   |              | [2311.10751](https://arxiv.org/abs/2311.10751)                                                     | THU             |
 37 | | ControlLLM: Augment Language Models with Tools by Searching on Graphs                                                      | ChatGPT, LLaMA     | [ControlLLM](https://github.com/OpenGVLab/ControlLLM)                             |              | [2310.17796](https://arxiv.org/abs/2310.17796)                                                     | Shanghai AI Lab |
 38 | | AgentTuning: Enabling Generalized Agent Abilities For LLMs                                                                 | LLaMA2             | [AgentTuning](https://github.com/THUDM/AgentTuning)                               |              | [2310.12823](https://arxiv.org/abs/2310.12823)                                                     | THU             |
 39 | | AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation                                                   |                    | [AutoGen](https://github.com/microsoft/autogen)                                   |              | [2308.08155](https://arxiv.org/abs/2308.08155)                                                     | Microsoft       |
 40 | | AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn                                      | ChatGPT            | [AssistGPT](https://github.com/COOORN/AssistGPT)                                  |              | [2306.08640](https://arxiv.org/abs/2306.08640)                                                     | NUS             |
 41 | | ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases                                        | Alpaca             | [ToolAlpaca](https://github.com/tangqiaoyu/ToolAlpaca)                            |              | [2306.05301](https://arxiv.org/abs/2306.05301)                                                     | CAS             |
 42 | | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction                                                | Vicuna-13B         | [GPT4Tools](https://github.com/StevenGrove/GPT4Tools)                             |              | [2305.18752](https://arxiv.org/abs/2305.18752)                                                     | Tencent         |
 43 | | AdaPlanner: Adaptive Planning from Feedback with Language Models                                                          | GPT3/3.5           | [AdaPlanner](https://github.com/haotiansun14/AdaPlanner)                          |              | [2305.16653](https://arxiv.org/abs/2305.16653)                                                     | Gatech          |
 44 | | `Correct` CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing                               | GPT3/3.5           | [CRITIC](https://github.com/microsoft/ProphetNet/tree/master/CRITIC)              |              | [2305.11738](https://arxiv.org/abs/2305.11738)                                                     | Microsoft       |
 45 | | ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings                                       | LLaMA              | [ToolkenGPT](https://github.com/Ber666/ToolkenGPT)                                | Neurips 2023 | [2305.11554](https://arxiv.org/abs/2305.11554)                                                     | UCSD            |
 46 | | LLM+P: Empowering Large Language Models with Optimal Planning Proficiency                                                  | GPT4               | [LLM-PDDL](https://github.com/Cranial-XIX/llm-pddl)                               |              | [2304.11477](https://arxiv.org/abs/2304.11477)                                                     | UTEXAS          |
 47 | | Can GPT-4 Perform Neural Architecture Search?                                                                              | GPT4               | [GENIUS](https://github.com/mingkai-zheng/GENIUS)                                 |              | [2304.10970](https://arxiv.org/abs/2304.10970)                                                     | Cambridge       |
 48 | | Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models                                                | GPT4               | [Chameleon](https://github.com/lupantech/chameleon-llm)                           |              | [2304.09842](https://arxiv.org/abs/2304.09842)                                                     | Microsoft       |
 49 | | OpenAGI: When LLM Meets Domain Experts                                                                                     | ChatGPT            | [OpenAGI](https://github.com/agiresearch/OpenAGI)                                 |              | [2304.04370](https://arxiv.org/abs/2304.04370)                                                     | Rutgers Univ.   |
 50 | | LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models                               | Opensource LLM     | [LLM-Adapters](https://github.com/AGI-Edgerunners/LLM-Adapters)                   |              | [2304.01933](https://arxiv.org/abs/2304.01933)                                                     | SMU             |
 51 | | HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace                                                   | ChatGPT            | [JARVIS](https://github.com/microsoft/JARVIS)                                     |              | [2303.17580](https://arxiv.org/abs/2303.17580)                                                     | Microsoft       |
 52 | | Language Models can Solve Computer Tasks                                                                                   | ChatGPT, GPT3, etc | [RCI Agent](https://github.com/posgnu/rci-agent)                                  |              | [2303.17491](https://arxiv.org/abs/2303.17491)                                                     | CMU             |
 53 | | TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs                                      | ChatGPT            | [TaskMatrix](https://github.com/microsoft/visual-chatgpt/tree/main/TaskMatrix.AI) |              | [2303.16434](https://arxiv.org/abs/2303.16434)                                                     | Microsoft       |
 54 | | MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action                                                            | ChatGPT            | [MM-REACT](https://github.com/microsoft/MM-REACT)                                 |              | [2303.11381](https://arxiv.org/abs/2303.11381)                                                     | Microsoft       |
 55 | | ART: Automatic multi-step reasoning and tool-use for large language models                                                | GPT3, Codex        | [Language-Programmes](https://github.com/bhargaviparanjape/language-programmes)   |              | [2303.09014](https://arxiv.org/abs/2303.09014)                                                     | Microsoft       |
 56 | | Foundation Models for Decision Making: Problems, Methods, and Opportunities                                                | -                  | -                                                                              |              | [2303.04129](https://arxiv.org/abs/2303.04129)                                                     | Google          |
 57 | | `Correct` Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback | ChatGPT            | [LLM-Augmenter](https://github.com/pengbaolin/LLM-Augmenter)                      |              | [2302.12813](https://arxiv.org/abs/2302.12813)                                                     | Microsoft       |
 58 | | Toolformer: Language Models Can Teach Themselves to Use Tools                                                              | GPT-J, OPT, GPT3   | [Toolformer (Unofficial)](https://github.com/lucidrains/toolformer-pytorch)       |              | [2302.04761](https://arxiv.org/abs/2302.04761)                                                     | Meta            |
 59 | | Visual Programming: Compositional visual reasoning without training                                                        | GPT3               | [VisProg](https://github.com/allenai/visprog)                                     | CVPR 2023    | [2211.11559](https://arxiv.org/abs/2211.11559)                                                     | AI2             |
 60 | | Decomposed Prompting: A Modular Approach for Solving Complex Tasks                                                         | GPT3               | [DecomP](https://github.com/allenai/DecomP)                                       | ICLR 2023    | [2210.02406](https://arxiv.org/abs/2210.02406)                                                     | AI2             |
 61 | | TALM: Tool Augmented Language Models                                                                                       | T5                 |                                                                                |              | [2205.12255](https://arxiv.org/abs/2205.12255)                                                     | Google          |
 62 | 
 63 | ## Datasets & Benchmarks
 64 | 
 65 | Datasets
 66 | 
 67 | - (2023.07) [ToolBench](https://github.com/OpenBMB/ToolBench), This project (ToolLLM) aims to construct open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability.
 68 | 
 69 | Benchmarks
 70 | 
 71 | - (2024.01) [R-Judge](https://github.com/Lordog/R-Judge), R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
 72 | - (2024.01) [RoTBench](https://github.com/Junjie-Ye/RoTBench), RoTBench consists of five external environments, each featuring varying levels of noise (i.e., Clean, Slight, Medium, Heavy, and Union), providing an in-depth analysis of the model's resilience across three critical phases: tool selection, parameter identification, and content filling
 73 | - (2024.01) [ToolEyes](https://github.com/Junjie-Ye/ToolEyes), The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning: format alignment, intent comprehension, behavior planning, tool selection, and answer organization.
 74 | - (2024.01) [AgentBoard](https://github.com/hkust-nlp/AgentBoard)，An Analytical Evaluation Board of Multi-turn LLM Agents
 75 | - (2023.10) [PCA-EVAL](https://github.com/pkunlp-icler/PCA-EVAL), PCA-EVAL is an innovative benchmark for evaluating multi-domain embodied decision-making, specifically focusing on the performance in perception, cognition, and action
 76 | - (2023.08) [AgentBench](https://github.com/THUDM/AgentBench), AgentBench is the first benchmark designed to evaluate LLM-as-Agent across a diverse spectrum of different environments. It encompasses 8 distinct environments to provide a more comprehensive evaluation of the LLMs' ability to operate as autonomous agents in various scenarios
 77 | 
 78 | ## Projects
 79 | 
 80 | - (2023-12) [KwaiAgents](https://github.com/KwaiKEG/KwaiAgents), A generalized information-seeking agent system with Large Language Models (LLMs).
 81 | - (2023-11) [CrewAI](https://github.com/joaomdmoura/crewAI), Cutting-edge framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.
 82 | - (2023-11) [WarAgent](https://github.com/agiresearch/WarAgent), Large Language Model-based Multi-Agent Simulation of World Wars
 83 | - (2023-08) [MetaGPT](https://github.com/geekan/MetaGPT), The Multi-Agent Framework: Given one line Requirement, return PRD, Design, Tasks, Repo
 84 | - (2023-08) [AI-Town](https://github.com/a16z-infra/ai-town), AI Town is a virtual town where AI characters live, chat and socialize
 85 | - (2023-08) [XLang](https://github.com/xlang-ai/xlang), An open-source framework for building and evaluating language model agents via executable language grounding
 86 |   - [homepage](https://www.xlang.ai/)
 87 | - (2023-06) [Gentopia](https://github.com/Gentopia-AI/Gentopia), Gentopia is a lightweight and extensible framework for LLM-driven Agents and [ALM](https://arxiv.org/abs/2302.07842) research
 88 |   - [paper](https://arxiv.org/abs/2308.04030)
 89 | - (2023-05) [Tranformers Agent](https://huggingface.co/docs/transformers/en/transformers_agents), Transformers Agent is an experimental API which is subject to change at any time
 90 | - (2023-05)  [闻达](https://github.com/l15y/wenda), 一个LLM调用平台。目标为针对特定环境的高效内容生成，同时考虑个人和中小企业的计算资源局限性，以及知识安全和私密性问题
 91 | - (2023-04) [AgentGPT](https://github.com/reworkd/AgentGPT), AgentGPT allows you to configure and deploy Autonomous AI agents
 92 |   - [demo](https://agentgpt.reworkd.ai/)
 93 | - (2023-04) [Auto-GPT](https://github.com/Torantulino/Auto-GPT), An experimental open-source attempt to make GPT-4 fully autonomous
 94 |   - [demo](https://agpt.co/)
 95 | - (2023-04) [BabyAGI](https://github.com/yoheinakajima/babyagi), The system uses OpenAI and vector databases such as Chroma or Weaviate to create, prioritize, and execute tasks
 96 |   - [blog](https://twitter.com/yoheinakajima/status/1640934493489070080?s=20)
 97 | - (2022-10) [LangChain](https://github.com/langchain-ai/langchain), LangChain is a framework for developing applications powered by language models. It enables applications that: **(i) Are context-aware**: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.) **(ii) Reason**: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)
 98 |   - [Doc](https://python.langchain.com/docs/get_started/introduction)
 99 | 
100 | ## Applications
101 | 
102 | - (2023-11) [Awesome-AI-GPTs](https://github.com/EmbraceAGI/Awesome-AI-GPTs), 欢迎来到 EmbraceAGI GPTs 开源目录，本项目收录了 OpenAI GPTs 的相关资源和有趣玩法，让我们一起因 AI 而强大
103 | - (2023-04) [Awesome-ChatGPT](https://github.com/awesome-chatgpt/awesome-chatgpt), An awe-inspiring collection of resources, encompassing a wide range of tools, documents, resources, applications, and use cases related to ChatGPT.
104 | - (2023-04) [众评AI](https://www.zhongpingtechnology.com/#), 全球AI网站排行榜展示了人工智能领域最顶尖的1800+个网站，排行榜每日更新
105 | - (2023-03) [ChatPaper](https://github.com/kaixindelele/ChatPaper), Use ChatGPT to summarize the arXiv papers. 全流程加速科研，利用chatgpt进行论文总结+润色+审稿+审稿回复
106 |   - [website](https://chatwithpaper.org/)
107 |   - Related: [ChatReviewer](https://github.com/nishiwen1214/ChatReviewer)
108 | - (2023-03) [BibiGPT](https://github.com/JimmyLv/BibiGPT), BibiGPT · 1-Click AI Summary for Audio/Video & Chat with Learning Content: Bilibili | YouTube | Tweet丨TikTok丨Local files | Websites丨Podcasts | Meetings | Lectures, etc. 音视频内容 AI 一键总结 & 对话：哔哩哔哩丨YouTube丨推特丨小红书丨抖音丨网页丨播客丨会议丨本地文件等 (原 BiliGPT 省流神器 & 课代表)
109 |   - [website](https://bibigpt.co/)
110 | 


--------------------------------------------------------------------------------
/Audio/Audio.md:
--------------------------------------------------------------------------------
1 | # Audio
2 | 
3 | ## Reading List
4 | 
5 | - [2024-08] Language Model Can Listen While Speaking, [arxiv](https://arxiv.org/pdf/2408.02622) | #listen-while-speak
6 | - [2024-06] UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner, [arxiv](https://arxiv.org/pdf/2406.10056) | #llama2 #audio-word-mapping


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 Yuxuan Wang
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Omni/omni.md:
--------------------------------------------------------------------------------
  1 | # Ominous
  2 | 
  3 | ## Reading List
  4 | 
  5 | Papers
  6 | 
  7 | #Interaction+Generation:
  8 | 
  9 | | Paper                                         | Base Model | Framework | Data | Code | Publication | Preprint                                                                                      | Affiliation |
 10 | | --------------------------------------------- | ---------- | --------- | ---- | ---- | ----------- | --------------------------------------------------------------------------------------------- | ----------- |
 11 | | Genie 2: A large-scale foundation world model |            |           |      |      |             | [genie2](https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/) | DeepMind    |
 12 | 
 13 | #Multimodal #End2end Understanding+Generation:
 14 | 
 15 | | Paper                                                                                                                               | Base Model              | Framework                                        | Data                                                      | Code                                                            | Publication | Preprint                                           | Affiliation     |
 16 | | ----------------------------------------------------------------------------------------------------------------------------------- | ----------------------- | ------------------------------------------------ | --------------------------------------------------------- | --------------------------------------------------------------- | ----------- | -------------------------------------------------- | --------------- |
 17 | | BAGEL: Emerging Properties in Unified Multimodal Pretraining                                                                        | Qwen2.5-7B + SigLip     | PT + FT (MoT)                                    | mixture (image, video, web)                               | [BAGEL](https://github.com/ByteDance-Seed/Bagel)                |             | [2505.14683](https://arxiv.org/abs/2505.14683)     | ByteDance       |
 18 | | UniTok: A Unified Tokenizer for Visual Generation and Understanding                                                                 | Llama-2-7B              | TOKEN                                            | mixture (image)                                           | [UniTok](https://github.com/FoundationVision/UniTok)            |             | [2502.20321](https://arxiv.org/abs/2502.20321)     | Bytedance       |
 19 | | Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling                                              | DeepSeek-LLM + LLaMAGen | PT + FT (NLL)                                    | mixture (image)                                           | [Janus](https://github.com/deepseek-ai/Janus)                   |             | [2501.17811](https://arxiv.org/abs/2501.17811)     | DeepSeek        |
 20 | | MetaMorph: Multimodal Understanding and Generation via Instruction Tuning                                                           | LLaMA-3 + SD            | PT + FT (NLL + cosine sim loss)                  | mixture (image)                                           |                                                                 |             | [2412.14164](https://arxiv.org/abs/2412.14164v1)   | Meta            |
 21 | | TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation                                                      | LLaVA-1.5 + TokenFlow   | PT + FT                                          | llava/cambrian (image)                                    | [TokenFlow](https://github.com/ByteFlow-AI/TokenFlow)           |             | [2412.03069](https://arxiv.org/abs/2412.03069)     | ByteDance       |
 22 | | EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions                                                        | LLaMA3                  | PT + FT                                          | mixture (image, speech)                                   | [emova](https://emova-ollm.github.io/)                          |             | [2409.18042](https://arxiv.org/abs/2409.18042)     | Huawei          |
 23 | | MIO: A Foundation Model on Multimodal Tokens                                                                                        | Yi-6B-Base              | PT + FT                                          | mixture (image, speech, audio)                            | [MIO](https://github.com/mio-team/mio)                          |             | [2409.17692](https://arxiv.org/abs/2409.17692)     | HKUST & 01      |
 24 | | VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation                                                  | LLaMA2 + RQ-VAE         | PT + FT (NLL)                                    | mixture (image)                                           | [vila-u](https://github.com/mit-han-lab/vila-u)                 | ICLR 2025   | [2409.04429](https://arxiv.org/abs/2409.04429)     | MIT             |
 25 | | One Single Transformer to Unify Multimodal Understanding and Generation                                                             | Phi + MagViT2           | PT + FT (NLL + MAE-loss)                         | mixture (image)                                           | [Show-o](https://github.com/showlab/Show-o)                     |             | [2408.12528](https://arxiv.org/abs/2408.12528)     | NUS             |
 26 | | Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model                                                   | - (transfusion)         | PT (NLL-loss + DDPM-loss)                        | self-collect (image)                                      |                                                                 |             | [2408.11039](https://www.arxiv.org/abs/2408.11039) | Meta            |
 27 | | Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation                                   | chameleon               | INST (interleaved)                               | mixture (image)                                           | [anole](https://github.com/GAIR-NLP/anole)                      |             | [2407.06135](https://arxiv.org/abs/2407.06135)     | SJTU            |
 28 | | Explore the Limits of Omni-modal Pretraining at Scale                                                                               | vicuna                  | PT + INST                                        | mixture (image, video, audio, depth -> text)              | [MiCo](https://github.com/invictus717/MiCo)                     |             | [2406.09412](https://arxiv.org/pdf/2406.09412)     | Shanghai AI Lab |
 29 | | Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation                                                           |                         |                                                  |                                                           |                                                                 |             |                                                    |                 |
 30 | | X-VILA: Cross-Modality Alignment for Large Language Model                                                                           | vicuna + SD             | INST + Diffusion Decoder                         | mixture (image, video, audio)                             |                                                                 |             | [2405.19335](https://arxiv.org/abs/2405.19335)     | NVIDIA          |
 31 | | Chameleon: Mixed-Modal Early-Fusion Foundation Models                                                                               | - (chameleon)           | PT + FT (AR + image detokenizer)                 | mixture (image)                                           | [chameleon](https://github.com/facebookresearch/chameleon)      |             | [2405.09818](https://arxiv.org/abs/2405.09818)     | Meta            |
 32 | | SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation                                               | LLaMA + SD              | PT + INST (interleaved)                          | mixture (image)                                           | [SEED-X](https://github.com/AILab-CVC/SEED-X)                   |             | [2404.14396](https://arxiv.org/abs/2404.14396)     | Tencent         |
 33 | | AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling                                                                      | LLaMA2 + SD             | INST + NAR-decoder                               | mixture (image, speech, music)                            | [AnyGPT](https://github.com/OpenMOSS/AnyGPT)                    |             | [2402.12226](https://arxiv.org/abs/2402.12226)     | FDU             |
 34 | | World Model on Million-Length Video And Language With Blockwise RingAttention                                                       | LLaMA + VQGAN (LWM)     | PT (long-context)                                | mixture (image, video)                                    | [LWM](https://github.com/LargeWorldModel/LWM)                   |             | [2402.08268](https://arxiv.org/abs/2402.08268)     | UCB             |
 35 | | MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer                                     | Vicuna + SD             | PT + INST                                        | mixture (image)                                           | [MM-Interleaved](https://github.com/OpenGVLab/MM-Interleaved)   |             | [2401.10208](https://arxiv.org/abs/2401.10208)     | Shanghai AI Lab |
 36 | | Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action                                     | T5X + VQGAN             | PT + INST                                        | mixture (image, audio, video, 3d)                         | [unified-io-2](https://github.com/allenai/unified-io-2)         |             | [2312.17172](https://arxiv.org/abs/2312.17172)     | AI2             |
 37 | | VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation                                   | LLaMA + SD              | PT + INST (interleaved)                          | mixture (image)                                           | [VL-GPT](https://github.com/AILab-CVC/VL-GPT)                   |             | [2312.09251](https://arxiv.org/abs/2312.09251)     | Tencent         |
 38 | | OneLLM: One Framework to Align All Modalities with Language                                                                         | LLaMA2                  | PT + INST (universal encoder + moe projector)    | mixture (image, audio, point, depth, IMU, fMRI -> text)   | [OneLLM](https://github.com/csuhan/OneLLM)                      | CVPR2024    | [2312.03700](https://arxiv.org/abs/2312.03700)     | Shanghai AI lab |
 39 | | LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment                               | -                       | INST                                             | mixture (video, infrared, depth, audio -> text)           | [LanguageBind](https://github.com/PKU-YuanGroup/LanguageBind)   | ICLR2024    | [2310.01852](https://arxiv.org/abs/2310.01852)     | PKU             |
 40 | | DreamLLM: Synergistic Multimodal Comprehension and Creation                                                                         | Vicuna + SD             | PT + INST with projector (interleaved)           | mixture (image)                                           | [DreamLLM](https://github.com/RunpeiDong/DreamLLM)              | ICLR2024    | [2309.11499](https://arxiv.org/abs/2309.11499)     | MEGVII          |
 41 | | NExT-GPT: Any-to-Any Multimodal LLM                                                                                                 | Vicuna + SD             | INST with projector                              | mixture (text -> audio/image/video)                       | [NExT-GPT](https://github.com/NExT-GPT/NExT-GPT)                | ICML2024    | [2309.05519](https://arxiv.org/abs/2309.05519)     | NUS             |
 42 | | LaVIT: Empower the Large Language Model to Understand and Generate Visual Content,[video version](https://arxiv.org/abs/2402.03161) | LLaMA  + SD             | PT + INST (vector quantization: CE + regression) | mixture (image)                                           | [LaVIT](https://github.com/jy0205/LaVIT)                        | ICLR2024    | [2309.04669](https://arxiv.org/abs/2309.04669)     | Kuaishou        |
 43 | | Emu: Generative Pretraining in Multimodality,[v2](https://arxiv.org/abs/2312.13286),[v3](https://github.com/baaivision/Emu3)        | LLaMA + SD              | PT (AR: CE + regression )                        | mixture (image)                                           | [Emu](https://github.com/baaivision/Emu)                        | ICLR2024    | [2307.05222](https://arxiv.org/abs/2307.05222)     | BAAI            |
 44 | | Any-to-Any Generation via Composable Diffusion                                                                                      | SD-1.5                  | individual diffusion -> latent attention         | mixture (text -> audio/image/video; image -> audio/video) | [CoDi](https://github.com/microsoft/i-Code/tree/main/i-Code-V3) | NeurIPS2023 | [2305.11846](https://arxiv.org/abs/2305.11846)     | Microsoft       |
 45 | | ImageBind: One Embedding Space To Bind Them All                                                                                     | CLIP                    | Contrastive + Diffusion Decoder                  | mixture(image, video, audio, depth)                       | [ImageBind](https://github.com/facebookresearch/ImageBind)      |             | [2305.05665](https://arxiv.org/abs/2305.05665)     | Meta            |
 46 | 
 47 | #Streaming #Real-Time #Online
 48 | 
 49 | | Paper                                                                                                                                        | Base Model      | Framework                                                    | Data                                                      | Code                                                                                                                        | Publication | Preprint                                                                                                                                                                    | Affiliation     |
 50 | | -------------------------------------------------------------------------------------------------------------------------------------------- | --------------- | ------------------------------------------------------------ | --------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------- |
 51 | | LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis                                               | Qwen2.5         | speech-to-speech                                             | self-construct(InstructS2S-200K-Multi-turn)               | [LLaMA-Omni2](https://github.com/ictnlp/LLaMA-Omni2)                                                                        |             | [2505.02625](https://arxiv.org/abs/2505.02625)                                                                                                                              | CAS             |
 52 | | Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment                                                  | Qwen2.5         | VideoLLM + Streaming TTS                                     | mixture                                                   | [Ola](https://github.com/Ola-Omni/Ola)                                                                                      |             | [2502.04328](https://arxiv.org/abs/2502.04328)                                                                                                                              | Tencent         |
 53 | | MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech, and Multimodal Live Streaming on Your Phone                                           | MiniCPM         | chunk-wise for streaming encode and decode + mask            | self-construct                                            | [MiniCPM-o](https://github.com/OpenBMB/MiniCPM-o)                                                                           |             | [2501.notion](https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9) | OpenBMB         |
 54 | | InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions                       | Qwen2-1.8B      | audio instruction + memory compression                       | self-construct                                            | [InternLM-XComposer-2.5-OmniLive](https://github.com/InternLM/InternLM-XComposer/tree/main/InternLM-XComposer-2.5-OmniLive) |             | [2412.09596](https://arxiv.org/abs/2412.09596)                                                                                                                              | Shanghai AI lab |
 55 | | StreamChat: Chatting with Streaming Video                                                                                                    | Qwen2.5         | kv-cache for streaming generation                            | self-construct                                            |                                                                                                                             |             | [2412.08646](https://arxiv.org/abs/2412.08646)                                                                                                                              | CUHK            |
 56 | | SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation                                                           | Vicuna          | Streaming Encoder + Streaming TTS                            | mixture                                                   |                                                                                                                             |             | [2411.18138](https://arxiv.org/abs/2411.18138)                                                                                                                              | ByteDance       |
 57 | | VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format                           | llava-ov        | grounding head                                               | self-construct                                            | [MMDuet](https://github.com/yellow-binary-tree/MMDuet)                                                                      |             | [2411.17991](https://arxiv.org/abs/2411.17991)                                                                                                                              | PKU             |
 58 | | Westlake-Omni: Open-Source Chinese Emotional Speech Interaction Large Language Model with Unified Discrete Sequence Modeling                 | qwen2           | LLM + AR Decoder                                             | self-construct (Chinese)                                  | [Westlake-Omni](https://github.com/xinchen-ai/Westlake-Omni)                                                                |             |                                                                                                                                                                             | xinchen-ai      |
 59 | | Moshi: a speech-text foundation model for real time dialogue                                                                                 | Helium-7B       | RQ-Tansformer                                                | self-construct (7m hr (pt) + 2k hr (inst) + 160 hr (tts)) | [moshi](https://github.com/kyutai-labs/moshi/tree/main/moshi)                                                               |             | [2409.pdf](https://kyutai.org/Moshi.pdf)                                                                                                                                    | kyutai          |
 60 | | LLaMA-Omni: Seamless Speech Interaction with Large Language Models                                                                           | LLaMA3          | speech-to-speech                                             | self-construct(InstructS2S-200K)                          | [LLaMA-Omni](https://github.com/ictnlp/LLaMA-Omni)                                                                          |             | [2409.06666](https://arxiv.org/abs/2409.06666)                                                                                                                              | CAS             |
 61 | | Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming,[v2](https://github.com/gpt-omni/mini-omni2)                           | Qwen2           | audio generation with text instruction + parallel generation | self-construct (VoiceAssistant-400K)                      | [mini-omni](https://github.com/gpt-omni/mini-omni)                                                                          |             | [2408.16725](https://arxiv.org/abs/2408.16725)                                                                                                                              | THU             |
 62 | | VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges                                                                           | Vicuna          | scene segment + recurrent                                    | videochat2                                                | [VideoLLaMB](https://github.com/bigai-nlco/VideoLLaMB)                                                                      |             | [2409.01071](https://arxiv.org/abs/2409.01071)                                                                                                                              | BIGAI           |
 63 | | VITA: Towards Open-Source Interactive Omni Multimodal LLM,[v1.5](https://github.com/VITA-MLLM/VITA?tab=readme-ov-file#-whats-new-in-vita-15) | Mixtral-8x7B    | special tokens (<1>: audio; <2>: EOS; <3> text)              | mixture                                                   | [VITA](https://github.com/VITA-MLLM/VITA)                                                                                   |             | [2408.05211](https://arxiv.org/abs/2408.05211)                                                                                                                              | Tencent         |
 64 | | VideoLLM-online: Online Large Language Model for Streaming Video                                                                             | Llama2/3        | Multi-turn dialogue + streaming loss                         | Ego4D                                                     | [videollm-online](https://github.com/showlab/videollm-online)                                                               |             | [2406.11816](https://arxiv.org/abs/2406.11816)                                                                                                                              | NUS             |
 65 | | RT-DETR: DETRs Beat YOLOs on Real-time Object Detection                                                                                      | Dino + DETR     | anchor-free                                                  | COCO                                                      | [RT-DETR](https://github.com/lyuwenyu/RT-DETR)                                                                              |             | [2304.08069](https://arxiv.org/abs/2304.08069)                                                                                                                              | Baidu           |
 66 | | Streaming Dense Video Captioning                                                                                                             | GIT/VidSeq + T5 | cluster visual token (memory)                                |                                                           | [streaming_dvc](https://github.com/google-research/scenic/tree/main/scenic/projects/streaming_dvc)                          | CVPR2024    | [2304.08069](https://arxiv.org/abs/2404.01297)                                                                                                                              | Google          |
 67 | | Deformable DETR: Deformable Transformers for End-to-End Object Detection                                                                     | ResNet+DETR     | deformable-attention                                         | COCO                                                      | [Deformable-DETR](https://github.com/fundamentalvision/Deformable-DETR)                                                     | ICLR2021    | [2010.04159](https://arxiv.org/abs/2010.04159)                                                                                                                              | SenseTime       |
 68 | 
 69 | #Interactive #Duplex
 70 | 
 71 | | Paper                                                                           | Base Model | Framework                | Data                       | Code                                                   | Publication | Preprint                                       | Affiliation |
 72 | | ------------------------------------------------------------------------------- | ---------- | ------------------------ | -------------------------- | ------------------------------------------------------ | ----------- | ---------------------------------------------- | ----------- |
 73 | | Enabling Real-Time Conversations with Minimal Training Costs                    | MiniCPM    | AR + special token       | self-curation (Ultra-Chat) |                                                        |             | [2409.11727](https://arxiv.org/abs/2409.11727) | HiT         |
 74 | | Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models | MiniCPM    | AR + time-slice `<idle>` | self-curation (Ultra-Chat) | [duplex-model](https://github.com/thunlp/duplex-model) |             | [2406.15718](https://arxiv.org/abs/2406.15718) | thunlp      |
 75 | 
 76 | Projects:
 77 | 
 78 | - [2024.09] [Open-Training-Moshi](https://github.com/yangdongchao/Open-Training-Moshi), The reproduce training process for Moshi
 79 | - [2024.07] [SAM2](https://github.com/facebookresearch/segment-anything-2), Introducing Meta Segment Anything Model 2 (SAM 2)
 80 |   - [2024.08] [segment-anything-2-real-time](https://github.com/Gy920/segment-anything-2-real-time), Run Segment Anything Model 2 on a live video stream
 81 | - [2024.06] [LLaVA-Magvit2](https://github.com/lucasjinreal/LLaVA-Magvit2), LLaVA MagVit2: Combines MLLM Understanding and Generation with MagVit2
 82 | - [2024.05] [GPT-4o system card](https://openai.com/index/hello-gpt-4o/), We're announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.
 83 | 
 84 | ## Dataset
 85 | 
 86 | #omininou-modality
 87 | 
 88 | - [2024.06] [ShareGPT4Omni Dataset](https://sharegpt4omni.github.io/), ShareGPT4Omni: Towards Building Omni Large Multi-modal Models with Comprehensive Multi-modal Annotations.
 89 | 
 90 | #streaming-data
 91 | 
 92 | - [2024.06] VideoLLM-online: Online Large Language Model for Streaming Video
 93 | - [2024.05] Streaming Long Video Understanding with Large Language Models
 94 | 
 95 | ## Benchmark
 96 | 
 97 | #streaming
 98 | - [2025.03] [OmniMMI](https://github.com/OmniMMI/OmniMMI), A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
 99 | - [2025.01] [OVO-Bench](https://github.com/JoeLeelyf/OVO-Bench), OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
100 | - [2024.11] [StreamingBench](https://github.com/THUNLP-MT/StreamingBench), StreamingBench evaluates Multimodal Large Language Models (MLLMs) in real-time, streaming video understanding tasks.
101 | 
102 | #omni
103 | - [2025.03] [ACVUBench](https://github.com/lark-png/ACVUBench), ACVUBench: Audio-Centric Video Understanding Benchmark
104 | - [2024.11] [LongVALE](https://github.com/ttgeng233/LongVALE), LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
105 | - [2024.10] [AVHBench](https://github.com/kaist-ami/AVHBench), AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
106 | - [2024.09] [OmniBench](https://github.com/multimodal-art-projection/OmniBench), OmniBench: Towards The Future of Universal Omni-Language Models
107 | 
108 | #timestampQA
109 | - [2024.06] [VStream-QA](https://invinciblewyq.github.io/vstream-page/), Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
110 | 
111 | #multihop
112 | - [2024.08] [MultiHop-EgoQA](https://qirui-chen.github.io/MultiHop-EgoQA/), Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos
113 | 
114 | #state#episodic
115 | - [2024.04] [OpenEQA](https://open-eqa.github.io/), OpenEQA: Embodied Question Answering in the Era of Foundation Models
116 | - [2021.10] [Env-QA](https://envqa.github.io/), Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments
117 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <h1 align="center" class="ui-bar-a"> <img alt="awesome-colorful-ai" src="assets/logo.png" width="60"> Colorful Multimodal Research</h1>
  2 | 
  3 | <p align="center">
  4 |     <a href="https://awesome.re">
  5 |         <img alt="awesome", src="https://awesome.re/badge.svg">
  6 |     </a>
  7 | </p>
  8 | 
  9 | Welcome to our meticulously assembled anthology of vibrant multimodal research, encompassing an array of domains including **Vision**, **Audio**, **Agent**, **Robotics**, **Fundamental Sciences**, and **Ominous** including anything you want. Our collection primarily focuses on the advancements propelled by **large language models (LLMs)**, complemented by an assortment of related collections.
 10 | 
 11 | ## 📚 Table of Contents
 12 | 
 13 | - [📚 Table of Contents](#-table-of-contents)
 14 | - [👀 Vision](#-vision)
 15 |   - [🖼 Image](#-image)
 16 |   - [📺 Video](#-video)
 17 |   - [📷 3D](#-3d)
 18 |   - [📰 Document](#-document)
 19 |   - [👁️ Vision Encoder](#️-vision-encoder)
 20 | - [👂 Audio](#-audio)
 21 | - [🔧 Agent](#-agent)
 22 | - [🤖 Robotic](#-robotic)
 23 | - [🔬 Science](#-science)
 24 |   - [♾️ AI for Math](#️-ai-for-math)
 25 | - [🌏 Ominous](#-ominous)
 26 | - [🙌 Contributing](#-contributing)
 27 | 
 28 | ## 👀 Vision
 29 | 
 30 | ### 🖼 Image
 31 | 
 32 | Collection of works about Image + LLMs, Diffusion, see [Image](Vision/Image.md) for details
 33 | 
 34 | > - Image Understanding
 35 | >   - Reading List
 36 | >   - Datasets & Benchmarks
 37 | > - Image Generation
 38 | >   - Reading List
 39 | > - Open-source Projects
 40 | 
 41 | Related Collections (Understanding)
 42 | 
 43 | - [VLM_survey](https://github.com/jingyi0000/VLM_survey) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/jingyi0000/VLM_survey?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/jingyi0000/VLM_survey.svg?style=social&label=Star), This is the repository of "Vision Language Models for Vision Tasks: a Survey", a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc.
 44 | - [Awesome-Multimodal-Large-Language-Models](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/BradyFU/Awesome-Multimodal-Large-Language-Models?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/BradyFU/Awesome-Multimodal-Large-Language-Models.svg?style=social&label=Star), A curated list of Multimodal Large Language Models (MLLMs), including datasets, multimodal instruction tuning, multimodal in-context learning, multimodal chain-of-thought, llm-aided visual reasoning, foundation models, and others. This list will be updated in real time.
 45 | - [LLM-in-Vision](https://github.com/DirtyHarryLYL/LLM-in-Vision) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/DirtyHarryLYL/LLM-in-Vision?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/DirtyHarryLYL/LLM-in-Vision.svg?style=social&label=Star), Recent LLM (Large Language Models)-based CV and multi-modal works
 46 | - [Awesome-Transformer-Attention](https://github.com/cmhungsteve/Awesome-Transformer-Attention) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/cmhungsteve/Awesome-Transformer-Attention?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/cmhungsteve/Awesome-Transformer-Attention.svg?style=social&label=Star), This repo contains a comprehensive paper list of Vision Transformer & Attention, including papers, codes, and related websites
 47 | - [Multimodal-AND-Large-Language-Models](https://github.com/Yangyi-Chen/Multimodal-AND-Large-Language-Models) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/Yangyi-Chen/Multimodal-AND-Large-Language-Models?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/Yangyi-Chen/Multimodal-AND-Large-Language-Models.svg?style=social&label=Star), Paper list about multimodal and large language models, only used to record papers I read in the daily arxiv for personal needs.
 48 | - [Efficient_Foundation_Model_Survey](https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/UbiquitousLearning/Efficient_Foundation_Model_Survey?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/UbiquitousLearning/Efficient_Foundation_Model_Survey.svg?style=social&label=Star), This repo contains the paper list and figures for A Survey of Resource-efficient LLM and Multimodal Foundation Models.
 49 | - [CVinW_Readings](https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/Computer-Vision-in-the-Wild/CVinW_Readings?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/Computer-Vision-in-the-Wild/CVinW_Readings.svg?style=social&label=Star), A collection of papers on the topic of Computer Vision in the Wild (CVinW)
 50 | - [Awesome-Vision-and-Language](https://github.com/sangminwoo/awesome-vision-and-language) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/sangminwoo/awesome-vision-and-language?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/sangminwoo/awesome-vision-and-language.svg?style=social&label=Star), A curated list of awesome vision and language resources
 51 | - [Awesome-Multimodal-Research](https://github.com/Eurus-Holmes/Awesome-Multimodal-Research) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/Eurus-Holmes/Awesome-Multimodal-Research?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/Eurus-Holmes/Awesome-Multimodal-Research.svg?style=social&label=Star), This repo is reorganized from Awesome-Multimodal-ML
 52 | - [Awesome-Multimodal-ML](https://github.com/pliang279/awesome-multimodal-ml) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/pliang279/awesome-multimodal-ml?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/pliang279/awesome-multimodal-ml.svg?style=social&label=Star), Reading list for research topics in multimodal machine learning
 53 | - [Awesome-Referring-Image-Segmentation](https://github.com/MarkMoHR/Awesome-Referring-Image-Segmentation) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/MarkMoHR/Awesome-Referring-Image-Segmentation?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/MarkMoHR/Awesome-Referring-Image-Segmentation.svg?style=social&label=Star), A collection of referring image (video, 3D) segmentation papers and datasets.
 54 | - [Awesome-Prompting-on-Vision-Language-Model](https://github.com/JindongGu/Awesome-Prompting-on-Vision-Language-Model) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/JindongGu/Awesome-Prompting-on-Vision-Language-Model?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/JindongGu/Awesome-Prompting-on-Vision-Language-Model.svg?style=social&label=Star), This repo lists relevant papers summarized in our survey paper: A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models.
 55 | - [Mamba-in-CV](https://github.com/Yangzhangcst/Mamba-in-CV) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/Yangzhangcst/Mamba-in-CV?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/Yangzhangcst/Mamba-in-CV.svg?style=social&label=Star), A paper list of some recent Mamba-based CV works. If you find some ignored papers, please open issues or pull requests.
 56 | - [Efficient-Multimodal-LLMs-Survey](https://github.com/lijiannuist/Efficient-Multimodal-LLMs-Survey) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/lijiannuist/Efficient-Multimodal-LLMs-Survey?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/lijiannuist/Efficient-Multimodal-LLMs-Survey.svg?style=social&label=Star), Efficient Multimodal Large Language Models: A Survey
 57 | 
 58 | Related Collections (Evaluation)
 59 | 
 60 | - [Awesome-MLLM-Hallucination](https://github.com/showlab/Awesome-MLLM-Hallucination) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/showlab/Awesome-MLLM-Hallucination?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/showlab/Awesome-MLLM-Hallucination.svg?style=social&label=Star), A curated list of resources dedicated to hallucination of multimodal large language models (MLLM)
 61 | - [awesome-Large-MultiModal-Hallucination](https://github.com/xieyuquanxx/awesome-Large-MultiModal-Hallucination) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/xieyuquanxx/awesome-Large-MultiModal-Hallucination?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/xieyuquanxx/awesome-Large-MultiModal-Hallucination.svg?style=social&label=Star)
 62 | - [Awesome-MLLM-Reasoning-Benchmarks](https://github.com/Wild-Cooperation-Hub/Awesome-MLLM-Reasoning-Benchmarks) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/Wild-Cooperation-Hub/Awesome-MLLM-Reasoning-Benchmarks?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/Wild-Cooperation-Hub/Awesome-MLLM-Reasoning-Benchmarks.svg?style=social&label=Star)
 63 | 
 64 | Related Collections (Generation)
 65 | 
 66 | - [Awesome-VQVAE](https://github.com/rese1f/Awesome-VQVAE) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/rese1f/Awesome-VQVAE?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/rese1f/Awesome-VQVAE.svg?style=social&label=Star), A collection of resources and papers on Vector Quantized Variational Autoencoder (VQ-VAE) and its application
 67 | - [Awesome-Diffusion-Models](https://github.com/heejkoo/Awesome-Diffusion-Models) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/heejkoo/Awesome-Diffusion-Models?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/heejkoo/Awesome-Diffusion-Models.svg?style=social&label=Star), This repository contains a collection of resources and papers on Diffusion Models
 68 | - [Awesome-Controllable-Diffusion](https://github.com/atfortes/Awesome-Controllable-Diffusion) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/atfortes/Awesome-Controllable-Diffusion?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/atfortes/Awesome-Controllable-Diffusion.svg?style=social&label=Star), Collection of papers and resources on Controllable Generation using Diffusion Models, including ControlNet, DreamBooth, and others.
 69 | - [Awesome-LLMs-meet-Multimodal-Generation](https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation.svg?style=social&label=Star), A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
 70 | 
 71 | Tutorials
 72 | 
 73 | - [CVPR2024 Tutorial] [Recent Advances in Vision Foundation Models](https://vlp-tutorial.github.io/)
 74 |   - Large Multimodal Models: Towards Building General-Purpose Multimodal Assistant, Chunyuan Li
 75 |   - Methods, Analysis & Insights from Multimodal LLM Pre-training, Zhe Gan
 76 |   - LMMs with Fine-Grained Grounding Capabilities, Haotian Zhang
 77 |   - A Close Look at Vision in Large Multimodal Models, Jianwei Yang
 78 |   - Multimodal Agents, Linjie Li
 79 |   - Recent Advances in Image Generative Foundation Models, Zhengyuan Yang
 80 |   - Video and 3D Generation, Kevin Lin
 81 | - [CVPR2023 Tutorial] [Recent Advances in Vision Foundation Models](https://vlp-tutorial.github.io/2023/index.html)
 82 |   - Opening Remarks & Visual and Vision-Language Pre-training, Zhe Gan
 83 |   - From Representation to Interface: The Evolution of Foundation for Vision Understanding, Jianwei Yang
 84 |   - Alignments in Text-to-Image Generation, Zhengyuan Yang
 85 |   - Large Multimodal Models, Chunyuan Li
 86 |   - Multimodal Agents: Chaining Multimodal Experts with LLMs, Linjie Li
 87 | - [CVPR2022 Tutorial] [Recent Advances in Vision-and-Language Pre-training](https://vlp-tutorial.github.io/2022/index.html)
 88 | - [CVPR2021 Tutorial] [From VQA to VLN: Recent Advances in Vision-and-Language Research](https://vqa2vln-tutorial.github.io/)
 89 | - [CVPR2020 Tutorial] [Recent Advances in Vision-and-Language Research](https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research/)
 90 | 
 91 | ### 📺 Video
 92 | 
 93 | Collection of works about Video-Language Pretraining, Video + LLMs, see [Video](Vision/Video.md) for details
 94 | 
 95 | > - Video Understanding
 96 | >   - Reading List
 97 | >   - Pretraining Tasks
 98 | >   - Datasets
 99 | >     - Pretraining Corpora
100 | >     - Video Instructions
101 | >   - Benchmarks
102 | >     - Common Downstream Tasks
103 | >     - Advanced Downstream Tasks
104 | >       - Task-Specific Benchmarks
105 | >       - Multifaceted Benchmarks
106 | >   - Metrics
107 | >   - Projects & Tools
108 | > - Video Generation
109 | >   - Reading List
110 | >   - Metrics
111 | >   - Projects
112 | 
113 | Related Collections (datasets)
114 | 
115 | - [Awesome-Video-Datasets](https://github.com/xiaobai1217/Awesome-Video-Datasets#Video-and-Language) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/xiaobai1217/Awesome-Video-Datasets?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/xiaobai1217/Awesome-Video-Datasets.svg?style=social&label=Star)
116 | 
117 | Related Collections (understanding)
118 | 
119 | - [Awesome-LLMs-for-Video-Understanding](https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/yunlong10/Awesome-LLMs-for-Video-Understanding?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/yunlong10/Awesome-LLMs-for-Video-Understanding.svg?style=social&label=Star), Latest Papers, Codes and Datasets on Vid-LLMs.
120 | - [Awesome Long-Term Video Understanding](https://github.com/ttengwang/Awesome_Long_Form_Video_Understanding)![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/ttengwang/Awesome_Long_Form_Video_Understanding?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/ttengwang/Awesome_Long_Form_Video_Understanding.svg?style=social&label=Star), Awesome papers & datasets specifically focused on long-term videos.
121 | - [Awesome-Token-Compress](https://github.com/daixiangzi/Awesome-Token-Compress)![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/daixiangzi/Awesome-Token-Compress?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/daixiangzi/Awesome-Token-Compress.svg?style=social&label=Star), A paper list of some recent works about Token Compress for Vit and VLM
122 | 
123 | Related Collections (generation)
124 | 
125 | - [i2vgen-xl](https://github.com/damo-vilab/i2vgen-xl) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/damo-vilab/i2vgen-xl?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/damo-vilab/i2vgen-xl.svg?style=social&label=Star), VGen is an open-source video synthesis codebase developed by the Tongyi Lab of Alibaba Group, featuring state-of-the-art video generative models.
126 | 
127 | ### 📷 3D
128 | 
129 | Collection of works about 3D+LLM, see [3D](Vision/3D.md) for details
130 | 
131 | > - Reading List
132 | 
133 | Related Collections
134 | 
135 | - [awesome-3D-gaussian-splatting](https://github.com/MrNeRF/awesome-3D-gaussian-splatting) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/MrNeRF/awesome-3D-gaussian-splatting?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/MrNeRF/awesome-3D-gaussian-splatting.svg?style=social&label=Star), A curated list of papers and open-source resources focused on 3D Gaussian Splatting, intended to keep pace with the anticipated surge of research in the coming months
136 | - [Awesome-LLM-3D](https://github.com/ActiveVisionLab/Awesome-LLM-3D) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/ActiveVisionLab/Awesome-LLM-3D?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/ActiveVisionLab/Awesome-LLM-3D.svg?style=social&label=Star), a curated list of Multi-modal Large Language Model in 3D world Resources
137 | - [Awesome-3D-Vision-and-Language](https://github.com/jianghaojun/Awesome-3D-Vision-and-Language) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/jianghaojun/Awesome-3D-Vision-and-Language?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/jianghaojun/Awesome-3D-Vision-and-Language.svg?style=social&label=Star), A curated list of research papers in 3D visual grounding
138 | - [awesome-scene-understanding](https://github.com/bertjiazheng/awesome-scene-understanding) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/bertjiazheng/awesome-scene-understanding?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/bertjiazheng/awesome-scene-understanding.svg?style=social&label=Star), A list of awesome scene understanding papers.
139 | 
140 | ### 📰 Document
141 | 
142 | Related Collections
143 | 
144 | - [Awesome Document Understanding](https://github.com/tstanislawek/awesome-document-understanding) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/tstanislawek/awesome-document-understanding?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/tstanislawek/awesome-document-understanding.svg?style=social&label=Star), A curated list of resources for Document Understanding (DU) topic related to Intelligent Document Processing (IDP), which is relative to Robotic Process Automation (RPA) from unstructured data, especially form Visually Rich Documents (VRDs).
145 | 
146 | ### 👁️ Vision Encoder
147 | 
148 | Collection of existing popular vision encoder, see [Vision Encoder](Vision/VisionEncoder.md) for details
149 | 
150 | > - Image Encoder
151 | > - Video Encoder
152 | > - Audio Encoder
153 | 
154 | ## 👂 Audio
155 | 
156 | Collection of works about audio+LLM, see [Audio](Audio/Audio.md) for details
157 | 
158 | > - Reading List
159 | 
160 | Related Collections
161 | 
162 | - [awesome-large-audio-models](https://github.com/EmulationAI/awesome-large-audio-models) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/EmulationAI/awesome-large-audio-models?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/EmulationAI/awesome-large-audio-models.svg?style=social&label=Star), Collection of resources on the applications of Large Language Models (LLMs) in Audio AI.
163 | - [speech-trident](https://github.com/ga642381/speech-trident) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/ga642381/speech-trident?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/ga642381/speech-trident.svg?style=social&label=Star), Awesome speech/audio LLMs, representation learning, and codec models
164 | - [Audio-AI-Timeline](https://github.com/archinetai/audio-ai-timeline) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/archinetai/audio-ai-timeline?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/archinetai/audio-ai-timeline.svg?style=social&label=Star), Here we will keep track of the latest AI models for waveform based audio generation, starting in 2023!
165 | 
166 | ## 🔧 Agent
167 | 
168 | Collection of works about agent learning, see [Agent](Agents/Agent.md) for details
169 | 
170 | > - Reading List
171 | > - Datasets & Benchmarks
172 | > - Projects
173 | > - Applications
174 | 
175 | Related Collections
176 | 
177 | - [LLM-Agent-Paper-Digest](https://github.com/XueyangFeng/LLM-Agent-Paper-Digest) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/XueyangFeng/LLM-Agent-Paper-Digest?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/XueyangFeng/LLM-Agent-Paper-Digest.svg?style=social&label=Star), For benefiting the research community and promoting LLM-powered agent direction, we organize papers related to LLM-powered agent that published on top conferences recently
178 | - [LLMAgentPapers](https://github.com/zjunlp/LLMAgentPapers) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/zjunlp/LLMAgentPapers?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/zjunlp/LLMAgentPapers.svg?style=social&label=Star), Must-read Papers on Large Language Model Agents.
179 | - [LLM-Agent-Paper-List](https://github.com/WooooDyy/LLM-Agent-Paper-List) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/WooooDyy/LLM-Agent-Paper-List?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/WooooDyy/LLM-Agent-Paper-List.svg?style=social&label=Star), In this repository, we provide a systematic and comprehensive survey on LLM-based agents, and list some must-read papers.
180 | - [XLang Paper Reading](https://github.com/xlang-ai/xlang-paper-reading) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/xlang-ai/xlang-paper-reading?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/xlang-ai/xlang-paper-reading.svg?style=social&label=Star), Paper collection on building and evaluating language model agents via executable language grounding
181 | - [Awesome-LLMOps](https://github.com/tensorchord/Awesome-LLMOps) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/tensorchord/Awesome-LLMOps?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/tensorchord/Awesome-LLMOps.svg?style=social&label=Star), An awesome & curated list of best LLMOps tools for developers
182 | - [Awesome LLM-Powered Agent](https://github.com/hyp1231/awesome-llm-powered-agent) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/hyp1231/awesome-llm-powered-agent?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/hyp1231/awesome-llm-powered-agent.svg?style=social&label=Star), Awesome things about LLM-powered agents. Papers / Repos / Blogs / ...
183 | - [Awesome LMs with Tools](https://github.com/zorazrw/awesome-tool-llm) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/zorazrw/awesome-tool-llm?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/zorazrw/awesome-tool-llm.svg?style=social&label=Star), Language models (LMs) are powerful yet mostly for text-generation tasks. Tools have substantially enhanced their performance for tasks that require complex skills.
184 | - [ToolLearningPapers](https://github.com/thunlp/ToolLearningPapers) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/thunlp/ToolLearningPapers?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/thunlp/ToolLearningPapers.svg?style=social&label=Star), Must-read papers on tool learning with foundation models
185 | - [Awesome-ALM](https://github.com/pbhu1024/awesome-augmented-language-model) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/pbhu1024/awesome-augmented-language-model?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/pbhu1024/awesome-augmented-language-model.svg?style=social&label=Star), This repo collect research papers about leveraging the capabilities of language models, which can be a good reference for building upper-layer applications
186 | - [LLM-powered Autonomous Agents](https://lilianweng.github.io/posts/2023-06-23-agent/), Lil'Log, Overview: panning, memory, tool use
187 | - [World Model Papers](https://github.com/Timothyxxx/WorldModelPapers), ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/Timothyxxx/WorldModelPapers?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/Timothyxxx/WorldModelPapers.svg?style=social&label=Star), Paper collections of the continuous effort start from World Models
188 | 
189 | ## 🤖 Robotic
190 | 
191 | Collection of works about robotics+LLM, see [Robotic](Robotic/Robotic.md) for details
192 | 
193 | > - Reading List
194 | 
195 | Related Collections (Robotics)
196 | 
197 | - [Awesome-Robotics-Foundation-Models](https://github.com/robotics-survey/Awesome-Robotics-Foundation-Models) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/robotics-survey/Awesome-Robotics-Foundation-Models?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/robotics-survey/Awesome-Robotics-Foundation-Models.svg?style=social&label=Star), This is the partner repository for the survey paper "Foundation Models in Robotics: Applications, Challenges, and the Future". The authors hope this repository can act as a quick reference for roboticists who wish to read the relevant papers and implement the associated methods.
198 | - [Awesome-LLM-Robotics](https://github.com/GT-RIPL/Awesome-LLM-Robotics) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/GT-RIPL/Awesome-LLM-Robotics?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/GT-RIPL/Awesome-LLM-Robotics.svg?style=social&label=Star), This repo contains a curative list of papers using Large Language/Multi-Modal Models for Robotics/RL
199 | - [Simulately](https://github.com/geng-haoran/Simulately) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/geng-haoran/Simulately?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/geng-haoran/Simulately.svg?style=social&label=Star), a website where we gather useful information of physics simulator for cutting-edge robot learning research. It is still under active development, so stay tuned!
200 | - [Awesome-Temporal-Action-Detection-Temporal-Action-Proposal-Generation](https://github.com/zhenyingfang/Awesome-Temporal-Action-Detection-Temporal-Action-Proposal-Generation) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/zhenyingfang/Awesome-Temporal-Action-Detection-Temporal-Action-Proposal-Generation?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/zhenyingfang/Awesome-Temporal-Action-Detection-Temporal-Action-Proposal-Generation.svg?style=social&label=Star), Temporal Action Detection & Weakly Supervised & Semi Supervised Temporal Action Detection & Temporal Action Proposal Generation & Open-Vocabulary Temporal Action Detection.
201 | - [Awesome-TimeSeries-SpatioTemporal-LM-LLM](https://github.com/qingsongedu/Awesome-TimeSeries-SpatioTemporal-LM-LLM) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/qingsongedu/Awesome-TimeSeries-SpatioTemporal-LM-LLM?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/qingsongedu/Awesome-TimeSeries-SpatioTemporal-LM-LLM.svg?style=social&label=Star), A professionally curated list of **Large (Language) Models and Foundation Models (LLM, LM, FM) for Temporal Data (Time Series, Spatio-temporal, and Event Data)** with awesome resources (paper, code, data, etc.), which aims to comprehensively and systematically summarize the recent advances to the best of our knowledge.
202 | - [PromptCraft-Robotics](https://github.com/microsoft/PromptCraft-Robotics) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/microsoft/PromptCraft-Robotics?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/microsoft/PromptCraft-Robotics.svg?style=social&label=Star), The PromptCraft-Robotics repository serves as a community for people to test and share interesting prompting examples for large language models (LLMs) within the robotics domain
203 | - [Awesome-Robotics](https://github.com/ahundt/awesome-robotics) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/ahundt/awesome-robotics?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/ahundt/awesome-robotics.svg?style=social&label=Star), A curated list of awesome links and software libraries that are useful for robots
204 | 
205 | Related Collections (embodied)
206 | 
207 | - [Embodied_AI_Paper_List](https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/HCPLab-SYSU/Embodied_AI_Paper_List?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/HCPLab-SYSU/Embodied_AI_Paper_List.svg?style=social&label=Star), Awesome Paper list for Embodied AI and its related projects and applications
208 | - [Awesome-Embodied-AI](https://github.com/haoranD/Awesome-Embodied-AI) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/haoranD/Awesome-Embodied-AI?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/haoranD/Awesome-Embodied-AI.svg?style=social&label=Star), A curated list of awesome papers on Embodied AI and related research/industry-driven resources
209 | - [awesome-embodied-vision](https://github.com/ChanganVR/awesome-embodied-vision) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/ChanganVR/awesome-embodied-vision?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/ChanganVR/awesome-embodied-vision.svg?style=social&label=Star), Reading list for research topics in embodied vision
210 | 
211 | Related Collections (autonomous driving)
212 | 
213 | - [Awesome-LLM4AD](https://github.com/Thinklab-SJTU/Awesome-LLM4AD) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/Thinklab-SJTU/Awesome-LLM4AD?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/ChangThinklab-SJTUanVR/Awesome-LLM4AD.svg?style=social&label=Star), A curated list of awesome LLM for Autonomous Driving resources (continually updated)
214 | 
215 | ## 🔬 Science
216 | 
217 | ### ♾️ AI for Math
218 | 
219 | Collection of works about Mathematics + LLMs, see [AI4Math](AI4Science/AI4Math.md) for details
220 | 
221 | > - Reading List
222 | 
223 | Related Collections
224 | 
225 | - [Awesome-Scientific-Language-Models](https://github.com/yuzhimanhua/Awesome-Scientific-Language-Models) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/yuzhimanhua/Awesome-Scientific-Language-Models?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/yuzhimanhua/Awesome-Scientific-Language-Models.svg?style=social&label=Star), A curated list of pre-trained language models in scientific domains (e.g., mathematics, physics, chemistry, biology, medicine, materials science, and geoscience), covering different model sizes (from <100M to 70B parameters) and modalities (e.g., language, vision, molecule, protein, graph, and table)
226 | 
227 | ## 🌏 Ominous
228 | 
229 | Collection of works about LLM + ominous modality, see [Omni](Omni/omni.md) for details
230 | 
231 | Related Collections
232 | 
233 | > - Reading List
234 | > - Dataset
235 | > - Benchmark
236 | 
237 | - [Awesome-Unified-Multimodal-Models](https://github.com/showlab/Awesome-Unified-Multimodal-Models) ![GitHub last commit (by committer)](https://img.shields.io/github/last-commit/showlab/Awesome-Unified-Multimodal-Models?style=flat)![Dynamic JSON Badge](https://img.shields.io/github/stars/showlab/Awesome-Unified-Multimodal-Models.svg?style=social&label=Star), This is a repository for organizing papers, codes and other resources related to unified multimodal models.
238 | 
239 | ## 🙌 Contributing
240 | 
241 | Please freely create a [pull request](https://github.com/patrick-tssn/Awesome-Colorful-LLM/pulls) or drop me an email: [flagwyx@gmail.com](flagwyx@gmail.com)
242 | 


--------------------------------------------------------------------------------
/Robotic/Robotic.md:
--------------------------------------------------------------------------------
 1 | # Robotic
 2 | 
 3 | Table of Contents
 4 | 
 5 | - [Reading List](#reading-list)
 6 | 
 7 | ## Reading List
 8 | 
 9 | | Paper                                                                                    | Base Language Model | Code                                                                                         | Publication | Preprint                                    | Affiliation         |
10 | | ---------------------------------------------------------------------------------------- | ------------------- | -------------------------------------------------------------------------------------------- | ----------- | ------------------------------------------- | ------------------- |
11 | | A Language Agent for Autonomous Driving                                                  | GPT-3.5             | [code](https://github.com/USC-GVL/Agent-Driver)                                                 |             | [2311.10813](https://arxiv.org/abs/2311.10813) | USC                 |
12 | | RT-2: New model translates vision and language into action                               | PaLI-X, PaLM-E      | [blog](https://www.deepmind.com/blog/rt-2-new-model-translates-vision-and-language-into-action) |             |                                             | Deepmind            |
13 | | VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models         | GPT4                | [web](https://voxposer.github.io/)                                                              |             | [2307.05973](https://arxiv.org/abs/2307.05973) | Stanford            |
14 | | Statler: State-Maintaining Language Models for Embodied Reasoning                        | GPT3                |                                                                                              |             | [2306.17840](https://arxiv.org/abs/2306.17840) | TTIC                |
15 | | Chat with the Environment: Interactive Multimodal Perception using Large Language Models | GPT3                |                                                                                              |             | [2303.08268](https://arxiv.org/abs/2303.08268) | Universitat Hamburg |
16 | 
17 | 
18 | ## Projects
19 | 
20 | - [Mobile ALOHA](https://github.com/MarkFzp/mobile-aloha), Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation


--------------------------------------------------------------------------------
/Vision/3D.md:
--------------------------------------------------------------------------------
 1 | # 3D
 2 | 
 3 | Table of Contents
 4 | 
 5 | - [Reading List](#reading-list)
 6 | 
 7 | ## Reading List
 8 | 
 9 | | Paper                                                                 | Base Language Model | Code                                                                         | Publication | Preprint                                    | Affiliation                    |
10 | | --------------------------------------------------------------------- | ------------------- | ---------------------------------------------------------------------------- | ----------- | ------------------------------------------- | ------------------------------ |
11 | | MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers    | GPT2-medium         | [mesh-gpt](https://github.com/nihalsid/mesh-gpt)                                |             | [2311.15475](http://arxiv.org/abs/2311.15475)  | Technical University of Munich |
12 | | Uni3D: Exploring Unified 3D Representation at Scale                   | CLIP                | [Uni3D](https://github.com/baaivision/Uni3D)                                    |             | [2310.06773](https://arxiv.org/abs/2310.06773) | BAAI                           |
13 | | Point-Bind & Point-LLM: Aligning 3D with Multi-modality               | LLaMA               | [Point-Bind &amp; Point-LLM](https://github.com/ZiyuGuo99/Point-Bind_Point-LLM) |             | [2309.00615](https://arxiv.org/abs/2309.00615) | Shanghai AI Lab                |
14 | | PointLLM: Empowering Large Language Models to Understand Point Clouds | Vicuna              | [PointLLM](https://github.com/OpenRobotLab/PointLLM)                            |             | [2308.16911](https://arxiv.org/abs/2308.16911) | Shanghai AI Lab                |
15 | | RT-2: New model translates vision and language into action            | BLIP2               | [3D-LLM](https://github.com/UMass-Foundation-Model/3D-LLM)                      |             | [2307.12981](https://arxiv.org/abs/2307.12981) | UMASS                          |
16 | 
17 | 
18 | ## Datasets & Benchmarks
19 | 
20 | - [SceneVerse](https://github.com/scene-verse/SceneVerse), We propose SceneVerse, the first million-scale 3D vision-language dataset with 68K 3D indoor scenes and 2.5M vision-language pairs. 
21 | - [GPTEval3D](https://github.com/3DTopia/GPTEval3D), An implementation of the paper "GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation". This contains an evaluation metric for text-to-3D generative models.
22 | 
23 | 


--------------------------------------------------------------------------------
/Vision/Ego.md:
--------------------------------------------------------------------------------
 1 | # Ego
 2 | 
 3 | ## Reading List
 4 | 
 5 | ## Datasets & Benchmarks
 6 | 
 7 | - [2023.12] [EgoPlan](https://github.com/ChenYi99/EgoPlan), In this work, we introduce a benchmark with human annotations,  **EgoPlan-Bench** , to quantitatively investigate the potential of MLLMs as embodied task planners in real-world scenarios
 8 | - [2021.10] [Ego4D](https://github.com/facebookresearch/Ego4d), **Ego4D** is the world's largest egocentric (first person) video ML dataset and benchmark suite, including over 3700 hours of annotated first-person video data.
 9 | - [2018.04] [Charades-Ego](https://prior.allenai.org/projects/charades-ego), A dataset which guides research into unstructured video activity recogntion and commonsense reasoning for daily human activities with paired videos of first person and third person perspectives.
10 | 


--------------------------------------------------------------------------------
/Vision/Image.md:
--------------------------------------------------------------------------------
  1 | # Image
  2 | 
  3 | Table of Contents
  4 | 
  5 | - [Image Understanding](#image-understanding)
  6 |   - [Reading List](#reading-list)
  7 |   - [Datasets &amp; Benchmarks](#datasets--benchmarks)
  8 | - [Image Generation](#image-generation)
  9 |   - [Reading List](#reading-list-1)
 10 | - [Open Source Projects](#open-source-projects)
 11 | 
 12 | ## Image Understanding
 13 | 
 14 | ### Reading List
 15 | 
 16 | *NOTEs: INST=Instruction, FT=Finetune, PT=Pretraining, ICL=In Context Learning, ZS=ZeroShot, FS=FewShot, RTr=Retrieval*
 17 | 
 18 | | Paper                                                                                                                                                                     | Base Language Model             | Framework                                | Data                                                                                                                                        | Code                                                                                          | Publication                    | Preprint                                                                           | Affiliation       |
 19 | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ------------------------------ | ---------------------------------------------------------------------------------- | ----------------- |
 20 | | xGen-MM (BLIP-3): A Family of Open Large Multimodal Models                                                                                                                | Phi3                            | PT + FT (interleaved resampler)          | self-curation                                                                                                                               |                                                                                               |                                | [2408.08872](https://www.arxiv.org/abs/2408.08872)                                    | Salesforce        |
 21 | | LLaVA-OneVision: Easy Visual Task Transfer                                                                                                                                | Qwen-2                          | PT + FT (knowledge+v-inst)               | self-curation (one-vision)                                                                                                                  | [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT)                                        |                                | [2408.03326](https://arxiv.org/abs/2408.03326)                                        | ByteDance         |
 22 | | Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs                                                                                                   | vicuna                          | PT + FT                                  | self-construct (cambrian)                                                                                                                   | [cambrian](https://github.com/cambrian-mllm/cambrian)                                            |                                | [2406.16860](https://arxiv.org/abs/2406.16860)                                        | Meta              |
 23 | | Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models                                                                                                | Vicuna/Mixtral/Yi               | PT+FT                                    | self-construct (mimi-gemini)                                                                                                                | [MGM](https://github.com/dvlab-research/MGM)                                                     |                                | [2403.18814](https://arxiv.org/abs/2403.18814)                                        | CUHK              |
 24 | | MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training                                                                                                        | ?                               | PT + FT                                  | self-construct + mixture                                                                                                                    | -                                                                                             |                                | [2403.09611](https://arxiv.org/abs/2403.09611)                                        | Apple             |
 25 | | An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models                                                         | Llama/Qwen                      | Prune                                    | -                                                                                                                                           | [FastV](https://github.com/chenllliang/FastV)                                                    |                                | [2403.06764](https://arxiv.org/abs/2403.06764)                                        | Alibaba           |
 26 | | DeepSeek-VL: Towards Real-World Vision-Language Understanding                                                                                                             | DeepSeekLLM                     | PT+FT                                    | mixture                                                                                                                                     | [DeepSeek-VL](https://github.com/deepseek-ai/DeepSeek-VL)                                        |                                | [2403.05525](https://arxiv.org/abs/2403.05525)                                        | Deepseek          |
 27 | | Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models                                                                                    | LLaVA                           | FT                                       | mixture-HR                                                                                                                                  | [LLaVA-HR](https://github.com/luogen1996/LLaVA-HR)                                               |                                | [2403.03003](https://arxiv.org/abs/2403.03003)                                        | XMU               |
 28 | | Efficient Multimodal Learning from Data-centric Perspective                                                                                                               | Phi, StableLM                   | PT + FT                                  | LAION-2B                                                                                                                                    | [Bunny](https://github.com/BAAI-DCAI/Bunny)                                                      |                                | [2402.11530](https://arxiv.org/abs/2402.11530)                                        | BAAI              |
 29 | | Efficient Visual Representation Learning with Bidirectional State Space Model                                                                                             | SSM                             | *efficient*                            |                                                                                                                                             | [Vim](https://github.com/hustvl/Vim)                                                             |                                | [2401.09417](https://arxiv.org/abs/2401.09417)                                        | HUST              |
 30 | | AIM: Autoregressive Image Models                                                                                                                                          | *ViT*                         | Scale                                    |                                                                                                                                             | [ml-aim](https://github.com/apple/ml-aim)                                                        |                                | [2401.08541](https://arxiv.org/abs/2401.08541)                                        | Apple             |
 31 | | LEGO:Language Enhanced Multi-modal Grounding Model                                                                                                                        | Vicuna                          | PT + SFT                                 | mixture + self-construct                                                                                                                    | [LEGO](https://github.com/lzw-lzw/LEGO)                                                          |                                | [2401.06071](https://arxiv.org/abs/2401.06071)                                        | ByteDance         |
 32 | | COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training                                                                                             | gated cross-attn + latent array | PT + FT                                  | mixture                                                                                                                                     | [cosmo](https://github.com/showlab/cosmo)                                                        |                                | [2401.00849](https://arxiv.org/abs/2401.00849)                                        | NUS               |
 33 | | Tracking with Human-Intent Reasoning                                                                                                                                      | LLaMA (LLaVA)                   | PT+FT                                    | mixture                                                                                                                                     | [TrackGPT](https://github.com/jiawen-zhu/TrackGPT)                                               |                                | [2312.17448](https://arxiv.org/abs/2312.17448)                                        | Alibaba           |
 34 | | InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks                                                                            | Vicuna                          | PT + FT                                  | mixture                                                                                                                                     | [InternVL](https://github.com/OpenGVLab/InternVL)                                                | CVPR 2024                      | [2312.14238](https://arxiv.org/abs/2312.14238)                                        | Shanghai AI Lab   |
 35 | | VCoder: Versatile Vision Encoders for Multimodal Large Language Models                                                                                                    | LLaMA (LLaVA-1.5)               | FT (depth encoder + segment encoder)     | COCO Segmentation Text ([COST](https://huggingface.co/datasets/shi-labs/COST))                                                                 | [VCoder](https://github.com/SHI-Labs/VCoder)                                                     | CVPR 2024                      | [2312.14233](https://arxiv.org/abs/2312.14233)                                        | Gatech            |
 36 | | V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs                                                                                                           | Vicuna-7B                       | FT + obj. refine (search)                | mixture+self-construct(object)                                                                                                              | [vstar](https://github.com/penghao-wu/vstar)                                                     |                                | [2312.14135](https://arxiv.org/abs/2312.14135)                                        | NYU               |
 37 | | Osprey: Pixel Understanding with Visual Instruction Tuning                                                                                                                | Vicuna                          | PT+FT                                    | mixture                                                                                                                                     | [Osprey](https://github.com/CircleRadon/Osprey)                                                  |                                | [2312.10032](https://arxiv.org/abs/2312.10032)                                        | ZJU               |
 38 | | Tokenize Anything via Prompting                                                                                                                                           | SAM                             | PT                                       | mixture (mainly SA-1B)                                                                                                                      | [tokenize-anything](https://github.com/baaivision/tokenize-anything)                             |                                | [2312.09128](https://arxiv.org/abs/2312.09128)                                        | BAAI              |
 39 | | Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens                                                                                                  | LLaVA                           | INST                                     | video-chatgpt                                                                                                                               | [Vista-LLaMA (web)](https://jinxxian.github.io/Vista-LLaMA/)                                     |                                | [2312.08870](https://arxiv.org/abs/2312.08870)                                        | ByteDance         |
 40 | | Gemini: A Family of Highly Capable Multimodal Models                                                                                                                      | Transformer-Decoder             | FT (language decoder + image decoder)    | ?                                                                                                                                           | ?                                                                                             | -                              | [2312.blog](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | Google            |
 41 | | VILA: On Pre-training for Visual Language Models                                                                                                                          | Llama                           | PT + FT                                  | self-construct + llava-1.5                                                                                                                  | [VILA](https://github.com/Efficient-Large-Model/VILA)                                            |                                | [2312.07533](https://arxiv.org/abs/2312.07533)                                        | NVIDIA            |
 42 | | Honeybee: Locality-enhanced Projector for Multimodal LLM                                                                                                                  | LLaMA/Vicuna                    | PT+INST (projector)                      | mixture                                                                                                                                     | [Honeybee](https://github.com/kakaobrain/honeybee)                                               | CVPR 2024                      | [2312.06742](https://arxiv.org/abs/2312.06742)                                        | KakaoBrain        |
 43 | | Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models                                                                                                   | *vocabulary network* + LLM    | PT                                       | mixture(doc,chart + opendomain)                                                                                                             | [Vary](https://github.com/Ucas-HaoranWei/Vary)                                                   |                                | [2312.06109](https://arxiv.org/abs/2312.06109)                                        | MEGVII            |
 44 | | LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models                                                                                                        | LLaMA (LLaVA)                   | FT(FT grounding model + INSTFT)          | RefCOCO + Flickr30K + LLaVA                                                                                                                 | [LLaVA-Grounding](https://github.com/UX-Decoder/LLaVA-Grounding)                                 |                                | [2312.02949](https://arxiv.org/abs/2312.02949)                                        | MSR               |
 45 | | Making Large Multimodal Models Understand Arbitrary Visual Prompts                                                                                                        | LLaMA (LLaVA)                   | PT+INSTFT                                | BLIP + LLaVA-1.5                                                                                                                            | [ViP-LLaVA](https://github.com/mu-cai/ViP-LLaVA)                                                 |                                | [2312.00784](https://arxiv.org/abs/2312.00784)                                        | Wisconsin-Madison |
 46 | | Sequential Modeling Enables Scalable Learning for Large Vision Models                                                                                                     | LLaMA                           | PT (**Visual Tokenizer**)          | mixture (430B visual tokens, 50 dataset, mainly from LAION)                                                                                 | [LVM](https://github.com/ytongbai/LVM)                                                           |                                | [2312.00785](https://arxiv.org/abs/2312.00785)                                        | UCB               |
 47 | | Compositional Chain-of-Thought Prompting for Large Multimodal Models                                                                                                      | LLaVA                           | CoT (scene graph)                        |                                                                                                                                             |                                                                                               |                                | [2311.17076](https://arxiv.org/abs/2311.17076)                                        | UCB               |
 48 | | GLaMM: Pixel Grounding Large Multimodal Model                                                                                                                             | Vicuna-1.5                      | FT                                       | self-construct ([grounding-anything-dataset](https://github.com/mbzuai-oryx/groundingLMM#-grounding-anything-dataset-grand))                   | [GLaMM](https://github.com/mbzuai-oryx/groundingLMM)                                             |                                | [2311.03356](https://arxiv.org/abs/2311.03356)                                        | MBZU              |
 49 | | Analyzing and Mitigating Object Hallucination in Large Vision-Language Models                                                                                             | GPT-3.5 + LLMs                  | FT (hallucination)                       | mixture                                                                                                                                     | [LURE](https://github.com/YiyangZhou/LURE)                                                       | ICLR 2024                      | [2310.00754](https://arxiv.org/abs/2310.00754)                                        | UNC               |
 50 | | CogVLM: Visual Expert For Large Language Models                                                                                                                           | Vicuna                          | PT + FT                                  | self-construct + mixture                                                                                                                    | [CogVLM](https://github.com/THUDM/CogVLM)                                                        |                                | [2309.github](https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf)      | Zhipu AI          |
 51 | | GPT-4V(ision) System Card                                                                                                                                                 | GPT4                            | ？                                       | ？                                                                                                                                          | ？                                                                                            | -                              | [2309.blog](https://cdn.openai.com/papers/GPTV_System_Card.pdf)                       | OpenAI            |
 52 | | Demystifying CLIP Data                                                                                                                                                    | CLIP                            | PT                                       | [curated &amp; transparent CLIP dataset](https://github.com/facebookresearch/MetaCLIP/blob/main/metadata.json)                                 | [MetaCLIP](https://github.com/facebookresearch/MetaCLIP)                                         |                                | [2309.16671](https://arxiv.org/abs/2309.16671)                                        | Meta              |
 53 | | InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition                                                                   | InternLM                        | PT + FT                                  | mixture                                                                                                                                     | [InternLM-XComposer](https://github.com/InternLM/InternLM-XComposer)                             |                                | [2309.15112](https://arxiv.org/abs/2309.15112)                                        | Shanghai AI Lab.  |
 54 | | DreamLLM: Synergistic Multimodal Comprehension and Creation                                                                                                               | LLaMA                           | PT + FT                                  | mixture                                                                                                                                     | [DreamLLM](https://github.com/RunpeiDong/DreamLLM)                                               |                                | [2309.11499](https://arxiv.org/abs/2309.11499)                                        | MEGVII            |
 55 | | LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization                                                                               | LLaMA                           | PT + FT (**Visual Tokenizer**)     | mixture                                                                                                                                     | [LaVIT](https://github.com/jy0205/LaVIT)                                                         |                                | [2309.04669](https://arxiv.org/abs/2309.04669)                                        | Kuaishou          |
 56 | | Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond                                                                      | QWen                            | PT + FT                                  |                                                                                                                                             | [Qwen-VL](https://github.com/QwenLM/Qwen-VL)                                                     |                                | [2308.12966](https://arxiv.org/abs/2308.12966)                                        | Alibaba           |
 57 | | BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions                                                                                          | Vicuna-7B/Flan-t5-xxl           | FT                                       | same as InstructBLIP                                                                                                                        | [BLIVA](https://github.com/mlpc-ucsd/BLIVA)                                                      |                                | [2308.09936](https://arxiv.org/abs/2308.09936)                                        | UCSD              |
 58 | | The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World                                                                           | Husky-7b                        | PT                                       | AS-1B                                                                                                                                       | [all-seeing](https://github.com/OpenGVLab/all-seeing)                                            |                                | [2308.01907](https://arxiv.org/abs/2308.01907)                                        | Shanghai AI Lab   |
 59 | | LISA: Reasoning Segmentation via Large Language Model                                                                                                                     | LLaMA                           | PT                                       | mixture                                                                                                                                     | [LISA](https://github.com/dvlab-research/LISA)                                                   |                                | [2308.00692](https://arxiv.org/abs/2308.00692)                                        | SmartMore         |
 60 | | Generative Pretraining in Multimodality, v2                                                                                                                               | LLaMA，*Diffusion*            | PT, Visual Decoder                       | mixture                                                                                                                                     | [Emu](https://github.com/baaivision/Emu)                                                         |                                | [2307.05222](https://arxiv.org/abs/2307.05222)                                        | BAAI              |
 61 | | What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?                                                                                              | Vicuna-7b                       | PT + FT                                  | mixture (emperical)                                                                                                                         | [lynx](https://github.com/bytedance/lynx-llm)                                                    |                                | [2307.02469](https://arxiv.org/abs/2307.02469)                                        | Bytedance         |
 62 | | Visual Instruction Tuning with Polite Flamingo                                                                                                                            | Flamingo                        | FT + (rewrite instruction)               | PF-1M, LLaVA-instruciton-177k                                                                                                               | [Polite Flamingo](https://github.com/ChenDelong1999/polite_flamingo)                             |                                | [2307.01003](https://arxiv.org/abs/2307.01003)                                        | Xiaobing          |
 63 | | LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding                                                                                             | Vicuna-13B                      | FT + MM-INST                             | self-construct (text-rich image)                                                                                                            | [LLaVAR](https://github.com/SALT-NLP/LLaVAR)                                                     |                                | [2306.17107](https://arxiv.org/abs/2306.17107)                                        | Gatech            |
 64 | | Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic                                                                                                           | Vicuna-7B/13B                   | FT + MM-INST                             | self-constuct (referential dialogue)                                                                                                        | [Shikra](https://github.com/shikras/shikra)                                                      |                                | [2306.15195](https://arxiv.org/abs/2306.15195)                                        | SenseTime         |
 65 | | KOSMOS-2: Grounding Multimodal Large Language Models to the World                                                                                                        | Magneto                         | PT + obj                                 | [Grit](https://github.com/microsoft/unilm/tree/master/kosmos-2#grit-large-scale-training-corpus-of-grounded-image-text-pairs) (90M images)    | [Kosmos-2](https://github.com/microsoft/unilm/tree/master/kosmos-2)                              |                                | [2306.14824](https://arxiv.org/abs/2306.14824)                                        | Microsoft         |
 66 | | Aligning Large Multi-Modal Model with Robust Instruction Tuning                                                                                                          | Vicuna (MiniGPT4-like)          | FT + MM-INST                             | [LRV-Instruction](https://github.com/FuxiaoLiu/LRV-Instruction#visual-instruction-data-lrv-instruction) (150K INST, robust), GAVIE (evaluate) | [LRV-Instruction](https://github.com/FuxiaoLiu/LRV-Instruction)                                  |                                | [2306.14565](https://arxiv.org/abs/2306.14565)                                        | UMD               |
 67 | | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark                                                                                  | Vicuna-7B/13B                   | FT + MM-INST                             | [LAMM-Dataset](https://github.com/OpenLAMM/LAMM#lamm-dataset) (186K INST), [LAMM-Benchmark](https://github.com/OpenLAMM/LAMM#lamm-benchmark)     | [LAMM](https://github.com/OpenLAMM/LAMM)                                                         |                                | [2306.06687](https://arxiv.org/abs/2306.06687)                                        | Shanghai AI Lab   |
 68 | | Improving CLIP Training with Language Rewrites                                                                                                                            | CLIP + ChatGPT                  | FT + Data-aug                            | mixture                                                                                                                                     | [LaCLIP](https://github.com/LijieFan/LaCLIP)                                                     |                                | [2305.20088](https://arxiv.org/abs/2305.20088)                                        | Google            |
 69 | | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst                                                                                          | Vicuna-13B                      | FT + MM-INST                             |                                                                                                                                             | [ChatBridge](https://github.com/joez17/ChatBridge)                                               |                                | [2305.16103](https://arxiv.org/abs/2305.16103)                                        | CAS               |
 70 | | Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models                                                                                   | LLaMA-7B/13B                    | FT adapter + MM-INST                    | [self-construc](https://github.com/luogen1996/LaVIN#data-preparation) (INST)                                                                  | [LaVIN](https://github.com/luogen1996/LaVIN)                                                     |                                | [2305.15023](https://arxiv.org/abs/2305.15023)                                        | Xiamen Univ.      |
 71 | | IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models                                                                                 | ChatGPT                         | iterative, compositional (que, ans, rea) | ZS                                                                                                                                          | [IdeaGPT](https://github.com/Hxyou/IdealGPT)                                                     |                                | [2305.14985](https://arxiv.org/abs/2305.14985)                                        | Columbia          |
 72 | | DetGPT: Detect What You Need via Reasoning                                                                                                                                | Robin, Vicuna                   | FT + MM-INST + detector                  | self-construct                                                                                                                              | [DetGPT](https://github.com/OptimalScale/DetGPT)                                                 |                                | [2305.14167](https://arxiv.org/abs/2305.14167)                                        | HKUST             |
 73 | | VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks                                                                                    | Alpaca                          |                                          |                                                                                                                                             | [VisionLLM](https://github.com/OpenGVLab/VisionLLM)                                              |                                | [2305.11175](https://arxiv.org/abs/2305.11175)                                        | Shanghai AI Lab.  |
 74 | | InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning                                                                                      | Vicuna                          |                                          |                                                                                                                                             | [InstructBLIP](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)              |                                | [2305.06500](https://arxiv.org/abs/2305.06500)                                        | Salesforce        |
 75 | | MultiModal-GPT: A Vision and Language Model for Dialogue with Humans                                                                                                     | Flamingo                        | FT + MM-INST, LoRA                       | mixture                                                                                                                                     | [Multimodal-GPT](https://github.com/open-mmlab/Multimodal-GPT)                                   |                                | [2305.04790](https://arxiv.org/abs/2305.04790)                                        | NUS               |
 76 | | Otter: A Multi-Modal Model with In-Context Instruction Tuning                                                                                                             | Flamingo                        |                                          |                                                                                                                                             | [Otter](https://github.com/Luodian/Otter)                                                        |                                | [2305.03726](https://arxiv.org/abs/2305.03726)                                        | NTU               |
 77 | | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages                                                                     | ChatGPT                         |                                          |                                                                                                                                             | [X-LLM](https://github.com/phellonchen/X-LLM)                                                    |                                | [2305.04160](https://arxiv.org/abs/2305.04160)                                        | CAS               |
 78 | | LMEye: An Interactive Perception Network for Large Language Models                                                                                                       | OPT,Bloomz,BLIP2                | PT, FT + MM-INST                        | self-construct                                                                                                                              | [LingCloud](https://github.com/YunxinLi/LingCloud)                                               |                                | [2305.03701](https://arxiv.org/abs/2305.03701)                                        | HIT               |
 79 | | Caption anything: Interactive image description with diverse multimodal controls                                                                                          | BLIP2, ChatGPT                  | ZS                                       |                                                                                                                                             | [Caption Anything](https://github.com/ttengwang/Caption-Anything)                                |                                | [2305.02677](https://arxiv.org/abs/2305.02677)                                        | SUSTech           |
 80 | | Multimodal Procedural Planning via Dual Text-Image Prompting                                                                                                              | OFA, BLIP, GPT3                 |                                          |                                                                                                                                             | [TIP](https://github.com/YujieLu10/TIP)                                                          |                                | [2305.01795](https://arxiv.org/abs/2305.01795)                                        | UCSB              |
 81 | | Transfer Visual Prompt Generator across LLMs                                                                                                                              | FlanT5, OPT                     | projecter + transfer strategy            |                                                                                                                                             | [VPGTrans](https://github.com/VPGTrans/VPGTrans)                                                 |                                | [2305.01278](https://arxiv.org/abs/2305.01278)                                        | CUHK              |
 82 | | LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model                                                                                                            | LLaMA                           |                                          |                                                                                                                                             | [LLaMA-Adapter](https://github.com/ZrrSkywalker/LLaMA-Adapter)                                   |                                | [2304.15010](https://arxiv.org/abs/2304.15010)                                        | Shanghai AI Lab.  |
 83 | | mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality,[mPLUG](https://aclanthology.org/2022.emnlp-main.488/), [mPLUG-2](https://arxiv.org/abs/2302.00402) | LLaMA                           |                                          |                                                                                                                                             | [mPLUG-Owl](https://github.com/X-PLUG/mPLUG-Owl)                                                 |                                | [2304.14178](https://arxiv.org/abs/2304.14178)                                        | DAMO Academy      |
 84 | | MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models                                                                                    | Vicunna                         |                                          |                                                                                                                                             | [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4)                                             |                                | [2304.10592](https://arxiv.org/abs/2304.10592)                                        | KAUST             |
 85 | | Visual Instruction Tuning                                                                                                                                                 | LLaMA                           | full-param. + INST tuning                | [LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) (150K INST by GPT4)                                     | [LLaVA](https://github.com/haotian-liu/LLaVA)                                                    |                                | [2304.08485](https://arxiv.org/abs/2304.08485)                                        | Microsoft         |
 86 | | Chain of Thought Prompt Tuning in Vision Language Models                                                                                                                  | -                               | Visual CoT                               | -                                                                                                                                           |                                                                                               |                                | [2304.07919](https://arxiv.org/abs/2304.07919)                                        | PKU               |
 87 | | MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action                                                                                                           | ChatGPT                         |                                          |                                                                                                                                             | [MM-REACT](https://github.com/microsoft/MM-REACT)                                                |                                | [2303.11381](https://arxiv.org/abs/2303.11381)                                        | Microsoft         |
 88 | | ViperGPT: Visual Inference via Python Execution for Reasoning                                                                                                             | Codex                           |                                          |                                                                                                                                             | [ViperGPT](https://github.com/cvlab-columbia/viper)                                              | ICCV 2023                      | [2303.08128](https://arxiv.org/abs/2303.08128)                                        | Columbia          |
 89 | | Scaling Vision-Language Models with Sparse Mixture of Experts                                                                                                             | (MOE + Scaling)                 |                                          |                                                                                                                                             |                                                                                               |                                | [2303.07226](https://arxiv.org/abs/2303.07226)                                        | Microsoft         |
 90 | | ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions                                                                                  | ChatGPT, Flan-T5 (BLIP2)        |                                          |                                                                                                                                             | [ChatCaptioner](https://github.com/Vision-CAIR/ChatCaptioner)                                    |                                | [2303.06594](https://arxiv.org/abs/2303.06594)                                        | KAUST             |
 91 | | Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models                                                                                                | ChatGPT                         |                                          |                                                                                                                                             | [Visual ChatGPT](https://github.com/microsoft/visual-chatgpt)                                    |                                | [2303.04671](https://arxiv.org/abs/2303.04671)                                        | Microsoft         |
 92 | | PaLM-E: An Embodied Multimodal Language Model                                                                                                                             | PaLM                            |                                          |                                                                                                                                             |                                                                                               |                                | [2303.03378](https://arxiv.org/abs/2303.03378)                                        | Google            |
 93 | | Prismer: A Vision-Language Model with An Ensemble of Experts                                                                                                              | RoBERTa, OPT, BLOOM             |                                          |                                                                                                                                             | [Prismer](https://github.com/NVlabs/prismer)                                                     |                                | [2303.02506](https://arxiv.org/abs/2303.02506)                                        | Nvidia            |
 94 | | Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners                                                                                 | GPT3, CLIP, DINO, DALLE         |                                          | FS, evaluate: img-cls                                                                                                                       | [CaFo](https://github.com/OpenGVLab/CaFo)                                                        | CVPR 2023                      | [2303.02151](https://arxiv.org/abs/2303.02151)                                        | CAS               |
 95 | | Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering                                                                      | GPT3                            | RTr+candidates, ICL                      | [evaluate](https://github.com/MILVLG/prophet#data-preparation): OKVQA, A-OKVQA                                                                 | [Prophet](https://github.com/MILVLG/prophet)                                                     | CVPR 2023                      | [2303.01903](https://arxiv.org/abs/2303.01903)                                        | HDU               |
 96 | | Language Is Not All You Need: Aligning Perception with Language Models                                                                                                    | Magneto                         |                                          |                                                                                                                                             | [KOSMOS-1](https://github.com/microsoft/unilm)                                                   |                                | [2302.14045](https://arxiv.org/abs/2302.14045)                                        | Microsoft         |
 97 | | Scaling Vision Transformers to 22 Billion Parameters                                                                                                                      | (CLIP + Scaling)                |                                          |                                                                                                                                             |                                                                                               |                                | [2302.05442](https://arxiv.org/abs/2302.05442)                                        | Google            |
 98 | | Multimodal Chain-of-Thought Reasoning in Language Models                                                                                                                  | T5                              | FT + MM-CoT                              |                                                                                                                                             | [MM-COT](https://github.com/amazon-science/mm-cot)                                               |                                | [2302.00923](https://arxiv.org/abs/2302.00923)                                        | Amazon            |
 99 | | Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Caption                                                                                    | RETRO                           |                                          |                                                                                                                                             |                                                                                               |                                | [2302.04858](https://arxiv.org/abs/2302.04858)                                        | Nvidia            |
100 | | BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models                                                                    | Flan-T5 / qformer               |                                          |                                                                                                                                             | [BLIP2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)                            | ICML 2023                      | [2301.12597](https://arxiv.org/abs/2301.12597)                                        | Salesforce        |
101 | | See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning                                                        | OPT                             |                                          |                                                                                                                                             |                                                                                               |                                | [2301.05226](https://arxiv.org/abs/2301.05226)                                        | MIT-IBM           |
102 | | Generalized Decoding for Pixel, Image, and Language                                                                                                                       | GPT3                            |                                          |                                                                                                                                             | [X-GPT](https://github.com/microsoft/X-Decoder/tree/xgpt)                                        |                                | [2212.11270](https://arxiv.org/abs/2212.11270)                                        | Microsoft         |
103 | | From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models                                                                     | OPT                             |                                          |                                                                                                                                             | [Img2LLM](https://github.com/salesforce/LAVIS/tree/main/projects/img2prompt-vqa)                 | CVPR 2023                      | [2212.10846](https://arxiv.org/abs/2212.10846)                                        | Salesforce        |
104 | | Visual Programming: Compositional visual reasoning without training                                                                                                       | GPT3                            | Compositional/Tool-Learning              |                                                                                                                                             | [VisProg](https://github.com/allenai/visprog)                                                    | CVPR 2023<br /> *best paper* | [2211.11559](https://arxiv.org/abs/2211.11559)                                        | AI2               |
105 | | Language Models are General-Purpose Interfaces                                                                                                                            | DeepNorm                        | Semi-Causal                              |                                                                                                                                             | [METALM](https://github.com/microsoft/unilm)                                                     |                                | [2206.06336](https://arxiv.org/abs/2206.06336)                                        | Microsoft         |
106 | | Language Models Can See: Plugging Visual Controls in Text Generation                                                                                                      | GPT2                            |                                          |                                                                                                                                             | [MAGIC](https://github.com/yxuansu/MAGIC)                                                        |                                | [2205.02655](https://arxiv.org/abs/2205.02655)                                        | Tencent           |
107 | | Flamingo: a Visual Language Model for Few-Shot Learning                                                                                                                   | Chinchilla / adapter            |                                          |                                                                                                                                             | [Flamingo](https://github.com/lucidrains/flamingo-pytorch)                                       | Neurips 2022                  | [2204.14198](https://arxiv.org/abs/2204.14198)                                        | DeepMind          |
108 | | Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language                                                                                                   | GPT3, RoBERTa                   |                                          |                                                                                                                                             | [Socratic Models](https://github.com/google-research/google-research/tree/master/socraticmodels) | ICLR 2023                      | [2204.00598](https://arxiv.org/abs/2204.00598)                                        | Google            |
109 | | An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA                                                                                                              | GPT3                            | LLM as KB, ICL                           | evaluate: OKVQA                                                                                                                             | [PICa](https://github.com/microsoft/PICa)                                                        | AAAI 2022                      | [2109.05014](https://arxiv.org/abs/2109.05014)                                        | Microsoft         |
110 | | Multimodal few-shot learning with frozen language models                                                                                                                  | Transforemr-LM-7b (PT on C4)   | ICL                                      | [ConceptualCaptions](https://github.com/google-research-datasets/conceptual-captions)                                                          | [Frozen](https://github.com/ilkerkesen/frozen) (unofficial)                                     | Neurips 2021                   | [2106.13884](https://arxiv.org/abs/2106.13884)                                        | Deepmind          |
111 | | Perceiver: General Perception with Iterative Attention                                                                                                                    | Perceiver                       | latent array                             |                                                                                                                                             |                                                                                               | ICML 2021                      | [2103.03206](https://arxiv.org/abs/2103.03206)                                        | DeepMind          |
112 | | Learning Transferable Visual Models From Natural Language Supervision                                                                                                     | Bert / contrastive learning     |                                          |                                                                                                                                             | [CLIP](https://github.com/openai/CLIP)                                                           | ICML 2021                      | [2103.00020](https://arxiv.org/abs/2103.00020)                                        | OpenAI            |
113 | 
114 | ### Datasets & Benchmarks
115 | 
116 | Datasets
117 | 
118 | | Dataset                                                 | Source                    | Format                      | Paper                                                                                              | Preprint                                      | Publication | Affiliation     |
119 | | ------------------------------------------------------- | ------------------------- | --------------------------- | -------------------------------------------------------------------------------------------------- | --------------------------------------------- | ----------- | --------------- |
120 | | [combrian](https://github.com/cambrian-mllm/cambrian)      | mixture  INST             | instruciton (10M)           | Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs                            | [2406.16860](https://arxiv.org/abs/2406.16860)   |             | Meta            |
121 | | [MINT-1T](https://github.com/mlfoundations/MINT-1T)        | mixture  html/pdf/arxiv   | corpora (3.4B imgs)         | MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens | [2406.11271](https://arxiv.org/arxiv/2406.11271) |             | Salesforce      |
122 | | [OmniCorpus](https://github.com/OpenGVLab/OmniCorpus)      | mixture  html             | corpora (10B imgs)          | OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text           | [2406.08418](https://arxiv.org/abs/2406.08418)   |             | Shanghai AI Lab |
123 | | [MGM](https://github.com/dvlab-research/MGM)               | mixture  INST             | instruciton (1.2M)          | Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models                         | [2403.18814](https://arxiv.org/abs/2403.18814)   |             | CUHK            |
124 | | [ALLaVA-4V](https://github.com/FreedomIntelligence/ALLaVA) | LAION/Vision-FLAN (GPT4V) | instruciton (505k/203k)     | ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model                         | [2401.06209](https://arxiv.org/abs/2401.06209)   |             | CUHKSZ          |
125 | | [M3IT](https://huggingface.co/datasets/MMInstruction/M3IT) | mixture INST              | self-construct INSTs (2.4M) | M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning                    | [2306.04387](https://arxiv.org/abs/2306.04387)   |             | HKU             |
126 | | [OBELICS](https://github.com/huggingface/OBELICS)          | I-T pairs                 | corpora (353M imgs)         | OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents                    | [2306.16527](https://arxiv.org/abs/2306.16527)   |             | Huggingface     |
127 | | [LLaVA](https://github.com/haotian-liu/LLaVA)              | mixture INST              | instruciton ( 675k)         | LLaVA: Large Language and Vision Assistant                                                         | [2304.08485](https://arxiv.org/abs/2304.08485)   |             | Microsoft       |
128 | | [LAION](https://laion.ai/blog/laion-5b/)                   | I-T pairs                 | corpora (2.32b)             | LAION-5B: An open large-scale dataset for training next generation image-text models               | [2210.08402](https://arxiv.org/abs/2210.08402)   |             | UCB             |
129 | 
130 | Benchmarks
131 | 
132 | | Benchmark                                                                               | Task                                                                                                                 | Data                                                                                 | Paper                                                                                                                               | Preprint                                    | Publication  | Affiliation      |
133 | | --------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------- | ------------ | ---------------- |
134 | | [MMVP](https://github.com/tsb0601/MMVP)                                                    | QA (pattern error)                                                                                                   | human-annotated (300)                                                                | Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs                                                                | [2401.06209](https://arxiv.org/abs/2401.06209) |              | NYU              |
135 | | [MMMU](https://github.com/MMMU-Benchmark/MMMU)                                             | QA (general domain)                                                                                                  | human collected (11.5K)                                                              | MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI                                    | [2311.16502](https://arxiv.org/abs/2311.16502) |              | OSU, UWaterloo   |
136 | | [MLLM-Bench](https://github.com/FreedomIntelligence/MLLM-Bench)                            | General INST                                                                                                         | human collected                                                                      | MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V                                                                                | [2311.13951](https://arxiv.org/abs/2311.13951) |              | CUHK             |
137 | | [HallusionBench](https://github.com/tianyi-lab/HallusionBench)                             | QA (hallucination)                                                                                                   | human annotated (1129)                                                               | HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models | [2310.14566](https://arxiv.org/abs/2310.14566) |              | UMD              |
138 | | [MathVista](https://mathvista.github.io/)                                                  | QA (math: IQTest, FuctionQA, PaperQA)                                                                                | self-construct + mixture QA pairs (6K)                                               | MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts                                                | [2310.02255](https://arxiv.org/abs/2310.02255) |              | Microsoft        |
139 | | [VisIT-Bench](https://visit-bench.github.io/)                                              | QA (general domain)                                                                                                  | self-construct (592)                                                                 | VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use                                       | [2308.06595](https://arxiv.org/abs/2308.06595) |              | LAION            |
140 | | [SEED-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard)               | QA (general domain)                                                                                                  | self-construct (19K)                                                                 | SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension                                                              | [2307.16125](https://arxiv.org/abs/2307.16125) |              | Tencent          |
141 | | [MMBench](https://opencompass.org.cn/leaderboard-multimodal)                               | QA (general domain)                                                                                                  | mixture (2.9K)                                                                       | MMBench: Is Your Multi-modal Model an All-around Player?                                                                            | [2307.06281](https://arxiv.org/abs/2307.06281) |              | Shanghai AI Lab. |
142 | | [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) | QA (general domain)                                                                                                  | self-construct (2.1K)                                                                | MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models                                                      | [2306.13394](https://arxiv.org/abs/2306.13394) |              | XMU              |
143 | | [POPE](https://github.com/RUCAIBox/POPE)                                                   | General (object hallucination)                                                                                       |                                                                                      | POPE: Polling-based Object Probing Evaluation for Object Hallucination                                                              | [2305.10355](https://arxiv.org/abs/2305.10355) |              | RUC              |
144 | | [DataComp](https://github.com/mlfoundations/datacomp)                                      | Curate I-T Pairs                                                                                                     | 12.8M I-T pairs                                                                      | DataComp: In search of the next generation of multimodal datasets                                                                   | [2304.14108](https://arxiv.org/abs/2304.14108) |              | DataComp.AI      |
145 | | [MM-Vet](https://github.com/yuweihao/MM-Vet)                                               | General                                                                                                              | [mm-vet.zip](https://github.com/yuweihao/MM-Vet/releases/download/v1/mm-vet.zip)        | MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities                                                              |                                             |              |                  |
146 | | [INFOSEEK](https://open-vision-language.github.io/infoseek/)                               | VQA                                                                                                                  | [OVEN (open domain image)](https://open-vision-language.github.io/oven/) + Human Anno. | Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?                                             | [2302.11713](https://arxiv.org/abs/2302.11713) |              | Google           |
147 | | [MultiInstruct](https://github.com/VT-NLP/MultiInstruct)                                   | General INST (Grounded Caption, Text Localization,`</br>` Referring Expression Selection, Question-Image Matching) | self-construct INSTs (62 * (5+5))                                                    | MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning                                                      | [2212.10773](https://arxiv.org/abs/2212.10773) | ACL 2023     | Virginia Tech    |
148 | | [ScienceQA](https://github.com/lupantech/ScienceQA)                                        | QA (elementary and high school science curricula)                                                                    | self-construct QA-pairs (21K)                                                        | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering                                            | [2209.09513](https://arxiv.org/abs/2209.09513) | NeurIPS 2022 | AI2              |
149 | 
150 | Evaluation Toolkits
151 | 
152 | - [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), VLMEvalKit (the python package name is vlmeval) is an open-source evaluation toolkit of large vision-language models (LVLMs).
153 | 
154 | Data Collection Tools
155 | 
156 | - [VisionDatasets](https://github.com/kyegomez/VisionDatasets), Scripts and logic to create high quality pre-training and finetuning datasets for multi-modal models!
157 | - [Visual-Instruction-Tuning](https://github.com/BAAI-DCAI/Visual-Instruction-Tuning), Scale up visual instruction tuning to millions by GPT-4.
158 | 
159 | ### Tools
160 | 
161 | - [EasyOCR](https://github.com/JaidedAI/EasyOCR), Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
162 | 
163 | ## Image Generation
164 | 
165 | ### Reading List
166 | 
167 | *Include some insightful works except for LLM*
168 | 
169 | | Paper                                                                                                    | Base Language Model               | Code                                                                                            | Publication | Preprint                                    | Affiliation     |
170 | | -------------------------------------------------------------------------------------------------------- | --------------------------------- | ----------------------------------------------------------------------------------------------- | ----------- | ------------------------------------------- | --------------- |
171 | | JPEG-LM: LLMs as Image Generators with Canonical Codec Representations                                   | JPEG-LM (codec-based LM)          |                                                                                                 |             | [2408.08459](https://arxiv.org/abs/2408.08459) | Meta            |
172 | | Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction                      | GPT-2,*VQ-GAN*                  | [VAR](https://github.com/FoundationVision/VAR)                                                     |             | [2404.02905](https://arxiv.org/abs/2404.02905) | Bytedance       |
173 | | InstantID: Zero-shot Identity-Preserving Generation in Seconds                                           | *Unet*                          | [InstantID](https://github.com/InstantID/InstantID)                                                |             | [2401.07519](https://arxiv.org/abs/2401.07519) | Instant         |
174 | | VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation        | LLaMA, IP-Adapter (*Diffusion*) | [VL-GPT](https://github.com/AILab-CVC/VL-GPT)                                                      |             |                                             | Tencent         |
175 | | LLMGA: Multimodal Large Language Model based Generation Assistant                                        | LLaVA,*Unet*                    | [LLMGA](https://github.com/dvlab-research/LLMGA)                                                   |             | [2311.16500](https://arxiv.org/abs/2311.16500) | CUHK            |
176 | | AnyText: Multilingual Visual Text Generation And Editing                                                 | *ControlNet* (OCR)              | [AnyText](https://github.com/tyxsspa/AnyText)                                                      |             | [2311.03054](https://arxiv.org/abs/2311.03054) | Alibaba         |
177 | | MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens                              | *Unet*                          | [MiniGPT-5](https://github.com/eric-ai-lab/MiniGPT-5)                                              |             | [2310.02239](https://arxiv.org/abs/2310.02239) | UCSC            |
178 | | NExT-GPT: Any-to-Any Multimodal LLM                                                                      | Vicuna-7B,*Diffusion*           | [NExT-GPT](https://github.com/NExT-GPT/NExT-GPT)                                                   |             | [2309.05519](https://arxiv.org/abs/2309.05519) | NUS             |
179 | | Generative Pretraining in Multimodality                                                                  | LLaMA，*Diffusion*              | [Emu](https://github.com/baaivision/Emu)                                                           |             | [2307.05222](https://arxiv.org/abs/2307.05222) | BAAI            |
180 | | SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs                            | PaLM2, GPT3.5                     |                                                                                                 |             | [2306.17842](https://arxiv.org/abs/2306.17842) | Google          |
181 | | LayoutGPT: Compositional Visual Planning and Generation with Large Language Models                       | ChatGPT                           | [LayoutGPT](https://github.com/weixi-feng/LayoutGPT)                                               |             | [2305.15393](https://arxiv.org/abs/2305.15393) | UCSB            |
182 | | BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing | Blip2,*u-net*                   | [BLIP-Diffusion](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion)            |             | [2305.14720](https://arxiv.org/abs/2305.14720) | Salesforce      |
183 | | CoDi: Any-to-Any Generation via Composable Diffusion                                                     | *Diffusion*                     | [CoDi](https://github.com/microsoft/i-Code/tree/main/i-Code-V3)                                    |             | [2305.11846](https://arxiv.org/abs/2305.11846) | Microsoft & UNC |
184 | | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation                 | ChatGPT,*VAE*                   | [Accountable Textual Visual Chat](https://github.com/matrix-alpha/Accountable-Textual-Visual-Chat) |             | [2303.05983](https://arxiv.org/abs/2303.05983) | CUHK            |
185 | | Denoising Diffusion Probabilistic Models                                                                 | *Diffusion*                     | [diffusion](https://github.com/hojonathanho/diffusion)                                             |             | [2006.11239](https://arxiv.org/abs/2006.11239) | UCB             |
186 | 
187 | ## Open-source Projects
188 | 
189 | - [open-prompts](https://github.com/krea-ai/open-prompts), open-source prompts for text-to-image models.
190 | - [LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory), LLaMA2-Accessory is an open-source toolkit for pretraining, finetuning and deployment of Large Language Models (LLMs) and multimodal LLMs
191 | - [Gemini-vs-GPT4V](https://github.com/Qi-Zhangyang/Gemini-vs-GPT4V), This paper presents an in-depth qualitative comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision).
192 | - [multimodal-maestro](https://github.com/roboflow/multimodal-maestro), Effective prompting for Large Multimodal Models like GPT-4 Vision or LLaVA.
193 |   - *by roboflow, 2023.11*
194 | - [VisCPM](https://github.com/OpenBMB/VisCPM), VisCPM is a family of open-source large multimodal models, which support multimodal conversational capabilities (VisCPM-Chat model) and text-to-image generation capabilities (VisCPM-Paint model) in both Chinese and English
195 |   - *by THU, 2023.07*
196 |   - model: [VisCPM-Chat](https://github.com/OpenBMB/VisCPM#%E6%A8%A1%E5%9E%8B%E4%B8%8B%E8%BD%BD), [VisCPM-Paint](https://github.com/OpenBMB/VisCPM#%E6%A8%A1%E5%9E%8B%E4%B8%8B%E8%BD%BD)
197 | 


--------------------------------------------------------------------------------
/Vision/Video.md:
--------------------------------------------------------------------------------
  1 | # Video
  2 | 
  3 | *Table of Contents*
  4 | 
  5 | - [Video Understanding](#video-understanding)
  6 |   - [Reading List](#reading-list)
  7 |   - [Pretraining Tasks](#pretraining-tasks)
  8 |   - [Datasets](#datasets)
  9 |     - [Pretraining Corpora](#pretraining-corpora)
 10 |     - [Video Instructions](#video-instructions)
 11 |   - [Benchmarks](#benchmarks)
 12 |     - [Common Downstream Tasks](#common-downstream-tasks)
 13 |     - [Advanced Downstream Tasks](#advanced-downstream-tasks)
 14 |       - [Task-Specific Benchmarks](#task-specific-benchmarks)
 15 |       - [Multifaceted Benchmarks](#multifaceted-benchmarks)
 16 |   - [Metrics](#metrics)
 17 |   - [Projects &amp; Tools](#projects--tools)
 18 | - [Video Generation](#video-generation)
 19 |   - [Reading List](#reading-list-1)
 20 |   - [Metrics](#metrics-1)
 21 |   - [Projects](#projects)
 22 | 
 23 | ## Reading List
 24 | 
 25 | **This reading list additionally collect video-language pretraining works before LLM**
 26 | 
 27 | *NOTEs: FT=Finetune, VidL=Video-Language, MM=Multimodal, INST=Instruction*
 28 | 
 29 | | Paper                                                                                                                     | Base Language Model       | Framework                        | Data                                                                       | Code                                                                                                     | Publication         | Preprint                                                               | Affiliation     |
 30 | | ------------------------------------------------------------------------------------------------------------------------- | ------------------------- | -------------------------------- | -------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- | ------------------- | ---------------------------------------------------------------------- | --------------- |
 31 | | LongVILA: Scaling Long-Context Visual Language Models for Long Videos                                                     | LLaMA3                    | INST (LLM-extend)                | Mixture                                                                    | [LongVILA](https://github.com/NVlabs/VILA/blob/main/LongVILA.md)                                            |                     | [2408.10188](https://arxiv.org/abs/2408.10188)                            | NVIDIA          |
 32 | | Long Context Transfer from Language to Vision                                                                             | Qwen2                     | INST (LLM-extend)                | llava-next                                                                 | [LongVA](https://github.com/EvolvingLMMs-Lab/LongVA)                                                        |                     | [2406.16852](https://arxiv.org/abs/2406.16852)                            | NTU             |
 33 | | video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models                                                         | Vicuna                    | INST                             | Mixture (llava, videochat, ego4d, how2)                                    | [SALMONN](https://github.com/bytedance/SALMONN/)                                                            |                     | [2406.15704](https://arxiv.org/abs/2406.15704)                            | Bytedance       |
 34 | | VideoLLM-online: Online Large Language Model for Streaming Video                                                          | Llama2/3                  | INST (+timestamp)                |                                                                            | [videollm-online](https://github.com/showlab/videollm-online)                                               |                     |                                                                        | NUS             |
 35 | | Streaming Long Video Understanding with Large Language Models                                                             | Phi2, Vicuna              | INST                             | Mixture (conceptual caption, howto100m, panda-700m, movieqa, msrvtt, star) |                                                                                                          |                     |                                                                        | ShanghaiAI Lab  |
 36 | | LLaVA-NeXT: A Strong Zero-shot Video Understanding Model                                                                  | Vicuna, Yi                | INST                             | LLaVA+                                                                     | [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT)                                                        |                     | [2404-blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/) | Bytedance       |
 37 | | PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning                                   | Vicuna, Yi                | INST                             | VideoChat2 insts.                                                          | [PLLaVA](https://github.com/magic-research/PLLaVA)                                                          |                     | [2404.16994](https://arxiv.org/abs/2404.16994)                            | Bytedance       |
 38 | | MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding                                         | Vicuna                    | FT (memory retrieval)            | - (task)                                                                   | [MA-LMM](https://github.com/boheumd/MA-LMM)                                                                 |                     | [2404.05726](https://arxiv.org/abs/2404.05726)                            | Meta            |
 39 | | MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens                  | Llama, Mistral            | PT+FT                            | mixture                                                                    | [MiniGPT4-video](https://github.com/Vision-CAIR/MiniGPT4-video)                                             |                     | [2404.03413](https://arxiv.org/abs/2404.03413)                            | KAUST           |
 40 | | VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding                                                   | X+GPT4                    | Video Agent                      | -                                                                          | [VideoAgent](https://videoagent.github.io/)                                                                 |                     | [2403.11481](https://arxiv.org/abs/2403.11481)                            | BIGAI           |
 41 | | VideoAgent: Long-form Video Understanding with Large Language Model as Agent                                              | LaViLa + GPT4             | Video Agent (Caption)            | -                                                                          |                                                                                                          |                     | [2403.10517](https://arxiv.org/abs/2403.10517)                            | Stanford        |
 42 | | Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding                                   | -                         | -                                |                                                                            | [Video Mamba Suite](https://github.com/OpenGVLab/video-mamba-suite)                                         |                     | [2403.09626](https://arxiv.org/abs/2403.09626)                            | Shanghai AI Lab |
 43 | | VideoMamba: State Space Model for Efficient Video Understanding                                                           | -                         | -                                |                                                                            | [VideoMamba](https://github.com/OpenGVLab/VideoMamba)                                                       |                     | [2403.06977](https://arxiv.org/abs/2403.06977)                            | Shanghai AI Lab |
 44 | | LLMs Meet Long Video: Advancing Long Video Comprehension with An Interactive Visual Adapter in LLMs                       | LLaMA                     | adapter                          | mixture                                                                    |                                                                                                          |                     | [2402.13546](https://arxiv.org/abs/2402.13546)                            | HIT             |
 45 | | Video ReCap: Recursive Captioning of Hour-Long Videos                                                                     | BLIP2, LaVila             | Caption + dataset                | mixture                                                                    | [VideoRecap](https://github.com/md-mohaiminul/VideoRecap)                                                   | CVPR2024            | [2402.13250](https://arxiv.org/abs/2402.13250)                            | UNC             |
 46 | | VideoPrism: A Foundational Visual Encoder for Video Understanding                                                         | (PaLM)                    | PT                               | mixture                                                                    |                                                                                                          |                     | [2402.13217](https://arxiv.org/abs/2402.13217)                            | Google          |
 47 | | LVCHAT: Facilitating Long Video Comprehension                                                                             | LLaMA                     | FT + position interleaved        | VideoChat2                                                                 | [LVChat](https://github.com/wangyu-ustc/LVChat)                                                             |                     | [2402.12079](https://arxiv.org/abs/2402.12079)                            | UCSD            |
 48 | | Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning                                       | llama                     | MMINST+temporal prompt           | self-construct(Moment-10M)decode                                           | [Momentor](https://github.com/DCDmllm/Momentor)                                                             |                     | [2402.11435](https://arxiv.org/abs/2402.11435)                            | ZJU             |
 49 | | World Model on Million-Length Video And Language With RingAttention                                                       | LLaMA2                    | PT+FT                            | mixture                                                                    | [LWM](https://github.com/LargeWorldModel/LWM)                                                               |                     | [2402.08268](https://arxiv.org/abs/2402.08268)                            | UCB             |
 50 | | Memory Consolidation Enables Long-Context Video Understanding                                                             | Bert                      | FT + memory(ViT)                 |                                                                            |                                                                                                          |                     | [2402.05861](https://arxiv.org/abs/2402.05861)                            | DeepMind        |
 51 | | Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization                              | LLaMA2                    | PT+FT                            | mixture                                                                    | [LaVIT](https://github.com/jy0205/LaVIT)                                                                    |                     | [2402.03161](https://arxiv.org/abs/2402.03161)                            | PKU             |
 52 | | MoE-LLaVA: Mixture of Experts for Large Vision-Language Models                                                            | StableLM, Qwen, Phi2      | MoE                              | mixture (MM-INST)                                                          | [MoE-LLaVA](https://github.com/PKU-YuanGroup/MoE-LLaVA)                                                     |                     | [2401.15947](https://arxiv.org/abs/2401.15947)                            | PKU             |
 53 | | DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models                                               | X + GPT3.5                | Video Agent                      | -                                                                          | [DoraemonGPT](https://github.com/z-x-yang/DoraemonGPT)                                                      |                     | [2401.08392](https://arxiv.org/abs/2401.08392)                            | ZJU             |
 54 | | A Simple LLM Framework for Long-Range Video Question-Answering                                                            | Cap + GPT4                | Video Agent (Caption)            | -                                                                          | [LLoVi](https://github.com/CeeZh/LLoVi)                                                                     |                     | [2312.17235](https://arxiv.org/abs/2312.17235)                            | UNC             |
 55 | | Text-Conditioned Resampler For Long Form Video Understanding                                                              | BLIP2                     | FT Resampler (blip2 on video)    | -                                                                          |                                                                                                          |                     | [2312.11897](https://arxiv.org/abs/2312.11897)                            | Google          |
 56 | | Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens                                                  | LLaVA                     | FT+Recur. Qformer                | VideoChatGPT                                                               |                                                                                                          |                     | [2312.08870](https://arxiv.org/abs/2312.08870)                            | ByteDance       |
 57 | | A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames                                      | PaLI, Bard                | PT (vivit+adapter)               | mixture                                                                    |                                                                                                          |                     | [2312.07395](https://arxiv.org/pdf/2312.07395)                            | Google          |
 58 | | LifelongMemory: Leveraging LLMs for Answering Queries in Egocentric Videos                                                | LLaVA, GPT3.5             | Video Agent (Caption)            | -                                                                          |                                                                                                          |                     | [2312.05269](https://arxiv.org/abs/2312.05269)                            | NYU             |
 59 | | TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding                                   | LLaMA2                    | MM-INST                          | mixture (additional Transcribed Speeck)                                    | [TimeChat](https://github.com/RenShuhuai-Andy/TimeChat)                                                     | CVPR2024            | [2312.02051](https://arxiv.org/abs/2312.02051)                            | PKU             |
 60 | | Zero-Shot Video Question Answering with Procedural Programs                                                               | GPT+X                     | Video Agent                      | -                                                                          |                                                                                                          |                     | [2312.00937](https://arxiv.org/abs/2312.00937)                            | CMU             |
 61 | | VTimeLLM: Empower LLM to Grasp Video Moments                                                                              | Vicuna                    | INST+temporal                    | mixture                                                                    | [VTimeLLM](https://github.com/huangb23/VTimeLLM)                                                            | CVPR2024            | [2311.18445](https://arxiv.org/abs/2311.18445)                            | THU             |
 62 | | LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models                                                            | Vicuna                    | MM-INST                          | self-construct                                                             | [LLaMA-VID](https://github.com/dvlab-research/LLaMA-VID)                                                    |                     | [2311.17043](https://arxiv.org/abs/2311.17043)                            | CUHK            |
 63 | | Vamos: Versatile Action Models for Video Understanding                                                                    | GPT4, X                   | Video Agent (Caption)            | -                                                                          | [Vamos](https://brown-palm.github.io/Vamos/)                                                                |                     | [2311.13627](https://arxiv.org/abs/2311.13627)                            | Brown           |
 64 | | Video-LLaVA: Learning United Visual Representation by Alignment Before Projection                                         | Vicuna 1.5                | PT+FT                            | mixture                                                                    | [Video-LLaVA](https://github.com/PKU-YuanGroup/Video-LLaVA)                                                 |                     | [2311.10122](https://arxiv.org/abs/2311.10122)                            | PKU             |
 65 | | Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding               | Vicuna                    | PT+FT                            | mixture                                                                    | [Chat-UniVi](https://github.com/PKU-YuanGroup/Chat-UniVi)                                                   |                     | [2311.08046](https://arxiv.org/abs/2311.08046)                            | PKU             |
 66 | | UniVTG: Towards Unified Video-Language Temporal Grounding                                                                 | CLIP                      | PT                               | mixture                                                                    | [UniVTG](https://github.com/showlab/UniVTG)                                                                 | ICCV 2023           | [2307.16715](https://arxiv.org/abs/2307.16715)                            | NTU             |
 67 | | MovieChat: From Dense Token to Sparse Memory for Long Video Understanding                                                 | Vicuna                    | FT                               |                                                                            | [MovieChat](https://github.com/rese1f/MovieChat)                                                            | CVPR2024            | [2307.16449](https://arxiv.org/abs/2307.16449)                            | Microsoft       |
 68 | | Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models                                | DeBerta                   | FT, RTr-Augmented                |                                                                            |                                                                                                          |                     | [2306.11732](https://arxiv.org/abs/2306.11732)                            | CUHK            |
 69 | | Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration                                   | LLaMA                     |                                  |                                                                            | [Macaw-LLM](https://github.com/lyuchenyang/Macaw-LLM)                                                       |                     | [2306.09093](https://arxiv.org/abs/2306.09093)                            | Tencent         |
 70 | | Valley: Video assistant with large language model enhanced ability                                                        | Vicuna                    | PT, FT + MM-INST                 | [mixture](https://github.com/RupertLuo/Valley#train-valley-step-by-step)      | [Valley](https://github.com/RupertLuo/Valley)                                                               |                     | [2306.07207](https://arxiv.org/abs/2306.07207)                            | ByteDance       |
 71 | | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models                                  | Vicuna                    |                                  |                                                                            | [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT)                                               |                     | [2306.05424](https://arxiv.org/abs/2306.05424)                            | MBZUAI          |
 72 | | Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding                                       | LLaMA                     |                                  |                                                                            | [Video-LLaMA](https://github.com/DAMO-NLP-SG/Video-LLaMA)                                                   |                     | [2306.02858](https://arxiv.org/abs/2306.02858)                            | Alibaba         |
 73 | | VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset                                             | Bert                      | PT                               | mixture (audio,video,image)                                                | [VAST](https://github.com/TXH-mercury/VAST)                                                                 | Neurips2023         | [2305.18500](https://arxiv.org/abs/2305.18500)                            | CAS             |
 74 | | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst                                          | Vicuna-13B                | FT + MM-INST                     |                                                                            | [ChatBridge](https://github.com/joez17/ChatBridge)                                                          |                     | [2305.16103](https://arxiv.org/abs/2305.16103)                            | CAS             |
 75 | | Self-Chained Image-Language Model for Video Localization and Question Answering                                           | BLIP2                     | 2-stage: localizer(LM) + answer | [QVHighlights](https://github.com/jayleicn/moment_detr), FT VidL              | [SeViLA](https://github.com/Yui010206/SeViLA)                                                               | Neurips2023         | [2305.06988](https://arxiv.org/abs/2305.06988)                            | UNC             |
 76 | | VideoChat: Chat-Centric Video Understanding                                                                               | Blip2                     |                                  |                                                                            | [VideoChat](https://github.com/OpenGVLab/Ask-Anything)                                                      |                     | [2305.06355](https://arxiv.org/abs/2305.06355)                            | Shanghai AI Lab |
 77 | | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages                     | ChatGPT                   |                                  |                                                                            | [X-LLM](https://github.com/phellonchen/X-LLM)                                                               |                     | [2305.04160](https://arxiv.org/abs/2305.04160)                            | CAS             |
 78 | | VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset                                                | Bert                      |                                  |                                                                            | [VALOR](https://github.com/TXH-mercury/VALOR)                                                               |                     | [2304.08345](https://arxiv.org/abs/2304.08345)                            | CAS             |
 79 | | Verbs in Action: Improving verb understanding in video-language models                                                    | PaLM                      |                                  |                                                                            |                                                                                                          |                     | [2304.06708](https://arxiv.org/abs/2304.06708)                            | Google          |
 80 | | Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions                                                     | ChatGPT, Flan-T5 (BLIP2) |                                  |                                                                            | [ChatCaptioner](https://github.com/Vision-CAIR/ChatCaptioner)                                               |                     | [2304.04227](https://arxiv.org/abs/2304.04227)                            | KAUST           |
 81 | | Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering                                    | GPT2, GPT-Neo, GPT3       |                                  |                                                                            |                                                                                                          | CVPR2023 workshop   | [2304.03754](https://arxiv.org/abs/2304.03754)                            | Columbia Univ.  |
 82 | | Unmasked Teacher: Towards Training-Efficient Video Foundation Models                                                      | Bert                      | PT                               | mixture                                                                    | [Unmasked Teacher](https://github.com/OpenGVLab/unmasked_teacher)                                           | ICCV 2023           | [2303.16058](https://arxiv.org/abs/2303.16058)                            | Shanghai AI Lab |
 83 | | Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning                                    | T5                        |                                  |                                                                            | [Vid2Seq](https://github.com/google-research/scenic/tree/main/scenic/projects/vid2seq)                      |                     | [2302.14115](https://arxiv.org/abs/2302.14115)                            | Google          |
 84 | | HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training                                                            | Bert                      |                                  |                                                                            |                                                                                                          |                     | [2212.14546](https://arxiv.org/abs/2212.14546)                            | Alibaba         |
 85 | | VindLU: A Recipe for Effective Video-and-Language Pretraining                                                             | Bert                      |                                  |                                                                            | [VindLU](https://github.com/klauscc/VindLU)                                                                 |                     | [2212.05051](https://arxiv.org/abs/2212.05051)                            | UNC             |
 86 | | Learning Video Representations from Large Language Models                                                                 | GPT2                      | PT (data-augment)                | Ego4D/HowTo100M                                                            | [LaViLa](https://github.com/facebookresearch/LaViLa)                                                        | CVPR2023            | [2212.04501](https://arxiv.org/abs/2212.04501)                            | Meta            |
 87 | | SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training                                                | Bert                      |                                  |                                                                            |                                                                                                          |                     | [2211.11446](https://arxiv.org/abs/2211.11446)                            | UW              |
 88 | | CLOP: Video-and-Language Pre-Training with Knowledge Regularizations                                                      | Roberta                   |                                  |                                                                            |                                                                                                          | MM 2022             | [2211.03314](https://arxiv.org/abs/2211.03314)                            | Baidu           |
 89 | | Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning                                       | Bert                      |                                  |                                                                            |                                                                                                          | NIPS 2022           | [2210.06031](https://arxiv.org/abs/2210.06031)                            | Microsoft       |
 90 | | OmniVL: One Foundation Model for Image-Language and Video-Language Tasks                                                  | Bert                      |                                  |                                                                            |                                                                                                          | NIPS 2022           | [2209.07526](https://arxiv.org/abs/2209.07526)                            | Microsoft       |
 91 | | Clover: Towards A Unified Video-Language Alignment and Fusion Model                                                       | Bert                      |                                  |                                                                            | [Clover](https://github.com/LeeYN-43/Clover)                                                                |                     | [2207.07885](https://arxiv.org/abs/2207.07885)                            | Bytedance       |
 92 | | LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling                                               | Bert-like                 |                                  |                                                                            | [LAVENDER](https://github.com/microsoft/LAVENDER)                                                           | CVPR 2023           | [2206.07160](https://arxiv.org/abs/2206.07160)                            | Microsoft       |
 93 | | Revealing Single Frame Bias for Video-and-Language Learning                                                               | Bert                      |                                  |                                                                            | [Singularity](https://github.com/jayleicn/singularity)                                                      |                     | [2206.03428](https://arxiv.org/abs/2206.03428)                            | UNC             |
 94 | | Label-Efficient Online Continual Object Detection in Streaming Video                                                      | -                         |                                  | (continual)                                                                | [Efficient-CLS](https://github.com/showlab/Efficient-CLS)                                                   | ICCV 2023           | [2206.00309](https://arxiv.org/abs/2206.00309)                            | NUS             |
 95 | | Flamingo: a Visual Language Model for Few-Shot Learning                                                                   | Chinchilla                |                                  |                                                                            | [Flamingo](https://github.com/lucidrains/flamingo-pytorch)                                                  | NIPS 2022           | [2204.14198](https://arxiv.org/abs/2204.14198)                            | DeepMind        |
 96 | | All in One: Exploring Unified Video-Language Pre-training                                                                 | Bert-like                 |                                  |                                                                            | [All-In-One](https://github.com/showlab/all-in-one)                                                         | CVPR 2023           | [2203.07303](https://arxiv.org/abs/2203.07303)                            | NUS             |
 97 | | End-to-end Generative Pretraining for Multimodal Video Captioning                                                         | Bert+GPT2                 |                                  |                                                                            |                                                                                                          | CVPR 2022           | [2201.08264](https://arxiv.org/abs/2201.08264)                            | Google          |
 98 | | Align and Prompt: Video-and-Language Pre-training with Entity Prompts                                                     | Bert-like                 |                                  |                                                                            | [ALPRO](https://github.com/salesforce/ALPRO)                                                                | CVPR 2022           | [2112.09583](https://arxiv.org/abs/2112.09583)                            | Salesforce      |
 99 | | VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling,[V2](https://arxiv.org/pdf/2209.01540.pdf) | Bert                      |                                  |                                                                            | [VIOLET](https://github.com/tsujuifu/pytorch_violet)                                                        |                     | [2111.12681](https://arxiv.org/abs/2111.12681)                            | Microsoft       |
100 | | VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding                                                | Bert                      |                                  |                                                                            | [VideoCLIP](https://github.com/facebookresearch/fairseq/tree/main/examples/MMPT)                            | EMNLP 2021          | [2109.14084](https://arxiv.org/abs/2109.14084)                            | Facebook        |
101 | | MERLOT: Multimodal Neural Script Knowledge Models,[V2](https://arxiv.org/abs/2201.02639)                                     | Roberta                   |                                  |                                                                            | [MERLOT](https://github.com/rowanz/merlot)                                                                  | NIPS 2021           | [2106.02636](https://arxiv.org/abs/2106.02636)                            | AI2             |
102 | | VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding                                              | Bert                      |                                  |                                                                            | [VLP](https://github.com/facebookresearch/fairseq/blob/main/examples/MMPT/README.md)                        | ACL Findings 2021   | [2105.09996](https://arxiv.org/abs/2105.09996)                            | Facebook        |
103 | | VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text                                 | Bert-like                 |                                  |                                                                            |                                                                                                          | NIPS 2021           | [2104.11178](https://arxiv.org/abs/2104.11178)                            | Google          |
104 | | CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval                                                 | Bert-like                 |                                  |                                                                            | [CLIP4Clip](https://github.com/ArrowLuo/CLIP4Clip)                                                          | Neurocomputing 2022 | [2104.08860](https://arxiv.org/abs/2104.08860)                            | Microsoft       |
105 | | Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval                                                  | Bert                      |                                  |                                                                            | [Frozen-in-Time](https://github.com/m-bain/frozen-in-time)                                                  | ICCV 2021           | [2104.00650](https://arxiv.org/abs/2104.00650)                            | Oxford          |
106 | | Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling                                                | Bert                      |                                  |                                                                            | [ClipBert](https://github.com/jayleicn/ClipBERT)                                                            | CVPR 2021           | [2102.06183](https://arxiv.org/abs/2102.06183)                            | Microsoft       |
107 | | ActBERT: Learning Global-Local Video-Text Representations                                                                 | Bert                      |                                  |                                                                            | [ActBert](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/en/model_zoo/multimodal/actbert.md) | CVPR 2020           | [2011.07231](https://arxiv.org/abs/2011.07231)                            | Baidu           |
108 | | Video Understanding as Machine Translation                                                                                | T5                        |                                  |                                                                            |                                                                                                          |                     | [2006.07203](https://arxiv.org/abs/2006.07203)                            | Facebook        |
109 | | HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training                                            | Bert                      |                                  |                                                                            | [HERO](https://github.com/linjieli222/HERO)                                                                 | EMNLP 2020          | [2005.00200](https://arxiv.org/abs/2005.00200)                            | Microsoft       |
110 | | UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation                        | Bert                      |                                  |                                                                            | [UniVL](https://github.com/microsoft/UniVL)                                                                 |                     | [2002.06353](https://arxiv.org/abs/2002.06353)                            | Microsoft       |
111 | | Learning Video Representations using Contrastive Bidirectional Transformer                                                | Bert                      |                                  |                                                                            |                                                                                                          |                     | [1906.05743](https://arxiv.org/abs/1906.05743)                            | Google          |
112 | | VideoBERT: A Joint Model for Video and Language Representation Learning                                                   | Bert                      |                                  |                                                                            | [VideoBert (non-official)](https://github.com/ammesatyajit/VideoBERT)                                       | ICCV 2019           | [1904.01766](https://arxiv.org/abs/1904.01766)                            | Google          |
113 | 
114 | ## Pretraining Tasks
115 | 
116 | *Commmonly Used Pretraining Tasks*
117 | 
118 | - Masked Language Modeling (MLM)
119 | - Causal Language Modeling (LM)
120 | - Masked Vision Modeling (MLM)
121 |   - Vision = Frame
122 |   - Vision = Patch
123 |   - VIsion = Object
124 | - Video Language Matching (VLM)
125 | - Video Language Contrastive (VLC)
126 | 
127 | ## Datasets
128 | 
129 | ### Pretraining Corpora
130 | 
131 | | Paper                                                                                                     | Video Clips  | Duration | Sentences | Domain                | Download Link                                                               |
132 | | --------------------------------------------------------------------------------------------------------- | ------------ | -------- | --------- | --------------------- | --------------------------------------------------------------------------- |
133 | | (❗NOT AVAILABLE, 23 Feb 2024) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval | 2.5M (2M)    | 18s      | 2.5M      | open (web)            | [WebVid-2M](https://github.com/m-bain/webvid), WebVid-10M                      |
134 | | Howto100m: Learning a text-video embedding by watching hundred million narrated video clips               | 136M (1.2M)  | 4s       | 136M      | instruction (YouTube) | [HowTo100M](https://www.di.ens.fr/willow/research/howto100m/)                  |
135 | | Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing              | 180M (6M)    | -20m     | ~720M     | open (YouTube)        | [YT-Temporal-180M](https://rowanzellers.com/merlot/)                           |
136 | | Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions             | 100M (3.3M)  | 13.4s    | 100M      | open (YouTube)        | [HD-VILA-100M](https://github.com/microsoft/XPretrain/tree/main/hd-vila-100m)  |
137 | | Learning audio-video modalities from image captions                                                       | 10.3M (6.3M) | 10s      |           | open (web)            | [VideoCC](https://github.com/google-research-datasets/videoCC-data)            |
138 | | CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos                                   | 18M          | 60s      |           | open (YouTube)        | [YTD-18M](https://seungjuhan.me/champagne/)                                    |
139 | | Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks      | 10 M         | 54.2s    | 10 M      | open (YOUKU)          | [Youku-mPLUG](https://github.com/X-PLUG/Youku-mPLUG#download)                  |
140 | | InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation                   | 234M (7.1M)  | 11.7s    | 234 M     | open (YouTube)        | [InternVid](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) |
141 | | Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers                                    | 70M          | 8.5s     | 70M       | open                  | [Panda-70M](https://github.com/snap-research/Panda-70M)  `from HD-VILA-100M` |
142 | 
143 | ### Video Instructions
144 | 
145 | | Dataset                                                                             | Statistics           | Source                                        |
146 | | ----------------------------------------------------------------------------------- | -------------------- | --------------------------------------------- |
147 | | [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/data/README.md) | 100k INST/10k videos | ActivityNet + (Human+GPT) Annotation          |
148 | | [Valley](https://github.com/RupertLuo/Valley)                                          | 65k INST/100k videos | (VATEX + JukinMedia )+ (Human+GPT) Annotation |
149 | | [VideoChat](https://github.com/OpenGVLab/Ask-Anything)                                 | 11k INST/11k videos  | WebVid + GPT Annotation                       |
150 | | [TimeIT](https://github.com/RenShuhuai-Andy/TimeChat?tab=readme-ov-file#data)          | 125k INST            | Mixture + GPT Annotation                     |
151 | 
152 | ### Others
153 | 
154 | - [Neurips23 D&B] [VidChapters](https://github.com/antoyang/VidChapters), a large-scale dataset of user-chaptered videos. We study three tasks on top of this dataset and show that video chapter generation models trained on VidChapters-7M transfer well to dense video captioning.
155 | 
156 | ## Benchmarks
157 | 
158 | ### Common Downstream Tasks
159 | 
160 | | **Task** | Paper                                                                                                         | Download Link                                                                                                               | Publication |
161 | | -------------- | ------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ----------- |
162 | | Retrieval      | Collecting Highly Parallel Data for Paraphrase Evaluation                                                     | [MSVD](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/)                                                             | ACL 2011    |
163 | | Retrieval      | A Dataset for Movie Description                                                                               | [LSMDC](https://sites.google.com/site/describingmovies/download)                                                               | CVPR 2015   |
164 | | Retrieval      | MSR-VTT: A Large Video Description Dataset for Bridging Video and Language                                    | [MSR-VTT](https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip)                                                 | CVPR 2016   |
165 | | Retrieval      | Localizing Moments in Video with Natural Language                                                             | [DiDeMo](https://github.com/LisaAnne/LocalizingMoments)                                                                        | ICCV 2017   |
166 | | Retrieval      | Dense-Captioning Events in Videos                                                                             | [ActivityNet Caption](https://cs.stanford.edu/people/ranjaykrishna/densevid/)                                                  | ICCV 2017   |
167 | | Retrieval      | Towards Automatic Learning of Procedures from Web Instructional Videos                                        | [YouCook2](http://youcook2.eecs.umich.edu/download)                                                                            | AAAI 2018   |
168 | | OE QA          | TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering                                        | [TGIF-Frame](https://github.com/YunseokJANG/tgif-qa/tree/cvpr2017/dataset)                                                     | CVPR 2017   |
169 | | OE QA          | A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering | [LSMDC-FiB](https://github.com/yj-yu/lsmdc)                                                                                    | CVPR 2017   |
170 | | OE QA          | Video Question Answering via Gradually Refined Attention over Appearance and Motion                           | [MSRVTT-QA](https://github.com/xudejing/video-question-answering),[MSVD-QA](https://github.com/xudejing/video-question-answering) | MM 2017     |
171 | | OE QA          | ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering                         | [ActivityNet-QA](https://github.com/MILVLG/activitynet-qa)                                                                     | AAAI 2019   |
172 | | MC QA          | Learning Language-Visual Embedding for Movie Understanding with Natural-Language                              | [LSMDC-MC](https://github.com/yj-yu/lsmdc)                                                                                     |             |
173 | | MC  QA         | TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering                                        | [TGIF-Action, TGIF-Transition](https://github.com/YunseokJANG/tgif-qa/tree/cvpr2017/dataset)                                   | CVPR 2017   |
174 | | MC QA          | A Joint Sequence Fusion Model for Video Question Answering and Retrieval                                      | [MSRVTT-MC](https://github.com/yj-yu/lsmdc)                                                                                    | ECCV 2018   |
175 | | Caption        | Collecting Highly Parallel Data for Paraphrase Evaluation                                                     | [MSVD](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/)                                                             | ACL 2011    |
176 | | Caption        | MSR-VTT: A Large Video Description Dataset for Bridging Video and Language                                    | [MSR-VTT](https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip)                                                 | CVPR 2016   |
177 | | Caption        | VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research                       | [VATEX](https://eric-xw.github.io/vatex-website/explore.html)                                                                  | ICCV 2019   |
178 | | Dense Caption  | Dense-Captioning Events in Videos                                                                             | [ActivityNet Caption](https://cs.stanford.edu/people/ranjaykrishna/densevid/)                                                  | ICCV 2017   |
179 | | Dense Caption  | Towards Automatic Learning of Procedures from Web Instructional Videos                                        | [YouCook2](http://youcook2.eecs.umich.edu/download)                                                                            | AAAI 2018   |
180 | | Dense Caption  | Multimodal Pretraining for Dense Video Captioning                                                             | [ViTT](https://github.com/google-research-datasets/Video-Timeline-Tags-ViTT)                                                   | AACL 2020   |
181 | | Action         | HMDB: A large video database for human motion recognition                                                     | [HMDB](https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/)-51                                       | ICCV 2021   |
182 | | Action         | UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild                                        | [UCF](https://www.crcv.ucf.edu/data/UCF101.php)-101                                                                            | ICCV 2013   |
183 | | Action         | ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding                                   | [ActivityNet](http://activity-net.org/about.html)-200                                                                          | CVPR 2015   |
184 | | Action         | Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding                                  | [Charades](https://prior.allenai.org/projects/charades)-157                                                                    | ECCV 2016   |
185 | | Action         | The Kinetics Human Action Video Dataset                                                                       | [Kinetics](https://github.com/cvdfoundation/kinetics-dataset)-400/600/700                                                      |             |
186 | 
187 | ### Advanced Downstream Tasks
188 | 
189 | #### Task-Specific Benchmarks
190 | 
191 | | paper                                                                                                          | task                     | duration    | domain          | link                                                                                         | publication |
192 | | -------------------------------------------------------------------------------------------------------------- | ------------------------ | ----------- | --------------- | -------------------------------------------------------------------------------------------- | ----------- |
193 | | MovieChat: From Dense Token to Sparse Memory for Long Video Understanding                                      | Video QA                 | ~8m         | movie           | [MovieChat](https://github.com/rese1f/MovieChat)                                                | CVPR 2024   |
194 | | MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding                           | Video QA                 | ~8m         | movie           | [MoVQA](https://movqa.github.io)                                                                |             |
195 | | EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding                              | Video QA                 | ~3m         | open (ego)      | [EgoSchema](https://github.com/egoschema/EgoSchema)                                             |             |
196 | | ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models                    | Temporal Grounding       | ~60s (mix.) | open            | [ViLMA](https://github.com/ilkerkesen/ViLMA)                                                    | ICLR 2024   |
197 | | From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering | Video QA                 | 9s          | open            | [Causal-VidQA](https://github.com/bcmi/Causal-VidQA)                                            | CVPR 2022   |
198 | | VIOLIN: A Large-Scale Dataset for Video-and-Language Inference                                                 | Video Language Inference | 35.2s       | movie           | [VIOLIN](https://github.com/jimmy646/violin)                                                    | CVPR 2020   |
199 | | TVQA: Localized, Compositional Video Question Answering                                                        | Video QA                 | 60-90s      | movie           | [TVQA](https://tvqa.cs.unc.edu/)                                                                | EMNLP 2018  |
200 | | AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning                                                  | Video QA                 | 30s         | open            | [AGQA](https://cs.stanford.edu/people/ranjaykrishna/agqa/)                                      | CVPR 2021   |
201 | | NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions                                       | Video QA                 | 44s         | open            | [NExT-QA-MC](https://github.com/doc-doc/NExT-QA), [NExT-QA-OE](https://github.com/doc-doc/NExT-OE) | CVPR 2021   |
202 | | Towards Long-Form Video Understanding                                                                          | Classification           | 1-3m        | movie           | [LVU](https://github.com/chaoyuaw/lvu)                                                          | CVPR 2021   |
203 | | STAR: A Benchmark for Situated Reasoning in Real-World Videos                                                  | Video QA                 | 12s         | open            | [Star](https://github.com/csbobby/STAR_Benchmark)                                               | NIPS 2021   |
204 | | Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments           | Video QA                 | 20s         | virtual env.    | [Env-QA](https://envqa.github.io/)                                                              | ICCV 2021   |
205 | | COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis                                     | Localization/Action Seg. | 3.36m       | open (instruct) | [COIN](https://coin-dataset.github.io/)                                                         | CVPR 2019   |
206 | | Cross-task weakly supervised learning from instructional videos                                                | Localization             | 4m57s       | open (instruct) | [CrossTask](https://github.com/DmZhukov/CrossTask)                                              | CVPR 2019   |
207 | | Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence                                   | Video QA                 | 60s         | open            | [Social-IQ](https://www.thesocialiq.com/)                                                       | CVPR 2019   |
208 | 
209 | #### Multifaceted Benchmarks
210 | 
211 | | Benchmark                                                | Task                | Data    | Paper | Preprint | Publication | Affiliation |
212 | | -------------------------------------------------------- | ------------------- | ------- | ----- | -------- | ----------- | ----------- |
213 | | [Video-Bench](https://github.com/PKU-YuanGroup/Video-Bench) | MC (general domain) | mixture |       |          |             |             |
214 | 
215 | ## Metrics
216 | 
217 | - [Common Metrics on Video Quality](https://github.com/JunyaoHu/common_metrics_on_video_quality), You can easily calculate FVD, PSNR, SSIM, LPIPS for evaluating the quality of generated or predicted videos.
218 | 
219 | ## Projects & Tools
220 | 
221 | projects
222 | 
223 | - `(Video Agent)` [VLog](https://github.com/showlab/VLog), Transform Video as a Document with ChatGPT, CLIP, BLIP2, GRIT, Whisper, LangChain.
224 | 
225 | tools
226 | 
227 | - [VideoDB](https://github.com/video-db/StreamRAG), It enables developers to: 1) Upload multiple videos to create a library or collection; 2) Search across these videos and get real-time video responses or compilations; 3) Publish your searchable collection on the ChatGPT store; 4) Receive summarized text answers (RAG); 5) Gain key insights from specific videos (e.g. "Top points from episode 31").
228 | - [video2dataset](https://github.com/bryant1410/video2dataset), Easily create large video dataset from video urls. Can download and package 10M videos in 12h on a single 16 core machine.
229 | - [Match cutting](https://github.com/Netflix/matchcut), A match cut is a transition between a pair of shots that uses similar framing, composition, or action to fluidly bring the viewer from one scene to the next
230 | - [Awesome-Video-Object-Segmentation](https://github.com/gaomingqi/Awesome-Video-Object-Segmentation), A curated list of video object segmentation (vos) papers, datasets, and projects.
231 | - [pytube](https://github.com/pytube/pytube), A lightweight, dependency-free Python library (and command-line utility) for downloading YouTube Videos.
232 | - [movienet-tools](https://github.com/movienet/movienet-tools/blob/master/docs/GETTING_STARTED.md), Movie toolbox provides many basic tools and functions for the researches on movie understanding, with which you can get started with your research easily.
233 | - [PySceneDetect](https://github.com/Breakthrough/PySceneDetect), Video Scene Cut Detection and Analysis Tool
234 | 
235 | # Video Generation
236 | 
237 | ## Reading List
238 | 
239 | Survey
240 | 
241 | - (2023-10) A Survey on Video Diffusion Models  [paper](https://arxiv.org/abs/2310.10647)  [repo](https://github.com/ChenHsing/Awesome-Video-Diffusion-Models)
242 | 
243 | Reading List
244 | 
245 | | Paper                                                                               | Base Structure         | Data           | Code                                                           | Publication | Preprint                                                                          | Affiliation     |
246 | | ----------------------------------------------------------------------------------- | ---------------------- | -------------- | -------------------------------------------------------------- | ----------- | --------------------------------------------------------------------------------- | --------------- |
247 | | Video generation models as world simulators                                         | Transformer            | -              | -                                                              | -           | [2402.blog](https://openai.com/research/video-generation-models-as-world-simulators) | OpenAI          |
248 | | Vlogger: Make Your Dream A Vlog                                                     | *Diffusion*          |                | [Vlogger](https://github.com/zhuangshaobin/Vlogger)               |             | [2401.09414](https://arxiv.org/abs/2401.09414)                                       | Shanghai AI Lab |
249 | | FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis     | *ControlNet*         |                | [FlowVid](https://github.com/Jeff-LiangF/FlowVid)                 |             | [2312.17681](https://arxiv.org/abs/2312.17681)                                       | Meta            |
250 | | SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction | *Diffusion*          |                | [SEINE](https://github.com/Vchitect/SEINE)                        |             | [2310.20700](https://arxiv.org/abs/2310.20700)                                       | Shanghai AI Lab |
251 | | MotionDirector: Motion Customization of Text-to-Video Diffusion Models              | *Diffusion*          |                | [MotionDirector](https://github.com/showlab/MotionDirector)       |             | [2310.08465](https://arxiv.org/abs/2310.08465)                                       | NUS             |
252 | | VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning   | GPT4 +*UNet*         |                | [VideoDirectorGPT](https://github.com/HL-hanlin/VideoDirectorGPT) |             | [2309.15091](https://arxiv.org/abs/2309.15091)                                       | UNC             |
253 | | CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers     | Transformer +*VQVAE* | self-construct | [CogVideo](https://github.com/THUDM/CogVideo)                     |             | [2205.15868](https://arxiv.org/abs/2205.15868)                                       | THU             |
254 | 
255 | ## Metrics
256 | 
257 | - [T2VScore](https://github.com/showlab/T2VScore), T2VScore: Towards A Better Metric for Text-to-Video Generation
258 | 
259 | ## Projects
260 | 
261 | - [Open Sora](https://github.com/hpcaitech/Open-Sora), Open-Sora: Democratizing Efficient Video Production for All
262 | - [Open Chat Video Editor](https://github.com/SCUTlihaoyu/open-chat-video-editor), Open source short video automatic generation tool
263 | 


--------------------------------------------------------------------------------
/Vision/VisionEncoder.md:
--------------------------------------------------------------------------------
 1 | # Vision Encoder
 2 | 
 3 | commonly used vision encoder
 4 | 
 5 | Table of contents
 6 | - [Image Encoder](#image-encoder)
 7 | - [Video Encoder](#video-encoder)
 8 | - [Encoder Analysis](#encoder-analysis)
 9 | 
10 | ## Image Encoder
11 | | Paper                                                                                                                                                                          | Framework                                | Data                                                                                                                                        | Code                                                                                          | Publication                                     | Preprint                                                                           | Affiliation       |
12 | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ----------------------------------------------- | ---------------------------------------------------------------------------------- |
13 | | DINOv2: Learning Robust Visual Features without Supervision                                                                                                     |  DINO+iBOT                                       | 142M Image Pair                                                                                                                                             | [DINOv2](https://github.com/facebookresearch/dinov2)                                                           | TMLR                                       | [2304.07193](https://arxiv.org/abs/2304.07193)                                        | Meta            |
14 | | Sigmoid Loss for Language Image Pre-Training                                                                                                     |  Contrastive (sigmoid)                                       | 900M Image-text Pair                                                                                                                                             | [SigLIP](https://github.com/google-research/big_vision)                                                           | ICCV 2023                                       | [2303.15343](https://arxiv.org/abs/2303.15343)                                        | Google            |
15 | | Learning Transferable Visual Models From Natural Language Supervision                                                                                                     |  Contrastive (softmax)                                       | 400M Image-text Pair                                                                                                                                             | [CLIP](https://github.com/openai/CLIP)                                                           | ICML 2021                                       | [2103.00020](https://arxiv.org/abs/2103.00020)                                        | OpenAI            |
16 | 
17 | 
18 | 
19 | ## Video Encoder
20 | 
21 | 
22 | 
23 | ## Encoder Analysis
24 | 
25 | - [2024-06] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [arxiv](https://arxiv.org/abs/2406.16860) | comparison of different image encoder on LLM


--------------------------------------------------------------------------------
/assets/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/patrick-tssn/Awesome-Colorful-LLM/40d4f0d86e7f8cf8398ef871db62bd45f0774948/assets/logo.png


--------------------------------------------------------------------------------