└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Multimodal & Large Language Models 2 | 3 | **Note:** This paper list is only used to record papers I read in the daily arxiv for personal needs. I only subscribe to and cover the following subjects: Artificial Intelligence (cs.AI), Computation and Language (cs.CL), Computer Vision and Pattern Recognition (cs.CV), and Machine Learning (cs.LG). If you find I missed some important and exciting work, it would be super helpful to let me know. Thanks! 4 | 5 | **Update:** Starting from June 2024, I'm focusing on reading and recording papers that I believe offer unique insights and substantial contributions to the field. 6 | 7 | 8 | ## Table of Contents 9 | 10 | - [Survey](#survey) 11 | - [Position Paper](#position-paper) 12 | - [Structure](#structure) 13 | - [Planning](#planning) 14 | - [Reasoning](#reasoning) 15 | - [Generation](#generation) 16 | - [Representation Learning](#representation-learning) 17 | - [LLM Analysis](#llm-analysis) 18 | - [LLM Safety](#llm-safety) 19 | - [LLM Evaluation](#llm-evaluation) 20 | - [LLM Reasoning](#llm-reasoning) 21 | - [LLM Application](#llm-application) 22 | - [LLM with Memory](#llm-with-memory) 23 | - [LLM with Human](#llm-with-human) 24 | - [Inference-time Scaling (via RL)](#inference-time-scaling) 25 | - [Long-Context LLM](#long-context-llm) 26 | - [LLM Foundation](#llm-foundation) 27 | - [Scaling Law](#scaling-law) 28 | - [LLM Data Engineering](#llm-data-engineering) 29 | - [VLM Data Engineering](#vlm-data-engineering) 30 | - [Alignment](#alignment) 31 | - [Scalable Oversight&SuperAlignment](#scalable-oversight-&-superalignment) 32 | - [RL Foundation](#rl-foundation) 33 | - [Beyond Bandit](#beyond-bandit) 34 | - [Agent](#agent) 35 | - [DeepResearch](#deepresearch) 36 | - [SWE-Agent](#swe-agent) 37 | - [Evolution](#evolution) 38 | - [Interaction](#interaction) 39 | - [Critique Modeling](#critic-modeling) 40 | - [MoE/Specialized](#moe/specialized) 41 | - [Vision-Language Foundation Model](#vision-language-foundation-model) 42 | - [Vision-Language Model Analysis & Evaluation](#vision-language-model-analysis&evaluation) 43 | - [Vision-Language Model Application](#vision-language-model-application) 44 | - [Multimodal Foundation Model](#multimodal-foundation-model) 45 | - [Image Generation](#image-generation) 46 | - [Diffusion](#diffusion) 47 | - [Document Understanding](#document-understanding) 48 | - [Tool Learning](#tool-learning) 49 | - [Instruction Tuning](#instruction-tuning) 50 | - [Incontext Learning](#incontext-learning) 51 | - [Learning from Feedback](#learning-from-feedback) 52 | - [Reward Modeling](#reward-modeling) 53 | - [Video Foundation Model](#video-foundation-model) 54 | - [Key Frame Detection](#key-frame-detection) 55 | - [Pretraining](#pretraining) 56 | - [Vision Model](#vision-model) 57 | - [Adaptation of Foundation Model](#adaptation-of-foundation-model) 58 | - [Prompting](#prompting) 59 | - [Efficiency](#efficiency) 60 | - [Analysis](#analysis) 61 | - [Grounding](#grounding) 62 | - [VQA Task](#vqa-task) 63 | - [VQA Dataset](#vqa-dataset) 64 | - [Social Good](#social-good) 65 | - [Application](#application) 66 | - [Benchmark & Evaluation](#benchmark-&-evaluation) 67 | - [Dataset](#dataset) 68 | - [Robustness](#robustness) 69 | - [Hallucination&Factuality](#hallucination&factuality) 70 | - [Cognitive NeuronScience & Machine Learning](#cognitive-neuronscience-&-machine-learning) 71 | - [Theory of Mind](#theory-of-mind) 72 | - [Cognitive NeuronScience](#cognitive-neuronscience) 73 | - [World Model](#world-model) 74 | - [Resource](#resource) 75 | 76 | ## Survey 77 | 78 | - **Multimodal Learning with Transformers: A Survey;** Peng Xu, Xiatian Zhu, David A. Clifton 79 | - **Multimodal Machine Learning: A Survey and Taxonomy;** Tadas Baltrusaitis, Chaitanya Ahuja, Louis-Philippe Morency; Introduce 4 challenges for multi-modal learning, including representation, translation, alignment, fusion, and co-learning. 80 | - **FOUNDATIONS & RECENT TRENDS IN MULTIMODAL MACHINE LEARNING: PRINCIPLES, CHALLENGES, & OPEN QUESTIONS;** Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency 81 | - **Multimodal research in vision and language: A review of current and emerging trends;** Shagun Uppal et al; 82 | - **Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods;** Aditya Mogadala et al 83 | - **Challenges and Prospects in Vision and Language Research;** Kushal Kafle et al 84 | - **A Survey of Current Datasets for Vision and Language Research;** Francis Ferraro et al 85 | - **VLP: A Survey on Vision-Language Pre-training;** Feilong Chen et al 86 | - **A Survey on Multimodal Disinformation Detection;** Firoj Alam et al 87 | - **Vision-Language Pre-training: Basics, Recent Advances, and Future Trends;** Zhe Gan et al 88 | - **Deep Multimodal Representation Learning: A Survey;** Wenzhong Guo et al 89 | - **The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges;** Maria Lymperaiou et al 90 | - **Augmented Language Models: a Survey;** Grégoire Mialon et al 91 | - **Multimodal Deep Learning;** Matthias Aßenmacher et al 92 | - **Sparks of Artificial General Intelligence: Early experiments with GPT-4;** Sebastien Bubeck et al 93 | - **Retrieving Multimodal Information for Augmented Generation: A Survey;** Ruochen Zhao et al 94 | - **Is Prompt All You Need? No. A Comprehensive and Broader View of Instruction Learning;** Renze Lou et al 95 | - **A Survey of Large Language Models;** Wayne Xin Zhao et al 96 | - **Tool Learning with Foundation Models;** Yujia Qin et al 97 | - **A Cookbook of Self-Supervised Learning;** Randall Balestriero et al 98 | - **Foundation Models for Decision Making: Problems, Methods, and Opportunities;** Sherry Yang et al 99 | - **Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation;** Patrick Fernandes et al 100 | - **Reasoning with Language Model Prompting: A Survey;** Shuofei Qiao et al 101 | - **Towards Reasoning in Large Language Models: A Survey;** Jie Huang et al 102 | - **Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models;** Chen Ling et al 103 | - **Unifying Large Language Models and Knowledge Graphs: A Roadmap;** Shirui Pan et al 104 | - **Interactive Natural Language Processing;** Zekun Wang et al 105 | - **A Survey on Multimodal Large Language Models;** Shukang Yin et al 106 | - **TRUSTWORTHY LLMS: A SURVEY AND GUIDELINE FOR EVALUATING LARGE LANGUAGE MODELS’ ALIGNMENT;** Yang Liu et al 107 | - **Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback;** Stephen Casper et al 108 | - **Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies;** Liangming Pan et al 109 | - **Challenges and Applications of Large Language Models;** Jean Kaddour et al 110 | - **Aligning Large Language Models with Human: A Survey;** Yufei Wang et al 111 | - **Instruction Tuning for Large Language Models: A Survey;** Shengyu Zhang et al 112 | - **From Instructions to Intrinsic Human Values —— A Survey of Alignment Goals for Big Models;** Jing Yao et al 113 | - **A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation;** Xiaowei Huang et al 114 | - **Explainability for Large Language Models: A Survey;** Haiyan Zhao et al 115 | - **Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models;** Yue Zhang et al 116 | - **Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity;** Cunxiang Wang et al 117 | - **ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?;** Hailin Chen et al 118 | - **Vision-Language Instruction Tuning: A Review and Analysis;** Chen Li et al 119 | - **The Mystery and Fascination of LLMs: A Comprehensive Survey on the Interpretation and Analysis of Emergent Abilities;** Yuxiang Zhou et al 120 | - **Efficient Large Language Models: A Survey;** Zhongwei Wan et al 121 | - **The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision);** Zhengyuan Yang et al 122 | - **Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents;** Zhuosheng Zhang et al 123 | - **Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis;** Yafei Hu et al 124 | - **Multimodal Foundation Models: From Specialists to General-Purpose Assistants;** Chunyuan Li et al 125 | - **A Survey on Large Language Model based Autonomous Agents;** Lei Wang et al 126 | - **Video Understanding with Large Language Models: A Survey;** Yunlong Tang et al 127 | - **A Survey of Preference-Based Reinforcement Learning Methods;** Christian Wirth et al 128 | - **AI Alignment: A Comprehensive Survey;** Jiaming Ji et al 129 | - **A SURVEY OF REINFORCEMENT LEARNING FROM HUMAN FEEDBACK;** Timo Kaufmann et al 130 | - **TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELS;** Lichao Sun et al 131 | - **AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION;** Zane Durante et al 132 | - **Autotelic Agents with Intrinsically Motivated Goal-Conditioned Reinforcement Learning: A Short Survey;** Cedric Colas et al 133 | - **Safety of Multimodal Large Language Models on Images and Text;** Xin Liu et al 134 | - **MM-LLMs: Recent Advances in MultiModal Large Language Models;** Duzhen Zhang et al 135 | - **Rethinking Interpretability in the Era of Large Language Models;** Chandan Singh et al 136 | - **Large Multimodal Agents: A Survey;** Junlin Xie et al 137 | - **A Survey on Data Selection for Language Models;** Alon Albalak et al 138 | - **What Are Tools Anyway? A Survey from the Language Model Perspective;** Zora Zhiruo Wang et al 139 | - **Best Practices and Lessons Learned on Synthetic Data for Language Models;** Ruibo Liu et al 140 | - **A Survey on the Memory Mechanism of Large Language Model based Agents;** Zeyu Zhang et al 141 | - **A Survey on Self-Evolution of Large Language Models;** Zhengwei Tao et al 142 | - **When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models;** Xianzheng Ma et al 143 | - **An Introduction to Vision-Language Modeling;** Florian Bordes et al 144 | - **Towards Scalable Automated Alignment of LLMs: A Survey;** Boxi Cao et al 145 | - **A Survey on Mixture of Experts;** Weilin Cai et al 146 | - **The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective;** Zhen Qin et al 147 | - **Retrieval-Augmented Generation for Large Language Models: A Survey;** Yunfan Gao et al 148 | - **Towards a Unified View of Preference Learning for Large Language Models: A Survey;** Bofei Gao et al 149 | - **From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models;** Sean Welleck et al 150 | - **A Survey on the Honesty of Large Language Models;** Siheng Li et al 151 | - **Autoregressive Models in Vision: A Survey;** Jing Xiong et al 152 | - **Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective;** Zhiyuan Zeng et al 153 | - **Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey;** Liang Chen et al 154 | - **(MIS)FITTING: A SURVEY OF SCALING LAWS;** Margaret Li et al 155 | - **Thus Spake Long-Context Large Language Model;** Xiaoran Liu et al 156 | - **Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models;** Yang Sui et al 157 | - **A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future;** Jialun Zhong et al 158 | 159 | 160 | ## Position Paper 161 | 162 | - **Eight Things to Know about Large Language Models;** Samuel R. Bowman et al 163 | - **A PhD Student’s Perspective on Research in NLP in the Era of Very Large Language Models;** Oana Ignat et al 164 | - **Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models;** Yuxi Ma et al 165 | - **Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models;** Lingxi Xie et al 166 | - **A Path Towards Autonomous Machine Intelligence;** Yann LeCun et al 167 | - **GPT-4 Can’t Reason;** Konstantine Arkoudas et al 168 | - **Cognitive Architectures for Language Agents;** Theodore Sumers et al 169 | - **Large Search Model: Redefining Search Stack in the Era of LLMs;** Liang Wang et al 170 | - **PROAGENT: FROM ROBOTIC PROCESS AUTOMATION TO AGENTIC PROCESS AUTOMATION;** Yining Ye et al 171 | - **Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning;** Zhiting Hu et al 172 | - **A Roadmap to Pluralistic Alignment;** Taylor Sorensen et al 173 | - **Towards Unified Alignment Between Agents, Humans, and Environment;** Zonghan Yang et al 174 | - **Video as the New Language for Real-World Decision Making;** Sherry Yang et al 175 | - **A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI;** Seliem El-Sayed et al 176 | - **Concrete Problems in AI Safety;** Dario Amodei et al 177 | - **Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought;** Violet Xiang et al 178 | 179 | 180 | 181 | 182 | 183 | 184 | ## Structure 185 | 186 | - **Finding Structural Knowledge in Multimodal-BERT;** Victor Milewski et al 187 | - **Going Beyond Nouns With Vision & Language Models Using Synthetic Data;** Paola Cascante-Bonilla et al 188 | - **Measuring Progress in Fine-grained Vision-and-Language Understanding;** Emanuele Bugliarello et al 189 | - **PV2TEA: Patching Visual Modality to Textual-Established Information Extraction;** Hejie Cui et al 190 | 191 | **Event Extraction** 192 | 193 | - **Cross-media Structured Common Space for Multimedia Event Extraction;** Manling Li et al; Focus on image-text event extraction. A new benchmark and baseline are proposed. 194 | - **Visual Semantic Role Labeling for Video Understanding;** Arka Sadhu et al; A new benchmark is proposed. 195 | - **GAIA: A Fine-grained Multimedia Knowledge Extraction System;** Manling Li et al; Demo paper. Extract knowledge (relation, event) from multimedia data. 196 | - **MMEKG: Multi-modal Event Knowledge Graph towards Universal Representation across Modalities;** Yubo Ma et al 197 | 198 | **Situation Recognition** 199 | 200 | - **Situation Recognition: Visual Semantic Role Labeling for Image Understanding;** Mark Yatskar et al; Focus on image understanding. Given images, do the semantic role labeling task. No text available. A new benchmark and baseline are proposed. 201 | - **Commonly Uncommon: Semantic Sparsity in Situation Recognition;** Mark Yatskar et al; Address the long-tail problem. 202 | - **Grounded Situation Recognition;** Sarah Pratt et al 203 | - **Rethinking the Two-Stage Framework for Grounded Situation Recognition;** Meng Wei et al 204 | - **Collaborative Transformers for Grounded Situation Recognition;** Junhyeong Cho et al 205 | 206 | **Scene Graph** 207 | 208 | - **Action Genome: Actions as Composition of Spatio-temporal Scene Graphs;** Jingwei Ji et al; Spatio-temporal scene graphs (video). 209 | - **Unbiased Scene Graph Generation from Biased Training;** Kaihua Tang et al 210 | - **Visual Distant Supervision for Scene Graph Generation;** Yuan Yao et al 211 | - **Learning to Generate Scene Graph from Natural Language Supervision;** Yiwu Zhong et al 212 | - **Weakly Supervised Visual Semantic Parsing;** Alireza Zareian, Svebor Karaman, Shih-Fu Chang 213 | - **Scene Graph Prediction with Limited Labels;** Vincent S. Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher Re, Li Fei-Fei 214 | - **Neural Motifs: Scene Graph Parsing with Global Context;** Rowan Zellers et al 215 | - **Fine-Grained Scene Graph Generation with Data Transfer;** Ao Zhang et al 216 | - **Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning;** Tao He et al 217 | - **COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR OPEN-VOCABULARY VIDEO RELATION DETECTION;** Kaifeng Gao et al; Video. 218 | - **LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation;** Xiaoguang Chang et al 219 | - **TRANSFORMER-BASED IMAGE GENERATION FROM SCENE GRAPHS;** Renato Sortino et al 220 | - **The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation;** Lin Li et al 221 | - **Knowledge-augmented Few-shot Visual Relation Detection;** Tianyu Yu et al 222 | - **Prototype-based Embedding Network for Scene Graph Generation;** Chaofan Zhen et al 223 | - **Unified Visual Relationship Detection with Vision and Language Models;** Long Zhao et al 224 | - **Structure-CLIP: Enhance Multi-modal Language Representations with Structure Knowledge;** Yufeng Huang et al 225 | 226 | **Attribute** 227 | 228 | - **COCO Attributes: Attributes for People, Animals, and Objects;** Genevieve Patterson et al 229 | - **Human Attribute Recognition by Deep Hierarchical Contexts;** Yining Li et al; Attribute prediction in specific domains. 230 | - **Emotion Recognition in Context;** Ronak Kosti et al; Attribute prediction in specific domains. 231 | - **The iMaterialist Fashion Attribute Dataset;** Sheng Guo et al; Attribute prediction in specific domains. 232 | - **Learning to Predict Visual Attributes in the Wild;** Khoi Pham et al 233 | - **Open-vocabulary Attribute Detection;** Marıa A. Bravo et al 234 | - **OvarNet: Towards Open-vocabulary Object Attribute Recognition;** Keyan Chen et al 235 | 236 | **Compositionality** 237 | 238 | - **CREPE: Can Vision-Language Foundation Models Reason Compositionally?;** Zixian Ma et al 239 | - **Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality;** Tristan Thrush et al 240 | - **WHEN AND WHY VISION-LANGUAGE MODELS BEHAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT?;** Mert Yuksekgonul et al 241 | - **GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering;** Drew A. Hudson et al 242 | - **COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images;** Ben Bogin et al 243 | - **Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension;** Zhenfang Chen et al 244 | - **Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?;** Tian Yun et al 245 | - **SUGARCREPE: Fixing Hackable Benchmarks for Vision-Language Compositionality;** Cheng-Yu Hsieh et al 246 | - **An Examination of the Compositionality of Large Generative Vision-Language Models;** Teli Ma et al 247 | 248 | **Concept** 249 | 250 | - **Cross-Modal Concept Learning and Inference for Vision-Language Models;** Yi Zhang et al 251 | - **Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning;** Hanjae Kim et al 252 | 253 | ## Planning 254 | 255 | - **Multimedia Generative Script Learning for Task Planning;** Qingyun Wang et al; Next step prediction. 256 | - **PlaTe: Visually-Grounded Planning with Transformers in Procedural Tasks;** Jiankai Sun et al; Procedure planning. 257 | - **P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision;** He Zhao et al; Procedure planning. Using text as weak supervision to replace video clips. 258 | - **Procedure Planning in Instructional Videos;** Chien-Yi Chang et al; Procedure planning. 259 | - **ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities;** Terry Yue Zhuo et al 260 | - **Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation;** Bingqian Lin et al 261 | 262 | ## Reasoning 263 | 264 | - **VisualCOMET: Reasoning about the Dynamic Context of a Still Image;** Jae Sung Park et al; Benchmark dataset, requiring models to reason about a still image (what happen past & next). 265 | - **Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering;** Pan Lu et al 266 | - **See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning;** Zhenfang Chen et al 267 | - **An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;** Zhengyuan Yang et al 268 | - **Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering;** Pan Lu et al 269 | - **Multimodal Chain-of-Thought Reasoning in Language Models;** Zhuosheng Zhang et al 270 | - **LAMPP: Language Models as Probabilistic Priors for Perception and Action;** Belinda Z. Li et al 271 | - **Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings;** Daniel Rose et al 272 | - **Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning;** Xiaoqian Wu et al 273 | 274 | 275 | **Common sense.** 276 | 277 | - **Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles;** Shuquan Ye et al 278 | - **VIPHY: Probing “Visible” Physical Commonsense Knowledge;** Shikhar Singh et al 279 | - **Visual Commonsense in Pretrained Unimodal and Multimodal Models;** Chenyu Zhang et al 280 | 281 | ## Generation 282 | 283 | - **ClipCap: CLIP Prefix for Image Captioning;** Ron Mokady et al; Train an light-weight encoder to convert CLIP embeddings to prefix token embeddings of GPT-2. 284 | - **Multimodal Knowledge Alignment with Reinforcement Learning;** Youngjae Yu et al; Use RL to train an encoder that projects multimodal inputs into the word embedding space of GPT-2. 285 | 286 | ## Representation Learning 287 | 288 | - **Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering;** Peter Anderson et al 289 | - **Fusion of Detected Objects in Text for Visual Question Answering;** Chris Alberti et al 290 | - **VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix;** Teng Wang et al 291 | - **Vision-Language Pre-Training with Triple Contrastive Learning;** Jinyu Yang et al 292 | - **Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision;** Hao Tan et al; Use visual supervision to pretrain language models. 293 | - **HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning;** Paul Pu Liang et al 294 | - **Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture;** Mahmoud Assran et al 295 | - **PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World;** Rowan Zellers et al 296 | - **Learning the Effects of Physical Actions in a Multi-modal Environment;** Gautier Dagan et al 297 | - **Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models;** Zhiqiu Lin et al 298 | - **Learning Visual Representations via Language-Guided Sampling;** Mohamed El Banani et al 299 | - **Image as Set of Points;** Xu Ma et al 300 | - **ARCL: ENHANCING CONTRASTIVE LEARNING WITH AUGMENTATION-ROBUST REPRESENTATIONS;** Xuyang Zhao et al 301 | - **BRIDGING THE GAP TO REAL-WORLD OBJECT-CENTRIC LEARNING;** Maximilian Seitzer et al 302 | - **Learning Transferable Spatiotemporal Representations from Natural Script Knowledge;** Ziyun Zeng et al 303 | - **Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning;** Qian Jiang et al 304 | - **VLM2VEC: TRAINING VISION-LANGUAGE MODELS FOR MASSIVE MULTIMODAL EMBEDDING TASKS;** Ziyan Jiang et al 305 | - **When Does Perceptual Alignment Benefit Vision Representations?;** Shobhita Sundaram et al 306 | - **NARAIM: Native Aspect Ratio Autoregressive Image Models;** Daniel Gallo Fernández et al 307 | - **Masked Autoencoders Are Scalable Vision Learners;** Kaiming He et al 308 | 309 | 310 | 311 | ## LLM Analysis 312 | 313 | - **GROKKING: GENERALIZATION BEYOND OVERFITTING ON SMALL ALGORITHMIC DATASETS;** Alethea Power et al 314 | - **Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition;** Yufei Huang et al 315 | - **A Categorical Archive of ChatGPT Failures;** Ali Borji et al 316 | - **Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling;** Stella Biderman et al 317 | - **Are Emergent Abilities of Large Language Models a Mirage?;** Rylan Schaeffer et al 318 | - **A Drop of Ink may Make a Million Think: The Spread of False Information in Large Language Models;** Ning Bian et al 319 | - **Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting;** Miles Turpin et al 320 | - **SYMBOL TUNING IMPROVES IN-CONTEXT LEARNING IN LANGUAGE MODELS;** Jerry Wei et al 321 | - **What In-Context Learning “Learns” In-Context: Disentangling Task Recognition and Task Learning;** Jane Pan et al 322 | - **Measuring the Knowledge Acquisition-Utilization Gap in Pretrained Language Models;** Amirhossein Kazemnejad et al 323 | - **Scaling Data-Constrained Language Models;** Niklas Muennighoff et al 324 | - **The False Promise of Imitating Proprietary LLMs;** Arnav Gudibande et al 325 | - **Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios;** Jiaxuan Li et al 326 | - **Inverse Scaling: When Bigger Isn’t Better;** Ian R. McKenzie et al 327 | - **DECODINGTRUST: A Comprehensive Assessment of Trustworthiness in GPT Models;** Boxin Wang et al 328 | - **Lost in the Middle: How Language Models Use Long Contexts;** Nelson F. Liu et al 329 | - **Won’t Get Fooled Again: Answering Questions with False Premises;** Shengding Hu et al 330 | - **Generating Benchmarks for Factuality Evaluation of Language Models;** Dor Muhlgay et al 331 | - **Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations;** Yanda Chen et al 332 | - **Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation;** Ruiyang Ren et al 333 | - **Large Language Models Struggle to Learn Long-Tail Knowledge;** Nikhil Kandpal et al 334 | - **SCALING RELATIONSHIP ON LEARNING MATHEMATICAL REASONING WITH LARGE LANGUAGE MODELS;** Zheng Yuan et al 335 | - **Multimodal Neurons in Pretrained Text-Only Transformers;** Sarah Schwettmann et al 336 | - **SIMPLE SYNTHETIC DATA REDUCES SYCOPHANCY IN LARGE LANGUAGE MODELS;** Jerry Wei et al 337 | - **Studying Large Language Model Generalization with Influence Functions;** Roger Grosse et al 338 | - **Taken out of context: On measuring situational awareness in LLMs;** Lukas Berglund et al 339 | - **OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs;** Patrick Haller et al 340 | - **Neurons in Large Language Models: Dead, N-gram, Positional;** Elena Voita et al 341 | - **Are Emergent Abilities in Large Language Models just In-Context Learning?;** Sheng Lu et al 342 | - **The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”;** Lukas Berglund et al 343 | - **Language Modeling Is Compression;** Grégoire Delétang et al 344 | - **FROM LANGUAGE MODELING TO INSTRUCTION FOLLOWING: UNDERSTANDING THE BEHAVIOR SHIFT IN LLMS AFTER INSTRUCTION TUNING;** Xuansheng Wu et al 345 | - **RESOLVING KNOWLEDGE CONFLICTS IN LARGE LANGUAGE MODELS;** Yike Wang et al 346 | - **LARGE LANGUAGE MODELS CANNOT SELF-CORRECT REASONING YET;** Jie Huang et al 347 | - **ASK AGAIN, THEN FAIL: LARGE LANGUAGE MODELS’ VACILLATIONS IN JUDGEMENT;** Qiming Xie et al 348 | - **FRESHLLMS: REFRESHING LARGE LANGUAGE MODELS WITH SEARCH ENGINE AUGMENTATION;** Tu Vu et al 349 | - **Demystifying Embedding Spaces using Large Language Models;** Guy Tennenholtz et al 350 | - **An Emulator for Fine-Tuning Large Language Models using Small Language Models;** Eric Mitchell et al 351 | - **UNVEILING A CORE LINGUISTIC REGION IN LARGE LANGUAGE MODELS;** Jun Zhao et al 352 | - **DETECTING PRETRAINING DATA FROM LARGE LANGUAGE MODELS;** Weijia Shi et al 353 | - **BENCHMARKING AND IMPROVING GENERATOR-VALIDATOR CONSISTENCY OF LMS;** Xiang Lisa Li et al 354 | - **Trusted Source Alignment in Large Language Models;** Vasilisa Bashlovkina et al 355 | - **THE UNLOCKING SPELL ON BASE LLMS: RETHINKING ALIGNMENT VIA IN-CONTEXT LEARNING;** Bill Yuchen Lin et al 356 | - **Can Large Language Models Really Improve by Self-critiquing Their Own Plans?;** Karthik Valmeekam et al 357 | - **TELL, DON’T SHOW: DECLARATIVE FACTS INFLUENCE HOW LLMS GENERALIZE;** Alexander Meinke et al 358 | - **A Closer Look at the Limitations of Instruction Tuning;** Sreyan Ghosh et al 359 | - **PERSONAS AS A WAY TO MODEL TRUTHFULNESS IN LANGUAGE MODELS;** Nitish Joshi et al 360 | - **Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models;** Chenyang Lyu et al 361 | - **Dated Data: Tracing Knowledge Cutoffs in Large Language Models;** Jeffrey Cheng et al 362 | - **Context versus Prior Knowledge in Language Models;** Kevin Du et al 363 | - **Training Trajectories of Language Models Across Scales;** Mengzhou Xia et al 364 | - **Retrieval Head Mechanistically Explains Long-Context Factuality;** Wenhao Wu et al 365 | - **Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models;** Jacob Pfau et al 366 | - **Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?;** Zorik Gekhman et al 367 | - **An Emulator for Fine-Tuning Large Language Models using Small Language Models;** Eric Mitchell et al 368 | - **To Code, or Not To Code? Exploring Impact of Code in Pre-training;** Viraat Aryabumi et al 369 | - **LATENT SPACE CHAIN-OF-EMBEDDING ENABLES OUTPUT-FREE LLM SELF-EVALUATION;** Yiming Wang et al 370 | - **Physics of Language Models: Part 3.1, Knowledge Storage and Extraction;** Zeyuan Allen-Zhu et al 371 | - **Physics of Language Models: Part 3.2, Knowledge Manipulation;** Zeyuan Allen-Zhu et al 372 | - **IMPROVING PRETRAINING DATA USING PERPLEXITY CORRELATIONS;** Tristan Thrush et al 373 | - **Overtrained Language Models Are Harder to Fine-Tune;** Jacob Mitchell Springer et al 374 | - **Reasoning Models Know When They’re Right: Probing Hidden States for Self-Verification;** Anqi Zhang et al 375 | - **Model Merging in Pre-training of Large Language Models;** ByteDance Seed 376 | - **How Alignment Shrinks the Generative Horizon;** Chenghao Yang et al 377 | - **EvoLM: In Search of Lost Language Model Training Dynamics;** Zhenting Qi et al 378 | 379 | 380 | 381 | **Calibration & Uncertainty** 382 | 383 | - **Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models;** Alfonso Amayuelas et al 384 | - **Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs;** Miao Xiong et al 385 | - **LLAMAS KNOW WHAT GPTS DON’T SHOW: SURROGATE MODELS FOR CONFIDENCE ESTIMATION;** Vaishnavi Shrivastava et al 386 | - **Navigating the Grey Area: How Expressions of Uncertainty and Overconfidence Affect Language Models;** Kaitlyn Zhou et al 387 | - **R-Tuning: Teaching Large Language Models to Refuse Unknown Questions;** Hanning Zhang et al 388 | - **Relying on the Unreliable: The Impact of Language Models’ Reluctance to Express Uncertainty;** Kaitlyn Zhou et al 389 | - **Prudent Silence or Foolish Babble? Examining Large Language Models’ Responses to the Unknown;** Genglin Liu et al 390 | - **Benchmarking LLMs via Uncertainty Quantification;** Fanghua Ye et al 391 | - **Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback;** Katherine Tian et al 392 | - **Deal, or no deal (or who knows)? Forecasting Uncertainty in Conversations using Large Language Models;** Anthony Sicilia et al 393 | - **Calibrating Long-form Generations from Large Language Models;** Yukun Huang et al 394 | - **Distinguishing the Knowable from the Unknowable with Language Models;** Gustaf Ahdritz et al 395 | - **Introspective Planning: Guiding Language-Enabled Agents to Refine Their Own Uncertainty;** Kaiqu Liang et al 396 | - **Asking the Right Question at the Right Time: Human and Model Uncertainty Guidance to Ask Clarification Questions;** Alberto Testoni et al 397 | - **Into the Unknown: Self-Learning Large Language Models;** Teddy Ferdinan et al 398 | - **The Internal State of an LLM Knows When It’s Lying;** Amos Azaria et al 399 | - **SELFCHECKGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models;** Potsawee Manakul et al 400 | - **Calibrating Large Language Models with Sample Consistency;** Qing Lyu et al 401 | - **Gotcha! Don’t trick me with unanswerable questions! Self-aligning Large Language Models for Responding to Unknown Questions;** Yang Deng et al 402 | - **Unfamiliar Finetuning Examples Control How Language Models Hallucinate;** Katie Kang et al 403 | - **Few-Shot Recalibration of Language Models;** Xiang Lisa Li et al 404 | - **When to Trust LLMs: Aligning Confidence with Response Quality and Exploring Applications in RAG;** Shuchang Tao et al 405 | - **Linguistic Calibration of Language Models;** Neil Band et al 406 | - **Large Language Models Must Be Taught to Know What They Don’t Know;** Sanyam Kapoor et al 407 | - **SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales;** Tianyang Xu et al 408 | - **Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models;** Qingcheng Zeng et al 409 | - **SEMANTIC UNCERTAINTY: LINGUISTIC INVARIANCES FOR UNCERTAINTY ESTIMATION IN NATURAL LANGUAGE GENERATION;** Lorenz Kuhn et al 410 | - **I Don’t Know: Explicit Modeling of Uncertainty with an [IDK] Token;** Roi Cohen et al 411 | 412 | 413 | 414 | 415 | ## LLM Safety 416 | 417 | - **Learning Human Objectives by Evaluating Hypothetical Behavior;** Siddharth Reddy et al 418 | - **Universal and Transferable Adversarial Attacks on Aligned Language Models;** Andy Zou et al 419 | - **XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models;** Paul Röttger et al 420 | - **Jailbroken: How Does LLM Safety Training Fail? Content Warning: This paper contains examples of harmful language;** Alexander Wei et al 421 | - **FUNDAMENTAL LIMITATIONS OF ALIGNMENT IN LARGE LANGUAGE MODELS;** Yotam Wolf et al 422 | - **BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset;** Jiaming Ji et al 423 | - **GPT-4 IS TOO SMART TO BE SAFE: STEALTHY CHAT WITH LLMS VIA CIPHER;** Youliang Yuan et al 424 | - **Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment;** Rishabh Bhardwaj et al 425 | - **Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs;** Yuxia Wang et al 426 | - **SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions;** Zhexin Zhang et al 427 | - **Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions;** Federico Bianchi et al 428 | - **Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations;** Hakan Inan et al 429 | - **EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!;** Zhanhui Zhou et al 430 | - **Logits of API-Protected LLMs Leak Proprietary Information;** Matthew Finlayson et al 431 | - **Simple probes can catch sleeper agents;** Monte MacDiarmid et al 432 | - **AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs;** Anselm Paulus et al 433 | - **SYCOPHANCY TO SUBTERFUGE: INVESTIGATING REWARD-TAMPERING IN LARGE LANGUAGE MODELS;** Carson Denison et al 434 | - **SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors;** Tinghao Xie et al 435 | - **WILDGUARD: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs;** Seungju Han et al 436 | - **Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs;** Rudolf Laine et al 437 | - **Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?;** Richard Ren et al 438 | - **LANGUAGE MODELS LEARN TO MISLEAD HUMANS VIA RLHF;** Jiaxin Wen et al 439 | - **MEASURING AND IMPROVING PERSUASIVENESS OF GENERATIVE MODELS;** Somesh Singh et al 440 | - **LOOKING INWARD: LANGUAGE MODELS CAN LEARN ABOUT THEMSELVES BY INTROSPECTION;** Felix J Binder et al 441 | - **Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming;** Mrinank Sharma et al 442 | - **Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs;** Jan Betley et al 443 | - **Reasoning Models Don’t Always Say What They Think;** Yanda Chen et al 444 | 445 | 446 | ## LLM Evaluation 447 | 448 | - **IS CHATGPT A GENERAL-PURPOSE NATURAL LANGUAGE PROCESSING TASK SOLVER?;** Chengwei Qin et al 449 | - **AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models;** Wanjun Zhong et al 450 | - **A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity;** Yejin Bang et al 451 | - **On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective;** Jindong Wang et al 452 | - **A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models;** Junjie Ye et al 453 | - **KoLA: Carefully Benchmarking World Knowledge of Large Language Models;** Jifan Yu et al 454 | - **SCIBENCH: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models;** Xiaoxuan Wang et al 455 | - **FLASK: FINE-GRAINED LANGUAGE MODEL EVALUATION BASED ON ALIGNMENT SKILL SETS;** Seonghyeon Ye et al 456 | - **Efficient Benchmarking (of Language Models);** Yotam Perlitz et al 457 | - **Can Large Language Models Understand Real-World Complex Instructions?;** Qianyu He et al 458 | - **NLPBENCH: EVALUATING LARGE LANGUAGE MODELS ON SOLVING NLP PROBLEMS;** Linxin Song et al 459 | - **CALIBRATING LLM-BASED EVALUATOR;** Yuxuan Liu et al 460 | - **GPT-FATHOM: BENCHMARKING LARGE LANGUAGE MODELS TO DECIPHER THE EVOLUTIONARY PATH TOWARDS GPT-4 AND BEYOND;** Shen Zheng et al 461 | - **L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models;** Ansong Ni et al 462 | - **Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations;** Lifan Yuan et al 463 | - **TIGERSCORE: TOWARDS BUILDING EXPLAINABLE METRIC FOR ALL TEXT GENERATION TASKS;** Dongfu Jiang et al 464 | - **DO LARGE LANGUAGE MODELS KNOW ABOUT FACTS?;** Xuming Hu et al 465 | - **PROMETHEUS: INDUCING FINE-GRAINED EVALUATION CAPABILITY IN LANGUAGE MODELS;** Seungone Kim et al 466 | - **CRITIQUE ABILITY OF LARGE LANGUAGE MODELS;** Liangchen Luo et al 467 | - **BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues;** Haodong Duan et al 468 | - **Instruction-Following Evaluation for Large Language Models;** Jeffrey Zhou et al 469 | - **GAIA: A Benchmark for General AI Assistants;** Gregoire Mialon et al 470 | - **ML-BENCH: LARGE LANGUAGE MODELS LEVERAGE OPEN-SOURCE LIBRARIES FOR MACHINE LEARNING TASKS;** Yuliang Liu et al 471 | - **TASKBENCH: BENCHMARKING LARGE LANGUAGE MODELS FOR TASK AUTOMATION;** Yongliang Shen et al 472 | - **GENERATIVE JUDGE FOR EVALUATING ALIGNMENT;** Junlong Li et al 473 | - **InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks;** Xueyu Hu et al 474 | - **AGENTBOARD: AN ANALYTICAL EVALUATION BOARD OF MULTI-TURN LLM AGENTS;** Chang Ma et al 475 | - **WEBLINX: Real-World Website Navigation with Multi-Turn Dialogue;** Xing Han Lu et al 476 | - **MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization;** Zhiyu Yang et al 477 | - **Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference;** Wei-Lin Chiang et al 478 | - **DevBench: A Comprehensive Benchmark for Software Development;** Bowen Li et al 479 | - **REWARDBENCH: Evaluating Reward Models for Language Modeling;** Nathan Lambert et al 480 | - **Long-context LLMs Struggle with Long In-context Learning;** Tianle Li et al 481 | - **LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models;** Shibo Hao et al 482 | - **Benchmarking Benchmark Leakage in Large Language Models;** Ruijie Xu et al 483 | - **PROMETHEUS 2: An Open Source Language Model Specialized in Evaluating Other Language Models;** Seungone Kim et al 484 | - **Revealing the structure of language model capabilities;** Ryan Burnell et al 485 | - **MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures;** Jinjie Ni et al 486 | - **BEHONEST: Benchmarking Honesty of Large Language Models;** Steffi Chern et al 487 | - **SciCode: A Research Coding Benchmark Curated by Scientists;** Minyang Tian et al 488 | - **Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation;** Tu Vu et al 489 | - **MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains;** Guoli Yin et al 490 | - **Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries;** Kiran Vodrahalli et al 491 | - **Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization;** Mucong Ding et al 492 | - **Law of the Weakest Link: Cross Capabilities of Large Language Models;** Ming Zhong et al 493 | - **HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly;** Howard Yen et al 494 | - **GPQA: A Graduate-Level Google-Proof Q&A Benchmark;** David Rein et al 495 | - **Humanity's Last Exam;** Long Phan et al 496 | - **VERDICT: A Library for Scaling Judge-Time Compute;** Nimit Kalra et al 497 | - **BIG-Bench Extra Hard;** Mehran Kazemi et al 498 | - **VerifiAgent: a Unified Verification Agent in Language Model Reasoning;** Jiuzhou Han et al 499 | - **PaperBench: Evaluating AI’s Ability to Replicate AI Research;** Giulio Starace et al 500 | - **SEALQA: Raising the Bar for Reasoning in Search-Augmented Language Models;** Thinh Pham et al 501 | - **UQ: Assessing Language Models on Unsolved Questions;** Fan Nie et al 502 | 503 | 504 | 505 | ## LLM Reasoning 506 | 507 | - **STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning;** Eric Zelikman et al 508 | - **Generated Knowledge Prompting for Commonsense Reasoning;** Jiacheng Liu et al 509 | - **SELF-CONSISTENCY IMPROVES CHAIN OF THOUGHT REASONING IN LANGUAGE MODELS;** Xuezhi Wang et al 510 | - **LEAST-TO-MOST PROMPTING ENABLES COMPLEX REASONING IN LARGE LANGUAGE MODELS;** Denny Zhou et al 511 | - **REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS;** Shunyu Yao et al 512 | - **The Capacity for Moral Self-Correction in Large Language Models;** Deep Ganguli et al 513 | - **Learning to Reason and Memorize with Self-Notes;** Jack lanchantin et al 514 | - **Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models;** Lei Wang et al 515 | - **T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering;** Lei Wang et al 516 | - **Tree of Thoughts: Deliberate Problem Solving with Large Language Models;** Shunyu Yao et al 517 | - **Introspective Tips: Large Language Model for In-Context Decision Making;** Liting Chen et al 518 | - **Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples;** Abulhair Saparov et al 519 | - **Reasoning with Language Model is Planning with World Model;** Shibo Hao et al 520 | - **Interpretable Math Word Problem Solution Generation Via Step-by-step Planning;** Mengxue Zhang et al 521 | - **Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters;** Boshi Wang et al 522 | - **Recursion of Thought: A Divide-and-Conquer Approach to Multi-Context Reasoning with Language Models;** Soochan Lee et al 523 | - **Large Language Models Are Reasoning Teachers;** Namgyu Ho et al 524 | - **Meta-Reasoning: Semantics-Symbol Deconstruction For Large Language Models;** Yiming Wang et al 525 | - **BeamSearchQA: Large Language Models are Strong Zero-Shot QA Solver;** Hao Sun et al 526 | - **AdaPlanner: Adaptive Planning from Feedback with Language Models;** Haotian Sun et al 527 | - **ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models;** Binfeng Xu et al 528 | - **SKILLS-IN-CONTEXT PROMPTING: UNLOCKING COMPOSITIONALITY IN LARGE LANGUAGE MODELS;** Jiaao Chen et al 529 | - **SOLVING CHALLENGING MATH WORD PROBLEMS USING GPT-4 CODE INTERPRETER WITH CODE-BASED SELF-VERIFICATION;** Aojun Zhou et al 530 | - **MAMMOTH: BUILDING MATH GENERALIST MODELS THROUGH HYBRID INSTRUCTION TUNING;** Xiang Yue et al 531 | - **DESIGN OF CHAIN-OF-THOUGHT IN MATH PROBLEM SOLVING;** Zhanming Jie et al 532 | - **NATURAL LANGUAGE EMBEDDED PROGRAMS FOR HYBRID LANGUAGE SYMBOLIC REASONING;** Tianhua Zhang et al 533 | - **MATHCODER: SEAMLESS CODE INTEGRATION IN LLMS FOR ENHANCED MATHEMATICAL REASONING;** Ke Wang et al 534 | - **META-COT: GENERALIZABLE CHAIN-OF-THOUGHT PROMPTING IN MIXED-TASK SCENARIOS WITH LARGE LANGUAGE MODELS;** Anni Zou et al 535 | - **TOOLCHAIN: EFFICIENT ACTION SPACE NAVIGATION IN LARGE LANGUAGE MODELS WITH A SEARCH;** Yuchen Zhuang et al 536 | - **Learning From Mistakes Makes LLM Better Reasoner;** Shengnan An et al 537 | - **Chain of Code: Reasoning with a Language Model-Augmented Code Emulator;** Chengshu Li et al 538 | - **Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives;** Wenqi Zhang et al 539 | - **Divide and Conquer for Large Language Models Reasoning;** Zijie Meng et al 540 | - **The Impact of Reasoning Step Length on Large Language Models;** Mingyu Jin et al 541 | - **REFT: Reasoning with REinforced Fine-Tuning;** Trung Quoc Luong et al 542 | - **Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding;** Mirac Suzgun et al 543 | - **SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures;** Pei Zhou et al 544 | - **Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving;** Yizhou Zhang et al 545 | - **Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning;** Zhiheng Xi et al 546 | - **V-STaR: Training Verifiers for Self-Taught Reasoners;** Arian Hosseini et al 547 | - **Verified Multi-Step Synthesis using Large Language Models and Monte Carlo Tree Search;** David Brandfonbrener et al 548 | - **BOOSTING OF THOUGHTS: TRIAL-AND-ERROR PROBLEM SOLVING WITH LARGE LANGUAGE MODELS;** Sijia Chen et al 549 | - **Language Agents as Optimizable Graphs;** Mingchen Zhuge et al 550 | - **MathScale: Scaling Instruction Tuning for Mathematical Reasoning;** Zhengyang Tang et al 551 | - **Teaching Large Language Models to Reason with Reinforcement Learning;** Alex Havrilla et al 552 | - **Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking;** Eric Zelikman et al 553 | - **TREE SEARCH FOR LANGUAGE MODEL AGENTS;** Jing Yu Koh et al 554 | - **DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search;** Huajian Xin et al 555 | - **Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning;** Yuxi Xie et al 556 | - **Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs;** Xuan Zhang et al 557 | - **B-STAR: MONITORING AND BALANCING EXPLORATION AND EXPLOITATION IN SELF-TAUGHT REASONERS;** Weihao Zeng et al 558 | 559 | 560 | 561 | 562 | **Self-consistency** 563 | 564 | - **Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference;** Eric Mitchell et al 565 | - **Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs;** Angelica Chen et al 566 | - **Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation;** Niels Mündler et al 567 | - **Measuring and Narrowing the Compositionality Gap in Language Models;** Ofir Press et al 568 | - **Self-consistency for open-ended generations;** Siddhartha Jain et al 569 | - **Question Decomposition Improves the Faithfulness of Model-Generated Reasoning;** Ansh Radhakrishnan et al 570 | - **Measuring Faithfulness in Chain-of-Thought Reasoning;** Tamera Lanham et al 571 | - **SELFCHECK: USING LLMS TO ZERO-SHOT CHECK THEIR OWN STEP-BY-STEP REASONING;** Ning Miao et al 572 | - **On Measuring Faithfulness or Self-consistency of Natural Language Explanations;** Letitia Parcalabescu et al 573 | - **Chain-of-Thought Unfaithfulness as Disguised Accuracy;** Oliver Bentham et al 574 | 575 | (with images) 576 | 577 | - **Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation;** Arijit Ray et al 578 | - **Maintaining Reasoning Consistency in Compositional Visual Question Answering;** Chenchen Jing et al 579 | - **SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions;** Ramprasaath R. Selvaraju et al 580 | - **Logical Implications for Visual Question Answering Consistency;** Sergio Tascon-Morales et al 581 | - **Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models;** Adyasha Maharana et al 582 | - **Co-VQA: Answering by Interactive Sub Question Sequence;** Ruonan Wang et al 583 | - **IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models;** Haoxuan You et al 584 | - **Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense;** Zhecan Wang et al 585 | 586 | ## LLM Application 587 | 588 | - **ArK: Augmented Reality with Knowledge Interactive Emergent Ability;** Qiuyuan Huang et al 589 | - **Can Large Language Models Be an Alternative to Human Evaluation?;** Cheng-Han Chiang et al 590 | - **Few-shot In-context Learning for Knowledge Base Question Answering;** Tianle Li et al 591 | - **AutoML-GPT: Automatic Machine Learning with GPT;** Shujian Zhang et al 592 | - **Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs;** Jinyang Li et al 593 | - **Language models can explain neurons in language models;** Steven Bills et al 594 | - **Large Language Model Programs;** Imanol Schlag et al 595 | - **Evaluating Factual Consistency of Summaries with Large Language Models;** Shiqi Chen et al 596 | - **WikiChat: A Few-Shot LLM-Based Chatbot Grounded with Wikipedia;** Sina J. Semnani et al 597 | - **Language Models Can Improve Event Prediction by Few-Shot Abductive Reasoning;** Xiaoming Shi et al 598 | - **Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks;** Sherzod Hakimov et al 599 | - **PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents;** Simeng Sun et al 600 | - **LayoutGPT: Compositional Visual Planning and Generation with Large Language Models;** Weixi Feng et al 601 | - **Judging LLM-as-a-judge with MT-Bench and Chatbot Arena;** Lianmin Zheng et al 602 | - **LLM-BLENDER: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion;** Dongfu Jiang et al 603 | - **Benchmarking Foundation Models with Language-Model-as-an-Examiner;** Yushi Bai et al 604 | - **AudioPaLM: A Large Language Model That Can Speak and Listen;** Paul K. Rubenstein et al 605 | - **Human-in-the-Loop through Chain-of-Thought;** Zefan Cai et al 606 | - **LARGE LANGUAGE MODELS ARE EFFECTIVE TEXT RANKERS WITH PAIRWISE RANKING PROMPTING;** Zhen Qin et al 607 | - **Language to Rewards for Robotic Skill Synthesis;** Wenhao Yu et al 608 | - **Visual Programming for Text-to-Image Generation and Evaluation;** Jaemin Cho et al 609 | - **Mindstorms in Natural Language-Based Societies of Mind;** Mingchen Zhuge et al 610 | - **Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators;** Zhizheng Zhang et al 611 | - **Large Language Models as General Pattern Machines;** Suvir Mirchandani et al 612 | - **A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation;** Neeraj Varshney et al 613 | - **VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models;** Wenlong Huang et al 614 | - **External Reasoning: Towards Multi-Large-Language-Models Interchangeable Assistance with Human Feedback;** Akide Liu et al 615 | - **OCTOPACK: INSTRUCTION TUNING CODE LARGE LANGUAGE MODELS;** Niklas Muennighoff et al 616 | - **Tackling Vision Language Tasks Through Learning Inner Monologues;** Diji Yang et al 617 | - **Can Language Models Learn to Listen?;** Evonne Ng et al 618 | - **PROMPT2MODEL: Generating Deployable Models from Natural Language Instructions;** Vijay Viswanathan et al 619 | - **AnomalyGPT: Detecting Industrial Anomalies using Large Vision-Language Models;** Zhaopeng Gu et al 620 | - **LARGE LANGUAGE MODELS AS OPTIMIZERS;** Chengrun Yang et al 621 | - **Large Language Model for Science: A Study on P vs. NP;** Qingxiu Dong et al 622 | - **Physically Grounded Vision-Language Models for Robotic Manipulation;** Jensen Gao et al 623 | - **Compositional Foundation Models for Hierarchical Planning;** Anurag Ajay et al 624 | - **STRUC-BENCH: Are Large Language Models Really Good at Generating Complex Structured Data?;** Xiangru Tang et al 625 | - **XATU: A Fine-grained Instruction-based Benchmark for Explainable Text Updates;** Haopeng Zhang et al 626 | - **TEXT2REWARD: AUTOMATED DENSE REWARD FUNCTION GENERATION FOR REINFORCEMENT LEARNING;** Tianbao Xie et al 627 | - **EUREKA: HUMAN-LEVEL REWARD DESIGN VIA CODING LARGE LANGUAGE MODELS;** Yecheng Jason Ma et al 628 | - **CREATIVE ROBOT TOOL USE WITH LARGE LANGUAGE MODELS;** Mengdi Xu et al 629 | - **Goal Driven Discovery of Distributional Differences via Language Descriptions;** Ruiqi Zhong et al 630 | - **Can large language models provide useful feedback on research papers? A large-scale empirical analysis.;** Weixin Liang et al 631 | - **DRIVEGPT4: INTERPRETABLE END-TO-END AUTONOMOUS DRIVING VIA LARGE LANGUAGE MODEL;** Zhenhua Xu et al 632 | - **QUALEVAL: QUALITATIVE EVALUATION FOR MODEL IMPROVEMENT;** Vishvak Murahari et al 633 | - **LLM AUGMENTED LLMS: EXPANDING CAPABILITIES THROUGH COMPOSITION;** Rachit Bansal et al 634 | - **SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems;** Dong Zhang et al 635 | - **DEMOCRATIZING FINE-GRAINED VISUAL RECOGNITION WITH LARGE LANGUAGE MODELS;** Mingxuan Liu et al 636 | - **Solving olympiad geometry without human demonstrations;** Trieu H. Trinh et al 637 | - **AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy;** Philipp Schoenegger et al 638 | - **What Evidence Do Language Models Find Convincing?;** Alexander Wan et al 639 | - **Tx-LLM: A Large Language Model for Therapeutics;** Juan Manuel Zambrano Chaves et al 640 | - **Language Models as Science Tutors;** Alexis Chevalier et al 641 | - **Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers;** Chenglei Si et al 642 | - **Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild;** Xinyu Zhao et al 643 | 644 | 645 | ## LLM with Memory 646 | 647 | - **Neural Turing Machines;** Alex Graves et al 648 | - **Narrative Question Answering with Cutting-Edge Open-Domain QA Techniques: A Comprehensive Study;** Xiangyang Mou et al 649 | - **Memory and Knowledge Augmented Language Models for Inferring Salience in Long-Form Stories;** David Wilmot et al 650 | - **MemPrompt: Memory-assisted Prompt Editing with User Feedback;** Aman Madaan et al 651 | - **LANGUAGE MODEL WITH PLUG-IN KNOWLEDGE MEMORY;** Xin Cheng et al 652 | - **Assessing Working Memory Capacity of ChatGPT;** Dongyu Gong et al 653 | - **Prompted LLMs as Chatbot Modules for Long Open-domain Conversation;** Gibbeum Lee et al 654 | - **Beyond Goldfish Memory: Long-Term Open-Domain Conversation;** Jing Xu et al 655 | - **Memory Augmented Large Language Models are Computationally Universal;** Dale Schuurmans et al 656 | - **MemoryBank: Enhancing Large Language Models with Long-Term Memory;** Wanjun Zhong et al 657 | - **Adaptive Chameleon or Stubborn Sloth: Unraveling the Behavior of Large Language Models in Knowledge Clashes;** Jian Xie et al 658 | - **RET-LLM: Towards a General Read-Write Memory for Large Language Models;** Ali Modarressi et al 659 | - **RECURRENTGPT: Interactive Generation of (Arbitrarily) Long Text;** Wangchunshu Zhou et al 660 | - **MEMORIZING TRANSFORMERS;** Yuhuai Wu et al 661 | - **Augmenting Language Models with Long-Term Memory;** Weizhi Wang et al 662 | - **Statler: State-Maintaining Language Models for Embodied Reasoning;** Takuma Yoneda et al 663 | - **LONGNET: Scaling Transformers to 1,000,000,000 Tokens;** Jiayu Ding et al 664 | - **In-context Autoencoder for Context Compression in a Large Language Model;** Tao Ge et al 665 | - **MemoChat: Tuning LLMs to Use Memos for Consistent Long-Range Open-Domain Conversation;** Junru Lu et al 666 | - **KnowledGPT: Enhancing Large Language Models with Retrieval and Storage Access on Knowledge Bases;** Xintao Wang et al 667 | - **LONGBENCH: A BILINGUAL, MULTITASK BENCHMARK FOR LONG CONTEXT UNDERSTANDING;** Yushi Bai et al 668 | - **ChipNeMo: Domain-Adapted LLMs for Chip Design;** Mingjie Liu et al 669 | - **LongAlign: A Recipe for Long Context Alignment of Large Language Models;** Yushi Bai et al 670 | - **RAPTOR: RECURSIVE ABSTRACTIVE PROCESSING FOR TREE-ORGANIZED RETRIEVAL;** Parth Sarthi et al 671 | - **A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts;** Kuang-Huei Lee et al 672 | - **Transformers Can Achieve Length Generalization But Not Robustly;** Yongchao Zhou et al 673 | - **Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context;** Gemini Team, Google 674 | - **HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models;** Bernal Jiménez Gutiérrez et al 675 | - **ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities;** Peng Xu et al 676 | 677 | **Advanced** 678 | 679 | - **MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent;** Hongli Yu et al 680 | 681 | 682 | 683 | **Retrieval-augmented LLM** 684 | 685 | - **Training Language Models with Memory Augmentation;** Zexuan Zhong et al 686 | - **Enabling Large Language Models to Generate Text with Citations;** Tianyu Gao et al 687 | - **Multiview Identifiers Enhanced Generative Retrieval;** Yongqi Li et al 688 | - **Meta-training with Demonstration Retrieval for Efficient Few-shot Learning;** Aaron Mueller et al 689 | - **SELF-RAG: LEARNING TO RETRIEVE, GENERATE, AND CRITIQUE THROUGH SELF-REFLECTION;** Akari Asai et ak 690 | - **RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents;** Tomoyuki Kagaya et al 691 | - **Unsupervised Dense Information Retrieval with Contrastive Learning;** Gautier Izacard et al 692 | - **LONGCITE: ENABLING LLMS TO GENERATE FINE-GRAINED CITATIONS IN LONG-CONTEXT QA;** Jiajie Zhang et al 693 | - **ONEGEN: EFFICIENT ONE-PASS UNIFIED GENERATION AND RETRIEVAL FOR LLMS;** Jintian Zhang et al 694 | - **OPENSCHOLAR: SYNTHESIZING SCIENTIFIC LITERATURE WITH RETRIEVAL-AUGMENTED LMS;** Akari Asai et al 695 | - **AUTO-RAG: AUTONOMOUS RETRIEVAL-AUGMENTED GENERATION FOR LARGE LANGUAGE MODELS;** Tian Yu et al 696 | 697 | 698 | 699 | ## LLM with Human 700 | 701 | - **CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities;** Mina Lee et al 702 | - **RewriteLM: An Instruction-Tuned Large Language Model for Text Rewriting;** Lei Shu et al 703 | - **LeanDojo: Theorem Proving with Retrieval-Augmented Language Models;** Kaiyu Yang et al 704 | - **Evaluating Human-Language Model Interaction;** Mina Lee et al 705 | 706 | 707 | 708 | 709 | ## Inference-time Scaling (via RL) 710 | 711 | - **An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models;** Yangzhen Wu et al 712 | - **Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters;** Charlie Snell et al 713 | - **Large Language Monkeys: Scaling Inference Compute with Repeated Sampling;** Bradley Brown et al 714 | - **Stream of Search (SoS): Learning to Search in Language;** Kanishk Gandhi et al 715 | - **Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation;** Rohin Manvi et al 716 | - **Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning;** Amrith Setlur et al 717 | - **EXACT: TEACHING AI AGENTS TO EXPLORE WITH REFLECTIVE-MCTS AND EXPLORATORY LEARNING;** Xiao Yu et al 718 | - **Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search;** Jinhao Jiang et al 719 | - **KIMI K1.5: SCALING REINFORCEMENT LEARNING WITH LLMS;** Kimi Team 720 | - **DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning;** DeepSeek-AI 721 | - **s1: Simple test-time scaling;** Niklas Muennighof et al 722 | - **Demystifying Long Chain-of-Thought Reasoning in LLMs;** Edward Yeo et al 723 | - **Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling;** Runze Liu et al 724 | - **Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?;** Zhiyuan Zeng et al 725 | - **Scaling Test-Time Compute Without Verification or RL is Suboptimal;** Amrith Setlur et al 726 | - **LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!;** Dacheng Li et al 727 | - **Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning;** Tian Xie et al 728 | - **Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach;** Jonas Geiping et al 729 | - **Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs;** Kanishk Gandhi et al 730 | - **Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models;** Wenxuan Huang et al 731 | - **R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning;** Huatong Song et al 732 | - **Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning;** Yuxiao Qu et al 733 | - **Understanding R1-Zero-Like Training: A Critical Perspective;** Zichen Liu et al 734 | - **Reasoning to Learn from Latent Thoughts;** Yangjun Ruan et al 735 | - **ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning;** Mingyang Chen et al 736 | - **Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators;** Seungone Kim et al 737 | - **SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild;** Weihao Zeng et al 738 | - **Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model;** Jingcheng Hu et al 739 | - **Climbing the Ladder of Reasoning: What LLMs Can—and Still Can’t—Solve after SFT?;** Yiyou Sun et al 740 | - **Learning Adaptive Parallel Reasoning with Language Models;** Jiayi Pan et al 741 | - **TTRL: Test-Time Reinforcement Learning;** Yuxin Zuo et al 742 | - **Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?** Yang Yue et al 743 | - **Reinforcement Learning for Reasoning in Large Language Models with One Training Example;** Yiping Wang et al 744 | - **DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition;** Z.Z. Ren et al 745 | - **Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers;** Kusha Sareen et al 746 | - **Absolute Zero: Reinforced Self-play Reasoning with Zero Data;** Andrew Zhao et al 747 | - **RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning;** Kaiwen Zha et al 748 | - **Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective;** Zhoujun Cheng et al 749 | - **QWENLONG-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning;** Fanqi Wan et al 750 | - **AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning;** Yang Chen et al 751 | - **AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy;** Zihan Liu et al 752 | - **Skywork Open Reasoner 1 Technical Report;** Jujie He et al 753 | - **Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction;** Junhong Shen et al 754 | - **Skywork-R1V3 Technical Report;** Multimodal Team, Skywork AI 755 | - **Reinforcement Pre-Training;** Qingxiu Dong et al 756 | - **Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training;** NVIDIA 757 | - **MEGASCIENCE: PUSHING THE FRONTIERS OF POST-TRAINING DATASETS FOR SCIENCE REASONING;** Run-Ze Fan et al 758 | - **Group Sequence Policy Optimization;** Chujie Zheng et al 759 | - **AGENTIC REINFORCED POLICY OPTIMIZATION;** Guanting Dong et al 760 | - **AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?;** Ori Press et al 761 | - **WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning;** Kuan Li et al 762 | - **PUSHING TEST-TIME SCALING LIMITS OF DEEP SEARCH WITH ASYMMETRIC VERIFICATION;** Weihao Zeng et al 763 | - **PretrainZero: Reinforcement Active Pretraining;** Xingrun Xing et al 764 | 765 | 766 | 767 | 768 | **Vision-Language or Vision-Only** 769 | 770 | - **Visual-RFT: Visual Reinforcement Fine-Tuning;** Ziyu Liu et al 771 | - **LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL;** Yingzhe Peng et al 772 | - **MM-EUREKA: EXPLORING VISUAL AHA MOMENT WITH RULE-BASED LARGE-SCALE REINFORCEMENT LEARNING;** Fanqing Meng et al 773 | - **DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding;** Xinyu Ma et al 774 | - **Video-R1: Reinforcing Video Reasoning in MLLMs;** Kaituo Feng et al 775 | - **Boosting MLLM Reasoning with Text-Debiased Hint-GRPO;** Qihan Huang et al 776 | - **Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning;** Chris et al 777 | - **ACTIVE-O3 : Empowering Multimodal Large Language Models with Active Perception via GRPO;** Muzhi Zhu et al 778 | - **DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models;** Chenbin Pan et al 779 | - **Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models;** Shilin Xu et al 780 | - **Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning;** Yana Wei et al 781 | - **RLPR: EXTRAPOLATING RLVR TO GENERAL DO MAINS WITHOUT VERIFIERS;** Tianyu Yu et al 782 | - **GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning;** GLM-V Team 783 | - **Scaling RL to Long Videos;** Yukang Chen et al 784 | - **Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning;** Ang Li et al 785 | - **VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning;** Ruifeng Yuan1 et al 786 | - **Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning;** Hao Shao etal 787 | 788 | 789 | 790 | **Image in Thoughts** 791 | 792 | - **GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs;** Shixian Luo et al 793 | - **DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning;** Ziwei Zheng et al 794 | - **Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning;** Alex Su et al 795 | - **Visual Planning: Let’s Think Only with Images;** Yi Xu et al 796 | - **Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing;** Junfei Wu et al 797 | 798 | 799 | 800 | 801 | **Latent Space** 802 | 803 | - **Training Large Language Models to Reason in a Continuous Latent Space;** Shibo Hao et al 804 | - **Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space;** Zhen Zhang et al 805 | - **Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens;** Zeyuan Yang et al 806 | - **LATENT VISUAL REASONING;** Bangzheng Li et al 807 | 808 | **Pre-Training** 809 | 810 | - **RLP: Reinforcement as a Pretraining Objective;** Ali Hatamizadeh et al 811 | 812 | 813 | 814 | **Understanding & Analysis** 815 | 816 | - **The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning;** Shivam Agarwal et al 817 | - **Maximizing Confidence Alone Improves Reasoning;** Mihir Prabhudesai et al 818 | - **Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning;** Shenzhi Wang et al 819 | - **Spurious Rewards: Rethinking Training Signals in RLVR;** Rulin Shao et al 820 | - **The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models;** Ganqu Cui et al 821 | - **REASONING WITH EXPLORATION: AN ENTROPY PERSPECTIVE;** Daixuan Cheng et al 822 | - **REINFORCEMENT FINE-TUNING NATURALLY MITIGATES FORGETTING IN CONTINUAL POST-TRAINING;** Song Lai et al 823 | - **REASONING OR MEMORIZATION? UNRELIABLE RESULTS OF REINFORCEMENT LEARNING DUE TO DATA CONTAMINATION;** Mingqi Wu et al 824 | 825 | 826 | ## Long-context LLM 827 | 828 | - **UniMem: Towards a Unified View of Long-Context Large Language Models;** Junjie Fang et al 829 | - **Data Engineering for Scaling Language Models to 128K Context;** Yao Fu et al 830 | - **How to Train Long-Context Language Models (Effectively);** Tianyu Gao et al 831 | - **Qwen2.5-1M Technical Report;** Qwen Team, Alibaba Group 832 | 833 | 834 | 835 | 836 | 837 | ## LLM Foundation 838 | 839 | - **QWEN TECHNICAL REPORT;** Jinze Bai et al 840 | - **DeepSeek LLM: Scaling Open-Source Language Models with Longtermism;** DeepSeek-AI 841 | - **Retentive Network: A Successor to Transformer for Large Language Models;** Yutao Sun et al 842 | - **Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models;** Mayee F. Chen et al 843 | - **Secrets of RLHF in Large Language Models Part I: PPO;** Rui Zheng et al 844 | - **EduChat: A Large-Scale Language Model-based Chatbot System for Intelligent Education;** Yuhao Dan et al 845 | - **WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct;** Haipeng Luo et al 846 | - **SlimPajama-DC: Understanding Data Combinations for LLM Training;** Zhiqiang Shen et al 847 | - **LMSYS-CHAT-1M: A LARGE-SCALE REAL-WORLD LLM CONVERSATION DATASET;** Lianmin Zheng et al 848 | - **Mistral 7B;** Albert Q. Jiang et al 849 | - **Tokenizer Choice For LLM Training: Negligible or Crucial?;** Mehdi Ali et al 850 | - **ZEPHYR: DIRECT DISTILLATION OF LM ALIGNMENT;** Lewis Tunstall et al 851 | - **LEMUR: HARMONIZING NATURAL LANGUAGE AND CODE FOR LANGUAGE AGENTS;** Yiheng Xu et al 852 | - **System 2 Attention (is something you might need too);** Jason Weston et al 853 | - **Camels in a Changing Climate: Enhancing LM Adaptation with TÜLU 2;** Hamish Ivison et al 854 | - **The Falcon Series of Open Language Models;** Ebtesam Almazrouei et al 855 | - **LLM360: Towards Fully Transparent Open-Source LLMs;** Zhengzhong Liu et al 856 | - **OLMO: Accelerating the Science of Language Models;** Dirk Groeneveld et al 857 | - **InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning;** Huaiyuan Ying et al 858 | - **Gemma: Open Models Based on Gemini Research and Technology;** Gemma Team 859 | - **StarCoder 2 and The Stack v2: The Next Generation;** Anton Lozhkov et al 860 | - **Yi: Open Foundation Models by 01.AI;** 01.AI 861 | - **InternLM2 Technical Report;** Zheng Cai et al 862 | - **Jamba: A Hybrid Transformer-Mamba Language Model;** Opher Lieber et al 863 | - **JetMoE: Reaching Llama2 Performance with 0.1M Dollars;** Yikang Shen et al 864 | - **RHO-1: Not All Tokens Are What You Need;** Zhenghao Lin et al 865 | - **DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model;** DeepSeek-AI 866 | - **MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series;** M-A-P 867 | - **Nemotron-4 340B Technical Report;** NVIDIA 868 | - **Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models;** Tianwen Wei et al 869 | - **Gemma 2: Improving Open Language Models at a Practical Size;** Gemma Team 870 | - **ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools;** Team GLM 871 | - **Reuse, Don’t Retrain: A Recipe for Continued Pretraining of Language Models;** Jupinder Parmar et al 872 | - **QWEN2 TECHNICAL REPORT;** An Yang et al 873 | - **Apple Intelligence Foundation Language Models;** Apple 874 | - **Jamba-1.5: Hybrid Transformer-Mamba Models at Scale;** Jamba Team 875 | - **The Llama 3 Herd of Models;** Llama Team, AI @ Meta 876 | - **QWEN2.5-MATH TECHNICAL REPORT: TOWARD MATHEMATICAL EXPERT MODEL VIA SELF IMPROVEMENT;** An Yang et al 877 | - **Qwen2.5-Coder Technical Report;** Binyuan Hui et al 878 | - **TÜLU 3: Pushing Frontiers in Open Language Model Post-Training;** Nathan Lambert et al 879 | - **Phi-4 Technical Report;** Marah Abdin et al 880 | - **Byte Latent Transformer: Patches Scale Better Than Tokens;** Artidoro Pagnoni et al 881 | - **Qwen2.5 Technical Report;** Qwen Team 882 | - **DeepSeek-V3 Technical Report;** DeepSeek-AI 883 | - **2 OLMo 2 Furious;** OLMo Team 884 | - **Titans: Learning to Memorize at Test Time;** Ali Behrouz et al 885 | - **Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs;** Microsoft et al 886 | - **Command A: An Enterprise-Ready Large Language Model;** Cohere 887 | - **MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining;** Xiaomi LLM-Core Team 888 | - **Qwen3 Technical Report;** Qwen Team 889 | - **Seed-Coder: Let the Code Model Curate Data for Itself;** ByteDance Seed 890 | - **MiniCPM4: Ultra-Efficient LLMs on End Devices;** MiniCPM Team 891 | - **Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities;** Gemini Team, Google 892 | - **Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance;** Falcon LLM Team 893 | - **KIMI K2: OPEN AGENTIC INTELLIGENCE;** Kimi Team 894 | - **Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data;** Syeda Nahida Akter et al 895 | 896 | 897 | 898 | ## Scaling Law 899 | 900 | - **Scaling Laws for Neural Language Models;** Jared Kaplan et al 901 | - **Explaining Neural Scaling Laws;** Yasaman Bahri et al 902 | - **Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws;** Zeyuan Allen-Zhu et al 903 | - **Training Compute-Optimal Large Language Models;** Jordan Hoffmann et al 904 | - **Scaling Laws for Autoregressive Generative Modeling;** Tom Henighan et al 905 | - **Scaling Laws for Generative Mixed-Modal Language Models;** Armen Aghajanyan et al 906 | - **Beyond neural scaling laws: beating power law scaling via data pruning;** Ben Sorscher et al 907 | - **Scaling Vision Transformers;** Xiaohua Zhai et al 908 | - **Scaling Laws for Reward Model Overoptimization;** Leo Gao et al 909 | - **Scaling Laws from the Data Manifold Dimension;** Utkarsh Sharma et al 910 | - **BROKEN NEURAL SCALING LAWS;** Ethan Caballero et al 911 | - **Scaling Laws for Transfer;** Danny Hernandez et al 912 | - **Scaling Data-Constrained Language Models;** Niklas Muennighoff et al 913 | - **Revisiting Neural Scaling Laws in Language and Vision;** Ibrahim Alabdulmohsin et al 914 | - **SCALE EFFICIENTLY: INSIGHTS FROM PRE-TRAINING AND FINE-TUNING TRANSFORMERS;** Yi Tay et al 915 | - **Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer;** Greg Yang et al 916 | - **Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster;** Nolan Dey et al 917 | - **Language models scale reliably with over-training and on downstream tasks;** Samir Yitzhak Gadre et al 918 | - **Unraveling the Mystery of Scaling Laws: Part I;** Hui Su et al 919 | - **nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales;** Yiqun Yao et al 920 | - **Understanding Emergent Abilities of Language Models from the Loss Perspective;** Zhengxiao Du et al 921 | - **Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance;** Jiasheng Ye et al 922 | - **The Fine Line: Navigating Large Language Model Pretraining with Down streaming Capability Analysis;** Chen Yang et al 923 | - **UNLOCK PREDICTABLE SCALING FROM EMERGENT ABILITIES;** Shengding Hu et al 924 | - **Scaling Laws for Data Filtering—Data Curation cannot be Compute Agnostic;** Sachin Goyal et al 925 | - **A Large-Scale Exploration of µ-Transfer;** Lucas Dax Lingle et al 926 | - **Observational Scaling Laws and the Predictability of Language Model Performance;** Yangjun Ruan et al 927 | - **Selecting Large Language Model to Fine-tune via Rectified Scaling Law;** Haowei Lin et al 928 | - **D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models;** Haoran Que et al 929 | - **Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms;** Rafael Rafailov et al 930 | - **Scaling and evaluating sparse autoencoders;** Leo Gao et al 931 | - **Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?;** Rylan Schaeffer et al 932 | - **REGMIX: Data Mixture as Regression for Language Model Pre-training;** Qian Liu et al 933 | - **Scaling Retrieval-Based Language Models with a Trillion-Token Datastore;** Rulin Shao et al 934 | - **Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies;** Chaofan Tao et al 935 | - **AutoScale–Automatic Prediction of Compute-optimal Data Composition for Training LLMs;** Feiyang Kang et al 936 | - **ARE BIGGER ENCODERS ALWAYS BETTER IN VISION LARGE MODELS?;** Bozhou Li et al 937 | - **An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models;** Yangzhen Wu et al 938 | - **Scaling Laws for Data Poisoning in LLMs;** Dillon Bowen et al 939 | - **SCALING LAW WITH LEARNING RATE ANNEALING;** Howe Tissue et al 940 | - **Optimization Hyper-parameter Laws for Large Language Models;** Xingyu Xie et al 941 | - **Scaling Laws and Interpretability of Learning from Repeated Data;** Danny Hernandez et al 942 | - **U-SHAPED AND INVERTED-U SCALING BEHIND EMERGENT ABILITIES OF LARGE LANGUAGE MODELS;** Tung-Yu Wu et al 943 | - **Revisiting the Superficial Alignment Hypothesis;** Mohit Raghavendra et al 944 | - **SCALING LAWS FOR DIFFUSION TRANSFORMERS;** Zhengyang Liang et al 945 | - **ADAPTIVE DATA OPTIMIZATION: DYNAMIC SAMPLE SELECTION WITH SCALING LAWS;** Yiding Jiang et al 946 | - **Scaling Laws for Predicting Downstream Performance in LLMs;** Yangyi Chen et al 947 | - **SCALING LAWS FOR PRE-TRAINING AGENTS AND WORLD MODELS;** Tim Pearce et al 948 | - **Towards Precise Scaling Laws for Video Diffusion Transformers;** Yuanyang Yin et al 949 | - **Predicting Emergent Capabilities by Finetuning;** Charlie Snell et al 950 | - **Densing Law of LLMs;** Chaojun Xiao et al 951 | - **Establishing Task Scaling Laws via Compute-Efficient Model Ladders;** Akshita Bhagia et al 952 | - **SLOTH: SCALING LAWS FOR LLM SKILLS TO PREDICT MULTI-BENCHMARK PERFORMANCE ACROSS FAMILIES;** Felipe Maia Polo et al 953 | - **LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws;** Prasanna Mayilvahanan et al 954 | - **Predictable Scale: Part I — Optimal Hyperparameter Scaling Law in Large Language Model Pretraining;** Houyi Li et al 955 | - **A MULTI-POWER LAW FOR LOSS CURVE PREDICTION ACROSS LEARNING RATE SCHEDULES;** Kairong Luo et al 956 | - **Scaling Laws of Synthetic Data for Language Models;** Zeyu Qin et al 957 | - **Compression Laws for Large Language Models;** Ayan Sengupta et al 958 | - **Scaling Laws for Native Multimodal Models;** Mustafa Shukor et al 959 | - **Parallel Scaling Law for Language Models;** Mouxiang Chen et al 960 | - **Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training;** Shane Bergsma et al 961 | - **Farseer: A Refined Scaling Law in Large Language Models;** Houyi Li et al 962 | - **The Art of Scaling Reinforcement Learning Compute for LLMs;** Devvrit Khatri et al 963 | 964 | 965 | **MoE** 966 | 967 | - **UNIFIED SCALING LAWS FOR ROUTED LANGUAGE MODELS;** Aidan Clark et al 968 | - **SCALING LAWS FOR SPARSELY-CONNECTED FOUNDATION MODELS;** Elias Frantar et al 969 | - **Toward Inference-optimal Mixture-of-Expert Large Language Models;** Longfei Yun et al 970 | - **SCALING LAWS FOR FINE-GRAINED MIXTURE OF EXPERTS;** Jakub Krajewski et al 971 | - **Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent;** Tencent Hunyuan Team 972 | 973 | 974 | 975 | 976 | 977 | 978 | 979 | ## LLM Data Engineering 980 | 981 | - **Textbooks Are All You Need II: phi-1.5 technical report;** Yuanzhi Li et al 982 | - **Orca: Progressive Learning from Complex Explanation Traces of GPT-4;** Subhabrata Mukherjee et al 983 | - **Symbol-LLM: Towards Foundational Symbol-centric Interface For Large Language Models;** Fangzhi Xu et al 984 | - **Orca 2: Teaching Small Language Models How to Reason;** Arindam Mitra et al 985 | - **REST MEETS REACT: SELF-IMPROVEMENT FOR MULTI-STEP REASONING LLM AGENT;** Renat Aksitov et al 986 | - **WHAT MAKES GOOD DATA FOR ALIGNMENT? A COMPREHENSIVE STUDY OF AUTOMATIC DATA SELECTION IN INSTRUCTION TUNING;** Wei Liu et al 987 | - **ChatQA: Building GPT-4 Level Conversational QA Models;** Zihan Liu et al 988 | - **AGENTOHANA: DESIGN UNIFIED DATA AND TRAINING PIPELINE FOR EFFECTIVE AGENT LEARNING;** Jianguo Zhang et al 989 | - **Advancing LLM Reasoning Generalists with Preference Trees;** Lifan Yuan et al 990 | - **WILDCHAT: 1M CHATGPT INTERACTION LOGS IN THE WILD;** Wenting Zhao et al 991 | - **MAmmoTH2: Scaling Instructions from the Web;** Xiang Yue et al 992 | - **Scaling Synthetic Data Creation with 1,000,000,000 Personas;** Xin Chan et al 993 | - **AgentInstruct: Toward Generative Teaching with Agentic Flows;** Arindam Mitra et al 994 | - **Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models — The Story Goes On;** Liang Zeng et al 995 | - **WizardLM: Empowering Large Language Models to Follow Complex Instructions;** Can Xu et al 996 | - **Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena;** Haipeng Luo et al 997 | - **Automatic Instruction Evolving for Large Language Models;** Weihao Zeng et al 998 | - **AgentInstruct: Toward Generative Teaching with Agentic Flows;** Arindam Mitra et al 999 | - **Does your data spark joy? Performance gains from domain upsampling at the end of training;** Cody Blakeney et al 1000 | - **ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws;** Ruihang Li et al 1001 | - **BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline;** Guosheng Dong et al 1002 | - **How Does Code Pretraining Affect Language Model Task Performance?;** Jackson Petty et al 1003 | - **DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models;** Ranchi Zhao et al 1004 | - **Data Selection via Optimal Control for Language Models;** Yuxian Gu et al 1005 | - **HARNESSING WEBPAGE UIS FOR TEXT-RICH VISUAL UNDERSTANDING;** Junpeng Liu et al 1006 | - **Aioli: A unified optimization framework for language model data mixing;** Mayee F. Chen et al 1007 | - **HOW TO SYNTHESIZE TEXT DATA WITHOUT MODEL COLLAPSE?;** Xuekai Zhu et al 1008 | - **Predictive Data Selection: The Data That Predicts Is the Data That Teaches;** Kashun Shum et al 1009 | - **DataDecide: How to Predict Best Pretraining Data with Small Experiments;** Ian Magnusson et al 1010 | - **Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data;** Yudong Wang et al 1011 | - **Scaling Physical Reasoning with the PHYSICS Dataset;** Shenghe Zheng et al 1012 | - **OpenThoughts: Data Recipes for Reasoning Models;** Etash Guha et al 1013 | - **BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining;** DatologyAI Team 1014 | - **Scaling Agents via Continual Pre-training;** Liangcai Su et al 1015 | 1016 | 1017 | 1018 | ## VLM Data Engineering 1019 | 1020 | - **Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models;** Lei Li et al 1021 | - **MANTIS: Interleaved Multi-Image Instruction Tuning;** Dongfu Jiang et al 1022 | - **ShareGPT4V: Improving Large Multi-Modal Models with Better Captions;** Lin Chen et al 1023 | - **ShareGPT4Video: Improving Video Understanding and Generation with Better Captions;** Lin Chen et al 1024 | - **OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text;** Qingyun Li et al 1025 | - **MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens;** Anas Awadalla et al 1026 | - **PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents;** Junjie Wang et al 1027 | - **Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs;** Sukmin Yun et al 1028 | - **MAVIS: Mathematical Visual Instruction Tuning;** Renrui Zhang et al 1029 | - **MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions;** Xuan Ju et al 1030 | - **On Pre-training of Multimodal Language Models Customized for Chart Understanding;** Wan-Cyuan Fan et al 1031 | - **MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity;** Yangzhou Liu et al 1032 | - **VILA^2: VILA Augmented VILA;** Yunhao Fang et al 1033 | - **VIDGEN-1M: A LARGE-SCALE DATASET FOR TEXTTO-VIDEO GENERATION;** Zhiyu Tan et al 1034 | - **MMEVOL: EMPOWERING MULTIMODAL LARGE LANGUAGE MODELS WITH EVOL-INSTRUCT;** Run Luo et al 1035 | - **InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning;** Xiaotian Han et al 1036 | - **MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning;** Haotian Zhang et al 1037 | - **DATACOMP: In search of the next generation of multimodal datasets;** Samir Yitzhak Gadre et al 1038 | - **VIDEO INSTRUCTION TUNING WITH SYNTHETIC DATA;** Yuanhan Zhang et al 1039 | - **REVISIT LARGE-SCALE IMAGE-CAPTION DATA IN PRETRAINING MULTIMODAL FOUNDATION MODELS;** Zhengfeng Lai et al 1040 | - **CompCap: Improving Multimodal Large Language Models with Composite Captions;** Xiaohui Chen et al 1041 | - **MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale;** Jarvis Guo et al 1042 | - **BIGDOCS: AN OPEN AND PERMISSIVELY-LICENSED DATASET FOR TRAINING MULTIMODAL MODELS ON DOCUMENT AND CODE TASKS;** Juan Rodriguez et al 1043 | - **VisionArena: 230K Real World User-VLM Conversations with Preference Labels;** Christopher Chou et al 1044 | - **DIVING INTO SELF-EVOLVING TRAINING FOR MULTIMODAL REASONING;** Wei Liu et al 1045 | - **Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models;** Zhiqi Li et al 1046 | - **Scaling Pre-training to One Hundred Billion Data for Vision Language Models;** Xiao Wang et al 1047 | - **MM-RLHF The Next Step Forward in Multimodal LLM Alignment;** Yi-Fan Zhang et al 1048 | - **TASKGALAXY: SCALING MULTI-MODAL INSTRUCTION FINE-TUNING WITH TENS OF THOUSANDS VISION TASK TYPES;** Jiankang Chen et al 1049 | - **Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation;** Yue Yang et al 1050 | - **GneissWeb: Preparing High Quality Data for LLMs at Scale;** Hajar Emami Gohari et al 1051 | - **OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference;** Xiangyu Zhao et al 1052 | - **SpiritSight Agent: Advanced GUI Agent with One Look;** Zhiyuan Huang et al 1053 | - **R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization;** Yi Yang et al 1054 | - **SHOULD VLMS BE PRE-TRAINED WITH IMAGE DATA?;** Sedrick Keh et al 1055 | - **Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation;** Yuheng Zha et al 1056 | 1057 | 1058 | ## Alignment 1059 | 1060 | - **AI Alignment Research Overview;** Jacob Steinhardt 1061 | - **Language Model Alignment with Elastic Reset;** Michael Noukhovitch et al 1062 | - **Alignment for Honesty;** Yuqing Yang et al 1063 | - **Align on the Fly: Adapting Chatbot Behavior to Established Norms;** Chunpu Xu et al 1064 | - **Combining weak-to-strong generalization with scalable oversight A high-level view on how this new approach fits into our alignment plans;** JAN LEIKE 1065 | - **SLEEPER AGENTS: TRAINING DECEPTIVE LLMS THAT PERSIST THROUGH SAFETY TRAINING;** Evan Hubinger et al 1066 | - **Towards Efficient and Exact Optimization of Language Model Alignment;** Haozhe Ji et al 1067 | - **Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction;** Jiaming Ji et al 1068 | - **DeAL: Decoding-time Alignment for Large Language Models;** James Y. Huang et al 1069 | - **Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping;** Haoyu Wang et al 1070 | - **Dissecting Human and LLM Preferences;** Junlong Li1 et al 1071 | - **Reformatted Alignment;** Run-Ze Fan et al 1072 | - **Capability or Alignment? Respect the LLM Base Model’s Capability During Alignment;** Jingfeng Yang 1073 | - **Learning or Self-aligning? Rethinking Instruction Fine-tuning;** Mengjie Ren et al 1074 | - **Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models;** Yi Luo et al 1075 | - **Weak-to-Strong Extrapolation Expedites Alignment;** Chujie Zheng et al 1076 | - **Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization;** Wenkai Yang et al 1077 | - **Baichuan Alignment Technical Report;** Mingan Lin et al 1078 | 1079 | 1080 | 1081 | 1082 | 1083 | ## Scalable Oversight & SuperAlignment 1084 | 1085 | - **Supervising strong learners by amplifying weak experts;** Paul Christiano et al 1086 | - **Deep Reinforcement Learning from Human Preferences;** Paul F Christiano et al 1087 | - **AI safety via debate;** Geoffrey Irving et al 1088 | - **Scalable agent alignment via reward modeling: a research direction;** Jan Leike et al 1089 | - **Recursively Summarizing Books with Human Feedback;** Jeff Wu et al 1090 | - **Self-critiquing models for assisting human evaluators;** William Saunders et al 1091 | - **Measuring Progress on Scalable Oversight for Large Language Models;** Samuel R. Bowman et al 1092 | - **Debate Helps Supervise Unreliable Experts;** Julian Michael et al 1093 | - **WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION;** Collin Burns et al 1094 | - **Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models;** Zixiang Chen et al 1095 | - **Discovering Language Model Behaviors with Model-Written Evaluations;** Ethan Perez et al 1096 | - **Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language Models;** Hongzhan Lin et al 1097 | - **Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate;** Tian Liang et al 1098 | - **Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning;** Jitao Sang et al 1099 | - **PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations;** Ruosen Li et al 1100 | - **Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models;** Jianyuan Guo et al 1101 | - **Improving Factuality and Reasoning in Language Models through Multiagent Debate;** Yilun Du et al 1102 | - **CHATEVAL: TOWARDS BETTER LLM-BASED EVALUATORS THROUGH MULTI-AGENT DEBATE;** Chi-Min Chan et al 1103 | - **Debating with More Persuasive LLMs Leads to More Truthful Answers;** Akbir Khan et al 1104 | - **Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision;** Zhiqing Sun et al 1105 | - **The Unreasonable Effectiveness of Easy Training Data for Hard Tasks;** Peter Hase et al 1106 | - **LLM Critics Help Catch LLM Bugs;** Nat McAleese et al 1107 | - **On scalable oversight with weak LLMs judging strong LLMs;** Zachary Kenton et al 1108 | - **PROVER-VERIFIER GAMES IMPROVE LEGIBILITY OF LLM OUTPUTS;** Jan Hendrik Kirchner et al 1109 | - **TRAINING LANGUAGE MODELS TO WIN DEBATES WITH SELF-PLAY IMPROVES JUDGE ACCURACY;** Samuel Arnesen et al 1110 | - **Great Models Think Alike and this Undermines AI Oversight;** Shashwat Goel et al 1111 | 1112 | 1113 | 1114 | 1115 | ## RL Foundation 1116 | 1117 | - **Proximal Policy Optimization Algorithms;** John Schulman et al 1118 | - **PREFERENCES IMPLICIT IN THE STATE OF THE WORLD;** Rohin Shah et al 1119 | - **Hindsight Experience Replay;** Marcin Andrychowicz et al 1120 | - **Learning to Reach Goals via Iterated Supervised Learning;** Dibya Ghosh et al 1121 | - **The Wisdom of Hindsight Makes Language Models Better Instruction Followers;** Tianjun Zhang et al 1122 | - **REWARD UNCERTAINTY FOR EXPLORATION IN PREFERENCE-BASED REINFORCEMENT LEARNING;** Xinran Liang et al 1123 | 1124 | **With Foundation Model** 1125 | 1126 | - **Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs;** Arash Ahmadian et al 1127 | - **DAPO: An Open-Source LLM Reinforcement Learning System at Scale;** ByteDance Seed 1128 | - **VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks;** ByteDance Seed 1129 | - **TAPERED OFF-POLICY REINFORCE Stable and efficient reinforcement learning for LLMs;** Nicolas Le Roux et al 1130 | - **AlphaEvolve: A coding agent for scientific and algorithmic discovery;** Alexander Novikov et al 1131 | - **AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?;** Ori Press et al 1132 | - **Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models;** Zhipeng Chen et al 1133 | - **InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling;** Peiji Li et al 1134 | - **VERLTOOL: TOWARDS HOLISTIC AGENTIC REINFORCEMENT LEARNING WITH TOOL USE;** Dongfu Jiang et al 1135 | - **Reinforcement Learning on Pre-Training Data;** Siheng Li et al 1136 | - **Defeating the Training-Inference Mismatch via FP16;** Penghui Qi et al 1137 | - **Stabilizing Reinforcement Learning with LLMs: Formulation and Practices;** Chujie Zheng et al 1138 | - **DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning;** Zhihong Shao et al 1139 | 1140 | 1141 | ## Beyond Bandit 1142 | 1143 | - **ZERO-SHOT GOAL-DIRECTED DIALOGUE VIA RL ON IMAGINED CONVERSATIONS;** Joey Hong et al 1144 | - **LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models;** Marwa Abdulhai et al 1145 | - **ELICITING HUMAN PREFERENCES WITH LANGUAGE MODELS;** Belinda Z. Li et al 1146 | - **ITERATED DECOMPOSITION: IMPROVING SCIENCE Q&A BY SUPERVISING REASONING PROCESSES;** Justin Reppert et al 1147 | - **Let’s Verify Step by Step;** Hunter Lightman et al 1148 | - **Solving math word problems with process and outcome-based feedback;** Jonathan Uesato et al 1149 | - **EMPOWERING LANGUAGE MODELS WITH ACTIVE INQUIRY FOR DEEPER UNDERSTANDING A PREPRINT;** Jing-Cheng Pang et al 1150 | - **Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models;** Zhiyuan Hu et al 1151 | - **Tell Me More! Towards Implicit User Intention Understanding of Language Model-Driven Agents;** Cheng Qian et al 1152 | - **MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues;** Ge Bai et al 1153 | - **ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL;** Yifei Zhou et al 1154 | - **Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games;** Yizhe Zhang et al 1155 | - **STaR-GATE: Teaching Language Models to Ask Clarifying Questions;** Chinmaya Andukuri et al 1156 | - **Bayesian Preference Elicitation with Language Models;** Kunal Handa et al 1157 | - **Multi-turn Reinforcement Learning from Preference Human Feedback;** Lior Shani et al 1158 | - **Improve Mathematical Reasoning in Language Models by Automated Process Supervision;** Liangchen Luo et al 1159 | - **MODELING FUTURE CONVERSATION TURNS TO TEACH LLMS TO ASK CLARIFYING QUESTIONS;** Michael J.Q. Zhang et al 1160 | 1161 | ## Agent 1162 | 1163 | - **Generative Agents: Interactive Simulacra of Human Behavior;** Joon Sung Park et al 1164 | - **SWIFTSAGE: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks;** Bill Yuchen Lin et al 1165 | - **Large Language Model Is Semi-Parametric Reinforcement Learning Agent;** Danyang Zhang et al 1166 | - **The Role of Summarization in Generative Agents: A Preliminary Perspective;** Xiachong Feng et al 1167 | - **CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society;** Guohao Li et al 1168 | - **Plan, Eliminate, and Track-Language Models are Good Teachers for Embodied Agents;** Yue Wu et al 1169 | - **Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents;** Zihao Wang et al 1170 | - **Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory;** Xizhou Zhu et al 1171 | - **TOWARDS A UNIFIED AGENT WITH FOUNDATION MODELS;** Norman Di Palo et al 1172 | - **MotionLM: Multi-Agent Motion Forecasting as Language Modeling;** Ari Seff et al 1173 | - **A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis;** Izzeddin Gur et al 1174 | - **Guide Your Agent with Adaptive Multimodal Rewards;** Changyeon Kim et al 1175 | - **Generative Agents: Interactive Simulacra of Human Behavior;** Joon Sung Park et al 1176 | - **AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors in Agents;** Weize Chen et al 1177 | - **METAGPT: META PROGRAMMING FOR MULTI-AGENT COLLABORATIVE FRAMEWORK;** Sirui Hong et al 1178 | - **YOU ONLY LOOK AT SCREENS: MULTIMODAL CHAIN-OF-ACTION AGENTS;** Zhuosheng Zhang et al 1179 | - **SELF: LANGUAGE-DRIVEN SELF-EVOLUTION FOR LARGE LANGUAGE MODEL;** Jianqiao Lu et al 1180 | - **Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond;** Liang Chen et al 1181 | - **A Zero-Shot Language Agent for Computer Control with Structured Reflection;** Tao Li et al 1182 | - **Character-LLM: A Trainable Agent for Role-Playing;** Yunfan Shao et al 1183 | - **CLIN: A CONTINUALLY LEARNING LANGUAGE AGENT FOR RAPID TASK ADAPTATION AND GENERALIZATION;** Bodhisattwa Prasad Majumder et al 1184 | - **FIREACT: TOWARD LANGUAGE AGENT FINE-TUNING;** Baian Chen et al 1185 | - **TrainerAgent: Customizable and Efficient Model Training through LLM-Powered Multi-Agent System;** Haoyuan Li et al 1186 | - **LUMOS: LEARNING AGENTS WITH UNIFIED DATA, MODULAR DESIGN, AND OPEN-SOURCE LLMS;** Da Yin et al 1187 | - **TaskWeaver: A Code-First Agent Framework;** Bo Qiao et al 1188 | - **Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning;** Filippos Christianos et al 1189 | - **AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation;** Qingyun Wu et al 1190 | - **TRUE KNOWLEDGE COMES FROM PRACTICE: ALIGNING LLMS WITH EMBODIED ENVIRONMENTS VIA REINFORCEMENT LEARNING;** Weihao Tan et al 1191 | - **Investigate-Consolidate-Exploit: A General Strategy for Inter-Task Agent Self-Evolution;** Cheng Qian et al 1192 | - **OS-COPILOT: TOWARDS GENERALIST COMPUTER AGENTS WITH SELF-IMPROVEMENT;** Zhiyong Wu et al 1193 | - **LONGAGENT: Scaling Language Models to 128k Context through Multi-Agent Collaboration;** Jun Zhao et al 1194 | - **When is Tree Search Useful for LLM Planning? It Depends on the Discriminator;** Ziru Chen et al 1195 | - **DATA INTERPRETER: AN LLM AGENT FOR DATA SCIENCE;** Sirui Hong et al 1196 | - **Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study;** Weihao Tan et al 1197 | - **SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents;** Ruiyi Wang et al 1198 | - **Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models;** Zehui Chen et al 1199 | - **LLM Agent Operating System;** Kai Mei et al 1200 | - **Symbolic Learning Enables Self-Evolving Agents;** Wangchunshu Zhou et al 1201 | - **Executable Code Actions Elicit Better LLM Agents;** Xingyao Wang et al 1202 | - **Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents;** Pranav Putta et al 1203 | - **MindSearch: Mimicking Human Minds Elicits Deep AI Searcher;** Zehui Chen et al 1204 | - **xLAM: A Family of Large Action Models to Empower AI Agent Systems;** Jianguo Zhang et al 1205 | - **Agent-as-a-Judge: Evaluate Agents with Agents;** Mingchen Zhuge et al 1206 | - **Search-o1: Agentic Search-Enhanced Large Reasoning Models;** Xiaoxi Li et al 1207 | - **The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks;** Alejandro Cuadron et al 1208 | - **ARMAP: SCALING AUTONOMOUS AGENTS VIA AUTOMATIC REWARD MODELING AND PLANNING;** Zhenfang Chen et al 1209 | - **SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks;** Yifei Zhou et al 1210 | - **Visual Agentic Reinforcement Fine-Tuning;** Ziyu Liu et al 1211 | - **AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning;** Zhong Zhang et al 1212 | - **GTA1: GUI Test-time Scaling Agent;** Yan Yang et al 1213 | - **OPENCUA: Open Foundations for Computer-Use Agents;** Xinyuan Wang et al 1214 | - **AGENTRL: Scaling Agentic Reinforcement Learning with a Multi-Turn, Multi-Task Framework;** Hanchen Zhang et al 1215 | - **The FM Agent;** Annan Li et al 1216 | 1217 | 1218 | 1219 | **AutoTelic Agent** 1220 | 1221 | - **AUGMENTING AUTOTELIC AGENTS WITH LARGE LANGUAGE MODELS;** Cedric Colas et al 1222 | - **Visual Reinforcement Learning with Imagined Goals;** Ashvin Nair et al 1223 | 1224 | 1225 | **Evaluation** 1226 | 1227 | - **AgentBench: Evaluating LLMs as Agents;** Xiao Liu et al 1228 | - **EVALUATING MULTI-AGENT COORDINATION ABILITIES IN LARGE LANGUAGE MODELS;** Saaket Agashe et al 1229 | - **OpenAgents: AN OPEN PLATFORM FOR LANGUAGE AGENTS IN THE WILD;** Tianbao Xie et al 1230 | - **SMARTPLAY : A BENCHMARK FOR LLMS AS INTELLIGENT AGENTS;** Yue Wu et al 1231 | - **WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?;** Alexandre Drouin et al 1232 | - **Autonomous Evaluation and Refinement of Digital Agents;** Jiayi Pan et al 1233 | 1234 | 1235 | **VL Related Task** 1236 | 1237 | - **LANGNAV: LANGUAGE AS A PERCEPTUAL REPRESENTATION FOR NAVIGATION;** Bowen Pan et al 1238 | - **VIDEO LANGUAGE PLANNING;** Yilun Du et al 1239 | - **Fuyu-8B: A Multimodal Architecture for AI Agents;** Rohan Bavishi et al 1240 | - **GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation;** An Yan et al 1241 | - **Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld;** Yijun Yang et al 1242 | - **STEVE: See and Think: Embodied Agent in Virtual Environment;** Zhonghan Zhao et al 1243 | - **JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models;** Zihao Wang et al 1244 | - **STEVE-EYE: EQUIPPING LLM-BASED EMBODIED AGENTS WITH VISUAL PERCEPTION IN OPEN WORLDS;** Sipeng Zheng et al 1245 | - **OCTOPUS: EMBODIED VISION-LANGUAGE PROGRAMMER FROM ENVIRONMENTAL FEEDBACK;** Jingkang Yang et al 1246 | - **CogAgent: A Visual Language Model for GUI Agents;** Wenyi Hong et al 1247 | - **GPT-4V(ision) is a Generalist Web Agent, if Grounded;** Boyuan Zheng et al 1248 | - **WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models;** Hongliang He et al 1249 | - **MOBILE-AGENT: AUTONOMOUS MULTI-MODAL MOBILE DEVICE AGENT WITH VISUAL PERCEPTION;** Junyang Wang et al 1250 | - **V-IRL: Grounding Virtual Intelligence in Real Life;** Jihan Yang et al 1251 | - **An Interactive Agent Foundation Model;** Zane Durante et al 1252 | - **RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis;** Yao Mu et al 1253 | - **Scaling Instructable Agents Across Many Simulated Worlds;** SIMA Team 1254 | - **Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model;** Zhonghan Zhao1 et al 1255 | - **Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs;** Keen You et al 1256 | - **OSWORLD: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments;** Tianbao Xie et al 1257 | - **GUICourse: From General Vision Language Model to Versatile GUI Agent;** Wentong Chen et al 1258 | - **VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents;** Xiao Liu et al 1259 | - **ShowUI: One Vision-Language-Action Model for GUI Visual Agent;** Kevin Qinghong Lin et al 1260 | - **UI-TARS: Pioneering Automated GUI Interaction with Native Agents;** Yujia Qin et al 1261 | - **Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents;** Saaket Agashe et al 1262 | - **Thyme: Think Beyond Images;** Yi-Fan Zhang et al 1263 | 1264 | ## DeepResearch 1265 | 1266 | - **Tongyi DeepResearch Technical Report;** Tongyi DeepResearch Team 1267 | 1268 | 1269 | ## SWE-Agent 1270 | 1271 | ### Agent Framework 1272 | 1273 | - **AutoCodeRover: Autonomous Program Improvement;** Yuntong Zhang et al 1274 | - **CODER: ISSUE RESOLVING WITH MULTI-AGENT AND TASK GRAPHS;** Dong Chen et al 1275 | - **SWE-AGENT: AGENT-COMPUTER INTERFACES ENABLE AUTOMATED SOFTWARE ENGINEERING;** John Yang et al 1276 | - **OpenHands: An Open Platform for AI Software Developers as Generalist Agents;** Xingyao Wang et al 1277 | - **Coding Agents with Multimodal Browsing are Generalist Problem Solvers;** Aditya Bharat Soni et al 1278 | - **Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling;** Trae Research 1279 | 1280 | 1281 | 1282 | ### Component Improvement 1283 | 1284 | - **REPOGRAPH: ENHANCING AI SOFTWARE ENGINEERING WITH REPOSITORY-LEVEL CODE GRAPH;** Siru Ouyang 1285 | - **AGENTLESS: Demystifying LLM-based Software Engineering Agents;** Chunqiu Steven Xia et al 1286 | 1287 | ### Data & Env 1288 | 1289 | - **SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents;** Ibragim Badertdinov et al 1290 | - **SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks;** Lianghong Guo et al 1291 | - **SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling;** Haoran Wang et al 1292 | - **R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents;** Naman Jain et al 1293 | - **SWE-smith: Scaling Data for Software Engineering Agents;** John Yang et al 1294 | - **SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution;** Chengxing Xie et al 1295 | - **Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs;** Liang Zeng et al 1296 | 1297 | 1298 | ### RL 1299 | 1300 | - **Training Software Engineering Agents and Verifiers with SWE-Gym;** Jiayi Pan et al 1301 | - **SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution;** Yuxiang Wei et al 1302 | - **https://nebius.com/blog/posts/training-and-search-for-software-engineering-agents** 1303 | - **Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards;** Jeff Da et al 1304 | - **Introducing Kimi-Dev: A Strong and Open-source Coding LLM for Issue Resolution;** Kimi Team 1305 | - **DeepSWE: Training a Fully Open-sourced, State-of-the-Art Coding Agent by Scaling RL;** Michael Luo et al 1306 | - **SWE-Swiss: A Multi-Task Fine-Tuning and RL Recipe for High-Performance Issue Resolution;** Zhenyu He et al 1307 | - **KIMI-DEV: AGENTLESS TRAINING AS SKILL PRIOR FOR SWE-AGENTS;** Zonghan Yang et al 1308 | 1309 | ### Pre-Training/Mid-Training 1310 | 1311 | - **CWM: An Open-Weights LLM for Research on Code Generation with World Models;** Meta FAIR CodeGen Team 1312 | 1313 | 1314 | 1315 | 1316 | ### Benchmark 1317 | 1318 | - **SWE-BENCH: CAN LANGUAGE MODELS RESOLVE REAL-WORLD GITHUB ISSUES?;** Carlos E. Jimenez 1319 | - **SWE-BENCH MULTIMODAL: DO AI SYSTEMS GENERALIZE TO VISUAL SOFTWARE DOMAINS?;** John Yang et al 1320 | - **SWE-Bench Verified;** Neil Chowdhury et al 1321 | - **SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering;** Xuehang Guo et al 1322 | - **Programming with Pixels: Computer-Use Meets Software Engineering;** Pranjal Aggarwal et al 1323 | - **SWE-bench Goes Live!;** Linghao Zhang et al 1324 | - **SWINGARENA: Competitive Programming Arena for Long-context GitHub Issue Solving;** Wendong Xu et al 1325 | - **SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner;** Lei Zhang et al 1326 | - **SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?;** Xiang Deng etal 1327 | 1328 | 1329 | 1330 | ## Evolution 1331 | 1332 | - **SCIENTIFIC ALGORITHM DISCOVERY BY AUGMENTING ALPHAEVOLVE WITH DEEP RESEARCH;** Gang Liu et al 1333 | 1334 | 1335 | 1336 | 1337 | ## Interaction 1338 | 1339 | - **Imitating Interactive Intelligence;** Interactive Agents Group 1340 | - **Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning;** Interactive Agents Team 1341 | - **Evaluating Multimodal Interactive Agents;** Interactive Agents Team 1342 | - **Improving Multimodal Interactive Agents with Reinforcement Learning from Human Feedback;** Interactive Agents Team 1343 | - **LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles;** Shulin Huang et al 1344 | - **BENCHMARKING LARGE LANGUAGE MODELS AS AI RESEARCH AGENTS;** Qian Huang et al 1345 | - **ADAPTING LLM AGENTS THROUGH COMMUNICATION;** Kuan Wang et al 1346 | - **PARROT: ENHANCING MULTI-TURN CHAT MODELS BY LEARNING TO ASK QUESTIONS;** Yuchong Sun et al 1347 | - **LLAMA RIDER: SPURRING LARGE LANGUAGE MODELS TO EXPLORE THE OPEN WORLD;** Yicheng Feng et al 1348 | - **AGENTTUNING: ENABLING GENERALIZED AGENT ABILITIES FOR LLMS;** Aohan Zeng et al 1349 | - **MINT: Evaluating LLMs in Multi-Turn Interaction with Tools and Language Feedback;** Xingyao Wang et al 1350 | - **LLF-Bench: Benchmark for Interactive Learning from Language Feedback;** Ching-An Cheng et al 1351 | - **MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models;** Wai-Chung Kwan et al 1352 | - **Can large language models explore in-context?;** Akshay Krishnamurthy et al 1353 | - **LARGE LANGUAGE MODELS CAN INFER PERSONALITY FROM FREE-FORM USER INTERACTIONS;** Heinrich Peters et al 1354 | 1355 | ## Critique Modeling 1356 | 1357 | - **Learning Evaluation Models from Large Language Models for Sequence Generation;** Chenglong Wang et al 1358 | - **RETROFORMER: RETROSPECTIVE LARGE LANGUAGE AGENTS WITH POLICY GRADIENT OPTIMIZATION;** Weiran Yao et al 1359 | - **Shepherd: A Critic for Language Model Generation;** Tianlu Wang et al 1360 | - **GENERATING SEQUENCES BY LEARNING TO [SELF-]CORRECT;** Sean Welleck et al 1361 | - **LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked;** Alec Helbling et al 1362 | - **RAIN: Your Language Models Can Align Themselves Yuhui Liwithout Finetuning;** Yuhui Li et al 1363 | - **SYNDICOM: Improving Conversational Commonsense with Error-Injection and Natural Language Feedback;** Christopher Richardson et al 1364 | - **MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language Models;** Deepak Nathani et al 1365 | - **DON’T THROW AWAY YOUR VALUE MODEL! MAKING PPO EVEN BETTER VIA VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING;** Jiacheng Liu et al 1366 | - **COFFEE: Boost Your Code LLMs by Fixing Bugs with Feedback;** Seungjun Moon et al 1367 | - **Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer;** Bowen Tan et al 1368 | - **Pinpoint, Not Criticize: Refining Large Language Models via Fine-Grained Actionable Feedback;** Wenda Xu et al 1369 | - **Digital Socrates: Evaluating LLMs through explanation critiques;** Yuling Gu et al 1370 | - **Outcome-supervised Verifiers for Planning in Mathematical Reasoning;** Fei Yu et al 1371 | - **Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language;** Di Jin et al 1372 | - **CRITIQUELLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation;** Pei Ke et al 1373 | - **Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment;** Brian Gordon et al 1374 | - **MATH-SHEPHERD: A LABEL-FREE STEP-BY-STEP VERIFIER FOR LLMS IN MATHEMATICAL REASONING;** Peiyi Wang et al 1375 | - **The Critique of Critique;** Shichao Sun et al 1376 | - **LLMCRIT: Teaching Large Language Models to Use Criteria;** Weizhe Yuan et al 1377 | - **Multi-Level Feedback Generation with Large Language Models for Empowering Novice Peer Counselors;** Alicja Chaszczewicz et al 1378 | - **Training LLMs to Better Self-Debug and Explain Code;** Nan Jiang et al 1379 | - **Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision;** Zhiheng Xi et al 1380 | - **Self-Generated Critiques Boost Reward Modeling for Language Models;** Yue Yu et al 1381 | 1382 | 1383 | 1384 | ## MoE/Specialized 1385 | 1386 | - **OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER;** Noam Shazeer et al 1387 | - **Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity;** William Fedus et al 1388 | - **DEMIX Layers: Disentangling Domains for Modular Language Modeling;** Suchin Gururangan et al 1389 | - **ModuleFormer: Learning Modular Large Language Models From Uncurated Data;** Yikang Shen et al 1390 | - **Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models;** Sheng Shen et al 1391 | - **From Sparse to Soft Mixtures of Experts;** Joan Puigcerver et al 1392 | - **SELF-SPECIALIZATION: UNCOVERING LATENT EXPERTISE WITHIN LARGE LANGUAGE MODELS;** Junmo Kang et al 1393 | - **HOW ABILITIES IN LARGE LANGUAGE MODELS ARE AFFECTED BY SUPERVISED FINE-TUNING DATA COMPOSITION;** Guanting Dong et al 1394 | - **OPENWEBMATH: AN OPEN DATASET OF HIGH-QUALITY MATHEMATICAL WEB TEXT;** Keiran Paster et al 1395 | - **LLEMMA: AN OPEN LANGUAGE MODEL FOR MATHEMATICS;** Zhangir Azerbayev et al 1396 | - **Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models;** Keming Lu et al 1397 | - **Mixtral of Experts;** Albert Q. Jiang et al 1398 | - **DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models;** Damai Dai et al 1399 | - **MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts;** Maciej Pioro et al 1400 | - **OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models;** Fuzhao Xue et al 1401 | - **Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM;** Sainbayar Sukhbaatar et al 1402 | - **Multi-Head Mixture-of-Experts;** Xun Wu et al 1403 | - **X FT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts;** Yifeng Ding et al 1404 | - **BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts;** Qizhen Zhang et al 1405 | - **OLMoE: Open Mixture-of-Experts Language Models;** Niklas Muennighoff et al 1406 | 1407 | ## Vision-Language Foundation Model 1408 | 1409 | ### First Generation: Using region-based features; can be classified as one- and two- streams model architectures; Before 2020.6; 1410 | 1411 | - **Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs;** Emanuele Bugliarello et al; A meta-analysis of the first generation VL models and a unified framework. 1412 | - **Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers;** Lisa Anne Hendricks et al 1413 | - **ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks;** Jiasen Lu et al 1414 | - **LXMERT: Learning Cross-Modality Encoder Representations from Transformers;** Hao Tan et al 1415 | - **VISUALBERT: A SIMPLE AND PERFORMANT BASELINE FOR VISION AND LANGUAGE;** Liunian Harold Li et al 1416 | - **UNITER: UNiversal Image-TExt Representation Learning;** Yen-Chun Chen et al 1417 | - **VL-BERT: PRE-TRAINING OF GENERIC VISUAL-LINGUISTIC REPRESENTATIONS;** Weijie Su et al 1418 | - **IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA;** Di Qi et al 1419 | - **Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training;** Gen Li et al 1420 | - **UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning;** Wei Li et al; Motivate to use unimodal data to improve the performance of VL tasks. 1421 | 1422 | **Introduce image tags to learn image-text alignments.** 1423 | 1424 | - **Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks;** Xiujun Li et al 1425 | - **VinVL: Revisiting Visual Representations in Vision-Language Models;** Pengchuan Zhang et al 1426 | - **Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions;** Liunian Harold Li et al; Consider the unsupervised setting. 1427 | - **Tag2Text: Guiding Vision-Language Model via Image Tagging;** Xinyu Huang et al 1428 | 1429 | ### Second Generation: Get rid of ROI and object detectors for acceleration; Moving to large pretraining datasets; Moving to unified architectures for understanding and generation tasks; Mostly before 2022.6. 1430 | 1431 | - **An Empirical Study of Training End-to-End Vision-and-Language Transformers;** Zi-Yi Dou et al; Meta-analysis. Investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. 1432 | - **Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers;** Zhicheng Huang et al; Throw away region-based features, bounding boxes, and object detectors. Directly input the raw pixels and use CNN to extract features. 1433 | - **ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision;** Wonjae Kim et al; Get rid of heavy computation of ROI and CNN through utilizing ViT. 1434 | - **Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning;** Zhicheng Huang et al 1435 | - **E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning;** Haiyang Xu et al; Get rid of bounding boxes; Introduce object detection and image captioning as pretraining tasks with a encoder-decoder structure. 1436 | - **Align before Fuse: Vision and Language Representation Learning with Momentum Distillation;** Junnan Li et al; Propose ALBEF. 1437 | - **simvlm: simple visual language model pre-training with weak supervision;** Zirui Wang et al; Get rid of bounding boxes; Further argue that the pretraining objectives are complicated and not scalable; Consider the zero-shot behaviors, emergent by pretraining on large datasets. 1438 | - **UFO: A UniFied TransfOrmer for Vision-Language Representation Learning;** Jianfeng Wang et al 1439 | - **VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts;** Hangbo Bao et al; Introduce the mixture-of-experts method to model text and image separately and use a specific expert to learn the cross-modal fusion (Multiway Transformer), which is later adopted by BEiT-3; Ensure better image-text retrieval (performance & speed) and VL tasks; 1440 | - **Learning Transferable Visual Models From Natural Language Supervision;** Alec Radford et al; Using large noisy pretraining datasets. 1441 | - **Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision;** Chao Jia et al; Using large noisy pretraining datasets. 1442 | - **FILIP: FINE-GRAINED INTERACTIVE LANGUAGE-IMAGE PRE-TRAINING;** Lewei Yao et al; Further improve CLIP & ALIGN by introducing fine-grained alignments. 1443 | - **PERCEIVER IO: A GENERAL ARCHITECTURE FOR STRUCTURED INPUTS & OUTPUTS;** Andrew Jaegle et al 1444 | - **X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages;** Feilong Chen et al 1445 | 1446 | **Special designs tailored to enhance the position encoding & grounding.** 1447 | 1448 | - **UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling;** Zhengyuan Yang et al 1449 | - **PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models;** Yuan Yao et al; Introduce explicit object position modeling.A woman < 310 mask 406 475 > is watching the mask < 175 86 254 460 >; 1450 | - **GLIPv2: Unifying Localization and VL Understanding;** Haotian Zhang et al; Further show that GLIP's pretraining method can benefit the VL task (Unifying localization and understanding). 1451 | - **DesCo: Learning Object Recognition with Rich Language Descriptions;** Liunian Harold Li et al 1452 | 1453 | **Motivate to use unparalleled image & text data to build a unified model for VL, vision, and language tasks and potentially bring better performance.** 1454 | 1455 | - **Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks;** Xizhou Zhu et al; Siamese network to encode various modalities. 1456 | - **FLAVA: A Foundational Language And Vision Alignment Model;** Amanpreet Singh et al; A unified backbone model (need task-specific heads) for NLP, CV, and VL tasks. 1457 | - **UNIMO-2: End-to-End Unified Vision-Language Grounded Learning;** Wei Li et al; Design a new method "Grounded Dictionary Learning", similar to the sense of "continuous" image tags to align two modalities. 1458 | 1459 | ### Third Generation: Chasing for one unified/general/generalist model to include more VL/NLP/CV tasks; Becoming larger & Stronger; 2022->Now. 1460 | 1461 | - **BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation;** Junnan Li et al; New unified architecture and new method to generate and then filter captions. 1462 | - **OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework;** Peng Wang et al; A unified model (framework) to handle text, image, and image-text tasks. 1463 | - **Webly Supervised Concept Expansion for General Purpose Vision Models;** Amita Kamath et al 1464 | - **Language Models are General-Purpose Interfaces;** Yaru Hao et al 1465 | - **GIT: A Generative Image-to-text Transformer for Vision and Language;** Jianfeng Wang et al 1466 | - **CoCa: Contrastive Captioners are Image-Text Foundation Models;** Jiahui Yu et al 1467 | - **Flamingo: a Visual Language Model for Few-Shot Learning;** Jean-Baptiste Alayrac et al; Designed for few-shot learning. 1468 | - **Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks;** Wenhui Wang et al; BEIT-3. 1469 | - **OmniVL: One Foundation Model for Image-Language and Video-Language Tasks;** Junke Wang et al; Support both image-language and video-language tasks and show the positive transfer in three modalities. 1470 | - **Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks;** Hao Li et al; Propose a generalist model that can also handle object detection and instance segmentation tasks. 1471 | - **X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks;** Yan Zeng et al; Propose a unified model for image-language and video-text-language tasks; Modeling the fine-grained alignments between image regions and descriptions. 1472 | - **Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks;** Xinsong Zhang et al 1473 | - **mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video;** Haiyang Xu et al 1474 | - **KOSMOS-2: Grounding Multimodal Large Language Models to the World;** Zhiliang Peng et al 1475 | - **PaLI-X: On Scaling up a Multilingual Vision and Language Model;** Xi Chen et al 1476 | - **UNIFIED LANGUAGE-VISION PRETRAINING WITH DYNAMIC DISCRETE VISUAL TOKENIZATION;** Yang Jin et al 1477 | - **PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER;** Xi Chen et al 1478 | 1479 | **Generalist models** 1480 | 1481 | - **UNIFIED-IO: A UNIFIED MODEL FOR VISION, LANGUAGE, AND MULTI-MODAL TASKS;** Jiasen Lu et al; Examine whether a single unified model can solve a variety of tasks (NLP, CV, VL) simultaneously; Construct a massive multi-tasking dataset by ensembling 95 datasets from 62 publicly available data sources, including Image Synthesis, Keypoint Estimation, Depth Estimation, Object Segmentation, et al; Focusing on multi-task fine-tuning. 1482 | - **Generalized Decoding for Pixel, Image, and Language;** Xueyan Zou et al 1483 | - **Foundation Transformers;** Hongyu Wang et al; Propose a new unified architecture. 1484 | - **A Generalist Agent;** Scott Reed et al 1485 | - **PaLM-E: An Embodied Multimodal Language Model;** Danny Driess et al 1486 | - **IMAGEBIND: One Embedding Space To Bind Them All;** Rohit Girdhar et al 1487 | 1488 | ### Fourth Generation: Relying on LLMs and instruction tuning 1489 | 1490 | - **BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models;** Junnan Li et al 1491 | - **Grounding Language Models to Images for Multimodal Inputs and Outputs;** Jing Yu Koh et al 1492 | - **Language Is Not All You Need: Aligning Perception with Language Models;** Shaohan Huang et al 1493 | - **Otter: A Multi-Modal Model with In-Context Instruction Tuning;** Bo Li et al 1494 | - **Visual Instruction Tuning;** Haotian Liu et al 1495 | - **MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models;** Deyao Zhu et al 1496 | - **InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning;** Wenliang Dai et al 1497 | - **LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model;** Peng Gao et al 1498 | - **LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding;** Yanzhe Zhang et al 1499 | - **MultiModal-GPT: A Vision and Language Model for Dialogue with Humans;** Tao Gong et al 1500 | - **GPT-4 Technical Report;** OpenAI 1501 | - **mPLUG-Owl : Modularization Empowers Large Language Models with Multimodality;** Qinghao Ye et al 1502 | - **VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks;** Wenhai Wang et al 1503 | - **PandaGPT: One Model To Instruction-Follow Them All;** Yixuan Su et al 1504 | - **Generating Images with Multimodal Language Models;** Jing Yu Koh et al 1505 | - **What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?;** Yan Zeng et al 1506 | - **GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest;** Shilong Zhang et al 1507 | - **Generative Pretraining in Multimodality;** Quan Sun et al 1508 | - **Planting a SEED of Vision in Large Language Model;** Yuying Ge et al 1509 | - **ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning;** Liang Zhao et al 1510 | - **Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning;** Lili Yu et al 1511 | - **The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World;** Weiyun Wang et al 1512 | - **EMPOWERING VISION-LANGUAGE MODELS TO FOLLOW INTERLEAVED VISION-LANGUAGE INSTRUCTIONS;** Juncheng Li et al 1513 | - **RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension;** Qiang Zhou et al 1514 | - **LISA: REASONING SEGMENTATION VIA LARGE LANGUAGE MODEL;** Xin Lai et al 1515 | - **Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities;** Jinze Bai et al 1516 | - **InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4;** Lai Wei et al 1517 | - **StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data;** Yanda Li et al 1518 | - **Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages;** Jinyi Hu et al 1519 | - **MMICL: EMPOWERING VISION-LANGUAGE MODEL WITH MULTI-MODAL IN-CONTEXT LEARNING;** Haozhe Zhao et al 1520 | - **An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models;** Yadong Lu et al 1521 | - **ALIGNING LARGE MULTIMODAL MODELS WITH FACTUALLY AUGMENTED RLHF;** Zhiqing Sun et al 1522 | - **Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants;** Tianyu Yu et al 1523 | - **AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model;** Seungwhan Moon et al 1524 | - **InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition;** Pan Zhang et al 1525 | - **HALLE-SWITCH: RETHINKING AND CONTROLLING OBJECT EXISTENCE HALLUCINATIONS IN LARGE VISION LANGUAGE MODELS FOR DETAILED CAPTION;** Bohan Zhai et al 1526 | - **Improved Baselines with Visual Instruction Tuning;** Haotian Liu et al 1527 | - **Fuyu-8B: A Multimodal Architecture for AI Agents;** Rohan Bavishi et al 1528 | - **MINIGPT-5: INTERLEAVED VISION-AND-LANGUAGE GENERATION VIA GENERATIVE VOKENS;** Kaizhi Zheng et al 1529 | - **Making LLaMA SEE and Draw with SEED Tokenizer;** Yuying Ge et al 1530 | - **To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning;** Junke Wang et al 1531 | - **TEAL: TOKENIZE AND EMBED ALL FOR MULTI-MODAL LARGE LANGUAGE MODELS;** Zhen Yang et al 1532 | - **mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration;** Qinghao Ye et al 1533 | - **LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge;** Gongwei Chen et al 1534 | - **OtterHD: A High-Resolution Multi-modality Model;** Bo Li et al 1535 | - **PerceptionGPT: Effectively Fusing Visual Perception into LLM;** Renjie Pi et al 1536 | - **OCTAVIUS: MITIGATING TASK INTERFERENCE IN MLLMS VIA MOE;** Zeren Chen et al 1537 | - **COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS;** Weihan Wang et al 1538 | - **Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models;** Zhang Li et al 1539 | - **Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts;** Jialin Wu et al 1540 | - **SILKIE: PREFERENCE DISTILLATION FOR LARGE VISUAL LANGUAGE MODELS;** Lei Li et al 1541 | - **GLaMM: Pixel Grounding Large Multimodal Model;** Hanoona Rasheed et al 1542 | - **TEXTBIND: MULTI-TURN INTERLEAVED MULTIMODAL INSTRUCTION-FOLLOWING IN THE WILD;** Huayang Li et al 1543 | - **DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback;** Yangyi Chen et al 1544 | - **Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding;** Peng Jin et al 1545 | - **LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models;** Hao Zhang et al 1546 | - **mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model;** Anwen Hu et al 1547 | - **Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models;** Haoning Wu et al 1548 | - **SPHINX: THE JOINT MIXING OF WEIGHTS, TASKS, AND VISUAL EMBEDDINGS FOR MULTI-MODAL LARGE LANGUAGE MODELS;** Ziyi Lin et al 1549 | - **DeepSpeed-VisualChat: Multi Round Multi Images Interleave Chat via Multi-Modal Casual Attention;** Zhewei Yao et al 1550 | - **Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models;** Haoran Wei et al 1551 | - **Osprey: Pixel Understanding with Visual Instruction Tuning;** Yuqian Yuan et al 1552 | - **Generative Multimodal Models are In-Context Learners;** Quan Sun et al 1553 | - **Gemini: A Family of Highly Capable Multimodal Models;** Gemini Team 1554 | - **CaMML: Context-Aware Multimodal Learner for Large Models;** Yixin Chen et al 1555 | - **MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer;** Changyao Tian et al 1556 | - **InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models;** Xiaoyi Dong et al 1557 | - **MoE-LLaVA: Mixture of Experts for Large Vision-Language Models;** Bin Lin et al 1558 | - **MouSi: Poly-Visual-Expert Vision-Language Models;** Xiaoran Fan et al 1559 | - **SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models;** Peng Gao et al 1560 | - **Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models;** Gen Luo et al 1561 | - **DeepSeek-VL: Towards Real-World Vision-Language Understanding;** Haoyu Lu et al 1562 | - **UniCode: Learning a Unified Codebook for Multimodal Large Language Models;** Sipeng Zheng et al 1563 | - **MoAI: Mixture of All Intelligence for Large Language and Vision Models;** Byung-Kwan Lee et al 1564 | - **LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images;** Ruyi Xu et al 1565 | - **Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models;** Yanwei Li et al 1566 | - **InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD;** Xiaoyi Dong et al 1567 | - **Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models;** Haotian Zhang et al 1568 | - **Self-Supervised Visual Preference Alignment;** Ke Zhu et al 1569 | - **How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites;** Zhe Chen et al 1570 | - **CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts;** Jiachen Li et al 1571 | - **Libra: Building Decoupled Vision System on Large Language Models;** Yifan Xu et al 1572 | - **Chameleon: Mixed-Modal Early-Fusion Foundation Models;** Chameleon Team 1573 | - **Towards Semantic Equivalence of Tokenization in Multimodal LLM;** Shengqiong Wu et al 1574 | - **Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models;** Wenhao Shi et al 1575 | - **VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks;** Jiannan Wu et al 1576 | - **LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models;** Feng Li et al 1577 | - **InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output;** Pan Zhang et al 1578 | - **PaliGemma: A versatile 3B VLM for transfer;** Lucas Beyer et al 1579 | - **TokenPacker: Efficient Visual Projector for Multimodal LLM;** Wentong Li et al 1580 | - **MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts;** Xi Victoria Lin et al 1581 | - **mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models;** Jiabo Ye et al 1582 | - **LLaVA-OneVision: Easy Visual Task Transfer;** Bo Li et al 1583 | - **xGen-MM (BLIP-3): A Family of Open Large Multimodal Models;** Le Xue et al 1584 | - **CogVLM2: Visual Language Models for Image and Video Understanding;** Wenyi Hong et al 1585 | - **MiniCPM-V: A GPT-4V Level MLLM on Your Phone;** Yuan Yao et al 1586 | - **General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model;** Haoran Wei et al 1587 | - **looongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture;** Xidong Wang et al 1588 | - **NVLM: Open Frontier-Class Multimodal LLMs;** Wenliang Dai et al 1589 | - **Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution;** Peng Wang et al 1590 | - **Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models;** Matt Deitke et al 1591 | - **ARIA: An Open Multimodal Native Mixture-of-Experts Model;** Dongxu Li et al 1592 | - **Pixtral 12B;** Pravesh Agrawal et al 1593 | - **RECONSTRUCTIVE VISUAL INSTRUCTION TUNING;** Haochen Wang et al 1594 | - **DEEM: DIFFUSION MODELS SERVE AS THE EYES OF LARGE LANGUAGE MODELS FOR IMAGE PERCEPTION;** Run Luo et al 1595 | - **Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models;** Weixin Liang et al 1596 | - **Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling;** Zhe Chen et al 1597 | - **DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding;** Zhiyu Wu et al 1598 | - **Qwen2.5-VL Technical Report;** Qwen Team, Alibaba Group 1599 | - **Gemma 3 Technical Report;** Gemma Team, Google DeepMind 1600 | - **Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources;** Weizhi Wang et al 1601 | - **SmolVLM: Redefining small and efficient multimodal models;** Andrés Marafioti et al 1602 | - **KIMI-VL TECHNICAL REPORT;** Kimi Team 1603 | - **InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models;** Jinguo Zhu et al 1604 | - **Seed1.5-VL Technical Report;** ByteDance Seed 1605 | - **MiMo-VL Technical Report;** LLM-Core Xiaomi 1606 | - **MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipes;** Tianyu Yu et al 1607 | - **Emu3.5: Native Multimodal Models are World Learners;** Emu3.5 Team 1608 | 1609 | 1610 | 1611 | ### Unified Understanding and Generation 1612 | 1613 | - **DREAMLLM: SYNERGISTIC MULTIMODAL COMPREHENSION AND CREATION;** Runpei Dong et al 1614 | - **SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation;** Yuying Ge et al 1615 | - **VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation;** Yecheng Wu et al 1616 | - **TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation;** Liao Qu et al 1617 | - **Emu3: Next-Token Prediction is All You Need;** Emu3 Team 1618 | - **JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation;** Yiyang Ma et al 1619 | - **BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset;** Jiuhai Chen 1620 | - **Qwen2.5-Omni Technical Report;** Qwen Team 1621 | - **Transfer between Modalities with MetaQueries;** Xichen Pan et al 1622 | - **MetaMorph: Multimodal Understanding and Generation via Instruction Tuning;** Shengbang Tong et al 1623 | - **Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model;** Chunting Zhou et al 1624 | - **SHOW-O: ONE SINGLE TRANSFORMER TO UNIFY MULTIMODAL UNDERSTANDING AND GENERATION;** Jinheng Xie et al 1625 | - **JETFORMER: AN AUTOREGRESSIVE GENERATIVE MODEL OF RAW IMAGES AND TEXT;** Michael Tschannen et al 1626 | - **Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation;** Zhiyang Xu et al 1627 | - **Ming-Omni: A Unified Multimodal Model for Perception and Generation;** Inclusion AI, Ant Group 1628 | - **Show-o2: Improved Native Unified Multimodal Models;** Jinheng Xie et al 1629 | - **Emerging Properties in Unified Multimodal Pretraining;** Chaorui Deng et al 1630 | 1631 | 1632 | 1633 | ### Unified Architecture 1634 | 1635 | - **Unveiling Encoder-Free Vision-Language Models;** Haiwen Diao et al 1636 | - **MONO-INTERNVL: PUSHING THE BOUNDARIES OF MONOLITHIC MULTIMODAL LARGE LANGUAGE MODELS WITH ENDOGENOUS VISUAL PRE-TRAINING;** Gen Luo et al 1637 | - **EVEv2: Improved Baselines for Encoder-Free Vision-Language Models;** Haiwen Diao et al 1638 | - **HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding;** Chenxin Tao et al 1639 | - **A Single Transformer for Scalable Vision-Language Modeling;** Yangyi Chen et al 1640 | - **Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation;** Bencheng Liao et al 1641 | - **The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer;** Weixian Lei et al 1642 | - **Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding;** Tao Zhang et al 1643 | - **FROM PIXELS TO WORDS – TOWARDS NATIVE VISIONLANGUAGE PRIMITIVES AT SCALE;** Haiwen Diao et al 1644 | 1645 | 1646 | 1647 | 1648 | ### Others 1649 | 1650 | - **Unified Vision-Language Pre-Training for Image Captioning and VQA;** Luowei Zhou et al 1651 | - **Unifying Vision-and-Language Tasks via Text Generation;** Jaemin Cho et al 1652 | - **MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound;** Rowan Zellers et al 1653 | - **CLIP-Event: Connecting Text and Images with Event Structures;** Manling Li et al; The new model CLIP-Event, specifically designed for multi-modal event extraction. Introducing new pretraining tasks to enable strong zero-shot performances. From object-centric representations to event-centric representations. 1654 | - **Scaling Vision-Language Models with Sparse Mixture of Experts;** Sheng Shen et al 1655 | - **MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks;** Weicheng Kuo et al 1656 | - **Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning;** Zhiyang Xu et al 1657 | 1658 | ## Vision-Language Model Application 1659 | 1660 | - **VISION-LANGUAGE FOUNDATION MODELS AS EFFECTIVE ROBOT IMITATORS;** Xinghang Li et al 1661 | - **LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing;** Wei-Ge Chen et al 1662 | - **Vision-Language Models as a Source of Rewards;** Kate Baumli et al 1663 | - **SELF-IMAGINE: EFFECTIVE UNIMODAL REASONING WITH MULTIMODAL MODELS USING SELF-IMAGINATION;** Syeda Nahida Akter et al 1664 | - **Code as Reward: Empowering Reinforcement Learning with VLMs;** David Venuto et al 1665 | - **MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark;** Dongping Chen et al 1666 | - **PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs;** Soroush Nasiriany et al 1667 | - **LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning;** Dantong Niu et al 1668 | - **OMG-LLaVA : Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding;** Tao Zhang et al 1669 | 1670 | 1671 | 1672 | ## Vision-Language Model Analysis & Evaluation 1673 | 1674 | - **What Makes for Good Visual Tokenizers for Large Language Models?;** Guangzhi Wang et al 1675 | - **LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models;** Peng Xu et al 1676 | - **MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models;** Chaoyou Fu et al 1677 | - **JourneyDB: A Benchmark for Generative Image Understanding;** Junting Pan et al 1678 | - **MMBench: Is Your Multi-modal Model an All-around Player?;** Yuan Liu et al 1679 | - **SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension;** Bohao Li et al 1680 | - **Tiny LVLM-eHub: Early Multimodal Experiments with Bard;** Wenqi Shao et al 1681 | - **MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities;** Weihao Yu et al 1682 | - **VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use;** Yonatan Bitton et al 1683 | - **TouchStone: Evaluating Vision-Language Models by Language Models;** Shuai Bai et al 1684 | - **Investigating the Catastrophic Forgetting in Multimodal Large Language Models;** Yuexiang Zhai et al 1685 | - **DEMYSTIFYING CLIP DATA;** Hu Xu et al 1686 | - **Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models;** Yangyi Chen et al 1687 | - **REFORM-EVAL: EVALUATING LARGE VISION LANGUAGE MODELS VIA UNIFIED RE-FORMULATION OF TASK-ORIENTED BENCHMARKS;** Zejun Li1 et al 1688 | - **REVO-LION: EVALUATING AND REFINING VISION LANGUAGE INSTRUCTION TUNING DATASETS;** Ning Liao et al 1689 | - **BEYOND TASK PERFORMANCE: EVALUATING AND REDUCING THE FLAWS OF LARGE MULTIMODAL MODELS WITH IN-CONTEXT LEARNING;** Mustafa Shukor et al 1690 | - **Grounded Intuition of GPT-Vision’s Abilities with Scientific Images;** Alyssa Hwang et al 1691 | - **Holistic Evaluation of Text-to-Image Models;** Tony Lee et al 1692 | - **CORE-MM: COMPLEX OPEN-ENDED REASONING EVALUATION FOR MULTI-MODAL LARGE LANGUAGE MODELS;** Xiaotian Han et al 1693 | - **HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models;** Tianrui Guan et al 1694 | - **SEED-Bench-2: Benchmarking Multimodal Large Language Models;** Bohao Li et al 1695 | - **MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI;** Xiang Yue et al 1696 | - **MATHVISTA: EVALUATING MATH REASONING IN VISUAL CONTEXTS WITH GPT-4V, BARD, AND OTHER LARGE MULTIMODAL MODELS;** Pan Lu et al 1697 | - **VILA: On Pre-training for Visual Language Models;** Ji Lin et al 1698 | - **TUNING LAYERNORM IN ATTENTION: TOWARDS EFFICIENT MULTI-MODAL LLM FINETUNING;** Bingchen Zhao et al 1699 | - **Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs;** Shengbang Tong et al 1700 | - **FROZEN TRANSFORMERS IN LANGUAGE MODELS ARE EFFECTIVE VISUAL ENCODER LAYERS;** Ziqi Pang et al 1701 | - **Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models;** Siddharth Karamcheti et al 1702 | - **Design2Code: How Far Are We From Automating Front-End Engineering?;** Chenglei Si et al 1703 | - **MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training;** Brandon McKinzie et al 1704 | - **MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?;** Renrui Zhang et al 1705 | - **Are We on the Right Way for Evaluating Large Vision-Language Models?;** Lin Chen et al 1706 | - **MMInA: Benchmarking Multihop Multimodal Internet Agents;** Ziniu Zhang et al 1707 | - **A Multimodal Automated Interpretability Agent;** Tamar Rott Shaham et al 1708 | - **MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI;** Kaining Ying et al 1709 | - **ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models;** Shuo Liu et al 1710 | - **Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models;** Piotr Padlewski et al 1711 | - **What matters when building vision-language models?;** Hugo Laurençon et al 1712 | - **Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions;** Junzhang Liu et al 1713 | - **Needle In A Multimodal Haystack;** Weiyun Wang et al 1714 | - **MUIRBENCH: A Comprehensive Benchmark for Robust Multi-image Understanding;** Fei Wang et al 1715 | - **VideoGUI: A Benchmark for GUI Automation from Instructional Videos;** Kevin Qinghong Lin et al 1716 | - **MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs;** Ziyu Liu et al 1717 | - **Task Me Anything;** Jieyu Zhang et al 1718 | - **MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos;** Xuehai He et al 1719 | - **WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?;** Runqi Qiao et al 1720 | - **MMEVALPRO: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation;** Jinsheng Huang et al 1721 | - **MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs;** Yusu Qian et al 1722 | - **Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?;** Ruisheng Cao et al 1723 | - **LLAVADI: What Matters For Multimodal Large Language Models Distillation;** Shilin Xu et al 1724 | - **CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models;** Junda Wu et al 1725 | - **MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models;** Fanqing Meng et al 1726 | - **UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling;** Haider Al-Tahan et al 1727 | - **MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?;** Yi-Fan Zhang et al 1728 | - **LAW OF VISION REPRESENTATION IN MLLMS;** Shijia Yang et al 1729 | - **MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark;** Xiang Yue et al 1730 | - **WINDOWSAGENTARENA: EVALUATING MULTI-MODAL OS AGENTS AT SCALE;** Rogerio Bonatti et al 1731 | - **INTRIGUING PROPERTIES OF LARGE LANGUAGE AND VISION MODELS;** Young-Jun Lee et al 1732 | - **VHELM: A Holistic Evaluation of Vision Language Models;** Tony Lee et al 1733 | - **DECIPHERING CROSS-MODAL ALIGNMENT IN LARGE VISION-LANGUAGE MODELS WITH MODALITY INTEGRATION RATE;** Qidong Huang et al 1734 | - **MMCOMPOSITION: REVISITING THE COMPOSITIONALITY OF PRE-TRAINED VISION-LANGUAGE MODELS;** Hang Hua et al 1735 | - **MMIE: MASSIVE MULTIMODAL INTERLEAVED COMPREHENSION BENCHMARK FOR LARGE VISION-LANGUAGE MODELS;** Peng Xia et al 1736 | - **MEGA-BENCH: SCALING MULTIMODAL EVALUATION TO OVER 500 REAL-WORLD TASKS;** Jiacheng Chen et al 1737 | - **HUMANEVAL-V: EVALUATING VISUAL UNDERSTANDING AND REASONING ABILITIES OF LARGE MULTIMODAL MODELS THROUGH CODING TASKS;** Fengji Zhang et al 1738 | - **VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models;** Lei Li et al 1739 | - **Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and Training;** Shijian Wang et al 1740 | - **Apollo: An Exploration of Video Understanding in Large Multimodal Models;** Orr Zohar et al 1741 | - **LLAVA-MINI: EFFICIENT IMAGE AND VIDEO LARGE MULTIMODAL MODELS WITH ONE VISION TOKEN;** Shaolei Zhang et al 1742 | - **MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency;** Dongzhi Jiang et al 1743 | - **ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models;** Jonathan Roberts et al 1744 | - **Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation;** Jihai Zhang et al 1745 | 1746 | 1747 | 1748 | 1749 | ## Multimodal Foundation Model 1750 | 1751 | - **MotionGPT: Human Motion as a Foreign Language;** Biao Jiang et al 1752 | - **Meta-Transformer: A Unified Framework for Multimodal Learning;** Yiyuan Zhang et al 1753 | - **3D-LLM: Injecting the 3D World into Large Language Models;** Yining Hong et al 1754 | - **BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs;** Yang Zhao et al 1755 | - **VIT-LENS: Towards Omni-modal Representations;** Weixian Lei et al 1756 | - **LLASM: LARGE LANGUAGE AND SPEECH MODEL;** Yu Shu et al 1757 | - **Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following;** Ziyu Guo et al 1758 | - **NExT-GPT: Any-to-Any Multimodal LLM;** Shengqiong Wu et al 1759 | - **ImageBind-LLM: Multi-modality Instruction Tuning;** Jiaming Han et al 1760 | - **LAURAGPT: LISTEN, ATTEND, UNDERSTAND, AND REGENERATE AUDIO WITH GPT;** Jiaming Wang et al 1761 | - **AN EMBODIED GENERALIST AGENT IN 3D WORLD;** Jiangyong Huang et al 1762 | - **VIT-LENS-2: Gateway to Omni-modal Intelligence;** Weixian Lei et al 1763 | - **CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation;** Zineng Tang et al 1764 | - **X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning;** Artemis Panagopoulou et al 1765 | - **Merlin: Empowering Multimodal LLMs with Foresight Minds;** En Yu et al 1766 | - **OneLLM: One Framework to Align All Modalities with Language;** Jiaming Han et al 1767 | - **Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action;** Jiasen Lu et al 1768 | - **WORLD MODEL ON MILLION-LENGTH VIDEO AND LANGUAGE WITH RINGATTENTION;** Hao Liu et al 1769 | - **LLMBind: A Unified Modality-Task Integration Framework;** Bin Zhu et al 1770 | - **Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts;** Yunxin Li et al 1771 | - **4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities;** Roman Bachmann et al 1772 | - **video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models;** Guangzhi Sun et al 1773 | - **Explore the Limits of Omni-modal Pretraining at Scale;** Yiyuan Zhang et al 1774 | - **Autoregressive Speech Synthesis without Vector Quantization;** Lingwei Meng et al 1775 | - **E5-V: Universal Embeddings with Multimodal Large Language Models;** Ting Jiang et al 1776 | - **VITA: Towards Open-Source Interactive Omni Multimodal LLM;** Chaoyou Fu et al 1777 | - **Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming;** Zhifei Xie et al 1778 | - **Moshi: a speech-text foundation model for real-time dialogue;** Alexandre D´efossez et al 1779 | - **LATENT ACTION PRETRAINING FROM VIDEOS;** Seonghyeon Ye et al 1780 | - **SCALING SPEECH-TEXT PRE-TRAINING WITH SYNTHETIC INTERLEAVED DATA;** Aohan Zeng et al 1781 | - **Exploring the Potential of Encoder-free Architectures in 3D LMMs;** Yiwen Tang et al 1782 | - **Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning;** NVIDIA 1783 | - **GR00T N1: An Open Foundation Model for Generalist Humanoid Robots;** NVIDIA 1784 | 1785 | 1786 | 1787 | 1788 | ## Image Generation 1789 | 1790 | - **Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors;** Oran Gafni et al 1791 | - **Modeling Image Composition for Complex Scene Generation;** Zuopeng Yang et al 1792 | - **ReCo: Region-Controlled Text-to-Image Generation;** Zhengyuan Yang et al 1793 | - **Going Beyond Nouns With Vision & Language Models Using Synthetic Data;** Paola Cascante-Bonilla et al 1794 | - **GUIDING INSTRUCTION-BASED IMAGE EDITING VIA MULTIMODAL LARGE LANGUAGE MODELS;** Tsu-Jui Fu et al 1795 | - **KOSMOS-G: Generating Images in Context with Multimodal Large Language Models;** Xichen Pan et al 1796 | - **DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning;** Abhay Zala et al 1797 | - **LLMGA: Multimodal Large Language Model based Generation Assistant;** Bin Xia et al 1798 | - **ChatIllusion: Efficient-Aligning Interleaved Generation ability with Visual Instruction Model;** Xiaowei Chi et al 1799 | - **Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition;** Chun-Hsiao Yeh et al 1800 | - **ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation;** Ethan Chern et al 1801 | - **SEED-Story: Multimodal Long Story Generation with Large Language Model;** Shuai Yang et al 1802 | - **JPEG-LM: LLMs as Image Generators with Canonical Codec Representations;** Xiaochuang Han et al 1803 | - **OmniGen: Unified Image Generation;** Shitao Xiao et al 1804 | - **SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL;** Junke Wang et al 1805 | - **PixelFlow: Pixel-Space Generative Models with Flow;** Shoufa Chen et al 1806 | - **ENVISIONING BEYOND THE PIXELS: BENCHMARKING REASONING-INFORMED VISUAL EDITING;** Xiangyu Zhao et al 1807 | 1808 | 1809 | 1810 | 1811 | ## Diffusion 1812 | 1813 | - **Denoising Diffusion Probabilistic Models;** Jonathan Ho et al 1814 | - **GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models;** Alex Nichol et al 1815 | - **Diffusion Models Beat GANs on Image Synthesis;** Prafulla Dhariwal et a; 1816 | - **One-step Diffusion with Distribution Matching Distillation;** Tianwei Yin et al 1817 | - **SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis;** Dustin Podell et al 1818 | - **Denoising Autoregressive Representation Learning;** Yazhe Li et al 1819 | - **Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis;** Wan-Cyuan Fan et al 1820 | - **UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild;** Can Qin et al 1821 | - **Autoregressive Image Generation without Vector Quantization;** Tianhong Li et al 1822 | - **All are Worth Words: A ViT Backbone for Diffusion Models;** Fan Bao et al 1823 | - **Scalable Diffusion Models with Transformers;** William Peebles et al 1824 | - **Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers;** Katherine Crowson et al 1825 | - **Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers;** Peng Gao et al 1826 | - **FiT: Flexible Vision Transformer for Diffusion Model;** Zeyu Lu et al 1827 | - **Scaling Diffusion Transformers to 16 Billion Parameters;** Zhengcong Fei et al 1828 | - **Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget;** Vikash Sehwag et al 1829 | - **DIFFUSION FEEDBACK HELPS CLIP SEE BETTER;** Wenxuan Wang et al 1830 | - **Imagen 3;** Imagen 3 Team, Google 1831 | - **MONOFORMER: ONE TRANSFORMER FOR BOTH DIFFUSION AND AUTOREGRESSION;** Chuyang Zhao et al 1832 | - **Movie Gen: A Cast of Media Foundation Models;** The Movie Gen team @ Meta 1833 | - **FLUID: SCALING AUTOREGRESSIVE TEXT-TO-IMAGE GENERATIVE MODELS WITH CONTINUOUS TOKENS;** Lijie Fan et al 1834 | - **Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion;** Emiel Hoogeboom et al 1835 | - **Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps;** Nanye Ma et al 1836 | 1837 | **Language Modeling** 1838 | 1839 | - **MMaDA: Multimodal Large Diffusion Language Models;** Ling Yang et al 1840 | - **DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation;** Shansan Gong et al 1841 | - **Mercury: Ultra-Fast Language Models Based on Diffusion;** Samar Khanna et al 1842 | 1843 | 1844 | 1845 | ## Document Understanding 1846 | 1847 | - **LayoutLM: Pre-training of Text and Layout for Document Image Understanding;** Yiheng Xu et al 1848 | - **LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding;** Yang Xu et al 1849 | - **LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking;** Yupan Huang et al 1850 | - **StrucTexT: Structured Text Understanding with Multi-Modal Transformers;** Yulin Li et al 1851 | - **LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding;** Jiapeng Wang et al 1852 | - **PIX2STRUCT: SCREENSHOT PARSING AS PRETRAINING FOR VISUAL LANGUAGE UNDERSTANDING;** Kenton Lee et al 1853 | - **Unifying Vision, Text, and Layout for Universal Document Processing;** Zineng Tang et al 1854 | - **STRUCTEXTV2: MASKED VISUAL-TEXTUAL PREDIC- TION FOR DOCUMENT IMAGE PRE-TRAINING;** Yuechen Yu et al 1855 | - **UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning;** Ahmed Masry et al 1856 | - **Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models;** Geewook Kim et al 1857 | - **LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding;** Yi Tu et al 1858 | - **mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding;** Jiabo Ye et al 1859 | - **KOSMOS-2.5: A Multimodal Literate Model;** Tengchao Lv et al 1860 | - **STRUCTCHART: PERCEPTION, STRUCTURING, REASONING FOR VISUAL CHART UNDERSTANDING;** Renqiu Xia et al 1861 | - **UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model;** Jiabo Ye et al 1862 | - **MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning;** Fuxiao Liu et al 1863 | - **ChartLlama: A Multimodal LLM for Chart Understanding and Generation;** Yucheng Han et al 1864 | - **G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model;** Jiahui Gao et al 1865 | - **ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning;** Fanqing Meng et al 1866 | - **ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning;** Renqiu Xia et al 1867 | - **Enhancing Vision-Language Pre-training with Rich Supervisions;** Yuan Gao et al 1868 | - **TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document;** Yuliang Liu et al 1869 | - **ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning;** Ahmed Masry et al 1870 | - **Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs;** Yonghui Wang et al 1871 | - **Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs;** Victor Carbune et al 1872 | - **HRVDA: High-Resolution Visual Document Assistant;** Chaohu Liu et al 1873 | - **TextSquare: Scaling up Text-Centric Visual Instruction Tuning;** Jingqun Tang et al 1874 | - **TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning;** Liang Zhang et al 1875 | - **Exploring the Capabilities of Large Multimodal Models on Dense Text;** Shuo Zhang et al 1876 | - **STRUCTEXTV3: AN EFFICIENT VISION-LANGUAGE MODEL FOR TEXT-RICH IMAGE PERCEPTION, COMPREHENSION, AND BEYOND;** Pengyuan Lyu et al 1877 | - **TRINS: Towards Multimodal Language Models that Can Read;** Ruiyi Zhang et al 1878 | - **Multimodal Table Understanding;** Mingyu Zheng et al 1879 | - **MPLUG-DOCOWL2: HIGH-RESOLUTION COMPRESSING FOR OCR-FREE MULTI-PAGE DOCUMENT UNDERSTANDING;** Anwen Hu et al 1880 | - **PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling;** Xudong Xie et al 1881 | 1882 | **Dataset** 1883 | 1884 | - **A Diagram Is Worth A Dozen Images;** Aniruddha Kembhavi et al 1885 | - **ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning;** Ahmed Masry et al 1886 | - **PDF-VQA: A New Dataset for Real-World VQA on PDF Documents;** Yihao Ding et al 1887 | - **DocumentNet: Bridging the Data Gap in Document Pre-Training;** Lijun Yu et al 1888 | - **Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning;** Kung-Hsiang Huang et al 1889 | - **CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs;** Zirui Wang et al 1890 | 1891 | 1892 | 1893 | ***Table*** 1894 | 1895 | - **Visual Understanding of Complex Table Structures from Document Images;** Sachin Raja et al 1896 | - **Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling;** Yongshuai Huang et al 1897 | - **Table-GPT: Table-tuned GPT for Diverse Table Tasks;** Peng Li et al 1898 | - **TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios;** Xiaokang Zhang et al 1899 | 1900 | ## Tool Learning 1901 | 1902 | **NLP** 1903 | 1904 | - **TALM: Tool Augmented Language Models;** Aaron Paris et al 1905 | - **WebGPT: Browser-assisted question-answering with human feedback;** Reiichiro Nakano et al 1906 | - **LaMDA: Language Models for Dialog Applications;** Romal Thoppilan et al 1907 | - **BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage;** Kurt Shuster et al 1908 | - **PAL: program-aided language models;** Luyu Gao et al 1909 | - **Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks;** Wenhu Chen et al 1910 | - **A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level;** Iddo Droria et al 1911 | - **React: synergizing reasoning and acting in language models;** Shunyu Yao et al 1912 | - **MIND’S EYE: GROUNDED LANGUAGE MODEL REASONING THROUGH SIMULATION;** Ruibo Liu et al 1913 | - **Toolformer: Language Models Can Teach Themselves to Use Tools;** Timo Schick et al 1914 | - **Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback;** Baolin Peng et al 1915 | - **ART: Automatic multi-step reasoning and tool-use for large language models;** Bhargavi Paranjape et al 1916 | - **Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models;** Pan Lu et al 1917 | - **AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head;** Rongjie Huang et al 1918 | - **Augmented Large Language Models with Parametric Knowledge Guiding;** Ziyang Luo et al 1919 | - **COOK: Empowering General-Purpose Language Models with Modular and Collaborative Knowledge;** Shangbin Feng et al 1920 | - **StructGPT: A General Framework for Large Language Model to Reason over Structured Data;** Jinhao Jiang et al 1921 | - **Chain of Knowledge: A Framework for Grounding Large Language Models with Structured Knowledge Bases;** Xingxuan Li et al 1922 | - **CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation;** Cheng Qian et al 1923 | - **ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases;** Qiaoyu Tang et al 1924 | - **WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences;** Xiao Liu et al 1925 | - **RestGPT: Connecting Large Language Models with Real-World Applications via RESTful APIs;** Yifan Song et al 1926 | - **MIND2WEB: Towards a Generalist Agent for the Web;** Xiang Deng et al 1927 | - **Certified Reasoning with Language Models;** Gabriel Poesia et al 1928 | - **ToolQA: A Dataset for LLM Question Answering with External Tools;** Yuchen Zhuang et al 1929 | - **On the Tool Manipulation Capability of Open-source Large Language Models;** Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, Jian Zhang et al 1930 | - **CHATDB: AUGMENTING LLMS WITH DATABASES AS THEIR SYMBOLIC MEMORY;** Chenxu Hu et al 1931 | - **MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting;** Tatsuro Inaba et al 1932 | - **Making Language Models Better Tool Learners with Execution Feedback;** Shuofei Qiao et al 1933 | - **CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing;** Zhibin Gou et al 1934 | - **ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models;** Zhipeng Chen et al 1935 | - **Fact-Checking Complex Claims with Program-Guided Reasoning;** Liangming Pan et al 1936 | - **Gorilla: Large Language Model Connected with Massive APIs;** Shishir G. Patil et al 1937 | - **ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings;** Shibo Hao et al 1938 | - **Large Language Models as Tool Makers;** Tianle Cai et al 1939 | - **VOYAGER: An Open-Ended Embodied Agent with Large Language Models;** Guanzhi Wang et al 1940 | - **FACTOOL: Factuality Detection in Generative AI A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios;** I-Chun Chern et al 1941 | - **WebArena: A Realistic Web Environment for Building Autonomous Agents;** Shuyan Zhou et al 1942 | - **TOOLLLM: FACILITATING LARGE LANGUAGE MODELS TO MASTER 16000+ REAL-WORLD APIS;** Yujia Qin et al 1943 | - **Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models;** Cheng-Yu Hsieh et al 1944 | - **ExpeL: LLM Agents Are Experiential Learners;** Andrew Zhao et al 1945 | - **Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum;** Shen Gao et al 1946 | - **Self-driven Grounding: Large Language Model Agents with Automatical Language-aligned Skill Learning;** Shaohui Peng et al 1947 | - **Identifying the Risks of LM Agents with an LM-Emulated Sandbox;** Yangjun Ruan et al 1948 | - **TORA: A TOOL-INTEGRATED REASONING AGENT FOR MATHEMATICAL PROBLEM SOLVING;** Zhibin Gou et al 1949 | - **CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets;** Lifan Yuan et al 1950 | - **METATOOL BENCHMARK: DECIDING WHETHER TO USE TOOLS AND WHICH TO USE;** Yue Huang et al 1951 | - **A Comprehensive Evaluation of Tool-Assisted Generation Strategies;** Alon Jacovi et al 1952 | - **TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems;** Yilun Kong et al 1953 | - **GITAGENT: FACILITATING AUTONOMOUS AGENT WITH GITHUB BY TOOL EXTENSION;** Bohan Lyu et al 1954 | - **TROVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks;** Zhiruo Wang et al 1955 | - **Towards Uncertainty-Aware Language Agent;** Jiuzhou Han et al 1956 | - **Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning;** Chenyu Wang et al 1957 | - **Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills;** Kolby Nottingham et al 1958 | - **AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls;** Yu Du et al 1959 | - **SCIAGENT: Tool-augmented Language Models for Scientific Reasoning;** Yubo Ma et al 1960 | - **API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs;** Kinjal Basu et al 1961 | - **Empowering Large Language Model Agents through Action Learning;** Haiteng Zhao et al 1962 | - **LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error;** Boshi Wang et al 1963 | - **StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models;** Zhicheng Guo et al 1964 | - **APIGen: Automated PIpeline for Generating Verifiable and Diverse Function-Calling Datasets;** Zuxin Liu et al 1965 | - **Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks;** Ibrahim Abdelaziz et al 1966 | - **ToolACE: Winning the Points of LLM Function Calling;** Weiwen Liu et al 1967 | - **MMSEARCH: BENCHMARKING THE POTENTIAL OF LARGE MODELS AS MULTI-MODAL SEARCH ENGINES;** Dongzhi Jiang et al 1968 | - **ReTool: Reinforcement Learning for Strategic Tool Use in LLMs;** Jiazhan Feng et al 1969 | 1970 | 1971 | 1972 | **With Visual Tools** 1973 | 1974 | - **Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models;** Chenfei Wu et al 1975 | - **ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions;** Deyao Zhu et al 1976 | - **Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions;** Jun Chen et al 1977 | - **Visual Programming: Compositional visual reasoning without training;** Tanmay Gupta et al 1978 | - **ViperGPT: Visual Inference via Python Execution for Reasoning;** Dídac Surís et al 1979 | - **Chat with the Environment: Interactive Multimodal Perception using Large Language Models;** Xufeng Zhao et al 1980 | - **MM-REACT : Prompting ChatGPT for Multimodal Reasoning and Action;** Zhengyuan Yang et al 1981 | - **HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace;** Yongliang Shen et al 1982 | - **TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs;** Yaobo Liang et al 1983 | - **OpenAGI: When LLM Meets Domain Experts;** Yingqiang Ge et al; Benchmark. 1984 | - **Inner Monologue: Embodied Reasoning through Planning with Language Models;** Wenlong Huang et al 1985 | - **Caption Anything: Interactive Image Description with Diverse Multimodal Controls;** Teng Wang et al 1986 | - **InternChat: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language;** Zhaoyang Liu et al 1987 | - **Modular Visual Question Answering via Code Generation;** Sanjay Subramanian et al 1988 | - **Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language;** William Berrios et al 1989 | - **AVIS: Autonomous Visual Information Seeking with Large Language Models;** Ziniu Hu et al 1990 | - **AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn;** Difei Gao et al 1991 | - **GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction;** Rui Yang et al 1992 | - **LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent;** Jianing Yang et al 1993 | - **Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation;** Zhengyuan Yang et al 1994 | - **ControlLLM: Augment Language Models with Tools by Searching on Graphs;** Zhaoyang Liu et al 1995 | - **MM-VID: Advancing Video Understanding with GPT-4V(ision);** Kevin Lin et al 1996 | - **Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models;** Yushi Hu et al 1997 | - **CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations;** Ji Qi et al 1998 | - **CLOVA: a closed-loop visual assistant with tool usage and update;** Zhi Gao et al 1999 | - **m&m’s: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks;** Zixian Ma et al 2000 | - **Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models;** Jiaxing Chen et al 2001 | 2002 | 2003 | ## Instruction Tuning 2004 | 2005 | - **Cross-Task Generalization via Natural Language Crowdsourcing Instructions;** Swaroop Mishra et al 2006 | - **FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS;** Jason Wei et al 2007 | - **MULTITASK PROMPTED TRAINING ENABLES ZERO-SHOT TASK GENERALIZATION;** Victor Sanh et al 2008 | - **Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks;** Yizhong Wang et al 2009 | - **Learning Instructions with Unlabeled Data for Zero-Shot Cross-Task Generalization;** Yuxian Gu et al 2010 | - **Scaling Instruction-Finetuned Language Models;** Hyung Won Chung et al 2011 | - **Task-aware Retrieval with Instructions;** Akari Asai et al 2012 | - **One Embedder, Any Task: Instruction-Finetuned Text Embeddings;** Hongjin Su et al 2013 | - **Boosting Natural Language Generation from Instructions with Meta-Learning;** Budhaditya Deb et al 2014 | - **Exploring the Benefits of Training Expert Language Models over Instruction Tuning;** Joel Jang et al 2015 | - **OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization;** Srinivasan Iyer et al 2016 | - **Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor;** Or Honovich et al 2017 | - **WeaQA: Weak Supervision via Captions for Visual Question Answering;** Pratyay Banerjee et al 2018 | - **MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning;** Zhiyang Xu et al 2019 | - **SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions;** Yizhong Wang et al 2020 | - **Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases;** Yunjie Ji et al 2021 | - **INSTRUCTION TUNING WITH GPT-4;** Baolin Peng et al 2022 | - **The Flan Collection: Designing Data and Methods for Effective Instruction Tuning;** Shayne Longpre et al 2023 | - **LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction;** Abdullatif Köksal et al 2024 | - **GUESS THE INSTRUCTION! FLIPPED LEARNING MAKES LANGUAGE MODELS STRONGER ZERO-SHOT LEARNERS;** Seonghyeon Ye et al 2025 | - **In-Context Instruction Learning;** Seonghyeon Ye et al 2026 | - **WizardLM: Empowering Large Language Models to Follow Complex Instructions;** Can Xu et al 2027 | - **Controlled Text Generation with Natural Language Instructions;** Wangchunshu Zhou et al 2028 | - **Poisoning Language Models During Instruction Tuning;** Alexander Wan et al 2029 | - **Improving Cross-Task Generalization with Step-by-Step Instructions;** Yang Wu et al 2030 | - **VideoChat: Chat-Centric Video Understanding;** KunChang Li et al 2031 | - **SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities;** Dong Zhang et al 2032 | - **Prompting with Pseudo-Code Instructions;** Mayank Mishra et al 2033 | - **LIMA: Less Is More for Alignment;** Chunting Zhou et al 2034 | - **ExpertPrompting: Instructing Large Language Models to be Distinguished Experts;** Benfeng Xu et al 2035 | - **HINT: Hypernetwork Instruction Tuning for Efficient Zero- & Few-Shot Generalisation;** Hamish Ivison et al 2036 | - **Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models;** Gen Luo et al 2037 | - **SAIL: Search-Augmented Instruction Learning;** Hongyin Luo et al 2038 | - **Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning;** Fan Yin et al 2039 | - **DYNOSAUR: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation;** Da Yin et al 2040 | - **MACAW-LLM: MULTI-MODAL LANGUAGE MODELING WITH IMAGE, AUDIO, VIDEO, AND TEXT INTEGRATION;** Chenyang Lyu et al 2041 | - **How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources;** Yizhong Wang et al 2042 | - **INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models;** Yew Ken Chia et al 2043 | - **MIMIC-IT: Multi-Modal In-Context Instruction Tuning;** Bo Li et al 2044 | - **Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning;** Fuxiao Liu et al 2045 | - **M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning;** Lei Li et al 2046 | - **InstructEval: Systematic Evaluation of Instruction Selection Methods;** Anirudh Ajith et al 2047 | - **LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark;** Zhenfei Yin et al 2048 | - **Instruction Mining: High-Quality Instruction Data Selection for Large Language Models;** Yihan Cao et al 2049 | - **ALPAGASUS: TRAINING A BETTER ALPACA WITH FEWER DATA;** Lichang Chen et al 2050 | - **Exploring Format Consistency for Instruction Tuning;** Shihao Liang et al 2051 | - **Self-Alignment with Instruction Backtranslation;** Xian Li et al 2052 | - **#INSTAG: INSTRUCTION TAGGING FOR DIVERSITY AND COMPLEXITY ANALYSIS;** Keming Lu et al 2053 | - **CITING: LARGE LANGUAGE MODELS CREATE CURRICULUM FOR INSTRUCTION TUNING;** Tao Feng et al 2054 | - **Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models;** Haoran Li et al 2055 | 2056 | ## Incontext Learning 2057 | 2058 | - **Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?;** Sewon Min et al 2059 | - **Extrapolating to Unnatural Language Processing with GPT-3's In-context Learning: The Good, the Bad, and the Mysterious;** Frieda Rong et al 2060 | - **Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning;** Haokun Liu et al 2061 | - **Learning To Retrieve Prompts for In-Context Learning;** Ohad Rubin et al 2062 | - **An Explanation of In-context Learning as Implicit Bayesian Inference;** Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma 2063 | - **MetaICL: Learning to Learn In Context;** Sewon Min et al 2064 | - **PROMPTING GPT-3 TO BE RELIABLE;** Chenglei Si et al 2065 | - **Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm;** Laria Reynolds et al 2066 | - **Do Prompt-Based Models Really Understand the Meaning of their Prompts?;** Albert Webson et al 2067 | - **On the Relation between Sensitivity and Accuracy in In-context Learning;** Yanda Chen et al 2068 | - **Meta-learning via Language Model In-context Tuning;** Yanda Chen et al 2069 | - **Extrapolating to Unnatural Language Processing with GPT-3's In-context Learning: The Good, the Bad, and the Mysterious;** Frieda Rong 2070 | - **SELECTIVE ANNOTATION MAKES LANGUAGE MODELS BETTER FEW-SHOT LEARNERS;** Hongjin Su et al 2071 | - **Robustness of Demonstration-based Learning Under Limited Data Scenario;** Hongxin Zhang et al; Demonstration-based learning, tuning the parameters. 2072 | - **Active Example Selection for In-Context Learning;** Yiming Zhang et al 2073 | - **Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity;** Yao Lu et al 2074 | - **Calibrate Before Use: Improving Few-Shot Performance of Language Models;** Tony Z. Zhao et al 2075 | - **DIALOGIC: Controllable Dialogue Simulation with In-Context Learning;** Zekun Li et al 2076 | - **PRESERVING IN-CONTEXT LEARNING ABILITY IN LARGE LANGUAGE MODEL FINE-TUNING;** Yihan Wang et al 2077 | - **Teaching Algorithmic Reasoning via In-context Learning;** Hattie Zhou et al 2078 | - **On the Compositional Generalization Gap of In-Context Learning** Arian Hosseini et al 2079 | - **Transformers generalize differently from information stored in context vs weights;** Stephanie C.Y. Chan et al 2080 | - **OVERTHINKING THE TRUTH: UNDERSTANDING HOW LANGUAGE MODELS PROCESS FALSE DEMONSTRATIONS;** Anonymous 2081 | - **In-context Learning and Induction Heads;** Catherine Olsson et al 2082 | - **Complementary Explanations for Effective In-Context Learning;** Xi Ye et al 2083 | - **What is Not in the Context? Evaluation of Few-shot Learners with Informative Demonstrations;** Michal Štefánik et al 2084 | - **Robustness of Learning from Task Instructions;** Jiasheng Gu et al 2085 | - **Structured Prompting: Scaling In-Context Learning to 1,000 Examples;** Yaru Hao et al 2086 | - **Transformers learn in-context by gradient descent;** Johannes von Oswald et al 2087 | - **Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale;** Hritik Bansal et al 2088 | - **Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations;** Xinxi Lyu et al 2089 | - **Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters;** Boshi Wang et al 2090 | - **Careful Data Curation Stabilizes In-context Learning;** Ting-Yun Chang et al 2091 | - **Parallel Context Windows Improve In-Context Learning of Large Language Models;** Nir Ratner et al 2092 | - **Investigating Fusion Methods for In-Context Learning;** Qinyuan Ye et al 2093 | - **Batch Prompting: Efficient Inference with Large Language Model APIs;** Zhoujun Cheng et al 2094 | - **Explanation Selection Using Unlabeled Data for In-Context Learning;** Xi Ye et al 2095 | - **Compositional Exemplars for In-context Learning;** Jiacheng Ye et al 2096 | - **Distinguishability Calibration to In-Context Learning;** Hongjing Li et al 2097 | - **How Does In-Context Learning Help Prompt Tuning?;** Simeng Sun et al 2098 | - **Guiding Large Language Models via Directional Stimulus Prompting;** Zekun Li et al 2099 | - **In-Context Instruction Learning;** Seonghyeon Ye et al 2100 | - **LARGER LANGUAGE MODELS DO IN-CONTEXT LEARNING DIFFERENTLY;** Jerry Wei et al 2101 | - **kNN PROMPTING: BEYOND-CONTEXT LEARNING WITH CALIBRATION-FREE NEAREST NEIGHBOR INFERENCE;** Benfeng Xu et al 2102 | - **Learning In-context Learning for Named Entity Recognition;** Jiawei Chen et al 2103 | - **SELF-ICL: Zero-Shot In-Context Learning with Self-Generated Demonstrations;** Wei-Lin Chen et al 2104 | - **Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation;** Marius Mosbach et al 2105 | - **Large Language Models Can be Lazy Learners: Analyze Shortcuts in In-Context Learning;** Ruixiang Tang et al 2106 | - **IN-CONTEXT REINFORCEMENT LEARNING WITH ALGORITHM DISTILLATION;** Michael Laskin et al 2107 | - **Supervised Pretraining Can Learn In-Context Reinforcement Learning;** Jonathan N. Lee et al 2108 | - **Learning to Retrieve In-Context Examples for Large Language Models;** Liang Wang et al 2109 | - **IN-CONTEXT LEARNING IN LARGE LANGUAGE MODELS LEARNS LABEL RELATIONSHIPS BUT IS NOT CONVENTIONAL LEARNING;** Jannik Kossen et al 2110 | - **In-Context Alignment: Chat with Vanilla Language Models Before Fine-Tuning;** Xiaochuang Han et al 2111 | 2112 | ## Learning from Feedback 2113 | 2114 | - **Decision Transformer: Reinforcement Learning via Sequence Modeling;** Lili Chen et al 2115 | - **Quark: Controllable Text Generation with Reinforced (Un)learning;** Ximing Lu et al 2116 | - **Learning to Repair: Repairing model output errors after deployment using a dynamic memory of feedback;** Niket Tandon et al 2117 | - **MemPrompt: Memory-assisted Prompt Editing with User Feedback;** Aman Madaan et al 2118 | - **Training language models to follow instructions with human feedback;** Long Ouyang et al 2119 | - **Pretraining Language Models with Human Preferences;** Tomasz Korbak et al 2120 | - **Training Language Models with Language Feedback;** Jérémy Scheurer et al 2121 | - **Training Language Models with Language Feedback at Scale;** Jérémy Scheurer et al 2122 | - **Improving Code Generation by Training with Natural Language Feedback;** Angelica Chen et al 2123 | - **REFINER: Reasoning Feedback on Intermediate Representations;** Debjit Paul et al 2124 | - **RRHF: Rank Responses to Align Language Models with Human Feedback without tears;** Zheng Yuan et al 2125 | - **Constitutional AI: Harmlessness from AI Feedback;** Yuntao Bai et al 2126 | - **Chain of Hindsight Aligns Language Models with Feedback;** Hao Liu et al 2127 | - **Self-Edit: Fault-Aware Code Editor for Code Generation;** Kechi Zhang et al 2128 | - **RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs;** Afra Feyza Akyürek et al 2129 | - **Learning to Simulate Natural Language Feedback for Interactive Semantic Parsing;** Hao Yan et al 2130 | - **Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback;** Yao Fu et al 2131 | - **Fine-Grained Human Feedback Gives Better Rewards for Language Model Training;** Zeqiu Wu et al 2132 | - **Aligning Large Language Models through Synthetic Feedback;** Sungdong Kim1 et al 2133 | - **Improving Language Models via Plug-and-Play Retrieval Feedback;** Wenhao Yu et al 2134 | - **Improving Open Language Models by Learning from Organic Interactions;** Jing Xu et al 2135 | - **Demystifying GPT Self-Repair for Code Generation;** Theo X. Olausson et al 2136 | - **Reflexion: Language Agents with Verbal Reinforcement Learning;** Noah Shinn et al 2137 | - **Evaluating Language Models for Mathematics through Interactions;** Katherine M. Collins et al 2138 | - **InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback;** John Yang et al 2139 | - **System-Level Natural Language Feedback;** Weizhe Yuan et al 2140 | - **Preference Ranking Optimization for Human Alignment;** Feifan Song et al 2141 | - **Let Me Teach You: Pedagogical Foundations of Feedback for Language Models;** Beatriz Borges et al 2142 | - **AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback;** Yann Dubois et al 2143 | - **Training Socially Aligned Language Models in Simulated Human Society;** Ruibo Liu et al 2144 | - **RLTF: Reinforcement Learning from Unit Test Feedback;** Jiate Liu et al 2145 | - **Chain of Hindsight Aligns Language Models with Feedback;** Hao Liu et al 2146 | - **LETI: Learning to Generate from Textual Interactions;** Xingyao Wang et al 2147 | - **Direct Preference Optimization: Your Language Model is Secretly a Reward Model;** Rafael Rafailov et al 2148 | - **FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback;** Ashish Singh et al 2149 | - **Leveraging Implicit Feedback from Deployment Data in Dialogue;** Richard Yuanzhe Pang et al 2150 | - **RLCD: REINFORCEMENT LEARNING FROM CONTRAST DISTILLATION FOR LANGUAGE MODEL ALIGNMENT;** Kevin Yang et al 2151 | - **Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback;** Viet Dac Lai et al 2152 | - **Reinforced Self-Training (ReST) for Language Modeling;** Caglar Gulcehre et al 2153 | - **EVERYONE DESERVES A REWARD: LEARNING CUSTOMIZED HUMAN PREFERENCES;** Pengyu Cheng et al 2154 | - **RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback;** Harrison Lee et al 2155 | - **STABILIZING RLHF THROUGH ADVANTAGE MODEL AND SELECTIVE REHEARSAL;** Baolin Peng et al 2156 | - **OPENCHAT: ADVANCING OPEN-SOURCE LANGUAGE MODELS WITH MIXED-QUALITY DATA;** Guan Wang et al 2157 | - **HUMAN FEEDBACK IS NOT GOLD STANDARD;** Tom Hosking et al 2158 | - **A LONG WAY TO GO: INVESTIGATING LENGTH CORRELATIONS IN RLHF;** Prasann Singhal et al 2159 | - **CHAT VECTOR: A SIMPLE APPROACH TO EQUIP LLMS WITH NEW LANGUAGE CHAT CAPABILITIES;** Shih-Cheng Huang et al 2160 | - **SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF;** Yi Dong et al 2161 | - **UNDERSTANDING THE EFFECTS OF RLHF ON LLM GENERALISATION AND DIVERSITY;** Robert Kirk et al 2162 | - **GAINING WISDOM FROM SETBACKS: ALIGNING LARGE LANGUAGE MODELS VIA MISTAKE ANALYSIS;** Kai Chen et al 2163 | - **Tuna: Instruction Tuning using Feedback from Large Language Models;** Haoran Li et al 2164 | - **Teaching Language Models to Self-Improve through Interactive Demonstrations;** Xiao Yu et al 2165 | - **Democratizing Reasoning Ability: Tailored Learning from Large Language Model;** Zhaoyang Wang et al 2166 | - **ENABLE LANGUAGE MODELS TO IMPLICITLY LEARN SELF-IMPROVEMENT FROM DATA;** Ziqi Wang et al 2167 | - **ULTRAFEEDBACK: BOOSTING LANGUAGE MODELS WITH HIGH-QUALITY FEEDBACK;** Ganqu Cui et al 2168 | - **HELPSTEER: Multi-attribute Helpfulness Dataset for STEERLM;** Zhilin Wang et al 2169 | - **Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering;** Yichi Zhang et al 2170 | - **Nash Learning from Human Feedback;** Rémi Munos et al 2171 | - **Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models;** Avi Singh et al 2172 | - **When Life Gives You Lemons, Make Cherryade: Converting Feedback from Bad Responses into Good Labels;** Weiyan Shi et al 2173 | - **ConstitutionMaker: Interactively Critiquing Large Language Models by Converting Feedback into Principles;** Savvas Petridis et al 2174 | - **REASONS TO REJECT? ALIGNING LANGUAGE MODELS WITH JUDGMENTS;** Weiwen Xu et al 2175 | - **Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss;** Jing Xu et al 2176 | - **Mitigating Unhelpfulness in Emotional Support Conversations with Multifaceted AI Feedback;** Jiashuo Wang et al 2177 | - **Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation;** Haoran Xu et al 2178 | - **Self-Rewarding Language Models;** Weizhe Yuan et al 2179 | - **Dense Reward for Free in Reinforcement Learning from Human Feedback;** Alex J. Chan et al 2180 | - **Efficient Exploration for LLMs;** Vikranth Dwaracherla et al 2181 | - **KTO: Model Alignment as Prospect Theoretic Optimization;** Kawin Ethayarajh et al 2182 | - **LiPO: Listwise Preference Optimization through Learning-to-Rank;** Tianqi Liu et al 2183 | - **Direct Language Model Alignment from Online AI Feedback;** Shangmin Guo et al 2184 | - **Noise Contrastive Alignment of Language Models with Explicit Rewards;** Huayu Chen et al 2185 | - **RLVF: Learning from Verbal Feedback without Overgeneralization;** Moritz Stephan et al 2186 | - **OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement;** Tianyu Zheng et al 2187 | - **A Critical Evaluation of AI Feedback for Aligning Large Language Models;** Archit Sharma et al 2188 | - **VOLCANO: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision;** Seongyun Lee et al 2189 | - **Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward;** Ruohong Zhang et al 2190 | - **ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback;** Zhenyu Hou et al 2191 | - **From r to Q: Your Language Model is Secretly a Q-Function;** Rafael Rafailov* et al 2192 | - **Aligning LLM Agents by Learning Latent Preference from User Edits;** Ge Gao et al 2193 | - **Self-Play Preference Optimization for Language Model Alignment;** Yue Wu et al 2194 | - **Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data;** Fahim Tajwar et al 2195 | - **Robust Preference Optimization through Reward Model Distillation;** Adam Fisch et al 2196 | - **Preference Learning Algorithms Do Not Learn Preference Rankings;** Angelica Chen et al 2197 | - **UNDERSTANDING ALIGNMENT IN MULTIMODAL LLMS: A COMPREHENSIVE STUDY;** Elmira Amirloo et al 2198 | - **Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback;** Hamish Ivison et al 2199 | - **Learning from Naturally Occurring Feedback;** Shachar Don-Yehiya et al 2200 | - **FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models;** Pengxiang Li et al 2201 | - **BOND: Aligning LLMs with Best-of-N Distillation;** Pier Giuseppe Sessa et al 2202 | - **Recursive Introspection: Teaching Language Model Agents How to Self-Improve;** Yuxiao Qu et al 2203 | - **WILDFEEDBACK: ALIGNING LLMS WITH IN-SITU USER INTERACTIONS AND FEEDBACK;** Taiwei Shi et al 2204 | - **Training Language Models to Self-Correct via Reinforcement Learning;** Aviral Kumar et al 2205 | - **The Perfect Blend: Redefining RLHF with Mixture of Judges;** Tengyu Xu et al 2206 | - **DOES RLHF SCALE? EXPLORING THE IMPACTS FROM DATA, MODEL, AND METHOD;** Zhenyu Hou et al 2207 | 2208 | 2209 | 2210 | 2211 | ## Reward Modeling 2212 | 2213 | - **HelpSteer2: Open-source dataset for training top-performing reward models;** Zhilin Wang et al 2214 | - **WARM: On the Benefits of Weight Averaged Reward Models;** Alexandre Ramé et al 2215 | - **Secrets of RLHF in Large Language Models Part II: Reward Modeling;** Binghai Wang et al 2216 | - **TOOL-AUGMENTED REWARD MODELING;** Lei Li et al 2217 | - **ZYN: Zero-Shot Reward Models with Yes-No Questions;** Victor Gallego et al 2218 | - **LET’S REWARD STEP BY STEP: STEP-LEVEL REWARD MODEL AS THE NAVIGATORS FOR REASONING;** Qianli Ma et al 2219 | - **Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning;** Amrith Setlur et al 2220 | - **Generative Verifiers: Reward Modeling as Next-Token Prediction;** Lunjun Zhang et al 2221 | - **GENERATIVE REWARD MODELS;** Dakota Mahan et al 2222 | - **Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs;** Chris Yuhao Liu et al 2223 | - **FREE PROCESS REWARDS WITHOUT PROCESS LABELS;** Lifan Yuan et al 2224 | - **The Lessons of Developing Process Reward Models in Mathematical Reasoning;** Zhenru Zhang et al 2225 | - **PROCESS REINFORCEMENT THROUGH IMPLICIT REWARDS;** Ganqu Cui et al 2226 | - **RETHINKING REWARD MODEL EVALUATION: ARE WE BARKING UP THE WRONG TREE?;** Xueru Wen et al 2227 | - **VisualPRM: An Effective Process Reward Model for Multimodal Reasoning;** Weiyun Wang et al 2228 | - **Expanding RL with Verifiable Rewards Across Diverse Domains;** Yi Su et al 2229 | - **Inference-Time Scaling for Generalist Reward Modeling;** Zijun Liu et al 2230 | - **J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning;** Chenxi Whitehouse et al 2231 | - **WorldPM: Scaling Human Preference Modeling;** Binghai Wang et al 2232 | - **RM-R1: Reward Modeling as Reasoning;** Xiusi Chen et al 2233 | - **Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy;** Chris Yuhao Liu et al 2234 | 2235 | 2236 | 2237 | ## Video Foundation Model 2238 | 2239 | - **VideoBERT: A Joint Model for Video and Language Representation Learning;** Chen Sun et al 2240 | - **LEARNING VIDEO REPRESENTATIONS USING CONTRASTIVE BIDIRECTIONAL TRANSFORMER;** Chen Sun et al 2241 | - **End-to-End Learning of Visual Representations from Uncurated Instructional Videos;** Antoine Miech et al 2242 | - **HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training;** Linjie Li et al 2243 | - **Multi-modal Transformer for Video Retrieval;** Valentin Gabeur et al 2244 | - **ActBERT: Learning Global-Local Video-Text Representations;** Linchao Zhu et al 2245 | - **Spatiotemporal Contrastive Video Representation Learning;** Rui Qian et al 2246 | - **DECEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization;** Zineng Tang et al 2247 | - **HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval;** Song Liu et al 2248 | - **Self-Supervised MultiModal Versatile Networks;** Jean-Baptiste Alayrac et al 2249 | - **COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning;** Simon Ging et al 2250 | - **VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning;** Hao Tan et al 2251 | - **Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling;** Jie Lei et al 2252 | - **Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval;** Max Bain et al 2253 | - **CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval;** Huaishao Luo et al 2254 | - **MERLOT: Multimodal Neural Script Knowledge Models;** Rowan Zellers et al 2255 | - **VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text;** Hassan Akbari et al 2256 | - **VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling;** Tsu-Jui Fu et al 2257 | - **CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising;** Jianjie Luo et al 2258 | - **LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling;** Linjie Li et al 2259 | - **CLIP-VIP: ADAPTING PRE-TRAINED IMAGE-TEXT MODEL TO VIDEO-LANGUAGE ALIGNMENT;** Hongwei Xue et al 2260 | - **Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning;** Rui Wang et al 2261 | - **Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning;** Yuchong Sun et al 2262 | - **Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning;** Antoine Yang et al 2263 | - **InternVideo: General Video Foundation Models via Generative and Discriminative Learning;** Yi Wang et al 2264 | - **MINOTAUR: Multi-task Video Grounding From Multimodal Queries;** Raghav Goyal et al 2265 | - **VideoLLM: Modeling Video Sequence with Large Language Models;** Guo Chen et al 2266 | - **COSA: Concatenated Sample Pretrained Vision-Language Foundation Model;** Sihan Chen et al 2267 | - **VALLEY: VIDEO ASSISTANT WITH LARGE LANGUAGE MODEL ENHANCED ABILITY;** Ruipu Luo et al 2268 | - **Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models;** Muhammad Maaz et al 2269 | - **Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding;** Hang Zhang et al 2270 | - **InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation;** Yi Wang et al 2271 | - **VideoCon: Robust Video-Language Alignment via Contrast Captions;** Hritik Bansal et al 2272 | - **PG-Video-LLaVA: Pixel Grounding Large Video-Language Models;** Shehan Munasinghe et al 2273 | - **VLM-Eval: A General Evaluation on Video Large Language Models;** Shuailin Li et al 2274 | - **Video-LLaVA: Learning United Visual Representation by Alignment Before Projection;** Bin Lin et al 2275 | - **MVBench: A Comprehensive Multi-modal Video Understanding Benchmark;** Kunchang Li et al 2276 | - **LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models;** Yanwei Li et al 2277 | - **LANGUAGEBIND: EXTENDING VIDEO-LANGUAGE PRETRAINING TO N-MODALITY BY LANGUAGE-BASED SEMANTIC ALIGNMENT;** Bin Zhu et al 2278 | - **TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding;** Shuhuai Ren et al 2279 | - **Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers;** Tsai-Shien Chen et al 2280 | - **INTERNVIDEO2: SCALING VIDEO FOUNDATION MODELS FOR MULTIMODAL VIDEO UNDERSTANDING;** Yi Wang et al 2281 | - **PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning;** Lin Xu et al 2282 | - **LLAVA FINDS FREE LUNCH: TEACHING HUMAN BEHAVIOR IMPROVES CONTENT UNDERSTANDING ABILITIES OF LLMS;** Somesh Singh et al 2283 | - **Pandora: Towards General World Model with Natural Language Actions and Video States;** Jiannan Xiang et al 2284 | - **VideoLLM-online: Online Video Large Language Model for Streaming Video;** Joya Chen et al 2285 | - **OMCHAT: A RECIPE TO TRAIN MULTIMODAL LANGUAGE MODELS WITH STRONG LONG CONTEXT AND VIDEO UNDERSTANDING;** Tiancheng Zhao et al 2286 | - **SLOWFAST-LLAVA: A STRONG TRAINING-FREE BASELINE FOR VIDEO LARGE LANGUAGE MODELS;** Mingze Xu et al 2287 | - **LONGVILA: SCALING LONG-CONTEXT VISUAL LANGUAGE MODELS FOR LONG VIDEOS;** Fuzhao Xue et al 2288 | - **Streaming Long Video Understanding with Large Language Models;** Rui Qian et al 2289 | - **ORYX MLLM: ON-DEMAND SPATIAL-TEMPORAL UNDERSTANDING AT ARBITRARY RESOLUTION;** Zuyan Liu et al 2290 | - **XGEN-MM-VID (BLIP-3-VIDEO): YOU ONLY NEED 32 TOKENS TO REPRESENT A VIDEO EVEN IN VLMS;** Michael S. Ryoo et al 2291 | - **VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding;** Boqiang Zhang et al 2292 | - **Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy;** Yunhang Shen et al 2293 | - **Breaking the Encoder Barrier for Seamless Video-Language Understanding;** Handong Li et al 2294 | - **PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding;** Jang Hyun Cho et al 2295 | 2296 | 2297 | 2298 | 2299 | ## Key Frame Detection 2300 | 2301 | - **Self-Supervised Learning to Detect Key Frames in Videos;** Xiang Yan et al 2302 | - **Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training;** Dezhao Luo et al 2303 | - **Localizing Moments in Long Video Via Multimodal Guidance;** Wayner Barrios et al 2304 | 2305 | ## Vision Model 2306 | 2307 | - **PIX2SEQ: A LANGUAGE MODELING FRAMEWORK FOR OBJECT DETECTION;** Ting Chen et al 2308 | - **Scaling Vision Transformers to 22 Billion Parameters;** Mostafa Dehghani et al 2309 | - **CLIPPO: Image-and-Language Understanding from Pixels Only;** Michael Tschannen et al 2310 | - **Segment Anything;** Alexander Kirillov et al 2311 | - **InstructDiffusion: A Generalist Modeling Interface for Vision Tasks;** Zigang Geng et al 2312 | - **RMT: Retentive Networks Meet Vision Transformers;** Qihang Fan et al 2313 | - **INSTRUCTCV: INSTRUCTION-TUNED TEXT-TO-IMAGE DIFFUSION MODELS AS VISION GENERALISTS;** Yulu Gan et al 2314 | - **Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks;** Micah Goldblum et al 2315 | - **RECOGNIZE ANY REGIONS;** Haosen Yang et al 2316 | - **AiluRus: A Scalable ViT Framework for Dense Prediction;** Jin Li et al 2317 | - **T-Rex: Counting by Visual Prompting;** Qing Jiang et al 2318 | - **Visual In-Context Prompting;** Feng Li et al 2319 | - **SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding;** Haoxiang Wang et al 2320 | - **Sequential Modeling Enables Scalable Learning for Large Vision Models;** Yutong Bai et al 2321 | - **Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks;** Bin Xiao et al 2322 | - **4M: Massively Multimodal Masked Modeling;** David Mizrahi et al 2323 | - **InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks;** Zhe Chen et al 2324 | - **Scalable Pre-training of Large Autoregressive Image Models;** Alaaeldin El-Nouby et al 2325 | - **When Do We Not Need Larger Vision Models?;** Baifeng Shi et al 2326 | - **ViTamin: Designing Scalable Vision Models in the Vision-Language Era;** Jieneng Chen et al 2327 | - **MambaVision: A Hybrid Mamba-Transformer Vision Backbone;** Ali Hatamizadeh et al 2328 | - **Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs;** Shengbang Tong et al 2329 | - **SAM 2: Segment Anything in Images and Videos;** Nikhila Ravi et al 2330 | - **Multimodal Autoregressive Pre-training of Large Vision Encoders;** Enrico Fini et al 2331 | - **SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features;** Michael Tschannen et al 2332 | - **TULIP: Towards Unified Language-Image Pretraining;** Zineng Tang et al 2333 | - **Scaling Language-Free Visual Representation Learning;** David Fan et al 2334 | - **Perception Encoder: The best visual embeddings are not at the output of the network;** Daniel Bolya et al 2335 | 2336 | 2337 | 2338 | 2339 | ## Pretraining 2340 | 2341 | - **MDETR - Modulated Detection for End-to-End Multi-Modal Understanding;** Aishwarya Kamath et al 2342 | - **SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning;** Zhecan Wang et al; Incorporating scene graphs in pretraining and fine-tuning improves performance of VCR tasks. 2343 | - **ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs;** Fei Yu et al 2344 | - **KB-VLP: Knowledge Based Vision and Language Pretraining;** Kezhen Chen et al; Propose to distill the object knowledge in VL pretraining for object-detector-free VL foundation models; Pretraining tasks include predicting the RoI features, category, and learning the alignments between phrases and image regions. 2345 | - **Large-Scale Adversarial Training for Vision-and-Language Representation Learning;** Zhe Gan et al 2346 | - **Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts;** Yan Zeng et al 2347 | - **BEIT: BERT Pre-Training of Image Transformers;** Hangbo Bao et al; Pre-trained CV model. 2348 | - **BEIT V2: Masked Image Modeling with Vector-Quantized Visual Tokenizers;** Zhiliang Peng et al; Pre-trained CV model. 2349 | - **VirTex: Learning Visual Representations from Textual Annotations;** Karan Desai et al; Pretraining CV models through the dense image captioning task. 2350 | - **Florence: A New Foundation Model for Computer Vision;** Lu Yuan et al; Pre-trained CV model. 2351 | - **Grounded Language-Image Pre-training;** Liunian Harold Li et al; Learning object-level, language-aware, and semantic-rich visual representations. Introducing phrase grounding to the pretraining task and focusing on object detection as the downstream task; Propose GLIP. 2352 | - **VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix;** Teng Wang et al; Using unpaired data for pretraining. 2353 | - **Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone;** Zi-Yi Dou et al 2354 | - **WRITE AND PAINT: GENERATIVE VISION-LANGUAGE MODELS ARE UNIFIED MODAL LEARNERS;** Shizhe Diao et al 2355 | - **VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining;** Junjie Ke et al 2356 | - **CONTRASTIVE ALIGNMENT OF VISION TO LANGUAGE THROUGH PARAMETER-EFFICIENT TRANSFER LEARNING;** Zaid Khan et al 2357 | - **The effectiveness of MAE pre-pretraining for billion-scale pretraining;** Mannat Singh et al 2358 | - **Retrieval-based Knowledge Augmented Vision Language Pre-training;** Jiahua Rao et al 2359 | 2360 | **Visual-augmented LM** 2361 | 2362 | - **Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision;** Hao Tan et al 2363 | - **Imagination-Augmented Natural Language Understanding;** Yujie Lu et al 2364 | - **Visually-augmented language modeling;** Weizhi Wang et al 2365 | - **Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models;** Taichi Iki et al 2366 | - **Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding;** Morris Alper et al 2367 | - **TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models;** Md Kamrul Hasan et al 2368 | - **Learning to Imagine: Visually-Augmented Natural Language Generation;** Tianyi Tang et al 2369 | 2370 | **Novel techniques.** 2371 | 2372 | - **CM3: A CAUSAL MASKED MULTIMODAL MODEL OF THE INTERNET;** Armen Aghajanyan et al; Propose to pretrain on large corpus of structured multi-modal documents (CC-NEWS & En-Wikipedia) that can contain both text and image tokens. 2373 | - **PaLI: A Jointly-Scaled Multilingual Language-Image Model;** Xi Chen et al; Investigate the scaling effect of multi-modal models; Pretrained on WebLI that contains text in over 100 languages. 2374 | - **Retrieval-Augmented Multimodal Language Modeling;** Michihiro Yasunaga et al; Consider text generation and image generation tasks. 2375 | - **Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning;** Zhuolin Yang et al 2376 | - **Teaching Structured Vision & Language Concepts to Vision & Language Models;** Sivan Doveh et al 2377 | - **MATCHA : Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering;** Fangyu Liu et al 2378 | - **Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training;** Filip Radenovic et al; Propose methods to improve zero-shot performance on retrieval and classification tasks through large-scale pre-training. 2379 | - **Prismer: A Vision-Language Model with An Ensemble of Experts;** Shikun Liu et al 2380 | - **REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory;** Ziniu Hu et al 2381 | 2382 | ## Adaptation of Foundation Model 2383 | 2384 | - **Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture;** Tanmay Gupta et al 2385 | - **Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners;** Zhenhailong Wang et al 2386 | - **Multimodal Few-Shot Learning with Frozen Language Models;** Maria Tsimpoukelli et al; Use prefix-like image-embedding to stear the text generation process to achieve few-shot learning. 2387 | - **Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language;** Andy Zeng et al 2388 | - **UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes;** Alexander Kolesnikov et al 2389 | - **META LEARNING TO BRIDGE VISION AND LANGUAGE MODELS FOR MULTIMODAL FEW-SHOT LEARNING;** Ivona Najdenkoska et al 2390 | - **RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training;** Zheng Yuan et al 2391 | - **Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners;** Renrui Zhang et al 2392 | - **F-VLM: OPEN-VOCABULARY OBJECT DETECTION UPON FROZEN VISION AND LANGUAGE MODELS;** Weicheng Kuo et al 2393 | - **eP-ALM: Efficient Perceptual Augmentation of Language Models;** Mustafa Shukor et al 2394 | - **Transfer Visual Prompt Generator across LLMs;** Ao Zhang et al 2395 | - **Multimodal Web Navigation with Instruction-Finetuned Foundation Models;** Hiroki Furuta et al 2396 | 2397 | ## Prompting 2398 | 2399 | - **Learning to Prompt for Vision-Language Models;** Kaiyang Zhou et al; Soft prompt tuning. Useing few-shot learning to improve performance on both in-distribution and out-of-distribution data. Few-shot setting. 2400 | - **Unsupervised Prompt Learning for Vision-Language Models;** Tony Huang et al; Soft prompt tuning. Unsupervised setting. 2401 | - **Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling;** Renrui Zhang et al; Few-shot setting. 2402 | - **CLIP-Adapter: Better Vision-Language Models with Feature Adapters;** Peng Gao et al; Few-shot setting. 2403 | - **Neural Prompt Search;** Yuanhan Zhang et al; Explore the combination of LoRA, Adapter, Soft prompt tuning. In full-data, few-shot, and domain shift settings. 2404 | - **Visual Prompt Tuning;** Menglin Jia et al; Soft prompt tuning + head tuning. Show better performance in few-shot and full-data settings than full-parameters tuning. Quite different from the NLP field. 2405 | - **Prompt Distribution Learning;** Yuning Lu et al; Soft prompt tuning. Few-shot setting. 2406 | - **Conditional Prompt Learning for Vision-Language Models;** identify a critical problem of CoOp: the learned context is not generalizable to wider unseen classes within the same dataset; Propose to learn a DNN that can generate for each image an input-conditional token (vector). 2407 | - **Learning to Prompt for Continual Learning;** Zifeng Wang et al; Continual learning setting. Maintain a prompt pool. 2408 | - **Exploring Visual Prompts for Adapting Large-Scale Models;** Hyojin Bahng et al; Employ adversarial reprogramming as visual prompts. Full-data setting. 2409 | - **Learning multiple visual domains with residual adapters;** Sylvestre-Alvise Rebuff et al; Use adapter to transfer pretrained knowledge to multiple domains while freeze the base model parameters. Work in the CV filed & full-data transfer learning. 2410 | - **Efficient parametrization of multi-domain deep neural networks;** Sylvestre-Alvise Rebuff et al; Still use adapter for transfer learning, with more comprehensive empirical study for an ideal choice. 2411 | - **Prompting Visual-Language Models for Efficient Video Understanding;** Chen Ju et al; Video tasks. Few-shots & zero-shots. Soft prompt tuning. 2412 | - **Visual Prompting via Image Inpainting;** Amir Bar et al; In-context learning in CV. Use pretrained masked auto-encoder. 2413 | - **CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment;** Haoyu Song et al; Propose a parameter-efficient tuning method (bias tuning), function well in few-shot setting. 2414 | - **LEARNING TO COMPOSE SOFT PROMPTS FOR COMPOSITIONAL ZERO-SHOT LEARNING;** Nihal V. Nayak et al; zero-shot setting, inject some knowledge in the learning process. 2415 | - **Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models;** Manli Shu et al; Learn soft-prompt in the test-time. 2416 | - **Multitask Vision-Language Prompt Tuning;** Sheng Shen et al; Few-shot. 2417 | - **A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models;** Woojeong Jin et al 2418 | - **CPT: COLORFUL PROMPT TUNING FOR PRE-TRAINED VISION-LANGUAGE MODELS;** Yuan Yao et al; Good few-shot & zero-shot performance on RefCOCO datasets. 2419 | - **What Makes Good Examples for Visual In-Context Learning?;** Yuanhan Zhang et al 2420 | - **Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery;** Yuxin Wen et al 2421 | - **PLOT: PROMPT LEARNING WITH OPTIMAL TRANSPORT FOR VISION-LANGUAGE MODELS;** Guangyi Chen et al 2422 | - **What does CLIP know about a red circle? Visual prompt engineering for VLMs;** Aleksandar Shtedritski et al 2423 | 2424 | ## Efficiency 2425 | 2426 | - **M3SAT: A SPARSELY ACTIVATED TRANSFORMER FOR EFFICIENT MULTI-TASK LEARNING FROM MULTIPLE MODALITIES;** Anonymous 2427 | - **Prompt Tuning for Generative Multimodal Pretrained Models;** Hao Yang et al; Implement prefix-tuning in OFA. Try full-data setting and demonstrate comparable performance. 2428 | - **Fine-tuning Image Transformers using Learnable Memory;** Mark Sandler et al; Add soft prompts in each layer. full-data. 2429 | - **Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks;** Jeffrey O. Zhang et al; Transfer learning. 2430 | - **Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks;** Yen-Cheng Liu et al 2431 | - **Task Residual for Tuning Vision-Language Models;** Tao Yu et al 2432 | - **UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling;** Haoyu Lu et al 2433 | 2434 | ## Analysis 2435 | 2436 | - **What Does BERT with Vision Look At?** Liunian Harold Li et al 2437 | - **Visual Referring Expression Recognition: What Do Systems Actually Learn?;** Volkan Cirik et al 2438 | - **Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks;** Nan Wu et al; Study the problem of only relying on one certain modality in training when using multi-modal models. 2439 | - **Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models;** Jize Cao et al 2440 | - **Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning** Weixin Liang et al 2441 | - **How Much Can CLIP Benefit Vision-and-Language Tasks?;** Sheng Shen et al; Explore two scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. Show the boost in performance when using CLIP as the image encoder. 2442 | - **Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers;** Stella Frank et al 2443 | - **Controlling for Stereotypes in Multimodal Language Model Evaluation;** Manuj Malik et al 2444 | - **Beyond Instructional Videos: Probing for More Diverse Visual-Textual Grounding on YouTube;** Jack Hessel et al 2445 | - **What is More Likely to Happen Next? Video-and-Language Future Event Prediction;** Jie Lei et al 2446 | 2447 | ## Grounding 2448 | 2449 | - **Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models;** Bryan A. Plummer et al; A new benchmark dataset, annotating phrase-region correspondences. 2450 | - **Connecting Vision and Language with Localized Narratives;** Jordi Pont-Tuset et al 2451 | - **MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding;** Qinxin Wang et al 2452 | - **Visual Grounding Strategies for Text-Only Natural Language Processing;** Propose to improve the NLP tasks performance by grounding to images. Two methods are proposed. 2453 | - **Visually Grounded Neural Syntax Acquisition;** Haoyue Shi et al 2454 | - **PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World;** Rowan Zellers et al 2455 | 2456 | ## VQA Task 2457 | 2458 | - **WeaQA: Weak Supervision via Captions for Visual Question Answering;** Pratyay Banerjee et al 2459 | - **Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering;** Aishwarya Agrawal et al 2460 | - **Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA;** Qingyi Si et al 2461 | - **Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning;** Qingyi Si et al 2462 | - **Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training;** Anthony Meng Huat Tiong et al 2463 | - **FROM IMAGES TO TEXTUAL PROMPTS: ZERO-SHOT VQA WITH FROZEN LARGE LANGUAGE MODELS;** Jiaxian Guo et al 2464 | - **SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions;** Ramprasaath R. Selvaraju et al 2465 | - **Multimodal retrieval-augmented generator for open question answering over images and text;** Wenhu Chen et al 2466 | - **Towards a Unified Model for Generating Answers and Explanations in Visual Question Answering;** Chenxi Whitehouse et al 2467 | - **Modularized Zero-shot VQA with Pre-trained Models;** Rui Cao et al 2468 | - **Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge;** Xingyu Fu et al 2469 | - **Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models;** Jiarui Zhang et al 2470 | - **Zero-shot Visual Question Answering with Language Model Feedback;** Yifan Du et al 2471 | - **Learning to Ask Informative Sub-Questions for Visual Question Answering;** Kohei Uehara et al 2472 | - **Why Did the Chicken Cross the Road? Rephrasing and Analyzing Ambiguous Questions in VQA;** Elias Stengel-Eskin et al 2473 | - **Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering;** Rabiul Awal et al 2474 | 2475 | ## VQA Dataset 2476 | 2477 | - **VQA: Visual Question Answering;** Aishwarya Agrawal et al 2478 | - **Towards VQA Models That Can Read;** Amanpreet Singh et al 2479 | - **Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering;** Yash Goyal et al; VQA-V2. 2480 | - **MULTIMODALQA: COMPLEX QUESTION ANSWERING OVER TEXT, TABLES AND IMAGES;** Alon Talmor et al 2481 | - **WebQA: Multihop and Multimodal QA;** Yingshan Chang et al 2482 | - **FunQA: Towards Surprising Video Comprehension;** Binzhu Xie et al; Used for video foundation model evaluation. 2483 | - **Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering;** Pan Lu et al 2484 | - **Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?;** Yang Chen et al 2485 | 2486 | **Cognition** 2487 | 2488 | - **Inferring the Why in Images;** Hamed Pirsiavash et al 2489 | - **Visual Madlibs: Fill in the blank Image Generation and Question Answering;** Licheng Yu et al 2490 | - **From Recognition to Cognition: Visual Commonsense Reasoning;** Rowan Zellers et al; Benchmark dataset, requiring models to go beyond the recognition level to cognition. Need to reason about a still image and give rationales. 2491 | - **VisualCOMET: Reasoning about the Dynamic Context of a Still Image;** Jae Sung Park et al 2492 | - **The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning;** Jack Hessel et al 2493 | 2494 | **Knowledge** 2495 | 2496 | - **Explicit Knowledge-based Reasoning for Visual Question Answering;** Peng Wang et al 2497 | - **FVQA: Fact-based Visual Question Answering;** Peng Wang; 2498 | - **OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge;** Kenneth Marino et al 2499 | 2500 | ## Social Good 2501 | 2502 | - **The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes;** Douwe Kiela et al; Multi-modal hate-speech detection. 2503 | - **Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News;** Reuben Tan et al; Multi-modal fake news dedetection. 2504 | - **InfoSurgeon: Cross-Media Fine-grained Information Consistency Checking for Fake News Detection;** Yi R. Fung et al; Cross-modal fake news detection. 2505 | - **EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection;** Yaqing Wang et al 2506 | - **End-to-End Multimodal Fact-Checking and Explanation Generation: A Challenging Dataset and Models;** Barry Menglong Yao et al 2507 | - **SAFE: Similarity-Aware Multi-Modal Fake News Detection;** Xinyi Zhou et al 2508 | - **r/Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection;** Kai Nakamura et al; Fake news detection dataset. 2509 | - **Fact-Checking Meets Fauxtography: Verifying Claims About Images;** Dimitrina Zlatkova et al; Claim-Images pairs. 2510 | - **Prompting for Multimodal Hateful Meme Classification;** Rui Cao et al 2511 | 2512 | ## Application 2513 | 2514 | - **MSMO: Multimodal Summarization with Multimodal Output;** Junnan Zhu et al 2515 | - **Re-imagen: Retrieval-augmented text-to-image generator;** Wenhu Chen et al 2516 | - **Large Scale Multi-Lingual Multi-Modal Summarization Dataset;** Yash Verma et al 2517 | - **Retrieval-augmented Image Captioning;** Rita Ramos et al 2518 | - **SYNTHETIC MISINFORMERS: GENERATING AND COMBATING MULTIMODAL MISINFORMATION;** Stefanos-Iordanis Papadopoulos et al 2519 | - **The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training;** Gi-Cheon Kang et al 2520 | - **CapDet: Unifying Dense Captioning and Open-World Detection Pretraining;** Yanxin Long et al 2521 | - **DECAP: DECODING CLIP LATENTS FOR ZERO-SHOT CAPTIONING VIA TEXT-ONLY TRAINING;** Wei Li et al 2522 | - **Align and Attend: Multimodal Summarization with Dual Contrastive Losses;** Bo He et al 2523 | 2524 | ## Benchmark & Evaluation 2525 | 2526 | - **Multimodal datasets: misogyny, pornography, and malignant stereotypes;** Abeba Birhane et al 2527 | - **Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense;** Zhecan Wang et al 2528 | - **Probing Image–Language Transformers for Verb Understanding;** Lisa Anne Hendricks et al 2529 | - **VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations;** Tiancheng Zhao et al 2530 | - **WHEN AND WHY VISION-LANGUAGE MODELS BEHAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT?;** Mert Yuksekgonul et al 2531 | - **GRIT: General Robust Image Task Benchmark;** Tanmay Gupta et al 2532 | - **MULTIMODALQA: COMPLEX QUESTION ANSWERING OVER TEXT, TABLES AND IMAGES;** Alon Talmor et al 2533 | - **Test of Time: Instilling Video-Language Models with a Sense of Time;** Piyush Bagad et al 2534 | 2535 | ## Dataset 2536 | 2537 | - **Visual Entailment: A Novel Task for Fine-Grained Image Understanding;** Ning Xie et al; Visual entailment task. SNLI-VE. 2538 | - **A Corpus for Reasoning About Natural Language Grounded in Photographs;** Alane Suhr et al; NLVR2. 2539 | - **VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models;** Wangchunshu Zhou et al; VLUE. 2540 | - **Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning;** Piyush Sharma et al 2541 | - **Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts;** Soravit Changpinyo et al 2542 | - **LAION-5B: An open large-scale dataset for training next generation image-text models;** Christoph Schuhmann et al 2543 | - **Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks;** Colin Leong et al 2544 | - **Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding;** Haoxuan You et al 2545 | - **MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning;** Zhiyang Xu et al 2546 | - **UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training;** Biao Gong et al 2547 | - **HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips;** Antoine Miech et al 2548 | - **Connecting Vision and Language with Video Localized Narratives;** Paul Voigtlaender et al 2549 | - **LAION-5B: An open large-scale dataset for training next generation image-text models;** Christoph Schuhmann et al 2550 | - **MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions;** Mattia Soldan et al 2551 | - **CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos;** Seungju Han et al 2552 | - **WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning;** Krishna Srinivasan et al 2553 | - **Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text;** Wanrong Zhu et al 2554 | - **OpenAssistant Conversations - Democratizing Large Language Model Alignment;** Andreas Köpf et al 2555 | - **TheoremQA: A Theorem-driven Question Answering dataset;** Wenhu Chen et al 2556 | - **MetaCLUE: Towards Comprehensive Visual Metaphors Research;** Arjun R. Akula et al 2557 | - **CAPSFUSION: Rethinking Image-Text Data at Scale;** Qiying Yu et al 2558 | - **RedCaps: Web-curated image-text data created by the people, for the people;** Karan Desai et al 2559 | - **OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents;** Hugo Laurençon et al 2560 | 2561 | ## Robustness 2562 | 2563 | - **Domino: Discovering Systematic Errors with Cross-Modal Embeddings;** Sabri Eyuboglu et al 2564 | - **Learning Visually-Grounded Semantics from Contrastive Adversarial Samples;** Haoyue Shi et al 2565 | - **Visually Grounded Reasoning across Languages and Cultures;** Fangyu Liu et al 2566 | - **A Closer Look at the Robustness of Vision-and-Language Pre-trained Models;** Linjie Li et al; Compile a list of robustness-VQA datasets. 2567 | - **ROBUSTNESS ANALYSIS OF VIDEO-LANGUAGE MODELS AGAINST VISUAL AND LANGUAGE PERTURBATIONS;** Madeline C. Schiappa et al 2568 | - **Context-Aware Robust Fine-Tuning;** Xiaofeng Mao et al 2569 | - **Task Bias in Vision-Language Models;** Sachit Menon et al 2570 | - **Are Multimodal Models Robust to Image and Text Perturbations?;** Jielin Qiu et al 2571 | - **CPL: Counterfactual Prompt Learning for Vision and Language Models;** Xuehai He et al 2572 | - **Improving Zero-shot Generalization and Robustness of Multi-modal Models;** Yunhao Ge et al 2573 | - **DIAGNOSING AND RECTIFYING VISION MODELS USING LANGUAGE;** Yuhui Zhang et al 2574 | - **Multimodal Prompting with Missing Modalities for Visual Recognition;** Yi-Lun Lee et al 2575 | 2576 | ## Hallucination&Factuality 2577 | 2578 | - **Object Hallucination in Image Captioning;** Anna Rohrbach et al 2579 | - **Learning to Generate Grounded Visual Captions without Localization Supervision;** Chih-Yao Ma et al 2580 | - **On Hallucination and Predictive Uncertainty in Conditional Language Generation;** Yijun Xiao et al 2581 | - **Consensus Graph Representation Learning for Better Grounded Image Captioning;** Wenqiao Zhang et al 2582 | - **Relational Graph Learning for Grounded Video Description Generation;** Wenqiao Zhang et al 2583 | - **Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning;** Ali Furkan Biten et al 2584 | - **Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training;** Wenliang Dai et al 2585 | - **Models See Hallucinations: Evaluating the Factuality in Video Captioning;** Hui Liu et al 2586 | - **Evaluating and Improving Factuality in Multimodal Abstractive Summarization;** David Wan et al 2587 | - **Evaluating Object Hallucination in Large Vision-Language Models;** Yifan Li et al 2588 | - **Do Language Models Know When They’re Hallucinating References?;** Ayush Agrawal et al 2589 | - **Detecting and Preventing Hallucinations in Large Vision Language Models;** Anisha Gunjal et al 2590 | - **DOLA: DECODING BY CONTRASTING LAYERS IMPROVES FACTUALITY IN LARGE LANGUAGE MODELS;** Yung-Sung Chuang et al 2591 | - **A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models;** S.M Towhidul Islam Tonmoy et al 2592 | - **Inference-Time Intervention: Eliciting Truthful Answers from a Language Model;** Kenneth Li et al 2593 | - **FELM: Benchmarking Factuality Evaluation of Large Language Models;** Shiqi Chen et al 2594 | - **Unveiling the Siren’s Song: Towards Reliable Fact-Conflicting Hallucination Detection;** Xiang Chen et al 2595 | - **ANALYZING AND MITIGATING OBJECT HALLUCINATION IN LARGE VISION-LANGUAGE MODELS;** Yiyang Zhou et al 2596 | - **Woodpecker: Hallucination Correction for Multimodal Large Language Models;** Shukang Yin et al 2597 | - **AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation;** Junyang Wang et al 2598 | - **Fine-tuning Language Models for Factuality;** Katherine Tian et al 2599 | - **Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization;** Zhiyuan Zhao et al 2600 | - **RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback;** Tianyu Yu et al 2601 | - **RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models;** Yuanhao Wu et al 2602 | - **Learning to Trust Your Feelings: Leveraging Self-awareness in LLMs for Hallucination Mitigation;** Yuxin Liang et al 2603 | - **Don’t Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration;** Shangbin Feng et al 2604 | - **FLAME: Factuality-Aware Alignment for Large Language Models;** Sheng-Chieh Lin et al 2605 | - **Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability;** Jiri Hron et al 2606 | 2607 | ## Cognitive NeuronScience & Machine Learning 2608 | 2609 | - **Mind Reader: Reconstructing complex images from brain activities;** Sikun Lin et al 2610 | - **Joint processing of linguistic properties in brains and language models;** Subba Reddy Oota et al 2611 | - **Is the Brain Mechanism for Hierarchical Structure Building Universal Across Languages? An fMRI Study of Chinese and English;** Xiaohan Zhang et al 2612 | - **TRAINING LANGUAGE MODELS FOR DEEPER UNDERSTANDING IMPROVES BRAIN ALIGNMENT;** Khai Loong Aw et al 2613 | - **Abstract Visual Reasoning with Tangram Shapes;** Anya Ji et al 2614 | - **DISSOCIATING LANGUAGE AND THOUGHT IN LARGE LANGUAGE MODELS: A COGNITIVE PERSPECTIVE;** Kyle Mahowald et al 2615 | - **Language Cognition and Language Computation Human and Machine Language Understanding;** Shaonan Wang et al 2616 | - **From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought;** Lionel Wong et al 2617 | - **DIVERGENCES BETWEEN LANGUAGE MODELS AND HUMAN BRAINS;** Yuchen Zhou et al 2618 | - **Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?;** Andreas Opedal et al 2619 | 2620 | ## Theory of Mind 2621 | 2622 | - **Do Large Language Models know what humans know?;** Sean Trott et al 2623 | - **Few-shot Language Coordination by Modeling Theory of Mind;** Hao Zhu et al 2624 | - **Few-Shot Character Understanding in Movies as an Assessment to Meta-Learning of Theory-of-Mind;** Mo Yu et al 2625 | - **Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs;** Maarten Sap et al 2626 | - **A Cognitive Evaluation of Instruction Generation Agents tl;dr They Need Better Theory-of-Mind Capabilities;** Lingjun Zhao et al 2627 | - **MINDCRAFT: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks;** Cristian-Paul Bara et al 2628 | - **TVSHOWGUESS: Character Comprehension in Stories as Speaker Guessing;** Yisi Sang et al 2629 | - **Theory of Mind May Have Spontaneously Emerged in Large Language Models;** Michal Kosinski 2630 | - **COMPUTATIONAL LANGUAGE ACQUISITION WITH THEORY OF MIND;** Andy Liu et al 2631 | - **Speaking the Language of Your Listener: Audience-Aware Adaptation via Plug-and-Play Theory of Mind;** Ece Takmaz et al 2632 | - **Understanding Social Reasoning in Language Models with Language Models;** Kanishk Gandhi et al 2633 | - **HOW FAR ARE LARGE LANGUAGE MODELS FROM AGENTS WITH THEORY-OF-MIND?;** Pei Zhou et al 2634 | 2635 | ## Cognitive NeuronScience 2636 | 2637 | - **Functional specificity in the human brain: A window into the functional architecture of the mind;** Nancy Kanwisher et al 2638 | - **Visual motion aftereffect in human cortical area MT revealed by functional magnetic resonance imaging;** Roger B. H. Tootell et al 2639 | - **Speed of processing in the human visual system;** Simon Thorpe et al 2640 | - **A Cortical Area Selective for Visual Processing of the Human Body;** Paul E. Downing et al 2641 | - **Triple Dissociation of Faces, Bodies, and Objects in Extrastriate Cortex;** David Pitcher et al 2642 | - **Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal Cortex;** James V. Haxby et al 2643 | - **Rectilinear Edge Selectivity Is Insufficient to Explain the Category Selectivity of the Parahippocampal Place Area;** Peter B. Bryan et al 2644 | - **Selective scene perception deficits in a case of topographical disorientation;** Jessica Robin et al 2645 | - **The cognitive map in humans: spatial navigation and beyond;** Russell A Epstein et al 2646 | - **From simple innate biases to complex visual concepts;** Shimon Ullman et al 2647 | - **Face perception in monkeys reared with no exposure to faces;** Yoichi Sugita et al 2648 | - **Functional neuroanatomy of intuitive physical inference;** Jason Fischer et al 2649 | - **Recruitment of an Area Involved in Eye Movements During Mental Arithmetic;** André Knops et al 2650 | - **Intonational speech prosody encoding in the human auditory cortex;** C. Tang et al 2651 | 2652 | ## World Model 2653 | 2654 | - **Recurrent World Models Facilitate Policy Evolution;** David Ha et al 2655 | - **TRANSFORMERS ARE SAMPLE-EFFICIENT WORLD MODELS;** Vincent Micheli et al 2656 | - **Language Models Meet World Models: Embodied Experiences Enhance Language Models;** Jiannan Xiang et al 2657 | - **Reasoning with Language Model is Planning with World Model;** Shibo Hao et al 2658 | - **Learning to Model the World with Language;** Jessy Lin et al 2659 | - **Learning Interactive Real-World Simulators;** Mengjiao Yang et al 2660 | - **Diffusion World Model;** Zihan Ding et al 2661 | - **Genie: Generative Interactive Environments;** Jake Bruce et al 2662 | - **Learning and Leveraging World Models in Visual Representation Learning;** Quentin Garrido et al 2663 | - **iVideoGPT: Interactive VideoGPTs are Scalable World Models;** Jialong Wu et al 2664 | - **Navigation World Models;** Amir Bar et al 2665 | - **Cosmos World Foundation Model Platform for Physical AI;** NVIDIA 2666 | 2667 | 2668 | 2669 | ## Resource 2670 | 2671 | - **LAVIS-A One-stop Library for Language-Vision Intelligence;** https://github.com/salesforce/LAVIS 2672 | - **MULTIVIZ: TOWARDS VISUALIZING AND UNDERSTANDING MULTIMODAL MODELS;** Paul Pu Liang et al 2673 | - **TorchScale - A Library for Transformers at (Any) Scale;** Shuming Ma et al 2674 | - **Video pretraining;** https://zhuanlan.zhihu.com/p/515175476 2675 | - **Towards Complex Reasoning: the Polaris of Large Language Models;** Yao Fu 2676 | - **Prompt Engineering;** Lilian Weng 2677 | - **Memory in human brains;** https://qbi.uq.edu.au/brain-basics/memory 2678 | - **Bloom's Taxonomy;** https://cft.vanderbilt.edu/guides-sub-pages/blooms-taxonomy/#:~:text=Familiarly%20known%20as%20Bloom's%20Taxonomy,Analysis%2C%20Synthesis%2C%20and%20Evaluation. 2679 | - **Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models’ Reasoning Performance;** Yao Fu et al 2680 | - **LLM Powered Autonomous Agents;** Lilian Weng 2681 | - **Retrieval-based Language Models and Applications;** Tutorial; https://github.com/ACL2023-Retrieval-LM/ACL2023-Retrieval-LM.github.io 2682 | - **Recent Advances in Vision Foundation Models;** Tutorial; https://vlp-tutorial.github.io/ 2683 | - **LM-Polygraph: Uncertainty Estimation for Language Models;** Ekaterina Fadeeva et al 2684 | - **The Q\* hypothesis: Tree-of-thoughts reasoning, process reward models, and supercharging synthetic data;** Nathan Lambert 2685 | - **Data-Juicer: A One-Stop Data Processing System for Large Language Models;** Daoyuan Chen et al 2686 | - **Designing, Evaluating, and Learning from Human-AI Interactions;** Sherry Tongshuang Wu et al 2687 | - **Reinforcement Learning from Human Feedback: Progress and Challenges;** John Schulman 2688 | - **Our approach to alignment research;** OpenAI 2689 | - **AI Alignment: A Comprehensive Survey;** Jiaming Ji et al; https://alignmentsurvey.com/ 2690 | - **Alignment Workshop;** https://www.alignment-workshop.com/nola-2023 2691 | - **AI Alignment Research Overview;** Jacob Steinhardt 2692 | - **Proxy objectives in reinforcement learning from human feedback;** John Schulman 2693 | - **AgentLite: A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System;** Zhiwei Liu et al 2694 | - **Training great LLMs entirely from ground up in the wilderness as a startup;** Yi Tay 2695 | - **MiniCPM:揭示端侧大语言模型的无限潜力;** 胡声鼎 et al 2696 | - **Superalignment Research Directions;** OpenAI; https://openai.notion.site/Research-directions-0df8dd8136004615b0936bf48eb6aeb8 2697 | - **Llama 3 Opens the Second Chapter of the Game of Scale;** Yao Fu 2698 | - **RLHF Workflow: From Reward Modeling to Online RLHF;** Hanze Dong et al 2699 | - **TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models;** Junlong Jia et al 2700 | - **LLaVA-NeXT: What Else Influences Visual Instruction Tuning Beyond Data?;** Bo Li et al 2701 | - **John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI;** 2702 | - **LLAMAFACTORY: Unified Efficient Fine-Tuning of 100+ Language Models;** Yaowei Zheng et al 2703 | - **How NuminaMath Won the 1st AIMO Progress Prize;** Yann Fleureau et al 2704 | - **How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model;** Sharath Sreenivas et al 2705 | - **Cosmopedia: how to create large-scale synthetic data for pre-training;** Loubna Ben Allal et al 2706 | - **SmolLM - blazingly fast and remarkably powerful;** Loubna Ben Allal et al 2707 | - **A recipe for frontier model post-training;** Nathan Lambert 2708 | - **Speculations on Test-Time Scaling (o1);** Sasha Rush 2709 | - **Don't teach. Incentivize;** Hyung Won Chung 2710 | - **From PhD to Google DeepMind: Lessons and Gratitude on My Journey;** Fuzhao Xue 2711 | - **Reward Hacking in Reinforcement Learning;** Lilian Weng 2712 | - **ICML 2024 Tutorial: Physics of Language Models;** Zeyuan Allen Zhu 2713 | - **Process Reinforcement through Implicit Rewards;** Ganqu Cui et al 2714 | - **Scaling Paradigms for Large Language Models;** Jason Wei 2715 | - **Recommendations for Technical AI Safety Research Directions;** Anthropic Alignment Team 2716 | - **Optimizing LLM Test-Time Compute Involves Solving a Meta-RL Problem;** Amrith Setlur et al 2717 | - **There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study;** Zichen Liu et al 2718 | - **RLVR in Vision Language Models: Findings, Questions and Directions;** Liang Chen et al 2719 | - **LLM (ML) Job Interviews - Resources;** Mimansa Jaiswal et al 2720 | - **SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning;** Shiyi Cao et al 2721 | - **On-Policy Distillation;** Kevin Lu et al 2722 | --------------------------------------------------------------------------------