├── README.md
└── images
    ├── act.PNG
    ├── act1.PNG
    ├── act2.PNG
    ├── figure.PNG
    ├── foundation_models.png
    ├── octo.PNG
    └── openvla.PNG


/README.md:
--------------------------------------------------------------------------------
  1 | # Foundation Models for Manipulation: Overview
  2 | 
  3 | <div>
  4 |     <img src="./images/foundation_models.png" alt="Global Trajectory">
  5 | </div><br>
  6 | 
  7 | The world of robotics is evolving rapidly with the advent of **foundation models**: large  AI systems that enable robots to perform complex manipulation tasks with unprecedented flexibility. These models, which leverage techniques like **transformers** and **imitation learning**, allow robots to learn across diverse environments and tasks without task-specific programming.
  8 | 
  9 | In this blog post, we’ll break down key foundation models for robotic manipulation, including:
 10 | 
 11 | - **Architectures of Foundation Models**: How modern models like ACT, Octo, OpenVLA and Helix are designed to enable robots to perform generalist tasks.
 12 | - **Action Representation**: Different methods for representing actions, such as continuous space, discretization and diffusion-based generation.
 13 | - **Finetuning Considerations**: Key insights on how to adapt these models for specific tasks and environments to ensure high performance in real-world applications.
 14 | 
 15 | Whether you're an AI researcher, roboticist or just curious about the future of autonomous robots, this guide will provide a clear and engaging overview of the exciting innovations shaping the next generation of robotic manipulation.
 16 | 
 17 | ## 1) Action Chunk Transformer (ACT)
 18 | 
 19 | <div>
 20 |     <img src="./images/act.PNG" alt="Global Trajectory">
 21 | </div><br>
 22 | 
 23 | ACT leverages **transformer-based action chunking** and end-to-end imitation learning to enable low-cost robotic arms to perform complex tasks with high success rates. Developed as part of the ALOHA project, ACT learns from real-world demonstrations collected via a custom teleoperation setup. The model generates action sequences in a chunked manner, improving stability and reducing compounding errors over time. With only 10 minutes of demonstrations, ACT enables robots to achieve 80-90% success rates on fine manipulation tasks.
 24 | 
 25 | ## Dataset  
 26 | 
 27 | ACT uses a dataset collected from real-world bimanual teleoperation experiments. The dataset consists of **human demonstrations**, meaning they gather their own data rather than relying on pre-existing datasets. The demonstration dataset consists of trajectories of **image observations, joint positions and executed actions**.  
 28 | 
 29 | ## Input & Output  
 30 | 
 31 | ### Input  
 32 | - **4 RGB images**
 33 | - **Joint positions** for the two robot arms (7+7=14 DOF)  
 34 | 
 35 | ### Output  
 36 | 
 37 | - **Absolute joint positions** in chunks (e.g., next 100 timesteps)
 38 | 
 39 | ## Model Architecture  
 40 | 
 41 | <div>
 42 |     <img src="./images/act1.PNG" alt="Global Trajectory">
 43 | </div><br>
 44 | 
 45 | <div>
 46 |     <img src="./images/act2.PNG" alt="Global Trajectory">
 47 | </div>
 48 | 
 49 | ### Training Phase
 50 | 
 51 | #### Step 1: Sample Data  
 52 | From the demonstration dataset, we sample:  
 53 | - A sequence of **RGB images** 
 54 | - **Joint positions** of two 7-DOF robot arms (14-dimensional vector)  
 55 | - A target **action sequence** over the next $k$ time steps  
 56 | 
 57 | #### Step 2: Infer Latent Style Variable $z$  
 58 | The encoder is a BERT-style transformer encoder that receives:  
 59 | - A learned **[CLS]** token  
 60 | - The **current joint positions**, projected to the embedding dimension  
 61 | - The **target action sequence**, also linearly embedded  
 62 | 
 63 | These inputs form a $(k + 2) \times d_\text{embed}$ sequence. After passing through the transformer encoder, the **[CLS] token output** is used to predict the **mean and variance** of the latent style variable $z$, modeled as a diagonal Gaussian. Using the **reparameterization trick**, a sample of $z$ is drawn, enabling gradient backpropagation.
 64 | 
 65 | #### Step 3: Decode Predicted Action Sequence  
 66 | The decoder — the actual **policy** — takes as input:
 67 | - **Image features**: Each image is processed with a **ResNet18** to get a 15×20×512 feature map, flattened into a sequence of 300×512. For 4 cameras, this gives a total of 1200×512.
 68 | - **2D sinusoidal position embeddings** are added to preserve spatial structure.
 69 | - **Joint positions** and **$z$**, both projected to the same embedding dimension.
 70 | 
 71 | These inputs are concatenated into a 1202×512 sequence and passed through a transformer encoder. A transformer decoder uses cross-attention to generate a sequence of **$k \times 14$** outputs, representing joint positions for each time step.
 72 | 
 73 | ---
 74 | 
 75 | ### Inference Phase
 76 | 
 77 | At test time, the model uses only the **CVAE Decoder** as the policy. The encoder is discarded.
 78 | 
 79 | - The robot receives a new observation: **RGB images + joint positions**  
 80 | - These are processed exactly as during training (ResNet18 → flattened features → transformer encoder)  
 81 | - The **style variable $z$** is fixed to a **zero vector** (i.e., mean of the prior distribution)  
 82 | - The transformer decoder outputs a deterministic **$k \times 14$** tensor, corresponding to the next $k$ joint positions
 83 | 
 84 | This deterministic decoding provides stable, repeatable behavior, which is especially valuable for evaluation and deployment.
 85 | 
 86 | ## Innovative Contributions
 87 | 
 88 | A central innovation of ACT is its use of **action chunking** - predicting sequences of joint positions over a fixed horizon (e.g., the next *k* steps) instead of single-step actions. This chunked prediction strategy reduces the task's effective time horizon and significantly mitigates compounding errors during execution.
 89 | 
 90 | ## 2) Octo: An Open-Source Generalist Robot Policy
 91 | 
 92 | <div>
 93 |     <img src="./images/octo.PNG" alt="Global Trajectory">
 94 | </div><br>
 95 | 
 96 | Octo is a large, transformer-based policy pretrained on 800k demonstrations from the Open X-Embodiment dataset. Designed for flexibility, it supports multiple robots, sensor setups, and task types - including language commands and goal images. Octo can be finetuned quickly on new environments and is fully open-source, making it a powerful foundation for scalable, general-purpose robotic learning.
 97 | 
 98 | ## Dataset  
 99 | 
100 | Octo is trained on a massive dataset of **800,000 robot trajectories** collected from the Open X-Embodiment dataset - the largest and most diverse robot manipulation dataset to date. This dataset brings together demonstrations from nine different robotic platforms, spanning a wide variety of manipulation tasks such as pick-and-place, tool use, button pressing and drawer opening or closing. The data is highly heterogeneous, featuring a mix of camera perspectives (e.g., wrist-mounted and third-person views), robots with different degrees of freedom, and task-conditioning signals in the form of either language instructions or goal images.
101 | 
102 | ## Input & Output  
103 | 
104 | ### **Input:**  
105 | 
106 |   - **RGB images** from multiple viewpoints (wrist cam, third-person).  
107 |   - **Proprioceptive states** (joint positions, velocities).  
108 |   - **Task conditioning**:  
109 |     - **Text commands** (e.g., "Pick up the red cup").  
110 |     - **Goal images** (e.g., "Make the scene look like this").  
111 | 
112 | ### **Output:**  
113 | - **Delta position Cartesian actions** in chunks.  
114 | 
115 | 
116 | ## Model Architecture  
117 | 
118 | **Octo** architecture consists of three main components:
119 | 
120 | 1. **Input tokenizers** for processing observations and task specifications  
121 | 2. A **transformer backbone** that encodes the unified input sequence  
122 | 3. **Readout heads** that decode the embeddings into actionable commands
123 | 
124 | ### Input Tokenization
125 | 
126 | Octo supports multiple input modalities including language commands, goal images, and diverse robot observations. Each of these is converted into a unified token representation using modality-specific encoders:
127 | 
128 | - **Language commands** are tokenized and encoded using a pretrained **T5-base** transformer model, producing a sequence of language embeddings.  
129 | - **Goal images** and **RGB observations** (from wrist or third-person cameras) are passed through a shallow CNN, then divided into flattened patch sequences. 
130 | 
131 | After encoding, each token is assigned a **learned positional embedding**. These are concatenated into a single token sequence that includes both **task tokens** (e.g., language or goal images) and **observation tokens**, forming the complete input to the transformer.
132 | 
133 | ### Transformer Backbone
134 | 
135 | The token sequence is processed by a **transformer model** with a block-wise attention mechanism. Observation tokens are allowed to attend causally - meaning only to past or current tokens - while also attending to task tokens. This structure ensures proper temporal consistency in policy outputs. 
136 | 
137 | Importantly, modality-specific blocks can be masked, enabling Octo to seamlessly handle datasets with missing modalities (e.g., no language input) and making it highly modular for downstream finetuning.
138 | 
139 | ### Readout Heads & Action Prediction
140 | 
141 | To generate actions, **readout tokens** are inserted into the input sequence. These tokens attend to task and observation tokens but are **not attended to in return**. They act as passive readers, similar to the [CLS] token in BERT, summarizing the encoded information.
142 | 
143 | The output embeddings of the readout tokens are passed through a lightweight **action head** based on **diffusion models**, which predicts a **chunk of future actions**. This formulation allows Octo to model complex, multimodal action distributions and supports chunked action execution similar to ACT.
144 | 
145 | ## Innovative contributions
146 | 
147 | One of Octo’s key design advantages is its **modular and adaptable architecture**. During finetuning, new sensors, tasks, or robot morphologies can be integrated by simply attaching new lightweight encoders, positional embeddings, or output heads — all **without modifying the pretrained transformer weights**. This stands in contrast to prior architectures that often require reinitialization or full retraining when adapting to new settings.
148 | 
149 | ## 3) OpenVLA: An Open-Source Vision-Language-Action Model
150 | 
151 | <div>
152 |     <img src="./images/openvla.PNG" alt="Global Trajectory">
153 | </div><br>
154 | 
155 | ## 3) OpenVLA: An Open-Source Vision-Language-Action Model
156 | 
157 | **OpenVLA** is a 7B-parameter open-source model for generalist robot manipulation, trained on **970k real-world demos** from the **Open X-Embodiment** dataset. It combines a **LLaMA 2 language model** with visual features from **DINOv2** and **SigLIP**, enabling rich vision-language grounding.
158 | 
159 | ## Dataset
160 | 
161 | OpenVLA is trained on a curated subset of 970,000 robot demonstrations from the Open X-Embodiment dataset, which contains over 2 million trajectories from 70+ robotic platforms. To ensure consistency, only demonstrations with third-person camera views and single-arm end-effector control were included. For diversity, the team followed Octo’s data mixture strategy, prioritizing datasets with a wide range of tasks and scenes, while down-weighting redundant or narrow-scope data. This balance enables strong generalization across embodiments and environments.
162 | 
163 | ## Input & Output  
164 | 
165 | ### **Input:**  
166 | - **Observation image(s):** One or more RGB frames from third-person cameras, processed by the visual encoder.  
167 | - **Language instruction:** A natural language command describing the desired task (e.g., "stack the blocks" or "put the apple in the bowl").  
168 | 
169 | ### **Output:**  
170 | - **Delta position Cartesian actions** as discrete tokens  
171 | 
172 | ## Model Architecture
173 | 
174 | OpenVLA builds on a modular vision-language foundation, with three primary components:
175 | 
176 | 1. **Visual Encoder:**  
177 |    - Dual-encoder setup: features from **DINOv2** and **SigLIP** are extracted independently and concatenated.  
178 |    - Enables strong spatial grounding, helpful for manipulation tasks involving complex scenes.
179 | 
180 | 2. **Projector:**  
181 |    - A small 2-layer MLP that maps visual features into the language model's token embedding space.  
182 |    - Ensures compatibility with the Llama 2 tokenizer and architecture.
183 | 
184 | 3. **Language Model Backbone (Prismatic-7B):**  
185 |    - Based on **Llama 2 (7B)**, pretrained on large-scale Internet text.  
186 |    - Fine-tuned with a next-token prediction objective on mixed vision-language-action data.  
187 |    - Predicts tokenized robot actions in an autoregressive fashion, conditioned on the task context.
188 | 
189 | This combination allows OpenVLA to act as a generalist visuomotor controller, understanding high-level language commands and grounding them into low-level action sequences.
190 | 
191 | ## Innovative Contributions
192 | 
193 | A key innovation of OpenVLA is its ability to **ground natural language instructions in visual observations** by leveraging a large pretrained language model (LLaMA 2) within a unified vision-language-action architecture. This enables OpenVLA to understand and execute complex task instructions - such as “place the blue mug on the top shelf next to the red bowl” - without requiring handcrafted reward functions or rigid scripting.
194 | 
195 | ## 4) Helix: A Vision-Language-Action Model for Humanoid Control
196 | 
197 | <div>
198 |     <img src="./images/figure.PNG" alt="Global Trajectory">
199 | </div><br>
200 | 
201 | Helix (**Figure AI**) is a Vision-Language-Action (VLA) model capable of controlling the **entire upper body of a humanoid robot** from raw pixels and natural language. It introduces a novel dual-system design - System 1 for fast, reactive control and System 2 for semantic understanding - enabling real-time dexterous manipulation grounded in language.
202 | 
203 | ## Dataset
204 | 
205 | Helix is trained on a high-quality, diverse dataset consisting of approximately 500 hours of teleoperated demonstrations, collected across multiple robots and human operators. These demonstrations cover a broad spectrum of upper-body behaviors, including precise finger movements, coordinated arm motions and full-body pose adjustments. To generate language-conditioned training pairs at scale, an auto-labeling vision-language model (VLM) is used to create hindsight instructions. This model analyzes segmented video clips from onboard cameras and answers the prompt: “What instruction would you have given the robot to get the action seen in this video?” 
206 | 
207 | ## Input & Output
208 | 
209 | ### **Input:**  
210 | - **Monocular RGB image** from the robot’s onboard camera  
211 | - **Robot state information** (e.g., wrist pose, finger joint positions)  
212 | - **Natural language command** specifying the desired behavior  
213 | 
214 | ### **Output:**  
215 | - Continuous 35-DoF action vector at 200Hz, including:  
216 |   - **Wrist pose targets** 
217 |   - **Finger movements**  
218 |   - **Head and torso orientation**  
219 | 
220 | ## Model Architecture
221 | 
222 | Helix consists of two main components that operate at different frequencies: **System 2 (S2)** for high-level perception and planning, and **System 1 (S1)** for low-level  real-time control.
223 | 
224 | ### **System 2 (S2): Vision-Language Model**
225 | 
226 | S2 is a 7B-parameter vision-language model (VLM), pretrained on large-scale internet data. It processes:
227 | - Monocular RGB images from the robot’s onboard camera
228 | - Proprioceptive robot state (e.g., wrist pose, finger joint positions)
229 | - A natural language command
230 | 
231 | These inputs are encoded into a shared embedding space and distilled into a single **latent semantic vector**, which summarizes the high-level task intent. This vector is passed to S1 to guide motor control.
232 | 
233 | ### **System 1 (S1): Visuomotor Transformer**
234 | 
235 | S1 is an 80M-parameter cross-attention encoder-decoder transformer optimized for reactive control at **200 Hz**. It uses:
236 | - A multi-scale convolutional vision backbone pretrained in simulation
237 | - The same image and state inputs as S2
238 | - The latent vector from S2 as task-conditioning input
239 | 
240 | These inputs are combined and processed to produce continuous control outputs for:
241 | - End-effector poses (wrist and arm)
242 | - Finger flexion and abduction
243 | - Head and torso orientation
244 | - A scalar representing task progress (used for predicting completion)
245 | 
246 | ## Innovative Contributions
247 | 
248 | Helix introduces a novel dual-system architecture inspired by "System 1 / System 2" reasoning. **System 2** (S2) handles slow, semantic understanding using a large vision-language model, while **System 1** (S1) performs fast, reactive control at 200 Hz. This separation allows Helix to combine internet-scale language grounding with high-frequency, whole upper-body humanoid control.
249 | 
250 | ## 5) Action Representation
251 | 
252 | One of the most crucial components in any robot policy is how actions are represented and generated. Different approaches make different trade-offs in terms of generalization, expressivity, and training stability. This chapter outlines and compares three prominent action representation strategies: **MSE regression**, **discretization**, and **diffusion-based generation**.
253 | 
254 | ### 1. Continuous Regression with MSE Loss
255 | 
256 | The most straightforward method is to **directly regress the next action** (e.g., joint positions or torques) using **Mean Squared Error (MSE)**:
257 | 
258 | ```math
259 | \mathcal{L}_{\text{MSE}} = \frac{1}{T} \sum_{t=1}^{T} \left\| a_t^{\text{pred}} - a_t^{\text{true}} \right\|^2
260 | ```
261 | 
262 | This method assumes a **unimodal distribution**, producing the "average" best action. It works well when demonstrations are consistent, but struggles in multimodal settings, where multiple distinct strategies exist (e.g., grasping an object from different angles).
263 | 
264 | ---
265 | 
266 | ### 2. Discretized Action Space
267 | 
268 | Instead of predicting continuous values, one can **discretize** each action dimension into $K$ bins and treat action generation as a **classification task**:
269 | 
270 | - Each action $a_i$ is split into $K$ bins.
271 | - The model outputs a probability distribution over bins.
272 | 
273 | Training is done using **cross-entropy loss**:
274 | 
275 | ```math
276 | \mathcal{L}_{\text{disc}} = - \sum_{i=1}^{D} \log p_i(a_i)
277 | ```
278 | 
279 | This approach enables **multi-modal prediction** by selecting among multiple possible bins. However, it introduces quantization errors and can lead to coarse, jittery behavior in fine-grained tasks.
280 | 
281 | ---
282 | 
283 | ### 3. Diffusion Models for Action Generation
284 | 
285 | Diffusion models provide a powerful way to represent **multi-modal**, continuous action distributions, especially when predicting **chunks of actions** rather than single steps. These models consist of two phases: a **forward process** (adding noise) and a **reverse process** (iterative denoising). 
286 | 
287 | #### Forward Process (Training Phase)
288 | 
289 | In the forward process, we gradually add Gaussian noise to a ground-truth action chunk \( a_0 \), generating noisy versions \( x_k \) at timestep \( k \):
290 | 
291 | $$
292 | x_k = \sqrt{\alpha_k} a_0 + \sqrt{1 - \alpha_k} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)
293 | $$
294 | 
295 | The model is trained to **predict the noise** that was added:
296 | 
297 | ```math
298 | \mathcal{L}_{\text{diff}} = \mathbb{E}_{a_0, \epsilon, k} \left[ \left\| \epsilon - \epsilon_\theta(x_k, e, k) \right\|^2 \right]
299 | ```
300 | 
301 | Where:
302 | - $x_k$ is the noisy action chunk at timestep $k$
303 | - $\epsilon_\theta$ is the denoising network (diffusion head)
304 | - $e$ is the context embedding from the transformer
305 | 
306 | **</br>Training Pseudocode**:
307 | 
308 | ```python
309 | def diffusion_training_step(a0, context_embedding, noise_schedule):
310 | 
311 |     k = sample_timestep()
312 |     eps = torch.randn_like(a0)
313 |     alpha_k = noise_schedule.alpha(k)
314 | 
315 |     # Forward diffusion (noisy input)
316 |     x_k = torch.sqrt(alpha_k) * a0 + torch.sqrt(1 - alpha_k) * eps
317 | 
318 |     # Predict noise
319 |     eps_pred = denoise_net(x_k, context_embedding, k)
320 | 
321 |     # Loss: predict the added noise
322 |     loss = F.mse_loss(eps_pred, eps)
323 | 
324 |     return loss
325 | ```
326 | 
327 | 
328 | #### Reverse Process (Inference Phase)
329 | 
330 | Once a diffusion model is trained, generating actions is done via a **reverse denoising process**, starting from random noise and progressively refining it into a meaningful action or action chunk.
331 | 
332 | Given a noisy action sample $x_k$, the model denoises it using:
333 | 
334 | $$
335 | x_{k-1} = \alpha_k \left(x_k - \gamma_k \, \epsilon_\theta(x_k, e, k)\right) + \sigma_k \, \mathcal{N}(0, I)
336 | $$
337 | 
338 | Where:
339 | - $x_k$ is the noisy action chunk at step $k$
340 | - $\epsilon_\theta(x_k, e, k)$ is the predicted noise from the denoising network
341 | - $e$ is the transformer-derived context embedding
342 | - $\alpha_k, \gamma_k, \sigma_k$ are parameters from a cosine or linear noise schedule
343 | - The added Gaussian noise $\mathcal{N}(0, I)$ ensures sample diversity
344 | 
345 | This step is iteratively applied from a pure noise sample $x_T$ down to $x_0$, the final denoised action chunk.
346 | 
347 | #### Inference Pseudocode (Chunk Prediction)
348 | 
349 | ```python
350 | def generate_action_chunk(context_embedding, T=50):
351 | 
352 |     # Start from pure Gaussian noise
353 |     x = torch.randn(batch_size, action_dim)
354 | 
355 |     # Iteratively denoise
356 |     for k in reversed(range(T)):
357 |         eps_pred = denoise_net(x, context_embedding, k)
358 |         alpha_k, gamma_k, sigma_k = noise_schedule.get(k)
359 | 
360 |         # Reverse denoising step
361 |         x = alpha_k * (x - gamma_k * eps_pred)
362 |         if k > 0:
363 |             x += sigma_k * torch.randn_like(x)
364 | 
365 |     return x  # Final predicted action chunk
366 | ```
367 | 
368 | ## 6) Fine-tuning VLA Models: Considerations and Strategies
369 | 
370 | Fine-tuning is a critical process for adapting pre-trained **VLA models** to specific tasks and robot setups. While these models are powerful and generalizable out of the box, fine-tuning allows for improved performance in diverse real-world scenarios. This chapter explores general considerations and the different strategies available for fine-tuning VLA models to achieve task-specific optimization.
371 | 
372 | ### 1. **Full Fine-tuning**
373 |    Full fine-tuning involves updating all model parameters, including the vision encoder, LLM and transformer layers. This approach provides the most flexibility and potential for performance improvement but requires substantial computational resources. It is suitable for situations where the robot setup and task domain are significantly different from the pre-trained data.
374 | 
375 |    - **Pros**: High performance with full adaptation.
376 |    - **Cons**: High computational cost and memory usage.
377 | 
378 | ### 2. **Last Layer Only**
379 |    In this strategy, only the last layer of the model’s transformer backbone are fine-tuned. This method significantly reduces the number of trainable parameters and computation requirements but may limit the model’s ability to adapt to new tasks that demand deeper adjustments across the network.
380 | 
381 |    - **Pros**: Low computational cost and memory usage.
382 |    - **Cons**: Likely to yield poorer performance on complex tasks.
383 | 
384 | ### 3. **Sandwich Fine-tuning**
385 |    Sandwich fine-tuning unfreezes the vision encoder and last layer while keeping the rest of the model frozen. This technique is a compromise between full fine-tuning and parameter-efficient approaches, providing better adaptation to new visual features while saving on GPU memory by not fine-tuning the entire model backbone.
386 | 
387 |    - **Pros**: Balanced approach with good performance and reduced memory usage.
388 |    - **Cons**: Still requires significant resources, though less than full fine-tuning.
389 | 
390 | ### 4. **LoRA (Low-Rank Adaptation)**
391 |    LoRA is a low-rank adaptation technique that modifies only a small fraction of the model parameters while achieving performance close to that of full fine-tuning. By applying LoRA to all linear layers of the model, we can drastically reduce the number of trainable parameters (often to just 1.4% of the full model) and achieve significant computational savings without sacrificing performance.
392 | 
393 |    - **Pros**: Best performance-compute trade-off, requiring only a fraction of the model parameters to be updated.
394 |    - **Cons**: May not fully capture all potential domain-specific nuances compared to full fine-tuning.
395 | 
396 | ### Choosing the Right Strategy
397 | 
398 | The selection of a fine-tuning strategy should depend on the available computational resources, the complexity of the task, and the extent of the domain shift between the pre-trained model and the target environment. For most use cases, **LoRA** presents a highly effective solution, offering an excellent trade-off between computational efficiency and task performance.
399 | 
400 | ## 7) Key Works and Citations
401 | 
402 | - **T. Zhao, V. Kumar**: [*Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware*](https://arxiv.org/pdf/2304.13705)
403 | - **D. Ghosh, H. Walke**: [*Octo: An Open-Source Generalist Robot Policy*](https://arxiv.org/pdf/2405.12213)
404 | - **A. Brohan, N. Brown**: [*RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control*](https://arxiv.org/pdf/2307.15818)
405 | - **M. Kim, K. Pertsch**: [*OpenVLA: An Open-Source Vision-Language-Action Model*](https://arxiv.org/pdf/2406.09246)
406 | - **K. Black, N. Brown**: [*π0: A Vision-Language-Action Flow Model for General Robot Control*](https://arxiv.org/pdf/2410.24164)
407 | - **Figure AI**: [*Helix: A Vision-Language-Action Model for Generalist Humanoid Control*](https://www.figure.ai/news/helix)
408 | - **C. Chi, Z. Xu, S. Feng**: [*Diffusion Policy: Visuomotor Policy Learning via Action Diffusion*](https://arxiv.org/pdf/2303.04137)
409 | - **K. Pertsch, K. Stachowicz**: [*FAST: Efficient Action Tokenization for Vision-Language-Action Models*](https://arxiv.org/pdf/2501.09747)
410 | - **G. Berseth**: [*Coding Generalist Robot Policies*](https://www.youtube.com/watch?v=w12h2tKKl_s)
411 | 


--------------------------------------------------------------------------------
/images/act.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Argo-Robot/foundation_models/6ce10fb729fb53b17fe926a175274a661f6b0d14/images/act.PNG


--------------------------------------------------------------------------------
/images/act1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Argo-Robot/foundation_models/6ce10fb729fb53b17fe926a175274a661f6b0d14/images/act1.PNG


--------------------------------------------------------------------------------
/images/act2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Argo-Robot/foundation_models/6ce10fb729fb53b17fe926a175274a661f6b0d14/images/act2.PNG


--------------------------------------------------------------------------------
/images/figure.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Argo-Robot/foundation_models/6ce10fb729fb53b17fe926a175274a661f6b0d14/images/figure.PNG


--------------------------------------------------------------------------------
/images/foundation_models.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Argo-Robot/foundation_models/6ce10fb729fb53b17fe926a175274a661f6b0d14/images/foundation_models.png


--------------------------------------------------------------------------------
/images/octo.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Argo-Robot/foundation_models/6ce10fb729fb53b17fe926a175274a661f6b0d14/images/octo.PNG


--------------------------------------------------------------------------------
/images/openvla.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Argo-Robot/foundation_models/6ce10fb729fb53b17fe926a175274a661f6b0d14/images/openvla.PNG


--------------------------------------------------------------------------------