├── experiments
    ├── 4_so101
    │   └── .gitkeep
    ├── 5_widowx
    │   ├── .gitkeep
    │   ├── requirements.txt
    │   ├── README.md
    │   ├── widowx_env.py
    │   └── eval_widowx.py
    ├── 6_agibot
    │   └── .gitkeep
    ├── 7_franka
    │   ├── .gitkeep
    │   ├── requirements.txt
    │   ├── README.md
    │   ├── realsense_camera.py
    │   └── eval_franka.py
    ├── 8_vllmeval
    │   ├── vlm
    │   │   ├── __init__.py
    │   │   └── prompt.py
    │   ├── run.sh
    │   ├── dataset-config.json
    │   ├── dataset
    │   │   ├── eobench.py
    │   │   └── erqa.py
    │   └── README.md
    ├── 3_simpler
    │   ├── data-bridge.yaml
    │   ├── data-fractal.yaml
    │   ├── simpler_env
    │   │   ├── eval_simpler.sh
    │   │   └── main_inference.py
    │   ├── train_bridge.sh
    │   ├── train_fractal.sh
    │   └── README.md
    ├── 9_pretraining
    │   ├── zero1.json
    │   ├── zero3.json
    │   ├── launch_pretrain.sh
    │   └── README.md
    ├── 2_libero
    │   ├── data-libero.yaml
    │   ├── train.sh
    │   ├── README.md
    │   └── eval_libero.py
    └── 1_demo
    │   ├── train.sh
    │   ├── data-demo.yaml
    │   └── README.md
├── scripts
    ├── inference_service.py
    ├── chat_template.json
    ├── eval_policy.py
    └── train.py
├── tests
    ├── test_dataset.py
    └── test_vlm.py
├── .assets
    ├── logo.png
    ├── embodiments.png
    ├── merged_grid.gif
    ├── data_example.png
    └── openloop_example.png
├── demo_data
    ├── example1.jpg
    ├── example2.png
    └── refcoco
    │   ├── images
    │       ├── COCO_train2014_000000168643_2.jpg
    │       ├── COCO_train2014_000000263111_0.jpg
    │       ├── COCO_train2014_000000579299_4.jpg
    │       ├── COCO_train2014_000000580905_2.jpg
    │       ├── COCO_train2014_000000580957_2.jpg
    │       └── COCO_train2014_000000567396_13.jpg
    │   └── refcoco.jsonl
├── .github
    ├── ISSUE_TEMPLATE
    │   ├── config.yml
    │   ├── feature-request.yml
    │   └── bug-report.yml
    ├── PULL_REQUEST_TEMPLATE.md
    ├── workflows
    │   ├── security.yml
    │   ├── quality.yml
    │   └── release.yml
    └── test.yml
├── getting_started
    ├── 3_eval_deploy.ipynb
    └── 4_advanced_pretrain.ipynb
├── .gitmodules
├── eo
    ├── constants.py
    ├── data
    │   ├── schema.py
    │   ├── transforms.py
    │   └── multim_dataset.py
    ├── model
    │   └── configuration_eo1.py
    └── train
    │   ├── pipeline_config.py
    │   └── train_utils.py
├── CITATION.cff
├── docker
    └── Dockerfile
├── .pre-commit-config.yaml
├── pyproject.toml
├── tools
    ├── openloop.py
    └── group_length.py
└── .gitignore


/experiments/4_so101/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/scripts/inference_service.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/experiments/5_widowx/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/experiments/6_agibot/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/experiments/7_franka/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/tests/test_dataset.py:
--------------------------------------------------------------------------------
1 | # test datasets here
2 | 


--------------------------------------------------------------------------------
/experiments/5_widowx/requirements.txt:
--------------------------------------------------------------------------------
1 | gym
2 | funcsigs
3 | numpy==1.24.3
4 | 


--------------------------------------------------------------------------------
/.assets/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SHAILAB-IPEC/EO1/HEAD/.assets/logo.png


--------------------------------------------------------------------------------
/.assets/embodiments.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SHAILAB-IPEC/EO1/HEAD/.assets/embodiments.png


--------------------------------------------------------------------------------
/.assets/merged_grid.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SHAILAB-IPEC/EO1/HEAD/.assets/merged_grid.gif


--------------------------------------------------------------------------------
/demo_data/example1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SHAILAB-IPEC/EO1/HEAD/demo_data/example1.jpg


--------------------------------------------------------------------------------
/demo_data/example2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SHAILAB-IPEC/EO1/HEAD/demo_data/example2.png


--------------------------------------------------------------------------------
/.assets/data_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SHAILAB-IPEC/EO1/HEAD/.assets/data_example.png


--------------------------------------------------------------------------------
/.assets/openloop_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SHAILAB-IPEC/EO1/HEAD/.assets/openloop_example.png


--------------------------------------------------------------------------------
/experiments/7_franka/requirements.txt:
--------------------------------------------------------------------------------
1 | torch
2 | transformers>=4.56.0
3 | opencv-python
4 | imageio
5 | tyro
6 | pillow
7 | pyrealsense2
8 | lerobot
9 | 


--------------------------------------------------------------------------------
/experiments/8_vllmeval/vlm/__init__.py:
--------------------------------------------------------------------------------
1 | from .model import EO1VisionFlowMatchingChat, EO1VisionFlowMatchingChatAguvis
2 | from .prompt import Qwen2VLPromptMixin
3 | 


--------------------------------------------------------------------------------
/demo_data/refcoco/images/COCO_train2014_000000168643_2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SHAILAB-IPEC/EO1/HEAD/demo_data/refcoco/images/COCO_train2014_000000168643_2.jpg


--------------------------------------------------------------------------------
/demo_data/refcoco/images/COCO_train2014_000000263111_0.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SHAILAB-IPEC/EO1/HEAD/demo_data/refcoco/images/COCO_train2014_000000263111_0.jpg


--------------------------------------------------------------------------------
/demo_data/refcoco/images/COCO_train2014_000000579299_4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SHAILAB-IPEC/EO1/HEAD/demo_data/refcoco/images/COCO_train2014_000000579299_4.jpg


--------------------------------------------------------------------------------
/demo_data/refcoco/images/COCO_train2014_000000580905_2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SHAILAB-IPEC/EO1/HEAD/demo_data/refcoco/images/COCO_train2014_000000580905_2.jpg


--------------------------------------------------------------------------------
/demo_data/refcoco/images/COCO_train2014_000000580957_2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SHAILAB-IPEC/EO1/HEAD/demo_data/refcoco/images/COCO_train2014_000000580957_2.jpg


--------------------------------------------------------------------------------
/demo_data/refcoco/images/COCO_train2014_000000567396_13.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SHAILAB-IPEC/EO1/HEAD/demo_data/refcoco/images/COCO_train2014_000000567396_13.jpg


--------------------------------------------------------------------------------
/experiments/8_vllmeval/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | set -x
3 | export GPU=$(nvidia-smi --list-gpus | wc -l)
4 | torchrun --nproc-per-node=$GPU run.py --config dataset-config.json
5 | 


--------------------------------------------------------------------------------
/experiments/3_simpler/data-bridge.yaml:
--------------------------------------------------------------------------------
1 | mm_datasets:
2 | 
3 | lerobot_datasets:
4 | 
5 |   - repo_id: bridge_orig_lerobot
6 |     root: HF_LEROBOT_HOME
7 |     select_video_keys: [observation.images.image_0]
8 | 


--------------------------------------------------------------------------------
/experiments/3_simpler/data-fractal.yaml:
--------------------------------------------------------------------------------
1 | mm_datasets:
2 | 
3 | lerobot_datasets:
4 | 
5 |   - repo_id: fractal20220817_data_lerobot
6 |     root: HF_LEROBOT_HOME
7 |     select_video_keys: [observation.images.image]
8 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/config.yml:
--------------------------------------------------------------------------------
1 | blank_issues_enabled: false
2 | contact_links:
3 |   - name: "🙋 General Question"
4 |     url: https://github.com/EO-Robotics/EO-1/discussions/new/choose
5 |     about: Our preferred starting point if you have any questions about the project.
6 | 


--------------------------------------------------------------------------------
/getting_started/3_eval_deploy.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": null,
 6 |    "id": "2fa50ceb",
 7 |    "metadata": {},
 8 |    "outputs": [],
 9 |    "source": []
10 |   }
11 |  ],
12 |  "metadata": {
13 |   "language_info": {
14 |    "name": "python"
15 |   }
16 |  },
17 |  "nbformat": 4,
18 |  "nbformat_minor": 5
19 | }
20 | 


--------------------------------------------------------------------------------
/getting_started/4_advanced_pretrain.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": null,
 6 |    "id": "a9616890",
 7 |    "metadata": {},
 8 |    "outputs": [],
 9 |    "source": []
10 |   }
11 |  ],
12 |  "metadata": {
13 |   "language_info": {
14 |    "name": "python"
15 |   }
16 |  },
17 |  "nbformat": 4,
18 |  "nbformat_minor": 5
19 | }
20 | 


--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
 1 | [submodule "experiments/7_franka/deoxys_control"]
 2 | 	path = experiments/7_franka/deoxys_control
 3 | 	url = https://github.com/UT-Austin-RPL/deoxys_control.git
 4 | [submodule "experiments/5_widowx/bridge_data_robot"]
 5 | 	path = experiments/5_widowx/bridge_data_robot
 6 | 	url = https://github.com/HaomingSong/bridge_data_robot.git
 7 | [submodule "experiments/5_widowx/edgeml"]
 8 | 	path = experiments/5_widowx/edgeml
 9 | 	url = https://github.com/youliangtan/edgeml.git
10 | 


--------------------------------------------------------------------------------
/experiments/9_pretraining/zero1.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "fp16": {
 3 |     "enabled": false
 4 |   },
 5 |   "bf16": {
 6 |     "enabled": true
 7 |   },
 8 |   "gradient_accumulation_steps": "auto",
 9 |   "gradient_clipping": "auto",
10 |   "steps_per_print": 2000,
11 |   "train_batch_size": "auto",
12 |   "train_micro_batch_size_per_gpu": "auto",
13 |   "wall_clock_breakdown": true,
14 |   "zero_optimization": {
15 |     "stage": 1,
16 |     "allgather_partitions": true,
17 |     "allgather_bucket_size": 1e9,
18 |     "overlap_comm": true,
19 |     "reduce_scatter": true,
20 |     "reduce_bucket_size": 1e9,
21 |     "contiguous_gradients": true
22 |   }
23 | }
24 | 


--------------------------------------------------------------------------------
/experiments/9_pretraining/zero3.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "fp16": {
 3 |       "enabled": "auto",
 4 |       "loss_scale": 0,
 5 |       "loss_scale_window": 1000,
 6 |       "initial_scale_power": 16,
 7 |       "hysteresis": 2,
 8 |       "min_loss_scale": 1
 9 |     },
10 |     "bf16": {
11 |       "enabled": "auto"
12 |     },
13 |     "train_micro_batch_size_per_gpu": "auto",
14 |     "train_batch_size": "auto",
15 |   "gradient_accumulation_steps": "auto",
16 |   "zero_optimization": {
17 |     "stage": 3,
18 |     "overlap_comm": true,
19 |     "contiguous_gradients": true,
20 |     "sub_group_size": 1e9,
21 |     "reduce_bucket_size": "auto",
22 |     "stage3_prefetch_bucket_size": "auto",
23 |     "stage3_param_persistence_threshold": "auto",
24 |     "stage3_max_live_parameters": 1e9,
25 |     "stage3_max_reuse_distance": 1e9,
26 |     "stage3_gather_16bit_weights_on_model_save": true
27 |   }
28 | }
29 | 


--------------------------------------------------------------------------------
/experiments/3_simpler/simpler_env/eval_simpler.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | dist_tasks=(
 4 |     bridge.sh
 5 |     drawer_variant_agg.sh
 6 |     drawer_visual_matching.sh
 7 |     move_near_variant_agg.sh
 8 |     move_near_visual_matching.sh
 9 |     pick_coke_can_variant_agg.sh
10 |     pick_coke_can_visual_matching.sh
11 |     put_in_drawer_variant_agg.sh
12 |     put_in_drawer_visual_matching.sh
13 | )
14 | 
15 | action_ensemble_temp=4
16 | 
17 | ckpt_path=YOUR_CHECKPOINT_PATH
18 | model_name=eo
19 | job_name=simpler
20 | logging_dir=results_${model_name}/${job_name}_ck${action_ensemble_temp}
21 | mkdir -p $logging_dir
22 | 
23 | conda activate simpler_env
24 | XDG_RUNTIME_DIR=/usr/lib
25 | LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}
26 | 
27 | for task in ${dist_tasks[@]}; do
28 |     bash scripts/$task $ckpt_path $model_name \
29 |     $action_ensemble_temp $logging_dir
30 | done
31 | 
32 | python tools/calc_metrics_evaluation_videos.py \
33 | --log-dir-root $logging_dir
34 | 


--------------------------------------------------------------------------------
/.github/PULL_REQUEST_TEMPLATE.md:
--------------------------------------------------------------------------------
 1 | # What does this PR do?
 2 | 
 3 | <!--
 4 | Congratulations! You've made it this far! You're not quite done yet though.
 5 | 
 6 | Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.
 7 | 
 8 | Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.
 9 | 
10 | Once you're done, someone will review your PR shortly. They may suggest changes to make the code even better.
11 | -->
12 | 
13 | <!-- Remove if not applicable -->
14 | 
15 | Fixes # (issue)
16 | 
17 | ## Before submitting
18 | 
19 | - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
20 | - [ ] Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
21 | - [ ] Did you make sure to update the documentation with your changes?
22 | - [ ] Did you write any new necessary tests?
23 | 


--------------------------------------------------------------------------------
/experiments/2_libero/data-libero.yaml:
--------------------------------------------------------------------------------
 1 | mm_datasets:
 2 | 
 3 | lerobot_datasets:
 4 | 
 5 |   - repo_id: libero_spatial_no_noops_1.0.0_lerobot
 6 |     root: ./demo_data/
 7 |     select_video_keys: [observation.images.image, observation.images.wrist_image]
 8 |     select_state_keys: [observation.state]
 9 |     select_action_keys: [action]
10 | 
11 |   # - repo_id: libero_90_no_noops_lerobot
12 |   #   root: HF_LEROBOT_HOME
13 |   #   select_video_keys: [observation.images.image, observation.images.wrist_image]
14 |   #   select_state_keys: [observation.state]
15 |   #   select_action_keys: [action]
16 | 
17 |   # - repo_id: libero_object_no_noops_1.0.0_lerobot
18 |   #   root: HF_LEROBOT_HOME
19 |   #   select_video_keys: [observation.images.image, observation.images.wrist_image]
20 |   #   select_state_keys: [observation.state]
21 |   #   select_action_keys: [action]
22 | 
23 |   # - repo_id: libero_10_no_noops_1.0.0_lerobot
24 |   #   root: HF_LEROBOT_HOME
25 |   #   select_video_keys: [observation.images.image, observation.images.wrist_image]
26 |   #   select_state_keys: [observation.state]
27 |   #   select_action_keys: [action]
28 | 


--------------------------------------------------------------------------------
/experiments/8_vllmeval/dataset-config.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "model": {
 3 |         "EO1-3B": {
 4 |             "class": "EO1VisionFlowMatchingChat",
 5 |             "min_pixels": 50176,
 6 |             "max_pixels": 100352,
 7 |             "use_custom_prompt": false,
 8 |             "model_path": "IPEC-COMMUNITY/EO-1-3B"
 9 |         }
10 |     },
11 |     "data": {
12 |         "EOBench": {
13 |             "class": "EOBench",
14 |             "dataset": "EOBench",
15 |             "data_file": "IPEC-COMMUNITY/EO-Bench/benchmark_v1.jsonl",
16 |             "data_root": "IPEC-COMMUNITY/EO-Bench"
17 |         },
18 |         "ERQABench": {
19 |             "class": "ERQABench",
20 |             "dataset": "ERQABench",
21 |             "data_root": "IPEC-COMMUNITY/ERQABench",
22 |             "data_file": "IPEC-COMMUNITY/ERQABench/benchmark_v1.jsonl"
23 |         },
24 |         "RoboVQA": {
25 |             "class": "RoboVQA",
26 |             "dataset": "RoboVQA",
27 |             "data_root": "IPEC-COMMUNITY/RoboVQA",
28 |             "data_file": "IPEC-COMMUNITY/RoboVQA/benchmark_v1.jsonl",
29 |             "fps": 1
30 |         }
31 |     }
32 | }
33 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature-request.yml:
--------------------------------------------------------------------------------
 1 | name: "🚀 Feature request"
 2 | description: Submit a proposal/request for a new any4lerobot feature
 3 | labels: ["Feature request"]
 4 | body:
 5 |   - type: textarea
 6 |     id: feature-request
 7 |     validations:
 8 |       required: true
 9 |     attributes:
10 |       label: Feature request
11 |       description: |
12 |         A clear and concise description of the feature proposal. Please provide a link to the paper and code in case they exist.
13 | 
14 |   - type: textarea
15 |     id: motivation
16 |     validations:
17 |       required: true
18 |     attributes:
19 |       label: Motivation
20 |       description: |
21 |         Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too.
22 | 
23 |   - type: textarea
24 |     id: contribution
25 |     validations:
26 |       required: true
27 |     attributes:
28 |       label: Your contribution
29 |       description: |
30 |         Is there any way that you could help, e.g. by submitting a PR? Make sure to read the CONTRIBUTING.MD [readme](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md)
31 | 


--------------------------------------------------------------------------------
/scripts/chat_template.json:
--------------------------------------------------------------------------------
1 | {
2 |     "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% set state_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif content['type'] == 'state' or 'state' in content %}{% set state_count.value = state_count.value + 1 %}<|state_start|><|state_pad|><|state_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}{{ noise_prompt }}"
3 | }
4 | 


--------------------------------------------------------------------------------
/experiments/3_simpler/simpler_env/main_inference.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | import numpy as np
 4 | import tensorflow as tf
 5 | from simpler_env.evaluation.argparse import get_args
 6 | from simpler_env.evaluation.maniskill2_evaluator import maniskill2_evaluator
 7 | 
 8 | if __name__ == "__main__":
 9 |     args = get_args()
10 |     os.environ["DISPLAY"] = ""
11 |     os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"] = "false"
12 |     gpus = tf.config.list_physical_devices("GPU")
13 |     if len(gpus) > 0:
14 |         tf.config.set_logical_device_configuration(
15 |             gpus[0],
16 |             [tf.config.LogicalDeviceConfiguration(memory_limit=args.tf_memory_limit)],
17 |         )
18 |     print(f"**** {args.policy_model} ****")
19 |     if args.policy_model in ["eo", "eo-1"]:
20 |         assert args.ckpt_path is not None
21 |         from simpler_env.policies.eo.eo_model import EOInference
22 | 
23 |         model = EOInference(
24 |             saved_model_path=args.ckpt_path,
25 |             policy_setup=args.policy_setup,
26 |             action_scale=args.action_scale,
27 |             action_ensemble_temp=args.action_ensemble_temp,
28 |         )
29 |     else:
30 |         raise NotImplementedError()
31 | 
32 |     # run real-to-sim evaluation
33 |     success_arr = maniskill2_evaluator(model, args)
34 |     print(args)
35 |     print(" " * 10, "Average success", np.mean(success_arr))
36 | 


--------------------------------------------------------------------------------
/experiments/1_demo/train.sh:
--------------------------------------------------------------------------------
 1 | GPUS=1
 2 | PER_DEVICE_BATCH_SIZE=8
 3 | 
 4 | ACCELERATE_ARGS="--num_machines 1 --machine_rank 0 --num_processes=${GPUS}"
 5 | 
 6 | # * datasets
 7 | dataset=experiments/1_demo/data-demo.yaml
 8 | dataset_name=$(basename ${dataset%.*})
 9 | 
10 | # hparams
11 | lr=1e-4
12 | mlr=1e-4
13 | vlr=2e-5
14 | 
15 | chunk_size=30
16 | epoch=50
17 | 
18 | model_name_or_path=
19 | run_name=${dataset_name}_ck${chunk_size}_gpu${GPUS}_lr${lr}_vlr${vlr}_mlr${mlr}_bs${PER_DEVICE_BATCH_SIZE}
20 | 
21 | 
22 | conda activate eo
23 | 
24 | accelerate launch $ACCELERATE_ARGS scripts/train.py \
25 |     --vlm-name-or-path ../pretrained/Qwen2.5-VL-3B-Instruct \
26 |     --data-path ${dataset} \
27 |     --chunk-size ${chunk_size} \
28 |     --dataloader-num-workers 8 \
29 |     --freeze-vision-tower False \
30 |     --freeze-llm False \
31 |     --freeze-merger False \
32 |     --bf16 True \
33 |     --tf32 True \
34 |     --fp16 False \
35 |     --num-train-epochs ${epoch} \
36 |     --per-device-train-batch-size ${PER_DEVICE_BATCH_SIZE} \
37 |     --learning-rate ${lr} \
38 |     --merger-lr ${mlr} \
39 |     --vision-lr ${vlr} \
40 |     --weight-decay 0.1 \
41 |     --warmup-ratio 0.03 \
42 |     --lr-scheduler-type cosine \
43 |     --gradient-checkpointing True \
44 |     --save-strategy steps \
45 |     --logging-steps 100 \
46 |     --save-steps 5000 \
47 |     --save-total-limit 3 \
48 |     --report-to none \
49 |     --run-name ${run_name} \
50 |     --attn-implementation flash_attention_2
51 | 


--------------------------------------------------------------------------------
/scripts/eval_policy.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | 
 3 | import torch
 4 | from PIL import Image
 5 | from transformers import AutoModel, AutoProcessor
 6 | 
 7 | argparser = argparse.ArgumentParser()
 8 | argparser.add_argument(
 9 |     "--model_path",
10 |     type=str,
11 |     default="outputs/your_path",
12 |     help="Path to the pretrained model",
13 | )
14 | argparser.add_argument(
15 |     "--repo_id",
16 |     type=str,
17 |     default="libero_spatial_no_noops_1.0.0_lerobot",
18 |     help="Class name of the model",
19 | )
20 | args = argparser.parse_args()
21 | 
22 | 
23 | def eval_policy():
24 |     # set the observation (image, state, etc.)
25 |     import numpy as np
26 | 
27 |     image0 = (np.random.rand(224, 224, 3) * 255).astype(np.uint8)
28 |     image1 = Image.fromarray(image0.copy())
29 | 
30 |     model = (
31 |         AutoModel.from_pretrained(args.model_path, dtype=torch.bfloat16, trust_remote_code=True).eval().cuda()
32 |     )
33 | 
34 |     processor = AutoProcessor.from_pretrained(args.model_path, trust_remote_code=True)
35 | 
36 |     batch = {
37 |         "observation.images.image": [image0],
38 |         "observation.images.wrist_image": [image1],
39 |         "observation.state": [torch.rand(8)],
40 |         "task": ["put the object in the box."],
41 |         # "repo_id": [args.repo_id],
42 |     }
43 |     ov_output = processor.select_action(
44 |         model,
45 |         batch,
46 |     )
47 |     print(ov_output)
48 | 
49 | 
50 | if __name__ == "__main__":
51 |     eval_policy()
52 | 


--------------------------------------------------------------------------------
/experiments/1_demo/data-demo.yaml:
--------------------------------------------------------------------------------
 1 | mm_datasets:
 2 |   - json_path: demo_data/refcoco/refcoco.jsonl             # jsonl file
 3 |     vision_base_path: demo_data/refcoco                    # base path for vision data files referenced in the JSONL
 4 |     sampling_strategy: random:100%                         # sampling strategy
 5 | 
 6 |   - json_path: demo_data/interleaved_demo.jsonl            # interleaved data jsonl
 7 | 
 8 | # @robot control config
 9 | lerobot_datasets:
10 |   - repo_id: demo25
11 |     root: ./demo_data
12 |     # Optional fields:
13 |     # episodes: [1, 2, 3]                                  # specific episodes to load (None = all)
14 |     train_subtask: mix:0.9                                 # mix sub-task instructions and overall instructions with 90% sub-task
15 |     delta_action: false                                    # train with delta actions
16 |     state_mode: "MEAN_STD"                                 # state normalization mode
17 |     # which camera streams to load
18 |     select_video_keys: [observation.images.head, observation.images.hand_left, observation.images.hand_right]
19 |     # proprioceptive states
20 |     select_state_keys: [observation.states.joint.position, observation.states.effector.position]
21 |     # action targets
22 |     select_action_keys: [actions.joint.position, actions.effector.position]
23 |     effector_indices: [14, 15]                             # indices of effector channels in the flattened action vector
24 |     weight: 1.0                                            # dataset weight for sampling
25 | 


--------------------------------------------------------------------------------
/experiments/2_libero/train.sh:
--------------------------------------------------------------------------------
 1 | GPUS=8
 2 | PER_DEVICE_BATCH_SIZE=64
 3 | 
 4 | ACCELERATE_ARGS="--num_machines 1 --machine_rank 0 --num_processes=${GPUS} --multi_gpu"
 5 | 
 6 | # datasets
 7 | dataset=experiments/2_libero/data-libero.yaml
 8 | dataset_name=$(basename ${dataset%.*})
 9 | 
10 | # hparams
11 | lr=1e-4
12 | mlr=1e-4
13 | vlr=2e-5
14 | 
15 | chunk_size=8
16 | epoch=50
17 | 
18 | model_name_or_path=
19 | run_name=${dataset_name}_ck${chunk_size}_gpu${GPUS}_lr${lr}_vlr${vlr}_mlr${mlr}_bs${PER_DEVICE_BATCH_SIZE}
20 | 
21 | 
22 | conda activate eo
23 | 
24 | accelerate launch $ACCELERATE_ARGS scripts/train.py \
25 |     ${model_name_or_path:+--model-name-or-path $model_name_or_path} \
26 |     --vlm-name-or-path ../pretrained/Qwen2.5-VL-3B-Instruct \
27 |     --data-path ${dataset} \
28 |     --chunk-size ${chunk_size} \
29 |     --dataloader-num-workers 8 \
30 |     --freeze-vision-tower False \
31 |     --freeze-llm False \
32 |     --freeze-merger False \
33 |     --bf16 True \
34 |     --tf32 True \
35 |     --fp16 False \
36 |     --num-train-epochs ${epoch} \
37 |     --per-device-train-batch-size ${PER_DEVICE_BATCH_SIZE} \
38 |     --learning-rate ${lr} \
39 |     --merger-lr ${mlr} \
40 |     --vision-lr ${vlr} \
41 |     --weight-decay 0.1 \
42 |     --warmup-ratio 0.03 \
43 |     --lr-scheduler-type cosine \
44 |     --gradient-checkpointing True \
45 |     --save-strategy steps \
46 |     --logging-steps 100 \
47 |     --save-steps 5000 \
48 |     --save-total-limit 3 \
49 |     --report-to none \
50 |     --run-name ${run_name} \
51 |     --attn-implementation flash_attention_2
52 | 


--------------------------------------------------------------------------------
/experiments/3_simpler/train_bridge.sh:
--------------------------------------------------------------------------------
 1 | GPUS=8
 2 | PER_DEVICE_BATCH_SIZE=128
 3 | 
 4 | ACCELERATE_ARGS="--main_process_ip=$MASTER_ADDR --main_process_port=$MASTER_PORT \
 5 |   --num_machines 1 --machine_rank 0 --num_processes=${GPUS} --multi_gpu"
 6 | 
 7 | dataset=experiments/3_simpler/data-bridge.yaml
 8 | dataset_name=$(basename ${dataset%.*})
 9 | 
10 | lr=1e-4
11 | mlr=1e-4
12 | vlr=2e-5
13 | 
14 | chunk_size=4
15 | epoch=20
16 | 
17 | model_name_or_path=
18 | run_name=${dataset_name}_ck${chunk_size}_gpu${GPUS}_lr${lr}_vlr${vlr}_mlr${mlr}_bs${PER_DEVICE_BATCH_SIZE}
19 | 
20 | 
21 | conda activate eo
22 | 
23 | accelerate launch $ACCELERATE_ARGS scripts/train.py \
24 |     ${model_name_or_path:+--model-name-or-path $model_name_or_path} \
25 |     --vlm-name-or-path ../pretrained/Qwen2.5-VL-3B-Instruct \
26 |     --data-path ${dataset} \
27 |     --chunk-size ${chunk_size} \
28 |     --dataloader-num-workers 8 \
29 |     --freeze-vision-tower False \
30 |     --freeze-llm False \
31 |     --freeze-merger False \
32 |     --bf16 True \
33 |     --tf32 True \
34 |     --fp16 False \
35 |     --num-train-epochs ${epoch} \
36 |     --per-device-train-batch-size ${PER_DEVICE_BATCH_SIZE} \
37 |     --learning-rate ${lr} \
38 |     --merger-lr ${mlr} \
39 |     --vision-lr ${vlr} \
40 |     --weight-decay 0.1 \
41 |     --warmup-ratio 0.03 \
42 |     --lr-scheduler-type cosine \
43 |     --gradient-checkpointing True \
44 |     --save-strategy steps \
45 |     --logging-steps 100 \
46 |     --save-steps 5000 \
47 |     --save-total-limit 3 \
48 |     --report-to none \
49 |     --run-name ${run_name} \
50 |     --attn-implementation flash_attention_2
51 | 


--------------------------------------------------------------------------------
/experiments/3_simpler/train_fractal.sh:
--------------------------------------------------------------------------------
 1 | GPUS=8
 2 | PER_DEVICE_BATCH_SIZE=256
 3 | 
 4 | ACCELERATE_ARGS="--main_process_ip=$MASTER_ADDR --main_process_port=$MASTER_PORT \
 5 |   --num_machines 1 --machine_rank 0 --num_processes=${GPUS} --multi_gpu"
 6 | 
 7 | dataset=experiments/3_simpler/data-fractal.yaml
 8 | dataset_name=$(basename ${dataset%.*})
 9 | 
10 | lr=1e-4
11 | mlr=1e-4
12 | vlr=2e-5
13 | 
14 | chunk_size=4
15 | epoch=10
16 | 
17 | model_name_or_path=
18 | run_name=${dataset_name}_ck${chunk_size}_gpu${GPUS}_lr${lr}_vlr${vlr}_mlr${mlr}_bs${PER_DEVICE_BATCH_SIZE}
19 | 
20 | 
21 | conda activate eo
22 | 
23 | accelerate launch $ACCELERATE_ARGS scripts/train.py \
24 |     ${model_name_or_path:+--model-name-or-path $model_name_or_path} \
25 |     --vlm-name-or-path ../pretrained/Qwen2.5-VL-3B-Instruct \
26 |     --data-path ${dataset} \
27 |     --chunk-size ${chunk_size} \
28 |     --dataloader-num-workers 8 \
29 |     --freeze-vision-tower False \
30 |     --freeze-llm False \
31 |     --freeze-merger False \
32 |     --bf16 True \
33 |     --tf32 True \
34 |     --fp16 False \
35 |     --num-train-epochs ${epoch} \
36 |     --per-device-train-batch-size ${PER_DEVICE_BATCH_SIZE} \
37 |     --learning-rate ${lr} \
38 |     --merger-lr ${mlr} \
39 |     --vision-lr ${vlr} \
40 |     --weight-decay 0.1 \
41 |     --warmup-ratio 0.03 \
42 |     --lr-scheduler-type cosine \
43 |     --gradient-checkpointing True \
44 |     --save-strategy steps \
45 |     --logging-steps 100 \
46 |     --save-steps 5000 \
47 |     --save-total-limit 3 \
48 |     --report-to none \
49 |     --run-name ${run_name} \
50 |     --attn-implementation flash_attention_2
51 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/bug-report.yml:
--------------------------------------------------------------------------------
 1 | name: "🐛 Bug Report"
 2 | description: Submit a bug report to help us improve EO-1
 3 | labels: ["bug"]
 4 | body:
 5 |   - type: markdown
 6 |     attributes:
 7 |       value: |
 8 |         Thanks for taking the time to fill out this bug report! 🤗
 9 | 
10 |   - type: textarea
11 |     id: system-info
12 |     attributes:
13 |       label: System Info
14 |       description: Please share your system info with us.
15 |       placeholder: platform, python version, EO-1 commit, ...
16 |     validations:
17 |       required: true
18 | 
19 |   - type: textarea
20 |     id: reproduction
21 |     validations:
22 |       required: true
23 |     attributes:
24 |       label: Reproduction
25 |       description: |
26 |         Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
27 |         Please include relevant config information with your code, for example your Trainers, TRL, Peft, and DeepSpeed configs.
28 |         If you have code snippets, error messages, stack traces please provide them here as well.
29 |         Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
30 |         Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.
31 | 
32 |       placeholder: |
33 |         Steps to reproduce the behavior:
34 | 
35 |           1.
36 |           2.
37 |           3.
38 | 
39 |   - type: textarea
40 |     id: expected-behavior
41 |     validations:
42 |       required: true
43 |     attributes:
44 |       label: Expected behavior
45 |       description: "A clear and concise description of what you would expect to happen."
46 | 


--------------------------------------------------------------------------------
/eo/constants.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2025 EO-Robotics Team. All rights reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | """This module defines constants used throughout the application,
16 | including system messages and various special tokens for language
17 | and vision models.
18 | These tokens are used to demarcate different types of input such
19 | as images, videos, actions, and states, with specific sets for
20 | different model architectures like LLaVA and datasets like LeRobot.
21 | """
22 | 
23 | SYSTEM_MESSAGE = "You are a helpful physical assistant."
24 | 
25 | # qwen2.5-vl special tokens
26 | DEFAULT_IM_START_TOKEN = "<|im_start|>"
27 | DEFAULT_IM_END_TOKEN = "<|im_end|>"
28 | DEFAULT_IMAGE_TOKEN = "<|image_pad|>"
29 | DEFAULT_VIDEO_TOKEN = "<|video_pad|>"
30 | VISION_START_TOKEN = "<|vision_start|>"
31 | VISION_END_TOKEN = "<|vision_end|>"
32 | 
33 | # EO-1 special tokens
34 | ACTION_START_TOKEN = "<|action_start|>"
35 | DEFAULT_ACTION_TOKEN = "<|action_pad|>"
36 | PASS_ACTION_TOKEN = "<|action_pass|>"
37 | ACTION_END_TOKEN = "<|action_end|>"
38 | STATE_START_TOKEN = "<|state_start|>"
39 | DEFAULT_STATE_TOKEN = "<|state_pad|>"
40 | STATE_END_TOKEN = "<|state_end|>"
41 | TASK_VLA_TOKEN = "<|vla|>"
42 | 
43 | # llava style special tokens
44 | IGNORE_INDEX = -100
45 | LLAVA_IMAGE_TOKEN = "<image>"
46 | LLAVA_VIDEO_TOKEN = "<video>"
47 | LLAVA_ACTION_TOKEN = "<action>"
48 | LLAVA_STATE_TOKEN = "<state>"
49 | LLAVA_VLA_TOKEN = "<vla>"
50 | 


--------------------------------------------------------------------------------
/.github/workflows/security.yml:
--------------------------------------------------------------------------------
 1 | # Copyright 2025 The HuggingFace Inc. team. All rights reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | # This workflow handles secret scanning using TruffleHog to detect sensitive information in the codebase.
16 | name: Security
17 | permissions:
18 |   contents: read
19 | 
20 | on:
21 |   # Allows running this workflow manually from the Actions tab
22 |   workflow_dispatch:
23 | 
24 |   # Triggers the workflow on push events to main
25 |   push:
26 |     branches:
27 |       - main
28 | 
29 |   # Triggers the workflow on pull request events targeting main
30 |   pull_request:
31 |     branches:
32 |       - main
33 | 
34 | # Ensures that only the latest commit for a PR or branch is built, canceling older runs.
35 | concurrency:
36 |   group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
37 |   cancel-in-progress: true
38 | 
39 | jobs:
40 |   # This job runs TruffleHog to scan the full history of the repository for secrets.
41 |   trufflehog:
42 |     name: Secret Leaks Scan
43 |     runs-on: ubuntu-latest
44 |     steps:
45 |       - name: Checkout code
46 |         uses: actions/checkout@v4 # zizmor: ignore[unpinned-uses]
47 |         with:
48 |           fetch-depth: 0
49 |           persist-credentials: false
50 | 
51 |       - name: Secret Scanning
52 |         uses: trufflesecurity/trufflehog@v3.90.0  # zizmor: ignore[unpinned-uses]
53 |         with:
54 |           extra_args: --only-verified
55 | 


--------------------------------------------------------------------------------
/eo/data/schema.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2025 EO-Robotics Team. All rights reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | """Schema for the dataset configuration."""
16 | 
17 | from dataclasses import dataclass, field
18 | 
19 | import yaml
20 | 
21 | 
22 | @dataclass
23 | class MMDatasetConfig:
24 |     json_path: str
25 |     sampling_strategy: str = "all"
26 |     vision_base_path: str | None = None
27 | 
28 | 
29 | @dataclass
30 | class LerobotConfig:
31 |     repo_id: str
32 |     root: str | None = None
33 |     episodes: list[int] | None = None
34 |     delta_action: bool = False
35 |     state_mode: str = "MEAN_STD"
36 | 
37 |     train_subtask: str | bool | None = False  # Optional[true, false, mix:0.5, cumulate]
38 |     select_video_keys: list[str] = None
39 |     select_action_keys: list[str] = None
40 |     select_state_keys: list[str] = None
41 |     effector_indices: list[int] = None
42 |     weight: float | None = None
43 | 
44 | 
45 | @dataclass
46 | class DataConfig:
47 |     mm_datasets: list[MMDatasetConfig] = field(default_factory=list)
48 |     lerobot_datasets: list[LerobotConfig] = field(default_factory=list)
49 | 
50 |     @classmethod
51 |     def from_yaml(cls, path: str) -> "DataConfig":
52 |         with open(path, encoding="utf-8") as f:
53 |             raw = yaml.safe_load(f)
54 |         return cls(
55 |             mm_datasets=[MMDatasetConfig(**d) for d in raw.get("mm_datasets") or []],
56 |             lerobot_datasets=[LerobotConfig(**d) for d in raw.get("lerobot_datasets") or []],
57 |         )
58 | 


--------------------------------------------------------------------------------
/.github/workflows/quality.yml:
--------------------------------------------------------------------------------
 1 | # Copyright 2025 The HuggingFace Inc. team. All rights reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | # This workflow handles linting, formatting, and static analysis checks for the codebase.
16 | name: Quality
17 | permissions:
18 |   contents: read
19 | 
20 | on:
21 |   # Allows running this workflow manually from the Actions tab
22 |   workflow_dispatch:
23 | 
24 |   # Triggers the workflow on push events to main
25 |   push:
26 |     branches:
27 |       - main
28 | 
29 |   # Triggers the workflow on pull request events targeting main
30 |   pull_request:
31 |     branches:
32 |       - main
33 | 
34 | # Ensures that only the latest commit for a PR or branch is built, canceling older runs.
35 | concurrency:
36 |   group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
37 |   cancel-in-progress: true
38 | 
39 | jobs:
40 |   # This job runs pre-commit hooks to check code style and formatting.
41 |   pre-commit-checks:
42 |     name: Run Pre-commit Hooks (Lint, Format & Static Analysis)
43 |     runs-on: ubuntu-latest
44 |     steps:
45 |       - name: Checkout code
46 |         uses: actions/checkout@v4
47 |         with:
48 |           persist-credentials: false
49 | 
50 |       - name: Set up Python
51 |         uses: actions/setup-python@v5
52 |         with:
53 |           python-version: '3.10'
54 | 
55 |       - name: Run pre-commit hooks
56 |         uses: pre-commit/action@v3.0.1 # zizmor: ignore[unpinned-uses]
57 |         with:
58 |           extra_args: --all-files --show-diff-on-failure --color=always
59 | 


--------------------------------------------------------------------------------
/CITATION.cff:
--------------------------------------------------------------------------------
 1 | cff-version: 1.2.0
 2 | message: "If you find this project useful, please cite it as below."
 3 | authors:
 4 | - family-names: "Qu"
 5 |   given-names: "Delin"
 6 | - family-names: "Song"
 7 |   given-names: "Haoming"
 8 | - family-names: "Chen"
 9 |   given-names: "Qizhi"
10 | - family-names: "Chen"
11 |   given-names: "Zhaoqing"
12 | - family-names: "Gao"
13 |   given-names: "Xianqiang"
14 | - family-names: "Ye"
15 |   given-names: "Xinyi"
16 | - family-names: "Lv"
17 |   given-names: "Qi"
18 | - family-names: "Shi"
19 |   given-names: "Modi"
20 | - family-names: "Ren"
21 |   given-names: "Guanghui"
22 | - family-names: "Ruan"
23 |   given-names: "Cheng"
24 | - family-names: "Yao"
25 |   given-names: "Maoqing"
26 | - family-names: "Yang"
27 |   given-names: "Haoran"
28 | - family-names: "Bao"
29 |   given-names: "Jiacheng"
30 | - family-names: "Zhao"
31 |   given-names: "Bin"
32 | - family-names: "Wang"
33 |   given-names: "Dong"
34 | title: "EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control"
35 | version: 2.0.4
36 | date-released: 2025-08-28
37 | url: "https://arxiv.org/abs/2508.21112"
38 | preferred-citation:
39 |   type: article
40 |   title: "EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control"
41 |   authors:
42 |   - family-names: "Qu"
43 |     given-names: "Delin"
44 |   - family-names: "Song"
45 |     given-names: "Haoming"
46 |   - family-names: "Chen"
47 |     given-names: "Qizhi"
48 |   - family-names: "Chen"
49 |     given-names: "Zhaoqing"
50 |   - family-names: "Gao"
51 |     given-names: "Xianqiang"
52 |   - family-names: "Ye"
53 |     given-names: "Xinyi"
54 |   - family-names: "Lv"
55 |     given-names: "Qi"
56 |   - family-names: "Shi"
57 |     given-names: "Modi"
58 |   - family-names: "Ren"
59 |     given-names: "Guanghui"
60 |   - family-names: "Ruan"
61 |     given-names: "Cheng"
62 |   - family-names: "Yao"
63 |     given-names: "Maoqing"
64 |   - family-names: "Yang"
65 |     given-names: "Haoran"
66 |   - family-names: "Bao"
67 |     given-names: "Jiacheng"
68 |   - family-names: "Zhao"
69 |     given-names: "Bin"
70 |   - family-names: "Wang"
71 |     given-names: "Dong"
72 |   journal: "arXiv preprint"
73 |   year: 2025
74 |   url: "https://arxiv.org/abs/2508.21112"
75 | 


--------------------------------------------------------------------------------
/tests/test_vlm.py:
--------------------------------------------------------------------------------
 1 | from transformers import AutoProcessor
 2 | 
 3 | from eo.model.modeling_qwen2_5_vl import Qwen2_5_VLForConditionalGeneration
 4 | 
 5 | """set model name or path"""
 6 | model_name_or_path = "../pretrained/Qwen2.5-VL-3B-Instruct"  # or EO-3B
 7 | model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
 8 |     model_name_or_path,
 9 |     device_map="auto",
10 |     # attn_implementation="flash_attention_2",
11 | )
12 | 
13 | processor = AutoProcessor.from_pretrained(model_name_or_path)
14 | 
15 | messages = [
16 |     {
17 |         "role": "user",
18 |         "content": [
19 |             {"type": "image", "image": "demo_data/refcoco/images/COCO_train2014_000000168643_2.jpg"},
20 |             {
21 |                 "type": "text",
22 |                 "text": "If the yellow robot gripper follows the yellow trajectory, what will happen? Choices: A. Robot puts the soda on the wooden steps. B. Robot moves the soda in front of the wooden steps. C. Robot moves the soda to the very top of the wooden steps. D. Robot picks up the soda can and moves it up. Please answer directly with only the letter of the correct option and nothing else.",
23 |             },
24 |         ],
25 |     },
26 | ]
27 | 
28 | times = 0
29 | past_key_values = None
30 | past_pixel_values_n = 0
31 | past_grid_thw_n = 0
32 | 
33 | while True:
34 |     if times > 0:
35 |         prompt = input("Enter your prompt: ")
36 |         if prompt == "q":
37 |             exit(0)
38 |         messages.append(
39 |             {
40 |                 "role": "user",
41 |                 "content": [
42 |                     {"type": "image", "image": "demo_data/refcoco/images/COCO_train2014_000000580957_2.jpg"},
43 |                     {"type": "text", "text": prompt},
44 |                 ],
45 |             }
46 |         )
47 |     inputs = processor.apply_chat_template(
48 |         messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
49 |     ).to("cuda")
50 | 
51 |     if "pixel_values" in inputs:
52 |         inputs["pixel_values"] = inputs["pixel_values"][past_pixel_values_n:]
53 |         inputs["image_grid_thw"] = inputs["image_grid_thw"][past_grid_thw_n:]
54 | 
55 |         past_pixel_values_n += inputs["pixel_values"].shape[0]
56 |         past_grid_thw_n += inputs["image_grid_thw"].shape[0]
57 | 
58 |     input_length = inputs["input_ids"].shape[1]
59 |     outputs = model.generate(
60 |         **inputs, max_new_tokens=1024, past_key_values=past_key_values, return_dict_in_generate=True
61 |     )
62 | 
63 |     past_key_values = outputs.past_key_values
64 |     generated_ids = outputs.sequences
65 | 
66 |     completion = processor.decode(generated_ids[0, input_length:], skip_special_tokens=False)
67 |     print(completion)
68 | 
69 |     messages.append({"role": "assistant", "content": [{"type": "text", "text": completion}]})
70 |     times += 1
71 | 


--------------------------------------------------------------------------------
/.github/test.yml:
--------------------------------------------------------------------------------
 1 | # Copyright 2025 The HuggingFace Inc. team. All rights reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | # This workflow handles fast testing.
16 | name: Fast Tests
17 | 
18 | on:
19 |   # Allows running this workflow manually from the Actions tab
20 |   workflow_dispatch: {}
21 | 
22 |   pull_request:
23 |     branches:
24 |       - main
25 |     paths:
26 |       - "eo/**"
27 |       - "tests/**"
28 |       - ".github/workflows/**"
29 |       - "pyproject.toml"
30 |       - "Makefile"
31 |   push:
32 |     branches:
33 |       - main
34 |     paths:
35 |       - "src/**"
36 |       - "tests/**"
37 |       - ".github/workflows/**"
38 |       - "pyproject.toml"
39 |       - "Makefile"
40 | 
41 | permissions:
42 |   contents: read
43 | 
44 | # Sets up the environment variables
45 | env:
46 |   UV_VERSION: "0.8.0"
47 |   PYTHON_VERSION: "3.10"
48 |   DOCKER_IMAGE_NAME: huggingface/lerobot-gpu
49 | 
50 | # Ensures that only the latest commit for a PR or branch is built, canceling older runs.
51 | concurrency:
52 |   group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
53 |   cancel-in-progress: true
54 | 
55 | jobs:
56 |   # This job runs pytests with the default dependencies.
57 |   # It runs everytime we commit to a PR or push to main
58 |   fast-pytest-tests:
59 |     name: Fast Pytest Tests
60 |     runs-on: ubuntu-latest
61 |     env:
62 |       MUJOCO_GL: egl
63 |     steps:
64 |       - uses: actions/checkout@v4
65 |         with:
66 |           persist-credentials: false
67 |           lfs: true
68 | 
69 |       # TODO(Steven): Evaluate the need of these dependencies
70 |       - name: Install apt dependencies
71 |         run: |
72 |           sudo apt-get update && sudo apt-get install -y build-essential git \
73 |           curl libglib2.0-0 libegl1-mesa-dev ffmpeg \
74 |           libusb-1.0-0-dev speech-dispatcher libgeos-dev portaudio19-dev
75 | 
76 |       - name: Setup uv and Python
77 |         uses: astral-sh/setup-uv@v6 # zizmor: ignore[unpinned-uses]
78 |         with:
79 |           enable-cache: true
80 |           version: ${{ env.UV_VERSION }}
81 |           python-version: ${{ env.PYTHON_VERSION }}
82 | 
83 |       - name: Install eo with test extras
84 |         run: uv sync --extra "test"
85 | 
86 |       - name: Run pytest
87 |         run: uv run pytest tests -vv --maxfail=10
88 | 


--------------------------------------------------------------------------------
/eo/model/configuration_eo1.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2025 EO-Robotics Team. All rights reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | from transformers.configuration_utils import PretrainedConfig
16 | from transformers.models.qwen2_5_vl.configuration_qwen2_5_vl import (
17 |     Qwen2_5_VLTextConfig,
18 |     Qwen2_5_VLVisionConfig,
19 | )
20 | 
21 | 
22 | class EO1VisionFlowMatchingConfig(PretrainedConfig):
23 |     model_type = "eo1"
24 |     sub_configs = {"vision_config": Qwen2_5_VLVisionConfig, "text_config": Qwen2_5_VLTextConfig}
25 |     keys_to_ignore_at_inference = ["past_key_values"]
26 | 
27 |     def __init__(
28 |         self,
29 |         text_config=None,
30 |         vision_config=None,
31 |         image_token_id=151655,
32 |         video_token_id=151656,
33 |         action_chunk_size=50,
34 |         max_action_dim=32,
35 |         num_denoise_steps=10,
36 |         action_act="linear",
37 |         num_action_layers=2,
38 |         state_token_id=151669,
39 |         action_token_id=151666,
40 |         action_pass_id=151672,
41 |         **kwargs,
42 |     ):
43 |         if isinstance(vision_config, dict):
44 |             self.vision_config = self.sub_configs["vision_config"](**vision_config)
45 |         elif vision_config is None:
46 |             self.vision_config = self.sub_configs["vision_config"](
47 |                 hidden_size=1280,
48 |                 out_hidden_size=2048,
49 |                 tokens_per_second=2,
50 |             )
51 | 
52 |         if isinstance(text_config, dict):
53 |             self.text_config = self.sub_configs["text_config"](**text_config)
54 |         elif text_config is None:
55 |             self.text_config = self.sub_configs["text_config"](**kwargs)
56 | 
57 |         self.image_token_id = image_token_id
58 |         self.video_token_id = video_token_id
59 |         self.state_token_id = state_token_id
60 |         self.action_token_id = action_token_id
61 |         self.action_pass_id = action_pass_id
62 | 
63 |         self.action_chunk_size = action_chunk_size
64 |         self.max_action_dim = max_action_dim
65 |         self.num_denoise_steps = num_denoise_steps
66 |         self.action_act = action_act
67 |         self.num_action_layers = num_action_layers
68 | 
69 |         super().__init__(**kwargs)
70 | 
71 | 
72 | EO1VisionFlowMatchingConfig.register_for_auto_class()
73 | 


--------------------------------------------------------------------------------
/demo_data/refcoco/refcoco.jsonl:
--------------------------------------------------------------------------------
 1 | {"question_id": "710706", "image": "images/COCO_train2014_000000580957_2.jpg", "conversations": [{"from": "human", "value": "<image>Locate bowl of carrots and annotate its bounding box, providing the output in json format."}, {"from": "gpt", "value": "```json\n[\n    {\"bbox_2d\": [86, 82, 345, 249], \"label\": \"bowl of carrots\"}\n]\n```"}], "seq_length": 198}
 2 | {"question_id": "480908", "image": "images/COCO_train2014_000000580905_2.jpg", "conversations": [{"from": "human", "value": "<image>Identify and mark the bounding box for little girl."}, {"from": "gpt", "value": "```json\n[\n    {\"bbox_2d\": [67, 157, 167, 252], \"label\": \"little girl\"}\n]\n```"}], "seq_length": 190}
 3 | {"question_id": "98434", "image": "images/COCO_train2014_000000168643_2.jpg", "conversations": [{"from": "human", "value": "<image>In json format, please annotate the bounding box for 3 cushion couch."}, {"from": "gpt", "value": "```json\n[\n    {\"bbox_2d\": [17, 138, 163, 240], \"label\": \"3 cushion couch\"}\n]\n```"}], "seq_length": 195}
 4 | {"question_id": "1071411", "image": "images/COCO_train2014_000000579299_4.jpg", "conversations": [{"from": "human", "value": "<image>Format the bounding box coordinates for front pizza."}, {"from": "gpt", "value": "```json\n[\n    {\"bbox_2d\": [7, 118, 298, 299], \"label\": \"front pizza\"}\n]\n```"}]}
 5 | {"question_id": "2191497", "image": "images/COCO_train2014_000000567396_13.jpg", "conversations": [{"from": "human", "value": "<image>please annotate the bounding box for horse far left."}, {"from": "gpt", "value": "```json\n[\n    {\"bbox_2d\": [18, 88, 70, 237], \"label\": \"horse far left\"}\n]\n```"}]}
 6 | {"question_id": "710706", "image": "images/COCO_train2014_000000580957_2.jpg", "conversations": [{"from": "human", "value": "<image>Please provide the bounding box for front bowl w/carrots in it, formatted as plaintext."}, {"from": "gpt", "value": "```plaintext\n86,82,345,249 front bowl w/carrots in it\n```"}], "seq_length": 183}
 7 | {"question_id": "710706", "image": "images/COCO_train2014_000000580957_2.jpg", "conversations": [{"from": "human", "value": "<image>Output in plaintext format the bounding box for bowl of carrots."}, {"from": "gpt", "value": "```plaintext\n86,82,345,249 bowl of carrots\n```"}], "seq_length": 173}
 8 | {"question_id": "285480", "image": "images/COCO_train2014_000000263111_0.jpg", "conversations": [{"from": "human", "value": "<image>Output in plaintext format the bounding box for colorful umbrella."}, {"from": "gpt", "value": "```plaintext\n169,19,299,105 colorful umbrella\n```"}], "seq_length": 172}
 9 | {"question_id": "98434", "image": "images/COCO_train2014_000000168643_2.jpg", "conversations": [{"from": "human", "value": "<image>Draw a bounding box around left couch, and return the coordinates in plaintext format."}, {"from": "gpt", "value": "```plaintext\n17,138,163,240 left couch\n```"}], "seq_length": 177}
10 | 


--------------------------------------------------------------------------------
/docker/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Copyright 2025 The HuggingFace Inc. team. All rights reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | # This Dockerfile is designed for a lerobot user who wants to
16 | # experiment with the project. It starts from an Python Slim base image.
17 | 
18 | # docker build -f docker/Dockerfile.user -t lerobot-user .
19 | # docker run -it --rm lerobot-user
20 | 
21 | # Configure the base image
22 | ARG PYTHON_VERSION=3.10
23 | FROM python:${PYTHON_VERSION}-slim
24 | 
25 | # Configure environment variables
26 | ENV DEBIAN_FRONTEND=noninteractive \
27 |     MUJOCO_GL=egl \
28 |     PATH=/lerobot/.venv/bin:$PATH
29 | 
30 | # Install system dependencies and uv (as root)
31 | RUN apt-get update && apt-get install -y --no-install-recommends \
32 |     build-essential git curl libglib2.0-0 libegl1-mesa-dev ffmpeg \
33 |     libusb-1.0-0-dev speech-dispatcher libgeos-dev portaudio19-dev \
34 |     && curl -LsSf https://astral.sh/uv/install.sh | sh \
35 |     && mv /root/.local/bin/uv /usr/local/bin/uv \
36 |     && useradd --create-home --shell /bin/bash user_lerobot \
37 |     && usermod -aG sudo user_lerobot \
38 |     && apt-get clean && rm -rf /var/lib/apt/lists/*
39 | 
40 | # Create application directory and set permissions
41 | WORKDIR /lerobot
42 | RUN chown -R user_lerobot:user_lerobot /lerobot
43 | 
44 | # Switch to the non-root user
45 | USER user_lerobot
46 | 
47 | # Environment variables for the testing
48 | ENV HOME=/home/user_lerobot \
49 |     HF_HOME=/home/user_lerobot/.cache/huggingface \
50 |     HF_LEROBOT_HOME=/home/user_lerobot/.cache/huggingface/lerobot \
51 |     TORCH_HOME=/home/user_lerobot/.cache/torch \
52 |     TRITON_CACHE_DIR=/home/user_lerobot/.cache/triton
53 | 
54 | # Create the virtual environment
55 | # We use a virtual environment inside the container—even though the container itself \
56 | # provides isolation—to closely resemble local development and allow users to \
57 | # run other Python projects in the same container without dependency conflicts.
58 | RUN uv venv
59 | 
60 | # Install Python dependencies for caching
61 | COPY --chown=user_lerobot:user_lerobot pyproject.toml README.md MANIFEST.in ./
62 | COPY --chown=user_lerobot:user_lerobot src/ src/
63 | RUN uv pip install --no-cache ".[all]"
64 | 
65 | # Copy the rest of the application code
66 | # Make sure to have the git-LFS files for testing
67 | COPY --chown=user_lerobot:user_lerobot . .
68 | 
69 | # Set the default command
70 | CMD ["/bin/bash"]
71 | 


--------------------------------------------------------------------------------
/experiments/9_pretraining/launch_pretrain.sh:
--------------------------------------------------------------------------------
 1 | DEBUG=false
 2 | if [ "$DEBUG" = true ]; then
 3 |   GPUS=1
 4 |   report=none
 5 |   data_num_workers=8
 6 |   save_steps=1000
 7 |   logging_steps=1
 8 |   PER_DEVICE_BATCH_SIZE=1
 9 |   TORCH_RUN_ARGS="--standalone --nnodes=1"
10 | fi
11 | 
12 | # distributed settings
13 | GPUS=${GPUS:-128}
14 | GPUS_PER_NODE=${GPUS_PER_NODE:-8}
15 | 
16 | NODES=$((GPUS / GPUS_PER_NODE))
17 | PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-1}
18 | GRADIENT_ACC=1
19 | data_num_workers=${data_num_workers:-8}
20 | report=${report:-wandb}
21 | save_steps=${save_steps:-5000}
22 | logging_steps=${logging_steps:-100}
23 | 
24 | TORCH_RUN_ARGS=${TORCH_RUN_ARGS:-"--nnodes $NODES --nproc-per-node $GPUS_PER_NODE --master_addr \$MASTER_ADDR --master_port \$MASTER_PORT"}
25 | 
26 | dataset=configs/pretrain/data-eo-stage1.yaml
27 | # dataset=configs/pretrain/data-eo-stage2.yaml
28 | # dataset=configs/pretrain/data-eo-stage3.yaml
29 | 
30 | dataset_name=$(basename ${dataset%.*})
31 | 
32 | # hparams, only base lr is logged
33 | lr=5e-5
34 | mlr=5e-5
35 | vlr=1e-5
36 | chunk_size=16
37 | epoch=5
38 | mipx=64
39 | mapx=$((${mipx} * 2))
40 | state_mode=MEAN_STD
41 | max_packed_length=16384
42 | lerobot_only=False
43 | pack_dataset=True
44 | torch_empty_cache_steps=100
45 | deepspeed=zero1
46 | 
47 | # fine-tuning
48 | model_name_or_path=
49 | resume_path=
50 | run_name=pretrain_${dataset_name}_ck${chunk_size}_${state_mode}_pix${mipx}-${mapx}_gpu${GPUS}_lr${lr}_vlr${vlr}_mlr${mlr}_bs${PER_DEVICE_BATCH_SIZE}_${max_packed_length}_ep${epoch}${deepspeed}
51 | 
52 | export DATASET_NUM_PROCESSES=10
53 | 
54 | torchrun $TORCH_RUN_ARGS onvisfm/train.py \
55 |   ${resume_path:+--output-dir $resume_path} \
56 |   ${model_name_or_path:+--model-name-or-path $model_name_or_path} \
57 |   ${torch_empty_cache_steps:+--torch-empty-cache-steps $torch_empty_cache_steps} \
58 |   ${deepspeed:+--deepspeed configs/${deepspeed}.json} \
59 |   --vlm-name-or-path ../pretrained/Qwen2.5-VL-3B-Instruct \
60 |   --train-lerobot-only ${lerobot_only} \
61 |   --data-path ${dataset} \
62 |   --chunk-size ${chunk_size} \
63 |   --dataloader-num-workers ${data_num_workers} \
64 |   --freeze-vision-tower False \
65 |   --freeze-llm False \
66 |   --freeze-merger False \
67 |   --bf16 True \
68 |   --tf32 True \
69 |   --fp16 False \
70 |   --num-train-epochs ${epoch} \
71 |   --per-device-train-batch-size ${PER_DEVICE_BATCH_SIZE} \
72 |   --gradient-accumulation-steps ${GRADIENT_ACC} \
73 |   --image-min-pixels $((${mipx} * 28 * 28)) \
74 |   --image-max-pixels $((${mapx} * 28 * 28)) \
75 |   --learning-rate ${lr} \
76 |   --merger-lr ${mlr} \
77 |   --vision-lr ${vlr} \
78 |   --weight-decay 0.1 \
79 |   --warmup-ratio 0.001 \
80 |   --lr-scheduler-type cosine \
81 |   --logging-steps ${logging_steps} \
82 |   --gradient-checkpointing True \
83 |   --save-strategy steps \
84 |   --save-steps ${save_steps} \
85 |   --save-total-limit 3 \
86 |   --report-to ${report} \
87 |   --run-name ${run_name} \
88 |   --state-mode ${state_mode} \
89 |   --pack-dataset ${pack_dataset} \
90 |   --max-packed-length ${max_packed_length} \
91 |   --ignore_data_skip True \
92 |   --attn-implementation flash_attention_3 \
93 |   --seed 42
94 | 


--------------------------------------------------------------------------------
/.pre-commit-config.yaml:
--------------------------------------------------------------------------------
  1 | # Copyright 2024 The HuggingFace Inc. team. All rights reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | default_language_version:
 16 |     python: python3.10
 17 | 
 18 | repos:
 19 |   ##### Meta #####
 20 |   - repo: meta
 21 |     hooks:
 22 |       - id: check-useless-excludes
 23 |       - id: check-hooks-apply
 24 | 
 25 |    ##### General Code Quality & Formatting #####
 26 |   - repo: https://github.com/pre-commit/pre-commit-hooks
 27 |     rev: v5.0.0
 28 |     hooks:
 29 |       - id: check-added-large-files
 30 |         args: ['--maxkb=1024']
 31 |       - id: debug-statements
 32 |       - id: check-merge-conflict
 33 |       - id: check-case-conflict
 34 |       - id: check-yaml
 35 |       - id: check-toml
 36 |       - id: end-of-file-fixer
 37 |       - id: trailing-whitespace
 38 | 
 39 |   - repo: https://github.com/astral-sh/ruff-pre-commit
 40 |     rev: v0.12.4
 41 |     hooks:
 42 |       - id: ruff-format
 43 |         types_or: [ python, pyi ]
 44 |         exclude: ^eo/model/modeling_qwen2_5_vl\.py|^experiments/8_vllmeval/
 45 |       - id: ruff
 46 |         types_or: [ python, pyi ]
 47 |         args: [--fix, --exit-non-zero-on-fix]
 48 |         exclude: ^eo/model/modeling_qwen2_5_vl\.py$|^experiments/(8_vllmeval|3_simpler/simpler_env/eo)(/|$)
 49 | 
 50 | 
 51 |   - repo: https://github.com/adhtruong/mirrors-typos
 52 |     rev: v1.34.0
 53 |     hooks:
 54 |       - id: typos
 55 |         args: [--force-exclude]
 56 |         exclude: "^(getting_started/.*\\.ipynb|assets/.*|experiments/8_vllmeval/.*)$"
 57 | 
 58 |   - repo: https://github.com/asottile/pyupgrade
 59 |     rev: v3.20.0
 60 |     hooks:
 61 |     -   id: pyupgrade
 62 |         args: [--py310-plus]
 63 | 
 64 |   ##### Markdown Quality #####
 65 |   - repo: https://github.com/rbubley/mirrors-prettier
 66 |     rev: v3.6.2
 67 |     hooks:
 68 |       - id: prettier
 69 |         name: Format Markdown with Prettier
 70 |         types_or: [markdown, mdx]
 71 |         args: [--prose-wrap=preserve]
 72 | 
 73 |   ##### Security #####
 74 |   - repo: https://github.com/gitleaks/gitleaks
 75 |     rev: v8.27.2
 76 |     hooks:
 77 |       - id: gitleaks
 78 | 
 79 |   - repo: https://github.com/woodruffw/zizmor-pre-commit
 80 |     rev: v1.11.0
 81 |     hooks:
 82 |       - id: zizmor
 83 | 
 84 |   - repo: https://github.com/PyCQA/bandit
 85 |     rev: 1.8.6
 86 |     hooks:
 87 |     - id: bandit
 88 |       args: ["-c", "pyproject.toml"]
 89 |       additional_dependencies: ["bandit[toml]"]
 90 |       exclude: ^eo/constants\.py$|eo/model/processing_eo1\.py$
 91 | 
 92 |   # TODO(Steven): Uncomment when ready to use
 93 |   ##### Static Analysis & Typing #####
 94 |   # - repo: https://github.com/pre-commit/mirrors-mypy
 95 |   #   rev: v1.16.0
 96 |   #   hooks:
 97 |   #     - id: mypy
 98 |   #       args: [--python-version=3.10]
 99 | 
100 |   ##### Docstring Checks #####
101 |   # - repo: https://github.com/akaihola/darglint2
102 |   #   rev: v1.8.2
103 |   #   hooks:
104 |   #     - id: darglint2
105 |   #       args: ["--docstring-style", "google", "-v", "2"]
106 |   #       exclude: ^tests/.*$
107 | 
108 |   # - repo: https://github.com/econchick/interrogate
109 |   #   rev: 1.7.0
110 |   #   hooks:
111 |   #     - id: interrogate
112 |   #       args: ["-vv", "--config=pyproject.toml"]
113 | 


--------------------------------------------------------------------------------
/experiments/7_franka/README.md:
--------------------------------------------------------------------------------
 1 | # Franka Emika Panda Robot Control with EO-1
 2 | 
 3 | This directory contains the implementation for controlling Franka Emika Panda robots using the EO-1 model. The system enables real-time robot manipulation tasks through vision-language-action integration.
 4 | 
 5 | ## 🚀 Quick Start
 6 | 
 7 | ### Prerequisites
 8 | 
 9 | **Hardware Requirements:**
10 | 
11 | - Franka Emika Panda robot arm
12 | - RealSense cameras (or compatible RGB cameras)
13 | - **NUC**: Configured with real-time kernel for robot control
14 | - **Workstation**: Equipped with GPU for model inference
15 | 
16 | **Software Requirements:**
17 | 
18 | - Ubuntu 20.04+ with CUDA support
19 | - Python 3.10+
20 | - Real-time kernel configuration on NUC
21 | - Deoxys control system properly configured
22 | 
23 | ### Installation
24 | 
25 | 1. **Setup submodules:**
26 | 
27 | ```bash
28 | git submodule update --init --recursive experiments/7_franka/deoxys_control
29 | ```
30 | 
31 | 2. **Configure robot control system:**
32 |    Follow the [Deoxys Documentation](https://zhuyifengzju.github.io/deoxys_docs/html/index.html) to configure your NUC and workstation for Franka control.
33 | 
34 | 3. **Install dependencies on workstation**
35 | 
36 | ```bash
37 | # Create conda environment
38 | conda create -n eo python=3.10
39 | conda activate eo
40 | 
41 | # Install deoxys for workstation
42 | pip install -e experiments/7_franka/deoxys_control/deoxys
43 | 
44 | # Install additional requirements
45 | pip install -r experiments/7_franka/requirements.txt
46 | ```
47 | 
48 | **Note**: The NUC handles real-time robot control while the workstation runs the EO-1 model inference. Both systems must be properly configured according to the Deoxys documentation.
49 | 
50 | ## 🤖 Running Robot Control
51 | 
52 | ### Basic Usage
53 | 
54 | ```bash
55 | python experiments/7_franka/eval_franka.py \
56 |     --model-path "path/to/your/model" \
57 |     --repo-id libero_spatial_no_noops_1.0.0_lerobot \
58 |     --task "Pick and place a cube" \
59 |     --video-out-path experiments/7_franka/videos \
60 |     --max-timesteps 300
61 | ```
62 | 
63 | ### Parameters
64 | 
65 | | Parameter          | Description                                  | Default                                 |
66 | | ------------------ | -------------------------------------------- | --------------------------------------- |
67 | | `--model-path`     | Path to the trained EO-1 model checkpoint    | Required                                |
68 | | `--repo-id`        | Dataset repository ID for task specification | `libero_spatial_no_noops_1.0.0_lerobot` |
69 | | `--task`           | Natural language description of the task     | `"Pick and place a cube"`               |
70 | | `--video-out-path` | Directory to save recorded videos            | `experiments/7_franka/videos`           |
71 | | `--max-timesteps`  | Maximum number of control steps              | `300`                                   |
72 | | `--resize-size`    | Image resize dimensions for model input      | `224`                                   |
73 | | `--replan-steps`   | RHC control horizon                          | `5`                                     |
74 | 
75 | ### Camera Configuration
76 | 
77 | The system supports multiple camera inputs. Update the camera serial numbers in `eval_franka.py`:
78 | 
79 | ```python
80 | # Camera serial numbers (update these with your actual camera IDs)
81 | EGO_CAMERA = "213522070137"  # Wrist camera
82 | THIRD_CAMERA = "243222074139"  # External camera
83 | ```
84 | 
85 | ## 🔒 Safety Considerations
86 | 
87 | - Always ensure proper workspace setup before operation
88 | - Monitor robot movements and be ready to use emergency stop
89 | - Verify camera positioning for optimal visual coverage
90 | 
91 | ## 📝 Notes
92 | 
93 | - The system requires both wrist and external cameras for optimal performance
94 | - Model performance depends on lighting conditions and camera positioning
95 | - Regular calibration of the robot and cameras is recommended
96 | - Check the video output directory for recorded demonstrations
97 | 


--------------------------------------------------------------------------------
/experiments/1_demo/README.md:
--------------------------------------------------------------------------------
  1 | # Demo Training and Evaluation
  2 | 
  3 | This directory contains the implementation for training and evaluating EO-1 on the demos25 dataset, a dual-arm robot manipulation dataset for supermarket packing tasks.
  4 | 
  5 | ## Overview
  6 | 
  7 | The demo experiment combines:
  8 | 
  9 | - **Robot Control Data**: Demos25 dataset with 25 episodes of dual-arm manipulation
 10 | - **Multimodal Data**: RefCOCO dataset for vision-language understanding
 11 | - **Interleaved Data**: Combined robot control and multimodal conversations
 12 | 
 13 | ## Dataset Preparation
 14 | 
 15 | ### 1. Download Datasets
 16 | 
 17 | ```bash
 18 | # Install Hugging Face CLI if not already installed
 19 | pip install huggingface-cli
 20 | huggingface-cli login
 21 | 
 22 | # Download demos25 dataset
 23 | huggingface-cli download --resume-download --local-dir-use-symlinks False --repo-type dataset \
 24 |     IPEC-COMMUNITY/demos25 --local-dir ../demo_data/demos25
 25 | 
 26 | # Download RefCOCO dataset (if not already available)
 27 | # The RefCOCO dataset should be placed in demo_data/refcoco/
 28 | ```
 29 | 
 30 | ### 2. Configure Dataset Paths
 31 | 
 32 | Update the dataset configuration in `experiments/1_demo/data-demo.yaml`:
 33 | 
 34 | ```yaml
 35 | mm_datasets:
 36 |   - json_path: demo_data/refcoco/refcoco.jsonl # jsonl file
 37 |     vision_base_path: demo_data/refcoco # base path for vision data files referenced in the JSONL
 38 |     sampling_strategy: random:100% # sampling strategy
 39 | 
 40 |   - json_path: demo_data/interleaved_demo.jsonl # interleaved data jsonl
 41 | 
 42 | # @robot control config
 43 | lerobot_datasets:
 44 |   - repo_id: demos25
 45 |     root: ./demo_data
 46 |     # Optional fields:
 47 |     # episodes: [1, 2, 3]                                  # specific episodes to load (None = all)
 48 |     train_subtask: mix:0.9 # mix sub-task instructions and overall instructions with 90% sub-task
 49 |     delta_action: false # train with delta actions
 50 |     state_mode: "MEAN_STD" # state normalization mode
 51 |     # which camera streams to load
 52 |     select_video_keys:
 53 |       [
 54 |         observation.images.head,
 55 |         observation.images.hand_left,
 56 |         observation.images.hand_right,
 57 |       ]
 58 |     # proprioceptive states
 59 |     select_state_keys:
 60 |       [observation.states.joint.position, observation.states.effector.position]
 61 |     # action targets
 62 |     select_action_keys: [actions.joint.position, actions.effector.position]
 63 |     effector_indices: [14, 15] # indices of effector channels in the flattened action vector
 64 |     weight: 1.0 # dataset weight for sampling
 65 | ```
 66 | 
 67 | ## Training
 68 | 
 69 | ### Training Configuration
 70 | 
 71 | The training script (`train.sh`) is configured with the following hyperparameters:
 72 | 
 73 | - **GPUs**: 8 GPUs for distributed training
 74 | - **Batch Size**: 32 per device (total effective batch size: 256)
 75 | - **Learning Rates**:
 76 |   - backbone: 1e-4
 77 |   - merger: 1e-4
 78 |   - vision tower: 2e-5
 79 | - **Epochs**: 10
 80 | - **Chunk Size**: 16 (for sequence processing)
 81 | - **Optimization**: AdamW with cosine learning rate scheduling
 82 | - **Precision**: BF16 with TF32 enabled
 83 | 
 84 | ### Start Training
 85 | 
 86 | ```bash
 87 | bash experiments/1_demo/train.sh
 88 | ```
 89 | 
 90 | The training will:
 91 | 
 92 | - Use the Qwen2.5-VL-3B-Instruct vision-language model as the base
 93 | - Train on both robot control and multimodal data simultaneously
 94 | - Save checkpoints every 1000 steps
 95 | - Use gradient checkpointing and flash attention for memory efficiency
 96 | - Log training progress every 100 steps
 97 | 
 98 | ## Evaluation
 99 | 
100 | ### Run Evaluation
101 | 
102 | Use the following command to run evaluation:
103 | 
104 | ```bash
105 | # Set the path to your trained checkpoint
106 | ckpt_path=PATH_TO_CHECKPOINT
107 | 
108 | # Run evaluation
109 | python tools/openloop.py \
110 |   --args.repo_id demos25 \
111 |   --args.root ./demo_data \
112 |   --args.model_path ${ckpt_path} \
113 |   --args.num_step 10 \
114 |   --args.train_subtask True
115 | ```
116 | 
117 | The script will visualize the inference action trajectory. With the following similar result:
118 | 
119 | <img src="../../.assets/openloop_example.png" width="500">
120 | 
121 | ## File Structure
122 | 
123 | ```
124 | experiments/1_demo/
125 | ├── README.md                # This file
126 | ├── train.sh                 # Training script
127 | └── data-demo.yaml           # Dataset configuration
128 | ```
129 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
  1 | [build-system]
  2 | requires = ["setuptools"]
  3 | build-backend = "setuptools.build_meta"
  4 | 
  5 | [project.urls]
  6 | homepage = "https://eo-robotics.ai"
  7 | documentation = "https://github.com/EO-Robotics/EO1/getting_started"
  8 | source = "https://github.com/EO-Robotics/EO1"
  9 | issues = "https://github.com/EO-Robotics/EO1/issues"
 10 | discord = "https://discord.com/invite/JqfDs6va"
 11 | 
 12 | [project]
 13 | name = "eo"
 14 | version = "0.1.0"
 15 | description = "EO: Open-source Unified Embodied Foundation Model Series"
 16 | readme = "README.md"
 17 | license = { text = "Apache-2.0" }
 18 | requires-python = ">=3.10"
 19 | authors = [
 20 |     { name = "Delin Qu", email = "delinqu.cs@gmail.com" },
 21 |     { name = "Dong Wang", email = "wangdong@pjlab.org.cn" },
 22 | ]
 23 | classifiers = [
 24 |     "Development Status :: 3 - Alpha",
 25 |     "Intended Audience :: Developers",
 26 |     "Intended Audience :: Education",
 27 |     "Intended Audience :: Science/Research",
 28 |     "License :: OSI Approved :: Apache Software License",
 29 |     "Programming Language :: Python :: 3.10",
 30 |     "Topic :: Software Development :: Build Tools",
 31 |     "Topic :: Scientific/Engineering :: Artificial Intelligence",
 32 | ]
 33 | keywords = ["eo", "eo-1", "eo-robotics", "generalist policy", "robotics",  "machine learning", "artificial intelligence"]
 34 | 
 35 | dependencies = [
 36 |     # Hugging Face dependencies
 37 |     "datasets>=2.19.0,<=3.6.0",
 38 |     "huggingface-hub[hf-transfer,cli]>=0.34.2",
 39 |     "lerobot>=0.3.3,<=0.3.4",
 40 |     "transformers==4.56.0",
 41 |     "accelerate>=1.10.1",
 42 | 
 43 |     # Core dependencies
 44 |     "cmake>=3.29.0.1",
 45 |     "einops>=0.8.0",
 46 |     "opencv-python-headless>=4.9.0",
 47 |     "av>=14.2.0",
 48 |     "jsonlines>=4.0.0",
 49 |     "packaging>=24.2",
 50 |     "pynput>=1.7.7",
 51 |     "pyserial>=3.5",
 52 |     "wandb>=0.20.0",
 53 | 
 54 |     "torch==2.7.0",
 55 |     "torchcodec>=0.2.1; sys_platform != 'win32' and (sys_platform != 'linux' or (platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l')) and (sys_platform != 'darwin' or platform_machine != 'x86_64')",
 56 |     "torchvision>=0.21.0",
 57 | 
 58 |     # Binpacking dependency
 59 |     "binpacking>=1.5.0",
 60 |     "qwen_vl_utils==0.0.11",
 61 |     "ujson>=5.11.0",
 62 | ]
 63 | # Optional dependencies
 64 | [project.optional-dependencies]
 65 | 
 66 | # Development
 67 | test = ["pytest>=8.1.0", "pytest-timeout>=2.4.0", "pytest-cov>=5.0.0", "mock-serial>=0.0.1 ; sys_platform != 'win32'"]
 68 | 
 69 | # All
 70 | all = [
 71 |     "eo[test]"
 72 | ]
 73 | 
 74 | # ---------------- Tool Configurations ----------------
 75 | [tool.setuptools.packages.find]
 76 | exclude = [
 77 |     "*.tests",
 78 |     "*.tests.*",
 79 |     "tests.*",
 80 |     "tests",
 81 |     "getting_started*",
 82 |     "scripts*",
 83 |     "*checkpoints*",
 84 |     "outputs*",
 85 |     "assets*",
 86 | ]
 87 | 
 88 | [tool.ruff]
 89 | target-version = "py310"
 90 | line-length = 110
 91 | exclude = ["tests/artifacts/**/*.safetensors", "*_pb2.py", "*_pb2_grpc.py", "assets/*", "eo/qwen2_5_vl/*"]
 92 | 
 93 | [tool.ruff.lint]
 94 | # E, W: pycodestyle errors and warnings
 95 | # F: PyFlakes
 96 | # I: isort
 97 | # UP: pyupgrade
 98 | # B: flake8-bugbear (good practices, potential bugs)
 99 | # C4: flake8-comprehensions (more concise comprehensions)
100 | # A: flake8-builtins (shadowing builtins)
101 | # SIM: flake8-simplify
102 | # RUF: Ruff-specific rules
103 | # D: pydocstyle (for docstring style/formatting)
104 | # S: flake8-bandit (some security checks, complements Bandit)
105 | # T20: flake8-print (discourage print statements in production code)
106 | # N: pep8-naming
107 | # TODO: Uncomment rules when ready to use
108 | select = [
109 |     "E", "W", "F", "I", "B", "C4", "T20", "N" # "SIM", "A", "S", "D", "RUF", "UP"
110 | ]
111 | ignore = [
112 |     "E501", # Line too long
113 |     "T201", # Print statement found
114 |     "T203", # Pprint statement found
115 |     "B008", # Perform function call in argument defaults
116 | ]
117 | 
118 | [tool.ruff.lint.per-file-ignores]
119 | "__init__.py" = ["F401", "F403"]
120 | 
121 | [tool.ruff.lint.isort]
122 | combine-as-imports = true
123 | known-first-party = ["eo"]
124 | 
125 | [tool.ruff.lint.pydocstyle]
126 | convention = "google"
127 | 
128 | [tool.ruff.format]
129 | quote-style = "double"
130 | indent-style = "space"
131 | skip-magic-trailing-comma = false
132 | line-ending = "auto"
133 | docstring-code-format = true
134 | 
135 | [tool.bandit]
136 | exclude_dirs = [
137 |     "tests",
138 |     "experinments",
139 |     "tools",
140 | ]
141 | skips = ["B101", "B311", "B404", "B603", "B615"]
142 | 
143 | [tool.typos]
144 | default.extend-ignore-re = [
145 |     "(?Rm)^.*(#|//)\\s*spellchecker:disable-line$",                      # spellchecker:disable-line
146 |     "(?s)(#|//)\\s*spellchecker:off.*?\\n\\s*(#|//)\\s*spellchecker:on", # spellchecker:<on|off>
147 | ]
148 | default.extend-ignore-identifiers-re = [
149 |     # Add individual words here to ignore them
150 |     "2nd",
151 |     "pn",
152 |     "ser",
153 |     "ein",
154 |     "thw",
155 |     "THW",
156 | ]
157 | 


--------------------------------------------------------------------------------
/tools/openloop.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | 
  3 | import matplotlib.pyplot as plt
  4 | import numpy as np
  5 | import torch
  6 | from lerobot.datasets.lerobot_dataset import LeRobotDatasetMetadata
  7 | from lerobot.policies.normalize import Unnormalize
  8 | from PIL import Image
  9 | from tqdm import tqdm
 10 | from transformers import AutoModel
 11 | 
 12 | from eo.data.lerobot_dataset import LeRobotDataset
 13 | from eo.model.processing_eo1 import EO1VisionProcessor
 14 | 
 15 | argparser = argparse.ArgumentParser()
 16 | argparser.add_argument("--repo_id", type=str, default="libero_spatial_no_noops_1.0.0_lerobot", help="repo id")
 17 | argparser.add_argument("--root", type=str, default="./demo_data", help="root path")
 18 | argparser.add_argument(
 19 |     "--model_path",
 20 |     type=str,
 21 |     default="path/to/your/model",
 22 |     help="model path",
 23 | )
 24 | argparser.add_argument("--num_step", type=int, default=10, help="model path")
 25 | argparser.add_argument("--train_subtask", type=bool, default=False, help="model path")
 26 | argparser.add_argument("--delta_action", type=bool, default=False, help="delta action")
 27 | args = argparser.parse_args()
 28 | 
 29 | num_step = args.num_step
 30 | 
 31 | # load models and set keys
 32 | # processor = AutoProcessor.from_pretrained(args.model_path, trust_remote_code=True)
 33 | 
 34 | processor = EO1VisionProcessor.from_pretrained(args.model_path, trust_remote_code=True)
 35 | model = AutoModel.from_pretrained(args.model_path, trust_remote_code=True, dtype=torch.bfloat16).eval().cuda()
 36 | action_horizon = processor.robot_config.get("action_chunk_size", 50)
 37 | 
 38 | select_video_keys = processor.robot_config["select_video_keys"][args.repo_id]
 39 | select_state_keys = processor.robot_config["select_state_keys"][args.repo_id]
 40 | select_action_keys = processor.robot_config["select_action_keys"][args.repo_id]
 41 | state_mode = processor.robot_config["state_mode"]
 42 | 
 43 | # load dataset
 44 | meta = LeRobotDatasetMetadata(args.repo_id, root=f"{args.root}/{args.repo_id}")
 45 | dataset = LeRobotDataset(
 46 |     args.repo_id,
 47 |     root=f"{args.root}/{args.repo_id}",
 48 |     delta_timestamps={
 49 |         k: [i / meta.fps for i in range(action_horizon)]
 50 |         for k in map(lambda x: x, select_action_keys)  # noqa: C417
 51 |     },
 52 |     state_mode=state_mode,
 53 |     train_subtask=args.train_subtask,
 54 |     select_action_keys=select_action_keys,
 55 |     delta_action=args.delta_action,
 56 |     effector_indices=[14, 15],
 57 | )
 58 | 
 59 | # helper functions
 60 | fn = lambda x: Image.fromarray((x.permute(1, 2, 0) * 255).numpy().astype(np.uint8))  # noqa: E731
 61 | unnormalizer = Unnormalize(dataset.normalizer.features, dataset.normalizer.norm_map, dataset.normalizer.stats)
 62 | actions = []
 63 | actions_data = []
 64 | 
 65 | for i in tqdm(range(num_step)):
 66 |     data = dataset[action_horizon * i]
 67 |     # model
 68 |     data = unnormalizer(data)
 69 |     batch = {
 70 |         **{k: [fn(data[k])] for k in select_video_keys},
 71 |         **{k: [data[k]] for k in select_state_keys},
 72 |         "task": [data["task"]],
 73 |         "repo_id": [args.repo_id],
 74 |     }
 75 |     selected_actions = processor.select_action(model, batch).action.squeeze(0).cpu().numpy()
 76 |     # raw
 77 |     actions_data += [
 78 |         torch.cat(
 79 |             [data[k].unsqueeze(-1) if data[k].ndim == 1 else data[k] for k in select_action_keys], dim=1
 80 |         ).numpy()
 81 |     ]
 82 | 
 83 |     if args.delta_action:
 84 |         selected_states = []
 85 |         for k in select_action_keys:
 86 |             state_key = k.replace("action", "observation.state")
 87 |             selected_states.append(
 88 |                 data[state_key].unsqueeze(0) if data[state_key].ndim == 1 else data[state_key]
 89 |             )
 90 |         selected_states = torch.cat(selected_states, dim=1).numpy()
 91 | 
 92 |         accumulated_actions = np.cumsum(selected_actions, axis=0)
 93 |         exec_actions = selected_states + accumulated_actions
 94 |         exec_actions[..., -2:] = selected_actions[..., -2:]
 95 |     else:
 96 |         exec_actions = selected_actions
 97 | 
 98 |     actions += [exec_actions]
 99 | 
100 | 
101 | actions = np.concatenate(actions, axis=0)
102 | actions_data = np.concatenate(actions_data, axis=0)
103 | 
104 | # plot actions
105 | fig, axs = plt.subplots(actions.shape[-1], 1, figsize=(12 * 4, 3 * 4 * actions.shape[-1]))
106 | for i in range(actions.shape[-1]):
107 |     axs[i].plot(range(num_step * action_horizon), actions_data[:, i], color="tab:green")  # , linestyle="--")
108 |     axs[i].plot(range(num_step * action_horizon), actions[:, i], color="tab:red")  # , linestyle=":")
109 | 
110 | fig.suptitle(f"{args.model_path}", fontsize=16)
111 | fig.legend(
112 |     labels=["Dataset Action", "Model Action"],
113 |     loc="center",
114 |     ncol=2,
115 |     bbox_to_anchor=(0.5, -0.05),
116 |     frameon=False,
117 | )
118 | step = args.model_path.split("/")[-1].split("-")[-1]
119 | 
120 | # save visualization
121 | plt.savefig(f"{args.model_path}/openloop_{step}.png", dpi=100, bbox_inches="tight")
122 | print(f"save to {args.model_path}/openloop_{step}.png")
123 | 


--------------------------------------------------------------------------------
/experiments/5_widowx/README.md:
--------------------------------------------------------------------------------
  1 | # WidowX 250s with EO-1
  2 | 
  3 | This directory contains the implementation for controlling WidowX 250s robots using the EO-1 model. The system enables real-time robot manipulation tasks through vision-language-action integration.
  4 | 
  5 | ## 🚀 Quick Start
  6 | 
  7 | ### Prerequisites
  8 | 
  9 | **Hardware Requirements:**
 10 | 
 11 | - WidowX 250s robot arm
 12 | - RealSense D435 camera (or compatible RGB camera)
 13 | - Compute options:
 14 |   - Single GPU workstation (runs both ROS control and model inference)
 15 |   - OR: NUC + GPU workstation (NUC for arm control, workstation for model inference)
 16 | 
 17 | **Software Requirements:**
 18 | 
 19 | - Ubuntu 20.04+ with CUDA support
 20 | - Python 3.10+
 21 | - Docker (recommended for running the WidowX ROS control node on a workstation in single-machine mode)
 22 | - BridgeData WidowX controller stack properly configured
 23 | 
 24 | Notes on architecture:
 25 | 
 26 | - `Single-machine mode`: Run the WidowX ROS control node in Docker on the same GPU workstation used for EO-1 inference.
 27 | - `Dual-machine mode`: Use a NUC for robot control and a GPU workstation for model inference. For WidowX, the NUC does not require a real-time kernel in this setup.
 28 | 
 29 | ### Installation
 30 | 
 31 | 1. **Setup submodules:**
 32 | 
 33 | ```bash
 34 | git submodule update --init --recursive experiments/5_widowx/bridge_data_robot
 35 | git submodule update --init --recursive experiments/5_widowx/edgeml
 36 | ```
 37 | 
 38 | 2. **Configure robot control system:**
 39 |    Follow the BridgeData WidowX controller setup in [bridge_data_robot](https://github.com/HaomingSong/bridge_data_robot?tab=readme-ov-file#setup) to configure your NUC/workstation for WidowX 250s control:
 40 | 
 41 | 3. **Install dependencies on workstation**
 42 | 
 43 | ```bash
 44 | # Create conda environment
 45 | conda create -n eo python=3.10
 46 | conda activate eo
 47 | 
 48 | # Install WidowX envs for workstation
 49 | pip install -e experiments/5_widowx/bridge_data_robot/widowx_envs
 50 | pip install -e experiments/5_widowx/edgeml
 51 | 
 52 | # Install additional requirements
 53 | pip install -r experiments/5_widowx/requirements.txt
 54 | ```
 55 | 
 56 | **Note**: In dual-machine mode, ensure the workstation can reach the control host (robot IP/port) over the network. In single-machine mode, ensure Docker has access to USB and camera devices.
 57 | 
 58 | ## 🤖 Running Robot Control
 59 | 
 60 | ### Basic Usage
 61 | 
 62 | ```bash
 63 | python experiments/5_widowx/eval_widowx.py \
 64 |     --model-path "path/to/your/model" \
 65 |     --repo-id libero_spatial_no_noops_1.0.0_lerobot \
 66 |     --default-instruction "Put the eggplant in the basket" \
 67 |     --robot-ip 10.6.8.122 \
 68 |     --robot-port 5556 \
 69 |     --max-timesteps 120
 70 | ```
 71 | 
 72 | ### Parameters
 73 | 
 74 | | Parameter               | Description                               | Default                          |
 75 | | ----------------------- | ----------------------------------------- | -------------------------------- |
 76 | | `--model-path`          | Path to the trained EO-1 model checkpoint | Required                         |
 77 | | `--repo-id`             | Dataset/repo ID for task specification    | Required                         |
 78 | | `--default-instruction` | Default natural language instruction      | "Put the eggplant in the basket" |
 79 | | `--roll-out-path`       | Directory to save rollouts/videos         | experiments/5_widowx/logs        |
 80 | | `--max-timesteps`       | Maximum number of control steps           | 120                              |
 81 | | `--im-size`             | Image size for model input                | 224                              |
 82 | | `--action-horizon`      | Receding-horizon (RHC) execution steps    | 2                                |
 83 | | `--blocking`            | Use blocking control for step execution   | False                            |
 84 | | `--robot-ip`            | Robot/control host IP                     | 10.6.8.122                       |
 85 | | `--robot-port`          | Robot/control host port                   | 5556                             |
 86 | 
 87 | ### Camera Configuration
 88 | 
 89 | - Default color topic for RealSense D435 is `/D435/color/image_raw` (see `CAMERA_TOPICS` in `eval_widowx.py`).
 90 | - Mount and wire the D435 according to the hardware guide: [BridgeData V2 Hardware Setup](https://docs.google.com/document/d/1si-6cTElTWTgflwcZRPfgHU7-UwfCUkEztkH3ge5CGc/edit?tab=t.0).
 91 | - If your camera topic differs, update `CAMERA_TOPICS` or the controller configuration accordingly.
 92 | 
 93 | ## 🔒 Safety Considerations
 94 | 
 95 | - Always ensure proper workspace setup and clear the workspace before operation.
 96 | - Monitor robot movements and be ready to use the emergency stop.
 97 | - Verify camera positioning and exposure for optimal visual coverage.
 98 | 
 99 | ## 📝 Notes
100 | 
101 | - This setup uses a single external D435 stream by default; wrist camera is optional.
102 | - Model performance depends on lighting, viewpoint, and calibration quality.
103 | - Regular calibration of the robot and camera(s) is recommended.
104 | - Rollouts and videos are saved under `--roll-out-path`.
105 | 


--------------------------------------------------------------------------------
/eo/train/pipeline_config.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2025 EO-Robotics Team. All rights reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | import warnings
 16 | from dataclasses import dataclass, field
 17 | 
 18 | from lerobot.configs.types import NormalizationMode
 19 | from transformers import TrainingArguments
 20 | 
 21 | 
 22 | @dataclass
 23 | class TrainPipelineConfig(TrainingArguments):
 24 |     """qwen2.5-vl vision parameters"""
 25 | 
 26 |     image_min_pixels: int | None = field(default=64 * 28 * 28)
 27 |     image_max_pixels: int | None = field(default=128 * 28 * 28)
 28 |     video_min_pixels: int | None = field(default=64 * 28 * 28)
 29 |     video_max_pixels: int | None = field(default=128 * 28 * 28)
 30 |     image_resized_width: int = field(default=None)
 31 |     image_resized_height: int = field(default=None)
 32 |     video_resized_width: int = field(default=None)
 33 |     video_resized_height: int = field(default=None)
 34 |     fps: float = 1.0
 35 | 
 36 |     """dataset parameters"""
 37 |     data_path: str = field(default=None, metadata={"help": "Path to training data or yaml config."})
 38 |     train_mm_only: bool = False
 39 |     train_lerobot_only: bool = True
 40 |     lerobot_data_video_backend: str | None = "torchcodec"
 41 |     state_mode: NormalizationMode | None = NormalizationMode.MEAN_STD
 42 |     pack_dataset: bool = False
 43 |     max_packed_length: int = field(default=16384, metadata={"help": "Maximum sequence length."})
 44 |     mini_action_set_length: int = field(
 45 |         default=256, metadata={"help": "Maximum length of mini action set data in dataset packing."}
 46 |     )
 47 | 
 48 |     """ model parameters """
 49 |     model_name_or_path: str | None = field(default=None)
 50 |     vlm_name_or_path: str | None = field(default="../pretrained/Qwen2.5-VL-3B-Instruct")
 51 |     processor_name_or_path: str | None = field(default=None)
 52 |     chat_template: str | None = field(default="scripts/chat_template.json")
 53 |     action_act: str | None = "linear"
 54 |     chunk_size: int | None = 16
 55 |     max_action_dim: int | None = 32
 56 | 
 57 |     """ training parameters """
 58 |     cache_dir: str | None = field(default=None)
 59 |     optim: str = field(default="adamw_torch")
 60 |     adam_beta1: float = field(default=0.9)
 61 |     adam_beta2: float = field(default=0.999)
 62 |     adam_epsilon: float = field(default=1e-8)
 63 | 
 64 |     freeze_vision_tower: bool = field(default=False)
 65 |     freeze_llm: bool = field(default=False)
 66 |     freeze_merger: bool = field(default=False)
 67 |     freeze_lm_head: bool = field(default=False)
 68 |     attn_implementation: str = field(default="sdpa")  # sdpa, flash_attention_2, flash_attention_3
 69 | 
 70 |     lora_enable: bool = False
 71 |     vision_lora: bool = False
 72 |     use_dora: bool = False
 73 |     lora_rank: int = 64
 74 |     lora_alpha: int = 16
 75 |     lora_dropout: float = 0.05
 76 |     lora_weight_path: str = ""
 77 |     lora_bias: str = "none"
 78 | 
 79 |     vision_lr: float | None = None
 80 |     merger_lr: float | None = None
 81 |     lora_namespan_exclude: str = field(
 82 |         default=None, metadata={"help": "List of namespan to exclude for LoRA"}
 83 |     )
 84 |     num_lora_modules: int = -1
 85 | 
 86 |     """experiment parameters"""
 87 |     output_base: str = field(default="outputs", metadata={"help": "Base directory for output."})
 88 | 
 89 |     def __post_init__(self):
 90 |         super().__post_init__()
 91 | 
 92 |         """check validity"""
 93 |         if self.train_lerobot_only and self.train_mm_only:
 94 |             self.train_mm_only = False
 95 |             warnings.warn("`train_mm_only` is set to False when `train_lerobot_only` is True.", stacklevel=2)
 96 | 
 97 |         if self.lora_enable and not self.freeze_llm:
 98 |             self.freeze_llm = True
 99 |             warnings.warn("`freeze_llm` is set to True when `lora_enable`.", stacklevel=2)
100 | 
101 |         if not self.lora_enable and self.vision_lora:
102 |             self.vision_lora = False
103 |             warnings.warn("`vision_lora` is set to False when `lora_enable` is False.", stacklevel=2)
104 | 
105 |         if self.vision_lora and not self.freeze_vision_tower:
106 |             self.freeze_vision_tower = True
107 |             warnings.warn("`freeze_vision_tower` is set to True when `vision_lora` is True.", stacklevel=2)
108 | 
109 |         if self.processor_name_or_path is None:
110 |             self.processor_name_or_path = self.model_name_or_path or self.vlm_name_or_path
111 | 
112 |         if self.output_dir == "trainer_output":
113 |             import datetime as dt
114 | 
115 |             self.output_dir = f"{self.output_base}/{dt.datetime.now():%Y-%m-%d/%H-%M-%S}-{self.run_name}"
116 | 


--------------------------------------------------------------------------------
/experiments/8_vllmeval/dataset/eobench.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | 
  3 | import pandas as pd
  4 | 
  5 | from ..smp import *
  6 | from .image_base import ImageBaseDataset
  7 | 
  8 | 
  9 | class EOBench(ImageBaseDataset):
 10 |     TYPE = "MCQ"
 11 | 
 12 |     def __init__(self, dataset="EOBench", data_file=None, data_root=None, skip_noimg=True):
 13 |         ROOT = LMUDataRoot()
 14 |         # You can override this variable to save image files to a different directory
 15 |         self.dataset_name = dataset
 16 |         self.data_file = data_file
 17 |         self.data_root = data_root
 18 | 
 19 |         data = self.load_data(dataset, data_file, data_root)
 20 | 
 21 |         self.skip_noimg = skip_noimg
 22 |         if skip_noimg and "image" in data:
 23 |             data = data[~pd.isna(data["image"])]
 24 |         self.meta_only = True
 25 | 
 26 |         # the dataframe has `id` field, which is the index
 27 |         data["index"] = data["id"]
 28 | 
 29 |         self.data = data
 30 |         self.post_build(dataset)
 31 | 
 32 |     def load_data(self, dataset="EOBench", data_file=None, data_root=None):
 33 |         def load_jsonl(f):
 34 |             lines = open(f, encoding="utf-8").readlines()
 35 |             lines = [x.strip() for x in lines]
 36 |             if lines[-1] == "":
 37 |                 lines = lines[:-1]
 38 |             data = [json.loads(x) for x in lines]
 39 |             return pd.DataFrame(data)
 40 | 
 41 |         data = load_jsonl(data_file)
 42 |         return data
 43 | 
 44 |     def build_prompt(self, line):
 45 |         if isinstance(line, int):
 46 |             line = self.data.iloc[line]
 47 | 
 48 |         if self.meta_only:
 49 |             tgt_path = toliststr(line["image_paths"])
 50 |             tgt_path = [osp.join(self.data_root, p) for p in tgt_path]
 51 |         else:
 52 |             tgt_path = self.dump_image(line)
 53 | 
 54 |         prompt = line["question"]
 55 |         msgs = []
 56 |         if isinstance(tgt_path, list):
 57 |             msgs.extend([dict(type="image", value=p) for p in tgt_path])
 58 |         else:
 59 |             msgs = [dict(type="image", value=tgt_path)]
 60 |         msgs.append(dict(type="text", value=prompt))
 61 | 
 62 |         return msgs
 63 | 
 64 |     @staticmethod
 65 |     def extract_characters_regex(s, choices=["(A)", "(B)", "(C)", "(D)", "(E)", "(F)", "(G)"]):
 66 |         if type(s) is dict:
 67 |             s = ""
 68 |         s = s.strip()
 69 |         answer_prefixes = [
 70 |             "The best answer is",
 71 |             "The correct answer is",
 72 |             "The answer is",
 73 |             "The answer",
 74 |             "The best option isThe correct option is",
 75 |             "Best answer:Best option:",
 76 |         ]
 77 |         for answer_prefix in answer_prefixes:
 78 |             s = s.replace(answer_prefix, "")
 79 | 
 80 |         if not re.search("[ABCDEFG]", s):
 81 |             return ""
 82 |         matches = re.findall(r"\(([a-gA-G])\)", s)
 83 |         if len(matches) == 0:
 84 |             matches = re.findall(r"(?:^|\s)?([a-gA-G])(?:$|[\s,.])?", s)
 85 |         if len(matches) == 0:
 86 |             matches = re.findall(r"[a-gA-G]", s)
 87 |         if len(matches) == 0:
 88 |             return ""
 89 |         else:
 90 |             matches = {mat.upper() for mat in matches}
 91 |             return "".join(matches)
 92 | 
 93 |     def evaluate(self, eval_file, **judge_kwargs):
 94 |         data = load(eval_file)
 95 |         data["prediction"] = [str(x) for x in data["prediction"]]
 96 |         task_stats = {}
 97 |         micro_metric = {"correct": 0, "total": 0}
 98 |         for index, it in data.iterrows():
 99 |             task = it.get("question_type", "unknown")
100 |             if task not in task_stats:
101 |                 task_stats[task] = {"correct": 0, "total": 0}
102 |             task_stats[task]["total"] += 1
103 |             micro_metric["total"] += 1
104 |             pred = self.extract_characters_regex(it["prediction"])
105 | 
106 |             if set(pred) == set(it["answer"]):
107 |                 task_stats[task]["correct"] += 1
108 |                 micro_metric["correct"] += 1
109 | 
110 |             elif set(pred).issubset(it["answer"]):
111 |                 task_stats[task]["correct"] += len(set(pred)) / len(it["answer"])
112 |                 micro_metric["correct"] += len(set(pred)) / len(it["answer"])
113 | 
114 |         accuracy_dict = {
115 |             task: [stats["correct"] / stats["total"]] for task, stats in sorted(task_stats.items())
116 |         }
117 |         result_df = pd.DataFrame(accuracy_dict)
118 |         from collections import defaultdict
119 | 
120 |         sphere_accs = defaultdict(list)
121 |         for task, acc in accuracy_dict.items():
122 |             sphere = task.split("/")[0]
123 |             assert len(acc) == 1
124 |             sphere_accs[sphere].append(acc[0])
125 |         for sphere, accs in sphere_accs.items():
126 |             result_df[f"Sphere macro: {sphere}"] = sum(accs) / len(accs)
127 |         result_df["Overall macro"] = result_df.mean(axis=1)
128 |         result_df["Overall micro"] = micro_metric["correct"] / micro_metric["total"]
129 |         suffix = eval_file.split(".")[-1]
130 |         score_file = eval_file.replace(f".{suffix}", "_acc.csv")
131 |         dump(result_df, score_file)
132 |         return result_df
133 | 


--------------------------------------------------------------------------------
/experiments/2_libero/README.md:
--------------------------------------------------------------------------------
  1 | # Libero Benchmark Training and Evaluation
  2 | 
  3 | This directory contains the implementation for training and evaluating EO-1 on the Libero benchmark, a comprehensive suite of manipulation tasks designed to test robotic learning algorithms across different task categories.
  4 | 
  5 | ## Overview
  6 | 
  7 | The Libero benchmark consists of four main task suites:
  8 | 
  9 | - **Libero Spatial**: Spatial reasoning tasks requiring understanding of 3D spatial relationships
 10 | - **Libero Object**: Object manipulation tasks focusing on object-specific behaviors
 11 | - **Libero Goal**: Goal-conditioned tasks with varying objectives
 12 | - **Libero 10/90**: Subsets with 10 and 90 tasks respectively for different evaluation scales
 13 | 
 14 | ## Dataset Preparation
 15 | 
 16 | ### 1. Download Libero Datasets
 17 | 
 18 | The Libero datasets are available on Hugging Face and can be downloaded using the Hugging Face CLI:
 19 | 
 20 | ```bash
 21 | cd YOUR_PATH_TO_DATASET
 22 | 
 23 | # Install Hugging Face CLI if not already installed
 24 | pip install huggingface-cli
 25 | 
 26 | # Download all Libero datasets
 27 | datasets=(
 28 |     libero_spatial_no_noops_1.0.0_lerobot
 29 |     libero_object_no_noops_1.0.0_lerobot
 30 |     libero_90_no_noops_lerobot
 31 |     libero_10_no_noops_1.0.0_lerobot
 32 | )
 33 | 
 34 | HF_LEROBOT_HOME=YOUR_PATH_TO_DATASET
 35 | 
 36 | for dataset in ${datasets[@]};
 37 | do
 38 |   echo "Downloading ${dataset}..."
 39 |   huggingface-cli download \
 40 |   --repo-type dataset --resume-download --local-dir-use-symlinks False \
 41 |   IPEC-COMMUNITY/${dataset} \
 42 |   --local-dir ${HF_LEROBOT_HOME}/${dataset}
 43 | done
 44 | ```
 45 | 
 46 | **Note**: The datasets are quite large (several GB each), so ensure you have sufficient disk space and a stable internet connection.
 47 | 
 48 | ### 2. Configure Dataset Paths
 49 | 
 50 | Update the dataset configuration in `experiments/2_libero/data-libero.yaml`:
 51 | 
 52 | ```yaml
 53 | mm_datasets: # leave empty
 54 | 
 55 | lerobot_datasets:
 56 |   - repo_id: libero_spatial_no_noops_1.0.0_lerobot
 57 |     root: HF_LEROBOT_HOME
 58 | 
 59 |   - repo_id: libero_90_no_noops_lerobot
 60 |     root: HF_LEROBOT_HOME
 61 | 
 62 |   - repo_id: libero_object_no_noops_1.0.0_lerobot
 63 |     root: HF_LEROBOT_HOME
 64 | 
 65 |   - repo_id: libero_10_no_noops_1.0.0_lerobot
 66 |     root: HF_LEROBOT_HOME
 67 | ```
 68 | 
 69 | Make sure to set the `HF_LEROBOT_HOME` to your dataset directory.
 70 | 
 71 | ## Training
 72 | 
 73 | ### Training Configuration
 74 | 
 75 | The training script (`train.sh`) is configured with the following hyperparameters:
 76 | 
 77 | - **GPUs**: 8 GPUs for distributed training
 78 | - **Batch Size**: 64 per device (total effective batch size: 512)
 79 | - **Learning Rates**:
 80 |   - backbone: 1e-4
 81 |   - merger: 1e-4
 82 |   - vision tower: 2e-5
 83 | - **Epochs**: 50
 84 | - **Chunk Size**: 8 (for sequence processing)
 85 | - **Optimization**: AdamW with cosine learning rate scheduling
 86 | - **Precision**: BF16 with TF32 enabled
 87 | 
 88 | ### Start Training
 89 | 
 90 | ```bash
 91 | bash experiments/2_libero/train.sh
 92 | ```
 93 | 
 94 | The training will:
 95 | 
 96 | - Use the Qwen2.5-VL-3B-Instruct vision-language model as the base
 97 | - Train on all four Libero datasets simultaneously
 98 | - Save checkpoints every 5000 steps
 99 | - Use gradient checkpointing and flash attention for memory efficiency
100 | - Log training progress every 100 steps
101 | 
102 | ## Evaluation
103 | 
104 | ### Evaluation Configuration
105 | 
106 | The evaluation script supports testing on different task suites with configurable parameters:
107 | 
108 | - **Number of trials per task**: 50 (for statistical significance)
109 | - **Replanning steps**: 8 (for action sequence planning)
110 | - **Video recording**: Enabled for visualization
111 | - **Random seed**: 7 (for reproducibility)
112 | 
113 | ### Run Evaluation
114 | 
115 | ```bash
116 | # Set the task suite to evaluate
117 | task_suite_name=libero_spatial  # Options: libero_spatial, libero_90, libero_object, libero_10
118 | 
119 | # Set the path to your trained checkpoint
120 | ckpt_path=PATH_TO_CHECKPOINT
121 | 
122 | # Run evaluation
123 | python experiments/2_libero/eval_libero.py \
124 |   --args.pretrained-path ${ckpt_path} \
125 |   --args.task-suite-name ${task_suite_name} \
126 |   --args.job-name ${task_suite_name} \
127 |   --args.num-trials-per-task 50 \
128 |   --args.replan-steps 8
129 | ```
130 | 
131 | ### Evaluation Output
132 | 
133 | The evaluation will:
134 | 
135 | - Generate detailed logs in `experiments/2_libero/logs/`
136 | - Save evaluation videos in `experiments/2_libero/logs/videos/`
137 | - Report success rates for each task and overall performance
138 | - Provide action-level analysis and failure case insights
139 | 
140 | ## Results
141 | 
142 | ### Performance Comparison
143 | 
144 | | Model                            | Libero Spatial | Libero Object | Libero Goal | Libero 10 | Average  |
145 | | -------------------------------- | -------------- | ------------- | ----------- | --------- | -------- |
146 | | π0.5 @ 30k (finetuned)           | 98.8           | 98.2          | 98.0        | 92.4      | 96.85    |
147 | | **EO-1 @ 50 epochs (finetuned)** | **99.7**       | **99.8**      | **99.2**    | **94.8**  | **98.2** |
148 | 
149 | ## File Structure
150 | 
151 | ```
152 | experiments/2_libero/
153 | ├── README.md                # This file
154 | ├── train.sh                 # Training script
155 | ├── eval_libero.py           # Evaluation script
156 | ├── data-libero.yaml         # Dataset configuration
157 | └── logs/                    # Training and evaluation logs
158 |     └── videos/              # Evaluation videos
159 | ```
160 | 


--------------------------------------------------------------------------------
/eo/train/train_utils.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | 
  3 | import torch
  4 | import transformers
  5 | from accelerate.logging import get_logger
  6 | 
  7 | logger = get_logger(__name__, log_level="INFO")
  8 | 
  9 | 
 10 | def set_requires_grad(parameters, requires_grad):
 11 |     """Set the requires_grad attribute for the parameters."""
 12 |     for p in parameters:
 13 |         p.requires_grad = requires_grad
 14 | 
 15 | 
 16 | def configure_vision_tower(vlm, training_args, compute_dtype, device):
 17 |     """Configure the vision tower."""
 18 |     vision_tower = vlm.visual
 19 |     vision_tower.to(dtype=compute_dtype, device=device)
 20 | 
 21 |     vision_model_params = vlm.visual.parameters()
 22 |     set_requires_grad(vision_model_params, not training_args.freeze_vision_tower)
 23 | 
 24 |     merger_params = vlm.visual.merger.parameters()
 25 |     set_requires_grad(merger_params, not training_args.freeze_merger)
 26 | 
 27 | 
 28 | def configure_llm(vlm, training_args):
 29 |     """Configure the LLM."""
 30 |     lm_head = vlm.lm_head.parameters()
 31 |     set_requires_grad(lm_head, not training_args.freeze_lm_head)
 32 | 
 33 |     llm_params = vlm.model.parameters()
 34 |     set_requires_grad(llm_params, not training_args.freeze_llm)
 35 | 
 36 | 
 37 | def configure_processor(processor, dataset, training_args):
 38 |     """Configure the processor."""
 39 |     if training_args.chat_template:
 40 |         import json
 41 | 
 42 |         chat_template = json.load(open(training_args.chat_template))["chat_template"]
 43 |         processor.chat_template = processor.tokenizer.chat_template = chat_template
 44 |         logger.info("Set chat template", main_process_only=True)
 45 | 
 46 |     if dataset.lerobot_dataset:
 47 |         logger.info(f"Set features, stats and mode {training_args.state_mode}", main_process_only=True)
 48 |         robo_config = dataset.lerobot_dataset.configuration
 49 |         robo_config["max_action_dim"] = training_args.max_action_dim
 50 |         robo_config["action_chunk_size"] = training_args.chunk_size
 51 |         processor.set_normalization(robo_config)
 52 | 
 53 |         logger.info("Set qwen2.5 VL min-max pixels", main_process_only=True)
 54 |         processor.image_processor.min_pixels = training_args.image_min_pixels
 55 |         processor.image_processor.max_pixels = training_args.image_max_pixels
 56 | 
 57 | 
 58 | def smart_tokenizer_and_embedding_resize(
 59 |     processor: transformers.ProcessorMixin,
 60 |     vlm: transformers.PreTrainedModel,
 61 | ):
 62 |     """Smart tokenizer and embedding resize."""
 63 |     from eo.constants import (
 64 |         ACTION_END_TOKEN,
 65 |         ACTION_START_TOKEN,
 66 |         DEFAULT_ACTION_TOKEN,
 67 |         DEFAULT_IMAGE_TOKEN,
 68 |         DEFAULT_STATE_TOKEN,
 69 |         DEFAULT_VIDEO_TOKEN,
 70 |         PASS_ACTION_TOKEN,
 71 |         STATE_END_TOKEN,
 72 |         STATE_START_TOKEN,
 73 |         TASK_VLA_TOKEN,
 74 |         VISION_START_TOKEN,
 75 |     )
 76 | 
 77 |     tokenizer = processor.tokenizer
 78 |     eo1_special_tokens = [
 79 |         ACTION_START_TOKEN, DEFAULT_ACTION_TOKEN, ACTION_END_TOKEN,
 80 |         STATE_START_TOKEN, DEFAULT_STATE_TOKEN, STATE_END_TOKEN,
 81 |         TASK_VLA_TOKEN, PASS_ACTION_TOKEN
 82 |     ]
 83 |     num_new_tokens = tokenizer.add_tokens(eo1_special_tokens, special_tokens=True)
 84 | 
 85 |     # NOTE: qwen2.5 vl vocab 151936 > [tokenizer 151664 + 8], we don't need to resize embeddings
 86 |     token_dict = {
 87 |         "state_token_id": DEFAULT_STATE_TOKEN,
 88 |         "action_token_start_id": ACTION_START_TOKEN,
 89 |         "action_token_id": DEFAULT_ACTION_TOKEN,
 90 |         "action_pass_id": PASS_ACTION_TOKEN,
 91 |         "vision_token_start_id": VISION_START_TOKEN,
 92 |         "image_token_id": DEFAULT_IMAGE_TOKEN,
 93 |         "video_token_id": DEFAULT_VIDEO_TOKEN,
 94 |     }
 95 | 
 96 |     for key, token in token_dict.items():
 97 |         token_id = tokenizer.convert_tokens_to_ids(token)
 98 |         setattr(vlm.model.config, key, token_id)
 99 |         setattr(vlm.config.text_config, key, token_id)
100 | 
101 |     processor.action_token_id = vlm.model.config.action_token_id
102 |     processor.action_pass_id = vlm.model.config.action_pass_id
103 |     return num_new_tokens
104 | 
105 | 
106 | def find_target_linear_names(model, num_lora_modules=-1, lora_namespan_exclude=None, verbose=True):
107 |     """Find the target linear names for LoRA."""
108 |     if lora_namespan_exclude is None:
109 |         lora_namespan_exclude = []
110 |     linear_cls = torch.nn.modules.Linear
111 |     embedding_cls = torch.nn.modules.Embedding
112 |     lora_module_names = []
113 | 
114 |     for name, module in model.named_modules():
115 |         if any(ex_keyword in name for ex_keyword in lora_namespan_exclude):
116 |             continue
117 |         if isinstance(module, (linear_cls, embedding_cls)):
118 |             lora_module_names.append(name)
119 | 
120 |     if num_lora_modules > 0:
121 |         lora_module_names = lora_module_names[-num_lora_modules:]
122 |     if verbose:
123 |         print(f"Found {len(lora_module_names)} lora modules: {lora_module_names}")
124 |     return lora_module_names
125 | 
126 | 
127 | def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
128 |     """Collects the state dict and dump to disk."""
129 | 
130 |     if trainer.deepspeed:
131 |         torch.cuda.synchronize()
132 |         trainer.save_model(output_dir)
133 |         return
134 | 
135 |     state_dict = trainer.model.state_dict()
136 |     if trainer.args.should_save:
137 |         cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
138 |         del state_dict
139 |         trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa


--------------------------------------------------------------------------------
/experiments/8_vllmeval/README.md:
--------------------------------------------------------------------------------
  1 | # VLLM Evaluation on Vision-Language Benchmarks
  2 | 
  3 | This directory contains the implementation for evaluating EO-1 on multiple vision-language benchmarks using the VLMEvalKit framework.
  4 | 
  5 | ## Overview
  6 | 
  7 | The evaluation covers three comprehensive benchmarks designed to test different aspects of vision-language understanding:
  8 | 
  9 | ### Benchmarks
 10 | 
 11 | - **EO-Bench**: End-to-end evaluation benchmark for embodied AI tasks
 12 | - **ERQABench**: Embodied Reasoning and Question Answering benchmark
 13 | - **RoboVQA**: Robotic Visual Question Answering benchmark
 14 | 
 15 | These benchmarks test the model's ability to:
 16 | 
 17 | - Understand complex visual scenes and spatial relationships
 18 | - Answer questions about robotic manipulation tasks
 19 | - Reason about cause-and-effect relationships in embodied scenarios
 20 | - Process multi-modal inputs (images, videos, text) effectively
 21 | 
 22 | ## Quick Start
 23 | 
 24 | ```bash
 25 | # 1. Download model and datasets
 26 | bash experiments/8_vllmeval/download_all.sh
 27 | 
 28 | # 2. Install VLMEvalKit, TODO: pr to VLMEvalKit to add EO-1-3B model @ Xianqiang Gao
 29 | git clone https://github.com/DelinQu/VLMEvalKit
 30 | cd VLMEvalKit
 31 | pip install -e .
 32 | 
 33 | # 3. Configure evaluation
 34 | cp experiments/8_vllmeval/dataset-config.json VLMEvalKit/configs/
 35 | 
 36 | # 4. Run evaluation
 37 | bash experiments/8_vllmeval/run.sh
 38 | ```
 39 | 
 40 | ## Dataset and Model Preparation
 41 | 
 42 | ### 1. Download EO-1-3B Model
 43 | 
 44 | ```bash
 45 | # Install Hugging Face CLI if not already installed
 46 | pip install huggingface-cli
 47 | 
 48 | # Download EO-1-3B model
 49 | huggingface-cli download --resume-download --local-dir-use-symlinks False \
 50 |   IPEC-COMMUNITY/EO-1-3B \
 51 |   --local-dir EO-1-3B
 52 | ```
 53 | 
 54 | ### 2. Download Benchmark Datasets
 55 | 
 56 | ```bash
 57 | cd YOUR_PATH_TO_DATASET
 58 | 
 59 | # Download all benchmark datasets
 60 | datasets=(
 61 |   ERQABench
 62 |   RoboVQA
 63 |   EO-Bench
 64 | )
 65 | 
 66 | HF_DATASET_HOME=YOUR_PATH_TO_DATASET
 67 | 
 68 | for dataset in ${datasets[@]};
 69 | do
 70 |   echo "Downloading ${dataset}..."
 71 |   huggingface-cli download \
 72 |   --repo-type dataset --resume-download --local-dir-use-symlinks False \
 73 |   IPEC-COMMUNITY/${dataset} \
 74 |   --local-dir ${HF_DATASET_HOME}/${dataset}
 75 | done
 76 | ```
 77 | 
 78 | **Note**: The datasets are quite large (several GB each), so ensure you have sufficient disk space and a stable internet connection.
 79 | 
 80 | ## VLMEvalKit Installation
 81 | 
 82 | ### Install VLMEvalKit
 83 | 
 84 | ```bash
 85 | # Clone the VLMEvalKit repository
 86 | git clone https://github.com/DelinQu/VLMEvalKit
 87 | cd VLMEvalKit
 88 | 
 89 | # Install in development mode
 90 | pip install -e .
 91 | 
 92 | # Install additional dependencies
 93 | pip install -r requirements.txt
 94 | 
 95 | # Verify installation
 96 | python -c "import vlmeval; print('VLMEvalKit installed successfully')"
 97 | ```
 98 | 
 99 | ## Evaluation Configuration
100 | 
101 | ### Model and Dataset Configuration
102 | 
103 | Create a configuration file (`dataset-config.json`) with the following settings:
104 | 
105 | ```json
106 | {
107 |   "model": {
108 |     "EO1-3B": {
109 |       "class": "EO1VisionFlowMatchingChat",
110 |       "min_pixels": 50176,
111 |       "max_pixels": 100352,
112 |       "use_custom_prompt": false,
113 |       "model_path": "IPEC-COMMUNITY/EO-1-3B"
114 |     }
115 |   },
116 |   "data": {
117 |     "EOBench": {
118 |       "class": "EOBench",
119 |       "dataset": "EOBench",
120 |       "data_file": "IPEC-COMMUNITY/EO-Bench/benchmark_v1.jsonl",
121 |       "data_root": "IPEC-COMMUNITY/EO-Bench"
122 |     },
123 |     "ERQABench": {
124 |       "class": "ERQABench",
125 |       "dataset": "ERQABench",
126 |       "data_root": "IPEC-COMMUNITY/ERQABench",
127 |       "data_file": "IPEC-COMMUNITY/ERQABench/benchmark_v1.jsonl"
128 |     },
129 |     "RoboVQA": {
130 |       "class": "RoboVQA",
131 |       "dataset": "RoboVQA",
132 |       "data_root": "IPEC-COMMUNITY/RoboVQA",
133 |       "data_file": "IPEC-COMMUNITY/RoboVQA/benchmark_v1.jsonl",
134 |       "fps": 1
135 |     }
136 |   }
137 | }
138 | ```
139 | 
140 | ### Configuration Parameters
141 | 
142 | - **min_pixels**: 50176 - Minimum image resolution for processing
143 | - **max_pixels**: 100352 - Maximum image resolution for processing
144 | - **use_custom_prompt**: false - Use default prompting strategy
145 | - **fps**: 1 - Frame rate for video processing (RoboVQA only)
146 | 
147 | ## Running Evaluation
148 | 
149 | ### Basic Evaluation
150 | 
151 | ```bash
152 | cd VLMEvalKit
153 | 
154 | # Run evaluation on all configured benchmarks
155 | bash scripts/run.sh
156 | ```
157 | 
158 | ### Custom Evaluation
159 | 
160 | ```bash
161 | # Run evaluation on specific benchmarks
162 | python vlmeval/run.py --model EO1-3B --data EOBench
163 | python vlmeval/run.py --model EO1-3B --data ERQABench
164 | python vlmeval/run.py --model EO1-3B --data RoboVQA
165 | ```
166 | 
167 | ### Evaluation with Custom Settings
168 | 
169 | ```bash
170 | # Run with custom batch size and device settings
171 | python vlmeval/run.py \
172 |   --model EO1-3B \
173 |   --data EOBench \
174 |   --batch_size 8 \
175 |   --device cuda:0
176 | ```
177 | 
178 | ## Results and Analysis
179 | 
180 | ### Expected Output
181 | 
182 | The evaluation will generate:
183 | 
184 | - **Detailed logs** for each benchmark
185 | - **Per-question results** with model predictions
186 | - **Overall accuracy scores** for each benchmark
187 | - **Performance metrics** including response time and memory usage
188 | 
189 | ### Results Directory Structure
190 | 
191 | ```
192 | VLMEvalKit/
193 | ├── results/
194 | │   ├── EO1-3B/
195 | │   │   ├── EOBench/
196 | │   │   ├── ERQABench/
197 | │   │   └── RoboVQA/
198 | │   └── logs/
199 | ```
200 | 


--------------------------------------------------------------------------------
/experiments/5_widowx/widowx_env.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | 
  3 | import cv2
  4 | import gym
  5 | import numpy as np
  6 | from widowx_envs.widowx_env_service import WidowXClient
  7 | 
  8 | 
  9 | def listdict2dictlist(ld):
 10 |     return {k: [dic[k] for dic in ld] for k in ld[0]}
 11 | 
 12 | 
 13 | class RHCWrapper(gym.Wrapper):
 14 |     """
 15 |     Performs receding horizon control. The policy returns `pred_horizon` actions and
 16 |     we execute `exec_horizon` of them.
 17 |     """
 18 | 
 19 |     def __init__(self, env: gym.Env, exec_horizon: int):
 20 |         super().__init__(env)
 21 |         self.exec_horizon = exec_horizon
 22 | 
 23 |     def step(self, actions):
 24 |         if self.exec_horizon == 1 and len(actions.shape) == 1:
 25 |             actions = actions[None]
 26 |         assert len(actions) >= self.exec_horizon
 27 |         rewards = []
 28 |         observations = []
 29 |         infos = []
 30 | 
 31 |         for i in range(self.exec_horizon):
 32 |             obs, reward, done, trunc, info = self.env.step(actions[i])
 33 |             observations.append(obs)
 34 |             rewards.append(reward)
 35 |             infos.append(info)
 36 | 
 37 |             if done or trunc:
 38 |                 break
 39 | 
 40 |         infos = listdict2dictlist(infos)
 41 |         infos["rewards"] = rewards
 42 |         infos["observations"] = observations
 43 | 
 44 |         return obs, np.sum(rewards), done, trunc, infos
 45 | 
 46 | 
 47 | def wait_for_obs(widowx_client):
 48 |     obs = widowx_client.get_observation()
 49 |     while obs is None:
 50 |         print("Waiting for observations...")
 51 |         obs = widowx_client.get_observation()
 52 |         time.sleep(1)
 53 |     return obs
 54 | 
 55 | 
 56 | def convert_obs(obs, im_size, *, flip=False):
 57 |     # image_obs = cv2.resize(obs["image"], (im_size, im_size), interpolation=cv2.INTER_LINEAR)
 58 |     image_obs = (obs["image"].reshape(3, im_size, im_size).transpose(1, 2, 0) * 255).astype(np.uint8)
 59 |     full_image = obs["full_image"]
 60 |     if flip:
 61 |         image_obs = cv2.flip(image_obs, -1)
 62 |         full_image = cv2.flip(full_image, -1)
 63 |     # add padding to proprio to match training
 64 |     proprio = np.concatenate([obs["state"][:6], [0], obs["state"][-1:]])
 65 | 
 66 |     return {
 67 |         "image_primary": image_obs,
 68 |         "proprio": proprio,
 69 |         "full_image": full_image,
 70 |     }
 71 | 
 72 | 
 73 | def null_obs(img_size):
 74 |     return {
 75 |         "image_primary": np.zeros((img_size, img_size, 3), dtype=np.uint8),
 76 |         "proprio": np.zeros((8,), dtype=np.float64),
 77 |         "full_image": np.zeros((480, 640, 3), dtype=np.uint8),
 78 |     }
 79 | 
 80 | 
 81 | class WidowXGym(gym.Env):
 82 |     """
 83 |     A Gym environment for the WidowX controller provided by:
 84 |     https://github.com/rail-berkeley/bridge_data_robot
 85 |     Needed to use Gym wrappers.
 86 |     """
 87 | 
 88 |     def __init__(
 89 |         self,
 90 |         env_params: dict,
 91 |         host: str = "localhost",
 92 |         port: int = 5556,
 93 |         im_size: int = 256,
 94 |         *,
 95 |         blocking: bool = True,
 96 |         sticky_gripper_num_steps: int = 1,
 97 |     ):
 98 |         self.widowx_client = WidowXClient(host, port)
 99 |         self.widowx_client.init(env_params, image_size=im_size)
100 |         self.env_params = env_params
101 |         self.im_size = im_size
102 |         self.blocking = blocking
103 |         self.observation_space = gym.spaces.Dict(
104 |             {
105 |                 "image_primary": gym.spaces.Box(
106 |                     low=np.zeros((im_size, im_size, 3)),
107 |                     high=255 * np.ones((im_size, im_size, 3)),
108 |                     dtype=np.uint8,
109 |                 ),
110 |                 "proprio": gym.spaces.Box(low=np.ones((8,)) * -1, high=np.ones((8,)), dtype=np.float64),
111 |             }
112 |         )
113 |         self.action_space = gym.spaces.Box(low=np.zeros((7,)), high=np.ones((7,)), dtype=np.float64)
114 |         self.sticky_gripper_num_steps = sticky_gripper_num_steps
115 |         self.is_gripper_closed = False
116 |         self.num_consecutive_gripper_change_actions = 0
117 | 
118 |     def step(self, action):
119 |         # sticky gripper logic
120 |         if (action[-1] < 0.5) != self.is_gripper_closed:
121 |             self.num_consecutive_gripper_change_actions += 1
122 |         else:
123 |             self.num_consecutive_gripper_change_actions = 0
124 | 
125 |         if self.num_consecutive_gripper_change_actions >= self.sticky_gripper_num_steps:
126 |             self.is_gripper_closed = not self.is_gripper_closed
127 |             self.num_consecutive_gripper_change_actions = 0
128 |         action[-1] = 0.0 if self.is_gripper_closed else 1.0
129 | 
130 |         self.widowx_client.step_action(action, blocking=self.blocking)
131 | 
132 |         raw_obs = self.widowx_client.get_observation()
133 | 
134 |         truncated = False
135 |         if raw_obs is None:
136 |             # this indicates a loss of connection with the server
137 |             # due to an exception in the last step so end the trajectory
138 |             truncated = True
139 |             obs = null_obs(self.im_size)  # obs with all zeros
140 |         else:
141 |             obs = convert_obs(
142 |                 raw_obs,
143 |                 self.im_size,
144 |                 flip=self.env_params["camera_topics"][0]["name"] == "/D435/color/image_raw",
145 |             )
146 | 
147 |         return obs, 0, False, truncated, {}
148 | 
149 |     def reset(self, seed=None, options=None):
150 |         super().reset(seed=seed)
151 |         self.widowx_client.reset()
152 | 
153 |         self.is_gripper_closed = False
154 |         self.num_consecutive_gripper_change_actions = 0
155 | 
156 |         raw_obs = wait_for_obs(self.widowx_client)
157 | 
158 |         obs = convert_obs(
159 |             raw_obs,
160 |             self.im_size,
161 |             flip=self.env_params["camera_topics"][0]["name"] == "/D435/color/image_raw",
162 |         )
163 | 
164 |         return obs, {}
165 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[codz]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py.cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # UV
 98 | #   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
 99 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
100 | #   commonly ignored for libraries.
101 | #uv.lock
102 | 
103 | # poetry
104 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
105 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
106 | #   commonly ignored for libraries.
107 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
108 | #poetry.lock
109 | #poetry.toml
110 | 
111 | # pdm
112 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
113 | #   pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
114 | #   https://pdm-project.org/en/latest/usage/project/#working-with-version-control
115 | #pdm.lock
116 | #pdm.toml
117 | .pdm-python
118 | .pdm-build/
119 | 
120 | # pixi
121 | #   Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
122 | #pixi.lock
123 | #   Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
124 | #   in the .venv directory. It is recommended not to include this directory in version control.
125 | .pixi
126 | 
127 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
128 | __pypackages__/
129 | 
130 | # Celery stuff
131 | celerybeat-schedule
132 | celerybeat.pid
133 | 
134 | # Redis
135 | *.rdb
136 | *.aof
137 | *.pid
138 | 
139 | # RabbitMQ
140 | mnesia/
141 | rabbitmq/
142 | rabbitmq-data/
143 | 
144 | # ActiveMQ
145 | activemq-data/
146 | 
147 | # SageMath parsed files
148 | *.sage.py
149 | 
150 | # Environments
151 | .env
152 | .envrc
153 | .venv
154 | env/
155 | venv/
156 | ENV/
157 | env.bak/
158 | venv.bak/
159 | 
160 | # Spyder project settings
161 | .spyderproject
162 | .spyproject
163 | 
164 | # Rope project settings
165 | .ropeproject
166 | 
167 | # mkdocs documentation
168 | /site
169 | 
170 | # mypy
171 | .mypy_cache/
172 | .dmypy.json
173 | dmypy.json
174 | 
175 | # Pyre type checker
176 | .pyre/
177 | 
178 | # pytype static type analyzer
179 | .pytype/
180 | 
181 | # Cython debug symbols
182 | cython_debug/
183 | 
184 | # PyCharm
185 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
186 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
187 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
188 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
189 | #.idea/
190 | 
191 | # Abstra
192 | # Abstra is an AI-powered process automation framework.
193 | # Ignore directories containing user credentials, local state, and settings.
194 | # Learn more at https://abstra.io/docs
195 | .abstra/
196 | 
197 | # Visual Studio Code
198 | #  Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
199 | #  that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
200 | #  and can be added to the global gitignore or merged into this file. However, if you prefer,
201 | #  you could uncomment the following to ignore the entire vscode folder
202 | # .vscode/
203 | 
204 | # Ruff stuff:
205 | .ruff_cache/
206 | 
207 | # PyPI configuration file
208 | .pypirc
209 | 
210 | # Marimo
211 | marimo/_static/
212 | marimo/_lsp/
213 | __marimo__/
214 | 
215 | # Streamlit
216 | .streamlit/secrets.toml
217 | 
218 | 
219 | tools/video_to_gif.py
220 | 
221 | HF_README.md
222 | 
223 | source
224 | 
225 | .DS_Store
226 | 
227 | experiments/outputs
228 | outputs
229 | scripts/env.sh
230 | pre-commit.sh
231 | tools/caption_video.html
232 | 
233 | experiments/2_libero/logs
234 | experiments/2_libero/logs/videos
235 | demo_data/demos25
236 | **/videos/**
237 | 
238 | demo_data/libero_spatial_no_noops_1.0.0_lerobot
239 | experiments/test
240 | 
241 | tools/hf_save_pretrained.py
242 | tools/hf_save_pretrained.py
243 | dev/
244 | eo/model_dev
245 | 


--------------------------------------------------------------------------------
/eo/data/transforms.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2025 EO-Robotics Team. All rights reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | import math
 16 | import random
 17 | from dataclasses import dataclass, field
 18 | from typing import Any
 19 | 
 20 | from lerobot.configs.types import FeatureType, PolicyFeature
 21 | from lerobot.datasets.transforms import ImageTransformConfig, RandomSubsetApply, SharpnessJitter
 22 | from torchvision.transforms import v2
 23 | from torchvision.transforms.v2 import Transform
 24 | from torchvision.transforms.v2._utils import query_size
 25 | 
 26 | 
 27 | class RandomScaleCrop:
 28 |     """Randomly crop the image according to the scale parameter, keeping the original aspect ratio.
 29 |     Args:
 30 |         scale: The range of size (as a fraction of the original size) to sample from.
 31 |     """
 32 | 
 33 |     def __init__(self, scale: tuple[float, float]):
 34 |         self.scale = scale
 35 | 
 36 |     def __call__(self, img: Any) -> Any:
 37 |         h, w = query_size(img)
 38 | 
 39 |         area = h * w
 40 |         target_area = random.uniform(self.scale[0], self.scale[1]) * area
 41 | 
 42 |         aspect_ratio = w / h
 43 |         crop_h = int(round(math.sqrt(target_area / aspect_ratio)))
 44 |         crop_w = int(round(target_area / crop_h))
 45 | 
 46 |         crop_w = min(crop_w, w)
 47 |         crop_h = min(crop_h, h)
 48 | 
 49 |         i = random.randint(0, h - crop_h)
 50 |         j = random.randint(0, w - crop_w)
 51 |         return v2.functional.crop(img, i, j, crop_h, crop_w)
 52 | 
 53 | 
 54 | @dataclass
 55 | class ImageTransformsConfig:
 56 |     """Image and Wrist transforms for the LeRobot dataset."""
 57 | 
 58 |     enable: bool = True
 59 |     max_num_transforms: int = 3
 60 |     random_order: bool = True
 61 |     tfs: dict[str, ImageTransformConfig] = field(
 62 |         default_factory=lambda: {
 63 |             "brightness": ImageTransformConfig(
 64 |                 type="ColorJitter",
 65 |                 kwargs={"brightness": (0.8, 1.2)},
 66 |             ),
 67 |             "contrast": ImageTransformConfig(
 68 |                 type="ColorJitter",
 69 |                 kwargs={"contrast": (0.8, 1.2)},
 70 |             ),
 71 |             "saturation": ImageTransformConfig(
 72 |                 type="ColorJitter",
 73 |                 kwargs={"saturation": (0.5, 1.5)},
 74 |             ),
 75 |             "hue": ImageTransformConfig(
 76 |                 type="ColorJitter",
 77 |                 kwargs={"hue": (-0.05, 0.05)},
 78 |             ),
 79 |             "sharpness": ImageTransformConfig(
 80 |                 type="SharpnessJitter",
 81 |                 kwargs={"sharpness": (0.5, 1.5)},
 82 |             ),
 83 |             "crop": ImageTransformConfig(
 84 |                 type="RandomScaleCrop",
 85 |                 kwargs={"scale": (0.95, 1.0)},
 86 |             ),
 87 |             "rotate": ImageTransformConfig(
 88 |                 type="RadomRotate",
 89 |                 kwargs={"degrees": (-3, 3)},
 90 |             ),
 91 |         }
 92 |     )
 93 | 
 94 | 
 95 | class ImageTransforms(Transform):
 96 |     """A class to compose image transforms based on configuration."""
 97 | 
 98 |     def __init__(self, cfg: ImageTransformsConfig) -> None:
 99 |         super().__init__()
100 |         self._cfg = cfg
101 |         self.weights = []
102 |         self.transforms = {}
103 | 
104 |         for tf_name, tf_cfg in cfg.tfs.items():
105 |             if tf_cfg.weight <= 0.0:
106 |                 continue
107 |             match tf_cfg.type:
108 |                 case "Identity":
109 |                     self.transforms[tf_name] = v2.Identity(**tf_cfg.kwargs)
110 |                 case "ColorJitter":
111 |                     self.transforms[tf_name] = v2.ColorJitter(**tf_cfg.kwargs)
112 |                 case "SharpnessJitter":
113 |                     self.transforms[tf_name] = SharpnessJitter(**tf_cfg.kwargs)
114 |                 case "RadomRotate":
115 |                     self.transforms[tf_name] = v2.RandomRotation(**tf_cfg.kwargs)
116 |                 case "RandomScaleCrop":
117 |                     self.transforms[tf_name] = RandomScaleCrop(**tf_cfg.kwargs)
118 |                 case _:
119 |                     self.transforms[tf_name] = v2.Identity(**tf_cfg.kwargs)
120 |             self.weights.append(tf_cfg.weight)
121 | 
122 |         n_subset = min(len(self.transforms), cfg.max_num_transforms)
123 |         if n_subset == 0 or not cfg.enable:
124 |             self.tf = v2.Identity()
125 |         else:
126 |             self.tf = RandomSubsetApply(
127 |                 transforms=list(self.transforms.values()),
128 |                 p=self.weights,
129 |                 n_subset=n_subset,
130 |                 random_order=cfg.random_order,
131 |             )
132 | 
133 |     def forward(self, *inputs: Any) -> Any:
134 |         outputs = self.tf(*inputs)
135 |         return outputs
136 | 
137 | 
138 | def dataset_to_policy_features(features: dict[str, dict]) -> dict[str, PolicyFeature]:
139 |     policy_features = {}
140 |     for key, ft in features.items():
141 |         shape = ft["shape"]
142 |         if ft["dtype"] in ["image", "video"]:
143 |             type = FeatureType.VISUAL
144 |             if len(shape) != 3:
145 |                 raise ValueError(f"Number of dimensions of {key} != 3 (shape={shape})")
146 |             names = ft["names"]
147 |             if names[2] in ["channel", "channels"]:  # (h, w, c) -> (c, h, w)
148 |                 shape = (shape[2], shape[0], shape[1])
149 |         elif key == "observation.environment_state":
150 |             type = FeatureType.ENV
151 |         elif key.startswith("observation"):
152 |             type = FeatureType.STATE
153 |         elif key.startswith("action"):
154 |             type = FeatureType.ACTION
155 |         else:
156 |             continue
157 |         policy_features[key] = PolicyFeature(
158 |             type=type,
159 |             shape=shape,
160 |         )
161 |     return policy_features
162 | 


--------------------------------------------------------------------------------
/scripts/train.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | from pathlib import Path
  4 | 
  5 | import torch
  6 | from accelerate.logging import get_logger
  7 | from accelerate.utils import broadcast_object_list
  8 | 
  9 | try:
 10 |     from peft import LoraConfig, get_peft_model
 11 | except ImportError:
 12 |     pass
 13 | 
 14 | from transformers import HfArgumentParser
 15 | 
 16 | from eo.data.dataset import make_supervised_data_module
 17 | from eo.model.modeling_eo1 import EO1VisionFlowMatchingConfig, EO1VisionFlowMatchingModel
 18 | from eo.model.modeling_qwen2_5_vl import Qwen2_5_VLForConditionalGeneration
 19 | from eo.model.processing_eo1 import EO1VisionProcessor
 20 | from eo.train.pipeline_config import TrainPipelineConfig
 21 | from eo.train.train_utils import (
 22 |     configure_llm,
 23 |     configure_processor,
 24 |     configure_vision_tower,
 25 |     find_target_linear_names,
 26 |     safe_save_model_for_hf_trainer,
 27 |     smart_tokenizer_and_embedding_resize,
 28 | )
 29 | from eo.train.trainer import EO1VisionTrainer
 30 | 
 31 | logger = get_logger(__name__, log_level="INFO")
 32 | 
 33 | 
 34 | def train():
 35 |     parser = HfArgumentParser(TrainPipelineConfig)
 36 | 
 37 |     if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
 38 |         (training_args,) = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
 39 |     else:
 40 |         (training_args,) = parser.parse_args_into_dataclasses()
 41 | 
 42 |     training_args.output_dir = broadcast_object_list([training_args.output_dir])[0]
 43 |     logger.info(f"set output-dir to {training_args.output_dir}")
 44 | 
 45 |     # configure model
 46 |     compute_dtype = torch.bfloat16 if training_args.bf16 else torch.float32
 47 | 
 48 |     if training_args.model_name_or_path is None:
 49 |         config = EO1VisionFlowMatchingConfig.from_pretrained(
 50 |             training_args.vlm_name_or_path,
 51 |             dtype=compute_dtype,
 52 |             attn_implementation=training_args.attn_implementation,
 53 |             action_act=training_args.action_act,
 54 |         )
 55 |         vlm_backbone = Qwen2_5_VLForConditionalGeneration.from_pretrained(
 56 |             training_args.vlm_name_or_path, dtype=compute_dtype
 57 |         )
 58 |         model = EO1VisionFlowMatchingModel(config, vlm_backbone=vlm_backbone)
 59 |     else:
 60 |         model = EO1VisionFlowMatchingModel.from_pretrained(
 61 |             training_args.model_name_or_path,
 62 |             dtype=compute_dtype,
 63 |             attn_implementation=training_args.attn_implementation,
 64 |         )
 65 | 
 66 |     # load processor and resize embeddings
 67 |     processor = EO1VisionProcessor.from_pretrained(
 68 |         training_args.processor_name_or_path,
 69 |         padding_side="right",
 70 |         use_fast=True,
 71 |     )
 72 |     smart_tokenizer_and_embedding_resize(processor, model.vlm_backbone)
 73 | 
 74 |     # configure model
 75 |     configure_llm(model.vlm_backbone, training_args)
 76 |     configure_vision_tower(model.vlm_backbone, training_args, compute_dtype, training_args.device)
 77 |     model.config.action_chunk_size = training_args.chunk_size
 78 | 
 79 |     # lora peft tuning
 80 |     if training_args.lora_enable:
 81 |         lora_namespan_exclude = training_args.lora_namespan_exclude
 82 |         peft_config = LoraConfig(
 83 |             r=training_args.lora_rank,
 84 |             lora_alpha=training_args.lora_alpha,
 85 |             target_modules=find_target_linear_names(
 86 |                 model,
 87 |                 lora_namespan_exclude=lora_namespan_exclude,
 88 |                 num_lora_modules=training_args.num_lora_modules,
 89 |             ),
 90 |             lora_dropout=training_args.lora_dropout,
 91 |             bias=training_args.lora_bias,
 92 |         )
 93 |         if training_args.bits == 16:
 94 |             if training_args.bf16:
 95 |                 model.to(torch.bfloat16)
 96 |             if training_args.fp16:
 97 |                 model.to(torch.float16)
 98 |         logger.info("adding LoRA to the model...", main_process_only=True)
 99 |         model = get_peft_model(model, peft_config)
100 | 
101 |     # load dataset
102 |     data_module = make_supervised_data_module(processor=processor, args=training_args)
103 |     configure_processor(processor, data_module["train_dataset"], training_args)
104 | 
105 |     model.config.use_cache = False
106 |     if training_args.gradient_checkpointing:
107 |         model.enable_input_require_grads()
108 |         training_args.gradient_checkpointing_kwargs = {"use_reentrant": False}
109 | 
110 |     # trainable params
111 |     total_params = sum(p.numel() for p in model.parameters())
112 |     trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
113 |     logger.warning(
114 |         f"{total_params=}, {trainable_params=}, [{trainable_params / total_params * 100}%]",
115 |         main_process_only=True,
116 |     )
117 | 
118 |     trainer = EO1VisionTrainer(
119 |         model=model, processing_class=processor, args=training_args, **data_module
120 |     )
121 | 
122 |     # aggregate data lengths for packing
123 |     if training_args.pack_dataset:
124 |         import numpy as np
125 | 
126 |         dataset = data_module["train_dataset"].dataset
127 |         lengths = None
128 |         if trainer.accelerator.is_main_process:
129 |             lengths = dataset.lengths
130 |         lengths = broadcast_object_list([lengths])[0]
131 |         dataset.cached_lengths = lengths
132 | 
133 |         data_module["train_dataset"]._pack()
134 |         packed_lens = data_module["train_dataset"].packed_lengths
135 | 
136 |         logger.info(
137 |             f"group length {len(lengths)=}, {min(lengths)=}, {max(lengths)=},",
138 |             f"packed data {len(packed_lens)=}, {min(packed_lens)=}, {max(packed_lens)=}, {np.mean(packed_lens)=}",
139 |         )
140 |     else:
141 |         dataset = data_module["train_dataset"]
142 | 
143 |     if trainer.accelerator.is_main_process:
144 |         dataset.info_qwen_vision_fetch()
145 |         input_ids = dataset[0]["input_ids"]
146 |         print(f"sample: {processor.tokenizer.decode(input_ids)}")
147 | 
148 |     if list(Path(training_args.output_dir).glob("checkpoint-*")):
149 |         logger.info("resume from checkpoint")
150 |         trainer.train(resume_from_checkpoint=True)
151 |     else:
152 |         trainer.train()
153 | 
154 |     model.config.use_cache = True
155 |     trainer.save_state()
156 |     safe_save_model_for_hf_trainer(
157 |         trainer=trainer,
158 |         output_dir=f"{training_args.output_dir}/checkpoint-final-{trainer.state.global_step}",
159 |     )
160 | 
161 | 
162 | if __name__ == "__main__":
163 |     train()
164 | 


--------------------------------------------------------------------------------
/experiments/8_vllmeval/vlm/prompt.py:
--------------------------------------------------------------------------------
  1 | from __future__ import annotations
  2 | 
  3 | 
  4 | class Qwen2VLPromptMixin:
  5 |     """
  6 |     Mixin class for Qwen2VLChat to build custom prompt for different datasets.
  7 | 
  8 |     Requires the following methods to be implemented in the subclass:
  9 |         - dump_image(line, dataset: str) -> str | list[str]
 10 | 
 11 |     Implements the following methods:
 12 |         - use_custom_prompt(dataset: str) -> bool
 13 |         - build_prompt(line, dataset: str) -> list[dict[str, str]]
 14 |     """
 15 | 
 16 |     def __init__(self, *args, use_custom_prompt: bool = True, **kwargs) -> None:
 17 |         super().__init__(*args, **kwargs)
 18 |         self._use_custom_prompt = use_custom_prompt
 19 | 
 20 |     def set_dump_image(self, dump_image_func):
 21 |         self.dump_image_func = dump_image_func
 22 | 
 23 |     def dump_image(self, line, dataset):
 24 |         return self.dump_image_func(line)
 25 | 
 26 |     def use_custom_prompt(self, dataset: str) -> bool:
 27 |         from vlmeval.dataset import DATASET_TYPE
 28 | 
 29 |         dataset_type = DATASET_TYPE(dataset, default=None)
 30 | 
 31 |         if not self._use_custom_prompt:
 32 |             return False
 33 |         if dataset in {"MMMU_DEV_VAL", "MMMU_TEST"}:
 34 |             return True
 35 |         if dataset_type == "MCQ":
 36 |             if dataset is not None and "LEGO" in dataset:
 37 |                 return False
 38 |             return True
 39 |         if dataset_type == "Y/N" and dataset in {"HallusionBench", "POPE"}:  # MME has it's own prompt
 40 |             return True
 41 |         if dataset_type == "VQA" and dataset not in {"MMVet"}:  # MMVet VQA has it's own prompt
 42 |             return True
 43 |         return False
 44 | 
 45 |     def build_prompt(self, line, dataset: str) -> list[dict[str, str]]:
 46 |         from vlmeval.dataset import DATASET_TYPE
 47 | 
 48 |         if dataset in {"MMMU_DEV_VAL", "MMMU_TEST"}:
 49 |             return self._build_mmmu_prompt(line, dataset)
 50 |         dataset_type = DATASET_TYPE(dataset, default=None)
 51 |         if dataset_type == "MCQ":
 52 |             return self._build_mcq_prompt(line, dataset)
 53 |         if dataset_type == "Y/N":
 54 |             return self._build_yorn_prompt(line, dataset)
 55 |         if dataset_type == "VQA":
 56 |             return self._build_vqa_prompt(line, dataset)
 57 |         raise ValueError(f"Unsupported dataset: {dataset}")
 58 | 
 59 |     def _build_mmmu_prompt(self, line, dataset: str) -> list[dict[str, str]]:
 60 |         """change the prompt for MMMU dataset: keep all images at beginning."""
 61 | 
 62 |         import string
 63 | 
 64 |         import pandas as pd
 65 | 
 66 |         tgt_path = self.dump_image(line, dataset)
 67 |         question = line["question"]
 68 |         options = {
 69 |             cand: line[cand] for cand in string.ascii_uppercase if cand in line and not pd.isna(line[cand])
 70 |         }
 71 |         options_prompt = "Options:\n"
 72 |         for key, item in options.items():
 73 |             options_prompt += f"{key}. {item}\n"
 74 |         hint = line["hint"] if ("hint" in line and not pd.isna(line["hint"])) else None
 75 |         prompt = ""
 76 |         if hint is not None:
 77 |             prompt += f"Hint: {hint}\n"
 78 |         prompt += f"Question: {question}\n"
 79 |         if len(options):
 80 |             prompt += options_prompt
 81 |             prompt += "Please select the correct answer from the options above. \n"
 82 |         prompt = prompt.rstrip()
 83 |         msgs = []
 84 |         if isinstance(tgt_path, list):
 85 |             msgs.extend([dict(type="image", value=p) for p in tgt_path])
 86 |         else:
 87 |             msgs = [dict(type="image", value=tgt_path)]
 88 |         msgs.append(dict(type="text", value=prompt))
 89 |         return msgs
 90 | 
 91 |     def _build_mcq_prompt(self, line, dataset: str) -> list[dict[str, str]]:
 92 |         """change the prompt for MCQ dataset: use chinese prompt if the question contains chinese characters."""
 93 |         MCQ_CN_PROMPT = "请直接回答选项字母。"
 94 |         MCQ_EN_PROMPT = "Please select the correct answer from the options above."
 95 | 
 96 |         import string
 97 | 
 98 |         import pandas as pd
 99 | 
100 |         def cn_string(s):
101 |             import re
102 | 
103 |             if re.search("[\u4e00-\u9fff]", s):
104 |                 return True
105 |             return False
106 | 
107 |         tgt_path = self.dump_image(line, dataset)
108 |         question = line["question"]
109 |         options = {
110 |             cand: line[cand] for cand in string.ascii_uppercase if cand in line and not pd.isna(line[cand])
111 |         }
112 |         options_prompt = "Options:\n"
113 |         for key, item in options.items():
114 |             options_prompt += f"{key}. {item}\n"
115 |         hint = line["hint"] if ("hint" in line and not pd.isna(line["hint"])) else None
116 |         prompt = ""
117 |         if hint is not None:
118 |             prompt += f"Hint: {hint}\n"
119 |         prompt += f"Question: {question}\n"
120 |         if len(options):
121 |             prompt += options_prompt
122 |             prompt += MCQ_CN_PROMPT if cn_string(prompt) else MCQ_EN_PROMPT
123 |         prompt = prompt.rstrip()
124 |         msgs = []
125 |         if isinstance(tgt_path, list):
126 |             msgs.extend([dict(type="image", value=p) for p in tgt_path])
127 |         else:
128 |             msgs = [dict(type="image", value=tgt_path)]
129 |         msgs.append(dict(type="text", value=prompt))
130 |         return msgs
131 | 
132 |     def _build_yorn_prompt(self, line, dataset: str) -> list[dict[str, str]]:
133 |         """change the prompt for YORN dataset:"""
134 |         YORN_PROMPT = " Please answer yes or no."
135 | 
136 |         tgt_path = self.dump_image(line, dataset)
137 |         question = line["question"]
138 |         msgs = []
139 |         if isinstance(tgt_path, list):
140 |             msgs.extend([dict(type="image", value=p) for p in tgt_path])
141 |         else:
142 |             msgs = [dict(type="image", value=tgt_path)]
143 |         msgs.append(dict(type="text", value=question))
144 |         assert msgs[-1]["type"] == "text"
145 |         msgs[-1]["value"] += YORN_PROMPT
146 |         return msgs
147 | 
148 |     def _build_vqa_prompt(self, line, dataset: str) -> list[dict[str, str]]:
149 |         """change the prompt for VQA dataset:"""
150 |         VQA_PROMPT = "\nPlease try to answer the question with short words or phrases if possible."
151 | 
152 |         tgt_path = self.dump_image(line, dataset)
153 |         question = line["question"]
154 |         msgs = []
155 |         if isinstance(tgt_path, list):
156 |             msgs.extend([dict(type="image", value=p) for p in tgt_path])
157 |         else:
158 |             msgs = [dict(type="image", value=tgt_path)]
159 |         msgs.append(dict(type="text", value=question))
160 |         assert msgs[-1]["type"] == "text"
161 |         msgs[-1]["value"] += VQA_PROMPT
162 |         return msgs
163 | 


--------------------------------------------------------------------------------
/.github/workflows/release.yml:
--------------------------------------------------------------------------------
  1 | # Copyright 2025 The HuggingFace Inc. team. All rights reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | name: Create Release and Publish to PyPI
 16 | 
 17 | on:
 18 |   push:
 19 |     tags:
 20 |       - 'v*.*.*' # Trigger on tags like v0.1.0, v1.0.0
 21 | 
 22 | # Sets up the environment variables
 23 | env:
 24 |   UV_VERSION: "0.8.0"
 25 |   PYTHON_VERSION: "3.10"
 26 | 
 27 | jobs:
 28 |   # This job builds the Python package and publishes it to PyPI
 29 |   build-and-publish:
 30 |     name: Build and publish Python distributions
 31 |     runs-on: ubuntu-latest
 32 |     outputs:
 33 |       version: ${{ steps.extract_info.outputs.tag_version }}
 34 |     permissions:
 35 |       contents: write
 36 |       id-token: write
 37 | 
 38 |     steps:
 39 |       - name: Checkout code
 40 |         uses: actions/checkout@v4
 41 |         with:
 42 |           persist-credentials: false
 43 | 
 44 |       - name: Set up Python
 45 |         uses: actions/setup-python@v5
 46 |         with:
 47 |           python-version: '3.10'
 48 | 
 49 |       - name: Extract Version
 50 |         id: extract_info
 51 |         # Extract version from tag (e.g., v0.1.0 -> 0.1.0)
 52 |         # zizmor: ignore[template-injection]
 53 |         run: |
 54 |           VERSION=${{ github.ref_name }}
 55 |           VERSION_NUMBER=${VERSION#v}
 56 |           echo "tag_version=$VERSION_NUMBER" >> $GITHUB_OUTPUT
 57 |       - name: Check if version matches pyproject.toml
 58 |         if: startsWith(github.ref, 'refs/tags/v') && !contains(github.ref, '-')
 59 |         # zizmor: ignore[template-injection]
 60 |         run: |
 61 |           TAG_VERSION=${{ steps.extract_info.outputs.tag_version }}
 62 | 
 63 |           PYPROJECT_VERSION=$(grep '^version = ' pyproject.toml | awk -F' = ' '{print $2}' | tr -d '"')
 64 | 
 65 |           if [[ "$TAG_VERSION" != "$PYPROJECT_VERSION" ]]; then
 66 |             echo "Error: Tag version ($TAG_VERSION) does not match pyproject.toml version ($PYPROJECT_VERSION)." >&2
 67 |             exit 1
 68 |           else
 69 |             echo "Tag version matches pyproject.toml version: $TAG_VERSION. Proceeding with release."
 70 |           fi
 71 | 
 72 |       - name: Check if version exists on PyPI
 73 |       # zizmor: ignore[template-injection]
 74 |         run: |
 75 |           NEW_VERSION=${{ steps.extract_info.outputs.tag_version }}
 76 | 
 77 |           response=$(curl -s "https://pypi.org/pypi/lerobot/$NEW_VERSION/json")
 78 |           if echo "$response" | grep -q "message"; then
 79 |             echo "Version $NEW_VERSION is available on PyPI. Proceeding with release."
 80 |           else
 81 |             echo "Error: Version $NEW_VERSION already exists on PyPI. Aborting."
 82 |             exit 1
 83 |           fi
 84 | 
 85 |       - name: Install build dependencies
 86 |         run: python -m pip install build
 87 | 
 88 |       - name: Build package
 89 |         run: python -m build
 90 | 
 91 |       - name: Create GitHub Release
 92 |         env:
 93 |           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 94 |         # zizmor: ignore[template-injection]
 95 |         run: |
 96 |           gh release create ${{ github.ref_name }} \
 97 |             --title "Release ${{ github.ref_name }}" \
 98 |             --generate-notes \
 99 |             --draft=$([[ "${{ github.ref_name }}" == *-* ]] && echo true || echo false) \
100 |             --prerelease=$([[ "${{ github.ref_name }}" == *-* ]] && echo true || echo false) \
101 |             ./dist/*
102 | 
103 |       # - name: Publish to TestPyPI for pre-releases
104 |       #   # True for tags like 'v0.2.0-rc1'
105 |       #   if: startsWith(github.ref, 'refs/tags/v') && contains(github.ref, '-')
106 |       #   uses: pypa/gh-action-pypi-publish@v1.12.4 # zizmor: ignore[unpinned-uses, use-trusted-publishing]
107 |       #   with:
108 |       #     repository-url: https://test.pypi.org/legacy/
109 |       #     verbose: true
110 |       #     print-hash: true
111 | 
112 |       # - name: Publish to PyPI
113 |       #   if: startsWith(github.ref, 'refs/tags/v') && !contains(github.ref, '-')
114 |       #   uses: pypa/gh-action-pypi-publish@v1.12.4 # zizmor: ignore[unpinned-uses, use-trusted-publishing]
115 |       #   with:
116 |       #     verbose: true
117 |       #     print-hash: true
118 | 
119 |   # This job runs end-to-end tests on the release
120 |   test-release:
121 |     name: Test Release
122 |     needs: [build-and-publish]
123 |     runs-on: ubuntu-latest
124 |     permissions:
125 |       contents: read
126 |     env:
127 |       MUJOCO_GL: egl
128 |     steps:
129 |       - uses: actions/checkout@v4
130 |         with:
131 |           lfs: true
132 |           persist-credentials: false
133 |       - name: Install apt dependencies
134 |         run: |
135 |           sudo apt-get update && sudo apt-get install -y build-essential \
136 |           git curl libglib2.0-0 libegl1-mesa-dev ffmpeg libusb-1.0-0-dev \
137 |           speech-dispatcher libgeos-dev portaudio19-dev
138 |       - name: Setup uv and Python
139 |         uses: astral-sh/setup-uv@v6 # zizmor: ignore[unpinned-uses]
140 |         with:
141 |           enable-cache: true
142 |           version: ${{ env.UV_VERSION }}
143 |           python-version: ${{ env.PYTHON_VERSION }}
144 |       - name: Create uv virtual environment
145 |         run: uv venv
146 |       - name: Install lerobot release
147 |         # zizmor: ignore[template-injection]
148 |         run: |
149 |           VERSION="${{ needs.build-and-publish.outputs.version }}"
150 |           if [[ "$VERSION" == *-* ]]; then
151 |             BASE_VERSION="${VERSION%%-*}"
152 |             echo "Installing pre-release version $BASE_VERSION from TestPyPI..."
153 |             uv pip install \
154 |               --index-url https://test.pypi.org/simple/ \
155 |               --extra-index-url https://pypi.org/simple \
156 |               --index-strategy unsafe-best-match \
157 |                "lerobot[all]==$BASE_VERSION"
158 |           else
159 |             echo "Installing release version $VERSION from PyPI..."
160 |             uv pip install "lerobot[all]==$VERSION"
161 |           fi
162 |       - name: Check lerobot version
163 |         run: uv run python -c "import lerobot; print(lerobot.__version__)"
164 | 
165 |       - name: Run end-to-end tests
166 |         run: uv run make test-end-to-end
167 | 
168 | 
169 | # TODO(Steven): Publish draft/pre-release and to test pypi weekly
170 | # TODO(Steven): Separate build and publish job
171 | # TODO(Steven): Tag documentation with the same version as the package
172 | 


--------------------------------------------------------------------------------
/tools/group_length.py:
--------------------------------------------------------------------------------
  1 | """caculate parquet dataset length
  2 | . /mnt/shared-storage-user/eorobotics-shared/miniconda3/etc/profile.d/conda.sh
  3 | conda activate eo
  4 | """
  5 | 
  6 | import torch
  7 | import os
  8 | import copy
  9 | import logging
 10 | from tqdm import tqdm
 11 | import os.path as osp
 12 | from datasets import load_dataset
 13 | 
 14 | from eo.model.processing_eo1 import EO1VisionProcessor
 15 | from eo.data.dataset import get_image_info, get_video_info
 16 | from eo.data.multim_dataset import llava_to_openai
 17 | from eo.constants import (
 18 |     DEFAULT_IM_END_TOKEN,
 19 |     DEFAULT_IM_START_TOKEN,
 20 |     DEFAULT_IMAGE_TOKEN,
 21 |     DEFAULT_VIDEO_TOKEN,
 22 |     IGNORE_INDEX,
 23 |     SYSTEM_MESSAGE,
 24 | )
 25 | 
 26 | logger = logging.getLogger(__name__)
 27 | 
 28 | data_base = ""
 29 | save_base = ""
 30 | video_folder = ""
 31 | 
 32 | datas = [
 33 |     # your json paths
 34 | ]
 35 | 
 36 | processor = EO1VisionProcessor.from_pretrained(
 37 |     "IPEC-COMMUNITY/eo1-qwen2_5_vl_hf",
 38 |     trust_remote_code=True,
 39 |     padding_side="right",
 40 |     use_fast=True,
 41 | )
 42 | 
 43 | 
 44 | def data_len_fn(sources):
 45 |     if "image" in sources:
 46 |         videos = None
 47 |         grid_key = "image_grid_thw"
 48 |         pixel_key = "pixel_values"
 49 |         image_files = sources["image"]
 50 |         images = []
 51 |         
 52 |         if not isinstance(image_files, list):
 53 |             image_files = [image_files]
 54 |         
 55 |         for image_file in image_files:
 56 |             images.append(
 57 |                 get_image_info(
 58 |                     image_file,
 59 |                     64 * 28 * 28,
 60 |                     128 * 28 * 28,
 61 |                     None,
 62 |                     None,
 63 |                 )
 64 |             )
 65 | 
 66 |     elif "video" in sources:
 67 |         images = None
 68 |         grid_key = "video_grid_thw"
 69 |         pixel_key = "pixel_values_videos"
 70 |         video_files = sources["video"]
 71 |         if isinstance(video_files, str):
 72 |             video_files = [video_files]
 73 |         videos = []
 74 |         for video_file in video_files:
 75 |             if isinstance(video_file, str) and not video_file.startswith("http"):
 76 |                 video_file = os.path.join(video_folder, video_file)
 77 |             video_input, video_kwargs = get_video_info(
 78 |                 video_file,
 79 |                 64 * 28 * 28,
 80 |                 128 * 28 * 28,
 81 |                 None,
 82 |                 None,
 83 |                 1.0,
 84 |             )
 85 |             videos.append(video_input)
 86 |     else:
 87 |         grid_key = pixel_key = images = videos = None
 88 | 
 89 |     conversations = copy.deepcopy(llava_to_openai(sources["conversation"], "video" in sources))
 90 | 
 91 |     all_input_ids = []
 92 |     all_labels = []
 93 |     all_pixel_values = []
 94 |     all_image_grid_thw = []
 95 |     all_second_gird = []
 96 | 
 97 |     if len(SYSTEM_MESSAGE) > 0:
 98 |         system_message = f"{DEFAULT_IM_START_TOKEN}system\n{SYSTEM_MESSAGE}{DEFAULT_IM_END_TOKEN}\n"
 99 |         system_message_input_ids = processor.tokenizer(
100 |             system_message, add_special_tokens=False, return_tensors="pt"
101 |         )["input_ids"]
102 |         system_labels = torch.full_like(system_message_input_ids, IGNORE_INDEX)
103 |         all_input_ids.append(system_message_input_ids.squeeze(0))
104 |         all_labels.append(system_labels.squeeze(0))
105 | 
106 |     img_start = 0
107 | 
108 |     for _, j in enumerate(range(0, len(conversations), 2)):
109 |         user_input = conversations[j]
110 |         gpt_response = conversations[j + 1]
111 |         user_input = f"{DEFAULT_IM_START_TOKEN}{user_input['role']}\n{user_input['content']}{DEFAULT_IM_END_TOKEN}\n{DEFAULT_IM_START_TOKEN}{gpt_response['role']}\n"
112 | 
113 |         if DEFAULT_IMAGE_TOKEN in user_input:
114 |             img_num = user_input.count(DEFAULT_IMAGE_TOKEN)
115 |             inputs = processor(
116 |                 text=[user_input],
117 |                 images=images[img_start : img_start + img_num] if images else None,
118 |                 videos=videos,
119 |                 padding=False,
120 |                 do_resize=False,
121 |                 return_tensors="pt",
122 |             )
123 |             prompt_input_ids = inputs["input_ids"]
124 |             all_pixel_values.append(inputs[pixel_key])
125 |             all_image_grid_thw.append(inputs[grid_key])
126 |             img_start += img_num
127 | 
128 |         elif DEFAULT_VIDEO_TOKEN in user_input:
129 |             inputs = processor(
130 |                 text=[user_input],
131 |                 images=images,
132 |                 videos=videos,
133 |                 padding=False,
134 |                 do_resize=False,
135 |                 return_tensors="pt",
136 |                 **video_kwargs,
137 |             )
138 |             all_second_gird.extend(inputs["second_per_grid_ts"])
139 |             prompt_input_ids = inputs["input_ids"]
140 |             all_pixel_values.append(inputs[pixel_key])
141 |             all_image_grid_thw.append(inputs[grid_key])
142 | 
143 |         else:
144 |             prompt_input_ids = processor.tokenizer(
145 |                 user_input, add_special_tokens=False, padding=False, return_tensors="pt"
146 |             )["input_ids"]
147 | 
148 |         gpt_response = f"{gpt_response['content']}{DEFAULT_IM_END_TOKEN}\n"
149 |         response_input_ids = processor(text=[gpt_response], padding=False, return_tensors="pt")[
150 |             "input_ids"
151 |         ]
152 |         input_ids = torch.cat([prompt_input_ids, response_input_ids], dim=1).squeeze(0)
153 |         all_input_ids.append(input_ids)
154 | 
155 |     input_ids = torch.cat(all_input_ids, dim=0)
156 |     return {"seq_length": input_ids.shape[0]}
157 | 
158 | 
159 | if __name__ == "__main__":
160 |     for dataset in tqdm(datas, desc="Grouping datase"):
161 |         data_path = osp.join(data_base, dataset)
162 |         save_path = osp.join(save_base, dataset)
163 |         os.makedirs(save_path, exist_ok=True)
164 | 
165 |         print(f"loaded {dataset} from {data_path}")
166 | 
167 |         ds = load_dataset(data_path)
168 | 
169 |         if "interleave" in dataset:
170 |             cols_to_remove = []
171 |         else:
172 |             cols_to_remove = [
173 |                 c for c in ["action", "state", "action_is_pad"] if c in ds["train"].column_names
174 |             ]
175 |             print(f"removing columns: {cols_to_remove} from {dataset}: {ds['train'].column_names}")
176 | 
177 |         ds = ds.map(
178 |             data_len_fn,
179 |             num_proc=32,
180 |             remove_columns=cols_to_remove,
181 |         )
182 | 
183 |         train = ds["train"]
184 |         print(f"gp train: {train}")
185 | 
186 |         n = len(train)
187 |         rows_per_shard = 1024
188 |         num_shards = (n + rows_per_shard - 1) // rows_per_shard
189 | 
190 |         for i in range(num_shards):
191 |             shard = train.select(range(i * rows_per_shard, min((i + 1) * rows_per_shard, n)))
192 |             shard.to_parquet(osp.join(save_path, f"train-{i:05d}-of-{num_shards:05d}.parquet"))
193 |         print(f"saved {num_shards} shards to {save_path}")
194 | 
195 |     print("Done")
196 | 


--------------------------------------------------------------------------------
/experiments/7_franka/realsense_camera.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pyrealsense2 as rs
  3 | 
  4 | # the camera intrinsics from the camera calibration
  5 | serial_number_to_cam_intr = {
  6 |     "243222074139": {
  7 |         "fx": 604.987548828125,
  8 |         "fy": 604.9332885742188,
  9 |         "px": 325.9312438964844,
 10 |         "py": 243.00851440429688,
 11 |     },  # eye-to-hand
 12 |     "213522070137": {
 13 |         "fx": 596.382688,
 14 |         "fy": 596.701788,
 15 |         "px": 333.837001,
 16 |         "py": 254.401211,
 17 |     },  # eye-in-hand
 18 | }
 19 | 
 20 | 
 21 | class Camera:
 22 |     def __init__(
 23 |         self,
 24 |         serial_number: str | None = "213522070137",
 25 |         camera_width: int = 640,
 26 |         camera_height: int = 480,
 27 |         camera_fps: int = 30,
 28 |     ) -> None:
 29 |         self.camera_width = camera_width
 30 |         self.camera_height = camera_height
 31 |         self.camera_fps = camera_fps
 32 |         self.pipeline = rs.pipeline()
 33 |         # self.pipeline.wait_for_frames(9999)
 34 |         self.config = rs.config()
 35 | 
 36 |         if serial_number is not None:
 37 |             self.config.enable_device(serial_number)
 38 |         self.config.enable_stream(
 39 |             rs.stream.depth,
 40 |             self.camera_width,
 41 |             self.camera_height,
 42 |             rs.format.z16,
 43 |             self.camera_fps,
 44 |         )
 45 |         self.config.enable_stream(
 46 |             rs.stream.color,
 47 |             self.camera_width,
 48 |             self.camera_height,
 49 |             rs.format.bgr8,
 50 |             self.camera_fps,
 51 |         )
 52 | 
 53 |         self.profile = self.pipeline.start(self.config)
 54 | 
 55 |         depth_sensor = self.profile.get_device().first_depth_sensor()
 56 |         self.depth_scale: float = depth_sensor.get_depth_scale()
 57 |         print("Depth Scale is: ", self.depth_scale)
 58 | 
 59 |         align_to = rs.stream.color
 60 |         self.align = rs.align(align_to)
 61 | 
 62 |     def close(self):
 63 |         self.pipeline.stop()
 64 | 
 65 |     def get_frame(self, filter=True) -> tuple[np.ndarray, np.ndarray]:
 66 |         """
 67 |         Args:
 68 |             filter bool optional Whether to apply filters to depth frames. Defaults to True.
 69 |         Returns:
 70 |             color_image np.ndarray shape=(H, W, 3) Color image(BGR)
 71 |             depth_image np.ndarray shape=(H, W) Depth image in meters
 72 |         """
 73 | 
 74 |         depth_to_disparity = rs.disparity_transform(True)
 75 | 
 76 |         spatial = rs.spatial_filter()
 77 |         spatial.set_option(rs.option.filter_magnitude, 2)
 78 |         spatial.set_option(rs.option.filter_smooth_alpha, 0.5)
 79 |         spatial.set_option(rs.option.filter_smooth_delta, 20)
 80 | 
 81 |         temporal = rs.temporal_filter()
 82 |         temporal.set_option(rs.option.filter_smooth_alpha, 0.4)
 83 |         temporal.set_option(rs.option.filter_smooth_delta, 20)
 84 | 
 85 |         disparity_to_depth = rs.disparity_transform(False)
 86 | 
 87 |         hole_filling = rs.hole_filling_filter()
 88 | 
 89 |         frames = self.pipeline.wait_for_frames()
 90 |         aligned_frames = self.align.process(frames)
 91 |         depth_frame = aligned_frames.get_depth_frame()
 92 |         color_frame = aligned_frames.get_color_frame()
 93 | 
 94 |         # Apply the filters to depth frames
 95 |         if filter:
 96 |             filtered_depth = depth_to_disparity.process(depth_frame)
 97 |             filtered_depth = spatial.process(filtered_depth)
 98 |             filtered_depth = temporal.process(filtered_depth)
 99 |             filtered_depth = disparity_to_depth.process(filtered_depth)
100 |             filtered_depth = hole_filling.process(filtered_depth)
101 | 
102 |             depth_image = np.asanyarray(filtered_depth.get_data())
103 |             color_image = np.asanyarray(color_frame.get_data())
104 |         else:
105 |             depth_image = np.asanyarray(depth_frame.get_data())
106 |             color_image = np.asanyarray(color_frame.get_data())
107 | 
108 |         return color_image, depth_image * self.depth_scale
109 | 
110 |     def get_camera_intrinsics(self, use_raw=False, serial_number: str | None = "213522070137"):
111 |         """
112 |         Args:
113 |             use_raw bool optional Whether to use the camera intrinsics from the raw stream. Defaults to False.
114 |         Returns:
115 |             {
116 |                 "fx": Focal length x,
117 |                 "fy": Focal length y,
118 |                 "px": Principal point x,
119 |                 "py": Principal point y
120 |             }
121 |         """
122 |         if use_raw:
123 |             profile = self.profile.get_stream(rs.stream.color)
124 |             intr = profile.as_video_stream_profile().get_intrinsics()
125 |             return {"px": intr.ppx, "py": intr.ppy, "fx": intr.fx, "fy": intr.fy}
126 |         return serial_number_to_cam_intr[serial_number]
127 | 
128 | 
129 | class MultiCamera:
130 |     def __init__(
131 |         self,
132 |         serial_numbers: list[str] | None = None,
133 |         camera_width: int = 640,
134 |         camera_height: int = 480,
135 |         camera_fps: int = 30,
136 |     ) -> None:
137 |         all_serial_numbers = []
138 |         for d in sorted(rs.context().devices, key=lambda x: x.get_info(rs.camera_info.serial_number)):
139 |             print(
140 |                 "Found device: ",
141 |                 d.get_info(rs.camera_info.name),
142 |                 " ",
143 |                 d.get_info(rs.camera_info.serial_number),
144 |             )
145 |             if d.get_info(rs.camera_info.name).lower() != "platform camera":
146 |                 all_serial_numbers.append(d.get_info(rs.camera_info.serial_number))
147 | 
148 |         if serial_numbers is None:
149 |             serial_numbers = all_serial_numbers
150 |         else:
151 |             serial_numbers = [
152 |                 serial_number for serial_number in serial_numbers if serial_number in all_serial_numbers
153 |             ]
154 | 
155 |         print("Using cameras with serial numbers: ", serial_numbers)
156 | 
157 |         self.cameras = {
158 |             serial_number: Camera(serial_number, camera_width, camera_height, camera_fps)
159 |             for serial_number in serial_numbers
160 |         }
161 | 
162 |         for _ in range(20):
163 |             print("filter")
164 |             self.get_frame()
165 | 
166 |     def close(self):
167 |         for camera in self.cameras.values():
168 |             camera.close()
169 | 
170 |     def get_frame(
171 |         self, serial_numbers: list[str] | None = None, filter=True
172 |     ) -> dict[str, tuple[np.ndarray, np.ndarray]]:
173 |         if serial_numbers is None:
174 |             serial_numbers = list(self.cameras.keys())
175 |         return {serial_number: self.cameras[serial_number].get_frame() for serial_number in serial_numbers}
176 | 
177 |     def get_camera_intrinsics(
178 |         self, serial_numbers: list[str] | None = None, use_raw=False
179 |     ) -> dict[str, dict[str, float]]:
180 |         if serial_numbers is None:
181 |             serial_numbers = self.cameras.keys()
182 |         return {
183 |             serial_number: self.cameras[serial_number].get_camera_intrinsics(use_raw, serial_number)
184 |             for serial_number in serial_numbers
185 |         }
186 | 


--------------------------------------------------------------------------------
/experiments/9_pretraining/README.md:
--------------------------------------------------------------------------------
 1 | ## Dataset Checklist
 2 | 
 3 | | Dataset           | Link                                                                                   |
 4 | | ----------------- | -------------------------------------------------------------------------------------- |
 5 | | AgiBotWorld-Beta  | [HF Link](https://huggingface.co/datasets/agibot-world/AgiBotWorld-Beta)               |
 6 | | EO-Data1.5M       | [HF Link](https://huggingface.co/datasets/IPEC-COMMUNITY/EO-Data1.5M)                  |
 7 | | LLaVA-1.5 Dataset | [HF Link](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)              |
 8 | | LLaVA-Video-178K  | [HF Link](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K)                   |
 9 | | Pixmo-Points      | [HF Link](https://huggingface.co/datasets/allenai/pixmo-points)                        |
10 | | RefCOCO           | [GitHub Link](https://github.com/lichengunc/refer)                                     |
11 | | FineVision        | [HF Link](https://huggingface.co/spaces/HuggingFaceM4/FineVision)                      |
12 | | RoboVQA           | [GitHub Link](https://robovqa.github.io/)                                              |
13 | | Bridge Dataset    | [HF Link](https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot)          |
14 | | Fractal Dataset   | [HF Link](https://huggingface.co/datasets/IPEC-COMMUNITY/fractal20220817_data_lerobot) |
15 | | Droid Dataset     | [HF Link](https://huggingface.co/datasets/IPEC-COMMUNITY/droid_lerobot)                |
16 | | RoboMIND Dataset  | [HF Link](https://huggingface.co/datasets/x-humanoid-robomind/RoboMIND)                |
17 | | IPEC-Franka       | [HF Link](https://huggingface.co/datasets/IPEC-COMMUNITY/IPEC-Franka)                  |
18 | | SO100 Collection  | [See Below](#so100-collection)                                                         |
19 | 
20 | ---
21 | 
22 | ### SO100 Collection
23 | 
24 | - [so100_tape](https://huggingface.co/datasets/aimihat/so100_tape)
25 | - [so100_tic_tac_toe_move_0_0](https://huggingface.co/datasets/bnarin/so100_tic_tac_toe_move_0_0)
26 | - [so100_tic_tac_toe_move_1_0](https://huggingface.co/datasets/bnarin/so100_tic_tac_toe_move_1_0)
27 | - [so100_tic_tac_toe_move_2_1](https://huggingface.co/datasets/bnarin/so100_tic_tac_toe_move_2_1)
28 | - [so100_tic_tac_toe_move_4_0](https://huggingface.co/datasets/bnarin/so100_tic_tac_toe_move_4_0)
29 | - [so100_tic_tac_toe_we_do_it_live](https://huggingface.co/datasets/bnarin/so100_tic_tac_toe_we_do_it_live)
30 | - [so100_work6](https://huggingface.co/datasets/danielkr452/so100_work6)
31 | - [so100_task_1](https://huggingface.co/datasets/danaaubakirova/so100_task_1)
32 | - [so100_task_2](https://huggingface.co/datasets/danaaubakirova/so100_task_2)
33 | - [so100_task_3](https://huggingface.co/datasets/danaaubakirova/so100_task_3)
34 | - [so100_task_4](https://huggingface.co/datasets/danaaubakirova/so100_task_4)
35 | - [so100_v2_task_1](https://huggingface.co/datasets/danaaubakirova/so100_v2_task_1)
36 | - [so100_v2_task_2](https://huggingface.co/datasets/danaaubakirova/so100_v2_task_2)
37 | - [so100_v2_task_3](https://huggingface.co/datasets/danaaubakirova/so100_v2_task_3)
38 | - [so100_v2_task_4](https://huggingface.co/datasets/danaaubakirova/so100_v2_task_4)
39 | - [so100_50ep_2](https://huggingface.co/datasets/DimiSch/so100_50ep_2)
40 | - [so100_pick_charger_on_tissue](https://huggingface.co/datasets/DorayakiLin/so100_pick_charger_on_tissue)
41 | - [so100_pick_cube_in_box](https://huggingface.co/datasets/DorayakiLin/so100_pick_cube_in_box)
42 | - [so100_stack_cube](https://huggingface.co/datasets/DorayakiLin/so100_stack_cube)
43 | - [so100_recording1](https://huggingface.co/datasets/Ityl/so100_recording1)
44 | - [so100_recording2](https://huggingface.co/datasets/Ityl/so100_recording2)
45 | - [so100_sweet_pick](https://huggingface.co/datasets/jmrog/so100_sweet_pick)
46 | - [so100_kapla_tower6](https://huggingface.co/datasets/kantine/so100_kapla_tower6)
47 | - [so100_100](https://huggingface.co/datasets/Loki0929/so100_100)
48 | - [so100_duck](https://huggingface.co/datasets/Loki0929/so100_duck)
49 | - [so100_strawberry_grape](https://huggingface.co/datasets/lucasngoo/so100_strawberry_grape)
50 | - [so100_pick_and_place2](https://huggingface.co/datasets/nbaron99/so100_pick_and_place2)
51 | - [so100_pick_and_place4](https://huggingface.co/datasets/nbaron99/so100_pick_and_place4)
52 | - [so100_cube_drop_pick_v1](https://huggingface.co/datasets/Odog16/so100_cube_drop_pick_v1)
53 | - [so100_cube_stacking_v1](https://huggingface.co/datasets/Odog16/so100_cube_stacking_v1)
54 | - [so100_tea_towel_folding_v1](https://huggingface.co/datasets/Odog16/so100_tea_towel_folding_v1)
55 | - [so100_lego](https://huggingface.co/datasets/paszea/so100_lego)
56 | - [so100_lego_2cam](https://huggingface.co/datasets/paszea/so100_lego_2cam)
57 | - [so100_lego_mix](https://huggingface.co/datasets/paszea/so100_lego_mix)
58 | - [so100_whale](https://huggingface.co/datasets/paszea/so100_whale)
59 | - [so100_whale_2](https://huggingface.co/datasets/paszea/so100_whale_2)
60 | - [so100_whale_3](https://huggingface.co/datasets/paszea/so100_whale_3)
61 | - [so100_whale_4](https://huggingface.co/datasets/paszea/so100_whale_4)
62 | - [so100_whale_grab](https://huggingface.co/datasets/paszea/so100_whale_grab)
63 | - [so100_herding_1](https://huggingface.co/datasets/samanthalhy/so100_herding_1)
64 | - [so100_picknplace](https://huggingface.co/datasets/SeanLMH/so100_picknplace)
65 | - [so100_picknplace_v2](https://huggingface.co/datasets/SeanLMH/so100_picknplace_v2)
66 | - [so100_grab_ball](https://huggingface.co/datasets/Setchii/so100_grab_ball)
67 | - [so100_base_env](https://huggingface.co/datasets/shreyasgite/so100_base_env)
68 | - [so100_legocube_50](https://huggingface.co/datasets/shreyasgite/so100_legocube_50)
69 | - [so100_orange_50ep_1](https://huggingface.co/datasets/sshh11/so100_orange_50ep_1)
70 | - [so100_orange_50ep_2](https://huggingface.co/datasets/sshh11/so100_orange_50ep_2)
71 | - [so100_above](https://huggingface.co/datasets/vladfatu/so100_above)
72 | - [so100_ds](https://huggingface.co/datasets/vladfatu/so100_ds)
73 | - [so100_office](https://huggingface.co/datasets/vladfatu/so100_office)
74 | - [so100_banana_to_plate_only](https://huggingface.co/datasets/VoicAndrei/so100_banana_to_plate_only)
75 | - [so100_button](https://huggingface.co/datasets/yuz1wan/so100_button)
76 | - [so100_fold_0227_1](https://huggingface.co/datasets/yuz1wan/so100_fold_0227_1)
77 | - [so100_fold_0227_2](https://huggingface.co/datasets/yuz1wan/so100_fold_0227_2)
78 | - [so100_pick_pink](https://huggingface.co/datasets/yuz1wan/so100_pick_pink)
79 | - [so100_pickplace](https://huggingface.co/datasets/yuz1wan/so100_pickplace)
80 | - [so100_pickplace_0223_2](https://huggingface.co/datasets/yuz1wan/so100_pickplace_0223_2)
81 | - [so100_pickplace_0223_3](https://huggingface.co/datasets/yuz1wan/so100_pickplace_0223_3)
82 | - [so100_pick_wahaha](https://huggingface.co/datasets/yuz1wan/so100_pick_wahaha)
83 | - [so100_pour_cup](https://huggingface.co/datasets/yuz1wan/so100_pour_cup)
84 | - [so100_pp_pink](https://huggingface.co/datasets/yuz1wan/so100_pp_pink)
85 | - [so100_cube_2](https://huggingface.co/datasets/zaringleb/so100_cube_2)
86 | - [so100_cube_4_binary](https://huggingface.co/datasets/zaringleb/so100_cube_4_binary)
87 | - [so100_cube_5_linear](https://huggingface.co/datasets/zaringleb/so100_cube_5_linear)
88 | - [so100_cube_6_2d](https://huggingface.co/datasets/zaringleb/so100_cube_6_2d)
89 | - [so100_cude_linear_and_2d_comb](https://huggingface.co/datasets/zaringleb/so100_cude_linear_and_2d_comb)
90 | - [so100_drop0](https://huggingface.co/datasets/ZGGZZG/so100_drop0)
91 | - [so100_drop1](https://huggingface.co/datasets/ZGGZZG/so100_drop1)
92 | 


--------------------------------------------------------------------------------
/experiments/8_vllmeval/dataset/erqa.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | 
  3 | import pandas as pd
  4 | 
  5 | from ..smp import *
  6 | from .image_base import ImageBaseDataset
  7 | 
  8 | 
  9 | class ERQABench(ImageBaseDataset):
 10 |     TYPE = "MCQ"
 11 | 
 12 |     def __init__(
 13 |         self,
 14 |         dataset="ERQABench",
 15 |         data_file="ERQABench",
 16 |         data_root=None,
 17 |         skip_noimg=True,
 18 |     ):
 19 |         # You can override this variable to save image files to a different directory
 20 |         self.dataset_name = "ERQABench"
 21 |         self.data_file = data_file
 22 |         self.data_root = data_root
 23 | 
 24 |         data = self.load_data(data_file)
 25 | 
 26 |         self.skip_noimg = skip_noimg
 27 |         if skip_noimg and "image" in data:
 28 |             data = data[~pd.isna(data["image"])]
 29 | 
 30 |         self.meta_only = True
 31 | 
 32 |         # the dataframe has `id` field, which is the index
 33 |         data["index"] = data["id"]
 34 | 
 35 |         self.data = data
 36 |         self.post_build(self.dataset_name)
 37 | 
 38 |     def load_data(self, data_file="ERQABench"):
 39 |         def load_jsonl(f):
 40 |             lines = open(f, encoding="utf-8").readlines()
 41 |             lines = [x.strip() for x in lines]
 42 |             if lines[-1] == "":
 43 |                 lines = lines[:-1]
 44 |             data = [json.loads(x) for x in lines]
 45 |             return pd.DataFrame(data)
 46 | 
 47 |         data = load_jsonl(data_file)
 48 |         return data
 49 | 
 50 |     def build_prompt(self, line):
 51 |         if isinstance(line, int):
 52 |             line = self.data.iloc[line]
 53 | 
 54 |         if self.meta_only:
 55 |             tgt_path = toliststr(line["image_paths"])
 56 |             tgt_path = [osp.join(self.data_root, p) for p in tgt_path]
 57 |         else:
 58 |             tgt_path = self.dump_image(line)
 59 | 
 60 |         question = line["question"]
 61 |         visual_indices = line["visual_indices"]
 62 | 
 63 |         # Convert encoded images to PIL images
 64 |         pil_images = tgt_path
 65 | 
 66 |         # Prepare contents for API based on visual_indices
 67 |         # Create a list of (image, index) pairs
 68 |         image_index_pairs = list(zip(pil_images, visual_indices, strict=False))
 69 | 
 70 |         # Sort by visual_indices
 71 |         image_index_pairs.sort(key=lambda x: x[1])
 72 | 
 73 |         # Split the question text and interleave with images
 74 |         contents = []
 75 | 
 76 |         # Handle case where visual_indices is empty (place images at the beginning)
 77 |         if len(visual_indices) == 0:
 78 |             # Add all images at the beginning
 79 |             for img in pil_images:
 80 |                 contents.append(dict(type="image", value=img))
 81 |             # Then add the question text
 82 |             contents.append(dict(type="text", value=question))
 83 |         # Handle case where all indices are 0 (all images at the beginning)
 84 |         elif all(idx == 0 for idx in visual_indices):
 85 |             # First add all images
 86 |             for img, _ in image_index_pairs:
 87 |                 contents.append(dict(type="image", value=img))
 88 |             # Then add the question text
 89 |             contents.append(dict(type="text", value=question))
 90 |         else:
 91 |             # Split question at visual_indices positions
 92 |             last_pos = 0
 93 | 
 94 |             # Process each image and its position
 95 |             for img, idx in image_index_pairs:
 96 |                 if idx == 0:
 97 |                     # Image goes at the beginning
 98 |                     contents.append(dict(type="image", value=img))
 99 |                 else:
100 |                     # Add text segment before this image
101 |                     if idx <= len(question):
102 |                         text_segment = question[last_pos:idx]
103 |                         if text_segment:
104 |                             contents.append(dict(type="text", value=text_segment))
105 |                         contents.append(dict(type="image", value=img))
106 |                         last_pos = idx
107 |                     else:
108 |                         # If index is beyond question length, just append the image
109 |                         contents.append(dict(type="image", value=img))
110 | 
111 |             # Add any remaining text
112 |             if last_pos < len(question):
113 |                 contents.append(dict(type="text", value=question[last_pos:]))
114 | 
115 |             # If no content was added (e.g., all indices were beyond question length),
116 |             # add the full question at the beginning
117 |             if not contents:
118 |                 contents.append(dict(type="text", value=question))
119 |                 for img, _ in image_index_pairs:
120 |                     contents.append(dict(type="image", value=img))
121 | 
122 |         # print(contents)
123 |         return contents
124 | 
125 |     @staticmethod
126 |     def extract_characters_regex(s, choices=["(A)", "(B)", "(C)", "(D)", "(E)", "(F)", "(G)"]):
127 |         if type(s) is dict:
128 |             s = ""
129 |         s = s.strip()
130 |         answer_prefixes = [
131 |             "The best answer is",
132 |             "The correct answer is",
133 |             "The answer is",
134 |             "The answer",
135 |             "The best option isThe correct option is",
136 |             "Best answer:Best option:",
137 |         ]
138 |         for answer_prefix in answer_prefixes:
139 |             s = s.replace(answer_prefix, "")
140 | 
141 |         if not re.search("[ABCDEFG]", s):
142 |             return ""
143 |         matches = re.findall(r"\(([a-gA-G])\)", s)
144 |         if len(matches) == 0:
145 |             matches = re.findall(r"(?:^|\s)?([a-gA-G])(?:$|[\s,.])?", s)
146 |         if len(matches) == 0:
147 |             matches = re.findall(r"[a-gA-G]", s)
148 |         if len(matches) == 0:
149 |             return ""
150 |         else:
151 |             matches = {mat.upper() for mat in matches}
152 |             return "".join(matches)
153 | 
154 |     def evaluate(self, eval_file, **judge_kwargs):
155 |         data = load(eval_file)
156 |         data["prediction"] = [str(x) for x in data["prediction"]]
157 |         task_stats = {}
158 |         micro_metric = {"correct": 0, "total": 0}
159 |         for index, it in data.iterrows():
160 |             task = it["question_type"]
161 |             if task not in task_stats:
162 |                 task_stats[task] = {"correct": 0, "total": 0}
163 |             task_stats[task]["total"] += 1
164 |             micro_metric["total"] += 1
165 |             pred = self.extract_characters_regex(it["prediction"])
166 |             if set(pred) == set(it["answer"]):
167 |                 task_stats[task]["correct"] += 1
168 |                 micro_metric["correct"] += 1
169 |         accuracy_dict = {
170 |             task: [stats["correct"] / stats["total"]] for task, stats in sorted(task_stats.items())
171 |         }
172 |         result_df = pd.DataFrame(accuracy_dict)
173 |         from collections import defaultdict
174 | 
175 |         sphere_accs = defaultdict(list)
176 |         for task, acc in accuracy_dict.items():
177 |             sphere = task.split("/")[0]
178 |             assert len(acc) == 1
179 |             sphere_accs[sphere].append(acc[0])
180 |         for sphere, accs in sphere_accs.items():
181 |             result_df[f"Sphere macro: {sphere}"] = sum(accs) / len(accs)
182 |         result_df["Overall macro"] = result_df.mean(axis=1)
183 |         result_df["Overall micro"] = micro_metric["correct"] / micro_metric["total"]
184 |         suffix = eval_file.split(".")[-1]
185 |         score_file = eval_file.replace(f".{suffix}", "_acc.csv")
186 |         dump(result_df, score_file)
187 |         return result_df
188 | 


--------------------------------------------------------------------------------
/experiments/7_franka/eval_franka.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | 
  3 | os.environ["TOKENIZERS_PARALLELISM"] = "false"
  4 | 
  5 | import collections
  6 | import copy
  7 | import dataclasses
  8 | import os.path as osp
  9 | import time
 10 | from datetime import datetime
 11 | from pathlib import Path
 12 | 
 13 | import cv2
 14 | import deoxys.utils.transform_utils as dft
 15 | import imageio
 16 | import numpy as np
 17 | import torch
 18 | import tqdm
 19 | import tyro
 20 | from deoxys import config_root
 21 | from deoxys.experimental.motion_utils import reset_joints_to
 22 | from deoxys.franka_interface import FrankaInterface
 23 | from deoxys.utils import YamlConfig
 24 | from PIL import Image
 25 | from realsense_camera import MultiCamera
 26 | from transformers import AutoModel, AutoProcessor
 27 | 
 28 | # Add your camera serial numbers here
 29 | EGO_CAMERA = "213522070137"
 30 | THIRD_CAMERA = "243222074139"
 31 | 
 32 | reset_joint_positions = [
 33 |     0.0760389047913384,
 34 |     -1.0362613022620384,
 35 |     -0.054254247684777324,
 36 |     -2.383951857286591,
 37 |     -0.004505598470154735,
 38 |     1.3820559157131187,
 39 |     0.784935455988679,
 40 | ]
 41 | 
 42 | 
 43 | def save_rollout_video(rollout_images, save_dir):
 44 |     """Saves an MP4 replay of an episode."""
 45 |     date_time = time.strftime("%Y_%m_%d-%H_%M_%S")
 46 |     mp4_path = Path(save_dir) / f"{date_time}.mp4"
 47 |     video_writer = imageio.get_writer(mp4_path, fps=5)
 48 |     for img in rollout_images:
 49 |         video_writer.append_data(img)
 50 |     video_writer.close()
 51 |     print(f"Saved rollout MP4 at path {mp4_path}")
 52 | 
 53 | 
 54 | @dataclasses.dataclass
 55 | class Args:
 56 |     #################################################################################################################
 57 |     # Model parameters
 58 |     #################################################################################################################
 59 |     resize_size: int = 224
 60 |     replan_steps: int = 5
 61 |     model_path: str = ""
 62 |     repo_id: str = ""
 63 |     task: str = ""
 64 | 
 65 |     #################################################################################################################
 66 |     # Utils
 67 |     #################################################################################################################
 68 |     video_out_path: Path = Path("experiments/7_franka/videos")  # Path to save videos
 69 |     max_timesteps: int = 300  # Number of timesteps to run
 70 | 
 71 | 
 72 | def convert_gripper_action(action):
 73 |     action[-1] = 1 - action[-1]
 74 |     if action[-1] < 0.5:
 75 |         action[-1] = -1
 76 | 
 77 |     return action
 78 | 
 79 | 
 80 | def get_robot_interface():
 81 |     robot_interface = FrankaInterface(osp.join(config_root, "charmander.yml"))
 82 |     controller_cfg = YamlConfig(osp.join(config_root, "osc-pose-controller.yml")).as_easydict()
 83 |     controller_type = "OSC_POSE"
 84 | 
 85 |     return robot_interface, controller_cfg, controller_type
 86 | 
 87 | 
 88 | def main(args: Args):
 89 |     multi_camera = MultiCamera()
 90 |     robot_interface, controller_cfg, controller_type = get_robot_interface()
 91 | 
 92 |     model = (
 93 |         AutoModel.from_pretrained(args.model_path, dtype=torch.bfloat16, trust_remote_code=True).eval().cuda()
 94 |     )
 95 | 
 96 |     processor = AutoProcessor.from_pretrained(args.model_path, trust_remote_code=True)
 97 | 
 98 |     while True:
 99 |         action_plan = collections.deque()
100 |         input("Press Enter to start episode ...")
101 |         reset_joints_to(robot_interface, reset_joint_positions)
102 | 
103 |         replay_images = []
104 |         bar = tqdm.tqdm(
105 |             range(args.max_timesteps),
106 |             position=0,
107 |             leave=True,
108 |             ncols=80,
109 |             desc="Rollout steps",
110 |         )
111 | 
112 |         for _ in bar:
113 |             try:
114 |                 images = multi_camera.get_frame()
115 |                 last_state = robot_interface._state_buffer[-1]
116 |                 last_gripper_state = robot_interface._gripper_state_buffer[-1]
117 |                 frame, _ = images[THIRD_CAMERA]
118 |                 ego_frame, _ = images[EGO_CAMERA]
119 | 
120 |                 if not action_plan:
121 |                     frame = copy.deepcopy(frame)
122 |                     ego_frame = copy.deepcopy(ego_frame)
123 |                     frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
124 |                     ego_frame = cv2.cvtColor(ego_frame, cv2.COLOR_BGR2RGB)
125 |                     replay_images.append(frame)
126 |                     frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
127 |                     ego_frame_rgb = cv2.cvtColor(ego_frame, cv2.COLOR_BGR2RGB)
128 |                     replay_images.append(frame_rgb.copy())
129 | 
130 |                     eef_pose = np.asarray(last_state.O_T_EE, dtype=np.float32).reshape(4, 4).T
131 |                     eef_state = np.concatenate(
132 |                         (
133 |                             eef_pose[:3, -1],
134 |                             dft.mat2euler(
135 |                                 eef_pose[:3, :-1],
136 |                             ),
137 |                         ),
138 |                         axis=-1,
139 |                     )
140 |                     gripper_state = np.array([last_gripper_state.width])
141 |                     state_data = np.concatenate([eef_state.flatten(), np.array([0]), gripper_state])
142 | 
143 |                     frame_resized = cv2.resize(frame_rgb, (args.resize_size, args.resize_size))
144 |                     ego_frame_resized = cv2.resize(ego_frame_rgb, (args.resize_size, args.resize_size))
145 | 
146 |                     ego_frame = Image.fromarray(ego_frame_resized)
147 |                     frame = Image.fromarray(frame_resized)
148 | 
149 |                     # NOTE: Change the keys to match your model
150 |                     batch = {
151 |                         "observation.images.image": [frame],
152 |                         "observation.images.wrist_image": [ego_frame],
153 |                         "observation.state": [state_data],
154 |                         "task": [args.task],
155 |                         "repo_id": [args.repo_id],
156 |                     }
157 |                     ov_out = processor.select_action(
158 |                         model,
159 |                         batch,
160 |                     )
161 |                     action_chunk = ov_out.action[0].numpy()
162 |                     assert len(action_chunk) >= args.replan_steps, (
163 |                         f"We want to replan every {args.replan_steps} steps, but policy only predicts {len(action_chunk)} steps."
164 |                     )
165 |                     action_plan.extend(action_chunk[: args.replan_steps])
166 | 
167 |                 pred_action_chunk = action_plan.popleft()
168 |                 action = pred_action_chunk
169 |                 rotation_matrix = dft.euler2mat(action[3:6])
170 |                 quat = dft.mat2quat(rotation_matrix)
171 |                 axis_angle = dft.quat2axisangle(quat)
172 |                 action[3:6] = axis_angle
173 |                 action = convert_gripper_action(action)
174 | 
175 |                 robot_interface.control(
176 |                     controller_type=controller_type, action=action, controller_cfg=controller_cfg
177 |                 )
178 | 
179 |             except KeyboardInterrupt:
180 |                 break
181 | 
182 |         # saving video
183 |         video_save_path = args.video_out_path / args.task.replace(" ", "_")
184 |         video_save_path.mkdir(parents=True, exist_ok=True)
185 |         curr_time = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
186 |         save_path = video_save_path / f"{curr_time}.mp4"
187 |         video = np.stack(replay_images)
188 |         imageio.mimsave(save_path, video, fps=20)
189 | 
190 |         if input("Do one more eval (default y)? [y/n]").lower() == "n":
191 |             break
192 | 
193 | 
194 | if __name__ == "__main__":
195 |     args = tyro.cli(Args)
196 |     main(args)
197 | 


--------------------------------------------------------------------------------
/eo/data/multim_dataset.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2025 EO-Robotics Team. All rights reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | import copy
 16 | import math
 17 | import random
 18 | import re
 19 | 
 20 | import torch
 21 | import ujson as json
 22 | from datasets import load_dataset
 23 | from torch.utils.data import Dataset
 24 | 
 25 | from eo.constants import (
 26 |     ACTION_END_TOKEN,
 27 |     ACTION_START_TOKEN,
 28 |     DEFAULT_ACTION_TOKEN,
 29 |     DEFAULT_IMAGE_TOKEN,
 30 |     DEFAULT_STATE_TOKEN,
 31 |     DEFAULT_VIDEO_TOKEN,
 32 |     LLAVA_ACTION_TOKEN,
 33 |     LLAVA_IMAGE_TOKEN,
 34 |     LLAVA_STATE_TOKEN,
 35 |     LLAVA_VIDEO_TOKEN,
 36 |     LLAVA_VLA_TOKEN,
 37 |     STATE_END_TOKEN,
 38 |     STATE_START_TOKEN,
 39 |     TASK_VLA_TOKEN,
 40 |     VISION_END_TOKEN,
 41 |     VISION_START_TOKEN,
 42 | )
 43 | from eo.data.schema import MMDatasetConfig
 44 | 
 45 | 
 46 | class MultimodaDataset(Dataset):
 47 |     """Dataset for supervised fine-tuning."""
 48 | 
 49 |     def __init__(
 50 |         self,
 51 |         data_configs: list[MMDatasetConfig],
 52 |         max_seq_length: int = 16384,
 53 |         max_action_dim: int = 32,
 54 |         chunk_size: int = 50,
 55 |     ):
 56 |         super().__init__()
 57 |         self.data_configs = data_configs
 58 |         self.max_action_dim = max_action_dim
 59 |         self.chunk_size = chunk_size
 60 | 
 61 |         list_data_dict, dataset_lens = [], []
 62 |         list_hf_datasets = []
 63 |         seq_lengths = []
 64 | 
 65 |         for i, dataset in enumerate(data_configs):
 66 |             json_path = dataset.json_path
 67 |             sampling_strategy = dataset.sampling_strategy
 68 |             sampling_number = None
 69 | 
 70 |             if json_path.endswith(".jsonl"):
 71 |                 cur_data_dict = []
 72 |                 for line in open(json_path):
 73 |                     cur_data_dict.append(json.loads(line.strip()))
 74 |             elif json_path.endswith(".json"):
 75 |                 cur_data_dict = json.load(open(json_path))
 76 |             else:
 77 |                 print(f"Loading HF dataset from {json_path}")
 78 |                 hf_dataset = load_dataset(json_path, split="train")
 79 |                 len_hf_ds = len(hf_dataset)
 80 | 
 81 |                 # set seq_length for packing
 82 |                 hf_dataset_lens = hf_dataset["seq_length"]
 83 |                 cur_data_dict = []
 84 | 
 85 |                 for idx in range(len_hf_ds):
 86 |                     cur_data_dict.append({
 87 |                         "hf_idx": len(list_hf_datasets),
 88 |                         "data_idx": idx,
 89 |                         "seq_length": hf_dataset_lens[idx]
 90 |                     })
 91 |                 list_hf_datasets.append(hf_dataset)
 92 | 
 93 |             # NOTE: filter out lines above MAX_SEQ_LENGTH
 94 |             cur_data_dict = [
 95 |                 line for line in cur_data_dict if line.get("seq_length", 0) <= max_seq_length
 96 |             ]
 97 | 
 98 |             if ":" in sampling_strategy:
 99 |                 sampling_strategy, sampling_number = sampling_strategy.split(":")
100 |                 if "%" in sampling_number:
101 |                     sampling_number = math.ceil(
102 |                         float(sampling_number.split("%")[0]) * len(cur_data_dict) / 100
103 |                     )
104 |                 else:
105 |                     sampling_number = int(sampling_number)
106 | 
107 |             # sampling
108 |             if sampling_strategy == "first" and sampling_number is not None:
109 |                 cur_data_dict = cur_data_dict[:sampling_number]
110 |             elif sampling_strategy == "end" and sampling_number is not None:
111 |                 cur_data_dict = cur_data_dict[-sampling_number:]
112 |             elif sampling_strategy == "random" and sampling_number is not None:
113 |                 random.shuffle(cur_data_dict)
114 |                 cur_data_dict = cur_data_dict[:sampling_number]
115 | 
116 |             print(f"Loaded {len(cur_data_dict)} samples from {json_path}")
117 |             dataset_lens.append(len(cur_data_dict))
118 |             for data in cur_data_dict:
119 |                 data["vision_base_path"] = dataset.vision_base_path
120 |             list_data_dict.extend(cur_data_dict)
121 | 
122 |             # prepare lens for packing
123 |             for line in cur_data_dict:
124 |                 seq_len = line.get("seq_length", 0)
125 |                 seq_lengths.append(seq_len)
126 |                 if seq_len == 0:
127 |                     print(f"[Warning] {seq_len=}, {json_path=}, {line=}, \
128 |                     please group length for data packing usage")
129 | 
130 |         self.json_data = list_data_dict
131 |         self.hf_datas = list_hf_datasets
132 |         self.dataset_lens = dataset_lens
133 |         self.cached_lengths = seq_lengths
134 | 
135 |     def __len__(self):
136 |         return len(self.json_data)
137 | 
138 |     def __getitem__(self, i) -> dict[str, torch.Tensor]:
139 |         sources = self.json_data[i]
140 |         key = "conversations"
141 |         if "hf_idx" in sources:
142 |             hf_idx, data_idx = sources["hf_idx"], sources["data_idx"]
143 |             vision_base_path = sources["vision_base_path"]
144 |             sources = self.hf_datas[hf_idx][data_idx]
145 |             sources["vision_base_path"] = vision_base_path
146 |             key = "conversation"
147 | 
148 |         transformed_source = copy.deepcopy(sources)
149 |         transformed_source["conversations"] = llava_to_openai(
150 |             transformed_source[key], "video" in sources
151 |         )
152 |         return transformed_source
153 | 
154 |     @property
155 |     def lengths(self) -> list[int]:
156 |         return self.cached_lengths
157 | 
158 | 
159 | def replace_image_tokens(input_string, is_video=False):
160 |     if is_video:
161 |         pattern = r"\s*" + re.escape(LLAVA_VIDEO_TOKEN) + r"\n?"
162 |         replacement = VISION_START_TOKEN + DEFAULT_VIDEO_TOKEN + VISION_END_TOKEN
163 |     else:
164 |         pattern = r"\s*" + re.escape(LLAVA_IMAGE_TOKEN) + r"\n?"
165 |         replacement = VISION_START_TOKEN + DEFAULT_IMAGE_TOKEN + VISION_END_TOKEN
166 |     return re.sub(pattern, replacement, input_string)
167 | 
168 | 
169 | def replace_action_tokens(input_string):
170 |     pattern = r"\s*" + re.escape(LLAVA_ACTION_TOKEN) + r"\n?"
171 |     replacement = f"{ACTION_START_TOKEN}{DEFAULT_ACTION_TOKEN}{ACTION_END_TOKEN}"
172 |     return re.sub(pattern, replacement, input_string)
173 | 
174 | 
175 | def replace_vla_tokens(input_string):
176 |     pattern = r"\s*" + re.escape(LLAVA_VLA_TOKEN) + r"\n?"
177 |     replacement = TASK_VLA_TOKEN
178 |     return re.sub(pattern, replacement, input_string)
179 | 
180 | 
181 | def replace_state_tokens(input_string):
182 |     pattern = r"\s*" + re.escape(LLAVA_STATE_TOKEN) + r"\n?"
183 |     replacement = f"{STATE_START_TOKEN}{DEFAULT_STATE_TOKEN}{STATE_END_TOKEN}"
184 |     return re.sub(pattern, replacement, input_string)
185 | 
186 | 
187 | def llava_to_openai(conversations, is_video=False):
188 |     role_mapping = {"human": "user", "gpt": "assistant"}
189 |     transformed_data = []
190 |     for conversation in conversations:
191 |         transformed_content = replace_image_tokens(conversation["value"], is_video=is_video)
192 |         transformed_content = replace_action_tokens(transformed_content)
193 |         transformed_content = replace_state_tokens(transformed_content)
194 |         transformed_content = replace_vla_tokens(transformed_content)
195 |         transformed_entry = {
196 |             "role": role_mapping.get(conversation["from"], conversation["from"]),
197 |             "content": transformed_content,
198 |         }
199 |         transformed_data.append(transformed_entry)
200 |     return transformed_data
201 | 
202 | 
203 | def pad_vector(vector, new_dim):
204 |     """Can be (b s e) or (b e)"""
205 |     if vector.shape[-1] == new_dim:
206 |         return vector
207 |     shape = list(vector.shape)
208 |     current_dim = shape[-1]
209 |     shape[-1] = new_dim
210 |     new_vector = torch.zeros(*shape, dtype=vector.dtype, device=vector.device)
211 |     new_vector[..., :current_dim] = vector
212 |     return new_vector
213 | 


--------------------------------------------------------------------------------
/experiments/2_libero/eval_libero.py:
--------------------------------------------------------------------------------
  1 | import collections
  2 | import dataclasses
  3 | import datetime as dt
  4 | import json
  5 | import logging
  6 | import math
  7 | import pathlib
  8 | from pathlib import Path
  9 | 
 10 | import imageio
 11 | import numpy as np
 12 | import tqdm
 13 | import tyro
 14 | from libero.libero import benchmark, get_libero_path
 15 | from libero.libero.envs import OffScreenRenderEnv
 16 | from PIL import Image
 17 | 
 18 | LIBERO_DUMMY_ACTION = [0.0] * 6 + [-1.0]
 19 | LIBERO_ENV_RESOLUTION = 256  # resolution used to render training data
 20 | 
 21 | 
 22 | @dataclasses.dataclass
 23 | class Args:
 24 |     replan_steps: int = 8
 25 |     task_suite_name: str = "libero_spatial"  # Task suite. Options: libero_spatial, libero_object, libero_goal, libero_10, libero_90
 26 |     num_steps_wait: int = 10  # Number of steps to wait for objects to stabilize i n sim
 27 |     num_trials_per_task: int = 10  # Number of rollouts per task
 28 |     video_out_path: str = "experiments/2_libero/logs/videos"  # Path to save videos
 29 |     seed: int = 7  # Random Seed (for reproducibility)
 30 |     pretrained_path: str = ""
 31 |     post_process_action: bool = True
 32 |     job_name: str = "test"
 33 |     debug: bool = False
 34 | 
 35 | 
 36 | def eval_libero(args: Args) -> None:
 37 |     date_base = Path("experiments/2_libero/logs", dt.datetime.now().strftime("%Y-%m-%d/%H-%M-%S"))
 38 |     date_base.parent.mkdir(parents=True, exist_ok=True)
 39 |     logging.basicConfig(
 40 |         level=logging.INFO,
 41 |         filename=f"{date_base}+{args.job_name}.log",
 42 |     )
 43 |     logging.info(f"Arguments: {json.dumps(dataclasses.asdict(args), indent=4)}")
 44 | 
 45 |     # Set random seed
 46 |     np.random.seed(args.seed)
 47 | 
 48 |     # Initialize LIBERO task suite
 49 |     benchmark_dict = benchmark.get_benchmark_dict()
 50 |     task_suite = benchmark_dict[args.task_suite_name]()
 51 |     num_tasks_in_suite = task_suite.n_tasks
 52 |     logging.info(f"Task suite: {args.task_suite_name}")
 53 | 
 54 |     args.video_out_path = f"{date_base}+{args.job_name}"
 55 |     pathlib.Path(args.video_out_path).mkdir(parents=True, exist_ok=True)
 56 | 
 57 |     if args.task_suite_name == "libero_spatial":
 58 |         max_steps = 220  # longest training demo has 193 steps
 59 |     elif args.task_suite_name == "libero_object":
 60 |         max_steps = 280  # longest training demo has 254 steps
 61 |     elif args.task_suite_name == "libero_goal":
 62 |         max_steps = 300  # longest training demo has 270 steps
 63 |     elif args.task_suite_name == "libero_10":
 64 |         max_steps = 520 + 50  # longest training demo has 505 steps
 65 |     elif args.task_suite_name == "libero_90":
 66 |         max_steps = 400  # longest training demo has 373 steps
 67 |     else:
 68 |         raise ValueError(f"Unknown task suite: {args.task_suite_name}")
 69 | 
 70 |     from transformers import AutoModel, AutoProcessor
 71 | 
 72 |     model = (
 73 |         AutoModel.from_pretrained(
 74 |             args.pretrained_path,
 75 |             trust_remote_code=True,
 76 |             local_files_only=True,
 77 |             # attn_implementation="flash_attention_2",
 78 |         )
 79 |         .eval()
 80 |         .cuda()
 81 |     )
 82 | 
 83 |     processor = AutoProcessor.from_pretrained(
 84 |         args.pretrained_path, trust_remote_code=True, local_files_only=True
 85 |     )
 86 | 
 87 |     # Start evaluation
 88 |     total_episodes, total_successes = 0, 0
 89 |     for task_id in tqdm.tqdm(range(num_tasks_in_suite)):
 90 |         # Get task
 91 |         task = task_suite.get_task(task_id)
 92 | 
 93 |         # Get default LIBERO initial states
 94 |         initial_states = task_suite.get_task_init_states(task_id)
 95 | 
 96 |         # Initialize LIBERO environment and task description
 97 |         env, task_description = _get_libero_env(task, LIBERO_ENV_RESOLUTION, args.seed)
 98 | 
 99 |         # Start episodes
100 |         task_episodes, task_successes = 0, 0
101 |         for episode_idx in tqdm.tqdm(range(args.num_trials_per_task)):
102 |             logging.info(f"\nTask: {task_description}")
103 | 
104 |             # Reset environment
105 |             env.reset()
106 |             action_plan = collections.deque()
107 | 
108 |             # Set initial states
109 |             obs = env.set_init_state(initial_states[episode_idx])
110 | 
111 |             # Setup
112 |             t = 0
113 |             replay_images = []
114 | 
115 |             logging.info(f"Starting episode {task_episodes + 1}...")
116 |             step = 0
117 |             while t < max_steps + args.num_steps_wait:
118 |                 try:
119 |                     # IMPORTANT: Do nothing for the first few timesteps because the simulator drops objects
120 |                     # and we need to wait for them to fall
121 |                     if t < args.num_steps_wait:
122 |                         obs, reward, done, info = env.step(LIBERO_DUMMY_ACTION)
123 |                         t += 1
124 |                         continue
125 | 
126 |                     # Get preprocessed image
127 |                     # IMPORTANT: rotate 180 degrees to match train preprocessing
128 |                     img = np.ascontiguousarray(obs["agentview_image"][::-1, ::-1])
129 |                     wrist_img = np.ascontiguousarray(obs["robot0_eye_in_hand_image"][::-1, ::-1])
130 | 
131 |                     # Save preprocessed image for replay video
132 |                     replay_images.append(img)
133 | 
134 |                     state = np.concatenate(
135 |                         (
136 |                             obs["robot0_eef_pos"],
137 |                             _quat2axisangle(obs["robot0_eef_quat"]),
138 |                             obs["robot0_gripper_qpos"],
139 |                         )
140 |                     )
141 | 
142 |                     if not action_plan:
143 |                         img = Image.fromarray(img)
144 |                         wrist_img = Image.fromarray(wrist_img)
145 | 
146 |                         batch = {
147 |                             "observation.images.image": [img],
148 |                             "observation.images.wrist_image": [wrist_img],
149 |                             "observation.state": [state],
150 |                             "task": [str(task_description)],
151 |                             "repo_id": [f"{args.task_suite_name}_no_noops_1.0.0_lerobot"],
152 |                         }
153 |                         ov_out = processor.select_action(
154 |                             model,
155 |                             batch,
156 |                         )
157 |                         action_chunk = ov_out.action[0].numpy()
158 | 
159 |                         if args.post_process_action:
160 |                             action_chunk[..., -1] = 2 * (1 - action_chunk[..., -1]) - 1
161 | 
162 |                         assert len(action_chunk) >= args.replan_steps, (
163 |                             f"We want to replan every {args.replan_steps} steps, but policy only predicts {len(action_chunk)} steps."
164 |                         )
165 |                         action_plan.extend(action_chunk[: args.replan_steps])
166 | 
167 |                     action = action_plan.popleft()
168 |                     # Execute action in environment
169 |                     obs, reward, done, info = env.step(action.tolist())
170 |                     if done:
171 |                         task_successes += 1
172 |                         total_successes += 1
173 |                         break
174 |                     t += 1
175 |                     step += 1
176 | 
177 |                 except Exception as e:
178 |                     if args.debug:
179 |                         raise e
180 |                     else:
181 |                         logging.error(f"Caught exception: {e}")
182 |                     break
183 | 
184 |             task_episodes += 1
185 |             total_episodes += 1
186 | 
187 |             # Save a replay video of the episode
188 |             suffix = "success" if done else "failure"
189 |             task_segment = task_description.replace(" ", "_")
190 |             imageio.mimwrite(
191 |                 pathlib.Path(args.video_out_path) / f"rollout_{task_segment}_{suffix}.mp4",
192 |                 [np.asarray(x) for x in replay_images],
193 |                 fps=10,
194 |             )
195 | 
196 |             # Log current results
197 |             logging.info(f"Success: {done}")
198 |             logging.info(f"# episodes completed so far: {total_episodes}")
199 |             logging.info(f"# successes: {total_successes} ({total_successes / total_episodes * 100:.1f}%)")
200 | 
201 |         # Log final results
202 |         logging.info(f"Current task success rate: {float(task_successes) / float(task_episodes)}")
203 |         logging.info(f"Current total success rate: {float(total_successes) / float(total_episodes)}")
204 | 
205 |     logging.info(f"Total success rate: {float(total_successes) / float(total_episodes)}")
206 |     logging.info(f"Total episodes: {total_episodes}")
207 | 
208 | 
209 | def _get_libero_env(task, resolution, seed):
210 |     """Initializes and returns the LIBERO environment, along with the task description."""
211 |     task_description = task.language
212 |     task_bddl_file = pathlib.Path(get_libero_path("bddl_files")) / task.problem_folder / task.bddl_file
213 |     env_args = {
214 |         "bddl_file_name": task_bddl_file,
215 |         "camera_heights": resolution,
216 |         "camera_widths": resolution,
217 |     }
218 |     env = OffScreenRenderEnv(**env_args)
219 |     env.seed(seed)  # IMPORTANT: seed seems to affect object positions even when using fixed initial state
220 |     return env, task_description
221 | 
222 | 
223 | def _quat2axisangle(quat):
224 |     """
225 |     Copied from robosuite
226 |     https://github.com/ARISE-Initiative/robosuite/blob/eafb81f54ffc104f905ee48a16bb15f059176ad3/robosuite/utils/transform_utils.py#L490C1-L512C55
227 |     """
228 |     # clip quaternion
229 |     if quat[3] > 1.0:
230 |         quat[3] = 1.0
231 |     elif quat[3] < -1.0:
232 |         quat[3] = -1.0
233 | 
234 |     den = np.sqrt(1.0 - quat[3] * quat[3])
235 |     if math.isclose(den, 0.0):
236 |         # This is (close to) a zero degree rotation, immediately return
237 |         return np.zeros(3)
238 | 
239 |     return (quat[:3] * 2.0 * math.acos(quat[3])) / den
240 | 
241 | 
242 | if __name__ == "__main__":
243 |     tyro.cli(eval_libero)
244 | 


--------------------------------------------------------------------------------
/experiments/5_widowx/eval_widowx.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This script shows how we evaluated a finetuned EO-1 on a real WidowX robot, which is adapted from https://github.com/octo-models/octo/blob/main/examples/04_eval_finetuned_on_robot.py.
  3 | While the exact specifics may not be applicable to your use case, this script serves as a didactic example of how to use EO-1 in a real-world setting.
  4 | 
  5 | If you wish, you may reproduce these results by [reproducing the robot setup](https://rail-berkeley.github.io/bridgedata/)
  6 | and installing [the robot controller](https://github.com/HaomingSong/bridge_data_robot.git)
  7 | """
  8 | 
  9 | import os
 10 | 
 11 | os.environ["TOKENIZERS_PARALLELISM"] = "false"
 12 | import dataclasses
 13 | import pathlib
 14 | import time
 15 | from datetime import datetime
 16 | 
 17 | import cv2
 18 | import imageio
 19 | import numpy as np
 20 | import pandas as pd
 21 | import torch
 22 | import tqdm
 23 | import tyro
 24 | from PIL import Image
 25 | from transformers import AutoModel, AutoProcessor
 26 | from widowx_env import RHCWrapper, WidowXGym
 27 | from widowx_envs.widowx_env_service import WidowXConfigs
 28 | 
 29 | 
 30 | @dataclasses.dataclass
 31 | class Args:
 32 |     #################################################################################################################
 33 |     # Model parameters
 34 |     #################################################################################################################
 35 |     im_size: int = 224
 36 |     action_horizon: int = 2
 37 |     model_path: str = ""
 38 |     repo_id: str = ""
 39 | 
 40 |     #################################################################################################################
 41 |     # WidowX environment-specific parameters
 42 |     #################################################################################################################
 43 |     robot_ip: str = "10.6.8.122"  # IP address of the robot
 44 |     robot_port: int = 5556  # Port of the robot
 45 |     initial_eep: tuple[float, float, float] = (0.3, 0.0, 0.25)  # Initial position
 46 |     # initial_eep: tuple[float, float, float] = (0.15, 0.0, 0.1)  # Initial position
 47 |     blocking: bool = False  # Use the blocking controller
 48 |     max_timesteps: int = 120  # Number of timesteps to run
 49 |     default_instruction: str = "Put the eggplant in the basket"  # Default instruction
 50 | 
 51 |     #################################################################################################################
 52 |     # Utils
 53 |     #################################################################################################################
 54 |     show_image: bool = False  # Show image
 55 |     roll_out_path: pathlib.Path = pathlib.Path("experiments/5_widowx/logs")  # Path to save videos
 56 | 
 57 | 
 58 | ##############################################################################
 59 | STEP_DURATION_MESSAGE = """
 60 | Bridge data was collected with non-blocking control and a step duration of 0.2s.
 61 | However, we relabel the actions to make it look like the data was collected with
 62 | blocking control and we evaluate with blocking control.
 63 | Be sure to use a step duration of 0.2 if evaluating with non-blocking control.
 64 | """
 65 | STEP_DURATION = 0.2
 66 | STICKY_GRIPPER_NUM_STEPS = 1
 67 | WORKSPACE_BOUNDS = [[0.1, -0.15, -0.01, -1.57, 0], [0.45, 0.25, 0.25, 1.57, 0]]
 68 | CAMERA_TOPICS = [{"name": "/D435/color/image_raw"}]
 69 | ENV_PARAMS = {
 70 |     "camera_topics": CAMERA_TOPICS,
 71 |     "override_workspace_boundaries": WORKSPACE_BOUNDS,
 72 |     "move_duration": STEP_DURATION,
 73 | }
 74 | 
 75 | ##############################################################################
 76 | 
 77 | 
 78 | def eval_bridge(args: Args) -> None:
 79 |     curr_time = datetime.now().strftime("%Y_%m_%d_%H:%M:%S")
 80 |     base_save_path = args.roll_out_path / pathlib.Path(args.default_instruction.replace(" ", "_")) / curr_time
 81 | 
 82 |     # set up the widowx client
 83 |     start_state = np.concatenate([args.initial_eep, (0, 0, 0, 1)])
 84 |     env_params = WidowXConfigs.DefaultEnvParams.copy()
 85 |     env_params.update(ENV_PARAMS)
 86 |     env_params["start_state"] = list(start_state)
 87 | 
 88 |     env = WidowXGym(
 89 |         env_params,
 90 |         host=args.robot_ip,
 91 |         port=args.robot_port,
 92 |         im_size=args.im_size,
 93 |         blocking=args.blocking,
 94 |         sticky_gripper_num_steps=STICKY_GRIPPER_NUM_STEPS,
 95 |     )
 96 |     if not args.blocking:
 97 |         assert STEP_DURATION == 0.2, STEP_DURATION_MESSAGE
 98 |     results_df = pd.DataFrame(columns=["success", "duration", "video_filename"])
 99 | 
100 |     model = (
101 |         AutoModel.from_pretrained(args.model_path, dtype=torch.bfloat16, trust_remote_code=True).eval().cuda()
102 |     )
103 | 
104 |     processor = AutoProcessor.from_pretrained(args.model_path, trust_remote_code=True)
105 | 
106 |     # switch TemporalEnsembleWrapper with RHCWrapper for receding horizon control
107 |     env = RHCWrapper(env, args.action_horizon)
108 | 
109 |     while True:
110 |         # reset env
111 |         obs, _ = env.reset()
112 |         time.sleep(2.0)
113 | 
114 |         if input(f"Use default instruction: {args.default_instruction}? (default y) [y/n]").lower() == "n":
115 |             instruction = input("Enter instruction: ")
116 |         else:
117 |             instruction = args.default_instruction
118 | 
119 |         # do rollout
120 |         images = []
121 |         images.append(obs["full_image"])
122 |         last_tstep = time.time()
123 |         bar = tqdm.tqdm(
124 |             range(args.max_timesteps),
125 |             position=0,
126 |             leave=True,
127 |             ncols=80,
128 |             desc="Rollout steps",
129 |         )
130 | 
131 |         for t_step in bar:
132 |             try:
133 |                 bar.set_description(f"Step {t_step}/{args.max_timesteps}")
134 |                 if args.show_image:
135 |                     cv2.imshow("img_view", obs["full_image"])
136 |                     cv2.waitKey(1)
137 | 
138 |                 # prepare observation
139 |                 # image = torch.from_numpy(obs["image_primary"] / 255).permute(2, 0, 1)
140 |                 # [::-1, ::-1]
141 |                 image = cv2.resize(obs["full_image"], (256, 256), interpolation=cv2.INTER_LINEAR)
142 |                 # image = np.ascontiguousarray(obs["image_primary"])
143 | 
144 |                 # print("image",image.shape)
145 |                 img = Image.fromarray(image)
146 |                 batch = {
147 |                     "observation.images.image": [img],
148 |                     "observation.images.wrist_image": [img],
149 |                     "observation.state": [obs["proprio"]],
150 |                     "task": [str(instruction)],
151 |                     "repo_id": [args.repo_id],
152 |                 }
153 |                 ov_out = processor.select_action(model, batch)
154 |                 action_chunk = ov_out.action.squeeze(0).numpy()
155 | 
156 |                 assert len(action_chunk) >= args.action_horizon, (
157 |                     f"We want to replan every {args.action_horizon} steps, but policy only predicts {len(action_chunk)} steps."
158 |                 )
159 | 
160 |                 # perform environment step
161 |                 obs, _, _, truncated, infos = env.step(action_chunk)
162 | 
163 |                 # recording history images
164 |                 for history_obs in infos["observations"]:
165 |                     image = history_obs["full_image"]
166 |                     images.append(image)
167 |                 if truncated:
168 |                     break
169 | 
170 |                 # match the step duration
171 |                 elapsed_time = time.time() - last_tstep
172 |                 if elapsed_time < STEP_DURATION:
173 |                     time.sleep(STEP_DURATION - elapsed_time)
174 | 
175 |             except KeyboardInterrupt:
176 |                 break
177 |             time.sleep(0.2)
178 | 
179 |         # logging rollouts
180 |         success: str | float | None = None
181 |         while not isinstance(success, float):
182 |             success = input(
183 |                 "Did the rollout succeed? (enter y for 100%, n for 0%, a float value 0-1, or a numeric value 0-100 based on the evaluation spec)"
184 |             )
185 |             try:
186 |                 if success == "y":
187 |                     success = 1.0
188 |                 elif success == "n":
189 |                     success = 0.0
190 |                 else:
191 |                     success = float(success)
192 |             except Exception:
193 |                 success = 0.0
194 | 
195 |             video_save_path = (
196 |                 base_save_path
197 |                 / "videos"
198 |                 / f"{datetime.now().strftime('%Y_%m_%d-%H_%M_%S')}_success_{success:.2f}.mp4"
199 |             )
200 | 
201 |             if not (0 <= success <= 1):
202 |                 print(f"Success must be a number in [0, 1] but got: {success}")
203 | 
204 |             results_df = pd.concat(
205 |                 [
206 |                     results_df,
207 |                     pd.DataFrame(
208 |                         [
209 |                             {
210 |                                 "instruction": instruction,
211 |                                 "success": success,
212 |                                 "duration": t_step,
213 |                                 "video_filename": video_save_path,
214 |                                 "model_path": args.model_path,
215 |                                 "repo_id": args.repo_id,
216 |                             }
217 |                         ]
218 |                     ),
219 |                 ],
220 |                 ignore_index=True,
221 |             )
222 | 
223 |         # saving video
224 |         video = np.stack(images)
225 |         video_save_path.parent.mkdir(parents=True, exist_ok=True)
226 |         imageio.mimsave(video_save_path, video, fps=1.0 / STEP_DURATION * 3)
227 | 
228 |         if (
229 |             input(f"Already eval {len(results_df)} rollouts. Do one more eval (default y)? [y/n]").lower()
230 |             == "n"
231 |         ):
232 |             break
233 | 
234 |     # save results
235 |     csv_filename = base_save_path / "results.csv"
236 |     results_df.to_csv(csv_filename, index=False)
237 |     print(f"Results saved to {csv_filename}")
238 |     # print avg
239 |     print(f"Avg success: {results_df['success'].mean()}")
240 | 
241 | 
242 | if __name__ == "__main__":
243 |     import logging
244 | 
245 |     logging.basicConfig(level=logging.INFO)
246 |     args: Args = tyro.cli(Args)
247 |     eval_bridge(args)
248 | 


--------------------------------------------------------------------------------
/experiments/3_simpler/README.md:
--------------------------------------------------------------------------------
  1 | # SimplerEnv Benchmark Training and Evaluation
  2 | 
  3 | This directory contains the implementation for training and evaluating EO-1 on the SimplerEnv benchmark, a comprehensive suite of robotic manipulation tasks designed to test real-world deployment capabilities across different robot platforms and task complexities.
  4 | 
  5 | ## Quick Start
  6 | 
  7 | ```bash
  8 | # 1. Download datasets
  9 | bash experiments/3_simpler/download_datasets.sh
 10 | 
 11 | # 2. Train on Bridge dataset
 12 | bash experiments/3_simpler/train_bridge.sh
 13 | 
 14 | # 3. Train on Fractal dataset
 15 | bash experiments/3_simpler/train_fractal.sh
 16 | 
 17 | # 4. Setup evaluation environment
 18 | git clone https://github.com/DelinQu/SimplerEnv-OpenVLA
 19 | cd SimplerEnv-OpenVLA
 20 | # Follow installation instructions
 21 | 
 22 | # 5. Run evaluation
 23 | bash scripts/run_eo.sh
 24 | ```
 25 | 
 26 | For detailed instructions, see the sections below.
 27 | 
 28 | ## Overview
 29 | 
 30 | The SimplerEnv benchmark consists of two main datasets and multiple evaluation scenarios:
 31 | 
 32 | ### Datasets
 33 | 
 34 | - **Bridge Dataset**: Real-world manipulation data collected from various robot platforms
 35 | - **Fractal Dataset**: Large-scale synthetic manipulation data for robust training
 36 | 
 37 | ### Evaluation Benchmarks
 38 | 
 39 | #### WidowX Benchmark (4 tasks)
 40 | 
 41 | - **Put Spoon on Towel**: Place a spoon on a specific towel location
 42 | - **Put Carrot on Plate**: Position a carrot on a designated plate
 43 | - **Stack Blocks**: Stack multiple blocks in a specific configuration
 44 | - **Put Eggplant in Basket**: Place an eggplant inside a basket
 45 | 
 46 | #### Google Robot Benchmark (4 tasks, 2 evaluation modes)
 47 | 
 48 | - **Pick Coke Can**: Grasp and lift a coke can from the table
 49 | - **Move Near**: Navigate the robot arm to a specific position
 50 | - **Open⋅Close Drawer**: Open and close a drawer mechanism
 51 | - **Drawer Apple**: Place an apple inside a drawer
 52 | 
 53 | **Evaluation Modes:**
 54 | 
 55 | - **Matching Mode**: Test on similar visual conditions as training data
 56 | - **Aggregation Mode**: Test on diverse visual conditions and object variations
 57 | 
 58 | The benchmark tests the model's ability to:
 59 | 
 60 | - Generalize across different robot platforms (WidowX vs Google Robot)
 61 | - Handle real-world visual variations and lighting conditions
 62 | - Perform precise manipulation tasks with varying complexity
 63 | - Adapt to different action spaces and control paradigms
 64 | - Maintain performance across different evaluation scenarios
 65 | 
 66 | ## Dataset Preparation
 67 | 
 68 | ### 1. Download SimplerEnv Datasets
 69 | 
 70 | The SimplerEnv datasets are available on Hugging Face and can be downloaded using the Hugging Face CLI:
 71 | 
 72 | ```bash
 73 | cd YOUR_PATH_TO_DATASET
 74 | 
 75 | # Install Hugging Face CLI if not already installed
 76 | pip install huggingface-cli
 77 | 
 78 | # Download all SimplerEnv datasets
 79 | datasets=(
 80 |   fractal20220817_data_lerobot
 81 |   bridge_orig_lerobot
 82 | )
 83 | 
 84 | HF_LEROBOT_HOME=YOUR_PATH_TO_DATASET
 85 | 
 86 | for dataset in ${datasets[@]};
 87 | do
 88 |   echo "Downloading ${dataset}..."
 89 |   huggingface-cli download \
 90 |   --repo-type dataset --resume-download --local-dir-use-symlinks False \
 91 |   IPEC-COMMUNITY/${dataset} \
 92 |   --local-dir ${HF_LEROBOT_HOME}/${dataset}
 93 | done
 94 | ```
 95 | 
 96 | **Note**: The datasets are quite large (several GB each), so ensure you have sufficient disk space and a stable internet connection.
 97 | 
 98 | ### 2. Configure Dataset Paths
 99 | 
100 | Update the dataset configuration files:
101 | 
102 | **For Bridge dataset** (`experiments/3_simpler/data-bridge.yaml`):
103 | 
104 | ```yaml
105 | mm_datasets:
106 | 
107 | lerobot_datasets:
108 |   - repo_id: bridge_orig_lerobot
109 |     root: HF_LEROBOT_HOME
110 |     select_video_keys: [observation.images.image_0]
111 | ```
112 | 
113 | **For Fractal dataset** (`experiments/3_simpler/data-fractal.yaml`):
114 | 
115 | ```yaml
116 | mm_datasets:
117 | 
118 | lerobot_datasets:
119 |   - repo_id: fractal20220817_data_lerobot
120 |     root: HF_LEROBOT_HOME
121 |     select_video_keys: [observation.images.image]
122 | ```
123 | 
124 | Make sure to set the `HF_LEROBOT_HOME` environment variable to point to your dataset directory.
125 | 
126 | ## Training
127 | 
128 | ### Training Configuration
129 | 
130 | The training is performed separately for each dataset with optimized hyperparameters:
131 | 
132 | #### Bridge Dataset Training
133 | 
134 | - **GPUs**: 8 GPUs (A100 80GB) for distributed training
135 | - **Batch Size**: 128 per device (total effective batch size: 1024)
136 | - **Epochs**: 20
137 | - **Chunk Size**: 4 (for sequence processing)
138 | 
139 | #### Fractal Dataset Training
140 | 
141 | - **GPUs**: 8 GPUs (A100 80GB) for distributed training
142 | - **Batch Size**: 256 per device (total effective batch size: 2048)
143 | - **Epochs**: 10
144 | - **Chunk Size**: 4 (for sequence processing)
145 | 
146 | #### Common Hyperparameters
147 | 
148 | - **Learning Rates**:
149 |   - Backbone: 1e-4
150 |   - Merger: 1e-4
151 |   - Vision tower: 2e-5
152 | - **Optimization**: AdamW with cosine learning rate scheduling
153 | - **Precision**: BF16 with TF32 enabled
154 | - **Weight Decay**: 0.1
155 | - **Warmup Ratio**: 0.03
156 | 
157 | ### Start Training
158 | 
159 | Train on each dataset separately:
160 | 
161 | ```bash
162 | # Train on Bridge dataset
163 | bash experiments/3_simpler/train_bridge.sh
164 | 
165 | # Train on Fractal dataset
166 | bash experiments/3_simpler/train_fractal.sh
167 | ```
168 | 
169 | The training will:
170 | 
171 | - Use the Qwen2.5-VL-3B-Instruct vision-language model as the base
172 | - Train on each dataset independently with optimized settings
173 | - Save checkpoints every 5000 steps
174 | - Use gradient checkpointing and flash attention for memory efficiency
175 | - Log training progress every 100 steps
176 | 
177 | ## Evaluation
178 | 
179 | ### Evaluation Setup
180 | 
181 | The evaluation requires setting up the SimplerEnv environment:
182 | 
183 | **Install SimplerEnv Dependencies**:
184 | 
185 | ```bash
186 | # Clone and setup SimplerEnv-OpenVLA
187 | git clone https://github.com/DelinQu/SimplerEnv-OpenVLA
188 | cd SimplerEnv-OpenVLA
189 | # Follow the installation instructions in the repository
190 | ```
191 | 
192 | ~~**Copy EO-1 Model Files**:~~ NOTE: SimplerEnv-OpenVLA have already supported the EO-1 model implementation.
193 | 
194 | ```bash
195 | cp -r experiments/3_simpler/simpler_env/eo SimplerEnv-OpenVLA/simpler_env/policies/eo
196 | cp experiments/3_simpler/simpler_env/eval_simpler.sh SimplerEnv-OpenVLA/scripts/run_eo.sh
197 | ```
198 | 
199 | ### Evaluation Configuration
200 | 
201 | The evaluation script supports multiple task suites with configurable parameters:
202 | 
203 | - **Action Ensemble Temperature**: 4 (for action sampling diversity)
204 | - **Number of Trials**: Variable per task (typically 10-50)
205 | - **Video Recording**: Enabled for visualization and analysis
206 | - **Environment**: Headless rendering for distributed evaluation
207 | 
208 | ### Run Evaluation
209 | 
210 | ```bash
211 | # Update checkpoint path in eval_simpler.sh
212 | ckpt_path=YOUR_CHECKPOINT_PATH
213 | 
214 | # Run evaluation on all task suites
215 | bash SimplerEnv-OpenVLA/scripts/run_eo.sh
216 | ```
217 | 
218 | The evaluation will:
219 | 
220 | - Test on 9 different task configurations
221 | - Generate detailed logs and success metrics
222 | - Save evaluation videos for analysis
223 | - Calculate overall performance statistics
224 | 
225 | ## Results
226 | 
227 | ### Performance Comparison
228 | 
229 | #### WidowX Benchmark Results
230 | 
231 | | Task      | Put Spoon on Towel | Put Carrot on Plate | Stack Blocks | Put Eggplant in Basket | Overall  |
232 | | --------- | ------------------ | ------------------- | ------------ | ---------------------- | -------- |
233 | | $\pi_{0}$ | 83.8               | 52.5                | 52.5         | 87.9                   | 69.2     |
234 | | **EO-1**  | **63.6**           | **54.5**            | **81.8**     | **90.9**               | **72.7** |
235 | 
236 | #### Google Robot Benchmark - Matching Mode
237 | 
238 | | Task      | Pick Coke Can | Move Near | Open⋅Close Drawer | Drawer Apple | Overall  |
239 | | --------- | ------------- | --------- | ----------------- | ------------ | -------- |
240 | | $\pi_{0}$ | 97.9          | 78.7      | 62.25             | 46.6         | 71.4     |
241 | | **EO-1**  | **98.0**      | **83.8**  | **71.3**          | **52.8**     | **76.5** |
242 | 
243 | #### Google Robot Benchmark - Aggregation Mode
244 | 
245 | | Task      | Pick Coke Can | Move Near | Open\Close Drawer | Drawer Apple | Average  |
246 | | --------- | ------------- | --------- | ----------------- | ------------ | -------- |
247 | | $\pi_{0}$ | 90.1          | 80.7      | 27.6              | 20.5         | 54.7     |
248 | | **EO-1**  | **91.6**      | **81.7**  | **55.0**          | **23.8**     | **63.0** |
249 | 
250 | ### Key Achievements
251 | 
252 | - **Strong Generalization**: EO-1 achieves competitive performance across different robot platforms (WidowX and Google Robot)
253 | - **Robust Manipulation**: Consistent performance on complex manipulation tasks like stacking and drawer operations
254 | - **Real-world Adaptation**: Effective handling of visual variations and real-world conditions
255 | - **Multi-modal Understanding**: Superior performance on tasks requiring spatial reasoning and object manipulation
256 | 
257 | ### Analysis
258 | 
259 | The results demonstrate EO-1's capabilities in:
260 | 
261 | 1. **Cross-platform Generalization**: Strong performance on both WidowX and Google Robot platforms
262 | 2. **Complex Manipulation**: Excellent results on challenging tasks like stacking blocks (81.8%) and drawer operations
263 | 3. **Visual Robustness**: Consistent performance across different visual conditions and object appearances
264 | 4. **Action Precision**: Effective handling of fine-grained manipulation tasks requiring precise control
265 | 
266 | ## File Structure
267 | 
268 | ```
269 | experiments/3_simpler/
270 | ├── README.md                    # This file
271 | ├── train_bridge.sh             # Bridge dataset training script
272 | ├── train_fractal.sh            # Fractal dataset training script
273 | ├── data-bridge.yaml            # Bridge dataset configuration
274 | ├── data-fractal.yaml           # Fractal dataset configuration
275 | └── simpler_env/                # Evaluation environment
276 |     ├── eo/                     # EO-1 model implementation
277 |     │   ├── eo_model.py         # Main model class
278 |     │   └── geometry.py          # Geometry utilities
279 |     ├── eval_simpler.sh         # Evaluation script
280 |     └── main_inference.py       # Inference interface
281 | ```
282 | 


--------------------------------------------------------------------------------