
**RoboAgent** is a project by the CMU Robotics Institute (Vikash Kumar) + Meta AI (FAIR) announced in August 2023 (paper 'RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking', Bharadhwaj et al., arXiv:2309.01918). The goal: to demonstrate that a **small (~75M-parameter) policy** trained on just **7,500 trajectories** can perform 12 skills × 38 tasks with generalization to new objects and scenes.
The key innovation: **semantic augmentations** — during data preparation, synthetic image variants are generated through inpainting (Stable Diffusion) while preserving the robot's actions. From one demonstration, 10-50 variants are created (different backgrounds, lighting, distractors, object colors), which significantly increases generalization without additional real demonstrations.
Architecture: **MT-ACT** (Multi-Task Action Chunking Transformer) — an extension of ACT (the Action Chunking Transformer from Mobile Aloha) with multi-task conditioning via language embedding (from a CLIP text encoder). The network predicts 'chunks' of 10-20 actions instead of individual actions, which stabilizes execution and allows 30 Hz control even with a large model.
Demonstrated skills: pick, place, push, slide, rotate, hinge open/close, drawer open/close, pour, wipe, pick-and-place, multi-step assembly. Hardware: Franka Emika Panda + RealSense D455 (one front view) + RealSense D435 (wrist view). Training: 4× A100 over ~3 days. The full stack is open-source on github.com/robopen/roboagent (Apache 2.0).
Impact: RoboAgent introduced **semantic augmentations** to mainstream robot learning, and the technique has been adopted by OpenVLA, π0, and Genesis. Bharadhwaj et al. subsequently founded the startup Skild AI (2024, $300M seed) — continuing the 'compact foundation models' direction.
A Runtime is the environment or execution layer used to run code, load libraries, manage dependencies, and operate applications or services — either in real time or during normal system operation. In robotics this includes real-time operating system (RTOS) runtimes, ROS 2 executor runtimes, containerised execution environments (Docker, podman), and embedded C++ runtimes on microcontrollers.
An API Library is a software package that exposes programmatic interfaces for communicating with a device, service, or system. In robotics it typically forms a lightweight integration layer built on top of the manufacturer's official API or an open-source project, abstracting low-level protocol details and providing language-native bindings (Python, C++, Java, etc.).
A family of open Vision-Language-Action (VLA) and foundation models for robotics: OpenVLA (Stanford/Berkeley), LeRobot (Hugging Face), RoboAgent (CMU), RT-2 (Google DeepMind, publication). Trained on datasets such as Open X-Embodiment, BridgeData V2, and RoboNet.
Academic research: ~50 citing publications (Q1 2026), including variants CACTI (CMU 2024, diffusion head), Augment Anything (Berkeley 2024, extraction of augmentations for downstream datasets). Commercial: the 'semantic augmentations' technique from RoboAgent has become a standard in the pipelines of Skild AI (founding RoboAgent team), Physical Intelligence (π0 data prep), and Embodied Vision (Hugging Face). The Stanford CS 224R Lab uses RoboAgent as a curriculum benchmark.
github.com/robopen/roboagent ~750★, ~80 forks. The arXiv:2309.01918 paper has ~180 citations (Q1 2026). The RoboSet dataset on HuggingFace Hub has ~5k downloads. Activity: latest commits Q1 2025 (maintenance rather than active development — the authors work at Skild AI).
Open-source stack: github.com/robopen/roboagent (Apache 2.0). Requires MuJoCo for the simulator pipeline + a Stable Diffusion checkpoint for the semantic augmentations pipeline. RoboSet datasets (7,500 trajectories) available under CC-BY-4.0.
License family: Permissive
PyTorch 2.5+ compatibility update, MuJoCo 3.0 integration fixes. Last commit series before the pause.
CACTI variant — Conditional Action Chunking Transformer with a diffusion head instead of a deterministic one.
RoboSet expanded to 15,000 demonstrations from 6 geographic locations. New skills: lift, tilt, fold.
First public release — arXiv:2309.01918 paper, code, RoboSet (7,500 demonstrations).