GO-1 (Genie Operator-1)

AgiBot's generalist embodied foundation model (launched March 10, 2025) — ViLLA (Vision-Language-Latent-Action) architecture combining a VLM, Latent Planner and Action Expert in a single policy driving heterogeneous robot platforms.

✓ Active🏢 Enterprise★ FeaturedRobotics foundation modelVision-Language-Action model

Release date

10 March 2025

🏢AGIBOTProducer

Deployment:📱 On-device☁ Cloud

Overview

GO-1 (Genie Operator-1) is a universal, generalist embodied foundation AI model developed by the Chinese company AgiBot, officially unveiled on March 10, 2025 in Shanghai. It redefines how robots see, understand and act in the real world — moving from rigid, task-specific automation toward flexible, generalist robotics. GO-1 is the heart of AgiBot's humanoid control stack, including the industrial G2 (unveiled October 16, 2025).

ViLLA (Vision-Language-Latent-Action) architecture

GO-1's novel architecture combines a Vision-Language Model (VLM) with a Mixture of Experts (MoE), forming the ViLLA framework. During inference the components work synergistically: the VLM first analyzes the scene and objects, the Latent Planner predicts k latent action tokens, and the Action Expert conditions a denoising process to produce the final control signals.

VLM (Vision-Language Model)

Uses internet-scale heterogeneous data as a solid foundation for scene and object understanding. Lets the robot interpret language instructions and ground them in the visual perception of its environment.

Latent Planner (MoE)

The first MoE expert — learns from cross-embodiment data and from human demonstrations, building general action understanding independent of any specific robot body. Its output is k latent action tokens that constitute an abstract action plan.

Action Expert (MoE)

The second MoE expert — trained on more than 1 million real-robot demonstrations (AgiBot World Colosseo). Performs high-frequency dexterous manipulation by converting the planner's latent tokens into concrete control signals via a denoising process.

Distinctive capabilities

GO-1 offers four unique properties: (1) Learning from Human Videos — learning from human video footage without teleoperation, (2) Few-shot Generalization — adapting to new tasks from minimal examples, (3) Cross-Embodiment Adaptation — transferring policies across robot platforms with different kinematics, (4) Continuous Self-Evolution — ongoing model improvement based on new operational data.

Applications and deployments

GO-1 drives the shift from rigid, task-specific automation toward universal generalist robotics. Deployed in manufacturing (precision assembly, automotive part transfer), logistics (parcel sorting), services (guided tours), and home automation. In the AGIBOT G2 humanoid it runs locally on the NVIDIA Jetson Thor T5000 (2,070 TFLOPS FP4) with control latency under 10 ms, paired with the GE-1 world model.

Classification

Robotics foundation modelVision-Language-Action model

Applications

Robotic manipulation Robot policy training

Access & deployment

On-deviceCloud

Weights: Closed

Key parameters

✓ Fine-tuning

📥 Input: image, video, text, robot sensors…

Robotics

Robot manipulationBimanual manipulationDexterous manipulationRobot controlScene understandingEmbodied task planning

Technical specification

License

Proprietary (closed)

Hardware requirements

Deployed locally on NVIDIA Jetson Thor T5000 (2,070 TFLOPS FP4, control latency <10 ms) in the AGIBOT G2 humanoid. Training requires data-center class GPU clusters.

Features:✓ Fine-tuning

Modalities

⬇ Input

imagevideotextrobot_sensorsrobot_state_data

⬆ Output

robot_actionsrobot_commandsmotion_trajectoriesmanipulator_control

Capabilities and applications

Native model capabilities

Cross-embodiment transfer

The ability of a single model to control robots with different morphologies (humanoids, dual-arm rigs, mobile platforms) without training a separate model per platform. Intelligence is decoupled from embodiment, so the same policy runs on hardware with different kinematics and dynamics.

Category: robotics

Vision-language-action grounding

The ability of a VLA model to ground visual perception and a language instruction into a concrete physical robot action. The model understands the scene and intent, then generates an executable action sequence, closing the loop from observation to motion.

Category: robotics

Planning

Forming and executing action plans for complex tasks.

Category: planning

Reasoning

The model's ability to reason logically and solve complex problems.

Category: reasoning

Multimodal understanding

Category: multimodal

Robotics

Robot manipulationBimanual manipulationDexterous manipulationRobot controlScene understandingEmbodied task planning

Application domains