Robots Atlas>ROBOTS ATLAS
GO-1 (Genie Operator-1)

GO-1 (Genie Operator-1)

1ย ยทย Family: GO-1 (Genie Operator-1)
AgiBot's generalist embodied foundation model (launched March 10, 2025) โ€” ViLLA (Vision-Language-Latent-Action) architecture combining a VLM, Latent Planner and Action Expert in a single policy driving heterogeneous robot platforms.
โœ“ Active๐Ÿข Enterpriseโ˜… FeaturedRobotics foundation modelVision-Language-Action model๐Ÿ“ GO-1 (Genie Operator-1)
Release date
10 March 2025
Deployment:๐Ÿ“ฑ On-deviceโ˜ Cloud

Overview

GO-1 (Genie Operator-1) is a universal, generalist embodied foundation AI model developed by the Chinese company AgiBot, officially unveiled on March 10, 2025 in Shanghai. It redefines how robots see, understand and act in the real world โ€” moving from rigid, task-specific automation toward flexible, generalist robotics. GO-1 is the heart of AgiBot's humanoid control stack, including the industrial G2 (unveiled October 16, 2025).

ViLLA (Vision-Language-Latent-Action) architecture

GO-1's novel architecture combines a Vision-Language Model (VLM) with a Mixture of Experts (MoE), forming the ViLLA framework. During inference the components work synergistically: the VLM first analyzes the scene and objects, the Latent Planner predicts k latent action tokens, and the Action Expert conditions a denoising process to produce the final control signals.

VLM (Vision-Language Model)

Uses internet-scale heterogeneous data as a solid foundation for scene and object understanding. Lets the robot interpret language instructions and ground them in the visual perception of its environment.

Latent Planner (MoE)

The first MoE expert โ€” learns from cross-embodiment data and from human demonstrations, building general action understanding independent of any specific robot body. Its output is k latent action tokens that constitute an abstract action plan.

Action Expert (MoE)

The second MoE expert โ€” trained on more than 1 million real-robot demonstrations (AgiBot World Colosseo). Performs high-frequency dexterous manipulation by converting the planner's latent tokens into concrete control signals via a denoising process.

Distinctive capabilities

GO-1 offers four unique properties: (1) Learning from Human Videos โ€” learning from human video footage without teleoperation, (2) Few-shot Generalization โ€” adapting to new tasks from minimal examples, (3) Cross-Embodiment Adaptation โ€” transferring policies across robot platforms with different kinematics, (4) Continuous Self-Evolution โ€” ongoing model improvement based on new operational data.

Applications and deployments

GO-1 drives the shift from rigid, task-specific automation toward universal generalist robotics. Deployed in manufacturing (precision assembly, automotive part transfer), logistics (parcel sorting), services (guided tours), and home automation. In the AGIBOT G2 humanoid it runs locally on the NVIDIA Jetson Thor T5000 (2,070 TFLOPS FP4) with control latency under 10 ms, paired with the GE-1 world model.

Classification
Robotics foundation modelVision-Language-Action model
Access & deployment
On-deviceCloud
Weights: Closed
Key parameters
โœ“ Fine-tuning
๐Ÿ“ฅ Input: image, video, text, robot sensorsโ€ฆ
Robotics
Robot manipulationBimanual manipulationDexterous manipulationRobot controlScene understandingEmbodied task planning

Technical specification

License
Proprietary (closed)
Hardware requirements
Deployed locally on NVIDIA Jetson Thor T5000 (2,070 TFLOPS FP4, control latency <10 ms) in the AGIBOT G2 humanoid. Training requires data-center class GPU clusters.
Features:โœ“ Fine-tuning
Modalities
โฌ‡ Input
imagevideotextrobot_sensorsrobot_state_data
โฌ† Output
robot_actionsrobot_commandsmotion_trajectoriesmanipulator_control

Capabilities and applications

Native model capabilities
Cross-embodiment transfer
The ability of a single model to control robots with different morphologies (humanoids, dual-arm rigs, mobile platforms) without training a separate model per platform. Intelligence is decoupled from embodiment, so the same policy runs on hardware with different kinematics and dynamics.
Category: robotics
Vision-language-action grounding
The ability of a VLA model to ground visual perception and a language instruction into a concrete physical robot action. The model understands the scene and intent, then generates an executable action sequence, closing the loop from observation to motion.
Category: robotics
Planning
Forming and executing action plans for complex tasks.
Category: planning
Reasoning
The model's ability to reason logically and solve complex problems.
Category: reasoning
Multimodal understanding
Category: multimodal
Robotics
Robot manipulationBimanual manipulationDexterous manipulationRobot controlScene understandingEmbodied task planning

Deployment and security

๐Ÿค– Related robots