MoDE-VLA

Sharp's VLA robotic model designed for contact-rich, bimanual manipulation tasks using vision, language, force, and touch.

Context window

nieujawnione publicznie

tokens

Parameters

nieujawnione publicznie; backbone obejmuje SigLIP So400m/14, PaliGemma (Gemma-3B) oraz action expert Gemma-300M

parameters

Release date

9 March 2026

🏢SharpaProducer

Overview

Key parameters

📏 Context: nieujawnione publicznie

🧩 Parameters: nieujawnione publicznie; backbone obejmuje SigLIP So400m/14, PaliGemma (Gemma-3B) oraz action expert Gemma-300M

📥 Input: text, robot_vision, robot sensors, robot state data

Technical specification

Context window

nieujawnione publicznie

tokens

Parameters

nieujawnione publicznie; backbone obejmuje SigLIP So400m/14, PaliGemma (Gemma-3B) oraz action expert Gemma-300M

parameters

License

CC BY 4.0 for paper; model/license for weights not publicly disclosed

Hardware requirements

Requires an advanced robotic platform with RGB cameras, proprioception, torque/force sensing, and tactile sensors on the hands; demonstrated on the Sharpa North platform with two Sharpa Wave hands.

Modalities

⬇ Input

textrobot_visionrobot_sensorsrobot_state_data

⬆ Output

robot_actionsrobot_commandsmanipulator_controlmotion_trajectories

Capabilities and applications

Native model capabilities

Reasoning

The model's ability to reason logically and solve complex problems.

Category: reasoning

Planning

Forming and executing action plans for complex tasks.

Category: planning

Image understanding

Analysing and interpreting the content of images.

Category: vision

Multimodal understanding

Category: multimodal

Sources and related pages

4 sources

PaperTowards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLAarxiv.org WebMoDE-VLA Project Pagesites.google.com WebSharpa Northsharpa.com WebSharpa Wavesharpa.com