RLDX-1

RLWRLD foundation model for dexterous manipulation, built on the Multi-Stream Action Transformer (MSAT) architecture with dedicated streams for vision, tactile, torque, and memory.

✓ Active✓ Public access⚖ Open weightsRobotics foundation modelVision-Language-Action modelMultimodal

Parameters

8.1B (mid-trained)

parameters

Release date

7 May 2026

🏢RLWRLDProducer

Access:DownloadDeployment:💻 Local📱 On-device

Overview

RLDX-1 is a dexterity-first foundation model for robotic hands developed by RLWRLD and introduced in May 2026. The model uses the proprietary Multi-Stream Action Transformer (MSAT) architecture, in which each modality (vision, language, proprioception, memory, tactile, torque) is processed in its own dedicated stream and joint self-attention fuses them before action decoding.

RLDX-1 uses a fine-tuned Qwen3-VL 8B as its vision-language backbone (RLDX-1-VLM). The model integrates a Motion Module (multi-frame compression through cognition video tokens), a Physics Module (tactile and torque stream with future-contact-state prediction) and a Cognition Interface with 64 cognition tokens that doubles as the substrate for the long-horizon Memory Module.

RLDX-1 ships in three checkpoints: RLDX-1-PT (pre-trained, embodiment-agnostic) and two 8.1B mid-trained variants — RLDX-1-MT-ALLEX (ALLEX humanoid) and RLDX-1-MT-DROID (Franka Research 3 with AnySkin tactile). The training pipeline includes pre-training, mid-training for the target embodiment, and post-training with DAgger plus Progress-Aware RL driven by a dedicated VLM-critic. Weights and code are released on Hugging Face and GitHub.

In simulation, RLDX-1 reaches 97.8 on LIBERO, 70.6 on RoboCasa Kitchen, 58.7 on RoboCasa GR-1 Tabletop and 32.1 on RoboCasa 365, outperforming π₀.₅, π₀-FAST and GR00T N1.5/N1.6. On the real ALLEX benchmark (Conveyor Pick-and-Place, Object-in-Box Selection, Pot-to-Cup Pouring) RLDX-1-MT-ALLEX scores 87.5%, 91.7% and 70.8% respectively, versus below 30% for the baselines.

Classification

Robotics foundation modelVision-Language-Action modelMultimodal

Access & deployment

Download

LocalOn-device

Weights: Open weights

Key parameters

🧩 Parameters: 8.1B (mid-trained)

✓ Fine-tuning

📥 Input: image, video, text, robot sensors…

Robotics

Dexterous manipulationBimanual manipulationRobot manipulation

Technical specification

Parameters

8.1B (mid-trained)

parameters

License

Open weights (Hugging Face — RLWRLD)

Hardware requirements

Inference optimized for NVIDIA RTX 5090 + Intel Core Ultra 7 265K class hardware (p50 latency ~43 ms for the all-modality variant via static graph + CUDA Graph + kernel fusion).

Features:✓ Fine-tuning

Modalities

⬇ Input

imagevideotextrobot_sensorsrobot_state_data

⬆ Output

robot_actionsmotion_trajectoriesmanipulator_controlrobot_commands

Capabilities and applications

Native model capabilities

Image understanding

Analysing and interpreting the content of images.

Category: vision

Video Understanding

Category: video

Multimodal understanding

Category: multimodal

Planning

Forming and executing action plans for complex tasks.

Category: planning

Reasoning

The model's ability to reason logically and solve complex problems.

Category: reasoning

Multi-step reasoning

Carrying out multi-step chains of reasoning across long, complex tasks.

Category: reasoning

Robotics

Dexterous manipulationBimanual manipulationRobot manipulation

Benchmark results

10 benchmarks

LIBERO

average success rate · RLDX-1-PT, simulation

97.8%