RLDX-1: a foundation model for robot hands with memory and force sensing

RLWRLD, a Seoul-based startup, published RLDX-1 on May 11, 2026 — a foundation model built from the ground up for dexterous hands with a high number of degrees of freedom. Unlike dominant VLA (Vision-Language-Action) models, RLDX-1 integrates force sensing, motion processing, and long-term memory as native data streams rather than late add-ons. Early benchmark results show that on tasks requiring physical signals and context history, the model achieves success rates close to 90% — compared to below 30% for GR00T N1.6 and π₀.₅.

Key takeaways

RLDX-1 is an 8.1B-parameter model available in three checkpoints: RLDX-1-PT, RLDX-1-MT-ALLEX, and RLDX-1-MT-DROID
The Multi-Stream Action Transformer (MSAT) architecture processes vision, force, and motion in separate streams, merging them only at the action decoding stage
On the OpenArm benchmark (versatile intelligence), RLDX-1 outperforms GR00T N1.6 and π₀.₅ in out-of-domain generalization
On the ALLEX humanoid for tasks requiring Motion/Physics Modules: RLDX-1 ~90% vs. baselines below 30%
Synthetic data pipeline scales data volume ~5x and improves success rate by 9.2% on the GR-1 Tabletop benchmark

What existing VLA models lacked

Most existing VLA models treat force sensing, touch, and context history as optional extensions. A standard transformer processes all modalities in a single stream — meaning the modality that dominates the gradient absorbs all model capacity, while the rest become decorative. Robots equipped with such models perform adequately in controlled conditions but fail on tasks requiring sensitivity to changing object weight, tracking of moving targets, or multi-step planning.

RLWRLD categorized these gaps into five "dexterity regimes": grasp diversity, spatial precision, temporal precision, contact precision, and context awareness. Each maps to a specific failure type in industrial robots — for example, no compensation for conveyor movement (temporal precision) or no detection of the contact moment (contact precision).

MSAT architecture — four streams in one model

The technical answer is the Multi-Stream Action Transformer (MSAT). Each modality — video, force/torque signals, motion — has a dedicated processing stream. Early layers keep the streams parallel; later layers merge them via joint self-attention just before action decoding. The Motion Module achieves +37.5 percentage points over GR00T N1.6 and π₀.₅ on the conveyor pick-and-place task. The Physics Module integrates torque and tactile signals, predicts future contact states, and degrades gracefully to vision-only when sensors are unavailable. The Memory Module stores 64-token cognition tokens in a FIFO buffer, speeding inference by 35% (16.3→22.1 Hz) while also serving as long-term task memory. The base VLM is Qwen3-VL 8B fine-tuned on robot-trajectory VQA. Combined with post-training via DAgger and Progress-Aware RL, the final policy executes tasks ~3x faster than imitation learning alone.

Synthetic data and learning from human hands

Teleoperation for five-fingered hands is inherently limited. RLWRLD addresses this through two approaches. The synthetic pipeline uses video generation models (including Cosmos-Predict2) to generate new trajectories, annotated by an inverse dynamics model and filtered for physical consistency. Result: ~5x data scale increase, +9.2% on GR-1 Tabletop. The Human Data pipeline records a bare human hand without teleoperation devices, then retargets movements onto a robot hand using 3D Gaussian Splatting. Throughput: over 200 demonstrations per hour.

Why does this matter?

RLDX-1 is not just another VLA model with better benchmark scores. It is the first publicly described architecture to simultaneously address four distinct failure types in dexterous manipulation. Previous models such as GR00T N1.6 or π₀ either ignored these modalities or bolted them on as ad-hoc layers. What matters for the industrial robotics market is that RLDX-1 is evaluated on commercial platforms — not only in simulation. The 90% vs. below 30% gap on Physics Module tasks points to a real capability difference. If these results hold in broader deployments, RLDX-1 could set the reference foundation model architecture for dexterous AI.

What comes next?

RLWRLD has announced three future directions: long-horizon tasks (hour-long interactions), zero-shot generalization for the pre-trained policy, and an extension toward a world model (predicting future visual observations conditioned on language and actions)
Checkpoints RLDX-1-PT, RLDX-1-MT-ALLEX, and RLDX-1-MT-DROID are available on Hugging Face — no public beta date for external integrators has been announced
The DexBench benchmark published by RLWRLD at dexbench.org may become an industry standard for dexterous manipulation evaluation, if adopted by other platform vendors