HIW-500: 10 TB of real-home humanoid data now open source

BitRobot Network, Hugging Face, and Unitree Robotics published HIW-500 (Humanoids-in-the-Wild 500) on June 25, 2026 — the largest open-source humanoid teleoperation dataset collected in real residential environments. More than 500 hours of footage, 23,000 episodes, and 10 TB of raw data gathered across 12 homes in Southeast Asia directly targets the generalization bottleneck that blocks home robots from mass deployment.

Key takeaways

HIW-500: 500+ hours, 23,000 episodes, 10+ TB of data — collected in 12 real homes in Southeast Asia using Unitree G1
Hugging Face LeRobot compressed 10 TB to 2 TB with zero data loss — available at HuggingFace: BitRobot/HIW-500-LeRobot
Hardware: Unitree G1 (29 DoF), stereo head camera at 480p/30fps, IR/RGB wrist cameras on both arms
Tasks: 10+ categories of household activities (sweeping, tidying, object transport), episodes up to 8 minutes long
Unitree G1 is natively supported in LeRobot — researchers can immediately train VLA policies on their own G1 hardware

Why real-home data is different

Most existing humanoid datasets come from labs — controlled lighting, empty tables, familiar objects in known positions. Real homes look different: clothes on the sofa, a bucket on the floor, children running around. That gap is exactly the generalization problem. Robots trained exclusively in labs cannot be safely deployed in a random apartment. HIW-500 attacks this at the source, collecting data precisely where robots are meant to work. Unitree Robotics supplied a fleet of G1 robots, and data collection took place across 12 distinct homes — each with a different layout, furniture, and level of "chaos."

The dataset covers 10+ categories of household tasks: sweeping, collecting trash, moving objects, organizing cabinets. Individual episodes run from a few seconds up to 8 minutes — the long horizons are intentional. Models trained on short pick-and-click episodes cannot plan multi-step sequences. Long episodes force the model to learn planning. Each task is broken down into subtasks with annotations — a multi-level structure that lets researchers train and evaluate models at different layers of complexity.

Hardware and teleoperation setup

The hardware platform is Unitree G1 — a robot that has become a research favorite thanks to its sub-$30,000 enterprise pricing, 29 degrees of freedom, and open SDK. For HIW-500, each G1 was equipped with a stereo head camera (RGB, 480p, 30 fps) for spatial perception, and wrist cameras on both arms (RGB + IR, 480p, 30 fps). The IR cameras minimize the occlusion problem during manipulation: when the hand obscures an object, the infrared channel still delivers position and shape data.

Full robot state (29 DoF) along with IMU and odometry data was logged in real time. Teleoperating a 29-DoF humanoid in a tight apartment takes months of operator training. Collecting 23,000 clean episodes required close collaboration with the Unitree team for hardware support throughout the months-long campaign.

LeRobot: compressing 10 TB to 2 TB without data loss

Raw 10 TB is a prohibitive size for small labs — downloading over a 1 Gbps connection would take 22 hours, and storage requires dedicated infrastructure. Hugging Face solved this by re-encoding the entire dataset into the LeRobot format. Result: 2 TB with 100% data fidelity. Trajectories, camera feeds, and annotations are identical — only the encoding changed. A 5:1 compression ratio with zero loss is a meaningful technical result.

The practical effect: a smaller lab with a 10 Gbps connection can download the dataset in a few hours. A browser-based visualizer lets anyone inspect any episode — a synchronized 3D model of the robot displays alongside camera feeds, language instructions, and subtask annotations, with no local installation required. Data is available in two formats: native ROSbag for advanced users and LeRobot format for direct use with the framework.

Target: the 80/80 benchmark for home robots

Unitree CEO Wang Xingxing defined the "ChatGPT moment" for embodied AI as achieving 80% task completion across 80% of unfamiliar real-world environments. Existing lab datasets — even large ones — are not suited for training models toward this benchmark, because they lack sufficient diversity of domestic chaos. HIW-500 is the first step toward building a training base that could realistically bring VLA models closer to that target.

Comparing with existing datasets shows the novelty: Open X-Embodiment (the most widely used VLA training corpus) contains data from many robots, but almost entirely from labs. DROID — the most recent large dataset — collected data in more varied settings, but still under controlled conditions. HIW-500 is the first dataset of this scale collected exclusively in unmodified private homes.

Why this matters

Data scarcity is one of the two main brakes on home robotics — alongside hardware dexterity limits. Models like GR00T, π0, or OpenVLA are starved for data from diverse, unstructured environments. Labs cannot generate that diversity at the scale needed to train models with sufficient generalization. HIW-500 is the first systematic attempt to collect data exactly where robots are meant to operate. The partnership with Hugging Face and LeRobot is critical here — the platform provides distribution infrastructure, and the LeRobot format guarantees compatibility with the growing open-source robotics ecosystem. If more labs contribute similar datasets from Europe, North America, and the rest of Asia, HIW-500 could become the starting point for a global open home-data project — something like Common Crawl for embodied AI.

What's next?

Native G1 support in LeRobot lets researchers start training immediately — the first behavioral cloning results should appear within weeks
BitRobot announced plans to expand coverage to additional geographic regions — European and North American home data would increase environmental variability and the dataset's global value
Key question: whether VLA models trained on HIW-500 can generalize zero-shot to an unseen home — the test that will verify the dataset's real-world value