Robots AtlasRobots Atlas

Dataset Cards

Dataset Cards introduce a structured, standardized format for documenting ML datasets β€” covering composition, provenance, limitations, and ethical risks β€” replacing the prior absence of any consistent dataset documentation standard.

Category
Abstraction level
Operation level
Documenting publicly or privately published datasetsAssessing dataset suitability for model training and evaluationSupport for Responsible AI and data governanceComparing datasets by quality, provenance, and risksImproving transparency in model hubs and data repositoriesSupport for compliance, audits, and ethical reviewsBetter communication between data creators and dataset users

Dataset Cards function as a standardized documentation artifact attached to a dataset. They include sections covering the motivation for creating the dataset, its composition, sources, collection and annotation process, data structure, train/evaluation splits, license, risks related to bias or errors, recommended uses, limitations, and maintenance and update considerations. On platforms such as Hugging Face Hub, a dataset card is rendered as the primary documentation for the dataset within its repository. This makes the card a bridge between the raw data and its responsible use in research, prototyping, and production deployments.

Dataset Cards address the lack of transparent, comparable, and actionable dataset documentation. Without such a framework, dataset users often have no way of knowing how the data was collected, which groups or phenomena it represents, what quality limitations it has, what ethical or legal risks it carries, or which use cases it suits well or poorly. The concept helps reduce dataset misuse, supports reproducibility, enables better assessment of dataset suitability, and reinforces Responsible AI and data governance practices.

01

YAML Metadata Block

Stores structured dataset metadata: license, language, task category, size, and tags; rendered by the Hub as filterable data.

A YAML front-matter block at the top of README.md storing machine-readable metadata such as license, language, task_categories, size_categories, and tags. Enables filtering and discovery on the Hugging Face Hub.

02

Dataset Description Section

Describes the dataset's purpose, language, tasks, associated papers, and other key contextual information.

Free-form Markdown section describing the dataset's purpose, supported languages, task types, relevant papers, and homepage links.

03

Dataset Structure Section

Documents data fields, splits, and dataset repository structure.

Describes the data fields, their types, splits (train/validation/test), and configurations, helping users understand how to load and navigate the dataset.

04

Dataset Creation Section

Documents data sources, collection process, annotations, tools, and contributors involved in creating the dataset.

Documents the source data, collection process, annotation procedures, tools used, and information about dataset creators.

05

Ethical Considerations and Limitations Section

Reports known biases, social risks, privacy concerns, and recommended or discouraged use cases.

Covers known biases, social risks, privacy considerations, and recommendations or warnings about intended and unintended uses of the dataset.

Common pitfalls

Incomplete or Superficial Section Completion
HIGH

Many dataset cards leave key sections blank or fill them with placeholder text ([More Information Needed]), reducing their informational value. Empirical analysis of 7,433 cards on Hugging Face found that only 7.9% of cards with no downloads completed all five community-endorsed sections, versus 86% for the top-100 downloaded datasets.

Follow the official Dataset Cards creation guide and complete all five recommended sections. Treat documentation quality as a checklist item in the dataset release process.

Outdated documentation after dataset update
MEDIUM

Dataset cards are static documents that may become outdated as the underlying dataset evolves (new splits, filtered versions, license changes). There is no automated mechanism to enforce synchronization between the card and the data.

Update README.md with each significant dataset release or revision. Include the dataset version or date in the card.

Omitting the ethical risks and biases section
HIGH

The 'Considerations for Using the Data' section covering known biases, privacy risks, and social impact is the most commonly omitted or vaguely completed section, limiting the card's utility for responsible AI.

Explicitly document known biases, demographic imbalances, privacy risks, and potential misuse scenarios. Use the Datasheets for Datasets framework as a template for documentation scope.

2018

Datasheets for Datasets (Gebru et al.)

breakthrough

Publication of 'Datasheets for Datasets' by Gebru et al. on arXiv, proposing structured documentation for ML datasets and serving as the direct conceptual inspiration for Dataset Cards.

2019

Model Cards for Model Reporting (Mitchell et al.)

Publication of 'Model Cards for Model Reporting' by Mitchell et al. at FAT* 2019, providing a parallel documentation framework for ML models that shaped the Dataset Cards format.

2021

Dataset Cards adopted as a standard on Hugging Face Hub

breakthrough

Hugging Face Hub adopted Dataset Cards (README.md) as the official dataset documentation standard, providing an official template, metadata UI, and tagging tools.

2024

Empirical analysis of Dataset Cards on Hugging Face (Wang et al.)

Publication of 'Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face' (arXiv:2401.13822), analyzing 7,433 dataset cards for completeness and documentation patterns.

Hardware agnosticPRIMARY

Dataset Cards is a documentation framework with no computational requirements. Cards are stored as Markdown files and rendered by web platforms.

Commonly used with

Model Cards

Model Cards are a documentation framework for trained machine learning models, proposed by Mitchell et al. in 2018 (published at ACM FAT* 2019). The framework defines nine sections for a model card: Model Details (basic information about the model and its development), Intended Use (primary intended use cases and out-of-scope applications), Factors (relevant demographic, environmental, and instrumentation factors affecting performance), Metrics (performance metrics and decision thresholds), Evaluation Data (composition and source of evaluation datasets), Training Data (composition and source of training datasets), Quantitative Analyses (disaggregated evaluation results per factor, including intersectional analysis), Ethical Considerations (sensitive data, human life implications, risks and mitigations), and Caveats and Recommendations (additional concerns and usage guidance). Model Cards are designed to promote transparent reporting of model performance across subgroups, reducing risks of misapplication in high-stakes domains. On the Hugging Face Hub, model cards are implemented as a README.md file with a YAML metadata header (machine-readable: license, language, base_model, pipeline_tag, evaluation results) and a Markdown body following the nine-section structure. Google developed an open-source Model Card Toolkit (2020) to automate card generation from TensorFlow model evaluation data.

GO TO CONCEPT
Dataset Cards

Official Hugging Face documentation describing dataset cards as the dataset repository README and a tool for responsible data use.

documentationHugging Face
Create a dataset card

A guide for creating dataset cards, covering descriptions of content, usage context, and dataset creation methodology.

documentationHugging Face
Datasheets for Datasets

Classic work on transparent dataset documentation, serving as an important conceptual background for later dataset cards.

scientific articlearXiv
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

Work developing the concept of structured, human-centered dataset documentation.

scientific articlearXiv
The Data Cards Playbook

Official Google playbook for transparent dataset documentation.

official websiteGoogle Research