Data

Data Flywheel

2025ActivePublished: 26 June 2026Updated: 26 June 2026Published

Key innovation

Closing the loop between production AI traffic and new model training — operational exhaust (prompts, responses, feedback) becomes an automated data source for distilling larger models into smaller, cheaper and equally effective ones.

How it works

Step 1: the production application instruments every LLM call with a stable task identifier (workload_id) and logs the full request/response to a centralised store (e.g. Elasticsearch). Step 2: an orchestrator (e.g. Celery) periodically collects logs, deduplicates them, and splits them into evaluation and training sets with class-balanced stratification. Step 3: for each workload_id, three experiment types run in parallel across a pool of candidate models (base, in-context learning, LoRA fine-tune). Step 4: an evaluator compares candidate outputs against the production model using LLM-as-judge scoring in the range [0,1]. Step 5: candidates above the threshold go to human review; after acceptance they are deployed as a new NIM, closing the loop. The cycle runs daily, weekly or on demand.

Problem solved

Frontier models are expensive and slow at inference, yet production applications generate sufficiently narrow-domain data (e.g. a single agent route) for a 70× smaller model to handle them after fine-tuning. What was missing was an automated process that continuously detects such opportunities and keeps production on the cheapest sufficiently accurate model, without ML engineers manually designing each experiment.

Components

Log StoreSingle source of truth for the flywheel — every downstream step relies on this data.

Centralised store of raw production logs in the {timestamp, workload_id, client_id, request, response} schema. The NVIDIA Blueprint uses Elasticsearch 8.12.

Official

Dataset BuilderConversion of raw traffic into datasets fit for training and evaluation.

Component that pulls logs from the log store, deduplicates them per workload_id and splits them into eval/train sets with class-aware stratified splitting (scikit-learn), ensuring balanced representation of call types.

Official

Experiment OrchestratorThe orchestrator is the heart of the flywheel — without it, experiments would revert to manual ML engineering work.

Workflow runner that schedules experiments per workload_id × candidate (base / ICL / LoRA fine-tune) and runs them in parallel while respecting the GPU pool. The NVIDIA Blueprint uses Celery with a parent_queue (concurrency=1) for the main DAG and a separate worker for evals.

Official

Fine-TunerProduces competitive variants of a smaller model that can replace the large production model.

Component that fine-tunes candidate base models on the workload_id training dataset. The NVIDIA Blueprint uses NeMo Customizer with SFT + LoRA (adapter dim 32, dropout 0.1, 2 epochs, batch 16, lr 1e-4).

LoRA SFTDefault in the NVIDIA Blueprint.

Full Fine-TuningMore expensive, rarely cost-effective for narrow workloads.

DPO / KTOOn the roadmap when logs contain thumbs-up/down signal.

Official

EvaluatorGating mechanism — without an evaluator there is no way to tell whether a candidate is truly production-ready.

Component that compares candidate responses with production model responses via an LLM-as-judge, returning a similarity score in [0, 1]. The NVIDIA Blueprint uses NeMo Evaluator with either a self-hosted (6 GPU) or remote (2 GPU) judge.

Official

Candidate PromoterSafety gate preventing quality regression and unintended changes to model behaviour.

Intentionally manual step: an ML engineer or researcher reviews candidates flagged by the evaluator and decides on deployment. The NVIDIA Blueprint defines the flywheel as 'a flashlight, not an autopilot' — promotion to production remains a human decision.

Implementation

Reference implementations

NVIDIA Data Flywheel Foundational Blueprint

Python 3.11 · NVIDIA AI Blueprints

Official

Enhance Your AI Agent with Data Flywheels Using NVIDIA NeMo Microservices (developer blog)

— · NVIDIA

Official

Implementation pitfalls

No PII removal before fine-tuningCritical

By default the NVIDIA Blueprint routes raw production traffic into training without masking PII. For many sectors (healthcare, finance, public sector) this is unacceptable and requires a custom PII redaction pipeline before the log store.

Fix:Insert a PII redaction layer (e.g. Presidio, Skyflow, a custom tokenizer) between the application and the log store. The NVIDIA Blueprint roadmap lists PII redaction as a planned extension.

No ground truth — evaluation against itselfHigh

The pattern builds evaluation sets from production-model responses, treating them as the gold standard. If production systematically errs on a narrow slice, candidates will fine-tune to the same errors.

Fix:Blend automatic evaluation with periodic hand-labelling on a random sample; introduce external ground truth for critical workloads; monitor drift of the production metric.

No free-GPU discovery — full-run serialisationMedium

NVIDIA Blueprint v1 caps parent_queue concurrency at 1, so only one flywheel run can execute at a time. In large deployments this creates a bottleneck independent of the GPU count.

Fix:Deploy a custom scheduler with GPU introspection (e.g. Volcano, Kueue) or wait for the planned auto-discovery in future Blueprint versions.

Silent quality degradation riskHigh

The loop optimises cost and latency; even with a human-in-the-loop there is a risk that successive iterations lower model quality by an imperceptible fraction that, after many rollouts, accumulates into clear regression.

Fix:Keep a fixed external golden set never seeded from production; periodically benchmark production vs golden and alert on drops.

Evolution

Original paper · 2025 · NVIDIA AI Blueprints repository (Apache-2.0). First public release: April 2025; deprecated: April 2026. · NVIDIA AI Blueprints Team

Data Flywheel Foundational Blueprint (NVIDIA AI Blueprints)

NVIDIA AI Blueprints Team

2001

Jim Collins defines the 'flywheel effect' in 'Good to Great'

Inflection point

Business metaphor of a heavy wheel spun up by consistent pushes — concept later borrowed for AI/ML.

Good to Great: Why Some Companies Make the Leap... and Others Don't (paper)

2017

Andrew Ng popularises the 'AI virtuous cycle' at Landing AI

The term 'data flywheel' enters VC and AI education circles for the product → data → model → product loop.

2025

NVIDIA Data Flywheel Blueprint (April 2025)

Inflection point

The first public reference implementation of a flywheel as a production-grade service, built on NeMo Microservices (Datastore, Customizer, Evaluator, Deployment Manager). Result: 98.6% cost reduction in NVIDIA's internal HR chatbot.

Enhance Your AI Agent with Data Flywheels Using NVIDIA NeMo Microservices (paper)

2026

Deprecation of the NVIDIA Foundational Blueprint (April 2026)

NVIDIA marks the public blueprint as deprecated and reference-only, moving development to newer patterns on NeMo Microservices. The Data Flywheel pattern itself remains active and continues to be used by the community and other stacks (LangSmith, Arize, Weights & Biases).