Step 1: the production application instruments every LLM call with a stable task identifier (workload_id) and logs the full request/response to a centralised store (e.g. Elasticsearch). Step 2: an orchestrator (e.g. Celery) periodically collects logs, deduplicates them, and splits them into evaluation and training sets with class-balanced stratification. Step 3: for each workload_id, three experiment types run in parallel across a pool of candidate models (base, in-context learning, LoRA fine-tune). Step 4: an evaluator compares candidate outputs against the production model using LLM-as-judge scoring in the range [0,1]. Step 5: candidates above the threshold go to human review; after acceptance they are deployed as a new NIM, closing the loop. The cycle runs daily, weekly or on demand.
Frontier models are expensive and slow at inference, yet production applications generate sufficiently narrow-domain data (e.g. a single agent route) for a 70× smaller model to handle them after fine-tuning. What was missing was an automated process that continuously detects such opportunities and keeps production on the cheapest sufficiently accurate model, without ML engineers manually designing each experiment.
Centralised store of raw production logs in the {timestamp, workload_id, client_id, request, response} schema. The NVIDIA Blueprint uses Elasticsearch 8.12.
Official
Component that pulls logs from the log store, deduplicates them per workload_id and splits them into eval/train sets with class-aware stratified splitting (scikit-learn), ensuring balanced representation of call types.
Official
Workflow runner that schedules experiments per workload_id × candidate (base / ICL / LoRA fine-tune) and runs them in parallel while respecting the GPU pool. The NVIDIA Blueprint uses Celery with a parent_queue (concurrency=1) for the main DAG and a separate worker for evals.
Official
Component that fine-tunes candidate base models on the workload_id training dataset. The NVIDIA Blueprint uses NeMo Customizer with SFT + LoRA (adapter dim 32, dropout 0.1, 2 epochs, batch 16, lr 1e-4).
Official
Component that compares candidate responses with production model responses via an LLM-as-judge, returning a similarity score in [0, 1]. The NVIDIA Blueprint uses NeMo Evaluator with either a self-hosted (6 GPU) or remote (2 GPU) judge.
Official
Intentionally manual step: an ML engineer or researcher reviews candidates flagged by the evaluator and decides on deployment. The NVIDIA Blueprint defines the flywheel as 'a flashlight, not an autopilot' — promotion to production remains a human decision.
By default the NVIDIA Blueprint routes raw production traffic into training without masking PII. For many sectors (healthcare, finance, public sector) this is unacceptable and requires a custom PII redaction pipeline before the log store.
The pattern builds evaluation sets from production-model responses, treating them as the gold standard. If production systematically errs on a narrow slice, candidates will fine-tune to the same errors.
NVIDIA Blueprint v1 caps parent_queue concurrency at 1, so only one flywheel run can execute at a time. In large deployments this creates a bottleneck independent of the GPU count.
The loop optimises cost and latency; even with a human-in-the-loop there is a risk that successive iterations lower model quality by an imperceptible fraction that, after many rollouts, accumulates into clear regression.
Business metaphor of a heavy wheel spun up by consistent pushes — concept later borrowed for AI/ML.
The term 'data flywheel' enters VC and AI education circles for the product → data → model → product loop.
The first public reference implementation of a flywheel as a production-grade service, built on NeMo Microservices (Datastore, Customizer, Evaluator, Deployment Manager). Result: 98.6% cost reduction in NVIDIA's internal HR chatbot.
NVIDIA marks the public blueprint as deprecated and reference-only, moving development to newer patterns on NeMo Microservices. The Data Flywheel pattern itself remains active and continues to be used by the community and other stacks (LangSmith, Arize, Weights & Biases).
Time complexity: O(W · K · E + W · F). Space complexity: O(L + K · M_adapter + D).
The bottleneck shifted from manual ML engineering work to GPU resources. NVIDIA Blueprint v1 serialises whole DAG runs (concurrency=1 on parent_queue) because automatic free-GPU discovery is not yet implemented.
How finely production traffic is partitioned into tasks. Too coarse = undetectable patterns; too fine = no critical mass for fine-tuning (min. 50 records).
List of smaller models considered as production replacements. NVIDIA Blueprint v1 tests Meta Llama 3.2 1B Instruct; roadmap: Qwen, Llama-Nemotron, Mistral.
How often the orchestrator launches a full flywheel DAG.
Number of examples held out for evaluation per workload_id. NVIDIA Blueprint default = 20 eval + 0.1 validation.
LoRA fine-tuning hyperparameters: adapter_dim, dropout, epochs, batch_size, learning_rate.
The flywheel loop activates at different model lifecycle stages: daily (incremental), weekly (deep sweep), ad-hoc (after application change).
Every production call receives a workload_id (e.g. agent.tool_router). The orchestrator routes logs of that workload_id to a dedicated evaluation pipeline and a dedicated fine-tuned candidate. This is not inference-time routing — it routes the training/evaluation loop itself.
Architecturally equivalent to a hyperparameter sweep — every workload_id × candidate × strategy (base/ICL/LoRA) is an independent job.
LoRA fine-tuning and parallel LLM-as-judge inference require capable Tensor Cores. NVIDIA Blueprint minimum: 6× H100/A100 (self-hosted judge) or 2× (remote judge).
Orchestration (Celery), log store (Elasticsearch), API (FastAPI), MongoDB and Redis components run fully on CPU.
The Data Flywheel concept itself is an abstract systems pattern — it can be realised on any ML stack (TPU, AWS Inferentia, AMD MI300).