Pre-training: sample millions of synthetic datasets from a Bayesian prior over structural causal models (SCMs); for each dataset, train a transformer to predict test labels given the labeled training context. Inference: feed the transformer the entire training set {(x_i, y_i)} as context plus test points x_test; in a single forward pass, the model returns p(y_test | x_test, context). No training or fine-tuning on the target dataset (except in the TabPFN Enterprise variant with optional fine-tuning).
Eliminates the need to train a separate model and tune hyperparameters for every new tabular dataset, delivering high-quality predictions in seconds on small and medium datasets (up to 50K rows in TabPFN-2.5).
Generator of synthetic datasets sampled from a Bayesian prior over functions (Structural Causal Models, BNNs, Gaussian Processes). In TabPFN, this is the distribution on which the model is pre-trained — effectively 'amortized' Bayesian inference.
Official
The Transformer takes the full training set (X_train, y_train) plus test points X_test as context and in a single forward pass returns p(y_test | x_test, D_train). No gradient training on the target task.
Mechanism converting a table row into a token sequence. TabPFNv2 uses per-feature embeddings + sample-level positions, treating each cell as a token. Handles heterogeneous feature types (numerical, categorical).
Official
TabPFNv2/2.5 architecture combines row-wise attention (samples attend to other samples) and column-wise attention (features attend to other features) — crucial for permutation invariance over features and samples.
Final layer returns a predictive distribution — softmax over classes for classification; distribution parameters (mixture of Gaussians or bin-based in TabPFNv2) for regression. Provides native uncertainty estimates.
Official
The full training set must fit in context. For datasets >50K rows, TabPFN requires subsampling or ensembling — it is not a drop-in XGBoost replacement on big data.
If real data have structure not covered by the synthetic prior (e.g., extreme heteroscedasticity, strong temporal effects), TabPFN may underperform XGBoost.
Unlike XGBoost (CPU-friendly), TabPFN-2.5 requires A100/H100-class GPU for datasets >10K rows. May be unacceptable in edge/CPU-only environments.
TabPFNv2 and TabPFN-2.5 weights released on Hugging Face are under a non-commercial license. Commercial use requires the Prior Labs API or commercial platforms (SageMaker, Azure AI Foundry, Databricks).
TabPFN does not (by design) allow gradient fine-tuning on the target dataset. For tasks with a strong domain signal (e.g., medical biomarkers), the lack of fine-tuning may limit performance compared to a dedicated model.
Müller et al. publish 'Transformers Can Do Bayesian Inference' — showing that a Transformer pre-trained on samples from a prior approximates the posterior in a single forward pass.
Hollmann et al. release TabPFN — the first PFN for tabular data. Limit: ~1K rows, ~100 features, classification only.
Hollmann, Müller, and Hutter found Prior Labs as a University of Freiburg spin-off — commercializing the TabPFN line.
TabPFNv2 published in Nature — regression support, ~10K rows, two-way attention, SCM-based prior. Surpasses XGBoost as the state of the art on small/medium datasets.
Scaling to 50K rows × 2K features (TabPFN-2.5) matching AutoGluon 1.4 with 4-hour tuning on TabArena. Specialized TabPFN-TS for time series.
SAP announces an agreement to acquire Prior Labs (>€1B over 4 years) — commercializing TabPFN in the enterprise stack (S/4HANA, Joule).
Time complexity: O((N_train + N_test)² · d_model) per warstwa (klasyczna pełna uwaga) lub O((N_train + N_test) · d_model) z FlashAttention/sparse. Space complexity: O((N_train + N_test)² + (N_train + N_test) · F · d_model).
Unlike XGBoost (which sees the data once during training), TabPFN processes the entire training set on every prediction. This makes inference O(N_train²), which for large datasets (>50K rows) becomes a practical limitation.
The architecture is a dense Transformer (no MoE). The entire model is activated on every prediction.
No gradient training on the target task — this is the fundamental difference from XGBoost/RF.
TabPFN-2.5 is designed for Tensor Core GPUs (A100/H100/B200). FP16/BF16 dense matmul + FlashAttention are the dominant operations.
The Transformer-based architecture is natively compatible with TPU; no official Prior Labs deployments on TPU, but feasible (XLA/JAX).
TabPFNv1 and small TabPFNv2 instances (<1K rows) work on CPU, but latency grows quickly. For larger datasets, CPU is practically excluded.