Robots AtlasRobots Atlas

Native Multimodal Architecture

A model trained from scratch simultaneously on data from all modalities, eliminating the need to combine separate pre-trained modality encoders and enabling the learning of joint cross-modal representations.

Category
Abstraction level
Operation level
01

Unified Multimodal Tokenizer

Creates a unified input for the shared Transformer backbone, enabling sequential processing of data from multiple modalities.

Modular

Module responsible for converting data from all modalities into a shared token space. Images are typically quantized using a vector quantizer (VQ-VAE) producing discrete visual tokens; text is tokenized standardly; audio is converted to spectrograms or discrete acoustic tokens.

VQ-VAE image tokenizerContinuous patch embeddings
02

Shared Transformer Backbone

Central computational unit of the model; handles unified representation and cross-modal reasoning.

A single stack of transformer layers processing interleaved token sequences from all modalities. The self-attention mechanism operates on the combined sequence, allowing tokens from different modalities to attend to each other.

03

Joint Pretraining Objective

Provides a unified gradient signature across all modalities during training, enforcing cross-modal representation learning.

Modular

Training objective applied simultaneously to data from all modalities. Typically autoregressive next-token prediction on interleaved multimodal sequences, without separate per-modality pretraining phases.

04

Modality-specific Output Heads

Enables generating outputs across multiple modalities while maintaining a shared backbone.

Modular

Separate output heads mapping the transformer's internal representation to the output space of each modality. May include a language head (softmax over text vocabulary) and a visual head (softmax over image token vocabulary or image decoder).

Parallelism

Partially parallel

Training on interleaved multimodal data can be parallelized across devices (data parallelism, tensor parallelism), but the sequential nature of autoregressive decoding limits parallelism during inference. Training from scratch on multi-modal data requires a large number of GPUs/TPUs.

Paradigm

Dense

All paths active

The base execution pattern is dense: all transformer parameters are activated for every token, regardless of modality. MoE variants introduce conditional expert activation, but the native multimodal paradigm does not inherently require routing.

Modality Fusion Depth

Critical
  • early fusionTokens from all modalities are concatenated directly into a single input sequence β€” an approach used in Chameleon and GPT-4o.
  • late fusionSeparate encoders process each modality, with outputs fused at a later stage β€” e.g., LLaVA, Flamingo.

Whether modalities are fused at the input level (early fusion) or after separate encoding (late fusion). Determines when cross-modal attention can first occur.

Modality Token Representation

Standard
  • discrete (VQ-VAE)Images quantized to discrete tokens β€” the approach used in Chameleon.
  • continuous patch embeddingsImages represented as continuous patch embeddings β€” the approach used in Gemini.

Whether non-text modalities are represented as discrete tokens (via VQ-VAE) or as continuous embeddings projected into the shared space.

Modality Range

Standard
  • text + imageMost common range β€” Chameleon, Aria.
  • text + image + audio + videoFull modality range β€” Gemini, GPT-4o.

Which modalities are included in joint pretraining: text only + image, or also audio, video, sensor data.

MoE Integration

Standard
  • dense (bez MoE)Chameleon β€” a fully dense model.
  • sparse MoEAria, Gemini-style β€” MoE enables per-modality expert specialization.

Whether MoE layers are incorporated to enable implicit modality-specific expert specialization, improving parameter efficiency.

Common pitfalls

Training instability with early fusion
HIGH

Early-fusion native multimodal models trained from scratch on mixed-modal data are prone to training instability, including loss spikes and gradient issues, due to the heterogeneity of token distributions across modalities.

Applying QK-Norm, scaling the learning rate to model size and modality, and careful curation of interleaved data.

Training from scratch cost
HIGH

Training a native multimodal model from scratch on all modalities simultaneously requires substantially more compute than fine-tuning a pre-trained LLM with a grafted vision encoder, making the approach inaccessible without significant infrastructure.

Staged curriculum training, efficient data mixing with modality ratio control, and MoE to reduce active FLOPs per token.

Modality imbalance in training data
MEDIUM

Imbalance in the quantity and quality of training data across modalities can cause the model to underperform on underrepresented modalities while excelling on the dominant one (typically text).

Careful tuning of per-modality data ratios; use of separate tokenizers to balance token distributions.

Difficulty Generating Across Multiple Modalities Simultaneously
MEDIUM

Enabling the model to generate outputs in multiple modalities (e.g., interleaved text and images) requires additional architectural support (separate output heads, image decoders) and alignment training that significantly increases engineering complexity.

Separate output heads per modality; staged alignment (SFT, RLHF) using data that includes mixed-modal outputs.

GENESIS Β· Source paper

Chameleon: Mixed-Modal Early-Fusion Foundation Models
2024arXiv 2024 (arXiv:2405.09818)Chameleon Team (FAIR at Meta)
2021

BEiT and visual tokenization

BEiT (Bao et al., 2021) introduced self-supervised vision representation learning using discrete image patch tokens, establishing the conceptual foundation for treating image patches as tokens analogous to text tokens.

2022

Training on mixed-document data (Aghajanyan et al.)

Aghajanyan et al. (2022) extended token-based modeling to mixed-modal documents with interleaved image and text tokens, enabling joint reasoning over both modalities in a unified architecture.

2023

Gemini β€” first large natively multimodal model

breakthrough

Google Gemini (2023) introduced a large-scale native multimodal model trained from the ground up on text, image, audio, and video, using a unified token stream and shared transformer backbone β€” establishing native multimodality as a viable paradigm at frontier scale.

2024

Chameleon β€” open early-fusion model trained from scratch

breakthrough

Meta's Chameleon (2024) formalized the early-fusion token-based native multimodal paradigm in an open model, demonstrating stable training from scratch on ~10 trillion interleaved tokens using a unified discrete vocabulary for text and images.

2024

GPT-4o β€” end-to-end training across text/audio/image modalities

breakthrough

OpenAI's GPT-4o (2024) adopted end-to-end training across text, audio, and visual modalities without separate cascaded models for speech recognition and synthesis, reducing latency and improving cross-modal reasoning.

2025

Scaling laws for native multimodal models (Apple/Sorbonne)

breakthrough

Shukor et al. (2025, Apple/Sorbonne) established scaling laws for native multimodal models, showing that early-fusion architectures trained from scratch match or outperform late-fusion designs at equivalent compute, and that MoE integration enables implicit modality-specific specialization.

GPU Tensor CoresPRIMARY

Training and inference of native multimodal models relies on large matrix operations within the transformer (QKV projections, FFN), optimized for tensor cores on GPUs such as NVIDIA H100, A100, and GB200. Chameleon was trained on A100 GPU clusters.

Most known implementations were trained on NVIDIA A100/H100 GPUs. GPT-4o and Gemini use TPUs (Google) or GB200 NVL72 (OpenAI).

TPUGOOD

Google Gemini β€” one of the key native multimodal models β€” was trained on TPU v4/v5. The Transformer architecture is well suited to the matrix-oriented TPU accelerators.

TPUs are particularly efficient for Gemini-style architectures with large batch sizes.

EXTENDS

Multimodal LLM

Multimodal Large Language Models (MLLMs) are architectures built on top of pretrained decoder-only large language models (LLMs) that incorporate additional modality-specific encoders and a modality interface to process non-textual inputsβ€”most commonly images, but also audio, video, and 3D data. The canonical MLLM architecture consists of three core modules: (1) a modality encoder that converts raw non-text inputs (e.g., image patches processed by a Vision Transformer) into token embeddings; (2) a modality interface (connector) that bridges the encoder and LLM spaces via projection layers, Q-Former modules (as in BLIP-2), or cross-attention layers (as in Flamingo); and (3) the pretrained LLM backbone that performs reasoning and text generation over the resulting combined token sequence. Optionally, a modality generator can be appended to produce non-textual outputs. During training, MLLMs typically proceed through pretraining (cross-modal alignment, with the LLM and often the encoder frozen), instruction tuning, and alignment tuning stages. The primary computational bottleneck is the quadratic complexity O(nΒ²) of self-attention with respect to the total token count, which becomes severe when visual inputs are converted into hundreds or thousands of tokensβ€”particularly for high-resolution images or video. Three main connector paradigms are used: projection-based (linear layer or MLP), query-based (Q-Former with learnable queries), and fusion-based (cross-attention into frozen LLM layers). Representative models include Flamingo (DeepMind, 2022), BLIP-2 (Salesforce, 2023), LLaVA (2023), GPT-4V (OpenAI, 2023), Gemini (Google, 2023), and Claude (Anthropic).

GO TO CONCEPT

Connects

MoE

Mixture of Experts (MoE) is an architecture in which a model is composed of multiple parallel sub-networks β€” the experts β€” along with a gating (routing) network that determines, for each input, which subset of experts to activate and how to combine their outputs. The gating network produces a weighting over experts; in the original soft formulation (Jacobs et al., 1991), all experts are weighted and summed. In the sparse formulation (Shazeer et al., 2017), only the top-k scoring experts are activated, and the remaining experts produce no output and incur no compute cost for that input. In the context of large language models, MoE is typically applied as a replacement for the feed-forward network (FFN) sub-layer within each Transformer block. Each token is routed to a small number of expert FFNs (commonly top-1 or top-2), with the router being a learned linear projection followed by a softmax. The outputs of the selected experts are weighted by the corresponding router scores and summed. A central challenge in sparse MoE is load balancing: without explicit regularization, the router tends to collapse onto a small set of preferred experts, leaving others undertrained. This is addressed via auxiliary load balancing losses added to the training objective, which encourage a roughly uniform distribution of tokens across experts. Expert parallelism is the standard distributed training and inference strategy for large MoE models: each expert is placed on a separate device, so that the total parameter count scales with the number of devices without increasing per-device memory or per-token FLOPs proportionally. The capacity factor controls the maximum number of tokens each expert can process per batch; tokens that overflow the capacity are either dropped or passed through a residual connection. Tuning the capacity factor is a critical practical consideration.

GO TO CONCEPT