Robots AtlasRobots Atlas

Multimodal LLM

Extending a decoder-based LLM with dedicated modality encoders and a modality interface module (connector) to enable processing and joint reasoning over inputs from multiple modalities (image, audio, video) while preserving the text generation capabilities of the LLM.

Category
Abstraction level
Operation level
Mechanisms4
Analyzing images and chartsVoice and multimodal assistantsOCR and document understandingReading PDFs, screenshots, and presentationsQ&A based on images, tables, and audio

A typical Multimodal LLM combines a base language model with additional modality encoders — for image or audio, for example — and a projection/alignment layer that maps representations from different data types into a shared space. This allows the model to understand relationships between text, images, and other signals, and to generate responses spanning more than one data type.

Classic text-only LLMs have limited ability to understand information conveyed through images, audio, documents, and other data modalities. Multimodal LLMs address this by integrating multiple input and output types into a single system.

Integrating a base LLM with image, audio, or video encoders
Alignment of representations from different modalities into a shared semantic space
Cross-modal reasoning between text and non-verbal signals
Generating text, voice, or multimodal responses
01

Modality Encoder

Extracts features from non-linguistic modalities into an embedding space compatible with downstream integration.

Modular

Module converting raw non-text modality data (images, audio, video) into structured, semantically meaningful token embeddings. Typically pretrained independently (e.g., CLIP ViT for images).

Vision Transformer (ViT)Perceiver Resampler
02

Modality Interface / Connector

Alignment and integration of representations from different modalities prior to processing by the LLM.

Modular

Bridge between the modality encoder embedding space and the LLM input space. Responsible for cross-modal alignment.

Linear Projection / MLPQ-FormerCross-attention layers inserted into LLM
03

LLM Backbone

Reasoning, language understanding, and text generation based on a combined sequence of text tokens and other modalities.

Modular

Architectural core: a pretrained decoder-only large language model serving as the reasoning and text generation module. Often kept frozen or lightly fine-tuned during MLLM training.

04

Modality Generator

Generates outputs in non-primarily-textual modalities.

Optional module generating non-text modality outputs (e.g., images, audio) from the LLM's output representations. Not present in all MLLM architectures.

Time

n is the total token sequence length (text + visual); d is the model dimension. Quadratic complexity arises from the self-attention mechanism in the LLM. For high-resolution images or video, n can range from several thousand to tens of millions of tokens.

The modality encoder (e.g., ViT) adds complexity O(p² · d_enc) dependent on the number of image patches p. A Q-Former connector reduces n_v visual tokens to a fixed number q of query tokens, thereby lowering the effective n.

Memory complexity

n is the token sequence length; P is the parameter count (encoder + connector + LLM). Memory scales quadratically with attention maps and linearly with the parameters of the component models.

In practice, the dominant memory costs are the LLM backbone weights (ranging from a few to hundreds of billions of parameters) and the attention activations for long visual sequences.

Wąskie gardło: Quadratic self-attention complexity for visual tokens

The primary computational bottleneck is the quadratic complexity of the self-attention mechanism with respect to total token count. Visual tokens (from images, video, audio) dramatically inflate the input sequence: a standard image via ViT yields hundreds of tokens, while video can generate tens of millions of tokens.

Parallelism

Partially parallel

The modality encoder and connector can be processed fully in parallel during both training and the prefill phase of inference. Tensor parallelism and pipeline parallelism are widely used for large LLM backbones in multi-GPU environments.

Paradigm

Dense

All paths active

Standard MLLMs process all tokens (text and visual) through full self-attention layers of the LLM backbone. Sparse or MoE variants appear in specific implementations (e.g., Mixtral-VL) but are not a defining feature of the MLLM paradigm.

Modality Encoder Type

Critical
  • CLIP ViT-L/14Most commonly used visual encoder in LLaVA, BLIP-2, and many other models.
  • CLIP ViT-H/14Larger variant used in newer models for improved representation quality.
  • SigLIPAlternative visual encoder used in Gemini models and newer LLaVA variants.

Architecture and pretrained weights of the modality encoder. Affects quality of visual/audio representations and knowledge transfer.

Modality Connector Type

Critical
  • Linear projection (MLP)Used in LLaVA. A simple, efficient projection without token count compression.
  • Q-FormerUsed in BLIP-2. Compresses visual tokens into a fixed number of query tokens.
  • Gated cross-attentionUsed in Flamingo. Inserted between frozen LLM layers.

Architecture of the modality interface module. Determines how modality tokens are aligned and compressed before the LLM.

LLM backbone

Critical
  • Vicuna-7B / 13BUsed in LLaVA.
  • Chinchilla-70BUsed in Flamingo.
  • OPT / Flan-T5Used in BLIP-2.

Choice of pretrained language model as the MLLM core. Determines reasoning capabilities, generation quality, and model scale.

Input Image Resolution

Standard
  • 224×224Standard resolution for CLIP ViT. Produces 196 visual tokens (14×14 patches).
  • 336×336 – 448×448Higher resolution for improved detail recognition and OCR.
  • Dynamic resolution (tile-based)Dynamic image tiling for high-resolution processing (e.g., in InternVL, LLaVA-HD).

Input image pixel resolution. Directly determines the number of visual tokens and thus computational cost.

Training Strategy

Standard
  • Freeze encoder + LLM, train connector onlyThe most common alignment pretraining stage (e.g., BLIP-2 stage 1, LLaVA stage 1).
  • Freeze encoder, train connector + LLMInstruction tuning with partial model unfreezing (e.g., LLaVA stage 2).
  • Full fine-tuning (PEFT / LoRA)Efficient fine-tuning of all components at reduced computational cost.

Determines which MLLM components are frozen vs. trained at each stage (alignment pretraining, instruction tuning, alignment tuning).

Strengths

  • More natural user interaction
  • Broader range of inputs and outputs
  • Better understanding of documents, charts, and interfaces
  • Supports integration of speech, image, and text within a single system

Limitations

  • Higher computational and memory costs
  • Higher training and evaluation complexity
  • Harder to ensure quality across all modalities
  • Uneven quality across modality types

Computational characteristics

  • Requires additional modality encoders beyond the LLM itself
  • Typically requires more memory and compute than a text-only model
  • Latency increases with the number and complexity of modalities
Evaluating multimodal LLMs requires benchmarks covering more than text, such as VQA, OCR, chart understanding, document understanding, audio understanding, and multimodal reasoning tasks.

Common pitfalls

Visual token explosion at high resolutions
CRITICAL

Increasing input image resolution or video length leads to exponential growth in visual token count and quadratic growth in LLM self-attention cost. Ignoring this causes GPU memory overflow or drastic slowdown in training and inference.

Using token-compressing connectors (Q-Former, Perceiver Resampler), token pruning/merging, dynamic image tiling with a limited patch count, or sparse attention.

Embedding space misalignment between encoder and LLM
HIGH

The modality encoder and LLM operate in different embedding spaces. An insufficiently trained connector leads to poor visual information transfer — the model ignores visual cues or hallucinates.

Pretrain the connector on a large image-text pair dataset before instruction tuning; use established connector architectures (Q-Former, MLP).

Catastrophic forgetting of LLM capabilities during fine-tuning
HIGH

Aggressive fine-tuning of the LLM backbone during MLLM training can cause loss of original language capabilities (forgetting general knowledge, degraded text generation).

Freezing the LLM during alignment pretraining; applying PEFT (LoRA, QLoRA) instead of full fine-tuning; mixing text and multimodal data during training.

Object hallucination – nonexistent objects in output
HIGH

MLLM generates descriptions of objects not present in the input image. Caused by imbalance between the LLM's strong linguistic priors and weaker visual signal, especially when the image lacks elements expected by the LLM.

Using instruction-following data with negative examples; tuning visual signal strength; applying specialized modality alignment losses; evaluating on POPE and HallusionBench benchmarks.

2021

CLIP (Radford et al., OpenAI) – contrastive image-text alignment as a foundation

breakthrough

CLIP introduced effective vision-text alignment via large-scale contrastive learning, providing a strong visual encoder widely used in subsequent MLLMs.

2022

Flamingo (Alayrac et al., DeepMind / NeurIPS 2022) – landmark MLLM with few-shot learning

breakthrough

Flamingo defined the MLLM architecture with interleaved cross-attention layers into a frozen LLM (Chinchilla-70B), Perceiver Resampler as connector, and training on interleaved image-text sequences. It demonstrated strong few-shot capabilities across 16 visual tasks.

2023

BLIP-2 (Li et al., Salesforce) and LLaVA (Liu et al.) — efficient and open MLLMs

breakthrough

BLIP-2 introduced Q-Former as an efficient connector compressing visual tokens to a fixed count, enabling MLLM training with far fewer trainable parameters than Flamingo. LLaVA showed that a simple linear projection with GPT-4-generated instruction-following data is sufficient for strong performance.

2023

GPT-4V (OpenAI) and Gemini (Google) – first-class commercial MLLMs

breakthrough

OpenAI and Google released closed-source MLLMs capable of advanced image-text processing in a unified system, setting new quality standards and driving broad industrial adoption of the MLLM paradigm.

2024

Expansion to multiple modalities and visual token compression as a primary research focus

Research expanded to audio, video, and high-resolution document processing, while quadratic visual token complexity became a leading research problem. Token pruning, Q-Former-based compression, and dynamic resolution methods gained prominence.

GPU Tensor CoresPRIMARY

MLLM consists of Transformers (visual encoder, connector, LLM backbone) — all relying on matrix multiplication (GEMM), which is accelerated by GPU Tensor Cores (NVIDIA A100, H100). In practice, training and inference of MLLMs require GPUs with large HBM capacity (40–80 GB).

Training large MLLMs (>7B parameters) requires multi-GPU setups with Tensor Parallelism or Pipeline Parallelism. Inference for 7–13B models is feasible on 24–40 GB GPUs with 4-bit quantization.

TPUGOOD

TPU v4/v5 are used to train MLLMs at Google (Gemini). They offer high throughput for GEMM operations and efficient scaling via TPU Pods.

TPU-friendly implementations require specific frameworks (JAX/XLA). Flamingo and Gemini were trained on TPUs.

What is a Multimodal LLM (MLLM)?

Definition and practical overview of MLLMs as models processing multiple modalities.

articleIBM
A Comprehensive Survey and Guide to Multimodal Large Language Models

Survey describing the architecture, applications, and evolution of multimodal LLMs.

scientific articlearXiv