Multimodal LLM

Extending a decoder-based LLM with dedicated modality encoders and a modality interface module (connector) to enable processing and joint reasoning over inputs from multiple modalities (image, audio, video) while preserving the text generation capabilities of the LLM.

Modality Encoder

Extracts features from non-linguistic modalities into an embedding space compatible with downstream integration.

Modular

Module converting raw non-text modality data (images, audio, video) into structured, semantically meaningful token embeddings. Typically pretrained independently (e.g., CLIP ViT for images).

Modality Interface / Connector

Alignment and integration of representations from different modalities prior to processing by the LLM.

Modular

Bridge between the modality encoder embedding space and the LLM input space. Responsible for cross-modal alignment.

LLM Backbone

Reasoning, language understanding, and text generation based on a combined sequence of text tokens and other modalities.

Modular

Architectural core: a pretrained decoder-only large language model serving as the reasoning and text generation module. Often kept frozen or lightly fine-tuned during MLLM training.

Modality Generator

Generates outputs in non-primarily-textual modalities.

Optional module generating non-text modality outputs (e.g., images, audio) from the LLM's output representations. Not present in all MLLM architectures.

Time

…

n is the total token sequence length (text + visual); d is the model dimension. Quadratic complexity arises from the self-attention mechanism in the LLM. For high-resolution images or video, n can range from several thousand to tens of millions of tokens.

The modality encoder (e.g., ViT) adds complexity O(p² · d_enc) dependent on the number of image patches p. A Q-Former connector reduces n_v visual tokens to a fixed number q of query tokens, thereby lowering the effective n.

Memory complexity

…

n is the token sequence length; P is the parameter count (encoder + connector + LLM). Memory scales quadratically with attention maps and linearly with the parameters of the component models.

In practice, the dominant memory costs are the LLM backbone weights (ranging from a few to hundreds of billions of parameters) and the attention activations for long visual sequences.

Wąskie gardło: Quadratic self-attention complexity for visual tokens

The primary computational bottleneck is the quadratic complexity of the self-attention mechanism with respect to total token count. Visual tokens (from images, video, audio) dramatically inflate the input sequence: a standard image via ViT yields hundreds of tokens, while video can generate tens of millions of tokens.

Parallelism

Partially parallel

The modality encoder and connector can be processed fully in parallel during both training and the prefill phase of inference. Tensor parallelism and pipeline parallelism are widely used for large LLM backbones in multi-GPU environments.

Paradigm

Dense

All paths active

Standard MLLMs process all tokens (text and visual) through full self-attention layers of the LLM backbone. Sparse or MoE variants appear in specific implementations (e.g., Mixtral-VL) but are not a defining feature of the MLLM paradigm.

Modality Encoder Type

Critical

CLIP ViT-L/14Most commonly used visual encoder in LLaVA, BLIP-2, and many other models.
CLIP ViT-H/14Larger variant used in newer models for improved representation quality.
SigLIPAlternative visual encoder used in Gemini models and newer LLaVA variants.

Architecture and pretrained weights of the modality encoder. Affects quality of visual/audio representations and knowledge transfer.

Modality Connector Type

Critical

Linear projection (MLP)Used in LLaVA. A simple, efficient projection without token count compression.
Q-FormerUsed in BLIP-2. Compresses visual tokens into a fixed number of query tokens.
Gated cross-attentionUsed in Flamingo. Inserted between frozen LLM layers.

Architecture of the modality interface module. Determines how modality tokens are aligned and compressed before the LLM.

LLM backbone

Critical

Vicuna-7B / 13BUsed in LLaVA.
Chinchilla-70BUsed in Flamingo.
OPT / Flan-T5Used in BLIP-2.

Choice of pretrained language model as the MLLM core. Determines reasoning capabilities, generation quality, and model scale.

Input Image Resolution

Standard

224×224Standard resolution for CLIP ViT. Produces 196 visual tokens (14×14 patches).
336×336 – 448×448Higher resolution for improved detail recognition and OCR.
Dynamic resolution (tile-based)Dynamic image tiling for high-resolution processing (e.g., in InternVL, LLaVA-HD).

Input image pixel resolution. Directly determines the number of visual tokens and thus computational cost.

Training Strategy

Standard

Freeze encoder + LLM, train connector onlyThe most common alignment pretraining stage (e.g., BLIP-2 stage 1, LLaVA stage 1).
Freeze encoder, train connector + LLMInstruction tuning with partial model unfreezing (e.g., LLaVA stage 2).
Full fine-tuning (PEFT / LoRA)Efficient fine-tuning of all components at reduced computational cost.

Determines which MLLM components are frozen vs. trained at each stage (alignment pretraining, instruction tuning, alignment tuning).

Strengths

More natural user interaction
Broader range of inputs and outputs
Better understanding of documents, charts, and interfaces
Supports integration of speech, image, and text within a single system

Limitations

Higher computational and memory costs
Higher training and evaluation complexity
Harder to ensure quality across all modalities
Uneven quality across modality types

Computational characteristics

Requires additional modality encoders beyond the LLM itself
Typically requires more memory and compute than a text-only model
Latency increases with the number and complexity of modalities

Evaluating multimodal LLMs requires benchmarks covering more than text, such as VQA, OCR, chart understanding, document understanding, audio understanding, and multimodal reasoning tasks.

Common pitfalls

Visual token explosion at high resolutions

CRITICAL

Increasing input image resolution or video length leads to exponential growth in visual token count and quadratic growth in LLM self-attention cost. Ignoring this causes GPU memory overflow or drastic slowdown in training and inference.

Using token-compressing connectors (Q-Former, Perceiver Resampler), token pruning/merging, dynamic image tiling with a limited patch count, or sparse attention.

Embedding space misalignment between encoder and LLM

HIGH

The modality encoder and LLM operate in different embedding spaces. An insufficiently trained connector leads to poor visual information transfer — the model ignores visual cues or hallucinates.

Pretrain the connector on a large image-text pair dataset before instruction tuning; use established connector architectures (Q-Former, MLP).

Catastrophic forgetting of LLM capabilities during fine-tuning

HIGH

Aggressive fine-tuning of the LLM backbone during MLLM training can cause loss of original language capabilities (forgetting general knowledge, degraded text generation).

Freezing the LLM during alignment pretraining; applying PEFT (LoRA, QLoRA) instead of full fine-tuning; mixing text and multimodal data during training.

Object hallucination – nonexistent objects in output

HIGH

MLLM generates descriptions of objects not present in the input image. Caused by imbalance between the LLM's strong linguistic priors and weaker visual signal, especially when the image lacks elements expected by the LLM.

Using instruction-following data with negative examples; tuning visual signal strength; applying specialized modality alignment losses; evaluating on POPE and HallusionBench benchmarks.

Reference implementations

LLaVA – Large Language and Vision Assistantofficial

Python · Haotian Liu et al. (UW-Madison, Microsoft Research, Columbia University)

OpenFlamingo – open-source implementation of Flamingo

Python · ML Foundations

BLIP-2 – Salesforce Researchofficial

Python · Salesforce Research

2021

CLIP (Radford et al., OpenAI) – contrastive image-text alignment as a foundation

breakthrough

CLIP introduced effective vision-text alignment via large-scale contrastive learning, providing a strong visual encoder widely used in subsequent MLLMs.

Learning Transferable Visual Models From Natural Language Supervision

2022

Flamingo (Alayrac et al., DeepMind / NeurIPS 2022) – landmark MLLM with few-shot learning

breakthrough

Flamingo defined the MLLM architecture with interleaved cross-attention layers into a frozen LLM (Chinchilla-70B), Perceiver Resampler as connector, and training on interleaved image-text sequences. It demonstrated strong few-shot capabilities across 16 visual tasks.

Flamingo: a Visual Language Model for Few-Shot Learning

2023

BLIP-2 (Li et al., Salesforce) and LLaVA (Liu et al.) — efficient and open MLLMs

breakthrough

BLIP-2 introduced Q-Former as an efficient connector compressing visual tokens to a fixed count, enabling MLLM training with far fewer trainable parameters than Flamingo. LLaVA showed that a simple linear projection with GPT-4-generated instruction-following data is sufficient for strong performance.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

2023

GPT-4V (OpenAI) and Gemini (Google) – first-class commercial MLLMs

breakthrough

OpenAI and Google released closed-source MLLMs capable of advanced image-text processing in a unified system, setting new quality standards and driving broad industrial adoption of the MLLM paradigm.

GPT-4 Technical Report

2024

Expansion to multiple modalities and visual token compression as a primary research focus

Research expanded to audio, video, and high-resolution document processing, while quadratic visual token complexity became a leading research problem. Token pruning, Q-Former-based compression, and dynamic resolution methods gained prominence.

GPU Tensor CoresPRIMARY

MLLM consists of Transformers (visual encoder, connector, LLM backbone) — all relying on matrix multiplication (GEMM), which is accelerated by GPU Tensor Cores (NVIDIA A100, H100). In practice, training and inference of MLLMs require GPUs with large HBM capacity (40–80 GB).

Training large MLLMs (>7B parameters) requires multi-GPU setups with Tensor Parallelism or Pipeline Parallelism. Inference for 7–13B models is feasible on 24–40 GB GPUs with 4-bit quantization.

TPUGOOD

TPU v4/v5 are used to train MLLMs at Google (Gemini). They offer high throughput for GEMM operations and efficient scaling via TPU Pods.

TPU-friendly implementations require specific frameworks (JAX/XLA). Flamingo and Gemini were trained on TPUs.

Related AI models

Claude

Claude 3.7 Sonnet

GPT

GPT-5.4 Pro

GPT-5.5 Pro

Gemini

Gemini 3

Gemini 3 Flash

Gemini 3.1 Deep Think

Gemini 3.1 Flash-Lite

Gemini 3.1 Pro

Gemini Robotics-ER 1.6

Gemma

Gemma 4

Grok

Grok 4

Grok 4.1

Grok-2

Mistral

Mistral Large 3

Muse

Muse Spark

Title	Publisher	Type
What is a Multimodal LLM (MLLM)? Definition and practical overview of MLLMs as models processing multiple modalities.	IBM	article
A Comprehensive Survey and Guide to Multimodal Large Language Models Survey describing the architecture, applications, and evolution of multimodal LLMs.	arXiv	scientific article

What is a Multimodal LLM (MLLM)?

Definition and practical overview of MLLMs as models processing multiple modalities.

articleIBM

A Comprehensive Survey and Guide to Multimodal Large Language Models

Survey describing the architecture, applications, and evolution of multimodal LLMs.

scientific articlearXiv

Back to technology catalog

Multimodal LLM

Use cases

How it works

Problem solved

Key mechanisms

Main components

Modality Encoder

Modality Interface / Connector

LLM Backbone

Modality Generator

Computational complexity

Configuration axes

Evaluation

Strengths

Limitations

Benchmark notes

Implementation

Common pitfalls

Reference implementations

History and evolution

Preferred hardware

Related models and families

Related AI models

Claude

GPT

Gemini

Gemma

Grok

Mistral

Muse

Sources