A typical Multimodal LLM combines a base language model with additional modality encoders — for image or audio, for example — and a projection/alignment layer that maps representations from different data types into a shared space. This allows the model to understand relationships between text, images, and other signals, and to generate responses spanning more than one data type.
Classic text-only LLMs have limited ability to understand information conveyed through images, audio, documents, and other data modalities. Multimodal LLMs address this by integrating multiple input and output types into a single system.
Module converting raw non-text modality data (images, audio, video) into structured, semantically meaningful token embeddings. Typically pretrained independently (e.g., CLIP ViT for images).
Official
Bridge between the modality encoder embedding space and the LLM input space. Responsible for cross-modal alignment.
Official
Architectural core: a pretrained decoder-only large language model serving as the reasoning and text generation module. Often kept frozen or lightly fine-tuned during MLLM training.
Official
Optional module generating non-text modality outputs (e.g., images, audio) from the LLM's output representations. Not present in all MLLM architectures.
Increasing input image resolution or video length leads to exponential growth in visual token count and quadratic growth in LLM self-attention cost. Ignoring this causes GPU memory overflow or drastic slowdown in training and inference.
The modality encoder and LLM operate in different embedding spaces. An insufficiently trained connector leads to poor visual information transfer — the model ignores visual cues or hallucinates.
Aggressive fine-tuning of the LLM backbone during MLLM training can cause loss of original language capabilities (forgetting general knowledge, degraded text generation).
MLLM generates descriptions of objects not present in the input image. Caused by imbalance between the LLM's strong linguistic priors and weaker visual signal, especially when the image lacks elements expected by the LLM.
CLIP introduced effective vision-text alignment via large-scale contrastive learning, providing a strong visual encoder widely used in subsequent MLLMs.
Flamingo defined the MLLM architecture with interleaved cross-attention layers into a frozen LLM (Chinchilla-70B), Perceiver Resampler as connector, and training on interleaved image-text sequences. It demonstrated strong few-shot capabilities across 16 visual tasks.
BLIP-2 introduced Q-Former as an efficient connector compressing visual tokens to a fixed count, enabling MLLM training with far fewer trainable parameters than Flamingo. LLaVA showed that a simple linear projection with GPT-4-generated instruction-following data is sufficient for strong performance.
OpenAI and Google released closed-source MLLMs capable of advanced image-text processing in a unified system, setting new quality standards and driving broad industrial adoption of the MLLM paradigm.
Research expanded to audio, video, and high-resolution document processing, while quadratic visual token complexity became a leading research problem. Token pruning, Q-Former-based compression, and dynamic resolution methods gained prominence.
Time complexity: O(n² · d). Space complexity: O(n² + P).
Evaluating multimodal LLMs requires benchmarks covering more than text, such as VQA, OCR, chart understanding, document understanding, audio understanding, and multimodal reasoning tasks.
The primary computational bottleneck is the quadratic complexity of the self-attention mechanism with respect to total token count. Visual tokens (from images, video, audio) dramatically inflate the input sequence: a standard image via ViT yields hundreds of tokens, while video can generate tens of millions of tokens.
Architecture and pretrained weights of the modality encoder. Affects quality of visual/audio representations and knowledge transfer.
Architecture of the modality interface module. Determines how modality tokens are aligned and compressed before the LLM.
Choice of pretrained language model as the MLLM core. Determines reasoning capabilities, generation quality, and model scale.
Input image pixel resolution. Directly determines the number of visual tokens and thus computational cost.
Determines which MLLM components are frozen vs. trained at each stage (alignment pretraining, instruction tuning, alignment tuning).
Standard MLLMs process all tokens (text and visual) through full self-attention layers of the LLM backbone. Sparse or MoE variants appear in specific implementations (e.g., Mixtral-VL) but are not a defining feature of the MLLM paradigm.
The modality encoder and connector can be processed fully in parallel during both training and the prefill phase of inference. Tensor parallelism and pipeline parallelism are widely used for large LLM backbones in multi-GPU environments.
MLLM consists of Transformers (visual encoder, connector, LLM backbone) — all relying on matrix multiplication (GEMM), which is accelerated by GPU Tensor Cores (NVIDIA A100, H100). In practice, training and inference of MLLMs require GPUs with large HBM capacity (40–80 GB).
TPU v4/v5 are used to train MLLMs at Google (Gemini). They offer high throughput for GEMM operations and efficient scaling via TPU Pods.