Unified Multimodal Tokenizer
Creates a unified input for the shared Transformer backbone, enabling sequential processing of data from multiple modalities.
Module responsible for converting data from all modalities into a shared token space. Images are typically quantized using a vector quantizer (VQ-VAE) producing discrete visual tokens; text is tokenized standardly; audio is converted to spectrograms or discrete acoustic tokens.