Multimodal

Speech-to-speech AI

2024ActivePublished: 20 March 2026Updated: 20 March 2026Published

Key innovation

A class of architectures enabling direct speech-to-speech processing — either by replacing the cascaded STT→LLM→TTS pipeline with a single end-to-end model operating on audio representations, or by tightly integrating pipeline components to minimize latency and preserve paralinguistic features (prosody, emotion, speaker characteristics).

How it works

The model takes audio as input, analyzes its content and paralinguistic features, then generates a response directly as audio output. The native speech-to-speech variant operates as a single multimodal model, while the pipeline variant chains multiple components: ASR, LLM, and TTS.

Problem solved

Traditional voice pipelines increase latency and can lose information carried in speech, such as emotion, intent, accent, and prosodic nuances. Speech-to-speech AI reduces this problem by handling voice input and output directly.

Components

Speech EncoderExtracts semantic and paralinguistic representations from the input speech signal.

Module converting input audio signal (raw samples or mel spectrograms) into latent representations used by downstream components. In cascade architectures the encoder role is played by the ASR model producing text tokens. In end-to-end architectures the encoder processes audio into continuous representations preserving paralinguistic information.

ASR/STT (Automatic Speech Recognition)In cascade architecture: transcribes speech to text (e.g., Whisper, Google STT). Discrete output — loss of paralinguistic information.

End-to-end continuous audio encoderIn direct architectures: neural network (e.g., Transformer-based) processing mel spectrograms or discrete audio tokens into continuous embeddings without passing through text.

Official

Reasoning and Response Generation ModuleIntent understanding, response content generation, and conversation context management.

Component processing input representations and generating response representations. In cascade architectures this is a text-operating LLM. In end-to-end architectures it may be an LLM conditioned on audio tokens/embeddings or a seq2seq model trained directly on audio pairs.

Official

Speech Decoder / SynthesizerSynthesizes speech output with natural-sounding prosody, optionally preserving voice identity or emotional tone.

Component generating output audio signal from response representations. In cascade architectures this is a TTS module operating on text. In end-to-end architectures it decodes latent representations into spectrograms or audio tokens, subsequently converted to output by a vocoder model.

TTS (Text-to-Speech)In cascade architecture: synthesizes speech from text. Allows expressive synthesis but with limited preservation of original voice characteristics.

Spectrogram decoder + vocoderIn end-to-end architectures: decoder produces mel spectrograms converted to audio signal by a vocoder (e.g., WaveNet, WaveGlow, HiFi-GAN).

Official

Voice Activity Detector (VAD)Manages turn-taking order, triggers input processing, and handles barge-in interruptions.

Component detecting the start and end of user speech in the audio stream, critical for natural turn-taking management in conversation. Modern VAD models (e.g., Silero VAD) process 30ms audio frames in under 1ms on CPU.

Official

Implementation

Reference implementations

Translatotron – official Google Research blog

Google Research

Official

Moshi – Kyutai Labs (open-source)

Python · Kyutai Labs

Official

LLaMA-Omni – open-source S2S model

Python · ICTNLP

Implementation pitfalls

Loss of paralinguistic information in cascaded architecturesHigh

In STT→LLM→TTS cascade, speech-to-text conversion irreversibly removes prosody, emotion, speaking rate, disfluencies, and voice characteristics. TTS must reconstruct expression from scratch, losing naturalness and emotional context.

Fix:For applications requiring emotion/prosody preservation: use an end-to-end architecture, or augment the cascaded pipeline with an emotion analysis module running in parallel with STT and passing emotional metadata to TTS.

Error Propagation in Cascaded ArchitecturesHigh

ASR errors propagate and may be amplified by LLM (incorrect intent understanding) and TTS (generating incorrect or inadequate response). Error accumulation is particularly acute for key terms, proper names, and domain-specific jargon.

Fix:Use domain-specific STT models trained on in-domain data; add correction and validation mechanisms between pipeline stages; monitor WER (Word Error Rate) in production environments.

Low availability of parallel audio data for end-to-end modelsHigh

Direct (end-to-end) models require audio pairs (input→output) that are significantly harder to collect than text or audio-transcript pairs. Particularly problematic for low-resource languages. Often results in poorer generalization of direct models on rare languages.

Fix:Using TTS-generated synthetic data as target training examples (as in Translatotron); applying multitask learning with available text data as an auxiliary signal; transfer learning from pretrained audio encoders.

Real-time network latency and audio quality issuesMedium

Real-time S2S systems are sensitive to network jitter and connection quality. Standard telephone codecs (e.g., 8 kHz G.711 in PSTN/Twilio) degrade audio below modern model requirements (typically trained on 16 kHz audio). GPT-4o Realtime and Gemini Live achieve best results with 16 kHz wideband audio but lose advantage over cascade at 8 kHz telephony.

Fix:Use wideband audio (G.722, 16 kHz or higher) where possible. For telephony deployments, consider a cascaded architecture with telephony-optimized STT components. Implement client-side audio buffering to smooth jitter.

Evolution

2019

Translatotron (Google) – first end-to-end S2ST model without intermediate text representation

Inflection point

Jia et al. published Translatotron (arXiv:1904.06037), the first seq2seq model for direct speech-to-speech translation without intermediate text. Model took source language mel spectrograms and generated target language mel spectrograms. Demonstrated voice characteristic preservation via speaker encoder. Translation quality was lower than cascade systems but feasibility was demonstrated.

Direct speech-to-speech translation with a sequence-to-sequence model (paper)

2022

Translatotron 2 – end-to-end S2ST quality matching cascade systems

Google published Translatotron 2, achieving quality comparable to cascade systems on standard benchmarks while eliminating the voice cloning vulnerability present in Translatotron 1.

Translatotron 2: High-quality direct speech-to-speech translation with voice preservation (paper)

2024

Moshi (Kyutai Labs) – first publicly documented end-to-end S2S model for real-time full-duplex conversational dialogue

Inflection point

Kyutai Labs published Moshi (2024), a speech-text foundation model for real-time dialogue. Model supports full-duplex — can listen and speak simultaneously. Published model weights and technical documentation. Achieved latency ~160ms theoretical, ~200ms practical.

Moshi: a speech-text foundation model for real-time dialogue (paper)

2024

GPT-4o Realtime API (OpenAI) and LLaMA-Omni – commercialization of end-to-end S2S

Inflection point

OpenAI released GPT-4o with native speech-to-speech capabilities (May 2024 demo, October 2024 API). LLaMA-Omni (2024) demonstrated an open-source approach to end-to-end S2S based on LLaMA. End-to-end S2S architecture entered production commercial deployment at scale.