Robots AtlasRobots Atlas

Speech-to-speech AI

A class of architectures enabling direct speech-to-speech processing — either by replacing the cascaded STT→LLM→TTS pipeline with a single end-to-end model operating on audio representations, or by tightly integrating pipeline components to minimize latency and preserve paralinguistic features (prosody, emotion, speaker characteristics).

Category
Abstraction level
Operation level
Next-generation voice assistantsRealtime voice agentsVoice-based customer supportTranslation and voice conversationsHands-free interfaces

The model takes audio as input, analyzes its content and paralinguistic features, then generates a response directly as audio output. The native speech-to-speech variant operates as a single multimodal model, while the pipeline variant chains multiple components: ASR, LLM, and TTS.

Traditional voice pipelines increase latency and can lose information carried in speech, such as emotion, intent, accent, and prosodic nuances. Speech-to-speech AI reduces this problem by handling voice input and output directly.

01

Speech Encoder

Extracts semantic and paralinguistic representations from the input speech signal.

Modular

Module converting input audio signal (raw samples or mel spectrograms) into latent representations used by downstream components. In cascade architectures the encoder role is played by the ASR model producing text tokens. In end-to-end architectures the encoder processes audio into continuous representations preserving paralinguistic information.

ASR/STT (Automatic Speech Recognition)End-to-end continuous audio encoder
02

Language/Reasoning Module

Intent understanding, response content generation, and conversation context management.

Modular

Component processing input representations and generating response representations. In cascade architectures this is a text-operating LLM. In end-to-end architectures it may be an LLM conditioned on audio tokens/embeddings or a seq2seq model trained directly on audio pairs.

03

Speech Decoder / Synthesizer

Synthesizes speech output with natural-sounding prosody, optionally preserving voice identity or emotional tone.

Modular

Component generating output audio signal from response representations. In cascade architectures this is a TTS module operating on text. In end-to-end architectures it decodes latent representations into spectrograms or audio tokens, subsequently converted to output by a vocoder model.

TTS (Text-to-Speech)Spectrogram decoder + vocoder
04

Voice Activity Detection (VAD)

Manages turn-taking order, triggers input processing, and handles barge-in interruptions.

Modular

Component detecting the start and end of user speech in the audio stream, critical for natural turn-taking management in conversation. Modern VAD models (e.g., Silero VAD) process 30ms audio frames in under 1ms on CPU.

Wąskie gardło: Pipeline latency (cascade) or audio token generation (end-to-end)

In cascade architectures the bottleneck is the sum of stage latencies: ASR + LLM + TTS, typically 2–4 seconds end-to-end. In end-to-end architectures the bottleneck is autoregressive audio token generation by the LLM (similar to text generation but with higher token volume per second of speech).

Parallelism

Partially parallel

Multiple parallel requests from different users can be handled concurrently by separate model instances across devices. Audio streaming and stage overlapping can reduce perceived latency.

Paradigm

Dense

Stage dependent

Both cascaded and end-to-end S2S architectures use dense processing in each of their components. 'Stage-dependent' refers to the fact that different components (encoder, LLM, decoder) are activated sequentially during the processing of a single query. In full-duplex systems (e.g., Moshi), input and output are processed simultaneously by a model capable of concurrently listening and speaking.

Architecture type (cascade vs. end-to-end)

Critical
  • cascade (STT→LLM→TTS)Modular, configurable, 2–4 s latency, best content control.
  • end-to-end (audio-to-audio)Prosody preservation, latency <1 s, requires paired audio data.
  • hybrid (tightly coupled pipeline)STT/LLM/TTS with overlapping and streaming, latency 250–500 ms.

Fundamental choice between cascade architecture (STT→LLM→TTS) and direct end-to-end architecture. Determines latency, prosody preservation, debuggability, and training data requirements.

Tryb dupleksowy

Standard
  • half-duplex (turn-based)Simpler to implement; the model waits for the user to finish speaking before responding.
  • full-duplexMore natural dialogue; the model can be interrupted while generating a response.

Whether the system supports half-duplex (turn-based, one side speaks at a time) or full-duplex (both sides can speak simultaneously, with barge-in capability). Full-duplex requires advanced VAD and barge-in mechanisms.

Reprezentacja audio

Standard
  • mel spectrogramUsed in Translatron — input and output are mel spectrogram sequences.
  • discrete audio tokens (codec)Used in models such as Moshi and LLaMA-Omni — audio is tokenized via EnCodec or SoundStream.
  • continuous audio embeddingsContinuous embeddings from a pretrained audio encoder (e.g., Whisper encoder).

Audio input/output format used by the model: raw waveform, mel spectrogram, discrete audio tokens (from audio codec e.g., EnCodec, SoundStream), or continuous embeddings.

Common pitfalls

Loss of paralinguistic information in cascaded architectures
HIGH

In STT→LLM→TTS cascade, speech-to-text conversion irreversibly removes prosody, emotion, speaking rate, disfluencies, and voice characteristics. TTS must reconstruct expression from scratch, losing naturalness and emotional context.

For applications requiring emotion/prosody preservation: use an end-to-end architecture, or augment the cascaded pipeline with an emotion analysis module running in parallel with STT and passing emotional metadata to TTS.

Error Propagation in Cascaded Architectures
HIGH

ASR errors propagate and may be amplified by LLM (incorrect intent understanding) and TTS (generating incorrect or inadequate response). Error accumulation is particularly acute for key terms, proper names, and domain-specific jargon.

Use domain-specific STT models trained on in-domain data; add correction and validation mechanisms between pipeline stages; monitor WER (Word Error Rate) in production environments.

Low availability of parallel audio data for end-to-end models
HIGH

Direct (end-to-end) models require audio pairs (input→output) that are significantly harder to collect than text or audio-transcript pairs. Particularly problematic for low-resource languages. Often results in poorer generalization of direct models on rare languages.

Using TTS-generated synthetic data as target training examples (as in Translatotron); applying multitask learning with available text data as an auxiliary signal; transfer learning from pretrained audio encoders.

Real-time network latency and audio quality issues
MEDIUM

Real-time S2S systems are sensitive to network jitter and connection quality. Standard telephone codecs (e.g., 8 kHz G.711 in PSTN/Twilio) degrade audio below modern model requirements (typically trained on 16 kHz audio). GPT-4o Realtime and Gemini Live achieve best results with 16 kHz wideband audio but lose advantage over cascade at 8 kHz telephony.

Use wideband audio (G.722, 16 kHz or higher) where possible. For telephony deployments, consider a cascaded architecture with telephony-optimized STT components. Implement client-side audio buffering to smooth jitter.

2019

Translatotron (Google) – first end-to-end S2ST model without intermediate text representation

breakthrough

Jia et al. published Translatotron (arXiv:1904.06037), the first seq2seq model for direct speech-to-speech translation without intermediate text. Model took source language mel spectrograms and generated target language mel spectrograms. Demonstrated voice characteristic preservation via speaker encoder. Translation quality was lower than cascade systems but feasibility was demonstrated.

2022

Translatotron 2 – end-to-end S2ST quality matching cascade systems

Google published Translatotron 2, achieving quality comparable to cascade systems on standard benchmarks while eliminating the voice cloning vulnerability present in Translatotron 1.

2024

Moshi (Kyutai Labs) – first publicly documented end-to-end S2S model for real-time full-duplex conversational dialogue

breakthrough

Kyutai Labs published Moshi (2024), a speech-text foundation model for real-time dialogue. Model supports full-duplex — can listen and speak simultaneously. Published model weights and technical documentation. Achieved latency ~160ms theoretical, ~200ms practical.

2024

GPT-4o Realtime API (OpenAI) and LLaMA-Omni – commercialization of end-to-end S2S

breakthrough

OpenAI released GPT-4o with native speech-to-speech capabilities (May 2024 demo, October 2024 API). LLaMA-Omni (2024) demonstrated an open-source approach to end-to-end S2S based on LLaMA. End-to-end S2S architecture entered production commercial deployment at scale.

GPU Tensor CoresPRIMARY

Both the components of cascaded models (Whisper, LLM, TTS) and end-to-end S2S models (Moshi, LLaMA-Omni, GPT-4o backend) are Transformer architectures requiring GPUs with Tensor Cores for efficient inference. Real-time speech processing with latency <500 ms at production scale requires GPUs.

End-to-end S2S models (~7B–70B parameters) require GPUs with large VRAM (24–80 GB). Cascaded systems can distribute components across smaller GPUs or CPU, but the LLM component still requires a GPU to maintain low latency.

CPU AVXPOSSIBLE

Smaller STT models (e.g., Whisper tiny/base) and VAD models (e.g., Silero VAD) can run efficiently on CPUs with AVX extensions. For full cascaded pipelines with large LLMs, CPU is insufficient to meet real-time latency requirements.

VAD runs efficiently on CPU (~<1 ms per 30 ms audio frame). STT components for small models can run on CPU in resource-constrained environments at the cost of higher latency.

Realtime API

Documentation for realtime multimodal and speech-to-speech interactions.

documentationOpenAI
Audio and speech

Description of speech-to-speech approaches and voice pipelines.

documentationOpenAI
Voice agents

Description of S2S architecture and voice agent applications.

documentationOpenAI