In voice mode, the incoming audio stream is first processed by ASR into text, then analyzed by NLU for intents (e.g., "flight_booking") and entities (date, location, number of passengers). The dialogue management module updates the conversation state, decides on the next action (response, clarification, API call), and passes the response structure to NLG. NLG generates text, which in voice mode is converted to audio by TTS. In LLM-based architectures (post-2022), the NLU + dialogue + NLG steps typically merge into a single model call, and in S2S variants, ASR and TTS are also absorbed into a single multimodal model.
Traditional GUI interfaces (forms, menus, tables) require users to learn application-specific structures and typically cannot handle ambiguous queries. Conversational AI addresses the problem of accessing system functionality in a natural way — via voice or text — with support for ambiguity resolution, multi-turn context, and human fallback when the system cannot handle a request.
Converts an audio stream into text. Classical implementations use hybrid acoustic-language models; modern systems rely on end-to-end models like Whisper. Optional for chat mode.
Official
Extracts the user's intent and salient entities (slots) from the input text. In pre-LLM systems, implemented via intent classifiers + NER; in LLM-based systems, often merged with dialog management.
Official
Tracks dialog state across turns (Dialog State Tracking) and decides the next system action (Dialog Policy): answer, clarification question, tool invocation, human escalation. The central component distinguishing Conversational AI from a single model call.
Produces the textual response for the user. Classically template-based with rules; modern systems use free-form LLM generation controlled by prompts and guardrails.
Official
Converts the response text into an audio stream. Modern neural systems (e.g. WaveNet, Tacotron, VALL-E) generate near-human quality speech with optional emotion and voice control. Optional for chat mode.
Official
Stores the conversation history within a session and optionally the user profile and long-term memory across sessions. Essential for multi-turn coherence and personalization.
Official
Mechanism for detecting that the system does not understand an utterance or that the request exceeds its scope, and handing the conversation off to a human agent with full context. Critical for user trust.
Official
Voice mode requires end-to-end latency below ~500 ms from the end of the user's utterance to the start of the response. Classical ASR→LLM→TTS pipelines without streaming often reach 1–3 s, which feels artificial and uncomfortable.
LLM-based dialog policy can generate confident but incorrect facts (prices, policies, account data), leading to loss of trust and legal risk.
ASR models have significantly higher WER for non-standard accents, dialects, code-switching, and noisy environments. ASR errors propagate into NLU, yielding incorrect intents.
A system that stubbornly tries to answer outside its scope leads to user frustration, negative NPS, and churn. Often more important than answer quality within scope.
Accumulated conversation history can exceed the LLM context window or be summarized incorrectly, causing the system to forget previously established intents and entities.
A malicious user can try to hijack system behavior ('forget previous instructions', 'pretend to be DAN'), which in an unhardened system leads to system prompt disclosure or out-of-scope behavior.
Conversational AI drifts as business processes, offerings, and documentation change. Without automated conversation evaluation (intent accuracy, resolution rate, escalation rate), quality degrades invisibly.
Joseph Weizenbaum (MIT) creates ELIZA — a program imitating a Rogerian therapist via pattern-matching rules. Demonstrates that even a simple text system can give users the illusion of understanding.
Slot-filling architecture with explicitly defined intents and entities becomes the dominant pattern for task-oriented dialog systems (e.g. flight booking).
Apple introduces Siri on the iPhone 4S, popularizing the idea of a mass-market personal voice assistant. In subsequent years come Google Now (2012), Cortana (2014), Alexa (2014).
Vinyals and Le (Google) publish 'A Neural Conversational Model' — show that RNN encoder-decoder models can generate coherent responses in open domain. Opens the era of neural generative chatbots.
OpenAI releases ChatGPT (November 2022). RLHF-tuned LLMs prove capable of multi-turn open-domain conversations with response quality surpassing prior modular systems. Conversational AI architecture shifts from modular pipelines toward unified LLMs.
OpenAI introduces Advanced Voice Mode in GPT-4o (May 2024) — a multimodal audio→audio model with ~320 ms latency, eliminating the intermediate text step. Other S2S models (Moshi, Hume Octave) confirm the trend.
Sierra publishes the Agents-as-a-Service manifesto (March 2026) — Conversational AI integrates with the agentic paradigm, where a single agent handles chat, voice, email, and 30+ languages with built-in guardrails, autonomously improved by an overseer agent (Ghostwriter).
User interaction mode. Voice mode requires ASR + TTS and much lower latency (below ~500 ms) than chat mode.
Whether the system is composed of separate modules (ASR + NLU + DM + NLG + TTS) or unified in a single model (LLM or speech-to-speech).
Acceptable time between the end of the user's utterance and the start of the system's response. Determines the naturalness of voice conversation.
Number and quality of supported languages and accents. Affects geographic reach and ASR/NLU accuracy for low-resource languages.
Strategy ensuring the system responds factually: pure model, RAG over customer documents, API access to live data.
When and how the system hands off to a human: after N failed attempts, on user request, based on emotion signals.
Modern LLM-based implementations combine NLU, dialog management, and NLG in a single model call, greatly simplifying the pipeline compared to classical modular systems.
Dialog policy routes the conversation among paths: direct answer, clarification question, tool invocation, human escalation. In LLM-based systems, routing is realized by model decisions in the context of a system prompt.
Parallelism occurs primarily inter-session (multiple users served concurrently) and within a single turn (parallel tool calls, RAG retrieval during generation).
LLM inference (NLU/dialog/NLG) and neural ASR/TTS run most efficiently on GPUs with tensor cores; voice mode with a <500 ms budget requires hardware acceleration.
Google deploys conversational AI (Google Assistant) on TPU; similar results to GPU for most inference workloads.
Lightweight intent classifiers, template-based NLG, and classical ASR run on CPU. Insufficient for modern LLM-based real-time voice systems.