The way an AI voicebot processes speech determines whether it sounds natural or slow and unreliable. There are currently two architectures in use by developers, and the choice between them has consequences for the caller experience, the reliability of the system and what the bot can learn from a conversation. The older approach chains three separate components together, whilst the newer method processes the entire conversation in one go.

Schema showing three stacked components on the left side and a single realtime speech model on the right side

The classic approach: stitching

When the first voicebots were built, it made sense to chain three existing components together. Incoming speech went through a speech-to-text engine that converted it to text, then a language model read that text and formulated a response, and finally a text-to-speech engine converted that response back to audible speech. This architecture is called “stitching” in the industry because you chain three independent systems together into one pipeline.

For a time, this delivered workable results, and for teams that didn’t want to train their own speech model, it was the only practical route. Yet in practice, three vulnerabilities emerge, because each link in the chain can fail independently. Speech recognition can misheard a sentence, the language model can produce a slow or incorrect response, and voice synthesis can fail at an awkward moment. Many teams therefore build in a backup using an alternative TTS or LLM provider, so the bot keeps working during an outage. That solves the failure, but callers suddenly hear a completely different voice and become confused about who they are actually talking to.

The second disadvantage may carry even more weight. With stitching, the language model sees only a textual transcript, which means it cannot perceive the tone, volume, hesitation or emotion of the caller. An irritated customer and a satisfied customer sound identical to the model once their words are written down, and this compromises the context sensitivity that makes a conversation valuable. Signals about likely age, native language or emotional state are lost in the conversion to text, yet these signals often determine how a staff member would conduct the call.

The new approach: a single realtime speech model

Since OpenAI released gpt-realtime-1.5 on 24 February 2026, there is a second way to build voicebots that works better in most cases. Instead of three separate components in sequence, a single model listens and speaks directly, eliminating the entire layer of transcription and synthesis. The model understands the words, tone and emotion of the caller simultaneously, so it can respond directly to them. A demo by Charlierguo shows how fluidly this works in practice.

This delivers concrete advantages in everyday use. There is now only one point where something can fail instead of three, which significantly reduces the chance of an outage. Response time typically sits under 400 milliseconds, so the conversation flows naturally without the latency that stitching introduces. Multilingualism is built in, so the same model effortlessly switches between Dutch, English, German and other languages without requiring you to configure that switch beforehand. And because the model processes audio rather than text, it recognises an irritated customer by their voice and can route them directly to a staff member without needing a keyword or explicit escalation.

When stitching is still the right choice

There remains a niche where the older architecture fits better, and that is situations where no live conversation needs to take place but a recording is analysed afterwards. When a call centre wants to summarise, code or screen calls for compliance after they end, there is no latency requirement and you can happily select a specialist language model. Think of a medical language model that recognises abbreviations and terminology in healthcare, or a speech-to-text engine specifically trained on a regional dialect. The precision on that one component outweighs the overall conversation experience in those scenarios, because there is no caller on the line waiting for a response.

Our recommendation

For organisations that want voicebots to handle live conversations, we recommend the realtime approach in virtually all cases. The combination of faster response, lower failure risk, multilingualism without configuration and emotional awareness creates a caller experience that doesn’t feel robotic. For post-call analysis and other scenarios where precision on a specific component is decisive, we continue to use stitching architectures because they still deliver the strongest results there.

Our team builds in both architectures

CallFactory builds voicebots in both architectures, depending on what works best for your call flow. Whether you want a fully managed solution where our team sets everything up from start to finish, or prefer a dedicated IVR running on your own infrastructure, we deliver GDPR-compliant implementations that are available 24 hours a day, seven days a week.

Contact our team to discuss which architecture suits your calls, how integration with your existing systems will work and when the voicebot can go live. This way you get a clear estimate of the timeline and investment, and from day one you can have incoming and outgoing calls handled by a voicebot that speaks and listens at a level that was unthinkable until recently.