Two ways to build an AI voicebot: stitching versus realtime

Jan / 22 Apr 2026 / 4 min read

Schema showing three stacked components on the left side and a single realtime speech model on the right side

The way an AI voicebot processes speech determines whether it sounds natural or slow and unreliable. There are currently two architectures in use by developers, and the choice between them has consequences for the caller experience, the reliability of the system and what the bot can learn from a conversation. The older approach chains three separate components together, whilst the newer method processes the entire conversation in one go.

The classic approach: stitching

When the first voicebots were built, it made sense to chain three existing components together. Incoming speech went through a speech-to-text engine that converted it to text, then a language model read that text and formulated a response, and finally a text-to-speech engine converted that response back to audible speech. This architecture is called “stitching” in the industry because you chain three independent systems together into one pipeline.

For a time, this delivered workable results, and for teams that didn’t want to train their own speech model, it was the only practical route. Yet in practice, three vulnerabilities emerge, because each link in the chain can fail independently. Speech recognition can misheard a sentence, the language model can produce a slow or incorrect response, and voice synthesis can fail at an awkward moment. Many teams therefore build in a backup using an alternative TTS or LLM provider, so the bot keeps working during an outage. That solves the failure, but callers suddenly hear a completely different voice and become confused about who they are actually talking to.

The second disadvantage may carry even more weight. With stitching, the language model sees only a textual transcript, which means it cannot perceive the tone, volume, hesitation or emotion of the caller. An irritated customer and a satisfied customer sound identical to the model once their words are written down, and this compromises the context sensitivity that makes a conversation valuable. Signals about likely age, native language or emotional state are lost in the conversion to text, yet these signals often determine how a staff member would conduct the call.

The new approach: a single realtime speech model

Since OpenAI released gpt-realtime-1.5 on 24 February 2026, there is a second way to build voicebots that works better in most cases. Instead of three separate components in sequence, a single model listens and speaks directly, eliminating the entire layer of transcription and synthesis. The model understands the words, tone and emotion of the caller simultaneously, so it can respond directly to them. A demo by Charlierguo shows how fluidly this works in practice.

This delivers concrete advantages in everyday use. There is now only one point where something can fail instead of three, which significantly reduces the chance of an outage. Response time typically sits under 400 milliseconds, so the conversation flows naturally without the latency that stitching introduces. Multilingualism is built in, so the same model effortlessly switches between Dutch, English, German and other languages without requiring you to configure that switch beforehand. And because the model processes audio rather than text, it recognises an irritated customer by their voice and can route them directly to a staff member without needing a keyword or explicit escalation.

When stitching is still the right choice

There remains a niche where the older architecture fits better, and that is situations where no live conversation needs to take place but a recording is analysed afterwards. When a call centre wants to summarise, code or screen calls for compliance after they end, there is no latency requirement and you can happily select a specialist language model. Think of a medical language model that recognises abbreviations and terminology in healthcare, or a speech-to-text engine specifically trained on a regional dialect. The precision on that one component outweighs the overall conversation experience in those scenarios, because there is no caller on the line waiting for a response.

Our recommendation

For organisations that want voicebots to handle live conversations, we recommend the realtime approach in virtually all cases. The combination of faster response, lower failure risk, multilingualism without configuration and emotional awareness creates a caller experience that doesn’t feel robotic. For post-call analysis and other scenarios where precision on a specific component is decisive, we continue to use stitching architectures because they still deliver the strongest results there.

Our team builds in both architectures

CallFactory builds voicebots in both architectures, depending on what works best for your call flow. Whether you want a fully managed solution where our team sets everything up from start to finish, or prefer a dedicated IVR running on your own infrastructure, we deliver GDPR-compliant implementations that are available 24 hours a day, seven days a week.

Contact our team to discuss which architecture suits your calls, how integration with your existing systems will work and when the voicebot can go live. This way you get a clear estimate of the timeline and investment, and from day one you can have incoming and outgoing calls handled by a voicebot that speaks and listens at a level that was unthinkable until recently.

Stitching is valuable when you don’t need to conduct a live conversation but want to analyse a recording afterwards. In that case, you have the freedom to select a specialist language model—such as a medical model for healthcare terminology or a speech-to-text engine trained on a regional dialect. In those situations, precision on a single component outweighs the need for a fluid conversation experience.

Response time typically sits under 400 milliseconds, which is comparable to a normal telephone conversation between two people. Because no separate components are chained together, the latency that stitching introduces vanishes entirely, meaning callers rarely notice they are speaking to an AI.

Yes. Realtime speech models are trained multilingually, so they can switch between Dutch, English, German and other languages during the same call without requiring you to configure that switch beforehand. For businesses serving an international customer base, this eliminates an entire configuration step.

We build a fallback route into each project, so the call automatically transfers to a team member or plays a recorded message if the model fails. The caller simply notices the call has transferred, which keeps your call flow running even if the service provider experiences an outage.

Yes. We build the voicebot so that audio and metadata stay within the European Union and all relevant parties have a data processing agreement in place. For regulated sectors such as healthcare, banking and insurance, we also provide a self-hosted variant that runs entirely behind your own firewall.

Two ways to build an AI voicebot: stitching versus realtime

The classic approach: stitching

The new approach: a single realtime speech model

When stitching is still the right choice

Our recommendation

Our team builds in both architectures

Frequently asked questions

Happy Callfactory customers:

UK Phone Numbers

International Numbers

Solutions

Features

Phone System

More

Two ways to build an AI voicebot: stitching versus realtime

The classic approach: stitching

The new approach: a single realtime speech model

When stitching is still the right choice

Our recommendation

Our team builds in both architectures

Frequently asked questions

Find Your Perfect Number

Related Articles

Telephone answering service: what it does and what it costs

Virtual receptionist: professional call answering without a front desk

IVR meaning: what Interactive Voice Response does for a business

Voicebot versus traditional IVR: when each one wins

Happy Callfactory customers: