I Spoke to 500+ Voice AI Builders in India Over 3 Months. Here Is What I Found.
India is, structurally, the most interesting market in the world for voice AI. Not because of the size of the opportunity, though that is real, but because of the actual nature of the problem. Over a billion people, fourteen officially scheduled languages with distinct scripts, hundreds of dialects, and a communication culture that has never neatly separated into "text people" and "voice people." Most Indians speak multiple languages and switch between them mid-sentence without thinking about it. The average urban user code-switches between English and at least one regional language several times in a ten-minute phone call. Rural users often operate in dialects that have no written standard at all. And yet voice interfaces, which should be the most natural fit for this context, have historically been terrible at handling it.
That has started to change, and a generation of builders is racing to figure out where the real problems are. Over the past three months, I have been inside those conversations, across multiple builder communities and in direct one-on-one discussions with 500+ Voice AI founders, engineers, and product leaders building in India. Some of them are shipping their first agents. Others are running millions of call-minutes per month. What follows is what I actually observed.
What Is Actually Getting Built
Before the technical problems, the use cases. Across these conversations, the pattern that stood out most is how decisively outbound dominates. Three out of four builders I spoke with are building for outbound, primarily AI-initiated calls at scale, and only a smaller fraction are focused on inbound customer service or IVR replacement. That ratio surprised me initially, then made sense: outbound is where the immediate ROI is visible, the conversion rates are measurable, and the Indian enterprise buyer already has budget allocated from legacy call centre spend.
- Outbound: 72% (Sales, collections, follow-up, lead qualification, appointment setting)
- Inbound: 28% (Customer support, IVR replacement, booking, service requests)
Within that outbound majority, the verticals concentrating the most activity are BFSI, real estate, and D2C commerce, in that order. BFSI is the most mature deployment context by a distance, partly because the use case is clear (collections, insurance sales, loan follow-ups) and partly because the enterprise buyers in that segment have the most experience contracting for outbound call capacity. Real estate, specifically new project launches, is gaining fast: the economics of qualifying a hundred leads to find three serious buyers map almost perfectly to what a well-prompted voice agent does well. D2C brands are using voice primarily for post-purchase and order management flows, often in combination with WhatsApp.
- BFSI Sales and Collections: 35%
- Real Estate and New Project Sales: 25%
- D2C and E-commerce: 20%
- Healthcare and Clinics: 10%
- Automotive and Other: 10%
Healthcare and automotive are smaller but active segments. Healthcare is complicated by compliance sensitivity (HIPAA equivalents are being asked for by larger hospital chains), while automotive has one of the cleaner use cases: service appointment booking and test drive scheduling, where the agent needs to check availability, confirm a slot, and update a CRM, with a well-defined end state and a measurable conversion metric. That kind of bounded, tool-heavy task is where current voice agents perform most reliably.
Where the Attention Goes
With the use-case landscape in view, it is worth mapping where the technical attention actually lives. Certain themes dominate almost every conversation. Others surface only when someone has pushed far enough into production to start running into the hard edges.
- AI Stack (LLM, TTS, STT): 36%
- Telephony & Infrastructure: 29%
- Production Operations: 11%
- Latency & Performance: 9%
- Context & Memory Management: 9%
- Indic Languages & Multilingual: 6%
The AI Stack and Telephony categories together consume nearly two-thirds of all technical conversation. These are infrastructure choices, not differentiation. Builders are spending the majority of their cognitive energy on the plumbing. Context and Memory Management accounts for roughly thirteen percent of all technical conversation once you trace it across topics; it surfaces in LLM threads as token explosion, in latency threads as the retrieval tradeoff, in prompt engineering as context stuffing, and in multi-agent threads as the handoff problem. It is the most quietly load-bearing problem in the stack, and the one least likely to have a dedicated conversation named after it.
The Stack: A Constant, Ongoing Negotiation
The TTS Problem Nobody Talks About Honestly
The most counterintuitive finding from three months of conversations: ElevenLabs being too good is a genuine production problem. Here is the dynamic. A voice that sounds crisp, warm, and perfectly articulated in a demo environment makes silence louder in a real call. The human auditory system, particularly for Indian callers who have spent years on patchy phone lines with contact centre agents who cough and lose their place, has calibrated expectations around what a real voice sounds like. When those expectations are violated by something too smooth, the uncanny valley does not kick in visually, but it does kick in conversationally. Callers feel something is off before they can articulate why.
"The model sounds too HD. That is the problem we keep running into. You want it imperfect, you want the slight unevenness. When it is too clean, the silence between turns feels surgical and people get uncomfortable."
— Paraphrased from multiple builder conversations
This has created a productive, if slightly absurd, micro-economy: some builders maintain curated libraries of voice profiles specifically for Indian use cases, categorised by age group, tone, and call direction. The demand for it is real. The cost dimension compounds things further. ElevenLabs at scale is expensive enough that a meaningful portion of builders are running cost-optimisation experiments rather than quality ones.
Cartesia is gaining ground as a more economical alternative. Sarvam is the default recommendation for anything involving Indic languages. Smallest AI's Lightning TTS release landed well, which tells you something about how hungry builders are for a credible third option in this space.
"Average contact centre human has average sound. Most TTS models are trained on synthetic data and will never fully close that gap. The question is whether that matters for your use case, and for most Indian deployments at scale, it does."
— Paraphrased from builder conversation
The Orchestration Fork
Every builder hits this fork in week two or three of serious development: LiveKit or Pipecat. LiveKit wins when speed to integration is the priority, handling SIP cleanly with a self-hosted option that is genuinely capable for straightforward inbound and outbound flows. Pipecat wins when agent behaviour is complex and dynamic, when you need fine-grained pipeline control, or when the call involves multiple handoff scenarios. The tradeoff is verbosity and compute cost.
LangGraph gets proposed periodically and walked back almost as often. It is built for multi-agent orchestration in general, not for voice specifically, and under real telephony conditions the latency characteristics do not hold up well enough to justify the complexity. A smaller segment is rolling their own state machines entirely. Honest verdict from builders who have done it: works in demos, breaks in production around edge cases that only appear at volume.
"Locally, everything looks fine at one second. The real thing happens when you switch to telephony, add a ten-thousand token prompt, five to ten tools, and push past ten turns. That is when the architectural choices you made in week two start mattering."
— Paraphrased from builder discussion
The Telephony Tax
Telephony is the most discussed topic in these conversations, not because it is the most technically interesting problem but because it is the most frustrating one. The layer is fragmented, expensive relative to the value it delivers, and almost entirely invisible to the caller. Nobody has ever chosen a voice agent because the SIP provider was excellent.
The cost gap between Twilio and Indian providers is significant enough that builders who start with Twilio for the documentation and reliability almost always migrate once volume hits a meaningful level. Options like Telnyx, Exotel, and local SIP resellers each carry their own set of tradeoffs around reliability, documentation quality, and compliance support. TRAI compliance adds a layer that catches early-stage builders off guard. The 140-series requirement for promotional calls exists regardless of explicit consent status. The 160-series requirement for BFSI transactional calls has a licensing precondition that most non-bank builders cannot satisfy directly. This is not optional complexity.
The Latency Obsession (And Why the Numbers Are Mostly Fiction)
Sub-500 millisecond latency is claimed by the majority of builders I spoke with. It is achievable by roughly a third of them, under specific conditions, and by significantly fewer in actual production under real telephony load. This is not a conspiracy. It is a measurement problem.
Telephony adds a fixed floor that no model optimisation removes. A voice with the right naturalness can feel faster at two and a half seconds than a stilted response at eight hundred milliseconds, because the brain is not timing responses; it is evaluating conversational flow. The techniques that actually move perceived latency in production are well understood at this point: preemptive generation, prewarming connections, TTS buffer management, speculative decoding. What is not resolved is the fact that adding a memory or retrieval layer to achieve better answer quality almost always degrades latency. That tradeoff has no clean resolution in current tooling.
▎ "Latency without context is a vanity metric. We test with full prompt load, knowledge base retrieval, and tool execution in the loop. That is the only number worth publishing."
The honest benchmark, proposed independently by multiple experienced builders: test with at least a two-thousand token system prompt, a knowledge base retrieval call, and at least one tool execution in the loop. Anything less is a demo number, not a production one.
▎ "At one pitch I saw sub-hundred millisecond latency claimed. Network latency alone on Indian telephony infrastructure is two to three hundred milliseconds. The physics are not negotiable."
— Paraphrased from builder conversation
The India Problem Is Larger Than Hinglish
Hinglish support, meaning the ability to handle natural English-Hindi code-switching mid-sentence, is widely described as mostly solved. That is partly true and usefully misleading. What has been solved is the educated, urban, metropolitan version of the problem. What has not been solved is India.
The framing I found most clarifying: Hinglish covers the portion of the Indian AI market that lives in metro cities and urban centres. It does not cover the larger portion of the country that does not. The regional blends, the dialectal variations, the non-standard code-switching patterns of Tamil-English or Bengali-English or Marathi-Hindi, those are not solved problems. They are not even well-defined problems yet.
Sarvam is the consensus recommendation for Indic STT and is doing genuinely good work. But there is a known hallucination artifact affecting its STT during silence stretches, where the model generates repetitive Hindi tokens unprompted, that was actively circulating among builders at the time of writing. It is the kind of bug that surfaces only at production scale, which itself tells you something about how many teams are actually in production with Indic language deployments right now.
"Speech-to-speech models handle English well. The moment the language switches mid-sentence, they lose the thread. We have been doing prompting experiments to get natural Hinglish output from S2S. It improves things. It does not solve things."
— Paraphrased from builder discussion
The pipeline architecture, separating STT (Deepgram), LLM, and TTS into distinct components, has significant advantages over speech-to-speech models for Indian deployments specifically because each component can be swapped independently for language-specific optimisation. The working view, which I share after these conversations, is that S2S will eventually handle multilingual Indian contexts well. The timeline is not next quarter.
The Commoditisation Anxiety Is Real, and It Is Also Correct
Per-minute pricing for voice agents is in a visible, accelerating decline. Builders who launched eighteen months ago at a certain price point are watching the floor drop quarter over quarter. The anxiety is legitimate.
But the framing from builders who seem less anxious about it is worth capturing. The voice pipe, meaning the STT-LLM-TTS-telephony stack, is commoditising. Infrastructure commoditises. What does not commoditise at the same rate is the intelligence layer: compliance automation, business-specific metric extraction from call data, CRM integration, conversation continuity across sessions. These are harder problems and their value compounds in ways that per-minute rates do not.
"The pipe gets cheaper every quarter. The brain on top of it gets more valuable. Compliance, memory, business-specific metrics, workspace integrations. Nobody is commoditising that anytime soon."
— Paraphrased from builder conversation
The implication is structural: the builders who will matter in three years are not building the best voice pipe. They are building the best intelligence layer that sits on top of a commoditising pipe. And that intelligence layer requires memory, which is where almost everyone is visibly stuck.
The Problem Nobody Has Actually Solved
Context and memory management accounts for roughly thirteen percent of all technical conversation when you trace it honestly across topics. It rarely shows up as its own thread. It appears in LLM discussions as token explosion and cost unpredictability. In latency discussions as the quality-versus-speed tradeoff nobody has resolved. In prompt engineering as the "stuff everything into one massive prompt" anti-pattern that teams adopt because they have no better option. In multi-agent discussions as the handoff problem, which underneath the architectural language is really just a context transfer problem. Everyone is solving a piece of it. Nobody has the layer underneath.
- Multi-Agent Architecture: 24%
- Prompt Engineering: 22%
- Compliance / Data Governance: 15%
- LLM Selection & Inference: 10%
- Latency & Performance: 9%
- VAD / Turn Detection: 8%
- Orchestration Frameworks: 5%
- STT / ASR: 5%
- TTS / Voice Synthesis: 3%
The problem is being solved in pieces, in isolation, without a coherent layer underneath it. Four failure modes appear often enough to be worth naming.
In-Call Token Explosion
Real-time voice models, particularly the live API variants, do not support prompt caching. Every turn in a conversation adds to the token count, and that count compounds faster than most builders expect when tool calls are in the loop. By the tenth turn of a complex call, a session that started at ten thousand tokens can be approaching six figures. At that point the model does not fail cleanly. It gets unstable in ways that are hard to diagnose and harder to explain to a client.
"We are running into something strange with the live models. By the time a call reaches ten turns, the token count has ballooned to something that makes the model behave strangely. The same pipeline using a standard LLM does the same job at a fraction of the tokens because caching works. The live API just keeps stacking."
— Paraphrased and merged from multiple builder conversations
The workaround that has spread through these circles is a sliding window approach: compressing context every five to seven turns back below a threshold. It works well enough to be in production use across several teams. It is also duct-taped engineering that papers over the absence of a real memory layer. Every team implementing it is writing the same logic from scratch, and no two implementations behave identically.
"No caching means unpredictable bills. We have seen a single campaign's API spend vary by forty percent week to week based on how chatty callers are. You cannot price a product reliably when your infrastructure cost has that kind of variance."
— Paraphrased from builder conversation
The Multi-Agent Handoff Problem
Complex voice use cases, think triage to specialist to booking to confirmation, require multiple specialised agents working in sequence. Transferring context between them cleanly, without the call sounding like it just hit a speed bump, is an unsolved UX problem with a messy infrastructure explanation underneath it. Teams either put everything into one enormous prompt and accept the fragility, or they attempt a clean handoff and live with the dead zone while context transfers. One approach that has been shared across several builders: streaming a pre-recorded filler phrase the moment a handoff triggers, which buys exactly the time needed to summarise prior context, compress it, and inject it into the new agent's instructions. It is clever. It is also a workaround for something that should be solved at the infrastructure layer.
"Most teams are stuffing every agent into one massive prompt. It is fragile, it breaks when you look at it wrong, and it does not scale. We moved to a sequential handoff approach and now the challenge is that the handoff itself sounds clunky. The tool call has to render the entire message before the next agent can start. That gap is audible."
— Paraphrased and merged from builder discussion
RAG in the Hot Path
Adding a retrieval layer improves answer quality and degrades latency. Simultaneously, with no clean resolution in current tooling at the price points Indian builders operate at. Some teams pre-load everything relevant into the system prompt at call initiation, which works until the knowledge base outgrows that approach. Some teams accept the latency hit and compensate with audio buffers. Some abandon retrieval entirely and write larger, more carefully structured prompts. These are different shapes of the same problem, not solutions to it.
"The moment you add a retrieval layer, your latency number is gone. Better answers or faster answers. Nobody has both yet, so most teams pick speed and accept that the agent will occasionally say something confidently wrong."
— Paraphrased from builder discussion
Cross-Session Memory: The Most Underbuilt Layer
The infrastructure for remembering a user across calls, for carrying context from one session into the next, for personalising an interaction based on what the agent learned last time, is essentially absent from most production deployments. The tell is in how builders talk about it. When someone has cross-session memory working, they present it as a differentiator. One builder showed me a demo where the agent remembered a caller from the previous day and opened with context from that prior conversation. He framed it as a feature, with visible pride. He was right to be proud of it. But the fact that it is impressive in 2026 tells you exactly where the baseline is.
"We showed the demo to a potential client and the first thing they asked was whether the agent would remember their customers across calls. We said yes. That was actually a lie at the time. We spent the next three weeks making it true."
— Paraphrased from builder conversation
Every team building this is building it entirely from scratch. There is no shared infrastructure, no agreed-upon data model, no standard for what "remembering a user" means in a voice context across Indic languages with code-switching. The teams doing it well have invested disproportionate time in a problem that should be infrastructure, not product differentiation.
What I Took Away
India is, without exaggeration, one of the most demanding environments in the world to build production voice AI. The language complexity alone would be enough. Add telephony fragmentation, cost pressure, regulatory overhead, and the user expectation that a voice agent should sound like someone a caller from a tier-two city would actually trust, and the problem surface is genuinely hard.
What I observed is a builder ecosystem that is past asking whether Voice AI works. They are deep in the harder question: whether the unit economics work, whether the intelligence layer can be built well enough to justify the infrastructure cost, whether the personalisation and continuity that makes a voice agent useful rather than merely functional is achievable at Indian price points and Indian scale. The builders I have most respect for in this space are not the ones claiming the lowest latency or the cheapest per-minute rate. They are the ones who have run out of ways to avoid the memory problem and are now solving it seriously.
The voice pipe is getting cheaper every quarter. The question worth sitting with is what you are building on top of it.
About the Author:
Gaurav Dadhich is the Founder of Maximem, which builds AI memory infrastructure for voice and conversational agents. Observations in this piece are drawn from community engagement and direct conversations with 500+ founders and builders across the Indian Voice AI ecosystem over three months in 2025 and 2026.
Related Docs:
Synap LangChain Intergration
Synap LangGraph Integration
The Real Cost of DIY Agent Memory



