When we started building Dilr Voice in late 2025, the landscape for voice AI was fragmented. You could stitch together Twilio, a speech-to-text API, an LLM, and a text-to-speech engine — but the glue code between them was always custom, always brittle, and always the thing that broke at 2am. Then we found Pipecat.
This post is a technical deep-dive into how Dilr Voice is built on Pipecat — what the framework gives us for free, what we had to engineer on top, and what the production architecture looks like when you're handling thousands of concurrent calls across 30+ languages.
What is Pipecat?
Pipecat is an open-source Python framework for building voice and multimodal AI applications. It provides a pipeline abstraction — a directed graph of processors that transform frames of data (audio, text, control signals) as they flow through the system.
Audio comes in from a phone call. Voice Activity Detection detects when someone is speaking. STT converts speech to text. The LLM generates a response. TTS converts it back to speech. Pipecat handles the frame routing, async execution, backpressure, and interruption logic. We handle everything else.
What Pipecat gives us
🔌 Provider abstraction
Swap Deepgram for Whisper, OpenAI for Claude, ElevenLabs for Azure — without rewriting the pipeline. Common interface across all providers.
🗣️ Interruption handling
Caller talks over the agent? Pipecat stops TTS, flushes the buffer, captures the new utterance, resumes. Deceptively hard to build from scratch.
📦 Frame-based architecture
Everything is a frame — AudioRawFrame, TextFrame, LLMMessagesFrame. Insert middleware anywhere: logging, metrics, guardrails, transformations.
⚡ Async pipeline
Fully async Python. STT and TTS run concurrently. No blocking on I/O. Handles 50+ concurrent calls per pod without thread contention.
Our production stack
| Component | Provider | Why we chose it |
|---|---|---|
| VAD | Silero | Best accuracy/latency tradeoff, runs locally — no network hop |
| STT | Deepgram Nova-3 | Lowest latency streaming STT, best accent handling across 30+ languages |
| LLM | GPT-4.1-mini | Best cost/quality for real-time voice — fast enough for sub-second responses |
| TTS | ElevenLabs Flash v2.5 | Most natural voice, lowest latency in its class, streaming output |
| Telephony | Twilio + Vobiz | Twilio for US/UK/global, Vobiz for India (+91) — both configurable per agent |
All configurable per-agent in the Dilr Voice platform. No code changes needed — just update the Smart Agent config.
What we built on top
Pipecat is a framework, not a product. To turn it into Dilr Voice, we engineered six major systems on top.
1. Multi-agent routing (Agent Flows)
Single-agent conversations hit a ceiling fast. A receptionist agent can't also be a billing expert and a technical support specialist. We built Agent Flows — a multi-agent system where specialist agents handle different parts of a call.
Keyword matching: "book", "cancel", "schedule" → Action Agent. Pre-defined per agent type. Supports Hindi.
Agent says "let me look that up" → Knowledge Agent. "Let me book that" → Action Agent. Regex-based detection.
No keyword match, no handoff signal — stay on current agent. Marketing agent is the fallback default.
The system prompt is swapped in-place in the LLM context — no new context created, full conversation history preserved across agent switches. The caller never notices.
2. Dynamic system prompt construction
Every call starts by building a system prompt from seven layers:
Two calls to the same agent produce completely different system prompts — because the caller's context is different. A returning caller with negative sentiment gets a more empathetic opening.
3. Knowledge Base (RAG)
Callers ask questions the LLM doesn't know — your pricing, your hours, your return policy. We built hybrid retrieval:
- Vector search:
text-embedding-3-smallembeddings in PostgreSQL +pgvector(1536-dim) - BM25 full-text: PostgreSQL
tsvectorfor keyword queries - Reciprocal Rank Fusion: merges both result sets
Upload PDFs, paste URLs (we crawl), or type text — all in the platform.
vector_results = pgvector.search(
embedding, top_k=10
)
bm25_results = postgres.fts(
query, top_k=10
)
// Reciprocal Rank Fusion
final = rrf(vector_results, bm25_results)
→ returns top 5 chunks
4. Production telephony
The API server handles config, billing, and post-call processing. The Pipecat Runner handles real-time voice. They communicate via gRPC. This separation means we scale runners independently — more pods when volume spikes, without touching the API.
5. AI tools the agent can use
Tools are configurable per agent in Agent Flows. The Action agent gets book_appointment + send_sms. The Knowledge agent gets search_knowledge_base. The Greeter has none — it just talks.
6. Infrastructure
- PlatformGKE · europe-west2
- APIFastAPI + gRPC
- VoicePipecat Runner
- WorkersCelery + Redis
- DatabaseCloud SQL + pgvector
- STT/TTS poolsNo cold-start
- LLM warmupOn WS connect
- Codecμ-law 8kHz pre-negotiated
- Agent switch~1ms in-place
- Concurrency~50 calls/pod
The numbers
After 6 months in production on Dilr Voice:
Try it
Dilr Voice is live and free to try — $20 in credits on signup, no credit card required. Build an agent in the visual flow builder, attach a phone number, and call it.
- Open Dilr Voice platform — sign up free, $20 credits
- Book a technical demo — we'll show you the architecture live
- Read: The evaluation harness — how we test voice agents in production