Building Dilr Voice on Pipecat: the open-source voice AI framework powering thousands of production calls

When we started building Dilr Voice in late 2025, the landscape for voice AI was fragmented. You could stitch together Twilio, a speech-to-text API, an LLM, and a text-to-speech engine — but the glue code between them was always custom, always brittle, and always the thing that broke at 2am. Then we found Pipecat.

This post is a technical deep-dive into how Dilr Voice is built on Pipecat — what the framework gives us for free, what we had to engineer on top, and what the production architecture looks like when you're handling thousands of concurrent calls across 30+ languages.

What is Pipecat?

Pipecat is an open-source Python framework for building voice and multimodal AI applications. It provides a pipeline abstraction — a directed graph of processors that transform frames of data (audio, text, control signals) as they flow through the system.

Audio comes in from a phone call. Voice Activity Detection detects when someone is speaking. STT converts speech to text. The LLM generates a response. TTS converts it back to speech. Pipecat handles the frame routing, async execution, backpressure, and interruption logic. We handle everything else.

What Pipecat gives us

🔌 Provider abstraction

Swap Deepgram for Whisper, OpenAI for Claude, ElevenLabs for Azure — without rewriting the pipeline. Common interface across all providers.

🗣️ Interruption handling

Caller talks over the agent? Pipecat stops TTS, flushes the buffer, captures the new utterance, resumes. Deceptively hard to build from scratch.

📦 Frame-based architecture

Everything is a frame — AudioRawFrame, TextFrame, LLMMessagesFrame. Insert middleware anywhere: logging, metrics, guardrails, transformations.

⚡ Async pipeline

Fully async Python. STT and TTS run concurrently. No blocking on I/O. Handles 50+ concurrent calls per pod without thread contention.

Our production stack

Component	Provider	Why we chose it
VAD	Silero	Best accuracy/latency tradeoff, runs locally — no network hop
STT	Deepgram Nova-3	Lowest latency streaming STT, best accent handling across 30+ languages
LLM	GPT-4.1-mini	Best cost/quality for real-time voice — fast enough for sub-second responses
TTS	ElevenLabs Flash v2.5	Most natural voice, lowest latency in its class, streaming output
Telephony	Twilio + Vobiz	Twilio for US/UK/global, Vobiz for India (+91) — both configurable per agent

All configurable per-agent in the Dilr Voice platform. No code changes needed — just update the Smart Agent config.

What we built on top

Pipecat is a framework, not a product. To turn it into Dilr Voice, we engineered six major systems on top.

1. Multi-agent routing (Agent Flows)

Single-agent conversations hit a ceiling fast. A receptionist agent can't also be a billing expert and a technical support specialist. We built Agent Flows — a multi-agent system where specialist agents handle different parts of a call.

Fast path · ~0ms

Keyword matching: "book", "cancel", "schedule" → Action Agent. Pre-defined per agent type. Supports Hindi.

Handoff signals · ~1ms

Agent says "let me look that up" → Knowledge Agent. "Let me book that" → Action Agent. Regex-based detection.

Fallback · default

No keyword match, no handoff signal — stay on current agent. Marketing agent is the fallback default.

The system prompt is swapped in-place in the LLM context — no new context created, full conversation history preserved across agent switches. The caller never notices.

2. Dynamic system prompt construction

Every call starts by building a system prompt from seven layers:

System prompt layers · built per call

1Base promptpersonality · role · goal · instructions

2Languageinput/output language pair

3Organisationbusiness name · industry · hours

4Call directionINBOUND vs OUTBOUND

5Custom contextper-call data from API

6Lead contexthistory · sentiment · last call summary

7Tools schemabook · search KB · SMS · email · transfer

Two calls to the same agent produce completely different system prompts — because the caller's context is different. A returning caller with negative sentiment gets a more empathetic opening.

3. Knowledge Base (RAG)

Callers ask questions the LLM doesn't know — your pricing, your hours, your return policy. We built hybrid retrieval:

Vector search: text-embedding-3-small embeddings in PostgreSQL + pgvector (1536-dim)
BM25 full-text: PostgreSQL tsvector for keyword queries
Reciprocal Rank Fusion: merges both result sets

Upload PDFs, paste URLs (we crawl), or type text — all in the platform.

// Hybrid search pipeline
vector_results = pgvector.search(
embedding, top_k=10
)
bm25_results = postgres.fts(
query, top_k=10
)
// Reciprocal Rank Fusion
final = rrf(vector_results, bm25_results)
→ returns top 5 chunks

4. Production telephony

The API server handles config, billing, and post-call processing. The Pipecat Runner handles real-time voice. They communicate via gRPC. This separation means we scale runners independently — more pods when volume spikes, without touching the API.

5. AI tools the agent can use

📅

Book appointment

Google Calendar

🔍

Search KB

Hybrid RAG

💬

Send SMS

Confirmation / follow-up

📧

Send email

Summary / receipt

👤

Update lead

Name · intent · notes

📞

Transfer call

Live human handoff

Tools are configurable per agent in Agent Flows. The Action agent gets book_appointment + send_sms. The Knowledge agent gets search_knowledge_base. The Greeter has none — it just talks.

6. Infrastructure

Runtime

PlatformGKE · europe-west2
APIFastAPI + gRPC
VoicePipecat Runner
WorkersCelery + Redis
DatabaseCloud SQL + pgvector

Optimisations

STT/TTS poolsNo cold-start
LLM warmupOn WS connect
Codecμ-law 8kHz pre-negotiated
Agent switch~1ms in-place
Concurrency~50 calls/pod

The numbers

After 6 months in production on Dilr Voice:

1.2s

Avg first response

78%

Containment rate

50+

Calls per pod

30+

Languages

99.7%

Uptime (90d)

Try it

Dilr Voice is live and free to try — $20 in credits on signup, no credit card required. Build an agent in the visual flow builder, attach a phone number, and call it.

Get started

Open Dilr Voice platform — sign up free, $20 credits
Book a technical demo — we'll show you the architecture live
Read: The evaluation harness — how we test voice agents in production