A Twilio-Style Streaming Bridge for Asterisk, FreeSWITCH & AI Systems
Building an AI voice agent is no longer hard.
Connecting that agent to real phone calls (SIP, PBX, PSTN) is.
Most AI systems operate over WebSockets and PCM audio, while production telephony relies on SIP, RTP, codecs, and PBX logic. This mismatch is where most voice-AI projects fail to reach production.
This article explains how NextGenSwitch acts as a telephony abstraction layer, allowing any AI voice agent to communicate with Asterisk, FreeSWITCH, or PSTN networks using a Twilio-style <Connect><Stream> interface and real-time audio streaming.
The Core Challenge
AI voice systems typically expect:
WebSocket → PCM audio → AI pipeline → PCM audio
Telephony systems operate with:
PSTN → SIP → PBX → RTP (μ-law / A-law)
Key problems:
- SIP and RTP are stateful and codec-specific
- AI pipelines want raw audio frames
- Real-time latency, barge-in, and scaling are non-trivial
- Most AI frameworks are not PBX-aware
The Role of NextGenSwitch
NextGenSwitch sits between PBX/PSTN infrastructure and AI services.
It provides:
- SIP & PSTN termination
- PBX integration
- Programmable Voice API (Twilio-like)
- Real-time WebSocket audio streaming
- Codec and sample-rate normalization
- AI-friendly JSON media events
Your AI service never touches SIP or RTP directly.
Supported Telephony Systems
NextGenSwitch works with standard SIP environments, including:
- Asterisk
- FreeSWITCH
- SIP trunks
- DID / PSTN providers
- GSM gateways
High-Level Architecture
Caller
|
[PSTN / SIP Trunk]
|
[Asterisk / FreeSWITCH]
|
[NextGenSwitch]
|
<WebSocket Audio Stream>
|
[Any AI Voice Service]
Step 1: Answering a Call with Twilio-Style XML
When a call reaches NextGenSwitch, it fetches XML instructions (similar to TwiML).
Minimal XML (only the stream URL is required)
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="wss://ai.yourdomain.com/ws/voice-agent"/>
</Connect>
</Response>
This single instruction:
- Answers the call
- Opens a bidirectional WebSocket
- Starts real-time audio streaming
Optional Parameters (Examples Only)
Parameters are not mandatory.
They are metadata only—exactly like Twilio <Parameter>.
<Response>
<Connect>
<Stream url="wss://ai.yourdomain.com/ws/voice-agent">
<Parameter name="agent" value="support-bot"/>
<Parameter name="tenant_id" value="company-01"/>
<Parameter name="language" value="en-US"/>
<Parameter name="context" value="sales-inquiry"/>
</Stream>
</Connect>
</Response>
These parameters are delivered to your AI service in the JSON start event and can be used for routing, prompts, or CRM lookups.
Step 2: WebSocket Streaming Protocol (JSON)
NextGenSwitch uses a Twilio Media Streams–style JSON protocol for audio exchange.
Your AI service only needs to understand four event types:
startmediastop- (optional) control events
1️⃣ start — Call Initialization
{
"event": "start",
"streamId": "NGS_STREAM_123456",
"start": {
"callId": "NGS_CALL_abc",
"from": "+8801XXXXXXXXX",
"to": "5000",
"customParameters": {
"agent": "support-bot",
"tenant_id": "company-01",
"language": "en-US"
}
}
}
Important
- Save
streamId - All outbound audio must reference this ID
2️⃣ media — Inbound Audio (Caller → AI)
{
"event": "media",
"streamId": "NGS_STREAM_123456",
"media": {
"payload": "BASE64_AUDIO_BYTES=="
}
}
Audio characteristics:
- Codec: G.711 μ-law
- Sample rate: 8 kHz
- Payload: base64 encoded
NextGenSwitch handles:
- RTP decoding
- Codec normalization
- Telephony timing
Your AI service receives clean, ordered audio frames.
3️⃣ media — Outbound Audio (AI → Caller)
{
"event": "media",
"streamId": "NGS_STREAM_123456",
"media": {
"payload": "BASE64_AUDIO_BYTES=="
}
}
Your AI service sends synthesized audio back using the same structure.
NextGenSwitch:
- Converts audio to telephony format
- Sends it through PBX → SIP → PSTN
4️⃣ stop — Call End
{
"event": "stop",
"streamId": "NGS_STREAM_123456",
"stop": {
"reason": "hangup"
}
}
AI Stack: Completely Flexible
NextGenSwitch does not mandate any AI framework.
You can use:
- Any STT engine (cloud or local)
- Any LLM
- Any TTS engine
- Any programming language
Frameworks like Pipecat can be used as a reference implementation, but they are optional, not required.
What matters is:
- You accept WebSocket JSON
- You process audio frames
- You send audio frames back
Why This Architecture Works
| Problem | Solution |
|---|---|
| SIP & RTP complexity | Handled by PBX + NextGenSwitch |
| Codec conversion | Automatic |
| Real-time streaming | WebSocket |
| AI vendor lock-in | None |
| Multi-tenant routing | XML + parameters |
| PSTN scalability | SIP-native |
Common Use Cases
- AI receptionist
- AI customer support agent
- Voice order processing
- Appointment booking
- IVR replacement
- Multilingual / regional AI voice bots
Key Takeaways
- Only
<Stream url>is required - XML parameters are optional examples
- Streaming protocol is Twilio-style JSON
- Telephony audio uses μ-law @ 8kHz
- AI implementation is fully decoupled
- NextGenSwitch isolates PBX logic from AI logic
References
- Programmable Voice Stream API
https://nextgenswitch.com/docs/programmable-voice-api/#stream - AI streaming serializer examples
https://github.com/nextgenswitch/ai_agents