Connecting AI Voice Agents to SIP & PSTN Using NextGenSwitch

admin

February 5, 2026

A Twilio-Style Streaming Bridge for Asterisk, FreeSWITCH & AI Systems

Building an AI voice agent is no longer hard.
Connecting that agent to real phone calls (SIP, PBX, PSTN) is.

Most AI systems operate over WebSockets and PCM audio, while production telephony relies on SIP, RTP, codecs, and PBX logic. This mismatch is where most voice-AI projects fail to reach production.

This article explains how NextGenSwitch acts as a telephony abstraction layer, allowing any AI voice agent to communicate with Asterisk, FreeSWITCH, or PSTN networks using a Twilio-style <Connect><Stream> interface and real-time audio streaming.

The Core Challenge

AI voice systems typically expect:

WebSocket → PCM audio → AI pipeline → PCM audio

Telephony systems operate with:

PSTN → SIP → PBX → RTP (μ-law / A-law)

Key problems:

SIP and RTP are stateful and codec-specific
AI pipelines want raw audio frames
Real-time latency, barge-in, and scaling are non-trivial
Most AI frameworks are not PBX-aware

The Role of NextGenSwitch

NextGenSwitch sits between PBX/PSTN infrastructure and AI services.

It provides:

SIP & PSTN termination
PBX integration
Programmable Voice API (Twilio-like)
Real-time WebSocket audio streaming
Codec and sample-rate normalization
AI-friendly JSON media events

Your AI service never touches SIP or RTP directly.

Supported Telephony Systems

NextGenSwitch works with standard SIP environments, including:

Asterisk
FreeSWITCH
SIP trunks
DID / PSTN providers
GSM gateways

High-Level Architecture

Caller
  |
[PSTN / SIP Trunk]
  |
[Asterisk / FreeSWITCH]
  |
[NextGenSwitch]
  |
<WebSocket Audio Stream>
  |
[Any AI Voice Service]

Step 1: Answering a Call with Twilio-Style XML

When a call reaches NextGenSwitch, it fetches XML instructions (similar to TwiML).

Minimal XML (only the stream URL is required)

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://ai.yourdomain.com/ws/voice-agent"/>
  </Connect>
</Response>

This single instruction:

Answers the call
Opens a bidirectional WebSocket
Starts real-time audio streaming

Optional Parameters (Examples Only)

Parameters are not mandatory.
They are metadata only—exactly like Twilio <Parameter>.

<Response>
  <Connect>
    <Stream url="wss://ai.yourdomain.com/ws/voice-agent">
      <Parameter name="agent" value="support-bot"/>
      <Parameter name="tenant_id" value="company-01"/>
      <Parameter name="language" value="en-US"/>
      <Parameter name="context" value="sales-inquiry"/>
    </Stream>
  </Connect>
</Response>

These parameters are delivered to your AI service in the JSON start event and can be used for routing, prompts, or CRM lookups.

Step 2: WebSocket Streaming Protocol (JSON)

NextGenSwitch uses a Twilio Media Streams–style JSON protocol for audio exchange.

Your AI service only needs to understand four event types:

start
media
stop
(optional) control events

1️⃣ `start` — Call Initialization

{
  "event": "start",
  "streamId": "NGS_STREAM_123456",
  "start": {
    "callId": "NGS_CALL_abc",
    "from": "+8801XXXXXXXXX",
    "to": "5000",
    "customParameters": {
      "agent": "support-bot",
      "tenant_id": "company-01",
      "language": "en-US"
    }
  }
}

Important

Save streamId
All outbound audio must reference this ID

2️⃣ `media` — Inbound Audio (Caller → AI)

{
  "event": "media",
  "streamId": "NGS_STREAM_123456",
  "media": {
    "payload": "BASE64_AUDIO_BYTES=="
  }
}

Audio characteristics:

Codec: G.711 μ-law
Sample rate: 8 kHz
Payload: base64 encoded

NextGenSwitch handles:

RTP decoding
Codec normalization
Telephony timing

Your AI service receives clean, ordered audio frames.

3️⃣ `media` — Outbound Audio (AI → Caller)

{
  "event": "media",
  "streamId": "NGS_STREAM_123456",
  "media": {
    "payload": "BASE64_AUDIO_BYTES=="
  }
}

Your AI service sends synthesized audio back using the same structure.

NextGenSwitch:

Converts audio to telephony format
Sends it through PBX → SIP → PSTN

4️⃣ `stop` — Call End

{
  "event": "stop",
  "streamId": "NGS_STREAM_123456",
  "stop": {
    "reason": "hangup"
  }
}

AI Stack: Completely Flexible

NextGenSwitch does not mandate any AI framework.

You can use:

Any STT engine (cloud or local)
Any LLM
Any TTS engine
Any programming language

Frameworks like Pipecat can be used as a reference implementation, but they are optional, not required.

What matters is:

You accept WebSocket JSON
You process audio frames
You send audio frames back

Why This Architecture Works

Problem	Solution
SIP & RTP complexity	Handled by PBX + NextGenSwitch
Codec conversion	Automatic
Real-time streaming	WebSocket
AI vendor lock-in	None
Multi-tenant routing	XML + parameters
PSTN scalability	SIP-native

Common Use Cases

AI receptionist
AI customer support agent
Voice order processing
Appointment booking
IVR replacement
Multilingual / regional AI voice bots

Key Takeaways

Only <Stream url> is required
XML parameters are optional examples
Streaming protocol is Twilio-style JSON
Telephony audio uses μ-law @ 8kHz
AI implementation is fully decoupled
NextGenSwitch isolates PBX logic from AI logic

References

Programmable Voice Stream API
https://nextgenswitch.com/docs/programmable-voice-api/#stream
AI streaming serializer examples
https://github.com/nextgenswitch/ai_agents

Return to homepage