Connecting AI Voice Agents to SIP & PSTN Using NextGenSwitch

admin
February 5, 2026

A Twilio-Style Streaming Bridge for Asterisk, FreeSWITCH & AI Systems

Building an AI voice agent is no longer hard.
Connecting that agent to real phone calls (SIP, PBX, PSTN) is.

Most AI systems operate over WebSockets and PCM audio, while production telephony relies on SIP, RTP, codecs, and PBX logic. This mismatch is where most voice-AI projects fail to reach production.

This article explains how NextGenSwitch acts as a telephony abstraction layer, allowing any AI voice agent to communicate with Asterisk, FreeSWITCH, or PSTN networks using a Twilio-style <Connect><Stream> interface and real-time audio streaming.

The Core Challenge

AI voice systems typically expect:

WebSocket → PCM audio → AI pipeline → PCM audio

Telephony systems operate with:

PSTN → SIP → PBX → RTP (μ-law / A-law)

Key problems:

  • SIP and RTP are stateful and codec-specific
  • AI pipelines want raw audio frames
  • Real-time latency, barge-in, and scaling are non-trivial
  • Most AI frameworks are not PBX-aware

The Role of NextGenSwitch

NextGenSwitch sits between PBX/PSTN infrastructure and AI services.

It provides:

  • SIP & PSTN termination
  • PBX integration
  • Programmable Voice API (Twilio-like)
  • Real-time WebSocket audio streaming
  • Codec and sample-rate normalization
  • AI-friendly JSON media events

Your AI service never touches SIP or RTP directly.

Supported Telephony Systems

NextGenSwitch works with standard SIP environments, including:

  • Asterisk
  • FreeSWITCH
  • SIP trunks
  • DID / PSTN providers
  • GSM gateways

High-Level Architecture

Caller
  |
[PSTN / SIP Trunk]
  |
[Asterisk / FreeSWITCH]
  |
[NextGenSwitch]
  |
<WebSocket Audio Stream>
  |
[Any AI Voice Service]

Step 1: Answering a Call with Twilio-Style XML

When a call reaches NextGenSwitch, it fetches XML instructions (similar to TwiML).

Minimal XML (only the stream URL is required)

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://ai.yourdomain.com/ws/voice-agent"/>
  </Connect>
</Response>

This single instruction:

  • Answers the call
  • Opens a bidirectional WebSocket
  • Starts real-time audio streaming

Optional Parameters (Examples Only)

Parameters are not mandatory.
They are metadata only—exactly like Twilio <Parameter>.

<Response>
  <Connect>
    <Stream url="wss://ai.yourdomain.com/ws/voice-agent">
      <Parameter name="agent" value="support-bot"/>
      <Parameter name="tenant_id" value="company-01"/>
      <Parameter name="language" value="en-US"/>
      <Parameter name="context" value="sales-inquiry"/>
    </Stream>
  </Connect>
</Response>

These parameters are delivered to your AI service in the JSON start event and can be used for routing, prompts, or CRM lookups.

Step 2: WebSocket Streaming Protocol (JSON)

NextGenSwitch uses a Twilio Media Streams–style JSON protocol for audio exchange.

Your AI service only needs to understand four event types:

  • start
  • media
  • stop
  • (optional) control events

1️⃣ start — Call Initialization

{
  "event": "start",
  "streamId": "NGS_STREAM_123456",
  "start": {
    "callId": "NGS_CALL_abc",
    "from": "+8801XXXXXXXXX",
    "to": "5000",
    "customParameters": {
      "agent": "support-bot",
      "tenant_id": "company-01",
      "language": "en-US"
    }
  }
}

Important

  • Save streamId
  • All outbound audio must reference this ID

2️⃣ media — Inbound Audio (Caller → AI)

{
  "event": "media",
  "streamId": "NGS_STREAM_123456",
  "media": {
    "payload": "BASE64_AUDIO_BYTES=="
  }
}

Audio characteristics:

  • Codec: G.711 μ-law
  • Sample rate: 8 kHz
  • Payload: base64 encoded

NextGenSwitch handles:

  • RTP decoding
  • Codec normalization
  • Telephony timing

Your AI service receives clean, ordered audio frames.

3️⃣ media — Outbound Audio (AI → Caller)

{
  "event": "media",
  "streamId": "NGS_STREAM_123456",
  "media": {
    "payload": "BASE64_AUDIO_BYTES=="
  }
}

Your AI service sends synthesized audio back using the same structure.

NextGenSwitch:

  • Converts audio to telephony format
  • Sends it through PBX → SIP → PSTN

4️⃣ stop — Call End

{
  "event": "stop",
  "streamId": "NGS_STREAM_123456",
  "stop": {
    "reason": "hangup"
  }
}

AI Stack: Completely Flexible

NextGenSwitch does not mandate any AI framework.

You can use:

  • Any STT engine (cloud or local)
  • Any LLM
  • Any TTS engine
  • Any programming language

Frameworks like Pipecat can be used as a reference implementation, but they are optional, not required.

What matters is:

  • You accept WebSocket JSON
  • You process audio frames
  • You send audio frames back

Why This Architecture Works

ProblemSolution
SIP & RTP complexityHandled by PBX + NextGenSwitch
Codec conversionAutomatic
Real-time streamingWebSocket
AI vendor lock-inNone
Multi-tenant routingXML + parameters
PSTN scalabilitySIP-native

Common Use Cases

  • AI receptionist
  • AI customer support agent
  • Voice order processing
  • Appointment booking
  • IVR replacement
  • Multilingual / regional AI voice bots

Key Takeaways

  • Only <Stream url> is required
  • XML parameters are optional examples
  • Streaming protocol is Twilio-style JSON
  • Telephony audio uses μ-law @ 8kHz
  • AI implementation is fully decoupled
  • NextGenSwitch isolates PBX logic from AI logic

References