Realtime WebSocket

WebSocket provides a reliable transport for realtime communication with rtAV. It's ideal for server-to-server integrations and web applications that don't require the lowest latency.

Overview

The WebSocket endpoint implements the OpenAI Realtime API protocol, making it compatible with existing OpenAI Realtime SDKs. All communication happens over a single WebSocket connection, with audio and video data encoded as base64 strings.

  • Endpoint: wss://api.rtav.io/v1/realtime
  • Protocol: OpenAI Realtime API compatible
  • Audio Format: Base64-encoded PCM audio
  • Video Format: Base64-encoded JPEG frames

Connection Methods

Method 1: Auto-Create Session (OpenAI-Compatible)

Connect with a model parameter to automatically create a session:

// Browser JavaScript
const ws = new WebSocket('wss://api.rtav.io/v1/realtime?model=gpt-5.2');

ws.onopen = () => {
  // Send auth message first
  ws.send(JSON.stringify({
    type: 'auth',
    api_key: 'rtav_ak_your_api_key_here'
  }));
};

Method 2: Connect to Existing Session

Create a session via REST API first, then connect with session_id:

// Browser JavaScript
// 1. Create session
const response = await fetch('https://api.rtav.io/v1/sessions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer rtav_ak_your_api_key_here',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'gpt-5.2',
    voice: 'default',
    instructions: 'You are a helpful assistant.'
  })
});

const { id: sessionId } = await response.json();

// 2. Connect via WebSocket
const ws = new WebSocket(`wss://api.rtav.io/v1/realtime?session_id=${sessionId}`);

ws.onopen = () => {
  // Send auth message
  ws.send(JSON.stringify({
    type: 'auth',
    api_key: 'rtav_ak_your_api_key_here'
  }));
};

Authentication

Browser WebSocket API doesn't support custom headers, so authentication must be sent as the first message:

// Browser JavaScript
ws.onopen = () => {
  // Send auth as first message
  ws.send(JSON.stringify({
    type: 'auth',
    api_key: 'rtav_ak_your_api_key_here'
  }));
};

For server-side applications, you can also use the Authorization: Bearer header.

Sending Messages

Text Input (Closed Loop - Triggers LLM)

Send text messages that will be processed by the LLM:

// Browser JavaScript
ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [{ type: 'input_text', text: 'Hello, how are you?' }]
  }
}));

Text Input (Open Loop - Direct to Worker)

Send text directly to the worker without LLM processing:

// Browser JavaScript
ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'assistant',  // Use 'assistant' role for open loop
    content: [{ type: 'input_text', text: 'Hello!' }]
  }
}));

Audio Input

Send audio data as base64-encoded PCM:

// Browser JavaScript
// Append audio chunks
ws.send(JSON.stringify({
  type: 'input_audio_buffer.append',
  audio: base64EncodedPCMAudio
}));

// Commit audio buffer to trigger STT processing
ws.send(JSON.stringify({
  type: 'input_audio_buffer.commit'
}));

Receiving Messages

Audio Output

Receive audio as base64-encoded PCM deltas:

// Browser JavaScript
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  
  if (data.type === 'response.output_audio.delta') {
    const audioBase64 = data.delta;
    // Decode base64 and play audio
    const audioData = atob(audioBase64);
    const audioBuffer = new ArrayBuffer(audioData.length);
    const view = new Uint8Array(audioBuffer);
    for (let i = 0; i < audioData.length; i++) {
      view[i] = audioData.charCodeAt(i);
    }
    // Play using Web Audio API
  }
};

Video Output

Receive video frames as base64-encoded JPEG deltas:

// Browser JavaScript
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  
  if (data.type === 'response.output_image.delta') {
    const imageBase64 = data.delta;
    // Decode and display image
    const imageUrl = 'data:image/jpeg;base64,' + imageBase64;
    document.getElementById('avatar').src = imageUrl;
  }
};

Transcripts

Receive input and output transcripts:

// Browser JavaScript
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  
  // Input audio transcription
  if (data.type === 'conversation.item.input_audio_transcription.completed') {
    const transcript = data.transcript;
    console.log('User said:', transcript);
  }
  
  // Output audio transcription
  if (data.type === 'response.output_audio_transcript.delta') {
    const transcript = data.delta;
    console.log('Assistant said:', transcript);
  }
};

Session Configuration

Update session settings dynamically during an active session:

// Browser JavaScript
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    model: 'gpt-5.2',
    instructions: 'You are a helpful coding assistant.',
    voice: 'default',
    face: 'face1',
    driving: 'idle'  // Single motion
  }
}));

Driving Motion Sequences

You can send a sequence of driving motions as an array. The avatar will transition smoothly between motions, playing intermediate motions once, and looping the final motion:

// Browser JavaScript
// Single motion (permanent change)
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    driving: 'Wink'  // Transitions to Wink and loops
  }
}));

// Sequence with intermediate motion
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    driving: ['AgreeYesTotaly', 'default']
    // Plays AgreeYesTotaly once, then transitions to default and loops
  }
}));

// Repeating intermediate motion
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    driving: ['Wink', 'Wink', 'Wink', 'default']
    // Plays Wink 3 times, then transitions to default and loops
  }
}));

See the Assets Guide for more details on driving motion sequences.

Complete Example

// Browser JavaScript
// Create session
const response = await fetch('https://api.rtav.io/v1/sessions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer rtav_ak_your_api_key',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'gpt-5.2',
    voice: 'default',
    instructions: 'You are a helpful assistant.'
  })
});

const { id: sessionId } = await response.json();

// Connect WebSocket
const ws = new WebSocket(`wss://api.rtav.io/v1/realtime?session_id=${sessionId}`);

ws.onopen = () => {
  // Authenticate
  ws.send(JSON.stringify({
    type: 'auth',
    api_key: 'rtav_ak_your_api_key'
  }));
  
  // Send text message
  setTimeout(() => {
    ws.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type: 'message',
        role: 'user',
        content: [{ type: 'input_text', text: 'Hello!' }]
      }
    }));
  }, 100);
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log('Event:', data.type);
  
  // Handle audio
  if (data.type === 'response.output_audio.delta') {
    // Decode and play audio
  }
  
  // Handle video
  if (data.type === 'response.output_image.delta') {
    // Display video frame
    document.getElementById('avatar').src = 
      'data:image/jpeg;base64,' + data.delta;
  }
  
  // Handle transcripts
  if (data.type === 'conversation.item.input_audio_transcription.completed') {
    console.log('User:', data.transcript);
  }
  if (data.type === 'response.output_audio_transcript.delta') {
    console.log('Assistant:', data.delta);
  }
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
};

ws.onclose = () => {
  console.log('WebSocket closed');
};

Next Steps