Skip to main content

Using Audio Models

How to send / receive audio to a /chat/completions endpoint.

Audio Output from a Model

Example for creating a human-like audio response to a prompt.

Python

import base64
from openai import OpenAI

client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.haimaker.ai/v1"
)

response = client.chat.completions.create(
model="openai/gpt-4o-audio-preview",
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": "Is a golden retriever a good family dog?"
}
]
)

print(response.choices[0])

wav_bytes = base64.b64decode(response.choices[0].message.audio.data)
with open("dog.wav", "wb") as f:
f.write(wav_bytes)

cURL

curl https://api.haimaker.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "openai/gpt-4o-audio-preview",
"modalities": ["text", "audio"],
"audio": {"voice": "alloy", "format": "wav"},
"messages": [
{
"role": "user",
"content": "Is a golden retriever a good family dog?"
}
]
}'

Audio Input to a Model

Python

import base64
import requests
from openai import OpenAI

client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.haimaker.ai/v1"
)

# Fetch the audio file and convert it to a base64 encoded string
url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"
response = requests.get(url)
response.raise_for_status()
wav_data = response.content
encoded_string = base64.b64encode(wav_data).decode('utf-8')

response = client.chat.completions.create(
model="openai/gpt-4o-audio-preview",
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this recording?"
},
{
"type": "input_audio",
"input_audio": {
"data": encoded_string,
"format": "wav"
}
}
]
},
]
)

print(response.choices[0].message)

Response Format with Audio

Below is an example JSON data structure for a message you might receive from a /chat/completions endpoint when requesting audio output.

{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"refusal": null,
"audio": {
"id": "audio_abc123",
"expires_at": 1729018505,
"data": "<bytes omitted>",
"transcript": "Yes, golden retrievers are known to be ..."
}
},
"finish_reason": "stop"
}
  • audio: If the audio output modality is requested, this object contains data about the audio response from the model
    • audio.id: Unique identifier for the audio response
    • audio.expires_at: The Unix timestamp (in seconds) for when this audio response will no longer be accessible on the server for use in multi-turn conversations.
    • audio.data: Base64 encoded audio bytes generated by the model, in the format specified in the request.
    • audio.transcript: Transcript of the audio generated by the model.