Skip to main content

Benchmarks

Benchmarks for haimaker Gateway (Proxy Server) tested against a fake OpenAI endpoint.

Use this config for testing:

model_list:
- model_name: "fake-openai-endpoint"
litellm_params:
model: openai/any
api_base: https://your-fake-openai-endpoint.com/chat/completions
api_key: "test"

1 Instance haimaker Proxy

In these tests the baseline latency characteristics are measured against a fake-openai-endpoint.

Performance Metrics

MetricValue
Requests per Second (RPS)475
End-to-End Latency P50 (ms)100
haimaker Overhead P50 (ms)3
haimaker Overhead P90 (ms)17
haimaker Overhead P99 (ms)31

Key Findings

  • Single instance: 475 RPS @ 100ms median latency
  • haimaker adds 3ms P50 overhead, 17ms P90 overhead, 31ms P99 overhead
  • 2 haimaker instances: 950 RPS @ 100ms latency
  • 4 haimaker instances: 1900 RPS @ 100ms latency

2 Instances

Adding 1 instance, will double the RPS and maintain the 100ms-110ms median latency.

MetricLitellm Proxy (2 Instances)
Median Latency (ms)100
RPS950

Machine Spec used for testing

Each machine deploying haimaker had the following specs:

  • 2 CPU
  • 4GB RAM

How to measure haimaker Overhead

All responses from litellm will include the x-litellm-overhead-duration-ms header, this is the latency overhead in milliseconds added by haimaker Proxy.

If you want to measure this on locust you can use the following code:

Locust Code for measuring haimaker Overhead
import os
import uuid
from locust import HttpUser, task, between, events

# Custom metric to track haimaker overhead duration
overhead_durations = []

@events.request.add_listener
def on_request(request_type, name, response_time, response_length, response, context, exception, start_time, url, **kwargs):
if response and hasattr(response, 'headers'):
overhead_duration = response.headers.get('x-litellm-overhead-duration-ms')
if overhead_duration:
try:
duration_ms = float(overhead_duration)
overhead_durations.append(duration_ms)
# Report as custom metric
events.request.fire(
request_type="Custom",
name="haimaker Overhead Duration (ms)",
response_time=duration_ms,
response_length=0,
)
except (ValueError, TypeError):
pass

class MyUser(HttpUser):
wait_time = between(0.5, 1) # Random wait time between requests

def on_start(self):
self.api_key = os.getenv('API_KEY', 'sk-1234567890')
self.client.headers.update({'Authorization': f'Bearer {self.api_key}'})

@task
def litellm_completion(self):
# no cache hits with this
payload = {
"model": "db-openai-endpoint",
"messages": [{"role": "user", "content": f"{uuid.uuid4()} This is a test there will be no cache hits and we'll fill up the context" * 150}],
"user": "my-new-end-user-1"
}
response = self.client.post("chat/completions", json=payload)

if response.status_code != 200:
# log the errors in error.txt
with open("error.txt", "a") as error_log:
error_log.write(response.text + "\n")

Logging Callbacks

GCS Bucket Logging

Using GCS Bucket has no impact on latency, RPS compared to Basic Litellm Proxy

MetricBasic Litellm Proxyhaimaker Proxy with GCS Bucket Logging
RPS1133.21137.3
Median Latency (ms)140138

LangSmith logging

Using LangSmith has no impact on latency, RPS compared to Basic Litellm Proxy

MetricBasic Litellm Proxyhaimaker Proxy with LangSmith
RPS1133.21135
Median Latency (ms)140132

Locust Settings

  • 2500 Users
  • 100 user Ramp Up