Skip to main content

Auto-router

Send model: "haimaker/auto" and the auto-router picks the right model for each request. It checks what the request actually needs (vision, tool use, long context), then compares the prompt against routing rules you define by example.

You configure it in the dashboard -- set up routing rules, pick a default model, and assign the router to your API keys. No code changes beyond swapping the model name.

Upgrading from v1?

Routing rules used to be keyword lists. They are now example prompts matched by similarity, and the router learns new rules from your traffic. If you set up a router before this change, read Upgrading from v1 -- a few API fields changed.

Quick start

curl https://api.haimaker.ai/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "haimaker/auto",
"messages": [{"role": "user", "content": "Write a Python function to sort a list"}]
}'

The response model field contains the actual model that handled the request (e.g., moonshotai/kimi-k2.5), not haimaker/auto.

How routing works

Every routing decision is deterministic: the same prompt always routes to the same model. There is no LLM call in the routing path, and nothing random. The router evaluates requests in order:

1. Capability detection

The router inspects the request and filters the model pool to only models that can handle it:

Detected capabilityRequest signalPool filter
VisionMessage contains image_url contentsupports_vision
Tool useRequest has tools or functionssupports_function_calling
Structured outputresponse_format.type is json_schemasupports_response_schema
Audio inputMessage contains input_audio contentsupports_audio_input
PDF inputMessage contains file or document contentsupports_pdf_input
Web searchTool with type web_search or web_search_previewsupports_web_search
Long contextEstimated token count exceeds model limitmax_input_tokens (with 10% safety buffer)

This always runs first. If a rule matches but its target model can't handle the request (e.g., the prompt looks like a coding task but includes an image, and the coding model doesn't support vision), that rule is skipped.

2. Example-based rules

A rule is a set of example prompts plus a target model. Instead of listing keywords, you write 3 to 10 prompts that represent the kind of request you want routed:

"Write a Python function to deduplicate a list" "Debug this TypeScript error: Cannot read properties of undefined" "Refactor this SQL query to use a CTE"

The router embeds your examples into a single reference point (a centroid). At request time, it embeds the last user message the same way and measures how similar it is to each rule's examples. The best-scoring rule above its match threshold wins. The default threshold is 0.80; you can adjust it per rule -- lower means the rule matches more loosely, higher means it only fires on near-identical prompts.

A few properties worth knowing:

  • Matching is semantic, not literal. "Fix this bug in my code" and "debug this function" land near each other even though they share no keywords.
  • Embeddings are computed in-process with a pinned model. No network call, no per-request cost, and the same text produces the same vector forever.
  • Timestamps, UUIDs, and long IDs are stripped before matching. A cron job that embeds the current time in every prompt still matches the same rule each run.
  • Ties go to rule order. If two rules score identically, the one with the lower order number wins.

Rules can also carry trigger conditions: required_capabilities (the rule only fires when the request needs those capabilities) and initial_turn_only (the rule only fires on the first message of a conversation).

3. Default model

If nothing matches, the default model handles the request. Pick something general-purpose here.

4. Fallback: cheapest capable model

If the default model gets filtered out by capability detection (most often a long-context request that exceeds its window), the router picks the cheapest model in the pool that can handle the request. The 10% context buffer protects correctness, and you don't pay flagship prices for a request that just needed a bigger window.

Rules the router learns on its own

The auto-router watches your traffic and proposes rules where they would save money. The analysis runs offline, once a day -- never during a request.

Here's the loop:

  1. Routed prompts are captured (see What gets captured below) and grouped into clusters of similar requests.
  2. Clusters that show up repeatedly -- think heartbeat checks, cron report templates, classification prompts your agent sends hundreds of times a day -- get classified by an offline LLM judge: is this trivial, standard, or complex work?
  3. For each cluster, the tuner asks whether a cheaper model in your pool could handle it. Trivial work maps to your cheapest capable model; complex work maps to your most capable one. A proposal is only created when the target is strictly cheaper than what the cluster currently costs. The tuner can never increase your spend.
  4. The proposal either applies automatically (within the guardrails below) or lands in your Review Queue with the judge's reasoning, sample prompts, request volume, and projected monthly savings. You accept or reject it.

Accepted and auto-applied proposals become normal rules, tagged mined in the rule list. You can edit, disable, or delete them like any rule you wrote yourself.

When a proposal applies automatically

A proposal skips the review queue only when all of these hold:

  • The judge rated the cluster trivial and the cluster is tight (the prompts are near-duplicates of each other).
  • The traffic currently goes to your default model or the fallback -- a mined rule never takes traffic away from a rule you wrote.
  • The cluster has real volume: at least 50 requests across at least 3 different days.
  • The cheaper target passes every capability check seen in the cluster's requests.
  • Auto-apply is enabled for your router (it's a toggle in Settings; turn it off and everything goes to the review queue instead).

Every auto-apply shows up in the Changelog tab with a one-click Revert button.

The safety net

For 7 days after a rule auto-applies, the router watches its error rate. If infrastructure errors (5xx, timeouts) double compared to the router's baseline, the rule is automatically reverted and the changelog records it. Quality problems that don't produce errors are yours to judge -- that's what the changelog and revert button are for.

What gets captured, and how to turn it off

To learn from traffic, the router stores the last user message of each auto-routed request:

  • Normalized and capped at 2KB -- timestamps, UUIDs, and long IDs are stripped.
  • Deduplicated: 500 identical heartbeats on one day become a single row with a hit count, not 500 rows.
  • Deleted after 30 days.
  • Used only for your router's own tuning and the Traffic tab. Routers never learn from each other's traffic.

If you don't want prompts stored at all, turn off traffic capture in your router's Settings tab. The router keeps working; it just stops learning, and the Traffic tab goes quiet.

Setting up a router

1. Create a router

Go to Auto-Router in the dashboard and click Create New. Pick a default model -- this is where requests go when no rule matches.

2. Write rules by example

For each kind of traffic you want routed somewhere specific, add a rule with a handful of example prompts and a target model. Real prompts from your application make the best examples. Three tips:

  • Use 3 to 10 examples per rule. One example works but matches narrowly; examples that vary in phrasing widen the match.
  • Keep each rule about one kind of task. A rule whose examples mix coding questions and translation requests will match neither reliably.
  • After saving, test it in the sandbox (below) with prompts that should and shouldn't match.

3. Add capability-based rules

You can also create rules that trigger on what the request contains, independent of content. Set required_capabilities on a rule and leave the examples empty:

  • Route all image requests to a vision-capable model
  • Route all function-calling requests to a model with strong tool use
  • Route structured-output requests to a model that handles JSON schemas well
info

The reasoning capability is special. There's no way to detect "this prompt needs a reasoning model" from request structure alone, so it always passes the capability check. Use it alongside example prompts -- e.g., a rule with analytical example prompts and the reasoning capability routes that work to a reasoning model.

4. Assign to API keys

In the dashboard, edit an API key and select your auto-router from the dropdown. Any request from that key using model: "haimaker/auto" will use your router configuration.

Testing your configuration

The Test sandbox tab is a similarity debugger. Type a sample prompt and you get:

  • The model the router would select, and why (example-match, capability-fallback, or default)
  • A per-rule breakdown: each rule's similarity score against its threshold, whether it matched, and if not, why it was skipped (disabled, capability-mismatch, not-initial-turn, target-not-capable)
  • The capabilities detected in the request

This calls the simulate endpoint without making an actual LLM request, so it's free and fast. It's the quickest way to answer "why did this prompt go to that model?" -- and to tune a threshold when a rule matches too eagerly or not at all.

Observability

The spend log metadata field records every routing decision:

{
"auto_routed_from": "haimaker/auto",
"auto_routed_model": "moonshotai/kimi-k2.5",
"auto_routing_trigger": "rule:abc-123",
"auto_routing_similarity": "0.91",
"auto_routing_rule_source": "manual"
}

Cost tracking and rate limits apply to the resolved model, not haimaker/auto.

Query routing history

Fetch your recent spend logs to see how requests were routed. Non-admin users are automatically restricted to their own data.

curl "https://api.haimaker.ai/spend/logs/v2?start_date=2026-06-01&end_date=2026-06-08&page=1&page_size=50" \
-H "Authorization: Bearer YOUR_API_KEY"
Metadata fieldDescription
auto_routed_fromAlways "haimaker/auto" for auto-routed requests
auto_routed_modelThe model that was selected
auto_routing_trigger"default", "rule:<id>", or "capability-fallback"
auto_routing_similarityCosine similarity of the matched rule (example matches only)
auto_routing_rule_source"manual" for rules you wrote, "mined" for learned rules

The dashboard's Traffic tab shows the same data aggregated: your top prompt templates over the last 30 days, how often each one hits, and which model serves it. It's the fastest way to spot high-volume traffic still going to the default model -- exactly the traffic the tuner targets, and a good place to find candidates for a manual rule.

Limits and edge cases

No recursive routing. You can't use haimaker/auto as a rule target or default model. The API rejects this, and there's a runtime check as a safety net.

One router per key. Each API key can have one auto-router assigned. If no router is assigned, requests to haimaker/auto return an error.

Context length safety buffer. Token estimation uses a 10% safety margin. A model is only considered if estimated_tokens < max_input_tokens * 0.9. This avoids routing to a model that then fails at the provider with a context length error.

Caching. Router configurations are cached for 60 seconds. Changes you make in the dashboard take up to a minute to take effect.

Identical prompts route identically. Repeated prompts hit an exact-match cache, so high-frequency templated traffic (heartbeats, cron jobs) doesn't even pay the (sub-millisecond) embedding cost.