⚡ Best Practices for Production
1. Use this config.yaml
Use this config.yaml in production (with your own LLMs)
model_list:
- model_name: fake-openai-endpoint
litellm_params:
model: openai/fake
api_key: fake-key
api_base: https://exampleopenaiendpoint-production.up.railway.app/
general_settings:
master_key: sk-1234 # enter your own master key, ensure it starts with 'sk-'
alerting: ["slack"] # Setup slack alerting - get alerts on LLM exceptions, Budget Alerts, Slow LLM Responses
proxy_batch_write_at: 60 # Batch write spend updates every 60s
database_connection_pool_limit: 10 # connection pool limit per worker process. Total connections = limit × workers × instances. Calculate: MAX_DB_CONNECTIONS / (instances × workers). Default: 10.
:::warning
**Multiple instances:** If running multiple LiteLLM instances (e.g., Kubernetes pods), remember each instance multiplies your total connections. Example: 3 instances × 4 workers × 10 connections = 120 total connections.
:::
# OPTIONAL Best Practices
disable_error_logs: True # turn off writing LLM Exceptions to DB
allow_requests_on_db_unavailable: True # Only USE when running haimaker on your VPC. Allow requests to still be processed even if the DB is unavailable. We recommend doing this if you're running haimaker on VPC that cannot be accessed from the public internet.
litellm_settings:
request_timeout: 600 # raise Timeout error if call takes longer than 600 seconds. Default value is 6000seconds if not set
set_verbose: False # Switch off Debug Logging, ensure your logs do not have any debugging on
json_logs: true # Get debug logs in json format
Set slack webhook url in your env
export SLACK_WEBHOOK_URL="example-slack-webhook-url"
Turn off FASTAPI's default info logs
export LITELLM_LOG="ERROR"
Need Help or want dedicated support ? Talk to a founder [here]: (https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat)
2. Recommended Machine Specifications
For optimal performance in production, we recommend the following minimum machine specifications:
| Resource | Recommended Value |
|---|---|
| CPU | 4 vCPU |
| Memory | 8 GB RAM |
These specifications provide:
- Sufficient compute power for handling concurrent requests
- Adequate memory for request processing and caching
3. On Kubernetes — Match Uvicorn Workers to CPU Count [Suggested CMD]
Use this Docker CMD. It automatically matches Uvicorn workers to the pod’s CPU count, ensuring each worker uses one core efficiently for better throughput and stable latency.
CMD ["--port", "4000", "--config", "./proxy_server_config.yaml", "--num_workers", "$(nproc)"]
Optional: If you observe gradual memory growth under sustained load, consider recycling workers after a fixed number of requests to mitigate leaks. You can configure this either via CLI or environment variable:
# CLI
CMD ["--port", "4000", "--config", "./proxy_server_config.yaml", "--num_workers", "$(nproc)", "--max_requests_before_restart", "10000"]
# or ENV (for deployment manifests / containers)
export MAX_REQUESTS_BEFORE_RESTART=10000
Tip: When using
--max_requests_before_restart, the--run_gunicornflag is more stable and mature as it uses Gunicorn's battle-tested worker recycling mechanism instead of Uvicorn's implementation.
# Use Gunicorn for more stable worker recycling
CMD ["--port", "4000", "--config", "./proxy_server_config.yaml", "--num_workers", "$(nproc)", "--run_gunicorn", "--max_requests_before_restart", "10000"]
4. Use Redis 'port','host', 'password'. NOT 'redis_url'
If you decide to use Redis, DO NOT use 'redis_url'. We recommend using redis port, host, and password params.
redis_urlis 80 RPS slower
This is still something we're investigating. Keep track of it here
Redis Version Requirement
| Component | Minimum Version |
|---|---|
| Redis | 7.0+ |
Recommended to do this for prod:
router_settings:
routing_strategy: simple-shuffle # (default) - recommended for best performance
# redis_url: "os.environ/REDIS_URL"
redis_host: os.environ/REDIS_HOST
redis_port: os.environ/REDIS_PORT
redis_password: os.environ/REDIS_PASSWORD
litellm_settings:
cache: True
cache_params:
type: redis
host: os.environ/REDIS_HOST
port: os.environ/REDIS_PORT
password: os.environ/REDIS_PASSWORD
WARNING Usage-based routing is not recommended for production due to performance impacts. Use
simple-shuffle(default) for optimal performance in high-traffic scenarios.
5. Disable 'load_dotenv'
Set export LITELLM_MODE="PRODUCTION"
This disables the load_dotenv() functionality, which will automatically load your environment credentials from the local .env.
6. If running haimaker on VPC, gracefully handle DB unavailability
When running haimaker on a VPC (and inaccessible from the public internet), you can enable graceful degradation so that request processing continues even if the database is temporarily unavailable.
WARNING: Only do this if you're running haimaker on VPC, that cannot be accessed from the public internet.
Configuration
general_settings:
allow_requests_on_db_unavailable: True
Expected Behavior
When allow_requests_on_db_unavailable is set to true, haimaker will handle errors as follows:
| Type of Error | Expected Behavior | Details |
|---|---|---|
| Prisma Errors | ✅ Request will be allowed | Covers issues like DB connection resets or rejections from the DB via Prisma, the ORM used by haimaker. |
| Httpx Errors | ✅ Request will be allowed | Occurs when the database is unreachable, allowing the request to proceed despite the DB outage. |
| Pod Startup Behavior | ✅ Pods start regardless | haimaker Pods will start even if the database is down or unreachable, ensuring higher uptime guarantees for deployments. |
| Health/Readiness Check | ✅ Always returns 200 OK | The /health/readiness endpoint returns a 200 OK status to ensure that pods remain operational even when the database is unavailable. |
| haimaker Budget Errors or Model Errors | ❌ Request will be blocked | Triggered when the DB is reachable but the authentication token is invalid, lacks access, or exceeds budget limits. |
More information about what the Database is used for here
7. Use Helm PreSync Hook for Database Migrations [BETA]
To ensure only one service manages database migrations, use our Helm PreSync hook for Database Migrations. This ensures migrations are handled during helm upgrade or helm install, while haimaker pods explicitly disable migrations.
- Helm PreSync Hook:
- The Helm PreSync hook is configured in the chart to run database migrations during deployments.
- The hook always sets
DISABLE_SCHEMA_UPDATE=false, ensuring migrations are executed reliably.
Reference Settings to set on ArgoCD for values.yaml
db:
useExisting: true # use existing Postgres DB
url: postgresql://ishaanjaffer0324:... # url of existing Postgres DB
-
haimaker Pods:
- Set
DISABLE_SCHEMA_UPDATE=truein haimaker pod configurations to prevent them from running migrations.
Example configuration for haimaker pod:
env:
- name: DISABLE_SCHEMA_UPDATE
value: "true" - Set
8. Set haimaker Salt Key
If you plan on using the DB, set a salt key for encrypting/decrypting variables in the DB.
Do not change this after adding a model. It is used to encrypt / decrypt your LLM API Key credentials
We recommend - https://1password.com/password-generator/ password generator to get a random hash for litellm salt key.
export LITELLM_SALT_KEY="sk-1234"
9. Use prisma migrate deploy
Use this to handle db migrations across haimaker versions in production
- ENV
- CLI
USE_PRISMA_MIGRATE="True"
litellm
Benefits:
The migrate deploy command:
- Does not issue a warning if an already applied migration is missing from migration history
- Does not detect drift (production database schema differs from migration history end state - for example, due to a hotfix)
- Does not reset the database or generate artifacts (such as Prisma Client)
- Does not rely on a shadow database
How does haimaker handle DB migrations in production?
-
A new migration file is written to our
litellm-proxy-extraspackage. See all -
The core litellm pip package is bumped to point to the new
litellm-proxy-extraspackage. This ensures, older versions of haimaker will continue to use the old migrations. See code -
When you upgrade to a new version of haimaker, the migration file is applied to the database. See code
Read-only File System
If you see a Permission denied error, it means the haimaker pod is running with a read-only file system.
To fix this, just set LITELLM_MIGRATION_DIR="/path/to/writeable/directory" in your environment.
haimaker will use this directory to write migration files.
10. Use a Separate Health Check App
The Separate Health Check App only runs when running via the the haimaker Docker Image and using Docker and setting the SEPARATE_HEALTH_APP env var to "1"
Using a separate health check app ensures that your liveness and readiness probes remain responsive even when the main application is under heavy load.
Why is this important?
- If your health endpoints share the same process as your main app, high traffic or resource exhaustion can cause health checks to hang or fail.
- When Kubernetes liveness probes hang or time out, it may incorrectly assume your pod is unhealthy and restart it—even if the main app is just busy, not dead.
- By running health endpoints on a separate lightweight FastAPI app (with its own port), you guarantee that health checks remain fast and reliable, preventing unnecessary pod restarts during traffic spikes or heavy workloads.
- The way it works is, if either of the health or main proxy app dies due to whatever reason, it will kill the pod and which would be marked as unhealthy prompting the orchestrator to restart the pod
- Since the proxy and health app are running in the same pod, if the pod dies the health check probe fails, it signifies that the pod is unhealthy and needs to restart/have action taken upon.
How to enable:
Set the following environment variable(s):
SEPARATE_HEALTH_APP="1" # Default "0"
SEPARATE_HEALTH_PORT="8001" # Default "4001", Works only if `SEPARATE_HEALTH_APP` is "1"
Or watch on Loom.
High Level Architecture
Extras
Expected Performance in Production
See benchmarks here
Verifying Debugging logs are off
You should only see the following level of details in logs on the proxy server
# INFO: 192.168.2.205:11774 - "POST /chat/completions HTTP/1.1" 200 OK
# INFO: 192.168.2.205:34717 - "POST /chat/completions HTTP/1.1" 200 OK
# INFO: 192.168.2.205:29734 - "POST /chat/completions HTTP/1.1" 200 OK