Recently, during a call with some friends, we talked about using an LLM router to improve model usage and token efficiency. I looked around at what solutions I could easily run locally to provide such capabilities. I also wanted something that would give me more visibility into my token and model consumption.
LiteLLM does exactly that. Plus, it lets me use all models and routers with a single API key. One local proxy that routes everything, tracks spend, caches responses, and picks the right model based on how hard the task is. Sounds too good to be true?
That’s what we’re going to cover in this post.
If you want to try OpenCode Go, sign up with my referral link — we both get a $5 usage credit.
If you are a human#
Here is the full story of the setup and why each decision was made.
What is an LLM gateway, and why would you want one#
An LLM gateway is a proxy that sits between your applications (or coding agent) and the various LLM providers you use. Instead of configuring your tools with multiple API keys, base URLs, and model-specific settings, you point everything at one local endpoint. The gateway handles the rest.
Think of it like an API aggregator for language models. You send a standard OpenAI-style request to http://localhost:4000, (if hosted locally of course) and the gateway forwards it to Anthropic, OpenAI, Gemini, or any other provider you have configured. The response comes back in the same format regardless of which model actually served it.
This matters for a few reasons:
Unified access. One API key, one base URL, one response format. Your coding tools, scripts, and experiments all talk to the same local endpoint. Switching from GPT-5.5 to Claude Sonnet is a one-line change in the model name, not a provider swap.
Cost management. LiteLLM also tracks spend per model, per key, and per user. You can see which models are costing you money, set budgets, and rotate providers keys without touching client configurations. This is especially useful when you are experimenting with multiple providers and need to know where the budget is going.
Response caching. If you or your tools send the same prompt twice, a cached response comes back instantly without hitting the provider API again. That saves money and reduces latency for repeated queries. Redis handles this transparently.
Virtual key management. Instead of embedding your real provider keys in every tool, you generate virtual keys inside the gateway. Each virtual key can be scoped to specific models or rate-limited. If a key leaks, you revoke just that virtual key without touching your actual OpenAI or Anthropic accounts.
Provider abstraction. The OpenAI API format has become the de facto standard. A gateway lets you use that same format with Anthropic, Google, Mistral, and any other provider. No need to learn different SDKs or handle different response shapes.
Failover and routing. Some gateways can fall back to a secondary provider if the primary one is down. Others, like the complexity router I set up, can pick different models for different tasks automatically.
In short: an LLM gateway turns a mess of provider-specific integrations into one clean, observable, controllable interface.
What this stack does#
I deployed a local LiteLLM proxy on my laptop using a docker compose stack. It holds 27 models from 5 providers in a single configuration file, plus three complexity routers on top.
The providers i configured in LiteLLM are:
| Provider | Models |
|---|---|
| OpenAI | GPT-5.5, GPT-5.4, GPT-5.4 Mini |
| Anthropic | Claude Haiku 4.5, Sonnet 4.5, Sonnet 4.6, Opus 4.5 through 4.8 |
| Gemini 2.5 Pro, 3.1 Pro Preview, 3.5 Flash | |
| Mistral | Mistral Small 4, Mistral Medium 3.5 |
| OpenCode Go | 12 models via custom opencodego provider (DeepSeek V4 Pro/Flash, GLM 5/5.1, Kimi K2.5/K2.6, MiMo V2.5/V2.5 Pro, MiniMax M2.5/M2.7, Qwen3.6 Plus/3.7 Max) |
Then I configured three complexity routers that sit on top of these models and route request based on complexity:
- frontier-router — routes across OpenAI and Anthropic models.
- opencodego-router — routes to 4 OpenCode Go models across four tiers (DeepSeek V4 Flash, DeepSeek V4 Pro, Kimi K2.6, MiMo V2.5 Pro)
- mistral-router — routes between Small 4 and Medium 3.5
OpenCode sees all 30 model entries (27 individual + 3 routers) but the routers are the one I used the most.
LiteLLM handles the complexity classification and model selection. Individual models are available for direct access or fallback if I have a problem with the router for example.
Why it runs locally, not on the homelab#
I do run a homelab. It has Portainer managing containers, Traefik handling reverse proxy and TLS, and a handful of services I access from anywhere. This stack is not part of that (yet ? :D).
Right now it runs directly on my laptop with docker compose up -d. That is intentional for this proof of concept. My primary use case is coding with OpenCode, which runs locally and talks to the proxy over localhost. No Traefik, no TLS termination, no DNS records. Just a local proxy that starts when I need it and stops when I do not.
If I move this to the homelab in the future, I would add TLS, authentication, and probably deploy it behind my tailscale. For now, as a single-developer coding setup, localhost is the right boundary. I could even deploy it on a VPS or Cloud provider, but that’s for another day.
The architecture#
Three containers in one Compose file:
- LiteLLM proxy — the OpenAI-compatible API gateway on port 4000
- PostgreSQL — virtual key management, spend tracking, logs etc (persistent data)
- Redis — response caching
API keys for connection to AI providers are passed through as environment variables. No env_file in Compose, just .env referenced through variable substitution. Secrets stay in one place locally and are gitignored.
The complete files#
If you want to reproduce this setup, here are the full working templates. Create a directory (for example litellm-local/) and place these three files inside it. The templates include a few example models per provider. Add or remove models as needed — the pattern is the same for every provider.
The OpenCode config goes in your global ~/.config/opencode/opencode.jsonc, not in the project directory.
docker-compose.yaml — the full stack:
services:
postgres:
image: postgres:16-alpine
container_name: litellm-local-postgres
restart: unless-stopped
environment:
POSTGRES_DB: litellm
POSTGRES_USER: litellm_admin
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U litellm_admin -d litellm"]
interval: 5s
timeout: 5s
retries: 10
redis:
image: redis:7-alpine
container_name: litellm-local-redis
restart: unless-stopped
command: ["redis-server", "--appendonly", "yes"]
volumes:
- redis_data:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 5s
retries: 10
litellm:
image: docker.litellm.ai/berriai/litellm:main-stable
container_name: litellm-local
restart: unless-stopped
ports:
- "4000:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml:ro
- ./providers.json:/app/.venv/lib/python3.13/site-packages/litellm/llms/openai_like/providers.json:ro
environment:
DATABASE_URL: postgresql://litellm_admin:${POSTGRES_PASSWORD}@postgres:5432/litellm
REDIS_HOST: redis
REDIS_PORT: 6379
LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
LITELLM_SALT_KEY: ${LITELLM_SALT_KEY}
# Add or remove provider keys as needed
OPENAI_API_KEY: ${OPENAI_API_KEY}
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
GEMINI_API_KEY: ${GEMINI_API_KEY}
MISTRAL_API_KEY: ${MISTRAL_API_KEY}
OPENCODE_GO_API_KEY: ${OPENCODE_GO_API_KEY}
STORE_MODEL_IN_DB: "True"
command: ["--config", "/app/config.yaml", "--port", "4000"]
healthcheck:
test: ["CMD-SHELL", "python -c 'import urllib.request; urllib.request.urlopen(\"http://localhost:4000/health/readiness\")' 2>/dev/null || exit 1"]
interval: 15s
timeout: 10s
retries: 5
start_period: 30s
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
volumes:
postgres_data:
redis_data:.env.example — copy to .env and fill in your keys:
POSTGRES_PASSWORD=change-me
LITELLM_MASTER_KEY=sk-local-master-key
LITELLM_SALT_KEY=replace-with-a-long-random-string
# Add or remove as needed
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GEMINI_API_KEY=
MISTRAL_API_KEY=
OPENCODE_GO_API_KEY=litellm_config.yaml — the full model list with cost tracking and three routers:
model_list:
# Frontier router (OpenAI + Anthropic)
- model_name: frontier-router
litellm_params:
model: auto_router/complexity_router
complexity_router_config:
tiers:
SIMPLE: gpt-5.4-mini
MEDIUM: claude-sonnet-4-6
COMPLEX: gpt-5.5
REASONING: claude-opus-4-8
model_info:
mode: chat
disable_background_health_check: true
# OpenAI
- model_name: gpt-5.5
litellm_params:
model: openai/gpt-5.5
api_key: "os.environ/OPENAI_API_KEY"
- model_name: gpt-5.4
litellm_params:
model: openai/gpt-5.4
api_key: "os.environ/OPENAI_API_KEY"
- model_name: gpt-5.4-mini
litellm_params:
model: openai/gpt-5.4-mini
api_key: "os.environ/OPENAI_API_KEY"
# Anthropic
- model_name: claude-haiku-4-5
litellm_params:
model: anthropic/claude-haiku-4-5
api_key: "os.environ/ANTHROPIC_API_KEY"
- model_name: claude-sonnet-4-5
litellm_params:
model: anthropic/claude-sonnet-4-5
api_key: "os.environ/ANTHROPIC_API_KEY"
- model_name: claude-sonnet-4-6
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: "os.environ/ANTHROPIC_API_KEY"
- model_name: claude-opus-4-5
litellm_params:
model: anthropic/claude-opus-4-5
api_key: "os.environ/ANTHROPIC_API_KEY"
- model_name: claude-opus-4-6
litellm_params:
model: anthropic/claude-opus-4-6
api_key: "os.environ/ANTHROPIC_API_KEY"
- model_name: claude-opus-4-7
litellm_params:
model: anthropic/claude-opus-4-7
api_key: "os.environ/ANTHROPIC_API_KEY"
- model_name: claude-opus-4-8
litellm_params:
model: anthropic/claude-opus-4-8
api_key: "os.environ/ANTHROPIC_API_KEY"
# Google Gemini
- model_name: gemini-2.5-pro
litellm_params:
model: gemini/gemini-2.5-pro
api_key: "os.environ/GEMINI_API_KEY"
- model_name: gemini-3.1-pro-preview
litellm_params:
model: gemini/gemini-3.1-pro-preview
api_key: "os.environ/GEMINI_API_KEY"
- model_name: gemini-3.5-flash
litellm_params:
model: gemini/gemini-3.5-flash
api_key: "os.environ/GEMINI_API_KEY"
# Mistral (versioned, with explicit cost tracking)
- model_name: mistral-small-4
litellm_params:
model: mistral/mistral-small-2603
api_key: "os.environ/MISTRAL_API_KEY"
input_cost_per_token: 0.00000015
output_cost_per_token: 0.00000060
- model_name: mistral-medium-3-5
litellm_params:
model: mistral/mistral-medium-3-5
api_key: "os.environ/MISTRAL_API_KEY"
input_cost_per_token: 0.00000150
output_cost_per_token: 0.00000750
# Mistral complexity router
- model_name: mistral-router
litellm_params:
model: auto_router/complexity_router
complexity_router_config:
tiers:
SIMPLE: mistral-small-4
COMPLEX: mistral-medium-3-5
model_info:
mode: chat
disable_background_health_check: true
# OpenCode Go (all 12 models with cost tracking, using custom opencodego provider)
- model_name: opencode-deepseek-v4-pro
litellm_params:
model: opencodego/deepseek-v4-pro
api_key: "os.environ/OPENCODE_GO_API_KEY"
api_base: https://opencode.ai/zen/go/v1
input_cost_per_token: 0.00000084
output_cost_per_token: 0.00000253
- model_name: opencode-deepseek-v4-flash
litellm_params:
model: opencodego/deepseek-v4-flash
api_key: "os.environ/OPENCODE_GO_API_KEY"
api_base: https://opencode.ai/zen/go/v1
input_cost_per_token: 0.00000009
output_cost_per_token: 0.00000027
- model_name: opencode-glm-5
litellm_params:
model: opencodego/glm-5
api_key: "os.environ/OPENCODE_GO_API_KEY"
api_base: https://opencode.ai/zen/go/v1
input_cost_per_token: 0.00000100
output_cost_per_token: 0.00000320
- model_name: opencode-glm-5-1
litellm_params:
model: opencodego/glm-5.1
api_key: "os.environ/OPENCODE_GO_API_KEY"
api_base: https://opencode.ai/zen/go/v1
input_cost_per_token: 0.00000140
output_cost_per_token: 0.00000440
- model_name: opencode-kimi-k2-5
litellm_params:
model: opencodego/kimi-k2.5
api_key: "os.environ/OPENCODE_GO_API_KEY"
api_base: https://opencode.ai/zen/go/v1
input_cost_per_token: 0.00000060
output_cost_per_token: 0.00000300
- model_name: opencode-kimi-k2-6
litellm_params:
model: opencodego/kimi-k2.6
api_key: "os.environ/OPENCODE_GO_API_KEY"
api_base: https://opencode.ai/zen/go/v1
input_cost_per_token: 0.00000095
output_cost_per_token: 0.00000400
- model_name: opencode-mimo-v2-5
litellm_params:
model: opencodego/mimo-v2.5
api_key: "os.environ/OPENCODE_GO_API_KEY"
api_base: https://opencode.ai/zen/go/v1
input_cost_per_token: 0.00000009
output_cost_per_token: 0.00000027
- model_name: opencode-mimo-v2-5-pro
litellm_params:
model: opencodego/mimo-v2.5-pro
api_key: "os.environ/OPENCODE_GO_API_KEY"
api_base: https://opencode.ai/zen/go/v1
input_cost_per_token: 0.00000084
output_cost_per_token: 0.00000252
- model_name: opencode-minimax-m2-5
litellm_params:
model: opencodego/minimax-m2.5
api_key: "os.environ/OPENCODE_GO_API_KEY"
api_base: https://opencode.ai/zen/go/v1
input_cost_per_token: 0.00000030
output_cost_per_token: 0.00000120
- model_name: opencode-minimax-m2-7
litellm_params:
model: opencodego/minimax-m2.7
api_key: "os.environ/OPENCODE_GO_API_KEY"
api_base: https://opencode.ai/zen/go/v1
input_cost_per_token: 0.00000030
output_cost_per_token: 0.00000120
- model_name: opencode-qwen3-6-plus
litellm_params:
model: opencodego/qwen3.6-plus
api_key: "os.environ/OPENCODE_GO_API_KEY"
api_base: https://opencode.ai/zen/go/v1
input_cost_per_token: 0.00000050
output_cost_per_token: 0.00000300
- model_name: opencode-qwen3-7-max
litellm_params:
model: opencodego/qwen3.7-max
api_key: "os.environ/OPENCODE_GO_API_KEY"
api_base: https://opencode.ai/zen/go/v1
input_cost_per_token: 0.00000250
output_cost_per_token: 0.00000750
# OpenCode Go complexity router (tuned for OpenCode's large baseline context)
- model_name: opencodego-router
litellm_params:
model: auto_router/complexity_router
complexity_router_config:
tiers:
SIMPLE: opencode-deepseek-v4-flash
MEDIUM: opencode-deepseek-v4-pro
COMPLEX: opencode-kimi-k2-6
REASONING: opencode-mimo-v2-5-pro
# Token count is useless for chat with large system prompts
# (OpenCode baseline is ~13k tokens). Kill it and let
# content-based signals drive routing.
dimension_weights:
tokenCount: 0.0
reasoningMarkers: 0.40
simpleIndicators: 0.20
technicalTerms: 0.25
codePresence: 0.10
multiStepPatterns: 0.03
questionComplexity: 0.02
# Lower boundaries to compensate for tokenCount removal
tier_boundaries:
simple_medium: 0.10 # was 0.15
medium_complex: 0.25 # was 0.35
complex_reasoning: 0.55 # was 0.60
model_info:
mode: chat
disable_background_health_check: true
general_settings:
master_key: "os.environ/LITELLM_MASTER_KEY"
database_url: "os.environ/DATABASE_URL"
health_check_skip_disabled_background_models: true
litellm_settings:
cache: true
cache_params:
type: redis
namespace: litellm.local~/.config/opencode/opencode.jsonc — my global OpenCode config. I keep the LiteLLM provider there with all routers and models:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"litellm-local": {
"npm": "@ai-sdk/openai-compatible",
"name": "LiteLLM Local",
"options": {
"baseURL": "http://127.0.0.1:4000/v1",
"apiKey": "{env:LITELLM_API_KEY}"
},
"models": {
"opencodego-router": { "name": "OpenCode Go Router (recommended default)" },
"frontier-router": { "name": "Frontier Router (recommended)" },
"mistral-router": { "name": "Mistral Router (recommended)" },
"gpt-5.5": { "name": "GPT-5.5 via LiteLLM" },
"gpt-5.4": { "name": "GPT-5.4 via LiteLLM" },
"gpt-5.4-mini": { "name": "GPT-5.4 Mini via LiteLLM" },
"claude-haiku-4-5": { "name": "Claude Haiku 4.5 via LiteLLM" },
"claude-sonnet-4-5": { "name": "Claude Sonnet 4.5 via LiteLLM" },
"claude-sonnet-4-6": { "name": "Claude Sonnet 4.6 via LiteLLM" },
"claude-opus-4-5": { "name": "Claude Opus 4.5 via LiteLLM" },
"claude-opus-4-6": { "name": "Claude Opus 4.6 via LiteLLM" },
"claude-opus-4-7": { "name": "Claude Opus 4.7 via LiteLLM" },
"claude-opus-4-8": { "name": "Claude Opus 4.8 via LiteLLM" },
"gemini-2.5-pro": { "name": "Gemini 2.5 Pro via LiteLLM" },
"gemini-3.1-pro-preview": { "name": "Gemini 3.1 Pro Preview via LiteLLM" },
"gemini-3.5-flash": { "name": "Gemini 3.5 Flash via LiteLLM" },
"mistral-small-4": { "name": "Mistral Small 4 via LiteLLM" },
"mistral-medium-3-5": { "name": "Mistral Medium 3.5 via LiteLLM" },
"opencode-deepseek-v4-pro": { "name": "OpenCode Go DeepSeek V4 Pro via LiteLLM" },
"opencode-deepseek-v4-flash": { "name": "OpenCode Go DeepSeek V4 Flash via LiteLLM" },
"opencode-glm-5": { "name": "OpenCode Go GLM 5 via LiteLLM" },
"opencode-glm-5-1": { "name": "OpenCode Go GLM 5.1 via LiteLLM" },
"opencode-kimi-k2-5": { "name": "OpenCode Go Kimi K2.5 via LiteLLM" },
"opencode-kimi-k2-6": { "name": "OpenCode Go Kimi K2.6 via LiteLLM" },
"opencode-mimo-v2-5": { "name": "OpenCode Go MiMo V2.5 via LiteLLM" },
"opencode-mimo-v2-5-pro": { "name": "OpenCode Go MiMo V2.5 Pro via LiteLLM" },
"opencode-minimax-m2-5": { "name": "OpenCode Go MiniMax M2.5 via LiteLLM" },
"opencode-minimax-m2-7": { "name": "OpenCode Go MiniMax M2.7 via LiteLLM" },
"opencode-qwen3-6-plus": { "name": "OpenCode Go Qwen3.6 Plus via LiteLLM" },
"opencode-qwen3-7-max": { "name": "OpenCode Go Qwen3.7 Max via LiteLLM" },
}
}
},
"model": "litellm-local/opencodego-router",
"small_model": "litellm-local/mistral-router"
}Once deployed in your opencode terminal type /model and you will see all models + the three routers.
How caching works in practice#
Redis caches responses based on the prompt content and model name. If you send the exact same message to the same model twice, the second request hits the cache instead of the provider API. That means:
- Zero cost for the second identical request
- Instant response instead of waiting for the provider
- Less rate-limit pressure on your provider accounts
The cache is keyed by a hash of the request, so even a single character difference results in a fresh provider call. For a coding agent like OpenCode, this means the cache is rarely useful. Each request carries the full conversation history, tool definitions, and dynamic system context, so two requests are almost never identical even when you retry the same user message. The cache still helps if you use the same model for simple one-shot API calls outside the agent — for example, a curl request with no history or tools — but do not expect cache hits from your day-to-day coding sessions.
How virtual keys and spend tracking work#
LiteLLM lets you generate virtual API keys that are scoped to specific models or users. Each virtual key has its own rate limits, budget and a lot of other configuration. When a request comes in, LiteLLM logs which virtual key was used, which model was called, how many tokens were consumed, and what the estimated cost was.
All of this data lives in PostgreSQL. You can query it directly or view it in the LiteLLM web UI at http://127.0.0.1:4000/ui.
For my setup, I mostly use the master key for simplicity, but I also generated a scoped virtual key for OpenCode. The virtual key only has access to the models I actually want OpenCode to use. If something goes wrong, I can revoke that key without touching the master key or any provider keys.
The model configuration#
The litellm_config.yaml maps each model name to its provider and API key. This is where the abstraction happens. On the outside, everything looks like model: "gpt-5.5". On the inside, LiteLLM knows to call OpenAI’s API with the OpenAI key.
OpenCode Go models use the opencodego/ prefix through a custom provider definition. By default, LiteLLM would treat OpenCode Go as openai/ since the API is OpenAI-compatible, but that causes all Go spend to be categorized under “OpenAI” on the billing dashboard.
Using a custom opencodego provider in providers.json fixes this:
{
"opencodego": {
"base_url": "https://opencode.ai/zen/go/v1",
"api_key_env": "OPENCODE_GO_API_KEY"
}
}The file litellm_config.yaml is mounted into the LiteLLM container at /app/.venv/lib/python3.13/site-packages/litellm/llms/openai_like/providers.json (the exact path depends on the LiteLLM version and Python path inside the container — find it with docker exec litellm-local find /app -name providers.json):
- model_name: opencode-deepseek-v4-flash
litellm_params:
model: opencodego/deepseek-v4-flash
api_key: "os.environ/OPENCODE_GO_API_KEY"
api_base: https://opencode.ai/zen/go/v1Cost tracking#
LiteLLM maintains a built-in price list for major providers (OpenAI, Anthropic, Google, Mistral) at https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json. For those models, you do not need to specify costs — LiteLLM looks them up automatically.
OpenCode Go models are not in that list. That is why every OpenCode Go entry in my config includes explicit input_cost_per_token and output_cost_per_token values. LiteLLM multiplies these by the token counts in each response and logs the result to PostgreSQL.
This matters because “unified access” is only half the story. The other half is knowing what that access costs. With 12 OpenCode Go models ranging from $0.09 to $2.50 per million input tokens, the difference between routing to Flash and routing to Qwen3.7 Max is real. The complexity router handles that decision automatically, but the cost parameters let you audit it after the fact.
The complexity routers#
This is the part I like the most.
LiteLLM has a built-in auto_router/complexity_router feature. It inspects the incoming prompt, classifies its complexity, and routes to a different model accordingly.
Unlike some routers that call an external LLM to classify each query — adding latency, cost, and non-deterministic behavior — LiteLLM’s complexity router uses pure pattern matching and heuristics. No API calls, no extra tokens, and the classification is deterministic: the same prompt always lands on the same tier.
Why tokenCount: 0.0 — the key tuning decision#
OpenCode (or any other AI Coding tool) sends a lot of context with every request: system prompt, tools, MCP instructions, agent rules, and conversation history. Even a trivial session can start around 10k tokens before I type anything meaningful. And then every request contains the full history. If token count participates in the complexity score, every request looks expensive and gets pushed toward the COMPLEX tier. A Hi with 10k tokens of context is still a Hi.
The root issue is that a coding agent mixes two very different signals into one payload. There are actually two routing questions, not one:
| Question | Good signal | Why it matters for agentic coding |
|---|---|---|
| How hard is this task? | Current message content | Token count is noise here. The request size grows because the tool appends history, files, and tool output — it measures session length, not task difficulty. |
| Can this request fit safely and cheaply? | Total request tokens | Token count matters. A 200k-token session routed to the REASONING tier might fit technically, but sending it to a cheaper model with a large enough window saves money. |
Most one-shot API calls get both answers from the same signal — a large request tends to be a complex request. But agentic coding breaks that assumption. The request grows over time regardless of what you are asking, so raw token count stops being a useful proxy for anything except capacity.
The rule: do not use raw total tokens as a complexity signal for long-running coding agents unless the router can separate current-user-message tokens from accumulated context tokens. For OpenCode, Claude Code, Cursor, Aider, or any agentic IDE workflow, set tokenCount to 0.0 and let content-based signals drive routing.
Token count should inform capacity decisions (history summarization, context window safety), not complexity classification. Those are different problems — treat them as such.
OpenCode Go router#
- model_name: opencodego-router
litellm_params:
model: auto_router/complexity_router
complexity_router_config:
tiers:
SIMPLE: opencode-deepseek-v4-flash
MEDIUM: opencode-deepseek-v4-pro
COMPLEX: opencode-kimi-k2-6
REASONING: opencode-mimo-v2-5-pro
dimension_weights:
tokenCount: 0.0
reasoningMarkers: 0.40
simpleIndicators: 0.20
technicalTerms: 0.25
codePresence: 0.10
multiStepPatterns: 0.03
questionComplexity: 0.02
tier_boundaries:
simple_medium: 0.10
medium_complex: 0.25
complex_reasoning: 0.55The router scores each request across six content-based dimensions:
| Dimension | Weight | What it catches |
|---|---|---|
reasoningMarkers | 0.40 | Phrases like “step by step”, “think through”, “explain your reasoning” |
technicalTerms | 0.25 | Domain complexity: architecture, distributed, throughput, latency, encryption, scalability |
simpleIndicators | 0.20 | Greetings, definitions, basic facts, short simple questions |
codePresence | 0.10 | Code-related terms: function, class, refactor, implement, api, error, docker, kubernetes |
multiStepPatterns | 0.03 | Sequential instructions like “first… then…” or numbered steps |
questionComplexity | 0.02 | Compound questions and multiple question marks |
tokenCount | 0.00 | Disabled — see rationale above |
The matching uses word boundaries for single-word keywords, so “microservice” matches “microservice” but not “microservices”. Multi-word phrases use substring matching.
Tier boundaries were lowered to compensate for removing tokenCount:
| Boundary | Default | Current |
|---|---|---|
simple_medium | 0.15 | 0.10 |
medium_complex | 0.35 | 0.25 |
complex_reasoning | 0.60 | 0.55 |
How scoring works#
The router extracts the last system prompt and the last user message from the request, then scores each dimension against the combination, multiplies by its weight, and sums the result. The conversation history and tool definitions are not scanned — only the final system prompt and current user message participate. reasoningMarkers is the strictest dimension: it scans the user message alone, ignoring even the system prompt, to prevent the system prompt from forcing every request into the REASONING tier. tokenCount is the exception: when enabled, it counts the full request body. That is another reason to keep it at 0.0.
A few things to know about how dimensions are scored before looking at examples:
- Scoring is threshold-based, not per-match. Each dimension counts keyword matches, then maps the count to a raw score (0 to 1, or -1) based on thresholds. Having more matches beyond the threshold adds nothing. For
codePresence, 2+ matches gives the maximum raw score of 1.0 — it does not matter whether you hit 2 or 20 matches. FortechnicalTerms, 4+ matches hits the max (2-3 matches scores 0.5, anything below 2 scores 0). Same pattern for every dimension. simpleIndicatorsis negative: a single match scores -1.0, which pulls the total score down. At weight 0.20, one greeting costs you -0.20. This is the mechanism that keeps “Hi” out of MEDIUM tier even in a session loaded with technical context.- The system prompt participates too.
reasoningMarkersis the only dimension that scans the user message alone. All others —codePresence,technicalTerms,simpleIndicators,multiStepPatterns— use the last system prompt plus the user message. The conversation history and tool definitions are excluded regardless. reasoningMarkershas a scoring threshold: 0 matches = 0, 1 match = 0.7, 2+ matches = 1.0. But 2+ matches also triggers the bypass (see below), so the 1.0 score is never actually used — the request never reaches the weighted scoring step.
Here is how different prompts land on the tier ladder with these weights:
| # | Prompt | reasoning | technical | simple | codePresence | Score | Tier |
|---|---|---|---|---|---|---|---|
| 1 | “Hello, can you help me?” | 0 | 0 | -0.20 | 0 | -0.20 | SIMPLE |
| 2 | “Refactor the API to use async database queries with proper error handling” | 0 | 0 | 0 | 0.10 | 0.10 | MEDIUM |
| 3 | “Design a distributed microservice architecture with container orchestration, high throughput, and low latency” | 0 | 0.25 | 0 | 0 | 0.25 | COMPLEX |
| 4 | “Explain your reasoning for this authentication architecture. First, analyze the distributed design, then implement the container orchestration layer.” | 0.28 | 0.25 | 0 | 0.05 | 0.595 | REASONING |
Each column shows the weighted contribution (raw score × weight). Weights: reasoning 0.40, technical 0.25, simple 0.20, codePresence 0.10. simpleIndicators always scores -1.0 (raw), so its contribution is negative. multiStepPatterns (0.03) and questionComplexity (0.02) omitted — they rarely tip a decision.
Example 1 is a greeting. simpleIndicators catches “hello” (raw -1.0 × 0.20 = -0.20). No other dimension fires. Score -0.20 → SIMPLE.
Example 2 matches five code keywords (refactor, api, async, database, error). Raw score 1.0 (2+ matches triggers the high threshold), weighted 1.0 × 0.10 = 0.10. No other dimension fires. Score 0.10 → MEDIUM.
Example 3 has no code matches but seven technical terms (distributed, microservice, architecture, container, orchestration, throughput, latency). Raw score 1.0 (4+ matches), weighted 1.0 × 0.25 = 0.25. Score 0.25 → COMPLEX.
Example 4 reaches REASONING through normal scoring. One reasoning marker (“explain your reasoning”, raw 0.7 × 0.40 = 0.28), five technical terms (raw 1.0 × 0.25 = 0.25), one code keyword (“implement”, raw 0.5 × 0.10 = 0.05), and the “first…then” multi-step pattern (raw 0.5 × 0.03 = 0.015). Total 0.595 > 0.55 → REASONING.
There is also a faster path to REASONING. If the user message contains two or more reasoning markers, the router bypasses normal scoring entirely and returns REASONING directly. A prompt like “analyze this step by step and think carefully” should not go through a weighted formula — it is obviously a reasoning request. This override is implemented in LiteLLM’s source (complexity_router.py:225), not something I added.
The scoring logic and keyword lists used by the router live in LiteLLM’s repository:
- Scoring and tier selection:
litellm/router_strategy/complexity_router/complexity_router.py - Keyword patterns and default weights:
litellm/router_strategy/complexity_router/config.py
How the tiers break down in practice:
- SIMPLE → DeepSeek V4 Flash. A lightweight 284B-parameter model with 13B active parameters. Handles straightforward coding tasks, code review, simple refactors, and one-line completions. Fast and cheap at $0.09/$0.27 per million tokens.
- MEDIUM → DeepSeek V4 Pro. The full 1.6T-parameter version. Better reasoning, better code generation, 1M context window. $0.84/$2.53 per million tokens.
- COMPLEX → Kimi K2.6. 1T total parameters, 32B active. Scores 80.2% on SWE-bench Verified and leads the OpenCode Go lineup on agentic coding. $0.95/$4.00 per million tokens.
- REASONING → MiMo V2.5 Pro. Purpose-built for long-horizon autonomous tasks. Xiaomi reports it built a complete compiler in 4.3 hours unsupervised. Terminal-Bench 2.0 score of 68.4%. $0.84/$2.52 per million tokens.
Creating your own routers#
The same pattern works for any model combination — you are not limited to a router per provider. Once models are defined in model_list, you can reference them by their model_name in any router’s tier mapping. Cross-provider routers work too: a SIMPLE tier pointing to Mistral Small 4 and a REASONING tier pointing to GPT-5.5 is valid.
Then you call any router like a normal model:
curl http://127.0.0.1:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $LITELLM_API_KEY" \
-d '{"model": "opencodego-router", "messages": [{"role": "user", "content": "Hello"}]}'One model name. The router figures out the rest.
The health check bug#
LiteLLM 1.86.2 has a gap: the built-in health check does not recognize auto_router as a valid provider. When the background health checker runs, it throws a 400 BadRequestError on the router model.
This happens because the health check function iterates through all configured models and tries to validate them by sending a test request. It knows how to handle openai/, anthropic/, gemini/, opencodego/, and mistral/ prefixes. It does not know what to do with auto_router/complexity_router.
The fix was two settings working together:
On the router model itself:
model_info:
disable_background_health_check: trueIn the global settings:
general_settings:
health_check_skip_disabled_background_models: trueThis tells LiteLLM to skip all three routers during health checks while keeping all 27 individual models monitored. The routers still work for actual API calls. They just do not get tested by the background checker.
The UI “Test” button also respects this setting, so you will not see a red error badge on the router model. That is important because a permanently failing health check makes the dashboard noisy and distracts from real issues.
OpenCode integration#
I keep my OpenCode config in ~/.config/opencode/opencode.jsonc (the global config). It includes a litellm-local provider pointing to http://127.0.0.1:4000/v1. The provider exposes all 27 models plus all three routers, so I can use whichever I need.
Default workflow: I use the routers. /model opencodego-router in the TUI lets LiteLLM classify the prompt and pick the right OpenCode Go model automatically. Same for frontier with /model frontier-router. The OpenCode Go router is my go-to for general coding tasks. The frontier router is useful when I specifically want OpenAI or Anthropic models.
Fallback workflow: If the router misclassifies a prompt, returns an error, or I simply want a specific model for a particular task, I can call any model directly: /model litellm-local/claude-sonnet-4-6 or /model litellm-local/opencode-kimi-k2-6. Having all models in the config has saved me more than once when the router had a transient issue.
The key change in the global config was moving from a hardcoded API key to an environment variable:
"apiKey": "{env:LITELLM_API_KEY}"No more exposed keys in config files. The key is injected at runtime from the shell environment, which means it never sits in version control and can be rotated easily.
To use it:
export LITELLM_API_KEY="sk-your-litellm-key"
opencodeThen /model opencodego-router for the smart default, or /model litellm-local/<any-model> for direct access.
What I learned#
A few things worth noting if you are building something similar:
STORE_MODEL_IN_DB must be “True” — without this, the LiteLLM UI does not see models added through the config file. The UI’s auto-router feature also depends on it. I spent about 20 minutes wondering why the UI showed an empty model list before finding this in the LiteLLM docs. It is not obvious from the config file alone.
Virtual keys are worth the extra step — the master key works, but generating a separate LiteLLM virtual key for day-to-day clients is safer. If the key leaks, you revoke it without touching the admin key. It also lets you scope access: a virtual key for OpenCode gets the full model list, while a testing key might only get access to a subset.
Cost parameters matter — I initially skipped input_cost_per_token and output_cost_per_token on the OpenCode Go models. The proxy worked fine, but the spend dashboard showed zero cost for every request. Adding the parameters means LiteLLM can calculate per-request cost and aggregate it in PostgreSQL. The values come from OpenCode Go’s pricing page.
Custom providers fix billing categorization — the OpenCode Go API is OpenAI-compatible, so LiteLLM treats it as openai/ by default. That means all Go spend shows up under “OpenAI” in the billing dashboard. Creating a custom opencodego provider in providers.json and switching the model prefixes from openai/ to opencodego/ gives Go its own spend category. Finding the right mount path inside the container took a few tries (/app/litellm/llms/... didn’t work because LiteLLM is installed as a pip package, not from source – the actual path is under /app/.venv/lib/python3.13/site-packages/litellm/llms/openai_like/providers.json).
providers.json — custom provider definitions:
{
"opencodego": {
"base_url": "https://opencode.ai/zen/go/v1",
"api_key_env": "OPENCODE_GO_API_KEY"
},
"opencodezen": {
"base_url": "https://opencode.ai/zen/v1",
"api_key_env": "OPENCODE_ZEN_API_KEY"
}
}This file is minimal because LiteLLM only needs the base URL and key environment variable. The rest of the model configuration (model names, costs, API keys) comes from litellm_config.yaml. I also added a few other providers here for future use — PublicAI, Helicone, VeniceAI, and others that might be useful later.
Verification#
Once the stack is running, these commands confirm everything works:
# List all 30 model entries
curl http://127.0.0.1:4000/v1/models \
-H "Authorization: Bearer $LITELLM_API_KEY"
# Test the router
curl http://127.0.0.1:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $LITELLM_API_KEY" \
-d '{"model": "opencodego-router", "messages": [{"role": "user", "content": "Hello"}]}'
# Inspect the actual deployment selected by the router
curl -s -D - http://127.0.0.1:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $LITELLM_API_KEY" \
-d '{"model": "opencodego-router", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 10}'
# Check health (27 models monitored, 3 routers skipped)
curl http://127.0.0.1:4000/health \
-H "Authorization: Bearer $LITELLM_API_KEY"The first command should return a JSON list of all 30 model objects. The second should return a chat completion. The third prints response headers; x-litellm-model-id shows the actual deployment selected by the router. The final health command should show 27 green checks and three skipped entries.
Current state#
Everything is working:
- 27 models from 5 providers in LiteLLM
- One custom
opencodegoprovider defined viaproviders.jsonfor clean billing separation - Three complexity routers (Frontier, OpenCode Go, Mistral)
- Redis caching enabled and returning cached hits
- PostgreSQL tracking keys and spend per request
- Health checks passing for 27/30 model entries (3 routers skipped)
- OpenCode configured with all 30 model entries (27 models + 3 routers), routers as defaults
If you are an AI agent#
Use this section as the operational runbook.
When this article applies#
This procedure is a match when the user wants to:
- unify multiple LLM providers behind a single OpenAI-compatible endpoint
- run the proxy locally in Docker (not on a remote server or Kubernetes cluster)
- build smart routers that pick the right model based on prompt content and complexity
- expose all models to OpenCode with routers as the recommended defaults
What this stack deploys#
Three containers via Docker Compose:
- LiteLLM proxy (
docker.litellm.ai/berriai/litellm:main-stable) on port 4000 - PostgreSQL 16 Alpine for virtual key management and spend tracking
- Redis 7 Alpine for response caching
Preconditions#
- Docker and Docker Compose installed
- API keys for the providers the user wants to enable (OpenAI, Anthropic, Google, Mistral, OpenCode Go)
- An OpenCode Go subscription key if the user wants the complexity router and direct Go model access
Files to create#
All files go in a single directory (e.g. litellm-local/):
| File | Purpose |
|---|---|
docker-compose.yaml | Container definitions for PostgreSQL, Redis, and LiteLLM |
litellm_config.yaml | Model list, router configuration, cache settings |
providers.json | Custom provider definitions (opencodego billing fix) |
.env | Secrets (gitignored) |
.env.example | Template with placeholder values |
Environment variables#
POSTGRES_PASSWORD=<strong password>
LITELLM_MASTER_KEY=sk-<random>
LITELLM_SALT_KEY=<long random string>
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GEMINI_API_KEY=
MISTRAL_API_KEY=
OPENCODE_GO_API_KEY=All API keys must be listed in the environment section of the LiteLLM service in docker-compose.yaml. If a key is missing from the Compose environment block, that provider’s models will return 401 errors even if the key is in .env.
Model configuration#
litellm_config.yaml must contain:
- A
model_listwith entries for each model. Provider prefix format:- OpenAI:
openai/<model-id> - Anthropic:
anthropic/<model-id> - Gemini:
gemini/<model-id> - Mistral: pinned model IDs, for example
mistral/mistral-small-2603andmistral/mistral-medium-3-5 - OpenCode Go:
opencodego/<model-id>withapi_base: https://opencode.ai/zen/go/v1(requires customproviders.jsonmounted in the container)
- OpenAI:
- Router entries. The same pattern applies to any combination of models — reference them by
model_namein any router’s tier mapping, even across providers:
OpenCode Go router (with tuned weights for agentic coding):
- model_name: opencodego-router
litellm_params:
model: auto_router/complexity_router
complexity_router_config:
tiers:
SIMPLE: opencode-deepseek-v4-flash
MEDIUM: opencode-deepseek-v4-pro
COMPLEX: opencode-kimi-k2-6
REASONING: opencode-mimo-v2-5-pro
dimension_weights:
tokenCount: 0.0
reasoningMarkers: 0.40
simpleIndicators: 0.20
technicalTerms: 0.25
codePresence: 0.10
multiStepPatterns: 0.03
questionComplexity: 0.02
tier_boundaries:
simple_medium: 0.10
medium_complex: 0.25
complex_reasoning: 0.55
model_info:
mode: chat
disable_background_health_check: true
general_settings:
master_key: "os.environ/LITELLM_MASTER_KEY"
database_url: "os.environ/DATABASE_URL"
health_check_skip_disabled_background_models: true
litellm_settings:
cache: true
cache_params:
type: redis
namespace: litellm.localRequired Compose environment variable#
STORE_MODEL_IN_DB must be set to "True" in the LiteLLM container environment. Without this, the LiteLLM UI will not display models defined in the config file, and the UI auto-router feature will not work.
Steps#
- Create the directory and files.
- Copy
.env.exampleto.envand fill in real API keys. - Run
docker compose up -d. - Wait for all three containers to be healthy (
docker compose ps). - Verify models:
curl http://127.0.0.1:4000/v1/models -H "Authorization: Bearer $LITELLM_MASTER_KEY". - Test a chat completion: send a POST to
http://127.0.0.1:4000/v1/chat/completionswith a valid model name. - Test the router: send a request with
"model": "opencodego-router". - Check health:
curl http://127.0.0.1:4000/health -H "Authorization: Bearer $LITELLM_MASTER_KEY".
OpenCode Go router behavior#
The opencodego-router is intentionally tuned for OpenCode’s large baseline context. Do not use token count as a routing signal for this router.
Required behavior:
tokenCountweight must stay at0.0.- Routing must be driven by content-based signals only: reasoning markers, technical terms, simple indicators, code presence, multi-step patterns, and question complexity.
- A short greeting must route to the same tier whether the request has no previous context or tens of thousands of tokens of previous context.
- Tier boundaries should be lower than LiteLLM defaults:
simple_medium: 0.10,medium_complex: 0.25,complex_reasoning: 0.55. - Messages with two or more reasoning markers should route directly to
REASONING/opencode-mimo-v2-5-pro.
Keep complexity routing separate from capacity routing:
- Complexity routing should classify the current task from message content.
- Capacity routing may use total request tokens to choose a long-context model, trigger history summarization, or prevent an over-limit request.
- Raw total token count is useful for one-shot document-heavy calls, but it is misleading for long-running coding-agent sessions where the request grows because of accumulated history and tool context.
- For tools like OpenCode, Claude Code, Cursor, Aider, Continue, or another agentic IDE workflow, set raw
tokenCountto0.0or very low if the router receives the full request payload.
Practical expectation:
| Prompt type | Expected tier | Model |
|---|---|---|
| Greeting or very simple query | SIMPLE | opencode-deepseek-v4-flash |
| Explanation request | MEDIUM | opencode-deepseek-v4-pro |
| Coding, refactor, architecture | COMPLEX | opencode-kimi-k2-6 |
| Explicit deep reasoning | REASONING | opencode-mimo-v2-5-pro |
Health check behavior#
- 27 of 30 model entries are monitored by background health checks.
- All three routers (
frontier-router,opencodego-router, andmistral-router) are intentionally skipped because LiteLLM 1.86.2 does not recognizeauto_routeras a valid provider in the health check function. - The settings that enforce this are
disable_background_health_check: trueon each router model andhealth_check_skip_disabled_background_models: trueingeneral_settings. - The routers still function correctly for actual API requests.
OpenCode integration#
Add a litellm-local provider to OpenCode config. Expose all models plus the routers:
{
"provider": {
"litellm-local": {
"npm": "@ai-sdk/openai-compatible",
"name": "LiteLLM Local",
"options": {
"baseURL": "http://127.0.0.1:4000/v1",
"apiKey": "{env:LITELLM_API_KEY}"
},
"models": {
"frontier-router": {},
"opencodego-router": {},
"mistral-router": {},
"gpt-5.5": {},
"claude-sonnet-4-6": {},
"gemini-2.5-pro": {},
"mistral-small-4": {},
"opencode-deepseek-v4-flash": {}
// ... add remaining models as needed
}
}
}
}This goes in the global ~/.config/opencode/opencode.jsonc. The {env:LITELLM_API_KEY} syntax tells OpenCode to read the key from the environment at runtime.
Set "model": "litellm-local/opencodego-router" and "small_model": "litellm-local/mistral-router" so the routers are the defaults. Individual models are available for fallback if a router has issues.
Do not use this setup when#
- the user needs the proxy to be accessible from multiple machines (would need TLS and proper auth)
- the user wants managed infrastructure with automated backups (PostgreSQL volume is local)
- the user only uses one provider and does not need routing or unified access
Further reading#
- LiteLLM documentation https://docs.litellm.ai/
- LiteLLM complexity router https://docs.litellm.ai/docs/proxy/caching_and_routing
- OpenCode configuration https://opencode.ai/docs/config/
- OpenCode providers https://opencode.ai/docs/providers/
This article was written with an AI agent at my side — I brought the expertise, it helped with the words.
