[{"content":"","date":"26 June 2026","externalUrl":null,"permalink":"/","section":"Julien.Cloud","summary":"","title":"Julien.Cloud","type":"page"},{"content":" Auto-generated page. This page is automatically built by a scheduled CI pipeline that runs twice daily. The data below is a snapshot of three public sources, cross-referenced programmatically.\nLast generated: 2026-06-26 22:08 UTC\nData sources:\nOpenCode Go API \u0026ndash; unauthenticated metadata endpoint listing all models in the Go catalog Models.dev \u0026ndash; provider/model metadata including costs, release dates, and deprecation status OpenCode Go Documentation \u0026ndash; official docs page with request estimates and pricing This is an unofficial tracker. Not affiliated with or endorsed by OpenCode or Anomaly.\nIf you spot missing or incorrect information, please reach out so I can fix or improve the page generation.\nThis page tracks every model available through OpenCode Go, the $10/month subscription for open coding models. The pipeline queries the three sources above, merges their data, and regenerates this page.\nQuick Stats # Metric Count Documented (listed on Go docs page) 12 API metadata endpoint (all models in catalog) 20 Models.dev registry 19 Undocumented (in API metadata, not on docs page) 8 Deprecated 6 Total tracked 20 The 12 documented models match what a typical Go subscriber sees when running /models in the OpenCode TUI. The API metadata endpoint returns 20 models total \u0026ndash; the extra 8 are either deprecated predecessors or preview models not yet documented. Whether a specific undocumented model is usable with your API key depends on your subscription.\nChangelog # Changes detected since previous snapshot (2026-06-04):\nNewly Available Models # Model Release Date Status GLM-5.2 2026-06-13 active Kimi K2.7 Code 2026-06-12 active Marked as Deprecated # Date Model 2026-06-26 MiniMax M2.5 marked as deprecated 2026-06-26 Kimi K2.5 marked as deprecated 2026-06-26 GLM-5 marked as deprecated Removed from Documentation # Date Model 2026-06-26 MiniMax M2.5 removed from official documentation 2026-06-26 Kimi K2.5 removed from official documentation 2026-06-26 GLM-5 removed from official documentation Pricing Changes # Date Model 2026-06-26 MiniMax M3 (3x usage) pricing: $0.60/$2.40 -\u0026gt; $0.10/$0.40 (input/output per 1M) All Models # Model Documented Status Release Context Input $/1M Output $/1M Cache Read Req/5h Req/Month Sources GLM-5.2 yes active 2026-06-13 1M $1.40 $4.40 $0.26 880 4,300 API Models.dev Docs Kimi K2.7 Code catalog active 2026-06-12 262K $0.95 $4.00 $0.19 1,350 9,250 API Models.dev Qwen3.7 Plus yes active 2026-06-02 1M $0.40 $1.60 $0.04 4,300 21,600 API Models.dev Docs MiniMax M3 (3x usage) yes active 2026-05-31 1M $0.10 $0.40 $0.02 3,200 16,000 API Models.dev Docs Qwen3.7 Max yes active 2026-05-21 1M $2.50 $7.50 $0.50 950 4,770 API Models.dev Docs DeepSeek V4 Pro yes active 2026-04-24 1M $1.74 $3.48 $0.01 3,450 17,150 API Models.dev Docs DeepSeek V4 Flash yes active 2026-04-24 1M $0.14 $0.28 $0.0028 31,650 158,150 API Models.dev Docs MiMo V2.5 Pro yes active 2026-04-22 1M $1.74 $3.48 $0.01 3,250 16,300 API Models.dev Docs MiMo V2.5 yes active 2026-04-22 1M $0.14 $0.28 $0.0028 30,100 150,400 API Models.dev Docs Kimi K2.6 yes active 2026-04-21 262K $0.95 $4.00 $0.16 1,150 5,750 API Models.dev Docs GLM-5.1 yes active 2026-04-07 203K $1.40 $4.40 $0.26 880 4,300 API Models.dev Docs Qwen3.6 Plus yes active 2026-04-02 1M $0.50 $3.00 $0.05 3,300 16,300 API Models.dev Docs MiniMax M2.7 yes active 2026-03-18 205K $0.30 $1.20 $0.06 3,400 17,000 API Models.dev Docs MiMo V2 Pro catalog deprecated 2026-03-18 1M $1.00 $3.00 $0.20 — — API Models.dev MiMo V2 Omni catalog deprecated 2026-03-18 262K $0.40 $2.00 $0.08 — — API Models.dev Qwen3.5 Plus catalog deprecated 2026-02-16 262K $0.20 $1.20 $0.02 — — API Models.dev MiniMax M2.5 catalog deprecated 2026-02-12 205K $0.30 $1.20 $0.03 — — API Models.dev GLM-5 catalog deprecated 2026-02-11 203K $1.00 $3.20 $0.20 — — API Models.dev Kimi K2.5 catalog deprecated 2026-01-27 262K $0.60 $3.00 $0.10 — — API Models.dev Hy3 Preview catalog active — — — — — — — API Model Details # GLM-5.2 # Model ID: glm-5.2 Family: glm Status: active Release date: 2026-06-13 Last updated (Models.dev): 2026-06-13 Context window: 1M Max output tokens: 131,072 Reasoning: yes Tool calling: yes Attachment support: — Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $1.40 $4.40 $0.26 — Estimated requests (Go plan):\nPer 5 hours Per week Per month 880 2,150 4,300 Available in: API Models.dev Docs\nKimi K2.7 Code # Model ID: kimi-k2.7-code Family: kimi-k2 Status: active Release date: 2026-06-12 Last updated (Models.dev): 2026-06-12 Context window: 262K Max output tokens: 262,144 Reasoning: yes Tool calling: yes Attachment support: yes Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $0.95 $4.00 $0.19 — Estimated requests (Go plan):\nPer 5 hours Per week Per month 1,350 4,630 9,250 Available in: API Models.dev\nQwen3.7 Plus # Model ID: qwen3.7-plus Family: qwen3.7-plus Status: active Release date: 2026-06-02 Last updated (Models.dev): 2026-06-02 Context window: 1M Max output tokens: 65,536 Reasoning: yes Tool calling: yes Attachment support: yes Open weights: — Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $0.40 $1.60 $0.04 $0.50 Estimated requests (Go plan):\nPer 5 hours Per week Per month 4,300 10,800 21,600 Available in: API Models.dev Docs\nMiniMax M3 (3x usage) # Model ID: minimax-m3 Family: minimax-m3 Status: active Release date: 2026-05-31 Last updated (Models.dev): 2026-05-31 Context window: 1M Max output tokens: 131,072 Reasoning: yes Tool calling: yes Attachment support: — Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $0.10 $0.40 $0.02 — Estimated requests (Go plan):\nPer 5 hours Per week Per month 3,200 8,000 16,000 Available in: API Models.dev Docs\nQwen3.7 Max # Model ID: qwen3.7-max Family: qwen3.7-max Status: active Release date: 2026-05-21 Last updated (Models.dev): 2026-05-21 Context window: 1M Max output tokens: 65,536 Reasoning: yes Tool calling: yes Attachment support: — Open weights: — Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $2.50 $7.50 $0.50 $3.12 Estimated requests (Go plan):\nPer 5 hours Per week Per month 950 2,390 4,770 Available in: API Models.dev Docs\nDeepSeek V4 Pro # Model ID: deepseek-v4-pro Family: deepseek-thinking Status: active Release date: 2026-04-24 Last updated (Models.dev): 2026-04-24 Context window: 1M Max output tokens: 384,000 Reasoning: yes Tool calling: yes Attachment support: — Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $1.74 $3.48 $0.01 — Estimated requests (Go plan):\nPer 5 hours Per week Per month 3,450 8,550 17,150 Available in: API Models.dev Docs\nDeepSeek V4 Flash # Model ID: deepseek-v4-flash Family: deepseek-flash Status: active Release date: 2026-04-24 Last updated (Models.dev): 2026-04-24 Context window: 1M Max output tokens: 384,000 Reasoning: yes Tool calling: yes Attachment support: — Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $0.14 $0.28 $0.0028 — Estimated requests (Go plan):\nPer 5 hours Per week Per month 31,650 79,050 158,150 Available in: API Models.dev Docs\nMiMo V2.5 Pro # Model ID: mimo-v2.5-pro Family: mimo-v2.5-pro Status: active Release date: 2026-04-22 Last updated (Models.dev): 2026-04-22 Context window: 1M Max output tokens: 128,000 Reasoning: yes Tool calling: yes Attachment support: yes Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $1.74 $3.48 $0.01 — Estimated requests (Go plan):\nPer 5 hours Per week Per month 3,250 8,150 16,300 Available in: API Models.dev Docs\nMiMo V2.5 # Model ID: mimo-v2.5 Family: mimo-v2.5 Status: active Release date: 2026-04-22 Last updated (Models.dev): 2026-04-22 Context window: 1M Max output tokens: 128,000 Reasoning: yes Tool calling: yes Attachment support: yes Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $0.14 $0.28 $0.0028 — Estimated requests (Go plan):\nPer 5 hours Per week Per month 30,100 75,200 150,400 Available in: API Models.dev Docs\nKimi K2.6 # Model ID: kimi-k2.6 Family: kimi-k2 Status: active Release date: 2026-04-21 Last updated (Models.dev): 2026-04-21 Context window: 262K Max output tokens: 65,536 Reasoning: yes Tool calling: yes Attachment support: yes Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $0.95 $4.00 $0.16 — Estimated requests (Go plan):\nPer 5 hours Per week Per month 1,150 2,880 5,750 Available in: API Models.dev Docs\nGLM-5.1 # Model ID: glm-5.1 Family: glm Status: active Release date: 2026-04-07 Last updated (Models.dev): 2026-04-07 Context window: 203K Max output tokens: 32,768 Reasoning: yes Tool calling: yes Attachment support: — Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $1.40 $4.40 $0.26 — Estimated requests (Go plan):\nPer 5 hours Per week Per month 880 2,150 4,300 Available in: API Models.dev Docs\nQwen3.6 Plus # Model ID: qwen3.6-plus Family: qwen3.6 Status: active Release date: 2026-04-02 Last updated (Models.dev): 2026-04-02 Context window: 1M Max output tokens: 65,536 Reasoning: yes Tool calling: yes Attachment support: yes Open weights: — Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $0.50 $3.00 $0.05 $0.62 Estimated requests (Go plan):\nPer 5 hours Per week Per month 3,300 8,200 16,300 Available in: API Models.dev Docs\nMiniMax M2.7 # Model ID: minimax-m2.7 Family: minimax-m2.7 Status: active Release date: 2026-03-18 Last updated (Models.dev): 2026-03-18 Context window: 205K Max output tokens: 131,072 Reasoning: yes Tool calling: yes Attachment support: — Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $0.30 $1.20 $0.06 $0.38 Estimated requests (Go plan):\nPer 5 hours Per week Per month 3,400 8,500 17,000 Available in: API Models.dev Docs\nMiMo V2 Pro # Model ID: mimo-v2-pro Family: mimo-v2-pro Status: deprecated Release date: 2026-03-18 Last updated (Models.dev): 2026-03-18 Context window: 1M Max output tokens: 128,000 Reasoning: yes Tool calling: yes Attachment support: yes Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $1.00 $3.00 $0.20 — Available in: API Models.dev\nMiMo V2 Omni # Model ID: mimo-v2-omni Family: mimo-v2-omni Status: deprecated Release date: 2026-03-18 Last updated (Models.dev): 2026-03-18 Context window: 262K Max output tokens: 128,000 Reasoning: yes Tool calling: yes Attachment support: yes Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $0.40 $2.00 $0.08 — Available in: API Models.dev\nQwen3.5 Plus # Model ID: qwen3.5-plus Family: qwen3.5 Status: deprecated Release date: 2026-02-16 Last updated (Models.dev): 2026-02-16 Context window: 262K Max output tokens: 65,536 Reasoning: yes Tool calling: yes Attachment support: yes Open weights: — Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $0.20 $1.20 $0.02 $0.25 Available in: API Models.dev\nMiniMax M2.5 # Model ID: minimax-m2.5 Family: minimax-m2.5 Status: deprecated Release date: 2026-02-12 Last updated (Models.dev): 2026-02-12 Context window: 205K Max output tokens: 65,536 Reasoning: yes Tool calling: yes Attachment support: — Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $0.30 $1.20 $0.03 $0.38 Available in: API Models.dev\nGLM-5 # Model ID: glm-5 Family: glm Status: deprecated Release date: 2026-02-11 Last updated (Models.dev): 2026-02-11 Context window: 203K Max output tokens: 32,768 Reasoning: yes Tool calling: yes Attachment support: — Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $1.00 $3.20 $0.20 — Available in: API Models.dev\nKimi K2.5 # Model ID: kimi-k2.5 Family: kimi-k2 Status: deprecated Release date: 2026-01-27 Last updated (Models.dev): 2026-01-27 Context window: 262K Max output tokens: 65,536 Reasoning: yes Tool calling: yes Attachment support: yes Open weights: yes Pricing (per 1M tokens):\nInput Output Cache Read Cache Write $0.60 $3.00 $0.10 — Available in: API Models.dev\nHy3 Preview # Model ID: hy3-preview Family: Status: active Release date: — Last updated (Models.dev): — Context window: — Max output tokens: — Reasoning: — Tool calling: — Attachment support: — Open weights: — Available in: API\nUndocumented Models # These models appear in the API metadata endpoint but are not listed on the official Go documentation page. They may still be usable depending on your subscription \u0026ndash; run /models in the TUI to check. They are typically deprecated predecessors, preview/beta models, or restricted-access models.\nModel Release Date Status Likely Reason Kimi K2.7 Code 2026-06-12 active Unknown / restricted MiMo V2 Pro 2026-03-18 deprecated Deprecated predecessor of mimo-v2.5-pro MiMo V2 Omni 2026-03-18 deprecated Deprecated predecessor of mimo-v2.5 (multimodal variant) Qwen3.5 Plus 2026-02-16 deprecated Deprecated predecessor of qwen3.6-plus / qwen3.7-max MiniMax M2.5 2026-02-12 deprecated Unknown / restricted GLM-5 2026-02-11 deprecated Unknown / restricted Kimi K2.5 2026-01-27 deprecated Unknown / restricted Hy3 Preview — active Tencent Hy3 preview/beta, not generally available Deprecated Models # These models are marked as deprecated in Models.dev. They may still be available through the API but are likely to be removed.\nModel Release Date Replacement MiMo V2 Pro 2026-03-18 mimo-v2.5-pro MiMo V2 Omni 2026-03-18 mimo-v2.5 Qwen3.5 Plus 2026-02-16 qwen3.6-plus / qwen3.7-max MiniMax M2.5 2026-02-12 — GLM-5 2026-02-11 — Kimi K2.5 2026-01-27 — Source Availability Matrix # Model API Models.dev Docs Status GLM-5.2 yes yes yes active Kimi K2.7 Code yes yes — active Qwen3.7 Plus yes yes yes active MiniMax M3 (3x usage) yes yes yes active Qwen3.7 Max yes yes yes active DeepSeek V4 Pro yes yes yes active DeepSeek V4 Flash yes yes yes active MiMo V2.5 Pro yes yes yes active MiMo V2.5 yes yes yes active Kimi K2.6 yes yes yes active GLM-5.1 yes yes yes active Qwen3.6 Plus yes yes yes active MiniMax M2.7 yes yes yes active MiMo V2 Pro yes yes — deprecated MiMo V2 Omni yes yes — deprecated Qwen3.5 Plus yes yes — deprecated MiniMax M2.5 yes yes — deprecated GLM-5 yes yes — deprecated Kimi K2.5 yes yes — deprecated Hy3 Preview yes — — active About This Page # This page is automatically generated by a scheduled CI pipeline that queries three publicly accessible sources:\nOpenCode Go API \u0026ndash; the unauthenticated metadata endpoint listing all models in the Go catalog Models.dev \u0026ndash; provider/model metadata including costs, release dates, and deprecation status OpenCode Go Documentation \u0026ndash; the official docs page with request estimates and pricing What this tracker can and cannot tell you # What it verifies: which models exist in each public data source, their release dates, pricing, deprecation status, and when models are added to or removed from those sources.\nWhat it cannot verify: which models your specific subscription entitles you to. The API metadata endpoint is unauthenticated and lists models that may not be available to all subscribers. To see which models you can actually use, run /models in the OpenCode TUI \u0026ndash; that command authenticates with your API key.\nThe \u0026ldquo;Documented\u0026rdquo; column above reflects which models appear on the official docs page \u0026ndash; those typically match what a standard Go subscription includes, but this is not independently verified by this tracker.\nThis is an unofficial tracker. It is not affiliated with or endorsed by OpenCode or Anomaly.\nSee Also # OpenCode Go: Can $10/Month Open Models Replace Frontier APIs? — my detailed benchmark and analysis of the Go model lineup Building an LLM Gateway with LiteLLM and OpenCode Go Router — how I built a gateway to route between Go models OpenCode Go official page ","date":"26 June 2026","externalUrl":null,"permalink":"/opencode-go-models/","section":"Julien.Cloud","summary":"","title":"OpenCode Go Models","type":"page"},{"content":"","date":"9 June 2026","externalUrl":null,"permalink":"/categories/ai/","section":"Categories","summary":"","title":"Ai","type":"categories"},{"content":"","date":"9 June 2026","externalUrl":null,"permalink":"/tags/ai/","section":"Tags","summary":"","title":"Ai","type":"tags"},{"content":"Essays and deep dives on infrastructure engineering — from Kubernetes platform design to AI inference at scale.\n","date":"9 June 2026","externalUrl":null,"permalink":"/blog/","section":"Blog","summary":"","title":"Blog","type":"blog"},{"content":"","date":"9 June 2026","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","date":"9 June 2026","externalUrl":null,"permalink":"/tags/edge-ai/","section":"Tags","summary":"","title":"Edge-Ai","type":"tags"},{"content":"","date":"9 June 2026","externalUrl":null,"permalink":"/tags/gemma4/","section":"Tags","summary":"","title":"Gemma4","type":"tags"},{"content":"","date":"9 June 2026","externalUrl":null,"permalink":"/tags/homelab/","section":"Tags","summary":"","title":"Homelab","type":"tags"},{"content":"","date":"9 June 2026","externalUrl":null,"permalink":"/tags/inference/","section":"Tags","summary":"","title":"Inference","type":"tags"},{"content":"","date":"9 June 2026","externalUrl":null,"permalink":"/tags/jetson/","section":"Tags","summary":"","title":"Jetson","type":"tags"},{"content":"","date":"9 June 2026","externalUrl":null,"permalink":"/tags/llm/","section":"Tags","summary":"","title":"Llm","type":"tags"},{"content":"","date":"9 June 2026","externalUrl":null,"permalink":"/tags/local-models/","section":"Tags","summary":"","title":"Local-Models","type":"tags"},{"content":"","date":"9 June 2026","externalUrl":null,"permalink":"/tags/ollama/","section":"Tags","summary":"","title":"Ollama","type":"tags"},{"content":"A few months ago I added an NVIDIA Jetson Orin Nano Developer Kit to my homelab. The idea was simple: a dedicated, always-on inference server for local LLMs, completely separate from my main Proxmox cluster. With 8GB of unified memory and an integrated CUDA-capable GPU, it sounded like the perfect edge device for running small models.\nReality, as usual, was more nuanced. Not every model that fits on disk actually runs well. Some swap to eMMC and crawl. Others load fast but generate gibberish. After testing six models across two generations of Google\u0026rsquo;s Gemma architecture, surviving a GPU driver rabbithole, and learning more about CMA memory than I ever wanted to know, I found a setup that delivers 25.5 tokens per second on GPU with zero swap.\nThis post covers both rounds of testing, the hardware limits, the GPU configuration that took hours to figure out, and the service architecture that keeps everything running.\nThe Hardware Reality Check # The Jetson Orin Nano Developer Kit has 8GB of shared memory. That means RAM and GPU VRAM come from the same pool. There is no dedicated graphics memory to fall back on. Storage is a 56GB eMMC module, fast enough for OS duties but brutally slow when used as swap.\nKey constraints from day one:\n8GB unified memory limits model size and context length ARM64 architecture restricts which models have native support eMMC swap is a performance cliff — once the system starts swapping, inference drops from 17 tokens per second to under 2 This means model selection is not about downloading the latest 8B parameter release and hoping for the best. It is about finding the sweet spot between size, speed, and quality on very specific hardware.\nRound One: Gemma 3 # I tested six models with the same prompt — \u0026ldquo;Hello, how are you?\u0026rdquo; — and measured tokens per second, memory footprint, and overall responsiveness.\nModel Size Tokens/sec Verdict llama3.1:8b 4.9 GB ~8-10 Too large \u0026ndash; heavy swap makes it unusable llama3.2:3b 2.0 GB ~15-18 Fast but mediocre reasoning quality llama3.2:3b-12k 2.0 GB ~15-18 Same speed with extended context, same quality limits qwen3.5:2b 2.7 GB ~18-22 Fastest of the bunch \u0026ndash; but weak at reasoning tasks gemma3:4b 3.3 GB ~16-20 Runner-up \u0026ndash; solid speed/quality balance, fits in RAM gemma3:4b-8k 3.3 GB ~17.5 Winner (GPU not yet functional) \u0026ndash; best reasoning, zero swap Gemma 3 4B with a custom 8K context window became the production model. 17.5 tok/s on CPU, zero swap, stable for months. But it ran on CPU only — the GPU was idle because the original Ollama binary had dropped JetPack 5 support.\nRound Two: Gemma 4 Arrives # In early 2026, Google released Gemma 4 with two edge variants: E2B (2.3B effective) and E4B (4.5B effective). The on-paper specs were compelling: 128K native context, thinking mode, function calling, system prompt support. The default Ollama quantization was 7.2 GB — too large — but the QAT tag changed everything.\nQAT (Quantization-Aware Training) quantizes during training rather than after. The result: gemma4:e2b-it-qat at 4.3 GB instead of 7.2 GB. Same architecture, 40% smaller. This is not a niche optimization; it is the difference between fitting and failing on 8 GB hardware.\nFeature Gemma 3 4B Gemma 4 E2B Disk size 3.3 GB 4.3 GB (QAT) Default context 8K 128K Thinking mode No Yes Function calling No Yes System prompt No Yes MMLU Pro score ~50% 60% MMLU Pro scores are approximate, sourced from published benchmarks and community results on comparable hardware. Your mileage will vary with quantization and prompt style.\nPulling the model was the easy part. Making it run on GPU was not.\nThe GPU Driver Rabbit Hole # The Jetson runs Ubuntu 22.04 with a CUDA 12.6 driver. Ollama 0.30.6 expects a cuda_jetpack6 directory — which was missing despite the system having the right driver version. The problem: the CUDA toolkit directory layout was from an older JetPack 5 installation (cuda_jetpack5), and the Ollama binary only checks for cuda_jetpack6.\nThree failed approaches before finding the right one:\nAttempt 1: JetPack 5 CUDA libs (CUDA 11.x). Symlinked cuda_jetpack6 to the existing cuda_jetpack5. GPU was detected but model loading failed with cudaMalloc failed: out of memory. These libs route all GPU allocations through CMA (Contiguous Memory Allocator), which defaults to 256 MB on Jetson. An LLM needs gigabytes.\nAttempt 2: Generic CUDA 12 libs. Symlinked to cuda_v12 from the main ARM64 tarball. GPU skipped entirely — the libggml-cuda.so was compiled for desktop GPU architectures (SM 5.0 through 9.0) but not Orin\u0026rsquo;s CC 8.7.\nAttempt 3: The CMA trap. Tried cma=4096M in the kernel boot parameters to expand the memory pool. This broke GPU detection entirely — the CUDA driver could not initialize when CMA consumed half the system RAM. Even cma=1024M had the same effect. The lesson: never touch CMA on Jetson.\nThe working solution: Extract the JetPack 6 CUDA tarball from the Ollama release.\ncurl -L https://github.com/ollama/ollama/releases/download/v0.30.6/ollama-linux-arm64-jetpack6.tar.zst -o ollama-jp6.tar.zst sudo tar --zstd -xf ollama-jp6.tar.zst -C /usr/local sudo systemctl restart ollama This provides libggml-cuda.so compiled with Orin CC 8.7 support and CUDA 12.6 runtime libs that match the Jetson driver. GPU discovery confirmed:\ninference compute: library=CUDA compute=8.7 name=CUDA0 description=Orin No CMA tweaks. No symlinks. Just the right libs in the right place.\nBenchmark: Gemma 4 on GPU # With GPU working, the numbers were decisive:\nModel Mode Tok/s Cold Load RAM GPU gemma3:4b-8k CPU 17.5 0.79s 4.5 GB No gemma4:e2b-it-qat CPU 12.4 70s 5.0 GB No gemma4:e2b-4k GPU 25.7 67s 3.4 GB 100% gemma4:e2b-8k GPU 25.5 ~30-70s 3.6 GB 100% Gemma 4 on GPU is 46% faster than Gemma 3 on CPU while using 1 GB less RAM. The 8K context window has zero speed penalty versus 4K — the KV cache is negligible next to the model weights. The 128K native context support is there if needed, though I settled on 8K as the practical sweet spot.\nCold load is the only downside: 30-70 seconds versus Gemma 3\u0026rsquo;s sub-second CPU load. But a keepalive service that pings the model every 4 minutes makes this a non-issue. The model stays in GPU memory permanently.\nCreating the Custom 8K Context Model # The default gemma4:e2b-it-qat is a raw weights download with Llama\u0026rsquo;s default context cap. To set an explicit 8K context window (matching what fits comfortably in the 8GB unified memory) and give it a friendly name, use a Modelfile:\n# Modelfile for gemma4:e2b-8k FROM gemma4:e2b-it-qat PARAMETER num_ctx 8192 Then create the named model:\nollama create gemma4:e2b-8k -f Modelfile The gemma4:e2b-4k variant in the benchmarks was the same base model capped at 4096 context for comparison. The 8K cap shows zero speed penalty \u0026ndash; the KV cache overhead is negligible next to the 4.3 GB model weights. You could go higher (128K native is supported) but at some point memory pressure from the KV cache starts eating into the safety margin.\nQuality: Is Gemma 4 Actually Smarter? # Benchmarks on paper are one thing. Real prompts are another. I tested both models on three tasks:\nLogic reasoning: \u0026ldquo;If a shirt takes 4 hours to dry, how long for 3 shirts?\u0026rdquo;\nGemma 4 answered correctly (4 hours, simultaneous drying) with structured step-by-step reasoning: \u0026ldquo;Since all three shirts dry independently at the same rate, you only need to wait the time required for one shirt to finish.\u0026rdquo; Gemma 3 sometimes fell for the multiplication trap.\nCode generation: \u0026ldquo;Write a Sieve of Eratosthenes in Python.\u0026rdquo;\nGemma 4 produced clean, commented code with proper edge cases (n \u0026lt; 2 returns empty), complexity analysis (O(N log log N)), and usage examples. Gemma 3 was adequate but less thorough.\nLong-form generation: \u0026ldquo;Write a technical essay about transformer architecture.\u0026rdquo;\nGemma 4 generated 2,700+ coherent tokens with technical depth on attention mechanisms, positional encoding, and multi-head attention. Sustained 25.5 tok/s throughout with no degradation.\nDespite having fewer effective parameters (2.3B vs Gemma 3\u0026rsquo;s ~4B), the QAT quantization and architectural improvements in Gemma 4 produce noticeably better output. The thinking mode — where the model outputs a chain-of-thought before the final answer — adds further quality for complex reasoning tasks.\nThe CMA Lesson # The Jetson\u0026rsquo;s Contiguous Memory Allocator defaults to 256 MB. On the old CUDA 11 libs, this was a hard bottleneck — every GPU memory allocation went through CMA, which is orders of magnitude too small. On the JetPack 6 CUDA 12 libs, GPU memory allocations bypass CMA and use system memory directly.\nBut CMA still matters for compute buffer allocation. When a model loads, a small compute buffer (100-200 MB) goes through CMA. If CMA is fragmented from a previous model load/unload cycle, the new load fails with cudaMalloc failed even though 6+ GB of system RAM is free. CMA fragmentation is permanent — it survives Ollama restarts and only a full reboot clears it.\nThe fix: never unload the model. The keepalive service is not just for cold-start latency — it prevents CMA fragmentation. If the model stays loaded, CMA is consumed once at initial load and never touched again.\nThe boot sequence must be careful:\nOllama starts, GPU discovery runs (CMA still clean) Preload fires, loads gemma4 (uses CMA once, model stays warm forever) Keepalive takes over (pings every 4 minutes, never unloads) The preload service must point at gemma4 — if gemma3 loads at boot, it consumes CMA and gemma4\u0026rsquo;s GPU load fails later Keeping the Model Warm # The service architecture uses three systemd units:\nollama.service — Main Ollama daemon, always running on port 11434 ollama-preload.service — Oneshot that loads gemma4 20 seconds after Ollama starts, warming the model at boot ollama-keepalive.service — User service that pings the model every 4 minutes to prevent eviction and CMA fragmentation The preload and keepalive scripts read the model name from /etc/ollama/model.conf, making it trivial to switch models by changing one line:\nOLLAMA_MODEL=gemma4:e2b-8k Monitoring What Matters # jtop — Jetson-specific monitoring. Watch GPU utilization (100% during inference), RAM usage (3.6 GB under load), and temperature (under 70C with the reference cooler). tegrastats — Low-level telemetry for power draw, per-core CPU usage, and memory. htop — General system view, mostly to confirm swap stays near zero. If swap usage climbs during inference, something is wrong. The fix is never to add more swap — it is to use a smaller model or reduce context.\nFinal Setup # Device: NVIDIA Jetson Orin Nano Developer Kit (8GB) OS: Ubuntu 22.04.5 LTS (JetPack 6) Ollama: 0.30.6 with JetPack 6 CUDA libs Active model: gemma4:e2b-8k (custom 8K context, QAT quantized) Inference speed: 25.5 tokens/sec (GPU, warm) Memory footprint: 3.6 GB total (weights + KV cache) GPU: 100% CUDA0 Orin Swap usage: Zero Lessons Learned # QAT quantization matters more than you think. The -qat tag on Ollama is the difference between a model that fits (4.3 GB) and one that does not (7.2 GB). Always check for QAT variants before dismissing a model for edge hardware.\nJetPack 6 CUDA libs are required for GPU on Orin. The standard ARM64 ollama tarball lacks Orin support. The JetPack 6 tarball has it. This is not documented anywhere obvious.\nNever touch CMA. The cma= kernel parameter breaks GPU detection entirely. Default CMA (256 MB) is sufficient when using the correct CUDA libs.\nKeepalive prevents CMA fragmentation. On Jetson, it is the difference between a working GPU inference server and a brick that needs a reboot after every model unload. CMA fragmentation is permanent and unrecoverable without a full system reboot.\nGemma 4 on edge is worth the effort. 46% faster, noticeably smarter, 1 GB lighter on RAM, with thinking mode and function calling. The hour of CUDA debugging pays for itself in every inference.\nThe Jetson Orin Nano is not going to replace a GPU server. But as a dedicated local LLM endpoint — handling RAG queries, chat, code generation, and light automation — it punches well above its weight class. The key is respecting the hardware limits, choosing the right model, and getting the CUDA configuration right the first time.\n","date":"9 June 2026","externalUrl":null,"permalink":"/blog/jetson-nano-ollama-edge-inference/","section":"Blog","summary":"The journey from Gemma 3 4B (17.5 tok/s CPU) to Gemma 4 E2B (25.5 tok/s GPU) on the Jetson Orin Nano. Covers model testing, QAT quantization, the JetPack CUDA rabbithole, CMA traps, and the keepalive architecture that makes it all work.","title":"Running Ollama on a Jetson Orin Nano: From Gemma 3 to Gemma 4 with GPU Acceleration","type":"blog"},{"content":"","date":"9 June 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"","date":"7 June 2026","externalUrl":null,"permalink":"/tags/ai-coding/","section":"Tags","summary":"","title":"Ai-Coding","type":"tags"},{"content":"","date":"7 June 2026","externalUrl":null,"permalink":"/tags/docker/","section":"Tags","summary":"","title":"Docker","type":"tags"},{"content":"","date":"7 June 2026","externalUrl":null,"permalink":"/categories/infrastructure/","section":"Categories","summary":"","title":"Infrastructure","type":"categories"},{"content":"","date":"7 June 2026","externalUrl":null,"permalink":"/tags/infrastructure/","section":"Tags","summary":"","title":"Infrastructure","type":"tags"},{"content":"","date":"7 June 2026","externalUrl":null,"permalink":"/tags/litellm/","section":"Tags","summary":"","title":"Litellm","type":"tags"},{"content":"Recently, during a call with some friends, we talked about using an LLM router to improve model usage and token efficiency. I looked around at what solutions I could easily run locally to provide such capabilities. I also wanted something that would give me more visibility into my token and model consumption.\nLiteLLM does exactly that. Plus, it lets me use all models and routers with a single API key. One local proxy that routes everything, tracks spend, caches responses, and picks the right model based on how hard the task is. Sounds too good to be true?\nThat\u0026rsquo;s what we\u0026rsquo;re going to cover in this post.\nIf you want to try OpenCode Go, sign up with my referral link — we both get a $5 usage credit.\nIf you are a human # Here is the full story of the setup and why each decision was made.\nWhat is an LLM gateway, and why would you want one # An LLM gateway is a proxy that sits between your applications (or coding agent) and the various LLM providers you use. Instead of configuring your tools with multiple API keys, base URLs, and model-specific settings, you point everything at one local endpoint. The gateway handles the rest.\nThink of it like an API aggregator for language models. You send a standard OpenAI-style request to http://localhost:4000, (if hosted locally of course) and the gateway forwards it to Anthropic, OpenAI, Gemini, or any other provider you have configured. The response comes back in the same format regardless of which model actually served it.\nThis matters for a few reasons:\nUnified access. One API key, one base URL, one response format. Your coding tools, scripts, and experiments all talk to the same local endpoint. Switching from GPT-5.5 to Claude Sonnet is a one-line change in the model name, not a provider swap.\nCost management. LiteLLM also tracks spend per model, per key, and per user. You can see which models are costing you money, set budgets, and rotate providers keys without touching client configurations. This is especially useful when you are experimenting with multiple providers and need to know where the budget is going.\nResponse caching. If you or your tools send the same prompt twice, a cached response comes back instantly without hitting the provider API again. That saves money and reduces latency for repeated queries. Redis handles this transparently.\nVirtual key management. Instead of embedding your real provider keys in every tool, you generate virtual keys inside the gateway. Each virtual key can be scoped to specific models or rate-limited. If a key leaks, you revoke just that virtual key without touching your actual OpenAI or Anthropic accounts.\nProvider abstraction. The OpenAI API format has become the de facto standard. A gateway lets you use that same format with Anthropic, Google, Mistral, and any other provider. No need to learn different SDKs or handle different response shapes.\nFailover and routing. Some gateways can fall back to a secondary provider if the primary one is down. Others, like the complexity router I set up, can pick different models for different tasks automatically.\nIn short: an LLM gateway turns a mess of provider-specific integrations into one clean, observable, controllable interface.\nWhat this stack does # I deployed a local LiteLLM proxy on my laptop using a docker compose stack. It holds 27 models from 5 providers in a single configuration file, plus three complexity routers on top.\nThe providers i configured in LiteLLM are:\nProvider Models OpenAI GPT-5.5, GPT-5.4, GPT-5.4 Mini Anthropic Claude Haiku 4.5, Sonnet 4.5, Sonnet 4.6, Opus 4.5 through 4.8 Google Gemini 2.5 Pro, 3.1 Pro Preview, 3.5 Flash Mistral Mistral Small 4, Mistral Medium 3.5 OpenCode Go 12 models via custom opencodego provider (DeepSeek V4 Pro/Flash, GLM 5/5.1, Kimi K2.5/K2.6, MiMo V2.5/V2.5 Pro, MiniMax M2.5/M2.7, Qwen3.6 Plus/3.7 Max) Then I configured three complexity routers that sit on top of these models and route request based on complexity:\nfrontier-router — routes across OpenAI and Anthropic models. opencodego-router — routes to 4 OpenCode Go models across four tiers (DeepSeek V4 Flash, DeepSeek V4 Pro, Kimi K2.6, MiMo V2.5 Pro) mistral-router — routes between Small 4 and Medium 3.5 OpenCode sees all 30 model entries (27 individual + 3 routers) but the routers are the one I used the most.\nLiteLLM handles the complexity classification and model selection. Individual models are available for direct access or fallback if I have a problem with the router for example.\nWhy it runs locally, not on the homelab # I do run a homelab. It has Portainer managing containers, Traefik handling reverse proxy and TLS, and a handful of services I access from anywhere. This stack is not part of that (yet ? :D).\nRight now it runs directly on my laptop with docker compose up -d. That is intentional for this proof of concept. My primary use case is coding with OpenCode, which runs locally and talks to the proxy over localhost. No Traefik, no TLS termination, no DNS records. Just a local proxy that starts when I need it and stops when I do not.\nIf I move this to the homelab in the future, I would add TLS, authentication, and probably deploy it behind my tailscale. For now, as a single-developer coding setup, localhost is the right boundary. I could even deploy it on a VPS or Cloud provider, but that\u0026rsquo;s for another day.\nThe architecture # Three containers in one Compose file:\nLiteLLM proxy — the OpenAI-compatible API gateway on port 4000 PostgreSQL — virtual key management, spend tracking, logs etc (persistent data) Redis — response caching API keys for connection to AI providers are passed through as environment variables. No env_file in Compose, just .env referenced through variable substitution. Secrets stay in one place locally and are gitignored.\nThe complete files # If you want to reproduce this setup, here are the full working templates. Create a directory (for example litellm-local/) and place these three files inside it. The templates include a few example models per provider. Add or remove models as needed — the pattern is the same for every provider.\nThe OpenCode config goes in your global ~/.config/opencode/opencode.jsonc, not in the project directory.\ndocker-compose.yaml — the full stack:\nservices: postgres: image: postgres:16-alpine container_name: litellm-local-postgres restart: unless-stopped environment: POSTGRES_DB: litellm POSTGRES_USER: litellm_admin POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} volumes: - postgres_data:/var/lib/postgresql/data healthcheck: test: [\u0026#34;CMD-SHELL\u0026#34;, \u0026#34;pg_isready -U litellm_admin -d litellm\u0026#34;] interval: 5s timeout: 5s retries: 10 redis: image: redis:7-alpine container_name: litellm-local-redis restart: unless-stopped command: [\u0026#34;redis-server\u0026#34;, \u0026#34;--appendonly\u0026#34;, \u0026#34;yes\u0026#34;] volumes: - redis_data:/data healthcheck: test: [\u0026#34;CMD\u0026#34;, \u0026#34;redis-cli\u0026#34;, \u0026#34;ping\u0026#34;] interval: 5s timeout: 5s retries: 10 litellm: image: docker.litellm.ai/berriai/litellm:main-stable container_name: litellm-local restart: unless-stopped ports: - \u0026#34;4000:4000\u0026#34; volumes: - ./litellm_config.yaml:/app/config.yaml:ro - ./providers.json:/app/.venv/lib/python3.13/site-packages/litellm/llms/openai_like/providers.json:ro environment: DATABASE_URL: postgresql://litellm_admin:${POSTGRES_PASSWORD}@postgres:5432/litellm REDIS_HOST: redis REDIS_PORT: 6379 LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY} LITELLM_SALT_KEY: ${LITELLM_SALT_KEY} # Add or remove provider keys as needed OPENAI_API_KEY: ${OPENAI_API_KEY} ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY} GEMINI_API_KEY: ${GEMINI_API_KEY} MISTRAL_API_KEY: ${MISTRAL_API_KEY} OPENCODE_GO_API_KEY: ${OPENCODE_GO_API_KEY} STORE_MODEL_IN_DB: \u0026#34;True\u0026#34; command: [\u0026#34;--config\u0026#34;, \u0026#34;/app/config.yaml\u0026#34;, \u0026#34;--port\u0026#34;, \u0026#34;4000\u0026#34;] healthcheck: test: [\u0026#34;CMD-SHELL\u0026#34;, \u0026#34;python -c \u0026#39;import urllib.request; urllib.request.urlopen(\\\u0026#34;http://localhost:4000/health/readiness\\\u0026#34;)\u0026#39; 2\u0026gt;/dev/null || exit 1\u0026#34;] interval: 15s timeout: 10s retries: 5 start_period: 30s depends_on: postgres: condition: service_healthy redis: condition: service_healthy volumes: postgres_data: redis_data: .env.example — copy to .env and fill in your keys:\nPOSTGRES_PASSWORD=change-me LITELLM_MASTER_KEY=sk-local-master-key LITELLM_SALT_KEY=replace-with-a-long-random-string # Add or remove as needed OPENAI_API_KEY= ANTHROPIC_API_KEY= GEMINI_API_KEY= MISTRAL_API_KEY= OPENCODE_GO_API_KEY= litellm_config.yaml — the full model list with cost tracking and three routers:\nmodel_list: # Frontier router (OpenAI + Anthropic) - model_name: frontier-router litellm_params: model: auto_router/complexity_router complexity_router_config: tiers: SIMPLE: gpt-5.4-mini MEDIUM: claude-sonnet-4-6 COMPLEX: gpt-5.5 REASONING: claude-opus-4-8 model_info: mode: chat disable_background_health_check: true # OpenAI - model_name: gpt-5.5 litellm_params: model: openai/gpt-5.5 api_key: \u0026#34;os.environ/OPENAI_API_KEY\u0026#34; - model_name: gpt-5.4 litellm_params: model: openai/gpt-5.4 api_key: \u0026#34;os.environ/OPENAI_API_KEY\u0026#34; - model_name: gpt-5.4-mini litellm_params: model: openai/gpt-5.4-mini api_key: \u0026#34;os.environ/OPENAI_API_KEY\u0026#34; # Anthropic - model_name: claude-haiku-4-5 litellm_params: model: anthropic/claude-haiku-4-5 api_key: \u0026#34;os.environ/ANTHROPIC_API_KEY\u0026#34; - model_name: claude-sonnet-4-5 litellm_params: model: anthropic/claude-sonnet-4-5 api_key: \u0026#34;os.environ/ANTHROPIC_API_KEY\u0026#34; - model_name: claude-sonnet-4-6 litellm_params: model: anthropic/claude-sonnet-4-6 api_key: \u0026#34;os.environ/ANTHROPIC_API_KEY\u0026#34; - model_name: claude-opus-4-5 litellm_params: model: anthropic/claude-opus-4-5 api_key: \u0026#34;os.environ/ANTHROPIC_API_KEY\u0026#34; - model_name: claude-opus-4-6 litellm_params: model: anthropic/claude-opus-4-6 api_key: \u0026#34;os.environ/ANTHROPIC_API_KEY\u0026#34; - model_name: claude-opus-4-7 litellm_params: model: anthropic/claude-opus-4-7 api_key: \u0026#34;os.environ/ANTHROPIC_API_KEY\u0026#34; - model_name: claude-opus-4-8 litellm_params: model: anthropic/claude-opus-4-8 api_key: \u0026#34;os.environ/ANTHROPIC_API_KEY\u0026#34; # Google Gemini - model_name: gemini-2.5-pro litellm_params: model: gemini/gemini-2.5-pro api_key: \u0026#34;os.environ/GEMINI_API_KEY\u0026#34; - model_name: gemini-3.1-pro-preview litellm_params: model: gemini/gemini-3.1-pro-preview api_key: \u0026#34;os.environ/GEMINI_API_KEY\u0026#34; - model_name: gemini-3.5-flash litellm_params: model: gemini/gemini-3.5-flash api_key: \u0026#34;os.environ/GEMINI_API_KEY\u0026#34; # Mistral (versioned, with explicit cost tracking) - model_name: mistral-small-4 litellm_params: model: mistral/mistral-small-2603 api_key: \u0026#34;os.environ/MISTRAL_API_KEY\u0026#34; input_cost_per_token: 0.00000015 output_cost_per_token: 0.00000060 - model_name: mistral-medium-3-5 litellm_params: model: mistral/mistral-medium-3-5 api_key: \u0026#34;os.environ/MISTRAL_API_KEY\u0026#34; input_cost_per_token: 0.00000150 output_cost_per_token: 0.00000750 # Mistral complexity router - model_name: mistral-router litellm_params: model: auto_router/complexity_router complexity_router_config: tiers: SIMPLE: mistral-small-4 COMPLEX: mistral-medium-3-5 model_info: mode: chat disable_background_health_check: true # OpenCode Go (all 12 models with cost tracking, using custom opencodego provider) - model_name: opencode-deepseek-v4-pro litellm_params: model: opencodego/deepseek-v4-pro api_key: \u0026#34;os.environ/OPENCODE_GO_API_KEY\u0026#34; api_base: https://opencode.ai/zen/go/v1 input_cost_per_token: 0.00000084 output_cost_per_token: 0.00000253 - model_name: opencode-deepseek-v4-flash litellm_params: model: opencodego/deepseek-v4-flash api_key: \u0026#34;os.environ/OPENCODE_GO_API_KEY\u0026#34; api_base: https://opencode.ai/zen/go/v1 input_cost_per_token: 0.00000009 output_cost_per_token: 0.00000027 - model_name: opencode-glm-5 litellm_params: model: opencodego/glm-5 api_key: \u0026#34;os.environ/OPENCODE_GO_API_KEY\u0026#34; api_base: https://opencode.ai/zen/go/v1 input_cost_per_token: 0.00000100 output_cost_per_token: 0.00000320 - model_name: opencode-glm-5-1 litellm_params: model: opencodego/glm-5.1 api_key: \u0026#34;os.environ/OPENCODE_GO_API_KEY\u0026#34; api_base: https://opencode.ai/zen/go/v1 input_cost_per_token: 0.00000140 output_cost_per_token: 0.00000440 - model_name: opencode-kimi-k2-5 litellm_params: model: opencodego/kimi-k2.5 api_key: \u0026#34;os.environ/OPENCODE_GO_API_KEY\u0026#34; api_base: https://opencode.ai/zen/go/v1 input_cost_per_token: 0.00000060 output_cost_per_token: 0.00000300 - model_name: opencode-kimi-k2-6 litellm_params: model: opencodego/kimi-k2.6 api_key: \u0026#34;os.environ/OPENCODE_GO_API_KEY\u0026#34; api_base: https://opencode.ai/zen/go/v1 input_cost_per_token: 0.00000095 output_cost_per_token: 0.00000400 - model_name: opencode-mimo-v2-5 litellm_params: model: opencodego/mimo-v2.5 api_key: \u0026#34;os.environ/OPENCODE_GO_API_KEY\u0026#34; api_base: https://opencode.ai/zen/go/v1 input_cost_per_token: 0.00000009 output_cost_per_token: 0.00000027 - model_name: opencode-mimo-v2-5-pro litellm_params: model: opencodego/mimo-v2.5-pro api_key: \u0026#34;os.environ/OPENCODE_GO_API_KEY\u0026#34; api_base: https://opencode.ai/zen/go/v1 input_cost_per_token: 0.00000084 output_cost_per_token: 0.00000252 - model_name: opencode-minimax-m2-5 litellm_params: model: opencodego/minimax-m2.5 api_key: \u0026#34;os.environ/OPENCODE_GO_API_KEY\u0026#34; api_base: https://opencode.ai/zen/go/v1 input_cost_per_token: 0.00000030 output_cost_per_token: 0.00000120 - model_name: opencode-minimax-m2-7 litellm_params: model: opencodego/minimax-m2.7 api_key: \u0026#34;os.environ/OPENCODE_GO_API_KEY\u0026#34; api_base: https://opencode.ai/zen/go/v1 input_cost_per_token: 0.00000030 output_cost_per_token: 0.00000120 - model_name: opencode-qwen3-6-plus litellm_params: model: opencodego/qwen3.6-plus api_key: \u0026#34;os.environ/OPENCODE_GO_API_KEY\u0026#34; api_base: https://opencode.ai/zen/go/v1 input_cost_per_token: 0.00000050 output_cost_per_token: 0.00000300 - model_name: opencode-qwen3-7-max litellm_params: model: opencodego/qwen3.7-max api_key: \u0026#34;os.environ/OPENCODE_GO_API_KEY\u0026#34; api_base: https://opencode.ai/zen/go/v1 input_cost_per_token: 0.00000250 output_cost_per_token: 0.00000750 # OpenCode Go complexity router (tuned for OpenCode\u0026#39;s large baseline context) - model_name: opencodego-router litellm_params: model: auto_router/complexity_router complexity_router_config: tiers: SIMPLE: opencode-deepseek-v4-flash MEDIUM: opencode-deepseek-v4-pro COMPLEX: opencode-kimi-k2-6 REASONING: opencode-mimo-v2-5-pro # Token count is useless for chat with large system prompts # (OpenCode baseline is ~13k tokens). Kill it and let # content-based signals drive routing. dimension_weights: tokenCount: 0.0 reasoningMarkers: 0.40 simpleIndicators: 0.20 technicalTerms: 0.25 codePresence: 0.10 multiStepPatterns: 0.03 questionComplexity: 0.02 # Lower boundaries to compensate for tokenCount removal tier_boundaries: simple_medium: 0.10 # was 0.15 medium_complex: 0.25 # was 0.35 complex_reasoning: 0.55 # was 0.60 model_info: mode: chat disable_background_health_check: true general_settings: master_key: \u0026#34;os.environ/LITELLM_MASTER_KEY\u0026#34; database_url: \u0026#34;os.environ/DATABASE_URL\u0026#34; health_check_skip_disabled_background_models: true litellm_settings: cache: true cache_params: type: redis namespace: litellm.local ~/.config/opencode/opencode.jsonc — my global OpenCode config. I keep the LiteLLM provider there with all routers and models:\n{ \u0026#34;$schema\u0026#34;: \u0026#34;https://opencode.ai/config.json\u0026#34;, \u0026#34;provider\u0026#34;: { \u0026#34;litellm-local\u0026#34;: { \u0026#34;npm\u0026#34;: \u0026#34;@ai-sdk/openai-compatible\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;LiteLLM Local\u0026#34;, \u0026#34;options\u0026#34;: { \u0026#34;baseURL\u0026#34;: \u0026#34;http://127.0.0.1:4000/v1\u0026#34;, \u0026#34;apiKey\u0026#34;: \u0026#34;{env:LITELLM_API_KEY}\u0026#34; }, \u0026#34;models\u0026#34;: { \u0026#34;opencodego-router\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;OpenCode Go Router (recommended default)\u0026#34; }, \u0026#34;frontier-router\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Frontier Router (recommended)\u0026#34; }, \u0026#34;mistral-router\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Mistral Router (recommended)\u0026#34; }, \u0026#34;gpt-5.5\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;GPT-5.5 via LiteLLM\u0026#34; }, \u0026#34;gpt-5.4\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;GPT-5.4 via LiteLLM\u0026#34; }, \u0026#34;gpt-5.4-mini\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;GPT-5.4 Mini via LiteLLM\u0026#34; }, \u0026#34;claude-haiku-4-5\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Claude Haiku 4.5 via LiteLLM\u0026#34; }, \u0026#34;claude-sonnet-4-5\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Claude Sonnet 4.5 via LiteLLM\u0026#34; }, \u0026#34;claude-sonnet-4-6\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Claude Sonnet 4.6 via LiteLLM\u0026#34; }, \u0026#34;claude-opus-4-5\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Claude Opus 4.5 via LiteLLM\u0026#34; }, \u0026#34;claude-opus-4-6\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Claude Opus 4.6 via LiteLLM\u0026#34; }, \u0026#34;claude-opus-4-7\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Claude Opus 4.7 via LiteLLM\u0026#34; }, \u0026#34;claude-opus-4-8\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Claude Opus 4.8 via LiteLLM\u0026#34; }, \u0026#34;gemini-2.5-pro\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Gemini 2.5 Pro via LiteLLM\u0026#34; }, \u0026#34;gemini-3.1-pro-preview\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Gemini 3.1 Pro Preview via LiteLLM\u0026#34; }, \u0026#34;gemini-3.5-flash\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Gemini 3.5 Flash via LiteLLM\u0026#34; }, \u0026#34;mistral-small-4\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Mistral Small 4 via LiteLLM\u0026#34; }, \u0026#34;mistral-medium-3-5\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;Mistral Medium 3.5 via LiteLLM\u0026#34; }, \u0026#34;opencode-deepseek-v4-pro\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;OpenCode Go DeepSeek V4 Pro via LiteLLM\u0026#34; }, \u0026#34;opencode-deepseek-v4-flash\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;OpenCode Go DeepSeek V4 Flash via LiteLLM\u0026#34; }, \u0026#34;opencode-glm-5\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;OpenCode Go GLM 5 via LiteLLM\u0026#34; }, \u0026#34;opencode-glm-5-1\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;OpenCode Go GLM 5.1 via LiteLLM\u0026#34; }, \u0026#34;opencode-kimi-k2-5\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;OpenCode Go Kimi K2.5 via LiteLLM\u0026#34; }, \u0026#34;opencode-kimi-k2-6\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;OpenCode Go Kimi K2.6 via LiteLLM\u0026#34; }, \u0026#34;opencode-mimo-v2-5\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;OpenCode Go MiMo V2.5 via LiteLLM\u0026#34; }, \u0026#34;opencode-mimo-v2-5-pro\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;OpenCode Go MiMo V2.5 Pro via LiteLLM\u0026#34; }, \u0026#34;opencode-minimax-m2-5\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;OpenCode Go MiniMax M2.5 via LiteLLM\u0026#34; }, \u0026#34;opencode-minimax-m2-7\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;OpenCode Go MiniMax M2.7 via LiteLLM\u0026#34; }, \u0026#34;opencode-qwen3-6-plus\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;OpenCode Go Qwen3.6 Plus via LiteLLM\u0026#34; }, \u0026#34;opencode-qwen3-7-max\u0026#34;: { \u0026#34;name\u0026#34;: \u0026#34;OpenCode Go Qwen3.7 Max via LiteLLM\u0026#34; }, } } }, \u0026#34;model\u0026#34;: \u0026#34;litellm-local/opencodego-router\u0026#34;, \u0026#34;small_model\u0026#34;: \u0026#34;litellm-local/mistral-router\u0026#34; } Once deployed in your opencode terminal type /model and you will see all models + the three routers.\nHow caching works in practice # Redis caches responses based on the prompt content and model name. If you send the exact same message to the same model twice, the second request hits the cache instead of the provider API. That means:\nZero cost for the second identical request Instant response instead of waiting for the provider Less rate-limit pressure on your provider accounts The cache is keyed by a hash of the request, so even a single character difference results in a fresh provider call. For a coding agent like OpenCode, this means the cache is rarely useful. Each request carries the full conversation history, tool definitions, and dynamic system context, so two requests are almost never identical even when you retry the same user message. The cache still helps if you use the same model for simple one-shot API calls outside the agent — for example, a curl request with no history or tools — but do not expect cache hits from your day-to-day coding sessions.\nHow virtual keys and spend tracking work # LiteLLM lets you generate virtual API keys that are scoped to specific models or users. Each virtual key has its own rate limits, budget and a lot of other configuration. When a request comes in, LiteLLM logs which virtual key was used, which model was called, how many tokens were consumed, and what the estimated cost was.\nAll of this data lives in PostgreSQL. You can query it directly or view it in the LiteLLM web UI at http://127.0.0.1:4000/ui.\nFor my setup, I mostly use the master key for simplicity, but I also generated a scoped virtual key for OpenCode. The virtual key only has access to the models I actually want OpenCode to use. If something goes wrong, I can revoke that key without touching the master key or any provider keys.\nThe model configuration # The litellm_config.yaml maps each model name to its provider and API key. This is where the abstraction happens. On the outside, everything looks like model: \u0026quot;gpt-5.5\u0026quot;. On the inside, LiteLLM knows to call OpenAI\u0026rsquo;s API with the OpenAI key.\nOpenCode Go models use the opencodego/ prefix through a custom provider definition. By default, LiteLLM would treat OpenCode Go as openai/ since the API is OpenAI-compatible, but that causes all Go spend to be categorized under \u0026ldquo;OpenAI\u0026rdquo; on the billing dashboard.\nUsing a custom opencodego provider in providers.json fixes this:\n{ \u0026#34;opencodego\u0026#34;: { \u0026#34;base_url\u0026#34;: \u0026#34;https://opencode.ai/zen/go/v1\u0026#34;, \u0026#34;api_key_env\u0026#34;: \u0026#34;OPENCODE_GO_API_KEY\u0026#34; } } The file litellm_config.yaml is mounted into the LiteLLM container at /app/.venv/lib/python3.13/site-packages/litellm/llms/openai_like/providers.json (the exact path depends on the LiteLLM version and Python path inside the container — find it with docker exec litellm-local find /app -name providers.json):\n- model_name: opencode-deepseek-v4-flash litellm_params: model: opencodego/deepseek-v4-flash api_key: \u0026#34;os.environ/OPENCODE_GO_API_KEY\u0026#34; api_base: https://opencode.ai/zen/go/v1 Cost tracking # LiteLLM maintains a built-in price list for major providers (OpenAI, Anthropic, Google, Mistral) at https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json. For those models, you do not need to specify costs — LiteLLM looks them up automatically.\nOpenCode Go models are not in that list. That is why every OpenCode Go entry in my config includes explicit input_cost_per_token and output_cost_per_token values. LiteLLM multiplies these by the token counts in each response and logs the result to PostgreSQL.\nThis matters because \u0026ldquo;unified access\u0026rdquo; is only half the story. The other half is knowing what that access costs. With 12 OpenCode Go models ranging from $0.09 to $2.50 per million input tokens, the difference between routing to Flash and routing to Qwen3.7 Max is real. The complexity router handles that decision automatically, but the cost parameters let you audit it after the fact.\nThe complexity routers # This is the part I like the most.\nLiteLLM has a built-in auto_router/complexity_router feature. It inspects the incoming prompt, classifies its complexity, and routes to a different model accordingly.\nUnlike some routers that call an external LLM to classify each query — adding latency, cost, and non-deterministic behavior — LiteLLM\u0026rsquo;s complexity router uses pure pattern matching and heuristics. No API calls, no extra tokens, and the classification is deterministic: the same prompt always lands on the same tier.\nWhy tokenCount: 0.0 — the key tuning decision # OpenCode (or any other AI Coding tool) sends a lot of context with every request: system prompt, tools, MCP instructions, agent rules, and conversation history. Even a trivial session can start around 10k tokens before I type anything meaningful. And then every request contains the full history. If token count participates in the complexity score, every request looks expensive and gets pushed toward the COMPLEX tier. A Hi with 10k tokens of context is still a Hi.\nThe root issue is that a coding agent mixes two very different signals into one payload. There are actually two routing questions, not one:\nQuestion Good signal Why it matters for agentic coding How hard is this task? Current message content Token count is noise here. The request size grows because the tool appends history, files, and tool output — it measures session length, not task difficulty. Can this request fit safely and cheaply? Total request tokens Token count matters. A 200k-token session routed to the REASONING tier might fit technically, but sending it to a cheaper model with a large enough window saves money. Most one-shot API calls get both answers from the same signal — a large request tends to be a complex request. But agentic coding breaks that assumption. The request grows over time regardless of what you are asking, so raw token count stops being a useful proxy for anything except capacity.\nThe rule: do not use raw total tokens as a complexity signal for long-running coding agents unless the router can separate current-user-message tokens from accumulated context tokens. For OpenCode, Claude Code, Cursor, Aider, or any agentic IDE workflow, set tokenCount to 0.0 and let content-based signals drive routing.\nToken count should inform capacity decisions (history summarization, context window safety), not complexity classification. Those are different problems — treat them as such.\nOpenCode Go router # - model_name: opencodego-router litellm_params: model: auto_router/complexity_router complexity_router_config: tiers: SIMPLE: opencode-deepseek-v4-flash MEDIUM: opencode-deepseek-v4-pro COMPLEX: opencode-kimi-k2-6 REASONING: opencode-mimo-v2-5-pro dimension_weights: tokenCount: 0.0 reasoningMarkers: 0.40 simpleIndicators: 0.20 technicalTerms: 0.25 codePresence: 0.10 multiStepPatterns: 0.03 questionComplexity: 0.02 tier_boundaries: simple_medium: 0.10 medium_complex: 0.25 complex_reasoning: 0.55 The router scores each request across six content-based dimensions:\nDimension Weight What it catches reasoningMarkers 0.40 Phrases like \u0026ldquo;step by step\u0026rdquo;, \u0026ldquo;think through\u0026rdquo;, \u0026ldquo;explain your reasoning\u0026rdquo; technicalTerms 0.25 Domain complexity: architecture, distributed, throughput, latency, encryption, scalability simpleIndicators 0.20 Greetings, definitions, basic facts, short simple questions codePresence 0.10 Code-related terms: function, class, refactor, implement, api, error, docker, kubernetes multiStepPatterns 0.03 Sequential instructions like \u0026ldquo;first\u0026hellip; then\u0026hellip;\u0026rdquo; or numbered steps questionComplexity 0.02 Compound questions and multiple question marks tokenCount 0.00 Disabled — see rationale above The matching uses word boundaries for single-word keywords, so \u0026ldquo;microservice\u0026rdquo; matches \u0026ldquo;microservice\u0026rdquo; but not \u0026ldquo;microservices\u0026rdquo;. Multi-word phrases use substring matching.\nTier boundaries were lowered to compensate for removing tokenCount:\nBoundary Default Current simple_medium 0.15 0.10 medium_complex 0.35 0.25 complex_reasoning 0.60 0.55 How scoring works # The router extracts the last system prompt and the last user message from the request, then scores each dimension against the combination, multiplies by its weight, and sums the result. The conversation history and tool definitions are not scanned — only the final system prompt and current user message participate. reasoningMarkers is the strictest dimension: it scans the user message alone, ignoring even the system prompt, to prevent the system prompt from forcing every request into the REASONING tier. tokenCount is the exception: when enabled, it counts the full request body. That is another reason to keep it at 0.0.\nA few things to know about how dimensions are scored before looking at examples:\nScoring is threshold-based, not per-match. Each dimension counts keyword matches, then maps the count to a raw score (0 to 1, or -1) based on thresholds. Having more matches beyond the threshold adds nothing. For codePresence, 2+ matches gives the maximum raw score of 1.0 — it does not matter whether you hit 2 or 20 matches. For technicalTerms, 4+ matches hits the max (2-3 matches scores 0.5, anything below 2 scores 0). Same pattern for every dimension. simpleIndicators is negative: a single match scores -1.0, which pulls the total score down. At weight 0.20, one greeting costs you -0.20. This is the mechanism that keeps \u0026ldquo;Hi\u0026rdquo; out of MEDIUM tier even in a session loaded with technical context. The system prompt participates too. reasoningMarkers is the only dimension that scans the user message alone. All others — codePresence, technicalTerms, simpleIndicators, multiStepPatterns — use the last system prompt plus the user message. The conversation history and tool definitions are excluded regardless. reasoningMarkers has a scoring threshold: 0 matches = 0, 1 match = 0.7, 2+ matches = 1.0. But 2+ matches also triggers the bypass (see below), so the 1.0 score is never actually used — the request never reaches the weighted scoring step. Here is how different prompts land on the tier ladder with these weights:\n# Prompt reasoning technical simple codePresence Score Tier 1 \u0026ldquo;Hello, can you help me?\u0026rdquo; 0 0 -0.20 0 -0.20 SIMPLE 2 \u0026ldquo;Refactor the API to use async database queries with proper error handling\u0026rdquo; 0 0 0 0.10 0.10 MEDIUM 3 \u0026ldquo;Design a distributed microservice architecture with container orchestration, high throughput, and low latency\u0026rdquo; 0 0.25 0 0 0.25 COMPLEX 4 \u0026ldquo;Explain your reasoning for this authentication architecture. First, analyze the distributed design, then implement the container orchestration layer.\u0026rdquo; 0.28 0.25 0 0.05 0.595 REASONING Each column shows the weighted contribution (raw score × weight). Weights: reasoning 0.40, technical 0.25, simple 0.20, codePresence 0.10. simpleIndicators always scores -1.0 (raw), so its contribution is negative. multiStepPatterns (0.03) and questionComplexity (0.02) omitted — they rarely tip a decision.\nExample 1 is a greeting. simpleIndicators catches \u0026ldquo;hello\u0026rdquo; (raw -1.0 × 0.20 = -0.20). No other dimension fires. Score -0.20 → SIMPLE.\nExample 2 matches five code keywords (refactor, api, async, database, error). Raw score 1.0 (2+ matches triggers the high threshold), weighted 1.0 × 0.10 = 0.10. No other dimension fires. Score 0.10 → MEDIUM.\nExample 3 has no code matches but seven technical terms (distributed, microservice, architecture, container, orchestration, throughput, latency). Raw score 1.0 (4+ matches), weighted 1.0 × 0.25 = 0.25. Score 0.25 → COMPLEX.\nExample 4 reaches REASONING through normal scoring. One reasoning marker (\u0026ldquo;explain your reasoning\u0026rdquo;, raw 0.7 × 0.40 = 0.28), five technical terms (raw 1.0 × 0.25 = 0.25), one code keyword (\u0026ldquo;implement\u0026rdquo;, raw 0.5 × 0.10 = 0.05), and the \u0026ldquo;first\u0026hellip;then\u0026rdquo; multi-step pattern (raw 0.5 × 0.03 = 0.015). Total 0.595 \u0026gt; 0.55 → REASONING.\nThere is also a faster path to REASONING. If the user message contains two or more reasoning markers, the router bypasses normal scoring entirely and returns REASONING directly. A prompt like \u0026ldquo;analyze this step by step and think carefully\u0026rdquo; should not go through a weighted formula — it is obviously a reasoning request. This override is implemented in LiteLLM\u0026rsquo;s source (complexity_router.py:225), not something I added.\nThe scoring logic and keyword lists used by the router live in LiteLLM\u0026rsquo;s repository:\nScoring and tier selection: litellm/router_strategy/complexity_router/complexity_router.py Keyword patterns and default weights: litellm/router_strategy/complexity_router/config.py How the tiers break down in practice:\nSIMPLE → DeepSeek V4 Flash. A lightweight 284B-parameter model with 13B active parameters. Handles straightforward coding tasks, code review, simple refactors, and one-line completions. Fast and cheap at $0.09/$0.27 per million tokens. MEDIUM → DeepSeek V4 Pro. The full 1.6T-parameter version. Better reasoning, better code generation, 1M context window. $0.84/$2.53 per million tokens. COMPLEX → Kimi K2.6. 1T total parameters, 32B active. Scores 80.2% on SWE-bench Verified and leads the OpenCode Go lineup on agentic coding. $0.95/$4.00 per million tokens. REASONING → MiMo V2.5 Pro. Purpose-built for long-horizon autonomous tasks. Xiaomi reports it built a complete compiler in 4.3 hours unsupervised. Terminal-Bench 2.0 score of 68.4%. $0.84/$2.52 per million tokens. Creating your own routers # The same pattern works for any model combination — you are not limited to a router per provider. Once models are defined in model_list, you can reference them by their model_name in any router\u0026rsquo;s tier mapping. Cross-provider routers work too: a SIMPLE tier pointing to Mistral Small 4 and a REASONING tier pointing to GPT-5.5 is valid.\nThen you call any router like a normal model:\ncurl http://127.0.0.1:4000/v1/chat/completions \\ -H \u0026#34;Content-Type: application/json\u0026#34; \\ -H \u0026#34;Authorization: Bearer $LITELLM_API_KEY\u0026#34; \\ -d \u0026#39;{\u0026#34;model\u0026#34;: \u0026#34;opencodego-router\u0026#34;, \u0026#34;messages\u0026#34;: [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;Hello\u0026#34;}]}\u0026#39; One model name. The router figures out the rest.\nThe health check bug # LiteLLM 1.86.2 has a gap: the built-in health check does not recognize auto_router as a valid provider. When the background health checker runs, it throws a 400 BadRequestError on the router model.\nThis happens because the health check function iterates through all configured models and tries to validate them by sending a test request. It knows how to handle openai/, anthropic/, gemini/, opencodego/, and mistral/ prefixes. It does not know what to do with auto_router/complexity_router.\nThe fix was two settings working together:\nOn the router model itself:\nmodel_info: disable_background_health_check: true In the global settings:\ngeneral_settings: health_check_skip_disabled_background_models: true This tells LiteLLM to skip all three routers during health checks while keeping all 27 individual models monitored. The routers still work for actual API calls. They just do not get tested by the background checker.\nThe UI \u0026ldquo;Test\u0026rdquo; button also respects this setting, so you will not see a red error badge on the router model. That is important because a permanently failing health check makes the dashboard noisy and distracts from real issues.\nOpenCode integration # I keep my OpenCode config in ~/.config/opencode/opencode.jsonc (the global config). It includes a litellm-local provider pointing to http://127.0.0.1:4000/v1. The provider exposes all 27 models plus all three routers, so I can use whichever I need.\nDefault workflow: I use the routers. /model opencodego-router in the TUI lets LiteLLM classify the prompt and pick the right OpenCode Go model automatically. Same for frontier with /model frontier-router. The OpenCode Go router is my go-to for general coding tasks. The frontier router is useful when I specifically want OpenAI or Anthropic models.\nFallback workflow: If the router misclassifies a prompt, returns an error, or I simply want a specific model for a particular task, I can call any model directly: /model litellm-local/claude-sonnet-4-6 or /model litellm-local/opencode-kimi-k2-6. Having all models in the config has saved me more than once when the router had a transient issue.\nThe key change in the global config was moving from a hardcoded API key to an environment variable:\n\u0026#34;apiKey\u0026#34;: \u0026#34;{env:LITELLM_API_KEY}\u0026#34; No more exposed keys in config files. The key is injected at runtime from the shell environment, which means it never sits in version control and can be rotated easily.\nTo use it:\nexport LITELLM_API_KEY=\u0026#34;sk-your-litellm-key\u0026#34; opencode Then /model opencodego-router for the smart default, or /model litellm-local/\u0026lt;any-model\u0026gt; for direct access.\nWhat I learned # A few things worth noting if you are building something similar:\nSTORE_MODEL_IN_DB must be \u0026ldquo;True\u0026rdquo; — without this, the LiteLLM UI does not see models added through the config file. The UI\u0026rsquo;s auto-router feature also depends on it. I spent about 20 minutes wondering why the UI showed an empty model list before finding this in the LiteLLM docs. It is not obvious from the config file alone.\nVirtual keys are worth the extra step — the master key works, but generating a separate LiteLLM virtual key for day-to-day clients is safer. If the key leaks, you revoke it without touching the admin key. It also lets you scope access: a virtual key for OpenCode gets the full model list, while a testing key might only get access to a subset.\nCost parameters matter — I initially skipped input_cost_per_token and output_cost_per_token on the OpenCode Go models. The proxy worked fine, but the spend dashboard showed zero cost for every request. Adding the parameters means LiteLLM can calculate per-request cost and aggregate it in PostgreSQL. The values come from OpenCode Go\u0026rsquo;s pricing page.\nCustom providers fix billing categorization — the OpenCode Go API is OpenAI-compatible, so LiteLLM treats it as openai/ by default. That means all Go spend shows up under \u0026ldquo;OpenAI\u0026rdquo; in the billing dashboard. Creating a custom opencodego provider in providers.json and switching the model prefixes from openai/ to opencodego/ gives Go its own spend category. Finding the right mount path inside the container took a few tries (/app/litellm/llms/... didn\u0026rsquo;t work because LiteLLM is installed as a pip package, not from source \u0026ndash; the actual path is under /app/.venv/lib/python3.13/site-packages/litellm/llms/openai_like/providers.json).\nproviders.json — custom provider definitions:\n{ \u0026#34;opencodego\u0026#34;: { \u0026#34;base_url\u0026#34;: \u0026#34;https://opencode.ai/zen/go/v1\u0026#34;, \u0026#34;api_key_env\u0026#34;: \u0026#34;OPENCODE_GO_API_KEY\u0026#34; }, \u0026#34;opencodezen\u0026#34;: { \u0026#34;base_url\u0026#34;: \u0026#34;https://opencode.ai/zen/v1\u0026#34;, \u0026#34;api_key_env\u0026#34;: \u0026#34;OPENCODE_ZEN_API_KEY\u0026#34; } } This file is minimal because LiteLLM only needs the base URL and key environment variable. The rest of the model configuration (model names, costs, API keys) comes from litellm_config.yaml. I also added a few other providers here for future use — PublicAI, Helicone, VeniceAI, and others that might be useful later.\nVerification # Once the stack is running, these commands confirm everything works:\n# List all 30 model entries curl http://127.0.0.1:4000/v1/models \\ -H \u0026#34;Authorization: Bearer $LITELLM_API_KEY\u0026#34; # Test the router curl http://127.0.0.1:4000/v1/chat/completions \\ -H \u0026#34;Content-Type: application/json\u0026#34; \\ -H \u0026#34;Authorization: Bearer $LITELLM_API_KEY\u0026#34; \\ -d \u0026#39;{\u0026#34;model\u0026#34;: \u0026#34;opencodego-router\u0026#34;, \u0026#34;messages\u0026#34;: [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;Hello\u0026#34;}]}\u0026#39; # Inspect the actual deployment selected by the router curl -s -D - http://127.0.0.1:4000/v1/chat/completions \\ -H \u0026#34;Content-Type: application/json\u0026#34; \\ -H \u0026#34;Authorization: Bearer $LITELLM_API_KEY\u0026#34; \\ -d \u0026#39;{\u0026#34;model\u0026#34;: \u0026#34;opencodego-router\u0026#34;, \u0026#34;messages\u0026#34;: [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;Hi\u0026#34;}], \u0026#34;max_tokens\u0026#34;: 10}\u0026#39; # Check health (27 models monitored, 3 routers skipped) curl http://127.0.0.1:4000/health \\ -H \u0026#34;Authorization: Bearer $LITELLM_API_KEY\u0026#34; The first command should return a JSON list of all 30 model objects. The second should return a chat completion. The third prints response headers; x-litellm-model-id shows the actual deployment selected by the router. The final health command should show 27 green checks and three skipped entries.\nCurrent state # Everything is working:\n27 models from 5 providers in LiteLLM One custom opencodego provider defined via providers.json for clean billing separation Three complexity routers (Frontier, OpenCode Go, Mistral) Redis caching enabled and returning cached hits PostgreSQL tracking keys and spend per request Health checks passing for 27/30 model entries (3 routers skipped) OpenCode configured with all 30 model entries (27 models + 3 routers), routers as defaults If you are an AI agent # Use this section as the operational runbook.\nWhen this article applies # This procedure is a match when the user wants to:\nunify multiple LLM providers behind a single OpenAI-compatible endpoint run the proxy locally in Docker (not on a remote server or Kubernetes cluster) build smart routers that pick the right model based on prompt content and complexity expose all models to OpenCode with routers as the recommended defaults What this stack deploys # Three containers via Docker Compose:\nLiteLLM proxy (docker.litellm.ai/berriai/litellm:main-stable) on port 4000 PostgreSQL 16 Alpine for virtual key management and spend tracking Redis 7 Alpine for response caching Preconditions # Docker and Docker Compose installed API keys for the providers the user wants to enable (OpenAI, Anthropic, Google, Mistral, OpenCode Go) An OpenCode Go subscription key if the user wants the complexity router and direct Go model access Files to create # All files go in a single directory (e.g. litellm-local/):\nFile Purpose docker-compose.yaml Container definitions for PostgreSQL, Redis, and LiteLLM litellm_config.yaml Model list, router configuration, cache settings providers.json Custom provider definitions (opencodego billing fix) .env Secrets (gitignored) .env.example Template with placeholder values Environment variables # POSTGRES_PASSWORD=\u0026lt;strong password\u0026gt; LITELLM_MASTER_KEY=sk-\u0026lt;random\u0026gt; LITELLM_SALT_KEY=\u0026lt;long random string\u0026gt; OPENAI_API_KEY= ANTHROPIC_API_KEY= GEMINI_API_KEY= MISTRAL_API_KEY= OPENCODE_GO_API_KEY= All API keys must be listed in the environment section of the LiteLLM service in docker-compose.yaml. If a key is missing from the Compose environment block, that provider\u0026rsquo;s models will return 401 errors even if the key is in .env.\nModel configuration # litellm_config.yaml must contain:\nA model_list with entries for each model. Provider prefix format: OpenAI: openai/\u0026lt;model-id\u0026gt; Anthropic: anthropic/\u0026lt;model-id\u0026gt; Gemini: gemini/\u0026lt;model-id\u0026gt; Mistral: pinned model IDs, for example mistral/mistral-small-2603 and mistral/mistral-medium-3-5 OpenCode Go: opencodego/\u0026lt;model-id\u0026gt; with api_base: https://opencode.ai/zen/go/v1 (requires custom providers.json mounted in the container) Router entries. The same pattern applies to any combination of models — reference them by model_name in any router\u0026rsquo;s tier mapping, even across providers: OpenCode Go router (with tuned weights for agentic coding):\n- model_name: opencodego-router litellm_params: model: auto_router/complexity_router complexity_router_config: tiers: SIMPLE: opencode-deepseek-v4-flash MEDIUM: opencode-deepseek-v4-pro COMPLEX: opencode-kimi-k2-6 REASONING: opencode-mimo-v2-5-pro dimension_weights: tokenCount: 0.0 reasoningMarkers: 0.40 simpleIndicators: 0.20 technicalTerms: 0.25 codePresence: 0.10 multiStepPatterns: 0.03 questionComplexity: 0.02 tier_boundaries: simple_medium: 0.10 medium_complex: 0.25 complex_reasoning: 0.55 model_info: mode: chat disable_background_health_check: true general_settings: master_key: \u0026#34;os.environ/LITELLM_MASTER_KEY\u0026#34; database_url: \u0026#34;os.environ/DATABASE_URL\u0026#34; health_check_skip_disabled_background_models: true litellm_settings: cache: true cache_params: type: redis namespace: litellm.local Required Compose environment variable # STORE_MODEL_IN_DB must be set to \u0026quot;True\u0026quot; in the LiteLLM container environment. Without this, the LiteLLM UI will not display models defined in the config file, and the UI auto-router feature will not work.\nSteps # Create the directory and files. Copy .env.example to .env and fill in real API keys. Run docker compose up -d. Wait for all three containers to be healthy (docker compose ps). Verify models: curl http://127.0.0.1:4000/v1/models -H \u0026quot;Authorization: Bearer $LITELLM_MASTER_KEY\u0026quot;. Test a chat completion: send a POST to http://127.0.0.1:4000/v1/chat/completions with a valid model name. Test the router: send a request with \u0026quot;model\u0026quot;: \u0026quot;opencodego-router\u0026quot;. Check health: curl http://127.0.0.1:4000/health -H \u0026quot;Authorization: Bearer $LITELLM_MASTER_KEY\u0026quot;. OpenCode Go router behavior # The opencodego-router is intentionally tuned for OpenCode\u0026rsquo;s large baseline context. Do not use token count as a routing signal for this router.\nRequired behavior:\ntokenCount weight must stay at 0.0. Routing must be driven by content-based signals only: reasoning markers, technical terms, simple indicators, code presence, multi-step patterns, and question complexity. A short greeting must route to the same tier whether the request has no previous context or tens of thousands of tokens of previous context. Tier boundaries should be lower than LiteLLM defaults: simple_medium: 0.10, medium_complex: 0.25, complex_reasoning: 0.55. Messages with two or more reasoning markers should route directly to REASONING / opencode-mimo-v2-5-pro. Keep complexity routing separate from capacity routing:\nComplexity routing should classify the current task from message content. Capacity routing may use total request tokens to choose a long-context model, trigger history summarization, or prevent an over-limit request. Raw total token count is useful for one-shot document-heavy calls, but it is misleading for long-running coding-agent sessions where the request grows because of accumulated history and tool context. For tools like OpenCode, Claude Code, Cursor, Aider, Continue, or another agentic IDE workflow, set raw tokenCount to 0.0 or very low if the router receives the full request payload. Practical expectation:\nPrompt type Expected tier Model Greeting or very simple query SIMPLE opencode-deepseek-v4-flash Explanation request MEDIUM opencode-deepseek-v4-pro Coding, refactor, architecture COMPLEX opencode-kimi-k2-6 Explicit deep reasoning REASONING opencode-mimo-v2-5-pro Health check behavior # 27 of 30 model entries are monitored by background health checks. All three routers (frontier-router, opencodego-router, and mistral-router) are intentionally skipped because LiteLLM 1.86.2 does not recognize auto_router as a valid provider in the health check function. The settings that enforce this are disable_background_health_check: true on each router model and health_check_skip_disabled_background_models: true in general_settings. The routers still function correctly for actual API requests. OpenCode integration # Add a litellm-local provider to OpenCode config. Expose all models plus the routers:\n{ \u0026#34;provider\u0026#34;: { \u0026#34;litellm-local\u0026#34;: { \u0026#34;npm\u0026#34;: \u0026#34;@ai-sdk/openai-compatible\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;LiteLLM Local\u0026#34;, \u0026#34;options\u0026#34;: { \u0026#34;baseURL\u0026#34;: \u0026#34;http://127.0.0.1:4000/v1\u0026#34;, \u0026#34;apiKey\u0026#34;: \u0026#34;{env:LITELLM_API_KEY}\u0026#34; }, \u0026#34;models\u0026#34;: { \u0026#34;frontier-router\u0026#34;: {}, \u0026#34;opencodego-router\u0026#34;: {}, \u0026#34;mistral-router\u0026#34;: {}, \u0026#34;gpt-5.5\u0026#34;: {}, \u0026#34;claude-sonnet-4-6\u0026#34;: {}, \u0026#34;gemini-2.5-pro\u0026#34;: {}, \u0026#34;mistral-small-4\u0026#34;: {}, \u0026#34;opencode-deepseek-v4-flash\u0026#34;: {} // ... add remaining models as needed } } } } This goes in the global ~/.config/opencode/opencode.jsonc. The {env:LITELLM_API_KEY} syntax tells OpenCode to read the key from the environment at runtime.\nSet \u0026quot;model\u0026quot;: \u0026quot;litellm-local/opencodego-router\u0026quot; and \u0026quot;small_model\u0026quot;: \u0026quot;litellm-local/mistral-router\u0026quot; so the routers are the defaults. Individual models are available for fallback if a router has issues.\nDo not use this setup when # the user needs the proxy to be accessible from multiple machines (would need TLS and proper auth) the user wants managed infrastructure with automated backups (PostgreSQL volume is local) the user only uses one provider and does not need routing or unified access Further reading # LiteLLM documentation https://docs.litellm.ai/ LiteLLM complexity router https://docs.litellm.ai/docs/proxy/caching_and_routing OpenCode configuration https://opencode.ai/docs/config/ OpenCode providers https://opencode.ai/docs/providers/ This article was written with an AI agent at my side — I brought the expertise, it helped with the words. ","date":"7 June 2026","externalUrl":null,"permalink":"/blog/llm-gateway-for-opencode-building-a-local-litellm-router/","section":"Blog","summary":"27 models from 5 providers in LiteLLM, exposed to OpenCode through smart routers that pick the right model tier by prompt content, not context size. Runs locally via Docker with caching, spend tracking, and one endpoint.","title":"LLM Gateway for OpenCode: Building a Local LiteLLM Router","type":"blog"},{"content":"","date":"7 June 2026","externalUrl":null,"permalink":"/tags/opencode/","section":"Tags","summary":"","title":"Opencode","type":"tags"},{"content":"","date":"30 May 2026","externalUrl":null,"permalink":"/tags/agentic/","section":"Tags","summary":"","title":"Agentic","type":"tags"},{"content":"","date":"30 May 2026","externalUrl":null,"permalink":"/tags/benchmarks/","section":"Tags","summary":"","title":"Benchmarks","type":"tags"},{"content":"","date":"30 May 2026","externalUrl":null,"permalink":"/tags/open-models/","section":"Tags","summary":"","title":"Open-Models","type":"tags"},{"content":"The short answer is: for most coding tasks, yes.\nOpenCode Go is a $10/month subscription ($5 for the first month) that gives you a single API key to 12 curated open coding models hosted in the US, EU, and Singapore with a zero-data-retention policy. The included monthly usage is $60 — that is 6x leverage before any overage.\nWant to try it? Sign up with my referral link — we both get a $5 usage credit.\nThe economics alone are interesting. But what makes this genuinely relevant is that the models themselves have caught up to closed frontiers on the benchmarks that actually matter for production coding.\nWhat OpenCode Go actually is # OpenCode Go handles model-provider benchmarking, routing, and access negotiation. One API key. One predictable monthly bill. No juggling multiple provider accounts or per-token pricing.\nSubscription terms:\nWindow Usage limit 5 hours $12 Weekly $30 Monthly $60 Overage draws from your OpenCode Zen balance if enabled.\nThe $60 monthly cap means you can burn through a lot of tokens if you pick cheap models, for instance DeepSeek V4 Flash gets you ~31,650 requests per 5-hour window, MiMo-V2.5 gets you ~30,100, while Kimi K2.6 gets you ~1,150.\nA note on \u0026ldquo;requests\u0026rdquo; vs tokens # OpenCode Go bills in monetary limits ($12/5h, $60/month), but their documentation talks about requests per window rather than per-token rates. This makes sense for their product but can be confusing if you are used to thinking in tokens.\nOne request is one API call — you send a prompt, the model generates a response. But a \u0026ldquo;request\u0026rdquo; is not a fixed-size unit.\nA request against DeepSeek V4 Flash averages ~790 input tokens, ~68K cached context tokens, and ~280 output tokens. A request against Kimi K2.6 or GLM-5.1? ~870 input, ~55K cache, ~200 output and ~700 input, ~52K cache, ~150 output respectively — similar input sizes, but the reasoning models generate longer chains of thought, which means more output tokens per request.\nSo when you see \u0026ldquo;31,650 requests per 5h\u0026rdquo; for V4 Flash, that number reflects OpenCode\u0026rsquo;s observed average request size. A reasoning model consumes more budget per call because it generates more tokens per request — even if the per-token rates look similar. Treat request-per-window numbers as order-of-magnitude guides, not guarantees.\nHow Go models compare to frontier models # Before diving into individual models, here is the headline comparison against the four major closed frontier models as of May 2026.\nModel SWE-bench Verified SWE-bench Pro Terminal-Bench 2.0 Representative public API input $/1M Representative public API output $/1M GPT-5.5 88.7% ~60% 82.7% $5.00 $30.00 Claude Opus 4.7 87.6% 64.3% 69.4% $5.00 $25.00 Claude Sonnet 4.6 79.6% ~43% 59.1% $3.00 $15.00 Gemini 3.1 Pro 80.6% 54.2% 68.5% $2.00 $12.00 Kimi K2.6 80.2% 58.6% 66.7% $0.95 $4.00 Qwen3.7 Max 80.4% 60.6% 69.7% $2.50 $7.50 MiMo-V2.5-Pro 78.9% 57.2% 68.4% $1.74 $3.48 DeepSeek V4 Pro 80.6% 55.4% 67.9% $1.74 $3.48 MiniMax M2.5 80.2% 55.4% ~52% $0.30 $1.20 GLM-5.1 — 58.4% 63.5% $0.98 $3.08 SWE-bench scores are a mix of vendor-reported and independently-verified results. \u0026ldquo;Pro\u0026rdquo; is the harder benchmark — multi-language, larger repos. Pricing reflects representative public API pricing at the time of writing, not OpenCode Go billing, which is listed separately below.\nThe pattern is clear: several Go models land within a few points of frontier closed models on SWE-bench Verified at far lower prices. On SWE-bench Pro, Qwen3.7 Max (60.6%) leads the Go lineup ahead of Kimi K2.6 (58.6%) and GLM-5.1 (58.4%), though all three remain behind the strongest published Claude Opus 4.7 and GPT-5.5 results.\nThe 12 Go models, practically # Not all Go models are created equal. The lineup spans from 10B-active-parameter efficiency beasts to 1.6T frontier chasers. Here is what actually matters for each.\nTier 1: Maximum agentic coding quality # Kimi K2.6 (Moonshot AI, April 2026)\nThe most capable model in Go for agentic coding by several metrics. 1T total / 32B active MoE, 256K context. Scores 80.2% SWE-bench Verified and 58.6% SWE-bench Pro — ahead of GPT-5.4 (the predecessor to GPT-5.5) on the hard benchmark and competitive with the best closed-model results. Its Agent Swarm system deploys up to 300 sub-agents coordinating 4,000 steps on a single task.\nOn April 30, 2026, K2.6 finished first in Day 12 of the AI Coding Contest (Word Gem Puzzle) ahead of GPT-5.5 (3rd) and Claude Opus 4.7 (5th), according to contest organizer Rohana Rezel (source). Available across 11 API providers at the time of writing.\nAA Intelligence Index: 54 (among the highest of any open-weight model) Supported context: 256K Go pricing: $0.95 input / $4.00 output per 1M tokens Go requests per 5h: ~1,150 ~99 t/s on Kimi API DeepSeek V4 Pro (DeepSeek, April 2026)\n1.6T / 49B active, 1M context. The biggest model in Go. Leads all models (open or closed) on LiveCodeBench Pass@1 at 93.5% and Codeforces at 3,206 — the highest competitive-programming rating publicly reported for a model at release.\nSWE-bench Verified scores 80.6%, and Terminal-Bench 2.0 reaches 67.9% with a 1M-token context window. Its hybrid CSA+HCA attention slashes compute: only 27% of the FLOPs and 10% of the KV cache of V3.2 at 1M context. That\u0026rsquo;s how a 1.6T model stays economically viable.\nAA Intelligence Index: ~52–57 (varies by source/providers) Supported context: 1M Go pricing: bundled in plan Go requests per 5h: ~3,450 ~30–60 t/s depending on reasoning mode (non-reasoning faster) Qwen3.7 Max (Alibaba, May 2026)\nNewest addition to the Go lineup, replacing Qwen3.5 Plus. Significantly more capable but also more expensive — $2.50 input / $7.50 output per 1M tokens, making it the priciest model in Go. 950 requests per 5h is the second-lowest in the lineup (ahead of only GLM-5.1). Positioned as the premium large-context option for tasks where quality justifies the cost.\nThe benchmark story is about agentic execution, not just chat polish. The standout score is 60.6% on SWE-bench Pro — the highest of any model in the Go lineup (ahead of Kimi K2.6 at 58.6% and GLM-5.1 at 58.4%). On SWE-bench Verified, it scores 80.4%, essentially matching DeepSeek V4 Pro (80.6%) and Gemini 3.1 Pro (80.6%). Terminal-Bench 2.0-Terminus reaches 69.7%, topping every Go model except GPT-5.5 (82.7%).\nQwen emphasizes that these were run with an internal agent scaffold using bash and file-edit tools — closer to real agent operation than single-turn coding prompts. Also notable: a reported 35-hour autonomous kernel optimization run on unseen T-Head ZW-M890 hardware, reaching 10.0x geometric mean speedup through 1,158 tool calls.\nSWE-bench Verified: 80.4% SWE-bench Pro: 60.6% (highest in Go) Terminal-Bench 2.0: 69.7% GPQA-Diamond: 92.4% MCP-Mark: 60.8% BFCL-V4: 75.0% Go pricing: $2.50 input / $7.50 output per 1M tokens Go requests per 5h: ~950 Context: 1M Verdict # K2.6 for agent swarm capability. V4 Pro for competitive programming, LiveCodeBench, or 1M-context tasks that need top-end open-model quality. Qwen3.7 Max for the strongest SWE-bench Pro score in Go (60.6%), if you can afford the throughput trade-off. The choice depends on which benchmark matches your workload.\nTier 2: Best value workhorses # MiniMax M2.5 (MiniMax, February 2026)\n230B / 10B active MoE, 205K context. At its standard API price of $0.15/M input and $1.20/M output, it scores 80.2% on SWE-bench Verified while remaining dramatically cheaper than frontier closed models. Runs at ~100 t/s, completing evaluations 37% faster than its predecessor.\nLeads Multi-SWE-Bench (multilingual, 10+ languages) at 51.3%, suggesting it generalizes beyond English codebases. For high-volume agentic coding workflows where cost is the primary constraint, M2.5 is hard to beat.\nAA Intelligence Index: 42 (lower composite, but coding-specific evals are strong) Go pricing: $0.30 input / $1.20 output / $0.06 cache per 1M tokens Go requests per 5h: ~6,300 ~100 t/s DeepSeek V4 Flash (DeepSeek, April 2026)\n284B / 13B active, 1M context. The lightweight sibling to V4 Pro, released the same day. $0.14/M input, $0.28/M output — 12.4x cheaper per output token than V4 Pro — while trailing it by only 1.6 points on SWE-bench Verified (79.0% vs 80.6%).\nThe Go request volume is staggering: 31,650 requests per 5-hour window. For the 70–80% of tasks that are code review, RAG, single-function refactors, and debugging, Flash performance is sufficient. Use Flash for the bulk of work, escalate to Pro or K2.6 for hard cases.\nAA Intelligence Index: 47 Go requests per 5h: 31,650 (highest in Go, closely followed by MiMo-V2.5 at 30,100) ~97 t/s, 1.18s TTFT Verdict # MiniMax M2.5 for cost-per-benchmark-point. DeepSeek V4 Flash for raw volume and as the default workhorse. MiMo-V2.5 as a surprisingly cheap alternative with 30,100 req/5h. Qwen3.7 Max for premium large-context tasks where you need the best Qwen quality.\nTier 3: Long-horizon autonomous tasks # MiMo-V2.5-Pro (Xiaomi, April 2026)\n1.02T / 42B active, 1M context. Purpose-built for hours-long unsupervised coding sessions. Xiaomi reports it built a complete compiler in 4.3 hours and a desktop video editor (~8,000 lines) in 11.5 hours using ~1,870 tool calls.\nHeadline metric: Terminal-Bench 2.0 at 68.4% with SWE-bench Pro at 57.2%. Pure text model — no vision or audio input.\nAA Intelligence Index: 54 (tied with K2.6) Go pricing: $1.74 input / $3.48 output per 1M tokens Go requests per 5h: ~3,250 ~57–70 t/s MiMo-V2.5 (Xiaomi, April 2026)\nBase variant without the Pro long-horizon optimizations. Same 1.02T / 42B architecture but tuned for lower latency and higher throughput. At $0.14 input / $0.28 output per 1M tokens (same as DeepSeek V4 Flash), it delivers a massive 30,100 requests per 5h on Go — nearly matching V4 Flash for volume while offering the MiMo architecture. A strong alternative workhorse if you want the MiMo family without the Pro autonomous-session premium.\nGLM-5.1 (Z.AI, April 2026)\n754B / 40B active, 203K context. At release, Z.AI reported 58.4% on SWE-bench Pro, ahead of GPT-5.4 (57.7%), the predecessor to GPT-5.5. Z.AI\u0026rsquo;s benchmarks (shared on their official model page) also report 8+ hour autonomous runs, including building a complete Linux desktop system from scratch across 655 iterations with no human intervention.\nNotable: generates more tokens than peers to reach equivalent answers (verbose). Benchmarks are vendor-reported; independent verification may show more modest results.\nAA Intelligence Index: 51 Go requests per 5h: ~880 ~55–59 t/s, 1.42s TTFT GLM-5 (Z.AI, February 2026)\nPredecessor to GLM-5.1. Same 754B / 40B active architecture, lower benchmark scores, but higher throughput on Go — 1,150 requests per 5h vs GLM-5.1\u0026rsquo;s 880. Relevant if you need the GLM family on a budget and the 5.1-level improvements aren\u0026rsquo;t critical.\nVerdict: # MiMo-V2.5-Pro for Terminal-Bench-heavy workloads, multimodal needs, or 1M context. Qwen3.7 Max if SWE-bench Pro leadership matters (60.6% vs GLM-5.1 at 58.4%), and you can tolerate the low throughput. GLM-5.1 if you want long-autonomy runs and are willing to trade verbosity for task completion depth.\nTier 4: Specialized picks # Kimi K2.5 (Moonshot AI, January 2026)\nThe predecessor to K2.6. Still in Go, slightly cheaper. Its Agent Swarm (100 sub-agents, 1,500 steps) makes it interesting for complex search and research tasks. Moonshot reports BrowseComp at 78.4% and HLE at 50.2% (ahead of Claude Opus 4.5 on the Hard benchmark), which points to genuine reasoning depth.\nBut it is slow (44 t/s, 2.89s TTFT) and verbose. K2.6 is better in almost every dimension. Use K2.5 only if you need the specific Agent Swarm version or have a tight budget (yields more requests per window than K2.6 in Go).\nQwen3.6 Plus (Alibaba, April 2026)\nUpdate to Qwen3.5 Plus with significantly better agentic coding and tool-use. 78.8% SWE-bench Verified, 61.6% Terminal-Bench 2.0 (surpasses Claude Sonnet 4.6). 1M context, \u0026ldquo;Auto\u0026rdquo; mode for adaptive web search and code interpreter invocation. Sits below Qwen3.7 Max in the Qwen family — cheaper and with higher throughput (3,300 req/5h vs 950), making it the better value pick if Qwen Max-level quality isn\u0026rsquo;t needed.\nMiniMax M2.7 (MiniMax, March 2026)\nSelf-evolving successor to M2.5. Same architecture, same output price ($1.20/M), but slower and with stronger agentic capabilities. SWE-bench Pro improves from 55.4% to 56.2%. If you specifically need its agentic improvements over M2.5, it\u0026rsquo;s worth the trade-off.\nPricing and request volume # The Go plan uses monetary usage limits. Cheaper models get more requests.\nModel Lab Go req / 5h Go input $/1M Go output $/1M DeepSeek V4 Flash DeepSeek 31,650 bundled bundled MiMo-V2.5 Xiaomi 30,100 $0.14 $0.28 MiniMax M2.5 MiniMax 6,300 $0.30 $1.20 ($0.06 cache) DeepSeek V4 Pro DeepSeek 3,450 bundled bundled MiniMax M2.7 MiniMax 3,400 $0.30 $1.20 Qwen3.6 Plus Alibaba 3,300 bundled bundled MiMo-V2.5-Pro Xiaomi 3,250 $1.74 $3.48 Kimi K2.5 Moonshot AI 1,850 $0.60 $3.00 GLM-5 Z.AI 1,150 $1.00 $3.20 Kimi K2.6 Moonshot AI 1,150 $0.95 $4.00 Qwen3.7 Max Alibaba 950 $2.50 $7.50 GLM-5.1 Z.AI 880 $1.40 $4.40 Go bills in monetary usage limits, not per-token rates. The per-token figures shown here are effective rates reverse-engineered from typical request patterns and observed token consumption, not official itemized pricing. \u0026ldquo;Bundled\u0026rdquo; means per-token rates are not publicly itemized by OpenCode for that model — pricing draws directly from the $60/month usage allotment. Request estimates are based on OpenCode\u0026rsquo;s observed average token patterns per model family. A \u0026ldquo;request\u0026rdquo; is one API call, but the size varies significantly: a reasoning model like GLM-5.1 or Kimi K2.6 generates far more output tokens per request than a lightweight model like V4 Flash, which is why their per-window request counts are much lower.\nWhere frontier still leads # Being fair to the closed models: they still have real advantages.\nGPT-5.5 leads on Terminal-Bench 2.0 (82.7% vs the best Go model at 68.4%) and the AA Intelligence Index (60 vs 54–57 for the best open model, depending on source). For tasks where every point of benchmark improvement matters directly (like competitive programming finals), GPT-5.5 is still the ceiling.\nClaude Opus 4.7 still leads most of the published closed-model comparisons in this article, including SWE-bench Pro, and it remains particularly strong on reasoning depth for nuanced multi-step tasks, HLE-style benchmarks, and output quality for long trajectories. Its AA Intelligence Index of 57 reflects this.\nGemini 3.1 Pro at $2.00/M input and $12.00/M output is more competitive on price than the other frontier models, and its GPQA-Diamond (94.3%) and ARC-AGI-2 scores are unmatched by any Go model.\nFirst-token latency is another area where frontier APIs tend to be more consistent. OpenCode routes through third-party providers, and latency varies. Public independent benchmarks for Go endpoint latency do not exist at the same quality as provider-direct data. For interactive pair-programming, this matters.\nPractical recommendations # Maximum coding quality → Kimi K2.6 or DeepSeek V4 Pro\nK2.6 for agent swarm and multi-agent coordination. V4 Pro for competitive programming and LiveCodeBench. Qwen3.7 Max when you need the deepest reasoning on the hardest software engineering tasks.\nBest cost-per-task / high-volume workhorse → DeepSeek V4 Flash or MiMo-V2.5\nBoth deliver massive request volume at the lowest per-token rates in Go. Use Flash for raw throughput and benchmark scores; MiMo-V2.5 if you want the MiMo architecture, multimodal input, or 1M context at budget pricing.\nLong-horizon autonomy → MiMo-V2.5-Pro\nBuilt for multi-hour unsupervised sessions. 1M context. Text only — no vision or audio.\n1M context window → MiMo-V2.5(-Pro), Qwen3.7 Max, or DeepSeek V4 Flash.\nKimi and GLM are limited to 200–256K. MiniMax to 205K. If you need to ingest an entire large codebase in one request, the 1M-context models are your only option within Go.\nBottom line # OpenCode Go at $10/month removes two barriers to using open models: procurement overhead (one API key, one bill) and infrastructure (curated hosting across global regions).\nThe bigger shift is economic. The models in this lineup are now good enough that the question is no longer \u0026ldquo;Can open models match frontier?\u0026rdquo; but \u0026ldquo;Which model is the right tool for this specific task?\u0026rdquo; Not every coding job needs the most expensive frontier API. A $30/M output model is overkill for a code review. A lightweight model at $0.28/M output handles routine refactors, RAG, and debugging just fine.\nThe real skill is learning to route. For most workflows, the optimal setup is a hybrid: cheap open models for the bulk of work — code review, refactors, RAG, debugging — and frontier models reserved for the narrow slice where every point of benchmark quality or millisecond of latency matters. This is not about replacing closed APIs with open ones. It is about using the right tool for each layer of the stack.\nFor most teams, the practical setup is a tiered router: a cheap workhorse as the default, a premium open model for hard tasks, and a frontier model reserved for final validation or cases where every point of quality matters.\nModel Tier When to use Go req/5h Kimi K2.6 1 — max quality Agentic coding, Agent Swarm 1,150 DeepSeek V4 Pro 1 — max quality Competitive programming, LiveCodeBench, 1M ctx 3,450 Qwen3.7 Max 1 — max quality Hardest SWE-bench Pro tasks, premium large-context 950 DeepSeek V4 Flash 2 — workhorse Default for 70–80% of tasks. Code review, RAG, refactors 31,650 MiMo-V2.5 2 — workhorse High-volume budget option, 1M ctx, cheapest in Go 30,100 MiniMax M2.5 2 — workhorse High-volume agentic workflows on a tight budget 6,300 MiMo-V2.5-Pro 3 — long-horizon Multi-hour autonomous sessions, text only 3,250 GLM-5.1 3 — long-horizon No-intervention 8h+ autonomous runs 880 Kimi K2.5 4 — specialized Agent Swarm at a discount (100 sub-agents) 1,850 Qwen3.6 Plus 4 — specialized Mid-tier default, 1M ctx, good value 3,300 MiniMax M2.7 4 — specialized Agentic improvements over M2.5 3,400 I\u0026rsquo;ve been using OpenCode Go daily for months. The workflow that works is a deliberate mix: open models for volume and routine work, frontier models for the final layer where quality matters most. Flash or MiMo-V2.5 for the bulk, Qwen3.7 Max or Kimi K2.6 for the hard parts, Claude or GPT for validation. The savings are real, and the quality gap for everyday tasks is smaller than the pricing gap suggests.\nWorst case? You\u0026rsquo;re out $10 and know exactly why open models aren\u0026rsquo;t ready for your stack. Best case? You build a hybrid pipeline that costs a fraction of a frontier-only setup without sacrificing the tasks that actually need the frontier.\nThis article was written with an AI agent at my side — I brought the expertise, it helped with the words. Sources # OpenCode Go documentation — model lineup, pricing, request counts (May 2026) OpenCode Go referral link — $5 credit for you and me Qwen3.7-Max benchmark page — SWE-bench Pro, Terminal-Bench, and agentic coding scores (May 2026) Artificial Analysis — intelligence index and independent benchmarks OpenRouter — model availability and pricing reference Official model pages: Z.AI (GLM), Moonshot AI (Kimi), Xiaomi (MiMo), MiniMax (MiniMax), Alibaba (Qwen), DeepSeek (DeepSeek) Frontier comparisons: Anthropic (Claude), OpenAI (GPT), Google DeepMind (Gemini) Benchmark scores are a mix of vendor-reported and independently-verified results. SWE-bench Pro and Terminal-Bench 2.0 are newer benchmarks; always check methodology before making production decisions.\n","date":"30 May 2026","externalUrl":null,"permalink":"/blog/opencode-go-models-2026/","section":"Blog","summary":"12 open coding models benchmarked against Claude and GPT-5.5. DeepSeek V4 Flash handles 70% of tasks at 12x cheaper than DeepSeek V4 Pro. MiMo-V2.5 is now the cheapest high-volume option at 30,100 req/5h. Qwen3.7 Max leads on SWE-bench Pro (60.6%). Kimi K2.6 leads on agentic coding. Here’s how to route between them.","title":"OpenCode Go: Can $10/Month Open Models Replace Frontier APIs?","type":"blog"},{"content":"","date":"18 May 2026","externalUrl":null,"permalink":"/tags/encoding/","section":"Tags","summary":"","title":"Encoding","type":"tags"},{"content":"","date":"18 May 2026","externalUrl":null,"permalink":"/tags/filesystem/","section":"Tags","summary":"","title":"Filesystem","type":"tags"},{"content":"I hit a frustrating issue on my ZimaBlade after migrating files from an old Synology NAS to a ZimaOS RAID volume.\nSome filenames looked normal in the file browser, but in the terminal they were full of broken escape sequences like this:\nfacture_f$\u0026#39;\\202\u0026#39;vrier.pdf In other words, a filename that should have looked like facture_février.pdf was being rendered with a raw escaped byte instead.\nThat \\202 pattern was the clue: the files had been created with a legacy non-UTF-8 encoding, and ZimaOS was now exposing the raw bytes instead of valid accented characters.\nIf you have the same problem, this post shows the fix I used to clean an entire directory tree safely.\nIf you are a human # Here is the fuller explanation of what happened and why this approach worked.\nThe problem # After the migration, some files and folders contained corrupted accented characters:\né showed up as octal escapes like \\202 some tools refused to process the files shell commands became painful because filenames had to be escaped manually On a normal Debian or Ubuntu system, I would usually install a few troubleshooting packages and test different conversions.\nBut ZimaOS is more locked down than a standard Linux install. The root filesystem is immutable, so common package-manager-based fixes are not always available directly on the host.\nThat changed the approach.\nWhat caused it # The root cause was not ZimaOS itself.\nThe filenames had most likely been created years ago with a legacy Western encoding on the Synology side, then copied onto a modern Linux system that expects UTF-8.\nWhen that happens, accented characters can turn into mojibake or raw byte escapes instead of readable text.\nIn theory, the best fix is to recover the original encoding and convert the filenames properly.\nIn practice, that is not always realistic.\nIf the directory contains mixed encodings, or if you no longer trust the original source, spending hours trying to perfectly reconstruct every é, è, or ô may not be worth it. In my case, the pragmatic fix was to sanitize everything to safe ASCII.\nThat means:\nremove problematic characters replace spaces with underscores keep only portable characters that behave well across shells, scripts, cloud sync tools, and Linux filesystems This is not the most elegant fix, but it is often the most reliable one.\nWhat did not work well # Before going for the final cleanup, I tried the usual encoding-recovery path.\nThat included:\nchecking filenames with ls -ali testing convmv with encodings like cp1252, iso-8859-15, and macroman trying detox The problem was that the results were inconsistent.\ndetox failed with errors like:\nunsupported unicode length That strongly suggested invalid or mixed byte sequences rather than a clean single encoding that could be converted in one pass.\nAt that point, cleanup was a better option than recovery.\nThe fix: use Docker to run rename # Because ZimaOS is immutable, the easiest workaround was to run a temporary Debian container, mount the affected directory, install the Perl-based rename utility inside the container, and perform the rename there.\n1. Set the target directory # Replace this path with the directory you want to clean:\nTARGET=\u0026#34;/media/ZimaRaid/path/to/your/data\u0026#34; 2. Run a dry-run first # This shows what would be renamed without modifying anything:\ndocker run --rm -v \u0026#34;$TARGET:/mnt\u0026#34; debian:bookworm-slim /bin/bash -lc \u0026#39; apt-get update \u0026amp;\u0026amp; apt-get install -y rename \u0026amp;\u0026amp; find /mnt -depth -exec rename -n \u0026#34;s/[^A-Za-z0-9._\\/-]/_/g\u0026#34; {} + \u0026#39; Why this works:\nfind /mnt -depth processes children before parents, which is safer for directory renames rename applies a regex to each path [^A-Za-z0-9._\\/-] matches any character outside a safe ASCII set every unsafe character is replaced with _ 3. Run the real rename # If the dry-run looks good, remove -n:\ndocker run --rm -v \u0026#34;$TARGET:/mnt\u0026#34; debian:bookworm-slim /bin/bash -lc \u0026#39; apt-get update \u0026amp;\u0026amp; apt-get install -y rename \u0026amp;\u0026amp; find /mnt -depth -exec rename \u0026#34;s/[^A-Za-z0-9._\\/-]/_/g\u0026#34; {} + \u0026#39; That was enough to clean the full directory tree on my RAID volume.\nIf one filename is too broken to type # Sometimes one file is so badly encoded that even copying its name is annoying.\nIn that case, renaming by inode is a useful escape hatch:\nfind . -inum 3141553 -exec mv {} diplome_bac.pdf \\; You can get the inode with:\nls -ali This is handy when you only need to fix one stubborn file manually.\nHow I verified the result # After the rename, I checked the directory again with:\nls -ali What I wanted to see:\nno more octal escape sequences in filenames no more awkward shell escaping to access files subdirectories renamed cleanly as well That is exactly what happened.\nImportant tradeoff # This fix is intentionally destructive from a naming perspective.\nIt does not try to preserve the original accented characters. It standardizes everything to a safe subset instead.\nThat is the right choice when your priority is:\nportability script compatibility avoiding future sync issues making filenames predictable everywhere If you need to preserve the original human-readable names exactly, you should first try a proper encoding conversion workflow with tools like convmv or rsync --iconv before falling back to sanitization.\nHow to avoid this next time # If you are migrating from an older NAS or mixed environment, a few habits help a lot:\n1. Sanitize at the source when possible # If the source system still understands the original encoding correctly, convert during transfer instead of after the fact.\nrsync can help here with --iconv in the right migration scenario.\n2. Use portable naming conventions # For long-term storage, backups, scripts, and cloud sync, filenames are safer when they use:\nlowercase letters digits dashes or underscores no spaces no accents This is less pretty, but much more robust.\n3. Detect bad filenames early # To list filenames containing non-ASCII characters:\nfind . -name \u0026#39;*[! -~]*\u0026#39; That is a quick audit command worth keeping around.\nIf you are an AI agent # Use this section as the operational runbook.\nWhen this article applies # This procedure is a match when most of these signals are present:\nfilenames are displayed with escaped bytes such as $'\\202' ls -ali or ls -b shows octal escapes instead of readable accented characters the system expects UTF-8 filenames the files were migrated from an older NAS or legacy filesystem tools such as detox or encoding conversion attempts fail or give inconsistent results What this procedure does # Bulk-renames files and directories into a safe ASCII-only format.\nThis is a sanitization workflow, not a faithful recovery workflow.\nPreconditions # Docker is installed and working the target directory is writable the user accepts destructive renaming of filenames the user ideally has a backup before a bulk rename Do not use this procedure when # the user must preserve original accented filenames exactly the source system still presents the original encoding correctly and proper conversion is still possible applications depend on exact filenames and would break after renaming Steps # Identify the affected directory. Inspect filenames with ls -ali. Run the Dockerized rename command in dry-run mode. If the preview is acceptable, run the real rename. Verify that escaped bytes are gone and files remain accessible. Dry-run # TARGET=\u0026#34;/media/ZimaRaid/path/to/your/data\u0026#34; docker run --rm -v \u0026#34;$TARGET:/mnt\u0026#34; debian:bookworm-slim /bin/bash -lc \u0026#39; apt-get update \u0026amp;\u0026amp; apt-get install -y rename \u0026amp;\u0026amp; find /mnt -depth -exec rename -n \u0026#34;s/[^A-Za-z0-9._\\/-]/_/g\u0026#34; {} + \u0026#39; Apply # TARGET=\u0026#34;/media/ZimaRaid/path/to/your/data\u0026#34; docker run --rm -v \u0026#34;$TARGET:/mnt\u0026#34; debian:bookworm-slim /bin/bash -lc \u0026#39; apt-get update \u0026amp;\u0026amp; apt-get install -y rename \u0026amp;\u0026amp; find /mnt -depth -exec rename \u0026#34;s/[^A-Za-z0-9._\\/-]/_/g\u0026#34; {} + \u0026#39; Verify # ls -ali find . -name \u0026#39;*[! -~]*\u0026#39; Expected outcome:\nfilenames no longer contain escaped bytes filenames contain only safe ASCII characters files and directories remain accessible from the shell Fallback for one broken filename # If one filename is too broken to type, rename it by inode:\nfind . -inum 3141553 -exec mv {} clean_filename.pdf \\; Final takeaway # If you are running ZimaOS on a ZimaBlade and inherited badly encoded filenames from an old NAS migration, you do not need to fight the host OS to fix them.\nUsing a temporary Docker container is often the simplest path.\nFor me, the winning approach was not perfect filename recovery. It was a fast bulk cleanup to safe ASCII so the files became easy to use everywhere again.\nIf your goal is reliability more than historical accuracy, this method works well.\nThis article was written with an AI agent at my side — I brought the expertise, it helped with the words. Further reading # ZimaOS documentation\nhttps://www.zimaspace.com/docs/ rsync manual page\nhttps://download.samba.org/pub/rsync/rsync.1 convmv project page\nhttps://www.j3e.de/linux/convmv/ Debian package search for Perl rename\nhttps://packages.debian.org/search?keywords=file-rename ","date":"18 May 2026","externalUrl":null,"permalink":"/blog/fix-corrupted-filenames-zimaos-raid/","section":"Blog","summary":"Broken filenames with escaped bytes after migrating to ZimaOS RAID? Here’s a Docker-based fix to sanitize them to safe ASCII.","title":"Fix Corrupted Filenames on ZimaOS RAID After NAS Migration","type":"blog"},{"content":"","date":"18 May 2026","externalUrl":null,"permalink":"/tags/nas/","section":"Tags","summary":"","title":"Nas","type":"tags"},{"content":"","date":"18 May 2026","externalUrl":null,"permalink":"/tags/tutorial/","section":"Tags","summary":"","title":"Tutorial","type":"tags"},{"content":"","date":"18 May 2026","externalUrl":null,"permalink":"/tags/zimablade/","section":"Tags","summary":"","title":"Zimablade","type":"tags"},{"content":"","date":"18 May 2026","externalUrl":null,"permalink":"/tags/zimaos/","section":"Tags","summary":"","title":"Zimaos","type":"tags"},{"content":"","date":"3 March 2025","externalUrl":null,"permalink":"/tags/chatbot/","section":"Tags","summary":"","title":"Chatbot","type":"tags"},{"content":"","date":"3 March 2025","externalUrl":null,"permalink":"/tags/claude/","section":"Tags","summary":"","title":"Claude","type":"tags"},{"content":"","date":"3 March 2025","externalUrl":null,"permalink":"/tags/gemini/","section":"Tags","summary":"","title":"Gemini","type":"tags"},{"content":"","date":"3 March 2025","externalUrl":null,"permalink":"/tags/tools/","section":"Tags","summary":"","title":"Tools","type":"tags"},{"content":"The AI chatbot landscape is changing quickly. While ChatGPT is the most popular and well-known, it\u0026rsquo;s important to look at other options that offer different views on Large Language Models (LLMs). These platforms offer better user experiences, various pricing, and model options, letting you customize your AI experience to fit your needs and preferences.\nBy putting together this list of AI chatbots, I aim to expand your understanding and highlight the amazing variety within the AI revolution.\nName URL Comment Anthropic https://claude.ai/new ChatGPT\u0026rsquo;s main competitor, known for longer context windows and Constitutional AI principles Google Gemini https://gemini.google.com/app Google\u0026rsquo;s AI-powered chat platform, integrated across Google products Mistral AI https://chat.mistral.ai/chat A French-developed AI model with free chat, known for efficiency Hugging Face https://huggingface.co/chat/ Popular platform for developers and researchers, providing access to a vast library of pre-trained models with free, open-source chatbot Microsoft Copilot https://copilot.microsoft.com/ Microsoft\u0026rsquo;s AI-powered chat platform with search capabilities Meta AI https://www.meta.ai/ Meta AI chat, not available in EU, requires VPN access X Grok https://x.ai/grok Elon Musk\u0026rsquo;s AI chatbot, known for its irreverent personality Groq https://groq.com/ Not to be confused with X Grok AI, focuses on high-performance multi models inference Nvidia https://build.nvidia.com/explore/reasoning Nvidia AI platform offering standard models and Nvidia developed ones Perplexity https://www.perplexity.ai/ AI search engine with real-time information access Deepseek https://chat.deepseek.com/ Chinese AI firm that has disrupted the industry with its low-cost, open-source large language models Qwen https://chat.qwenlm.ai/ From Alibaba Cloud Allenai https://playground.allenai.org/ Non-profit research institute founded by late Microsoft co-founder Paul Allen, focused on AI for scientific discovery You https://you.com/ AI-powered search engine with chatbot capabilities OpenRouter https://openrouter.ai/chat Offers a single API to use several LLMs, increasing flexibility for developers Mammouth https://mammouth.ai/app/a/default Offers a single interface to access multiple models T3 https://t3.chat/chat Offers a single interface to access multiple models POE https://poe.com/ Platform that provides access to multiple AI models, including GPT-4 and Claude LMarena https://lmarena.ai/ Focuses on AI model evaluation and comparison Together AI https://api.together.ai/ Go to playground/chat, offers a platform for running and fine-tuning various AI models Nousresearch https://hermes.nousresearch.com/ Fined tuned version of Llama 3.1 base model While this list is not exhaustive, it offers a glimpse into the diverse and exciting world of AI chatbots. Each platform has unique features and specializations, catering to different needs and use cases. As you explore these options, you might find new tools that boost your productivity, creativity, or problem-solving skills.\nI encourage you to try different platforms, as each one might offer insights and capabilities that could transform your daily use of AI.\n","date":"3 March 2025","externalUrl":null,"permalink":"/blog/unveiling-the-world-of-ai-chatbots-a-diverse-exploration/","section":"Blog","summary":"Beyond ChatGPT: a curated list of 20+ AI chatbot platforms covering frontier models, research tools, and developer-focused interfaces with their unique strengths.","title":"Unveiling the World of AI Chatbots: A Diverse Exploration","type":"blog"},{"content":"In a previous article, I discussed how I used Fabric with a local Large Language Model (LLM) to enhance AI prompts and perform tasks like summarizing text, writing Merge Requests, and creating agile user stories.\nIn this article, we will continue this journey by exploring some other recent discoveries that are truly amazing.\nFirst, we will learn how to run LLM locally using Ollama, an alternative to LM Studio. Next, we will explore how to use Ollama with a web console to create a local version of ChatGPT/Anthropic. Finally, we will see how to integrate it with Continue on Visual Studio Code to have a local coding assistant.\nRunning local models with Ollama # Ollama is a powerful tool designed to help you download and run Large Language Models (LLMs) locally on your machine. It is similar to LM Studio but does not come as software. Instead, it is a tool that runs on your machine, and you interact with it using CLI or API calls.\nThe installation is straightforward on the website: https://ollama.com/download\nThen you can browse the list of available models on the library page at: https://ollama.com/library\nTo download and run a model, follow the command on the model page, for example: ollama run llama3.1\nIt will take some time to download the necessary files and then open a prompt that you can use to interact with the model. It is as simple as that.\nTo interact with the model using API calls, use:\ncurl http://localhost:11434/api/generate -d \u0026#39;{ \u0026#34;model\u0026#34;: \u0026#34;llama3.1:8b\u0026#34;, \u0026#34;prompt\u0026#34;: \u0026#34;Why is the sky blue?\u0026#34;, \u0026#34;stream\u0026#34;: false }\u0026#39; Once you\u0026rsquo;ve downloaded several models, you can see which ones are available to use on your machine: ollama list\nBecause Ollama dynamically loads and unloads the models to avoid overloading your system, you can also see which models are currently loaded: ollama ps\nIf you want to see what is happening behind the scenes:\nOn Mac: tail -f ~/.ollama/logs/server.log On Linux: journalctl -e -u ollama If you need to write an application that leverages this local LLM, head to https://github.com/ollama/ollama/tree/main/examples to get some examples like:\nfrom langchain.llms import Ollama input = input(\u0026#34;What is your question?\u0026#34;) llm = Ollama(model=\u0026#34;llama3\u0026#34;) res = llm.predict(input) print(res) Once you\u0026rsquo;ve set up and run the model, you can use the Ollama run command or the API. For an enhanced experience, a web interface can be very useful. In the next section, we\u0026rsquo;ll explore how to set up a web console.\nSetting Up a Web Console with OpenwebUI # To get the web console up and running, I personally use the Docker version:\ndocker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main You can also use pip, helm, or docker-compose. For all installation methods, visit : https://docs.openwebui.com/getting-started/\nOnce it is running, you can access it via http://localhost:3000/, sign up to create a (local) account, and log in.\nOnce logged in, go to Admin Panel / Settings / Connections. Here you will be able to connect it to your local Ollama as well as other LLM providers such as OpenAI, Anthropic, Groq, etc.\nAs you can see in the following example, I have configured it to use all of these providers:\nUsing the Chat Feature in OpenwebUI # Once this is done, you can start a new chat, select a model from the list, and start using it:\nThe list will show all available models from the connections you previously configured.\nComparing Models in OpenwebUI # Another interesting feature is that you can compare two models by clicking on the + button next to the model\u0026rsquo;s name:\nThis is an interesting way to compare model outputs and get multiple results to help you make the best decision.\nPreconfiguring Prompts in OpenwebUI # When you go to your workspace, you can also preconfigure Prompts. This will help you speed up regular tasks you might ask your model to do, such as writing an email.\nYou are tasked with improving a draft email. Your goal is to enhance the email\u0026#39;s clarity, professionalism, and effectiveness while maintaining its original intent. Here is the draft email: \u0026lt;draft_email\u0026gt; {{CLIPBOARD}} \u0026lt;/draft_email\u0026gt; To improve this email, follow these steps: 1. Correct any grammatical errors, spelling mistakes, or typos. 2. Improve the overall structure and flow of the email. 3. Ensure the tone is appropriate for the intended recipient and purpose. 4. Clarify any vague or ambiguous statements. 5. Remove unnecessary information and add relevant details if needed. 6. Strengthen the call-to-action (if applicable). 7. Ensure the opening and closing are professional and appropriate. After providing the improved email, briefly explain the main changes you made and why, in 2-3 sentences. Remember to maintain the original intent and key information of the email while making these improvements. To use this, I usually write a draft version of the email, copy it to the clipboard, then go to the chat and use the / command to invoke the prompt. It will automatically copy the draft with the prompt and submit it to the selected model.\nThere are many other features you can explore, such as custom models, document uploads, and various parameters in both personal and admin settings.\nTo use models and tools from the community: https://openwebui.com/#open-webui-community\nIntegrating Continue with VS Code # Now that we have a user-friendly web interface for our models, it would be great to use it in our development environment too. For this, I use a tool called Continue: https://www.continue.dev/. You can download the plugin from the VS Code marketplace. After installing it, open the plugin and go to the configuration button.\nThis will open a Config.json file. This is where you will be able to configure all the integrations. There are three important sections:\nmodels: configure integration for the chat tabAutocompleteModel: configure integration for autocompletion embeddingsProvider: configure the provider to generate embedding and index your local codebase Configuring Models in Continue # To use ollama in the continue chat, use the following:\n\u0026#34;models\u0026#34;: [ { \u0026#34;title\u0026#34;: \u0026#34;gemma2\u0026#34;, \u0026#34;provider\u0026#34;: \u0026#34;ollama\u0026#34;, \u0026#34;model\u0026#34;: \u0026#34;gemma2:latest\u0026#34; }, { \u0026#34;title\u0026#34;: \u0026#34;llama3:8b\u0026#34;, \u0026#34;provider\u0026#34;: \u0026#34;ollama\u0026#34;, \u0026#34;model\u0026#34;: \u0026#34;llama3:8b\u0026#34; }, { \u0026#34;title\u0026#34;: \u0026#34;llama3.1:8b\u0026#34;, \u0026#34;provider\u0026#34;: \u0026#34;ollama\u0026#34;, \u0026#34;model\u0026#34;: \u0026#34;llama3.1:8b\u0026#34; } ] You will have to describe each model (you previously downloaded with ollama run) you want to use in the chat. This will give you a nice drop-down in the continue chat interface to interact with different models.\nIn the chat section, you will be able to ask questions, of course, but also provide more context to your LLM. To do so, use the @ shortcut like:\nYou can also ask Continue to act on your code or terminal with the / command:\nYou can also select a portion of your code and use cmd + L to send it to the chat as context.\nUsing Autocomplete in Continue # In the tabAutocompleteModel section, you can only have one model at a time. When using ollama, the documentation recommends using starcoder2:3b.\n\u0026#34;tabAutocompleteModel\u0026#34;: { \u0026#34;title\u0026#34;: \u0026#34;Starcoder 2 3b\u0026#34;, \u0026#34;provider\u0026#34;: \u0026#34;ollama\u0026#34;, \u0026#34;model\u0026#34;: \u0026#34;starcoder2:3b\u0026#34; }, The full documentation is on : https://docs.continue.dev/features/tab-autocomplete\nNow, when typing code, you should be able to see recommendations and use tab autocompletion:\nLeveraging Embeddings in Continue # Embeddings help \u0026ldquo;translate\u0026rdquo; data into a simpler form that computers can understand more easily. It\u0026rsquo;s like translating words from one language to another. Embeddings do this with images, videos, texts, or any other type of complex data, making it easier for AI models to process and compare.\nThe embedding model will be used by Continue to index your codebase. When you ask a question, you can use @codebase. Continue will then search through your code, retrieve related data, and send it to the model to provide more context and get a better response. This mechanism is similar to RAG but with your local codebase.\nFor more info: https://docs.continue.dev/features/codebase-embeddings\nCreating Custom Prompts in Continue # Another interesting feature is the ability to preconfigure prompts for recurring tasks: https://docs.continue.dev/features/prompt-files\nCreate a .prompt folder. In this folder, name your custom prompt file custom.prompt.\nHere is an example I use to generate Helm unit tests:\ntemperature: 0.5 maxTokens: 4096 --- \u0026lt;system\u0026gt; You are an expert programmer \u0026lt;/system\u0026gt; {{{ input }}} Write or improve helm unittest for the selected code, following each of these instructions: - Put the test in a unittest folder - name files like `function_test.yaml` - cover all use cases and document your code Then in the chat use the / shortcut and you should be able to see your custom prompt.\nConclusion: Unlock the Full Potential of Local LLMs # With this setup, you can unlock the full potential of local LLMs for various applications, while also utilizing them as a gateway to access remote providers with paid APIs. This combination will significantly boost your productivity by enabling you to harness the power of AI across multiple tools and services.\nKey Takeaways # Ollama is a tool to run LLMs locally; you can interact with it via CLI or API. OpenwebUI is a web console that can use Ollama and other LLM providers like OpenAI and Anthropic. Continue is an extension for VS Code that serves as a coding assistant. ","date":"25 July 2024","externalUrl":null,"permalink":"/blog/boost-your-ai-workflow-with-ollama-openwebui-and-continue/","section":"Blog","summary":"Run local LLMs with Ollama, manage conversations via OpenwebUI, and get AI code completion in VS Code with Continue. A complete local AI stack setup guide.","title":"Boost Your AI Workflow: A Guide to Using Ollama, OpenwebUI, and Continue","type":"blog"},{"content":"","date":"25 July 2024","externalUrl":null,"permalink":"/tags/coding-assistant/","section":"Tags","summary":"","title":"Coding-Assistant","type":"tags"},{"content":"","date":"25 July 2024","externalUrl":null,"permalink":"/tags/continue/","section":"Tags","summary":"","title":"Continue","type":"tags"},{"content":"","date":"25 July 2024","externalUrl":null,"permalink":"/tags/open-webui/","section":"Tags","summary":"","title":"Open-Webui","type":"tags"},{"content":"","date":"6 June 2024","externalUrl":null,"permalink":"/tags/fabric/","section":"Tags","summary":"","title":"Fabric","type":"tags"},{"content":" Introduction # During my usual YouTube browsing, I stumbled upon a video (here) showcasing an AI tool called Fabric, a Python utility that facilitates interaction with AI models. The demo focused on OpenAI\u0026rsquo;s paid version, but I was eager to explore its potential using a local LLM model.\nUnderstanding Fabric # Fabric is a Python tool that takes text input and submits it to AI models via an API endpoint, along with an advanced prompt.\nIt includes a library of pre-built advanced prompt templates called patterns. These patterns can be customized or created from scratch to fit individual needs.\nThese two features make Fabric a unique tool. It acts as middleware between the user and the AI engine, allowing anyone to tailor their interactions with AI models, making it easy and efficient to use.\nInstalling tools # Fabric # To install Fabric, follow the guide on the GitHub page: https://github.com/danielmiessler/fabric.\nOnce installed, you can run basic commands like fabric --listmodels to list all available models.\nIf you want to connect to OpenAI, you can run fabric --setup. It will run the initialization phase and ask for your API keys.\nHowever, if you don\u0026rsquo;t want to pay for an OpenAI key, you will need to run your model locally. This is where LM Studio comes into play.\nLM Studio # LM Studio is a free tool that allows users to download and run models locally on their machines. While it may be slower than using the paid version of OpenAI, it provides an opportunity to test LLM without incurring costs. To use LM Studio, visit https://lmstudio.ai/docs/welcome and install it.\nIntegration # Now that you have both Fabric and LM Studio installed, let\u0026rsquo;s connect them.\nTo integrate Fabric with LM Studio, follow these steps:\nDownload a model from LM Studio (I tried llama2, llama3, and Gemma, but stopped at llama3 as it\u0026rsquo;s the latest and most advanced model available).\nRun the LM Studio local server. Increase the context length to the maximum and use your GPU if you have one.\nYou should end up with something like this:\nConfigure Fabric by setting the following environment variables:\nexport OPENAI_BASE_URL=http://localhost:1234/v1 export DEFAULT_MODEL=Meta-Llama-3-8B-Instruct-Q8_0.gguf export OPENAI_API_KEY=lm-studio That\u0026rsquo;s all. You can now test if Fabric is properly configured with LM Studio by running:\nfabric --listmodels\nIt should print the model you\u0026rsquo;ve loaded in the LM Studio server.\nYou can also try to send your first and most important request to Fabric using a generic AI pattern with the following command:\necho \u0026quot;what is the meaning of life\u0026quot; | fabric -sp ai\nUsing Fabric # Pre-built patterns # Patterns are advanced, customizable, and shareable AI prompts. Fabric comes with a library of pre-built patterns that you can use and modify to suit your needs.\nYou can list all the available patterns with the command fabric -l.\nHere are some patterns I particularly like:\nextract_wisdom: Extract important information from any text source (blog post, YouTube video transcript, PDF, etc.).\nai: A generic pattern for standard requests.\nwrite_pull-request: To write beautiful PR/MR without effort.\nagility_story: For creating agile user stories.\nprovide_guidance : Your AI psychologist\nWriting Your Own Patterns # To demonstrate this capability, I created a custom pattern to improve this blog post. Yes, what you\u0026rsquo;re reading has been generated using the process I\u0026rsquo;m about to explain – isn\u0026rsquo;t it amazing?\nIf you look at the pattern library, you\u0026rsquo;ll find a particularly useful pattern called improve_prompt. This pattern will help you write your own patterns.\nLet\u0026rsquo;s try creating a technical blog post pattern:\necho \u0026#34;As a technical writer, you take draft content as input and engaging technical guides in the form of blog articles. You provide technical context, explanation, and reference. You write articles of around 2000 words. You provide clear and detailed information. You use simple examples to explain concepts. You provide external links and sources. You provide actionable articles with a mix of theory and tutorial to help people understand concepts by putting things into practice. You write in a balance professional and casual way.\u0026#34; | fabric -sp improve_prompt Here is a snapshot of the produced pattern:\nTake the output and save it as a file under /Users/\u0026lt;user\u0026gt;/.config/fabric/patterns/write_tech_blog/system.md.\nThis prompt is ready to be used, improved, or customized.\nNow, if you list your models with Fabric -l, you should see your new pattern.\nTo pass some data to Fabric, you can use several methods:\necho \u0026quot;data\u0026quot; | Fabric cat my_file.txt | Fabric pbpaste | Fabric which will input text you\u0026rsquo;ve copied to your clipboard using ctrl+c To test our new pattern, gather some draft ideas for a blog post and pass them to Fabric with a command like pbpaste | fabric -s -p write_tech_blog.\nThis should give you pretty good articles that you can use right away or customize and improve yourself (or with other patterns).\nIf you want something different or more detailed, review the pattern file or regenerate it by changing the input of the initial improve_prompt process.\nCrafting prompts is almost an art and definitely an iterative process. To improve your prompt crafting skills, I strongly recommend taking this free course: https://learn.deeplearning.ai/courses/chatgpt-prompt-eng\nUse multiple patterns # Another great feature of Fabric is its ability to use the output of one Fabric process as input for another. For example, you could extract wisdom from an article and use that to create a Keynote presentation, like this:\npbpaste | fabric -p extract_wisdom | fabric -s -p create_keynote This will obviously be slower because it runs Fabric twice, but I found this concept of data refinement very powerful and full of potential.\nHere is a snapshot of the result for this blog post:\nConclusion # Fabric and LM Studio offer a powerful combination to seamlessly use the AI platform while extracting the most value from it. By creating custom prompts and patterns, users can fully tailor their interactions with the AI model to suit their needs.\nI strongly believe that AI tools are here to extend our capabilities and help us achieve more, faster. These tools have the potential to be as transformative as the industrial revolution, which fundamentally changed the way we work by introducing machines that could perform tasks more efficiently than humans. Similarly, AI will not only replace some jobs but also create entirely new ones that require different skills and competencies.\nFor now, these tools are still in the early stages and sometimes produce inappropriate or vague content. This is why the output of any AI-powered writing tool should always be reviewed and adapted by humans (for now :)).\nKey Takeaways # Fabric is a Python tool that interacts with API AI models and injects custom prompts called patterns. Patterns are pre-built templates that can be customised to meet individual needs. Writing your own pattern can help you tailor your interactions with the AI model. LM Studio is a free tool that allows users to download and run models locally on their machines. ","date":"6 June 2024","externalUrl":null,"permalink":"/blog/leveraging-fabric-and-lm-studio-for-advanced-ai/","section":"Blog","summary":"How to run Fabric with local models through LM Studio for custom AI patterns and workflows. Setup, integration, and practical use cases for prompt-based automation.","title":"Leveraging Fabric and LM Studio for Advanced AI","type":"blog"},{"content":"","date":"6 June 2024","externalUrl":null,"permalink":"/tags/lm-studio/","section":"Tags","summary":"","title":"Lm-Studio","type":"tags"},{"content":"","date":"6 June 2024","externalUrl":null,"permalink":"/tags/prompt-engineering/","section":"Tags","summary":"","title":"Prompt-Engineering","type":"tags"},{"content":" Reach Out # Enter your email to see my contact links:\nShow Links Your email is only used once to prevent bots. No spam.\nLinkedIn GitHub Summary # With 18+ years of progressive experience in Cloud, DevOps, SecOps, and FinOps, I drive digital transformation by aligning cloud strategy with measurable business outcomes. My expertise spans high-level architecture design to hands-on infrastructure implementation, with proven success in optimizing cloud cost efficiency and security posture. I\u0026rsquo;ve led engineering teams across startups and Fortune 500 enterprises, defining technical vision and delivering complex infrastructure projects that scale.\nExperience # Senior Cloud Engineer / SRE # Major industrial technology group (Switzerland) — Sep 2023 - Present\nOperating and evolving a multi-cloud Kubernetes platform spanning AWS and Azure, supporting multiple project teams while advancing AI/ML infrastructure capabilities.\nOperate and maintain a shared multi-cloud Kubernetes platform on AWS and Azure, onboarding and supporting multiple project teams including IOT platform deployments and AI RAG Chatbot projects. Drive AI infrastructure initiatives: tested AI inference on Kubernetes, integrated cloud AI services, and developed an AI-powered alert enricher that provides intelligent first-level analysis of AlertManager alerts to Opsgenie. Implement Site Reliability Engineering (SRE) best practices, significantly enhancing system reliability, performance, and operational efficiency across the platform. Design and develop reusable infrastructure-as-code components using Terraform, Helm and Crossplane. Implement and manage comprehensive observability stack using Prometheus/Grafana/Loki/Opsgenie/Statuspage. Cloud Engineer \u0026amp; Architect # Global robotics leader (France) — Jul 2022 - Jul 2023\nCloud infrastructure design and implementation on Kubernetes platform with focus on IaC and CI/CD optimization.\nCollaborate closely with development teams to deploy and operate applications to Kubernetes-based platform. Create comprehensive architecture documentation and presentations for new and existing solutions. Re-design and implement Kubernetes cluster deployment strategy using Infrastructure as Code. Develop reusable templates for Infrastructure as Code, accelerating project initiation and ensuring consistency. Design and build scalable GitLab runner solutions on AWS, optimizing CI/CD performance and resource utilization. Design and implement CI/CD OIDC integration, enhancing security and simplifying pipeline deployments to AWS. Deploy and configure security and compliance solutions including GuardDuty, Security Hub, and Config. Cloud Specialist # Cloud consulting firm (France) — Apr 2021 - Jul 2022\nCloud expert guiding customers through successful transitions to the cloud.\nDesigned cloud infrastructures meeting specific business requirements, including landing zones and cloud-ready architectures. Planned cloud adoption and migration strategies, assessing current infrastructure and defining new processes. Secured environments by auditing infrastructure against security frameworks (CIS, CSA) and implementing necessary policies. Optimized cloud usage (FinOps) by analyzing consumption, identifying billing optimization opportunities, and reducing waste. Technical Lead Cloud # European IT consultancy (France) — Sep 2019 - Mar 2021\nEndorsed both consulting and internal activities.\nCloud architect @ Global shipping and logistics leader (France)\nCloud architect: Designed and validated cloud architectures for business projects and landing zones components (network, backup, authentication, log management). FinOps lead: Implemented tagging policy and compliance tracking. Educated project teams on FinOps culture. Analyzed and proposed Saving Plans and architecture changes to optimise the cloud bill. Cloud advisor: Worked on Target Operating models aligned with Cloud architecture models. Evangelised agile methodologies with a product-based approach. Followed cloud security and compliance area to implement checks and guardrails. DevOps Engineer @ Top 5 European bank\nAssisted the developer team to deliver a Django project to AWS environments. Created multiple environments using automation pipelines with GitlabCI, Docker and Terraform. Technical Lead Cloud @ Internal\nPre-sales, Recruitment, creation of Cloud offers, Training, Webinars, Drive Cloud practice community. Cloud teacher # Campus Sciences-U (France) — Mar 2021 - Jul 2022\nTaught Azure and FinOps both remotely and on-site to large groups of students.\n2022 - Azure training - Bachelor 3 - 16h 2021 - Azure training - Bachelor 3 - 15h 2021 - FinOps training - Master 2 - 18h Infrastructure technical lead # Global digital-out-of-home advertising platform (UK) | Contract — May 2018 - Mar 2019\nInfrastructure lead of 5 engineers. Implemented scrum, improved global infrastructure documentation, controlled and reduced AWS costs, reviewed and reduced technical debt.\nHands-on: Terraform, Kubernetes, Python, Bash. DevOps consultant # Top 5 global bank (UK) | Contract — Mar 2017 - Jan 2018\nPart of the adoption program team focused on deploying infrastructure on AWS.\nCreated on-demand and on events compliance checks based on CloudTrail events leveraging Python Lambda functions. Implemented GitOps principles for Terraform infrastructure using Go.Cd automation tool. Hands-on: Terraform, Lambda, Go.cd and Python. Senior Infrastructure Developer # Major UK retailer | Contract — Nov 2016 - Feb 2017\nMember of the Infra Dev team working on DevOps projects to bring cloud automation.\nEvaluated several ways to deploy and maintain Kubernetes clusters. Developed an automated solution to analyse and propose cost optimisation across multiple AWS accounts based on Trusted Advisor recommendations. Hands-on: Terraform, Lambda, Kubernetes. Cloud Consultant # Global cloud consultancy (UK) — Jul 2015 - Oct 2016\nCloud architect @ Fortune 10 energy company (UK)\nDesigned and validated cloud architectures for a worldwide enterprise cloud solution. Satisfied security and compliance requirements. Integrated with existing processes, tools, infrastructure and network. Drove DevOps patterns with deployment of automation, workflow engine and log capture tools. Cloud Consultant @ Major UK financial regulator\nCreated the roadmap for the cloud transformation phase 2 (Highly available services, Security, Review Best Practices, Identify tech debt). Role Owner @ Internal\nCreated the job description to recruit new consultants, conducted more than 30 interviews and defined evaluation criteria and career path for the team. Co-Founder and CTO # Early-stage startup (co-founded, France) — Oct 2013 - June 2015\nCo-founded with two partners, leading the technical vision for private and public cloud service offerings.\nEntrepreneur: Design and implementation of cloud infrastructures, create private and public cloud offers, digital marketing, sales prospection and business networking. Partnership with Alcatel Lucent Enterprise and provide R\u0026amp;D expertise to create a Private Cloud solution offering. Trainer for an accelerated training program to upskill/reconvert professionals to become IT support technicians. Senior ICT Consultant # Swiss IT services firm — Jan 2011 - Sept 2013\nHelped clients with their IT projects involving: Citrix, Hyper V, Active Directory, DNS, DHCP, BackupExec, Forefront, Remote Desktop Service, Profile Management, SAN, Vmware, LANdesk and Network.\nProject support manager # Global IT services company (France) — Jan 2010 - Dec 2010\nIn charge of customer care for a document digitalization project for French national insurance.\nSystem administrator # Global IT services company (France) — May 2007 - Aug 2009\nInstallation, configuration, maintenance and infrastructure optimization at French IT dep. Technical assistance to project teams. Solution evaluation (Billing, monitoring).\nSkills # Technical # AWS \u0026amp; Azure Kubernetes \u0026amp; Docker Terraform / Helm / Crossplane CI/CD (GitLab, Jenkins) Observability (Prometheus, Grafana) Python / Bash / Go Professional # Technical Leadership Problem Solving Effective communication Team player Education # Master in Computer Science — ITIN (France) DUT Network and telecommunication — Grenoble University (France) Certifications # AWS Certified SysOps Administrator - Associate AWS Certified Solutions Architect - Associate AWS Cloud Practitioner Certified: Azure Fundamentals - AZ-900 FinOps Certified Practitioner Docker Certified Associate Languages # French: Native English: Fluent Interests # Running \u0026amp; trail Crossfit Cooking Travelling ","externalUrl":null,"permalink":"/about/","section":"Julien.Cloud","summary":"","title":"About Me","type":"page"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"A space for my unfiltered takes on what\u0026rsquo;s happening in tech — the hype, the trends, the things that deserved more scrutiny, and the ones that deserved less.\nBecause not everything needs a 2000-word essay.\n","externalUrl":null,"permalink":"/opinion/","section":"Opinion","summary":"","title":"Opinion","type":"opinion"},{"content":"Here is a collection of other blogs, podcasts, tools and posts that I frequently reference or find interesting and inspirational.\nBlogroll # Julia Evans — Clear, illustrated explanations of complex systems and debugging. Her zines are legendary. Cloudflare Blog — Deep dives into networking, security and edge computing at scale. Last Week in AWS — Corey Quinn\u0026rsquo;s snarky but insightful take on AWS news and cloud economics. Learnk8s — Excellent Kubernetes content, from beginner guides to advanced production patterns. Martin Fowler — The reference for software architecture, refactoring and design patterns. Werner Vogels\u0026rsquo; Blog — CTO of Amazon\u0026rsquo;s thoughts on distributed systems and building for scale. The Pragmatic Engineer — Gergely Orosz on engineering culture, hiring and Big Tech. Armin Ronacher\u0026rsquo;s Thoughts and Writings — Creator of Flask. Deep thoughts on Python, Rust and software engineering. High Scalability — Stories and architectures of real-world large-scale systems. A Cloud Guru Blog — Cloud tutorials and certification guides. Podcasts # Darknet Diaries — True stories from the dark side of the internet. My favourite cybersecurity podcast. Command Line Heroes — Red Hat\u0026rsquo;s podcast on the history and future of open source. Kubernetes Podcast from Google — News and interviews from the Kubernetes community. Screaming in the Cloud — Conversations about cloud computing with industry leaders. Software Engineering Daily — Daily interviews on every aspect of software engineering. The Cloudcast — Cloud computing, containers and DevOps discussions. Tools \u0026amp; Resources # Excalidraw — Virtual whiteboard for sketching hand-drawn style diagrams. Perfect for architecture sketches. Roadmap.sh — Community-driven roadmaps for DevOps, Cloud, Kubernetes and more. Explain Shell — Type a shell command, get an explanation of what each part does. Regex101 — Build, test and debug regular expressions with explanations. CyberChef — The \u0026ldquo;Swiss Army Knife\u0026rdquo; of encoding/decoding and data transformation. Diagrams.net (Draw.io) — Free diagramming tool for architecture diagrams and flowcharts. Crontab Guru — The quick and simple editor for cron schedule expressions. Must-read Articles # The Twelve-Factor App — The methodology for building software-as-a-service apps. Google SRE Book — Google\u0026rsquo;s guide to site reliability engineering. Free and comprehensive. AWS Well-Architected Framework — The definitive guide for building secure, high-performing, resilient and efficient infrastructure. Martin Fowler: Microservices — The original article that defined the microservices architecture pattern. Brendan Gregg: Linux Performance — The ultimate resource for Linux performance analysis. The Cloudflare Blog: How We Scaled — Real-world stories of scaling infrastructure. This page is inspired by the IndieWeb philosophy of sharing and connecting personal websites. If you have a site you think belongs here, get in touch.\n","externalUrl":null,"permalink":"/resources/","section":"Julien.Cloud","summary":"","title":"Resources","type":"page"},{"content":"","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"}]