A few months ago I added an NVIDIA Jetson Orin Nano Developer Kit to my homelab. The idea was simple: a dedicated, always-on inference server for local LLMs, completely separate from my main Proxmox cluster. With 8GB of unified memory and an integrated CUDA-capable GPU, it sounded like the perfect edge device for running small models.
Reality, as usual, was more nuanced. Not every model that fits on disk actually runs well. Some swap to eMMC and crawl. Others load fast but generate gibberish. After testing six models across two generations of Google’s Gemma architecture, surviving a GPU driver rabbithole, and learning more about CMA memory than I ever wanted to know, I found a setup that delivers 25.5 tokens per second on GPU with zero swap.
This post covers both rounds of testing, the hardware limits, the GPU configuration that took hours to figure out, and the service architecture that keeps everything running.
The Hardware Reality Check#
The Jetson Orin Nano Developer Kit has 8GB of shared memory. That means RAM and GPU VRAM come from the same pool. There is no dedicated graphics memory to fall back on. Storage is a 56GB eMMC module, fast enough for OS duties but brutally slow when used as swap.
Key constraints from day one:
- 8GB unified memory limits model size and context length
- ARM64 architecture restricts which models have native support
- eMMC swap is a performance cliff — once the system starts swapping, inference drops from 17 tokens per second to under 2
This means model selection is not about downloading the latest 8B parameter release and hoping for the best. It is about finding the sweet spot between size, speed, and quality on very specific hardware.
Round One: Gemma 3#
I tested six models with the same prompt — “Hello, how are you?” — and measured tokens per second, memory footprint, and overall responsiveness.
| Model | Size | Tokens/sec | Verdict |
|---|---|---|---|
| llama3.1:8b | 4.9 GB | ~8-10 | Too large – heavy swap makes it unusable |
| llama3.2:3b | 2.0 GB | ~15-18 | Fast but mediocre reasoning quality |
| llama3.2:3b-12k | 2.0 GB | ~15-18 | Same speed with extended context, same quality limits |
| qwen3.5:2b | 2.7 GB | ~18-22 | Fastest of the bunch – but weak at reasoning tasks |
| gemma3:4b | 3.3 GB | ~16-20 | Runner-up – solid speed/quality balance, fits in RAM |
| gemma3:4b-8k | 3.3 GB | ~17.5 | Winner (GPU not yet functional) – best reasoning, zero swap |
Gemma 3 4B with a custom 8K context window became the production model. 17.5 tok/s on CPU, zero swap, stable for months. But it ran on CPU only — the GPU was idle because the original Ollama binary had dropped JetPack 5 support.
Round Two: Gemma 4 Arrives#
In early 2026, Google released Gemma 4 with two edge variants: E2B (2.3B effective) and E4B (4.5B effective). The on-paper specs were compelling: 128K native context, thinking mode, function calling, system prompt support. The default Ollama quantization was 7.2 GB — too large — but the QAT tag changed everything.
QAT (Quantization-Aware Training) quantizes during training rather than after. The result: gemma4:e2b-it-qat at 4.3 GB instead of 7.2 GB. Same architecture, 40% smaller. This is not a niche optimization; it is the difference between fitting and failing on 8 GB hardware.
| Feature | Gemma 3 4B | Gemma 4 E2B |
|---|---|---|
| Disk size | 3.3 GB | 4.3 GB (QAT) |
| Default context | 8K | 128K |
| Thinking mode | No | Yes |
| Function calling | No | Yes |
| System prompt | No | Yes |
| MMLU Pro score | ~50% | 60% |
MMLU Pro scores are approximate, sourced from published benchmarks and community results on comparable hardware. Your mileage will vary with quantization and prompt style.
Pulling the model was the easy part. Making it run on GPU was not.
The GPU Driver Rabbit Hole#
The Jetson runs Ubuntu 22.04 with a CUDA 12.6 driver. Ollama 0.30.6 expects a cuda_jetpack6 directory — which was missing despite the system having the right driver version. The problem: the CUDA toolkit directory layout was from an older JetPack 5 installation (cuda_jetpack5), and the Ollama binary only checks for cuda_jetpack6.
Three failed approaches before finding the right one:
Attempt 1: JetPack 5 CUDA libs (CUDA 11.x). Symlinked cuda_jetpack6 to the existing cuda_jetpack5. GPU was detected but model loading failed with cudaMalloc failed: out of memory. These libs route all GPU allocations through CMA (Contiguous Memory Allocator), which defaults to 256 MB on Jetson. An LLM needs gigabytes.
Attempt 2: Generic CUDA 12 libs. Symlinked to cuda_v12 from the main ARM64 tarball. GPU skipped entirely — the libggml-cuda.so was compiled for desktop GPU architectures (SM 5.0 through 9.0) but not Orin’s CC 8.7.
Attempt 3: The CMA trap. Tried cma=4096M in the kernel boot parameters to expand the memory pool. This broke GPU detection entirely — the CUDA driver could not initialize when CMA consumed half the system RAM. Even cma=1024M had the same effect. The lesson: never touch CMA on Jetson.
The working solution: Extract the JetPack 6 CUDA tarball from the Ollama release.
curl -L https://github.com/ollama/ollama/releases/download/v0.30.6/ollama-linux-arm64-jetpack6.tar.zst -o ollama-jp6.tar.zst
sudo tar --zstd -xf ollama-jp6.tar.zst -C /usr/local
sudo systemctl restart ollamaThis provides libggml-cuda.so compiled with Orin CC 8.7 support and CUDA 12.6 runtime libs that match the Jetson driver. GPU discovery confirmed:
inference compute: library=CUDA compute=8.7 name=CUDA0 description=OrinNo CMA tweaks. No symlinks. Just the right libs in the right place.
Benchmark: Gemma 4 on GPU#
With GPU working, the numbers were decisive:
| Model | Mode | Tok/s | Cold Load | RAM | GPU |
|---|---|---|---|---|---|
| gemma3:4b-8k | CPU | 17.5 | 0.79s | 4.5 GB | No |
| gemma4:e2b-it-qat | CPU | 12.4 | 70s | 5.0 GB | No |
| gemma4:e2b-4k | GPU | 25.7 | 67s | 3.4 GB | 100% |
| gemma4:e2b-8k | GPU | 25.5 | ~30-70s | 3.6 GB | 100% |
Gemma 4 on GPU is 46% faster than Gemma 3 on CPU while using 1 GB less RAM. The 8K context window has zero speed penalty versus 4K — the KV cache is negligible next to the model weights. The 128K native context support is there if needed, though I settled on 8K as the practical sweet spot.
Cold load is the only downside: 30-70 seconds versus Gemma 3’s sub-second CPU load. But a keepalive service that pings the model every 4 minutes makes this a non-issue. The model stays in GPU memory permanently.
Creating the Custom 8K Context Model#
The default gemma4:e2b-it-qat is a raw weights download with Llama’s default context cap. To set an explicit 8K context window (matching what fits comfortably in the 8GB unified memory) and give it a friendly name, use a Modelfile:
# Modelfile for gemma4:e2b-8k
FROM gemma4:e2b-it-qat
PARAMETER num_ctx 8192Then create the named model:
ollama create gemma4:e2b-8k -f ModelfileThe gemma4:e2b-4k variant in the benchmarks was the same base model capped at 4096 context for comparison. The 8K cap shows zero speed penalty – the KV cache overhead is negligible next to the 4.3 GB model weights. You could go higher (128K native is supported) but at some point memory pressure from the KV cache starts eating into the safety margin.
Quality: Is Gemma 4 Actually Smarter?#
Benchmarks on paper are one thing. Real prompts are another. I tested both models on three tasks:
Logic reasoning: “If a shirt takes 4 hours to dry, how long for 3 shirts?”
Gemma 4 answered correctly (4 hours, simultaneous drying) with structured step-by-step reasoning: “Since all three shirts dry independently at the same rate, you only need to wait the time required for one shirt to finish.” Gemma 3 sometimes fell for the multiplication trap.
Code generation: “Write a Sieve of Eratosthenes in Python.”
Gemma 4 produced clean, commented code with proper edge cases (n < 2 returns empty), complexity analysis (O(N log log N)), and usage examples. Gemma 3 was adequate but less thorough.
Long-form generation: “Write a technical essay about transformer architecture.”
Gemma 4 generated 2,700+ coherent tokens with technical depth on attention mechanisms, positional encoding, and multi-head attention. Sustained 25.5 tok/s throughout with no degradation.
Despite having fewer effective parameters (2.3B vs Gemma 3’s ~4B), the QAT quantization and architectural improvements in Gemma 4 produce noticeably better output. The thinking mode — where the model outputs a chain-of-thought before the final answer — adds further quality for complex reasoning tasks.
The CMA Lesson#
The Jetson’s Contiguous Memory Allocator defaults to 256 MB. On the old CUDA 11 libs, this was a hard bottleneck — every GPU memory allocation went through CMA, which is orders of magnitude too small. On the JetPack 6 CUDA 12 libs, GPU memory allocations bypass CMA and use system memory directly.
But CMA still matters for compute buffer allocation. When a model loads, a small compute buffer (100-200 MB) goes through CMA. If CMA is fragmented from a previous model load/unload cycle, the new load fails with cudaMalloc failed even though 6+ GB of system RAM is free. CMA fragmentation is permanent — it survives Ollama restarts and only a full reboot clears it.
The fix: never unload the model. The keepalive service is not just for cold-start latency — it prevents CMA fragmentation. If the model stays loaded, CMA is consumed once at initial load and never touched again.
The boot sequence must be careful:
- Ollama starts, GPU discovery runs (CMA still clean)
- Preload fires, loads gemma4 (uses CMA once, model stays warm forever)
- Keepalive takes over (pings every 4 minutes, never unloads)
- The preload service must point at gemma4 — if gemma3 loads at boot, it consumes CMA and gemma4’s GPU load fails later
Keeping the Model Warm#
The service architecture uses three systemd units:
ollama.service— Main Ollama daemon, always running on port 11434ollama-preload.service— Oneshot that loads gemma4 20 seconds after Ollama starts, warming the model at bootollama-keepalive.service— User service that pings the model every 4 minutes to prevent eviction and CMA fragmentation
The preload and keepalive scripts read the model name from /etc/ollama/model.conf, making it trivial to switch models by changing one line:
OLLAMA_MODEL=gemma4:e2b-8kMonitoring What Matters#
jtop— Jetson-specific monitoring. Watch GPU utilization (100% during inference), RAM usage (3.6 GB under load), and temperature (under 70C with the reference cooler).tegrastats— Low-level telemetry for power draw, per-core CPU usage, and memory.htop— General system view, mostly to confirm swap stays near zero.
If swap usage climbs during inference, something is wrong. The fix is never to add more swap — it is to use a smaller model or reduce context.
Final Setup#
- Device: NVIDIA Jetson Orin Nano Developer Kit (8GB)
- OS: Ubuntu 22.04.5 LTS (JetPack 6)
- Ollama: 0.30.6 with JetPack 6 CUDA libs
- Active model:
gemma4:e2b-8k(custom 8K context, QAT quantized) - Inference speed: 25.5 tokens/sec (GPU, warm)
- Memory footprint: 3.6 GB total (weights + KV cache)
- GPU: 100% CUDA0 Orin
- Swap usage: Zero
Lessons Learned#
QAT quantization matters more than you think. The -qat tag on Ollama is the difference between a model that fits (4.3 GB) and one that does not (7.2 GB). Always check for QAT variants before dismissing a model for edge hardware.
JetPack 6 CUDA libs are required for GPU on Orin. The standard ARM64 ollama tarball lacks Orin support. The JetPack 6 tarball has it. This is not documented anywhere obvious.
Never touch CMA. The cma= kernel parameter breaks GPU detection entirely. Default CMA (256 MB) is sufficient when using the correct CUDA libs.
Keepalive prevents CMA fragmentation. On Jetson, it is the difference between a working GPU inference server and a brick that needs a reboot after every model unload. CMA fragmentation is permanent and unrecoverable without a full system reboot.
Gemma 4 on edge is worth the effort. 46% faster, noticeably smarter, 1 GB lighter on RAM, with thinking mode and function calling. The hour of CUDA debugging pays for itself in every inference.
The Jetson Orin Nano is not going to replace a GPU server. But as a dedicated local LLM endpoint — handling RAG queries, chat, code generation, and light automation — it punches well above its weight class. The key is respecting the hardware limits, choosing the right model, and getting the CUDA configuration right the first time.
