Running Qwen 3.5 on Hermes Agent via LM Studio and MLX
A practical guide to running Hermes Agent with Qwen 3.5 (4B & 9B) locally on Apple Silicon — no cloud API keys required.
Why Run Local?
Hermes Agent is a powerful open-source AI coding assistant that typically connects to cloud providers like OpenRouter or OpenAI. But if you have an Apple Silicon Mac, you can run the entire stack locally — model inference included — for zero-latency, zero-cost, fully private AI coding.
The stack: Hermes Agent → LM Studio → MLX → Apple Silicon GPU
We tested this on a MacBook Pro M3 Max (36GB) running Qwen3.5-9B at 4-bit quantization. The results: \~51 tokens/sec generation, \~5GB memory footprint, and a fully functional coding agent.
Hardware Requirements
MLX inference is memory-bandwidth bound. Here's what to expect:
| Machine | Memory BW | Qwen3.5-9B 4bit Gen Speed | Qwen3.5-4B 4bit Gen Speed |
|---|---|---|---|
| Mac Mini M4 (16GB) | 120 GB/s | 19.6 tok/s | 36.7 tok/s |
| MacBook Pro M3 Max (36GB) | 400 GB/s | 51.1 tok/s | 87.3 tok/s |
The M3 Max is \~2.5x faster purely due to memory bandwidth. Any Apple Silicon Mac works, but more bandwidth = faster generation. 16GB minimum for 9B models, 8GB is enough for 4B models.
Step 1: Install LM Studio
Download LM Studio from lmstudio.ai. It provides a robust OpenAI-compatible API server with proper connection handling, context management, and a GUI for model management.
Why LM Studio over raw mlx_lm.server? We initially tried mlx_lm.server (the Python server bundled with the mlx-lm package). It works for simple requests, but it chokes on large payloads. Hermes sends \~3,500+ tokens of tool definitions with every request, and mlx_lm.server — which even warns it's "not recommended for production" — would disconnect mid-processing, causing BrokenPipeError and RemoteProtocolError crashes. LM Studio handles these payloads reliably.
Step 2: Load a Model
In LM Studio, search for and download an MLX-optimized model. We recommend:
-
Qwen3.5-9B MLX 4bit — Best balance of quality and speed. \~5GB VRAM.
-
Qwen3.5-4B MLX 4bit — Faster (87 tok/s on M3 Max), lighter (2.5GB), good for simpler tasks.
Load the model and start the local server. LM Studio defaults to http://127.0.0.1:1234/v1.
Important: Increase the context length. LM Studio defaults to 4096 tokens, but Hermes's 29 tool definitions alone consume \~6,200 tokens. Go to the server settings and increase context to at least 8192 — preferably 16384 or 32768 for longer conversations. Without this, you'll get a cryptic error:
Cannot truncate prompt with n_keep (6221) >= n_ctx (4096)
Step 3: Configure Hermes
This is where it gets tricky. Hermes has three places that affect provider configuration, and they interact in non-obvious ways:
1. ~/.hermes/.env (Highest Priority)
This is the most important file and the one most likely to bite you. Environment variables here override everything else:
OPENAI_BASE_URL=http://127.0.0.1:1234/v1
OPENAI_API_KEY=not-needed
LM Studio doesn't require an API key, but Hermes's OpenAI client refuses to connect without one. Set it to any non-empty string like not-needed.
This is the #1 gotcha. If you previously configured a different provider (OpenRouter, Codex, or a raw mlx_lm.server on a different port), old values in .env will silently override your config.yaml changes. We spent significant debugging time because .env had OPENAI_BASE_URL=http://127.0.0.1:8800/v1 from an earlier setup.
2. ~/.hermes/config.yaml (Model & Provider)
model:
default: qwen/qwen3.5-9b
provider: custom
base_url: http://127.0.0.1:1234/v1
The provider: custom tells Hermes to use a custom OpenAI-compatible endpoint instead of OpenRouter. The model name must match what LM Studio reports (check curl http://127.0.0.1:1234/v1/models).
3. ~/.hermes/auth.json (OAuth State)
If you previously logged in to a provider like OpenAI Codex (hermes login), the auth store saves an active_provider field:
{
"active_provider": "openai-codex"
}
When active_provider is set and has valid credentials, Hermes's auto-detection will prefer it over your config.yaml settings, even if you explicitly set provider: custom. To fix this, either:
-
Run
hermes logoutto clear the active provider -
Or manually set
"active_provider": nullinauth.json
Priority Order (TL;DR)
~/.hermes/.env (OPENAI_BASE_URL, OPENAI_API_KEY)
↓ overrides
~/.hermes/auth.json (active_provider with valid OAuth)
↓ overrides
~/.hermes/config.yaml (model.provider, model.base_url)
All three must agree, or you'll get mysterious Connection refused errors while curl works fine from the same machine.
Step 4: Verify the Connection
Before starting Hermes, verify LM Studio is responding:
curl -s http://127.0.0.1:1234/v1/models | python3 -m json.tool
curl -s http://127.0.0.1:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen/qwen3.5-9b", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}' \
| python3 -m json.tool
Then start Hermes:
hermes
Troubleshooting
"Connection refused" but curl works
Almost certainly a config priority issue. Check all three files:
grep -E "OPENAI_BASE_URL|OPENAI_API_KEY" ~/.hermes/.env
python3 -c "import json; d=json.load(open('$HOME/.hermes/auth.json')); print('active_provider:', d.get('active_provider'))"
grep -A3 "^model:" ~/.hermes/config.yaml
"Cannot truncate prompt with n_keep >= n_ctx"
Increase LM Studio's context length. Hermes's tool definitions alone need \~6,200 tokens.
Server disconnects on large prompts (BrokenPipeError)
If using raw mlx_lm.server instead of LM Studio, switch to LM Studio. The Python HTTP server can't handle Hermes's payload sizes reliably.
Performance Tips
- Memory bandwidth is king. M3 Max/Ultra and M4 Pro/Max give the best local inference speeds.
- 4-bit quantization is the sweet spot. Minimal quality loss, \~60% memory savings.
- Keep the MacBook plugged in. Sustained GPU load draws \~13W from the GPU alone.
- Monitor with asitop. Install via
uv tool install asitop, run withsudo asitopto see real-time CPU/GPU/ANE usage and memory bandwidth. - ANE won't be used. MLX runs entirely on the GPU via Metal. The Apple Neural Engine is designed for small classification models, not autoregressive LLM generation.
The Full Stack
When everything is configured correctly, you have a completely local coding agent:
You (terminal)
→ Hermes Agent (CLI, tools, context management)
→ LM Studio (OpenAI-compatible server, context handling)
→ MLX (Metal-optimized inference)
→ Apple Silicon GPU (400 GB/s unified memory)
No API keys. No cloud calls. No rate limits. No data leaving your machine. Just you and a 9-billion parameter model running at 51 tokens per second on your laptop's GPU.
Tested on MacBook Pro M3 Max (36GB) with Hermes Agent v0.3.0, LM Studio, and Qwen3.5-9B MLX 4bit. March 2026.