If you have ever used Claude Code and loved every bit of it — but quietly resented the idea of your private code flying off to someone else’s servers — you are not alone. A growing number of developers are figuring out how to run Claude Code with a local LLM, and the results are genuinely impressive. No cloud dependency. No API bills stacking up. No terms of service breathing down your neck. Just pure, private, agentic coding power running right on your own hardware.
This article breaks down exactly how it works, why it matters, and what you need to get started today.
What Is Claude Code — and Why Does It Need the Cloud at All?
Claude Code is Anthropic’s terminal-based agentic coding tool. It understands your entire codebase, edits files, calls tools, plans multi-step tasks, and works with minimal hand-holding. Think of it as having a senior developer sitting inside your terminal, available around the clock.
By default, Claude Code is hardwired to Anthropic’s cloud. Every prompt you send — every snippet of code, every function, every file path — travels to Anthropic’s servers, gets processed, and returns an answer. For most casual users, that’s perfectly fine. But for security researchers, developers working on proprietary software, or anyone handling sensitive client data, sending that code off-machine is somewhere between uncomfortable and outright prohibited.
That’s exactly the problem that running Claude Code with a local LLM solves.
Why Developers Are Moving to Claude Code with a Local LLM
There are three very real reasons this setup is gaining serious traction.
1. Privacy Is Non-Negotiable for Some Work
When you are reverse engineering firmware, tearing apart binaries, or working through the early stages of a security assessment, sending disassembled code to a cloud API is a questionable decision at best. For anyone working under an NDA or a security policy, it may be outright illegal. Running Claude Code with a local LLM means your data never leaves your network — period.
2. Cloud Costs Add Up Frighteningly Fast
Claude Code’s cloud API is priced per token. During long development sessions involving constant refactoring, code explanation, and iteration, those tokens burn through fast. Developers have reported draining API credits at an alarming rate just by experimenting with features. A local setup removes that cost entirely. You run as many loops as you want, refine your prompts freely, and the only bottleneck is your own hardware — not your wallet.
3. No Rate Limits, No Latency Tax
Cloud APIs come with rate limits. They also introduce network latency on every single request. When you run Claude Code with a local LLM, inference happens on your machine. You get faster responses, zero throttling, and the freedom to fire the same analysis dozens of times while dialling in your approach.
The Script That Makes It All Simple
Setting up Claude Code with a local LLM manually used to be tedious. You had to export multiple environment variables, remember the right flags, confirm your local inference server was running, and then launch Claude Code — every single time.
The smarter solution is a short bash script that handles all of it in one command. Here is what a well-built launcher script does under the hood:
- Sets the base URL to your local inference server
- Provides a dummy authentication token so Claude Code doesn’t throw an auth error
- Clears the real Anthropic API key so the tool doesn’t try to phone home
- Disables telemetry that would fail without a live cloud connection
- Launches Claude Code with your chosen local model name
When you run the script without arguments, it checks what models are currently loaded on your local server. If there’s only one model running — which is often the case — it skips the selection step and launches straight away. That single quality-of-life detail makes daily workflow noticeably smoother.
This launcher works with any inference server that supports the Anthropic API format — LM Studio, Ollama, llama.cpp, and vLLM behind LiteLLM all qualify.
Which Local Model Works Best with Claude Code?
Running Claude Code with a local LLM is only as good as the model you point it at. For a long time, local models were too weak for real coding work — capable of basic autocomplete but unable to handle multi-file reasoning or complex refactoring.
That changed with the arrival of Qwen3 Coder Next. This model brought a 170,000-token context window to a locally-runnable package, meaning you can feed Claude Code entire decompiled functions, surrounding binary context, and detailed instructions without truncation.
Other solid options include:
- Qwen3-Coder 30B — A strong balance of capability and hardware requirements, especially in quantized form
- DeepSeek V3 — Excellent reasoning depth for complex tasks
- GLM-4.7-Flash — Lightweight and low-latency when speed matters more than depth
Hardware You Need for Claude Code with a Local LLM
Your GPU is the backbone of this setup. Here is what you realistically need:
- GPU with at least 8GB VRAM — 12–24GB strongly preferred for larger models
- 16GB+ system RAM — 32GB is the sweet spot
- NVMe storage — model files range from 8GB to 40GB+ depending on size and quantization
CPU-only inference is possible but painfully slow for agentic workflows. If you are serious about making Claude Code with a local LLM part of your daily routine, a capable GPU is not optional — it’s essential.
How to Connect Claude Code to Your Local Inference Server
The core trick is redirecting Claude Code away from Anthropic’s API endpoint and toward your local server. Claude Code respects the ANTHROPIC_BASE_URL environment variable, so all you need is to set it to your local server’s address before launching.
For a persistent setup, add these lines to your ~/.bashrc or ~/.zshrc:
export ANTHROPIC_BASE_URL="http://127.0.0.1:YOUR_PORT"
export ANTHROPIC_API_KEY="dummy-key"If Claude Code still prompts you to sign in on first run, add "hasCompletedOnboarding": true and "primaryApiKey": "sk-dummy-key" to your ~/.claude.json file. For LM Studio users, the local server URL is typically http://127.0.0.1:1234. Start the server from the Developer tab, confirm the API is active, and you’re good to go.
What You Can Actually Do with Claude Code Running Locally
This is the part that genuinely surprises people. The local LLM version of Claude Code is not a dumbed-down experience — it is a full agentic workflow, just running on your own hardware.
Reverse Engineering and Security Research
This is arguably the strongest use case for running Claude Code with a local LLM. Static analysis tasks — checking file types, reading ELF headers, extracting strings, identifying debug symbols — are repeatable steps that an agentic model can automate confidently. What makes it powerful is the flexibility: unlike a deterministic script, a capable local model can pivot based on what it finds. If it spots something unusual in a binary, it adjusts its approach. All of this happens without a single byte of sensitive data touching the internet.
Everyday Development Work
For day-to-day iterative coding — refactoring, boilerplate generation, explaining unfamiliar code, writing tests — Claude Code with a local LLM handles the load well. Because there are no API costs, you can iterate freely, run the same analysis a dozen times, and experiment without watching a usage meter tick upward.
Firmware Analysis
Extract a firmware image, identify the filesystem, pull out interesting binaries, check for hardcoded credentials. These are standard steps for security engineers, but they’re tedious done manually. With Claude Code and a local LLM handling the agentic loop, the entire initial phase of firmware analysis becomes significantly faster — and entirely private.
The Honest Limitations You Should Know
Running Claude Code with a local LLM is not a perfect replacement for cloud models on every task.
Reasoning depth — The largest cloud models still outperform local models on the most complex multi-step reasoning tasks. If you are working on genuinely difficult architectural problems, a local 30B or 70B model may fall short.
Context window constraints — While Qwen3 Coder Next’s 170K context window is impressive, very large codebases may still hit limits depending on your configuration.
Occasional overconfidence — Local models can suggest wrong answers with confident phrasing. You need to stay sharp and verify outputs rather than accepting them wholesale.
Hardware cost — The upfront investment in a capable GPU is real. For professional developers and security researchers, the ROI makes sense. For hobbyists, it requires honest calculation.
Why This Is More Than Just a Privacy Trick
Running Claude Code with a local LLM changes the relationship between a developer and their AI tooling in a meaningful way. Instead of working within the constraints of a metered cloud service — rationing prompts, optimizing token efficiency, avoiding anything that might trip a content filter — you work freely. The model is yours. The compute is yours. The data is yours.
For anyone doing security research or professional development on sensitive codebases, that freedom is not a luxury. It’s a requirement.
And as local models continue to improve — which they are doing rapidly — the gap between cloud and local capability will keep narrowing. The setup that feels like a solid-but-limited alternative today will look like an obvious first choice within a year.
Your Quick-Start Checklist
- Install a local inference server — Ollama, LM Studio, or llama.cpp
- Download a capable coding model — Qwen3-Coder 30B or Qwen3 Coder Next
- Set your context length to at least 16K, ideally 32K or higher
- Point
ANTHROPIC_BASE_URLat your local server - Write or download a launcher script to automate the environment setup
- Test with a real task — not a toy example, but an actual problem you’d normally send to the cloud
Once you have run Claude Code with a local LLM through a real workflow, going back to the cloud-only setup will feel like an unnecessary constraint. The capability is there. The privacy is there. The cost is zero. All it takes is the right hardware, the right model, and a one-time setup that takes less than an hour.
Running Claude Code with a local LLM is one of the most practical moves a privacy-conscious developer can make right now. The tools are mature, the models are capable, and the barrier to entry has never been lower.











1 thought on “How to Run Claude Code with a Local LLM and Ditch the Cloud for Good”