Set up a local LLM with Ollama on macOS and Linux
local-inference
Prerequisites
- •macOS 13+ or Ubuntu 22.04+ with 16 GB RAM (32 GB recommended for 13B models)
- •Admin access to install packages
- •Optional: NVIDIA GPU with 8 GB+ VRAM for faster inference
What you'll build
A local LLM environment where you can:
- Run models on your machine with Ollama
- Chat in the terminal or Ollama desktop app
- Call
http://localhost:11434from Node or Python using an OpenAI-compatible client - Compare latency and quality against hosted APIs before committing spend
Local inference is ideal for drafts, PII-sensitive workflows, and offline development — not a full replacement for frontier hosted models on hard reasoning tasks.
Why Ollama in 2026
Ollama packages model weights, GPU detection, and a simple pull/run UX. Alternatives like vLLM or llama.cpp offer more control but more ops work. For indie developers validating RAG or agent prototypes, Ollama is the fastest on-ramp. See Mistral vs Llama when choosing open-weight families.
Step 1 — Install Ollama
macOS:
brew install ollama
# or download from https://ollama.com/download
ollama serve # runs the daemon (auto-starts on desktop app)
Linux:
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable ollama
sudo systemctl start ollama
Verify: curl http://localhost:11434/api/tags should return JSON (possibly empty).
Step 2 — Pull a model
Start small, then scale:
ollama pull llama3.2 # 3B — fast on laptops
ollama pull mistral # 7B — balanced
ollama pull llama3.1:8b # stronger reasoning, more RAM
Interactive chat:
ollama run llama3.2
>>> Write a one-paragraph product description for a todo app.
Watch Activity Monitor / nvidia-smi for memory pressure. If the model swaps to disk, pick a smaller quant or reduce context.
Step 3 — OpenAI-compatible API
Ollama exposes /v1/chat/completions. Test with curl:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello in one sentence."}]
}'
Node.js example:
import OpenAI from 'openai'
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama', // required string, ignored locally
})
const res = await client.chat.completions.create({
model: 'llama3.2',
messages: [{ role: 'user', content: 'Summarize RAG in two sentences.' }],
})
console.log(res.choices[0].message.content)
Point existing apps at baseURL for dev; switch to Anthropic/OpenAI in production with the same interface abstraction.
Step 4 — Environment variables for your app
export LLM_BASE_URL=http://localhost:11434/v1
export LLM_MODEL=llama3.2
export LLM_API_KEY=ollama
In Next.js or Express, read these in a single createLlmClient() helper so CI uses mocks while your laptop uses Ollama.
Step 5 — Optional: Modelfile for custom system prompts
Create Modelfile:
FROM llama3.2
SYSTEM You are a concise technical writer. Answer in bullet points unless asked otherwise.
PARAMETER temperature 0.3
Build and run:
ollama create writer -f Modelfile
ollama run writer
Useful for consistent tone in changelog or support-draft agents.
Benchmark against cloud APIs
Run the same ten prompts through Ollama and Claude. Track:
- Time to first token
- Factual errors on your domain docs
- Refusal rate on edge cases
If local quality is 80% of cloud at 10% of cost for your workload, hybrid routing (local draft → cloud polish) often wins. Read How to Choose an LLM API in 2026 for production routing patterns.
Common errors
connection refused on 11434 — Daemon not running; start ollama serve or the desktop app.
Model download hangs — Check disk space; models are multi-GB. Use ollama pull on a stable network.
Out of memory — Use smaller models (llama3.2) or quantized tags (:q4_0 variants when available).
Next steps
- Build RAG with Pinecone and Claude — swap Claude for Ollama in dev
- Compare hosted vs local in the LLM API guide
- Explore Perplexity vs ChatGPT for research tasks that still need the web
Get the full recipe
Clone the starter repo and follow along in your own environment.
Related Stacks
The indie SaaS AI stack Marc Lou uses to ship products in days, not months
by Marc Lou