What you'll build

A local LLM environment where you can:

Run models on your machine with Ollama
Chat in the terminal or Ollama desktop app
Call http://localhost:11434 from Node or Python using an OpenAI-compatible client
Compare latency and quality against hosted APIs before committing spend

Local inference is ideal for drafts, PII-sensitive workflows, and offline development — not a full replacement for frontier hosted models on hard reasoning tasks.

Why Ollama in 2026

Ollama packages model weights, GPU detection, and a simple pull/run UX. Alternatives like vLLM or llama.cpp offer more control but more ops work. For indie developers validating RAG or agent prototypes, Ollama is the fastest on-ramp. See Mistral vs Llama when choosing open-weight families.

Step 1 — Install Ollama

macOS:

brew install ollama
# or download from https://ollama.com/download
ollama serve   # runs the daemon (auto-starts on desktop app)

Linux:

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable ollama
sudo systemctl start ollama

Verify: curl http://localhost:11434/api/tags should return JSON (possibly empty).

Step 2 — Pull a model

Start small, then scale:

ollama pull llama3.2        # 3B — fast on laptops
ollama pull mistral         # 7B — balanced
ollama pull llama3.1:8b     # stronger reasoning, more RAM

Interactive chat:

ollama run llama3.2
>>> Write a one-paragraph product description for a todo app.

Watch Activity Monitor / nvidia-smi for memory pressure. If the model swaps to disk, pick a smaller quant or reduce context.

Step 3 — OpenAI-compatible API

Ollama exposes /v1/chat/completions. Test with curl:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello in one sentence."}]
  }'

Node.js example:

import OpenAI from 'openai'

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama', // required string, ignored locally
})

const res = await client.chat.completions.create({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Summarize RAG in two sentences.' }],
})
console.log(res.choices[0].message.content)

Point existing apps at baseURL for dev; switch to Anthropic/OpenAI in production with the same interface abstraction.

Step 4 — Environment variables for your app

export LLM_BASE_URL=http://localhost:11434/v1
export LLM_MODEL=llama3.2
export LLM_API_KEY=ollama

In Next.js or Express, read these in a single createLlmClient() helper so CI uses mocks while your laptop uses Ollama.

Step 5 — Optional: Modelfile for custom system prompts

Create Modelfile:

FROM llama3.2
SYSTEM You are a concise technical writer. Answer in bullet points unless asked otherwise.
PARAMETER temperature 0.3

Build and run:

ollama create writer -f Modelfile
ollama run writer

Useful for consistent tone in changelog or support-draft agents.

Benchmark against cloud APIs

Run the same ten prompts through Ollama and Claude. Track:

Time to first token
Factual errors on your domain docs
Refusal rate on edge cases

If local quality is 80% of cloud at 10% of cost for your workload, hybrid routing (local draft → cloud polish) often wins. Read How to Choose an LLM API in 2026 for production routing patterns.

Common errors

connection refused on 11434 — Daemon not running; start ollama serve or the desktop app.

Model download hangs — Check disk space; models are multi-GB. Use ollama pull on a stable network.

Out of memory — Use smaller models (llama3.2) or quantized tags (:q4_0 variants when available).

Next steps

Build RAG with Pinecone and Claude — swap Claude for Ollama in dev
Compare hosted vs local in the LLM API guide
Explore Perplexity vs ChatGPT for research tasks that still need the web

Set up a local LLM with Ollama on macOS and Linux

Prerequisites