Skip to main content
All Tutorials
beginnerllama

Set up a local LLM with Ollama on macOS and Linux

local-inference

~45 min hands-on3 min readJune 4, 2026
Recipe code coming soon — subscribe to get notified

Prerequisites

  • macOS 13+ or Ubuntu 22.04+ with 16 GB RAM (32 GB recommended for 13B models)
  • Admin access to install packages
  • Optional: NVIDIA GPU with 8 GB+ VRAM for faster inference

What you'll build

A local LLM environment where you can:

  1. Run models on your machine with Ollama
  2. Chat in the terminal or Ollama desktop app
  3. Call http://localhost:11434 from Node or Python using an OpenAI-compatible client
  4. Compare latency and quality against hosted APIs before committing spend

Local inference is ideal for drafts, PII-sensitive workflows, and offline development — not a full replacement for frontier hosted models on hard reasoning tasks.

Why Ollama in 2026

Ollama packages model weights, GPU detection, and a simple pull/run UX. Alternatives like vLLM or llama.cpp offer more control but more ops work. For indie developers validating RAG or agent prototypes, Ollama is the fastest on-ramp. See Mistral vs Llama when choosing open-weight families.

Step 1 — Install Ollama

macOS:

brew install ollama
# or download from https://ollama.com/download
ollama serve   # runs the daemon (auto-starts on desktop app)

Linux:

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable ollama
sudo systemctl start ollama

Verify: curl http://localhost:11434/api/tags should return JSON (possibly empty).

Step 2 — Pull a model

Start small, then scale:

ollama pull llama3.2        # 3B — fast on laptops
ollama pull mistral         # 7B — balanced
ollama pull llama3.1:8b     # stronger reasoning, more RAM

Interactive chat:

ollama run llama3.2
>>> Write a one-paragraph product description for a todo app.

Watch Activity Monitor / nvidia-smi for memory pressure. If the model swaps to disk, pick a smaller quant or reduce context.

Step 3 — OpenAI-compatible API

Ollama exposes /v1/chat/completions. Test with curl:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello in one sentence."}]
  }'

Node.js example:

import OpenAI from 'openai'

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama', // required string, ignored locally
})

const res = await client.chat.completions.create({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Summarize RAG in two sentences.' }],
})
console.log(res.choices[0].message.content)

Point existing apps at baseURL for dev; switch to Anthropic/OpenAI in production with the same interface abstraction.

Step 4 — Environment variables for your app

export LLM_BASE_URL=http://localhost:11434/v1
export LLM_MODEL=llama3.2
export LLM_API_KEY=ollama

In Next.js or Express, read these in a single createLlmClient() helper so CI uses mocks while your laptop uses Ollama.

Step 5 — Optional: Modelfile for custom system prompts

Create Modelfile:

FROM llama3.2
SYSTEM You are a concise technical writer. Answer in bullet points unless asked otherwise.
PARAMETER temperature 0.3

Build and run:

ollama create writer -f Modelfile
ollama run writer

Useful for consistent tone in changelog or support-draft agents.

Benchmark against cloud APIs

Run the same ten prompts through Ollama and Claude. Track:

  • Time to first token
  • Factual errors on your domain docs
  • Refusal rate on edge cases

If local quality is 80% of cloud at 10% of cost for your workload, hybrid routing (local draft → cloud polish) often wins. Read How to Choose an LLM API in 2026 for production routing patterns.

Common errors

connection refused on 11434 — Daemon not running; start ollama serve or the desktop app.

Model download hangs — Check disk space; models are multi-GB. Use ollama pull on a stable network.

Out of memory — Use smaller models (llama3.2) or quantized tags (:q4_0 variants when available).

Next steps

Get the full recipe

Clone the starter repo and follow along in your own environment.

Related Stacks

Indie Hackers community

The indie SaaS AI stack Marc Lou uses to ship products in days, not months

by Marc Lou

claudecursorv0perplexity+1 more
3 min readMay 2026