What you'll build

A small RAG (retrieval-augmented generation) service that:

Chunks and embeds your documents into a Pinecone index
Retrieves top-k chunks for a user question
Sends grounded context to Claude with a strict "answer from context only" prompt
Exposes a CLI or HTTP endpoint for queries

This pattern powers support bots, internal wikis, and sales enablement without fine-tuning a model.

Architecture overview

Documents → chunk (500 tokens) → embed → Pinecone upsert
User question → embed query → Pinecone query → top 5 chunks → Claude completion

Keep chunks overlapping by ~50 tokens so sentences split across boundaries still retrieve fully. Store metadata (source, page, title) on each vector for citations in the UI.

Before you start

Create accounts:

Anthropic console — API key
Pinecone — create an index with dimension 1536 if using OpenAI text-embedding-3-small, or 1024 for many open embedders

export ANTHROPIC_API_KEY=sk-ant-...
export PINECONE_API_KEY=...
export PINECONE_INDEX=rag-demo
export OPENAI_API_KEY=sk-...   # for embeddings only

Step 1 — Project setup

mkdir rag-pinecone-claude && cd rag-pinecone-claude
npm init -y
npm install @anthropic-ai/sdk @pinecone-database/pinecone openai pdf-parse
npm install -D tsx @types/node typescript

Step 2 — Chunk and embed documents

Create src/ingest.ts:

import fs from 'fs/promises'
import path from 'path'
import OpenAI from 'openai'
import { Pinecone } from '@pinecone-database/pinecone'

const openai = new OpenAI()
const pinecone = new Pinecone()
const index = pinecone.index(process.env.PINECONE_INDEX!)

const CHUNK_SIZE = 500
const OVERLAP = 50

function chunkText(text: string): string[] {
  const words = text.split(/\s+/)
  const chunks: string[] = []
  for (let i = 0; i < words.length; i += CHUNK_SIZE - OVERLAP) {
    chunks.push(words.slice(i, i + CHUNK_SIZE).join(' '))
  }
  return chunks.filter((c) => c.length > 80)
}

async function embed(texts: string[]): Promise<number[][]> {
  const res = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: texts,
  })
  return res.data.map((d) => d.embedding)
}

export async function ingestDir(dir: string) {
  const files = (await fs.readdir(dir)).filter((f) => f.endsWith('.md') || f.endsWith('.txt'))
  let id = 0
  for (const file of files) {
    const raw = await fs.readFile(path.join(dir, file), 'utf-8')
    const chunks = chunkText(raw)
    const vectors = await embed(chunks)
    await index.upsert(
      vectors.map((values, i) => ({
        id: `${file}-${id++}`,
        values,
        metadata: { source: file, text: chunks[i] },
      })),
    )
    console.log(`Indexed ${chunks.length} chunks from ${file}`)
  }
}

Run: npx tsx -e "import { ingestDir } from './src/ingest.ts'; ingestDir('./docs')"

Step 3 — Query with retrieval + Claude

Create src/ask.ts:

import Anthropic from '@anthropic-ai/sdk'
import OpenAI from 'openai'
import { Pinecone } from '@pinecone-database/pinecone'

const anthropic = new Anthropic()
const openai = new OpenAI()
const index = new Pinecone().index(process.env.PINECONE_INDEX!)

async function retrieve(question: string, topK = 5) {
  const emb = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question,
  })
  const res = await index.query({
    vector: emb.data[0].embedding,
    topK,
    includeMetadata: true,
  })
  return (res.matches ?? [])
    .map((m) => String(m.metadata?.text ?? ''))
    .filter(Boolean)
}

export async function ask(question: string): Promise<string> {
  const chunks = await retrieve(question)
  const context = chunks.map((c, i) => `[${i + 1}] ${c}`).join('\n\n')

  const msg = await anthropic.messages.create({
    model: 'claude-sonnet-4-5',
    max_tokens: 1024,
    system: `Answer using ONLY the numbered context below. If the answer is not in context, say "I don't have that in the indexed documents." Cite chunk numbers like [2] when relevant.`,
    messages: [
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${question}`,
      },
    ],
  })

  const block = msg.content[0]
  return block.type === 'text' ? block.text : ''
}

Test: npx tsx -e "import { ask } from './src/ask.ts'; ask('What is our refund policy?').then(console.log)"

Step 4 — Evaluation before production

Build a CSV of 20 question/answer pairs from your docs. For each question:

Run retrieval-only — do the right chunks appear in top 5?
Run full RAG — is the answer faithful to context?
Log failures and tune chunk size or metadata filters

Bad RAG is usually a retrieval problem, not a model problem. Compare Pinecone vs Weaviate if you need hybrid search later.

Production hardening

Namespaces per customer in Pinecone for multi-tenant SaaS
Rate limits on the ask endpoint
Citation UI — show source filenames from metadata
Re-ingest webhook when docs change in Notion or Google Drive
Cost caps on embedding batch jobs

Common errors

Empty retrieval — Index name mismatch or wrong embedding dimension vs index config.

Hallucinations despite context — Strengthen the system prompt; reduce topK if irrelevant chunks confuse the model.

Slow queries — Cache embeddings for frequent questions; use Claude Haiku for draft answers.

Next steps

Add reranking with Cohere for better precision on long corpora
Wire the ask endpoint into your support widget
Read How to Choose an LLM API in 2026 for provider failover patterns

Build a RAG pipeline with Pinecone and Claude

Prerequisites