Considering RAG for your Agent? Build this instead.

Key Takeaways

Most SaaS AI agents don’t need a vector database — file-based memory plus 1M-token context windows plus tool calls handle the typical case
Anthropic’s official “key primitive for just-in-time context retrieval” is filesystem-based, not vector-based
Claude Code’s pattern — an index file (MEMORY.md) plus per-topic markdown files loaded on demand — works for production SaaS agents too
RAG still wins for large unstructured corpora, regulated multi-tenant data, and frequently-refreshed external knowledge — most SaaS use cases don’t fit those criteria

If you’re considering RAG for your AI agent in 2026, the most important question isn’t which vector database to pick. It’s whether you need one at all.

For most SaaS agents, the simpler pattern is file-based memory: the agent stores what it learns in markdown files and reads them back on demand, the shape Claude Code uses internally. Add 1M-token context windows and tool calls against your existing database, and you handle the typical agent job with fewer moving parts than a vector-DB pipeline.

This isn’t a “RAG is dead” piece. Hamel Husain rebutted that take in July 2025 and he’s right. What’s changing is which kind of retrieval you reach for first. If you’ve been vibe coding with Claude Code or Cursor, you’ve already been using file-based memory without naming it.

The Default-RAG Instinct Is Doing Too Much

Open any “build an AI agent” tutorial and the architecture is the same: pick a vector database (Pinecone, ChromaDB, pgvector), build an embedding pipeline, chunk your documents, write retrieval, layer in a reranker, hand the top-k chunks to the model. Each piece is a system you own and pay to run.

That stack made sense when frontier models had 8K-to-32K context windows and tool calling was experimental. It doesn’t make sense as the default in 2026, when Claude Sonnet 4.6 ships a 1M-token context window and function calling is universal. Most SaaS data already lives in a structured database; agents reach it through tool calls, not similarity search. That 2023-era stack is over-engineering for the job.

When RAG Genuinely Wins

Before pulling apart the default, name the cases where a full RAG pipeline is the right answer. There are real ones.

Large unstructured corpora. When the agent searches across tens of thousands of documents whose titles don’t tell you what’s inside (product manuals, legal archives, scientific literature, internal wikis at enterprise scale), you need similarity search. Listing every doc in an index stops fitting in context; exact-match lookups miss the relevant chunk.
Regulated, multi-tenant isolation. SaaS apps with strict per-tenant data boundaries (healthcare, finance, defense) get row-level access controls and audit trails out of the box from a vector store. Filesystem memory can do this too, but you build the primitives yourself.
Frequently-refreshed external knowledge. News feeds, market data, regulatory updates: anything where the corpus changes hourly. Vector indexes update incrementally; filesystem memory drifts unless you build the same incremental path yourself.
Agentic search over structured tool responses. Jason Liu puts it sharply: “Good search is the ceiling on your RAG quality. If recall is poor, no prompt engineering or model upgrade will save you.” When the agent reasons across thousands of structured records and chooses what to ask next, you need real retrieval infrastructure with faceted metadata.

If your use case fits one of those, build the RAG stack. The rest of this post is about every other case.

Why Most SaaS Agents Don’t Fit That Profile

The typical SaaS agent operates over your own structured data: users, accounts, orders, tickets, audit logs. You don’t need fuzzy similarity search to find a user record; you need a tool call that runs SELECT * FROM users WHERE id = ?. Tool calls beat vector retrieval here on three counts: precise structured records the model handles more reliably than chunks of prose; fresh data the moment it’s written, with no embedding pipeline to re-run; and your existing database’s access controls, transactions, and audit trail. None of that is true of a parallel vector store sitting alongside your DB.

For the parts of agent context that aren’t in your DB (system instructions, conventions, accumulated learnings about a user, prior conversation summaries, your product’s docs), the math has changed too. With a 1M-token context window you can carry an enormous amount of state inline. You don’t need to retrieve what already fits.

The File-Based Memory Pattern

The architecture is simple: an index file listing what the agent knows, a directory of per-topic markdown files with the contents, and file-read and file-write tools the agent uses to navigate them.

Anthropic’s official Memory tool documentation describes this as “the key primitive for just-in-time context retrieval”: the agent stores what it learns in files in a /memories directory and reads them back on demand, instead of loading everything upfront. No embedding step, no vector store, no chunker. Just files.

Anthropic’s September 2025 post on effective context engineering formalizes it: “agents built with the just in time approach maintain lightweight identifiers (file paths, stored queries, web links, etc.) and use these references to dynamically load data into context at runtime using tools.” The same post names the failure mode this avoids: “context rot,” where model recall degrades as context fills. File-based memory keeps context lean by design.

Working memory stays small: the system prompt, the conversation, and whichever topic files were pulled in for this step. Everything else sits on disk. Need more, read more. Harness engineering calls this a feedforward control: structure the inputs so the agent doesn’t have to guess.

How Claude Code Does It

The reference implementation is sitting on every Claude Code user’s machine. Claude Code maintains a memory directory at ~/.claude/projects/<project>/memory/ with a single index file (MEMORY.md) and one or more topic-specific markdown files alongside it.

The official docs spell out the rules: MEMORY.md loads first, capped at the first 200 lines or 25KB, and contains one-line entries pointing to per-topic memory files. Topic files don’t load until the agent asks for one. The /memory command lists what’s currently loaded, toggles auto-memory, and opens the underlying folder.

An easy-to-miss guideline in the same docs: target under 200 lines per memory file. The reason: longer files consume more context and reduce adherence. That’s the principle making file-based memory work. Many small focused files beat one giant context dump.

Why this works

Three properties map cleanly onto what an agent needs. The index gives directional awareness: the agent knows what it knows. Per-topic files provide just-in-time depth: they enter context only when the topic is live. The 200-line cap forces summarization discipline: topics that get longer have to be split, which keeps each load focused.

None of this is novel infrastructure. It’s a directory of markdown files plus a convention for organizing and reading them. It works because the convention matches how the model reasons about relevance.

Applying This to Your SaaS Agent

Adapting this pattern for an agent inside your SaaS is mostly a question of mapping the same conventions onto your storage and your tools.

Storage layer

The simplest backend is a literal filesystem (fine for single-tenant, single-machine setups). For production multi-tenant SaaS, the pattern fits cleanly into S3 or Cloudflare R2 with one prefix per tenant, or a database table where each row is “a file” (tenant_id, path, content, updated_at). Pick whichever is closest to your stack. The agent’s tools don’t care.

Index format

Your MEMORY.md is a markdown table of contents. Each entry is one line: a path, a short description, optionally a category tag. The agent loads it every turn, so keep it tight; same 200-line discipline as Claude Code.

Topic file conventions

Group topics by the dimension that matches your access pattern. A customer support agent usually wants per-user files: memory/user-<id>/preferences.md, memory/user-<id>/recent-tickets.md, memory/user-<id>/open-issues.md. A coding assistant groups per-project; a research agent groups per-topic.

Loading and update rules

Two invariants do most of the work. Always load the index. Load topic files only when the conversation needs them. The agent can decide what’s worth saving in the moment, but deterministic capture is more reliable. Topic files get rewritten in full, not appended; that keeps them under 200 lines and forces summarization.

Capture patterns: hooks and a daily diary

The interesting design question isn’t where memory goes — it’s when the agent writes to it. Two patterns combine to handle most of the work.

Per-session hooks. After a session ends, a deterministic trigger writes a short entry to memory/sessions/<session-id>.md: what the user did, what they pushed back on, what preferences came up, what broke. The agent doesn’t decide mid-session; the hook captures at session close. Same shape as Claude Code’s auto-memory: the model spots new conventions during the conversation, the system persists them at close.

A daily diary. Once a day, a scheduled job summarizes the last 24 hours of session logs into a single short entry at memory/diary/2026-05-10.md. One paragraph, no more. Old logs get folded in and archived. Over a month you have 30 diary entries instead of thousands of raw logs. Compress further over a year, with weekly summaries and monthly themes, and the agent has hierarchical memory that mirrors how humans remember: vivid for last week, summarized for last month, themes-only for last year.

The diary works for the same reason journaling does. It forces summarization, which forces relevance ranking. Deciding what mattered at the time is much cheaper than reconstructing relevance later from an unstructured pile. Unlike humans, the agent doesn’t forget to do it. A scheduled function reads memory/sessions/, prompts the model with “summarize the last 24 hours of sessions into one paragraph, focused on durable learnings,” writes the result, and archives the source. A 50-line cron job, not infrastructure.

Andrej Karpathy’s April 2026 “LLM Wiki” gist formalizes the same shape with a three-layer split: a raw/ directory of immutable source documents, a wiki/ directory of LLM-maintained markdown pages summarizing and cross-referencing the raw material, and a CLAUDE.md at the root defining the schema and update workflow. His framing: “LLMs don’t get bored, don’t forget to update a cross-reference, and can touch 15 files in one pass.” Same skeleton, different vocabulary.

The strongest validation comes from another Anthropic post. In “Code execution with MCP” (November 2025), the team described a workflow that consumed ~150,000 tokens loading tool definitions upfront. Reimplemented with filesystem-style MCP APIs (tool definitions read on demand), the same workflow used ~2,000 tokens. A 98.7% reduction. They call it “progressive disclosure.” Your file-based memory layer encodes the same idea.

A Working Example

Here’s what this looks like end-to-end for a customer support agent inside a SaaS: no vector DB, no embeddings, just files and four tools.

Directory layout

Per-tenant root, three categories: per-user state (the agent’s working knowledge of each customer), the time-decaying capture layer from the previous section (sessions and diary), and tenant-wide policies. One tenant’s layout:

memory/ layout

memory/MEMORY.md tenant-wide index

memory/user-42/preferences.md explicit facts (timezone, plan tier, channels)
memory/user-42/recent-tickets.md last 5 tickets, summarized
memory/user-42/open-issues.md current state of unresolved issues

memory/sessions/2026-05-10-094217.md raw session log (last 24h only)
memory/diary/2026-05-09.md yesterday, one paragraph
memory/diary/2026-05-week-19.md last week, two sentences
memory/diary/2026-04-themes.md April, three bullet points

memory/policies/refunds.md product-wide policy
memory/policies/escalation.md escalation rules

Hierarchy from the previous section in action: sessions vivid and short-lived; dailies roll them up and live ~30 days; weeklies roll up the dailies; monthly themes carry only recurring patterns. Per-user state and tenant policies sit alongside, untouched.

What MEMORY.md actually contains

The index is the agent’s table of contents: one line per file, enough metadata to decide what to load. Loading the whole index every turn costs almost nothing because the index itself stays small.

memory/MEMORY.md

# Memory Index

## User state
- user-42/preferences.md: Pro plan, async preferred, EU timezone
- user-42/recent-tickets.md: last 5 (1 refund, 2 billing, 2 onboarding)
- user-42/open-issues.md: webhook signature mismatch, opened 2026-05-08

## Capture layer
- sessions/: raw logs, last 24h only
- diary/2026-05-09.md: billing webhook day
- diary/2026-05-week-19.md: refund policy edge cases

## Policies
- policies/refunds.md: refund auth + escalation thresholds
- policies/escalation.md: when to involve a human

Tool surface

Four tools, defined the same way they would be in any modern AI SDK (Vercel AI SDK’s tool(), OpenAI function-calling, or LangChain’s Tool interface):

read_memory_index() — returns the contents of MEMORY.md for the active tenant. Called every turn (cheap because the index is small).
read_memory_file(path) — returns the contents of one topic file. Called only when the index suggests it’s relevant.
write_memory_file(path, content) — rewrites a topic file in full. The full-rewrite constraint forces summarization rather than append-only growth.
delete_memory_file(path) — explicit deletion. Used when a topic is resolved or superseded.

A turn in the life

How one query flows through the system. The user asks: “what was the resolution to my webhook issue from last week?”

The agent calls read_memory_index(). Index entries flag user-42/open-issues.md (webhook signature mismatch) and diary/2026-05-09.md (“billing webhook day”).
It calls read_memory_file("user-42/open-issues.md") and read_memory_file("diary/2026-05-09.md") in parallel.
Combined context is enough to answer: “We pinned it to a Stripe API key rotation that wasn’t propagated to your staging env. The fix shipped Friday. Issue is closed on our side; you should be receiving webhooks normally now.”
The agent calls write_memory_file to remove the resolved entry from user-42/open-issues.md. A server-side validator checks schema, size, and rate before the write lands.
At session end, the per-session hook writes a one-line summary to memory/sessions/2026-05-10-094217.md: “User asked about webhook resolution. Confirmed fix held. Removed entry from open-issues.md.”
At 04:00 the next morning, the daily-diary cron reads memory/sessions/ from the last 24 hours, summarizes it into one paragraph at memory/diary/2026-05-10.md, and archives the raw files. A week later the dailies fold into diary/2026-05-week-19.md; a month later the weeklies fold into diary/2026-05-themes.md.

How the agent decides what to save

The decision rule is part of the system prompt, not the tool schema. Something like: “After resolving a ticket, update recent-tickets.md with a one-line summary. If the user states a durable preference (‘always send me updates by email’), update preferences.md. Don’t save transient facts (‘the user said hi’).”

Deterministic guards earn their keep here. For high-stakes writes (preferences, policy overrides), route the agent’s write_memory_file calls through a server-side validator that enforces schema, size, and rate caps before the write lands. The agent thinks it’s writing freely; the system enforces invariants. Structured vibe coding calls this “guides plus guardrails”: the same idea applied to agent runtime instead of code generation.

The Honest Tradeoffs — Context Rot

File-based memory isn’t a free lunch. The biggest failure mode is context rot. Chroma’s July 2025 study of 18 frontier models (including Claude Opus 4, Sonnet 4, GPT-4.1, GPT-4o, o3, and Gemini 2.5 Pro) found that “model performance degrades as input length increases” well before the stated max context window. A 200K-window model can show meaningful degradation at 50K tokens. The 200-line discipline matters because it caps how much memory enters context at once. The older “lost in the middle” finding from Liu et al. (TACL 2024) is softened in current frontier models but not eliminated; if you’re packing 30 memory files into context, the order matters.

Two more failure modes are worth naming. Fuzzy matching is genuinely harder. If a user asks “what was that thing about Stripe webhooks we discussed?” and the relevant entry is in memory/billing-debugging.md, the agent has to either browse the index intelligently or accept that some queries will miss. With vector search, the same query lights up automatically. For most SaaS use cases this is acceptable; for a public-facing knowledge base where users phrase the same question 50 different ways, vector retrieval still wins. Memory has to be maintained. Files go stale, two files end up contradicting each other, and the agent saves a fact incorrectly and propagates the error on every read. None of these are unique to file-based memory; they’re the same problems any RAG system has. The solution is different, though: explicit update and delete semantics in your write path, not incremental embedding refreshes.

None of these tradeoffs make file-based memory wrong. They make it bounded. Know where the bounds are.

Where Everyone Is Converging

If this looked contrarian a year ago, it doesn’t now. The major AI infrastructure players have adopted the pattern. The timeline:

August 2025: Anthropic ships the Memory tool. The official tool for stateful Claude agents writes to a filesystem (/memories directory), not a vector store. Tool version: memory_20250818.
September 2025: Anthropic publishes “Effective context engineering for AI agents.” The post argues for the “just-in-time” approach with file paths as lightweight identifiers, and warns explicitly about “context rot.”
November 2025: Anthropic publishes “Code execution with MCP.” The 98.7% token reduction case study. Simon Willison’s reaction: “a sensible way to take advantage of the strengths of coding agents and address some of the major drawbacks of MCP as it is usually implemented today.”
December 2025: Linux Foundation forms the Agentic AI Foundation. Founding contributions include OpenAI’s AGENTS.md, Anthropic’s MCP, and Block’s Goose. AGENTS.md was already adopted by 60,000+ open-source projects at the announcement, supported by Cursor, Codex, GitHub Copilot, Gemini CLI, Devin, and others. The standard for agent context is a markdown file. Not a vector index.
April 2026: Karpathy publishes the LLM Wiki gist. Three-layer markdown wiki maintained by the LLM, explicitly contrasted with naive RAG.

Anthropic’s official memory primitive, Anthropic’s context-engineering guidance, the Linux Foundation’s flagship agent standard, Karpathy’s most recent public design: all point at file-based memory as the default for agent state. Major AI coding tools (Claude Code, Cursor, Windsurf, GitHub Copilot) consume this pattern natively. Convergence is moving faster than most teams have updated their architectures.

VibeReady’s AI Agent Starter Kit ships file-based memory primitives wired into Vercel AI SDK — index file, per-user/per-thread organization, and read/write tools you can extend. Layer RAG on top later if your data outgrows the pattern. See the AI Agent Starter Kit →

Decision Framework — Do You Need RAG?

If you’re still on the fence, the decision is mostly mechanical. Run your use case down the comparison and the answer falls out.

	File-based memory	Vector RAG	Long context only
Best for	Per-user/per-tenant agent state, conventions, summarized history	Large unstructured corpora, fuzzy semantic search	Single-shot tasks with bounded inputs
Corpus size	Up to a few thousand small files per scope	Tens of thousands to millions of documents	Whatever fits in 1M tokens
Data structure	Structured or summarized prose, agent-organized	Unstructured or semi-structured prose	Anything that fits
Infrastructure	Filesystem or object store, four tools	Embedding model, vector DB, chunker, reranker	None beyond the model API
Latency	One file-read per topic, fast	Embedding + vector search + rerank, several hops	Just the model
Cost shape	Storage + token cost on read	Storage + embedding compute + DB ops	Token cost only, scales with context size
Failure mode	Stale or contradictory memory files	Bad chunks retrieved, agent ignores them	Context rot, lost-in-the-middle

The heuristic that captures most of this: if your data fits in your existing database and your relevant memory fits in your context, you don’t need a vector DB. Reach for one when you outgrow that envelope, not before. Memory is one layer of a larger system; for the others, see the full AI agent SaaS tech stack.

The practical sequence: ship the agent with file-based memory first, watch how it fails in production, add RAG infrastructure only when a specific corpus demands it. Anthropic’s “Building Effective Agents” guidance from December 2024 says the same thing more generally: “Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.”

Frequently Asked Questions

Is file-based memory just RAG by another name?

Technically yes by a strict definition — anytime an agent reads external info into context, that's retrieval. The narrower argument: most SaaS agents don't need a vector database, embeddings, or chunking. File reads against an index plus per-topic markdown files do the same job with less infrastructure.

Won't file-based memory break at scale?

It will, and that's where RAG starts winning. File-based memory works while the index plus relevant topic files fit in the agent's working context. Once you're storing millions of facts that need fuzzy retrieval — say, support tickets across many tenants — you'll want vector search and reranking.

How does this work with Vercel AI SDK, OpenAI SDK, or LangChain?

Implementation is identical across SDKs — the pattern is just file-read and file-write tools. With Vercel AI SDK, OpenAI function-calling, or LangChain, define read_memory_index, read_memory_file, and write_memory_file as tools. Storage backend (filesystem, S3/R2, or database) is SDK-agnostic.

Should I migrate my existing RAG system to file-based memory?

Probably not — if your RAG works, leave it. Migrate only when you hit signals: most queries return chunks the agent ignores, retrieval latency dominates response time, or the vector DB is the highest-maintenance part of your stack. Hybrid (file-based memory plus targeted RAG for one corpus) is fine.

What does VibeReady ship for AI agent memory?

VibeReady's AI Agent Starter Kit ships file-based memory primitives — a memory/ directory, index file format, read/write tools wired into Vercel AI SDK, and per-user organization. Layer RAG on top later if you outgrow the pattern. See vibeready.sh/ai-agent-starter/.

Have more questions? See our full FAQ →