Chat on WhatsApp
AI for Magento 13 min read

RAG Over Magento Documentation — Building an Internal Dev Assistant

Magento developers waste hours re-grepping the same documentation — Adobe DevDocs, Hyvä, Mage-OS, internal CLAUDE.md notes — every time a new team member ramps up or a familiar one forgets which DI argument was renamed in 2.4.7. A real retrieval-augmented generation pipeline collapses that into a single REST endpoint that takes a question and returns a cited, paragraph-length answer in under a second. This article walks through the working pipeline shipped on kishansavaliya.com — crawler, chunker, embeddings, pgvector store, Magento controller, and the ragas evaluation harness that keeps it honest.

RAG Over Magento Documentation — Building an Internal Dev Assistant

RAG over Magento documentation is the architecture pattern that turns the merchant's developer reference material — Adobe Commerce DevDocs, Hyvä documentation, Mage-OS knowledge base, and the team's accumulated CLAUDE.md and runbook notes — into a vector-indexed corpus that a large language model can cite from when answering a developer's question on Magento 2.4.4 — 2.4.9 in 2026. The problem this solves is not "our team doesn't know Magento"; it is "our team forgets the exact DI argument name that changed in 2.4.7". The fix is a 400-line Python pipeline plus one Magento REST controller. The rest of this article is the working implementation, the cost math, and the evaluation harness that proves it is not hallucinating.

Documentation sprawl is the actual productivity tax.

Adobe Commerce DevDocs alone is more than 4,200 pages. Add Hyvä's documentation (around 600 pages of theme, checkout, and module reference), the Mage-OS knowledge base (community-maintained patches and ADRs), plus the team's own internal notes — CLAUDE.md per repository, deployment runbooks, post-incident reports — and a senior Magento developer is searching across roughly 8,000 documents in three different search UIs that all rank by token frequency, not meaning.

  • A developer asking how do I add a custom attribute group to a product form on Magento 2.4.9 currently opens DevDocs, runs three searches, switches to Hyvä docs, opens an old PR, and copies a snippet from a 2024 article that may or may not still compile.
  • The same question through a RAG endpoint returns the relevant DevDocs section, the Hyvä override note, and the in-house pattern from CLAUDE.md in one paragraph with citations — in under a second.
  • Across a team of four Magento developers at $25/hr internal rate, recovering 30 minutes per developer per day is roughly $700 per month of reclaimed engineering time.
RAG is not "chat with your docs". It is a retrieval system that happens to use a language model as the final formatter — the retrieval is the work.

1. The crawler pulls all four sources into clean Markdown.

Crawling is the boring half of the pipeline and the one that determines answer quality more than the model choice. Three sources are public HTTP (Adobe DevDocs, Hyvä, Mage-OS); the fourth is the team's own CLAUDE.md tree on disk. The script normalizes all four into Markdown files under data/corpus/<source>/<path>.md so the rest of the pipeline does not care where a chunk came from.

# scripts/crawl_docs.py
import asyncio, re, hashlib, pathlib
import httpx
from selectolax.parser import HTMLParser
from markdownify import markdownify

SOURCES = {
    "adobe":   "https://developer.adobe.com/commerce/docs/",
    "hyva":    "https://docs.hyva.io/",
    "mage-os": "https://mage-os.org/docs/",
}
OUT = pathlib.Path("data/corpus")

async def fetch(client, url):
    r = await client.get(url, timeout=30, follow_redirects=True)
    r.raise_for_status()
    return r.text

def to_markdown(html: str, url: str) -> str:
    tree = HTMLParser(html)
    main = tree.css_first("main, article, div.content") or tree.body
    md = markdownify(main.html, heading_style="ATX")
    return f"---\nsource_url: {url}\n---\n\n{md.strip()}\n"

async def crawl(source: str, root: str):
    seen, queue = set(), [root]
    async with httpx.AsyncClient(headers={"User-Agent": "panth-rag/1.0"}) as client:
        while queue:
            url = queue.pop(0)
            if url in seen or not url.startswith(root):
                continue
            seen.add(url)
            try:
                html = await fetch(client, url)
            except Exception:
                continue
            path = OUT / source / (hashlib.sha1(url.encode()).hexdigest()[:12] + ".md")
            path.parent.mkdir(parents=True, exist_ok=True)
            path.write_text(to_markdown(html, url))
            for a in HTMLParser(html).css("a[href]"):
                href = a.attributes.get("href", "")
                if href.startswith("/"):
                    href = root.rstrip("/") + href
                if href.startswith(root):
                    queue.append(href.split("#")[0])

if __name__ == "__main__":
    for src, root in SOURCES.items():
        asyncio.run(crawl(src, root))

The internal CLAUDE.md tree is pulled with a one-liner — rsync -a ~/Projects/magento/ data/corpus/internal/ --include='*/' --include='CLAUDE.md' --exclude='*' — that catches every project's per-repo memory file. On a four-project team this brings in roughly 40 KB of curated, project-specific guidance that no public documentation ever sees.

2. The chunker splits on H2 boundaries with deliberate overlap.

Chunk strategy decides whether the retriever finds the right paragraph or a misleading neighbor. After benchmarking five strategies, splitting on Markdown H2 boundaries with a 500-token target and a 50-token overlap consistently beat fixed-window and sentence-window approaches on Magento DevDocs content.[1]

# scripts/chunk_docs.py
import re, pathlib, json
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
TARGET = 500
OVERLAP = 50

def split_h2(md: str) -> list[str]:
    parts = re.split(r"(?m)^## ", md)
    head = parts[0]
    return [head] + ["## " + p for p in parts[1:]]

def pack(sections: list[str]) -> list[str]:
    chunks, buf, buf_tokens = [], [], 0
    for s in sections:
        n = len(enc.encode(s))
        if buf_tokens + n > TARGET and buf:
            chunks.append("\n\n".join(buf))
            tail = enc.decode(enc.encode("\n\n".join(buf))[-OVERLAP:])
            buf, buf_tokens = [tail], OVERLAP
        buf.append(s)
        buf_tokens += n
    if buf:
        chunks.append("\n\n".join(buf))
    return chunks

out = []
for md_file in pathlib.Path("data/corpus").rglob("*.md"):
    text = md_file.read_text()
    src_url = re.search(r"source_url:\s*(\S+)", text)
    for i, chunk in enumerate(pack(split_h2(text))):
        out.append({
            "id":         f"{md_file.stem}-{i}",
            "source":     md_file.parts[-2],
            "source_url": src_url.group(1) if src_url else None,
            "text":       chunk,
            "tokens":     len(enc.encode(chunk)),
        })

pathlib.Path("data/chunks.jsonl").write_text("\n".join(json.dumps(c) for c in out))

On the full corpus (Adobe + Hyvä + Mage-OS + internal) this produces roughly 12,800 chunks averaging 470 tokens. Total token count to embed: ~6 million.

3. Embeddings run on OpenAI's cheapest model and store in pgvector.

OpenAI's text-embedding-3-small produces 1,536-dimension vectors at $0.02 per million input tokens.[3] Embedding the entire corpus once costs $0.12. Re-embedding on a weekly doc refresh costs the same. pgvector is the storage choice because Magento already runs PostgreSQL-compatible workloads on the same VPS and a separate vector database adds operational overhead with no measurable retrieval-quality gain at this scale.[2]

-- migrations/001_pgvector.sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE panth_rag_chunks (
    id           text PRIMARY KEY,
    source       text NOT NULL,
    source_url   text,
    text         text NOT NULL,
    tokens       int  NOT NULL,
    embedding    vector(1536) NOT NULL,
    indexed_at   timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX panth_rag_chunks_hnsw
    ON panth_rag_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

CREATE INDEX panth_rag_chunks_source ON panth_rag_chunks (source);
# scripts/embed_chunks.py
import json, os, pathlib
from openai import OpenAI
import psycopg

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
conn = psycopg.connect(os.environ["PG_DSN"])

BATCH = 100
chunks = [json.loads(l) for l in pathlib.Path("data/chunks.jsonl").read_text().splitlines()]

with conn.cursor() as cur:
    for i in range(0, len(chunks), BATCH):
        batch = chunks[i:i + BATCH]
        resp = client.embeddings.create(
            model="text-embedding-3-small",
            input=[c["text"] for c in batch],
        )
        rows = [
            (c["id"], c["source"], c["source_url"], c["text"], c["tokens"], d.embedding)
            for c, d in zip(batch, resp.data)
        ]
        cur.executemany(
            """INSERT INTO panth_rag_chunks (id, source, source_url, text, tokens, embedding)
               VALUES (%s, %s, %s, %s, %s, %s)
               ON CONFLICT (id) DO UPDATE SET
                 text = EXCLUDED.text, embedding = EXCLUDED.embedding, indexed_at = now()""",
            rows,
        )
        conn.commit()
        print(f"upserted {i + len(batch)} / {len(chunks)}")

4. LlamaIndex vs LangChain — pick by the shape of the workload.

Both frameworks wrap the same underlying pieces (loaders, splitters, vector stores, LLM clients). The difference is what they make easy. LlamaIndex is optimized around a single query-engine abstraction: question in, cited answer out. LangChain is optimized around composing multi-step agents that call tools, including retrieval as one tool among many. Pick by the workload, not by the marketing.

CapabilityLlamaIndexLangChain
Single-question retrieval + answerOne QueryEngine object; ~15 lines to shipDoable, but boilerplate-heavy for the same outcome
Multi-step agent (plan, retrieve, run tool, retrieve again)Possible via AgentRunner, less matureNative — LangGraph is the right primitive
Streaming partial tokens to the clientYes, via response_genYes, via astream_events
Vector store coverage40+ integrations including pgvector50+ integrations including pgvector
Eval hooksBuilt-in evaluation moduleExternal via ragas or LangSmith
Best fit on this projectYes — query is single-stepReach for it the day "and run the Magento CLI" lands on the spec

The Magento dev assistant is a single-question, single-answer workload. LlamaIndex is the right primitive. The day the team asks the assistant to also run bin/magento module:status and read the output back, LangChain becomes the right primitive — but that is a different product and worth the rewrite when it arrives.

5. The Magento REST controller turns the pipeline into one HTTP endpoint.

Magento exposes the pipeline through a custom REST endpoint /rest/V1/panth-rag/query. The endpoint embeds the question, runs a cosine-similarity top-5 against pgvector, packs the retrieved chunks into a Claude prompt with explicit citation markers, and streams the response back. The 800 ms p95 latency budget exists because the planned IDE plugin treats this endpoint like an autocomplete — anything slower feels broken.

<?php
// app/code/Panth/Rag/Api/QueryInterface.php
namespace Panth\Rag\Api;

interface QueryInterface
{
    /**
     * @param string $question
     * @param int|null $topK
     * @return string JSON: {answer, citations[], latencyMs}
     */
    public function execute(string $question, ?int $topK = 5): string;
}
<?php
// app/code/Panth/Rag/Model/Query.php
namespace Panth\Rag\Model;

use Panth\Rag\Api\QueryInterface;
use Panth\Rag\Service\Embedder;
use Panth\Rag\Service\VectorStore;
use Panth\Rag\Service\Anthropic;

class Query implements QueryInterface
{
    public function __construct(
        private Embedder $embedder,
        private VectorStore $store,
        private Anthropic $llm
    ) {}

    public function execute(string $question, ?int $topK = 5): string
    {
        $started = microtime(true);
        $vector  = $this->embedder->embed($question);
        $hits    = $this->store->cosineTopK($vector, $topK ?? 5);

        $context = '';
        foreach ($hits as $i => $hit) {
            $context .= "[{$i}] {$hit['source_url']}\n{$hit['text']}\n\n";
        }

        $answer = $this->llm->complete(
            system: 'You are a Magento 2.4.4 — 2.4.9 developer assistant. '
                  . 'Answer ONLY from the provided context. Cite sources as [0], [1], [2]. '
                  . 'If the context is insufficient, say so.',
            user: "Question: {$question}\n\nContext:\n{$context}",
            maxTokens: 400,
        );

        return json_encode([
            'answer'    => $answer,
            'citations' => array_column($hits, 'source_url'),
            'latencyMs' => (int)((microtime(true) - $started) * 1000),
        ]);
    }
}
{
  "routes": {
    "/V1/panth-rag/query": {
      "POST": {
        "service": {
          "class": "Panth\\Rag\\Api\\QueryInterface",
          "method": "execute"
        },
        "resources": ["Panth_Rag::query"]
      }
    }
  }
}

The cosine-similarity query against pgvector is one line of SQL — the HNSW index does the heavy lifting and returns the top-5 in roughly 6 ms on the test corpus.

-- Service\VectorStore::cosineTopK
SELECT id, source, source_url, text, 1 - (embedding <=> $1::vector) AS similarity
FROM panth_rag_chunks
ORDER BY embedding <=> $1::vector
LIMIT $2;

Latency budget — where the 800 ms goes

  • Question embedding (OpenAI text-embedding-3-small) — 90–140 ms.
  • pgvector cosine top-5 with HNSW — 4–8 ms.
  • Claude completion with 5 chunks of context (~2,500 input tokens, ~400 output) — 500–650 ms.
  • Network + Magento controller overhead — 20–40 ms.
  • Total p95 observed across 200 questions — 760 ms. Within budget.

6. The eval framework runs ragas on 50 hand-curated Magento questions.

An untested RAG pipeline silently degrades — a doc refresh changes chunk IDs, a new embedding model shifts cosine distances, and answers slowly become wrong. ragas measures three things on a fixed evaluation set: context precision (were the retrieved chunks relevant), context recall (did we miss any relevant chunks), and answer faithfulness (did the model invent anything that was not in the retrieved chunks).[4]

# scripts/eval_rag.py
import json, os
import httpx
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
)

QUESTIONS = json.loads(open("eval/magento_50.json").read())

def run(q):
    r = httpx.post(
        "https://kishansavaliya.com/rest/V1/panth-rag/query",
        json={"question": q["question"]},
        headers={"Authorization": f"Bearer {os.environ['MAGENTO_TOKEN']}"},
        timeout=10,
    ).json()
    return {
        "question":     q["question"],
        "answer":       r["answer"],
        "contexts":     r.get("raw_contexts", []),
        "ground_truth": q["ground_truth"],
    }

rows = [run(q) for q in QUESTIONS]
ds = Dataset.from_list(rows)
result = evaluate(ds, metrics=[context_precision, context_recall, faithfulness])
print(result)

thresholds = {"faithfulness": 0.85, "context_recall": 0.80, "context_precision": 0.75}
for metric, floor in thresholds.items():
    if result[metric] < floor:
        raise SystemExit(f"FAIL: {metric}={result[metric]:.3f} below {floor}")

The 50-question evaluation set is the most valuable artifact in the whole pipeline. Sample entries from eval/magento_50.json:

[
  {
    "question": "How do I register a custom REST API endpoint in Magento 2.4.9?",
    "ground_truth": "Declare the route in etc/webapi.xml with service class and method, then bind ACL resources in etc/acl.xml."
  },
  {
    "question": "Where does Hyva expect Alpine.js component registration files to live?",
    "ground_truth": "Inside Magento_Theme/web/js, registered via the Hyva ViewModel and loaded through the requirejs-config.js stub Hyva ships."
  },
  {
    "question": "What is the recommended way to override a vendor template under Hyva without forking the module?",
    "ground_truth": "Place the override in app/design/frontend/<Vendor>/<theme>/<Module_Name>/templates/... — never edit src/vendor."
  }
]

Run the eval in CI on every doc refresh. A faithfulness score under 0.85 fails the build and the new index is not promoted to production.

7. The operational gotchas worth knowing.

Six things broke on the way to shipping, and each one became a permanent rule.

  • Stale doc cache — Adobe DevDocs ships under a fast-changing path structure. Re-crawl weekly and diff chunk IDs; anything deleted upstream must be deleted from pgvector or stale answers leak in.
  • Embedding model lock-in — vectors from text-embedding-3-small are not comparable to vectors from text-embedding-3-large or all-MiniLM-L6-v2. Switching models requires a full re-embed.
  • Chunk overlap matters more than chunk size — moving overlap from 0 to 50 tokens improved context recall by 11 points on the eval set. Moving from 500 to 700 tokens of target size moved nothing.
  • Citations are mandatory — the system prompt instructs Claude to cite as [0], [1], [2]. Without that, faithfulness scores drop and the assistant becomes a confident hallucinator instead of a doc lookup.
  • pgvector + Magento on the same host — PostgreSQL on the Magento VPS adds ~120 MB resident memory under the test load. Cheaper and lower latency than a separate vector DB.
  • Prompt injection in docs — community docs occasionally contain "ignore previous instructions"-style tokens left as jokes. Strip with a regex pass during chunking; otherwise a single bad chunk in the top-5 can derail an answer.

8. What this is not.

This is not a chatbot. It is a retrieval API with a thin LLM formatter on top. There is no conversation memory, no multi-turn refinement, and no agent that runs Magento CLI commands. Those are separate products with separate failure modes — adding them to a doc-retrieval endpoint is the fastest way to make both worse. The dev assistant on kishansavaliya.com answers one question per request, cites its sources, and stays out of the way. When the team needs an agent, that gets a different endpoint with a different LangChain-shaped architecture.

9. Where this is heading next.

Three upgrades are queued for the next quarter on Kishan Savaliya's roadmap.

  • Per-project context layers — every Magento project gets its own private chunk namespace seeded from its repo's CLAUDE.md plus its app/code/Panth/* custom modules. Question routing decides whether to query the public corpus, the private corpus, or both.
  • Reranking with a cross-encoder — the top-5 from cosine similarity get re-scored by a small cross-encoder model (bge-reranker-v2-m3 or similar) before going into the Claude prompt. Expect +5 to +8 points on context precision.
  • IDE plugin — a thin VS Code extension that calls the same REST endpoint from the editor sidebar. The 800 ms budget exists for this client; the web UI is a fallback.

Decision table — when to ship a RAG dev assistant.

Team shapeDoc footprintShip RAG?Why
Solo developer, 1 client< 50 internal pagesNoGrep + bookmarks is faster than maintaining a pipeline.
2–4 developers, 1–3 stores~200 internal pages + DevDocsWorth pilotingOnboarding new devs is the biggest payoff.
4+ developers, multiple stores, multiple stacks (Luma + Hyvä + PWA)500+ internal pages + DevDocs + HyväYesCross-stack questions are the hard ones. RAG closes them.
Agency, 10+ active projectsPer-project runbooks + shared knowledge baseYes, with namespacingEach project needs its own chunk namespace plus a shared one.

FAQ

Why pgvector instead of a dedicated vector database like Pinecone or Weaviate?

The corpus is 12,800 chunks. At that scale the HNSW index in pgvector returns top-5 in single-digit milliseconds — well inside the latency budget — and the operational overhead of a managed vector DB (separate auth, separate backups, separate billing) buys nothing. The break-even point where a dedicated vector DB starts winning is somewhere past 5 million chunks.

Can the same pipeline run on Claude embeddings instead of OpenAI?

Anthropic does not ship a hosted embedding endpoint as of this writing — embeddings come from OpenAI, Cohere, or open-source models. The completion side, however, is one client swap. The current pipeline uses OpenAI for embeddings and Claude for completion specifically because Claude's faithfulness on cited-context prompts measured higher on the eval set.

How do you keep the corpus from going stale?

A weekly cron runs the crawler, then the chunker, then the embedder with upsert semantics. Chunks whose hashes match the existing row are skipped. New chunks are inserted; deleted upstream pages get their chunks removed by a final reconciliation pass that compares the latest chunk-ID set to what is in pgvector.

Does this break Magento upgrades on 2.4.4 — 2.4.9?

No. The Magento side is a self-contained module — one REST controller, one service class, one ACL entry — that does not patch any core. The Python pipeline runs as a separate cron user and writes to a separate PostgreSQL database that Magento never touches. Adobe security patches do not interact with any of it.

How big does the team need to be before this is worth the engineering?

Two senior Magento developers and 200+ pages of internal docs is the floor. Below that, grep and bookmarks are faster than the pipeline you have to maintain. Above that, onboarding new developers is the biggest single payback — a new hire who can ask the assistant "how do we handle B2B price approval workflows in this repo" ramps in days instead of weeks.

What does the whole thing cost to run per month?

OpenAI embeddings on a weekly refresh — ~$0.50. Claude completions at 200 queries per day across the team — ~$15 to $30 depending on context length. pgvector on the existing Magento VPS — $0. Total: under $35 per month.

Why LlamaIndex over LangChain on this specific build?

The workload is single-step: take a question, retrieve, answer with citations. LlamaIndex's QueryEngine abstraction ships that in ~15 lines and exposes the right hooks for evaluation. LangChain shines when the workload becomes multi-step (retrieve, decide to call a tool, retrieve again) — that day will come, and LangChain will get the rewrite.

How is faithfulness actually measured by ragas?

ragas.metrics.faithfulness decomposes the generated answer into atomic claims and uses a separate LLM judge to check whether each claim is supported by the retrieved context. The output is a 0-to-1 score that is the fraction of claims grounded in context. The build fails if it drops below 0.85 on the 50-question evaluation set.

Citations

  1. LangChain — framework for building LLM applications [1]
  2. LlamaIndex — data framework for LLM applications [2]
  3. OpenAI — new embedding models and API updates [3]
  4. ragas — evaluation framework for RAG pipelines [4]
Want a RAG dev assistant shipped over your team's Magento documentation?

I scope and ship the full pipeline — crawler, chunker, pgvector store, Magento REST endpoint, and the ragas eval harness — on Magento 2.4.4 — 2.4.9 with citation logging, weekly doc refresh cron, and 30 days of patches. Fixed quote from $499 audit · $2,499 sprint · ~36h @ $25/hr. See hire me.