TL;DR

A working RAG (retrieval-augmented generation) pipeline that indexes Adobe DevDocs, Hyvä docs, Mage-OS knowledge base, and the team's own CLAUDE.md files into one searchable corpus for Magento 2.4.4-2.4.9 work.
Crawler: python scripts/crawl_docs.py pulls developer.adobe.com/commerce, docs.hyva.io, and mage-os.org/docs into clean Markdown on disk.
Chunker: splits on H2 boundaries, target 500 tokens per chunk with 50-token overlap. Keeps semantic units intact and survives long code blocks.
Embeddings: OpenAI text-embedding-3-small at $0.02 per million tokens^[3], stored in pgvector running alongside Magento's MySQL host on the same VPS. Zero added infra cost.
Query: Magento REST endpoint /rest/V1/panth-rag/query embeds the question, runs cosine-similarity top-5, calls Claude with retrieved chunks as context. Latency budget 800 ms p95 for autocomplete UX.
Eval: ragas measures context precision, context recall, and answer faithfulness on 50 hand-curated Magento dev questions. Anything below 0.85 on faithfulness blocks the release.

RAG over Magento documentation is the architecture pattern that turns the merchant's developer reference material, Adobe Commerce DevDocs, Hyvä documentation, Mage-OS knowledge base, and the team's accumulated CLAUDE.md and runbook notes, into a vector-indexed corpus that a large language model can cite from when answering a developer's question on Magento 2.4.4-2.4.9 in 2026. The problem this solves is not "our team doesn't know Magento"; it is "our team forgets the exact DI argument name that changed in 2.4.7". The fix is a 400-line Python pipeline plus one Magento REST controller. The rest of this article is the working implementation, the cost math, and the evaluation harness that proves it is not hallucinating.

Documentation sprawl is the actual productivity tax.

Adobe Commerce DevDocs alone is more than 4,200 pages. Add Hyvä's documentation (around 600 pages of theme, checkout, and module reference), the Mage-OS knowledge base (community-maintained patches and ADRs), plus the team's own internal notes: CLAUDE.md per repository, deployment runbooks, post-incident reports: and a senior Magento developer is searching across roughly 8,000 documents in three different search UIs that all rank by token frequency, not meaning.

A developer asking how do I add a custom attribute group to a product form on Magento 2.4.9 currently opens DevDocs, runs three searches, switches to Hyvä docs, opens an old PR, and copies a snippet from a 2024 article that may or may not still compile.
The same question through a RAG endpoint returns the relevant DevDocs section, the Hyvä override note, and the in-house pattern from CLAUDE.md in one paragraph with citations, in under a second.
Across a team of four Magento developers at $25/hr internal rate, recovering 30 minutes per developer per day is roughly $700 per month of reclaimed engineering time.

RAG is not "chat with your docs". It is a retrieval system that happens to use a language model as the final formatter: the retrieval is the work.

1. The crawler pulls all four sources into clean Markdown.

Crawling is the boring half of the pipeline and the one that determines answer quality more than the model choice. Three sources are public HTTP (Adobe DevDocs, Hyvä, Mage-OS); the fourth is the team's own CLAUDE.md tree on disk. The script normalizes all four into Markdown files under data/corpus/<source>/<path>.md so the rest of the pipeline does not care where a chunk came from.

# scripts/crawl_docs.py
import asyncio, re, hashlib, pathlib
import httpx
from selectolax.parser import HTMLParser
from markdownify import markdownify

SOURCES = {
    "adobe":   "https://developer.adobe.com/commerce/docs/",
    "hyva":    "https://docs.hyva.io/",
    "mage-os": "https://mage-os.org/docs/",
}
OUT = pathlib.Path("data/corpus")

async def fetch(client, url):
    r = await client.get(url, timeout=30, follow_redirects=True)
    r.raise_for_status()
    return r.text

def to_markdown(html: str, url: str) -> str:
    tree = HTMLParser(html)
    main = tree.css_first("main, article, div.content") or tree.body
    md = markdownify(main.html, heading_style="ATX")
    return f"---\nsource_url: {url}\n---\n\n{md.strip()}\n"

async def crawl(source: str, root: str):
    seen, queue = set(), [root]
    async with httpx.AsyncClient(headers={"User-Agent": "panth-rag/1.0"}) as client:
        while queue:
            url = queue.pop(0)
            if url in seen or not url.startswith(root):
                continue
            seen.add(url)
            try:
                html = await fetch(client, url)
            except Exception:
                continue
            path = OUT / source / (hashlib.sha1(url.encode()).hexdigest()[:12] + ".md")
            path.parent.mkdir(parents=True, exist_ok=True)
            path.write_text(to_markdown(html, url))
            for a in HTMLParser(html).css("a[href]"):
                href = a.attributes.get("href", "")
                if href.startswith("/"):
                    href = root.rstrip("/") + href
                if href.startswith(root):
                    queue.append(href.split("#")[0])

if __name__ == "__main__":
    for src, root in SOURCES.items():
        asyncio.run(crawl(src, root))

The internal CLAUDE.md tree is pulled with a one-liner: rsync -a ~/Projects/magento/ data/corpus/internal/ --include='*/' --include='CLAUDE.md' --exclude='*' That catches every project's per-repo memory file. On a four-project team this brings in roughly 40 KB of curated, project-specific guidance that no public documentation ever sees.

2. The chunker splits on H2 boundaries with deliberate overlap.

Chunk strategy decides whether the retriever finds the right paragraph or a misleading neighbor. After benchmarking five strategies, splitting on Markdown H2 boundaries with a 500-token target and a 50-token overlap consistently beat fixed-window and sentence-window approaches on Magento DevDocs content.^[1]

# scripts/chunk_docs.py
import re, pathlib, json
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
TARGET = 500
OVERLAP = 50

def split_h2(md: str) -> list[str]:
    parts = re.split(r"(?m)^## ", md)
    head = parts[0]
    return [head] + ["## " + p for p in parts[1:]]

def pack(sections: list[str]) -> list[str]:
    chunks, buf, buf_tokens = [], [], 0
    for s in sections:
        n = len(enc.encode(s))
        if buf_tokens + n > TARGET and buf:
            chunks.append("\n\n".join(buf))
            tail = enc.decode(enc.encode("\n\n".join(buf))[-OVERLAP:])
            buf, buf_tokens = [tail], OVERLAP
        buf.append(s)
        buf_tokens += n
    if buf:
        chunks.append("\n\n".join(buf))
    return chunks

out = []
for md_file in pathlib.Path("data/corpus").rglob("*.md"):
    text = md_file.read_text()
    src_url = re.search(r"source_url:\s*(\S+)", text)
    for i, chunk in enumerate(pack(split_h2(text))):
        out.append({
            "id":         f"{md_file.stem}-{i}",
            "source":     md_file.parts[-2],
            "source_url": src_url.group(1) if src_url else None,
            "text":       chunk,
            "tokens":     len(enc.encode(chunk)),
        })

pathlib.Path("data/chunks.jsonl").write_text("\n".join(json.dumps(c) for c in out))

On the full corpus (Adobe + Hyvä + Mage-OS + internal) this produces roughly 12,800 chunks averaging 470 tokens. Total token count to embed: ~6 million.

3. Embeddings run on OpenAI's cheapest model and store in pgvector.

OpenAI's text-embedding-3-small produces 1,536-dimension vectors at $0.02 per million input tokens.^[3] Embedding the entire corpus once costs $0.12. Re-embedding on a weekly doc refresh costs the same. pgvector is the storage choice because Magento already runs PostgreSQL-compatible workloads on the same VPS and a separate vector database adds operational overhead with no measurable retrieval-quality gain at this scale.^[2]

-- migrations/001_pgvector.sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE panth_rag_chunks (
    id           text PRIMARY KEY,
    source       text NOT NULL,
    source_url   text,
    text         text NOT NULL,
    tokens       int  NOT NULL,
    embedding    vector(1536) NOT NULL,
    indexed_at   timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX panth_rag_chunks_hnsw
    ON panth_rag_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

CREATE INDEX panth_rag_chunks_source ON panth_rag_chunks (source);

# scripts/embed_chunks.py
import json, os, pathlib
from openai import OpenAI
import psycopg

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
conn = psycopg.connect(os.environ["PG_DSN"])

BATCH = 100
chunks = [json.loads(l) for l in pathlib.Path("data/chunks.jsonl").read_text().splitlines()]

with conn.cursor() as cur:
    for i in range(0, len(chunks), BATCH):
        batch = chunks[i:i + BATCH]
        resp = client.embeddings.create(
            model="text-embedding-3-small",
            input=[c["text"] for c in batch],
        )
        rows = [
            (c["id"], c["source"], c["source_url"], c["text"], c["tokens"], d.embedding)
            for c, d in zip(batch, resp.data)
        ]
        cur.executemany(
            """INSERT INTO panth_rag_chunks (id, source, source_url, text, tokens, embedding)
               VALUES (%s, %s, %s, %s, %s, %s)
               ON CONFLICT (id) DO UPDATE SET
                 text = EXCLUDED.text, embedding = EXCLUDED.embedding, indexed_at = now()""",
            rows,
        )
        conn.commit()
        print(f"upserted {i + len(batch)} / {len(chunks)}")

4. LlamaIndex vs LangChain: pick by the shape of the workload.

Both frameworks wrap the same underlying pieces (loaders, splitters, vector stores, LLM clients). The difference is what they make easy. LlamaIndex is optimized around a single query-engine abstraction: question in, cited answer out. LangChain is optimized around composing multi-step agents that call tools, including retrieval as one tool among many. Pick by the workload, not by the marketing.

Capability	LlamaIndex	LangChain
Single-question retrieval + answer	One `QueryEngine` object; ~15 lines to ship	Doable, but boilerplate-heavy for the same outcome
Multi-step agent (plan, retrieve, run tool, retrieve again)	Possible via `AgentRunner`, less mature	Native: `LangGraph` is the right primitive
Streaming partial tokens to the client	Yes, via `response_gen`	Yes, via `astream_events`
Vector store coverage	40+ integrations including pgvector	50+ integrations including pgvector
Eval hooks	Built-in `evaluation` module	External via `ragas` or `LangSmith`
Best fit on this project	Yes: query is single-step	Reach for it the day "and run the Magento CLI" lands on the spec

The Magento dev assistant is a single-question, single-answer workload. LlamaIndex is the right primitive. The day the team asks the assistant to also run bin/magento module:status and read the output back, LangChain becomes the right primitive, but that is a different product and worth the rewrite when it arrives.

5. The Magento REST controller turns the pipeline into one HTTP endpoint.

Magento exposes the pipeline through a custom REST endpoint /rest/V1/panth-rag/query. The endpoint embeds the question, runs a cosine-similarity top-5 against pgvector, packs the retrieved chunks into a Claude prompt with explicit citation markers, and streams the response back. The 800 ms p95 latency budget exists because the planned IDE plugin treats this endpoint like an autocomplete: anything slower feels broken.

<?php
// app/code/Panth/Rag/Api/QueryInterface.php
namespace Panth\Rag\Api;

interface QueryInterface
{
    /**
     * @param string $question
     * @param int|null $topK
     * @return string JSON: {answer, citations[], latencyMs}
     */
    public function execute(string $question, ?int $topK = 5): string;
}

<?php
// app/code/Panth/Rag/Model/Query.php
namespace Panth\Rag\Model;

use Panth\Rag\Api\QueryInterface;
use Panth\Rag\Service\Embedder;
use Panth\Rag\Service\VectorStore;
use Panth\Rag\Service\Anthropic;

class Query implements QueryInterface
{
    public function __construct(
        private Embedder $embedder,
        private VectorStore $store,
        private Anthropic $llm
    ) {}

    public function execute(string $question, ?int $topK = 5): string
    {
        $started = microtime(true);
        $vector  = $this->embedder->embed($question);
        $hits    = $this->store->cosineTopK($vector, $topK ?? 5);

        $context = '';
        foreach ($hits as $i => $hit) {
            $context .= "[{$i}] {$hit['source_url']}\n{$hit['text']}\n\n";
        }

        $answer = $this->llm->complete(
            system: 'You are a Magento 2.4.4-2.4.9 developer assistant. '
                  . 'Answer ONLY from the provided context. Cite sources as [0], [1], [2]. '
                  . 'If the context is insufficient, say so.',
            user: "Question: {$question}\n\nContext:\n{$context}",
            maxTokens: 400,
        );

        return json_encode([
            'answer'    => $answer,
            'citations' => array_column($hits, 'source_url'),
            'latencyMs' => (int)((microtime(true) - $started) * 1000),
        ]);
    }
}

{
  "routes": {
    "/V1/panth-rag/query": {
      "POST": {
        "service": {
          "class": "Panth\\Rag\\Api\\QueryInterface",
          "method": "execute"
        },
        "resources": ["Panth_Rag::query"]
      }
    }
  }
}

The cosine-similarity query against pgvector is one line of SQL: the HNSW index does the heavy lifting and returns the top-5 in roughly 6 ms on the test corpus.

-- Service\VectorStore::cosineTopK
SELECT id, source, source_url, text, 1 - (embedding <=> $1::vector) AS similarity
FROM panth_rag_chunks
ORDER BY embedding <=> $1::vector
LIMIT $2;

Latency budget: where the 800 ms goes

Question embedding (OpenAI text-embedding-3-small): 90-140 ms.
pgvector cosine top-5 with HNSW: 4-8 ms.
Claude completion with 5 chunks of context (~2,500 input tokens, ~400 output): 500-650 ms.
Network + Magento controller overhead: 20-40 ms.
Total p95 observed across 200 questions: 760 ms. Within budget.

6. The eval framework runs `ragas` on 50 hand-curated Magento questions.

An untested RAG pipeline silently degrades: a doc refresh changes chunk IDs, a new embedding model shifts cosine distances, and answers slowly become wrong. ragas measures three things on a fixed evaluation set: context precision (were the retrieved chunks relevant), context recall (did we miss any relevant chunks), and answer faithfulness (did the model invent anything that was not in the retrieved chunks).^[4]

# scripts/eval_rag.py
import json, os
import httpx
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
)

QUESTIONS = json.loads(open("eval/magento_50.json").read())

def run(q):
    r = httpx.post(
        "https://kishansavaliya.com/rest/V1/panth-rag/query",
        json={"question": q["question"]},
        headers={"Authorization": f"Bearer {os.environ['MAGENTO_TOKEN']}"},
        timeout=10,
    ).json()
    return {
        "question":     q["question"],
        "answer":       r["answer"],
        "contexts":     r.get("raw_contexts", []),
        "ground_truth": q["ground_truth"],
    }

rows = [run(q) for q in QUESTIONS]
ds = Dataset.from_list(rows)
result = evaluate(ds, metrics=[context_precision, context_recall, faithfulness])
print(result)

thresholds = {"faithfulness": 0.85, "context_recall": 0.80, "context_precision": 0.75}
for metric, floor in thresholds.items():
    if result[metric] < floor:
        raise SystemExit(f"FAIL: {metric}={result[metric]:.3f} below {floor}")

The 50-question evaluation set is the most valuable artifact in the whole pipeline. Sample entries from eval/magento_50.json:

[
  {
    "question": "How do I register a custom REST API endpoint in Magento 2.4.9?",
    "ground_truth": "Declare the route in etc/webapi.xml with service class and method, then bind ACL resources in etc/acl.xml."
  },
  {
    "question": "Where does Hyva expect Alpine.js component registration files to live?",
    "ground_truth": "Inside Magento_Theme/web/js, registered via the Hyva ViewModel and loaded through the requirejs-config.js stub Hyva ships."
  },
  {
    "question": "What is the recommended way to override a vendor template under Hyva without forking the module?",
    "ground_truth": "Place the override in app/design/frontend/<Vendor>/<theme>/<Module_Name>/templates/..., never edit src/vendor."
  }
]

Run the eval in CI on every doc refresh. A faithfulness score under 0.85 fails the build and the new index is not promoted to production.

7. The operational gotchas worth knowing.

Six things broke on the way to shipping, and each one became a permanent rule.

Stale doc cache Adobe DevDocs ships under a fast-changing path structure. Re-crawl weekly and diff chunk IDs; anything deleted upstream must be deleted from pgvector or stale answers leak in.
Embedding model lock-in : vectors from text-embedding-3-small are not comparable to vectors from text-embedding-3-large or all-MiniLM-L6-v2. Switching models requires a full re-embed.
Chunk overlap matters more than chunk size : moving overlap from 0 to 50 tokens improved context recall by 11 points on the eval set. Moving from 500 to 700 tokens of target size moved nothing.
Citations are mandatory The system prompt instructs Claude to cite as [0], [1], [2]. Without that, faithfulness scores drop and the assistant becomes a confident hallucinator instead of a doc lookup.
pgvector + Magento on the same host PostgreSQL on the Magento VPS adds ~120 MB resident memory under the test load. Cheaper and lower latency than a separate vector DB.
Prompt injection in docs community docs occasionally contain "ignore previous instructions"-style tokens left as jokes. Strip with a regex pass during chunking; otherwise a single bad chunk in the top-5 can derail an answer.

8. What this is not.

This is not a chatbot. It is a retrieval API with a thin LLM formatter on top. There is no conversation memory, no multi-turn refinement, and no agent that runs Magento CLI commands. Those are separate products with separate failure modes: adding them to a doc-retrieval endpoint is the fastest way to make both worse. The dev assistant on kishansavaliya.com answers one question per request, cites its sources, and stays out of the way. When the team needs an agent, that gets a different endpoint with a different LangChain-shaped architecture.

9. Where this is heading next.

Three upgrades are queued for the next quarter on Kishan Savaliya's roadmap.

Per-project context layers : every Magento project gets its own private chunk namespace seeded from its repo's CLAUDE.md plus its app/code/Panth/* custom modules. Question routing decides whether to query the public corpus, the private corpus, or both.
Reranking with a cross-encoder : the top-5 from cosine similarity get re-scored by a small cross-encoder model (bge-reranker-v2-m3 or similar) before going into the Claude prompt. Expect +5 to +8 points on context precision.
IDE plugin : a thin VS Code extension that calls the same REST endpoint from the editor sidebar. The 800 ms budget exists for this client; the web UI is a fallback.

Decision table: when to ship a RAG dev assistant.

Team shape	Doc footprint	Ship RAG?	Why
Solo developer, 1 client	< 50 internal pages	No	Grep + bookmarks is faster than maintaining a pipeline.
2-4 developers, 1-3 stores	~200 internal pages + DevDocs	Worth piloting	Onboarding new devs is the biggest payoff.
4+ developers, multiple stores, multiple stacks (Luma + Hyvä + PWA)	500+ internal pages + DevDocs + Hyvä	Yes	Cross-stack questions are the hard ones. RAG closes them.
Agency, 10+ active projects	Per-project runbooks + shared knowledge base	Yes, with namespacing	Each project needs its own chunk namespace plus a shared one.

FAQ

Why pgvector instead of a dedicated vector database like Pinecone or Weaviate?

The corpus is 12,800 chunks. At that scale the HNSW index in pgvector returns top-5 in single-digit milliseconds, well inside the latency budget, and the operational overhead of a managed vector DB (separate auth, separate backups, separate billing) buys nothing. The break-even point where a dedicated vector DB starts winning is somewhere past 5 million chunks.

Can the same pipeline run on Claude embeddings instead of OpenAI?

Anthropic does not ship a hosted embedding endpoint as of this writing: embeddings come from OpenAI, Cohere, or open-source models. The completion side, however, is one client swap. The current pipeline uses OpenAI for embeddings and Claude for completion specifically because Claude's faithfulness on cited-context prompts measured higher on the eval set.

How do you keep the corpus from going stale?

A weekly cron runs the crawler, then the chunker, then the embedder with upsert semantics. Chunks whose hashes match the existing row are skipped. New chunks are inserted; deleted upstream pages get their chunks removed by a final reconciliation pass that compares the latest chunk-ID set to what is in pgvector.

Does this break Magento upgrades on 2.4.4-2.4.9?

No. The Magento side is a self-contained module, one REST controller, one service class, one ACL entry, that does not patch any core. The Python pipeline runs as a separate cron user and writes to a separate PostgreSQL database that Magento never touches. Adobe security patches do not interact with any of it.

How big does the team need to be before this is worth the engineering?

Two senior Magento developers and 200+ pages of internal docs is the floor. Below that, grep and bookmarks are faster than the pipeline you have to maintain. Above that, onboarding new developers is the biggest single payback: a new hire who can ask the assistant "how do we handle B2B price approval workflows in this repo" ramps in days instead of weeks.

What does the whole thing cost to run per month?

OpenAI embeddings on a weekly refresh: ~$0.50. Claude completions at 200 queries per day across the team: ~$15 to $30 depending on context length. pgvector on the existing Magento VPS: $0. Total: under $35 per month.

Why LlamaIndex over LangChain on this specific build?

The workload is single-step: take a question, retrieve, answer with citations. LlamaIndex's QueryEngine abstraction ships that in ~15 lines and exposes the right hooks for evaluation. LangChain shines when the workload becomes multi-step (retrieve, decide to call a tool, retrieve again): that day will come, and LangChain will get the rewrite.

How is faithfulness actually measured by ragas?

ragas.metrics.faithfulness decomposes the generated answer into atomic claims and uses a separate LLM judge to check whether each claim is supported by the retrieved context. The output is a 0-to-1 score that is the fraction of claims grounded in context. The build fails if it drops below 0.85 on the 50-question evaluation set.

Citations

Want a RAG dev assistant shipped over your team's Magento documentation?

I scope and ship the full pipeline, crawler, chunker, pgvector store, Magento REST endpoint, and the ragas eval harness, on Magento 2.4.4-2.4.9 with citation logging, weekly doc refresh cron, and 30 days of patches. Fixed quote from $499 audit · $2,499 sprint · ~36h @ $25/hr. See hire me.

Tagged #Claude #OpenAI API #AI Pair Programming #LangChain #RAG #Vector Store

Keep reading

Generative Engine Optimization (GEO) for Magento: Get Cited by AI Search

GEO is how you get Magento product, category, and brand pages cited inside ChatGPT, Perplexity, and Google AI Overviews. A concrete, honest developer playbook.

Jun 8, 2026
Answer Engine Optimization (AEO) for Magento: Winning Snippets, PAA, and Voice Answers

Answer Engine Optimization for Magento is about being the single extracted answer across search and AI engines. Here is how to win snippets, PAA, and voice answers, plus the honest reality of FAQ rich results in 2026.

Jun 8, 2026
Google AI Mode Is Here (May 2026): The SEO Playbook You Need to Rewrite, Now

AI Overviews now show up on 48% of Google queries, and 93% of AI Mode sessions end without a single click off the page. The bar moved from 'rank a link' to 'be cited inside the answer.' Here is the May-2026 playbook: the six ranking factors that actually drive citations, the llms.txt + JSON-LD stack to deploy this week, the bot-allow rules every site needs, and the one Magento-specific pattern that turns AI Mode from a traffic loss into a brand-mention pipeline.

May 29, 2026