Data Science

Baseline Enterprise RAG, From PDF to Highlighted Reply

May 30, 2026

[ad_1]

quickest approach to perceive what RAG is is to construct the smallest model that really works, run it on an actual doc, and look carefully at what simply occurred.

That’s this text. A few hundred traces of Python (no vector database, no framework, no brokers) operating on the Consideration Is All You Want paper (Vaswani et al. 2017; arXiv non-exclusive distribution license, declared on the arXiv summary web page), returning a sourced reply with the precise supply traces highlighted on the web page.

Then we stroll again by way of every block and ask the query it naturally raises. Every query is what a later article develops.

The minimal pipeline is the smallest quantity of code that respects the 4 bricks and produces a verifiable reply. Each later article provides functionality the staff wants after a particular failure on actual paperwork, not as a result of the structure wanted extra layers.

1. What we’re constructing

The pipeline has 4 bricks (Half II goes into each intimately) plus a ultimate, non-obligatory rendering step. Every brick says what it takes in and what it offers again; what we cross from one brick to the following is what we save.

Doc parsing takes a PDF path and returns line_df (one row per textual content line, with page_num, line_num, textual content, and the bounding field) plus page_df. The minimal model holds each in reminiscence; larger programs persist them (Article 23 covers when to maneuver to a database).
Query parsing turns the consumer’s query right into a ParsedQuestion carrying the normalized query plus a brief checklist of checked key phrases. It stays slim on goal: no retrieval logic right here, no query embedding.
Retrieval consumes the ParsedQuestion and emits top-k web page numbers (and, when wanted, the matching line numbers inside these pages). Holding the handoff to web page numbers solely retains it small; the following step rebuilds the filtered traces from line_df on the spot. The query embedding lives on this brick as a result of it is determined by the corpus index.
Era brings collectively the query, line_df, and the retrieved web page numbers, and produces an AnswerWithEvidence: a typed JSON carrying the reply, the proof span (start_page, start_line, end_page, end_line), a confidence, a justification, the precise quotes from the supply, and any caveats. The complete JSON is value saving for analysis, audit, and replay.
PDF annotation is non-obligatory. Given the supply PDF and the proof span, it writes an annotated PDF with rectangles drawn across the cited traces. A CLI software, a batch job, or an API shopper can skip it; the reply with citations is already full after era.

The primary 4 are the 4 bricks (Article 5 develops doc parsing, Article 6 query parsing, Article 7 retrieval, Article 8 era). PDF annotation is the rendering step, not a brick in itself.

The baseline RAG pipeline, finish to finish – Picture by creator

A PDF and a query go in. Every brick turns its enter into one thing extra structured: doc parsing turns the PDF into rows, query parsing turns the query into search-ready key phrases, retrieval cuts the rows down to a couple web page numbers, era produces a typed reply, and PDF annotation attracts the cited traces again onto the supply. What comes out shouldn’t be a chatbot bubble. It’s a sourced JSON reply plus an annotated PDF you possibly can open and verify.

The dependencies are minimal:

pymupdf parses PDFs into textual content plus place information; the bounding bins it returns are what we use to spotlight the reply again on the supply web page.
openai is the LLM shopper; through base_url the identical library serves Azure, OpenRouter, Ollama, or any suitable endpoint.
pandas holds the doc as a DataFrame, the format each parsing and retrieval step makes use of.
pydantic defines the reply schema that forces structured JSON with citations.

No vector database, no orchestration framework, no specialised RAG library. Later articles have a look at when these libraries’ helpers turn out to be helpful, and once they get in the way in which of seeing what’s occurring.

“For a 15-page paper, the LLM can learn the entire thing. Why hassle with retrieval?” Truthful level on this one doc. We use the paper to show the strategy, to not save tokens on these 15 pages. The objection usually factors to the Needle in a Haystack benchmark (Kamradt, 2023), the place frontier fashions rating near-perfectly retrieving a single verbatim sentence from a 1M-token context.

That benchmark is analysis, not follow. A needle is one remoted, verbatim reality, whereas enterprise questions mixture (“each contract whose deductible exceeds €5,000”), evaluate (“clause 12 throughout these three insurance policies”), or summarize throughout many passages. None of these is a single sentence to search out.

Two extra sensible causes preserve retrieval within the loop. Enterprise paperwork are sometimes lengthy:

a 300-page insurance coverage contract,
a 500-page regulatory submitting,
a multi-volume technical specification.

Sending the entire thing to the LLM prices actual cash on each query, each rerun, each consumer, and dilutes its consideration throughout irrelevant pages.

And the identical query runs throughout lots of or 1000’s of paperwork without delay:

“discover each contract that excludes earthquake injury”,
“summarize this yr’s regulatory adjustments throughout all filings”.

At that scale, “throw all of it in” stops being a technique. Retrieval is what makes the pipeline survive each strikes: from one brief paper to 1 lengthy contract, and from one doc to an entire corpus.

2. The 4 bricks, and a PDF spotlight

Every step declares its inputs and outputs, and the steps are impartial. The output of step N is the enter of step N+1, saved as a named DataFrame so any step could be re-run by itself in opposition to the saved output of the earlier one. Within the AI-coding period, an assistant advised to “repair retrieval” can quietly modify the query parser when it ought to have stayed untouched. Impartial modules are how you’re employed confidently on one piece with out breaking the remainder.

The setup chunks under load them alongside the OpenAI shopper.

Each brick that talks to a mannequin wants a configured shopper. The sequence makes use of OpenAI’s Python SDK; any supplier that exposes an OpenAI-compatible endpoint (Azure OpenAI, vLLM, llama.cpp’s --api-server, …) drops in by altering base_url and the mannequin identify.

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

shopper = OpenAI(
    api_key=os.getenv("API_KEY"),
    base_url=os.getenv("BASE_URL"),
)
model_chat = os.getenv("MODEL_CHAT", "gpt-4.1")
model_embed = os.getenv("MODEL_EMBED", "text-embedding-3-small")

2.1 Doc parsing

We extract each textual content line of the PDF together with its place on the web page. The output is a DataFrame the place every row is one line, with page_num, line_num, the textual content itself, and the 4 bounding-box coordinates x0, y0, x1, y1.

In: a PDF path.

Out: line_df (one row per textual content line, with page_num, line_num, textual content, and the bounding field) plus a page_df we’ll construct in part 2.3.

The bounding bins matter: they’re what we use to attract highlights on the supply PDF on the finish.

def fitz_pdf_to_line_df(file_path):
    doc = fitz.open(file_path)
    knowledge = []
    for page_num in vary(len(doc)):
        web page = doc[page_num]
        blocks = web page.get_text("dict").get("blocks", [])
        line_num = 0
        for block in blocks:
            if block.get("sort") != 0:
                proceed
            for line in block.get("traces", []):
                spans = line.get("spans", [])
                if not spans: proceed
                textual content = "".be part of(s["text"] for s in spans)
                rect = fitz.Rect(spans[0]["bbox"])
                for span in spans[1:]:
                    rect |= fitz.Rect(span["bbox"])
                knowledge.append({
                    "page_num": page_num + 1,
                    "line_num": line_num + 1,
                    "textual content": textual content,
                    "x0": float(rect.x0), "y0": float(rect.y0),
                    "x1": float(rect.x1), "y1": float(rect.y1),
                })
                line_num += 1
    return pd.DataFrame(knowledge)

Working line_df = fitz_pdf_to_line_df(pdf_path) on the Consideration paper returns 1048 traces throughout 15 pages.

*First 5 rows of line_df with web page, line quantity, textual content, and bounding field – Picture by creator*

The paper, was rows. Every line is one row, with its textual content and the 4 numbers that find it on the web page. The x0, y0, x1, y1 columns don’t imply a lot but; in part 2.5 they’re what we use to attract rectangles on the supply PDF, precisely over the traces the mannequin cited.

This DataFrame, line_df, is the core knowledge construction of the remainder of the sequence. Article 5 introduces a richer relational mannequin round it (line_df, chunk_df, toc_df, page_df, image_df).

What this parser doesn’t do: detect tables (Desk 1 web page 4, Desk 3 web page 9 flatten into plain traces), reconstruct headings, footnotes, cross-references, or deal with multi-column layouts. None of this issues for the query we ask right here. For different questions on the identical paper, it can. Article 5 covers parsing in full.

2.2 Query parsing

Earlier than the query goes to retrieval, we run it by way of a tiny LLM name. The objective is to extract the key phrases most helpful for looking out the doc: brief phrases the doc is probably going to make use of, not essentially the literal phrases of the query.

In: a textual content query.

Out: a ParsedQuestion holding the normalized query and a brief checklist of checked key phrases.

This step doesn’t learn about retrieval. It doesn’t compute the query embedding both. That one is tied to the corpus index and lives in part 2.3. Hold that line clear and you may swap the embedding mannequin or add a hybrid retriever tomorrow with out touching query parsing.

Why hassle on a minimal pipeline? Two causes:

You may clarify why retrieval picked what it picked. When the system solutions flawed, we are able to see whether or not the key phrases have been off (question-parsing downside) or the correct key phrases landed on the flawed web page (retrieval downside). With out query parsing, retrieval is a black field.
The query is an actual enter, similar to the doc. Part 2.1 parsed the doc into line_df. This subsection parses the query into ParsedQuestionMinimal. Each inputs need to be parsed earlier than they hit the search step. Article 6 builds the richer brick (parse_question, with reply form, scope filters, decomposition, …).

On the query “What are the choices talked about for positional encoding?”, the decision parsed_question = get_keywords_from_question(query, shopper=shopper) returns parsed_question.key phrases = ['positional encoding', 'options', 'mentioned'].

query = "What are the choices talked about for positional encoding?"
parsed_question = get_keywords_from_question(query, shopper=shopper)
print(parsed_question.key phrases)

['positional encoding']

The LLM produces a single, literal phrase like ['positional encoding']. That’s deliberate. An earlier draft of this immediate requested for “3 to five brief key phrases helpful for looking out”, and the LLM fortunately crammed the quota with paraphrases (positional encoding choices, varieties of positional encoding, transformer positional encoding). None of these are written within the doc. Solely positional encoding is. Substring matching is strict: a single lacking phrase kills the match. The minimal model asks the LLM to do much less (extract the literal noun phrase, drop the query framing) and trusts the following block to do the remainder.

What this minimal model doesn’t do:

detect an answer_shape (Q&A vs summarization)
decompose compound questions
pull from a site glossary
connect retrieval hints

All lined in Article 6, below the richer parse_question brick. Right here we preserve two fields, corrected_question and key phrases, the smallest model that makes the brick seen.

Be aware: overriding the system immediate. get_keywords_from_question exposes the system immediate as a kwarg with KEYWORDS_PROMPT as default. To check a variant (completely different area, stricter guidelines, additional examples), cross system_prompt=... on the name website. No edit to the perform. Identical sample for each LLM helper in docintel (llm_answer_with_evidence exposes each system_prompt and user_template). Under: the identical name, run twice on a contract-style query. First with the research-paper default, which stays generic. Then with a contract-domain immediate, which picks up insurance coverage vocabulary like exclusions, deductible.


demo_question = "Are earthquakes excluded from protection?"

# Default: research-paper immediate.
parsed_question_default = get_keywords_from_question(demo_question, shopper=shopper)
print("Default (research-paper):", parsed_question_default.key phrases)

# Override: insurance coverage / authorized contract immediate.
contract_prompt = (
    "Extract 1 to three brief key phrases from the consumer query for looking out an "
    "insurance coverage contract or authorized coverage. Favor literal phrases the contract is "
    "possible to make use of: clauses, exclusions, named perils, deductibles, caps. Drop "
    "query framing phrases. Output 1 to three key phrases."
)
parsed_question_contract = get_keywords_from_question(
    demo_question, system_prompt=contract_prompt, shopper=shopper,
)
print("Contract immediate:        ", parsed_question_contract.key phrases)

Default (research-paper): ['earthquakes', 'coverage']
Contract immediate:         ['earthquakes', 'exclusions', 'coverage']

2.3 Retrieval

Sending all 1048 traces to the LLM works on a paper this dimension however doesn’t scale and dilutes the mannequin’s consideration. We lower the doc right down to the few pages most probably to include the reply.

In: the checked key phrases (and/or the normalized query, relying on the strategy) from part 2.2.

Out: the top-k web page numbers, plus optionally the matching line numbers inside these pages.

The query embedding is computed right here, not in part 2.2, as a result of an embedding solely is sensible relative to the index it was constructed on. Identical logic for any hybrid scoring or BM25 statistics.

The usual reply in 2024 RAG tutorials is embeddings: flip every web page right into a vector, rating by cosine similarity. Article 2 is devoted to them. For the minimal model, we intentionally don’t, for one cause.

Embeddings are opaque. Cosine similarity returns a quantity like 0.7798 and asks the consumer to belief that “web page 6 is related to the query”. Present that rating to a site professional, a product proprietor, or a supervisor: no person understands what 0.78 means, or why it’s larger than 0.65. Builders might argue they perceive it (“dot product of normalized vectors”). They perceive the maths, not the relevance. Requested why this particular web page scored 0.7798 in opposition to this particular query, they shrug and level on the mannequin.

In an enterprise context, retrieval is the step customers query essentially the most. Why did the system have a look at this web page and never that one? It’s a must to clarify it. So the minimal model makes use of one thing we are able to learn with our personal eyes: key phrase matching. Part 2.2 pulled the key phrases; we rating every web page by what number of of these key phrases seem in it, and preserve the highest three.

The place we search vs what we return: each pages right here. Actual retrieval has two ranges. The anchor is the place the key phrase or embedding truly hits (a line, a sentence). The context is what we hand to era (the traces round it, the web page). We search small, we return massive. Right here we use the web page for each. That works on an educational paper the place every web page is roughly one thought. Article 7 separates the 2 ranges for lengthy contracts, multi-column studies, table-heavy paperwork.

page_df = build_page_df(line_df) collapses the 1048 traces into 15 pages, one row per web page.

*First 5 rows of page_df, one row per web page with the total textual content concatenated – Picture by creator*

2.3.a Embeddings + cosine similarity

Embed each web page (one name per web page), embed the query, compute cosine similarity, preserve the top-k. The output: a quantity like 0.7798 per web page. Have a look at the scores under: are you able to inform why a web page made the highest three? May you clarify the rating to a site professional? That’s the opaque-score downside the article opens with.

*High three pages by cosine similarity. Exact scores, opaque rating – Picture by creator*

Three numbers, all very shut to one another (0.7843, 0.7798, 0.7728). Are you able to say why web page 9 beats web page 6? The textual content preview makes it apparent: web page 9 is the Variations on the Transformer structure desk, web page 5 is about output values and concatenation, web page 6 is the Most path lengths desk. The web page that really solutions the query, part 3.5 Positional Encoding, sits on web page 6 and ranks final within the high three. The unrelated web page 5 ranks second. The scores look exact, however the rating has no story behind it: there isn’t a token to level at, no phrase to defend, only a dot product on two black-box vectors. Embeddings work in lots of circumstances, and Article 2 unpacks the place this rating comes from. However the rating itself by no means turns into interpretable, and for the remainder of this text we use a retriever you possibly can learn with your personal eyes.

2.3.b Key phrase matching

For every web page, depend what number of of parsed_question.key phrases seem in it (case-insensitive substring match). Drop pages with zero matches; preserve the top-k by match depend. The output desk under carries the precise matched_keywords per web page, so anybody can learn it and see why a web page was picked.

retrieve_pages(page_df, line_df, parsed_question.key phrases, top_k=3) returns the highest three pages by key phrase depend plus the filtered traces: 314 traces stored from pages 6, 9, 7.

*High three keyword-matched pages, with the matched phrases proven per web page – Picture by creator*

Three pages, ranked by match depend, with the precise matches laid out. Pages 6, 8, and 9 every include the literal phrase positional encoding; web page 6 holds Part 3.5 Positional Encoding with the precise reply. Anybody studying the desk can confirm the consequence by hand: search the supply for positional encoding and also you’ll discover these three pages.

Two design selections:

Drop pages with zero matches. A retrieval that claims “nothing matches” is extra helpful than one which pads with three random pages. The schema’s null path (subsequent subsection) handles the empty case cleanly.
We don’t break ties. When pages tie on the similar match depend, the order is no matter pandas’ nlargest returns. The downstream LLM sees the traces from all tied pages in doc order and decides.

From 1048 traces to 300, and we all know the correct materials is in there.

def cosine_sim_matrix(query_vec, doc_matrix):
    q = query_vec / (np.linalg.norm(query_vec) + 1e-12)
    d = doc_matrix / np.linalg.norm(doc_matrix, axis=1, keepdims=True)
    return d @ q

def retrieve_pages(page_df, line_df, query, top_k=3):
    q_vec = np.asarray(get_embedding(query), dtype=np.float32)
    doc_matrix = np.vstack(page_df["embedding"].values)
    sims = cosine_sim_matrix(q_vec, doc_matrix)

    scored = page_df.copy()
    scored["similarity"] = sims
    retrieved_pages_df = scored.nlargest(top_k, "similarity")

    kept_pages = retrieved_pages_df["page_num"].tolist()
    filtered_line_df = line_df[line_df["page_num"].isin(kept_pages)]
    return retrieved_pages_df, filtered_line_df

Be aware: the “cut up into particular person phrases” lure. A pure reflex when the multi-word phrases don’t match: cut up them and seek for the person tokens. Under we develop each key phrase into its phrases, deduplicate, then re-run retrieval. We get matches, and we additionally get false positives, as a result of phrases like encoding, transformer, community seem all around the doc in unrelated contexts.

Now each web page within the high three matches a number of tokens, however have a look at which tokens. Phrases like encoding and transformer cowl a lot of the paper. Pages about layer encoding or encoder stacks look as related because the web page that really solutions the query. Splitting trades one failure (zero matches) for one more (false positives). Article 7 covers the actual fixes (synonym enlargement by way of a dictionary, hybrid scoring); for now, preserve the phrase entire.

2.3.c A more durable query: the place every retriever breaks

Identical pipeline, a special query. We ask concerning the worth of epsilon utilized in label smoothing. The reply is on web page 8 of the paper, written as ε_ls = 0.1 (Greek letter ε, by no means the English phrase epsilon). Watch what every retriever does.

question_2 = "What's the worth of epsilon utilized in label smoothing?"
parsed_question_2 = get_keywords_from_question(question_2, shopper=shopper)
print("Key phrases:", parsed_question_2.key phrases)

Key phrases: ['epsilon', 'label smoothing']

Two failures of various shapes:

Embeddings rank pages by topical proximity. The fitting web page (web page 8, the place ε_ls = 0.1 lives) might or is probably not within the high three. Pages dense in math notation come up even once they’re unrelated.
Key phrases are blind to symbols. The LLM emits epsilon, label smoothing, and so forth. The doc writes the Greek letter ε. Substring match returns zero on something that mentions epsilon by image solely. The web page that comprises the reply is invisible to the key phrase retriever.

Part 4.4 picks this up because the bridge to Article 2 (Embeddings deal with synonyms and floor variation) and Article 6 (richer Query Parsing pulls in alternate options just like the Greek letter).

2.4 Era

We ship the retrieved traces to the LLM with the query, formatted as a tab-separated block the place page_num and line_num sit subsequent to every line. That format offers the LLM the precise coordinates it must cite.

In: the unique query, line_df, and the retrieved web page numbers from part 2.3.

Out: an AnswerWithEvidence, a structured JSON with the reply, the proof span (start_page_num, start_line_num, end_page_num, end_line_num), a confidence, a justification, the precise quotes, and any caveats.

class AnswerWithEvidence(BaseModel):
    reply: str = Discipline(...)

    start_page_num: int | None
    start_line_num: int | None
    end_page_num: int | None
    end_line_num: int | None

    confidence: float = Discipline(..., ge=0.0, le=1.0)
    justification: str = Discipline(...)

    quotes: checklist[str] = Discipline(default_factory=checklist)
    caveats: checklist[str] = Discipline(default_factory=checklist)

The uncooked JSON is value saving in manufacturing: justification, quotes, caveats, and confidence all feed analysis, audit, and replay, properly past the reply discipline a chat UI exhibits.

We serialize the filtered traces right into a TSV with header page_numtline_numttext, one row per line. The LLM sees the precise coordinates subsequent to every textual content fragment so it will possibly cite by (page_num, line_num) in its reply.

That is what makes the reply grounded: the schema forces the mannequin to fill in (start_page, start_line, end_page, end_line), a verbatim quote, and caveats if something is unsure. No prose, solely a typed object with citations.

We name reply = llm_answer_with_evidence(query, filtered_line_df, shopper=shopper) and get again an AnswerWithEvidence occasion, rendered under as a styled JSON picture so the sector labels keep legible.

def llm_answer_with_evidence(query, filtered_text_prompt):
    resp = shopper.responses.parse(
        mannequin=model_chat,
        enter=[
            {
                "role": "system",
                "content": (
                    "Answer using ONLY the provided lines. "
                    "Return JSON only."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Lines:n{filtered_text_prompt}nn"
                    f"Question:n{question}nn"
                    "Pick a contiguous evidence span."
                ),
            },
        ],
        text_format=AnswerWithEvidence,
        retailer=False,
    )
    return resp.output_text

We name reply = llm_answer_with_evidence(query, filtered_line_df, shopper=shopper) and get again an AnswerWithEvidence occasion.

{
  "reply": "The choices for positional encoding talked about are realized positional embeddings and glued positional encodings (particularly, utilizing sine and cosine capabilities of various frequencies).",
  "start_page_num": 6,
  "start_line_num": 31,
  "end_page_num": 6,
  "end_line_num": 32,
  "confidence": 0.98,
  "justification": "Traces 31–32 explicitly state: 'There are numerous selections of positional encodings, realized and glued [9].' Moreover, additional traces element the sinusoidal encoding because the fastened alternative, and Desk 3 row (E) discusses utilizing realized embeddings as a substitute.",
  "quotes": [
    "There are many choices of positional encodings, learned and fixed [9]."
  ],
  "caveats": [
    "Further details about the specific implementation of learned embeddings are only touched on elsewhere, but both options are mentioned here."
  ],
  "complete_answer_found": true,
  "context_structured": true,
  "llm_discovered_keywords": [
    "learned positional embeddings",
    "fixed positional encodings",
    "sinusoidal positional encoding"
  ]
}

Three issues occurred that matter:

The reply is right. Each choices recognized, paraphrased appropriately.
The proof span (web page 6, traces 26-44) factors to a particular area. Not “someplace on web page 6”. Precise traces.
The mannequin couldn’t have hallucinated a quotation: it solely noticed traces from the retrieved pages, and the schema pressured an actual (web page, line) vary we are able to confirm.

If the mannequin can’t fill the schema, null fields are allowed and caveats information why. Article 8 develops the schema right into a a lot richer kind with per-brick suggestions fields; Article 23 builds the storage structure round it.

Sanity verify. On a paper this brief we are able to additionally ship your entire line_df to the LLM with no retrieval and verify the reply matches. Reassuring right here, gained’t scale to giant paperwork.

{
  "reply": "The choices talked about for positional encoding are sinusoidal positional encodings (utilizing sine and cosine capabilities of various frequencies) and realized positional embeddings.",
  "start_page_num": 6,
  "start_line_num": 27,
  "end_page_num": 6,
  "end_line_num": 41,
  "confidence": 0.99,
  "justification": "Traces 6:27-6:41 describe including 'positional encodings' to the enter embeddings, specify the sinusoidal methodology, and point out experimenting with realized positional embeddings, stating each choices have been tried and produced practically equivalent outcomes.",
  "quotes": [
    "Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add 'positional encodings' to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9]. On this work, we use sine and cosine capabilities of various frequencies: ... We additionally experimented with utilizing realized positional embeddings [9] as a substitute, and located that the 2 variations produced practically equivalent outcomes (see Desk 3 row (E)). We selected the sinusoidal model as a result of it could enable the mannequin to extrapolate to sequence lengths longer than those encountered throughout coaching."
  ],
  "caveats": [
    "Exact mathematical formulas for sinusoidal encoding are present here, but full details for learned embeddings are not. Table 3 row (E) and further details may expand on results but are not needed for the options question."
  ],
  "complete_answer_found": true,
  "context_structured": true,
  "llm_discovered_keywords": [
    "sinusoidal positional encoding",
    "learned positional embeddings",
    "sine and cosine functions",
    "relative or absolute position"
  ]
}

2.5 PDF annotation on the supply PDF

Now the satisfying half. We use the proof span to attract rectangles instantly on the supply PDF.

In: the supply PDF and the proof span from the AnswerWithEvidence.

Out: an annotated PDF with rectangles drawn across the cited traces.

Non-obligatory. A CLI software, a batch job, or an API might skip it; the reply with citations is already full after part 2.4.

Three calls do the work:

passage_lines_df_from_answer(line_df, reply) rebuilds the cited-line DataFrame from the proof span.
passage_bbox_by_page(passage_df) teams bounding bins per web page.
draw_passage_rectangles(pdf_path, bboxes_df, out_pdf_path) writes the annotated PDF.

*One bounding field per cited web page, wrapping each cited line on that web page – Picture by creator*

*PDF annotation in three steps: develop the span, union per web page, draw rectangles – Picture by creator*

def passage_lines_df_from_answer(line_df, answer_json):
    a = json.masses(answer_json)
    sp, sl = a["start_page_num"], a["start_line_num"]
    ep, el = a["end_page_num"], a["end_line_num"]
    if sp is None: return line_df.iloc[0:0]
    masks = (
        line_df["page_num"].between(sp, ep)
        & ((line_df["page_num"] != sp) | (line_df["line_num"] >= sl))
        & ((line_df["page_num"] != ep) | (line_df["line_num"] <= el))
    )
    return line_df.loc[mask].copy()

def passage_bbox_by_page(passage_df):
    return passage_df.groupby("page_num", as_index=False).agg(
        x0=("x0", "min"), y0=("y0", "min"),
        x1=("x1", "max"), y1=("y1", "max"))

def draw_passage_rectangles(pdf_path, bboxes_df, out_path):
    doc = fitz.open(pdf_path)
    for _, r in bboxes_df.iterrows():
        web page = doc[int(r["page_num"]) - 1]
        web page.add_rect_annot(fitz.Rect(r["x0"], r["y0"], r["x1"], r["y1"]))
    doc.save(out_path)

*Consideration paper web page 6 with cited paragraph highlighted, subsequent to query and reply – Picture by creator*

The passage actually is the place the reply comes from. The purple field wraps the Positional Encoding paragraph: the sentence that introduces the selection (“we use sine and cosine capabilities of various frequencies”) and the two-line system instantly under it. The reader can transfer from the chat reply to the quotation to the supply paragraph with out leaving the identical display. That’s the entire level.

Why a field round the entire paragraph and never the precise phrases? As a result of we labored on the line granularity: line_df carries one bounding field per textual content line, the LLM cites a (start_line, end_line) span, and passage_bbox_by_page collapses each line in that span into one wrapping rectangle. If you wish to draw the field across the actual phrases sin(pos / 10000^(2i/d_model)) as a substitute of the entire paragraph, the strategy is identical. Simply change the granularity. Substitute line_df with a word-level word_df (PyMuPDF’s web page.get_text("phrases") offers you a bounding field per phrase), make the schema cite (start_word, end_word), and passage_bbox_by_page already does the correct factor. Identical four-brick pipeline, finer scope.

3. Chaining the bricks, and testing the pipeline

3.1 The entire pipeline as one perform

The bricks chain right into a single name. Feed in a PDF and a query; get again a typed reply with line citations, and optionally an annotated PDF.

In: a PDF path and a textual content query (plus an non-obligatory top_k and an non-obligatory output PDF path).

Out: an AnswerWithEvidence, and (if annotate_pdf is given) an annotated PDF on disk.

Inside, pdf_qa_baseline chains doc parsing → query parsing → retrieval → era → PDF annotation. What crosses the retrieval → era boundary is simply the web page numbers; the filtered line_df is rebuilt inside era.

def pdf_qa_baseline(
    pdf_path: str,
    query: str,
    top_k: int = 3,
    annotate_pdf: str | None = None,
):
    # 1. Parsing
    line_df = fitz_pdf_to_line_df(pdf_path)

    # 2. Retrieval
    page_df = embed_page_df(build_page_df(line_df))
    _, filtered = retrieve_pages(page_df, line_df, query, top_k)

    # 3. Era
    reply = llm_answer_with_evidence(query, filtered)

    # 4. Non-obligatory highlighting on the supply PDF
    if annotate_pdf shouldn't be None:
        passage = passage_lines_df_from_answer(line_df, reply)
        bboxes = passage_bbox_by_page(passage)
        draw_passage_rectangles(pdf_path, bboxes, annotate_pdf)

    return reply

{
  "reply": "The choices talked about for positional encoding are realized and glued positional encodings, particularly sinusoidal positional encodings (utilizing sine and cosine capabilities of various frequencies) and realized positional embeddings.",
  "start_page_num": 6,
  "start_line_num": 31,
  "end_page_num": 6,
  "end_line_num": 41,
  "confidence": 0.99,
  "justification": "Traces 31-41 talk about the alternatives for positional encodings, stating that there are a lot of selections together with realized and glued encodings. It then explains the usage of sine and cosine capabilities (sinusoidal encoding) and notes that realized positional embeddings have been additionally experimented with.",
  "quotes": [
    "There are many choices of positional encodings, learned and fixed [9].",
    "On this work, we use sine and cosine capabilities of various frequencies: ...",
    "We additionally experimented with utilizing realized positional embeddings [9] as a substitute, and located that the 2 variations produced practically equivalent outcomes (see Desk 3 row (E))."
  ],
  "caveats": [],
  "complete_answer_found": true,
  "context_structured": true,
  "llm_discovered_keywords": [
    "positional encodings",
    "learned",
    "fixed",
    "sinusoidal",
    "sine and cosine functions",
    "learned positional embeddings"
  ]
}

That is the API of the article. Later articles construct a sister perform ask_corpus(query, corpus, ...) for archive-scale work: similar contract (typed reply with citations), completely different scope (filter the corpus first, then run document-level work on the matching paperwork).

3.2 Attempt it on a special doc

Drop in any PDF you’ve got round: a paper from your personal discipline, a contract, a report from work. Right here we choose the World Financial institution’s April 2026 Commodity Markets Outlook (World Financial institution publication, April 2026 concern; CC BY 3.0 IGO, as declared on the World Financial institution Open Information Repository publication web page for this concern): a 69-page report on vitality, agriculture, and fertilizer markets, removed from a analysis paper in tone and construction.

Identical 4 bricks, similar default prompts, similar retrieve_pages, similar schema. Nothing concerning the pipeline adjustments for a brand new doc.

We begin with a query whose reply lives deep within the report, within the metals chapter moderately than the Government Abstract: the outlook for aluminum costs in 2026.

We name pdf_qa_baseline end-to-end: cross the CMO PDF, the aluminum query, top_k=3, and an annotate_pdf path so the pipeline additionally writes the highlighted supply. The returned answer_cmo_al is identical AnswerWithEvidence form we noticed on the Consideration paper.

{
  "reply": "Aluminum costs are projected to rise by about 22 p.c in 2026 (y/y) to succeed in an all-time excessive—about 21 p.c larger than their January 2026 projections—supported by tight provide circumstances and stable demand development. Costs are anticipated to say no by about 6 p.c in 2027 as provide circumstances progressively ease.",
  "start_page_num": 45,
  "start_line_num": 32,
  "end_page_num": 45,
  "end_line_num": 43,
  "confidence": 0.98,
  "justification": "The chosen span explicitly gives the projected share enhance for aluminum costs in 2026, the context for these actions, and the outlook for 2027. It additionally mentions the record-high degree forecast and elements driving the worth.",
  "quotes": [
    "Aluminum prices are projected to rise by about 22 percent in 2026 (y/y) to reach an all-time high—about 21 percent higher than their January 2026 projections—supported by tight supply conditions and solid demand growth (table 1).",
    "Prices are expected to decline by about 6 percent in 2027 as supply conditions gradually ease."
  ],
  "caveats": [],
  "complete_answer_found": true,
  "context_structured": true,
  "llm_discovered_keywords": [
    "all-time high",
    "tight supply conditions",
    "solid demand growth"
  ]
}

The composite view locations the highlighted supply web page subsequent to the query and the reply, so the quotation could be checked at a look:

A more durable query on the identical report. What if we ask about one thing the report mentions solely in passing? We attempt the AI-related electrical energy demand query, whose reply the World Financial institution developed solely in an “Upside threat” sidebar on web page 31.

Identical name form, more durable query: pdf_qa_baseline(pdf_path=pdf_path_cmo, query=question_cmo_ai, top_k=3, ...). The pipeline should determine whether or not the retrieved pages truly carry the AI-electricity determine or whether or not to flag the reply as not discovered.

{
  "reply": "The offered traces point out that faster-than-anticipated enlargement of AI-related knowledge facilities may increase demand for sure metals like aluminum and copper, however don't quantify the contribution of AI-related knowledge facilities to world electrical energy demand development.",
  "start_page_num": 47,
  "start_line_num": 39,
  "end_page_num": 47,
  "end_line_num": 40,
  "confidence": 0.8,
  "justification": "The one point out of AI-related knowledge facilities is in relation to demand for metals, not electrical energy demand. There isn't a quantitative estimate or share given for his or her influence on world electrical energy demand development.",
  "quotes": [
    "Also, faster-than-antici-npated expansion of AI-related data centers could nboost demand for aluminum and copper, driving nprices higher."
  ],
  "caveats": [
    "No specific figures or direct statements about global electricity demand growth caused by AI-related data centers were found in the provided lines."
  ],
  "complete_answer_found": false,
  "context_structured": true,
  "llm_discovered_keywords": [
    "AI-related data centers",
    "electricity demand growth",
    "boost demand for aluminum and copper"
  ]
}

*CMO web page 47, null-path response: the schema refused to manufacture when the reply wasn’t there – Picture by creator*

However how can we be certain the reply actually doesn’t exist within the doc? Strictly, we are able to’t, at the least not from this null path alone. What the schema says is “the LLM didn’t discover the reply within the traces it was proven”, which is a special declare from “the reply shouldn’t be within the doc”. The Upside-risk sidebar on web page 31 of the identical CMO report does quantify the determine (the World Financial institution cites the IEA’s 8% projection of world electrical energy demand development from 2024 to 2030). The default key phrase pipeline pulled web page 47 and close by pages as a substitute, the place the report’s prose discusses AI’s impact on steel demand. Proving absence would require both operating the LLM on each web page, or a retrieval methodology that surfaces sidebar textual content and brief reference mentions. That’s precisely what Article 7 (Retrieval) develops; for the minimal model, “I didn’t discover it within the high three pages” is what we report.

3.3 Extra questions in a single desk

A small batch of 4 questions on the identical two paperwork, all ends in one desk. Learn the desk for patterns, not for each cell.

Numeric worth: studying charge of the bottom Transformer. Particular quantity, anticipated web page 7 (part 5.3 on Adam optimizer).
No reply in doc: chemical composition of seawater. The schema’s null path ought to fireplace; each retrievers will pull random-looking pages.
Totally different subject on CMO: outlook for urea costs. Identical pipeline on the fertilizer part of the World Financial institution report, removed from the AI sidebar.
Compound query: d_k and d_v within the Transformer. Two values requested without delay. Additionally assessments the table-parsing restrict (the values reside in Desk 1 web page 4, parsed as flat traces).

def run_pipeline_test(
    query: str,
    line_df_in: pd.DataFrame,
    page_df_in: pd.DataFrame,
    page_df_emb_in: pd.DataFrame,
    top_k: int = 3,
    shopper=shopper,
) -> dict:
    """Run each retrievers + era on one query; return a abstract dict."""
    parsed_q = get_keywords_from_question(query, shopper=shopper)
    retrieved_emb_df, _ = retrieve_pages_by_similarity(
        page_df_emb_in, line_df_in, query, top_k=top_k, shopper=shopper,
    )
    retrieved_kw_df, filtered_lines_kw = retrieve_pages(
        page_df_in, line_df_in, parsed_q.key phrases, top_k=top_k,
    )
    # If key phrase retrieval finds nothing, fall again to the entire doc so era
    # nonetheless runs (small PDFs solely: wouldn't scale to an actual corpus).
    lines_for_generation = (
        filtered_lines_kw if len(filtered_lines_kw) > 0 else line_df_in
    )
    reply = llm_answer_with_evidence(
        query, lines_for_generation, shopper=shopper,
    )
    return {
        "query": query,
        "key phrases": parsed_q.key phrases,
        "emb_top3": retrieved_emb_df["page_num"].tolist(),
        "kw_top3": (
            retrieved_kw_df["page_num"].tolist()
            if len(retrieved_kw_df) > 0 else "(no kw match)"
        ),
        "answer_excerpt": (reply.reply[:80] + ("..." if len(reply.reply) > 80 else "")),
        "cite_page": reply.start_page_num,
    }

*Identical pipeline on 4 questions: two succeed, one refuses cleanly, one journeys on desk parsing – Picture by creator*

Learn the desk left-to-right per row. 4 patterns to remove:

Key phrases beat embeddings on the studying charge row. The bottom Transformer’s coaching schedule is on web page 7 (part 5.3, Optimizer). Embeddings rank pages 8/9/10; web page 7 is not within the high three. The key phrase retriever finds web page 7 instantly through the literal phrase studying charge. Identical lesson because the epsilon row in part 2.3.c: when the query is determined by a exact time period the doc prints verbatim, key phrases are the higher software.
Each retrievers fail on the seawater row, and the failure is seen. The PDF has nothing to say about seawater. The key phrase column exhibits (no kw match) outright, with no false ‘top-3 pages’ that look believable. The schema then returns a null reply with a caveat. A clear ‘I don’t know’ is the system’s most beneficial conduct on out-of-scope questions.
Each retrievers work on the urea row. The CMO has a fertilizer part; embeddings and key phrases each carry again web page 42, era cites it appropriately. Cross-domain pipelines work so long as the query’s vocabulary lands on the doc.
The d_k and d_v compound row exposes the table-parsing restrict. The 2 values reside in Desk 1, web page 4 of the Transformer paper, the place every row lists d_model, h, d_k, d_v, and so forth. Our parser flattened the desk into plain traces, so a mannequin that asks for 2 cells aspect by aspect has to reassemble the row from textual content alone. Key phrases retrieve web page 4 (the literal phrase d_k seems there), however the quotation usually factors to 1 worth whereas the opposite is paraphrased. The repair is structural: parse tables as tables, not as traces. That’s Article 5 (parsing) and Article 6 (compound-question decomposition) doing their job.

4. The questions every block raises

What this minimal system does properly:

An actual, verifiable reply. A structured object with the reply, the web page, the traces, the quote. The consumer can verify the quotation in seconds.
“Not discovered” dealt with cleanly. When the reply isn’t within the retrieved traces, the schema permits null fields and the caveats discipline says why. No fabrication.
The reply linked to the supply. The highlighted PDF closes the loop between the LLM’s declare and the doc. That is what separates a helpful RAG system from a chatbot that occurs to learn paperwork.
Straightforward to comply with. Every perform does one factor. No hidden state, no framework magic. When one thing goes flawed, debugging is studying the code.

Now have a look at the identical system once more. Every block hides assumptions value questioning.

4.1 Doc parsing: we simply learn traces

We extracted textual content line by line. That’s affordable for an educational paper, however have a look at what we threw away: part construction, headings, desk layouts, figures, footnotes, cross-references. Web page 4 of this paper comprises Desk 1 with the per-layer complexities. We parsed every of its rows as plain traces, shedding the desk construction totally. Web page 9 comprises Desk 3, the ablation research. Identical downside.

For a query like “What are the choices for positional encoding?” this doesn’t matter. The reply is in steady prose. For a query like “What’s the per-layer complexity of self-attention?” it immediately does, as a result of the reply lives in a desk cell that our parser flattened into noise.

That’s the subject of Article 5: Parsing. Paperwork have construction. Ignoring it’s the single greatest supply of downstream failure.

4.2 Query parsing: we requested for key phrases, however solely key phrases

Our question-parsing step extracts a flat checklist of key phrases. That works on a clear query in opposition to an educational paper. It begins to interrupt down as quickly as questions get more durable.

Three issues this minimal model doesn’t do.

It doesn’t detect intent. “Summarize chapter 3”, “Translate this clause into French”, “Examine X and Y” every name for a special downstream pipeline. A single key phrases discipline can’t carry that sign.

It doesn’t decompose compound questions. “What are the exclusions and the deductible?” parsed as a flat key phrase checklist pollutes the retrieval (the key phrases for “exclusions” and “deductible” pull in two completely different scopes that intrude). Article 6 walks by way of easy methods to detect compound questions, determine whether or not to decompose, and route the sub-questions independently.

It doesn’t detect an anticipated reply form. “What’s the premium quantity?” needs a quantity with a forex. “What are the obligations?” needs an inventory. “Examine the 2 insurance policies” needs a desk. The minimal model treats each reply as free textual content. Article 6 introduces the expected_answer_shape discipline that drives the era template downstream.

That’s the subject of Article 6: Query Parsing. The identical brick, a lot richer JSON.

4.3 Chunking: we aggregated by web page

We selected pages because the unit of retrieval. Why pages? Why not paragraphs, or sections, or fixed-size chunks of 512 tokens like each normal RAG tutorial recommends?

The reply is that page-level aggregation occurs to work for this paper as a result of pages roughly align with semantic models. On a contract, on a authorized textual content, on a technical handbook with numbered clauses, pages are arbitrary cuts and also you’d need clause-level or section-level chunks as a substitute. The “proper” chunking is determined by the doc and the query, not on a default worth.

The temptation, when a fixed-size strategy begins failing, is to grid-search over chunk sizes and overlaps. That’s the machine studying reflex. It’s the flawed body for what’s truly a structural choice. Article 3: RAG Is Not Machine Studying, and the Six-Month Mistake of Treating It Like One makes that case in full.

4.4 Retrieval: key phrase matching is clear, however blind to vocabulary

Our retrieval simply labored. Web page 6 got here again with the matched key phrase, forward of the remainder, and the Positional Encoding part is on web page 6. Anybody can have a look at the match desk and see why. That’s the commerce we made: the best potential retrieval, utterly auditable.

The commerce has a value. Key phrase matching is blind each time the query’s vocabulary doesn’t match the doc’s. Three failure modes present up instantly on the identical paper.

Image vs phrase. Ask “What’s the worth of epsilon utilized in label smoothing?” The key phrases from query parsing are possible one thing like ["epsilon", "label smoothing"]. The precise reply (ε_ls = 0.1) sits on web page 8, however the doc writes it because the Greek letter ε, by no means the English phrase “epsilon”. The substring verify returns zero on the symbol-only web page; solely the literal phrase label smoothing lands on web page 8.

Synonym mismatch. Ask “How does the mannequin know the order of phrases in a sentence?” The key phrases could be ["word order", "sentence order"]. The doc calls this positional encoding. Not one of the query’s key phrases seem on web page 6. The retriever picks pages that occur to say “order” or “sentence” in passing, none of which include the reply.

Paraphrase. Ask “What consideration mechanism does the encoder use?” The doc says self-attention and Multi-Head Consideration, by no means the phrase “consideration mechanism the encoder makes use of”. The key phrases pulled from the query, even after enlargement, might or might not embrace the doc’s actual phrasing. Once they do, retrieval works. Once they don’t, it silently degrades.

The primary two failures are so widespread that the remainder of the sequence spends two articles on them.

Article 6: Query Parsing turns the key phrase extraction right into a a lot richer step that pulls from a site glossary, expands synonyms, and consists of possible doc phrasings moderately than the query’s literal phrases.
Article 2: Embeddings introduces vector representations that match throughout floor vocabulary: the place embeddings shine (synonyms, paraphrase, misspellings, cross-lingual matching), the place they quietly fail (negation, actual values, inner acronyms, polysemic phrases), and easy methods to mix them with key phrase matching for the very best of each worlds.
Articles 7 and 9 put the ensuing hybrid retrieval into an actual doc index.

The fitting reply is to mix, not choose a winner. The 2 strategies fail on nearly reverse circumstances: embeddings stumble when the query is determined by a exact image, named time period, or actual worth; key phrases stumble when the asker’s vocabulary doesn’t actually seem within the doc. Working each retrievers, taking the union of their candidates, and (optionally) re-ranking with a cross-encoder is the usual hybrid recipe. Article 2 develops it; Articles 7 and 9 wire it right into a corpus.

The minimal model stays single-retriever as a result of it teaches the correct reflex first: the retriever have to be auditable. Key phrase matching makes that reflex concrete (you possibly can see precisely which phrases landed on which web page). As soon as that reflex is in place, embeddings turn out to be a managed addition moderately than an opaque default, and mixing the 2 turns into a deliberate engineering alternative moderately than a development.

4.5 Era: we requested for sources, and we received them

That is the block that labored greatest, nearly too simply. We outlined a Pydantic schema with start_page_num, start_line_num, end_page_num, end_line_num, confidence, justification, quotes, and caveats, and the mannequin crammed it in appropriately.

How way more can we ask? A structured comparability for comparative questions, an inventory of conflicts if the doc contradicts itself, a number of citations from a number of elements of the doc, a confidence breakdown per declare. Sure to all the above. The era step is much extra controllable than most groups notice. Article 8: Era as Managed Execution explores this in depth.

5. The form of what comes subsequent

This minimal pipeline is the backbone of all the pieces that follows. Every a part of the sequence goes deep on one of many questions raised above.

The errors that kill most tasks come from getting the flawed image of one in all these blocks: RAG isn’t ML (Article 3), embeddings aren’t magic (Article 2), not all RAG issues look the identical (Article 4). That’s Half I.

Every brick then will get its personal deep dive: doc parsing, query parsing, retrieval, era. That’s Half II, the 4 bricks.

As soon as the blocks are stable, we recombine them for circumstances that appear to be manufacturing: lengthy paperwork, justification and absence dealing with, table-of-contents-driven retrieval, itemizing questions, structured extraction, the composite pipeline. That’s Half III.

Then we modify scale. From one doc to many. From a single paper to an archive of lots of or 1000’s of paperwork. The structure adjustments considerably. That’s Half IV.

Lastly, what it takes to function the system in manufacturing: analysis, price and monitoring, safety and compliance, the structure of the codebase itself. That’s Half V.

The blocks don’t change. Their internals do.

Just a few framing notes:

The 4 bricks (Half II) are the conceptual core. Many of the remainder of the sequence is about doing each higher. Half III and Half IV are recombinations: the identical 4 concepts at completely different scales and for various query sorts.
The sequence scope is enterprise paperwork. Contracts, technical specs, regulatory filings, inner procedures: all carry construction (TOC, sections, tables) and bounded vocabulary (trade jargon, professional phrases). RAG works on these corpora due to that construction, not heroic embedding tips. Paperwork with no construction (novels, lengthy unstructured transcripts) and questions that require intent moderately than finding a passage are out of scope; Article 4 returns to the place the road falls.
Code is illustrative, not production-ready. What you’ve learn works on an actual PDF, however lacks the error dealing with, validation, caching, price controls, monitoring, and safety a manufacturing system wants. Every will get its personal article.

Right here’s the express map from this minimal system to the remainder of the sequence:

PDF parsing throws away construction → Article 5, Article 10
Query parsing wants greater than key phrases (intent, decomposition, anticipated reply form) → Article 6
Chunking technique isn’t a hyperparameter → Article 3
Query vocabulary doesn’t match doc phrases → Article 2, Article 6
Retrieval picks the flawed web page → Article 7, Article 9
Mannequin paraphrases its quotation → Article 8, Article 21
“Not discovered” wants nuance → Article 4
Compound, itemizing, comparability, summarization questions → Article 6, Articles 11-13
Multi-document corpus → Half IV (Articles 15-20)
Manufacturing, analysis, safety, structure → Half V (Articles 21-25)

6. Conclusion

100 traces of Python and a Pydantic schema are sufficient to ship a working RAG system on an actual PDF. What makes the system reliable shouldn’t be the road depend : it’s the structured reply with line-level citations, the schema’s null path that refuses to manufacture, and the PDF spotlight that ties each declare again to its supply. The 4 bricks (parsing, query parsing, retrieval, era) are the conceptual core ; all the pieces that follows within the sequence is about doing each higher.

The minimal model is a baseline, not a vacation spot. The subsequent article tackles the misperception that wrecks essentially the most RAG tasks : that RAG is a machine studying downside. It’s not.

7. Sources and additional studying

The structured-output-with-citations framing this text makes use of for AnswerWithEvidence is identical path as Bohnet et al. (Attributed Query Answering, 2022). The complete production-grade equal of this type of pipeline exhibits up in Anthropic’s Contextual Retrieval (Sept 2024), which Article 9 will preview. The time period RAG itself comes from Lewis et al. (2020). Quantity 3 (Agentic Bricks) returns to the agentic improve path on high of the 4 bricks outlined right here.

Identical path because the article:

Bohnet et al., Attributed Query Answering, 2022 (arXiv:2212.08037). Structured output with citations because the belief mechanism; the closest printed thought behind the AnswerWithEvidence schema.
Anthropic, Contextual Retrieval (Sept 2024 engineering submit). Manufacturing-grade “minimal however ship-ready” pipeline; lands on hybrid retrieval + reranking. Article 9 picks up the place this one stops.
Asai et al., Self-RAG: Studying to Retrieve, Generate, and Critique by way of Self-Reflection, ICLR 2024 (arXiv:2310.11511). Identical trust-via-structure path. Self-reflection tokens flag when retrieval helped and when a declare is grounded.
Lewis et al., Retrieval-Augmented Era for Information-Intensive NLP Duties, NeurIPS 2020 (arXiv:2005.11401). The paper that named RAG.

Totally different angle, completely different context:

Karpukhin et al., Dense Passage Retrieval for Open-Area QA, EMNLP 2020 (arXiv:2004.04906). Dense retrieval because the manufacturing primitive; most “minimal RAG” tutorials descend from this. This text makes use of key phrase matching as a substitute (defended in Article 2).
Yao et al., ReAct: Synergizing Reasoning and Performing in Language Fashions, ICLR 2023 (arXiv:2210.03629). Founding paper of agentic RAG. The context is general-purpose tool-picking at runtime. Quantity 3 (Agentic Bricks) develops this line on high of the 4 bricks outlined right here.
Lee et al., Can Lengthy-Context Language Fashions Subsume Retrieval, RAG, SQL, and Extra?, 2024 (arXiv:2406.13121). The long-context-replaces-retrieval strategy. Empirical knowledge on the place this works and the place it breaks; this text assumes long-context doesn’t substitute structured retrieval on enterprise PDFs.

[ad_2]