-14 C
New York
Sunday, February 8, 2026

Methods to Context Engineer to Optimize Query Answering Pipelines


engineering is likely one of the most related subjects in machine studying right this moment, which is why I’m writing my third article on the subject. My objective is to each broaden my understanding of engineering contexts for LLMs and share that data via my articles.

In right this moment’s article, I’ll talk about enhancing the context you feed into your LLMs for query answering. Normally, this context is predicated on retrieval augmented era (RAG), nevertheless, in right this moment’s ever-shifting surroundings, this strategy must be up to date.

The co-founder of Chroma (a vector database supplier) tweeted that RAG is useless. I don’t totally agree that we received’t use RAG anymore, however his tweet highlights how there are completely different choices for filling the context of your LLM.

It’s also possible to learn my earlier context engineering articles:

  1. Fundamental Context engineering approachs
  2. Superior context engineering strategies

Desk of Contents

Why you must care about context engineering

First, let me spotlight three key factors for why you must care about context engineering:

  • Higher output high quality by avoiding context rot. Fewer pointless tokens improve output high quality. You may learn extra particulars about it on this article
  • Cheaper (don’t ship pointless tokens, they value cash)
  • Velocity (much less tokens = sooner response instances)

These are three core metrics for many query answering techniques. The output high quality is of course of utmost precedence, contemplating customers is not going to wish to use a low-performing system.

Moreover, value ought to all the time be a consideration, and should you can decrease it (with out an excessive amount of engineering value), it’s a easy determination to take action. Lastly, a sooner query answering system offers a greater person expertise. You don’t need customers ready quite a few seconds to get a response when ChatGPT will reply a lot sooner.

The standard question-answering strategy

Conventional, on this sense, means the most typical query answering strategy in techniques constructed after the discharge of ChatGPT. This method is conventional RAG, which works as follows:

  1. Fetch probably the most related paperwork to the person’s query, utilizing vector similarity retrieval
  2. Feed related paperwork together with a query into an LLM, and obtain a response

Contemplating its simplicity, this strategy works extremely nicely. Apparently sufficient, we additionally see this occurring with one other conventional strategy. BM25 has been round since 1994 and was, for instance, not too long ago utilized by Anthropic once they launched Contextual Retrieval, proving how efficient even easy data retrieval strategies are.

Nevertheless, you possibly can nonetheless vastly enhance your query answering system by updating your RAG utilizing some strategies I’ll describe within the subsequent part.

Bettering RAG context fetching

Although RAG works comparatively nicely, you possibly can probably obtain higher efficiency by introducing the strategies I’ll talk about on this part. The strategies I describe right here all concentrate on enhancing the context you feed to the LLM. You may enhance this context with two principal approaches:

  1. Use fewer tokens on irrelevant context (for instance, eradicating or utilizing much less materials from related paperwork)
  2. Add paperwork which might be related

Thus, you must concentrate on attaining one of many factors above. If you happen to suppose by way of precision and recall:

  1. Will increase precision (at the price of recall)
  2. Enhance recall (at the price of precision)

This can be a tradeoff you have to make whereas engaged on context engineering your query answering system.

Lowering the variety of irrelevant tokens

On this part, I spotlight three principal approaches to cut back the variety of irrelevant tokens you feed into the LLMs context:

  • Reranking
  • Summarization
  • Prompting GPT

When fetching paperwork from vector similarity search, they’re returned so as of most related to least related, given the vector similarity rating. Nevertheless, this similarity rating may not precisely symbolize which paperwork are most related.

Reranking

You may thus use a reranking mannequin, for instance, Qwen reranker, to reorder the doc chunks. You may then select to solely maintain the highest X most related chunks (based on the reranker), which ought to take away some irrelevant paperwork out of your context.

Summarization

It’s also possible to select to summarize paperwork, lowering the variety of tokens used per doc. You may, for instance, maintain the total doc from the highest 10 most comparable paperwork fetched, summarize paperwork ranked from 11-20, and discard the remainder.

This strategy will improve the probability that you just maintain the total context from related paperwork, whereas at the least sustaining some context (the abstract) from paperwork which might be much less prone to be related.

Prompting GPT

Lastly, you too can immediate GPT whether or not the fetched paperwork are related to the person question. For instance, should you fetch 15 paperwork, you may make 15 particular person LLM calls to evaluate if every doc is related. You then discard paperwork which might be deemed irrelevant. Needless to say these LLM calls have to be parallelized to maintain response time inside an appropriate restrict.

Including related paperwork

Earlier than or after eradicating irrelevant paperwork, you additionally make sure you embrace related paperwork. I embrace two principal approaches on this subsection:

  • Higher embedding fashions
  • Looking out via extra paperwork (at the price of decrease precision)

Higher embedding fashions

To search out one of the best embedding fashions, you possibly can go to the HuggingFace embedding mannequin leaderboard, the place Gemini and Qwen are within the prime 3 as of the writing of this text. Updating your embedding mannequin is normally an affordable strategy to fetch extra related paperwork. It’s because working and storing embeddings is normally low-cost, for instance, embedding via the Gemini API, and storing vectors in Pinecone.

Search extra paperwork

One other (comparatively easy) strategy to fetch extra related paperwork is to fetch extra paperwork normally. Fetching extra paperwork naturally will increase the chance that you just add related ones. Nevertheless, it’s a must to stability this with avoiding context rot and lowering the variety of irrelevant paperwork to a minimal. Each pointless token in an LLM name is, as earlier, prone to:

  • Scale back output high quality
  • Enhance value
  • Decrease velocity

These are all essential points of a question-answering system.

Agentic search strategy

I’ve mentioned agentic search approaches in earlier articles, for instance, after I mentioned Scaling your AI Search. Nevertheless, on this part, I’ll dive deeper into organising an agentic search, which replaces some or the entire vector retrieval step in your RAG.

Step one is that the person offers their query to a given set of information factors, for instance, a set of paperwork. You then arrange an agentic system consisting of an orchestra agent and a listing of sub-agents.

This determine highlights an orchestra system of LLM brokers. The principle agent receives the person question and assigns duties to subagents. Picture by ChatGPT.

That is an instance of the pipeline the brokers would comply with (although there are lots of methods to set it up).

  1. Orchestra agent tells two subagents to iterate over all doc filenames and return related paperwork
  2. Related paperwork are fed again to the orchestra agent, which once more releases a subagent to every of the related paperwork, to fetch subparts (chunks) of the doc which might be related to the person’s query. These chunks are then fed again to the orchestra agent
  3. The orchestra agent solutions the person’s query, given the offered chunks

One other circulate you could possibly implement may very well be to retailer doc embeddings, and substitute the first step with vector similarity between the person query and every doc.

This agentic strategy has upsides and disadvantages.

Upsides:

  • Higher probability of fetching related chunks than with conventional RAG
  • Extra management over the RAG system. You may replace system prompts, and so on, whereas RAG is comparatively static with its embedding similarities

Draw back:

For my part, constructing such an agent-based retrieval system is a brilliant highly effective strategy that may result in wonderful outcomes. The consideration it’s a must to make when constructing such a system is whether or not the elevated high quality you’ll (probably) see is well worth the improve in value.

Different context engineering points

On this article, I’ve primarily coated context engineering for the paperwork we fetch in a query answering system. Nevertheless, there are additionally different points you have to be conscious of, primarily:

  • The system/person immediate you’re utilizing
  • Different data fed into the immediate

The immediate you write in your query answering system must be exact, structured, and keep away from irrelevant data. You may learn many different articles on the subject of structuring prompts, and you’ll sometimes ask an LLM to enhance these points of your immediate.

Generally, you additionally feed different data into your immediate. A standard instance is feeding in metadata, for instance, knowledge protecting details about the person, resembling:

  • Title
  • Job position
  • What they normally seek for
  • and so on

Everytime you add such data, you must all the time ask your self:

Does amending this data assist my query answering system reply the query?

Generally the reply is sure, different instances it’s no. A very powerful half is that you just made a rational determination on whether or not the knowledge is required within the immediate. If you happen to can’t justify having this data within the immediate, it ought to normally be eliminated.

Conclusion

On this article, I’ve mentioned context engineering in your query answering system, and why it’s essential. Query answering techniques normally include an preliminary step to fetch related data. The concentrate on this data must be to cut back the variety of irrelevant tokens to a minimal, whereas additionally together with as many related items of knowledge as doable.

👉 Discover me on socials:

🧑‍💻 Get in contact

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

It’s also possible to learn my in-depth article on Anthropic’s contextual retrieval under:

Related Articles

Latest Articles