7.3 C
New York
Sunday, March 22, 2026

4 Strategies to Optimize Your LLM Prompts for Price, Latency and Efficiency


of automating a major variety of duties. For the reason that launch of ChatGPT in 2022, we now have seen an increasing number of AI merchandise available on the market using LLMs. Nevertheless, there are nonetheless lots of enhancements that needs to be made in the way in which we make the most of LLMs. Bettering your immediate with an LLM immediate improver and using cached tokens are, for instance, two easy strategies you possibly can make the most of to vastly enhance the efficiency of your LLM software.

On this article, I’ll talk about a number of particular strategies you possibly can apply to the way in which you create and construction your prompts, which is able to cut back latency and price, and likewise enhance the standard of your responses. The aim is to current you with these particular strategies, so you possibly can instantly implement them into your individual LLM software.

This infographic highlights the primary contents of this text. I’ll talk about 4 totally different strategies to significantly enhance the efficiency of your LLM software, with regard to price, latency, and output high quality. I’ll cowl using cached tokens, having the consumer query on the finish, utilizing immediate optimizers, and having your individual custom-made LLM benchmarks. Picture by Gemini.

Why you must optimize your immediate

In lots of instances, you might need a immediate that works with a given LLM and yields ample outcomes. Nevertheless, in lots of instances, you haven’t spent a lot time optimizing the immediate, which leaves lots of potential on the desk.

I argue that utilizing the particular strategies I’ll current on this article, you possibly can simply each enhance the standard of your responses and cut back prices with out a lot effort. Simply because a immediate and LLM work doesn’t imply it’s performing optimally, and in lots of instances, you possibly can see nice enhancements with little or no effort.

Particular strategies to optimize

On this part, I’ll cowl the particular strategies you possibly can make the most of to optimize your prompts.

At all times maintain static content material early

The primary method I’ll cowl is to all the time maintain static content material early in your immediate. With static content material, I consult with content material that is still the identical if you make a number of API calls.

The explanation you must maintain the static content material early is that every one the large LLM suppliers, corresponding to Anthropic, Google, and OpenAI, make the most of cached tokens. Cached tokens are tokens which have already been processed in a earlier API request, and that may be processed cheaply and shortly. It varies from supplier to supplier, however cached enter tokens are often priced round 10% of regular enter tokens.

Cached tokens are tokens which have already been processed in a earlier API request, and that may be processed cheaper and sooner than regular tokens

Meaning, when you ship in the identical immediate two instances in a row, the enter tokens of the second immediate will solely price 1/tenth the enter tokens of the primary immediate. This works as a result of the LLM suppliers cache the processing of those enter tokens, which makes processing your new request cheaper and sooner.


In observe, caching enter tokens is finished by maintaining variables on the finish of the immediate.

For instance, if in case you have an extended system immediate with a query that varies from request to request, you must do one thing like this:

immediate = f"""
{lengthy static system immediate}

{consumer immediate}
"""

For instance:

immediate = f"""
You're a doc professional ...
It's best to all the time reply on this format ...
If a consumer asks about ... you must reply ...

{consumer query}
"""

Right here we now have the static content material of the immediate first, earlier than we put the variable contents (the consumer query) final.


In some eventualities, you wish to feed in doc contents. If you happen to’re processing lots of totally different paperwork, you must maintain the doc content material on the finish of the immediate:

# if processing totally different paperwork
immediate = f"""
{static system immediate}
{variable immediate instruction 1}
{doc content material}
{variable immediate instruction 2}
{consumer query}
"""

Nevertheless, suppose you’re processing the identical paperwork a number of instances. In that case, you can also make positive the tokens of the doc are additionally cached by making certain no variables are put into the immediate beforehand:

# if processing the identical paperwork a number of instances
immediate = f"""
{static system immediate}
{doc content material} # maintain this earlier than any variable directions
{variable immediate instruction 1}
{variable immediate instruction 2}
{consumer query}
"""

Word that cached tokens are often solely activated if the primary 1024 tokens are the identical in two requests. For instance, in case your static system immediate within the above instance is shorter than 1024 tokens, you’ll not make the most of any cached tokens.

# do NOT do that
immediate = f"""
{variable content material} < --- this removes all utilization of cached tokens
{static system immediate}
{doc content material}
{variable immediate instruction 1}
{variable immediate instruction 2}
{consumer query}
"""

Your prompts ought to all the time be constructed up with essentially the most static contents first (the content material various the least from request to request), the essentially the most dynamic content material (the content material various essentially the most from request to request)

  1. You probably have an extended system and consumer immediate with none variables, you must maintain that first, and add the variables on the finish of the immediate
  2. If you’re fetching textual content from paperwork, for instance, and processing the identical doc twice, you must

Could possibly be doc contents, or if in case you have an extended immediate -> make use of caching

Query on the finish

One other method you must make the most of to enhance LLM efficiency is to all the time put the consumer query on the finish of your immediate. Ideally, you manage it so you will have your system immediate containing all the overall directions, and the consumer immediate merely consists of solely the consumer query, corresponding to beneath:

system_prompt = ""

user_prompt = f"{user_question}"

In Anthropic’s immediate engineering docs, the state that features the consumer immediate on the finish can enhance efficiency by as much as 30%, particularly if you’re utilizing lengthy contexts. Together with the query in the long run makes it clearer to the mannequin which job it’s making an attempt to realize, and can, in lots of instances, result in higher outcomes.

Utilizing a immediate optimizer

Lots of instances, when people write prompts, they develop into messy, inconsistent, embrace redundant content material, and lack construction. Thus, you must all the time feed your immediate by means of a immediate optimizer.

The only immediate optimizer you need to use is to immediate an LLM to enhance this immediate {immediate}, and it’ll offer you a extra structured immediate, with much less redundant content material, and so forth.

A good higher strategy, nevertheless, is to make use of a particular immediate optimizer, corresponding to one you could find in OpenAI’s or Anthropic’s consoles. These optimizers are LLMs particularly prompted and created to optimize your prompts, and can often yield higher outcomes. Moreover, you must make sure that to incorporate:

  • Particulars in regards to the job you’re making an attempt to realize
  • Examples of duties the immediate succeeded at, and the enter and output
  • Instance of duties the immediate failed at, with the enter and output

Offering this extra data will often yield means higher outcomes, and also you’ll find yourself with a a lot better immediate. In lots of instances, you’ll solely spend round 10-Quarter-hour and find yourself with a far more performant immediate. This makes utilizing a immediate optimizer one of many lowest effort approaches to enhancing LLM efficiency.

Benchmark LLMs

The LLM you employ may also considerably impression the efficiency of your LLM software. Completely different LLMs are good at totally different duties, so it’s essential to check out the totally different LLMs in your particular software space. I like to recommend at the very least organising entry to the most important LLM suppliers like Google Gemini, OpenAI, and Anthropic. Setting this up is sort of easy, and switching your LLM supplier takes a matter of minutes if you have already got credentials arrange. Moreover, you possibly can think about testing open-source LLMs as properly, although they often require extra effort.

You now must arrange a particular benchmark for the duty you’re making an attempt to realize, and see which LLM works greatest. Moreover, you must commonly test mannequin efficiency, for the reason that large LLM suppliers sometimes improve their fashions, with out essentially popping out with a brand new model. It’s best to, after all, even be able to check out any new fashions popping out from the massive LLM suppliers.

Conclusion

On this article, I’ve lined 4 totally different strategies you possibly can make the most of to enhance the efficiency of your LLM software. I mentioned using cached tokens, having the query on the finish of the immediate, utilizing immediate optimizers, and creating particular LLM benchmarks. These are all comparatively easy to arrange and do, and may result in a major efficiency enhance. I consider many comparable and easy strategies exist, and you must all the time attempt to be looking out for them. These matters are often described in several weblog posts, the place Anthropic is among the blogs that has helped me enhance LLM efficiency essentially the most.

👉 Discover me on socials:

📩 Subscribe to my publication

🧑‍💻 Get in contact

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

You can even learn a few of my different articles:

Related Articles

Latest Articles