scenario: You’re employed within the operations crew of a medium-sized firm. Day by day, your crew processes order types from completely different B2B prospects. All of them arrive as PDFs. And in principle, all of them comprise the identical data: buyer ID, buy order quantity, supply date, and the ordered objects.
In apply, nevertheless, each doc appears to be like barely completely different: One buyer locations the acquisition order quantity within the top-left nook, the following one within the bottom-right nook. Some write “PO Quantity”, others use “Order ID”, “Order Reference”, or one thing utterly completely different.
For us people, that is normally not an issue. We take a look at the doc, perceive the context, and instantly acknowledge which data is supposed.
For conventional automation programs, nevertheless, this turns into troublesome: A regex rule can particularly seek for “PO Quantity: “. However what occurs if the following buyer makes use of “Order Reference: “ as a substitute?
That’s precisely the issue I recreated for this text.
We evaluate two completely different approaches for extracting structured knowledge from B2B order types:
- A standard rule-based strategy utilizing pytesseract and regex guidelines
- An LLM-based strategy utilizing pytesseract, Ollama, and LLaMA 3
The objective of this text is to not present that LLMs are typically higher. They aren’t at all times.
A way more fascinating query is: At what level do conventional extraction pipelines begin to attain their limits as complexity and the variety of completely different layouts enhance? And when can an LLM really scale back upkeep effort?
Desk of Contents
1 – Step-by-Step Information
2 – Head-to-Head Comparability
3 – When ought to we NOT use an LLM?
4 – Last Ideas
The place to Proceed Studying?
1 – Step-by-Step Information
We rebuild each approaches step-by-step. First, we create two pattern PDFs containing the identical enterprise data however utilizing completely different layouts. Afterwards, we extract the information as soon as with a conventional OCR and regex pipeline and as soon as with an OCR and LLM pipeline. This permits us to check each approaches beneath equivalent situations.
- The normal strategy mainly asks:
“Can I discover the precise sample that I programmed?” - The LLM-based strategy as a substitute asks:
“Can I perceive the which means of this subject in context?”
→ 🤓 Discover the complete code within the GitHub Repo 🤓 ←
Earlier than We Begin — Mise en Place
pip vs. Anaconda
On this information, we use pip, Python’s normal bundle supervisor. This implies we set up all libraries instantly by the command line utilizing pip set up …. pip is already included mechanically if you set up Python. If you realize Python tutorials that work with Anaconda, that’s merely one other method to obtain the identical objective (utilizing conda set up …). Within the article “Python Knowledge Evaluation Ecosystem — A Newbie’s Roadmap”, you’ll find additional particulars about getting began with Python. Moreover, on a Microsoft system we use the CMD terminal (Home windows key + R > click on on cmd).
Create and activate a brand new digital setting
Create a brand new python setting with python –m venv b2bdocumentextractor (you’ll be able to change the title) in a terminal and activate it withb2bdocumentextractorScriptsactivate.
Non-obligatory: Test Python and pip
python --version
pip --version
You need to see a Python and a pip model.
Step 1 – Set up Tesseract
Tesseract is the OCR engine. It’s the software that truly reads textual content from photographs or scanned PDFs utilizing OCR (Optical Character Recognition). pytesseract is barely the Python bridge to Tesseract. This implies: Our Python code can talk with Tesseract by pytesseract, however the actual textual content recognition is finished by Tesseract itself. With out putting in Tesseract first, pytesseract can not work.
First, we obtain the newest .exe-file for w64 and run the installer:
GitHub – Tesseract at UB Mannheim
Vital: Bear in mind the set up path:
C:Program FilesTesseract-OCR
Contained in the CMD terminal, we confirm the set up utilizing the next command:
"C:Program FilesTesseract-OCRtesseract.exe" --version
If all the pieces labored appropriately, we should always see the corresponding Tesseract model.
Step 2 – Set up Poppler
Subsequent, we set up pdf2image. That is our library for changing PDFs into photographs and it requires Poppler within the background. Poppler is an open-source PDF rendering library used to show PDF recordsdata.
For this, we obtain the newest model of Poppler, extract the ZIP file, and transfer the extracted folder to the C: drive.
GitHub-Poppler Home windows Releases
Contained in the folder, click on on Library > bin and save the trail the place you saved the folder in your C: drive. On my machine, it appears to be like like this:
C:Usersschuepoppler-26.02.0Librarybin
Moreover, we add the trail to the PATH variable so Home windows is aware of the place Poppler is positioned.
Trace for Newbies:
Press the Home windows key and seek for Edit setting variables. Afterwards click on on Edit the system setting variables. Then click on on Atmosphere Variables. Underneath Person variables, choose the variable PATH, click on on Edit, then New, and paste the trail.
Now restart CMD so the adjustments are utilized.

Step 3 – Set up Python Libraries
Now we set up all Python libraries we want. Be sure you reactivate the Python setting beforehand:
- pytesseract: We set up this library because the bridge between Python and Tesseract. We already put in Tesseract because the OCR engine, however solely with pytesseract can Python talk with it instantly.
- pdf2image: pytesseract is an OCR engine, which implies it acknowledges textual content from pixels in a picture. It can not learn PDF constructions instantly. pdf2image due to this fact performs an intermediate step: It renders every PDF web page as a picture, much like a screenshot, in order that pytesseract can analyze it afterwards. Observe: If we had digital PDFs (which means PDFs the place you’ll be able to choose and duplicate textual content), we might instantly extract the textual content utilizing libraries resembling pdfplumber or PyMuPDF. Nevertheless, since we assume that B2B order types are sometimes scans in apply, we take the detour by pdf2image.
- pillow: pdf2image and pytesseract use this image-processing library within the background (we don’t instantly see the utilization within the code) to appropriately course of photographs.
fpdf2: We use this library to mechanically generate two check PDFs (Structure A and Structure B) by way of script for the article instance.
ollama: This library permits our Python script to ship messages to the LLM and obtain responses.

Step 4 – Set up Ollama and Obtain LLaMA 3
As soon as the set up of the libraries labored efficiently, we set up Ollama and LLaMA 3 because the LLM. Ollama is the software that enables us to run LLMs utterly free, domestically on our laptop computer, and with out API keys.
First, we set up Ollama. When you have not already accomplished this, you’ll be able to obtain the Home windows installer from Ollama and execute it.
Afterwards, we obtain LLaMA 3 utilizing the next command:
ollama pull llama3
Relying in your web connection, this step could take a while since roughly 4.7 GB are downloaded. Nevertheless, we are able to see a progress bar within the terminal.

Afterwards, we confirm whether or not all the pieces labored:
ollama record
In case you see one thing much like the screenshot, it labored efficiently.

Step 5 – Create the Mission Folder and Generate Take a look at PDFs
For this comparability, we create two B2B order types for Alpha GmbH and Beta AG that comprise the identical data however use completely different layouts. On this instance, we assume that the order types are scans, which is why we beforehand put in pdf2image (for digital PDFs, this could even be potential with libraries resembling pdfplumber or PyMuPDF).
First, we create a undertaking folder to retailer all recordsdata there:
mkdir document_extractor
cd document_extractor
Subsequent, we create a brand new file referred to as create_test_pdfs.py and insert the next code that you’ll find on this GitHub-Gist. We save this file contained in the beforehand created folder document_extractor:
https://gist.github.com/Sari95/a52a62eb78e0604c4d8c64f5cdd1160a
Now we return to the terminal and execute the file:
python create_test_pdfs.py
Contained in the folder, we are able to now see the 2 newly created PDFs:

Within the two PDFs, we are able to already see the issue:
- They comprise the identical data.
- However the PDFs use utterly completely different subject names and a distinct date format.
Method 1: The Conventional Manner (pytesseract + Regex Guidelines)
The normal strategy works in two steps:
- First, we convert the PDF into a picture. Afterwards, we use pytesseract to learn the picture and extract the uncooked textual content by way of OCR (Optical Character Recognition). Put merely, OCR signifies that the software “appears to be like” on the picture and tries to acknowledge letters from pixels. Fairly much like how people decipher handwritten notes.
- Within the second step, we use regex. These are common expressions that seek for particular patterns contained in the textual content. For instance, we are able to outline: “Seek for all the pieces that comes after
PO Quantity:.”
Already on this second step, we are able to determine the primary downside: What occurs if the shopper merely writes “Order Reference” as a substitute of “PO Quantity: “?
In that case, the regex sample finds nothing. What we are able to then do (or should do) is add a brand new rule.
Execute Script 1 for Method 1
Subsequent, we create a brand new file referred to as approach1_traditional.py with the next code that you’ll find within the GitHub-Gist inside the identical folder:
https://gist.github.com/Sari95/aa2be6938fbcb1c7f94b053d9046f55d
Now we execute the file once more contained in the terminal:
python approach1_traditional.py
The Results of Method 1
For Structure A, all the pieces works completely:
For Structure B? Not a single subject is acknowledged and all values return “None”:

And that is precisely the place the issue lies. For each new buyer, new regex guidelines must be written, examined, and deployed. With 200 prospects, meaning 200 completely different patterns. And each time a buyer barely adjustments their type, the system breaks once more.
Method 2: A New Manner (pytesseract + Ollama + LLaMA 3)
On this second strategy, we maintain the OCR step, however substitute the inflexible regex guidelines with an LLM:
- pytesseract nonetheless reads the textual content from the PDF.
- As a substitute of telling the code “Seek for PO Quantity: ”, we inform the LLM: “Right here is an order doc. Extract these fields for me, no matter how they’re named.”
The LLM understands the semantic context. It acknowledges that “Order Reference” and “PO Quantity” imply the identical factor, even with out an express rule.
Execute Script 2 for Method 2
Now, we create a brand new file referred to as approach2_llm.py with the next code that you’ll find within the GitHub-Gist inside the identical folder:
https://gist.github.com/Sari95/d4e9e83490a9fbf34a3776d1604f8742
Now we execute the file once more contained in the terminal. Guarantee that Ollama continues to be operating within the background:
python approach2_llm.py
The Results of Method 2
What we are able to now see is that each layouts are appropriately acknowledged:

For each layouts, the data from the otherwise named fields is appropriately extracted and assigned, despite the fact that not a single regex expression was adjusted and no new template was created. The LLM understands each layouts as a result of it reads the context. Moreover, the date format from Structure B is instantly normalized to match the format from Structure A.
2 – Head-to-Head Comparability
After each assessments, one factor shortly turns into clear: Technically, each approaches clear up the identical downside.
Each approaches have their very own benefits and downsides:

With regex-based pipelines, the complexity lives within the guidelines and upkeep effort. With LLM-based pipelines, the complexity shifts towards infrastructure, inference time, and mannequin habits. For medium-sized corporations processing many customer-specific layouts, that trade-off can turn out to be strategically extra necessary than pure extraction accuracy.
3 – When ought to we NOT use an LLM?
In the meanwhile, it usually feels as if each current automation course of all of a sudden must be changed with AI or LLMs.
In apply, nevertheless, this isn’t at all times the higher resolution. Particularly medium-sized corporations normally don’t have to construct the “most fashionable” resolution, however somewhat the one that is still steady, maintainable, and economically affordable in the long run. Relying on the scenario, that may be the normal regex-based strategy, whereas in different instances switching to an LLM could make extra sense.
Some conditions the place the normal strategy should still be the extra appropriate choice:
- The paperwork are steady and standardized:
If an organization solely processes a couple of identified layouts and these hardly ever change, regex is usually the higher resolution.Why?
As a result of the extra advantage of an LLM turns into small, whereas the general system complexity will increase.
A steady rule-based course of, however, is quicker, cheaper, simpler to debug, and simpler at hand over to new individuals.
- Velocity and throughput are crucial:
In our instance, the LLM processes one doc inside 20–40 seconds.At first, that sounds acceptable. However as soon as we think about ourselves inside an actual manufacturing setting, the attitude adjustments shortly.
A medium-sized firm most likely processes orders, supply notes, invoices, customs paperwork, assist paperwork, and many others. And never 10 instances per day, however 10,000 instances per day.
On this scenario, inference time all of a sudden turns into an actual infrastructure challenge. Regex-based programs run considerably sooner, whereas LLMs require extra RAM, extra CPU/GPU energy, and sometimes further queueing or batch-processing mechanisms.
- Explainability is extra necessary than flexibility:
Particularly in regulated industries resembling pharma, insurance coverage, banking, or healthcare, it’s usually crucial to totally perceive why a particular worth was extracted.Regex guidelines are clearly deterministic: One line of code produces one clearly explainable outcome. LLMs, however, work probabilistically: The mannequin interprets the context and returns the probably outcome. That is precisely what makes LLMs versatile, however on the identical time additionally harder to audit.
- The corporate doesn’t have the proper infrastructure:
In our instance, we used Ollama. Getting began was typically easy. However, it shouldn’t be underestimated that reminiscence consumption, GPU sources, monitoring, or response instances beneath load can look very completely different when working with LLMs.
On my Substack Knowledge Science Espresso, I share sensible guides and bite-sized updates from the world of Knowledge Science, Python, AI, Machine Studying, and Tech — made for curious minds like yours.
Take a look and subscribe on Medium or on Substack if you wish to keep within the loop.
4 – Last Ideas
Choosing the proper strategy will not be essentially a technical query, however somewhat a strategic one.
The normal strategy tries to explicitly describe each potential doc. The LLM-based strategy as a substitute tries to know which means and context. For small and steady environments, the normal strategy is usually utterly ample. The extra layouts and edge instances seem, the harder it turns into to maintain the principles maintainable in the long run. That’s precisely the place LLMs begin to turn out to be fascinating.
It can be an thrilling entry-level use case for a corporation to begin working with an LLM right here and, in doing so, make the corporate prepared for AI and achieve preliminary sensible expertise.
