LangChain for EDA: Construct a CSV Sanity-Test Agent in Python

September 10, 2025

56

, brokers carry out actions.

That’s precisely what we’re going to check out in at present’s article.

On this article, we’ll use LangChain and Python to construct our personal CSV sanity verify agent. With this agent, we’ll automate typical exploratory knowledge evaluation (EDA) duties as displaying columns, detecting lacking values (NaNs) and retrieving descriptive statistics.

Brokers resolve step-by-step which instrument to name and when to reply a query about our knowledge. It is a massive distinction from an software within the conventional sense, the place the developer defines how the method works (e.g., by way of if-else loops). It additionally goes far past easy prompting as a result of we’re constructing a system that acts (albeit in a easy approach) and doesn’t simply speak.

This text is for you for those who:

…work with Pandas and need to automate EDA.
…discover LLMs thrilling, however have little expertise with LangChain up to now.
…need to perceive how brokers actually work (from setup to mini-evaluation) utilizing a easy instance.

Desk of Contents
What we construct & why
Arms-On-Instance: CSV-Sanity-Test Agent with LangChain
Mini-Analysis
Remaining Ideas – Pitfalls, Ideas and Subsequent Steps
The place Can You Proceed Studying?

What we construct & why

An agent is a system to which we assign duties. The system then decides for itself which instruments to make use of to unravel these duties.

This requires three parts:

Agent = LLM + Instruments + Management logic

Let’s take a better have a look at the three parts:

The LLM offers the intelligence: It understands the query, plans steps, and decides what to do.
The instruments are small Python features that the agent is allowed to name (e.g., get_schema() or get_nulls()): They supply particular info from the information, corresponding to column names or statistics.
The management logic (coverage) ensures that the LLM doesn’t reply instantly, however first decides whether or not it ought to use a instrument. It thinks step-by-step: First, the query is analyzed, then the suitable instrument is chosen, then the result’s interpreted and, if crucial, a subsequent step is chosen, and at last a response is returned.

As a substitute of manually describing all knowledge as in basic prompting, we switch the accountability to the agent: The system ought to act by itself, however solely with the instruments supplied.

Let’s have a look at a easy instance:

A consumer asks: “What’s the common age within the CSV?”

At this level, the agent calls up the instrument we’ve outlined, df.describe(). The output is a clearly structured worth (e.g., “imply”: 29.7). Right here we will additionally see that this could cut back or decrease hallucinations, because the system is aware of what to use and can’t return a solution corresponding to “In all probability between 20 and 40.”

LangChain as a framework

We use the LangChain framework for the agent. This enables us to attach LLMs with instruments and construct techniques with outlined habits. The system can carry out actions as a substitute of simply offering solutions or producing textual content. An in depth clarification would make this text too lengthy. However in a earlier article, you’ll find a proof of LangChain and a comparability with Langflow: LangChain vs Langflow: Construct a Easy LLM App with Code or Drag & Drop.

What the agent does for us

After we obtain a brand new CSV, we normally ask ourselves the next questions first (begin of exploratory knowledge evaluation):

What columns are there?
The place is knowledge lacking?
What do the descriptive statistics appear like?

That is precisely what we would like the agent to do robotically.

Instruments we outline for the agent

For the agent to work, it wants clearly outlined instruments. It’s best to outline them as small, particular, and managed as attainable. This fashion, we keep away from errors, hallucinations or unclear outputs as a result of they make the output deterministic. Additionally they make the agent reproducible and testable as a result of the identical enter ought to produce a constant end result.

In our instance, we outline three instruments:

schema: Returns column names and knowledge sorts.
nulls: Exhibits columns with lacking values (together with quantity).
describe: Supplies descriptive statistics for numeric columns.

Later, we are going to add a small mini-evaluation to make sure that our agent is working appropriately.

Why is that this an agent and never an app?

We’re not constructing a basic program with a set sequence (e.g., utilizing if-else), however moderately the mannequin plans itself based mostly on the query, selects the suitable instrument, and combines steps as essential to arrive at a solution:

Visualization by the writer.

Arms-On-Instance: CSV-Sanity-Test Agent with LangChain

1) Setup

Prerequisite: Python 3.10 or increased should be put in. Many packages within the AI tooling world require ≥ 3.10. Yow will discover the code and the hyperlink to the repo beneath.

Tip for newbies:
You’ll be able to verify this by coming into “python –model” in cmd.exe.

With the code beneath, we first create a brand new challenge, create an remoted Python setting and activate it. We do that in order that packages and variations are reproducible and don’t consolidate with different tasks.

Tip for newbies:
I work with Home windows. We open a terminal with Home windows + R > cmd and paste the next code.

mkdir csv-agent

cd csv-agent
python -m venv .venv
.venvScriptsactivate

Then we set up the required packages:

pip set up "langchain>=0.2,<0.3" "langchain-openai>=0.1.7" "langchain-community>=0.2" pandas seaborn

With this command, we pin LangChain to the 0.2 line and set up the OpenAI connection and the group bundle. We additionally set up pandas for the EDA features and seaborn for loading the Titanic pattern dataset.

The image shows creating an environment and installing packages. — Screenshot taken by the writer.

Tip for newbies:
Should you don’t need to use OpenAI, you possibly can work regionally with Ollama (e.g., with Llama or Mistral). This feature is obtainable later within the code.

2) Put together the information set in prepare_data.py

Subsequent, we create a Python file known as prepare_data.py. I take advantage of Visible Studio Code for this, however you can too use one other IDE. On this file, we load the Titanic dataset, as it’s publicly obtainable.

# prepare_data.py
import seaborn as sns
df = sns.load_dataset("titanic")
df.to_csv("titanic.csv", index=False)
print("Saved titanic.csv")

With seaborn.load_dataset(“titanic”), we load the general public dataset (891 rows + first row with column names) immediately into reminiscence and reserve it as titanic.csv. The dataset comprises solely numeric, Boolean and categorical columns, making it excellent for an EDA agent.

Ideas for newbies:

sns.load_dataset() requires web entry (the information comes from the seaborn repo).
Save the file within the challenge folder (csv-agent) so htat predominant.py can discover it.

Within the terminal, we execute the Python file with the next command, in order that the titanic.csv file is positioned within the challenge:

python prepare_data.py

We then see within the terminal that the csv has been saved and see the titanic.csv file within the folder:

The image shows the result in the terminal after the csv is saved. — Screenshot taken by the writer.

The image shows the folder structure of the project. — Screenshot taken by the writer.

Aspect Be aware – Titanic dataset

The evaluation is predicated on the Titanic dataset (OpenML ID 40945), which is marked as public on OpenML.

After we open the file, we see the next 14 columns and 891 rows of information. The Titanic dataset is a basic instance of exploratory knowledge evaluation (EDA). It comprises info on 891 passengers of the Titanic and is commonly used to research the connection between traits (e.g., gender, age, ticket class) and survival.

The image shows the Titanic dataset in Excel. — Screenshot taken by the writer.

Listed here are the 14 columns with a short clarification:

survived: Survived (1) or didn’t survive (0).
pclass: Ticket class (1 = 1st class, 2 = 2nd class, 3 = third class).
intercourse: Gender of the passenger.
age: Age of the passenger (in years, could also be lacking).
sibsp: Variety of siblings/spouses on board.
parch: Variety of dad and mom/kids on board.
fare: Fare paid by the passenger.
embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
class: Ticket class as textual content (First, Second, Third). Corresponds to pclass.
who: Categorization “man,” “girl,” “youngster.”
adult_male: Boolean discipline: Was the passenger an grownup male (True/False)?
deck: Cabin deck (typically lacking).
embark_town: Metropolis of port of embarkation (Cherbourg, Queenstown, Southampton).
alone: Boolean discipline: Did the passenger journey alone (True/False)?

Non-compulsory for superior readers
If you wish to observe and consider your agent runs later, you need to use LangSmith.

2) Outline instruments in predominant.py

Subsequent, we outline the varied instruments. To do that, we create a brand new Python file known as predominant.py and reserve it within the csv-agent folder as nicely. We add the next code to it:

# predominant.py
import os, json
import pandas as pd

# --- 0) Loading CSV ---
DF_PATH = "titanic.csv"
df = pd.read_csv(DF_PATH)

# --- 1) Defining instruments as small, concise instructions ---
# IMPORTANT: Instruments return strings (on this case, JSON strings) in order that the LLM sees clearly structured responses.

from langchain_core.instruments import instrument

@instrument
def tool_schema(dummy: str) -> str:
    """Returns column names and knowledge sorts as JSON."""
    schema = {col: str(dtype) for col, dtype in df.dtypes.objects()}
    return json.dumps(schema)

@instrument
def tool_nulls(dummy: str) -> str:
    """Returns columns with the variety of lacking values as JSON (solely columns with >0 lacking values)."""
    nulls = df.isna().sum()
    end result = {col: int(n) for col, n in nulls.objects() if n > 0}
    return json.dumps(end result)

@instrument
def tool_describe(input_str: str) -> str:
    """
    Returns describe() statistics.
    Non-compulsory: input_str can comprise a comma-separated listing of columns, e.g. "age, fare".
    """
    cols = None
    if input_str and input_str.strip():
        cols = [c.strip() for c in input_str.split(",") if c.strip() in df.columns]
    stats = df[cols].describe() if cols else df.describe()
    # describe() has a MultiIndex. Flatten it for the LLM to maintain it readable:
    return stats.to_csv(index=True)

After importing the required packages, we load titanic.csv into df as soon as and outline three small, narrowly outlined instruments. Let’s take a better have a look at every of those instruments:

tool_schema returns the column names and knowledge sorts as JSON. This offers us an outline of what we’re coping with and is normally step one in any knowledge evaluation. Even when a instrument doesn’t want enter (like schema), it should nonetheless settle for one argument, as a result of the agent all the time passes a string. We merely ignore it.
tool_nulls counts lacking values per column and returns solely columns with lacking values.
tool_describe calls df.describe(). It is very important word that this instrument solely works for numeric columns. Strings or Booleans, alternatively, are ignored. This is a crucial step within the sanity verify or EDA. This enables us to rapidly see the imply, min, max, and many others. of the completely different columns. For giant CSVs, describe() can take a very long time. On this case, you could possibly combine df.pattern(n=10000) as sampling logic, for instance.

These instruments are the managed interfaces by which the LLM is allowed to entry the information. They’re deterministic and subsequently reproducible. Instruments ought to ideally be clear and restricted: In different phrases, they need to have just one operate or process.

Why do we want instruments in any respect?

An LLM can generate textual content, nevertheless it can not immediately “see” knowledge. To ensure that the LLM to work meaningfully with a CSV, we have to present interfaces. That’s precisely what instruments are for:

Instruments are small Python features that the agent is allowed to name. As a substitute of creating every part free, we solely permit very particular, reproducible actions.

What precisely does the code do?

With the @instrument decorator, LangChain robotically infers the instrument’s title, description and argument schema from the operate signature and docstring. This implies we solely want to write down the operate itself. LangChain takes care of the remainder.

The mannequin passes arguments that match the instrument’s schema (typically JSON). On this tutorial we preserve issues easy and settle for a single string argument (e.g., input_str: str or a dummy string we ignore).
Instruments all the time return a string (textual content). JSON is right for structured knowledge, which we outline with return json.dumps(…).

This image shows how the agent uses multi-step reasoning with tools. — Visualization by the writer.

It is a multi-step thought course of. The LLM plans iteratively. As a substitute of responding immediately, it thinks step-by-step: it decides which instrument to name, interprets the end result, and will proceed till it has sufficient info to reply.

4) Registering instruments for LangChain in predominant.py

We add the code beneath to the identical predominant.py file to register the beforehand outlined instruments for the agent:

# --- 2) Registering instruments for LangChain ---

instruments = [tool_schema, tool_nulls, tool_describe]

With this code, we merely gather the adorned features into a listing. Every operate has already been transformed right into a LangChain instrument by the @instrument decorator.

5) Configuring LLM in predominant.py

Subsequent, we configure the LLM that the agent makes use of. Right here, you possibly can both use the variant for OpenAI or for an open-source instrument with Ollama.

I used OpenAI, which is why we first must set the API key:

At OpenAI, we create a brand new API key:

The image shows how to create an API-Key in OpenAI. — Screenshot taken by the writer.

We then copy it immediately (it won’t be displayed later) and set it as an setting variable within the terminal with the next command.

setx OPENAI_API_KEY "your_key”

It is very important restart cmd and reactivate .venv afterwards. We will use echo to verify whether or not an API key has been saved.

The image shows how to check in the terminal, if the API-Key was saved. — Screenshot taken by the writer.

Now we add the next code to the tip of predominant.py:

# --- 3) Configure LLM ---
# Choice A: OpenAI (easy)
#   export OPENAI_API_KEY=...    # Home windows: setx OPENAI_API_KEY "YOUR_KEY"
#   Use a decrease temperature for extra steady instrument utilization
USE_OPENAI = bool(os.getenv("OPENAI_API_KEY"))

if USE_OPENAI:
    from langchain_openai import ChatOpenAI
    llm = ChatOpenAI(mannequin="gpt-4o-mini", temperature=0.1)
else:
    # Choice B: Native with Ollama (make certain to tug the mannequin first, e.g. 'ollama run llama3')
    from langchain_community.chat_models import ChatOllama
    llm = ChatOllama(mannequin="llama3.1:8b", temperature=0.1)

The code makes use of OpenAI if an OpenAI_API_KEY is obtainable, in any other case Ollama regionally.

We set the temperature to 0.1. This ensures that the responses are extra deterministic, which is necessary for the next check.

We additionally use gpt-4o-mini because the LLM. It is a light-weight mannequin from OpenAI with a give attention to instrument utilization.

Tip for Newbies:
The temperature determines how creatively an LLM responds. If we enter 0.0, it responds deterministically. Because of this the mannequin virtually all the time returns the identical reply when the enter is similar. That is good for structured duties corresponding to instrument utilization, code or details, for instance. If we specify 1.0, the mannequin responds creatively and with all kinds of choices. Because of this the mannequin varies extra and might recommend completely different formulations or options, which is nice for brainstorming or textual content concepts, for instance.

6) Defining the agent’s habits in predominant.py utilizing the coverage

On this step, we outline how the agent ought to behave. The system immediate units the coverage.

# --- 4) Slender Coverage/Immediate (Agent Conduct) ---
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

SYSTEM_PROMPT = (
    "You're a data-focused assistant. "
    "If a query requires info from the CSV, first use an applicable instrument. "
    "Use just one instrument name per step if attainable. "
    "Reply concisely and in a structured approach. "
    "If no instrument matches, briefly clarify why.nn"
    "Obtainable instruments:n{instruments}n"
    "Use solely these instruments: {tool_names}."
)

immediate = ChatPromptTemplate.from_messages(
    [
        ("system", SYSTEM_PROMPT),
        ("human", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)

_tool_desc = "n".be part of(f"- {t.title}: {t.description}" for t in instruments)
_tool_names = ", ".be part of(t.title for t in instruments)
immediate = immediate.partial(instruments=_tool_desc, tool_names=_tool_names)

First, we import ChatPromptTemplate to construction our agent’s immediate. A very powerful a part of the code is the system immediate: it defines the coverage, i.e., the “guidelines of the sport” for the agent. In it, we outline that the agent could solely use one instrument per step, that it must be concise, and that it could solely use the instruments we’ve outlined.

With the final two traces within the system immediate, we make sure that {instruments} lists all obtainable instruments with their descriptions and with {tool_names}, we make sure that the agent can solely use these names and can’t invent fantasy instruments.

As well as, we use MesagesPlaceholder(“agent_scratchpad”). That is the place the agent shops intermediate steps: The agent shops which instruments it has known as and which ends it has obtained. This enables it to proceed its personal chain of reasoning till it arrives at a last reply.

7) Create tool-calling agent in predominant.py

Within the final step, we outline the agent:

# --- 5) Create & Run Instrument-Calling Agent ---
from langchain.brokers import create_tool_calling_agent, AgentExecutor

agent = create_tool_calling_agent(llm=llm, instruments=instruments, immediate=immediate)
agent_executor = AgentExecutor(
    agent=agent,
    instruments=instruments,
    verbose=False,   # optionally available: True for debug logs
    max_iterations=3,
)

if __name__ == "__main__":
    user_query = "Which columns have lacking values? Checklist 'Column: Rely'."
    end result = agent_executor.invoke({"enter": user_query})
    print("n=== AGENT ANSWER ===")
    print(end result["output"])

With create_tool_calling_agent, we join our LLM, the instruments and the immediate to kind a tool-calling agent.

To make sure that the method runs easily, we use the AgentExecutor. It takes care of the so-called agent loop: The agent first plans what must be achieved, then calls up a instrument, receives the end result and decides whether or not one other instrument is required or whether or not it may well present the ultimate reply. This cycle repeats till the result’s prepared.

With verbose=True, we will view the intermediate steps within the terminal, which is extraordinarily useful for debugging. For instance, we will see which instrument was known as when or what knowledge was returned. If every part is operating easily, we will additionally set it to =False to maintain the output clearer.

With max_iterations=3, we restrict what number of reasoning–instrument–response cycles the agent could carry out. This helps forestall infinite loops or extreme instrument calls. In our instance, the agent may fairly name schema → nulls → describe earlier than answering.

With the final a part of the code, the agent is executed with the pattern enter “Which columns have lacking values?”. The result’s printed within the terminal.

Tip for newbies:
if title == “predominant”: is a typical Python sample: If we execute the file immediately within the terminal with python predominant.py, the code on this block shall be began. Nevertheless, if we solely import the file (e.g., later within the mini_eval.py file), this block is skipped. This enables us to make use of the file as a standalone script or reuse it as a module in different tasks.

8) Run the script: Run the file predominant.py within the terminal.

Now we enter python predominant.py within the terminal to begin the agent. We then see the ultimate reply within the terminal:

The image shows the result that the agent shows in the terminal (how many missing values). — Screenshot taken by the writer.

Mini-Analysis

Lastly, we need to verify our agent, which we do with a small analysis. This ensures that the agent behaves appropriately and doesn’t introduce any “regressions” after we change one thing within the code in a while.

On the finish of predominant.py, we add the code beneath:

def ask_agent(question: str) -> str:
    return agent_executor.invoke({"enter": question})["output"]

With ask_agent, we encapsulate the agent name in a operate that merely returns a string. This enables us to name the agent later from different recordsdata.

The decrease block ensures {that a} check run is carried out when predominant.py is known as immediately. If, alternatively, we import predominant into one other file, solely the operate is supplied.

Now we create the mini_eval.py file and insert the next code:

# mini_eval.py

from predominant import ask_agent

exams = [
    ("Which columns have missing values?", ["age", "embarked", "deck", "embark_town"]),
    ("Present me the primary 3 columns with their knowledge sorts.", ["survived", "pclass", "sex"]),
    ("Give me a statistical abstract of the 'age' column.", ["mean", "min", "max"]),
]

def handed(q, out, must_include):
    textual content = out.decrease()
    return all(any(tok in textual content for tok in (m.decrease(), str(m).decrease())) for m in must_include)

if __name__ == "__main__":
    okay = 0
    for q, should in exams:
        out = ask_agent(q)
        end result = handed(q, out, should)
        print(f"[{'OK' if result else 'FAIL'}] {q}n{out}n")
        okay += int(end result)
    print(f"Handed {okay}/{len(exams)}")

Within the code, we outline three check instances. Every check consists of a query for the agent and a listing of key phrases that should seem within the reply. The handed() operate checks whether or not these key phrases are included.

Anticipated check outcomes

Take a look at 1: “Which columns have lacking values?”
Anticipated: Output mentions age, deck, embarked, embark_town.
Take a look at 2: “Present me the primary 3 columns with their knowledge sorts.” Anticipated: Output comprises survived, pclass, intercourse with sorts corresponding to int64 or object.
Take a look at 3: “Give me a statistical abstract of the ‘age’ column.” Anticipated output: Output comprises imply ≈ 29.7, min = 0.42, max = 80.

If every part runs appropriately, the script reviews “Handed 3/3” on the finish.

We get this output within the terminal. So the check works:

The image shows the result of the mini-evaluation. — Screenshot taken by the writer.

Yow will discover the code & the csv within the repo on GitHub.

On my Substack Knowledge Science Espresso, I share sensible guides and bite-sized updates from the world of Knowledge Science, Python, AI, Machine Studying, and Tech — made for curious minds like yours.

Take a look and subscribe on Medium or on Substack if you wish to keep within the loop.

Remaining Ideas – Pitfalls, suggestions and subsequent steps

LangChain could be very sensible for this instance as a result of it already contains and properly illustrates your complete agent loop (planning, instrument calling, management). For small or clearly structured duties, nonetheless, options corresponding to pure operate calling (e.g., by way of the OpenAI API) or basic EDA frameworks like Nice Expectations could be ample. That stated, LangChain does add some overhead. Should you solely want fastened EDA checks, a plain Python script could be leaner and quicker. LangChain is very worthwhile once you need to prolong issues flexibly or orchestrate a number of instruments and brokers.

When working with brokers, there are some things you need to consider:

One frequent pitfall is unclear instrument descriptions: If the descriptions are too obscure, the mannequin can simply select the mistaken instrument (misrouting). With exact and concrete descriptions, we will enormously cut back this.

One other necessary level is testing: Even a small mini-evaluation with three easy exams is useful in detecting regressions (errors that keep unnoticed because of subsequent modifications) at an early stage.

It’s additionally value beginning small: In our instance, we solely labored with three clearly outlined instruments, however now we all know that they work reliably.

With regard to this agent, it may also be helpful to include sampling (for instance, df.pattern(n=10000)) for very massive CSV recordsdata to keep away from efficiency points. Remember that LLM brokers also can develop into pricey if each query triggers a number of instrument calls.

On this article, we constructed a single agent that checks CSV recordsdata. In apply, a number of brokers would typically work collectively: For instance, one agent may guarantee knowledge high quality whereas a second agent creates visualizations. Such multi-agent techniques are the subsequent step in fixing extra complicated duties.

As a subsequent step, we may additionally incorporate LangGraph to increase the agent loop with states and orchestration. This is able to permit us to assemble brokers as in a flowchart, together with interruptions, reminiscence, or extra versatile management logic.

Lastly, in our instance, we manually outlined the three instruments schema, nulls, and describe. With the Mannequin Context Protocol (MCP), we may join instruments in a standardized approach. For instance, we may join databases, APIs or IDEs.