LangChain for EDA: Construct a CSV Sanity-Test Agent in Python

September 9, 2025

5

, brokers carry out actions.

That’s precisely what we’re going to check out in as we speak’s article.

On this article, we’ll use LangChain and Python to construct our personal CSV sanity examine agent. With this agent, we’ll automate typical exploratory knowledge evaluation (EDA) duties as displaying columns, detecting lacking values (NaNs) and retrieving descriptive statistics.

Brokers resolve step-by-step which device to name and when to reply a query about our knowledge. This can be a huge distinction from an software within the conventional sense, the place the developer defines how the method works (e.g., through if-else loops). It additionally goes far past easy prompting as a result of we’re constructing a system that acts (albeit in a easy method) and doesn’t simply discuss.

This text is for you when you:

…work with Pandas and need to automate EDA.
…discover LLMs thrilling, however have little expertise with LangChain up to now.
…need to perceive how brokers actually work (from setup to mini-evaluation) utilizing a easy instance.

Desk of Contents
What we construct & why
Fingers-On-Instance: CSV-Sanity-Test Agent with LangChain
Mini-Analysis
Remaining Ideas – Pitfalls, Ideas and Subsequent Steps
The place Can You Proceed Studying?

What we construct & why

An agent is a system to which we assign duties. The system then decides for itself which instruments to make use of to unravel these duties.

This requires three parts:

Agent = LLM + Instruments + Management logic

Let’s take a more in-depth have a look at the three parts:

The LLM offers the intelligence: It understands the query, plans steps, and decides what to do.
The instruments are small Python features that the agent is allowed to name (e.g., get_schema() or get_nulls()): They supply particular data from the information, akin to column names or statistics.
The management logic (coverage) ensures that the LLM doesn’t reply instantly, however first decides whether or not it ought to use a device. It thinks step-by-step: First, the query is analyzed, then the suitable device is chosen, then the result’s interpreted and, if needed, a subsequent step is chosen, and eventually a response is returned.

As an alternative of manually describing all knowledge as in basic prompting, we switch the duty to the agent: The system ought to act by itself, however solely with the instruments offered.

Let’s have a look at a easy instance:

A consumer asks: “What’s the common age within the CSV?”

At this level, the agent calls up the device we’ve outlined, df.describe(). The output is a clearly structured worth (e.g., “imply”: 29.7). Right here we are able to additionally see that this could cut back or decrease hallucinations, because the system is aware of what to use and can’t return a solution akin to “Most likely between 20 and 40.”

LangChain as a framework

We use the LangChain framework for the agent. This permits us to attach LLMs with instruments and construct programs with outlined conduct. The system can carry out actions as a substitute of simply offering solutions or producing textual content. An in depth rationalization would make this text too lengthy. However in a earlier article, yow will discover an evidence of LangChain and a comparability with Langflow: LangChain vs Langflow: Construct a Easy LLM App with Code or Drag & Drop.

What the agent does for us

Once we obtain a brand new CSV, we often ask ourselves the next questions first (begin of exploratory knowledge evaluation):

What columns are there?
The place is knowledge lacking?
What do the descriptive statistics appear like?

That is precisely what we would like the agent to do robotically.

Instruments we outline for the agent

For the agent to work, it wants clearly outlined instruments. It’s best to outline them as small, particular, and managed as attainable. This manner, we keep away from errors, hallucinations or unclear outputs as a result of they make the output deterministic. In addition they make the agent reproducible and testable as a result of the identical enter ought to produce a constant consequence.

In our instance, we outline three instruments:

schema: Returns column names and knowledge sorts.
nulls: Reveals columns with lacking values (together with quantity).
describe: Supplies descriptive statistics for numeric columns.

Later, we’ll add a small mini-evaluation to make sure that our agent is working appropriately.

Why is that this an agent and never an app?

We aren’t constructing a basic program with a hard and fast sequence (e.g., utilizing if-else), however fairly the mannequin plans itself primarily based on the query, selects the suitable device, and combines steps as essential to arrive at a solution:

Visualization by the writer.

Fingers-On-Instance: CSV-Sanity-Test Agent with LangChain

1) Setup

Prerequisite: Python 3.10 or increased should be put in. Many packages within the AI tooling world require ≥ 3.10. You will discover the code and the hyperlink to the repo beneath.

Tip for newbies:
You’ll be able to examine this by getting into “python –model” in cmd.exe.

With the code beneath, we first create a brand new challenge, create an remoted Python atmosphere and activate it. We do that in order that packages and variations are reproducible and don’t consolidate with different initiatives.

Tip for newbies:
I work with Home windows. We open a terminal with Home windows + R > cmd and paste the next code.

mkdir csv-agent

cd csv-agent
python -m venv .venv
.venvScriptsactivate

Then we set up the required packages:

pip set up "langchain>=0.2,<0.3" "langchain-openai>=0.1.7" "langchain-community>=0.2" pandas seaborn

With this command, we pin LangChain to the 0.2 line and set up the OpenAI connection and the neighborhood package deal. We additionally set up pandas for the EDA features and seaborn for loading the Titanic pattern dataset.

The image shows creating an environment and installing packages. — Screenshot taken by the writer.

Tip for newbies:
In case you don’t need to use OpenAI, you may work domestically with Ollama (e.g., with Llama or Mistral). This selection is on the market later within the code.

2) Put together the information set in prepare_data.py

Subsequent, we create a Python file referred to as prepare_data.py. I take advantage of Visible Studio Code for this, however you may also use one other IDE. On this file, we load the Titanic dataset, as it’s publicly accessible.

# prepare_data.py
import seaborn as sns
df = sns.load_dataset("titanic")
df.to_csv("titanic.csv", index=False)
print("Saved titanic.csv")

With seaborn.load_dataset(“titanic”), we load the general public dataset (891 rows + first row with column names) straight into reminiscence and put it aside as titanic.csv. The dataset accommodates solely numeric, Boolean and categorical columns, making it perfect for an EDA agent.

Ideas for newbies:

sns.load_dataset() requires web entry (the information comes from the seaborn repo).
Save the file within the challenge folder (csv-agent) so htat major.py can discover it.

Within the terminal, we execute the Python file with the next command, in order that the titanic.csv file is situated within the challenge:

python prepare_data.py

We then see within the terminal that the csv has been saved and see the titanic.csv file within the folder:

The image shows the result in the terminal after the csv is saved. — Screenshot taken by the writer.

The image shows the folder structure of the project. — Screenshot taken by the writer.

Aspect Observe – Titanic dataset

The evaluation relies on the Titanic dataset (OpenML ID 40945), which is marked as public on OpenML.

Once we open the file, we see the next 14 columns and 891 rows of knowledge. The Titanic dataset is a basic instance of exploratory knowledge evaluation (EDA). It accommodates data on 891 passengers of the Titanic and is commonly used to analyze the connection between traits (e.g., gender, age, ticket class) and survival.

The image shows the Titanic dataset in Excel. — Screenshot taken by the writer.

Listed below are the 14 columns with a quick rationalization:

survived: Survived (1) or didn’t survive (0).
pclass: Ticket class (1 = 1st class, 2 = 2nd class, 3 = third class).
intercourse: Gender of the passenger.
age: Age of the passenger (in years, could also be lacking).
sibsp: Variety of siblings/spouses on board.
parch: Variety of dad and mom/kids on board.
fare: Fare paid by the passenger.
embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
class: Ticket class as textual content (First, Second, Third). Corresponds to pclass.
who: Categorization “man,” “lady,” “youngster.”
adult_male: Boolean area: Was the passenger an grownup male (True/False)?
deck: Cabin deck (typically lacking).
embark_town: Metropolis of port of embarkation (Cherbourg, Queenstown, Southampton).
alone: Boolean area: Did the passenger journey alone (True/False)?

Elective for superior readers
If you wish to observe and consider your agent runs later, you should utilize LangSmith.

2) Outline instruments in major.py

Subsequent, we outline the varied instruments. To do that, we create a brand new Python file referred to as major.py and put it aside within the csv-agent folder as effectively. We add the next code to it:

# major.py
import os, json
import pandas as pd

# --- 0) Loading CSV ---
DF_PATH = "titanic.csv"
df = pd.read_csv(DF_PATH)

# --- 1) Defining instruments as small, concise instructions ---
# IMPORTANT: Instruments return strings (on this case, JSON strings) in order that the LLM sees clearly structured responses.

from langchain_core.instruments import device

@device
def tool_schema(dummy: str) -> str:
    """Returns column names and knowledge sorts as JSON."""
    schema = {col: str(dtype) for col, dtype in df.dtypes.gadgets()}
    return json.dumps(schema)

@device
def tool_nulls(dummy: str) -> str:
    """Returns columns with the variety of lacking values as JSON (solely columns with >0 lacking values)."""
    nulls = df.isna().sum()
    consequence = {col: int(n) for col, n in nulls.gadgets() if n > 0}
    return json.dumps(consequence)

@device
def tool_describe(input_str: str) -> str:
    """
    Returns describe() statistics.
    Elective: input_str can comprise a comma-separated listing of columns, e.g. "age, fare".
    """
    cols = None
    if input_str and input_str.strip():
        cols = [c.strip() for c in input_str.split(",") if c.strip() in df.columns]
    stats = df[cols].describe() if cols else df.describe()
    # describe() has a MultiIndex. Flatten it for the LLM to maintain it readable:
    return stats.to_csv(index=True)

After importing the required packages, we load titanic.csv into df as soon as and outline three small, narrowly outlined instruments. Let’s take a more in-depth have a look at every of those instruments:

tool_schema returns the column names and knowledge sorts as JSON. This offers us an summary of what we’re coping with and is often step one in any knowledge evaluation. Even when a device doesn’t want enter (like schema), it should nonetheless settle for one argument, as a result of the agent all the time passes a string. We merely ignore it.
tool_nulls counts lacking values per column and returns solely columns with lacking values.
tool_describe calls df.describe(). You will need to word that this device solely works for numeric columns. Strings or Booleans, however, are ignored. This is a crucial step within the sanity examine or EDA. This permits us to shortly see the imply, min, max, and many others. of the totally different columns. For big CSVs, describe() can take a very long time. On this case, you would combine df.pattern(n=10000) as sampling logic, for instance.

These instruments are the managed interfaces by way of which the LLM is allowed to entry the information. They’re deterministic and due to this fact reproducible. Instruments ought to ideally be clear and restricted: In different phrases, they need to have just one operate or job.

Why do we’d like instruments in any respect?

An LLM can generate textual content, however it can not straight “see” knowledge. To ensure that the LLM to work meaningfully with a CSV, we have to present interfaces. That’s precisely what instruments are for:

Instruments are small Python features that the agent is allowed to name. As an alternative of creating every thing free, we solely permit very particular, reproducible actions.

What precisely does the code do?

With the @device decorator, LangChain robotically infers the device’s title, description and argument schema from the operate signature and docstring. This implies we solely want to jot down the operate itself. LangChain takes care of the remainder.

The mannequin passes arguments that match the device’s schema (typically JSON). On this tutorial we preserve issues easy and settle for a single string argument (e.g., input_str: str or a dummy string we ignore).
Instruments all the time return a string (textual content). JSON is good for structured knowledge, which we outline with return json.dumps(…).

This image shows how the agent uses multi-step reasoning with tools. — Visualization by the writer.

This can be a multi-step thought course of. The LLM plans iteratively. As an alternative of responding straight, it thinks step-by-step: it decides which device to name, interprets the consequence, and should proceed till it has sufficient data to reply.

4) Registering instruments for LangChain in major.py

We add the code beneath to the identical major.py file to register the beforehand outlined instruments for the agent:

# --- 2) Registering instruments for LangChain ---

instruments = [tool_schema, tool_nulls, tool_describe]

With this code, we merely gather the adorned features into an inventory. Every operate has already been transformed right into a LangChain device by the @device decorator.

5) Configuring LLM in major.py

Subsequent, we configure the LLM that the agent makes use of. Right here, you may both use the variant for OpenAI or for an open-source device with Ollama.

I used OpenAI, which is why we first must set the API key:

At OpenAI, we create a brand new API key:

The image shows how to create an API-Key in OpenAI. — Screenshot taken by the writer.

We then copy it straight (it is not going to be displayed later) and set it as an atmosphere variable within the terminal with the next command.

setx OPENAI_API_KEY "your_key”

You will need to restart cmd and reactivate .venv afterwards. We are able to use echo to examine whether or not an API key has been saved.

The image shows how to check in the terminal, if the API-Key was saved. — Screenshot taken by the writer.

Now we add the next code to the top of major.py:

# --- 3) Configure LLM ---
# Choice A: OpenAI (easy)
#   export OPENAI_API_KEY=...    # Home windows: setx OPENAI_API_KEY "YOUR_KEY"
#   Use a decrease temperature for extra secure device utilization
USE_OPENAI = bool(os.getenv("OPENAI_API_KEY"))

if USE_OPENAI:
    from langchain_openai import ChatOpenAI
    llm = ChatOpenAI(mannequin="gpt-4o-mini", temperature=0.1)
else:
    # Choice B: Native with Ollama (ensure that to drag the mannequin first, e.g. 'ollama run llama3')
    from langchain_community.chat_models import ChatOllama
    llm = ChatOllama(mannequin="llama3.1:8b", temperature=0.1)

The code makes use of OpenAI if an OpenAI_API_KEY is on the market, in any other case Ollama domestically.

We set the temperature to 0.1. This ensures that the responses are extra deterministic, which is vital for the following check.

We additionally use gpt-4o-mini because the LLM. This can be a light-weight mannequin from OpenAI with a give attention to device utilization.

Tip for Newbies:
The temperature determines how creatively an LLM responds. If we enter 0.0, it responds deterministically. Which means that the mannequin virtually all the time returns the identical reply when the enter is identical. That is good for structured duties akin to device utilization, code or details, for instance. If we specify 1.0, the mannequin responds creatively and with all kinds of choices. Which means that the mannequin varies extra and may recommend totally different formulations or options, which is sweet for brainstorming or textual content concepts, for instance.

6) Defining the agent’s conduct in major.py utilizing the coverage

On this step, we outline how the agent ought to behave. The system immediate units the coverage.

# --- 4) Slim Coverage/Immediate (Agent Habits) ---
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

SYSTEM_PROMPT = (
    "You're a data-focused assistant. "
    "If a query requires data from the CSV, first use an applicable device. "
    "Use just one device name per step if attainable. "
    "Reply concisely and in a structured method. "
    "If no device matches, briefly clarify why.nn"
    "Obtainable instruments:n{instruments}n"
    "Use solely these instruments: {tool_names}."
)

immediate = ChatPromptTemplate.from_messages(
    [
        ("system", SYSTEM_PROMPT),
        ("human", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)

_tool_desc = "n".be a part of(f"- {t.title}: {t.description}" for t in instruments)
_tool_names = ", ".be a part of(t.title for t in instruments)
immediate = immediate.partial(instruments=_tool_desc, tool_names=_tool_names)

First, we import ChatPromptTemplate to construction our agent’s immediate. An important a part of the code is the system immediate: it defines the coverage, i.e., the “guidelines of the sport” for the agent. In it, we outline that the agent might solely use one device per step, that it must be concise, and that it could solely use the instruments we’ve outlined.

With the final two strains within the system immediate, we be sure that {instruments} lists all accessible instruments with their descriptions and with {tool_names}, we be sure that the agent can solely use these names and can’t invent fantasy instruments.

As well as, we use MesagesPlaceholder(“agent_scratchpad”). That is the place the agent shops intermediate steps: The agent shops which instruments it has referred to as and which ends it has acquired. This permits it to proceed its personal chain of reasoning till it arrives at a remaining reply.

7) Create tool-calling agent in major.py

Within the final step, we outline the agent:

# --- 5) Create & Run Device-Calling Agent ---
from langchain.brokers import create_tool_calling_agent, AgentExecutor

agent = create_tool_calling_agent(llm=llm, instruments=instruments, immediate=immediate)
agent_executor = AgentExecutor(
    agent=agent,
    instruments=instruments,
    verbose=False,   # non-obligatory: True for debug logs
    max_iterations=3,
)

if __name__ == "__main__":
    user_query = "Which columns have lacking values? Checklist 'Column: Depend'."
    consequence = agent_executor.invoke({"enter": user_query})
    print("n=== AGENT ANSWER ===")
    print(consequence["output"])

With create_tool_calling_agent, we join our LLM, the instruments and the immediate to type a tool-calling agent.

To make sure that the method runs easily, we use the AgentExecutor. It takes care of the so-called agent loop: The agent first plans what must be executed, then calls up a device, receives the consequence and decides whether or not one other device is required or whether or not it will possibly present the ultimate reply. This cycle repeats till the result’s prepared.

With verbose=True, we are able to view the intermediate steps within the terminal, which is extraordinarily useful for debugging. For instance, we are able to see which device was referred to as when or what knowledge was returned. If every thing is operating easily, we are able to additionally set it to =False to maintain the output clearer.

With max_iterations=3, we restrict what number of reasoning–device–response cycles the agent might carry out. This helps forestall infinite loops or extreme device calls. In our instance, the agent would possibly fairly name schema → nulls → describe earlier than answering.

With the final a part of the code, the agent is executed with the pattern enter “Which columns have lacking values?”. The result’s printed within the terminal.

Tip for newbies:
if title == “major”: is a regular Python sample: If we execute the file straight within the terminal with python major.py, the code on this block will likely be began. Nonetheless, if we solely import the file (e.g., later within the mini_eval.py file), this block is skipped. This permits us to make use of the file as a standalone script or reuse it as a module in different initiatives.

8) Run the script: Run the file major.py within the terminal.

Now we enter python major.py within the terminal to start out the agent. We then see the ultimate reply within the terminal:

The image shows the result that the agent shows in the terminal (how many missing values). — Screenshot taken by the writer.

Mini-Analysis

Lastly, we need to examine our agent, which we do with a small analysis. This ensures that the agent behaves appropriately and doesn’t introduce any “regressions” once we change one thing within the code in a while.

On the finish of major.py, we add the code beneath:

def ask_agent(question: str) -> str:
    return agent_executor.invoke({"enter": question})["output"]

With ask_agent, we encapsulate the agent name in a operate that merely returns a string. This permits us to name the agent later from different recordsdata.

The decrease block ensures {that a} check run is carried out when major.py is known as straight. If, however, we import major into one other file, solely the operate is offered.

Now we create the mini_eval.py file and insert the next code:

# mini_eval.py

from major import ask_agent

checks = [
    ("Which columns have missing values?", ["age", "embarked", "deck", "embark_town"]),
    ("Present me the primary 3 columns with their knowledge sorts.", ["survived", "pclass", "sex"]),
    ("Give me a statistical abstract of the 'age' column.", ["mean", "min", "max"]),
]

def handed(q, out, must_include):
    textual content = out.decrease()
    return all(any(tok in textual content for tok in (m.decrease(), str(m).decrease())) for m in must_include)

if __name__ == "__main__":
    okay = 0
    for q, should in checks:
        out = ask_agent(q)
        consequence = handed(q, out, should)
        print(f"[{'OK' if result else 'FAIL'}] {q}n{out}n")
        okay += int(consequence)
    print(f"Handed {okay}/{len(checks)}")

Within the code, we outline three check circumstances. Every check consists of a query for the agent and an inventory of key phrases that should seem within the reply. The handed() operate checks whether or not these key phrases are included.

Anticipated check outcomes

Check 1: “Which columns have lacking values?”
Anticipated: Output mentions age, deck, embarked, embark_town.
Check 2: “Present me the primary 3 columns with their knowledge sorts.” Anticipated: Output accommodates survived, pclass, intercourse with sorts akin to int64 or object.
Check 3: “Give me a statistical abstract of the ‘age’ column.” Anticipated output: Output accommodates imply ≈ 29.7, min = 0.42, max = 80.

If every thing runs appropriately, the script reviews “Handed 3/3” on the finish.

We get this output within the terminal. So the check works:

The image shows the result of the mini-evaluation. — Screenshot taken by the writer.

You will discover the code & the csv within the repo on GitHub.

On my Substack Information Science Espresso, I share sensible guides and bite-sized updates from the world of Information Science, Python, AI, Machine Studying, and Tech — made for curious minds like yours.

Take a look and subscribe on Medium or on Substack if you wish to keep within the loop.

Remaining Ideas – Pitfalls, suggestions and subsequent steps

LangChain may be very sensible for this instance as a result of it already consists of and properly illustrates the complete agent loop (planning, device calling, management). For small or clearly structured duties, nevertheless, alternate options akin to pure operate calling (e.g., through the OpenAI API) or basic EDA frameworks like Nice Expectations is likely to be adequate. That mentioned, LangChain does add some overhead. In case you solely want fastened EDA checks, a plain Python script can be leaner and sooner. LangChain is very worthwhile whenever you need to lengthen issues flexibly or orchestrate a number of instruments and brokers.

When working with brokers, there are some things you must remember:

One frequent pitfall is unclear device descriptions: If the descriptions are too obscure, the mannequin can simply select the incorrect device (misrouting). With exact and concrete descriptions, we are able to drastically cut back this.

One other vital level is testing: Even a small mini-evaluation with three easy checks is useful in detecting regressions (errors that keep unnoticed as a consequence of subsequent adjustments) at an early stage.

It’s additionally price beginning small: In our instance, we solely labored with three clearly outlined instruments, however now we all know that they work reliably.

With regard to this agent, it may additionally be helpful to include sampling (for instance, df.pattern(n=10000)) for very giant CSV recordsdata to keep away from efficiency points. Remember that LLM brokers may grow to be pricey if each query triggers a number of device calls.

On this article, we constructed a single agent that checks CSV recordsdata. In follow, a number of brokers would typically work collectively: For instance, one agent might guarantee knowledge high quality whereas a second agent creates visualizations. Such multi-agent programs are the subsequent step in fixing extra advanced duties.

As a subsequent step, we might additionally incorporate LangGraph to increase the agent loop with states and orchestration. This is able to permit us to assemble brokers as in a flowchart, together with interruptions, reminiscence, or extra versatile management logic.

Lastly, in our instance, we manually outlined the three instruments schema, nulls, and describe. With the Mannequin Context Protocol (MCP), we might join instruments in a standardized method. For instance, we might join databases, APIs or IDEs.

LangChain for EDA: Construct a CSV Sanity-Test Agent in Python

What we construct & why

LangChain as a framework

What the agent does for us

Instruments we outline for the agent

Why is that this an agent and never an app?

Fingers-On-Instance: CSV-Sanity-Test Agent with LangChain

1) Setup

2) Put together the information set in prepare_data.py

Aspect Observe – Titanic dataset

2) Outline instruments in major.py

Why do we’d like instruments in any respect?

What precisely does the code do?

4) Registering instruments for LangChain in major.py

5) Configuring LLM in major.py

6) Defining the agent’s conduct in major.py utilizing the coverage

7) Create tool-calling agent in major.py

8) Run the script: Run the file major.py within the terminal.

Mini-Analysis

Remaining Ideas – Pitfalls, suggestions and subsequent steps

The place Can You Proceed Studying?

Related Articles

Samsung particulars One UI 8 launch schedule for dozens of gadgets

The way to Construct Manufacturing-Prepared UI Prototypes in Minutes Utilizing Google Sew

ASTM Worldwide Approves New Customary to Streamline AM Processes

LEAVE A REPLY Cancel reply

Latest Articles

Samsung particulars One UI 8 launch schedule for dozens of gadgets

The way to Construct Manufacturing-Prepared UI Prototypes in Minutes Utilizing Google Sew

ASTM Worldwide Approves New Customary to Streamline AM Processes

Are cloud suppliers neglecting safety to chase AI?

Apple Backports Repair for CVE-2025-43300 Exploited in Subtle Spy ware Assault

About US