# Introduction
In a latest article on Machine Studying Mastery, we constructed a tool-calling agent that reached outward, that’s pulling climate, information, forex charges, and time from public APIs. That article lined the synthesis half of the sample properly, but it surely left the extra attention-grabbing half on the desk: an agent that causes about its personal setting, inspects its personal machine, and offloads logic it does not belief itself to carry out. It might be argued that that is nearer to actually “agentic.”
This text picks up the place that one left off. We’ll give Gemma 4 two new instruments — a sandboxed native filesystem explorer and a restricted Python interpreter — and watch the mannequin resolve, by itself, when to go searching and when to compute.
Subjects we are going to cowl embrace:
- Why “agentic” instrument calling wants greater than internet APIs to be attention-grabbing
- Methods to construct a filesystem inspection instrument with exhausting path-traversal guards
- Methods to wire a Python interpreter instrument to the mannequin with out handing it the keys to your machine
- How the identical orchestration loop from earlier than generalizes to those new capabilities
I extremely suggest that you simply first learn this text earlier than persevering with on.
# From Dialog to Company
When the one instruments you give a language mannequin are read-only internet APIs, primarily you continue to actually have a chatbot, albeit one with potential entry to higher info. The mannequin receives a immediate, decides which API to ping, and stitches the JSON response right into a paragraph. There is no such thing as a actual notion of setting, no state to examine, no consequence to purpose about; it is a state of affairs extra akin to retrieval augmented technology than true company.
Company, within the sensible sense practitioners use the phrase, reveals up when a mannequin begins interacting with the system it’s operating on. That may imply studying from an area filesystem, executing code, modifying information, calling different processes, or any mixture of these. The second a instrument can do one thing apart from return a clear string from a distant service, the mannequin has to start out asking about itself: what information exist, what does this quantity truly equal, what’s on this folder earlier than I declare it accommodates something.
The Gemma 4 household, and particularly the gemma4:e2b edge variant we now have been utilizing, is sufficiently small to run domestically on a laptop computer whereas being competent sufficient at structured output to drive this type of loop reliably. That mixture is what makes the local-agentic sample attention-grabbing within the first place. The entire code for this tutorial might be discovered right here.
# The Architectural Reuse
The orchestration loop from the earlier tutorial doesn’t change. We outline Python capabilities, expose them through JSON schema, go the registry to Ollama alongside the consumer immediate, intercept any tool_calls block on the response, execute the requested operate domestically, append the end result as a instrument-role message, and re-query the mannequin so it may well synthesize a closing reply. The identical call_ollama helper, the identical TOOL_FUNCTIONS dictionary, the identical available_tools schema array from the earlier tutorial all make appearances.
What adjustments is the character of the instruments themselves. The place the earlier batch have been all skinny shoppers over distant APIs, these we are going to construct now each run code on the machine. That shifts the design drawback from “how do I parse this response” to “how do I be certain that the mannequin can not, even by accident, do one thing it shouldn’t be allowed to do.”
# Instrument 1: A Sandboxed Filesystem Explorer
The primary instrument, list_directory_contents, provides the mannequin the power to see what information exist in a given folder. This sounds trivial till you keep in mind that os.listdir accepts any string, together with /, ~, and ../../and many others. A naive implementation may fortunately stroll the mannequin’s “curiosity” straight to your API keys.
The design selection right here is to pin a protected base listing at script begin and reject any request that resolves exterior of it:
# Safety: confine list_directory_contents to this base listing and its descendants
# Set to the present working listing when the script begins
SAFE_BASE_DIR = os.path.abspath(os.getcwd())
def list_directory_contents(path: str = ".") -> str:
"""Lists information and directories inside a path, constrained to the protected base listing."""
attempt:
# Resolve to an absolute path and confirm it sits inside SAFE_BASE_DIR
# This blocks traversal makes an attempt like '../../and many others' or absolute paths like "https://www.kdnuggets.com/"
requested = os.path.abspath(os.path.be a part of(SAFE_BASE_DIR, path))
if not (requested == SAFE_BASE_DIR or requested.startswith(SAFE_BASE_DIR + os.sep)):
return (
f"Error: Entry denied. The trail '{path}' resolves exterior the "
f"permitted workspace ({SAFE_BASE_DIR})."
)
...
The sample is easy however value contemplating additional. We by no means belief the string the mannequin produced. We be a part of it onto the bottom listing, resolve it completely (so .. will get normalized away), after which confirm the resolved path nonetheless begins with the bottom. Each /and many others/passwd and ../../someplace collapse into paths that fail that prefix examine and are rejected earlier than os.listdir is ever known as.
The remainder of the operate is housekeeping: verify the trail exists and is a listing, checklist its contents, and format every entry as both [DIR] or [FILE] with a byte measurement. The returned string is apparent English with construction the mannequin can parse on the second go:
entries = sorted(os.listdir(requested))
if not entries:
return f"The listing '{path}' is empty."
strains = [f"Contents of '{path}' ({len(entries)} item(s)):"]
for title in entries:
full = os.path.be a part of(requested, title)
if os.path.isdir(full):
strains.append(f" [DIR] {title}/")
else:
attempt:
measurement = os.path.getsize(full)
strains.append(f" [FILE] {title} ({measurement} bytes)")
besides OSError:
strains.append(f" [FILE] {title}")
return "n".be a part of(strains)
The JSON schema we hand to the mannequin is intentionally permissive on the parameter aspect — path is non-obligatory, defaulting to the workspace root, as a result of most helpful first questions are in regards to the present folder:
{
"sort": "operate",
"operate": {
"title": "list_directory_contents",
"description": (
"Lists information and subdirectories inside a path inside the consumer's workspace. "
"Use this to examine the setting earlier than answering questions on native information."
),
"parameters": {
"sort": "object",
"properties": {
"path": {
"sort": "string",
"description": (
"A relative path contained in the workspace, e.g. '.', 'knowledge', or 'src/utils'. "
"Defaults to the workspace root."
)
}
},
"required": []
}
}
}
Notice the outline does a small quantity of immediate engineering: “Use this to examine the setting earlier than answering questions on native information.” That sentence pushes Gemma 4 towards calling the instrument when the consumer asks a imprecise query about “my information” moderately than guessing at what could be there.
# Instrument 2: A Restricted Python Interpreter
The second instrument, execute_python_code, is the extra harmful and the extra pedagogically attention-grabbing of the 2. The premise is that language fashions, particularly small ones, are unreliable at exact arithmetic, actual string manipulation, and something involving greater than a few steps of branching logic. A instrument that lets the mannequin write and run a deterministic snippet is a significantly better reply to these issues than asking it to purpose by way of them in pure language.
The implementation makes use of exec() with a intentionally stripped-down builtins namespace:
def execute_python_code(code: str) -> str:
"""Executes a snippet of Python code and returns no matter was printed to stdout.
This can be a learning-only sandbox. exec() is essentially unsafe; don't expose this instrument
to untrusted customers or networks. The restrictions under cease the informal instances, not a
decided attacker.
"""
attempt:
# A minimal restricted setting. We strip __builtins__ all the way down to a small
# whitelist in order that, e.g., open(), eval(), and __import__ usually are not straight
# accessible from the snippet's world scope.
safe_builtins = {
"abs": abs, "all": all, "any": any, "bool": bool, "dict": dict,
"divmod": divmod, "enumerate": enumerate, "filter": filter, "float": float,
"int": int, "len": len, "checklist": checklist, "map": map, "max": max, "min": min,
"pow": pow, "print": print, "vary": vary, "repr": repr, "reversed": reversed,
"spherical": spherical, "set": set, "sorted": sorted, "str": str, "sum": sum,
"tuple": tuple, "zip": zip,
}
# Pre-import a few protected, helpful modules so the mannequin does not must.
import math, statistics
restricted_globals = {
"__builtins__": safe_builtins,
"math": math,
"statistics": statistics,
}
A couple of choices value calling out. We substitute __builtins__ solely moderately than blacklisting particular person capabilities, which implies open, eval, exec, compile, __import__, enter, and the rest not in our whitelist merely doesn’t exist contained in the snippet. We pre-import math and statistics into the snippet’s globals as a result of the mannequin will attain for them continuously and we might moderately not power it to combat __import__ restrictions. We seize stdout with contextlib.redirect_stdout so the mannequin will get again precisely what its snippet printed:
# Seize stdout so we will hand the printed output again to the mannequin
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
exec(code, restricted_globals, {})
output = buffer.getvalue().strip()
if not output:
return "Code executed efficiently however produced no output. Use print() to return a price."
return f"Output:n{output}"
The empty-output department issues greater than it seems. Small fashions will routinely write expressions like x = sum(vary(101)) and overlook the print(x). Returning a particular error telling them to make use of print() provides the orchestration loop the choice to retry; with out it, the mannequin would synthesize a closing reply based mostly on an empty string and confidently invent a price.
A closing phrase on security, for the reason that script’s docstring is blunt about it: this can be a studying sandbox, not a hardened one. A decided adversary can escape of a Python exec sandbox in a dozen methods, most of them involving object introspection by way of ().__class__.__mro__. For a single-user agent operating by yourself laptop computer by yourself prompts, the whitelist is loads. For the rest, you’d need an actual isolation layer — a subprocess with seccomp, a container, or RestrictedPython.
# The Orchestration Loop
The principle loop is unchanged in construction from the earlier tutorial. The mannequin is queried with the consumer immediate and the instrument registry, and if it responds with tool_calls, every name is dispatched in opposition to TOOL_FUNCTIONS:
if "tool_calls" in message and message["tool_calls"]:
print("[TOOL EXECUTION]")
messages.append(message)
num_tools = len(message["tool_calls"])
for i, tool_call in enumerate(message["tool_calls"]):
function_name = tool_call["function"]["name"]
arguments = tool_call["function"]["arguments"]
...
if function_name in TOOL_FUNCTIONS:
func = TOOL_FUNCTIONS[function_name]
attempt:
end result = func(**arguments)
...
messages.append({
"function": "instrument",
"content material": str(end result),
"title": function_name
})
The CLI formatting is value a small tweak for this script. The execute_python_code instrument’s code argument could be a multi-line string with newlines in it, which can wreck an ASCII tree if printed naively. We flatten and truncate string arguments for the show solely; the mannequin nonetheless receives the total string when the operate runs:
def _short(v):
if isinstance(v, str):
flat = v.substitute("n", "n")
if len(flat) > 60:
flat = flat[:57] + "..."
return f"'{flat}'"
return str(v)
args_str = ", ".be a part of(f"{okay}={_short(v)}" for okay, v in arguments.gadgets())
As soon as every instrument result’s appended again into the message historical past as a "function": "instrument" entry, we re-call Ollama with the enriched payload and the mannequin produces its grounded closing reply. Similar two-pass sample, identical logic.
# Testing the Instruments
And now we check our instrument calling. Pull gemma4:e2b with ollama pull gemma4:e2b when you’ve got not already, then run the script from a folder you don’t thoughts the mannequin peeking at.
Let’s begin with the filesystem instrument. From the mission listing:
What scripts are in my present folder, and which one seems prefer it needs to be used to course of CSVs?
End result:
[SYSTEM]
○ Instrument: execute_python_code......................[LOADED]
○ Instrument: list_directory_contents..................[LOADED]
○ Workspace: /Customers/matt/initiatives/gemma_agent.....[SANDBOXED]
[PROMPT]
What scripts are in my present folder, and which one seems prefer it needs to be used to course of CSVs?
[EXECUTION]
● Querying mannequin...
[TOOL EXECUTION]
└── Calling: list_directory_contents
├─ Args: path="."
└─ End result: Contents of '.' (5 merchandise(s)):
[FILE] README.md (412 bytes)
[FILE] csv_cleaner.py (1834 bytes)
[FILE] major.py (10786 bytes)
[FILE] notes.txt (88 bytes)
[FILE] sales_report.py (2210 bytes)
[EXECUTION]
● Synthesizing outcomes...
[RESPONSE]
Your present folder accommodates 5 information. The one that appears supposed for CSV
processing is csv_cleaner.py — its title strongly suggests it handles CSV enter.
sales_report.py may additionally contact CSV knowledge, however its title is extra about output than
ingestion.
The mannequin known as the instrument, regarded on the precise filenames, and made an affordable inference grounded within the itemizing moderately than in its weights. That’s the distinction between hallucination and commentary.
Subsequent, the Python interpreter. A small job that small fashions reliably get improper if requested to do it of their head:
What’s the customary deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to 4 decimal locations?
End result:
[SYSTEM]
○ Instrument: execute_python_code......................[LOADED]
○ Instrument: list_directory_contents..................[LOADED]
○ Workspace: /Customers/matt/initiatives/gemma_agent.....[SANDBOXED]
[PROMPT]
What's the customary deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to 4 decimal locations?
[EXECUTION]
● Querying mannequin...
[TOOL EXECUTION]
└── Calling: execute_python_code
├─ Args: code="nums = [12, 18, 23, 24, 29, 31, 35, 41, 44, 47]nprint(rou..."
└─ End result: Output:
11.4659
[EXECUTION]
● Synthesizing outcomes...
[RESPONSE]
The usual deviation of these numbers, rounded to 4 decimal locations, is 11.4659.
The mannequin offloaded the calculation solely; it wrote a snippet, known as statistics.stdev, rounded the end result, and reported what the interpreter mentioned. No psychological arithmetic, no approximation, no fabricated vital digits.
Lastly, the extra attention-grabbing case: a immediate that requires each instruments in sequence. The mannequin has to examine the folder and compute one thing about what it finds:
Take a look at the information within the present folder and inform me the overall measurement in kilobytes, rounded to 2 decimal locations.
Output:
[SYSTEM]
○ Instrument: execute_python_code......................[LOADED]
○ Instrument: list_directory_contents..................[LOADED]
○ Workspace: /Customers/matt/initiatives/gemma_agent.....[SANDBOXED]
[PROMPT]
Take a look at the information within the present folder and inform me the overall measurement in kilobytes, rounded to 2 decimal locations.
[EXECUTION]
● Querying mannequin...
[TOOL EXECUTION]
┌── Calling: list_directory_contents
│ ├─ Args: path="."
│ └─ End result: Contents of '.' (5 merchandise(s)):
│ [FILE] README.md (412 bytes)
│ [FILE] csv_cleaner.py (1834 bytes)
│ [FILE] major.py (10786 bytes)
│ [FILE] notes.txt (88 bytes)
│ [FILE] sales_report.py (2210 bytes)
│
└── Calling: execute_python_code
├─ Args: code="sizes = [412, 1834, 10786, 88, 2210]nprint(spherical(sum(siz..."
└─ End result: Output:
15.33
[EXECUTION]
● Synthesizing outcomes...
[RESPONSE]
The 5 information within the present folder complete 15.33 KB.
Two instruments, in the fitting order, with the output of 1 feeding the argument of the opposite — produced by a 2-billion-parameter mannequin operating on a laptop computer with no GPU. The filesystem instrument grounds the mannequin in what is definitely there; the interpreter instrument grounds the reply in what is definitely true. The mannequin contributes the half it’s genuinely good at, which is deciding which query to ask of which instrument.
It’s value poking on the security guards too, simply to verify they maintain. Asking the mannequin “checklist the contents of /and many others” produces the anticipated denial message within the instrument end result, which the mannequin then reviews again gracefully moderately than fabricating a listing itemizing. Asking it to run open('/and many others/passwd').learn() contained in the interpreter produces a NameError, since open isn’t within the whitelisted builtins. Each failures degrade into helpful error strings as an alternative of silent compromises, which is strictly what you need at this layer.
# Conclusion
The sooner tutorial confirmed that Gemma 4 can attain throughout the web in your behalf. This one reveals it may well attain into the machine you might be sitting at, fastidiously, when you’ve constructed the carefulness in. After getting a working tool-calling loop, the attention-grabbing query stops being “can the mannequin name a operate” and begins being “what ought to I let it contact.”
A filesystem-aware instrument and a code-execution instrument collectively get you many of the technique to one thing that genuinely earns the time period agent: it may well observe its setting, resolve what calculation issues, and run that calculation deterministically moderately than guessing. The sample generalizes from there. Database queries, shell instructions, git operations, doc parsing; every one among these is similar JSON schema, the identical dispatch desk, the identical two-pass synthesis, with no matter security perimeter is suitable for the blast radius of the underlying name.
Construct the perimeter first. Then hand the mannequin the keys to no matter sits inside it.
Matthew Mayo (@mattmayo13) holds a grasp’s diploma in pc science and a graduate diploma in knowledge mining. As managing editor of KDnuggets & Statology, and contributing editor at Machine Studying Mastery, Matthew goals to make complicated knowledge science ideas accessible. His skilled pursuits embrace pure language processing, language fashions, machine studying algorithms, and exploring rising AI. He’s pushed by a mission to democratize information within the knowledge science neighborhood. Matthew has been coding since he was 6 years previous.
