TOOL

hermes-tool-test-suite

pytest harness for AI agent tool-calling reliability

certify that a model + prompt-config combination actually executes tools end-to-end

What it is

A pytest harness that validates AI agent tool-calling reliability across providers. It's used to certify that a given model + prompt-config combination actually executes tool calls end-to-end, not just generates text that looks like a tool call.

The default validation set covers 10 scenarios across:

Category Scenarios
Terminal shell exec, cwd handling
File ops read, write, edit, list
Code execution inline Python, subprocess Python
Web tools fetch, search

Each scenario has a deterministic expected output. The model's tool-call sequence is run and the resulting state (filesystem, stdout, etc.) is compared.

Why this exists

A model that generates a tool_calls block in its API response hasn't actually proven it can drive tools. Real failures observed in practice:

A test harness that runs the entire loop catches all three.

Use cases

Status

Active. Public. See the GitHub repo for installation and the test catalog.