Every tool call
before it fires.
The standard approach to agent safety is to measure what agents did after they did it. Chenglin Yang built something different: a pipeline that intercepts every tool call before execution, deobfuscates shell payloads, tracks multi-step attack patterns across a session, and returns a structured verdict. Not a sandbox. Not a post-hoc audit. A gate.
First surfaced in Tandemly Briefing — 2026-05-22.
Post-hoc safety
is too late.
File deletion is immediate. Credential exfiltration has already left the system. Measuring agent behavior after execution catches patterns in logs. It does not stop the action.
The standard safety evaluation paradigm for AI agents works like a flight data recorder. It captures what happened during the flight. It is useful for investigation and for training future systems. It is not useful for stopping a crash that is in progress.
The tools that exist before execution have their own gaps. Static guardrails match keywords in tool call parameters, but shell commands can be obfuscated in ways that defeat keyword matching entirely. A command that recursively deletes a directory tree can be written through variable substitution, hex escape sequences, command chaining, or ANSI-C quoting such that no pattern in the raw string reveals what it does. The normalizer has to expand the command into its plaintext form first. Most static guardrails skip that step.
Multi-step attack chains add a second gap. Reconnaissance steps are individually harmless. A tool call that reads a directory listing looks fine on its own. The same call followed by a read of a credentials file, followed by a network upload, forms a coherent exfiltration chain. No single-step rule catches it. Only a system that tracks session context can.
Infrastructure sandboxes constrain the execution environment: what directories are accessible, what network calls are permitted, what user context the agent runs under. They don't evaluate what an action means. A shell command that exfiltrates credentials by piping them to an allowed network endpoint runs cleanly inside a sandbox that permits both operations. The sandbox permitted the environment. It didn't evaluate the action.
Can a system intercept every agent tool call before execution, evaluate its semantic intent, and return a structured verdict fast enough to be used in production, without requiring the agent to be modified?
Eight components,
one pipeline.
AgentTrust is structured as a pipeline of eight cooperating components. Five always run. Two run by default and can be disabled. One is opt-in for ambiguous cases. Every tool call passes through the pipeline before it executes.
The pipeline begins with two always-on extraction components. The ShellNormalizer expands shell command strings through nine syntactic strategies before any pattern matching runs: variable expansion, hex and octal escape resolution, alias resolution, command substitution, ANSI-C quoting, and adjacent-quote concatenation, among others. The expanded plaintext is what gets evaluated. The ActionAnalyzer then extracts risk-relevant features from the normalized command using 42 regex patterns organized into four categories.
The PolicyEngine applies 170 YAML-configurable rules to those features and produces a preliminary verdict. YAML-configurable means operators can update the rule set without touching code, and the rules can be tuned to match specific deployment risk profiles. The TrustInterceptor orchestrates the full pipeline, and the TrustReporter produces a structured, human-readable explanation of the verdict.
The four verdicts are allow, warn, block, and review. Warn lets the action proceed with a logged flag. Block stops it. Review routes it for human inspection before execution continues. The output is a structured TrustReport, not a binary pass or fail. The structured output means downstream systems can act on the verdict type differently rather than applying a single response to any non-allow case.
The MCP server interface is the deployment detail that matters most for practical adoption. Because AgentTrust exposes itself as a Model Context Protocol server, it can interpose between any MCP-compatible agent and its tools without patching the agent itself. The agent sees the same tool interface. The interceptor sits invisibly between the call and the execution.
Most agent safety tooling is binary: allow or block. SafeFix is the part that makes the system constructive rather than obstructionist. An agent that was about to run a recursive delete gets back a suggestion to move the target to trash instead. This keeps the agent functional while reducing blast radius. The 37 fix rules map to the most common blocked action patterns.
95% verdict accuracy
at millisecond latency.
Two benchmarks. One internal, designed to cover the six primary risk categories evenly. One external, built independently from real-world adversarial scenarios. Both tested the production configuration without the LLMJudge component.
The internal benchmark reports two accuracy figures: 95.0% verdict accuracy and 73.7% risk-level accuracy. Verdict accuracy is the security-critical one. Getting the right verdict (block, warn, allow, review) is what stops harmful actions. Risk-level accuracy measures whether the system calibrated the severity correctly within a verdict category, which matters for logging, alerting, and operator prioritization. 73.7% is useful directional signal on severity, not a precise score to act on mechanically.
The 73.7% figure is honest. Most safety tooling reports single-dimension accuracy on its best-performing task. Reporting both, and explaining the gap, gives operators a calibrated expectation of what the system actually provides.
The system achieves low-millisecond median end-to-end latency on the production configuration. That figure matters because a safety interceptor that adds hundreds of milliseconds to every tool call will either be disabled by operators or will make agents noticeably slower in ways users observe. Low-millisecond overhead is smaller than network call variance for typical tool execution. The interceptor is practically invisible in the latency budget.
Both benchmarks were constructed by the author. Independent external replication on a broader distribution of real production agent logs would give higher confidence in the generalization claim. The LLMJudge component is opt-in and not measured in the primary results, so its accuracy contribution on ambiguous cases is separately characterized. AGPL-3.0 licensing constrains commercial deployment without source disclosure.
What this means
for building agents.
Runtime interception is a distinct safety layer that neither sandbox restrictions nor post-hoc audits provide. For builders operating agents in production, this paper frames the gap and offers a concrete pattern to fill it.
Where to go
from here.
If you want to go deeper on runtime agent safety.