← Blog·Engineering

Before You Enforce, Simulate: Thoth's Observe Mode and the Day-7 Report

The question every security team asks before turning on enforcement: "What would this actually block?" Thoth's observe mode gives you a precise answer running your policies against real production traffic, recording what would have been blocked, without blocking anything.

Aten Security

June 1, 2026 · 11 min read

Technical architecture diagram of Thoth's passive Observe Mode pipeline, showing an AI agent executing a tool call through the Thoth Proxy to an API, with a parallel asynchronous data path writing a deterministic verdict token directly to a Day-7 compliance ledger.

"We don't want to turn on enforcement and find out it breaks something."

Every security and platform engineering team asks this when deploying autonomous AI workflows. The worry isn't unfounded. A policy that is too aggressive will block legitimate agent operations, degrade your user experience, and create an internal engineering backlash that sets security adoption back by months. Conversely, a policy that is too permissive is nothing more than static compliance theater.

The true request underneath this concern isn't "don't enforce." It is "prove to me that enforcement is calibrated before it's active."

Thoth's observe mode is built to answer that exact question. It runs your policies against live production traffic and records every decision what would have been allowed, what would have been blocked, and what would have triggered a step-up challenge without changing agent behavior or modifying payloads.

After seven days, the Day-7 report gives you the full picture: which tool calls would have been blocked, which policies triggered, and the exact regulatory or financial exposure those calls represent.

That's not a marketing promise. That's evidence.

How Observe Mode Works

Every thoth_policy_bundle infrastructure resource contains a dedicated enforcement_mode field. To run shadow evaluation, simply set it to observe:

resource "thoth_policy_bundle" "hipaa_minimum_necessary" {
  name             = "hipaa-minimum-necessary"
  framework        = "OPA"
  raw_policy       = file("${path.module}/policies/hipaa-minimum-necessary.rego")
  enforcement_mode = "observe"
}

In observe mode, every single tool call your agents attempt passes through the full runtime enforcement pipeline. The policy is evaluated. A verdict is computed (ALLOW, BLOCK, STEP_UP, or MODIFY). That verdict is recorded in the immutable audit trail but the tool call executes regardless. Nothing is dropped or intercepted.

The agents operate normally, completely unaware they are being evaluated. Yet, the evaluation remains fully deterministic. If the exact same tool call occurred in enforce mode, the policy engine would yield the identical decision. Observe mode gives you the precise telemetry of active enforcement, with zero operational risk.

The Day-7 Report

After seven days of shadow observation, the Day-7 report is automatically populated in your dashboard or delivered straight to your engineering team's inbox.

The report provides a deep architectural breakdown:

The Aggregated Headline: A high-level summary of total evaluated tool calls, simulated block frequencies, and escalation rates broken down by policy category.
Per-Violation Traces: For every tool call that would have triggered an interception, the report exposes the exact agent identity, the target tool schema, the specific rule that triggered, the policy version hash, and the longitudinal behavioral baseline score at the time of execution.
Regulatory Exposure Estimator: For regulated environments, the report maps violation categories directly to applicable compliance frameworks and estimated fine structures. A healthcare procurement agent attempting to access patient tables outside of its session intent maps directly to specific HIPAA citations and the corresponding HHS Office for Civil Rights (OCR) fine bands.
Calibration Recommendations: Based on the seven-day trend, the report flags which policies are tight and which are generating noise. If a specific rule would have blocked 40 tool calls but 38 of them were legitimate engineering queries, that's a tuning signal the policy logic needs to be adjusted before enforcement goes live.

The Day-7 report converts "prove to me enforcement is calibrated" from a subjective request into an auditable fact.

Policy Simulation: Deterministic Testing in CI/CD

Beyond shadow production staging, Thoth supports offline policy simulation allowing you to evaluate policy logic against mock input documents locally before deploying code to any environment.

thothctl policy simulate \
  --policy ./policies/hipaa-minimum-necessary.rego \
  --framework OPA \
  --input ./test-inputs/patient-record-query.json

Output:

{
  "decision": "BLOCK",
  "policy": "hipaa-minimum-necessary",
  "policy_version": "sha256:e3b0c44...",
  "triggered_rules": ["minimum_necessary_access"],
  "deny_messages": [
    "tool call requests full_record=true with purpose=customer-facing; minimum necessary access requires field-scoped query"
  ],
  "behavioral_score": null,
  "simulated": true
}

Because simulation is entirely deterministic, it serves as a unit testing framework for your security policies. You can run it directly inside your CI/CD pipelines. The pipeline fails if a known-bad agent input produces an ALLOW decision, or if a known-good schema gets blocked. Your policy lifecycle gains a robust test suite.

The test input structure perfectly matches the canonical payload the production runtime layer processes:

{
  "principal": {
    "id": "agent:care-navigator-001",
    "role": "clinical_agent",
    "scopes": ["query:patient_records"]
  },
  "action": "tool_call:get_patient_record",
  "context": {
    "purpose": "customer-facing",
    "sensitivity_label": "phi",
    "session_intent": "care-navigation",
    "task_id": "task_01hx..."
  },
  "tool_args": {
    "patient_id": "P-4421",
    "full_record": true,
    "fields": null
  }
}

The lifecycle is clean: Write test inputs for your known edge cases. Run them in simulation during code review. Deploy the policy in observe mode to monitor production traffic. Review the Day-7 report. Promote to enforce.

The Transition: From Observe to Enforce

When your team and compliance leads are satisfied with the calibration, promoting a policy to active enforcement requires a simple, single-line GitOps change:

resource "thoth_policy_bundle" "hipaa_minimum_necessary" {
  name             = "hipaa-minimum-necessary"
  framework        = "OPA"
  raw_policy       = file("${path.module}/policies/hipaa-minimum-necessary.rego")
  enforcement_mode = "enforce"   # changed from: "observe"
}

Every tool call that would have triggered a simulated block during the observation window is now programmatically intercepted. Because Thoth records the immutable version hash of the policy code throughout both stages, the transition is completely auditable. For SOC 2 CC6 compliance, the pull request history showing who reviewed the Day-7 telemetry and who authorized the toggle serves as the primary audit record. No separate log search required.

Version-Aware Audit Trails

Every update to your policy files creates a new, immutable bundle version. Thoth stamps the active policy version hash onto the runtime trace of every single decision.

This allows you to answer the exact question auditors ask during an investigation: "What specific version of your policy was running at 14:32 on May 9th, and what code logic dictated that choice?"

thothctl audit query \
  --agent care-navigator-001 \
  --from 2026-05-09T14:00:00Z \
  --to 2026-05-09T15:00:00Z \
  --include-policy-version

Cross-referencing the resulting hash against your version control repository surfaces the exact lines of Rego or Cedar code that executed at that specific millisecond. You hand the auditor a cryptographic trace, not an estimate.

Real-World Sequence: Regulated Healthcare Deployment

For a healthcare team deploying an autonomous procurement or care-navigation agent, the safe deployment sequence looks like this:

CI Simulation: Test fixtures covering compliant and malicious tool use run against the policy engine in the build pipeline.
14-Day Observation: The policy bundle is deployed in observe mode. The agent lifecycle continues without path disruption while Thoth maps the tool-calling baseline.
Telemetry Review: Compliance reviews the report. Suppose 23 tool calls would have been blocked. 21 are flagged as overly restrictive limits on valid tool schemas (indicating the policy needs broader parameters); 2 are genuine policy anomalies that require remediation.
Policy Tuning: The policy rules are updated to incorporate valid schema variations, verified locally via thothctl policy simulate.
Active Enforcement: The PR is merged, changing the state to enforce. The runtime guardrails are active, completely calibrated, and verified against real production traffic profiles.

That is the shift from having an abstract, unverified corporate policy to maintaining a proven, operational enforcement plane.

Getting Started

The Thoth infrastructure providers are available on the public registries:

Terraform: registry.terraform.io/providers/atensecurity/thoth
Pulumi: pulumi.com/atensec/thoth

Starter policy packs mapping to HIPAA Minimum Necessary, SOC 2 CC6, and FINRA 4370 are maintained in our public thoth-runbooks repository.

If you are currently managing volatile agent schemas in a regulated sector and want to eliminate the integration and maintenance tax of manual guardrails, reach out to our team at sales@atensecurity.com to kick off an automated observe-mode pipeline.

Get practical updates on AI agent security and governance.

Twice monthly notes on incidents, controls, and implementation lessons from real enterprise deployments.

Unified Decision Logic: Introducing MODIFY and DEFER in Thoth SDK 0.1.16

May 15, 2026 · 7 min

Aten Security | Runtime Governance Control Plane

Product

From Guardrails to Control Plane: The Next Phase of AI Agent Governance

May 6, 2026 · 11 min

Diagram-style illustration showing AI tool-call decisions (allow, step-up, block) feeding a tamper-evident security evidence timeline

Engineering

The Missing Layer in Agent Security: Evidence, Not Just Decisions

April 16, 2026 · 3 min