All articles
Engineering··11 min read

The v0.7 Bet: Postgres Stays Authoritative, the Bridge Fails Open

v0.7.2 of the Code Atelier Governance SDK adds an opt-in platform bridge, a write-and-poll HITL approvals inbox, and an AGT recipe scaffolder. The SDK stays in-process and Postgres stays the source of truth. The bridge fails open; enforcement does not.

We shipped v0.7.2 of the Code Atelier Governance SDK this week. The headline change is not the AGT recipe scaffolder, which is a distribution convenience we will get to at the bottom of this piece. The headline change is the platform bridge: an opt-in dual-write that lets operators get a hosted audit dashboard and a human-in-the-loop approvals inbox without adding a second piece of infrastructure to their stack. This piece is the engineering write-up of what the bridge actually is, what it is not, and what "opt-in and fail-open" means in code.

A quick note before we go further. Nothing in this article is legal advice. Regulatory references, including the EU AI Act Article 12 obligation on automatic event logging, cite primary sources. If you are in a regulated industry and making a deployment decision, your counsel should be in the room.

The default path to compliance-grade audit is another piece of infrastructure

Engineering teams I have talked to about agent governance tend to arrive at the same question. The product team has shipped an agent that writes to a real system. Finance has noticed. Legal has noticed. Someone asks who signs off when the agent spends more than a threshold, and who has the audit trail if a regulator asks in two years. The engineering team goes looking for a solution.

The default path they find is another piece of infrastructure. A Kafka topic to carry events off the primary database. An observability SaaS to store and index them. A trust service to hold the tamper-evident chain. A Data Processing Agreement with that trust service. An SSO integration so the compliance officer can actually read the dashboard. A second on-call rotation because the trust service is now in the critical path for your regulator-facing evidence. The event stream you were trying to secure becomes a thing your infrastructure team now operates.

For a mid-market company shipping one or two agentic features, that is a six-figure commitment in staff time before it is a software bill. For an enterprise, it is a procurement cycle. For a small team, it is a reason to keep the agent off the critical path indefinitely.

The v0.7 series of the Code Atelier Governance SDK is a bet against that default. Everything in the SDK still writes to the customer's existing Postgres first. That is the first and most important architectural invariant the SDK ships with, and it is non-negotiable: the host application must continue to work if the governance database is unreachable. What v0.7.0 added, and what v0.7.1 and v0.7.2 extended, is an optional bridge that fire-and-forgets a copy of each audit event to a hosted platform. The customer gets the dashboard, the approvals inbox, and the evidence pack without running any of it themselves. The SDK stays in-process. Postgres stays authoritative. The bridge is advisory.

What v0.7.0 added: the platform bridge

The bridge lives in a new package, codeatelier_governance.platform. It is an async HTTP client that POSTs each local audit event to the hosted platform's ingest endpoint at codeatelier.tech/api/v1/ingest/events. It is fire-and-forget from the audit hot path. It retries with exponential backoff on HTTP 429. It is silent on 5xx. It logs but does not raise on 4xx. It holds a bounded in-process queue, default 1000 events, and drops the oldest entry on saturation. If the platform is unreachable, rate-limited, or the ingest token is revoked, the host application keeps working and every audit event still commits to the customer's Postgres.

The bridge is opt-in and non-blocking by default. There are four new kwargs on the GovernanceSDK constructor: platform_ingest_url, platform_ingest_token, platform_bridge_enabled (default False), and platform_trusted_hosts. Each has an environment-variable equivalent prefixed GOVERNANCE_PLATFORM_*.

Setting the URL and token alone is not enough to turn the bridge on. You also have to pass platform_bridge_enabled=True or export GOVERNANCE_PLATFORM_BRIDGE_ENABLED=true. When the bridge activates, the SDK prints a single WARN line, platform.bridge_enabled, with the configured URL and a link to the data-residency page. Bridge activation is never invisible to an operator reading logs. That is a deliberate design choice: the SDK should not silently start sending audit metadata off-premise because two environment variables happened to be present in a container image from a previous deploy.

Here is roughly what the config looks like in code:

from codeatelier_governance import GovernanceSDK

sdk = GovernanceSDK(
    database_url="postgresql://...",
    platform_ingest_url="https://codeatelier.tech/api/v1/ingest/events",
    platform_ingest_token=os.environ["GOVERNANCE_PLATFORM_INGEST_TOKEN"],
    platform_bridge_enabled=True,
)

With the flag off (the default), the SDK's behaviour is identical to v0.6.x. With the flag on, every audit event that commits to the customer's Postgres also gets queued for fire-and-forget forwarding to the hosted platform. The local write is synchronous and authoritative. The remote forward is asynchronous and advisory. That is the entire contract.

SSRF hardening is default-on, not opt-in

A bridge that makes outbound HTTP on behalf of the host application is a server-side request forgery surface by construction. We treated that seriously. The SSRF guard is default-ON and rejects RFC1918 addresses, loopback, link-local, and AWS, GCP, and Azure cloud-metadata hosts without any explicit allowlist from the operator. trust_env=False is set on the httpx client, which blocks HTTPS_PROXY and HTTP_PROXY exfiltration vectors that would otherwise get picked up from the container environment. follow_redirects=False blocks the classic 302-to-internal-host redirect exfil. Ingest tokens have whitespace stripped before being placed on the wire so a trailing newline from a Kubernetes secret does not trigger confusing 401s downstream. A 401 response latches the bridge off for the rest of the process so a revoked token never retries.

If you actually need to talk to a private ingest endpoint in your own VPC, you explicitly opt in via platform_trusted_hosts. That is the only way past the guard, and the opt-in happens at SDK construction, not via an environment variable that could drift in from a neighbouring deploy.

That posture maps cleanly onto OWASP Top 10 for Agentic Applications 2026 ASI03 (Identity and Privilege Abuse) and the tool-abuse categories. The bridge is the SDK reaching outward to a third party, and each of those guards is there because "agent governance tool exfiltrates the environment it is supposed to protect" is a well-understood attack shape.

Observability-grade, not durable-queue

I want to name one constraint honestly. The bridge is observability-grade, not durable-queue. On a crash or a SIGKILL, up to max_queue_size in-flight rows on the bridge path may be lost. The local Postgres writes are unaffected, because they committed before the bridge ever queued them. If your regulator reads the chain off the hosted dashboard, and the hosted dashboard is five seconds behind the source of truth, that is by design. The source of truth is your Postgres.

If you need a durable, at-least-once delivery guarantee between your application and the dashboard, that is a v0.8 conversation. v0.7 optimised for "opt-in, fail-open, zero new infrastructure." Adding at-least-once to that constraint set without breaking either of the other two is non-trivial, and it is the next big architectural conversation for the project.

v0.7.1: a second channel for HITL approvals, with local still authoritative

v0.7.0 bridged audit events. v0.7.1 extended the bridge to human-in-the-loop gates. When the bridge is on, sdk.gates.request() dual-writes each pending gate to the platform's /app/approvals inbox, and sdk.gates.wait_for() polls the platform's resolution endpoint alongside the local gate store. A reviewer can now approve or deny a gate from a web UI instead of a CLI. The CLI fallback, governance grant <token> and governance deny <token>, still works.

The rule is unchanged: local always wins. If the local gate store says granted and the platform says pending, the agent proceeds on local. If the local gate store says pending and the platform returns granted, the SDK syncs the resolution back into the local store via the on-commit hook that emits the approval.granted audit event, and the wait returns. A sibling process waking up a moment later sees the decision locally without another platform round-trip.

The poll cadence on the platform side is 2 seconds base with 500 ms of jitter, exponential backoff on errors capped at 10 seconds. The local poll cadence is unchanged from v0.7.0 at a default of 500 ms. Local reads first, platform reads only if local is still pending. An operator who wants to run the bridge disabled sees exactly the same wait-loop behaviour as they saw on v0.7.0, which is the same behaviour they saw on v0.6.x.

The wire format is small on purpose

The forwarded body on a gate-request carries exactly six fields: request_id, agent_id, kind, action_hash, expires_at, and sdk_version. It does not carry the single-use HMAC token that approvers use on the CLI fallback path (that stays strictly customer-side). It does not carry the agent-side payload, because payloads can contain PII. The resolution response from the platform deliberately omits the reviewer's reason string, because a reason written in free text by a human reviewer should not round-trip back into the agent's LLM context. That last one was a v0.7.1 security decision and it is one I want on the record: reviewer text going back into an agent's prompt is an injection surface.

We wrote a regression test that asserts the forward body contains exactly those six fields and none of token, secret, payload, chain_key, reason, decision, hmac, or client_hmac. A future refactor that leaks a secret to the platform fails CI. That is the kind of guard I want in place before I ask an operator to flip a flag that moves bytes off-premise.

Reverse-sync closes the loop

Reverse-sync handles the other direction. When a reviewer uses the CLI to grant or deny a gate that was forwarded to the platform, the SDK fire-and-forgets a POST /api/v1/bridge/gates/:id/resolution so the hosted inbox renders "Resolved in SDK" instead of leaving a stale pending card for the compliance officer to wonder about. This only fires for gates the platform actually acknowledged (tracked via the bridge's internal forwarded-request set), so a stale CLI grant against a platform-down-at-creation-time gate skips silently. The reverse-sync stat counters (local_resolutions_sent, local_resolutions_dropped) are exposed on sdk.platform.stats() for operators who want to see the end-to-end flow.

There is also a forward-burst semaphore, capped at 100 via asyncio.Semaphore(100). If the platform is hanging and the application is generating gates faster than the platform can ingest them, concurrent forwards back-pressure instead of creating an unbounded number of coroutines. A new stat counter, gate_forwards_queued, increments whenever a task hits that back-pressure. Operators can see it on sdk.platform.stats() and alert on it if the bridge starts queuing under normal traffic.

The 402 tier-not-entitled latch, and why it is permanent until restart

This is the part of the design I want to describe in detail, because it is the part that most often surprises a security-conscious reviewer.

If the hosted platform returns HTTP 402 on any bridge route (ingest, gate forward, resolution poll, reverse-sync), the SDK does three things. It auto-disables the bridge for the rest of the process via a single-flight latch. It logs one structured WARN, platform.bridge_disabled reason=tier_not_entitled, with the platform-supplied upgrade_url. And it silently drops every subsequent bridge call for the lifetime of that process. The latch resets only on a process restart.

Every local path keeps working through all of that. sdk.audit.log() commits to the customer's Postgres. sdk.gates.request() creates local gates. sdk.gates.grant() and sdk.gates.deny() resolve them. The CLI fallback still issues approvals. The host application never sees a 402 and never sees an exception. If the 402 had flipped the application's error path, a billing-tier glitch at the platform would become an agent outage at the customer. We chose to make the bridge advisory instead.

The same latch covers 401 auth failures. A revoked or briefly-rotated token disables the bridge for the rest of the process rather than retrying forever and flooding the logs. Operators can detect the state via two fields on sdk.platform.stats(): the pre-existing disabled boolean, and the new disabled_reason string, which emits "auth_failed" for 401, "tier_not_entitled" for 402, and None otherwise. Dashboards that already scrape the disabled boolean keep working; the disabled_reason field is additive.

The permanent-until-restart property is the exact property that makes the bridge safe to enable. A flapping tier check that retries every thirty seconds would create an intermittent double-write where the operator cannot tell whether a gate made it to the inbox or not. A latch that stays closed until the operator explicitly resets it (by restarting the process) gives you a crisp state machine: either the bridge is on and healthy, or it is off and the operator knows why. No half-on state. No "some gates are in the inbox and some are not" mystery. The bridge is never on the enforcement path, so silent latching never causes an enforcement bypass.

A consecutive-5xx backoff cap is the same idea at a different timescale. After five consecutive platform poll errors, the next-poll hold jumps from the normal jittered interval to 60 seconds for one tick, before resuming the exponential-with-cap ladder. The latch resets on any successful poll. We are not going to hammer a stuck upstream on behalf of an advisory feature.

The invariants that make it safe to flip on

Five design choices, stated plainly. If any of these surprise you, read the v0.7.0 and v0.7.1 sections of the repository changelog before you enable the bridge.

  1. Postgres is authoritative. Every audit event and every gate commits locally first. The bridge is an asynchronous forward of what already landed. Turning the bridge off tomorrow leaves a complete, tamper-evident, HMAC-chained audit trail in the database you were already running.
  2. No background worker is required. The bridge runs in-process. An architectural invariant the SDK has held since v0.1 is that no feature may require a background worker process to function. v0.7 did not relax it.
  3. SSRF guard is default-on. RFC1918, loopback, link-local, and cloud metadata hosts are rejected out of the box. trust_env=False blocks proxy exfil. follow_redirects=False blocks 302 exfil. Tokens are whitespace-stripped before they hit the wire.
  4. Wire-format CI guard. The gate-forward body is asserted to contain exactly the six permitted fields and none of the eight banned ones. A refactor that leaks a secret fails CI before it ships. The guard is in the test suite, not just in a code-review comment.
  5. Explicit opt-in with a visible WARN on activation. The flag is off by default. Turning it on prints one WARN line per process with the configured URL and a data-residency link. An operator who did not know the bridge was on can find out by reading the first thousand lines of their application logs.

Taken together, those five choices describe a bridge that is hard to turn on by accident, hard to exfiltrate through, and impossible to be the sole source of the audit trail. It is advisory. The customer's own Postgres is the system of record.

What this is not

I want to be explicit about what the bridge does not claim to be.

It is not a durable queue. A kill-9 can lose up to max_queue_size in-flight rows on the bridge path. The local write already committed, so the local chain is intact; the hosted dashboard may be slightly behind until the next event arrives and the queue drains.

It is not a replacement for the customer's own disaster-recovery story for audit data. If you are obliged to keep audit logs for six months under EU AI Act Article 12, that obligation sits on your Postgres, not on the hosted dashboard. We publish a data-residency page so you can see what the hosted platform retains, for how long, and in which region, before you enable the bridge. If your compliance posture is "audit events must never leave the customer VPC," leave the flag off.

It is not the enforcement path. Enforcement (scope checks, cost pre-flight, HITL gates) is the local SDK in your process. The bridge is a read-out and a review surface on top of enforcement. The 402 latch section above is the clearest place this matters: the bridge silently latching off cannot bypass an enforcement decision, because the enforcement decision already happened locally.

It is not required to use the SDK. Every feature in v0.6.x still works in v0.7.2 with the bridge flag off. Some customers will run with the bridge off permanently and that is a fully supported configuration. The hosted dashboard is an affordance, not a gate.

v0.7.2 in one paragraph: the AGT recipe scaffolder

Since this is the v0.7.2 release write-up, I owe you the actual v0.7.2 change: a CLI command, codeatelier-governance recipe agt <path>, that writes a five-file Microsoft Agent Framework starter already wired through the governance sandwich. The scaffolded agent.py demonstrates sdk.scope.check plus sdk.cost.preflight plus sdk.gates.request around a refund tool, where refunds of $1,000 or more block on a human decision. --force overwrites an existing directory but refuses symlinked targets outright, so ln -s /etc/important my-agent; recipe agt my-agent --force cannot dereference through the link and wipe a critical path on the host. The fix for that vector was applied during the release security review and is the kind of edge case the command-line surface attracts. There are no schema changes, no migrations, and no runtime behaviour changes; the scaffolder is a build-time convenience and never touches Postgres. If you want the starter, install the SDK and run the command. Future recipes (LangGraph, CrewAI) plug in by dropping a directory under recipes/<name>/ and a single name-registry entry, with no CLI changes required.

The bet behind the v0.7 series

We keep three positioning lines pinned to the wall. They have been there since v0.1, and v0.7 is where they have had to earn their keep against a harder design problem than anything the SDK has faced before.

Enforcement gates, not just tracing. Every incumbent in the adjacent space, whether it is an observability platform, a gateway, or a trust service, sits downstream of the LLM call. The governance SDK sits in-process, upstream of the call, and can refuse to let the call fire. The platform bridge does not change that. Enforcement is local; the dashboard is remote.

Five lines to enforcement. The AGT recipe is the cleanest example we have published. The scaffold wires three SDK calls (scope, cost, gate) around a refund tool in under five lines of governance code. What the lines DO matters more than how many. One line of tracing does not refuse a $50,000 refund to a hijacked customer account; five lines of enforcement does.

Just Postgres. No ClickHouse, no Redis, no Kafka, no sidecar, no background worker. The host application's existing Postgres is the only infrastructure dependency. The bridge is optional and fails open, so "just Postgres" survives the v0.7 extension. If you turn the bridge off tomorrow, you still have a complete, tamper-evident, HMAC-chained audit trail in the database you were already running, aligned with the event-logging obligations of EU AI Act Article 12 and the auditability direction of the NIST AI Risk Management Framework.

The v0.7 bet, reduced to one sentence: an operator who wants a hosted dashboard should not have to take on more operations work to get one, and an operator who cannot accept any audit metadata leaving their VPC should not have to accept a worse SDK to keep their posture. The bridge is how those two operators share a codebase.

If you want to compare notes on any of this, or if you are running the v0.7 series and have an edge case you want me to look at, I am easy to reach. The contact form at the top of the site goes directly to my inbox.

Frequently Asked Questions

Is the platform bridge on by default in v0.7.2?

No. The bridge is opt-in, and setting the environment variables for the ingest URL and token alone is not enough to turn it on. You must also pass platform_bridge_enabled=True on the GovernanceSDK constructor, or set GOVERNANCE_PLATFORM_BRIDGE_ENABLED=true. When the flag is off, the SDK's behaviour is identical to v0.6.x. When the flag is on, a single structured WARN line prints at SDK init (platform.bridge_enabled) with the configured URL and a link to the data-residency page, so activation is never invisible to an operator reading the logs.

What happens to audit events if the hosted platform is unreachable?

The local Postgres write is unaffected. The bridge is fire-and-forget from the audit hot path: it retries with exponential backoff on HTTP 429, is silent on 5xx, and logs but does not raise on 4xx. A bounded in-process queue (default 1000 events) drops the oldest entries on saturation. The governing architectural invariant is that the host application must continue working if the governance database is unreachable; v0.7.1 extended that to cover the case where the platform rejects the tier. The bridge is observability-grade, not durable-queue, so a crash or SIGKILL can lose up to max_queue_size in-flight rows on the bridge path. The local chain on your Postgres is always intact.

What exactly does the SDK forward to the platform when a gate is requested?

Exactly six fields: request_id, agent_id, kind, action_hash, expires_at, and sdk_version. It does not forward the single-use HMAC approval token (that stays customer-side for the CLI fallback path). It does not forward the agent-side payload (which can contain PII). The resolution response from the platform deliberately omits the reviewer's reason string, because a reason written in free text by a human reviewer should not round-trip back into the agent's LLM context. A CI regression test asserts the forward body contains exactly those six fields and none of token, secret, payload, chain_key, reason, decision, hmac, or client_hmac. A future refactor that leaks a secret to the platform fails CI before it ships.

Why is the 402 tier-not-entitled latch permanent until process restart?

Because the bridge is advisory and the latch property is what makes it safe. A flapping tier check that retries every thirty seconds would create an intermittent double-write where the operator cannot tell whether a given gate made it to the hosted inbox or not. A latch that stays closed until the operator explicitly resets the process gives you a crisp state machine: either the bridge is on and healthy, or it is off and the operator knows exactly why. Local audit.log, gates.request, gates.grant, and gates.deny remain fully operational while the latch is set. The bridge is never on the enforcement path, so silent latching never causes an enforcement bypass.

How does this relate to EU AI Act Article 12?

Article 12 obliges providers and deployers of high-risk AI systems to keep automatic event logs with a minimum six-month retention window. The Commission's enforcement powers enter application on 2 August 2026. The governance SDK writes an HMAC-chained, append-only audit log to your Postgres by default; that log goes beyond the Article 12 letter by adding tamper-evidence, which the Act itself does not require. The platform bridge is an additional, optional read-out of the same events on a hosted dashboard. The obligation to retain audit logs sits on your Postgres, not on the hosted dashboard. We publish a data-residency page that documents what the hosted platform retains, and for how long, so you can make that call before enabling the bridge. Nothing in this answer is legal advice.

What changed in v0.7.2 compared to v0.7.0?

v0.7.0 shipped the platform bridge (opt-in dual-write of audit events), v0.7.1 extended the bridge to human-in-the-loop gates with a six-field wire format and reverse-sync, and v0.7.2 added the codeatelier-governance recipe agt CLI scaffolder for Microsoft Agent Framework starters. v0.7.2 has no schema migrations, no breaking API changes, and no runtime behaviour changes; the scaffolder is build-time only and does not touch Postgres. Upgrade with pip install --upgrade 'code-atelier-governance>=0.7.2'.

Does the Code Atelier Governance SDK require a background worker?

No. An architectural invariant the SDK has held since v0.1 is that no feature may require a background worker process to function. v0.7.2 did not relax it. The platform bridge runs in-process behind a bounded asyncio queue (default 1000 events), fire-and-forget from the audit hot path. The only infrastructure dependency is a Postgres connection string.

Code Atelier · NYC

Ready to get agent-ready before your competitors do?

Let's talk