Gateway Queue And Recovery¶
This page explains how the live gateway process persists queue state, tracks the currently attached upstream instance, and avoids replaying work across unsafe continuity changes.
Mental Model¶
The gateway is small on purpose.
- It keeps a durable queue plus a read-optimized status snapshot.
- It owns one active execution slot for terminal-mutating work.
- It treats the managed agent behind it as a replaceable upstream instance, not as the durable identity of the session.
- That is why it tracks an epoch and blocks replay when continuity becomes uncertain.
Queue Storage Model¶
The durable queue lives in queue.sqlite.
Current stored request states:
acceptedrunningcompletedfailedcoalesced
Current queue-depth reporting counts only accepted and running items. Completed, failed, and coalesced records remain useful for history and diagnostics, but they are not part of active queue depth.
Opt-in gateway diagnostic logs do not live in queue.sqlite. They are cleanup-sensitive JSONL files under logs/diagnostics/ and are useful for route-boundary and mailbox-operation postmortems. The queue database remains the durable authority for accepted work, terminal state, and gateway-owned notifier audit history.
Current-Instance State¶
The gateway writes run/current-instance.json with:
- process id,
- bound host,
- bound port,
managed_agent_instance_epoch,- optional
managed_agent_instance_id.
This file tells the runtime which live gateway process published the current listener, and it lets the gateway notice when the upstream managed-agent instance behind the same session changed.
Request Admission And Serial Execution¶
The gateway worker loop is intentionally serialized.
- only one queue item can hold the active terminal-mutation slot at a time,
- new requests are first persisted as
accepted, - the worker coalesces adjacent pending control intents before promotion,
- the worker promotes the next effective eligible request to
running, - completion updates the record to
completedorfailedand appends an event.
sequenceDiagram
participant CLI as Runtime CLI
participant GW as Gateway
participant Q as queue.sqlite
participant Be as Agent terminal
CLI->>GW: POST /v1/requests
GW->>Q: insert accepted record
GW-->>CLI: accepted response
opt adjacent control-intent run
GW->>Q: mark superseded records<br/>coalesced
end
GW->>Q: promote effective record<br/>to running
GW->>Be: submit_prompt or interrupt
alt backend call succeeds
GW->>Q: mark completed
else backend call fails
GW->>Q: mark failed
end
Control-Intent Coalescing¶
The gateway treats a narrow set of queued records as coalescible control intents:
interrupt,submit_promptwhose entire trimmed prompt is exactly/compact,submit_promptwhose entire trimmed prompt is exactly/clear,submit_promptwhose entire trimmed prompt is exactly/new.
This policy is intentionally conservative. It does not parse command prefixes inside ordinary prose, it does not coalesce multiline prompts that merely mention commands, and it does not apply to direct /v1/control/prompt because that route bypasses the durable queue.
When the oldest accepted queue record is a control intent, the worker scans the adjacent accepted control-intent run for the same managed_agent_instance_epoch. Ordinary prompts, internal mail_notifier_prompt records, unsupported request kinds, and different epochs stop the scan. Within the run, duplicate interrupts collapse to one interrupt, context-control prompts collapse to the strongest effective context action, /new supersedes /clear and /compact, and /clear supersedes /compact. If both interrupt and context action remain effective, the interrupt executes first and the context action executes afterward.
Rows removed from execution are not deleted. They are marked coalesced, get finished_at_utc, and store result_json with the superseding request or action. The gateway also appends a coalesced event listing the coalesced request ids and effective actions.
Health Versus Upstream Availability¶
This split is easy to miss the first time you debug the system.
GET /healthonly asks whether the gateway control plane is alive.GET /v1/statusadds the managed-agent view: connectivity, recovery state, request admission, and surface eligibility.
That means a healthy gateway can still report:
managed_agent_connectivity=unavailable,managed_agent_recovery=awaiting_rebind,request_admission=blocked_unavailable.
The gateway is alive; the upstream session it fronts is not currently ready.
Epochs, Reconciliation, And Replay Blocking¶
The gateway increments managed_agent_instance_epoch when it sees a different current upstream instance id than the last one it recorded.
Consequences:
- if the upstream instance did not change, the epoch stays stable,
- if the upstream instance changed, the gateway enters reconciliation-oriented status,
- requests accepted for the old epoch are not replayed blindly against the replacement instance.
Representative status after an instance change:
{
"gateway_health": "healthy",
"managed_agent_connectivity": "connected",
"managed_agent_recovery": "reconciliation_required",
"request_admission": "blocked_reconciliation",
"managed_agent_instance_epoch": 2
}
This is a safety boundary, not just bookkeeping. It prevents the sidecar from silently delivering old queued intent to a new upstream instance whose continuity has not been positively established.
Restart Recovery¶
Gateway restarts do not discard already accepted queued work by default.
Current behavior:
- requests left in
acceptedstate remain eligible after restart, - requests left in
runningstate are marked failed on startup because the old process died mid-execution, - accepted work can be recovered, coalesced, and executed after restart if the upstream instance continuity is still valid,
- accepted work is preserved but not replayed when the new startup detects an epoch change that requires reconciliation.
sequenceDiagram
participant Old as Old gateway
participant Q as queue.sqlite
participant New as New gateway
participant Up as Upstream
Old->>Q: leave accepted work<br/>durably stored
Old-x New: process restart
New->>Q: fail leftover running work
New->>Up: inspect current instance id
alt same instance
New->>Q: execute accepted work
else replacement instance
New->>Q: keep accepted work<br/>blocked by reconciliation
end
Current Execution-Adapter Boundary¶
The live gateway process now selects an execution adapter from manifest-backed authority plus internal bootstrap metadata instead of assuming a single callback path.
- Legacy REST-backed adapters may still appear when inspecting old manifests, but new public launches no longer create
cao_restorhoumao_server_restsessions. - A local tmux-backed adapter covers runtime-owned native headless sessions and runtime-owned
local_interactivesessions, and resumes that runtime through runtime-owned control. - A passive-server-managed headless adapter covers native headless sessions whose attach metadata publishes
managed_api_base_urlplusmanaged_agent_ref, and routes prompt or interrupt work back through the managed-agent API rather than bypassing passive-server-owned turn authority. - Queue durability, reconciliation checks, and gateway-local epoch handling stay the same across those adapters.