Gateway Queue And Recovery¶
This page explains how the live gateway process persists queue state, tracks the currently attached upstream instance, and avoids replaying work across unsafe continuity changes.
Mental Model¶
The gateway is small on purpose.
- It keeps a durable queue plus a read-optimized status snapshot.
- It owns one active execution slot for terminal-mutating work.
- It treats the managed agent behind it as a replaceable upstream instance, not as the durable identity of the session.
- That is why it tracks an epoch and blocks replay when continuity becomes uncertain.
Queue Storage Model¶
The durable queue lives in queue.sqlite.
Current stored request states:
acceptedrunningcompletedfailed
Current queue-depth reporting counts only accepted and running items. Completed and failed records remain useful for history and diagnostics, but they are not part of active queue depth.
Current-Instance State¶
The gateway writes run/current-instance.json with:
- process id,
- bound host,
- bound port,
managed_agent_instance_epoch,- optional
managed_agent_instance_id.
This file tells the runtime which live gateway process published the current listener, and it lets the gateway notice when the upstream managed-agent instance behind the same session changed.
Request Admission And Serial Execution¶
The gateway worker loop is intentionally serialized.
- only one queue item can hold the active terminal-mutation slot at a time,
- new requests are first persisted as
accepted, - the worker promotes the oldest eligible request to
running, - completion updates the record to
completedorfailedand appends an event.
sequenceDiagram
participant CLI as Runtime CLI
participant GW as Gateway
participant Q as queue.sqlite
participant Be as Agent terminal
CLI->>GW: POST /v1/requests
GW->>Q: insert accepted record
GW-->>CLI: accepted response
GW->>Q: promote next record<br/>to running
GW->>Be: submit_prompt or interrupt
alt backend call succeeds
GW->>Q: mark completed
else backend call fails
GW->>Q: mark failed
end
Health Versus Upstream Availability¶
This split is easy to miss the first time you debug the system.
GET /healthonly asks whether the gateway control plane is alive.GET /v1/statusadds the managed-agent view: connectivity, recovery state, request admission, and surface eligibility.
That means a healthy gateway can still report:
managed_agent_connectivity=unavailable,managed_agent_recovery=awaiting_rebind,request_admission=blocked_unavailable.
The gateway is alive; the upstream session it fronts is not currently ready.
Epochs, Reconciliation, And Replay Blocking¶
The gateway increments managed_agent_instance_epoch when it sees a different current upstream instance id than the last one it recorded.
Consequences:
- if the upstream instance did not change, the epoch stays stable,
- if the upstream instance changed, the gateway enters reconciliation-oriented status,
- requests accepted for the old epoch are not replayed blindly against the replacement instance.
Representative status after an instance change:
{
"gateway_health": "healthy",
"managed_agent_connectivity": "connected",
"managed_agent_recovery": "reconciliation_required",
"request_admission": "blocked_reconciliation",
"managed_agent_instance_epoch": 2
}
This is a safety boundary, not just bookkeeping. It prevents the sidecar from silently delivering old queued intent to a new upstream instance whose continuity has not been positively established.
Restart Recovery¶
Gateway restarts do not discard already accepted queued work by default.
Current behavior:
- requests left in
acceptedstate remain eligible after restart, - requests left in
runningstate are marked failed on startup because the old process died mid-execution, - accepted work can be recovered and executed after restart if the upstream instance continuity is still valid,
- accepted work is preserved but not replayed when the new startup detects an epoch change that requires reconciliation.
sequenceDiagram
participant Old as Old gateway
participant Q as queue.sqlite
participant New as New gateway
participant Up as Upstream
Old->>Q: leave accepted work<br/>durably stored
Old-x New: process restart
New->>Q: fail leftover running work
New->>Up: inspect current instance id
alt same instance
New->>Q: execute accepted work
else replacement instance
New->>Q: keep accepted work<br/>blocked by reconciliation
end
Current Execution-Adapter Boundary¶
The live gateway process now selects an execution adapter from manifest-backed authority plus internal bootstrap metadata instead of assuming a single callback path.
- REST-backed adapters cover runtime-owned
cao_restandhoumao_server_restsessions and use the existing REST callback path for inspection, prompt submission, and interrupt delivery. - A local tmux-backed adapter covers runtime-owned native headless sessions and runtime-owned
local_interactivesessions, and resumes that runtime through runtime-owned control. - A server-managed headless adapter covers native headless sessions whose attach metadata publishes
managed_api_base_urlplusmanaged_agent_ref, and routes prompt or interrupt work back through the managed-agent API rather than bypassing server-owned turn authority. - Queue durability, reconciliation checks, and gateway-local epoch handling stay the same across those adapters.