Supervisor And Lifecycle¶
The server-owned tracker runs as one supervisor thread plus one watch worker per live known session. This logic lives in ../../../../src/houmao/server/tui/supervisor.py.
Runtime Contract¶
The supervisor depends on a TrackingRuntime protocol implemented by HoumaoServerService.
The important runtime hooks are:
watch_poll_interval_seconds()load_live_known_sessions()ensure_known_session(record)poll_known_session(tracked_session_id)handle_poll_exception(tracked_session_id, exc)release_known_session(tracked_session_id)
This keeps the thread orchestration in tui/supervisor.py while the actual registry, tracking, and route authority stay in service.py.
Reconcile Loop¶
TuiTrackingSupervisor._run() repeatedly calls _reconcile_once() and then waits for either:
- the normal poll interval, or
- an out-of-band wakeup from
request_reconcile()
The reconcile pass does four things:
- Load the current live known-session map from
KnownSessionRegistry. - Stop workers whose
tracked_session_idis no longer present in the live set. - Stop dead workers for sessions that are still live so they can be recreated on the same pass.
- Ensure tracker state exists for every live record and start workers that are missing.
When a tracked session leaves the live set, the supervisor also calls release_known_session(tracked_session_id). In HoumaoServerService, that maps to _forget_tracker(...), which evicts both the live tracker and any terminal alias bound to it.
That eviction step is what keeps GET /houmao/terminals/{terminal_id}/state from continuing to resolve through stale in-memory residue after tmux or registry authority has gone away.
Worker Loop¶
SessionWatchWorker owns one background thread for one tracked_session_id.
Its loop is simple:
- Call
poll_known_session(tracked_session_id). - If the runtime says
False, exit. - Otherwise sleep until the next poll interval or until stop is requested.
In the current implementation, poll_known_session() returns False only when the tracked tmux session no longer exists. That means:
- TUI-down sessions remain eligible for future polling
- parse failures remain eligible for future polling
- unsupported-tool sessions remain eligible for future polling
- tmux loss ends the worker, and the next reconcile pass releases the stale in-memory authority if the session is no longer rediscovered
Alias And Tracker Lifecycle¶
HoumaoServerService keeps two in-memory maps:
m_trackers:tracked_session_id -> LiveSessionTrackerm_terminal_aliases:terminal_id -> tracked_session_id
ensure_known_session(record) is responsible for keeping those maps coherent:
- create a new tracker if the session has not been seen before
- refresh tracker identity if the session already has a tracker
- remove stale aliases that still point at the same tracked session under an older terminal id
- bind the current terminal id alias to the tracked session
release_known_session() and the explicit delete handlers remove those mappings again.
The explicit delete flows are:
handle_deleted_terminal(terminal_id)handle_deleted_session(session_name)
Both remove the registration directory, forget the in-memory tracker state, and wake the supervisor.
Exception Hardening¶
The review hardening change made both thread layers resilient to unexpected exceptions.
Worker-side failures¶
SessionWatchWorker._run() now catches any unexpected exception raised by poll_known_session().
The recovery path is:
- call
handle_poll_exception(tracked_session_id, exc) - if that handler also fails, log the failure defensively
- keep the worker alive for later polling
HoumaoServerService.handle_poll_exception() logs the exception and records an explicit probe/runtime error state into the tracker with:
transport_state="probe_error"process_state="probe_error"parse_status="probe_error"probe_error.kind="tracking_runtime_error"
This keeps the failure visible in the live-state route instead of silently killing the worker.
Supervisor-side failures¶
TuiTrackingSupervisor._run() also wraps _reconcile_once() in a catch-all guard. An unexpected reconcile failure is logged and the supervisor continues to the next wait/retry cycle instead of dying permanently.
This is important because reconcile is the mechanism that:
- admits new registrations
- releases dead sessions
- recreates dead workers
Without the guard, one unexpected exception could stall the whole watch plane until the server process restarted.