The 25-Hour Outage
Most of the media stack was dead for 25 hours. Nobody noticed. The fix broke more things. Then I rebuilt how Athena thinks about infrastructure.
I came back today to find most of the stack dead. qBittorrent, Jellyfin, Sonarr, Radarr, the VPN tunnel. All down. The only survivors were Prowlarr and FlareSolverr. Everything else had been offline for over 25 hours, and not a single alert reached me.
What Died and Why
Yesterday I consolidated the workspace. Moved the docker-compose.yml into a new directory as part of the Zeus-Config reorganization. Clean move, everything made sense at the time.
Then Zeus rebooted.
Docker’s unless-stopped restart policy tries to bring containers back after a reboot. But it relies on the compose context, the directory where the compose file lived when the containers were originally created. Move that file and Docker can’t find the context anymore. The containers became orphans. They didn’t restart. They just sat there, stopped, waiting for a compose file that no longer existed at the expected path.
Meanwhile, the Secret Broker had its own problem. It’s the credential proxy that every other service depends on. A log file was owned by ali instead of the secretbroker service user. The broker couldn’t write to its own log. It crash-looped on every boot attempt. By the time I found it, the restart counter had hit 13,560.
The Cascade
The Secret Broker isn’t just another service. It’s the authentication layer. Every service that needs credentials goes through it. When the broker dies, nothing downstream can authenticate.
Overwatch, the system guardian, couldn’t fetch qBittorrent credentials from the broker. The torrent enforcer, the component that verifies downloads are running through the VPN, couldn’t function. No credential proxying meant no health checks, no enforcement, no monitoring. The monitoring system was the first casualty of the outage it was supposed to detect.
Twenty-five hours. Zero alerts.
The Messy Recovery
Athena started bringing containers back. Got the VPN tunnel up, then qBittorrent, then the arr stack. Should’ve been straightforward.
It wasn’t.
qBittorrent had a password issue after coming back. Athena’s fix was to restart the container. Repeatedly. Each restart triggered Overwatch alerts. The same monitoring that hadn’t noticed 25 hours of downtime was now flooding my Telegram with notifications about every container bounce.
When restarts didn’t fix the password, Athena nuked the qBittorrent config file to force a reset. This worked for the password. It also wiped every ratio target, seeding rule, and upload limit I’d configured. All gone.
Then we discovered Sonarr and Radarr had lost their webhook configurations. Download completion notifications weren’t firing. Jellyfin library scans weren’t triggering after imports. Took multiple rounds of debugging to figure out why new downloads weren’t appearing. The webhooks that connect the whole pipeline had been silently dropped.
Each fix revealed the next broken thing. Each broken thing required another round of investigation. The whole recovery took hours.
The Real Problem
While all of this was happening, I was waiting. Every message I sent to Athena came back minutes later. She was executing recovery commands inline, sequentially chaining 10+ tool calls per response, blocking on each one. Check this container. Restart that service. Curl this API. Wait for the response. Parse it. Try the next thing.
One generalist agent, doing everything serially, with no specialization and no ability to parallelize. She was figuring out each service’s API by trial and error, reading documentation mid-recovery, guessing at endpoints. Every mistake added another round trip.
An AI assistant that blocks on sequential tool calls while the user waits is just a slow script with better grammar.
The Architectural Response
The outage exposed a design flaw I’d been ignoring: Athena had no operational knowledge. Every time something broke, she was starting from scratch. Discovering API endpoints, guessing authentication flows, reading container logs to figure out what a service even does.
So I built dedicated specialist skills for every layer of the stack.
The VPN skill covers the WireGuard tunnel, kill switch verification, DNS resolution checks, and container network dependencies. 13KB of operational knowledge.
The qBittorrent skill documents the full Web API: authentication, torrent management, transfer settings, the password reset procedure that doesn’t destroy your config. 23KB.
The Arr Stack skill covers Sonarr, Radarr, and Prowlarr. Quality profiles, webhook configuration, indexer management, the download pipeline from search to import. 37KB.
The Secret Broker skill maps the credential proxy architecture: HMAC authentication, service tiers, health checks, the restart procedure that actually fixes crash loops. 15KB.
The Jellyfin skill handles library management, scan triggers, media search, and the API authentication flow. 14KB.
The Overwatch skill documents the monitoring system itself. Alert configuration, service health checks, webhook setup, and how to avoid the alert flood problem during recovery. 13KB.
Then I built a master orchestrator skill that sits on top of all of them. It maps the full dependency chain: VPN must be up before qBittorrent. Broker must be up before anything that needs credentials. Arr stack depends on both qBittorrent and the indexers. Jellyfin depends on the import pipeline. The orchestrator knows which specialist to spawn for each layer, and it can run them in parallel when there are no dependencies between them.
155KB of operational knowledge, codified. The agent that was fumbling through API calls by trial and error now has a manual for every service it manages.
The Lesson
Monitoring that doesn’t alert is decoration. A system guardian that can’t reach the credential broker can’t check the services behind it. The broker was a single point of failure not just for authentication, but for observability. When it went down, the whole stack went blind.
The 25-hour gap between failure and detection is the number that matters. Not the restart count, not the cascade depth, not the recovery time. Twenty-five hours of silence from a system designed to watch for exactly this kind of failure.
The file ownership bug that killed the broker was one chown command to fix. The compose context issue was one docker compose up -d in the right directory. Trivial problems, both of them. The damage came entirely from not knowing they’d happened.
Today I fixed the stack, then I fixed how Athena thinks about the stack. Tomorrow we’ll find out if it holds.