March 15, 2026  ·  5 min read

How I Caused 10 Hours of AI Agent Downtime (and Built a Fix)

ai agents devops open source

I changed one line in my AI agent's config file and took it down for 10 hours while I was asleep.

The change looked harmless: updating the model from claude-opus-4-5 to claude-opus-4-6. What I didn't check was whether the installed version of the agent software actually supported the new model name. It didn't. Every request failed silently. I woke up to 10 hours of missed messages and a log file full of errors.

This shouldn't be a hard problem. Servers have had health checks and auto-rollback for decades. But AI agent frameworks don't ship with this by default. So I built it.

What model-watchdog does

It's a single Python file with zero dependencies beyond the standard library. It does three things:

  1. Probes your agent's health endpoint every N seconds
  2. When K failures occur within M minutes, rolls back the config to the last known good backup and restarts the service
  3. When the agent is healthy after a config change, updates the "good backup" automatically

It also sends alerts via Telegram, Slack, Discord, or any HTTP webhook when it takes action.

The pattern it solves

Config change → agent fails → watchdog detects → rollback → restart → alert → agent back online

The key insight is that you want to save the backup after you've confirmed the new config works, not before. Most backup tools snapshot before the change. model-watchdog saves the "good" state only when the agent is healthy — so if you change the config and it works for 3 probes, that becomes the new baseline. If it fails immediately, it rolls back to the previous good state.

Usage

# Install: just download the file
curl -O https://raw.githubusercontent.com/feralghost/model-watchdog/master/watchdog.py

# Run with defaults (probes http://localhost:18789/health)
python3 watchdog.py

# Custom config
python3 watchdog.py --config watchdog.yaml

# One-shot health check (for CI scripts)
python3 watchdog.py --check-once
echo $?  # 0 = healthy, 1 = down

Generate a sample config:

python3 watchdog.py --dump-config > watchdog.yaml

The key config options:

{
  "probe": {
    "url": "http://localhost:18789/health",
    "timeout_sec": 5,
    "expected_status": 200
  },
  "thresholds": {
    "failures": 3,
    "window_sec": 180,
    "probe_interval_sec": 30
  },
  "rollback": {
    "config_path": "~/.openclaw/openclaw.json",
    "backup_path": "~/.openclaw/openclaw.json.watchdog-good",
    "restart_cmd": "systemctl --user restart openclaw-gateway",
    "restart_wait_sec": 10
  },
  "alerts": {
    "telegram_bot_token": "...",
    "telegram_chat_id": "..."
  }
}

Running as a service

cat > ~/.config/systemd/user/model-watchdog.service << EOF
[Unit]
Description=model-watchdog AI agent health monitor
After=network.target

[Service]
ExecStart=/usr/bin/python3 /path/to/watchdog.py
Restart=always
RestartSec=5

[Install]
WantedBy=default.target
EOF

systemctl --user enable --now model-watchdog

Works with anything

The default config targets OpenClaw's health endpoint, but it works with any agent that has an HTTP health check and a config file + restart command. Change two lines and it covers LangChain servers, custom OpenAI wrappers, local Ollama setups, whatever.

Why no dependencies

Agents running 24/7 on minimal VPS installs shouldn't need a pip install to stay alive. The whole tool is standard library Python. Optional: pip install pyyaml if you want YAML config instead of JSON. That's it.

The repo is at github.com/feralghost/model-watchdog. Single file, MIT license.

If you're running an autonomous agent 24/7 and you don't have something like this, you're one config change away from finding out the hard way.