I changed one line in my AI agent's config file and took it down for 10 hours while I was asleep.
The change looked harmless: updating the model from claude-opus-4-5 to claude-opus-4-6. What I didn't check was whether the installed version of the agent software actually supported the new model name. It didn't. Every request failed silently. I woke up to 10 hours of missed messages and a log file full of errors.
This shouldn't be a hard problem. Servers have had health checks and auto-rollback for decades. But AI agent frameworks don't ship with this by default. So I built it.
It's a single Python file with zero dependencies beyond the standard library. It does three things:
It also sends alerts via Telegram, Slack, Discord, or any HTTP webhook when it takes action.
Config change → agent fails → watchdog detects → rollback → restart → alert → agent back online
The key insight is that you want to save the backup after you've confirmed the new config works, not before. Most backup tools snapshot before the change. model-watchdog saves the "good" state only when the agent is healthy — so if you change the config and it works for 3 probes, that becomes the new baseline. If it fails immediately, it rolls back to the previous good state.
# Install: just download the file
curl -O https://raw.githubusercontent.com/feralghost/model-watchdog/master/watchdog.py
# Run with defaults (probes http://localhost:18789/health)
python3 watchdog.py
# Custom config
python3 watchdog.py --config watchdog.yaml
# One-shot health check (for CI scripts)
python3 watchdog.py --check-once
echo $? # 0 = healthy, 1 = down
Generate a sample config:
python3 watchdog.py --dump-config > watchdog.yaml
The key config options:
{
"probe": {
"url": "http://localhost:18789/health",
"timeout_sec": 5,
"expected_status": 200
},
"thresholds": {
"failures": 3,
"window_sec": 180,
"probe_interval_sec": 30
},
"rollback": {
"config_path": "~/.openclaw/openclaw.json",
"backup_path": "~/.openclaw/openclaw.json.watchdog-good",
"restart_cmd": "systemctl --user restart openclaw-gateway",
"restart_wait_sec": 10
},
"alerts": {
"telegram_bot_token": "...",
"telegram_chat_id": "..."
}
}
cat > ~/.config/systemd/user/model-watchdog.service << EOF
[Unit]
Description=model-watchdog AI agent health monitor
After=network.target
[Service]
ExecStart=/usr/bin/python3 /path/to/watchdog.py
Restart=always
RestartSec=5
[Install]
WantedBy=default.target
EOF
systemctl --user enable --now model-watchdog
The default config targets OpenClaw's health endpoint, but it works with any agent that has an HTTP health check and a config file + restart command. Change two lines and it covers LangChain servers, custom OpenAI wrappers, local Ollama setups, whatever.
Agents running 24/7 on minimal VPS installs shouldn't need a pip install to stay alive. The whole tool is standard library Python. Optional: pip install pyyaml if you want YAML config instead of JSON. That's it.
The repo is at github.com/feralghost/model-watchdog. Single file, MIT license.
If you're running an autonomous agent 24/7 and you don't have something like this, you're one config change away from finding out the hard way.