Detecting Unresponsive Components with Watchdogs¶
How to detect when a component stops responding and take recovery action.
Overview¶
A watchdog monitors that an expected event keeps happening. If the event doesn't occur within a specified timeout, the watchdog fires a reaction so you can take corrective action.
sequenceDiagram
participant S as Monitored System
participant W as Watchdog Timer
participant R as Recovery Reaction
S->>W: emit<Scope::WATCHDOG>(service)
Note over W: Timer reset
S->>W: emit<Scope::WATCHDOG>(service)
Note over W: Timer reset
Note over S: System stalls...
Note over W: Timeout expires!
W->>R: Watchdog reaction fires
R->>S: Take recovery action
1. Define a Watchdog Group¶
A watchdog group is any type that identifies what you're monitoring. It's never instantiated — it's just a tag:
2. Set Up the Watchdog Reaction¶
Use on<Watchdog<Group, ticks, period>>() to define what happens when the timeout expires:
on<Watchdog<HeartbeatMonitor, 5, std::chrono::seconds>>().then([this] {
log<WARN>("Heartbeat lost! Attempting recovery...");
emit(std::make_unique<RecoveryCommand>());
});
This fires if no service signal is received within 5 seconds.
3. Service the Watchdog¶
Each time the monitored activity occurs, reset the timer:
Every emit<Scope::WATCHDOG> call resets the countdown.
As long as services arrive faster than the timeout, the watchdog never fires.
4. Complete Example¶
#include "nuclear"
// Tag types
struct HeartbeatMonitor {};
struct RecoveryCommand {};
class SensorWatchdog : public NUClear::Reactor {
public:
SensorWatchdog(std::unique_ptr<NUClear::Environment> environment)
: Reactor(std::move(environment)) {
// Fire if no heartbeat for 3 seconds
on<Watchdog<HeartbeatMonitor, 3, std::chrono::seconds>>().then([this] {
log<WARN>("Sensor heartbeat lost!");
emit(std::make_unique<RecoveryCommand>());
});
// When we receive sensor data, service the watchdog
on<Trigger<SensorData>>().then([this](const SensorData& data) {
// Process the data...
process(data);
// Reset the watchdog timer
emit<Scope::WATCHDOG>(ServiceWatchdog<HeartbeatMonitor>());
});
// Recovery logic
on<Trigger<RecoveryCommand>>().then([this] {
log("Restarting sensor connection...");
reconnect_sensor();
});
}
};
Multiple Watchdogs with Runtime Keys¶
You can monitor multiple instances of the same group using a runtime argument. Each unique key gets its own independent timer:
// Monitor each motor independently
on<Watchdog<MotorMonitor, 500, std::chrono::milliseconds>>(motor_id).then([this, motor_id] {
log<WARN>("Motor", motor_id, "stopped responding");
});
// Service a specific motor's watchdog
emit<Scope::WATCHDOG>(ServiceWatchdog<MotorMonitor>(motor_id));
Timing Diagram¶
gantt
title Watchdog Lifecycle (3-second timeout)
dateFormat X
axisFormat %s
section Activity
Service received :done, 0, 1
Service received :done, 1, 2
Service received :done, 2, 3
No activity :crit, 3, 6
section Timer State
Reset (3s remaining) :active, 0, 1
Reset (3s remaining) :active, 1, 2
Reset (3s remaining) :active, 2, 3
Counting down :crit, 3, 6
section Events
Timeout fires! :milestone, 6, 6
Parameters¶
| Parameter | Description |
|---|---|
WatchdogGroup |
A type tag identifying what is being monitored |
ticks |
Number of time units before the watchdog fires |
period |
A std::chrono::duration type defining the tick length |
Common period types: std::chrono::milliseconds, std::chrono::seconds, std::chrono::minutes
Tips¶
- The watchdog starts timing from the moment
bindis called (when theon<>statement runs). Service it early if you need a grace period at startup. - If the watchdog fires, the timer resets automatically — it will fire again after another timeout unless serviced.
- Use specific group types to avoid accidentally servicing the wrong watchdog.
- If a reactor only needs a single watchdog, you can use the reactor type itself as the group instead of creating a separate tag type: