Skip to content

Detecting Unresponsive Components with Watchdogs

How to detect when a component stops responding and take recovery action.

Overview

A watchdog monitors that an expected event keeps happening. If the event doesn't occur within a specified timeout, the watchdog fires a reaction so you can take corrective action.

sequenceDiagram
    participant S as Monitored System
    participant W as Watchdog Timer
    participant R as Recovery Reaction

    S->>W: emit<Scope::WATCHDOG>(service)
    Note over W: Timer reset
    S->>W: emit<Scope::WATCHDOG>(service)
    Note over W: Timer reset
    Note over S: System stalls...
    Note over W: Timeout expires!
    W->>R: Watchdog reaction fires
    R->>S: Take recovery action

1. Define a Watchdog Group

A watchdog group is any type that identifies what you're monitoring. It's never instantiated — it's just a tag:

struct HeartbeatMonitor {};

2. Set Up the Watchdog Reaction

Use on<Watchdog<Group, ticks, period>>() to define what happens when the timeout expires:

on<Watchdog<HeartbeatMonitor, 5, std::chrono::seconds>>().then([this] {
    log<WARN>("Heartbeat lost! Attempting recovery...");
    emit(std::make_unique<RecoveryCommand>());
});

This fires if no service signal is received within 5 seconds.

3. Service the Watchdog

Each time the monitored activity occurs, reset the timer:

emit<Scope::WATCHDOG>(ServiceWatchdog<HeartbeatMonitor>());

Every emit<Scope::WATCHDOG> call resets the countdown. As long as services arrive faster than the timeout, the watchdog never fires.

4. Complete Example

#include "nuclear"

// Tag types
struct HeartbeatMonitor {};
struct RecoveryCommand {};

class SensorWatchdog : public NUClear::Reactor {
public:
    SensorWatchdog(std::unique_ptr<NUClear::Environment> environment)
        : Reactor(std::move(environment)) {

        // Fire if no heartbeat for 3 seconds
        on<Watchdog<HeartbeatMonitor, 3, std::chrono::seconds>>().then([this] {
            log<WARN>("Sensor heartbeat lost!");
            emit(std::make_unique<RecoveryCommand>());
        });

        // When we receive sensor data, service the watchdog
        on<Trigger<SensorData>>().then([this](const SensorData& data) {
            // Process the data...
            process(data);

            // Reset the watchdog timer
            emit<Scope::WATCHDOG>(ServiceWatchdog<HeartbeatMonitor>());
        });

        // Recovery logic
        on<Trigger<RecoveryCommand>>().then([this] {
            log("Restarting sensor connection...");
            reconnect_sensor();
        });
    }
};

Multiple Watchdogs with Runtime Keys

You can monitor multiple instances of the same group using a runtime argument. Each unique key gets its own independent timer:

// Monitor each motor independently
on<Watchdog<MotorMonitor, 500, std::chrono::milliseconds>>(motor_id).then([this, motor_id] {
    log<WARN>("Motor", motor_id, "stopped responding");
});

// Service a specific motor's watchdog
emit<Scope::WATCHDOG>(ServiceWatchdog<MotorMonitor>(motor_id));

Timing Diagram

gantt
    title Watchdog Lifecycle (3-second timeout)
    dateFormat X
    axisFormat %s

    section Activity
    Service received     :done, 0, 1
    Service received     :done, 1, 2
    Service received     :done, 2, 3
    No activity          :crit, 3, 6

    section Timer State
    Reset (3s remaining) :active, 0, 1
    Reset (3s remaining) :active, 1, 2
    Reset (3s remaining) :active, 2, 3
    Counting down        :crit, 3, 6

    section Events
    Timeout fires!       :milestone, 6, 6

Parameters

Parameter Description
WatchdogGroup A type tag identifying what is being monitored
ticks Number of time units before the watchdog fires
period A std::chrono::duration type defining the tick length

Common period types: std::chrono::milliseconds, std::chrono::seconds, std::chrono::minutes

Tips

  • The watchdog starts timing from the moment bind is called (when the on<> statement runs). Service it early if you need a grace period at startup.
  • If the watchdog fires, the timer resets automatically — it will fire again after another timeout unless serviced.
  • Use specific group types to avoid accidentally servicing the wrong watchdog.
  • If a reactor only needs a single watchdog, you can use the reactor type itself as the group instead of creating a separate tag type:
    class SensorReader : public NUClear::Reactor {
        // ...
        on<Watchdog<SensorReader, 5, std::chrono::seconds>>().then([this] {
            log<WARN>("Sensor timeout!");
        });
        // Service with: emit<Scope::WATCHDOG>(ServiceWatchdog<SensorReader>());
    };