opcua-session-manager · master
Docs · Recipes

Healthcheck and monitoring

The ping command answers in milliseconds. Wire it into your monitoring agent, alert on missing responses, surface session counts. Three signals cover most of what an OPC UA daemon owes its operators.

The daemon does not expose a metrics endpoint. What it offers is the ping IPC command — cheap, fast, unauthenticated (authentication is enforced but ping works without). Three signals derive from it; everything else is operating-system plumbing.

The signals

Signal Source What it tells you
Daemon liveness ping returns success: true in under 100 ms Daemon is up and responsive
Session count ping response data.sessions Number of active OPC UA sessions
OPC UA reachability A successful read() round-trip The daemon's OPC UA side is healthy

The first two cover the daemon; the third covers the daemon's connection to the OPC UA server. Wire all three for full coverage.

Daemon liveness — minimal probe

A shell one-liner per check:

bash terminal — POSIX
echo '{"command":"ping"}' \
    | timeout 2 nc -U /tmp/opcua-session-manager.sock \
    | grep -q '"success":true' && echo healthy || echo unhealthy

For TCP loopback:

bash terminal — TCP
echo '{"command":"ping"}' \
    | timeout 2 nc 127.0.0.1 9990 \
    | grep -q '"success":true' && echo healthy || echo unhealthy

Wire this into:

  • Kubernetes livenessProbe with exec.command: ["/bin/sh","-c","..."]
  • systemd ExecStartPost for boot smoke
  • cron / Nagios / Sensu as the cheap "is it up" check

Healthcheck script

A more useful version, in PHP, that surfaces the session count:

php bin/opcua-healthcheck
#!/usr/bin/env php
<?php
require __DIR__ . '/../vendor/autoload.php';

use PhpOpcua\SessionManager\Client\SocketConnection;
use PhpOpcua\SessionManager\Exception\DaemonException;

$endpoint = getenv('OPCUA_SOCKET_PATH') ?: '/tmp/opcua-session-manager.sock';
$token    = getenv('OPCUA_AUTH_TOKEN');

$request = ['command' => 'ping'];
if ($token) {
    $request['authToken'] = $token;
}

$start = microtime(true);

try {
    $response = SocketConnection::send($endpoint, $request, timeout: 2.0);
} catch (DaemonException $e) {
    fwrite(STDERR, "FAIL ipc {$e->getMessage()}\n");
    exit(2);
}

$durationMs = (int) ((microtime(true) - $start) * 1000);

if (! $response['success']) {
    fwrite(STDERR, "FAIL response_not_ok\n");
    exit(2);
}

if ($durationMs > 100) {
    fwrite(STDERR, "WARN slow {$durationMs}ms\n");
    exit(1);
}

printf("OK sessions=%d ms=%d\n", $response['data']['sessions'], $durationMs);
exit(0);

Exit codes follow the Nagios convention (0 ok, 1 warning, 2 critical). The output line includes a session count and duration — useful in scrape-and-graph monitoring.

OPC UA reachability probe

The ping covers the daemon. To cover the OPC UA path end-to-end, issue a real read() against a well-known node:

php bin/opcua-readinessprobe
#!/usr/bin/env php
<?php
require __DIR__ . '/../vendor/autoload.php';

use PhpOpcua\SessionManager\Client\ManagedClient;
use PhpOpcua\Client\Types\StatusCode;

$client = new ManagedClient(
    socketPath: getenv('OPCUA_SOCKET_PATH'),
    timeout:    5.0,
    authToken:  getenv('OPCUA_AUTH_TOKEN'),
);

try {
    $client->connect(getenv('OPCUA_ENDPOINT'));
    $dv = $client->read('i=2261');   // ProductName — every server has it
} catch (Throwable $e) {
    fwrite(STDERR, "FAIL " . $e->getMessage() . "\n");
    exit(2);
} finally {
    $client->disconnect();
}

if (! StatusCode::isGood($dv->statusCode)) {
    fwrite(STDERR, "FAIL bad_status " . StatusCode::getName($dv->statusCode) . "\n");
    exit(2);
}

echo "OK product=" . $dv->getValue() . "\n";
exit(0);

This costs more than ping — one round-trip to the OPC UA server. Reserve it for the readiness probe (run every 30-60 s), not for the liveness probe (run every 5-10 s).

Metrics scraping

The daemon does not currently emit Prometheus metrics. The healthcheck script above can be wrapped in a textfile collector:

bash terminal — Prometheus textfile
TMP="$(mktemp --suffix=.prom)"
{
    if /opt/myapp/bin/opcua-healthcheck > /tmp/oh.out; then
        sessions=$(awk -F= '/sessions/{print $2}' /tmp/oh.out | tr -d ' ')
        echo "opcua_daemon_up 1"
        echo "opcua_daemon_sessions ${sessions:-0}"
    else
        echo "opcua_daemon_up 0"
        echo "opcua_daemon_sessions 0"
    fi
} > "$TMP" && mv "$TMP" /var/lib/node_exporter/textfile/opcua.prom

Schedule it via cron every 15-60 s; node_exporter picks up the file on its scrape interval.

Alerts worth wiring

Alert Threshold Severity
Daemon unreachable ping fails 2 checks in a row Critical
Daemon slow ping > 100 ms over 3 checks Warning
Session count drops sharply Drop ≥ 30 % in 5 minutes Warning
OPC UA readiness fails read healthcheck fails 2 checks Critical
Frame-size cap hit Any payload_too_large in logs Warning
Auth failures Any auth_failed in logs Critical (potential intrusion attempt)

The session-count drop matters because:

  • A drop in active sessions usually means workers died or the daemon restarted. Either case is worth investigating.
  • Sustained zero sessions usually means the workers are not reaching the daemon (network, auth, configuration drift) — the daemon may be perfectly healthy and still useless.

Inspect from the operator side

For ad-hoc inspection, the list IPC command enumerates every active session with its endpoint, lastUsed, and (redacted) config:

bash terminal — netcat list
echo "{\"command\":\"list\",\"authToken\":\"$(cat /etc/opcua/daemon.token)\"}" \
    | nc -U /var/run/opcua/sessions.sock \
    | jq .

See Recipes · Debugging with netcat for more interactive patterns.

Log-based monitoring

The daemon's info level captures session create / close, auto-connect outcomes, cleanup runs. Common alert patterns:

Pattern in logs What it means
session_not_found on query Worker tried a stale session — reconnect expected
forbidden_method on query Worker tried an unknown method — code bug
auth_failed Worker is sending wrong token, or attacker probing
payload_too_large Worker is sending oversized frames — bug
Many Session <id> expired (endpoint: <url>) lines Worker idle pattern — review --timeout

The daemon does not emit a summary line per cleanup run; the only cleanup-related entry is the per-session expiry log line shown above (emitted by SessionManagerDaemon::cleanupExpiredSessions()).

Pipe --log-file into your log aggregation; alert on the patterns above.