OpenClaw Reliability

OpenClaw Reliability is a reliability smoke-check and guarded repair plugin for OpenClaw. It diagnoses runtime, provider, tool, plugin, channel, and configuration noise, then turns those findings into human-readable reports, watch decisions, and safe repair flows.

中文说明见 README.zh-CN.md.

What it is for

Use this plugin when you want to know whether an OpenClaw runtime problem is truly a runtime failure or just provider noise, model-call pressure, caller input error, stale tool state, channel expiry, or approval governance.

It helps answer:

Is the Gateway actually degraded, or is the signal short-window sampling noise?
Are failures coming from providers, credentials, HTTP endpoints, tool inputs, plugin config, or channels?
Should the user be notified, or should the signal stay quiet as known noise?
Is there a safe read-only diagnostic action to run?
Is a state-changing repair blocked until explicit human approval?

Philosophy

Reliability follows a conservative operating rule:

Diagnose first. Repair only what is safe, narrow, verified, and reversible.

The plugin separates:

Smoke checks — collect local, redacted health evidence.
Reports — summarize blockers, warnings, and next action.
Watch decisions — de-duplicate signals and stay quiet for non-actionable noise.
Repair plans — create advisory actions with evidence and anti-actions.
Watch repair — run only safe non-mutating actions automatically; keep state-changing actions guarded.
Cron runner — emit NO_REPLY or a ready-to-send Chinese alert for schedulers.

It is not a security boundary. Use openclaw-language-boundary for action policy.

What it can do

Run passive reliability smoke checks.
Sample Gateway health and sustained CPU.
Classify provider timeout, HTTP error, missing config, tool input errors, approval outcomes, plugin runtime noise, source-only shadowing, channel expiry, and stale tool failures.
Generate short human-readable reports.
Generate JSON output for automation.
De-duplicate watch notifications.
Run safe read-only diagnostic repair actions.
Run one guarded state-changing cleanup for stale tool-failure records only when explicitly allowlisted.
Emit NO_REPLY for quiet scheduled checks.

What it will not do by default

It will not restart Gateway.
It will not patch config.
It will not write credentials.
It will not disable plugins.
It will not delete files.
It will not mutate cron jobs, channels, sessions, or external services.
It will not upload logs.

Install

openclaw plugins install clawhub:openclaw-reliability@0.3.14
openclaw gateway restart

For the latest ClawHub release:

openclaw plugins install clawhub:openclaw-reliability
openclaw gateway restart

Common commands

From the installed package or source directory:

npm run smoke
npm run smoke:json
npm run report
npm run repair-plan
npm run repair-plan:json
npm run watch-check
npm run watch-check:json
npm run watch-repair
npm run watch-repair:json
npm run cron-runner
npm run cron-runner:json
npm run release:check

Useful examples:

npm run smoke -- --cpu-sample-seconds=15 --failure-freshness-minutes=60
npm run report -- --cpu-sample-seconds=15 --failure-freshness-minutes=60
npm run watch-repair:json -- --cpu-sample-seconds=5 --failure-freshness-minutes=60
npm run cron-runner -- --cpu-sample-seconds=5 --failure-freshness-minutes=60

Cron runner

cron-runner is designed for scheduled checks.

If there is no actionable signal, it prints exactly:

NO_REPLY

If a human should be notified, it prints a concise Chinese alert.

This reduces the scheduler's need to parse JSON or summarize with a model. The scheduler can simply run the command and forward stdout when it is not NO_REPLY.

Safe repair model

watch-repair automatically runs only actions whose repair plan says:

wouldChangeSystem=false
requiresConfirmation=false

State-changing actions require explicit allowlists such as:

npm run watch-repair -- --allow-state-change=tool_failure_stale

The currently implemented guarded state-changing action is cleanup-stale-tool-failures. It only removes stale non-healthy tool records, creates a backup, reruns smoke, and restores the backup if the post-check gets worse.

Relationship with Language Boundary

openclaw-language-boundary protects action safety boundaries.
openclaw-reliability diagnoses runtime reliability and guarded repair signals.

Use both together for a safer OpenClaw runtime: one controls action policy, the other watches operational health.

Current release

Latest verified release: openclaw-reliability@0.3.14
ClawHub release ID: rd7a6g0tky94f0sbfrxa27n6e18895e7
Status at final verification: published, scanned clean, isolated install verified.

openclaw-reliability