openclaw-reliability
Read-only reliability smoke checks and health summaries for OpenClaw agents.
Goals
- Detect reliability risks before they become agent failures.
- Classify tool/provider/runtime/plugin problems.
- Provide local, redacted, reusable health summaries.
- Stay safe by default: no restarts, no config changes, no plugin disabling.
Non-goals
- This is not a security boundary. Use
openclaw-language-boundaryfor action policy. - This does not automatically fix host/firewall/SSH/Gateway config.
- This does not upload logs.
Current usage
npm run typecheck
npm run smoke
npm run report
npm run repair-plan
npm run release:check
npm run smoke -- --cpu-sample-seconds=8
npm run smoke -- --cpu-sample-seconds=15 --failure-freshness-minutes=60
npm run smoke -- --probe-tools=web_fetch,web_search --probe-timeout-ms=10000
npm run smoke:json
npm run repair-plan:json
npm run release:check runs the read-only internal gate: typecheck, tests, build, required docs, example JSON fixture validation, and live passive smoke:json schema validation. Use -- --skip-smoke for offline fixture-only release checks.
npm run report wraps passive smoke JSON into a shorter human-readable report with:
- overall health result
- blockers vs acceptable warnings
- recommended next action
- whether external probes were run
- whether any state-changing remediation should be reviewed first
Example outputs are available in examples/:
passive-smoke-output.txtsmoke-json-sample.jsonreport-output.txt
Current checks include:
- Gateway
/healthplus sustained process CPU sampling. - Diagnostic log signals since the latest Gateway start window.
- Config warnings from
openclaw config validate. - Web provider credential mismatch, including
tools.web.search/ legacyweb.search, Tavily, and MiniMax model-vs-search credential paths. - Source-only local extension shadows that cause compiled-runtime warning noise.
language-boundaryruntime state and audit summary, with stale historical tool failures ignored after a freshness window.- Skills root pressure.
- Channel session-expired log noise.
- Optional active probes for selected tools/providers. Disabled by default; pass
--probe-tools=web_fetch,web_searchto run lightweight external checks.
Result categories
The smoke output separates true runtime degradation from short-window sampling noise:
runtime_degraded: sustained or unclear Gateway pressure. Treat this as actionable investigation material: active sessions, memory indexing, plugin hooks, cron jobs, channel loops, and file-descriptor growth.runtime_health_sampling_noise:/healthreportsevent_loop_utilization/cpu, but sustained CPU sampling is low or the latest diagnostic logs are clearlymodel_call/ session-queue related. Do not restart or disable components for this alone; re-test when no model call is active.provider_missing_config: a provider is selected/enabled but its tool-specific credential path is missing.plugin_runtime_output_missing: a local source-only extension shadows a compiled installed package and may create loader/config warning noise.tool_degraded: current non-healthy runtime-state record for a tool/provider within the freshness window.tool_failure_stale: historical non-healthy runtime-state record older than the freshness window or followed by a newer success. Stale records should not drive operational decisions.
Repair plan (advisory only)
npm run repair-plan converts smoke findings into a ranked advisory plan. It
does not execute commands, edit config, restart Gateway, disable plugins, or
change cron jobs.
npm run repair-plan
npm run repair-plan:json
The JSON schema is openclaw.reliability.repair-plan.v1. Each action includes
severity, confidence, risk, confirmation requirements, evidence, manual steps,
and rollback notes when a future system-changing remediation might exist.
Use this as the v0.2 bridge between diagnosis and future confirmed repair flows.
Confirmed repair scaffold
npm run repair is the v0.3 safety scaffold. It does not implement actual
remediation yet. It can preview a selected repair-plan action and deliberately
blocks execution unless a future version implements that exact action.
npm run repair-plan
npm run repair -- --action <actionId> --dry-run
npm run repair -- --action <actionId> --confirm
Current behavior:
- missing
--actionis blocked - unknown action ids are blocked
--dry-runprints the selected action, suggested commands, manual steps, and rollback notes--confirmis implemented only for safe diagnostic actions:retest-runtime-sampling-noisereruns smoke,investigate-provider-timeoutruns an explicitweb_fetchactive probe, andfix-provider-missing-configrerunsopenclaw config validatewithout writing credentials. These do not change config, restart Gateway, disable plugins, or change cron jobs; other actions remain blocked
This keeps the execution entrypoint safe before any state-changing repairs are added.
Active probes
Smoke stays passive by default. Active probes are opt-in because they can create outbound network traffic and consume provider quota.
npm run smoke -- --probe-tools=web_fetch
npm run smoke -- --probe-tools=web_fetch,web_search --probe-timeout-ms=10000
Current probes:
web_fetch: HTTP GEThttps://example.comusing the local runtime network path.web_search: Tavily search probe whentools.web.search.provider=tavilyandplugins.entries.tavily.config.webSearch.apiKeyorTAVILY_API_KEYis configured.
Unsupported probe names produce a warning rather than failing the whole smoke run.
Freshness windows
Historical provider/tool failures can otherwise pollute smoke results long after recovery. Runtime-state checks therefore treat a non-healthy tool record as current only when:
- it has a failure timestamp inside
--failure-freshness-minutes(default60), and - there is no newer
lastSuccessAttimestamp for that tool.
If all non-healthy records are stale, smoke reports the runtime-state checks as OK and mentions the stale records in detail.
Current status on this host
Last verified 2026-05-13 16:22 Asia/Shanghai:
npm run typecheck
npm run smoke -- --cpu-sample-seconds=15 --failure-freshness-minutes=60
Summary: ok=11 warn=2 fail=0
The remaining warnings were both runtime_health_sampling_noise: 15s Gateway CPU average was low, and latest liveness/long-running diagnostics were model-call/session-queue related. Treat the P0 runtime stability incident as recovered unless future smoke runs show sustained CPU, repeated non-model-call liveness warnings, plugin errors, provider failures, or channel loops.
Current known remediation candidates
These are intentionally not automatic:
- If
runtime_degradedreturns outside active model calls, investigate active sessions, memory indexing, plugin hooks, cron jobs, channel loops, and file descriptors before restarting Gateway. - Add freshness/active-probe semantics for historical runtime-state tool failures.
- Keep web search provider checks aligned with current config paths (
tools.web.searchfirst, legacyweb.searchfallback). - Track security-hardening warnings separately from runtime reliability: exec security, Control UI auth posture, trusted proxies, and plugin pinning are not P0 stability failures.
Mainline closeout
openclaw-reliability is now in a stable usable state rather than an active incident response state.
Latest validation should be generated locally with:
npm run typecheck
npm run smoke -- --cpu-sample-seconds=15 --failure-freshness-minutes=60
npm run report -- --cpu-sample-seconds=15 --failure-freshness-minutes=60
Interpretation:
fail=0means the system is usable for diagnostics.runtime_health_sampling_noiseis usually acceptable when sustained CPU is low and diagnostics are model-call/session-queue related.- Fresh
runtime_degraded, plugin errors, provider failures, or channel loops should be investigated before changing runtime behavior.
Resolved during this mainline:
- failing remote embedding/provider paths are classified as provider/tool issues rather than global runtime failure
- session-expired channel noise is detected separately
- noisy cron delivery failures are detected as operational noise rather than plugin packaging failure
- stale source-only extension shadows are detected
- skills pressure is summarized
language-boundaryruntime state and audit summary are included when available
Reusability requirements
- No hard-coded user paths such as
/Users/<name>or machine-specific workspace paths. - No dependency on a specific agent/session/machine name.
- No dependency on a specific channel, provider account, local credential, cron job, or local extension layout.
- Host-specific paths or providers may appear only as examples or local validation notes, never runtime defaults.
- All future state paths must be configurable or derived from OpenClaw/home directory.
- External probes must stay opt-in and must not assume any specific provider, channel, or local credential exists.
- Default behavior must remain read-only.