@lanshanpi

openclaw-reliability

Read-only reliability smoke checks and health summaries for OpenClaw agents

当前版本
v0.3.2
code-plugin社区source-linked

openclaw-reliability

Read-only reliability smoke checks and health summaries for OpenClaw agents.

Goals

  • Detect reliability risks before they become agent failures.
  • Classify tool/provider/runtime/plugin problems.
  • Provide local, redacted, reusable health summaries.
  • Stay safe by default: no restarts, no config changes, no plugin disabling.

Non-goals

  • This is not a security boundary. Use openclaw-language-boundary for action policy.
  • This does not automatically fix host/firewall/SSH/Gateway config.
  • This does not upload logs.

Current usage

npm run typecheck
npm run smoke
npm run report
npm run repair-plan
npm run release:check
npm run smoke -- --cpu-sample-seconds=8
npm run smoke -- --cpu-sample-seconds=15 --failure-freshness-minutes=60
npm run smoke -- --probe-tools=web_fetch,web_search --probe-timeout-ms=10000
npm run smoke:json
npm run repair-plan:json

npm run release:check runs the read-only internal gate: typecheck, tests, build, required docs, example JSON fixture validation, and live passive smoke:json schema validation. Use -- --skip-smoke for offline fixture-only release checks.

npm run report wraps passive smoke JSON into a shorter human-readable report with:

  • overall health result
  • blockers vs acceptable warnings
  • recommended next action
  • whether external probes were run
  • whether any state-changing remediation should be reviewed first

Example outputs are available in examples/:

  • passive-smoke-output.txt
  • smoke-json-sample.json
  • report-output.txt

Current checks include:

  • Gateway /health plus sustained process CPU sampling.
  • Diagnostic log signals since the latest Gateway start window.
  • Config warnings from openclaw config validate.
  • Web provider credential mismatch, including tools.web.search / legacy web.search, Tavily, and MiniMax model-vs-search credential paths.
  • Source-only local extension shadows that cause compiled-runtime warning noise.
  • language-boundary runtime state and audit summary, with stale historical tool failures ignored after a freshness window.
  • Skills root pressure.
  • Channel session-expired log noise.
  • Optional active probes for selected tools/providers. Disabled by default; pass --probe-tools=web_fetch,web_search to run lightweight external checks.

Result categories

The smoke output separates true runtime degradation from short-window sampling noise:

  • runtime_degraded: sustained or unclear Gateway pressure. Treat this as actionable investigation material: active sessions, memory indexing, plugin hooks, cron jobs, channel loops, and file-descriptor growth.
  • runtime_health_sampling_noise: /health reports event_loop_utilization / cpu, but sustained CPU sampling is low or the latest diagnostic logs are clearly model_call / session-queue related. Do not restart or disable components for this alone; re-test when no model call is active.
  • provider_missing_config: a provider is selected/enabled but its tool-specific credential path is missing.
  • plugin_runtime_output_missing: a local source-only extension shadows a compiled installed package and may create loader/config warning noise.
  • tool_degraded: current non-healthy runtime-state record for a tool/provider within the freshness window.
  • tool_failure_stale: historical non-healthy runtime-state record older than the freshness window or followed by a newer success. Stale records should not drive operational decisions.

Repair plan (advisory only)

npm run repair-plan converts smoke findings into a ranked advisory plan. It does not execute commands, edit config, restart Gateway, disable plugins, or change cron jobs.

npm run repair-plan
npm run repair-plan:json

The JSON schema is openclaw.reliability.repair-plan.v1. Each action includes severity, confidence, risk, confirmation requirements, evidence, manual steps, and rollback notes when a future system-changing remediation might exist.

Use this as the v0.2 bridge between diagnosis and future confirmed repair flows.

Confirmed repair scaffold

npm run repair is the v0.3 safety scaffold. It does not implement actual remediation yet. It can preview a selected repair-plan action and deliberately blocks execution unless a future version implements that exact action.

npm run repair-plan
npm run repair -- --action <actionId> --dry-run
npm run repair -- --action <actionId> --confirm

Current behavior:

  • missing --action is blocked
  • unknown action ids are blocked
  • --dry-run prints the selected action, suggested commands, manual steps, and rollback notes
  • --confirm is implemented only for safe diagnostic actions: retest-runtime-sampling-noise reruns smoke, investigate-provider-timeout runs an explicit web_fetch active probe, and fix-provider-missing-config reruns openclaw config validate without writing credentials. These do not change config, restart Gateway, disable plugins, or change cron jobs; other actions remain blocked

This keeps the execution entrypoint safe before any state-changing repairs are added.

Active probes

Smoke stays passive by default. Active probes are opt-in because they can create outbound network traffic and consume provider quota.

npm run smoke -- --probe-tools=web_fetch
npm run smoke -- --probe-tools=web_fetch,web_search --probe-timeout-ms=10000

Current probes:

  • web_fetch: HTTP GET https://example.com using the local runtime network path.
  • web_search: Tavily search probe when tools.web.search.provider=tavily and plugins.entries.tavily.config.webSearch.apiKey or TAVILY_API_KEY is configured.

Unsupported probe names produce a warning rather than failing the whole smoke run.

Freshness windows

Historical provider/tool failures can otherwise pollute smoke results long after recovery. Runtime-state checks therefore treat a non-healthy tool record as current only when:

  • it has a failure timestamp inside --failure-freshness-minutes (default 60), and
  • there is no newer lastSuccessAt timestamp for that tool.

If all non-healthy records are stale, smoke reports the runtime-state checks as OK and mentions the stale records in detail.

Current status on this host

Last verified 2026-05-13 16:22 Asia/Shanghai:

npm run typecheck
npm run smoke -- --cpu-sample-seconds=15 --failure-freshness-minutes=60
Summary: ok=11 warn=2 fail=0

The remaining warnings were both runtime_health_sampling_noise: 15s Gateway CPU average was low, and latest liveness/long-running diagnostics were model-call/session-queue related. Treat the P0 runtime stability incident as recovered unless future smoke runs show sustained CPU, repeated non-model-call liveness warnings, plugin errors, provider failures, or channel loops.

Current known remediation candidates

These are intentionally not automatic:

  1. If runtime_degraded returns outside active model calls, investigate active sessions, memory indexing, plugin hooks, cron jobs, channel loops, and file descriptors before restarting Gateway.
  2. Add freshness/active-probe semantics for historical runtime-state tool failures.
  3. Keep web search provider checks aligned with current config paths (tools.web.search first, legacy web.search fallback).
  4. Track security-hardening warnings separately from runtime reliability: exec security, Control UI auth posture, trusted proxies, and plugin pinning are not P0 stability failures.

Mainline closeout

openclaw-reliability is now in a stable usable state rather than an active incident response state.

Latest validation should be generated locally with:

npm run typecheck
npm run smoke -- --cpu-sample-seconds=15 --failure-freshness-minutes=60
npm run report -- --cpu-sample-seconds=15 --failure-freshness-minutes=60

Interpretation:

  • fail=0 means the system is usable for diagnostics.
  • runtime_health_sampling_noise is usually acceptable when sustained CPU is low and diagnostics are model-call/session-queue related.
  • Fresh runtime_degraded, plugin errors, provider failures, or channel loops should be investigated before changing runtime behavior.

Resolved during this mainline:

  • failing remote embedding/provider paths are classified as provider/tool issues rather than global runtime failure
  • session-expired channel noise is detected separately
  • noisy cron delivery failures are detected as operational noise rather than plugin packaging failure
  • stale source-only extension shadows are detected
  • skills pressure is summarized
  • language-boundary runtime state and audit summary are included when available

Reusability requirements

  • No hard-coded user paths such as /Users/<name> or machine-specific workspace paths.
  • No dependency on a specific agent/session/machine name.
  • No dependency on a specific channel, provider account, local credential, cron job, or local extension layout.
  • Host-specific paths or providers may appear only as examples or local validation notes, never runtime defaults.
  • All future state paths must be configurable or derived from OpenClaw/home directory.
  • External probes must stay opt-in and must not assume any specific provider, channel, or local credential exists.
  • Default behavior must remain read-only.

源码与版本

源码仓库

LanShanPi/openclaw-safety

打开仓库

源码提交

cf1c6afd5acd33fdc462c3ff90061978dd63771a

查看提交

安装命令

openclaw plugins install clawhub:openclaw-reliability

元数据

  • 包名: openclaw-reliability
  • 创建时间: 2026/05/15
  • 更新时间: 2026/05/15
  • 执行代码:
  • 源码标签: v0.3.2-reliability-internal-beta

兼容性

  • 构建于 OpenClaw: 2026.5.7
  • 插件 API 范围: >=2026.3.24-beta.2
  • 标签: internal-beta, latest
  • 文件数: 6