Delta-Mem MLX for OpenClaw
Give OpenClaw agents a local Apple Silicon memory sidecar.
This plugin connects OpenClaw to a local δ-mem MLX sidecar: an OpenAI-compatible chat-completions service that runs MLX models and can preload retrieved memory into a compact neural state before generation. It is for people running local agents on macOS who want memory behavior that is more interesting than simply pasting old context into the prompt.
δ-mem is exciting because it changes how attention is shaped in a frozen model. Instead of fine-tuning a new model or appending a long transcript, the sidecar keeps a small per-session memory state and applies it through the model's attention path. In local tests, the Qwen3 4B backbone improved consistently when the δ-mem adapter was attached; OpenClaw-specific memory preload experiments are still being tuned and benchmarked.
Stack
| Layer | What it uses |
|---|---|
| Agent runtime | OpenClaw model provider config |
| Sidecar API | FastAPI, OpenAI-compatible /v1/chat/completions |
| Inference | MLX / mlx-lm on Apple Silicon |
| Default backbone | mlx-community/Qwen3-4B-Instruct-2507-4bit |
| δ-mem adapter | ofthetrees/delta-mem-qwen3-4b-instruct-mlx-adapter |
| Optional retrieval | QMD vsearch snippets passed as attention_state |
Sidecar and benchmark repo:
https://github.com/elimaine/delta-mem-mlx-sidecar-w-openclaw
δ-mem paper:
https://arxiv.org/abs/2605.12357
Requirements
| Component | Minimum | Notes |
|---|---|---|
| Host | macOS on Apple Silicon | MLX is the intended local runtime |
| Python | 3.11+ | Used by the sidecar |
| Node | Current LTS | Used by plugin helper scripts |
| OpenClaw | 2026.3.24-beta.2 or newer | Required for plugin install/use |
| Storage | Several GB | Model downloads are gated and explicit |
| Hugging Face | Network access | Needed for model/adapter downloads |
The plugin can still emit provider config for an already-running sidecar. Managed install is optional.
Install
From ClawHub:
openclaw plugins install clawhub:@elimaine/openclaw-delta-mem-mlx
From GitHub:
git clone https://github.com/elimaine/openclaw-delta-mem-mlx-plugin.git
openclaw plugins install -l ./openclaw-delta-mem-mlx-plugin
Quick Start
Check whether the sidecar route is reachable:
curl -fsS http://127.0.0.1:8765/health
Then use the plugin inside OpenClaw:
- Human path: ask OpenClaw to run
delta_mem_mlx_status, thendelta_mem_mlx_provider_config. - Agent path: call
delta_mem_mlx_provider_configwithsidecarBaseUrl: "http://127.0.0.1:8765"andmodelId: "delta-mem-qwen3-4b-mlx", then add the returned provider block to the model configuration.
Sidecar Setup
Managed install:
npm run install-sidecar
The installer checks platform assumptions first. It asks before downloading model artifacts and starts the sidecar only after a completed managed install.
Install without starting:
npm run install-sidecar-only
Start an existing sidecar checkout:
npm run start-sidecar -- --root /path/to/delta-mem-mlx-sidecar-w-openclaw
Use the QMD fallback when OpenClaw is not yet passing attention_state itself:
npm run start-sidecar -- \
--root /path/to/delta-mem-mlx-sidecar-w-openclaw \
--attention-state-source qmd \
--qmd-mode vsearch \
--qmd-limit 6
If QMD is missing, times out, or exits nonzero, the sidecar still answers
normally and reports X-Delta-Attention-State-Count: 0.
Model Presets
| Preset | Purpose |
|---|---|
qwen3-delta | Qwen3 4B MLX backbone plus converted δ-mem adapter |
smoke | Small Qwen2.5 MLX model for cheap setup checks |
custom | User-supplied MLX model and optional adapter directory |
The Qwen3 δ preset validates that delta_mem_config.json and
delta_mem_adapter_mlx.npz exist before startup, so partial adapter downloads
fail early with a clear error.
How Memory Reaches The Model
OpenClaw calls the sidecar like a normal OpenAI-compatible provider:
baseUrl: sidecar/v1roottransportProtocol:openai-chat-completionsendpoint:/v1/chat/completions- stable session header:
X-OpenClaw-Session-Key
Retrieved memory should not be appended to the visible chat unless the operator explicitly wants that behavior. The sidecar accepts hidden memory-shaping input as:
- request body:
attention_state,attentionState, ordelta_attention_state - request header:
X-Delta-Attention-State
The sidecar then preloads those snippets into the per-session δ-state before the real assistant response. This is attention shaping, not guaranteed verbatim recall.
Agent Integration Notes
Agents configuring this plugin should preserve these invariants:
- Use one stable
X-OpenClaw-Session-Keyper logical agent/session. - Route to
/v1/chat/completions, not legacy completions. - Keep QMD snippets in
attention_state; do not copy them into user-visible prompt text unless explicitly requested. - Verify integration with
X-Delta-Attention-State-CountandX-Delta-Attention-State-Source. - Treat install/download actions as operator-approved; do not silently fetch model weights.
Minimal provider hook:
const body = {
model,
messages,
temperature,
max_tokens: maxTokens
};
+const query = latestUserText(messages);
+const snippets = await qmd.vsearch(query, { limit: 6 });
+body.attention_state = snippets.map((item) => ({
+ text: item.text || item.content || item.snippet,
+ source: item.source || "qmd:vsearch",
+ score: item.score
+})).filter((item) => item.text);
Benchmarks
Current findings live in the sidecar wiki:
https://github.com/elimaine/delta-mem-mlx-sidecar-w-openclaw/blob/main/wiki/Benchmark-Findings.md
High-level summary:
| Test shape | Result |
|---|---|
| Qwen3 4B + δ-mem adapter | Consistent response-quality lift over the plain backbone |
| Strongest OpenClaw-shaped QMD preload tests | 0.5625 plain to 0.7292 δ-mem, about 1.30x |
| Stricter OpenClaw-16 replay | Small exact-recall recovery, +0.0391 absolute |
| Latency | Usually slower with memory enabled; exact ratio depends on path |
These are early local benchmarks, not a claim that the plugin perfectly recalls session history. The practical goal is to measure when δ-state improves agent behavior enough to justify the latency and setup cost.
When Not To Use This
- You are not on Apple Silicon and need an efficient local runtime today.
- You want guaranteed transcript recall rather than attention shaping.
- You need a zero-setup hosted model provider.
- You cannot allocate disk space for local model artifacts.
Documentation
- Sidecar repo: https://github.com/elimaine/delta-mem-mlx-sidecar-w-openclaw
- Benchmark findings: https://github.com/elimaine/delta-mem-mlx-sidecar-w-openclaw/blob/main/wiki/Benchmark-Findings.md
- δ-mem paper: https://arxiv.org/abs/2605.12357
- Upstream adapter: https://huggingface.co/declare-lab/delta-mem_qwen3_4b-instruct
- MLX adapter artifact: https://huggingface.co/ofthetrees/delta-mem-qwen3-4b-instruct-mlx-adapter