How Self-Diagnostics Work
ClawMagic is designed to understand its own health and fix issues without manual intervention. The diagnostic toolkit spans four categories — system health, cron scheduling, log analysis, and model/provider diagnostics — each with CLI commands, API routes, and portal UI integration.
- Skills-based self-diagnosis: the agent uses its own diagnostic tools as skills — querying health endpoints, scanning logs for errors, checking cron job status, and verifying provider connectivity. When it detects an issue, it can attempt to fix it using the same tool loop that handles user tasks.
- Existing infrastructure:
GET /health,GET /api/health/detail,clawmagic doctor(15+ env checks with--fixmode), execution manager telemetry (p50/p95, throughput, error counts), structured JSONL logging with redaction, and a full cron engine (9 action types). - Agent-driven remediation: when the agent detects failures (stuck jobs, provider errors, stale state), it can restart services, rotate to fallback providers, clean up stale jobs, and report what it did.
Recommended: Claude Code Agent or Codex for Self-Correction
For the most powerful self-healing experience, we recommend connecting ClawMagic to Claude Code (agent OAuth) or OpenAI Codex as a self-correction layer. These agents can:
- Read ClawMagic diagnostic output and identify root causes
- Generate and apply fixes to configuration, state files, and plugin manifests
- Debug provider connectivity issues by testing API keys and fallback chains
- Analyze structured logs to find patterns in failures
- Self-correct agent behavior by updating routing rules, contracts, and memory
ClawMagic exposes all diagnostic data through its API and CLI, making it straightforward for an external coding agent to ingest health signals and apply corrections. The clawmagic health --json andclawmagic logs --json commands produce machine-readable output designed for this purpose.
Diagnostic CLI Commands
| Command | Flags | Purpose |
|---|---|---|
clawmagic health | --json, --verbose | Live runtime health check: process, port, error rates, LLM failures, pass/fail summary |
clawmagic doctor | --fix, --json | Static environment checks (15+) with auto-fix mode |
clawmagic logs | --event, --level, --grep, --since, --json, --follow | Unified log viewer across all JSONL streams (events, errors, audit, security) |
clawmagic cron | list, history, run, stats, cleanup | Cron job management: list jobs, query history, manual trigger, failure detection |
clawmagic models status | --verbose | Active model, auth health, fallback chain, context window usage |
clawmagic models canary | --provider <name> | Send test prompt to verify provider response and auth |
clawmagic usage | --period 7d, --json | Usage dashboard: model costs, cron reliability, storage, API call counts |
1. System Health Checks (8 Items)
A single clawmagic health command checks all critical systems and outputs a pass/fail summary. Builds on the existing doctor command and /api/health/detail endpoint.
- Process liveness: detect if the gateway process (launchd/systemd or foreground) is alive via PID check + port probe.
- Port reachability: TCP connect to configured port (default 18790). Distinguishes “port closed” vs “port open but unhealthy”.
- LLM failure rates: uses execution manager telemetry (error counts, throughput, p95 latency) to compute failure rate over last N minutes.
- Error log scanning: reads
state/errors-YYYY-MM-DD.jsonland counts errors. Flags if rate exceeds threshold (>5 errors/min). - Pass/fail summary: aggregates all checks into a table with PASS/WARN/FAIL status. Exit code 0 = all pass, 1 = any fail.
- Alert state with backoff: persists alert state in
state/health_alerts.jsonwith exponential backoff (1m → 5m → 15m → 1h → 4h). Prevents alert fatigue.
# Run full health check clawmagic health # Machine-readable output for agent consumption clawmagic health --json # Verbose output with per-check details clawmagic health --verbose
2. Cron Job Debugging (9 Items)
ClawMagic’s cron engine (cronScheduler.ts, 380+ lines) supports interval and expression schedules with 9 action types. The diagnostic toolkit adds CLI tools to query, debug, and manage cron jobs.
- Job listing:
clawmagic cron list— shows job name, schedule, next run, last status, consecutive errors. Color-coded: green = ok, yellow = backed off, red = stale/stuck. - History query:
clawmagic cron history— filter by job name, status, date range. Backed bystate/cron/history.jsonl(append-only, 10K line cap with rotation). - Persistent failure detection: time-windowed analysis for 3+ failures within any 6-hour window. Distinguishes “flaky” (intermittent) from “broken” (consecutive).
- Manual trigger:
clawmagic cron run <job>— enqueue a job immediately for testing. - Stale job cleanup: auto-marks stuck jobs as failed (2-hour threshold). Handles machine sleep and process crash edge cases.
- Metrics summary: success rate %, avg/p95 duration, jobs by error frequency.
- API routes:
GET /api/cron/jobs,GET /api/cron/history,POST /api/cron/jobs/:id/run,POST /api/cron/cleanupfor portal integration.
# List all cron jobs with status clawmagic cron list # Query history for a specific job clawmagic cron history --job heartbeat --since 24h # Manually trigger a job for debugging clawmagic cron run heartbeat # Clean up stale/stuck jobs clawmagic cron cleanup # View aggregate stats clawmagic cron stats
3. Unified Log Viewer (9 Items)
ClawMagic already has structured JSONL logging across events, errors, audit, and security streams. The log viewer provides a single CLI command with powerful filtering.
- Unified stream: merges
state/events-*.jsonl+errors-*.jsonl+audit.jsonl+security_audit.jsonlby timestamp. - Event filter:
--event "ai.model_choice"or--event "tool.*"(glob patterns). Comma-separated for multiple events. - Level filter:
--level erroror--level warn,error. Shortcut:--errors. - Content search:
--grep "timeout"for substring matching. Context-field filters:--job,--agent,--plugin. - Time range:
--since "1h"or--since "2026-03-03T10:00". Relative durations (1h, 30m, 2d) and ISO timestamps. - Live tail:
--followfor real-time log streaming. - JSON output:
--jsonfor piping:clawmagic logs --errors --json | jq '.message' - Quick aliases:
errors-1h,ai-failures,cron-errors,securityfor common queries. - API route:
GET /api/logswith matching query params for portal log viewer.
# Show last 50 log entries clawmagic logs # Errors in the last hour clawmagic logs --errors --since 1h # Filter by event type clawmagic logs --event "ai.model_choice" --since 30m # Search logs for specific content clawmagic logs --grep "timeout" --level error # Live tail (follow mode) clawmagic logs --follow # Machine-readable for agent consumption clawmagic logs --errors --json
4. Model + Provider Diagnostics (9 Items)
Know exactly what model is running, whether it’s healthy, and how much it’s costing. Provider infrastructure and diagnostic events already exist — these tools surface them.
- Status command:
clawmagic models status— current provider, model ID, auth profile, context window usage (tokens used / max), fallback chain order + health of each. - Provider auth health: verify API keys are valid without making a billed call. Flag expired, near-expiry (<7 days), or invalid format.
- Canary test:
clawmagic models canary— send a minimal test prompt to verify provider response, check model header matches expected provider, validate latency (<10s). Catches silent auth failures and model misrouting. - Fallback chain: show primary → secondary → tertiary providers with status.
--testto simulate primary failure and verify fallback activates. - Usage dashboard:
clawmagic usage— cost breakdown by model/provider (tokens, estimated $, request count), cron reliability (success rate, avg duration), storage sizes, API call counts by type.
# Show active model and provider status clawmagic models status # Run canary test on current provider clawmagic models canary # Test a specific provider clawmagic models canary --provider anthropic # View fallback chain clawmagic models status --verbose # Usage and cost dashboard clawmagic usage --period 7d
Diagnostic API Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/health | GET | Basic health status + stuck detection |
/api/health/detail | GET | Detailed health: memory, event loop, plugins, sessions |
/api/cron/jobs | GET | List all registered cron jobs with status |
/api/cron/history | GET | Query cron run history (filter by job, status, date range) |
/api/cron/jobs/:id/run | POST | Manually trigger a cron job |
/api/cron/cleanup | POST | Clean up stale/stuck cron jobs |
/api/logs | GET | Paginated log query (mirrors CLI filter params) |
/api/providers | GET | List configured providers with status |
/api/usage | GET | Aggregated usage: model costs, cron reliability, storage, API counts |
Self-Healing Flow
When the agent detects an issue through diagnostic tools, it follows a structured remediation flow:
- Detection — Heartbeat or cron job runs
clawmagic health --json. Any FAIL or WARN status triggers the remediation skill. - Diagnosis — Agent reads detailed diagnostics (
/api/health/detail), queries logs for related errors (clawmagic logs --errors --since 15m --json), and checks provider status (clawmagic models status). - Remediation — Based on the diagnosis:
- Stuck jobs:
clawmagic cron cleanup - Provider down: switch to fallback chain
- Stale state: restart affected subsystem
- Config issue:
clawmagic doctor --fix
- Stuck jobs:
- Verification — Re-run
clawmagic healthto confirm the fix worked. Log the remediation action and outcome. - Escalation — If self-healing fails after 2 attempts, alert the user through the configured notification channel (portal, Slack, Discord, etc.) with a diagnostic summary.
Agent Self-Correction with External Agents
For complex issues beyond simple restarts, ClawMagic can delegate to an external coding agent for deep self-correction:
- Claude Code (agent OAuth): connect via OAuth to give Claude Code access to ClawMagic’s diagnostic endpoints and state files. Claude Code can read health output, analyze log patterns, modify configuration, and verify fixes — all within a sandboxed session.
- OpenAI Codex: similarly, Codex can ingest
--jsonoutput from diagnostic commands and generate targeted fixes for configuration, routing rules, or plugin manifests. - Any agent-compatible tool: the diagnostic toolkit outputs machine-readable JSON by design. Any agent that can call CLI commands or HTTP endpoints can participate in the self-healing loop.
This approach keeps ClawMagic focused on runtime execution while leveraging purpose-built coding agents for the reasoning-heavy work of root-cause analysis and code-level fixes.
Existing Diagnostic Infrastructure
ClawMagic already has significant diagnostic infrastructure that the toolkit builds on:
| Component | File | What It Does |
|---|---|---|
/health | routes/health.ts | Status + stuck job detection |
/api/health/detail | routes/health_detail.ts | Memory, event loop, plugins, sessions |
clawmagic doctor | commands/doctor.ts | 15+ env checks, JSON output, --fix mode |
| Execution telemetry | executionManager.ts | p50/p95 latency, throughput, error counts |
| Structured logging | logger.ts | JSONL events/errors with redaction, getRecentLogs(), getDebugBundle() |
| Cron engine | cronScheduler.ts | 380+ lines, 9 action types, error backoff [30s, 60s, 5m, 15m, 60m], stale detection |
| DiagnosticsStream | diagnosticEvents.ts | ai.model_choice, tool.call, context_budget events |
| Provider factory | agentProviderFactory.ts | Agent-aware provider routing + fallback chains |
State Files for Diagnostics
state/
events-YYYY-MM-DD.jsonl # Structured event log (daily rotation)
errors-YYYY-MM-DD.jsonl # Error log (daily rotation)
audit.jsonl # Audit trail
security_audit.jsonl # Security event audit
health_alerts.json # Alert state with backoff timers
diagnostics/ # Runtime diagnostic snapshots
cron/
history.jsonl # Cron run history (append-only, 10K cap)
*.json # Per-job state filesAll diagnostic data is local-first JSON/JSONL. No external database required. Files are human-readable and can be committed to version control for debugging.