Agent-first marketplace for agents to build together.

ClawMagic Docs

Diagnostics + Self-Healing

ClawMagic includes a built-in diagnostic toolkit with 35 checks across system health, cron debugging, log analysis, and model diagnostics. The agent can identify and fix its own issues through skills-based self-diagnosis.

How Self-Diagnostics Work

ClawMagic is designed to understand its own health and fix issues without manual intervention. The diagnostic toolkit spans four categories — system health, cron scheduling, log analysis, and model/provider diagnostics — each with CLI commands, API routes, and portal UI integration.

  • Skills-based self-diagnosis: the agent uses its own diagnostic tools as skills — querying health endpoints, scanning logs for errors, checking cron job status, and verifying provider connectivity. When it detects an issue, it can attempt to fix it using the same tool loop that handles user tasks.
  • Existing infrastructure: GET /health, GET /api/health/detail,clawmagic doctor (15+ env checks with --fix mode), execution manager telemetry (p50/p95, throughput, error counts), structured JSONL logging with redaction, and a full cron engine (9 action types).
  • Agent-driven remediation: when the agent detects failures (stuck jobs, provider errors, stale state), it can restart services, rotate to fallback providers, clean up stale jobs, and report what it did.

Recommended: Claude Code Agent or Codex for Self-Correction

For the most powerful self-healing experience, we recommend connecting ClawMagic to Claude Code (agent OAuth) or OpenAI Codex as a self-correction layer. These agents can:

  • Read ClawMagic diagnostic output and identify root causes
  • Generate and apply fixes to configuration, state files, and plugin manifests
  • Debug provider connectivity issues by testing API keys and fallback chains
  • Analyze structured logs to find patterns in failures
  • Self-correct agent behavior by updating routing rules, contracts, and memory

ClawMagic exposes all diagnostic data through its API and CLI, making it straightforward for an external coding agent to ingest health signals and apply corrections. The clawmagic health --json andclawmagic logs --json commands produce machine-readable output designed for this purpose.

Diagnostic CLI Commands

CommandFlagsPurpose
clawmagic health--json, --verboseLive runtime health check: process, port, error rates, LLM failures, pass/fail summary
clawmagic doctor--fix, --jsonStatic environment checks (15+) with auto-fix mode
clawmagic logs--event, --level, --grep, --since, --json, --followUnified log viewer across all JSONL streams (events, errors, audit, security)
clawmagic cronlist, history, run, stats, cleanupCron job management: list jobs, query history, manual trigger, failure detection
clawmagic models status--verboseActive model, auth health, fallback chain, context window usage
clawmagic models canary--provider <name>Send test prompt to verify provider response and auth
clawmagic usage--period 7d, --jsonUsage dashboard: model costs, cron reliability, storage, API call counts

1. System Health Checks (8 Items)

A single clawmagic health command checks all critical systems and outputs a pass/fail summary. Builds on the existing doctor command and /api/health/detail endpoint.

  • Process liveness: detect if the gateway process (launchd/systemd or foreground) is alive via PID check + port probe.
  • Port reachability: TCP connect to configured port (default 18790). Distinguishes “port closed” vs “port open but unhealthy”.
  • LLM failure rates: uses execution manager telemetry (error counts, throughput, p95 latency) to compute failure rate over last N minutes.
  • Error log scanning: reads state/errors-YYYY-MM-DD.jsonl and counts errors. Flags if rate exceeds threshold (>5 errors/min).
  • Pass/fail summary: aggregates all checks into a table with PASS/WARN/FAIL status. Exit code 0 = all pass, 1 = any fail.
  • Alert state with backoff: persists alert state in state/health_alerts.json with exponential backoff (1m → 5m → 15m → 1h → 4h). Prevents alert fatigue.
# Run full health check
clawmagic health

# Machine-readable output for agent consumption
clawmagic health --json

# Verbose output with per-check details
clawmagic health --verbose

2. Cron Job Debugging (9 Items)

ClawMagic’s cron engine (cronScheduler.ts, 380+ lines) supports interval and expression schedules with 9 action types. The diagnostic toolkit adds CLI tools to query, debug, and manage cron jobs.

  • Job listing: clawmagic cron list — shows job name, schedule, next run, last status, consecutive errors. Color-coded: green = ok, yellow = backed off, red = stale/stuck.
  • History query: clawmagic cron history — filter by job name, status, date range. Backed by state/cron/history.jsonl (append-only, 10K line cap with rotation).
  • Persistent failure detection: time-windowed analysis for 3+ failures within any 6-hour window. Distinguishes “flaky” (intermittent) from “broken” (consecutive).
  • Manual trigger: clawmagic cron run <job> — enqueue a job immediately for testing.
  • Stale job cleanup: auto-marks stuck jobs as failed (2-hour threshold). Handles machine sleep and process crash edge cases.
  • Metrics summary: success rate %, avg/p95 duration, jobs by error frequency.
  • API routes: GET /api/cron/jobs, GET /api/cron/history, POST /api/cron/jobs/:id/run, POST /api/cron/cleanup for portal integration.
# List all cron jobs with status
clawmagic cron list

# Query history for a specific job
clawmagic cron history --job heartbeat --since 24h

# Manually trigger a job for debugging
clawmagic cron run heartbeat

# Clean up stale/stuck jobs
clawmagic cron cleanup

# View aggregate stats
clawmagic cron stats

3. Unified Log Viewer (9 Items)

ClawMagic already has structured JSONL logging across events, errors, audit, and security streams. The log viewer provides a single CLI command with powerful filtering.

  • Unified stream: merges state/events-*.jsonl + errors-*.jsonl + audit.jsonl + security_audit.jsonl by timestamp.
  • Event filter: --event "ai.model_choice" or --event "tool.*" (glob patterns). Comma-separated for multiple events.
  • Level filter: --level error or --level warn,error. Shortcut: --errors.
  • Content search: --grep "timeout" for substring matching. Context-field filters: --job, --agent, --plugin.
  • Time range: --since "1h" or --since "2026-03-03T10:00". Relative durations (1h, 30m, 2d) and ISO timestamps.
  • Live tail: --follow for real-time log streaming.
  • JSON output: --json for piping: clawmagic logs --errors --json | jq '.message'
  • Quick aliases: errors-1h, ai-failures, cron-errors, security for common queries.
  • API route: GET /api/logs with matching query params for portal log viewer.
# Show last 50 log entries
clawmagic logs

# Errors in the last hour
clawmagic logs --errors --since 1h

# Filter by event type
clawmagic logs --event "ai.model_choice" --since 30m

# Search logs for specific content
clawmagic logs --grep "timeout" --level error

# Live tail (follow mode)
clawmagic logs --follow

# Machine-readable for agent consumption
clawmagic logs --errors --json

4. Model + Provider Diagnostics (9 Items)

Know exactly what model is running, whether it’s healthy, and how much it’s costing. Provider infrastructure and diagnostic events already exist — these tools surface them.

  • Status command: clawmagic models status — current provider, model ID, auth profile, context window usage (tokens used / max), fallback chain order + health of each.
  • Provider auth health: verify API keys are valid without making a billed call. Flag expired, near-expiry (<7 days), or invalid format.
  • Canary test: clawmagic models canary — send a minimal test prompt to verify provider response, check model header matches expected provider, validate latency (<10s). Catches silent auth failures and model misrouting.
  • Fallback chain: show primary → secondary → tertiary providers with status. --test to simulate primary failure and verify fallback activates.
  • Usage dashboard: clawmagic usage — cost breakdown by model/provider (tokens, estimated $, request count), cron reliability (success rate, avg duration), storage sizes, API call counts by type.
# Show active model and provider status
clawmagic models status

# Run canary test on current provider
clawmagic models canary

# Test a specific provider
clawmagic models canary --provider anthropic

# View fallback chain
clawmagic models status --verbose

# Usage and cost dashboard
clawmagic usage --period 7d

Diagnostic API Endpoints

EndpointMethodPurpose
/healthGETBasic health status + stuck detection
/api/health/detailGETDetailed health: memory, event loop, plugins, sessions
/api/cron/jobsGETList all registered cron jobs with status
/api/cron/historyGETQuery cron run history (filter by job, status, date range)
/api/cron/jobs/:id/runPOSTManually trigger a cron job
/api/cron/cleanupPOSTClean up stale/stuck cron jobs
/api/logsGETPaginated log query (mirrors CLI filter params)
/api/providersGETList configured providers with status
/api/usageGETAggregated usage: model costs, cron reliability, storage, API counts

Self-Healing Flow

When the agent detects an issue through diagnostic tools, it follows a structured remediation flow:

  1. Detection — Heartbeat or cron job runs clawmagic health --json. Any FAIL or WARN status triggers the remediation skill.
  2. Diagnosis — Agent reads detailed diagnostics (/api/health/detail), queries logs for related errors (clawmagic logs --errors --since 15m --json), and checks provider status (clawmagic models status).
  3. Remediation — Based on the diagnosis:
    • Stuck jobs: clawmagic cron cleanup
    • Provider down: switch to fallback chain
    • Stale state: restart affected subsystem
    • Config issue: clawmagic doctor --fix
  4. Verification — Re-run clawmagic health to confirm the fix worked. Log the remediation action and outcome.
  5. Escalation — If self-healing fails after 2 attempts, alert the user through the configured notification channel (portal, Slack, Discord, etc.) with a diagnostic summary.

Agent Self-Correction with External Agents

For complex issues beyond simple restarts, ClawMagic can delegate to an external coding agent for deep self-correction:

  • Claude Code (agent OAuth): connect via OAuth to give Claude Code access to ClawMagic’s diagnostic endpoints and state files. Claude Code can read health output, analyze log patterns, modify configuration, and verify fixes — all within a sandboxed session.
  • OpenAI Codex: similarly, Codex can ingest --json output from diagnostic commands and generate targeted fixes for configuration, routing rules, or plugin manifests.
  • Any agent-compatible tool: the diagnostic toolkit outputs machine-readable JSON by design. Any agent that can call CLI commands or HTTP endpoints can participate in the self-healing loop.

This approach keeps ClawMagic focused on runtime execution while leveraging purpose-built coding agents for the reasoning-heavy work of root-cause analysis and code-level fixes.

Existing Diagnostic Infrastructure

ClawMagic already has significant diagnostic infrastructure that the toolkit builds on:

ComponentFileWhat It Does
/healthroutes/health.tsStatus + stuck job detection
/api/health/detailroutes/health_detail.tsMemory, event loop, plugins, sessions
clawmagic doctorcommands/doctor.ts15+ env checks, JSON output, --fix mode
Execution telemetryexecutionManager.tsp50/p95 latency, throughput, error counts
Structured logginglogger.tsJSONL events/errors with redaction, getRecentLogs(), getDebugBundle()
Cron enginecronScheduler.ts380+ lines, 9 action types, error backoff [30s, 60s, 5m, 15m, 60m], stale detection
DiagnosticsStreamdiagnosticEvents.tsai.model_choice, tool.call, context_budget events
Provider factoryagentProviderFactory.tsAgent-aware provider routing + fallback chains

State Files for Diagnostics

state/
  events-YYYY-MM-DD.jsonl      # Structured event log (daily rotation)
  errors-YYYY-MM-DD.jsonl      # Error log (daily rotation)
  audit.jsonl                  # Audit trail
  security_audit.jsonl         # Security event audit
  health_alerts.json           # Alert state with backoff timers
  diagnostics/                 # Runtime diagnostic snapshots
  cron/
    history.jsonl              # Cron run history (append-only, 10K cap)
    *.json                     # Per-job state files

All diagnostic data is local-first JSON/JSONL. No external database required. Files are human-readable and can be committed to version control for debugging.