Skip to content

OpenClaw: The Definitive Guide to Building 24/7 Agentic AI Workflows

Server room with glowing blue lights representing always-on AI agent infrastructure

AI agents that run once and quit are toys. The real revolution happens when agents run continuously - monitoring systems, analyzing data, responding to incidents, and orchestrating workflows around the clock without a human touching the keyboard. That's the promise of OpenClaw: an open-source framework purpose-built for running autonomous AI agents 24 hours a day, 7 days a week, as persistent daemon-mode services with health checks, auto-restart, observability, and multi-agent orchestration baked in from day one.

This guide is the definitive deep-dive. We'll cover everything from architecture internals and installation to building production-grade agents, multi-agent pipelines, LLM integration, security guardrails, Kubernetes deployment, and real-world examples. Whether you're deploying a single uptime monitor or orchestrating a fleet of 50 collaborating agents, this is your reference.

What you'll learn: OpenClaw architecture, agent lifecycle management, multi-agent orchestration patterns, state persistence, LLM routing with cost budgets, observability with Prometheus/Grafana, production Kubernetes deployment, and 5 real-world 24/7 agent examples with code.

1. What is OpenClaw?

OpenClaw is an open-source framework for running AI agents as persistent, autonomous services. It is not a chatbot framework. It is not a prompt chain library. It is infrastructure - the daemon layer that keeps your AI agents alive, healthy, and productive around the clock.

Think of it as systemd for AI agents. Just as systemd manages long-running Linux services with health checks, restart policies, and logging, OpenClaw manages long-running AI agents with the same operational rigor - plus first-class support for LLM integration, inter-agent communication, state persistence, and cost tracking.

Core Design Principles

  • Daemon-first: Agents are designed to run forever, not execute once. The default mode is a persistent loop with configurable schedules, not a one-shot script.
  • Operationally mature: Health checks (liveness + readiness), auto-restart with exponential backoff, structured logging, Prometheus metrics, and distributed tracing are built in - not bolted on.
  • LLM-native: First-class support for OpenAI, Anthropic, local models via Ollama, model fallback chains, token tracking, and per-agent cost budgets.
  • Multi-agent by default: Agents communicate via a message bus (Redis Streams or NATS), share state through a persistent store, and can be orchestrated in pipelines, fan-out/fan-in patterns, or event-driven topologies.
  • Cloud-native: Runs in Docker, Kubernetes, or as a standalone binary. Helm charts, HPA, rolling updates, and blue-green deployments are supported out of the box.

What OpenClaw Is NOT

  • It is not a prompt engineering library (use LangChain or LlamaIndex for that).
  • It is not a workflow orchestrator like Airflow (though it can replace Airflow for AI-driven workflows).
  • It is not a chatbot framework (use Rasa or Botpress for conversational UIs).
  • It is the runtime and infrastructure layer that sits beneath all of these, keeping your agents alive and observable 24/7.
Key differentiator: OpenClaw's daemon-mode agents come with health checks, auto-restart, resource isolation, and observability built in. You write the agent logic; OpenClaw handles everything else needed to keep it running in production around the clock.

2. Why 24/7 Agents?

Most AI agent frameworks assume a request-response model: a human asks something, the agent responds, done. But entire categories of valuable work require continuous, autonomous operation - work that happens at 3 AM on a Sunday, work that monitors streams of data in real time, work that must never stop.

Use Cases That Demand Always-On Agents

Domain Use Case Why 24/7?
Security Log analysis, threat detection, IP blocking Attacks don't wait for business hours
Infrastructure Auto-scaling, incident response, health monitoring Downtime costs $5,600/minute on average
Market Analysis Price tracking, competitor monitoring, sentiment analysis Markets are global and never close
Content Moderation Scanning user-generated content for policy violations Content is posted around the clock
Customer Support Ticket triage, auto-responses, escalation routing Customers expect instant responses at any hour
Data Pipelines ETL orchestration, data quality monitoring, schema drift detection Data flows continuously from upstream sources
Compliance Regulatory change monitoring, policy enforcement, audit trails Regulations change without notice; violations are costly
Social Media Brand monitoring, engagement tracking, crisis detection Viral events happen in minutes, not days

Cost Analysis: 24/7 Agent vs. Human Team

Let's do the math for a security monitoring operation that requires 24/7 coverage:

Cost Factor Human Team (3 shifts) OpenClaw Agent Fleet
Personnel 5 analysts Γ— $95K/yr = $475K 0 (automated)
Infrastructure SIEM licenses: $50K/yr 3-node K8s cluster: $8K/yr
LLM API costs N/A ~$18K/yr (GPT-4o with fallback)
Response time 5-15 minutes (human triage) <30 seconds (automated)
Coverage gaps Shift handoffs, sick days, holidays Zero (daemon-mode)
Total annual cost ~$525K ~$26K
Important: AI agents augment human teams - they don't replace judgment. Use agents for triage, detection, and first-response automation. Keep humans in the loop for critical decisions, escalations, and novel threat analysis.

3. OpenClaw Architecture

OpenClaw's architecture is modular and designed for production workloads. Every component can be scaled independently, replaced with alternatives, or extended via plugins. Here's the complete system architecture:

Architecture Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        OpenClaw Control Plane                       β”‚
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  API Gateway  β”‚  β”‚  Scheduler   β”‚  β”‚  Supervisor (Process Mgr)β”‚  β”‚
β”‚  β”‚  REST/WS API  β”‚  β”‚  Cron+Events β”‚  β”‚  Health β”‚ Restart β”‚ Stop β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚         β”‚                 β”‚                      β”‚                   β”‚
β”‚  ───────┴─────────────────┴──────────────────────┴───────────────   β”‚
β”‚                         Internal Event Bus                          β”‚
β”‚  ────────────────────────────────────────────────────────────────   β”‚
β”‚         β”‚                 β”‚                      β”‚                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Agent Runtime β”‚  β”‚ Agent Runtime β”‚  β”‚    Agent Runtime        β”‚  β”‚
β”‚  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚  β”‚
β”‚  β”‚ β”‚ Agent A   β”‚ β”‚  β”‚ β”‚ Agent B   β”‚ β”‚  β”‚ β”‚ Agent C   β”‚           β”‚  β”‚
β”‚  β”‚ β”‚ (Python)  β”‚ β”‚  β”‚ β”‚ (Python)  β”‚ β”‚  β”‚ β”‚ (Python)  β”‚           β”‚  β”‚
β”‚  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚  β”‚
β”‚  β”‚  Container    β”‚  β”‚  Container    β”‚  β”‚  Container              β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                β”‚                β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Message Bus  β”‚  β”‚ State Store β”‚  β”‚ Metrics/Traces  β”‚
   β”‚ Redis Streamsβ”‚  β”‚ PostgreSQL  β”‚  β”‚ Prometheus +     β”‚
   β”‚ or NATS      β”‚  β”‚ + Redis     β”‚  β”‚ OpenTelemetry    β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Deep-Dive

Agent Runtime

Each agent runs inside an isolated container (or process, in standalone mode). The runtime provides:

  • Resource isolation: CPU and memory limits enforced via cgroups (Docker/K8s) or process-level ulimits.
  • Dependency isolation: Each agent has its own Python virtual environment and dependency set.
  • Lifecycle hooks: setup(), run(), teardown(), on_error() - the runtime calls these at the appropriate lifecycle stages.
  • Signal handling: SIGTERM triggers graceful shutdown; SIGKILL is a last resort after the grace period expires.

Supervisor

The Supervisor is OpenClaw's process manager - analogous to systemd or supervisord, but purpose-built for AI agents. It:

  • Monitors agent health via liveness and readiness probes.
  • Restarts crashed agents according to configurable restart policies (immediate, linear backoff, exponential backoff).
  • Handles graceful shutdown sequences - draining in-flight work before stopping.
  • Enforces resource limits and kills agents that exceed their memory or CPU budget.
  • Reports agent status to the API Gateway and emits lifecycle events to the message bus.

Message Bus

Inter-agent communication flows through a message bus. OpenClaw supports two backends:

  • Redis Streams (default): Lightweight, easy to operate, supports consumer groups for load balancing. Best for small-to-medium deployments (up to ~50 agents).
  • NATS JetStream: Higher throughput, built-in persistence, better for large-scale deployments (50+ agents) or when you need exactly-once delivery semantics.

Messages are typed events with JSON payloads. Agents publish events with await self.emit("event_name", payload) and subscribe with @on("event_name") decorators.

State Store

Agent state is persisted in PostgreSQL with an optional Redis cache layer for hot state. Features include:

  • Automatic checkpointing: State is saved at configurable intervals (default: every 60 seconds) and after every run() cycle.
  • Point-in-time recovery: State snapshots are versioned, allowing rollback to any previous checkpoint.
  • Shared state: Agents can read (but not write) other agents' state via namespaced keys.
  • TTL support: State keys can have expiration times for temporary data.

Scheduler

The Scheduler supports two trigger modes:

  • Cron-like schedules: Schedule.every(minutes=5), Schedule.cron("0 */2 * * *"), Schedule.daily(hour=9, minute=0).
  • Event-driven triggers: Agents can be triggered by events from other agents, webhooks, or external systems via the API Gateway.

API Gateway

The API Gateway exposes a REST and WebSocket API for external control:

  • GET /api/agents - List all agents and their status.
  • POST /api/agents/{name}/start - Start a stopped agent.
  • POST /api/agents/{name}/pause - Pause a running agent.
  • GET /api/agents/{name}/state - Read agent state.
  • GET /api/agents/{name}/logs - Stream agent logs via WebSocket.
  • POST /api/agents/{name}/trigger - Manually trigger an agent run.
  • GET /api/metrics - Prometheus-format metrics endpoint.

4. Installation & Setup

OpenClaw supports three deployment modes: Docker Compose (recommended for development and small production), Kubernetes with Helm (recommended for production at scale), and standalone binary (for single-machine deployments).

Option A: Docker Compose (Recommended Start)

The fastest way to get a full OpenClaw stack running. This docker-compose.yml includes all services:

# docker-compose.yml
version: "3.9"

services:
  openclaw-supervisor:
    image: openclaw/supervisor:1.4.0
    ports:
      - "8400:8400"   # API Gateway
      - "9090:9090"   # Prometheus metrics
    environment:
      - OPENCLAW_DB_URL=postgresql://openclaw:secret@postgres:5432/openclaw
      - OPENCLAW_REDIS_URL=redis://redis:6379/0
      - OPENCLAW_LOG_LEVEL=info
      - OPENCLAW_LOG_FORMAT=json
    volumes:
      - ./agents:/app/agents
      - ./openclaw.yaml:/app/openclaw.yaml
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    restart: unless-stopped

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: openclaw
      POSTGRES_USER: openclaw
      POSTGRES_PASSWORD: secret
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U openclaw"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7-alpine
    command: redis-server --appendonly yes
    volumes:
      - redisdata:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - ./grafana/dashboards:/var/lib/grafana/dashboards
      - ./grafana/provisioning:/etc/grafana/provisioning

volumes:
  pgdata:
  redisdata:

Start the stack:

docker compose up -d
# Verify all services are healthy
docker compose ps
# Check the API Gateway
curl http://localhost:8400/api/health

Option B: Kubernetes with Helm

# Add the OpenClaw Helm repository
helm repo add openclaw https://charts.openclaw.dev
helm repo update

# Install with default values
helm install openclaw openclaw/openclaw \
  --namespace openclaw \
  --create-namespace \
  --set supervisor.replicas=2 \
  --set postgres.enabled=true \
  --set redis.enabled=true

# Verify the deployment
kubectl get pods -n openclaw

Option C: Standalone Binary

# Install via the install script
curl -fsSL https://get.openclaw.dev | sh

# Or via pip
pip install openclaw-cli

# Verify installation
openclaw version
# OpenClaw CLI v1.4.0 (runtime v1.4.0, Python 3.11+)

The OpenClaw CLI

The openclaw CLI is your primary interface for managing agents:

# Initialize a new project
openclaw init my-project
cd my-project

# Create an agent from a template
openclaw agent create --name monitor --template watchdog
openclaw agent create --name analyzer --template llm-processor
openclaw agent create --name reporter --template scheduled-report

# List available templates
openclaw templates list

# Run locally for development (hot-reload enabled)
openclaw dev

# Deploy to a remote OpenClaw cluster
openclaw deploy --env production

# Check agent status
openclaw status
# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
# β”‚ Agent            β”‚ Status   β”‚ Uptime β”‚ Last Run  β”‚ Success Rate β”‚
# β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
# β”‚ monitor          β”‚ running  β”‚ 4d 12h β”‚ 2m ago    β”‚ 99.7%        β”‚
# β”‚ analyzer         β”‚ running  β”‚ 4d 12h β”‚ 5m ago    β”‚ 98.2%        β”‚
# β”‚ reporter         β”‚ paused   β”‚ -      β”‚ 1h ago    β”‚ 100%         β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

# Stream logs from a specific agent
openclaw logs monitor --follow

# Trigger a manual run
openclaw trigger monitor

# Pause / resume an agent
openclaw pause reporter
openclaw resume reporter
Project structure: After openclaw init, your project will contain: openclaw.yaml (main config), agents/ (agent source code), tests/ (agent tests), docker-compose.yml (local stack), and .openclaw/ (CLI state).

5. Building Your First 24/7 Agent

Let's build a real-world agent: an uptime monitor that checks a list of URLs every 5 minutes, detects downtime, sends Slack alerts, and maintains a persistent uptime history that survives restarts.

Agent Configuration

First, define the agent in openclaw.yaml:

# openclaw.yaml
project: my-monitoring
version: "1.0"

agents:
  uptime-monitor:
    source: agents/uptime_monitor.py
    schedule: every 5m
    config:
      slack_webhook: ${SLACK_WEBHOOK_URL}
      targets:
        - url: https://api.example.com/health
          name: "Production API"
          expected_status: 200
          timeout: 10
        - url: https://app.example.com
          name: "Web Application"
          expected_status: 200
          timeout: 15
        - url: https://docs.example.com
          name: "Documentation Site"
          expected_status: 200
          timeout: 10
    restart_policy:
      max_retries: 10
      backoff: exponential
    resources:
      memory: 256Mi
      cpu: 0.25

Full Agent Implementation

# agents/uptime_monitor.py
import asyncio
from datetime import datetime, timezone
from openclaw import Agent, Schedule, Alert
from openclaw.tools import HttpClient, SlackNotifier

class UptimeMonitor(Agent):
    """24/7 website uptime monitoring agent with Slack alerting."""

    name = "uptime-monitor"
    schedule = Schedule.every(minutes=5)

    async def setup(self):
        """Initialize HTTP client, Slack notifier, and load targets."""
        self.http = HttpClient(
            timeout=self.config.get("default_timeout", 15),
            retries=2,
            retry_delay=3,
        )
        self.slack = SlackNotifier(
            webhook_url=self.config.slack_webhook
        )
        self.targets = self.config.get("targets", [])
        self.logger.info(
            f"Uptime monitor initialized with {len(self.targets)} targets"
        )

    async def run(self):
        """Check all targets and process results."""
        results = await asyncio.gather(
            *[self.check_target(t) for t in self.targets],
            return_exceptions=True,
        )

        # Load persistent history from state store
        history = await self.state.get("check_history", [])
        timestamp = datetime.now(timezone.utc).isoformat()

        for target, result in zip(self.targets, results):
            if isinstance(result, Exception):
                result = {
                    "status": "error",
                    "error": str(result),
                    "latency_ms": None,
                }

            entry = {
                "target": target["name"],
                "url": target["url"],
                "timestamp": timestamp,
                **result,
            }
            history.append(entry)

            # Alert on downtime
            if result["status"] != "healthy":
                await self.handle_downtime(target, result)
            else:
                # Clear incident state if target recovered
                incident_key = f"incident_{target['name']}"
                was_down = await self.state.get(incident_key, False)
                if was_down:
                    await self.slack.send(
                        f"βœ… *{target['name']}* is back UP "
                        f"(latency: {result['latency_ms']}ms)"
                    )
                    await self.state.set(incident_key, False)

        # Keep last 10,000 entries, persist to state store
        await self.state.set("check_history", history[-10000:])

        # Update summary metrics
        healthy = sum(1 for r in results if not isinstance(r, Exception) and r["status"] == "healthy")
        self.metrics.gauge("targets_healthy", healthy)
        self.metrics.gauge("targets_total", len(self.targets))

    async def check_target(self, target: dict) -> dict:
        """Check a single target URL and return status."""
        timeout = target.get("timeout", 15)
        expected = target.get("expected_status", 200)

        start = asyncio.get_event_loop().time()
        try:
            response = await self.http.get(
                target["url"], timeout=timeout
            )
            latency = int(
                (asyncio.get_event_loop().time() - start) * 1000
            )

            if response.status_code == expected:
                return {
                    "status": "healthy",
                    "http_status": response.status_code,
                    "latency_ms": latency,
                }
            else:
                return {
                    "status": "unhealthy",
                    "http_status": response.status_code,
                    "expected_status": expected,
                    "latency_ms": latency,
                }
        except Exception as e:
            return {
                "status": "error",
                "error": str(e),
                "latency_ms": None,
            }

    async def handle_downtime(self, target: dict, result: dict):
        """Send alert and track incident state."""
        incident_key = f"incident_{target['name']}"
        already_alerted = await self.state.get(incident_key, False)

        if not already_alerted:
            status = result.get("http_status", "N/A")
            error = result.get("error", "unexpected status")
            await self.slack.send(
                f"🚨 *{target['name']}* is DOWN!\n"
                f"URL: {target['url']}\n"
                f"Status: {status}\n"
                f"Error: {error}"
            )
            await self.state.set(incident_key, True)
            self.logger.warning(
                f"Target {target['name']} is down: {result}"
            )

    async def teardown(self):
        """Cleanup on graceful shutdown."""
        await self.http.close()
        self.logger.info("Uptime monitor shut down gracefully")
Key features in this agent: Persistent state that survives restarts (self.state), built-in metrics (self.metrics), structured logging (self.logger), concurrent target checking with asyncio.gather, incident tracking to avoid alert storms, and recovery notifications.

6. Agent Lifecycle Management

Understanding the agent lifecycle is critical for building reliable 24/7 systems. OpenClaw agents transition through well-defined states, and the Supervisor manages these transitions automatically.

Agent State Machine

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ INITIALIZING β”‚
                    β”‚  setup()     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                           β”‚ success
                           β–Ό
              β”Œβ”€β”€β”€β”€β”€ RUNNING ◄─────┐
              β”‚      run() loop    β”‚
              β”‚                    β”‚ resume
    pause     β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
              β”œβ”€β”€β”€β–Ίβ”‚ PAUSED  β”œβ”€β”€β”€β”€β”˜
              β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
    error     β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  retry (backoff)
              β”œβ”€β”€β”€β–Ίβ”‚  ERROR  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Ί  RUNNING
              β”‚    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
              β”‚         β”‚ max retries exceeded
              β”‚         β–Ό
              β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              └───►│ STOPPED β”‚
     stop          β”‚teardown()β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Health Checks

OpenClaw implements two types of health checks, inspired by Kubernetes:

  • Liveness probe: "Is the agent process alive?" - Checks that the agent's event loop is responsive. If the liveness probe fails, the Supervisor kills and restarts the agent.
  • Readiness probe: "Is the agent ready to do work?" - Checks that dependencies (database, APIs, etc.) are available. If the readiness probe fails, the agent is temporarily removed from event subscriptions but not restarted.

You can define custom health checks:

class MyAgent(Agent):
    async def liveness_check(self) -> bool:
        """Return True if the agent is alive and responsive."""
        return self.event_loop_responsive()

    async def readiness_check(self) -> bool:
        """Return True if all dependencies are available."""
        try:
            await self.db.ping()
            await self.http.get("https://api.openai.com/v1/models", timeout=5)
            return True
        except Exception:
            return False

Complete Lifecycle Configuration

# openclaw.yaml - Agent lifecycle configuration
agents:
  uptime-monitor:
    source: agents/uptime_monitor.py
    replicas: 1

    restart_policy:
      max_retries: 5
      backoff: exponential       # immediate | linear | exponential
      initial_delay: 5s          # first retry after 5 seconds
      max_delay: 300s            # cap backoff at 5 minutes
      reset_after: 3600s         # reset retry counter after 1h of healthy operation

    health_check:
      liveness:
        interval: 30s
        timeout: 10s
        unhealthy_threshold: 3   # 3 consecutive failures = restart
      readiness:
        interval: 15s
        timeout: 5s
        unhealthy_threshold: 2   # 2 failures = remove from event bus

    resources:
      cpu: 0.5                   # 0.5 CPU cores
      memory: 512Mi              # 512 MB RAM
      api_calls_per_minute: 60   # rate limit for external API calls
      max_state_size: 100Mi      # max persistent state size

    graceful_shutdown:
      timeout: 30s               # time to finish in-flight work
      drain_events: true         # stop accepting new events before shutdown

    logging:
      level: info                # debug | info | warning | error
      format: json               # json | text
      retention: 7d              # keep logs for 7 days
Exponential backoff matters: Without backoff, a crashing agent will restart in a tight loop, consuming resources and flooding logs. OpenClaw's default exponential backoff starts at 5 seconds and doubles each retry: 5s β†’ 10s β†’ 20s β†’ 40s β†’ 80s β†’ 160s β†’ 300s (capped). After 1 hour of healthy operation, the retry counter resets.

7. Multi-Agent Orchestration

Single agents are useful. Fleets of collaborating agents are transformative. OpenClaw provides four orchestration patterns for multi-agent systems, all built on top of the message bus.

Orchestration Patterns

Pattern 1: Pipeline

Agent A's output feeds Agent B, which feeds Agent C. Linear, sequential processing.

  [Collector] ──► [Analyzer] ──► [Reporter] ──► [Alerter]
     emit()        @on()          @on()          @on()

Pattern 2: Fan-Out / Fan-In

One agent distributes work to multiple workers; an aggregator collects results.

                  β”Œβ”€β–Ί [Worker 1] ─┐
  [Dispatcher] ───┼─► [Worker 2] ─┼──► [Aggregator]
                  └─► [Worker 3] β”€β”˜

Pattern 3: Supervisor

A meta-agent that monitors and manages other agents - spawning, stopping, or reconfiguring them based on conditions.

Pattern 4: Event-Driven

Agents react to events from other agents, external webhooks, or system events. No fixed topology - agents subscribe to events they care about.

Complete Multi-Agent Pipeline Example

Here's a full 4-agent pipeline: Data Collector β†’ Analyzer β†’ Reporter β†’ Alerter.

# agents/data_collector.py
from openclaw import Agent, Schedule

class DataCollector(Agent):
    """Collects data from multiple sources every 10 minutes."""
    name = "data-collector"
    schedule = Schedule.every(minutes=10)

    async def setup(self):
        self.sources = self.config.get("data_sources", [])

    async def run(self):
        collected = []
        for source in self.sources:
            data = await self.http.get(source["url"])
            collected.append({
                "source": source["name"],
                "data": data.json(),
                "collected_at": self.now_iso(),
            })

        # Emit to the pipeline - Analyzer will pick this up
        await self.emit("data_collected", {
            "batch_id": self.generate_id(),
            "records": collected,
            "count": len(collected),
        })
        self.logger.info(f"Collected {len(collected)} records")
# agents/analyzer.py
from openclaw import Agent, on

class Analyzer(Agent):
    """Analyzes collected data for anomalies and trends."""
    name = "analyzer"

    @on("data_collected")
    async def analyze(self, event):
        records = event.data["records"]
        batch_id = event.data["batch_id"]

        analysis = {
            "batch_id": batch_id,
            "anomalies": [],
            "trends": [],
            "summary": {},
        }

        # Load historical baselines from persistent state
        baselines = await self.state.get("baselines", {})

        for record in records:
            source = record["source"]
            values = record["data"].get("metrics", {})

            baseline = baselines.get(source, {})
            for metric, value in values.items():
                avg = baseline.get(metric, {}).get("avg", value)
                stddev = baseline.get(metric, {}).get("stddev", 0)

                # Flag anomalies beyond 2 standard deviations
                if stddev > 0 and abs(value - avg) > 2 * stddev:
                    analysis["anomalies"].append({
                        "source": source,
                        "metric": metric,
                        "value": value,
                        "expected": avg,
                        "deviation": round((value - avg) / stddev, 2),
                    })

            # Update rolling baselines
            self.update_baseline(baselines, source, values)

        await self.state.set("baselines", baselines)
        await self.emit("analysis_complete", analysis)

    def update_baseline(self, baselines, source, values):
        """Update rolling average and stddev for each metric."""
        if source not in baselines:
            baselines[source] = {}
        for metric, value in values.items():
            b = baselines[source].get(metric, {"avg": value, "stddev": 0, "n": 0})
            n = b["n"] + 1
            old_avg = b["avg"]
            new_avg = old_avg + (value - old_avg) / n
            b["stddev"] = ((b["stddev"] ** 2 * (n - 1) + (value - old_avg) * (value - new_avg)) / n) ** 0.5
            b["avg"] = new_avg
            b["n"] = n
            baselines[source][metric] = b
# agents/reporter.py
from openclaw import Agent, on

class Reporter(Agent):
    """Generates human-readable reports from analysis results."""
    name = "reporter"

    @on("analysis_complete")
    async def report(self, event):
        analysis = event.data
        anomaly_count = len(analysis["anomalies"])

        report = {
            "batch_id": analysis["batch_id"],
            "generated_at": self.now_iso(),
            "anomaly_count": anomaly_count,
            "severity": "critical" if anomaly_count > 5 else "warning" if anomaly_count > 0 else "normal",
            "details": analysis["anomalies"],
        }

        # Use LLM to generate a natural-language summary
        if anomaly_count > 0:
            summary = await self.llm.complete(
                f"Summarize these anomalies in 2-3 sentences for an ops team: "
                f"{analysis['anomalies']}"
            )
            report["summary"] = summary

        await self.emit("report_ready", report)
        self.logger.info(f"Report generated: {anomaly_count} anomalies")
# agents/alerter.py
from openclaw import Agent, on
from openclaw.tools import SlackNotifier, PagerDutyClient

class Alerter(Agent):
    """Routes alerts based on severity."""
    name = "alerter"

    async def setup(self):
        self.slack = SlackNotifier(webhook_url=self.config.slack_webhook)
        self.pagerduty = PagerDutyClient(api_key=self.config.pagerduty_key)

    @on("report_ready")
    async def alert(self, event):
        report = event.data
        severity = report["severity"]

        if severity == "normal":
            return  # No alert needed

        if severity == "warning":
            await self.slack.send(
                f"⚠️ *{report['anomaly_count']} anomalies detected*\n"
                f"{report.get('summary', 'See dashboard for details.')}"
            )

        elif severity == "critical":
            # Slack + PagerDuty for critical alerts
            await self.slack.send(
                f"🚨 *CRITICAL: {report['anomaly_count']} anomalies!*\n"
                f"{report.get('summary', 'Immediate attention required.')}"
            )
            await self.pagerduty.trigger_incident(
                title=f"Critical anomalies: {report['anomaly_count']}",
                severity="critical",
                details=report["details"],
            )

Pipeline Configuration

# openclaw.yaml - Multi-agent pipeline
agents:
  data-collector:
    source: agents/data_collector.py
    schedule: every 10m
    config:
      data_sources:
        - name: "production-metrics"
          url: https://metrics.internal/api/v1/query
        - name: "sales-data"
          url: https://sales-api.internal/daily

  analyzer:
    source: agents/analyzer.py
    # Event-driven - triggered by data-collector
    config: {}

  reporter:
    source: agents/reporter.py
    # Event-driven - triggered by analyzer
    config: {}

  alerter:
    source: agents/alerter.py
    config:
      slack_webhook: ${SLACK_WEBHOOK_URL}
      pagerduty_key: ${PAGERDUTY_API_KEY}

8. State Management & Persistence

Stateless agents are simple but limited. Real-world 24/7 agents need to remember what happened across restarts, track trends over time, and share context with other agents. OpenClaw's state management system provides all of this.

How State Works

  • Automatic persistence: Agent state is stored in PostgreSQL and cached in Redis. Writes go to both; reads hit Redis first.
  • Checkpointing: State is automatically checkpointed after every run() cycle and at configurable intervals.
  • Versioning: Every state change creates a new version. You can roll back to any previous checkpoint.
  • Namespacing: Each agent has its own state namespace. Agents can read (but not write) other agents' state.

State API

class TrendAnalyzer(Agent):
    """Stateful agent that tracks price trends across restarts."""
    name = "trend-analyzer"
    schedule = Schedule.every(minutes=15)

    async def run(self):
        # State persists across restarts - no data loss
        history = await self.state.get("price_history", [])
        current = await self.fetch_prices()
        history.append({
            "timestamp": self.now_iso(),
            "prices": current,
        })

        # Keep last 1000 entries (automatic size management)
        await self.state.set("price_history", history[-1000:])

        # Detect anomalies using historical data
        if self.detect_anomaly(history):
            await self.emit("anomaly_detected", {
                "type": "price_spike",
                "current": current,
                "baseline": self.calculate_baseline(history),
                "deviation_pct": self.calculate_deviation(history, current),
            })

        # Update summary metrics
        self.metrics.gauge("history_size", len(history))
        self.metrics.gauge("latest_price", current.get("avg", 0))

    async def fetch_prices(self) -> dict:
        resp = await self.http.get("https://api.exchange.com/v1/prices")
        return resp.json()

    def detect_anomaly(self, history: list) -> bool:
        if len(history) < 10:
            return False
        recent = [h["prices"].get("avg", 0) for h in history[-10:]]
        avg = sum(recent) / len(recent)
        stddev = (sum((x - avg) ** 2 for x in recent) / len(recent)) ** 0.5
        latest = recent[-1]
        return stddev > 0 and abs(latest - avg) > 2.5 * stddev

    def calculate_baseline(self, history: list) -> float:
        prices = [h["prices"].get("avg", 0) for h in history[-100:]]
        return sum(prices) / len(prices) if prices else 0.0

    def calculate_deviation(self, history: list, current: dict) -> float:
        baseline = self.calculate_baseline(history)
        if baseline == 0:
            return 0.0
        return round((current.get("avg", 0) - baseline) / baseline * 100, 2)

Shared State Between Agents

# Agent A writes to its own state
await self.state.set("latest_analysis", analysis_result)

# Agent B reads Agent A's state (read-only cross-agent access)
analyzer_state = await self.state.read_from("analyzer")
latest = analyzer_state.get("latest_analysis", {})

# Shared state namespace (writable by any agent with access)
await self.shared_state.set("global_config", {"threshold": 0.95})
config = await self.shared_state.get("global_config")

State Snapshots & Rollback

# CLI commands for state management
openclaw state list uptime-monitor
# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
# β”‚ Versionβ”‚ Timestamp            β”‚ Size     β”‚ Keys     β”‚
# β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
# β”‚ v142   β”‚ 2026-04-18T14:30:00Z β”‚ 2.3 MB   β”‚ 4        β”‚
# β”‚ v141   β”‚ 2026-04-18T14:25:00Z β”‚ 2.3 MB   β”‚ 4        β”‚
# β”‚ v140   β”‚ 2026-04-18T14:20:00Z β”‚ 2.2 MB   β”‚ 4        β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

# Rollback to a previous version
openclaw state rollback uptime-monitor --version v140

# Export state for debugging
openclaw state export uptime-monitor --output state.json
State size limits: By default, each agent's state is capped at 100 MB. For agents that accumulate large datasets, use the max_state_size config option and implement pruning logic in your agent (e.g., keep only the last N entries). For truly large datasets, use an external database and store only references in agent state.

9. Observability & Monitoring

You can't run agents 24/7 if you can't see what they're doing. OpenClaw ships with comprehensive observability: Prometheus metrics, Grafana dashboards, structured JSON logging, and distributed tracing via OpenTelemetry.

Built-in Prometheus Metrics

Every agent automatically exposes these metrics at /api/metrics:

Metric Type Description
openclaw_agent_uptime_seconds Gauge Time since agent last started
openclaw_agent_runs_total Counter Total number of run() invocations
openclaw_agent_runs_failed_total Counter Number of failed run() invocations
openclaw_agent_run_duration_seconds Histogram Duration of each run() cycle (p50, p95, p99)
openclaw_agent_state_size_bytes Gauge Current size of agent persistent state
openclaw_llm_tokens_total Counter Total LLM tokens consumed (by model)
openclaw_llm_cost_usd_total Counter Cumulative LLM API cost in USD
openclaw_llm_latency_seconds Histogram LLM API call latency
openclaw_events_emitted_total Counter Events published to message bus
openclaw_events_consumed_total Counter Events consumed from message bus
openclaw_restarts_total Counter Number of agent restarts

Grafana Dashboard Configuration

{
  "dashboard": {
    "title": "OpenClaw Agent Fleet",
    "panels": [
      {
        "title": "Agent Status Overview",
        "type": "stat",
        "targets": [{
          "expr": "count by (status) (openclaw_agent_uptime_seconds > 0)",
          "legendFormat": "{{status}}"
        }]
      },
      {
        "title": "Run Success Rate (24h)",
        "type": "gauge",
        "targets": [{
          "expr": "1 - (rate(openclaw_agent_runs_failed_total[24h]) / rate(openclaw_agent_runs_total[24h]))",
          "legendFormat": "{{agent}}"
        }],
        "thresholds": [
          {"value": 0.95, "color": "red"},
          {"value": 0.99, "color": "yellow"},
          {"value": 1.0, "color": "green"}
        ]
      },
      {
        "title": "LLM Cost per Agent (Daily)",
        "type": "timeseries",
        "targets": [{
          "expr": "increase(openclaw_llm_cost_usd_total[24h])",
          "legendFormat": "{{agent}} ({{model}})"
        }]
      },
      {
        "title": "Run Duration (p95)",
        "type": "timeseries",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(openclaw_agent_run_duration_seconds_bucket[5m]))",
          "legendFormat": "{{agent}}"
        }]
      }
    ]
  }
}

Alerting Rules

# prometheus-alerts.yml
groups:
  - name: openclaw-agents
    rules:
      - alert: AgentDown
        expr: openclaw_agent_uptime_seconds == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Agent {{ $labels.agent }} is down"
          description: "Agent has been down for more than 2 minutes."

      - alert: HighFailureRate
        expr: |
          rate(openclaw_agent_runs_failed_total[15m])
          / rate(openclaw_agent_runs_total[15m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent }} failure rate > 10%"

      - alert: LLMBudgetExceeded
        expr: increase(openclaw_llm_cost_usd_total[24h]) > 50
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent }} LLM spend > $50/day"

      - alert: AgentRestartLoop
        expr: increase(openclaw_restarts_total[1h]) > 5
        labels:
          severity: critical
        annotations:
          summary: "Agent {{ $labels.agent }} restarted 5+ times in 1 hour"
OpenTelemetry tracing: Enable distributed tracing by setting OPENCLAW_OTEL_ENDPOINT in your environment. Every run() cycle, LLM call, event emission, and state operation creates a span. Traces flow through Jaeger, Zipkin, or any OTLP-compatible backend.

10. LLM Integration

OpenClaw is LLM-native - it treats language models as first-class infrastructure, not an afterthought. Every agent has access to an LLM router that handles provider selection, fallback chains, cost tracking, rate limiting, and token budgets.

LLM Router Configuration

from openclaw.llm import LLMRouter, Budget

# Configure the LLM router with fallback chain
router = LLMRouter(
    primary="gpt-4o",
    fallback=["claude-3.5-sonnet", "llama-3.1-70b"],
    budget=Budget(
        daily_limit_usd=50.0,
        monthly_limit_usd=1200.0,
        alert_at_pct=80,           # Alert when 80% of budget consumed
    ),
    retry_on=[RateLimitError, TimeoutError],
    max_retries=3,
    timeout=30,
)

# Use in an agent
class SmartAnalyzer(Agent):
    async def setup(self):
        self.llm = LLMRouter(
            primary="gpt-4o",
            fallback=["claude-3.5-sonnet", "llama-3.1-70b"],
            budget=Budget(daily_limit_usd=50.0),
            retry_on=[RateLimitError, TimeoutError],
        )

    async def run(self):
        data = await self.fetch_data()

        # The router automatically handles:
        # 1. Try gpt-4o first
        # 2. On rate limit or timeout, fall back to claude-3.5-sonnet
        # 3. If that fails too, fall back to llama-3.1-70b (local)
        # 4. Track tokens and cost for each call
        # 5. Reject calls if daily budget is exceeded
        analysis = await self.llm.complete(
            messages=[
                {"role": "system", "content": "You are a data analyst."},
                {"role": "user", "content": f"Analyze this data: {data}"},
            ],
            temperature=0.3,
            max_tokens=2000,
        )

        self.logger.info(
            f"LLM call: model={analysis.model}, "
            f"tokens={analysis.usage.total_tokens}, "
            f"cost=${analysis.usage.cost_usd:.4f}, "
            f"latency={analysis.latency_ms}ms"
        )

YAML-Based LLM Configuration

# openclaw.yaml - LLM configuration
llm:
  providers:
    openai:
      api_key: ${OPENAI_API_KEY}
      models: ["gpt-4o", "gpt-4o-mini"]
      rate_limit: 500/min

    anthropic:
      api_key: ${ANTHROPIC_API_KEY}
      models: ["claude-3.5-sonnet", "claude-3-haiku"]
      rate_limit: 300/min

    ollama:
      base_url: http://ollama:11434
      models: ["llama-3.1-70b", "mistral-7b"]
      # No rate limit for local models

  defaults:
    primary: gpt-4o
    fallback: [claude-3.5-sonnet, llama-3.1-70b]
    timeout: 30s
    max_retries: 3

  budgets:
    global:
      daily_limit_usd: 200.0
      monthly_limit_usd: 5000.0
    per_agent:
      smart-analyzer:
        daily_limit_usd: 50.0
      content-generator:
        daily_limit_usd: 30.0
Cost control is critical for 24/7 agents: An agent running every 5 minutes with GPT-4o can easily spend $100+/day. Always set per-agent daily budgets. Use gpt-4o-mini or local models for high-frequency, low-complexity tasks. Reserve GPT-4o/Claude for tasks that genuinely need advanced reasoning.

11. Security & Guardrails

Autonomous agents running 24/7 without human oversight need strong guardrails. A misconfigured agent can leak data, burn through API budgets, or take destructive actions. OpenClaw provides multiple layers of security.

Security Architecture

  • Agent sandboxing: Each agent runs in an isolated container with a read-only filesystem (except for designated temp directories). Network access is restricted to an allowlist.
  • Secret management: Secrets are injected via environment variables from a secrets backend (Vault, AWS Secrets Manager, or encrypted .env files). Secrets are never stored in agent state or logs.
  • Output validation: Agent outputs (especially LLM-generated content) pass through configurable validators before being emitted or sent externally.
  • Rate limiting: Per-agent limits on API calls, event emissions, state writes, and LLM tokens.
  • Audit logging: Every agent action (state changes, events emitted, LLM calls, external API calls) is logged to an immutable audit trail.
  • RBAC: Role-based access control for multi-tenant deployments - teams can only manage their own agents.

Guardrail Configuration

# openclaw.yaml - Security & guardrails
security:
  # Agent sandboxing
  sandbox:
    read_only_filesystem: true
    allowed_network:
      - "*.internal"
      - "api.openai.com"
      - "api.anthropic.com"
      - "hooks.slack.com"
    blocked_network:
      - "*.torproject.org"
      - "metadata.google.internal"    # Block cloud metadata
      - "169.254.169.254"             # Block AWS IMDS

  # Secret management
  secrets:
    backend: vault                     # vault | aws-sm | env
    vault_addr: https://vault.internal:8200
    vault_path: secret/openclaw

  # Output validation
  validators:
    - type: pii_filter
      action: redact                   # redact | block | warn
      fields: ["email", "phone", "ssn", "credit_card"]

    - type: content_safety
      action: block
      categories: ["hate_speech", "violence", "self_harm"]

    - type: json_schema
      action: block
      schema_path: schemas/output.json

  # Rate limiting
  rate_limits:
    default:
      api_calls_per_minute: 60
      events_per_minute: 100
      state_writes_per_minute: 30
      llm_tokens_per_hour: 100000

  # Audit logging
  audit:
    enabled: true
    backend: postgresql               # postgresql | elasticsearch | s3
    retention: 90d
    log_events:
      - state_change
      - event_emitted
      - llm_call
      - external_api_call
      - agent_lifecycle

  # RBAC
  rbac:
    enabled: true
    roles:
      admin:
        permissions: ["*"]
      developer:
        permissions: ["agents:read", "agents:create", "agents:update", "logs:read", "state:read"]
      viewer:
        permissions: ["agents:read", "logs:read"]
Defense in depth: OpenClaw's security model follows the principle of least privilege. Agents start with no network access and no secrets - you explicitly grant what each agent needs. This prevents a compromised or buggy agent from accessing resources it shouldn't.

12. Production Deployment

Running agents in development is easy. Running them in production - with high availability, rolling updates, autoscaling, and disaster recovery - requires careful planning. Here's the production playbook.

Kubernetes Deployment with Helm

# values.yaml - Production Helm configuration
replicaCount: 2

supervisor:
  replicas: 2                          # HA supervisor
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: "1"
      memory: 1Gi

agentRuntime:
  resources:
    requests:
      cpu: 250m
      memory: 256Mi
    limits:
      cpu: "1"
      memory: 1Gi

postgres:
  enabled: true
  persistence:
    size: 50Gi
    storageClass: gp3
  replication:
    enabled: true
    readReplicas: 2
  backup:
    enabled: true
    schedule: "0 */6 * * *"            # Every 6 hours
    retention: 30d
    destination: s3://backups/openclaw

redis:
  enabled: true
  architecture: replication
  replica:
    replicaCount: 2
  persistence:
    size: 10Gi

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: openclaw.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: openclaw-tls
      hosts:
        - openclaw.example.com

monitoring:
  prometheus:
    enabled: true
    serviceMonitor: true
  grafana:
    enabled: true
    dashboards: true

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

Deployment Commands

# Deploy to production
helm upgrade --install openclaw openclaw/openclaw \
  --namespace openclaw \
  --create-namespace \
  -f values.yaml \
  --set image.tag=v1.4.0 \
  --wait --timeout 5m

# Verify deployment
kubectl get pods -n openclaw
kubectl get svc -n openclaw

# Check agent status via the API
kubectl port-forward svc/openclaw-api 8400:8400 -n openclaw
curl http://localhost:8400/api/agents

# Rolling update (zero-downtime)
helm upgrade openclaw openclaw/openclaw \
  --namespace openclaw \
  -f values.yaml \
  --set image.tag=v1.5.0 \
  --wait

# Rollback if something goes wrong
helm rollback openclaw 1 --namespace openclaw

Horizontal Pod Autoscaling

# hpa.yaml - Custom HPA based on agent queue depth
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: openclaw-agents
  namespace: openclaw
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: openclaw-agent-runtime
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: openclaw_events_pending
        target:
          type: AverageValue
          averageValue: "100"
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Disaster Recovery

  • Database backups: Automated PostgreSQL backups every 6 hours to S3 with 30-day retention. Point-in-time recovery via WAL archiving.
  • State snapshots: Agent state is snapshotted before every deployment and stored in S3.
  • Multi-region: For critical workloads, deploy OpenClaw in two regions with PostgreSQL streaming replication and Redis Sentinel for automatic failover.
  • Recovery time: Cold start from backup: ~5 minutes. Failover to standby region: ~30 seconds.
Pre-deployment checklist: Before deploying to production, verify: (1) all secrets are in your secrets backend (not in config files), (2) resource limits are set for every agent, (3) LLM budgets are configured, (4) alerting rules are active in Prometheus, (5) database backups are tested with a restore drill.

13. Real-World 24/7 Agent Examples

Theory is useful; working examples are better. Here are five production-grade agent architectures with key code snippets.

Example 1: DevOps Agent

Purpose: Monitors infrastructure, auto-scales services, responds to incidents, and posts status updates.

# agents/devops_agent.py
from openclaw import Agent, Schedule, on
from openclaw.tools import HttpClient, SlackNotifier

class DevOpsAgent(Agent):
    name = "devops-agent"
    schedule = Schedule.every(minutes=2)

    async def setup(self):
        self.http = HttpClient()
        self.slack = SlackNotifier(webhook_url=self.config.slack_webhook)
        self.k8s_api = self.config.kubernetes_api_url

    async def run(self):
        # Check cluster health
        nodes = await self.http.get(f"{self.k8s_api}/api/v1/nodes")
        pods = await self.http.get(f"{self.k8s_api}/api/v1/pods?fieldSelector=status.phase!=Running")

        unhealthy_pods = [
            p for p in pods.json()["items"]
            if p["status"]["phase"] in ("Failed", "CrashLoopBackOff")
        ]

        if unhealthy_pods:
            # Use LLM to diagnose the issue
            diagnosis = await self.llm.complete(
                f"Diagnose these Kubernetes pod failures and suggest fixes: "
                f"{[p['metadata']['name'] + ': ' + p['status'].get('reason', 'unknown') for p in unhealthy_pods]}"
            )
            await self.slack.send(
                f"πŸ”§ *{len(unhealthy_pods)} unhealthy pods detected*\n"
                f"Diagnosis: {diagnosis.text}"
            )

        # Auto-scale based on CPU metrics
        metrics = await self.http.get(f"{self.k8s_api}/apis/metrics.k8s.io/v1beta1/pods")
        for pod_metric in metrics.json().get("items", []):
            cpu_usage = self.parse_cpu(pod_metric)
            if cpu_usage > 80:
                deployment = pod_metric["metadata"]["labels"].get("app")
                await self.scale_up(deployment)

    async def scale_up(self, deployment: str):
        current = await self.state.get(f"replicas_{deployment}", 2)
        new_count = min(current + 1, 10)
        await self.http.patch(
            f"{self.k8s_api}/apis/apps/v1/namespaces/default/deployments/{deployment}/scale",
            json={"spec": {"replicas": new_count}},
        )
        await self.state.set(f"replicas_{deployment}", new_count)
        await self.slack.send(f"πŸ“ˆ Scaled *{deployment}* to {new_count} replicas")

Example 2: Security Agent

Purpose: Scans logs for threats, blocks malicious IPs, and generates daily security reports.

# agents/security_agent.py
from openclaw import Agent, Schedule

class SecurityAgent(Agent):
    name = "security-agent"
    schedule = Schedule.every(minutes=1)

    async def run(self):
        # Pull latest logs from Elasticsearch
        logs = await self.http.post(
            f"{self.config.elasticsearch_url}/_search",
            json={
                "query": {"range": {"@timestamp": {"gte": "now-1m"}}},
                "size": 1000,
            },
        )

        suspicious = []
        for hit in logs.json()["hits"]["hits"]:
            log = hit["_source"]
            # Pattern matching for common attack signatures
            if self.is_suspicious(log):
                suspicious.append(log)

        if suspicious:
            # Use LLM to classify threat severity
            classification = await self.llm.complete(
                f"Classify these {len(suspicious)} suspicious log entries by threat level "
                f"(critical/high/medium/low). Return JSON array: {suspicious[:20]}"
            )

            for threat in classification.parsed:
                if threat["level"] in ("critical", "high"):
                    # Auto-block the IP
                    await self.http.post(
                        f"{self.config.firewall_api}/block",
                        json={"ip": threat["source_ip"], "duration": "24h"},
                    )

            await self.emit("threats_detected", {
                "count": len(suspicious),
                "blocked_ips": [t["source_ip"] for t in classification.parsed if t["level"] in ("critical", "high")],
            })

    def is_suspicious(self, log: dict) -> bool:
        patterns = ["SQL injection", "XSS", "path traversal", "brute force", "401", "403"]
        message = log.get("message", "").lower()
        return any(p.lower() in message for p in patterns)

Example 3: Market Intelligence Agent

Purpose: Tracks competitor pricing, analyzes market trends, and generates weekly intelligence reports.

# agents/market_intel.py
from openclaw import Agent, Schedule

class MarketIntelAgent(Agent):
    name = "market-intel"
    schedule = Schedule.every(hours=1)

    async def run(self):
        competitors = self.config.get("competitors", [])
        price_data = []

        for competitor in competitors:
            data = await self.http.get(competitor["pricing_url"])
            price_data.append({
                "competitor": competitor["name"],
                "prices": data.json(),
                "timestamp": self.now_iso(),
            })

        # Store in persistent state for trend analysis
        history = await self.state.get("price_history", [])
        history.extend(price_data)
        await self.state.set("price_history", history[-5000:])

        # Detect significant price changes
        previous = await self.state.get("last_prices", {})
        alerts = []
        for entry in price_data:
            name = entry["competitor"]
            for product, price in entry["prices"].items():
                prev_price = previous.get(name, {}).get(product)
                if prev_price and abs(price - prev_price) / prev_price > 0.05:
                    alerts.append({
                        "competitor": name,
                        "product": product,
                        "old_price": prev_price,
                        "new_price": price,
                        "change_pct": round((price - prev_price) / prev_price * 100, 1),
                    })

        if alerts:
            summary = await self.llm.complete(
                f"Summarize these competitor price changes for a business audience: {alerts}"
            )
            await self.emit("price_changes_detected", {
                "alerts": alerts,
                "summary": summary.text,
            })

        await self.state.set("last_prices", {
            e["competitor"]: e["prices"] for e in price_data
        })

Example 4: Content Pipeline Agent

Purpose: Monitors RSS feeds, summarizes articles with LLM, and publishes to a CMS.

# agents/content_pipeline.py
from openclaw import Agent, Schedule

class ContentPipeline(Agent):
    name = "content-pipeline"
    schedule = Schedule.every(minutes=30)

    async def run(self):
        feeds = self.config.get("rss_feeds", [])
        processed_ids = await self.state.get("processed_ids", set())

        for feed_url in feeds:
            entries = await self.http.get(feed_url, headers={"Accept": "application/rss+xml"})
            articles = self.parse_rss(entries.text)

            for article in articles:
                if article["id"] in processed_ids:
                    continue

                # Summarize with LLM
                summary = await self.llm.complete(
                    f"Write a 3-paragraph summary of this article for a tech audience. "
                    f"Title: {article['title']}\nContent: {article['content'][:3000]}"
                )

                # Publish to CMS
                await self.http.post(
                    f"{self.config.cms_api}/articles",
                    json={
                        "title": article["title"],
                        "summary": summary.text,
                        "source_url": article["link"],
                        "category": "auto-curated",
                        "status": "draft",
                    },
                    headers={"Authorization": f"Bearer {self.config.cms_token}"},
                )

                processed_ids.add(article["id"])
                self.logger.info(f"Published: {article['title']}")

        await self.state.set("processed_ids", processed_ids)

Example 5: Compliance Agent

Purpose: Monitors regulatory changes, checks internal policy compliance, and alerts the legal team.

# agents/compliance_agent.py
from openclaw import Agent, Schedule

class ComplianceAgent(Agent):
    name = "compliance-agent"
    schedule = Schedule.every(hours=6)

    async def run(self):
        # Check regulatory feeds
        reg_sources = self.config.get("regulatory_sources", [])
        known_regs = await self.state.get("known_regulations", {})
        new_changes = []

        for source in reg_sources:
            data = await self.http.get(source["url"])
            regulations = data.json().get("regulations", [])

            for reg in regulations:
                reg_id = reg["id"]
                if reg_id not in known_regs or reg["updated_at"] != known_regs[reg_id]:
                    new_changes.append(reg)
                    known_regs[reg_id] = reg["updated_at"]

        await self.state.set("known_regulations", known_regs)

        if new_changes:
            # Use LLM to assess impact on our policies
            impact = await self.llm.complete(
                f"Analyze these regulatory changes and assess their impact on a "
                f"SaaS company's data handling and privacy policies. "
                f"For each change, rate impact as high/medium/low and explain why.\n"
                f"Changes: {new_changes}"
            )

            await self.emit("regulatory_changes", {
                "changes": new_changes,
                "impact_analysis": impact.text,
                "high_impact_count": impact.text.lower().count("high"),
            })

            # Alert legal team for high-impact changes
            if "high" in impact.text.lower():
                await self.slack.send(
                    f"βš–οΈ *{len(new_changes)} regulatory changes detected*\n"
                    f"Impact analysis:\n{impact.text[:500]}"
                )

14. OpenClaw vs. Alternatives

OpenClaw isn't the only option for running automated workflows. But it's the only one purpose-built for 24/7 LLM-native autonomous agents. Here's how it compares:

Feature OpenClaw Temporal Prefect Airflow Cron Jobs LangGraph Cloud
24/7 daemon agents βœ… Native ⚠️ Possible ❌ Task-based ❌ DAG-based ❌ One-shot ⚠️ Limited
LLM-native βœ… Built-in router, budgets, fallback ❌ DIY ❌ DIY ❌ DIY ❌ DIY βœ… LangChain ecosystem
State management βœ… Built-in, versioned βœ… Excellent ⚠️ Basic ⚠️ XCom (limited) ❌ None βœ… Checkpointing
Multi-agent orchestration βœ… Pipeline, fan-out, event-driven ⚠️ Workflow-based ⚠️ Flow-based ⚠️ DAG-based ❌ None βœ… Graph-based
Observability βœ… Prometheus, Grafana, OTel βœ… Built-in UI βœ… Built-in UI βœ… Built-in UI ❌ Manual ⚠️ LangSmith
Auto-restart & health checks βœ… Native βœ… Native ⚠️ Limited ⚠️ Limited ❌ None ⚠️ Platform-managed
Cost tracking (LLM) βœ… Per-agent budgets ❌ None ❌ None ❌ None ❌ None ⚠️ Via LangSmith
Learning curve 🟒 Low (Python classes) 🟑 Medium (workflows) 🟒 Low (decorators) πŸ”΄ High (DAGs, operators) 🟒 Low (shell scripts) 🟑 Medium (graphs)
Self-hosted βœ… Open source βœ… Open source βœ… Open source βœ… Open source βœ… Built-in ❌ Cloud only
When to choose OpenClaw: If your primary use case is running AI agents that need to operate continuously (not just scheduled batch jobs), need LLM integration with cost controls, and require multi-agent collaboration - OpenClaw is purpose-built for this. If you need traditional DAG-based ETL, Airflow or Prefect may be better fits. If you need durable workflow execution without LLM features, Temporal is excellent.

15. Scaling & Performance

OpenClaw is designed to scale from a single agent on a laptop to hundreds of agents across a Kubernetes cluster. Here are the benchmarks and scaling strategies.

Performance Benchmarks

Metric Single Node (4 CPU, 8 GB) 3-Node Cluster 10-Node Cluster
Max concurrent agents 25 80 300+
Message bus throughput 5,000 events/sec 15,000 events/sec 50,000+ events/sec
State read latency (p50) 1.2 ms 1.5 ms 2.1 ms
State read latency (p99) 8 ms 12 ms 18 ms
State write latency (p50) 3.5 ms 4.2 ms 5.8 ms
Supervisor overhead ~50 MB RAM ~120 MB RAM ~300 MB RAM
Cold start (agent) 1.2 sec 1.5 sec 2.0 sec

Benchmarks measured with Redis Streams message bus, PostgreSQL 16 state store, Python 3.12 runtime. NATS JetStream provides ~2x higher message throughput at the cost of additional operational complexity.

Scaling Strategies

Vertical Scaling

The simplest approach: give your nodes more CPU and memory. Effective up to ~25 agents per node. Beyond that, you hit Python GIL limitations and should switch to horizontal scaling.

Horizontal Scaling

Add more nodes to your cluster. OpenClaw's Supervisor automatically distributes agents across available nodes using a consistent hashing algorithm. New agents are placed on the least-loaded node.

# Scale the agent runtime deployment
kubectl scale deployment openclaw-agent-runtime \
  --replicas=5 -n openclaw

# Or use HPA for automatic scaling
kubectl autoscale deployment openclaw-agent-runtime \
  --min=2 --max=20 --cpu-percent=70 -n openclaw

Sharding

For agents that process large datasets, partition the work across multiple agent instances:

# openclaw.yaml - Sharded agent configuration
agents:
  log-processor:
    source: agents/log_processor.py
    replicas: 4
    sharding:
      strategy: hash                   # hash | range | round-robin
      key: "source_id"                 # shard by data source
      # Each replica processes 1/4 of the sources
Scaling bottleneck: The most common bottleneck is the state store (PostgreSQL), not the agents themselves. If state operations become slow, add read replicas for read-heavy agents, enable connection pooling (PgBouncer), and consider partitioning large state tables. For write-heavy workloads, use Redis as the primary state store with async PostgreSQL persistence.

Resource Planning Guide

  • Small (1-10 agents): Single node, 4 CPU, 8 GB RAM. Docker Compose deployment. ~$50/month on cloud.
  • Medium (10-50 agents): 3-node K8s cluster, 4 CPU / 16 GB each. Helm deployment with HA supervisor. ~$300/month.
  • Large (50-200 agents): 5-10 node K8s cluster with HPA. Dedicated PostgreSQL with read replicas. NATS JetStream for message bus. ~$1,000-2,500/month.
  • Enterprise (200+ agents): Multi-cluster deployment with federation. Sharded state store. Dedicated monitoring infrastructure. Custom capacity planning required.

Start Building 24/7 Agents Today

OpenClaw is open source and free to use. Get started in under 5 minutes with Docker Compose, or deploy to Kubernetes for production workloads. The future of AI isn't chatbots - it's autonomous agents that work around the clock.

⭐ Star on GitHub  |  πŸ“– Read the Docs  |  πŸ’¬ Join Discord