OpenClaw: The Definitive Guide to Building 24/7 Agentic AI Workflows
AI agents that run once and quit are toys. The real revolution happens when agents run continuously - monitoring systems, analyzing data, responding to incidents, and orchestrating workflows around the clock without a human touching the keyboard. That's the promise of OpenClaw: an open-source framework purpose-built for running autonomous AI agents 24 hours a day, 7 days a week, as persistent daemon-mode services with health checks, auto-restart, observability, and multi-agent orchestration baked in from day one.
This guide is the definitive deep-dive. We'll cover everything from architecture internals and installation to building production-grade agents, multi-agent pipelines, LLM integration, security guardrails, Kubernetes deployment, and real-world examples. Whether you're deploying a single uptime monitor or orchestrating a fleet of 50 collaborating agents, this is your reference.
1. What is OpenClaw?
OpenClaw is an open-source framework for running AI agents as persistent, autonomous services. It is not a chatbot framework. It is not a prompt chain library. It is infrastructure - the daemon layer that keeps your AI agents alive, healthy, and productive around the clock.
Think of it as systemd for AI agents. Just as systemd manages long-running Linux services with health checks, restart policies, and logging, OpenClaw manages long-running AI agents with the same operational rigor - plus first-class support for LLM integration, inter-agent communication, state persistence, and cost tracking.
Core Design Principles
- Daemon-first: Agents are designed to run forever, not execute once. The default mode is a persistent loop with configurable schedules, not a one-shot script.
- Operationally mature: Health checks (liveness + readiness), auto-restart with exponential backoff, structured logging, Prometheus metrics, and distributed tracing are built in - not bolted on.
- LLM-native: First-class support for OpenAI, Anthropic, local models via Ollama, model fallback chains, token tracking, and per-agent cost budgets.
- Multi-agent by default: Agents communicate via a message bus (Redis Streams or NATS), share state through a persistent store, and can be orchestrated in pipelines, fan-out/fan-in patterns, or event-driven topologies.
- Cloud-native: Runs in Docker, Kubernetes, or as a standalone binary. Helm charts, HPA, rolling updates, and blue-green deployments are supported out of the box.
What OpenClaw Is NOT
- It is not a prompt engineering library (use LangChain or LlamaIndex for that).
- It is not a workflow orchestrator like Airflow (though it can replace Airflow for AI-driven workflows).
- It is not a chatbot framework (use Rasa or Botpress for conversational UIs).
- It is the runtime and infrastructure layer that sits beneath all of these, keeping your agents alive and observable 24/7.
2. Why 24/7 Agents?
Most AI agent frameworks assume a request-response model: a human asks something, the agent responds, done. But entire categories of valuable work require continuous, autonomous operation - work that happens at 3 AM on a Sunday, work that monitors streams of data in real time, work that must never stop.
Use Cases That Demand Always-On Agents
| Domain | Use Case | Why 24/7? |
|---|---|---|
| Security | Log analysis, threat detection, IP blocking | Attacks don't wait for business hours |
| Infrastructure | Auto-scaling, incident response, health monitoring | Downtime costs $5,600/minute on average |
| Market Analysis | Price tracking, competitor monitoring, sentiment analysis | Markets are global and never close |
| Content Moderation | Scanning user-generated content for policy violations | Content is posted around the clock |
| Customer Support | Ticket triage, auto-responses, escalation routing | Customers expect instant responses at any hour |
| Data Pipelines | ETL orchestration, data quality monitoring, schema drift detection | Data flows continuously from upstream sources |
| Compliance | Regulatory change monitoring, policy enforcement, audit trails | Regulations change without notice; violations are costly |
| Social Media | Brand monitoring, engagement tracking, crisis detection | Viral events happen in minutes, not days |
Cost Analysis: 24/7 Agent vs. Human Team
Let's do the math for a security monitoring operation that requires 24/7 coverage:
| Cost Factor | Human Team (3 shifts) | OpenClaw Agent Fleet |
|---|---|---|
| Personnel | 5 analysts Γ $95K/yr = $475K | 0 (automated) |
| Infrastructure | SIEM licenses: $50K/yr | 3-node K8s cluster: $8K/yr |
| LLM API costs | N/A | ~$18K/yr (GPT-4o with fallback) |
| Response time | 5-15 minutes (human triage) | <30 seconds (automated) |
| Coverage gaps | Shift handoffs, sick days, holidays | Zero (daemon-mode) |
| Total annual cost | ~$525K | ~$26K |
3. OpenClaw Architecture
OpenClaw's architecture is modular and designed for production workloads. Every component can be scaled independently, replaced with alternatives, or extended via plugins. Here's the complete system architecture:
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OpenClaw Control Plane β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β
β β API Gateway β β Scheduler β β Supervisor (Process Mgr)β β
β β REST/WS API β β Cron+Events β β Health β Restart β Stop β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββββββ¬ββββββββββββββββ β
β β β β β
β ββββββββ΄ββββββββββββββββββ΄βββββββββββββββββββββββ΄βββββββββββββββ β
β Internal Event Bus β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β β
β ββββββββ΄ββββββββ βββββββ΄βββββββββ βββββββββββ΄ββββββββββββββββ β
β β Agent Runtime β β Agent Runtime β β Agent Runtime β β
β β ββββββββββββ β β ββββββββββββ β β ββββββββββββ β β
β β β Agent A β β β β Agent B β β β β Agent C β β β
β β β (Python) β β β β (Python) β β β β (Python) β β β
β β ββββββββββββ β β ββββββββββββ β β ββββββββββββ β β
β β Container β β Container β β Container β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββΌβββββββββββββββββ
β β β
ββββββββ΄βββββββ βββββββ΄βββββββ βββββββ΄βββββββββββ
β Message Bus β β State Store β β Metrics/Traces β
β Redis Streamsβ β PostgreSQL β β Prometheus + β
β or NATS β β + Redis β β OpenTelemetry β
βββββββββββββββ ββββββββββββββ ββββββββββββββββββ
Component Deep-Dive
Agent Runtime
Each agent runs inside an isolated container (or process, in standalone mode). The runtime provides:
- Resource isolation: CPU and memory limits enforced via cgroups (Docker/K8s) or process-level ulimits.
- Dependency isolation: Each agent has its own Python virtual environment and dependency set.
- Lifecycle hooks:
setup(),run(),teardown(),on_error()- the runtime calls these at the appropriate lifecycle stages. - Signal handling: SIGTERM triggers graceful shutdown; SIGKILL is a last resort after the grace period expires.
Supervisor
The Supervisor is OpenClaw's process manager - analogous to systemd or supervisord, but purpose-built for AI agents. It:
- Monitors agent health via liveness and readiness probes.
- Restarts crashed agents according to configurable restart policies (immediate, linear backoff, exponential backoff).
- Handles graceful shutdown sequences - draining in-flight work before stopping.
- Enforces resource limits and kills agents that exceed their memory or CPU budget.
- Reports agent status to the API Gateway and emits lifecycle events to the message bus.
Message Bus
Inter-agent communication flows through a message bus. OpenClaw supports two backends:
- Redis Streams (default): Lightweight, easy to operate, supports consumer groups for load balancing. Best for small-to-medium deployments (up to ~50 agents).
- NATS JetStream: Higher throughput, built-in persistence, better for large-scale deployments (50+ agents) or when you need exactly-once delivery semantics.
Messages are typed events with JSON payloads. Agents publish events with await self.emit("event_name", payload) and subscribe with @on("event_name") decorators.
State Store
Agent state is persisted in PostgreSQL with an optional Redis cache layer for hot state. Features include:
- Automatic checkpointing: State is saved at configurable intervals (default: every 60 seconds) and after every
run()cycle. - Point-in-time recovery: State snapshots are versioned, allowing rollback to any previous checkpoint.
- Shared state: Agents can read (but not write) other agents' state via namespaced keys.
- TTL support: State keys can have expiration times for temporary data.
Scheduler
The Scheduler supports two trigger modes:
- Cron-like schedules:
Schedule.every(minutes=5),Schedule.cron("0 */2 * * *"),Schedule.daily(hour=9, minute=0). - Event-driven triggers: Agents can be triggered by events from other agents, webhooks, or external systems via the API Gateway.
API Gateway
The API Gateway exposes a REST and WebSocket API for external control:
GET /api/agents- List all agents and their status.POST /api/agents/{name}/start- Start a stopped agent.POST /api/agents/{name}/pause- Pause a running agent.GET /api/agents/{name}/state- Read agent state.GET /api/agents/{name}/logs- Stream agent logs via WebSocket.POST /api/agents/{name}/trigger- Manually trigger an agent run.GET /api/metrics- Prometheus-format metrics endpoint.
4. Installation & Setup
OpenClaw supports three deployment modes: Docker Compose (recommended for development and small production), Kubernetes with Helm (recommended for production at scale), and standalone binary (for single-machine deployments).
Option A: Docker Compose (Recommended Start)
The fastest way to get a full OpenClaw stack running. This docker-compose.yml includes all services:
# docker-compose.yml
version: "3.9"
services:
openclaw-supervisor:
image: openclaw/supervisor:1.4.0
ports:
- "8400:8400" # API Gateway
- "9090:9090" # Prometheus metrics
environment:
- OPENCLAW_DB_URL=postgresql://openclaw:secret@postgres:5432/openclaw
- OPENCLAW_REDIS_URL=redis://redis:6379/0
- OPENCLAW_LOG_LEVEL=info
- OPENCLAW_LOG_FORMAT=json
volumes:
- ./agents:/app/agents
- ./openclaw.yaml:/app/openclaw.yaml
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
restart: unless-stopped
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: openclaw
POSTGRES_USER: openclaw
POSTGRES_PASSWORD: secret
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U openclaw"]
interval: 5s
timeout: 3s
retries: 5
redis:
image: redis:7-alpine
command: redis-server --appendonly yes
volumes:
- redisdata:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
grafana:
image: grafana/grafana:10.4.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- ./grafana/dashboards:/var/lib/grafana/dashboards
- ./grafana/provisioning:/etc/grafana/provisioning
volumes:
pgdata:
redisdata:
Start the stack:
docker compose up -d
# Verify all services are healthy
docker compose ps
# Check the API Gateway
curl http://localhost:8400/api/health
Option B: Kubernetes with Helm
# Add the OpenClaw Helm repository
helm repo add openclaw https://charts.openclaw.dev
helm repo update
# Install with default values
helm install openclaw openclaw/openclaw \
--namespace openclaw \
--create-namespace \
--set supervisor.replicas=2 \
--set postgres.enabled=true \
--set redis.enabled=true
# Verify the deployment
kubectl get pods -n openclaw
Option C: Standalone Binary
# Install via the install script
curl -fsSL https://get.openclaw.dev | sh
# Or via pip
pip install openclaw-cli
# Verify installation
openclaw version
# OpenClaw CLI v1.4.0 (runtime v1.4.0, Python 3.11+)
The OpenClaw CLI
The openclaw CLI is your primary interface for managing agents:
# Initialize a new project
openclaw init my-project
cd my-project
# Create an agent from a template
openclaw agent create --name monitor --template watchdog
openclaw agent create --name analyzer --template llm-processor
openclaw agent create --name reporter --template scheduled-report
# List available templates
openclaw templates list
# Run locally for development (hot-reload enabled)
openclaw dev
# Deploy to a remote OpenClaw cluster
openclaw deploy --env production
# Check agent status
openclaw status
# ββββββββββββββββββββ¬βββββββββββ¬βββββββββ¬ββββββββββββ¬βββββββββββββββ
# β Agent β Status β Uptime β Last Run β Success Rate β
# ββββββββββββββββββββΌβββββββββββΌβββββββββΌββββββββββββΌβββββββββββββββ€
# β monitor β running β 4d 12h β 2m ago β 99.7% β
# β analyzer β running β 4d 12h β 5m ago β 98.2% β
# β reporter β paused β - β 1h ago β 100% β
# ββββββββββββββββββββ΄βββββββββββ΄βββββββββ΄ββββββββββββ΄βββββββββββββββ
# Stream logs from a specific agent
openclaw logs monitor --follow
# Trigger a manual run
openclaw trigger monitor
# Pause / resume an agent
openclaw pause reporter
openclaw resume reporter
openclaw init, your project will contain: openclaw.yaml (main config), agents/ (agent source code), tests/ (agent tests), docker-compose.yml (local stack), and .openclaw/ (CLI state).
5. Building Your First 24/7 Agent
Let's build a real-world agent: an uptime monitor that checks a list of URLs every 5 minutes, detects downtime, sends Slack alerts, and maintains a persistent uptime history that survives restarts.
Agent Configuration
First, define the agent in openclaw.yaml:
# openclaw.yaml
project: my-monitoring
version: "1.0"
agents:
uptime-monitor:
source: agents/uptime_monitor.py
schedule: every 5m
config:
slack_webhook: ${SLACK_WEBHOOK_URL}
targets:
- url: https://api.example.com/health
name: "Production API"
expected_status: 200
timeout: 10
- url: https://app.example.com
name: "Web Application"
expected_status: 200
timeout: 15
- url: https://docs.example.com
name: "Documentation Site"
expected_status: 200
timeout: 10
restart_policy:
max_retries: 10
backoff: exponential
resources:
memory: 256Mi
cpu: 0.25
Full Agent Implementation
# agents/uptime_monitor.py
import asyncio
from datetime import datetime, timezone
from openclaw import Agent, Schedule, Alert
from openclaw.tools import HttpClient, SlackNotifier
class UptimeMonitor(Agent):
"""24/7 website uptime monitoring agent with Slack alerting."""
name = "uptime-monitor"
schedule = Schedule.every(minutes=5)
async def setup(self):
"""Initialize HTTP client, Slack notifier, and load targets."""
self.http = HttpClient(
timeout=self.config.get("default_timeout", 15),
retries=2,
retry_delay=3,
)
self.slack = SlackNotifier(
webhook_url=self.config.slack_webhook
)
self.targets = self.config.get("targets", [])
self.logger.info(
f"Uptime monitor initialized with {len(self.targets)} targets"
)
async def run(self):
"""Check all targets and process results."""
results = await asyncio.gather(
*[self.check_target(t) for t in self.targets],
return_exceptions=True,
)
# Load persistent history from state store
history = await self.state.get("check_history", [])
timestamp = datetime.now(timezone.utc).isoformat()
for target, result in zip(self.targets, results):
if isinstance(result, Exception):
result = {
"status": "error",
"error": str(result),
"latency_ms": None,
}
entry = {
"target": target["name"],
"url": target["url"],
"timestamp": timestamp,
**result,
}
history.append(entry)
# Alert on downtime
if result["status"] != "healthy":
await self.handle_downtime(target, result)
else:
# Clear incident state if target recovered
incident_key = f"incident_{target['name']}"
was_down = await self.state.get(incident_key, False)
if was_down:
await self.slack.send(
f"β
*{target['name']}* is back UP "
f"(latency: {result['latency_ms']}ms)"
)
await self.state.set(incident_key, False)
# Keep last 10,000 entries, persist to state store
await self.state.set("check_history", history[-10000:])
# Update summary metrics
healthy = sum(1 for r in results if not isinstance(r, Exception) and r["status"] == "healthy")
self.metrics.gauge("targets_healthy", healthy)
self.metrics.gauge("targets_total", len(self.targets))
async def check_target(self, target: dict) -> dict:
"""Check a single target URL and return status."""
timeout = target.get("timeout", 15)
expected = target.get("expected_status", 200)
start = asyncio.get_event_loop().time()
try:
response = await self.http.get(
target["url"], timeout=timeout
)
latency = int(
(asyncio.get_event_loop().time() - start) * 1000
)
if response.status_code == expected:
return {
"status": "healthy",
"http_status": response.status_code,
"latency_ms": latency,
}
else:
return {
"status": "unhealthy",
"http_status": response.status_code,
"expected_status": expected,
"latency_ms": latency,
}
except Exception as e:
return {
"status": "error",
"error": str(e),
"latency_ms": None,
}
async def handle_downtime(self, target: dict, result: dict):
"""Send alert and track incident state."""
incident_key = f"incident_{target['name']}"
already_alerted = await self.state.get(incident_key, False)
if not already_alerted:
status = result.get("http_status", "N/A")
error = result.get("error", "unexpected status")
await self.slack.send(
f"π¨ *{target['name']}* is DOWN!\n"
f"URL: {target['url']}\n"
f"Status: {status}\n"
f"Error: {error}"
)
await self.state.set(incident_key, True)
self.logger.warning(
f"Target {target['name']} is down: {result}"
)
async def teardown(self):
"""Cleanup on graceful shutdown."""
await self.http.close()
self.logger.info("Uptime monitor shut down gracefully")
self.state), built-in metrics (self.metrics), structured logging (self.logger), concurrent target checking with asyncio.gather, incident tracking to avoid alert storms, and recovery notifications.
6. Agent Lifecycle Management
Understanding the agent lifecycle is critical for building reliable 24/7 systems. OpenClaw agents transition through well-defined states, and the Supervisor manages these transitions automatically.
Agent State Machine
βββββββββββββββ
β INITIALIZING β
β setup() β
ββββββββ¬βββββββ
β success
βΌ
ββββββ RUNNING βββββββ
β run() loop β
β β resume
pause β βββββββββββ β
βββββΊβ PAUSED ββββββ
β βββββββββββ
β
error β βββββββββββ retry (backoff)
βββββΊβ ERROR ββββββββββββββββββΊ RUNNING
β ββββββ¬βββββ
β β max retries exceeded
β βΌ
β βββββββββββ
βββββΊβ STOPPED β
stop βteardown()β
βββββββββββ
Health Checks
OpenClaw implements two types of health checks, inspired by Kubernetes:
- Liveness probe: "Is the agent process alive?" - Checks that the agent's event loop is responsive. If the liveness probe fails, the Supervisor kills and restarts the agent.
- Readiness probe: "Is the agent ready to do work?" - Checks that dependencies (database, APIs, etc.) are available. If the readiness probe fails, the agent is temporarily removed from event subscriptions but not restarted.
You can define custom health checks:
class MyAgent(Agent):
async def liveness_check(self) -> bool:
"""Return True if the agent is alive and responsive."""
return self.event_loop_responsive()
async def readiness_check(self) -> bool:
"""Return True if all dependencies are available."""
try:
await self.db.ping()
await self.http.get("https://api.openai.com/v1/models", timeout=5)
return True
except Exception:
return False
Complete Lifecycle Configuration
# openclaw.yaml - Agent lifecycle configuration
agents:
uptime-monitor:
source: agents/uptime_monitor.py
replicas: 1
restart_policy:
max_retries: 5
backoff: exponential # immediate | linear | exponential
initial_delay: 5s # first retry after 5 seconds
max_delay: 300s # cap backoff at 5 minutes
reset_after: 3600s # reset retry counter after 1h of healthy operation
health_check:
liveness:
interval: 30s
timeout: 10s
unhealthy_threshold: 3 # 3 consecutive failures = restart
readiness:
interval: 15s
timeout: 5s
unhealthy_threshold: 2 # 2 failures = remove from event bus
resources:
cpu: 0.5 # 0.5 CPU cores
memory: 512Mi # 512 MB RAM
api_calls_per_minute: 60 # rate limit for external API calls
max_state_size: 100Mi # max persistent state size
graceful_shutdown:
timeout: 30s # time to finish in-flight work
drain_events: true # stop accepting new events before shutdown
logging:
level: info # debug | info | warning | error
format: json # json | text
retention: 7d # keep logs for 7 days
7. Multi-Agent Orchestration
Single agents are useful. Fleets of collaborating agents are transformative. OpenClaw provides four orchestration patterns for multi-agent systems, all built on top of the message bus.
Orchestration Patterns
Pattern 1: Pipeline
Agent A's output feeds Agent B, which feeds Agent C. Linear, sequential processing.
[Collector] βββΊ [Analyzer] βββΊ [Reporter] βββΊ [Alerter]
emit() @on() @on() @on()
Pattern 2: Fan-Out / Fan-In
One agent distributes work to multiple workers; an aggregator collects results.
βββΊ [Worker 1] ββ
[Dispatcher] ββββΌββΊ [Worker 2] ββΌβββΊ [Aggregator]
βββΊ [Worker 3] ββ
Pattern 3: Supervisor
A meta-agent that monitors and manages other agents - spawning, stopping, or reconfiguring them based on conditions.
Pattern 4: Event-Driven
Agents react to events from other agents, external webhooks, or system events. No fixed topology - agents subscribe to events they care about.
Complete Multi-Agent Pipeline Example
Here's a full 4-agent pipeline: Data Collector β Analyzer β Reporter β Alerter.
# agents/data_collector.py
from openclaw import Agent, Schedule
class DataCollector(Agent):
"""Collects data from multiple sources every 10 minutes."""
name = "data-collector"
schedule = Schedule.every(minutes=10)
async def setup(self):
self.sources = self.config.get("data_sources", [])
async def run(self):
collected = []
for source in self.sources:
data = await self.http.get(source["url"])
collected.append({
"source": source["name"],
"data": data.json(),
"collected_at": self.now_iso(),
})
# Emit to the pipeline - Analyzer will pick this up
await self.emit("data_collected", {
"batch_id": self.generate_id(),
"records": collected,
"count": len(collected),
})
self.logger.info(f"Collected {len(collected)} records")
# agents/analyzer.py
from openclaw import Agent, on
class Analyzer(Agent):
"""Analyzes collected data for anomalies and trends."""
name = "analyzer"
@on("data_collected")
async def analyze(self, event):
records = event.data["records"]
batch_id = event.data["batch_id"]
analysis = {
"batch_id": batch_id,
"anomalies": [],
"trends": [],
"summary": {},
}
# Load historical baselines from persistent state
baselines = await self.state.get("baselines", {})
for record in records:
source = record["source"]
values = record["data"].get("metrics", {})
baseline = baselines.get(source, {})
for metric, value in values.items():
avg = baseline.get(metric, {}).get("avg", value)
stddev = baseline.get(metric, {}).get("stddev", 0)
# Flag anomalies beyond 2 standard deviations
if stddev > 0 and abs(value - avg) > 2 * stddev:
analysis["anomalies"].append({
"source": source,
"metric": metric,
"value": value,
"expected": avg,
"deviation": round((value - avg) / stddev, 2),
})
# Update rolling baselines
self.update_baseline(baselines, source, values)
await self.state.set("baselines", baselines)
await self.emit("analysis_complete", analysis)
def update_baseline(self, baselines, source, values):
"""Update rolling average and stddev for each metric."""
if source not in baselines:
baselines[source] = {}
for metric, value in values.items():
b = baselines[source].get(metric, {"avg": value, "stddev": 0, "n": 0})
n = b["n"] + 1
old_avg = b["avg"]
new_avg = old_avg + (value - old_avg) / n
b["stddev"] = ((b["stddev"] ** 2 * (n - 1) + (value - old_avg) * (value - new_avg)) / n) ** 0.5
b["avg"] = new_avg
b["n"] = n
baselines[source][metric] = b
# agents/reporter.py
from openclaw import Agent, on
class Reporter(Agent):
"""Generates human-readable reports from analysis results."""
name = "reporter"
@on("analysis_complete")
async def report(self, event):
analysis = event.data
anomaly_count = len(analysis["anomalies"])
report = {
"batch_id": analysis["batch_id"],
"generated_at": self.now_iso(),
"anomaly_count": anomaly_count,
"severity": "critical" if anomaly_count > 5 else "warning" if anomaly_count > 0 else "normal",
"details": analysis["anomalies"],
}
# Use LLM to generate a natural-language summary
if anomaly_count > 0:
summary = await self.llm.complete(
f"Summarize these anomalies in 2-3 sentences for an ops team: "
f"{analysis['anomalies']}"
)
report["summary"] = summary
await self.emit("report_ready", report)
self.logger.info(f"Report generated: {anomaly_count} anomalies")
# agents/alerter.py
from openclaw import Agent, on
from openclaw.tools import SlackNotifier, PagerDutyClient
class Alerter(Agent):
"""Routes alerts based on severity."""
name = "alerter"
async def setup(self):
self.slack = SlackNotifier(webhook_url=self.config.slack_webhook)
self.pagerduty = PagerDutyClient(api_key=self.config.pagerduty_key)
@on("report_ready")
async def alert(self, event):
report = event.data
severity = report["severity"]
if severity == "normal":
return # No alert needed
if severity == "warning":
await self.slack.send(
f"β οΈ *{report['anomaly_count']} anomalies detected*\n"
f"{report.get('summary', 'See dashboard for details.')}"
)
elif severity == "critical":
# Slack + PagerDuty for critical alerts
await self.slack.send(
f"π¨ *CRITICAL: {report['anomaly_count']} anomalies!*\n"
f"{report.get('summary', 'Immediate attention required.')}"
)
await self.pagerduty.trigger_incident(
title=f"Critical anomalies: {report['anomaly_count']}",
severity="critical",
details=report["details"],
)
Pipeline Configuration
# openclaw.yaml - Multi-agent pipeline
agents:
data-collector:
source: agents/data_collector.py
schedule: every 10m
config:
data_sources:
- name: "production-metrics"
url: https://metrics.internal/api/v1/query
- name: "sales-data"
url: https://sales-api.internal/daily
analyzer:
source: agents/analyzer.py
# Event-driven - triggered by data-collector
config: {}
reporter:
source: agents/reporter.py
# Event-driven - triggered by analyzer
config: {}
alerter:
source: agents/alerter.py
config:
slack_webhook: ${SLACK_WEBHOOK_URL}
pagerduty_key: ${PAGERDUTY_API_KEY}
8. State Management & Persistence
Stateless agents are simple but limited. Real-world 24/7 agents need to remember what happened across restarts, track trends over time, and share context with other agents. OpenClaw's state management system provides all of this.
How State Works
- Automatic persistence: Agent state is stored in PostgreSQL and cached in Redis. Writes go to both; reads hit Redis first.
- Checkpointing: State is automatically checkpointed after every
run()cycle and at configurable intervals. - Versioning: Every state change creates a new version. You can roll back to any previous checkpoint.
- Namespacing: Each agent has its own state namespace. Agents can read (but not write) other agents' state.
State API
class TrendAnalyzer(Agent):
"""Stateful agent that tracks price trends across restarts."""
name = "trend-analyzer"
schedule = Schedule.every(minutes=15)
async def run(self):
# State persists across restarts - no data loss
history = await self.state.get("price_history", [])
current = await self.fetch_prices()
history.append({
"timestamp": self.now_iso(),
"prices": current,
})
# Keep last 1000 entries (automatic size management)
await self.state.set("price_history", history[-1000:])
# Detect anomalies using historical data
if self.detect_anomaly(history):
await self.emit("anomaly_detected", {
"type": "price_spike",
"current": current,
"baseline": self.calculate_baseline(history),
"deviation_pct": self.calculate_deviation(history, current),
})
# Update summary metrics
self.metrics.gauge("history_size", len(history))
self.metrics.gauge("latest_price", current.get("avg", 0))
async def fetch_prices(self) -> dict:
resp = await self.http.get("https://api.exchange.com/v1/prices")
return resp.json()
def detect_anomaly(self, history: list) -> bool:
if len(history) < 10:
return False
recent = [h["prices"].get("avg", 0) for h in history[-10:]]
avg = sum(recent) / len(recent)
stddev = (sum((x - avg) ** 2 for x in recent) / len(recent)) ** 0.5
latest = recent[-1]
return stddev > 0 and abs(latest - avg) > 2.5 * stddev
def calculate_baseline(self, history: list) -> float:
prices = [h["prices"].get("avg", 0) for h in history[-100:]]
return sum(prices) / len(prices) if prices else 0.0
def calculate_deviation(self, history: list, current: dict) -> float:
baseline = self.calculate_baseline(history)
if baseline == 0:
return 0.0
return round((current.get("avg", 0) - baseline) / baseline * 100, 2)
Shared State Between Agents
# Agent A writes to its own state
await self.state.set("latest_analysis", analysis_result)
# Agent B reads Agent A's state (read-only cross-agent access)
analyzer_state = await self.state.read_from("analyzer")
latest = analyzer_state.get("latest_analysis", {})
# Shared state namespace (writable by any agent with access)
await self.shared_state.set("global_config", {"threshold": 0.95})
config = await self.shared_state.get("global_config")
State Snapshots & Rollback
# CLI commands for state management
openclaw state list uptime-monitor
# ββββββββββ¬βββββββββββββββββββββββ¬βββββββββββ¬βββββββββββ
# β Versionβ Timestamp β Size β Keys β
# ββββββββββΌβββββββββββββββββββββββΌβββββββββββΌβββββββββββ€
# β v142 β 2026-04-18T14:30:00Z β 2.3 MB β 4 β
# β v141 β 2026-04-18T14:25:00Z β 2.3 MB β 4 β
# β v140 β 2026-04-18T14:20:00Z β 2.2 MB β 4 β
# ββββββββββ΄βββββββββββββββββββββββ΄βββββββββββ΄βββββββββββ
# Rollback to a previous version
openclaw state rollback uptime-monitor --version v140
# Export state for debugging
openclaw state export uptime-monitor --output state.json
max_state_size config option and implement pruning logic in your agent (e.g., keep only the last N entries). For truly large datasets, use an external database and store only references in agent state.
9. Observability & Monitoring
You can't run agents 24/7 if you can't see what they're doing. OpenClaw ships with comprehensive observability: Prometheus metrics, Grafana dashboards, structured JSON logging, and distributed tracing via OpenTelemetry.
Built-in Prometheus Metrics
Every agent automatically exposes these metrics at /api/metrics:
| Metric | Type | Description |
|---|---|---|
openclaw_agent_uptime_seconds |
Gauge | Time since agent last started |
openclaw_agent_runs_total |
Counter | Total number of run() invocations |
openclaw_agent_runs_failed_total |
Counter | Number of failed run() invocations |
openclaw_agent_run_duration_seconds |
Histogram | Duration of each run() cycle (p50, p95, p99) |
openclaw_agent_state_size_bytes |
Gauge | Current size of agent persistent state |
openclaw_llm_tokens_total |
Counter | Total LLM tokens consumed (by model) |
openclaw_llm_cost_usd_total |
Counter | Cumulative LLM API cost in USD |
openclaw_llm_latency_seconds |
Histogram | LLM API call latency |
openclaw_events_emitted_total |
Counter | Events published to message bus |
openclaw_events_consumed_total |
Counter | Events consumed from message bus |
openclaw_restarts_total |
Counter | Number of agent restarts |
Grafana Dashboard Configuration
{
"dashboard": {
"title": "OpenClaw Agent Fleet",
"panels": [
{
"title": "Agent Status Overview",
"type": "stat",
"targets": [{
"expr": "count by (status) (openclaw_agent_uptime_seconds > 0)",
"legendFormat": "{{status}}"
}]
},
{
"title": "Run Success Rate (24h)",
"type": "gauge",
"targets": [{
"expr": "1 - (rate(openclaw_agent_runs_failed_total[24h]) / rate(openclaw_agent_runs_total[24h]))",
"legendFormat": "{{agent}}"
}],
"thresholds": [
{"value": 0.95, "color": "red"},
{"value": 0.99, "color": "yellow"},
{"value": 1.0, "color": "green"}
]
},
{
"title": "LLM Cost per Agent (Daily)",
"type": "timeseries",
"targets": [{
"expr": "increase(openclaw_llm_cost_usd_total[24h])",
"legendFormat": "{{agent}} ({{model}})"
}]
},
{
"title": "Run Duration (p95)",
"type": "timeseries",
"targets": [{
"expr": "histogram_quantile(0.95, rate(openclaw_agent_run_duration_seconds_bucket[5m]))",
"legendFormat": "{{agent}}"
}]
}
]
}
}
Alerting Rules
# prometheus-alerts.yml
groups:
- name: openclaw-agents
rules:
- alert: AgentDown
expr: openclaw_agent_uptime_seconds == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Agent {{ $labels.agent }} is down"
description: "Agent has been down for more than 2 minutes."
- alert: HighFailureRate
expr: |
rate(openclaw_agent_runs_failed_total[15m])
/ rate(openclaw_agent_runs_total[15m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.agent }} failure rate > 10%"
- alert: LLMBudgetExceeded
expr: increase(openclaw_llm_cost_usd_total[24h]) > 50
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.agent }} LLM spend > $50/day"
- alert: AgentRestartLoop
expr: increase(openclaw_restarts_total[1h]) > 5
labels:
severity: critical
annotations:
summary: "Agent {{ $labels.agent }} restarted 5+ times in 1 hour"
OPENCLAW_OTEL_ENDPOINT in your environment. Every run() cycle, LLM call, event emission, and state operation creates a span. Traces flow through Jaeger, Zipkin, or any OTLP-compatible backend.
10. LLM Integration
OpenClaw is LLM-native - it treats language models as first-class infrastructure, not an afterthought. Every agent has access to an LLM router that handles provider selection, fallback chains, cost tracking, rate limiting, and token budgets.
LLM Router Configuration
from openclaw.llm import LLMRouter, Budget
# Configure the LLM router with fallback chain
router = LLMRouter(
primary="gpt-4o",
fallback=["claude-3.5-sonnet", "llama-3.1-70b"],
budget=Budget(
daily_limit_usd=50.0,
monthly_limit_usd=1200.0,
alert_at_pct=80, # Alert when 80% of budget consumed
),
retry_on=[RateLimitError, TimeoutError],
max_retries=3,
timeout=30,
)
# Use in an agent
class SmartAnalyzer(Agent):
async def setup(self):
self.llm = LLMRouter(
primary="gpt-4o",
fallback=["claude-3.5-sonnet", "llama-3.1-70b"],
budget=Budget(daily_limit_usd=50.0),
retry_on=[RateLimitError, TimeoutError],
)
async def run(self):
data = await self.fetch_data()
# The router automatically handles:
# 1. Try gpt-4o first
# 2. On rate limit or timeout, fall back to claude-3.5-sonnet
# 3. If that fails too, fall back to llama-3.1-70b (local)
# 4. Track tokens and cost for each call
# 5. Reject calls if daily budget is exceeded
analysis = await self.llm.complete(
messages=[
{"role": "system", "content": "You are a data analyst."},
{"role": "user", "content": f"Analyze this data: {data}"},
],
temperature=0.3,
max_tokens=2000,
)
self.logger.info(
f"LLM call: model={analysis.model}, "
f"tokens={analysis.usage.total_tokens}, "
f"cost=${analysis.usage.cost_usd:.4f}, "
f"latency={analysis.latency_ms}ms"
)
YAML-Based LLM Configuration
# openclaw.yaml - LLM configuration
llm:
providers:
openai:
api_key: ${OPENAI_API_KEY}
models: ["gpt-4o", "gpt-4o-mini"]
rate_limit: 500/min
anthropic:
api_key: ${ANTHROPIC_API_KEY}
models: ["claude-3.5-sonnet", "claude-3-haiku"]
rate_limit: 300/min
ollama:
base_url: http://ollama:11434
models: ["llama-3.1-70b", "mistral-7b"]
# No rate limit for local models
defaults:
primary: gpt-4o
fallback: [claude-3.5-sonnet, llama-3.1-70b]
timeout: 30s
max_retries: 3
budgets:
global:
daily_limit_usd: 200.0
monthly_limit_usd: 5000.0
per_agent:
smart-analyzer:
daily_limit_usd: 50.0
content-generator:
daily_limit_usd: 30.0
gpt-4o-mini or local models for high-frequency, low-complexity tasks. Reserve GPT-4o/Claude for tasks that genuinely need advanced reasoning.
11. Security & Guardrails
Autonomous agents running 24/7 without human oversight need strong guardrails. A misconfigured agent can leak data, burn through API budgets, or take destructive actions. OpenClaw provides multiple layers of security.
Security Architecture
- Agent sandboxing: Each agent runs in an isolated container with a read-only filesystem (except for designated temp directories). Network access is restricted to an allowlist.
- Secret management: Secrets are injected via environment variables from a secrets backend (Vault, AWS Secrets Manager, or encrypted
.envfiles). Secrets are never stored in agent state or logs. - Output validation: Agent outputs (especially LLM-generated content) pass through configurable validators before being emitted or sent externally.
- Rate limiting: Per-agent limits on API calls, event emissions, state writes, and LLM tokens.
- Audit logging: Every agent action (state changes, events emitted, LLM calls, external API calls) is logged to an immutable audit trail.
- RBAC: Role-based access control for multi-tenant deployments - teams can only manage their own agents.
Guardrail Configuration
# openclaw.yaml - Security & guardrails
security:
# Agent sandboxing
sandbox:
read_only_filesystem: true
allowed_network:
- "*.internal"
- "api.openai.com"
- "api.anthropic.com"
- "hooks.slack.com"
blocked_network:
- "*.torproject.org"
- "metadata.google.internal" # Block cloud metadata
- "169.254.169.254" # Block AWS IMDS
# Secret management
secrets:
backend: vault # vault | aws-sm | env
vault_addr: https://vault.internal:8200
vault_path: secret/openclaw
# Output validation
validators:
- type: pii_filter
action: redact # redact | block | warn
fields: ["email", "phone", "ssn", "credit_card"]
- type: content_safety
action: block
categories: ["hate_speech", "violence", "self_harm"]
- type: json_schema
action: block
schema_path: schemas/output.json
# Rate limiting
rate_limits:
default:
api_calls_per_minute: 60
events_per_minute: 100
state_writes_per_minute: 30
llm_tokens_per_hour: 100000
# Audit logging
audit:
enabled: true
backend: postgresql # postgresql | elasticsearch | s3
retention: 90d
log_events:
- state_change
- event_emitted
- llm_call
- external_api_call
- agent_lifecycle
# RBAC
rbac:
enabled: true
roles:
admin:
permissions: ["*"]
developer:
permissions: ["agents:read", "agents:create", "agents:update", "logs:read", "state:read"]
viewer:
permissions: ["agents:read", "logs:read"]
12. Production Deployment
Running agents in development is easy. Running them in production - with high availability, rolling updates, autoscaling, and disaster recovery - requires careful planning. Here's the production playbook.
Kubernetes Deployment with Helm
# values.yaml - Production Helm configuration
replicaCount: 2
supervisor:
replicas: 2 # HA supervisor
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
agentRuntime:
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: "1"
memory: 1Gi
postgres:
enabled: true
persistence:
size: 50Gi
storageClass: gp3
replication:
enabled: true
readReplicas: 2
backup:
enabled: true
schedule: "0 */6 * * *" # Every 6 hours
retention: 30d
destination: s3://backups/openclaw
redis:
enabled: true
architecture: replication
replica:
replicaCount: 2
persistence:
size: 10Gi
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: openclaw.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: openclaw-tls
hosts:
- openclaw.example.com
monitoring:
prometheus:
enabled: true
serviceMonitor: true
grafana:
enabled: true
dashboards: true
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
Deployment Commands
# Deploy to production
helm upgrade --install openclaw openclaw/openclaw \
--namespace openclaw \
--create-namespace \
-f values.yaml \
--set image.tag=v1.4.0 \
--wait --timeout 5m
# Verify deployment
kubectl get pods -n openclaw
kubectl get svc -n openclaw
# Check agent status via the API
kubectl port-forward svc/openclaw-api 8400:8400 -n openclaw
curl http://localhost:8400/api/agents
# Rolling update (zero-downtime)
helm upgrade openclaw openclaw/openclaw \
--namespace openclaw \
-f values.yaml \
--set image.tag=v1.5.0 \
--wait
# Rollback if something goes wrong
helm rollback openclaw 1 --namespace openclaw
Horizontal Pod Autoscaling
# hpa.yaml - Custom HPA based on agent queue depth
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: openclaw-agents
namespace: openclaw
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: openclaw-agent-runtime
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: openclaw_events_pending
target:
type: AverageValue
averageValue: "100"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Disaster Recovery
- Database backups: Automated PostgreSQL backups every 6 hours to S3 with 30-day retention. Point-in-time recovery via WAL archiving.
- State snapshots: Agent state is snapshotted before every deployment and stored in S3.
- Multi-region: For critical workloads, deploy OpenClaw in two regions with PostgreSQL streaming replication and Redis Sentinel for automatic failover.
- Recovery time: Cold start from backup: ~5 minutes. Failover to standby region: ~30 seconds.
13. Real-World 24/7 Agent Examples
Theory is useful; working examples are better. Here are five production-grade agent architectures with key code snippets.
Example 1: DevOps Agent
Purpose: Monitors infrastructure, auto-scales services, responds to incidents, and posts status updates.
# agents/devops_agent.py
from openclaw import Agent, Schedule, on
from openclaw.tools import HttpClient, SlackNotifier
class DevOpsAgent(Agent):
name = "devops-agent"
schedule = Schedule.every(minutes=2)
async def setup(self):
self.http = HttpClient()
self.slack = SlackNotifier(webhook_url=self.config.slack_webhook)
self.k8s_api = self.config.kubernetes_api_url
async def run(self):
# Check cluster health
nodes = await self.http.get(f"{self.k8s_api}/api/v1/nodes")
pods = await self.http.get(f"{self.k8s_api}/api/v1/pods?fieldSelector=status.phase!=Running")
unhealthy_pods = [
p for p in pods.json()["items"]
if p["status"]["phase"] in ("Failed", "CrashLoopBackOff")
]
if unhealthy_pods:
# Use LLM to diagnose the issue
diagnosis = await self.llm.complete(
f"Diagnose these Kubernetes pod failures and suggest fixes: "
f"{[p['metadata']['name'] + ': ' + p['status'].get('reason', 'unknown') for p in unhealthy_pods]}"
)
await self.slack.send(
f"π§ *{len(unhealthy_pods)} unhealthy pods detected*\n"
f"Diagnosis: {diagnosis.text}"
)
# Auto-scale based on CPU metrics
metrics = await self.http.get(f"{self.k8s_api}/apis/metrics.k8s.io/v1beta1/pods")
for pod_metric in metrics.json().get("items", []):
cpu_usage = self.parse_cpu(pod_metric)
if cpu_usage > 80:
deployment = pod_metric["metadata"]["labels"].get("app")
await self.scale_up(deployment)
async def scale_up(self, deployment: str):
current = await self.state.get(f"replicas_{deployment}", 2)
new_count = min(current + 1, 10)
await self.http.patch(
f"{self.k8s_api}/apis/apps/v1/namespaces/default/deployments/{deployment}/scale",
json={"spec": {"replicas": new_count}},
)
await self.state.set(f"replicas_{deployment}", new_count)
await self.slack.send(f"π Scaled *{deployment}* to {new_count} replicas")
Example 2: Security Agent
Purpose: Scans logs for threats, blocks malicious IPs, and generates daily security reports.
# agents/security_agent.py
from openclaw import Agent, Schedule
class SecurityAgent(Agent):
name = "security-agent"
schedule = Schedule.every(minutes=1)
async def run(self):
# Pull latest logs from Elasticsearch
logs = await self.http.post(
f"{self.config.elasticsearch_url}/_search",
json={
"query": {"range": {"@timestamp": {"gte": "now-1m"}}},
"size": 1000,
},
)
suspicious = []
for hit in logs.json()["hits"]["hits"]:
log = hit["_source"]
# Pattern matching for common attack signatures
if self.is_suspicious(log):
suspicious.append(log)
if suspicious:
# Use LLM to classify threat severity
classification = await self.llm.complete(
f"Classify these {len(suspicious)} suspicious log entries by threat level "
f"(critical/high/medium/low). Return JSON array: {suspicious[:20]}"
)
for threat in classification.parsed:
if threat["level"] in ("critical", "high"):
# Auto-block the IP
await self.http.post(
f"{self.config.firewall_api}/block",
json={"ip": threat["source_ip"], "duration": "24h"},
)
await self.emit("threats_detected", {
"count": len(suspicious),
"blocked_ips": [t["source_ip"] for t in classification.parsed if t["level"] in ("critical", "high")],
})
def is_suspicious(self, log: dict) -> bool:
patterns = ["SQL injection", "XSS", "path traversal", "brute force", "401", "403"]
message = log.get("message", "").lower()
return any(p.lower() in message for p in patterns)
Example 3: Market Intelligence Agent
Purpose: Tracks competitor pricing, analyzes market trends, and generates weekly intelligence reports.
# agents/market_intel.py
from openclaw import Agent, Schedule
class MarketIntelAgent(Agent):
name = "market-intel"
schedule = Schedule.every(hours=1)
async def run(self):
competitors = self.config.get("competitors", [])
price_data = []
for competitor in competitors:
data = await self.http.get(competitor["pricing_url"])
price_data.append({
"competitor": competitor["name"],
"prices": data.json(),
"timestamp": self.now_iso(),
})
# Store in persistent state for trend analysis
history = await self.state.get("price_history", [])
history.extend(price_data)
await self.state.set("price_history", history[-5000:])
# Detect significant price changes
previous = await self.state.get("last_prices", {})
alerts = []
for entry in price_data:
name = entry["competitor"]
for product, price in entry["prices"].items():
prev_price = previous.get(name, {}).get(product)
if prev_price and abs(price - prev_price) / prev_price > 0.05:
alerts.append({
"competitor": name,
"product": product,
"old_price": prev_price,
"new_price": price,
"change_pct": round((price - prev_price) / prev_price * 100, 1),
})
if alerts:
summary = await self.llm.complete(
f"Summarize these competitor price changes for a business audience: {alerts}"
)
await self.emit("price_changes_detected", {
"alerts": alerts,
"summary": summary.text,
})
await self.state.set("last_prices", {
e["competitor"]: e["prices"] for e in price_data
})
Example 4: Content Pipeline Agent
Purpose: Monitors RSS feeds, summarizes articles with LLM, and publishes to a CMS.
# agents/content_pipeline.py
from openclaw import Agent, Schedule
class ContentPipeline(Agent):
name = "content-pipeline"
schedule = Schedule.every(minutes=30)
async def run(self):
feeds = self.config.get("rss_feeds", [])
processed_ids = await self.state.get("processed_ids", set())
for feed_url in feeds:
entries = await self.http.get(feed_url, headers={"Accept": "application/rss+xml"})
articles = self.parse_rss(entries.text)
for article in articles:
if article["id"] in processed_ids:
continue
# Summarize with LLM
summary = await self.llm.complete(
f"Write a 3-paragraph summary of this article for a tech audience. "
f"Title: {article['title']}\nContent: {article['content'][:3000]}"
)
# Publish to CMS
await self.http.post(
f"{self.config.cms_api}/articles",
json={
"title": article["title"],
"summary": summary.text,
"source_url": article["link"],
"category": "auto-curated",
"status": "draft",
},
headers={"Authorization": f"Bearer {self.config.cms_token}"},
)
processed_ids.add(article["id"])
self.logger.info(f"Published: {article['title']}")
await self.state.set("processed_ids", processed_ids)
Example 5: Compliance Agent
Purpose: Monitors regulatory changes, checks internal policy compliance, and alerts the legal team.
# agents/compliance_agent.py
from openclaw import Agent, Schedule
class ComplianceAgent(Agent):
name = "compliance-agent"
schedule = Schedule.every(hours=6)
async def run(self):
# Check regulatory feeds
reg_sources = self.config.get("regulatory_sources", [])
known_regs = await self.state.get("known_regulations", {})
new_changes = []
for source in reg_sources:
data = await self.http.get(source["url"])
regulations = data.json().get("regulations", [])
for reg in regulations:
reg_id = reg["id"]
if reg_id not in known_regs or reg["updated_at"] != known_regs[reg_id]:
new_changes.append(reg)
known_regs[reg_id] = reg["updated_at"]
await self.state.set("known_regulations", known_regs)
if new_changes:
# Use LLM to assess impact on our policies
impact = await self.llm.complete(
f"Analyze these regulatory changes and assess their impact on a "
f"SaaS company's data handling and privacy policies. "
f"For each change, rate impact as high/medium/low and explain why.\n"
f"Changes: {new_changes}"
)
await self.emit("regulatory_changes", {
"changes": new_changes,
"impact_analysis": impact.text,
"high_impact_count": impact.text.lower().count("high"),
})
# Alert legal team for high-impact changes
if "high" in impact.text.lower():
await self.slack.send(
f"βοΈ *{len(new_changes)} regulatory changes detected*\n"
f"Impact analysis:\n{impact.text[:500]}"
)
14. OpenClaw vs. Alternatives
OpenClaw isn't the only option for running automated workflows. But it's the only one purpose-built for 24/7 LLM-native autonomous agents. Here's how it compares:
| Feature | OpenClaw | Temporal | Prefect | Airflow | Cron Jobs | LangGraph Cloud |
|---|---|---|---|---|---|---|
| 24/7 daemon agents | β Native | β οΈ Possible | β Task-based | β DAG-based | β One-shot | β οΈ Limited |
| LLM-native | β Built-in router, budgets, fallback | β DIY | β DIY | β DIY | β DIY | β LangChain ecosystem |
| State management | β Built-in, versioned | β Excellent | β οΈ Basic | β οΈ XCom (limited) | β None | β Checkpointing |
| Multi-agent orchestration | β Pipeline, fan-out, event-driven | β οΈ Workflow-based | β οΈ Flow-based | β οΈ DAG-based | β None | β Graph-based |
| Observability | β Prometheus, Grafana, OTel | β Built-in UI | β Built-in UI | β Built-in UI | β Manual | β οΈ LangSmith |
| Auto-restart & health checks | β Native | β Native | β οΈ Limited | β οΈ Limited | β None | β οΈ Platform-managed |
| Cost tracking (LLM) | β Per-agent budgets | β None | β None | β None | β None | β οΈ Via LangSmith |
| Learning curve | π’ Low (Python classes) | π‘ Medium (workflows) | π’ Low (decorators) | π΄ High (DAGs, operators) | π’ Low (shell scripts) | π‘ Medium (graphs) |
| Self-hosted | β Open source | β Open source | β Open source | β Open source | β Built-in | β Cloud only |
15. Scaling & Performance
OpenClaw is designed to scale from a single agent on a laptop to hundreds of agents across a Kubernetes cluster. Here are the benchmarks and scaling strategies.
Performance Benchmarks
| Metric | Single Node (4 CPU, 8 GB) | 3-Node Cluster | 10-Node Cluster |
|---|---|---|---|
| Max concurrent agents | 25 | 80 | 300+ |
| Message bus throughput | 5,000 events/sec | 15,000 events/sec | 50,000+ events/sec |
| State read latency (p50) | 1.2 ms | 1.5 ms | 2.1 ms |
| State read latency (p99) | 8 ms | 12 ms | 18 ms |
| State write latency (p50) | 3.5 ms | 4.2 ms | 5.8 ms |
| Supervisor overhead | ~50 MB RAM | ~120 MB RAM | ~300 MB RAM |
| Cold start (agent) | 1.2 sec | 1.5 sec | 2.0 sec |
Benchmarks measured with Redis Streams message bus, PostgreSQL 16 state store, Python 3.12 runtime. NATS JetStream provides ~2x higher message throughput at the cost of additional operational complexity.
Scaling Strategies
Vertical Scaling
The simplest approach: give your nodes more CPU and memory. Effective up to ~25 agents per node. Beyond that, you hit Python GIL limitations and should switch to horizontal scaling.
Horizontal Scaling
Add more nodes to your cluster. OpenClaw's Supervisor automatically distributes agents across available nodes using a consistent hashing algorithm. New agents are placed on the least-loaded node.
# Scale the agent runtime deployment
kubectl scale deployment openclaw-agent-runtime \
--replicas=5 -n openclaw
# Or use HPA for automatic scaling
kubectl autoscale deployment openclaw-agent-runtime \
--min=2 --max=20 --cpu-percent=70 -n openclaw
Sharding
For agents that process large datasets, partition the work across multiple agent instances:
# openclaw.yaml - Sharded agent configuration
agents:
log-processor:
source: agents/log_processor.py
replicas: 4
sharding:
strategy: hash # hash | range | round-robin
key: "source_id" # shard by data source
# Each replica processes 1/4 of the sources
Resource Planning Guide
- Small (1-10 agents): Single node, 4 CPU, 8 GB RAM. Docker Compose deployment. ~$50/month on cloud.
- Medium (10-50 agents): 3-node K8s cluster, 4 CPU / 16 GB each. Helm deployment with HA supervisor. ~$300/month.
- Large (50-200 agents): 5-10 node K8s cluster with HPA. Dedicated PostgreSQL with read replicas. NATS JetStream for message bus. ~$1,000-2,500/month.
- Enterprise (200+ agents): Multi-cluster deployment with federation. Sharded state store. Dedicated monitoring infrastructure. Custom capacity planning required.
Start Building 24/7 Agents Today
OpenClaw is open source and free to use. Get started in under 5 minutes with Docker Compose, or deploy to Kubernetes for production workloads. The future of AI isn't chatbots - it's autonomous agents that work around the clock.