GitHub - cprobe/catpaw: A lightweight monitoring agent that detects anomalies, auto-diagnoses root causes with AI, and lets you troubleshoot interactively via chat

Resumen Principal

catpaw es un agente de monitoreo ligero y altamente eficiente, diseñado para la detección de anomalías y el análisis de causa raíz asistido por inteligencia artificial. Utiliza un sistema de plugins modular, ofreciendo más de 25 checks predefinidos para monitorear componentes críticos como CPU, disco, red, Docker, logs y estado de servicios. Cuando una alerta se activa, catpaw puede desencadenar automáticamente un análisis de causa raíz empleando un robusto conjunto de más de 70 herramientas de diagnóstico integradas que cubren aspectos de sistema, red, almacenamiento, seguridad y kernel. Además, facilita la resolución interactiva de problemas a través de un chat con AI y permite inspecciones de salud proactivas impulsadas por IA. Su capacidad para integrarse con fuentes de datos externas mediante MCP (Model Context Protocol), como Prometheus o Jaeger, amplifica la profundidad de sus diagnósticos, posicionándolo como una solución integral para mantener la operatividad y reducir el tiempo medio de resolución.

Elementos Clave

Diagnósticos basados en AI: catpaw sobresale por su capacidad de realizar análisis de causa raíz automáticos cuando se dispara una alerta. Utiliza más de 70 herramientas de diagnóstico incorporadas para investigar profundamente el sistema. Además, ofrece un chat AI interactivo para la resolución de problemas conversacional y permite inspecciones de salud proactivas bajo demanda, todas impulsadas por la misma inteligencia artificial, para una identificación rápida y precisa de los problemas.
Monitoreo modular basado en Plugins: La flexibilidad es un pilar fundamental de catpaw gracias a su arquitectura de plugins. Ofrece más de 25 plugins de verificación que abarcan una amplia gama de funcionalidades, desde la utilización de CPU y disco, monitoreo de contenedores Docker y servicios systemd, hasta cheques de conectividad de red y análisis de logs. Esta modularidad permite a los usuarios habilitar únicamente las funciones que necesitan, manteniendo al agente ligero y con pocos requisitos de dependencias.
Notificación Flexible y Despliegue Sencillo: El agente está diseñado para ser ligero y fácil de desplegar, distribuyéndose como un solo binario sin dependencias pesadas. Soporta múltiples canales de notificación, permitiendo la configuración simultánea de salidas a consola (para validación rápida), API web genéricas (para cualquier endpoint HTTP), Flashduty y PagerDuty. Esta versatilidad asegura que los eventos y alertas puedan ser dirigidos a las plataformas y equipos adecuados de manera eficiente.
Integración con MCP (Model Context Protocol): catpaw puede extender sus capacidades de diagnóstico integrándose con fuentes de datos externas a través del Model Context Protocol (MCP). Esto permite al AI consultar datos históricos de plataformas como Prometheus, Jaeger o CMDB, enriqueciendo el contexto para un análisis de causa raíz más completo y profundo. La AI descubre y utiliza automáticamente las herramientas proporcionadas por estas fuentes de datos externas, consolidando la información de monitoreo y diagnóstico.

Análisis e Implicaciones

La combinación de monitoreo proactivo y diagnósticos AI automatizados en catpaw tiene el potencial de transformar la gestión de incidencias, reduciendo significativamente el tiempo de resolución (MTTR). Su arquitectura modular y ligera lo hace ideal para entornos diversos, desde infraestructuras simples hasta sistemas distribuidos complejos, al poder auto-monitorear incluso otros sistemas de monitoreo.

Contexto Adicional

El agente catpaw se puede configurar fácilmente a través de archivos TOML, permitiendo recargas en caliente de las configuraciones, y cuenta con una documentación exhaustiva que cubre desde la arquitectura hasta guías de desarrollo de plugins, así como una comunidad activa en WeChat.

English | 中文

catpaw is a lightweight monitoring agent with AI-powered diagnostics. It detects anomalies through plugin-based checks, produces standardized events, and — when an alert fires — can automatically trigger AI root-cause analysis using 70+ built-in diagnostic tools.

Events can be forwarded to any alert platform (Flashduty, PagerDuty, or any HTTP endpoint), or simply printed to the console for quick validation.

✨ Key Features

🪶 Lightweight, zero heavy dependencies — single binary, easy to deploy
🔌 Plugin-based monitoring — 25+ check plugins, enable only what you need
🤖 AI-powered diagnosis — automatic root-cause analysis triggered by alerts
💬 Interactive AI chat — troubleshoot issues conversationally with AI + tools
🩺 Proactive health inspection — on-demand AI-driven health checks
🛠️ 70+ diagnostic tools — system, network, storage, security, process, kernel
🔗 MCP integration — connect external data sources (Prometheus, Jaeger, CMDB, etc.) via Model Context Protocol
📡 Flexible notification — console, generic WebAPI, Flashduty, PagerDuty, or any combination
🔄 Self-monitoring friendly — ideal for monitoring your monitoring systems

🏗️ Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        catpaw agent                             │
│                                                                 │
│  ┌─────────────┐   alert    ┌──────────────┐    AI + Tools     │
│  │  25+ Check  │ ────────── │  AI Diagnose │ ──────────────┐   │
│  │   Plugins   │  trigger   │    Engine    │               │   │
│  └──────┬──────┘            └──────────────┘               │   │
│         │                                                  ▼   │
│         │ events    ┌──────────────┐         ┌───────────────┐ │
│         └────────── │   Notifiers  │         │  70+ Diagnose │ │
│                     │  (multiple)  │         │     Tools     │ │
│                     └──────────────┘         └───────┬───────┘ │
│                                                      │         │
│  ┌─────────────┐                            ┌────────┴───────┐ │
│  │  AI Chat    │ ───── interactive ──────── │  MCP External  │ │
│  │  (CLI)      │       troubleshoot         │  Data Sources  │ │
│  └─────────────┘                            └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

🔍 Check Plugins

Plugin	Description
`cert`	TLS certificate expiry check (remote TLS + local files; STARTTLS, SNI, glob)
`conntrack`	Linux conntrack table usage — prevent silent packet drops
`cpu`	CPU utilization and per-core normalized load average
`disk`	Disk space, inode, and writability check
`dns`	DNS resolution check
`docker`	Docker container monitoring (state, restart, health, CPU/mem)
`exec`	Run scripts/commands to produce events (JSON and Nagios modes)
`filecheck`	File existence, mtime, and checksum check
`filefd`	System-level file descriptor usage (Linux)
`http`	HTTP availability, status code, response body, cert expiry
`journaltail`	Incremental journalctl log reading with keyword matching (Linux)
`logfile`	Log file monitoring (offset tracking, rotation, glob, multi-encoding)
`mem`	Memory and swap usage check
`mount`	Mount point baseline (fs type, options compliance; Linux)
`neigh`	ARP/neighbor table usage — prevent new-IP failures (K8s)
`net`	TCP/UDP connectivity and response time
`netif`	Network interface health (link state, error/drop delta; Linux)
`ntp`	NTP sync, clock offset, stratum (Linux)
`ping`	ICMP reachability, packet loss, latency
`procfd`	Per-process fd usage — prevent nofile exhaustion
`procnum`	Process count check (multiple lookup methods)
`scriptfilter`	Script output filter-rule matching
`secmod`	SELinux/AppArmor baseline (Linux)
`sockstat`	TCP listen queue overflow detection (Linux)
`sysctl`	Kernel parameter baseline — detect silent resets (Linux)
`systemd`	systemd service status (Linux)
`tcpstate`	TCP state monitoring (CLOSE_WAIT/TIME_WAIT; Netlink; Linux)
`uptime`	Unexpected reboot detection
`zombie`	Zombie process detection

🧠 AI Diagnostic Tools (70+)

When AI diagnosis is triggered (by alert, inspection, or chat), the AI agent has access to a rich toolkit:

⚙️ System & Process: CPU top, memory breakdown, OOM history, cgroup limits, process threads (with wchan), open files, environment variables, PSI pressure

🌐 Network: ping, traceroute, DNS resolve, ARP neighbors, TCP connection states, socket details (RTT/cwnd), retransmission rate, connection latency summary, listen queue overflow, TCP tuning check, softnet stats, route table, IP addresses, interface stats, firewall rules

💾 Storage: disk I/O latency, block device topology, LVM status, mount info

🔐 Kernel & Security: dmesg, interrupts distribution, conntrack stats, NUMA stats, thermal zones, sysctl snapshot, SELinux/AppArmor status, coredump list

📜 Logs: log tail, log grep (with pattern matching), journald query

🐳 Services: systemd service status, failed services list, timer list, Docker ps/inspect

🔌 Remote plugins (Redis, etc.) contribute their own specialized diagnostic tools for deep introspection.

🔗 MCP external tools: Connect Prometheus, Jaeger, CMDB, or any MCP-compatible data source — the AI automatically discovers and uses their tools.

🖥️ CLI Commands

catpaw run [flags]                      # Start the monitoring agent
catpaw chat [-v]                        # Interactive AI chat for troubleshooting
catpaw inspect <plugin> [target]        # Proactive AI health inspection
catpaw diagnose list|show <id>          # View past diagnosis records
catpaw selftest [filter] [-q]           # Smoke-test all diagnostic tools
catpaw mcptest                          # Test MCP server connections

🚀 Quick Start

📦 Installation

Download the binary from GitHub Releases.

Basic Monitoring

Enable plugin configs under conf.d/p.<plugin>/
Start:

The default config enables [notify.console], so events are printed to the terminal with colored output — no external service needed for a quick test.

📡 Event Notification

catpaw supports multiple notification channels. Configure one or more in conf.d/config.toml:

Channel	Config Section	Description
Console	`[notify.console]`	Print events to terminal (enabled by default)
WebAPI	`[notify.webapi]`	Push raw Event JSON to any HTTP endpoint
Flashduty	`[notify.flashduty]`	Forward to Flashduty alert platform
PagerDuty	`[notify.pagerduty]`	Forward to PagerDuty incident management

Multiple channels can be active simultaneously. For example, you can print to console for debugging while also forwarding to your alert platform.

Console (default — for quick validation):

[notify.console]
enabled = true

WebAPI (push raw Event JSON to any HTTP endpoint):

[notify.webapi]
url = "https://your-service.example.com/api/v1/events"
# method = "POST"
# timeout = "10s"
[notify.webapi.headers]
Authorization = "Bearer ${WEBAPI_TOKEN}"

Flashduty:

[notify.flashduty]
integration_key = "your-integration-key"

PagerDuty:

[notify.pagerduty]
routing_key = "your-routing-key"

🤖 AI Diagnosis (optional)

Add to conf.d/config.toml:

[ai]
enabled = true
model_priority = ["default"]

[ai.models.default]
base_url = "https://api.openai.com/v1"
api_key = "${OPENAI_API_KEY}"
model = "gpt-4o"

Now when alerts fire, AI automatically analyzes root cause using built-in diagnostic tools.

💬 Interactive Chat

Ask questions like "Why is CPU high?" or "Check disk I/O latency" — the AI uses diagnostic tools and shell commands (with confirmation) to investigate.

🔗 MCP External Data Sources (optional)

Connect Prometheus, Jaeger, or other MCP servers for AI to query historical metrics, traces, etc.:

[ai.mcp]
enabled = true

[[ai.mcp.servers]]
name = "prometheus"
command = "/usr/local/bin/mcp-prometheus"
args = ["serve"]
identity = 'instance="${IP}:9100"'
[ai.mcp.servers.env]
PROMETHEUS_URL = "http://127.0.0.1:9090"

[[ai.mcp.servers]]
name = "nightingale"
command = "npx"
args = ["-y", "@n9e/n9e-mcp-server", "stdio"]
identity = 'ident="${HOSTNAME}"'
tools_allow = []
[ai.mcp.servers.env]
N9E_TOKEN = "480c04ed-ebe7-4266-xxxx-f8daf7819a6d"
N9E_BASE_URL = "http://127.0.0.1:17000"

Verify connectivity:

⚙️ Configuration

Global config: conf.d/config.toml
Plugin configs: conf.d/p.<plugin>/*.toml (multiple files merged on load)
Hot-reload plugin configs with SIGHUP:

kill -HUP $(pidof catpaw)

📚 Documentation

Document	Description
Developer Guide	Architecture overview and codebase walkthrough — read this first
CLI Reference	Complete command-line options
Deployment Guide	Binary, systemd, Docker deployment
Event Data Model	Event structure, labels, AlertKey rules
Plugin Development Guide	How to create a new catpaw plugin

💬 Community

WeChat: add picobyte and mention catpaw to join the group.

Absortio