OpsSquad
Core Concepts

Agents

Understanding autonomous agents in OpsSquad.ai

An Agent is an autonomous AI entity that works as part of a squad. Agents are the core of OpsSquad.ai's capabilities, enabling AI-powered diagnostics, monitoring, and incident response through their linked nodes.

What is an Agent?

An agent is:

  • Autonomous - Can operate independently through linked nodes
  • AI-Powered - Understands natural language requests
  • Verified - Executes commands with permission and SLM guardrails
  • Squad-Based - Works as part of a team organized by purpose
  • Node-Linked - Connects to physical/virtual infrastructure for execution

Agent Capabilities

System Diagnostics

Agents can investigate system state:

  • Process monitoring (CPU, memory usage)
  • Disk space and I/O analysis
  • Network connectivity and traffic
  • Service health and status

Log Analysis

Agents can analyze logs:

  • Search for patterns and errors
  • Correlate events across time
  • Identify anomalies
  • Summarize recent activity

Service Management

With approval, agents can:

  • Check service status
  • Restart services
  • View container states
  • Manage processes

Custom Commands

Agents can execute authorized commands:

  • Run diagnostics scripts
  • Execute health checks
  • Gather metrics
  • Perform routine maintenance

Agent Lifecycle

Agents go through distinct states:

StateDescriptionNode Requirement
CreatedAgent exists in squad, not linkedNone
LinkedConnected to a nodeHas node assignment
ActiveReceiving and processing requestsNode must be online
UnlinkedDisconnected from nodeNone
InactiveSuspended but linkedNode assignment preserved
DeletedRemoved from squadNone

State Transitions

FromToHow
CreatedLinkedLink to deployed node
LinkedActiveNode comes online
ActiveUnlinkedRemove node link
ActiveInactivePause agent
InactiveActiveResume agent
LinkedLinkedReassign to different node
AnyDeletedDelete agent

Agent Architecture

Components

Agents are cloud-based AI entities that coordinate with nodes:

Agent-Node Relationship

  1. Agent Creation - Created in squad through dashboard
  2. Node Linking - Linked to a deployed node
  3. Command Flow - Agent sends commands to linked node
  4. Execution - Node executes commands on server
  5. Response - Results returned to agent for processing

Node Execution Layer

The linked node provides:

  • MCP Shell Server for secure command execution
  • Security validation before execution
  • Resource management for processes
  • Output capture and streaming

Agent Types

Agents can be specialized for different roles within your squad:

TypeFocusExample Use Cases
SystemOS-level diagnosticsMonitor CPU, memory, disk usage
DatabaseDatabase operationsCheck DB health, query performance
ContainerContainer managementMonitor Docker/K8s deployments
SecuritySecurity monitoringAudit logs, check vulnerabilities
CustomSpecialized tasksYour specific workflows

Agent-Node Communication

Protocol

Nodes communicate with the platform using JSON over TCP:

{
  "type": "COMMAND",
  "node_id": "node_abc123",
  "agent_id": "agent_xyz789",
  "request_id": "req_123456",
  "timestamp": "2024-01-15T10:30:00Z",
  "payload": {
    "command": "ps",
    "args": ["aux", "--sort=-%cpu"]
  }
}

Message Types

TypeDirectionPurpose
REGISTERNode → PlatformInitial authentication
HEARTBEATBothKeep-alive signal
COMMANDPlatform → NodeCommand request from agent
RESPONSENode → PlatformCommand result to agent
ERRORNode → PlatformError notification

Security

All communication is:

  • Authenticated - Using node tokens
  • Validated - Request ID correlation
  • Timestamped - Replay attack prevention
  • Logged - Complete audit trail

Resource Usage

Agent resource consumption:

ComponentResource Usage
Agent (Cloud)Serverless, scales automatically
Node (Server)30-50 MB memory, <1% CPU idle
NetworkMinimal (heartbeats + commands)
DiskConfig + logs on server

When nodes execute commands, usage temporarily increases based on the command.

Best Practices

Naming

Use clear, descriptive names for agents:

  • Include role: Database Monitor, Web Server Guardian
  • Include environment: Production API Agent, Staging DB Agent
  • Be descriptive: Security Audit Agent, Log Analysis Agent

Example: Production Web Server Monitor

Squad Organization

Organize agents effectively:

  • Group by environment (Production Squad, Staging Squad)
  • Group by function (DevOps Squad, Security Squad)
  • Create specialized squads for different teams

Node Linking

Best practices for linking agents to nodes:

  • Link agents to nodes that match their purpose
  • One agent can be linked to one node at a time
  • Unlink and relink to reassign agents as needed
  • Monitor node status to ensure agents can execute

Security

Protect your agents and nodes:

  • Use unique tokens per node
  • Rotate node tokens periodically
  • Review audit logs for all commands
  • Use minimal permissions
  • Only approve necessary commands

Troubleshooting Agents

Agent Not Executing Commands

  1. Check if agent is linked to a node
  2. Verify linked node is online
  3. Check node connectivity to socket.opssquad.ai:9000
  4. Review node logs for errors

Agent Slow to Respond

  1. Check linked node's resource utilization
  2. Verify network latency between node and platform
  3. Review command complexity
  4. Check for competing processes on node

Need to Reassign Agent

  1. Go to agent details in dashboard
  2. Click "Unlink Node"
  3. Select a different online node
  4. Click "Link Node"

Next Steps

  • Squads - Learn how to organize agents into squads.
  • Security - Understand the security model protecting agents.
  • Managing Agents - Manage agents through the dashboard.

On this page