Agents
Understanding autonomous agents in OpsSquad.ai
An Agent is an autonomous AI entity that works as part of a squad. Agents are the core of OpsSquad.ai's capabilities, enabling AI-powered diagnostics, monitoring, and incident response through their linked nodes.
What is an Agent?
An agent is:
- Autonomous - Can operate independently through linked nodes
- AI-Powered - Understands natural language requests
- Verified - Executes commands with permission and SLM guardrails
- Squad-Based - Works as part of a team organized by purpose
- Node-Linked - Connects to physical/virtual infrastructure for execution
Agent Capabilities
System Diagnostics
Agents can investigate system state:
- Process monitoring (CPU, memory usage)
- Disk space and I/O analysis
- Network connectivity and traffic
- Service health and status
Log Analysis
Agents can analyze logs:
- Search for patterns and errors
- Correlate events across time
- Identify anomalies
- Summarize recent activity
Service Management
With approval, agents can:
- Check service status
- Restart services
- View container states
- Manage processes
Custom Commands
Agents can execute authorized commands:
- Run diagnostics scripts
- Execute health checks
- Gather metrics
- Perform routine maintenance
Agent Lifecycle
Agents go through distinct states:
| State | Description | Node Requirement |
|---|---|---|
| Created | Agent exists in squad, not linked | None |
| Linked | Connected to a node | Has node assignment |
| Active | Receiving and processing requests | Node must be online |
| Unlinked | Disconnected from node | None |
| Inactive | Suspended but linked | Node assignment preserved |
| Deleted | Removed from squad | None |
State Transitions
| From | To | How |
|---|---|---|
| Created | Linked | Link to deployed node |
| Linked | Active | Node comes online |
| Active | Unlinked | Remove node link |
| Active | Inactive | Pause agent |
| Inactive | Active | Resume agent |
| Linked | Linked | Reassign to different node |
| Any | Deleted | Delete agent |
Agent Architecture
Components
Agents are cloud-based AI entities that coordinate with nodes:
Agent-Node Relationship
- Agent Creation - Created in squad through dashboard
- Node Linking - Linked to a deployed node
- Command Flow - Agent sends commands to linked node
- Execution - Node executes commands on server
- Response - Results returned to agent for processing
Node Execution Layer
The linked node provides:
- MCP Shell Server for secure command execution
- Security validation before execution
- Resource management for processes
- Output capture and streaming
Agent Types
Agents can be specialized for different roles within your squad:
| Type | Focus | Example Use Cases |
|---|---|---|
| System | OS-level diagnostics | Monitor CPU, memory, disk usage |
| Database | Database operations | Check DB health, query performance |
| Container | Container management | Monitor Docker/K8s deployments |
| Security | Security monitoring | Audit logs, check vulnerabilities |
| Custom | Specialized tasks | Your specific workflows |
Agent-Node Communication
Protocol
Nodes communicate with the platform using JSON over TCP:
{
"type": "COMMAND",
"node_id": "node_abc123",
"agent_id": "agent_xyz789",
"request_id": "req_123456",
"timestamp": "2024-01-15T10:30:00Z",
"payload": {
"command": "ps",
"args": ["aux", "--sort=-%cpu"]
}
}Message Types
| Type | Direction | Purpose |
|---|---|---|
REGISTER | Node → Platform | Initial authentication |
HEARTBEAT | Both | Keep-alive signal |
COMMAND | Platform → Node | Command request from agent |
RESPONSE | Node → Platform | Command result to agent |
ERROR | Node → Platform | Error notification |
Security
All communication is:
- Authenticated - Using node tokens
- Validated - Request ID correlation
- Timestamped - Replay attack prevention
- Logged - Complete audit trail
Resource Usage
Agent resource consumption:
| Component | Resource Usage |
|---|---|
| Agent (Cloud) | Serverless, scales automatically |
| Node (Server) | 30-50 MB memory, <1% CPU idle |
| Network | Minimal (heartbeats + commands) |
| Disk | Config + logs on server |
When nodes execute commands, usage temporarily increases based on the command.
Best Practices
Naming
Use clear, descriptive names for agents:
- Include role:
Database Monitor,Web Server Guardian - Include environment:
Production API Agent,Staging DB Agent - Be descriptive:
Security Audit Agent,Log Analysis Agent
Example: Production Web Server Monitor
Squad Organization
Organize agents effectively:
- Group by environment (Production Squad, Staging Squad)
- Group by function (DevOps Squad, Security Squad)
- Create specialized squads for different teams
Node Linking
Best practices for linking agents to nodes:
- Link agents to nodes that match their purpose
- One agent can be linked to one node at a time
- Unlink and relink to reassign agents as needed
- Monitor node status to ensure agents can execute
Security
Protect your agents and nodes:
- Use unique tokens per node
- Rotate node tokens periodically
- Review audit logs for all commands
- Use minimal permissions
- Only approve necessary commands
Troubleshooting Agents
Agent Not Executing Commands
- Check if agent is linked to a node
- Verify linked node is online
- Check node connectivity to
socket.opssquad.ai:9000 - Review node logs for errors
Agent Slow to Respond
- Check linked node's resource utilization
- Verify network latency between node and platform
- Review command complexity
- Check for competing processes on node
Need to Reassign Agent
- Go to agent details in dashboard
- Click "Unlink Node"
- Select a different online node
- Click "Link Node"
Next Steps
- Squads - Learn how to organize agents into squads.
- Security - Understand the security model protecting agents.
- Managing Agents - Manage agents through the dashboard.