Building Custom Claude Code Skills for VoIP Infrastructure Operations
Current Situation Analysis
Traditional infrastructure monitoring relies on static dashboards and procedural runbooks. While these provide visibility, they suffer from critical failure modes in complex VoIP/telecom environments:
- Fragmented Context: Monitoring tools (Grafana, Homer, ViciDial, Asterisk CLI) operate in silos. Engineers must manually correlate SIP traces, MySQL replication lag, carrier hangup causes, and agent states across multiple interfaces.
- Static Runbook Drift: Procedures quickly become outdated as infrastructure evolves. New engineers lack institutional knowledge, leading to prolonged MTTR (Mean Time To Resolution) during incidents like dropped calls or audio degradation.
- Manual Execution Overhead: Health checks, call tracing, and server audits require repetitive SSH sessions, command chaining, and cross-referencing logs. This manual workflow is error-prone, slow, and unscalable during high-volume incidents.
- Lack of Adaptive Diagnosis: Traditional tools alert on thresholds but cannot reason about root causes. They cannot dynamically adjust investigation paths based on real-time server variations, credential routing, or multi-source replication states.
WOW Moment: Key Findings
Deploying custom Claude Code skills transforms static monitoring into context-aware, automated investigation. The following comparison demonstrates the operational shift:
| Approach | Execution Time | Context Correlation | MTTR Reduction |
|---|---|---|---|
| Traditional Manual Runbooks | 30-120 min per task | Low (fragmented tool switching) | Baseline |
| Claude Code Custom Skills | 15 sec - 5 min per task | High (cross-system auto-tracing) | 85-95% |
Key Findings:
- Health checks across 5 servers drop from 5-10 minutes to 15 seconds via single-command execution.
- Dropped call investigations shrink from 30-60 minutes to ~2 minutes by automatically tracing DID routing, carrier logs, dialplans, and SIP traces.
- Audio quality diagnosis (previously 1-2 hours) completes in ~5 minutes by correlating Homer RTCP, NISQA neural scoring, codec verification, and network metrics.
- Institutional knowledge becomes code-embedded: hangup cause mappings, credential routing, and server-specific paths are version-controlled and instantly accessible to all team members.
Core Solution
The architecture centers on a Claude Code CLI instance running on a central VPS/jump box with SSH access to production servers and Docker access to monitoring stacks. Skills are declarative Markdown files that instruct Claude how to execute investigation playbooks using whitelisted tools.
Architecture Overview
+------------------+ SSH (key-based) +-------------------+
| |------------------------->| VoIP Server 1 |
| VPS / Jump Box |------------------------->| VoIP Server 2 |
| (Claude Code) |------------------------->| VoIP Server 3 |
| |------------------------->| Replica DB |
| ~/.claude/ | +-------------------+
| skills/ |
| health/ | Docker (local) +-------------------+
| calls/ |------------------------->| Grafana |
| agents/ |------------------------->| Prometheus |
| ... |------------------------->| Loki |
| hooks/ |------------------------->| Homer (SIP/RTCP) |
| settings.json |------------------------->| Smokeping |
+------------------+ +-------------------+
|
| MCP (Model Context Protocol)
v
+------------------+
| Grafana MCP |
| (mcp-grafana) |
| - Dashboards |
| - PromQL queries |
| - Loki log search|
+------------------+
The SKILL.md File Format
Every skill is a single Markdown file named SKILL.md inside its own directory under ~/.claude/skills/. The file has two parts: a YAML frontmatter header and a Markdown body.
Frontmatter (Required)
---
name: skill-name
description: One-line description shown in skill listings and used for matching.
user-invocable: true
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *)
---
| Field | Purpose |
|---|---|
name | The slash command |
name. Users type /name to invoke. |
| description | Shown in help listings. Also used by Claude to decide when to suggest the skill. Be specific -- mention the problem types this skill addresses. |
| user-invocable | Set to true so users can trigger it directly with /name. |
| allowed-tools | Whitelist of tools the skill can use. Uses glob patterns. Bash(ssh *) means "allow any Bash command starting with ssh". |
Allowed-Tools Patterns
# SSH to any server
allowed-tools: Bash(ssh *)
# SSH + Docker + curl + ping
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *), Bash(ping *)
# SSH + local audio tools
allowed-tools: Bash(ssh *), Bash(curl *), Bash(sox *), Bash(soxi *), Bash(ffprobe *)
The tool patterns act as a security boundary. A skill that only needs SSH cannot accidentally execute Docker commands or write files. Design skills with the minimum tools they need.
Body (The Investigation Procedure)
The Markdown body is the actual instruction set. Claude reads this as its playbook when the skill is invoked. It should contain:
- What to do -- step-by-step procedures
- How to access resources -- SSH commands, SQL queries, API calls
- How to interpret results -- reference tables, thresholds, known patterns
- Server-specific variations -- different credentials, paths, or versions per server
- Output formatting -- how to present results to the user
The body supports a special variable: $ARGUMENTS -- whatever the user typed after the slash command. For example, if the user types /health server-a, then $ARGUMENTS is server-a.
Directory Structure
~/.claude/
settings.json # Global settings (permissions, hooks, env)
settings.local.json # Per-machine permission overrides
hooks/
protect-production.sh # Safety hook: blocks dangerous commands
skills/
health/
SKILL.md # /health skill
calls/
SKILL.md # /calls skill
agents/
SKILL.md # /agents skill
replication/
SKILL.md # /replication skill
audit-server/
SKILL.md # /audit-server skill
trunk-status/
SKILL.md # /trunk-status skill
audio-quality/
SKILL.md # /audio-quality skill
call-investigate/
SKILL.md # /call-investigate skill
call-drops/
SKILL.md # /call-drops skill
lagged/
SKILL.md # /lagged skill
network-check/
SKILL.md # /network-check skill
agent-ranks/
SKILL.md # /agent-ranks skill
did-lookup/
SKILL.md # /did-lookup skill
reports/
SKILL.md # /reports skill
listen-recording/
SKILL.md # /listen-recording skill
Skill Categories & Implementation
- Operations Skills (6): Answer "What is happening right now?" (
/health,/calls,/agents,/replication,/audit-server,/trunk-status). Use single-SSH aggregation patterns to minimize round-trips. Output structured tables with WARNING/CRITICAL flags. - Investigation Skills (5): Answer "Why did this happen?" (
/audio-quality,/call-investigate,/call-drops,/lagged,/network-check). Chain cross-system queries: Homer RTCP β Asterisk logs β SIP peer stats β Smokeping β Codec verification. - Lookup Skills (4): Answer "Where is this data?" (
/agent-ranks,/did-lookup,/reports,/listen-recording). Query ViciDial databases, DID routing tables, and recording storage paths with formatted output. - MCP Grafana Integration: Connect Claude to
mcp-grafanafor real-time PromQL queries, Loki log search, and dashboard rendering without leaving the CLI. - Production Safety Hook:
hooks/protect-production.shintercepts dangerous commands (rm -rf,DROP TABLE,iptables -F) and requires explicit confirmation or blocks execution entirely. - Permission Management:
settings.jsonandsettings.local.jsonenforce role-based tool access. Skills inherit permissions from the global config, ensuring least-privilege execution.
Pitfall Guide
- Over-Permissive Tool Whitelisting: Using
Bash(*)or broad patterns likeBash(ssh *)without scoping allows Claude to execute unintended commands. Always define the minimum required tools per skill (e.g.,Bash(ssh voip-*), Bash(mysql -e *)). - Hardcoding Server-Specific Credentials: Embedding passwords or IP addresses directly in
SKILL.mdbreaks portability and violates security practices. Use SSH key-based routing, environment variables, or dynamic credential lookup scripts. - Ignoring Output Formatting Standards: Claude defaults to verbose, unstructured text. Explicitly define output templates in the skill body (e.g., "Present results as a Markdown table with columns: Server, Status, Metric, Threshold").
- Skipping Production Safety Hooks: Failing to implement
protect-production.shrisks accidental destructive commands during automated investigations. Always validate hooks in staging before deploying to production. - Static Runbook Mentality: Treating skills as static text files instead of dynamic playbooks. Leverage
$ARGUMENTSfor context-aware execution and conditional branching based on real-time server responses. - Neglecting MCP Integration: Relying solely on SSH/CLI tools misses real-time metric correlation. Integrate
mcp-grafanafor PromQL, Loki, and dashboard queries to close the monitoring-investigation loop. - Inconsistent Skill Naming & Descriptions: Vague
nameordescriptionfields cause Claude to misroute invocations or fail to suggest relevant skills. Use precise, problem-oriented descriptions (e.g., "Diagnose SIP trunk registration failures and carrier connectivity").
Deliverables
- Architecture Blueprint: Complete directory tree (
~/.claude/skills/,hooks/,settings.json) with skill categorization, MCP integration points, and SSH/Docker routing paths. - Skill Creation Checklist: Step-by-step validation for frontmatter accuracy, allowed-tools scoping,
$ARGUMENTSusage, output formatting rules, and safety hook verification before deployment. - Configuration Templates: Ready-to-use
SKILL.mdfrontmatter blocks,settings.jsonpermission overrides,protect-production.shhook script, andmcp-grafanaconnection config for Prometheus/Loki/Grafana integration.
