Back to KB
Difficulty
Intermediate
Read Time
6 min

Building Custom Claude Code Skills for VoIP Infrastructure Operations

By Codcompass TeamΒ·Β·6 min read

Current Situation Analysis

Traditional infrastructure monitoring relies on static dashboards and procedural runbooks. While these provide visibility, they suffer from critical failure modes in complex VoIP/telecom environments:

  • Fragmented Context: Monitoring tools (Grafana, Homer, ViciDial, Asterisk CLI) operate in silos. Engineers must manually correlate SIP traces, MySQL replication lag, carrier hangup causes, and agent states across multiple interfaces.
  • Static Runbook Drift: Procedures quickly become outdated as infrastructure evolves. New engineers lack institutional knowledge, leading to prolonged MTTR (Mean Time To Resolution) during incidents like dropped calls or audio degradation.
  • Manual Execution Overhead: Health checks, call tracing, and server audits require repetitive SSH sessions, command chaining, and cross-referencing logs. This manual workflow is error-prone, slow, and unscalable during high-volume incidents.
  • Lack of Adaptive Diagnosis: Traditional tools alert on thresholds but cannot reason about root causes. They cannot dynamically adjust investigation paths based on real-time server variations, credential routing, or multi-source replication states.

WOW Moment: Key Findings

Deploying custom Claude Code skills transforms static monitoring into context-aware, automated investigation. The following comparison demonstrates the operational shift:

ApproachExecution TimeContext CorrelationMTTR Reduction
Traditional Manual Runbooks30-120 min per taskLow (fragmented tool switching)Baseline
Claude Code Custom Skills15 sec - 5 min per taskHigh (cross-system auto-tracing)85-95%

Key Findings:

  • Health checks across 5 servers drop from 5-10 minutes to 15 seconds via single-command execution.
  • Dropped call investigations shrink from 30-60 minutes to ~2 minutes by automatically tracing DID routing, carrier logs, dialplans, and SIP traces.
  • Audio quality diagnosis (previously 1-2 hours) completes in ~5 minutes by correlating Homer RTCP, NISQA neural scoring, codec verification, and network metrics.
  • Institutional knowledge becomes code-embedded: hangup cause mappings, credential routing, and server-specific paths are version-controlled and instantly accessible to all team members.

Core Solution

The architecture centers on a Claude Code CLI instance running on a central VPS/jump box with SSH access to production servers and Docker access to monitoring stacks. Skills are declarative Markdown files that instruct Claude how to execute investigation playbooks using whitelisted tools.

Architecture Overview

+------------------+     SSH (key-based)     +-------------------+
|                  |------------------------->| VoIP Server 1     |
|  VPS / Jump Box  |------------------------->| VoIP Server 2     |
|  (Claude Code)   |------------------------->| VoIP Server 3     |
|                  |------------------------->| Replica DB        |
|  ~/.claude/      |                          +-------------------+
|    skills/       |
|      health/     |     Docker (local)       +-------------------+
|      calls/      |------------------------->| Grafana           |
|      agents/     |------------------------->| Prometheus        |
|      ...         |------------------------->| Loki              |
|    hooks/        |------------------------->| Homer (SIP/RTCP)  |
|    settings.json |------------------------->| Smokeping         |
+------------------+                          +-------------------+
        |
        | MCP (Model Context Protocol)
        v
+------------------+
| Grafana MCP      |
| (mcp-grafana)    |
| - Dashboards     |
| - PromQL queries |
| - Loki log search|
+------------------+

The SKILL.md File Format

Every skill is a single Markdown file named SKILL.md inside its own directory under ~/.claude/skills/. The file has two parts: a YAML frontmatter header and a Markdown body.

Frontmatter (Required)

---
name: skill-name
description: One-line description shown in skill listings and used for matching.
user-invocable: true
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *)
---
FieldPurpose
nameThe slash command

name. Users type /name to invoke. | | description | Shown in help listings. Also used by Claude to decide when to suggest the skill. Be specific -- mention the problem types this skill addresses. | | user-invocable | Set to true so users can trigger it directly with /name. | | allowed-tools | Whitelist of tools the skill can use. Uses glob patterns. Bash(ssh *) means "allow any Bash command starting with ssh". |

Allowed-Tools Patterns

# SSH to any server
allowed-tools: Bash(ssh *)

# SSH + Docker + curl + ping
allowed-tools: Bash(ssh *), Bash(docker *), Bash(curl *), Bash(ping *)

# SSH + local audio tools
allowed-tools: Bash(ssh *), Bash(curl *), Bash(sox *), Bash(soxi *), Bash(ffprobe *)

The tool patterns act as a security boundary. A skill that only needs SSH cannot accidentally execute Docker commands or write files. Design skills with the minimum tools they need.

Body (The Investigation Procedure)

The Markdown body is the actual instruction set. Claude reads this as its playbook when the skill is invoked. It should contain:

  1. What to do -- step-by-step procedures
  2. How to access resources -- SSH commands, SQL queries, API calls
  3. How to interpret results -- reference tables, thresholds, known patterns
  4. Server-specific variations -- different credentials, paths, or versions per server
  5. Output formatting -- how to present results to the user

The body supports a special variable: $ARGUMENTS -- whatever the user typed after the slash command. For example, if the user types /health server-a, then $ARGUMENTS is server-a.

Directory Structure

~/.claude/
  settings.json              # Global settings (permissions, hooks, env)
  settings.local.json        # Per-machine permission overrides
  hooks/
    protect-production.sh    # Safety hook: blocks dangerous commands
  skills/
    health/
      SKILL.md               # /health skill
    calls/
      SKILL.md               # /calls skill
    agents/
      SKILL.md               # /agents skill
    replication/
      SKILL.md               # /replication skill
    audit-server/
      SKILL.md               # /audit-server skill
    trunk-status/
      SKILL.md               # /trunk-status skill
    audio-quality/
      SKILL.md               # /audio-quality skill
    call-investigate/
      SKILL.md               # /call-investigate skill
    call-drops/
      SKILL.md               # /call-drops skill
    lagged/
      SKILL.md               # /lagged skill
    network-check/
      SKILL.md               # /network-check skill
    agent-ranks/
      SKILL.md               # /agent-ranks skill
    did-lookup/
      SKILL.md               # /did-lookup skill
    reports/
      SKILL.md               # /reports skill
    listen-recording/
      SKILL.md               # /listen-recording skill

Skill Categories & Implementation

  • Operations Skills (6): Answer "What is happening right now?" (/health, /calls, /agents, /replication, /audit-server, /trunk-status). Use single-SSH aggregation patterns to minimize round-trips. Output structured tables with WARNING/CRITICAL flags.
  • Investigation Skills (5): Answer "Why did this happen?" (/audio-quality, /call-investigate, /call-drops, /lagged, /network-check). Chain cross-system queries: Homer RTCP β†’ Asterisk logs β†’ SIP peer stats β†’ Smokeping β†’ Codec verification.
  • Lookup Skills (4): Answer "Where is this data?" (/agent-ranks, /did-lookup, /reports, /listen-recording). Query ViciDial databases, DID routing tables, and recording storage paths with formatted output.
  • MCP Grafana Integration: Connect Claude to mcp-grafana for real-time PromQL queries, Loki log search, and dashboard rendering without leaving the CLI.
  • Production Safety Hook: hooks/protect-production.sh intercepts dangerous commands (rm -rf, DROP TABLE, iptables -F) and requires explicit confirmation or blocks execution entirely.
  • Permission Management: settings.json and settings.local.json enforce role-based tool access. Skills inherit permissions from the global config, ensuring least-privilege execution.

Pitfall Guide

  1. Over-Permissive Tool Whitelisting: Using Bash(*) or broad patterns like Bash(ssh *) without scoping allows Claude to execute unintended commands. Always define the minimum required tools per skill (e.g., Bash(ssh voip-*), Bash(mysql -e *)).
  2. Hardcoding Server-Specific Credentials: Embedding passwords or IP addresses directly in SKILL.md breaks portability and violates security practices. Use SSH key-based routing, environment variables, or dynamic credential lookup scripts.
  3. Ignoring Output Formatting Standards: Claude defaults to verbose, unstructured text. Explicitly define output templates in the skill body (e.g., "Present results as a Markdown table with columns: Server, Status, Metric, Threshold").
  4. Skipping Production Safety Hooks: Failing to implement protect-production.sh risks accidental destructive commands during automated investigations. Always validate hooks in staging before deploying to production.
  5. Static Runbook Mentality: Treating skills as static text files instead of dynamic playbooks. Leverage $ARGUMENTS for context-aware execution and conditional branching based on real-time server responses.
  6. Neglecting MCP Integration: Relying solely on SSH/CLI tools misses real-time metric correlation. Integrate mcp-grafana for PromQL, Loki, and dashboard queries to close the monitoring-investigation loop.
  7. Inconsistent Skill Naming & Descriptions: Vague name or description fields cause Claude to misroute invocations or fail to suggest relevant skills. Use precise, problem-oriented descriptions (e.g., "Diagnose SIP trunk registration failures and carrier connectivity").

Deliverables

  • Architecture Blueprint: Complete directory tree (~/.claude/skills/, hooks/, settings.json) with skill categorization, MCP integration points, and SSH/Docker routing paths.
  • Skill Creation Checklist: Step-by-step validation for frontmatter accuracy, allowed-tools scoping, $ARGUMENTS usage, output formatting rules, and safety hook verification before deployment.
  • Configuration Templates: Ready-to-use SKILL.md frontmatter blocks, settings.json permission overrides, protect-production.sh hook script, and mcp-grafana connection config for Prometheus/Loki/Grafana integration.