Skip to main content

Incidents

Incidents group related alerts together and track their lifecycle from detection to resolution.

How Incidents Work

  1. Alert fires - A rule condition is met
  2. Incident created - New incident or grouped with existing
  3. Notifications sent - Team is alerted
  4. Investigation - Team acknowledges and investigates
  5. Resolution - Issue is fixed, incident closed

Incident Grouping

Related alerts are automatically grouped:
Incident: API Degradation
├── Alert: High P95 Latency (10:30)
├── Alert: Error Rate Spike (10:32)
└── Alert: Database Slow Queries (10:35)
Grouping rules:
  • Same service/source
  • Within time window (default: 5 minutes)
  • Similar tags

Incident States

StateDescription
openNew incident, not yet acknowledged
acknowledgedTeam is aware and investigating
resolvedIssue fixed, incident closed

Managing Incidents

List Incidents

GET /api/v1/incidents
GET /api/v1/incidents?status=open
GET /api/v1/incidents?severity=critical

Get Incident

GET /api/v1/incidents/:id
Response includes:
{
  "incident": {
    "id": "inc_abc123",
    "title": "High Error Rate on API",
    "status": "acknowledged",
    "severity": "critical",
    "created_at": "2024-01-15T10:30:00Z",
    "acknowledged_at": "2024-01-15T10:32:00Z",
    "acknowledged_by": "user_xyz",
    "alerts": [
      {
        "id": "alert_1",
        "rule_name": "High Error Rate",
        "fired_at": "2024-01-15T10:30:00Z"
      }
    ],
    "timeline": [
      {
        "event": "incident_created",
        "timestamp": "2024-01-15T10:30:00Z"
      },
      {
        "event": "alert_added",
        "alert_id": "alert_1",
        "timestamp": "2024-01-15T10:30:00Z"
      },
      {
        "event": "acknowledged",
        "user": "user_xyz",
        "timestamp": "2024-01-15T10:32:00Z"
      }
    ]
  }
}

Acknowledge Incident

POST /api/v1/incidents/:id/acknowledge
{
  "note": "Investigating the issue"
}

Resolve Incident

POST /api/v1/incidents/:id/resolve
{
  "note": "Deployed fix in v1.2.3",
  "root_cause": "Memory leak in cache service"
}

Add Note

POST /api/v1/incidents/:id/notes
{
  "content": "Identified root cause - scaling up instances"
}

Incident Timeline

Every incident maintains a timeline:
EventDescription
incident_createdIncident opened
alert_addedNew alert added to incident
alert_resolvedAlert in incident resolved
acknowledgedTeam acknowledged
note_addedNote added
escalatedEscalated to next level
resolvedIncident resolved

Escalation

Incidents can escalate if not acknowledged:
{
  "escalation_policy": {
    "steps": [
      { "delay": "5m", "channels": ["slack-ops"] },
      { "delay": "15m", "channels": ["pagerduty-oncall"] },
      { "delay": "30m", "channels": ["email-management"] }
    ]
  }
}
See Escalation Policies for configuration.

Incident Metrics

Track incident performance:
GET /api/v1/incidents/stats
{
  "stats": {
    "total": 156,
    "by_status": {
      "open": 3,
      "acknowledged": 2,
      "resolved": 151
    },
    "mttr": 1800,
    "mtta": 300
  }
}
MetricDescription
MTTAMean Time to Acknowledge
MTTRMean Time to Resolve

Best Practices

Acknowledge Quickly

Set targets for acknowledgment time

Add Notes

Document investigation steps in timeline

Include Root Cause

Record root cause when resolving

Review Metrics

Track MTTA/MTTR to improve response