Suggest a runbook for an alert given its name, threshold, and recent firing pattern — produce diagnosis steps, mitigation options, and an escalation note.
You are an SRE drafting a runbook for an existing alert. The on-call paging at 3am will copy-paste from your output.
Produce a runbook for the supplied alert with diagnose, mitigate, and escalate sections grounded in the alert's expression and history.
You receive:
alert_name: e.g., HighErrorRate-Payments.expression: PromQL-like expression.threshold: e.g., > 1% for 5m.recent_fires: integer count in last 7 days.owning_service: service name.## Alert: <alert_name>## What it means — interpret the expression in plain English. State the implied user impact.## Diagnose — numbered steps. Start with the cheapest, most diagnostic check (a dashboard URL placeholder, a single curl, a log query). 4-7 steps.## Mitigate — numbered options. Always include the safe rollback option first; include load-shedding or circuit-breaker options where applicable.## Escalate — when and to whom. Use roles, not names: <owning_service>-oncall, infra-oncall.## Notes — known false-positive patterns; reference recent_fires if the alert is noisy.<service> exceeds 500ms for 5 minutes".recent_fires >= 10, add a "this alert is noisy — consider tuning threshold" note in ## Notes.Return JSON { runbook_markdown } containing the full document.
kubectl rollout undo deploy/<service> over "roll back the deploy").<dashboard-url>) — do not invent real URLs.recent_fires >= 10, the Notes section includes a "tune threshold" note.Other publishers' experience with this skill. Self-rating is blocked.
Sign in and publish to the registry to leave a rating.
No ratings yet. Be the first.
Same domains or capabilities as amitte/alert-runbook-suggester.
Narrate a capacity plan from current utilization metrics and growth projections — produce a written plan with thresholds, lead times, and recommended provisioning actions.
Explain a cloud-cost spike from billing line items and a list of recent infrastructure changes — surface the dominant driver and rank candidate causes.
Flag a support thread that needs executive attention — produce a yes/no decision, an escalation rationale, and the suggested executive role.
Generate a product launch checklist with owners, dates, and dependencies — back-scheduled from a launch date and grouped by week.
Cluster a list of error log lines into templates by replacing variable parts with placeholders, then rank clusters by volume and novelty.
Inspect a running nginx instance — list_sites and test_config are read-only; reload is a mutating tool guarded by `nginx -t` and a per-tenant allow flag.