Data Pipeline Failure Triager

What this skill does

Reads task failure logs from Airflow or Dagster runs, classifies each failure into a root-cause family, and emits a triage report that groups failures by cause with a concrete fix per family.

Inputs

logs_dir: directory containing per-task log files.
tool: airflow or dagster.
Optional dag_id / job_name: scope the triage to one pipeline.
Optional since: ISO timestamp; only consider failures after this point.

Steps

List failed task runs: Airflow via airflow tasks failed-deps or by walking logs/<dag_id>/<task_id>/<run_id>/; Dagster via dagster job execute history files or dagster-graphql.
For each failure, read the last 200 lines of the log.
Apply pattern matchers in priority order:
- OperationalError, connection refused, Could not connect: classify infra-network.
- KeyError, AttributeError, column ... does not exist: classify schema-drift.
- MemoryError, Killed, OOMKilled: classify resource-exhaustion.
- TimeoutError, Task timed out: classify timeout.
- PermissionDenied, 403, AccessDenied: classify auth.
- IntegrityError, unique constraint: classify data-quality.
Per family, propose a fix: infra-network -> add retry with backoff, increase pool slots; schema-drift -> regenerate model, add a contract test; resource-exhaustion -> bump operator resources / chunk input; timeout -> raise execution_timeout; auth -> rotate creds, check IAM; data-quality -> add dbt test or upstream constraint.
Sort failures by family count desc; per family list the top 5 affected tasks.
Compute the time series: failure counts per hour over the last 24h to spot bursts.
Write a triage report as markdown.

Output

pipeline-triage.md with: summary table (family, count, % of total), per-family detail (top tasks, sample log excerpt, suggested fix), and a "bursts" section listing windows with > 3x mean failure rate. Exit 1 if any task is still in retry-loop.

Verification

Re-classify a 10% sample by hand and check overall agreement with the automated classification; below 80% agreement means the regex set is too coarse for this corpus and the report should warn. For each suggested fix, confirm the relevant config knob exists (e.g., Airflow's pool_slots is a real setting in airflow.cfg or per-DAG). Re-run the triager after fixes are applied and confirm the family counts decrease as expected.

Edge cases

Log truncation (large outputs cut by orchestrator): note "log truncated, classification approximate" on those entries.
Custom in-house exceptions not matched: bucket into unknown and surface a sample for taxonomy update.
Logs with sensitive data (PII): redact known patterns (emails, IPs) before quoting in the report.
Single failure across many retries: count once per task instance, not per attempt.

amitte/data-pipeline-failure-triager