Triage Airflow or Dagster task failures from logs, classify each by root-cause family, and suggest a concrete fix per cause.
Reads task failure logs from Airflow or Dagster runs, classifies each failure into a root-cause family, and emits a triage report that groups failures by cause with a concrete fix per family.
logs_dir: directory containing per-task log files.tool: airflow or dagster.dag_id / job_name: scope the triage to one pipeline.since: ISO timestamp; only consider failures after this point.airflow tasks failed-deps or by walking logs/<dag_id>/<task_id>/<run_id>/; Dagster via dagster job execute history files or dagster-graphql.OperationalError, connection refused, Could not connect: classify infra-network.KeyError, AttributeError, column ... does not exist: classify schema-drift.MemoryError, Killed, OOMKilled: classify resource-exhaustion.TimeoutError, Task timed out: classify timeout.PermissionDenied, 403, AccessDenied: classify auth.IntegrityError, unique constraint: classify data-quality.execution_timeout; auth -> rotate creds, check IAM; data-quality -> add dbt test or upstream constraint.pipeline-triage.md with: summary table (family, count, % of total), per-family detail (top tasks, sample log excerpt, suggested fix), and a "bursts" section listing windows with > 3x mean failure rate. Exit 1 if any task is still in retry-loop.
Re-classify a 10% sample by hand and check overall agreement with the automated classification; below 80% agreement means the regex set is too coarse for this corpus and the report should warn. For each suggested fix, confirm the relevant config knob exists (e.g., Airflow's pool_slots is a real setting in airflow.cfg or per-DAG). Re-run the triager after fixes are applied and confirm the family counts decrease as expected.
unknown and surface a sample for taxonomy update.Other publishers' experience with this skill. Self-rating is blocked.
Ratings are limited to publishers while the registry is small — sign in and publish a public skill to rate.
No ratings yet. Be the first.
Same domains or capabilities as amitte/data-pipeline-failure-triager.
Narrate A/B test results from a structured summary into a plain-English readout including effect size, statistical significance, and the recommended decision.
Explain a metric anomaly from a time-series excerpt and a list of known events — produce candidate causes ranked by plausibility with grounded evidence.
Read-only AWS surface — list/describe EC2, S3 buckets, IAM users, and Lambda functions. Auth via STS-assumed role; no mutating tools.
Run a backup-restore drill: pick a recent snapshot, restore to a sandbox database, and verify data integrity with row counts and checksums.
Detect weeks with meeting overload from a calendar export, suggest blocks to decline, and propose a recurring focus-time policy.
Suggest a chart type from a dataset description and an analytical goal — pick one primary chart and one fallback, with rationale grounded in field cardinality.