Log Anomaly Clusterer

What this skill does

Clusters semi-structured log lines into templates using the Drain algorithm, then compares template volumes between a baseline window and a recent window to surface anomalies. The output is a top-N table of templates whose volumes shifted the most.

Inputs

log_path: a single log file or directory of files. Lines may be plain text, JSON, or syslog.
baseline_window: ISO interval, e.g., 2026-04-01T00:00:00/PT24H.
recent_window: another ISO interval (must not overlap).
Optional top_n: defaults to 20.

Steps

Detect format: JSON lines if the first non-empty line parses as JSON; otherwise treat as plain or syslog.
Extract a timestamp per line. JSON: known fields (@timestamp, time, ts); plain: regex match common formats (2006-01-02 15:04:05, ISO 8601).
Filter lines into the two windows by timestamp.
Feed each window's message field (or the full line minus timestamp) into Drain3 (from drain3 import TemplateMiner).
After both windows are processed, dump the merged template inventory.
For each template, compute delta = (recent_count / recent_total) - (baseline_count / baseline_total).
Sort by |delta| descending; take top_n.
For each row, render: template id, template string (parameter slots as <*>), baseline count, recent count, delta percentage, sample line.
Highlight templates that are newly seen (baseline_count == 0) — those are usually the most actionable.
Emit a markdown table.

Output

log-anomalies.md with the top_n table, plus three "spotlight" subsections for the largest deltas (newly seen, biggest spike, biggest drop). Stdout prints the total template count and Drain's leaf-node count.

Verification

Pick three top entries and grep the log for the literal sample line; confirm the count is plausibly within 5% of Drain's reported count. Re-run the clusterer with Drain3 similarity threshold 0.4 and 0.6; the top entries should remain the same but ordering may shuffle — large reorderings indicate threshold sensitivity. If both windows have under 1000 lines total, prepend a "low corpus" warning to the report.

Edge cases

Multi-line stack traces: collapse continuation lines (those starting with whitespace) into the previous line before clustering.
Timestamps in mixed timezones: convert all to UTC before windowing.
Logs containing high-cardinality identifiers (UUIDs, request IDs): pre-mask via regex so Drain doesn't explode the template count.
Gzipped log files: detect by extension and pipe through zcat transparently.

amitte/log-anomaly-clusterer