Pivot Spec from Question

What this skill does

Converts a natural-language analytics question and a column inventory into a structured pivot table specification: which fields go in rows, columns, values, filters, and the sort order. Output is a JSON spec plus a generated pandas snippet.

Inputs

question: a single sentence (e.g., "Revenue by region by month for 2025, top 5 regions only").
columns: an array of column descriptors {name, type, sample_values, role?}.
Optional dataset_name: used as the dataframe identifier in the snippet.

Steps

Tokenize the question and tag each noun/verb. Identify the question type: compare-by, trend-over, top-N, share-of, contribution.
Match temporal phrases ("by month", "over 2025", "this quarter") to columns of type date or timestamp. The by <X> phrase usually marks rows or columns; over <X> marks the column axis when temporal.
Match metric phrases ("revenue", "users", "retention") to columns of type number. The dominant metric becomes the value field; aggregation defaults: count -> count, revenue -> sum, retention -> avg.
Match dimension phrases ("by region", "per product") to non-numeric columns; place the most-mentioned dimension on rows, the second on columns.
Filters: extract WHERE-like phrases ("for 2025", "active users only"); map to known column values from sample_values if possible.
Top-N / Bottom-N: detect "top 5", "bottom 10"; map to a sort+limit on the value field.
Compose the spec: {rows, cols, values: [{field, agg}], filters, sort, limit}.
Validate the spec: every referenced field must be in columns; every agg must be one of sum, count, avg, min, max, count_distinct.
Generate a pandas snippet using pd.pivot_table(df, index=rows, columns=cols, values=values_field, aggfunc=agg) plus the filter pre-step.
Print the spec JSON and the snippet.

Output

A JSON document with the pivot spec and a markdown sibling containing the generated pandas snippet (and a SQL equivalent using GROUP BY <rows>, <cols>). Stdout summarizes the chosen mapping question-phrase to spec-field.

Verification

Run the pandas snippet against a small representative dataframe and confirm the result has the expected number of rows (= unique row values matching filters) and columns (= unique column values plus a totals column if requested). For "top N" specs, count the row-axis cardinality after limit and confirm == N. If the question references a column not present in columns, refuse to fabricate; raise an error with the missing field.

Edge cases

Ambiguous question (multiple plausible dimensions): emit two candidate specs ranked by confidence and ask for disambiguation.
Questions implying a chart (e.g., "show me a line chart of..."): produce the spec but tag chart_hint: line in the output JSON.
Questions with relative time ("last 30 days") and no current-date context: assume "today" is the current ISO date.
Columns with overlapping sample values across categories: prefer the column whose name is closer to the question's word (Levenshtein distance).

amitte/pivot-spec-from-question