CSV Schema Inferer

What this skill does

Reads a CSV file (full or sampled) and produces a JSON Schema (Draft 2020-12) describing each column's type, nullability, observed range or enum, and a one-line example. Useful as the first step of a data-onboarding pipeline.

Inputs

csv_path: path to the CSV file.
Optional sample_rows: number of data rows to sample (default 10000).
Optional delimiter: defaults to auto-detected via the Python csv.Sniffer heuristic.
Optional out_path: defaults to <csv_path>.schema.json.

Steps

Detect the delimiter by reading the first 4096 bytes and feeding to csv.Sniffer().sniff().
Read the header row; sanitize column names by trimming whitespace and lowercasing.
Stream up to sample_rows rows using a streaming reader (csv.reader in Python or csv-parse in Node).
Per column, accumulate observed values into a small histogram bucket (cap unique values at 1024).
Detect type per column: try int (regex ^-?\d+$), then float, then ISO date (YYYY-MM-DD), then ISO datetime, then bool (true|false|0|1 whitelist), default string.
Nullability: any blank cell observed -> nullable: true.
For numeric types, record minimum and maximum. For strings, record minLength and maxLength.
If a column has <= 32 unique non-null values, emit enum. Else, emit a pattern if all values match a tight regex (e.g., UUID, email).
Compose a JSON Schema with $schema, type: object, properties, required (columns with no nulls), and additionalProperties: false.
Write to out_path.

Output

A JSON Schema file at out_path that validates rows of the CSV when each row is reshaped to an object. Stdout prints column count, sample size used, and any columns where type detection was ambiguous (mixed types).

Verification

Round-trip validate: parse N random rows from the CSV as objects and run them through ajv against the produced schema; confirm zero validation errors. Re-run with sample_rows = sample_rows * 2 and confirm the schema's enums or ranges only widen, never narrow. If any column came back as string despite values matching a numeric regex, the sniffer found a non-numeric value — the report should show one example.

Edge cases

Files with quoted multi-line fields: rely on the streaming parser, not naive line splits.
Mixed types in one column (e.g., dates and "TBD"): emit oneOf with both alternatives or a string with pattern.
Empty file or header-only file: emit a stub schema with empty properties and a warning.
CSVs with BOM: strip the UTF-8 BOM from the first column name.

amitte/csv-schema-inferer