Cohort Retention Analyzer

What this skill does

Computes weekly cohort retention from an event log: for each cohort (users who signed up in week W), it reports the share of those users who returned in each subsequent week. Output is a triangular CSV plus a chart-ready JSON.

Inputs

events_path: a CSV or NDJSON file with at least user_id, event_name, event_ts columns.
signup_event: event name marking the start of a user (e.g., account_created).
return_event: event name counted as retention (e.g., app_open).
Optional weeks: number of post-signup weeks to track (default 12).

Steps

Validate the file: confirm header includes the three required columns; abort otherwise.
Load events with a streaming reader. Parse event_ts as ISO timestamp; convert to UTC.
Build the cohort map: for each user, find the earliest signup_event timestamp; assign the user to the ISO week of that timestamp (year-week pair).
For each cohort, count distinct users (cohort size).
For each cohort and each week-offset 0..weeks, count distinct users with at least one return_event falling in [signup_week + offset, signup_week + offset + 1).
Compute retention percent = returners / cohort_size.
Emit a triangular CSV: rows are cohort weeks; columns are W0, W1, ..., W<weeks>; values are retention percentages.
Emit a JSON sidecar suitable for charting: {cohorts: [{week, size, retention: [...]}, ...]}.
Compute aggregate metrics: median W1 retention, median W4, median W12; print to stdout.
Save outputs as retention.csv and retention.json.

Output

Two files: retention.csv (triangular table) and retention.json (machine-readable). Stdout summary lists cohort count, smallest cohort size, and the median retention at W1 / W4 / W12.

Verification

Pick two cohorts and recompute their W1 retention by SQL or pandas: the value must match the CSV cell exactly. Confirm the matrix is triangular (a cohort starting in the most recent week has no W1+ data; cells must be empty, not zero). If aggregate W0 retention is not 100% for every cohort, the data has duplicate signup events — surface a warning before publishing.

Edge cases

Users with multiple signup_event rows: keep only the earliest.
Sparse data (cohorts under 30 users): mark them in the JSON with low_n: true.
Time zone skew (events in user-local time): document the assumption in the report header; do not silently rewrite.
Recent partial weeks: include but mark partial: true so consumers don't compare incomplete data with complete cohorts.

amitte/cohort-retention-analyzer