Why did users actually churn?
Most churn models answer who will leave. This analysis answers why — and what intervention would have changed the outcome. The difference between correlation and causation is the difference between a dashboard and a decision.
IBM's Telco Customer Churn dataset is a subscription business in miniature — 7,043 customers, each described by their contract type, service usage, tenure, charges, and whether they left. It maps directly to any recurring-revenue product: SaaS, media, newspaper subscriptions.
View data loading code (Python / pandas)
import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder # Load IBM Telco dataset df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv') # Clean target variable df['Churn'] = (df['Churn'] == 'Yes').astype(int) df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce').fillna(0) # Tenure groups for visualisation df['TenureGroup'] = pd.cut(df['tenure'], bins=[0,12,24,36,48,60,72], labels=['0–12m','13–24m','25–36m','37–48m','49–60m','61–72m']) print(df.groupby('Contract')['Churn'].mean().round(3)) # Contract # Month-to-month 0.427 # One year 0.113 # Two year 0.028
A naive model sees that month-to-month customers churn at 42.7% vs 2.8% on two-year contracts — a gap of 39.9 percentage points. The obvious product decision: push everyone onto annual contracts. But this conclusion skips a critical question: are these actually comparable groups?
View confounder analysis code
# Check confounding: are contract groups actually comparable? confounder_check = df.groupby('Contract')[[ 'tenure', 'MonthlyCharges', 'SeniorCitizen' ]].mean().round(2) # Contract tenure MonthlyCharges SeniorCitizen # Month-to-month 17.9 66.4 0.18 # One year 34.4 65.0 0.12 # Two year 55.5 60.4 0.10 # Groups differ massively on tenure — not comparable at all # Standardised mean difference (balance check) def smd(group1, group2): diff = group1.mean() - group2.mean() pooled_std = np.sqrt((group1.var() + group2.var()) / 2) return diff / pooled_std m2m = df[df['Contract'] == 'Month-to-month'] longer = df[df['Contract'] != 'Month-to-month'] print(smd(m2m['tenure'], longer['tenure'])) # → -1.24 (severe imbalance)
Before running any model, we encode our causal assumptions as a Directed Acyclic Graph. This makes assumptions explicit and auditable. The DAG defines which variables are confounders (must be controlled), mediators (should not be controlled), and instruments.
View DoWhy causal model code
import dowhy from dowhy import CausalModel model = CausalModel( data=df, treatment='MonthToMonth', # 1 = month-to-month, 0 = longer contract outcome='Churn', common_causes=['tenure', 'SeniorCitizen', 'InternetService'], instruments=[] ) # Identify the causal effect identified_estimand = model.identify_effect(proceed_when_unidentifiable=True) print(identified_estimand) # Estimand type: nonparametric-ate # Backdoor variables: tenure, SeniorCitizen, InternetService
We can't run a randomised experiment — we can't randomly assign customers to contract types. Propensity Score Matching is the next best thing: for each month-to-month customer, find a statistically comparable customer on a longer contract. Compare outcomes within matched pairs. This blocks the backdoor path through Tenure and other confounders.
View propensity matching code
from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # Step 1: Estimate propensity scores covariates = ['tenure', 'MonthlyCharges', 'SeniorCitizen', 'InternetService_Fiber', 'TechSupport_Yes'] X = df[covariates] T = df['MonthToMonth'] scaler = StandardScaler() X_scaled = scaler.fit_transform(X) lr = LogisticRegression(max_iter=1000) lr.fit(X_scaled, T) df['propensity_score'] = lr.predict_proba(X_scaled)[:, 1] # Step 2: Match on propensity score (nearest neighbour, caliper 0.05) treated = df[df['MonthToMonth'] == 1].copy() control = df[df['MonthToMonth'] == 0].copy() matched_pairs = [] for _, t_row in treated.iterrows(): diffs = (control['propensity_score'] - t_row['propensity_score']).abs() best_match_idx = diffs.idxmin() if diffs[best_match_idx] <= 0.05: matched_pairs.append((t_row.name, best_match_idx)) control = control.drop(best_match_idx) # no replacement print(f'Matched pairs: {len(matched_pairs)}') # → 2,841 matched pairs
With matched groups, we estimate the Average Treatment Effect (ATE) — the true causal impact of each intervention on churn probability. The gap between naive correlation and causal estimate reveals how much confounding was inflating the apparent effect.
View ATE and CATE estimation code
from econml.dml import LinearDML from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier # Average Treatment Effect on matched sample matched_df = df.loc[[i for pair in matched_pairs for i in pair]] ate = (matched_df[matched_df['MonthToMonth']==1]['Churn'].mean() - matched_df[matched_df['MonthToMonth']==0]['Churn'].mean()) print(f'ATE (contract): {ate:.3f}') # → 0.243 # Conditional ATE using Double ML (EconML) est = LinearDML( model_y=GradientBoostingRegressor(), model_t=GradientBoostingClassifier(), discrete_treatment=True ) X_cate = df[['tenure', 'MonthlyCharges', 'EngagementScore']] est.fit(df['Churn'], df['MonthToMonth'], X=X_cate, W=df[covariates]) cate_estimates = est.effect(X_cate) print(f'CATE range: {cate_estimates.min():.3f} to {cate_estimates.max():.3f}') # → CATE range: 0.071 to 0.338
Three decisions that follow directly from the causal analysis — each one different from what a purely predictive model would have recommended.