News Froggy
newsfroggy
HomeTechReviewProgrammingGamesHow ToAboutContacts
newsfroggy

Your daily source for the latest technology news, startup insights, and innovation trends.

More

  • About Us
  • Contact
  • Privacy Policy
  • Terms of Service

Categories

  • Tech
  • Review
  • Programming
  • Games
  • How To

© 2026 News Froggy. All rights reserved.

TwitterFacebook
Programming

Causal Inference for LLM Features: The Propensity Score

Every product experimentation team eventually confronts a common challenge when launching new features, especially those leveraging Large Language Models (LLMs): the 'Opt-In Trap'. Imagine shipping a new AI assistant

PublishedMay 1, 2026
Reading Time9 min
Causal Inference for LLM Features: The Propensity Score

Every product experimentation team eventually confronts a common challenge when launching new features, especially those leveraging Large Language Models (LLMs): the 'Opt-In Trap'. Imagine shipping a new AI assistant mode. Your dashboard proudly reports that users who enable it complete 21 percentage points more tasks. While this looks fantastic, you instinctively know something's amiss.

The core problem? Users who opt into features, particularly those behind a user-controlled toggle (e.g., "Try our AI assistant," "Enable smart replies"), are rarely a random sample. Heavy-engagement users, early adopters, or those with specific intentions are far more likely to click. This self-selection creates a systematic difference between your 'treated' (opted-in) and 'control' (non-opted-in) groups before the feature even makes an impact. A naive comparison, like a simple t-test, will attribute these pre-existing differences (selection bias) to the feature's effect, leading to a wildly inflated estimate. In our synthetic dataset, a true +8 percentage point effect inflates to +21 percentage points – a 2.6x overshoot.

Why Naive Comparisons Fail for Opt-In Features

Three mechanisms commonly drive this selection bias:

  1. Selection on Engagement: Power users are more likely to explore new features. If your most engaged users opt-in at a high rate (e.g., 65%) while light users opt-in at a low rate (e.g., 12%), the opted-in group is inherently composed of users who were already more active and likely to complete tasks, irrespective of the new feature.
  2. Selection on Intent: Users opting into a feature often have an immediate, specific need. A developer enabling "code suggestions" likely has code to write and would show higher task completion even without the AI assist.
  3. Selection on Risk Tolerance: Early adopters are often more tolerant of initial bugs or performance issues. This means your opt-in group might be enriched with users less likely to churn due to imperfections, impacting downstream metrics.

What Propensity Scores Do

Propensity score methods are statistical tools designed to disentangle selection bias from the true causal effect. A propensity score is simply the probability that a user opts in, given their observable characteristics (e.g., engagement tier, historical behavior). By estimating this probability, we can reweight or rematch our comparison groups to make them appear balanced on these observable characteristics, much like a randomized experiment would.

This relies on three key identification assumptions:

  1. Unconfoundedness: All variables influencing both opt-in and the outcome are included in your propensity model.
  2. Overlap (Positivity): Every user has a non-zero probability of both opting in and not opting in. This ensures we have comparable users across both groups.
  3. No Interference (SUTVA): One user's opt-in decision doesn't affect another user's outcome.

Violating any of these can lead to biased estimates.

Practical Implementation in Python

We'll walk through a pipeline using a synthetic SaaS dataset, where the ground-truth causal effect is known to be +0.08. You'll need numpy, pandas, scikit-learn, and matplotlib. First, clone the companion repository and generate the synthetic data:

shell git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git cd product-experimentation-causal-inference-genai-llm python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv

Load the data and observe the naive effect:

python import pandas as pd df = pd.read_csv("data/synthetic_llm_logs.csv") print(df.groupby("engagement_tier").opt_in_agent_mode.mean().round(3)) naive_effect = ( df[df.opt_in_agent_mode == 1].task_completed.mean() - df[df.opt_in_agent_mode == 0].task_completed.mean() ) print(f" Naive opt-in effect: {naive_effect:+.4f}")

Expected output:

engagement_tier heavy 0.647 light 0.120 medium 0.353 Name: opt_in_agent_mode, dtype: float64 Naive opt-in effect: +0.2106

As seen, heavy users opt-in much more, and the naive effect is +0.2106, far from the true +0.08.

Step 1: Estimate the Propensity Score

We use logistic regression to predict opt_in_agent_mode based on observable characteristics. For production, include all relevant confounders.

python from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score

X = pd.get_dummies( df[["engagement_tier", "query_confidence"]], drop_first=True ).astype(float) y_treat = df.opt_in_agent_mode

ps_model = LogisticRegression(max_iter=1000).fit(X, y_treat) df["propensity"] = ps_model.predict_proba(X)[:, 1]

Sanity checks for calibration and overlap

print(df.groupby("engagement_tier").propensity.mean().round(3)) print(f"Propensity range (treated): {df[df.opt_in_agent_mode == 1].propensity.min():.3f} - {df[df.opt_in_agent_mode == 1].propensity.max():.3f}") print(f"Propensity range (control): {df[df.opt_in_agent_mode == 0].propensity.min():.3f} - {df[df.opt_in_agent_mode == 0].propensity.max():.3f}") print(f"Propensity model AUC: {roc_auc_score(y_treat, df.propensity):.3f}")

Expected output:

engagement_tier heavy 0.646 light 0.120 medium 0.353 Name: propensity, dtype: float64 Propensity range (treated): 0.114 - 0.675 Propensity range (control): 0.114 - 0.673 Propensity model AUC: 0.744

The propensities are well-calibrated (matching opt-in rates per tier), show good discrimination (AUC 0.744), and crucially, exhibit overlap between treated and control groups, fulfilling the positivity assumption.

Step 2: Inverse-Probability Weighting (IPW)

IPW reweights each user inversely to their probability of receiving their observed treatment status. This balances covariates, allowing us to estimate the Average Treatment Effect (ATE) or Average Treatment Effect on the Treated (ATT).

python import numpy as np

ATE weights: 1/P(treat) for treated, 1/P(no treat) for control

df["ipw"] = np.where( df.opt_in_agent_mode == 1, 1 / df.propensity, 1 / (1 - df.propensity) ) t = df[df.opt_in_agent_mode == 1] c = df[df.opt_in_agent_mode == 0] ate_ipw = ( (t.task_completed * t.ipw).sum() / t.ipw.sum() - (c.task_completed * c.ipw).sum() / c.ipw.sum() ) print(f"IPW average treatment effect (ATE): {ate_ipw:+.4f}")

ATT weights: 1 for treated, P(treat)/P(no treat) for control

df["ipw_att"] = np.where( df.opt_in_agent_mode == 1, 1, df.propensity / (1 - df.propensity) ) t = df[df.opt_in_agent_mode == 1] c = df[df.opt_in_agent_mode == 0] treated_mean = t.task_completed.mean() control_w_mean = (c.task_completed * c.ipw_att).sum() / c.ipw_att.sum() att_ipw = treated_mean - control_w_mean print(f"IPW average treatment effect on treated (ATT): {att_ipw:+.4f}")

Expected output:

IPW average treatment effect (ATE): +0.0851 IPW average treatment effect on treated (ATT): +0.0770

Both ATE (+0.0851) and ATT (+0.0770) are significantly closer to the true +0.08 effect than the naive +0.2106. ATE informs about population-level effects, while ATT quantifies what opt-in users gained.

Step 3: Nearest-Neighbor Matching

Matching pairs each treated user with one or more control users who have the most similar propensity scores. This directly estimates ATT.

python from sklearn.neighbors import NearestNeighbors

treated_ps = df[df.opt_in_agent_mode == 1][["propensity"]].values control_ps = df[df.opt_in_agent_mode == 0][["propensity"]].values nn = NearestNeighbors(n_neighbors=1).fit(control_ps) _, idx = nn.kneighbors(treated_ps)

treated_outcomes = df[df.opt_in_agent_mode == 1].task_completed.values matched_control_outcomes = ( df[df.opt_in_agent_mode == 0].task_completed.values[idx.flatten()] ) att_match = (treated_outcomes - matched_control_outcomes).mean() print(f"1-NN matching ATT: {att_match:+.4f}")

Expected output:

1-NN matching ATT: +0.0752

The 1-Nearest-Neighbor matching ATT (+0.0752) is also close to the ground truth. Matching with replacement (as done here) can reduce bias but potentially increase variance. For robust results, k-nearest-neighbor matching (k=3-5) with replacement is often a good default.

Step 4: Check Covariate Balance

Propensity score methods are only valid if they successfully balance the observable covariates. The Standardized Mean Difference (SMD) is the standard diagnostic. An |SMD| < 0.1 after weighting or matching indicates good balance.

python def smd(treated_vals, control_vals, treated_w=None, control_w=None): """Standardized mean difference, optionally with weights.""" if treated_w is None: treated_w = np.ones(len(treated_vals)) if control_w is None: control_w = np.ones(len(control_vals))

t_mean = np.average(treated_vals, weights=treated_w)
c_mean = np.average(control_vals, weights=control_w)
pooled_std = np.sqrt((treated_vals.var() + control_vals.var()) / 2)
return (t_mean - c_mean) / pooled_std

engagement_heavy = (df.engagement_tier == "heavy").astype(float).values qc = df.query_confidence.values tr = (df.opt_in_agent_mode == 1).values

covariates = { "engagement_tier_heavy": engagement_heavy, "query_confidence": qc, }

print(f"{'Covariate':<30} {'Raw SMD':>10} {'Weighted SMD':>15}") for name, vals in covariates.items(): smd_raw = smd(vals[tr], vals[~tr]) smd_weighted = smd( vals[tr], vals[~tr], treated_w=df[tr].ipw.values, control_w=df[~tr].ipw.values, ) print(f"{name:<30} {smd_raw:>+10.3f} {smd_weighted:>+15.3f}")

Expected output:

Covariate Raw SMD Weighted SMD engagement_tier_heavy +0.742 +0.002 query_confidence -0.032 -0.003

The raw SMD for engagement_tier_heavy is a high +0.742, indicating severe imbalance. After IPW weighting, it drops to a negligible +0.002, well below the 0.1 threshold, confirming successful balance. If SMDs remain high, your propensity model needs more features or interaction terms.

When Propensity Score Methods Fail

Propensity score methods are powerful but not foolproof. They rely heavily on the unconfoundedness assumption – that all variables driving both opt-in and the outcome are included and correctly modeled. If there's an unobserved confounder, the estimates will remain biased. Similarly, severe lack of overlap (e.g., a group of users who always opt-in and another who never do) means there are no comparable users, making causal inference impossible. In such cases, a true A/B test is the only reliable path forward.

Practical Takeaways

For LLM-based features deployed behind an opt-in toggle, naive comparisons are dangerously misleading. Propensity score methods (IPW or matching) offer a robust approach to estimate the true causal effect by balancing observable confounders. Always verify covariate balance using SMDs and remember to quantify uncertainty with confidence intervals (e.g., via bootstrapping) for a complete picture.

FAQ

Q: How do I choose between Inverse-Probability Weighting (IPW) and Matching?

A: IPW estimates the Average Treatment Effect (ATE) for the entire population if all users could have opted in, or the ATT if the control group is reweighted to match the treated. Matching typically estimates the Average Treatment Effect on the Treated (ATT), focusing on the effect for users who actually opted in. IPW can be more sensitive to extreme weights if there's poor overlap, while matching can be computationally intensive for very large datasets. The choice often depends on the specific question (population-level vs. treated-user impact) and data characteristics.

Q: What if my propensity model's AUC is low, or my SMDs are still high after weighting?

A: A low AUC (e.g., near 0.5) suggests your model isn't effectively discriminating between opt-in and non-opt-in users based on the chosen covariates. High SMDs (above 0.1) indicate that the covariates are still imbalanced. Both signal a problem with the unconfoundedness assumption. You'll need to enrich your propensity model with more relevant features, include interaction terms, or consider a more flexible model (like gradient boosting) to better capture the relationship between covariates and opt-in behavior. If balance still can't be achieved, it might imply unobserved confounders, making these methods unreliable.

Q: Can propensity scores handle situations where multiple features are opted into simultaneously?

A: Propensity score methods, in their basic form, are designed for binary treatment (opt-in vs. no opt-in for a single feature). For multiple, simultaneous, and potentially interacting treatments, you would need more advanced causal inference techniques such as generalized propensity scores for continuous treatments, or multi-valued treatment propensity scores, which become significantly more complex. For product experimentation, it's often best to isolate feature impacts where possible, or design A/B tests to measure combinations of features if interaction is expected.

#programming#freeCodeCamp#product experimentation#causal inference#AI#Machine LearningMore

Related articles

Moto G Stylus 2026 Review: A Mid-Range Masterpiece of Missing Features
Review
Android AuthorityJun 14

Moto G Stylus 2026 Review: A Mid-Range Masterpiece of Missing Features

The Moto G Stylus (2026) is a mid-range Android phone that stands out by offering an enhanced, battery-powered stylus, a 3.5mm headphone jack, and a microSD card slot. At $499.99 MSRP, it's a unique value proposition for those seeking these features, especially after typical Motorola discounts.

Unlock Home Assistant's Voice Assistant: Hidden Powers Revealed
How To
How-To GeekJun 12

Unlock Home Assistant's Voice Assistant: Hidden Powers Revealed

Discover the powerful, often-hidden features of Home Assistant's native voice assistant, Assist, in just five key steps. Go beyond basic commands and unlock hands-free control, scheduling, and list management for your smart home.

Apple: Siri's New AI Will Not Be Your 'AI Girlfriend
Tech
The VergeJun 12

Apple: Siri's New AI Will Not Be Your 'AI Girlfriend

Apple has clarified that its revamped Siri AI will not serve as an "AI girlfriend" or romantic partner. Craig Federighi, Apple's Senior VP of Software Engineering, stated that Siri is designed purely for utility, helping users complete tasks and gain information, in contrast to other chatbots that focus on engagement and sycophancy. This strategic decision underscores Apple's commitment to user privacy and a functional, boundary-respecting AI experience.

Bluesky launches group chats, shifting focus to community features
Tech
TechCrunchJun 12

Bluesky launches group chats, shifting focus to community features

Bluesky has launched group chats and is strategically shifting its focus to community-centric features to compete with larger social networks like X. This move aims to differentiate the platform by offering more private, user-controlled spaces for engagement, especially as its user growth has slowed. The new features include group chats for up to 50 people and a vision for distinct, manageable communities, a direction contrasting X's recent shuttering of its own community features.

Kickstart Your Tech Career with freeCodeCamp: A Deep Dive
Programming
freeCodeCampJun 11

Kickstart Your Tech Career with freeCodeCamp: A Deep Dive

The technology landscape is in a constant state of flux, rapidly reshaping industries and creating new opportunities. For many aspiring developers, navigating this dynamic environment and identifying a clear path to

Engineering Leadership in the Era of Near-Zero Code Cost
Programming
Stack Overflow BlogJun 12

Engineering Leadership in the Era of Near-Zero Code Cost

AI is pushing the cost of code generation to near zero, profoundly reshaping engineering leadership. This shift moves the bottleneck from coding speed to ideation and process, necessitating a re-evaluation of how teams measure effectiveness and collaborate. Engineering leaders must now prioritize customer value, foster cross-functional empathy, and emphasize system ownership over raw code output.

Back to Newsroom

Stay ahead of the curve

Get the latest technology insights delivered to your inbox every morning.