Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why I got different lift result when using get_cumlift() and calculating line by line? #706

Open
AmyLin0515 opened this issue Nov 14, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@AmyLin0515
Copy link

Describe the bug
Hi Team!
I used get_cumlift(), and got the lift for S-Learner like this:
image

When I tried to duplicate the result, calculating it manually, the result is different from what I had using get_cumlift().

sorted_df = df_try.sort_values(col, ascending=False).reset_index(drop=True)
sorted_df.index = sorted_df.index + 1
sorted_df["cumsum_tr"] = sorted_df['w'].cumsum()
sorted_df["cumsum_ct"] = sorted_df.index.values - sorted_df["cumsum_tr"]
sorted_df["cumsum_y_tr"] = (sorted_df['y'] * sorted_df['w']).cumsum()
sorted_df["cumsum_y_ct"] = (sorted_df['y'] * (1 - sorted_df['w'])).cumsum()

This is how table looks like:
image

And then I calculate the lift:

lift=[]
lift.append(sorted_df["cumsum_y_tr"] / sorted_df["cumsum_tr"] - sorted_df["cumsum_y_ct"] / sorted_df["cumsum_ct"])
lift = pd.concat(lift, join="inner", axis=1)
lift.loc[0] = np.zeros((lift.shape[1],))
lift = lift.sort_index().interpolate()

This is how the final result looks like:
image

I plot the difference between the result from get_cumlif() and manual calculation.
image

Does anyone know why they are different?

Environment (please complete the following information):

  • OS: Windows
  • Python Version: 3.8
  • Versions of Major Dependencies (pandas, scikit-learn, cython):pandas==1.3.5, scikit-learn==1.0.2, cython==0.29.34]
@AmyLin0515 AmyLin0515 added the bug Something isn't working label Nov 14, 2023
@ras44
Copy link
Collaborator

ras44 commented Nov 16, 2023

hi @AmyLin0515

A couple ideas:

See the code for get_cumlift here:

def get_cumlift(
df, outcome_col="y", treatment_col="w", treatment_effect_col="tau", random_seed=42
):
"""Get average uplifts of model estimates in cumulative population.
If the true treatment effect is provided (e.g. in synthetic data), it's calculated
as the mean of the true treatment effect in each of cumulative population.
Otherwise, it's calculated as the difference between the mean outcomes of the
treatment and control groups in each of cumulative population.
For details, see Section 4.1 of Gutierrez and G{\'e}rardy (2016), `Causal Inference
and Uplift Modeling: A review of the literature`.
For the former, `treatment_effect_col` should be provided. For the latter, both
`outcome_col` and `treatment_col` should be provided.
Args:
df (pandas.DataFrame): a data frame with model estimates and actual data as columns
outcome_col (str, optional): the column name for the actual outcome
treatment_col (str, optional): the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional): the column name for the true treatment effect
random_seed (int, optional): random seed for numpy.random.rand()
Returns:
(pandas.DataFrame): average uplifts of model estimates in cumulative population
"""
assert (
(outcome_col in df.columns)
and (treatment_col in df.columns)
or treatment_effect_col in df.columns
)
df = df.copy()
np.random.seed(random_seed)
random_cols = []
for i in range(10):
random_col = "__random_{}__".format(i)
df[random_col] = np.random.rand(df.shape[0])
random_cols.append(random_col)
model_names = [
x
for x in df.columns
if x not in [outcome_col, treatment_col, treatment_effect_col]
]
lift = []
for i, col in enumerate(model_names):
sorted_df = df.sort_values(col, ascending=False).reset_index(drop=True)
sorted_df.index = sorted_df.index + 1
if treatment_effect_col in sorted_df.columns:
# When treatment_effect_col is given, use it to calculate the average treatment effects
# of cumulative population.
lift.append(sorted_df[treatment_effect_col].cumsum() / sorted_df.index)
else:
# When treatment_effect_col is not given, use outcome_col and treatment_col
# to calculate the average treatment_effects of cumulative population.
sorted_df["cumsum_tr"] = sorted_df[treatment_col].cumsum()
sorted_df["cumsum_ct"] = sorted_df.index.values - sorted_df["cumsum_tr"]
sorted_df["cumsum_y_tr"] = (
sorted_df[outcome_col] * sorted_df[treatment_col]
).cumsum()
sorted_df["cumsum_y_ct"] = (
sorted_df[outcome_col] * (1 - sorted_df[treatment_col])
).cumsum()
lift.append(
sorted_df["cumsum_y_tr"] / sorted_df["cumsum_tr"]
- sorted_df["cumsum_y_ct"] / sorted_df["cumsum_ct"]
)
lift = pd.concat(lift, join="inner", axis=1)
lift.loc[0] = np.zeros((lift.shape[1],))
lift = lift.sort_index().interpolate()
lift.columns = model_names
lift[RANDOM_COL] = lift[random_cols].mean(axis=1)
lift.drop(random_cols, axis=1, inplace=True)
return lift

  1. Note that get_cumlift iterates at least 10 times over random orderings and also other order orderings if your input df has columns other than outcome_col, treatment_col, and treatment_effect_col:

for i in range(10):
random_col = "__random_{}__".format(i)
df[random_col] = np.random.rand(df.shape[0])
random_cols.append(random_col)

for i, col in enumerate(model_names):
sorted_df = df.sort_values(col, ascending=False).reset_index(drop=True)
sorted_df.index = sorted_df.index + 1

  1. Also if treatment_effect_col is provided, it is used to calculate the ATE of the cumulative population:
    if treatment_effect_col in sorted_df.columns:
    # When treatment_effect_col is given, use it to calculate the average treatment effects
    # of cumulative population.

Not sure if you are providing the treatment_effect_col using synthetic data or not, but if that is the case, then 2) would apply.

If you're not providing treatment_effect_col, then 1) still applies- a repeated random ordering and subsequent interpolation of lift results.

FYI, also see work in #707

@AmyLin0515
Copy link
Author

Hi @ras44 ! Thanks for providing insights. I did find the difference decreased a lot after I added 10 random columns and included them to sort. However, I don't understand why we need to add these two random columns. And if eventually the order was changed by the final 10th random columns, what is the point that we added so many of them.

@jeongyoonlee
Copy link
Collaborator

Hi @AmyLin0515. We updated the uplift/Qini curve and score calculation to use the theoretical random instead of the sampled random in #799. With the change, we don't use a set of ten random estimates to calculate the AUUC and Qini scores for random, which will always be 0.5 and 0.0, respectively.

To investigate the discrepancy between the manual calculation vs. the get_cumlift() outputs, can you provide the data used to run the simulation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants