Estimated reading time: 202 minutes
  1. Comment on the predictive strenght of interactions: no more thn two, otherwise the brain capsizes.

This notebook would study the possibilty of preempting customer churn so as to allow for the opportunity to intervene. As an analyst you have to understand who is leaving and why. Customers who have left in the last month are known as customers that have churned. This is thus the target you want to investigate, and you want to understand the characteristics associated with this target. All you do is a simple prediction excercise with a tree, understandable variables and feature importance measures for single and interaction variables. Another interesting feature is a wordcloud of important predictors. Here is an example of some wordclouds, https://nicholastsmith.wordpress.com/2017/08/16/using-random-forests-and-wordclouds-to-visualize-feature-importance-in-document-classification//

An important part now is the actuate on the knowledge that we have gathered. This is easy and can be done by formulating simple decision rules to identify the customers most at risk. This is done by identifying the important predictors and isolating those for which a large group of predictors point in a similat direction. I would recommend isolating no more than five variables and playing around with them as inputs to the model keeping all other variables stable at median or mean - good question as to which would be the most stable and then identify the liklihood to churn. What you can do is split the features by median in five pieces and then slowsly list the top ten most likley groups to fail. That is all you really need, I think it is a roundabout way of understand, but a useful method none the less.

import pandas as  pd

df = pd.read_csv("data/telco_churn.csv")

df.shape
(7043, 21)

How you treat your categorical variables depend on the cardinality. tenure, MonthlyCharges and TotalCharges are not cateogical variables, hence they show large cardinality. The rest of the variables are indeed categorical, in this problem set, one hot encoding would work fine because of the low cardinality present. In applied machine learning you do not want to exploit leakages, you want to fix them.

df["SeniorCitizen"] = df["SeniorCitizen"].astype("object")

pd.options.display.max_rows = None

df["TotalChargesConv"] = pd.to_numeric(df["TotalCharges"], errors='coerce')

df["TotalChargesConv"].isnull().sum()
11

TotalCharges is completely missing. From further analysis, I am convinced that they are missed at random and that the fact that they are missing does not contain any value. So I would go ahead and just fill the values with the mean

df.head()
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn TotalChargesConv
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No Month-to-month Yes Electronic check 29.85 29.85 No 29.85
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... No No No One year No Mailed check 56.95 1889.5 No 1889.50
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No Month-to-month Yes Mailed check 53.85 108.15 Yes 108.15
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No 1840.75
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No Month-to-month Yes Electronic check 70.70 151.65 Yes 151.65

5 rows × 22 columns

Here I would now also seek to see if there are more missing feature present.


def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

missing_data(df)
Total Percent
TotalChargesConv 11 0.156183
Churn 0 0.000000
gender 0 0.000000
SeniorCitizen 0 0.000000
Partner 0 0.000000
Dependents 0 0.000000
tenure 0 0.000000
PhoneService 0 0.000000
MultipleLines 0 0.000000
InternetService 0 0.000000
OnlineSecurity 0 0.000000
OnlineBackup 0 0.000000
DeviceProtection 0 0.000000
TechSupport 0 0.000000
StreamingTV 0 0.000000
StreamingMovies 0 0.000000
Contract 0 0.000000
PaperlessBilling 0 0.000000
PaymentMethod 0 0.000000
MonthlyCharges 0 0.000000
TotalCharges 0 0.000000
customerID 0 0.000000
df["TotalChargesConv"] = df["TotalChargesConv"].fillna(df["TotalChargesConv"].mean())

del df["TotalCharges"]

import numpy as np
# Keeping 15% for the validation set, thats about 1000 instances to test on 
msk = np.random.rand(len(df)) < 0.15
df["is_test"] = msk

cols_drop = ["customerID","Churn", "is_test"]
id_target = df[cols_drop]

df = pd.get_dummies(df.drop(cols_drop, axis=1))
df = pd.concat((id_target, df),axis=1)

df["Churn"] = df["Churn"].apply(lambda x: 0 if x=="No" else 1)

## Immediately Remove High Correlates
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.90
to_drop = [column for column in upper.columns if any(upper[column] > 0.999999)]
to_drop
['gender_Male',
 'SeniorCitizen_1',
 'Partner_Yes',
 'Dependents_Yes',
 'PhoneService_Yes',
 'MultipleLines_No phone service',
 'OnlineSecurity_No internet service',
 'OnlineBackup_No internet service',
 'DeviceProtection_No internet service',
 'TechSupport_No internet service',
 'StreamingTV_No internet service',
 'StreamingMovies_No internet service',
 'PaperlessBilling_Yes']
df = df.drop(to_drop, axis=1)
train = df[df["is_test"]==False]
X_train = df[df["is_test"]==False].drop(cols_drop,axis=1)
y_train = df[df["is_test"]==False]["Churn"]

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from lightgbm import LGBMClassifier
import gc
import random
params = {
         'boosting_type': 'gbdt',
          'max_depth': 6,
          'objective': 'binary',
          'n_estimators': 5000, 
          'nthread': 12,
          'num_leaves': 32,
          'learning_rate': 0.02,
          'subsample': 0.85,
          'colsample_bytree': 0.95,
          'reg_alpha': 0.5,
          'reg_lambda': 0.8,
          'min_split_gain': 0.05,
          'min_child_weight': 40,
          'min_child_samples': 5,
          'metric': 'auc'}



def pars(n_estimator_bottom, n_estimator_top):
    return random.sample(range(n_estimator_bottom, n_estimator_bottom+ n_estimator_top), 8) + random.sample(range(n_estimator_bottom, int(n_estimator_bottom + n_estimator_top/np.log(n_estimator_top))), 6) + random.sample(range(n_estimator_bottom, int(n_estimator_bottom+(n_estimator_top/np.log(n_estimator_top))/np.log(n_estimator_top))), 4)

    #return random.sample(range(n_estimator_bottom, n_estimator_bottom+ n_estimator_top), 4) + random.sample(range(n_estimator_bottom, int(n_estimator_bottom + n_estimator_top/np.log(n_estimator_top))), 3) + random.sample(range(n_estimator_bottom, int(n_estimator_bottom+(n_estimator_top/np.log(n_estimator_top))/np.log(n_estimator_top))), 2)



mdl = LGBMClassifier(
    **params
)

gridParams = {
    'learning_rate': [0.02, 0.05],
    'n_estimators': pars(10,15000),
    'num_leaves': pars(20,100),
    'boosting_type' : ['gbdt'],
    'random_state' : [405], # Updated from 'seed'
    'colsample_bytree' : [round(random.uniform(0.7,1),2) for _ in range(18)] ,
    'subsample' : [round(random.uniform(.2,.95),2) for _ in range(18)] ,
    'reg_alpha' : [round(random.uniform(0.1,4),2) for _ in range(18)],
    'reg_lambda' : [round(random.uniform(0.1,4),2) for _ in range(18)],
    }

from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RandomizedSearchCV

Create the grid

LightGBM should be jobs 1

grid = RandomizedSearchCV(mdl, gridParams, cv=3, n_jobs=1, verbose=10) #grid = GridSearchCV(mdl, gridParams, cv=4, n_jobs=1, verbose=10)

Run the grid

grid.fit(X_train, y_train )

Took like a few minutes so worth it.

print(grid.best_params_) print(grid.best_score_)

Print the best parameters found

print(grid.best_params_) print(grid.best_score_)

params[‘subsample’] = grid.best_params_[‘subsample’] params[‘n_estimators’] = grid.best_params_[‘n_estimators’] params[‘colsample_bytree’] = grid.best_params_[‘colsample_bytree’] params[‘learning_rate’] = grid.best_params_[‘learning_rate’] params[‘num_leaves’] = grid.best_params_[‘num_leaves’] params[‘reg_alpha’] = grid.best_params_[‘reg_alpha’] params[‘reg_lambda’] = grid.best_params_[‘reg_lambda’]

params[‘subsample_for_bin’] = grid.best_params_[‘subsample_for_bin’]

print(‘Fitting with params: ‘) print(params)

X_test = df.loc[df.is_test, :].drop(cols_drop, axis=1)
y_test = df.loc[df.is_test, :]["Churn"]
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import roc_auc_score

n_splits = 5
cvv = StratifiedKFold(n_splits=n_splits, random_state=42)

oof_preds = np.zeros(X_train.shape[0])

sub = df.loc[df.is_test, :][cols_drop]
sub["Churn_Predict"] = 0
feature_importances = pd.DataFrame()
avg_iter = 0
for i, (fit_idx, val_idx) in enumerate(cvv.split(X_train, y_train)):
    
    X_fit = X_train.iloc[fit_idx]
    y_fit = y_train.iloc[fit_idx]
    X_val = X_train.iloc[val_idx]
    y_val = y_train.iloc[val_idx]
    
    model = LGBMClassifier(
**params
    )

    model.fit(
        X_fit,
        y_fit,
        eval_set=[(X_fit, y_fit), (X_val, y_val)],
        eval_names=('fit', 'val'),
        eval_metric='auc',
        early_stopping_rounds=150,
        verbose=False
    )
    
    print("itteration: ", model.best_iteration_)
    avg_iter += model.best_iteration_
    oof_preds[val_idx] = model.predict_proba(X_val, num_iteration=model.best_iteration_)[:, 1]
    sub['Churn_Predict'] += model.predict_proba(X_test, num_iteration=model.best_iteration_)[:,1]
    
    fi = pd.DataFrame()
    fi["feature"] = X_train.columns
    fi["importance"] = model.feature_importances_
    fi["fold"] = (i+1)
    
    feature_importances = pd.concat([feature_importances, fi], axis=0)
    
    print("Fold {} AUC: {:.8f}".format(i+1, roc_auc_score(y_val, oof_preds[val_idx])))
    
print('Full AUC score %.8f' % roc_auc_score(y_train, oof_preds)) 
print('Out of sample AUC: ', roc_auc_score(y_test, sub['Churn_Predict']/n_splits))
print("avg iteration: ",(avg_iter/n_splits))
itteration:  177
Fold 1 AUC: 0.86939524
itteration:  314
Fold 2 AUC: 0.85728010
itteration:  230
Fold 3 AUC: 0.84829261
itteration:  495
Fold 4 AUC: 0.84521372
itteration:  377
Fold 5 AUC: 0.85197786
Full AUC score 0.85256117
Out of sample AUC:  0.8286286199168862
avg iteration:  318.6
## The splits seems to work well.
## You can take the average iteration above or use the process below.
## You can retrain below on an ensemble of different seeds to produce the final model
## This will work if ou change parameters to get the new iteration score. 

from lightgbm import cv
from lightgbm import Dataset


def get_score(X, y, usecols, params,seeder,depther):  
     dtrain = Dataset(X[usecols], y)
     params["max_depth"] = depther
     ## These are forced for variable importance
     params["feature_fraction"] = 0.9
     params["bagging_fraction"] = 0.9

     eval =  cv(params,
             dtrain,
             nfold=5,
             stratified=True,
             num_boost_round=20000,
             early_stopping_rounds=160, ## After it stopped how long should go on. 
             verbose_eval=20,
             #verbose_eval=-1,
             seed = seeder,
             show_stdv=True)
     return max(eval['auc-mean']), eval["auc-mean"].index(max(eval["auc-mean"]))
  
seed_dict = {}
for seeder, depther in zip([5, 1, 9 ,12, 20], [6,6,6,6,6]):  ## Here is where you add parameter adaptions
    score, seed_best = get_score(X_train,y_train ,list(X_train.columns), params, seeder,depther)
    seed_dict[seeder] = seed_best

## Lets see if a full dataset prediction validation outperforms.
[20]	cv_agg's auc: 0.843784 + 0.00821279
[40]	cv_agg's auc: 0.846575 + 0.00947819
[60]	cv_agg's auc: 0.848162 + 0.00912504
[80]	cv_agg's auc: 0.849335 + 0.00902077
[100]	cv_agg's auc: 0.850259 + 0.00888681
[120]	cv_agg's auc: 0.851368 + 0.008769
[140]	cv_agg's auc: 0.852178 + 0.00865427
[160]	cv_agg's auc: 0.852431 + 0.00848813
[180]	cv_agg's auc: 0.852657 + 0.00825845
[200]	cv_agg's auc: 0.852751 + 0.00810582
[220]	cv_agg's auc: 0.853085 + 0.00787604
[240]	cv_agg's auc: 0.853163 + 0.00771073
[260]	cv_agg's auc: 0.853307 + 0.00762619
[280]	cv_agg's auc: 0.853409 + 0.00761826
[300]	cv_agg's auc: 0.853207 + 0.00768979
[320]	cv_agg's auc: 0.853189 + 0.00778502
[340]	cv_agg's auc: 0.853249 + 0.00784857
[360]	cv_agg's auc: 0.853228 + 0.00789094
[380]	cv_agg's auc: 0.853026 + 0.00807676
[400]	cv_agg's auc: 0.852888 + 0.00812161
[420]	cv_agg's auc: 0.852822 + 0.0081956
[20]	cv_agg's auc: 0.844505 + 0.0153633
[40]	cv_agg's auc: 0.846352 + 0.0156829
[60]	cv_agg's auc: 0.847766 + 0.0153002
[80]	cv_agg's auc: 0.848652 + 0.0151338
[100]	cv_agg's auc: 0.849505 + 0.0150432
[120]	cv_agg's auc: 0.850304 + 0.0151445
[140]	cv_agg's auc: 0.850746 + 0.0153762
[160]	cv_agg's auc: 0.851053 + 0.0153237
[180]	cv_agg's auc: 0.851171 + 0.0152782
[200]	cv_agg's auc: 0.851375 + 0.015367
[220]	cv_agg's auc: 0.851382 + 0.0151171
[240]	cv_agg's auc: 0.851239 + 0.0151573
[260]	cv_agg's auc: 0.851301 + 0.0152773
[280]	cv_agg's auc: 0.851341 + 0.0151455
[300]	cv_agg's auc: 0.851295 + 0.0150807
[320]	cv_agg's auc: 0.851225 + 0.0149669
[340]	cv_agg's auc: 0.851117 + 0.0148903
[360]	cv_agg's auc: 0.851039 + 0.0148297
[20]	cv_agg's auc: 0.846313 + 0.0114744
[40]	cv_agg's auc: 0.847691 + 0.011438
[60]	cv_agg's auc: 0.848695 + 0.0115246
[80]	cv_agg's auc: 0.849907 + 0.0117519
[100]	cv_agg's auc: 0.850889 + 0.0117958
[120]	cv_agg's auc: 0.85157 + 0.0117482
[140]	cv_agg's auc: 0.852051 + 0.011485
[160]	cv_agg's auc: 0.85236 + 0.0112426
[180]	cv_agg's auc: 0.852455 + 0.011207
[200]	cv_agg's auc: 0.852387 + 0.0112711
[220]	cv_agg's auc: 0.852347 + 0.0113187
[240]	cv_agg's auc: 0.852387 + 0.0113198
[260]	cv_agg's auc: 0.85245 + 0.0111506
[280]	cv_agg's auc: 0.852548 + 0.011055
[300]	cv_agg's auc: 0.852749 + 0.0110681
[320]	cv_agg's auc: 0.852543 + 0.0111498
[340]	cv_agg's auc: 0.852551 + 0.0112671
[360]	cv_agg's auc: 0.852435 + 0.0113013
[380]	cv_agg's auc: 0.852323 + 0.0113198
[400]	cv_agg's auc: 0.852281 + 0.0114486
[420]	cv_agg's auc: 0.852147 + 0.0115187
[440]	cv_agg's auc: 0.852138 + 0.0114774
[20]	cv_agg's auc: 0.844818 + 0.00752731
[40]	cv_agg's auc: 0.8469 + 0.00812337
[60]	cv_agg's auc: 0.848418 + 0.00853306
[80]	cv_agg's auc: 0.849986 + 0.00962428
[100]	cv_agg's auc: 0.850866 + 0.00988992
[120]	cv_agg's auc: 0.851671 + 0.00982287
[140]	cv_agg's auc: 0.852286 + 0.00978718
[160]	cv_agg's auc: 0.85262 + 0.00990508
[180]	cv_agg's auc: 0.852703 + 0.010033
[200]	cv_agg's auc: 0.852879 + 0.0100981
[220]	cv_agg's auc: 0.852956 + 0.0100493
[240]	cv_agg's auc: 0.853011 + 0.00976151
[260]	cv_agg's auc: 0.852986 + 0.00949627
[280]	cv_agg's auc: 0.852963 + 0.0095649
[300]	cv_agg's auc: 0.852869 + 0.00948038
[320]	cv_agg's auc: 0.852862 + 0.00946797
[340]	cv_agg's auc: 0.852779 + 0.00925175
[360]	cv_agg's auc: 0.852641 + 0.00918829
[380]	cv_agg's auc: 0.852628 + 0.00909059
[20]	cv_agg's auc: 0.846441 + 0.0161064
[40]	cv_agg's auc: 0.847855 + 0.0163468
[60]	cv_agg's auc: 0.849537 + 0.0167314
[80]	cv_agg's auc: 0.850486 + 0.0159733
[100]	cv_agg's auc: 0.851169 + 0.0152941
[120]	cv_agg's auc: 0.852089 + 0.0150693
[140]	cv_agg's auc: 0.852565 + 0.0147605
[160]	cv_agg's auc: 0.852684 + 0.0145832
[180]	cv_agg's auc: 0.852811 + 0.0142475
[200]	cv_agg's auc: 0.852997 + 0.0139145
[220]	cv_agg's auc: 0.853024 + 0.0136548
[240]	cv_agg's auc: 0.853132 + 0.0135377
[260]	cv_agg's auc: 0.853081 + 0.0133628
[280]	cv_agg's auc: 0.852963 + 0.0130755
[300]	cv_agg's auc: 0.852855 + 0.012968
[320]	cv_agg's auc: 0.852727 + 0.0127505
[340]	cv_agg's auc: 0.852658 + 0.0125208
[360]	cv_agg's auc: 0.852502 + 0.0123709
[380]	cv_agg's auc: 0.852345 + 0.0121535
[400]	cv_agg's auc: 0.852298 + 0.012038
[420]	cv_agg's auc: 0.852294 + 0.0118129
## This process gives you 15% extra data, but you have no validation set. 
# This can of course be improved by saving the best iterations, which is what I would do
dip  = {}
dip[1] = [6, 5]
dip[2] = [6, 1]
dip[3] = [6, 9]
dip[4] = [6, 12]

dip[5] = [6, 20]

n_splits = 5
cvv = StratifiedKFold(n_splits=n_splits, random_state=42)

sub = df.loc[df.is_test, :][cols_drop]
sub["Churn_Predict"] = 0

feature_importances = pd.DataFrame()

ba= 0 
for i, r in enumerate(range(n_splits)):
    ba = ba + 1
    
    params["num_boost_round"] = seed_dict[dip[ba][1]]  
    params["seed"] = dip[ba][1]
    params["max_depth"] = dip[ba][0] # if you want to customise uncomment
    
    model = LGBMClassifier(
**params
    )

    model.fit(
        X_train,
        y_train,
        eval_metric='auc',
        verbose=False
    )
    
    preds = model.predict_proba(X_test, num_iteration=seed_dict[dip[ba][1]] )[:,1]
    sub['Churn_Predict'] += preds
    
    fi = pd.DataFrame()
    fi["feature"] = X_train.columns
    fi["importance"] = model.feature_importances_
    fi["fold"] = (i+1)
    
    feature_importances = pd.concat([feature_importances, fi], axis=0)
    print("Fold {} AUC: {:.8f}".format(i+1, roc_auc_score(y_test, preds)))
  
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/lightgbm/engine.py:99: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Fold 1 AUC: 0.82761689


/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/lightgbm/engine.py:99: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Fold 2 AUC: 0.82700115


/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/lightgbm/engine.py:99: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Fold 3 AUC: 0.82692065


/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/lightgbm/engine.py:99: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Fold 4 AUC: 0.82756032


/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/lightgbm/engine.py:99: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Fold 5 AUC: 0.82757338
print('Out of sample AUC: ', roc_auc_score(y_test, sub['Churn_Predict']/n_splits))
Out of sample AUC:  0.8246883227083832
feat_imp = feature_importances.groupby("feature").mean().sort_values("importance", ascending=False)

feat_imp.head(10)

# if is_test shows significance it would point to leakage issue 
importance fold
feature
MonthlyCharges 694.6 3.0
TotalChargesConv 593.6 3.0
tenure 571.4 3.0
MultipleLines_No 130.0 3.0
PaymentMethod_Electronic check 123.6 3.0
OnlineSecurity_No 118.4 3.0
PaperlessBilling_No 113.6 3.0
Contract_Month-to-month 112.6 3.0
Contract_One year 108.0 3.0
TechSupport_No 85.8 3.0
## Often Times the cross-validation splits can just be completely ignored.
#180
import numpy as np
l = [x for x in seed_dict.values()]
best_iterations = np.mean(l)

best_iterations
252.8
### best_iterations  394.6 for inst
### 252 -  litle subsample
import lightgbm as lgb
d_train = lgb.Dataset(X_train, label=y_train)
d_test = lgb.Dataset(X_test, label=y_test)
# 1.1 for no leaks, 0.7 used because best model feature importance 
# this slightly outperforms enemble but it leaks
params["num_boost_round"] = int(best_iterations*1.1) 
params["num_boost_round"] = int(best_iterations*1.1)


## Here you have to force feature fractions for predictor analysis

import shap

params["num_boost_round"] = int(best_iterations*1.1)
splits = 5 
f = -1
for seeds in [15,1,6,7,2]:
    f = f +1
    print(seeds)
    params["feature_fraction_seed"] = seeds
    
    params["random_seed"] = seeds + 1
    
    model = lgb.train(params, d_train, verbose_eval=1000)

    shap_values = shap.TreeExplainer(model).shap_values(df.drop(cols_drop, axis=1))

    shap_fram = pd.DataFrame(shap_values[:,:-1], columns=list(X_train.columns))

    shap_new = shap_fram.sum().sort_values().to_frame()

    shap_new.columns = ["SHAP"]

    shap_new["Largest_Effect"] = shap_new["SHAP"].apply(lambda x: "-" if x<0else "+")

    shap_new["SHAP_abs"] = shap_new["SHAP"].abs()

    direction = df.corr()["Churn"].to_frame()
    direction.columns = ["corr"]
    shap_new = pd.merge(shap_new, direction, left_index=True, right_index=True, how="left")

    shap_new["direction"] = shap_new["corr"].apply(lambda x: "-" if x<0else "+")
    shap_new["abs_corr"] = shap_new["corr"].abs()
    
    if f==0:
        shap_fin = shap_new
    else:
        shap_fin = shap_fin + shap_new
15


/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/lightgbm/engine.py:99: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


1
6
7
2
from lightgbm import plotting
plotting.create_tree_digraph(model, tree_index=1)

svg

plotting.plot_tree(model)
<matplotlib.axes._subplots.AxesSubplot at 0x1a410be278>

png

model.plot_metric()
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-162-48ad7f4179a8> in <module>()
----> 1 model.plot_metric()


AttributeError: 'Booster' object has no attribute 'plot_metric'
shap_fin.loc[:,["SHAP","SHAP_abs","corr","abs_corr"]] = shap_fin.loc[:,["SHAP","SHAP_abs","corr","abs_corr"]]/5
shap_fin["SHAP_corr_corrrected"] = 0
shap_fin.head()
SHAP Largest_Effect SHAP_abs corr direction abs_corr SHAP_corr_corrrected
Contract_Month-to-month 906.785271 +++++ 906.785271 0.405103 +++++ 0.405103 0
Contract_One year -130.124222 ----- 130.124222 -0.177820 ----- 0.177820 0
Contract_Two year 110.094202 +++++ 110.094202 -0.302253 ----- 0.302253 0
Dependents_No -5.010731 ----- 5.010731 0.164221 +++++ 0.164221 0
DeviceProtection_No -0.006343 --+++ 0.732087 0.252481 +++++ 0.252481 0
b = "Contract_Month-to-month"
das  = df.corr()[df.corr().index.isin(list(shap_fin.index))].abs()
shap_fin.head()
SHAP Largest_Effect SHAP_abs corr direction abs_corr SHAP_corr_corrrected
Contract_Month-to-month 906.785271 +++++ 906.785271 0.405103 +++++ 0.405103 0
Contract_One year -130.124222 ----- 130.124222 -0.177820 ----- 0.177820 0
Contract_Two year 110.094202 +++++ 110.094202 -0.302253 ----- 0.302253 0
Dependents_No -5.010731 ----- 5.010731 0.164221 +++++ 0.164221 0
DeviceProtection_No -0.006343 --+++ 0.732087 0.252481 +++++ 0.252481 0
for b in das.iloc[:,2:].columns:
    cora = df.corr()[df.corr().index.isin(list(shap_fin.index))][b].abs()
    for c, v in zip(cora.index, cora.values):
        shap_fin["SHAP_corr_corrrected"].loc[c] = shap_fin["SHAP_corr_corrrected"].loc[c] + (shap_fin["SHAP_abs"].loc[b]/shap_fin["SHAP_abs"].max())*v
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/indexing.py:194: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
shap_fin["SHAP_corr"] = shap_fin["SHAP_corr_corrrected"].copy()

shap_fin["SHAP_corr"] = shap_fin["SHAP_corr"]/(shap_fin["SHAP_corr"].max()/shap_fin["SHAP_abs"].max())

shap_fin["SHAP_corr_corrrected"] = shap_fin["SHAP_corr_corrrected"]/(shap_fin["SHAP_corr_corrrected"].max()/shap_fin["SHAP_abs"].max())

shap_fin["SHAP_corr_corrrected"] = shap_fin["SHAP_corr_corrrected"] + shap_fin["SHAP_abs"]

shap_fin["SHAP_corr_corrrected"] = shap_fin["SHAP_corr_corrrected"]/(shap_fin["SHAP_corr_corrrected"].max()/shap_fin["SHAP_abs"].max())
shap_fin.sort_values('SHAP_corr', ascending=False).head(10)
SHAP Largest_Effect SHAP_abs corr direction abs_corr SHAP_corr_corrrected SHAP_corr
Contract_Month-to-month 906.785271 +++++ 906.785271 0.405103 +++++ 0.405103 906.785271 906.785271
tenure -308.250063 ----- 308.250063 -0.352229 ----- 0.352229 516.718340 725.186618
Contract_Two year 110.094202 +++++ 110.094202 -0.302253 ----- 0.302253 386.615172 663.136143
TotalChargesConv -58.927057 ----- 58.927057 -0.199428 ----- 0.199428 311.159987 563.392916
TechSupport_No -170.060887 ----- 170.060887 0.337281 +++++ 0.337281 349.015912 527.970937
Contract_One year -130.124222 ----- 130.124222 -0.177820 ----- 0.177820 316.545191 502.966159
OnlineSecurity_No -68.364262 ----- 68.364262 0.342637 +++++ 0.342637 270.712949 473.061636
DeviceProtection_No -0.006343 --+++ 0.732087 0.252481 +++++ 0.252481 215.028381 429.324675
PaymentMethod_Electronic check -88.616541 ----- 88.616541 0.301919 +++++ 0.301919 244.125871 399.635200
OnlineBackup_No -14.979056 ----- 14.979056 0.268005 +++++ 0.268005 204.751625 394.524194
dans = pd.read_csv("amalgamated.csv")
dans.sort_values('SHAP_corr', ascending=False).head(10)
Unnamed: 0 SHAP Largest_Effect SHAP_abs corr direction abs_corr SHAP_corr_corrrected SHAP_corr
0 Contract_Month-to-month 89.722288 +++++ 89.722288 0.405103 +++++ 0.405103 89.722288 89.722288
1 tenure -32.415159 ----- 32.415159 -0.352229 ----- 0.352229 55.615119 78.815078
2 Contract_Two year 56.824406 +++++ 56.824406 -0.302253 ----- 0.302253 67.805145 78.785884
3 TotalChargesConv -2.957160 -+--+ 11.577356 -0.199428 ----- 0.199428 41.444353 71.311350
4 TechSupport_No -3.145848 -+++- 10.992171 0.337281 +++++ 0.337281 35.971348 60.950524
5 OnlineSecurity_No 6.738696 ++++- 12.082303 0.342637 +++++ 0.342637 35.709987 59.337670
6 InternetService_No 4.387260 ++-+- 8.929218 -0.227890 ----- 0.227890 32.621494 56.313771
7 InternetService_Fiber optic 8.872516 ++-+- 13.110938 0.308020 +++++ 0.308020 32.963081 52.815223
8 DeviceProtection_No -3.105107 -+--- 4.129762 0.252481 +++++ 0.252481 27.780575 51.431389
9 OnlineBackup_No -14.044804 ----+ 15.399742 0.268005 +++++ 0.268005 33.136412 50.873081
### There is a way to prove that it works - using all. 
#shap_fin.sort_values('SHAP_corr', ascending=False).to_csv("amalgamated.csv")
"SHAP values are sensitive to high correlations among different features. When features are correlated, their impact on the model score can be split among them in an infinite number of ways. This means that the SHAP values will be lower than if all but one of the correlated feature(s) had been removed from the model. "

My recommendation based on the SHAP_corr (adjusted shapely correlation values)

  1. Only target customers who don’t have internet or are less likley to get internet
  2. Those who have paid more in the past are locked in by sunk cost and stay on for longer
  3. Potentially lower the amount to get more clients
  4. The longer a client has stayed on, the longer they are expected to stay on, the lindy effect, therefore get clients to stay on for longer, even if you trade at a loss at first.
  5. Sign customers on for yearly and not month to month agreements
  6. If you do target internet enabled customers, focus on those who do not have fiber.
  7. Online Security and Tech support are markers for ineptness and clients who require this would less likley churn.
  8. If they stream movies and tv they are more likley to churn, so these are two important questions to ask, I believe these proxy for competence and age i.e. liklihood to churn.

Here we try out the global feature importance calcuations that come with XGBoost. Note that they all contradict each other, which motivates the use of SHAP values since they come with consistency gaurentees (meaning they will order the features correctly).

shap_fin.sort_values("SHAP_corr_corrrected",ascending=False).head(20)
SHAP Largest_Effect SHAP_abs corr direction abs_corr SHAP_corr_corrrected SHAP_corr
Contract_Month-to-month 89.722288 +++++ 89.722288 0.405103 +++++ 0.405103 89.722288 77.623205
Contract_Two year 56.824406 +++++ 56.824406 -0.302253 ----- 0.302253 66.017732 66.308543
tenure -32.415159 ----- 32.415159 -0.352229 ----- 0.352229 60.448009 80.329423
MonthlyCharges -23.233877 ---+- 23.831915 0.193356 +++++ 0.193356 59.466914 87.082779
TotalChargesConv -2.957160 -+--+ 11.577356 -0.199428 ----- 0.199428 53.569346 88.337491
InternetService_No 4.387260 ++-+- 8.929218 -0.227890 ----- 0.227890 52.892006 89.722288
InternetService_Fiber optic 8.872516 ++-+- 13.110938 0.308020 +++++ 0.308020 46.719933 74.028728
OnlineSecurity_No 6.738696 ++++- 12.082303 0.342637 +++++ 0.342637 44.640472 71.178857
TechSupport_No -3.145848 -+++- 10.992171 0.337281 +++++ 0.337281 43.851507 70.797451
OnlineBackup_No -14.044804 ----+ 15.399742 0.268005 +++++ 0.268005 40.939261 60.958105
StreamingTV_Yes -5.488447 --+-- 8.484568 0.063228 +++++ 0.063228 39.678901 65.522520
PaymentMethod_Electronic check -14.921700 ----- 14.921700 0.301919 +++++ 0.301919 39.614285 58.964869
StreamingMovies_Yes -2.688222 +---- 4.784227 0.061382 +++++ 0.061382 38.097282 66.272904
DeviceProtection_No -3.105107 -+--- 4.129762 0.252481 +++++ 0.252481 36.274619 63.527831
DeviceProtection_Yes 0.877864 ++-+- 3.441347 -0.066160 ----- 0.066160 35.175298 62.165847
TechSupport_Yes -3.379816 --+-+ 7.888294 -0.164674 ----- 0.164674 34.985053 57.364064
MultipleLines_Yes -6.962591 ----- 6.962591 0.040102 +++++ 0.040102 34.215635 56.854687
OnlineSecurity_Yes 3.851610 -++++ 7.781800 -0.171226 ----- 0.171226 33.883269 55.415566
OnlineBackup_Yes -0.167503 ++++- 4.217049 -0.082255 ----- 0.082255 33.378034 58.037978
StreamingMovies_No 4.124048 +++++ 4.124048 0.130845 +++++ 0.130845 32.164288 55.867162
## I would argue that SHAP_corr are more causal in manner.
## Other shit is correlated with SHAP_corr, so if it falls away not too much issues.
shap_fin.sort_values("SHAP_corr",ascending=False).head(20)
## New Number I want to TRY
shap_fin = shap_fin.sort_values("SHAP_abs",ascending=False)
shap_new.to_csv("imp1.csv")
preds = model.predict(X_test, num_iteration=model.best_iteration)

print('Out of sample AUC: ', roc_auc_score(y_test, preds))
# To be truthfull the prediction excerse is not that interesting, what is more interesting
# fitting the data to the response, i.e. training the model and looking at the feature
# interactions it proposes. Feature importance is only interesting to the extent it is
# different to correlation plot, in that regard, higher dimensional interactions become
# important
## To get direction use general df
## df.corr()["Churn"]
shap.summary_plot(shap_values, df.drop(cols_drop, axis=1))

png




shap.TreeExplainer(xgb).shap_interaction_values(df.drop(cols_drop, axis=1))

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-147-092cf5c20b25> in <module>()
      2 
      3 
----> 4 shap.TreeExplainer(xgb).shap_interaction_values(df.drop(cols_drop, axis=1))


NameError: name 'xgb' is not defined
shap_fram = pd.DataFrame(shap_values[:,:-1], columns=list(X_train.columns))

shap_new = shap_fram.sum().sort_values().to_frame()

shap_new.columns = ["SHAP"]

shap_new["Largest_Effect"] = shap_new["SHAP"].apply(lambda x: "-" if x<0else "+")

shap_new["SHAP_abs"] = shap_new["SHAP"].abs()
df.drop(cols_drop, axis=1).shape
(7043, 33)
print('Out of sample AUC: ', roc_auc_score(y_test, preds))
Out of sample AUC:  0.8275733774286896
## Here at this point, you migth want to investigate individual mistakes from the validation
## set
## You can simply focus on train on validation instead of train on test.
## However I doubt any leakage would occur here as there is not much to be gained.

sub["Churn_Predict"] = sub['Churn_Predict']/n_splits
sub["Churn"] = y_test

sub["Error"] = sub["Churn"] - sub["Churn_Predict"]

## Worse false positive
sub.sort_values("Error").head(2)
customerID Churn is_test Churn_Predict Error
5140 7577-SWIFR 0 True 0.886304 -0.886304
5307 1941-HOSAM 0 True 0.880972 -0.880972

[5140, 5307, 3735, 532, 1550, 3969, 5071, 2493, 85, 1724, 3582, 684, 727, 2453, 6362, 5839, 1022, 4173, 3822, 5144, 5846, 4150, 1482, 5755, 1446, 3337, 2998, 4124, 1719, 2530, 1297, 2173, 4094, 1282, 4648, 4473, 5658, 7029, 1829, 2364, 6345, 5716, 5187, 1628, 2398, 702, 5907, 3469, 4478, 2652, 6265, 5993, 4641, 376, 2046, 5707, 4335, 491, 5189, 2866, 2966, 4016, 1079, 4940, 3048, 2236, 5748, 6669, 390, 5631, 3554, 3221, 2171, 5119, 5699, 5188, 3545, 1160, 4986, 5514, 3791, 2343, 3481, 50, 4514, 2385, 5298, 2065, 2107, 6109, 3798, 1833, 4338, 5656, 5604, 4869, 3492, 6668, 1811, 1972]

## All worse false positives:
## Contract Month to Months seems to be the culprit, it would be good idea to look at its distribution 
## before and after the 

sub
shap.force_plot(shap_values[list(sub.sort_values("Error").head(100).index),:], df.drop(cols_drop, axis=1).iloc[list(sub.sort_values("Error").head(100).index),:], link="logit")
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security.
## The distributions are not different, so this is not a concern 
X_train["Contract_Month-to-month"].sum()/len(X_train)
0.550175908862456
X_test["Contract_Month-to-month"].sum()/len(X_test)
0.5502793296089385
## Different model giving different amounts, run initial model for shap values.
shap.initjs()

shap.force_plot(shap_values[5354,:], df.drop(cols_drop, axis=1).iloc[5354,:], link="logit")
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security.
shap.force_plot(shap_values[345,:], df.drop(cols_drop, axis=1).iloc[345,:], link="logit")
sub["Error_Abs"] = sub["Error"].abs()
## Best True Negative
sub[sub["Churn"]==0].sort_values("Error_Abs").head(2)
shap.force_plot(shap_values[5804,:], df.drop(cols_drop, axis=1).iloc[5804,:], link="logit")
shap_ordered = shap_new.sort_values("SHAP_abs",ascending=False)
# Remember this direction says the average effect on the overall outcome.
# The direction is generally not too informative as it is not the same
# As the direction of the variable relationship with response, for that
# you would have to look at the chart. 
shap_ordered[shap_ordered["Direction"]=="+"].head()
## All this says is that low tenure is more informative than high tenure. 
shap_ordered[shap_ordered["Direction"]=="-"].head()
df["CurrentRateTotal"] = df["tenure"] * df["TotalChargesConv"]
df["RateDiff"] = df["CurrentRateTotal"] - df["TotalChargesConv"] 
df[df.index.isin([4484, 5804])]
## Worse false negative
sub.sort_values("Error").tail(2)
shap.force_plot(shap_values[4272,:], df.drop(cols_drop, axis=1).iloc[4272,:], link="logit")
## Best True Positive
sub[sub["Churn"]==1].sort_values("Error_Abs").head(2)
sub_full = pd.merge(sub,df.drop(cols_drop, axis=1),left_index=True, right_index=True, how="left" )
sub_full.shape
sub_full.head()

Arbitary Decision Rules:

Churn

  1. Tenure < 20
  2. Contract_Month-to-month ==1
  3. Monthly_Charges > 50

No Churn

  1. Tenure > 20
  2. Contract_Month-to-month ==0 {Implies Yearly or Two-yearly contracts}
  3. Monthly_Charges < 50

The predicted and actual outcomes probabilities are very similar. The prediction model seems to be working well.

# Churn - Predict
sub_full[(sub_full["tenure"]<20) &(sub_full["Contract_Month-to-month"]==1)&(sub_full["MonthlyCharges"]>50) ]["Churn_Predict"].mean()
0.5948143191306948
# Churn - Predict
sub_full[(sub_full["tenure"]<20) &(sub_full["Contract_Month-to-month"]==1)&(sub_full["MonthlyCharges"]>50) ]["Churn"].mean()
0.6164383561643836
# No Chrun - Predict
sub_full[(sub_full["tenure"]>20) &(sub_full["Contract_Month-to-month"]==0)&(sub_full["MonthlyCharges"]<50) ]["Churn_Predict"].mean()
0.03017844932166786
# No Churn - Actual
sub_full[(sub_full["tenure"]>20) &(sub_full["Contract_Month-to-month"]==0)&(sub_full["MonthlyCharges"]<50) ]["Churn"].mean()
0.013986013986013986
df.head()
customerID Churn is_test tenure MonthlyCharges TotalChargesConv gender_Female gender_Male SeniorCitizen_0 SeniorCitizen_1 ... Contract_One year Contract_Two year PaperlessBilling_No PaperlessBilling_Yes PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check CurrentRateTotal RateDiff
0 7590-VHVEG 0 False 1 29.85 29.85 1 0 1 0 ... 0 0 0 1 0 0 1 0 29.85 0.00
1 5575-GNVDE 0 False 34 56.95 1889.50 0 1 1 0 ... 1 0 1 0 0 0 0 1 64243.00 62353.50
2 3668-QPYBK 1 False 2 53.85 108.15 0 1 1 0 ... 0 0 0 1 0 0 0 1 216.30 108.15
3 7795-CFOCW 0 True 45 42.30 1840.75 0 1 1 0 ... 1 0 1 0 1 0 0 0 82833.75 80993.00
4 9237-HQITU 1 False 2 70.70 151.65 1 0 1 0 ... 0 0 0 1 0 0 1 0 303.30 151.65

5 rows × 51 columns

## New Features - Lets See Effect



X_train = df[df["is_test"]==False].drop(cols_drop,axis=1)
y_train = df[df["is_test"]==False]["Churn"]

X_test = df.loc[df.is_test, :].drop(cols_drop, axis=1)
y_test = df.loc[df.is_test, :]["Churn"]

## This process gives you 15% extra data, but you have no validation set. 
# This can of course be improved by saving the best iterations, which is what I would do
dip  = {}
dip[1] = [6, 5]
dip[2] = [6, 1]
dip[3] = [6, 9]
dip[4] = [6, 12]
dip[5] = [6, 20]

n_splits = 5
cvv = StratifiedKFold(n_splits=n_splits, random_state=42)

sub = df.loc[df.is_test, :][cols_drop]
sub["Churn_Predict"] = 0

feature_importances = pd.DataFrame()

ba= 0 
for i, r in enumerate(range(n_splits)):
    ba = ba + 1
    
    params["num_boost_round"] = seed_dict[dip[ba][1]]  
    params["seed"] = dip[ba][1]
    params["max_depth"] = dip[ba][0] # if you want to customise uncomment
    
    model = LGBMClassifier(
**params
    )

    model.fit(
        X_train,
        y_train,
        eval_metric='auc',
        verbose=False
    )
    
    preds = model.predict_proba(X_test, num_iteration=seed_dict[dip[ba][1]] )[:,1]
    sub['Churn_Predict'] += preds
    
    fi = pd.DataFrame()
    fi["feature"] = X_train.columns
    fi["importance"] = model.feature_importances_
    fi["fold"] = (i+1)
    
    feature_importances = pd.concat([feature_importances, fi], axis=0)
    print("Fold {} AUC: {:.8f}".format(i+1, roc_auc_score(y_test, preds)))
  
print('Out of sample AUC: ', roc_auc_score(y_test, sub['Churn_Predict']/n_splits))
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/lightgbm/engine.py:99: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Fold 1 AUC: 0.86425604


/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/lightgbm/engine.py:99: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Fold 2 AUC: 0.86441416


/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/lightgbm/engine.py:99: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Fold 3 AUC: 0.86525830


/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/lightgbm/engine.py:99: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Fold 4 AUC: 0.86397384


/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/lightgbm/engine.py:99: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Fold 5 AUC: 0.86469149
Out of sample AUC:  0.8647741979429193
## It has led to worse performance 
Out of sample AUC:  0.8648277170687089
#Start here for XGBoost two interactions:
import xgboost

d_train = xgboost.DMatrix(X_train, label=y_train)
d_test = xgboost.DMatrix(X_test, label=y_test)
    
params = {
    "eta": 0.01,
    "objective": "binary:logistic",
    "colsample_bytree":0.2,
    "colsample_bylevel":0.2,
    "subsample": 0.5,
    "base_score": np.mean(y_train),
    "eval_metric": "logloss"
}
xgb = xgboost.train(params, d_train, 15000, evals = [(d_test, "test")], verbose_eval=100, early_stopping_rounds=200)

[0]	test-logloss:0.583503
Will train until test-logloss hasn't improved in 200 rounds.
[100]	test-logloss:0.511794
[200]	test-logloss:0.473851
[300]	test-logloss:0.451492
[400]	test-logloss:0.438492
[500]	test-logloss:0.430542
[600]	test-logloss:0.424635
[700]	test-logloss:0.420811
[800]	test-logloss:0.418328
[900]	test-logloss:0.415762
[1000]	test-logloss:0.414202
[1100]	test-logloss:0.413115
[1200]	test-logloss:0.412331
[1300]	test-logloss:0.411409
[1400]	test-logloss:0.410732
[1500]	test-logloss:0.410264
[1600]	test-logloss:0.409837
[1700]	test-logloss:0.409436
[1800]	test-logloss:0.408982
[1900]	test-logloss:0.408614
[2000]	test-logloss:0.408551
[2100]	test-logloss:0.408201
[2200]	test-logloss:0.408031
[2300]	test-logloss:0.40796
[2400]	test-logloss:0.407775
[2500]	test-logloss:0.407447
[2600]	test-logloss:0.407408
[2700]	test-logloss:0.407488
Stopping. Best iteration:
[2576]	test-logloss:0.407333
import _pickle as pickle
#pickle.dump(shap_inter, open("shap_inter.p", "wb"))  # save it into a file 
shap_inter = pickle.load(open("shap_inter.p", "rb")) # open pickle

import shap shap_inter = shap.TreeExplainer(xgb).shap_interaction_values(df.drop(cols_drop, axis=1))

shap_inter[1,1,:]
array([-0.00034301, -0.11787798,  0.00363676,  0.00428383,  0.00157433,
       -0.00140101, -0.0018202 ,  0.000298  ,  0.0005892 ,  0.0013317 ,
        0.00247201, -0.00043364, -0.00027109,  0.00078189,  0.00436718,
       -0.00215778,  0.00173602,  0.00467855,  0.00454791, -0.0008688 ,
       -0.00289185,  0.00525794,  0.00300427,  0.00142787, -0.00099375,
        0.00400078,  0.0022267 , -0.00055282,  0.00487384,  0.00060737,
        0.00056997, -0.00150263,  0.00884111,  0.        ], dtype=float32)
import shap

shap.summary_plot(shap_inter, df.drop(cols_drop, axis=1))

png

## This is a very important interaction that can't be over-emphesised. 

Here you can see something very interesting, only if you sign or regularly sign two-year contracts does your tenure matter. Thus if you have already signed two year contracts, you are more likley to sign contracts in the future. On the other hand if you have been tenured for very long withu

df.head()
customerID Churn is_test tenure MonthlyCharges TotalChargesConv gender_Female SeniorCitizen_0 Partner_No Dependents_No ... StreamingMovies_No StreamingMovies_Yes Contract_Month-to-month Contract_One year Contract_Two year PaperlessBilling_No PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check
0 7590-VHVEG 0 False 1 29.85 29.85 1 1 0 1 ... 1 0 1 0 0 0 0 0 1 0
1 5575-GNVDE 0 False 34 56.95 1889.50 0 1 1 1 ... 1 0 0 1 0 1 0 0 0 1
2 3668-QPYBK 1 False 2 53.85 108.15 0 1 1 1 ... 1 0 1 0 0 0 0 0 0 1
3 7795-CFOCW 0 False 45 42.30 1840.75 0 1 1 1 ... 1 0 0 1 0 1 1 0 0 0
4 9237-HQITU 1 False 2 70.70 151.65 1 1 1 1 ... 1 0 1 0 0 0 0 0 1 0

5 rows × 36 columns

Females are less willing to change providers if they have been with the company for a long time, males are more likley to churn and do not show as much loyalty as females.

shap.dependence_plot(
    ("tenure", "gender_Female"),
    shap_inter, df.drop(cols_drop, axis=1),
    display_features=df.drop(cols_drop, axis=1)
)

png

Young people are more likley to churn if they have been with the company for a long time as opposed to old people

shap.dependence_plot(
    ("tenure", "SeniorCitizen_0"),
    shap_inter, df.drop(cols_drop, axis=1),
    display_features=df.drop(cols_drop, axis=1)
)

png

Customers are a lot less likley to churn if they have spent exhorborant amounts on fees and have been with the company for a long time. It is likley that these companies have gained favour due to be long term customers, so this might have less to do with their total charges than their current rates.

shap.dependence_plot(
    ("tenure", "TotalChargesConv"),
    shap_inter, df.drop(cols_drop, axis=1),
    display_features=df.drop(cols_drop, axis=1)
)

png

df.head()
customerID Churn is_test tenure MonthlyCharges TotalChargesConv gender_Female SeniorCitizen_0 Partner_No Dependents_No ... StreamingMovies_No StreamingMovies_Yes Contract_Month-to-month Contract_One year Contract_Two year PaperlessBilling_No PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check
0 7590-VHVEG 0 False 1 29.85 29.85 1 1 0 1 ... 1 0 1 0 0 0 0 0 1 0
1 5575-GNVDE 0 False 34 56.95 1889.50 0 1 1 1 ... 1 0 0 1 0 1 0 0 0 1
2 3668-QPYBK 1 False 2 53.85 108.15 0 1 1 1 ... 1 0 1 0 0 0 0 0 0 1
3 7795-CFOCW 0 False 45 42.30 1840.75 0 1 1 1 ... 1 0 0 1 0 1 1 0 0 0
4 9237-HQITU 1 False 2 70.70 151.65 1 1 1 1 ... 1 0 1 0 0 0 0 0 1 0

5 rows × 36 columns

If your total charges are high but you are on month to month, you are less likley to jump ship, because you do not feel constrained. So this might be a ptential method to get more total slaes, convert to monthly

shap.dependence_plot(
    ("TotalChargesConv","Contract_Month-to-month"),
    shap_inter, df.drop(cols_drop, axis=1),
    display_features=df.drop(cols_drop, axis=1)
)

png

This can help you understand certain price points. Similar to shelley points in game theory. There is certain values that customers just flat out do not like. When the price increases from 40 to 50 customers are unhappy, it migth be pshychological or the company is infact losing out at that price point against the market. From price point 60 - 90 the firm seems to be doing reasonably well in retaining customer etc.

shap.dependence_plot(
    ("MonthlyCharges","Contract_Two year"),
    shap_inter, df.drop(cols_drop, axis=1),
    display_features=df.drop(cols_drop, axis=1)
)

png

These partial dependence plots and correlation plots will 99% of the tme show the same relationship. The only issue is that with the corerlation plot you have to create groupby’s

This chart shows something concerning,

shap.dependence_plot(
    ("TotalChargesConv","MonthlyCharges"),
    shap_inter, df.drop(cols_drop, axis=1),
    display_features=df.drop(cols_drop, axis=1)
)

png

cols_drop

[‘customerID’, ‘Churn’, ‘is_test’]


[‘Contract_Month-to-month’, ‘tenure’, ‘Contract_Two year’, ‘TotalChargesConv’, ‘TechSupport_No’, ‘Contract_One year’, ‘OnlineSecurity_No’, ‘DeviceProtection_No’, ‘PaymentMethod_Electronic check’, ‘OnlineBackup_No’]

kt

[‘Contract_Month-to-month’, ‘tenure’, ‘Contract_Two year’, ‘TotalChargesConv’, ‘TechSupport_No’, ‘Contract_One year’, ‘OnlineSecurity_No’, ‘DeviceProtection_No’, ‘PaymentMethod_Electronic check’, ‘OnlineBackup_No’, ‘Churn’]

df[kt].head()
Contract_Month-to-month tenure Contract_Two year TotalChargesConv TechSupport_No Contract_One year OnlineSecurity_No DeviceProtection_No PaymentMethod_Electronic check OnlineBackup_No Churn
0 1 1 0 29.85 1 0 1 1 1 0 0
1 0 34 0 1889.50 1 1 0 0 0 1 0
2 1 2 0 108.15 1 0 0 1 0 0 1
3 0 45 0 1840.75 0 1 0 0 0 1 0
4 1 2 0 151.65 1 0 1 1 1 1 1
df["Churn"].value_counts()
0    5174
1    1869
Name: Churn, dtype: int64
import seaborn as sns

kt = list(shap_fin.sort_values("SHAP_corr",ascending=False).index[:4])
kt = []
kt.append("Churn")
kt.append("MonthlyCharges")
kt.append("TotalChargesConv")
g = sns.pairplot(df[kt], hue='Churn', palette = 'seismic',size=3,diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=10) )
g.set(xticklabels=[])
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py:494: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X,a,b,gridsize)/(delta*nobs)
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/statsmodels/nonparametric/kdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce
  return umr_maximum(a, axis, None, out, keepdims)





<seaborn.axisgrid.PairGrid at 0x1a3feeda58>

png

## You are more likely to churn with high monthly charges and low total cost.
## You are not as likley to churn with high monthly charges and high monthly costs.
## See it shows the same relationship
## What it does unfortunately not show is the probability of being this and that as it is single points. 
## For that you might have to do a nearest neighbour transofmration or a strategised groupby.
## and that is simply what the tree provides, see it as a groupby function, in out example it is just not 
## entirely overfitted, so some information might be lost, but not necessarily if generalisability is important. 
## With scatter plots you have to do it with actual or expected values. 
## I have neverv seen expected value plots because to create it you would have to create some groupbys or knn's
## so inface your LightGBM is the expected value amounts - nothing can actually get as close as it, hence why it
## is used. And the expected value plot gives you a slightly better picture than colored scatter plots for categorical.
## However, for regression plots, it is going to show a very similar plot. - You can at a a later stage try that out 
## for regression plots. So I would recommend using the shap dependence plots when it comes to classification
## and maybe switching  to scater plots when it comes to regression. There is also an argument to just use the
## the scatter plots as the relationship might become moot as a result of another correlated vairable.