The PyMC3 workflow

PyMC3 aims for intuitive and readable, yet powerful syntax that reflects how statisticians describe models. The modeling process generally follows these five steps:

  1. Encode a probability model by defining the following:
    1. The prior distributions that quantify knowledge and uncertainty about latent variables
    2. The likelihood function that conditions the parameters on observed data
  1. Analyze the posterior using one of the options:
    1. Obtain a point estimate using MAP inference
    2. Sample from the posterior using MCMC methods
  1. Approximate the posterior using variational Bayes.
  2. Check the model using various diagnostic tools.
  3. Generate predictions.

The resulting model can be used for inference to gain detailed insights into parameter values as well as to predict outcomes for new data points.

Logistic regression with PyMC3

Logistic regression estimates a linear relationship between a set of features and a binary outcome, mediated by a sigmoid function to ensure the model produces probabilities. The frequentist approach resulted in point estimates for the parameters that measure the influence of each feature on the probability that a data point belongs to the positive class, with confidence intervals based on assumptions about the parameter distribution.

In contrast, Bayesian logistic regression estimates the posterior distribution over the parameters itself. The posterior allows for more robust estimates of what is called a Bayesian credible interval for each parameter with the benefit of more transparency about the model’s uncertainty.

Imports

In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
from pathlib import Path
import pickle
from collections import OrderedDict
import pandas as pd
import numpy as np
from scipy import stats

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import (roc_curve, roc_auc_score, confusion_matrix, accuracy_score, f1_score, 
                             precision_recall_curve) 
import theano
import pymc3 as pm
from pymc3.variational.callbacks import CheckParametersConvergence
import statsmodels.formula.api as smf

import arviz as az
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.lines as mlines

import seaborn as sns
from IPython.display import HTML
In [3]:
sns.set(style="white", color_codes=True)
pd.set_option('display.max_columns', None)
In [4]:
data_path = Path('data')
fig_path = Path('figures')
model_path = Path('models')
for p in [data_path, fig_path, model_path]:
    if not p.exists():
        p.mkdir()

The Data

The goal of this dataset is to create a binary classification model that predicts whether or not over $11,000$ customers will subscribe to a term deposit after a marketing campaign the bank has performed, based on $17$ predictor variables. The target variable is given as deposit and takes on a value of $1$ if the customer has subscribed and $0$ otherwise.

This is a slightly imbalanced class problem because there are more customers that did not subscribe the term deposit than did.

Load from Disk

In [5]:
data = pd.read_csv('bank.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11162 entries, 0 to 11161
Data columns (total 17 columns):
age          11162 non-null int64
job          11162 non-null object
marital      11162 non-null object
education    11162 non-null object
default      11162 non-null object
balance      11162 non-null int64
housing      11162 non-null object
loan         11162 non-null object
contact      11162 non-null object
day          11162 non-null int64
month        11162 non-null object
duration     11162 non-null int64
campaign     11162 non-null int64
pdays        11162 non-null int64
previous     11162 non-null int64
poutcome     11162 non-null object
deposit      11162 non-null object
dtypes: int64(7), object(10)
memory usage: 1.4+ MB
In [6]:
data.head()
Out[6]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
0 59 admin. married secondary no 2343 yes no unknown 5 may 1042 1 -1 0 unknown yes
1 56 admin. married secondary no 45 no no unknown 5 may 1467 1 -1 0 unknown yes
2 41 technician married secondary no 1270 yes no unknown 5 may 1389 1 -1 0 unknown yes
3 55 services married secondary no 2476 yes no unknown 5 may 579 1 -1 0 unknown yes
4 54 admin. married tertiary no 184 no no unknown 5 may 673 2 -1 0 unknown yes
In [7]:
unique = data.nunique()
unique
Out[7]:
age            76
job            12
marital         3
education       4
default         2
balance      3805
housing         2
loan            2
contact         3
day            31
month          12
duration     1428
campaign       36
pdays         472
previous       34
poutcome        4
deposit         2
dtype: int64
In [8]:
mask = data.nunique().values < 31
cat_features = list(unique[mask].index)
cat_features
Out[8]:
['job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'poutcome',
 'deposit']

Quick EDA

In [9]:
total = len(data)
plt.figure(figsize=(7,5))
g = sns.countplot(x='deposit', data=data)
g.set_ylabel('Count', fontsize=14)
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x()+p.get_width()/2.,
            height + 1.5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=14, fontweight='bold')
plt.margins(y=0.1)
plt.show()
In [10]:
fig, axs = plt.subplots(ncols=2, nrows=0, figsize=(13, 45))
plt.subplots_adjust(right=2)
plt.subplots_adjust(top=2)

for i, feature in enumerate(list(data[cat_features]), 1):
    if(feature=='deposit'):
        break
        
    plt.subplot(len(cat_features), 3, i)
    sns.countplot(y=feature, hue='deposit', data=data)
    plt.title(f'{feature}', size=15, fontsize=12)    
    plt.legend(loc='best', prop={'size': 8})
    
plt.tight_layout()       
plt.show()
In [11]:
plt.figure(figsize=(8,4))
sns.countplot(x='day', hue='deposit', data=data);
In [12]:
num_features = [f for f in data.columns if f not in cat_features]
num_features.remove('day')
num_features
Out[12]:
['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
In [13]:
basetable0 = pd.DataFrame(index=data.index)
basetable0['deposit'] = data['deposit']
deposit_dict = {'no': 0, 'yes': 1}
basetable0 = basetable0.replace({'deposit': deposit_dict})

for variable in num_features:
    if len(data.groupby(variable)) > 31:
        new_variable = "disc_" + variable
        basetable0[new_variable] = pd.qcut(data[variable], 5, duplicates='drop')
In [14]:
basetable0.head()
Out[14]:
deposit disc_age disc_balance disc_duration disc_campaign disc_pdays disc_previous
0 1 (52.0, 95.0] (2223.0, 81204.0] (585.0, 3881.0] (0.999, 2.0] (-1.001, 95.0] (-0.001, 1.0]
1 1 (52.0, 95.0] (-6847.001, 62.0] (585.0, 3881.0] (0.999, 2.0] (-1.001, 95.0] (-0.001, 1.0]
2 1 (36.0, 42.0] (862.6, 2223.0] (585.0, 3881.0] (0.999, 2.0] (-1.001, 95.0] (-0.001, 1.0]
3 1 (52.0, 95.0] (2223.0, 81204.0] (323.0, 585.0] (0.999, 2.0] (-1.001, 95.0] (-0.001, 1.0]
4 1 (52.0, 95.0] (62.0, 337.0] (585.0, 3881.0] (0.999, 2.0] (-1.001, 95.0] (-0.001, 1.0]
In [15]:
def create_pig_table(basetable, target, variable):
    
    groups = basetable[[target,variable]].groupby(variable)
    pig_table = groups[target].agg(Incidence = np.mean, Size = np.size).reset_index()
    
    return pig_table

def plot_pig(pig_table, variable):
    
    plt.figure(figsize=(10, 4))
    plt.ylabel("Size", rotation=0, rotation_mode="anchor", ha="right")
    pig_table["Size"].plot(kind="bar", width=0.5, color="lightgray", edgecolor="none") 
    pig_table["Incidence"].plot(secondary_y=True)
    
    plt.xticks(np.arange(len(pig_table)), pig_table[variable])
    plt.xlim([-0.5, len(pig_table) - 0.5])
    plt.ylabel("Incidence", rotation=0, rotation_mode="anchor", ha="left")
    plt.title(f'{variable}')
    
    plt.show()
In [16]:
variables = list(basetable0.drop('deposit', axis=1).columns)

pig_tables = {}
for variable in variables: 
    pig_table = create_pig_table(basetable0, 'deposit', variable)
    pig_tables[variable] = pig_table
In [17]:
for variable in variables: 
    pig_table = create_pig_table(basetable0, "deposit", variable)
    plot_pig(pig_table, variable)

Data Preprocessing

In [18]:
job_dict = {'management': 12, 'admin.': 11, 'entrepreneur': 10, 'technician': 9, 'services': 8, 'self-employed': 7,
            'blue-collar': 6, 'retired': 5, 'housemaid': 4, 'unemployed': 3, 'unknown': 2, 'student': 1}
data = data.replace({'job': job_dict})

marital_dict = {'married': 4, 'single': 3, 'divorced': 2, 'unknown': 1}
data = data.replace({'marital': marital_dict})
data['marital'] = data['marital'].astype(int)

edu_dict = {'tertiary': 4, 'secondary': 3, 'primary': 2, 'unknown': 1}
data = data.replace({'education': edu_dict})

default_dict = {'no': 1, 'yes': 0}
data = data.replace({'default': default_dict})

housing_dict = {'no': 1, 'yes': 0}
data = data.replace({'housing': housing_dict})

loan_dict = {'no': 1, 'yes': 0}
data = data.replace({'loan': loan_dict})

contact_dict = {'cellular': 3, 'telephone': 2, 'unknown': 1}
data = data.replace({'contact': contact_dict})

poutcome_dict = {'success': 4, 'other': 3, 'unknown': 2, 'failure': 1}
data = data.replace({'poutcome': poutcome_dict})

month_dict = {'jan': 1, 'feb': 2, 'aug': 8, 'nov': 11, 'jun': 6, 'apr': 4, 'jul': 7, 'may': 5, 'oct': 10, 
              'mar': 3, 'sep': 9, 'dec': 12}
data = data.replace({'month': month_dict})

pdays_dict = {-1: 0}
data = data.replace({'pdays': pdays_dict})
In [19]:
data.head()
Out[19]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
0 59 11 4 3 1 2343 0 1 1 5 5 1042 1 0 0 2 yes
1 56 11 4 3 1 45 1 1 1 5 5 1467 1 0 0 2 yes
2 41 9 4 3 1 1270 0 1 1 5 5 1389 1 0 0 2 yes
3 55 8 4 3 1 2476 0 1 1 5 5 579 1 0 0 2 yes
4 54 11 4 4 1 184 1 1 1 5 5 673 2 0 0 2 yes
In [20]:
data.dtypes
Out[20]:
age           int64
job           int64
marital       int64
education     int64
default       int64
balance       int64
housing       int64
loan          int64
contact       int64
day           int64
month         int64
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome      int64
deposit      object
dtype: object
In [21]:
mask = data.nunique().values > 31
num_features = list(unique[mask].index)
num_features
Out[21]:
['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']

Distributions and transformations

In [22]:
data[num_features].hist(bins=30, figsize=(8,7))
plt.tight_layout();
In [23]:
from sklearn.preprocessing import MinMaxScaler
import scipy.stats as stats

colors = cm.rainbow(np.linspace(0, 1, len(num_features)))

for col, c in zip(num_features, colors):
    
    minmax = MinMaxScaler(feature_range=(0.0001, max(data[col].values)))
    if data[col].describe().iloc[3] <= 0.0:
        z = minmax.fit_transform(data[col].values.reshape(-1, 1))
        z = np.squeeze(z)
        print('BC MinMax Scaled')
    else: 
        z = data[col]
        
    z0, _ = stats.boxcox(z, lmbda=None)
    z1 = stats.boxcox(z, lmbda=-1)
    z2 = stats.boxcox(z, lmbda=-0.5)
    z3 = stats.boxcox(z, lmbda=0.0)
    z4 = stats.boxcox(z, lmbda=0.5)
    z5 = stats.boxcox(z, lmbda=2)
    z6 = stats.boxcox(z, lmbda=-2)
    z7 = np.cbrt(data[col])

    trans = [data[col], z0, z1, z2, z3, z4, z5, z6, z7]
    names = ['Original', 'Boxcox', 'Recip', 'Recip_Sqrt', 'Log', 'Sqrt', 'Square', 'Recip_Square', 'Cube_rt']

    fig = plt.figure(figsize=(12,6))
    for i, t, n in zip(range(1, 10), trans, names):
        ax = fig.add_subplot(3, 3, i)
        plt.hist(t, bins=30, color=c, alpha=0.6)
        plt.title(col+'_'+n)
        
    plt.tight_layout()
    plt.show();
BC MinMax Scaled
BC MinMax Scaled
BC MinMax Scaled
In [24]:
data.balance = np.cbrt(data.balance)

age, _ = stats.boxcox(data.age, lmbda=None)
data.age = age

duration, _ = stats.boxcox(data.duration, lmbda=None)
data.duration = duration
In [25]:
from sklearn.preprocessing import scale

cols = ['duration', 'balance', 'age']

data.loc[:, cols] = scale(data.loc[:, cols])
In [26]:
nums = data[num_features]
nums['deposit'] = data.deposit
In [27]:
sns.pairplot(nums, hue='deposit', diag_kind='hist');