The PyMC3 workflow

PyMC3 aims for intuitive and readable, yet powerful syntax that reflects how statisticians describe models. The modeling process generally follows these five steps:

  1. Encode a probability model by defining the following:
    1. The prior distributions that quantify knowledge and uncertainty about latent variables
    2. The likelihood function that conditions the parameters on observed data
  1. Analyze the posterior using one of the options:
    1. Obtain a point estimate using MAP inference
    2. Sample from the posterior using MCMC methods
  1. Approximate the posterior using variational Bayes.
  2. Check the model using various diagnostic tools.
  3. Generate predictions.

The resulting model can be used for inference to gain detailed insights into parameter values as well as to predict outcomes for new data points.

Logistic regression with PyMC3

Logistic regression estimates a linear relationship between a set of features and a binary outcome, mediated by a sigmoid function to ensure the model produces probabilities. The frequentist approach resulted in point estimates for the parameters that measure the influence of each feature on the probability that a data point belongs to the positive class, with confidence intervals based on assumptions about the parameter distribution.

In contrast, Bayesian logistic regression estimates the posterior distribution over the parameters itself. The posterior allows for more robust estimates of what is called a Bayesian credible interval for each parameter with the benefit of more transparency about the model’s uncertainty.

Imports

In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
from pathlib import Path
import pickle
from collections import OrderedDict
import pandas as pd
import numpy as np
from scipy import stats

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import (roc_curve, roc_auc_score, confusion_matrix, accuracy_score, f1_score, 
                             precision_recall_curve) 
import theano
import pymc3 as pm
from pymc3.variational.callbacks import CheckParametersConvergence
import statsmodels.formula.api as smf

import arviz as az
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.lines as mlines

import seaborn as sns
from IPython.display import HTML
In [3]:
sns.set(style="white", color_codes=True)
pd.set_option('display.max_columns', None)
In [4]:
data_path = Path('data')
fig_path = Path('figures')
model_path = Path('models')
for p in [data_path, fig_path, model_path]:
    if not p.exists():
        p.mkdir()

The Data

The goal of this dataset is to create a binary classification model that predicts whether or not over $11,000$ customers will subscribe to a term deposit after a marketing campaign the bank has performed, based on $17$ predictor variables. The target variable is given as deposit and takes on a value of $1$ if the customer has subscribed and $0$ otherwise.

This is a slightly imbalanced class problem because there are more customers that did not subscribe the term deposit than did.

Load from Disk

In [5]:
data = pd.read_csv('bank.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11162 entries, 0 to 11161
Data columns (total 17 columns):
age          11162 non-null int64
job          11162 non-null object
marital      11162 non-null object
education    11162 non-null object
default      11162 non-null object
balance      11162 non-null int64
housing      11162 non-null object
loan         11162 non-null object
contact      11162 non-null object
day          11162 non-null int64
month        11162 non-null object
duration     11162 non-null int64
campaign     11162 non-null int64
pdays        11162 non-null int64
previous     11162 non-null int64
poutcome     11162 non-null object
deposit      11162 non-null object
dtypes: int64(7), object(10)
memory usage: 1.4+ MB
In [6]:
data.head()
Out[6]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
0 59 admin. married secondary no 2343 yes no unknown 5 may 1042 1 -1 0 unknown yes
1 56 admin. married secondary no 45 no no unknown 5 may 1467 1 -1 0 unknown yes
2 41 technician married secondary no 1270 yes no unknown 5 may 1389 1 -1 0 unknown yes
3 55 services married secondary no 2476 yes no unknown 5 may 579 1 -1 0 unknown yes
4 54 admin. married tertiary no 184 no no unknown 5 may 673 2 -1 0 unknown yes
In [7]:
unique = data.nunique()
unique
Out[7]:
age            76
job            12
marital         3
education       4
default         2
balance      3805
housing         2
loan            2
contact         3
day            31
month          12
duration     1428
campaign       36
pdays         472
previous       34
poutcome        4
deposit         2
dtype: int64
In [8]:
mask = data.nunique().values < 31
cat_features = list(unique[mask].index)
cat_features
Out[8]:
['job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'poutcome',
 'deposit']

Quick EDA

In [9]:
total = len(data)
plt.figure(figsize=(7,5))
g = sns.countplot(x='deposit', data=data)
g.set_ylabel('Count', fontsize=14)
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x()+p.get_width()/2.,
            height + 1.5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=14, fontweight='bold')
plt.margins(y=0.1)
plt.show()
In [10]:
fig, axs = plt.subplots(ncols=2, nrows=0, figsize=(13, 45))
plt.subplots_adjust(right=2)
plt.subplots_adjust(top=2)

for i, feature in enumerate(list(data[cat_features]), 1):
    if(feature=='deposit'):
        break
        
    plt.subplot(len(cat_features), 3, i)
    sns.countplot(y=feature, hue='deposit', data=data)
    plt.title(f'{feature}', size=15, fontsize=12)    
    plt.legend(loc='best', prop={'size': 8})
    
plt.tight_layout()       
plt.show()
In [11]:
plt.figure(figsize=(8,4))
sns.countplot(x='day', hue='deposit', data=data);
In [12]:
num_features = [f for f in data.columns if f not in cat_features]
num_features.remove('day')
num_features
Out[12]:
['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
In [13]:
basetable0 = pd.DataFrame(index=data.index)
basetable0['deposit'] = data['deposit']
deposit_dict = {'no': 0, 'yes': 1}
basetable0 = basetable0.replace({'deposit': deposit_dict})

for variable in num_features:
    if len(data.groupby(variable)) > 31:
        new_variable = "disc_" + variable
        basetable0[new_variable] = pd.qcut(data[variable], 5, duplicates='drop')
In [14]:
basetable0.head()
Out[14]:
deposit disc_age disc_balance disc_duration disc_campaign disc_pdays disc_previous
0 1 (52.0, 95.0] (2223.0, 81204.0] (585.0, 3881.0] (0.999, 2.0] (-1.001, 95.0] (-0.001, 1.0]
1 1 (52.0, 95.0] (-6847.001, 62.0] (585.0, 3881.0] (0.999, 2.0] (-1.001, 95.0] (-0.001, 1.0]
2 1 (36.0, 42.0] (862.6, 2223.0] (585.0, 3881.0] (0.999, 2.0] (-1.001, 95.0] (-0.001, 1.0]
3 1 (52.0, 95.0] (2223.0, 81204.0] (323.0, 585.0] (0.999, 2.0] (-1.001, 95.0] (-0.001, 1.0]
4 1 (52.0, 95.0] (62.0, 337.0] (585.0, 3881.0] (0.999, 2.0] (-1.001, 95.0] (-0.001, 1.0]
In [15]:
def create_pig_table(basetable, target, variable):
    
    groups = basetable[[target,variable]].groupby(variable)
    pig_table = groups[target].agg(Incidence = np.mean, Size = np.size).reset_index()
    
    return pig_table

def plot_pig(pig_table, variable):
    
    plt.figure(figsize=(10, 4))
    plt.ylabel("Size", rotation=0, rotation_mode="anchor", ha="right")
    pig_table["Size"].plot(kind="bar", width=0.5, color="lightgray", edgecolor="none") 
    pig_table["Incidence"].plot(secondary_y=True)
    
    plt.xticks(np.arange(len(pig_table)), pig_table[variable])
    plt.xlim([-0.5, len(pig_table) - 0.5])
    plt.ylabel("Incidence", rotation=0, rotation_mode="anchor", ha="left")
    plt.title(f'{variable}')
    
    plt.show()
In [16]:
variables = list(basetable0.drop('deposit', axis=1).columns)

pig_tables = {}
for variable in variables: 
    pig_table = create_pig_table(basetable0, 'deposit', variable)
    pig_tables[variable] = pig_table
In [17]:
for variable in variables: 
    pig_table = create_pig_table(basetable0, "deposit", variable)
    plot_pig(pig_table, variable)

Data Preprocessing

In [18]:
job_dict = {'management': 12, 'admin.': 11, 'entrepreneur': 10, 'technician': 9, 'services': 8, 'self-employed': 7,
            'blue-collar': 6, 'retired': 5, 'housemaid': 4, 'unemployed': 3, 'unknown': 2, 'student': 1}
data = data.replace({'job': job_dict})

marital_dict = {'married': 4, 'single': 3, 'divorced': 2, 'unknown': 1}
data = data.replace({'marital': marital_dict})
data['marital'] = data['marital'].astype(int)

edu_dict = {'tertiary': 4, 'secondary': 3, 'primary': 2, 'unknown': 1}
data = data.replace({'education': edu_dict})

default_dict = {'no': 1, 'yes': 0}
data = data.replace({'default': default_dict})

housing_dict = {'no': 1, 'yes': 0}
data = data.replace({'housing': housing_dict})

loan_dict = {'no': 1, 'yes': 0}
data = data.replace({'loan': loan_dict})

contact_dict = {'cellular': 3, 'telephone': 2, 'unknown': 1}
data = data.replace({'contact': contact_dict})

poutcome_dict = {'success': 4, 'other': 3, 'unknown': 2, 'failure': 1}
data = data.replace({'poutcome': poutcome_dict})

month_dict = {'jan': 1, 'feb': 2, 'aug': 8, 'nov': 11, 'jun': 6, 'apr': 4, 'jul': 7, 'may': 5, 'oct': 10, 
              'mar': 3, 'sep': 9, 'dec': 12}
data = data.replace({'month': month_dict})

pdays_dict = {-1: 0}
data = data.replace({'pdays': pdays_dict})
In [19]:
data.head()
Out[19]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
0 59 11 4 3 1 2343 0 1 1 5 5 1042 1 0 0 2 yes
1 56 11 4 3 1 45 1 1 1 5 5 1467 1 0 0 2 yes
2 41 9 4 3 1 1270 0 1 1 5 5 1389 1 0 0 2 yes
3 55 8 4 3 1 2476 0 1 1 5 5 579 1 0 0 2 yes
4 54 11 4 4 1 184 1 1 1 5 5 673 2 0 0 2 yes
In [20]:
data.dtypes
Out[20]:
age           int64
job           int64
marital       int64
education     int64
default       int64
balance       int64
housing       int64
loan          int64
contact       int64
day           int64
month         int64
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome      int64
deposit      object
dtype: object
In [21]:
mask = data.nunique().values > 31
num_features = list(unique[mask].index)
num_features
Out[21]:
['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']

Distributions and transformations

In [22]:
data[num_features].hist(bins=30, figsize=(8,7))
plt.tight_layout();
In [23]:
from sklearn.preprocessing import MinMaxScaler
import scipy.stats as stats

colors = cm.rainbow(np.linspace(0, 1, len(num_features)))

for col, c in zip(num_features, colors):
    
    minmax = MinMaxScaler(feature_range=(0.0001, max(data[col].values)))
    if data[col].describe().iloc[3] <= 0.0:
        z = minmax.fit_transform(data[col].values.reshape(-1, 1))
        z = np.squeeze(z)
        print('BC MinMax Scaled')
    else: 
        z = data[col]
        
    z0, _ = stats.boxcox(z, lmbda=None)
    z1 = stats.boxcox(z, lmbda=-1)
    z2 = stats.boxcox(z, lmbda=-0.5)
    z3 = stats.boxcox(z, lmbda=0.0)
    z4 = stats.boxcox(z, lmbda=0.5)
    z5 = stats.boxcox(z, lmbda=2)
    z6 = stats.boxcox(z, lmbda=-2)
    z7 = np.cbrt(data[col])

    trans = [data[col], z0, z1, z2, z3, z4, z5, z6, z7]
    names = ['Original', 'Boxcox', 'Recip', 'Recip_Sqrt', 'Log', 'Sqrt', 'Square', 'Recip_Square', 'Cube_rt']

    fig = plt.figure(figsize=(12,6))
    for i, t, n in zip(range(1, 10), trans, names):
        ax = fig.add_subplot(3, 3, i)
        plt.hist(t, bins=30, color=c, alpha=0.6)
        plt.title(col+'_'+n)
        
    plt.tight_layout()
    plt.show();
BC MinMax Scaled
BC MinMax Scaled
BC MinMax Scaled
In [24]:
data.balance = np.cbrt(data.balance)

age, _ = stats.boxcox(data.age, lmbda=None)
data.age = age

duration, _ = stats.boxcox(data.duration, lmbda=None)
data.duration = duration
In [25]:
from sklearn.preprocessing import scale

cols = ['duration', 'balance', 'age']

data.loc[:, cols] = scale(data.loc[:, cols])
In [26]:
nums = data[num_features]
nums['deposit'] = data.deposit
In [27]:
sns.pairplot(nums, hue='deposit', diag_kind='hist');
In [28]:
deposit_dict = {'no': 0, 'yes': 1}
data = data.replace({'deposit': deposit_dict})

data.head()
Out[28]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
0 1.393156 11 4 3 1 0.800181 0 1 1 5 5 1.633998 1 0 0 2 1
1 1.225089 11 4 3 1 -0.723010 1 1 1 5 5 2.107157 1 0 0 2 1
2 0.163473 9 4 3 1 0.416053 0 1 1 5 5 2.029763 1 0 0 2 1
3 1.166438 8 4 3 1 0.838821 0 1 1 5 5 0.881662 1 0 0 2 1
4 1.106378 11 4 4 1 -0.389262 1 1 1 5 5 1.067320 2 0 0 2 1

Correlations

In [29]:
plt.figure(figsize=(13, 13))
corr = data.corr() 
mask = np.tri(*corr.shape).T 
sns.heatmap(corr.abs(), mask=mask, annot=True, annot_kws={"size": 9})
b, t = plt.ylim() 
b += 0.5 
t -= 0.5 
plt.ylim(b, t) 
plt.show()
In [30]:
n_fts = len(data.columns)
colors = cm.rainbow(np.linspace(0, 1, n_fts))

data.drop('deposit',axis=1).corrwith(data.deposit).sort_values(ascending=True).plot(kind='barh', 
                                                                                     color=colors, figsize=(12, 6))
plt.title('Correlation to Target (deposit)')
plt.show()

print('\n',data.drop('deposit',axis=1).corrwith(data.deposit).sort_values(ascending=False))
 duration     0.519315
contact      0.249847
poutcome     0.237089
housing      0.203888
balance      0.151666
pdays        0.151167
previous     0.139867
loan         0.110580
education    0.075583
default      0.040680
month        0.028645
age         -0.007167
job         -0.031694
day         -0.056326
marital     -0.068325
campaign    -0.128081
dtype: float64

Logistic Regression with one independent variable

Example logistic regression equation:

$$y = \frac{e^{(\beta_0 + \beta_1*x)}}{(1 + e^{(\beta_0 + \beta_1*x)})}$$

Where $y$ is the predicted output, $e$ is the base of the natural logarithms or the np.exp() function, $\beta_0$ is the bias or intercept term and $\beta_1$ is the coefficient for the single input value $(x)$. Each column in your input data has an associated $\beta$ coefficient (a constant real value) that must be learned from the training data.

We are going to begin with the simplest possible logistic model, using just one independent variable or feature, the duration.

In [31]:
y_simple = data['deposit']
x_n = 'duration' 
x_c = data[x_n].values

with pm.Model() as model_simple:
    # random variables for the coefficient with
    # uninformative priors for each parameter
    α = pm.Normal('α', mu=0, sd=10)
    β = pm.Normal('β', mu=0, sd=10)
    
    # Transform random variables into vector of probabilities 
    μ = α + pm.math.dot(x_c, β)    
    θ = pm.Deterministic('θ', pm.math.sigmoid(μ))
    
    # The decision boundary
    db = pm.Deterministic('db', -α/β)
    
    # Bernoulli random vector with probability of success
    # given by sigmoid function and actual data as observed
    y = pm.Bernoulli(name='logit', p=θ, observed=y_simple)

    trace_simple = pm.sample(1000, tune=1000, chains=4, init = 'adapt_diag', cores=5)
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 5 jobs)
NUTS: [β, α]
Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [00:09<00:00, 875.08draws/s] 
In [32]:
model_simple.model
Out[32]:
$$ \begin{array}{rcl} \text{α} &\sim & \text{Normal}(\mathit{mu}=0.0,~\mathit{sigma}=10.0)\\\text{β} &\sim & \text{Normal}(\mathit{mu}=0.0,~\mathit{sigma}=10.0)\\\text{θ} &\sim & \text{Deterministic}(\text{α},~\text{Constant},~\text{β})\\\text{db} &\sim & \text{Deterministic}(\text{α},~\text{β})\\\text{logit} &\sim & \text{Bernoulli}(\mathit{p}=\text{θ}) \end{array} $$

The command pm.model_to_graphviz(manual_logistic_model) produces the plate notation displayed below.

It shows the unobserved parameters as light and the observed elements as dark circles. The rectangle indicates the number of repetitions of the observed model element implied by the data included in the model definition.

In [33]:
pm.model_to_graphviz(model_simple)
Out[33]:
%3 cluster11,162 11,162 db db ~ Deterministic α α ~ Normal α->db θ θ ~ Deterministic α->θ β β ~ Normal β->db β->θ logit logit ~ Bernoulli θ->logit
In [34]:
theta = trace_simple['θ'].mean(axis=0)
idx = np.argsort(x_c)

plt.plot(x_c[idx], theta[idx], color='red', lw=3)
plt.vlines(trace_simple['db'].mean(), 0, 1, color='k', ls='--')
plt.scatter(x_c, np.random.normal(y_simple, 0.02), marker='.', alpha=0.5, 
            color=[f'C{x}' for x in y_simple])

plt.xlabel(x_n)
plt.ylabel('θ', rotation=0)
locs, _ = plt.xticks()
plt.xticks(locs, np.round(locs + x_c.mean(), 1));

The above plot shows non deposit versus deposit $(y = 0, y = 1)$. The S-shaped (red) line is the mean value of $\theta$ (sigmoid function). This line can be interpreted as the probability of a deposit, given that we know that the last time contact duration (the value of the duration). The decision boundary is represented as a (black) vertical line. According to the decision boundary, the values of duration to the left correspond to $y = 0$ (non deposit), and the values to the right to $y = 1$ (deposit).

Create a data frame with trace summary statistics.

In [35]:
az.summary(trace_simple, var_names=['α', 'β', 'db'])
Out[35]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
α -0.135 0.023 -0.177 -0.093 0.000 0.0 3629.0 3572.0 3628.0 2898.0 1.0
β 1.439 0.031 1.385 1.498 0.001 0.0 3367.0 3365.0 3349.0 2621.0 1.0
db 0.094 0.016 0.064 0.123 0.000 0.0 3656.0 3592.0 3656.0 2771.0 1.0
  • mean: Trace mean

  • sd: Trace standard deviation

  • mcse: Markov Chain Standard Error statistic

  • ess: Estimate of the effective sample size

  • r_hat: The R-hat diagnostic tests for lack of convergence by comparing the variance between multiple chains to the variance within each chain. If convergence has been achieved, the between-chain and within-chain variances should be identical. To be most effective in detecting evidence for nonconvergence, each chain should have been initialized to starting values that are dispersed relative to the target distribution.

Posterior Predictive Checks

PPCs are very useful for examining how well a model fits the data. They do so by generating data from the model using parameters from draws from the posterior. We use the function pm.sample_ppc for this purpose and obtain $n$ samples for each observation (the GLM module automatically names the outcome $y$):

In [36]:
ppc = pm.sample_ppc(trace_simple, model=model_simple, samples=500)
preds = np.rint(ppc['logit'].mean(axis=0)).astype('int')
y_score = np.mean(ppc['logit'], axis=0)
100%|██████████| 500/500 [00:00<00:00, 671.87it/s]
In [37]:
print('Accuracy of the simplest model:', accuracy_score(data['deposit'], preds))
print('AUC score of the simplest model:', roc_auc_score(data['deposit'], y_score))
Accuracy of the simplest model: 0.7170757928686615
AUC score of the simplest model: 0.8038875231925058

Models

In [38]:
simple_model = 'deposit ~ duration + pdays + previous'
full_model = 'deposit ~ age + job + marital + default + day + month + duration + pdays + previous + education + balance +  loan + poutcome + campaign +  housing + contact'

MAP Inference

A probabilistic program consists of observed and unobserved random variables (RVs). We define the observed RVs via likelihood distributions and unobserved RVs via prior distributions. PyMC3 includes numerous probability distributions for this purpose.

The PyMC3 library makes it very straightforward to perform approximate Bayesian inference for logistic regression. Logistic regression models the probability that individual $i$ subscribes to a deposit based on $k$ features:

$$ p(y_i = 1 \mid \beta) = \sigma (\beta_0 + \beta_1 x_{i1} + \dots + \beta_k x_{ik}) $$

Where $\sigma$ is the logistic function: $$ \sigma(t) = \frac{1}{1 + e^{-1}} $$

Manual Model Specification

We will use the context manager with to define a manual_logistic_model that we can refer to later as a probabilistic model:

  1. The random variables for the unobserved parameters for intercept and $3$ features are expressed using uninformative priors that assume normal distributions with mean $0$ and standard deviation of $100$.
  1. The likelihood combines the parameters with the data according to the specification of the logistic regression.
  1. The outcome is modeled as a Bernoulli RV with success probability given by the likelihood.
In [39]:
with pm.Model() as manual_logistic_model:
    # random variables for coefficients with
    # uninformative priors for each parameter

    intercept = pm.Normal('intercept', 0, sd=100)
    beta_1 = pm.Normal('beta_1', 0, sd=100)
    beta_2 = pm.Normal('beta_2', 0, sd=100)
    beta_3 = pm.Normal('beta_3', 0, sd=100)

    # Transform random variables into vector of probabilities p(y_i=1)
    # according to logistic regression model specification.
    likelihood = pm.invlogit(intercept + beta_1 * data.duration + beta_2 * data.pdays + beta_3 * data.previous)

    # Bernoulli random vector with probability of success
    # given by sigmoid function and actual data as observed
    pm.Bernoulli(name='logit', p=likelihood, observed=data.deposit)
In [40]:
manual_logistic_model.model
Out[40]:
$$ \begin{array}{rcl} \text{intercept} &\sim & \text{Normal}(\mathit{mu}=0.0,~\mathit{sigma}=100.0)\\\text{beta_1} &\sim & \text{Normal}(\mathit{mu}=0.0,~\mathit{sigma}=100.0)\\\text{beta_2} &\sim & \text{Normal}(\mathit{mu}=0.0,~\mathit{sigma}=100.0)\\\text{beta_3} &\sim & \text{Normal}(\mathit{mu}=0.0,~\mathit{sigma}=100.0)\\\text{logit} &\sim & \text{Bernoulli}(\mathit{p}=f(f(f(),~f(f(),~f(f(f(f(f(f(\text{intercept}),~f(f(\text{beta_1}),~array)),~f(f(\text{beta_2}),~array)),~f(f(\text{beta_3}),~array)))))),~f())) \end{array} $$
In [41]:
pm.model_to_graphviz(manual_logistic_model)
Out[41]:
%3 cluster11,162 11,162 beta_1 beta_1 ~ Normal logit logit ~ Bernoulli beta_1->logit beta_2 beta_2 ~ Normal beta_2->logit beta_3 beta_3 ~ Normal beta_3->logit intercept intercept ~ Normal intercept->logit

MAP Estimate

Maximum a posteriori probability (MAP) estimation leverages that the evidence is a constant factor that scales the posterior to meet the requirements for a probability distribution. Since the evidence does not depend on $θ$, the posterior distribution is proportional to the product of the likelihood and the prior. Hence, MAP estimation chooses the value of $θ$ that maximizes the posterior given the observed data and the prior belief, that is, the mode of the posterior.

$$ \theta_{MAP} = \operatorname*{arg\,max}_\theta P(X \mid \theta) P(\theta) $$

We obtain point MAP estimates for the three parameters using the just defined model’s .find_MAP() method:

In [42]:
with manual_logistic_model:
    # compute maximum a-posteriori estimate
    # for logistic regression weights
    manual_map_estimate = pm.find_MAP()
logp = -5,766.1, ||grad|| = 217.22: 100%|██████████| 45/45 [00:00<00:00, 329.78it/s]   
In [43]:
def print_map(result):
    return pd.Series({k: np.asscalar(v) for k, v in result.items()})
In [44]:
print_map(manual_map_estimate)
Out[44]:
intercept   -0.389113
beta_1       1.487781
beta_2       0.002422
beta_3       0.140743
dtype: float64

GLM Model

PyMC3 includes numerous common models so that we can usually leave the manual specification for custom applications.

The following code defines the same logistic regression as a member of the Generalized Linear Models (GLM) family using the formula format inspired by the statistical language R and ported to python by the patsy library:

In [45]:
with pm.Model() as logistic_model:
    pm.glm.GLM.from_formula(simple_model,
                            data,
                            family=pm.glm.families.Binomial())
In [46]:
pm.model_to_graphviz(logistic_model)
Out[46]:
%3 cluster11,162 11,162 previous previous ~ Normal y y ~ Binomial previous->y Intercept Intercept ~ Flat Intercept->y duration duration ~ Normal duration->y pdays pdays ~ Normal pdays->y

PyMC3 solves the optimization problem of finding the posterior point with the highest density using the quasi-Newton Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm but offers several alternatives provided by the scipy library. The result is virtually identically to the corresponding statsmodels estimate:

In [47]:
model = smf.logit(formula=simple_model, data=data[['deposit', 'duration', 'pdays', 'previous']])
result = model.fit()
print(result.summary())
Optimization terminated successfully.
         Current function value: 0.514605
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                deposit   No. Observations:                11162
Model:                          Logit   Df Residuals:                    11158
Method:                           MLE   Df Model:                            3
Date:                Fri, 09 Oct 2020   Pseudo R-squ.:                  0.2561
Time:                        14:43:13   Log-Likelihood:                -5744.0
converged:                       True   LL-Null:                       -7721.6
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.3891      0.026    -14.920      0.000      -0.440      -0.338
duration       1.4878      0.031     48.209      0.000       1.427       1.548
pdays          0.0024      0.000      9.319      0.000       0.002       0.003
previous       0.1407      0.014      9.874      0.000       0.113       0.169
==============================================================================
In [48]:
print_map(manual_map_estimate)
Out[48]:
intercept   -0.389113
beta_1       1.487781
beta_2       0.002422
beta_3       0.140743
dtype: float64
In [49]:
result.params 
Out[49]:
Intercept   -0.389111
duration     1.487782
pdays        0.002422
previous     0.140742
dtype: float64

Markov Chain Monte Carlo

Markov chains are stochastic models that describe sequences of possible events. Each event comes from a set of outcomes, and each outcome determines which outcome occurs next, according to a fixed set of probabilities. An important feature of Markov chains is that they are memoryless: everything that you would possibly need to predict the next event is available in the current state, and no new information comes from knowing historical events.

Monte Carlo methods rely on repeated random sampling to approximate results that may be deterministic, but that does not permit an analytic, exact solution.

Many algorithms apply the Monte Carlo method to a Markov Chain, and generally proceed as follows:

  1. Start at the current position.
  2. Draw a new position from a proposal distribution.
  3. Evaluate the probability of the new position in light of data and prior distributions:
    1. If sufficiently likely, move to the new position
    2. Otherwise, remain at the current position
  4. Repeat from step 1.
  5. After a given number of iterations, return all accepted positions.

Define the Model

We will use a slightly more complicated model to illustrate Markov chain Monte Carlo inference:

  • full_model = 'deposit ~ age + job + marital + default + day + month + duration + pdays + previous + education + balance + loan + poutcome + campaign + housing + contact'
In [50]:
with pm.Model() as logistic_model:
    pm.glm.GLM.from_formula(formula=full_model,
                            data=data,
                            family=pm.glm.families.Binomial())
In [51]:
pm.model_to_graphviz(logistic_model)
Out[51]:
%3 cluster11,162 11,162 duration duration ~ Normal y y ~ Binomial duration->y balance balance ~ Normal balance->y poutcome poutcome ~ Normal poutcome->y age age ~ Normal age->y marital marital ~ Normal marital->y day day ~ Normal day->y loan loan ~ Normal loan->y housing housing ~ Normal housing->y month month ~ Normal month->y previous previous ~ Normal previous->y Intercept Intercept ~ Flat Intercept->y education education ~ Normal education->y campaign campaign ~ Normal campaign->y pdays pdays ~ Normal pdays->y contact contact ~ Normal contact->y default default ~ Normal default->y job job ~ Normal job->y

Hamiltonian Monte Carlo – going NUTS

By default, PyMC3 automatically selects the most efficient sampler and initializes the sampling process for efficient convergence. For a continuous model, PyMC3 chooses the NUTS sampler. It also runs variational inference via ADVI to find good starting parameters for the sampler. One among several alternatives is to use the MAP estimate.

Hamiltonian Monte Carlo (HMC) is a hybrid method that leverages the first-order derivative information of the gradient of the likelihood to propose new states for exploration and overcome some of the challenges of MCMC. In addition, it incorporates momentum to efficiently jump around the posterior. As a result, it converges faster to a high-dimensional target distribution than simpler random-walk Metropolis or Gibbs sampling.

To see what the convergence looks like, we first draw $1,000$ samples after tuning the sampler for $1,000$ iterations that will be discarded. The sampling process can be parallelized for multiple chains using the cores argument (except when using GPU).

In [52]:
with logistic_model:
    trace = pm.sample(tune=1000, draws=1000, chains=4, init = 'adapt_diag', cores=5)
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 5 jobs)
NUTS: [contact, housing, campaign, poutcome, loan, balance, education, previous, pdays, duration, month, day, default, marital, job, age, Intercept]
Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [02:44<00:00, 48.75draws/s]

Inspect Trace

In [53]:
pm.plot_trace(trace);
  • On the left, we have a KDE plot, — for each parameter value on the x-axis we get a probability on the y-axis that tells us how likely that parameter value is. The maximum posterior estimate of each variable (the peak in the left side distributions) is very close to the true parameters.
  • On the right, we get the individual sampled values at each step during the sampling. From the trace plot, we can visually get the plausible values from the posterior. Our sampling chains for the individual parameters (left) seem well converged and stationary (there are no large drifts or other odd patterns).

Plot joint distribution of parameters.

One of the major benefits of Bayesian data analysis is that we can be explicit about our uncertainty. Maximum likelihood returns a number, but how certain can we be that we found the right number? Instead, Bayesian inference returns a distribution over parameter values.

In [54]:
temp = pd.DataFrame(dict(job =trace['job'], edu = trace['education']))

kdeplot = sns.jointplot(data=temp, x='edu', y='job', kind="kde", cmap='jet', cbar=True, n_levels=100,
                        marginal_kws={'color': 'blue'}).annotate(stats.pearsonr)

kdeplot.fig.set_size_inches(9, 7)

plt.subplots_adjust(left=0.1, right=0.8, top=0.9, bottom=0.1)
pos_joint_ax = kdeplot.ax_joint.get_position()
pos_marg_x_ax = kdeplot.ax_marg_x.get_position()
kdeplot.ax_joint.set_position([pos_joint_ax.x0, pos_joint_ax.y0, pos_marg_x_ax.width, pos_joint_ax.height])
kdeplot.fig.axes[-1].set_position([.83, pos_joint_ax.y0, .07, pos_joint_ax.height])

plt.show()

Interpreting Model Coefficients

Here we fit a logistic regression model using MLE via the statsmodels.formula library and exctract the p-values and coefficients. We will also calculate the odds ratio of the coefficients to try and interpret the models feature importance.

In [55]:
tmp = data.copy()
tmp['deposit'] = data['deposit']

model = smf.logit(formula=full_model, data=tmp)
result = model.fit()
Optimization terminated successfully.
         Current function value: 0.436392
         Iterations 7

Alt text that describes the graphic

An odds ratio (OR) is a measure of association between a certain property $A$ and a second property $B$ in a population. Specifically, it tells you how the presence or absence of property $A$ has an effect on the presence or absence of property $B$. The OR is also used to figure out if a particular exposure (like eating processed meat) is a risk factor for a particular outcome (such as colon cancer), and to compare the various risk factors for that outcome. As long as you have two properties you think are linked, you can calculate the odds.

You have two choices for the formula: $$\frac{(a \div c)}{(b \div d)}$$

or, equivalently: $$\frac{(a \times d)}{(b \times c)}$$

  • $OR = 1$ indicates exposure does not affect odds of event occurrence
  • $OR > 1$ indicates increased occurrence of event
  • $OR < 1$ indicates decreased occurrence of event (protective exposure)

We can get the odds ratio of the model coefficients by exponentiating the coefficients.

In [56]:
log_df = pd.DataFrame(dict(coefs = result.params[1:], p_vals = result.pvalues[1:]), index=data.columns)
log_df['odds'] = np.exp(log_df['coefs'])
log_df['Pct_odds_increase'] = (log_df['odds'] - 1) * 100
log_df = log_df.sort_values('odds', ascending=False)
mask = log_df.p_vals <= 0.05
log_df[mask]
Out[56]:
coefs p_vals odds Pct_odds_increase
duration 1.671703 0.000000e+00 5.321223 432.122345
housing 0.939173 7.765103e-66 2.557865 155.786525
poutcome 0.670200 2.952181e-68 1.954629 95.462907
contact 0.580022 7.722039e-60 1.786078 78.607753
loan 0.558923 1.560255e-12 1.748787 74.878725
balance 0.257488 1.190841e-21 1.293677 29.367658
education 0.167433 3.239794e-06 1.182266 18.226588
previous 0.057102 1.498422e-04 1.058764 5.876393
pdays 0.002370 2.181593e-16 1.002373 0.237272
day -0.006962 2.105777e-02 0.993062 -0.693806
job -0.021345 2.051986e-02 0.978881 -2.111852
campaign -0.095100 2.462751e-13 0.909282 -9.071792
marital -0.163300 1.069260e-05 0.849336 -15.066392

Each curve shows how the probability of subscribing to a term deposit changes with duration, given this customer owns a house.

The red curve represents success, the green curve represents unknown and, the blue failure poutcomes (outcome of the previous marketing campaign). For all three poutcomes levels, the probability of subscribing a term deposit increases with duration until approximately duration (last contact duration, in seconds) $0$, when the probability begins to decrease.

We are plotting $100$ different curves for each level of poutcomes. Each curve is a draw from our posterior distribution.

In [57]:
def lm_full(trace, duration, poutcome, housing):
    shape = np.broadcast(duration, poutcome, housing).shape
    x_norm = np.asarray([np.broadcast_to(x, shape) for x in [duration, poutcome, housing]])
    
    return 1 / (1 + np.exp(-(trace['Intercept'] + 
                             trace['duration']*x_norm[0] + 
                             trace['poutcome']*x_norm[1] +
                             trace['housing']*x_norm[2])))

lm = lambda x, samples: lm_full(samples, x, 1, 1)
lm2 = lambda x, samples: lm_full(samples, x,2, 1)
lm3 = lambda x, samples: lm_full(samples, x, 4, 1)
In [58]:
min_age = min(data.duration)
max_age = max(data.duration)

# Plot the posterior predictive distributions of P(deposit = yes) vs. duration
pm.plot_posterior_predictive_glm(trace, eval=np.linspace(
    min_age, max_age, 1000), lm=lm, samples=100, color="blue", alpha=.7)
pm.plot_posterior_predictive_glm(trace, eval=np.linspace(
    min_age, max_age, 1000), lm=lm2, samples=100, color="green", alpha=.7)
pm.plot_posterior_predictive_glm(trace, eval=np.linspace(
    min_age, max_age, 1000), lm=lm3, samples=100, color="red", alpha=.7)

blue_line = mlines.Line2D(['lm'], [], color='b', label='failure')
green_line = mlines.Line2D(['lm2'], [], color='g', label='unknown')
red_line = mlines.Line2D(['lm3'], [], color='r', label='success')
plt.legend(handles=[blue_line, green_line, red_line], loc='lower right')
plt.ylabel("P(deposit = yes)")
plt.xlabel("duration")
plt.show()
In [59]:
pm.trace_to_dataframe(trace).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 17 columns):
Intercept    4000 non-null float64
age          4000 non-null float64
job          4000 non-null float64
marital      4000 non-null float64
default      4000 non-null float64
day          4000 non-null float64
month        4000 non-null float64
duration     4000 non-null float64
pdays        4000 non-null float64
previous     4000 non-null float64
education    4000 non-null float64
balance      4000 non-null float64
loan         4000 non-null float64
poutcome     4000 non-null float64
campaign     4000 non-null float64
housing      4000 non-null float64
contact      4000 non-null float64
dtypes: float64(17)
memory usage: 531.4 KB
In [60]:
with open(model_path / 'logistic_model_mh.pkl', 'wb') as buff:
    pickle.dump({'model': logistic_model, 'trace': trace}, buff)
In [61]:
pm.summary(trace)
Out[61]:
mean sd hpd_3% hpd_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
Intercept -3.753 0.302 -4.337 -3.209 0.005 0.004 3379.0 3361.0 3375.0 2984.0 1.0
age -0.045 0.026 -0.096 0.002 0.000 0.000 4591.0 3761.0 4598.0 3133.0 1.0
job -0.021 0.009 -0.038 -0.004 0.000 0.000 5148.0 4153.0 5169.0 2622.0 1.0
marital -0.163 0.037 -0.230 -0.093 0.001 0.000 4966.0 4773.0 4949.0 3204.0 1.0
default 0.142 0.216 -0.265 0.548 0.003 0.003 4230.0 2702.0 4233.0 2914.0 1.0
day -0.007 0.003 -0.013 -0.001 0.000 0.000 5646.0 4676.0 5668.0 3090.0 1.0
month -0.018 0.010 -0.036 0.001 0.000 0.000 5087.0 3955.0 5079.0 2794.0 1.0
duration 1.676 0.035 1.610 1.743 0.001 0.000 4946.0 4946.0 4940.0 2664.0 1.0
pdays 0.002 0.000 0.002 0.003 0.000 0.000 4299.0 4299.0 4317.0 3234.0 1.0
previous 0.058 0.015 0.029 0.084 0.000 0.000 3964.0 3762.0 3962.0 3283.0 1.0
education 0.168 0.036 0.099 0.234 0.001 0.000 4533.0 4441.0 4529.0 3066.0 1.0
balance 0.258 0.027 0.210 0.309 0.000 0.000 5051.0 4904.0 5057.0 2872.0 1.0
loan 0.559 0.079 0.415 0.706 0.001 0.001 5613.0 5565.0 5612.0 2899.0 1.0
poutcome 0.672 0.039 0.603 0.747 0.001 0.000 5267.0 5267.0 5281.0 2896.0 1.0
campaign -0.096 0.013 -0.120 -0.071 0.000 0.000 5846.0 5693.0 5850.0 2901.0 1.0
housing 0.942 0.055 0.841 1.047 0.001 0.001 5093.0 5058.0 5117.0 2718.0 1.0
contact 0.581 0.036 0.517 0.650 0.001 0.000 5095.0 5095.0 5105.0 3042.0 1.0
In [62]:
draws = 100

trace_df = pm.trace_to_dataframe(trace).assign(
    chain=lambda x: x.index // draws)

trace_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 18 columns):
Intercept    4000 non-null float64
age          4000 non-null float64
job          4000 non-null float64
marital      4000 non-null float64
default      4000 non-null float64
day          4000 non-null float64
month        4000 non-null float64
duration     4000 non-null float64
pdays        4000 non-null float64
previous     4000 non-null float64
education    4000 non-null float64
balance      4000 non-null float64
loan         4000 non-null float64
poutcome     4000 non-null float64
campaign     4000 non-null float64
housing      4000 non-null float64
contact      4000 non-null float64
chain        4000 non-null int64
dtypes: float64(17), int64(1)
memory usage: 562.6 KB

Persist Results

In [63]:
with open(model_path / 'logistic_model_nuts.pkl', 'wb') as buff:
    pickle.dump({'model': logistic_model,
                 'trace': trace}, buff)
In [64]:
with open(model_path / 'logistic_model_nuts.pkl', 'rb') as buff:
    data0 = pickle.load(buff)  

logistic_model, trace_NUTS = data0['model'], data0['trace']

Combine Traces

In [65]:
draws = 10000

df = pm.trace_to_dataframe(trace_NUTS).iloc[200:].reset_index(
    drop=True).assign(chain=lambda x: x.index // draws)

trace_df = pd.concat([trace_df.assign(samples=100),
                      df.assign(samples=10000)])
trace_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7800 entries, 0 to 3799
Data columns (total 19 columns):
Intercept    7800 non-null float64
age          7800 non-null float64
job          7800 non-null float64
marital      7800 non-null float64
default      7800 non-null float64
day          7800 non-null float64
month        7800 non-null float64
duration     7800 non-null float64
pdays        7800 non-null float64
previous     7800 non-null float64
education    7800 non-null float64
balance      7800 non-null float64
loan         7800 non-null float64
poutcome     7800 non-null float64
campaign     7800 non-null float64
housing      7800 non-null float64
contact      7800 non-null float64
chain        7800 non-null int64
samples      7800 non-null int64
dtypes: float64(17), int64(2)
memory usage: 1.2 MB

Visualize both traces

In [66]:
trace_df_long = pd.melt(trace_df, id_vars=['samples', 'chain'])
trace_df_long.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132600 entries, 0 to 132599
Data columns (total 4 columns):
samples     132600 non-null int64
chain       132600 non-null int64
variable    132600 non-null object
value       132600 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 4.0+ MB

Convergence

We can visualize the samples over time and their distributions to check the quality of the results. The following charts show the posterior distributions after an initial $100$ and an additional $10,000$ samples, respectively, and illustrate how convergence implies that multiple chains identify the same distribution. The pm.trace_plot() function shows the evolution of the samples as well.

In [67]:
g = sns.FacetGrid(trace_df_long, col='variable', row='samples',
                  hue='chain', sharex='col', sharey=False)
g = g.map(sns.distplot, 'value', hist=False, rug=False)

Estimate Odds Ratio

In [68]:
b = trace['duration']
lb, ub = np.percentile(b, 2.5), np.percentile(b, 97.5)
lb, ub = np.exp(lb), np.exp(ub)
print(f'P({lb:.3f} < Odds Ratio < {ub:.3f}) = 0.95')
P(4.985 < Odds Ratio < 5.725) = 0.95

We can interpret something along those lines: "With probability $0.95$ the odds ratio is greater than $4.985$ and less than $5.725$, so the duration effect takes place because a person with a longer contact duration has at least $4.985$ higher probability to subscribe to a term deposit than a person with a lower duration, while holding all the other variables constant.

Computing Credible Intervals

We can compute the credible intervals, the Bayesian counterpart of confidence intervals, as percentiles of the trace. The resulting boundaries reflect our confidence about the range of the parameter value for a given probability threshold, as opposed to the number of times the parameter will be within this range for a large number of trials.

In [69]:
fig, ax = plt.subplots(figsize=(8, 4))
sns.distplot(np.exp(b), axlabel='Odds Ratio', ax=ax)
ax.set_title(f'Credible Interval: P({lb:.3f} < Odds Ratio < {ub:.3f}) = 0.95')
ax.axvspan(lb, ub, alpha=0.25, color='gray');

Variational Inference

Variational Inference (VI) is a machine learning method that approximates probability densities through optimization. In the Bayesian context, it approximates the posterior distribution as follows:

  1. Select a parametrized family of probability distributions
  2. Find the member of this family closest to the target, as measured by Kullback-Leibler divergence

Compared to MCMC, Variational Bayes tends to converge faster and scales to large data better. While MCMC approximates the posterior with samples from the chain that will eventually converge arbitrarily close to the target, variational algorithms approximate the posterior with the result of the optimization, which is not guaranteed to coincide with the target.

Variational Inference is better suited for large datasets and to quickly explore many models.In contrast, MCMC will deliver more accurate results on smaller datasets or when time and computational resources pose fewer constraints.

Run Automatic Differentation Variational Inference (ADVI)

The interface for variational inference is very similar to the MCMC implementation. We just use the fit() instead of the sample() function, with the option to include an early stopping CheckParametersConvergence callback if the distribution-fitting process converged up to a given tolerance:

In [70]:
with logistic_model:
    callback = CheckParametersConvergence(diff='absolute')
    approx = pm.fit(n=100000, callbacks=[callback])
Average Loss = 5,120.8: 100%|██████████| 100000/100000 [02:13<00:00, 751.37it/s] 
Finished [100%]: Average Loss = 5,120.7

Persist Result

In [71]:
with open(model_path / 'logistic_model_advi.pkl', 'wb') as buff:
    pickle.dump({'model': logistic_model,
                 'approx': approx}, buff)

Sample from approximated distribution

We can draw samples from the approximated distribution to obtain a trace object as above for the MCMC sampler:

In [72]:
trace_advi = approx.sample(2000)
In [73]:
# pm.summary(trace_advi).to_csv(model_path / 'trace_advi.csv')

Visualize the covariance structure of the models

In [74]:
az.plot_pair(trace_NUTS, figsize=(10, 10), kind='kde', divergences=True);

Let us visualize the covariance structure of the ADVI model

In [75]:
az.plot_pair(trace_advi, figsize=(10, 10), kind='kde', divergences=True);

Clearly, ADVI does not capture (as expected) the interactions between variables because of the mean field approximation, and so it underestimates the overall variance.

Model Diagnostics

Bayesian model diagnostics includes validating that the sampling process has converged and consistently samples from high-probability areas of the posterior, and confirming that the model represents the data well.

For high-dimensional models with many variables, it becomes cumbersome to inspect numerous traces. When using NUTS, the energy plot helps to assess problems of convergence. It summarizes how efficiently the random process explores the posterior. The plot shows the energy and the energy transition matrix that should be well matched as in the below example.

Energy Plot

When using NUTS, the energy plot helps to assess problems of convergence. It summarizes how efficiently the random process explores the posterior. The plot shows the energy and the energy transition matrix, which should be well-matched.

In [76]:
pm.energyplot(trace_NUTS);

Forest Plot

A forest plot, also known as a blobbogram, is a graphical display of estimated results from a number of scientific studies addressing the same question, along with the overall results. It was developed for use in medical research as a means of graphically representing a meta-analysis of the results of randomized controlled trials.

In [77]:
az.plot_forest([trace_advi, trace_NUTS], model_names=['trace_advi', 'trace_NUTS']);

Posterior Plot

In [78]:
pm.plot_posterior(trace_NUTS);

Model selection using Widely-applicable Information Criterion (WAIC)

WAIC (Watanabe 2010) is a fully Bayesian criterion for estimating out-of-sample expectation, using the computed log pointwise posterior predictive density (LPPD) and correcting for the effective number of parameters to adjust for overfitting.

One question that was immediately asked was what effect does campaign have on the model, and why should it be $\text{campaign}^2$ versus campaign? We’ll run the model with a few changes to see what effect increasing polynomial complexity has on this model in terms of WAIC.

In [79]:
def run_models(df, upper_order=5):
   
    models, traces = OrderedDict(), OrderedDict()

    for k in range(1,upper_order+1):

        nm = 'k{}'.format(k)
        fml = create_poly_modelspec(k)

        with pm.Model() as models[nm]:

            print('\nRunning: {}'.format(nm))
            pm.glm.GLM.from_formula(fml, df,
                                    family=pm.glm.families.Binomial())
            
            traces[nm] = pm.sample(1000, tune=1000, chains=4, init = 'adapt_diag', cores=5)

    return models, traces

def create_poly_modelspec(k=1):
   
    return ('deposit ~ age + job + marital + default + day + month + duration + pdays + previous + education + balance +  loan + poutcome + campaign +  housing + contact' + ' '.join(['+ np.power(campaign,{})'.format(j)
                                     for j in range(2,k+1)])).strip()
In [80]:
models_lin, traces_lin = run_models(data, 4)
Running: k1
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 5 jobs)
NUTS: [contact, housing, campaign, poutcome, loan, balance, education, previous, pdays, duration, month, day, default, marital, job, age, Intercept]
Sampling 4 chains, 1 divergences: 100%|██████████| 8000/8000 [02:38<00:00, 50.50draws/s]
Running: k2
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 5 jobs)
NUTS: [np.power(campaign, 2), contact, housing, campaign, poutcome, loan, balance, education, previous, pdays, duration, month, day, default, marital, job, age, Intercept]
Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [02:38<00:00, 50.40draws/s]
Running: k3
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 5 jobs)
NUTS: [np.power(campaign, 3), np.power(campaign, 2), contact, housing, campaign, poutcome, loan, balance, education, previous, pdays, duration, month, day, default, marital, job, age, Intercept]
Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [03:05<00:00, 43.18draws/s]
Running: k4
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 5 jobs)
NUTS: [np.power(campaign, 4), np.power(campaign, 3), np.power(campaign, 2), contact, housing, campaign, poutcome, loan, balance, education, previous, pdays, duration, month, day, default, marital, job, age, Intercept]
Sampling 4 chains, 1 divergences: 100%|██████████| 8000/8000 [09:54<00:00, 13.45draws/s]
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
In [81]:
model_trace_dict = dict()
ks = ['k1', 'k2', 'k3', 'k4']
for nm in ks:
    models_lin[nm].name = nm
    model_trace_dict.update({models_lin[nm]: traces_lin[nm]})

dfwaic = pm.compare(model_trace_dict, ic='WAIC')
dfwaic = dfwaic.reset_index(drop=True)
dfwaic['names'] = ks
dfwaic = dfwaic.sort_values('rank')
dfwaic.index = dfwaic.names

dfwaic
Out[81]:
rank waic p_waic d_waic weight se dse warning waic_scale names
names
k1 0 -4869.19 19.8589 0 0.516058 59.0606 0 True log k1
k2 1 -4869.27 19.7413 0.0823213 0.478088 59.0329 0.417066 True log k2
k3 2 -4883.47 20.8359 14.2835 0.00574277 59.1341 6.34255 True log k3
k4 3 -4888.31 17.5975 19.1204 0.000111089 59.166 5.90907 False log k4
  • We prefer the model(s) with lower WAIC.

Next we use the pm.compareplot function takes the output of pm.compare and produces a summary plot.

In [82]:
pm.compareplot(dfwaic);
  • The empty circle represents the values of WAIC and the black error bars associated with them are the values of the standard deviation of WAIC.
  • The value of the lowest WAIC is also indicated with a vertical dashed grey line to ease comparison with other WAIC values.
  • The filled in black dots are the in-sample deviance of each model, which for WAIC is 2 pWAIC from the corresponding WAIC value.
  • For all models except the top-ranked one, we also get a triangle, indicating the value of the difference of WAIC between that model and the top model, and a grey error bar indicating the standard error of the differences between the top-ranked WAIC and WAIC for each model.

This confirms that the original model is better than the models with increasing polynomial complexity.

Posterior Predictive Checks

PPCs are very useful for examining how well a model fits the data. They do so by generating data from the model using parameters from draws from the posterior. We use the function pm.sample_ppc for this purpose and obtain $n$ samples for each observation (the GLM module automatically names the outcome $y$):

In [83]:
ppc = pm.sample_ppc(trace_NUTS, samples=500, model=logistic_model)
100%|██████████| 500/500 [00:00<00:00, 517.36it/s]
In [84]:
ppc['y'].shape
Out[84]:
(500, 11162)

Check AUC Score

In [85]:
y_score = np.mean(ppc['y'], axis=0)
pred_scores = dict(y_true=data.deposit,y_score=y_score)
preds = np.rint(ppc['y'].mean(axis=0)).astype('int')

print('Accuracy of the full model: ', accuracy_score(data.deposit, preds))
print('AUC of the full model: ', roc_auc_score(**pred_scores))
Accuracy of the full model:  0.8011109120229349
AUC of the full model:  0.8795237358009937

Prediction

Follows PyMC3 docs

Predictions use theano’s shared variables to replace the training data with test data before running posterior predictive checks. To facilitate visualization, we create the train and test datasets, and convert the former to a shared variable. Note that we need to use numpy arrays and provide a list of column labels:

Train-test split

In [86]:
X = data.drop('deposit', axis=1)
y = data.deposit

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
labels = X_train.columns

Create shared theano variable

In [87]:
X_shared = theano.shared(X_train.values)

Define logistic model

In [88]:
with pm.Model() as logistic_model_pred:
    pm.glm.GLM(x=X_shared, labels=labels,
               y=y_train, family=pm.glm.families.Binomial())

Run NUTS sampler

In [89]:
with logistic_model_pred:
    pred_trace = pm.sample(draws=1000, tune=1000, chains=4, cores=5, init='adapt_diag')
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
Multiprocess sampling (4 chains in 5 jobs)
NUTS: [poutcome, previous, pdays, campaign, duration, month, day, contact, loan, housing, balance, default, education, marital, job, age, Intercept]
Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [02:45<00:00, 48.19draws/s]

Replace shared variable with test set

We then run the sampler as before, and apply the pm.sample_ppc function to the resulting trace after replacing the train with test data:

In [90]:
X_shared.set_value(X_test)
In [91]:
ppc = pm.sample_ppc(pred_trace,
                    model=logistic_model_pred,
                    samples=5000)
100%|██████████| 5000/5000 [00:10<00:00, 454.75it/s]

AUC Score

In [92]:
y_score = np.mean(ppc['y'], axis=0)
preds = np.rint(ppc['y'].mean(axis=0)).astype('int')

print('Accuracy on test data: ', accuracy_score(y_test, preds))
print('AUC on test data: ', roc_auc_score(y_score=y_score, y_true=y_test))
Accuracy on test data:  0.793999104343932
AUC on test data:  0.8756986854986891

Thresholding probabilities

In [93]:
from plot_utils import plot_thresholds, comp_conf_matrix

plot_thresholds(X_test, y_test, y_score, name='Bayesian Logistic Regression', fig_x=13, fig_y=4, 
                sizes=[0.55, -95, -45])
In [94]:
g_ts_pred = np.where(y_score > 0.44, 1, 0)
f1_ts_pred = np.where(y_score > 0.40, 1, 0)
prl_ts_pred = np.where(y_score > 0.50, 1, 0)

f1_ts_cm = np.around(confusion_matrix(y_test, f1_ts_pred, normalize='true'), decimals=3)
g_ts_cm = np.around(confusion_matrix(y_test, g_ts_pred, normalize='true'), decimals=3)
prl_ts_cm = np.around(confusion_matrix(y_test, prl_ts_pred, normalize='true'), decimals=3)
no_ts_cm = np.around(confusion_matrix(y_test, preds, normalize='true'), decimals=3)

comp_conf_matrix(mtx1=no_ts_cm, mtx2=f1_ts_cm, labels=['deposit', 'no deposit'], 
                 titles=['No Threshold', 'F1 Thresholded'], sizes=[10, 4])

comp_conf_matrix(mtx1=prl_ts_cm, mtx2=g_ts_cm, labels=['deposit', 'no deposit'], 
                 titles=['PRL Thresholded', 'G Thresholded'], sizes=[10, 4])
In [ ]: