Logistic Regression

Logistic regression estimates a linear relationship between a set of features and a binary outcome (usually $1$ for yes and $0$ for no), mediated by a sigmoid function to ensure the model produces probabilities. The frequentist approach results in point estimates for the parameters that measure the influence of each feature on the probability that a data point belongs to the positive or negative class.

Logistic Function

The logistic function, also called the sigmoid function is an S-shaped curve that can take any real-valued number and map it into a value between $0$ and $1$ representing probability.

$$ f(x)={\frac {1}{1+e^{-k(x-x_{0})}}} $$

where

  • $x_{0} =$ the $x$ value of the sigmoid's midpoint,
  • $k =$ the logistic growth rate or steepness of the curve,
  • $e$ is the base of the natural logarithms or the np.exp() function.

Representation Used for Logistic Regression

Logistic regression uses an equation as the representation, very much like linear regression.

Input values $(x)$ are combined linearly using weights or coefficient values (referred to as the Greek capital letter Beta $\beta$) to predict an output value $(y)$. A key difference from linear regression is that the output value being modeled is a binary values ($0$ or $1$) rather than a numeric value.

Example logistic regression equation:

$$y = \frac{e^{(\beta_0 + \beta_1*x)}}{(1 + e^{(\beta_0 + \beta_1*x)})}$$

Where $y$ is the predicted output, $\beta_0$ is the bias or intercept term and $\beta_1$ is the coefficient for the single input value $(x)$. Each column in your input data has an associated $\beta$ coefficient (a constant real value) that must be learned from the training data.

In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns 
In [3]:
sns.set(style="darkgrid", color_codes=True)
pd.set_option('display.max_columns', None)

Here we define a sigmoid function and plot it for a range of values between $-10$ and $10$

In [4]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
In [5]:
x = np.arange(-10,11,1)
y = sigmoid(x)

plt.plot(x,y)
plt.xlabel('X')
plt.ylabel('Sigmoid(X)');

Logistic Regression Example

Data Set

Quarterly US macro data from $1959 – 2009$

In [6]:
data = pd.DataFrame(sm.datasets.macrodata.load().data)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203 entries, 0 to 202
Data columns (total 14 columns):
year        203 non-null float64
quarter     203 non-null float64
realgdp     203 non-null float64
realcons    203 non-null float64
realinv     203 non-null float64
realgovt    203 non-null float64
realdpi     203 non-null float64
cpi         203 non-null float64
m1          203 non-null float64
tbilrate    203 non-null float64
unemp       203 non-null float64
pop         203 non-null float64
infl        203 non-null float64
realint     203 non-null float64
dtypes: float64(14)
memory usage: 22.3 KB
In [7]:
data.head()
Out[7]:
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint
0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00
1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74
2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09
3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06
4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19

Data Prep

To obtain a binary target variable, we compute the $20$-quarter rolling average of the annual growth rate of quarterly real GDP. We then assign $1$ if current growth exceeds the moving average and $0$ otherwise. Finally, we shift the indicator variables to align next quarter's outcome with the current quarter.

In [8]:
data['growth_rate'] = data.realgdp.pct_change(4)
data['target'] = (data.growth_rate > data.growth_rate.rolling(20).mean()).astype(int).shift(-1)
data.quarter = data.quarter.astype(int)
In [9]:
data.tail()
Out[9]:
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint growth_rate target
198 2008.0 3 13324.600 9267.7 1990.693 991.551 9838.3 216.889 1474.7 1.17 6.0 305.270 -3.16 4.33 0.000262 0.0
199 2008.0 4 13141.920 9195.3 1857.661 1007.273 9920.4 212.174 1576.5 0.12 6.9 305.952 -8.79 8.91 -0.018619 0.0
200 2009.0 1 12925.410 9209.2 1558.494 996.287 9926.4 212.671 1592.8 0.22 8.1 306.547 0.94 -0.71 -0.033026 0.0
201 2009.0 2 12901.504 9189.0 1456.678 1023.528 10077.5 214.469 1653.6 0.18 9.2 307.226 3.37 -3.19 -0.038297 0.0
202 2009.0 3 12990.341 9256.0 1486.398 1044.088 10040.6 216.385 1673.9 0.12 9.6 308.013 3.56 -3.44 -0.025086 NaN
In [10]:
pct_cols = ['realcons', 'realinv', 'realgovt', 'realdpi', 'm1', 'realgdp', 'pop', 'cpi', 'tbilrate', 'unemp']
drop_cols = ['year', 'growth_rate']

data.loc[:, pct_cols] = data.loc[:, pct_cols].pct_change(4)
In [11]:
data = pd.get_dummies(data.drop(drop_cols, axis=1), columns=['quarter']).dropna()
In [12]:
data.head()
Out[12]:
realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint target quarter_1 quarter_2 quarter_3 quarter_4
4 0.050676 0.036957 0.156237 -0.016692 0.036356 0.019324 -0.000716 0.241135 -0.103448 0.016151 2.31 1.19 0.0 1 0 0 0
5 0.020005 0.034147 -0.040877 -0.043426 0.024170 0.013722 -0.010586 -0.129870 0.019608 0.015976 0.14 2.55 0.0 0 1 0 0
6 0.022891 0.019409 0.024718 -0.033758 0.026821 0.013629 0.002847 -0.382199 0.056604 0.016070 2.70 -0.34 0.0 0 0 1 0
7 0.006252 0.019673 -0.132257 -0.015738 0.018278 0.016003 0.007857 -0.471132 0.125000 0.016172 1.21 1.08 0.0 0 0 0 1
8 -0.009985 0.009715 -0.196903 0.029544 0.014830 0.009140 0.017908 -0.322857 0.307692 0.016583 -0.40 2.77 0.0 1 0 0 0
In [13]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 198 entries, 4 to 201
Data columns (total 17 columns):
realgdp      198 non-null float64
realcons     198 non-null float64
realinv      198 non-null float64
realgovt     198 non-null float64
realdpi      198 non-null float64
cpi          198 non-null float64
m1           198 non-null float64
tbilrate     198 non-null float64
unemp        198 non-null float64
pop          198 non-null float64
infl         198 non-null float64
realint      198 non-null float64
target       198 non-null float64
quarter_1    198 non-null uint8
quarter_2    198 non-null uint8
quarter_3    198 non-null uint8
quarter_4    198 non-null uint8
dtypes: float64(13), uint8(4)
memory usage: 22.4 KB

EDA

In [14]:
total = len(data)
plt.figure(figsize=(7,5))
g = sns.countplot(x='target', data=data)
g.set_ylabel('Count', fontsize=14)
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x()+p.get_width()/2.,
            height + 1.5,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=14, fontweight='bold')
plt.margins(y=0.1)
plt.show()
In [15]:
plt.figure(figsize=(16, 8))
corr = data.corr()
mask = np.tri(*corr.shape).T 
sns.heatmap(corr.abs(), mask=mask, annot=True, fmt=".3f", annot_kws={"size": 10})
b, t = plt.ylim() 
b += 0.5 
t -= 0.5 
plt.ylim(b, t) 
plt.show()
In [16]:
import matplotlib.cm as cm

n_fts = len(data.drop('target', axis=1).columns)
colors = cm.rainbow(np.linspace(0, 1, n_fts))

_ = data.drop('target', axis=1).corrwith(data.target).sort_values(ascending=True).plot(kind='barh', 
                                                                                     color=colors, figsize=(10, 4))
plt.title('Correlation to Target')
plt.show()
In [17]:
basetable0 = data.copy()

# Get all the variable names except "target"
variables = list(basetable0.columns)
variables.remove("target")

# Loop through all the variables and discretize in 5 bins if there are more than 5 different values
for variable in variables:
    if len(basetable0.groupby(variable)) > 10:
        new_variable = "disc_" + variable
        basetable0[new_variable] = pd.qcut(basetable0[variable].round(4), 5)
        
# Print the columns in the new basetable
basetable0 = basetable0.drop(variables, axis=1)
print('\n', basetable0.columns)
 Index(['target', 'disc_realgdp', 'disc_realcons', 'disc_realinv',
       'disc_realgovt', 'disc_realdpi', 'disc_cpi', 'disc_m1', 'disc_tbilrate',
       'disc_unemp', 'disc_pop', 'disc_infl', 'disc_realint'],
      dtype='object')
In [18]:
# Function that creates predictor insight graph table
def create_pig_table(basetable, target, variable):

    # Create groups for each variable
    groups = basetable[[target,variable]].groupby(variable)
    
    # Calculate size and target incidence for each group
    pig_table = groups[target].agg(Incidence = np.mean, Size = np.size).reset_index()

    # Return the predictor insight graph table
    return pig_table
In [19]:
# Variables you want to make predictor insight graph tables for
variables = list(basetable0.columns)
variables.remove("target")

# Create an empty dictionary
pig_tables = {}

# Loop through the variables
for variable in variables: 

   # Create a predictor insight graph table
    pig_table = create_pig_table(basetable0, "target", variable)

    # Add the table to the dictionary
    pig_tables[variable] = pig_table

# Print the predictor insight graph table of the variable "disc_realgdp"
print(pig_tables["disc_realgdp"])
        disc_realgdp  Incidence  Size
0  (-0.0393, 0.0153]   0.050000  40.0
1    (0.0153, 0.029]   0.179487  39.0
2    (0.029, 0.0406]   0.675000  40.0
3   (0.0406, 0.0497]   0.666667  39.0
4   (0.0497, 0.0851]   0.700000  40.0
In [20]:
# The function to plot a predictor insight graph
def plot_pig(pig_table, variable):
    
    plt.figure(figsize=(10, 4))
    # Plot formatting
    plt.ylabel("Size", rotation=0, rotation_mode="anchor", ha="right")
    
    # Plot the bars with sizes 
    pig_table["Size"].plot(kind="bar", width=0.5, color="pink", edgecolor="none", alpha=0.3) 
    
    # Plot the incidence line on secondary axis
    pig_table["Incidence"].plot(secondary_y=True)
    
    # Plot formatting
    plt.xticks(np.arange(len(pig_table)), pig_table[variable])
    plt.xlim([-0.5, len(pig_table) - 0.5])
    plt.ylabel("Incidence", rotation=0, rotation_mode="anchor", ha="left")
    plt.title(f'{variable}')
    
    # Show the graph
    plt.show()
In [21]:
# Loop through the variables
for variable in variables: 
    
    # Create the predictor insight graph table
    pig_table = create_pig_table(basetable0, "target", variable)
    
    # Plot the predictor insight graph
    plot_pig(pig_table, variable)

Fit Logistic Regression

We use an intercept and convert the quarter values to dummy variables and train the logistic regression model as follows:

In [22]:
model = sm.Logit(data.target, sm.add_constant(data.drop('target', axis=1)))
result = model.fit()
result.summary()
Optimization terminated successfully.
         Current function value: 0.335442
         Iterations 9
Out[22]:
Logit Regression Results
Dep. Variable: target No. Observations: 198
Model: Logit Df Residuals: 182
Method: MLE Df Model: 15
Date: Thu, 01 Oct 2020 Pseudo R-squ.: 0.5132
Time: 01:24:57 Log-Likelihood: -66.417
converged: True LL-Null: -136.42
Covariance Type: nonrobust LLR p-value: 2.288e-22
coef std err z P>|z| [0.025 0.975]
const 1.5416 6.83e+06 2.26e-07 1.000 -1.34e+07 1.34e+07
realgdp -84.1671 44.337 -1.898 0.058 -171.066 2.732
realcons 173.7999 39.838 4.363 0.000 95.719 251.880
realinv 35.2613 7.280 4.844 0.000 20.993 49.529
realgovt -11.9359 6.624 -1.802 0.072 -24.920 1.048
realdpi -42.5489 20.847 -2.041 0.041 -83.407 -1.690
cpi -0.8881 27.725 -0.032 0.974 -55.228 53.451
m1 7.7942 5.461 1.427 0.154 -2.910 18.498
tbilrate 0.0715 0.786 0.091 0.927 -1.469 1.612
unemp 4.4700 3.205 1.395 0.163 -1.813 10.753
pop -496.9396 178.376 -2.786 0.005 -846.550 -147.329
infl -0.1254 0.249 -0.503 0.615 -0.614 0.364
realint -0.0890 0.144 -0.619 0.536 -0.371 0.193
quarter_1 0.5723 6.83e+06 8.38e-08 1.000 -1.34e+07 1.34e+07
quarter_2 0.4601 6.83e+06 6.73e-08 1.000 -1.34e+07 1.34e+07
quarter_3 0.0408 6.83e+06 5.98e-09 1.000 -1.34e+07 1.34e+07
quarter_4 0.4683 6.83e+06 6.86e-08 1.000 -1.34e+07 1.34e+07

This produces the following summary for our model with $198$ observations and $16$ variables, including intercept: The summary indicates that the model has been trained using maximum likelihood and provides the maximized value of the log-likelihood function at $-66.4$.

  • likelihood function: The likelihood function (often simply called the likelihood) measures the goodness of fit of a statistical model to a sample of data for given values of the unknown parameters. It is formed from the joint probability distribution of the sample, but viewed and used as a function of the parameters only, thus treating the random variables as fixed at the observed values.
  • The likelihood function represents the combination of model parameter values that maximize the probability of drawing the sample obtained.The procedure for obtaining these arguments of the maximum of the likelihood function is known as maximum likelihood estimation, which for computational convenience is usually done using the natural logarithm of the likelihood, known as the log-likelihood function.
  • maximum likelihood estimation: Maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate.
In [23]:
# Image of model summary
plt.rc('figure', figsize=(12, 7))
plt.text(0.01, 0.05, str(result.summary()), {'fontsize': 14}, fontproperties = 'monospace')
plt.axis('off')
plt.tight_layout()
plt.subplots_adjust(left=0.2, right=0.8, top=0.8, bottom=0.1)
plt.savefig('logistic_example.png', bbox_inches='tight', dpi=300);
In [24]:
# Store p-values in dictionary
pvals = result.pvalues[1:].to_dict()
pvals
Out[24]:
{'realgdp': 0.05765035302701731,
 'realcons': 1.2846790085441199e-05,
 'realinv': 1.2739780714173386e-06,
 'realgovt': 0.071578784150557,
 'realdpi': 0.041245807979824425,
 'cpi': 0.9744466146149555,
 'm1': 0.15354027960377567,
 'tbilrate': 0.9274729258852897,
 'unemp': 0.16316542882812857,
 'pop': 0.005337771929708511,
 'infl': 0.6151894399881822,
 'realint': 0.5358600251228749,
 'quarter_1': 0.9999999331604994,
 'quarter_2': 0.9999999462713322,
 'quarter_3': 0.9999999952311711,
 'quarter_4': 0.9999999453030424}

The LL-Null value of $-136.42$ is the result of the maximized log-likelihood function when only an intercept is included. It forms the basis for the pseudo-R2 statistic and the Log-Likelihood Ratio (LLR) test. The pseudo-R2 statistic is a substitute for the familiar R2 available under least squares. It is computed based on the ratio of the maximized log-likelihood function for the null model $m_0$ and the full model $m_1$.

The values vary from $0$ (when the model does not improve the likelihood) to 1 where the model fits perfectly and the log-likelihood is maximized at $0$. Consequently, higher values indicate a better fit.

Scikit Learn Example With Regularization

Regularization can be used to train models that generalize better on unseen data, by preventing the algorithm from overfitting the training dataset by penalizing the models coefficients.

Alt text that describes the graphic

Types of weight regularization:

  • l1 (Lasso): Sum of the weights $$ \lambda \Sigma_{i=1}^{k} |w_{i}| $$
  • l2 (Ridge): Sum of the square of weights $$ \lambda \Sigma_{i=1}^{k} w_{i}^{2} $$
  • l1 + l2 (Elastic Net): Sum of the absolute and the squared weights.
$$ \frac{\Sigma_{i=1}^{n}(y_{i} - x_{i}^{J} \hat{\beta})^{2}}{2n} + \lambda \left( \frac{1 - a}{2} \sum_{j=1}^{m} \hat{\beta_{j}^{2}} + a \sum_{j=1}^{m} |\hat{\beta_{j}}| \right) $$

Logistic regression with different types of regularization

Here, we fit a logistic regression model for the various types of regularization and without regularization. Then we evaluate with 3-fold cross validation.

In [25]:
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split, TimeSeriesSplit
from sklearn.metrics import roc_auc_score

#cv = StratifiedKFold(3)
cv = TimeSeriesSplit(n_splits=3)

X, y = data.drop('target', axis=1), data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=False)

regs = ['l1', 'l2', 'none', 'elasticnet']

for reg in regs:
    
    if reg == 'l1':
        lr = LogisticRegression(penalty=reg, solver='liblinear', random_state=1)
        print(f'{reg} cv score:', np.mean(cross_val_score(lr, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)))
        
    elif reg == 'elasticnet':
        lr = LogisticRegression(penalty=reg, solver='saga', l1_ratio=0.5, random_state=2)
        print(f'\n{reg} cv score:', np.mean(cross_val_score(lr, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)))
    
    else:
        lr = LogisticRegression(penalty=reg, solver='lbfgs', random_state=3)
        print(f'\n{reg} cv score:', np.mean(cross_val_score(lr, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)))
l1 cv score: 0.7231865477517653

l2 cv score: 0.7009866220735786

none cv score: 0.7445484949832776

elasticnet cv score: 0.7511947231512449

Logistic Regression Regularization Strength

To determine how strong we penalize our coefficients for logistic regression we adjust the $C$ parameter. The $C$ parameter is a penalty term used to disincentivise and regulate against over fitting.

Alt text that describes the graphic

Here we evaluate $20$ parameter values for $C$ using cross validation. We then use the roc_auc_score to compare the predictive accuracy across the various regularization parameters as follows:

In [26]:
a = np.linspace(0, 1, 20)
b = [0.5]
Cs = np.sort(np.concatenate((a, b)))

enet_result, enet_coeffs = pd.DataFrame(), pd.DataFrame()
for i, C in enumerate(Cs):
    print(C)
    coeffs, test_results = [], []
    lr_enet = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=C, random_state=22, n_jobs=-1)
    for train, test in cv.split(X.values, y.values):

        X_train0 = X.iloc[train]
        y_train0 = y.iloc[train]
        lr_enet.fit(X_train0, y_train0)
        coeffs.append(np.squeeze(lr_enet.coef_))

        X_test0 = X.iloc[test]
        y_test0 = y.iloc[test]
        y_prob0 = lr_enet.predict_proba(X_test0)[:,1]

        auc = roc_auc_score(y_test0, y_prob0)
        
        test_results.append([auc, C])
        
    test_results = pd.DataFrame(test_results, columns=['auc', 'C'])
    enet_result = enet_result.append(test_results)
    enet_coeffs[C] = np.mean(coeffs, axis=0)
0.0
0.05263157894736842
0.10526315789473684
0.15789473684210525
0.21052631578947367
0.2631578947368421
0.3157894736842105
0.3684210526315789
0.42105263157894735
0.47368421052631576
0.5
0.5263157894736842
0.5789473684210527
0.631578947368421
0.6842105263157894
0.7368421052631579
0.7894736842105263
0.8421052631578947
0.894736842105263
0.9473684210526315
1.0
In [27]:
enet_result.describe()
Out[27]:
auc C
count 63.000000 63.000000
mean 0.719001 0.500000
std 0.156965 0.298553
min 0.260870 0.000000
25% 0.658027 0.263158
50% 0.676667 0.500000
75% 0.876254 0.736842
max 0.939799 1.000000

We can again plot the accuracy result for the range of hyperparameter values alongside the coefficient path that shows the improvements in predictive accuracy as the coefficients are a bit shrunk at the optimal regularization value $10^4$:

In [28]:
fig, axes = plt.subplots(ncols=2, sharex=True, figsize=(12,6))

best_C = enet_result.groupby('C').auc.mean().idxmax()
best_accuracy = enet_result.groupby('C').auc.mean().max()
mean_auc = enet_result.groupby('C').auc.mean().median()

enet_result.groupby('C')['auc'].mean().plot(logx=True, title='AUC Scores', ax=axes[0],
                                                label=f'Best AUC = {round(best_accuracy,3)}')
axes[0].axhline(mean_auc, color='r', label=f'Mean AUC = {mean_auc}', ls='--')
axes[0].axvline(x=best_C, c='darkgrey', ls='--', label=f'Best C = {best_C}')
axes[0].set_xlabel('Regularization')
axes[0].set_ylabel('AUC')
axes[0].legend(loc='best')

enet_coeffs.T.plot(legend=False, logx=True, title='Elasticnet Path', ax=axes[1])
axes[1].set_xlabel('Regularization')
axes[1].set_ylabel('Coefficients')
axes[1].axvline(x=best_C, c='darkgrey', ls='--')
fig.tight_layout();

Feature Selection

The forward stepwise variable selection procedure

  • Empty set
  • Find best variable $v_1$
  • Find best variable $v_2$ in combination with $v_1$
  • Find best variable $v_3$ in combination with $v_1 , v_2$

(Until all variables are added or until predefined number of variables is added)

Implementation of the forward stepwise procedure

  • Function auc that calculates AUC given a certain set of variables
  • Function next_best that returns next best variable in combination with current variables
  • Loop until desired number of variables is reached
In [29]:
def auc(variables, target, df, multiple=True):
    
    if multiple:
        X0 = X[variables]
        y0 = y
    else:
        X0 = X[[variables]]
        y0 = y
        
    logreg = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=best_C, 
                                random_state=22, n_jobs=-1)
    
    score = np.mean(cross_val_score(logreg, X0, y0, scoring='roc_auc', cv=cv, n_jobs=-1))
    return(score)

def next_best(current_variables, candidate_variables, target, df):
    best_auc = -1
    best_variable = None
    
    # Calculate the auc score of adding v to the current variables
    for v in candidate_variables:
        auc_v = auc(current_variables + [v], target, df, multiple=True)
        
        # Update best_auc and best_variable adding v led to a better auc score
        if auc_v >= best_auc:
            best_auc = auc_v
            best_variable = v
            
    return best_variable

Finding the order of variables

The forward stepwise variable selection procedure starts with an empty set of variables, and adds predictors one by one. In each step, the predictor that has the highest AUC in combination with the current variables is selected.

In [30]:
# Find the candidate variables
candidate_variables = list(data.drop('target', axis=1).columns.values)

# Initialize the current variables
current_variables = []

# The forward stepwise variable selection procedure
number_iterations = 16
for i in range(0, number_iterations):
    next_variable = next_best(current_variables, candidate_variables, ["target"], data)
    current_variables = current_variables + [next_variable]
    candidate_variables.remove(next_variable)
    print("Variable added in step " + str(i+1)  + " is " + next_variable + ".")

print('\n', current_variables)
Variable added in step 1 is realinv.
Variable added in step 2 is quarter_3.
Variable added in step 3 is realcons.
Variable added in step 4 is pop.
Variable added in step 5 is cpi.
Variable added in step 6 is realdpi.
Variable added in step 7 is realgdp.
Variable added in step 8 is realgovt.
Variable added in step 9 is m1.
Variable added in step 10 is realint.
Variable added in step 11 is quarter_1.
Variable added in step 12 is quarter_2.
Variable added in step 13 is quarter_4.
Variable added in step 14 is tbilrate.
Variable added in step 15 is unemp.
Variable added in step 16 is infl.

 ['realinv', 'quarter_3', 'realcons', 'pop', 'cpi', 'realdpi', 'realgdp', 'realgovt', 'm1', 'realint', 'quarter_1', 'quarter_2', 'quarter_4', 'tbilrate', 'unemp', 'infl']

Evaluating a model with 3 fold-cv

In [31]:
# Keep track of AUC values
auc_values_cv = []

# Add variables one by one
variables_evaluate = []

# Iterate over the variables in current_variables the variables ordered according to the forward stepwise procedure
for v in current_variables:

    # Add the variable
    variables_evaluate.append(v)
    
    # Calculate the AUC of this set of variables
    auc_test = auc(variables_evaluate, 'target', data, multiple=True)
    
    # Append the values to the lists
    auc_values_cv.append(auc_test)

Finding the optimal set of features

The forward stepwise variable selection procedure provides an order in which variables are optimally added to the predictor set. In order to decide where to cut off the variables, we make the 3 fold-cv AUC curves. These curves plot the AUC using the first, first two, first three, … variables in the model.

In [32]:
res = pd.DataFrame(dict(variables=current_variables, auc=auc_values_cv))
x = np.array(range(0,len(auc_values_cv)))
y_cv = np.array(auc_values_cv)
plt.xticks(x, current_variables, rotation = 90)
plt.plot(x,y_cv, label='3-fold CV')
plt.axvline(res.auc.idxmax(), linestyle='dashed', color='r', lw=1.5)
plt.ylim((0.71, 0.99))
plt.title(f'Best AUC = {round(res.auc.max(),3)}')
plt.legend()
plt.show()

best_vars = list(res[:res.auc.idxmax()+1].variables.values)
print(f'Best set of variables: \n{best_vars}')
Best set of variables: 
['realinv', 'quarter_3', 'realcons']

Explore logistic regression coefficients

The coefficients can be interpreted as the change in log odds of the target variable associated with $1$ unit increase in the the input feature value.

For example:

If the input feature is realinv in quarters then, an increase in realinv by one quarter will have an effect equal to the coefficient or the log odds.

  • The exponent of the coefficients will give us the change in the actual odds $O$ associated with a $1$ unit increase in the feature value

  • Odds: The probability of an even $p$ occurring divided by the probability of the event $p$ not occurring $ \frac{p}{1-p}$

  • We can go backwards to get the probability $p$ by calculating $p = \frac{O}{1+O}$

Note that:

  • Probability ranges from $0$ to $1$
  • Odds ratio range from $0$ to $\infty$
  • Log Odds range from $-\infty$ to $\infty$

That is why the log odds are used to avoid modeling a variable with a restricted range such as probability.

We will now explore the coefficients of the logistic regression to understand what is driving our target variable to go up or down. Wee will extract the logistic regression coefficients from our fitted model, and calculate their exponent to make them more interpretable.

In [33]:
logreg = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=best_C, 
                                random_state=22, n_jobs=-1)

logreg.fit(X_train[best_vars], y_train)

# Combine feature names and coefficients into pandas DataFrame
feature_names = pd.DataFrame(X[best_vars].columns, columns=['Feature'])
log_coef = pd.DataFrame(np.transpose(logreg.coef_), columns=['Coefficient'])
coefficients = pd.concat([feature_names, log_coef], axis = 1)

coefficients['P-value'] = coefficients.Feature.map(pvals).values

# Calculate exponent of the logistic regression coefficients
coefficients['Odds_ratio'] = np.exp(coefficients['Coefficient']).round(3)

# Calculate the probability for each coefficient
coefficients['Probability'] = coefficients['Odds_ratio'] / (1 + coefficients['Odds_ratio'])

coefficients['Pct_odds_increase'] = (coefficients['Odds_ratio'] - 1) * 100

# Remove coefficients that are equal to zero
coefficients = coefficients[coefficients['Coefficient']!=0]

# Print the values sorted by the exponent coefficient
pd.set_option('display.float_format', lambda x: '%.3f' % x)
coefficients = coefficients.sort_values(by=['Odds_ratio'], ascending=False)
coefficients
Out[33]:
Feature Coefficient P-value Odds_ratio Probability Pct_odds_increase
0 realinv 3.657 0.000 38.756 0.975 3775.600
2 realcons 0.237 0.000 1.267 0.559 26.700
1 quarter_3 -0.164 1.000 0.849 0.459 -15.100
In [34]:
# Filter only statistically significant coefficients
mask = coefficients['P-value'] <= 0.05
coefficients[mask]
Out[34]:
Feature Coefficient P-value Odds_ratio Probability Pct_odds_increase
0 realinv 3.657 0.000 38.756 0.975 3775.600
2 realcons 0.237 0.000 1.267 0.559 26.700
  • Values $< 1$ decrease the odds values $> 1$ increase the odds

  • The coefficient for realinv = $3.657$ which is interpreted as the expected change in log odds for a one-quarter increase in the realinv.

  • The odds ratio can be calculated by exponentiating this value to get $38.756$ which means we expect to see about a $3,776\%$ increase in the odds of the quarterly real GDP growth exceeding its $20$-quarter rolling average, for a one-quarter increase in realinv

Plot Receiver Operating Characteristic (ROC) with cross validation

In [35]:
from sklearn.metrics import auc
from sklearn.metrics import plot_roc_curve

vars_ = list(coefficients[mask].Feature.values)
n_samples, n_features = X[vars_].shape

classifier = LogisticRegression(penalty='l2', solver='lbfgs', C=best_C, n_jobs=-1, random_state=122)

X0, y0 = X[vars_].values, y.values

tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

fig, ax = plt.subplots(figsize=(12, 7))
for i, (train, test) in enumerate(cv.split(X0, y0)):
    
    classifier.fit(X0[train], y0[train])
    
    viz = plot_roc_curve(classifier, X0[test], y0[test],
                         name='ROC fold {}'.format(i),
                         alpha=0.5, lw=1, ax=ax)
    
    interp_tpr = np.interp(mean_fpr, viz.fpr, viz.tpr)
    interp_tpr[0] = 0.0
    tprs.append(interp_tpr)
    aucs.append(viz.roc_auc)

ax.plot([0, 1], [0, 1], linestyle='--', lw=1.5, color='k',
        label='Chance', alpha=.8)

mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
ax.plot(mean_fpr, mean_tpr, color='red', label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc), lw=1.5)

std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
ax.fill_between(mean_fpr, tprs_lower, tprs_upper, color='white', label=r'$\pm$ 1 std. dev.')

ax.set(xlim=[-0.05, 1.05], ylim=[-0.05, 1.05], title="Receiver operating characteristic")
ax.legend(loc="lower right")
plt.show()
In [ ]: