Customer Segmentation in Python

Customer Segmentation is a powerful analytics technique to group customers and enable businesses to customize their product offering and marketing strategies.

For example, we can group customers by the month of their first purchase, segment by their recency, frequency and monetary values or use a clustering model like K-means to identify similar groups of customers based on their purchasing behavior.

Introduction to cohort analysis

Cohort analysis is a subset of behavioral analytics that takes the data from a given data set (e-commerce platform, web application, or online game) and rather than looking at all users as one unit, it breaks them into related groups for analysis. These related groups, or cohorts, usually share common characteristics or experiences within a defined time-span. Cohort analysis allows a company to “see patterns clearly across the life-cycle of a customer (or user), rather than slicing across all customers blindly without accounting for the natural cycle that a customer undergoes.” By seeing these patterns of time, a company can adapt and tailor its service to those specific cohorts. While cohort analysis is sometimes associated with a cohort study, they are different and should not be viewed as one and the same.

  • Mutually exclusive segments - cohort which are then measured over times
  • Compare metrics across product lifecycle
  • Compare metrics across customer lifecycle

Types of cohorts

  • Time cohorts: Customers who signed up for a product or service during a particular time frame. Analyzing these cohorts shows the customer behavior depending on the time they started using the companies products or services. Time could be monthly, quarterly or daily.
  • Behavior cohorts: Customers who purchased a products or subscribed to a service in the past. It groups customers by the type of product or service they signed up for. Customers who sign up for various levels of service may require different needs. Understanding these needs can help companies develop customized products or services for particular segments.
  • Size cohorts: Various sizes of customers who purchase a companies products or services. This can be based on the amount of spending in some time period after acquisition, or the product type that the customer spent the most money in some period of time

Elements of cohort analysis

  • Typically pivot table format
  • Assigned cohort in rows
  • Cohort Index in columns
  • Metrics in the table

Online retail data

Over $0.5$ million transactions from a UK-based online retail store. We will use a randomly sampled $20\%$ subset of this dataset.

In [1]:
import pandas as pd
import numpy as np 
import datetime as dt

online = pd.read_csv('online.csv', index_col='Unnamed: 0')
online.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 70864 entries, 416792 to 312243
Data columns (total 8 columns):
InvoiceNo      70864 non-null int64
StockCode      70864 non-null object
Description    70864 non-null object
Quantity       70864 non-null int64
InvoiceDate    70864 non-null object
UnitPrice      70864 non-null float64
CustomerID     70864 non-null int64
Country        70864 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 4.9+ MB
In [2]:
online.InvoiceDate = pd.to_datetime(online.InvoiceDate)
online.dtypes
Out[2]:
InvoiceNo               int64
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID              int64
Country                object
dtype: object

Assign daily acquisition cohort

Defining a cohort is the first step to cohort analysis. We will now create daily cohorts based on the day each customer has made their first transaction.

In [3]:
# Define a function that will parse the date
def get_day(x): 
    return dt.datetime(x.year, x.month, x.day) 
In [4]:
# Create InvoiceDay column
online['InvoiceDay'] = online['InvoiceDate'].apply(get_day) 

# Group by CustomerID and select the InvoiceDay value
grouping = online.groupby('CustomerID')['InvoiceDay'] 

# Assign a minimum InvoiceDay value to the dataset
online['CohortDay'] = grouping.transform('min')

# View the top 5 rows
online.head()
Out[4]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceDay CohortDay
416792 572558 22745 POPPY'S PLAYHOUSE BEDROOM 6 2011-10-25 08:26:00 2.10 14286 United Kingdom 2011-10-25 2011-04-11
482904 577485 23196 VINTAGE LEAF MAGNETIC NOTEPAD 1 2011-11-20 11:56:00 1.45 16360 United Kingdom 2011-11-20 2011-09-12
263743 560034 23299 FOOD COVER WITH BEADS SET 2 6 2011-07-14 13:35:00 3.75 13933 United Kingdom 2011-07-14 2011-07-14
495549 578307 72349B SET/6 PURPLE BUTTERFLY T-LIGHTS 1 2011-11-23 15:53:00 2.10 17290 United Kingdom 2011-11-23 2011-11-23
204384 554656 21756 BATH BUILDING BLOCK WORD 3 2011-05-25 13:36:00 5.95 17663 United Kingdom 2011-05-25 2011-02-25

Now each customer belongs to a daily acquisition cohort that we can use for further analysis!

Calculate time offset in days - part 1

Calculating time offset for each transaction allows us to report the metrics for each cohort in a comparable fashion.

First, we will create $6$ variables that capture the integer value of years, months and days for Invoice and Cohort Date using a get_date_int() function

In [5]:
def get_date_int(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column].dt.day
    return year, month, day
In [6]:
# Get the integers for date parts from the InvoiceDaycolumn
invoice_year, invoice_month, invoice_day = get_date_int(online, 'InvoiceDay')

# Get the integers for date parts from the CohortDay column
cohort_year, cohort_month, cohort_day = get_date_int(online, 'CohortDay')

Now we will use these integer values to calculate business metrics for our time cohorts!

Calculate time offset in days - part 2

Great work! Now, we have $6$ different data sets with year, month and day values for Invoice and Cohort dates - invoice_year, cohort_year, invoice_month, cohort_month, invoice_day, and cohort_day.

In this exercise we will calculate the difference between the Invoice and Cohort dates in years, months and days separately and then calculate the total days difference between the two. This will be our days offset which we will use in the next exercise to visualize the customer count.

In [7]:
# Calculate difference in years
years_diff = invoice_year - cohort_year

# Calculate difference in months
months_diff = invoice_month - cohort_month

# Calculate difference in days
days_diff = invoice_day - cohort_day

# Extract the difference in days from all previous values
online['CohortIndex'] = years_diff * 365 + months_diff * 30 + days_diff + 1

online.head()
Out[7]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceDay CohortDay CohortIndex
416792 572558 22745 POPPY'S PLAYHOUSE BEDROOM 6 2011-10-25 08:26:00 2.10 14286 United Kingdom 2011-10-25 2011-04-11 195
482904 577485 23196 VINTAGE LEAF MAGNETIC NOTEPAD 1 2011-11-20 11:56:00 1.45 16360 United Kingdom 2011-11-20 2011-09-12 69
263743 560034 23299 FOOD COVER WITH BEADS SET 2 6 2011-07-14 13:35:00 3.75 13933 United Kingdom 2011-07-14 2011-07-14 1
495549 578307 72349B SET/6 PURPLE BUTTERFLY T-LIGHTS 1 2011-11-23 15:53:00 2.10 17290 United Kingdom 2011-11-23 2011-11-23 1
204384 554656 21756 BATH BUILDING BLOCK WORD 3 2011-05-25 13:36:00 5.95 17663 United Kingdom 2011-05-25 2011-02-25 91

We have successfully assigned the daily time offset to each transaction and can use it for running daily cohort analysis!

Cohort metrics

  • Retention: How many customers from each of the cohort have returned in the subsequent months? gives you the percentage of active customers compared to the total number of customers.

Customer retention: cohort_counts table

  • How many customers originally in each cohort?
  • How many of them were active in following months?

Calculate retention rate from scratch

We have seen how to create retention and average quantity metrics table for the monthly acquisition cohorts. Now it's time to build the retention metrics.

In [8]:
# Define a function that will parse the month from the date 
def get_month(x): 
    return dt.datetime(x.year, x.month, 1)
In [9]:
# Extract the difference in months from all previous values
online['CohortIndex_M'] = years_diff * 12 + months_diff + 1

# Create InvoiceMonth column
online['InvoiceMonth'] = online['InvoiceDate'].apply(get_month)

# Group by CustomerID and select the InvoiceMonth value
grouping = online.groupby('CustomerID')['InvoiceMonth']

# Assign a minimum InvoiceMonth value to the dataset
online['CohortMonth'] = grouping.transform('min')

# View the top 5 rows
online.head()
Out[9]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceDay CohortDay CohortIndex CohortIndex_M InvoiceMonth CohortMonth
416792 572558 22745 POPPY'S PLAYHOUSE BEDROOM 6 2011-10-25 08:26:00 2.10 14286 United Kingdom 2011-10-25 2011-04-11 195 7 2011-10-01 2011-04-01
482904 577485 23196 VINTAGE LEAF MAGNETIC NOTEPAD 1 2011-11-20 11:56:00 1.45 16360 United Kingdom 2011-11-20 2011-09-12 69 3 2011-11-01 2011-09-01
263743 560034 23299 FOOD COVER WITH BEADS SET 2 6 2011-07-14 13:35:00 3.75 13933 United Kingdom 2011-07-14 2011-07-14 1 1 2011-07-01 2011-07-01
495549 578307 72349B SET/6 PURPLE BUTTERFLY T-LIGHTS 1 2011-11-23 15:53:00 2.10 17290 United Kingdom 2011-11-23 2011-11-23 1 1 2011-11-01 2011-11-01
204384 554656 21756 BATH BUILDING BLOCK WORD 3 2011-05-25 13:36:00 5.95 17663 United Kingdom 2011-05-25 2011-02-25 91 4 2011-05-01 2011-02-01
In [10]:
# Group by CohortMonth and CohortIndex
grouping = online.groupby(['CohortMonth', 'CohortIndex_M'])

# Count the number of customers in each group
cohort_data = grouping['CustomerID'].apply(pd.Series.nunique)

# Reset the index
cohort_data = cohort_data.reset_index()

# Create a pivot table with CohortMonth as rows, CohortIndex as the columns, and CustomerID counts as the values
cohort_counts = cohort_data.pivot(index='CohortMonth', columns='CohortIndex_M', values='CustomerID')

cohort_counts
Out[10]:
CohortIndex_M 1 2 3 4 5 6 7 8 9 10 11 12 13
CohortMonth
2010-12-01 716.0 246.0 221.0 251.0 245.0 285.0 249.0 236.0 240.0 265.0 254.0 348.0 172.0
2011-01-01 332.0 69.0 82.0 81.0 110.0 90.0 82.0 86.0 104.0 102.0 124.0 45.0 NaN
2011-02-01 316.0 58.0 57.0 83.0 85.0 74.0 80.0 83.0 86.0 95.0 28.0 NaN NaN
2011-03-01 388.0 63.0 100.0 76.0 83.0 67.0 98.0 85.0 107.0 38.0 NaN NaN NaN
2011-04-01 255.0 49.0 52.0 49.0 47.0 52.0 56.0 59.0 17.0 NaN NaN NaN NaN
2011-05-01 249.0 40.0 43.0 36.0 52.0 58.0 61.0 22.0 NaN NaN NaN NaN NaN
2011-06-01 207.0 33.0 26.0 41.0 49.0 62.0 19.0 NaN NaN NaN NaN NaN NaN
2011-07-01 173.0 28.0 31.0 38.0 44.0 17.0 NaN NaN NaN NaN NaN NaN NaN
2011-08-01 139.0 30.0 28.0 35.0 14.0 NaN NaN NaN NaN NaN NaN NaN NaN
2011-09-01 279.0 56.0 78.0 34.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-10-01 318.0 67.0 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-11-01 291.0 32.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-12-01 38.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [11]:
# Select the first column and store it to cohort_sizes
cohort_sizes = cohort_counts.iloc[:,0]

# Divide the cohort count by cohort sizes along the rows
retention = cohort_counts.divide(cohort_sizes, axis=0)

Calculate average price

We will now calculate the average price metric and analyze if there are any differences in shopping patterns across time and across cohorts.

In [12]:
# Create a groupby object and pass the monthly cohort and cohort index as a list
grouping = online.groupby(['CohortMonth', 'CohortIndex_M']) 

# Calculate the average of the unit price 
cohort_data = grouping['UnitPrice'].mean()

# Reset the index of cohort_data
cohort_data = cohort_data.reset_index()

# Create a pivot 
average_price = cohort_data.pivot(index='CohortMonth', columns='CohortIndex_M', values='UnitPrice')
print(average_price.round(1))
CohortIndex_M   1    2    3    4    5    6    7    8    9    10   11   12   13
CohortMonth                                                                   
2010-12-01     3.0  3.0  3.0  2.8  2.7  6.9  2.8  3.0  2.7  2.7  3.0  2.8  2.6
2011-01-01     3.2  3.1  3.0  3.0  3.1  3.0  3.0  2.5  2.7  2.9  2.6  2.0  NaN
2011-02-01     3.1  4.0  3.3  2.9  3.3  2.9  2.8  2.7  2.9  2.7  3.1  NaN  NaN
2011-03-01     3.5  3.6  3.5  2.8  2.7  2.5  2.7  2.9  2.5  2.4  NaN  NaN  NaN
2011-04-01     3.3  4.4  3.4  2.6  2.8  2.8  2.8  2.6  2.6  NaN  NaN  NaN  NaN
2011-05-01     3.1  2.8  2.5  2.7  2.5  2.3  2.7  2.3  NaN  NaN  NaN  NaN  NaN
2011-06-01     2.8  2.4  2.7  3.1  2.5  2.4  2.5  NaN  NaN  NaN  NaN  NaN  NaN
2011-07-01     3.2  3.1  3.4  2.5  2.4  2.3  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2011-08-01     2.9  3.7  5.4  6.9  4.2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2011-09-01     2.9  3.1  3.0  2.6  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2011-10-01     2.9  2.7  2.5  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2011-11-01     2.5  2.1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2011-12-01     1.9  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN

Visualize average quantity metric

We are now going to visualize average quantity values in a heatmap.

In [13]:
# Create a groupby object and pass the monthly cohort and cohort index as a list
grouping = online.groupby(['CohortMonth', 'CohortIndex_M']) 

# Calculate the average of the quantity  
cohort_data = grouping['Quantity'].mean()

# Reset the index of cohort_data
cohort_data = cohort_data.reset_index()

# Create a pivot 
average_quantity = cohort_data.pivot(index='CohortMonth', columns='CohortIndex_M', values='Quantity')
In [14]:
import matplotlib.pyplot as plt
import seaborn as sns

# Initialize an 8 by 6 inches plot figure
fig, ax = plt.subplots(figsize=(12, 6))

# Add a title
ax.set_title('Average Spend by Monthly Cohorts')

# Create the heatmap
sns.heatmap(average_quantity, annot=True, fmt=".1f", annot_kws={"size": 12}, ax=ax)
b, t = ax.set_ylim() 
b += 0.5 
t -= 0.5 
ax.set_ylim(b, t) 
plt.show()

Recency, frequency, monetary (RFM) segmentation

What is RFM segmentation?

Behavioral customer segmentation based on three metrics:

  • Recency (R): time since last customer transaction
  • Frequency (F): number of purchases in the observed period
  • Monetary Value (M): total amount spent in the observed period

Grouping RFM values

The RFM values can be grouped in several ways:

  • Percentiles e.g. quantiles
  • Pareto $80/20$ cut
  • Custom - based on business knowledge

We are going to implement percentile-based grouping.

Short review of percentiles

Process of calculating percentiles:

  1. Sort customers based on that metric
  2. Break customers into a pre-defined number of groups of equal size
  3. Assign a label to each group

Calculate RFM values

Calculate Recency, Frequency and Monetary values for the online dataset we have used before.

In [15]:
# Calculate the total sum 
online['TotalSum'] = online['Quantity'] * online['UnitPrice']

# Print min and max dates 
print('Min:{}; Max:{}'.format(min(online.InvoiceDate), max(online.InvoiceDate)))
Min:2010-12-01 08:26:00; Max:2011-12-09 12:49:00
In [16]:
# Create a hypothetical snapshot_day data as if we're doing analysis recently
snapshot_date = max(online.InvoiceDate) + dt.timedelta(days=1)
print(snapshot_date)
2011-12-10 12:49:00
In [17]:
# Calculate Recency, Frequency and Monetary value for each customer 
datamart = online.groupby(['CustomerID']).agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'count',
    'TotalSum': 'sum'})

# Rename the columns 
datamart.rename(columns={'InvoiceDate': 'Recency',
                         'InvoiceNo': 'Frequency',
                         'TotalSum': 'MonetaryValue'}, inplace=True)
# Print top 5 rows
datamart.head()
Out[17]:
Recency Frequency MonetaryValue
CustomerID
12747 2 27 992.82
12748 1 967 7522.06
12749 4 37 813.45
12820 3 17 268.02
12822 71 9 146.15

Calculate 3 groups for recency and frequency

We will now group the customers into three separate groups based on Recency, and Frequency.

We will use the result from the exercise in the next one, where we will group customers based on the MonetaryValue and finally calculate and RFM_Score.

In [18]:
# Create labels for Recency and Frequency
r_labels = range(3, 0, -1); f_labels = range(1, 4)

# Assign these labels to three equal percentile groups 
r_groups = pd.qcut(datamart['Recency'], q=3, labels=r_labels)

# Assign these labels to three equal percentile groups 
f_groups = pd.qcut(datamart['Frequency'], q=3, labels=f_labels)

# Create new columns R and F
datamart = datamart.assign(R=r_groups.values, F=f_groups.values)

Calculate RFM Score

We will now finish the job by assigning customers to three groups based on the MonetaryValue percentiles and then calculate an RFM_Score which is a sum of the $R$, $F$, and $M$ values.

In [19]:
# Create labels for MonetaryValue 
m_labels = range(1, 4)

# Assign these labels to three equal percentile groups
m_groups = pd.qcut(datamart['MonetaryValue'], q=3, labels=m_labels)

# Create new column M
datamart = datamart.assign(M=m_groups.values)

# Calculate RFM_Score
datamart['RFM_Score'] = datamart[['R','F','M']].sum(axis=1)

datamart.head()
Out[19]:
Recency Frequency MonetaryValue R F M RFM_Score
CustomerID
12747 2 27 992.82 3 3 3 9.0
12748 1 967 7522.06 3 3 3 9.0
12749 4 37 813.45 3 3 3 9.0
12820 3 17 268.02 3 3 3 9.0
12822 71 9 146.15 2 2 2 6.0

Creating custom segments

Here we will create a custom segmentation based on RFM_Score values. We will create a function to build segmentation and then assign it to each customer.

In [20]:
# Define rfm_level function
def rfm_level(df):
    if df['RFM_Score'] >= 9:
        return 'Top'
    elif ((df['RFM_Score'] >= 5) and (df['RFM_Score'] < 9)):
        return 'Middle'
    else:
        return 'Low'

# Create a new variable RFM_Level
datamart['RFM_Level'] = datamart.apply(rfm_level, axis=1)

# Print the header with the top 5 rows to the console.
datamart.head()
Out[20]:
Recency Frequency MonetaryValue R F M RFM_Score RFM_Level
CustomerID
12747 2 27 992.82 3 3 3 9.0 Top
12748 1 967 7522.06 3 3 3 9.0 Top
12749 4 37 813.45 3 3 3 9.0 Top
12820 3 17 268.02 3 3 3 9.0 Top
12822 71 9 146.15 2 2 2 6.0 Middle

Analyzing custom segments

As a final step, we will analyze average values of Recency, Frequency and MonetaryValue for the custom segments we've created.

In [21]:
# Calculate average values for each RFM_Level, and return a size of each segment 
rfm_level_agg = datamart.groupby('RFM_Level').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
  
    # Return the size of each segment
    'MonetaryValue': ['mean', 'count']}).round(1)

# Print the aggregated dataset
rfm_level_agg
Out[21]:
Recency Frequency MonetaryValue
mean mean mean count
RFM_Level
Low 186.3 3.3 50.8 1054
Middle 70.5 15.1 286.0 2081
Top 10.3 63.7 1342.1 566

Data pre-processing for k-means clustering

Advantages of k-means clustering

  • One of the most popular unsupervised learning method
  • Simple and fast
  • Works well

with certain assumptions about the data

Key k-means assumptions

  • Symmetric distribution of variables (not skewed)
  • Variables with same average values
  • Variables with same variance

Variables on the same scale

  • K-means assumes equal mean
  • And equal variance
  • It's not the case with RFM data

Visualize RFM distributions

We will now explore the RFM distributions.

In [22]:
datamart_rfm = datamart[['Recency', 'Frequency', 'MonetaryValue']]

# Plot recency distribution
plt.subplot(3, 1, 1); sns.distplot(datamart_rfm['Recency'])

# Plot frequency distribution
plt.subplot(3, 1, 2); sns.distplot(datamart_rfm['Frequency'])

# Plot monetary value distribution
plt.subplot(3, 1, 3); sns.distplot(datamart_rfm['MonetaryValue'])

# Show the plot
plt.tight_layout()
plt.show()

Pre-process RFM data

Since the variables are skewed and are on different scales, we will now un-skew and scale them.

In [23]:
from sklearn.preprocessing import StandardScaler

# Unskew the data
datamart_log = np.log(datamart_rfm)

# Initialize a standard scaler and fit it
scaler = StandardScaler()
scaler.fit(datamart_log)

# Scale and center the data
datamart_normalized = scaler.transform(datamart_log)

# Create a pandas DataFrame
datamart_normalized = pd.DataFrame(data=datamart_normalized, index=datamart_rfm.index, 
                                   columns=datamart_rfm.columns)

Visualize the normalized variables

Now we will plot the normalized and unskewed variables to see the difference in the distribution as well as the range of the values.

In [24]:
# Plot recency distribution
plt.subplot(3, 1, 1); sns.distplot(datamart_normalized['Recency'])

# Plot frequency distribution
plt.subplot(3, 1, 2); sns.distplot(datamart_normalized['Frequency'])

# Plot monetary value distribution
plt.subplot(3, 1, 3); sns.distplot(datamart_normalized['MonetaryValue'])

# Show the plot
plt.tight_layout()
plt.show()

Run k-means

We will now build a $3$ cluster k-means model.

In [25]:
# Import KMeans 
from sklearn.cluster import KMeans

# Initialize KMeans
kmeans = KMeans(n_clusters=3, random_state=1) 

# Fit k-means clustering on the normalized data set
kmeans.fit(datamart_normalized)

# Extract cluster labels
cluster_labels = kmeans.labels_

Assign labels to raw data

We will now analyze the average RFM values of the three clusters we've created in the previous exercise.

In [26]:
# Create a DataFrame by adding a new cluster label column
datamart_rfm_k3 = datamart_rfm.assign(Cluster=cluster_labels)

# Group the data by cluster
grouped = datamart_rfm_k3.groupby(['Cluster'])

# Calculate average RFM values and segment sizes per cluster value
grouped.agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': ['mean', 'count']
  }).round(1)
Out[26]:
Recency Frequency MonetaryValue
mean mean mean count
Cluster
0 13.8 52.7 1128.1 855
1 171.7 3.1 55.3 1224
2 78.3 13.6 231.9 1622

You can immediately see the differences in RFM values of these segments!

Choosing number of clusters

Methods

  • Visual methods - elbow criterion
  • Mathematical methods - silhouette coefficient
  • Experimentation and interpretation

Elbow criterion method

  • Plot the number of clusters against within-cluster sum-of-squared-errors (SSE) - sum of squared distances from every data point to their cluster center
  • Identify an "elbow" in the plot
  • Elbow - a point representing an "optimal" number of clusters

Experimental approach - analyze segments

  • Build clustering at and around elbow solution
  • Analyze their properties - average RFM values
  • Compare against each other and choose one which makes most business sense

Calculate sum of squared errors

In this exercise, we will calculate the sum of squared errors for different number of clusters ranging from $1$ to $20$.

In [27]:
sse = dict()

# Fit KMeans and calculate SSE for each k
for k in range(1, 21):
  
    # Initialize KMeans with k clusters
    kmeans = KMeans(n_clusters=k, random_state=1)
    
    # Fit KMeans on the normalized dataset
    kmeans.fit(datamart_normalized)
    
    # Assign sum of squared distances to k element of dictionary
    sse[k] = kmeans.inertia_ 

Plot sum of squared errors

Now we will plot the sum of squared errors for each value of $k$ and identify if there is an elbow. This will guide us towards the recommended number of clusters to use.

In [28]:
# Add the plot title "The Elbow Method"
plt.title('The Elbow Method')

# Add X-axis label "k"
plt.xlabel('k')

# Add Y-axis label "SSE"
plt.ylabel('SSE')

# Plot SSE values for each key in the dictionary
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()

You can see the elbow is clearly around $2-3$ clusters!

Compute the mean Silhouette Coefficient of all samples.

The Silhouette Coefficient is calculated using the mean intra-cluster distance $(a)$ and the mean nearest-cluster distance $(b)$ for each sample. The Silhouette Coefficient for a sample is

$$s = \frac{b - a}{max(a, b)}$$

To clarify, $b$ is the distance between a sample and the nearest cluster that the sample is not a part of. $a$ is the mean distance between a sample and all other points in the same class. Note that Silhouette Coefficient is only defined if number of labels is $2$ <= n_labels <= n_samples $- 1$.

The best value is $1$ and the worst value is $-1$. Values near $0$ indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

Now we will plot the silhouette coefficients for each value of $k$ and identify if the highest value. This will guide us towards the recommended number of clusters to use.

In [29]:
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm

X = datamart_normalized.values

range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:
    
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')
    
    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o', c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()
For n_clusters = 2 The average silhouette_score is : 0.3898282199798433
For n_clusters = 3 The average silhouette_score is : 0.3067028493682284
For n_clusters = 4 The average silhouette_score is : 0.2957223549730578
For n_clusters = 5 The average silhouette_score is : 0.2811533972803104
For n_clusters = 6 The average silhouette_score is : 0.2783496283019491

Profile and interpret segments

Approaches to build customer personas

  • Summary statistics for each cluster e.g. average RFM values
  • Snake plots (from market research)
  • Relative importance of cluster attributes compared to population

Snake plots to understand and compare segments

  • Market research technique to compare different segments
  • Visual representation of each segment's attributes
  • Need to first normalize data (center & scale)
  • Plot each cluster's average normalized values of each attribute

Prepare data for the snake plot

Now we will prepare data for the snake plot. We will use the 3-cluster RFM segmentation solution we have built previously. We will transform the normalized RFM data into a long format by "melting" the metric columns into two columns - one for the name of the metric, and another for the actual numeric value.

In [30]:
datamart_normalized['Cluster'] = datamart_rfm_k3['Cluster']

# Melt the normalized dataset and reset the index
datamart_melt = pd.melt(datamart_normalized.reset_index(), 
                        
                    # Assign CustomerID and Cluster as ID variables                  
                    id_vars=['CustomerID', 'Cluster'],

                    # Assign RFM values as value variables
                    value_vars=['Recency', 'Frequency', 'MonetaryValue'], 
                        
                    # Name the variable and value
                    var_name='Metric', value_name='Value')

datamart_melt.head()
Out[30]:
CustomerID Cluster Metric Value
0 12747 0 Recency -2.195642
1 12748 0 Recency -2.684553
2 12749 0 Recency -1.706731
3 12820 0 Recency -1.909647
4 12822 2 Recency 0.322128

Visualize snake plot

Now we will now use the melted dataset to build the snake plot.

In [31]:
# Add the plot title
plt.title('Snake plot of normalized variables')

# Add the x axis label
plt.xlabel('Metric')

# Add the y axis label
plt.ylabel('Value')

# Plot a line for each value of the cluster variable
sns.lineplot(data=datamart_melt, x='Metric', y='Value', hue='Cluster')
plt.show()

Calculate relative importance of each attribute

Relative importance of segment attributes

  • Useful technique to identify relative importance of each segment's attribute
  • Calculate average values of each cluster
  • Calculate average values of population
  • Calculate importance score by dividing them and subtracting $1$ (ensures $0$ is returned when cluster average equals population average)

Now we will calculate the relative importance of the RFM values within each cluster.

In [32]:
# Calculate average RFM values for each cluster
cluster_avg = datamart_rfm_k3.groupby(['Cluster']).mean() 

# Calculate average RFM values for the total customer population
population_avg = datamart_rfm.mean()

# Calculate relative importance of cluster's attribute value compared to population
relative_imp = cluster_avg / population_avg - 1

# Print relative importance score rounded to 2 decimals
relative_imp.round(2)
Out[32]:
Recency Frequency MonetaryValue
Cluster
0 -0.85 1.75 1.96
1 0.82 -0.84 -0.85
2 -0.17 -0.29 -0.39

As the ratio moves away from $0$, attribute importance for a segment (relative to total pop.) increases.

Plot relative importance heatmap

Now we will build a heatmap visualizing the relative scores for each cluster.

In [33]:
# Initialize a plot with a figure size of 8 by 2 inches 
plt.figure(figsize=(8, 2))

# Add the plot title
plt.title('Relative importance of attributes')

# Plot the heatmap
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn', annot_kws={"size": 16})
b, t = plt.ylim() 
b += 0.5 
t -= 0.5 
plt.ylim(b, t) 
plt.show()
In [ ]: